PPBD webscraping

Mariegain · Aug-20-2022, 09:14 AM

Good morning to all,

I have been having serious problems for the last 2 months because I can't manage to scrape the different tables of the following page in a correct way:
page
Website
I tried in no-code (webscraper, octoparse and others...) or with Python (pandas, Beautifulsoup...) but it doesn't give me anything usable in csv. Does anyone have a solution to help me :crossed_fingers:

Thanks in advance and have a good weekend ! Smile

import pandas as pd
#https://python.doctor/page-beautifulsoup-html-parser-python-library-xml
#https://developer.mozilla.org/fr/docs/Web/API/Document/querySelectorAll
from bs4 import BeautifulSoup
import urllib

link = "https://sitem.herts.ac.uk/aeru/ppdb/en/atoz.htm"
f = urllib.request.urlopen(link)
html_doc= f.read()

soup = BeautifulSoup(html_doc)
#print(soup)
pages = soup.find_all("a")
filtred_pages = []
for p in pages :
  if(p.has_attr('href') and p['href'].startswith("Report")):
    filtred_pages.append(p)

#pages = list(filter(lambda el: el['href'].startswith("Report"), pages))
print(filtred_pages)

#importer la page
result = []
for page in filtred_pages[:100]:
  f = urllib.request.urlopen('http://sitem.herts.ac.uk/aeru/ppdb/en/'+page['href'])
  html_doc= f.read()
  soup = BeautifulSoup(html_doc)
  #print(soup)
  titreNode = soup.find_all("td", attrs={"class" : "title"})[0].text
  trs = soup.select('table.report_data tr')
  rowDict = {}
  for tr in trs:
    if len(tr.select('td.row_header')) == 0:
      continue
    if len(tr.select('td.data1')) == 0:
      continue
    tdTitre = tr.select('td.row_header')[0].text
    tdValue = tr.select('td.data1')[0].text.replace('&nbsp','').rstrip('\n').strip()
    rowDict[tdTitre] = tdValue
  result.append(rowDict)

#print(result)
df = pd.DataFrame(result)
df.to_csv('file.csv')

***snippsat*** · Aug-20-2022, 09:35 AM

Can use Pandas read_html() to get all the tables.
Example.

import pandas as pd
pd.set_option('expand_frame_repr', False)

df = pd.read_html('https://sitem.herts.ac.uk/aeru/ppdb/en/Reports/7.htm')

# There are 35 tables on site 
>>> len(df)
35

# Look at a couple 
>>> df[7]
                          0                                                  1
0               Description  A plant growth regulator used fruit setting an...
1  Example pests controlled        Growth - in terms of thinging and fruit set
2      Example applications                                   Tomatoes; Grapes
3       Efficacy & activity                                                  -
4       Availability status                                            Current
5  Introduction & key dates                                            Current

>>> df[10]
  ATAustria  BEBelgium     BGBulgaria   CYCyprus CZCzech Republic  DEGermany DKDenmark    EEEstonia      ELGreece
0     &nbsp      &nbsp          &nbsp      &nbsp            &nbsp      &nbsp     &nbsp        &nbsp         &nbsp
1   ESSpain  FIFinland       FRFrance  HRCroatia        HUHungary  IEIreland   ITItaly  LTLithuania  LULuxembourg
2     &nbsp      &nbsp          &nbsp      &nbsp            &nbsp      &nbsp     &nbsp        &nbsp         &nbsp
3  LVLatvia    MTMalta  NLNetherlands   PLPoland       PTPortugal  RORomania  SESweden   SISlovenia    SKSlovakia
4     &nbsp      &nbsp          &nbsp      &nbsp            &nbsp      &nbsp     &nbsp        &nbsp         &nbsp

Mariegain

(Aug-20-2022, 09:35 AM)snippsat Wrote: Can use Pandas read_html() to get all the tables.

Hi @snippsat,

Thank you so much for your help! It seems so easy when you well know it!
And to reproduce the tables of each page, how would you do it? thank you in advance :)

***snippsat*** · (This post was last modified: Aug-20-2022, 02:26 PM by snippsat.)

(Aug-20-2022, 01:16 PM)Mariegain Wrote: And to reproduce the tables of each page, how would you do it? thank you in advance :)

You have to look into this yourself i don't know what tables you need,there are many.
If i test with a other url eg 6.htm then get same tables back,this can mean that structure of tables is the same on all pages.

Also you should look if there is an API,can make it easier.
Quick search Pesticide Product Information Database (PPID) API Guide
In Python can eg use it like this.

import requests

id_number = '2022-4050'
url = 'https://pest-control.canada.ca/pesticide-registry-api/api/extract/application/'
response = requests.get(f'{url}{id_number}')

>>> response
<Response [200]>

>>> print(response.text)
Application number,Date received,Outcome,Active ingredients - English,Active ingredients - French,Purpose,Product name - English,Product name - French,Registration number,Name of registrant,Registration status,Category,Product type,Marketing type,Current / Historical

2022-4050,2022-08-12,PENDING,POTASSIUM SALTS OF FATTY ACIDS,SELS DE POTASSIUM D'ACIDES GRAS,Renewal,Confidential,Confidentiel,,Confidential,Full Registration,D,HERBICIDE,COMMERCIAL,Current

Mariegain

Good evening, thank you very much for your help and I will try this now.
Good evening to you! :)

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Getting 'NoneType' object has no attribute 'find' error when WebScraping with BS	Franky77	2	5,248	Aug-17-2021, 05:24 PM Last Post: Franky77

PPBD webscraping

User Panel Messages

Announcements