Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
PPBD webscraping
#1
Good morning to all,

I have been having serious problems for the last 2 months because I can't manage to scrape the different tables of the following page in a correct way:
page
Website
I tried in no-code (webscraper, octoparse and others...) or with Python (pandas, Beautifulsoup...) but it doesn't give me anything usable in csv. Does anyone have a solution to help me :crossed_fingers:

Thanks in advance and have a good weekend ! Smile

import pandas as pd
#https://python.doctor/page-beautifulsoup-html-parser-python-library-xml
#https://developer.mozilla.org/fr/docs/Web/API/Document/querySelectorAll
from bs4 import BeautifulSoup
import urllib
link = "https://sitem.herts.ac.uk/aeru/ppdb/en/atoz.htm"
f = urllib.request.urlopen(link)
html_doc= f.read()

soup = BeautifulSoup(html_doc)
#print(soup)
pages = soup.find_all("a")
filtred_pages = []
for p in pages :
  if(p.has_attr('href') and p['href'].startswith("Report")):
    filtred_pages.append(p)

#pages = list(filter(lambda el: el['href'].startswith("Report"), pages))
print(filtred_pages)
 
#importer la page
result = []
for page in filtred_pages[:100]:
  f = urllib.request.urlopen('http://sitem.herts.ac.uk/aeru/ppdb/en/'+page['href'])
  html_doc= f.read()
  soup = BeautifulSoup(html_doc)
  #print(soup)
  titreNode = soup.find_all("td", attrs={"class" : "title"})[0].text
  trs = soup.select('table.report_data tr')
  rowDict = {}
  for tr in trs:
    if len(tr.select('td.row_header')) == 0:
      continue
    if len(tr.select('td.data1')) == 0:
      continue
    tdTitre = tr.select('td.row_header')[0].text
    tdValue = tr.select('td.data1')[0].text.replace('&nbsp','').rstrip('\n').strip()
    rowDict[tdTitre] = tdValue
  result.append(rowDict)

#print(result)
df = pd.DataFrame(result)
df.to_csv('file.csv')
Reply
#2
Can use Pandas read_html() to get all the tables.
Example.
import pandas as pd
pd.set_option('expand_frame_repr', False)

df = pd.read_html('https://sitem.herts.ac.uk/aeru/ppdb/en/Reports/7.htm')
# There are 35 tables on site 
>>> len(df)
35

# Look at a couple 
>>> df[7]
                          0                                                  1
0               Description  A plant growth regulator used fruit setting an...
1  Example pests controlled        Growth - in terms of thinging and fruit set
2      Example applications                                   Tomatoes; Grapes
3       Efficacy & activity                                                  -
4       Availability status                                            Current
5  Introduction & key dates                                            Current

>>> df[10]
  ATAustria  BEBelgium     BGBulgaria   CYCyprus CZCzech Republic  DEGermany DKDenmark    EEEstonia      ELGreece
0     &nbsp      &nbsp          &nbsp      &nbsp            &nbsp      &nbsp     &nbsp        &nbsp         &nbsp
1   ESSpain  FIFinland       FRFrance  HRCroatia        HUHungary  IEIreland   ITItaly  LTLithuania  LULuxembourg
2     &nbsp      &nbsp          &nbsp      &nbsp            &nbsp      &nbsp     &nbsp        &nbsp         &nbsp
3  LVLatvia    MTMalta  NLNetherlands   PLPoland       PTPortugal  RORomania  SESweden   SISlovenia    SKSlovakia
4     &nbsp      &nbsp          &nbsp      &nbsp            &nbsp      &nbsp     &nbsp        &nbsp         &nbsp
Gribouillis likes this post
Reply
#3
(Aug-20-2022, 09:35 AM)snippsat Wrote: Can use Pandas read_html() to get all the tables.

Hi @snippsat,

Thank you so much for your help! It seems so easy when you well know it!
And to reproduce the tables of each page, how would you do it? thank you in advance :)
Reply
#4
(Aug-20-2022, 01:16 PM)Mariegain Wrote: And to reproduce the tables of each page, how would you do it? thank you in advance :)
You have to look into this yourself i don't know what tables you need,there are many.
If i test with a other url eg 6.htm then get same tables back,this can mean that structure of tables is the same on all pages.

Also you should look if there is an API,can make it easier.
Quick search Pesticide Product Information Database (PPID) API Guide
In Python can eg use it like this.
import requests

id_number = '2022-4050'
url = 'https://pest-control.canada.ca/pesticide-registry-api/api/extract/application/'
response = requests.get(f'{url}{id_number}')
>>> response
<Response [200]>

>>> print(response.text)
Application number,Date received,Outcome,Active ingredients - English,Active ingredients - French,Purpose,Product name - English,Product name - French,Registration number,Name of registrant,Registration status,Category,Product type,Marketing type,Current / Historical

2022-4050,2022-08-12,PENDING,POTASSIUM SALTS OF FATTY ACIDS,SELS DE POTASSIUM D'ACIDES GRAS,Renewal,Confidential,Confidentiel,,Confidential,Full Registration,D,HERBICIDE,COMMERCIAL,Current
Reply
#5
Good evening, thank you very much for your help and I will try this now.
Good evening to you! :)
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Getting 'NoneType' object has no attribute 'find' error when WebScraping with BS Franky77 2 5,248 Aug-17-2021, 05:24 PM
Last Post: Franky77

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020