Aug-20-2022, 09:14 AM
Good morning to all,
I have been having serious problems for the last 2 months because I can't manage to scrape the different tables of the following page in a correct way:
page
Website
I tried in no-code (webscraper, octoparse and others...) or with Python (pandas, Beautifulsoup...) but it doesn't give me anything usable in csv. Does anyone have a solution to help me :crossed_fingers:
Thanks in advance and have a good weekend !
I have been having serious problems for the last 2 months because I can't manage to scrape the different tables of the following page in a correct way:
page
Website
I tried in no-code (webscraper, octoparse and others...) or with Python (pandas, Beautifulsoup...) but it doesn't give me anything usable in csv. Does anyone have a solution to help me :crossed_fingers:
Thanks in advance and have a good weekend !

import pandas as pd #https://python.doctor/page-beautifulsoup-html-parser-python-library-xml #https://developer.mozilla.org/fr/docs/Web/API/Document/querySelectorAll from bs4 import BeautifulSoup import urllib
link = "https://sitem.herts.ac.uk/aeru/ppdb/en/atoz.htm" f = urllib.request.urlopen(link) html_doc= f.read() soup = BeautifulSoup(html_doc) #print(soup) pages = soup.find_all("a") filtred_pages = [] for p in pages : if(p.has_attr('href') and p['href'].startswith("Report")): filtred_pages.append(p) #pages = list(filter(lambda el: el['href'].startswith("Report"), pages)) print(filtred_pages)
#importer la page result = [] for page in filtred_pages[:100]: f = urllib.request.urlopen('http://sitem.herts.ac.uk/aeru/ppdb/en/'+page['href']) html_doc= f.read() soup = BeautifulSoup(html_doc) #print(soup) titreNode = soup.find_all("td", attrs={"class" : "title"})[0].text trs = soup.select('table.report_data tr') rowDict = {} for tr in trs: if len(tr.select('td.row_header')) == 0: continue if len(tr.select('td.data1')) == 0: continue tdTitre = tr.select('td.row_header')[0].text tdValue = tr.select('td.data1')[0].text.replace(' ','').rstrip('\n').strip() rowDict[tdTitre] = tdValue result.append(rowDict) #print(result) df = pd.DataFrame(result) df.to_csv('file.csv')