Sep-14-2022, 05:39 PM
I am trying to web scrape the following webpage: https://www.assemblee-nationale.fr/12/cr...#TopOfPage .
I would like to obtain the following dataset in the end:
I would like to obtain the following dataset in the end:
name speech M. le président La séance est... M. le président Conformément... M. le président Mes chers... M. Maxime Gremetz Nous aussi!Example of the html:
<html> <div align="center"> <table width="90%"> <tbody> <tr> <p align="JUSTIFY"> <strong></strong> <strong> M. le président</strong> Conformément... <br> Mes chers... <br> <strong> M. Maxime Gremetz </strong> Nous aussi! <br> <strong></strong> </p> </tr> </tbody> </table> </div> <html>The code that I have so far:
import requests from bs4 import BeautifulSoup import pandas as pd url="https://www.assemblee-nationale.fr/12/cri/2003-2004/20040001.asp#TopOfPage" r=requests.get(url) soup_data=BeautifulSoup(r.content, 'html.parser') list_soup_div=soup_data.find_all('div', {'align':'center'}) for item_soup_div in list_soup_div: for item_soup_p in item_soup_div.find_all('p', {'align':'JUSTIFY'}): text_speech=item_soup_p.get_text() for item_n in item_soup_p.find_all('strong'): text_name=item_n.get_text() debate.append({'name': text_name, 'speech': text_speech}) df=pd.DataFrame(debate)What I obtain is a "name" column almost empty and not all <br> as rows. Could anyone help me improve my code?