Python Forum

I am trying to web scrape the following webpage: https://www.assemblee-nationale.fr/12/cr...#TopOfPage .
I would like to obtain the following dataset in the end:

name                         speech
M. le président            La séance est...
M. le président            Conformément...
M. le président            Mes chers...
M. Maxime Gremetz    Nous aussi!

Example of the html:

<html>
  <div align="center">
     <table width="90%">
        <tbody>
          <tr>
            <p align="JUSTIFY">
              <strong></strong>
              <strong> M. le président</strong>
              Conformément...
              <br>
              Mes chers...
              <br>
              <strong> M. Maxime Gremetz </strong>
              Nous aussi!
              <br>
              <strong></strong>
           </p> 
         </tr>
       </tbody>
     </table>
   </div>
<html>

The code that I have so far:

import requests
from bs4 import BeautifulSoup
import pandas as pd

url="https://www.assemblee-nationale.fr/12/cri/2003-2004/20040001.asp#TopOfPage"
r=requests.get(url)
soup_data=BeautifulSoup(r.content, 'html.parser')
list_soup_div=soup_data.find_all('div', {'align':'center'})
for item_soup_div in list_soup_div:
    for item_soup_p in item_soup_div.find_all('p', {'align':'JUSTIFY'}):
        text_speech=item_soup_p.get_text()
        for item_n in item_soup_p.find_all('strong'):
            text_name=item_n.get_text()
        debate.append({'name': text_name, 'speech': text_speech})
df=pd.DataFrame(debate)

What I obtain is a "name" column almost empty and not all as rows. Could anyone help me improve my code?

The issue might be with the way you're extracting the name and speech data from the HTML structure. The name data is not directly within the elements, so you need to adjust your code to properly extract it. Additionally, you can use the str.strip() method to remove any unnecessary whitespace from the extracted text.

Here i have updated your code:-

import requests
from bs4 import BeautifulSoup
import pandas as pd

url = "https://www.assemblee-nationale.fr/12/cri/2003-2004/20040001.asp#TopOfPage"
r = requests.get(url)
soup_data = BeautifulSoup(r.content, 'html.parser')

# Initialize an empty list to store the speech data
debate = []

# Find all the <p> elements with the 'align' attribute set to 'JUSTIFY'
list_soup_p = soup_data.find_all('p', {'align': 'JUSTIFY'})

# Initialize a variable to store the current speaker's name
current_name = None

# Loop through the <p> elements
for item_soup_p in list_soup_p:
    if item_soup_p.find('strong'):
        current_name = item_soup_p.find('strong').get_text().strip()
    else:
        text_speech = item_soup_p.get_text().strip()
        debate.append({'name': current_name, 'speech': text_speech})

# Create a DataFrame from the debate list
df = pd.DataFrame(debate)

# Print the resulting DataFrame
print(df)

In this code, I've added logic to extract the speaker's name from the tag within the element when available. Then, the name is used for subsequent speeches until a new name is encountered. This should help you organize the data properly in your DataFrame.

mfernandes

Gaurav_Kumar