Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
web scraping with <br> tag
#1
I am trying to web scrape the following webpage: https://www.assemblee-nationale.fr/12/cr...#TopOfPage .
I would like to obtain the following dataset in the end:

name                         speech
M. le président            La séance est...
M. le président            Conformément...
M. le président            Mes chers...
M. Maxime Gremetz    Nous aussi!
Example of the html:
<html>
  <div align="center">
     <table width="90%">
        <tbody>
          <tr>
            <p align="JUSTIFY">
              <strong></strong>
              <strong> M. le président</strong>
              Conformément...
              <br>
              Mes chers...
              <br>
              <strong> M. Maxime Gremetz </strong>
              Nous aussi!
              <br>
              <strong></strong>
           </p> 
         </tr>
       </tbody>
     </table>
   </div>
<html>        
The code that I have so far:
import requests
from bs4 import BeautifulSoup
import pandas as pd

url="https://www.assemblee-nationale.fr/12/cri/2003-2004/20040001.asp#TopOfPage"
r=requests.get(url)
soup_data=BeautifulSoup(r.content, 'html.parser')
list_soup_div=soup_data.find_all('div', {'align':'center'})
for item_soup_div in list_soup_div:
    for item_soup_p in item_soup_div.find_all('p', {'align':'JUSTIFY'}):
        text_speech=item_soup_p.get_text()
        for item_n in item_soup_p.find_all('strong'):
            text_name=item_n.get_text()
        debate.append({'name': text_name, 'speech': text_speech})
df=pd.DataFrame(debate)
What I obtain is a "name" column almost empty and not all <br> as rows. Could anyone help me improve my code?
Reply
#2
The issue might be with the way you're extracting the name and speech data from the HTML structure. The name data is not directly within the <p> elements, so you need to adjust your code to properly extract it. Additionally, you can use the str.strip() method to remove any unnecessary whitespace from the extracted text.

Here i have updated your code:-

import requests
from bs4 import BeautifulSoup
import pandas as pd

url = "https://www.assemblee-nationale.fr/12/cri/2003-2004/20040001.asp#TopOfPage"
r = requests.get(url)
soup_data = BeautifulSoup(r.content, 'html.parser')

# Initialize an empty list to store the speech data
debate = []

# Find all the <p> elements with the 'align' attribute set to 'JUSTIFY'
list_soup_p = soup_data.find_all('p', {'align': 'JUSTIFY'})

# Initialize a variable to store the current speaker's name
current_name = None

# Loop through the <p> elements
for item_soup_p in list_soup_p:
    if item_soup_p.find('strong'):
        current_name = item_soup_p.find('strong').get_text().strip()
    else:
        text_speech = item_soup_p.get_text().strip()
        debate.append({'name': current_name, 'speech': text_speech})

# Create a DataFrame from the debate list
df = pd.DataFrame(debate)

# Print the resulting DataFrame
print(df)
In this code, I've added logic to extract the speaker's name from the <strong> tag within the <p> element when available. Then, the name is used for subsequent speeches until a new name is encountered. This should help you organize the data properly in your DataFrame.
Reply


Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020