Python Forum

Full Version: Cleaning HTML data using Jupyter Notebook
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
I need help cleaning extracting HTML code, the output is showing the data with commas inbetween the information (small example shown as below). My full code is at the bottom, my code can also be found at https://github.com/aaron1986/Coursera_Ca...tats.ipynb

['Defence',
'Clean',
'sheets',
'13',
'Goals',
'Conceded',
'11',

Moreover, I would like to view the data as below.

[Defence,
Clean sheets 13,
Goals Conceded 11,
]

import requests
import pandas as pd
import numpy as np
import seaborn as sns

from urllib.request import urlopen
from bs4 import BeautifulSoup
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
main_url = 'xxxxxxxx'
result= requests.get(main_url)
result.text
>>>>>>>>>>>>>>>>>
soup = BeautifulSoup(result.text, 'html.parser')
print(soup.prettify())
>>>>>>>>>>>>>>>>>>>>>>>>>
new = soup.find("ul", class_ = "normalStatList")
new.get_text()
>>>>>>>>>>>>>>>>>>>>
new2 = new.get_text().replace('\n', ' ').split()
new2
>>>>>>>>>>>>>
I guess you use BeautifulSoup.
Doing it like this you mess up original structure as it also spilt sentence.
As you don't show html it's not easy to help.
Here a quick example see that sentence don't get split up here.
from bs4 import BeautifulSoup

html = '''\
<body>
  <h1>This is a Heading</h1>
  <p>This is a paragraph</p>
  <p>blue car</p>
</body>'''

soup = BeautifulSoup(html, 'lxml')
>>> ptag = soup.find_all('p')
>>> ptag
[<p>This is a paragraph</p>, <p>blue car</p>]
>>> 
>>> for t in ptag:
...     print(t.text)     
...     
This is a paragraph
blue car
>>> lst = [t.text for t in ptag]
>>> lst
['This is a paragraph', 'blue car']
I have updated my post with full code.
To show a example of first one of normalStat,loop can try to figure out yourself.
import requests
from bs4 import BeautifulSoup

url = 'https://www.premierleague.com/players/16431/player/stats'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
stat = soup.find('div', class_="statsListBlock")
>>> norm = soup.find(class_="normalStat")
>>> text = norm.select_one('.stat').text.strip()
>>> text
'Clean sheets   \n      13'
>>> " ".join(text.split())
'Clean sheets 13'
Hi, thank-you for the reply, I have tried to code the loop but I cannot seem to loop all the '.stat' fields together.
Try this.
import requests
from bs4 import BeautifulSoup

url = 'https://www.premierleague.com/players/16431/player/stats'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
norm_stat = soup.find_all(class_='normalStat')
for tag in norm_stat:
    temp = tag.select_one('.stat').text.strip()
    result = " ".join(temp.split())
    print(result)
Output:
Clean sheets 13 Goals Conceded 11 Tackles 19 Tackle success % 63% Last man tackles 0 Blocked shots 1 Interceptions 24 Clearances 68 Headed Clearance 36 .....
Thank-you. It was the 'select_one' part that was confusing me.
(Mar-05-2021, 10:13 PM)jacob1986 Wrote: [ -> ]Thank-you. It was the 'select_one' part that was confusing me.
As info with select() and select_one() get all the power of CSS Selector.
Many forget about this powerful feature of BS and just stick find() and find_all().
As you see can mix this together.