Python Forum
Cleaning HTML data using Jupyter Notebook
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Cleaning HTML data using Jupyter Notebook
#1
I need help cleaning extracting HTML code, the output is showing the data with commas inbetween the information (small example shown as below). My full code is at the bottom, my code can also be found at https://github.com/aaron1986/Coursera_Ca...tats.ipynb

['Defence',
'Clean',
'sheets',
'13',
'Goals',
'Conceded',
'11',

Moreover, I would like to view the data as below.

[Defence,
Clean sheets 13,
Goals Conceded 11,
]

import requests
import pandas as pd
import numpy as np
import seaborn as sns

from urllib.request import urlopen
from bs4 import BeautifulSoup
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
main_url = 'xxxxxxxx'
result= requests.get(main_url)
result.text
>>>>>>>>>>>>>>>>>
soup = BeautifulSoup(result.text, 'html.parser')
print(soup.prettify())
>>>>>>>>>>>>>>>>>>>>>>>>>
new = soup.find("ul", class_ = "normalStatList")
new.get_text()
>>>>>>>>>>>>>>>>>>>>
new2 = new.get_text().replace('\n', ' ').split()
new2
>>>>>>>>>>>>>
Reply
#2
I guess you use BeautifulSoup.
Doing it like this you mess up original structure as it also spilt sentence.
As you don't show html it's not easy to help.
Here a quick example see that sentence don't get split up here.
from bs4 import BeautifulSoup

html = '''\
<body>
  <h1>This is a Heading</h1>
  <p>This is a paragraph</p>
  <p>blue car</p>
</body>'''

soup = BeautifulSoup(html, 'lxml')
>>> ptag = soup.find_all('p')
>>> ptag
[<p>This is a paragraph</p>, <p>blue car</p>]
>>> 
>>> for t in ptag:
...     print(t.text)     
...     
This is a paragraph
blue car
>>> lst = [t.text for t in ptag]
>>> lst
['This is a paragraph', 'blue car']
Reply
#3
I have updated my post with full code.
Reply
#4
To show a example of first one of normalStat,loop can try to figure out yourself.
import requests
from bs4 import BeautifulSoup

url = 'https://www.premierleague.com/players/16431/player/stats'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
stat = soup.find('div', class_="statsListBlock")
>>> norm = soup.find(class_="normalStat")
>>> text = norm.select_one('.stat').text.strip()
>>> text
'Clean sheets   \n      13'
>>> " ".join(text.split())
'Clean sheets 13'
Reply
#5
Hi, thank-you for the reply, I have tried to code the loop but I cannot seem to loop all the '.stat' fields together.
Reply
#6
Try this.
import requests
from bs4 import BeautifulSoup

url = 'https://www.premierleague.com/players/16431/player/stats'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
norm_stat = soup.find_all(class_='normalStat')
for tag in norm_stat:
    temp = tag.select_one('.stat').text.strip()
    result = " ".join(temp.split())
    print(result)
Output:
Clean sheets 13 Goals Conceded 11 Tackles 19 Tackle success % 63% Last man tackles 0 Blocked shots 1 Interceptions 24 Clearances 68 Headed Clearance 36 .....
Reply
#7
Thank-you. It was the 'select_one' part that was confusing me.
Reply
#8
(Mar-05-2021, 10:13 PM)jacob1986 Wrote: Thank-you. It was the 'select_one' part that was confusing me.
As info with select() and select_one() get all the power of CSS Selector.
Many forget about this powerful feature of BS and just stick find() and find_all().
As you see can mix this together.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Trying to scrape data from HTML with no identifiers pythonpaul32 2 831 Dec-02-2023, 03:42 AM
Last Post: pythonpaul32
Bug Need Pointers/Advise for Cleaning up BS4 XPATH Data BrandonKastning 0 1,228 Mar-08-2022, 12:28 PM
Last Post: BrandonKastning
  Post HTML Form Data to API Endpoints Dexty 0 1,392 Nov-11-2021, 10:51 PM
Last Post: Dexty
  cleaning HTML pages using lxml and XPath wenkos 2 2,409 Aug-25-2021, 10:54 AM
Last Post: wenkos
  HTML multi select HTML listbox with Flask/Python rfeyer 0 4,604 Mar-14-2021, 12:23 PM
Last Post: rfeyer
  Any way to remove HTML tags from scraped data? (I want text only) SeBz2020uk 1 3,449 Nov-02-2020, 08:12 PM
Last Post: Larz60+
  html data cell attribute issue delahug 5 3,132 May-31-2020, 09:18 AM
Last Post: delahug
  Extracting html data using attributes WiPi 14 5,432 May-04-2020, 02:04 PM
Last Post: snippsat
  extrat data from a button html windows11 1 1,969 Mar-24-2020, 03:39 PM
Last Post: Larz60+
  Python3 + BeautifulSoup4 + lxml (HTML -> CSV) - How to loop to next HTML/new CSV Row BrandonKastning 0 2,350 Mar-22-2020, 06:10 AM
Last Post: BrandonKastning

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020