Python Forum
Cleaning HTML data using Jupyter Notebook
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Cleaning HTML data using Jupyter Notebook
#2
I guess you use BeautifulSoup.
Doing it like this you mess up original structure as it also spilt sentence.
As you don't show html it's not easy to help.
Here a quick example see that sentence don't get split up here.
from bs4 import BeautifulSoup

html = '''\
<body>
  <h1>This is a Heading</h1>
  <p>This is a paragraph</p>
  <p>blue car</p>
</body>'''

soup = BeautifulSoup(html, 'lxml')
>>> ptag = soup.find_all('p')
>>> ptag
[<p>This is a paragraph</p>, <p>blue car</p>]
>>> 
>>> for t in ptag:
...     print(t.text)     
...     
This is a paragraph
blue car
>>> lst = [t.text for t in ptag]
>>> lst
['This is a paragraph', 'blue car']
Reply


Messages In This Thread
RE: Cleaning HTML data using Jupyter Notebook - by snippsat - Mar-04-2021, 10:05 PM

Possibly Related Threads…
Thread Author Replies Views Last Post
  Trying to scrape data from HTML with no identifiers pythonpaul32 2 965 Dec-02-2023, 03:42 AM
Last Post: pythonpaul32
Bug Need Pointers/Advise for Cleaning up BS4 XPATH Data BrandonKastning 0 1,305 Mar-08-2022, 12:28 PM
Last Post: BrandonKastning
  Post HTML Form Data to API Endpoints Dexty 0 1,449 Nov-11-2021, 10:51 PM
Last Post: Dexty
  cleaning HTML pages using lxml and XPath wenkos 2 2,552 Aug-25-2021, 10:54 AM
Last Post: wenkos
  HTML multi select HTML listbox with Flask/Python rfeyer 0 4,749 Mar-14-2021, 12:23 PM
Last Post: rfeyer
  Any way to remove HTML tags from scraped data? (I want text only) SeBz2020uk 1 3,542 Nov-02-2020, 08:12 PM
Last Post: Larz60+
  html data cell attribute issue delahug 5 3,246 May-31-2020, 09:18 AM
Last Post: delahug
  Extracting html data using attributes WiPi 14 5,655 May-04-2020, 02:04 PM
Last Post: snippsat
  extrat data from a button html windows11 1 2,030 Mar-24-2020, 03:39 PM
Last Post: Larz60+
  Python3 + BeautifulSoup4 + lxml (HTML -> CSV) - How to loop to next HTML/new CSV Row BrandonKastning 0 2,420 Mar-22-2020, 06:10 AM
Last Post: BrandonKastning

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020