Python Forum
Need Tip On Cleaning My BS4 Scraped Data
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Need Tip On Cleaning My BS4 Scraped Data
#1
Hey guys Smile I'm having an issue cleaning and refining some scraped data.. here's a sample:

[<span data-class="timestamp">12h</span>, <span data-class="timestamp">12h</span>, <span data-class="timestamp">4d</span>, <span data-class="timestamp">2d</span>, <span data-class="timestamp">5d</span>, <span data-class="timestamp">19 Jan</span>, <span data-class="timestamp">18 Jan</span>, <span data-class="timestamp">18 Jan</span>, <span data-class="timestamp">19 Jan</span>, <span data-class="timestamp">19 Jan</span>, <span data-class="timestamp">5d</span>, <span data-class="timestamp">18 Jan</span>]
This is how I'm scraping it:

js_test5 = soup.find_all('span', {'data-class': 'timestamp'})
For some reason it saves the data as a list item..

I want my output to look like this: 12h, 12h, 4d, 2d, 5d, 19 Jan, 18 Jan, 18 Jan.. etc

I tried to use .text to pull all this data out, but it's only giving me 1 result ("12h").. I can do [4].text and it will output "5d".. which is confusing, because each span is supposed to be in quotes for it to be a separate item right?

Do I need to run a loop to pull all the results out? Or maybe my method of scraping can be improved? What's the best way for me to solve this?
Reply
#2
Don't post scraped html data in list.
This make it harder to run it.
Here how it can look
from bs4 import BeautifulSoup

html = '''\
<span data-class="timestamp">12h</span>
<span data-class="timestamp">12h</span>
<span data-class="timestamp">4d</span>
<span data-class="timestamp">2d</span>
<span data-class="timestamp">5d</span>
<span data-class="timestamp">19 Jan</span>
<span data-class="timestamp">18 Jan</span>
<span data-class="timestamp">18 Jan</span>
<span data-class="timestamp">19 Jan</span>
<span data-class="timestamp">19 Jan</span>
<span data-class="timestamp">5d</span>
<span data-class="timestamp">18 Jan</span>'''

soup = BeautifulSoup(html, 'lxml')
data = soup.find_all('span', {'data-class': 'timestamp'})
Test:
>>> [item.text for item in data]
['12h',
 '12h',
 '4d',
 '2d',
 '5d',
 '19 Jan',
 '18 Jan',
 '18 Jan',
 '19 Jan',
 '19 Jan',
 '5d',
 '18 Jan']

>>> ', '.join([item.text for item in data])
'12h, 12h, 4d, 2d, 5d, 19 Jan, 18 Jan, 18 Jan, 19 Jan, 19 Jan, 5d, 18 Jan'
Reply
#3
Works perfectly. Much appreciated!
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Weird characters scraped samuelbachorik 3 857 Oct-29-2023, 02:36 PM
Last Post: DeaD_EyE
  Web scraper not populating .txt with scraped data BlackHeart 5 1,457 Apr-03-2023, 05:12 PM
Last Post: snippsat
Bug Need Pointers/Advise for Cleaning up BS4 XPATH Data BrandonKastning 0 1,211 Mar-08-2022, 12:28 PM
Last Post: BrandonKastning
  Python Obstacles | Krav Maga | Wiki Scraped Content [Column Copy] BrandonKastning 4 2,161 Jan-03-2022, 06:59 AM
Last Post: BrandonKastning
  Python Obstacles | Kapap | Wiki Scraped Content [Column Nulling] BrandonKastning 2 1,687 Jan-03-2022, 04:26 AM
Last Post: BrandonKastning
  cleaning HTML pages using lxml and XPath wenkos 2 2,322 Aug-25-2021, 10:54 AM
Last Post: wenkos
  Cleaning HTML data using Jupyter Notebook jacob1986 7 4,052 Mar-05-2021, 10:44 PM
Last Post: snippsat
  Any way to remove HTML tags from scraped data? (I want text only) SeBz2020uk 1 3,412 Nov-02-2020, 08:12 PM
Last Post: Larz60+
  cant loop through scraped site matt42 3 2,376 Aug-12-2020, 06:48 AM
Last Post: ndc85430
  Normalizig scraped text wuggs 3 2,498 Jan-07-2020, 03:32 AM
Last Post: Larz60+

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020