Need Tip On Cleaning My BS4 Scraped Data

digitalmatic7 · Jan-29-2018, 05:08 PM

Hey guys

I'm having an issue cleaning and refining some scraped data.. here's a sample:

[<span data-class="timestamp">12h</span>, <span data-class="timestamp">12h</span>, <span data-class="timestamp">4d</span>, <span data-class="timestamp">2d</span>, <span data-class="timestamp">5d</span>, <span data-class="timestamp">19 Jan</span>, <span data-class="timestamp">18 Jan</span>, <span data-class="timestamp">18 Jan</span>, <span data-class="timestamp">19 Jan</span>, <span data-class="timestamp">19 Jan</span>, <span data-class="timestamp">5d</span>, <span data-class="timestamp">18 Jan</span>]

This is how I'm scraping it:

js_test5 = soup.find_all('span', {'data-class': 'timestamp'})

For some reason it saves the data as a list item..

I want my output to look like this: 12h, 12h, 4d, 2d, 5d, 19 Jan, 18 Jan, 18 Jan.. etc

I tried to use .text to pull all this data out, but it's only giving me 1 result ("12h").. I can do [4].text and it will output "5d".. which is confusing, because each span is supposed to be in quotes for it to be a separate item right?

Do I need to run a loop to pull all the results out? Or maybe my method of scraping can be improved? What's the best way for me to solve this?

***snippsat*** · (This post was last modified: Jan-29-2018, 06:06 PM by snippsat.)

Don't post scraped html data in list.
This make it harder to run it.
Here how it can look

from bs4 import BeautifulSoup

html = '''\
<span data-class="timestamp">12h</span>
<span data-class="timestamp">12h</span>
<span data-class="timestamp">4d</span>
<span data-class="timestamp">2d</span>
<span data-class="timestamp">5d</span>
<span data-class="timestamp">19 Jan</span>
<span data-class="timestamp">18 Jan</span>
<span data-class="timestamp">18 Jan</span>
<span data-class="timestamp">19 Jan</span>
<span data-class="timestamp">19 Jan</span>
<span data-class="timestamp">5d</span>
<span data-class="timestamp">18 Jan</span>'''

soup = BeautifulSoup(html, 'lxml')
data = soup.find_all('span', {'data-class': 'timestamp'})

Test:

>>> [item.text for item in data]
['12h',
 '12h',
 '4d',
 '2d',
 '5d',
 '19 Jan',
 '18 Jan',
 '18 Jan',
 '19 Jan',
 '19 Jan',
 '5d',
 '18 Jan']

>>> ', '.join([item.text for item in data])
'12h, 12h, 4d, 2d, 5d, 19 Jan, 18 Jan, 18 Jan, 19 Jan, 19 Jan, 5d, 18 Jan'

digitalmatic7 · Jan-29-2018, 08:49 PM

Works perfectly. Much appreciated!

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Weird characters scraped	samuelbachorik	3	926	Oct-29-2023, 02:36 PM Last Post: DeaD_EyE
	Web scraper not populating .txt with scraped data	BlackHeart	5	1,520	Apr-03-2023, 05:12 PM Last Post: snippsat
	Need Pointers/Advise for Cleaning up BS4 XPATH Data	BrandonKastning	0	1,243	Mar-08-2022, 12:28 PM Last Post: BrandonKastning
	Python Obstacles \| Krav Maga \| Wiki Scraped Content [Column Copy]	BrandonKastning	4	2,236	Jan-03-2022, 06:59 AM Last Post: BrandonKastning
	Python Obstacles \| Kapap \| Wiki Scraped Content [Column Nulling]	BrandonKastning	2	1,735	Jan-03-2022, 04:26 AM Last Post: BrandonKastning
	cleaning HTML pages using lxml and XPath	wenkos	2	2,482	Aug-25-2021, 10:54 AM Last Post: wenkos
	Cleaning HTML data using Jupyter Notebook	jacob1986	7	4,152	Mar-05-2021, 10:44 PM Last Post: snippsat
	Any way to remove HTML tags from scraped data? (I want text only)	SeBz2020uk	1	3,478	Nov-02-2020, 08:12 PM Last Post: Larz60+
	cant loop through scraped site	matt42	3	2,438	Aug-12-2020, 06:48 AM Last Post: ndc85430
	Normalizig scraped text	wuggs	3	2,556	Jan-07-2020, 03:32 AM Last Post: Larz60+

Need Tip On Cleaning My BS4 Scraped Data

User Panel Messages

Announcements