Python Forum

Full Version: Need Tip On Cleaning My BS4 Scraped Data
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Hey guys Smile I'm having an issue cleaning and refining some scraped data.. here's a sample:

[<span data-class="timestamp">12h</span>, <span data-class="timestamp">12h</span>, <span data-class="timestamp">4d</span>, <span data-class="timestamp">2d</span>, <span data-class="timestamp">5d</span>, <span data-class="timestamp">19 Jan</span>, <span data-class="timestamp">18 Jan</span>, <span data-class="timestamp">18 Jan</span>, <span data-class="timestamp">19 Jan</span>, <span data-class="timestamp">19 Jan</span>, <span data-class="timestamp">5d</span>, <span data-class="timestamp">18 Jan</span>]
This is how I'm scraping it:

js_test5 = soup.find_all('span', {'data-class': 'timestamp'})
For some reason it saves the data as a list item..

I want my output to look like this: 12h, 12h, 4d, 2d, 5d, 19 Jan, 18 Jan, 18 Jan.. etc

I tried to use .text to pull all this data out, but it's only giving me 1 result ("12h").. I can do [4].text and it will output "5d".. which is confusing, because each span is supposed to be in quotes for it to be a separate item right?

Do I need to run a loop to pull all the results out? Or maybe my method of scraping can be improved? What's the best way for me to solve this?
Don't post scraped html data in list.
This make it harder to run it.
Here how it can look
from bs4 import BeautifulSoup

html = '''\
<span data-class="timestamp">12h</span>
<span data-class="timestamp">12h</span>
<span data-class="timestamp">4d</span>
<span data-class="timestamp">2d</span>
<span data-class="timestamp">5d</span>
<span data-class="timestamp">19 Jan</span>
<span data-class="timestamp">18 Jan</span>
<span data-class="timestamp">18 Jan</span>
<span data-class="timestamp">19 Jan</span>
<span data-class="timestamp">19 Jan</span>
<span data-class="timestamp">5d</span>
<span data-class="timestamp">18 Jan</span>'''

soup = BeautifulSoup(html, 'lxml')
data = soup.find_all('span', {'data-class': 'timestamp'})
Test:
>>> [item.text for item in data]
['12h',
 '12h',
 '4d',
 '2d',
 '5d',
 '19 Jan',
 '18 Jan',
 '18 Jan',
 '19 Jan',
 '19 Jan',
 '5d',
 '18 Jan']

>>> ', '.join([item.text for item in data])
'12h, 12h, 4d, 2d, 5d, 19 Jan, 18 Jan, 18 Jan, 19 Jan, 19 Jan, 5d, 18 Jan'
Works perfectly. Much appreciated!