Python Forum
Need Tip On Cleaning My BS4 Scraped Data - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html)
+--- Thread: Need Tip On Cleaning My BS4 Scraped Data (/thread-7908.html)



Need Tip On Cleaning My BS4 Scraped Data - digitalmatic7 - Jan-29-2018

Hey guys Smile I'm having an issue cleaning and refining some scraped data.. here's a sample:

[<span data-class="timestamp">12h</span>, <span data-class="timestamp">12h</span>, <span data-class="timestamp">4d</span>, <span data-class="timestamp">2d</span>, <span data-class="timestamp">5d</span>, <span data-class="timestamp">19 Jan</span>, <span data-class="timestamp">18 Jan</span>, <span data-class="timestamp">18 Jan</span>, <span data-class="timestamp">19 Jan</span>, <span data-class="timestamp">19 Jan</span>, <span data-class="timestamp">5d</span>, <span data-class="timestamp">18 Jan</span>]
This is how I'm scraping it:

js_test5 = soup.find_all('span', {'data-class': 'timestamp'})
For some reason it saves the data as a list item..

I want my output to look like this: 12h, 12h, 4d, 2d, 5d, 19 Jan, 18 Jan, 18 Jan.. etc

I tried to use .text to pull all this data out, but it's only giving me 1 result ("12h").. I can do [4].text and it will output "5d".. which is confusing, because each span is supposed to be in quotes for it to be a separate item right?

Do I need to run a loop to pull all the results out? Or maybe my method of scraping can be improved? What's the best way for me to solve this?


RE: Need Tip On Cleaning My BS4 Scraped Data - snippsat - Jan-29-2018

Don't post scraped html data in list.
This make it harder to run it.
Here how it can look
from bs4 import BeautifulSoup

html = '''\
<span data-class="timestamp">12h</span>
<span data-class="timestamp">12h</span>
<span data-class="timestamp">4d</span>
<span data-class="timestamp">2d</span>
<span data-class="timestamp">5d</span>
<span data-class="timestamp">19 Jan</span>
<span data-class="timestamp">18 Jan</span>
<span data-class="timestamp">18 Jan</span>
<span data-class="timestamp">19 Jan</span>
<span data-class="timestamp">19 Jan</span>
<span data-class="timestamp">5d</span>
<span data-class="timestamp">18 Jan</span>'''

soup = BeautifulSoup(html, 'lxml')
data = soup.find_all('span', {'data-class': 'timestamp'})
Test:
>>> [item.text for item in data]
['12h',
 '12h',
 '4d',
 '2d',
 '5d',
 '19 Jan',
 '18 Jan',
 '18 Jan',
 '19 Jan',
 '19 Jan',
 '5d',
 '18 Jan']

>>> ', '.join([item.text for item in data])
'12h, 12h, 4d, 2d, 5d, 19 Jan, 18 Jan, 18 Jan, 19 Jan, 19 Jan, 5d, 18 Jan'



RE: Need Tip On Cleaning My BS4 Scraped Data - digitalmatic7 - Jan-29-2018

Works perfectly. Much appreciated!