Python Forum
Need Tip On Cleaning My BS4 Scraped Data
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Need Tip On Cleaning My BS4 Scraped Data
Hey guys Smile I'm having an issue cleaning and refining some scraped data.. here's a sample:

[<span data-class="timestamp">12h</span>, <span data-class="timestamp">12h</span>, <span data-class="timestamp">4d</span>, <span data-class="timestamp">2d</span>, <span data-class="timestamp">5d</span>, <span data-class="timestamp">19 Jan</span>, <span data-class="timestamp">18 Jan</span>, <span data-class="timestamp">18 Jan</span>, <span data-class="timestamp">19 Jan</span>, <span data-class="timestamp">19 Jan</span>, <span data-class="timestamp">5d</span>, <span data-class="timestamp">18 Jan</span>]
This is how I'm scraping it:

js_test5 = soup.find_all('span', {'data-class': 'timestamp'})
For some reason it saves the data as a list item..

I want my output to look like this: 12h, 12h, 4d, 2d, 5d, 19 Jan, 18 Jan, 18 Jan.. etc

I tried to use .text to pull all this data out, but it's only giving me 1 result ("12h").. I can do [4].text and it will output "5d".. which is confusing, because each span is supposed to be in quotes for it to be a separate item right?

Do I need to run a loop to pull all the results out? Or maybe my method of scraping can be improved? What's the best way for me to solve this?
Don't post scraped html data in list.
This make it harder to run it.
Here how it can look
from bs4 import BeautifulSoup

html = '''\
<span data-class="timestamp">12h</span>
<span data-class="timestamp">12h</span>
<span data-class="timestamp">4d</span>
<span data-class="timestamp">2d</span>
<span data-class="timestamp">5d</span>
<span data-class="timestamp">19 Jan</span>
<span data-class="timestamp">18 Jan</span>
<span data-class="timestamp">18 Jan</span>
<span data-class="timestamp">19 Jan</span>
<span data-class="timestamp">19 Jan</span>
<span data-class="timestamp">5d</span>
<span data-class="timestamp">18 Jan</span>'''

soup = BeautifulSoup(html, 'lxml')
data = soup.find_all('span', {'data-class': 'timestamp'})
>>> [item.text for item in data]
 '19 Jan',
 '18 Jan',
 '18 Jan',
 '19 Jan',
 '19 Jan',
 '18 Jan']

>>> ', '.join([item.text for item in data])
'12h, 12h, 4d, 2d, 5d, 19 Jan, 18 Jan, 18 Jan, 19 Jan, 19 Jan, 5d, 18 Jan'
Works perfectly. Much appreciated!

Possibly Related Threads…
Thread Author Replies Views Last Post
  Any way to remove HTML tags from scraped data? (I want text only) SeBz2020uk 1 231 Nov-02-2020, 08:12 PM
Last Post: Larz60+
  cant loop through scraped site matt42 3 358 Aug-12-2020, 06:48 AM
Last Post: ndc85430
  Normalizig scraped text wuggs 3 576 Jan-07-2020, 03:32 AM
Last Post: Larz60+
  non-finite value error when cleaning data yokaso 0 1,264 Dec-17-2019, 07:26 AM
Last Post: yokaso
  Parsing infor from scraped files. Larz60+ 2 1,133 Apr-12-2019, 05:06 PM
Last Post: Larz60+
  beautiful soup - parsing scraped code in a script lilbigwill99 2 1,387 Mar-09-2018, 04:10 PM
Last Post: lilbigwill99

Forum Jump:

User Panel Messages

Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020