Python Forum
Need Tip On Cleaning My BS4 Scraped Data
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Need Tip On Cleaning My BS4 Scraped Data
#1
Hey guys Smile I'm having an issue cleaning and refining some scraped data.. here's a sample:

[<span data-class="timestamp">12h</span>, <span data-class="timestamp">12h</span>, <span data-class="timestamp">4d</span>, <span data-class="timestamp">2d</span>, <span data-class="timestamp">5d</span>, <span data-class="timestamp">19 Jan</span>, <span data-class="timestamp">18 Jan</span>, <span data-class="timestamp">18 Jan</span>, <span data-class="timestamp">19 Jan</span>, <span data-class="timestamp">19 Jan</span>, <span data-class="timestamp">5d</span>, <span data-class="timestamp">18 Jan</span>]
This is how I'm scraping it:

js_test5 = soup.find_all('span', {'data-class': 'timestamp'})
For some reason it saves the data as a list item..

I want my output to look like this: 12h, 12h, 4d, 2d, 5d, 19 Jan, 18 Jan, 18 Jan.. etc

I tried to use .text to pull all this data out, but it's only giving me 1 result ("12h").. I can do [4].text and it will output "5d".. which is confusing, because each span is supposed to be in quotes for it to be a separate item right?

Do I need to run a loop to pull all the results out? Or maybe my method of scraping can be improved? What's the best way for me to solve this?
Reply
#2
Don't post scraped html data in list.
This make it harder to run it.
Here how it can look
from bs4 import BeautifulSoup

html = '''\
<span data-class="timestamp">12h</span>
<span data-class="timestamp">12h</span>
<span data-class="timestamp">4d</span>
<span data-class="timestamp">2d</span>
<span data-class="timestamp">5d</span>
<span data-class="timestamp">19 Jan</span>
<span data-class="timestamp">18 Jan</span>
<span data-class="timestamp">18 Jan</span>
<span data-class="timestamp">19 Jan</span>
<span data-class="timestamp">19 Jan</span>
<span data-class="timestamp">5d</span>
<span data-class="timestamp">18 Jan</span>'''

soup = BeautifulSoup(html, 'lxml')
data = soup.find_all('span', {'data-class': 'timestamp'})
Test:
>>> [item.text for item in data]
['12h',
 '12h',
 '4d',
 '2d',
 '5d',
 '19 Jan',
 '18 Jan',
 '18 Jan',
 '19 Jan',
 '19 Jan',
 '5d',
 '18 Jan']

>>> ', '.join([item.text for item in data])
'12h, 12h, 4d, 2d, 5d, 19 Jan, 18 Jan, 18 Jan, 19 Jan, 19 Jan, 5d, 18 Jan'
Reply
#3
Works perfectly. Much appreciated!
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Any way to remove HTML tags from scraped data? (I want text only) SeBz2020uk 1 231 Nov-02-2020, 08:12 PM
Last Post: Larz60+
  cant loop through scraped site matt42 3 358 Aug-12-2020, 06:48 AM
Last Post: ndc85430
  Normalizig scraped text wuggs 3 576 Jan-07-2020, 03:32 AM
Last Post: Larz60+
  non-finite value error when cleaning data yokaso 0 1,264 Dec-17-2019, 07:26 AM
Last Post: yokaso
  Parsing infor from scraped files. Larz60+ 2 1,133 Apr-12-2019, 05:06 PM
Last Post: Larz60+
  beautiful soup - parsing scraped code in a script lilbigwill99 2 1,387 Mar-09-2018, 04:10 PM
Last Post: lilbigwill99

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020