Need Tip On Cleaning My BS4 Scraped Data - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html) +--- Thread: Need Tip On Cleaning My BS4 Scraped Data (/thread-7908.html) |
Need Tip On Cleaning My BS4 Scraped Data - digitalmatic7 - Jan-29-2018 Hey guys I'm having an issue cleaning and refining some scraped data.. here's a sample: [<span data-class="timestamp">12h</span>, <span data-class="timestamp">12h</span>, <span data-class="timestamp">4d</span>, <span data-class="timestamp">2d</span>, <span data-class="timestamp">5d</span>, <span data-class="timestamp">19 Jan</span>, <span data-class="timestamp">18 Jan</span>, <span data-class="timestamp">18 Jan</span>, <span data-class="timestamp">19 Jan</span>, <span data-class="timestamp">19 Jan</span>, <span data-class="timestamp">5d</span>, <span data-class="timestamp">18 Jan</span>]This is how I'm scraping it: js_test5 = soup.find_all('span', {'data-class': 'timestamp'})For some reason it saves the data as a list item.. I want my output to look like this: 12h, 12h, 4d, 2d, 5d, 19 Jan, 18 Jan, 18 Jan.. etc I tried to use .text to pull all this data out, but it's only giving me 1 result ("12h").. I can do [4].text and it will output "5d".. which is confusing, because each span is supposed to be in quotes for it to be a separate item right? Do I need to run a loop to pull all the results out? Or maybe my method of scraping can be improved? What's the best way for me to solve this? RE: Need Tip On Cleaning My BS4 Scraped Data - snippsat - Jan-29-2018 Don't post scraped html data in list. This make it harder to run it. Here how it can look from bs4 import BeautifulSoup html = '''\ <span data-class="timestamp">12h</span> <span data-class="timestamp">12h</span> <span data-class="timestamp">4d</span> <span data-class="timestamp">2d</span> <span data-class="timestamp">5d</span> <span data-class="timestamp">19 Jan</span> <span data-class="timestamp">18 Jan</span> <span data-class="timestamp">18 Jan</span> <span data-class="timestamp">19 Jan</span> <span data-class="timestamp">19 Jan</span> <span data-class="timestamp">5d</span> <span data-class="timestamp">18 Jan</span>''' soup = BeautifulSoup(html, 'lxml') data = soup.find_all('span', {'data-class': 'timestamp'})Test: >>> [item.text for item in data] ['12h', '12h', '4d', '2d', '5d', '19 Jan', '18 Jan', '18 Jan', '19 Jan', '19 Jan', '5d', '18 Jan'] >>> ', '.join([item.text for item in data]) '12h, 12h, 4d, 2d, 5d, 19 Jan, 18 Jan, 18 Jan, 19 Jan, 19 Jan, 5d, 18 Jan' RE: Need Tip On Cleaning My BS4 Scraped Data - digitalmatic7 - Jan-29-2018 Works perfectly. Much appreciated! |