Scraping Numbers from HTML using BeautifulSoup - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: Homework (https://python-forum.io/forum-9.html) +--- Thread: Scraping Numbers from HTML using BeautifulSoup (/thread-26278.html) |
Scraping Numbers from HTML using BeautifulSoup - eyavuz21 - Apr-26-2020 Hey all, I am currently completing the 'Python for Everybody' course on Coursera and I am stuck on the 'Scraping Numbers from HTML using BeautifulSoup' problem. However, the last line of my code is not working! This is the task: You are to find all the <span> tags in the file and pull out the numbers from the tag and sum the numbers. The URL link for the code below is: http://py4e-data.dr-chuck.net/comments_42.html from urllib.request import urlopen from bs4 import BeautifulSoup import ssl import re listed = list() ctx = ssl.create_default_context() ctx.check_hostname = False ctx.verify_mode = ssl.CERT_NONE url = input('Enter - ') html = urlopen(url, context=ctx).read() soup = BeautifulSoup(html, "html.parser") tags = soup('a')When I enter 'tags', the program gives me an 'empty' answer: []When I believe it should show me the 'anchor' tags from this webpage. When I print 'soup', I get the following: <html> <head> <title>Welcome to the comments assignment from www.py4e.com</title> </head> <body> <h1>This file contains the sample data for testing</h1> <table border="2"> <tr> <td>Name</td><td>Comments</td> </tr> <tr><td>Romina</td><td><span class="comments">97</span></td></tr> <tr><td>Laurie</td><td><span class="comments">97</span></td></tr> <tr><td>Bayli</td><td><span class="comments">90</span></td></tr> <tr><td>Siyona</td><td><span class="comments">90</span></td></tr> <tr><td>Taisha</td><td><span class="comments">88</span></td></tr> <tr><td>Alanda</td><td><span class="comments">87</span></td></tr> <tr><td>Ameelia</td><td><span class="comments">87</span></td></tr> <tr><td>Prasheeta</td><td><span class="comments">80</span></td></tr> <tr><td>Asif</td><td><span class="comments">79</span></td></tr> <tr><td>Risa</td><td><span class="comments">79</span></td></tr> <tr><td>Zi</td><td><span class="comments">78</span></td></tr> <tr><td>Danyil</td><td><span class="comments">76</span></td></tr> <tr><td>Ediomi</td><td><span class="comments">76</span></td></tr> <tr><td>Barry</td><td><span class="comments">72</span></td></tr> <tr><td>Lance</td><td><span class="comments">72</span></td></tr> <tr><td>Hattie</td><td><span class="comments">66</span></td></tr> <tr><td>Mathu</td><td><span class="comments">66</span></td></tr> <tr><td>Bowie</td><td><span class="comments">65</span></td></tr> <tr><td>Samara</td><td><span class="comments">65</span></td></tr> <tr><td>Uchenna</td><td><span class="comments">64</span></td></tr> <tr><td>Shauni</td><td><span class="comments">61</span></td></tr> <tr><td>Georgia</td><td><span class="comments">61</span></td></tr> <tr><td>Rivan</td><td><span class="comments">59</span></td></tr> <tr><td>Kenan</td><td><span class="comments">58</span></td></tr> <tr><td>Hassan</td><td><span class="comments">57</span></td></tr> <tr><td>Isma</td><td><span class="comments">57</span></td></tr> <tr><td>Samanthalee</td><td><span class="comments">54</span></td></tr> <tr><td>Alexa</td><td><span class="comments">51</span></td></tr> <tr><td>Caine</td><td><span class="comments">49</span></td></tr> <tr><td>Grady</td><td><span class="comments">47</span></td></tr> <tr><td>Anne</td><td><span class="comments">40</span></td></tr> <tr><td>Rihan</td><td><span class="comments">38</span></td></tr> <tr><td>Alexei</td><td><span class="comments">37</span></td></tr> <tr><td>Indie</td><td><span class="comments">36</span></td></tr> <tr><td>Rhuairidh</td><td><span class="comments">36</span></td></tr> <tr><td>Annoushka</td><td><span class="comments">32</span></td></tr> <tr><td>Kenzi</td><td><span class="comments">25</span></td></tr> <tr><td>Shahd</td><td><span class="comments">24</span></td></tr> <tr><td>Irvine</td><td><span class="comments">22</span></td></tr> <tr><td>Carys</td><td><span class="comments">21</span></td></tr> <tr><td>Skye</td><td><span class="comments">19</span></td></tr> <tr><td>Atiya</td><td><span class="comments">18</span></td></tr> <tr><td>Rohan</td><td><span class="comments">18</span></td></tr> <tr><td>Nuala</td><td><span class="comments">14</span></td></tr> <tr><td>Maram</td><td><span class="comments">12</span></td></tr> <tr><td>Carlo</td><td><span class="comments">12</span></td></tr> <tr><td>Japleen</td><td><span class="comments">9</span></td></tr> <tr><td>Breeanna</td><td><span class="comments">7</span></td></tr> <tr><td>Zaaine</td><td><span class="comments">3</span></td></tr> <tr><td>Inika</td><td><span class="comments">2</span></td></tr> </table> </body> </html>I only want to extract the lines that look like the following: <tr><td>Inika</td><td><span class="comments">2</span></td></tr>I have tried: tags = soup('<tr><td>')- which also does not work Alternatively, I have also tried to convert the contents of the web page to a string and then extract the desired numeric values: #to convert soup to a string soup = str(soup) #to extract the desired numeric values from relevant lines soup = re.findall('>[0-9]+',soup) soup = ['>97', '>97', '>90', '>90', '>88', '>87', '>87', '>80', '>79', '>79', '>78', '>76', '>76', '>72', '>72', '>66', '>66', '>65', '>65', '>64', '>61', '>61', '>59', '>58', '>57', '>57', '>54', '>51', '>49', '>47', '>40', '>38', '>37', '>36', '>36', '>32', '>25', '>24', '>22', '>21', '>19', '>18', '>18', '>14', '>12', '>12', '>9', '>7', '>3', '>2']I want to find the sum of these values. soup = str(soup) soup = re.findall('[0-9]+',soup) for value in soup: value = int(value) listed.append(value) print(sum(listed))This gives me a value of 2553. However, is this a valid way of solving the issue? I would be grateful if someone could let me know I was unable to extract the anchor tags above. Thank you! RE: Scraping Numbers from HTML using BeautifulSoup - Larz60+ - Apr-26-2020 I would suggest (on this forum) web scraping part 1 web scraping part 2 |