Apr-26-2020, 09:10 AM
Hey all,
I am currently completing the 'Python for Everybody' course on Coursera and I am stuck on the 'Scraping Numbers from HTML using BeautifulSoup' problem. However, the last line of my code is not working!
This is the task: You are to find all the <span> tags in the file and pull out the numbers from the tag and sum the numbers.
The URL link for the code below is: http://py4e-data.dr-chuck.net/comments_42.html
When I print 'soup', I get the following:
Alternatively, I have also tried to convert the contents of the web page to a string and then extract the desired numeric values:
#to convert soup to a string
#to extract the desired numeric values from relevant lines
However, is this a valid way of solving the issue? I would be grateful if someone could let me know I was unable to extract the anchor tags above.
Thank you!
I am currently completing the 'Python for Everybody' course on Coursera and I am stuck on the 'Scraping Numbers from HTML using BeautifulSoup' problem. However, the last line of my code is not working!
This is the task: You are to find all the <span> tags in the file and pull out the numbers from the tag and sum the numbers.
The URL link for the code below is: http://py4e-data.dr-chuck.net/comments_42.html
from urllib.request import urlopen from bs4 import BeautifulSoup import ssl import re listed = list() ctx = ssl.create_default_context() ctx.check_hostname = False ctx.verify_mode = ssl.CERT_NONE url = input('Enter - ') html = urlopen(url, context=ctx).read() soup = BeautifulSoup(html, "html.parser") tags = soup('a')When I enter 'tags', the program gives me an 'empty' answer:
[]When I believe it should show me the 'anchor' tags from this webpage.
When I print 'soup', I get the following:
<html> <head> <title>Welcome to the comments assignment from www.py4e.com</title> </head> <body> <h1>This file contains the sample data for testing</h1> <table border="2"> <tr> <td>Name</td><td>Comments</td> </tr> <tr><td>Romina</td><td><span class="comments">97</span></td></tr> <tr><td>Laurie</td><td><span class="comments">97</span></td></tr> <tr><td>Bayli</td><td><span class="comments">90</span></td></tr> <tr><td>Siyona</td><td><span class="comments">90</span></td></tr> <tr><td>Taisha</td><td><span class="comments">88</span></td></tr> <tr><td>Alanda</td><td><span class="comments">87</span></td></tr> <tr><td>Ameelia</td><td><span class="comments">87</span></td></tr> <tr><td>Prasheeta</td><td><span class="comments">80</span></td></tr> <tr><td>Asif</td><td><span class="comments">79</span></td></tr> <tr><td>Risa</td><td><span class="comments">79</span></td></tr> <tr><td>Zi</td><td><span class="comments">78</span></td></tr> <tr><td>Danyil</td><td><span class="comments">76</span></td></tr> <tr><td>Ediomi</td><td><span class="comments">76</span></td></tr> <tr><td>Barry</td><td><span class="comments">72</span></td></tr> <tr><td>Lance</td><td><span class="comments">72</span></td></tr> <tr><td>Hattie</td><td><span class="comments">66</span></td></tr> <tr><td>Mathu</td><td><span class="comments">66</span></td></tr> <tr><td>Bowie</td><td><span class="comments">65</span></td></tr> <tr><td>Samara</td><td><span class="comments">65</span></td></tr> <tr><td>Uchenna</td><td><span class="comments">64</span></td></tr> <tr><td>Shauni</td><td><span class="comments">61</span></td></tr> <tr><td>Georgia</td><td><span class="comments">61</span></td></tr> <tr><td>Rivan</td><td><span class="comments">59</span></td></tr> <tr><td>Kenan</td><td><span class="comments">58</span></td></tr> <tr><td>Hassan</td><td><span class="comments">57</span></td></tr> <tr><td>Isma</td><td><span class="comments">57</span></td></tr> <tr><td>Samanthalee</td><td><span class="comments">54</span></td></tr> <tr><td>Alexa</td><td><span class="comments">51</span></td></tr> <tr><td>Caine</td><td><span class="comments">49</span></td></tr> <tr><td>Grady</td><td><span class="comments">47</span></td></tr> <tr><td>Anne</td><td><span class="comments">40</span></td></tr> <tr><td>Rihan</td><td><span class="comments">38</span></td></tr> <tr><td>Alexei</td><td><span class="comments">37</span></td></tr> <tr><td>Indie</td><td><span class="comments">36</span></td></tr> <tr><td>Rhuairidh</td><td><span class="comments">36</span></td></tr> <tr><td>Annoushka</td><td><span class="comments">32</span></td></tr> <tr><td>Kenzi</td><td><span class="comments">25</span></td></tr> <tr><td>Shahd</td><td><span class="comments">24</span></td></tr> <tr><td>Irvine</td><td><span class="comments">22</span></td></tr> <tr><td>Carys</td><td><span class="comments">21</span></td></tr> <tr><td>Skye</td><td><span class="comments">19</span></td></tr> <tr><td>Atiya</td><td><span class="comments">18</span></td></tr> <tr><td>Rohan</td><td><span class="comments">18</span></td></tr> <tr><td>Nuala</td><td><span class="comments">14</span></td></tr> <tr><td>Maram</td><td><span class="comments">12</span></td></tr> <tr><td>Carlo</td><td><span class="comments">12</span></td></tr> <tr><td>Japleen</td><td><span class="comments">9</span></td></tr> <tr><td>Breeanna</td><td><span class="comments">7</span></td></tr> <tr><td>Zaaine</td><td><span class="comments">3</span></td></tr> <tr><td>Inika</td><td><span class="comments">2</span></td></tr> </table> </body> </html>I only want to extract the lines that look like the following:
<tr><td>Inika</td><td><span class="comments">2</span></td></tr>I have tried:
tags = soup('<tr><td>')- which also does not work
Alternatively, I have also tried to convert the contents of the web page to a string and then extract the desired numeric values:
#to convert soup to a string
soup = str(soup)
#to extract the desired numeric values from relevant lines
soup = re.findall('>[0-9]+',soup) soup = ['>97', '>97', '>90', '>90', '>88', '>87', '>87', '>80', '>79', '>79', '>78', '>76', '>76', '>72', '>72', '>66', '>66', '>65', '>65', '>64', '>61', '>61', '>59', '>58', '>57', '>57', '>54', '>51', '>49', '>47', '>40', '>38', '>37', '>36', '>36', '>32', '>25', '>24', '>22', '>21', '>19', '>18', '>18', '>14', '>12', '>12', '>9', '>7', '>3', '>2']I want to find the sum of these values.
soup = str(soup) soup = re.findall('[0-9]+',soup) for value in soup: value = int(value) listed.append(value) print(sum(listed))This gives me a value of 2553.
However, is this a valid way of solving the issue? I would be grateful if someone could let me know I was unable to extract the anchor tags above.
Thank you!