Python Forum
Scraping Numbers from HTML using BeautifulSoup
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Scraping Numbers from HTML using BeautifulSoup
#1
Hey all,

I am currently completing the 'Python for Everybody' course on Coursera and I am stuck on the 'Scraping Numbers from HTML using BeautifulSoup' problem. However, the last line of my code is not working!

This is the task: You are to find all the <span> tags in the file and pull out the numbers from the tag and sum the numbers.

The URL link for the code below is: http://py4e-data.dr-chuck.net/comments_42.html

from urllib.request import urlopen
from bs4 import BeautifulSoup
import ssl
import re
listed = list() 
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
url = input('Enter - ')
html = urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, "html.parser")
tags = soup('a')
When I enter 'tags', the program gives me an 'empty' answer:

[]
When I believe it should show me the 'anchor' tags from this webpage.

When I print 'soup', I get the following:

<html>
<head>
<title>Welcome to the comments assignment from www.py4e.com</title>
</head>
<body>
<h1>This file contains the sample data for testing</h1>
<table border="2">
<tr>
<td>Name</td><td>Comments</td>
</tr>
<tr><td>Romina</td><td><span class="comments">97</span></td></tr>
<tr><td>Laurie</td><td><span class="comments">97</span></td></tr>
<tr><td>Bayli</td><td><span class="comments">90</span></td></tr>
<tr><td>Siyona</td><td><span class="comments">90</span></td></tr>
<tr><td>Taisha</td><td><span class="comments">88</span></td></tr>
<tr><td>Alanda</td><td><span class="comments">87</span></td></tr>
<tr><td>Ameelia</td><td><span class="comments">87</span></td></tr>
<tr><td>Prasheeta</td><td><span class="comments">80</span></td></tr>
<tr><td>Asif</td><td><span class="comments">79</span></td></tr>
<tr><td>Risa</td><td><span class="comments">79</span></td></tr>
<tr><td>Zi</td><td><span class="comments">78</span></td></tr>
<tr><td>Danyil</td><td><span class="comments">76</span></td></tr>
<tr><td>Ediomi</td><td><span class="comments">76</span></td></tr>
<tr><td>Barry</td><td><span class="comments">72</span></td></tr>
<tr><td>Lance</td><td><span class="comments">72</span></td></tr>
<tr><td>Hattie</td><td><span class="comments">66</span></td></tr>
<tr><td>Mathu</td><td><span class="comments">66</span></td></tr>
<tr><td>Bowie</td><td><span class="comments">65</span></td></tr>
<tr><td>Samara</td><td><span class="comments">65</span></td></tr>
<tr><td>Uchenna</td><td><span class="comments">64</span></td></tr>
<tr><td>Shauni</td><td><span class="comments">61</span></td></tr>
<tr><td>Georgia</td><td><span class="comments">61</span></td></tr>
<tr><td>Rivan</td><td><span class="comments">59</span></td></tr>
<tr><td>Kenan</td><td><span class="comments">58</span></td></tr>
<tr><td>Hassan</td><td><span class="comments">57</span></td></tr>
<tr><td>Isma</td><td><span class="comments">57</span></td></tr>
<tr><td>Samanthalee</td><td><span class="comments">54</span></td></tr>
<tr><td>Alexa</td><td><span class="comments">51</span></td></tr>
<tr><td>Caine</td><td><span class="comments">49</span></td></tr>
<tr><td>Grady</td><td><span class="comments">47</span></td></tr>
<tr><td>Anne</td><td><span class="comments">40</span></td></tr>
<tr><td>Rihan</td><td><span class="comments">38</span></td></tr>
<tr><td>Alexei</td><td><span class="comments">37</span></td></tr>
<tr><td>Indie</td><td><span class="comments">36</span></td></tr>
<tr><td>Rhuairidh</td><td><span class="comments">36</span></td></tr>
<tr><td>Annoushka</td><td><span class="comments">32</span></td></tr>
<tr><td>Kenzi</td><td><span class="comments">25</span></td></tr>
<tr><td>Shahd</td><td><span class="comments">24</span></td></tr>
<tr><td>Irvine</td><td><span class="comments">22</span></td></tr>
<tr><td>Carys</td><td><span class="comments">21</span></td></tr>
<tr><td>Skye</td><td><span class="comments">19</span></td></tr>
<tr><td>Atiya</td><td><span class="comments">18</span></td></tr>
<tr><td>Rohan</td><td><span class="comments">18</span></td></tr>
<tr><td>Nuala</td><td><span class="comments">14</span></td></tr>
<tr><td>Maram</td><td><span class="comments">12</span></td></tr>
<tr><td>Carlo</td><td><span class="comments">12</span></td></tr>
<tr><td>Japleen</td><td><span class="comments">9</span></td></tr>
<tr><td>Breeanna</td><td><span class="comments">7</span></td></tr>
<tr><td>Zaaine</td><td><span class="comments">3</span></td></tr>
<tr><td>Inika</td><td><span class="comments">2</span></td></tr>
</table>
</body>
</html>
I only want to extract the lines that look like the following:

<tr><td>Inika</td><td><span class="comments">2</span></td></tr>
I have tried:

tags = soup('<tr><td>')
- which also does not work

Alternatively, I have also tried to convert the contents of the web page to a string and then extract the desired numeric values:

#to convert soup to a string
soup = str(soup)

#to extract the desired numeric values from relevant lines
soup = re.findall('>[0-9]+',soup)  
soup = ['>97', '>97', '>90', '>90', '>88', '>87', '>87', '>80', '>79', '>79', '>78', '>76', '>76', '>72', '>72', '>66', '>66', '>65', '>65', '>64', '>61', '>61', '>59', '>58', '>57', '>57', '>54', '>51', '>49', '>47', '>40', '>38', '>37', '>36', '>36', '>32', '>25', '>24', '>22', '>21', '>19', '>18', '>18', '>14', '>12', '>12', '>9', '>7', '>3', '>2']
I want to find the sum of these values.

soup = str(soup) 
soup = re.findall('[0-9]+',soup)
for value in soup:
    value = int(value)
    listed.append(value)
print(sum(listed)) 
This gives me a value of 2553.

However, is this a valid way of solving the issue? I would be grateful if someone could let me know I was unable to extract the anchor tags above.

Thank you!
Reply
#2
I would suggest (on this forum)
web scraping part 1
web scraping part 2
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Random Generator: From Word to Numbers, from Numbers to n possibles Words Yamiyozx 2 1,369 Jan-02-2023, 05:08 PM
Last Post: deanhystad
  Convert list of numbers to string of numbers kam_uk 5 2,935 Nov-21-2020, 03:10 PM
Last Post: deanhystad
  Regular Expressions in Files (find all phone numbers and credit card numbers) Amirsalar 2 4,053 Dec-05-2017, 09:48 AM
Last Post: DeaD_EyE

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020