Python Forum

Full Version: Scraping number in % from website
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Hi all, for my chessclub I'm trying to automate collecting Timeout percentages. 

It hidden in this code:          <aside>7.69%</aside>



 
 <ul class="stats-list no-border">
     <li>
       Winning Streak        <aside>19</aside>
     </li>
             <li>
         Time per Move          <aside>14 hours 15 minutes</aside>
       </li>
                   <li>
         Timeouts          <span class="stats-list-info" tip="Last 3 Months" tip-popup-delay="0"><i class="icon-circle-question"
     
     
         ></i></span>
         <aside>7.69%</aside>
</li>
           <li>
       Glicko RD        <aside>
          73         </aside>
     </li>
             <li>
         Top Opponent          <aside>N/A</aside>
       </li>
         </ul>
 </div>

 <div class="col-md-6">

   <div class="chart-box live">
     <span class="ui-select-search-container">
       <ui-select class="chess-select"
           ng-model="model.selectedOpponent"
           on-select="selectOpponent($item)" ng-cloak>
         <ui-select-match
           placeholder="vs. All Opponents"
           allow-clear="true">
           [[ $select.selected.id ]]
         </ui-select-match>
         <ui-select-choices repeat="opponent in UI.opponents"
           refresh="findOpponents($select.search)"
           refresh-delay="0">
           [[ opponent.id ]]
         </ui-select-choices>
       </ui-select>
     </span>
*******************************************************************************


It's not always a number with decimals, but when it is I can only collect the last 2 decimals, which is a problem. 
I need the first digits, or the complete number and it also has to work when the number is 0% or 10% or 100% instead of 24.76%
The code I have is here:

import sys
import fileinput
import requests
from bs4 import BeautifulSoup
import pandas as dataset
import string
import re
from decimal import *

static_profile_url= REMOVED DUE TO ANTISPAM MEASURES
namen = []
timeouts = []


# Zoek tussen stringpatronen en return waarde als string.
# Dit haalt het TO percentage zonder % uit de html
def find_between( s, first, last ):
    try:
        start = s.index( first ) + len( first )
        end = s.index( last, start)
        timeout = re.compile(r'(\d+)$').search(s[start:end]).group(1)
        #timeout = (s.split(first))[1].split(last)[0]
        print (timeout)
        return (timeout)
    except ValueError:
        return "error parsing"



def retrieve_timeouts(speler_stats_url):
    try:
        r = requests.get(speler_stats_url)
        soup = BeautifulSoup(r.text, 'lxml')
        #  stats = stat_soup.findAll(class_='stats-list no-border')
        stats = soup.findAll('ul', class_='stats-list no-border')
        timeout_percentage = find_between( str(stats), '<aside>', '%</aside>' )
        print (timeout_percentage)
        return int(timeout_percentage)
    except ValueError:
        return "error parsing"


print('processing, please wait... this may take a long time!')
fnamen = open('namen.txt', 'r')
tnamen = fnamen.read().splitlines()
for naam in tnamen:
    print (naam)
    namen.append(naam)
    timeouts.append(retrieve_timeouts(static_profile_url + str(naam)))
    print (retrieve_timeouts(static_profile_url + str(naam)))

spelersdata = { 'naam': namen, 'timeout': timeouts }
ds = dataset.DataFrame(spelersdata)
f = open('timouts.csv', 'w')
f.writelines(ds.to_csv())
f.close()
I don't know why it's not working, I'm not used too coding in Python, let alone building scrapers. 
So my code is made up of a lot of copy pasta... 


Could someone please help me out with this one or point me in the right direction?
Take a look at this.
from bs4 import BeautifulSoup

html = '''\
<li>
 Timeouts <span class="stats-list-info" tip="Last 3 Months" tip-popup-delay="0">
 <i class="icon-circle-question"></i></span>
 <aside>7.69%</aside>
</li>'''

soup = BeautifulSoup(html, 'lxml')
soup = soup.select('li > aside')[0]
number = soup.text
print(number) #--> 7.69%
# Only float number
print(float(number[:-1])) #--> 7.69
Your regular expression in timeout = re.compile(r'(\d+)$').search(s[start:end]).group(1) (ie r'(\d+)$' matches only the string of digits that ends the original string. To match the whole string you need something like r'^(\d+(\.\d+)?)$' that will match a bunch of digits, followed optionally by a dot an another bunch of digits).
Awesome, it wasn't the real solution but it got me on the right track!
Only thing I had to alter was the index if li, the html is a lot larger than I pasted here, but since its always in the same spot this will fix the problem.
Thank you very much! I was strugling with this for hours, would have never been able to solve it myself.

@ofnuts: I hadn't seen your post yet, I will try that one also! Since I would be able to keep most of my code. Thank you both!

EDIT: Well since I've implemented the first solution I can get completely rid of the 2 functions. Sticking with one! Thanks again!