Bottom Page

Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
 Scraping number in % from website
#1
Hi all, for my chessclub I'm trying to automate collecting Timeout percentages. 

It hidden in this code:          <aside>7.69%</aside>



 
 <ul class="stats-list no-border">
     <li>
       Winning Streak        <aside>19</aside>
     </li>
             <li>
         Time per Move          <aside>14 hours 15 minutes</aside>
       </li>
                   <li>
         Timeouts          <span class="stats-list-info" tip="Last 3 Months" tip-popup-delay="0"><i class="icon-circle-question"
     
     
         ></i></span>
         <aside>7.69%</aside>
</li>
           <li>
       Glicko RD        <aside>
          73         </aside>
     </li>
             <li>
         Top Opponent          <aside>N/A</aside>
       </li>
         </ul>
 </div>

 <div class="col-md-6">

   <div class="chart-box live">
     <span class="ui-select-search-container">
       <ui-select class="chess-select"
           ng-model="model.selectedOpponent"
           on-select="selectOpponent($item)" ng-cloak>
         <ui-select-match
           placeholder="vs. All Opponents"
           allow-clear="true">
           [[ $select.selected.id ]]
         </ui-select-match>
         <ui-select-choices repeat="opponent in UI.opponents"
           refresh="findOpponents($select.search)"
           refresh-delay="0">
           [[ opponent.id ]]
         </ui-select-choices>
       </ui-select>
     </span>
*******************************************************************************


It's not always a number with decimals, but when it is I can only collect the last 2 decimals, which is a problem. 
I need the first digits, or the complete number and it also has to work when the number is 0% or 10% or 100% instead of 24.76%
The code I have is here:

import sys
import fileinput
import requests
from bs4 import BeautifulSoup
import pandas as dataset
import string
import re
from decimal import *

static_profile_url= REMOVED DUE TO ANTISPAM MEASURES
namen = []
timeouts = []


# Zoek tussen stringpatronen en return waarde als string.
# Dit haalt het TO percentage zonder % uit de html
def find_between( s, first, last ):
    try:
        start = s.index( first ) + len( first )
        end = s.index( last, start)
        timeout = re.compile(r'(\d+)$').search(s[start:end]).group(1)
        #timeout = (s.split(first))[1].split(last)[0]
        print (timeout)
        return (timeout)
    except ValueError:
        return "error parsing"



def retrieve_timeouts(speler_stats_url):
    try:
        r = requests.get(speler_stats_url)
        soup = BeautifulSoup(r.text, 'lxml')
        #  stats = stat_soup.findAll(class_='stats-list no-border')
        stats = soup.findAll('ul', class_='stats-list no-border')
        timeout_percentage = find_between( str(stats), '<aside>', '%</aside>' )
        print (timeout_percentage)
        return int(timeout_percentage)
    except ValueError:
        return "error parsing"


print('processing, please wait... this may take a long time!')
fnamen = open('namen.txt', 'r')
tnamen = fnamen.read().splitlines()
for naam in tnamen:
    print (naam)
    namen.append(naam)
    timeouts.append(retrieve_timeouts(static_profile_url + str(naam)))
    print (retrieve_timeouts(static_profile_url + str(naam)))

spelersdata = { 'naam': namen, 'timeout': timeouts }
ds = dataset.DataFrame(spelersdata)
f = open('timouts.csv', 'w')
f.writelines(ds.to_csv())
f.close()

I don't know why it's not working, I'm not used too coding in Python, let alone building scrapers. 
So my code is made up of a lot of copy pasta... 


Could someone please help me out with this one or point me in the right direction?
Quote
#2
Take a look at this.
from bs4 import BeautifulSoup

html = '''\
<li>
 Timeouts <span class="stats-list-info" tip="Last 3 Months" tip-popup-delay="0">
 <i class="icon-circle-question"></i></span>
 <aside>7.69%</aside>
</li>'''

soup = BeautifulSoup(html, 'lxml')
soup = soup.select('li > aside')[0]
number = soup.text
print(number) #--> 7.69%
# Only float number
print(float(number[:-1])) #--> 7.69
Quote
#3
Your regular expression in timeout = re.compile(r'(\d+)$').search(s[start:end]).group(1) (ie r'(\d+)$' matches only the string of digits that ends the original string. To match the whole string you need something like r'^(\d+(\.\d+)?)$' that will match a bunch of digits, followed optionally by a dot an another bunch of digits).
Unless noted otherwise, code in my posts should be understood as "coding suggestions", and its use may require more neurones than the two necessary for Ctrl-C/Ctrl-V.
Your one-stop place for all your GIMP needs: gimp-forum.net
Quote
#4
Awesome, it wasn't the real solution but it got me on the right track!
Only thing I had to alter was the index if li, the html is a lot larger than I pasted here, but since its always in the same spot this will fix the problem.
Thank you very much! I was strugling with this for hours, would have never been able to solve it myself.

@ofnuts: I hadn't seen your post yet, I will try that one also! Since I would be able to keep most of my code. Thank you both!

EDIT: Well since I've implemented the first solution I can get completely rid of the 2 functions. Sticking with one! Thanks again!
Quote

Top Page

Possibly Related Threads...
Thread Author Replies Views Last Post
  scraping from a website that hides source code PIWI_Protein 1 92 Mar-27-2020, 05:08 PM
Last Post: Larz60+
  Scraping not moving to the next pages in a website jithin123 0 67 Mar-23-2020, 06:10 PM
Last Post: jithin123
  Random Loss of Control of Website When Scraping bmccollum 0 219 Aug-30-2019, 04:04 AM
Last Post: bmccollum
  MaxRetryError while scraping a website multiple times kawasso 6 3,503 Aug-29-2019, 05:25 PM
Last Post: kawasso
  scraping multiple pages of a website. Blue Dog 14 13,700 Jun-21-2018, 09:03 PM
Last Post: Blue Dog

Forum Jump:


Users browsing this thread: 1 Guest(s)