Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Scraping number in % from website
#1
Hi all, for my chessclub I'm trying to automate collecting Timeout percentages. 

It hidden in this code:          <aside>7.69%</aside>



 
 <ul class="stats-list no-border">
     <li>
       Winning Streak        <aside>19</aside>
     </li>
             <li>
         Time per Move          <aside>14 hours 15 minutes</aside>
       </li>
                   <li>
         Timeouts          <span class="stats-list-info" tip="Last 3 Months" tip-popup-delay="0"><i class="icon-circle-question"
     
     
         ></i></span>
         <aside>7.69%</aside>
</li>
           <li>
       Glicko RD        <aside>
          73         </aside>
     </li>
             <li>
         Top Opponent          <aside>N/A</aside>
       </li>
         </ul>
 </div>

 <div class="col-md-6">

   <div class="chart-box live">
     <span class="ui-select-search-container">
       <ui-select class="chess-select"
           ng-model="model.selectedOpponent"
           on-select="selectOpponent($item)" ng-cloak>
         <ui-select-match
           placeholder="vs. All Opponents"
           allow-clear="true">
           [[ $select.selected.id ]]
         </ui-select-match>
         <ui-select-choices repeat="opponent in UI.opponents"
           refresh="findOpponents($select.search)"
           refresh-delay="0">
           [[ opponent.id ]]
         </ui-select-choices>
       </ui-select>
     </span>
*******************************************************************************


It's not always a number with decimals, but when it is I can only collect the last 2 decimals, which is a problem. 
I need the first digits, or the complete number and it also has to work when the number is 0% or 10% or 100% instead of 24.76%
The code I have is here:

import sys
import fileinput
import requests
from bs4 import BeautifulSoup
import pandas as dataset
import string
import re
from decimal import *

static_profile_url= REMOVED DUE TO ANTISPAM MEASURES
namen = []
timeouts = []


# Zoek tussen stringpatronen en return waarde als string.
# Dit haalt het TO percentage zonder % uit de html
def find_between( s, first, last ):
    try:
        start = s.index( first ) + len( first )
        end = s.index( last, start)
        timeout = re.compile(r'(\d+)$').search(s[start:end]).group(1)
        #timeout = (s.split(first))[1].split(last)[0]
        print (timeout)
        return (timeout)
    except ValueError:
        return "error parsing"



def retrieve_timeouts(speler_stats_url):
    try:
        r = requests.get(speler_stats_url)
        soup = BeautifulSoup(r.text, 'lxml')
        #  stats = stat_soup.findAll(class_='stats-list no-border')
        stats = soup.findAll('ul', class_='stats-list no-border')
        timeout_percentage = find_between( str(stats), '<aside>', '%</aside>' )
        print (timeout_percentage)
        return int(timeout_percentage)
    except ValueError:
        return "error parsing"


print('processing, please wait... this may take a long time!')
fnamen = open('namen.txt', 'r')
tnamen = fnamen.read().splitlines()
for naam in tnamen:
    print (naam)
    namen.append(naam)
    timeouts.append(retrieve_timeouts(static_profile_url + str(naam)))
    print (retrieve_timeouts(static_profile_url + str(naam)))

spelersdata = { 'naam': namen, 'timeout': timeouts }
ds = dataset.DataFrame(spelersdata)
f = open('timouts.csv', 'w')
f.writelines(ds.to_csv())
f.close()
I don't know why it's not working, I'm not used too coding in Python, let alone building scrapers. 
So my code is made up of a lot of copy pasta... 


Could someone please help me out with this one or point me in the right direction?
Reply
#2
Take a look at this.
from bs4 import BeautifulSoup

html = '''\
<li>
 Timeouts <span class="stats-list-info" tip="Last 3 Months" tip-popup-delay="0">
 <i class="icon-circle-question"></i></span>
 <aside>7.69%</aside>
</li>'''

soup = BeautifulSoup(html, 'lxml')
soup = soup.select('li > aside')[0]
number = soup.text
print(number) #--> 7.69%
# Only float number
print(float(number[:-1])) #--> 7.69
Reply
#3
Your regular expression in timeout = re.compile(r'(\d+)$').search(s[start:end]).group(1) (ie r'(\d+)$' matches only the string of digits that ends the original string. To match the whole string you need something like r'^(\d+(\.\d+)?)$' that will match a bunch of digits, followed optionally by a dot an another bunch of digits).
Unless noted otherwise, code in my posts should be understood as "coding suggestions", and its use may require more neurones than the two necessary for Ctrl-C/Ctrl-V.
Your one-stop place for all your GIMP needs: gimp-forum.net
Reply
#4
Awesome, it wasn't the real solution but it got me on the right track!
Only thing I had to alter was the index if li, the html is a lot larger than I pasted here, but since its always in the same spot this will fix the problem.
Thank you very much! I was strugling with this for hours, would have never been able to solve it myself.

@ofnuts: I hadn't seen your post yet, I will try that one also! Since I would be able to keep most of my code. Thank you both!

EDIT: Well since I've implemented the first solution I can get completely rid of the 2 functions. Sticking with one! Thanks again!
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  web scraping for new additions/modifed website? kingoman123 4 2,184 Apr-14-2022, 04:46 PM
Last Post: snippsat
  Scraping lender data from Ren Ren Dai website using Python. I will pay for that 200$ Hafedh_2021 1 2,724 May-18-2021, 08:41 PM
Last Post: snippsat
  Scraping all website text using Python MKMKMKMK 1 2,052 Nov-26-2020, 10:35 PM
Last Post: Larz60+
  Scraping a Website (HELP) LearnPython2 1 1,708 May-08-2020, 03:20 PM
Last Post: Larz60+
  scraping from a website that hides source code PIWI_Protein 1 1,938 Mar-27-2020, 05:08 PM
Last Post: Larz60+
  Scraping not moving to the next pages in a website jithin123 0 1,916 Mar-23-2020, 06:10 PM
Last Post: jithin123
  Random Loss of Control of Website When Scraping bmccollum 0 1,490 Aug-30-2019, 04:04 AM
Last Post: bmccollum
  MaxRetryError while scraping a website multiple times kawasso 6 17,267 Aug-29-2019, 05:25 PM
Last Post: kawasso
  scraping multiple pages of a website. Blue Dog 14 22,270 Jun-21-2018, 09:03 PM
Last Post: Blue Dog

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020