Python Forum

Full Version: Scrapping javascript website with Selenium where pages randomly fail to load
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2
I have a python scrapper with selenium for scrapping a dynamically loaded javascript website.
Scrapper by itself works ok but pages sometimes fail to load with 404 error.
Problem is that public http doesn't have data I need but loads everytime and javascript http with data I need sometimes won't load for a random time.
Even weirder is that same javascript http loads in one browser but not in another and vice versa.
I tried webdriver for chrome, firefox, firefox developer edition and opera. Not a single one loads all pages every time.
Public link that doesn't have data I need looks like this: https://www.sazka.cz/kurzove-sazky/fotbal/*League*/.
Javascript link that have data I need looks like this https://rsb.sazka.cz/fotbal/*League*/.
On average from around 30 links, about 8 fail to load although in different browsers that same link at the same time loads flawlessly.
I tried to search in page source for some clues but I found nothing.
Can anyone help me find out where might be a problem? Thank you.

Edit: here is my code that i think is relevant

Edit2: You can reproduce this problem by right-clicking on some league and try to open link in another tab. Then can be seen that even that page at first loaded properly after opening it in new tab it changes start of http link from https://www.sazka.cz to https://rsb.sazka.cz and sometimes gives 404 error that can last for an hour or more.

driver = webdriver.Chrome(executable_path='chromedriver', 
                               service_args=['--ssl-protocol=any', 
                               '--ignore-ssl-errors=true'])
driver.maximize_window()
for single_url in urls:   
    randomLoadTime = random.randint(400, 600)/100
    time.sleep(randomLoadTime)
    driver1 = driver
    driver1.get(single_url)  
    htmlSourceRedirectCheck = driver1.page_source

    # Redirect Check
    redirectCheck = re.findall('404 - Page not found', htmlSourceRedirectCheck)

    if '404 - Page not found' in redirectCheck:
        leaguer1 = single_url
        leagueFinal = re.findall('fotbal/(.*?)/', leaguer1)
        print(str(leagueFinal) + ' ' + '404 - Page not found')
        pass

    else:
        try:
            loadedOddsCheck = WebDriverWait(driver1, 25)
            loadedOddsCheck.until(EC.element_to_be_clickable \
            ((By.XPATH, ".//h3[contains(@data-params, 'hideShowEvents')]")))
        except TimeoutException:
                pass

        unloadedOdds = driver1.find_elements_by_xpath \
        (".//h3[contains(@data-params, 'loadExpandEvents')]")
        for clicking in unloadedOdds:
            clicking.click()
            randomLoadTime2 = random.randint(50, 100)/100
            time.sleep(randomLoadTime2)

        matchArr = []
        leaguer = single_url

        htmlSourceOrig = driver1.page_source
Please post code that can be run (without modification) with base URL.
from bs4 import BeautifulSoup
from pprint import pprint
import requests
import csv
import re
import time
import random
from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException

# URL Load

url = "URL_List.txt"
with open(url, "r") as urllist:
  url_pages = urllist.read()
urls = url_pages.split("\n")

# Variables

matchArr = []
matchArrFinal = []
scrapeDate = time.strftime("%d-%m-%Y")

# Driver Load

driver = webdriver.Chrome(executable_path='chromedriver', 
                               service_args=['--ssl-protocol=any', 
                               '--ignore-ssl-errors=true'])
driver.maximize_window()

# URL Scrapping

for single_url in urls:
    
    randomLoadTime = random.randint(400, 600)/100
    time.sleep(randomLoadTime)
    driver1 = driver
    driver1.get(single_url)  
    htmlSourceRedirectCheck = driver1.page_source

    # Redirect Check
    redirectCheck = re.findall('404 - Page not found', htmlSourceRedirectCheck)

    if '404 - Page not found' in redirectCheck:
        leaguer1 = single_url
        leagueFinal = re.findall('fotbal/(.*?)/', leaguer1)
        print(str(leagueFinal) + ' ' + '404 - Page not found')
        pass

    else:
        try:
            loadedOddsCheck = WebDriverWait(driver1, 25)
            loadedOddsCheck.until(EC.element_to_be_clickable \
            ((By.XPATH, ".//h3[contains(@data-params, 'hideShowEvents')]")))
        except TimeoutException:
                pass

        unloadedOdds = driver1.find_elements_by_xpath \
        (".//h3[contains(@data-params, 'loadExpandEvents')]")
        for clicking in unloadedOdds:
            clicking.click()
            randomLoadTime2 = random.randint(50, 100)/100
            time.sleep(randomLoadTime2)
    
        matchArr = []
        leaguer = single_url

        htmlSourceOrig = driver1.page_source
        htmlSource = htmlSourceOrig.replace('Dagenham & Redbridge', 'Dagenham')

        # REGEX

        try:
            leagueFinal = re.findall('fotbal/(.*?)/', leaguer)
            print(leagueFinal)
        except IndexError:
            leagueFinal = 'null'
        try:
            home = re.findall('"event-details-team-a-name">(.*?)</span>', htmlSource)
        except IndexError:
            home = 'null'
        try:
            away = re.findall('"event-details-team-b-name">(.*?)</span>', htmlSource)
        except IndexError:
            away = 'null'
        try:
            date = re.findall('"event-details-date">(.*?)</span>', htmlSource)
        except IndexError:
            date = 'null'
        try:
            odds = re.findall('bet-odds-value">([0-9]+,[0-9][0-9])</span>', htmlSource)
        except IndexError:
            odds = 'null'

        oddsFinal = [o.replace(',', '.') for o in odds]

        # Live date fix
        matchNumber = len(home)
        dateNumber = len(date)
        dateFix = matchNumber - dateNumber
        if matchNumber > dateNumber:
            for fixing in range (dateFix):
                date.insert(0, 'LIVE')
                
        # Matches

        matchNum = len(home)

        for matches in range (matchNum):
            matchArr.append(leagueFinal[0])
            matchArr.append(home[0])
            matchArr.append(away[0])
            try:        
                matchArr.append(date[0])
            except IndexError:
                matchArr.append(None)
            try:
                matchArr.append(oddsFinal[0])
            except IndexError:
                matchArr.append(None)
            try:
                matchArr.append(oddsFinal[1])
            except IndexError:
                matchArr.append(None)
            try:
                matchArr.append(oddsFinal[2])
            except IndexError:
                matchArr.append(None) 


            del home[0]
            del away[0]

            try:
                del date[0]
            except IndexError:         
                pass

            del oddsFinal[0:3]

        for matchesFinal in range (matchNum):
            matchArrFinal.append(matchArr[0:7])
            
            del matchArr[0:7]

driver.close() 

# CSV

with open('D:\Betting\BET Fotbal\Scrapped Odds\Sazkabet' + ' ' + scrapeDate + '.csv', 'w', newline='') as csvFile:
    writer = csv.writer(csvFile, delimiter=',')
    writer.writerow(["league", "home", "away", "date", "1", "0", "2"])
    writer.writerows(matchArrFinal)

csvFile.close()
here is content of URL_List.txt file:
I'll get back, this will take a bit of time.
I created the following short script just to see what was available from the url's that you supplied.
Many of the url's have no pages , so that explains the 404 errors.
Run following code and see what I'm talking about:
from selenium import webdriver
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
import PrettifyPage
from pathlib import Path
import time
import os


class FootballInfo:
    def __init__(self):
        os.chdir(os.path.abspath(os.path.dirname(__file__)))
        self.pp = PrettifyPage.PrettifyPage()

        homepath = Path('.')
        self.savepath = homepath / 'PrettyPages'
        self.savepath.mkdir(exist_ok=True)

        self.teams = ['', 'anglie-1-liga', 'anglie-2-liga', 'anglie-3-liga', \
            'anglie-4-liga', 'anglie-5-liga', 'n%C4%9Bmecko-1-liga', \
            'n%C4%9Bmecko-2-liga', 'francie-1-liga', 'francie-2-liga', \
            'it%C3%A1lie-1-liga', 'it%C3%A1lie-2-liga', \
            '%C5%A1pan%C4%9Blsko-1-liga', '%C5%A1pan%C4%9Blsko-2-liga', \
            'belgie-1-liga', 'd%C3%A1nsko-1-liga', 'nizozemsko-1-liga', \
            'norsko-1-liga', 'polsko-1-liga', 'portugalsko-1-liga', \
            'rakousko-1-liga', 'rumunsko-1-liga', 'rusko-1-liga', \
            '%C5%99ecko-1-liga', 'skotsko-1-liga', 'skotsko-2-liga', \
            'skotsko-3-liga', 'skotsko-4-liga', \
            '%C5%A1v%C3%BDcarsko-1-liga', \
            'turecko-1-liga']

    def start_browser(self):
        caps = webdriver.DesiredCapabilities().FIREFOX
        caps["marionette"] = True
        self.browser = webdriver.Firefox(capabilities=caps)

    def stop_browser(self):
        self.browser.close()
    
    def next_url(self):
        for team in self.teams:
            yield team, f"https://rsb.sazka.cz/fotbal/{team}/"

    def get_team_pages(self):
        self.start_browser()
        for team, url in self.next_url():
            print(f"loading team: {team}, url: {url}")
            self.browser.get(url)
            time.sleep(5)
            soup = BeautifulSoup(self.browser.page_source, 'lxml')
            savefile = self.savepath / f"{team}_pretty.html"
            with savefile.open('w') as fp:
                fp.write(f"{self.pp.prettify(soup,2)}")
        self.stop_browser()

if __name__ == '__main__':
    fi = FootballInfo()
    fi.get_team_pages()
You'll also need this script (save as PrettifyPage.py):
# PrettifyPage.py

from bs4 import BeautifulSoup
import requests
import pathlib


class PrettifyPage:
    def __init__(self):
        pass

    def prettify(self, soup, indent):
        pretty_soup = str()
        previous_indent = 0
        for line in soup.prettify().split("\n"):
            current_indent = str(line).find("<")
            if current_indent == -1 or current_indent > previous_indent + 2:
                current_indent = previous_indent + 1
            previous_indent = current_indent
            pretty_soup += self.write_new_line(line, current_indent, indent)
        return pretty_soup

    def write_new_line(self, line, current_indent, desired_indent):
        new_line = ""
        spaces_to_add = (current_indent * desired_indent) - current_indent
        if spaces_to_add > 0:
            for i in range(spaces_to_add):
                new_line += " "		
        new_line += str(line) + "\n"
        return new_line
A better way would to scan for team_entries = soup.find_all('div', {'class': 'rj-instant-collapsible'})
which should pull all teams from the page.
Thank you i will look into it. I am aware that many links now are not available due to winter break in soccer leagues but certain leagues retry 404 error even if they are available.
So it seems that same problem occur with your code too :( . i had to add encoding="utf-8 into savefile and after that code ran as it should. Still randomly some pages did not load and i checked that they are available in another browser, when i ran code second time right after first run ended then different set of pages did not randomly load. In folder prettypages i can see that your code scrap webpages properly but when page fail to load it is saved just with 404 error message. I am using windows 7 and latest 3.8.1 version with latest geckoo driver.
that's because the page does not exist!
look at the screens as they attempt to be brought up.
If you use the page internal links instead of set links you will avoid this error,
or you can:
  • supply proper URL (one texted manually)
  • just ignore missing data.
you cannot make a non-existant page appear, need proper URL
Webpages do exist, that is why i wrote i checked in another browser. I get 404 error even tho that page loads at that same moment in another browser. I understand few of provided urls are indeed not available but 404 error keeps randomly showing even when code load given webpage. For example this url https://rsb.sazka.cz/fotbal/anglie-3-liga/ i did run code five times right after one iteration ended and this webpage loaded 3 times and two times i got 404 error
I just manually tried: https://rsb.sazka.cz/fotbal/anglie-3-liga/
received 404 - Page not found.
Pages: 1 2