Python Forum
Scrapping javascript website with Selenium where pages randomly fail to load
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Scrapping javascript website with Selenium where pages randomly fail to load
#1
I have a python scrapper with selenium for scrapping a dynamically loaded javascript website.
Scrapper by itself works ok but pages sometimes fail to load with 404 error.
Problem is that public http doesn't have data I need but loads everytime and javascript http with data I need sometimes won't load for a random time.
Even weirder is that same javascript http loads in one browser but not in another and vice versa.
I tried webdriver for chrome, firefox, firefox developer edition and opera. Not a single one loads all pages every time.
Public link that doesn't have data I need looks like this: https://www.sazka.cz/kurzove-sazky/fotbal/*League*/.
Javascript link that have data I need looks like this https://rsb.sazka.cz/fotbal/*League*/.
On average from around 30 links, about 8 fail to load although in different browsers that same link at the same time loads flawlessly.
I tried to search in page source for some clues but I found nothing.
Can anyone help me find out where might be a problem? Thank you.

Edit: here is my code that i think is relevant

Edit2: You can reproduce this problem by right-clicking on some league and try to open link in another tab. Then can be seen that even that page at first loaded properly after opening it in new tab it changes start of http link from https://www.sazka.cz to https://rsb.sazka.cz and sometimes gives 404 error that can last for an hour or more.

driver = webdriver.Chrome(executable_path='chromedriver', 
                               service_args=['--ssl-protocol=any', 
                               '--ignore-ssl-errors=true'])
driver.maximize_window()
for single_url in urls:   
    randomLoadTime = random.randint(400, 600)/100
    time.sleep(randomLoadTime)
    driver1 = driver
    driver1.get(single_url)  
    htmlSourceRedirectCheck = driver1.page_source

    # Redirect Check
    redirectCheck = re.findall('404 - Page not found', htmlSourceRedirectCheck)

    if '404 - Page not found' in redirectCheck:
        leaguer1 = single_url
        leagueFinal = re.findall('fotbal/(.*?)/', leaguer1)
        print(str(leagueFinal) + ' ' + '404 - Page not found')
        pass

    else:
        try:
            loadedOddsCheck = WebDriverWait(driver1, 25)
            loadedOddsCheck.until(EC.element_to_be_clickable \
            ((By.XPATH, ".//h3[contains(@data-params, 'hideShowEvents')]")))
        except TimeoutException:
                pass

        unloadedOdds = driver1.find_elements_by_xpath \
        (".//h3[contains(@data-params, 'loadExpandEvents')]")
        for clicking in unloadedOdds:
            clicking.click()
            randomLoadTime2 = random.randint(50, 100)/100
            time.sleep(randomLoadTime2)

        matchArr = []
        leaguer = single_url

        htmlSourceOrig = driver1.page_source
Reply
#2
Please post code that can be run (without modification) with base URL.
Reply
#3
from bs4 import BeautifulSoup
from pprint import pprint
import requests
import csv
import re
import time
import random
from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException

# URL Load

url = "URL_List.txt"
with open(url, "r") as urllist:
  url_pages = urllist.read()
urls = url_pages.split("\n")

# Variables

matchArr = []
matchArrFinal = []
scrapeDate = time.strftime("%d-%m-%Y")

# Driver Load

driver = webdriver.Chrome(executable_path='chromedriver', 
                               service_args=['--ssl-protocol=any', 
                               '--ignore-ssl-errors=true'])
driver.maximize_window()

# URL Scrapping

for single_url in urls:
    
    randomLoadTime = random.randint(400, 600)/100
    time.sleep(randomLoadTime)
    driver1 = driver
    driver1.get(single_url)  
    htmlSourceRedirectCheck = driver1.page_source

    # Redirect Check
    redirectCheck = re.findall('404 - Page not found', htmlSourceRedirectCheck)

    if '404 - Page not found' in redirectCheck:
        leaguer1 = single_url
        leagueFinal = re.findall('fotbal/(.*?)/', leaguer1)
        print(str(leagueFinal) + ' ' + '404 - Page not found')
        pass

    else:
        try:
            loadedOddsCheck = WebDriverWait(driver1, 25)
            loadedOddsCheck.until(EC.element_to_be_clickable \
            ((By.XPATH, ".//h3[contains(@data-params, 'hideShowEvents')]")))
        except TimeoutException:
                pass

        unloadedOdds = driver1.find_elements_by_xpath \
        (".//h3[contains(@data-params, 'loadExpandEvents')]")
        for clicking in unloadedOdds:
            clicking.click()
            randomLoadTime2 = random.randint(50, 100)/100
            time.sleep(randomLoadTime2)
    
        matchArr = []
        leaguer = single_url

        htmlSourceOrig = driver1.page_source
        htmlSource = htmlSourceOrig.replace('Dagenham & Redbridge', 'Dagenham')

        # REGEX

        try:
            leagueFinal = re.findall('fotbal/(.*?)/', leaguer)
            print(leagueFinal)
        except IndexError:
            leagueFinal = 'null'
        try:
            home = re.findall('"event-details-team-a-name">(.*?)</span>', htmlSource)
        except IndexError:
            home = 'null'
        try:
            away = re.findall('"event-details-team-b-name">(.*?)</span>', htmlSource)
        except IndexError:
            away = 'null'
        try:
            date = re.findall('"event-details-date">(.*?)</span>', htmlSource)
        except IndexError:
            date = 'null'
        try:
            odds = re.findall('bet-odds-value">([0-9]+,[0-9][0-9])</span>', htmlSource)
        except IndexError:
            odds = 'null'

        oddsFinal = [o.replace(',', '.') for o in odds]

        # Live date fix
        matchNumber = len(home)
        dateNumber = len(date)
        dateFix = matchNumber - dateNumber
        if matchNumber > dateNumber:
            for fixing in range (dateFix):
                date.insert(0, 'LIVE')
                
        # Matches

        matchNum = len(home)

        for matches in range (matchNum):
            matchArr.append(leagueFinal[0])
            matchArr.append(home[0])
            matchArr.append(away[0])
            try:        
                matchArr.append(date[0])
            except IndexError:
                matchArr.append(None)
            try:
                matchArr.append(oddsFinal[0])
            except IndexError:
                matchArr.append(None)
            try:
                matchArr.append(oddsFinal[1])
            except IndexError:
                matchArr.append(None)
            try:
                matchArr.append(oddsFinal[2])
            except IndexError:
                matchArr.append(None) 


            del home[0]
            del away[0]

            try:
                del date[0]
            except IndexError:         
                pass

            del oddsFinal[0:3]

        for matchesFinal in range (matchNum):
            matchArrFinal.append(matchArr[0:7])
            
            del matchArr[0:7]

driver.close() 

# CSV

with open('D:\Betting\BET Fotbal\Scrapped Odds\Sazkabet' + ' ' + scrapeDate + '.csv', 'w', newline='') as csvFile:
    writer = csv.writer(csvFile, delimiter=',')
    writer.writerow(["league", "home", "away", "date", "1", "0", "2"])
    writer.writerows(matchArrFinal)

csvFile.close()
here is content of URL_List.txt file:
Reply
#4
I'll get back, this will take a bit of time.
Reply
#5
I created the following short script just to see what was available from the url's that you supplied.
Many of the url's have no pages , so that explains the 404 errors.
Run following code and see what I'm talking about:
from selenium import webdriver
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
import PrettifyPage
from pathlib import Path
import time
import os


class FootballInfo:
    def __init__(self):
        os.chdir(os.path.abspath(os.path.dirname(__file__)))
        self.pp = PrettifyPage.PrettifyPage()

        homepath = Path('.')
        self.savepath = homepath / 'PrettyPages'
        self.savepath.mkdir(exist_ok=True)

        self.teams = ['', 'anglie-1-liga', 'anglie-2-liga', 'anglie-3-liga', \
            'anglie-4-liga', 'anglie-5-liga', 'n%C4%9Bmecko-1-liga', \
            'n%C4%9Bmecko-2-liga', 'francie-1-liga', 'francie-2-liga', \
            'it%C3%A1lie-1-liga', 'it%C3%A1lie-2-liga', \
            '%C5%A1pan%C4%9Blsko-1-liga', '%C5%A1pan%C4%9Blsko-2-liga', \
            'belgie-1-liga', 'd%C3%A1nsko-1-liga', 'nizozemsko-1-liga', \
            'norsko-1-liga', 'polsko-1-liga', 'portugalsko-1-liga', \
            'rakousko-1-liga', 'rumunsko-1-liga', 'rusko-1-liga', \
            '%C5%99ecko-1-liga', 'skotsko-1-liga', 'skotsko-2-liga', \
            'skotsko-3-liga', 'skotsko-4-liga', \
            '%C5%A1v%C3%BDcarsko-1-liga', \
            'turecko-1-liga']

    def start_browser(self):
        caps = webdriver.DesiredCapabilities().FIREFOX
        caps["marionette"] = True
        self.browser = webdriver.Firefox(capabilities=caps)

    def stop_browser(self):
        self.browser.close()
    
    def next_url(self):
        for team in self.teams:
            yield team, f"https://rsb.sazka.cz/fotbal/{team}/"

    def get_team_pages(self):
        self.start_browser()
        for team, url in self.next_url():
            print(f"loading team: {team}, url: {url}")
            self.browser.get(url)
            time.sleep(5)
            soup = BeautifulSoup(self.browser.page_source, 'lxml')
            savefile = self.savepath / f"{team}_pretty.html"
            with savefile.open('w') as fp:
                fp.write(f"{self.pp.prettify(soup,2)}")
        self.stop_browser()

if __name__ == '__main__':
    fi = FootballInfo()
    fi.get_team_pages()
You'll also need this script (save as PrettifyPage.py):
# PrettifyPage.py

from bs4 import BeautifulSoup
import requests
import pathlib


class PrettifyPage:
    def __init__(self):
        pass

    def prettify(self, soup, indent):
        pretty_soup = str()
        previous_indent = 0
        for line in soup.prettify().split("\n"):
            current_indent = str(line).find("<")
            if current_indent == -1 or current_indent > previous_indent + 2:
                current_indent = previous_indent + 1
            previous_indent = current_indent
            pretty_soup += self.write_new_line(line, current_indent, indent)
        return pretty_soup

    def write_new_line(self, line, current_indent, desired_indent):
        new_line = ""
        spaces_to_add = (current_indent * desired_indent) - current_indent
        if spaces_to_add > 0:
            for i in range(spaces_to_add):
                new_line += " "		
        new_line += str(line) + "\n"
        return new_line
A better way would to scan for team_entries = soup.find_all('div', {'class': 'rj-instant-collapsible'})
which should pull all teams from the page.
Reply
#6
Thank you i will look into it. I am aware that many links now are not available due to winter break in soccer leagues but certain leagues retry 404 error even if they are available.
Reply
#7
So it seems that same problem occur with your code too :( . i had to add encoding="utf-8 into savefile and after that code ran as it should. Still randomly some pages did not load and i checked that they are available in another browser, when i ran code second time right after first run ended then different set of pages did not randomly load. In folder prettypages i can see that your code scrap webpages properly but when page fail to load it is saved just with 404 error message. I am using windows 7 and latest 3.8.1 version with latest geckoo driver.
Reply
#8
that's because the page does not exist!
look at the screens as they attempt to be brought up.
If you use the page internal links instead of set links you will avoid this error,
or you can:
  • supply proper URL (one texted manually)
  • just ignore missing data.
you cannot make a non-existant page appear, need proper URL
Reply
#9
Webpages do exist, that is why i wrote i checked in another browser. I get 404 error even tho that page loads at that same moment in another browser. I understand few of provided urls are indeed not available but 404 error keeps randomly showing even when code load given webpage. For example this url https://rsb.sazka.cz/fotbal/anglie-3-liga/ i did run code five times right after one iteration ended and this webpage loaded 3 times and two times i got 404 error
Reply
#10
I just manually tried: https://rsb.sazka.cz/fotbal/anglie-3-liga/
received 404 - Page not found.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Problem with scrapping Website giddyhead 1 1,632 Mar-08-2024, 08:20 AM
Last Post: AhanaSharma
  python web scrapping mg24 1 332 Mar-01-2024, 09:48 PM
Last Post: snippsat
  Scaping pages created by javascript mbizzl 1 1,515 Jul-17-2022, 10:01 PM
Last Post: Larz60+
  How can I ignore empty fields when scrapping never5000 0 1,395 Feb-11-2022, 09:19 AM
Last Post: never5000
  Suggestion request for scrapping html table Vkkindia 3 2,036 Dec-06-2021, 06:09 PM
Last Post: Larz60+
  web scrapping through Python Naheed 2 2,626 May-17-2021, 12:02 PM
Last Post: Naheed
  Website scrapping and download santoshrane 3 4,330 Apr-14-2021, 07:22 AM
Last Post: kashcode
  Using Python request without selenium on html form with javascript onclick submit but eraosa 0 3,186 Jan-09-2021, 06:08 PM
Last Post: eraosa
  Newbie help with lxml scrapping chelsealoa 1 1,866 Jan-08-2021, 09:14 AM
Last Post: Larz60+
  Scrapping Sport score laplacea 1 2,260 Dec-13-2020, 04:09 PM
Last Post: Larz60+

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020