Scrapping javascript website with Selenium where pages randomly fail to load - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html) +--- Thread: Scrapping javascript website with Selenium where pages randomly fail to load (/thread-23367.html) Pages:
1
2
|
Scrapping javascript website with Selenium where pages randomly fail to load - JuanJuan - Dec-25-2019 I have a python scrapper with selenium for scrapping a dynamically loaded javascript website. Scrapper by itself works ok but pages sometimes fail to load with 404 error. Problem is that public http doesn't have data I need but loads everytime and javascript http with data I need sometimes won't load for a random time. Even weirder is that same javascript http loads in one browser but not in another and vice versa. I tried webdriver for chrome, firefox, firefox developer edition and opera. Not a single one loads all pages every time. Public link that doesn't have data I need looks like this: https://www.sazka.cz/kurzove-sazky/fotbal/*League*/. Javascript link that have data I need looks like this https://rsb.sazka.cz/fotbal/*League*/. On average from around 30 links, about 8 fail to load although in different browsers that same link at the same time loads flawlessly. I tried to search in page source for some clues but I found nothing. Can anyone help me find out where might be a problem? Thank you. Edit: here is my code that i think is relevant Edit2: You can reproduce this problem by right-clicking on some league and try to open link in another tab. Then can be seen that even that page at first loaded properly after opening it in new tab it changes start of http link from https://www.sazka.cz to https://rsb.sazka.cz and sometimes gives 404 error that can last for an hour or more. driver = webdriver.Chrome(executable_path='chromedriver', service_args=['--ssl-protocol=any', '--ignore-ssl-errors=true']) driver.maximize_window() for single_url in urls: randomLoadTime = random.randint(400, 600)/100 time.sleep(randomLoadTime) driver1 = driver driver1.get(single_url) htmlSourceRedirectCheck = driver1.page_source # Redirect Check redirectCheck = re.findall('404 - Page not found', htmlSourceRedirectCheck) if '404 - Page not found' in redirectCheck: leaguer1 = single_url leagueFinal = re.findall('fotbal/(.*?)/', leaguer1) print(str(leagueFinal) + ' ' + '404 - Page not found') pass else: try: loadedOddsCheck = WebDriverWait(driver1, 25) loadedOddsCheck.until(EC.element_to_be_clickable \ ((By.XPATH, ".//h3[contains(@data-params, 'hideShowEvents')]"))) except TimeoutException: pass unloadedOdds = driver1.find_elements_by_xpath \ (".//h3[contains(@data-params, 'loadExpandEvents')]") for clicking in unloadedOdds: clicking.click() randomLoadTime2 = random.randint(50, 100)/100 time.sleep(randomLoadTime2) matchArr = [] leaguer = single_url htmlSourceOrig = driver1.page_source RE: Scrapping javascript website with Selenium where pages randomly fail to load - Larz60+ - Dec-25-2019 Please post code that can be run (without modification) with base URL. RE: Scrapping javascript website with Selenium where pages randomly fail to load - JuanJuan - Dec-25-2019 from bs4 import BeautifulSoup from pprint import pprint import requests import csv import re import time import random from selenium import webdriver from selenium.webdriver.common.action_chains import ActionChains from selenium.webdriver.firefox.firefox_binary import FirefoxBinary from selenium.common.exceptions import NoSuchElementException from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from selenium.common.exceptions import TimeoutException # URL Load url = "URL_List.txt" with open(url, "r") as urllist: url_pages = urllist.read() urls = url_pages.split("\n") # Variables matchArr = [] matchArrFinal = [] scrapeDate = time.strftime("%d-%m-%Y") # Driver Load driver = webdriver.Chrome(executable_path='chromedriver', service_args=['--ssl-protocol=any', '--ignore-ssl-errors=true']) driver.maximize_window() # URL Scrapping for single_url in urls: randomLoadTime = random.randint(400, 600)/100 time.sleep(randomLoadTime) driver1 = driver driver1.get(single_url) htmlSourceRedirectCheck = driver1.page_source # Redirect Check redirectCheck = re.findall('404 - Page not found', htmlSourceRedirectCheck) if '404 - Page not found' in redirectCheck: leaguer1 = single_url leagueFinal = re.findall('fotbal/(.*?)/', leaguer1) print(str(leagueFinal) + ' ' + '404 - Page not found') pass else: try: loadedOddsCheck = WebDriverWait(driver1, 25) loadedOddsCheck.until(EC.element_to_be_clickable \ ((By.XPATH, ".//h3[contains(@data-params, 'hideShowEvents')]"))) except TimeoutException: pass unloadedOdds = driver1.find_elements_by_xpath \ (".//h3[contains(@data-params, 'loadExpandEvents')]") for clicking in unloadedOdds: clicking.click() randomLoadTime2 = random.randint(50, 100)/100 time.sleep(randomLoadTime2) matchArr = [] leaguer = single_url htmlSourceOrig = driver1.page_source htmlSource = htmlSourceOrig.replace('Dagenham & Redbridge', 'Dagenham') # REGEX try: leagueFinal = re.findall('fotbal/(.*?)/', leaguer) print(leagueFinal) except IndexError: leagueFinal = 'null' try: home = re.findall('"event-details-team-a-name">(.*?)</span>', htmlSource) except IndexError: home = 'null' try: away = re.findall('"event-details-team-b-name">(.*?)</span>', htmlSource) except IndexError: away = 'null' try: date = re.findall('"event-details-date">(.*?)</span>', htmlSource) except IndexError: date = 'null' try: odds = re.findall('bet-odds-value">([0-9]+,[0-9][0-9])</span>', htmlSource) except IndexError: odds = 'null' oddsFinal = [o.replace(',', '.') for o in odds] # Live date fix matchNumber = len(home) dateNumber = len(date) dateFix = matchNumber - dateNumber if matchNumber > dateNumber: for fixing in range (dateFix): date.insert(0, 'LIVE') # Matches matchNum = len(home) for matches in range (matchNum): matchArr.append(leagueFinal[0]) matchArr.append(home[0]) matchArr.append(away[0]) try: matchArr.append(date[0]) except IndexError: matchArr.append(None) try: matchArr.append(oddsFinal[0]) except IndexError: matchArr.append(None) try: matchArr.append(oddsFinal[1]) except IndexError: matchArr.append(None) try: matchArr.append(oddsFinal[2]) except IndexError: matchArr.append(None) del home[0] del away[0] try: del date[0] except IndexError: pass del oddsFinal[0:3] for matchesFinal in range (matchNum): matchArrFinal.append(matchArr[0:7]) del matchArr[0:7] driver.close() # CSV with open('D:\Betting\BET Fotbal\Scrapped Odds\Sazkabet' + ' ' + scrapeDate + '.csv', 'w', newline='') as csvFile: writer = csv.writer(csvFile, delimiter=',') writer.writerow(["league", "home", "away", "date", "1", "0", "2"]) writer.writerows(matchArrFinal) csvFile.close()here is content of URL_List.txt file: RE: Scrapping javascript website with Selenium where pages randomly fail to load - Larz60+ - Dec-26-2019 I'll get back, this will take a bit of time. RE: Scrapping javascript website with Selenium where pages randomly fail to load - Larz60+ - Dec-26-2019 I created the following short script just to see what was available from the url's that you supplied. Many of the url's have no pages , so that explains the 404 errors. Run following code and see what I'm talking about: from selenium import webdriver from selenium.webdriver.common.by import By from bs4 import BeautifulSoup import PrettifyPage from pathlib import Path import time import os class FootballInfo: def __init__(self): os.chdir(os.path.abspath(os.path.dirname(__file__))) self.pp = PrettifyPage.PrettifyPage() homepath = Path('.') self.savepath = homepath / 'PrettyPages' self.savepath.mkdir(exist_ok=True) self.teams = ['', 'anglie-1-liga', 'anglie-2-liga', 'anglie-3-liga', \ 'anglie-4-liga', 'anglie-5-liga', 'n%C4%9Bmecko-1-liga', \ 'n%C4%9Bmecko-2-liga', 'francie-1-liga', 'francie-2-liga', \ 'it%C3%A1lie-1-liga', 'it%C3%A1lie-2-liga', \ '%C5%A1pan%C4%9Blsko-1-liga', '%C5%A1pan%C4%9Blsko-2-liga', \ 'belgie-1-liga', 'd%C3%A1nsko-1-liga', 'nizozemsko-1-liga', \ 'norsko-1-liga', 'polsko-1-liga', 'portugalsko-1-liga', \ 'rakousko-1-liga', 'rumunsko-1-liga', 'rusko-1-liga', \ '%C5%99ecko-1-liga', 'skotsko-1-liga', 'skotsko-2-liga', \ 'skotsko-3-liga', 'skotsko-4-liga', \ '%C5%A1v%C3%BDcarsko-1-liga', \ 'turecko-1-liga'] def start_browser(self): caps = webdriver.DesiredCapabilities().FIREFOX caps["marionette"] = True self.browser = webdriver.Firefox(capabilities=caps) def stop_browser(self): self.browser.close() def next_url(self): for team in self.teams: yield team, f"https://rsb.sazka.cz/fotbal/{team}/" def get_team_pages(self): self.start_browser() for team, url in self.next_url(): print(f"loading team: {team}, url: {url}") self.browser.get(url) time.sleep(5) soup = BeautifulSoup(self.browser.page_source, 'lxml') savefile = self.savepath / f"{team}_pretty.html" with savefile.open('w') as fp: fp.write(f"{self.pp.prettify(soup,2)}") self.stop_browser() if __name__ == '__main__': fi = FootballInfo() fi.get_team_pages()You'll also need this script (save as PrettifyPage.py): # PrettifyPage.py from bs4 import BeautifulSoup import requests import pathlib class PrettifyPage: def __init__(self): pass def prettify(self, soup, indent): pretty_soup = str() previous_indent = 0 for line in soup.prettify().split("\n"): current_indent = str(line).find("<") if current_indent == -1 or current_indent > previous_indent + 2: current_indent = previous_indent + 1 previous_indent = current_indent pretty_soup += self.write_new_line(line, current_indent, indent) return pretty_soup def write_new_line(self, line, current_indent, desired_indent): new_line = "" spaces_to_add = (current_indent * desired_indent) - current_indent if spaces_to_add > 0: for i in range(spaces_to_add): new_line += " " new_line += str(line) + "\n" return new_lineA better way would to scan for team_entries = soup.find_all('div', {'class': 'rj-instant-collapsible'}) which should pull all teams from the page. RE: Scrapping javascript website with Selenium where pages randomly fail to load - JuanJuan - Dec-26-2019 Thank you i will look into it. I am aware that many links now are not available due to winter break in soccer leagues but certain leagues retry 404 error even if they are available. RE: Scrapping javascript website with Selenium where pages randomly fail to load - JuanJuan - Dec-26-2019 So it seems that same problem occur with your code too :( . i had to add encoding="utf-8 into savefile and after that code ran as it should. Still randomly some pages did not load and i checked that they are available in another browser, when i ran code second time right after first run ended then different set of pages did not randomly load. In folder prettypages i can see that your code scrap webpages properly but when page fail to load it is saved just with 404 error message. I am using windows 7 and latest 3.8.1 version with latest geckoo driver. RE: Scrapping javascript website with Selenium where pages randomly fail to load - Larz60+ - Dec-26-2019 that's because the page does not exist! look at the screens as they attempt to be brought up. If you use the page internal links instead of set links you will avoid this error, or you can:
RE: Scrapping javascript website with Selenium where pages randomly fail to load - JuanJuan - Dec-26-2019 Webpages do exist, that is why i wrote i checked in another browser. I get 404 error even tho that page loads at that same moment in another browser. I understand few of provided urls are indeed not available but 404 error keeps randomly showing even when code load given webpage. For example this url https://rsb.sazka.cz/fotbal/anglie-3-liga/ i did run code five times right after one iteration ended and this webpage loaded 3 times and two times i got 404 error RE: Scrapping javascript website with Selenium where pages randomly fail to load - Larz60+ - Dec-26-2019 I just manually tried: https://rsb.sazka.cz/fotbal/anglie-3-liga/ received 404 - Page not found. |