Posts: 8
Threads: 1
Joined: Dec 2019
Dec-25-2019, 01:35 PM
(This post was last modified: Dec-25-2019, 01:36 PM by JuanJuan.)
I have a python scrapper with selenium for scrapping a dynamically loaded javascript website.
Scrapper by itself works ok but pages sometimes fail to load with 404 error.
Problem is that public http doesn't have data I need but loads everytime and javascript http with data I need sometimes won't load for a random time.
Even weirder is that same javascript http loads in one browser but not in another and vice versa.
I tried webdriver for chrome, firefox, firefox developer edition and opera. Not a single one loads all pages every time.
Public link that doesn't have data I need looks like this: https://www.sazka.cz/kurzove-sazky/fotbal/*League*/.
Javascript link that have data I need looks like this https://rsb.sazka.cz/fotbal/*League*/.
On average from around 30 links, about 8 fail to load although in different browsers that same link at the same time loads flawlessly.
I tried to search in page source for some clues but I found nothing.
Can anyone help me find out where might be a problem? Thank you.
Edit: here is my code that i think is relevant
Edit2: You can reproduce this problem by right-clicking on some league and try to open link in another tab. Then can be seen that even that page at first loaded properly after opening it in new tab it changes start of http link from https://www.sazka.cz to https://rsb.sazka.cz and sometimes gives 404 error that can last for an hour or more.
driver = webdriver.Chrome(executable_path='chromedriver',
service_args=['--ssl-protocol=any',
'--ignore-ssl-errors=true'])
driver.maximize_window()
for single_url in urls:
randomLoadTime = random.randint(400, 600)/100
time.sleep(randomLoadTime)
driver1 = driver
driver1.get(single_url)
htmlSourceRedirectCheck = driver1.page_source
# Redirect Check
redirectCheck = re.findall('404 - Page not found', htmlSourceRedirectCheck)
if '404 - Page not found' in redirectCheck:
leaguer1 = single_url
leagueFinal = re.findall('fotbal/(.*?)/', leaguer1)
print(str(leagueFinal) + ' ' + '404 - Page not found')
pass
else:
try:
loadedOddsCheck = WebDriverWait(driver1, 25)
loadedOddsCheck.until(EC.element_to_be_clickable \
((By.XPATH, ".//h3[contains(@data-params, 'hideShowEvents')]")))
except TimeoutException:
pass
unloadedOdds = driver1.find_elements_by_xpath \
(".//h3[contains(@data-params, 'loadExpandEvents')]")
for clicking in unloadedOdds:
clicking.click()
randomLoadTime2 = random.randint(50, 100)/100
time.sleep(randomLoadTime2)
matchArr = []
leaguer = single_url
htmlSourceOrig = driver1.page_source
Posts: 11,875
Threads: 474
Joined: Sep 2016
Please post code that can be run (without modification) with base URL.
Posts: 8
Threads: 1
Joined: Dec 2019
Dec-25-2019, 07:21 PM
(This post was last modified: Dec-25-2019, 07:23 PM by JuanJuan.)
from bs4 import BeautifulSoup
from pprint import pprint
import requests
import csv
import re
import time
import random
from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
# URL Load
url = "URL_List.txt"
with open(url, "r") as urllist:
url_pages = urllist.read()
urls = url_pages.split("\n")
# Variables
matchArr = []
matchArrFinal = []
scrapeDate = time.strftime("%d-%m-%Y")
# Driver Load
driver = webdriver.Chrome(executable_path='chromedriver',
service_args=['--ssl-protocol=any',
'--ignore-ssl-errors=true'])
driver.maximize_window()
# URL Scrapping
for single_url in urls:
randomLoadTime = random.randint(400, 600)/100
time.sleep(randomLoadTime)
driver1 = driver
driver1.get(single_url)
htmlSourceRedirectCheck = driver1.page_source
# Redirect Check
redirectCheck = re.findall('404 - Page not found', htmlSourceRedirectCheck)
if '404 - Page not found' in redirectCheck:
leaguer1 = single_url
leagueFinal = re.findall('fotbal/(.*?)/', leaguer1)
print(str(leagueFinal) + ' ' + '404 - Page not found')
pass
else:
try:
loadedOddsCheck = WebDriverWait(driver1, 25)
loadedOddsCheck.until(EC.element_to_be_clickable \
((By.XPATH, ".//h3[contains(@data-params, 'hideShowEvents')]")))
except TimeoutException:
pass
unloadedOdds = driver1.find_elements_by_xpath \
(".//h3[contains(@data-params, 'loadExpandEvents')]")
for clicking in unloadedOdds:
clicking.click()
randomLoadTime2 = random.randint(50, 100)/100
time.sleep(randomLoadTime2)
matchArr = []
leaguer = single_url
htmlSourceOrig = driver1.page_source
htmlSource = htmlSourceOrig.replace('Dagenham & Redbridge', 'Dagenham')
# REGEX
try:
leagueFinal = re.findall('fotbal/(.*?)/', leaguer)
print(leagueFinal)
except IndexError:
leagueFinal = 'null'
try:
home = re.findall('"event-details-team-a-name">(.*?)</span>', htmlSource)
except IndexError:
home = 'null'
try:
away = re.findall('"event-details-team-b-name">(.*?)</span>', htmlSource)
except IndexError:
away = 'null'
try:
date = re.findall('"event-details-date">(.*?)</span>', htmlSource)
except IndexError:
date = 'null'
try:
odds = re.findall('bet-odds-value">([0-9]+,[0-9][0-9])</span>', htmlSource)
except IndexError:
odds = 'null'
oddsFinal = [o.replace(',', '.') for o in odds]
# Live date fix
matchNumber = len(home)
dateNumber = len(date)
dateFix = matchNumber - dateNumber
if matchNumber > dateNumber:
for fixing in range (dateFix):
date.insert(0, 'LIVE')
# Matches
matchNum = len(home)
for matches in range (matchNum):
matchArr.append(leagueFinal[0])
matchArr.append(home[0])
matchArr.append(away[0])
try:
matchArr.append(date[0])
except IndexError:
matchArr.append(None)
try:
matchArr.append(oddsFinal[0])
except IndexError:
matchArr.append(None)
try:
matchArr.append(oddsFinal[1])
except IndexError:
matchArr.append(None)
try:
matchArr.append(oddsFinal[2])
except IndexError:
matchArr.append(None)
del home[0]
del away[0]
try:
del date[0]
except IndexError:
pass
del oddsFinal[0:3]
for matchesFinal in range (matchNum):
matchArrFinal.append(matchArr[0:7])
del matchArr[0:7]
driver.close()
# CSV
with open('D:\Betting\BET Fotbal\Scrapped Odds\Sazkabet' + ' ' + scrapeDate + '.csv', 'w', newline='') as csvFile:
writer = csv.writer(csvFile, delimiter=',')
writer.writerow(["league", "home", "away", "date", "1", "0", "2"])
writer.writerows(matchArrFinal)
csvFile.close() here is content of URL_List.txt file:
Posts: 11,875
Threads: 474
Joined: Sep 2016
I'll get back, this will take a bit of time.
Posts: 11,875
Threads: 474
Joined: Sep 2016
Dec-26-2019, 02:15 AM
(This post was last modified: Dec-26-2019, 02:46 AM by Larz60+.)
I created the following short script just to see what was available from the url's that you supplied.
Many of the url's have no pages , so that explains the 404 errors.
Run following code and see what I'm talking about:
from selenium import webdriver
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
import PrettifyPage
from pathlib import Path
import time
import os
class FootballInfo:
def __init__(self):
os.chdir(os.path.abspath(os.path.dirname(__file__)))
self.pp = PrettifyPage.PrettifyPage()
homepath = Path('.')
self.savepath = homepath / 'PrettyPages'
self.savepath.mkdir(exist_ok=True)
self.teams = ['', 'anglie-1-liga', 'anglie-2-liga', 'anglie-3-liga', \
'anglie-4-liga', 'anglie-5-liga', 'n%C4%9Bmecko-1-liga', \
'n%C4%9Bmecko-2-liga', 'francie-1-liga', 'francie-2-liga', \
'it%C3%A1lie-1-liga', 'it%C3%A1lie-2-liga', \
'%C5%A1pan%C4%9Blsko-1-liga', '%C5%A1pan%C4%9Blsko-2-liga', \
'belgie-1-liga', 'd%C3%A1nsko-1-liga', 'nizozemsko-1-liga', \
'norsko-1-liga', 'polsko-1-liga', 'portugalsko-1-liga', \
'rakousko-1-liga', 'rumunsko-1-liga', 'rusko-1-liga', \
'%C5%99ecko-1-liga', 'skotsko-1-liga', 'skotsko-2-liga', \
'skotsko-3-liga', 'skotsko-4-liga', \
'%C5%A1v%C3%BDcarsko-1-liga', \
'turecko-1-liga']
def start_browser(self):
caps = webdriver.DesiredCapabilities().FIREFOX
caps["marionette"] = True
self.browser = webdriver.Firefox(capabilities=caps)
def stop_browser(self):
self.browser.close()
def next_url(self):
for team in self.teams:
yield team, f"https://rsb.sazka.cz/fotbal/{team}/"
def get_team_pages(self):
self.start_browser()
for team, url in self.next_url():
print(f"loading team: {team}, url: {url}")
self.browser.get(url)
time.sleep(5)
soup = BeautifulSoup(self.browser.page_source, 'lxml')
savefile = self.savepath / f"{team}_pretty.html"
with savefile.open('w') as fp:
fp.write(f"{self.pp.prettify(soup,2)}")
self.stop_browser()
if __name__ == '__main__':
fi = FootballInfo()
fi.get_team_pages() You'll also need this script (save as PrettifyPage.py):
# PrettifyPage.py
from bs4 import BeautifulSoup
import requests
import pathlib
class PrettifyPage:
def __init__(self):
pass
def prettify(self, soup, indent):
pretty_soup = str()
previous_indent = 0
for line in soup.prettify().split("\n"):
current_indent = str(line).find("<")
if current_indent == -1 or current_indent > previous_indent + 2:
current_indent = previous_indent + 1
previous_indent = current_indent
pretty_soup += self.write_new_line(line, current_indent, indent)
return pretty_soup
def write_new_line(self, line, current_indent, desired_indent):
new_line = ""
spaces_to_add = (current_indent * desired_indent) - current_indent
if spaces_to_add > 0:
for i in range(spaces_to_add):
new_line += " "
new_line += str(line) + "\n"
return new_line A better way would to scan for team_entries = soup.find_all('div', {'class': 'rj-instant-collapsible'})
which should pull all teams from the page.
Posts: 8
Threads: 1
Joined: Dec 2019
Thank you i will look into it. I am aware that many links now are not available due to winter break in soccer leagues but certain leagues retry 404 error even if they are available.
Posts: 8
Threads: 1
Joined: Dec 2019
So it seems that same problem occur with your code too :( . i had to add encoding="utf-8 into savefile and after that code ran as it should. Still randomly some pages did not load and i checked that they are available in another browser, when i ran code second time right after first run ended then different set of pages did not randomly load. In folder prettypages i can see that your code scrap webpages properly but when page fail to load it is saved just with 404 error message. I am using windows 7 and latest 3.8.1 version with latest geckoo driver.
Posts: 11,875
Threads: 474
Joined: Sep 2016
Dec-26-2019, 09:58 PM
(This post was last modified: Dec-26-2019, 09:58 PM by Larz60+.)
that's because the page does not exist!
look at the screens as they attempt to be brought up.
If you use the page internal links instead of set links you will avoid this error,
or you can: - supply proper URL (one texted manually)
- just ignore missing data.
you cannot make a non-existant page appear, need proper URL
Posts: 8
Threads: 1
Joined: Dec 2019
Dec-26-2019, 10:13 PM
(This post was last modified: Dec-26-2019, 10:13 PM by JuanJuan.)
Webpages do exist, that is why i wrote i checked in another browser. I get 404 error even tho that page loads at that same moment in another browser. I understand few of provided urls are indeed not available but 404 error keeps randomly showing even when code load given webpage. For example this url https://rsb.sazka.cz/fotbal/anglie-3-liga/ i did run code five times right after one iteration ended and this webpage loaded 3 times and two times i got 404 error
Posts: 11,875
Threads: 474
Joined: Sep 2016
I just manually tried: https://rsb.sazka.cz/fotbal/anglie-3-liga/
received 404 - Page not found.
|