Posts: 5,151
Threads: 396
Joined: Sep 2016
i have a selenium program that is archiving the forum of the last 1500 threads. At around 818 threads it timeouts. The method ran to archive each thread gets ran for each URL is...
def archive_url(self, url):
self.browser.get('https://web.archive.org/')
WebDriverWait(self.browser, 10).until(EC.presence_of_element_located((By.ID,"web_save_div")))
self.browser.find_element_by_xpath("/html/body/div[3]/div/div[3]/div/div[2]/div[3]/div[2]/form/input").click()
self.browser.find_element_by_class_name('web-save-url-input').send_keys(url)
self.delay()
self.browser.find_element_by_xpath('/html/body/div[3]/div/div[3]/div/div[2]/div[3]/div[2]/form/button').click()
WebDriverWait(self.browser, 10).until(EC.presence_of_element_located((By.ID,"wmtbURL")))
print(f'Archived: {url}') Error: Traceback (most recent call last):
File "archive_forum.py", line 213, in <module>
File "archive_forum.py", line 177, in __init__
def archive_url(self, url):
File "archive_forum.py", line 187, in archive_url
File "/usr/local/lib/python3.6/site-packages/selenium/webdriver/support/wait.py", line 80, in until
raise TimeoutException(message, screen, stacktrace)
selenium.common.exceptions.TimeoutException: Message:
At first i was assuming the wait needs to be longer than 10 seconds, but it consistently timesout only after doing 818-ish threads.
Recommended Tutorials:
Posts: 12,046
Threads: 487
Joined: Sep 2016
I usually set the timeout of WebDriverWait to 50 seconds, although 10 should be sufficient,
it sometimes is not, and as soon as condition is True, the wait ends anyway.
Posts: 5,151
Threads: 396
Joined: Sep 2016
Jan-13-2019, 11:51 PM
(This post was last modified: Jan-13-2019, 11:55 PM by metulburr.)
i still get a timeout with 50 seconds. It only takes about 6 seconds between each url to save though. I think ill just go back to time sleep as that was working perfectly fine. What is weird was the time.sleep was only for 1.5 seconds
EDIT:
The one that i got last time was for this one
Quote:WebDriverWait(self.browser, 50).until(EC.presence_of_element_located((By.ID,"wmtbURL")))
which is after archive is already done. I guess i probably wont even need this as it replaced a time.sleep(1.5) and doesnt really need to wait as it is already over.
Recommended Tutorials:
Posts: 12,046
Threads: 487
Joined: Sep 2016
Quote:What weird was the time.sleep was only for 1.5 seconds
That is weird, I suppose if you wanted to dig, you could find out why, but not worth the effort.
Posts: 5,151
Threads: 396
Joined: Sep 2016
well i guess i take that back. I get a timeout error with just time.sleep too
Error: Traceback (most recent call last):
File "archive_forum.py", line 208, in <module>
App()
File "archive_forum.py", line 175, in __init__
self.archive_url(url)
File "archive_forum.py", line 184, in archive_url
self.browser.find_element_by_xpath('/html/body/div[3]/div/div[3]/div/div[2]/div[3]/div[2]/form/button').click()
File "/usr/local/lib/python3.6/site-packages/selenium/webdriver/remote/webelement.py", line 78, in click
self._execute(Command.CLICK_ELEMENT)
File "/usr/local/lib/python3.6/site-packages/selenium/webdriver/remote/webelement.py", line 499, in _execute
return self._parent.execute(command, params)
File "/usr/local/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 297, in execute
self.error_handler.check_response(response)
File "/usr/local/lib/python3.6/site-packages/selenium/webdriver/remote/errorhandler.py", line 194, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.TimeoutException: Message: timeout
(Session info: headless chrome=63.0.3239.108)
(Driver info: chromedriver=2.33.506092 (733a02544d189eeb751fe0d7ddca79a0ee28cce4),platform=Linux 4.4.0-141-generic x86_64)
Recommended Tutorials:
Posts: 12,046
Threads: 487
Joined: Sep 2016
Jan-14-2019, 02:23 AM
(This post was last modified: Jan-14-2019, 02:43 AM by Larz60+.)
It looks like it's not seeing the By.ID,"wmtbURL"
I'll try pulling it up in debugger as see where it's hanging (maybe)
I'm wondering if you are running into an issue that I recently encountered where the buttons that need to be clicked are off the
page, and need to be scrolled to. Here's some working code where that's exactly what happened, and the only indication I had was timeout
This might not have anything at all to do with the issue. I struggled with this for a couple of days before realizing what was going on.
code:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.keys import Keys
from selenium.common import exceptions
from itertools import permutations
from bs4 import BeautifulSoup
import BusinessPaths
import time
import PrettifyPage
import string
import sys
class GetArkansas:
def __init__(self):
self.bpath = BusinessPaths.BusinessPaths()
self.pp = PrettifyPage.PrettifyPage()
caps = webdriver.DesiredCapabilities().FIREFOX
caps["marionette"] = True
self.browser = webdriver.Firefox(capabilities=caps)
self.get_all_data()
def get_all_data(self):
end = False
# 3 characters satisfies minimum search requirements
al = f'{string.ascii_uppercase}-&0123456789 '
for letter1 in al:
if end:
break
for letter2 in al:
if end:
break
for letter3 in al:
sitem = f'{letter1}{letter2}{letter3}'
print(f'initial value: {sitem}')
if sitem == 'BAA':
end = True
break
alph = [''.join(p) for p in permutations(sitem)]
for entry in alph:
self.get_data(entry)
self.browser.close()
def get_data(self, searchitem):
mainfilename = self.bpath.htmlpath / f'mainpage_{searchitem}.html'
if mainfilename.exists():
return None
arkansas_url = 'https://www.sos.arkansas.gov/corps/search_all.php'
self.browser.get(arkansas_url)
time.sleep(2)
mainsrc = self.browser.page_source
soup = BeautifulSoup(mainsrc,"lxml")
with mainfilename.open('w') as fp:
fp.write(self.pp.prettify(soup, 2))
# This gets first page
search_box = self.browser.find_element(By.XPATH, '/html/body/div[2]/div/div[2]/div/form/table/tbody/tr[4]/td[2]/font/input')
search_box.clear()
search_box.send_keys(searchitem)
self.browser.find_element(By.XPATH, '/html/body/div[2]/div/div[2]/div/form/table/tbody/tr[11]/td/font/input').click()
time.sleep(3)
src = self.browser.page_source
if 'There were no records found!' in src:
print(f'There are no records for {searchitem}')
return None
print(f'got page 1 - {searchitem}')
soup = BeautifulSoup(src,"lxml")
filename = self.bpath.htmlpath / f'results_{searchitem}page1.html'
with filename.open('w') as fp:
fp.write(self.pp.prettify(soup, 2))
page = 2
while True:
try:
height = self.browser.execute_script("return document.documentElement.scrollHeight")
self.browser.execute_script("window.scrollTo(0, " + str(height) + ");")
# Next line fails on third page!
# mainContent > table:nth-child(5) > tbody:nth-child(1) > tr:nth-child(1) > td:nth-child(3) > font:nth-child(1) > a:nth-child(1)
# page_button_xpath = f'/html/body/div[2]/div/div[2]/div/table[4]/tbody/tr/td[{page}]/font/a'
page_button_css = 'mainContent > table:nth-child(5) > tbody:nth-child(1) > tr:nth-child(1) > td:nth-child(3) > font:nth-child(1) > a:nth-child(1)'
next_page = self.browser.find_element(By.PARTIAL_LINK_TEXT, 'Next 250')
ActionChains(self.browser).move_to_element(next_page)
next_page.click()
time.sleep(2)
print(f'got page {page} - {searchitem}')
src = self.browser.page_source
soup = BeautifulSoup(src,"lxml")
filename = self.bpath.htmlpath / f'results_{searchitem}page{page}.html'
with filename.open('w') as fp:
fp.write(self.pp.prettify(soup, 2))
page += 1
except exceptions.NoSuchElementException:
break
# sys.exit(0)
if __name__ == '__main__':
GetArkansas() look at the garbage beginning here:
while True:
try:
height = self.browser.execute_script("return document.documentElement.scrollHeight") This determines the height of the page and scrolls the page until buttons are visible, then moves to button and clicks
If you try to run this, I''ll either have to post the 2 external files BusinessPaths and PrettifyPage (or you'll have to comment out)
I think the scroll code is fairly straight forward, so you probably won't need the other files.
Posts: 5,151
Threads: 396
Joined: Sep 2016
no i mean that last error is with all webdriverwait replaced with time.sleep's
Recommended Tutorials:
Posts: 12,046
Threads: 487
Joined: Sep 2016
I've been coding since 3 A.M. and am getting weird. That's 20 hours straight. Time to quit, I don't think I'm making any sense!
Posts: 5,151
Threads: 396
Joined: Sep 2016
those are the best moments
Recommended Tutorials:
|