Python Forum

Full Version: Random Loss of Control of Website When Scraping
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
I've been writing a Python script to scrape a site. I'm accessing & controlling the site using Selenium.

Here's an example of a typical page on the site:

You'll prob. get prompted with a warning initially when trying to hit the above page, but it's 100% safe. It's a government site listing info. on parts for aircraft. Just click that you're o.k. with proceeding on to the site and the above page should render for you.

The problem I'm having with my script is that the script will successfully navigate 50 records on a page, will allow for a Javascript "doPostBack" command to be send to navigate to the next page, will successfully scrape the 50 records on that page, will allow for another Javascript "doPostBack" command to be sent to land us on the next page, and so forth... but this process eventually breaks after a very random # of pages and seemingly can no longer navigate to any upcoming pages.

For example I've seen the scrape successfully navigate and scrape 28 pages out of a possible 120 pages and seemingly be unable to navigate from page 28 to 29.

I've re-run the scrape and then seen the scrape successfully navigate/scrape 73 pages out of a possible 120 pages and then same thing... the Javascript "doPostBack" command absolutely will not allow for the navigation to page 74.

I've re-run / re-started the scrape again and have seen it then successfully navigate through all 120 pages without any issue.

I ca then re-run it again and it might navigate up through 95 pages out of 120 total pages and be unable to navigate on to page 96.

Etc... you get my drift... very randomly seems to hit a point with the page navigation attempts and the Python script simply cannot provide for continued navigation to the remaining pages, with the navigation attempts just leaving the page stuck on whatever current page it's on.

I've tried every trick under the sun that I've read about online without any definitive resolution.

It's as if I've lost all control of the page from Python at a random point during each scrape attempt and the page simply won't respond to the same Javascript "doPostBack" command that it successfully responded to many times over and over up until that point.

The only thing I can think of to possibly attempt at this point is, when I've somehow determined that a page navigation attempt did *NOT* result in successful page navigation on to the next intended page and that the page is still "stuck" on the same page without moving on to the next page of 50 records, is to just issue a "driver.quit()" command in Python to close completely out of the existing browser session, re-instantiate a new instance of the browser using Python/Selenium, re-open the desired page, and send a Javascript "doPostBack" command to try to take me directly to the page right after where the last attempt left off before being unable to continue with successful page navigation. This is just about the only way I can think of do try to resolve this... closing completely out of the browser using "driver.quit()" once it's evident that I'm no longer successfully navigating to the next page, reopen a brand new browser instance, and send a new Javascript "doPostBack" command to try to navigate directly to the next page of 50 records where I left off, see how many additional pages I can navigate, and if it stalls out again, repeat the "browser.quit()" command and reopen a new instance of the browser however many times needed until the Python script has successfully navigated through all remaining pages.

Any thoughts or suggestions are greatly appreciated, as the page navigation issues are very much a show-stopper for me if I can't get the page navigation to function consistently somehow.

Also, this isn't a one-time scrape... it's a scrape that will ultimately be executed once every single day.

Thanks in advance!