Python Forum

Full Version: Webcrawler with Selenium and Browsermob, Har file not complete
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Hope somebody can assist me

I am building a complex webcrawler and it works perfect when i'm not trying to use Selenium.
I need to get response code and headers so i setup the crawler to use Browsermob.

Sometimes my har file is not complete so i get a lot of errors, i don't understand whats going on.

I start my proxy and then run selenium in threads and then close my proxy again.

When i try with 5 second sleep after driver.get i only get errors sometimes, but 5 seconds is a lot of time to wait when not neded.

How can i know when i get a complete har file.

I found out if i match my threads with the amount of pages to get - then i dont get any problems - hmm

How can i do this better?

The current setup is this:

1. User enters there domain (Works)

2. Crawler starts (Works)

3. Proxy starts (Works)

4. Proxy blocks request to Google Analytics, Facebook and more (Works)

5. Crawler request url with selenium and by its own thread (Works)
-- Reason for using a proxy: Its not possible to get Headers or Status codes with Selenium

6. I get Har file and goes troug it and collects relevant data (Headers, Status Code) - (Errors)
-- Error: Here is the problem some pages returns me a almost empty Har file and when i check the url, there is no problem with it.

7. I get source from selenium and return it to soup. (Works)

How whould you handle step 6 if we were forced to use Selenium?