Python Forum
Webcrawler with Selenium and Browsermob, Har file not complete
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Webcrawler with Selenium and Browsermob, Har file not complete
#1
Hope somebody can assist me

I am building a complex webcrawler and it works perfect when i'm not trying to use Selenium.
I need to get response code and headers so i setup the crawler to use Browsermob.

Sometimes my har file is not complete so i get a lot of errors, i don't understand whats going on.

I start my proxy and then run selenium in threads and then close my proxy again.

When i try with 5 second sleep after driver.get i only get errors sometimes, but 5 seconds is a lot of time to wait when not neded.

How can i know when i get a complete har file.

I found out if i match my threads with the amount of pages to get - then i dont get any problems - hmm

How can i do this better?

The current setup is this:

1. User enters there domain (Works)

2. Crawler starts (Works)

3. Proxy starts (Works)

4. Proxy blocks request to Google Analytics, Facebook and more (Works)

5. Crawler request url with selenium and by its own thread (Works)
-- Reason for using a proxy: Its not possible to get Headers or Status codes with Selenium

6. I get Har file and goes troug it and collects relevant data (Headers, Status Code) - (Errors)
-- Error: Here is the problem some pages returns me a almost empty Har file and when i check the url, there is no problem with it.

7. I get source from selenium and return it to soup. (Works)

How whould you handle step 6 if we were forced to use Selenium?
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Building a webcrawler for research (HELP!) aStudent 2 2,681 May-31-2018, 09:55 AM
Last Post: aStudent
  file upload from windows10 machine using send_key in selenium fails nithya_g 3 4,588 Jan-12-2018, 09:32 PM
Last Post: metulburr
  unable to load file using python selenium purnima1 4 6,462 Dec-12-2017, 04:04 PM
Last Post: hshivaraj
  Error in Selenium: CRITICAL:root:Selenium module is not installed...Exiting program. AcszE 1 3,584 Nov-03-2017, 08:41 PM
Last Post: metulburr
  Issues running Selenium to download zip file davehughes87 4 6,251 Jan-12-2017, 02:19 PM
Last Post: metulburr

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020