How to properly do multi-processing with selenium

shinaco · May-27-2020, 06:42 PM

Hello everyone,

I am trying to do a complex web scraping task, but because it requires to process a large number of pages and the processing I am doing on the web pages is complex itself, it takes a long time, and I need to take advantage of the multiple cores that I have to do this faster.

I am using Selenium mainly because I need to make sure that the Javascript in the pages is run so they render properly when I extract the html. I tried running it using requests_html, but the render method there wouldn't work for some reason. The problem with selenium is that it adds a big overhead every time I load a new Chrome instance and close it. So I am trying to figure out a proper way to do the parallel processing, open the number of selenium instances equal to my number of parallel processes, and have each process use its own selenium instance so that they don't collide. But I am not sure how to properly do this, at least with the concurrent.futures library that I am using

Any advice on how to do this will be very much appreciated. Below you will see the relevant parts of my code. I am not posting the full code to avoid confusing you with the complex processing task that I am working on.

from bs4 import BeautifulSoup
import re
import csv
from datetime import datetime
import numpy as np
import concurrent.futures
import multiprocessing
from selenium import webdriver

options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--window-size=1920x1080')
options.add_argument('--no-sandbox')

def runit(row):
    driver = webdriver.Chrome(chrome_options=options)
    driver.set_page_load_timeout(500)
    driver.implicitly_wait(500)
    url = row[1]
    driver.get(url)
    html_doc = driver.page_source
    driver.quit()
    soup = BeautifulSoup(html_doc, 'html.parser')

    # Some long processing code that uses the soup object and generates the result object that is returned below with what I want    

    return result, row

if __name__ == '__main__':
    multiprocessing.freeze_support()
    print(datetime.now())
    # The file below has the list of all the pages that I need to process, along with some other pieces of relevant data
    # The URL is the second field in the csv file
    with open('D:\\Users\\shina\\OneDrive\\testTiles.csv') as csv_file:
        csv_reader = csv.reader(csv_file, delimiter=',')
        # I have 4 cores but Windows shows 8 logical processors, I have tried other numbers below 8, but 8 seems to bring the fastest results
        with concurrent.futures.ProcessPoolExecutor(max_workers=8) as executor:
            results = executor.map(runit, csv_reader)
            
        #At a later time I will code here what I will do with the results after all the processes finish.

    print(datetime.now())

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	WSGI Multi processing.	simbha	1	3,675	May-05-2020, 10:34 AM Last Post: pyzyx3qwerty
	Error in Selenium: CRITICAL:root:Selenium module is not installed...Exiting program.	AcszE	1	4,452	Nov-03-2017, 08:41 PM Last Post: metulburr
	web parsing multi-classnames with selenium	metulburr	1	3,274	Aug-12-2017, 04:57 AM Last Post: LeahKolar

How to properly do multi-processing with selenium

User Panel Messages

Announcements