Problem With Simple Multiprocessing Script

digitalmatic7 · (This post was last modified: Apr-16-2018, 07:18 PM by digitalmatic7.)

Phew.. I think I'm 95% done with this script.. just hit another issue though Coffee

I'm taking URLs I want processed, adding them to a dataframe with Pandas, and trying to pass that through map in the array to be processed within my 'scraper' function.

I need the dataframe within scraper function because the 'counter' will let me fill the scraped data into the right table cell.

Part of my problem is I don't know where to create the dataframe or how to manage it properly inside the function.

Here's the code:

from multiprocessing import Lock, Pool, Manager
from time import sleep
from bs4 import BeautifulSoup
import pandas as pd
import re
import requests


exceptions = []
lock = Lock()


def scraper(obj):  # obj is the array passed from map (counter, url items)

    counter, url = obj  # not sure what this does

    df.insert(1, 'Alexa Rank:', "")  # insert new column
    df.insert(2, 'Status:', "")  # insert new column

    lock.acquire()

    counter_val = counter.get()

    try:

        scrape = requests.get(url,
                              headers={"user-agent": "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36"},
                              timeout=10)

        if scrape.status_code == 200:

            """ --------------------------------------------- """
            # ---------------------------------------------------
            '''           --> SCRAPE ALEXA RANK: <--          '''
            # ---------------------------------------------------
            """ --------------------------------------------- """

            sleep(0.1)
            scrape = requests.get("http://data.alexa.com/data?cli=10&dat=s&url=" + url,
                                  headers={"user-agent": "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36"})
            html = scrape.content
            soup = BeautifulSoup(html, 'lxml')

            rank = re.findall(r'<popularity[^>]*text="(\d+)"', str(soup))

            df.iloc[counter_val, 0] = url  # fill cell with URL data
            df.iloc[counter_val, 1] = rank  # fill cell with alexa rank

            counter.set(counter_val + 1)  # increment counter

            print("Server Status:", scrape.status_code, '-', u"\u2713", '-', counter_val, '-', df.iloc[counter_val, 0], '-', "Rank:", rank[0])

        else:
            print("Server Status:", scrape.status_code)
            df.iloc[counter_val, 2] = scrape.status_code  # fill df cell with server status code
            counter.set(counter_val + 1)
            pass

    except BaseException as e:
        exceptions.append(e)
        print("Exception:", e)
        df.iloc[counter_val, 2] = e  # fill df cell with script exception message
        counter.set(counter_val + 1)
        pass

    finally:
        lock.release()
        df.to_csv("output.csv", index=False)
        return


if __name__ == '__main__':

    """ --------------------------------------------- """
    # ---------------------------------------------------
    '''               GET LINK LIST:                  '''
    # ---------------------------------------------------
    """ --------------------------------------------- """

    # get this line of code from the pastebin (link list)
    https://pastebin.com/h42wqJPp

    df = pd.DataFrame(list1, columns=["Links:"])  # create pandas dataframe from links list

    """ --------------------------------------------- """
    # ---------------------------------------------------
    '''               MULTIPROCESSING:                '''
    # ---------------------------------------------------
    """ --------------------------------------------- """

    counter = Manager().Value(int, 0)  # set counter as manager with value of 0
    array = [(counter, url) for url in df]  # link together the counter and list in an array ---------------------------------------- ***** ERROR - not adding links to array correctly *****
    print("Problem here, it's not adding all the links to array", array)

    p = Pool(20)  # worker count
    p.map(scraper, array)  # function, iterable
    p.terminate()

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Simple flask rest api problem	cancerboi	4	2,819	Jan-29-2020, 03:10 PM Last Post: brighteningeyes
	requests problem in python script "content type"	abdlwafitahiri	4	3,194	Dec-29-2019, 02:29 PM Last Post: abdlwafitahiri
	"I'm Feeling Lucky" script problem (again)	tab_lo_lo	7	7,794	Jul-23-2019, 11:26 PM Last Post: snippsat
	Need Help with Simple Text Reformatting Problem	MattTuck	5	3,770	Aug-14-2017, 10:07 PM Last Post: MattTuck

Problem With Simple Multiprocessing Script

User Panel Messages

Announcements