Python Forum
Problem With Simple Multiprocessing Script
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Problem With Simple Multiprocessing Script
#12
Phew.. I think I'm 95% done with this script.. just hit another issue though Coffee

I'm taking URLs I want processed, adding them to a dataframe with Pandas, and trying to pass that through map in the array to be processed within my 'scraper' function.

I need the dataframe within scraper function because the 'counter' will let me fill the scraped data into the right table cell.

Part of my problem is I don't know where to create the dataframe or how to manage it properly inside the function.

Here's the code:

from multiprocessing import Lock, Pool, Manager
from time import sleep
from bs4 import BeautifulSoup
import pandas as pd
import re
import requests


exceptions = []
lock = Lock()


def scraper(obj):  # obj is the array passed from map (counter, url items)

    counter, url = obj  # not sure what this does

    df.insert(1, 'Alexa Rank:', "")  # insert new column
    df.insert(2, 'Status:', "")  # insert new column

    lock.acquire()

    counter_val = counter.get()

    try:

        scrape = requests.get(url,
                              headers={"user-agent": "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36"},
                              timeout=10)

        if scrape.status_code == 200:

            """ --------------------------------------------- """
            # ---------------------------------------------------
            '''           --> SCRAPE ALEXA RANK: <--          '''
            # ---------------------------------------------------
            """ --------------------------------------------- """

            sleep(0.1)
            scrape = requests.get("http://data.alexa.com/data?cli=10&dat=s&url=" + url,
                                  headers={"user-agent": "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36"})
            html = scrape.content
            soup = BeautifulSoup(html, 'lxml')

            rank = re.findall(r'<popularity[^>]*text="(\d+)"', str(soup))

            df.iloc[counter_val, 0] = url  # fill cell with URL data
            df.iloc[counter_val, 1] = rank  # fill cell with alexa rank

            counter.set(counter_val + 1)  # increment counter

            print("Server Status:", scrape.status_code, '-', u"\u2713", '-', counter_val, '-', df.iloc[counter_val, 0], '-', "Rank:", rank[0])

        else:
            print("Server Status:", scrape.status_code)
            df.iloc[counter_val, 2] = scrape.status_code  # fill df cell with server status code
            counter.set(counter_val + 1)
            pass

    except BaseException as e:
        exceptions.append(e)
        print("Exception:", e)
        df.iloc[counter_val, 2] = e  # fill df cell with script exception message
        counter.set(counter_val + 1)
        pass

    finally:
        lock.release()
        df.to_csv("output.csv", index=False)
        return


if __name__ == '__main__':

    """ --------------------------------------------- """
    # ---------------------------------------------------
    '''               GET LINK LIST:                  '''
    # ---------------------------------------------------
    """ --------------------------------------------- """

    # get this line of code from the pastebin (link list)
    https://pastebin.com/h42wqJPp

    df = pd.DataFrame(list1, columns=["Links:"])  # create pandas dataframe from links list

    """ --------------------------------------------- """
    # ---------------------------------------------------
    '''               MULTIPROCESSING:                '''
    # ---------------------------------------------------
    """ --------------------------------------------- """

    counter = Manager().Value(int, 0)  # set counter as manager with value of 0
    array = [(counter, url) for url in df]  # link together the counter and list in an array ---------------------------------------- ***** ERROR - not adding links to array correctly *****
    print("Problem here, it's not adding all the links to array", array)

    p = Pool(20)  # worker count
    p.map(scraper, array)  # function, iterable
    p.terminate()
Reply


Messages In This Thread
RE: Problem With Simple Multiprocessing Script - by digitalmatic7 - Apr-16-2018, 07:18 PM

Possibly Related Threads…
Thread Author Replies Views Last Post
  Simple flask rest api problem cancerboi 4 2,819 Jan-29-2020, 03:10 PM
Last Post: brighteningeyes
  requests problem in python script "content type" abdlwafitahiri 4 3,194 Dec-29-2019, 02:29 PM
Last Post: abdlwafitahiri
  "I'm Feeling Lucky" script problem (again) tab_lo_lo 7 7,794 Jul-23-2019, 11:26 PM
Last Post: snippsat
  Need Help with Simple Text Reformatting Problem MattTuck 5 3,770 Aug-14-2017, 10:07 PM
Last Post: MattTuck

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020