Bottom Page

Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
 Problem With Simple Multiprocessing Script
#11
Ok, so I've rewritten your code a bit. You created a manager, but never passed it to the function. Being fast-and-loose with global variables can slide now and then, but not when dealing with multiple processes. Every single thing you use, needs to be passed in. This is a fairly advanced topic, so please don't take it poorly that you didn't get it right the first time (I've spent the past day reading docs just to get to this point, lol).

from functools import partial
from multiprocessing import Pool, Manager

def test(counter, lock, current_item):
    with lock:
        value = counter.get() + 1
        counter.set(value)
        print(f"{value} => {current_item}")

if __name__ == '__main__':
    list1 = ["item1",
             "item2",
             "item3",
             "item4",
             "item5",
             "item6",
             "item7",
             "item8",
             "item9",
             "item10",
             "item11",
             "item12"]

    with Manager() as manager:
        value = manager.Value("i", 0)
        lock = manager.Lock()

        worker = partial(test, value, lock)
        with Pool(processes=4) as p:
            p.map(worker, list1)
Output:
1 => item1 2 => item3 3 => item2 4 => item4 5 => item5 6 => item8 7 => item6 8 => item7 9 => item9 10 => item10 11 => item11 12 => item12
digitalmatic7 likes this post
Quote
#12
Phew.. I think I'm 95% done with this script.. just hit another issue though Coffee

I'm taking URLs I want processed, adding them to a dataframe with Pandas, and trying to pass that through map in the array to be processed within my 'scraper' function.

I need the dataframe within scraper function because the 'counter' will let me fill the scraped data into the right table cell.

Part of my problem is I don't know where to create the dataframe or how to manage it properly inside the function.

Here's the code:

from multiprocessing import Lock, Pool, Manager
from time import sleep
from bs4 import BeautifulSoup
import pandas as pd
import re
import requests


exceptions = []
lock = Lock()


def scraper(obj):  # obj is the array passed from map (counter, url items)

    counter, url = obj  # not sure what this does

    df.insert(1, 'Alexa Rank:', "")  # insert new column
    df.insert(2, 'Status:', "")  # insert new column

    lock.acquire()

    counter_val = counter.get()

    try:

        scrape = requests.get(url,
                              headers={"user-agent": "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36"},
                              timeout=10)

        if scrape.status_code == 200:

            """ --------------------------------------------- """
            # ---------------------------------------------------
            '''           --> SCRAPE ALEXA RANK: <--          '''
            # ---------------------------------------------------
            """ --------------------------------------------- """

            sleep(0.1)
            scrape = requests.get("http://data.alexa.com/data?cli=10&dat=s&url=" + url,
                                  headers={"user-agent": "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36"})
            html = scrape.content
            soup = BeautifulSoup(html, 'lxml')

            rank = re.findall(r'<popularity[^>]*text="(\d+)"', str(soup))

            df.iloc[counter_val, 0] = url  # fill cell with URL data
            df.iloc[counter_val, 1] = rank  # fill cell with alexa rank

            counter.set(counter_val + 1)  # increment counter

            print("Server Status:", scrape.status_code, '-', u"\u2713", '-', counter_val, '-', df.iloc[counter_val, 0], '-', "Rank:", rank[0])

        else:
            print("Server Status:", scrape.status_code)
            df.iloc[counter_val, 2] = scrape.status_code  # fill df cell with server status code
            counter.set(counter_val + 1)
            pass

    except BaseException as e:
        exceptions.append(e)
        print("Exception:", e)
        df.iloc[counter_val, 2] = e  # fill df cell with script exception message
        counter.set(counter_val + 1)
        pass

    finally:
        lock.release()
        df.to_csv("output.csv", index=False)
        return


if __name__ == '__main__':

    """ --------------------------------------------- """
    # ---------------------------------------------------
    '''               GET LINK LIST:                  '''
    # ---------------------------------------------------
    """ --------------------------------------------- """

    # get this line of code from the pastebin (link list)
    https://pastebin.com/h42wqJPp

    df = pd.DataFrame(list1, columns=["Links:"])  # create pandas dataframe from links list

    """ --------------------------------------------- """
    # ---------------------------------------------------
    '''               MULTIPROCESSING:                '''
    # ---------------------------------------------------
    """ --------------------------------------------- """

    counter = Manager().Value(int, 0)  # set counter as manager with value of 0
    array = [(counter, url) for url in df]  # link together the counter and list in an array ---------------------------------------- ***** ERROR - not adding links to array correctly *****
    print("Problem here, it's not adding all the links to array", array)

    p = Pool(20)  # worker count
    p.map(scraper, array)  # function, iterable
    p.terminate()
Quote

Top Page

Possibly Related Threads...
Thread Author Replies Views Last Post
  "I'm Feeling Lucky" script problem (again) tab_lo_lo 7 1,550 Jul-23-2019, 11:26 PM
Last Post: snippsat
  Need help Multiprocessing with BeautifulSoup HiImNew 4 1,734 Jun-07-2018, 06:12 PM
Last Post: Grok_It
  Need Help with Simple Text Reformatting Problem MattTuck 5 1,223 Aug-14-2017, 10:07 PM
Last Post: MattTuck

Forum Jump:


Users browsing this thread: 1 Guest(s)