Apr-16-2018, 07:18 PM
(This post was last modified: Apr-16-2018, 07:18 PM by digitalmatic7.)
Phew.. I think I'm 95% done with this script.. just hit another issue though
I'm taking URLs I want processed, adding them to a dataframe with Pandas, and trying to pass that through map in the array to be processed within my 'scraper' function.
I need the dataframe within scraper function because the 'counter' will let me fill the scraped data into the right table cell.
Part of my problem is I don't know where to create the dataframe or how to manage it properly inside the function.
Here's the code:
I'm taking URLs I want processed, adding them to a dataframe with Pandas, and trying to pass that through map in the array to be processed within my 'scraper' function.
I need the dataframe within scraper function because the 'counter' will let me fill the scraped data into the right table cell.
Part of my problem is I don't know where to create the dataframe or how to manage it properly inside the function.
Here's the code:
from multiprocessing import Lock, Pool, Manager from time import sleep from bs4 import BeautifulSoup import pandas as pd import re import requests exceptions = [] lock = Lock() def scraper(obj): # obj is the array passed from map (counter, url items) counter, url = obj # not sure what this does df.insert(1, 'Alexa Rank:', "") # insert new column df.insert(2, 'Status:', "") # insert new column lock.acquire() counter_val = counter.get() try: scrape = requests.get(url, headers={"user-agent": "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36"}, timeout=10) if scrape.status_code == 200: """ --------------------------------------------- """ # --------------------------------------------------- ''' --> SCRAPE ALEXA RANK: <-- ''' # --------------------------------------------------- """ --------------------------------------------- """ sleep(0.1) scrape = requests.get("http://data.alexa.com/data?cli=10&dat=s&url=" + url, headers={"user-agent": "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36"}) html = scrape.content soup = BeautifulSoup(html, 'lxml') rank = re.findall(r'<popularity[^>]*text="(\d+)"', str(soup)) df.iloc[counter_val, 0] = url # fill cell with URL data df.iloc[counter_val, 1] = rank # fill cell with alexa rank counter.set(counter_val + 1) # increment counter print("Server Status:", scrape.status_code, '-', u"\u2713", '-', counter_val, '-', df.iloc[counter_val, 0], '-', "Rank:", rank[0]) else: print("Server Status:", scrape.status_code) df.iloc[counter_val, 2] = scrape.status_code # fill df cell with server status code counter.set(counter_val + 1) pass except BaseException as e: exceptions.append(e) print("Exception:", e) df.iloc[counter_val, 2] = e # fill df cell with script exception message counter.set(counter_val + 1) pass finally: lock.release() df.to_csv("output.csv", index=False) return if __name__ == '__main__': """ --------------------------------------------- """ # --------------------------------------------------- ''' GET LINK LIST: ''' # --------------------------------------------------- """ --------------------------------------------- """ # get this line of code from the pastebin (link list) https://pastebin.com/h42wqJPp df = pd.DataFrame(list1, columns=["Links:"]) # create pandas dataframe from links list """ --------------------------------------------- """ # --------------------------------------------------- ''' MULTIPROCESSING: ''' # --------------------------------------------------- """ --------------------------------------------- """ counter = Manager().Value(int, 0) # set counter as manager with value of 0 array = [(counter, url) for url in df] # link together the counter and list in an array ---------------------------------------- ***** ERROR - not adding links to array correctly ***** print("Problem here, it's not adding all the links to array", array) p = Pool(20) # worker count p.map(scraper, array) # function, iterable p.terminate()