Bottom Page

Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
 Problem With Simple Multiprocessing Script
#1
Is it possible to multiprocess this script without breaking it up into functions?

I'm trying to keep it as barebones and simple as possible.


# EXTREMELY SIMPLE SCRAPING SCRIPT


from time import sleep
from bs4 import BeautifulSoup
import re
import requests
from multiprocessing import Pool


exceptions = []

list1 = ["http://www.wallstreetinvestorplace.com/2018/04/cvs-health-corporation-cvs-to-touch-7-54-earnings-growth-for-next-year/",
         "https://macondaily.com/2018/04/06/cetera-advisors-llc-lowers-position-in-cvs-health-cvs.html",
         "http://www.thesportsbank.net/football/liverpool/jurgen-klopp-very-positive-about-mo-salah-injury/",
         "https://www.moneyjournals.com/trump-wasting-time-trying-bring-amazon/",
         "https://www.pmnewsnigeria.com/2018/04/06/fcta-targets-800000-children-for-polio-immunisation/",
         "http://toronto.citynews.ca/2018/04/06/officials-in-canada-braced-for-another-spike-in-illegal-border-crossings/",
         "https://www.pmnewsnigeria.com/2018/04/04/pdp-describes-looters-list-as-plot-to-divert-attention/",
         "https://beyondpesticides.org/dailynewsblog/2018/04/epa-administrator-pruitt-colluding-regulated-industry/",
         "http://thyblackman.com/2018/04/06/robert-mueller-is-searching-for/",
         "https://www.theroar.com.au/2018/04/06/2018-commonwealth-games-swimming-night-2-finals-live-updates-results-blog/",
         "https://medicalresearch.com/pain-research/migraine-linked-to-increased-risk-of-heart-disease-and-stroke/40858/",
         "http://www.investingbizz.com/2018/04/amazon-com-inc-amzn-stock-creates-investors-concerns/",
         "https://stocknewstimes.com/2018/04/06/convergence-investment-partners-llc-grows-position-in-amazon-com-inc-amzn.html",
         "https://factsherald.com/old-food-rules-needs-to-be-updated/",
         "https://www.nextadvisor.com/blog/2018/04/06/the-facebook-scandal-evolves/",
         "http://sacramento.cbslocal.com/2018/04/04/police-family-youtube-shooter/",
         "http://en.brinkwire.com/245768/why-does-stress-lead-to-weight-gain-study-sheds-light/",
         "https://www.marijuana.com/news/2018/04/monterey-bud-jeff-sessions-is-on-the-wrong-side-of-history-science-and-public-opinion/",
         "http://www.stocksgallery.com/2018/04/06/jpmorgan-chase-co-jpm-noted-a-price-change-of-0-80-and-amazon-com-inc-amzn-closes-with-a-move-of-2-92/",
         "https://stocknewstimes.com/2018/04/06/front-barnett-associates-llc-has-2-41-million-position-in-cvs-health-corp-cvs.html",
         "http://www.liveinsurancenews.com/colorado-mental-health-insurance-bill-to-help-consumers-navigate-the-system/",
         "http://newyork.cbslocal.com/2018/04/04/youtube-headquarters-shooting-suspect/",
         "https://ledgergazette.com/2018/04/06/liberty-interactive-co-series-a-liberty-ventures-lvnta-shares-bought-by-brandywine-global-investment-management-llc.html",
         "http://bangaloreweekly.com/2018-04-06-city-holding-co-invests-in-cvs-health-corporation-cvs-shares/",
         "https://www.thenewsguru.com/didnt-know-lawyer-paid-prostitute-130000-donald-trump/",
         "http://www.westlondonsport.com/chelsea/football-wls-conte-gives-two-main-reasons-chelseas-loss-tottenham",
         "https://registrarjournal.com/2018/04/06/amazon-com-inc-amzn-shares-bought-by-lenox-wealth-management-inc.html",
         "http://www.businessdayonline.com/1bn-eca-withdrawal-commence-action-president-buhari-pdp-tasks-nass/",
         "http://www.thesportsbank.net/football/manchester-united/pep-guardiola-asks-for-his-fans-help-vs-united-in-manchester-derby/",
         "https://www.pakistantoday.com.pk/2018/04/06/three-palestinians-martyred-as-new-clashes-erupt-along-gaza-border/",
         "http://www.nasdaqfortune.com/2018/04/06/risky-factor-of-cvs-health-corporation-cvs-is-observed-at-1-03/",
         "https://stocknewstimes.com/2018/04/06/cetera-advisor-networks-llc-decreases-position-in-cvs-health-cvs.html",
         "http://nasdaqjournal.com/index.php/2018/04/06/planet-fitness-inc-nyseplnt-do-analysts-think-you-should-buy/",
         "http://www.tv360nigeria.com/apc-to-hold-national-congress/",
         "https://www.pmnewsnigeria.com/2018/04/03/apc-governors-keep-sealed-lips-after-meeting-with-buhari/",
         "https://www.healththoroughfare.com/diet/healthy-lifestyle-best-foods-you-should-eat-for-weight-loss/7061",
         "https://stocknewstimes.com/2018/04/05/amazon-com-inc-amzn-shares-bought-by-west-oak-capital-llc.html",
         "http://www.current-movie-reviews.com/48428/dr-oz-could-you-be-a-victim-of-sexual-assault-while-on-vacation/",
         "https://www.brecorder.com/2018/04/07/410124/world-health-day-to-be-observed-on-april-7/",
         "http://www.coloradoindependent.com/169637/trump-pruitt-emissions-epa-pollution",
         "https://thecrimereport.org/2018/04/05/will-sessions-new-justice-strategy-turn-the-clock-back-on-civil-rights/",
         "http://en.brinkwire.com/245490/pasta-unlikely-to-cause-weight-gain-as-part-of-a-healthy-diet/"]

list_counter = 0

p = Pool(10)  # process count
records = p.map(,list1[list_counter])  # argument required
p.terminate()
p.join()

print()
print('Total URLS:', len(list1), "- Starting Task...")
print()

for items in list1:

    try:

        scrape = requests.get(list1[list_counter],
                              headers={"user-agent": "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36"},
                              timeout=10)

        if scrape.status_code == 200:

            html = scrape.content
            soup = BeautifulSoup(html, 'html.parser')

            """ --------------------------------------------- """
            # ---------------------------------------------------
            '''           --> SCRAPE ALEXA RANK: <--          '''
            # ---------------------------------------------------
            """ --------------------------------------------- """

            sleep(0.15)
            scrape = requests.get("http://data.alexa.com/data?cli=10&dat=s&url=" + list1[list_counter],
                                  headers={"user-agent": "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36"})
            html = scrape.content
            soup = BeautifulSoup(html, 'lxml')

            rank = re.findall(r'<popularity[^>]*text="(\d+)"', str(soup))

            print("Server Status:", scrape.status_code, '-', u"\u2713", '-', list_counter, '-', list1[list_counter], '-', "Rank:", rank[0])

            list_counter = list_counter + 1

        else:
            print("Server Status:", scrape.status_code)
            list_counter = list_counter + 1
            pass

    except BaseException as e:
        exceptions.append(e)
        print()
        print(e)
        print()
        list_counter = list_counter + 1
        pass

if len(exceptions) > 0:
    print("OUTPUT ERROR LOGS:", exceptions)
else:
    print("No Errors To Report")

Quote
#2
No, you need to have a callable object that you can pass to the process pool. Or, you can rewrite it so it only handles one url, then store the urls in a different file, and let your operating system handle the multiprocessing with something like cat urls.txt | parallel my_file.py {} (https://www.gnu.org/software/parallel/).
digitalmatic7 likes this post
Quote
#3
You should also consider saving your URL list to a file, rather than hard coding.
digitalmatic7 likes this post
Quote
#4
Hmmmm. I got it working.. but.. now I'm so confused..

How does it manage to pull "url" from list1? I was surprised this script actually works lol.

from multiprocessing import Lock, Pool
from time import sleep
from bs4 import BeautifulSoup
import re
import requests

exceptions = []
lock = Lock()


def scraper(url):

    """
    Testing multiprocessing and requests
    """
    lock.acquire()

    try:

        scrape = requests.get(url,
                              headers={"user-agent": "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36"},
                              timeout=10)

        if scrape.status_code == 200:

            """ --------------------------------------------- """
            # ---------------------------------------------------
            '''           --> SCRAPE ALEXA RANK: <--          '''
            # ---------------------------------------------------
            """ --------------------------------------------- """

            sleep(0.1)
            scrape = requests.get("http://data.alexa.com/data?cli=10&dat=s&url=" + url,
                                  headers={"user-agent": "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36"})
            html = scrape.content
            soup = BeautifulSoup(html, 'lxml')

            rank = re.findall(r'<popularity[^>]*text="(\d+)"', str(soup))

            print("Server Status:", scrape.status_code, '-', u"\u2713", '-', url, '-', "Rank:", rank[0])

        else:
            print("Server Status:", scrape.status_code)
            pass

    except BaseException as e:
        exceptions.append(e)
        print()
        print(e)
        print()
        pass

    finally:
        lock.release()


if __name__ == '__main__':

    list1 = ["http://www.wallstreetinvestorplace.com/2018/04/cvs-health-corporation-cvs-to-touch-7-54-earnings-growth-for-next-year/",
             "https://macondaily.com/2018/04/06/cetera-advisors-llc-lowers-position-in-cvs-health-cvs.html",
             "http://www.thesportsbank.net/football/liverpool/jurgen-klopp-very-positive-about-mo-salah-injury/",
             "https://www.moneyjournals.com/trump-wasting-time-trying-bring-amazon/",
             "https://www.pmnewsnigeria.com/2018/04/06/fcta-targets-800000-children-for-polio-immunisation/",
             "http://toronto.citynews.ca/2018/04/06/officials-in-canada-braced-for-another-spike-in-illegal-border-crossings/",
             "https://www.pmnewsnigeria.com/2018/04/04/pdp-describes-looters-list-as-plot-to-divert-attention/",
             "https://beyondpesticides.org/dailynewsblog/2018/04/epa-administrator-pruitt-colluding-regulated-industry/",
             "http://thyblackman.com/2018/04/06/robert-mueller-is-searching-for/",
             "https://www.theroar.com.au/2018/04/06/2018-commonwealth-games-swimming-night-2-finals-live-updates-results-blog/",
             "https://medicalresearch.com/pain-research/migraine-linked-to-increased-risk-of-heart-disease-and-stroke/40858/",
             "http://www.investingbizz.com/2018/04/amazon-com-inc-amzn-stock-creates-investors-concerns/",
             "https://stocknewstimes.com/2018/04/06/convergence-investment-partners-llc-grows-position-in-amazon-com-inc-amzn.html",
             "https://factsherald.com/old-food-rules-needs-to-be-updated/",
             "https://www.nextadvisor.com/blog/2018/04/06/the-facebook-scandal-evolves/",
             "http://sacramento.cbslocal.com/2018/04/04/police-family-youtube-shooter/",
             "http://en.brinkwire.com/245768/why-does-stress-lead-to-weight-gain-study-sheds-light/",
             "https://www.marijuana.com/news/2018/04/monterey-bud-jeff-sessions-is-on-the-wrong-side-of-history-science-and-public-opinion/",
             "http://www.stocksgallery.com/2018/04/06/jpmorgan-chase-co-jpm-noted-a-price-change-of-0-80-and-amazon-com-inc-amzn-closes-with-a-move-of-2-92/",
             "https://stocknewstimes.com/2018/04/06/front-barnett-associates-llc-has-2-41-million-position-in-cvs-health-corp-cvs.html",
             "http://www.liveinsurancenews.com/colorado-mental-health-insurance-bill-to-help-consumers-navigate-the-system/",
             "http://newyork.cbslocal.com/2018/04/04/youtube-headquarters-shooting-suspect/",
             "https://ledgergazette.com/2018/04/06/liberty-interactive-co-series-a-liberty-ventures-lvnta-shares-bought-by-brandywine-global-investment-management-llc.html",
             "http://bangaloreweekly.com/2018-04-06-city-holding-co-invests-in-cvs-health-corporation-cvs-shares/",
             "https://www.thenewsguru.com/didnt-know-lawyer-paid-prostitute-130000-donald-trump/",
             "http://www.westlondonsport.com/chelsea/football-wls-conte-gives-two-main-reasons-chelseas-loss-tottenham",
             "https://registrarjournal.com/2018/04/06/amazon-com-inc-amzn-shares-bought-by-lenox-wealth-management-inc.html",
             "http://www.businessdayonline.com/1bn-eca-withdrawal-commence-action-president-buhari-pdp-tasks-nass/",
             "http://www.thesportsbank.net/football/manchester-united/pep-guardiola-asks-for-his-fans-help-vs-united-in-manchester-derby/",
             "https://www.pakistantoday.com.pk/2018/04/06/three-palestinians-martyred-as-new-clashes-erupt-along-gaza-border/",
             "http://www.nasdaqfortune.com/2018/04/06/risky-factor-of-cvs-health-corporation-cvs-is-observed-at-1-03/",
             "https://stocknewstimes.com/2018/04/06/cetera-advisor-networks-llc-decreases-position-in-cvs-health-cvs.html",
             "http://nasdaqjournal.com/index.php/2018/04/06/planet-fitness-inc-nyseplnt-do-analysts-think-you-should-buy/",
             "http://www.tv360nigeria.com/apc-to-hold-national-congress/",
             "https://www.pmnewsnigeria.com/2018/04/03/apc-governors-keep-sealed-lips-after-meeting-with-buhari/",
             "https://www.healththoroughfare.com/diet/healthy-lifestyle-best-foods-you-should-eat-for-weight-loss/7061",
             "https://stocknewstimes.com/2018/04/05/amazon-com-inc-amzn-shares-bought-by-west-oak-capital-llc.html",
             "http://www.current-movie-reviews.com/48428/dr-oz-could-you-be-a-victim-of-sexual-assault-while-on-vacation/",
             "https://www.brecorder.com/2018/04/07/410124/world-health-day-to-be-observed-on-april-7/",
             "http://www.coloradoindependent.com/169637/trump-pruitt-emissions-epa-pollution",
             "https://thecrimereport.org/2018/04/05/will-sessions-new-justice-strategy-turn-the-clock-back-on-civil-rights/",
             "http://en.brinkwire.com/245490/pasta-unlikely-to-cause-weight-gain-as-part-of-a-healthy-diet/"]

    p = Pool(10)
    p.map(scraper, list1)
    p.terminate()
    p.join()
Quote
#5
(Apr-10-2018, 06:40 PM)digitalmatic7 Wrote: p.map(scraper, list1)
map will call the function, scraper, once for each item in the iterable, list1. It might be a method of a Process Pool, but it works very similarly to map: https://docs.python.org/3/library/functions.html#map
digitalmatic7 likes this post
Quote
#6
(Apr-10-2018, 06:55 PM)nilamo Wrote:
(Apr-10-2018, 06:40 PM)digitalmatic7 Wrote: p.map(scraper, list1)
map will call the function, scraper, once for each item in the iterable, list1. It might be a method of a Process Pool, but it works very similarly to map: https://docs.python.org/3/library/functions.html#map

Thanks for the help! I really, really appreciate it. I think I almost have a grasp on it now.

def scraper(url):
This is the last part I need some clarification on. url is just some name I made up, yet somehow it cycles through list1 items.

I don't really understand how that happens. Is map passing each individual list item into scraper function, and then it gets named what ever I call it in the function brackets?
Quote
#7
When you call a function, you can pass it parameters. The function decides what variables those parameters are bound to, and what they're named. Nothing outside the function needs to know that it's called a "url", as far as the process.map is concerned, it's just an element of the list.
digitalmatic7 likes this post
Quote
#8
I've run into issues getting a counter to work inside the scraper function.

I just need a very basic counter that increments for each URL (iteration) that is processed. I tried using a global variable and it didn't work. It's assigning a counter to each individual process:

İmage


I tried passing the variable as an argument but couldn't get it to work.

What you guys think? Is it even possible to have a counter work with multiple processes?

Code here: https://pastebin.com/qnRbdaC2
Quote
#9
You shouldn't use shared state in separate processes. The easy answer is to just tell each function which one it is, something like:
class Counter:
    def __init__(self, start=0):
        self.value = start
    def inc(self, value=1):
        self.value += value
        return self.value

count = Counter()
p.map(lambda elem: scraper(elem, count.inc()), list1)
That way you handle the incrementing before you hand things over to different processes.

If you actually want scraper to keep a global count (...you shouldn't), then you'd need to use some way for the processes to talk to each other, like a queue or this Value thing.
digitalmatic7 likes this post
Quote
#10
I looked over some __init__ tutorials, but I still don't really understand what that does. How would I setup Counter class within the scraper function?

I was playing around with multiprocessing manager (it seems to offer the functionality I need).. but I can't get it to work! Any idea where I'm going wrong?

from multiprocessing import Pool, Manager


def test(current_item, manager):

    counter = manager.value(+1)
    print(counter)

    print(current_item)


if __name__ == '__main__':

    list1 = ["item1",
             "item2",
             "item3",
             "item4",
             "item5",
             "item6",
             "item7",
             "item8",
             "item9",
             "item10",
             "item11",
             "item12"]

    manager = Manager()

    p = Pool(4)  # worker count
    p.map(test, list1)  # (function, iterable)
    p.terminate()
    p.join()
Quote

Top Page

Possibly Related Threads...
Thread Author Replies Views Last Post
  "I'm Feeling Lucky" script problem (again) tab_lo_lo 7 1,380 Jul-23-2019, 11:26 PM
Last Post: snippsat
  Need help Multiprocessing with BeautifulSoup HiImNew 4 1,719 Jun-07-2018, 06:12 PM
Last Post: Grok_It
  Need Help with Simple Text Reformatting Problem MattTuck 5 1,221 Aug-14-2017, 10:07 PM
Last Post: MattTuck

Forum Jump:


Users browsing this thread: 1 Guest(s)