Python Forum
Problem With Simple Multiprocessing Script
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Problem With Simple Multiprocessing Script
#1
Is it possible to multiprocess this script without breaking it up into functions?

I'm trying to keep it as barebones and simple as possible.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
# EXTREMELY SIMPLE SCRAPING SCRIPT
 
 
from time import sleep
from bs4 import BeautifulSoup
import re
import requests
from multiprocessing import Pool
 
 
exceptions = []
 
 
list_counter = 0
 
p = Pool(10# process count
records = p.map(,list1[list_counter])  # argument required
p.terminate()
p.join()
 
print()
print('Total URLS:', len(list1), "- Starting Task...")
print()
 
for items in list1:
 
    try:
 
        scrape = requests.get(list1[list_counter],
                              headers={"user-agent": "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36"},
                              timeout=10)
 
        if scrape.status_code == 200:
 
            html = scrape.content
            soup = BeautifulSoup(html, 'html.parser')
 
            """ --------------------------------------------- """
            # ---------------------------------------------------
            '''           --> SCRAPE ALEXA RANK: <--          '''
            # ---------------------------------------------------
            """ --------------------------------------------- """
 
            sleep(0.15)
            scrape = requests.get("http://data.alexa.com/data?cli=10&dat=s&url=" + list1[list_counter],
                                  headers={"user-agent": "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36"})
            html = scrape.content
            soup = BeautifulSoup(html, 'lxml')
 
            rank = re.findall(r'<popularity[^>]*text="(\d+)"', str(soup))
 
            print("Server Status:", scrape.status_code, '-', u"\u2713", '-', list_counter, '-', list1[list_counter], '-', "Rank:", rank[0])
 
            list_counter = list_counter + 1
 
        else:
            print("Server Status:", scrape.status_code)
            list_counter = list_counter + 1
            pass
 
    except BaseException as e:
        exceptions.append(e)
        print()
        print(e)
        print()
        list_counter = list_counter + 1
        pass
 
if len(exceptions) > 0:
    print("OUTPUT ERROR LOGS:", exceptions)
else:
    print("No Errors To Report")
Reply
#2
No, you need to have a callable object that you can pass to the process pool. Or, you can rewrite it so it only handles one url, then store the urls in a different file, and let your operating system handle the multiprocessing with something like cat urls.txt | parallel my_file.py {} (https://www.gnu.org/software/parallel/).
Reply
#3
You should also consider saving your URL list to a file, rather than hard coding.
Reply
#4
Hmmmm. I got it working.. but.. now I'm so confused..

How does it manage to pull "url" from list1? I was surprised this script actually works lol.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
from multiprocessing import Lock, Pool
from time import sleep
from bs4 import BeautifulSoup
import re
import requests
 
exceptions = []
lock = Lock()
 
 
def scraper(url):
 
    """
    Testing multiprocessing and requests
    """
    lock.acquire()
 
    try:
 
        scrape = requests.get(url,
                              headers={"user-agent": "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36"},
                              timeout=10)
 
        if scrape.status_code == 200:
 
            """ --------------------------------------------- """
            # ---------------------------------------------------
            '''           --> SCRAPE ALEXA RANK: <--          '''
            # ---------------------------------------------------
            """ --------------------------------------------- """
 
            sleep(0.1)
            scrape = requests.get("http://data.alexa.com/data?cli=10&dat=s&url=" + url,
                                  headers={"user-agent": "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36"})
            html = scrape.content
            soup = BeautifulSoup(html, 'lxml')
 
            rank = re.findall(r'<popularity[^>]*text="(\d+)"', str(soup))
 
            print("Server Status:", scrape.status_code, '-', u"\u2713", '-', url, '-', "Rank:", rank[0])
 
        else:
            print("Server Status:", scrape.status_code)
            pass
 
    except BaseException as e:
        exceptions.append(e)
        print()
        print(e)
        print()
        pass
 
    finally:
        lock.release()
 
 
if __name__ == '__main__':
 
 
    p = Pool(10)
    p.map(scraper, list1)
    p.terminate()
    p.join()
Reply
#5
(Apr-10-2018, 06:40 PM)digitalmatic7 Wrote: p.map(scraper, list1)
map will call the function, scraper, once for each item in the iterable, list1. It might be a method of a Process Pool, but it works very similarly to map: https://docs.python.org/3/library/functions.html#map
Reply
#6
(Apr-10-2018, 06:55 PM)nilamo Wrote:
(Apr-10-2018, 06:40 PM)digitalmatic7 Wrote: p.map(scraper, list1)
map will call the function, scraper, once for each item in the iterable, list1. It might be a method of a Process Pool, but it works very similarly to map: https://docs.python.org/3/library/functions.html#map

Thanks for the help! I really, really appreciate it. I think I almost have a grasp on it now.

1
def scraper(url):
This is the last part I need some clarification on. url is just some name I made up, yet somehow it cycles through list1 items.

I don't really understand how that happens. Is map passing each individual list item into scraper function, and then it gets named what ever I call it in the function brackets?
Reply
#7
When you call a function, you can pass it parameters. The function decides what variables those parameters are bound to, and what they're named. Nothing outside the function needs to know that it's called a "url", as far as the process.map is concerned, it's just an element of the list.
Reply
#8
I've run into issues getting a counter to work inside the scraper function.

I just need a very basic counter that increments for each URL (iteration) that is processed. I tried using a global variable and it didn't work. It's assigning a counter to each individual process:

[Image: LPIpPj58SNy5-jXFfi8y_A.png]

I tried passing the variable as an argument but couldn't get it to work.

What you guys think? Is it even possible to have a counter work with multiple processes?

Code here: https://pastebin.com/qnRbdaC2
Reply
#9
You shouldn't use shared state in separate processes. The easy answer is to just tell each function which one it is, something like:
1
2
3
4
5
6
7
8
9
class Counter:
    def __init__(self, start=0):
        self.value = start
    def inc(self, value=1):
        self.value += value
        return self.value
 
count = Counter()
p.map(lambda elem: scraper(elem, count.inc()), list1)
That way you handle the incrementing before you hand things over to different processes.

If you actually want scraper to keep a global count (...you shouldn't), then you'd need to use some way for the processes to talk to each other, like a queue or this Value thing.
Reply
#10
I looked over some __init__ tutorials, but I still don't really understand what that does. How would I setup Counter class within the scraper function?

I was playing around with multiprocessing manager (it seems to offer the functionality I need).. but I can't get it to work! Any idea where I'm going wrong?

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
from multiprocessing import Pool, Manager
 
 
def test(current_item, manager):
 
    counter = manager.value(+1)
    print(counter)
 
    print(current_item)
 
 
if __name__ == '__main__':
 
    list1 = ["item1",
             "item2",
             "item3",
             "item4",
             "item5",
             "item6",
             "item7",
             "item8",
             "item9",
             "item10",
             "item11",
             "item12"]
 
    manager = Manager()
 
    p = Pool(4# worker count
    p.map(test, list1)  # (function, iterable)
    p.terminate()
    p.join()
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Simple flask rest api problem cancerboi 4 3,819 Jan-29-2020, 03:10 PM
Last Post: brighteningeyes
  requests problem in python script "content type" abdlwafitahiri 4 4,275 Dec-29-2019, 02:29 PM
Last Post: abdlwafitahiri
  "I'm Feeling Lucky" script problem (again) tab_lo_lo 7 9,407 Jul-23-2019, 11:26 PM
Last Post: snippsat
  Need Help with Simple Text Reformatting Problem MattTuck 5 4,762 Aug-14-2017, 10:07 PM
Last Post: MattTuck

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020