Python Forum
Need help Multiprocessing with BeautifulSoup
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Need help Multiprocessing with BeautifulSoup
#1
I am trying to look through html for certain tags and (if a certain tag is found, have python notify me as quickly as possible) This is my code so far:
import ast
import bs4 as bs
doc = open('C:/Users/Me/AppData/Local/Programs/Python/Python36/sample_requests.txt', 'r').readlines()
results_string = doc[0]
results_list = ast.literal_eval(results_string)
results = []
for i in results_list:  # This converts my list of strings to a list of bytes of html text.
    n_coded = i.encode()
    results.append(n_coded)
to_notify_list = []


def parseru(requested):
    soup = bs.BeautifulSoup(requested, 'lxml')
    tr_list = soup.find_all('tr')
    tr_list = (tr_list[3:])[:5]
    for tr in tr_list:
        if 'text I am searching for' in tr:
             to_notify_list.append(requested)


for i in results:
    parseru(i)
for i in to_notify_list:
    print(i)
I've experimented with multiprocessing and multiprocessing.dummy:
from multiprocessing.dummy import Pool
if __name__ == '__main__':
    pool = Pool(4)
    pool.map(parseru, results)
However, multiprocessing.dummy just makes the code run twice as slow.

I've also experimented with multiprocessing (without the dummy):
from multiprocessing import Pool
if __name__ == '__main__':
    pool = Pool(4)
    pool.map(parseru, results)
This just ends up running the function 4 times at the same time and almost crashes PyCharm each time. (It freezes for several seconds) It also makes code outside of the if statement run multiple times. For instance, print functions 100 lines away run four times.

Well, I've been at this for hours, and I am starting to feel like the dummy. One of the versions of code iterates through the list twice as slow, and the other runs through the list four times at once. What am I doing wrong?
Reply
#2
line 22, syntax error??
swallow osama bin laden
Reply
#3
I fixed it, put back the in. Sorry that was an attempt from a while ago (few hours) and had to recreate it form memory.
Reply
#4
Figured out a solution. For anyone interested:
import math
from multiprocessing import Pool
from multiprocessing import cpu_count
# results_list = list of requests.get(<url>).content items (list of html text in byte type)
chunky_monkey = math.ceil(len(results_list)/cpu_count())
#chunky_monkey is a variable (that works on any pc and list size) used to evenly distribute chunks of equal size to cpu cores I came up with that myself :D


def parseru(requested):
    soup = bs.BeautifulSoup(requested, 'lxml', parse_only=parse_only1)
    tr_list = soup.find_all('tr')
    tr_list = (tr_list[3:])[:10]
    for tr in tr_list:
        if date_string in tr.text:
            if '8-K' in tr.text:
                if requested not in notified_list:
                    return requested


if __name__ == '__main__':
    pool = Pool(cpu_count())
    fat_list = pool.map(func=parseru, iterable=results_list, chunksize=chunky_monkey)
    pool.close()
    pool.join()
    send_list = [x for x in fat_list if x is not None] # I couldn't figure out how to use global variables for multi-processes, so I just delete every returned value that's None
Reply
#5
(Jun-07-2018, 05:31 AM)HiImNew Wrote: Figured out a solution. For anyone interested:
# I couldn't figure out how to use global variables for multi-processes, so I just delete every returned value that's None
Check out this page on synchronization of resources among threads: http://effbot.org/zone/thread-synchronization.htm
Reply


Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020