Python Forum
Need help Multiprocessing with BeautifulSoup
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Need help Multiprocessing with BeautifulSoup
#1
I am trying to look through html for certain tags and (if a certain tag is found, have python notify me as quickly as possible) This is my code so far:
import ast
import bs4 as bs
doc = open('C:/Users/Me/AppData/Local/Programs/Python/Python36/sample_requests.txt', 'r').readlines()
results_string = doc[0]
results_list = ast.literal_eval(results_string)
results = []
for i in results_list:  # This converts my list of strings to a list of bytes of html text.
    n_coded = i.encode()
    results.append(n_coded)
to_notify_list = []


def parseru(requested):
    soup = bs.BeautifulSoup(requested, 'lxml')
    tr_list = soup.find_all('tr')
    tr_list = (tr_list[3:])[:5]
    for tr in tr_list:
        if 'text I am searching for' in tr:
             to_notify_list.append(requested)


for i in results:
    parseru(i)
for i in to_notify_list:
    print(i)
I've experimented with multiprocessing and multiprocessing.dummy:
from multiprocessing.dummy import Pool
if __name__ == '__main__':
    pool = Pool(4)
    pool.map(parseru, results)
However, multiprocessing.dummy just makes the code run twice as slow.

I've also experimented with multiprocessing (without the dummy):
from multiprocessing import Pool
if __name__ == '__main__':
    pool = Pool(4)
    pool.map(parseru, results)
This just ends up running the function 4 times at the same time and almost crashes PyCharm each time. (It freezes for several seconds) It also makes code outside of the if statement run multiple times. For instance, print functions 100 lines away run four times.

Well, I've been at this for hours, and I am starting to feel like the dummy. One of the versions of code iterates through the list twice as slow, and the other runs through the list four times at once. What am I doing wrong?
Reply


Messages In This Thread
Need help Multiprocessing with BeautifulSoup - by HiImNew - Jun-06-2018, 04:00 AM

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020