Feb-28-2018, 09:21 AM
(This post was last modified: Feb-28-2018, 09:21 AM by digitalmatic7.)
I've created a very simple bot, and I'm having trouble threading it.
It loads a list of URLs from links.csv (sample list here: https://pastebin.com/QiH3qpRD)
Then it will scrape the rank of each URL from Alexa API.
The problem is that all the threads are handling the same URL, can you guys help me figure out where I went wrong in my code:
![[Image: lA0Zf1_BQEGJWIIwSIHrJg.png]](https://image.prntscr.com/image/lA0Zf1_BQEGJWIIwSIHrJg.png)
Do I need to use "multiprocessing.Pool"?
EDIT: I think I've figured out what I was doing wrong. Concurrent instead of parallel. I'm playing with some new code.
***
It loads a list of URLs from links.csv (sample list here: https://pastebin.com/QiH3qpRD)
Then it will scrape the rank of each URL from Alexa API.
The problem is that all the threads are handling the same URL, can you guys help me figure out where I went wrong in my code:
![[Image: lA0Zf1_BQEGJWIIwSIHrJg.png]](https://image.prntscr.com/image/lA0Zf1_BQEGJWIIwSIHrJg.png)
Do I need to use "multiprocessing.Pool"?
from __future__ import print_function import threading import requests from bs4 import BeautifulSoup import re import pandas as pd # CREATE URL LIST FROM CSV df = pd.read_csv('links.csv', header=0) # df = dataframe df.insert(1, 'Alexa Rank:', "") # create new column # GET URL TOTAL FROM CSV url_total = len(df.index) print() print('Total URLS Loaded:', url_total, "- Task Starting...") print() url_total = len(df.index) - 1 # get total number of URLs in list def worker(Id): time.sleep(0.3) # COUNTER TO INCREMENT THROUGH URL_LIST list_counter = 0 while list_counter <= url_total: scrape = requests.get("http://data.alexa.com/data?cli=10&dat=s&url=" + df.iloc[list_counter, 0], headers={"user-agent": "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36"}) html = scrape.content soup = BeautifulSoup(html, 'lxml') rank = re.findall(r'<popularity[^>]*text="(\d+)"', str(soup)) # scrape alexa rank rank = rank[0] df.iloc[list_counter, 1] = rank # add to dataframe print(u"\u2713", '-', list_counter, '-', df.iloc[list_counter, 0], '-', "Alexa Rank:", rank) list_counter = list_counter + 1 def main(): threads = [] for i in range(4): t = threading.Thread(target=worker, args=(i,)) threads.append(t) t.start() print("Main has spawn all the threads") for t in threads: t.join() main()***
EDIT: I think I've figured out what I was doing wrong. Concurrent instead of parallel. I'm playing with some new code.
***