Multi-Threaded Alexa Website Ranker Problem - All Threads Doing Same Task

digitalmatic7 · (This post was last modified: Feb-28-2018, 09:21 AM by digitalmatic7.)

I've created a very simple bot, and I'm having trouble threading it.

It loads a list of URLs from links.csv (sample list here: https://pastebin.com/QiH3qpRD)

Then it will scrape the rank of each URL from Alexa API.

The problem is that all the threads are handling the same URL, can you guys help me figure out where I went wrong in my code:

[Image: lA0Zf1_BQEGJWIIwSIHrJg.png]

Do I need to use "multiprocessing.Pool"?

from __future__ import print_function
import threading
import requests
from bs4 import BeautifulSoup
import re
import pandas as pd

# CREATE URL LIST FROM CSV

df = pd.read_csv('links.csv', header=0)  # df = dataframe
df.insert(1, 'Alexa Rank:', "")  # create new column

# GET URL TOTAL FROM CSV

url_total = len(df.index)

print()
print('Total URLS Loaded:', url_total, "- Task Starting...")
print()

url_total = len(df.index) - 1  # get total number of URLs in list


def worker(Id):

    time.sleep(0.3)

    # COUNTER TO INCREMENT THROUGH URL_LIST
    list_counter = 0

    while list_counter <= url_total:

        scrape = requests.get("http://data.alexa.com/data?cli=10&dat=s&url=" + df.iloc[list_counter, 0],
                              headers={"user-agent": "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36"})
        html = scrape.content
        soup = BeautifulSoup(html, 'lxml')

        rank = re.findall(r'<popularity[^>]*text="(\d+)"', str(soup))       # scrape alexa rank
        rank = rank[0]
        df.iloc[list_counter, 1] = rank                                     # add to dataframe

        print(u"\u2713", '-', list_counter, '-', df.iloc[list_counter, 0], '-', "Alexa Rank:", rank)

        list_counter = list_counter + 1


def main():

    threads = []
    for i in range(4):
        t = threading.Thread(target=worker, args=(i,))
        threads.append(t)
        t.start()
    print("Main has spawn all the threads")

    for t in threads:
        t.join()


main()

***
EDIT: I think I've figured out what I was doing wrong. Concurrent instead of parallel. I'm playing with some new code.
***

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Problem with scrapping Website	giddyhead	1	2,541	Mar-08-2024, 08:20 AM Last Post: AhanaSharma
	Problem with logging in on website - python w/ requests	GoldeNx	6	6,846	Sep-25-2020, 10:52 AM Last Post: snippsat
	Problem parsing website html file	thefpgarace	2	4,008	May-01-2018, 11:09 AM Last Post: Standard_user
	Python - Why multi threads are not working in this web crawler?	ratanbhushan	1	3,472	Nov-17-2017, 05:21 PM Last Post: Larz60+

Multi-Threaded Alexa Website Ranker Problem - All Threads Doing Same Task

User Panel Messages

Announcements