Python Forum
Multi-Threaded Alexa Website Ranker Problem - All Threads Doing Same Task
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Multi-Threaded Alexa Website Ranker Problem - All Threads Doing Same Task
#1
I've created a very simple bot, and I'm having trouble threading it.

It loads a list of URLs from links.csv (sample list here: https://pastebin.com/QiH3qpRD)

Then it will scrape the rank of each URL from Alexa API.

The problem is that all the threads are handling the same URL, can you guys help me figure out where I went wrong in my code:

[Image: lA0Zf1_BQEGJWIIwSIHrJg.png]
Do I need to use "multiprocessing.Pool"?

from __future__ import print_function
import threading
import requests
from bs4 import BeautifulSoup
import re
import pandas as pd

# CREATE URL LIST FROM CSV

df = pd.read_csv('links.csv', header=0)  # df = dataframe
df.insert(1, 'Alexa Rank:', "")  # create new column

# GET URL TOTAL FROM CSV

url_total = len(df.index)

print()
print('Total URLS Loaded:', url_total, "- Task Starting...")
print()

url_total = len(df.index) - 1  # get total number of URLs in list


def worker(Id):

    time.sleep(0.3)

    # COUNTER TO INCREMENT THROUGH URL_LIST
    list_counter = 0

    while list_counter <= url_total:

        scrape = requests.get("http://data.alexa.com/data?cli=10&dat=s&url=" + df.iloc[list_counter, 0],
                              headers={"user-agent": "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36"})
        html = scrape.content
        soup = BeautifulSoup(html, 'lxml')

        rank = re.findall(r'<popularity[^>]*text="(\d+)"', str(soup))       # scrape alexa rank
        rank = rank[0]
        df.iloc[list_counter, 1] = rank                                     # add to dataframe

        print(u"\u2713", '-', list_counter, '-', df.iloc[list_counter, 0], '-', "Alexa Rank:", rank)

        list_counter = list_counter + 1


def main():

    threads = []
    for i in range(4):
        t = threading.Thread(target=worker, args=(i,))
        threads.append(t)
        t.start()
    print("Main has spawn all the threads")

    for t in threads:
        t.join()


main()
***
EDIT: I think I've figured out what I was doing wrong. Concurrent instead of parallel. I'm playing with some new code.
***
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Problem with scrapping Website giddyhead 1 1,580 Mar-08-2024, 08:20 AM
Last Post: AhanaSharma
  Problem with logging in on website - python w/ requests GoldeNx 6 5,209 Sep-25-2020, 10:52 AM
Last Post: snippsat
  Problem parsing website html file thefpgarace 2 3,166 May-01-2018, 11:09 AM
Last Post: Standard_user
  Python - Why multi threads are not working in this web crawler? ratanbhushan 1 2,766 Nov-17-2017, 05:21 PM
Last Post: Larz60+

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020