Tkinter Web Scraping w/Multithreading Question.... - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: General Coding Help (https://python-forum.io/forum-8.html) +--- Thread: Tkinter Web Scraping w/Multithreading Question.... (/thread-38983.html) |
Tkinter Web Scraping w/Multithreading Question.... - AaronCatolico1 - Dec-15-2022 I have a simple program in Tkinter that I'm using to scrape the age of various people from wikipedia just for webscraping practice. I am able to scrape the age of each person one-by-one on one thread, but I'm trying to have one thread for each person to handle scraping their ages all at the same time so that the program will be much faster. So, in other words, the program currently scrapes only 1 person at a time and can only return 1 row at a time in the Treeview, but I'd like to have it to where a thread works for each person at the same time (concurrently) so that the Treeview will return each person's age in one shot as well. Here's the code that I've come up with so far: from tkinter import Tk, Button, Listbox from tkinter import ttk import threading import requests import re #imports RegEx class MainWindow(Tk): def __init__(self): super().__init__() self.option_add("*Font", "poppins, 11 bold") self.lb1 = Listbox(self, width=22, cursor='hand2') self.lb1.pack(side='left', fill='y', padx=20, pady=20) #create list of names names = ['Adam Levine', 'Arnold Schwarzenegger', 'Artur Beterbiev', 'Chris Hemsworth', 'Dan Henderson', 'Dustin Poirier', 'Fedor Emelianenko', 'Gennady Golovkin', 'Igor Vovchanchyn', 'Ken Shamrock', 'Mirko Cro Cop', 'Oleksandr Usyk', 'Ronnie Coleman', 'Vasiliy Lomachenko'] #populate listbox with names for name in names: self.lb1.insert('end', name) self.tv1 = ttk.Treeview(self, show='tree headings', cursor='hand2') columns = ('NAME', 'AGE') self.tv1.config(columns = columns) style = ttk.Style() style.configure("Treeview", highlightthickness=2, bd=0, rowheight=26,font=('Poppins', 11)) # Modify the font of the body style.configure("Treeview.Heading", font=('Poppins', 12, 'bold')) # Modify the font of the headings #configure headers self.tv1.column('#0', width=0, stretch=0) self.tv1.column('NAME', width=190) self.tv1.column('AGE', width=80, stretch=0) #define headings self.tv1.heading('NAME', text='NAME', anchor='w') self.tv1.heading('AGE', text='AGE', anchor='w') self.tv1.pack(fill='both', expand=1, padx=(0, 20), pady=20) #create start button self.b1 = Button(self, text='START', bg='green', fg='white', cursor='hand2', command=self.start) self.b1.pack(pady=(0, 20)) #scrape data from WikiPedia.org def start(self): for item in self.tv1.get_children(): self.tv1.delete(item) t1 = threading.Thread(target=self.scrape_wiki, daemon=True) t1.start() def scrape_wiki(self): for i in range(self.lb1.size()): #select the name self.name = self.lb1.get(i).replace(' ', '_') # create a simple dictionary to hold the user agent inside of the headers headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; rv:91.0) Gecko/20100101 Firefox/91.0'} # the required first parameter of the 'get' method is the 'url': r = requests.get('https://en.wikipedia.org/wiki/' + self.name, headers=headers) # regex setup regex = re.search('(age \d+)', r.text) age = regex.group(0).replace('age ', '').replace(')', '') # Populate Treeview with row data self.name = self.name.replace('_', ' ') self.tv1.insert(parent='', index='end', iid=i, values=(self.name, age)) if __name__ == '__main__': app = MainWindow() #app.iconbitmap('imgs/logo-icon.ico') app.title('Main Window') app.configure(bg='#333333') #center the Main Window: w = 600 # Width h = 520 # Height screen_width = app.winfo_screenwidth() # Width of the screen screen_height = app.winfo_screenheight() # Height of the screen # Calculate Starting X and Y coordinates for Window x = (screen_width / 2) - (w / 2) y = (screen_height / 2) - (h / 2) app.mainloop()So, how exactly do I create multiple threads to handle each person in the list and return them concurrently/simultaneously instead of row-by-row, one at a time? I'd appreciate any support. RE: Tkinter Web Scraping w/Multithreading Question.... - deanhystad - Dec-15-2022 Take a look at ProcessPoolExecutor. https://docs.python.org/3/library/concurrent.futures.html#concurrent.futures.ThreadPoolExecutor RE: Tkinter Web Scraping w/Multithreading Question.... - AaronCatolico1 - Dec-15-2022 Hi Dean, and thank you for your reply. So, I've already tried the ThreadPoolExector from concurrent.futures class, but it was not compatible with updating Tkinter's GUI. Is ProcessPoolExecutor going to be any different? I'm currently away from home, but I'll give it a shot once I return if you have tested this class & have concluded that it will indeed work with Tkinter's GUI.. Anyhow, thanks again in the meantime. 👍👍 (Dec-15-2022, 06:52 PM)deanhystad Wrote: Take a look at ProcessPoolExecutor. RE: Tkinter Web Scraping w/Multithreading Question.... - deanhystad - Dec-16-2022 After my post I gave ProcessPoolExecuter a better look. I thought you could launch all the processes and use futures to get the results when available. Unfortunately, it looks like the executor blocks until all processes are completed, or at least started. Have you tried Asnyc IO? I think async io works great, but it is invasive. You only want to call one function asynchronously and before you know it you have async def sprinkled everywhere. Here's an idea that uses threads, a list of queue objects, and the tkinter after() funciton. import tkinter as tk from threading import Thread from queue import Queue from time import sleep from random import randint import functools class TkinterThreadManager: """A thread pool manager that maintains function call order and doesn't block tkinter """ def __init__(self, window, period=100): self.queue = [] self.window = window self.period = period def submit(self, func, args=None, callback=None): """Add thread to the pool func: Function executed in thread args: Optional arguments passed to function. callback: Optional function that is called using return value """ if callback is None: Thread(target=func, args=args).start() else: q = Queue(maxsize=1) q.callback = callback Thread(target=func, args=(q,)+args).start() self.queue.append(q) self.join() def join(self): """Periodically check for thread completion. Execute callback.""" while self.queue and self.queue[0].full(): q = self.queue.pop(0) q.callback(q.get()) if self.queue: self.window.after(self.period, self.join) def thread_func(func): """TkinterThreadManager function wrapper. Puts function return value in Queue to notify manager of thread completion. """ @functools.wraps(func) def wrapper(q, *args, **kwargs): value = func(*args, **kwargs) print(value) # For demonstration purposes q.put(value) return wrapper @thread_func def get_name(name, delay): """Dummy function to execute via thread""" sleep(randint(0, delay)) return name class MainWindow(tk.Tk): """Window to demonstrate TkinterThreadManager.""" def __init__(self): super().__init__() self.threads = TkinterThreadManager(self) self.lb1 = tk.Listbox(self, width=22, height=30) self.lb1.pack(side='left', fill='y', padx=10, pady=10) button = tk.Button(self, text='Letters', command=self.letters) button.pack(padx=10, pady=10) button = tk.Button(self, text='Numbers', command=self.numbers) button.pack(padx=10, pady=10) def letters(self): """Add some letters to the listbox""" for delay, name in enumerate(('A', 'B', 'C')): self.threads.submit( func=get_name, args=(name, delay), callback=lambda x: self.lb1.insert(tk.END, x)) def numbers(self): """Add some numbers to the listbox""" for delay, name in enumerate((1, 2, 3)): self.threads.submit( func=get_name, args=(name, delay), callback=lambda x: self.lb1.insert(tk.END, x)) if __name__ == '__main__': MainWindow().mainloop() |