Parsing infor from scraped files. - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html) +--- Thread: Parsing infor from scraped files. (/thread-17474.html) |
Parsing infor from scraped files. - Larz60+ - Apr-12-2019 I am trying to get one simple bit of data from several thousand scraped files. I want to do this using concurrent futures, but am having a bit of an issue I created a sample which contains just 10 files, for testing and it looks like this: from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor, as_completed from pathlib import Path from bs4 import BeautifulSoup import os class Scrape3: def __init__(self): os.chdir(os.path.abspath(os.path.dirname(__file__))) filepath = Path('./html') citylist = [ ['Andover'], ['Berlin'], ['Brooklyn'], ['Burlington'], ['Colchester'], ['Groton'], ['Hartland'], ['Kent'], ['Manchester'], ['Marlborough'] ] for city in citylist: city.append(filepath / f'{city[0]}_page1.html') # for item in citylist: # print(f'{item[0]}, {item[1].resolve()}') self.numpages = [] # self.test_parse(citylist) self.get_numpages(citylist) print(f'numpages: {self.numpages}') def parse(self, city): with city[1].open('rb') as fp: page = fp.read() soup = BeautifulSoup(page, 'lxml') return [city[0], str(soup.find('span', {'class': "paginate-info"}).text.split()[2])] def test_parse(self, citylist): for city in citylist: print(self.parse(city)) def get_numpages(self, citylist): ex = ThreadPoolExecutor(max_workers=10) for city in citylist: wait_for = [ ex.submit(self.parse(city)) ] for f in as_completed(wait_for): self.numpages.append(f.result) if __name__ == '__main__': Scrape3()It all appears to function properly until it comes to what's getting stored in self.numpages. I expected a list of lists, each containing city name and number of pages, but what I get is: I am missing something. Don't know what's creating the TypeError. Anybody know what that might be?I tested the return statement (by running test_parse) and it does what it's supposed to do: [city[0], str(soup.find('span', {'class': "paginate-info"}).text.split()[2])]
RE: Parsing infor from scraped files. - Yoriz - Apr-12-2019 Missing the () off f.result
RE: Parsing infor from scraped files. - Larz60+ - Apr-12-2019 I knew it was something stupid, Thanks! |