Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Parsing infor from scraped files.
#1
I am trying to get one simple bit of data from several thousand scraped files.
I want to do this using concurrent futures, but am having a bit of an issue
I created a sample which contains just 10 files, for testing and it looks like this:
from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor, as_completed
from pathlib import Path
from bs4 import BeautifulSoup
import os


class Scrape3:
    def __init__(self):
        os.chdir(os.path.abspath(os.path.dirname(__file__)))
        filepath = Path('./html')
        citylist = [
            ['Andover'],
            ['Berlin'],
            ['Brooklyn'],
            ['Burlington'], 
            ['Colchester'],
            ['Groton'],
            ['Hartland'],
            ['Kent'],
            ['Manchester'],
            ['Marlborough']
        ]

        for city in citylist:
            city.append(filepath / f'{city[0]}_page1.html')

        # for item in citylist:
        #     print(f'{item[0]}, {item[1].resolve()}')

        self.numpages = []
        # self.test_parse(citylist)
        self.get_numpages(citylist)

        print(f'numpages: {self.numpages}')

    def parse(self, city):
        with city[1].open('rb') as fp:
            page = fp.read()
        soup = BeautifulSoup(page, 'lxml')
        return [city[0], str(soup.find('span', {'class': "paginate-info"}).text.split()[2])]

    def test_parse(self, citylist):
        for city in citylist:
            print(self.parse(city))

    def get_numpages(self, citylist):
        ex = ThreadPoolExecutor(max_workers=10)

        for city in citylist:
            wait_for = [
                ex.submit(self.parse(city))
            ]

            for f in as_completed(wait_for):
                self.numpages.append(f.result)

if __name__ == '__main__':
    Scrape3()
It all appears to function properly until it comes to what's getting stored in self.numpages.
I expected a list of lists, each containing city name and number of pages, but what I get is:
Output:
numpages: [<bound method Future.result of <Future at 0x7f23423502e8 state=finished raised TypeError>>, <bound method Future.result of <Future at 0x7f23413fe390 state=finished raised TypeError>>, <bound method Future.result of <Future at 0x7f2340b930b8 state=finished raised TypeError>>, <bound method Future.result of <Future at 0x7f2340320e80 state=finished raised TypeError>>, <bound method Future.result of <Future at 0x7f23402fe278 state=finished raised TypeError>>, <bound method Future.result of <Future at 0x7f2340bbbf60 state=finished raised TypeError>>, <bound method Future.result of <Future at 0x7f234029d240 state=finished raised TypeError>>, <bound method Future.result of <Future at 0x7f234022e198 state=finished raised TypeError>>, <bound method Future.result of <Future at 0x7f234029dcc0 state=finished raised TypeError>>, <bound method Future.result of <Future at 0x7f234034f0f0 state=finished raised TypeError>>]
I am missing something. Don't know what's creating the TypeError. Anybody know what that might be?

I tested the return statement (by running test_parse) and it does what it's supposed to do:
[city[0], str(soup.find('span', {'class': "paginate-info"}).text.split()[2])]
Output:
['Andover', '18'] ['Berlin', '91'] ['Brooklyn', '76'] ['Burlington', '59'] ['Colchester', '77'] ['Groton', '92'] ['Hartland', '1'] ['Kent', '23'] ['Manchester', '278'] ['Marlborough', '39']
Reply
#2
Missing the () off f.result
Reply
#3
I knew it was something stupid, Thanks!
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Weird characters scraped samuelbachorik 3 858 Oct-29-2023, 02:36 PM
Last Post: DeaD_EyE
  Web scraper not populating .txt with scraped data BlackHeart 5 1,458 Apr-03-2023, 05:12 PM
Last Post: snippsat
  Python Obstacles | Krav Maga | Wiki Scraped Content [Column Copy] BrandonKastning 4 2,163 Jan-03-2022, 06:59 AM
Last Post: BrandonKastning
  Python Obstacles | Kapap | Wiki Scraped Content [Column Nulling] BrandonKastning 2 1,690 Jan-03-2022, 04:26 AM
Last Post: BrandonKastning
  Any way to remove HTML tags from scraped data? (I want text only) SeBz2020uk 1 3,414 Nov-02-2020, 08:12 PM
Last Post: Larz60+
  cant loop through scraped site matt42 3 2,377 Aug-12-2020, 06:48 AM
Last Post: ndc85430
  Normalizig scraped text wuggs 3 2,498 Jan-07-2020, 03:32 AM
Last Post: Larz60+
  beautiful soup - parsing scraped code in a script lilbigwill99 2 3,209 Mar-09-2018, 04:10 PM
Last Post: lilbigwill99
  Need Tip On Cleaning My BS4 Scraped Data digitalmatic7 2 3,177 Jan-29-2018, 08:49 PM
Last Post: digitalmatic7

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020