Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
web scraper using pathlib
#1
Ok, since DeaD_EyE introduced me to pathlib last night, I've been playing with it to get familiar with
it's capabilities.

I wanted to get a bunch of files (3220 to be exact) from the US census Tiger files.
So, if you run this code, you'll probably want to stop it after a bit as delays that I've
inserted (so as not to abuse downloads) alone are almost an hour long.
I inserted a stopafter variable, set to 10 to stop after 10 files. Set to None if you actually have
a use for the data, and it will get all.

from pathlib import Path
from shutil import unpack_archive
import requests
from bs4 import BeautifulSoup
from time import sleep


class TryThis:
    def __init__(self):
        self.debug = False
        self.stop_after = 10
        self.data_main_url = 'https://www2.census.gov/geo/tiger/TIGER2017/ADDR/'
        self.data_url = 'https://www2.census.gov/geo/tiger/TIGER2017/ADDR/tl_2017_01001_addr.zip'
        self.filelist = None
        self.homepath = Path('.')
        self.data_dir = self.homepath / 'data'
        self.data_dir.mkdir(exist_ok=True, parents=True)
        self.soup_index_fname = self.data_dir / 'index.html'
        self.resp = None
        self.get_main_page()
        self.get_files()

    # save to a file so we're not banging on website
    def get_main_page(self):
        """
        Extract filenames from download page
        :return: None
        """
        self.filelist = []
        self.resp = requests.get(self.data_main_url)
        soup = BeautifulSoup(self.resp.content, 'lxml')
        selection = soup.select('a')
        links = [pt.get_text() for pt in selection]
        for link in links:
            link.strip()
            if link.startswith('tl_'):
                self.filelist.append(link)
        print(f'{len(self.filelist)} files to downlload')

    def get_files(self):
        """
        Get all zip files fron filelist, and extract on the fly
        :return: None
        """
        for filename in self.filelist:
            filesdownloaded = 0
            self.data_url = f'{self.data_main_url}/{filename}'
            print(self.data_url)
            self.resp = requests.get(self.data_url)
            self.save(filename)
            self.zip_fname = self.data_dir.joinpath(Path(self.data_url).name)
            print(f'self.zip_fname: {self.zip_fname}')
            self.unpack()
            filesdownloaded += 1
            if self.stop_after:
                if filesdownloaded >= self.stop_after:
                    break
            sleep(1)

    def save(self, filename):
        """
        Saves each zip file
        :param filename:
        :return: None
        """
        self.zip_fname = self.data_dir / filename
        self.zip_fname.write_bytes(self.resp.content)

    def unpack(self):
        """
        Unpack current file
        :return: None
        """
        unpack_archive(str(self.zip_fname), extract_dir=str(self.data_dir))
        for fpath in self.data_dir.glob('*.txt'):
            print(fpath, fpath.read_text())


if __name__ == '__main__':
    tt = TryThis()
Reply
#2
Please note: There is a download limit of 10 files per session see: https://ask.census.gov/prweb/PRServletCu.../!STANDARD
I found out the hard way (for the second time, shame on me)
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Web scraper tomenzo123 8 4,291 Aug-18-2023, 12:45 PM
Last Post: Gaurav_Kumar
  Web scraper not populating .txt with scraped data BlackHeart 5 1,449 Apr-03-2023, 05:12 PM
Last Post: snippsat
  Image Scraper (beautifulsoup), stopped working, need to help see why woodmister 9 3,958 Jan-12-2021, 04:10 PM
Last Post: woodmister
  Court Opinion Scraper in Python w/ BS4 (Currently exports to CSV) need help with SQL MidnightDreamer 4 2,962 Mar-12-2020, 09:57 AM
Last Post: BrandonKastning
  Python using BS scraper paulfearn100 1 2,499 Feb-07-2020, 10:22 PM
Last Post: snippsat
  Need alittle hlpl with an image scraper. Blue Dog 8 7,637 Dec-24-2016, 08:09 PM
Last Post: Blue Dog
  Made a very simple email grabber(scraper) Blue Dog 4 6,799 Dec-13-2016, 06:25 AM
Last Post: wavic

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020