Python Forum
scraping multiple pages of a website.
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
scraping multiple pages of a website.
#6
Ok,

Couldn't resit writing this one:
This code can be run by itself, or imported into another module.
Once run, all that's needed in a class that wants to use the index is to load the json file into
a dictionary (see testit)

create a project directory and src directory
mkdir FederalAgencies
cd FederalAgencies
mkdir src
add to FederalAgencies directory:

module __init__.py
FederalAgencies/
    __init__.py
    src/
        __init__.py
        BuildFederalAgencyIndex.py
        FederalPaths.py
add to src directory:

1. create an empty __init__.py file

save this in src directory as FederalPaths.py
from pathlib import Path
import os


class FederalPaths:
    def __init__(self):
        # Make sure start path is  properly set
        self.set_starting_dir()
        self.homepath = Path('.')
        self.rootpath = self.homepath / '..'
        self.datapath = self.rootpath / 'data'
        self.datapath.mkdir(exist_ok=True)
        self.outpath = self.datapath / 'json'
        self.outpath.mkdir(exist_ok=True)

        self.gov_urlbase = 'https://www.usa.gov/'
        self.baseurl = 'https://www.usa.gov/federal-agencies/'
        self.valid_pages = 'abcdefghijlmnoprstuvw'
        self.fed_index_file = self.outpath / 'FedIndex.json'

    def set_starting_dir(self):
        path = Path(__file__).resolve()
        path, file = os.path.split(path)
        path = os.path.abspath(path)
        os.chdir(path)

def testit():
    FederalPaths()

if __name__ == '__main__':
    testit()
save this one in src directory as BuildFederalAgencyIndex.py
import FederalPaths
import requests
from bs4 import BeautifulSoup
import sys
import json


class BuildFederalAgencyIndex:
    def __init__(self):
        self.fpath = FederalPaths.FederalPaths()
        self.fed_index = {}
        self.valid_pages = 'abcdefghijlmnoprstuvw'

        self.build_index()
    
    def build_index(self):
        for n in range(len(self.valid_pages)):
            alpha = self.valid_pages[n]
            URL = f'{self.fpath.baseurl}{alpha}'
            self.fed_index[alpha] = {}
            try:
                response = requests.get(URL)
                soup = BeautifulSoup(response.content, 'lxml')
                ulist = soup.find('ul', {"class": "one_column_bullet"} )
                links = ulist.find_all('a')
                for link in links:
                    suffix = link.get('href')
                    href = f'{self.fpath.gov_urlbase}{suffix}'
                    self.fed_index[alpha][link.text] = href
            except:
                print(f'error: {sys.exc_info()[0]}')

        with self.fpath.fed_index_file.open('w') as jout:
            json.dump(self.fed_index, jout)


def testit():
    # Create json file
    fa = BuildFederalAgencyIndex()

    # test json file
    with fa.fpath.fed_index_file.open() as fp:
        fed_index = json.load(fp)
    
    # Show all entries for 'c'
    for name, url in fed_index['c'].items():
        print(f'name: {name}, url: {url}')

    # Individual entry:
    print(f"\nIndividual entry url for Court of Appeals for Veterans Claims: {fed_index['c']['Court of Appeals for Veterans Claims']}")

if __name__ == '__main__':
    testit()
test run:
cd FederalAgencies/src
This will create json file (in data/json directory) and print out all 'C' indexes:
directories will be created first time run
python BuildFederalAgencyIndex.py
results:
Output:
name: California, url: https://www.usa.gov//state-government/california name: Capitol Police, url: https://www.usa.gov//federal-agencies/u-s-capitol-police name: Capitol Visitor Center, url: https://www.usa.gov//federal-agencies/u-s-capitol-visitor-center name: Career, Technical, and Adult Education, Office of, url: https://www.usa.gov//federal-agencies/office-of-career-technical-and-adult-education name: Census Bureau, url: https://www.usa.gov//federal-agencies/u-s-census-bureau name: Center for Food Safety and Applied Nutrition, url: https://www.usa.gov//federal-agencies/center-for-food-safety-and-applied-nutrition name: Center for Nutrition Policy and Promotion (CNPP), url: https://www.usa.gov//federal-agencies/center-for-nutrition-policy-and-promotion name: Centers for Disease Control and Prevention (CDC), url: https://www.usa.gov//federal-agencies/centers-for-disease-control-and-prevention name: Centers for Medicare and Medicaid Services (CMS), url: https://www.usa.gov//federal-agencies/centers-for-medicare-and-medicaid-services name: Central Command (CENTCOM), url: https://www.usa.gov//federal-agencies/u-s-central-command name: Central Intelligence Agency (CIA), url: https://www.usa.gov//federal-agencies/central-intelligence-agency name: Chemical Safety Board, url: https://www.usa.gov//federal-agencies/u-s-chemical-safety-board name: Chief Acquisition Officers Council, url: https://www.usa.gov//federal-agencies/chief-acquisition-officers-council name: Chief Financial Officers Council, url: https://www.usa.gov//federal-agencies/chief-financial-officers-council name: Chief Human Capital Officers Council, url: https://www.usa.gov//federal-agencies/chief-human-capital-officers-council name: Chief Information Officers Council, url: https://www.usa.gov//federal-agencies/chief-information-officers-council name: Child Support Enforcement, Office of (OCSE), url: https://www.usa.gov//federal-agencies/office-of-child-support-enforcement name: Circuit Courts of Appeal, url: https://www.usa.gov//federal-agencies/u-s-courts-of-appeal name: Citizens' Stamp Advisory Committee, url: https://www.usa.gov//federal-agencies/citizens-stamp-advisory-committee name: Citizenship and Immigration Services (USCIS), url: https://www.usa.gov//federal-agencies/u-s-citizenship-and-immigration-services name: Civil Rights, Department of Education Office of, url: https://www.usa.gov//federal-agencies/office-for-civil-rights-department-of-education name: Civil Rights, Department of Health and Human Services Office for, url: https://www.usa.gov//federal-agencies/office-for-civil-rights-department-of-health-and-human-services name: Coast Guard, url: https://www.usa.gov//federal-agencies/u-s-coast-guard name: Colorado, url: https://www.usa.gov//state-government/colorado name: Commerce Department (DOC), url: https://www.usa.gov//federal-agencies/u-s-department-of-commerce name: Commission of Fine Arts, url: https://www.usa.gov//federal-agencies/u-s-commission-of-fine-arts name: Commission on Civil Rights, url: https://www.usa.gov//federal-agencies/commission-on-civil-rights name: Commission on International Religious Freedom, url: https://www.usa.gov//federal-agencies/u-s-commission-on-international-religious-freedom name: Commission on Presidential Scholars, url: https://www.usa.gov//federal-agencies/commission-on-presidential-scholars name: Commission on Security and Cooperation in Europe (Helsinki Commission), url: https://www.usa.gov//federal-agencies/commission-on-security-and-cooperation-in-europe-helsinki-commission name: Committee for the Implementation of Textile Agreements, url: https://www.usa.gov//federal-agencies/committee-for-the-implementation-of-textile-agreements name: Committee on Foreign Investment in the United States, url: https://www.usa.gov//federal-agencies/committee-on-foreign-investment-in-the-united-states name: Commodity Futures Trading Commission (CFTC), url: https://www.usa.gov//federal-agencies/u-s-commodity-futures-trading-commission name: Community Oriented Policing Services (COPS), url: https://www.usa.gov//federal-agencies/community-oriented-policing-services name: Community Planning and Development, url: https://www.usa.gov//federal-agencies/office-of-community-planning-and-development name: Compliance, Office of, url: https://www.usa.gov//federal-agencies/office-of-compliance name: Comptroller of the Currency, Office of (OCC), url: https://www.usa.gov//federal-agencies/office-of-the-comptroller-of-the-currency name: Computer Emergency Readiness Team (US CERT), url: https://www.usa.gov//federal-agencies/computer-emergency-readiness-team name: Congress—U.S. House of Representatives, url: https://www.usa.gov//federal-agencies/u-s-house-of- representatives name: Congress—U.S. Senate, url: https://www.usa.gov//federal-agencies/u-s-senate name: Congressional Budget Office (CBO), url: https://www.usa.gov//federal-agencies/congressional-budget-office name: Congressional Research Service, url: https://www.usa.gov//federal-agencies/congressional-research-service name: Connecticut, url: https://www.usa.gov//state-government/connecticut name: Consular Affairs, Bureau of, url: https://www.usa.gov//federal-agencies/bureau-of-consular-affairs name: Consumer Financial Protection Bureau, url: https://www.usa.gov//federal-agencies/consumer-financial-protection-bureau name: Consumer Product Safety Commission (CPSC), url: https://www.usa.gov//federal-agencies/consumer-product-safety-commission name: Coordinating Council on Juvenile Justice and Delinquency Prevention, url: https://www.usa.gov//federal-agencies/coordinating-council-on-juvenile-justice-and-delinquency-prevention name: Copyright Office, url: https://www.usa.gov//federal-agencies/copyright-office name: Corporation for National and Community Service, url: https://www.usa.gov//federal-agencies/corporation-for-national-and-community-service name: Corps of Engineers, url: https://www.usa.gov//federal-agencies/u-s-army-corps-of-engineers name: Council of Economic Advisers, url: https://www.usa.gov//federal-agencies/council-of-economic-advisers name: Council of the Inspectors General on Integrity and Efficiency, url: https://www.usa.gov//federal-agencies/council-of-the-inspectors-general-on-integrity-and-efficiency name: Council on Environmental Quality, url: https://www.usa.gov//federal-agencies/council-on-environmental-quality name: Court Services and Offender Supervision Agency for the District of Columbia, url: https://www.usa.gov//federal-agencies/court-services-and-offender-supervision-agency-for-the-district-of-columbia name: Court of Appeals for Veterans Claims, url: https://www.usa.gov//federal-agencies/u-s-court-of-appeals-for-veterans-claims name: Court of Appeals for the Armed Forces, url: https://www.usa.gov//federal-agencies/court-of-appeals-for-the-armed-forces name: Court of Appeals for the Federal Circuit, url: https://www.usa.gov//federal-agencies/court-of-appeals-for-the-federal-circuit name: Court of Federal Claims, url: https://www.usa.gov//federal-agencies/court-of-federal-claims name: Court of International Trade, url: https://www.usa.gov//federal-agencies/court-of-international-trade name: Customs and Border Protection, url: https://www.usa.gov//federal-agencies/u-s-customs-and-border-protection Individual entry url for Court of Appeals for Veterans Claims: https://www.usa.gov//federal-agencies/u-s-court-of-appeals-for-veterans-claims
Reply


Messages In This Thread
scraping multiple pages of a website. - by Blue Dog - Jun-07-2018, 10:07 PM
RE: scraping multiple pages of a website. - by Larz60+ - Jun-08-2018, 07:56 AM

Possibly Related Threads…
Thread Author Replies Views Last Post
  Need help opening pages when web scraping templeowls 1 475 Feb-29-2024, 06:45 PM
Last Post: snippsat
  Scrape table from multiple pages Nhattanktnn 1 980 Jun-07-2023, 09:35 AM
Last Post: Larz60+
Information Web-scraping, multiple webpages Pabloty92 1 1,361 Dec-28-2022, 02:09 PM
Last Post: Yoriz
  web scraping for new additions/modifed website? kingoman123 4 2,341 Apr-14-2022, 04:46 PM
Last Post: snippsat
  Scraping lender data from Ren Ren Dai website using Python. I will pay for that 200$ Hafedh_2021 1 2,817 May-18-2021, 08:41 PM
Last Post: snippsat
  Scraping all website text using Python MKMKMKMK 1 2,156 Nov-26-2020, 10:35 PM
Last Post: Larz60+
  Web scrap multiple pages anilacem_302 3 3,957 Jul-01-2020, 07:50 PM
Last Post: mlieqo
  scraping multiple pages from table bandar 1 2,800 Jun-27-2020, 10:43 PM
Last Post: Larz60+
  Beginner help - Leap Year Issue Feb 29 and multiple pages warriordazza 3 2,827 May-10-2020, 01:14 AM
Last Post: warriordazza
  Scraping a Website (HELP) LearnPython2 1 1,821 May-08-2020, 03:20 PM
Last Post: Larz60+

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020