Python Forum

Pages: 1 2

Hello All,

I have a website that as 26 pages, that star with 'a' and end with a 'z'.
this is y\the url of the site https://www.usa.gov/federal-agencies/a
I have a scraper that does what I want. I know to all of you python kings
it will be crude. what I need help on is how to scrape all 26 pages.
I have been all over the net looking for how to do it. Just not much out there.
I have found a few way of doing it, but none work. Wall

So here I am hoping someone can help. LOL

here is my code

#Python 3.7
from html.parser import HTMLParser
import requests
from bs4 import BeautifulSoup







r = requests.get('https://www.usa.gov/federal-agencies/a')
        
first_page = r.text

soup = BeautifulSoup(first_page, 'html.parser')

page_soup = soup

#page_soup.h1

#page_soup.p
boxes = page_soup.find_all('ul', {'class' : 'one_column_bullet'})
boxes[0].text.strip()


print(boxes)

I tryed all I could think of mostly many for loop.
here that works a bit. it print out the same page 26 times.

#Python 3.7
from html.parser import HTMLParser
import requests
from bs4 import BeautifulSoup
from string import ascii_lowercase


for letter in ascii_lowercase:
        r = requests.get('https://www.usa.gov/federal-agencies/' + letter +' ')
        
        first_page = r.text

        soup = BeautifulSoup(first_page, 'html.parser')

        page_soup = soup.find('h1')


        print(page_soup)

So if some one know how to use my to scrape 26 pages let me know.
Thank you
renny

The pages have the same URL base, with the letter added to the end.
https://www.usa.gov/federal-agencies/a
https://www.usa.gov/federal-agencies/b
etc.

>>> baseurl = 'https://www.usa.gov/federal-agencies/'
>>> valid_pages = 'abcdefghijlmnoprstuvw'
>>> for n in range(len(valid_pages)):
...     url = f'{baseurl}{valid_pages[n]}'
...     print(url)
...
https://www.usa.gov/federal-agencies/a
https://www.usa.gov/federal-agencies/b
https://www.usa.gov/federal-agencies/c
https://www.usa.gov/federal-agencies/d
https://www.usa.gov/federal-agencies/e
https://www.usa.gov/federal-agencies/f
https://www.usa.gov/federal-agencies/g
https://www.usa.gov/federal-agencies/h
https://www.usa.gov/federal-agencies/i
https://www.usa.gov/federal-agencies/j
https://www.usa.gov/federal-agencies/l
https://www.usa.gov/federal-agencies/m
https://www.usa.gov/federal-agencies/n
https://www.usa.gov/federal-agencies/o
https://www.usa.gov/federal-agencies/p
https://www.usa.gov/federal-agencies/r
https://www.usa.gov/federal-agencies/s
https://www.usa.gov/federal-agencies/t
https://www.usa.gov/federal-agencies/u
https://www.usa.gov/federal-agencies/v
https://www.usa.gov/federal-agencies/w
>>>

so can iterate over this:
pseudo code:
for char in valid_pages
within each page, the following can be used as an anchor:
<ul class="az-list group">

After that, all links (regular <a tags) up until the </ul>
are what you need.

so seems pretty simple.

You lost me, i will try to to use it
Thank you.
renny

Well I been at this for about 14 hours today. I am going to hit the sack.
This is what I got so far.

import requests
from bs4 import BeautifulSoup
from html.parser import HTMLParser


baseurl = requests.get('https://www.usa.gov/federal-agencies/')
valid_pages = 'abcdefghijlmnoprstuvw'
for n in range(len(valid_pages)):
        url = f'{baseurl}{valid_pages[n]}'
        print(url)
        page = soup = BeautifulSoup(url, 'html.parser')
       
        for page in soup.find_all('ul', {'class' : 'one_column_bullet'}):
        
        
                print(page)

this is what I get:

Response [200]>a
<Response [200]>b
<Response [200]>c
<Response [200]>d
<Response [200]>e
<Response [200]>f
<Response [200]>g
<Response [200]>h
<Response [200]>i
<Response [200]>j
<Response [200]>l
<Response [200]>m
<Response [200]>n
<Response [200]>o
<Response [200]>p
<Response [200]>r
<Response [200]>s
<Response [200]>t
<Response [200]>u
<Response [200]>v
<Response [200]>w
I do get to all the pages. soup does not work.
I want to thank you Larz60+ for your help. I will start back on it tomorrow.
renny

all the code that I showed is create a url for each page,
you still have to fetch it with requests and extract the links.
so instead of print, add your page scraping code.

Still up, try some new stuff:

import requests
from bs4 import BeautifulSoup
from html.parser import HTMLParser


baseurl = requests.get('https://www.usa.gov/federal-agencies/')
valid_pages = 'abcdefghijlmnoprstuvw'
for n in range(len(valid_pages)):
        url = f'{baseurl}{valid_pages[n]}'
        print(url)
        pages = soup = BeautifulSoup(url, 'html.parser')
        print(pages.title)
        for page in pages.find_all('ul', {'class' : 'z-list group'}):
        
                print(page.a)
                print(page)

here is the out put, bs4 is not kicking in. I just to beat to mess with ti to night.

<Response [200]>a
None
<Response [200]>b
None
<Response [200]>c
None
<Response [200]>d
None
<Response [200]>e
None
<Response [200]>f
None
<Response [200]>g
None
<Response [200]>h
None
<Response [200]>i
None
<Response [200]>j
None
<Response [200]>l
None
<Response [200]>m
None
<Response [200]>n
None
<Response [200]>o
None
<Response [200]>p
None

I just had a brain fart, maybe my for loop is not working.
Tomorrow is another day.

Ok,

Couldn't resit writing this one:
This code can be run by itself, or imported into another module.
Once run, all that's needed in a class that wants to use the index is to load the json file into
a dictionary (see testit)

create a project directory and src directory

mkdir FederalAgencies
cd FederalAgencies
mkdir src

add to FederalAgencies directory:

module __init__.py

FederalAgencies/
    __init__.py
    src/
        __init__.py
        BuildFederalAgencyIndex.py
        FederalPaths.py

add to src directory:

1. create an empty __init__.py file

save this in src directory as FederalPaths.py

from pathlib import Path
import os


class FederalPaths:
    def __init__(self):
        # Make sure start path is  properly set
        self.set_starting_dir()
        self.homepath = Path('.')
        self.rootpath = self.homepath / '..'
        self.datapath = self.rootpath / 'data'
        self.datapath.mkdir(exist_ok=True)
        self.outpath = self.datapath / 'json'
        self.outpath.mkdir(exist_ok=True)

        self.gov_urlbase = 'https://www.usa.gov/'
        self.baseurl = 'https://www.usa.gov/federal-agencies/'
        self.valid_pages = 'abcdefghijlmnoprstuvw'
        self.fed_index_file = self.outpath / 'FedIndex.json'

    def set_starting_dir(self):
        path = Path(__file__).resolve()
        path, file = os.path.split(path)
        path = os.path.abspath(path)
        os.chdir(path)

def testit():
    FederalPaths()

if __name__ == '__main__':
    testit()

save this one in src directory as BuildFederalAgencyIndex.py

import FederalPaths
import requests
from bs4 import BeautifulSoup
import sys
import json


class BuildFederalAgencyIndex:
    def __init__(self):
        self.fpath = FederalPaths.FederalPaths()
        self.fed_index = {}
        self.valid_pages = 'abcdefghijlmnoprstuvw'

        self.build_index()
    
    def build_index(self):
        for n in range(len(self.valid_pages)):
            alpha = self.valid_pages[n]
            URL = f'{self.fpath.baseurl}{alpha}'
            self.fed_index[alpha] = {}
            try:
                response = requests.get(URL)
                soup = BeautifulSoup(response.content, 'lxml')
                ulist = soup.find('ul', {"class": "one_column_bullet"} )
                links = ulist.find_all('a')
                for link in links:
                    suffix = link.get('href')
                    href = f'{self.fpath.gov_urlbase}{suffix}'
                    self.fed_index[alpha][link.text] = href
            except:
                print(f'error: {sys.exc_info()[0]}')

        with self.fpath.fed_index_file.open('w') as jout:
            json.dump(self.fed_index, jout)


def testit():
    # Create json file
    fa = BuildFederalAgencyIndex()

    # test json file
    with fa.fpath.fed_index_file.open() as fp:
        fed_index = json.load(fp)
    
    # Show all entries for 'c'
    for name, url in fed_index['c'].items():
        print(f'name: {name}, url: {url}')

    # Individual entry:
    print(f"\nIndividual entry url for Court of Appeals for Veterans Claims: {fed_index['c']['Court of Appeals for Veterans Claims']}")

if __name__ == '__main__':
    testit()

test run:

cd FederalAgencies/src

This will create json file (in data/json directory) and print out all 'C' indexes:
directories will be created first time run

python BuildFederalAgencyIndex.py

results:

Output:name: California, url: https://www.usa.gov//state-government/california
name: Capitol Police, url: https://www.usa.gov//federal-agencies/u-s-capitol-police
name: Capitol Visitor Center, url: https://www.usa.gov//federal-agencies/u-s-capitol-visitor-center
name: Career, Technical, and Adult Education, Office of, url: https://www.usa.gov//federal-agencies/office-of-career-technical-and-adult-education
name: Census Bureau, url: https://www.usa.gov//federal-agencies/u-s-census-bureau
name: Center for Food Safety and Applied Nutrition, url: https://www.usa.gov//federal-agencies/center-for-food-safety-and-applied-nutrition
name: Center for Nutrition Policy and Promotion (CNPP), url: https://www.usa.gov//federal-agencies/center-for-nutrition-policy-and-promotion
name: Centers for Disease Control and Prevention (CDC), url: https://www.usa.gov//federal-agencies/centers-for-disease-control-and-prevention
name: Centers for Medicare and Medicaid Services (CMS), url: https://www.usa.gov//federal-agencies/centers-for-medicare-and-medicaid-services
name: Central Command (CENTCOM), url: https://www.usa.gov//federal-agencies/u-s-central-command
name: Central Intelligence Agency (CIA), url: https://www.usa.gov//federal-agencies/central-intelligence-agency
name: Chemical Safety Board, url: https://www.usa.gov//federal-agencies/u-s-chemical-safety-board
name: Chief Acquisition Officers Council, url: https://www.usa.gov//federal-agencies/chief-acquisition-officers-council
name: Chief Financial Officers Council, url: https://www.usa.gov//federal-agencies/chief-financial-officers-council
name: Chief Human Capital Officers Council, url: https://www.usa.gov//federal-agencies/chief-human-capital-officers-council
name: Chief Information Officers Council, url: https://www.usa.gov//federal-agencies/chief-information-officers-council
name: Child Support Enforcement, Office of (OCSE), url: https://www.usa.gov//federal-agencies/office-of-child-support-enforcement
name: Circuit Courts of Appeal, url: https://www.usa.gov//federal-agencies/u-s-courts-of-appeal
name: Citizens' Stamp Advisory Committee, url: https://www.usa.gov//federal-agencies/citizens-stamp-advisory-committee
name: Citizenship and Immigration Services (USCIS), url: https://www.usa.gov//federal-agencies/u-s-citizenship-and-immigration-services
name: Civil Rights, Department of Education Office of, url: https://www.usa.gov//federal-agencies/office-for-civil-rights-department-of-education
name: Civil Rights, Department of Health and Human Services Office for, url: https://www.usa.gov//federal-agencies/office-for-civil-rights-department-of-health-and-human-services
name: Coast Guard, url: https://www.usa.gov//federal-agencies/u-s-coast-guard
name: Colorado, url: https://www.usa.gov//state-government/colorado
name: Commerce Department (DOC), url: https://www.usa.gov//federal-agencies/u-s-department-of-commerce
name: Commission of Fine Arts, url: https://www.usa.gov//federal-agencies/u-s-commission-of-fine-arts
name: Commission on Civil Rights, url: https://www.usa.gov//federal-agencies/commission-on-civil-rights
name: Commission on International Religious Freedom, url: https://www.usa.gov//federal-agencies/u-s-commission-on-international-religious-freedom
name: Commission on Presidential Scholars, url: https://www.usa.gov//federal-agencies/commission-on-presidential-scholars
name: Commission on Security and Cooperation in Europe (Helsinki Commission), url: https://www.usa.gov//federal-agencies/commission-on-security-and-cooperation-in-europe-helsinki-commission
name: Committee for the Implementation of Textile Agreements, url: https://www.usa.gov//federal-agencies/committee-for-the-implementation-of-textile-agreements
name: Committee on Foreign Investment in the United States, url: https://www.usa.gov//federal-agencies/committee-on-foreign-investment-in-the-united-states
name: Commodity Futures Trading Commission (CFTC), url: https://www.usa.gov//federal-agencies/u-s-commodity-futures-trading-commission
name: Community Oriented Policing Services (COPS), url: https://www.usa.gov//federal-agencies/community-oriented-policing-services
name: Community Planning and Development, url: https://www.usa.gov//federal-agencies/office-of-community-planning-and-development
name: Compliance, Office of, url: https://www.usa.gov//federal-agencies/office-of-compliance
name: Comptroller of the Currency, Office of (OCC), url: https://www.usa.gov//federal-agencies/office-of-the-comptroller-of-the-currency
name: Computer Emergency Readiness Team (US CERT), url: https://www.usa.gov//federal-agencies/computer-emergency-readiness-team
name: CongressU.S. House of Representatives, url: https://www.usa.gov//federal-agencies/u-s-house-of-
representatives
name: CongressU.S. Senate, url: https://www.usa.gov//federal-agencies/u-s-senate
name: Congressional Budget Office (CBO), url: https://www.usa.gov//federal-agencies/congressional-budget-office
name: Congressional Research Service, url: https://www.usa.gov//federal-agencies/congressional-research-service
name: Connecticut, url: https://www.usa.gov//state-government/connecticut
name: Consular Affairs, Bureau of, url: https://www.usa.gov//federal-agencies/bureau-of-consular-affairs
name: Consumer Financial Protection Bureau, url: https://www.usa.gov//federal-agencies/consumer-financial-protection-bureau
name: Consumer Product Safety Commission (CPSC), url: https://www.usa.gov//federal-agencies/consumer-product-safety-commission
name: Coordinating Council on Juvenile Justice and Delinquency Prevention, url: https://www.usa.gov//federal-agencies/coordinating-council-on-juvenile-justice-and-delinquency-prevention
name: Copyright Office, url: https://www.usa.gov//federal-agencies/copyright-office
name: Corporation for National and Community Service, url: https://www.usa.gov//federal-agencies/corporation-for-national-and-community-service
name: Corps of Engineers, url: https://www.usa.gov//federal-agencies/u-s-army-corps-of-engineers
name: Council of Economic Advisers, url: https://www.usa.gov//federal-agencies/council-of-economic-advisers
name: Council of the Inspectors General on Integrity and Efficiency, url: https://www.usa.gov//federal-agencies/council-of-the-inspectors-general-on-integrity-and-efficiency
name: Council on Environmental Quality, url: https://www.usa.gov//federal-agencies/council-on-environmental-quality
name: Court Services and Offender Supervision Agency for the District of Columbia, url: https://www.usa.gov//federal-agencies/court-services-and-offender-supervision-agency-for-the-district-of-columbia
name: Court of Appeals for Veterans Claims, url: https://www.usa.gov//federal-agencies/u-s-court-of-appeals-for-veterans-claims
name: Court of Appeals for the Armed Forces, url: https://www.usa.gov//federal-agencies/court-of-appeals-for-the-armed-forces
name: Court of Appeals for the Federal Circuit, url: https://www.usa.gov//federal-agencies/court-of-appeals-for-the-federal-circuit
name: Court of Federal Claims, url: https://www.usa.gov//federal-agencies/court-of-federal-claims
name: Court of International Trade, url: https://www.usa.gov//federal-agencies/court-of-international-trade
name: Customs and Border Protection, url: https://www.usa.gov//federal-agencies/u-s-customs-and-border-protection

Individual entry url for Court of Appeals for Veterans Claims: https://www.usa.gov//federal-agencies/u-s-court-of-appeals-for-veterans-claims

Larz60+ has done wonderful job writing this for you, but I think 'it's too complicated for something that can be done with couple of lines (i.e. OOP, etc is overkill)
first of all - your code. The problem is on line#11.
Here is it with some small changes

import requests
from bs4 import BeautifulSoup
from string import ascii_lowercase
 
base_url  = 'https://www.usa.gov/federal-agencies/'
for letter in ascii_lowercase:
    url = '{}{}'.format(base_url, letter)
    print(url)
    resp = requests.get(url)
    soup = BeautifulSoup(resp.text, 'html.parser')
    for ul in soup.find_all('ul', {'class' : 'one_column_bullet'}):
        print(ul))

Now the nice part
If you inspect the page and what it loads you will notice that it gets all the information as json
so

import requests

url = "https://www.usa.gov/ajax/federal-agencies/autocomplete"
resp = requests.get(url)
print(resp.json())

if you want you can save the json resp as file. Anyway, you get all 548 agencies and the respective url in one get request as a json file.

Buran,

With all respect for your concise response,
Thanks, Great way to start my day!

(Jun-08-2018, 01:55 PM)Larz60+ Wrote: [ -> ]Thanks, Great way to start my day!

Sorry for any misstep :-)

Not A problem.
I was like I was building a bridge over a body of water that already had stepping stones in place.

Pages: 1 2

Blue Dog

Larz60+

Blue Dog

Larz60+

Blue Dog

Larz60+

buran

Larz60+

buran

Larz60+