Python Forum

Full Version: scraping multiple pages of a website.
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2
Hello All,

I have a website that as 26 pages, that star with 'a' and end with a 'z'.
this is y\the url of the site https://www.usa.gov/federal-agencies/a
I have a scraper that does what I want. I know to all of you python kings
it will be crude. what I need help on is how to scrape all 26 pages.
I have been all over the net looking for how to do it. Just not much out there.
I have found a few way of doing it, but none work. Wall So here I am hoping someone can help. LOL

here is my code
#Python 3.7
from html.parser import HTMLParser
import requests
from bs4 import BeautifulSoup







r = requests.get('https://www.usa.gov/federal-agencies/a')
        
first_page = r.text

soup = BeautifulSoup(first_page, 'html.parser')

page_soup = soup

#page_soup.h1

#page_soup.p
boxes = page_soup.find_all('ul', {'class' : 'one_column_bullet'})
boxes[0].text.strip()


print(boxes)
I tryed all I could think of mostly many for loop.
here that works a bit. it print out the same page 26 times.

#Python 3.7
from html.parser import HTMLParser
import requests
from bs4 import BeautifulSoup
from string import ascii_lowercase


for letter in ascii_lowercase:
        r = requests.get('https://www.usa.gov/federal-agencies/' + letter +' ')
        
        first_page = r.text

        soup = BeautifulSoup(first_page, 'html.parser')

        page_soup = soup.find('h1')


        print(page_soup)
So if some one know how to use my to scrape 26 pages let me know.
Thank you
renny
The pages have the same URL base, with the letter added to the end.
https://www.usa.gov/federal-agencies/a
https://www.usa.gov/federal-agencies/b
etc.

>>> baseurl = 'https://www.usa.gov/federal-agencies/'
>>> valid_pages = 'abcdefghijlmnoprstuvw'
>>> for n in range(len(valid_pages)):
...     url = f'{baseurl}{valid_pages[n]}'
...     print(url)
...
https://www.usa.gov/federal-agencies/a
https://www.usa.gov/federal-agencies/b
https://www.usa.gov/federal-agencies/c
https://www.usa.gov/federal-agencies/d
https://www.usa.gov/federal-agencies/e
https://www.usa.gov/federal-agencies/f
https://www.usa.gov/federal-agencies/g
https://www.usa.gov/federal-agencies/h
https://www.usa.gov/federal-agencies/i
https://www.usa.gov/federal-agencies/j
https://www.usa.gov/federal-agencies/l
https://www.usa.gov/federal-agencies/m
https://www.usa.gov/federal-agencies/n
https://www.usa.gov/federal-agencies/o
https://www.usa.gov/federal-agencies/p
https://www.usa.gov/federal-agencies/r
https://www.usa.gov/federal-agencies/s
https://www.usa.gov/federal-agencies/t
https://www.usa.gov/federal-agencies/u
https://www.usa.gov/federal-agencies/v
https://www.usa.gov/federal-agencies/w
>>>
so can iterate over this:
pseudo code:
for char in valid_pages
within each page, the following can be used as an anchor:
<ul class="az-list group">

After that, all links (regular <a tags) up until the </ul>
are what you need.

so seems pretty simple.
You lost me, i will try to to use it
Thank you.
renny

Well I been at this for about 14 hours today. I am going to hit the sack.
This is what I got so far.

import requests
from bs4 import BeautifulSoup
from html.parser import HTMLParser


baseurl = requests.get('https://www.usa.gov/federal-agencies/')
valid_pages = 'abcdefghijlmnoprstuvw'
for n in range(len(valid_pages)):
        url = f'{baseurl}{valid_pages[n]}'
        print(url)
        page = soup = BeautifulSoup(url, 'html.parser')
       
        for page in soup.find_all('ul', {'class' : 'one_column_bullet'}):
        
        
                print(page)
this is what I get:

Response [200]>a
<Response [200]>b
<Response [200]>c
<Response [200]>d
<Response [200]>e
<Response [200]>f
<Response [200]>g
<Response [200]>h
<Response [200]>i
<Response [200]>j
<Response [200]>l
<Response [200]>m
<Response [200]>n
<Response [200]>o
<Response [200]>p
<Response [200]>r
<Response [200]>s
<Response [200]>t
<Response [200]>u
<Response [200]>v
<Response [200]>w
I do get to all the pages. soup does not work.
I want to thank you Larz60+ for your help. I will start back on it tomorrow.
renny
all the code that I showed is create a url for each page,
you still have to fetch it with requests and extract the links.
so instead of print, add your page scraping code.
Still up, try some new stuff:

import requests
from bs4 import BeautifulSoup
from html.parser import HTMLParser


baseurl = requests.get('https://www.usa.gov/federal-agencies/')
valid_pages = 'abcdefghijlmnoprstuvw'
for n in range(len(valid_pages)):
        url = f'{baseurl}{valid_pages[n]}'
        print(url)
        pages = soup = BeautifulSoup(url, 'html.parser')
        print(pages.title)
        for page in pages.find_all('ul', {'class' : 'z-list group'}):
        
                print(page.a)
                print(page)
here is the out put, bs4 is not kicking in. I just to beat to mess with ti to night.

<Response [200]>a
None
<Response [200]>b
None
<Response [200]>c
None
<Response [200]>d
None
<Response [200]>e
None
<Response [200]>f
None
<Response [200]>g
None
<Response [200]>h
None
<Response [200]>i
None
<Response [200]>j
None
<Response [200]>l
None
<Response [200]>m
None
<Response [200]>n
None
<Response [200]>o
None
<Response [200]>p
None

I just had a brain fart, maybe my for loop is not working.
Tomorrow is another day.
Ok,

Couldn't resit writing this one:
This code can be run by itself, or imported into another module.
Once run, all that's needed in a class that wants to use the index is to load the json file into
a dictionary (see testit)

create a project directory and src directory
mkdir FederalAgencies
cd FederalAgencies
mkdir src
add to FederalAgencies directory:

module __init__.py
FederalAgencies/
    __init__.py
    src/
        __init__.py
        BuildFederalAgencyIndex.py
        FederalPaths.py
add to src directory:

1. create an empty __init__.py file

save this in src directory as FederalPaths.py
from pathlib import Path
import os


class FederalPaths:
    def __init__(self):
        # Make sure start path is  properly set
        self.set_starting_dir()
        self.homepath = Path('.')
        self.rootpath = self.homepath / '..'
        self.datapath = self.rootpath / 'data'
        self.datapath.mkdir(exist_ok=True)
        self.outpath = self.datapath / 'json'
        self.outpath.mkdir(exist_ok=True)

        self.gov_urlbase = 'https://www.usa.gov/'
        self.baseurl = 'https://www.usa.gov/federal-agencies/'
        self.valid_pages = 'abcdefghijlmnoprstuvw'
        self.fed_index_file = self.outpath / 'FedIndex.json'

    def set_starting_dir(self):
        path = Path(__file__).resolve()
        path, file = os.path.split(path)
        path = os.path.abspath(path)
        os.chdir(path)

def testit():
    FederalPaths()

if __name__ == '__main__':
    testit()
save this one in src directory as BuildFederalAgencyIndex.py
import FederalPaths
import requests
from bs4 import BeautifulSoup
import sys
import json


class BuildFederalAgencyIndex:
    def __init__(self):
        self.fpath = FederalPaths.FederalPaths()
        self.fed_index = {}
        self.valid_pages = 'abcdefghijlmnoprstuvw'

        self.build_index()
    
    def build_index(self):
        for n in range(len(self.valid_pages)):
            alpha = self.valid_pages[n]
            URL = f'{self.fpath.baseurl}{alpha}'
            self.fed_index[alpha] = {}
            try:
                response = requests.get(URL)
                soup = BeautifulSoup(response.content, 'lxml')
                ulist = soup.find('ul', {"class": "one_column_bullet"} )
                links = ulist.find_all('a')
                for link in links:
                    suffix = link.get('href')
                    href = f'{self.fpath.gov_urlbase}{suffix}'
                    self.fed_index[alpha][link.text] = href
            except:
                print(f'error: {sys.exc_info()[0]}')

        with self.fpath.fed_index_file.open('w') as jout:
            json.dump(self.fed_index, jout)


def testit():
    # Create json file
    fa = BuildFederalAgencyIndex()

    # test json file
    with fa.fpath.fed_index_file.open() as fp:
        fed_index = json.load(fp)
    
    # Show all entries for 'c'
    for name, url in fed_index['c'].items():
        print(f'name: {name}, url: {url}')

    # Individual entry:
    print(f"\nIndividual entry url for Court of Appeals for Veterans Claims: {fed_index['c']['Court of Appeals for Veterans Claims']}")

if __name__ == '__main__':
    testit()
test run:
cd FederalAgencies/src
This will create json file (in data/json directory) and print out all 'C' indexes:
directories will be created first time run
python BuildFederalAgencyIndex.py
results:
Output:
name: California, url: https://www.usa.gov//state-government/california name: Capitol Police, url: https://www.usa.gov//federal-agencies/u-s-capitol-police name: Capitol Visitor Center, url: https://www.usa.gov//federal-agencies/u-s-capitol-visitor-center name: Career, Technical, and Adult Education, Office of, url: https://www.usa.gov//federal-agencies/office-of-career-technical-and-adult-education name: Census Bureau, url: https://www.usa.gov//federal-agencies/u-s-census-bureau name: Center for Food Safety and Applied Nutrition, url: https://www.usa.gov//federal-agencies/center-for-food-safety-and-applied-nutrition name: Center for Nutrition Policy and Promotion (CNPP), url: https://www.usa.gov//federal-agencies/center-for-nutrition-policy-and-promotion name: Centers for Disease Control and Prevention (CDC), url: https://www.usa.gov//federal-agencies/centers-for-disease-control-and-prevention name: Centers for Medicare and Medicaid Services (CMS), url: https://www.usa.gov//federal-agencies/centers-for-medicare-and-medicaid-services name: Central Command (CENTCOM), url: https://www.usa.gov//federal-agencies/u-s-central-command name: Central Intelligence Agency (CIA), url: https://www.usa.gov//federal-agencies/central-intelligence-agency name: Chemical Safety Board, url: https://www.usa.gov//federal-agencies/u-s-chemical-safety-board name: Chief Acquisition Officers Council, url: https://www.usa.gov//federal-agencies/chief-acquisition-officers-council name: Chief Financial Officers Council, url: https://www.usa.gov//federal-agencies/chief-financial-officers-council name: Chief Human Capital Officers Council, url: https://www.usa.gov//federal-agencies/chief-human-capital-officers-council name: Chief Information Officers Council, url: https://www.usa.gov//federal-agencies/chief-information-officers-council name: Child Support Enforcement, Office of (OCSE), url: https://www.usa.gov//federal-agencies/office-of-child-support-enforcement name: Circuit Courts of Appeal, url: https://www.usa.gov//federal-agencies/u-s-courts-of-appeal name: Citizens' Stamp Advisory Committee, url: https://www.usa.gov//federal-agencies/citizens-stamp-advisory-committee name: Citizenship and Immigration Services (USCIS), url: https://www.usa.gov//federal-agencies/u-s-citizenship-and-immigration-services name: Civil Rights, Department of Education Office of, url: https://www.usa.gov//federal-agencies/office-for-civil-rights-department-of-education name: Civil Rights, Department of Health and Human Services Office for, url: https://www.usa.gov//federal-agencies/office-for-civil-rights-department-of-health-and-human-services name: Coast Guard, url: https://www.usa.gov//federal-agencies/u-s-coast-guard name: Colorado, url: https://www.usa.gov//state-government/colorado name: Commerce Department (DOC), url: https://www.usa.gov//federal-agencies/u-s-department-of-commerce name: Commission of Fine Arts, url: https://www.usa.gov//federal-agencies/u-s-commission-of-fine-arts name: Commission on Civil Rights, url: https://www.usa.gov//federal-agencies/commission-on-civil-rights name: Commission on International Religious Freedom, url: https://www.usa.gov//federal-agencies/u-s-commission-on-international-religious-freedom name: Commission on Presidential Scholars, url: https://www.usa.gov//federal-agencies/commission-on-presidential-scholars name: Commission on Security and Cooperation in Europe (Helsinki Commission), url: https://www.usa.gov//federal-agencies/commission-on-security-and-cooperation-in-europe-helsinki-commission name: Committee for the Implementation of Textile Agreements, url: https://www.usa.gov//federal-agencies/committee-for-the-implementation-of-textile-agreements name: Committee on Foreign Investment in the United States, url: https://www.usa.gov//federal-agencies/committee-on-foreign-investment-in-the-united-states name: Commodity Futures Trading Commission (CFTC), url: https://www.usa.gov//federal-agencies/u-s-commodity-futures-trading-commission name: Community Oriented Policing Services (COPS), url: https://www.usa.gov//federal-agencies/community-oriented-policing-services name: Community Planning and Development, url: https://www.usa.gov//federal-agencies/office-of-community-planning-and-development name: Compliance, Office of, url: https://www.usa.gov//federal-agencies/office-of-compliance name: Comptroller of the Currency, Office of (OCC), url: https://www.usa.gov//federal-agencies/office-of-the-comptroller-of-the-currency name: Computer Emergency Readiness Team (US CERT), url: https://www.usa.gov//federal-agencies/computer-emergency-readiness-team name: Congress—U.S. House of Representatives, url: https://www.usa.gov//federal-agencies/u-s-house-of- representatives name: Congress—U.S. Senate, url: https://www.usa.gov//federal-agencies/u-s-senate name: Congressional Budget Office (CBO), url: https://www.usa.gov//federal-agencies/congressional-budget-office name: Congressional Research Service, url: https://www.usa.gov//federal-agencies/congressional-research-service name: Connecticut, url: https://www.usa.gov//state-government/connecticut name: Consular Affairs, Bureau of, url: https://www.usa.gov//federal-agencies/bureau-of-consular-affairs name: Consumer Financial Protection Bureau, url: https://www.usa.gov//federal-agencies/consumer-financial-protection-bureau name: Consumer Product Safety Commission (CPSC), url: https://www.usa.gov//federal-agencies/consumer-product-safety-commission name: Coordinating Council on Juvenile Justice and Delinquency Prevention, url: https://www.usa.gov//federal-agencies/coordinating-council-on-juvenile-justice-and-delinquency-prevention name: Copyright Office, url: https://www.usa.gov//federal-agencies/copyright-office name: Corporation for National and Community Service, url: https://www.usa.gov//federal-agencies/corporation-for-national-and-community-service name: Corps of Engineers, url: https://www.usa.gov//federal-agencies/u-s-army-corps-of-engineers name: Council of Economic Advisers, url: https://www.usa.gov//federal-agencies/council-of-economic-advisers name: Council of the Inspectors General on Integrity and Efficiency, url: https://www.usa.gov//federal-agencies/council-of-the-inspectors-general-on-integrity-and-efficiency name: Council on Environmental Quality, url: https://www.usa.gov//federal-agencies/council-on-environmental-quality name: Court Services and Offender Supervision Agency for the District of Columbia, url: https://www.usa.gov//federal-agencies/court-services-and-offender-supervision-agency-for-the-district-of-columbia name: Court of Appeals for Veterans Claims, url: https://www.usa.gov//federal-agencies/u-s-court-of-appeals-for-veterans-claims name: Court of Appeals for the Armed Forces, url: https://www.usa.gov//federal-agencies/court-of-appeals-for-the-armed-forces name: Court of Appeals for the Federal Circuit, url: https://www.usa.gov//federal-agencies/court-of-appeals-for-the-federal-circuit name: Court of Federal Claims, url: https://www.usa.gov//federal-agencies/court-of-federal-claims name: Court of International Trade, url: https://www.usa.gov//federal-agencies/court-of-international-trade name: Customs and Border Protection, url: https://www.usa.gov//federal-agencies/u-s-customs-and-border-protection Individual entry url for Court of Appeals for Veterans Claims: https://www.usa.gov//federal-agencies/u-s-court-of-appeals-for-veterans-claims
Larz60+ has done wonderful job writing this for you, but I think 'it's too complicated for something that can be done with couple of lines (i.e. OOP, etc is overkill)
first of all - your code. The problem is on line#11.
Here is it with some small changes
import requests
from bs4 import BeautifulSoup
from string import ascii_lowercase
 
base_url  = 'https://www.usa.gov/federal-agencies/'
for letter in ascii_lowercase:
    url = '{}{}'.format(base_url, letter)
    print(url)
    resp = requests.get(url)
    soup = BeautifulSoup(resp.text, 'html.parser')
    for ul in soup.find_all('ul', {'class' : 'one_column_bullet'}):
        print(ul))
Now the nice part
If you inspect the page and what it loads you will notice that it gets all the information as json
so
import requests

url = "https://www.usa.gov/ajax/federal-agencies/autocomplete"
resp = requests.get(url)
print(resp.json())
if you want you can save the json resp as file. Anyway, you get all 548 agencies and the respective url in one get request as a json file.
Buran,

With all respect for your concise response,
Thanks, Great way to start my day!
(Jun-08-2018, 01:55 PM)Larz60+ Wrote: [ -> ]Thanks, Great way to start my day!
Sorry for any misstep :-)
Not A problem.
I was like I was building a bridge over a body of water that already had stepping stones in place.
Pages: 1 2