Bottom Page

Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
 scraping multiple pages of a website.
#1
Hello All,

I have a website that as 26 pages, that star with 'a' and end with a 'z'.
this is y\the url of the site https://www.usa.gov/federal-agencies/a
I have a scraper that does what I want. I know to all of you python kings
it will be crude. what I need help on is how to scrape all 26 pages.
I have been all over the net looking for how to do it. Just not much out there.
I have found a few way of doing it, but none work. Wall So here I am hoping someone can help. LOL

here is my code
#Python 3.7
from html.parser import HTMLParser
import requests
from bs4 import BeautifulSoup







r = requests.get('https://www.usa.gov/federal-agencies/a')
        
first_page = r.text

soup = BeautifulSoup(first_page, 'html.parser')

page_soup = soup

#page_soup.h1

#page_soup.p
boxes = page_soup.find_all('ul', {'class' : 'one_column_bullet'})
boxes[0].text.strip()


print(boxes)

I tryed all I could think of mostly many for loop.
here that works a bit. it print out the same page 26 times.

#Python 3.7
from html.parser import HTMLParser
import requests
from bs4 import BeautifulSoup
from string import ascii_lowercase


for letter in ascii_lowercase:
        r = requests.get('https://www.usa.gov/federal-agencies/' + letter +' ')
        
        first_page = r.text

        soup = BeautifulSoup(first_page, 'html.parser')

        page_soup = soup.find('h1')


        print(page_soup)

So if some one know how to use my to scrape 26 pages let me know.
Thank you
renny
Quote
#2
The pages have the same URL base, with the letter added to the end.
https://www.usa.gov/federal-agencies/a
https://www.usa.gov/federal-agencies/b
etc.

>>> baseurl = 'https://www.usa.gov/federal-agencies/'
>>> valid_pages = 'abcdefghijlmnoprstuvw'
>>> for n in range(len(valid_pages)):
...     url = f'{baseurl}{valid_pages[n]}'
...     print(url)
...
https://www.usa.gov/federal-agencies/a
https://www.usa.gov/federal-agencies/b
https://www.usa.gov/federal-agencies/c
https://www.usa.gov/federal-agencies/d
https://www.usa.gov/federal-agencies/e
https://www.usa.gov/federal-agencies/f
https://www.usa.gov/federal-agencies/g
https://www.usa.gov/federal-agencies/h
https://www.usa.gov/federal-agencies/i
https://www.usa.gov/federal-agencies/j
https://www.usa.gov/federal-agencies/l
https://www.usa.gov/federal-agencies/m
https://www.usa.gov/federal-agencies/n
https://www.usa.gov/federal-agencies/o
https://www.usa.gov/federal-agencies/p
https://www.usa.gov/federal-agencies/r
https://www.usa.gov/federal-agencies/s
https://www.usa.gov/federal-agencies/t
https://www.usa.gov/federal-agencies/u
https://www.usa.gov/federal-agencies/v
https://www.usa.gov/federal-agencies/w
>>>
so can iterate over this:
pseudo code:
for char in valid_pages
within each page, the following can be used as an anchor:
<ul class="az-list group">

After that, all links (regular <a tags) up until the </ul>
are what you need.

so seems pretty simple.
Quote
#3
You lost me, i will try to to use it
Thank you.
renny

Well I been at this for about 14 hours today. I am going to hit the sack.
This is what I got so far.

import requests
from bs4 import BeautifulSoup
from html.parser import HTMLParser


baseurl = requests.get('https://www.usa.gov/federal-agencies/')
valid_pages = 'abcdefghijlmnoprstuvw'
for n in range(len(valid_pages)):
        url = f'{baseurl}{valid_pages[n]}'
        print(url)
        page = soup = BeautifulSoup(url, 'html.parser')
       
        for page in soup.find_all('ul', {'class' : 'one_column_bullet'}):
        
        
                print(page)
this is what I get:

Response [200]>a
<Response [200]>b
<Response [200]>c
<Response [200]>d
<Response [200]>e
<Response [200]>f
<Response [200]>g
<Response [200]>h
<Response [200]>i
<Response [200]>j
<Response [200]>l
<Response [200]>m
<Response [200]>n
<Response [200]>o
<Response [200]>p
<Response [200]>r
<Response [200]>s
<Response [200]>t
<Response [200]>u
<Response [200]>v
<Response [200]>w
I do get to all the pages. soup does not work.
I want to thank you Larz60+ for your help. I will start back on it tomorrow.
renny
Quote
#4
all the code that I showed is create a url for each page,
you still have to fetch it with requests and extract the links.
so instead of print, add your page scraping code.
Quote
#5
Still up, try some new stuff:

import requests
from bs4 import BeautifulSoup
from html.parser import HTMLParser


baseurl = requests.get('https://www.usa.gov/federal-agencies/')
valid_pages = 'abcdefghijlmnoprstuvw'
for n in range(len(valid_pages)):
        url = f'{baseurl}{valid_pages[n]}'
        print(url)
        pages = soup = BeautifulSoup(url, 'html.parser')
        print(pages.title)
        for page in pages.find_all('ul', {'class' : 'z-list group'}):
        
                print(page.a)
                print(page)

here is the out put, bs4 is not kicking in. I just to beat to mess with ti to night.

<Response [200]>a
None
<Response [200]>b
None
<Response [200]>c
None
<Response [200]>d
None
<Response [200]>e
None
<Response [200]>f
None
<Response [200]>g
None
<Response [200]>h
None
<Response [200]>i
None
<Response [200]>j
None
<Response [200]>l
None
<Response [200]>m
None
<Response [200]>n
None
<Response [200]>o
None
<Response [200]>p
None

I just had a brain fart, maybe my for loop is not working.
Tomorrow is another day.
Quote
#6
Ok,

Couldn't resit writing this one:
This code can be run by itself, or imported into another module.
Once run, all that's needed in a class that wants to use the index is to load the json file into
a dictionary (see testit)

create a project directory and src directory
mkdir FederalAgencies
cd FederalAgencies
mkdir src
add to FederalAgencies directory:

module __init__.py
FederalAgencies/
    __init__.py
    src/
        __init__.py
        BuildFederalAgencyIndex.py
        FederalPaths.py
add to src directory:

1. create an empty __init__.py file

save this in src directory as FederalPaths.py
from pathlib import Path
import os


class FederalPaths:
    def __init__(self):
        # Make sure start path is  properly set
        self.set_starting_dir()
        self.homepath = Path('.')
        self.rootpath = self.homepath / '..'
        self.datapath = self.rootpath / 'data'
        self.datapath.mkdir(exist_ok=True)
        self.outpath = self.datapath / 'json'
        self.outpath.mkdir(exist_ok=True)

        self.gov_urlbase = 'https://www.usa.gov/'
        self.baseurl = 'https://www.usa.gov/federal-agencies/'
        self.valid_pages = 'abcdefghijlmnoprstuvw'
        self.fed_index_file = self.outpath / 'FedIndex.json'

    def set_starting_dir(self):
        path = Path(__file__).resolve()
        path, file = os.path.split(path)
        path = os.path.abspath(path)
        os.chdir(path)

def testit():
    FederalPaths()

if __name__ == '__main__':
    testit()
save this one in src directory as BuildFederalAgencyIndex.py
import FederalPaths
import requests
from bs4 import BeautifulSoup
import sys
import json


class BuildFederalAgencyIndex:
    def __init__(self):
        self.fpath = FederalPaths.FederalPaths()
        self.fed_index = {}
        self.valid_pages = 'abcdefghijlmnoprstuvw'

        self.build_index()
    
    def build_index(self):
        for n in range(len(self.valid_pages)):
            alpha = self.valid_pages[n]
            URL = f'{self.fpath.baseurl}{alpha}'
            self.fed_index[alpha] = {}
            try:
                response = requests.get(URL)
                soup = BeautifulSoup(response.content, 'lxml')
                ulist = soup.find('ul', {"class": "one_column_bullet"} )
                links = ulist.find_all('a')
                for link in links:
                    suffix = link.get('href')
                    href = f'{self.fpath.gov_urlbase}{suffix}'
                    self.fed_index[alpha][link.text] = href
            except:
                print(f'error: {sys.exc_info()[0]}')

        with self.fpath.fed_index_file.open('w') as jout:
            json.dump(self.fed_index, jout)


def testit():
    # Create json file
    fa = BuildFederalAgencyIndex()

    # test json file
    with fa.fpath.fed_index_file.open() as fp:
        fed_index = json.load(fp)
    
    # Show all entries for 'c'
    for name, url in fed_index['c'].items():
        print(f'name: {name}, url: {url}')

    # Individual entry:
    print(f"\nIndividual entry url for Court of Appeals for Veterans Claims: {fed_index['c']['Court of Appeals for Veterans Claims']}")

if __name__ == '__main__':
    testit()
test run:
cd FederalAgencies/src
This will create json file (in data/json directory) and print out all 'C' indexes:
directories will be created first time run
python BuildFederalAgencyIndex.py
results:
Output:
name: California, url: https://www.usa.gov//state-government/california name: Capitol Police, url: https://www.usa.gov//federal-agencies/u-s-capitol-police name: Capitol Visitor Center, url: https://www.usa.gov//federal-agencies/u-s-capitol-visitor-center name: Career, Technical, and Adult Education, Office of, url: https://www.usa.gov//federal-agencies/office-of-career-technical-and-adult-education name: Census Bureau, url: https://www.usa.gov//federal-agencies/u-s-census-bureau name: Center for Food Safety and Applied Nutrition, url: https://www.usa.gov//federal-agencies/center-for-food-safety-and-applied-nutrition name: Center for Nutrition Policy and Promotion (CNPP), url: https://www.usa.gov//federal-agencies/center-for-nutrition-policy-and-promotion name: Centers for Disease Control and Prevention (CDC), url: https://www.usa.gov//federal-agencies/centers-for-disease-control-and-prevention name: Centers for Medicare and Medicaid Services (CMS), url: https://www.usa.gov//federal-agencies/centers-for-medicare-and-medicaid-services name: Central Command (CENTCOM), url: https://www.usa.gov//federal-agencies/u-s-central-command name: Central Intelligence Agency (CIA), url: https://www.usa.gov//federal-agencies/central-intelligence-agency name: Chemical Safety Board, url: https://www.usa.gov//federal-agencies/u-s-chemical-safety-board name: Chief Acquisition Officers Council, url: https://www.usa.gov//federal-agencies/chief-acquisition-officers-council name: Chief Financial Officers Council, url: https://www.usa.gov//federal-agencies/chief-financial-officers-council name: Chief Human Capital Officers Council, url: https://www.usa.gov//federal-agencies/chief-human-capital-officers-council name: Chief Information Officers Council, url: https://www.usa.gov//federal-agencies/chief-information-officers-council name: Child Support Enforcement, Office of (OCSE), url: https://www.usa.gov//federal-agencies/office-of-child-support-enforcement name: Circuit Courts of Appeal, url: https://www.usa.gov//federal-agencies/u-s-courts-of-appeal name: Citizens' Stamp Advisory Committee, url: https://www.usa.gov//federal-agencies/citizens-stamp-advisory-committee name: Citizenship and Immigration Services (USCIS), url: https://www.usa.gov//federal-agencies/u-s-citizenship-and-immigration-services name: Civil Rights, Department of Education Office of, url: https://www.usa.gov//federal-agencies/office-for-civil-rights-department-of-education name: Civil Rights, Department of Health and Human Services Office for, url: https://www.usa.gov//federal-agencies/office-for-civil-rights-department-of-health-and-human-services name: Coast Guard, url: https://www.usa.gov//federal-agencies/u-s-coast-guard name: Colorado, url: https://www.usa.gov//state-government/colorado name: Commerce Department (DOC), url: https://www.usa.gov//federal-agencies/u-s-department-of-commerce name: Commission of Fine Arts, url: https://www.usa.gov//federal-agencies/u-s-commission-of-fine-arts name: Commission on Civil Rights, url: https://www.usa.gov//federal-agencies/commission-on-civil-rights name: Commission on International Religious Freedom, url: https://www.usa.gov//federal-agencies/u-s-commission-on-international-religious-freedom name: Commission on Presidential Scholars, url: https://www.usa.gov//federal-agencies/commission-on-presidential-scholars name: Commission on Security and Cooperation in Europe (Helsinki Commission), url: https://www.usa.gov//federal-agencies/commission-on-security-and-cooperation-in-europe-helsinki-commission name: Committee for the Implementation of Textile Agreements, url: https://www.usa.gov//federal-agencies/committee-for-the-implementation-of-textile-agreements name: Committee on Foreign Investment in the United States, url: https://www.usa.gov//federal-agencies/committee-on-foreign-investment-in-the-united-states name: Commodity Futures Trading Commission (CFTC), url: https://www.usa.gov//federal-agencies/u-s-commodity-futures-trading-commission name: Community Oriented Policing Services (COPS), url: https://www.usa.gov//federal-agencies/community-oriented-policing-services name: Community Planning and Development, url: https://www.usa.gov//federal-agencies/office-of-community-planning-and-development name: Compliance, Office of, url: https://www.usa.gov//federal-agencies/office-of-compliance name: Comptroller of the Currency, Office of (OCC), url: https://www.usa.gov//federal-agencies/office-of-the-comptroller-of-the-currency name: Computer Emergency Readiness Team (US CERT), url: https://www.usa.gov//federal-agencies/computer-emergency-readiness-team name: Congress—U.S. House of Representatives, url: https://www.usa.gov//federal-agencies/u-s-house-of- representatives name: Congress—U.S. Senate, url: https://www.usa.gov//federal-agencies/u-s-senate name: Congressional Budget Office (CBO), url: https://www.usa.gov//federal-agencies/congressional-budget-office name: Congressional Research Service, url: https://www.usa.gov//federal-agencies/congressional-research-service name: Connecticut, url: https://www.usa.gov//state-government/connecticut name: Consular Affairs, Bureau of, url: https://www.usa.gov//federal-agencies/bureau-of-consular-affairs name: Consumer Financial Protection Bureau, url: https://www.usa.gov//federal-agencies/consumer-financial-protection-bureau name: Consumer Product Safety Commission (CPSC), url: https://www.usa.gov//federal-agencies/consumer-product-safety-commission name: Coordinating Council on Juvenile Justice and Delinquency Prevention, url: https://www.usa.gov//federal-agencies/coordinating-council-on-juvenile-justice-and-delinquency-prevention name: Copyright Office, url: https://www.usa.gov//federal-agencies/copyright-office name: Corporation for National and Community Service, url: https://www.usa.gov//federal-agencies/corporation-for-national-and-community-service name: Corps of Engineers, url: https://www.usa.gov//federal-agencies/u-s-army-corps-of-engineers name: Council of Economic Advisers, url: https://www.usa.gov//federal-agencies/council-of-economic-advisers name: Council of the Inspectors General on Integrity and Efficiency, url: https://www.usa.gov//federal-agencies/council-of-the-inspectors-general-on-integrity-and-efficiency name: Council on Environmental Quality, url: https://www.usa.gov//federal-agencies/council-on-environmental-quality name: Court Services and Offender Supervision Agency for the District of Columbia, url: https://www.usa.gov//federal-agencies/court-services-and-offender-supervision-agency-for-the-district-of-columbia name: Court of Appeals for Veterans Claims, url: https://www.usa.gov//federal-agencies/u-s-court-of-appeals-for-veterans-claims name: Court of Appeals for the Armed Forces, url: https://www.usa.gov//federal-agencies/court-of-appeals-for-the-armed-forces name: Court of Appeals for the Federal Circuit, url: https://www.usa.gov//federal-agencies/court-of-appeals-for-the-federal-circuit name: Court of Federal Claims, url: https://www.usa.gov//federal-agencies/court-of-federal-claims name: Court of International Trade, url: https://www.usa.gov//federal-agencies/court-of-international-trade name: Customs and Border Protection, url: https://www.usa.gov//federal-agencies/u-s-customs-and-border-protection Individual entry url for Court of Appeals for Veterans Claims: https://www.usa.gov//federal-agencies/u-s-court-of-appeals-for-veterans-claims
Quote
#7
Larz60+ has done wonderful job writing this for you, but I think 'it's too complicated for something that can be done with couple of lines (i.e. OOP, etc is overkill)
first of all - your code. The problem is on line#11.
Here is it with some small changes
import requests
from bs4 import BeautifulSoup
from string import ascii_lowercase
 
base_url  = 'https://www.usa.gov/federal-agencies/'
for letter in ascii_lowercase:
    url = '{}{}'.format(base_url, letter)
    print(url)
    resp = requests.get(url)
    soup = BeautifulSoup(resp.text, 'html.parser')
    for ul in soup.find_all('ul', {'class' : 'one_column_bullet'}):
        print(ul))
Now the nice part
If you inspect the page and what it loads you will notice that it gets all the information as json
so
import requests

url = "https://www.usa.gov/ajax/federal-agencies/autocomplete"
resp = requests.get(url)
print(resp.json())
if you want you can save the json resp as file. Anyway, you get all 548 agencies and the respective url in one get request as a json file.
Quote
#8
Buran,

With all respect for your concise response,
Thanks, Great way to start my day!
Quote
#9
(Jun-08-2018, 01:55 PM)Larz60+ Wrote: Thanks, Great way to start my day!
Sorry for any misstep :-)
Quote
#10
Not A problem.
I was like I was building a bridge over a body of water that already had stepping stones in place.
Quote

Top Page

Possibly Related Threads...
Thread Author Replies Views Last Post
  scraping from a website that hides source code PIWI_Protein 1 71 Mar-27-2020, 05:08 PM
Last Post: Larz60+
  Scraping not moving to the next pages in a website jithin123 0 52 Mar-23-2020, 06:10 PM
Last Post: jithin123
  Scraping from multiple URLS to print in a single line. jb89 4 191 Jan-29-2020, 06:12 AM
Last Post: perfringo
  Looping through multiple pages with changing url Qaruri 2 205 Jan-17-2020, 01:55 PM
Last Post: Qaruri
  Scrapping javascript website with Selenium where pages randomly fail to load JuanJuan 14 706 Dec-27-2019, 12:32 PM
Last Post: JuanJuan
  Random Loss of Control of Website When Scraping bmccollum 0 215 Aug-30-2019, 04:04 AM
Last Post: bmccollum
  MaxRetryError while scraping a website multiple times kawasso 6 3,366 Aug-29-2019, 05:25 PM
Last Post: kawasso
  How to handle tables splitted across multiple web pages ankitjindalbti 2 383 Jun-02-2019, 07:33 AM
Last Post: ankitjindalbti
  scraping with multiple iframe jansky 1 1,281 Nov-09-2018, 11:12 AM
Last Post: snippsat
  Scraping external URLs from pages Apook 5 1,400 Jul-18-2018, 06:42 PM
Last Post: nilamo

Forum Jump:


Users browsing this thread: 1 Guest(s)