getting links from webpage and store it into an array

getting links from webpage and store it into an array - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html)
+--- Thread: getting links from webpage and store it into an array (/thread-33706.html)

getting links from webpage and store it into an array - apollo - May-22-2021

dear Python-experts Smile

first of all: i hope that you are all right and everything goes well at your site.

I am currently attempting to gather some data on the cost of Failed Bank List: This list includes banks which have failed since October 1, 2000.
getting it from the website below: https://www.fdic.gov/resources/resolutions/bank-failures/failed-bank-list/

With this approach i extract links from the basic-website into an nice little array. Besides that - i want to open all the links and gather a little piece of information form the (subsequent linked ) sub-page.

my Approach: for the sake of repeatingly extract links out of the targetpage i use the function below:
Setup: i run Anaconda on Win 10 with Python 3.8.5 and BS4 (version 4.8.2)

from bs4 import BeautifulSoup
import requests
import re

def getLinks(url):
    r = requests.get("https://www.fdic.gov/resources/resolutions/bank-failures/failed-bank-list/")
    soup = BeautifulSoup(r.content)
    links = []

    for link in soup.findAll('a', attrs={'href': re.compile("^http://")}):
        links.append(link.get('href'))
##It will scrape all the a tags, and for each a tags, it will append the href attribute to the links list.

    return links

print( getLinks("https://www.fdic.gov/resources/resolutions/bank-failures/failed-bank-list/") )

dataset: html_page = ("https://www.fdic.gov/resources/resolutions/bank-failures/failed-bank-list/")

that contains the following pages that hold information about the towns with inhabitants:

see the dataset:

Quote:Almena State Bank Almena KS 15426 Equity Bank October 23, 2020 10538
First City Bank of Florida Fort Walton Beach FL 16748 United Fidelity Bank, fsb October 16, 2020 10537
The First State Bank Barboursville WV 14361 MVB Bank, Inc. April 3, 2020 10536
Ericson State Bank Ericson NE 18265 Farmers and Merchants Bank February 14, 2020 10535
City National Bank of New Jersey Newark NJ 21111 Industrial Bank November 1, 2019 10534
Resolute Bank Maumee OH 58317 Buckeye State Bank October 25, 2019 10533
Louisa Community Bank Louisa KY 58112 Kentucky Farmers Bank Corporation October 25, 2019 10532
The Enloe State Bank Cooper TX 10716 Legend Bank, N. A. May 31, 2019 10531

note: what is aimed to gather the data out of the sub-pages:
therefore i need a parser that loops through the subpages - eg like the following:

https://www.fdic.gov/resources/resolutions/bank-failures/failed-bank-list/almenastate.html
https://www.fdic.gov/resources/resolutions/bank-failures/failed-bank-list/firstcitybank.html
https://www.fdic.gov/resources/resolutions/bank-failures/failed-bank-list/fsb-wv.html

and so forth.

btw. currently i am getting back the error:

 ModuleNotFoundError: No module named 'BeautifulSoup'

although i run BeautifulSoup4 Version 4.8.2

after fixing this issue i want to get all the infos out of the Failed-Bank-List:
cf: https://www.fdic.gov/resources/resolutions/bank-failures/failed-bank-list/firstcitybank.html

Quote:Failed Bank Information for First City Bank of Florida, Fort Walton Beach, FL
On Friday, October 16, 2020, First City Bank of Florida was closed by the Florida Office of Financial Regulation. The FDIC was named Receiver. No advance notice is given to the public when a financial institution is closed. United Fidelity Bank, fsb, Evansville, IN acquired all deposit accounts and substantially all the assets. All shares of stock were owned by the holding company, which was not involved in this transaction.

this is the tag:

<div class="usa-layout desktop:grid-col-12">
        <p class="fbankcategory">Failed Bank List</p>
        <!-- don't touch --> 
        <!--Failed Bank Title-->
        <h1 class="fbanktitle">Failed Bank Information for First City Bank of Florida, Fort Walton Beach, FL</h1>
        <!-- update -->
        
        <div class="fbankgrayborder"></div>
        <!-- don't touch -->
        
        <p class="fbankdescription"><!-- update --> 
          On Friday, October 16, 2020, First City Bank of Florida  was closed by the  Florida Office of Financial Regulation. The FDIC was named Receiver. No advance notice is given to the public when a financial institution is closed. United Fidelity Bank, fsb, Evansville, IN acquired all deposit accounts and substantially all the assets.  All shares of stock were owned by the holding company, which was not involved in this transaction.</p>
      </div>

at the moment i need to fix the first issue that i have with the getting back the error:

 ModuleNotFoundError: No module named 'BeautifulSoup'

although i run BeautifulSoup4 Version 4.8.2

After having fixed this i will have a closer look how to get the combination of

a. gathering the links on the first page and
b. collecting the piece of data that is on the second page ...

RE: getting links from webpage and store it into an array - perfringo - May-22-2021

Is there a question? It appears that problem is related to missing module and not actual code?

Some observations nevertheless.

Is there particular need to ignore function naming convention set in PEP8 ("Function names should be lowercase, with words separated by underscores as necessary to improve readability.")?

Is there need to use re? Every time I see re in context of web parsing it reminds me this legendary StackOverflow answer.

My take (without re) would be something along those lines:

import requests
import bs4 as bs

url = 'https://www.fdic.gov/resources/resolutions/bank-failures/failed-bank-list/'
response = requests.get(url)
soup = bs.BeautifulSoup(response.text, 'lxml')
table = soup.find('table')

links = []
for row in table.find_all('tr'):
    for link in row.find_all('a'):
        links.append(f'{url}{link.get("href")}')

Just for fun of irritating others (and future yourself if you return to this code say, in couple of weeks) much of it can be condensed into two rows:

table = bs.BeautifulSoup(requests.get(url).text, 'lxml').find('table')
links = [f'{url}{link.get("href")}' for row in table.find_all('tr') for link in row.find_all('a')]

In real life scenario I would probably write generator function instead of constructing list just for single iteration.