getting links from webpage and store it into an array - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html) +--- Thread: getting links from webpage and store it into an array (/thread-33706.html) |
getting links from webpage and store it into an array - apollo - May-22-2021 dear Python-experts first of all: i hope that you are all right and everything goes well at your site. I am currently attempting to gather some data on the cost of Failed Bank List: This list includes banks which have failed since October 1, 2000. getting it from the website below: https://www.fdic.gov/resources/resolutions/bank-failures/failed-bank-list/ With this approach i extract links from the basic-website into an nice little array. Besides that - i want to open all the links and gather a little piece of information form the (subsequent linked ) sub-page. my Approach: for the sake of repeatingly extract links out of the targetpage i use the function below: Setup: i run Anaconda on Win 10 with Python 3.8.5 and BS4 (version 4.8.2) from bs4 import BeautifulSoup import requests import re def getLinks(url): r = requests.get("https://www.fdic.gov/resources/resolutions/bank-failures/failed-bank-list/") soup = BeautifulSoup(r.content) links = [] for link in soup.findAll('a', attrs={'href': re.compile("^http://")}): links.append(link.get('href')) ##It will scrape all the a tags, and for each a tags, it will append the href attribute to the links list. return links print( getLinks("https://www.fdic.gov/resources/resolutions/bank-failures/failed-bank-list/") )dataset: html_page = ("https://www.fdic.gov/resources/resolutions/bank-failures/failed-bank-list/") that contains the following pages that hold information about the towns with inhabitants: see the dataset: Quote:Almena State Bank Almena KS 15426 Equity Bank October 23, 2020 10538 note: what is aimed to gather the data out of the sub-pages: therefore i need a parser that loops through the subpages - eg like the following: https://www.fdic.gov/resources/resolutions/bank-failures/failed-bank-list/almenastate.html https://www.fdic.gov/resources/resolutions/bank-failures/failed-bank-list/firstcitybank.html https://www.fdic.gov/resources/resolutions/bank-failures/failed-bank-list/fsb-wv.html and so forth. btw. currently i am getting back the error: ModuleNotFoundError: No module named 'BeautifulSoup' although i run BeautifulSoup4 Version 4.8.2 after fixing this issue i want to get all the infos out of the Failed-Bank-List: cf: https://www.fdic.gov/resources/resolutions/bank-failures/failed-bank-list/firstcitybank.html Quote:Failed Bank Information for First City Bank of Florida, Fort Walton Beach, FL this is the tag: <div class="usa-layout desktop:grid-col-12"> <p class="fbankcategory">Failed Bank List</p> <!-- don't touch --> <!--Failed Bank Title--> <h1 class="fbanktitle">Failed Bank Information for First City Bank of Florida, Fort Walton Beach, FL</h1> <!-- update --> <div class="fbankgrayborder"></div> <!-- don't touch --> <p class="fbankdescription"><!-- update --> On Friday, October 16, 2020, First City Bank of Florida was closed by the Florida Office of Financial Regulation. The FDIC was named Receiver. No advance notice is given to the public when a financial institution is closed. United Fidelity Bank, fsb, Evansville, IN acquired all deposit accounts and substantially all the assets. All shares of stock were owned by the holding company, which was not involved in this transaction.</p> </div>at the moment i need to fix the first issue that i have with the getting back the error: ModuleNotFoundError: No module named 'BeautifulSoup'although i run BeautifulSoup4 Version 4.8.2 After having fixed this i will have a closer look how to get the combination of a. gathering the links on the first page and b. collecting the piece of data that is on the second page ... RE: getting links from webpage and store it into an array - perfringo - May-22-2021 Is there a question? It appears that problem is related to missing module and not actual code? Some observations nevertheless. Is there particular need to ignore function naming convention set in PEP8 ("Function names should be lowercase, with words separated by underscores as necessary to improve readability.")? Is there need to use re ? Every time I see re in context of web parsing it reminds me this legendary StackOverflow answer.My take (without re) would be something along those lines: import requests import bs4 as bs url = 'https://www.fdic.gov/resources/resolutions/bank-failures/failed-bank-list/' response = requests.get(url) soup = bs.BeautifulSoup(response.text, 'lxml') table = soup.find('table') links = [] for row in table.find_all('tr'): for link in row.find_all('a'): links.append(f'{url}{link.get("href")}')Just for fun of irritating others (and future yourself if you return to this code say, in couple of weeks) much of it can be condensed into two rows: table = bs.BeautifulSoup(requests.get(url).text, 'lxml').find('table') links = [f'{url}{link.get("href")}' for row in table.find_all('tr') for link in row.find_all('a')] In real life scenario I would probably write generator function instead of constructing list just for single iteration. |