Jul-09-2021, 11:04 PM
here's a better way, doesn't require excel or pandas, and can be reused for any sites of the type you mention.
I have included a sample which scrapes the bird pages for species listed (compare to your 'endpart')
you can try with your url's
the pages will be downloaded and placed in directory 'renderhtml', ready to be parsed with Beautifulsoup.
note that this class used pathlib (python built-in) to create posix path.
It uses lxml parser, and beautifulsoup which you may have to install:
(from command line):
RenderUrl.py
TryRenderUrl.py
you will find:
Now try with your data.
I have included a sample which scrapes the bird pages for species listed (compare to your 'endpart')
you can try with your url's
the pages will be downloaded and placed in directory 'renderhtml', ready to be parsed with Beautifulsoup.
note that this class used pathlib (python built-in) to create posix path.
It uses lxml parser, and beautifulsoup which you may have to install:
(from command line):
pip install lxml pip install Beautifulsoup4Name this module:
RenderUrl.py
import os from pathlib import Path import requests class RenderUrl: def __init__(self, baseurl=None): self.base_url = baseurl # Create new savepath if needed os.chdir(os.path.abspath(os.path.dirname(__file__))) self.savepath = Path('.') / 'renderhtml' self.savepath.mkdir(exist_ok=True) # temp storage for url suffixes self.suffixlist = [] def url_emmitter(self): page = None suffix = None n = len(self.suffixlist) i = 0 while i < n: suffix = self.suffixlist[i] url = f"{self.base_url}{suffix}" yield url i += 1 def get_pages(self, suffixlist, cache=False): self.suffixlist = suffixlist for url in self.url_emmitter(): print(f"fetching: {url}") fname = (url.split('/')[-1]).replace('-','_') filename = self.savepath / f"{fname}.html" if cache and filename.exists(): with filename.open('rb') as fp: page = fp.read() else: response = requests.get(url) if response.status_code == 200: page = response.content with filename.open('wb') as fp: fp.write(page)Here's how to use it (put both files in same directory):
TryRenderUrl.py
from RenderUrl import RenderUrl def main(): ''' AudubonBirds: ''' baseurl = "https://www.massaudubon.org/learn/nature-wildlife/birds/" birdlist = ['american-goldfinches','american-kestrels','american-robins','bald-eagles', 'baltimore-orchard-o','baltimore-orchard-orioles', 'birds-of-prey'] rurl = RenderUrl(baseurl) rurl.get_pages(birdlist, cache=True) if __name__ == '__main__': main()then run
python TryRenderUrl.py
Output:fetching: https://www.massaudubon.org/learn/nature-wildlife/birds/american-goldfinches
fetching: https://www.massaudubon.org/learn/nature-wildlife/birds/american-kestrels
fetching: https://www.massaudubon.org/learn/nature-wildlife/birds/american-robins
fetching: https://www.massaudubon.org/learn/nature-wildlife/birds/bald-eagles
fetching: https://www.massaudubon.org/learn/nature-wildlife/birds/baltimore-orchard-o
fetching: https://www.massaudubon.org/learn/nature-wildlife/birds/baltimore-orchard-orioles
fetching: https://www.massaudubon.org/learn/nature-wildlife/birds/birds-of-prey
Now look in the directory renderhtmlyou will find:
Output:american_goldfinches.html
american_kestrels.html
american_robins.html
bald_eagles.html
baltimore_orchard_orioles.html
birds_of_prey.html
ready to be parsed.Now try with your data.