Jul-18-2021, 06:59 AM
Hi Larz60+,
I'm sorry for the very late reply.
Thank you for providing me with this cool sample code to extract the html locally for each page. I've tried to go through the code and I sort of understand it (wouldn't be able to write it myself yet ) and I've applied it to my test project and it works very well indeed.
I've copied the code into my notes to use if this scenario comes up again. It's nice having the local html to then play around to extract other attributes- especially when practicing.
Thanks again mate and have a great week ahead.
I'm sorry for the very late reply.
Thank you for providing me with this cool sample code to extract the html locally for each page. I've tried to go through the code and I sort of understand it (wouldn't be able to write it myself yet ) and I've applied it to my test project and it works very well indeed.
I've copied the code into my notes to use if this scenario comes up again. It's nice having the local html to then play around to extract other attributes- especially when practicing.
Thanks again mate and have a great week ahead.
(Jul-09-2021, 11:04 PM)Larz60+ Wrote: here's a better way, doesn't require excel or pandas, and can be reused for any sites of the type you mention.
I have included a sample which scrapes the bird pages for species listed (compare to your 'endpart')
you can try with your url's
the pages will be downloaded and placed in directory 'renderhtml', ready to be parsed with Beautifulsoup.
note that this class used pathlib (python built-in) to create posix path.
It uses lxml parser, and beautifulsoup which you may have to install:
(from command line):
pip install lxml pip install Beautifulsoup4Name this module:
RenderUrl.py
import os from pathlib import Path import requests class RenderUrl: def __init__(self, baseurl=None): self.base_url = baseurl # Create new savepath if needed os.chdir(os.path.abspath(os.path.dirname(__file__))) self.savepath = Path('.') / 'renderhtml' self.savepath.mkdir(exist_ok=True) # temp storage for url suffixes self.suffixlist = [] def url_emmitter(self): page = None suffix = None n = len(self.suffixlist) i = 0 while i < n: suffix = self.suffixlist[i] url = f"{self.base_url}{suffix}" yield url i += 1 def get_pages(self, suffixlist, cache=False): self.suffixlist = suffixlist for url in self.url_emmitter(): print(f"fetching: {url}") fname = (url.split('/')[-1]).replace('-','_') filename = self.savepath / f"{fname}.html" if cache and filename.exists(): with filename.open('rb') as fp: page = fp.read() else: response = requests.get(url) if response.status_code == 200: page = response.content with filename.open('wb') as fp: fp.write(page)Here's how to use it (put both files in same directory):
TryRenderUrl.py
from RenderUrl import RenderUrl def main(): ''' AudubonBirds: ''' baseurl = "https://www.massaudubon.org/learn/nature-wildlife/birds/" birdlist = ['american-goldfinches','american-kestrels','american-robins','bald-eagles', 'baltimore-orchard-o','baltimore-orchard-orioles', 'birds-of-prey'] rurl = RenderUrl(baseurl) rurl.get_pages(birdlist, cache=True) if __name__ == '__main__': main()then runpython TryRenderUrl.py
Now look in the directory renderhtml
Output:fetching: https://www.massaudubon.org/learn/nature-wildlife/birds/american-goldfinches fetching: https://www.massaudubon.org/learn/nature-wildlife/birds/american-kestrels fetching: https://www.massaudubon.org/learn/nature-wildlife/birds/american-robins fetching: https://www.massaudubon.org/learn/nature-wildlife/birds/bald-eagles fetching: https://www.massaudubon.org/learn/nature-wildlife/birds/baltimore-orchard-o fetching: https://www.massaudubon.org/learn/nature-wildlife/birds/baltimore-orchard-orioles fetching: https://www.massaudubon.org/learn/nature-wildlife/birds/birds-of-prey
you will find:
ready to be parsed.
Output:american_goldfinches.html american_kestrels.html american_robins.html bald_eagles.html baltimore_orchard_orioles.html birds_of_prey.html
Now try with your data.