Nov-14-2019, 10:54 AM
if you examine the site html, you will see that most of the links are indexed with:
each of these pages is for a particular species.
Then each of these pages will have to be scraped as well.
To get the list of (initial) links, you can use the following code:
<table border="5" cellpadding="3"> <tbody> <tr> <td colspan="3" align="left"> <table border="1" cellpadding="4"> <tbody> <tr> <th><a href="indexa-anat.htm">A - Anat</a></th> <th><a href="indexanc.htm">Anc - Az</a></th> <th><a href="indexb.htm">B - Br</a></th> <th><a href="indexbulb.htm">Bulb - By</a></th> <th><a href="indexc.htm">C - Cattleya</a></th> <th><a href="indexcattleyo.htm">Cattleyo - Cn</a></th> <th><a href="indexco.htm">Co - Cz</a></th> <th><a href="indexde.htm">D - Dendrob</a></th> <th><a href="indexdendroc.htm">Dendroc - Dy</a></th> <th><a href="indexe-ep.htm">E - Epic</a></th> <th><a href="indexepid-ex.htm">Epid - Ez</a></th> <th><a href="indexfghijkl.htm">FG</a></th> <th><a href="indexhi.htm">HI</a></th> <th><a href="indexjkl.htm">JK</a></th> <th><a href="indexjkl.htm#sec9">L</a></th> <th><a href="indexm-masd.htm">M-Masd</a></th> <th><a href="indexmast-max.htm">Mast-Max</a></th> <th><a href="indexme.htm">Me - Ny</a></th> <th><a href="indexo.htm">O</a></th> <th><a href="indexor.htm">Or - Oz</a></th> <th><a href="indexp-pf.htm">P - Pe</a></th> <th><a href="indexph-pk.htm">Ph - Pi</a></th> <th><a href="indexpl-pz.htm">Pl - Pz</a></th> <th><a href="indexqrsel.htm">QRS - Sel</a></th> <th><a href="indexser.htm">Ser - Sz</a></th> <th><a href="indextuvwxyz.htm">T-Z</a></th> </tr> </tbody> </table> </td> </tr> </tbody> </table>So you need to capture each of these links and navigate to the pages shown.
each of these pages is for a particular species.
Then each of these pages will have to be scraped as well.
To get the list of (initial) links, you can use the following code:
from bs4 import BeautifulSoup import requests class ScrapeOrchids: def __init__(self): self.main_url = 'http://www.orchidspecies.com/indexe-ep.htm' self.links = {} self.get_initial_list() self.show_links() def get_initial_list(self): baseurl = 'http://www.orchidspecies.com/' response = requests.get(self.main_url) if response.status_code == 200: page = response.content soup = BeautifulSoup(page, 'lxml') # css_select link can be found using browser inspect element, then right click-->Copy-->CSS_Selector tr = soup.select('body > table:nth-child(3) > tbody:nth-child(1) > tr:nth-child(1) > td:nth-child(1) > table:nth-child(1) > tbody:nth-child(1) > tr:nth-child(1)')[0] ths = tr.find_all('th') for th in ths: self.links[th.a.text.strip()] = f"{baseurl}{th.a.get('href')}" else: print(f"Problem fetching {self.main_url}") def show_links(self): for key, value in self.links.items(): print(f"{key}: {value}") if __name__ == '__main__': ScrapeOrchids()results:
Output:A - Anat: http://www.orchidspecies.com/indexa-anat.htm
Anc - Az: http://www.orchidspecies.com/indexanc.htm
B - Br: http://www.orchidspecies.com/indexb.htm
Bulb - By: http://www.orchidspecies.com/indexbulb.htm
C - Cattleya: http://www.orchidspecies.com/indexc.htm
Cattleyo - Cn: http://www.orchidspecies.com/indexcattleyo.htm
Co - Cz: http://www.orchidspecies.com/indexco.htm
D - Dendrob: http://www.orchidspecies.com/indexde.htm
Dendroc - Dy: http://www.orchidspecies.com/indexdendroc.htm
E - Epic: http://www.orchidspecies.com/indexe-ep.htm
Epid - Ez: http://www.orchidspecies.com/indexepid-ex.htm
FG: http://www.orchidspecies.com/indexfghijkl.htm
HI: http://www.orchidspecies.com/indexhi.htm
JK: http://www.orchidspecies.com/indexjkl.htm
L: http://www.orchidspecies.com/indexjkl.htm#sec9
M-Masd: http://www.orchidspecies.com/indexm-masd.htm
Mast-Max: http://www.orchidspecies.com/indexmast-max.htm
Me - Ny: http://www.orchidspecies.com/indexme.htm
O: http://www.orchidspecies.com/indexo.htm
Or - Oz: http://www.orchidspecies.com/indexor.htm
P - Pe: http://www.orchidspecies.com/indexp-pf.htm
Ph - Pi: http://www.orchidspecies.com/indexph-pk.htm
Pl - Pz: http://www.orchidspecies.com/indexpl-pz.htm
QRS - Sel: http://www.orchidspecies.com/indexqrsel.htm
Ser - Sz: http://www.orchidspecies.com/indexser.htm
T-Z: http://www.orchidspecies.com/indextuvwxyz.htm
you may need to install lxml pip install lxml