Hello to all.
I'm new to python but I would like to use it to do scraping on a website.
I need to search for a name within the text part of an anchor and if found the string, take the href link to open the page.
this is the part to look for:
<P> <LI> <A href="earinaaestivalis.htm"> Cheeeman Earina aestivalis 1919 <A/> <P> <img src = "orphotdir / scent.jpg"> <img src = "orphotdir / partialshade.jpg" > <img src = "orphotdir / tempint.jpg"> MID <img src = "orphotdir / spring.jpg"> THROUGH <img src = "orphotdir / summer.jpg">
I have to look for the name Earina in the text Earina aestivalis Cheeseman 1919 and if I find it, I have to open the page in href.
I did some scripts with Beautifulsoap and I did all the best if I look for the inside text .. but I can't find a text in the complete text.
Who can help me ?
Thanks
Thank you
I wrote this code:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
html = urlopen('http://www.orchidspecies.com/indexe-ep.htm')
bs = BeautifulSoup(html, 'html.parser')
images = bs.find_all('a')
for image in images:
prova=image.text
if re.search('Earina sigmoidea', prova):
#print(image.text)
print(image.get('href'))
But the result is not entirely correct, becouse the name 'Earina sigmoidea' in the page there is only one.
Why this ?
this is the result.
Output:
earinaaestivalis.htm
earsigmoidea.htm
Interesting. All the way it is returning both the urls

if you examine the site html, you will see that most of the links are indexed with:
<table border="5" cellpadding="3">
<tbody>
<tr>
<td colspan="3" align="left">
<table border="1" cellpadding="4">
<tbody>
<tr>
<th><a href="indexa-anat.htm">A - Anat</a></th>
<th><a href="indexanc.htm">Anc - Az</a></th>
<th><a href="indexb.htm">B - Br</a></th>
<th><a href="indexbulb.htm">Bulb - By</a></th>
<th><a href="indexc.htm">C - Cattleya</a></th>
<th><a href="indexcattleyo.htm">Cattleyo - Cn</a></th>
<th><a href="indexco.htm">Co - Cz</a></th>
<th><a href="indexde.htm">D - Dendrob</a></th>
<th><a href="indexdendroc.htm">Dendroc - Dy</a></th>
<th><a href="indexe-ep.htm">E - Epic</a></th>
<th><a href="indexepid-ex.htm">Epid - Ez</a></th>
<th><a href="indexfghijkl.htm">FG</a></th>
<th><a href="indexhi.htm">HI</a></th>
<th><a href="indexjkl.htm">JK</a></th>
<th><a href="indexjkl.htm#sec9">L</a></th>
<th><a href="indexm-masd.htm">M-Masd</a></th>
<th><a href="indexmast-max.htm">Mast-Max</a></th>
<th><a href="indexme.htm">Me - Ny</a></th>
<th><a href="indexo.htm">O</a></th>
<th><a href="indexor.htm">Or - Oz</a></th>
<th><a href="indexp-pf.htm">P - Pe</a></th>
<th><a href="indexph-pk.htm">Ph - Pi</a></th>
<th><a href="indexpl-pz.htm">Pl - Pz</a></th>
<th><a href="indexqrsel.htm">QRS - Sel</a></th>
<th><a href="indexser.htm">Ser - Sz</a></th>
<th><a href="indextuvwxyz.htm">T-Z</a></th>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
So you need to capture each of these links and navigate to the pages shown.
each of these pages is for a particular species.
Then each of these pages will have to be scraped as well.
To get the list of (initial) links, you can use the following code:
from bs4 import BeautifulSoup
import requests
class ScrapeOrchids:
def __init__(self):
self.main_url = 'http://www.orchidspecies.com/indexe-ep.htm'
self.links = {}
self.get_initial_list()
self.show_links()
def get_initial_list(self):
baseurl = 'http://www.orchidspecies.com/'
response = requests.get(self.main_url)
if response.status_code == 200:
page = response.content
soup = BeautifulSoup(page, 'lxml')
# css_select link can be found using browser inspect element, then right click-->Copy-->CSS_Selector
tr = soup.select('body > table:nth-child(3) > tbody:nth-child(1) > tr:nth-child(1) > td:nth-child(1) > table:nth-child(1) > tbody:nth-child(1) > tr:nth-child(1)')[0]
ths = tr.find_all('th')
for th in ths:
self.links[th.a.text.strip()] = f"{baseurl}{th.a.get('href')}"
else:
print(f"Problem fetching {self.main_url}")
def show_links(self):
for key, value in self.links.items():
print(f"{key}: {value}")
if __name__ == '__main__':
ScrapeOrchids()
results:
Output:
A - Anat: http://www.orchidspecies.com/indexa-anat.htm
Anc - Az: http://www.orchidspecies.com/indexanc.htm
B - Br: http://www.orchidspecies.com/indexb.htm
Bulb - By: http://www.orchidspecies.com/indexbulb.htm
C - Cattleya: http://www.orchidspecies.com/indexc.htm
Cattleyo - Cn: http://www.orchidspecies.com/indexcattleyo.htm
Co - Cz: http://www.orchidspecies.com/indexco.htm
D - Dendrob: http://www.orchidspecies.com/indexde.htm
Dendroc - Dy: http://www.orchidspecies.com/indexdendroc.htm
E - Epic: http://www.orchidspecies.com/indexe-ep.htm
Epid - Ez: http://www.orchidspecies.com/indexepid-ex.htm
FG: http://www.orchidspecies.com/indexfghijkl.htm
HI: http://www.orchidspecies.com/indexhi.htm
JK: http://www.orchidspecies.com/indexjkl.htm
L: http://www.orchidspecies.com/indexjkl.htm#sec9
M-Masd: http://www.orchidspecies.com/indexm-masd.htm
Mast-Max: http://www.orchidspecies.com/indexmast-max.htm
Me - Ny: http://www.orchidspecies.com/indexme.htm
O: http://www.orchidspecies.com/indexo.htm
Or - Oz: http://www.orchidspecies.com/indexor.htm
P - Pe: http://www.orchidspecies.com/indexp-pf.htm
Ph - Pi: http://www.orchidspecies.com/indexph-pk.htm
Pl - Pz: http://www.orchidspecies.com/indexpl-pz.htm
QRS - Sel: http://www.orchidspecies.com/indexqrsel.htm
Ser - Sz: http://www.orchidspecies.com/indexser.htm
T-Z: http://www.orchidspecies.com/indextuvwxyz.htm
you may need to install lxml
pip install lxml
Thank you to all.
My project is get the image of the orchid species start from plants list.
This list is in excel format (.xls)
If found the image, save it in the excel file near the name of species.
Please, help me because don't know python, but I see the the best for web scraping.
Sorry Larz60+
I have analized the code woth chrome but don't fond the TH tag... Why ?
What you use for analyzing the web page ?
I used firefox Inspect Element:
- right click on page to highlight element choose inspect element
- right click again on html line
- click copy
- click CSS Selector
- paste
tr = soup.select('body > table:nth-child(3) > tbody:nth-child(1) > tr:nth-child(1) > td:nth-child(1) > table:nth-child(1) > tbody:nth-child(1) > tr:nth-child(1)')
tgo get tr
- use
ths = tr.find_all('th')
to get thlist (Note this is all case sensitive)
Thank you Larz60+.
Please, could you help me with the project?
As written above, I need to search the pages for a species name, enter the link and download the photo.
I'm studying python but I'm still a mess on scraping.
Perhaps tonight or tomorrow. I have a regular occupation that requires my attention during a good part of each day. The forum has 29,000+ users per day and we are all volunteers here, so just answering questions requires a lot of time.