Web Scraping on href text

Superzaffo · Nov-13-2019, 10:32 PM

Hello to all.
I'm new to python but I would like to use it to do scraping on a website.
I need to search for a name within the text part of an anchor and if found the string, take the href link to open the page.
this is the part to look for:

<P> <LI> <A href="earinaaestivalis.htm"> Cheeeman Earina aestivalis 1919 <A/> <P> <img src = "orphotdir / scent.jpg"> <img src = "orphotdir / partialshade.jpg" > <img src = "orphotdir / tempint.jpg"> MID <img src = "orphotdir / spring.jpg"> THROUGH <img src = "orphotdir / summer.jpg">

I have to look for the name Earina in the text Earina aestivalis Cheeseman 1919 and if I find it, I have to open the page in href.
I did some scripts with Beautifulsoap and I did all the best if I look for the inside text .. but I can't find a text in the complete text.
Who can help me ?
Thanks

**Larz60+** · Nov-14-2019, 12:30 AM

Start with
Web scraping part 1
Web scraping Part 2

Superzaffo · (This post was last modified: Nov-14-2019, 10:15 AM by Larz60+.)

Thank you
I wrote this code:

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

html = urlopen('http://www.orchidspecies.com/indexe-ep.htm')
bs = BeautifulSoup(html, 'html.parser')
images = bs.find_all('a')
for image in images: 
  prova=image.text
  if re.search('Earina sigmoidea', prova):
    #print(image.text)
    print(image.get('href'))

But the result is not entirely correct, becouse the name 'Earina sigmoidea' in the page there is only one.
Why this ?

this is the result.

Output:earinaaestivalis.htm
earsigmoidea.htm

Malt · Nov-14-2019, 10:27 AM

Interesting. All the way it is returning both the urls Huh

**Larz60+** · Nov-14-2019, 10:54 AM

if you examine the site html, you will see that most of the links are indexed with:

<table border="5" cellpadding="3">
  <tbody>
    <tr>

      <td colspan="3" align="left">
      <table border="1" cellpadding="4">
        <tbody>
          <tr>

            <th><a href="indexa-anat.htm">A - Anat</a></th>
            <th><a href="indexanc.htm">Anc - Az</a></th>
            <th><a href="indexb.htm">B - Br</a></th>
            <th><a href="indexbulb.htm">Bulb - By</a></th>

            <th><a href="indexc.htm">C - Cattleya</a></th>
            <th><a href="indexcattleyo.htm">Cattleyo - Cn</a></th>
            <th><a href="indexco.htm">Co - Cz</a></th>
            <th><a href="indexde.htm">D - Dendrob</a></th>
            <th><a href="indexdendroc.htm">Dendroc - Dy</a></th>
            <th><a href="indexe-ep.htm">E - Epic</a></th>

            <th><a href="indexepid-ex.htm">Epid - Ez</a></th>
            <th><a href="indexfghijkl.htm">FG</a></th>
            <th><a href="indexhi.htm">HI</a></th>
            <th><a href="indexjkl.htm">JK</a></th>
            <th><a href="indexjkl.htm#sec9">L</a></th>
            <th><a href="indexm-masd.htm">M-Masd</a></th>

            <th><a href="indexmast-max.htm">Mast-Max</a></th>
            <th><a href="indexme.htm">Me - Ny</a></th>
            <th><a href="indexo.htm">O</a></th>
            <th><a href="indexor.htm">Or - Oz</a></th>
            <th><a href="indexp-pf.htm">P - Pe</a></th>
            <th><a href="indexph-pk.htm">Ph - Pi</a></th>

            <th><a href="indexpl-pz.htm">Pl - Pz</a></th>
            <th><a href="indexqrsel.htm">QRS - Sel</a></th>
            <th><a href="indexser.htm">Ser - Sz</a></th>
            <th><a href="indextuvwxyz.htm">T-Z</a></th>
          </tr>
        </tbody>

      </table>

      </td>
    </tr>
  </tbody>
</table>

So you need to capture each of these links and navigate to the pages shown.
each of these pages is for a particular species.
Then each of these pages will have to be scraped as well.
To get the list of (initial) links, you can use the following code:

from bs4 import BeautifulSoup
import requests


class ScrapeOrchids:
    def __init__(self):
        self.main_url = 'http://www.orchidspecies.com/indexe-ep.htm'
        self.links = {}
        self.get_initial_list()
        self.show_links()
    
    def get_initial_list(self):
        baseurl = 'http://www.orchidspecies.com/'
        response = requests.get(self.main_url)
        if response.status_code == 200:
            page = response.content
            soup = BeautifulSoup(page, 'lxml')
            # css_select link can be found using browser inspect element, then right click-->Copy-->CSS_Selector
            tr = soup.select('body > table:nth-child(3) > tbody:nth-child(1) > tr:nth-child(1) > td:nth-child(1) > table:nth-child(1) > tbody:nth-child(1) > tr:nth-child(1)')[0]
            ths = tr.find_all('th')
            for th in ths:
                self.links[th.a.text.strip()] = f"{baseurl}{th.a.get('href')}"
        else:
            print(f"Problem fetching {self.main_url}")

    def show_links(self):
        for key, value in self.links.items():
            print(f"{key}: {value}")


if __name__ == '__main__':
    ScrapeOrchids()

results:

Output:A - Anat: http://www.orchidspecies.com/indexa-anat.htm
Anc - Az: http://www.orchidspecies.com/indexanc.htm
B - Br: http://www.orchidspecies.com/indexb.htm
Bulb - By: http://www.orchidspecies.com/indexbulb.htm
C - Cattleya: http://www.orchidspecies.com/indexc.htm
Cattleyo - Cn: http://www.orchidspecies.com/indexcattleyo.htm
Co - Cz: http://www.orchidspecies.com/indexco.htm
D - Dendrob: http://www.orchidspecies.com/indexde.htm
Dendroc - Dy: http://www.orchidspecies.com/indexdendroc.htm
E - Epic: http://www.orchidspecies.com/indexe-ep.htm
Epid - Ez: http://www.orchidspecies.com/indexepid-ex.htm
FG: http://www.orchidspecies.com/indexfghijkl.htm
HI: http://www.orchidspecies.com/indexhi.htm
JK: http://www.orchidspecies.com/indexjkl.htm
L: http://www.orchidspecies.com/indexjkl.htm#sec9
M-Masd: http://www.orchidspecies.com/indexm-masd.htm
Mast-Max: http://www.orchidspecies.com/indexmast-max.htm
Me - Ny: http://www.orchidspecies.com/indexme.htm
O: http://www.orchidspecies.com/indexo.htm
Or - Oz: http://www.orchidspecies.com/indexor.htm
P - Pe: http://www.orchidspecies.com/indexp-pf.htm
Ph - Pi: http://www.orchidspecies.com/indexph-pk.htm
Pl - Pz: http://www.orchidspecies.com/indexpl-pz.htm
QRS - Sel: http://www.orchidspecies.com/indexqrsel.htm
Ser - Sz: http://www.orchidspecies.com/indexser.htm
T-Z: http://www.orchidspecies.com/indextuvwxyz.htm

you may need to install lxml pip install lxml

Superzaffo · Nov-14-2019, 08:20 PM

Thank you to all.
My project is get the image of the orchid species start from plants list.
This list is in excel format (.xls)
If found the image, save it in the excel file near the name of species.
Please, help me because don't know python, but I see the the best for web scraping.

Superzaffo · Nov-14-2019, 10:05 PM

Sorry Larz60+
I have analized the code woth chrome but don't fond the TH tag... Why ?
What you use for analyzing the web page ?

**Larz60+** · (This post was last modified: Nov-15-2019, 02:34 AM by Larz60+.)

I used firefox Inspect Element:

right click on page to highlight element choose inspect element
right click again on html line
click copy
click CSS Selector
paste tr = soup.select('body > table:nth-child(3) > tbody:nth-child(1) > tr:nth-child(1) > td:nth-child(1) > table:nth-child(1) > tbody:nth-child(1) > tr:nth-child(1)') tgo get tr
use ths = tr.find_all('th') to get thlist (Note this is all case sensitive)

Superzaffo · Nov-15-2019, 08:10 AM

Thank you Larz60+.
Please, could you help me with the project?
As written above, I need to search the pages for a species name, enter the link and download the photo.
I'm studying python but I'm still a mess on scraping.

**Larz60+** · Nov-15-2019, 12:43 PM

Perhaps tonight or tomorrow. I have a regular occupation that requires my attention during a good part of each day. The forum has 29,000+ users per day and we are all volunteers here, so just answering questions requires a lot of time.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Extract Href URL and Text From List	knight2000	2	9,280	Jul-08-2021, 12:53 PM Last Post: knight2000
	BeautifulSoup pagination using href	rhat398	1	2,443	Jun-30-2021, 10:55 AM Last Post: snippsat
	Accessing a data-phone tag from an href	KatMac	1	2,926	Apr-27-2021, 06:18 PM Last Post: buran
	Scraping all website text using Python	MKMKMKMK	1	2,108	Nov-26-2020, 10:35 PM Last Post: Larz60+
	Scraping text from application?	kamix	1	1,638	Sep-25-2020, 10:53 PM Last Post: Larz60+
	How to get the href value of a specific word in the html code	julio2000	2	3,245	Mar-05-2020, 07:50 PM Last Post: julio2000
	scraping in a text/javascript	saasyp	1	2,254	Aug-31-2019, 11:39 AM Last Post: metulburr
	Scrapy Picking What to Output Href or Img	soothsayerpg	1	2,728	Aug-02-2018, 10:59 AM Last Post: soothsayerpg
	Flask - Opening second page via href is failing - This site can’t be reached	rafiPython1	2	5,524	Apr-11-2018, 08:41 AM Last Post: rafiPython1

Web Scraping on href text

User Panel Messages

Announcements