Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Web Scraping on href text
#1
Hello to all.
I'm new to python but I would like to use it to do scraping on a website.
I need to search for a name within the text part of an anchor and if found the string, take the href link to open the page.
this is the part to look for:

<P> <LI> <A href="earinaaestivalis.htm"> Cheeeman Earina aestivalis 1919 <A/> <P> <img src = "orphotdir / scent.jpg"> <img src = "orphotdir / partialshade.jpg" > <img src = "orphotdir / tempint.jpg"> MID <img src = "orphotdir / spring.jpg"> THROUGH <img src = "orphotdir / summer.jpg">

I have to look for the name Earina in the text Earina aestivalis Cheeseman 1919 and if I find it, I have to open the page in href.
I did some scripts with Beautifulsoap and I did all the best if I look for the inside text .. but I can't find a text in the complete text.
Who can help me ?
Thanks
Reply
#2
Start with
Web scraping part 1
Web scraping Part 2
Reply
#3
Thank you
I wrote this code:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

html = urlopen('http://www.orchidspecies.com/indexe-ep.htm')
bs = BeautifulSoup(html, 'html.parser')
images = bs.find_all('a')
for image in images: 
  prova=image.text
  if re.search('Earina sigmoidea', prova):
    #print(image.text)
    print(image.get('href')) 
But the result is not entirely correct, becouse the name 'Earina sigmoidea' in the page there is only one.
Why this ?

this is the result.
Output:
earinaaestivalis.htm earsigmoidea.htm
Reply
#4
Interesting. All the way it is returning both the urls Huh
Reply
#5
if you examine the site html, you will see that most of the links are indexed with:
<table border="5" cellpadding="3">
  <tbody>
    <tr>

      <td colspan="3" align="left">
      <table border="1" cellpadding="4">
        <tbody>
          <tr>

            <th><a href="indexa-anat.htm">A - Anat</a></th>
            <th><a href="indexanc.htm">Anc - Az</a></th>
            <th><a href="indexb.htm">B - Br</a></th>
            <th><a href="indexbulb.htm">Bulb - By</a></th>

            <th><a href="indexc.htm">C - Cattleya</a></th>
            <th><a href="indexcattleyo.htm">Cattleyo - Cn</a></th>
            <th><a href="indexco.htm">Co - Cz</a></th>
            <th><a href="indexde.htm">D - Dendrob</a></th>
            <th><a href="indexdendroc.htm">Dendroc - Dy</a></th>
            <th><a href="indexe-ep.htm">E - Epic</a></th>

            <th><a href="indexepid-ex.htm">Epid - Ez</a></th>
            <th><a href="indexfghijkl.htm">FG</a></th>
            <th><a href="indexhi.htm">HI</a></th>
            <th><a href="indexjkl.htm">JK</a></th>
            <th><a href="indexjkl.htm#sec9">L</a></th>
            <th><a href="indexm-masd.htm">M-Masd</a></th>

            <th><a href="indexmast-max.htm">Mast-Max</a></th>
            <th><a href="indexme.htm">Me - Ny</a></th>
            <th><a href="indexo.htm">O</a></th>
            <th><a href="indexor.htm">Or - Oz</a></th>
            <th><a href="indexp-pf.htm">P - Pe</a></th>
            <th><a href="indexph-pk.htm">Ph - Pi</a></th>

            <th><a href="indexpl-pz.htm">Pl - Pz</a></th>
            <th><a href="indexqrsel.htm">QRS - Sel</a></th>
            <th><a href="indexser.htm">Ser - Sz</a></th>
            <th><a href="indextuvwxyz.htm">T-Z</a></th>
          </tr>
        </tbody>

      </table>

      </td>
    </tr>
  </tbody>
</table>
So you need to capture each of these links and navigate to the pages shown.
each of these pages is for a particular species.
Then each of these pages will have to be scraped as well.
To get the list of (initial) links, you can use the following code:
from bs4 import BeautifulSoup
import requests


class ScrapeOrchids:
    def __init__(self):
        self.main_url = 'http://www.orchidspecies.com/indexe-ep.htm'
        self.links = {}
        self.get_initial_list()
        self.show_links()
    
    def get_initial_list(self):
        baseurl = 'http://www.orchidspecies.com/'
        response = requests.get(self.main_url)
        if response.status_code == 200:
            page = response.content
            soup = BeautifulSoup(page, 'lxml')
            # css_select link can be found using browser inspect element, then right click-->Copy-->CSS_Selector
            tr = soup.select('body > table:nth-child(3) > tbody:nth-child(1) > tr:nth-child(1) > td:nth-child(1) > table:nth-child(1) > tbody:nth-child(1) > tr:nth-child(1)')[0]
            ths = tr.find_all('th')
            for th in ths:
                self.links[th.a.text.strip()] = f"{baseurl}{th.a.get('href')}"
        else:
            print(f"Problem fetching {self.main_url}")

    def show_links(self):
        for key, value in self.links.items():
            print(f"{key}: {value}")


if __name__ == '__main__':
    ScrapeOrchids()
results:
Output:
A - Anat: http://www.orchidspecies.com/indexa-anat.htm Anc - Az: http://www.orchidspecies.com/indexanc.htm B - Br: http://www.orchidspecies.com/indexb.htm Bulb - By: http://www.orchidspecies.com/indexbulb.htm C - Cattleya: http://www.orchidspecies.com/indexc.htm Cattleyo - Cn: http://www.orchidspecies.com/indexcattleyo.htm Co - Cz: http://www.orchidspecies.com/indexco.htm D - Dendrob: http://www.orchidspecies.com/indexde.htm Dendroc - Dy: http://www.orchidspecies.com/indexdendroc.htm E - Epic: http://www.orchidspecies.com/indexe-ep.htm Epid - Ez: http://www.orchidspecies.com/indexepid-ex.htm FG: http://www.orchidspecies.com/indexfghijkl.htm HI: http://www.orchidspecies.com/indexhi.htm JK: http://www.orchidspecies.com/indexjkl.htm L: http://www.orchidspecies.com/indexjkl.htm#sec9 M-Masd: http://www.orchidspecies.com/indexm-masd.htm Mast-Max: http://www.orchidspecies.com/indexmast-max.htm Me - Ny: http://www.orchidspecies.com/indexme.htm O: http://www.orchidspecies.com/indexo.htm Or - Oz: http://www.orchidspecies.com/indexor.htm P - Pe: http://www.orchidspecies.com/indexp-pf.htm Ph - Pi: http://www.orchidspecies.com/indexph-pk.htm Pl - Pz: http://www.orchidspecies.com/indexpl-pz.htm QRS - Sel: http://www.orchidspecies.com/indexqrsel.htm Ser - Sz: http://www.orchidspecies.com/indexser.htm T-Z: http://www.orchidspecies.com/indextuvwxyz.htm
you may need to install lxml pip install lxml
Reply
#6
Thank you to all.
My project is get the image of the orchid species start from plants list.
This list is in excel format (.xls)
If found the image, save it in the excel file near the name of species.
Please, help me because don't know python, but I see the the best for web scraping.
Reply
#7
Sorry Larz60+
I have analized the code woth chrome but don't fond the TH tag... Why ?
What you use for analyzing the web page ?
Reply
#8
I used firefox Inspect Element:
  • right click on page to highlight element choose inspect element
  • right click again on html line
  • click copy
  • click CSS Selector
  • paste tr = soup.select('body > table:nth-child(3) > tbody:nth-child(1) > tr:nth-child(1) > td:nth-child(1) > table:nth-child(1) > tbody:nth-child(1) > tr:nth-child(1)') tgo get tr
  • use ths = tr.find_all('th') to get thlist (Note this is all case sensitive)
Reply
#9
Thank you Larz60+.
Please, could you help me with the project?
As written above, I need to search the pages for a species name, enter the link and download the photo.
I'm studying python but I'm still a mess on scraping.
Reply
#10
Perhaps tonight or tomorrow. I have a regular occupation that requires my attention during a good part of each day. The forum has 29,000+ users per day and we are all volunteers here, so just answering questions requires a lot of time.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Extract Href URL and Text From List knight2000 2 8,623 Jul-08-2021, 12:53 PM
Last Post: knight2000
  BeautifulSoup pagination using href rhat398 1 2,351 Jun-30-2021, 10:55 AM
Last Post: snippsat
  Accessing a data-phone tag from an href KatMac 1 2,863 Apr-27-2021, 06:18 PM
Last Post: buran
  Scraping all website text using Python MKMKMKMK 1 2,051 Nov-26-2020, 10:35 PM
Last Post: Larz60+
  Scraping text from application? kamix 1 1,548 Sep-25-2020, 10:53 PM
Last Post: Larz60+
  How to get the href value of a specific word in the html code julio2000 2 3,143 Mar-05-2020, 07:50 PM
Last Post: julio2000
  scraping in a text/javascript saasyp 1 2,181 Aug-31-2019, 11:39 AM
Last Post: metulburr
  Scrapy Picking What to Output Href or Img soothsayerpg 1 2,677 Aug-02-2018, 10:59 AM
Last Post: soothsayerpg
  Flask - Opening second page via href is failing - This site can’t be reached rafiPython1 2 5,433 Apr-11-2018, 08:41 AM
Last Post: rafiPython1

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020