Python Forum

Full Version: Web Scraping on href text
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2
Hello to all.
I'm new to python but I would like to use it to do scraping on a website.
I need to search for a name within the text part of an anchor and if found the string, take the href link to open the page.
this is the part to look for:

<P> <LI> <A href="earinaaestivalis.htm"> Cheeeman Earina aestivalis 1919 <A/> <P> <img src = "orphotdir / scent.jpg"> <img src = "orphotdir / partialshade.jpg" > <img src = "orphotdir / tempint.jpg"> MID <img src = "orphotdir / spring.jpg"> THROUGH <img src = "orphotdir / summer.jpg">

I have to look for the name Earina in the text Earina aestivalis Cheeseman 1919 and if I find it, I have to open the page in href.
I did some scripts with Beautifulsoap and I did all the best if I look for the inside text .. but I can't find a text in the complete text.
Who can help me ?
Thanks
Thank you
I wrote this code:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

html = urlopen('http://www.orchidspecies.com/indexe-ep.htm')
bs = BeautifulSoup(html, 'html.parser')
images = bs.find_all('a')
for image in images: 
  prova=image.text
  if re.search('Earina sigmoidea', prova):
    #print(image.text)
    print(image.get('href')) 
But the result is not entirely correct, becouse the name 'Earina sigmoidea' in the page there is only one.
Why this ?

this is the result.
Output:
earinaaestivalis.htm earsigmoidea.htm
Interesting. All the way it is returning both the urls Huh
if you examine the site html, you will see that most of the links are indexed with:
<table border="5" cellpadding="3">
  <tbody>
    <tr>

      <td colspan="3" align="left">
      <table border="1" cellpadding="4">
        <tbody>
          <tr>

            <th><a href="indexa-anat.htm">A - Anat</a></th>
            <th><a href="indexanc.htm">Anc - Az</a></th>
            <th><a href="indexb.htm">B - Br</a></th>
            <th><a href="indexbulb.htm">Bulb - By</a></th>

            <th><a href="indexc.htm">C - Cattleya</a></th>
            <th><a href="indexcattleyo.htm">Cattleyo - Cn</a></th>
            <th><a href="indexco.htm">Co - Cz</a></th>
            <th><a href="indexde.htm">D - Dendrob</a></th>
            <th><a href="indexdendroc.htm">Dendroc - Dy</a></th>
            <th><a href="indexe-ep.htm">E - Epic</a></th>

            <th><a href="indexepid-ex.htm">Epid - Ez</a></th>
            <th><a href="indexfghijkl.htm">FG</a></th>
            <th><a href="indexhi.htm">HI</a></th>
            <th><a href="indexjkl.htm">JK</a></th>
            <th><a href="indexjkl.htm#sec9">L</a></th>
            <th><a href="indexm-masd.htm">M-Masd</a></th>

            <th><a href="indexmast-max.htm">Mast-Max</a></th>
            <th><a href="indexme.htm">Me - Ny</a></th>
            <th><a href="indexo.htm">O</a></th>
            <th><a href="indexor.htm">Or - Oz</a></th>
            <th><a href="indexp-pf.htm">P - Pe</a></th>
            <th><a href="indexph-pk.htm">Ph - Pi</a></th>

            <th><a href="indexpl-pz.htm">Pl - Pz</a></th>
            <th><a href="indexqrsel.htm">QRS - Sel</a></th>
            <th><a href="indexser.htm">Ser - Sz</a></th>
            <th><a href="indextuvwxyz.htm">T-Z</a></th>
          </tr>
        </tbody>

      </table>

      </td>
    </tr>
  </tbody>
</table>
So you need to capture each of these links and navigate to the pages shown.
each of these pages is for a particular species.
Then each of these pages will have to be scraped as well.
To get the list of (initial) links, you can use the following code:
from bs4 import BeautifulSoup
import requests


class ScrapeOrchids:
    def __init__(self):
        self.main_url = 'http://www.orchidspecies.com/indexe-ep.htm'
        self.links = {}
        self.get_initial_list()
        self.show_links()
    
    def get_initial_list(self):
        baseurl = 'http://www.orchidspecies.com/'
        response = requests.get(self.main_url)
        if response.status_code == 200:
            page = response.content
            soup = BeautifulSoup(page, 'lxml')
            # css_select link can be found using browser inspect element, then right click-->Copy-->CSS_Selector
            tr = soup.select('body > table:nth-child(3) > tbody:nth-child(1) > tr:nth-child(1) > td:nth-child(1) > table:nth-child(1) > tbody:nth-child(1) > tr:nth-child(1)')[0]
            ths = tr.find_all('th')
            for th in ths:
                self.links[th.a.text.strip()] = f"{baseurl}{th.a.get('href')}"
        else:
            print(f"Problem fetching {self.main_url}")

    def show_links(self):
        for key, value in self.links.items():
            print(f"{key}: {value}")


if __name__ == '__main__':
    ScrapeOrchids()
results:
Output:
A - Anat: http://www.orchidspecies.com/indexa-anat.htm Anc - Az: http://www.orchidspecies.com/indexanc.htm B - Br: http://www.orchidspecies.com/indexb.htm Bulb - By: http://www.orchidspecies.com/indexbulb.htm C - Cattleya: http://www.orchidspecies.com/indexc.htm Cattleyo - Cn: http://www.orchidspecies.com/indexcattleyo.htm Co - Cz: http://www.orchidspecies.com/indexco.htm D - Dendrob: http://www.orchidspecies.com/indexde.htm Dendroc - Dy: http://www.orchidspecies.com/indexdendroc.htm E - Epic: http://www.orchidspecies.com/indexe-ep.htm Epid - Ez: http://www.orchidspecies.com/indexepid-ex.htm FG: http://www.orchidspecies.com/indexfghijkl.htm HI: http://www.orchidspecies.com/indexhi.htm JK: http://www.orchidspecies.com/indexjkl.htm L: http://www.orchidspecies.com/indexjkl.htm#sec9 M-Masd: http://www.orchidspecies.com/indexm-masd.htm Mast-Max: http://www.orchidspecies.com/indexmast-max.htm Me - Ny: http://www.orchidspecies.com/indexme.htm O: http://www.orchidspecies.com/indexo.htm Or - Oz: http://www.orchidspecies.com/indexor.htm P - Pe: http://www.orchidspecies.com/indexp-pf.htm Ph - Pi: http://www.orchidspecies.com/indexph-pk.htm Pl - Pz: http://www.orchidspecies.com/indexpl-pz.htm QRS - Sel: http://www.orchidspecies.com/indexqrsel.htm Ser - Sz: http://www.orchidspecies.com/indexser.htm T-Z: http://www.orchidspecies.com/indextuvwxyz.htm
you may need to install lxml pip install lxml
Thank you to all.
My project is get the image of the orchid species start from plants list.
This list is in excel format (.xls)
If found the image, save it in the excel file near the name of species.
Please, help me because don't know python, but I see the the best for web scraping.
Sorry Larz60+
I have analized the code woth chrome but don't fond the TH tag... Why ?
What you use for analyzing the web page ?
I used firefox Inspect Element:
  • right click on page to highlight element choose inspect element
  • right click again on html line
  • click copy
  • click CSS Selector
  • paste tr = soup.select('body > table:nth-child(3) > tbody:nth-child(1) > tr:nth-child(1) > td:nth-child(1) > table:nth-child(1) > tbody:nth-child(1) > tr:nth-child(1)') tgo get tr
  • use ths = tr.find_all('th') to get thlist (Note this is all case sensitive)
Thank you Larz60+.
Please, could you help me with the project?
As written above, I need to search the pages for a species name, enter the link and download the photo.
I'm studying python but I'm still a mess on scraping.
Perhaps tonight or tomorrow. I have a regular occupation that requires my attention during a good part of each day. The forum has 29,000+ users per day and we are all volunteers here, so just answering questions requires a lot of time.
Pages: 1 2