Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Web Scraping on href text
#5
if you examine the site html, you will see that most of the links are indexed with:
<table border="5" cellpadding="3">
  <tbody>
    <tr>

      <td colspan="3" align="left">
      <table border="1" cellpadding="4">
        <tbody>
          <tr>

            <th><a href="indexa-anat.htm">A - Anat</a></th>
            <th><a href="indexanc.htm">Anc - Az</a></th>
            <th><a href="indexb.htm">B - Br</a></th>
            <th><a href="indexbulb.htm">Bulb - By</a></th>

            <th><a href="indexc.htm">C - Cattleya</a></th>
            <th><a href="indexcattleyo.htm">Cattleyo - Cn</a></th>
            <th><a href="indexco.htm">Co - Cz</a></th>
            <th><a href="indexde.htm">D - Dendrob</a></th>
            <th><a href="indexdendroc.htm">Dendroc - Dy</a></th>
            <th><a href="indexe-ep.htm">E - Epic</a></th>

            <th><a href="indexepid-ex.htm">Epid - Ez</a></th>
            <th><a href="indexfghijkl.htm">FG</a></th>
            <th><a href="indexhi.htm">HI</a></th>
            <th><a href="indexjkl.htm">JK</a></th>
            <th><a href="indexjkl.htm#sec9">L</a></th>
            <th><a href="indexm-masd.htm">M-Masd</a></th>

            <th><a href="indexmast-max.htm">Mast-Max</a></th>
            <th><a href="indexme.htm">Me - Ny</a></th>
            <th><a href="indexo.htm">O</a></th>
            <th><a href="indexor.htm">Or - Oz</a></th>
            <th><a href="indexp-pf.htm">P - Pe</a></th>
            <th><a href="indexph-pk.htm">Ph - Pi</a></th>

            <th><a href="indexpl-pz.htm">Pl - Pz</a></th>
            <th><a href="indexqrsel.htm">QRS - Sel</a></th>
            <th><a href="indexser.htm">Ser - Sz</a></th>
            <th><a href="indextuvwxyz.htm">T-Z</a></th>
          </tr>
        </tbody>

      </table>

      </td>
    </tr>
  </tbody>
</table>
So you need to capture each of these links and navigate to the pages shown.
each of these pages is for a particular species.
Then each of these pages will have to be scraped as well.
To get the list of (initial) links, you can use the following code:
from bs4 import BeautifulSoup
import requests


class ScrapeOrchids:
    def __init__(self):
        self.main_url = 'http://www.orchidspecies.com/indexe-ep.htm'
        self.links = {}
        self.get_initial_list()
        self.show_links()
    
    def get_initial_list(self):
        baseurl = 'http://www.orchidspecies.com/'
        response = requests.get(self.main_url)
        if response.status_code == 200:
            page = response.content
            soup = BeautifulSoup(page, 'lxml')
            # css_select link can be found using browser inspect element, then right click-->Copy-->CSS_Selector
            tr = soup.select('body > table:nth-child(3) > tbody:nth-child(1) > tr:nth-child(1) > td:nth-child(1) > table:nth-child(1) > tbody:nth-child(1) > tr:nth-child(1)')[0]
            ths = tr.find_all('th')
            for th in ths:
                self.links[th.a.text.strip()] = f"{baseurl}{th.a.get('href')}"
        else:
            print(f"Problem fetching {self.main_url}")

    def show_links(self):
        for key, value in self.links.items():
            print(f"{key}: {value}")


if __name__ == '__main__':
    ScrapeOrchids()
results:
Output:
A - Anat: http://www.orchidspecies.com/indexa-anat.htm Anc - Az: http://www.orchidspecies.com/indexanc.htm B - Br: http://www.orchidspecies.com/indexb.htm Bulb - By: http://www.orchidspecies.com/indexbulb.htm C - Cattleya: http://www.orchidspecies.com/indexc.htm Cattleyo - Cn: http://www.orchidspecies.com/indexcattleyo.htm Co - Cz: http://www.orchidspecies.com/indexco.htm D - Dendrob: http://www.orchidspecies.com/indexde.htm Dendroc - Dy: http://www.orchidspecies.com/indexdendroc.htm E - Epic: http://www.orchidspecies.com/indexe-ep.htm Epid - Ez: http://www.orchidspecies.com/indexepid-ex.htm FG: http://www.orchidspecies.com/indexfghijkl.htm HI: http://www.orchidspecies.com/indexhi.htm JK: http://www.orchidspecies.com/indexjkl.htm L: http://www.orchidspecies.com/indexjkl.htm#sec9 M-Masd: http://www.orchidspecies.com/indexm-masd.htm Mast-Max: http://www.orchidspecies.com/indexmast-max.htm Me - Ny: http://www.orchidspecies.com/indexme.htm O: http://www.orchidspecies.com/indexo.htm Or - Oz: http://www.orchidspecies.com/indexor.htm P - Pe: http://www.orchidspecies.com/indexp-pf.htm Ph - Pi: http://www.orchidspecies.com/indexph-pk.htm Pl - Pz: http://www.orchidspecies.com/indexpl-pz.htm QRS - Sel: http://www.orchidspecies.com/indexqrsel.htm Ser - Sz: http://www.orchidspecies.com/indexser.htm T-Z: http://www.orchidspecies.com/indextuvwxyz.htm
you may need to install lxml pip install lxml
Reply


Messages In This Thread
Web Scraping on href text - by Superzaffo - Nov-13-2019, 10:32 PM
RE: Web Scraping on href text - by Larz60+ - Nov-14-2019, 12:30 AM
RE: Web Scraping on href text - by Superzaffo - Nov-14-2019, 09:06 AM
RE: Web Scraping on href text - by Malt - Nov-14-2019, 10:27 AM
RE: Web Scraping on href text - by Larz60+ - Nov-14-2019, 10:54 AM
RE: Web Scraping on href text - by Superzaffo - Nov-14-2019, 08:20 PM
RE: Web Scraping on href text - by Superzaffo - Nov-14-2019, 10:05 PM
RE: Web Scraping on href text - by Larz60+ - Nov-15-2019, 02:33 AM
RE: Web Scraping on href text - by Superzaffo - Nov-15-2019, 08:10 AM
RE: Web Scraping on href text - by Larz60+ - Nov-15-2019, 12:43 PM
RE: Web Scraping on href text - by Superzaffo - Nov-15-2019, 01:18 PM
RE: Web Scraping on href text - by Superzaffo - Nov-16-2019, 10:52 AM

Possibly Related Threads…
Thread Author Replies Views Last Post
  Extract Href URL and Text From List knight2000 2 9,350 Jul-08-2021, 12:53 PM
Last Post: knight2000
  BeautifulSoup pagination using href rhat398 1 2,456 Jun-30-2021, 10:55 AM
Last Post: snippsat
  Accessing a data-phone tag from an href KatMac 1 2,933 Apr-27-2021, 06:18 PM
Last Post: buran
  Scraping all website text using Python MKMKMKMK 1 2,120 Nov-26-2020, 10:35 PM
Last Post: Larz60+
  Scraping text from application? kamix 1 1,650 Sep-25-2020, 10:53 PM
Last Post: Larz60+
  How to get the href value of a specific word in the html code julio2000 2 3,262 Mar-05-2020, 07:50 PM
Last Post: julio2000
  scraping in a text/javascript saasyp 1 2,258 Aug-31-2019, 11:39 AM
Last Post: metulburr
  Scrapy Picking What to Output Href or Img soothsayerpg 1 2,740 Aug-02-2018, 10:59 AM
Last Post: soothsayerpg
  Flask - Opening second page via href is failing - This site can’t be reached rafiPython1 2 5,535 Apr-11-2018, 08:41 AM
Last Post: rafiPython1

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020