Python Forum
Extract Href URL and Text From List
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Extract Href URL and Text From List
#1
Hello all,

I'm new to Python and I'm trying to practice some webscraping by challenging myself to try to extract various elements from different websites. On this personal challenge, I've become stuck trying to extract the URL and the Anchor text from a ul list on a site (as shown below in the output). After hours trying to resolve this, I thought I would ask for some assistance please.

From what I've read, you need to create a for loop within a loop and although I've tried so many different variations- I must admit, I'm still confused. Confused

I've been able to use the following 'for loop' to almost get the results I'm after:

from bs4 import BeautifulSoup
import requests
import pandas as pd

url = [mytesturl]


page = requests.get(url)

soup = BeautifulSoup(page.text, 'html.parser')

full_list = soup.findAll('ol', {'class': 'nav browse-group-list'})

for category in full_list:
       group_list = category.findAll('li')
       for weblink in group_list:
       url= weblink.findAll('a')
       print(url)


So the results that I'm getting from this code is:

Output:
[<a href="/tour-operator-software/">Tour Operator Software</a>] [<a href="/treasury-software/">Treasury Software</a>] [<a href="/trucking-software/">Trucking Software</a>] [<a href="/trust-accounting-software/">Trust Accounting Software</a>] [<a href="/tutoring-software/">Tutoring Software</a>] [<a href="/unified-communications-software/">Unified Communications Software </a>] [<a href="/unified-endpoint-management-software/">Unified Endpoint Management (UEM) Software</a>] [<a href="/url-shortener-software/">URL Shortener</a>] [<a href="/user-testing-software/">User Testing Software</a>] [<a href="/utility-billing-software/">Utility Billing Software</a>] [<a href="/utility-management-systems-software/">Utility Management Systems Software</a>] [<a href="/ux-software/">UX Software</a>] [<a href="/vacation-rental-software/">Vacation Rental Software</a>] [<a href="/vaccine-management-software/">Vaccine Management Software</a>] [<a href="/vdi-software/">VDI Software</a>]
But I'm wanting to try and extract both the URL (for example :/vdi-software/) and also the anchor text (eg- VDI Software) but I've become stuck and unsure of what to use. Would really appreciate some assistance please.
Reply
#2
Use textand attrs.
Don't use findAll('a')(will be a list) in loop just find('a')
Also use find_all('a'),the old CamleCase(do not use in Python) is kept for backward compatibility.
>>> from bs4 import BeautifulSoup
>>> 
>>> html = '<a href="/vdi-software/">VDI Software</a>'
>>> soup = BeautifulSoup(html, 'lxml')
>>> tag = soup.find('a')
>>> tag
<a href="/vdi-software/">VDI Software</a>
>>> 
>>> tag.text
'VDI Software'
>>> tag.attrs.get('href', 'Not found')
'/vdi-software/'
>>> tag.attrs.get('car', 'Not found')
'Not found'
Reply
#3
(Jul-08-2021, 11:58 AM)snippsat Wrote: Use textand attrs.
Don't use findAll('a')(will be a list) in loop just find('a')
Also use find_all('a'),the old CamleCase(do not use in Python) is kept for backward compatibility.
>>> from bs4 import BeautifulSoup
>>> 
>>> html = '<a href="/vdi-software/">VDI Software</a>'
>>> soup = BeautifulSoup(html, 'lxml')
>>> tag = soup.find('a')
>>> tag
<a href="/vdi-software/">VDI Software</a>
>>> 
>>> tag.text
'VDI Software'
>>> tag.attrs.get('href', 'Not found')
'/vdi-software/'
>>> tag.attrs.get('car', 'Not found')
'Not found'

Hi snippsat,

Thanks so much for helping out and teaching me. Also appreciate you pointing out better practices re- find_all. Awesome to learn new stuff! Big Grin
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  BeautifulSoup pagination using href rhat398 1 498 Jun-30-2021, 10:55 AM
Last Post: snippsat
  Accessing a data-phone tag from an href KatMac 1 1,416 Apr-27-2021, 06:18 PM
Last Post: buran
  Selenium extract id text xzozx 1 853 Jun-15-2020, 06:32 AM
Last Post: Larz60+
  How to get the href value of a specific word in the html code julio2000 2 1,424 Mar-05-2020, 07:50 PM
Last Post: julio2000
  Extract text from tag content using regular expression Pavel_47 8 2,085 Nov-25-2019, 03:17 PM
Last Post: buran
  Web Scraping on href text Superzaffo 11 3,430 Nov-16-2019, 10:52 AM
Last Post: Superzaffo
  Extract text between bold headlines from HTML CostasG 1 969 Aug-31-2019, 10:53 AM
Last Post: snippsat
  Python/BeautiifulSoup. list of urls ->parse->extract data to csv. getting ERROR IanTheLMT 2 2,036 Jul-04-2019, 02:31 AM
Last Post: IanTheLMT
  Scrapy Picking What to Output Href or Img soothsayerpg 1 1,628 Aug-02-2018, 10:59 AM
Last Post: soothsayerpg
  Extract Anchor Text (Scrapy) soothsayerpg 2 5,280 Jul-21-2018, 07:18 AM
Last Post: soothsayerpg

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020