Extract Href URL and Text From List

knight2000 · Jul-08-2021, 11:32 AM

Hello all,

I'm new to Python and I'm trying to practice some webscraping by challenging myself to try to extract various elements from different websites. On this personal challenge, I've become stuck trying to extract the URL and the Anchor text from a ul list on a site (as shown below in the output). After hours trying to resolve this, I thought I would ask for some assistance please.

From what I've read, you need to create a for loop within a loop and although I've tried so many different variations- I must admit, I'm still confused.

I've been able to use the following 'for loop' to almost get the results I'm after:

from bs4 import BeautifulSoup
import requests
import pandas as pd

url = [mytesturl]


page = requests.get(url)

soup = BeautifulSoup(page.text, 'html.parser')

full_list = soup.findAll('ol', {'class': 'nav browse-group-list'})

for category in full_list:
       group_list = category.findAll('li')
       for weblink in group_list:
       url= weblink.findAll('a')
       print(url)

So the results that I'm getting from this code is:

Output:[<a href="/tour-operator-software/">Tour Operator Software</a>]
[<a href="/treasury-software/">Treasury Software</a>]
[<a href="/trucking-software/">Trucking Software</a>]
[<a href="/trust-accounting-software/">Trust Accounting Software</a>]
[<a href="/tutoring-software/">Tutoring Software</a>]
[<a href="/unified-communications-software/">Unified Communications Software	</a>]
[<a href="/unified-endpoint-management-software/">Unified Endpoint Management (UEM) Software</a>]
[<a href="/url-shortener-software/">URL Shortener</a>]
[<a href="/user-testing-software/">User Testing Software</a>]
[<a href="/utility-billing-software/">Utility Billing Software</a>]
[<a href="/utility-management-systems-software/">Utility Management Systems Software</a>]
[<a href="/ux-software/">UX Software</a>]
[<a href="/vacation-rental-software/">Vacation Rental Software</a>]
[<a href="/vaccine-management-software/">Vaccine Management Software</a>]
[<a href="/vdi-software/">VDI Software</a>]

But I'm wanting to try and extract both the URL (for example :/vdi-software/) and also the anchor text (eg- VDI Software) but I've become stuck and unsure of what to use. Would really appreciate some assistance please.

***snippsat*** · Jul-08-2021, 11:58 AM

Use textand attrs.
Don't use findAll('a')(will be a list) in loop just find('a')
Also use find_all('a'),the old CamleCase(do not use in Python) is kept for backward compatibility.

>>> from bs4 import BeautifulSoup
>>> 
>>> html = '<a href="/vdi-software/">VDI Software</a>'
>>> soup = BeautifulSoup(html, 'lxml')
>>> tag = soup.find('a')
>>> tag
<a href="/vdi-software/">VDI Software</a>
>>> 
>>> tag.text
'VDI Software'
>>> tag.attrs.get('href', 'Not found')
'/vdi-software/'
>>> tag.attrs.get('car', 'Not found')
'Not found'

knight2000 · Jul-08-2021, 12:53 PM

(Jul-08-2021, 11:58 AM)snippsat Wrote: Use textand attrs.
Don't use findAll('a')(will be a list) in loop just find('a')
Also use find_all('a'),the old CamleCase(do not use in Python) is kept for backward compatibility.
>>> from bs4 import BeautifulSoup
>>> 
>>> html = '<a href="/vdi-software/">VDI Software</a>'
>>> soup = BeautifulSoup(html, 'lxml')
>>> tag = soup.find('a')
>>> tag
<a href="/vdi-software/">VDI Software</a>
>>> 
>>> tag.text
'VDI Software'
>>> tag.attrs.get('href', 'Not found')
'/vdi-software/'
>>> tag.attrs.get('car', 'Not found')
'Not found'

Hi snippsat,

Thanks so much for helping out and teaching me. Also appreciate you pointing out better practices re- find_all. Awesome to learn new stuff! Big Grin

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	BeautifulSoup pagination using href	rhat398	1	3,410	Jun-30-2021, 10:55 AM Last Post: snippsat
	Accessing a data-phone tag from an href	KatMac	1	3,754	Apr-27-2021, 06:18 PM Last Post: buran
	Selenium extract id text	xzozx	1	2,878	Jun-15-2020, 06:32 AM Last Post: Larz60+
	How to get the href value of a specific word in the html code	julio2000	2	4,580	Mar-05-2020, 07:50 PM Last Post: julio2000
	Extract text from tag content using regular expression	Pavel_47	8	8,467	Nov-25-2019, 03:17 PM Last Post: buran
	Web Scraping on href text	Superzaffo	11	10,001	Nov-16-2019, 10:52 AM Last Post: Superzaffo
	Extract text between bold headlines from HTML	CostasG	1	3,276	Aug-31-2019, 10:53 AM Last Post: snippsat
	Python/BeautiifulSoup. list of urls ->parse->extract data to csv. getting ERROR	IanTheLMT	2	5,067	Jul-04-2019, 02:31 AM Last Post: IanTheLMT
	Scrapy Picking What to Output Href or Img	soothsayerpg	1	3,355	Aug-02-2018, 10:59 AM Last Post: soothsayerpg
	Extract Anchor Text (Scrapy)	soothsayerpg	2	9,947	Jul-21-2018, 07:18 AM Last Post: soothsayerpg

Extract Href URL and Text From List

User Panel Messages

Announcements