Python Forum

Hello all,

I'm new to Python and I'm trying to practice some webscraping by challenging myself to try to extract various elements from different websites. On this personal challenge, I've become stuck trying to extract the URL and the Anchor text from a ul list on a site (as shown below in the output). After hours trying to resolve this, I thought I would ask for some assistance please.

From what I've read, you need to create a for loop within a loop and although I've tried so many different variations- I must admit, I'm still confused.

I've been able to use the following 'for loop' to almost get the results I'm after:

from bs4 import BeautifulSoup
import requests
import pandas as pd

url = [mytesturl]


page = requests.get(url)

soup = BeautifulSoup(page.text, 'html.parser')

full_list = soup.findAll('ol', {'class': 'nav browse-group-list'})

for category in full_list:
       group_list = category.findAll('li')
       for weblink in group_list:
       url= weblink.findAll('a')
       print(url)

So the results that I'm getting from this code is:

Output:[<a href="/tour-operator-software/">Tour Operator Software</a>]
[<a href="/treasury-software/">Treasury Software</a>]
[<a href="/trucking-software/">Trucking Software</a>]
[<a href="/trust-accounting-software/">Trust Accounting Software</a>]
[<a href="/tutoring-software/">Tutoring Software</a>]
[<a href="/unified-communications-software/">Unified Communications Software	</a>]
[<a href="/unified-endpoint-management-software/">Unified Endpoint Management (UEM) Software</a>]
[<a href="/url-shortener-software/">URL Shortener</a>]
[<a href="/user-testing-software/">User Testing Software</a>]
[<a href="/utility-billing-software/">Utility Billing Software</a>]
[<a href="/utility-management-systems-software/">Utility Management Systems Software</a>]
[<a href="/ux-software/">UX Software</a>]
[<a href="/vacation-rental-software/">Vacation Rental Software</a>]
[<a href="/vaccine-management-software/">Vaccine Management Software</a>]
[<a href="/vdi-software/">VDI Software</a>]

But I'm wanting to try and extract both the URL (for example :/vdi-software/) and also the anchor text (eg- VDI Software) but I've become stuck and unsure of what to use. Would really appreciate some assistance please.

Use textand attrs.
Don't use findAll('a')(will be a list) in loop just find('a')
Also use find_all('a'),the old CamleCase(do not use in Python) is kept for backward compatibility.

>>> from bs4 import BeautifulSoup
>>> 
>>> html = '<a href="/vdi-software/">VDI Software</a>'
>>> soup = BeautifulSoup(html, 'lxml')
>>> tag = soup.find('a')
>>> tag
<a href="/vdi-software/">VDI Software</a>
>>> 
>>> tag.text
'VDI Software'
>>> tag.attrs.get('href', 'Not found')
'/vdi-software/'
>>> tag.attrs.get('car', 'Not found')
'Not found'

(Jul-08-2021, 11:58 AM)snippsat Wrote: [ -> ]Use textand attrs.
Don't use findAll('a')(will be a list) in loop just find('a')
Also use find_all('a'),the old CamleCase(do not use in Python) is kept for backward compatibility.
>>> from bs4 import BeautifulSoup
>>> 
>>> html = '<a href="/vdi-software/">VDI Software</a>'
>>> soup = BeautifulSoup(html, 'lxml')
>>> tag = soup.find('a')
>>> tag
<a href="/vdi-software/">VDI Software</a>
>>> 
>>> tag.text
'VDI Software'
>>> tag.attrs.get('href', 'Not found')
'/vdi-software/'
>>> tag.attrs.get('car', 'Not found')
'Not found'

Hi snippsat,

Thanks so much for helping out and teaching me. Also appreciate you pointing out better practices re- find_all. Awesome to learn new stuff! Big Grin

knight2000

snippsat

knight2000