Extract Href URL and Text From List - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html) +--- Thread: Extract Href URL and Text From List (/thread-34226.html) |
Extract Href URL and Text From List - knight2000 - Jul-08-2021 Hello all, I'm new to Python and I'm trying to practice some webscraping by challenging myself to try to extract various elements from different websites. On this personal challenge, I've become stuck trying to extract the URL and the Anchor text from a ul list on a site (as shown below in the output). After hours trying to resolve this, I thought I would ask for some assistance please. From what I've read, you need to create a for loop within a loop and although I've tried so many different variations- I must admit, I'm still confused. I've been able to use the following 'for loop' to almost get the results I'm after: from bs4 import BeautifulSoup import requests import pandas as pd url = [mytesturl] page = requests.get(url) soup = BeautifulSoup(page.text, 'html.parser') full_list = soup.findAll('ol', {'class': 'nav browse-group-list'}) for category in full_list: group_list = category.findAll('li') for weblink in group_list: url= weblink.findAll('a') print(url) So the results that I'm getting from this code is: But I'm wanting to try and extract both the URL (for example :/vdi-software/) and also the anchor text (eg- VDI Software) but I've become stuck and unsure of what to use. Would really appreciate some assistance please.
RE: Extract Href URL and Text From List - snippsat - Jul-08-2021 Use text and attrs .Don't use findAll('a') (will be a list) in loop just find('a') Also use find_all('a') ,the old CamleCase(do not use in Python) is kept for backward compatibility.>>> from bs4 import BeautifulSoup >>> >>> html = '<a href="/vdi-software/">VDI Software</a>' >>> soup = BeautifulSoup(html, 'lxml') >>> tag = soup.find('a') >>> tag <a href="/vdi-software/">VDI Software</a> >>> >>> tag.text 'VDI Software' >>> tag.attrs.get('href', 'Not found') '/vdi-software/' >>> tag.attrs.get('car', 'Not found') 'Not found' RE: Extract Href URL and Text From List - knight2000 - Jul-08-2021 (Jul-08-2021, 11:58 AM)snippsat Wrote: Use Hi snippsat, Thanks so much for helping out and teaching me. Also appreciate you pointing out better practices re- find_all. Awesome to learn new stuff! |