Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Soup('A')
#1
Hello,

I've been going through an introductory Python book that includes some material on web scraping using BeautifulSoup. My question is about the final three lines in the below code:

import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl

# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url = "https://docs.python.org"
html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, 'html.parser')

# Retrieve all of the anchor tags
tags = soup('a')
for tag in tags:
     print(tag.get('href', None))
I understand what the code is doing (and it works on my computer) but I'm curious about what's going on in the line "tags = soup('a')." The get() method used in the for loop on the next two lines suggests that "soup('a')" is referring to a dictionary. But if that were the case, shouldn't the code be written with square brackets as "tags = soup['a']"? Also, when I print tags the output I get indicates to me that tags is a list (the output starts with a square bracket and ends with a square bracket).

I tried looking through the BeautifulSoup documentation for any insights on this but am still unclear.

Thanks in advance for any help.
Reply
#2
I would have used: tags=soup.find_all('a')
Reply
#3
(Aug-31-2022, 12:23 PM)Larz60+ Wrote: I would have used: tags=soup.find_all('a')

Thanks Larz60+, that definitely makes more sense to me (and produces the same result).
Reply
#4
1.> soup('a') is not referring to a dictionary, but it's actually filtering and extracting all <a> tags from the parsed HTML, returning a ResultSet (a list-like object).
2.> The code tags = soup('a') assigns this list of Tag objects to the variable tags.
3.> When you print tags, it displays the representation of the ResultSet, which may look like a list.

Your understanding is correct: the square brackets indicate a list-like structure, but the actual content inside those brackets is a collection of Tag objects, not dictionary keys or values.
Reply
#5
The conversational tone of your writing makes your blog feel like a friendly conversation. It's a pleasure to read and learn from.
Reply
#6
Say you already have tags:

type(tags)
Output:
<class 'bs4.element.ResultSet'>
Take say the first element:

s = str(tags[0])
print(s)
Now you have:

Output:
<a class="nav-logo" href="https://www.python.org/"> <img alt="Python logo" src="_static/py.svg"/> </a>
Now you can get the actual link address using a regex expression:

import re

e = re.compile(r'(href=")([:/a-z\.]+)')
res = e.search(s)
print(res.group(2)) # 'https://www.python.org/'
Output:
'https://www.python.org/'
That's more or less what Beautifulsoup is doing!

Probably, it uses a more complicated regex to cater for all possibilities.
Reply


Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020