Soup('A')

new_coder_231013 · Aug-31-2022, 12:09 PM

Hello,

I've been going through an introductory Python book that includes some material on web scraping using BeautifulSoup. My question is about the final three lines in the below code:

import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl

# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url = "https://docs.python.org"
html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, 'html.parser')

# Retrieve all of the anchor tags
tags = soup('a')
for tag in tags:
     print(tag.get('href', None))

I understand what the code is doing (and it works on my computer) but I'm curious about what's going on in the line "tags = soup('a')." The get() method used in the for loop on the next two lines suggests that "soup('a')" is referring to a dictionary. But if that were the case, shouldn't the code be written with square brackets as "tags = soup['a']"? Also, when I print tags the output I get indicates to me that tags is a list (the output starts with a square bracket and ends with a square bracket).

I tried looking through the BeautifulSoup documentation for any insights on this but am still unclear.

Thanks in advance for any help.

**Larz60+** · Aug-31-2022, 12:23 PM

I would have used: tags=soup.find_all('a')

new_coder_231013 · Sep-12-2022, 12:02 PM

(Aug-31-2022, 12:23 PM)Larz60+ Wrote: I would have used: tags=soup.find_all('a')

Thanks Larz60+, that definitely makes more sense to me (and produces the same result).

Gaurav_Kumar · Aug-09-2023, 11:49 AM

1.> soup('a') is not referring to a dictionary, but it's actually filtering and extracting all <a> tags from the parsed HTML, returning a ResultSet (a list-like object).
2.> The code tags = soup('a') assigns this list of Tag objects to the variable tags.
3.> When you print tags, it displays the representation of the ResultSet, which may look like a list.

Your understanding is correct: the square brackets indicate a list-like structure, but the actual content inside those brackets is a collection of Tag objects, not dictionary keys or values.

jenson · (This post was last modified: Sep-24-2024, 11:11 AM by jenson.)

The conversational tone of your writing makes your blog feel like a friendly conversation. It's a pleasure to read and learn from.

Pedroski55 · Oct-13-2024, 06:19 AM

Say you already have tags:

type(tags)

Output:
<class 'bs4.element.ResultSet'>

Take say the first element:

s = str(tags[0])
print(s)

Now you have:

Output:<a class="nav-logo" href="https://www.python.org/">
<img alt="Python logo" src="_static/py.svg"/>
</a>

Now you can get the actual link address using a regex expression:

import re

e = re.compile(r'(href=")([:/a-z\.]+)')
res = e.search(s)
print(res.group(2)) # 'https://www.python.org/'

Output:
'https://www.python.org/'

That's more or less what Beautifulsoup is doing!

Probably, it uses a more complicated regex to cater for all possibilities.

Soup('A')

User Panel Messages

Announcements