I have HTML code like the following from a URL:
<img class="this" alt="this" src="this_source1.gif">
<img class="this" alt="this" src="this_source2.gif">
<img class="this" alt="this" src="this_source3.gif">
<img class="this and that" alt="not this" src="this__and_that_source1.gif">
<img class="this and that" alt="not this" src="this__and_that_source2.gif">
<img class="this and that" alt="not this" src="this__and_that_source3.gif">
I'm trying to get the alt value of just the img tags with only class="this"
import requests
from bs4 import BeautifulSoup
url = "https://someurl.com"
resp = requests.get(url)
txt = resp.text
soup = BeautifulSoup(txt, 'lxml')
imgThis = soup.find_all('img', class_='this')
for i in (imgThis):
imgThis[i]['alt']
The find_all method returns alts for both class_="this" and class_="this and that"
How do I specify only to return class_="this"?
I have HTML code like the following from a URL:
<img class="this" alt="this" src="this_source1.gif">
<img class="this" alt="this" src="this_source2.gif">
<img class="this" alt="this" src="this_source3.gif">
<img class="this and that" alt="not this" src="this__and_that_source1.gif">
<img class="this and that" alt="not this" src="this__and_that_source2.gif">
<img class="this and that" alt="not this" src="this__and_that_source3.gif">
I'm trying to get the alt strings of img tags with specifically class="this"
import requests
from bs4 import BeautifulSoup
url = 'https://someurl.com'
resp = requests.get(url)
txt = resp.text
soup = BeautifulSoup(txt, 'lxml')
imgThis = soup.find_all('img', class_='this')
for i in (imgThis):
imgThis[i]['alt']
The find_all method returns matches for both class_="this" and class_="this and that"
Output:
this
this
this
this and that
this and that
this and that
How do I specify only to return class_="this"?
for example,
<img class="this" alt="this" src="this_source1.gif">
use:
source1 = soup.find('img', {'class': 'this'})
Thank you Larz.
I did try:
test = soup.find('img', {'class': 'this'})
But that returned just the first instance of <img class="this
Which happened to be a <img class="this and that"
and
test = soup.find_all('img', {'class': 'this'})
[python]
returns all img tags with class="this" and class="this and that"
[hr]
and
[python]
test = soup.find_all('img', {'class': 'this'})
returns all img tags with class="this" and class="this and that"
...and
test = soup.find_all('img', {'class': 'this'})
returns all img tags with class="this" and class="this and that"
If you really must use bs4, I would use its
CSS selector support and stay away from the weird
find
/
find_all
api.
This is one way to achieve what you want:
soup.select('img[class="this"]')
In general, I'd recommend using
lxml instead of bs4 for pretty much anything.
Thanks stranac!
That seems to have done the trick.
It's a shame the BeautifulSoup documentation is less than optimal!
Edit this is merge of Threads,so my answer is same as @
stranac.
-----
Can use
CSS selectors to match the exact class name.
from bs4 import BeautifulSoup
html = '''\
<img class="this" alt="this" src="this_source1.gif">
<img class="this" alt="this" src="this_source2.gif">
<img class="this" alt="this" src="this_source3.gif">
<img class="this and that" alt="not this" src="this__and_that_source1.gif">
<img class="this and that" alt="not this" src="this__and_that_source2.gif">
<img class="this and that" alt="not this" src="this__and_that_source3.gif">'''
soup = BeautifulSoup(html, 'lxml')
only_this = soup.select('img[class="this"]')
Test:
>>> only_this
[<img alt="this" class="this" src="this_source1.gif"/>,
<img alt="this" class="this" src="this_source2.gif"/>,
<img alt="this" class="this" src="this_source3.gif"/>]
>>> [i.get('src') for i in only_this]
['this_source1.gif', 'this_source2.gif', 'this_source3.gif']