Posts: 212
Threads: 94
Joined: Aug 2018
Aug-17-2022, 01:56 PM
(This post was last modified: Aug-17-2022, 01:56 PM by Winfried.)
Hello,
Can BeautifulSoup grab what's between the brackets ("John Doe") in the following line?
from bs4 import BeautifulSoup
with open("input.txt") as fp:
soup = BeautifulSoup(fp, 'html.parser')
#<a href="/authorx/john_doe">John Doe</a> Thank you.
--
Edit: Found it
items = soup.select("a[href*=authorx]")
for item in items:
#print(item)
print(item.string)
Posts: 7,320
Threads: 123
Joined: Sep 2016
There is no need to loop to get the text.
from bs4 import BeautifulSoup
html = '<a href="/authorx/john_doe">John Doe</a>'
soup = BeautifulSoup(html, 'html.parser') >>> item = soup.find('a')
>>> item.text
'John Doe' Or if use CSS selector then can use select_one() if only need this element.
>>> item = soup.select_one("a[href*=authorx]")
>>> item.text
'John Doe'
Posts: 212
Threads: 94
Joined: Aug 2018
Aug-17-2022, 02:19 PM
(This post was last modified: Aug-17-2022, 02:20 PM by Winfried.)
Even if a book has more than one author, and the page has a bunch of href links that have nothing to do with the authors?
Posts: 7,320
Threads: 123
Joined: Sep 2016
(Aug-17-2022, 02:19 PM)Winfried Wrote: Even if a book has more than one author, and the page has a bunch of href links that have nothing to do with the authors? Give a example if you have trouble.
There are serval to get a tag even if there are serval similar.
from bs4 import BeautifulSoup
html = '''\
<body>
<a href="/authorx/john_doe">John Doe1</a>
<a href="/authorx/john_doe">John Doe2</a>
<a href="/authorx/john_doe">John Doe3</a>
</body>'''
soup = BeautifulSoup(html, 'html.parser') >>> item = soup.select_one('body > a:nth-child(2)')
>>> item.text
'John Doe2'
Posts: 212
Threads: 94
Joined: Aug 2018
The code above works fine, so I'm happy with it.
However, what about this?
<div class="bi_row">
<span class="bi_col_title">Publication date</span>
<span class="bi_col_value">January 1, 1999</span>
</div>
How could I get the publication date ("January 1, 1999") ?
Posts: 7,320
Threads: 123
Joined: Sep 2016
from bs4 import BeautifulSoup
html = '''\
<div class="bi_row">
<span class="bi_col_title">Publication date</span>
<span class="bi_col_value">January 1, 1999</span>
</div>'''
soup = BeautifulSoup(html, 'html.parser') # CSS selector
>>> item = soup.select_one("span.bi_col_value")
>>> item.text
'January 1, 1999'
# Using find() add a singel _ in CSS class
>>> item = soup.find(class_="bi_col_value")
>>> item.text
'January 1, 1999'
# Get attribute would be like
>>> item.attrs
{'class': ['bi_col_value']}
>>> item.get('class')
['bi_col_value']
Posts: 212
Threads: 94
Joined: Aug 2018
Aug-17-2022, 03:58 PM
(This post was last modified: Aug-17-2022, 03:58 PM by Winfried.)
Sorry, forgot to say the webpage contains multiple items with identical elements:
<div class="bi_row">
<span class="bi_col_title">Publisher</span>
<span class="bi_col_value">Some publisher Inc</span>
</div>
<div class="bi_row">
<span class="bi_col_title">Publication date</span>
<span class="bi_col_value">January 1, 1999</span>
</div>
etc.
--
Edit: Kludgy but it works:
for col in soup.find_all("div", {"class": "bi_row"}):
if col.find("span", {"class": "bi_col_title"}).text == "Publication date":
print(col.find("span", {"class": "bi_col_value"}).text)
break
|