Python Forum

Full Version: [SOLVED] [BeautifulSoup] How to get this text?
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Hello,

Can BeautifulSoup grab what's between the brackets ("John Doe") in the following line?

from bs4 import BeautifulSoup

with open("input.txt") as fp:
    soup = BeautifulSoup(fp, 'html.parser')

#<a href="/authorx/john_doe">John Doe</a>
Thank you.

--
Edit: Found it

items = soup.select("a[href*=authorx]")
for item in items:
	#print(item)
	print(item.string)
There is no need to loop to get the text.
from bs4 import BeautifulSoup

html = '<a href="/authorx/john_doe">John Doe</a>'
soup = BeautifulSoup(html, 'html.parser')
>>> item = soup.find('a')
>>> item.text
'John Doe'
Or if use CSS selector then can use select_one() if only need this element.
>>> item = soup.select_one("a[href*=authorx]")
>>> item.text
'John Doe' 
Even if a book has more than one author, and the page has a bunch of href links that have nothing to do with the authors?
(Aug-17-2022, 02:19 PM)Winfried Wrote: [ -> ]Even if a book has more than one author, and the page has a bunch of href links that have nothing to do with the authors?
Give a example if you have trouble.
There are serval to get a tag even if there are serval similar.
from bs4 import BeautifulSoup

html = '''\
<body>
  <a href="/authorx/john_doe">John Doe1</a>
  <a href="/authorx/john_doe">John Doe2</a>
  <a href="/authorx/john_doe">John Doe3</a>
</body>'''

soup = BeautifulSoup(html, 'html.parser') 
>>> item = soup.select_one('body > a:nth-child(2)')
>>> item.text
'John Doe2'
The code above works fine, so I'm happy with it.

However, what about this?

<div class="bi_row">
<span class="bi_col_title">Publication date</span>
<span class="bi_col_value">January 1, 1999</span>
</div>


How could I get the publication date ("January 1, 1999") ?
from bs4 import BeautifulSoup

html = '''\
<div class="bi_row">
  <span class="bi_col_title">Publication date</span>
  <span class="bi_col_value">January 1, 1999</span>
</div>'''

soup = BeautifulSoup(html, 'html.parser')
# CSS selector
>>> item = soup.select_one("span.bi_col_value")
>>> item.text
'January 1, 1999'

# Using find() add a singel _ in CSS class
>>> item = soup.find(class_="bi_col_value")
>>> item.text
'January 1, 1999'

# Get attribute would be like
>>> item.attrs
{'class': ['bi_col_value']}
>>> item.get('class')
['bi_col_value']
Sorry, forgot to say the webpage contains multiple items with identical elements:

<div class="bi_row">
<span class="bi_col_title">Publisher</span>
<span class="bi_col_value">Some publisher Inc</span>
</div>

<div class="bi_row">
<span class="bi_col_title">Publication date</span>
<span class="bi_col_value">January 1, 1999</span>
</div>


etc.

--
Edit: Kludgy but it works:

for col in soup.find_all("div", {"class": "bi_row"}):
	if col.find("span", {"class": "bi_col_title"}).text == "Publication date":
		print(col.find("span", {"class": "bi_col_value"}).text)
		break