Python Forum

Hi Guys,

After testing my code i found out the BeautifulSoup strips html tags when using: get_text()

i'm getting data from an .xml file:

xml_content_body = soup.find('taskBody')

This field contains text with anchor text in it, like:

word word word <a href="https://www.thesite.com/">work</a> word etc

Is there a way to keep the html tags instead of stripping them with get_text()?

# beautifulsoup setup
soup = BeautifulSoup(projects.text, 'xml')

# xml values
xml_content_body = soup.find('taskBody')

I cannot see a way to do this, any help would be appreciated guys!

regards

Graham

With html it's something like that:

>>> html = urlopen(some_url)
>>> bs = BeautifulSoup(html.read(), 'html.parser')
>>> title = bs.body.h1
>>> title
<h1>An Interesting Title</h1>

graham23s Wrote:Is there a way to keep the html tags instead of stripping them with get_text()?

The html tag is always kept,so i don't understand this?
You should only call get_text(),well when need the the text

Maybe this will clear up the usage of get_text(),or better usage .text.
So try to use .text as most as possible,and not get_text().
.text is a property that calls get_text(),so it's identical except you don't use parentheses.
Only use get_text() if need to pass in parameters eg .get_text(strip=True, seperator='\n').
Examle:

from bs4 import BeautifulSoup

html = '''\
<body>
 <div id='images'>
   <a href='image1.html'>My image<img src='image1_thumb.jpg'/></a>
 </div>
</body>
'''
soup = BeautifulSoup(html, 'lxml')

Use:

>> image_tag = soup.find('div', id='images')
>>> image_tag.find('a')
<a href="image1.html">My image<img src="image1_thumb.jpg"/></a>

>>> # Best way if only need text
>>> image_tag.find('a').text
'My image'

>>> # Work the same way
>>> image_tag.find('a').get_text()
'My image'

# But can take parameter in function if needed
>>> image_tag.find('a').get_text(strip=True)
'My image'

Hello :)

If i do this test:

print("RAW: This is an <a href=\"https://www.thesite.com/\">test</a> string.")

The string: This is a <a href="https://www.thesite.com/">test</a> string. is printed to the console exactly as is typed above.

But when i get the exact same string from the XML file the console shows: This is a test string. completely stripping the <a href=""></a> parts.

soup = BeautifulSoup(projects.text, 'xml')

xml_content_body = soup.find('taskBody')

print("XML: " + xml_content_body.text) <- prints [b]This is a test string.[/b] to the console
print("RAW: This is an <a href=\"https://www.thesite.com/\">test</a> string.") <- prints [b]This is a <a href="https://www.thesite.com/">test</a> string.[/b] to the console

The RAW one is the one i need, but when i get the exact same line of text from the XML the html ahref tag is stripped, i read get_text() does strip all html tags and text also seems to strip it, i will keep debugging the issue :)

regards

Quote:print("RAW: This is an <a href=\"https://www.thesite.com/\">test</a> string.")

This is just a string and not valid html or xml.
Then can regex be a better tool.

>>> import re
>>> 
>>> s = "RAW: This is an <a href=\"https://www.thesite.com/\">test</a> string."
>>> re.sub(r'<.*>', '', s)
'RAW: This is an  string.'

Hi :)

Here is maybe a better example: https://www.thesite.com/api.php?getProjectsToAction=1

That URL contains xml content, when i get this part:

<taskBody>This is a &lt;a href=&quot;https://www.test.com/&quot;&gt;test&lt;/a&gt; using html tags.</taskBody>

Via Python using BeautifulSoup, the entire string including the ahref is not showing in the console as: This is a <a href="https://www.test.com/">test</a> using html tags. instead it shows as This is a test using html tags.

if i use get_text() it strips out all html tags (so i read somewhere) so it shouldn't strip the html do you think? :)

thank you for the help!

Update:

Just fixed it i did:

body_with_html = ''.join(str(c) for c in soup.find('taskBody').children)

Which seems to keep the html intact.

Thank you for the help guys!

graham23s

perfringo

snippsat

graham23s

snippsat

graham23s

graham23s