Python Forum
Preserving anchor tags in BeautifulSoup
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Preserving anchor tags in BeautifulSoup
#1
Hi Guys,

After testing my code i found out the BeautifulSoup strips html tags when using: get_text()

i'm getting data from an .xml file:

xml_content_body = soup.find('taskBody')
This field contains text with anchor text in it, like:

word word word <a href="https://www.thesite.com/">work</a> word etc
Is there a way to keep the html tags instead of stripping them with get_text()?

# beautifulsoup setup
soup = BeautifulSoup(projects.text, 'xml')

# xml values
xml_content_body = soup.find('taskBody')
I cannot see a way to do this, any help would be appreciated guys!

regards

Graham
Reply
#2
With html it's something like that:

>>> html = urlopen(some_url)
>>> bs = BeautifulSoup(html.read(), 'html.parser')
>>> title = bs.body.h1
>>> title
<h1>An Interesting Title</h1>
I'm not 'in'-sane. Indeed, I am so far 'out' of sane that you appear a tiny blip on the distant coast of sanity. Bucky Katt, Get Fuzzy

Da Bishop: There's a dead bishop on the landing. I don't know who keeps bringing them in here. ....but society is to blame.
Reply
#3
graham23s Wrote:Is there a way to keep the html tags instead of stripping them with get_text()?
The html tag is always kept,so i don't understand this?
You should only call get_text(),well when need the the text

Maybe this will clear up the usage of get_text(),or better usage .text.
So try to use .text as most as possible,and not get_text().
.text is a property that calls get_text(),so it's identical except you don't use parentheses.
Only use get_text() if need to pass in parameters eg .get_text(strip=True, seperator='\n').
Examle:
from bs4 import BeautifulSoup

html = '''\
<body>
 <div id='images'>
   <a href='image1.html'>My image<img src='image1_thumb.jpg'/></a>
 </div>
</body>
'''
soup = BeautifulSoup(html, 'lxml')
Use:
>> image_tag = soup.find('div', id='images')
>>> image_tag.find('a')
<a href="image1.html">My image<img src="image1_thumb.jpg"/></a>

>>> # Best way if only need text
>>> image_tag.find('a').text
'My image'

>>> # Work the same way
>>> image_tag.find('a').get_text()
'My image'

# But can take parameter in function if needed
>>> image_tag.find('a').get_text(strip=True)
'My image'
Reply
#4
Hello :)

If i do this test:

print("RAW: This is an <a href=\"https://www.thesite.com/\">test</a> string.")
The string: This is a <a href="https://www.thesite.com/">test</a> string. is printed to the console exactly as is typed above.

But when i get the exact same string from the XML file the console shows: This is a test string. completely stripping the <a href=""></a> parts.

soup = BeautifulSoup(projects.text, 'xml')

xml_content_body = soup.find('taskBody')

print("XML: " + xml_content_body.text) <- prints [b]This is a test string.[/b] to the console
print("RAW: This is an <a href=\"https://www.thesite.com/\">test</a> string.") <- prints [b]This is a <a href="https://www.thesite.com/">test</a> string.[/b] to the console
The RAW one is the one i need, but when i get the exact same line of text from the XML the html ahref tag is stripped, i read get_text() does strip all html tags and text also seems to strip it, i will keep debugging the issue :)

regards
Reply
#5
Quote:print("RAW: This is an <a href=\"https://www.thesite.com/\">test</a> string.")
This is just a string and not valid html or xml.
Then can regex be a better tool.
>>> import re
>>> 
>>> s = "RAW: This is an <a href=\"https://www.thesite.com/\">test</a> string."
>>> re.sub(r'<.*>', '', s)
'RAW: This is an  string.'
Reply
#6
Hi :)

Here is maybe a better example: https://www.thesite.com/api.php?getProjectsToAction=1

That URL contains xml content, when i get this part:

<taskBody>This is a &lt;a href=&quot;https://www.test.com/&quot;&gt;test&lt;/a&gt; using html tags.</taskBody>
Via Python using BeautifulSoup, the entire string including the ahref is not showing in the console as: This is a <a href="https://www.test.com/">test</a> using html tags. instead it shows as This is a test using html tags.

if i use get_text() it strips out all html tags (so i read somewhere) so it shouldn't strip the html do you think? :)

thank you for the help!
Reply
#7
Update:

Just fixed it i did:

body_with_html = ''.join(str(c) for c in soup.find('taskBody').children)
Which seems to keep the html intact.

Thank you for the help guys!
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  BeautifulSoup - I can't translate html tags that contain <a href=..</a> OR <em></em> Melcu54 10 1,563 Oct-27-2022, 08:58 AM
Last Post: wavic
  Loop through tags inside tags in Selenium/Python xpack24 1 5,636 Oct-23-2019, 10:15 AM
Last Post: Larz60+
  remove tags from BeautifulSoup result moski 1 4,652 Jun-05-2019, 01:47 PM
Last Post: heiner55

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020