Hi Guys,
After testing my code i found out the BeautifulSoup strips html tags when using:
get_text()
i'm getting data from an .xml file:
xml_content_body = soup.find('taskBody')
This field contains text with anchor text in it, like:
word word word <a href="https://www.thesite.com/">work</a> word etc
Is there a way to keep the html tags instead of stripping them with
get_text()?
# beautifulsoup setup
soup = BeautifulSoup(projects.text, 'xml')
# xml values
xml_content_body = soup.find('taskBody')
I cannot see a way to do this, any help would be appreciated guys!
regards
Graham
With html it's something like that:
>>> html = urlopen(some_url)
>>> bs = BeautifulSoup(html.read(), 'html.parser')
>>> title = bs.body.h1
>>> title
<h1>An Interesting Title</h1>
graham23s Wrote:Is there a way to keep the html tags instead of stripping them with get_text()?
The html tag is always kept,so i don't understand this?
You should only call get_text(),well when need the the text
Maybe this will clear up the usage of
get_text()
,or better usage
.text
.
So try to use
.text
as most as possible,and not
get_text()
.
.text
is a property that calls
get_text()
,so it's identical except you don't use parentheses.
Only use
get_text()
if need to pass in parameters eg
.get_text(strip=True, seperator='\n')
.
Examle:
from bs4 import BeautifulSoup
html = '''\
<body>
<div id='images'>
<a href='image1.html'>My image<img src='image1_thumb.jpg'/></a>
</div>
</body>
'''
soup = BeautifulSoup(html, 'lxml')
Use:
>> image_tag = soup.find('div', id='images')
>>> image_tag.find('a')
<a href="image1.html">My image<img src="image1_thumb.jpg"/></a>
>>> # Best way if only need text
>>> image_tag.find('a').text
'My image'
>>> # Work the same way
>>> image_tag.find('a').get_text()
'My image'
# But can take parameter in function if needed
>>> image_tag.find('a').get_text(strip=True)
'My image'
Hello :)
If i do this test:
print("RAW: This is an <a href=\"https://www.thesite.com/\">test</a> string.")
The string:
This is a <a href="https://www.thesite.com/">test</a> string. is printed to the console exactly as is typed above.
But when i get the exact same string from the XML file the console shows:
This is a test string. completely stripping the <a href=""></a> parts.
soup = BeautifulSoup(projects.text, 'xml')
xml_content_body = soup.find('taskBody')
print("XML: " + xml_content_body.text) <- prints [b]This is a test string.[/b] to the console
print("RAW: This is an <a href=\"https://www.thesite.com/\">test</a> string.") <- prints [b]This is a <a href="https://www.thesite.com/">test</a> string.[/b] to the console
The RAW one is the one i need, but when i get the exact same line of text from the XML the html ahref tag is stripped, i read
get_text() does strip all html tags and
text also seems to strip it, i will keep debugging the issue :)
regards
Quote:print("RAW: This is an <a href=\"https://www.thesite.com/\">test</a> string.")
This is just a string and not valid html or xml.
Then can regex be a better tool.
>>> import re
>>>
>>> s = "RAW: This is an <a href=\"https://www.thesite.com/\">test</a> string."
>>> re.sub(r'<.*>', '', s)
'RAW: This is an string.'
Hi :)
Here is maybe a better example:
https://www.thesite.com/api.php?getProjectsToAction=1
That URL contains xml content, when i get this part:
<taskBody>This is a <a href="https://www.test.com/">test</a> using html tags.</taskBody>
Via Python using BeautifulSoup, the entire string including the ahref is not showing in the console as:
This is a <a href="https://www.test.com/">test</a> using html tags. instead it shows as
This is a test using html tags.
if i use get_text() it strips out all html tags (so i read somewhere) so it shouldn't strip the html do you think? :)
thank you for the help!
Update:
Just fixed it i did:
body_with_html = ''.join(str(c) for c in soup.find('taskBody').children)
Which seems to keep the html intact.
Thank you for the help guys!