Preserving anchor tags in BeautifulSoup - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: General Coding Help (https://python-forum.io/forum-8.html) +--- Thread: Preserving anchor tags in BeautifulSoup (/thread-18455.html) |
Preserving anchor tags in BeautifulSoup - graham23s - May-18-2019 Hi Guys, After testing my code i found out the BeautifulSoup strips html tags when using: get_text() i'm getting data from an .xml file: xml_content_body = soup.find('taskBody')This field contains text with anchor text in it, like: word word word <a href="https://www.thesite.com/">work</a> word etcIs there a way to keep the html tags instead of stripping them with get_text()? # beautifulsoup setup soup = BeautifulSoup(projects.text, 'xml') # xml values xml_content_body = soup.find('taskBody')I cannot see a way to do this, any help would be appreciated guys! regards Graham RE: Preserving anchor tags in BeautifulSoup - perfringo - May-18-2019 With html it's something like that: >>> html = urlopen(some_url) >>> bs = BeautifulSoup(html.read(), 'html.parser') >>> title = bs.body.h1 >>> title <h1>An Interesting Title</h1> RE: Preserving anchor tags in BeautifulSoup - snippsat - May-18-2019 graham23s Wrote:Is there a way to keep the html tags instead of stripping them with get_text()?The html tag is always kept,so i don't understand this? You should only call get_text(),well when need the the text Maybe this will clear up the usage of get_text() ,or better usage .text .So try to use .text as most as possible,and not get_text() ..text is a property that calls get_text() ,so it's identical except you don't use parentheses.Only use get_text() if need to pass in parameters eg .get_text(strip=True, seperator='\n') .Examle: from bs4 import BeautifulSoup html = '''\ <body> <div id='images'> <a href='image1.html'>My image<img src='image1_thumb.jpg'/></a> </div> </body> ''' soup = BeautifulSoup(html, 'lxml')Use: >> image_tag = soup.find('div', id='images') >>> image_tag.find('a') <a href="image1.html">My image<img src="image1_thumb.jpg"/></a> >>> # Best way if only need text >>> image_tag.find('a').text 'My image' >>> # Work the same way >>> image_tag.find('a').get_text() 'My image' # But can take parameter in function if needed >>> image_tag.find('a').get_text(strip=True) 'My image' RE: Preserving anchor tags in BeautifulSoup - graham23s - May-18-2019 Hello :) If i do this test: print("RAW: This is an <a href=\"https://www.thesite.com/\">test</a> string.")The string: This is a <a href="https://www.thesite.com/">test</a> string. is printed to the console exactly as is typed above. But when i get the exact same string from the XML file the console shows: This is a test string. completely stripping the <a href=""></a> parts. soup = BeautifulSoup(projects.text, 'xml') xml_content_body = soup.find('taskBody') print("XML: " + xml_content_body.text) <- prints [b]This is a test string.[/b] to the console print("RAW: This is an <a href=\"https://www.thesite.com/\">test</a> string.") <- prints [b]This is a <a href="https://www.thesite.com/">test</a> string.[/b] to the consoleThe RAW one is the one i need, but when i get the exact same line of text from the XML the html ahref tag is stripped, i read get_text() does strip all html tags and text also seems to strip it, i will keep debugging the issue :) regards RE: Preserving anchor tags in BeautifulSoup - snippsat - May-18-2019 Quote:print("RAW: This is an <a href=\"https://www.thesite.com/\">test</a> string.")This is just a string and not valid html or xml. Then can regex be a better tool. >>> import re >>> >>> s = "RAW: This is an <a href=\"https://www.thesite.com/\">test</a> string." >>> re.sub(r'<.*>', '', s) 'RAW: This is an string.' RE: Preserving anchor tags in BeautifulSoup - graham23s - May-19-2019 Hi :) Here is maybe a better example: https://www.thesite.com/api.php?getProjectsToAction=1 That URL contains xml content, when i get this part: <taskBody>This is a <a href="https://www.test.com/">test</a> using html tags.</taskBody>Via Python using BeautifulSoup, the entire string including the ahref is not showing in the console as: This is a <a href="https://www.test.com/">test</a> using html tags. instead it shows as This is a test using html tags. if i use get_text() it strips out all html tags (so i read somewhere) so it shouldn't strip the html do you think? :) thank you for the help! RE: Preserving anchor tags in BeautifulSoup - graham23s - May-19-2019 Update: Just fixed it i did: body_with_html = ''.join(str(c) for c in soup.find('taskBody').children)Which seems to keep the html intact. Thank you for the help guys! |