Hello,
Does someone know how to find the "</body>" closing bit in an HTML file?
from bs4 import BeautifulSoup as bs
soup = bs("file.html")
#How to find </body>?
element = soup.body.previous_sibling
if element is none:
print("Nothing")
else
print("Found :", element )
Thank you.
You are not finding opening or closing tag. You parse the BeautifulSoup object and [maybe] find the whole tag and get instance of
bs4.element.Tag
If you want to search for string
</body>
, then maybe regex is the tool you need, but that is NOT parsing html. Have a look at
this famous answer on Stack Overflow
Thanks. Indeed, it looks like using a regex would be simpler in this case.
from bs4 import BeautifulSoup as bs
# Load the HTML file
with open("file.html", "r", encoding="utf-8") as file:
html_data = file.read()
# Create a BeautifulSoup object
soup = bs(html_data, "html.parser")
# Find the </body> tag
body_closing_tag = soup.find_all(text="</body>")
if not body_closing_tag:
print("Nothing")
else:
print("Found:", body_closing_tag[0].parent)
1.> Open the HTML file in read mode and read its contents into the html_data variable.
2.> Create a BeautifulSoup object named soup to parse the HTML data.
3.> Use soup.find_all(text="</body>") to find all occurrences of </body> in the parsed HTML.
4.> If we find any occurrences, we print the parent of the first occurrence to get the whole <body> tag. If we don't find any occurrences, we print "Nothing."