Python Forum

Full Version: cleaning HTML pages using lxml and XPath
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
I'm new to python and lxml. I have some basic task: I need cleanup html files in local directory (recursively).
I want to remove unnecessary divs include content (divs with IDs "box", "header", "columnLeft", "adbox", "footer", div class="box", plus all stylesheets and scripts). I have the following code that recursively search all html files and parse it:

#!/usr/bin/python
import os
import lxml.html as lh

path = '/path/to/directory'

for root, dirs, files in os.walk(path):
    for name in files:
        if name.endswith(".htm"):           
           tree = lh.parse(name)
           root = tree.getroot()
           for element in root.xpath('//div[@id="header"]'):
               element.getparent().remove(element) 
But this code gives IOError:

Output:
./clean2.py Traceback (most recent call last): File "./clean2.py", line 10, in <module> tree = lh.parse(name) File "/usr/lib/python2.7/dist-packages/lxml/html/__init__.py", line 940, in parse return etree.parse(filename_or_url, parser, base_url=base_url, **kw) File "src/lxml/lxml.etree.pyx", line 3427, in lxml.etree.parse (src/lxml/lxml.etree.c:81110) File "src/lxml/parser.pxi", line 1811, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:117841) File "src/lxml/parser.pxi", line 1837, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:118188) File "src/lxml/parser.pxi", line 1741, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:117100) File "src/lxml/parser.pxi", line 1138, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:111646) File "src/lxml/parser.pxi", line 595, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:105102) File "src/lxml/parser.pxi", line 706, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:106810) File "src/lxml/parser.pxi", line 633, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:105621) IOError: Error reading file '5.htm': failed to load external entity "5.htm"
I don't know how to solve this. Possibly, this error is lxml etree alert, it happens when the file or directory not found. Then how to make script insensitive to this? Can anybody help with that?
Do not us Python 2.7 was dead(2020)💀
html-text is best suited for this.
Often can think of in a wrong way,so remove stuff rather than just parse stuff needed and leave all unnecessary stuff alone.
No loop before have tested some html files alone.
In my case, HTML markup is still required. The goal is not to extract a plain text from HTML , but cleaning HTML from unnecessary elements. I need prepare html for e-book format, it support HTML and CSS styling.