I'm new to python and lxml. I have some basic task: I need cleanup html files in local directory (recursively).
I want to remove unnecessary divs include content (divs with IDs "box", "header", "columnLeft", "adbox", "footer", div class="box", plus all stylesheets and scripts). I have the following code that recursively search all html files and parse it:
I want to remove unnecessary divs include content (divs with IDs "box", "header", "columnLeft", "adbox", "footer", div class="box", plus all stylesheets and scripts). I have the following code that recursively search all html files and parse it:
#!/usr/bin/python import os import lxml.html as lh path = '/path/to/directory' for root, dirs, files in os.walk(path): for name in files: if name.endswith(".htm"): tree = lh.parse(name) root = tree.getroot() for element in root.xpath('//div[@id="header"]'): element.getparent().remove(element)But this code gives IOError:
Output:./clean2.py
Traceback (most recent call last):
File "./clean2.py", line 10, in <module>
tree = lh.parse(name)
File "/usr/lib/python2.7/dist-packages/lxml/html/__init__.py", line 940, in parse
return etree.parse(filename_or_url, parser, base_url=base_url, **kw)
File "src/lxml/lxml.etree.pyx", line 3427, in lxml.etree.parse (src/lxml/lxml.etree.c:81110)
File "src/lxml/parser.pxi", line 1811, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:117841)
File "src/lxml/parser.pxi", line 1837, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:118188)
File "src/lxml/parser.pxi", line 1741, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:117100)
File "src/lxml/parser.pxi", line 1138, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:111646)
File "src/lxml/parser.pxi", line 595, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:105102)
File "src/lxml/parser.pxi", line 706, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:106810)
File "src/lxml/parser.pxi", line 633, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:105621)
IOError: Error reading file '5.htm': failed to load external entity "5.htm"
I don't know how to solve this. Possibly, this error is lxml etree alert, it happens when the file or directory not found. Then how to make script insensitive to this? Can anybody help with that?