cleaning HTML pages using lxml and XPath

wenkos · (This post was last modified: Aug-24-2021, 03:44 PM by wenkos.)

I'm new to python and lxml. I have some basic task: I need cleanup html files in local directory (recursively).
I want to remove unnecessary divs include content (divs with IDs "box", "header", "columnLeft", "adbox", "footer", div class="box", plus all stylesheets and scripts). I have the following code that recursively search all html files and parse it:

#!/usr/bin/python
import os
import lxml.html as lh

path = '/path/to/directory'

for root, dirs, files in os.walk(path):
    for name in files:
        if name.endswith(".htm"):           
           tree = lh.parse(name)
           root = tree.getroot()
           for element in root.xpath('//div[@id="header"]'):
               element.getparent().remove(element)

But this code gives IOError:

Output:./clean2.py
Traceback (most recent call last):
  File "./clean2.py", line 10, in <module>
    tree = lh.parse(name)
  File "/usr/lib/python2.7/dist-packages/lxml/html/__init__.py", line 940, in parse
    return etree.parse(filename_or_url, parser, base_url=base_url, **kw)
  File "src/lxml/lxml.etree.pyx", line 3427, in lxml.etree.parse (src/lxml/lxml.etree.c:81110)
  File "src/lxml/parser.pxi", line 1811, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:117841)
  File "src/lxml/parser.pxi", line 1837, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:118188)
  File "src/lxml/parser.pxi", line 1741, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:117100)
  File "src/lxml/parser.pxi", line 1138, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:111646)
  File "src/lxml/parser.pxi", line 595, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:105102)
  File "src/lxml/parser.pxi", line 706, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:106810)
  File "src/lxml/parser.pxi", line 633, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:105621)
IOError: Error reading file '5.htm': failed to load external entity "5.htm"

I don't know how to solve this. Possibly, this error is lxml etree alert, it happens when the file or directory not found. Then how to make script insensitive to this? Can anybody help with that?

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Need Pointers/Advise for Cleaning up BS4 XPATH Data	BrandonKastning	0	1,251	Mar-08-2022, 12:28 PM Last Post: BrandonKastning
	HTML multi select HTML listbox with Flask/Python	rfeyer	0	4,677	Mar-14-2021, 12:23 PM Last Post: rfeyer
	Cleaning HTML data using Jupyter Notebook	jacob1986	7	4,180	Mar-05-2021, 10:44 PM Last Post: snippsat
	Python3 + BeautifulSoup4 + lxml (HTML -> CSV) - How to write 3 Columns to MariaDB?	BrandonKastning	21	7,052	Mar-23-2020, 05:51 PM Last Post: ndc85430
	Python3 + BeautifulSoup4 + lxml (HTML -> CSV) - How to loop to next HTML/new CSV Row	BrandonKastning	0	2,389	Mar-22-2020, 06:10 AM Last Post: BrandonKastning
	need help with xpath	pythonprogrammer	1	2,779	Jan-18-2020, 11:28 PM Last Post: snippsat
	non-finite value error when cleaning data	yokaso	0	3,355	Dec-17-2019, 07:26 AM Last Post: yokaso
	[Help]xpath is not working with lxml	mr_byte31	3	6,285	Jul-22-2018, 04:10 PM Last Post: stranac
	Need Tip On Cleaning My BS4 Scraped Data	digitalmatic7	2	3,252	Jan-29-2018, 08:49 PM Last Post: digitalmatic7

cleaning HTML pages using lxml and XPath

User Panel Messages

Announcements