Python Forum
cleaning HTML pages using lxml and XPath
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
cleaning HTML pages using lxml and XPath
#1
I'm new to python and lxml. I have some basic task: I need cleanup html files in local directory (recursively).
I want to remove unnecessary divs include content (divs with IDs "box", "header", "columnLeft", "adbox", "footer", div class="box", plus all stylesheets and scripts). I have the following code that recursively search all html files and parse it:

#!/usr/bin/python
import os
import lxml.html as lh

path = '/path/to/directory'

for root, dirs, files in os.walk(path):
    for name in files:
        if name.endswith(".htm"):           
           tree = lh.parse(name)
           root = tree.getroot()
           for element in root.xpath('//div[@id="header"]'):
               element.getparent().remove(element) 
But this code gives IOError:

Output:
./clean2.py Traceback (most recent call last): File "./clean2.py", line 10, in <module> tree = lh.parse(name) File "/usr/lib/python2.7/dist-packages/lxml/html/__init__.py", line 940, in parse return etree.parse(filename_or_url, parser, base_url=base_url, **kw) File "src/lxml/lxml.etree.pyx", line 3427, in lxml.etree.parse (src/lxml/lxml.etree.c:81110) File "src/lxml/parser.pxi", line 1811, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:117841) File "src/lxml/parser.pxi", line 1837, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:118188) File "src/lxml/parser.pxi", line 1741, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:117100) File "src/lxml/parser.pxi", line 1138, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:111646) File "src/lxml/parser.pxi", line 595, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:105102) File "src/lxml/parser.pxi", line 706, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:106810) File "src/lxml/parser.pxi", line 633, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:105621) IOError: Error reading file '5.htm': failed to load external entity "5.htm"
I don't know how to solve this. Possibly, this error is lxml etree alert, it happens when the file or directory not found. Then how to make script insensitive to this? Can anybody help with that?
Reply


Messages In This Thread
cleaning HTML pages using lxml and XPath - by wenkos - Aug-24-2021, 03:44 PM

Possibly Related Threads…
Thread Author Replies Views Last Post
Bug Need Pointers/Advise for Cleaning up BS4 XPATH Data BrandonKastning 0 1,251 Mar-08-2022, 12:28 PM
Last Post: BrandonKastning
  HTML multi select HTML listbox with Flask/Python rfeyer 0 4,677 Mar-14-2021, 12:23 PM
Last Post: rfeyer
  Cleaning HTML data using Jupyter Notebook jacob1986 7 4,180 Mar-05-2021, 10:44 PM
Last Post: snippsat
  Python3 + BeautifulSoup4 + lxml (HTML -> CSV) - How to write 3 Columns to MariaDB? BrandonKastning 21 7,052 Mar-23-2020, 05:51 PM
Last Post: ndc85430
  Python3 + BeautifulSoup4 + lxml (HTML -> CSV) - How to loop to next HTML/new CSV Row BrandonKastning 0 2,389 Mar-22-2020, 06:10 AM
Last Post: BrandonKastning
  need help with xpath pythonprogrammer 1 2,779 Jan-18-2020, 11:28 PM
Last Post: snippsat
  non-finite value error when cleaning data yokaso 0 3,355 Dec-17-2019, 07:26 AM
Last Post: yokaso
  [Help]xpath is not working with lxml mr_byte31 3 6,285 Jul-22-2018, 04:10 PM
Last Post: stranac
  Need Tip On Cleaning My BS4 Scraped Data digitalmatic7 2 3,252 Jan-29-2018, 08:49 PM
Last Post: digitalmatic7

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020