Python Forum
cleaning HTML pages using lxml and XPath
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
cleaning HTML pages using lxml and XPath
#1
I'm new to python and lxml. I have some basic task: I need cleanup html files in local directory (recursively).
I want to remove unnecessary divs include content (divs with IDs "box", "header", "columnLeft", "adbox", "footer", div class="box", plus all stylesheets and scripts). I have the following code that recursively search all html files and parse it:

#!/usr/bin/python
import os
import lxml.html as lh

path = '/path/to/directory'

for root, dirs, files in os.walk(path):
    for name in files:
        if name.endswith(".htm"):           
           tree = lh.parse(name)
           root = tree.getroot()
           for element in root.xpath('//div[@id="header"]'):
               element.getparent().remove(element) 
But this code gives IOError:

Output:
./clean2.py Traceback (most recent call last): File "./clean2.py", line 10, in <module> tree = lh.parse(name) File "/usr/lib/python2.7/dist-packages/lxml/html/__init__.py", line 940, in parse return etree.parse(filename_or_url, parser, base_url=base_url, **kw) File "src/lxml/lxml.etree.pyx", line 3427, in lxml.etree.parse (src/lxml/lxml.etree.c:81110) File "src/lxml/parser.pxi", line 1811, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:117841) File "src/lxml/parser.pxi", line 1837, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:118188) File "src/lxml/parser.pxi", line 1741, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:117100) File "src/lxml/parser.pxi", line 1138, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:111646) File "src/lxml/parser.pxi", line 595, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:105102) File "src/lxml/parser.pxi", line 706, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:106810) File "src/lxml/parser.pxi", line 633, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:105621) IOError: Error reading file '5.htm': failed to load external entity "5.htm"
I don't know how to solve this. Possibly, this error is lxml etree alert, it happens when the file or directory not found. Then how to make script insensitive to this? Can anybody help with that?
Reply
#2
Do not us Python 2.7 was dead(2020)💀
html-text is best suited for this.
Often can think of in a wrong way,so remove stuff rather than just parse stuff needed and leave all unnecessary stuff alone.
No loop before have tested some html files alone.
Reply
#3
In my case, HTML markup is still required. The goal is not to extract a plain text from HTML , but cleaning HTML from unnecessary elements. I need prepare html for e-book format, it support HTML and CSS styling.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
Bug Need Pointers/Advise for Cleaning up BS4 XPATH Data BrandonKastning 0 1,210 Mar-08-2022, 12:28 PM
Last Post: BrandonKastning
  HTML multi select HTML listbox with Flask/Python rfeyer 0 4,529 Mar-14-2021, 12:23 PM
Last Post: rfeyer
  Cleaning HTML data using Jupyter Notebook jacob1986 7 4,052 Mar-05-2021, 10:44 PM
Last Post: snippsat
  Python3 + BeautifulSoup4 + lxml (HTML -> CSV) - How to write 3 Columns to MariaDB? BrandonKastning 21 6,712 Mar-23-2020, 05:51 PM
Last Post: ndc85430
  Python3 + BeautifulSoup4 + lxml (HTML -> CSV) - How to loop to next HTML/new CSV Row BrandonKastning 0 2,329 Mar-22-2020, 06:10 AM
Last Post: BrandonKastning
  need help with xpath pythonprogrammer 1 2,731 Jan-18-2020, 11:28 PM
Last Post: snippsat
  non-finite value error when cleaning data yokaso 0 3,297 Dec-17-2019, 07:26 AM
Last Post: yokaso
  [Help]xpath is not working with lxml mr_byte31 3 6,156 Jul-22-2018, 04:10 PM
Last Post: stranac
  Need Tip On Cleaning My BS4 Scraped Data digitalmatic7 2 3,173 Jan-29-2018, 08:49 PM
Last Post: digitalmatic7

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020