Python Forum
How do I read the HTML files in a directory and write the content into a CSV file? - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: General Coding Help (https://python-forum.io/forum-8.html)
+--- Thread: How do I read the HTML files in a directory and write the content into a CSV file? (/thread-21283.html)



How do I read the HTML files in a directory and write the content into a CSV file? - glittergirl - Sep-23-2019

I am trying to read all the HTML files in a directory and write them into a CSV file. Each row in the CSV file will contain the contents of one HTML file.

I seem to be able to only read one HTML file and write that one file into one row of a CSV file.

import fnmatch
from pathlib import Path

directory = "directory/"

for dirpath, dirs, files in os.walk(directory):
    for filename in fnmatch.filter(files, '*.html'):
        with open(os.path.join(dirpath, filename)) as f:
            html = f.read()
            if 'apples and oranges' in html:
                with open('output.csv', 'w') as f:
                    writer = csv.writer(f)
                    lines = [[html]]
                    for l in lines:
                        writer.writerow(l)
I currently only see one HTML file being printed out into one CSV row.


RE: How do I read the HTML files in a directory and write the content into a CSV file? - Larz60+ - Sep-23-2019

What you need to do is scrape the contents of the HTML.
There are several tools to do this, and each works for certain types of HTML content.
There is a quick tutorial on this forum, designed by Snippsat here (applies to html files, or web):
Web scraping part 1
Web scraping part 2