Python Forum

I am trying to read all the HTML files in a directory and write them into a CSV file. Each row in the CSV file will contain the contents of one HTML file.

I seem to be able to only read one HTML file and write that one file into one row of a CSV file.

import fnmatch
from pathlib import Path

directory = "directory/"

for dirpath, dirs, files in os.walk(directory):
    for filename in fnmatch.filter(files, '*.html'):
        with open(os.path.join(dirpath, filename)) as f:
            html = f.read()
            if 'apples and oranges' in html:
                with open('output.csv', 'w') as f:
                    writer = csv.writer(f)
                    lines = [[html]]
                    for l in lines:
                        writer.writerow(l)

I currently only see one HTML file being printed out into one CSV row.

What you need to do is scrape the contents of the HTML.
There are several tools to do this, and each works for certain types of HTML content.
There is a quick tutorial on this forum, designed by Snippsat here (applies to html files, or web):
Web scraping part 1
Web scraping part 2

glittergirl

Larz60+