Python Forum
How do I turn a directory of HTML files into one CSV? - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: General Coding Help (https://python-forum.io/forum-8.html)
+--- Thread: How do I turn a directory of HTML files into one CSV? (/thread-21260.html)



How do I turn a directory of HTML files into one CSV? - glittergirl - Sep-21-2019

I am currently trying to do the following:

1. Identify all the files that have the text "business class" in it.

2. Print all the files that have the text "business class" in it to a CSV containing columns for "filename" and "text."

import os
import fnmatch
from pathlib import Path
from bs4 import BeautifulSoup
import csv

directory = "/directory"

remove_files = []

for dirpath, dirs, files in os.walk(directory):
    for filename in fnmatch.filter(files, '*.html'):
        with open(os.path.join(dirpath, filename)) as f:
            html = f.read()
if 'business class' in html:
    lines = [[files, html]]
    header = ['filename', 'text']
    with open("test.csv", "w", newline='') as f:
        writer = csv.writer(f, delimiter=',')
        writer.writerow(header)
        for l in lines:
            writer.writerow(l)
else:
    remove_files.append(os.path.join(dirpath, filename))

for each in remove_files:
    os.remove(each)
    print ('REMOVED: %s' %each)
The problem that I'm encountering is that the code currently only loops through one folder in the directory and prints all four filenames in the "filename" column and all the texts in the "text" column. So I have one row containing four files.

So the CSV file should look something like:

Quote:filename,text
filename001.html,this text contains the phrase business class
filename002.html,this text is about business class
filename003.html,this text is about business classes and economy classes



RE: How do I turn a directory of HTML files into one CSV? - woooee - Sep-21-2019

Quote:prints all four filenames in the "filename" column
You put all of the file names in the list, not just one.
    lines = [[files, html]]
I would also suggest that you print the contents of the html variable after the for has finished, and see how many (or few)files it contains.


RE: How do I turn a directory of HTML files into one CSV? - glittergirl - Sep-21-2019

(Sep-21-2019, 04:04 PM)woooee Wrote:
Quote:prints all four filenames in the "filename" column
You put all of the file names in the list, not just one.
 lines = [[files, html]]
I would also suggest that you print the contents of the html variable after the for has finished, and see how many (or few)files it contains.

How do put only one file name in the list? Do I do the following:

line = [[filename, html]]