How do I turn a directory of HTML files into one CSV?

glittergirl · Sep-21-2019, 02:33 PM

I am currently trying to do the following:

1. Identify all the files that have the text "business class" in it.

2. Print all the files that have the text "business class" in it to a CSV containing columns for "filename" and "text."

import os
import fnmatch
from pathlib import Path
from bs4 import BeautifulSoup
import csv

directory = "/directory"

remove_files = []

for dirpath, dirs, files in os.walk(directory):
    for filename in fnmatch.filter(files, '*.html'):
        with open(os.path.join(dirpath, filename)) as f:
            html = f.read()
if 'business class' in html:
    lines = [[files, html]]
    header = ['filename', 'text']
    with open("test.csv", "w", newline='') as f:
        writer = csv.writer(f, delimiter=',')
        writer.writerow(header)
        for l in lines:
            writer.writerow(l)
else:
    remove_files.append(os.path.join(dirpath, filename))

for each in remove_files:
    os.remove(each)
    print ('REMOVED: %s' %each)

The problem that I'm encountering is that the code currently only loops through one folder in the directory and prints all four filenames in the "filename" column and all the texts in the "text" column. So I have one row containing four files.

So the CSV file should look something like:

Quote:filename,text
filename001.html,this text contains the phrase business class
filename002.html,this text is about business class
filename003.html,this text is about business classes and economy classes

woooee · (This post was last modified: Sep-21-2019, 04:06 PM by woooee.)

Quote:prints all four filenames in the "filename" column

You put all of the file names in the list, not just one.

    lines = [[files, html]]

I would also suggest that you print the contents of the html variable after the for has finished, and see how many (or few)files it contains.

glittergirl · Sep-21-2019, 05:33 PM

(Sep-21-2019, 04:04 PM)woooee Wrote:
Quote:prints all four filenames in the "filename" column
You put all of the file names in the list, not just one.
 lines = [[files, html]]
I would also suggest that you print the contents of the html variable after the for has finished, and see how many (or few)files it contains.

How do put only one file name in the list? Do I do the following:

line = [[filename, html]]

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Loop through all files in a directory?	Winfried	8	153	22 minutes ago Last Post: Gribouillis
	uploading files from a ubuntu local directory to Minio storage container	dchilambo	0	455	Dec-22-2023, 07:17 AM Last Post: dchilambo
	change directory of save of python files	akbarza	3	880	Jul-23-2023, 08:30 AM Last Post: Gribouillis
	Using pyinstaller with .ui GUI files - No such file or directory error	diver999	3	3,332	Jun-27-2023, 01:17 PM Last Post: diver999
	Monitoring a Directory for new mkv and mp4 Files	lastyle	3	1,633	May-07-2023, 12:33 PM Last Post: deanhystad
	Tkinterweb (Browser Module) Appending/Adding Additional HTML to a HTML Table Row	AaronCatolico1	0	929	Dec-25-2022, 06:28 PM Last Post: AaronCatolico1
	Read directory listing of files and parse out the highest number?	cubangt	5	2,349	Sep-28-2022, 10:15 PM Last Post: Larz60+
	How to save files in a separate directory	Scordomaniac	3	1,870	Mar-16-2022, 10:17 AM Last Post: Gribouillis
	reading html and edit chekcbox to html	jacklee26	5	3,076	Jul-01-2021, 10:31 AM Last Post: snippsat
	Rename Multiple files in directory to remove special characters	nyawadasi	9	6,379	Feb-16-2021, 09:49 PM Last Post: BashBedlam

How do I turn a directory of HTML files into one CSV?

User Panel Messages

Announcements