Python Forum
How do I turn a directory of HTML files into one CSV?
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
How do I turn a directory of HTML files into one CSV?
#1
I am currently trying to do the following:

1. Identify all the files that have the text "business class" in it.

2. Print all the files that have the text "business class" in it to a CSV containing columns for "filename" and "text."

import os
import fnmatch
from pathlib import Path
from bs4 import BeautifulSoup
import csv

directory = "/directory"

remove_files = []

for dirpath, dirs, files in os.walk(directory):
    for filename in fnmatch.filter(files, '*.html'):
        with open(os.path.join(dirpath, filename)) as f:
            html = f.read()
if 'business class' in html:
    lines = [[files, html]]
    header = ['filename', 'text']
    with open("test.csv", "w", newline='') as f:
        writer = csv.writer(f, delimiter=',')
        writer.writerow(header)
        for l in lines:
            writer.writerow(l)
else:
    remove_files.append(os.path.join(dirpath, filename))

for each in remove_files:
    os.remove(each)
    print ('REMOVED: %s' %each)
The problem that I'm encountering is that the code currently only loops through one folder in the directory and prints all four filenames in the "filename" column and all the texts in the "text" column. So I have one row containing four files.

So the CSV file should look something like:

Quote:filename,text
filename001.html,this text contains the phrase business class
filename002.html,this text is about business class
filename003.html,this text is about business classes and economy classes
Reply
#2
Quote:prints all four filenames in the "filename" column
You put all of the file names in the list, not just one.
    lines = [[files, html]]
I would also suggest that you print the contents of the html variable after the for has finished, and see how many (or few)files it contains.
Reply
#3
(Sep-21-2019, 04:04 PM)woooee Wrote:
Quote:prints all four filenames in the "filename" column
You put all of the file names in the list, not just one.
 lines = [[files, html]]
I would also suggest that you print the contents of the html variable after the for has finished, and see how many (or few)files it contains.

How do put only one file name in the list? Do I do the following:

line = [[filename, html]]
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  uploading files from a ubuntu local directory to Minio storage container dchilambo 0 447 Dec-22-2023, 07:17 AM
Last Post: dchilambo
  change directory of save of python files akbarza 3 875 Jul-23-2023, 08:30 AM
Last Post: Gribouillis
  Using pyinstaller with .ui GUI files - No such file or directory error diver999 3 3,303 Jun-27-2023, 01:17 PM
Last Post: diver999
  Monitoring a Directory for new mkv and mp4 Files lastyle 3 1,623 May-07-2023, 12:33 PM
Last Post: deanhystad
  Tkinterweb (Browser Module) Appending/Adding Additional HTML to a HTML Table Row AaronCatolico1 0 923 Dec-25-2022, 06:28 PM
Last Post: AaronCatolico1
  Read directory listing of files and parse out the highest number? cubangt 5 2,333 Sep-28-2022, 10:15 PM
Last Post: Larz60+
  How to save files in a separate directory Scordomaniac 3 1,851 Mar-16-2022, 10:17 AM
Last Post: Gribouillis
  reading html and edit chekcbox to html jacklee26 5 3,071 Jul-01-2021, 10:31 AM
Last Post: snippsat
  Rename Multiple files in directory to remove special characters nyawadasi 9 6,361 Feb-16-2021, 09:49 PM
Last Post: BashBedlam
  List of error codes to find (and count) in all files in a directory tester_V 8 3,667 Dec-11-2020, 07:07 PM
Last Post: tester_V

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020