Bottom Page

Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
 How do I turn a directory of HTML files into one CSV?
#1
I am currently trying to do the following:

1. Identify all the files that have the text "business class" in it.

2. Print all the files that have the text "business class" in it to a CSV containing columns for "filename" and "text."

import os
import fnmatch
from pathlib import Path
from bs4 import BeautifulSoup
import csv

directory = "/directory"

remove_files = []

for dirpath, dirs, files in os.walk(directory):
    for filename in fnmatch.filter(files, '*.html'):
        with open(os.path.join(dirpath, filename)) as f:
            html = f.read()
if 'business class' in html:
    lines = [[files, html]]
    header = ['filename', 'text']
    with open("test.csv", "w", newline='') as f:
        writer = csv.writer(f, delimiter=',')
        writer.writerow(header)
        for l in lines:
            writer.writerow(l)
else:
    remove_files.append(os.path.join(dirpath, filename))

for each in remove_files:
    os.remove(each)
    print ('REMOVED: %s' %each)
The problem that I'm encountering is that the code currently only loops through one folder in the directory and prints all four filenames in the "filename" column and all the texts in the "text" column. So I have one row containing four files.

So the CSV file should look something like:

Quote:filename,text
filename001.html,this text contains the phrase business class
filename002.html,this text is about business class
filename003.html,this text is about business classes and economy classes
Quote
#2
Quote:prints all four filenames in the "filename" column
You put all of the file names in the list, not just one.
    lines = [[files, html]]
I would also suggest that you print the contents of the html variable after the for has finished, and see how many (or few)files it contains.
Quote
#3
(Sep-21-2019, 04:04 PM)woooee Wrote:
Quote:prints all four filenames in the "filename" column
You put all of the file names in the list, not just one.
 lines = [[files, html]]
I would also suggest that you print the contents of the html variable after the for has finished, and see how many (or few)files it contains.

How do put only one file name in the list? Do I do the following:

line = [[filename, html]]
Quote

Top Page

Possibly Related Threads...
Thread Author Replies Views Last Post
  Error With Reading Files In Directory And Calculating Values chascp 2 90 Feb-15-2020, 01:57 PM
Last Post: chascp
  How do i turn my program into a .exe julio2000 1 107 Feb-14-2020, 08:18 PM
Last Post: snippsat
  How do I read the HTML files in a directory and write the content into a CSV file? glittergirl 1 268 Sep-23-2019, 11:01 AM
Last Post: Larz60+
  unable to list files in a directory christober 2 198 Sep-18-2019, 11:45 PM
Last Post: wavic
  HTML to Python to Windows .bat and back to HTML perfectservice33 0 304 Aug-22-2019, 06:31 AM
Last Post: perfectservice33
  Turn py into exe tester21 4 568 Jul-22-2019, 04:31 PM
Last Post: nilamo
  How to turn screen output into clickable hyperlinks windros 5 581 Jan-22-2019, 05:41 PM
Last Post: windros
  Not sure how to turn this into a loop iamgonge 1 469 Dec-05-2018, 11:03 PM
Last Post: anandoracledba
  Fetching html files from local directories shiva 3 1,040 Mar-20-2018, 05:12 PM
Last Post: wavic
  How to make a script to find a certain word in text files in a whole directory ? RandoomDude 2 3,438 Apr-27-2017, 10:27 AM
Last Post: wavic

Forum Jump:


Users browsing this thread: 1 Guest(s)