How do I turn a directory of HTML files into one CSV? - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: General Coding Help (https://python-forum.io/forum-8.html) +--- Thread: How do I turn a directory of HTML files into one CSV? (/thread-21260.html) |
How do I turn a directory of HTML files into one CSV? - glittergirl - Sep-21-2019 I am currently trying to do the following: 1. Identify all the files that have the text "business class" in it. 2. Print all the files that have the text "business class" in it to a CSV containing columns for "filename" and "text." import os import fnmatch from pathlib import Path from bs4 import BeautifulSoup import csv directory = "/directory" remove_files = [] for dirpath, dirs, files in os.walk(directory): for filename in fnmatch.filter(files, '*.html'): with open(os.path.join(dirpath, filename)) as f: html = f.read() if 'business class' in html: lines = [[files, html]] header = ['filename', 'text'] with open("test.csv", "w", newline='') as f: writer = csv.writer(f, delimiter=',') writer.writerow(header) for l in lines: writer.writerow(l) else: remove_files.append(os.path.join(dirpath, filename)) for each in remove_files: os.remove(each) print ('REMOVED: %s' %each)The problem that I'm encountering is that the code currently only loops through one folder in the directory and prints all four filenames in the "filename" column and all the texts in the "text" column. So I have one row containing four files. So the CSV file should look something like: Quote:filename,text RE: How do I turn a directory of HTML files into one CSV? - woooee - Sep-21-2019 Quote:prints all four filenames in the "filename" columnYou put all of the file names in the list, not just one. lines = [[files, html]]I would also suggest that you print the contents of the html variable after the for has finished, and see how many (or few)files it contains. RE: How do I turn a directory of HTML files into one CSV? - glittergirl - Sep-21-2019 (Sep-21-2019, 04:04 PM)woooee Wrote:Quote:prints all four filenames in the "filename" columnYou put all of the file names in the list, not just one.lines = [[files, html]]I would also suggest that you print the contents of the html variable after the for has finished, and see how many (or few)files it contains. How do put only one file name in the list? Do I do the following: line = [[filename, html]] |