Sep-21-2019, 02:33 PM
I am currently trying to do the following:
1. Identify all the files that have the text "business class" in it.
2. Print all the files that have the text "business class" in it to a CSV containing columns for "filename" and "text."
So the CSV file should look something like:
1. Identify all the files that have the text "business class" in it.
2. Print all the files that have the text "business class" in it to a CSV containing columns for "filename" and "text."
import os import fnmatch from pathlib import Path from bs4 import BeautifulSoup import csv directory = "/directory" remove_files = [] for dirpath, dirs, files in os.walk(directory): for filename in fnmatch.filter(files, '*.html'): with open(os.path.join(dirpath, filename)) as f: html = f.read() if 'business class' in html: lines = [[files, html]] header = ['filename', 'text'] with open("test.csv", "w", newline='') as f: writer = csv.writer(f, delimiter=',') writer.writerow(header) for l in lines: writer.writerow(l) else: remove_files.append(os.path.join(dirpath, filename)) for each in remove_files: os.remove(each) print ('REMOVED: %s' %each)The problem that I'm encountering is that the code currently only loops through one folder in the directory and prints all four filenames in the "filename" column and all the texts in the "text" column. So I have one row containing four files.
So the CSV file should look something like:
Quote:filename,text
filename001.html,this text contains the phrase business class
filename002.html,this text is about business class
filename003.html,this text is about business classes and economy classes