Separate text files and convert into csv

marfer · Dec-09-2021, 07:00 PM

I have several text files in a folder that I want to split by paragraph and convert into csv. Each text file is composed of several paragraphs and some paragraphs have several lines. Paragraphs are separated by 1 empty line.
Text file example:
" A very long story
and paragraph.

Paragraph with several lines.
More information here."

How I want my csv file to look like:

id, text
abc.txt, A very long story and paragraph.
abc.txt, Paragraph with several lines. More information here.
def.txt, Imagine there is another text file.

This is my code:

import csv, os
import glob

os.chdir(path)
with open('output.csv', 'w', newline="", encoding="utf-16") as f:
    output = csv.writer(f)
    output.writerow(['id', 'text'])
    for txt_file in glob.iglob('*.txt'):
        with open(txt_file, 'r') as txt:
            for line in txt.read().split("\n\n"):
                output.writerow([(txt_file), line])

This is how my csv file looks now:

id, text
abc.txt, A very long story
and paragraph.
abc.txt, Paragraph with several lines.
More information here.
def.txt, Imagine there is another text file.

**Larz60+** · (This post was last modified: Dec-09-2021, 11:32 PM by Larz60+.)

a csv file is by default a structured text file that uses a delimiter (usually a comma, thus the name) to separate fields
It usually contains a header record with the names of each field.
Both data and header are terminated with a line feed
And finally, the fields need to be either included on each line or a delimiter inserted in place of data.
thus:

import csv

with open('my.csv', 'w') as myfile:
    myfile.write("'Field1','Field2','Field2','Field4'\n")
    myfile.write("'AAA',,'bbb','ccc'\n")

print(f"\nWill now open as csv file")

with open('my.csv') as fp:
    crdr = csv.reader(fp)
    for row in crdr:
        print(row)

Output:Will now open as csv file

["'Field1'", "'Field2'", "'Field2'", "'Field4'"]
["'AAA'", '', "'bbb'", "'ccc'"]

BashBedlam · Dec-09-2021, 11:51 PM

Try replacing line eleven with this line. I think it's what you want.

output.writerow([(txt_file), line.replace ('\n', '')])

supuflounder · Dec-10-2021, 09:30 AM

There is a problem with several of these suggestions.
For example

output.writerow([(txt_file), line.replace('\n', ''))

will mean that

This is one line.
This is another line which
is kind of long

will come out as

abc.txt, This is one line.This is another line whichis kind of long

If you want the paragraph to come out as one line, you will have to deal with making the newline, "\n" come out as a space. This loses your original line arrangement, and long paragraphs can become very, very long, extending for many yards past the end of your screen.
You might consider doing

line.replace("\n", "\\n")

which would allow you to put the newlines back in at some point, by doing
line.replace("\\n", "\n")
if you wanted to see the original line breaks.

Also, text values are best enclosed in quotes.

But, if you enclose text in quote marks, you have to worry about quote marks in the text, e.g.,

I ran up to her.  "Look out" I cried, shoving her out of the way of the falling anvil.

If you don't enclose this string in quote marks, then Excel will split the text at the comma into two cells, which is not what you want.

   A    |                  B                    |                     C                             |
abc.txt | I ran up to her.  "Look out" I cried  | shoving her out of the way of the falling anvil.  |

But if you put quote marks around it, you then get into trouble because of the quote marks inside. In fact, because of the quote marks, Excel may become unhappy. I leave this as An Exercise For The Reader.

To test what Excel found acceptable, I created an Excel file
(Well, I had a screenshot, but there seems to be no way to upload it. My file looked like this:)

    A    |                               B                           |
abc.txt  | This is a test                                            |
def.txt  | This is "another test"                                    |
1.txt    | This is "an example of" a comma, in a line                |
2,tzr    | There is a comma, here                                    |

When I saved it, it came out as

abc.txt,This is a test
def.txt,"This is ""another test"""
1.txt,"This is ""an example of"" a comma, in a line"
"2,tzr","There is a comma, here"

So you will want to apply

'"' + line.replace('"', '""') + '"'

Note that Excel does not quote text that does not contain a comma or quotes. You may choose to do this, or you may choose to always put quotes around the text. Note that I mistyped the file name; instead of "2.txt" my fingers were in the wrong place, and I typed "2,tzr", and because there was a comma in the filename (which is actually legal), it put the filename in quotes also.

Pedroski55 · Dec-10-2021, 10:41 AM

Not too clear what you want, maybe this.

Text files don't have paragraphs, they just have lines.

If you read the string, how will you split it for paragraphs?

If you mean one section of text is separated from the next by an empty line, then use text.readlines() to get the text.

This will give you a list of lines.

Join the lines to a string, but replace lines that only have \n with something weird, I chose _=_

import os, glob, csv

path2text = '/home/pedro/temp/'
files = glob.glob(path2text, '*.txt')

def getParagraphs(f):    
    with open(f) as atext:
        data = atext.readlines()
        parastring = ''
        separator = '_=_'
        for line in data:
            if not line == '\n':
                parastring = parastring + line
            elif line == '\n':
                parastring = parastring + separator
    return parastring
    
savename = path2text + 'output.csv'                  
with open(savename, mode='w') as csvout:
    for f in files:
        name = f.split(os.sep)
        idd = name[-1]
        mystring = getParagraphs(f)
        paragraphs = mystring.split('_=_')               
        f_writer = csv.writer(csvout, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
        fieldnames = ['text_name', 'paragraph']
        f_writer.writerow(fieldnames)
        for i in range(0, len(paragraphs)):
            rowName = idd + '_paragraph_' + str(i+1)
            row = [rowName, paragraphs[i]]
            f_writer.writerow(row)
    print('All done and saved to', savename)

Pedroski55 · Dec-10-2021, 10:43 AM

Not too clear what you want, maybe this.

Text files don't have paragraphs, they just have lines.

If you read the string, how will you split it for paragraphs?

If you mean one section of text is separated from the next by an empty line, then use text.readlines() to get the text.

This will give you a list of lines.

Join the lines to a string, but replace lines that only have \n with something weird, I chose _=_

import os, glob, csv

path2text = '/home/pedro/temp/'
files = glob.glob(path2text, '*.txt')

def getParagraphs(f):    
    with open(f) as atext:
        data = atext.readlines()
        parastring = ''
        separator = '_=_'
        for line in data:
            if not line == '\n':
                parastring = parastring + line
            elif line == '\n':
                parastring = parastring + separator
    return parastring
    
savename = path2text + 'output.csv'                  
with open(savename, mode='w') as csvout:
    for f in files:
        name = f.split(os.sep)
        idd = name[-1]
        mystring = getParagraphs(f)
        paragraphs = mystring.split('_=_')               
        f_writer = csv.writer(csvout, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
        fieldnames = ['text_name', 'paragraph']
        f_writer.writerow(fieldnames)
        for i in range(0, len(paragraphs)):
            rowName = idd + '_paragraph_' + str(i+1)
            row = [rowName, paragraphs[i]]
            f_writer.writerow(row)
    print('All done and saved to', savename)

marfer · Dec-10-2021, 12:09 PM

Thank you all for your comments and time. In my case it works well

output.writerow([(txt_file), line.replace ('\n', ' ')])

as suggested by BashBedlam.
I also decided to remove punctuation in my text files to not have the problems referred by supuflounder.
Thank you!

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Convert Xls files into Csv in on premises sharpoint	Andrew_andy9642	3	1,120	Aug-30-2024, 06:41 PM Last Post: deanhystad
	python convert multiple files to multiple lists	MCL169	6	3,388	Nov-25-2023, 05:31 AM Last Post: Iqratech
	Split Bytearray into separate Files by Hex delimter	lastyle	5	6,701	Mar-09-2023, 07:49 AM Last Post: bowlofred
	azure TTS from text files to mp3s	mutantGOD	2	3,282	Jan-17-2023, 03:20 AM Last Post: mutantGOD
	Writing into 2 text files from the same function	paul18fr	4	2,741	Jul-28-2022, 04:34 AM Last Post: ndc85430
	Delete empty text files [SOLVED]	AlphaInc	5	3,285	Jul-09-2022, 02:15 PM Last Post: DeaD_EyE
	select files such as text file	RolanRoll	2	1,991	Jun-25-2022, 08:07 PM Last Post: RolanRoll
	Two text files, want to add a column value	zxcv101	8	3,506	Jun-20-2022, 03:06 PM Last Post: deanhystad
	importing functions from a separate python file in a separate directory	Scordomaniac	3	2,338	May-17-2022, 07:49 AM Last Post: Pedroski55
	select Eof extension files based on text list of filenames with if condition	RolanRoll	1	2,256	Apr-04-2022, 09:29 PM Last Post: Larz60+

Separate text files and convert into csv

User Panel Messages

Announcements