Python Forum
Separate text files and convert into csv
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Separate text files and convert into csv
#1
I have several text files in a folder that I want to split by paragraph and convert into csv. Each text file is composed of several paragraphs and some paragraphs have several lines. Paragraphs are separated by 1 empty line.
Text file example:
" A very long story
and paragraph.

Paragraph with several lines.
More information here."

How I want my csv file to look like:

id, text
abc.txt, A very long story and paragraph.
abc.txt, Paragraph with several lines. More information here.
def.txt, Imagine there is another text file.

This is my code:
import csv, os
import glob

os.chdir(path)
with open('output.csv', 'w', newline="", encoding="utf-16") as f:
    output = csv.writer(f)
    output.writerow(['id', 'text'])
    for txt_file in glob.iglob('*.txt'):
        with open(txt_file, 'r') as txt:
            for line in txt.read().split("\n\n"):
                output.writerow([(txt_file), line])
This is how my csv file looks now:

id, text
abc.txt, A very long story
and paragraph.
abc.txt, Paragraph with several lines.
More information here.
def.txt, Imagine there is another text file.
Reply
#2
a csv file is by default a structured text file that uses a delimiter (usually a comma, thus the name) to separate fields
It usually contains a header record with the names of each field.
Both data and header are terminated with a line feed
And finally, the fields need to be either included on each line or a delimiter inserted in place of data.
thus:
import csv

with open('my.csv', 'w') as myfile:
    myfile.write("'Field1','Field2','Field2','Field4'\n")
    myfile.write("'AAA',,'bbb','ccc'\n")

print(f"\nWill now open as csv file")

with open('my.csv') as fp:
    crdr = csv.reader(fp)
    for row in crdr:
        print(row)
Output:
Will now open as csv file ["'Field1'", "'Field2'", "'Field2'", "'Field4'"] ["'AAA'", '', "'bbb'", "'ccc'"]
Reply
#3
Try replacing line eleven with this line. I think it's what you want.
output.writerow([(txt_file), line.replace ('\n', '')])
Reply
#4
There is a problem with several of these suggestions.
For example
output.writerow([(txt_file), line.replace('\n', ''))
will mean that
This is one line.
This is another line which
is kind of long
will come out as
abc.txt, This is one line.This is another line whichis kind of long
If you want the paragraph to come out as one line, you will have to deal with making the newline, "\n" come out as a space. This loses your original line arrangement, and long paragraphs can become very, very long, extending for many yards past the end of your screen.
You might consider doing
line.replace("\n", "\\n")
which would allow you to put the newlines back in at some point, by doing
line.replace("\\n", "\n")
if you wanted to see the original line breaks.

Also, text values are best enclosed in quotes.

But, if you enclose text in quote marks, you have to worry about quote marks in the text, e.g.,
I ran up to her.  "Look out" I cried, shoving her out of the way of the falling anvil.
If you don't enclose this string in quote marks, then Excel will split the text at the comma into two cells, which is not what you want.
   A    |                  B                    |                     C                             |
abc.txt | I ran up to her.  "Look out" I cried  | shoving her out of the way of the falling anvil.  |
But if you put quote marks around it, you then get into trouble because of the quote marks inside. In fact, because of the quote marks, Excel may become unhappy. I leave this as An Exercise For The Reader.

To test what Excel found acceptable, I created an Excel file
(Well, I had a screenshot, but there seems to be no way to upload it. My file looked like this:)
    A    |                               B                           |
abc.txt  | This is a test                                            |
def.txt  | This is "another test"                                    |
1.txt    | This is "an example of" a comma, in a line                |
2,tzr    | There is a comma, here                                    |
When I saved it, it came out as
abc.txt,This is a test
def.txt,"This is ""another test"""
1.txt,"This is ""an example of"" a comma, in a line"
"2,tzr","There is a comma, here"
So you will want to apply
'"' + line.replace('"', '""') + '"'
Note that Excel does not quote text that does not contain a comma or quotes. You may choose to do this, or you may choose to always put quotes around the text. Note that I mistyped the file name; instead of "2.txt" my fingers were in the wrong place, and I typed "2,tzr", and because there was a comma in the filename (which is actually legal), it put the filename in quotes also.
Reply
#5
Not too clear what you want, maybe this.

Text files don't have paragraphs, they just have lines.

If you read the string, how will you split it for paragraphs?

If you mean one section of text is separated from the next by an empty line, then use text.readlines() to get the text.

This will give you a list of lines.

Join the lines to a string, but replace lines that only have \n with something weird, I chose _=_

import os, glob, csv

path2text = '/home/pedro/temp/'
files = glob.glob(path2text, '*.txt')

def getParagraphs(f):    
    with open(f) as atext:
        data = atext.readlines()
        parastring = ''
        separator = '_=_'
        for line in data:
            if not line == '\n':
                parastring = parastring + line
            elif line == '\n':
                parastring = parastring + separator
    return parastring
    
savename = path2text + 'output.csv'                  
with open(savename, mode='w') as csvout:
    for f in files:
        name = f.split(os.sep)
        idd = name[-1]
        mystring = getParagraphs(f)
        paragraphs = mystring.split('_=_')               
        f_writer = csv.writer(csvout, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
        fieldnames = ['text_name', 'paragraph']
        f_writer.writerow(fieldnames)
        for i in range(0, len(paragraphs)):
            rowName = idd + '_paragraph_' + str(i+1)
            row = [rowName, paragraphs[i]]
            f_writer.writerow(row)
    print('All done and saved to', savename)
Reply
#6
Not too clear what you want, maybe this.

Text files don't have paragraphs, they just have lines.

If you read the string, how will you split it for paragraphs?

If you mean one section of text is separated from the next by an empty line, then use text.readlines() to get the text.

This will give you a list of lines.

Join the lines to a string, but replace lines that only have \n with something weird, I chose _=_

import os, glob, csv

path2text = '/home/pedro/temp/'
files = glob.glob(path2text, '*.txt')

def getParagraphs(f):    
    with open(f) as atext:
        data = atext.readlines()
        parastring = ''
        separator = '_=_'
        for line in data:
            if not line == '\n':
                parastring = parastring + line
            elif line == '\n':
                parastring = parastring + separator
    return parastring
    
savename = path2text + 'output.csv'                  
with open(savename, mode='w') as csvout:
    for f in files:
        name = f.split(os.sep)
        idd = name[-1]
        mystring = getParagraphs(f)
        paragraphs = mystring.split('_=_')               
        f_writer = csv.writer(csvout, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
        fieldnames = ['text_name', 'paragraph']
        f_writer.writerow(fieldnames)
        for i in range(0, len(paragraphs)):
            rowName = idd + '_paragraph_' + str(i+1)
            row = [rowName, paragraphs[i]]
            f_writer.writerow(row)
    print('All done and saved to', savename)
Reply
#7
Thank you all for your comments and time. In my case it works well
output.writerow([(txt_file), line.replace ('\n', ' ')])
as suggested by BashBedlam.
I also decided to remove punctuation in my text files to not have the problems referred by supuflounder.
Thank you!
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  python convert multiple files to multiple lists MCL169 6 1,550 Nov-25-2023, 05:31 AM
Last Post: Iqratech
  Split Bytearray into separate Files by Hex delimter lastyle 5 2,660 Mar-09-2023, 07:49 AM
Last Post: bowlofred
  azure TTS from text files to mp3s mutantGOD 2 1,702 Jan-17-2023, 03:20 AM
Last Post: mutantGOD
  Writing into 2 text files from the same function paul18fr 4 1,679 Jul-28-2022, 04:34 AM
Last Post: ndc85430
  Delete empty text files [SOLVED] AlphaInc 5 1,568 Jul-09-2022, 02:15 PM
Last Post: DeaD_EyE
  select files such as text file RolanRoll 2 1,174 Jun-25-2022, 08:07 PM
Last Post: RolanRoll
  Two text files, want to add a column value zxcv101 8 1,922 Jun-20-2022, 03:06 PM
Last Post: deanhystad
  importing functions from a separate python file in a separate directory Scordomaniac 3 1,377 May-17-2022, 07:49 AM
Last Post: Pedroski55
  select Eof extension files based on text list of filenames with if condition RolanRoll 1 1,517 Apr-04-2022, 09:29 PM
Last Post: Larz60+
  How to save files in a separate directory Scordomaniac 3 1,886 Mar-16-2022, 10:17 AM
Last Post: Gribouillis

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020