Parse text from a .txt file and save multiple output .txt

rattlerskin · (This post was last modified: Mar-15-2017, 08:20 PM by rattlerskin.)

Here's some quick background: I need to finalize the code I have included below so it saves each chapter of the .txt ebook as a separate file, complete with the chapter name. I have already written the code to strip out the table of contents, split the rest of the book on "CHAPTER", and add the word "CHAPTER" back into the text since it was removed when I split the text. What I am having trouble with is the writing of the loop to save each chapter as a separate .txt file, complete with the chapter number and name. So for example, in the ebook entitled "Jed" (http://www.gutenberg.org/ebooks/54350.txt.utf-8), the text for chapter 1 is:

CHAPTER I.

JED.

I need to create the loop to save each chapter in the format of book name-Chapter-#-Chapter name (with no punctuation and spaces replaced with underscore), where the example above woulb be Jed-Chapter -1-Jed. I have already extracted the contents section of the book in a separate part of the program, so I just need to wrap this up and do some simple changes to the format of the text, but I can't seem to find a way of making this type of loop work, since it needs to save each file as a different name based on the text in the book. The hard part is that I cannot hard code it in- the program must find it from within the text or previously written code.

Any help is greatly appreciated.

for line in text[table_contents_end:]:
    if line == "***END OF THE PROJECT GUTENBERG EBOOK JED\n":
        book_end = text.index(line) + 1

whole_book = text[table_contents_end:book_end]

whole_book_string = "".join(whole_book)
chapters = whole_book_string.split("CHAPTER")
#print(len(chapters))

for chapter in chapters:
    chapter_rebuilt = "CHAPTER " + chapter
    print(chapter_rebuilt)

***ichabod801*** · Mar-15-2017, 09:08 PM

First, you need to open the new file each time through the loop. I'm assuming you read the text from a file, so you understand file basics. If not, there's a tutorial here. The file name you can generate with the string format method:

>>> '1, 2, {}, and throw the {} {}'.format(5, 'holy', 'hand grenade')
'1, 2, 5, and throw the holy hand grenade'

Basically, it replaces each pair of curly braces with the corresponding parameter to the format method. You can do a lot more with it, but that should be sufficient for your purposes.

Then just write the text with the write method of the file object, close the file, and go on to the next chapter.

rattlerskin · Mar-15-2017, 09:51 PM

Yes, I read from the text file downloaded from the Gutenberg website. The problem with the string method is that it looks like I have to manually assign the numbers, whereas I need to find a way to read from the table of contents and use the chapter name listed there, then use the chapter number contained within the text. I already split the text on "CHAPTER" and replaced the word "chapter" when it was removed at the split, but all of the operations need to be performed within the program based on the text of the book, not on hardcoded text.

wavic · (This post was last modified: Mar-15-2017, 10:41 PM by wavic.)

Hello! You do not need the table of contents. You can get the chapter name and use it.
Here is example script which loads the file lines into a list. Read each line and looks for 'CHAPTER' in the line. Then if the next line is empty, it gets the next line and this is the file name. Writes to the file until two new lines are reached.

while book:                        
    line = book.pop(0)
    
    if "CHAPTER" in line and book.pop(0) == '\n':
        chapter = line.strip()
        title = book.pop(0).strip("\n\r\ .")
        
        with open("{} - {}.txt".format(chapter,title), 'w') as f:
            print('Writing: "{} - {}.txt"'.format(chapter,title))
            
            new_lines = 0
            try:
                while new_lines < 3:
                    row = book.pop(0)
                    if row == '\n':
                        f.write(row)
                        new_lines += 1
               else:
                   f.write(row)
                   new_lines = 0
            except IndexError:
                pass

Hm! The indentation could be messy. Dodgy

rattlerskin · Mar-16-2017, 05:21 PM

I would prefer to use the chapter titles in my table of contents, since they are shorter and have already been cleaned up in my earlier code. That's the issue I've been having- trying to figure out how to pull the chapter titles out from my earlier lines and use them within the file name. I would also prefer not to need line 13, since it is relying on a certain number of spaces after the chapter title as opposed to simply pulling out the text I have already separated and pulling the text out of the associated line number. Any ideas?

wavic · (This post was last modified: Mar-16-2017, 07:18 PM by wavic.)

Well, it's easy to modify the code so instead to looking for 'CHAPTER'>new_line>get_the_next_line_(chapter_title),

Skip the first 'CHAPTER' as I did it for example
# in one loop

look for 'CHAPTER'>
write every line after this to a file (take it from generated list)
till next 'CHAPTER' in the line.

***ichabod801*** · Mar-17-2017, 12:36 AM

If you have a list of chapter titles and list of chapter text, you could zip the lists together and loop through that.

***snippsat*** · (This post was last modified: Mar-17-2017, 04:02 AM by snippsat.)

A example of zip() that ichabod801 mention,i would also write a regex split to take out chapter + Roman numerals.
Eg:

import re

with open('small_sample.txt') as f:
   book = f.read()

lst = ['CHAPTER I.', 'CHAPTER II.', 'CHAPTER III.']
chap = re.split(r'CHAPTER\s[A-Z.]+', book)[1:]
chapter = list(zip(lst, chap))
for c in chapter:
   print(''.join(c))

Example of output.

Output:>>> print(chapter[0])
('CHAPTER I.', '\n\nJED.\n\n\n"Here, you Jed!"\n\nJed paused in his work with his axe suspended above him, for he was\nsplitting wood. He turned his face toward the side door at which stood a\n\n\n')
>>> print(chapter[1])
('CHAPTER II.', '\n\nMR. AND MRS. FOGSON.\n\n\nMr. Fogson was about as unpleasant-looking as his wife, but was not so\nthin. He had stiff red hair with a tendency to stand up straight, a\n\n\n')

From the loop with join().

Output:CHAPTER I.

JED.


"Here, you Jed!"

Jed paused in his work with his axe suspended above him, for he was
splitting wood. He turned his face toward the side door at which stood a



CHAPTER II.

MR. AND MRS. FOGSON.


Mr. Fogson was about as unpleasant-looking as his wife, but was not so
thin. He had stiff red hair with a tendency to stand up straight, a



CHAPTER III.

THE SCRANTON POORHOUSE.


"Ahem!" began Squire Dixon, clearing his throat; "the announcement of my
friend Mrs. Fogson furnishes me with a text. I hope you all appreciate

wavic · (This post was last modified: Mar-17-2017, 08:06 AM by wavic.)

I can't test it cause I am not at home and can't install Python here.

with open(file_name, 'r') as f:
    book = f.readlines()

while book:                        
    line = book.pop(0)
     
    if "CHAPTER" in line and book.pop(0) == '\n':
        for title in chapters_names_list: # your lies of chapter names here
            with open("{}.txt".format(chapters_names_list), 'w') as f:
                try:
                    while True:
                        line = pop(0)
                        if 'CHAPTER' in line:
                            break
                        else:
                            f.write(line)
                except IndexError:
                    pass

While this might work but you should look at @snipsat code snipped. It seems to be more clear than this. If you decide to use the code above it will be empty lines in the text. One above the title and at the end of the chapter a few more.

And I should not write a full working code but... I can't explain what is in my head very well. My English is not so good as I need for this.

**Larz60+** · Mar-17-2017, 01:45 PM

If you need to create a medium for publishing, it's better to have the table of contents separate,
most publishers will want you to submit documents in LaTeX or some other typesetting language,
(usually LaTeX), and the conversion will be easier if TOC is separate.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	dictionary output to text file (beginner)	Delg_Dankil	2	1,182	Jul-12-2023, 11:45 AM Last Post: deanhystad
	Output File ?	Kessie1971	11	1,995	May-11-2023, 08:31 AM Last Post: buran
	[for a h/w project] How to save and get back dictionary in a .json file in TinyDB.	adithya_like_py	4	3,560	Feb-05-2021, 10:49 AM Last Post: buran
	Read text file, process data and print specific output	Happythankyoumoreplease	3	2,920	Feb-20-2020, 12:19 PM Last Post: jefsummers
	parse text, save individual chapters into text files	isniffbooks	3	3,797	Nov-07-2019, 12:04 AM Last Post: isniffbooks
	Convert text from an image to a text file	Evil_Patrick	5	4,288	Jul-30-2019, 07:57 PM Last Post: DeaD_EyE
	Using Pandas to save csv file into mysql database with for loop	kirito85	4	3,431	Feb-05-2019, 01:13 AM Last Post: kirito85
	reading text file and writing to an output file precedded by line numbers	kannan	7	10,393	Dec-11-2018, 02:19 PM Last Post: ichabod801
	Multiple XML file covert to CSV output file	krish143	1	3,343	Jul-27-2018, 06:55 PM Last Post: ichabod801
	How to get output in a file?	BananaRekt	3	3,133	May-06-2018, 06:04 PM Last Post: buran

Parse text from a .txt file and save multiple output .txt

User Panel Messages

Announcements