Python Forum
Parse text from a .txt file and save multiple output .txt
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Parse text from a .txt file and save multiple output .txt
#1
Here's some quick background: I need to finalize the code I have included below so it saves each chapter of the .txt ebook as a separate file, complete with the chapter name. I have already written the code to strip out the table of contents, split the rest of the book on "CHAPTER", and add the word "CHAPTER" back into the text since it was removed when I split the text. What I am having trouble with is the writing of the loop to save each chapter as a separate .txt file, complete with the chapter number and name. So for example, in the ebook entitled "Jed" (http://www.gutenberg.org/ebooks/54350.txt.utf-8), the text for chapter 1 is:

CHAPTER I.

JED.


I need to create the loop to save each chapter in the format of book name-Chapter-#-Chapter name (with no punctuation and spaces replaced with underscore), where the example above woulb be Jed-Chapter -1-Jed. I have already extracted the contents section of the book in a separate part of the program, so I just need to wrap this up and do some simple changes to the format of the text, but I can't seem to find a way of making this type of loop work, since it needs to save each file as a different name based on the text in the book. The hard part is that I cannot hard code it in- the program must find it from within the text or previously written code.

Any help is greatly appreciated.

for line in text[table_contents_end:]:
    if line == "***END OF THE PROJECT GUTENBERG EBOOK JED\n":
        book_end = text.index(line) + 1

whole_book = text[table_contents_end:book_end]

whole_book_string = "".join(whole_book)
chapters = whole_book_string.split("CHAPTER")
#print(len(chapters))

for chapter in chapters:
    chapter_rebuilt = "CHAPTER " + chapter
    print(chapter_rebuilt)
Reply
#2
First, you need to open the new file each time through the loop. I'm assuming you read the text from a file, so you understand file basics. If not, there's a tutorial here. The file name you can generate with the string format method:

>>> '1, 2, {}, and throw the {} {}'.format(5, 'holy', 'hand grenade')
'1, 2, 5, and throw the holy hand grenade'
Basically, it replaces each pair of curly braces with the corresponding parameter to the format method. You can do a lot more with it, but that should be sufficient for your purposes.

Then just write the text with the write method of the file object, close the file, and go on to the next chapter.
Craig "Ichabod" O'Brien - xenomind.com
I wish you happiness.
Recommended Tutorials: BBCode, functions, classes, text adventures
Reply
#3
Yes, I read from the text file downloaded from the Gutenberg website. The problem with the string method is that it looks like I have to manually assign the numbers, whereas I need to find a way to read from the table of contents and use the chapter name listed there, then use the chapter number contained within the text. I already split the text on "CHAPTER" and replaced the word "chapter" when it was removed at the split, but all of the operations need to be performed within the program based on the text of the book, not on hardcoded text.
Reply
#4
Hello! You do not need the table of contents. You can get the chapter name and use it.
Here is example script which loads the file lines into a list. Read each line and looks for 'CHAPTER' in the line. Then if the next line is empty, it gets the next line and this is the file name. Writes to the file until two new lines are reached.

while book:                        
    line = book.pop(0)
    
    if "CHAPTER" in line and book.pop(0) == '\n':
        chapter = line.strip()
        title = book.pop(0).strip("\n\r\ .")
        
        with open("{} - {}.txt".format(chapter,title), 'w') as f:
            print('Writing: "{} - {}.txt"'.format(chapter,title))
            
            new_lines = 0
            try:
                while new_lines < 3:
                    row = book.pop(0)
                    if row == '\n':
                        f.write(row)
                        new_lines += 1
               else:
                   f.write(row)
                   new_lines = 0
            except IndexError:
                pass
Hm! The indentation could be messy.  Dodgy
"As they say in Mexico 'dosvidaniya'. That makes two vidaniyas."
https://freedns.afraid.org
Reply
#5
I would prefer to use the chapter titles in my table of contents, since they are shorter and have already been cleaned up in my earlier code. That's the issue I've been having- trying to figure out how to pull the chapter titles out from my earlier lines and use them within the file name. I would also prefer not to need line 13, since it is relying on a certain number of spaces after the chapter title as opposed to simply pulling out the text I have already separated and pulling the text out of the associated line number. Any ideas?
Reply
#6
Well, it's easy to modify the code so instead to looking for 'CHAPTER'>new_line>get_the_next_line_(chapter_title),

Skip the first 'CHAPTER' as I did it for example
# in one loop
  • look for 'CHAPTER'>
  • write every line after this to a file (take it from generated list)
  • till next 'CHAPTER' in the line.
"As they say in Mexico 'dosvidaniya'. That makes two vidaniyas."
https://freedns.afraid.org
Reply
#7
If you have a list of chapter titles and list of chapter text, you could zip the lists together and loop through that.
Craig "Ichabod" O'Brien - xenomind.com
I wish you happiness.
Recommended Tutorials: BBCode, functions, classes, text adventures
Reply
#8
A example of zip() that ichabod801 mention,i would also write a regex split to take out chapter + Roman numerals.
Eg:
import re

with open('small_sample.txt') as f:
   book = f.read()

lst = ['CHAPTER I.', 'CHAPTER II.', 'CHAPTER III.']
chap = re.split(r'CHAPTER\s[A-Z.]+', book)[1:]
chapter = list(zip(lst, chap))
for c in chapter:
   print(''.join(c))
Example of output.
Output:
>>> print(chapter[0]) ('CHAPTER I.', '\n\nJED.\n\n\n"Here, you Jed!"\n\nJed paused in his work with his axe suspended above him, for he was\nsplitting wood. He turned his face toward the side door at which stood a\n\n\n') >>> print(chapter[1]) ('CHAPTER II.', '\n\nMR. AND MRS. FOGSON.\n\n\nMr. Fogson was about as unpleasant-looking as his wife, but was not so\nthin. He had stiff red hair with a tendency to stand up straight, a\n\n\n')
From the loop with join().
Output:
CHAPTER I. JED. "Here, you Jed!" Jed paused in his work with his axe suspended above him, for he was splitting wood. He turned his face toward the side door at which stood a CHAPTER II. MR. AND MRS. FOGSON. Mr. Fogson was about as unpleasant-looking as his wife, but was not so thin. He had stiff red hair with a tendency to stand up straight, a CHAPTER III. THE SCRANTON POORHOUSE. "Ahem!" began Squire Dixon, clearing his throat; "the announcement of my friend Mrs. Fogson furnishes me with a text. I hope you all appreciate
Reply
#9
I can't test it cause I am not at home and can't install Python here.

with open(file_name, 'r') as f:
    book = f.readlines()

while book:                        
    line = book.pop(0)
     
    if "CHAPTER" in line and book.pop(0) == '\n':
        for title in chapters_names_list: # your lies of chapter names here
            with open("{}.txt".format(chapters_names_list), 'w') as f:
                try:
                    while True:
                        line = pop(0)
                        if 'CHAPTER' in line:
                            break
                        else:
                            f.write(line)
                except IndexError:
                    pass
While this might work but you should look at @snipsat code snipped. It seems to be more clear than this. If you decide to use the code above it will be empty lines in the text. One above the title and at the end of the chapter a few more.

And I should not write a full working code but... I can't explain what is in my head very well. My English is not so good as I need for this.
"As they say in Mexico 'dosvidaniya'. That makes two vidaniyas."
https://freedns.afraid.org
Reply
#10
If you need to create a medium for publishing, it's better to have the table of contents separate,
most publishers will want you to submit documents in LaTeX or some other typesetting language,
(usually LaTeX), and the conversion will be easier if TOC is separate.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  dictionary output to text file (beginner) Delg_Dankil 2 1,182 Jul-12-2023, 11:45 AM
Last Post: deanhystad
  Output File ? Kessie1971 11 1,995 May-11-2023, 08:31 AM
Last Post: buran
Bug [for a h/w project] How to save and get back dictionary in a .json file in TinyDB. adithya_like_py 4 3,560 Feb-05-2021, 10:49 AM
Last Post: buran
  Read text file, process data and print specific output Happythankyoumoreplease 3 2,920 Feb-20-2020, 12:19 PM
Last Post: jefsummers
  parse text, save individual chapters into text files isniffbooks 3 3,797 Nov-07-2019, 12:04 AM
Last Post: isniffbooks
  Convert text from an image to a text file Evil_Patrick 5 4,288 Jul-30-2019, 07:57 PM
Last Post: DeaD_EyE
  Using Pandas to save csv file into mysql database with for loop kirito85 4 3,431 Feb-05-2019, 01:13 AM
Last Post: kirito85
  reading text file and writing to an output file precedded by line numbers kannan 7 10,393 Dec-11-2018, 02:19 PM
Last Post: ichabod801
  Multiple XML file covert to CSV output file krish143 1 3,343 Jul-27-2018, 06:55 PM
Last Post: ichabod801
  How to get output in a file? BananaRekt 3 3,133 May-06-2018, 06:04 PM
Last Post: buran

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020