parse text, save individual chapters into text files

isniffbooks · (This post was last modified: Nov-05-2019, 11:58 PM by Larz60+.)

Your job will be to parse through this document, find the individual chapters, and write those out as separate files. the chapter number from inside the chapter, so you will need to use some form of a counter. • Each chapter file should start with “CHAPTER…” as the first line and contain exactly the text content of that chapter. Don’t worry about extra newlines at the beginning or the end, but you should not have extra newlines between the text lines (watch out for this one). o Example: Dracula-Chapter-1.txt should start with “CHAPTER I” and the last line of text should be “sky.” Plus maybe some newlines on either end, but no other text. o This means you’ll need to retain or recreate the newlines within the text file as they appear. o The content of your chapter files needs to look exactly as it does from the original file (but again, don’t worry about extra newlines at the beginning and the end of the file). • You may not hard code any line numbers, position numbers, or chapter numbers into your script (as in explicitly list a line number for the text in the code. Navigating this can be tricky, which is where the check in comes in. If I see anything that looks like hard coding, I’ll tell you. I’m also happy to answer any emails to ask if the method you are using is hard coding. If you a struggling and want to hard code these things in, you’ll only lose small points on that element. This will let you complete the assignment so you can get points on the other elements. • More details: o you can look for text that says “CHAPTER I” or “CHAPTER” but you cannot do this 27 times to find all the chapters). o You absolutely can use split, though, on specific text. But again, you can’t do this for each chapter. o If you are doing anything 27 separate times in your code, you are hard coding. o You can use method like .index(), .find(), etc to find the start and end positions and then slice, but you may not add these position numbers directly into your code. In other words, you can use position numbers in your code, so long as you are using a python tool to find those position numbers. o Minimal use of .pop() or slicing is allowed. This means that you must use search strategies to detect where these things are and store the values in variables. You will need to find a way to detect where the text starts and ends (warning: there is more text after the book, which you will need to figure out how to get rid of). o If you are looking and saying that the chapters start on line 100 and go through line 10000, you are hard coding. Use other string processing tools to break it apart. this is using the gutenberg version of dracula...i'm having a hard time trying to parse these out without hard coding each individual text file and can't figure out where i'm going wrong. Here is the input file: http://www.gutenberg.org/cache/epub/345/pg345.txt.

code i have thus far:

infile = open('dracula.txt', 'r')

readlines = infile.readlines()

toc_list = readlines[74:185]

toc_text_lines = [] for line in toc_list:

if len(line) > 1: stripped_line = line.strip()

toc_text_lines.append(stripped_line)

#print(len(toc_text_lines))

chaptitles = []

for text_lines in toc_text_lines:

split_text_line = text_lines.split()

if split_text_line[-1].isdigit(): chaptitles.append(text_lines)

#print(len(chaptitles))

print(chaptitles) infile.close()

import re

with open('dracula.txt') as f:

book = f.read() lst = ['CHAPTER I.', 'CHAPTER II.', 'CHAPTER III.', 'CHAPTER IV.', 'CHAPTER V.', 'CHAPTER VI.', 'CHAPTER VII.', 'CHAPTER VIII.', 'CHAPTER IX.', 'CHAPTER X.', 'CHAPTER XI.', 'CHAPTER XII.', 'CHAPTER XIII.', 'CHAPTER XIV.', 'CHAPTER XV.', 'CHAPTER XVI.', 'CHAPTER XVII.', 'CHAPTER XVIII.', 'CHAPTER XIX.', 'CHAPTER XX.', 'CHAPTER XXI.', 'CHAPTER XXII.', 'CHAPTER XXIII.', 'CHAPTER XXIV.', 'CHAPTER XXV', 'CHAPTER XXVI.', 'CHAPTER XXVII.']

chap = re.split(r'CHAPTER\s[A-Z.]+', book)[1:27]

chapter = list(zip(lst, chap))

for c in chapter:

print(''.join(c))

any ideas welcome!

isniffbooks · Nov-06-2019, 03:17 AM

sorry and thanks

***ichabod801*** · Nov-06-2019, 02:36 PM

First of all, we need to see your code with proper indentation. Copy and paste as plain text, and put it between python tags, using the instructions Larz linked to. Second, if you code is not working, please explain how it is not working, including the full text of any error messages you are getting. That will help us narrow down where the problem is.

isniffbooks · (This post was last modified: Nov-07-2019, 12:04 AM by isniffbooks.)

Output:infile = open('dracula.txt', 'r')


readlines = infile.readlines()

toc_list = readlines[74:185]

toc_text_lines = []
for line in toc_list:
    if len(line) > 1:
        stripped_line = line.strip()
        toc_text_lines.append(stripped_line)

#print(len(toc_text_lines))

chaptitles = []
for text_lines in toc_text_lines:
    split_text_line = text_lines.split()
    if split_text_line[-1].isdigit():
        chaptitles.append(text_lines)

# print(len(chaptitles))
#print(chaptitles)

def main():

    def main():
        in_file = open('dracula.txt', 'r')
        text_lines = in_file.readlines()
    in_file.close()

    toc_start = text_lines.index('CONTENTS\n')
    toc_end = text_lines.index('DRACULA\n')

    toc_list = text_lines[toc_start:toc_end]

    #print(''.join(toc_list))
#import re

#with open('dracula.txt') as f:
  # book = f.read()

#lst = ['CHAPTER I.', 'CHAPTER II.', 'CHAPTER III.', 'CHAPTER IV.', 'CHAPTER V.', 'CHAPTER VI.', 'CHAPTER VII.', 'CHAPTER VIII.', 'CHAPTER IX.', 'CHAPTER X.', 'CHAPTER XI.', 'CHAPTER XII.', 'CHAPTER XIII.', 'CHAPTER XIV.', 'CHAPTER XV.', 'CHAPTER XVI.', 'CHAPTER XVII.', 'CHAPTER XVIII.', 'CHAPTER XIX.', 'CHAPTER XX.', 'CHAPTER XXI.', 'CHAPTER XXII.', 'CHAPTER XXIII.', 'CHAPTER XXIV.', 'CHAPTER XXV', 'CHAPTER XXVI.', 'CHAPTER XXVII.']
#chap = re.split(r'CHAPTER\s[A-Z.]+', book)[1:]
#chapter = list(zip(lst, chap))
#for c in chapter:
   #print(''.join(c))

chapterfiles = 27
counter = 1
chaptitlecounter= 0
with open('dracula.txt', 'r') as infile:
    outFile = open('CHAPTER' + str(counter) + '.txt', 'a')
    for text in infile.read():
        if text == 'CHAPTER':
            chaptitlecounter += 1
        if chaptitlecounter == chapterfiles:
            chaptitlecounter = chaptertitlecount+1

outfile = open('dracula.txt', 'w')
#print(bit_of_content, file=outfile)
outfile.close()
sentenceNum = sentenceNum + 1

myNewText = ("CHAPTER 1 I am chapter 1 \n CHAPTER 2 I am chapter 2 \n CHAPTER 3 I am ch. 3\n")
chapters = mynewText.split("CHAPTER")

print(myNewText)
chapters = myNewText.split("CHAPTER")
myNewString = "CHAPTER" + chapters[1] # concatenation
#print(myNewString)


#counter += 1
#outFile = open('output' + str(counter) + '.txt', 'a')
    #else:
outFile.write(text)

outFile.close()

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	URGENT: How to plot data from text file. Trying to recreate plots from MATLAB	JamieAl	4	3,520	Dec-03-2023, 06:56 AM Last Post: Pedroski55
	dictionary output to text file (beginner)	Delg_Dankil	2	1,157	Jul-12-2023, 11:45 AM Last Post: deanhystad
	beginner having text based adventure trouble	mrgee	2	2,059	Dec-16-2021, 05:07 AM Last Post: buran
	Trouble downloading and using any text editors	edwarmax001	1	1,836	Feb-20-2021, 05:36 PM Last Post: Larz60+
	Split string into 160-character chunks while adding text to each part	iambobbiekings	9	9,560	Jan-27-2021, 08:15 AM Last Post: iambobbiekings
	HomeWork Python - Drawing window with text center.	Voraman	8	3,256	Jan-09-2021, 06:53 PM Last Post: Voraman
	Reading a text until matched string and print it as a single line	cananb	1	2,015	Nov-29-2020, 01:38 PM Last Post: DPaul
	computer science coursework, read the text please and tell me if theres any specifics	sixcray	4	2,601	Nov-11-2020, 03:17 PM Last Post: buran
	Working with text data	APK	4	2,475	Aug-22-2020, 04:48 AM Last Post: buran
	configparser module, when use text file to show following error	mbilalshafiq	5	5,133	Jul-14-2020, 04:06 PM Last Post: mbilalshafiq

parse text, save individual chapters into text files

User Panel Messages

Announcements