Python Forum

Full Version: parse text, save individual chapters into text files
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Your job will be to parse through this document, find the individual chapters, and write those out as separate files. the chapter number from inside the chapter, so you will need to use some form of a counter. • Each chapter file should start with “CHAPTER…” as the first line and contain exactly the text content of that chapter. Don’t worry about extra newlines at the beginning or the end, but you should not have extra newlines between the text lines (watch out for this one). o Example: Dracula-Chapter-1.txt should start with “CHAPTER I” and the last line of text should be “sky.” Plus maybe some newlines on either end, but no other text. o This means you’ll need to retain or recreate the newlines within the text file as they appear. o The content of your chapter files needs to look exactly as it does from the original file (but again, don’t worry about extra newlines at the beginning and the end of the file). • You may not hard code any line numbers, position numbers, or chapter numbers into your script (as in explicitly list a line number for the text in the code. Navigating this can be tricky, which is where the check in comes in. If I see anything that looks like hard coding, I’ll tell you. I’m also happy to answer any emails to ask if the method you are using is hard coding. If you a struggling and want to hard code these things in, you’ll only lose small points on that element. This will let you complete the assignment so you can get points on the other elements. • More details: o you can look for text that says “CHAPTER I” or “CHAPTER” but you cannot do this 27 times to find all the chapters). o You absolutely can use split, though, on specific text. But again, you can’t do this for each chapter. o If you are doing anything 27 separate times in your code, you are hard coding. o You can use method like .index(), .find(), etc to find the start and end positions and then slice, but you may not add these position numbers directly into your code. In other words, you can use position numbers in your code, so long as you are using a python tool to find those position numbers. o Minimal use of .pop() or slicing is allowed. This means that you must use search strategies to detect where these things are and store the values in variables. You will need to find a way to detect where the text starts and ends (warning: there is more text after the book, which you will need to figure out how to get rid of). o If you are looking and saying that the chapters start on line 100 and go through line 10000, you are hard coding. Use other string processing tools to break it apart. this is using the gutenberg version of dracula...i'm having a hard time trying to parse these out without hard coding each individual text file and can't figure out where i'm going wrong. Here is the input file: http://www.gutenberg.org/cache/epub/345/pg345.txt.

code i have thus far:
infile = open('dracula.txt', 'r')

readlines = infile.readlines()

toc_list = readlines[74:185]

toc_text_lines = [] for line in toc_list:

if len(line) > 1: stripped_line = line.strip()

toc_text_lines.append(stripped_line)

#print(len(toc_text_lines))

chaptitles = []

for text_lines in toc_text_lines:

split_text_line = text_lines.split()

if split_text_line[-1].isdigit(): chaptitles.append(text_lines)

#print(len(chaptitles))

print(chaptitles) infile.close()

import re

with open('dracula.txt') as f:

book = f.read() lst = ['CHAPTER I.', 'CHAPTER II.', 'CHAPTER III.', 'CHAPTER IV.', 'CHAPTER V.', 'CHAPTER VI.', 'CHAPTER VII.', 'CHAPTER VIII.', 'CHAPTER IX.', 'CHAPTER X.', 'CHAPTER XI.', 'CHAPTER XII.', 'CHAPTER XIII.', 'CHAPTER XIV.', 'CHAPTER XV.', 'CHAPTER XVI.', 'CHAPTER XVII.', 'CHAPTER XVIII.', 'CHAPTER XIX.', 'CHAPTER XX.', 'CHAPTER XXI.', 'CHAPTER XXII.', 'CHAPTER XXIII.', 'CHAPTER XXIV.', 'CHAPTER XXV', 'CHAPTER XXVI.', 'CHAPTER XXVII.']

chap = re.split(r'CHAPTER\s[A-Z.]+', book)[1:27]

chapter = list(zip(lst, chap))

for c in chapter:

print(''.join(c))
any ideas welcome!
sorry and thanks
First of all, we need to see your code with proper indentation. Copy and paste as plain text, and put it between python tags, using the instructions Larz linked to. Second, if you code is not working, please explain how it is not working, including the full text of any error messages you are getting. That will help us narrow down where the problem is.
Output:
infile = open('dracula.txt', 'r') readlines = infile.readlines() toc_list = readlines[74:185] toc_text_lines = [] for line in toc_list: if len(line) > 1: stripped_line = line.strip() toc_text_lines.append(stripped_line) #print(len(toc_text_lines)) chaptitles = [] for text_lines in toc_text_lines: split_text_line = text_lines.split() if split_text_line[-1].isdigit(): chaptitles.append(text_lines) # print(len(chaptitles)) #print(chaptitles) def main(): def main(): in_file = open('dracula.txt', 'r') text_lines = in_file.readlines() in_file.close() toc_start = text_lines.index('CONTENTS\n') toc_end = text_lines.index('DRACULA\n') toc_list = text_lines[toc_start:toc_end] #print(''.join(toc_list)) #import re #with open('dracula.txt') as f: # book = f.read() #lst = ['CHAPTER I.', 'CHAPTER II.', 'CHAPTER III.', 'CHAPTER IV.', 'CHAPTER V.', 'CHAPTER VI.', 'CHAPTER VII.', 'CHAPTER VIII.', 'CHAPTER IX.', 'CHAPTER X.', 'CHAPTER XI.', 'CHAPTER XII.', 'CHAPTER XIII.', 'CHAPTER XIV.', 'CHAPTER XV.', 'CHAPTER XVI.', 'CHAPTER XVII.', 'CHAPTER XVIII.', 'CHAPTER XIX.', 'CHAPTER XX.', 'CHAPTER XXI.', 'CHAPTER XXII.', 'CHAPTER XXIII.', 'CHAPTER XXIV.', 'CHAPTER XXV', 'CHAPTER XXVI.', 'CHAPTER XXVII.'] #chap = re.split(r'CHAPTER\s[A-Z.]+', book)[1:] #chapter = list(zip(lst, chap)) #for c in chapter: #print(''.join(c)) chapterfiles = 27 counter = 1 chaptitlecounter= 0 with open('dracula.txt', 'r') as infile: outFile = open('CHAPTER' + str(counter) + '.txt', 'a') for text in infile.read(): if text == 'CHAPTER': chaptitlecounter += 1 if chaptitlecounter == chapterfiles: chaptitlecounter = chaptertitlecount+1 outfile = open('dracula.txt', 'w') #print(bit_of_content, file=outfile) outfile.close() sentenceNum = sentenceNum + 1 myNewText = ("CHAPTER 1 I am chapter 1 \n CHAPTER 2 I am chapter 2 \n CHAPTER 3 I am ch. 3\n") chapters = mynewText.split("CHAPTER") print(myNewText) chapters = myNewText.split("CHAPTER") myNewString = "CHAPTER" + chapters[1] # concatenation #print(myNewString) #counter += 1 #outFile = open('output' + str(counter) + '.txt', 'a') #else: outFile.write(text) outFile.close()