Python Forum
parse text, save individual chapters into text files
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
parse text, save individual chapters into text files
#1
Your job will be to parse through this document, find the individual chapters, and write those out as separate files. the chapter number from inside the chapter, so you will need to use some form of a counter. • Each chapter file should start with “CHAPTER…” as the first line and contain exactly the text content of that chapter. Don’t worry about extra newlines at the beginning or the end, but you should not have extra newlines between the text lines (watch out for this one). o Example: Dracula-Chapter-1.txt should start with “CHAPTER I” and the last line of text should be “sky.” Plus maybe some newlines on either end, but no other text. o This means you’ll need to retain or recreate the newlines within the text file as they appear. o The content of your chapter files needs to look exactly as it does from the original file (but again, don’t worry about extra newlines at the beginning and the end of the file). • You may not hard code any line numbers, position numbers, or chapter numbers into your script (as in explicitly list a line number for the text in the code. Navigating this can be tricky, which is where the check in comes in. If I see anything that looks like hard coding, I’ll tell you. I’m also happy to answer any emails to ask if the method you are using is hard coding. If you a struggling and want to hard code these things in, you’ll only lose small points on that element. This will let you complete the assignment so you can get points on the other elements. • More details: o you can look for text that says “CHAPTER I” or “CHAPTER” but you cannot do this 27 times to find all the chapters). o You absolutely can use split, though, on specific text. But again, you can’t do this for each chapter. o If you are doing anything 27 separate times in your code, you are hard coding. o You can use method like .index(), .find(), etc to find the start and end positions and then slice, but you may not add these position numbers directly into your code. In other words, you can use position numbers in your code, so long as you are using a python tool to find those position numbers. o Minimal use of .pop() or slicing is allowed. This means that you must use search strategies to detect where these things are and store the values in variables. You will need to find a way to detect where the text starts and ends (warning: there is more text after the book, which you will need to figure out how to get rid of). o If you are looking and saying that the chapters start on line 100 and go through line 10000, you are hard coding. Use other string processing tools to break it apart. this is using the gutenberg version of dracula...i'm having a hard time trying to parse these out without hard coding each individual text file and can't figure out where i'm going wrong. Here is the input file: http://www.gutenberg.org/cache/epub/345/pg345.txt.

code i have thus far:
infile = open('dracula.txt', 'r')

readlines = infile.readlines()

toc_list = readlines[74:185]

toc_text_lines = [] for line in toc_list:

if len(line) > 1: stripped_line = line.strip()

toc_text_lines.append(stripped_line)

#print(len(toc_text_lines))

chaptitles = []

for text_lines in toc_text_lines:

split_text_line = text_lines.split()

if split_text_line[-1].isdigit(): chaptitles.append(text_lines)

#print(len(chaptitles))

print(chaptitles) infile.close()

import re

with open('dracula.txt') as f:

book = f.read() lst = ['CHAPTER I.', 'CHAPTER II.', 'CHAPTER III.', 'CHAPTER IV.', 'CHAPTER V.', 'CHAPTER VI.', 'CHAPTER VII.', 'CHAPTER VIII.', 'CHAPTER IX.', 'CHAPTER X.', 'CHAPTER XI.', 'CHAPTER XII.', 'CHAPTER XIII.', 'CHAPTER XIV.', 'CHAPTER XV.', 'CHAPTER XVI.', 'CHAPTER XVII.', 'CHAPTER XVIII.', 'CHAPTER XIX.', 'CHAPTER XX.', 'CHAPTER XXI.', 'CHAPTER XXII.', 'CHAPTER XXIII.', 'CHAPTER XXIV.', 'CHAPTER XXV', 'CHAPTER XXVI.', 'CHAPTER XXVII.']

chap = re.split(r'CHAPTER\s[A-Z.]+', book)[1:27]

chapter = list(zip(lst, chap))

for c in chapter:

print(''.join(c))
any ideas welcome!
Reply
#2
sorry and thanks
Reply
#3
First of all, we need to see your code with proper indentation. Copy and paste as plain text, and put it between python tags, using the instructions Larz linked to. Second, if you code is not working, please explain how it is not working, including the full text of any error messages you are getting. That will help us narrow down where the problem is.
Craig "Ichabod" O'Brien - xenomind.com
I wish you happiness.
Recommended Tutorials: BBCode, functions, classes, text adventures
Reply
#4
Output:
infile = open('dracula.txt', 'r') readlines = infile.readlines() toc_list = readlines[74:185] toc_text_lines = [] for line in toc_list: if len(line) > 1: stripped_line = line.strip() toc_text_lines.append(stripped_line) #print(len(toc_text_lines)) chaptitles = [] for text_lines in toc_text_lines: split_text_line = text_lines.split() if split_text_line[-1].isdigit(): chaptitles.append(text_lines) # print(len(chaptitles)) #print(chaptitles) def main(): def main(): in_file = open('dracula.txt', 'r') text_lines = in_file.readlines() in_file.close() toc_start = text_lines.index('CONTENTS\n') toc_end = text_lines.index('DRACULA\n') toc_list = text_lines[toc_start:toc_end] #print(''.join(toc_list)) #import re #with open('dracula.txt') as f: # book = f.read() #lst = ['CHAPTER I.', 'CHAPTER II.', 'CHAPTER III.', 'CHAPTER IV.', 'CHAPTER V.', 'CHAPTER VI.', 'CHAPTER VII.', 'CHAPTER VIII.', 'CHAPTER IX.', 'CHAPTER X.', 'CHAPTER XI.', 'CHAPTER XII.', 'CHAPTER XIII.', 'CHAPTER XIV.', 'CHAPTER XV.', 'CHAPTER XVI.', 'CHAPTER XVII.', 'CHAPTER XVIII.', 'CHAPTER XIX.', 'CHAPTER XX.', 'CHAPTER XXI.', 'CHAPTER XXII.', 'CHAPTER XXIII.', 'CHAPTER XXIV.', 'CHAPTER XXV', 'CHAPTER XXVI.', 'CHAPTER XXVII.'] #chap = re.split(r'CHAPTER\s[A-Z.]+', book)[1:] #chapter = list(zip(lst, chap)) #for c in chapter: #print(''.join(c)) chapterfiles = 27 counter = 1 chaptitlecounter= 0 with open('dracula.txt', 'r') as infile: outFile = open('CHAPTER' + str(counter) + '.txt', 'a') for text in infile.read(): if text == 'CHAPTER': chaptitlecounter += 1 if chaptitlecounter == chapterfiles: chaptitlecounter = chaptertitlecount+1 outfile = open('dracula.txt', 'w') #print(bit_of_content, file=outfile) outfile.close() sentenceNum = sentenceNum + 1 myNewText = ("CHAPTER 1 I am chapter 1 \n CHAPTER 2 I am chapter 2 \n CHAPTER 3 I am ch. 3\n") chapters = mynewText.split("CHAPTER") print(myNewText) chapters = myNewText.split("CHAPTER") myNewString = "CHAPTER" + chapters[1] # concatenation #print(myNewString) #counter += 1 #outFile = open('output' + str(counter) + '.txt', 'a') #else: outFile.write(text) outFile.close()
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
Exclamation URGENT: How to plot data from text file. Trying to recreate plots from MATLAB JamieAl 4 3,520 Dec-03-2023, 06:56 AM
Last Post: Pedroski55
  dictionary output to text file (beginner) Delg_Dankil 2 1,157 Jul-12-2023, 11:45 AM
Last Post: deanhystad
  beginner having text based adventure trouble mrgee 2 2,059 Dec-16-2021, 05:07 AM
Last Post: buran
  Trouble downloading and using any text editors edwarmax001 1 1,836 Feb-20-2021, 05:36 PM
Last Post: Larz60+
  Split string into 160-character chunks while adding text to each part iambobbiekings 9 9,560 Jan-27-2021, 08:15 AM
Last Post: iambobbiekings
  HomeWork Python - Drawing window with text center. Voraman 8 3,256 Jan-09-2021, 06:53 PM
Last Post: Voraman
  Reading a text until matched string and print it as a single line cananb 1 2,015 Nov-29-2020, 01:38 PM
Last Post: DPaul
  computer science coursework, read the text please and tell me if theres any specifics sixcray 4 2,601 Nov-11-2020, 03:17 PM
Last Post: buran
  Working with text data APK 4 2,475 Aug-22-2020, 04:48 AM
Last Post: buran
  configparser module, when use text file to show following error mbilalshafiq 5 5,133 Jul-14-2020, 04:06 PM
Last Post: mbilalshafiq

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020