May-23-2024, 05:39 AM
(This post was last modified: May-23-2024, 05:55 AM by Pedroski55.)
Just using Input1 as text, I tried the code below. First I saved Input1 as input.txt
For better formatting maybe you should use the Python module docx, which has more options.
I doubt if all your text files have the same style, so you may need some more re expressions! That depends on what is in your other texts.
For better formatting maybe you should use the Python module docx, which has more options.
I doubt if all your text files have the same style, so you may need some more re expressions! That depends on what is in your other texts.
import re # you can do the following with Python but do it like this for now for testing # copy Input1 text from sample_data_strings.txt and paste it into a text editor # remove " at both ends, just ' remains # put [ at the beginning and ] at the end # now you have a list. # call the list l # Paste the list into your Python IDE: l = ['Opening: ... players.'] # save the list l # savepath = '/home/pedro/myPython/re/text_files/unformatted_text1.txt' # with open(savepath, 'w') as output: output.writelines(l) # now you have your text as lines of text, open with readlines() # path to the unformatted lines of text from above Input1 = '/home/pedro/myPython/re/text_files/unformatted_text1.txt' # get the text as a list with open(Input1, 'r') as infile: text_list = infile.readlines() # have a look for line in text_list: print(line) # patterns to look for p = re.compile(r'([A-Za-z]+:)') # get text followed by : like: Opening: endline = re.compile(r'(:)$') # get end of line : num = re.compile(r'(\d+\.)') # get a number or numbers followed by . numsub = re.compile(r'(\w\.)') # get number subpoints \w followed by . subsub = re.compile(r'(- )') # get the number subpoint marker - if __name__ == "__main__": # match only looks at the beginning of each line # search looks through the whole line # re.compile(r'(:)$') finds : at the end of a line, if : is there for i in range(len(text_list)): res = p.match(text_list[i]) end = endline.search(text_list[i]) number = num.match(text_list[i]) numsubp = numsub.match(text_list[i]) subp = sub.match(text_list[i]) if res and end: text_list[i] = ' ' + text_list[i] elif res and not end: text_list[i] = ' ' + text_list[i] elif number: text_list[i] = '\t' + text_list[i] elif numsubp: text_list[i] = '\t ' + text_list[i] elif subp: text_list[i] = '\t\t ' + text_list[i] elif i == len(text_list) - 1: text_list[i] = '\n\t' + text_list[i] savepath = '/home/pedro/myPython/re/text_files/formatted_text1.txt' with open(savepath, 'w') as output: output.writelines(text_list)You can tweak and change the elifs to get what you want. Add more re expressions for different text parts!