How to parse and group hierarchical list items from an unindented string in Python?

Pedroski55 · (This post was last modified: May-23-2024, 05:55 AM by Pedroski55.)

Just using Input1 as text, I tried the code below. First I saved Input1 as input.txt

For better formatting maybe you should use the Python module docx, which has more options.

I doubt if all your text files have the same style, so you may need some more re expressions! That depends on what is in your other texts.

import re

# you can do the following with Python but do it like this for now for testing
# copy Input1 text from sample_data_strings.txt and paste it into a text editor
# remove " at both ends, just ' remains
# put [ at the beginning and ] at the end
# now you have a list.
# call the list l
# Paste the list into your Python IDE: l = ['Opening: ... players.']
# save the list l
# savepath = '/home/pedro/myPython/re/text_files/unformatted_text1.txt'
# with open(savepath, 'w') as output: output.writelines(l)
# now you have your text as lines of text, open with readlines()

# path to the unformatted lines of text from above
Input1 = '/home/pedro/myPython/re/text_files/unformatted_text1.txt'
# get the text as a list
with open(Input1, 'r') as infile:
    text_list = infile.readlines()

# have a look
for line in text_list:
    print(line)

# patterns to look for
p = re.compile(r'([A-Za-z]+:)') # get text followed by : like: Opening:
endline = re.compile(r'(:)$') # get end of line :
num = re.compile(r'(\d+\.)') # get a number or numbers followed by .
numsub = re.compile(r'(\w\.)') # get number subpoints \w followed by .
subsub = re.compile(r'(- )') # get the number subpoint marker -

if __name__ == "__main__":
    # match only looks at the beginning of each line
    # search looks through the whole line
    # re.compile(r'(:)$') finds : at the end of a line, if : is there
    for i in range(len(text_list)):
        res = p.match(text_list[i]) 
        end = endline.search(text_list[i])
        number = num.match(text_list[i])
        numsubp = numsub.match(text_list[i])
        subp = sub.match(text_list[i])    
        if res and end:
            text_list[i] = '  ' + text_list[i]
        elif res and not end:
            text_list[i] = '  ' + text_list[i]
        elif number:
            text_list[i] = '\t' + text_list[i]
        elif numsubp:
            text_list[i] = '\t  ' + text_list[i]
        elif subp:
            text_list[i] = '\t\t ' + text_list[i]
        elif i == len(text_list) - 1:
            text_list[i] = '\n\t' + text_list[i]

    savepath = '/home/pedro/myPython/re/text_files/formatted_text1.txt'       
    with open(savepath, 'w') as output:
        output.writelines(text_list)

You can tweak and change the elifs to get what you want. Add more re expressions for different text parts!

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Collisions for items in a list	Idents	3	2,473	Apr-06-2021, 03:48 PM Last Post: deanhystad
	Removing items in a list	cap510	3	2,497	Nov-01-2020, 09:53 PM Last Post: cap510
	Help with Recursive solution,list items	gianniskampanakis	8	3,874	Feb-28-2020, 03:36 PM Last Post: gianniskampanakis
	Removing items from list	slackerman73	8	4,675	Dec-13-2019, 05:39 PM Last Post: Clunk_Head
	Find 'greater than' items in list	johneven	2	4,650	Apr-05-2019, 07:22 AM Last Post: perfringo
	How to add items within a list	Mrocks22	2	2,815	Nov-01-2018, 08:46 PM Last Post: Mrocks22
	How to keep duplicates and remove all other items in list?	student8	1	5,095	Oct-28-2017, 05:52 AM Last Post: heiner55
	Help printing any items that contains a keyword from a list	Liquid_Ocelot	13	93,106	May-06-2017, 10:41 PM Last Post: ichabod801

How to parse and group hierarchical list items from an unindented string in Python?

User Panel Messages

Announcements