Python Forum
How to parse and group hierarchical list items from an unindented string in Python?
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
How to parse and group hierarchical list items from an unindented string in Python?
#1
Problem Statement

  1. Given an unindented string as input, perform these steps:

    - Identify the list items at the highest level of the hierarchy within the string. These top-level items can be identified by the following criteria:

    - Numbering systems (e.g., 1., 2., 3.)
    - Lettering systems (e.g., A., B., C.)
    - Bullets (e.g., -, *, •)
    - Symbols (e.g., >, #, §)

    - For each top-level item identified in step 1:

    a. Group it with all subsequent lower-level items until the next top-level item is encountered. Lower-level items can be identified by the following criteria:

    - Prefixes (e.g., 1.1, 1.2, 1.3)
    - Bullets (e.g., -, *, •)
    - Alphanumeric sequences (e.g., a., b., c.)
    - Roman numerals (e.g., i., ii., iii.)

    b. Concatenate the top-level item with its associated lower-level items into a single string, maintaining the original formatting and delimiters. The formatting and delimiters should be preserved as they appear in the input string.

    - Return the resulting grouped list items as a Python list where each element represents a top-level item and its associated lower-level items. Each element in the list should be a string containing the concatenated top-level item and its lower-level items.

    - Exclude any text that appears before the first top-level item and after the last top-level item from the output. Only the content between the first and last top-level items should be included in the output list.

Goal
The goal is to create a Python method that takes an unindented string as input, identifies the top-level items and their associated lower-level items based on the specified criteria, concatenates them into a single string for each top-level item while maintaining the original formatting and delimiters, and returns the resulting grouped list items as a Python list. The output list should match the desired format, with each element representing a top-level item and its associated lower-level items.

Request
Please provide an explanation and guidance on how to create a Python method that can successfully achieve the goal outlined above. The explanation should include the steps involved, any necessary data structures or algorithms, and considerations for handling different scenarios and edge cases.

Additional Details
  1. - I have attempted to create a Python method to achieve the tasks outlined above, but my attempts have been unsuccessful. The methods I have tried do not produce the expected outputs for the given inputs.

    - To aid in testing and validating the solution, I have created and included numerous sample inputs and their corresponding expected outputs below. These test cases cover various scenarios and edge cases to ensure the robustness of the method.


Code Attempts:


Attempt 1:



    def process_list_hierarchy(text):
        # Helper function to determine the indentation level
        def get_indentation_level(line):
            return len(line) - len(line.lstrip())
    
        # Helper function to parse the input text into a list of lines with their hierarchy levels
        def parse_hierarchy(text):
            lines = text.split('\n')
            hierarchy = []
            for line in lines:
                if line.strip():  # Ignore empty lines
                    level = get_indentation_level(line)
                    hierarchy.append((level, line.strip()))
            return hierarchy
    
        # Helper function to build a tree structure from the hierarchy levels
        def build_tree(hierarchy):
            tree = []
            stack = [(-1, tree)]  # Start with a dummy root level
            for level, content in hierarchy:
                # Find the correct parent level
                while stack and stack[-1][0] >= level:
                    stack.pop()
                # Create a new node and add it to its parent's children
                node = {'content': content, 'children': []}
                stack[-1][1].append(node)
                stack.append((level, node['children']))
            return tree
    
        # Helper function to combine the tree into a single list
        def combine_tree(tree, combined_list=[], level=0):
            for node in tree:
                combined_list.append(('  ' * level) + node['content'])
                if node['children']:
                    combine_tree(node['children'], combined_list, level + 1)
            return combined_list
    
        # Parse the input text into a hierarchy
        hierarchy = parse_hierarchy(text)
        # Build a tree structure from the hierarchy
        tree = build_tree(hierarchy)
        # Combine the tree into a single list while maintaining the hierarchy
        combined_list = combine_tree(tree)
        # Return the combined list as a string
        return '\n'.join(combined_list)

Attempt 2:



    def organize_hierarchically(items):
        def get_level(item):
            match = re.match(r'^(\d+\.?|\-|\*)', item)
            return len(match.group()) if match else 0
    
        grouped_items = []
        for level, group in groupby(items, key=get_level):
            if level == 1:
                grouped_items.append('\n'.join(group))
            else:
                grouped_items[-1] += '\n' + '\n'.join(group)
    
        return grouped_items


Attempt 3:

 from bs4 import BeautifulSoup
    import nltk
    
    def extract_sub_objectives(input_text):
        soup = BeautifulSoup(input_text, 'html.parser')
        text_content = soup.get_text()
    
        # Tokenize the text into sentences
        sentences = nltk.sent_tokenize(text_content)
    
        # Initialize an empty list to store the sub-objectives
        sub_objectives = []
    
        # Iterate through the sentences and extract sub-objectives
        current_sub_objective = ""
        for sentence in sentences:
            if sentence.startswith(("1.", "2.", "3.", "4.")):
                if current_sub_objective:
                    sub_objectives.append(current_sub_objective)
                    current_sub_objective = ""
                current_sub_objective += sentence + "\n"
            elif current_sub_objective:
                current_sub_objective += sentence + "\n"
    
        # Append the last sub-objective, if any
        if current_sub_objective:
            sub_objectives.append(current_sub_objective)
    
        return sub_objectives
Attempt 4:


    def extract_sub_objectives(input_text, preserve_formatting=False):
        # Modified to strip both single and double quotes
        input_text = input_text.strip('\'"')
        messages = []
        messages.append("Debug: Starting to process the input text.")
        # Debug message to show the input text after stripping quotes
        messages.append(f"Debug: Input text after stripping quotes: '{input_text}'")
    
        # Define possible starting characters for new sub-objectives
        start_chars = [str(i) + '.' for i in range(1, 100)]  # Now includes up to two-digit numbering
        messages.append(f"Debug: Start characters defined: {start_chars}")
    
        # Define a broader range of continuation characters
        continuation_chars = ['-', '*', '+', '•', '>', '→', '—']  # Expanded list
        messages.append(f"Debug: Continuation characters defined: {continuation_chars}")
    
        # Replace escaped newline characters with actual newline characters
        input_text = input_text.replace('\\n', '\n')
        # Split the input text into lines
        lines = input_text.split('\n')
        messages.append(f"Debug: Input text split into lines: {lines}")
    
        # Initialize an empty list to store the sub-objectives
        sub_objectives = []
        # Initialize an empty string to store the current sub-objective
        current_sub_objective = ''
        # Initialize a counter for the number of continuations in the current sub-objective
        continuation_count = 0
    
        # Function to determine if a line is a new sub-objective
        def is_new_sub_objective(line):
            # Strip away leading quotation marks and whitespace
            line = line.strip('\'"').strip()
            return any(line.startswith(start_char) for start_char in start_chars)
    
        # Function to determine if a line is a continuation
        def is_continuation(line, prev_line):
            if not prev_line:
                return False
            # Check if the line starts with an alphanumeric followed by a period or parenthesis
            if len(line) > 1 and line[0].isalnum() and (line[1] == '.' or line[1] == ')'):
                # Check if it follows the sequence of the previous line
                if line[0].isdigit() and prev_line[0].isdigit() and int(line[0]) == int(prev_line[0]) + 1:
                    return False
                elif line[0].isalpha() and prev_line[0].isalpha() and ord(line[0].lower()) == ord(prev_line[0].lower()) + 1:
                    return False
                else:
                    return True
            # Add a condition to check for lower-case letters followed by a full stop
            if line[0].islower() and line[1] == '.':
                return True
            return any(line.startswith(continuation_char) for continuation_char in continuation_chars)
    
        # Iterate over each line
        for i, line in enumerate(lines):
            prev_line = lines[i - 1] if i > 0 else ''
            # Check if the line is a new sub-objective
            if is_new_sub_objective(line):
                messages.append(f"Debug: Found a new sub-objective at line {i + 1}: '{line}'")
                # If we have a current sub-objective, check the continuation count
                if current_sub_objective:
                    if continuation_count < 2:
                        messages.append(f"Debug: Sub-objective does not meet the continuation criterion: '{current_sub_objective}'")
                        for message in messages:
                            print(message)
                        return None
                    # Check the preserve_formatting parameter before adding
                    sub_objectives.append(
                        current_sub_objective.strip() if not preserve_formatting else current_sub_objective)
                    messages.append(f"Debug: Added a sub-objective to the list. Current count: {len(sub_objectives)}.")
                # Reset the current sub-objective to the new one and reset the continuation count
                current_sub_objective = line
                continuation_count = 0
            # Check if the line is a continuation
            elif is_continuation(line, prev_line):
                messages.append(f"Debug: Line {i + 1} is a continuation of the previous line: '{line}'")
                # Add the line to the current sub-objective, checking preserve_formatting
                current_sub_objective += '\n' + line if preserve_formatting else ' ' + line.strip()
                # Increment the continuation count
                continuation_count += 1
            # Handle lines that are part of the current sub-objective but don't start with a continuation character
            elif current_sub_objective:
                messages.append(f"Debug: Line {i + 1} is part of the current sub-objective: '{line}'")
                # Add the line to the current sub-objective, checking preserve_formatting
                current_sub_objective += '\n' + line if preserve_formatting else ' ' + line.strip()
    
        # If we have a current sub-objective, check the continuation count before adding it to the list
        if current_sub_objective:
            if continuation_count < 2:
                messages.append(f"Debug: Sub-objective does not meet the continuation criterion: '{current_sub_objective}'")
                for message in messages:
                    print(message)
                return None
            # Check the preserve_formatting parameter before adding
            sub_objectives.append(current_sub_objective.strip() if not preserve_formatting else current_sub_objective)
            messages.append(f"Debug: Added the final sub-objective to the list. Final count: {len(sub_objectives)}.")
    
        # Print the debug messages if no sub-objectives are found
        if not sub_objectives:
            for message in messages:
                print(message)
    
        return sub_objectives
Sample Data (Inputs and associated Outputs):

https://pastebin.com/s8nWktbZ
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  How to get all items in SharePoint recycle bin by using sharepy library in Python? QuangHuynh 2 349 Apr-10-2024, 03:09 PM
Last Post: SandraYokum
  How do I parse the string? anna17 4 325 Apr-10-2024, 10:26 AM
Last Post: DeaD_EyE
  Why do I have to repeat items in list slices in order to make this work? Pythonica 7 1,354 May-22-2023, 10:39 PM
Last Post: ICanIBB
  [split] Parse Nested JSON String in Python mmm07 4 1,543 Mar-28-2023, 06:07 PM
Last Post: snippsat
  Finding combinations of list of items (30 or so) LynnS 1 885 Jan-25-2023, 02:57 PM
Last Post: deanhystad
  parse String jaykappy 2 765 Dec-23-2022, 07:42 AM
Last Post: praveencqr
  Removal of items in .txt using python nanakochan 8 1,768 Sep-02-2022, 04:58 PM
Last Post: perfringo
  mutable values to string items? fozz 15 2,841 Aug-30-2022, 07:20 PM
Last Post: deanhystad
  python read iperf log and parse throughput jacklee26 4 2,798 Aug-27-2022, 07:04 AM
Last Post: Yoriz
  For Word, Count in List (Counts.Items()) new_coder_231013 6 2,624 Jul-21-2022, 02:51 PM
Last Post: new_coder_231013

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020