Python Forum

Hello
I try little to work with python and process mining. so i try to create a file from a text with 4 columns case id , name, process and time but my problem is that my code put it on same column on csv - excel file wich i dont want it. I want to put them on 4 different columns and same the titles.

import re
import pandas as pd

# Sample text paragraph (replace with your actual text)
text_paragraph = """
Character: Maria
Case1 - 2023-11-01 09:00 AM: Started the process
Character: George
Case2 - 2023-11-01 10:30 AM: Joined the project
Character: Maria
Case1 - 2023-11-01 11:45 AM: Continued working
Character: George
Case2 - 2023-11-01 12:15 PM: Left for a meeting
"""

# Initialize variables to store event data
event_data = {
    'Case ID': [],
    'Character': [],
    'Process': [],
    'Time': []
}

# Use regular expressions to extract character, case ID, process, and time information
event_pattern = r"(Character: (.+)|Case(\d+) - (\d{4}-\d{2}-\d{2} \d{2}:\d{2} [APM]{2}): (.+))"
matches = re.findall(event_pattern, text_paragraph)

current_character = None

for match in matches:
    character, case_id, timestamp, process = match[1], match[2], match[3], match[4]

    if character:
        current_character = character
    else:
        event_data['Character'].append(current_character)
        event_data['Case ID'].append(case_id)
        event_data['Time'].append(timestamp)
        event_data['Process'].append(process)

# Create a DataFrame from the event data
df = pd.DataFrame(event_data)

# Save the DataFrame as a CSV file
df.to_csv('process_mining_data_4_columns.csv', index=False)

Unless told otherwise, re patterns only match a single line. Your pattern has two lines, so you should use a MULTILINE pattern.

import re
import pandas as pd


text_paragraph = """
Character: Maria
Case1 - 2023-11-01 09:00 AM: Started the process
Character: George
Case2 - 2023-11-01 10:30 AM: Joined the project
Character: Maria
Case1 - 2023-11-01 11:45 AM: Continued working
Character: George
Case2 - 2023-11-01 12:15 PM: Left for a meeting
"""

event_pattern = re.compile(
    r"^Character: (.+)\nCase(\d+) - (\d{4}-\d{2}-\d{2} \d{2}:\d{2} [APM]{2}): (.+)",
    re.MULTILINE
)
df = pd.DataFrame(
    re.findall(event_pattern, text_paragraph), 
    columns=["Character", "Case Num", "Time", "Process"]
)
print(df)

Output:  Character Case Num                 Time              Process
0     Maria        1  2023-11-01 09:00 AM  Started the process
1    George        2  2023-11-01 10:30 AM   Joined the project
2     Maria        1  2023-11-01 11:45 AM    Continued working
3    George        2  2023-11-01 12:15 PM   Left for a meeting

ok thank you. but can i save it as csv file with the above data and 4 columns with titles and data?

I don't understand the question. The code from your first post wrote a CSV file. I just ran your code and the CSV file looks like this:

Output:Case ID,Character,Process,Time
1,Maria,Started the process,2023-11-01 09:00 AM
2,George,Joined the project,2023-11-01 10:30 AM
1,Maria,Continued working,2023-11-01 11:45 AM
2,George,Left for a meeting,2023-11-01 12:15 PM

4 columns with titles. Please describe how this is not what you want.

thomaskissas33

deanhystad

thomaskissas33

deanhystad