Python Forum

Full Version: How to combine multiple column values into 1?
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2
So ive looked around and found some steps on how to accomplish this, but seem like alot of code for this, maybe im not searching for the right thing, maybe it goes by another name, but i am using pandas and working with CSV and excel files in other scripts..

So here is what i have, just looking for suggestions on the proper or right coding to use to accomplish.

So i have a CSV file that has 4 columns(legit columns)
Date/Time/User/Message

BUT my problem is that if the "Message" string value has their own comma in the value, then the CSV ends up with additional columns.. which then prevents the data from being imported correctly, unless we consolidate those into the main "message" column..

So what im trying to do is save ourselves the manual step of copying over those rare instances of extra columns into the main message column.

Say the CSV has 3000 rows.. and MAYBE about 20 rows have those extra columns.. sometimes the row may only have 1 extra column and other times there may be 5 extra columns for a row..

How can i run a script against this file to check for the extra columns, if found, then copy those column values into the main message column so that we have 1 message string value?
Please provide samples of the data with and without the extra commas.
yea im working on finding a file that has not been cleaned up yet so i can upload.
ok here is a small sample

This is literally a real example on how we get he file because there are extra commas in the message value which then throws everything off
And there is one row you will see has the string value in the first column which is the date column and in those cases it seems to happen when the message column has a huge paragraph worth of data, it gets places into other columns instead of just new columns.
That is not a csv file. Do you have any control over how the file is generated?
How do you mean its not a CSV file?

When i open it in notepad this is what i get...

Date,Time,User,Message,,,,
5/22/2022,8:44 AM,Don,Oh no,,,,
5/22/2022,8:43 AM,Jenn,Did i tell you my mom, dad, nephew, my sister's husband, and i think my brother have covid
5/22/2022,8:42 AM,Jenn,A little sore,,,,
5/21/2022,10:11 PM,Don,Ok,,,,
5/21/2022,10:11 PM,Jenn,I will talk to you tomorrow... Yes it's in Hulu.... With Jessica Beal,,,,
5/21/2022,10:10 PM,Don,Candy?,,,,
5/21/2022,10:10 PM,Jenn,That's good,,,,
And I've also been told you are with everyone is praying her husband doesn't find out,,,,,,,
5/11/2022,7:29 PM,Jenn,Buttttttt,,,,


And unfortunately i do not have access to the generation of this file, it comes from a 3rd party and I'm just trying to clean it up as best as possible before pulling the data into our side of things.
If the extra commas can only be in the message column, then split on comma to get the first columns, then rsplit on comma to get the last columns.

# name, age, notes, zip_code
table = '''Susan,27,works with HR on Zoom calls,02134
Roger,41,Gets coffee, bagels, and sodas for all the meetings,90210
'''

for row in table.splitlines():
    name, age, rest = row.split(",", maxsplit=2)
    notes, zip_code = rest.rsplit(",", maxsplit = 1)

    print(f"Name: {name}.  Notes: {notes}")
Output:
Name: Susan. Notes: works with HR on Zoom calls Name: Roger. Notes: Gets coffee, bagels, and sodas for all the meetings
So i tried to follow your example with my file as the source and get an error:

import csv

with open("sample.csv", "r") as file_in: 
    dataReader = csv.reader(file_in)
    
    for row in dataReader.splitlines():
        date, time, user, rest = row.split(",", maxsplit=2)
        message = rest.rsplit(",", maxsplit = 1)
        
        print(f"Date: {date}.  Message: {message}")
Error:
AttributeError: '_csv.reader' object has no attribute 'splitlines'
This is not a csv format file. Do not use csv reader.

import re
import pandas as pd

date_pattern = re.compile("\d+/\d+/\d+")

lines = []
with open("Sample.csv", "r") as f:
    # Get column headers
    columns = next(f).rstrip(",\n").split(",")
    for line in f:
        line = line.rstrip(",\n")
        # Check if line starts with date, time,
        if re.match(date_pattern, line)
            # This is a new row.  Split into columns
            row = line.split(",", maxsplit=len(columns) - 1)
            lines.append(row)
        else:
            # This is a continuation of previous message.
            row = lines[-1]
            row[-1] = f"{row[-1]}\n{line}"

df = pd.DataFrame(lines, columns=columns)
print(df)
ok, so that i can understand, how are you identifying that is not a true csv format file?
Pages: 1 2