Posts: 170
Threads: 43
Joined: May 2019
So ive looked around and found some steps on how to accomplish this, but seem like alot of code for this, maybe im not searching for the right thing, maybe it goes by another name, but i am using pandas and working with CSV and excel files in other scripts..
So here is what i have, just looking for suggestions on the proper or right coding to use to accomplish.
So i have a CSV file that has 4 columns(legit columns)
Date/Time/User/Message
BUT my problem is that if the "Message" string value has their own comma in the value, then the CSV ends up with additional columns.. which then prevents the data from being imported correctly, unless we consolidate those into the main "message" column..
So what im trying to do is save ourselves the manual step of copying over those rare instances of extra columns into the main message column.
Say the CSV has 3000 rows.. and MAYBE about 20 rows have those extra columns.. sometimes the row may only have 1 extra column and other times there may be 5 extra columns for a row..
How can i run a script against this file to check for the extra columns, if found, then copy those column values into the main message column so that we have 1 message string value?
Posts: 6,783
Threads: 20
Joined: Feb 2020
Please provide samples of the data with and without the extra commas.
Posts: 170
Threads: 43
Joined: May 2019
yea im working on finding a file that has not been cleaned up yet so i can upload.
Posts: 170
Threads: 43
Joined: May 2019
ok here is a small sample
This is literally a real example on how we get he file because there are extra commas in the message value which then throws everything off
And there is one row you will see has the string value in the first column which is the date column and in those cases it seems to happen when the message column has a huge paragraph worth of data, it gets places into other columns instead of just new columns.
Attached Files
Sample.csv (Size: 556 bytes / Downloads: 142)
Posts: 6,783
Threads: 20
Joined: Feb 2020
That is not a csv file. Do you have any control over how the file is generated?
Posts: 170
Threads: 43
Joined: May 2019
How do you mean its not a CSV file?
When i open it in notepad this is what i get...
Date,Time,User,Message,,,,
5/22/2022,8:44 AM,Don,Oh no,,,,
5/22/2022,8:43 AM,Jenn,Did i tell you my mom, dad, nephew, my sister's husband, and i think my brother have covid
5/22/2022,8:42 AM,Jenn,A little sore,,,,
5/21/2022,10:11 PM,Don,Ok,,,,
5/21/2022,10:11 PM,Jenn,I will talk to you tomorrow... Yes it's in Hulu.... With Jessica Beal,,,,
5/21/2022,10:10 PM,Don,Candy?,,,,
5/21/2022,10:10 PM,Jenn,That's good,,,,
And I've also been told you are with everyone is praying her husband doesn't find out,,,,,,,
5/11/2022,7:29 PM,Jenn,Buttttttt,,,,
And unfortunately i do not have access to the generation of this file, it comes from a 3rd party and I'm just trying to clean it up as best as possible before pulling the data into our side of things.
Posts: 1,583
Threads: 3
Joined: Mar 2020
If the extra commas can only be in the message column, then split on comma to get the first columns, then rsplit on comma to get the last columns.
# name, age, notes, zip_code
table = '''Susan,27,works with HR on Zoom calls,02134
Roger,41,Gets coffee, bagels, and sodas for all the meetings,90210
'''
for row in table.splitlines():
name, age, rest = row.split(",", maxsplit=2)
notes, zip_code = rest.rsplit(",", maxsplit = 1)
print(f"Name: {name}. Notes: {notes}") Output: Name: Susan. Notes: works with HR on Zoom calls
Name: Roger. Notes: Gets coffee, bagels, and sodas for all the meetings
Posts: 170
Threads: 43
Joined: May 2019
So i tried to follow your example with my file as the source and get an error:
import csv
with open("sample.csv", "r") as file_in:
dataReader = csv.reader(file_in)
for row in dataReader.splitlines():
date, time, user, rest = row.split(",", maxsplit=2)
message = rest.rsplit(",", maxsplit = 1)
print(f"Date: {date}. Message: {message}") Error: AttributeError: '_csv.reader' object has no attribute 'splitlines'
Posts: 6,783
Threads: 20
Joined: Feb 2020
Aug-11-2022, 05:50 PM
(This post was last modified: Aug-11-2022, 06:04 PM by deanhystad.)
This is not a csv format file. Do not use csv reader.
import re
import pandas as pd
date_pattern = re.compile("\d+/\d+/\d+")
lines = []
with open("Sample.csv", "r") as f:
# Get column headers
columns = next(f).rstrip(",\n").split(",")
for line in f:
line = line.rstrip(",\n")
# Check if line starts with date, time,
if re.match(date_pattern, line)
# This is a new row. Split into columns
row = line.split(",", maxsplit=len(columns) - 1)
lines.append(row)
else:
# This is a continuation of previous message.
row = lines[-1]
row[-1] = f"{row[-1]}\n{line}"
df = pd.DataFrame(lines, columns=columns)
print(df)
Posts: 170
Threads: 43
Joined: May 2019
ok, so that i can understand, how are you identifying that is not a true csv format file?
|