Python Forum
Correctly read a malformed CSV file data - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: General Coding Help (https://python-forum.io/forum-8.html)
+--- Thread: Correctly read a malformed CSV file data (/thread-39285.html)



Correctly read a malformed CSV file data - klllmmm - Jan-25-2023

I have a malformed CSV file where I need to create a proper dataframe.
[Image: s!An03iU493hAgnpxZ2rP2oUzB9BJNNw?e=djFMls]

I get an error when reading with pandas
Error:
ParserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.

so I used csv module in order to rectify the error in csv data.

import csv
with open(r'C:\Users\Klllmmm\Downloads\test_CSV_file.csv', 'r') as file:
    csv_file = csv.DictReader(file)
    df = pd.DataFrame(csv_file)
Output:
df.head() Out[75]: SOURCE CATEGORY PERIOD_NAME \ 0 3 7 Sep-22 1 3 7 Sep-22 2 3 7 Apr-22 3 EP A 4926954 1068 EP 21-APR-2022 Receipts None 4 3 7 Apr-22 BATCH_NAME JOURNAL_NAME 0 ext002 EP A 5005720 1167 EP 23-SEP-2022 Receipts 1 ext002 EP A 5005720 1167 EP 23-SEP-2022 Receipts 2 ext001 None 3 None None 4 ext001 None
How can I combine the data from next line if the "JOURNAL_NAME" field is None and remove the next line.

This is the expected format of the table.

[Image: s!An03iU493hAgnpxaCh7UB-IHNWpGBQ?e=OLO7l1]

Appreciate it if someone can help on how to rectify the data.


RE: Correctly read a malformed CSV file data - buran - Jan-25-2023

read the whole file in memory and replace \n with just " "

import pandas as pd
from io import StringIO
file_path = r'path-to-file\test_CSV_file.csv'

with open(file_path, 'r') as f:
    data = f.read().replace('\n ', ' ')

df = pd.read_csv(StringIO(data))
print(df)
output

Output:
SOURCE CATEGORY PERIOD_NAME BATCH_NAME JOURNAL_NAME 0 3 7 Sep-22 ext002 EP A 5005720 1167 EP 23-SEP-2022 Receipts 1 3 7 Sep-22 ext002 EP A 5005720 1167 EP 23-SEP-2022 Receipts 2 3 7 Apr-22 ext001 EP A 4926954 1068 EP 21-APR-2022 Receipts 3 3 7 Apr-22 ext001 EP A 4926954 1068 EP 21-APR-2022 Receipts 4 3 6 Apr-22 ext001 EP A 4926954 1068 EP 21-APR-2022 Deposits 5 3 7 Aug-22 ext002 EP A 4962225 1131 EP 03-AUG-2022 Receipts
If you want you can write data back to file in order to fix it, instead of using io.StringIO to directly work with data in pandas


RE: Correctly read a malformed CSV file data - klllmmm - Jan-25-2023

@buran

This is perfect....
Thank you so much!!


(Jan-25-2023, 02:18 PM)buran Wrote: read the whole file in memory and replace \n with just ""

import pandas as pd
from io import StringIO
file_path = r'path-to-file\test_CSV_file.csv'

with open(file_path, 'r') as f:
    data = f.read().replace('\n ', ' ')

df = pd.read_csv(StringIO(data))
print(df)
output

Output:
SOURCE CATEGORY PERIOD_NAME BATCH_NAME JOURNAL_NAME 0 3 7 Sep-22 ext002 EP A 5005720 1167 EP 23-SEP-2022 Receipts 1 3 7 Sep-22 ext002 EP A 5005720 1167 EP 23-SEP-2022 Receipts 2 3 7 Apr-22 ext001 EP A 4926954 1068 EP 21-APR-2022 Receipts 3 3 7 Apr-22 ext001 EP A 4926954 1068 EP 21-APR-2022 Receipts 4 3 6 Apr-22 ext001 EP A 4926954 1068 EP 21-APR-2022 Deposits 5 3 7 Aug-22 ext002 EP A 4962225 1131 EP 03-AUG-2022 Receipts
If you want you can write data back to file in order to fix it, instead of using io.StringIO to directly work with data in pandas