Correctly read a malformed CSV file data

Correctly read a malformed CSV file data - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: General Coding Help (https://python-forum.io/forum-8.html)
+--- Thread: Correctly read a malformed CSV file data (/thread-39285.html)

Correctly read a malformed CSV file data - klllmmm - Jan-25-2023

I have a malformed CSV file where I need to create a proper dataframe.
[Image: s!An03iU493hAgnpxZ2rP2oUzB9BJNNw?e=djFMls]

[Image: s!An03iU493hAgnpxZ2rP2oUzB9BJNNw?e=djFMls]

I get an error when reading with pandas

Error:
ParserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.

so I used csv module in order to rectify the error in csv data.

import csv
with open(r'C:\Users\Klllmmm\Downloads\test_CSV_file.csv', 'r') as file:
    csv_file = csv.DictReader(file)
    df = pd.DataFrame(csv_file)

Output:df.head()
Out[75]: 
               SOURCE                 CATEGORY PERIOD_NAME  \
0                   3                        7      Sep-22   
1                   3                        7      Sep-22   
2                   3                        7      Apr-22   
3   EP A 4926954 1068  EP 21-APR-2022 Receipts        None   
4                   3                        7      Apr-22   

                 BATCH_NAME             JOURNAL_NAME  
0  ext002 EP A 5005720 1167  EP 23-SEP-2022 Receipts  
1  ext002 EP A 5005720 1167  EP 23-SEP-2022 Receipts  
2                    ext001                     None  
3                      None                     None  
4                    ext001                     None

How can I combine the data from next line if the "JOURNAL_NAME" field is None and remove the next line.

This is the expected format of the table.

[Image: s!An03iU493hAgnpxaCh7UB-IHNWpGBQ?e=OLO7l1]

[Image: s!An03iU493hAgnpxaCh7UB-IHNWpGBQ?e=OLO7l1]

Appreciate it if someone can help on how to rectify the data.

RE: Correctly read a malformed CSV file data - buran - Jan-25-2023

read the whole file in memory and replace \n with just " "

import pandas as pd
from io import StringIO
file_path = r'path-to-file\test_CSV_file.csv'

with open(file_path, 'r') as f:
    data = f.read().replace('\n ', ' ')

df = pd.read_csv(StringIO(data))
print(df)

output

Output:   SOURCE  CATEGORY PERIOD_NAME                BATCH_NAME             JOURNAL_NAME
0       3         7      Sep-22  ext002 EP A 5005720 1167  EP 23-SEP-2022 Receipts
1       3         7      Sep-22  ext002 EP A 5005720 1167  EP 23-SEP-2022 Receipts
2       3         7      Apr-22  ext001 EP A 4926954 1068  EP 21-APR-2022 Receipts
3       3         7      Apr-22  ext001 EP A 4926954 1068  EP 21-APR-2022 Receipts
4       3         6      Apr-22  ext001 EP A 4926954 1068  EP 21-APR-2022 Deposits
5       3         7      Aug-22  ext002 EP A 4962225 1131  EP 03-AUG-2022 Receipts

If you want you can write data back to file in order to fix it, instead of using io.StringIO to directly work with data in pandas

RE: Correctly read a malformed CSV file data - klllmmm - Jan-25-2023

@buran

This is perfect....
Thank you so much!!

(Jan-25-2023, 02:18 PM)buran Wrote: read the whole file in memory and replace \n with just ""

import pandas as pd
from io import StringIO
file_path = r'path-to-file\test_CSV_file.csv'

with open(file_path, 'r') as f:
    data = f.read().replace('\n ', ' ')

df = pd.read_csv(StringIO(data))
print(df)

output

Output:   SOURCE  CATEGORY PERIOD_NAME                BATCH_NAME             JOURNAL_NAME
0       3         7      Sep-22  ext002 EP A 5005720 1167  EP 23-SEP-2022 Receipts
1       3         7      Sep-22  ext002 EP A 5005720 1167  EP 23-SEP-2022 Receipts
2       3         7      Apr-22  ext001 EP A 4926954 1068  EP 21-APR-2022 Receipts
3       3         7      Apr-22  ext001 EP A 4926954 1068  EP 21-APR-2022 Receipts
4       3         6      Apr-22  ext001 EP A 4926954 1068  EP 21-APR-2022 Deposits
5       3         7      Aug-22  ext002 EP A 4962225 1131  EP 03-AUG-2022 Receipts

If you want you can write data back to file in order to fix it, instead of using io.StringIO to directly work with data in pandas