Scenerio: Parse the PDF Bank statement and transform into clean and formatted csv file.
What I've tried: I manage to parse the pdf file(tabular format) using camelot library but failed to produce the desired result in sense of formatting.
Code:
My_Result:
https://ibb.co/VYLczdr
Desired_Output:
https://ibb.co/2NZby99
The logic I came up with is to move those rows up i guess n-1 if date column is NaN i don't know if this logic is right or not.Can anyone help me to sort out this properly?
I tried pandas groupby and aggregation functions but it only merging the whole data and removing NaN and duplicate dates which is not suitable because every entry is necessary.
What I've tried: I manage to parse the pdf file(tabular format) using camelot library but failed to produce the desired result in sense of formatting.
Code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
import camelot import pandas as pd tables = camelot.read_pdf( 'test.pdf' , pages = '3' ) for i, table in enumerate (tables): print ( f 'table_id:{i}' ) print ( f 'page:{table.page}' ) print ( f 'coordinates:{table._bbox}' ) tables = camelot.read_pdf( 'test.pdf' , flavor = 'stream' , pages = '3' ) columns = df.iloc[ 0 ] df.columns = columns df = df.drop( 0 ) df.head() for c in df.select_dtypes( 'object' ).columns: df[c] = df[c]. str .replace( '$' , '') df[c] = df[c]. str .replace( '-' , '') def convert_to_float(num): try : return float (num.replace( ',' ,'')) except : return 0 for col in [ 'Deposits' , 'Withdrawals' , 'Balance' ]: df[col] = df[col]. map (convert_to_float) |
https://ibb.co/VYLczdr
Desired_Output:
https://ibb.co/2NZby99
The logic I came up with is to move those rows up i guess n-1 if date column is NaN i don't know if this logic is right or not.Can anyone help me to sort out this properly?
I tried pandas groupby and aggregation functions but it only merging the whole data and removing NaN and duplicate dates which is not suitable because every entry is necessary.