Pandas DataFrame combine rows by column value, where Date Rows are NULL - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: General Coding Help (https://python-forum.io/forum-8.html) +--- Thread: Pandas DataFrame combine rows by column value, where Date Rows are NULL (/thread-33554.html) |
Pandas DataFrame combine rows by column value, where Date Rows are NULL - rhat398 - May-04-2021 Scenerio: Parse the PDF Bank statement and transform into clean and formatted csv file. What I've tried: I manage to parse the pdf file(tabular format) using camelot library but failed to produce the desired result in sense of formatting. Code: import camelot import pandas as pd tables = camelot.read_pdf('test.pdf', pages = '3') for i, table in enumerate(tables): print(f'table_id:{i}') print(f'page:{table.page}') print(f'coordinates:{table._bbox}') tables = camelot.read_pdf('test.pdf', flavor='stream', pages = '3') columns = df.iloc[0] df.columns = columns df = df.drop(0) df.head() for c in df.select_dtypes('object').columns: df[c] = df[c].str.replace('$', '') df[c] = df[c].str.replace('-', '') def convert_to_float(num): try: return float(num.replace(',','')) except: return 0 for col in ['Deposits', 'Withdrawals', 'Balance']: df[col] = df[col].map(convert_to_float)My_Result: https://ibb.co/VYLczdr Desired_Output: https://ibb.co/2NZby99 The logic I came up with is to move those rows up i guess n-1 if date column is NaN i don't know if this logic is right or not.Can anyone help me to sort out this properly? I tried pandas groupby and aggregation functions but it only merging the whole data and removing NaN and duplicate dates which is not suitable because every entry is necessary. |