Oct-09-2023, 06:32 PM
Hello dear community,
I am faced with a challenge and need your expertise. I am working with an Excel spreadsheet that contains various columns.
I would like to aggregate similar rows (which contain all transactions) of this table based on multiple criteria, in order to have fewer records.
The original Excel file has now grown to more than 94k lines, is approximately 40MB in size, and is understandably cumbersome to handle.
ChatGPT introduced me to Python and pandas, and after several hours, I have now reached a point where I can't proceed any further because ChatGPT keeps forgetting or omitting things.
I myself am not familiar with Python, or with any other programming languages, and I unfortunately only understand in fragments what the code developed by ChatGPT is doing exactly.
I realize this is quite an imposition on any programmer, but I don't know any other way to get help.
Here's a link to a sample table on docs.google:
https://docs.google.com/spreadsheets/d/ ... sp=sharing
I've tried to compile what's supposed to happen, though my structure might be unnecessarily complicated.
I hope it's clear what I theoretically want to achieve.
Aggregation criteria:
If the value in the "isIgnored" column reads "TRUE" -> ignore the row.
If the value in the "isIgnored" column reads "FALSE" -> process the row.
Only rows with identical entries in "area" can be aggregated.
Only rows with identical entries in "location" can be aggregated.
Only rows with identical or empty entries in "type" can be aggregated.
At this point, a check is necessary that requires different further actions, depending on whether the entry in the "type" column contains the values "deposit", "withdraw", or "order". Only similar entries can possibly be aggregated.
Let's start with deposit. I'll try to figure it out for the other two types on my own. Withdraw shouldn't be too hard (I hope) and for "Order", I might need to ask again.
The check for "deposit" goes as follows:
6 a) For "deposit", only columns can be aggregated where the entry in "boughtCurrencyId" is identical.
6 b) If the entries are identical, it needs to be checked if the entries in "feeCurrencyId" are identical.
6 c) If they are identical, they can be aggregated.
6 d) If they are different, they can only be aggregated if the entries of the other rows in "feeQuantity" are empty.
6 e) The fields "soldQuantity", "soldCurrency", and "soldCurrencyId" must be empty, otherwise the entries cannot be aggregated.
Rows should be aggregated for "deposit" as follows:
The time in the "TimeExecuted" column should take the latest of the aggregated transactions.
There is a special feature here: for transactions on the last day of a month, transactions should only be aggregated until 11:59:59 pm.
In the "type" column, it remains (logically) "deposit".
The values from all aggregated rows in the "boughtQuantity" column are summed up.
If there are entries for the fees, then the values from "feeQuantity" are to be added.
The values from "feeCurrency" are also to be adopted. If a transaction with an empty value in "feeCurrency" and "feeCurrencyId" is also aggregated, it remains with the non-empty entry.
The values from all aggregated rows in the "txId" column should be transferred to the aggregated transaction.
Starting from the first to the last, the entries should be separated by a "+" in the new and aggregated transaction.
So, if the "txId" of 3 aggregated transactions reads 1, 2, and 3, the entry in the "txId" field of the aggregated transaction should read "1 + 2 + 3".
For the "Id" field, the entry of the last aggregated transaction should be adopted.
The "comments" field of the aggregated transaction should be filled as follows:
<Number of aggregated transactions> "Transactions" from <Timestamp of the earliest transaction included in this summary> to <Timestamp of the latest transaction included in this summary> aggregated. ID: <Entries of the fields from the "Id" column for all aggregated transactions, separated by a "+">
Then insert </ original comments> and behind it, all the entries in the "comments" fields of the aggregated transactions should be adopted.
A complete entry in the "comments" field might look like this:
4 transactions from 01.01.2020 00:09:12 to 01.01.2020 07:06:14 aggregated. ID: 1 + 2 + 3 + 4 / original comments: Hello + this + is + a + test
That concludes the procedure for the "deposit" type.
With ChatGPT's help, I have tried to get the following code for the "Order" area.
I am faced with a challenge and need your expertise. I am working with an Excel spreadsheet that contains various columns.
I would like to aggregate similar rows (which contain all transactions) of this table based on multiple criteria, in order to have fewer records.
The original Excel file has now grown to more than 94k lines, is approximately 40MB in size, and is understandably cumbersome to handle.
ChatGPT introduced me to Python and pandas, and after several hours, I have now reached a point where I can't proceed any further because ChatGPT keeps forgetting or omitting things.
I myself am not familiar with Python, or with any other programming languages, and I unfortunately only understand in fragments what the code developed by ChatGPT is doing exactly.
I realize this is quite an imposition on any programmer, but I don't know any other way to get help.
Here's a link to a sample table on docs.google:
https://docs.google.com/spreadsheets/d/ ... sp=sharing
I've tried to compile what's supposed to happen, though my structure might be unnecessarily complicated.
I hope it's clear what I theoretically want to achieve.
Aggregation criteria:
If the value in the "isIgnored" column reads "TRUE" -> ignore the row.
If the value in the "isIgnored" column reads "FALSE" -> process the row.
Only rows with identical entries in "area" can be aggregated.
Only rows with identical entries in "location" can be aggregated.
Only rows with identical or empty entries in "type" can be aggregated.
At this point, a check is necessary that requires different further actions, depending on whether the entry in the "type" column contains the values "deposit", "withdraw", or "order". Only similar entries can possibly be aggregated.
Let's start with deposit. I'll try to figure it out for the other two types on my own. Withdraw shouldn't be too hard (I hope) and for "Order", I might need to ask again.
The check for "deposit" goes as follows:
6 a) For "deposit", only columns can be aggregated where the entry in "boughtCurrencyId" is identical.
6 b) If the entries are identical, it needs to be checked if the entries in "feeCurrencyId" are identical.
6 c) If they are identical, they can be aggregated.
6 d) If they are different, they can only be aggregated if the entries of the other rows in "feeQuantity" are empty.
6 e) The fields "soldQuantity", "soldCurrency", and "soldCurrencyId" must be empty, otherwise the entries cannot be aggregated.
Rows should be aggregated for "deposit" as follows:
The time in the "TimeExecuted" column should take the latest of the aggregated transactions.
There is a special feature here: for transactions on the last day of a month, transactions should only be aggregated until 11:59:59 pm.
In the "type" column, it remains (logically) "deposit".
The values from all aggregated rows in the "boughtQuantity" column are summed up.
If there are entries for the fees, then the values from "feeQuantity" are to be added.
The values from "feeCurrency" are also to be adopted. If a transaction with an empty value in "feeCurrency" and "feeCurrencyId" is also aggregated, it remains with the non-empty entry.
The values from all aggregated rows in the "txId" column should be transferred to the aggregated transaction.
Starting from the first to the last, the entries should be separated by a "+" in the new and aggregated transaction.
So, if the "txId" of 3 aggregated transactions reads 1, 2, and 3, the entry in the "txId" field of the aggregated transaction should read "1 + 2 + 3".
For the "Id" field, the entry of the last aggregated transaction should be adopted.
The "comments" field of the aggregated transaction should be filled as follows:
<Number of aggregated transactions> "Transactions" from <Timestamp of the earliest transaction included in this summary> to <Timestamp of the latest transaction included in this summary> aggregated. ID: <Entries of the fields from the "Id" column for all aggregated transactions, separated by a "+">
Then insert </ original comments> and behind it, all the entries in the "comments" fields of the aggregated transactions should be adopted.
A complete entry in the "comments" field might look like this:
4 transactions from 01.01.2020 00:09:12 to 01.01.2020 07:06:14 aggregated. ID: 1 + 2 + 3 + 4 / original comments: Hello + this + is + a + test
That concludes the procedure for the "deposit" type.
With ChatGPT's help, I have tried to get the following code for the "Order" area.
import pandas as pd # Read the data df = pd.read_excel('testfile.xlsx', engine='openpyxl') # Step 1: Remove rows with "True" in "isIgnored" (However, it shouldn't be removed but only ignored here!) df = df[df['isIgnored'] != "True"] # Function to summarize groups of transactions def summarize_group(group): if len(group) == 1: return group first, last = group.iloc[0], group.iloc[-1].copy() # Adjusting the relevant columns last['comments'] = f"Transactions from {first['timeExecuted']} to {last['timeExecuted']} (a total of {len(group)} aggregated)" last['comments'] += '+' + '+'.join(group['comments'].dropna()) last['id'] = '+'.join(group['id'].astype(str)) last['txId'] = '+'.join(group['txId'].dropna()) return pd.DataFrame([last]) # Determine the keys for grouping group_keys = df['type'].ne(df['type'].shift()).cumsum() if 'order' in df['type'].values: group_keys += (df['area'] + df['boughtCurrency'] + df['soldCurrency']).ne((df['location'] + df['boughtCurrency'] + df['soldCurrency']).shift()).cumsum() result = df.groupby(group_keys).apply(summarize_group).reset_index(drop=True) # Save the cleaned DataFrame result.to_excel('cleaned_testfile.xlsx', index=False)