Python Forum
Python and pandas: Aggregate lines form Excel sheet
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Python and pandas: Aggregate lines form Excel sheet
#1
Hello dear community,

I am faced with a challenge and need your expertise. I am working with an Excel spreadsheet that contains various columns.
I would like to aggregate similar rows (which contain all transactions) of this table based on multiple criteria, in order to have fewer records.
The original Excel file has now grown to more than 94k lines, is approximately 40MB in size, and is understandably cumbersome to handle.

ChatGPT introduced me to Python and pandas, and after several hours, I have now reached a point where I can't proceed any further because ChatGPT keeps forgetting or omitting things.

I myself am not familiar with Python, or with any other programming languages, and I unfortunately only understand in fragments what the code developed by ChatGPT is doing exactly.

I realize this is quite an imposition on any programmer, but I don't know any other way to get help.

Here's a link to a sample table on docs.google:
https://docs.google.com/spreadsheets/d/ ... sp=sharing

I've tried to compile what's supposed to happen, though my structure might be unnecessarily complicated.
I hope it's clear what I theoretically want to achieve.

Aggregation criteria:

If the value in the "isIgnored" column reads "TRUE" -> ignore the row.
If the value in the "isIgnored" column reads "FALSE" -> process the row.
Only rows with identical entries in "area" can be aggregated.
Only rows with identical entries in "location" can be aggregated.
Only rows with identical or empty entries in "type" can be aggregated.
At this point, a check is necessary that requires different further actions, depending on whether the entry in the "type" column contains the values "deposit", "withdraw", or "order". Only similar entries can possibly be aggregated.
Let's start with deposit. I'll try to figure it out for the other two types on my own. Withdraw shouldn't be too hard (I hope) and for "Order", I might need to ask again.

The check for "deposit" goes as follows:
6 a) For "deposit", only columns can be aggregated where the entry in "boughtCurrencyId" is identical.
6 b) If the entries are identical, it needs to be checked if the entries in "feeCurrencyId" are identical.
6 c) If they are identical, they can be aggregated.
6 d) If they are different, they can only be aggregated if the entries of the other rows in "feeQuantity" are empty.
6 e) The fields "soldQuantity", "soldCurrency", and "soldCurrencyId" must be empty, otherwise the entries cannot be aggregated.

Rows should be aggregated for "deposit" as follows:
The time in the "TimeExecuted" column should take the latest of the aggregated transactions.
There is a special feature here: for transactions on the last day of a month, transactions should only be aggregated until 11:59:59 pm.
In the "type" column, it remains (logically) "deposit".
The values from all aggregated rows in the "boughtQuantity" column are summed up.

If there are entries for the fees, then the values from "feeQuantity" are to be added.
The values from "feeCurrency" are also to be adopted. If a transaction with an empty value in "feeCurrency" and "feeCurrencyId" is also aggregated, it remains with the non-empty entry.

The values from all aggregated rows in the "txId" column should be transferred to the aggregated transaction.
Starting from the first to the last, the entries should be separated by a "+" in the new and aggregated transaction.
So, if the "txId" of 3 aggregated transactions reads 1, 2, and 3, the entry in the "txId" field of the aggregated transaction should read "1 + 2 + 3".
For the "Id" field, the entry of the last aggregated transaction should be adopted.
The "comments" field of the aggregated transaction should be filled as follows:
<Number of aggregated transactions> "Transactions" from <Timestamp of the earliest transaction included in this summary> to <Timestamp of the latest transaction included in this summary> aggregated. ID: <Entries of the fields from the "Id" column for all aggregated transactions, separated by a "+">
Then insert </ original comments> and behind it, all the entries in the "comments" fields of the aggregated transactions should be adopted.

A complete entry in the "comments" field might look like this:
4 transactions from 01.01.2020 00:09:12 to 01.01.2020 07:06:14 aggregated. ID: 1 + 2 + 3 + 4 / original comments: Hello + this + is + a + test

That concludes the procedure for the "deposit" type.

With ChatGPT's help, I have tried to get the following code for the "Order" area.

import pandas as pd

# Read the data
df = pd.read_excel('testfile.xlsx', engine='openpyxl')

# Step 1: Remove rows with "True" in "isIgnored" (However, it shouldn't be removed but only ignored here!)
df = df[df['isIgnored'] != "True"]

# Function to summarize groups of transactions
def summarize_group(group):
    if len(group) == 1:
        return group

    first, last = group.iloc[0], group.iloc[-1].copy()

    # Adjusting the relevant columns
    last['comments'] = f"Transactions from {first['timeExecuted']} to {last['timeExecuted']} (a total of {len(group)} aggregated)"
    last['comments'] += '+' + '+'.join(group['comments'].dropna())
    last['id'] = '+'.join(group['id'].astype(str))
    last['txId'] = '+'.join(group['txId'].dropna())

    return pd.DataFrame([last])

# Determine the keys for grouping
group_keys = df['type'].ne(df['type'].shift()).cumsum()
if 'order' in df['type'].values:
    group_keys += (df['area'] + df['boughtCurrency'] + df['soldCurrency']).ne((df['location'] + df['boughtCurrency'] + df['soldCurrency']).shift()).cumsum()

result = df.groupby(group_keys).apply(summarize_group).reset_index(drop=True)

# Save the cleaned DataFrame
result.to_excel('cleaned_testfile.xlsx', index=False)
Reply
#2
First, I would like to clear up the meaning of aggregate here:

"aggregate" here means "put rows which meet the criteria in a block, one beneath the other" or '"add the values of each row which meets the criteria to produce 1 row"?? I think you mean the latter.

Criterion 1 is, I believe: isIgnored = True: do nothing, next row please
Criterion 2 is, I believe: isIgnored = False: do something
Criterion 3 is, I believe: aggregate rows if columns area, location and type have identical values
Criterion 4 is, I believe: aggregate rows if column type has the same value, either: deposit, withdraw, order
Criterion 5 is, I believe: aggregate rows if column boughtCurrencyId is identical
Criterion 6 is, I believe: aggregate rows if column feeCurrencyId is identical
Criterion 7 is, I believe: aggregate rows if column feeCurrencyId is different and feeQuantity is empty
Criterion 8 is, I believe: do not aggregate rows if column soldQuantity, soldCurrency or soldCurrencyId are not empty

If "aggregate" means add the values from each row which meets the criteria in 1 row, what do you want to do with say, columns timeExecuted, classification, txId and id? Lose them in the output?

It seems to me you should sort at the time you make a new entry to the XL, but if you already have a big XL, it must be possible to sort it, perhaps even in XL.

Why don't you post a slightly bigger extract from your actual XL, say 100 rows?

Das kriegen wir hin!
Reply
#3
(Oct-10-2023, 07:13 AM)Pedroski55 Wrote: First, I would like to clear up the meaning of aggregate here:

"aggregate" here means "put rows which meet the criteria in a block, one beneath the other" or '"add the values of each row which meets the criteria to produce 1 row"?? I think you mean the latter.

Thank you for your message. As you have assumed correctly at the bottom of your post: I am not a native speaker, so please excuse that I did not know, that aggregate has two meanings.

Indeed I was thinking of the latter.


Quote:Criterion 1 is, I believe: isIgnored = True: do nothing, next row please
Criterion 2 is, I believe: isIgnored = False: do something

affirmative

Quote:Criterion 3 is, I believe: aggregate rows if columns area, location and type have identical values
Criterion 4 is, I believe: aggregate rows if column type has the same value, either: deposit, withdraw, order

correct

Quote:Criterion 5 is, I believe: aggregate rows if column boughtCurrencyId is identical
Criterion 6 is, I believe: aggregate rows if column feeCurrencyId is identical

This is correct for the type deposit. (If the type is withdraw ist is soldCurrencyId)
A deposit of 6 Stones with 0,6 Stone fees and another deposit of 12 Stones with 0,1 Stone fees could be aggregated to
18 Stones deposited with 0,7 Stone fees.

Quote:Criterion 7 is, I believe: aggregate rows if column feeCurrencyId is different and feeQuantity is empty
Criterion 8 is, I believe: do not aggregate rows if column soldQuantity, soldCurrency or soldCurrencyId are not empty

This is a not mandatory feature that should take care of the situation that there are rows that have (for example) entries like
1. Line: order: 1 copper traded for 2 Stones (0,1 Stone fees)
2. Line: order: 2 copper traded for 4 Stones ( Stone fees) (so no value inserted in the field)
3. Line order: 3 copper traded for 6 Stones ( fees (no entry in value and currency ID)
4. Line order: 2 copper traded for 4 Stones (0,2 Gold fees) (so the fees deducted in another currency)

So in this example lines 1 - 3 could be aggregated (with the rest of the above criteria matching of course) to one line
6 copper traded for 12 Stones (0,1 Stone fees)

Quote:If "aggregate" means add the values from each row which meets the criteria in 1 row, what do you want to do with say, columns timeExecuted, classification, txId and id? Lose them in the output?
timeExecuted values of the aggregated lines should be (for the type deposit and order) the value of the latest aggregated transaction.
for the type withdraw it should be the earliest timecode.

classification can be lost as all lines that should be summed up would have the same classification.

txId can be dropped for all lines that should be summed up and just the txId of the latest line should be kept.

only the id of the latest line should be kept in the ID field. The other ids should go in the comment field with an entry that would be

"11.10.2023 07:00:00 to 11.10.2023 12:00:00: 27 deposits merged (or aggregated) ID 101 + 102 + 103 + ... + 127"

Quote:It seems to me you should sort at the time you make a new entry to the XL, but if you already have a big XL, it must be possible to sort it, perhaps even in XL.

Those lines get added by a software.
So it would have been an "easy" (but annoying) task if I would have summed up on a daily basis throughout the last years...

Quote:Why don't you post a slightly bigger extract from your actual XL, say 100 rows?
Will happily do that, but I won't have access to the Excel Sheet until Thursday evening, as I am traveling

Quote:Das kriegen wir hin!

Das wäre großartig! That would be fantastic!
Reply
#4
(Oct-10-2023, 07:13 AM)Pedroski55 Wrote: Why don't you post a slightly bigger extract from your actual XL, say 100 rows?

I am finally back home and could add more lines :-)
Reply
#5
Please have a look to see if this starts what you need correctly.

I would save the output to a new Excel file. I added 2 columns:
1. How many rows joined here
2. Next source row to process

The purpose of 2. is, if you later return with a longer version of the source file, you know which row to start at.

Using the Excel you posted with 104 rows, the way I see it, you need 9 rows for different permutations of deposit, 4 rows for different permutations of withdraw and 4 rows for different permutations of order, if I understand what you want correctly.

Is that correct so far?

If this is correct, I can add similar rows, that is, rows having the same permutation of the important columns easily.

import openpyxl
# import os
# these 4 can help you format cells in openpyxl
# check the documentation online
from openpyxl.styles import PatternFill
from openpyxl.styles import Font
from openpyxl.utils import get_column_letter
from openpyxl.styles import Alignment

# set the paths
path2file = '/home/pedro/myPython/openpyxl/xlsx_files/Beispieltabelle.xlsx'
savepath = '/home/pedro/myPython/openpyxl/xlsx_files/condensed_Beispieltabelle.xlsx'
# get the source XL
sourceFile = openpyxl.load_workbook(path2file)
sourceFilesheets = sourceFile.sheetnames
# show the available sheets
for sheet in sourceFilesheets:
    print('These are the sheets:', sheet)
# get the sheet you want to process
mysheet = input('What sheet do you want to process? Copy and paste the sheet name ... ')
# get the maximum row and column numbers for later loops
smaxRow = sourceFile[mysheet].max_row
smaxCol = sourceFile[mysheet].max_column
# each entry in column 2 type is either deposit, withdraw or order
# deposit may have different boughtCurrency, col 4, boughtCurrencyId, col 5, soldCurrency, col 7, feeCurrency, col 10, classification, col 12,
# location, col 13, area, col 14 (and of course boughtCurrencyId, col 5, txId, col 15 and id, col 16, and comment, col 17)
# collect those different permutations
type_dict = {'deposit': [], 'withdraw': [], 'order': []}

for rowNum in range(2, smaxRow + 1):
    mylist = [sourceFile[mysheet].cell(row=rowNum, column=t).value for t in [4, 7, 10, 12, 13, 14]]
    # no doubles
    if not mylist in type_dict[sourceFile[mysheet].cell(row=rowNum, column=2).value]:
        type_dict[sourceFile[mysheet].cell(row=rowNum, column=2).value].append(mylist)
    
# these are the different permutations of the types deposit, withdraw and order
for key in type_dict.keys():
    print(key, type_dict[key])
    print('This list is', len(type_dict[key]), 'long')

"""
deposit [['Stone', None, 'Stone', 'mined', 'quarry', 'Kingstown'], ['Copper', None, None, 'add', 'quarry', 'Kingstown'], ['Iron', None, None, None, 'quarry', 'Kingstown'], ['Iron', None, None, 'mined', 'quarry', 'Kingstown'], ['Stone', None, None, 'internal', 'quarry', 'Kingstown'], ['Stone', None, None, 'internal', 'warehouse', 'Kingstown'], ['wood', None, None, None, 'warehouse', 'Kingstown'], ['gold', None, None, 'rent_income', 'tenement houses', 'Belize'], ['gold', None, None, 'rent_income', 'tenement houses', 'Kingstown']]
This list is 9 long
withdraw [[None, 'Stone', 'Stone', 'payment', 'quarry', 'Kingstown'], [None, 'Stone', 'Stone', 'remove', 'quarry', 'Kingstown'], [None, 'Stone', 'Stone', 'internal', 'quarry', 'Kingstown'], [None, 'Stone', 'Stone', 'internal', 'warehouse', 'Kingstown']]
This list is 4 long
order [['Stone', 'Copper', 'Copper', None, 'quarry', 'Kingstown'], ['Iron', 'Stone', 'Copper', None, 'quarry', 'Kingstown'], ['wool', 'silver', 'gold', None, 'market', 'Kingstown'], ['VAI', 'silver', 'gold', None, 'market', 'Kingstown']]
This list is 4 long

"""
# The above are the different combinations of type, boughtCurrency or soldCurrency, classification, location and area, 
# of course there may be more when the XL file is much longer
# at the moment you need 9 amalgamated rows for deposit and 4 amalgamated rows for withdraw and order
# make a new XL to take the results of processing
targetFile = openpyxl.Workbook()
# a new workbook only has 1 sheet called Sheet
# give the sheet a new name
tsheet = targetFile.active
# rename the sheet
tsheet.title = mysheet + '_condensed data'
targetFilesheets = targetFile.sheetnames
# for some reason I get an error here: the sheet is recognised as <Worksheet "condensed data">  not just "condensed data"
# so assign sheet directly
tsheet = targetFilesheets[0]
# save the basic file or reload the existing file
targetFile.save(savepath)
### reload the file because I added a column: How many rows joined here
### targetFile = openpyxl.load_workbook(savepath)
##tmaxRow = targetFile[sheet].max_row
##tmaxCol = targetFile[sheet].max_column
# get a list of tuples of format (column number, column name)
# will be useful later
column_numbers_names = []
count = 1
for colNum in range(1, smaxCol + 1):
    # write the column headers to the target file
    data1 = sourceFile[mysheet].cell(row=1, column=colNum).value
    # get 1 row of data from the sourceFile
    data2 = sourceFile[mysheet].cell(row=2, column=colNum).value
    # save the column_numbers_names for later
    column_numbers_names.append((count, data1))
    # put the headers in the target file
    targetFile[tsheet].cell(row=1, column=colNum).value=data1
    # put the first row of data in the targetFile
    targetFile[tsheet].cell(row=2, column=colNum).value=data2
    count +=1

# have a look at column_numbers_names any time you need to
# just run showColumn_data()
def showColumn_data():
    for tup in column_numbers_names:
        print(tup)

showColumn_data()
# save the basic file or reload the existing file
targetFile.save(savepath)
The above does not add similar rows yet. I would first like to know if I have understood what you want correctly.
Glyxbringer likes this post
Reply
#6
(Oct-14-2023, 06:46 AM)Pedroski55 Wrote: Please have a look to see if this starts what you need correctly.

Thank you very very very much! I will try tonight when I get back home and will report back asap.
Reply
#7
(Oct-14-2023, 06:46 AM)Pedroski55 Wrote: Please have a look to see if this starts what you need correctly.

Hey there...

Unfortunately I was not able to troubleshoot the following error:

These are the sheets: TX
What sheet do you want to process? Copy and paste the sheet name ... TX
Traceback (most recent call last):
  File "D:\Python-Pandas\script.py", line 33, in <module>
    if not mylist in type_dict[sourceFile[mysheet].cell(row=rowNum, column=2).value]:
                     ~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
KeyError: None
Reply
#8
In your Excel file column 2 is type

type can have the values: deposit, withdrawal and order only, as I understand it.

This is a dictionary with empty lists as values. The keys are type, which may have 3 different values.

type_dict = {'deposit': [], 'withdraw': [], 'order': []}
If you picked up None as a key value, there must be a row without an entry in column 2.

This gets the value of each cell in column 2, starting at row 2 and going down:

sourceFile[mysheet].cell(row=rowNum, column=2).value
If the above finds an cell without a value, that will cause a keyError

for rowNum in range(2, smaxRow + 1):
    mylist = [sourceFile[mysheet].cell(row=rowNum, column=t).value for t in [4, 7, 10, 12, 13, 14]]
    # no doubles
    if not mylist in type_dict[sourceFile[mysheet].cell(row=rowNum, column=2).value]:
        type_dict[sourceFile[mysheet].cell(row=rowNum, column=2).value].append(mylist)
You cannot assign the values anywhere if there is no type value! There should not be any empty cells in column 2, column type.
A simple if clause can skip the the problem by skipping such a row.
But if there are any other possible values for type, you should say so.

To illustrate that:

type_dict = {'deposit': [], 'withdraw': [], 'order': []}
'deposit' in type_dict.keys()
True
None in type_dict.keys()
False
Add the if not clause:

for rowNum in range(2, smaxRow + 1):
    if not sourceFile[mysheet].cell(row=rowNum, column=t).value in type_dict.keys():
        continue
    mylist = [sourceFile[mysheet].cell(row=rowNum, column=t).value for t in [4, 7, 10, 12, 13, 14]]   
    # no doubles
    if not mylist in type_dict[sourceFile[mysheet].cell(row=rowNum, column=2).value]:
        type_dict[sourceFile[mysheet].cell(row=rowNum, column=2).value].append(mylist)[
If you have cells which may have numerical values but are empty, it is better to write zeroes in them first, because you can't add an integer and None.

It is easy to sweep through each row and the numerical columns and, if a cell is empty, write 0 in it.
Reply
#9
(Oct-16-2023, 06:33 PM)Pedroski55 Wrote: In your Excel file column 2 is type

type can have the values: deposit, withdrawal and order only, as I understand it.
If you have cells which may have numerical values but are empty, it is better to write zeroes in them first, because you can't add an integer and None.

It is easy to sweep through each row and the numerical columns and, if a cell is empty, write 0 in it.

Hey there,

thank you very much for your very helpful response.
I tripple checked the Excel and there is no empty field in row 2 (Type).
Except for the end of the Excel (where all fields / lines / columns are empty)
As I see it Excel does not allow me to delete all empty lines.
Are you suggesting to fill all those fields with zeros?
Because I think it could lead to the situation that Excel just keeps adding new lines (not sure here)

Unfortunately I won't have access to my workstation for 24h, so I can't try the if and if not clause...
But I will report back asap and let me thank you again for your help
Reply
#10
Yes, I found that the sample excel file which you posted had an empty row at the bottom. The first sample XL had maxRow 31, but only 30 rows had entries, if I remember correctly.

I deleted row 31.

But now, that is not necessary. The code below will just ignore any rows which do not have either deposit, withdraw or order in column 2

for rowNum in range(2, smaxRow + 1):
    if not sourceFile[mysheet].cell(row=rowNum, column=2).value in type_dict.keys():
        continue
About inserting zeroes: None values can cause problems when you are trying to add. You can't add an integer and None.

In your example Excel, row 2 is a deposit, so there is no entry in column soldQuantity, column 6:

sourceFile[mysheet].cell(row=2, column=6).value == None
True
This will be true for all deposit rows. But I thought it is easier to pick up all the values of all the rows. regardless of type.

So, if we first give all numerical cells with None values the value zero, that will simply make the addition simpler! Adding zero changes nothing! Take a look at my bank account!

Just as an example, this below would set all None values in columns 4, 6 and 8 to zero, 0
# set the paths
path2file = '/home/pedro/myPython/openpyxl/xlsx_files/Beispieltabelle.xlsx'
# get the source XL
sourceFile = openpyxl.load_workbook(path2file)
sourceFilesheets = sourceFile.sheetnames
# show the available sheets
for sheet in sourceFilesheets:
    print('These are the sheets:', sheet)
# get the sheet you want to process
mysheet = input('What sheet do you want to process? Copy and paste the sheet name ... ')
# get the maximum row and column numbers for later loops
smaxRow = sourceFile[mysheet].max_row
smaxCol = sourceFile[mysheet].max_column
# if, just as an example columns 4, 6 and 8 can contain a number, set them all to zero first
for rowNum in range(2, smaxRow + 1):
    for colNum in range(4, 10, 2):
        if sourceFile[sheet].cell(row=rowNum, column=colNum).value == None:
            sourceFile[sheet].cell(row=rowNum, column=colNum, value=0)
Of course, you can avoid None values with an if-clause too!

Hab grad Besuch aus Deutschland, nicht so viel Zeit diese Woche, but if you are happy with the pattern of collecting the various patterns for deposit, withdrawal and order, I will "amalgamate" the rows to a very small XL.
Glyxbringer likes this post
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  New on python. Needs help with Google sheet jeromep 1 141 Apr-25-2024, 06:47 PM
Last Post: deanhystad
  What are these python lines for? What are tey doing? Led_Zeppelin 7 1,632 Feb-13-2023, 03:08 PM
Last Post: deanhystad
  how to read txt file, and write into excel with multiply sheet jacklee26 14 10,050 Jan-21-2023, 06:57 AM
Last Post: jacklee26
  Start Putty into Python Form Schlazen 5 5,501 Dec-13-2022, 06:28 AM
Last Post: divya130
  Need Help! Pandas EXCEL PIVOT psb3958 1 955 Nov-13-2022, 10:37 PM
Last Post: deanhystad
  export into excel, how to implement pandas into for-loop deneme2 6 2,477 Sep-01-2022, 05:44 AM
Last Post: deneme2
  Help with Integration Pandas excel - Python Gegemendes 5 1,804 Jun-05-2022, 09:46 PM
Last Post: Gegemendes
  df column aggregate and group by multiple columns SriRajesh 0 1,051 May-06-2022, 02:26 PM
Last Post: SriRajesh
  Reading Excel file and use a wildcard in file name and sheet name randolphoralph 6 7,142 Jan-13-2022, 10:20 PM
Last Post: randolphoralph
  Problem in saving .xlsm (excel) file using pandas dataframe in python shantanu97 2 4,319 Aug-29-2021, 12:39 PM
Last Post: snippsat

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020