Python Forum

Full Version: dataframe groupby with totals for certain fields
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Hello,

I am frequently required to calculate stuff for a number of dataframe columns, but with totals included.
I know I can ignore a specific dimension using the transform function (https://stackoverflow.com/questions/3094...-in-pandas)
But this adds a second column, while I want extra rows with totals for a non-specified number of columns.

I currently use an expensive loop (cf below "for e in grpby"), but this causes the runtime to explode dramatically.
Below is a simplified example of what I do. Does anyone know a more pythonic way of adding those totals?

Any help is vastly appreciated!

Regards,
Mikis

import pandas as pd
region = ['france', 'france', 'france', 'france', 'uk', 'uk', 'uk', 'uk']
illness = ['pain', 'pain', 'no pain', 'no pain', 'pain', 'pain', 'no pain', 'no pain']
respondent = [1, 2, 3, 1, 2, 3, 1, 2]
numbers = [1, 2, 3, 4, 5, 6, 7, 8]

data = list(zip(region, illness, respondent, numbers))
columns = ['region', 'illness', 'respondent', 'numbers']
df = pd.DataFrame(data = data, columns=(columns))

grpby = ['region', 'illness']

df_scores = df.groupby(grpby) \
                .agg({'numbers':'mean', 'respondent':pd.Series.nunique}) \
                .rename(columns={'numbers':'score', 'respondent':"sample"}) \
                .reset_index()
#But I also want the total of region & illness (cf next)

#------------ THIS IS WHAT I DO RIGHT NOW
# Add all those totals within a loop (not very pythonic)
df_totals = df.copy()
for e in grpby:
    tmp = df_totals.copy()
    tmp[e] = 'All'
    df_totals = pd.concat([df_totals, tmp], ignore_index=True)
# And then group by, which gives me the outcome I want
df_scores_total = df_totals.groupby(grpby) \
                .agg({'numbers':'mean', 'respondent':pd.Series.nunique}) \
                .rename(columns={'numbers':'score', 'respondent':"sample"}) \
                .reset_index()
Your for-loop iterates over two items (or a few number of columns)... I don't understand why it is expensive (probably, copying and concatenating huge df's are expensive).
Alternative way is to calculate cumulative values as a separate df and concatenate it to the source df, e.g.

gg=df.groupby('region').agg({'numbers':'mean', 'respondent': pd.Series.nunique}).reset_index()
gg['illness'] = 'All'
pd.concat([df, gg], ignore_index=True)
The same can be done with 'illness' and finally, without groupby (i.e. df.agg(...))
(Oct-19-2020, 12:21 AM)scidam Wrote: [ -> ]Your for-loop iterates over two items (or a few number of columns)... I don't understand why it is expensive (probably, copying and concatenating huge df's are expensive).
Alternative way is to calculate cumulative values as a separate df and concatenate it to the source df, e.g.

gg=df.groupby('region').agg({'numbers':'mean', 'respondent': pd.Series.nunique}).reset_index()
gg['illness'] = 'All'
pd.concat([df, gg], ignore_index=True)
The same can be done with 'illness' and finally, without groupby (i.e. df.agg(...))

Hi scidam,

Those 2 fields are just an example, it can be any number of columns (usually not too many but anyways).
it's expensive in a sense that it's a loop, and I was told that if you're looping then most of the times you're probably doing it wrong Smile
I was hoping that perhaps there was some function or package that basically had a method for this (like the transform function, of which I'm a huge fan).
So there's nothing that would work similar to the below "column_totals"?
gg=df.groupby(['region', 'illness']).column_totals(name = "Total").agg({'numbers':'mean', 'respondent': pd.Series.nunique}).reset_index() 
Quote:, and I was told that if you're looping then most of the times you're probably doing it wrong

In this case loop doesn't apply to all dataframe, you just iterate over column names... Loops are also widely used in such libraries as pandas, numpy, even we don't see them. E.g. numpy.vectorize decorator is essentially a for-loop.

As far as I understand, you need something like the following df.groupby([[column_subset1], [column_subset2], ...]), i.e. perform grouping by different column subsets, including subsets consisting only of one column (and then apply agg method to each grouping).

So, you probably can implement a helper function that will perform grouping over different column subsets and populate the dataframe with calculated values (likely, for-loop you will use to implement this function).
(Oct-23-2020, 05:41 AM)scidam Wrote: [ -> ]
Quote:, and I was told that if you're looping then most of the times you're probably doing it wrong

In this case loop doesn't apply to all dataframe, you just iterate over column names... Loops are also widely used in such libraries as pandas, numpy, even we don't see them. E.g. numpy.vectorize decorator is essentially a for-loop.

As far as I understand, you need something like the following df.groupby([[column_subset1], [column_subset2], ...]), i.e. perform grouping by different column subsets, including subsets consisting only of one column (and then apply agg method to each grouping).

So, you probably can implement a helper function that will perform grouping over different column subsets and populate the dataframe with calculated values (likely, for-loop you will use to implement this function).
Ok thanks! Guess I wasn't being as silly as I thought Smile