What exactly does .agg() do?

Mark17 · Nov-07-2023, 02:26 PM

Hi all,

I'm not appreciating the functionality of the .agg() method. This creates a df:

import pandas as pd

name = ['Bella', 'Charlie', 'Lucy', 'Cooper', 'Max', 'Stella', 'Bernie']
breed = ['Labrador', 'Poodle', 'Chow Chow', 'Schnauzer', 'Labrador', 'Chihuahua', 'St. Bernard']
color = ['Brown', 'Black', 'Brown', 'Gray', 'Black', 'Tan', 'White']
height_cm = [56, 43, 46, 49, 59, 18, 77]
weight_kg = [25, 23, 22, 17, 29, 2, 74]
dob = ['2013-07-01', '2016-09-16', '2014-08-25', '2011-12-11', '2017-01-20', '2015-04-20', '2018-02-27']
dogs_dict = {'Name':name, 'Breed':breed, 'Color':color, 'Height (cm)':height_cm, 'Weight (kg)':weight_kg, 'Date of Birth':dob}
dogs = pd.DataFrame(dogs_dict)

Now, with or without the .agg() method, these do the same thing:

def pct30(column):
    return column.quantile(0.3)

print(pct30(dogs[['Height (cm)', 'Weight (kg)']]))
print()
print(dogs[['Height (cm)', 'Weight (kg)']].agg(pct30))

What is the reason to use .agg(), exactly? Thanks!

noisefloor · Nov-07-2023, 04:04 PM

The agg() method performs aggregation on the columns (or rows, but by default columns) of a dataframe. See Pandas documentation for details.

Simple example: Create a dataframe with to columns and calculate the sum and average per column:

>>> import pandas as pd
>>> data = {'height': [100, 120, 160, 200], 'weight': [10, 20, 15, 8]}
>>> df = pd.DataFrame(data = data)
>>> df
   height  weight
0     100      10
1     120      20
2     160      15
3     200       8
>>> df.agg('sum')
height    580
weight     53
dtype: int64
>>> df.agg('mean')
height    145.00
weight     13.25
dtype: float64

You can also aggregate specific columns only:

>>> df.agg({'height': ['mean']})
      height
mean   145.0

The examples above use built-in functions of Pandas, but you can also aggregate with custom function:

>>> from statistics import median
>>> def calc_median(column):
...     return column.median()
...
>>> df.agg(calc_median)
height    140.0
weight     12.5
dtype: float64

Regards, noisefloor

Mark17 · Nov-07-2023, 06:33 PM

(Nov-07-2023, 04:04 PM)noisefloor Wrote: The agg() method performs aggregation on the columns (or rows, but by default columns) of a dataframe. See Pandas documentation for details.

Simple example: Create a dataframe with to columns and calculate the sum and average per column:
>>> import pandas as pd
>>> data = {'height': [100, 120, 160, 200], 'weight': [10, 20, 15, 8]}
>>> df = pd.DataFrame(data = data)
>>> df
   height  weight
0     100      10
1     120      20
2     160      15
3     200       8
>>> df.agg('sum')
height    580
weight     53
dtype: int64
>>> df.agg('mean')
height    145.00
weight     13.25
dtype: float64
You can also aggregate specific columns only:
>>> df.agg({'height': ['mean']})
      height
mean   145.0
The examples above use built-in functions of Pandas, but you can also aggregate with custom function:
>>> from statistics import median
>>> def calc_median(column):
...     return column.median()
...
>>> df.agg(calc_median)
height    140.0
weight     12.5
dtype: float64
Regards, noisefloor

So it aggregates across columns (default) or rows (axis = 1). How can I get a list of what functions can be used? I understand custom functions can. You used 'sum' and 'mean'. I wondered about 'mean" because this worked for me earlier:

print(dogs.groupby('Color')['Weight (kg)'].agg([min, max, sum]))

However, when I tried .agg([min, max, sum, mean])), I get:

NameError: name 'mean' is not defined

Thanks!

***snippsat*** · (This post was last modified: Nov-07-2023, 07:10 PM by snippsat.)

In addition to noisefloor good explanation.

Your code.

 >>> dogs
      Name        Breed  Color  Height (cm)  Weight (kg) Date of Birth
0    Bella     Labrador  Brown           56           25    2013-07-01
1  Charlie       Poodle  Black           43           23    2016-09-16
2     Lucy    Chow Chow  Brown           46           22    2014-08-25
3   Cooper    Schnauzer   Gray           49           17    2011-12-11
4      Max     Labrador  Black           59           29    2017-01-20
5   Stella    Chihuahua    Tan           18            2    2015-04-20
6   Bernie  St. Bernard  White           77           74    2018-02-27

>>> dogs[['Height (cm)', 'Weight (kg)']].agg([pct30, 'mean', 'max'])
       Height (cm)  Weight (kg)
pct30    45.400000    21.000000
mean     49.714286    27.428571
max      77.000000    74.000000

This would give you the 30th percentile quantile, mean, and max for both columns in one output.

This would compute the 30th percentile quantile and mean only the Height (cm).

>>> dogs.agg({'Height (cm)': [pct30, 'mean'], 'Weight (kg)': 'max'})
       Height (cm)  Weight (kg)
pct30    45.400000          NaN
mean     49.714286          NaN
max            NaN         74.0

Most use string methods.

>>> dogs.groupby('Breed')[['Height (cm)', 'Weight (kg)']].agg(['mean', 'max', 'sum'])
            Height (cm)          Weight (kg)        
                   mean max  sum        mean max sum
Breed                                               
Chihuahua          18.0  18   18         2.0   2   2
Chow Chow          46.0  46   46        22.0  22  22
Labrador           57.5  59  115        27.0  29  54
Poodle             43.0  43   43        23.0  23  23
Schnauzer          49.0  49   49        17.0  17  17
St. Bernard        77.0  77   77        74.0  74  74

Mark17 Wrote:How can I get a list of what functions can be used?

Quote:Built-in String Methods: These are simple aggregations that pandas understands as strings,
like 'sum', 'mean', 'std', 'var', 'min', 'max', 'median', 'mode', 'count', 'prod' (for product), 'size' (counts NaN values too), and more.

Functions from the numpy library: Since pandas is built on top of numpy,
can use functions from numpy like np.sum, np.mean, np.std, np.var, np.min, np.max, np.median, np.prod, and many others.

Custom Functions: You can define your own function that takes a Series and returns a value. This function can then be passed to .agg()

**deanhystad** · Nov-07-2023, 07:09 PM

I think the question is more "Why would I use df.agg(func) instead of func(df)? The original post shows a function that works when passed a dataframe or a column (series), so there is little difference between df.agg(pct30) and pct30(df).

Lets try a different function. We import statistics and use the medain functon.

import pandas as pd
import statistics

def median(column):
    return statistics.median(column)

df = pd.DataFrame({"height": [100, 120, 160, 200], "weight": [10, 20, 15, 8]})
print(df.agg(median))

Output:height    140.0
weight     12.5
dtype: float64

Look what happens when we try to pass df as an argument to median.

print(median(df))

Error:Traceback (most recent call last):
  File "agg_test.py", line 10, in <module>
    print(median(df))
  File "agg_test.py", line 6, in median   
    return statistics.median(column)
  File "C:\Program Files\Python310\lib\statistics.py", line 457, in median      
    return (data[i - 1] + data[i]) / 2
TypeError: unsupported operand type(s) for /: 'str' and 'int'

That's an interesting error. median must be including the column headers because it doesn't know how to work with a DataFrame. When using df.agg(), agg splits the dataframe into series and passes them one at a time to the function. You can see that here.

def median(column):
    print(column)
    return statistics.median(column)


df = pd.DataFrame({"height": [100, 120, 160, 200], "weight": [10, 20, 15, 8]})
print(df.agg(median))

Output:0    100
1    120
2    160
3    200
Name: height, dtype: int64
0    10
1    20
2    15
3     8
Name: weight, dtype: int64
height    140.0
weight     12.5
dtype: float64

This allows using functions that are not "pandas aware". Functions that take an iterable as an argument should work.

noisefloor · Nov-08-2023, 07:01 AM

And to answer

(Nov-07-2023, 06:33 PM)Mark17 Wrote: However, when I tried .agg([min, max, sum, mean])), I get:

The difference is the missing quotes in your code. Your code works if you write

.agg(['min', 'max', 'sum', 'mean']))

Note the quotes in this code.

Your code like .agg([min, max]) would expect that two callables / functions named min and maxare defined somewhere in your code. These are called from the agg method of the dataframe. While `.agg(['min', 'max' ]) calls the built-in function from Pandas as explained in @snippsat post.

Regards, noisefloor

What exactly does .agg() do?

User Panel Messages

Announcements