Python Forum
What exactly does .agg() do?
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
What exactly does .agg() do?
#1
Hi all,

I'm not appreciating the functionality of the .agg() method. This creates a df:

import pandas as pd

name = ['Bella', 'Charlie', 'Lucy', 'Cooper', 'Max', 'Stella', 'Bernie']
breed = ['Labrador', 'Poodle', 'Chow Chow', 'Schnauzer', 'Labrador', 'Chihuahua', 'St. Bernard']
color = ['Brown', 'Black', 'Brown', 'Gray', 'Black', 'Tan', 'White']
height_cm = [56, 43, 46, 49, 59, 18, 77]
weight_kg = [25, 23, 22, 17, 29, 2, 74]
dob = ['2013-07-01', '2016-09-16', '2014-08-25', '2011-12-11', '2017-01-20', '2015-04-20', '2018-02-27']
dogs_dict = {'Name':name, 'Breed':breed, 'Color':color, 'Height (cm)':height_cm, 'Weight (kg)':weight_kg, 'Date of Birth':dob}
dogs = pd.DataFrame(dogs_dict)
Now, with or without the .agg() method, these do the same thing:

def pct30(column):
    return column.quantile(0.3)

print(pct30(dogs[['Height (cm)', 'Weight (kg)']]))
print()
print(dogs[['Height (cm)', 'Weight (kg)']].agg(pct30))
What is the reason to use .agg(), exactly? Thanks!
Reply
#2
The agg() method performs aggregation on the columns (or rows, but by default columns) of a dataframe. See Pandas documentation for details.

Simple example: Create a dataframe with to columns and calculate the sum and average per column:

>>> import pandas as pd
>>> data = {'height': [100, 120, 160, 200], 'weight': [10, 20, 15, 8]}
>>> df = pd.DataFrame(data = data)
>>> df
   height  weight
0     100      10
1     120      20
2     160      15
3     200       8
>>> df.agg('sum')
height    580
weight     53
dtype: int64
>>> df.agg('mean')
height    145.00
weight     13.25
dtype: float64
You can also aggregate specific columns only:

>>> df.agg({'height': ['mean']})
      height
mean   145.0
The examples above use built-in functions of Pandas, but you can also aggregate with custom function:

>>> from statistics import median
>>> def calc_median(column):
...     return column.median()
...
>>> df.agg(calc_median)
height    140.0
weight     12.5
dtype: float64
Regards, noisefloor
snippsat, Mark17, Pedroski55 like this post
Reply
#3
(Nov-07-2023, 04:04 PM)noisefloor Wrote: The agg() method performs aggregation on the columns (or rows, but by default columns) of a dataframe. See Pandas documentation for details.

Simple example: Create a dataframe with to columns and calculate the sum and average per column:

>>> import pandas as pd
>>> data = {'height': [100, 120, 160, 200], 'weight': [10, 20, 15, 8]}
>>> df = pd.DataFrame(data = data)
>>> df
   height  weight
0     100      10
1     120      20
2     160      15
3     200       8
>>> df.agg('sum')
height    580
weight     53
dtype: int64
>>> df.agg('mean')
height    145.00
weight     13.25
dtype: float64
You can also aggregate specific columns only:

>>> df.agg({'height': ['mean']})
      height
mean   145.0
The examples above use built-in functions of Pandas, but you can also aggregate with custom function:

>>> from statistics import median
>>> def calc_median(column):
...     return column.median()
...
>>> df.agg(calc_median)
height    140.0
weight     12.5
dtype: float64
Regards, noisefloor

So it aggregates across columns (default) or rows (axis = 1). How can I get a list of what functions can be used? I understand custom functions can. You used 'sum' and 'mean'. I wondered about 'mean" because this worked for me earlier:

print(dogs.groupby('Color')['Weight (kg)'].agg([min, max, sum]))
However, when I tried .agg([min, max, sum, mean])), I get:

NameError: name 'mean' is not defined

Thanks!
Reply
#4
In addition to noisefloor good explanation.

Your code.
 >>> dogs
      Name        Breed  Color  Height (cm)  Weight (kg) Date of Birth
0    Bella     Labrador  Brown           56           25    2013-07-01
1  Charlie       Poodle  Black           43           23    2016-09-16
2     Lucy    Chow Chow  Brown           46           22    2014-08-25
3   Cooper    Schnauzer   Gray           49           17    2011-12-11
4      Max     Labrador  Black           59           29    2017-01-20
5   Stella    Chihuahua    Tan           18            2    2015-04-20
6   Bernie  St. Bernard  White           77           74    2018-02-27

>>> dogs[['Height (cm)', 'Weight (kg)']].agg([pct30, 'mean', 'max'])
       Height (cm)  Weight (kg)
pct30    45.400000    21.000000
mean     49.714286    27.428571
max      77.000000    74.000000
This would give you the 30th percentile quantile, mean, and max for both columns in one output.

This would compute the 30th percentile quantile and mean only the Height (cm).
>>> dogs.agg({'Height (cm)': [pct30, 'mean'], 'Weight (kg)': 'max'})
       Height (cm)  Weight (kg)
pct30    45.400000          NaN
mean     49.714286          NaN
max            NaN         74.0
Most use string methods.
>>> dogs.groupby('Breed')[['Height (cm)', 'Weight (kg)']].agg(['mean', 'max', 'sum'])
            Height (cm)          Weight (kg)        
                   mean max  sum        mean max sum
Breed                                               
Chihuahua          18.0  18   18         2.0   2   2
Chow Chow          46.0  46   46        22.0  22  22
Labrador           57.5  59  115        27.0  29  54
Poodle             43.0  43   43        23.0  23  23
Schnauzer          49.0  49   49        17.0  17  17
St. Bernard        77.0  77   77        74.0  74  74
Mark17 Wrote:How can I get a list of what functions can be used?
Quote:Built-in String Methods: These are simple aggregations that pandas understands as strings,
like 'sum', 'mean', 'std', 'var', 'min', 'max', 'median', 'mode', 'count', 'prod' (for product), 'size' (counts NaN values too), and more.

Functions from the numpy library: Since pandas is built on top of numpy,
can use functions from numpy like np.sum, np.mean, np.std, np.var, np.min, np.max, np.median, np.prod, and many others.

Custom Functions: You can define your own function that takes a Series and returns a value. This function can then be passed to .agg()
noisefloor likes this post
Reply
#5
I think the question is more "Why would I use df.agg(func) instead of func(df)? The original post shows a function that works when passed a dataframe or a column (series), so there is little difference between df.agg(pct30) and pct30(df).

Lets try a different function. We import statistics and use the medain functon.
import pandas as pd
import statistics

def median(column):
    return statistics.median(column)

df = pd.DataFrame({"height": [100, 120, 160, 200], "weight": [10, 20, 15, 8]})
print(df.agg(median))
Output:
height 140.0 weight 12.5 dtype: float64
Look what happens when we try to pass df as an argument to median.
print(median(df))
Error:
Traceback (most recent call last): File "agg_test.py", line 10, in <module> print(median(df)) File "agg_test.py", line 6, in median return statistics.median(column) File "C:\Program Files\Python310\lib\statistics.py", line 457, in median return (data[i - 1] + data[i]) / 2 TypeError: unsupported operand type(s) for /: 'str' and 'int'
That's an interesting error. median must be including the column headers because it doesn't know how to work with a DataFrame. When using df.agg(), agg splits the dataframe into series and passes them one at a time to the function. You can see that here.
def median(column):
    print(column)
    return statistics.median(column)


df = pd.DataFrame({"height": [100, 120, 160, 200], "weight": [10, 20, 15, 8]})
print(df.agg(median))
Output:
0 100 1 120 2 160 3 200 Name: height, dtype: int64 0 10 1 20 2 15 3 8 Name: weight, dtype: int64 height 140.0 weight 12.5 dtype: float64
This allows using functions that are not "pandas aware". Functions that take an iterable as an argument should work.
noisefloor likes this post
Reply
#6
And to answer

(Nov-07-2023, 06:33 PM)Mark17 Wrote: However, when I tried .agg([min, max, sum, mean])), I get:
The difference is the missing quotes in your code. Your code works if you write

.agg(['min', 'max', 'sum', 'mean']))
Note the quotes in this code.

Your code like .agg([min, max]) would expect that two callables / functions named min and maxare defined somewhere in your code. These are called from the agg method of the dataframe. While `.agg(['min', 'max' ]) calls the built-in function from Pandas as explained in @snippsat post.

Regards, noisefloor
Mark17 likes this post
Reply


Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020