Posts: 279
Threads: 107
Joined: Aug 2019
Hi all,
I'm not appreciating the functionality of the .agg() method. This creates a df:
import pandas as pd
name = ['Bella', 'Charlie', 'Lucy', 'Cooper', 'Max', 'Stella', 'Bernie']
breed = ['Labrador', 'Poodle', 'Chow Chow', 'Schnauzer', 'Labrador', 'Chihuahua', 'St. Bernard']
color = ['Brown', 'Black', 'Brown', 'Gray', 'Black', 'Tan', 'White']
height_cm = [56, 43, 46, 49, 59, 18, 77]
weight_kg = [25, 23, 22, 17, 29, 2, 74]
dob = ['2013-07-01', '2016-09-16', '2014-08-25', '2011-12-11', '2017-01-20', '2015-04-20', '2018-02-27']
dogs_dict = {'Name':name, 'Breed':breed, 'Color':color, 'Height (cm)':height_cm, 'Weight (kg)':weight_kg, 'Date of Birth':dob}
dogs = pd.DataFrame(dogs_dict) Now, with or without the .agg() method, these do the same thing:
def pct30(column):
return column.quantile(0.3)
print(pct30(dogs[['Height (cm)', 'Weight (kg)']]))
print()
print(dogs[['Height (cm)', 'Weight (kg)']].agg(pct30)) What is the reason to use .agg(), exactly? Thanks!
Posts: 136
Threads: 0
Joined: Jun 2019
The agg() method performs aggregation on the columns (or rows, but by default columns) of a dataframe. See Pandas documentation for details.
Simple example: Create a dataframe with to columns and calculate the sum and average per column:
>>> import pandas as pd
>>> data = {'height': [100, 120, 160, 200], 'weight': [10, 20, 15, 8]}
>>> df = pd.DataFrame(data = data)
>>> df
height weight
0 100 10
1 120 20
2 160 15
3 200 8
>>> df.agg('sum')
height 580
weight 53
dtype: int64
>>> df.agg('mean')
height 145.00
weight 13.25
dtype: float64 You can also aggregate specific columns only:
>>> df.agg({'height': ['mean']})
height
mean 145.0 The examples above use built-in functions of Pandas, but you can also aggregate with custom function:
>>> from statistics import median
>>> def calc_median(column):
... return column.median()
...
>>> df.agg(calc_median)
height 140.0
weight 12.5
dtype: float64 Regards, noisefloor
Posts: 279
Threads: 107
Joined: Aug 2019
(Nov-07-2023, 04:04 PM)noisefloor Wrote: The agg() method performs aggregation on the columns (or rows, but by default columns) of a dataframe. See Pandas documentation for details.
Simple example: Create a dataframe with to columns and calculate the sum and average per column:
>>> import pandas as pd
>>> data = {'height': [100, 120, 160, 200], 'weight': [10, 20, 15, 8]}
>>> df = pd.DataFrame(data = data)
>>> df
height weight
0 100 10
1 120 20
2 160 15
3 200 8
>>> df.agg('sum')
height 580
weight 53
dtype: int64
>>> df.agg('mean')
height 145.00
weight 13.25
dtype: float64 You can also aggregate specific columns only:
>>> df.agg({'height': ['mean']})
height
mean 145.0 The examples above use built-in functions of Pandas, but you can also aggregate with custom function:
>>> from statistics import median
>>> def calc_median(column):
... return column.median()
...
>>> df.agg(calc_median)
height 140.0
weight 12.5
dtype: float64 Regards, noisefloor
So it aggregates across columns (default) or rows (axis = 1). How can I get a list of what functions can be used? I understand custom functions can. You used 'sum' and 'mean'. I wondered about 'mean" because this worked for me earlier:
print(dogs.groupby('Color')['Weight (kg)'].agg([min, max, sum])) However, when I tried .agg([min, max, sum, mean])), I get:
NameError: name 'mean' is not defined
Thanks!
Posts: 7,324
Threads: 123
Joined: Sep 2016
Nov-07-2023, 07:07 PM
(This post was last modified: Nov-07-2023, 07:10 PM by snippsat.)
In addition to noisefloor good explanation.
Your code.
>>> dogs
Name Breed Color Height (cm) Weight (kg) Date of Birth
0 Bella Labrador Brown 56 25 2013-07-01
1 Charlie Poodle Black 43 23 2016-09-16
2 Lucy Chow Chow Brown 46 22 2014-08-25
3 Cooper Schnauzer Gray 49 17 2011-12-11
4 Max Labrador Black 59 29 2017-01-20
5 Stella Chihuahua Tan 18 2 2015-04-20
6 Bernie St. Bernard White 77 74 2018-02-27
>>> dogs[['Height (cm)', 'Weight (kg)']].agg([pct30, 'mean', 'max'])
Height (cm) Weight (kg)
pct30 45.400000 21.000000
mean 49.714286 27.428571
max 77.000000 74.000000 This would give you the 30th percentile quantile, mean, and max for both columns in one output.
This would compute the 30th percentile quantile and mean only the Height (cm).
>>> dogs.agg({'Height (cm)': [pct30, 'mean'], 'Weight (kg)': 'max'})
Height (cm) Weight (kg)
pct30 45.400000 NaN
mean 49.714286 NaN
max NaN 74.0 Most use string methods.
>>> dogs.groupby('Breed')[['Height (cm)', 'Weight (kg)']].agg(['mean', 'max', 'sum'])
Height (cm) Weight (kg)
mean max sum mean max sum
Breed
Chihuahua 18.0 18 18 2.0 2 2
Chow Chow 46.0 46 46 22.0 22 22
Labrador 57.5 59 115 27.0 29 54
Poodle 43.0 43 43 23.0 23 23
Schnauzer 49.0 49 49 17.0 17 17
St. Bernard 77.0 77 77 74.0 74 74 Mark17 Wrote:How can I get a list of what functions can be used? Quote:Built-in String Methods: These are simple aggregations that pandas understands as strings,
like 'sum', 'mean', 'std', 'var', 'min', 'max', 'median', 'mode', 'count', 'prod' (for product), 'size' (counts NaN values too), and more.
Functions from the numpy library: Since pandas is built on top of numpy,
can use functions from numpy like np.sum, np.mean, np.std, np.var, np.min, np.max, np.median, np.prod, and many others.
Custom Functions: You can define your own function that takes a Series and returns a value. This function can then be passed to .agg()
noisefloor likes this post
Posts: 6,809
Threads: 20
Joined: Feb 2020
I think the question is more "Why would I use df.agg(func) instead of func(df)? The original post shows a function that works when passed a dataframe or a column (series), so there is little difference between df.agg(pct30) and pct30(df).
Lets try a different function. We import statistics and use the medain functon.
import pandas as pd
import statistics
def median(column):
return statistics.median(column)
df = pd.DataFrame({"height": [100, 120, 160, 200], "weight": [10, 20, 15, 8]})
print(df.agg(median)) Output: height 140.0
weight 12.5
dtype: float64
Look what happens when we try to pass df as an argument to median.
print(median(df)) Error: Traceback (most recent call last):
File "agg_test.py", line 10, in <module>
print(median(df))
File "agg_test.py", line 6, in median
return statistics.median(column)
File "C:\Program Files\Python310\lib\statistics.py", line 457, in median
return (data[i - 1] + data[i]) / 2
TypeError: unsupported operand type(s) for /: 'str' and 'int'
That's an interesting error. median must be including the column headers because it doesn't know how to work with a DataFrame. When using df.agg(), agg splits the dataframe into series and passes them one at a time to the function. You can see that here.
def median(column):
print(column)
return statistics.median(column)
df = pd.DataFrame({"height": [100, 120, 160, 200], "weight": [10, 20, 15, 8]})
print(df.agg(median)) Output: 0 100
1 120
2 160
3 200
Name: height, dtype: int64
0 10
1 20
2 15
3 8
Name: weight, dtype: int64
height 140.0
weight 12.5
dtype: float64
This allows using functions that are not "pandas aware". Functions that take an iterable as an argument should work.
noisefloor likes this post
Posts: 136
Threads: 0
Joined: Jun 2019
And to answer
(Nov-07-2023, 06:33 PM)Mark17 Wrote: However, when I tried .agg([min, max, sum, mean])), I get: The difference is the missing quotes in your code. Your code works if you write
.agg(['min', 'max', 'sum', 'mean'])) Note the quotes in this code.
Your code like .agg([min, max]) would expect that two callables / functions named min and max are defined somewhere in your code. These are called from the agg method of the dataframe. While `.agg(['min', 'max' ]) calls the built-in function from Pandas as explained in @ snippsat post.
Regards, noisefloor
|