Posts: 35
Threads: 13
Joined: Mar 2017
Apr-07-2017, 03:23 PM
(This post was last modified: Apr-07-2017, 03:58 PM by nilamo.)
I hope you are all having a good day. I have a question in regards to the apply function. Below is a function I created and then I apply it to my data frame:
def impute_age(cols):
Age = cols[0]
Pclass = cols[1]
if pd.isnull(Age):
if Pclass == 1:
return 37
elif Pclass == 2:
return 29
else:
return 24
else:
return Age
train['Age'] = train[['Age','Pclass']].apply(impute_age, axis=1) Why, when I don't specify axis=1 does the code not correctly replace the null values in the age column? I understand axis=1 means to apply it to the columns, however, I don't get the logic of how applying it to the rows (axis=0) doesn't work.
Moderator nilamo: Please use code tags in the future
Posts: 331
Threads: 2
Joined: Feb 2017
If you use df.apply(func) , then func is applied to a columns of dataframe. So in your case impute_age is applied at first on entire column "Age", after that on entire column "Pclass" and returns series with two elements only, first one is based on first two values of "Age', second one is based on first two values of "Pclass".
You dont need to use .apply for this imputing - you can do it directly by assigning. Either one by one:
train.Age[train.Age.isnull() & (train.Pclass == 1)] = 37
train.Age[train.Age.isnull() & (train.Pclass == 2)] = 29
train.Age[train.Age.isnull() & train.Pclass.isnull()] = 24 or using something more complicated like nested np.where
train.Age[train.Age.isnull()] = np.where(train.Pclass==1, 37, np.where(train.Pclass==2, 29, 24))[train.Age.isnull()]
Posts: 35
Threads: 13
Joined: Mar 2017
(Apr-07-2017, 06:13 PM)zivoni Wrote: If you use df.apply(func) , then func is applied to a columns of dataframe. So in your case impute_age is applied at first on entire column "Age", after that on entire column "Pclass" and returns series with two elements only, first one is based on first two values of "Age', second one is based on first two values of "Pclass". You dont need to use .apply for this imputing - you can do it directly by assigning. Either one by one: train.Age[train.Age.isnull() & (train.Pclass == 1)] = 37 train.Age[train.Age.isnull() & (train.Pclass == 2)] = 29 train.Age[train.Age.isnull() & train.Pclass.isnull()] = 24 or using something more complicated like nested np.where train.Age[train.Age.isnull()] = np.where(train.Pclass==1, 37, np.where(train.Pclass==2, 29, 24))[train.Age.isnull()]
Thank you for the response. However, I am still confused on what is happening if axis=0. Can you dumb it down for me please?
Posts: 331
Threads: 2
Joined: Feb 2017
df.apply(func, axis=0) is exactly same as df.apply(func) - default value for axis is 0. As i mentioned in previous post, in this case .apply aplies function func to entire columns. Simple example with func printing its argument and some information about it:
Output: In [2]: df = pd.DataFrame({'a':[1,2,3], 'b':[5,6,7]})
In [3]: df
Out[3]:
a b
0 1 5
1 2 6
2 3 7
In [4]: def func(s):
...: print("=== type: {}, shape: {}".format(type(s), s.shape))
...: print(s)
...:
In [5]: apply_result = df.apply(func)
=== type: <class 'pandas.core.series.Series'>, shape: (3,)
0 1
1 2
2 3
Name: a, dtype: int64
=== type: <class 'pandas.core.series.Series'>, shape: (3,)
0 5
1 6
2 7
Name: b, dtype: int64
In [6]: apply_result
Out[6]:
a None
b None
dtype: object
As you can see, func is applied to the column "a" first, after that to the column "b". And result is a series with same index as column index for original dataframe, containing None's, as func has no return statement.
Posts: 35
Threads: 13
Joined: Mar 2017
(Apr-08-2017, 08:44 AM)zivoni Wrote: df.apply(func, axis=0) is exactly same as df.apply(func) - default value for axis is 0. As i mentioned in previous post, in this case .apply aplies function func to entire columns. Simple example with func printing its argument and some information about it: Output: In [2]: df = pd.DataFrame({'a':[1,2,3], 'b':[5,6,7]}) In [3]: df Out[3]: a b 0 1 5 1 2 6 2 3 7 In [4]: def func(s): ...: print("=== type: {}, shape: {}".format(type(s), s.shape)) ...: print(s) ...: In [5]: apply_result = df.apply(func) === type: <class 'pandas.core.series.Series'>, shape: (3,) 0 1 1 2 2 3 Name: a, dtype: int64 === type: <class 'pandas.core.series.Series'>, shape: (3,) 0 5 1 6 2 7 Name: b, dtype: int64 In [6]: apply_result Out[6]: a None b None dtype: object
As you can see, func is applied to the column "a" first, after that to the column "b". And result is a series with same index as column index for original dataframe, containing None's, as func has no return statement.
Thank you for your help.
|