Python Forum
Pandas + Groupby + Filter unique values - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: Data Science (https://python-forum.io/forum-44.html)
+--- Thread: Pandas + Groupby + Filter unique values (/thread-27646.html)



Pandas + Groupby + Filter unique values - JosepMaria - Jun-15-2020

I have this dataset:

df=pd.DataFrame({'user':[1,1,2,2,2,3,3,3,3,3,4,4],
                  'date':['1995-09-01','1995-09-02','1995-10-03','1995-10-04','1995-10-05','1995-11-07','1995-11-08','1995-11-09','1995-11-10','1995-11-15','1995-12-18','1995-12-20'],
                  'type':['a','b','a','x','b','a','x','a','x','b','a','b']})
Which gives me:

user    date    type
1   1995-09-01  a
1   1995-09-02  b
2   1995-10-03  a
2   1995-10-04  x
2   1995-10-05  b
3   1995-11-07  a
3   1995-11-08  x
3   1995-11-09  a
3   1995-11-10  x
3   1995-11-15  b
4   1995-12-18  a
4   1995-12-20  b
and I want to get a new column, with dates based on a values. When a user has more than an value type a value , the date of the oldest a value of this user should be selected to show on the new column.

For example, user 3 has several a values on the type column. Thus, on the a_type_date column, the eldest date for the a value is chosen.

Here is the desired output:

user    date      type   a_type_date
    1   1995-09-01  a     1995-09-01
    1   1995-09-02  b     1995-09-01
    2   1995-10-03  a     1995-10-03
    2   1995-10-04  x     1995-10-03
    2   1995-10-05  b     1995-10-03
    3   1995-11-07  a     1995-11-07
    3   1995-11-08  x     1995-11-07
    3   1995-11-09  a     1995-11-07
    3   1995-11-10  x     1995-11-07
    3   1995-11-15  b     1995-11-07
    4   1995-12-18  a     1995-12-18
    4   1995-12-20  b     1995-12-18
Any help will be highly appreciated.


RE: Pandas + Groupby + Filter unique values - JosepMaria - Jun-15-2020

1. First, convert the date column into a datetime() index:

df.date = pd.to_datetime(df.date)
2. Next use min() and list comprehension to create a new column containing your earliest (use max() for latest) type a dates per user as follows:

df['a_type_date'] = [min(df[(df.user==i)&(df.type=='a')]['date']).date() for i in df.user]