Pandas - cumulative sum of two columns

Pandas - cumulative sum of two columns - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: Data Science (https://python-forum.io/forum-44.html)
+--- Thread: Pandas - cumulative sum of two columns (/thread-18470.html)

Pages: 1 2

Pandas - cumulative sum of two columns - tobbs - May-19-2019

I want to calculate points for soccer teams. I have points for the hometeam and awayteam for each game and i want to get the sum for previous games. (not include the row i'm standing on) This is a part of the dataframe (i have more columns then displayed in the example):

 

df  = pd.DataFrame([
["Gothenburg", "Malmo", 2018, 1, 1],
["Malmo","Gothenburg",  2018, 1, 1],
["Malmo", "Gothenburg", 2018, 0, 3],
["Gothenburg", "Malmo", 2018, 1, 1],
["Gothenburg", "Malmo" ,2018, 0, 3],
["Gothenburg", "Malmo", 2018, 1, 1],
["Gothenburg", "Malmo", 2018, 0, 3],
["Malmo", "Gothenburg", 2018, 0, 3],
["Gothenburg", "Malmo", 2018, 1, 1],
["Malmo", "Gothenburg", 2018, 0, 3],
[ "Malmo","Gothenburg", 2018, 1, 1],
[ "Malmo", "Gothenburg",2018, 0, 3],
])

The desired output is:

        H_team      A_team  Year  H_points  A_points  H_cumsum  A_cumsum
0   Gothenburg       Malmo  2018         1         1       NaN       NaN
1        Malmo  Gothenburg  2018         1         1       1.0       1.0
2        Malmo  Gothenburg  2018         0         3       2.0       2.0
3   Gothenburg       Malmo  2018         1         1       5.0       2.0
4   Gothenburg       Malmo  2018         0         3       6.0       3.0
5   Gothenburg       Malmo  2018         1         1       6.0       6.0
6   Gothenburg       Malmo  2018         0         3       7.0       7.0
7        Malmo  Gothenburg  2018         0         3       10.0      7.0
8   Gothenburg       Malmo  2018         1         1       10.0      10.0
9        Malmo  Gothenburg  2018         0         3       11.0      11.0
10       Malmo  Gothenburg  2018         1         1       11.0      14.0
11       Malmo  Gothenburg  2018         0         3       12.0      15.0

I can calculate total points using stack, but since i have around 10 columns (all not displayed in this example), the result gets really messy.

df.columns = [['Team', 'Team', "Year", 'Points', 'Points'],
    ['Home', 'Away', 'Year', 'Home', 'Away']]

d1 = df.stack()
total = d1.groupby('Team').Points.apply(lambda x: x.shift().cumsum())
df = d1.assign(Total=total).unstack()

print(df)

   Points                  Team                  Year              Total           
     Away Home Year        Away        Home Year Away Home    Year  Away  Home Year
0     1.0  1.0  NaN       Malmo  Gothenburg  NaN  NaN  NaN  2018.0   NaN   NaN  NaN
1     1.0  1.0  NaN  Gothenburg       Malmo  NaN  NaN  NaN  2018.0   1.0   1.0  NaN
2     3.0  0.0  NaN  Gothenburg       Malmo  NaN  NaN  NaN  2018.0   2.0   2.0  NaN
3     1.0  1.0  NaN       Malmo  Gothenburg  NaN  NaN  NaN  2018.0   2.0   5.0  NaN
4     3.0  0.0  NaN       Malmo  Gothenburg  NaN  NaN  NaN  2018.0   3.0   6.0  NaN
5     1.0  1.0  NaN       Malmo  Gothenburg  NaN  NaN  NaN  2018.0   6.0   6.0  NaN
6     3.0  0.0  NaN       Malmo  Gothenburg  NaN  NaN  NaN  2018.0   7.0   7.0  NaN
7     3.0  0.0  NaN  Gothenburg       Malmo  NaN  NaN  NaN  2018.0   7.0  10.0  NaN
8     1.0  1.0  NaN       Malmo  Gothenburg  NaN  NaN  NaN  2018.0  10.0  10.0  NaN
9     3.0  0.0  NaN  Gothenburg       Malmo  NaN  NaN  NaN  2018.0  11.0  11.0  NaN
10    1.0  1.0  NaN  Gothenburg       Malmo  NaN  NaN  NaN  2018.0  14.0  11.0  NaN
11    3.0  0.0  NaN  Gothenburg       Malmo  NaN  NaN  NaN  2018.0  15.0  12.0  NaN

I have also tried groupby home- and awayteam, but that doesn't give me the total points:

df.columns = ['Hometeam', 'Awayteam', "Year", 'homepoints', 'awayponts']
df["Homepoints_tot"] = df.groupby(['Hometeam', "Year"])['homepoints'].transform(lambda x: x.cumsum().shift())
df["Awaypoints_tot"] = df.groupby(['Awayteam', "Year"])['awayponts'].transform(lambda x: x.cumsum().shift())

      Hometeam    Awayteam  Year  homepoints  awayponts  Homepoints_tot  Awaypoints_tot
0   Gothenburg       Malmo  2018           1          1             NaN             NaN
1        Malmo  Gothenburg  2018           1          1             NaN             NaN
2        Malmo  Gothenburg  2018           0          3             1.0             1.0
3   Gothenburg       Malmo  2018           1          1             1.0             1.0
4   Gothenburg       Malmo  2018           0          3             2.0             2.0
5   Gothenburg       Malmo  2018           1          1             2.0             5.0
6   Gothenburg       Malmo  2018           0          3             3.0             6.0
7        Malmo  Gothenburg  2018           0          3             1.0             4.0
8   Gothenburg       Malmo  2018           1          1             3.0             9.0
9        Malmo  Gothenburg  2018           0          3             1.0             7.0
10       Malmo  Gothenburg  2018           1          1             1.0            10.0
11       Malmo  Gothenburg  2018           0          3             2.0            11.0

Any ideas how to get the desired output?

RE: Pandas - cumulative sum of two columns - Yoriz - May-19-2019

I think you might have your desired output cumulative sums mixed up(or i don't understand), this output is not the same but i think it could be what you are after.

import pandas as pd

df = pd.DataFrame([
    ["Gothenburg", "Malmo", 2018, 1, 1],
    ["Malmo", "Gothenburg",  2018, 1, 1],
    ["Malmo", "Gothenburg", 2018, 0, 3],
    ["Gothenburg", "Malmo", 2018, 1, 1],
    ["Gothenburg", "Malmo", 2018, 0, 3],
    ["Gothenburg", "Malmo", 2018, 1, 1],
    ["Gothenburg", "Malmo", 2018, 0, 3],
    ["Malmo", "Gothenburg", 2018, 0, 3],
    ["Gothenburg", "Malmo", 2018, 1, 1],
    ["Malmo", "Gothenburg", 2018, 0, 3],
    ["Malmo", "Gothenburg", 2018, 1, 1],
    ["Malmo", "Gothenburg", 2018, 0, 3],
])

df.columns = ['H_team', 'A_team', 'Year', 'H_points', 'A_points']
df['H_cumsum'] = df['H_points'].cumsum().shift(1)
df['A_cumsum'] = df['A_points'].cumsum().shift(1)
print(df)

Output:        H_team      A_team  Year  H_points  A_points  H_cumsum  A_cumsum
0   Gothenburg       Malmo  2018         1         1       NaN       NaN
1        Malmo  Gothenburg  2018         1         1       1.0       1.0
2        Malmo  Gothenburg  2018         0         3       2.0       2.0
3   Gothenburg       Malmo  2018         1         1       2.0       5.0
4   Gothenburg       Malmo  2018         0         3       3.0       6.0
5   Gothenburg       Malmo  2018         1         1       3.0       9.0
6   Gothenburg       Malmo  2018         0         3       4.0      10.0
7        Malmo  Gothenburg  2018         0         3       4.0      13.0
8   Gothenburg       Malmo  2018         1         1       4.0      16.0
9        Malmo  Gothenburg  2018         0         3       5.0      17.0
10       Malmo  Gothenburg  2018         1         1       5.0      20.0
11       Malmo  Gothenburg  2018         0         3       6.0      21.0

RE: Pandas - cumulative sum of two columns - tobbs - May-19-2019

No that's not correct, the desired output is what i described at top. It's a score table for soccer teams. The team get 1 point for draw and 3 for victory. Therefor the cumlative score for Gothenburg at row 11 should be: 1+1+3+1+0+1+0+3+1+3+1 = 15 (shifted 1 row, meaning don't include the 3 points at row 11)

RE: Pandas - cumulative sum of two columns - Yoriz - May-19-2019

Ok, i only know little about pandas and nothing about soccer Huh

. good luck finding a solution Smile

RE: Pandas - cumulative sum of two columns - perfringo - May-19-2019

Pandas is not my cup of tea but isn’t there pandas.cumsum?

RE: Pandas - cumulative sum of two columns - tobbs - May-19-2019

Yea, i use cumsum but i can only get either the home points or the away points, like this:

df['H_cumsum'] = df['H_points'].cumsum().shift(1)
df['A_cumsum'] = df['A_points'].cumsum().shift(1)

I just don't know how to add the home points with the away points.

RE: Pandas - cumulative sum of two columns - tobbs - May-24-2019

Okey a different approach for the same problem. I have the same results but instead of points i have a value for either home win (H), draw (D) or away win (A). A win is 3 points for the winning team and draw is 1 point to both teams. I need the cumulative points for both the home and the away team.

This is the dataframe:

df  = pd.DataFrame([
["Gothenburg", "Malmo", 2018,"A"],
["Malmo","Gothenburg",  2018, "D"],
["Malmo", "Gothenburg", 2018, "A"],
["Gothenburg", "Malmo", 2018, "H"],
["Gothenburg", "Malmo" ,2018, "D"],
["Gothenburg", "Malmo", 2018, "A"],
["Gothenburg", "Malmo", 2018, "H"],
["Malmo", "Gothenburg", 2018, "A"],
["Gothenburg", "Malmo", 2018, "H"],
["Malmo", "Gothenburg", 2018, "H"],
[ "Malmo","Gothenburg", 2018, "D"],
[ "Malmo", "Gothenburg",2018, "H"],

])

And this is desired output:

      Hometeam    Awayteam  Year Tecken H_points A_points
0   Gothenburg       Malmo  2018      D      Nan      Nan
1        Malmo  Gothenburg  2018      D        1        1
2        Malmo  Gothenburg  2018      A        2        2
3   Gothenburg       Malmo  2018      D        5        2
4   Gothenburg       Malmo  2018      A        6        3
5   Gothenburg       Malmo  2018      D        6        6
6   Gothenburg       Malmo  2018      A        7        7
7        Malmo  Gothenburg  2018      A       10        7
8   Gothenburg       Malmo  2018      D       10       10
9        Malmo  Gothenburg  2018      A       11       11
10       Malmo  Gothenburg  2018      D       11       14
11       Malmo  Gothenburg  2018      A       12       15

Im losing my hair over this, hope someone can give me a hand :)

RE: Pandas - cumulative sum of two columns - michalmonday - May-24-2019

I know it's ugly and I'm not experienced with pandas but it works:p

import pandas as pd

def get_team_names(df):
    return df[0][0], df[1][0]


df  = pd.DataFrame([
["Gothenburg", "Malmo", 2018,"A"],
["Malmo","Gothenburg",  2018, "D"],
["Malmo", "Gothenburg", 2018, "A"],
["Gothenburg", "Malmo", 2018, "H"],
["Gothenburg", "Malmo" ,2018, "D"],
["Gothenburg", "Malmo", 2018, "A"],
["Gothenburg", "Malmo", 2018, "H"],
["Malmo", "Gothenburg", 2018, "A"],
["Gothenburg", "Malmo", 2018, "H"],
["Malmo", "Gothenburg", 2018, "H"],
[ "Malmo","Gothenburg", 2018, "D"],
[ "Malmo", "Gothenburg",2018, "H"],
 
])

team_1, team_2 = get_team_names(df)

df.columns = ['H_team', 'A_team', 'Year', 'Result']

draws = df['Result']=='D'

hh = (df['Result']=='H').where(df['H_team'] == team_1) * 3
ha = (df['Result']=='A').where(df['A_team'] == team_1) * 3

ah = (df['Result']=='H').where(df['H_team'] == team_2) * 3
aa = (df['Result']=='A').where(df['A_team'] == team_2) * 3

team_1_points = hh.add(ha, fill_value=0).add(draws, fill_value=0).cumsum().shift(1)
team_2_points = ah.add(aa, fill_value=0).add(draws, fill_value=0).cumsum().shift(1)

df['H_points'] = team_1_points.where(df['H_team'] == team_1, team_2_points)
df['A_points'] = team_1_points.where(df['A_team'] == team_1, team_2_points)

print(df)

Output:        H_team      A_team  Year Result  H_points  A_points
0   Gothenburg       Malmo  2018      A       NaN       NaN
1        Malmo  Gothenburg  2018      D       3.0       0.0
2        Malmo  Gothenburg  2018      A       4.0       1.0
3   Gothenburg       Malmo  2018      H       4.0       4.0
4   Gothenburg       Malmo  2018      D       7.0       4.0
5   Gothenburg       Malmo  2018      A       8.0       5.0
6   Gothenburg       Malmo  2018      H       8.0       8.0
7        Malmo  Gothenburg  2018      A       8.0      11.0
8   Gothenburg       Malmo  2018      H      14.0       8.0
9        Malmo  Gothenburg  2018      H       8.0      17.0
10       Malmo  Gothenburg  2018      D      11.0      17.0
11       Malmo  Gothenburg  2018      H      12.0      18.0

RE: Pandas - cumulative sum of two columns - tobbs - May-25-2019

Yes, that works! Thank you so much, i've been struggling with this one.

I need to make a change though, the points need to be reset every year since i'm reading data from 4-5 years back. Is there a way to either group by year or maybe use reset so that the points starts at zero for a new year+ 2016, 2017, 2018 and so on...

RE: Pandas - cumulative sum of two columns - michalmonday - May-25-2019

Glad I could help, btw I think it would be important to make sure that the 'Year' value actually groups games from particular season, not strictly a year.

import pandas as pd

def get_team_names(df):
    return df[0][0], df[1][0]

df  = pd.DataFrame([
["Gothenburg", "Malmo", 2010,"A"],
["Malmo","Gothenburg",  2010, "D"],
["Malmo", "Gothenburg", 2010, "A"],
["Gothenburg", "Malmo", 2010, "H"],
["Gothenburg", "Malmo" ,2011, "D"],
["Gothenburg", "Malmo", 2011, "A"],
["Gothenburg", "Malmo", 2011, "H"],
["Malmo", "Gothenburg", 2011, "A"],
["Gothenburg", "Malmo", 2012, "H"],
["Malmo", "Gothenburg", 2012, "H"],
[ "Malmo","Gothenburg", 2012, "D"],
[ "Malmo", "Gothenburg",2012, "H"],
["Malmo", "Gothenburg", 2013, "H"],
[ "Malmo","Gothenburg", 2013, "D"],
[ "Malmo", "Gothenburg",2013, "H"],
 
])

team_1, team_2 = get_team_names(df)

df.columns = ['H_team', 'A_team', 'Year', 'Result']

df['H_points'] = 0
df['A_points'] = 0

for year in set(df['Year']):

    draws = (df['Result']=='D').where(df['Year'] == year)

    hh = (df['Result']=='H').where((df['H_team'] == team_1) & (df['Year'] == year)) * 3
    ha = (df['Result']=='A').where((df['A_team'] == team_1) & (df['Year'] == year)) * 3

    ah = (df['Result']=='H').where((df['H_team'] == team_2) & (df['Year'] == year)) * 3
    aa = (df['Result']=='A').where((df['A_team'] == team_2) & (df['Year'] == year)) * 3

    team_1_points = hh.add(ha, fill_value=0).add(draws, fill_value=0).cumsum().shift(1)
    team_2_points = ah.add(aa, fill_value=0).add(draws, fill_value=0).cumsum().shift(1)

    df['H_points'] = df['H_points'].add( team_1_points.where(df['H_team'] == team_1, team_2_points), fill_value=0)
    df['A_points'] = df['A_points'].add( team_1_points.where(df['A_team'] == team_1, team_2_points), fill_value=0)

    # set first game of season to 0 points - https://stackoverflow.com/a/37911231/4620679
    # otherwise it would be equal to the sum of points from previous season (due to ".shift(1)" method)
    df.loc[df.index[df['Year'] == year][:1], 'H_points'] = 0 
    df.loc[df.index[df['Year'] == year][:1], 'A_points'] = 0

print(df)

Output:        H_team      A_team  Year Result  H_points  A_points
0   Gothenburg       Malmo  2010      A       0.0       0.0
1        Malmo  Gothenburg  2010      D       3.0       0.0
2        Malmo  Gothenburg  2010      A       4.0       1.0
3   Gothenburg       Malmo  2010      H       4.0       4.0
4   Gothenburg       Malmo  2011      D       0.0       0.0
5   Gothenburg       Malmo  2011      A       1.0       1.0
6   Gothenburg       Malmo  2011      H       1.0       4.0
7        Malmo  Gothenburg  2011      A       4.0       4.0
8   Gothenburg       Malmo  2012      H       0.0       0.0
9        Malmo  Gothenburg  2012      H       0.0       3.0
10       Malmo  Gothenburg  2012      D       3.0       3.0
11       Malmo  Gothenburg  2012      H       4.0       4.0
12       Malmo  Gothenburg  2013      H       0.0       0.0
13       Malmo  Gothenburg  2013      D       3.0       0.0
14       Malmo  Gothenburg  2013      H       4.0       1.0