Pandas - cumulative sum of two columns

tobbs · May-19-2019, 09:29 AM

I want to calculate points for soccer teams. I have points for the hometeam and awayteam for each game and i want to get the sum for previous games. (not include the row i'm standing on) This is a part of the dataframe (i have more columns then displayed in the example):

 

df  = pd.DataFrame([
["Gothenburg", "Malmo", 2018, 1, 1],
["Malmo","Gothenburg",  2018, 1, 1],
["Malmo", "Gothenburg", 2018, 0, 3],
["Gothenburg", "Malmo", 2018, 1, 1],
["Gothenburg", "Malmo" ,2018, 0, 3],
["Gothenburg", "Malmo", 2018, 1, 1],
["Gothenburg", "Malmo", 2018, 0, 3],
["Malmo", "Gothenburg", 2018, 0, 3],
["Gothenburg", "Malmo", 2018, 1, 1],
["Malmo", "Gothenburg", 2018, 0, 3],
[ "Malmo","Gothenburg", 2018, 1, 1],
[ "Malmo", "Gothenburg",2018, 0, 3],
])

The desired output is:

        H_team      A_team  Year  H_points  A_points  H_cumsum  A_cumsum
0   Gothenburg       Malmo  2018         1         1       NaN       NaN
1        Malmo  Gothenburg  2018         1         1       1.0       1.0
2        Malmo  Gothenburg  2018         0         3       2.0       2.0
3   Gothenburg       Malmo  2018         1         1       5.0       2.0
4   Gothenburg       Malmo  2018         0         3       6.0       3.0
5   Gothenburg       Malmo  2018         1         1       6.0       6.0
6   Gothenburg       Malmo  2018         0         3       7.0       7.0
7        Malmo  Gothenburg  2018         0         3       10.0      7.0
8   Gothenburg       Malmo  2018         1         1       10.0      10.0
9        Malmo  Gothenburg  2018         0         3       11.0      11.0
10       Malmo  Gothenburg  2018         1         1       11.0      14.0
11       Malmo  Gothenburg  2018         0         3       12.0      15.0

I can calculate total points using stack, but since i have around 10 columns (all not displayed in this example), the result gets really messy.

df.columns = [['Team', 'Team', "Year", 'Points', 'Points'],
    ['Home', 'Away', 'Year', 'Home', 'Away']]

d1 = df.stack()
total = d1.groupby('Team').Points.apply(lambda x: x.shift().cumsum())
df = d1.assign(Total=total).unstack()

print(df)

   Points                  Team                  Year              Total           
     Away Home Year        Away        Home Year Away Home    Year  Away  Home Year
0     1.0  1.0  NaN       Malmo  Gothenburg  NaN  NaN  NaN  2018.0   NaN   NaN  NaN
1     1.0  1.0  NaN  Gothenburg       Malmo  NaN  NaN  NaN  2018.0   1.0   1.0  NaN
2     3.0  0.0  NaN  Gothenburg       Malmo  NaN  NaN  NaN  2018.0   2.0   2.0  NaN
3     1.0  1.0  NaN       Malmo  Gothenburg  NaN  NaN  NaN  2018.0   2.0   5.0  NaN
4     3.0  0.0  NaN       Malmo  Gothenburg  NaN  NaN  NaN  2018.0   3.0   6.0  NaN
5     1.0  1.0  NaN       Malmo  Gothenburg  NaN  NaN  NaN  2018.0   6.0   6.0  NaN
6     3.0  0.0  NaN       Malmo  Gothenburg  NaN  NaN  NaN  2018.0   7.0   7.0  NaN
7     3.0  0.0  NaN  Gothenburg       Malmo  NaN  NaN  NaN  2018.0   7.0  10.0  NaN
8     1.0  1.0  NaN       Malmo  Gothenburg  NaN  NaN  NaN  2018.0  10.0  10.0  NaN
9     3.0  0.0  NaN  Gothenburg       Malmo  NaN  NaN  NaN  2018.0  11.0  11.0  NaN
10    1.0  1.0  NaN  Gothenburg       Malmo  NaN  NaN  NaN  2018.0  14.0  11.0  NaN
11    3.0  0.0  NaN  Gothenburg       Malmo  NaN  NaN  NaN  2018.0  15.0  12.0  NaN

I have also tried groupby home- and awayteam, but that doesn't give me the total points:

df.columns = ['Hometeam', 'Awayteam', "Year", 'homepoints', 'awayponts']
df["Homepoints_tot"] = df.groupby(['Hometeam', "Year"])['homepoints'].transform(lambda x: x.cumsum().shift())
df["Awaypoints_tot"] = df.groupby(['Awayteam', "Year"])['awayponts'].transform(lambda x: x.cumsum().shift())

      Hometeam    Awayteam  Year  homepoints  awayponts  Homepoints_tot  Awaypoints_tot
0   Gothenburg       Malmo  2018           1          1             NaN             NaN
1        Malmo  Gothenburg  2018           1          1             NaN             NaN
2        Malmo  Gothenburg  2018           0          3             1.0             1.0
3   Gothenburg       Malmo  2018           1          1             1.0             1.0
4   Gothenburg       Malmo  2018           0          3             2.0             2.0
5   Gothenburg       Malmo  2018           1          1             2.0             5.0
6   Gothenburg       Malmo  2018           0          3             3.0             6.0
7        Malmo  Gothenburg  2018           0          3             1.0             4.0
8   Gothenburg       Malmo  2018           1          1             3.0             9.0
9        Malmo  Gothenburg  2018           0          3             1.0             7.0
10       Malmo  Gothenburg  2018           1          1             1.0            10.0
11       Malmo  Gothenburg  2018           0          3             2.0            11.0

Any ideas how to get the desired output?

**Yoriz** · (This post was last modified: May-19-2019, 10:48 AM by Yoriz.)

I think you might have your desired output cumulative sums mixed up(or i don't understand), this output is not the same but i think it could be what you are after.

import pandas as pd

df = pd.DataFrame([
    ["Gothenburg", "Malmo", 2018, 1, 1],
    ["Malmo", "Gothenburg",  2018, 1, 1],
    ["Malmo", "Gothenburg", 2018, 0, 3],
    ["Gothenburg", "Malmo", 2018, 1, 1],
    ["Gothenburg", "Malmo", 2018, 0, 3],
    ["Gothenburg", "Malmo", 2018, 1, 1],
    ["Gothenburg", "Malmo", 2018, 0, 3],
    ["Malmo", "Gothenburg", 2018, 0, 3],
    ["Gothenburg", "Malmo", 2018, 1, 1],
    ["Malmo", "Gothenburg", 2018, 0, 3],
    ["Malmo", "Gothenburg", 2018, 1, 1],
    ["Malmo", "Gothenburg", 2018, 0, 3],
])

df.columns = ['H_team', 'A_team', 'Year', 'H_points', 'A_points']
df['H_cumsum'] = df['H_points'].cumsum().shift(1)
df['A_cumsum'] = df['A_points'].cumsum().shift(1)
print(df)

Output:        H_team      A_team  Year  H_points  A_points  H_cumsum  A_cumsum
0   Gothenburg       Malmo  2018         1         1       NaN       NaN
1        Malmo  Gothenburg  2018         1         1       1.0       1.0
2        Malmo  Gothenburg  2018         0         3       2.0       2.0
3   Gothenburg       Malmo  2018         1         1       2.0       5.0
4   Gothenburg       Malmo  2018         0         3       3.0       6.0
5   Gothenburg       Malmo  2018         1         1       3.0       9.0
6   Gothenburg       Malmo  2018         0         3       4.0      10.0
7        Malmo  Gothenburg  2018         0         3       4.0      13.0
8   Gothenburg       Malmo  2018         1         1       4.0      16.0
9        Malmo  Gothenburg  2018         0         3       5.0      17.0
10       Malmo  Gothenburg  2018         1         1       5.0      20.0
11       Malmo  Gothenburg  2018         0         3       6.0      21.0

tobbs · May-19-2019, 11:43 AM

No that's not correct, the desired output is what i described at top. It's a score table for soccer teams. The team get 1 point for draw and 3 for victory. Therefor the cumlative score for Gothenburg at row 11 should be: 1+1+3+1+0+1+0+3+1+3+1 = 15 (shifted 1 row, meaning don't include the 3 points at row 11)

**Yoriz** · May-19-2019, 11:55 AM

Ok, i only know little about pandas and nothing about soccer Huh

. good luck finding a solution Smile

**perfringo** · May-19-2019, 12:16 PM

Pandas is not my cup of tea but isn’t there pandas.cumsum?

tobbs · (This post was last modified: May-19-2019, 01:35 PM by tobbs.)

Yea, i use cumsum but i can only get either the home points or the away points, like this:

df['H_cumsum'] = df['H_points'].cumsum().shift(1)
df['A_cumsum'] = df['A_points'].cumsum().shift(1)

I just don't know how to add the home points with the away points.

tobbs · (This post was last modified: May-24-2019, 09:23 PM by tobbs.)

Okey a different approach for the same problem. I have the same results but instead of points i have a value for either home win (H), draw (D) or away win (A). A win is 3 points for the winning team and draw is 1 point to both teams. I need the cumulative points for both the home and the away team.

This is the dataframe:

df  = pd.DataFrame([
["Gothenburg", "Malmo", 2018,"A"],
["Malmo","Gothenburg",  2018, "D"],
["Malmo", "Gothenburg", 2018, "A"],
["Gothenburg", "Malmo", 2018, "H"],
["Gothenburg", "Malmo" ,2018, "D"],
["Gothenburg", "Malmo", 2018, "A"],
["Gothenburg", "Malmo", 2018, "H"],
["Malmo", "Gothenburg", 2018, "A"],
["Gothenburg", "Malmo", 2018, "H"],
["Malmo", "Gothenburg", 2018, "H"],
[ "Malmo","Gothenburg", 2018, "D"],
[ "Malmo", "Gothenburg",2018, "H"],

])

And this is desired output:

      Hometeam    Awayteam  Year Tecken H_points A_points
0   Gothenburg       Malmo  2018      D      Nan      Nan
1        Malmo  Gothenburg  2018      D        1        1
2        Malmo  Gothenburg  2018      A        2        2
3   Gothenburg       Malmo  2018      D        5        2
4   Gothenburg       Malmo  2018      A        6        3
5   Gothenburg       Malmo  2018      D        6        6
6   Gothenburg       Malmo  2018      A        7        7
7        Malmo  Gothenburg  2018      A       10        7
8   Gothenburg       Malmo  2018      D       10       10
9        Malmo  Gothenburg  2018      A       11       11
10       Malmo  Gothenburg  2018      D       11       14
11       Malmo  Gothenburg  2018      A       12       15

Im losing my hair over this, hope someone can give me a hand :)

michalmonday · May-24-2019, 10:55 PM

I know it's ugly and I'm not experienced with pandas but it works:p

import pandas as pd

def get_team_names(df):
    return df[0][0], df[1][0]


df  = pd.DataFrame([
["Gothenburg", "Malmo", 2018,"A"],
["Malmo","Gothenburg",  2018, "D"],
["Malmo", "Gothenburg", 2018, "A"],
["Gothenburg", "Malmo", 2018, "H"],
["Gothenburg", "Malmo" ,2018, "D"],
["Gothenburg", "Malmo", 2018, "A"],
["Gothenburg", "Malmo", 2018, "H"],
["Malmo", "Gothenburg", 2018, "A"],
["Gothenburg", "Malmo", 2018, "H"],
["Malmo", "Gothenburg", 2018, "H"],
[ "Malmo","Gothenburg", 2018, "D"],
[ "Malmo", "Gothenburg",2018, "H"],
 
])

team_1, team_2 = get_team_names(df)

df.columns = ['H_team', 'A_team', 'Year', 'Result']

draws = df['Result']=='D'

hh = (df['Result']=='H').where(df['H_team'] == team_1) * 3
ha = (df['Result']=='A').where(df['A_team'] == team_1) * 3

ah = (df['Result']=='H').where(df['H_team'] == team_2) * 3
aa = (df['Result']=='A').where(df['A_team'] == team_2) * 3

team_1_points = hh.add(ha, fill_value=0).add(draws, fill_value=0).cumsum().shift(1)
team_2_points = ah.add(aa, fill_value=0).add(draws, fill_value=0).cumsum().shift(1)

df['H_points'] = team_1_points.where(df['H_team'] == team_1, team_2_points)
df['A_points'] = team_1_points.where(df['A_team'] == team_1, team_2_points)

print(df)

Output:        H_team      A_team  Year Result  H_points  A_points
0   Gothenburg       Malmo  2018      A       NaN       NaN
1        Malmo  Gothenburg  2018      D       3.0       0.0
2        Malmo  Gothenburg  2018      A       4.0       1.0
3   Gothenburg       Malmo  2018      H       4.0       4.0
4   Gothenburg       Malmo  2018      D       7.0       4.0
5   Gothenburg       Malmo  2018      A       8.0       5.0
6   Gothenburg       Malmo  2018      H       8.0       8.0
7        Malmo  Gothenburg  2018      A       8.0      11.0
8   Gothenburg       Malmo  2018      H      14.0       8.0
9        Malmo  Gothenburg  2018      H       8.0      17.0
10       Malmo  Gothenburg  2018      D      11.0      17.0
11       Malmo  Gothenburg  2018      H      12.0      18.0

tobbs · May-25-2019, 07:46 AM

Yes, that works! Thank you so much, i've been struggling with this one.

I need to make a change though, the points need to be reset every year since i'm reading data from 4-5 years back. Is there a way to either group by year or maybe use reset so that the points starts at zero for a new year+ 2016, 2017, 2018 and so on...

michalmonday · May-25-2019, 10:25 AM

Glad I could help, btw I think it would be important to make sure that the 'Year' value actually groups games from particular season, not strictly a year.

import pandas as pd

def get_team_names(df):
    return df[0][0], df[1][0]

df  = pd.DataFrame([
["Gothenburg", "Malmo", 2010,"A"],
["Malmo","Gothenburg",  2010, "D"],
["Malmo", "Gothenburg", 2010, "A"],
["Gothenburg", "Malmo", 2010, "H"],
["Gothenburg", "Malmo" ,2011, "D"],
["Gothenburg", "Malmo", 2011, "A"],
["Gothenburg", "Malmo", 2011, "H"],
["Malmo", "Gothenburg", 2011, "A"],
["Gothenburg", "Malmo", 2012, "H"],
["Malmo", "Gothenburg", 2012, "H"],
[ "Malmo","Gothenburg", 2012, "D"],
[ "Malmo", "Gothenburg",2012, "H"],
["Malmo", "Gothenburg", 2013, "H"],
[ "Malmo","Gothenburg", 2013, "D"],
[ "Malmo", "Gothenburg",2013, "H"],
 
])

team_1, team_2 = get_team_names(df)

df.columns = ['H_team', 'A_team', 'Year', 'Result']

df['H_points'] = 0
df['A_points'] = 0

for year in set(df['Year']):

    draws = (df['Result']=='D').where(df['Year'] == year)

    hh = (df['Result']=='H').where((df['H_team'] == team_1) & (df['Year'] == year)) * 3
    ha = (df['Result']=='A').where((df['A_team'] == team_1) & (df['Year'] == year)) * 3

    ah = (df['Result']=='H').where((df['H_team'] == team_2) & (df['Year'] == year)) * 3
    aa = (df['Result']=='A').where((df['A_team'] == team_2) & (df['Year'] == year)) * 3

    team_1_points = hh.add(ha, fill_value=0).add(draws, fill_value=0).cumsum().shift(1)
    team_2_points = ah.add(aa, fill_value=0).add(draws, fill_value=0).cumsum().shift(1)

    df['H_points'] = df['H_points'].add( team_1_points.where(df['H_team'] == team_1, team_2_points), fill_value=0)
    df['A_points'] = df['A_points'].add( team_1_points.where(df['A_team'] == team_1, team_2_points), fill_value=0)

    # set first game of season to 0 points - https://stackoverflow.com/a/37911231/4620679
    # otherwise it would be equal to the sum of points from previous season (due to ".shift(1)" method)
    df.loc[df.index[df['Year'] == year][:1], 'H_points'] = 0 
    df.loc[df.index[df['Year'] == year][:1], 'A_points'] = 0

print(df)

Output:        H_team      A_team  Year Result  H_points  A_points
0   Gothenburg       Malmo  2010      A       0.0       0.0
1        Malmo  Gothenburg  2010      D       3.0       0.0
2        Malmo  Gothenburg  2010      A       4.0       1.0
3   Gothenburg       Malmo  2010      H       4.0       4.0
4   Gothenburg       Malmo  2011      D       0.0       0.0
5   Gothenburg       Malmo  2011      A       1.0       1.0
6   Gothenburg       Malmo  2011      H       1.0       4.0
7        Malmo  Gothenburg  2011      A       4.0       4.0
8   Gothenburg       Malmo  2012      H       0.0       0.0
9        Malmo  Gothenburg  2012      H       0.0       3.0
10       Malmo  Gothenburg  2012      D       3.0       3.0
11       Malmo  Gothenburg  2012      H       4.0       4.0
12       Malmo  Gothenburg  2013      H       0.0       0.0
13       Malmo  Gothenburg  2013      D       3.0       0.0
14       Malmo  Gothenburg  2013      H       4.0       1.0

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	iretate over columns in df and calculate euclidean distance with one column in pandas	Pit292	0	3,386	May-09-2021, 06:46 PM Last Post: Pit292
	pandas.to_datetime: Combine data from 2 columns	ju21878436312	1	2,520	Feb-20-2021, 08:25 PM Last Post: perfringo
	Remove extra count columns created by pandas groupby	spyf8	1	2,810	Feb-10-2021, 09:19 AM Last Post: Naheed
	Pandas: summing columns conditional on the column labels	ddd2332	0	2,194	Sep-10-2020, 05:58 PM Last Post: ddd2332
	Difference of two columns in Pandas dataframe	zinho	2	3,460	Jun-17-2020, 03:36 PM Last Post: zinho
	Pandas dataframe columns collapsed in Spyder when printing	UniKlixX	2	4,896	Nov-04-2019, 07:00 AM Last Post: UniKlixX
	[pandas] How to re-arrange DataFrame columns	SriMekala	8	5,069	Jun-22-2019, 12:55 AM Last Post: scidam
	comparing two columns two different files in pandas	nuncio	0	2,455	Jun-06-2018, 01:04 PM Last Post: nuncio
	subtruction of columns in pandas	garikhgh0	3	2,752	May-14-2018, 10:31 AM Last Post: volcano63

Pandas - cumulative sum of two columns

User Panel Messages

Announcements