column grouping (sum)

metalray · (This post was last modified: Mar-07-2017, 04:24 PM by snippsat.)

Dear Python Experts,

I am looking again at Daniel Breen's case study In [69]: he writes a function
called convert_housing_data_to_quarters().

http://danielbreen.net/projects/housing_...ege_towns/

The data is providing columns for every months from 1996-04 to 2016q3.
He is looping through the data to only consider columns from 2000q1 through 2016q3 and
sums them up in quarters. He does that in a nexted loop that
looks like it is dealing specifically with q3 and q4 of 2016.

for year in range(2000,2017):
   for quarter in range(1,5):
       
       if quarter == 4 and year == 2016:
           break
           
       new_column_name = '{0}q{1}'.format(year, quarter)
       begin_month = (quarter-1)*3 + 1
       end_month = quarter*3
       begin_column = '{0}-{1:02d}'.format(year,begin_month)
       end_column = '{0}-{1:02d}'.format(year,end_month)
       
       if quarter == 3 and year == 2016:
           new_column_name = '2016q3'
           begin_month = 6
           end_month = 8
           begin_column = '{0}-{1:02d}'.format(year,begin_month)
           end_column = '{0}-{1:02d}'.format(year,end_month)                
           
       data = housing_df.loc[:,begin_column:end_column]
           
       housing_df[new_column_name] = data.mean(axis = 1)

I wonder if there is an easier way. I would rahter fix the non-existing q3 and q4 problem
without a nested loop.

It would be really great if someone has a suggestion how to do that.
Maybe there is a function I dont know yet.

***sparkz_alot*** · Mar-07-2017, 02:53 PM

It's very difficult to read with all the formatting. When posting between the code tags, try pasting using "CTRL + SHIFT + V", this should remove the formatting portion.

***zivoni*** · Mar-07-2017, 07:15 PM

Yes, there are functions to do it "more directly". Common way is to convert given dataframe to a "narrow" format, where column names become one variable, after that do some transformation and convert it back to a "wide" format, where content of one (or more) column gives new column names. It has some similarity with Excel pivot tables.

There is pandas.melt(), that can be used to convert dataframe from wide to narrow format, and .pivot() or.unstack() can be used to convert narrow to wide (there are other functions too).

Unfortunately even with those functions it could look rather ugly:

Output:In [214]: data = {"state":["New York", "California"], "region":["New York", "Los Angeles"], "2001-01":[123,345], "2001-02":[343,132], "2001-03":[63,423], "2001-04":[393,42]}

In [215]: df = pd.DataFrame(data, columns=["state", "region", "2001-01", "2001-02", "2001-03", "2001-04"])

In [216]: df
Out[216]: 
        state       region  2001-01  2001-02  2001-03  2001-04
0    New York     New York      123      343       63      393
1  California  Los Angeles      345      132      423       42

In [217]: wide = df.drop(['state', 'region'], axis=1).reset_index()
     ...: melted = pd.melt(wide, id_vars='index') 
     ...: melted.variable = melted.variable.apply(lambda x : "{}q{}".format(x[:4], (int(x[5:]) - 1) // 3 + 1))
     ...: grouped = melted.groupby(['index','variable']).sum().unstack()
     ...: grouped.columns = grouped.columns.get_level_values(1)
     ...: df[['state', 'region']].join(grouped)
     ...: 
Out[217]: 
        state       region  2001q1  2001q2
0    New York     New York     529     393
1  California  Los Angeles     900      42

In reality it is not so bad:

cut part to modify
convert it to narrow
aggregate
convert to wide
join with remaining part

With more detail:

Output:In [199]: wide = df.drop(['state', 'region'], axis=1).reset_index()
# as there are columns that wont be transformed, we remove them and copy index as a column to use later for merge
# (it could be done with all columns, but it would be more ugly)

In [200]: wide
Out[200]: 
   index  2001-01  2001-02  2001-03  2001-04
0      0      123      343       63      393
1      1      345      132      423       42

In [201]: melted = pd.melt(wide, id_vars='index') 
# actual conversion to a "narrow" format

In [202]: melted
Out[202]: 
   index variable  value
0      0  2001-01    123
1      1  2001-01    345
2      0  2001-02    343
3      1  2001-02    132
4      0  2001-03     63
5      1  2001-03    423
6      0  2001-04    393
7      1  2001-04     42

In [203]: melted.variable = melted.variable.apply(lambda x : "{}q{}".format(x[:4], (int(x[5:]) - 1) // 3 + 1))
# monthly values are converted to the quartals with .apply 

In [204]: melted
Out[204]: 
   index variable  value
0      0   2001q1    123
1      1   2001q1    345
2      0   2001q1    343
3      1   2001q1    132
4      0   2001q1     63
5      1   2001q1    423
6      0   2001q2    393
7      1   2001q2     42

In [206]: grouped = melted.groupby(['index','variable']).sum()
# summing over the quartals

In [207]: grouped
Out[207]: 
                value
index variable       
0     2001q1      529
      2001q2      393
1     2001q1      900
      2001q2       42

In [208]: grouped = grouped.unstack()
# "pivoting" - "variable" values become new columns

In [209]: grouped
Out[209]: 
          value       
variable 2001q1 2001q2
index                 
0           529    393
1           900     42

In [210]: grouped.columns = grouped.columns.get_level_values(1)
# "Flattening" column multiindex

In [211]: grouped
Out[211]: 
variable  2001q1  2001q2
index                   
0            529     393
1            900      42

In [212]: df[['state', 'region']].join(grouped)
# final merge
Out[212]: 
        state       region  2001q1  2001q2
0    New York     New York     529     393
1  California  Los Angeles     900      42

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Grouping Candidates with same name	coolperson	4	3,031	Jul-12-2019, 07:38 PM Last Post: coolperson
	unicode within a RE grouping	bluefrog	2	3,079	Jun-09-2018, 09:06 AM Last Post: snippsat

column grouping (sum)

User Panel Messages

Announcements