Python Forum
column grouping (sum)
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
column grouping (sum)
#1
Dear Python Experts,

I am looking again at Daniel Breen's case study In [69]: he writes a function
called convert_housing_data_to_quarters().

http://danielbreen.net/projects/housing_...ege_towns/

The data is providing columns for every months from 1996-04 to 2016q3.
He is looping through the data to only consider columns from 2000q1 through 2016q3 and
sums them up in quarters. He does that in a nexted loop that
looks like it is dealing specifically with q3 and q4 of 2016.

for year in range(2000,2017):
   for quarter in range(1,5):
       
       if quarter == 4 and year == 2016:
           break
           
       new_column_name = '{0}q{1}'.format(year, quarter)
       begin_month = (quarter-1)*3 + 1
       end_month = quarter*3
       begin_column = '{0}-{1:02d}'.format(year,begin_month)
       end_column = '{0}-{1:02d}'.format(year,end_month)
       
       if quarter == 3 and year == 2016:
           new_column_name = '2016q3'
           begin_month = 6
           end_month = 8
           begin_column = '{0}-{1:02d}'.format(year,begin_month)
           end_column = '{0}-{1:02d}'.format(year,end_month)                
           
       data = housing_df.loc[:,begin_column:end_column]
           
       housing_df[new_column_name] = data.mean(axis = 1)
I wonder if there is an easier way. I would rahter fix the non-existing q3 and q4 problem
without a nested loop.

It would be really great if someone has a suggestion how to do that.
Maybe there is a function I dont know yet.
Reply
#2
It's very difficult to read with all the formatting. When posting between the code tags, try pasting using "CTRL + SHIFT + V", this should remove the formatting portion.
If it ain't broke, I just haven't gotten to it yet.
OS: Windows 10, openSuse 42.3, freeBSD 11, Raspian "Stretch"
Python 3.6.5, IDE: PyCharm 2018 Community Edition
Reply
#3
Yes, there are functions to do it "more directly". Common way is to convert given dataframe to a "narrow" format, where column names become one variable, after that do some transformation and convert it back to a "wide" format, where content of one (or more) column gives new column names. It has some similarity with Excel pivot tables.

There is pandas.melt(), that can be used to convert dataframe from wide to narrow format, and .pivot() or.unstack() can be used to convert narrow to wide (there are other functions too).

Unfortunately even with those functions it could look rather ugly:
Output:
In [214]: data = {"state":["New York", "California"], "region":["New York", "Los Angeles"], "2001-01":[123,345], "2001-02":[343,132], "2001-03":[63,423], "2001-04":[393,42]} In [215]: df = pd.DataFrame(data, columns=["state", "region", "2001-01", "2001-02", "2001-03", "2001-04"]) In [216]: df Out[216]:         state       region  2001-01  2001-02  2001-03  2001-04 0    New York     New York      123      343       63      393 1  California  Los Angeles      345      132      423       42 In [217]: wide = df.drop(['state', 'region'], axis=1).reset_index()      ...: melted = pd.melt(wide, id_vars='index')      ...: melted.variable = melted.variable.apply(lambda x : "{}q{}".format(x[:4], (int(x[5:]) - 1) // 3 + 1))      ...: grouped = melted.groupby(['index','variable']).sum().unstack()      ...: grouped.columns = grouped.columns.get_level_values(1)      ...: df[['state', 'region']].join(grouped)      ...: Out[217]:         state       region  2001q1  2001q2 0    New York     New York     529     393 1  California  Los Angeles     900      42
In reality it is not so bad:
  • cut part to modify
  • convert it to narrow
  • aggregate
  • convert to wide
  • join with remaining part
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Grouping Candidates with same name coolperson 4 3,031 Jul-12-2019, 07:38 PM
Last Post: coolperson
  unicode within a RE grouping bluefrog 2 3,079 Jun-09-2018, 09:06 AM
Last Post: snippsat

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020