Python Forum
can anybody explain what such function doing - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: Data Science (https://python-forum.io/forum-44.html)
+--- Thread: can anybody explain what such function doing (/thread-15520.html)



can anybody explain what such function doing - cdbs - Jan-20-2019

Hello,
im trying to understand some piece of code written by guy with 10+ experience of coding,
he won Data science competition and put his code here:
https://github.com/drivendataorg/power-laws-forecasting/blob/master/1st%20Place/preprocess.py

Can anybody explain what this function is intended to do pls?
(sorry if question confuse you, im newbie)
def get_aggregates(df, TestTimestamp, period, target_col, cols, func_list, offset_name, col_values, group_cache, noval_name = ''):
  
#  prtime('cols = ', cols)
  start = time.time()

#  print('gagc = ', get_aggregates.gb_cache)
  if (tuple(cols), target_col, col_values) in group_cache:
    subset = group_cache[(tuple(cols), target_col, col_values)]
  else:
    if tuple(cols) in get_aggregates.gb_cache:
      gb = get_aggregates.gb_cache[tuple(cols)]
    else:
      if len(cols):
        gb = df.groupby(cols)['Value','Temperature']
        get_aggregates.gb_cache[tuple(cols)] = gb
    
    if len(cols):
      if col_values in gb.groups.keys():
        subset = gb.get_group(col_values)[target_col] #.set_index('Timestamp')  # Slice with the current values in the corresponding columns
      else:
        subset = df.iloc[0:0] # empty slice
    else:
      subset = df # No slicing, using all data
    group_cache[(tuple(cols), target_col, col_values)] = subset
  get_aggregates.times['get_group'] += time.time()-start
start = time.time()



RE: can anybody axplain what such function doing - Larz60+ - Jan-20-2019

1st read script comments, as the author has more knowledge about his code then we do:
Output:
# Calculating historical aggregates # df - source dataframe (train set) # TestTimestamp - start of test period, no data at this point or beyond is used # period - amount of time before TestTimestamp used to calculate aggregetes # target col - column to calculate averages (can be Value, Temperature, ...) # cols - columns to group by (i.e. we are getting aggregate values for the same values in these columns in the past # col_values - current values in this columns (for example, current time and day of week) # Notice : in its current state this function relies on Timestamp values being sorted (ascending) within each group, # so can't be used for aggregates over different SiteIds