Nov-27-2018, 05:06 AM
After reading a file I have a DataFrame with 6 columns named 'Spclty1' to 'Spclty6' and 6 corresponding columns named 'StartDt1' to 'StartDt6', along with other identifying info. Any or all of the Spclty/StartDt field pairs can be empty. Each row is one user but many users have more than 1 row (and can have more than 6 specialties). I want to consolidate this DataFrame into 1 row per user with a single column that contains all of each user's specialty codes and dates as key-value pairs.
I know better than to append to a DataFrame within a loop, so I'm appending to a list instead, then converting to a DataFrame at the end. My code is working -- it only takes 4.5 minutes on a 7 MB input file -- but it never completes on a 150 MB input file. It seems to get slower as the output list gets larger, which is contrary to my understanding that lists have little overhead. Am I doing something dumb, or is there an obscure issue at work?
I know better than to append to a DataFrame within a loop, so I'm appending to a list instead, then converting to a DataFrame at the end. My code is working -- it only takes 4.5 minutes on a 7 MB input file -- but it never completes on a 150 MB input file. It seems to get slower as the output list gets larger, which is contrary to my understanding that lists have little overhead. Am I doing something dumb, or is there an obscure issue at work?
# concatenate 6 fields into 1 for both Spclty* and StartDt*: df_tmp['Spclty'] = df_tmp[['Spclty1', 'Spclty2', 'Spclty3', 'Spclty4', 'Spclty5', 'Spclty6']].values.tolist() df_tmp['StartDt'] = df_tmp[['StartDt1', 'StartDt2', 'StartDt3', 'StartDt4', 'StartDt5', 'StartDt6']].values.tolist() # remove missing values: df_tmp['Spclty'] = [[x for x in df_tmp['Spclty'][row] if str(x) != 'nan'] for row in range(df_tmp.shape[0])] df_tmp['StartDt'] = [[x for x in df_tmp['StartDt'][row] if str(x) != 'None'] for row in range(df_tmp.shape[0])] # convert to a dict-like set: df_tmp['Spclty'] = [set(zip(df_tmp['Spclty'][row], df_tmp['StartDt'][row])) for row in range(df_tmp.shape[0])] # drop columns no longer needed: df_tmp = df_tmp.drop(['Spclty1', 'Spclty2', 'Spclty3', 'Spclty4', 'Spclty5', 'Spclty6', 'StartDt1', 'StartDt2', 'StartDt3', 'StartDt4', 'StartDt5', 'StartDt6'], axis = 1) # loop over all UserNbr: # consolidate specialty fields into dict-like sets (to remove redundant codes); # output one row per user to new data frame out_rows = list() spcltycol = df_tmp.column.get_loc('Spclty') all_UserNbr = df_tmp['UserNbr'].unique() for user in all_UserNbr: df_user = df_tmp.loc[df_tmp['UserNbr'] == user] # select rows with no RegEndDt: df_user_open = df_user.loc[np.isnat(df_user['RegEndDt'])] if df_user_open.shape[0] > 0: # have rows with empty RegEndDt, so treat these as current: open_combined = df_user_open.iloc[0, spcltycol] # capture 1st row for row in range(1, df_user_open.shape[0]): # union with any subsequent rows open_combined = open_combined.union(df_user_open.iloc[row, spcltycol]) new_row = df_user_open.drop(['Spclty', 'StartDt'], axis = 1).iloc[0].tolist() new_row.append(open_combined) out_rows.append(new_row) # construct new dataframe with no redundant UserID rows: df_out = pd.DataFrame(out_rows, columns = ['FamilyName', 'GivenName', 'MiddleName', 'UserNbr', 'AddressLine1', 'AddressLine2', 'AddressLine3', 'Postcode', 'State', 'RegStartDt', 'RegEndDt', 'Title', 'Closed', 'Spclty']) # convert Spclty sets to dicts: df_out['Spclty'] = [dict(df_out['Spclty'][row]) for row in range(df_out.shape[0])]