Python Forum
Slow loop due to row & column consolidation
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Slow loop due to row & column consolidation
#1
After reading a file I have a DataFrame with 6 columns named 'Spclty1' to 'Spclty6' and 6 corresponding columns named 'StartDt1' to 'StartDt6', along with other identifying info. Any or all of the Spclty/StartDt field pairs can be empty. Each row is one user but many users have more than 1 row (and can have more than 6 specialties). I want to consolidate this DataFrame into 1 row per user with a single column that contains all of each user's specialty codes and dates as key-value pairs.

I know better than to append to a DataFrame within a loop, so I'm appending to a list instead, then converting to a DataFrame at the end. My code is working -- it only takes 4.5 minutes on a 7 MB input file -- but it never completes on a 150 MB input file. It seems to get slower as the output list gets larger, which is contrary to my understanding that lists have little overhead. Am I doing something dumb, or is there an obscure issue at work?

# concatenate 6 fields into 1 for both Spclty* and StartDt*:
df_tmp['Spclty'] = df_tmp[['Spclty1', 'Spclty2', 'Spclty3', 'Spclty4', 'Spclty5', 'Spclty6']].values.tolist()
df_tmp['StartDt'] = df_tmp[['StartDt1', 'StartDt2', 'StartDt3', 'StartDt4', 'StartDt5', 'StartDt6']].values.tolist()

# remove missing values:
df_tmp['Spclty'] = [[x for x in df_tmp['Spclty'][row] if str(x) != 'nan'] for row in range(df_tmp.shape[0])]
df_tmp['StartDt'] = [[x for x in df_tmp['StartDt'][row] if str(x) != 'None'] for row in range(df_tmp.shape[0])]

# convert to a dict-like set:
df_tmp['Spclty'] = [set(zip(df_tmp['Spclty'][row], df_tmp['StartDt'][row])) for row in range(df_tmp.shape[0])]

# drop columns no longer needed:
df_tmp = df_tmp.drop(['Spclty1', 'Spclty2', 'Spclty3', 'Spclty4', 'Spclty5', 'Spclty6', 
                      'StartDt1', 'StartDt2', 'StartDt3', 'StartDt4', 'StartDt5', 'StartDt6'],
                    axis = 1)

# loop over all UserNbr: 
#   consolidate specialty fields into dict-like sets (to remove redundant codes);
#   output one row per user to new data frame
out_rows = list() 
spcltycol = df_tmp.column.get_loc('Spclty')                
all_UserNbr = df_tmp['UserNbr'].unique()
for user in all_UserNbr:
    df_user = df_tmp.loc[df_tmp['UserNbr'] == user]
    # select rows with no RegEndDt:
    df_user_open = df_user.loc[np.isnat(df_user['RegEndDt'])]
    if df_user_open.shape[0] > 0:   
        # have rows with empty RegEndDt, so treat these as current:
        open_combined = df_user_open.iloc[0, spcltycol]   # capture 1st row
        for row in range(1, df_user_open.shape[0]):       # union with any subsequent rows
            open_combined = open_combined.union(df_user_open.iloc[row, spcltycol])
        new_row = df_user_open.drop(['Spclty', 'StartDt'], axis = 1).iloc[0].tolist()
        new_row.append(open_combined)
        out_rows.append(new_row)

# construct new dataframe with no redundant UserID rows:
df_out = pd.DataFrame(out_rows, 
                      columns = ['FamilyName', 'GivenName', 'MiddleName', 
                                 'UserNbr',
                                 'AddressLine1', 'AddressLine2', 'AddressLine3',
                                 'Postcode', 'State', 
                                 'RegStartDt', 'RegEndDt', 'Title', 'Closed', 'Spclty'])
# convert Spclty sets to dicts:
df_out['Spclty'] = [dict(df_out['Spclty'][row]) for row in range(df_out.shape[0])]
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  UART & I2C slow down a loop trix 4 548 Dec-28-2023, 05:14 PM
Last Post: trix
  Loop row and column + 1 Scott 3 1,342 May-22-2022, 11:41 AM
Last Post: Axel_Erfurt
  Need help with For Loop logic for reading docx Table column data vintysaw 2 3,834 Jan-10-2020, 06:36 AM
Last Post: vintysaw
  Loop through folder of Excel Files and extract single column fioranosnake 2 4,479 Oct-28-2019, 05:19 PM
Last Post: fioranosnake
  change array column values without loop khalidreemy 2 3,698 May-05-2019, 09:05 AM
Last Post: DeaD_EyE

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020