Nov-20-2018, 10:33 AM
(This post was last modified: Nov-20-2018, 10:47 AM by Gribouillis.)
The below function calc_time_diff calculates time difference between two timestamps and finally adds them up and returns the result. But it should ignore the overlapping timestamps and should calculate the net sum of all the time difference.
The existing function performs this using a loop but it is causing a huge performance issue when it accessing more records.
Is there anyway to tune this function or any other approach?
I have attached some of the sample data.
The existing function performs this using a loop but it is causing a huge performance issue when it accessing more records.
Is there anyway to tune this function or any other approach?
I have attached some of the sample data.
data = [{'start_timestamp': '2017-07-10 00:01:00', 'end_timestamp': '2017-07-10 00:43:00'}, {'start_timestamp': '2017-07-10 02:23:00', 'end_timestamp': '2017-07-10 05:30:00'}, {'start_timestamp': '2017-07-10 09:01:00', 'end_timestamp': '2017-07-10 16:30:00'}, {'start_timestamp': '2017-07-10 10:01:00', 'end_timestamp': '2017-07-10 11:25:00'}, {'start_timestamp': '2017-07-10 12:01:00', 'end_timestamp': '2017-07-10 13:45:00'}, {'start_timestamp': '2017-07-10 13:51:00', 'end_timestamp': '2017-07-10 14:26:00'}, {'start_timestamp': '2017-07-10 14:43:00', 'end_timestamp': '2017-07-12 05:00:00'}, {'start_timestamp': '2017-07-10 18:01:00', 'end_timestamp': '2017-07-10 18:45:00'}, {'start_timestamp': '2017-07-10 21:21:00', 'end_timestamp': '2017-07-10 22:46:00'}] df = pd.DataFrame(data) def calc_time_diff( df): df = df.dropna(how='any', subset=['start_timestamp', 'end_timestamp']) if df.empty: return pd.Series({'duration': np.nan}) #VSD Added on 15th Nov just to expedite testing of some other KPIs. The ultimate target is to optimize this function by replacing for loop with dataframe operations #VSD after testing following line needs to be commented or removed #return pd.Series({self.name: ((df['end_timestamp'][0] - df['start_timestamp'][0]) / np.timedelta64(1, 'h'))}) df = df.sort_values(by=['start_timestamp']) # logger.debug('input df=\n%s' % df.head(100)) new_start_timestamp = [] prev_endtime = None for index, row in df.iterrows(): if prev_endtime is None: new_start_timestamp.append(row['start_timestamp']) prev_endtime = row['end_timestamp'] continue if row['start_timestamp'] < prev_endtime: new_start_timestamp.append(prev_endtime) # df.loc[index, 'start_timestamp'] = prev_endtime else: new_start_timestamp.append(row['start_timestamp']) if prev_endtime < row['end_timestamp']: prev_endtime = row['end_timestamp'] df['start_timestamp'] = new_start_timestamp #df['new_start_timestamp'] = df['end_timestamp'].shift(1) #df['new_start_timestamp'].iloc[0] = df['start_timestamp'].iloc[0] #df['start_timestamp'] = np.where( df['new_start_timestamp'] > df['start_timestamp'], df['new_start_timestamp'], df['start_timestamp'] ) # logger.debug('result_df=\n%s' % df.head(100)) series = np.where(df['start_timestamp'] < df['end_timestamp'], ((df['end_timestamp'] - df['start_timestamp']) / np.timedelta64(1, 'h')).astype(float), 0) # logger.debug('result_series:%s' % series) return pd.Series({'duration': series.sum()})