Python Forum

Full Version: iterrows() performance issue
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
The below function calc_time_diff calculates time difference between two timestamps and finally adds them up and returns the result. But it should ignore the overlapping timestamps and should calculate the net sum of all the time difference.

The existing function performs this using a loop but it is causing a huge performance issue when it accessing more records.
Is there anyway to tune this function or any other approach?

I have attached some of the sample data.
data = [{'start_timestamp': '2017-07-10 00:01:00', 'end_timestamp': '2017-07-10 00:43:00'},
        {'start_timestamp': '2017-07-10 02:23:00', 'end_timestamp': '2017-07-10 05:30:00'}, 
        {'start_timestamp': '2017-07-10 09:01:00', 'end_timestamp': '2017-07-10 16:30:00'}, 
        {'start_timestamp': '2017-07-10 10:01:00', 'end_timestamp': '2017-07-10 11:25:00'},
        {'start_timestamp': '2017-07-10 12:01:00', 'end_timestamp': '2017-07-10 13:45:00'}, 
        {'start_timestamp': '2017-07-10 13:51:00', 'end_timestamp': '2017-07-10 14:26:00'}, 
        {'start_timestamp': '2017-07-10 14:43:00', 'end_timestamp': '2017-07-12 05:00:00'}, 
        {'start_timestamp': '2017-07-10 18:01:00', 'end_timestamp': '2017-07-10 18:45:00'}, 
        {'start_timestamp': '2017-07-10 21:21:00', 'end_timestamp': '2017-07-10 22:46:00'}]
df = pd.DataFrame(data)

def calc_time_diff( df):
        df = df.dropna(how='any', subset=['start_timestamp', 'end_timestamp'])
        if df.empty:
            return pd.Series({'duration': np.nan})
        
        #VSD Added on 15th Nov just to expedite testing of some other KPIs. The ultimate target is to optimize this function by replacing for loop with dataframe operations
        #VSD after testing following line needs to be commented or removed
        #return pd.Series({self.name: ((df['end_timestamp'][0] - df['start_timestamp'][0]) / np.timedelta64(1, 'h'))})
        
        df = df.sort_values(by=['start_timestamp'])

#         logger.debug('input df=\n%s' % df.head(100))
        new_start_timestamp = []
        prev_endtime = None
        for index, row in df.iterrows():
            if prev_endtime is None:
                new_start_timestamp.append(row['start_timestamp'])
                prev_endtime = row['end_timestamp']
                continue

            if row['start_timestamp'] < prev_endtime:
                new_start_timestamp.append(prev_endtime)
                # df.loc[index, 'start_timestamp'] = prev_endtime
            else:
                new_start_timestamp.append(row['start_timestamp'])

            if prev_endtime < row['end_timestamp']:
                prev_endtime = row['end_timestamp']

        df['start_timestamp'] = new_start_timestamp
        #df['new_start_timestamp'] = df['end_timestamp'].shift(1)
        #df['new_start_timestamp'].iloc[0] = df['start_timestamp'].iloc[0]
        #df['start_timestamp'] = np.where( df['new_start_timestamp'] > df['start_timestamp'], df['new_start_timestamp'], df['start_timestamp'] )

#         logger.debug('result_df=\n%s' % df.head(100))

        series = np.where(df['start_timestamp'] < df['end_timestamp'], 
            ((df['end_timestamp'] - df['start_timestamp']) / np.timedelta64(1, 'h')).astype(float), 0)

#         logger.debug('result_series:%s' % series)

        return pd.Series({'duration': series.sum()})