Python Forum
iterrows() performance issue
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
iterrows() performance issue
#1
The below function calc_time_diff calculates time difference between two timestamps and finally adds them up and returns the result. But it should ignore the overlapping timestamps and should calculate the net sum of all the time difference.

The existing function performs this using a loop but it is causing a huge performance issue when it accessing more records.
Is there anyway to tune this function or any other approach?

I have attached some of the sample data.
data = [{'start_timestamp': '2017-07-10 00:01:00', 'end_timestamp': '2017-07-10 00:43:00'},
        {'start_timestamp': '2017-07-10 02:23:00', 'end_timestamp': '2017-07-10 05:30:00'}, 
        {'start_timestamp': '2017-07-10 09:01:00', 'end_timestamp': '2017-07-10 16:30:00'}, 
        {'start_timestamp': '2017-07-10 10:01:00', 'end_timestamp': '2017-07-10 11:25:00'},
        {'start_timestamp': '2017-07-10 12:01:00', 'end_timestamp': '2017-07-10 13:45:00'}, 
        {'start_timestamp': '2017-07-10 13:51:00', 'end_timestamp': '2017-07-10 14:26:00'}, 
        {'start_timestamp': '2017-07-10 14:43:00', 'end_timestamp': '2017-07-12 05:00:00'}, 
        {'start_timestamp': '2017-07-10 18:01:00', 'end_timestamp': '2017-07-10 18:45:00'}, 
        {'start_timestamp': '2017-07-10 21:21:00', 'end_timestamp': '2017-07-10 22:46:00'}]
df = pd.DataFrame(data)

def calc_time_diff( df):
        df = df.dropna(how='any', subset=['start_timestamp', 'end_timestamp'])
        if df.empty:
            return pd.Series({'duration': np.nan})
        
        #VSD Added on 15th Nov just to expedite testing of some other KPIs. The ultimate target is to optimize this function by replacing for loop with dataframe operations
        #VSD after testing following line needs to be commented or removed
        #return pd.Series({self.name: ((df['end_timestamp'][0] - df['start_timestamp'][0]) / np.timedelta64(1, 'h'))})
        
        df = df.sort_values(by=['start_timestamp'])

#         logger.debug('input df=\n%s' % df.head(100))
        new_start_timestamp = []
        prev_endtime = None
        for index, row in df.iterrows():
            if prev_endtime is None:
                new_start_timestamp.append(row['start_timestamp'])
                prev_endtime = row['end_timestamp']
                continue

            if row['start_timestamp'] < prev_endtime:
                new_start_timestamp.append(prev_endtime)
                # df.loc[index, 'start_timestamp'] = prev_endtime
            else:
                new_start_timestamp.append(row['start_timestamp'])

            if prev_endtime < row['end_timestamp']:
                prev_endtime = row['end_timestamp']

        df['start_timestamp'] = new_start_timestamp
        #df['new_start_timestamp'] = df['end_timestamp'].shift(1)
        #df['new_start_timestamp'].iloc[0] = df['start_timestamp'].iloc[0]
        #df['start_timestamp'] = np.where( df['new_start_timestamp'] > df['start_timestamp'], df['new_start_timestamp'], df['start_timestamp'] )

#         logger.debug('result_df=\n%s' % df.head(100))

        series = np.where(df['start_timestamp'] < df['end_timestamp'], 
            ((df['end_timestamp'] - df['start_timestamp']) / np.timedelta64(1, 'h')).astype(float), 0)

#         logger.debug('result_series:%s' % series)

        return pd.Series({'duration': series.sum()})
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Python performance Jacques 4 2,285 Jun-23-2020, 01:03 PM
Last Post: GaryNR
  performance time Skaperen 2 2,219 Apr-13-2019, 08:29 PM
Last Post: Skaperen

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020