Mar-24-2021, 02:10 PM
Hi, I have df like:
df = pd.DataFrame(np.array([[1, 10, 20], [1, 5, 8], [1, 5, 15], [1, 13, 14], [1, 18, 21],[2, 2, 2], [1, 21, 100], [1, 1, 50]]), columns=['id', 'start', 'stop']) df['valid'] = True print(f"{df} \n") >>> id start stop valid 0 1 10 20 True 1 1 5 8 True 2 1 5 15 True 3 1 13 14 True 4 1 18 21 True 5 2 2 2 True 6 1 21 100 True 7 1 1 50 TrueTo mark invalid enries, which have the same id and an overlap in time I came up with the idea:
col2 = 'id' col3 = 'start' col4 = 'stop' counter = 0 for index, row in df.iterrows(): id = df.at[index, col2] start = df.at[index, col3] stop = df.at[index, col4] #3 cases for overlapping time #case 1: index.start <= stop <= index.stop #case 2: index.start <= start <= index.stop #case 3: start <= index start and stop >= index.stop df_temp = df.query(f"{col2} == '{id}' and (({start} <= {col4} <= {stop}) or ({start} <= {col3} <= {stop}) or ({col3} <= {start} and {col4} >= {stop}))") #sort out same index df_temp = df_temp.drop([index]) for index in df_temp.index: if df.loc[index, 'valid'] == True: df.loc[index, 'valid'] = False counter += 1 print(f"Affected Rows: {counter}")My solution takes aprox 6s for 0.1% (326 rows), which results with linear interpolation in aprox 100min. Is there a way, to make this faster? I'd appreciate a hint. Thanks.