Python Forum

Full Version: Efficient way to mark entries in df with overlap in time ranges
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Hi, I have df like:
df = pd.DataFrame(np.array([[1, 10, 20], [1, 5, 8], [1, 5, 15], [1, 13, 14], [1, 18, 21],[2, 2, 2], [1, 21, 100], [1, 1, 50]]),
                   columns=['id', 'start', 'stop'])
df['valid'] = True

print(f"{df} \n")
>>> id  start stop valid
  0 1   10    20   True
  1 1   5     8    True
  2 1   5     15   True
  3 1   13    14   True
  4 1   18    21   True
  5 2   2     2    True
  6 1   21    100  True
  7 1   1     50   True
To mark invalid enries, which have the same id and an overlap in time I came up with the idea:
col2 = 'id'
col3 = 'start'
col4 = 'stop'

counter = 0

for index, row in df.iterrows():
    id = df.at[index, col2]
    start = df.at[index, col3]
    stop = df.at[index, col4]

    #3 cases for overlapping time
    #case 1: index.start <= stop <= index.stop
    #case 2: index.start <= start <= index.stop
    #case 3: start <= index start and stop >= index.stop
    df_temp = df.query(f"{col2} == '{id}' and (({start} <= {col4} <= {stop}) or ({start} <= {col3} <= {stop}) or ({col3} <= {start} and {col4} >= {stop}))")
    #sort out same index
    df_temp = df_temp.drop([index])

    for index in df_temp.index:
        if df.loc[index, 'valid'] == True:
            df.loc[index, 'valid'] = False
            counter += 1
print(f"Affected Rows: {counter}")
My solution takes aprox 6s for 0.1% (326 rows), which results with linear interpolation in aprox 100min. Is there a way, to make this faster? I'd appreciate a hint. Thanks.