Jan-26-2023, 11:22 PM
(This post was last modified: Jan-27-2023, 05:27 AM by deanhystad.)
Looping in pandas is really slow. I would get the datetime values for two dataframes and process those to create a list of dataframe 2 rows that should be incorporated into dataframe 1. Something like this.
I need to think about how this can be expanded to 3 or more dataframes.
I did a little experiment with dataframes more in line with the sizes you describe. Here I combine a dataframe with 20,000 rows with a dataframe that has 5000 rows. Time to combine the frames = 0.2 seconds.
import pandas as pd df1 = pd.DataFrame({"A": [1, 5, 9, 13, 17, 21, 25], "B": list(range(1, 8))}) df2 = pd.DataFrame({"A": [6, 13, 19, 22, 28], "B": list(range(1, 6))}) a_values = df1["A"].values b_values = df2["A"].values keepers = [False] * len(b_values) j = 0 for i in range(len(a_values)): for j in range(j, len(b_values)): if b_values[j] > a_values[i]: keepers[j] = 0 < (b_values[j] - a_values[i]) <= 1 break df3 = ( pd.concat([df1, df2[keepers]], ignore_index=True) .sort_values(by=["A"]) .reset_index(drop=True) ) print(df3)
Output: A B
0 1 1
1 5 2
2 6 1
3 9 3
4 13 4
5 17 5
6 21 6
7 22 4
8 25 7
It should be pretty quick because it only has to make one pass through the dataframes.I need to think about how this can be expanded to 3 or more dataframes.
I did a little experiment with dataframes more in line with the sizes you describe. Here I combine a dataframe with 20,000 rows with a dataframe that has 5000 rows. Time to combine the frames = 0.2 seconds.
import pandas as pd import numpy as np from datetime import datetime, timedelta from random import random from time import time a = [datetime.now()] for i in range(20000): a.append(a[-1] + timedelta(seconds=random() * 3600)) b = [a[0]] for i in range(5000): b.append(b[-1] + timedelta(seconds=random() * 3600)) df1 = pd.DataFrame({"Datetime": a, "B": list(range(len(a)))}) df2 = pd.DataFrame({"Datetime": b, "B": list(range(len(b)))}) a_times = df1["Datetime"].values b_times = df2["Datetime"].values start = time() delta = np.timedelta64(5, "m") keepers = [False] * len(b_times) i = 0 for a_time in a_times: for i in range(i, len(b_times)): if b_times[i] > a_time: keepers[i] = (b_times[i] - a_time) <= delta break df3 = ( pd.concat([df1, df2[keepers]], ignore_index=True) .sort_values(by=["Datetime"]) .reset_index(drop=True) ) print(time() - start) print(df3)