Manipulating panda dataframes more python-like

**deanhystad** · (This post was last modified: Jan-27-2023, 05:27 AM by deanhystad.)

Looping in pandas is really slow. I would get the datetime values for two dataframes and process those to create a list of dataframe 2 rows that should be incorporated into dataframe 1. Something like this.

import pandas as pd

df1 = pd.DataFrame({"A": [1, 5, 9, 13, 17, 21, 25], "B": list(range(1, 8))})
df2 = pd.DataFrame({"A": [6, 13, 19, 22, 28], "B": list(range(1, 6))})

a_values = df1["A"].values
b_values = df2["A"].values

keepers = [False] * len(b_values)
j = 0
for i in range(len(a_values)):
    for j in range(j, len(b_values)):
        if b_values[j] > a_values[i]:
            keepers[j] = 0 < (b_values[j] - a_values[i]) <= 1
            break

df3 = (
    pd.concat([df1, df2[keepers]], ignore_index=True)
    .sort_values(by=["A"])
    .reset_index(drop=True)
)
print(df3)

Output:    A  B
0   1  1
1   5  2
2   6  1
3   9  3
4  13  4
5  17  5
6  21  6
7  22  4
8  25  7

It should be pretty quick because it only has to make one pass through the dataframes.

I need to think about how this can be expanded to 3 or more dataframes.

I did a little experiment with dataframes more in line with the sizes you describe. Here I combine a dataframe with 20,000 rows with a dataframe that has 5000 rows. Time to combine the frames = 0.2 seconds.

import pandas as pd
import numpy as np
from datetime import datetime, timedelta
from random import random
from time import time


a = [datetime.now()]
for i in range(20000):
    a.append(a[-1] + timedelta(seconds=random() * 3600))

b = [a[0]]
for i in range(5000):
    b.append(b[-1] + timedelta(seconds=random() * 3600))

df1 = pd.DataFrame({"Datetime": a, "B": list(range(len(a)))})
df2 = pd.DataFrame({"Datetime": b, "B": list(range(len(b)))})
 
a_times = df1["Datetime"].values
b_times = df2["Datetime"].values

start = time()
delta = np.timedelta64(5, "m")
keepers = [False] * len(b_times)
i = 0
for a_time in a_times:
    for i in range(i, len(b_times)):
        if b_times[i] > a_time:
            keepers[i] = (b_times[i] - a_time) <= delta
            break

df3 = (
    pd.concat([df1, df2[keepers]], ignore_index=True)
    .sort_values(by=["Datetime"])
    .reset_index(drop=True)
)
print(time() - start)
print(df3)

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Panda Exercise	nyzs92	0	1,502	Sep-24-2021, 03:35 AM Last Post: nyzs92
	Manipulating code to draw a tree	Py_thon	8	3,257	Nov-21-2019, 05:00 PM Last Post: sumana
	Manipulating List	frenchyinspace	2	2,679	Oct-08-2019, 07:57 AM Last Post: perfringo
	Manipulating __init__ method	schniefen	5	3,530	May-06-2019, 11:22 AM Last Post: buran

Manipulating panda dataframes more python-like

User Panel Messages

Announcements