Python Forum
Manipulating panda dataframes more python-like
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Manipulating panda dataframes more python-like
#2
Looping in pandas is really slow. I would get the datetime values for two dataframes and process those to create a list of dataframe 2 rows that should be incorporated into dataframe 1. Something like this.
import pandas as pd

df1 = pd.DataFrame({"A": [1, 5, 9, 13, 17, 21, 25], "B": list(range(1, 8))})
df2 = pd.DataFrame({"A": [6, 13, 19, 22, 28], "B": list(range(1, 6))})

a_values = df1["A"].values
b_values = df2["A"].values

keepers = [False] * len(b_values)
j = 0
for i in range(len(a_values)):
    for j in range(j, len(b_values)):
        if b_values[j] > a_values[i]:
            keepers[j] = 0 < (b_values[j] - a_values[i]) <= 1
            break

df3 = (
    pd.concat([df1, df2[keepers]], ignore_index=True)
    .sort_values(by=["A"])
    .reset_index(drop=True)
)
print(df3)
Output:
A B 0 1 1 1 5 2 2 6 1 3 9 3 4 13 4 5 17 5 6 21 6 7 22 4 8 25 7
It should be pretty quick because it only has to make one pass through the dataframes.

I need to think about how this can be expanded to 3 or more dataframes.

I did a little experiment with dataframes more in line with the sizes you describe. Here I combine a dataframe with 20,000 rows with a dataframe that has 5000 rows. Time to combine the frames = 0.2 seconds.
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
from random import random
from time import time


a = [datetime.now()]
for i in range(20000):
    a.append(a[-1] + timedelta(seconds=random() * 3600))

b = [a[0]]
for i in range(5000):
    b.append(b[-1] + timedelta(seconds=random() * 3600))

df1 = pd.DataFrame({"Datetime": a, "B": list(range(len(a)))})
df2 = pd.DataFrame({"Datetime": b, "B": list(range(len(b)))})
 
a_times = df1["Datetime"].values
b_times = df2["Datetime"].values

start = time()
delta = np.timedelta64(5, "m")
keepers = [False] * len(b_times)
i = 0
for a_time in a_times:
    for i in range(i, len(b_times)):
        if b_times[i] > a_time:
            keepers[i] = (b_times[i] - a_time) <= delta
            break

df3 = (
    pd.concat([df1, df2[keepers]], ignore_index=True)
    .sort_values(by=["Datetime"])
    .reset_index(drop=True)
)
print(time() - start)
print(df3)
Reply


Messages In This Thread
RE: Manipulating panda dataframes more python-like - by deanhystad - Jan-26-2023, 11:22 PM

Possibly Related Threads…
Thread Author Replies Views Last Post
  Panda Exercise nyzs92 0 1,502 Sep-24-2021, 03:35 AM
Last Post: nyzs92
  Manipulating code to draw a tree Py_thon 8 3,257 Nov-21-2019, 05:00 PM
Last Post: sumana
  Manipulating List frenchyinspace 2 2,679 Oct-08-2019, 07:57 AM
Last Post: perfringo
  Manipulating __init__ method schniefen 5 3,530 May-06-2019, 11:22 AM
Last Post: buran

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020