Manipulating panda dataframes more python-like

Manipulating panda dataframes more python-like - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: Homework (https://python-forum.io/forum-9.html)
+--- Thread: Manipulating panda dataframes more python-like (/thread-39243.html)

Manipulating panda dataframes more python-like - badtwistoffate - Jan-20-2023

Hello,

While not necessarily homework, it seems the best place to post considering the question and difficulty level. I am trying to analyze some data so first trying to get it into get it into the right format.

I have multiple DataFrames of large shapes (rows vary around 20000-22000 in length and columns of 10-17 in length).

I have one DataFrame (let's call it a "master" one) that represents one location, and I have multiple other ones of similar length when it comes to rows (each DataFrame represents a different location). Initially, I am trying to fill in a DataFrame column that I added to the existing "master" DataFrame depending on its row's time compared to the other row's time (of the other DataFrames). The time stamps don't match up perfectly and the shapes of the other DataFrames are different so my desire is to: fill the master DataFrame's row in the new column with the corresponding data from the other DataFrame (other location). Simply if the data row is within 59 mins and 30 seconds of the master's time stamp.

Unfortunately, I don't think I'm doing it in a python-efficient way as the script has been executing for over three hours. Here's some example code. Is there a better way to compare time stamps and not rely as heavily on for loops?

# iterate through each row
for ind in df1.index:

  # See if it has a wind gust by checking wind gust and see if it is higher than 0 mph
    if df1['wind_gust'][ind]>=0:

      # Check now if there is a corresponding neighbor obs within 59 minutes, if so plug it in
      # Check site df2 first
      for ind_df2 in df2.index:

        # Compare the time stamp of df1 and df2. If df2 falls within +/- 29 mins and 45 seconds then copy it in!
        if ((df2['Date_Time'][ind_df2]) <= (dfOD184['Date_Time'][ind] + pd.Timedelta("29 min 45 us"))) & \
        ((df2['Date_Time'][ind_df2]) >= (dfOD184['Date_Time'][ind] - pd.Timedelta("29 min 45 us"))):
          # If df2 obs is within 59 minutes and 30 seconds, insert the obs
          df1['df2 pressure'][ind] = df2['pressure'][ind_df2]

          # Stop the current iteration of the df2 loop once filled and move to check df3 next by using continue
      continue

      # Check site df3 now
      for ind_df3 in df3.index:
        if ((df3['Date_Time'][ind_df3 ]) <= (df1['Date_Time'][ind] + pd.Timedelta("29 min 45 us"))) & \
        ((df3['Date_Time'][ind_df3 ]) >= (df1['Date_Time'][ind] - pd.Timedelta("29 min 45 us"))):

        # If df3 obs is within 59 minutes and 30 seconds, insert the pressure obs
          df1['df3 pressure'][ind] = df3['sea_level_pressure_set_1'][ind_df3]

      continue

RE: Manipulating panda dataframes more python-like - deanhystad - Jan-26-2023

Looping in pandas is really slow. I would get the datetime values for two dataframes and process those to create a list of dataframe 2 rows that should be incorporated into dataframe 1. Something like this.

import pandas as pd

df1 = pd.DataFrame({"A": [1, 5, 9, 13, 17, 21, 25], "B": list(range(1, 8))})
df2 = pd.DataFrame({"A": [6, 13, 19, 22, 28], "B": list(range(1, 6))})

a_values = df1["A"].values
b_values = df2["A"].values

keepers = [False] * len(b_values)
j = 0
for i in range(len(a_values)):
    for j in range(j, len(b_values)):
        if b_values[j] > a_values[i]:
            keepers[j] = 0 < (b_values[j] - a_values[i]) <= 1
            break

df3 = (
    pd.concat([df1, df2[keepers]], ignore_index=True)
    .sort_values(by=["A"])
    .reset_index(drop=True)
)
print(df3)

Output:    A  B
0   1  1
1   5  2
2   6  1
3   9  3
4  13  4
5  17  5
6  21  6
7  22  4
8  25  7

It should be pretty quick because it only has to make one pass through the dataframes.

I need to think about how this can be expanded to 3 or more dataframes.

I did a little experiment with dataframes more in line with the sizes you describe. Here I combine a dataframe with 20,000 rows with a dataframe that has 5000 rows. Time to combine the frames = 0.2 seconds.

import pandas as pd
import numpy as np
from datetime import datetime, timedelta
from random import random
from time import time


a = [datetime.now()]
for i in range(20000):
    a.append(a[-1] + timedelta(seconds=random() * 3600))

b = [a[0]]
for i in range(5000):
    b.append(b[-1] + timedelta(seconds=random() * 3600))

df1 = pd.DataFrame({"Datetime": a, "B": list(range(len(a)))})
df2 = pd.DataFrame({"Datetime": b, "B": list(range(len(b)))})
 
a_times = df1["Datetime"].values
b_times = df2["Datetime"].values

start = time()
delta = np.timedelta64(5, "m")
keepers = [False] * len(b_times)
i = 0
for a_time in a_times:
    for i in range(i, len(b_times)):
        if b_times[i] > a_time:
            keepers[i] = (b_times[i] - a_time) <= delta
            break

df3 = (
    pd.concat([df1, df2[keepers]], ignore_index=True)
    .sort_values(by=["Datetime"])
    .reset_index(drop=True)
)
print(time() - start)
print(df3)

RE: Manipulating panda dataframes more python-like - snippsat - Jan-27-2023

(Jan-20-2023, 12:57 AM)badtwistoffate Wrote: Unfortunately, I don't think I'm doing it in a python-efficient way as the script has been executing for over three hours. Here's some example code. Is there a better way to compare time stamps and not rely as heavily on for loops?

Pandas often need a tolal diffent approch than how stuff are done in Python,the loops are usually out 🚀
Look at this blog post.
So try to work on whole dataframs(build in tools) using vectorization(this mean avoid Python-level loops)
To give exampe how to compare time between two or more dataframs.

import pandas as pd

# Create Dataframe 1
date1 = ['2022-01-01 00:00:00', '2022-01-02 12:00:00', '2022-01-03 18:00:00']
wind_speed1 = [5, 6, 7]
df1 = pd.DataFrame({'date_time': pd.to_datetime(date1), 'wind_speed': wind_speed1})

# Create Dataframe 2
date2 = ['2022-01-02 08:00:00', '2022-01-03 15:00:00', '2022-01-04 20:00:00']
wind_speed2 = [8, 9, 10]
df2 = pd.DataFrame({'date_time': pd.to_datetime(date2), 'wind_speed': wind_speed2})

# Convert the specific time to a 'pandas.Timestamp' object
specific_time = df1.iloc[1]['date_time'].time()
specific_time = pd.to_datetime(specific_time, format='%H:%M:%S').time()

# Compare time less than a specific time and show the result
print(df2[df2['date_time'].dt.time < specific_time])

Output:     date_time        wind_speed
0   2022-01-02 08:00:00   8

RE: Manipulating panda dataframes more python-like - deanhystad - Jan-28-2023

I think I understand your question a bit better now.

df1 - Has most of the information you want.
df2 - Has some additional information you would like to add to df1.
Only some rows in df1 should be updated.
df1 and df1 both have a datetime column that can be used to synchronize the two.

How about something like this? It takes half a second to process a 20,000 row dataframe.

import pandas as pd
import numpy as np
from time import time


# Make some dummy dataframes
# df1 has datetime, windspeed and pressure.  Pressures are all 0/empty
# df2 has datetime and pressure
df1 = pd.DataFrame({
    "Datetime": pd.date_range("1/1/2022", "1/1/2023", 20000),
    "WindSpeed": np.random.randint(2, size=20000),
    "Pressure": np.zeros(20000)
})
df2 = pd.DataFrame({
    "Datetime": pd.date_range("1/1/2022", "1/1/2023", 15000),
    "Pressure": np.random.random(15000) * 5 + 1,
})

starting_time = time()

# We are only interested in wind events.  No reason to look at other rows.
# We can roll changes back into df1 later on using the row index values.
wind_events = df1[df1["WindSpeed"] > 0]

# Extract the time and pressure info.  Faster to work with these
# as lists instead of getting from dataframes each time.
a_times = wind_events["Datetime"].values
b_times = df2["Datetime"].values
a_pressure = wind_events.columns.get_loc("Pressure")
b_pressure = df2["Pressure"].values

# For each wind event, find the pressure from df2 that has the
# closest timestamp.  Only accept pressues within 10 minutes of
# the wind event.
window = np.timedelta64(10, "m")
b = 0
for a, a_time in enumerate(a_times):
    best_time = window + 1
    best_index = None
    for b in range(b, len(b_times)):
        diff = b_times[b] - a_time
        if diff > window:
            break
        diff = abs(diff)
        if diff < best_time:
            best_time = diff
            best_index = b
    if best_index is not None and best_time <= window:
        wind_events.iat[a, a_pressure] = b_pressure[best_index]
df1.update(wind_events["Pressure"])
print(time() - starting_time)

Output:
0.54099893569946293

RE: Manipulating panda dataframes more python-like - deanhystad - Jan-31-2023

Better and faster than before. 20,000 record table in about 0.16 seconds. You can ignore the slice/copy warning.

import pandas as pd
import numpy as np
from time import time


# Make some dummy dataframes
# df1 has datetime, windspeed and pressure.  Pressures are all 0/empty
# df2 has datetime and pressure
df1 = pd.DataFrame({
    "Datetime": pd.date_range("1/1/2022", "1/1/2023", 20000),
    "WindSpeed": np.random.randint(2, size=20000),
    "Pressure": np.zeros(20000)
})
df2 = pd.DataFrame({
    "Datetime": pd.date_range("1/1/2022", "1/1/2023", 15000),
    "Pressure": np.random.random(15000) * 5 + 1,
})

starting_time = time()

# We are only interested in wind events.  No reason to look at other rows.
# We can roll changes back into df1 later on using the row index values.
wind_events = df1[df1["WindSpeed"] > 0]

# Extract the time and pressure info.  Faster to work with these
# as lists instead of getting from dataframes each time.
a_times = wind_events["Datetime"].values
b_times = df2["Datetime"].values
a_pressure = wind_events["Pressure"].values
b_pressure = df2["Pressure"].values

# For each wind event, find the pressure from df2 that has the
# closest timestamp.  Only accept pressues within time window.
window = np.timedelta64(10, "m")
b = 0
for a, a_time in enumerate(a_times):
    best_diff = prev_diff = abs(b_times[b] - a_time)
    best_index = b
    for b in range(b+1, len(b_times)):
        diff = abs(b_times[b] - a_time)
        if diff < best_diff:
            best_diff, best_index = diff, b
        if diff > prev_diff:
            # We are not going to get a better match
            break
        prev_diff = diff
    if best_index is not None and best_diff <= window:
        a_pressure[a] = b_pressure[best_index]

# Replace wind_events["Pressure"] with a_pressure list and fold
# wind_events back into df1.
wind_events["Pressure"] = a_pressure
df1.update(wind_events["Pressure"])
print(time() - starting_time)