Manipulating panda dataframes more python-like - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: Homework (https://python-forum.io/forum-9.html) +--- Thread: Manipulating panda dataframes more python-like (/thread-39243.html) |
Manipulating panda dataframes more python-like - badtwistoffate - Jan-20-2023 Hello, While not necessarily homework, it seems the best place to post considering the question and difficulty level. I am trying to analyze some data so first trying to get it into get it into the right format. I have multiple DataFrames of large shapes (rows vary around 20000-22000 in length and columns of 10-17 in length). I have one DataFrame (let's call it a "master" one) that represents one location, and I have multiple other ones of similar length when it comes to rows (each DataFrame represents a different location). Initially, I am trying to fill in a DataFrame column that I added to the existing "master" DataFrame depending on its row's time compared to the other row's time (of the other DataFrames). The time stamps don't match up perfectly and the shapes of the other DataFrames are different so my desire is to: fill the master DataFrame's row in the new column with the corresponding data from the other DataFrame (other location). Simply if the data row is within 59 mins and 30 seconds of the master's time stamp. Unfortunately, I don't think I'm doing it in a python-efficient way as the script has been executing for over three hours. Here's some example code. Is there a better way to compare time stamps and not rely as heavily on for loops? # iterate through each row for ind in df1.index: # See if it has a wind gust by checking wind gust and see if it is higher than 0 mph if df1['wind_gust'][ind]>=0: # Check now if there is a corresponding neighbor obs within 59 minutes, if so plug it in # Check site df2 first for ind_df2 in df2.index: # Compare the time stamp of df1 and df2. If df2 falls within +/- 29 mins and 45 seconds then copy it in! if ((df2['Date_Time'][ind_df2]) <= (dfOD184['Date_Time'][ind] + pd.Timedelta("29 min 45 us"))) & \ ((df2['Date_Time'][ind_df2]) >= (dfOD184['Date_Time'][ind] - pd.Timedelta("29 min 45 us"))): # If df2 obs is within 59 minutes and 30 seconds, insert the obs df1['df2 pressure'][ind] = df2['pressure'][ind_df2] # Stop the current iteration of the df2 loop once filled and move to check df3 next by using continue continue # Check site df3 now for ind_df3 in df3.index: if ((df3['Date_Time'][ind_df3 ]) <= (df1['Date_Time'][ind] + pd.Timedelta("29 min 45 us"))) & \ ((df3['Date_Time'][ind_df3 ]) >= (df1['Date_Time'][ind] - pd.Timedelta("29 min 45 us"))): # If df3 obs is within 59 minutes and 30 seconds, insert the pressure obs df1['df3 pressure'][ind] = df3['sea_level_pressure_set_1'][ind_df3] continue RE: Manipulating panda dataframes more python-like - deanhystad - Jan-26-2023 Looping in pandas is really slow. I would get the datetime values for two dataframes and process those to create a list of dataframe 2 rows that should be incorporated into dataframe 1. Something like this. import pandas as pd df1 = pd.DataFrame({"A": [1, 5, 9, 13, 17, 21, 25], "B": list(range(1, 8))}) df2 = pd.DataFrame({"A": [6, 13, 19, 22, 28], "B": list(range(1, 6))}) a_values = df1["A"].values b_values = df2["A"].values keepers = [False] * len(b_values) j = 0 for i in range(len(a_values)): for j in range(j, len(b_values)): if b_values[j] > a_values[i]: keepers[j] = 0 < (b_values[j] - a_values[i]) <= 1 break df3 = ( pd.concat([df1, df2[keepers]], ignore_index=True) .sort_values(by=["A"]) .reset_index(drop=True) ) print(df3) It should be pretty quick because it only has to make one pass through the dataframes.I need to think about how this can be expanded to 3 or more dataframes. I did a little experiment with dataframes more in line with the sizes you describe. Here I combine a dataframe with 20,000 rows with a dataframe that has 5000 rows. Time to combine the frames = 0.2 seconds. import pandas as pd import numpy as np from datetime import datetime, timedelta from random import random from time import time a = [datetime.now()] for i in range(20000): a.append(a[-1] + timedelta(seconds=random() * 3600)) b = [a[0]] for i in range(5000): b.append(b[-1] + timedelta(seconds=random() * 3600)) df1 = pd.DataFrame({"Datetime": a, "B": list(range(len(a)))}) df2 = pd.DataFrame({"Datetime": b, "B": list(range(len(b)))}) a_times = df1["Datetime"].values b_times = df2["Datetime"].values start = time() delta = np.timedelta64(5, "m") keepers = [False] * len(b_times) i = 0 for a_time in a_times: for i in range(i, len(b_times)): if b_times[i] > a_time: keepers[i] = (b_times[i] - a_time) <= delta break df3 = ( pd.concat([df1, df2[keepers]], ignore_index=True) .sort_values(by=["Datetime"]) .reset_index(drop=True) ) print(time() - start) print(df3) RE: Manipulating panda dataframes more python-like - snippsat - Jan-27-2023 (Jan-20-2023, 12:57 AM)badtwistoffate Wrote: Unfortunately, I don't think I'm doing it in a python-efficient way as the script has been executing for over three hours. Here's some example code. Is there a better way to compare time stamps and not rely as heavily on for loops?Pandas often need a tolal diffent approch than how stuff are done in Python,the loops are usually out 🚀Look at this blog post. So try to work on whole dataframs(build in tools) using vectorization (this mean avoid Python-level loops)To give exampe how to compare time between two or more dataframs. import pandas as pd # Create Dataframe 1 date1 = ['2022-01-01 00:00:00', '2022-01-02 12:00:00', '2022-01-03 18:00:00'] wind_speed1 = [5, 6, 7] df1 = pd.DataFrame({'date_time': pd.to_datetime(date1), 'wind_speed': wind_speed1}) # Create Dataframe 2 date2 = ['2022-01-02 08:00:00', '2022-01-03 15:00:00', '2022-01-04 20:00:00'] wind_speed2 = [8, 9, 10] df2 = pd.DataFrame({'date_time': pd.to_datetime(date2), 'wind_speed': wind_speed2}) # Convert the specific time to a 'pandas.Timestamp' object specific_time = df1.iloc[1]['date_time'].time() specific_time = pd.to_datetime(specific_time, format='%H:%M:%S').time() # Compare time less than a specific time and show the result print(df2[df2['date_time'].dt.time < specific_time])
RE: Manipulating panda dataframes more python-like - deanhystad - Jan-28-2023 I think I understand your question a bit better now. df1 - Has most of the information you want. df2 - Has some additional information you would like to add to df1. Only some rows in df1 should be updated. df1 and df1 both have a datetime column that can be used to synchronize the two. How about something like this? It takes half a second to process a 20,000 row dataframe. import pandas as pd import numpy as np from time import time # Make some dummy dataframes # df1 has datetime, windspeed and pressure. Pressures are all 0/empty # df2 has datetime and pressure df1 = pd.DataFrame({ "Datetime": pd.date_range("1/1/2022", "1/1/2023", 20000), "WindSpeed": np.random.randint(2, size=20000), "Pressure": np.zeros(20000) }) df2 = pd.DataFrame({ "Datetime": pd.date_range("1/1/2022", "1/1/2023", 15000), "Pressure": np.random.random(15000) * 5 + 1, }) starting_time = time() # We are only interested in wind events. No reason to look at other rows. # We can roll changes back into df1 later on using the row index values. wind_events = df1[df1["WindSpeed"] > 0] # Extract the time and pressure info. Faster to work with these # as lists instead of getting from dataframes each time. a_times = wind_events["Datetime"].values b_times = df2["Datetime"].values a_pressure = wind_events.columns.get_loc("Pressure") b_pressure = df2["Pressure"].values # For each wind event, find the pressure from df2 that has the # closest timestamp. Only accept pressues within 10 minutes of # the wind event. window = np.timedelta64(10, "m") b = 0 for a, a_time in enumerate(a_times): best_time = window + 1 best_index = None for b in range(b, len(b_times)): diff = b_times[b] - a_time if diff > window: break diff = abs(diff) if diff < best_time: best_time = diff best_index = b if best_index is not None and best_time <= window: wind_events.iat[a, a_pressure] = b_pressure[best_index] df1.update(wind_events["Pressure"]) print(time() - starting_time)
RE: Manipulating panda dataframes more python-like - deanhystad - Jan-31-2023 Better and faster than before. 20,000 record table in about 0.16 seconds. You can ignore the slice/copy warning. import pandas as pd import numpy as np from time import time # Make some dummy dataframes # df1 has datetime, windspeed and pressure. Pressures are all 0/empty # df2 has datetime and pressure df1 = pd.DataFrame({ "Datetime": pd.date_range("1/1/2022", "1/1/2023", 20000), "WindSpeed": np.random.randint(2, size=20000), "Pressure": np.zeros(20000) }) df2 = pd.DataFrame({ "Datetime": pd.date_range("1/1/2022", "1/1/2023", 15000), "Pressure": np.random.random(15000) * 5 + 1, }) starting_time = time() # We are only interested in wind events. No reason to look at other rows. # We can roll changes back into df1 later on using the row index values. wind_events = df1[df1["WindSpeed"] > 0] # Extract the time and pressure info. Faster to work with these # as lists instead of getting from dataframes each time. a_times = wind_events["Datetime"].values b_times = df2["Datetime"].values a_pressure = wind_events["Pressure"].values b_pressure = df2["Pressure"].values # For each wind event, find the pressure from df2 that has the # closest timestamp. Only accept pressues within time window. window = np.timedelta64(10, "m") b = 0 for a, a_time in enumerate(a_times): best_diff = prev_diff = abs(b_times[b] - a_time) best_index = b for b in range(b+1, len(b_times)): diff = abs(b_times[b] - a_time) if diff < best_diff: best_diff, best_index = diff, b if diff > prev_diff: # We are not going to get a better match break prev_diff = diff if best_index is not None and best_diff <= window: a_pressure[a] = b_pressure[best_index] # Replace wind_events["Pressure"] with a_pressure list and fold # wind_events back into df1. wind_events["Pressure"] = a_pressure df1.update(wind_events["Pressure"]) print(time() - starting_time) |