Feb-10-2022, 06:14 AM
I have this code that I have been working and creating data based on my actual data. I am using pandas and Python. Here is how my code looks like:
new_df = pd.DataFrame(columns=['dates', 'Column_D', 'Column_A', 'VALUE', 'Column_B', 'Column_C']) for i in df["dates"].unique(): for j in df["Column_A"].unique(): for k in df["Column_B"].unique(): for m in df["Column_C"].unique(): n = df[(df["Column_D"] == 'orange') & (df["dates"] == '2005-1-1') & (df["Column_A"] == j) & (df["Column_B"] == k) & (df["Column_C"] == m)]['VALUE'] x = df[(df["dates"] == '2005-1-1') & (df["Column_A"] == j) & (df["Column_B"] == k) & (df["Column_C"] == m)]['VALUE'].sum() tempVal = df[(df["dates"] == i) & (df["Column_A"] == j) & (df["Column_B"] == k) & (df["Column_C"] == m)]['VALUE'].agg(sum) finalVal = (n * tempVal) / (x - n) if finalVal.empty | finalVal.isnull().values.any() | finalVal.isna().values.any() | np.inf(finalVal).values.any(): finalVal = 0 finalVal = int(finalVal) new_df = new_df.append({'dates': i, 'Column_D': 'orange', 'Column_A': j, 'VALUE': finalVal, 'Column_B': k, 'Column_C': m}, ignore_index=True)It takes a long time for my code to run right now and I'm not sure how to fix it and reduce the speed. I suspect the code is written sequentially. Could I get some help to reduce the speed? I want to know how to write my code in parallel and reduce the number of for loops. I heard pyspark is good, but will it help me? Thanks!