Different code execution times

Wirbelwind94 · Oct-04-2023, 06:06 PM

Hi,
i have the problem with the following code that it takes extremely long time to run completely under Python 3.9 - 3.11. but under 3.6 - 3.8 it works without big problem. Can someone explain me why this is and how to make it faster on Python 3.9 and higher. The time difference is from few seconds (by 3.8) to about 1 hour (by 3.11).

I have tested it under windows 11 and Ubuntu 22.04 with the same results.

import pandas as pd
import numpy as np

fileName = "Data"
inputRange = 20

df = pd.read_csv(fileName + ".csv", delimiter=";")

x_data = df[["Close","High","Low","Volumen"]]

y_data = df["Signal"]

x_train = []

for i in range(0,len(x_data)-inputRange):
    x_train.append(x_data[(i+1):(i+inputRange+1)])

**deanhystad** · (This post was last modified: Oct-05-2023, 03:23 PM by deanhystad.)

How big in the csv file? I just ran your code using python 3.10.7 with Data.csv having 100,000 rows and it took 3 seconds. Using x_train.append(x_data.iloc[(i + 1) : (i + inputRange + 1)]) was about 1 second faster.

Out of curiosity I updated my pandas from 2.0.0 to 2.1.1. Now the code takes much longer to execute, about 80 times longer. Appears to be a pandas version issue, not a python version issue.

Pandas article on improving performance.

https://pandas.pydata.org/pandas-docs/st...gperf.html

Does x_train need to be a list of pandas dataframes? Would your code work if x_train was an array of numpy arrays (a 3D numpy array)? In the example below I use numpy.lib.stride_tricks.sliding_window_view(). This creates a new array that contains a series of sliding windows, 20 rows long, from x_data.

import pandas as pd
from numpy.lib.stride_tricks import sliding_window_view

input_range = 20
df = pd.read_csv("data.csv")
x_data = df[["Close", "High", "Low", "Volumen"]].to_numpy()
x_train = sliding_window_view(x_data, (input_range, 4)).reshape(-1, input_range, 4)

When I tried using pandas 2.1.1 and your code on a CSV file with 100,000 rows, it took 4 minutes, 15 seconds. Using the code above it took 0.05 seconds. That's 5000 times faster. For your data it should reduce processing time from an hour to under a second.

I lack some understanding about sliding_window_view(). x_data.shape = (100000, 4), but when I call sliding_window_view(x_data, (input_range, 4)), x_train's shape is (99981, 1, 20, 4). I don't know why the extra axis (1) is created. The fix for now is to reshape the array to remove it.

Wirbelwind94 · Oct-05-2023, 08:01 PM

Hi,
thanks for the quick reply.

I have now noticed that it comes from 2.0.3 to 2.1.0 to the error.
I have assumed that I use the same Pandas version everywhere but that was not the case.
But if I install everywhere Pandas 2.0.3 it works also up to Python 3.10. (in 3.11 I have not tested it and in 3.12 I had an error message)

The with the 3D numpy array I will look at the next days times more closely.

Thanks for your help.

**deanhystad** · Oct-05-2023, 08:53 PM

Normally you don't specify a version when installing a package, so you get the newest stable version.

***snippsat*** · (This post was last modified: Oct-06-2023, 12:31 PM by snippsat.)

(Oct-04-2023, 06:06 PM)Wirbelwind94 Wrote:
for i in range(0,len(x_data)-inputRange):

When use a standard Python loop this like in Pandas,then there is usually a lot faster way.
Also as deanhystad mention the time can blow when using code like this in different version,
as should avoid to use code like this in Pandas.

Here a test with 1,000,000 rows generated Data.csv.
This read it to a numpy.ndarray and back to a DataFrame for easier access of data.
This take 3.3-sec to do.

import pandas as pd
import numpy as np

fileName = "Data"
inputRange = 20
df = pd.read_csv(fileName + ".csv", delimiter=";")
x_data = df[["Close","High","Low","Volumen"]].values
y_data = df["Signal"].values

# Create sequences of x_data
num_sequences = len(x_data) - inputRange
x_train = np.zeros((num_sequences, inputRange, x_data.shape[1]))

for i in range(num_sequences):
    x_train[i] = x_data[i:(i + inputRange)]

# Adjust y_data to align with the end of each sequence
y_train = y_data[inputRange:]

# Convert back to a DataFrame
features = ["Close", "High", "Low", "Volumen"]
# Create multi-level columns
multi_columns = pd.MultiIndex.from_product([features, range(inputRange)], names=['Feature', 'Timestep'])
# Reshape the 3D numpy array to 2D
reshaped_data = x_train.reshape((num_sequences, -1))
df_converted = pd.DataFrame(reshaped_data, columns=multi_columns)
print(df_converted.head())
print(df_converted.tail())

Feature        Close                          ...    Volumen                   
Timestep          0           1           2   ...         17         18      19
0         100.496714  100.708180  100.096421  ...  97.350607  95.925274   983.0
1         100.358450  101.260176   99.535949  ...  98.185528  97.624068  1036.0
2         101.006138  101.566533  100.399888  ...  98.023315  97.650150  1048.0
3         102.529168  103.378286  102.401322  ...  98.283334  97.853196  1002.0
4         102.295015  103.010577  101.300510  ...  96.642765  95.833808  1012.0

[5 rows x 80 columns]
Feature         Close               ...      Volumen        
Timestep           0            1   ...           18      19
999975   -1506.391447 -1506.368737  ... -1502.330789   951.0
999976   -1506.438789 -1505.710167  ... -1502.366849  1029.0
999977   -1506.919674 -1506.542284  ... -1502.014728  1029.0
999978   -1508.160130 -1507.298449  ... -1502.327236   979.0
999979   -1509.262004 -1508.807126  ... -1500.596051  1010.0

Example look at max for High column.

>>> df_converted['High'].max()
Timestep
0      472.933948
1      473.926661
2      472.407345
3     1049.000000
4      472.933948
5      473.926661
6      472.407345
7     1049.000000
8      472.933948
9      473.926661
10     472.407345
11    1049.000000
12     472.933948
13     473.926661
14     472.407345
15    1049.000000
16     472.933948
17     473.926661
18     472.407345
19    1049.000000

There are also serval tool that is great for better speed and memory usage,like eg Dask | Polar.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Code running many times nad not just one?	korenron	4	1,376	Jul-24-2022, 08:12 AM Last Post: korenron
	In consistency in code execution	Led_Zeppelin	1	1,120	Jun-27-2022, 03:00 AM Last Post: deanhystad
	rtmidi problem after running the code x times	philipbergwerf	1	2,434	Apr-04-2021, 07:07 PM Last Post: philipbergwerf
	Minimizing the CMD window during code execution	Shaswat	1	4,605	Oct-03-2019, 07:44 AM Last Post: Shaswat
	Function Execution Times	sunnyarora	3	2,581	Mar-15-2019, 04:26 PM Last Post: sunnyarora
	code execution after event	shift838	3	2,833	Nov-26-2018, 05:10 AM Last Post: Larz60+
	How to Make Python code execution pause and resume to create .csv and read values.	Kashi	2	3,777	Jun-14-2018, 04:16 PM Last Post: DeaD_EyE
	My code prints out my string 5 times and then just stops?	Abstract_Otaku	0	1,957	Jun-13-2018, 07:11 PM Last Post: Abstract_Otaku
	Another working code, help required for faster multithreading execution	anna	0	2,263	Feb-09-2018, 03:26 AM Last Post: anna
	Help required for faster execution of working code	anna	2	3,160	Feb-09-2018, 03:00 AM Last Post: anna

Different code execution times

User Panel Messages

Announcements