May-03-2024, 05:30 PM
I'm using Joblib for parralelization.
Here is my call to joblib:
I'm running across 2 different computers, Windows 11 and MacOS 14.4.1 both on Python3 updated to latest version of joblib.
The error seems to be happening MUCH more often on the Mac than the Windows machine. In fact, it's quite rare on the Windows machine, but on the mac, if I'm the joblib function for more than a few hundred times, it's bound to happen at least once or twice. This discrepancy between MacOS and Windows makes me think maybe it's an actual bug with the joblib code on MacOS?
I also though maybe I'm running out of RAM, I have been monitoring my memory usage during execution and it happens plenty enough when there is still extra RAM to spare, so that's not it.
If I run my script multiple times, it never happens in the same place. Nor does it happen the same number of times, sometimes it doesn't happen at all. So it seems random when it does happen.
It's also my understanding that if a worker fails, joblib does NOT automatically retry running that worker and we have a bit of our data missing. If someone can confirm that this is correct, I'd appreciate it. So you see in my function, I thought I'd write in some error handling code, because I was checking of ways to make joblib re-run a worker if it fails. My error handling code above, is failing however - if it catches the exception, it should at least print('oh nooooo') but it doesn't even do that so the exceptions are not caught. It is also my understanding that joblib has no internal way to automatically re-run failed workers (if someone can please confirm), so I need to manually handle a re-run.
The error also says this may be caused by too short of a worker timeout. In my research, by default if you don't specify a worker timeout, then isn't the timeout infinite in joblib? So by specifying a timeout, I would only make it shorter so that wouldn't help... or so I think....
Ultimately if a worker fails, I just need the data re-run. So any suggestions on how to properly code it so that the data does re-run on a worker fail?
Here is my call to joblib:
backtestdf_list = Parallel(n_jobs = coremax)(delayed(do_backtest)(backtest_start_index) for backtest_start_index in backtest_start_index_list)Here is the function do_backtest:
def do_backtest(backtest_start_index): try: day_start_time = time.time() #Calculate the best params for the last X days def summarize(sample_dur): sampledf = completedf[(completedf['date'] >= unique_dates[backtest_start_index - sample_dur]) & (completedf['date'] < unique_dates[backtest_start_index])] return get_summarydf(sampledf).nlargest(1, 'avgdaily_percent_ROR') summarized_list = list(map(summarize, sample_dur_list)) summarized_df = pd.concat(summarized_list) #Add sample_dur and b_index to the dataframe summarized_df['sample_dur'] = sample_dur_list summarized_df['b_index'] = backtest_start_index #summarized_df = summarized_df.sort_values(by='avgdaily_percent_ROR', ascending=False) #Uncomment if you want to view the summarized_df sorted, logically it does not matter if it is sorted or not at this point print('Finished backtesting ' + unique_dates[backtest_start_index] + ' @ ' + datetime.now().strftime(progressformat) + ' - Execution time (HH:MM:SS.xx) : ' + timer(day_start_time)) return summarized_df except Exception as e: print('oh nooooo') print(e) errorlist.append((unique_dates[backtest_start_index], e))And this it the output I get:
Output:Finished backtesting 2023-12-21 @ 05-03-2024 12:01:09 PM - Execution time (HH:MM:SS.xx) : 00:00:06.46
Finished backtesting 2023-12-22 @ 05-03-2024 12:01:10 PM - Execution time (HH:MM:SS.xx) : 00:00:06.51
Finished backtesting 2023-12-26 @ 05-03-2024 12:01:11 PM - Execution time (HH:MM:SS.xx) : 00:00:06.43
Finished backtesting 2023-12-27 @ 05-03-2024 12:01:12 PM - Execution time (HH:MM:SS.xx) : 00:00:06.43
Finished backtesting 2023-12-28 @ 05-03-2024 12:01:12 PM - Execution time (HH:MM:SS.xx) : 00:00:06.41
Finished backtesting 2023-12-29 @ 05-03-2024 12:01:13 PM - Execution time (HH:MM:SS.xx) : 00:00:06.35
Finished backtesting 2024-01-02 @ 05-03-2024 12:01:14 PM - Execution time (HH:MM:SS.xx) : 00:00:06.52
Finished backtesting 2024-01-03 @ 05-03-2024 12:01:15 PM - Execution time (HH:MM:SS.xx) : 00:00:06.55
Finished backtesting 2024-01-04 @ 05-03-2024 12:01:15 PM - Execution time (HH:MM:SS.xx) : 00:00:06.37
Finished backtesting 2024-01-05 @ 05-03-2024 12:01:16 PM - Execution time (HH:MM:SS.xx) : 00:00:06.43
Finished backtesting 2024-01-08 @ 05-03-2024 12:01:17 PM - Execution time (HH:MM:SS.xx) : 00:00:06.46
Finished backtesting 2024-01-09 @ 05-03-2024 12:01:18 PM - Execution time (HH:MM:SS.xx) : 00:00:06.69
/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/joblib/externals/loky/process_executor.py:752: UserWarning: A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak.
warnings.warn(
Finished backtesting 2024-01-10 @ 05-03-2024 12:01:19 PM - Execution time (HH:MM:SS.xx) : 00:00:06.59
Finished backtesting 2024-01-11 @ 05-03-2024 12:01:19 PM - Execution time (HH:MM:SS.xx) : 00:00:06.49
Finished backtesting 2024-01-12 @ 05-03-2024 12:01:20 PM - Execution time (HH:MM:SS.xx) : 00:00:06.27
Finished backtesting 2024-01-16 @ 05-03-2024 12:01:20 PM - Execution time (HH:MM:SS.xx) : 00:00:06.12
Finished backtesting 2024-01-17 @ 05-03-2024 12:01:21 PM - Execution time (HH:MM:SS.xx) : 00:00:06.48
Finished backtesting 2024-01-18 @ 05-03-2024 12:01:22 PM - Execution time (HH:MM:SS.xx) : 00:00:06.37
Finished backtesting 2024-01-19 @ 05-03-2024 12:01:23 PM - Execution time (HH:MM:SS.xx) : 00:00:06.32
Finished backtesting 2024-01-22 @ 05-03-2024 12:01:23 PM - Execution time (HH:MM:SS.xx) : 00:00:06.09
/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/joblib/externals/loky/process_executor.py:752: UserWarning: A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak.
warnings.warn(
Finished backtesting 2024-01-23 @ 05-03-2024 12:01:25 PM - Execution time (HH:MM:SS.xx) : 00:00:06.76
Finished backtesting 2024-01-24 @ 05-03-2024 12:01:25 PM - Execution time (HH:MM:SS.xx) : 00:00:06.62
Finished backtesting 2024-01-25 @ 05-03-2024 12:01:26 PM - Execution time (HH:MM:SS.xx) : 00:00:06.50
Finished backtesting 2024-01-29 @ 05-03-2024 12:01:27 PM - Execution time (HH:MM:SS.xx) : 00:00:06.61
Finished backtesting 2024-01-26 @ 05-03-2024 12:01:28 PM - Execution time (HH:MM:SS.xx) : 00:00:06.63
Finished backtesting 2024-01-30 @ 05-03-2024 12:01:28 PM - Execution time (HH:MM:SS.xx) : 00:00:06.57
Finished backtesting 2024-01-31 @ 05-03-2024 12:01:29 PM - Execution time (HH:MM:SS.xx) : 00:00:06.36
Finished backtesting 2024-02-01 @ 05-03-2024 12:01:30 PM - Execution time (HH:MM:SS.xx) : 00:00:06.70
Finished backtesting 2024-02-02 @ 05-03-2024 12:01:30 PM - Execution time (HH:MM:SS.xx) : 00:00:06.57
Finished backtesting 2024-02-05 @ 05-03-2024 12:01:31 PM - Execution time (HH:MM:SS.xx) : 00:00:06.64
Finished backtesting 2024-02-06 @ 05-03-2024 12:01:32 PM - Execution time (HH:MM:SS.xx) : 00:00:06.65
Finished backtesting 2024-02-07 @ 05-03-2024 12:01:33 PM - Execution time (HH:MM:SS.xx) : 00:00:06.53
So as you can see there I am occassionally getting errors with certain joblib workers. Here are my observations:I'm running across 2 different computers, Windows 11 and MacOS 14.4.1 both on Python3 updated to latest version of joblib.
The error seems to be happening MUCH more often on the Mac than the Windows machine. In fact, it's quite rare on the Windows machine, but on the mac, if I'm the joblib function for more than a few hundred times, it's bound to happen at least once or twice. This discrepancy between MacOS and Windows makes me think maybe it's an actual bug with the joblib code on MacOS?
I also though maybe I'm running out of RAM, I have been monitoring my memory usage during execution and it happens plenty enough when there is still extra RAM to spare, so that's not it.
If I run my script multiple times, it never happens in the same place. Nor does it happen the same number of times, sometimes it doesn't happen at all. So it seems random when it does happen.
It's also my understanding that if a worker fails, joblib does NOT automatically retry running that worker and we have a bit of our data missing. If someone can confirm that this is correct, I'd appreciate it. So you see in my function, I thought I'd write in some error handling code, because I was checking of ways to make joblib re-run a worker if it fails. My error handling code above, is failing however - if it catches the exception, it should at least print('oh nooooo') but it doesn't even do that so the exceptions are not caught. It is also my understanding that joblib has no internal way to automatically re-run failed workers (if someone can please confirm), so I need to manually handle a re-run.
The error also says this may be caused by too short of a worker timeout. In my research, by default if you don't specify a worker timeout, then isn't the timeout infinite in joblib? So by specifying a timeout, I would only make it shorter so that wouldn't help... or so I think....
Ultimately if a worker fails, I just need the data re-run. So any suggestions on how to properly code it so that the data does re-run on a worker fail?