Hello everyone,
I'm currently working on a regression model to predict financial asset returns, building upon the research by Welch and Goyal and the work of Gu et al. (2020). However, I'm facing a few challenges and would really appreciate any guidance you can offer.
R² Comparison: I am using the r2_score from sklearn.metrics to calculate the R² of my model. In the paper by Gu et al. (2020), they also mention an R², but I want to ensure that the formula they used is identical to that of r2_score. Could someone confirm whether this function adheres to the standard R² formula, or are there specific nuances to consider?
Explanatory Variables: Currently, my model only uses years and months as explanatory variables. I'm looking to incorporate individual asset characteristics, macroeconomic variables, and potentially the products of these variables.
Here what's the ticker_1000 looks like with all the features :
Thank you in advance for your help and suggestions. Any advice or code examples would be greatly appreciated and could really help improve the robustness and efficacy of my model.
Best regards,
My code :
I'm currently working on a regression model to predict financial asset returns, building upon the research by Welch and Goyal and the work of Gu et al. (2020). However, I'm facing a few challenges and would really appreciate any guidance you can offer.
R² Comparison: I am using the r2_score from sklearn.metrics to calculate the R² of my model. In the paper by Gu et al. (2020), they also mention an R², but I want to ensure that the formula they used is identical to that of r2_score. Could someone confirm whether this function adheres to the standard R² formula, or are there specific nuances to consider?
Explanatory Variables: Currently, my model only uses years and months as explanatory variables. I'm looking to incorporate individual asset characteristics, macroeconomic variables, and potentially the products of these variables.
Here what's the ticker_1000 looks like with all the features :
permno DATE mvel1 beta betasq chmom dolvol idiovol indmom mom1m mom6m mom12m mom36m pricedelay turn absacc acc age agr bm bm_ia cashdebt cashpr cfp cfp_ia chatoia chcsho chempia chinv chpmia convind currat depr divi divo dy egr ep gma grcapx grltnoa herf hire invest lev lgr mve_ia operprof orgcap pchcapx_ia pchcurrat pchdepr pchgm_pchsale pchquick pchsale_pchinvt pchsale_pchrect pchsale_pchxsga pchsaleinv pctacc ps quick rd rd_mve rd_sale realestate roic salecash saleinv salerec secured securedind sgr sin sp tang tb aeavol cash chtx cinvest ear nincr roaq roavol roeq rsup stdacc stdcf ms baspread ill maxret retvol std_dolvol std_turn zerotrade sic2 10000 19860228 16100 0.2111585701 0.0769982184 1.2440505E-6 0.25 0.0652783882 1.2312885045 2.1208045717 4.7851753E-8 39 10000 19860331 11960 0.2624713557 -0.257142872 0.0555114619 1.8917602E-6 0.0447761193 0.0310041349 1.0210892479 1.0797738144 1.0233918E-7 39Here the Returns document for the ticker_10000
PERMNO date NCUSIP TICKER COMNAM PRC RET RETX 10000 19851231 10000 19860131 68391610 OMFGA OPTIMUM MANUFACTURING INC -4.375 C C 10000 19860228 68391610 OMFGA OPTIMUM MANUFACTURING INC -3.25 -0.257143 -0.257143Here the macroeconomics series :
yyyymm b/m tbl ntis Rfree svar dp ep tms dfy 195701 0.567242675 0.0311 0.027991994 0.0027 0.000901942 -3.248451342 -2.574685554 0.0017 0.0072Analysis by Assets and Sub-Periods: I also want to extend the analysis to cover multiple assets and different sub-periods. What would be the best method to structure my code to loop over multiple assets and sub-periods? Any specific code examples or frameworks you would recommend?
Thank you in advance for your help and suggestions. Any advice or code examples would be greatly appreciated and could really help improve the robustness and efficacy of my model.
Best regards,
My code :
# -*- coding: utf-8 -*- """ Created on Wed May 1 15:26:41 2024 @author: Lucas """ import pandas as pd import numpy as np import os import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LinearRegression, Lasso, Ridge from sklearn.ensemble import RandomForestRegressor from sklearn.svm import SVR from sklearn.metrics import mean_squared_error, r2_score from tensorflow.keras.models import Sequential from tensorflow.keras.layers import Dense, LSTM # Chemin d'accès aux fichiers path = r"C:\Users\Lucas\Desktop\Base de données\Stocks Prices - Returns" files = [f for f in os.listdir(path) if f.startswith('return_')] data_frames = [] for file in files: full_path = os.path.join(path, file) df = pd.read_csv(full_path, delimiter='\t', engine='python', dtype={'RETX': str}) data_frames.append(df) data = pd.concat(data_frames) data.columns = data.columns.str.replace('"', '').str.strip() data.dropna(subset=['RETX'], inplace=True) data = data[~data['RETX'].str.contains('[^0-9.-]', regex=True)] data['date'] = pd.to_datetime(data['date'], format='%Y%m%d') data['RETX'] = data['RETX'].astype(float) data['year'] = data['date'].dt.year data['month'] = data['date'].dt.month # Sélection des colonnes pour X et y X = data[['year', 'month']] # Ajoutez d'autres features numériques pertinentes ici y = data['RETX'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Standardisation pour les réseaux de neurones et d'autres modèles si nécessaire scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test) # Modèles de machine learning model_ols = LinearRegression().fit(X_train, y_train) model_lasso = Lasso(alpha=0.1).fit(X_train, y_train) model_ridge = Ridge(alpha=1.0).fit(X_train, y_train) model_svr = SVR(kernel='rbf').fit(X_train_scaled, y_train) # Utiliser des données normalisées pour SVR model_rf = RandomForestRegressor(n_estimators=100).fit(X_train, y_train) # Modèle de réseau de neurones model_nn = Sequential([ Dense(128, activation='relu', input_shape=(X_train_scaled.shape[1],)), Dense(64, activation='relu'), Dense(1) ]) model_nn.compile(optimizer='adam', loss='mse') model_nn.fit(X_train_scaled, y_train, epochs=100, batch_size=32) # Réseau de neurones profond model_dnn = Sequential([ Dense(128, activation='relu', input_shape=(X_train_scaled.shape[1],)), Dense(64, activation='relu'), Dense(1) ]) model_dnn.compile(optimizer='adam', loss='mse') model_dnn.fit(X_train_scaled, y_train, epochs=10, batch_size=32) # Réseau LSTM X_train_lstm = X_train_scaled.reshape(X_train_scaled.shape[0], 1, X_train_scaled.shape[1]) model_lstm = Sequential([ LSTM(50, activation='relu', input_shape=(1, X_train_scaled.shape[1])), Dense(1) ]) model_lstm.compile(optimizer='adam', loss='mse') model_lstm.fit(X_train_lstm, y_train, epochs=100, batch_size=32) # Prédictions et évaluation y_pred_ols = model_ols.predict(X_test) y_pred_lasso = model_lasso.predict(X_test) y_pred_ridge = model_ridge.predict(X_test) y_pred_svr = model_svr.predict(X_test_scaled) y_pred_rf = model_rf.predict(X_test) y_pred_dnn = model_dnn.predict(X_test_scaled).flatten() X_test_lstm = X_test_scaled.reshape(X_test_scaled.shape[0], 1, X_test_scaled.shape[1]) y_pred_lstm = model_lstm.predict(X_test_lstm).flatten() # Affichage des erreurs print("MSE - OLS:", mean_squared_error(y_test, y_pred_ols)) print("MSE - Lasso:", mean_squared_error(y_test, y_pred_lasso)) print("MSE - Ridge:", mean_squared_error(y_test, y_pred_ridge)) print("MSE - SVR:", mean_squared_error(y_test, y_pred_svr)) print("MSE - RandomForest:", mean_squared_error(y_test, y_pred_rf)) print("MSE - DNN:", mean_squared_error(y_test, y_pred_dnn)) print("MSE - LSTM:", mean_squared_error(y_test, y_pred_lstm)) # Graphique de comparaison plt.figure(figsize=(10, 5)) plt.plot(y_test.index, y_test, label='Real') plt.plot(y_test.index, y_pred_ols, label='OLS Predicted') plt.legend() plt.show() # Calcul du R2 pour chaque modèle r2_ols = r2_score(y_test, y_pred_ols) r2_lasso = r2_score(y_test, y_pred_lasso) r2_ridge = r2_score(y_test, y_pred_ridge) r2_svr = r2_score(y_test, y_pred_svr) r2_rf = r2_score(y_test, y_pred_rf) r2_dnn = r2_score(y_test, y_pred_dnn) r2_lstm = r2_score(y_test, y_pred_lstm) # Affichage des résultats R2 model_names = ['OLS', 'LASSO', 'Ridge', 'SVR', 'Random Forest', 'DNN', 'LSTM'] r2_scores = [r2_ols, r2_lasso, r2_ridge, r2_svr, r2_rf, r2_dnn, r2_lstm] plt.figure(figsize=(10, 6)) plt.bar(model_names, r2_scores, color='blue') plt.xlabel('Model Type') plt.ylabel('R2 Score') plt.title('Comparison of R2 Scores Across Different Models') plt.ylim(0, 1) # Adjust the limit to better fit your data if needed plt.show() def calculate_r2_total(actual, predicted): rss = np.sum((actual - predicted) ** 2) tss = np.sum(actual ** 2) return 1 - rss / tss def calculate_r2_predictive(actual, predicted): # Cette fonction est la même que R2 total dans ce contexte simplifié return calculate_r2_total(actual, predicted) # Calcul pour le modèle OLS r2_total_ols = calculate_r2_total(y_test, y_pred_ols) r2_predictive_ols = calculate_r2_predictive(y_test, y_pred_ols) # Assume future predictions are the same for simplification print(f"R2 Total OLS: {r2_total_ols}") print(f"R2 Predictive OLS: {r2_predictive_ols}") # R2 pour OLS r2_total_ols = calculate_r2_total(y_test, y_pred_ols) r2_predictive_ols = calculate_r2_predictive(y_test, y_pred_ols) # Simplification pour l'exemple # R2 pour Lasso r2_total_lasso = calculate_r2_total(y_test, y_pred_lasso) r2_predictive_lasso = calculate_r2_predictive(y_test, y_pred_lasso) # R2 pour Ridge r2_total_ridge = calculate_r2_total(y_test, y_pred_ridge) r2_predictive_ridge = calculate_r2_predictive(y_test, y_pred_ridge) # R2 pour SVR (utiliser les données normalisées) r2_total_svr = calculate_r2_total(y_test, y_pred_svr) r2_predictive_svr = calculate_r2_predictive(y_test, y_pred_svr) # R2 pour Random Forest r2_total_rf = calculate_r2_total(y_test, y_pred_rf) r2_predictive_rf = calculate_r2_predictive(y_test, y_pred_rf) # R2 pour DNN (assurez-vous que les prédictions sont aplatis si nécessaire) r2_total_dnn = calculate_r2_total(y_test, y_pred_dnn) r2_predictive_dnn = calculate_r2_predictive(y_test, y_pred_dnn) # R2 pour LSTM (assurez-vous que les prédictions sont aplatis si nécessaire) r2_total_lstm = calculate_r2_total(y_test, y_pred_lstm) r2_predictive_lstm = calculate_r2_predictive(y_test, y_pred_lstm) r2_totals = [r2_total_ols, r2_total_lasso, r2_total_ridge, r2_total_svr, r2_total_rf, r2_total_dnn, r2_total_lstm] r2_predictives = [r2_predictive_ols, r2_predictive_lasso, r2_predictive_ridge, r2_predictive_svr, r2_predictive_rf, r2_predictive_dnn, r2_predictive_lstm] # Liste des noms de modèles pour les labels des axes model_names = ['OLS', 'LASSO', 'Ridge', 'SVR', 'Random Forest', 'DNN', 'LSTM'] # Listes des scores R2 pour la visualisation r2_totals = [r2_total_ols, r2_total_lasso, r2_total_ridge, r2_total_svr, r2_total_rf, r2_total_dnn, r2_total_lstm] r2_predictives = [r2_predictive_ols, r2_predictive_lasso, r2_predictive_ridge, r2_predictive_svr, r2_predictive_rf, r2_predictive_dnn, r2_predictive_lstm] # Code de visualisation comme précédemment mentionné plt.figure(figsize=(12, 6)) x = np.arange(len(model_names)) # les positions des barres width = 0.35 # la largeur des barres fig, ax = plt.subplots() rects1 = ax.bar(x - width/2, r2_totals, width, label='R2 Total') rects2 = ax.bar(x + width/2, r2_predictives, width, label='R2 Predictive') # Ajout des labels, titre, etc. ax.set_xlabel('Model Type') ax.set_ylabel('R2 Score') ax.set_title('Comparison of R2 Total and Predictive Across Different Models') ax.set_xticks(x) ax.set_xticklabels(model_names) ax.legend() plt.show() import matplotlib.pyplot as plt # Indices pour les données de test pour l'axe des x test_indices = y_test.index # Configuration du graphique plt.figure(figsize=(18, 12)) # Tracés pour chaque modèle avec des points plt.subplot(3, 3, 1) plt.scatter(test_indices, y_test, label='Real', color='blue', marker='o') plt.scatter(test_indices, y_pred_ols, label='OLS Predicted', color='red', marker='o') plt.title('OLS Predictions') plt.legend() plt.subplot(3, 3, 2) plt.scatter(test_indices, y_test, label='Real', color='blue', marker='o') plt.scatter(test_indices, y_pred_lasso, label='LASSO Predicted', color='red', marker='o') plt.title('LASSO Predictions') plt.legend() plt.subplot(3, 3, 3) plt.scatter(test_indices, y_test, label='Real', color='blue', marker='o') plt.scatter(test_indices, y_pred_ridge, label='Ridge Predicted', color='red', marker='o') plt.title('Ridge Predictions') plt.legend() plt.subplot(3, 3, 4) plt.scatter(test_indices, y_test, label='Real', color='blue', marker='o') plt.scatter(test_indices, y_pred_svr, label='SVR Predicted', color='red', marker='o') plt.title('SVR Predictions') plt.legend() plt.subplot(3, 3, 5) plt.scatter(test_indices, y_test, label='Real', color='blue', marker='o') plt.scatter(test_indices, y_pred_rf, label='Random Forest Predicted', color='red', marker='o') plt.title('Random Forest Predictions') plt.legend() plt.subplot(3, 3, 6) plt.scatter(test_indices, y_test, label='Real', color='blue', marker='o') plt.scatter(test_indices, y_pred_dnn, label='DNN Predicted', color='red', marker='o') plt.title('DNN Predictions') plt.legend() plt.subplot(3, 3, 7) plt.scatter(test_indices, y_test, label='Real', color='blue', marker='o') plt.scatter(test_indices, y_pred_lstm, label='LSTM Predicted', color='red', marker='o') plt.title('LSTM Predictions') plt.legend() # Ajustement de l'espacement entre les graphiques plt.tight_layout() # Affichage du graphique plt.show() import pandas as pd import numpy as np import matplotlib.pyplot as plt import time # Fonction pour simuler des prédictions de modèles def simulate_model_predictions(data, params): """ Simule des prédictions basées sur une simple régression linéaire fictive pour l'exemple. """ return data['feature'] * params['slope'] + np.random.normal(0, params['noise_level'], size=len(data)) # Fonction pour tracer les prédictions def plot_predictions(real, predicted, title): plt.figure(figsize=(10, 6)) plt.scatter(real.index, real, color='blue', label='Real Returns') plt.scatter(real.index, predicted, color='red', label='Predicted Returns', alpha=0.5) plt.title(title) plt.xlabel('Index') plt.ylabel('Returns') plt.legend() plt.show() # Données de simulation data = pd.DataFrame({ 'feature': np.random.rand(100), 'RET': np.random.rand(100) * 10 }) # Paramètres fictifs et simulation des modèles models = { 'GLM': {'slope': 10, 'noise_level': 1}, 'LightGBM': {'slope': 9, 'noise_level': 1.5}, 'Random Forest': {'slope': 8, 'noise_level': 2}, 'XGBoost': {'slope': 7, 'noise_level': 2.5} } for model_name, params in models.items(): start_time = time.time() predictions = simulate_model_predictions(data, params) elapsed_time = time.time() - start_time print(f'{model_name} simulation finished! Execution time: {time.strftime("%H:%M:%S", time.gmtime(elapsed_time))}') plot_predictions(data['RET'], predictions, f'{model_name} Model Predictions')
Attached Files