Apr-19-2020, 06:12 PM
Hey guys! First of all, let me say that I am completely new into this. I am trying to do my capstone and I've been trying to study python but things are going down hills haha. I need to train my code to create a demand forecast based on previous sales. I am usind Spyder (via Anaconda) and I am getting an error that I have no idea how to fix it.
The erros is: "ValueError: Found array with 0 sample(s) (shape=(0, 4)) while a minimum of 1 is required."
It seems that the error happens in that "#SEGUNDO TREINO DE ERRO" part. In that part I need to "train" the code to dicrease the rmsle.
Here is my code:
The erros is: "ValueError: Found array with 0 sample(s) (shape=(0, 4)) while a minimum of 1 is required."
It seems that the error happens in that "#SEGUNDO TREINO DE ERRO" part. In that part I need to "train" the code to dicrease the rmsle.
Here is my code:
# IMPORTAR BIBLIOTECA import pandas as pd import numpy as np from IPython import get_ipython ipy = get_ipython() if ipy is not None: ipy.run_line_magic('matplotlib', 'inline') from sklearn.metrics import mean_squared_log_error from sklearn.ensemble import RandomForestRegressor from lightgbm import LGBMRegressor # IMPORTAR ARQUIVO data = pd.read_csv(r"C:\Users\Marcella\Documents\FEI\9 ciclo\TCC1\Banco de dados\Empresa Leo\SKU_csv2.csv", sep = ';') df = pd.DataFrame(data) # CRIAR COLUNA "PERÍODO" COM "ANO" E "MÊS" data["Period"] = data["Year"].astype(str) + "-" + data["Month"].astype(str) # We use the datetime formatting to make sure format is consistent data["Period"] = pd.to_datetime(data["Period"]).dt.strftime("%Y-%m") data3 = data.filter(regex=r'Code|Timeline|Quantity') data3.head() #INVERTER A ORDEM DA TABELA df = pd.DataFrame(data3) dfOrdenado = df.sort_values(by = 'Code', ascending = True) dfOrdenado.head() #DIFERENÇA DE VOLUME TIMELINE ATUAL E ANTERIOR (MES ATUAL-MES ANTERIOR) data2 = dfOrdenado.copy() data2['Last_Month_Quantity'] = data2.groupby(['Code'])['Quantity'].shift(-1) data2['Last_Month_Diff'] = data2.groupby(['Code'])['Last_Month_Quantity'].diff() data2 = data2.dropna() data2.head() #PRIMEIRO TREINO DE ERRO def rmsle(ytrue, ypred): return np.sqrt(mean_squared_log_error(ytrue, ypred)) mean_error = [] for Timeline in range(1,36): train = data2[data2['Timeline'] < Timeline] val = data2[data2['Timeline'] == Timeline] p = val['Last_Month_Quantity'].values error = rmsle(val['Quantity'].values, p) print('Timeline %d - Error %.5f' % (Timeline, error)) mean_error.append(error) print('Mean Error = %.5f' % np.mean(mean_error)) #HISTOGRAMA DO ERRO data2['Quantity'].hist(bins=20, figsize=(10,5)) # SEGUNDO TREINO DE ERRO mean_error = [] for Timeline in range(1,36): train = data2[data2['Timeline'] < Timeline] val = data2[data2['Timeline'] == Timeline] xtr, xts = train.drop(['Quantity'], axis=1), val.drop(['Quantity'], axis=1) ytr, yts = train['Quantity'].values, val['Quantity'].values mdl = RandomForestRegressor(n_estimators=1000, n_jobs=-1, random_state=0) mdl.fit(xtr, ytr) p = mdl.predict(xts) error = rmsle(yts, p) print('Timeline %d - Error %.5f' % (Timeline, error)) mean_error.append(error) print('Mean Error = %.5f' % np.mean(mean_error))And here is the Output:
IPython 7.12.0 -- An enhanced Interactive Python. # IMPORTAR BIBLIOTECA import pandas as pd import numpy as np from IPython import get_ipython ipy = get_ipython() if ipy is not None: ipy.run_line_magic('matplotlib', 'inline') from sklearn.metrics import mean_squared_log_error from sklearn.ensemble import RandomForestRegressor from lightgbm import LGBMRegressor # IMPORTAR ARQUIVO data = pd.read_csv(r"C:\Users\Marcella\Documents\FEI\9 ciclo\TCC1\Banco de dados\Empresa Leo\SKU_csv2.csv", sep = ';') df = pd.DataFrame(data) # CRIAR COLUNA "PERÍODO" COM "ANO" E "MÊS" data["Period"] = data["Year"].astype(str) + "-" + data["Month"].astype(str) # We use the datetime formatting to make sure format is consistent data["Period"] = pd.to_datetime(data["Period"]).dt.strftime("%Y-%m") data3 = data.filter(regex=r'Code|Timeline|Quantity') data3.head() #INVERTER A ORDEM DA TABELA df = pd.DataFrame(data3) dfOrdenado = df.sort_values(by = 'Code', ascending = True) dfOrdenado.head() #DIFERENÇA DE VOLUME TIMELINE ATUAL E ANTERIOR (MES ATUAL-MES ANTERIOR) data2 = dfOrdenado.copy() data2['Last_Month_Quantity'] = data2.groupby(['Code'])['Quantity'].shift(-1) data2['Last_Month_Diff'] = data2.groupby(['Code'])['Last_Month_Quantity'].diff() data2 = data2.dropna() data2.head() #PRIMEIRO TREINO DE ERRO def rmsle(ytrue, ypred): return np.sqrt(mean_squared_log_error(ytrue, ypred)) mean_error = [] for Timeline in range(1,36): train = data2[data2['Timeline'] < Timeline] val = data2[data2['Timeline'] == Timeline] p = val['Last_Month_Quantity'].values error = rmsle(val['Quantity'].values, p) print('Timeline %d - Error %.5f' % (Timeline, error)) mean_error.append(error) print('Mean Error = %.5f' % np.mean(mean_error)) #HISTOGRAMA DO ERRO data2['Quantity'].hist(bins=20, figsize=(10,5)) # SEGUNDO TREINO DE ERRO mean_error = [] for Timeline in range(1,36): train = data2[data2['Timeline'] < Timeline] val = data2[data2['Timeline'] == Timeline] xtr, xts = train.drop(['Quantity'], axis=1), val.drop(['Quantity'], axis=1) ytr, yts = train['Quantity'].values, val['Quantity'].values mdl = RandomForestRegressor(n_estimators=1000, n_jobs=-1, random_state=0) mdl.fit(xtr, ytr) p = mdl.predict(xts) error = rmsle(yts, p) print('Timeline %d - Error %.5f' % (Timeline, error)) mean_error.append(error) print('Mean Error = %.5f' % np.mean(mean_error)) Timeline 1 - Error 2.70350 Timeline 2 - Error 1.61701 Timeline 3 - Error 3.18454 Timeline 4 - Error 2.40659 Timeline 5 - Error 1.45284 Timeline 6 - Error 0.69815 Timeline 7 - Error 1.02462 Timeline 8 - Error 1.93734 Timeline 9 - Error 0.48172 Timeline 10 - Error 1.87422 Timeline 11 - Error 2.91395 Timeline 12 - Error 2.15465 Timeline 13 - Error 2.24474 Timeline 14 - Error 1.58562 Timeline 15 - Error 1.24788 Timeline 16 - Error 0.20848 Timeline 17 - Error 0.72884 Timeline 18 - Error 0.10210 Timeline 19 - Error 0.55287 Timeline 20 - Error 2.73459 Timeline 21 - Error 1.87676 Timeline 22 - Error 3.05041 Timeline 23 - Error 0.97720 Timeline 24 - Error 1.62730 Timeline 25 - Error 1.85567 Timeline 26 - Error 2.42298 Timeline 27 - Error 0.91488 Timeline 28 - Error 0.88662 Timeline 29 - Error 2.16283 Timeline 30 - Error 1.81922 Timeline 31 - Error 1.46269 Timeline 32 - Error 0.53905 Timeline 33 - Error 0.27669 Timeline 34 - Error 1.87140 Timeline 35 - Error 1.87198 Mean Error = 1.58486 Traceback (most recent call last): File "<ipython-input-1-587546307fe9>", line 70, in <module> mdl.fit(xtr, ytr) File "C:\Users\Marcella\anaconda3\lib\site-packages\sklearn\ensemble\_forest.py", line 295, in fit X = check_array(X, accept_sparse="csc", dtype=DTYPE) File "C:\Users\Marcella\anaconda3\lib\site-packages\sklearn\utils\validation.py", line 586, in check_array context)) ValueError: Found array with 0 sample(s) (shape=(0, 4)) while a minimum of 1 is required.Could anyone help me with this? Thank you so much in advance!