ValueError: Found array with 0 samples

ValueError: Found array with 0 samples - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: Data Science (https://python-forum.io/forum-44.html)
+--- Thread: ValueError: Found array with 0 samples (/thread-26044.html)

ValueError: Found array with 0 samples - marcellam - Apr-19-2020

Hey guys! First of all, let me say that I am completely new into this. I am trying to do my capstone and I've been trying to study python but things are going down hills haha. I need to train my code to create a demand forecast based on previous sales. I am usind Spyder (via Anaconda) and I am getting an error that I have no idea how to fix it. Wall

The erros is: "ValueError: Found array with 0 sample(s) (shape=(0, 4)) while a minimum of 1 is required."
It seems that the error happens in that "#SEGUNDO TREINO DE ERRO" part. In that part I need to "train" the code to dicrease the rmsle.

Here is my code:

# IMPORTAR BIBLIOTECA
import pandas as pd
import numpy as np
from IPython import get_ipython
ipy = get_ipython()
if ipy is not None:
    ipy.run_line_magic('matplotlib', 'inline')

from sklearn.metrics import mean_squared_log_error
from sklearn.ensemble import RandomForestRegressor
from lightgbm import LGBMRegressor

# IMPORTAR ARQUIVO
data = pd.read_csv(r"C:\Users\Marcella\Documents\FEI\9 ciclo\TCC1\Banco de dados\Empresa Leo\SKU_csv2.csv", sep = ';')
df = pd.DataFrame(data)

# CRIAR COLUNA "PERÍODO" COM "ANO" E "MÊS"
data["Period"] = data["Year"].astype(str) + "-" + data["Month"].astype(str) 

# We use the datetime formatting to make sure format is consistent 
data["Period"] = pd.to_datetime(data["Period"]).dt.strftime("%Y-%m")

data3 = data.filter(regex=r'Code|Timeline|Quantity')
data3.head()

#INVERTER A ORDEM DA TABELA
df = pd.DataFrame(data3)
dfOrdenado = df.sort_values(by = 'Code', ascending = True)
dfOrdenado.head()


#DIFERENÇA DE VOLUME TIMELINE ATUAL E ANTERIOR (MES ATUAL-MES ANTERIOR)

data2 = dfOrdenado.copy()
data2['Last_Month_Quantity'] = data2.groupby(['Code'])['Quantity'].shift(-1)
data2['Last_Month_Diff'] = data2.groupby(['Code'])['Last_Month_Quantity'].diff()
data2 = data2.dropna()
data2.head()

#PRIMEIRO TREINO DE ERRO
def rmsle(ytrue, ypred):
    return np.sqrt(mean_squared_log_error(ytrue, ypred))

mean_error = []
for Timeline in range(1,36):
    train = data2[data2['Timeline'] < Timeline]
    val = data2[data2['Timeline'] == Timeline]
    
    p = val['Last_Month_Quantity'].values

    error = rmsle(val['Quantity'].values, p)
    print('Timeline %d - Error %.5f' % (Timeline, error))
    mean_error.append(error)
print('Mean Error = %.5f' % np.mean(mean_error))

#HISTOGRAMA DO ERRO
data2['Quantity'].hist(bins=20, figsize=(10,5))


# SEGUNDO TREINO DE ERRO
mean_error = []
for Timeline in range(1,36):
    train = data2[data2['Timeline'] < Timeline]
    val = data2[data2['Timeline'] == Timeline]

    xtr, xts = train.drop(['Quantity'], axis=1), val.drop(['Quantity'], axis=1)
    ytr, yts = train['Quantity'].values, val['Quantity'].values

    mdl = RandomForestRegressor(n_estimators=1000, n_jobs=-1, random_state=0)
    mdl.fit(xtr, ytr)

    p = mdl.predict(xts)

    error = rmsle(yts, p)
    print('Timeline %d - Error %.5f' % (Timeline, error))
    mean_error.append(error)
print('Mean Error = %.5f' % np.mean(mean_error))

And here is the Output:

IPython 7.12.0 -- An enhanced Interactive Python.

# IMPORTAR BIBLIOTECA
import pandas as pd
import numpy as np
from IPython import get_ipython
ipy = get_ipython()
if ipy is not None:
    ipy.run_line_magic('matplotlib', 'inline')

from sklearn.metrics import mean_squared_log_error
from sklearn.ensemble import RandomForestRegressor
from lightgbm import LGBMRegressor

# IMPORTAR ARQUIVO
data = pd.read_csv(r"C:\Users\Marcella\Documents\FEI\9 ciclo\TCC1\Banco de dados\Empresa Leo\SKU_csv2.csv", sep = ';')
df = pd.DataFrame(data)

# CRIAR COLUNA "PERÍODO" COM "ANO" E "MÊS"
data["Period"] = data["Year"].astype(str) + "-" + data["Month"].astype(str) 

# We use the datetime formatting to make sure format is consistent 
data["Period"] = pd.to_datetime(data["Period"]).dt.strftime("%Y-%m")

data3 = data.filter(regex=r'Code|Timeline|Quantity')
data3.head()

#INVERTER A ORDEM DA TABELA
df = pd.DataFrame(data3)
dfOrdenado = df.sort_values(by = 'Code', ascending = True)
dfOrdenado.head()


#DIFERENÇA DE VOLUME TIMELINE ATUAL E ANTERIOR (MES ATUAL-MES ANTERIOR)

data2 = dfOrdenado.copy()
data2['Last_Month_Quantity'] = data2.groupby(['Code'])['Quantity'].shift(-1)
data2['Last_Month_Diff'] = data2.groupby(['Code'])['Last_Month_Quantity'].diff()
data2 = data2.dropna()
data2.head()

#PRIMEIRO TREINO DE ERRO
def rmsle(ytrue, ypred):
    return np.sqrt(mean_squared_log_error(ytrue, ypred))

mean_error = []
for Timeline in range(1,36):
    train = data2[data2['Timeline'] < Timeline]
    val = data2[data2['Timeline'] == Timeline]
    
    p = val['Last_Month_Quantity'].values

    error = rmsle(val['Quantity'].values, p)
    print('Timeline %d - Error %.5f' % (Timeline, error))
    mean_error.append(error)
print('Mean Error = %.5f' % np.mean(mean_error))

#HISTOGRAMA DO ERRO
data2['Quantity'].hist(bins=20, figsize=(10,5))


# SEGUNDO TREINO DE ERRO
mean_error = []
for Timeline in range(1,36):
    train = data2[data2['Timeline'] < Timeline]
    val = data2[data2['Timeline'] == Timeline]

    xtr, xts = train.drop(['Quantity'], axis=1), val.drop(['Quantity'], axis=1)
    ytr, yts = train['Quantity'].values, val['Quantity'].values

    mdl = RandomForestRegressor(n_estimators=1000, n_jobs=-1, random_state=0)
    mdl.fit(xtr, ytr)

    p = mdl.predict(xts)

    error = rmsle(yts, p)
    print('Timeline %d - Error %.5f' % (Timeline, error))
    mean_error.append(error)
print('Mean Error = %.5f' % np.mean(mean_error))
Timeline 1 - Error 2.70350
Timeline 2 - Error 1.61701
Timeline 3 - Error 3.18454
Timeline 4 - Error 2.40659
Timeline 5 - Error 1.45284
Timeline 6 - Error 0.69815
Timeline 7 - Error 1.02462
Timeline 8 - Error 1.93734
Timeline 9 - Error 0.48172
Timeline 10 - Error 1.87422
Timeline 11 - Error 2.91395
Timeline 12 - Error 2.15465
Timeline 13 - Error 2.24474
Timeline 14 - Error 1.58562
Timeline 15 - Error 1.24788
Timeline 16 - Error 0.20848
Timeline 17 - Error 0.72884
Timeline 18 - Error 0.10210
Timeline 19 - Error 0.55287
Timeline 20 - Error 2.73459
Timeline 21 - Error 1.87676
Timeline 22 - Error 3.05041
Timeline 23 - Error 0.97720
Timeline 24 - Error 1.62730
Timeline 25 - Error 1.85567
Timeline 26 - Error 2.42298
Timeline 27 - Error 0.91488
Timeline 28 - Error 0.88662
Timeline 29 - Error 2.16283
Timeline 30 - Error 1.81922
Timeline 31 - Error 1.46269
Timeline 32 - Error 0.53905
Timeline 33 - Error 0.27669
Timeline 34 - Error 1.87140
Timeline 35 - Error 1.87198
Mean Error = 1.58486
Traceback (most recent call last):

  File "<ipython-input-1-587546307fe9>", line 70, in <module>
    mdl.fit(xtr, ytr)

  File "C:\Users\Marcella\anaconda3\lib\site-packages\sklearn\ensemble\_forest.py", line 295, in fit
    X = check_array(X, accept_sparse="csc", dtype=DTYPE)

  File "C:\Users\Marcella\anaconda3\lib\site-packages\sklearn\utils\validation.py", line 586, in check_array
    context))

ValueError: Found array with 0 sample(s) (shape=(0, 4)) while a minimum of 1 is required.

Could anyone help me with this? Thank you so much in advance!

RE: ValueError: Found array with 0 samples - jefsummers - Apr-22-2020

Could you post the actual error message in its entirety? There is often more specific information that can help to sort this out.