ValueError: Found array with 0 samples

marcellam · Apr-19-2020, 06:12 PM

Hey guys! First of all, let me say that I am completely new into this. I am trying to do my capstone and I've been trying to study python but things are going down hills haha. I need to train my code to create a demand forecast based on previous sales. I am usind Spyder (via Anaconda) and I am getting an error that I have no idea how to fix it. Wall

The erros is: "ValueError: Found array with 0 sample(s) (shape=(0, 4)) while a minimum of 1 is required."
It seems that the error happens in that "#SEGUNDO TREINO DE ERRO" part. In that part I need to "train" the code to dicrease the rmsle.

Here is my code:

# IMPORTAR BIBLIOTECA
import pandas as pd
import numpy as np
from IPython import get_ipython
ipy = get_ipython()
if ipy is not None:
    ipy.run_line_magic('matplotlib', 'inline')

from sklearn.metrics import mean_squared_log_error
from sklearn.ensemble import RandomForestRegressor
from lightgbm import LGBMRegressor

# IMPORTAR ARQUIVO
data = pd.read_csv(r"C:\Users\Marcella\Documents\FEI\9 ciclo\TCC1\Banco de dados\Empresa Leo\SKU_csv2.csv", sep = ';')
df = pd.DataFrame(data)

# CRIAR COLUNA "PERÍODO" COM "ANO" E "MÊS"
data["Period"] = data["Year"].astype(str) + "-" + data["Month"].astype(str) 

# We use the datetime formatting to make sure format is consistent 
data["Period"] = pd.to_datetime(data["Period"]).dt.strftime("%Y-%m")

data3 = data.filter(regex=r'Code|Timeline|Quantity')
data3.head()

#INVERTER A ORDEM DA TABELA
df = pd.DataFrame(data3)
dfOrdenado = df.sort_values(by = 'Code', ascending = True)
dfOrdenado.head()


#DIFERENÇA DE VOLUME TIMELINE ATUAL E ANTERIOR (MES ATUAL-MES ANTERIOR)

data2 = dfOrdenado.copy()
data2['Last_Month_Quantity'] = data2.groupby(['Code'])['Quantity'].shift(-1)
data2['Last_Month_Diff'] = data2.groupby(['Code'])['Last_Month_Quantity'].diff()
data2 = data2.dropna()
data2.head()

#PRIMEIRO TREINO DE ERRO
def rmsle(ytrue, ypred):
    return np.sqrt(mean_squared_log_error(ytrue, ypred))

mean_error = []
for Timeline in range(1,36):
    train = data2[data2['Timeline'] < Timeline]
    val = data2[data2['Timeline'] == Timeline]
    
    p = val['Last_Month_Quantity'].values

    error = rmsle(val['Quantity'].values, p)
    print('Timeline %d - Error %.5f' % (Timeline, error))
    mean_error.append(error)
print('Mean Error = %.5f' % np.mean(mean_error))

#HISTOGRAMA DO ERRO
data2['Quantity'].hist(bins=20, figsize=(10,5))


# SEGUNDO TREINO DE ERRO
mean_error = []
for Timeline in range(1,36):
    train = data2[data2['Timeline'] < Timeline]
    val = data2[data2['Timeline'] == Timeline]

    xtr, xts = train.drop(['Quantity'], axis=1), val.drop(['Quantity'], axis=1)
    ytr, yts = train['Quantity'].values, val['Quantity'].values

    mdl = RandomForestRegressor(n_estimators=1000, n_jobs=-1, random_state=0)
    mdl.fit(xtr, ytr)

    p = mdl.predict(xts)

    error = rmsle(yts, p)
    print('Timeline %d - Error %.5f' % (Timeline, error))
    mean_error.append(error)
print('Mean Error = %.5f' % np.mean(mean_error))

And here is the Output:

IPython 7.12.0 -- An enhanced Interactive Python.

# IMPORTAR BIBLIOTECA
import pandas as pd
import numpy as np
from IPython import get_ipython
ipy = get_ipython()
if ipy is not None:
    ipy.run_line_magic('matplotlib', 'inline')

from sklearn.metrics import mean_squared_log_error
from sklearn.ensemble import RandomForestRegressor
from lightgbm import LGBMRegressor

# IMPORTAR ARQUIVO
data = pd.read_csv(r"C:\Users\Marcella\Documents\FEI\9 ciclo\TCC1\Banco de dados\Empresa Leo\SKU_csv2.csv", sep = ';')
df = pd.DataFrame(data)

# CRIAR COLUNA "PERÍODO" COM "ANO" E "MÊS"
data["Period"] = data["Year"].astype(str) + "-" + data["Month"].astype(str) 

# We use the datetime formatting to make sure format is consistent 
data["Period"] = pd.to_datetime(data["Period"]).dt.strftime("%Y-%m")

data3 = data.filter(regex=r'Code|Timeline|Quantity')
data3.head()

#INVERTER A ORDEM DA TABELA
df = pd.DataFrame(data3)
dfOrdenado = df.sort_values(by = 'Code', ascending = True)
dfOrdenado.head()


#DIFERENÇA DE VOLUME TIMELINE ATUAL E ANTERIOR (MES ATUAL-MES ANTERIOR)

data2 = dfOrdenado.copy()
data2['Last_Month_Quantity'] = data2.groupby(['Code'])['Quantity'].shift(-1)
data2['Last_Month_Diff'] = data2.groupby(['Code'])['Last_Month_Quantity'].diff()
data2 = data2.dropna()
data2.head()

#PRIMEIRO TREINO DE ERRO
def rmsle(ytrue, ypred):
    return np.sqrt(mean_squared_log_error(ytrue, ypred))

mean_error = []
for Timeline in range(1,36):
    train = data2[data2['Timeline'] < Timeline]
    val = data2[data2['Timeline'] == Timeline]
    
    p = val['Last_Month_Quantity'].values

    error = rmsle(val['Quantity'].values, p)
    print('Timeline %d - Error %.5f' % (Timeline, error))
    mean_error.append(error)
print('Mean Error = %.5f' % np.mean(mean_error))

#HISTOGRAMA DO ERRO
data2['Quantity'].hist(bins=20, figsize=(10,5))


# SEGUNDO TREINO DE ERRO
mean_error = []
for Timeline in range(1,36):
    train = data2[data2['Timeline'] < Timeline]
    val = data2[data2['Timeline'] == Timeline]

    xtr, xts = train.drop(['Quantity'], axis=1), val.drop(['Quantity'], axis=1)
    ytr, yts = train['Quantity'].values, val['Quantity'].values

    mdl = RandomForestRegressor(n_estimators=1000, n_jobs=-1, random_state=0)
    mdl.fit(xtr, ytr)

    p = mdl.predict(xts)

    error = rmsle(yts, p)
    print('Timeline %d - Error %.5f' % (Timeline, error))
    mean_error.append(error)
print('Mean Error = %.5f' % np.mean(mean_error))
Timeline 1 - Error 2.70350
Timeline 2 - Error 1.61701
Timeline 3 - Error 3.18454
Timeline 4 - Error 2.40659
Timeline 5 - Error 1.45284
Timeline 6 - Error 0.69815
Timeline 7 - Error 1.02462
Timeline 8 - Error 1.93734
Timeline 9 - Error 0.48172
Timeline 10 - Error 1.87422
Timeline 11 - Error 2.91395
Timeline 12 - Error 2.15465
Timeline 13 - Error 2.24474
Timeline 14 - Error 1.58562
Timeline 15 - Error 1.24788
Timeline 16 - Error 0.20848
Timeline 17 - Error 0.72884
Timeline 18 - Error 0.10210
Timeline 19 - Error 0.55287
Timeline 20 - Error 2.73459
Timeline 21 - Error 1.87676
Timeline 22 - Error 3.05041
Timeline 23 - Error 0.97720
Timeline 24 - Error 1.62730
Timeline 25 - Error 1.85567
Timeline 26 - Error 2.42298
Timeline 27 - Error 0.91488
Timeline 28 - Error 0.88662
Timeline 29 - Error 2.16283
Timeline 30 - Error 1.81922
Timeline 31 - Error 1.46269
Timeline 32 - Error 0.53905
Timeline 33 - Error 0.27669
Timeline 34 - Error 1.87140
Timeline 35 - Error 1.87198
Mean Error = 1.58486
Traceback (most recent call last):

  File "<ipython-input-1-587546307fe9>", line 70, in <module>
    mdl.fit(xtr, ytr)

  File "C:\Users\Marcella\anaconda3\lib\site-packages\sklearn\ensemble\_forest.py", line 295, in fit
    X = check_array(X, accept_sparse="csc", dtype=DTYPE)

  File "C:\Users\Marcella\anaconda3\lib\site-packages\sklearn\utils\validation.py", line 586, in check_array
    context))

ValueError: Found array with 0 sample(s) (shape=(0, 4)) while a minimum of 1 is required.

Could anyone help me with this? Thank you so much in advance!

jefsummers · Apr-22-2020, 04:12 PM

Could you post the actual error message in its entirety? There is often more specific information that can help to sort this out.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Separating unique, stable, samples using pandas	keithpfio	1	1,120	Jun-20-2022, 07:06 PM Last Post: keithpfio
	RandomForest --ValueError: setting an array element with a sequence	JaneTan	0	1,743	Sep-08-2021, 02:12 AM Last Post: JaneTan
	ValueError: Found input variables with inconsistent numbers of samples: [5, 6]	bongielondy	6	25,952	Jun-28-2021, 05:23 AM Last Post: ricslato
	ValueError: Found input variables with inconsistent numbers of sample	robert2joe	0	4,272	Mar-25-2020, 11:10 AM Last Post: robert2joe
	ValueError: Found input variables	AhmadMWaddah	3	3,750	Mar-03-2020, 10:19 PM Last Post: AhmadMWaddah
	ValueError: could not broadcast input array from shape (75) into shape (25)	route2sabya	0	6,499	Mar-14-2019, 01:14 PM Last Post: route2sabya
	ValueError: Found input variables with inconsistent numbers of samples: [0, 3]	ayaz786amd	2	9,623	Nov-27-2018, 07:12 AM Last Post: ayaz786amd
	ValueError: The truth value of an array with more than one element is ambiguous.	Eliza5	1	14,351	Apr-02-2018, 12:03 AM Last Post: scidam
	pandas: assemble data to have samples	sdcompanies	2	3,339	Jan-19-2018, 09:45 PM Last Post: Larz60+

ValueError: Found array with 0 samples

User Panel Messages

Announcements