Python Forum
Join Predicted values with test dataset - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: Data Science (https://python-forum.io/forum-44.html)
+--- Thread: Join Predicted values with test dataset (/thread-17030.html)



Join Predicted values with test dataset - bhuwan - Mar-25-2019

I am a beginner in machine learning and write my first machine learning code. I have trained a model and got y_pred. Now I want to join x_test with y_pred so that I can compare the predicted values with the real values. I noticed that the indices for x_test are different than y_pred, so how can I join them and make sure that the y_pred values are aligned with the respective indices of x_test.
y_pred = predict_classifier(clf, X_test, y_test)



RE: Join Predicted values with test dataset - scidam - Mar-26-2019

Could you provide full code? The problem is not clear: why do you need to join arrays? you can check their shapes, e.g. using len, or, if they are numpy arrays, via .shape attribute.

Lets consider common steps of verifying a ML model, in general.

1) You have original dataset X and class labels y; Suppose that these arrays have shapes (n, m) and (n, ) respectively (i.e. we have m-features (# of cols) with n-measurements (# of rows) and n desired classes). These classes could be encoded with integer values (some ML frameworks works only with numerical values).

2) We could train our classifier (or model) on X and y, apply the trained model to X and get y_pred with the same shape as y, and compute some accuracy measures, such as precision, recall, accuracy etc. measure_score(y, y_pred) => some value. Unfortunately, doing so, we get overestimated measures of accuracy. This is due to over fitting problem.

3) A common way to overcome the overfitting problem consist in splitting original
dataset (X, y) into two datasets: (X_train, y_train) and (X_test, y_test). Usually, this splitting is performed randomly, e.g. 85 % of rows from X (and correspondingly in y) randomly selected for X_train and y_train, and 15 % are used for X_test, y_test. The first pair (X_train, y_train) is used to train our model. The second, that was not showed to the model, is used for testing: we apply the model to X_test and compare obtained y_pred with y_test; these vectors are of the same size.

So, pseudocode would be the following:

Quote:X, y -- original dataset

(X_train, y_train), (X_test, y_test) = split_data(X, y)

model -- ML-model used to solve classification problem

model.fit(X_train, y_train) --- fitting the model on train data

#From now we have fitted model, and we wish to estimate its accuracy

y_pred = model.predict(X_test) # predict classes on test data

some_accuracy_measure(y_pred, y_test) => float value (usually in [0,1])



RE: Join Predicted values with test dataset - bhuwan - Mar-26-2019

Thank you for the reply.
This is how I get the y_pred.
# split data 
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=0, shuffle=True)

# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)	

# Fitting SVM
from sklearn.svm import SVC
classifier = SVC(kernel = 'linear', random_state = 0)
classifier.fit(X_train, y_train)

# Predicting the Test
y_pred = classifier.predict(X_test)
This is my desired result dataset.
df1 = pd.concat([X_test.reset_index(drop='Tru‌​e'),y_pred.reset_index(drop='Tru‌​e')],axis=1)
I want to join the x_test+y_pred so that i can compare the predicted result one by one. By doing above concat, does each row of y_test and y_pred align in the same order as in x_test?
I am developing a model to classify the behavior of people so that necessary actions can be taken for each individual.


RE: Join Predicted values with test dataset - scidam - Mar-27-2019

(Mar-26-2019, 03:54 PM)bhuwan Wrote: By doing above concat, does each row of y_test and y_pred align in the same order as in x_test?

The .predict doesn't change the order of classified cases. Let X_test.shape = (m, n), then
y_test.shape = n (preserving order is guaranteed by train_test_split in this case); finally, y_pred is produced by .predict, this function retains the order of classified items (rows of X_test).


RE: Join Predicted values with test dataset - bhuwan - Mar-28-2019

(Mar-27-2019, 02:21 AM)scidam Wrote:
(Mar-26-2019, 03:54 PM)bhuwan Wrote: By doing above concat, does each row of y_test and y_pred align in the same order as in x_test?

The .predict doesn't change the order of classified cases. Let X_test.shape = (m, n), then
y_test.shape = n (preserving order is guaranteed by train_test_split in this case); finally, y_pred is produced by .predict, this function retains the order of classified items (rows of X_test).

Thank you so much, this is what I needed to confirm.