Bottom Page

Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
 ValueError: X has 231 features per sample; expecting 1228
#1
Here is the top of the script where the model was trained (I am using Logistic Regression):

data_raw = pd.read_sql(sql,cnxn)

pd.Series(data_raw.columns) 
pd.Series(data_raw.dtypes)

data_raw.describe(include='all')

data_raw['collision_type'] = data_raw.loc[0:, 'collision_type'].replace('?', 'Unknown')

data_raw['property_damage'] = data_raw.loc[0:, 'property_damage'].replace('?', 'Unknown')

data_raw.isnull().sum()

dropping_columns = ['months_as_customer', 'policy_bind_date', 'age', 'policy_number', 'policy_annual_premium', 'insured_zip', 
                    'capital_gains', 'capital_loss', 'total_claim_amount', 'injury_claim', 'property_claim', 'vehicle_claim',
                   'auto_year']

data_cleaned = data_raw.drop(dropping_columns, axis=1)

data_preprocessed = pd.get_dummies(data_cleaned, drop_first=True)


targets = data_preprocessed['fraud_reported_Y']
features = data_preprocessed.drop(['fraud_reported_Y'], axis=1)

x_train, x_test, y_train, y_test = train_test_split(features, targets, test_size=0.2, random_state=420)

from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(x_train, y_train)
y_pred = logreg.predict(x_test)
Now I'm trying to make predictions on a test input (test dataset imported from SQL table):

test = df['TestTable']
test = test[0]
sql = 'SELECT * FROM '+ test
test_raw = pd.read_sql(sql,cnxn)

#sample_rows = test_raw.sample(n=5)

test_raw.describe(include='all')

test_raw['collision_type'] = data_raw.loc[0:, 'collision_type'].replace('?', 'Unknown')

test_raw['property_damage'] = data_raw.loc[0:, 'property_damage'].replace('?', 'Unknown')

test_raw.isnull().sum()

print(test_raw.shape)

test_dropped = test_raw.drop(dropping_columns, axis=1)
test_preprocessed = pd.get_dummies(test_dropped, drop_first=True)

logreg = LogisticRegression()
logreg.fit(x_train, y_train)
test_predicted = logreg.predict(test_preprocessed)
Here is the error I got:

Error:
Traceback (most recent call last): File "<ipython-input-149-e6d470e94433>", line 1, in <module> runfile('C:/Users/BusinessUser/Downloads/insurance_claim_fraud_detection-master/insurance_claim_fraud_detection.py', wdir='C:/Users/BusinessUser/Downloads/insurance_claim_fraud_detection-master') File "C:\Users\BusinessUser\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 827, in runfile execfile(filename, namespace) File "C:\Users\BusinessUser\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 110, in execfile exec(compile(f.read(), filename, 'exec'), namespace) File "C:/Users/BusinessUser/Downloads/insurance_claim_fraud_detection-master/insurance_claim_fraud_detection.py", line 402, in <module> test_predicted = logreg.predict(test_preprocessed) File "C:\Users\BusinessUser\Anaconda3\lib\site-packages\sklearn\linear_model\base.py", line 289, in predict scores = self.decision_function(X) File "C:\Users\BusinessUser\Anaconda3\lib\site-packages\sklearn\linear_model\base.py", line 270, in decision_function % (X.shape[1], n_features)) ValueError: X has 231 features per sample; expecting 1228
My train dataset has 999 rows with a final prediction result column while the test dataset has 50 rows without prediction result column. The other columns are basically the same.

I'm quite a newbie and I'm pretty sure there is such basic thing I have not known about this model training. Thank you guys so much for helping me out.
Quote
#2
***UPDATE***

Thank you for your attention! It turns out that the number of features seems to be mismatched between the training dataset and testing input. However, I am still not sure how I can ensure the number of features matching from training dataset and testing input? I'm expecting the testing input would have less records than the training data, so after converting the categorical values to dummy values would it result in less number of features? For your information, the original dataset and the testing input have the same columns except that the training set here has a predicted result column (last column).

I understand that the testing data might also be gone through same preprocessing and cleasing steps, in which the categorical columns needs to be converted to dummy values. May I ask if there is a way to use the model and make prediction based on original columns not the features. Please correct me if my knowledge/wordy is wrong to use here.
Quote
#3
(Feb-05-2020, 08:06 PM)anhnguyen Wrote: I understand that the testing data might also be gone through same preprocessing and cleasing steps
You are right, train and test samples should pass troughout the same preprocessing steps.
(Feb-05-2020, 08:06 PM)anhnguyen Wrote: I'm expecting the testing input would have less records than the training data, so after converting the categorical values to dummy values would it result in less number of features?
It seems that is your case. As a result, you end up with different number of features (columns) for training and testing samples.
One way to overcome this issue is applying one-hot-encoding (get_dummies) to entire dataset, e.g. preload train and test data, concatenate them into one dataset, apply one-hot-encoding to that dataset, split them back and train the model (and, finally, test the model).
Quote
#4
Hi scidam,

Thank you for your helpful response. My concern is how does the split method work? Will it return my original testing dataset not the mix of training and testing data (I'm assuming train_test_split method is used here). My goal is to train the model first, then use the model to test on any new dataset/record, preferably not making this testing data like a part of cross validation data.

Please correct me if any of my understanding is wrong as I'm quite new to this machine learning topic. Thank you very much for you help.
Quote

Top Page

Forum Jump:


Users browsing this thread: 1 Guest(s)