Bottom Page

Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
 How to match number of features of training dataset to testing input
***This is a revised question from a previous post since I do not know how to delete my old post***

Here is the top of the script where the model was trained (I am using Logistic Regression):

data_raw = pd.read_sql(sql,cnxn)
data_raw['collision_type'] = data_raw.loc[0:, 'collision_type'].replace('?', 'Unknown')
data_raw['property_damage'] = data_raw.loc[0:, 'property_damage'].replace('?', 'Unknown')
dropping_columns = ['months_as_customer', 'policy_bind_date', 'age', 'policy_number', 'policy_annual_premium', 'insured_zip', 
                    'capital_gains', 'capital_loss', 'total_claim_amount', 'injury_claim', 'property_claim', 'vehicle_claim',
data_cleaned = data_raw.drop(dropping_columns, axis=1)
data_preprocessed = pd.get_dummies(data_cleaned, drop_first=True)
targets = data_preprocessed['fraud_reported_Y']
features = data_preprocessed.drop(['fraud_reported_Y'], axis=1)
x_train, x_test, y_train, y_test = train_test_split(features, targets, test_size=0.2, random_state=420)
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(), y_train)
y_pred = logreg.predict(x_test)
Now I'm trying to make predictions on a test input (test dataset imported from SQL table):

test = df['TestTable']
test = test[0]
sql = 'SELECT * FROM '+ test
test_raw = pd.read_sql(sql,cnxn)
#sample_rows = test_raw.sample(n=5)
test_raw['collision_type'] = data_raw.loc[0:, 'collision_type'].replace('?', 'Unknown')
test_raw['property_damage'] = data_raw.loc[0:, 'property_damage'].replace('?', 'Unknown')
test_dropped = test_raw.drop(dropping_columns, axis=1)
test_preprocessed = pd.get_dummies(test_dropped, drop_first=True)
logreg = LogisticRegression(), y_train)
test_predicted = logreg.predict(test_preprocessed)
Here is the error I got:

Traceback (most recent call last): File "<ipython-input-149-e6d470e94433>", line 1, in <module> runfile('C:/Users/BusinessUser/Downloads/insurance_claim_fraud_detection-master/', wdir='C:/Users/BusinessUser/Downloads/insurance_claim_fraud_detection-master') File "C:\Users\BusinessUser\Anaconda3\lib\site-packages\spyder_kernels\customize\", line 827, in runfile execfile(filename, namespace) File "C:\Users\BusinessUser\Anaconda3\lib\site-packages\spyder_kernels\customize\", line 110, in execfile exec(compile(, filename, 'exec'), namespace) File "C:/Users/BusinessUser/Downloads/insurance_claim_fraud_detection-master/", line 402, in <module> test_predicted = logreg.predict(test_preprocessed) File "C:\Users\BusinessUser\Anaconda3\lib\site-packages\sklearn\linear_model\", line 289, in predict scores = self.decision_function(X) File "C:\Users\BusinessUser\Anaconda3\lib\site-packages\sklearn\linear_model\", line 270, in decision_function % (X.shape[1], n_features)) ValueError: X has 231 features per sample; expecting 1228
For your information, my training dataset has 999 records while the testing input has 50 records. The training dataset has a same number of columns as testing input, except that the training dataset has one final column: the predicted result column.

I am not sure how I can ensure the number of features matching from training dataset and testing input? I'm expecting the testing input would have less records than the training data, so after converting the categorical values to dummy values would it result in less number of features.

I understand that the testing data might also be gone through same preprocessing and cleasing steps, in which the categorical columns needs to be converted to dummy values. May I ask if there is a way to use the model and make prediction based on original columns not the features?

Since I'm a newbie please correct me if my knowledge/wordy is wrong to use here. Thank you guys so much for helping me out.

Top Page

Possibly Related Threads...
Thread Author Replies Views Last Post
  Differencing Time series and Inverse after Training donnertrud 0 279 May-27-2020, 06:11 AM
Last Post: donnertrud
  ValueError: X has 231 features per sample; expecting 1228 anhnguyen 3 1,189 Feb-06-2020, 06:08 PM
Last Post: anhnguyen
  spread values of dataset equally over fixed number of bins moose_man 3 737 Oct-30-2019, 07:41 PM
Last Post: ichabod801
  stacked autoencoder training JohnMarie 0 612 Feb-24-2019, 12:23 AM
Last Post: JohnMarie
  How to use a tfrecord file for training an autoencoder JohnMarie 6 1,283 Feb-22-2019, 06:35 PM
Last Post: JohnMarie

Forum Jump:

Users browsing this thread: 1 Guest(s)