Feb-05-2020, 08:28 PM
***This is a revised question from a previous post since I do not know how to delete my old post***
Here is the top of the script where the model was trained (I am using Logistic Regression):
I am not sure how I can ensure the number of features matching from training dataset and testing input? I'm expecting the testing input would have less records than the training data, so after converting the categorical values to dummy values would it result in less number of features.
I understand that the testing data might also be gone through same preprocessing and cleasing steps, in which the categorical columns needs to be converted to dummy values. May I ask if there is a way to use the model and make prediction based on original columns not the features?
Since I'm a newbie please correct me if my knowledge/wordy is wrong to use here. Thank you guys so much for helping me out.
Here is the top of the script where the model was trained (I am using Logistic Regression):
data_raw = pd.read_sql(sql,cnxn) pd.Series(data_raw.columns) pd.Series(data_raw.dtypes) data_raw.describe(include='all') data_raw['collision_type'] = data_raw.loc[0:, 'collision_type'].replace('?', 'Unknown') data_raw['property_damage'] = data_raw.loc[0:, 'property_damage'].replace('?', 'Unknown') data_raw.isnull().sum() dropping_columns = ['months_as_customer', 'policy_bind_date', 'age', 'policy_number', 'policy_annual_premium', 'insured_zip', 'capital_gains', 'capital_loss', 'total_claim_amount', 'injury_claim', 'property_claim', 'vehicle_claim', 'auto_year'] data_cleaned = data_raw.drop(dropping_columns, axis=1) data_preprocessed = pd.get_dummies(data_cleaned, drop_first=True) targets = data_preprocessed['fraud_reported_Y'] features = data_preprocessed.drop(['fraud_reported_Y'], axis=1) x_train, x_test, y_train, y_test = train_test_split(features, targets, test_size=0.2, random_state=420) from sklearn.linear_model import LogisticRegression logreg = LogisticRegression() logreg.fit(x_train, y_train) y_pred = logreg.predict(x_test)Now I'm trying to make predictions on a test input (test dataset imported from SQL table):
test = df['TestTable'] test = test[0] sql = 'SELECT * FROM '+ test test_raw = pd.read_sql(sql,cnxn) #sample_rows = test_raw.sample(n=5) test_raw.describe(include='all') test_raw['collision_type'] = data_raw.loc[0:, 'collision_type'].replace('?', 'Unknown') test_raw['property_damage'] = data_raw.loc[0:, 'property_damage'].replace('?', 'Unknown') test_raw.isnull().sum() print(test_raw.shape) test_dropped = test_raw.drop(dropping_columns, axis=1) test_preprocessed = pd.get_dummies(test_dropped, drop_first=True) logreg = LogisticRegression() logreg.fit(x_train, y_train) test_predicted = logreg.predict(test_preprocessed)Here is the error I got:
Error:Traceback (most recent call last):
File "<ipython-input-149-e6d470e94433>", line 1, in <module>
runfile('C:/Users/BusinessUser/Downloads/insurance_claim_fraud_detection-master/insurance_claim_fraud_detection.py', wdir='C:/Users/BusinessUser/Downloads/insurance_claim_fraud_detection-master')
File "C:\Users\BusinessUser\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 827, in runfile
execfile(filename, namespace)
File "C:\Users\BusinessUser\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 110, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "C:/Users/BusinessUser/Downloads/insurance_claim_fraud_detection-master/insurance_claim_fraud_detection.py", line 402, in <module>
test_predicted = logreg.predict(test_preprocessed)
File "C:\Users\BusinessUser\Anaconda3\lib\site-packages\sklearn\linear_model\base.py", line 289, in predict
scores = self.decision_function(X)
File "C:\Users\BusinessUser\Anaconda3\lib\site-packages\sklearn\linear_model\base.py", line 270, in decision_function
% (X.shape[1], n_features))
ValueError: X has 231 features per sample; expecting 1228
For your information, my training dataset has 999 records while the testing input has 50 records. The training dataset has a same number of columns as testing input, except that the training dataset has one final column: the predicted result column.I am not sure how I can ensure the number of features matching from training dataset and testing input? I'm expecting the testing input would have less records than the training data, so after converting the categorical values to dummy values would it result in less number of features.
I understand that the testing data might also be gone through same preprocessing and cleasing steps, in which the categorical columns needs to be converted to dummy values. May I ask if there is a way to use the model and make prediction based on original columns not the features?
Since I'm a newbie please correct me if my knowledge/wordy is wrong to use here. Thank you guys so much for helping me out.