Python Forum

Full Version: How to match number of features of training dataset to testing input
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
***This is a revised question from a previous post since I do not know how to delete my old post***

Here is the top of the script where the model was trained (I am using Logistic Regression):

data_raw = pd.read_sql(sql,cnxn)
 
pd.Series(data_raw.columns) 
pd.Series(data_raw.dtypes)
 
data_raw.describe(include='all')
 
data_raw['collision_type'] = data_raw.loc[0:, 'collision_type'].replace('?', 'Unknown')
 
data_raw['property_damage'] = data_raw.loc[0:, 'property_damage'].replace('?', 'Unknown')
 
data_raw.isnull().sum()
 
dropping_columns = ['months_as_customer', 'policy_bind_date', 'age', 'policy_number', 'policy_annual_premium', 'insured_zip', 
                    'capital_gains', 'capital_loss', 'total_claim_amount', 'injury_claim', 'property_claim', 'vehicle_claim',
                   'auto_year']
 
data_cleaned = data_raw.drop(dropping_columns, axis=1)
 
data_preprocessed = pd.get_dummies(data_cleaned, drop_first=True)
 
 
targets = data_preprocessed['fraud_reported_Y']
features = data_preprocessed.drop(['fraud_reported_Y'], axis=1)
 
x_train, x_test, y_train, y_test = train_test_split(features, targets, test_size=0.2, random_state=420)
 
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(x_train, y_train)
y_pred = logreg.predict(x_test)
Now I'm trying to make predictions on a test input (test dataset imported from SQL table):

test = df['TestTable']
test = test[0]
sql = 'SELECT * FROM '+ test
test_raw = pd.read_sql(sql,cnxn)
 
#sample_rows = test_raw.sample(n=5)
 
test_raw.describe(include='all')
 
test_raw['collision_type'] = data_raw.loc[0:, 'collision_type'].replace('?', 'Unknown')
 
test_raw['property_damage'] = data_raw.loc[0:, 'property_damage'].replace('?', 'Unknown')
 
test_raw.isnull().sum()
 
print(test_raw.shape)
 
test_dropped = test_raw.drop(dropping_columns, axis=1)
test_preprocessed = pd.get_dummies(test_dropped, drop_first=True)
 
logreg = LogisticRegression()
logreg.fit(x_train, y_train)
test_predicted = logreg.predict(test_preprocessed)
Here is the error I got:

Error:
Traceback (most recent call last): File "<ipython-input-149-e6d470e94433>", line 1, in <module> runfile('C:/Users/BusinessUser/Downloads/insurance_claim_fraud_detection-master/insurance_claim_fraud_detection.py', wdir='C:/Users/BusinessUser/Downloads/insurance_claim_fraud_detection-master') File "C:\Users\BusinessUser\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 827, in runfile execfile(filename, namespace) File "C:\Users\BusinessUser\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 110, in execfile exec(compile(f.read(), filename, 'exec'), namespace) File "C:/Users/BusinessUser/Downloads/insurance_claim_fraud_detection-master/insurance_claim_fraud_detection.py", line 402, in <module> test_predicted = logreg.predict(test_preprocessed) File "C:\Users\BusinessUser\Anaconda3\lib\site-packages\sklearn\linear_model\base.py", line 289, in predict scores = self.decision_function(X) File "C:\Users\BusinessUser\Anaconda3\lib\site-packages\sklearn\linear_model\base.py", line 270, in decision_function % (X.shape[1], n_features)) ValueError: X has 231 features per sample; expecting 1228
For your information, my training dataset has 999 records while the testing input has 50 records. The training dataset has a same number of columns as testing input, except that the training dataset has one final column: the predicted result column.

I am not sure how I can ensure the number of features matching from training dataset and testing input? I'm expecting the testing input would have less records than the training data, so after converting the categorical values to dummy values would it result in less number of features.

I understand that the testing data might also be gone through same preprocessing and cleasing steps, in which the categorical columns needs to be converted to dummy values. May I ask if there is a way to use the model and make prediction based on original columns not the features?

Since I'm a newbie please correct me if my knowledge/wordy is wrong to use here. Thank you guys so much for helping me out.