Python Forum
How to match number of features of training dataset to testing input
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
How to match number of features of training dataset to testing input
#1
***This is a revised question from a previous post since I do not know how to delete my old post***

Here is the top of the script where the model was trained (I am using Logistic Regression):

data_raw = pd.read_sql(sql,cnxn)
 
pd.Series(data_raw.columns) 
pd.Series(data_raw.dtypes)
 
data_raw.describe(include='all')
 
data_raw['collision_type'] = data_raw.loc[0:, 'collision_type'].replace('?', 'Unknown')
 
data_raw['property_damage'] = data_raw.loc[0:, 'property_damage'].replace('?', 'Unknown')
 
data_raw.isnull().sum()
 
dropping_columns = ['months_as_customer', 'policy_bind_date', 'age', 'policy_number', 'policy_annual_premium', 'insured_zip', 
                    'capital_gains', 'capital_loss', 'total_claim_amount', 'injury_claim', 'property_claim', 'vehicle_claim',
                   'auto_year']
 
data_cleaned = data_raw.drop(dropping_columns, axis=1)
 
data_preprocessed = pd.get_dummies(data_cleaned, drop_first=True)
 
 
targets = data_preprocessed['fraud_reported_Y']
features = data_preprocessed.drop(['fraud_reported_Y'], axis=1)
 
x_train, x_test, y_train, y_test = train_test_split(features, targets, test_size=0.2, random_state=420)
 
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(x_train, y_train)
y_pred = logreg.predict(x_test)
Now I'm trying to make predictions on a test input (test dataset imported from SQL table):

test = df['TestTable']
test = test[0]
sql = 'SELECT * FROM '+ test
test_raw = pd.read_sql(sql,cnxn)
 
#sample_rows = test_raw.sample(n=5)
 
test_raw.describe(include='all')
 
test_raw['collision_type'] = data_raw.loc[0:, 'collision_type'].replace('?', 'Unknown')
 
test_raw['property_damage'] = data_raw.loc[0:, 'property_damage'].replace('?', 'Unknown')
 
test_raw.isnull().sum()
 
print(test_raw.shape)
 
test_dropped = test_raw.drop(dropping_columns, axis=1)
test_preprocessed = pd.get_dummies(test_dropped, drop_first=True)
 
logreg = LogisticRegression()
logreg.fit(x_train, y_train)
test_predicted = logreg.predict(test_preprocessed)
Here is the error I got:

Error:
Traceback (most recent call last): File "<ipython-input-149-e6d470e94433>", line 1, in <module> runfile('C:/Users/BusinessUser/Downloads/insurance_claim_fraud_detection-master/insurance_claim_fraud_detection.py', wdir='C:/Users/BusinessUser/Downloads/insurance_claim_fraud_detection-master') File "C:\Users\BusinessUser\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 827, in runfile execfile(filename, namespace) File "C:\Users\BusinessUser\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 110, in execfile exec(compile(f.read(), filename, 'exec'), namespace) File "C:/Users/BusinessUser/Downloads/insurance_claim_fraud_detection-master/insurance_claim_fraud_detection.py", line 402, in <module> test_predicted = logreg.predict(test_preprocessed) File "C:\Users\BusinessUser\Anaconda3\lib\site-packages\sklearn\linear_model\base.py", line 289, in predict scores = self.decision_function(X) File "C:\Users\BusinessUser\Anaconda3\lib\site-packages\sklearn\linear_model\base.py", line 270, in decision_function % (X.shape[1], n_features)) ValueError: X has 231 features per sample; expecting 1228
For your information, my training dataset has 999 records while the testing input has 50 records. The training dataset has a same number of columns as testing input, except that the training dataset has one final column: the predicted result column.

I am not sure how I can ensure the number of features matching from training dataset and testing input? I'm expecting the testing input would have less records than the training data, so after converting the categorical values to dummy values would it result in less number of features.

I understand that the testing data might also be gone through same preprocessing and cleasing steps, in which the categorical columns needs to be converted to dummy values. May I ask if there is a way to use the model and make prediction based on original columns not the features?

Since I'm a newbie please correct me if my knowledge/wordy is wrong to use here. Thank you guys so much for helping me out.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Sample training small model AndrzejB 3 1,161 Mar-22-2023, 07:37 PM
Last Post: jefsummers
  Is it normal so much time training for Training Custom Object Detector?? hobbyist 2 2,724 May-31-2022, 08:55 AM
Last Post: aserikova
  How to find the impact of each of the correlated features? AlekseyPython 0 1,349 Sep-28-2021, 09:05 AM
Last Post: AlekseyPython
  Differencing Time series and Inverse after Training donnertrud 0 4,075 May-27-2020, 06:11 AM
Last Post: donnertrud
  spread values of dataset equally over fixed number of bins moose_man 3 2,470 Oct-30-2019, 07:41 PM
Last Post: ichabod801
  stacked autoencoder training JohnMarie 0 2,620 Feb-24-2019, 12:23 AM
Last Post: JohnMarie
  How to use a tfrecord file for training an autoencoder JohnMarie 6 4,561 Feb-22-2019, 06:35 PM
Last Post: JohnMarie

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020