Python Forum

Full Version: Using Python and scikitlearn, how to output the individual feature dependencies?
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Hello,
I am relatively new to Python and Machine Learning.
I have a basic dataset for insurance fraud and a script that generates the model and runs the predictions.
I am able to output the accuracy percentages, but I would like to also output the feature dependencies: For example, what role did each attribute play in the prediction? The policy_number would be 0.0% where as the claim_amount would likely be 56.2%, does this make sense?
Is there a scikit function for this? Also, is "feature dependency" even the correct term?
Thank you for your help!
-Matt
So in other words, you would like the coefficients of your model? Once you generate your regression by LR = model.fit(X,y) or similar, LR.coef_ is an array of the coefficients for each of the features. Take that and convert to percent of total and you will have what you are looking for.
Hello Jef,
Yes, exactly! Thank you so much for this suggestion. So the proper terminology is "coefficients."
Aside from LR, does this *.coef function work for any model?
Thank you, again, for taking the time to help me.
I hesitate to say yes to any or all, but in general that is true. Probably not for classification models but have not checked.

The other term besides coefficients is "weights". I use coefficients for the equation, weights once you have converted to a percentage. Others can correct me if wrong
Hello Jef,
Thanks again for your input. Ok, I have made some changes to my code:
from sklearn.ensemble import ExtraTreesClassifier
model = ExtraTreesClassifier()
model.fit(x_train, y_train)
coef = pd.DataFrame({''Columns'': x_train.columns, ''Importances'': np.transpose(model.feature_importances_)}).sort_values(by=[''Importances''], ascending=False)
print(coef.nlargest(10, ''Importances''))
I am getting the following output:
Output:
Columns Importances 125 incident_severity_Minor Damage 0.042847 40 insured_hobbies_chess 0.041505 126 incident_severity_Total Loss 0.028544 124 collision_type_Unknown 0.019634 41 insured_hobbies_cross-fit 0.014173 1 policy_state_OH 0.009765 16 insured_sex_MALE 0.009697 57 insured_relationship_own-child 0.009582 25 insured_occupation_exec-managerial 0.009513 5 policy_deductable_500 0.009146
I can't make sense of this, as the percentages don't seem right? Need they be calibrated or converted?
Thank you!
Sum the coefficients, then divide each coefficient by the sum and multiply by 100 to convert to a percent
Good Morning,
I am a student at the University of Rzeszow. As part of my master's thesis, I am conducting a study on the use of data clustering methods. Please complete the survey found at the link https://forms.gle/tK8mdjbxaKeRAQpm7. The survey is anonymous and consists of 9 short questions.
Thank you for your time.
Piotr Kuras
Not really telling you what to do, but a survey is usually to describe and/or predict behavior in a population. What population do you think you have posting here? For your thesis, how are you going to describe the eligible population that is surveyed?
I don't see the problem; that is your Gini importance feature ranking...of course you can tune your algorithm , but the logic is always the same