Python Forum
Using Python and scikitlearn, how to output the individual feature dependencies?
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Using Python and scikitlearn, how to output the individual feature dependencies?
#1
Hello,
I am relatively new to Python and Machine Learning.
I have a basic dataset for insurance fraud and a script that generates the model and runs the predictions.
I am able to output the accuracy percentages, but I would like to also output the feature dependencies: For example, what role did each attribute play in the prediction? The policy_number would be 0.0% where as the claim_amount would likely be 56.2%, does this make sense?
Is there a scikit function for this? Also, is "feature dependency" even the correct term?
Thank you for your help!
-Matt
Reply
#2
So in other words, you would like the coefficients of your model? Once you generate your regression by LR = model.fit(X,y) or similar, LR.coef_ is an array of the coefficients for each of the features. Take that and convert to percent of total and you will have what you are looking for.
Reply
#3
Hello Jef,
Yes, exactly! Thank you so much for this suggestion. So the proper terminology is "coefficients."
Aside from LR, does this *.coef function work for any model?
Thank you, again, for taking the time to help me.
Reply
#4
I hesitate to say yes to any or all, but in general that is true. Probably not for classification models but have not checked.

The other term besides coefficients is "weights". I use coefficients for the equation, weights once you have converted to a percentage. Others can correct me if wrong
Reply
#5
Hello Jef,
Thanks again for your input. Ok, I have made some changes to my code:
from sklearn.ensemble import ExtraTreesClassifier
model = ExtraTreesClassifier()
model.fit(x_train, y_train)
coef = pd.DataFrame({''Columns'': x_train.columns, ''Importances'': np.transpose(model.feature_importances_)}).sort_values(by=[''Importances''], ascending=False)
print(coef.nlargest(10, ''Importances''))
I am getting the following output:
Output:
Columns Importances 125 incident_severity_Minor Damage 0.042847 40 insured_hobbies_chess 0.041505 126 incident_severity_Total Loss 0.028544 124 collision_type_Unknown 0.019634 41 insured_hobbies_cross-fit 0.014173 1 policy_state_OH 0.009765 16 insured_sex_MALE 0.009697 57 insured_relationship_own-child 0.009582 25 insured_occupation_exec-managerial 0.009513 5 policy_deductable_500 0.009146
I can't make sense of this, as the percentages don't seem right? Need they be calibrated or converted?
Thank you!
Reply
#6
Sum the coefficients, then divide each coefficient by the sum and multiply by 100 to convert to a percent
Reply
#7
Good Morning,
I am a student at the University of Rzeszow. As part of my master's thesis, I am conducting a study on the use of data clustering methods. Please complete the survey found at the link https://forms.gle/tK8mdjbxaKeRAQpm7. The survey is anonymous and consists of 9 short questions.
Thank you for your time.
Piotr Kuras
Reply
#8
Not really telling you what to do, but a survey is usually to describe and/or predict behavior in a population. What population do you think you have posting here? For your thesis, how are you going to describe the eligible population that is surveyed?
Reply
#9
I don't see the problem; that is your Gini importance feature ranking...of course you can tune your algorithm , but the logic is always the same
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  AttributeError: Can't get attribute 'Individual' on <module 'deap.creator' DomClout 4 8,725 Jul-27-2018, 09:05 PM
Last Post: Vysero
  Python Dependencies rbs 4 3,866 Dec-19-2017, 08:24 PM
Last Post: snippsat
  AUCPR of individual features using Random Forest (Error: unhashable Type) melissa 1 3,280 Jul-10-2017, 12:48 PM
Last Post: sparkz_alot

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020