Jun-06-2019, 02:51 PM
(This post was last modified: Jun-06-2019, 02:51 PM by michalmonday.)
I'll take a look at the dataset in a moment but in general, normally you'd have 2 separate datasets with their corresponding binary results (also referred to as "labels")
1. Training one (used to fit the logistic regression model)
2. Testing one (used to verify how accurate are the predictions)
If you have 1 big dataset, you could split it (scikit has a method especially for that).
In the code you posted it seems that the same data is used to train the model and then it's being used for making predictions. This will probably result in "fake" great performance of the classifier.
1. Training one (used to fit the logistic regression model)
2. Testing one (used to verify how accurate are the predictions)
If you have 1 big dataset, you could split it (scikit has a method especially for that).
In the code you posted it seems that the same data is used to train the model and then it's being used for making predictions. This will probably result in "fake" great performance of the classifier.