Sep-25-2019, 05:56 PM
Hi all,
I am stuck with a classification problem of unlabeled data. One of the issues I have is that the dataset is imbalanced and I would like to improve it a bit to give less hard job to the clustering algorithms.
What I can use though is that one of the features that we know is important for the clustering is imbalanced. In the Figure below, where x axis is speed, you can see that the dataset includes mostly slow speeds.![[Image: imbalanced.jpg]](https://i.ibb.co/WVjPfL7/imbalanced.jpg)
free image upload
Is it possible based on this distribution to try to sample the dataset more equally? Like pick less entries as percentage that are from low speed and higher percentages from the higher speeds?
The sklearn package does not seem to have such functionality. Can you please help to find the relative packages? I am quite sure that your answers will help many more than me.
Regards Alex
I am stuck with a classification problem of unlabeled data. One of the issues I have is that the dataset is imbalanced and I would like to improve it a bit to give less hard job to the clustering algorithms.
What I can use though is that one of the features that we know is important for the clustering is imbalanced. In the Figure below, where x axis is speed, you can see that the dataset includes mostly slow speeds.
![[Image: imbalanced.jpg]](https://i.ibb.co/WVjPfL7/imbalanced.jpg)
free image upload
Is it possible based on this distribution to try to sample the dataset more equally? Like pick less entries as percentage that are from low speed and higher percentages from the higher speeds?
The sklearn package does not seem to have such functionality. Can you please help to find the relative packages? I am quite sure that your answers will help many more than me.
Regards Alex