Python Forum
Sample based on the distribution of a feature to create more balanced data set - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: Data Science (https://python-forum.io/forum-44.html)
+--- Thread: Sample based on the distribution of a feature to create more balanced data set (/thread-21343.html)



Sample based on the distribution of a feature to create more balanced data set - dervast - Sep-25-2019

Hi all,

I am stuck with a classification problem of unlabeled data. One of the issues I have is that the dataset is imbalanced and I would like to improve it a bit to give less hard job to the clustering algorithms.

What I can use though is that one of the features that we know is important for the clustering is imbalanced. In the Figure below, where x axis is speed, you can see that the dataset includes mostly slow speeds. [Image: imbalanced.jpg]
free image upload

Is it possible based on this distribution to try to sample the dataset more equally? Like pick less entries as percentage that are from low speed and higher percentages from the higher speeds?

The sklearn package does not seem to have such functionality. Can you please help to find the relative packages? I am quite sure that your answers will help many more than me.

Regards Alex


RE: Sample based on the distribution of a feature to create more balanced data set - schuler - Nov-15-2019

I would put a batch norm per channel on the first layer and give it a go.