Bottom Page

Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
 Sample based on the distribution of a feature to create more balanced data set
Hi all,

I am stuck with a classification problem of unlabeled data. One of the issues I have is that the dataset is imbalanced and I would like to improve it a bit to give less hard job to the clustering algorithms.

What I can use though is that one of the features that we know is important for the clustering is imbalanced. In the Figure below, where x axis is speed, you can see that the dataset includes mostly slow speeds. İmage

free image upload

Is it possible based on this distribution to try to sample the dataset more equally? Like pick less entries as percentage that are from low speed and higher percentages from the higher speeds?

The sklearn package does not seem to have such functionality. Can you please help to find the relative packages? I am quite sure that your answers will help many more than me.

Regards Alex

Top Page

Possibly Related Threads...
Thread Author Replies Views Last Post
  unsupervised learning for distribution of outliers dervast 3 328 Aug-01-2019, 12:41 AM
Last Post: scidam
  select data based on indice Staph 4 221 Jul-15-2019, 02:05 AM
Last Post: scidam
  Grouping data based on rolling conditions kapilan15 0 214 Jun-05-2019, 01:07 PM
Last Post: kapilan15
  How to graph a normal distribution? royer14 0 285 Apr-22-2019, 09:16 PM
Last Post: royer14
  Create selection box to pass string value based on uniques in Excel column sneakysnek 1 497 Nov-18-2018, 07:29 PM
Last Post: Stefanovietch
  Draw Weibull distribution probability function based on Confidence interval farzadtb 1 910 Jul-31-2018, 03:21 PM
Last Post: Vysero
  Match two data sets based on item values klllmmm 7 2,266 Mar-29-2017, 02:33 PM
Last Post: zivoni

Forum Jump:

Users browsing this thread: 1 Guest(s)