Python Forum
Sample based on the distribution of a feature to create more balanced data set
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Sample based on the distribution of a feature to create more balanced data set
#1
Hi all,

I am stuck with a classification problem of unlabeled data. One of the issues I have is that the dataset is imbalanced and I would like to improve it a bit to give less hard job to the clustering algorithms.

What I can use though is that one of the features that we know is important for the clustering is imbalanced. In the Figure below, where x axis is speed, you can see that the dataset includes mostly slow speeds. [Image: imbalanced.jpg]
free image upload

Is it possible based on this distribution to try to sample the dataset more equally? Like pick less entries as percentage that are from low speed and higher percentages from the higher speeds?

The sklearn package does not seem to have such functionality. Can you please help to find the relative packages? I am quite sure that your answers will help many more than me.

Regards Alex
Reply
#2
I would put a batch norm per channel on the first layer and give it a go.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Supervised learning, tree based model - problems splitting data Pixel 0 644 May-16-2023, 05:25 PM
Last Post: Pixel
  Grouping Data based on 30% bracket purnima1 0 940 Feb-16-2023, 07:14 PM
Last Post: purnima1
  Make unique id in vectorized way based on text data column with similarity scoring ill8 0 861 Dec-12-2022, 03:22 AM
Last Post: ill8
  Pandas Data frame column condition check based on length of the value aditi06 1 2,655 Jul-28-2021, 11:08 AM
Last Post: jefsummers
  MNE Sample Data Chriso99 3 2,149 Sep-06-2020, 03:32 PM
Last Post: Larz60+
  Dropping Rows From A Data Frame Based On A Variable JoeDainton123 1 2,187 Aug-03-2020, 02:05 AM
Last Post: scidam
  Filter data based on a value from another dataframe column and create a file using lo pawanmtm 1 4,245 Jul-15-2020, 06:20 PM
Last Post: pawanmtm
  Not able to figure out how to create bar plot on aggregate data - Python darpInd 1 2,252 Mar-30-2020, 11:37 AM
Last Post: jefsummers
  unsupervised learning for distribution of outliers dervast 3 2,702 Aug-01-2019, 12:41 AM
Last Post: scidam
  select data based on indice Staph 4 2,463 Jul-15-2019, 02:05 AM
Last Post: scidam

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020