Sample based on the distribution of a feature to create more balanced data set

dervast · Sep-25-2019, 05:56 PM

Hi all,

I am stuck with a classification problem of unlabeled data. One of the issues I have is that the dataset is imbalanced and I would like to improve it a bit to give less hard job to the clustering algorithms.

What I can use though is that one of the features that we know is important for the clustering is imbalanced. In the Figure below, where x axis is speed, you can see that the dataset includes mostly slow speeds.

free image upload

Is it possible based on this distribution to try to sample the dataset more equally? Like pick less entries as percentage that are from low speed and higher percentages from the higher speeds?

The sklearn package does not seem to have such functionality. Can you please help to find the relative packages? I am quite sure that your answers will help many more than me.

Regards Alex

schuler · Nov-15-2019, 12:25 AM

I would put a batch norm per channel on the first layer and give it a go.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Create dataframe from the unique data of two dataframes	Calab	6	1,242	Mar-02-2025, 01:51 PM Last Post: Pedroski55
	Supervised learning, tree based model - problems splitting data	Pixel	0	1,332	May-16-2023, 05:25 PM Last Post: Pixel
	Grouping Data based on 30% bracket	purnima1	0	1,444	Feb-16-2023, 07:14 PM Last Post: purnima1
	Make unique id in vectorized way based on text data column with similarity scoring	ill8	0	1,498	Dec-12-2022, 03:22 AM Last Post: ill8
	Pandas Data frame column condition check based on length of the value	aditi06	1	3,879	Jul-28-2021, 11:08 AM Last Post: jefsummers
	MNE Sample Data	Chriso99	3	3,220	Sep-06-2020, 03:32 PM Last Post: Larz60+
	Dropping Rows From A Data Frame Based On A Variable	JoeDainton123	1	2,932	Aug-03-2020, 02:05 AM Last Post: scidam
	Filter data based on a value from another dataframe column and create a file using lo	pawanmtm	1	5,266	Jul-15-2020, 06:20 PM Last Post: pawanmtm
	Not able to figure out how to create bar plot on aggregate data - Python	darpInd	1	2,996	Mar-30-2020, 11:37 AM Last Post: jefsummers
	select data based on indice	Staph	4	3,531	Jul-15-2019, 02:05 AM Last Post: scidam

Sample based on the distribution of a feature to create more balanced data set

User Panel Messages

Announcements