Python Forum

Hi all,

I'm a bit of a novice Pythoner and have recently been running a K-Means Clustering in Azure Machine Learning. That method doesn't suit the dataset I have due to noise and I'm thinking that DBSCAN may be the way to go. My question is the dataset is 940 columns by 104k rows so, as a bit of a newbie, I'm not sure of the best way to deal with a dataset this size. Any high level advice much appreciated

Thanks

Mads

I would suggest you to apply some dimension reduction technique first. It might be useful to
explore the dataset using e.g. t-SNE, or even PCA. DBSCAN is a good choice, but you need to choose
appropriate metric to get it worked fine. You probably would need to scale the data before applying any clustering or dimension reduction technique. If you dataset is sparse, you could consider to apply NMDS-approach first. Everything depends on specificity of your dataset: what data types the columns have?! Are they all of numeric type or some columns have categorical data?!

Hi Scidam

Thanks for your reply - and sorry for mine being late (busy night!). It's all numerical data at present and apart from the odd blank cell where no data was available, is fully populated.

I'm very much wanting to use this process as a learning exercise for myself, but obviously being a newbie I'm still a little confused as to what to use and when, so guidance (such as your suggestions in your previous reply) are much appreciated :), and any guidance on the best scaling method for my dataset would be very much welcomed

Thanks so much!

Mads

PS I should probably say the purpose of the clustering is to create 'profiles' of users

Madraykin

scidam

Madraykin