1

I'm looking to perform a k-means cluster analysis on a set of data that contains variable ranges that contain both positive and negative values. Given the rangers vary so much the data will need to be scaled, but my concern is with the variables that contain negative value ranges. Should I perform some sort of log transformation on all the date so as to scale the data to positive values. For example:
Variable A: 3.4, 5.6,1.3,7.6,8.3
Variable B: 1,2,3,2,1
Variable C:-1.3, -1.4, -2.3, -4.2, -1.3

Jeff
  • 131
  • 1
  • 5
  • I'm probably not the best expert for k-means but to the best of my knowledge there's no requirement at all to provide it with positive values only. It proceeds by calculating distances between points in the euclidean space so negative coordinates are totally fine. Applying a log transformation doesn't sound like a good idea to me. – Erwan Jul 04 '19 at 00:19
  • So you think if I simply just resale the values I should be fine? – Jeff Jul 04 '19 at 00:21
  • Yes I think so. Hopefully someone else will confirm. – Erwan Jul 04 '19 at 00:27

1 Answers1

1

You'll want to scale each variable to a normal distribution. For example, in Matlab, for all values of Variable A this would be something like:

VarA = zscore(VarA);

And then you'll want to repeat that for each variable before running k-means. Make sure you normalize each variable separately. This will put everything on the same scale so that the Euclidean distances are not weighted based on the width of the variable distributions.

There is another good explanation of this on the Stats Stack Exchange.

gcalongi
  • 26
  • 1