1

I recently received a manuscript for review in which author used ~1000 "fake" data points, so that the final centroid of K-mean stays within the required range. Neither me nor the author seems to have background in data science and the paper is more of application into our research area.

I have tried to find published work related to such method of restricting k-mean centers, but failed to do so. However, on simple logic, it seems like valid way, so maybe author used wrong terminology.

Hence, I would like to ask, is this a valid way to restrict k-mean center and are there any published work on it?

Joe89
  • 11
  • 2
  • What do you mean by restricting? Usually k- mean is calculated on a whole dataset. This dataset could be filtered to get rid of outliers. – keiv.fly Oct 29 '18 at 07:01
  • 1
    Very broadly speaking, I have also used a similar approach, where I manually annotate datapoints in order to introduce domain-specific restrictions. However, this is no man's land when it comes to data science. It is common for data scientists to shy away from domain knowledge, as it introduces constraints that rise questions like yours. I really hope you get some good answers, as I'm looking forward to read them too. – mapto Oct 29 '18 at 08:25
  • I have many questions about this question but let’s start with something that should be easy to answer. What is the value of k (how many clusters are you finding) and how many data points are in your dataset? – Paul Aug 26 '19 at 21:39

2 Answers2

0

I highly recommend finding a source explains how k-means work and understand it well. The K-means is well known, so it is hard to find a reference talk about it as an algorithm or explain how it work.

I noticed you stating "author used ~1000 "fake" data points, so that the final centroid of K-mean stays within the required range" which is always going to be true. K-means is about calculating the mean (average) of data points used, which assure (always) to end with a centroid/s within the range of data used.

The power of this algorithm (K-means) is calculating the mean iteratively to reach stability of means (centroids). In another waord, in each iterate, means shift to be centered of denses. That give, if you in case of finding 1 K (one centroid) you will find it by one iterate.

Me personally suggest start with some videos, and go forward. Here is the first result on YouTube about k-means https://youtu.be/_aWzGGNrcic.

krayyem
  • 179
  • 1
  • 10
0

A generalized solution would be constrained optimization. Change to the loss function to only allow solutions within a certain region.

Adding fake data points to nudge the solution into a valid region has several limitations: it requires human intervention adjustment for every model run and no guarantees. Constrained optimization would be automated and provide give strong guarantees.

Brian Spiering
  • 21,136
  • 2
  • 26
  • 109