2

I am currently reading slides about the $k$-means algorithm. In the analysis, the professor writes

Minimize Schwarz Criterion: $W(C) + \lambda m k \log R$

$W(C)$ is Within-class scatter. I guess $\lambda$ is a weighting factor which has to be chosen by the developer and $k$ is the number of clusters. But what is $m$ and what is $R$?

Martin Thoma
  • 18,880
  • 35
  • 95
  • 169

1 Answers1

1

See also this StackExchange post. The Schwarz criterion, or BIC, is very general, so the question is, "how does this formula apply the BIC to $k$-means?"

The Wikipedia page on Bayesian Information Criterion (BIC) uses this definition of BIC:

$$BIC = -2 \ln \hat{L} + \kappa \ln {n}$$

where:

  • $n$ is the #observations
  • $\kappa$ is the #free parameters (I use Greek $\kappa$ to distinguish from your $k$)
  • $\hat{L}$ is the maximized likelihood after fitting parameters

Pelleg and Moore (2000) use $R$ for the #observations. Assuming your professor does to, that means:

  • $\kappa = \lambda m k$, the # free parameters.
  • $W(C) = -2 \ln \hat{L}$

So, what are $\lambda$ and $m$? I can't tell which is which, but it has to come out something like this:

  • $k$ is the number of clusters [as you say]
  • $\lambda$ is the number of dimensions (columns) in the data
  • $m$ is the number of parameters to estimate for each dimension (2 for the usual Gaussian: mean and variance).

It makes sense that $W(C)$, the "within-class scatter" would be measured by log likelihood. Note, $W(C)$ has to sum the log likelihood for all clusters.

ctwardy
  • 216
  • 1
  • 3