How is the Schwarz Criterion defined?

Question

I am currently reading slides about the $k$-means algorithm. In the analysis, the professor writes

Minimize Schwarz Criterion: $W(C) + \lambda m k \log R$

$W(C)$ is Within-class scatter. I guess $\lambda$ is a weighting factor which has to be chosen by the developer and $k$ is the number of clusters. But what is $m$ and what is $R$?

score 1 · Answer 1 · edited May 23 '17 at 12:38

See also this StackExchange post. The Schwarz criterion, or BIC, is very general, so the question is, "how does this formula apply the BIC to $k$-means?"

The Wikipedia page on Bayesian Information Criterion (BIC) uses this definition of BIC:

$$BIC = -2 \ln \hat{L} + \kappa \ln {n}$$

where:

$n$ is the #observations
$\kappa$ is the #free parameters (I use Greek $\kappa$ to distinguish from your $k$)
$\hat{L}$ is the maximized likelihood after fitting parameters

Pelleg and Moore (2000) use $R$ for the #observations. Assuming your professor does to, that means:

$\kappa = \lambda m k$, the # free parameters.
$W(C) = -2 \ln \hat{L}$

So, what are $\lambda$ and $m$? I can't tell which is which, but it has to come out something like this:

$k$ is the number of clusters [as you say]
$\lambda$ is the number of dimensions (columns) in the data
$m$ is the number of parameters to estimate for each dimension (2 for the usual Gaussian: mean and variance).

It makes sense that $W(C)$, the "within-class scatter" would be measured by log likelihood. Note, $W(C)$ has to sum the log likelihood for all clusters.

How is the Schwarz Criterion defined?

1 Answers1