See also this StackExchange post. The Schwarz criterion, or BIC, is very general, so the question is, "how does this formula apply the BIC to $k$-means?"
The Wikipedia page on Bayesian Information Criterion (BIC) uses this definition of BIC:
$$BIC = -2 \ln \hat{L} + \kappa \ln {n}$$
where:
- $n$ is the #observations
- $\kappa$ is the #free parameters (I use Greek $\kappa$ to distinguish from your $k$)
- $\hat{L}$ is the maximized likelihood after fitting parameters
Pelleg and Moore (2000) use $R$ for the #observations. Assuming your professor does to, that means:
- $\kappa = \lambda m k$, the # free parameters.
- $W(C) = -2 \ln \hat{L}$
So, what are $\lambda$ and $m$? I can't tell which is which, but it has to come out something like this:
- $k$ is the number of clusters [as you say]
- $\lambda$ is the number of dimensions (columns) in the data
- $m$ is the number of parameters to estimate for each dimension (2 for the usual Gaussian: mean and variance).
It makes sense that $W(C)$, the "within-class scatter" would be measured by log likelihood. Note, $W(C)$ has to sum the log likelihood for all clusters.