8

Principal Component Analysis (PCA) is used to reduce n-dimensional data to k-dimensional data to speed things up in machine learning. After PCA is applied, one can check how much of the variance of the original dataset remains in the resulting dataset. A common goal is keeping variance between 90% and 99%.

My question is: is it considered a good practice to try different values of the k parameter (size of the resulting dataset's dimension) and then check the results of the resulting models against some cross-validation dataset in the same way as we do to pick good values of other hyperparameters like regularization lambdas and thresholds?

J. Doe
  • 81
  • 1
  • 2

1 Answers1

6

Your emphasis on using a validation set rather than the training set for selecting $k$ is a good practice and should be followed. However, we can do even better!

The parameter $k$ in $\text{PCA}$ is more special than a general hyper-parameter. Because, the solution to $\text{PCA}(k)$ already exists in $\text{PCA}(K)$, for $K > k$, which is the first $k$ Eigenvectors (corresponding to $k$ largest Eigenvalues) in $\text{PCA}(K)$. Therefore, instead of running $\text{PCA}(1)$, $\text{PCA}(4)$, ..., $\text{PCA}(K)$ separately on training data, as we do for a hyper-parameter in general, we only need to run $\text{PCA}(K)$ to have the solution for all $k \in \{1,..,K\}$.

As a result, the process would be as follows:

  1. Run $\text{PCA}$ for the largest acceptable $K$ on training set,
  2. Plot, or prepare ($k$, variance) on validation set,
  3. Select the $k$ that gives the minimum acceptable variance, e.g. 90% or 99%.

And, N-fold cross validation would be as follows:

  1. Run $\text{PCA}$ for the largest acceptable $K$ on N training folds,
  2. Plot, or prepare ($k$, average of N variances) on held-out folds,
  3. Select the $k$ that gives the minimum acceptable average variance, e.g. 90% or 99%.

Also, here is a related post that asks "why do we choose principal components based on maximum variance explained?".

Esmailian
  • 9,312
  • 2
  • 32
  • 48