# How to do (semi) supervised learning with principal component analysis (PCA)

By itself, principal component analysis (PCA) is an *unsupervised* learning method, meaning it does not take into account any labels or prediction variables of your data. PCA is simply a common method for dimensionality reduction of your â€™s, without worrying about the â€™s.

Most people intuitively understand this and nod their heads when they hear â€śPCAâ€ť. But the real power of PCA comes from using it in some supervised prediction task. However, itâ€™s not immediately obvious how to go from the task of reducing the dimensions of to making predictions about . This post is a quick explainer on how to use PCA in supervised learning.

## 1. PCA your training data

The goal of PCA is to represent your data in an orthonormal basis . The orthogonality of this basis is what allows us to identify the â€śprincipalâ€ť components (which are sometimes interpreted to be inherent latent factors of your dataâ€™s structure). The coordinates of your data in this new basis will be represented as . i.e.:

Because is orthonormal, we can invert it simply by taking its transpose: . This allows us to transform our raw data into the orthonormal basis simply by multiplying by :

To reduce dimensionality, letâ€™s pick some number of components . Assuming our basis vectors in are ordered from largest to smallest (i.e., eigenvector corresponding to the largest eigenvalue is first, etc.), this amounts to simply keeping the first columns of . This results in an â€śapproxmiatedâ€ť version of , which Iâ€™ll call :

## 2. Train a classifier on your transformed training data

Now that we have a -dimensional representation of our training data , you can train your favorite classifier (SVM, kNN, logistic regression, etc.) on the *transformed* features . This amounts to finding the â€śbestâ€ť fit for some model . For example, if your goal was to minimize the squared error between the data and your model prediction, your estimated parameters would simply be:

The purpose of going through all this trouble is when is very large. If you have 100,000 features, running your favorite classifier may take a very long time. However, if you can pick a much smaller number of principle components (i.e., ) which accurately capture the covariance structure of your data, you can dramatically improve the efficiency of your classifier. This can also be thought of as a form of regularization, since itâ€™s unlikely that all 100,000 features of your dataset have a meaningful effect on your outcome variable.

## 3. Project your test data into the same -dimensional subspace

Where do you go once youâ€™ve perfomed PCA on your training data and built a classifier on your transformed data ? The key is to realize that is in some sense a canonical transformation from our space of features down to a space of features (or at least the best transformation we could find using our training data). Thus, we can hit our *test* data with the same transformation, resulting in a -dimensional representation of our test features:

## 4. Run your classifier on your transformed test data

We can now use the classifier trained on the -dimensional representation of our training data (with the corresponding weights ) to make predictions on the -dimensional representation of our test data:

And thatâ€™s how you use PCA to make predictions on test data. Again, the key is to think of as a rotating transformation, which *projects* your raw features into a -dimensional subspace. The entire goal of PCA is essentially to learn this projection operator. This projection works on both training data and test data, allowing you to build a classifier on your new features and use that same classifier to make predictions on your test data.