# Machine Learning: Dimensionality Reduction

## A Note from Ng

Yao Yao on September 6, 2014

## 1. Motivation

Dimensionality Reduction helps in:

• Data Compression
• Visualization (because we can only plot 2D or 3D)

Principal Component Analysis (主成分分析), abbreviated as PCA, is the algorithm to implement Dimensionality Reduction。

## 3. PCA: Algorithm

### 3.1 Data preprocessing

Given training set $x^{(1)}, \dots, x^{(m)}$, the preprocessing goes like mean normalization:

• calculate $\mu_j = \frac{1}{m} \sum_{i=1}^m { x_j^{(i)} }$
• replace each $x_j^{(i)}$ with $x_j^{(i)} - \mu_j$

If there are different features on different scales (e.g. $x_1 = \text{size of house}$, $x_2 = \text{number of bedrooms}$), scale features to have comparable range of values.

### 3.2 PCA algorithm and implementation in Octave

Suppose we are reducing data from $n$-dimensions to $k$-dimensions

#### Step 1: Compute covariance matrix (协方差矩阵):

Non-vectorized formula is $\Sigma = \frac{1}{m} \sum_{i=1}^n {x^{(i)}*(x^{(i)})^T}$

$\because x^{(i)} = \begin{vmatrix} x^{(i)}_1 \newline x^{(i)}_2 \newline … \newline x^{(i)}_n \end{vmatrix}$ ($\operatorname{size} = n \times 1$)

$\therefore \operatorname{size}(\Sigma) = (n \times 1) \ast (1 \times n) = n \times n$

$\therefore$ Vectorized formula is $\Sigma = \frac{1}{m} X^T X$

#### Step 2: Compute eigenvectors of covariance matrix

[U, S, V] = svd(Σ), svd for Singular Value Decomposition (奇异值分解). eig(Σ) also works but less stable.

Covariance matrix always satisfies a property called “symmetric positive semidefinite” (对称半正定矩阵), so svd == eig.

The structure of U in [U, S, V] is:

$U = \begin{vmatrix} \vert & \vert & & \vert \newline u^{(1)} & u^{(2)} & … & u^{(n)} \newline \vert & \vert & & \vert \end{vmatrix}$ ($\operatorname{size}=n \times n$)

$u^{(i)} = \begin{vmatrix} u^{(i)}_1 \newline u^{(i)}_2 \newline … \newline u^{(i)}_n \end{vmatrix}$ ($\operatorname{size} = n \times 1$)

#### Step 3: Generate the $k$ dimensions

We want to reduce to $k$-dimensions, so pick up the first $k$ columns of U, i.e. $u^{(1)}, u^{(2)}, …, u^{(k)}$ into $U_{reduce}$

$U_{reduce} = \begin{vmatrix} \vert & \vert & & \vert \newline u^{(1)} & u^{(2)} & … & u^{(k)} \newline \vert & \vert & & \vert \end{vmatrix}$ ($\operatorname{size}=n \times k$)

In Octave, use U_reduce = U(:, 1:k).

The new dimension $z^{(i)} = (U_{reduce})^T*x^{(i)}$ ($\operatorname{size} = (k \times n) \ast (n \times 1) = k \times 1$)

Vectorized formula is $Z = X*U_{reduce}$ ($\operatorname{size} = (m \times n) \ast (n \times k) = m \times k$)

The structure of $Z$ is:

$Z = \begin{vmatrix} – (z^{(1)})^T – \newline – (z^{(2)})^T – \newline …… \newline – (z^{(m)})^T – \end{vmatrix}$ ($\operatorname{size} = m \times k$)

$z^{(i)} = \begin{vmatrix} z^{(i)}_1 \newline z^{(i)}_2 \newline … \newline z^{(i)}_k \end{vmatrix}$ ($\operatorname{size} = k \times 1$)

## 4. Reconstruction from Compressed Representation

Vectorized formula is: $X_{approx} = Z*U_{reduce}^T$

## 5. Choosing the Number of Principal Components

I.e. how to choose $k$.

### 5.1 Algorithm

Average squared projection error: $ASPE = \frac{1}{m} \sum_{i=1}^m { \lVert x^{(i)} - x^{(i)}_{approx} \rVert ^2 }$

Total variation in the data: $TV = \frac{1}{m} \sum_{i=1}^m { \lVert x^{(i)} \rVert^2 }$

Typically, choose $k$ to be smallest value that satisfy $\frac{ASPE}{TV} <= 0.01$, which means “99% of variance is retained”

### 5.2 Convenient calculation with SVD results

For a given $k$, $\frac{ASPE}{TV} = 1 - \frac{\sum_{i=1}^k {s_{ii}}} {\sum_{i=1}^n {s_{ii}}}$.（注意这里 $s_{ii}$ 是递减的，i.e. $s_{11}$ 占 variance 的比重最大，$s_{22}$ 次之，依次类推）

## 6. Advice for Applying PCA

### 6.1 Good use of PCA

Application of PCA:

• Compression
• Reduce memory/disk needed to store data
• Speed up learning algorithm
• choose $k$ by xx% of variance retaining
• Visualization
• choose $k=2$ or $k=3$

PCA can be used to speedup learning algorithm, most commonly the supervised learning.

Suppose we have $(x^{(1)}, y^{(1)}), …, (x^{(m)}, y^{(m)})$ and $n=10000$ (feature#). Extract inputs to make a unlabeled dataset $x^{(1)}, …, x^{(m)} \in \mathbb{R}^{10000}$. If PCA applied, say we reduce to 1000 features, we would have $z^{(1)}, …, z^{(m)} \in \mathbb{R}^{1000}$. Then we have a new training set $(z^{(1)}, y^{(1)}), …, (z^{(m)}, y^{(m)})$, which is much cheaper computationally.

### 6.2 Bad use of PCA

• To prevent overfitting
• Use PCA to reduce the number of features, thus, fewer features, less likely to overfit.

This is bad use because PCA is not a good way to address overfitting. Use regularization instead.

### 6.3 Implementation tips

• Note 1: The mapping $x^{(i)} \rightarrow z^{(i)}$ should be defined by running PCA only on the training set. But this mapping can be applied as well to the examples $x_{cv}^{(i)}$ and $x_{test}^{(i)}$ in the cross validation and test sets.
• Note 2: Before implemen1ng PCA, first try running whatever you want to do with the original/raw data $x^{(i)}$. Only if that does not do what you want, then implement PCA and consider using $z^{(i)}$.