Detail Mathematics for PCA

This post is a more detailed mathematical look on TLDR part 3 of chapter 1 on Linear Algebra from Deep Learning Book by Goodfellow.

Principal Component Analysis (PCA)

This post is a more detailed mathematical analysis for PCA from the previous post on Linear Algebra. The readers coming from that post might find a lot of similarities between these two posts.

Defining the problem

Problem under consideration is finding a compression algorithm. Some assumptions:
- Lossy compression: store points in less memory but may lose some precision.
- Using only Linear property.
Consider m points represented by m vectors: ${ \mathbf{x}^{(1)}, \mathbf{x}^{(2)}, …, \mathbf{x}^{(m)} }$ all in $\mathbb{R}^n$
Find: for each point $\mathbf{x}^{(i)} \epsilon \mathbb{R}^n$, find a code: $\mathbf{c}^{(i)} \epsilon \mathbb{R}^l$ such that $l \leq n$
- Also find mapping $\mathbf{f(x)} = c$ called encoding function.
- And find mapping $\mathbf{g}(f(x)) \approx x$ called decoding function.
Many design choices may be possible for this compression
- PCA is one of the design choice.
- PCA is defined by choice of decoding function.
Design choices made:
- Decoding function is linear.
  
  $\mathbf{g}(\mathbf{c}) = \mathbf{Dc}$ where $\mathbf{D} \ \epsilon \ \mathbb{R}^{n,l}$
- Optimal coding: The decoding function is such that the reconstructed point $\mathbf{g}(f(x))$ is close to the original point $\mathbf{x}$ in terms of L² norms:
  \[\mathbf{c^*} = \underset{c}{\operatorname{argmin}}{\vert\vert \mathbf{x} - g(\mathbf{c}) \vert\vert_2} = \underset{c}{\operatorname{argmin}}{\vert\vert \mathbf{x} - g(\mathbf{c}) \vert\vert_2^2}\]
- Simplifying condition: For simplifying the problem we make additional criteria that matrix $\mathbf{D}$ is like orthogonal.
  - The vectors $\mathbf{D}_{:,i}$ are orthogonal to each other.
  - The vectors $\mathbf{D}_{:,i}$ all have norm of one.
  - This matrix is technically not orthogonal because the matrix is not square but ha many property that are similar to orthogonal matrices.

Solving the Optimization problem

The problem is defined as a minimization problem:
\[\mathbf{c^*} = \underset{c}{\operatorname{argmin}}{\vert\vert \mathbf{x} - g(\mathbf{c}) \vert\vert_2^2}\]
Known facts and assumptions:
- Decoding function is linear:
\[\mathbf{g}(\mathbf{c}) = \mathbf{Dc}\]
- Columns of $\mathbf{D}$ are orthogonal:
\[\mathbf{D^TD} = \mathbf{I}_l \ \ and \ \ \mathbf{DD^T} = \mathbf{I}_n\]
The loss function (the function to optimize) can be re written as:
- \[\vert\vert \mathbf{x} - g(\mathbf{c}) \vert\vert_2^2 \\ \mathbf{x}= \vert\vert \mathbf{x} - \mathbf{Dc} \vert\vert_2^2 \\ = (\mathbf{x} - \mathbf{Dc})^T(\mathbf{x} - \mathbf{Dc}) \\ = \mathbf{x}^T\mathbf{x} - 2 \mathbf{x^TDc} + \mathbf{c^TD^TDc}\]
This means our problem becomes:
\[\mathbf{c^*} = \underset{c}{\operatorname{argmin \ }}{ \mathbf{x}^T\mathbf{x} - 2 \mathbf{x^TDc} + \mathbf{c^TD^TDc} }\]
Since the variable to minimize is $\mathbf{c}$, the terms independent of $\mathbf{c}$ can be ignored:
\[\mathbf{c^*} = \underset{c}{\operatorname{argmin \ }}{ -2 \mathbf{x^TDc} + \mathbf{c^TD^TDc} }\]
Using constraint: $\mathbf{D^TD} = \mathbf{I}_l$:
\[\mathbf{c^*} = \underset{c}{\operatorname{argmin \ }}{- 2 \mathbf{x^TDc} + \mathbf{c^Tc} }\]
Using calculus: gradient with respect to $\mathbf{c}$ must be zero:
\[\underset{c}{\operatorname{\nabla \ }}{ (- 2 \mathbf{x^TDc} + \mathbf{c^Tc}) = \mathbf{0} } \\ \underset{c}{\operatorname{\nabla \ }}{(- 2 \mathbf{x^TDc} )} + \underset{c}{\operatorname{\nabla \ }}{(\mathbf{c^Tc} )} = \mathbf{0} \\ {(- 2 \mathbf{D^Tx})} + {2 \mathbf{c}} = \mathbf{0} \\ \mathbf{c} = \mathbf{D^Tx}\]
Thus our optimal linear encoder is:
\[\mathbf{f(x)} = \mathbf{c} = \mathbf{D^Tx}\]
Also as the decoder function is: $\mathbf{g}(\mathbf{c}) = \mathbf{Dc}$, if $\mathbf{r}$ be reconstruction of $\mathbf{x}$ after encoding followed by decoding:
\[\mathbf{r} = \mathbf{g(f(x))} = \mathbf{g(D^Tx)} = \mathbf{DD^Tx}\]

Optimal Encoder

Now the problem is to find the best value of $\mathbf{D}$ for optimal encoding.
Assume that $l=1$
- The encoded points are represented by a single scalar value $c$.
- The decoding matrix will be $\mathbf{D} \in \mathbb{R}^{n,1}$ which is vector that we can represent by: $\mathbf{d}$.
The optimal encoder will have to encode all he points such that the euclidean distance for all points {$\mathbf{x}^{(1)}, …, \mathbf{x}^{(m)}$} and their reconstruction {$\mathbf{r}^{(1)}, …, \mathbf{r}^{(m)}$} is minimized.
Thus the optimization problem reduces to:
\[\mathbf{d^*} = \underset{\mathbf{d}}{\operatorname{argmin}}{\sum_{i=1}^m{\vert\vert \mathbf{x_i} - \mathbf{dd^Tx_i} \vert\vert_2^2} }\]
In this optimization problem, the quantity $\mathbf{x_i} - \mathbf{dd^Tx_i}$ is a vector on $\mathbb{R}^n$. But it can also be seen as matrix in $\mathbb{R}^{n,1}$
The optimization can then be seen as:
\[\mathbf{d^*} = \underset{\mathbf{d}}{\operatorname{argmin}}{\sum_{i=1}^m{\vert\vert (\mathbf{x_i} - \mathbf{dd^Tx_i})^T \vert\vert_2^2} } \\ = \underset{\mathbf{d}}{\operatorname{argmin}}{\sum_{i=1}^m{\vert\vert (\mathbf{x_i}^T - (\mathbf{dd^Tx_i})^T )\vert\vert_2^2} }\\ = \underset{\mathbf{d}}{\operatorname{argmin}}{\sum_{i=1}^m{\vert\vert (\mathbf{x_i}^T - \mathbf{x_i^Tdd^T} )\vert\vert_2^2} }\]
Defining $\mathbf{X}$ to be a matrix containing all m points stacked as rows:
\[\mathbf{X} = \begin{bmatrix}\mathbf{- x_1^T -}\\\mathbf{- x_2^T -}\\...\\\mathbf{- x_n^T -} \end{bmatrix} \in \mathbb{R}^{m,n}\]
Using this definition for $\mathbf{X}$, the optimization problem can be simplified as:
\[\mathbf{d^*} = \underset{\mathbf{d}}{\operatorname{argmin \ }}{\vert\vert (\mathbf{X} - \mathbf{Xdd^T}) \vert\vert_F^2}\]
We also have a property linking trace to the norm (take tis relation as given for now): $Tr(\mathbf{AA^T}) = \vert \vert \mathbf{A} \vert \vert_F^2$. This reduces the optimization to:
\[\mathbf{d^*} = \underset{\mathbf{d}}{\operatorname{argmin} \ }{ Tr\{(\mathbf{X} - \mathbf{Xdd^T})^T (\mathbf{X} - \mathbf{Xdd^T}) \} } \\ = \underset{\mathbf{d}}{\operatorname{argmin}}{ Tr\{(\mathbf{X^TX} - \mathbf{X^TXdd^T} - \mathbf{dd^TX^TX} + \mathbf{dd^TX^TXdd^T}) \} } \\ = \underset{\mathbf{d}}{\operatorname{argmin}}{ Tr(\mathbf{X^TX}) - Tr(\mathbf{X^TXdd^T}) - Tr(\mathbf{dd^TX^TX}) + Tr(\mathbf{dd^TX^TXdd^T})}\]
Since $\mathbf{X^TX}$ is constant in terms of minimizing variable $\mathbf{d}$, it can be ignored from the loss function, which reduces the optimization problem to:
\[\mathbf{d^*} = \underset{\mathbf{d}}{\operatorname{argmin \ }}{- Tr(\mathbf{X^TXdd^T}) - Tr(\mathbf{dd^TX^TX}) + Tr(\mathbf{dd^TX^TXdd^T})}\]
Trace shows cyclic property: $Tr(\mathbf{ABC}) = Tr(\mathbf{CAB}) = Tr(\mathbf{BCA})$. Using this property, we can write:
- \[Tr(\mathbf{dd^TX^TX}) = Tr(\mathbf{X^TXdd^T})\]
- \[Tr(\mathbf{dd^TX^TXdd^T}) = Tr(\mathbf{X^TXdd^Tdd^T})\]
Also from our assumption, we had decided to chose $\mathbf{d}$ to be a unit vector. This can be used as constaint here as: $\mathbf{d^Td} = 1$. This means:
\[Tr(\mathbf{dd^TX^TXdd^T}) = Tr(\mathbf{X^TXdd^Tdd^T}) = Tr(\mathbf{X^TXdd^T})\]
The loss function then becomes:
\[\mathbf{d^*} = \underset{\mathbf{d}}{\operatorname{argmin \ }}{- Tr(\mathbf{X^TXdd^T}) - Tr(\mathbf{dd^TX^TX}) + Tr(\mathbf{dd^TX^TXdd^T})} \\ = \underset{\mathbf{d}}{\operatorname{argmin \ }}{- Tr(\mathbf{X^TXdd^T}) - Tr(\mathbf{X^TXdd^T}) + Tr(\mathbf{X^TXdd^T})} \\ = \underset{\mathbf{d}}{\operatorname{argmin \ }}{- Tr(\mathbf{X^TXdd^T})} \\ = \underset{\mathbf{d}}{\operatorname{argmax \ }}{Tr(\mathbf{X^TXdd^T})} \\ = \underset{\mathbf{d}}{\operatorname{argmax \ }}{Tr(\mathbf{d^TX^TXd})}\]
The vector that maxizes this equation is the eigenvector with the largest eigenvalue for $\mathbf{X^TX}$.
Generalizing it to case where $l > 1$, we get the optimal vectors for $\mathbf{D}$ are the $l$ eigenvectors of $\mathbf{X^TX}$ with the $l$ highest eigenvalues.
Thus the optimal PCA is:
- Define $\mathbf{D} \in \mathbb{R}^{n,l}$ as $l$ eigenvectors of $\mathbf{X^TX}$ with $l$ highest eigenvalues.
- The encoder is therefore:
\[\mathbf{c} = \mathbf{D^Tx}\]
- The decoder is then:
\[\mathbf{r} = \mathbf{DD^Tx} = \mathbf{Dc}\]

September 21st , 2020 by Bipin Lekhak

Feel free to share!

Detail Mathematics for PCA

Principal Component Analysis (PCA)

Defining the problem

Solving the Optimization problem

Optimal Encoder

You may also enjoy:

Linear Algebra 3: Eigendecomposition, SVD

Linear Algebra 2: Special Matrices and vectors

Information Theory 1

Probability 3: Structured Probabilistic Models

Probability 2: Common Probability Distributions