This post is a more detailed mathematical look on TLDR part 3 of chapter 1 on Linear Algebra from Deep Learning Book by Goodfellow.
This post is a more detailed mathematical analysis for PCA from the previous post on Linear Algebra. The readers coming from that post might find a lot of similarities between these two posts.
Consider m points represented by m vectors: ${ \mathbf{x}^{(1)}, \mathbf{x}^{(2)}, …, \mathbf{x}^{(m)} }$ all in $\mathbb{R}^n$
Design choices made:
Decoding function is linear.
\(\mathbf{g}(\mathbf{c}) = \mathbf{Dc}\) where $\mathbf{D} \ \epsilon \ \mathbb{R}^{n,l}$
Optimal coding: The decoding function is such that the reconstructed point $\mathbf{g}(f(x))$ is close to the original point $\mathbf{x}$ in terms of L2 norms:
\[\mathbf{c^*} = \underset{c}{\operatorname{argmin}}{\vert\vert \mathbf{x} - g(\mathbf{c}) \vert\vert_2} = \underset{c}{\operatorname{argmin}}{\vert\vert \mathbf{x} - g(\mathbf{c}) \vert\vert_2^2}\]Simplifying condition: For simplifying the problem we make additional criteria that matrix $\mathbf{D}$ is like orthogonal.
The problem is defined as a minimization problem:
\[\mathbf{c^*} = \underset{c}{\operatorname{argmin}}{\vert\vert \mathbf{x} - g(\mathbf{c}) \vert\vert_2^2}\]This means our problem becomes:
\[\mathbf{c^*} = \underset{c}{\operatorname{argmin \ }}{ \mathbf{x}^T\mathbf{x} - 2 \mathbf{x^TDc} + \mathbf{c^TD^TDc} }\]Since the variable to minimize is $\mathbf{c}$, the terms independent of $\mathbf{c}$ can be ignored:
\[\mathbf{c^*} = \underset{c}{\operatorname{argmin \ }}{ -2 \mathbf{x^TDc} + \mathbf{c^TD^TDc} }\]Using constraint: $\mathbf{D^TD} = \mathbf{I}_l$:
\[\mathbf{c^*} = \underset{c}{\operatorname{argmin \ }}{- 2 \mathbf{x^TDc} + \mathbf{c^Tc} }\]Using calculus: gradient with respect to $\mathbf{c}$ must be zero:
\[\underset{c}{\operatorname{\nabla \ }}{ (- 2 \mathbf{x^TDc} + \mathbf{c^Tc}) = \mathbf{0} } \\ \underset{c}{\operatorname{\nabla \ }}{(- 2 \mathbf{x^TDc} )} + \underset{c}{\operatorname{\nabla \ }}{(\mathbf{c^Tc} )} = \mathbf{0} \\ {(- 2 \mathbf{D^Tx})} + {2 \mathbf{c}} = \mathbf{0} \\ \mathbf{c} = \mathbf{D^Tx}\]Thus our optimal linear encoder is:
\[\mathbf{f(x)} = \mathbf{c} = \mathbf{D^Tx}\]Also as the decoder function is: $\mathbf{g}(\mathbf{c}) = \mathbf{Dc}$, if $\mathbf{r}$ be reconstruction of $\mathbf{x}$ after encoding followed by decoding:
\[\mathbf{r} = \mathbf{g(f(x))} = \mathbf{g(D^Tx)} = \mathbf{DD^Tx}\]Now the problem is to find the best value of $\mathbf{D}$ for optimal encoding.
Thus the optimization problem reduces to:
\[\mathbf{d^*} = \underset{\mathbf{d}}{\operatorname{argmin}}{\sum_{i=1}^m{\vert\vert \mathbf{x_i} - \mathbf{dd^Tx_i} \vert\vert_2^2} }\]In this optimization problem, the quantity $\mathbf{x_i} - \mathbf{dd^Tx_i}$ is a vector on $\mathbb{R}^n$. But it can also be seen as matrix in $\mathbb{R}^{n,1}$
The optimization can then be seen as:
\[\mathbf{d^*} = \underset{\mathbf{d}}{\operatorname{argmin}}{\sum_{i=1}^m{\vert\vert (\mathbf{x_i} - \mathbf{dd^Tx_i})^T \vert\vert_2^2} } \\ = \underset{\mathbf{d}}{\operatorname{argmin}}{\sum_{i=1}^m{\vert\vert (\mathbf{x_i}^T - (\mathbf{dd^Tx_i})^T )\vert\vert_2^2} }\\ = \underset{\mathbf{d}}{\operatorname{argmin}}{\sum_{i=1}^m{\vert\vert (\mathbf{x_i}^T - \mathbf{x_i^Tdd^T} )\vert\vert_2^2} }\]Defining $\mathbf{X}$ to be a matrix containing all m points stacked as rows:
\[\mathbf{X} = \begin{bmatrix}\mathbf{- x_1^T -}\\\mathbf{- x_2^T -}\\...\\\mathbf{- x_n^T -} \end{bmatrix} \in \mathbb{R}^{m,n}\]Using this definition for $\mathbf{X}$, the optimization problem can be simplified as:
\[\mathbf{d^*} = \underset{\mathbf{d}}{\operatorname{argmin \ }}{\vert\vert (\mathbf{X} - \mathbf{Xdd^T}) \vert\vert_F^2}\]We also have a property linking trace to the norm (take tis relation as given for now): $Tr(\mathbf{AA^T}) = \vert \vert \mathbf{A} \vert \vert_F^2$. This reduces the optimization to:
\[\mathbf{d^*} = \underset{\mathbf{d}}{\operatorname{argmin} \ }{ Tr\{(\mathbf{X} - \mathbf{Xdd^T})^T (\mathbf{X} - \mathbf{Xdd^T}) \} } \\ = \underset{\mathbf{d}}{\operatorname{argmin}}{ Tr\{(\mathbf{X^TX} - \mathbf{X^TXdd^T} - \mathbf{dd^TX^TX} + \mathbf{dd^TX^TXdd^T}) \} } \\ = \underset{\mathbf{d}}{\operatorname{argmin}}{ Tr(\mathbf{X^TX}) - Tr(\mathbf{X^TXdd^T}) - Tr(\mathbf{dd^TX^TX}) + Tr(\mathbf{dd^TX^TXdd^T})}\]Since $\mathbf{X^TX}$ is constant in terms of minimizing variable $\mathbf{d}$, it can be ignored from the loss function, which reduces the optimization problem to:
\[\mathbf{d^*} = \underset{\mathbf{d}}{\operatorname{argmin \ }}{- Tr(\mathbf{X^TXdd^T}) - Tr(\mathbf{dd^TX^TX}) + Tr(\mathbf{dd^TX^TXdd^T})}\]Also from our assumption, we had decided to chose $\mathbf{d}$ to be a unit vector. This can be used as constaint here as: $\mathbf{d^Td} = 1$. This means:
\[Tr(\mathbf{dd^TX^TXdd^T}) = Tr(\mathbf{X^TXdd^Tdd^T}) = Tr(\mathbf{X^TXdd^T})\]The loss function then becomes:
\[\mathbf{d^*} = \underset{\mathbf{d}}{\operatorname{argmin \ }}{- Tr(\mathbf{X^TXdd^T}) - Tr(\mathbf{dd^TX^TX}) + Tr(\mathbf{dd^TX^TXdd^T})} \\ = \underset{\mathbf{d}}{\operatorname{argmin \ }}{- Tr(\mathbf{X^TXdd^T}) - Tr(\mathbf{X^TXdd^T}) + Tr(\mathbf{X^TXdd^T})} \\ = \underset{\mathbf{d}}{\operatorname{argmin \ }}{- Tr(\mathbf{X^TXdd^T})} \\ = \underset{\mathbf{d}}{\operatorname{argmax \ }}{Tr(\mathbf{X^TXdd^T})} \\ = \underset{\mathbf{d}}{\operatorname{argmax \ }}{Tr(\mathbf{d^TX^TXd})}\]The vector that maxizes this equation is the eigenvector with the largest eigenvalue for $\mathbf{X^TX}$.
Generalizing it to case where $l > 1$, we get the optimal vectors for $\mathbf{D}$ are the $l$ eigenvectors of $\mathbf{X^TX}$ with the $l$ highest eigenvalues.