Linear Algebra 3: Eigendecomposition, SVD
This post is TLDR part 3 of chapter 1 on Linear Algebra from
Deep Learning Book by Goodfellow.
Eigenvalues and eigenvectors
- Think of matrix to vector multiplication as linear transformation.
- Matrix $\mathbf{A}$ will linearly transform vector $\mathbf{x}$ to:
$\mathbf{A}\mathbf{x}$.
- For a matrix $\mathbf{A}$, the vector $\mathbf{v}$ which under transformation
is only scaled is called eigenvector of the matrix $\mathbf{A}$ and the scale
factor is called eigenvalue.
- If
\(\mathbf{A}\mathbf{x} = \lambda \mathbf{v}\)
then, $\mathbf{v}$ is the eigenvector and $\lambda$ is eigenvalue of
$\mathbf{A}$.
- If $\mathbf{v}$ is eigenvector of $\mathbf{A}$, any vector $k\mathbf{v}$
(parallel to $\mathbf{v}$) is also eigenvector of $\mathbf{A}$.
- $k\mathbf{v}$ will have same eigenvalue of $\lambda$.
- Generally, only the unit vector along the direction is considered.
- If matrix $\mathbf{A}$ has $n$ linearly independent eigenvectors:
${\mathbf{v^{(1)}}, …, \mathbf{v^{(n)}}}$ with corresponding eigenvalues:
${\lambda_1, …, \lambda_n}$:
- Define: $\mathbf{V} = [\mathbf{v^{(1)}}, … , \mathbf{v^{(n)}}]$
(Concatenating eigenvectors as column of matrix).
- Define: $\mathbf{\lambda} = [\lambda_1, …, \lambda_n]$
(Concatenating eigenvalues as a vector).
- Then:
\[\mathbf{A} = \mathbf{V} \ diag(\lambda) \ \mathbf{V}^{-1}\]
- Above equation is called eigendecomposition of $\mathbf{A}$.
- figure goes here
- For a matrix $\mathbf{A}$, if $\mathbf{A}$ is a symmetrical real-valued matrix,
Determinants
- For a square matrix $\mathbf{A}$, its determinant:
- is denoted as $det(\mathbf{A})$.
- is defined as: product of all eigenvalues.
- Can be negative
- Geometric interpretation:
- Each eigenvalue gives scaling in the direction corresponding to its eigenvectors.
- Product of all eigenvalues gives total scaling along all eigenvectors.
- Therefore, determinant is geometrically the magnitude by which the space is scaled.
- Negative determinant means basis are reversed.
- Determinant of 1 means: volume of space is preserved by transformation.
- Determinant of 0 means space is contracted completely along at least one dimension.
Singular value decomposition
- Extension of concept of eigendecomposition to non symmetric and even
rectangular matrix.
- For any matrix $\mathbf{A}$, the matrix $\mathbf{A}^T \mathbf{A}$ and
$\mathbf{A} \mathbf{A}^T$ are both symmetrical.
- $\mathbf{A}^T \mathbf{A} \ne \mathbf{A}^T \mathbf{A}$
- But eigenvalues of $\mathbf{A}^T \mathbf{A}$ and $\mathbf{A} \mathbf{A}^T$ will be equal.
- The eigenvectors of the matrices are not equal.
- In fact they could even be of different lengths.
- If eigenvectors of $\mathbf{A}^T \mathbf{A}$ be $\mathbf{V}$.
- If eigenvectors of $\mathbf{A} \mathbf{A}^T$ be $\mathbf{U}$.
- Define matrix $\mathbf{D}$ as diagonal matrix which is not necessarily a
square, but has the eigenvectors of $\mathbf{A}^T \mathbf{A}$ as diagonal
elements. These diagonal elements are called singular values.
- The $\mathbf{A}$ can then be written as:
\[\mathbf{A} = \mathbf{U}\mathbf{D}\mathbf{V}^T\]
where $\mathbf{A} \ \epsilon \ \mathbb{R}^{m,n}$
then $\mathbf{U} \ \epsilon \ \mathbb{R}^{m,m}$,
$\mathbf{V} \ \epsilon \ \mathbb{R}^{n,n}$,
and $\mathbf{D} \ \epsilon \ \mathbb{R}^{m,n}$
- SVD is useful to partially generalize the inversion operation to non square matrices.
Moore-Penrose Pseudoinverse
- For linear equation $\mathbf{A}\mathbf{x} = \mathbf{y}$, the solution for x
be: $\mathbf{x} = \mathbf{B}\mathbf{y}$.
- If $\mathbf{A}$ is invertible square matrix, then $\mathbf{B} = \mathbf{A}^{-1}$
- Above condition is not always possible.
-
For given matrix $\mathbf{A}$ with SVD: $\mathbf{A} = \mathbf{U}\mathbf{D}\mathbf{V}^T$,
the pseudoinverse is given by:
\[\mathbf{A}^+ = \mathbf{V}\mathbf{D}^+\mathbf{U}^T\]
where $\mathbf{D}^+$ is obtained by reciprocal of nonzero diagonal
components of $\mathbf{D}$
- Though exact value of $\mathbf{B}$ is not possible to calculate, the value of
$\mathbf{B} = \mathbf{A}^+$, will produce solution of $\mathbf{A}\mathbf{x} = \mathbf{y}$
such that the norm: $\vert \vert \mathbf{Ax} - \mathbf{y} \vert \vert_2$ is minimum.
Principal Component Analysis (PCA)
Defining the problem
- Problem under consideration is finding a compression algorithm.
- Lossy compression: store points in less memory but may lose some precision.
- Using only Linear property.
-
Consider m points represented by m vectors:
${ \mathbf{x}^{(1)}, \mathbf{x}^{(2)}, …, \mathbf{x}^{(m)} }$ all in
$\mathbb{R}^n$.
- Find: for each point $\mathbf{x}^{(i)} \epsilon \mathbb{R}^n$, find a code:
$\mathbf{c}^{(i)} \epsilon \mathbb{R}^l$ such that $l \leq n$
- Also find mapping $\mathbf{f(x)} = c$ called encoding function.
- And find mapping $\mathbf{g}(f(x)) \approx x$ called decoding function.
- Many design choices may be possible for this compression,
- PCA is one of the design choice.
- PCA is defined by choice of decoding function.
-
Design choices made:
- Decoding function is linear:
\[\mathbf{g}(\mathbf{c}) = \mathbf{Dc} \\
where \ \mathbf{D} \ \epsilon \ \mathbb{R}^{n,l}\]
-
Optimal coding: The decoding function is such that the reconstructed point
$\mathbf{g}(f(x))$ is close to the original point $\mathbf{x}$
in terms of L2 norms:
\[\mathbf{c^*} = \underset{c}{\operatorname{argmin}}{\vert\vert \mathbf{x} - g(\mathbf{c})
\vert\vert_2} = \underset{c}{\operatorname{argmin}}{\vert\vert \mathbf{x} - g(\mathbf{c})
\vert\vert_2^2}\]
-
Simplifying condition: For simplifying the problem we make additional
criteria that matrix $\mathbf{D}$ is orthogonal.
- The vectors $\mathbf{D}_{:,i}$ are orthogonal to each other.
- The vectors $\mathbf{D}_{:,i}$ all have norm of one.
- This matrix is technically not orthogonal because the matrix is not
square but ha many property that are similar to orthogonal matrices.
- Summary of assumptions:
- Decoding function is linear:
\[\mathbf{g}(\mathbf{c}) = \mathbf{Dc}\]
- Columns of $\mathbf{D}$ are orthogonal:
\[\mathbf{D^TD} = \mathbf{I}_l \ \ and \ \ \mathbf{DD^T} = \mathbf{I}_n\]
Solving the Optimization problem
-
This optimization can be solved to get the encoding function to be:
\[\mathbf{f(x)} = \mathbf{c} = \mathbf{D^Tx}\]
-
Also as the decoder function is: $\mathbf{g}(\mathbf{c}) = \mathbf{Dc}$,
if $\mathbf{r}$ be reconstruction of $\mathbf{x}$ after encoding followed by
decoding:
\[\mathbf{r} = \mathbf{g(f(x))} = \mathbf{g(D^Tx)} = \mathbf{DD^Tx}\]
Encoding function
- From this we can make certain inference about the encoding function, given our
design choices for decoding function:
- Encoding function is also linear.
- The parameters of linear transformation in encoding can completely be deduced
from the decoding function, no other parameters required.
- The parameters of decoding function for optimal encoder are however still not
known. We will try to find that now.
Optimal encoder
- Assume that $l=1$
- The encoded points are represented by a single scalar value $c$.
- The decoding matrix will be $\mathbf{D} \in \mathbb{R}^{n,1}$ which is vector
that we can represent by: $\mathbf{d}$.
- The optimal encoder will have to encode all he points such that the euclidean
distance for all points {$\mathbf{x}^{(1)}, …, \mathbf{x}^{(m)}$} and their
reconstruction {$\mathbf{r}^{(1)}, …, \mathbf{r}^{(m)}$} is minimized.
-
Thus the optimization problem reduces to:
\[\mathbf{d^*} =
\underset{\mathbf{d}}{\operatorname{argmin}}{\sum_{i=1}^m{\vert\vert
\mathbf{x_i} - \mathbf{dd^Tx_i} \vert\vert_2^2} }\]
-
Defining $\mathbf{X}$ to be a matrix containing all m points stacked as
rows:
\[\mathbf{X} = \begin{bmatrix}\mathbf{- x_1^T -}\\\mathbf{- x_2^T
-}\\...\\\mathbf{- x_n^T -} \end{bmatrix}\]
-
Using this definition for rewrtting the sum in loss function, the optimization
problem becomes:
\[\mathbf{d^*} =
\underset{\mathbf{d}}{\operatorname{argmin \ }}{
\vert \vert \mathbf{X} - \mathbf{Xdd^T} \vert \vert_F^2}\]
-
Solving this optimization problem under constraint given by: $\mathbf{d^Td}=1$,
gives us the best choice of encoder as:
\[\mathbf{d^*} = max \ eigenvector(\mathbf{X^TX})\]
Here max eigenvector of $\mathbf{X^TX}$ refers to eigenvector of
$\mathbf{X^TX}$ which has the highest eigenvalue.
-
Generalizing it to case where $l > 1$, we get the optimal vectors for
$\mathbf{D}$ are the $l$ eigenvectors of $\mathbf{X^TX}$ with the $l$ highest
eigenvalues.
- Thus the optimal PCA is:
- Define $\mathbf{D} \in \mathbb{R}^{n,l}$ as $l$ eigenvectors of
$\mathbf{X^TX}$ with $l$ highest eigenvalues.
- The encoder is therefore:
\[\mathbf{c} = \mathbf{D^Tx}\]
\[\mathbf{r} = \mathbf{DD^Tx} = \mathbf{Dc}\]
September
14th,
2020
by
Bipin Lekhak
Feel free to share!