Linear Algebra 1

The chapter 1 of Deep Learning Book is focussed on Linear Algebra. This post will work on the part of introductory Linear Algebra from the book by Goodfellow.

Scalar, Vector, Matrices & Tensor

Scalar

A scalar is a single number.
Denoted by a lower-case variable name: s
Examples:
- $s$ $\epsilon$ $\mathbb{R}$, where $\mathbb{R}$ is set of all real numbers
- $n$ $\epsilon$ $\mathbb{N}$, where $\mathbb{N}$ is set of all natural numbers

Vector

Array of numbers (also called elements) in order.
Order is important.
Represented by: lower case name in bold typeface:
- x is a vector.
Elements represented by: italics typeface with index as subscript:
- x₁ is first element of x
For X containing n elements, if each element is in $\mathbb{R}$ then:
- x $\epsilon$ $\mathbb{R}^n$
- x is a member of set formed by Cartesian Product of $\mathbb{R}$ n times
- If x_i is the i_th element of x $\mathbf{x}=\begin{bmatrix}\mathit{x_1}\\\mathit{x_2}\\...\\\mathit{x_n} \end{bmatrix}$
- Indexing: If $s = {1,2,4}$ then $\mathbf{x_S}$ is elements of x whose index are given by S: therefore: $\mathbf{x_S} = { \mathit{x_1}, \mathit{x_2}, \mathit{x_4} }$

Matrices

Two dimensional array of numbers
Index will be two numbers (vector had one)
Denoted by: uppercase bold face, example: A is a matrix
If m rows and n columns: $\mathbf{A}$ $\epsilon$ $\mathbb{R}^{\mathit{m} \text{x} \mathit{n}}$
Elements denoted by subscripts of index:
- $\mathbf{A}_{1,1}$ is top most element
- $\mathbf{A}_{m,n}$ is bottom most element
- $\mathbf{A}_{i,j}$ is element in i^th row and j^th column
- All elements with vertical coordinate i will be: $\mathbf{A}_{i,:}$ (is a vector)
- All elements with horizontal coordinate j will be: $\mathbf{A}_{:,j}$ (is a vector)

Tensors

Array of more than two dimensions
Must be in a regular grid
Matrix like indexing

Operation on Matrices

Transpose

Operation on a matrix $\mathbf{A}$
Denoted as: $\mathbf{A}^T$ : Transpose of A, or A transposed
Defined as: new matrix such that: $(\mathbf{A}^T){i,j} = \mathbf{A}{j,i}$
Visual aid: mirror image of matrix along its diagonal
Vectors can be written as: $ \mathbf{x} = [\mathit{x_1}, \mathit{x_2},…, \mathit{x_n}]^T $
A transpose of an scalar is the same scalar: $a^T = a$

Addition

Matrix with Matrix:
- Given $\mathbf{A}$ $\epsilon$ $\mathbb{R}^{\mathit{m} \text{x} \mathit{n}}$ and $\mathbf{B}$ $\epsilon$ $\mathbb{R}^{\mathit{m} \text{x} \mathit{n}}$ be two matrix, then their sum is: $\mathbf{C}$ $\epsilon$ $\mathbb{R}^{\mathit{m} \text{x} \mathit{n}}$ such that $C_{i,j} = A_{i,j} + B_{i,j}$
Matrix with vector:
- Conventionally, not possible
- In DL context, can be done through broadcasting
- Given $\mathbf{A}$ $\epsilon$ $\mathbb{R}^{\mathit{m} \text{x} \mathit{n}}$ and $\mathbf{b}$ $\epsilon$ $\mathbb{R}^\mathit{n}$, their sum can be defined as: $\mathbf{C} = \mathbf{A} + \mathbf{b}$ where $C_{i,j} = A_{i,j} + b_{j}$
- This is equivalent to taking $m$ copies of $\mathbf{b}$ into a matrix of $\mathbb{R}^{\mathit{m} \text{x} \mathit{n}}$ and adding that matrix to $\mathbf{A}$
Matrix with scalar
- Scalar can be broadcasted to add to matrix.
- Given $\mathbf{A}$ $\epsilon$ $\mathbb{R}^{\mathit{m} \text{x} \mathit{n}}$ and scalar $c$, their sum can be defined as $\mathbf{D}$ $\epsilon$ $\mathbb{R}^{\mathit{m} \text{x} \mathit{n}}$ for which: $D_{i,j} = A_{i,j} + c$

Multiplication

Matrix with scalar
- Given $\mathbf{A}$ $\epsilon$ $\mathbb{R}^{\mathit{m} \text{x} \mathit{n}}$ and scalar $c$, their product can be defined as $\mathbf{D}$ $\epsilon$ $\mathbb{R}^{\mathit{m} \text{x} \mathit{n}}$ for which: $D_{i,j} = c * A_{i,j}$
Matrix with Matrix:
- Given $\mathbf{A}$ $\epsilon$ $\mathbb{R}^{\mathit{m} \text{x} \mathit{n}}$ and $\mathbf{B}$ $\epsilon$ $\mathbb{R}^{\mathit{n} \text{x} \mathit{p}}$ be two matrix, then their product is: $\mathbf{C}$ $\epsilon$ $\mathbb{R}^{\mathit{m} \text{x} \mathit{p}}$ such that $C_{i,j} = \sum_{k=1}^{n}{A_{i,k} B_{k,j}}$
Matrix with vector:
- Any vector $\mathbf{b}$ $\epsilon$ $\mathbb{R}^\mathit{n}$ can be thought of as a matrix $\mathbf{B}$ $\epsilon$ $\mathbb{R}^{\mathit{n} \text{x} \mathit{1}}$ and the multiplication with matrix $\mathbf{A}$ $\epsilon$ $\mathbb{R}^{\mathit{m} \text{x} \mathit{n}}$ follows as a matrix-matrix multiplication.
Dot product of vectors:
- Given two vectors $\mathbf{a}$ $\epsilon$ $\mathbb{R}^\mathit{n}$ and $\mathbf{b}$ $\epsilon$ $\mathbb{R}^\mathit{n}$, the dot product of $\mathbf{a}$ and $\mathbf{b}$ can be represented as the matrix product: $\mathbf{a}^T\mathbf{b}$
Hadamard Product:
- Given $\mathbf{A}$ $\epsilon$ $\mathbb{R}^{\mathit{m} \text{x} \mathit{n}}$ and $\mathbf{B}$ $\epsilon$ $\mathbb{R}^{\mathit{m} \text{x} \mathit{n}}$ be two matrix, then the product of their individual elemnets gives the Hadamard product: $\mathbf{C} \ \epsilon \ \mathbb{R}^{\mathit{m}\text{x}\mathit{n}} =\mathbf{A} \odot \mathbf{B}$ such that $C_{i,j} = A_{i,j} * B_{i,j}$
Properties of matrix multiplication (not Hadamard)
- Distributive: $\mathbf{A} (\mathbf{B} + \mathbf{C}) = \mathbf{A}\mathbf{B} + \mathbf{A}\mathbf{C}$
- Associative: $\mathbf{A}(\mathbf{B}\mathbf{C}) = (\mathbf{A}\mathbf{B})\mathbf{C}$
- Not (always) commutative:
  - Generally: $\mathbf{A}\mathbf{B} \ne \mathbf{B}\mathbf{A}$
  - But for dot products: $\mathbf{a}^T\mathbf{b} = \mathbf{b}^T\mathbf{a}$
- Spread of transpose: $(\mathbf{A}\mathbf{B})^T = \mathbf{B}^T\mathbf{A}^T$
- NOTE the commutative nature of dot product can be explained by the spread of transpose property and the fact that the transpose of a scalar is the same scalar: $\mathbf{a}^T\mathbf{b} = (\mathbf{a}^T\mathbf{b})^T = \mathbf{b}^T\mathbf{a}$

Identity and Inverse Matrix

Identity matrix
- Square matrix (height = width)
- Denoted by $\mathbf{I_n}$ for $\mathbb{R}^{\mathit{n} \text{x} \mathit{n}}$
- Defined as: For any matrix $\mathbf{A}$ $\epsilon$ $\mathbb{R}^{\mathit{n} \text{x} \mathit{n}}$, $\mathbf{I_n} \mathbf{A} = \mathbf{A} \mathbf{I_n} = \mathbf{A}$
Inverse matrix
- For any square matrix $\mathbf{A}$ $\epsilon$ $\mathbb{R}^{\mathit{n} \text{x} \mathit{n}}$, if there is another matrix $\mathbf{B}$ such that: $\mathbf{A} \mathbf{B} = \mathbf{B}\mathbf{A} =\mathbf{I_n}$
- This $\mathbf{B}$ $\epsilon$ $\mathbb{R}^{\mathit{n} \text{x} \mathit{n}}$ is called inverse.
- Denoted as: $\mathbf{B} = \mathbf{A}^{-1}$
- Therefore, definition of inverse: $\mathbf{A} \mathbf{A}^{-1} = \mathbf{A}^{-1}\mathbf{A} =\mathbf{I_n}$
- Not possible to find inverse for all matrix: *
  - Matrix must be square for inverse to exist
  - Column of matrix should be linearly independent
  - Equavalently $\mathbf{A}$ must be full rank.

Note: For more details on concepts such as linear independence, span, rank, etc. refer to the book.

Norms

Measure of size of vector
Properties of norm:
- $f(x) = 0 \Rightarrow x=0$
- $f(x +y) \le f(x) + f(y)$
- $\forall \alpha \epsilon \mathbb{R}, f(\alpha x) = \lvert \alpha \rvert f(x)$
The L^P norm of vector $\mathbf{x}$ is given by: $\lvert\lvert\mathbf{x}\rvert\rvert_p = (\sum_i{ \lvert x_i \rvert^p })^{\frac{1}{p}} \ for \ p \ \epsilon \ \mathbf{R} \ , \ p \ge 1$
When $p=2$, L² norm of vector ($\lvert \lvert \mathbf{x}\lvert \lvert _2$) is most commonly used.
- Common to omit the subscript 2: $\lvert \lvert \mathbf{x}\lvert \lvert $
- Use of squared L² norm also very common, reduces the problem to: $\mathbf{x}^T\mathbf{x}$
  - Squared L² norm easier to compute than L² norm
  - Squared L² norm’s derivative with respect to each element of $\mathbf{x}$ dependes only on corresponding element of $\mathbf{x}$ but for L² norm, the derivative depends on all other components of x
L¹ norm (where p=1) also used in some applications.
- L¹ norm = $\lvert \lvert \mathbf{x}\lvert \lvert _1 = \sum_i{}\lvert x_i \rvert$
- Used When we need to discriminate between elements that are exactly zero and elements that are small but nonzero.
- L¹ preferred in these situations because it grows at same rate at all locations
- Everytime an element of $\mathbf{x}$ moves away from $0$ by $\epsilon$, L¹ increases by $\epsilon$.
L⁰ norm (where p=0)
- Not formally a norm because does not satisfy all the properties.
- Equal to the number of non zero elements in the vector
L^$\infty$ norm
- Also called max norm
- L^$\infty$ norm = $\lvert\lvert\mathbf{x}\lvert\lvert _\infty = max_i{\lvert x_i \lvert}$
Frobenius norm
- Similar to L^$2$ but for matrices instead of vector
- \[\vert\vert\mathbf{A}\vert\vert_F = \sqrt{\sum_{i,j}{A^2_{i,j}}}\]
Dot Product can be written in terms of norm:
- $\mathbf{a}^T\mathbf{b} = \vert\vert\mathbf{a}\vert\vert_2 \vert\vert\mathbf{b}\vert\vert_2 \cos{\theta}$, where $\theta$ is angle between the vectors.

September 9th, 2020 by Bipin Lekhak

Feel free to share!

Linear Algebra 1

Scalar, Vector, Matrices & Tensor

Scalar

Vector

Matrices

Tensors

Operation on Matrices

Transpose

Addition

Multiplication

Identity and Inverse Matrix

Norms

You may also enjoy:

Detail Mathematics for PCA

Linear Algebra 3: Eigendecomposition, SVD

Linear Algebra 2: Special Matrices and vectors

Information Theory 1

Probability 3: Structured Probabilistic Models

Probability 2: Common Probability Distributions