code
share




C.5 Vector and Matrix Norms*




* The following is part of an early draft of the second edition of Machine Learning Refined. The published text (with revised material) is now available on Amazon as well as other major book retailers. Instructors may request an examination copy from Cambridge University Press.

In this Section we discuss popular vector and matrix norms that will arise frequently in our study of machine learning, particularly when discussing regularization. A norm is a kind of function that measures the length of real vectors and matrices. The notion of length is extremely useful as it enables us to define distance - or similarity - between any two vectors (or matrices) living in the same space.

Vector norms

The $\ell_{2}$ norm

We begin with the most widely-used vector norm in machine learning, the $\ell_{2}$ norm, defined for an $N$ dimensional vector $\mathbf{x}$ as

\begin{equation} \left\Vert \mathbf{x}\right\Vert _{2}=\sqrt{\sum_{n=1}^{N} x_{n}^{2}} \end{equation}

Using the $\ell_{2}$ norm we can measure the distance between any two points $\mathbf{x}$ and $\mathbf{y}$ via $\left\Vert \mathbf{x}-\mathbf{y}\right\Vert _{2}$, which is simply the length of the vector connecting $\mathbf{x}$ and $\mathbf{y}$.

For example, the distance between $\mathbf{x}=\left[\begin{array}{c} 1\\ 2 \end{array}\right]$ and $\mathbf{y}=\left[\begin{array}{c} 9\\ 8 \end{array}\right]$ is calculated as $\sqrt{\left(1-9\right)^{2}+\left(2-8\right)^{2}}=10$, as shown pictorially in the figure below (recall Pythagorean theorem).

The $\ell_{2}$ norm is not the only way to measure, or more precisely define, the length of a vector. There are other norms arising occasionally in machine learning that we discuss now.

The $\ell_{1}$ norm

The $\ell_{1}$ norm of a vector $\mathbf{x}$ is defined as the sum of the absolute values of its entries

\begin{equation} \Vert\mathbf{x}\Vert_{1}=\sum_{n=1}^{N}\left|x_{n}\right| \end{equation}

In terms of the $\ell_{1}$ norm the distance between $\mathbf{x}$ and $\mathbf{y}$ is given by $\left\Vert \mathbf{x}-\mathbf{y}\right\Vert _{1}$, which provides a measurement of distance different from the $\ell_{2}$ norm. As illustrated in the figure below the distance defined by the $\ell_{1}$ norm is the length of a path consisting of perpendicular pieces (shown in blue). Because these paths are somewhat akin to how an automobile might travel from $\mathbf{x}$ to $\mathbf{y}$ if they were two locations in a gridded city, having to traverse perpendicular city blocks one after the other, the $\ell_{1}$ norm is sometimes referred to as the taxicab norm, and the distance measured via the $\ell_{1}$ norm, the Manhattan distance.

In the example above the distance between $\mathbf{x}=\left[\begin{array}{c} 1\\ 2 \end{array}\right]$ and $\mathbf{y}=\left[\begin{array}{c} 9\\ 8 \end{array}\right]$ using the $\ell_{1}$ norm is calculated as $\left|1-9\right|+\left|2-8\right|=14$.

The $\ell_{\infty}$ norm

The $\ell_{\infty}$ norm of a vector $\mathbf{x}$ is equal to its largest entry in terms of absolute value, defined mathematically as

\begin{equation} \Vert\mathbf{x}\Vert_{\infty}=\underset{n}{\text{max}}\left|x_{n}\right| \end{equation}

For example, the distance between $\mathbf{x}=\left[\begin{array}{c} 1\\ 2 \end{array}\right]$ and $\mathbf{y}=\left[\begin{array}{c} 9\\ 8 \end{array}\right]$ in terms of the $\ell_{\infty}$ norm is found as $\text{max}\left(\left| 1-9\right|,\left|2-8\right|\right)=8$.

The $\ell_{0}$ norm

The $\ell_0$ vector norm, written as $\left \Vert \mathbf{x} \right \Vert_0$, measures 'magnitude' as

\begin{equation} \left \Vert \mathbf{x} \right \Vert_0 = \text{number of non-zero entries of } \mathbf{x}. \end{equation}

What do these vector norms have in common?

The $\ell_{2}$, $\ell_{1}$, and $\ell_{\infty}$ norms, by virtue of being vector norms, share a number of useful properties that we detail below. Since these properties hold in general for any norm, we momentarily drop the subscript and represent the norm of $\mathbf{x}$ simply by $\Vert\mathbf{x}\Vert$.

  1. Norms are always non-negative, that is $\Vert\mathbf{x}\Vert\geq 0$. Furthermore, the equality holds only when $\mathbf{x}=\mathbf{0}$. Therefore the norm of any nonzero vector is always greater than zero.

  2. The norm of $\alpha\mathbf{x}$, that is a scalar multiple of $\mathbf{x}$, can be written in terms of the norm of $\mathbf{x}$, as $\Vert\alpha\mathbf{x}\Vert=\left|\alpha\right|\Vert\mathbf{x}\Vert$. With $\alpha=-1$ for example, we have that $\Vert-\mathbf{x}\Vert=\Vert\mathbf{x}\Vert$.

  3. Norms also satisfy the so-called triangle inequality where for any vectors $\mathbf{x}$, $\mathbf{y}$, and $\mathbf{z}$ we have $\Vert\mathbf{x}-\mathbf{z}\Vert+\Vert\mathbf{z}-\mathbf{y}\Vert\geq\Vert\mathbf{x}-\mathbf{y}\Vert$. As illustrated in the figure below for the $\ell_{2}$ norm (left), the $\ell_{1}$ norm (middle), and the $\ell_{\infty}$ norm (right) the triangle inequality simply states that the distance between $\mathbf{x}$ and $\mathbf{y}$ is always smaller than (or equal to) the distance between $\mathbf{x}$ and $\mathbf{z}$, and the distance between $\mathbf{z}$ and $\mathbf{y}$, combined. In other words, if one wanted to travel from a given point $\mathbf{x}$ to a given point $\mathbf{y}$, it would be always better to travel directly from $\mathbf{x}$ to $\mathbf{y}$, than to travel first to a third point z, and then to y. With the change of variables $\mathbf{u}=\mathbf{x}-\mathbf{z}$ and $\mathbf{v}=\mathbf{z}-\mathbf{y}$, the triangle inequality is sometimes written in the simpler form of $\Vert\mathbf{u}\Vert+\Vert\mathbf{v}\Vert\geq\Vert\mathbf{u}+\mathbf{v}\Vert$ for all vectors $\mathbf{u}$ and $\mathbf{v}$.

In addition to the general properties mentioned above and held by any norm, the $\ell_{2}$, $\ell_{1}$, and $\ell_{\infty}$ norms share a stronger bond that ties them together: they are all members of the $\ell_{p}$ norm family. The $\ell_{p}$ norm is generally defined as

\begin{equation} \left\Vert \mathbf{x}\right\Vert _{p}=\left(\sum_{n=1}^{N}\left|x_{n}\right|^{p}\right)^{\frac{1}{p}} \end{equation}

for $p\geq1$. One can easily verify that with $p=1$, $p=2$, and as $p\longrightarrow\infty$, the $\ell_{p}$ norm reduces to the $\ell_{1}$, $\ell_{2}$, and $\ell_{\infty}$ norm respectively.

The $\ell_{p}$ norm balls

A norm ball is a set of all vectors $\mathbf{x}$ with same norm value, that is, all $\mathbf{x}$such that $\left\Vert \mathbf{x}\right\Vert =c$ for some constant $c>0$. When $c=1$, this set is called the unit norm ball, or simply the unit ball.

As you can see in the figure below, the $\ell_{1}$ unit ball takes the form of a diamond (in blue), whose equation is given by

\begin{equation} \left|x_{1}\right|+\left|x_{2}\right|=1 \end{equation}

The $\ell_{2}$ unit ball is a circle (in red) defined by

\begin{equation} x_{1}^{2}+x_{2}^{2}=1 \end{equation}

And finally as $p\longrightarrow\infty$ the unit ball approaches a square (in green) characterized by

\begin{equation} \text{max}\left(\left|x_{1}\right|,\left|x_{2}\right|\right)=1 \end{equation}


Matrix norms

The Frobenius norm

Recall that the $\ell_{2}$ norm of a vector is defined as the square root of the sum of the squares of its elements. The Frobenius norm is the intuitive extension of the $\ell_{2}$ norm for vectors to matrices, defined similarly as the square root of the sum of the squares of all the elements in the matrix. Thus, the Frobenius norm of an $N\times M$ dimensional matrix $\mathbf{X}$ is calculated as

\begin{equation} \left\Vert \mathbf{X}\right\Vert _{F}=\sqrt{\sum_{n=1}^{N}\sum_{m=1}^{M} x_{n,m}^{2}} \end{equation}

For example, the Frobenius norm of the matrix $\mathbf{X}=\left[\begin{array}{cc} -1 & 2\\ 0 & 5 \end{array}\right]$ can be found as $\sqrt{\left(-1\right)^{2}+2^{2}+0^{2}+5^{2}}=\sqrt{30}$.

The connection between the $\ell_{2}$ norm and the Frobenius norm goes even deeper: collecting all singular values of $\mathbf{X}$ in the vector $\mathbf{s}$ (assuming $N\leq M$)

\begin{equation} \mathbf{s}=\left[\begin{array}{c} \sigma_{1}\\ \sigma_{2}\\ \vdots\\ \sigma_{N} \end{array}\right] \end{equation}

The Frobenius norm of $\mathbf{X}$ can be shown to be equal to the $\ell_{2}$ norm of $\mathbf{s}$, i.e.,

\begin{equation} \left\Vert \mathbf{X}\right\Vert _{F}=\left\Vert \mathbf{s}\right\Vert _{2} \end{equation}

The spectral and nuclear norms

The observation that the $\ell_{2}$ norm of the vector of singular values of a matrix is identical to its Frobenius norm motivates the use of other $\ell_{p}$ norms on the vector $\mathbf{s}$. In particular the $\ell_{1}$ norm of $\mathbf{s}$ defines the nuclear norm of $\mathbf{X}$ denoted by $\left\Vert \mathbf{X}\right\Vert _{*}$

\begin{equation} \left\Vert \mathbf{X}\right\Vert _{*}=\left\Vert \mathbf{s}\right\Vert _{1} \end{equation}

and the $\ell_{\infty}$ norm of $\mathbf{s}$ defines the spectral norm of $\mathbf{X}$, denoted by $\left\Vert \mathbf{X}\right\Vert _{2}$

\begin{equation} \left\Vert \mathbf{X}\right\Vert _{2}=\left\Vert \mathbf{s}\right\Vert _{\infty} \end{equation}

Because the singular values of real matrices are always non-negative, the spectral norm and the nuclear norm are simply the largest and the sum of the singular values respectively.