code
share




A.5 Mini-Batch Optimization*




* The following is part of an early draft of the second edition of Machine Learning Refined. The published text (with revised material) is now available on Amazon as well as other major book retailers. Instructors may request an examination copy from Cambridge University Press.

In machine learning applications we almost never tasked with minimizing a single mathematical function, but one that consist a sum of $P$ functions. In other words the sort of function $g$ we very often need to minimize in machine learning applications takes the general form

\begin{equation} g\left(\mathbf{w}\right) = \sum_{p=1}^P g_p\left(\mathbf{w}\right). \end{equation}

where $g_1,\,g_2,\,...,g_P$ are mathematical functions themselves. In machine learning applications hese functions $g_1,\,g_2,\,...,g_P$ are almost always of the same type - e.g., they can be convex quadratic functions with different constants paramterized by the same weights $\mathbf{w}$.

This special summation structure allows for a simple but very effective enhancement to virtually any local optimization scheme, and is called mini-batch optimization. Mini-batch optimization is most often used in combination with a gradient-based step like any of those discussed in the prior Sections, which is why we discuss the subject in this Chapter.

In [1]:

A simple idea with powerful consequences¶

Suppose we were to apply a local optimization scheme to minimize a function $g$ of the form

\begin{equation} g\left(\mathbf{w}\right) = \sum_{p=1}^P g_p\left(\mathbf{w}\right). \end{equation}

where $g_1\,g_2,\,...,g_P$ are all functions of the same kind (e.g., quadratics with different constants parameterized by $\mathbf{w}$).

The motivation for mini-batch optimization rests on a simple inquiry: for this sort of function $g$, what would happen if instead of taking one descent step in $g$ - that is, one descent step in the entire sum of the functions $g_1,\,g_2,\,...,g_P$ simultaneously - we took a sequence of $P$ descent steps in $g_1,\,g_2,\,...,g_P$ sequentially by first descending in $g_1$, then in $g_2$, etc., until finally we descend in $g_P$. In other words, what would happen if we were to try to minimize $g$ by taking descent steps in the summand functions $g_1,\,g_2,\,...,g_P$ one-at-a-time? As we will see empircaly throughout this text, starting with the examples below, in many instances this idea can actually lead to considerably faster optimization of a function $g$ consisting of a sum of $P$ functions as detailed in general above above.

The gist of this idea is drawn graphically in the figure below for the case $P = 3$, where we compare the idea of taking a the a descent step simultaneously in $g_1,\,g_2,\,...,g_P$ versus a sequence of $P$ descent steps in $g_1$ then $g_2$ etc., up to $g_P$.

Taking the first step of a local method, we begin at some initial point $\mathbf{w}^0$, determine a descent direction $\mathbf{d}^0$, and transition to a new point $\mathbf{w}^1$ as

\begin{equation} \mathbf{w}^1 = \mathbf{w}^0 + \alpha \, \mathbf{d}^0. \end{equation}

By analogy, if we were to follow the mini-batch idea detailed above this entails taking a sequence of $P$ steps, which we now describe. If we call our initial point $\mathbf{w}^{0,0} = \mathbf{w}^0$, we then first determine a descent direction $\mathbf{d}^{0,1}$ in $g_1$, the first function in the sum of $g$, and take a step in this direction as

\begin{equation} \mathbf{w}^{0,1} = \mathbf{w}^{0,0} + \alpha \, \mathbf{d}^{0,1}. \end{equation}

Next we determine a descent direction $\mathbf{d}^{0,2}$ in $g_2$, the second function in the sum for $g$, and take a step in this direction

\begin{equation} \mathbf{w}^{0,2} = \mathbf{w}^{0,1} + \alpha \, \mathbf{d}^{0,2}. \end{equation}

Continuing this pattern we take a sequence of $P$ steps, where $\mathbf{d}^{0,p}$ is the descent direction found in $g_p$, that takes the following form

\begin{array} \mathbf{w}^{0,1} = \mathbf{w}^{0,0} + \alpha \, \mathbf{d}^{0,1} \\ \mathbf{w}^{0,2} = \mathbf{w}^{0,1} + \alpha \, \mathbf{d}^{0,2} \\ \,\, \,\,\,\,\,\, \,\,\,\,\,\, \,\,\,\,\,\, \vdots \\ \mathbf{w}^{0,p} = \mathbf{w}^{0,p-1} + \alpha \, \mathbf{d}^{0,p} \\ \,\, \,\,\,\,\,\, \,\,\,\,\,\, \,\,\,\,\,\, \vdots \\ \mathbf{w}^{0,P} = \mathbf{w}^{0,P-1} + \alpha \, \mathbf{d}^{0,P} \\ \end{array}

This sequence of steps completes one sweep through the functions $g_1,\,g_2,\,...,g_P$, and is commonly referred to as an epoch. If we continued this pattern and took another sweep through each the $P$ functions we perform a second epoch of steps, and so on. Below we compare the $k^{th}$ step of a standard descent method (left) to the $k^{th}$ epoch of this mini-batch idea (right).

\begin{array} {r|r} \text{full (batch) descent step} & \text{mini-batch epoch, batch-size = 1} \\ \hline & \mathbf{w}^{k,1} = \mathbf{w}^{k-1,0} + \alpha \, \mathbf{d}^{k-1,1} \\ & \vdots \,\,\,\,\,\, \,\,\,\,\,\, \,\,\,\,\,\, \,\,\,\,\,\, \\ \mathbf{w}^{k} = \mathbf{w}^{k-1} + \alpha \, \mathbf{d}^{k-1} & \mathbf{w}^{k,p} = \mathbf{w}^{k-1,p-1} + \alpha \, \mathbf{d}^{k-1,p} \\ & \vdots \,\,\,\,\,\, \,\,\,\,\,\, \,\,\,\,\,\, \,\,\,\,\,\, \\ & \mathbf{w}^{k,P} = \mathbf{w}^{k-1,P-1} + \alpha \, \mathbf{d}^{k-1,P} \\ \end{array}

When employed with gradient-based steps these steps take a very convinent form. For example, noting that $\nabla g\left(\mathbf{w}\right) = \nabla \sum_{p=1}^P g_p\left(\mathbf{w}\right)$ the $k^{th}$ standard gradient descent step in all $P$ summands can be written as

\begin{equation} \mathbf{w}^k = \mathbf{w}^{k-1} - \alpha \nabla \,\sum_{p=1}^P g_p\left(\mathbf{w}^{k-1}\right) \end{equation}

where here the descent direction $\mathbf{d}^{k-1} = - \nabla \,\sum_{p=1}^P g_p\left(\mathbf{w}^{k-1}\right)$. The gradient descent direction in just a single function $g_p$ at the $k^{th}$ epoch of a mini-batch run is likewise just negative gradient of this function alone $ \mathbf{d}^{k-1,p} = - \nabla g_p\left(\mathbf{w}^{k-1,p}\right)$, and so the step minibatch step $\mathbf{w}^{k,p} = \mathbf{w}^{k-1,p-1} + \alpha \, \mathbf{d}^{k-1,p}$ is given by

\begin{equation} \mathbf{w}^{k,p} = \mathbf{w}^{k-1,p-1} - \alpha \,\nabla g_p\left(\mathbf{w}^{k-1,p}\right). \end{equation}

The table above comparing a single descent step at the $k^{th}$ epoch of standard gradient descent to its analagous mini-batch epoch can then be written as follows below.

\begin{array} {r|r} \text{full (batch) descent step} & \text{mini-batch epoch, batch-size = 1} \\ \hline & \mathbf{w}^{k,1} = \mathbf{w}^{k-1,0} - \alpha \,\nabla g_1\left(\mathbf{w}^{k-1,1}\right) \\ & \vdots \,\,\,\,\,\, \,\,\,\,\,\, \,\,\,\,\,\, \,\,\,\,\,\, \\ \mathbf{w}^k = \mathbf{w}^{k-1} - \alpha \,\nabla\sum_{p=1}^P g_p\left(\mathbf{w}^{k-1}\right) & \mathbf{w}^{k,p} = \mathbf{w}^{k-1,p-1} - \alpha \,\nabla g_p\left(\mathbf{w}^{k-1,p}\right) \\ & \vdots \,\,\,\,\,\, \,\,\,\,\,\, \,\,\,\,\,\, \,\,\,\,\,\, \\ & \mathbf{w}^{k,P} = \mathbf{w}^{k-1,P-1} - \alpha \,\nabla g_P\left(\mathbf{w}^{k-1,P}\right) \\ \end{array}

Descending with larger mini-batch sizes¶

Instead of taking $P$ sequential steps in single functions $g_p$ (a mini-batch of size $1$) one-at-a-time, we can more general (with functions $g$ that take the form $g\left(\mathbf{w}\right) = \sum_{p=1}^P g_p\left(\mathbf{w}\right)$) take fewer steps in one epoch, but take each step with respect to several of the functions $g_p$ e.g., two functions at-a-time, or three functions at-a-time, etc.,. With this slight twist on the idea detailed above we take fewer steps per epoch but take each with respect to larger non-overlapping subsets of the functions $g_1,\,g_2,\,...,g_P$, but still sweep through each function exactly once per epoch.

Choosing a general batch size we split up our listing of functions into $M$ subsets, where each subset has the same size with perhaps the exception of one (e.g., if $P = 5$ and we choose a batch size of $2$ then we will have two subsets of two functions with a final subset containing just one). If we denote $\Omega_{m}$ the set of indices of those summand functions $g_1,\,g_2,\,...,g_P$ in our $m^{th}$ mini-batch then during our $k^{th}$ epoch we take steps like those shown in the table below. Here $\mathbf{d}^{k-1,m}$ is the descent direction computed for the $m^{th}$ minibatch.

\begin{array} {r|r} \text{full (batch) descent step} & \text{mini-batch epoch, batch-size = M} \\ \hline & \mathbf{w}^{k,1} = \mathbf{w}^{k-1,0} + \alpha \, \mathbf{d}^{k-1,1} \\ & \vdots \,\,\,\,\,\, \,\,\,\,\,\, \,\,\,\,\,\, \,\,\,\,\,\, \\ \mathbf{w}^{k} = \mathbf{w}^{k-1} + \alpha \, \mathbf{d}^{k-1} & \mathbf{w}^{k,m} = \mathbf{w}^{k-1,m-1} + \alpha \, \mathbf{d}^{k-1,m} \\ & \vdots \,\,\,\,\,\, \,\,\,\,\,\, \,\,\,\,\,\, \,\,\,\,\,\, \\ & \mathbf{w}^{k,M} = \mathbf{w}^{k-1,M-1} + \alpha \, \mathbf{d}^{k-1,M} \\ \end{array}

The size of the subset used is called the batch-size of the proces e.g., in our description of the mini-batch optimization scheme above we used batch-size = $1$ (mini-batch optimization using a batch-size of $1$ is also often referred to as stochastic optimization). What batch-size works best in practice - in terms of providing the greatest speed up in optimization - varies and is often problem dependent.

Once again in the case of gradient-based methods a mini-batch step using a batch size greater than $1$ can be written out quite clearly, and just as easily implemented. If we denote by $\Omega_m$ the set of indices of our summand functions $g_1,\,g_2,\,...,g_P$ in the $m^{th}$ mini-batch , then during the $k^{th}$ epoch of our mini-batch optimization if we use the standard gradient method our descent direction is simply the negative gradient direction in the sum of these functions i.e.,

\begin{equation} \mathbf{d}^{k-1,m} = - \nabla \,\sum_{p \in \Omega_m} g_p\left(\mathbf{w}^{k-1,m}\right). \end{equation}

Our corresponding epoch of mini-batch steps with batch size $M$ looks as follows.

\begin{array} {r|r} \text{full (batch) descent step} & \text{mini-batch epoch, batch-size = M} \\ \hline & \mathbf{w}^{k,1} = \mathbf{w}^{k-1,0} - \alpha \nabla \,\sum_{p \in \Omega_1} g_p\left(\mathbf{w}^{k-1,1}\right) \\ & \vdots \,\,\,\,\,\, \,\,\,\,\,\, \,\,\,\,\,\, \,\,\,\,\,\, \,\,\,\,\,\, \,\,\,\,\,\, \\ \mathbf{w}^k = \mathbf{w}^{k-1} - \alpha \,\nabla\sum_{p=1}^P g_p\left(\mathbf{w}^{k-1}\right) & \mathbf{w}^{k,m} = \mathbf{w}^{k-1,m-1} - \alpha \nabla \,\sum_{p \in \Omega_m} g_p\left(\mathbf{w}^{k-1,m}\right) \\ & \vdots \,\,\,\,\,\, \,\,\,\,\,\, \,\,\,\,\,\, \,\,\,\,\,\, \,\,\,\,\,\, \,\,\,\,\,\, \\ & \mathbf{w}^{k,M} = \mathbf{w}^{k-1,M-1} - \alpha \nabla \,\sum_{p \in \Omega_M} g_p\left(\mathbf{w}^{k-1,M}\right) \\ \end{array}

Mini-batch optimization general performance¶

Is the trade-off - taking more steps per epoch with a mini-batch approach as opposed a full descent step - worth the extra effort? Typically yes. Often in practice when minimizing machine learning functions an epoch of mini-batch steps like those detailed above will drastically outperform an analagous full descent step - often referred to as a full batch or simply a batch epoch in the context of mini-batch optimiztaion.
A prototypical comparison of a cost function history employing a batch and corresponding epochs of mini-batch optimization applied to the same hypothetical function $g$ with the same initialization $\mathbf{w}^0$ is shown in the figure below. Because we take far more steps with the mini-batch approach and because each $g_p$ takes the same form, each epoch of the mini-batch approach typically outperforms its full batch analog. Even when taking into account that far more descent steps are taken during an epoch of mini-batch optimization the method often greatly outperforms its full batch analog.

Example 1. Minimizing a sum of quadratic functions via gradient based mini-batch optimization¶

In this example we will compare a full batch and two mini-batch runs (using batch-size $1$ and $10$ respectively) employing the standard gradient descent method. The function $g$ we minimize in these various runs is as sum of $P = 100$ single input convex quadratic functions $g_p(w) = a^{\,}_p + b_pw + c_pw^2$. In other words, our function $g$ takes the form

\begin{equation} g\left(\mathbf{w}\right) = \sum_{p=1}^P g_p(w) = \sum_{p=1}^P\left( a^{\,}_p + b_pw + c_pw^2\right) \end{equation}

All three runs use the same (random) initial point $w^0 = 0$ and a steplength / learning rate $\alpha = 10^{0}$ and we take $2$ epochs of each method.

Below we plot the cost function history plots associated with the full batch (shown in blue), mini-batch with batch-sizes equal to $10$ and $1$ (in magenta and black respectively). In the left panel we show these three histories with respect to the batch-size = $1$ steps, that is we show the cost function value at every single step of the batch-size = $1$ method and plot the histories of our other mini-batch method every $10$ steps and full batch method every $100$ steps for visual comparison. In the right panel we show all three histories only after each full epoch is complete. From these plots can see clearly that both mini-batch approaches descended significantly faster than their full batch version.

In [3]:
In [4]:
In [8]:
In [11]:
In [12]:

Example 2. Minimizing a sum of quadratic functions via zero order mini-batch optimization¶

In this example we will compare a full batch and two mini-batch runs (using batch-size $1$ and $10$ respectively) employing the zero order descent method coordinate search detailed in Section 2.6. The function $g$ we minimize in these various runs is the same sum of random convex quadratics used in the previous Example.

Below we plot the cost function history plots associated with the full batch (shown in blue), mini-batch with batch-sizes equal to $10$ and $1$ (in magenta and black respectively). In the left panel we show these three histories with respect to the batch-size = $1$ steps, that is we show the cost function value at every single step of the batch-size = $1$ method and plot the histories of our other mini-batch method every $10$ steps and full batch method every $100$ steps for visual comparison. In the right panel we show all three histories only after each full epoch is complete. In either case, we can see clearly that both mini-batch approaches descended significantly faster than their full batch version.

In [13]:
In [14]: