code
share




A.4 Advanced Gradient-Based Methods*




* The following is part of an early draft of the second edition of Machine Learning Refined. The published text (with revised material) is now available on Amazon as well as other major book retailers. Instructors may request an examination copy from Cambridge University Press.

In Section 3.7 we discussed two fundamental problems with the negative gradient insofar as it is used as a descent direction for local optimization: the zig-zagging and slow-crawling behaviors. In the Sections that then followed we detailed fundamental solutions to each of these problems independently, momentum and gradient normalized descent. These two fundamental problems can certainly occur together in practice, particularly with the sort of functions we minimize in machine learning (particularly those involving neural networks). They both occur with functions that have flat long narrow valleys that provoke zig-zagging of gradient descent steps due to the shape of their contours, and slow crawling of the steps due to the area's flatness.

Because of this many advanced first order gradient based methods have been developed in the machine learning community that essentially combine the momentum and normalized gradient ideas in various interesting ways. In this Section we detail several of these popular methods including RMSprop, and Adam, highlighting their connection to the two fundamental solutions methods discussed in the prior two Sections.

In [1]:

Combining momentum with normalized gradient descent

In Section 3.7 we described the notion of momentum accelerated gradient descent, and how it is a natural remedy for the zig-zagging problem the standard gradient descent algorithm suffers from when run along long narrow valleys. As we the momentum acceleration descent direction $\mathbf{d}^{k-1}$ is simply an exponential average of gradient descent directions taking the form

\begin{equation} \mathbf{d}^{k-1} = \beta \, \mathbf{d}^{k-2} - \left(1 - \beta\right)\nabla g\left(\mathbf{w}^{k-1}\right) \\ \mathbf{w}^{\,k} = \mathbf{w}^{\,k-1} + \alpha \, \mathbf{d}^{k-1} \,\,\,\,\,\,\,\,\,\,\,\,\, \,\,\,\,\,\,\,\,\,\,\,\,\, \,\,\, \end{equation}

where $\beta \in \left[0,1 \right]$ is typically set at a value of $\beta = 0.8$ or higher.

Then in Section 3.9 we saw how normalizing the gradient descent direction componentwise helps deal with the problem standard gradient descent has when traversering flat regions of a function. We saw there how a component-normalized gradient descent step takes the form (for the $j^{th}$ component of $\mathbf{w}$)

\begin{equation} w_j^k = w_j^{k-1} - \alpha \, \frac{\frac{\partial}{\partial w_j}g\left(\mathbf{w}^{k-1}\right)}{{\sqrt{\left(\frac{\partial}{\partial w_j}g\left(\mathbf{w}\right)\right)^2}}} \end{equation}

where in practice of course a small fixed value $\epsilon > 0$ is often added to the denominator on the right hand side to avoid division by zero.

Knowing that these two additions to the standard gradient descent step help solve two fundamental problems associated with gradient descent, it is natural to then try to combine them to leverage both enhancements. There are several ways one might think to combine these two ideas. For example, one could momentum acclerate a componentwise-normalized direction or - in other words - replace the gradient descent direction in the exponential average in equation (1) with its component-normalized version shown in equation (2).

Another way of combining the two ideas would be to component-normalize the exponential average descent direction computed in momentum-accelerated gradient descent. That is, compute the exponential average direction in the top line of equation (1) and then normalize it (instead of the raw gradient descent direction) as shown in equation (2).

Doing this - and writing out the update for only the $j^{th}$ component of the resulting step - we have

\begin{equation} d^{k-1}_j = \beta \, d^{k-2}_j - \left(1 - \beta\right)\frac{\partial}{\partial w_j}g\left(\mathbf{w}^{k-1}\right) \\ d^{k-1}_j \longleftarrow \frac{d^{k-1}_j }{\sqrt{\left(d^{k-1}_j \right)^2}}\\ \end{equation}

where in practice of course a small $\epsilon > 0$ (like e.g., $\epsilon = 10^{-8}$) is added to the denominator to avoid division by zero.

With a full direction $\mathbf{d}^{k-1}$ commputed in this way we can then take a descent step

\begin{equation} \mathbf{w}^{\,k} = \mathbf{w}^{\,k-1} + \alpha \, \mathbf{d}^{k-1}. \end{equation}

Many popular first order steps used to tune machine learning models employing deep neural networks combine momentum and normalized gradient descent in this sort of way. Below we list a few examples including the popular Adam and RMSprop first order steps.

Example. 1 Adaptive Moment Estimation (Adam)

Adaptive Moment Estimation (Adam) is a componentwise-normalized gradient step employing independently calculated exponential averages for both the descent direction and magnitude. That is, we compute $j^{th}$ coordinate of the updated descent direction by first computing the exponential average of the gradient descent direction $d_j^{k}$ squared magnitude $h_j^{k}$ separately along this coordinate as

\begin{equation} \begin{array} \ d^{k-1}_j = \beta_1 \, d^{k-2}_j + \left(1 - \beta_1\right)\frac{\partial}{\partial w_j}g\left(\mathbf{w}^{k-1}\right) \\ h_j^{k-1} = \beta_2 \, h_j^{k-2} + \left(1 - \beta_2\right)\left(\frac{\partial}{\partial w_j}g\left(\mathbf{w}^{k-1}\right)\right)^2 \end{array} \end{equation}

where $\beta_1$ and $\beta_2$ lie in the range $[0,1]$. Popular values the parameters of this update step are $\beta_1 = 0.9$, $\beta_2 = 0.999$.

Note as with any exponential average these two updates apply when $k-1 > 0$ and should be initialized at first values from the series they respectively model: that is the initial descent direction $d^0_j = \frac{\partial}{\partial w_j}g\left(\mathbf{w}^{0}\right)$ and its squared magnitude $h^0_j = \left(\frac{\partial}{\partial w_j}g\left(\mathbf{w}^{0}\right)\right)^2$

The Adam step is then a component-normalized descent step using this exponentially average descent direction and magnitude. A step in the $j^{th}$ coordinate then takes the form

\begin{equation} w_j^k = w_j^{k-1} - \alpha \frac{d^{k-1}_j}{\sqrt{h_j^{k-1}}}. \end{equation}

where in practice of course a small $\epsilon > 0$ (like e.g., $\epsilon = 10^{-8}$) is added to the denominator to avoid division by zero.

Notice - as we saw the (component) normalized step in the previous Section - that if we slightly re-write above as

\begin{equation} w_j^k = w_j^{k-1} - \frac{\alpha}{\sqrt{h_j^{k-1}}} \, d^{k-1}_j. \end{equation}

we can interpret the Adam step as a momentum-accelerated gradient descent step with an individual steplength / learning rate value $\frac{\alpha}{\sqrt{h_j^{k-1}}}$ per component that all adjusts themselves individually at each step based on component-wise exponentially normalized magnitude of the gradient.


Note: the authors of this particular update step proposed that each exponential average be inialized at zero - i.e., as $d^0_j = 0$ and $h^0_j = 0$ - instead of the first step in each series they respectively model (i.e., the initial derivative and its squared magnitude). This initialization - along with the values for $\beta_1$ and $\beta_2$ typically chosen to be greater than $0.9$ - cause the the first few update steps of these exponential averages to be 'biased' towards zero as well. Because of this they also employ a 'bias-correction' term to compensate for this initialization of the form $d^{k-1}_j \longleftarrow \frac{d^{k-1}_j }{1-\left(\beta_1\right)^{k-1}}$ and $h^{k-1}_j \longleftarrow \frac{h^{k-1}_j }{1-\left(\beta_2\right)^{k-1}}$.


Example 2. Root Mean Squared Propogation (RMSprop)

This popular first order step is a varient of the component-normalized step discussed in Section 3.9, where each component of the gradient is normalized by an exponential average of the magnitude of previously computed gradient descent directions.

In other words, we compute an exponential average of the magnitude of the gradient at each step. Denoting $h_j^{k}$ the exponential average of of the squared magnitude of the $j^{th}$ partial derivative at step $k$ we have

\begin{equation} h_j^k = \gamma \, h_j^{k-1} + \left(1 - \gamma\right)\left(\frac{\partial}{\partial w_j}g\left(\mathbf{w}^{k-1}\right)\right)^2 \end{equation}

The Root Mean Squared Error Propogation (RMSprop) step is then a component-normalized descent step using this exponential average. A step in the $j^{th}$ coordinate then takes the form

\begin{equation} w_j^k = w_j^{k-1} - \alpha \frac{\frac{\partial}{\partial w_j} g\left(\mathbf{w}^{k-1}\right)}{\sqrt{h_j^{k-1}}} \end{equation}

where in practice of course a small $\epsilon > 0$ (like e.g., $\epsilon = 10^{-8}$) is added to the denominator to avoid division by zero. Popular values the parameters of this update step are $\beta = 0.9$ and $\alpha = 10^{-2}$.

Notice - as we saw the (component) normalized step in the previous Section - that if we slightly re-write above as

\begin{equation} w_j^k = w_j^{k-1} - \frac{\alpha}{\sqrt{h_j^{k-1}}} \, \frac{\partial}{\partial w_j}g\left(\mathbf{w}^{k-1}\right). \end{equation}

we can interpret the RMSprop step as a standard gradient descent step with an individual steplength / learning rate value $\frac{\alpha}{\sqrt{h_j^{k-1}}}$ per component that all adjusts themselves individually at each step based on component-wise magnitude of the gradient.