In the previous Section we saw how with boosting based cross-validation we automatically learn the proper level of model complexity for a given dataset by optimizing a general high capacity model one unit at-a-time. In this Section we introduce what are collectively referred to as regularization techniques for efficient cross-validation. With this set of approaches we once again start with a single high capacity model, and once again adjust its complexity with respect to a training dataset via careful optimization. However, with regularization we tune all of the units simultaneously, controlling how well we optimize its associated cost so that a minimum validation instance of the model is achieved.
Imagine for a moment that we have a simple nonlinear regression dataset, like the one shown in the top-left panel of the Figure 11.37, and we use a high capacity model (relative to the nature of the data) made up of a sum of universal approximators of a single kind to fit it as
\begin{equation} \text{model}\left(\mathbf{x},\Theta\right) = w_0 + f_1\left(\mathbf{x}\right){w}_{1} + f_2\left(\mathbf{x}\right){w}_{2} + \cdots + f_M\left(\mathbf{x}\right)w_M. \label{equation:regularization-original-construct} \end{equation}
Suppose then that we partition this data into training and validation portions, and then train our high capacity model by completely optimizing the Least Squares cost over the training portion of the data. In other words, we determine a set of parameters for our high capacity model that lie very close to a global minimum of its associated cost function. In the top-right panel of the Figure we draw a hypothetical two dimensional illustration of the cost function associated with our high capacity model over the training data, denoting the global minimum by a blue dot and its evaluation on the function by a blue 'x'.
Since our model has high capacity, the resulting fit provided by the parameters lying at the global minimum of our cost will produce a tuned model that is overly complex and severely overfits the training portion of our dataset. In the bottom-left panel of the Figure 11.37 we show the tuned model fit (in blue) provided by such a set of parameters, which wildly overfits the training data. In the top-right panel we also show a set of parameters lying relatively near the global minimum as a yellow dot, and whose evaluation of the function is shown as a yellow 'x'. This set of parameters lying in the general neighborhood of the global minimum is where the cost function is minimized over the validation portion of our data. Because of this the corresponding fit (shown in the bottom-right panel in yellow) provides a much better representation of the data.
This toy example is illustrative of a more general principal we have seen earlier in Section 11.3.2: that overfitting is due both to the capacity of an un-tuned model being too high and its corresponding cost function (over the training data) being optimized too well, leading to an overly complex tuned model. Conversely, if we use a high capacity model and minimize its cost function (over the training data) the best set of parameters are those that provide minimum error over the validation portion of the data.
This phenomenon holds true for all machine learning problems (including regression, classification, and unsupervised learning techniques like the Autoencoder) and is the motivation for general regularization-based cross-validation strategies: if proper optimization of all parameters of a high capacity model leads to overfitting, it can be avoided by optimizing said model imperfectly when validation error (not training error) is at its lowest. In other words, regularization in the context of cross-validation constitutes a set of approaches to cross-validation wherein we carefully tune all parameters of a high capacity model by setting them purposefully away from the global minima of its associated cost function. This can be done in a variety of ways, and we detail the two most popular approaches below.
With early stopping regularization we properly tune a high capacity model by making a run of local optimization (tuning all parameters of the model), and by using the set of weights from this run where the model achieves minimum validation error. This idea is illustrated in the left panel of Figure 11.38 where we employ the same prototypical cost function (associated with a high capacity model) first shown in the top-right panel of Figure 11.38. Here again we mark its global minimum and set of validation error minimizing weights in blue and yellow, respectively (as detailed originally in Figure 11.37. During a run of local optimization we frequently compute training and validation errors (e.g., at each step of the optimization procedure). Thus, depending on the optimization procedure used (as detailed further below) a set of weights providing minimum validation error for a high capacity model can be determined with fine resolution.
This regularization approach is especially popular when employing high capacity deep neural network models as detailed in Section 13.7.
Whether or not one literally stops the optimization run when minimum validation error has been reached (which can be challenging in practice given the somewhat unpredictable behavior of validation error as first noted in Section 11.4) or one runs the optimization to completion (picking the best set of weights afterwards), in either case we refer to this method as early stopping regularization. Note that the method itself is analogous to the early stopping procedure outlined for boosting based cross-validation in Section 11.5 in that we sequentially increase the complexity of a model until minimum validation is reached. However, here (unlike boosting) we do this by controlling how well we optimize a model's parameters simultaneously, as opposed to one unit at-a-time.
Supposing that we begin our optimization with a small initial value (which we typically do; see for example, Section 3.6) the corresponding training and validation error curves will in general\footnote{Note that both can oscillate in practice depending on the optimization method used.} look like those shown in top panel of Figure 11.38. At the start of the run the complexity of our model (evaluated at, for instance, the initial weights) is quite small, providing a large training and validation error. As minimization proceeds, and we continue optimizing one step at-a-time, error in both training and validation portions of the data decreases while the complexity of the tuned model increases. This trend continues up until a point when the model complexity becomes too great and overfitting begins, and validation error increases.
In terms of the capacity/optimization dial scheme detailed in the context of real data in Section 11.3.2, we can think of (early stopping based) regularization as beginning with our capacity dial set all the way to the right (since we employ a high capacity model) and our optimization dial all the way to the left (at the initialization of our optimization). With this configuration - summarized visually in the bottom panel of Figure 11.39 - we allow our optimization dial to (roughly speaking) directly govern the amount of complexity our tuned models can take (here each notch on the optimization dial denotes a single step of local optimization). In other words, with this configuration our optimization dial becomes (roughly speaking) the ideal complexity dial described at the start of the Chapter in Section 11.1. With early stopping we turn our optimization dial from left to right, starting at our initialization making a run of local optimization one step at-a-time, seeking out a set of parameters that provide minimum validation error for our (high capacity) model. This is illustrated in the bottom panels of Figure 11.39, where we see our capacity dial set all the way to the right and our generic validation error curve wrapped around our optimization dial (as it now, roughly speaking, controls the complexity of each tuned model).
With an initial set of parameters $\Theta_0$, taking $M$ steps of a local optimization produces a sequence of $M+1$ parameter settings $\left\{\Theta_m\right\}_{m=0}^M$ for our model, or similarly (ignoring the initialization for the sake of illustration) a set of $M$ models of generally increasing complexity with respect to the training data $\left\{\text{model}\left(\textbf{x},\Theta_m\right)\right\}_{m=1}^M$.
There are a number of important engineering details associated with implementing an effective early stopping regularization procedure, including the following:
A regularizer is a simple function that can be added to a machine learning cost for a variety of purposes e.g., to prevent unstable learning (as we saw in Section 6.4.4), as a natural part of relaxing the support vector machine (Section 6.5.4) and multi-class learning scenarios (Section 7.3.3), and for feature selection (Section 9.7). As we will see, the latter of these applications (feature selection) is very similar to our use of the regularizer here.
Adding a simple regularizer function like one of those we have seen in previous applications, e.g., the $\ell_2$ norm, to the cost of a high capacity model we can alter its shape and, in particular, move the location of its global minima away from their original location(s). In general if our high capacity model is given as $\text{model}\left(\mathbf{x},\Theta\right)$, its associated cost function given by $g$, and a regularizer $h$, then the regularized cost is given as the linear combination of $g$ and $h$ as
\begin{equation} g\left(\Theta \right) + \lambda h\left(\Theta\right) \label{equation:cross-validation-general-regularized-cost} \end{equation}
where $\lambda$ is referred to as the regularization parameter. The regularization parameter is always non-negative $\lambda \geq 0$ and controls the mixture of the cost and regularizer. When it is set small and close to zero $\lambda \approx 0$ the regularized cost is essentially just $g$, and conversely when set very large the regularizer $h$ dominates in the linear combination (and so upon minimization we are really just minimizing it alone). In the right panel of Figure 11.38 we show how the shape of a figurative regularized cost (and consequently the location of its minima) changes with the value of $\lambda$.
Supposing that we begin with a large value of $\lambda$ and try progressively smaller values (completely optimizing each regularized cost) - the corresponding training and validation error curves will in general look something like those shown in the top panel of Figure 11.41 (remember in practice that validation error can oscillate, and need not take just one dip down). At the start of this procedure, using a large value of $\lambda$, the complexity of our model is quite small as the regularizer completely dominated in the regularized cost, and thus the associated minimum recovered belongs to the regularizer and not the cost function itself. Since the set of weights is virtually unrelated to the data we are training over the corresponding model will tend to have large training and validation errors. As $\lambda$ is decreased the parameters provided by complete minimization of the regularized cost will be closer to the global minima of the original cost itself, and so error on both training and validation portions of the data decreases while (generally speaking) the complexity of the tuned model increases. This trend continues up until a point when the regularization parameter is small enough that the recovered parameters lie too close to that of the original cost, so that the corresponding model complexity becomes too great. Here overfitting begins and validation error increases.
In terms of the capacity/optimization dial scheme detailed in the context of real data in Section 11.3.2, we can think of (regularizer-based) regularization as beginning with our capacity dial set to the right (since we employ a high capacity model) and our optimization dial all the way to the left (employing a large value for $\lambda$ in our regularized cost). Here each notch on the optimization dial represents the complete minimization of the regularized cost function) for a given value of $\lambda$ - thus when the dial is turned all the way to the right (where $\lambda = 0$) we completely minimize the original cost. With this configuration (summarized visually in the bottom panel of Figure 11.41) we allow our optimization dial to (roughly speaking) directly govern the amount of complexity our tuned models can take (here each setting of the capacity dial defines a model and each setting of the optimization dial a set of parameters of that model). As we turn our optimization dial from left to right we decrease the value of $\lambda$ and completely minimize the corresponding regularized cost, seeking out a set of parameters that provide minimum validation error for our (high capacity) model. This is illustrated in the bottom panels of Figure 11.41, where we see our capacity dial set all the way to the right and our generic validation error curve wrapped around our optimization dial (as it now, roughly speaking, controls the complexity of each tuned model).
With an set of $M$ values $\left\{\lambda_m\right\}_{m=1}^M$ for our regularization parameter $\lambda$, sorted from largest to smallest ($\lambda_1$ being the largest value chosen and $\lambda_M$ the smallest) this scheme produces a sequence of $M$ parameter settings $\left\{\Theta_m\right\}_{m=1}^M$ and corresponding models $\left\{\text{model}\left(\textbf{x},\Theta_m\right)\right\}_{m=1}^M$ of generally increasing complexity. Thus, formally speaking, we can see regularizer based regularization stopping falls into the general category of cross-validation techniques outlined in Section 11.4.
There are a number of important engineering details associated with implementing an effective regularizer based regularization procedure, including the following:
Below we illustrate the use of regularizer based regularization on a simple example.
In this example we use a quadratic regularizer to fit a proper nonlinear regression to our toy sinusoidal dataset. Here the training set is shown colored in light blue, and the validation points are shown colored yellow. We use a high capacity model (with respect to this data), here a $B = 10$ degree polynomials, and $100$ values of $\lambda$ between $0$ and $1$ (completely minimizing the corresponding regularized cost in each instance). As the value of $\lambda$ increases the fit provided by the weights recovered from the global minimum of each regularized cost function is shown in red, while the corresponding training and validation errors are shown in blue and yellow, respectively. In this simple experiment, a value somewhere around $\lambda \approx 0.3$ appears to provide the lowest validation error and corresponding best fit to the dataset overall.
Akin to the boosting procedure detailed in the previous Section, here the careful reader will notice how similar the regularizer based regularization framework described here is to the concept of regularization detailed for feature selection in Section 9.7. The two approaches are very similar in theme, except here we do not select from a set of given input features but create them ourselves based on a universal approximator. Additionally, instead of our main concern with regularization being human interpret-ability of a machine learning model here we use regularization as a tool for cross-validation.