code
share




10.3 Nonlinear Multi-Output Regression*




* The following is part of an early draft of the second edition of Machine Learning Refined. The published text (with revised material) is now available on Amazon as well as other major book retailers. Instructors may request an examination copy from Cambridge University Press.

In this Section we present a description of nonlinear feature engineering for multi-ouput regression first introduced Section 5.6. This mirrors what we have seen in the previous Section completely with one small but important difference: in the multi-output case we can choose to model each regression separately, employing one nonlinear model for output, or jointly, producing a single nonlinear model for all outputs simultaneously.

InΒ [1]:

Modeling principles of linear multi-output regressionΒΆ

Recall that when dealing withi mulit-output regression (introduced in Section 5.6) we have $N$ dimensional input / $C$ dimensional output pairs $\left(\mathbf{x}_p,\,\mathbf{y}_p\right)$, and our joint linear model for all $C$ regressions takes the form

\begin{equation} \text{model}\left(\mathbf{x},\mathbf{W}\right) = \mathring{\mathbf{x}}^T\mathbf{W}^{\,} \end{equation}

where the weight matrix $\mathbf{W}$ has dimension $\left(N+1\right)\times C$. As discussed there, we can tune the parameters of this joint model one column at a time by solving each linear regression independently. We can also tune the parameters of $\mathbf{W}$ simultaneously by minimizing an appropriate regression cost function over the entire matrix like e.g., the Least Square

\begin{equation} g\left(\mathbf{W}\right) = \frac{1}{P}\sum_{p=1}^{P} \left \Vert \mathring{\mathbf{x}}_{p}^T \mathbf{W} - \overset{\,}{\mathbf{y}}_{p}^{\,} \right \Vert_2^2. \end{equation}

However because this model is linear the results of either approach result in the same tuning of the parameters.

Modeling principles of nonlinear multi-output regressionΒΆ

With multi-output regression the move from linear to nonlinear modeling mirrors what we have seen in the previous Section completely - with one small but important wrinkle. With $C$ regressions to perform we can choose to either produce $C$ independent nonlinear models or one shared nonlinear model for all $C$ regressions.

Figure 1: Nonlineaer multi-output regression: we can either solve each regression problem independently, employing a distinct nonlinearity per sub-problem (left panel) or use a *shared nonlinear architecture* in solving all problems *simultaneously*.

If we choose the former approach - forming $C$ separate nonlinear models - each feature engineering / noonlinear regression is executed precisely as we have seen in previous the Section. That is, for the $c^{th}$ regression problem we construct a model using (in general) $B_c$ nonlinear feature transformations as

\begin{equation} \text{model}_c\left(\mathbf{x},\Theta_c\right) = w_{c,0} + f_{1}\left(\mathbf{x}\right){w}_{c,1} + f_{2}\left(\mathbf{x}\right){w}_{c,2} + \cdots + f_{B_c}\left(\mathbf{x}\right)w_{c,B_c} \end{equation}

where $f_1,\,f_2,\,...\,f_{B_c}$ are nonlinear parameterized or unparameterized functions that can be chosen uniquely for the $c^{th}$ model, and $w_{c,0}$ through $w_{c,B_c}$ (along with any additional weights internal to the nonlinear functions) are represented in the weight set $\Theta_c$ and must be tuned properly. To perform multi-output regression we can then - with nonlinear feature transformations chosen - tune each of the $C$ models above via the minimization of a proper regression cost like e.g., the Least Squares.

With the latter approach to nonlinear multi-output regression - that is we engineer a single set of nonlinear feature transfomations and share them among all $C$ regressions - we simply use the same nonlinear features for all $C$ models. This is very often done to simplify the chores of nonlinear feature engineering and - as we will see - nonlinear feature learning. if we choose the same set $B$ nonlinear features for all $C$ models the $c^{th}$ of which looks like

\begin{equation} \text{model}_c\left(\mathbf{x},\Theta_c\right) = w_{c,0} + f_{1}\left(\mathbf{x}\right){w}_{c,1} + f_{2}\left(\mathbf{x}\right){w}_{c,2} + \cdots + f_{B}\left(\mathbf{x}\right)w_{c,B} \end{equation}

Note here while $\Theta_c$ contains both those weights from the linear combination $w_{c,0},\,...w_{c,B}$ as well as any weights internal to the feature transformatinos, the only parameters unique to this model alone are the linear combination weights (since every model shares any weights internal to the feature transformations).

Employing the same compact notation for our feature transformations introduced in Section 10.2.2 we can express each of these models more compactly as

\begin{equation} \text{model}_c\left(\mathbf{x},\Theta_c\right) = \mathring{\mathbf{f}}_{\,}^T \mathbf{w}_c. \end{equation}

Figure: (left) Linear multi-output regression. (middle) Nonlinear multi-output regression where each output uses its own distinct nonlinear feature transformation. (right) Nonlinear multi-output regression where both outputs use the same nonlinear feature transformation (a case of feature-sharing).

Alternatively we formulate them all in one model by stacking the linear combination weights from our $C$ nonlinear models from equation (3) into an $\left(N + 1\right) \times C$ array of the form

\begin{equation} \mathbf{W}=\begin{bmatrix} w_{0,0} & w_{0,1} & w_{0,2} & \cdots & w_{0,C-1} \\ w_{1,0} & w_{1,1} & w_{1,2} & \cdots & w_{1,C-1} \\ w_{2,0} & w_{2,1} & w_{2,2} & \cdots & w_{2,C-1} \\ \,\,\, {\vdots}_{\,\,\,} & {\vdots}_{\,\,\,} & {\vdots}_{\,\,\,} & \cdots & {\vdots}_{\,\,\,} \\ w_{N,0} & w_{N,1} & w_{N,2} & \cdots & w_{N,C-1} \\ \end{bmatrix} \end{equation}

we can likewise express all $C$ models together in one singular model as

\begin{equation} \text{model}\left(\mathbf{x},\Theta\right) = \mathring{\mathbf{f}}_{\,}^T \mathbf{W} \end{equation}

This of course is a direct generalization of the original linear model shown in equation (1). Note here that the set $\Theta$ contains the linear combination weights $\mathbf{W}$ as well as any parameters internal to our feature transformations. To tune the weights of our joint model we minnimize an appropriate regression cost over the parameters in $\Theta$ like e.g., the Least Squares

\begin{equation} g\left(\Theta\right) = \frac{1}{P}\sum_{p=1}^{P} \left \Vert \mathring{\mathbf{f}}_{p}^T \mathbf{W} - \overset{\,}{\mathbf{y}}_{p}^{\,} \right \Vert_2^2. \end{equation}

If these feature transformations contain no internal parameters, e.g., if they are polynomial functions, then since our model decomposes nicely into a linear combination of each of the $C$ individual regression models each can be tuned separately. However when employing parameterized features (like e.g., neural networks) then we the joint cost function does not decompose over each regressor and we must tune all of our model parameters jointly, that is learn all $C$ regressions simultaneously.

A feature engineering exampleΒΆ

Given the additional number of regressions here, determining features by visual analysis is more challenging than the basic instance of regression detailed in the previous Section. Here we provide one relatively simple example of this sort of feature engineering.

Example 1. Tuning multiple regressions simultaneouslyΒΆ

Below we plot a $C = 2$ multi-ouptput regression dataset, the first pairs are shown in the left panel while the second pairs are shown in the right. Both instances appear to be sinusoidal in nature, with each having its own unique shape.

InΒ [24]:

To model both regressions simultaneously we will use $B = 2$ parameterized sinusoidal feature transformations

\begin{equation} \begin{array} \ f_1\left(\mathbf{x}\right) = \text{sin}\left(w_{1,0} + w_{1,1}x_1 + w_{1,2}x_2\right) \\ f_2\left(\mathbf{x}\right) = \text{sin}\left(w_{2,0} + w_{2,1}x_1 + w_{2,2}x_2\right) \\ \end{array} \end{equation}

Fitting this set of nonlinear features jointly to both regression problems above (using gradient descent) results in the fits shown below - both of which are quite good.

InΒ [23]:

Implementing joint nonlinear multi-output regression in PythonΒΆ

As with the linear case, detailed in Section 5.6.3, here likewise we can piggy-back on our general Pythonic implementation of nonlinear regression introduced in Section 10.2.4 and employ precisely the same model and cost function implementation as used in the single-output case. The only difference here is in how how we define our feature transformations and the dimension of our matrix of linear combination weights.