We have now seen two fundamental approaches to linear multi-class classification: the One-versus-All (OvA) and the Multi-class Perceptron / Softmax. Both approaches are commonly used in practice and can - depending on the dataset - produce similarly good results. Howver the latter approach (the Multi-class Perceptron / Softmax) is - at least in principle - capable of achieving higher accuracy on a broader range of datasets. Why is this?

As we detailed in Sections 7.2.7 and 7.3.4, we can express all $C$ individual linear classifiers we employ regardless of which approach we take jointly as one large linear model of the form

\begin{equation} \begin{matrix} \text{model}\left(\mathbf{x},\mathbf{W}\right) = \mathring{\mathbf{x}}_{\,}^T\mathbf{W} \end{matrix} \end{equation}

where the matrix $\mathbf{C}$ is a $\left(N+1\right) \times C$ matrix of weights whose $c^{th}$ column contains the weights of our $c^{th}$ linear classifier.

With (as we saw in Section 7.2) we solve a sequence of $C$ two class subproblems (one per class) tuning the weights of this matrix one column at-a-time independently. Only afterward do we combine all classifiers together to form the fusion rule. On the other hand, as we saw in the previous Section, when we minimize the Multi-class Perceptron or Softmax cost functions we are tuning all parameters of this matrix jointly - i.e., all $C$ classifiers simultaneously - in order to make the fusion rule hold as well as possible over our training dataset. This joint minimization permits valuable interactions to take place in the tuning of the weights or - in other words - by tuning all weights together we are directly aiming to minimize the number of misclassifications our final model has over the training data. OvA - by comparison - achieves this goal indirectly since the $c^{th}$ weight vector $\mathbf{w}_c$ is tuned in a comparatively myopic way, i.e., to minimie misclassifications of the $c^{th}$ class alone. The end goal - minimizing the number of misclassifications over the entire dataset - is 'seen by the weights' only after they are all tuned.

Also notice that (as noted previously) this linear model is precisely what we use with multi-output regression (see Section 5.6). However unlike the case with linear multi-output regression - as detailed in Section 5.6.2 - it is not the case here that tuning $\mathbf{W}$ via minimization of a multi-class cost (like e.g., the Multiclass Softmax) is equivalent to tuning each column independently.

We illustrate this principal superiority of the Multi-class Perceptron / Softmax approach over OvA using a toy $C = 5$ class dataset shown below. Here points colred red, blue, green, kahki, and violet have label values $y_p = 0$, $1$, $2$, $3$, and $4$ respectively.

Think for a moment how the OvA approach will perform in terms of the kahki colored class. In particular, think of how the subproblem in which we distinguish between members of this class and all others will be solved. Because this class of data is surrounded by members of the other classes - and there fewer members of the khaki class then all other classes combined - the optimal classification rule for this subproblem is to classify all points as non-khaki (or in other words to misclassify the entire kahki class). This implies that the linear decision boundary will lie outside the range of the points shown, with all points lying on its negative side. The resulting individual classifier decision boundaries are shown in the left panel below. Now because the weights of decision boundary associated with the kahki colored class are tuned solely based on this subproblem, this will lead to the entire kahki class being misclassified in the final OvA solution. This is shown below in the right panel of the figure below, along with the fused decision boundary.

If - on the other hand - we employ the Multi-class Perceptron or Softmax approach we will not miss this class. Here - because we tune all weights together in order to minimize the number of misclassifications over the entire dataset - we end with final fused decision boundary that is far superior to the one provided by OvA (shown in the right panel below). We misclassify far fewer points and - in particualr - do not misclassify the entire khaki class of data. We show the individual linear decision boundaries found here as well in the left panel. While the decision boundary associated with the khaki class lies outside the range of the datapoints, because we have tuned everything together weights in other individual decision boundaries are adjusted to better satisfy our goal of minimizing misclassifications.