7.2 One-versus-All multi-class classification

In practice many classification problems have more than two classes we wish to distinguish, e.g., face recognition, hand gesture recognition, general object detection, speech recognition, and more. However because it has only two sides, a single linear separator is fundamentally insufficient as a mechanism for differentiating between more than two classes of data. Nonetheless we can use our understanding of two-class classification to overcome this shortcoming when dealing with $C>2$ classes by learning $C$ linear classifiers (one per class), each distinguishing one class from the rest of the data.

The heart of the matter is how we should combine these individual classifiers to create a reasonable multi-class decision boundary. In this Section we develop this basic scheme - called One-versus-All multi-class classification - step-by-step by studying how such an idea should unfold on a toy dataset. With due diligence and a little common sense we can intuitively derive universal ideas regarding multiclass classification that are the basis for most popular multi-class classification schemes, including One-versus-All (OvA) classification.

In [1]:

7.2.1 Notation and modeling

A multiclass dataset $\left\{ \left(\mathbf{x}_{p,}\,y_{p}\right)\right\} _{p=1}^{P}$ consists of $C$ distinct classes of data. As with the two class case we can in theory use any $C$ distinct labels to distinguish between the classes. For the sake of this derivation it is convenient to use label values $y_{p}\in\left\{ 0,1,...,C-1\right\} $, which is what we will use throughout this Section.

Below we show a prototypical toy $C = 3$ class dataset we will use in our derivation of OvA. Afterwards we will apply what we have developed to other datasets as well. Here the points colored blue have label value $y_p = 0$, those colored red have label value $y_p = 1$, and those colored green have label value $y_p = 2$.

In [2]:

7.2.2 Training $C$ One-versus-All classifiers

The first step of OvA classification is simple - we reduce the new problem of multi-class classification into a sequence of smaller problems that we are already familiar with. If we reflect for a moment on what we want, one way of phrasing the goal of multi-class classification is that we want to learn how to distinguish each class of our data from the other $C-1$ classes. From this perspective it certainly seems that a good first step towards accomplishing our goal would be to learn $C$ two-class classifiers on the entire dataset, with the $c^{th}$ classifier trained to distinguish the $c^{th}$ class from the remainder of the data. With the $c^{th}$ two-class subproblem we simply assign temporary labels $\tilde y_p$ to the entire dataset, giving $+1$ labels to the $c^{th}$ class and $-1$ labels to the remainder of the dataset

\begin{equation} \tilde y_p = \begin{cases} +1 \,\,\,\,\,\,\text{if}\,\, y_p = c \\ -1 \,\,\,\,\,\,\text{if}\,\, y_p \neq c \end{cases} \end{equation}

where again $y_p$ is the original label for the $p^{th}$ point, and run the two-class classification scheme of our choice.

Doing this for our $C = 3$ dataset in this case we end up learning 3 linear classifiers - here we use logistic regression for each subproblem, solving each using Newton's method.

With our classifiers trained we can now illustrate our learned decision boundaries - each learned to distinguish a single class from the remainder of the data. Below we plot two rows of images - in the top row our original dataset is plotted three times with each instance showing just one of the three two-class classifiers learned. The single class being distinguished is colored with its original color - with the corresponding learned decision boundary colored similarly - and all other data is colored gray. In the bottom row the dataset is shown with along with all three learned decision boundaries all at once.

In [3]:

7.2.3 Points on the positive side of a single classifier

With OvA we learn $C$ two-class classifiers, and we can denote the weights from the $c^{th}$ classifier as $\mathbf{w}_c$ where

\begin{equation} \mathbf{w}_c=\begin{bmatrix} w_{0,c}\\ w_{1,c}\\ w_{2,c}\\ \vdots\\ w_{N,c} \end{bmatrix} \end{equation}

and then the corresponding decision boundary as

\begin{equation} \mathring{\mathbf{x}}_{\,}^T \mathbf{w}_c = 0. \end{equation}

Now note how in the figure above how in each case - because each subproblem is perfectly linearly separable and because of our choice of temporary labels - that the class to be distinguished from the rest lies on the positive side of its respective classifier, with the remainder of the points lying on the negative side. This of course means that for the $j^{th}$ classifier we have for the $p^{th}$ point $\mathbf{x}_p$ that

\begin{equation} \mathring{\mathbf{x}}_{\,}^T \mathbf{w}_c \begin{cases} > 0 \,\,\,\,\,\,\text{if}\,\,\, y_p = c \\ < 0 \,\,\,\,\,\, \text{if} \,\,\, y_p \neq c. \end{cases} \end{equation}

This implies that - when evaluated by each two-class classifier individually - the one identifying a point's true label always provides the largest evaluation, i.e.,

\begin{equation} \mathring{\mathbf{x}}_{\,}^T \mathbf{w}_j = \underset{c \,=\, 0,...,C-1}{\text{max}} \,\,\,\mathring{\mathbf{x}}_{\,}^T \mathbf{w}_c \end{equation}

So we know how to classify the points that we have, what about those we do not? How do we classify arbitrary points in the space of our example? Lets figure this out step-by-step.

First, those points that lie solely on the positive side of the $j^{th}$ classifier only - like the points we already have - should clearly belong to the $j^{th}$ class. Such a point $\mathbf{x}$ lies close to those we already have, also clearly satisfying the condition that

\begin{equation} \mathring{\mathbf{x}}_{\,}^T \mathbf{w}_j = \underset{c \,=\, 0,...,C-1}{\text{max}} \,\,\,\mathring{\mathbf{x}}_{\,}^T \mathbf{w}_c \end{equation}

Therefore to get the associated label $y$ we therefore want the maximum argument of the right hand side - since this gives the index associated with the largest classifier. Formally then the predicted label $y$ for input point $\mathbf{x}$ here can be written as

\begin{equation} y = \underset{c \,=\, 0,...,C-1}{\text{argmax}} \,\,\,\mathring{\mathbf{x}}_{\,}^T \mathbf{w}_c \end{equation}

We color in the points representing these areas of the space of our dataset in the Python cell below.

In [4]:

7.2.4 Points on the positive side of more than one classifier

Notice there are regions left uncolored above that do not fall into this first category (of being on the positive side of a single classifier). These include regions where points are on the positive side of more than one classifier - these are the three triangular white regions bordered by two decision boundaries. For example points in the white region at the top - bordered by the red and blue decision boundaries - are on the positive side of both the red and blue classifiers. The un-colored regions also include the one in the very middle of the plot whose points are on the negative side of all three classifiers.

Let us first determine how to appropriately classify points in those un-colored regions where more than one classifier is positive. We deal with the case where a point ends up on the negative side of all classifiers in the next Subsection. For simplicity, here we only consider the top region that is on the positive side of both the red and blue classifiers (the other regions should behave similarly).

In the Python cell below we show two example points in this region. In each case we plot the original dataset and individual classifiers along with new point in question in black. We also highlight the distance from the new point to both decision boundaries in dashed black, with the projection of the point onto each decision boundary shown as an 'X' in the same color as its respective hyperplane.

In [5]:

Beginning with the point shown on the left, which class should we assign this point to?

Recall, when we discussed logistic regression, that we think of a classifier as being 'more confident' of the class identity of given a point the farther the point lies from the classifier's decision boundary. This is a simple geometric/ probabilistic concept, the bigger a point's distance to the boundary the deeper into one region of a classifier's half-space it lies, and thus we can be much more confident in its class identity than a point closer to the boundary. Another way we can think about it - imagine if we slightly perturbed the decision boundary: those points originally close to its boundary might end up on the other side of the perturbed hyperplane, changing classes, whereas those points farther from the boundary are less likely to be so affected (hence we can be more confident in their class identities to begin with).

With this in mind, which classifier can we say is more confident about this point? The answer is the red one since our input point lies a greater distance from it.

What about the point shown in the right panel? By the same logic it is best assigned to the blue class, being at a greater distance from the blue classifier.

If we repeat this logic for every point in the region - as well as those points in the other two regions where two or more classifiers are positive - and color each point the color of its respective class, we will end up shading each such region as shown by the Python cell below.

In [6]:

And how do those points equidistant to two or more decision boundaries get assigned to a class? In the same way points lying on a two-class decision boundary do: we assign a class label at random. These points form the multi-class decision boundary.

How do we formalize this rule? In these regions we have assigned each point to the class whose boundary is at the largest nonnegative distance from it. How do we compute the distance of an arbitrary point to each of our decision boundaries?

As we saw in Section 6.3, the signed distance of an arbitrary point $\mathbf{x}$ to the $j^{th}$ linear decision boundary is equal to the evaluation at this boundary provided we normalize its parameters by the magnitude of its feature-touching weights (which is also the normal vector to the separating hyperplane). We can express this by using notation that exposes the bias and feature-touching weights of each classifier

\begin{equation} \text{(bias):}\,\, b_j = w_{0,j} \,\,\,\,\,\,\,\, \text{(feature-touching weights):} \,\,\,\,\,\, \boldsymbol{\omega}_j = \begin{bmatrix} w_{1,j} \\ w_{2,j} \\ \vdots \\ w_{N,j} \end{bmatrix}. \end{equation}

With this notation we have that the

\begin{equation} \text{signed distance of $\mathbf{x}$ to $j^{th}$ boundary} = \frac{\mathring{\mathbf{x}}_{\,}^T \mathbf{w}_j }{\left\Vert \boldsymbol{\omega}_{j}^{\,} \right\Vert_2} \end{equation}

Therefore if we normalize the weights of each linear classifier by the length of its normal vector as

\begin{equation} \mathbf{w}_{j}^{\,} \longleftarrow \frac{\mathbf{w}_{j}^{\,}}{\left\Vert \boldsymbol{\omega}_{j}^{\,} \right\Vert_2} \end{equation}

which we will assume from here on, then this distance is simply written as the raw evalaution of the point via the decision boundary as

\begin{equation} \text{signed distance of $\mathbf{x}$ to $j^{th}$ boundary} = \mathring{\mathbf{x}}_{\,}^T \mathbf{w}_j \end{equation}

To assign a point in one of our current regions we seek out the classifier which maximizes this quantity. That is, we have found that - after weight-normalization - we have precisely the same prediction rule we found originally for regions of the space where only a single classifier is positive

\begin{equation} y = \underset{c \,=\, 0,...,C-1}{\text{argmax}} \,\,\,\mathring{\mathbf{x}}_{\,}^T \mathbf{w}_c. \end{equation}

So the same rule above can be used in any region of the space where at least one classifier is positive.

7.2.5 Points on the negative side of all classifiers

Finally, what about the middle blank region of our dataset - where all of our classifiers are negative? How shall we assign class labels to these points. Once again let us examine two points in this region and reason out what should be done.

In [7]: