code
share




6.9 Weighted Two-Class Classification*




* The following is part of an early draft of the second edition of Machine Learning Refined. The published text (with revised material) is now available on Amazon as well as other major book retailers. Instructors may request an examination copy from Cambridge University Press.

Because our two-class classification cost functions are summable over individual points we can - as we did with regression in Section 5.5 - weight individual points in order to emphasize or de-emphasize their importance to a classification model. This is called weighted classification. This idea is often implemented when dealing with highly imbalanced two class datasets, as we discuss here.

In [1]:

Weighted two-class classification

Just as we saw with regression in Section 5.5, weighted classification cost functions naturally arise due to epeated points in a dataset. When collecting metadata (e.g., census data) it is not so uncommon collect duplicate entries - multiple people having similar/the same stats based on a given survey.

Below we take a standard census dataset and plot a subset of it along a single input feature. With only one feature taken into account we end up with multiple entries of the same datapoint, which we show visually via the radius of each point (the more times a given datapoint appears in the dataset the larger we make the radius). These datapoints should not be thrown away - they did not arise due to some error in data collecting / storage - they represent the true dataset.

In [2]:

Just as with a regression cost, if we examine any two-class classification cost it will 'collapse', with summands containing identical points combining naturally. One can see this by performing the same kind of simple exercise used in Section 5.5 to illustrate this fact for regression. This leads to the notion of weighting two-class cost functions, like e.g., the weighted softmax below (written using our general model notation used)

\begin{equation} g\left(\mathbf{w}\right) = \sum_{p=1}^P\beta_p\,\text{log}\left(1 + e^{-y_p\text{model}\left(x_p,\mathbf{w}\right)}\right). \end{equation}

Here the values $\beta_1,\,\beta_2,\,...,\,\beta_P$ are fixed point-wise weights. That is, a unique point $\left(x_p,\,y_p\right)$ in the dataset has weight $\beta_p = 1$ whereas if this point is repeated $R$ times in the dataset then one instance of it will have weight $\beta_p = R$ while the others have weight $\beta_p = 0$.

Since these weights are fixed (i.e., they are not parameters that need to be tuned, like $\mathbf{w}$) we can minimize a weighted classification cost precisely as we would any other e.g,. via a local optimization scheme like gradient descent or Newton's method.

In [2]:

Just as with regression, we can also think of assigning these fixed weight values to points ourselves based on our 'confidence' of the legitimacy of a datapoint. If we believe that a point is very trustworthy we can set its corresponding weight $\beta_p$ closer to $1$, and the more untrustworthy we find a point the smaller we set $\beta_p$ in the range $0 \leq \beta_p \leq 1$ where $\beta_p = 0$ implies we do not trust the point at all. In making these weight selections we of course determine how important each datapoint is in the training of the model.

Dealing with imbalanced datasets via weighted classification

Weighted classification - in the manner detailed above - is often used to deal with imbalanced datasets. These are datasets which contain far more examples of one class of data than the other. With such datasets it is often easy to achieve a high accuracy by misclassifying points from the smaller class. For example, if a two-class dataset consisted of $90\%$ points with label value $-1$ and $10\%$ points with label $+1$, then simply classifying all datapoints to the $-1$ class would provide $90\%$ accuracy.

This idea of 'sacrificing' members of the smaller class by misclassifying them (instead of members from the majority class) is - depending on the application - not very desirable. For example

  • if the classification determines whether or not someone has a particularly rare but deadly disease that requires further examination one would likely rather misclassify a healthly individual (and give them further testing) then miss someone with the disease
  • if the classification determines whether or not a particular financial transaction was fraudulant, one would likely rather misclassify a standard transaction (to review further or alert a customer) than miss an actually fraudulunt transaction

One way of ameliorating this issue is to use a weighted classification cost to alter the behavior of the learned classifier so that it weights points in the smaller class more, and points in the larger class less

In order to produce this outcome it is common to assign such weigths inversely proportional to the number of members of each class. That is if we denote $\Omega_{+1}$ and $\Omega{-1}$ index sets for the points in classes $+1$ and $-1$ respectively, then first note that $P = \left\vert \Omega_{+1} \right\vert + \left\vert \Omega_{-1} \right\vert$. Then denoting $\beta_{+1}$ and $\beta_{-1}$ the weight for each member of class $+1$ and $-1$ respectively we can set these class-wise weights inversely proportional to the number of points in each class as

\begin{array} \ \beta_{+1} = \frac{1}{\left\vert \Omega_{+1} \right\vert} \\ \beta_{-1} = \frac{1}{\left\vert \Omega_{-1} \right\vert}. \\ \end{array}

Below weight of minority class is increased as you move the animation slider from left to right, with new classification shown as result (and point size of minority red class increased for visualization). As you increase the weighting on minority class members you incentivise learned classifier to classify these points correctly.

In [ ]:
In [14]:
Out[14]: