# Don't use AUC for model selection and definitely don't use accuracy

It’s pretty hard to build a predictive model, let’s be honest. My previous post was about the strange practice of under-/oversampling. We impose this on ourselves because we want to do yes/no classification instead of predicting probability. This post is on the same theme. It has a very clear message: if we use accuracy-like model performance measures (looking at you, AUC, F1, etc.) we reduce our ability to select the best model.

I often do little simulations to try to make sense of a confusing world. Below I present some pretty convincing evidence for the case of dropping classification in favor of prediction. At least in the model selection phase.

# A simulation model

I will make a simulation model based on the one in my undersampling post.
Once again the log odds of class `True`

is \( \textrm{logit}(p) = -9 +6x, \) but I
now sample \(X \sim N(1.5, 1)\). This is different from last because I no longer want
severe class imbalance. I will also add an uninformative noise variable
\(X_{\epsilon} \sim N(0, 2)\) post hoc. The below function generates a data set
to these specifications.

This suggests two logistic regression models:

- The true model, \( \textrm{logit}(p) = \beta_0 + \beta_1x + \epsilon; \)
- A clearly worse model, \( \textrm{logit}(p) = \beta_0 + \beta_1x + \beta_2x_{\epsilon} + \epsilon. \)

Imagine that we didn’t know the truth. Our goal is to choose between the two models: which one predicts better?

# Three performance metrics

I will compare three metrics for model performance on held-out data.

- Accuracy is
the fraction of times we got the classification right. For this metric I will adopt the
standard decision rule of classifying as
`True`

when \(\hat p > .5\). - The area under the ROC
curve is a strange measure that luckliy directly corresponds to the
probability of ranking a
`True`

higher than a`False`

.

So both of these measures have a straight-forward interpretation, which is nice. But since the main thesis of this post is that you shouldn’t use them, let’s look at a third metric:

- Brier score is the mean squared error between estimated probabilities and the true probabilities (the true probabilities are either unity or zero). It measures the calibration of your probability estimates, but has no simple interpretation.

# An experiment

I will simulate the scenario where we compare the two methods by five-fold cross validation to get a score for each. I want to estimate the statistical power of each metric: the probability that it gives the true model a better score.

Below I run the experiment 10000 times for each of the three performance metrics. Beware that if you run this yourself it’ll take a while! Afterward I go over the scores of the two models and check whether the One True Model scored better. I use this below to estimate power.

# Results

The plot above shows the convergence of our three simulations toward the final estimates. Showing just the numbers isn’t social media friendly. The results are unequivocal. In this simulation, using Brier score gives you an extra \(10 \% \) power to detect the right model over AUC! Accuracy is worse still: it is little better than making a monkey decide.

# Why is the Brier score better?

To get accuracy we impose a (more or less) random threshold on the prediction.
Many would argue that this is a problem in itself. As a score it has the problem
that a miniscule improvement in model can lead to any size improvement in
accuracy. Consider the following: among \(n\) observations to classify, we
gave a `True`

the probability of \(.5\). Mistake. Classified as `False`

. If we
improve the model a very little so that the probability is now \(.5001\) we
get a discrete jump of \(1/n\) in accuracy! We’ve crossed the magical barrier.
Any further improvement in predicted probability makes no difference. The jump
from \(.5\) to \(.5001\) is as good as a jump from \(.5\) to \(1\).
Similarly, a model that predicts a probability of \(.5\) for all `False`

s and
\(.5001\) for all `True`

s has perfect accuracy. It’s as good as a model that
predicted zero for all `False`

s and unity for all `True`

s. Clearly that’s
nonsense. Similar reasoning applies to AUC, which is just accuracy stretched
out, or indeed anything based on counting “correct” classifications. The Brier
score is a proper scoring
rule. Any change
in predictions results in a proportional change in Brier score. Hence the Brier
score is a more sensitive and sensible instrument. It reflects the actual
improvement in model.

# Full disclosure

Recently I ran days, maybe weeks, worth of resampling validation on our local supercomputer. All based on AUC score. I have yet to rerun those experiments; maybe I should follow my own advice.