Lyngby Correspondent e04: gotta go fast

This weekly update covers the last two weeks as I was simply too busy last week. Friend of the show Kajsa was visiting our Lyngby location for a week of focussed writing. The idea was that we have this mostly-finished paper on the ranking of classifiers so let’s just sit down and finish it. “Mostly finished” is a vague description, and I’m not sure we actually made it much more finished than it was. For example there was this quick little simulation study I wanted to do that was neither quick nor little once I started looking at it.

AUC is interpretable, actually

For this simulation—work still in progress—I need to calculate a lot of AUCs. The AUC (or area under the receiver operating characteristic curve for short) used to be a completely inscrutable statistic for me because you involve this story about radars and moving some kind of decision threshold back and forth and etc., etc., etc., etc. It’s nearly impossible to give it a sane explanation.

Or so I used to think. Some years ago I learned that the AUC does have a completely sane interpretation. You’ve got some data that can be classified as either T or F and you have a model that spits out a T-ness score for each data point. A higher score means that it’s more likely that this data point belongs to class T. Could be a probability, but it doesn’t have to be. AUC is the probability that a randomly chosen T will have a higher score than a randomly chosen F. That’s it. End of explanation. No need to invoke any curve, nor threshold, nor hypothetical radar.

How to calculate AUC very quickly

In numerical linear algebra it’s a folk theorem that if you’re inverting a matrix you’re probably doing something wrong. Similarly if you’re calculating an AUC by first building the ROC curve and then integrating you’re probably doing it wrong. I figured this out last week; my journey was something like this:

  1. I start out using pROC::auc in R because this is what I had been doing when I didn’t need actual millions of the things. This is very slow.
  2. A little bit of searching lead me to this SO question where one of the lower-down answers uses an approach from this blog post that’s supposed to be pretty fast compared to pROC::auc, and indeed it is.
  3. But I notice that as a comment to that answer the author of the bigstatsr package says that they have another fast method, and indeed it is a little faster.
  4. I don’t want my son to grow up with me just staring at a progress bar, so I want to get it a little faster still. I do some profiling and I notice that the bigstatsr::AUC function spends almost half its time in sanity checks. This isn’t a problem for the size of data that bigstatsr is made for: the asserts will proportionally take up no time. But for me it matters because I’m only doing like 1000 scores at a time.
  5. Luckily bigstatsr::AUC just calls another function after those asserts. This function is “not exported,” but actually R doesn’t enforce that. You just do bigstatsr:::AUC2 (notice ::: instead of ::) and you access the not-exported function to skip the sanity checks. I think the only way to speed up my simulation now is parallelizing, which is easy.

Both the bigstatsr approach and the one from the SO answer are based on the observation that, as I say, AUC is mathematically equivalent to a probability about ranking (see Hanley and McNeil (1982) for more). If you sort the scores it’s a really fast thing to count pairs of T, F where the score of T is greater than the score of F. You can do it in a single pass. By the same approach you could in fact also build the ROC curve in a single pass, but then you’d still have to do your stupid little integration if AUC was what you wanted.

Backlinks:

Hanley, J A, and B J McNeil. 1982. “The Meaning and Use of the Area Under a Receiver Operating Characteristic (ROC) Curve.” Radiology 143 (1): 29–36. https://doi.org/10.1148/radiology.143.1.7063747.
this file last touched 2024.02.14