### Back Story

There’s an interesting thread on the BioNLP mailing list (here’s a link to the publicly readable thread). The thread was kicked off when Andrew Clegg asked:

Suppose I have two precision-recall curves for two different IR algorithms on the same test data and query.

What would be the correct statistical test to determine if they are significantly different?

I’m particularly sympathetic to (part of) Bill Hersh‘s second response.

… All statistical inference tells you is how likely your differences are due to chance. It says nothing about whether the difference is important. …

to which Andrew replied

… in my case it’s the other way round, I think we’ve already displayed a clear benefit in terms of usability, I’m just thinking of including significance testing in case a very picky reviewer asks if it’s down to chance.

Can you say CYA?

There were a host of classical tests proposed including the new-to-me Page trend test. Others suggesting taking an aggregate statistic like area-under-the-curve and doing a standard difference test.

### ROC Curves vs. PR Curves

There’s much more literature on receiver operating characteristic curves (aka ROC curves) than precision-recall (PR) curves. Just as a refresher, if we take a given number of true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN), precision, recall, sensitivity and specificity are:

Sensitivity = TP / (TP + FN)

Specificity = TN / (TN + FP)

Recall = Sensitivity = TP / (TP + FN)

Precision = TP / (TP + FP)

If we have a ranked list of outputs for which we know their gold-standard category (relevant or not in the case of IR), we can consider initial segments of the ranked list and compute the precision/recall at that point. The ROC curve plots 1-specificity (the false positive rate) versus sensitivity (the true positive rate) at various operating points for an algorithm. The PR curve plots precision versus recall.

In PR curves, it is often the case that only “interpolated” results are considered, which means taking the convex shell of the curve. This is easily accomplished by making sure the function is non-decreasing by propagating later higher results back. For instance, if precision at 1 recall is 1/2 and precision at 2 recall is 2/3 and at 3 recall is 1/2, then the precision at 1 recall is interpolated to 2/3 so the function is non-decreasing. (I’m not sure what this does to hypothesis testing.)

There’s a lot more information and examples in the class documentation for

- LingPipe’s
`classify.ScoredPrecisionRecallEvaluation`

### The Bayesian Way

It’s in exactly these kinds of situations where a Bayesian approach seems natural.

Just add a prior and compute a posterior over entire PR curves. Then compare the PR curves any way you want. Say by calculating how likely it is for one PR curve to be above the other at every operating point. Or how likely it is for one to be above the other at a given operating point (say 99% recall). Or how likely one is to be substantially better than another, or how much you expect one to be higher than another.

See my earlier post on Bayesian counterpart to Fisher’s exact test on contingency table data.

The nifty part is that “adjustments” for multiple comparisons are built in. So we can ask what’s the posterior probability of one system having at least 1% higher precision than all others at 99% recall.

But once we’re doing multiple comparisons, it makes sense to build a hierarchical model. Often, a system will report results on multiple queries, so there’s a set of precision-recall curves for each system, leading to two levels of grouping.

I provide an example of Bayesian multiple comparisons in a hierarchical model in my earlier blog post on Bayesian batting ability with multiple comparisons.

### Will it Work?

How do I model the posterior PR (or ROC) curve? From that, a regular or interpolated PR (or ROC) curve can be generated if desired.

I’m specifically worried about precision and recall sharing the TP term in their numerators and denominators; ROC curves seem simpler.

I’m also worried about the low counts at the low recall end of the curves.

And about the correlations among the different operating points.

It doesn’t seem that a naive independent binomial for each recall point would be justified, though it’d make the posteriors very clean, especially with uniform priors.

For instance, suppose that after the first 5 true positives in the systems’ ranked lists, we have seen 7, 10, 4, 11, and 21 false positives for the five systems? The maximum likelihood point estimate of precision with a uniform prior is then 5/12, 5/15, 5/9, 5/16, and 5/26. Can we just use Beta(6,13), …, Beta(6,27) as the posteriors for comparison? Somehow it just seems wrong.

### Any Ideas?

Does anyone know how to compute PR or ROC posteriors appropriately?

### Does Anyone Have Data?

Andrew, or anyone: if you have the ranked data and gold standard answers, I’d love to code this up in R/BUGS and see what it looks like in the context of data someone cares about.

January 29, 2010 at 4:49 pm |

Significance tests on PR curves seem wicked hard, given the nonlinear dependence on number of true positives. ROC seems easier but not easy. (As a side note, one thing that people often forget is that precision is effectively lower bounded by generality, i.e. the proportion of positive examples.)

I’m also sympathetic to BIll Hersh’s comment. Significance tests are vastly overused in the traditional sciences, and it’s not really a disease I’d care for the information sciences to contract.

January 29, 2010 at 5:19 pm |

There seem to be two different questions asked here. The first is whether the two curves are significantly *different*. The second is whether one curve is significantly *better* than another.

They’re different questions because one curve can be higher on the left than another curve and lower on the right. The two curves are then different, but it’s a subjective question whether one is better than the other.

The first question also seems easier, because you “just” want to model whether the two curves might plausibly have been generated from the same system.

For the second question I’d probably want to integrate the PR function times a utility function that tells you the desired tradeoff between P & R. As Dave says though, carrying a significance test through that process seems horribly messy.

January 30, 2010 at 6:13 am |

Hi Bob, I was hoping you would jump in at some point.

A bit of background on the system: It’s not actually a document IR system, it’s a set of algorithms (and ensembles) for selecting and ranking relevant proteins from a much larger set of candidates, using various evidence sources — gene expression profiles, evolutionary relatedness via domain fusion, MEDLINE co-occurrence, semantic similarity of GO annotations, yadda yadda.

Dave’s comment about the ‘traditional sciences’ is bang on, as this is going to a bioinformatics journal (formerly a trad molecular biology one), and yes, there is a certain element of ass-covering. However the deadline looms so the significance testing will probably have to wait until a reviewer requests it.

In the meantime, it’s just become an interesting intellectual challenge.

Ken, you’re right about the difference between significantly different vs. better, I was thinking of the former really. I think any reviewer who insisted on the latter would be making too many assumptions about what ‘better’ means, since as you point out, curves cross, and part of our motivation for presenting various ensembles is that different ones emphasize precision or recall.

Bob, can you do any informative hacking just based on the scores, or would you need the actual GS and predictions? If you need the raw data, I’ll need to ‘anonymize’ it, as our collaborators who assembled the GS by hand don’t want it published until their own paper is out.

February 1, 2010 at 1:34 pm |

I’d think an evaluation based on PR curves should be run just from the ranked results of each system and a notion of which results are “relevant”. Of course, you can compute document ranks from document scores.

What’s a “GS”?

I’m in no hurry; I know the biologists are close with their data and wouldn’t want to step on anyone’s toes.

February 1, 2010 at 1:46 pm

Err, Gold Standard, sorry.

Okay, I’ll punt the raw results to you when we have clearance to share it. Thanks — always glad to have a new angle, and I don’t think Bayesianly very much myself.

February 1, 2010 at 3:42 pm

As I said below, I’m now thinking a bootstrap analysis would be simpler. But you’d still need a statistic to measure. I really like precision at 99% recall, but that’s just us. Recall at high precision is also interesting. In our experience, apps tend to be heavily recall or precision focused.

February 1, 2010 at 10:58 pm

Had to go and read up on bootstrapping — been years since doing about sampling etc. in a Masters course. But yes, I can see the appeal of that.

I’d been using AUC of the P-R curve as the point statistic for comparison, as most of the algorithms predict conservatively and don’t actually reach 99% recall…

January 30, 2010 at 10:50 pm |

ROC is a model comparison that turns probability into a rank. Not sure I buy it. If you feed probabilities into a decision theoretic utility, it will give you the optimal threshold automatically.

Now, if you insist on doing significance on ROC curves, why not do a bunch of 2-fold cross validations over a number of permutations of the indices, and paint the ROC for each one of them? If you run 100 cross validations, you will have 100 ROC curves, and if you try two models on the same split, you will be even able to compare them. I’ve done such things in my PhD.

February 1, 2010 at 1:30 pm |

Both great ideas with big “BUT”s for Andrew’s proposed application of running an information retrieval-like evaluation.

1. Not everyone uses a probabilistic system,

2. Many IR systems don’t have an estimation/training phase, and

3. When running a retrospective evaluation, you typically just have ranked responses to queries.

Having said all that, it occurs to me from Aleks’s suggestions you might be able to use a bootstrap analysis.

February 1, 2010 at 3:51 pm |

Many people are fond of aROC (or area under the ROC curve) as a statistic. I still prefer to just throw in my utilities and go decision-theoretic.

February 4, 2010 at 12:51 pm |

Just go for the bootstrap… so much easier!