## Many problems with the standard story, especially when comparing (say) 20 different configurations of a learner.

• Say each test has confidence of 0.95
• Net test has confidence of 0.95^20 = 0.36
• The complexity of the test makes me lose confidence in my ability to understand the test
• For the parametric tests, too many parametric assumptions.
• The blurred telescope effect (Friedman-Nemenyi)

## A12

The Vargha and Delaney A12 statistics is a non-parametric effect size measure.

Given a performance measure M seen in m measures of X and n measures of Y, the A12 statistics measures the probability that running algorithm X yields higher M values than running another algorithm Y.

A12 = #(X > Y)/mn + 0.5*#(X=Y)/mn

## rA12

A12 studies two treatments. rA12 handles mulitple treatments using a Scott Knott procedure; i.e. divide a list of treatments into two lists  L1 and  L2 _ by finding the division that maximizes the expected value of the sum of square of the mean differences before and after divisions; i.e.

L = union(L1,L2)
|L1|  *  abs(L1.mu - L.mu)^2 + |L2|  *  abs(L2.mu - L.mu)^2

rA12 calles itself recursively on the lsts terminating when no further  useful  division can be found (and  useful  is checked by the A12 procedure.

For more on the Scott Knott procedure (where ANOVA is used to define  useful), see

•  N. Mittas and L. Angelis, "Ranking and clustering software cost estimation models through a multiple comparisons algorithm" IEEE Transactions on Software Engineering (pre-print), 2012.