The standard story:
Many problems with the standard story, especially when comparing (say) 20 different configurations of a learner.
- Say each test has confidence of 0.95
- Net test has confidence of 0.95^20 = 0.36
- The complexity of the test makes me lose confidence in my ability to understand the test
- For the parametric tests, too many parametric assumptions.
- The blurred telescope effect (Friedman-Nemenyi)
A12The Vargha and Delaney A12 statistics is a non-parametric effect size measure.
Given a performance measure M seen in m measures of X and n measures of Y, the A12 statistics measures the probability that running algorithm X yields higher M values than running another algorithm Y.
A12 = #(X > Y)/mn + 0.5*#(X=Y)/mn
- A. Vargha and H. D. Delaney. "A critique and improvement of the CLcommon language effect size statistics of McGraw and Wong". Journal of Educational and Behavioral Statistics, 25(2):101-132, 2000.
rA12A12 studies two treatments. rA12 handles mulitple treatments using a Scott Knott procedure; i.e. divide a list of treatments into two lists L1 and L2 _ by finding the division that maximizes the expected value of the sum of square of the mean differences before and after divisions; i.e.
L = union(L1,L2)
|L1| * abs(L1.mu - L.mu)^2 + |L2| * abs(L2.mu - L.mu)^2
rA12 calles itself recursively on the lsts terminating when no further useful division can be found (and useful is checked by the A12 procedure.
For more on the Scott Knott procedure (where ANOVA is used to define useful), see
- N. Mittas and L. Angelis, "Ranking and clustering software cost estimation models through a multiple comparisons algorithm" IEEE Transactions on Software Engineering (pre-print), 2012.