Monday, April 5, 2010

which2 : a stochastic anytime rule learner

(Code: http://unbox.org/wisp/tags/which2/0.1/which2.awk.)

WHICH2 builds rules by ranking ideas, then repeatedly building new ideas by picking and combining two old ideas (favoring those with higher ranks). New ideas generated in this way are ranked and thrown back into the same pot as the old ideas so, if they are any good, they might be picked and extended in subsequent rounds. Alternatively, if the new idea stinks, it gets buried by the better ideas and is ignore.

One important aspect of the following is that the scoring routines for ideas are completely seperate from the rest of the code (see the "score1" function). Hence, it is a simple matter to try our different search biases.
For example, here's the current score routine:
function score(i,rows,data, 
            cols,row, col,a,b,c,d,triggered,pd,pf,
prec,acc,supports,s,fits) 
{
    a=b=c=d=Pinch # stop divide by zero errors
    cols=Name[0]
    for(row=1;row<=rows;row++) {
        triggered = matched(row,cols,data,i)
        if (data[row,cols]) {
            if (triggered) {d++} else {b++}
        } else {
            if (triggered) {c++} else {a++}
        }
    } 
        fits    = c + d
    pd      = d/(b+d)
    pf      = a/(a+c)
    prec    = d/(c+d)
    acc     = (a+d)/(a+b+c+d)
    support = (c+d)/(a+b+c+d)
    return score1(pd,pf,prec,acc,support,fits)
}
function score1(pd,pf,prec,acc,supportm,fits) {
        if (fits <= Overfitted)
                return 0
    if (Eval==1) return acc
    if (Eval==2) return 2 * pd * prec/(pd+prec)
    if (Eval==3) return 2 * pd * pf/(pd+pf)
    if (Eval==4) return support * 2 * pd * pf/(pd+pf)
    return support * 2 * pd * prec/(pd+prec)
}
Various students (ZachM and Joe) have implemented this before and found it surprisingly slow.  For example, in the above, note that scoring one rule means running it over the entire data set. Which2 therefore uses two speed up tricks: 
  • Rather that sorting new ideas straight away into the old ones, it generates (say) 20 new ones at each round before sorting them back into the old.
  • Which2 keeps a cache of how ideas scored before. If we re-score an old idea, we just return that score (no need to recompute it).
The result is pretty zip: under a second for the 2000+ records of titanic.arff. e.g. This call produces the following output.

gawk -f which2.awk weather.nominal.arff 

In the following, the "candidates" are ideas that look promising and "score" ranks the candidates. If the max score does not improve from the last round, then "lives" decreases.

Each round tries random combinations of the stuff from prior rounds (favoring those things with higher scores). Hence, at round 1, all the candidates are singletons. But. later on (see line 54) the candidates can grow to combinations of things.

Each round prunes the candiates so that only the better candidates survive to round+1.
The final output are the candidates of the last round. In this example, the best rule is tempareture = cool or mild.


----------------------------------------------------
% class: yes seed: 1 round: 1 max: 0 lives: 5

% candidate: yes,humidity=normal
% candidate: yes,humidity=normal,outlook=overcast
% candidate: yes,humidity=normal,temperature=cool
% candidate: yes,humidity=normal,temperature=mild
% candidate: yes,humidity=normal,windy=FALSE
% candidate: yes,outlook=overcast
% candidate: yes,outlook=overcast,temperature=cool
% candidate: yes,outlook=overcast,windy=FALSE
% candidate: yes,temperature=cool
% candidate: yes,temperature=cool,temperature=mild
% candidate: yes,temperature=cool,windy=FALSE
% candidate: yes,temperature=mild
% candidate: yes,temperature=mild,windy=FALSE
% candidate: yes,windy=FALSE

[0.072352013,yes,temperature=mild,windy=FALSE].
[0.13229596,yes,humidity=normal,temperature=cool].
[0.13264848,yes,temperature=cool].
[0.17611714,yes,humidity=normal,windy=FALSE].
[0.17652966,yes,outlook=overcast].
[0.22876845,yes,temperature=mild].
[0.37582652,yes,humidity=normal].
[0.40354402,yes,windy=FALSE].
[0.52640158,yes,temperature=cool,temperature=mild].

----------------------------------------------------
% class: yes seed: 1 round: 2 max: 0.52640158 lives: 5

% candidate: yes,humidity=normal
% candidate: yes,humidity=normal,outlook=overcast
% candidate: yes,humidity=normal,outlook=overcast,temperature=cool
% candidate: yes,humidity=normal,temperature=cool
% candidate: yes,humidity=normal,temperature=cool,temperature=mild
% candidate: yes,humidity=normal,windy=FALSE
% candidate: yes,outlook=overcast
% candidate: yes,outlook=overcast,temperature=cool
% candidate: yes,outlook=overcast,temperature=cool,temperature=mild
% candidate: yes,outlook=overcast,windy=FALSE
% candidate: yes,temperature=cool
% candidate: yes,temperature=cool,temperature=mild
% candidate: yes,temperature=cool,temperature=mild,windy=FALSE
% candidate: yes,temperature=mild
% candidate: yes,temperature=mild,windy=FALSE
% candidate: yes,windy=FALSE

[0.13229596,yes,humidity=normal,temperature=cool].
[0.13264848,yes,temperature=cool].
[0.17611714,yes,humidity=normal,windy=FALSE].
[0.17652966,yes,outlook=overcast].
[0.20419978,yes,temperature=cool,temperature=mild,windy=FALSE].
[0.22876845,yes,temperature=mild].
[0.28604711,yes,humidity=normal,temperature=cool,temperature=mild].
[0.37582652,yes,humidity=normal].
[0.40354402,yes,windy=FALSE].
[0.52640158,yes,temperature=cool,temperature=mild].

----------------------------------------------------
% class: yes seed: 1 round: 3 max: 0.52640158 lives: 4

% candidate: yes,humidity=normal
% candidate: yes,humidity=normal,outlook=overcast
% candidate: yes,humidity=normal,temperature=cool
% candidate: yes,humidity=normal,temperature=cool,temperature=mild
% candidate: yes,humidity=normal,temperature=cool,temperature=mild,windy=FALSE
% candidate: yes,humidity=normal,temperature=cool,windy=FALSE
% candidate: yes,humidity=normal,windy=FALSE
% candidate: yes,outlook=overcast
% candidate: yes,outlook=overcast,temperature=cool,temperature=mild
% candidate: yes,outlook=overcast,windy=FALSE
% candidate: yes,temperature=cool
% candidate: yes,temperature=cool,temperature=mild
% candidate: yes,temperature=cool,temperature=mild,windy=FALSE
% candidate: yes,temperature=cool,windy=FALSE
% candidate: yes,temperature=mild
% candidate: yes,temperature=mild,windy=FALSE
% candidate: yes,windy=FALSE

[0.13229596,yes,humidity=normal,temperature=cool].
[0.13264848,yes,temperature=cool].
[0.17611714,yes,humidity=normal,windy=FALSE].
[0.17652966,yes,outlook=overcast].
[0.20419978,yes,temperature=cool,temperature=mild,windy=FALSE].
[0.22876845,yes,temperature=mild].
[0.28604711,yes,humidity=normal,temperature=cool,temperature=mild].
[0.37582652,yes,humidity=normal].
[0.40354402,yes,windy=FALSE].
[0.52640158,yes,temperature=cool,temperature=mild].

----------------------------------------------------
% class: yes seed: 1 round: 4 max: 0.52640158 lives: 3

% candidate: yes,humidity=normal
% candidate: yes,humidity=normal,outlook=overcast,windy=FALSE
% candidate: yes,humidity=normal,temperature=cool
% candidate: yes,humidity=normal,temperature=cool,temperature=mild
% candidate: yes,humidity=normal,temperature=cool,temperature=mild,windy=FALSE
% candidate: yes,humidity=normal,windy=FALSE
% candidate: yes,outlook=overcast
% candidate: yes,outlook=overcast,temperature=cool
% candidate: yes,outlook=overcast,temperature=cool,temperature=mild
% candidate: yes,outlook=overcast,temperature=cool,temperature=mild,windy=FALSE
% candidate: yes,outlook=overcast,windy=FALSE
% candidate: yes,temperature=cool
% candidate: yes,temperature=cool,temperature=mild
% candidate: yes,temperature=cool,temperature=mild,windy=FALSE
% candidate: yes,temperature=cool,windy=FALSE
% candidate: yes,temperature=mild
% candidate: yes,temperature=mild,windy=FALSE
% candidate: yes,windy=FALSE

[0.13229596,yes,humidity=normal,temperature=cool].
[0.13264848,yes,temperature=cool].
[0.17611714,yes,humidity=normal,windy=FALSE].
[0.17652966,yes,outlook=overcast].
[0.20419978,yes,temperature=cool,temperature=mild,windy=FALSE].
[0.22876845,yes,temperature=mild].
[0.28604711,yes,humidity=normal,temperature=cool,temperature=mild].
[0.37582652,yes,humidity=normal].
[0.40354402,yes,windy=FALSE].
[0.52640158,yes,temperature=cool,temperature=mild].

----------------------------------------------------
% class: yes seed: 1 round: 5 max: 0.52640158 lives: 2

% candidate: yes,humidity=normal
% candidate: yes,humidity=normal,outlook=overcast,temperature=cool
% candidate: yes,humidity=normal,temperature=cool
% candidate: yes,humidity=normal,temperature=cool,temperature=mild
% candidate: yes,humidity=normal,temperature=cool,temperature=mild,windy=FALSE
% candidate: yes,humidity=normal,temperature=cool,windy=FALSE
% candidate: yes,humidity=normal,temperature=mild
% candidate: yes,humidity=normal,windy=FALSE
% candidate: yes,outlook=overcast
% candidate: yes,temperature=cool
% candidate: yes,temperature=cool,temperature=mild
% candidate: yes,temperature=cool,temperature=mild,windy=FALSE
% candidate: yes,temperature=cool,windy=FALSE
% candidate: yes,temperature=mild
% candidate: yes,windy=FALSE

[0.13229596,yes,humidity=normal,temperature=cool].
[0.13264848,yes,temperature=cool].
[0.17611714,yes,humidity=normal,windy=FALSE].
[0.17652966,yes,outlook=overcast].
[0.20419978,yes,temperature=cool,temperature=mild,windy=FALSE].
[0.22876845,yes,temperature=mild].
[0.28604711,yes,humidity=normal,temperature=cool,temperature=mild].
[0.37582652,yes,humidity=normal].
[0.40354402,yes,windy=FALSE].
[0.52640158,yes,temperature=cool,temperature=mild].

----------------------------------------------------
% class: yes seed: 1 round: 6 max: 0.52640158 lives: 1

% candidate: yes,humidity=normal
% candidate: yes,humidity=normal,outlook=overcast
% candidate: yes,humidity=normal,temperature=cool
% candidate: yes,humidity=normal,temperature=cool,temperature=mild
% candidate: yes,humidity=normal,temperature=cool,temperature=mild,windy=FALSE
% candidate: yes,humidity=normal,temperature=cool,windy=FALSE
% candidate: yes,humidity=normal,windy=FALSE
% candidate: yes,outlook=overcast
% candidate: yes,outlook=overcast,temperature=cool,temperature=mild,windy=FALSE
% candidate: yes,temperature=cool
% candidate: yes,temperature=cool,temperature=mild
% candidate: yes,temperature=cool,temperature=mild,windy=FALSE
% candidate: yes,temperature=mild
% candidate: yes,windy=FALSE

[0.13229596,yes,humidity=normal,temperature=cool].
[0.13264848,yes,temperature=cool].
[0.17611714,yes,humidity=normal,windy=FALSE].
[0.17652966,yes,outlook=overcast].
[0.20419978,yes,temperature=cool,temperature=mild,windy=FALSE].
[0.22876845,yes,temperature=mild].
[0.28604711,yes,humidity=normal,temperature=cool,temperature=mild].
[0.37582652,yes,humidity=normal].
[0.40354402,yes,windy=FALSE].
[0.52640158,yes,temperature=cool,temperature=mild].

----------------------------------------------------
% class: yes seed: 1 round: 7 max: 0.52640158 lives: 0

% candidate: yes,humidity=normal
% candidate: yes,humidity=normal,outlook=overcast
% candidate: yes,humidity=normal,temperature=cool
% candidate: yes,humidity=normal,temperature=cool,temperature=mild
% candidate: yes,humidity=normal,temperature=cool,temperature=mild,windy=FALSE
% candidate: yes,humidity=normal,windy=FALSE
% candidate: yes,outlook=overcast
% candidate: yes,outlook=overcast,temperature=cool,temperature=mild
% candidate: yes,outlook=overcast,temperature=mild
% candidate: yes,outlook=overcast,windy=FALSE
% candidate: yes,temperature=cool
% candidate: yes,temperature=cool,temperature=mild
% candidate: yes,temperature=cool,temperature=mild,windy=FALSE
% candidate: yes,temperature=mild
% candidate: yes,windy=FALSE

[0.13229596,yes,humidity=normal,temperature=cool].
[0.13264848,yes,temperature=cool].
[0.17611714,yes,humidity=normal,windy=FALSE].
[0.17652966,yes,outlook=overcast].
[0.20419978,yes,temperature=cool,temperature=mild,windy=FALSE].
[0.22876845,yes,temperature=mild].
[0.28604711,yes,humidity=normal,temperature=cool,temperature=mild].
[0.37582652,yes,humidity=normal].
[0.40354402,yes,windy=FALSE].
[0.52640158,yes,temperature=cool,temperature=mild].

This is not the usual solution to weather but note how it covers most of the examples. The reason for this selection is that the scoring function (shown above) has a support term. Hence, this run favors things with high support.

1 comment:

  1. FYI: converting the above to a rule learner that tries to select for maximal continuous classes took about 30 minutes (much of the code is the same).

    see http://unbox.org/wisp/var/timm/10/which2/which2n.awk

    ReplyDelete