Tuesday, January 26, 2010

1987 Evett Fiber Model

(defn numerator [y x]
(let [m (nrow x)
fnc1 (fn [t u]
(- (/ (* (Math/exp (- u)) (- t 1) (Math/pow u (- t 2)))
(- t 2))
(* (- (Math/exp (- u))) (Math/pow u (- t 1)))))
fnc2 (fn []
(/ (* (- m 1) (+ m 1)) m))
smean (fn [x]
(div (reduce plus x) m))
Sx (fn [x]
(let [vals (div
(reduce plus
(matrix (map #(pow (minus % (smean x)) 2) x))) m)
comat (matrix [[(first vals) 0] [0 (second vals)]])
icomat (matrix [[(second vals) 0] [0 (first vals)]])]
[comat icomat]))]

(/ (/ (fnc1 (/ m 2) (first y)) (* (Math/PI) (fnc1 (/ (- m 2) 2) (first y))))
(mult (pow (det (mult (fnc2) (first (Sx x)))) 0.5)
(Math/pow (+ 1 (mmult (mmult (mult (minus y (smean x)) (fnc2))
(second (Sx x)))
(trans (minus y (smean x)))))
(/ m 2))))))

(defn denominator [y z lamb]
(let [Normal-zlamb (fn [zi]
(* (/ 1 (Math/sqrt (* 2 (Math/PI) (Math/pow lamb 2))))
(Math/exp (- (/ (mmult (minus y zi) (trans (minus y zi)))
(* 2 (Math/pow lamb 2)))))))
SumZ (/ (apply + (map #(Normal-zlamb %) z)) (count z))]
SumZ))

(defn likelihood [x y z lamb]
(/ (numerator y x) (denominator y z lamb)))

Why ``W'' Can't Make Up Its Mind

For the past few weeks, I've been having trouble generating consistent constrained queries in our CBR rig, ``W''. For half the datasets we'd generate recommendations that wouldn't map back to the test set, preventing further analysis. The following char should shed some light on the problem:

For the longest time the concern was with the size of our datasets, given that we weren't returning historical results. Upon closer inspection, the datasets giving us trouble contain a very large number of attribute values, and currently ``W'' only learns single-value constraints. This explains why when we constrain our query from learned data we return no results.

Solution: discretize the data, or build a better matching system.

A Simple Approach to RDR

Gaines and Compton [1] explain Ripple Down Rules pretty simply by analyzing Compton's Reconstruction of the GARVAN ES-1 diagnostic system. GARVAN ES-1 was one of the world's first medical expert systems to be put into routine use and interpreted medical reports for thyroid hormones measurements in blood samples.

The idea behind RDR is combining a dynamic, incremental rule based hierarchy with human experts. Given data and domain specific knowledge, the knowledge based system can make assumptions about a space. The human interaction involved is to repair broken rules created by the system or to provide further insight to the system itself.



Take a look at the GARVAN RDR tree supplied by Gaines and Compton. The system begins at a root node where the first rule (0.00) diagnoses that there isn't an instance of thyroid disease. Obviously this rule is always true so, in decision tree fashion, we follow the 'if-true' link to the next rule 1.46 which attempts to make a better diagnosis. Based on the pass-fail result of the diagnosis, as governed by the human experts, the knowledge based system attempts to refine it's diagnosis as we travel down the tree. In effect, the KBS limits the comparisons and decisions required in knowledge acquisition and maintenance.

You might be thinking, "Hey buddy! What's the big deal, that's just a decision tree!" And it is, but what we're gaining is the strong context provided by the 'if-true', 'if-false' links which strictly contain the addition of new rules. A new rule can only affect conclusions applied to cases previously falling under at most one rule and in general can only affect cases falling under the rule responsible for the conclusion if the new rule were not present. Any rule considered for addition to the system need only be applied underneath one rule.

The structure of the tree allows the computer to take care of all the dirty work, leaving the human designer the capability to apply his/her knowledge effectively within' the framework.

--
[1] Gaines, Brian R., Compton, Paul. Induction of Ripple Down Rules Applied to Modeling Large Databases. University of New South Whales. 1995.

Monday, January 25, 2010

Simple single pass k-means

Farnstrom, F., Lewis, J., and Elkan, C. 2000. Scalability for clustering algorithms revisited. SIGKDD Explor. Newsl. 2, 1 (Jun. 2000), 51-57.  http://portal.acm.org/citation.cfm?id=360419


Model clusters as gaussians:
  • Clusters are modeled by the  sum, sumSquared, and n of each feature.
  • When clusters combine, their new stats are the sum of the of the old sums, the sum of the old symSquareds, and the n 
    • Reminder: when x is added to a Gaussian
      • n++
      • sum = sum + x
      • sumSquared = sumSquared + x*x
      • mean = sum/n
      • sd=  sqrt((sumSquared-((sum*sum)/n))/(n-1)) 
  • Do not add a point to a cluster if any of the its attributes falls outside of mean±1.96sd (for 95% confidence) 
Algorithm:
  1. Initialize k random clusters C1... Ck
  2. Now we have old stats and new stats (initially, old stats is nil)
  3. LOOP: Take 1% of the data and add it to a buffer.
  4. For all remaining items in the buffer,  run k-means, moving the centroids (until they converge). If can add to a cluster (according to old confidence intervals), then add it (updating sum, sumSquared, and n of new stats).  Here, each cluster is treated as being a point at the mean values, weighted with the number of points in the that cluster.
  5. When no more items can be added, then..
    1. If any cluster has no items, seed it with the most distant data point
    2. Now clear the buffer 
    3. Add the new stats to the old stats.
  6. If there is no more new data, stop. Otherwise, go to LOOP
Runs nearly twice as fast as old k-means.  Here's the results from using various clustering algorithms on the KDD 1998 cup data (95412 rows, 481 fields). This algorithm is "N1". Normal batch k-means is "K".  "R1" is a random sampling k-means (that, btw, generated low quality clusters).


Note that clusters generated by this algorithms, working in increments of 1% of the data,  produces clusters of almost the same quality as multiple-pass K-means.

Other advantages:
  • Small enough memory that you could have multiple versions going, and pick the one with the lowest variances in the generated clusters. 
  • Very simple implementation.
  • Satisfies the data mining desiderata:
    1. Require one scan (or less) of the database if possible: a single data scan is considered costly, early termination if appropriate is highly desirable. 
    2. On-line “anytime” behavior: a “best” answer is always available, with status information on progress, expected remaining time, etc. provided 
    3. Suspendable, stoppable, resumable; incremental  progress saved to resume a stopped job.
    4. Ability to incrementally incorporate additional data with existing models efficiently. 
    5. Work within confines of a given limited RAM buffer. 
    6. Utilize variety of possible scan modes: sequential,index, and sampling scans if available. 
    7. Ability to operate on forward-only cursor over a view of the database. This is necessary since the database view may be a result of an expensive join query, over a potentially distributed data warehouse, with much processing required to construct each row (case).

Searching For Ceiling Effects in NASA MDP Components - Part I - Undersampling

Defect distribution visualization can be found here among components. These are verified by components flagged to be used/ignored based on the median number of defects. Components whose defects exceed in number the median are first filtered for selection. These components can be seen below.



Component IDUse/Ignore
22295ignore
22296ignore
22297ignore
22298ignore
22299ignore
22302ignore
22303ignore
22305ignore
22306ignore
24952ignore
24954ignore
24958ignore
24959ignore
24960ignore
24962ignore
24964ignore
24967ignore
24970ignore
24971ignore
25ignore
25492ignore
26ignore
27ignore
28ignore
29ignore
30ignore
31ignore
32ignore
33132ignore
33135ignore
33137ignore
33138ignore
33139ignore
33140ignore
33142ignore
33143ignore
33146ignore
33147ignore
33148ignore
34753ignore
34754ignore
34755ignore
34756ignore
34757ignore
34758ignore
34759ignore
34760ignore
34761ignore
34762ignore
34763ignore
34764ignore
34765ignore
34766ignore
34767ignore
34768ignore
34769ignore
70812ignore
70814ignore
70815ignore
70818ignore
70820ignore
70821ignore
70822ignore
70823ignore
70824ignore
17use
18use
19use
20use
21use
22use
22292use
22293use
22294use
22300use
22301use
22304use
22307use
22308use
22309use
23use
24use
24953use
24955use
24956use
24957use
24961use
24963use
24965use
24966use
24968use
24969use
25493use
33133use
33134use
33136use
33141use
33144use
33145use
33149use
33150use
34752use
70816use
70817use
70819use

Plots of the potential ceiling effects can be found here using subsampling, where M=N (#defective = #non-defective). A 10X10-way cross-validation was used in the experiment.

Kernel estimation

Palpanas et al. 2003 Distributed deviation detection in sensor networks (p79)

  • Assume a local distribution
  • Scream if you do not find it.
But how to assume the local kernel function?





Tuesday, January 19, 2010

Defect distributions across software components in NASA data

Below are some findings showing the small number of components containing a high density of defects. Note that most components lie below either the red or green lines (average and median respectively), telling us that the majority of components in the data sets shown contain relatively few defects.

KC1 - Ground Control System



PC3 - Flight Software from a current Earth-orbiting satellite



CM1 - Software from a science instrument

Friday, January 15, 2010

Checking prior results in student retention estimation

Prior results in learning predictors for student retention are shown here.

Those reports offer a variety of numbers. But using the Zhang equations, we can computer the missing from the given:

calc(Pos,Neg,Prec,Recall, Pf,Acc) :-
        Pf      is Pos/Neg * (1-Prec)/Prec * Recall, 
        D       is Recall * Pos,
        C       is Pf * Neg,
        A       is C*(1/Pf - 1),
        Acc     is (A+D)/(Neg + Pos). 

Then we can write a simulator to explore a range of possible values. For example, for the Atwell paper:

run(atwel,[prec/Prec,neg=Neg,pos=Pos,pf/Pf,pd/Recall,acc/Acc]) :-
        nl,
        member(Prec,[0.88,0.82,0.73]),
        N   = 5990, 
        Neg = 4881, 
        Pos is (N-Neg), 
        member(Recall,[0.65,0.7,0.75,0.8,0.85,0.9]), 
        calc(Pos,Neg,Prec,Recall,Pf,Acc).

When we run this, we get the following numbers. Note the suspiciously low false alarm rates:

[who=atwel, prec=88, neg=4881, pos=1109, pf=2, pd=65, acc=92]
[who=atwel, prec=88, neg=4881, pos=1109, pf=2, pd=70, acc=93]
[who=atwel, prec=88, neg=4881, pos=1109, pf=2, pd=75, acc=93]
[who=atwel, prec=88, neg=4881, pos=1109, pf=2, pd=80, acc=94]
[who=atwel, prec=88, neg=4881, pos=1109, pf=3, pd=85, acc=95]
[who=atwel, prec=88, neg=4881, pos=1109, pf=3, pd=90, acc=96]
[who=atwel, prec=82, neg=4881, pos=1109, pf=3, pd=65, acc=91]
[who=atwel, prec=82, neg=4881, pos=1109, pf=3, pd=70, acc=92]
[who=atwel, prec=82, neg=4881, pos=1109, pf=4, pd=75, acc=92]
[who=atwel, prec=82, neg=4881, pos=1109, pf=4, pd=80, acc=93]
[who=atwel, prec=82, neg=4881, pos=1109, pf=4, pd=85, acc=94]
[who=atwel, prec=82, neg=4881, pos=1109, pf=4, pd=90, acc=94]
[who=atwel, prec=73, neg=4881, pos=1109, pf=5, pd=65, acc=89]
[who=atwel, prec=73, neg=4881, pos=1109, pf=6, pd=70, acc=90]
[who=atwel, prec=73, neg=4881, pos=1109, pf=6, pd=75, acc=90]
[who=atwel, prec=73, neg=4881, pos=1109, pf=7, pd=80, acc=91]
[who=atwel, prec=73, neg=4881, pos=1109, pf=7, pd=85, acc=91]
[who=atwel, prec=73, neg=4881, pos=1109, pf=8, pd=90, acc=92]

Similarly for the delong results. Here's the query:

run(delong,[prec/Prec,neg=Neg,pos=Pos,pf/Pf,pd/Recall,acc/Acc]) :-
        nl,
        Neg is 500,
        Pos is 500,
        member(Prec,[0.57,0.58,0.59]),
        member(Recall,[0.65,0.7,0.75,0.8,0.85,0.9]), 
        calc(Pos,Neg,Prec,Recall,Pf,Acc).

And here's the results. Note the very high false alarm rates and mediocre accuracies.

[who=delong, prec=57, neg=500, pos=500, pf=49, pd=65, acc=58]
[who=delong, prec=57, neg=500, pos=500, pf=53, pd=70, acc=59]
[who=delong, prec=57, neg=500, pos=500, pf=57, pd=75, acc=59]
[who=delong, prec=57, neg=500, pos=500, pf=60, pd=80, acc=60]
[who=delong, prec=57, neg=500, pos=500, pf=64, pd=85, acc=60]
[who=delong, prec=57, neg=500, pos=500, pf=68, pd=90, acc=61]
[who=delong, prec=58, neg=500, pos=500, pf=47, pd=65, acc=59]
[who=delong, prec=58, neg=500, pos=500, pf=51, pd=70, acc=60]
[who=delong, prec=58, neg=500, pos=500, pf=54, pd=75, acc=60]
[who=delong, prec=58, neg=500, pos=500, pf=58, pd=80, acc=61]
[who=delong, prec=58, neg=500, pos=500, pf=62, pd=85, acc=62]
[who=delong, prec=58, neg=500, pos=500, pf=65, pd=90, acc=62]
[who=delong, prec=59, neg=500, pos=500, pf=45, pd=65, acc=60]
[who=delong, prec=59, neg=500, pos=500, pf=49, pd=70, acc=61]
[who=delong, prec=59, neg=500, pos=500, pf=52, pd=75, acc=61]
[who=delong, prec=59, neg=500, pos=500, pf=56, pd=80, acc=62]
[who=delong, prec=59, neg=500, pos=500, pf=59, pd=85, acc=63]
[who=delong, prec=59, neg=500, pos=500, pf=63, pd=90, acc=64]

Thursday, January 14, 2010

ssh keys

This guide will give you everything you need to create and start using an ssh key to bypass password authentication. Before you begin, please understand that ssh keys need to be maintained with a high level of security. If your keys were to become compromised for some reason, whomever inherited your keys could potentially break your pass phrase and gain access to your data.

In a safe location, run the following command to generate your key pair:
ssh-keygen -b 4096 -t rsa

You'll immediately be prompted to choose a file name for the key pair. Most commonly, the string 'identity' is used and I suggest using that. Following that, you'll need to enter a pass phrase twice. The end result of running the above command will leave you with two files: 'identity', and 'identity.pub.' One is private, you should just keep this file on your home machine, laptop, or flash drive and the other is public.

Next, back up your current '~/.ssh' folder and create a new one to house your fresh keys.
mv ~/.ssh ~/.ssh-old
mkdir ~/.ssh
mv identity* ~/.ssh/

Once your keys are in place you can run the following command to 'add' your keys and ensure that everything has gone according to plan. Running this will prompt you for your pass phrase.
ssh-add

Now, we need to set up the public part of your key on any remote hosts you want to access without typing your password. If you have an NFS mount on those hosts (like CSEE) you just need to do this in your home directory. ssh into the server or machine in question and if '~/.ssh' does not exist create it. Once created, go to '~/.ssh' and create the file 'authorized_keys'.
cd ~/.ssh && touch authorized_keys

Finally, copy the contents of your 'identity.pub' file into 'authorized_keys' on the remote host. The next time you authenticate (ensuring that you've used 'ssh-add' in advance) you won't need to enter your password.