ai @ wvu: January 2010

Tuesday, January 26, 2010

1987 Evett Fiber Model

(defn numerator [y x]
(let [m (nrow x)
fnc1 (fn [t u]
(- (/ (* (Math/exp (- u)) (- t 1) (Math/pow u (- t 2)))
(- t 2))
(* (- (Math/exp (- u))) (Math/pow u (- t 1)))))
fnc2 (fn []
(/ (* (- m 1) (+ m 1)) m))
smean (fn [x]
(div (reduce plus x) m))
Sx (fn [x]
(let [vals (div
(reduce plus
(matrix (map #(pow (minus % (smean x)) 2) x))) m)
comat (matrix [[(first vals) 0] [0 (second vals)]])
icomat (matrix [[(second vals) 0] [0 (first vals)]])]
[comat icomat]))]

(/ (/ (fnc1 (/ m 2) (first y)) (* (Math/PI) (fnc1 (/ (- m 2) 2) (first y))))
(mult (pow (det (mult (fnc2) (first (Sx x)))) 0.5)
(Math/pow (+ 1 (mmult (mmult (mult (minus y (smean x)) (fnc2))
(second (Sx x)))
(trans (minus y (smean x)))))
(/ m 2))))))

(defn denominator [y z lamb]
(let [Normal-zlamb (fn [zi]
(* (/ 1 (Math/sqrt (* 2 (Math/PI) (Math/pow lamb 2))))
(Math/exp (- (/ (mmult (minus y zi) (trans (minus y zi)))
(* 2 (Math/pow lamb 2)))))))
SumZ (/ (apply + (map #(Normal-zlamb %) z)) (count z))]
SumZ))

(defn likelihood [x y z lamb]
(/ (numerator y x) (denominator y z lamb)))

Why ``W'' Can't Make Up Its Mind

For the past few weeks, I've been having trouble generating consistent constrained queries in our CBR rig, ``W''. For half the datasets we'd generate recommendations that wouldn't map back to the test set, preventing further analysis. The following char should shed some light on the problem:

For the longest time the concern was with the size of our datasets, given that we weren't returning historical results. Upon closer inspection, the datasets giving us trouble contain a very large number of attribute values, and currently ``W'' only learns single-value constraints. This explains why when we constrain our query from learned data we return no results.

Solution: discretize the data, or build a better matching system.

A Simple Approach to RDR

Gaines and Compton [1] explain Ripple Down Rules pretty simply by analyzing Compton's Reconstruction of the GARVAN ES-1 diagnostic system. GARVAN ES-1 was one of the world's first medical expert systems to be put into routine use and interpreted medical reports for thyroid hormones measurements in blood samples.

The idea behind RDR is combining a dynamic, incremental rule based hierarchy with human experts. Given data and domain specific knowledge, the knowledge based system can make assumptions about a space. The human interaction involved is to repair broken rules created by the system or to provide further insight to the system itself.

Take a look at the GARVAN RDR tree supplied by Gaines and Compton. The system begins at a root node where the first rule (0.00) diagnoses that there isn't an instance of thyroid disease. Obviously this rule is always true so, in decision tree fashion, we follow the 'if-true' link to the next rule 1.46 which attempts to make a better diagnosis. Based on the pass-fail result of the diagnosis, as governed by the human experts, the knowledge based system attempts to refine it's diagnosis as we travel down the tree. In effect, the KBS limits the comparisons and decisions required in knowledge acquisition and maintenance.

You might be thinking, "Hey buddy! What's the big deal, that's just a decision tree!" And it is, but what we're gaining is the strong context provided by the 'if-true', 'if-false' links which strictly contain the addition of new rules. A new rule can only affect conclusions applied to cases previously falling under at most one rule and in general can only affect cases falling under the rule responsible for the conclusion if the new rule were not present. Any rule considered for addition to the system need only be applied underneath one rule.

The structure of the tree allows the computer to take care of all the dirty work, leaving the human designer the capability to apply his/her knowledge effectively within' the framework.

--
[1] Gaines, Brian R., Compton, Paul. Induction of Ripple Down Rules Applied to Modeling Large Databases. University of New South Whales. 1995.

Monday, January 25, 2010

Simple single pass k-means

Farnstrom, F., Lewis, J., and Elkan, C. 2000. Scalability for clustering algorithms revisited. SIGKDD Explor. Newsl. 2, 1 (Jun. 2000), 51-57. http://portal.acm.org/citation.cfm?id=360419

Model clusters as gaussians:

Clusters are modeled by the sum, sumSquared, and n of each feature.
When clusters combine, their new stats are the sum of the of the old sums, the sum of the old symSquareds, and the n

Reminder: when x is added to a Gaussian

n++
sum = sum + x
sumSquared = sumSquared + x*x
mean = sum/n
sd= sqrt((sumSquared-((sum*sum)/n))/(n-1))

Do not add a point to a cluster if any of the its attributes falls outside of mean±1.96sd (for 95% confidence)

Algorithm:

Initialize k random clusters C1... Ck
Now we have old stats and new stats (initially, old stats is nil)
LOOP: Take 1% of the data and add it to a buffer.
For all remaining items in the buffer, run k-means, moving the centroids (until they converge). If can add to a cluster (according to old confidence intervals), then add it (updating sum, sumSquared, and n of new stats). Here, each cluster is treated as being a point at the mean values, weighted with the number of points in the that cluster.
When no more items can be added, then..

If any cluster has no items, seed it with the most distant data point
Now clear the buffer
Add the new stats to the old stats.

If there is no more new data, stop. Otherwise, go to LOOP

Runs nearly twice as fast as old k-means. Here's the results from using various clustering algorithms on the KDD 1998 cup data (95412 rows, 481 fields). This algorithm is "N1". Normal batch k-means is "K". "R1" is a random sampling k-means (that, btw, generated low quality clusters).

Note that clusters generated by this algorithms, working in increments of 1% of the data, produces clusters of almost the same quality as multiple-pass K-means.

Other advantages:

Small enough memory that you could have multiple versions going, and pick the one with the lowest variances in the generated clusters.
Very simple implementation.
Satisfies the data mining desiderata:

Require one scan (or less) of the database if possible: a single data scan is considered costly, early termination if appropriate is highly desirable.
On-line “anytime” behavior: a “best” answer is always available, with status information on progress, expected remaining time, etc. provided
Suspendable, stoppable, resumable; incremental progress saved to resume a stopped job.
Ability to incrementally incorporate additional data with existing models efficiently.
Work within confines of a given limited RAM buffer.
Utilize variety of possible scan modes: sequential,index, and sampling scans if available.
Ability to operate on forward-only cursor over a view of the database. This is necessary since the database view may be a result of an expensive join query, over a potentially distributed data warehouse, with much processing required to construct each row (case).

Searching For Ceiling Effects in NASA MDP Components - Part I - Undersampling

Defect distribution visualization can be found here among components. These are verified by components flagged to be used/ignored based on the median number of defects. Components whose defects exceed in number the median are first filtered for selection. These components can be seen below.

Component ID	Use/Ignore
22295	ignore
22296	ignore
22297	ignore
22298	ignore
22299	ignore
22302	ignore
22303	ignore
22305	ignore
22306	ignore
24952	ignore
24954	ignore
24958	ignore
24959	ignore
24960	ignore
24962	ignore
24964	ignore
24967	ignore
24970	ignore
24971	ignore
25	ignore
25492	ignore
26	ignore
27	ignore
28	ignore
29	ignore
30	ignore
31	ignore
32	ignore
33132	ignore
33135	ignore
33137	ignore
33138	ignore
33139	ignore
33140	ignore
33142	ignore
33143	ignore
33146	ignore
33147	ignore
33148	ignore
34753	ignore
34754	ignore
34755	ignore
34756	ignore
34757	ignore
34758	ignore
34759	ignore
34760	ignore
34761	ignore
34762	ignore
34763	ignore
34764	ignore
34765	ignore
34766	ignore
34767	ignore
34768	ignore
34769	ignore
70812	ignore
70814	ignore
70815	ignore
70818	ignore
70820	ignore
70821	ignore
70822	ignore
70823	ignore
70824	ignore
17	use
18	use
19	use
20	use
21	use
22	use
22292	use
22293	use
22294	use
22300	use
22301	use
22304	use
22307	use
22308	use
22309	use
23	use
24	use
24953	use
24955	use
24956	use
24957	use
24961	use
24963	use
24965	use
24966	use
24968	use
24969	use
25493	use
33133	use
33134	use
33136	use
33141	use
33144	use
33145	use
33149	use
33150	use
34752	use
70816	use
70817	use
70819	use

Plots of the potential ceiling effects can be found here using subsampling, where M=N (#defective = #non-defective). A 10X10-way cross-validation was used in the experiment.

Kernel estimation

Palpanas et al. 2003 Distributed deviation detection in sensor networks (p79)

Assume a local distribution
Scream if you do not find it.

But how to assume the local kernel function?

There are so many.
Palpanas et al. comment that the actual kernel does not matter. (p79).

This is not quite correct. Schneid reports that Wand & Jones reports that the choice of kernel is not as important as choice of bandwidth (if the sampling bandwidth is high enough, then anything is nearly good enough).

Tuesday, January 19, 2010

Defect distributions across software components in NASA data

Below are some findings showing the small number of components containing a high density of defects. Note that most components lie below either the red or green lines (average and median respectively), telling us that the majority of components in the data sets shown contain relatively few defects.

KC1 - Ground Control System

PC3 - Flight Software from a current Earth-orbiting satellite

CM1 - Software from a science instrument

Friday, January 15, 2010

Checking prior results in student retention estimation

Prior results in learning predictors for student retention are shown here.

Those reports offer a variety of numbers. But using the Zhang equations, we can computer the missing from the given:

calc(Pos,Neg,Prec,Recall, Pf,Acc) :-
        Pf      is Pos/Neg * (1-Prec)/Prec * Recall, 
        D       is Recall * Pos,
        C       is Pf * Neg,
        A       is C*(1/Pf - 1),
        Acc     is (A+D)/(Neg + Pos).

Then we can write a simulator to explore a range of possible values. For example, for the Atwell paper:

run(atwel,[prec/Prec,neg=Neg,pos=Pos,pf/Pf,pd/Recall,acc/Acc]) :-
        nl,
        member(Prec,[0.88,0.82,0.73]),
        N   = 5990, 
        Neg = 4881, 
        Pos is (N-Neg), 
        member(Recall,[0.65,0.7,0.75,0.8,0.85,0.9]), 
        calc(Pos,Neg,Prec,Recall,Pf,Acc).

When we run this, we get the following numbers. Note the suspiciously low false alarm rates:

[who=atwel, prec=88, neg=4881, pos=1109, pf=2, pd=65, acc=92]
[who=atwel, prec=88, neg=4881, pos=1109, pf=2, pd=70, acc=93]
[who=atwel, prec=88, neg=4881, pos=1109, pf=2, pd=75, acc=93]
[who=atwel, prec=88, neg=4881, pos=1109, pf=2, pd=80, acc=94]
[who=atwel, prec=88, neg=4881, pos=1109, pf=3, pd=85, acc=95]
[who=atwel, prec=88, neg=4881, pos=1109, pf=3, pd=90, acc=96]
[who=atwel, prec=82, neg=4881, pos=1109, pf=3, pd=65, acc=91]
[who=atwel, prec=82, neg=4881, pos=1109, pf=3, pd=70, acc=92]
[who=atwel, prec=82, neg=4881, pos=1109, pf=4, pd=75, acc=92]
[who=atwel, prec=82, neg=4881, pos=1109, pf=4, pd=80, acc=93]
[who=atwel, prec=82, neg=4881, pos=1109, pf=4, pd=85, acc=94]
[who=atwel, prec=82, neg=4881, pos=1109, pf=4, pd=90, acc=94]
[who=atwel, prec=73, neg=4881, pos=1109, pf=5, pd=65, acc=89]
[who=atwel, prec=73, neg=4881, pos=1109, pf=6, pd=70, acc=90]
[who=atwel, prec=73, neg=4881, pos=1109, pf=6, pd=75, acc=90]
[who=atwel, prec=73, neg=4881, pos=1109, pf=7, pd=80, acc=91]
[who=atwel, prec=73, neg=4881, pos=1109, pf=7, pd=85, acc=91]
[who=atwel, prec=73, neg=4881, pos=1109, pf=8, pd=90, acc=92]

Similarly for the delong results. Here's the query:

run(delong,[prec/Prec,neg=Neg,pos=Pos,pf/Pf,pd/Recall,acc/Acc]) :-
        nl,
        Neg is 500,
        Pos is 500,
        member(Prec,[0.57,0.58,0.59]),
        member(Recall,[0.65,0.7,0.75,0.8,0.85,0.9]), 
        calc(Pos,Neg,Prec,Recall,Pf,Acc).

And here's the results. Note the very high false alarm rates and mediocre accuracies.

[who=delong, prec=57, neg=500, pos=500, pf=49, pd=65, acc=58]
[who=delong, prec=57, neg=500, pos=500, pf=53, pd=70, acc=59]
[who=delong, prec=57, neg=500, pos=500, pf=57, pd=75, acc=59]
[who=delong, prec=57, neg=500, pos=500, pf=60, pd=80, acc=60]
[who=delong, prec=57, neg=500, pos=500, pf=64, pd=85, acc=60]
[who=delong, prec=57, neg=500, pos=500, pf=68, pd=90, acc=61]
[who=delong, prec=58, neg=500, pos=500, pf=47, pd=65, acc=59]
[who=delong, prec=58, neg=500, pos=500, pf=51, pd=70, acc=60]
[who=delong, prec=58, neg=500, pos=500, pf=54, pd=75, acc=60]
[who=delong, prec=58, neg=500, pos=500, pf=58, pd=80, acc=61]
[who=delong, prec=58, neg=500, pos=500, pf=62, pd=85, acc=62]
[who=delong, prec=58, neg=500, pos=500, pf=65, pd=90, acc=62]
[who=delong, prec=59, neg=500, pos=500, pf=45, pd=65, acc=60]
[who=delong, prec=59, neg=500, pos=500, pf=49, pd=70, acc=61]
[who=delong, prec=59, neg=500, pos=500, pf=52, pd=75, acc=61]
[who=delong, prec=59, neg=500, pos=500, pf=56, pd=80, acc=62]
[who=delong, prec=59, neg=500, pos=500, pf=59, pd=85, acc=63]
[who=delong, prec=59, neg=500, pos=500, pf=63, pd=90, acc=64]

Thursday, January 14, 2010

ssh keys

This guide will give you everything you need to create and start using an ssh key to bypass password authentication. Before you begin, please understand that ssh keys need to be maintained with a high level of security. If your keys were to become compromised for some reason, whomever inherited your keys could potentially break your pass phrase and gain access to your data.

In a safe location, run the following command to generate your key pair:

ssh-keygen -b 4096 -t rsa

You'll immediately be prompted to choose a file name for the key pair. Most commonly, the string 'identity' is used and I suggest using that. Following that, you'll need to enter a pass phrase twice. The end result of running the above command will leave you with two files: 'identity', and 'identity.pub.' One is private, you should just keep this file on your home machine, laptop, or flash drive and the other is public.

Next, back up your current '~/.ssh' folder and create a new one to house your fresh keys.

mv ~/.ssh ~/.ssh-old
mkdir ~/.ssh
mv identity* ~/.ssh/

Once your keys are in place you can run the following command to 'add' your keys and ensure that everything has gone according to plan. Running this will prompt you for your pass phrase.

ssh-add

Now, we need to set up the public part of your key on any remote hosts you want to access without typing your password. If you have an NFS mount on those hosts (like CSEE) you just need to do this in your home directory. ssh into the server or machine in question and if '~/.ssh' does not exist create it. Once created, go to '~/.ssh' and create the file 'authorized_keys'.

cd ~/.ssh && touch authorized_keys

Finally, copy the contents of your 'identity.pub' file into 'authorized_keys' on the remote host. The next time you authenticate (ensuring that you've used 'ssh-add' in advance) you won't need to enter your password.