(defn numerator [y x]

(let [m (nrow x)

fnc1 (fn [t u]

(- (/ (* (Math/exp (- u)) (- t 1) (Math/pow u (- t 2)))

(- t 2))

(* (- (Math/exp (- u))) (Math/pow u (- t 1)))))

fnc2 (fn []

(/ (* (- m 1) (+ m 1)) m))

smean (fn [x]

(div (reduce plus x) m))

Sx (fn [x]

(let [vals (div

(reduce plus

(matrix (map #(pow (minus % (smean x)) 2) x))) m)

comat (matrix [[(first vals) 0] [0 (second vals)]])

icomat (matrix [[(second vals) 0] [0 (first vals)]])]

[comat icomat]))]

(/ (/ (fnc1 (/ m 2) (first y)) (* (Math/PI) (fnc1 (/ (- m 2) 2) (first y))))

(mult (pow (det (mult (fnc2) (first (Sx x)))) 0.5)

(Math/pow (+ 1 (mmult (mmult (mult (minus y (smean x)) (fnc2))

(second (Sx x)))

(trans (minus y (smean x)))))

(/ m 2))))))

(defn denominator [y z lamb]

(let [Normal-zlamb (fn [zi]

(* (/ 1 (Math/sqrt (* 2 (Math/PI) (Math/pow lamb 2))))

(Math/exp (- (/ (mmult (minus y zi) (trans (minus y zi)))

(* 2 (Math/pow lamb 2)))))))

SumZ (/ (apply + (map #(Normal-zlamb %) z)) (count z))]

SumZ))

(defn likelihood [x y z lamb]

(/ (numerator y x) (denominator y z lamb)))

## Tuesday, January 26, 2010

### Why ``W'' Can't Make Up Its Mind

For the past few weeks, I've been having trouble generating consistent constrained queries in our CBR rig, ``W''. For half the datasets we'd generate recommendations that wouldn't map back to the test set, preventing further analysis. The following char should shed some light on the problem:

For the longest time the concern was with the size of our datasets, given that we weren't returning historical results. Upon closer inspection, the datasets giving us trouble contain a very large number of attribute values, and currently ``W'' only learns single-value constraints. This explains why when we constrain our query from learned data we return no results.

Solution: discretize the data, or build a better matching system.

### A Simple Approach to RDR

Gaines and Compton [1] explain Ripple Down Rules pretty simply by analyzing Compton's Reconstruction of the GARVAN ES-1 diagnostic system. GARVAN ES-1 was one of the world's first medical expert systems to be put into routine use and interpreted medical reports for thyroid hormones measurements in blood samples.

The idea behind RDR is combining a dynamic, incremental rule based hierarchy with human experts. Given data and domain specific knowledge, the knowledge based system can make assumptions about a space. The human interaction involved is to repair broken rules created by the system or to provide further insight to the system itself.

Take a look at the GARVAN RDR tree supplied by Gaines and Compton. The system begins at a root node where the first rule (0.00) diagnoses that there isn't an instance of thyroid disease. Obviously this rule is always true so, in decision tree fashion, we follow the 'if-true' link to the next rule 1.46 which attempts to make a better diagnosis. Based on the pass-fail result of the diagnosis, as governed by the human experts, the knowledge based system attempts to refine it's diagnosis as we travel down the tree. In effect, the KBS limits the comparisons and decisions required in knowledge acquisition and maintenance.

You might be thinking, "Hey buddy! What's the big deal, that's just a decision tree!" And it is, but what we're gaining is the strong context provided by the 'if-true', 'if-false' links which strictly contain the addition of new rules. A new rule can only affect conclusions applied to cases previously falling under at most one rule and in general can only affect cases falling under the rule responsible for the conclusion if the new rule were not present. Any rule considered for addition to the system need only be applied underneath one rule.

The structure of the tree allows the computer to take care of all the dirty work, leaving the human designer the capability to apply his/her knowledge effectively within' the framework.

--

[1] Gaines, Brian R., Compton, Paul. Induction of Ripple Down Rules Applied to Modeling Large Databases. University of New South Whales. 1995.

The idea behind RDR is combining a dynamic, incremental rule based hierarchy with human experts. Given data and domain specific knowledge, the knowledge based system can make assumptions about a space. The human interaction involved is to repair broken rules created by the system or to provide further insight to the system itself.

Take a look at the GARVAN RDR tree supplied by Gaines and Compton. The system begins at a root node where the first rule (0.00) diagnoses that there isn't an instance of thyroid disease. Obviously this rule is always true so, in decision tree fashion, we follow the 'if-true' link to the next rule 1.46 which attempts to make a better diagnosis. Based on the pass-fail result of the diagnosis, as governed by the human experts, the knowledge based system attempts to refine it's diagnosis as we travel down the tree. In effect, the KBS limits the comparisons and decisions required in knowledge acquisition and maintenance.

You might be thinking, "Hey buddy! What's the big deal, that's just a decision tree!" And it is, but what we're gaining is the strong context provided by the 'if-true', 'if-false' links which strictly contain the addition of new rules. A new rule can only affect conclusions applied to cases previously falling under at most one rule and in general can only affect cases falling under the rule responsible for the conclusion if the new rule were not present. Any rule considered for addition to the system need only be applied underneath one rule.

The structure of the tree allows the computer to take care of all the dirty work, leaving the human designer the capability to apply his/her knowledge effectively within' the framework.

--

[1] Gaines, Brian R., Compton, Paul. Induction of Ripple Down Rules Applied to Modeling Large Databases. University of New South Whales. 1995.

## Monday, January 25, 2010

### Simple single pass k-means

Farnstrom, F., Lewis, J., and Elkan, C. 2000. Scalability for clustering algorithms revisited. SIGKDD Explor. Newsl. 2, 1 (Jun. 2000), 51-57. http://portal.acm.org/citation.cfm?id=360419

Model clusters as gaussians:

Runs nearly twice as fast as old k-means. Here's the results from using various clustering algorithms on the KDD 1998 cup data (95412 rows, 481 fields). This algorithm is "N1". Normal batch k-means is "K". "R1" is a random sampling k-means (that, btw, generated low quality clusters).

Note that clusters generated by this algorithms, working in increments of 1% of the data, produces clusters of almost the same quality as multiple-pass K-means.

Other advantages:

Model clusters as gaussians:

- Clusters are modeled by the sum, sumSquared, and n of each feature.
- When clusters combine, their new stats are the sum of the of the old sums, the sum of the old symSquareds, and the n
- Reminder: when x is added to a Gaussian
- n++
- sum = sum + x
- sumSquared = sumSquared + x*x
- mean = sum/n
- sd= sqrt((sumSquared-((sum*sum)/n))/(n-1))
- Do not add a point to a cluster if any of the its attributes falls outside of mean±1.96sd (for 95% confidence)

- Initialize k random clusters C1... Ck
- Now we have old stats and new stats (initially, old stats is nil)
- LOOP: Take 1% of the data and add it to a buffer.
- For all remaining items in the buffer, run k-means, moving the centroids (until they converge). If can add to a cluster (according to old confidence intervals), then add it (updating sum, sumSquared, and n of new stats). Here, each cluster is treated as being a point at the mean values, weighted with the number of points in the that cluster.
- When no more items can be added, then..
- If any cluster has no items, seed it with the most distant data point
- Now clear the buffer
- Add the new stats to the old stats.
- If there is no more new data, stop. Otherwise, go to LOOP

Note that clusters generated by this algorithms, working in increments of 1% of the data, produces clusters of almost the same quality as multiple-pass K-means.

Other advantages:

- Small enough memory that you could have multiple versions going, and pick the one with the lowest variances in the generated clusters.
- Very simple implementation.
- Satisfies the data mining desiderata:

- Require one scan (or less) of the database if possible: a single data scan is considered costly, early termination if appropriate is highly desirable.
- On-line “anytime” behavior: a “best” answer is always available, with status information on progress, expected remaining time, etc. provided
- Suspendable, stoppable, resumable; incremental progress saved to resume a stopped job.
- Ability to incrementally incorporate additional data with existing models efficiently.
- Work within confines of a given limited RAM buffer.
- Utilize variety of possible scan modes: sequential,index, and sampling scans if available.
- Ability to operate on forward-only cursor over a view of the database. This is necessary since the database view may be a result of an expensive join query, over a potentially distributed data warehouse, with much processing required to construct each row (case).

### Searching For Ceiling Effects in NASA MDP Components - Part I - Undersampling

Defect distribution visualization can be found here among components. These are verified by components flagged to be used/ignored based on the median number of defects. Components whose defects exceed in number the median are first filtered for selection. These components can be seen below.

Plots of the potential ceiling effects can be found here using subsampling, where M=N (#defective = #non-defective). A 10X10-way cross-validation was used in the experiment.

Component ID | Use/Ignore |
---|---|

22295 | ignore |

22296 | ignore |

22297 | ignore |

22298 | ignore |

22299 | ignore |

22302 | ignore |

22303 | ignore |

22305 | ignore |

22306 | ignore |

24952 | ignore |

24954 | ignore |

24958 | ignore |

24959 | ignore |

24960 | ignore |

24962 | ignore |

24964 | ignore |

24967 | ignore |

24970 | ignore |

24971 | ignore |

25 | ignore |

25492 | ignore |

26 | ignore |

27 | ignore |

28 | ignore |

29 | ignore |

30 | ignore |

31 | ignore |

32 | ignore |

33132 | ignore |

33135 | ignore |

33137 | ignore |

33138 | ignore |

33139 | ignore |

33140 | ignore |

33142 | ignore |

33143 | ignore |

33146 | ignore |

33147 | ignore |

33148 | ignore |

34753 | ignore |

34754 | ignore |

34755 | ignore |

34756 | ignore |

34757 | ignore |

34758 | ignore |

34759 | ignore |

34760 | ignore |

34761 | ignore |

34762 | ignore |

34763 | ignore |

34764 | ignore |

34765 | ignore |

34766 | ignore |

34767 | ignore |

34768 | ignore |

34769 | ignore |

70812 | ignore |

70814 | ignore |

70815 | ignore |

70818 | ignore |

70820 | ignore |

70821 | ignore |

70822 | ignore |

70823 | ignore |

70824 | ignore |

17 | use |

18 | use |

19 | use |

20 | use |

21 | use |

22 | use |

22292 | use |

22293 | use |

22294 | use |

22300 | use |

22301 | use |

22304 | use |

22307 | use |

22308 | use |

22309 | use |

23 | use |

24 | use |

24953 | use |

24955 | use |

24956 | use |

24957 | use |

24961 | use |

24963 | use |

24965 | use |

24966 | use |

24968 | use |

24969 | use |

25493 | use |

33133 | use |

33134 | use |

33136 | use |

33141 | use |

33144 | use |

33145 | use |

33149 | use |

33150 | use |

34752 | use |

70816 | use |

70817 | use |

70819 | use |

### Kernel estimation

Palpanas et al. 2003 Distributed deviation detection in sensor networks (p79)

- Assume a local distribution
- Scream if you do not find it.

But how to assume the local kernel function?

- There are so many.
- Palpanas et al. comment that the actual kernel does not matter. (p79).
- This is not quite correct. Schneid reports that Wand & Jones reports that the choice of kernel is not as important as choice of bandwidth (if the sampling bandwidth is high enough, then anything is nearly good enough).

## Tuesday, January 19, 2010

### Defect distributions across software components in NASA data

Below are some findings showing the small number of components containing a high density of defects. Note that most components lie below either the red or green lines (average and median respectively), telling us that the majority of components in the data sets shown contain relatively few defects.

KC1 - Ground Control System

CM1 - Software from a science instrument

## Friday, January 15, 2010

### Checking prior results in student retention estimation

Prior results in learning predictors for student retention are shown here.

Those reports offer a variety of numbers. But using the Zhang equations, we can computer the missing from the given:

Those reports offer a variety of numbers. But using the Zhang equations, we can computer the missing from the given:

calc(Pos,Neg,Prec,Recall, Pf,Acc) :- Pf is Pos/Neg * (1-Prec)/Prec * Recall, D is Recall * Pos, C is Pf * Neg, A is C*(1/Pf - 1), Acc is (A+D)/(Neg + Pos).Then we can write a simulator to explore a range of possible values. For example, for the Atwell paper:

run(atwel,[prec/Prec,neg=Neg,pos=Pos,pf/Pf,pd/Recall,acc/Acc]) :- nl, member(Prec,[0.88,0.82,0.73]), N = 5990, Neg = 4881, Pos is (N-Neg), member(Recall,[0.65,0.7,0.75,0.8,0.85,0.9]), calc(Pos,Neg,Prec,Recall,Pf,Acc).When we run this, we get the following numbers. Note the suspiciously low false alarm rates:

[who=atwel, prec=88, neg=4881, pos=1109, pf=2, pd=65, acc=92] [who=atwel, prec=88, neg=4881, pos=1109, pf=2, pd=70, acc=93] [who=atwel, prec=88, neg=4881, pos=1109, pf=2, pd=75, acc=93] [who=atwel, prec=88, neg=4881, pos=1109, pf=2, pd=80, acc=94] [who=atwel, prec=88, neg=4881, pos=1109, pf=3, pd=85, acc=95] [who=atwel, prec=88, neg=4881, pos=1109, pf=3, pd=90, acc=96] [who=atwel, prec=82, neg=4881, pos=1109, pf=3, pd=65, acc=91] [who=atwel, prec=82, neg=4881, pos=1109, pf=3, pd=70, acc=92] [who=atwel, prec=82, neg=4881, pos=1109, pf=4, pd=75, acc=92] [who=atwel, prec=82, neg=4881, pos=1109, pf=4, pd=80, acc=93] [who=atwel, prec=82, neg=4881, pos=1109, pf=4, pd=85, acc=94] [who=atwel, prec=82, neg=4881, pos=1109, pf=4, pd=90, acc=94] [who=atwel, prec=73, neg=4881, pos=1109, pf=5, pd=65, acc=89] [who=atwel, prec=73, neg=4881, pos=1109, pf=6, pd=70, acc=90] [who=atwel, prec=73, neg=4881, pos=1109, pf=6, pd=75, acc=90] [who=atwel, prec=73, neg=4881, pos=1109, pf=7, pd=80, acc=91] [who=atwel, prec=73, neg=4881, pos=1109, pf=7, pd=85, acc=91] [who=atwel, prec=73, neg=4881, pos=1109, pf=8, pd=90, acc=92]Similarly for the delong results. Here's the query:

run(delong,[prec/Prec,neg=Neg,pos=Pos,pf/Pf,pd/Recall,acc/Acc]) :- nl, Neg is 500, Pos is 500, member(Prec,[0.57,0.58,0.59]), member(Recall,[0.65,0.7,0.75,0.8,0.85,0.9]), calc(Pos,Neg,Prec,Recall,Pf,Acc).And here's the results. Note the very high false alarm rates and mediocre accuracies.

[who=delong, prec=57, neg=500, pos=500, pf=49, pd=65, acc=58] [who=delong, prec=57, neg=500, pos=500, pf=53, pd=70, acc=59] [who=delong, prec=57, neg=500, pos=500, pf=57, pd=75, acc=59] [who=delong, prec=57, neg=500, pos=500, pf=60, pd=80, acc=60] [who=delong, prec=57, neg=500, pos=500, pf=64, pd=85, acc=60] [who=delong, prec=57, neg=500, pos=500, pf=68, pd=90, acc=61] [who=delong, prec=58, neg=500, pos=500, pf=47, pd=65, acc=59] [who=delong, prec=58, neg=500, pos=500, pf=51, pd=70, acc=60] [who=delong, prec=58, neg=500, pos=500, pf=54, pd=75, acc=60] [who=delong, prec=58, neg=500, pos=500, pf=58, pd=80, acc=61] [who=delong, prec=58, neg=500, pos=500, pf=62, pd=85, acc=62] [who=delong, prec=58, neg=500, pos=500, pf=65, pd=90, acc=62] [who=delong, prec=59, neg=500, pos=500, pf=45, pd=65, acc=60] [who=delong, prec=59, neg=500, pos=500, pf=49, pd=70, acc=61] [who=delong, prec=59, neg=500, pos=500, pf=52, pd=75, acc=61] [who=delong, prec=59, neg=500, pos=500, pf=56, pd=80, acc=62] [who=delong, prec=59, neg=500, pos=500, pf=59, pd=85, acc=63] [who=delong, prec=59, neg=500, pos=500, pf=63, pd=90, acc=64]

## Thursday, January 14, 2010

### ssh keys

This guide will give you everything you need to create and start using an ssh key to bypass password authentication. Before you begin, please understand that ssh keys need to be maintained with a high level of security. If your keys were to become compromised for some reason, whomever inherited your keys could potentially break your pass phrase and gain access to your data.

In a safe location, run the following command to generate your key pair:

Next, back up your current '~/.ssh' folder and create a new one to house your fresh keys.

In a safe location, run the following command to generate your key pair:

ssh-keygen -b 4096 -t rsaYou'll immediately be prompted to choose a file name for the key pair. Most commonly, the string 'identity' is used and I suggest using that. Following that, you'll need to enter a pass phrase twice. The end result of running the above command will leave you with two files: 'identity', and 'identity.pub.' One is private, you should just keep this file on your home machine, laptop, or flash drive and the other is public.

Next, back up your current '~/.ssh' folder and create a new one to house your fresh keys.

mv ~/.ssh ~/.ssh-oldOnce your keys are in place you can run the following command to 'add' your keys and ensure that everything has gone according to plan. Running this will prompt you for your pass phrase.

mkdir ~/.ssh

mv identity* ~/.ssh/

ssh-addNow, we need to set up the public part of your key on any remote hosts you want to access without typing your password. If you have an NFS mount on those hosts (like CSEE) you just need to do this in your home directory. ssh into the server or machine in question and if '~/.ssh' does not exist create it. Once created, go to '~/.ssh' and create the file 'authorized_keys'.

cd ~/.ssh && touch authorized_keysFinally, copy the contents of your 'identity.pub' file into 'authorized_keys' on the remote host. The next time you authenticate (ensuring that you've used 'ssh-add' in advance) you won't need to enter your password.

Subscribe to:
Posts (Atom)