Tuesday, August 31, 2010

Problem with Stopping Rule in Compass Defect Prediction

I've found that the stopping rule for walking through the Compass tree stops too early. Compass currently stops when the variance of the node is less than the weighted variance of the node's children.

(if (< (node-variance c-node)
(weighted-variance c-node))

(defun weighted-variance (c-node)
(if (and (null (node-right c-node)) (null (node-left c-node)))
(node-variance c-node)
(if (or (null (node-right c-node)) (null (node-left c-node)))
(if (null (node-right c-node))
(node-variance (node-left c-node))
(node-variance (node-right c-node)))
(/ (+ (* (node-variance (node-right c-node))
(length (node-contents (node-right c-node))))
(* (node-variance (node-left c-node))
(length (node-contents (node-left c-node)))))
(+ (length (node-contents (node-right c-node)))
(length (node-contents (node-left c-node))))))))


In the defect prediction data sets, this condition occurs at a high level (usually 1 or 2). The result is that the cluster for majority voting is often most of the data set rather than similar instances.



Results from Splitting Data // Oracle

Results can be found HERE.

The System

1. Randomize 20x

2. Divide data into two halves.
a. Separate first half into eras of 10 instances.
b. Use the second half to build an oracle compass tree.

3. Using the eras from (2a), build an initial tree with the first several eras.

4. Incrementally insert the instances from eras (2a) into the compass tree formed in (3).
a. After each era find the most interesting instances by finding high variance children pairs and removing the center of their union.
b. Classify instances in the naughty list using the oracle from (2b).
c. Insert the naughty list back into the incremental tree formed in (3), (4) by using their classification information from (4b).
d. To keep the tree growing, re-compass high areas of variance (leaves in the top X% of variance among leaves).

6. Compare with a standard tree.

W Java Mockup

Rough-cut of a Java GUI for the W Recommendation System

Wednesday, August 25, 2010

Compass demos

Andrew Butcher  writes:

Once you have source from http://unbox.org/wisp/var/butcher/compass/tags/compass-0.0.1

cd compass-0.0.1/src
make demo1

They will all pipe to less, except demo10 which outputs maxwell.dot.png.

Tuesday, August 24, 2010

Discretization & Stability





Pred Comparisons, Datasets and Bolstering

Pred(25) and loss values as well as comparison of algorithms&datasets of 3 points over all datasets are here.

New 30+ datasets for checking is here.

Bolstering: Is it inverse of boosting?
Boosting is supervised - Neither TEAK nor Compass is supervised
Boosting is a set of weak learners - TEAK and Compass are like cascading of one algorithm
Boosting weights the instances in a way (sending the misclassifieds to next weak learner) - TEAK and Compass also weights instances (0-1 weighting, keep or eliminate for the next round)
Boosting similar algorithms that are sometimes confused are called "leveraging algorithms" (need to read more on this)

Sunday, August 22, 2010

Agenda, August 24

Stuff in blue shows addits after posting0.
Stuff in green shows addits after posting1.

- need to name our desk. landscape paper, name in 30 point, photo.
- we now have 2 sets of ugrads students working the 

- range experiment, and on other data sets.
- need that graphic max wanted showing the complexity of our data
- thesis outline (on paper).

- can you attend? want to talk abut china and java W

- all the latexing of the e,d,m and edm results done.
- ditto with the discretization experiments. outline of thesis.
- thesis outline (on paper)
- co-ordination with adam re java W
- need a revision of the paper that shows a good example of instability and
how it is less with W2
- receipt from japan

- when you used "log" pre-preprocessor, did you unlog afterwards before (say) MRE?
- check jacky's TSE corrections
- the abstract only mentions MRE. add others? or delete reference to MRE?
- check that our quotes from the paper in the review section are still accurate
- FIG5: when you checked (median(a) < median(b)), did you reverse that for pred?
- needs a bridge at end of 3.2. As can be seen in this example, it is not necessarily true that moving to a smaller set of neighbors decreases variance. As shown below, it can improve prediction accuracy if ABE takes this matter into account.
- bring (small) copies of ASE poster
- revist all old results. which need to be done with strong data sets?
- need your full passport name
- can we do ensembles just as a simple TEAK extension? just by randomly selecting from the current sub-tree 90%, 10 times?

- (the following will take some time)
- redo the icse paper's results with
- compass
- (sum, pred25, mre)
- win%,loss%, (win-loss)%
- leave-1-one, cross-val
- china broken up. jacky says it naturally divides into a few large chunks. chinaA, chinaB, etc.
- report the number of rank positions things change.
- i.e. what is expected wriggle in an algorithm's rankings
- a simple MSR paper http://2011.msrconf.org/important-dates

- seeking results on
- splitting data in half
- building a predictor (?teac) from one half
- walking the other one half in in (say) 10 eras
- after clustering the other half from era[i], find the most informative regions to query and send those queries to the predictor
- extend the era[i] cluster tree with the new results found during (c)
- test the new tree on era[i+i]
- increment "i" and goto (c)

- photos, videos of the helicopter on the blog
- Effort Estimation Experiments with Compass and K-means
- Code is ready to go (Thanks in very large part to Andrew's previous work)
- Want to discuss experiments about effort estimation and defect predictions.
- Working on Matlab versions of TEAK and Compass to work with Ekrem's Matlab rig later.

Tuesday, August 17, 2010

Defect Sets and GUI's

Defect Sets
Very large (19k pixels wide) Defect set Compass trees can be found --> HERE.

Working on a way to display variance in the Compass Explorer... either going with labels or colors --> HERE.

Changes of Note:

In-line pruning, along with square root min cluster size.

ASE presentation (1st draft)


Monday, August 16, 2010

Agenda, August 17

  1. news: the great MSR vs PROMISE marriage
  2. wboy tv coverage
  3. request: all on the blog
  4. need a new time for meetings

  1. GUI for HALE

  1. feel free not to attend but have you got a start date from ye yang?
  2. Please get with Adam2 regarding "W"
  1. Reaching data mining. got a copy of chapter 4 of witten?
  2. TSE rewrite (should have gone to Jacky by now)
  3. Next TSE article 

  • PRED(25) results for X=12, X=1, X=-10
  •  What is we build COMBA by just fliiping the pre-preprocessor?
  •  What is we build COMBA by just using 10 90% samples and log(K=1)?

  1. Boolstering vs boosting (just an idea)
  2. atteding ase. need to write a poster. see http://ai-at-wvu.blogspot.com/2010/08/how-to-write-posters.html
  3. new data sets to PROMISE
  4. travel to ase. you got visas to belgium
  1. seeking results on
  • (a) splitting data in half
  • (b) building a predictor (?teac) from one half
  • (c) walking the other one half in in (say) 10 eras
  • (c) after clustering the other half from era[i], find the most informative regions to query and send those queries to the predictor
  • (d) extend the era[i] cluster tree with the new results found during (c)
  • (e) test the new tree on era[i+i]
  • (f) increment "i" and goto (c)
  1. note that by end august we will need at least one run of IDEA on a large defect data set (e.g. PC1)
  2. note by mid-oct we need some (possibly gawd awful) prototype to describe to GT
  3. what can we brainstorm as tasks for IDEA?
  1. Progress towards masters thesis?
  2. Got time to do experiments on IDEA (formally known as "compass")
  3. IDEA and TEAK in ekrem's matlab rig?
  4. Try defect prediction in IDEA?
  5. Try effort estimation with K-means, IDEA?
  1. word version
  2. need an example of error reduction in a medical domain. UCI?
  3. Outline for thesis. Checked in with Cox?
  4. Teaching LISP: do you have a copy of graham?
  5. decision regarding phd direction
  1. Outline for thesis. have you checked in with cox? update to  locallessons.pdf table of results for coc/nasa 20 trials updated. right now, each cell presents one number. i want three written down as a,b,c where "a" is for just reducing effort, "b" is for just reducing detects and "c" is the current ones reducing effort,defects, time
  2. also, for that week,  i need the discretization study repeated. some succinct representation showing that if we discretize to small N we do as well as larger N.
  3. then which of the following do you want to do?
  • port it to your favorite language. python?
  • build a web front end where it is easy to alter the goal function and the query. in python? in awk? check it: http://awk.info/?tools/server
  • start writing your thesis
  • do it all!
  1. get with adam1 regarding understanding W and a gui for china
  2. still need to sort out the payment receipts. got a publication date from jsea?
  3. what is your timetable for the nsf stuff

Sunday, August 15, 2010

How to write posters