http://rpi.edu/dept/arc/training/latex/resumes/
includes some nice sample resumes including http://rpi.edu/dept/arc/training/latex/resumes/res9b.pdf
Friday, December 10, 2010
Tuesday, November 9, 2010
world cloude
of papers in my foser track.
combined cloud in the midle
individual papers all around
http://menzies.us/tmp/Track5uberwordle.pdf
combined cloud in the midle
individual papers all around
http://menzies.us/tmp/Track5uberwordle.pdf
MoE Experiments
On the x-axis algorithms are ordered w.r.t. loss values over all err. measures and all datasets. Lines show the max. ordering change w.r.t. win and loss values over any of the error measures. The individual rankings are in this sheet.
Tuesday, October 26, 2010
Friday, October 22, 2010
Tuesday, October 12, 2010
IDEA - Stained Glass Visualization
Defect Sets:
Boxes with color indicate a cluster. Clusters are self-tested and ranked based on harmonic mean for the TRUE defect class. Green indicates the cluster performed in the top 25%. Yellow indicates the middle 50%, and red indicates the bottom 25%.
KC1
2,109 examples
Run time: 12.5 seconds
JM1
10,880 examples
Run time: 2 minutes
Effort sets:
Effort set performance is based on self-test MDMRE. Clusters are colored in the same manner as above.
China
499 examples
2.6 seconds
Datasets, charts for more datasets, and time results are available in WISP.
Boxes with color indicate a cluster. Clusters are self-tested and ranked based on harmonic mean for the TRUE defect class. Green indicates the cluster performed in the top 25%. Yellow indicates the middle 50%, and red indicates the bottom 25%.
KC1
2,109 examples
Run time: 12.5 seconds
JM1
10,880 examples
Run time: 2 minutes
Effort sets:
Effort set performance is based on self-test MDMRE. Clusters are colored in the same manner as above.
China
499 examples
2.6 seconds
Datasets, charts for more datasets, and time results are available in WISP.
Stability and Bias-Variance of LOO vs. 3Way
Stability across datasets: For each evaluation method, over 19 datasets, see how many times solos appear in top 16. The numbers and plots for all methods are here.
Bias & Variance of LOO and 3-Way: Let L be the squared loss function, f be a predictor and y be the predicted value for a particular x. Also let y* be the optimal prediction (actual response for x), and ym be the main prediction. Under squared loss function, main function becomes mean of all predictions: ym=mean(all y's). Then the following definitions follow:
Bias & Variance of LOO and 3-Way: Let L be the squared loss function, f be a predictor and y be the predicted value for a particular x. Also let y* be the optimal prediction (actual response for x), and ym be the main prediction. Under squared loss function, main function becomes mean of all predictions: ym=mean(all y's). Then the following definitions follow:
- Bias(x) = L(y*, ym)
- Var(x) = ED(L(ym, y)) where D is the occurrence of x in all training sets.
- Biasavg = Ex(B(x))
- Varavg = Ex(Var(x))
Tuesday, October 5, 2010
IDEA - Progress
List completed so far:
- Flipping the X-axis so that the dense area of points are at the West pole (essential for logging the x axis).
- Logging of X and Y Coordinates
- Equal width and equal frequency grid creation for m x n divisions.
- Separating data point into Quadrants based on the axes. (Not recursive yet)
- Coloring the plot where data exists.
- Next step will be to color in interesting* neighboring quadrants.
Linear v. Dynamic Search in Molecular Structures
In the field of cheminformatics, a common task is navigating through a database of molecules that could have thousands of entries.
One task is how to store the molecules themselves. An approach in cheminformatics is a "SMILES" string. A smiles string stores the molecules involved and special characters which indicate structural properties of the molecule. Previous tests in the Lewis Research Group have involved RDX adsorption. (RDX is an explosive compound) The Lewis Research Group uses a default file format of .xyz, but an open file conversion system exists called OpenBabel. The SMILES string for RDX is provided below, along with a picture for comparison.
One task is how to store the molecules themselves. An approach in cheminformatics is a "SMILES" string. A smiles string stores the molecules involved and special characters which indicate structural properties of the molecule. Previous tests in the Lewis Research Group have involved RDX adsorption. (RDX is an explosive compound) The Lewis Research Group uses a default file format of .xyz, but an open file conversion system exists called OpenBabel. The SMILES string for RDX is provided below, along with a picture for comparison.
RDX Molecule
O=N(=O)N1CN(N(=O)=O)CN(N(=O)=O)C1
O=N(=O)N1CN(N(=O)=O)CN(N(=O)=O)C1
SMILES strings can be represent the same structure in different ways, so for databases a special form of SMILES known as Canonical SMILES form is used to prevent duplicate entries. The advantage of SMILES strings is that when they break in to comparable substructures.
There are clever indexing schemes for quickly searching structural properties of molecules. One is substructure keys, in which binary flags about structural properties are stored for each molecule. Another is using a hash table encoding that serves as a proximity filter.
There is also the area of molecular similarity and molecular diversity. This field seeks to find similar molecules by noting differences in derived attributes. For example, it is computationally easy to compute the molecular weight of a molecule. This attribute, and others, can be used as a search therm to find similar molecules or correlation with more complex attributes such as molecular adsorption in repeating lattice structures.
Other features can also be noted and collected, providing a linear database for searching.
Overall, the question is whether to use structural similarity analysis (dynamic) or molecular diversity measures (linear) for the system.
The answer, at the current moment, appears to be using both.
We are currently not sure what variables correlate with performance in the desired chemical property of adsorption in a lattice structure, so having more features provides a greater possibility for correlation and more accurate estimates. The linear search could also be used to select candidates for dynamic search, using a hybridized preprocessing approach.
The second is that the system has the advantage of being domain specific and thus having access to domain specific algorithms and methods. The database in question will have many properties which need to be handled in relation to their environment. For example, it is likely that there will be missing or blank features for entries in the database. The algorithm might have access to be able to call a function to compute missing entries, which is a highly domain specific solution not appropriate to general algorithms.
There are clever indexing schemes for quickly searching structural properties of molecules. One is substructure keys, in which binary flags about structural properties are stored for each molecule. Another is using a hash table encoding that serves as a proximity filter.
There is also the area of molecular similarity and molecular diversity. This field seeks to find similar molecules by noting differences in derived attributes. For example, it is computationally easy to compute the molecular weight of a molecule. This attribute, and others, can be used as a search therm to find similar molecules or correlation with more complex attributes such as molecular adsorption in repeating lattice structures.
Other features can also be noted and collected, providing a linear database for searching.
Overall, the question is whether to use structural similarity analysis (dynamic) or molecular diversity measures (linear) for the system.
The answer, at the current moment, appears to be using both.
We are currently not sure what variables correlate with performance in the desired chemical property of adsorption in a lattice structure, so having more features provides a greater possibility for correlation and more accurate estimates. The linear search could also be used to select candidates for dynamic search, using a hybridized preprocessing approach.
The second is that the system has the advantage of being domain specific and thus having access to domain specific algorithms and methods. The database in question will have many properties which need to be handled in relation to their environment. For example, it is likely that there will be missing or blank features for entries in the database. The algorithm might have access to be able to call a function to compute missing entries, which is a highly domain specific solution not appropriate to general algorithms.
Tuesday, September 28, 2010
Visualizing Datasets with Compass - Progress
Work is partially finished. Need to add different colors for TRUE/FALSE and gradiant for MRE.
Tuesday, September 21, 2010
Does Active Learning Work?
Here's work from yesterday. Unfortunately, I encountered several walls during my look at NASA93 before I begin working on all of the other datasets.
The first section of the document describes the variables and our ranking function for selecting devious children pairs.
The second section details everything I worked on yesterday and the pitfalls encountered.
Status of Compass Comparison Defect Prediction Experiments
I'm currently working my way through NASA and Softlab datasets shown in the table testing two Compass variations with different stopping rules to K-means with k = 1,2,4,8,16 and bisecting K-means with k = 4,6 and 8. I'm using leave-one-out and 66%/33% train-test slices repeated 20 times for cross-val.
The NASA datasets are still running, but ugly, non-formatted Softlab results are available in: http://www.unbox.org/wisp/var/kel/CompassDefectResults/
The NASA datasets are still running, but ugly, non-formatted Softlab results are available in: http://www.unbox.org/wisp/var/kel/CompassDefectResults/
Thursday, September 2, 2010
Flight of the MILL
Four brave researchers take time from their busy schedules to embark on a journey via helicopter that will bring them closer together than ever before. Their adventure can be now be shared thanks to the power of a Blackberry and YouTube.
A special thanks to Jacky Keung for the helicopter. Dr. Menzies will be lucky if he gets it back.
Tuesday, August 31, 2010
Problem with Stopping Rule in Compass Defect Prediction
I've found that the stopping rule for walking through the Compass tree stops too early. Compass currently stops when the variance of the node is less than the weighted variance of the node's children.
Example:
http://github.com/abutcher/compass/raw/master/doc/dot/defect/pruned/jm1.dot.png
In the defect prediction data sets, this condition occurs at a high level (usually 1 or 2). The result is that the cluster for majority voting is often most of the data set rather than similar instances.(if (< (node-variance c-node)(weighted-variance c-node))
(defun weighted-variance (c-node)(if (and (null (node-right c-node)) (null (node-left c-node)))(node-variance c-node)(if (or (null (node-right c-node)) (null (node-left c-node)))(if (null (node-right c-node))(node-variance (node-left c-node))(node-variance (node-right c-node)))(/ (+ (* (node-variance (node-right c-node))(length (node-contents (node-right c-node))))(* (node-variance (node-left c-node))(length (node-contents (node-left c-node)))))(+ (length (node-contents (node-right c-node)))(length (node-contents (node-left c-node))))))))
http://github.com/abutcher/compass/raw/master/trunk/src/lisp/variance.lisp
Example:
http://github.com/abutcher/compass/raw/master/doc/dot/defect/pruned/jm1.dot.png
Results from Splitting Data // Oracle
Results can be found HERE.
The System
1. Randomize 20x
2. Divide data into two halves.
a. Separate first half into eras of 10 instances.
b. Use the second half to build an oracle compass tree.
3. Using the eras from (2a), build an initial tree with the first several eras.
4. Incrementally insert the instances from eras (2a) into the compass tree formed in (3).
a. After each era find the most interesting instances by finding high variance children pairs and removing the center of their union.
b. Classify instances in the naughty list using the oracle from (2b).
c. Insert the naughty list back into the incremental tree formed in (3), (4) by using their classification information from (4b).
d. To keep the tree growing, re-compass high areas of variance (leaves in the top X% of variance among leaves).
6. Compare with a standard tree.
The System
1. Randomize 20x
2. Divide data into two halves.
a. Separate first half into eras of 10 instances.
b. Use the second half to build an oracle compass tree.
3. Using the eras from (2a), build an initial tree with the first several eras.
4. Incrementally insert the instances from eras (2a) into the compass tree formed in (3).
a. After each era find the most interesting instances by finding high variance children pairs and removing the center of their union.
b. Classify instances in the naughty list using the oracle from (2b).
c. Insert the naughty list back into the incremental tree formed in (3), (4) by using their classification information from (4b).
d. To keep the tree growing, re-compass high areas of variance (leaves in the top X% of variance among leaves).
6. Compare with a standard tree.
Wednesday, August 25, 2010
Compass demos
Andrew Butcher writes:
Once you have source from http://unbox.org/wisp/var/butcher/compass/tags/compass-0.0.1
cd compass-0.0.1/src
make demo1
etc...
They will all pipe to less, except demo10 which outputs maxwell.dot.png.
Once you have source from http://unbox.org/wisp/var/butcher/compass/tags/compass-0.0.1
cd compass-0.0.1/src
make demo1
etc...
They will all pipe to less, except demo10 which outputs maxwell.dot.png.
Tuesday, August 24, 2010
Pred Comparisons, Datasets and Bolstering
Pred(25) and loss values as well as comparison of algorithms&datasets of 3 points over all datasets are here.
New 30+ datasets for checking is here.
Bolstering: Is it inverse of boosting?
Boosting is supervised - Neither TEAK nor Compass is supervised
Boosting is a set of weak learners - TEAK and Compass are like cascading of one algorithm
Boosting weights the instances in a way (sending the misclassifieds to next weak learner) - TEAK and Compass also weights instances (0-1 weighting, keep or eliminate for the next round)
Boosting similar algorithms that are sometimes confused are called "leveraging algorithms" (need to read more on this)
New 30+ datasets for checking is here.
Bolstering: Is it inverse of boosting?
Boosting is supervised - Neither TEAK nor Compass is supervised
Boosting is a set of weak learners - TEAK and Compass are like cascading of one algorithm
Boosting weights the instances in a way (sending the misclassifieds to next weak learner) - TEAK and Compass also weights instances (0-1 weighting, keep or eliminate for the next round)
Boosting similar algorithms that are sometimes confused are called "leveraging algorithms" (need to read more on this)
Sunday, August 22, 2010
Agenda, August 24
Stuff in blue shows addits after posting0.
Stuff in green shows addits after posting1.
ALL:
Stuff in green shows addits after posting1.
ALL:
- need to name our desk. landscape paper, name in 30 point, photo.
- we now have 2 sets of ugrads students working the
FAYOLA:
- range experiment, and on other data sets.
- need that graphic max wanted showing the complexity of our data
- thesis outline (on paper).
ADAM1:
- can you attend? want to talk abut china and java W
ADAM2:
- all the latexing of the e,d,m and edm results done.
- ditto with the discretization experiments. outline of thesis.
- thesis outline (on paper)
- co-ordination with adam re java W
- need a revision of the paper that shows a good example of instability and
how it is less with W2
- receipt from japan
EKREM:
- when you used "log" pre-preprocessor, did you unlog afterwards before (say) MRE?
- check jacky's TSE corrections
- the abstract only mentions MRE. add others? or delete reference to MRE?
- check that our quotes from the paper in the review section are still accurate
- FIG5: when you checked (median(a) < median(b)), did you reverse that for pred?
- needs a bridge at end of 3.2. As can be seen in this example, it is not necessarily true that moving to a smaller set of neighbors decreases variance. As shown below, it can improve prediction accuracy if ABE takes this matter into account.
- bring (small) copies of ASE poster
- revist all old results. which need to be done with strong data sets?
- need your full passport name
- can we do ensembles just as a simple TEAK extension? just by randomly selecting from the current sub-tree 90%, 10 times?
ANDREW:
- seeking results on
- splitting data in half
- building a predictor (?teac) from one half
- walking the other one half in in (say) 10 eras
- after clustering the other half from era[i], find the most informative regions to query and send those queries to the predictor
- extend the era[i] cluster tree with the new results found during (c)
- test the new tree on era[i+i]
- increment "i" and goto (c)
KEL:
- photos, videos of the helicopter on the blog
- Effort Estimation Experiments with Compass and K-means
- Code is ready to go (Thanks in very large part to Andrew's previous work)
- Want to discuss experiments about effort estimation and defect predictions.
- Working on Matlab versions of TEAK and Compass to work with Ekrem's Matlab rig later.
- we now have 2 sets of ugrads students working the
FAYOLA:
- range experiment, and on other data sets.
- need that graphic max wanted showing the complexity of our data
- thesis outline (on paper).
ADAM1:
- can you attend? want to talk abut china and java W
ADAM2:
- all the latexing of the e,d,m and edm results done.
- ditto with the discretization experiments. outline of thesis.
- thesis outline (on paper)
- co-ordination with adam re java W
- need a revision of the paper that shows a good example of instability and
how it is less with W2
- receipt from japan
EKREM:
- when you used "log" pre-preprocessor, did you unlog afterwards before (say) MRE?
- check jacky's TSE corrections
- the abstract only mentions MRE. add others? or delete reference to MRE?
- check that our quotes from the paper in the review section are still accurate
- FIG5: when you checked (median(a) < median(b)), did you reverse that for pred?
- needs a bridge at end of 3.2. As can be seen in this example, it is not necessarily true that moving to a smaller set of neighbors decreases variance. As shown below, it can improve prediction accuracy if ABE takes this matter into account.
- bring (small) copies of ASE poster
- revist all old results. which need to be done with strong data sets?
- need your full passport name
- can we do ensembles just as a simple TEAK extension? just by randomly selecting from the current sub-tree 90%, 10 times?
- (the following will take some time)
- redo the icse paper's results with
- compass
- (sum, pred25, mre)
- win%,loss%, (win-loss)%
- leave-1-one, cross-val
- china broken up. jacky says it naturally divides into a few large chunks. chinaA, chinaB, etc.
- report the number of rank positions things change.
- i.e. what is expected wriggle in an algorithm's rankings
- a simple MSR paper http://2011.msrconf.org/important-dates
ANDREW:
- seeking results on
- splitting data in half
- building a predictor (?teac) from one half
- walking the other one half in in (say) 10 eras
- after clustering the other half from era[i], find the most informative regions to query and send those queries to the predictor
- extend the era[i] cluster tree with the new results found during (c)
- test the new tree on era[i+i]
- increment "i" and goto (c)
KEL:
- photos, videos of the helicopter on the blog
- Effort Estimation Experiments with Compass and K-means
- Code is ready to go (Thanks in very large part to Andrew's previous work)
- Want to discuss experiments about effort estimation and defect predictions.
- Working on Matlab versions of TEAK and Compass to work with Ekrem's Matlab rig later.
Tuesday, August 17, 2010
Defect Sets and GUI's
Defect Sets
Very large (19k pixels wide) Defect set Compass trees can be found --> HERE.
GUI
Working on a way to display variance in the Compass Explorer... either going with labels or colors --> HERE.
Changes of Note:
In-line pruning, along with square root min cluster size.
Very large (19k pixels wide) Defect set Compass trees can be found --> HERE.
GUI
Working on a way to display variance in the Compass Explorer... either going with labels or colors --> HERE.
Changes of Note:
In-line pruning, along with square root min cluster size.
Monday, August 16, 2010
Agenda, August 17
TIM
ADAM1
- news: the great MSR vs PROMISE marriage
- wboy tv coverage
- request: all on the blog
- need a new time for meetings
- GUI for HALE
ADAM1
- feel free not to attend but have you got a start date from ye yang?
- Please get with Adam2 regarding "W"
- Reaching data mining. got a copy of chapter 4 of witten?
- TSE rewrite (should have gone to Jacky by now)
- Next TSE article
- PRED(25) results for X=12, X=1, X=-10
- What is we build COMBA by just fliiping the pre-preprocessor?
- What is we build COMBA by just using 10 90% samples and log(K=1)?
- Boolstering vs boosting (just an idea)
- atteding ase. need to write a poster. see http://ai-at-wvu.blogspot.com/2010/08/how-to-write-posters.html
- new data sets to PROMISE
- travel to ase. you got visas to belgium
- seeking results on
- (a) splitting data in half
- (b) building a predictor (?teac) from one half
- (c) walking the other one half in in (say) 10 eras
- (c) after clustering the other half from era[i], find the most informative regions to query and send those queries to the predictor
- (d) extend the era[i] cluster tree with the new results found during (c)
- (e) test the new tree on era[i+i]
- (f) increment "i" and goto (c)
- note that by end august we will need at least one run of IDEA on a large defect data set (e.g. PC1)
- note by mid-oct we need some (possibly gawd awful) prototype to describe to GT
- what can we brainstorm as tasks for IDEA?
- Progress towards masters thesis?
- Got time to do experiments on IDEA (formally known as "compass")
- IDEA and TEAK in ekrem's matlab rig?
- Try defect prediction in IDEA?
- Try effort estimation with K-means, IDEA?
- word version
- need an example of error reduction in a medical domain. UCI?
- Outline for thesis. Checked in with Cox?
- Teaching LISP: do you have a copy of graham?
- decision regarding phd direction
- Outline for thesis. have you checked in with cox? update to locallessons.pdf table of results for coc/nasa 20 trials updated. right now, each cell presents one number. i want three written down as a,b,c where "a" is for just reducing effort, "b" is for just reducing detects and "c" is the current ones reducing effort,defects, time
- also, for that week, i need the discretization study repeated. some succinct representation showing that if we discretize to small N we do as well as larger N.
- then which of the following do you want to do?
- port it to your favorite language. python?
- build a web front end where it is easy to alter the goal function and the query. in python? in awk? check it: http://awk.info/?tools/server
- start writing your thesis
- do it all!
- get with adam1 regarding understanding W and a gui for china
- still need to sort out the payment receipts. got a publication date from jsea?
- what is your timetable for the nsf stuff
Sunday, August 15, 2010
How to write posters
- For a sample poster, see http://ourmine.googlecode.com/svn/trunk/share/pdf/poster.pdf
- You can write this using
- Latex (use "svn export http://unbox.org/wisp/var/09/pom2/doc/ase09/poster/pom2poster/ poster" then run "make")
- Powerpoint : layup one page in 5pt text. Won't look as good as the Latex, but some will find it easier.
- Open Office, etc (see the Powerpoint advice)
- Basement of library can print these; you can print non-gloss ($6) or full gloss ($12).
- For test prints, A3 sheets are readable, and cheap to produce
Sunday, June 13, 2010
Active Learning
Labeling every data point is time consuming and unlabeled data is abundant. This motivates the field of active learning, in which learner is able to ask for labels of specific points, but each question is charged.
Another ideas: Use two learners and ask points where they disagree, use SVM and only ask point closest to hyperplane at each round. The question to me is how I can adapt it to effort estimation (a regression problem). We formed a reading group for this problem, the bib file etc. are here: http://unbox.org/wisp/var/ekrem/activeLearning/Literature/The points to be queried are usually chosen from a pool of unlabeled data points. What we need is to ask as few as possible queries and pick up points that would help the learner the most (highest information content).
Possible ideas to find the points to be queried:
1) Build a Voronoi structure and ask the points which are a) center of largest circumcirle or b) subset of Voronoi vertices whose nearest neighbors belong to different classes. It is difficult for high dimensions.
Tuesday, June 8, 2010
beamer, and zoom
Adam nelson is writing his slides using the beamer class. the slides look fabulous.
Beamer is a latex style file which, unlike other approaches, can be processed with pdflatex
Also, last time i checked, its a standard item on most linux /osX /cygwin package managers
For notes on beamer, see http://www.math.umbc.edu/~rouben/beamer/
For an advanced guide, see http://www.math.umbc.edu/~rouben/beamer/beamer_guide.pdf
For a reason to use beamer instead of anything else, look at the zoom feature on slide 32 of http://www.math.umbc.edu/~rouben/beamer/beamer_guide.pdf: zoomable figures. Now you can include results to any level of detail.
t
Beamer is a latex style file which, unlike other approaches, can be processed with pdflatex
Also, last time i checked, its a standard item on most linux /osX /cygwin package managers
For notes on beamer, see http://www.math.umbc.edu/~rouben/beamer/
For an advanced guide, see http://www.math.umbc.edu/~rouben/beamer/beamer_guide.pdf
For a reason to use beamer instead of anything else, look at the zoom feature on slide 32 of http://www.math.umbc.edu/~rouben/beamer/beamer_guide.pdf: zoomable figures. Now you can include results to any level of detail.
t
Monday, June 7, 2010
Wednesday, June 2, 2010
Data: Feature Weighting and Instance Selection Optimization
http://unbox.org/wisp/var/bill/table.txt
Results of optimizing feature weighting and instance selection for analogy based estimation of software effort w/different methods.
Results of optimizing feature weighting and instance selection for analogy based estimation of software effort w/different methods.
Random Pre-Processor + Algorithm Results for Normal and Reduced Datasets
The results for the experiments:http://unbox.org/wisp/var/ekrem/resultsVariance/Results/FinalResults.xls
- Behavior of different pre-processor&algorithm combinations become more similar as the instances size gets smaller (e.g. cocomo81s and desharnaisL3 figures)
- Some algorithms have considerable less loss values, but there is no best algorithm for all datasets.
- The reduced variance datasets are reduced by the GAC tree (only instances in the nodes that have less than or equal to ten percent of the max variance).
- When reduction is applied 3 datasets reduce to only 2 instances: kemerer, nasa-center1, telecom1.
- Since reduction makes the dataset smaller, their results become more similar (both the plots look similar and the cases of all algorithms getting zero losses increase).
- The graphs for these experiments can be found at:http://unbox.org/wisp/var/ekrem/resultsVariance/Results/NORMAL-DATA RESULTS.zip andhttp://unbox.org/wisp/var/ekrem/resultsVariance/Results/REDUCED-DATA RESULTS.zip Related plots are at http://unbox.org/wisp/var/ekrem/resultsVariance/Results/resultsPlotterTexFiles/plots.pdf
Some more results regarding GAC-simulated datasets:
- Populated datasets attain very high MdMRE and Pred(25) values.
- There is more of a pattern regarding the best algorithm.
- A better check of GAC-simulation would be simulation&prediction with leave-one-out.
Tuesday, June 1, 2010
Makefile tricks
I write to record Bryan's "embed postscript fonts" trick. Without this trick, some conference/journal submission systems won't let you submit, complaining that "fonts not embedded".
The trick is to use the "embed" rule, as called by "done" in the following Makefile. This code is available at /wisp/var/adam2/cbr/doc/Makefile
Src=model-vs-cbr-v3
all : dirs tex bib tex tex done
one : dirs tex done
done : embed
@printf "\n\n\n======================================\n"
@printf "see output in $(HOME)/tmp/$(Src).pdf\n"
@printf "=======================================\n\n\n"
@printf "\n\nWarnings (may be none):\n\n"
grep arning $(HOME)/tmp/${Src}.log
dirs :
- [ ! -d $(HOME)/tmp ] && mkdir $(HOME)/tmp
tex :
- pdflatex -output-directory=$(HOME)/tmp $(Src)
embed :
@ cd $(HOME)/tmp; \
gs -q -dNOPAUSE -dBATCH -dPDFSETTINGS=/prepress \
-sDEVICE=pdfwrite \
-sOutputFile=$(Src)1.pdf $(Src).pdf
@ mv $(HOME)/tmp/$(Src)1.pdf $(HOME)/tmp/$(Src).pdf
bib :
- bibtex $(HOME)/tmp/$(Src)
The trick is to use the "embed" rule, as called by "done" in the following Makefile. This code is available at /wisp/var/adam2/cbr/doc/Makefile
Src=model-vs-cbr-v3
all : dirs tex bib tex tex done
one : dirs tex done
done : embed
@printf "\n\n\n======================================\n"
@printf "see output in $(HOME)/tmp/$(Src).pdf\n"
@printf "=======================================\n\n\n"
@printf "\n\nWarnings (may be none):\n\n"
grep arning $(HOME)/tmp/${Src}.log
dirs :
- [ ! -d $(HOME)/tmp ] && mkdir $(HOME)/tmp
tex :
- pdflatex -output-directory=$(HOME)/tmp $(Src)
embed :
@ cd $(HOME)/tmp; \
gs -q -dNOPAUSE -dBATCH -dPDFSETTINGS=/prepress \
-sDEVICE=pdfwrite \
-sOutputFile=$(Src)1.pdf $(Src).pdf
@ mv $(HOME)/tmp/$(Src)1.pdf $(HOME)/tmp/$(Src).pdf
bib :
- bibtex $(HOME)/tmp/$(Src)
lab meeting, wednesday
note meeting time: 11am
10am: skype call with adam2. adam2-
11am: meeting
1pm: break
1:45pm: bryan's defense. (edited by Bryan L. to correct time. Its 1:45, not 2:00)
newbies (make them welcome)
fayola:
ekrem, andrew:
andrew-
william:
10am: skype call with adam2. adam2-
11am: meeting
1pm: break
1:45pm: bryan's defense. (edited by Bryan L. to correct time. Its 1:45, not 2:00)
newbies (make them welcome)
- Kel Cecil
- Charles Corb
- Tomi Prifti
- promise paper
- travel arrangements to bejing
- the vision thing Structured Machine Learning: Ten Problems for the Next Ten Years (Section 5 in Structured Machine Learning: The Next Ten Years). Machine Learning, 73, 3-23, 2008.
- journal paper
1) Stable rankings for different effort models ;
3) How to Understand Complex Models
2) Defect prediction from static code features: current results, limitations, new approaches - dod sttrt2
- clustering as compression
Towards Parameter-Free Data Mining
Clustering by Compression
Compression and Machine Learning: A New Perspective on Feature Space Vectors
Parameterless Outlier Detection in Data Stream - temporal data mining and anomaly detection
the joy of sax: Visually Mining and Monitoring Massive Time Series
- what news?
fayola:
- did not understand your last explanation of your distance measure (i.e. is 20% more or less movement). help me, please, to obtain clarity
- does your result hold as you increase number of clusters?
- starting 3 papers
ekrem, andrew:
- start a sub-group: active learning.
- begin lit reviewing
- that experiment with people in front of interfaces..
andrew-
- what news on ddp?
- papers for comparing compass against other clusterers:
Evalaution of Hierarchical Clustering
A Comparison of Document Clustering Techniques
- what news on using teac as an instance selector for other data miners
william:
- effort estimation data table
- -instance collection environment
Monday, May 24, 2010
Quick and Dirty Lisp Scripting
Here's a quick and dirty example of scripting a lisp routine so that it can be executed simply from shell.
Code can be found --> HERE
Script is --> HERE
So, I have this data simulation lisp code lying around that I'm using to generate larger datasets from smaller ones using a GAC tree. We don't need to worry about the specifics but what I needed was a way to run it fast without jumping into slime and loading everything whenever I want to use it. Also, I'm using this lisp code to output to csv but when you write from emacs lisp there will be a newline every 80 characters printed into the file and that messes up how I want to use the csv's. AND, because lisp is so precise, it outputs floats as their division with a "/". I wrote a couple python scripts to clean that up quickly which I need to apply that to each file after I generate it. Perfect example of compounding a few steps into an easy to use bash script...
All I do is run something to the effect of this to output a big albrecht (size 1000) into a csv file named albrecht.csv:
It's not that fancy but it does one thing and it does it well, in the unix way.
For you black belts in the crowd, here's how I generate 2000 eg data files from all my existing data.
Code can be found --> HERE
Script is --> HERE
So, I have this data simulation lisp code lying around that I'm using to generate larger datasets from smaller ones using a GAC tree. We don't need to worry about the specifics but what I needed was a way to run it fast without jumping into slime and loading everything whenever I want to use it. Also, I'm using this lisp code to output to csv but when you write from emacs lisp there will be a newline every 80 characters printed into the file and that messes up how I want to use the csv's. AND, because lisp is so precise, it outputs floats as their division with a "/". I wrote a couple python scripts to clean that up quickly which I need to apply that to each file after I generate it. Perfect example of compounding a few steps into an easy to use bash script...
All I do is run something to the effect of this to output a big albrecht (size 1000) into a csv file named albrecht.csv:
./sampler albrecht 1000 albrecht.csv
It's not that fancy but it does one thing and it does it well, in the unix way.
For you black belts in the crowd, here's how I generate 2000 eg data files from all my existing data.
for file in `ls data/ | cut -d"." -f1`; do ./sampler $file 2000 examples/2000egs/$file.csv; done;
Wednesday, April 21, 2010
Brittle 100420085813-phpapp01
Check out this SlideShare Presentation:
Brittle 100420085813-phpapp01
View more presentations from fayola21.
Tuesday, April 13, 2010
Monday, April 12, 2010
More Context Variables on Nasa93
Possible Context Variables for Datasets
- Cocomo81: Development mode
- Nasa93: Dev. center, flight-ground, category of application, year of development
- Desharnais: Development language
- Albrecht: None (all numeric attributes)
Component vs. All Learning - Summary
Tuesday, April 6, 2010
Monday, April 5, 2010
Context Variables
About the context variables, we have the following matrices. In parenthesis, the total number of instances located in this particular context variable is written. Those context variables are the ones which we used to divide the datasets into subsets.
It is difficult to see a pattern here. In each case, there is one dominant column (or context variable) from which most of the instances are selected. On the other hand, this particular column is not the most densely populated context variable. Therefore:
1) Center (for nasa93), development type (for cocomo81) or development language (for desharnais) are influential context variables
2) There is one particular context variable from which always most of the instances are selected but this context variable is not the most densely populated one, so size is not a dominant indicator either.
It is difficult to see a pattern here. In each case, there is one dominant column (or context variable) from which most of the instances are selected. On the other hand, this particular column is not the most densely populated context variable. Therefore:
1) Center (for nasa93), development type (for cocomo81) or development language (for desharnais) are influential context variables
2) There is one particular context variable from which always most of the instances are selected but this context variable is not the most densely populated one, so size is not a dominant indicator either.
Master's Defense - The Robust Optimization of Non-Linear Requirements Models
Master's Defense - WVU Lane Dept. CSEE
Gregory Gay
Wednesday, April 7, 2010
1:00 PM, room G112 ESB
Open to the public, so feel free to come and see me make a fool of myself. :D
which2 : a stochastic anytime rule learner
(Code: http://unbox.org/wisp/tags/which2/0.1/which2.awk.)
WHICH2 builds rules by ranking ideas, then repeatedly building new ideas by picking and combining two old ideas (favoring those with higher ranks). New ideas generated in this way are ranked and thrown back into the same pot as the old ideas so, if they are any good, they might be picked and extended in subsequent rounds. Alternatively, if the new idea stinks, it gets buried by the better ideas and is ignore.
One important aspect of the following is that the scoring routines for ideas are completely seperate from the rest of the code (see the "score1" function). Hence, it is a simple matter to try our different search biases.
For example, here's the current score routine:
In the following, the "candidates" are ideas that look promising and "score" ranks the candidates. If the max score does not improve from the last round, then "lives" decreases.
Each round tries random combinations of the stuff from prior rounds (favoring those things with higher scores). Hence, at round 1, all the candidates are singletons. But. later on (see line 54) the candidates can grow to combinations of things.
Each round prunes the candiates so that only the better candidates survive to round+1.
The final output are the candidates of the last round. In this example, the best rule is tempareture = cool or mild.
This is not the usual solution to weather but note how it covers most of the examples. The reason for this selection is that the scoring function (shown above) has a support term. Hence, this run favors things with high support.
WHICH2 builds rules by ranking ideas, then repeatedly building new ideas by picking and combining two old ideas (favoring those with higher ranks). New ideas generated in this way are ranked and thrown back into the same pot as the old ideas so, if they are any good, they might be picked and extended in subsequent rounds. Alternatively, if the new idea stinks, it gets buried by the better ideas and is ignore.
One important aspect of the following is that the scoring routines for ideas are completely seperate from the rest of the code (see the "score1" function). Hence, it is a simple matter to try our different search biases.
For example, here's the current score routine:
function score(i,rows,data, cols,row, col,a,b,c,d,triggered,pd,pf,
prec,acc,supports,s,fits)
{ a=b=c=d=Pinch # stop divide by zero errors cols=Name[0] for(row=1;row<=rows;row++) { triggered = matched(row,cols,data,i) if (data[row,cols]) { if (triggered) {d++} else {b++} } else { if (triggered) {c++} else {a++} } } fits = c + d pd = d/(b+d) pf = a/(a+c) prec = d/(c+d) acc = (a+d)/(a+b+c+d) support = (c+d)/(a+b+c+d) return score1(pd,pf,prec,acc,support,fits) } function score1(pd,pf,prec,acc,supportm,fits) { if (fits <= Overfitted) return 0 if (Eval==1) return acc if (Eval==2) return 2 * pd * prec/(pd+prec) if (Eval==3) return 2 * pd * pf/(pd+pf) if (Eval==4) return support * 2 * pd * pf/(pd+pf) return support * 2 * pd * prec/(pd+prec) }Various students (ZachM and Joe) have implemented this before and found it surprisingly slow. For example, in the above, note that scoring one rule means running it over the entire data set. Which2 therefore uses two speed up tricks:
- Rather that sorting new ideas straight away into the old ones, it generates (say) 20 new ones at each round before sorting them back into the old.
- Which2 keeps a cache of how ideas scored before. If we re-score an old idea, we just return that score (no need to recompute it).
gawk -f which2.awk weather.nominal.arff
In the following, the "candidates" are ideas that look promising and "score" ranks the candidates. If the max score does not improve from the last round, then "lives" decreases.
Each round tries random combinations of the stuff from prior rounds (favoring those things with higher scores). Hence, at round 1, all the candidates are singletons. But. later on (see line 54) the candidates can grow to combinations of things.
Each round prunes the candiates so that only the better candidates survive to round+1.
The final output are the candidates of the last round. In this example, the best rule is tempareture = cool or mild.
---------------------------------------------------- % class: yes seed: 1 round: 1 max: 0 lives: 5 % candidate: yes,humidity=normal % candidate: yes,humidity=normal,outlook=overcast % candidate: yes,humidity=normal,temperature=cool % candidate: yes,humidity=normal,temperature=mild % candidate: yes,humidity=normal,windy=FALSE % candidate: yes,outlook=overcast % candidate: yes,outlook=overcast,temperature=cool % candidate: yes,outlook=overcast,windy=FALSE % candidate: yes,temperature=cool % candidate: yes,temperature=cool,temperature=mild % candidate: yes,temperature=cool,windy=FALSE % candidate: yes,temperature=mild % candidate: yes,temperature=mild,windy=FALSE % candidate: yes,windy=FALSE [0.072352013,yes,temperature=mild,windy=FALSE]. [0.13229596,yes,humidity=normal,temperature=cool]. [0.13264848,yes,temperature=cool]. [0.17611714,yes,humidity=normal,windy=FALSE]. [0.17652966,yes,outlook=overcast]. [0.22876845,yes,temperature=mild]. [0.37582652,yes,humidity=normal]. [0.40354402,yes,windy=FALSE]. [0.52640158,yes,temperature=cool,temperature=mild]. ---------------------------------------------------- % class: yes seed: 1 round: 2 max: 0.52640158 lives: 5 % candidate: yes,humidity=normal % candidate: yes,humidity=normal,outlook=overcast % candidate: yes,humidity=normal,outlook=overcast,temperature=cool % candidate: yes,humidity=normal,temperature=cool % candidate: yes,humidity=normal,temperature=cool,temperature=mild % candidate: yes,humidity=normal,windy=FALSE % candidate: yes,outlook=overcast % candidate: yes,outlook=overcast,temperature=cool % candidate: yes,outlook=overcast,temperature=cool,temperature=mild % candidate: yes,outlook=overcast,windy=FALSE % candidate: yes,temperature=cool % candidate: yes,temperature=cool,temperature=mild % candidate: yes,temperature=cool,temperature=mild,windy=FALSE % candidate: yes,temperature=mild % candidate: yes,temperature=mild,windy=FALSE % candidate: yes,windy=FALSE [0.13229596,yes,humidity=normal,temperature=cool]. [0.13264848,yes,temperature=cool]. [0.17611714,yes,humidity=normal,windy=FALSE]. [0.17652966,yes,outlook=overcast]. [0.20419978,yes,temperature=cool,temperature=mild,windy=FALSE]. [0.22876845,yes,temperature=mild]. [0.28604711,yes,humidity=normal,temperature=cool,temperature=mild]. [0.37582652,yes,humidity=normal]. [0.40354402,yes,windy=FALSE]. [0.52640158,yes,temperature=cool,temperature=mild]. ---------------------------------------------------- % class: yes seed: 1 round: 3 max: 0.52640158 lives: 4 % candidate: yes,humidity=normal % candidate: yes,humidity=normal,outlook=overcast % candidate: yes,humidity=normal,temperature=cool % candidate: yes,humidity=normal,temperature=cool,temperature=mild % candidate: yes,humidity=normal,temperature=cool,temperature=mild,windy=FALSE % candidate: yes,humidity=normal,temperature=cool,windy=FALSE % candidate: yes,humidity=normal,windy=FALSE % candidate: yes,outlook=overcast % candidate: yes,outlook=overcast,temperature=cool,temperature=mild % candidate: yes,outlook=overcast,windy=FALSE % candidate: yes,temperature=cool % candidate: yes,temperature=cool,temperature=mild % candidate: yes,temperature=cool,temperature=mild,windy=FALSE % candidate: yes,temperature=cool,windy=FALSE % candidate: yes,temperature=mild % candidate: yes,temperature=mild,windy=FALSE % candidate: yes,windy=FALSE [0.13229596,yes,humidity=normal,temperature=cool]. [0.13264848,yes,temperature=cool]. [0.17611714,yes,humidity=normal,windy=FALSE]. [0.17652966,yes,outlook=overcast]. [0.20419978,yes,temperature=cool,temperature=mild,windy=FALSE]. [0.22876845,yes,temperature=mild]. [0.28604711,yes,humidity=normal,temperature=cool,temperature=mild]. [0.37582652,yes,humidity=normal]. [0.40354402,yes,windy=FALSE]. [0.52640158,yes,temperature=cool,temperature=mild]. ---------------------------------------------------- % class: yes seed: 1 round: 4 max: 0.52640158 lives: 3 % candidate: yes,humidity=normal % candidate: yes,humidity=normal,outlook=overcast,windy=FALSE % candidate: yes,humidity=normal,temperature=cool % candidate: yes,humidity=normal,temperature=cool,temperature=mild % candidate: yes,humidity=normal,temperature=cool,temperature=mild,windy=FALSE % candidate: yes,humidity=normal,windy=FALSE % candidate: yes,outlook=overcast % candidate: yes,outlook=overcast,temperature=cool % candidate: yes,outlook=overcast,temperature=cool,temperature=mild % candidate: yes,outlook=overcast,temperature=cool,temperature=mild,windy=FALSE % candidate: yes,outlook=overcast,windy=FALSE % candidate: yes,temperature=cool % candidate: yes,temperature=cool,temperature=mild % candidate: yes,temperature=cool,temperature=mild,windy=FALSE % candidate: yes,temperature=cool,windy=FALSE % candidate: yes,temperature=mild % candidate: yes,temperature=mild,windy=FALSE % candidate: yes,windy=FALSE [0.13229596,yes,humidity=normal,temperature=cool]. [0.13264848,yes,temperature=cool]. [0.17611714,yes,humidity=normal,windy=FALSE]. [0.17652966,yes,outlook=overcast]. [0.20419978,yes,temperature=cool,temperature=mild,windy=FALSE]. [0.22876845,yes,temperature=mild]. [0.28604711,yes,humidity=normal,temperature=cool,temperature=mild]. [0.37582652,yes,humidity=normal]. [0.40354402,yes,windy=FALSE]. [0.52640158,yes,temperature=cool,temperature=mild]. ---------------------------------------------------- % class: yes seed: 1 round: 5 max: 0.52640158 lives: 2 % candidate: yes,humidity=normal % candidate: yes,humidity=normal,outlook=overcast,temperature=cool % candidate: yes,humidity=normal,temperature=cool % candidate: yes,humidity=normal,temperature=cool,temperature=mild % candidate: yes,humidity=normal,temperature=cool,temperature=mild,windy=FALSE % candidate: yes,humidity=normal,temperature=cool,windy=FALSE % candidate: yes,humidity=normal,temperature=mild % candidate: yes,humidity=normal,windy=FALSE % candidate: yes,outlook=overcast % candidate: yes,temperature=cool % candidate: yes,temperature=cool,temperature=mild % candidate: yes,temperature=cool,temperature=mild,windy=FALSE % candidate: yes,temperature=cool,windy=FALSE % candidate: yes,temperature=mild % candidate: yes,windy=FALSE [0.13229596,yes,humidity=normal,temperature=cool]. [0.13264848,yes,temperature=cool]. [0.17611714,yes,humidity=normal,windy=FALSE]. [0.17652966,yes,outlook=overcast]. [0.20419978,yes,temperature=cool,temperature=mild,windy=FALSE]. [0.22876845,yes,temperature=mild]. [0.28604711,yes,humidity=normal,temperature=cool,temperature=mild]. [0.37582652,yes,humidity=normal]. [0.40354402,yes,windy=FALSE]. [0.52640158,yes,temperature=cool,temperature=mild]. ---------------------------------------------------- % class: yes seed: 1 round: 6 max: 0.52640158 lives: 1 % candidate: yes,humidity=normal % candidate: yes,humidity=normal,outlook=overcast % candidate: yes,humidity=normal,temperature=cool % candidate: yes,humidity=normal,temperature=cool,temperature=mild % candidate: yes,humidity=normal,temperature=cool,temperature=mild,windy=FALSE % candidate: yes,humidity=normal,temperature=cool,windy=FALSE % candidate: yes,humidity=normal,windy=FALSE % candidate: yes,outlook=overcast % candidate: yes,outlook=overcast,temperature=cool,temperature=mild,windy=FALSE % candidate: yes,temperature=cool % candidate: yes,temperature=cool,temperature=mild % candidate: yes,temperature=cool,temperature=mild,windy=FALSE % candidate: yes,temperature=mild % candidate: yes,windy=FALSE [0.13229596,yes,humidity=normal,temperature=cool]. [0.13264848,yes,temperature=cool]. [0.17611714,yes,humidity=normal,windy=FALSE]. [0.17652966,yes,outlook=overcast]. [0.20419978,yes,temperature=cool,temperature=mild,windy=FALSE]. [0.22876845,yes,temperature=mild]. [0.28604711,yes,humidity=normal,temperature=cool,temperature=mild]. [0.37582652,yes,humidity=normal]. [0.40354402,yes,windy=FALSE]. [0.52640158,yes,temperature=cool,temperature=mild]. ---------------------------------------------------- % class: yes seed: 1 round: 7 max: 0.52640158 lives: 0 % candidate: yes,humidity=normal % candidate: yes,humidity=normal,outlook=overcast % candidate: yes,humidity=normal,temperature=cool % candidate: yes,humidity=normal,temperature=cool,temperature=mild % candidate: yes,humidity=normal,temperature=cool,temperature=mild,windy=FALSE % candidate: yes,humidity=normal,windy=FALSE % candidate: yes,outlook=overcast % candidate: yes,outlook=overcast,temperature=cool,temperature=mild % candidate: yes,outlook=overcast,temperature=mild % candidate: yes,outlook=overcast,windy=FALSE % candidate: yes,temperature=cool % candidate: yes,temperature=cool,temperature=mild % candidate: yes,temperature=cool,temperature=mild,windy=FALSE % candidate: yes,temperature=mild % candidate: yes,windy=FALSE [0.13229596,yes,humidity=normal,temperature=cool]. [0.13264848,yes,temperature=cool]. [0.17611714,yes,humidity=normal,windy=FALSE]. [0.17652966,yes,outlook=overcast]. [0.20419978,yes,temperature=cool,temperature=mild,windy=FALSE]. [0.22876845,yes,temperature=mild]. [0.28604711,yes,humidity=normal,temperature=cool,temperature=mild]. [0.37582652,yes,humidity=normal]. [0.40354402,yes,windy=FALSE]. [0.52640158,yes,temperature=cool,temperature=mild].
This is not the usual solution to weather but note how it covers most of the examples. The reason for this selection is that the scoring function (shown above) has a support term. Hence, this run favors things with high support.
Tuesday, March 16, 2010
Effect of various Bandwidth&Kernel Combinations in Neighbor Weighting
What is the effect of weighting your neighbors in kNN? Acc. to 3 datasets (Coc81, Nasa93, Desharnais), the effects are as follows:
- Weighting your neighbors acc. to a kernel function creates significantly different results
- However, in these 3 datasets subject to different kernels (Uniform, Triangular, Epanechnikov, Gaussian) and to inverse rank weighted mean (proposed by Mendes et. al.) with various bandwidths, we could not observe an improvement in accuracies.
- Detailed experimental results under http://unbox.org/wisp/var/ekrem/kernel/
Monday, March 15, 2010
Presentation - Finding Robust Solutions to Requirements Models
Presentation to Tsinghua University - Thursday, March 18, 2010.
Based on: Gay, Gregory and Menzies, Tim and Jalali, Omid and Mundy, Gregory and Gilkerson, Beau and Feather, Martin and Kiper, James. Finding robust solutions in requirements models. Automated Software Engineering, 17(1): 87-116, 2010.
Labels:
GregG,
KEYS,
presentation,
SBSE,
treatment learning
Active Machine Learning
Unlabeled data is easy to come by and labeling that data can be a tedious task. Imagine that you've been tasked with gathering sound bytes. All you have to do is walk around with a microphone and you've got your data. Once you have the data you have to label each individual sound byte and catalogue it. Obviously, combing and labeling all the data wouldn't be fun -- regardless of the domain.
Active machine learning is a supervised learning technique whose goal is to produce adequately labeled (classified) data with as little human interference as possible. The active learning process takes in a small chunk of data which has already been assigned a classification by a human (oracle) with extensive domain knowledge. The learner then uses that data to create a classifier and applies it to a larger set of unlabeled data. The entire learning process aims at keeping human annotation to a minimum -- only referring to the oracle when the cost of querying the data is high.
Active learning typically allows for monitoring of the learning process and offers the human expert the ability to halt learning. If the classification error grows beyond a heuristic the oracle can pull the plug on the learner and attempt to rectify the problem...
For a less high level view of Active Machine Learning see the following literature survey on the topic.
Active machine learning is a supervised learning technique whose goal is to produce adequately labeled (classified) data with as little human interference as possible. The active learning process takes in a small chunk of data which has already been assigned a classification by a human (oracle) with extensive domain knowledge. The learner then uses that data to create a classifier and applies it to a larger set of unlabeled data. The entire learning process aims at keeping human annotation to a minimum -- only referring to the oracle when the cost of querying the data is high.
Active learning typically allows for monitoring of the learning process and offers the human expert the ability to halt learning. If the classification error grows beyond a heuristic the oracle can pull the plug on the learner and attempt to rectify the problem...
For a less high level view of Active Machine Learning see the following literature survey on the topic.
Tuesday, March 9, 2010
Brittleness vs. Number of Classes
Monday, March 8, 2010
Is kernel effect statistical?
See file kernelWinTieLossValues.xls from http://unbox.org/wisp/var/ekrem/kernel/
In summary:
- Different kernels did not yield a significant difference in performance when compared to each other
- Kernel weighting did not significantly improve the performance of different k values, i.e. there is not a significant difference between k=x and k=x plus kernel
Fastmap: Mapping objects into a k-dimensional space
The Fastmap algorithm turns data-set instances into a k-dimensional coordinate which can be primarily used for visualization and clustering purposes. Given a set of n instances from a dataset, Fastmap operates by doing the following. Assume we're using euclidean distance as the dissimilarity function and that we're operating in a 2-D space.
1. Arbitrarily choose an instance from the set and then find the farthest instance from it. These instances will serve as pivots to map the rest of our instances. Call these objects Oa and Ob. Each incoming instance Oi will be mapped using the line that exists between Oa and Ob.
2. As each new instance enters the space, apply geometric rules to extrapolate the coordinates of the new point.
Using the Cosine Law we can find Xi and then use the Pythagorean theorem to determine the coordinates of Oi as it relates to the line formed between Oa and Ob.
3. From there we can easily turn our coordinates into a visualization or use the coordinates to aid in clustering.
To move from a 2-D space to a k-D space, see http://portal.acm.org/citation.cfm?id=223812
Tuesday, March 2, 2010
Monday, March 1, 2010
Locality in Defect Prediction
Here, we show how a group of learners performs on 7 of the NASA datasets for software defect prediction both for Within Company and Cross Company.
We also are looking at varying the K for LWL. The greatest change between PD/PF with K values of {5,10,25,50} is 5%, usually around 3%. With changes in PD/PF this small, the actual value of K does not seem to matter.
Two algorithms, Clump and Naive Bayes, are shown both with and without relevancy filtering via the Burak Filter. Although applying the Burak Filter can reduce the variance of the results (for within and cross company when the data is logged), it does not significantly affect the median PD/PF's.
Observing the interaction between PD and PF for all algorithms explored, I cannot see a case where locality is beneficial. The PD vs PF ratio of both the locality based methods and the global methods are almost identical, showing that any gain in PD/loss in PF is at the cost of additional PF/less PD.
We also are looking at varying the K for LWL. The greatest change between PD/PF with K values of {5,10,25,50} is 5%, usually around 3%. With changes in PD/PF this small, the actual value of K does not seem to matter.
Two algorithms, Clump and Naive Bayes, are shown both with and without relevancy filtering via the Burak Filter. Although applying the Burak Filter can reduce the variance of the results (for within and cross company when the data is logged), it does not significantly affect the median PD/PF's.
Observing the interaction between PD and PF for all algorithms explored, I cannot see a case where locality is beneficial. The PD vs PF ratio of both the locality based methods and the global methods are almost identical, showing that any gain in PD/loss in PF is at the cost of additional PF/less PD.
Friday, February 26, 2010
Review article on anomaly detection
Varun Chandola, Arindam Banerjee, and Vipin Kumar, "Anomaly Detection : A Survey", ACM Computing Surveys, Vol. 41(3), Article 15, July 2009.
http://www-users.cs.umn.edu/~kumar/papers/anomaly-survey.php
http://www-users.cs.umn.edu/~kumar/papers/anomaly-survey.php
IEEE TSE accepts "Genetic Algorithms for Randomized Unit Testing"
Jamie Andrews and Tim Menzies
Randomized testing is an effective method for testing software units. Thoroughness of randomized unit testing varies widely according to the settings of certain parameters, such as the relative frequencies with which methods are called. In this paper, we describe Nighthawk, a system which uses a genetic algorithm (GA) to find parameters for randomized unit testing that optimize test coverage.
Designing GAs is somewhat of a black art. We therefore use a feature subset selection (FSS) tool to assess the size and content of the representations within the GA. Using that tool, we can reduce the size of the representation substantially, while still achieving most of the coverage found using the full representation. Our reduced GA achieves almost the same results as the full system, but in only 10% of the time. These results suggest that FSS could significantly optimize meta heuristic search-based software engineering tools.
Randomized testing is an effective method for testing software units. Thoroughness of randomized unit testing varies widely according to the settings of certain parameters, such as the relative frequencies with which methods are called. In this paper, we describe Nighthawk, a system which uses a genetic algorithm (GA) to find parameters for randomized unit testing that optimize test coverage.
Designing GAs is somewhat of a black art. We therefore use a feature subset selection (FSS) tool to assess the size and content of the representations within the GA. Using that tool, we can reduce the size of the representation substantially, while still achieving most of the coverage found using the full representation. Our reduced GA achieves almost the same results as the full system, but in only 10% of the time. These results suggest that FSS could significantly optimize meta heuristic search-based software engineering tools.
Download: http://menzies.us/pdf/10nighthawk.pdf
Monday, February 22, 2010
Monday, February 15, 2010
Does Kernel Choice Affect Prediction Performance?
What we do with the kernel estimation is to come up with a probability distribution function that is derived from our train data and then use this pdf to assign weights to our neighbors in a kNN approach.
Well when the bandwidths are assigned in accordance with Scott's Rule, we see that at critical regions (where the performance of different methods diverge) kernel weighted selection methods perform better than non-weighted versions. However, for the rest of the graphs, the performance of methods are very close to each other and it is difficult to draw a solid conclusion. Below are the graphs for Desharnais dataset for Triangular, Epanechnikov and Gaussian Kernels respectively.
Classification comparisons using DENSITY
DENSITY is a new classifier I wrote that works by
- Pruning irrelevant data
- Set density values using geometric series
- Rank based on "fairness" class distribution weight
Defect data results: http://unbox.org/wisp/var/adam/results/density/accuracies
Component vs. Whole Defect Prediction
- Components trimmed based on number of defective modules (using medians)
- For each learner..
- 10X10-way cross val using training sets composed of multiple components vs. varying components
Tuesday, February 9, 2010
Generality and ``W''
To recap, we've demonstrated that ``W'' is capable of sizable median effort and "spread" reductions in software estimation. I've spent more time getting to know what types of datasets and projects ``W'' is proficient at predicting. Here's some new data:
Two new points to make: I played around with the Desharnais dataset and realized the original synthetic projects where making constraints that represented too few historical projects. The new synthetic project simply makes all independent attributes controllable, and gives results in-line with the other synthetic and non-synthetic project.
Second, multiple-goal NASA93 projects are included. The BFC-NASA93 projects seek to minimize the defects, effort, and months as a whole. The reductions are in-line with the effort-only version of this dataset.
Finally, the big generality statement. To make this somewhat easier to read, I've left it in list format with the frequency counts of each attribute value trailing. I'm still working on a more concise way to express this data, one giant table doesn't seem to help.
key: DATASET.project:::attribute=freq count
NASA93.flight:::acap=25
NASA93.flight:::aexp=6
NASA93.flight:::cplx=13
NASA93.flight:::data=14
NASA93.flight:::lexp=8
NASA93.flight:::pcap=9
NASA93.flight:::rely=20
NASA93.flight:::stor=17
NASA93.flight:::turn=6
NASA93.ground:::acap=16
NASA93.ground:::aexp=9
NASA93.ground:::cplx=13
NASA93.ground:::data=20
NASA93.ground:::lexp=7
NASA93.ground:::pcap=7
NASA93.ground:::rely=18
NASA93.ground:::stor=13
NASA93.ground:::time=14
NASA93.osp2:::sced=87
NASA93.osp:::acap=29
NASA93.osp:::aexp=8
NASA93.osp:::cplx=10
NASA93.osp:::sced=12
NASA93.osp:::stor=24
NASA93.osp:::tool=20
BFC-NASA93.flight:::acap=28
BFC-NASA93.flight:::aexp=2
BFC-NASA93.flight:::cplx=7
BFC-NASA93.flight:::data=15
BFC-NASA93.flight:::lexp=2
BFC-NASA93.flight:::pcap=10
BFC-NASA93.flight:::rely=11
BFC-NASA93.flight:::stor=17
BFC-NASA93.flight:::turn=9
BFC-NASA93.ground:::acap=22
BFC-NASA93.ground:::apex=9
BFC-NASA93.ground:::cplx=4
BFC-NASA93.ground:::data=14
BFC-NASA93.ground:::ltex=10
BFC-NASA93.ground:::pcap=8
BFC-NASA93.ground:::rely=10
BFC-NASA93.ground:::stor=14
BFC-NASA93.ground:::time=11
BFC-NASA93.osp2:::sced=109
BFC-NASA93.osp:::acap=43
BFC-NASA93.osp:::aexp=2
BFC-NASA93.osp:::cplx=17
BFC-NASA93.osp:::sced=13
BFC-NASA93.osp:::stor=16
BFC-NASA93.osp:::team=4
BFC-NASA93.osp:::tool=10
ISBSG-clientserver-discretized.proj1:::Norm_PDF_UFP=9
ISBSG-clientserver-discretized.proj1:::Norm_PDR_AFP=11
ISBSG-clientserver-discretized.proj1:::Projected_PDR_UFP=3
ISBSG-clientserver-discretized.proj1:::Reported_PDR_AFP=4
ISBSG-clientserver-discretized.proj1:::Size_Added=5
ISBSG-clientserver-discretized.proj1:::Size_Enquiry=2
ISBSG-clientserver-discretized.proj1:::Size_File=19
ISBSG-clientserver-discretized.proj1:::Size_Input=16
ISBSG-clientserver-discretized.proj1:::Size_Interface=6
ISBSG-clientserver-discretized.proj1:::Size_Output=4
ISBSG-standalone-discretized.proj1:::Adj_FP=1
ISBSG-standalone-discretized.proj1:::Norm_PDF_UFP=14
ISBSG-standalone-discretized.proj1:::Norm_PDR_AFP=15
ISBSG-standalone-discretized.proj1:::Projected_PDR_UFP=4
ISBSG-standalone-discretized.proj1:::Reported_PDR_AFP=5
ISBSG-standalone-discretized.proj1:::Size_Added=5
ISBSG-standalone-discretized.proj1:::Size_Enquiry=1
ISBSG-standalone-discretized.proj1:::Size_File=13
ISBSG-standalone-discretized.proj1:::Size_Input=17
ISBSG-standalone-discretized.proj1:::Size_Interface=4
ISBSG-standalone-discretized.proj1:::Size_Output=4
ISBSG-standalone-discretized.proj2:::Adj_FP=1
ISBSG-standalone-discretized.proj2:::Norm_PDF_UFP=6
ISBSG-standalone-discretized.proj2:::Norm_PDR_AFP=3
ISBSG-standalone-discretized.proj2:::Projected_PDR_UFP=2
ISBSG-standalone-discretized.proj2:::Reported_PDR_AFP=1
ISBSG-standalone-discretized.proj2:::Size_Input=4
ISBSG-standalone-discretized.proj2:::VAF=2
ISBSG-standalone-discretized.proj2:::Work_Effort=33
MAXWELL.proj1:::T01=9
MAXWELL.proj1:::T02=11
MAXWELL.proj1:::T03=8
MAXWELL.proj1:::T04=4
MAXWELL.proj1:::T05=7
MAXWELL.proj1:::T06=3
MAXWELL.proj1:::T07=23
MAXWELL.proj1:::T08=5
MAXWELL.proj1:::T09=4
MAXWELL.proj1:::T10=7
MAXWELL.proj1:::T11=6
MAXWELL.proj1:::T12=5
MAXWELL.proj1:::T13=4
MAXWELL.proj1:::T14=5
MAXWELL.proj1:::T15=4
coc81.allLarge:::acap=4
coc81.allLarge:::aexp=6
coc81.allLarge:::cplx=3
coc81.allLarge:::data=14
coc81.allLarge:::lexp=5
coc81.allLarge:::modp=1
coc81.allLarge:::pcap=4
coc81.allLarge:::rely=22
coc81.allLarge:::sced=4
coc81.allLarge:::stor=14
coc81.allLarge:::time=5
coc81.allLarge:::tool=4
coc81.allLarge:::turn=6
coc81.allLarge:::vexp=13
coc81.allLarge:::virt=1
coc81.flight:::acap=6
coc81.flight:::aexp=21
coc81.flight:::cplx=7
coc81.flight:::data=18
coc81.flight:::lexp=4
coc81.flight:::pcap=1
coc81.flight:::rely=7
coc81.flight:::stor=27
coc81.flight:::turn=19
coc81.ground:::acap=8
coc81.ground:::cplx=7
coc81.ground:::data=30
coc81.ground:::ltex=2
coc81.ground:::pcap=3
coc81.ground:::pmat=2
coc81.ground:::rely=8
coc81.ground:::stor=30
coc81.ground:::time=9
coc81.osp2:::sced=58
coc81.osp:::acap=4
coc81.osp:::aexp=6
coc81.osp:::cplx=20
coc81.osp:::sced=7
coc81.osp:::stor=23
coc81.osp:::tool=9
desharnais-discretized.proj1:::Adjustment=4
desharnais-discretized.proj1:::Entities=2
desharnais-discretized.proj1:::Language=11
desharnais-discretized.proj1:::Length=3
desharnais-discretized.proj1:::ManagerExp=2
desharnais-discretized.proj1:::PointsAjust=6
desharnais-discretized.proj1:::PointsNonAdjust=3
desharnais-discretized.proj1:::TeamExp=1
desharnais-discretized.proj1:::Transactions=1
desharnais-discretized.proj1:::YearEnd=1
These numbers are out of what was returned after 100 runs, so the total counts within one project might not match another. The take-home here is we've able to maintain these large effort reductions even with widely different recommendations both within the same project and across different projects with similar metrics. Some projects such as OSP2 are every stable, only recommending SCED. Others show some distribution favoring a particular attribute, some seem to get by recommending a small subset of attributes.
Closer study is needed of the attributes returned. Right now I'm not looking at the recommended values, just what attribute was chosen.
Subscribe to:
Posts (Atom)