ai @ wvu

Sunday, August 22, 2010

Agenda, August 24

Stuff in blue shows addits after posting0.
Stuff in green shows addits after posting1.

ALL:

- need to name our desk. landscape paper, name in 30 point, photo.
- we now have 2 sets of ugrads students working the

FAYOLA:
- range experiment, and on other data sets.
- need that graphic max wanted showing the complexity of our data
- thesis outline (on paper).

ADAM1:
- can you attend? want to talk abut china and java W

ADAM2:
- all the latexing of the e,d,m and edm results done.
- ditto with the discretization experiments. outline of thesis.
- thesis outline (on paper)
- co-ordination with adam re java W
- need a revision of the paper that shows a good example of instability and
how it is less with W2
- receipt from japan

EKREM:
- when you used "log" pre-preprocessor, did you unlog afterwards before (say) MRE?
- check jacky's TSE corrections
- the abstract only mentions MRE. add others? or delete reference to MRE?
- check that our quotes from the paper in the review section are still accurate
- FIG5: when you checked (median(a) < median(b)), did you reverse that for pred?
- needs a bridge at end of 3.2. As can be seen in this example, it is not necessarily true that moving to a smaller set of neighbors decreases variance. As shown below, it can improve prediction accuracy if ABE takes this matter into account.
- bring (small) copies of ASE poster
- revist all old results. which need to be done with strong data sets?
- need your full passport name
- can we do ensembles just as a simple TEAK extension? just by randomly selecting from the current sub-tree 90%, 10 times?

- (the following will take some time)

- redo the icse paper's results with

- compass

- (sum, pred25, mre)

- win%,loss%, (win-loss)%

- leave-1-one, cross-val

- china broken up. jacky says it naturally divides into a few large chunks. chinaA, chinaB, etc.

- report the number of rank positions things change.

- i.e. what is expected wriggle in an algorithm's rankings

- a simple MSR paper http://2011.msrconf.org/important-dates

ANDREW:
- seeking results on
- splitting data in half
- building a predictor (?teac) from one half
- walking the other one half in in (say) 10 eras
- after clustering the other half from era[i], find the most informative regions to query and send those queries to the predictor
- extend the era[i] cluster tree with the new results found during (c)
- test the new tree on era[i+i]
- increment "i" and goto (c)

KEL:
- photos, videos of the helicopter on the blog
- Effort Estimation Experiments with Compass and K-means
- Code is ready to go (Thanks in very large part to Andrew's previous work)
- Want to discuss experiments about effort estimation and defect predictions.
- Working on Matlab versions of TEAK and Compass to work with Ekrem's Matlab rig later.

Tuesday, August 17, 2010

Defect Sets and GUI's

Defect Sets
Very large (19k pixels wide) Defect set Compass trees can be found --> HERE.

GUI
Working on a way to display variance in the Compass Explorer... either going with labels or colors --> HERE.

Changes of Note:

In-line pruning, along with square root min cluster size.

ASE presentation (1st draft)

http://unbox.org/wisp/var/ekrem/ase/poster/poster-1st-draft.pdf

Monday, August 16, 2010

Agenda, August 17

TIM

news: the great MSR vs PROMISE marriage
wboy tv coverage
request: all on the blog
need a new time for meetings

WILLIAM

GUI for HALE

ADAM1

feel free not to attend but have you got a start date from ye yang?
Please get with Adam2 regarding "W"

EKREM

Reaching data mining. got a copy of chapter 4 of witten?
TSE rewrite (should have gone to Jacky by now)
Next TSE article

PRED(25) results for X=12, X=1, X=-10
What is we build COMBA by just fliiping the pre-preprocessor?
What is we build COMBA by just using 10 90% samples and log(K=1)?

Boolstering vs boosting (just an idea)
atteding ase. need to write a poster. see http://ai-at-wvu.blogspot.com/2010/08/how-to-write-posters.html
new data sets to PROMISE
travel to ase. you got visas to belgium

ANDREW

seeking results on

(a) splitting data in half
(b) building a predictor (?teac) from one half
(c) walking the other one half in in (say) 10 eras
(c) after clustering the other half from era[i], find the most informative regions to query and send those queries to the predictor
(d) extend the era[i] cluster tree with the new results found during (c)
(e) test the new tree on era[i+i]
(f) increment "i" and goto (c)

note that by end august we will need at least one run of IDEA on a large defect data set (e.g. PC1)
note by mid-oct we need some (possibly gawd awful) prototype to describe to GT
what can we brainstorm as tasks for IDEA?

KEL

Progress towards masters thesis?
Got time to do experiments on IDEA (formally known as "compass")
IDEA and TEAK in ekrem's matlab rig?
Try defect prediction in IDEA?
Try effort estimation with K-means, IDEA?

FAYOLA

word version
need an example of error reduction in a medical domain. UCI?
Outline for thesis. Checked in with Cox?
Teaching LISP: do you have a copy of graham?
decision regarding phd direction

ADAM2

Outline for thesis. have you checked in with cox? update to locallessons.pdf table of results for coc/nasa 20 trials updated. right now, each cell presents one number. i want three written down as a,b,c where "a" is for just reducing effort, "b" is for just reducing detects and "c" is the current ones reducing effort,defects, time
also, for that week, i need the discretization study repeated. some succinct representation showing that if we discretize to small N we do as well as larger N.
then which of the following do you want to do?

port it to your favorite language. python?
build a web front end where it is easy to alter the goal function and the query. in python? in awk? check it: http://awk.info/?tools/server
start writing your thesis
do it all!

get with adam1 regarding understanding W and a gui for china
still need to sort out the payment receipts. got a publication date from jsea?
what is your timetable for the nsf stuff

Sunday, August 15, 2010

How to write posters

For a sample poster, see http://ourmine.googlecode.com/svn/trunk/share/pdf/poster.pdf
You can write this using

Latex (use "svn export http://unbox.org/wisp/var/09/pom2/doc/ase09/poster/pom2poster/ poster" then run "make")
Powerpoint : layup one page in 5pt text. Won't look as good as the Latex, but some will find it easier.
Open Office, etc (see the Powerpoint advice)

Basement of library can print these; you can print non-gloss ($6) or full gloss ($12).

For test prints, A3 sheets are readable, and cheap to produce

Sunday, June 13, 2010

Active Learning

Labeling every data point is time consuming and unlabeled data is abundant. This motivates the field of active learning, in which learner is able to ask for labels of specific points, but each question is charged.

The points to be queried are usually chosen from a pool of unlabeled data points. What we need is to ask as few as possible queries and pick up points that would help the learner the most (highest information content).

Possible ideas to find the points to be queried:

1) Build a Voronoi structure and ask the points which are a) center of largest circumcirle or b) subset of Voronoi vertices whose nearest neighbors belong to different classes. It is difficult for high dimensions.

Another ideas: Use two learners and ask points where they disagree, use SVM and only ask point closest to hyperplane at each round. The question to me is how I can adapt it to effort estimation (a regression problem). We formed a reading group for this problem, the bib file etc. are here: http://unbox.org/wisp/var/ekrem/activeLearning/Literature/

Tuesday, June 8, 2010

beamer, and zoom

Adam nelson is writing his slides using the beamer class. the slides look fabulous.

Beamer is a latex style file which, unlike other approaches, can be processed with pdflatex

Also, last time i checked, its a standard item on most linux /osX /cygwin package managers

For notes on beamer, see http://www.math.umbc.edu/~rouben/beamer/

For an advanced guide, see http://www.math.umbc.edu/~rouben/beamer/beamer_guide.pdf

For a reason to use beamer instead of anything else, look at the zoom feature on slide 32 of http://www.math.umbc.edu/~rouben/beamer/beamer_guide.pdf: zoomable figures. Now you can include results to any level of detail.

t

Monday, June 7, 2010

Abstracts - first draft

http://unbox.org/wisp/var/fayola/forensics/

http://unbox.org/wisp/var/fayola/code/docs/

Wednesday, June 2, 2010

Data: Feature Weighting and Instance Selection Optimization

http://unbox.org/wisp/var/bill/table.txt

Results of optimizing feature weighting and instance selection for analogy based estimation of software effort w/different methods.

Random Pre-Processor + Algorithm Results for Normal and Reduced Datasets

The results for the experiments:http://unbox.org/wisp/var/ekrem/resultsVariance/Results/FinalResults.xls

Behavior of different pre-processor&algorithm combinations become more similar as the instances size gets smaller (e.g. cocomo81s and desharnaisL3 figures)
Some algorithms have considerable less loss values, but there is no best algorithm for all datasets.
The reduced variance datasets are reduced by the GAC tree (only instances in the nodes that have less than or equal to ten percent of the max variance).
When reduction is applied 3 datasets reduce to only 2 instances: kemerer, nasa-center1, telecom1.
Since reduction makes the dataset smaller, their results become more similar (both the plots look similar and the cases of all algorithms getting zero losses increase).
The graphs for these experiments can be found at:http://unbox.org/wisp/var/ekrem/resultsVariance/Results/NORMAL-DATA RESULTS.zip andhttp://unbox.org/wisp/var/ekrem/resultsVariance/Results/REDUCED-DATA RESULTS.zip Related plots are at http://unbox.org/wisp/var/ekrem/resultsVariance/Results/resultsPlotterTexFiles/plots.pdf

Some more results regarding GAC-simulated datasets:

Populated datasets attain very high MdMRE and Pred(25) values.
There is more of a pattern regarding the best algorithm.
A better check of GAC-simulation would be simulation&prediction with leave-one-out.

Tuesday, June 1, 2010

Makefile tricks

I write to record Bryan's "embed postscript fonts" trick. Without this trick, some conference/journal submission systems won't let you submit, complaining that "fonts not embedded".

The trick is to use the "embed" rule, as called by "done" in the following Makefile. This code is available at /wisp/var/adam2/cbr/doc/Makefile

Src=model-vs-cbr-v3
all : dirs tex bib tex tex done
one : dirs tex done

done : embed
@printf "\n\n\n======================================\n"
@printf "see output in $(HOME)/tmp/$(Src).pdf\n"
@printf "=======================================\n\n\n"
@printf "\n\nWarnings (may be none):\n\n"
grep arning $(HOME)/tmp/${Src}.log
dirs :
- [ ! -d $(HOME)/tmp ] && mkdir $(HOME)/tmp
tex :
- pdflatex -output-directory=$(HOME)/tmp $(Src)
embed :
@ cd $(HOME)/tmp; \
   gs -q -dNOPAUSE -dBATCH -dPDFSETTINGS=/prepress \
   -sDEVICE=pdfwrite \
   -sOutputFile=$(Src)1.pdf $(Src).pdf
@ mv $(HOME)/tmp/$(Src)1.pdf $(HOME)/tmp/$(Src).pdf
bib :
- bibtex $(HOME)/tmp/$(Src)

lab meeting, wednesday

note meeting time: 11am

10am: skype call with adam2. adam2-
11am: meeting
1pm: break
1:45pm: bryan's defense. (edited by Bryan L. to correct time. Its 1:45, not 2:00)

newbies (make them welcome)

Kel Cecil
Charles Corb
Tomi Prifti

adam1:

promise paper
travel arrangements to bejing

timm:

the vision thing Structured Machine Learning: Ten Problems for the Next Ten Years (Section 5 in Structured Machine Learning: The Next Ten Years). Machine Learning, 73, 3-23, 2008.
journal paper

1) Stable rankings for different effort models ;
2) Defect prediction from static code features: current results, limitations, new approaches
3) How to Understand Complex Models
dod sttrt2
clustering as compression
Towards Parameter-Free Data Mining
Clustering by Compression
Compression and Machine Learning: A New Perspective on Feature Space Vectors
Parameterless Outlier Detection in Data Stream
temporal data mining and anomaly detection
the joy of sax: Visually Mining and Monitoring Massive Time Series

tomi:

what news?

fayola:

did not understand your last explanation of your distance measure (i.e. is 20% more or less movement). help me, please, to obtain clarity
does your result hold as you increase number of clusters?
starting 3 papers

ekrem, andrew:

start a sub-group: active learning.
begin lit reviewing
that experiment with people in front of interfaces..

andrew-

what news on ddp?
papers for comparing compass against other clusterers:
Evalaution of Hierarchical Clustering
A Comparison of Document Clustering Techniques

ekrem:

what news on using teac as an instance selector for other data miners

william:

effort estimation data table
-instance collection environment

Monday, May 24, 2010

Quick and Dirty Lisp Scripting

Here's a quick and dirty example of scripting a lisp routine so that it can be executed simply from shell.

Code can be found --> HERE
Script is --> HERE

So, I have this data simulation lisp code lying around that I'm using to generate larger datasets from smaller ones using a GAC tree. We don't need to worry about the specifics but what I needed was a way to run it fast without jumping into slime and loading everything whenever I want to use it. Also, I'm using this lisp code to output to csv but when you write from emacs lisp there will be a newline every 80 characters printed into the file and that messes up how I want to use the csv's. AND, because lisp is so precise, it outputs floats as their division with a "/". I wrote a couple python scripts to clean that up quickly which I need to apply that to each file after I generate it. Perfect example of compounding a few steps into an easy to use bash script...

All I do is run something to the effect of this to output a big albrecht (size 1000) into a csv file named albrecht.csv:

./sampler albrecht 1000 albrecht.csv

It's not that fancy but it does one thing and it does it well, in the unix way.

For you black belts in the crowd, here's how I generate 2000 eg data files from all my existing data.

for file in `ls data/ | cut -d"." -f1`; do ./sampler $file 2000 examples/2000egs/$file.csv; done;

Wednesday, April 21, 2010

Brittle 100420085813-phpapp01

Check out this SlideShare Presentation:

Brittle 100420085813-phpapp01

View more presentations from fayola21.

Tuesday, April 13, 2010

which2 on UCI data sets

Results from Bryan: HERE

Monday, April 12, 2010

More Context Variables on Nasa93

Possible Context Variables for Datasets

Cocomo81: Development mode
Nasa93: Dev. center, flight-ground, category of application, year of development
Desharnais: Development language
Albrecht: None (all numeric attributes)

Previous results suggested that there is not a strong correlation between above context variables and selection. Further experiments for Nasa93 (flight-ground, category of application, year of development) point to a similar conclusion: Afore mentioned context variables do not have strong influence on selection.

Component vs. All Learning - Summary

It appears, from our experiments, that learning on dense components and testing on all components wins over the traditional learning on all, and testing on all. Here is the summary table.

Tuesday, April 6, 2010

Brittleness

Brittle

View more presentations from fayola21.

Brittle Results

Monday, April 5, 2010

Context Variables

About the context variables, we have the following matrices. In parenthesis, the total number of instances located in this particular context variable is written. Those context variables are the ones which we used to divide the datasets into subsets.

It is difficult to see a pattern here. In each case, there is one dominant column (or context variable) from which most of the instances are selected. On the other hand, this particular column is not the most densely populated context variable. Therefore:
1) Center (for nasa93), development type (for cocomo81) or development language (for desharnais) are influential context variables
2) There is one particular context variable from which always most of the instances are selected but this context variable is not the most densely populated one, so size is not a dominant indicator either.

Master's Defense - The Robust Optimization of Non-Linear Requirements Models

Master's Defense - WVU Lane Dept. CSEE
Gregory Gay
Wednesday, April 7, 2010
1:00 PM, room G112 ESB

Open to the public, so feel free to come and see me make a fool of myself. :D

which2 : a stochastic anytime rule learner

(Code: http://unbox.org/wisp/tags/which2/0.1/which2.awk.)

WHICH2 builds rules by ranking ideas, then repeatedly building new ideas by picking and combining two old ideas (favoring those with higher ranks). New ideas generated in this way are ranked and thrown back into the same pot as the old ideas so, if they are any good, they might be picked and extended in subsequent rounds. Alternatively, if the new idea stinks, it gets buried by the better ideas and is ignore.

One important aspect of the following is that the scoring routines for ideas are completely seperate from the rest of the code (see the "score1" function). Hence, it is a simple matter to try our different search biases.
For example, here's the current score routine:

function score(i,rows,data, 
            cols,row, col,a,b,c,d,triggered,pd,pf,

prec,acc,supports,s,fits)

{
    a=b=c=d=Pinch # stop divide by zero errors
    cols=Name[0]
    for(row=1;row<=rows;row++) {
        triggered = matched(row,cols,data,i)
        if (data[row,cols]) {
            if (triggered) {d++} else {b++}
        } else {
            if (triggered) {c++} else {a++}
        }
    } 
        fits    = c + d
    pd      = d/(b+d)
    pf      = a/(a+c)
    prec    = d/(c+d)
    acc     = (a+d)/(a+b+c+d)
    support = (c+d)/(a+b+c+d)
    return score1(pd,pf,prec,acc,support,fits)
}
function score1(pd,pf,prec,acc,supportm,fits) {
        if (fits <= Overfitted)
                return 0
    if (Eval==1) return acc
    if (Eval==2) return 2 * pd * prec/(pd+prec)
    if (Eval==3) return 2 * pd * pf/(pd+pf)
    if (Eval==4) return support * 2 * pd * pf/(pd+pf)
    return support * 2 * pd * prec/(pd+prec)
}

Various students (ZachM and Joe) have implemented this before and found it surprisingly slow. For example, in the above, note that scoring one rule means running it over the entire data set. Which2 therefore uses two speed up tricks:

Rather that sorting new ideas straight away into the old ones, it generates (say) 20 new ones at each round before sorting them back into the old.
Which2 keeps a cache of how ideas scored before. If we re-score an old idea, we just return that score (no need to recompute it).

The result is pretty zip: under a second for the 2000+ records of titanic.arff. e.g. This call produces the following output.

gawk -f which2.awk weather.nominal.arff

In the following, the "candidates" are ideas that look promising and "score" ranks the candidates. If the max score does not improve from the last round, then "lives" decreases.

Each round tries random combinations of the stuff from prior rounds (favoring those things with higher scores). Hence, at round 1, all the candidates are singletons. But. later on (see line 54) the candidates can grow to combinations of things.

Each round prunes the candiates so that only the better candidates survive to round+1.
The final output are the candidates of the last round. In this example, the best rule is tempareture = cool or mild.

----------------------------------------------------
% class: yes seed: 1 round: 1 max: 0 lives: 5

% candidate: yes,humidity=normal
% candidate: yes,humidity=normal,outlook=overcast
% candidate: yes,humidity=normal,temperature=cool
% candidate: yes,humidity=normal,temperature=mild
% candidate: yes,humidity=normal,windy=FALSE
% candidate: yes,outlook=overcast
% candidate: yes,outlook=overcast,temperature=cool
% candidate: yes,outlook=overcast,windy=FALSE
% candidate: yes,temperature=cool
% candidate: yes,temperature=cool,temperature=mild
% candidate: yes,temperature=cool,windy=FALSE
% candidate: yes,temperature=mild
% candidate: yes,temperature=mild,windy=FALSE
% candidate: yes,windy=FALSE

[0.072352013,yes,temperature=mild,windy=FALSE].
[0.13229596,yes,humidity=normal,temperature=cool].
[0.13264848,yes,temperature=cool].
[0.17611714,yes,humidity=normal,windy=FALSE].
[0.17652966,yes,outlook=overcast].
[0.22876845,yes,temperature=mild].
[0.37582652,yes,humidity=normal].
[0.40354402,yes,windy=FALSE].
[0.52640158,yes,temperature=cool,temperature=mild].

----------------------------------------------------
% class: yes seed: 1 round: 2 max: 0.52640158 lives: 5

% candidate: yes,humidity=normal
% candidate: yes,humidity=normal,outlook=overcast
% candidate: yes,humidity=normal,outlook=overcast,temperature=cool
% candidate: yes,humidity=normal,temperature=cool
% candidate: yes,humidity=normal,temperature=cool,temperature=mild
% candidate: yes,humidity=normal,windy=FALSE
% candidate: yes,outlook=overcast
% candidate: yes,outlook=overcast,temperature=cool
% candidate: yes,outlook=overcast,temperature=cool,temperature=mild
% candidate: yes,outlook=overcast,windy=FALSE
% candidate: yes,temperature=cool
% candidate: yes,temperature=cool,temperature=mild
% candidate: yes,temperature=cool,temperature=mild,windy=FALSE
% candidate: yes,temperature=mild
% candidate: yes,temperature=mild,windy=FALSE
% candidate: yes,windy=FALSE

[0.13229596,yes,humidity=normal,temperature=cool].
[0.13264848,yes,temperature=cool].
[0.17611714,yes,humidity=normal,windy=FALSE].
[0.17652966,yes,outlook=overcast].
[0.20419978,yes,temperature=cool,temperature=mild,windy=FALSE].
[0.22876845,yes,temperature=mild].
[0.28604711,yes,humidity=normal,temperature=cool,temperature=mild].
[0.37582652,yes,humidity=normal].
[0.40354402,yes,windy=FALSE].
[0.52640158,yes,temperature=cool,temperature=mild].

----------------------------------------------------
% class: yes seed: 1 round: 3 max: 0.52640158 lives: 4

% candidate: yes,humidity=normal
% candidate: yes,humidity=normal,outlook=overcast
% candidate: yes,humidity=normal,temperature=cool
% candidate: yes,humidity=normal,temperature=cool,temperature=mild
% candidate: yes,humidity=normal,temperature=cool,temperature=mild,windy=FALSE
% candidate: yes,humidity=normal,temperature=cool,windy=FALSE
% candidate: yes,humidity=normal,windy=FALSE
% candidate: yes,outlook=overcast
% candidate: yes,outlook=overcast,temperature=cool,temperature=mild
% candidate: yes,outlook=overcast,windy=FALSE
% candidate: yes,temperature=cool
% candidate: yes,temperature=cool,temperature=mild
% candidate: yes,temperature=cool,temperature=mild,windy=FALSE
% candidate: yes,temperature=cool,windy=FALSE
% candidate: yes,temperature=mild
% candidate: yes,temperature=mild,windy=FALSE
% candidate: yes,windy=FALSE

[0.13229596,yes,humidity=normal,temperature=cool].
[0.13264848,yes,temperature=cool].
[0.17611714,yes,humidity=normal,windy=FALSE].
[0.17652966,yes,outlook=overcast].
[0.20419978,yes,temperature=cool,temperature=mild,windy=FALSE].
[0.22876845,yes,temperature=mild].
[0.28604711,yes,humidity=normal,temperature=cool,temperature=mild].
[0.37582652,yes,humidity=normal].
[0.40354402,yes,windy=FALSE].
[0.52640158,yes,temperature=cool,temperature=mild].

----------------------------------------------------
% class: yes seed: 1 round: 4 max: 0.52640158 lives: 3

% candidate: yes,humidity=normal
% candidate: yes,humidity=normal,outlook=overcast,windy=FALSE
% candidate: yes,humidity=normal,temperature=cool
% candidate: yes,humidity=normal,temperature=cool,temperature=mild
% candidate: yes,humidity=normal,temperature=cool,temperature=mild,windy=FALSE
% candidate: yes,humidity=normal,windy=FALSE
% candidate: yes,outlook=overcast
% candidate: yes,outlook=overcast,temperature=cool
% candidate: yes,outlook=overcast,temperature=cool,temperature=mild
% candidate: yes,outlook=overcast,temperature=cool,temperature=mild,windy=FALSE
% candidate: yes,outlook=overcast,windy=FALSE
% candidate: yes,temperature=cool
% candidate: yes,temperature=cool,temperature=mild
% candidate: yes,temperature=cool,temperature=mild,windy=FALSE
% candidate: yes,temperature=cool,windy=FALSE
% candidate: yes,temperature=mild
% candidate: yes,temperature=mild,windy=FALSE
% candidate: yes,windy=FALSE

[0.13229596,yes,humidity=normal,temperature=cool].
[0.13264848,yes,temperature=cool].
[0.17611714,yes,humidity=normal,windy=FALSE].
[0.17652966,yes,outlook=overcast].
[0.20419978,yes,temperature=cool,temperature=mild,windy=FALSE].
[0.22876845,yes,temperature=mild].
[0.28604711,yes,humidity=normal,temperature=cool,temperature=mild].
[0.37582652,yes,humidity=normal].
[0.40354402,yes,windy=FALSE].
[0.52640158,yes,temperature=cool,temperature=mild].

----------------------------------------------------
% class: yes seed: 1 round: 5 max: 0.52640158 lives: 2

% candidate: yes,humidity=normal
% candidate: yes,humidity=normal,outlook=overcast,temperature=cool
% candidate: yes,humidity=normal,temperature=cool
% candidate: yes,humidity=normal,temperature=cool,temperature=mild
% candidate: yes,humidity=normal,temperature=cool,temperature=mild,windy=FALSE
% candidate: yes,humidity=normal,temperature=cool,windy=FALSE
% candidate: yes,humidity=normal,temperature=mild
% candidate: yes,humidity=normal,windy=FALSE
% candidate: yes,outlook=overcast
% candidate: yes,temperature=cool
% candidate: yes,temperature=cool,temperature=mild
% candidate: yes,temperature=cool,temperature=mild,windy=FALSE
% candidate: yes,temperature=cool,windy=FALSE
% candidate: yes,temperature=mild
% candidate: yes,windy=FALSE

[0.13229596,yes,humidity=normal,temperature=cool].
[0.13264848,yes,temperature=cool].
[0.17611714,yes,humidity=normal,windy=FALSE].
[0.17652966,yes,outlook=overcast].
[0.20419978,yes,temperature=cool,temperature=mild,windy=FALSE].
[0.22876845,yes,temperature=mild].
[0.28604711,yes,humidity=normal,temperature=cool,temperature=mild].
[0.37582652,yes,humidity=normal].
[0.40354402,yes,windy=FALSE].
[0.52640158,yes,temperature=cool,temperature=mild].

----------------------------------------------------
% class: yes seed: 1 round: 6 max: 0.52640158 lives: 1

% candidate: yes,humidity=normal
% candidate: yes,humidity=normal,outlook=overcast
% candidate: yes,humidity=normal,temperature=cool
% candidate: yes,humidity=normal,temperature=cool,temperature=mild
% candidate: yes,humidity=normal,temperature=cool,temperature=mild,windy=FALSE
% candidate: yes,humidity=normal,temperature=cool,windy=FALSE
% candidate: yes,humidity=normal,windy=FALSE
% candidate: yes,outlook=overcast
% candidate: yes,outlook=overcast,temperature=cool,temperature=mild,windy=FALSE
% candidate: yes,temperature=cool
% candidate: yes,temperature=cool,temperature=mild
% candidate: yes,temperature=cool,temperature=mild,windy=FALSE
% candidate: yes,temperature=mild
% candidate: yes,windy=FALSE

[0.13229596,yes,humidity=normal,temperature=cool].
[0.13264848,yes,temperature=cool].
[0.17611714,yes,humidity=normal,windy=FALSE].
[0.17652966,yes,outlook=overcast].
[0.20419978,yes,temperature=cool,temperature=mild,windy=FALSE].
[0.22876845,yes,temperature=mild].
[0.28604711,yes,humidity=normal,temperature=cool,temperature=mild].
[0.37582652,yes,humidity=normal].
[0.40354402,yes,windy=FALSE].
[0.52640158,yes,temperature=cool,temperature=mild].

----------------------------------------------------
% class: yes seed: 1 round: 7 max: 0.52640158 lives: 0

% candidate: yes,humidity=normal
% candidate: yes,humidity=normal,outlook=overcast
% candidate: yes,humidity=normal,temperature=cool
% candidate: yes,humidity=normal,temperature=cool,temperature=mild
% candidate: yes,humidity=normal,temperature=cool,temperature=mild,windy=FALSE
% candidate: yes,humidity=normal,windy=FALSE
% candidate: yes,outlook=overcast
% candidate: yes,outlook=overcast,temperature=cool,temperature=mild
% candidate: yes,outlook=overcast,temperature=mild
% candidate: yes,outlook=overcast,windy=FALSE
% candidate: yes,temperature=cool
% candidate: yes,temperature=cool,temperature=mild
% candidate: yes,temperature=cool,temperature=mild,windy=FALSE
% candidate: yes,temperature=mild
% candidate: yes,windy=FALSE

[0.13229596,yes,humidity=normal,temperature=cool].
[0.13264848,yes,temperature=cool].
[0.17611714,yes,humidity=normal,windy=FALSE].
[0.17652966,yes,outlook=overcast].
[0.20419978,yes,temperature=cool,temperature=mild,windy=FALSE].
[0.22876845,yes,temperature=mild].
[0.28604711,yes,humidity=normal,temperature=cool,temperature=mild].
[0.37582652,yes,humidity=normal].
[0.40354402,yes,windy=FALSE].
[0.52640158,yes,temperature=cool,temperature=mild].

This is not the usual solution to weather but note how it covers most of the examples. The reason for this selection is that the scoring function (shown above) has a support term. Hence, this run favors things with high support.

Tuesday, March 16, 2010

Performance of 'nb' on Winshield Dataset

Effect of various Bandwidth&Kernel Combinations in Neighbor Weighting

What is the effect of weighting your neighbors in kNN? Acc. to 3 datasets (Coc81, Nasa93, Desharnais), the effects are as follows:

Weighting your neighbors acc. to a kernel function creates significantly different results
However, in these 3 datasets subject to different kernels (Uniform, Triangular, Epanechnikov, Gaussian) and to inverse rank weighted mean (proposed by Mendes et. al.) with various bandwidths, we could not observe an improvement in accuracies.
Detailed experimental results under http://unbox.org/wisp/var/ekrem/kernel/

Monday, March 15, 2010

Presentation - Finding Robust Solutions to Requirements Models

Presentation to Tsinghua University - Thursday, March 18, 2010.

Based on: Gay, Gregory and Menzies, Tim and Jalali, Omid and Mundy, Gregory and Gilkerson, Beau and Feather, Martin and Kiper, James. Finding robust solutions in requirements models. Automated Software Engineering, 17(1): 87-116, 2010.

Active Machine Learning

Unlabeled data is easy to come by and labeling that data can be a tedious task. Imagine that you've been tasked with gathering sound bytes. All you have to do is walk around with a microphone and you've got your data. Once you have the data you have to label each individual sound byte and catalogue it. Obviously, combing and labeling all the data wouldn't be fun -- regardless of the domain.

Active machine learning is a supervised learning technique whose goal is to produce adequately labeled (classified) data with as little human interference as possible. The active learning process takes in a small chunk of data which has already been assigned a classification by a human (oracle) with extensive domain knowledge. The learner then uses that data to create a classifier and applies it to a larger set of unlabeled data. The entire learning process aims at keeping human annotation to a minimum -- only referring to the oracle when the cost of querying the data is high.

Active learning typically allows for monitoring of the learning process and offers the human expert the ability to halt learning. If the classification error grows beyond a heuristic the oracle can pull the plug on the learner and attempt to rectify the problem...

For a less high level view of Active Machine Learning see the following literature survey on the topic.

Tuesday, March 9, 2010

Does learning on duplicated or unique defects improve software defect prediction in the nasa mdp data?

The short answer is "no".

Brittleness vs. Number of Classes

The following graph shows the brittleness level of the windshield data set as the number of classes are increased with k-means clustering.

Monday, March 8, 2010

Is kernel effect statistical?

See file kernelWinTieLossValues.xls from http://unbox.org/wisp/var/ekrem/kernel/

In summary:

Different kernels did not yield a significant difference in performance when compared to each other
Kernel weighting did not significantly improve the performance of different k values, i.e. there is not a significant difference between k=x and k=x plus kernel

Fastmap: Mapping objects into a k-dimensional space

The Fastmap algorithm turns data-set instances into a k-dimensional coordinate which can be primarily used for visualization and clustering purposes. Given a set of n instances from a dataset, Fastmap operates by doing the following. Assume we're using euclidean distance as the dissimilarity function and that we're operating in a 2-D space.

1. Arbitrarily choose an instance from the set and then find the farthest instance from it. These instances will serve as pivots to map the rest of our instances. Call these objects Oa and Ob. Each incoming instance Oi will be mapped using the line that exists between Oa and Ob.

2. As each new instance enters the space, apply geometric rules to extrapolate the coordinates of the new point.

Using the Cosine Law we can find Xi and then use the Pythagorean theorem to determine the coordinates of Oi as it relates to the line formed between Oa and Ob.

3. From there we can easily turn our coordinates into a visualization or use the coordinates to aid in clustering.

To move from a 2-D space to a k-D space, see http://portal.acm.org/citation.cfm?id=223812

Tuesday, March 2, 2010

Working With Windshield Data

1. Clustered to determine classes.
2. Found mean pd/pf

http://unbox.org/wisp/var/fayola/forensics/Class_1.pdf

Monday, March 1, 2010

Locality in Defect Prediction

Here, we show how a group of learners performs on 7 of the NASA datasets for software defect prediction both for Within Company and Cross Company.

We also are looking at varying the K for LWL. The greatest change between PD/PF with K values of {5,10,25,50} is 5%, usually around 3%. With changes in PD/PF this small, the actual value of K does not seem to matter.

Two algorithms, Clump and Naive Bayes, are shown both with and without relevancy filtering via the Burak Filter. Although applying the Burak Filter can reduce the variance of the results (for within and cross company when the data is logged), it does not significantly affect the median PD/PF's.

Observing the interaction between PD and PF for all algorithms explored, I cannot see a case where locality is beneficial. The PD vs PF ratio of both the locality based methods and the global methods are almost identical, showing that any gain in PD/loss in PF is at the cost of additional PF/less PD.

Friday, February 26, 2010

Review article on anomaly detection

Varun Chandola, Arindam Banerjee, and Vipin Kumar, "Anomaly Detection : A Survey", ACM Computing Surveys, Vol. 41(3), Article 15, July 2009.

http://www-users.cs.umn.edu/~kumar/papers/anomaly-survey.php

IEEE TSE accepts "Genetic Algorithms for Randomized Unit Testing"

Jamie Andrews and Tim Menzies

Randomized testing is an effective method for testing software units. Thoroughness of randomized unit testing varies widely according to the settings of certain parameters, such as the relative frequencies with which methods are called. In this paper, we describe Nighthawk, a system which uses a genetic algorithm (GA) to ﬁnd parameters for randomized unit testing that optimize test coverage.

Designing GAs is somewhat of a black art. We therefore use a feature subset selection (FSS) tool to assess the size and content of the representations within the GA. Using that tool, we can reduce the size of the representation substantially, while still achieving most of the coverage found using the full representation. Our reduced GA achieves almost the same results as the full system, but in only 10% of the time. These results suggest that FSS could signiﬁcantly optimize meta heuristic search-based software engineering tools.

Download: http://menzies.us/pdf/10nighthawk.pdf

Webcast: "Finding Local Lessons in SE"

Speaker: Tim Menzies

http://coursecast02.cs.wayne.edu:8080/ess/echo/presentation/d1db6a70-7295-4b90-b1b8-36096cc1d21f

Monday, February 22, 2010

Yet more better ways to do k-means

Monday, February 15, 2010

Does Kernel Choice Affect Prediction Performance?

What we do with the kernel estimation is to come up with a probability distribution function that is derived from our train data and then use this pdf to assign weights to our neighbors in a kNN approach.

Well when the bandwidths are assigned in accordance with Scott's Rule, we see that at critical regions (where the performance of different methods diverge) kernel weighted selection methods perform better than non-weighted versions. However, for the rest of the graphs, the performance of methods are very close to each other and it is difficult to draw a solid conclusion. Below are the graphs for Desharnais dataset for Triangular, Epanechnikov and Gaussian Kernels respectively.

Classification comparisons using DENSITY

DENSITY is a new classifier I wrote that works by

Pruning irrelevant data
Set density values using geometric series
Rank based on "fairness" class distribution weight

UCI data results: http://unbox.org/wisp/var/adam/results/density/times-accs-pdpf-results.pdf

Defect data results: http://unbox.org/wisp/var/adam/results/density/accuracies

Component vs. Whole Defect Prediction

Components trimmed based on number of defective modules (using medians)
For each learner..
10X10-way cross val using training sets composed of multiple components vs. varying components

for page, see http://unbox.org/wisp/var/adam/results/defect_dist/datas/mwq

Tuesday, February 9, 2010

Generality and ``W''

To recap, we've demonstrated that ``W'' is capable of sizable median effort and "spread" reductions in software estimation. I've spent more time getting to know what types of datasets and projects ``W'' is proficient at predicting. Here's some new data:

Two new points to make: I played around with the Desharnais dataset and realized the original synthetic projects where making constraints that represented too few historical projects. The new synthetic project simply makes all independent attributes controllable, and gives results in-line with the other synthetic and non-synthetic project.

Second, multiple-goal NASA93 projects are included. The BFC-NASA93 projects seek to minimize the defects, effort, and months as a whole. The reductions are in-line with the effort-only version of this dataset.

Finally, the big generality statement. To make this somewhat easier to read, I've left it in list format with the frequency counts of each attribute value trailing. I'm still working on a more concise way to express this data, one giant table doesn't seem to help.

key: DATASET.project:::attribute=freq count

NASA93.flight:::acap=25

NASA93.flight:::aexp=6

NASA93.flight:::cplx=13

NASA93.flight:::data=14

NASA93.flight:::lexp=8

NASA93.flight:::pcap=9

NASA93.flight:::rely=20

NASA93.flight:::stor=17

NASA93.flight:::turn=6

NASA93.ground:::acap=16

NASA93.ground:::aexp=9

NASA93.ground:::cplx=13

NASA93.ground:::data=20

NASA93.ground:::lexp=7

NASA93.ground:::pcap=7

NASA93.ground:::rely=18

NASA93.ground:::stor=13

NASA93.ground:::time=14

NASA93.osp2:::sced=87

NASA93.osp:::acap=29

NASA93.osp:::aexp=8

NASA93.osp:::cplx=10

NASA93.osp:::sced=12

NASA93.osp:::stor=24

NASA93.osp:::tool=20

BFC-NASA93.flight:::acap=28

BFC-NASA93.flight:::aexp=2

BFC-NASA93.flight:::cplx=7

BFC-NASA93.flight:::data=15

BFC-NASA93.flight:::lexp=2

BFC-NASA93.flight:::pcap=10

BFC-NASA93.flight:::rely=11

BFC-NASA93.flight:::stor=17

BFC-NASA93.flight:::turn=9

BFC-NASA93.ground:::acap=22

BFC-NASA93.ground:::apex=9

BFC-NASA93.ground:::cplx=4

BFC-NASA93.ground:::data=14

BFC-NASA93.ground:::ltex=10

BFC-NASA93.ground:::pcap=8

BFC-NASA93.ground:::rely=10

BFC-NASA93.ground:::stor=14

BFC-NASA93.ground:::time=11

BFC-NASA93.osp2:::sced=109

BFC-NASA93.osp:::acap=43

BFC-NASA93.osp:::aexp=2

BFC-NASA93.osp:::cplx=17

BFC-NASA93.osp:::sced=13

BFC-NASA93.osp:::stor=16

BFC-NASA93.osp:::team=4

BFC-NASA93.osp:::tool=10

ISBSG-clientserver-discretized.proj1:::Norm_PDF_UFP=9

ISBSG-clientserver-discretized.proj1:::Norm_PDR_AFP=11

ISBSG-clientserver-discretized.proj1:::Projected_PDR_UFP=3

ISBSG-clientserver-discretized.proj1:::Reported_PDR_AFP=4

ISBSG-clientserver-discretized.proj1:::Size_Added=5

ISBSG-clientserver-discretized.proj1:::Size_Enquiry=2

ISBSG-clientserver-discretized.proj1:::Size_File=19

ISBSG-clientserver-discretized.proj1:::Size_Input=16

ISBSG-clientserver-discretized.proj1:::Size_Interface=6

ISBSG-clientserver-discretized.proj1:::Size_Output=4

ISBSG-standalone-discretized.proj1:::Adj_FP=1

ISBSG-standalone-discretized.proj1:::Norm_PDF_UFP=14

ISBSG-standalone-discretized.proj1:::Norm_PDR_AFP=15

ISBSG-standalone-discretized.proj1:::Projected_PDR_UFP=4

ISBSG-standalone-discretized.proj1:::Reported_PDR_AFP=5

ISBSG-standalone-discretized.proj1:::Size_Added=5

ISBSG-standalone-discretized.proj1:::Size_Enquiry=1

ISBSG-standalone-discretized.proj1:::Size_File=13

ISBSG-standalone-discretized.proj1:::Size_Input=17

ISBSG-standalone-discretized.proj1:::Size_Interface=4

ISBSG-standalone-discretized.proj1:::Size_Output=4

ISBSG-standalone-discretized.proj2:::Adj_FP=1

ISBSG-standalone-discretized.proj2:::Norm_PDF_UFP=6

ISBSG-standalone-discretized.proj2:::Norm_PDR_AFP=3

ISBSG-standalone-discretized.proj2:::Projected_PDR_UFP=2

ISBSG-standalone-discretized.proj2:::Reported_PDR_AFP=1

ISBSG-standalone-discretized.proj2:::Size_Input=4

ISBSG-standalone-discretized.proj2:::VAF=2

ISBSG-standalone-discretized.proj2:::Work_Effort=33

MAXWELL.proj1:::T01=9

MAXWELL.proj1:::T02=11

MAXWELL.proj1:::T03=8

MAXWELL.proj1:::T04=4

MAXWELL.proj1:::T05=7

MAXWELL.proj1:::T06=3

MAXWELL.proj1:::T07=23

MAXWELL.proj1:::T08=5

MAXWELL.proj1:::T09=4

MAXWELL.proj1:::T10=7

MAXWELL.proj1:::T11=6

MAXWELL.proj1:::T12=5

MAXWELL.proj1:::T13=4

MAXWELL.proj1:::T14=5

MAXWELL.proj1:::T15=4

coc81.allLarge:::acap=4

coc81.allLarge:::aexp=6

coc81.allLarge:::cplx=3

coc81.allLarge:::data=14

coc81.allLarge:::lexp=5

coc81.allLarge:::modp=1

coc81.allLarge:::pcap=4

coc81.allLarge:::rely=22

coc81.allLarge:::sced=4

coc81.allLarge:::stor=14

coc81.allLarge:::time=5

coc81.allLarge:::tool=4

coc81.allLarge:::turn=6

coc81.allLarge:::vexp=13

coc81.allLarge:::virt=1

coc81.flight:::acap=6

coc81.flight:::aexp=21

coc81.flight:::cplx=7

coc81.flight:::data=18

coc81.flight:::lexp=4

coc81.flight:::pcap=1

coc81.flight:::rely=7

coc81.flight:::stor=27

coc81.flight:::turn=19

coc81.ground:::acap=8

coc81.ground:::cplx=7

coc81.ground:::data=30

coc81.ground:::ltex=2

coc81.ground:::pcap=3

coc81.ground:::pmat=2

coc81.ground:::rely=8

coc81.ground:::stor=30

coc81.ground:::time=9

coc81.osp2:::sced=58

coc81.osp:::acap=4

coc81.osp:::aexp=6

coc81.osp:::cplx=20

coc81.osp:::sced=7

coc81.osp:::stor=23

coc81.osp:::tool=9

desharnais-discretized.proj1:::Adjustment=4

desharnais-discretized.proj1:::Entities=2

desharnais-discretized.proj1:::Language=11

desharnais-discretized.proj1:::Length=3

desharnais-discretized.proj1:::ManagerExp=2

desharnais-discretized.proj1:::PointsAjust=6

desharnais-discretized.proj1:::PointsNonAdjust=3

desharnais-discretized.proj1:::TeamExp=1

desharnais-discretized.proj1:::Transactions=1

desharnais-discretized.proj1:::YearEnd=1

These numbers are out of what was returned after 100 runs, so the total counts within one project might not match another. The take-home here is we've able to maintain these large effort reductions even with widely different recommendations both within the same project and across different projects with similar metrics. Some projects such as OSP2 are every stable, only recommending SCED. Others show some distribution favoring a particular attribute, some seem to get by recommending a small subset of attributes.

Closer study is needed of the attributes returned. Right now I'm not looking at the recommended values, just what attribute was chosen.

Monday, February 8, 2010

Initial Kernel Estimation Plots

Kernel-based methods are most popular for non-parametric estimators
They can discover structural features in data which parametric approaches may not reveal
Kernel performance is measured by AMISE (asymptotic mean integrated squared error)
Below graphs were produced via Epanechnikov kernel, since Epanechnikov kernel minimizes AMISE
For below graphs we compute a probability density estimate of a sample vector and then plot the evaluation of this probability density estimate
Plots (lines) below correspond to probability density function for k=16 from Cocomo81 and Cocomo81 itself respectively. The stars are the sorted effort values of related datasets.

Postcards from CHINA

Greg reports he can halve the median MRE as follows:

take data
cluster using k=10 means
assign a weight alpha to each cluster
learn alpha by random sampling (assessing each sample with LSR), selecting best alphas via bore

Results:

nasa93: median(MRE) now 0.4, was 0.61.
cocomonasa: median(MRE) now 0.39, was 0.64

Nasa93:
        #key, ties, win, loss, win-loss
 oversampled,    0,   1,    0,        1
    original,    0,   0,    1,       -1

cocomonasa:
        #key, ties, win, loss, win-loss
 oversampled,    0,   1,    0,        1
    original,    0,   0,    1,       -1

Combined:
                  #key, ties, win, loss, win-loss
oversampled|nasa93    ,    1,   2,    0,        2
oversampled|cocomonasa,    1,   2,    0,        2
original|nasa93       ,    1,   0,    2,       -2
original|cocomonasa   ,    1,   0,    2,       -2

Learning Treatments for Student Retention using TAR3

Using the treatment learner TAR3 (here, Menzies 2001), treatments were learned from features previously selected from enrollment data experiments. Below are the first obtained treatments:


Class = YES
bestClass = 20.00%

 1 worth=1.112139 [FinAidSTUDENT_AG=[5123.000000..724724.000000]] 
 2 worth=1.105999 [FinAidPARENT_MAR=M] [FinAidSTUDENT_WA=[3878.000000..561500.000000]]   
 3 worth=1.102110 [FinAidFATHER_WAG=[43669.000000..999999.000000]] [FinAidFATHER_ED=3.0]   
4 worth=1.097397 [FinAidPARENT_AGI=[67908.000000..999999.000000]]   5 worth=1.092824 [FinAidSTUDENT_WA=[3878.000000..561500.000000]]   
6 worth=1.092252 [FinAidMOTHER_ED=3.0] [FinAidFATHER_ED=3.0]   
7 worth=1.090565 [FinAidFATHER_WAG=[43669.000000..999999.000000]]   8 worth=1.082257 [FinAidFATHER_ED=3.0]   
9 worth=1.080906 [FinAidMOTHER_WAG=[23139.000000..533395.000000]] [FinAidSTUDENT_AG=[5123.000000..724724.000000]]  
10 worth=1.079164 [TotalClasses=[6.000000..10.000000]] [FinAidPARENT_MAR=M]

Class = YES
bestClass = 10.00%

 1 worth=1.159171    [FinAidPARENT_MAR=M] [FinAidPARENT_AGI=[67908.000000..999999.000000]] [CUR_ERLHRS=[15.000000..22.000000]] [FinAidSTUDENT_AG=[5123.000000..724724.000000]] 
 2 worth=1.156786    [FinAidFATHER_WAG=[43669.000000..999999.000000]] [FinAidSTUDENT_AG=[5123.000000..724724.000000]]   
3 worth=1.132347    [FinAidPARENT_HOU=4.0] [FinAidSTUDENT_AG=[5123.000000..724724.000000]]   
4 worth=1.128918    [HS_PERCENT=[60.000000..100.000000]] [FinAidMOTHER_WAG=[23139.000000..533395.000000]] [PercentileRankHSGPA=[49.500000..98.400002]]   
5 worth=1.113224    [FinAidSTUDENT_TA=2.0] [HS_PERCENT=[60.000000..100.000000]]   
6 worth=1.112139    [FinAidSTUDENT_AG=[5123.000000..724724.000000]]   7 worth=1.097397    [FinAidPARENT_AGI=[67908.000000..999999.000000]]   8 worth=1.092824    [FinAidSTUDENT_WA=[3878.000000..561500.000000]]   9 worth=1.090565    [FinAidFATHER_WAG=[43669.000000..999999.000000]]  10 worth=1.089154    [ACT1_MATH=[19.000000..510.000000]] [FinAidSTUDENT_MA=U]  

Class = NO
bestClass = 20.00%

 1 worth=1.073376 [PercentileRankHSGPA=[0.000000..49.500000)]  

Class = YES
bestClass = 10.00%

1 worth=1.090117 [ENG10=Y]

Tuesday, February 2, 2010

Whole vs. Partial Learning - NASA MDP Components - Part II - Undersampling Vs. Subsampling Vs. No sampling

Experiment stats:

discretize
nb, j48
10*10-way
supersampling/undersampling/no sampling vs component vs all

mwu and quarts here

On Profiling RDR (Clump)...

Following are runtime statistics for Bryan's RDR implementation using the JM1 data set (~11000 lines).

As you can easily see NBAYES is the outlier weighing in at a whopping 18.3 minutes. What this tells us is that we're spending way too much time classifying as we go down the tree. What's more is that the data doesn't necessarily change from node to node but what we use to classify (based on our rules) is the variable factor.

Now it's time for the proposal. Since we're using the same information for each classification, why not cache the frequency counts early on and just merge counts when RDR generates a new rule? Once a system like the aforementioned has been implemented, the runtime for RDR would be minimal.

Discretizing W

In my last post, I discovered that the current problem datasets all suffered from an abundance of attribute values. Given that W acts on exact matching for it's query constraints, this causes a problem where nothing is left in the testing set after applying the first constraint. At least in the case of the ISBSG dataset, discretization makes a big difference:

Median Effort Reductions

Median Spread Reductions

For the Desharnais dataset, problems still remain. Even with discretization we can only generate a constrained query 25% of the time. The reported results use 9 discrete bins, but tinkering with this number has yet to see any gains in effort reduction. This is concerning, but currently all Desharnais projects are highly synthetic. A better look at the data is in order.

Awktimization

If you're doing a lot of heavy text crunching with Awk, you've probably found that it can get quite slow very quickly. It always seems to happen whether you program well or not. Unfortunately for us, Awk isn't a very high efficiency tool. But that's not going to stop us. Not today.

You see, the GNU implementation of Awk, gawk, has a built-in profiler that you can use in order to analyze and optimize your Awk scripts. Its beautiful simplicity will cast light on those sinister bottlenecks destroying your performance. To use it, simply call pgawk from the command line on any machine with the GNU Awk installed. (Alternatively, you can pass the --profile option to gawk.) The runtime will execute your script and write the profiler results to a file by default named awkprof.out. This file contains the execution counts for each statement executed in the left margin.

Optimizing the now elucidated bottlenecks in your program should give you a considerable performance boost. However, if after improving on your runtimes, you still find that your program isn't zippy enough, you can put a little bit more spring in its step by compiling it to ANSI C.

That's right, the open-source Awka project takes plain old Awk code and transforms it into a nitroglycerin-breathing, high-performance rocket car written in C. Nested loops, arrays, and user functions benefit the most from the conversion. The final product will typically see performance gains of around 50%, but depending on the code, you could see 300-400% speed boosts.

So if you find that your Awk text crunching just isn't quite up to par, maybe it's time to consider grinding that puppy through a profiler. And if that still isn't enough, give it the Total Language Makeover with a conversion to C. And if you have a problem that's still taking millennia to compute, well, maybe you should contemplate asking a simpler question. :)