## Saturday, October 27, 2012

### Looking at the data

These figures are  hypercube projections of effort data sets.

The first side was picked at random and runs between any two corners 0,1.

After than, side N=2,3,4... was built by picking a point whose sum of the distances to  corners 0,1,...N-1 is maximal.

## Tuesday, October 23, 2012

### Effect size: the new religon

The paper about the effect size metrics in software engineering is here: http://www.sciencedirect.com/science/article/pii/S0950584907000195

In case you need the bibtex:
@article{DBLP:journals/infsof/KampenesDHS07,
author    = {Vigdis By Kampenes and
Tore Dyb{\aa} and
Jo Erskine Hannay and
Dag I. K. Sj{\o}berg},
title     = {A systematic review of effect size in software engineering
experiments},
journal   = {Information {\&} Software Technology},
volume    = {49},
number    = {11-12},
year      = {2007},
pages     = {1073-1086},
ee        = {http://dx.doi.org/10.1016/j.infsof.2007.02.015},
bibsource = {DBLP, http://dblp.uni-trier.de}
}

For code that implements the hedges g measure recommended by this paper, see below (the second last function: ghedge). Note that its kinda simple. For this to work, you need a value of "small". As per the recommendations of the above paper, I use 0.17.

function scottknott(datas,small,  data) {
for(data in datas)
sk1(datas[data],small,1,1,length(datas[data]),0)
}
function sk1(rxs,small,rank,lo,hi,  cut,i) {
cut = sk0(rxs,small,lo,hi)
if (cut) {
rank = sk1(rxs,small,rank,lo,cut-1) + 1
rank = sk1(rxs,small,rank,cut,hi)
} else {
for(i=lo;i<=hi;i++)
rxs[i]["rank"] = rank
}
return rank
}
function sk0(a,small,lo,hi,\
s,  s2, n, mu,                     \
ls,ls2,ln,lmu,                     \
rs,rs2,rn,rmu,                     \
i,best,tmp,cut) {
for(i=lo;i<=hi;i++) {
s += a[i]["s"]; s2+= a[i]["s2"]; n += a[i]["n"]
}
mu= s/n
best = -1;
for(i=lo+1;i<=hi;i++) {
ls  += a[i-1]["s"];
ls2 += a[i-1]["s2"];
ln  += a[i-1]["n"]
rs   = s - ls;
rs2  = s2- ls2;
rn   = n - ln;
rmu  = rs/rn
lmu  = ls/ln
tmp = ln/n * (mu - lmu)^2 + rn/n * (mu - rmu)^2
if (tmp > best)
if  (ghedge(rmu,lmu,rn,ln,rs,ls,rs2,ls2) > small)
if (significantlyDifferent(a,lo,i,hi)) {
best = tmp; cut = i }}
return cut
}
function ghedge(rmu,lmu,rn,ln,rs,ls,rs2,ls2,\
n, ln1,rn1,rsd, lsd,sp,correct,g) {
n       = rn + ln
ln1     = ln - 1
rn1     = rn - 1
rsd     = sd(rn,rs,rs2)
lsd     = sd(ln,ls,ls2)
sp      = sqrt((ln1*lsd^2 + rn1*rsd^2)/(ln1 + rn1))
correct = 1 - 3/(4*(n - 2) - 1)
return abs(lmu - rmu)/sp * correct
}
function significantlyDifferent(a,lo,cut,hi,\
i,j,x,r,l,rhs,lhs) {
for(i=lo;i<=hi;i++)
for(j in  a[i]["="]) {
x = a[i]["="][j]
i < cut ?  lhs[++l] = x :  rhs[++r] = x
}
print ":lo",lo,":cut",cut,":hi",hi,":left",l,":right",r > "/dev/stderr"
return bootstrap(lhs,rhs,500,0.05)
}

## Monday, October 22, 2012

### the opposite of "parameter tuning is bad"

"With more parameters data sparsity becomes an issue again, but with proper smoothing the models are usually more accurate than the original models. Thus, no matter how much data one has, smoothing can almost always help performace, and for a relatively small eď¬€ort."

"Adding free parameters to an algorithm and optimizing these parameters on held-out data can improve performance."

http://nlp.stanford.edu/~wcmac/papers/20050421-smoothing-tutorial.pdf

### WHERE = spill tree \tau = 0

fig4 reports better precision with higher \tau

"As should be expected, for all splitting rules and spill thresholds, recall performance degrades as the
maximum depth of the tree increases." ????

http://unbox.org/stuff/doc/11rptrees.pdf

### Random projections rule

Note the computations shown in Fig2. So fast!

http://unbox.org/stuff/doc/01randomProjections.pdf

### Graphics beats stats!

behold the biggest big-ass effect i've ever seen

lowest errors
long training times
ok test times

http://unbox.org/stuff/doc/11clusterHyperCube.pdf

## Friday, October 19, 2012

### a few random projections is enough

In the following, e=0 means use all attributes and e>0 means stop after finding e better projections

The way this works is that in builds "e" projection as follows:
• Draw a line between two randomly selected points.
• All the other points are then mapped to this point using cosine rule.
• The error of this line is the mean distance to all the points around it (found using Pythagoras)
• Keep projection if its error is less than best projection found so far
In the e>0 experiments, the weight given to each projection is biased by the error of each projection:
• Weight[X] = error of [X] / error[W]
where "W" is the worst error seen in all projections.

After all that, its Euclidean distance times weight over the projections.

In the following, note how e=0 is not as critical as other factors (except in big datasets like China). The LHS number comes from ScottKnott which ignores small effect size (Hedges' corrected g = 0.17) and which uses bootstrapping (500 samples, 95% confidence) to test for differences.

----| 0 |----------------------
1,    k=4,n=1,e=0,   18,   36,   66,|,  36,  48, 4029,>,-*-      |          ,<,
1,    k=2,n=1,e=0,   17,   37,   63,|,  37,  46, 3800,>,-*-      |          ,<,
1,    k=1,n=1,e=0,   21,   40,   68,|,  40,  47, 1943,>,-*-      |          ,<,
2,    k=4,n=1,e=1,   33,   61,  145,|,  61, 112, 4551,>, --*---- |          ,<,
2,    k=4,n=1,e=2,   33,   61,  145,|,  61, 112, 4551,>, --*---- |          ,<,
2,    k=4,n=1,e=4,   33,   62,  144,|,  62, 111, 4551,>, --*---- |          ,<,
2,    k=2,n=1,e=1,   30,   63,  132,|,  63, 102, 7761,>, --*---  |          ,<,
2,    k=2,n=1,e=4,   32,   63,  132,|,  63, 100, 7761,>, --*---  |          ,<,
2,    k=2,n=1,e=2,   32,   64,  139,|,  64, 107, 7761,>, --*---- |          ,<,
2,    k=1,n=1,e=1,   37,   66,  108,|,  66,  71, 8926,>, --*--   |          ,<,
2,    k=1,n=1,e=2,   36,   66,  108,|,  66,  72, 8926,>, --*--   |          ,<,
2,    k=1,n=1,e=4,   36,   66,  107,|,  66,  71, 8926,>, --*--   |          ,<,

----| /home/timm/tmp/desharnais.dash |----------------------
1,    k=4,n=1,e=0,   11,   29,   55,|,  29,  44, 378,>,*-       |          ,<,
2,    k=4,n=1,e=2,   20,   33,   68,|,  33,  48, 883,>,-*-      |          ,<,
2,    k=4,n=1,e=4,   18,   33,   68,|,  33,  50, 883,>,-*-      |          ,<,
2,    k=2,n=1,e=0,   16,   36,   59,|,  36,  43, 320,>,-*       |          ,<,
2,    k=4,n=1,e=1,   16,   37,   69,|,  37,  53, 854,>,-*-      |          ,<,
2,    k=1,n=1,e=0,   19,   41,   66,|,  41,  47, 387,>,-*-      |          ,<,
3,    k=2,n=1,e=4,   15,   45,   77,|,  45,  62, 764,>,--*-     |          ,<,
3,    k=2,n=1,e=2,   19,   47,   83,|,  47,  64, 764,>,--*-     |          ,<,   3,    k=2,n=1,e=1,   19,   48,   79,|,  48,  60, 844,>,--*-     |          ,<,
3,    k=1,n=1,e=2,   24,   48,   76,|,  48,  52, 655,>,--*-     |          ,<,
3,    k=1,n=1,e=4,   25,   48,   78,|,  48,  53, 1220,>,--*-     |          ,<,
3,    k=1,n=1,e=1,   24,   52,   85,|,  52,  61, 655,>,--*-     |          ,<,

----| /home/timm/tmp/miyazaki94.dash |----------------------
1,    k=1,n=1,e=0,   29,   42,   75,|,  42,  46, 655,>,-*--     |          ,<,
1,    k=2,n=1,e=1,   17,   43,   88,|,  43,  71, 490,>,-*--     |          ,<,
1,    k=4,n=1,e=1,   25,   45,   95,|,  45,  70, 340,>,--*--    |          ,<,
1,    k=2,n=1,e=0,   27,   45,   88,|,  45,  61, 472,>,--*-     |          ,<,
1,    k=1,n=1,e=2,   16,   45,   83,|,  45,  67, 648,>,--*-     |          ,<,
1,    k=1,n=1,e=4,   16,   45,   83,|,  45,  67, 648,>,--*-     |          ,<,
1,    k=4,n=1,e=0,   18,   45,   73,|,  45,  55, 370,>,--*      |          ,<,
1,    k=4,n=1,e=2,   30,   51,   81,|,  51,  51, 602,>, -*-     |          ,<,
1,    k=4,n=1,e=4,   30,   51,   81,|,  51,  51, 602,>, -*-     |          ,<,
1,    k=2,n=1,e=2,   28,   57,   92,|,  57,  64, 472,>,--*--    |          ,<,
1,    k=2,n=1,e=4,   28,   57,   92,|,  57,  64, 472,>,--*--    |          ,<,
1,    k=1,n=1,e=1,   35,   60,  100,|,  60,  65, 648,>, --*-    |          ,<,

----| /home/timm/tmp/nasa93.dash |----------------------
1,    k=2,n=1,e=0,   22,   52,   86,|,  52,  64, 2058,>,--*-     |          ,<,
1,    k=1,n=1,e=0,   15,   50,   93,|,  50,  78, 900,>,--*--    |          ,<,
1,    k=1,n=1,e=1,   33,   54,   95,|,  54,  62, 4207,>, -*--    |          ,<,
1,    k=4,n=1,e=0,   27,   58,  104,|,  58,  77, 4029,>,--*--    |          ,<,   1,    k=1,n=1,e=2,   33,   62,  104,|,  62,  71, 4207,>, --*-    |          ,<,
1,    k=1,n=1,e=4,   33,   63,  104,|,  63,  71, 4207,>, --*-    |          ,<,
1,    k=4,n=1,e=1,   40,   73,  123,|,  73,  83, 1719,>, --*---  |          ,<,
1,    k=2,n=1,e=1,   29,   73,  123,|,  73,  94, 2443,>,---*---  |          ,<,
1,    k=2,n=1,e=4,   29,   74,  132,|,  74, 103, 2443,>,---*---  |          ,<,
1,    k=4,n=1,e=2,   40,   74,  129,|,  74,  89, 1719,>, --*---  |          ,<,
1,    k=2,n=1,e=2,   29,   75,  146,|,  75, 117, 2443,>,----*--- |          ,<,
1,    k=4,n=1,e=4,   40,   74,  126,|,  74,  86, 1719,>, --*---  |          ,<,

----| /home/timm/tmp/china.dash |----------------------
1,    k=4,n=1,e=0,   16,   30,   53,|,  30,  37, 1139,>,-*       |          ,<,
1,    k=2,n=1,e=0,   14,   31,   49,|,  31,  35, 1609,>,-*       |          ,<,
1,    k=1,n=1,e=0,   18,   36,   56,|,  36,  38, 1943,>,-*       |          ,<,
2,    k=4,n=1,e=1,   35,   62,  159,|,  62, 124, 4551,>, --*-----|          ,<,
2,    k=4,n=1,e=2,   35,   62,  159,|,  62, 124, 4551,>, --*-----|          ,<,
2,    k=4,n=1,e=4,   35,   62,  159,|,  62, 124, 4551,>, --*-----|          ,<,
2,    k=2,n=1,e=1,   34,   65,  151,|,  65, 117, 7761,>, --*-----|          ,<,
2,    k=2,n=1,e=2,   34,   65,  151,|,  65, 117, 7761,>, --*-----|          ,<,
2,    k=2,n=1,e=4,   34,   65,  151,|,  65, 117, 7761,>, --*-----|          ,<,
2,    k=1,n=1,e=1,   37,   69,  115,|,  69,  78, 8926,>, --*--   |          ,<,
2,    k=1,n=1,e=2,   37,   69,  115,|,  69,  78, 8926,>, --*--   |          ,<,
2,    k=1,n=1,e=4,   37,   69,  115,|,  69,  78, 8926,>, --*--   |          ,<,

----| /home/timm/tmp/finnish.dash |----------------------
1,    k=1,n=1,e=0,   23,   36,   79,|,  36,  56, 950,>,-*--     |          ,<,
1,    k=1,n=1,e=1,   31,   50,   85,|,  50,  54, 1761,>, -*-     |          ,<,
1,    k=1,n=1,e=2,   31,   50,   85,|,  50,  54, 1761,>, -*-     |          ,<,
1,    k=1,n=1,e=4,   31,   50,   85,|,  50,  54, 1761,>, -*-     |          ,<,
1,    k=4,n=1,e=0,   24,   51,  120,|,  51,  96, 1120,>,--*----  |          ,<,
1,    k=4,n=1,e=1,   28,   52,  121,|,  52,  93, 809,>,--*----  |          ,<,
1,    k=4,n=1,e=2,   28,   52,  121,|,  52,  93, 809,>,--*----  |          ,<,
1,    k=4,n=1,e=4,   28,   52,  121,|,  52,  93, 809,>,--*----  |          ,<,
1,    k=2,n=1,e=0,   23,   56,   81,|,  56,  58, 489,>,--*-     |          ,<,
1,    k=2,n=1,e=1,   23,   58,   88,|,  58,  65, 1467,>,--*-     |          ,<,
1,    k=2,n=1,e=2,   23,   58,   88,|,  58,  65, 1467,>,--*-     |          ,<,
1,    k=2,n=1,e=4,   23,   58,   88,|,  58,  65, 1467,>,--*-     |          ,<,

----| /home/timm/tmp/coc81.dash |----------------------
1,    k=2,n=1,e=0,   40,   76,  222,|,  76, 182, 3800,>, ---*----|---       ,<,
1,    k=2,n=1,e=1,   46,   76,  262,|,  76, 216, 1286,>,  --*----|------    ,<,
1,    k=2,n=1,e=2,   46,   76,  262,|,  76, 216, 1286,>,  --*----|------    ,<,
1,    k=2,n=1,e=4,   46,   76,  262,|,  76, 216, 1286,>,  --*----|------    ,<,
1,    k=1,n=1,e=0,   39,   76,  163,|,  76, 124, 1240,>, ---*----|          ,<,
1,    k=4,n=1,e=1,   45,   80,  284,|,  80, 239, 977,>,  --*----|-------   ,<,
1,    k=4,n=1,e=2,   45,   80,  284,|,  80, 239, 977,>,  --*----|-------   ,<,
1,    k=4,n=1,e=4,   45,   80,  284,|,  80, 239, 977,>,  --*----|-------   ,<,
1,    k=1,n=1,e=1,   56,   80,  160,|,  80, 104, 1523,>,  --*----|          ,<,
1,    k=1,n=1,e=2,   56,   80,  160,|,  80, 104, 1523,>,  --*----|          ,<,
1,    k=1,n=1,e=4,   56,   80,  160,|,  80, 104, 1523,>,  --*----|          ,<,
1,    k=4,n=1,e=0,   36,   81,  224,|,  81, 188, 3188,>, ---*----|---       ,<,

## Sunday, October 14, 2012

### Erin's bio

Erin Moore is pursuing a doctorate in Computer Engineering at WVU where she has been awarded a WV EPSCoR Cancer, Energy and Security Nanotechnology Fellowship for the year of 2012.  Her research uses Data Mining to find patterns in biological data.  Specifically, she is predicting a special type of DNA sequences called Aptamers.  These molecules can be used for targeted pharmaceutical therapies and in sensors to detect the presence of chemical substances at a very low concentration. She would like to continue developing Artificial Intelligence algorithms to further medical diagnostics and pharmaceutical development technologies.

Erin's academic and professional experience is detailed at

### Vasil's Info

Vasil is currently pursuing his MSc on Computer Science in West Virginia University. He has previously received a BSc on Computer Science from University of Tirana. His main interest is using data mining techniques in setting up public health policies. He is currently collaborating with the WVU’s Regional Research Institute on a project that investigates the influence of community nutrition environment on child obesity.

### Ekrem Kocaguneli

Ekrem started his CS career at Bogazici University, where he received his BSc and MS degrees. Currently he is a PhD candidate in LCSEE department at West Virginia University. His main research interests are using machine learning methods to find solutions to difficult software estimation problems, analysis of various types of software data and investigating for stable conclusions in empirical software engineering. He currently maintains a personal web site (www.kocaguneli.com), where he posts his ideas about software engineering problems, announces new papers accepted and so on.
A list of his publications are:
• Journal
• J. Keung, E. Kocaguneli, T. Menzies, “Finding Conclusion Stability for Selecting the Best Effort Predictor in Software Effort Estimation”, Automated Software Engineering, to appear in 2012.
• E. Kocaguneli, T. Menzies, A. Bener, J. Keung, “Exploiting the Essential Assumptions of Analogy-based Effort Estimation”, IEEE Transactions on Software Engineering, vol. 38, no.2, pp. 425-438, March-April 2012.
• E. Kocaguneli, T. Menzies, J. Keung, “On the Value of Ensemble Effort Estimation”, IEEE Transactions on Software Engineering, pre-prints, 2011.
• E. Kocaguneli, T. Menzies, J. Keung, “Kernel Methods for Software Effort Estimation”, Empirical Software Engineering Journal, pre-prints, 2011.
• A. Brady, T. Menzies, E. Kocaguneli, “What Is Enough Quality for Data Repositories?”, Software Quality Professional, 2011.
• Conference
• E. Kocaguneli, T. Menzies, “How to Find Relevant Data for Effort Estimation”, International Symposium on Empirical Software Engineering and Measurement (ESEM) 2011
• E. Kocaguneli, B.Caglayan, A. Tosun, A. Bener, “Experiences on Developer Participation and Effort Estimation”, EUROMICRO Conference on Software Engineering and Advanced Applications (SEAA) 2011
• E. Kocaguneli, G. Gay, Y. Yang, T. Menzies, “When to Use Data from Other Projects for Effort Estimation”, International Conference on Automated Software Engineering (ASE) 2010
• E. Kocaguneli, A. Tosun, A. Bener, “AI-Based Models for Software Effort Estimation”, 36th EUROMICRO Conference on Software Engineering and Advanced Applications (SEAA) 2010
• E. Kocaguneli, Y. Kultur, A. Bener, “Combining Multiple Learners Induced on Multiple Datasets for Software Effort Prediction”, Proceedings of International Symposium on Software Reliability Engineering (ISSRE) 2009.
• E. Kocaguneli, A. Tosun, A. Bener, B. Turhan, B. Caglayan, “Prest: An Intelligent Software Metrics Extraction, Analysis and Defect Prediction Tool”, Proceedings of International Conference on Software Engineering and Knowledge Engineering (SEKE) 2009.
• Ayse Tosun, Ayse Bener, Ekrem Kocaguneli, “BITS: Issue Tracking and Project Management Tool in Healthcare Software Development”, Proceedings of International Conference on Software Engineering and Knowledge Engineering (SEKE) 2009.
• A. Bakir, E. Kocaguneli, A. Tosun, A.Bener and B. Turhan “Xiruxe: An Intelligent Fault Tracking Tool”, Proceedings of International Conference on Artificial Intelligence and Pattern Recognition (AIPR) 2009.
• Y. Kultur, E. Kocaguneli, A. Bener, “Discovering Cost Related Features: Focus on Classification Models”, Poster Paper in Empirical Software Engineering and Measurement (ESEM) 2009.
• E. Kocaguneli, A. Tosun, B. Caglayan, A. Bener, T. Aytac, B. Turhan, “Bulutlarda Akilli Bir Yazilim Olcumleme, Hata Analiz ve Tahmin Araci: Prest”, (in Turkish), 2nd National Symposium on Software Engineering (UYMS) 2009
• Y. Kultur, E. Kocaguneli, A. Bener, “Domain Specific Phase by Phase Effort Estimation in Software Projects”, Proceedings of International Symposium on Computer and Information Sciences (ISCIS) 2009.

## Joe Krall

I'm Joe, Ph.D. student at WVU, studying games with computer science, aiming for a high level job in the gaming industry; an indoor outdoors-man, writer, artist, and gamer at heart.

#### Education

- B.S. in Computer Science at University of Pittsburgh at Johnstown (2008)
- B.S. in Mathematics at University of Pittsburgh at Johnstown (2008)
- M.S. in Computer Science at West Virginia University (2010)
- Ph.D. pending, in Computer Science at West Virginia University (2013)

#### Research Interests

- Studying Games
- Cognitive Psychology & Believability
- Data Mining
- Multiple Objective Optimization

#### Publications

- JSEA'12 "Aspects of Replayability and Software Engineering: Towards a Methodology of Developing Games" (The most downloaded paper in Vol.5 No. 7) [Link]

#### Abilities

- Excellent Writer
- Good with Graphics & Presentation
- Intensive Programmer

- Trap Shooting
- Archery
- Gaming

## Monday, October 8, 2012

### how many random projections, how many neighbors

same rig as before http://unbox.org/stuff/var/timm/12/dash/var/egdash.out

### how many neighbors? normalize?

• N=1 : normalize nums min..max 0..1
• k=K : use mean of kth nearest neighbors
• data sets : desharnais, autompg, nasa93, finnish, miyazaki94, china, coc81
• results : mre
• small effect size : median of numbers bottom third (max(mre) - min(mre))/3

results: http://unbox.org/stuff/var/timm/12/dash/var/egknn.out

note the 010101 patterns on everything except china which is 1111110000000

• normally normalization does not matter
• but when it matters, it really matters

in the n=1 results

• k=1 has less error in 6/7 data sets (exception: desharnais)

conclusion: k=1 a good default

## Tuesday, October 2, 2012

### Why is HV in IBEA?

Since the Indicator used in IBEA must be dominance preserving. IHD below is dominance preserving.

### Defining orderly mutation

How it is done now (or so we thought!)

not_mutated = true
while not_mutated:
for bit in variable:
if rand(0,1)<probability:
if deselecting root feature:
continue
else if selecting feature whose parent is not selected:
continue
else if deselecting feature that another selected feature requires:
continue
else if cardinality violation:
continue
else:
flip(bit)
not_mutated = false
if isSelected(bit):
reselect(bit.children)
else:
deselect(bit.children)

How it might be done otherwise:

not_mutated = true

while not_mutated:
traverseTree(root)

function traverseTree(node):
if rand(0,1)<probability:
if deselecting root feature:
continue
else if selecting feature whose parent is not selected:
continue
else if deselecting feature that another selected feature requires:
continue
else if cardinality violation:
continue
else:
flip(node.bit)
not_mutated = false
if isSelected(node.bit):
reselect(node.children)
else:
deselect(node.children)
if isSelected(node):
for child in node:
traverseTree(child)

## Monday, October 1, 2012

### signal and the noise

The Signal and the Noise:
Why So Many Predictions Fail-but Some Don't

• Hedgehogs dig in- say one thing, only, all the time. Do best on the talk show circuit.
• Foxes, sniff around. Craftier.  Do best at prediction.

### paper accepted with minor revisions to tse

Local vs. Global Lessons for Defect Prediction and Effort Estimation

Tim Menzies, Member, IEEE,
Andrew Butcher,
David Cok, Member, IEEE,
Andrian Marcus, Member, IEEE,
Lucas Layman,
Forrest Shull, Member, IEEE,
Burak Turhan, Member, IEEE,
Thomas Zimmermann

Abstract—Existing research is unclear on how to generate lessons learned for defect prediction and effort estimation. Should we seek lessons that are global to multiple projects, or just local to particular projects? This paper aims to comparatively evaluate local vs. global lessons learned for effort estimation and defect prediction. We applied automated clustering tools to effort and defect data sets from the PROMISE repository. Rule learners generated lessons learned from all the data, from local projects, or just from each cluster. The results indicate that the lessons learned after combining small parts of different data sources (i.e., the clusters) were superior to either generalizations formed over all the data or local lessons formed from particular projects. We conclude that when researchers attempt to draw lessons from some historical data source, they should (a) ignore any existing local divisions into multiple sources; (b) cluster across all available data; then (c) restrict the learning of lessons to the clusters from other sources that are nearest to the test data.