Tuesday, November 13, 2012

Make cross-project defect prediction a successful story: a research roadmap

#Research questions:

1. For the first release of a new project, how can we learn quality prediction models on cross-project data?

Burak proposed to use NN-filter to help cross-project defect prediction and made some promising results (JASE 2010). Now Fayola is also working on this and do it much better (mostly on small size test sets). There are so much worth being further investigated.

1.1 can we  do better on cross-learning performance ? 
      -- some more effective ways to understand the training and test data, like TEAK?
1.2 can we generalize cross-learning strategies to other data?
      -- not only small test sets and static code measures.
1.3 can we make cross-learning algorithms more scalable?
      -- maybe low dimensionality can help?

2. For the following releases, we have to:

2.1 handle the data shift or concept drift problem caused by development environment change to produce better prediction results.
      -- now I'm going on this direction, see my post for last week's group meeting.
2.2 know whether cross-project data can further improve the prediction performance   even though we have local data. If cross-project data works, how should we make use of it?


#Potential solutions:

transductive transfer learning for research question 1, and inductive transfer learning for research question 2.
      -- transductive transfer learning: plenty of  labeled data in the source domain, no labeled data in target domain is available.
      -- inductive transfer learning:  plenty of labeled data in the source domain and a small number of labeled data in the target domain are available.


#My expectation:

1-2 publication on this topic.
     -- FSE conference, JESE, or somewhere better.

All comments are welcome.


Zhimin

Friday, November 9, 2012

Interesting tidbits from Discover Magazine

Nate Silver will tax your crap

Tarnished Silver

Effort Estimation Intrinsic Dimensions


Code to compute the correlation dimension as a function of "r". Note that the intrinsic dimensionality of a data set is, at most, the steepest slopes in a plot of log(C(r)) vs log(r).

@include "lib.awk"
@include "dist.awk"
@include "readcsv.awk"

function _cr(   o) {
  if(args("-d,data/nasa93.csv,-steps,20,-jumps,4",o))
    cr(o["-d"],o["-steps"],o["-jumps"])
}
function cr(f,steps,jumps,
   _Rows,r,k,c,x,logc,logr) {
  readcsv("cat "f,_Rows)
  distances(_Rows,x)
  for(r=1/steps; r<=1 ; r+= 1/steps) {
    c = correlationDimension(r,x,length(d))
    if (c==0) continue
    if (c==1) break
    k++
    print (logr[k] = log(r)) "\t" (logc[k] = log(c)) 
  }
  say("# " steepest(k,jumps,logr,logc)  " " f "\n")
}
function distances(_Rows,x,       i,j) {
  for(i in all)
    for(j in all)
      if (j > i) 
x[i][j] = dist(all[i],all[j],_Rows,1)
}
function correlationDimension(r,x,n,   i,j,c) {
  for(i in x)
    for(j in x[i])
      c += x[i][j] <= r
  return 2/(n*(n-1)) * c
}
function steepest(max,jumps,logr,logc, 
 i,rise,run,m,most) {
  for(i=1; i <= max-jumps; i += jumps) {

    rise = logc[i + jumps] - logc[i]
    run  = logr[i + jumps] - logr[i]
    m    = rise / run
    if (m > most) 
      most = m
  }
  return most
}

And the results are, sorted lowest to highest...


  • low
    •  0.91  data/china.csv
    • 1.97 data/kemerer.csv
    • 2.77 data/finnish.csv
    • 2.92  data/miyazaki94.csv
    • 3.00  data/albrecht.csv
    • 3.35 data/nasa93c1.csv







  • medium
    • 3.70 data/coc81o.csv
    • 3.96  data/telecom.csv
    • 4.00 data/coc81sd.csv
    • 4.07  data/coc81.csv
    • 4.10 data/desharnais.csv















  • high
    • 4.51 data/nasa93c2.csv
    • 4.54 data/nasa93c5.csv
    • 4.78  data/coc81e.csv
    • 5.74  data/nasa93.csv
    • 8.19  data/sdr.csv




Thursday, November 8, 2012

The Peters Filter


The cross-company problem: how to find what train is relevant to you:


Why do cross-company learning?
  • Cause when you don't have enough local data, you do very badly
  • In the following, we are training and test on very small data sets (lo, median, hi) = 6, 20, 65 instances


So, lets reach out across data sets and compare.
Two cross-company selection filters
  • Burak: N things nearest the test data (shown in gray)
  • Peters: Cluster the train data, find the clusters with the test data




Note that the Peters  filter uses the structure of the train data to guide the initial selection of the data.
Why?
  • Intuition behind Peters' filter: 
  • there is more experience in the repo than with you. So use it to guide you


In the following

  • Train on selected members of the 46 data sets in the repo (lo, med, hi) = (109, 293,885) instances 
  • g = 2*(1 - pf)*pd / (1 - pf + pd) 
  • The last column is the delta between peters and burak Filter
  • Delta is usually positive and large









The

Does transfer learning work for cross project defect prediction?

The data shift issue  troubles software defect prediction mainly at two aspects:
1) at the time dimensionality: for multi-version projects, historical data of early releases may not applicable for new release.
2) at the space dimensionality: because of the data lack problem, we need to do cross-project defect prediction in some cases.

We see transfer learning as a potential solution to data shift problem of defect prediction for its capability to learn and predict under different training and test distribution.  Our on going experiments support this argument.

First, we observe serious performance reduction on cross-release and cross-project defect prediction (i.e., at the time dimensionality and the space dimensionality ).


Second, after employ a transfer learning framework BDSR (Bias Reduction via Structure Discovery), we do better on cross defect prediction.




--Zhimin 

Wednesday, November 7, 2012

Novel Subspace Clustering


Novel Subspace Clustering algorithm that uses a top-down FP Growth approach for attribute filtering versus the typically agglomerative bottom-up Apriori paradigm.

It produces disjoint clusters of various dimensionality and does not suffer from the exhaustive subspace search problem of bottom-up approaches.  It finds the densest, correlated attributes and then searches the points for patterns (clusters).

As with most subspace and projected clustering algorithms, the clustering is done in a cyclical manner.  In this method, FP Growth is used to find an candidate attribute subset, then EM clustering is performed over the attributes. EM produces multiple clusters, which are tested by classification learners. Good clusters are labeled and removed from the data set, creating disjoint instance clusters.  The null and bad clusters remain in the data set for further cycles of clustering.  All attributes are available for the FP Growth step and may be repeated in later clusters.

This method requires several parameters, for FP Growth, EM clustering, minimum test and stopping criteria. I believe that it will be significantly less sensitive to the parameter values than current methods. I also believe that it will be more computationally efficient than existing techniques since it uses FP Growth to find candidate subspaces escaping a combinatorial search, and removes clustered instances reducing the overall data set size.

Literature surveys compare the performance of methods over varying data set  sizes.  Moise09 varies attribute size and instance size to create several data sets, but does not test against data sets with roughly the same attribute and instance size.  My method was designed for this case, named the n x m problem. Other subspace and projected clustering algorithms suffer as the attributes are increased; whereas my method is suited to scalability.

- Erin

Tuesday, November 6, 2012

Intrinsic dimensionality

Here's an article playing my song:

http://www.stat.berkeley.edu/~bickel/mldim.pdf (cited by 300+ people)

Maximum Likelihood Estimation of Intrinsic Dimension
Elizaveta Levina
Peter J. Bickel

In Advances in NIPS, volume 17. MIT Press, 2005
There is a consensus in the high-dimensional data analysis community that the only
reason any methods work in very high dimensions is that, in fact, the data are not
truly high-dimensional. Rather, they are embedded in a high-dimensional space,
but can be e±ciently summarized in a space of a much lower dimension, such as a
nonlinear manifold. Then one can reduce dimension without losing much informa-
tion for many types of real-life high-dimensional data, such as images, and avoid
many of the \curses of dimensionality".

Most standard method: the correlation dimension plot log(C(r)) vs log(r):


They go on to offer another method, which I don't understand, but the resulting numbers are very close to (1).

For a sample of these log(C(r)) vs log(r) graphs, see fig1 of http://oldweb.ct.infn.it/~rapis/corso-fsc2/grassberger-procaccia-prl1983.pdf





Saturday, October 27, 2012

Looking at the data

These figures are  hypercube projections of effort data sets.

The first side was picked at random and runs between any two corners 0,1.

After than, side N=2,3,4... was built by picking a point whose sum of the distances to  corners 0,1,...N-1 is maximal.





















Tuesday, October 23, 2012

Effect size: the new religon

The paper about the effect size metrics in software engineering is here: http://www.sciencedirect.com/science/article/pii/S0950584907000195

In case you need the bibtex:
@article{DBLP:journals/infsof/KampenesDHS07,
  author    = {Vigdis By Kampenes and
               Tore Dyb{\aa} and
               Jo Erskine Hannay and
               Dag I. K. Sj{\o}berg},
  title     = {A systematic review of effect size in software engineering
               experiments},
  journal   = {Information {\&} Software Technology},
  volume    = {49},
  number    = {11-12},
  year      = {2007},
  pages     = {1073-1086},
  ee        = {http://dx.doi.org/10.1016/j.infsof.2007.02.015},
  bibsource = {DBLP, http://dblp.uni-trier.de}
}


For code that implements the hedges g measure recommended by this paper, see below (the second last function: ghedge). Note that its kinda simple. For this to work, you need a value of "small". As per the recommendations of the above paper, I use 0.17.


function scottknott(datas,small,  data) {
  for(data in datas) 
    sk1(datas[data],small,1,1,length(datas[data]),0) 
}
function sk1(rxs,small,rank,lo,hi,  cut,i) { 
  cut = sk0(rxs,small,lo,hi)
  if (cut) {
    rank = sk1(rxs,small,rank,lo,cut-1) + 1
    rank = sk1(rxs,small,rank,cut,hi) 
  } else {
    for(i=lo;i<=hi;i++)
      rxs[i]["rank"] = rank
  }
  return rank
}
function sk0(a,small,lo,hi,\
             s,  s2, n, mu,                     \
             ls,ls2,ln,lmu,                     \
             rs,rs2,rn,rmu,                     \
             i,best,tmp,cut) {
  for(i=lo;i<=hi;i++) {
    s += a[i]["s"]; s2+= a[i]["s2"]; n += a[i]["n"]
  }
  mu= s/n
  best = -1; 
  for(i=lo+1;i<=hi;i++) {
    ls  += a[i-1]["s"]; 
    ls2 += a[i-1]["s2"]; 
    ln  += a[i-1]["n"]
    rs   = s - ls; 
    rs2  = s2- ls2; 
    rn   = n - ln; 
    rmu  = rs/rn
    lmu  = ls/ln
    tmp = ln/n * (mu - lmu)^2 + rn/n * (mu - rmu)^2 
    if (tmp > best) 
      if  (ghedge(rmu,lmu,rn,ln,rs,ls,rs2,ls2) > small) 
        if (significantlyDifferent(a,lo,i,hi)) {
            best = tmp; cut = i }}
  return cut
}
function ghedge(rmu,lmu,rn,ln,rs,ls,rs2,ls2,\
              n, ln1,rn1,rsd, lsd,sp,correct,g) {
  n       = rn + ln
  ln1     = ln - 1
  rn1     = rn - 1  
  rsd     = sd(rn,rs,rs2)
  lsd     = sd(ln,ls,ls2)
  sp      = sqrt((ln1*lsd^2 + rn1*rsd^2)/(ln1 + rn1))
  correct = 1 - 3/(4*(n - 2) - 1)
  return abs(lmu - rmu)/sp * correct
}
function significantlyDifferent(a,lo,cut,hi,\
                               i,j,x,r,l,rhs,lhs) {
  for(i=lo;i<=hi;i++) 
    for(j in  a[i]["="]) {
      x = a[i]["="][j]
      i < cut ?  lhs[++l] = x :  rhs[++r] = x
    }
  print ":lo",lo,":cut",cut,":hi",hi,":left",l,":right",r > "/dev/stderr"
  return bootstrap(lhs,rhs,500,0.05)
}

Monday, October 22, 2012

the opposite of "parameter tuning is bad"


"With more parameters data sparsity becomes an issue again, but with proper smoothing the models are usually more accurate than the original models. Thus, no matter how much data one has, smoothing can almost always help performace, and for a relatively small effort."

"Adding free parameters to an algorithm and optimizing these parameters on held-out data can improve performance."

http://nlp.stanford.edu/~wcmac/papers/20050421-smoothing-tutorial.pdf

WHERE = spill tree \tau = 0

fig4 reports better precision with higher \tau

but what about false alarms?


"As should be expected, for all splitting rules and spill thresholds, recall performance degrades as the
maximum depth of the tree increases." ????

http://unbox.org/stuff/doc/11rptrees.pdf

Random projections rule

Note the computations shown in Fig2. So fast!

http://unbox.org/stuff/doc/01randomProjections.pdf

Graphics beats stats!

behold the biggest big-ass effect i've ever seen

lowest errors
long training times
ok test times

http://unbox.org/stuff/doc/11clusterHyperCube.pdf


Friday, October 19, 2012

a few random projections is enough



In the following, e=0 means use all attributes and e>0 means stop after finding e better projections

The way this works is that in builds "e" projection as follows:
  • Draw a line between two randomly selected points.
  • All the other points are then mapped to this point using cosine rule.
  • The error of this line is the mean distance to all the points around it (found using Pythagoras)
  • Keep projection if its error is less than best projection found so far
In the e>0 experiments, the weight given to each projection is biased by the error of each projection:
  • Weight[X] = error of [X] / error[W]
where "W" is the worst error seen in all projections.

After all that, its Euclidean distance times weight over the projections.

In the following, note how e=0 is not as critical as other factors (except in big datasets like China). The LHS number comes from ScottKnott which ignores small effect size (Hedges' corrected g = 0.17) and which uses bootstrapping (500 samples, 95% confidence) to test for differences.


----| 0 |----------------------
  1,    k=4,n=1,e=0,   18,   36,   66,|,  36,  48, 4029,>,-*-      |          ,<, 
  1,    k=2,n=1,e=0,   17,   37,   63,|,  37,  46, 3800,>,-*-      |          ,<, 
  1,    k=1,n=1,e=0,   21,   40,   68,|,  40,  47, 1943,>,-*-      |          ,<, 
  2,    k=4,n=1,e=1,   33,   61,  145,|,  61, 112, 4551,>, --*---- |          ,<, 
  2,    k=4,n=1,e=2,   33,   61,  145,|,  61, 112, 4551,>, --*---- |          ,<, 
  2,    k=4,n=1,e=4,   33,   62,  144,|,  62, 111, 4551,>, --*---- |          ,<, 
  2,    k=2,n=1,e=1,   30,   63,  132,|,  63, 102, 7761,>, --*---  |          ,<, 
  2,    k=2,n=1,e=4,   32,   63,  132,|,  63, 100, 7761,>, --*---  |          ,<, 
  2,    k=2,n=1,e=2,   32,   64,  139,|,  64, 107, 7761,>, --*---- |          ,<, 
  2,    k=1,n=1,e=1,   37,   66,  108,|,  66,  71, 8926,>, --*--   |          ,<, 
  2,    k=1,n=1,e=2,   36,   66,  108,|,  66,  72, 8926,>, --*--   |          ,<, 
  2,    k=1,n=1,e=4,   36,   66,  107,|,  66,  71, 8926,>, --*--   |          ,<, 

----| /home/timm/tmp/desharnais.dash |----------------------
  1,    k=4,n=1,e=0,   11,   29,   55,|,  29,  44, 378,>,*-       |          ,<, 
  2,    k=4,n=1,e=2,   20,   33,   68,|,  33,  48, 883,>,-*-      |          ,<, 
  2,    k=4,n=1,e=4,   18,   33,   68,|,  33,  50, 883,>,-*-      |          ,<, 
  2,    k=2,n=1,e=0,   16,   36,   59,|,  36,  43, 320,>,-*       |          ,<, 
  2,    k=4,n=1,e=1,   16,   37,   69,|,  37,  53, 854,>,-*-      |          ,<, 
  2,    k=1,n=1,e=0,   19,   41,   66,|,  41,  47, 387,>,-*-      |          ,<, 
  3,    k=2,n=1,e=4,   15,   45,   77,|,  45,  62, 764,>,--*-     |          ,<, 
  3,    k=2,n=1,e=2,   19,   47,   83,|,  47,  64, 764,>,--*-     |          ,<,   3,    k=2,n=1,e=1,   19,   48,   79,|,  48,  60, 844,>,--*-     |          ,<, 
  3,    k=1,n=1,e=2,   24,   48,   76,|,  48,  52, 655,>,--*-     |          ,<, 
  3,    k=1,n=1,e=4,   25,   48,   78,|,  48,  53, 1220,>,--*-     |          ,<, 
  3,    k=1,n=1,e=1,   24,   52,   85,|,  52,  61, 655,>,--*-     |          ,<, 

----| /home/timm/tmp/miyazaki94.dash |----------------------
  1,    k=1,n=1,e=0,   29,   42,   75,|,  42,  46, 655,>,-*--     |          ,<, 
  1,    k=2,n=1,e=1,   17,   43,   88,|,  43,  71, 490,>,-*--     |          ,<, 
  1,    k=4,n=1,e=1,   25,   45,   95,|,  45,  70, 340,>,--*--    |          ,<, 
  1,    k=2,n=1,e=0,   27,   45,   88,|,  45,  61, 472,>,--*-     |          ,<, 
  1,    k=1,n=1,e=2,   16,   45,   83,|,  45,  67, 648,>,--*-     |          ,<, 
  1,    k=1,n=1,e=4,   16,   45,   83,|,  45,  67, 648,>,--*-     |          ,<, 
  1,    k=4,n=1,e=0,   18,   45,   73,|,  45,  55, 370,>,--*      |          ,<, 
  1,    k=4,n=1,e=2,   30,   51,   81,|,  51,  51, 602,>, -*-     |          ,<, 
  1,    k=4,n=1,e=4,   30,   51,   81,|,  51,  51, 602,>, -*-     |          ,<, 
  1,    k=2,n=1,e=2,   28,   57,   92,|,  57,  64, 472,>,--*--    |          ,<, 
  1,    k=2,n=1,e=4,   28,   57,   92,|,  57,  64, 472,>,--*--    |          ,<, 
  1,    k=1,n=1,e=1,   35,   60,  100,|,  60,  65, 648,>, --*-    |          ,<, 

----| /home/timm/tmp/nasa93.dash |----------------------
  1,    k=2,n=1,e=0,   22,   52,   86,|,  52,  64, 2058,>,--*-     |          ,<, 
  1,    k=1,n=1,e=0,   15,   50,   93,|,  50,  78, 900,>,--*--    |          ,<, 
  1,    k=1,n=1,e=1,   33,   54,   95,|,  54,  62, 4207,>, -*--    |          ,<, 
  1,    k=4,n=1,e=0,   27,   58,  104,|,  58,  77, 4029,>,--*--    |          ,<,   1,    k=1,n=1,e=2,   33,   62,  104,|,  62,  71, 4207,>, --*-    |          ,<, 
  1,    k=1,n=1,e=4,   33,   63,  104,|,  63,  71, 4207,>, --*-    |          ,<, 
  1,    k=4,n=1,e=1,   40,   73,  123,|,  73,  83, 1719,>, --*---  |          ,<, 
  1,    k=2,n=1,e=1,   29,   73,  123,|,  73,  94, 2443,>,---*---  |          ,<, 
  1,    k=2,n=1,e=4,   29,   74,  132,|,  74, 103, 2443,>,---*---  |          ,<, 
  1,    k=4,n=1,e=2,   40,   74,  129,|,  74,  89, 1719,>, --*---  |          ,<, 
  1,    k=2,n=1,e=2,   29,   75,  146,|,  75, 117, 2443,>,----*--- |          ,<, 
  1,    k=4,n=1,e=4,   40,   74,  126,|,  74,  86, 1719,>, --*---  |          ,<, 

----| /home/timm/tmp/china.dash |----------------------
  1,    k=4,n=1,e=0,   16,   30,   53,|,  30,  37, 1139,>,-*       |          ,<, 
  1,    k=2,n=1,e=0,   14,   31,   49,|,  31,  35, 1609,>,-*       |          ,<, 
  1,    k=1,n=1,e=0,   18,   36,   56,|,  36,  38, 1943,>,-*       |          ,<, 
  2,    k=4,n=1,e=1,   35,   62,  159,|,  62, 124, 4551,>, --*-----|          ,<, 
  2,    k=4,n=1,e=2,   35,   62,  159,|,  62, 124, 4551,>, --*-----|          ,<, 
  2,    k=4,n=1,e=4,   35,   62,  159,|,  62, 124, 4551,>, --*-----|          ,<, 
  2,    k=2,n=1,e=1,   34,   65,  151,|,  65, 117, 7761,>, --*-----|          ,<, 
  2,    k=2,n=1,e=2,   34,   65,  151,|,  65, 117, 7761,>, --*-----|          ,<, 
  2,    k=2,n=1,e=4,   34,   65,  151,|,  65, 117, 7761,>, --*-----|          ,<, 
  2,    k=1,n=1,e=1,   37,   69,  115,|,  69,  78, 8926,>, --*--   |          ,<, 
  2,    k=1,n=1,e=2,   37,   69,  115,|,  69,  78, 8926,>, --*--   |          ,<, 
  2,    k=1,n=1,e=4,   37,   69,  115,|,  69,  78, 8926,>, --*--   |          ,<, 

----| /home/timm/tmp/finnish.dash |----------------------
  1,    k=1,n=1,e=0,   23,   36,   79,|,  36,  56, 950,>,-*--     |          ,<, 
  1,    k=1,n=1,e=1,   31,   50,   85,|,  50,  54, 1761,>, -*-     |          ,<, 
  1,    k=1,n=1,e=2,   31,   50,   85,|,  50,  54, 1761,>, -*-     |          ,<, 
  1,    k=1,n=1,e=4,   31,   50,   85,|,  50,  54, 1761,>, -*-     |          ,<, 
  1,    k=4,n=1,e=0,   24,   51,  120,|,  51,  96, 1120,>,--*----  |          ,<, 
  1,    k=4,n=1,e=1,   28,   52,  121,|,  52,  93, 809,>,--*----  |          ,<, 
  1,    k=4,n=1,e=2,   28,   52,  121,|,  52,  93, 809,>,--*----  |          ,<, 
  1,    k=4,n=1,e=4,   28,   52,  121,|,  52,  93, 809,>,--*----  |          ,<, 
  1,    k=2,n=1,e=0,   23,   56,   81,|,  56,  58, 489,>,--*-     |          ,<, 
  1,    k=2,n=1,e=1,   23,   58,   88,|,  58,  65, 1467,>,--*-     |          ,<, 
  1,    k=2,n=1,e=2,   23,   58,   88,|,  58,  65, 1467,>,--*-     |          ,<, 
  1,    k=2,n=1,e=4,   23,   58,   88,|,  58,  65, 1467,>,--*-     |          ,<, 

----| /home/timm/tmp/coc81.dash |----------------------
  1,    k=2,n=1,e=0,   40,   76,  222,|,  76, 182, 3800,>, ---*----|---       ,<, 
  1,    k=2,n=1,e=1,   46,   76,  262,|,  76, 216, 1286,>,  --*----|------    ,<, 
  1,    k=2,n=1,e=2,   46,   76,  262,|,  76, 216, 1286,>,  --*----|------    ,<, 
  1,    k=2,n=1,e=4,   46,   76,  262,|,  76, 216, 1286,>,  --*----|------    ,<, 
  1,    k=1,n=1,e=0,   39,   76,  163,|,  76, 124, 1240,>, ---*----|          ,<, 
  1,    k=4,n=1,e=1,   45,   80,  284,|,  80, 239, 977,>,  --*----|-------   ,<, 
  1,    k=4,n=1,e=2,   45,   80,  284,|,  80, 239, 977,>,  --*----|-------   ,<, 
  1,    k=4,n=1,e=4,   45,   80,  284,|,  80, 239, 977,>,  --*----|-------   ,<, 
  1,    k=1,n=1,e=1,   56,   80,  160,|,  80, 104, 1523,>,  --*----|          ,<, 
  1,    k=1,n=1,e=2,   56,   80,  160,|,  80, 104, 1523,>,  --*----|          ,<, 
  1,    k=1,n=1,e=4,   56,   80,  160,|,  80, 104, 1523,>,  --*----|          ,<, 
  1,    k=4,n=1,e=0,   36,   81,  224,|,  81, 188, 3188,>, ---*----|---       ,<, 


Sunday, October 14, 2012

Erin's bio

Erin Moore is pursuing a doctorate in Computer Engineering at WVU where she has been awarded a WV EPSCoR Cancer, Energy and Security Nanotechnology Fellowship for the year of 2012.  Her research uses Data Mining to find patterns in biological data.  Specifically, she is predicting a special type of DNA sequences called Aptamers.  These molecules can be used for targeted pharmaceutical therapies and in sensors to detect the presence of chemical substances at a very low concentration. She would like to continue developing Artificial Intelligence algorithms to further medical diagnostics and pharmaceutical development technologies.

Erin's academic and professional experience is detailed at LinkedIn. 

A paper form Domingos on some basic things about Data mining

Vasil's Info


Vasil is currently pursuing his MSc on Computer Science in West Virginia University. He has previously received a BSc on Computer Science from University of Tirana. His main interest is using data mining techniques in setting up public health policies. He is currently collaborating with the WVU’s Regional Research Institute on a project that investigates the influence of community nutrition environment on child obesity.  

 For more information please visit: http://al.linkedin.com/in/vpapakroni.

Ekrem Kocaguneli

Ekrem started his CS career at Bogazici University, where he received his BSc and MS degrees. Currently he is a PhD candidate in LCSEE department at West Virginia University. His main research interests are using machine learning methods to find solutions to difficult software estimation problems, analysis of various types of software data and investigating for stable conclusions in empirical software engineering. He currently maintains a personal web site (www.kocaguneli.com), where he posts his ideas about software engineering problems, announces new papers accepted and so on. 
A list of his publications are:
  • Journal
    • J. Keung, E. Kocaguneli, T. Menzies, “Finding Conclusion Stability for Selecting the Best Effort Predictor in Software Effort Estimation”, Automated Software Engineering, to appear in 2012.
    • E. Kocaguneli, T. Menzies, A. Bener, J. Keung, “Exploiting the Essential Assumptions of Analogy-based Effort Estimation”, IEEE Transactions on Software Engineering, vol. 38, no.2, pp. 425-438, March-April 2012.
    • E. Kocaguneli, T. Menzies, J. Keung, “On the Value of Ensemble Effort Estimation”, IEEE Transactions on Software Engineering, pre-prints, 2011.
    • E. Kocaguneli, T. Menzies, J. Keung, “Kernel Methods for Software Effort Estimation”, Empirical Software Engineering Journal, pre-prints, 2011.
    • A. Brady, T. Menzies, E. Kocaguneli, “What Is Enough Quality for Data Repositories?”, Software Quality Professional, 2011.
  • Conference
    • E. Kocaguneli, T. Menzies, “How to Find Relevant Data for Effort Estimation”, International Symposium on Empirical Software Engineering and Measurement (ESEM) 2011
    • E. Kocaguneli, B.Caglayan, A. Tosun, A. Bener, “Experiences on Developer Participation and Effort Estimation”, EUROMICRO Conference on Software Engineering and Advanced Applications (SEAA) 2011 
    • E. Kocaguneli, G. Gay, Y. Yang, T. Menzies, “When to Use Data from Other Projects for Effort Estimation”, International Conference on Automated Software Engineering (ASE) 2010
    • E. Kocaguneli, A. Tosun, A. Bener, “AI-Based Models for Software Effort Estimation”, 36th EUROMICRO Conference on Software Engineering and Advanced Applications (SEAA) 2010
    • E. Kocaguneli, Y. Kultur, A. Bener, “Combining Multiple Learners Induced on Multiple Datasets for Software Effort Prediction”, Proceedings of International Symposium on Software Reliability Engineering (ISSRE) 2009.
    • E. Kocaguneli, A. Tosun, A. Bener, B. Turhan, B. Caglayan, “Prest: An Intelligent Software Metrics Extraction, Analysis and Defect Prediction Tool”, Proceedings of International Conference on Software Engineering and Knowledge Engineering (SEKE) 2009.
    • Ayse Tosun, Ayse Bener, Ekrem Kocaguneli, “BITS: Issue Tracking and Project Management Tool in Healthcare Software Development”, Proceedings of International Conference on Software Engineering and Knowledge Engineering (SEKE) 2009.
    • A. Bakir, E. Kocaguneli, A. Tosun, A.Bener and B. Turhan “Xiruxe: An Intelligent Fault Tracking Tool”, Proceedings of International Conference on Artificial Intelligence and Pattern Recognition (AIPR) 2009.
    • Y. Kultur, E. Kocaguneli, A. Bener, “Discovering Cost Related Features: Focus on Classification Models”, Poster Paper in Empirical Software Engineering and Measurement (ESEM) 2009.
    • E. Kocaguneli, A. Tosun, B. Caglayan, A. Bener, T. Aytac, B. Turhan, “Bulutlarda Akilli Bir Yazilim Olcumleme, Hata Analiz ve Tahmin Araci: Prest”, (in Turkish), 2nd National Symposium on Software Engineering (UYMS) 2009
    • Y. Kultur, E. Kocaguneli, A. Bener, “Domain Specific Phase by Phase Effort Estimation in Software Projects”, Proceedings of International Symposium on Computer and Information Sciences (ISCIS) 2009.

Friday, October 12, 2012

Who is Joe

Joe Krall

I'm Joe, Ph.D. student at WVU, studying games with computer science, aiming for a high level job in the gaming industry; an indoor outdoors-man, writer, artist, and gamer at heart.

Education

 - B.S. in Computer Science at University of Pittsburgh at Johnstown (2008)
 - B.S. in Mathematics at University of Pittsburgh at Johnstown (2008)
 - M.S. in Computer Science at West Virginia University (2010)
 - Ph.D. pending, in Computer Science at West Virginia University (2013)

Research Interests

 - Studying Games
 - Cognitive Psychology & Believability
 - Data Mining
 - Multiple Objective Optimization

Publications

 - JSEA'12 "Aspects of Replayability and Software Engineering: Towards a Methodology of Developing Games" (The most downloaded paper in Vol.5 No. 7) [Link]

Abilities

 - Excellent Writer
 - Good with Graphics & Presentation
 - Intensive Programmer

Leisure Interests

 - Trap Shooting
 - Archery
 - Gaming