Monday, June 24, 2013

EMFP Wilcoxon Results on Synthetic Dbs

In all of the tables, higher numbers were preferred, so with pf, and RMSE it is preferable to lose.

The Wilcoxon tables need the header row repeated down the side to indicate the comparisons.

Wilcox Tables
Win-Lose Tables

Brian: what I do

MSCS Computer Science - June 2013 -> ???

Research: Data Reduction with Vasil on Public Health Data
          Data-Quality

Projects: I am currently working on a project to test some of the current 
"Data Cleaning Techniques" outlined below. First to see how many of 
our current datasets contain such "problem data" then to see if injecting 
"problem data" into existing datasets will actually have any meaningful effect.

Some examples of "problem data" as defined here. 
Features 
1) Identical Features
2) Constant Features
3) Features with Missing Values
4) Features with Implausible Values

Instances
1) Identical Cases
2) Inconsistant Cases
3) Cases with Missing Values
4) Cases with Conflicting Values
5) Cases with Implausible Values

The current approach outlined in the paper above is to lump all of 
the data falling into these categories together then throw them out 
thus cleaning the data. 
 

Fayola: Okay, here's the story

Prototype Learning

During my masters, I worked on prototype learning. Created the CLIFF algorithm and applied it to the forensic science domain as a forensic interpretation model.

Prototype learning algorithms are designed to eliminated the drawbacks of the K-nearest neighbor algorithm:
  1. The high computation costs caused by the need for each test sample to find the distance between it and each training sample.
  2. The storage requirement is large since the entire dataset needs to be stored in memory.
  3. Outliers can negatively affect the accuracy of the classifier.
  4. The negative effect of data sets with non-separable and/or overlapping classes.
  5. The low tolerance to noise.
The CLIFF algorithm reduced the effects of these drawbacks significantly better that the state-of-art prototype learning algorithms.

CLIFF takes a dataset and for each class ranks (power) each attribute sub-range using BORE. Multiply ranks of each row then select the most powerful rows of each class.

Privacy

Now I use CLIFF along with MORPH to privatized defect datasets. CLIFF removes the overlap of classes with instance reduction while MORPH moves remaining data to low density areas and avoids overlap.


y = x ± (x − z) ∗ r
  • x ∈ D, the original instance;
  • z ∈ D the NUN of x;
  • y the resulting MORPHed instance.

Erin : What I do

Biography:
  • 10 years industry experience as Software and Systems Engineer
  • Masters Systems Engineering, University of Arizona, 2006
  • Doctoral Student - Computer Engineering
    • NSF WVEPSCOR STEM Fellow 2012
    • NSF WV NanoSAFE Fellow 2013
Research Interests:
  • Unsupervised Learning
  • Association Learning
  • Biclustering
  • Large Datasets where Attributes >= Instances
  • Bioinformatics
Research Project: 

Crowd Heuristics Algorithm Learning Kit (CHALK)

    Goal: Develop a Virtual Screeing for DNA Aptamers leveraging insights from aptamer researchers

    Algorithm: EMFP
  • Combining Association Learning and Probabilistic Clustering into a Biclustering Framework
  • Pattern-based, Soft Biclustering Algorithm for Binary Dbs
  • High quality ( F, pd, prec), low error ( RMSE, pf ) statistically relevant biclusters
  • Compared to BicBin, vanUitert08

        Paper pending internal review

   Further Research:
  • Show that Prelic's Match Score is biased against statistically relevant small clusters in high noise databases. Prelic06
  • Develop a streaming EMFP implementation



Short presentation on slideshare


http://www.slideshare.net/abedsayyad/my-summary-6242013

Vasil: What I do

Master's CS Computer Science student since 2011, expected to graduate on August 2013.

Research scope: Data Reduction. Condense large training data into succinct summaries. Enabling users to review and analyze raw data.

Algorithm: PEEKING2 - A tool for data carving.

Implementation:  
  1. Feature Selection via Information Gain.
    Prune irrelevant features, select 25% of features with highest Info Gain.
  2. FASTMAP.
    Project into the direction of the greatest variability, applying a PCA-like linear time projection.
  3. Grid-clustering.
    Recursively split clusters by the median of each projected dimension.
  4. Centroid estimation.
    Replace each data cluster with its centroid.

 Inference:
  1. Instance base learning. (k=2 Nearest Neighbor)
    Extrapolate between centroids to make predictions.
  2. Contrast set rule learning.
    Generated rules estimating the deltas btw. centroids. 

Experiments:
  • PEEKING2 applied on 10 defect data sets and 10 effort data sets from PROMISE.
  • Large data reduction is being observed: 93% of original data is reduced.
  • Little information is lost. In most of the cases, k=2 NN applied on condensed data performed as well or better as other state of the art algorithm applied on overall data.
Addition research:
Applied data mining techniques to reduce the cost of data collection for a public health study conducted by WVU.

  • Applying Correlation-based feature selection, we could drastically reduced the number of features without significantly impacting the performance of Linear Regression.
  • We have also observed some degree of stability across different geographical regions.
  • Small samples of stores (20%) can be used to make prediction for the rest.
  • Samples are selected to minimize the travel distance btw. stores

Sunday, June 23, 2013

Timm: What I do

For years I've been thinking that our learners are really a menu of options that we can mix and mash to achieve other goals.

Why Bother?

  1. Its good science
  2. Its fun
  3. I've become a little tired of the same dull old data mining in SE paper. Sure, we can get results from data miners on SE data sets. Ok. Next? 
  4. Recent papers report that there’s little to be gained from such algorithm mining because the “improvements” found from this approach are marginal, at best:

And if it works, what would we expect?

  • Simpler implementations.
    •  If we cut down to the essence, then the essence should be less than the confusion we cleared away.
  • Faster implementations. 
    • Again, if we cut the crap, the rest should run faster.
  • Scalable inference. 
    • Era of big data. Need scalability. 
  • More services: 
    • the parts should be mashable into new services, not currently offered.

What new services?

  • Local learning
  • Decision mining:
    • aka weighted contrast set learning
    • planning (what to do)
      • learning decisions from now to better
    • monitoring (what to avoid)
      • learning decisions from now to worse
  • Discussion mining
    • Learn how a community's bias by what decisions they take, avoid, retake, undo
  • Temporal reasoning. 
    • Anytime inference. Recognition of anomalies. Pinpointing which part of it all needs updating. Updating just that section.
  • Privacy:
    • Can do
  • Quality assurance
  • Transfer learning:
    •  keeping what can be kept, changing what needs to change as we move
      •  across organizations (cross-company learning)
      •  across time (dataset shit)
  • MOEA :
    •  the distinction between traditional learning and optimization is bogus. 
  • Inference at the business level: 
    • ability to model business goals and tune the learners appropriately.

What business goals:

From Buse & Zimmermann: Information needs, 100+ managers, Microsoft



Under the hood (the current implementation):

Fast spectral learning (map to eigenvectors, cluster, reason from there)
  • O(N) inference with Fastmap
  • Defect data set (POI-3.). Green=less defective
  • Cluster in 2D space (fast)
  • Find neighboring clusters C1,C2 where scores(C1) > scores(C2)
  • Learn contrast C1 to C2 
Papers



Wednesday, June 19, 2013

Seminar: Mernik on Evolutionary Algorithms


Seminar notification, WVU, CSEE, AI lab

When: 1:30pm Mnday June 24

Where: ESB, room 1011 (map)

What: A Parameter Control Method of Evolutionary Algorithms Using Exploration and Exploitation Measures

Abstract: Exploration and exploitation are omnipresent terms in evolutionary computation community that have been broadly utilized to explain how evolutionary algorithms perform search. However, only recently exploration and exploitation measures were presented in a quantitative way enabling to measure amounts of exploration and exploitation. To move a step further, this talk introduces a parameter control approach that utilizes such measures as feedback to adaptively control evolution processes. With new exploration and exploitation measures, the evolution process generates relatively well results in terms of fitness when applying to a practical chemical engineering problem of fitting Sovova’s model.

Who: Marjan Mernik received the M.Sc. and Ph.D. degrees in computer science from the University of Maribor in 1994 and 1998, respectively. He is currently Professor of Computer Science at the University of Maribor, Slovenia. He is also Visiting Professor of Computer and Information Sciences at the University of Alabama at Birmingham, and at the University of Novi Sad, Faculty of Technical Sciences, Serbia. His research interests include programming languages, compilers, domain-specific (modeling) languages, grammar-based systems, grammatical inference, and evolutionary computations.  He is a member of the IEEE, ACM and EAPLS.

Tuesday, June 18, 2013

GUI Development With Tkinter

Absolute necessities:

Read and parse an input file composed of centroids.

Display a graph with the centroids.

Allow the end user to select two centroids and view the deltas between them.



Other concerns:


Quality of the UI (Ease of use, etc.)

Portability

(Concerning the GUI package used in development) How easy is install? Can the package do what is needed of it?


Between WxPython and Tkinter, Tkinter showed itself to be an exceedingly easy deployment.
Tkinter is not a complex or versatile package.

Not really important anyways -- graph display/centroid selection is all relatively simple, as is feeding data out to a user.


Development:

The package is extremely simplistic. I can read mouse input and place widgets such as buttons, text displays, "canvases", etc. with ease. The next goal is to build the program so as to display the graph using a canvas widget.

Another necessary program feature is reading file input. I have not yet worked on this, and will begin delving into this area by the end of the week. Concerns in this area are namely portability (due to a differing of file structure/naming archetectures)


EMFP comparison to BicBin

EMFP a sparse binary biclustering algorithm

How to enumerate solutions using Z3



http://stackoverflow.com/questions/11867611/z3py-checking-all-solutions-for-equation

1) Find first solution, record it
2) Rerun SAT blocking the previous solution, find one more solution, record it
3) Rerun SAT blocking the first 2 solutions, and so forth.

Strange that SAT has to be repeated all over again each time. This is why enumeration takes time.

Need to:
1) Read DIMACS model, and run the above routine.
2) Not all the solutions will be useful. What I need is feature-rich solutions.

Tuesday, June 4, 2013

How not to drive the international research community

http://storify.com/timmenzies/how-not-to-drive-the-international-research-commin#publicize

recoding peeker2

diabetes (probability of positive for diabetes)


housing (normalized resale value 0..1, min..max)


poi -1.5*log(p(defects)

Such sweet code:

def Abcx(row,c,left,right):
  a = dist(row,left)
  b = dist(row,right)
  x = (a**2 + c**2 - b**2)* 1.0 / (2*c + 1.0/The.inf)
  y = (a**2 - x**2)**0.5
  return Thing(x=x,y=y,row=row,a=a,b=b)

def quadrants(t,demo=False):
  z     = one(t.rows)
  left  = furthest(z)
  right = furthest(left)
  out = []
  quad1(t,left,right,out,demo)
  return out

def quad1(t,l,r,out,demo):
  t.left  = l
  t.right = r
  c = t.c = dist(l,r)
  x,y     = [],[]
  for row in t.rows:
    xy = Abcx(row,c,l,r)
    if   xy.a > t.c : return quad1(t,l,row,out,demo)
    elif xy.b > t.c : return quad1(t,row,r,out,demo)
    else:
      x += [xy]
      y += [xy]
  x = sorted(x,key=lambda z:z.x)
  y = sorted(x,key=lambda z:z.y)
  hi   = len(x) 
  stop = The.enough(hi)
  print "stop>",stop
  tile(x,y,0,hi,0,hi,out,stop,demo,2) 

def tile(x,y,x0,x2,y0,y2,out,stop,demo,depth=0):
  def tile1(x0,x2,y0,y2,out):
    tmp = []
    for one in x[x0:x2]:
      for two in y[y0:y2]:
        if one.row.row == two.row.row:
          tmp += [one]
    if demo:
      print '--|'*depth,len(tmp)
    if len(tmp) > stop: 
      tile(x,y,x0,x2,y0,y2,out,stop,demo,depth+1)
    else: 
      out += [tmp] 
   
  x1 = x0 + (x2 - x0)/2
  y1 = y0 + (y2 - y0)/2
  tile1( x0,x1, y0,y1 ,out)
  tile1( x0,x1, y1,y2 ,out) 
  tile1( x1,x2, y0,y1 ,out)
  tile1( x1,x2, y1,y2 ,out)

icse'13 report

9 days. 2 conferences. 3 workshops. 1 tutorial.

MSR


data science tutorial

  • 25 attendees
  • over half from industry
  • ekrem reports that in the san fran start up culture, anyone with a little data mining skills goes far

 CMSBSE (combining Modelling and Search-Based Software Engineering)

  • abdel's workshop talk went well
Dapse (data analysis patterns)

book contract signing
  • book #2 (sharing data and models) is now underway
  • book #1 (data analysis patterns book) scuttled by dapse workshop. field currently too diverse to accept analysis patters
ICSE

  • abdel's talk went well. and, the ever present question, why didn't you use a theorem prover
  • sonia (from wayne state)'s talk went well

RAISE (realizing AI synergies in SE)


Wednesday, May 1, 2013

Vargha-Delaney Effect Size

Hedge's effect size should not be used when your data isn't known to have a normal distribution, thus you would have to use a non-parametric effect size calculation, e.g. Vargha-Delaney. Check out the description and R code.

http://doofussoftware.blogspot.com/2012/07/measuring-effect-size-with-vargha.html

Wednesday, April 24, 2013

pom decision frequencies + html5

POM3 Results (Showing POM3A and POM3B):

10 even width bins between min and max
min <= val < bin1
bin1 <= val < bin2
...
etc





HTML5

Popular Buzz Word of recent; very powerful (cool and fast) with Javascript + jquery + CSS3.
(initially opened in 2011, updates in 2012, became a WC3 recommendation Dec 2012.)

- Some stuff that's new with HTML5: http://net.tutsplus.com/tutorials/html-css-techniques/25-html5-features-tips-and-techniques-you-must-know/
- HTML5 is the way to go today.

Tools:

http://craftyjs.com/ - a game library set of tools for javascript to make html5 games
(one of many, there's also game engines and interface-only designers not-for-free)

http://html5games.com/ - popular web site to submit your games and gain audiences/revenue
(there are several other sites, as well)

My Fun-box site & Test of HTML5

fun.unbox.org/testgame

IBEA vs NSGAII for large feature models

10 runs for 2000sec each. pop=300, xover=0.05, mut=0.001, 5 obj

      IBEA With Rule Fixing
FM Features Rules %correct TimeToAnyC TimeTo50C TimeTo100C
toybox 544 1020 40.4% 28.4 NA NA
axTLS 684 2155 99.6% 45.2 187 224
ecos 1244 3146 100% 93.2 115 223
freebsd 1396 62183 98% 315 501 NA
fiasco 1638 5228 99.6% 170 656 810
uClinux 1850 2468 38% 75 NA NA
busybox 6796 17836 1.98% 1350 NA NA

      NSGA-II With Rule Fixing
FM Features Rules %correct TimeToAnyC TimeTo50C TimeTo100C
toybox 544 1020    
axTLS 684 2155    
ecos 1244 3146 0% NA NA NA
freebsd 1396 62183 0% NA NA NA
fiasco 1638 5228 2% 61 NA NA
uClinux 1850 2468 3.33% 26 NA NA
busybox 6796 17836 1.17% 1110 NA NA

Wednesday, April 17, 2013

DIMACS feature models

IBEA With Rule Fixing NSGA-II With Rule Fixing
FM Features Rules %correct TimeToAnyC TimeTo50C TimeTo100C %correct TimeToAnyC  TimeTo50C TimeTo100C
toybox 544 1020 25% 23.2 NA NA 12.4%
axTLS 684 2155 100% 43.8 333 NA
ecos 1244 3146 100% 87.8 129 NA
freebsd 1396 62183 98% 377 717 NA
fiasco 1638 5228 100% 104 444 NA
uClinux 1850 2468 32% 47.8 229 NA
busybox 6796 17836 18% 1190 NA NA

Wednesday, April 10, 2013

GALE + PCG

Updates to GALE:

What if while recursing in RRSL clustering method, we only go down 1 level (split only a single time)?   Previously, we were going down  4 levels (because rstop = square root(mu=100)).

NOTE: We tested below on a really small sample for quick results, using rho=0.005.  That means we were running GALE for a -long- time, and still get results under <30 numEval.  With rho=0.10, the numEval would be ~ 4-5.







Introducing SBSE into game design in two ways:

1) "Outer Loop"

decisions = {all the features used to build the game} *these are categories and non-numeric*
evaluate = the player plays the game, and rates it 0-5 stars across categories
objectives = the player rating (aesthetic, story, gameplay, replay, overall)

A designer builds the game (a supply of the decisions).  Then, a player plays it (evaluates it across objectives).  The evaluation is human feedback, which is subject to human error and bias.  It takes a long time for a player to play the game.  It takes a long time to get the feedback.  And the feedback may not even be correct.  SBSE can be theoretically used, but it would take decades.

GALE can speed this up, tremendously.  But there's still the risk of human error in the feedback.  The method of evaluation is subjective here.  And the decisions may not be easily modelled at all.

2) "Inner Loop"

There are many "inner loops" that complete the single "outer loop" for the whole game.

decisions = {parameters that define one feature of the game} *introducing PCG*
evaluate = the player plays the game, and is scored on his/her performance w/ respect to that feature
objectives = the player score (points, accuracy%, evade%, time taken, etc)

A designer builds the game.  The game supplies the decisions to a PCG module in the game; i.e. the generation of random AI (speed, strength, defense) enemies.  The player plays the stage in the game for which these enemies are met.  After completion of the stage (or failure to), the player is graded across objectives: for example, points, accuracy of hitting the enemies, evasion rate of avoiding enemies, time taken before death or completion, etc.

The inner loop provides SBSE a way to search on the fly while the gamer is playing.  It avoids the issue of human bias/error.  A single stage can be rapid, and there is no overhead of feedback.  The game attunes itself according the players that play it, and searches for optimal PCG that can provide entertainment on the assumption that when players score well, they are having fun.  With GALE, a user only needs to play ~2-8 games before seeing decent results, and after 20 results, the optimality should be superb.

Wednesday, March 20, 2013

EMFP Alg, early stop implemented

My algorithm now stops early if FP Growth finds the same set of in two sequential runs.

This likely means that the attributes do not form statistically significant clusters and that there are likely no more attribute sets that will be stat. sig. because FP Growth finds the most common column sets first.

It is currently running with a cutoff of a 95% Binomial CI, I can lower the CI percentage as Dr. Menzies suggested.

Below is the latest printout demonstrating the early exit.


 time ./trial.sh main synthDb_M100_N500_m5_n10_p0.050000_num1-1.arff

***Initial Values***
confidence levels:      0.99 0.95 0.90 0.85 0.80
min perc levels (rows): 0.05 0.04 0.03 0.02 0.01
min cols:               2
FPGrowth Step level:    .01
FPGrowth Upper limit:   0.5
---------------------------------------
number of times in loop=  1
Confidence Perc =  0.99
Min Rules Perc= 0.05
dvAvg dims=500
 dbAvg rows=100
 running FPGrowth
500,Number of Cols from FPGrowth=  9
in idCentroids file  3 cluster1 cluster2 cluster3 numArffs=  0
End Clause
#Cols
#Cols
#Cols
#Cols
#Cols
#Cols
numArffs=  1
---------------------------------------
number of times in loop=  2
Confidence Perc =  0.99
Min Rules Perc= 0.05
dvAvg dims=500
 dbAvg rows=100
 running FPGrowth
500,Number of Cols from FPGrowth=  9
NumDiff  6
in idCentroids file  3 cluster1 cluster3 cluster4 numArffs=  1
End Clause
#Cols
#Cols
#Cols
#Cols
#Cols
#Cols
numArffs=  2
---------------------------------------
number of times in loop=  3
Confidence Perc =  0.99
Min Rules Perc= 0.05
dvAvg dims=500
 dbAvg rows=100
 running FPGrowth
500,Number of Cols from FPGrowth=  9
NumDiff  0
FP Same Cols-break

real 0m19.057s
user 0m15.725s
sys 0m1.425s

Playing with DIMACS feature models -- updated 4/10/2013

Running IBEA for 1 hour over the available DIMACS feature models
Parameters: Population = 300, constrained mutation at rate = 0.001, NO crossover.
Technique: Before evolution, TWO rounds of rule checking are made:
1) First round: all features standing by themselves in a line are fixed: some are selected (mandatory), and some are deselected ("prohibited"!!!)
2) Second round: all features that share a line with "fixed features" from first round, they become fixed as well.
Mutation is constrained so that it doesn't mess with the fixed features.


Skipped features Skipped Rules
FM Features Rules %correct Round 1 Total Total
toybox 544 1020 100% 361 66% 363 67% 394 39%
axTLS 684 2155 100% 382 56% 384 56% 259 12%
ecos 1244 3146 100% 10 1% 19 2% 11 0%
freebsd 1396 62183 100% 3 0% 3 0% 20 0%
fiasco 1638 5228 100% 995 61% 995 61% 553 11%
uClinux 1850 2468 100% 1234 67% 1244 67% 1850 75%
busybox 6796 17836 100% 3947 58% 3949 58% 2644 15%
uClinuxconfig 11254 31637 4% 6025 54% 6027 54% 4641 15%
coreboot 12268 47091 1% 4592 37% 4672 38% 2060 4%
buildroot 14910 45603 0% 6755 45% 6759 45% 3534 8%
embtoolkit 23516 180511 0% 6370 27% 6619 28% 657 0%
freetz 31012 102705 0% 14444 47% 14493 47% 3911 4%
Linux 2.6.32 60072 268223 0% 32329 54% 32479 54% 18734 7%
Linux 2.6.33 62482 273799 0% 33597 54% 33766 54% 19394 7%
Linux 2.6.28.6 6888 343944 0%








1) Verify how many features are fixed in the first and second rounds of rule checking. Do we benefit by making further rounds? Done. No benefit expected from further rounds.
2) The first 7 feature models are "easy"... why? Do they have a lot of "fixed" features? or just because they're smaller? Done. ecos and freebsd have very few "fixed" features, yet they are solved easily... Size is a factor that is part of the larger notion of "hardness" or "complexity".
3) These 7 can be used in experiments to compare IBEA with others (NSGAII etc)... We already know others will suck. This could be our "scale-up" verification. No.
4) We could also compare this "rule detection" method with complete randomness (ICSE'13 style) but we've already learned not to rely on complete randomness. No.
5) We could try to optimize two objectives only, then plug the result into 5-objective optimization... This can save a lot of wasted "number crunching". We could run 2-objective and identify the rules that are most frequently violated, make recommendations, take feedback, fix a bunch of features... run again.. sound familiar?

Intrinsic Dimensions for Defects and Effort

Wednesday, March 13, 2013

Sticking points in TSE paper


1-      We’re actually expanding two papers: ICSE and CMSBSE.
2-      In ICSE we fix the number of fitness evaluations. In CMSBSE (and TSE) we fix the run time… A reviewer complained about this in CMSBSE…
3-      Statistical analysis needed in addition to effect size.

Wednesday, March 6, 2013

bagging+FSS+dataset selection for defect prediction

http://unbox.org/things/var/zhimin/

SBSE MOO in Few Evals: GALE

GALE = Geometric Active Learning Evolution


Hypervolume versus IBD and IBS
  -  IBD = how good we are
  -  IBS = how spread out we are
  -  Both based on IBEA indicator



More Results:
 - added some more flavors just so we have one of every cookie
Notes:
  - Left table shows the Raw Averages for everyone over 20 repeats
  - Right table uses t-statistic to generate confidence interval (99%) 
  -  7-3-2 on wins-losses-ties when only considering IBD (quality)
  - WIN or LOSS = no overlap of confidence interval
  - RRSL should be GALE, here

Analysis Notes:
  - RRSL wins badly on NumEval
  - NSGA-II outperforms RRSL in runtime for small MOPs that are cheap to evaluate.  But when models become expensive (i.e. POM2), NSGA-II becomes slow and RRSL wins in runtime
  - RRSL is a very decision-sided routine.  For ZDT1 (30 decisions) and osyzcka2( 5 decisions ), the runtime of RRSL is punished

Paper "3/4 Draft":
 - nothing in here, yet: http://unbox.org/things/var/joe/active/
 - promise to be integrated in linux/latex fully by full draft time
 - https://www.dropbox.com/s/y9ey0hr6agsolzt/solving_problems_in_few_evaluations.pdf?m

Wednesday, February 27, 2013

Not Global vs Local

I'm playing with the idea of a global and local rules family, where local rules are the children and global rules the grandparents. The parents are the middle rules. In other words, the local rules are subsets of global and middle rules.

The goal - provide the project manager with a family of options. If a local rule is best but would prefer a more global solution, find its global and middle relatives and see if they work just as well.

My initial experiment is a little flawed by my simple rule score metric, P(defects), but results are useful.

Aptamer Research Presentation

RII 2 Meeting, Feb 18

Wednesday, February 20, 2013

TSE paper outline

1- The ICSE'13 result: All algorithms perform almost the same at 2 or 3 objectives. IBEA beats all by a large margin when optimizing 4 or 5 objectives.
2- The CMSBSE'13 result: Lower rates are better. This is a brief section, IBEA vs NSGAII vs SPEA2, E-Shop only, high rates vs low rates. It can be extended to other algorithms and feature models, but then it's dwarfed by the tree mutation results. 
3- The comparison of algorithms is run for 20 feature models, with effect size calculations... Confirms superiority of IBEA. Lower rates are used here to give all algorithms their best chance to perform.
4- Tree Mutation: By removing random crossover and mutation and deploying tree mutation [at a low rate 0.01], IBEA performs better (converges faster) than 1 & 2 above. 
4-a) The improvement is 3-orders-of-magnitude when comparing E-Shop with ICSE'13 setup vs Tree Mutation. In fact, we now have good results at 2 sec and very good at 10 sec, thus interactive responses are possible. [Comparison: With ICSE'13 IBEA, We achieve HV=0.28 & 52% correct solutions after 3 hours. With Tree Mutation IBEA, We achieve HV=0.25 & 99% correct solutions after 8 seconds. The 0.28 figure is inflated by incorrect solutions that achieve supreme objective values... 3hours/8sec = 1350].
4-b) The improvement can also be demonstrated on all 20 feature models, only the numbers are less than 3-orders-of-magnitude... [This needs to be repeated because of the bug we've just discovered].
5- Tree Mutation does not help other algorithms much... This can be demonstrated on all 20 models.  [This needs to be repeated because of the bug we've just discovered]. Should this be included? If we don't include it here, is it worthy of a whole new paper?
5-a) Comparison of 6 algorithms with and without Tree Mutation. found at
http://www.unbox.org/things/var/abdel/13/tse/Comparison of Methods Feb-Mar 2013.pdf

Biclustering Comparisons


BicBin comparisons VanUitert08

Comparison using Artificial Db
p {.001:.05}  # prob of 1s
M = 10, 100, 1000 # rows
N = 50, 500, 5000 # cols
m x n = {2:10:50} # size of biclusters

create 5 dbs for each combination
run each algorithm 50 X per alg
average the fp rates for each combination
results shown in heat maps

Real Dbs
TRANSFAC v 10.3 Wingender00
transaction factor binding site data

STRING vonMering07
protein to protein interactions

GO categories Prelic06

measuring distances between the motifs found by these dbs
don't understand all of this group yet

Wednesday, February 13, 2013

Over the horizon: ML + SBSE = What?


here's the talk i gave at UCL last week

so come see what i say about you behind your back (sorry, only could get in references to ekrem, joe krall, fayola into this talk- but clearly vasil's work fits this framework as well)

Wednesday, February 6, 2013

A12measure

The standard story:



Many problems with the standard story, especially when comparing (say) 20 different configurations of a learner.

  • Say each test has confidence of 0.95
    • Net test has confidence of 0.95^20 = 0.36
  • The complexity of the test makes me lose confidence in my ability to understand the test
  • For the parametric tests, too many parametric assumptions.
  • The blurred telescope effect (Friedman-Nemenyi)

A12

The Vargha and Delaney A12 statistics is a non-parametric effect size measure.

 Given a performance measure M seen in m measures of X and n measures of Y, the A12 statistics measures the probability that running algorithm X yields higher M values than running another algorithm Y.

 A12 = #(X > Y)/mn + 0.5*#(X=Y)/mn 



rA12

A12 studies two treatments. rA12 handles mulitple treatments using a Scott Knott procedure; i.e. divide a list of treatments into two lists  L1 and  L2 _ by finding the division that maximizes the expected value of the sum of square of the mean differences before and after divisions; i.e.

L = union(L1,L2)
 |L1|  *  abs(L1.mu - L.mu)^2 + |L2|  *  abs(L2.mu - L.mu)^2

 rA12 calles itself recursively on the lsts terminating when no further  useful  division can be found (and  useful  is checked by the A12 procedure.

 For more on the Scott Knott procedure (where ANOVA is used to define  useful), see

  •  N. Mittas and L. Angelis, "Ranking and clustering software cost estimation models through a multiple comparisons algorithm" IEEE Transactions on Software Engineering (pre-print), 2012.

Source code

active learning for multi-objective optimization


  • same orders of magnitude as joe
  • technique seems a little more complex
  • no execution on standard models (i think)


Marcela ZuluagaAndreas KrauseGuillaume Sergent and Markus Püschel (to appear in Proc. International Conference on Machine Learning (ICML), 2013)
Active Learning for Multi-Objective Optimization
In many fields one encounters the challenge of identifying, out of a pool of possible designs, those that simultaneously optimize multiple objectives. This means that usually there is not one optimal design but an entire set of Pareto-optimal ones with optimal trade-offs in the objectives. In many applications, evaluating one design is expensive; thus, an exhaustive search for the Pareto-optimal set is unfeasible. To address this challenge, we propose the Pareto Active Learning (PAL) algorithm which intelligently samples the design space to predict the Pareto-optimal set. Key features of PAL include (1) modeling the objectives as samples from a Gaussian process distribution to capture structure and accomodate noisy evaluation; (so parametric assumptions)  (2) a method to carefully choose the next design to evaluate to maximize progress; and (3) the ability to control prediction accuracy and sampling cost. We provide theoretical bounds on PAL’s sampling cost required to achieve a desired accuracy. Further, we show an experimental evaluation on three real-world data sets (only 3???). The results show PAL’s effectiveness; in particular it improves significantly over a state-of-the-art multi-objective optimization method, saving in many cases about 33% evaluations to achieve the same accuracy.

Wednesday 2-6-2013

Linux early results http://www.unbox.org/things/var/abdel/13/linux/
CMSBSE'13 http://www.unbox.org/things/var/abdel/13/cmsbse13/
RAISE'13 http://www.unbox.org/things/var/abdel/13/raise13/

Wednesday, January 23, 2013

Free Big Data Symposium, Feb 1

http://www.psc.edu/index.php/events/sherlocklaunch

Nate Silver: The Numbers Don't Lie

http://www.youtube.com/watch?v=GuAZtOJqFr0

A 56-minute talk about the signal and the noise by Nate Silver

Results for TSE paper

Results for 20 feature models

http://www.unbox.org/things/var/abdel/13/tse/

Wednesday, January 16, 2013

RRSL Report

JMOO
1) We have a system for very easily instancing new problems, and have already implemented most baselines such as Fonseca, Srinivas, ConstrEx and more.  Capabilities can handle constrained or non-constrained problems.

Modifiable Genetic Algorithm Approach
2) A basis for a very plain and dull GA is as follows:
 - Read a 100-row dataset for the problem (so that we begin on a consistent standpoint).
 - Evaluate all of these as the initial population and record Median & IQR metrics for each objective.
 - Begin an iterative process, stopping at a given criteria (# of runs, or %change between iterations):
 - - - a: Select offspring for CM
 - - - b: Crossover for the offspring
 - - - c: Mutation for the offspring
 - - - d: Combine population with the new offspring and select a new population from these

RRSL GA
3) Different Algorithms follow from adjusting (a,b,c,d) above:
 - RRSL GA:
 - - - a: Selection of offspring via rrsl non-dominating leaves. Maintain two copies of these offspring.
 - - - b: Crossover of the offspring via moving averages idea
 - - - c: Mutation for the offspring via Differential Evolution
 - - - d: Tournament Selection on the (few) offspring + adjustedoffspring

NSGA2/SPEA2
4) Benchmark Algorithms also follow from adjusting (a,b,c,d) above:
 - NSGA2/SPEA2:
 - - - a: Selection of offspring using k-round tournament selection
 - - - b: Standard Crossover to swap attributes for a given pair of individuals
 - - - c: Standard Mutation to re-roll the dice on some individual's attribute chosen decision
 - - - d: Selection via NSGA2 or SPEA2 sub-routines.

Goals
5) The Goal of the Research:
 - We want RRSL GA to perform as well as NSGA2/SPEA2.
 - There's an argument on what it means to be better.  Solutions lie across the optimal pareto frontier, but different algorithms may discover solutions at different areas along the frontier.
 - The absolute emphasis of the research: Improving upon the number of objective function evaluations.  The theory is that RRSL GA will require log N the number of evaluations needed by NSGA2/SPEA, and still do just as well.

Results
6) Results

charts: http://unbox.org/stuff/var/joe/joe/fall2012/pom2rrslTesting/charts/1-16-2013/
--- full results pending ---

Wednesday, January 9, 2013

Erin's update

Her is the current pseudocode and code for my alg.  BicBin is the alg that I will run comparisons against.


PSEUDOCODE

loop over (confidence levels, or coverage)

# ATTRIBUTE SELECTION
run FP Growth
extract unique 4-mers from results
compile the list of selected attribute numbers
derive unselected attribute numbers

# CLUSTERING
EM cluster db using selected attributes
label rows by the clusters
parse results


# CLASSIFIED TESTING
divide db into 10 bins
run nb, JRip, oner, prism, ZeroR, j48 using cluster labels as the class

keep clusters that are greater than the db average (effect size),
AND pd, pf, calc you gave me today in classification

record the cluster patterns, and rows and columns

change 1s in the clusters to 0s (non overlapping biclusters)

end loop

? do I recalculate the db average or use inital avg
? classification performance criteria

CODE (thus far)
########################

#!/bin/bash

Jar="/usr/share/java/weka.jar"
Weka="nice -n 20 java -Xmx2048M -cp $Jar "
Input="/tiny/data/DNA/WordRules/secondTry/ell4Words.arff"
CONFIDENCE=0.99
STEP=0.05
MinRulesPerc=.60
UPPER=1.0
InvAttSet=""
DbAvg=0

#COPY THE INPUT DATA
cp $Input /tiny/data/trialInput.arff
Input="/tiny/data/trialInput.arff"

cat $Input | ./dbAvg.awk
#GET THE DB AVERAGE
#DbAvg=`cat $Input | ./dbAvg.awk`
#echo "DbAvg" $DbAvg

#LOOP OVER CONFIDENCE LEVELS or MinRulesPerc(aka coverage)

  #for ((CONFIDENCE=99;CONFIDENCE==80;CONFIDENCE=CONFIDENCE-5)); do

#GET ATTRIBUTES FROM FP GROWTH
#run FPGrowth
$Weka weka.associations.FPGrowth -P 1 -I -1 -N 20 -T 0 -C $CONFIDENCE -D $STEP -U $UPPER -M $MinRulesPerc -t $Input > fpResults$MinRulesPerc$CONFIDENCE.txt

#extract unique 4-mers from FP

cat fpResults$MinRulesPerc$CONFIDENCE.txt |
sed 's/_binarized/ /g' |
sed -e 's/=1/ /g' -e 's/=0/ /g' |
./fpFieldParser.awk > attA$MinRulesPerc$CONFIDENCE.txt

#compile list of attribute numbers

cat fieldsNum.txt separator.txt attA$MinRulesPerc$CONFIDENCE.txt  |
./makeFieldNumList.awk > fieldListA$MinRulesPerc$CONFIDENCE.txt
#remove last comma

sed 's/.\{1\}$//' attIndicies.txt > fldListA$MinRulesPerc$CONFIDENCE.txt
sed 's/.\{1\}$//' inverseAttIndicies.txt > fldListIA$MinRulesPerc$CONFIDENCE.txt

#put list into variable

InvAttSet=`cat fldListIA$MinRulesPerc$CONFIDENCE.txt`
echo $InvAttSet

#cluster all instances on the reduced attribute list

$Weka weka.filters.unsupervised.attribute.AddCluster -W "weka.clusterers.EM  -t $Input -I 100 -N -1 -M 1.0E-6 -S 100" -I $InvAttSet

#keep clusters that are 30% more than the db average
#AND Test well 70+

#done

####
used 'time' and got these numbers


real 0m33.106s
user 0m38.313s
sys 0m0.890s




Vasil - Progress Report

1. From the previous meeting, I showed that we can approximate the NEMS score results by training on 10% of the population.
However, by clustering (cluster+LR and cluster+Centroid) we did not get better results than by applying linear regression on the entire instances space.

I tried a new method of grid clustering. Instead of using Recursive FastMap, the projected instance space was split into a specific number of equally-sized quadrants, then the quadrants were clustered using the variance filter.
I used different number of quadrants to split the space, but still clustering could not outperform the linear regression on the entire space.


2. Next I have shifted by attention on applying the clustering methods on defect prediction data. (I am currently using the ant data set from promise repository).

- The instance space is projected using FastMap.
- Then the instance space is recursively split into quadrants on the median of x and y dimensions.
- The clusters are created using Entropy to combine quadrants (i.e. a quadrant is not added if the entropy is increased).

ToDo:
- Apply Naive Bayes on the entire space and on clusters. (The implementation will be finished today and I can report the results).
- Use information gain to define the attributes that contribute on clustering.

Vasil

Tuesday, December 11, 2012

Kernel Mean Match (KMM) for cross-project defect prediction

Context: cross-project defect prediction

Algorithm: Kernel Mean Match



where  B limits the scope of discrepancy between Pr and Prand esp  ensures that the measure β(x) Pr(x) is close to a probability distribution. 

Data: 5 open source project, process metrics (change data)

performance indicators: AUC, F, Recall, Precision


Experiments:

parameter tuning: B=1,3,5,10,20,50; esp=0.01, 0.1, 0.5, 1, 5, 10.

Naive Bayes (B=5, esp=0.01)
AUC Acc F F_2 Recall Precision PF TestSet TrainSet
0.6749 0.8517 0.2238 0.1752 0.1531 0.4156 0.0349 PDE_change JDT_change 
0.6731 0.6451 0.2282 0.1586 0.1318 0.8500 0.0154 equinox_change JDT_change 
0.7555 0.9204 0.3529 0.2708 0.2344 0.7143 0.0096 lucene_change JDT_change 
0.6424 0.8598 0.1092 0.0778 0.0653 0.3333 0.0198 mylyn_change JDT_change 
0.7185 0.6229 0.4551 0.6002 0.7621 0.3244 0.4134 JDT_change PDE_change 
0.7953 0.6944 0.4343 0.3381 0.2946 0.8261 0.0410 equinox_change PDE_change 
0.7772 0.8857 0.4060 0.4154 0.4219 0.3913 0.0670 lucene_change PDE_change 
0.5753 0.8448 0.2684 0.2345 0.2163 0.3533 0.0600 mylyn_change PDE_change 
0.7488 0.6810 0.4838 0.6037 0.7233 0.3634 0.3300 JDT_change equinox_change 
0.6985 0.7508 0.3935 0.4871 0.5789 0.2980 0.2213 PDE_change equinox_change 
0.7118 0.7424 0.3101 0.4444 0.6250 0.2062 0.2456 lucene_change equinox_change 
0.5615 0.7873 0.2826 0.3030 0.3184 0.2541 0.1416 mylyn_change equinox_change 
0.7411 0.6911 0.4901 0.6056 0.7184 0.3719 0.3161 JDT_change lucene_change 
0.6889 0.7916 0.3735 0.4133 0.4450 0.3218 0.1522 PDE_change lucene_change 
0.7825 0.7222 0.5588 0.4822 0.4419 0.7600 0.0923 equinox_change lucene_change 
0.6489 0.8281 0.3043 0.2929 0.2857 0.3256 0.0897 mylyn_change lucene_change 
0.6040 0.3400 0.3286 0.5038 0.7816 0.2080 0.7750 JDT_change mylyn_change 
0.5824 0.5311 0.2626 0.3958 0.5981 0.1682 0.4798 PDE_change mylyn_change 
0.5599 0.6636 0.5068 0.4605 0.4341 0.6087 0.1846 equinox_change mylyn_change 
0.6228 0.7829 0.2857 0.3731 0.4688 0.2055 0.1850 lucene_change mylyn_change 


KMM vs. non-weights
B=5 esp=0.01 non-weights
AUC AUC improvement mean(imp) median(imp)
0.6749 0.6313 0.0436 0.0218 0.0228
0.6731 0.7702 -0.0970
0.7555 0.7149 0.0406
0.6424 0.4153 0.2271
0.7185 0.7207 -0.0022
0.7953 0.7843 0.0110
0.7772 0.7413 0.0359
0.5753 0.5716 0.0038
0.7488 0.7105 0.0384
0.6985 0.6843 0.0142
0.7118 0.7257 -0.0139
0.5615 0.5131 0.0484
0.7411 0.7098 0.0313
0.6889 0.6954 -0.0065
0.7825 0.7739 0.0086
0.6489 0.6140 0.0349
0.6040 0.6305 -0.0265
0.5824 0.6424 -0.0601
0.5599 0.5102 0.0497
0.6228 0.5678 0.0550



Observations:
1) after parameter tuning, KMM helps to improve prediction performance, but not so significant (2%)
2) long running time! 900 training instances + 300 test instances: 30min; 10,000 training instances + 300 test instance: 130 min! 

Questions:
1) KMM may be a stupid idea for cross defect prediction though it's beautiful on mathematics?  


New results(2012-12-18):
parameter tuning: B=1,3,5,10,20,50; esp=0.01, 0.1, 0.5, 1, 5, 10.



Fig. 1 Improvement of PD



Fig. 2. Improvement of PF