Monday, November 11, 2013

Tree query languages for MOEA

Method

  1. Cluster the data
  2. Find deltas of interest between the clusters
    • Score each cluster
      • Let each row have objective scores, normalized 0..1, min..max
      • Let the score of a row by the sum of the normalized scores
      • Let the score of a cluster be the mean of the score of its rows
      • Technically, this is almost the cdom predicate used in IBEA
    • For each cluster C1
      • Find its nearest neighbor C2 with a better score 
      • Assert one  (leave,goto)  tuple for  (C1,C2)
  3. Build and prune a decision tree on the clusters
    • Label each instance with the cluster it belongs to
    • Build a decision tree on the labelled data set.
    • Find the clusters that are only weakly recognized by the decision tree learner
      • e.g. use a three-way cross val and prune anything with F < 0.5
    • Remove the weakly recognized clusters
  4. For each (C1,C2) tuple where both are not weakly recognized, 
    • Query the tree to find the delta
Observation: the trees are so small that this can be done manually.

Example

Nasa93 clustered into 2D (one color per cluster)

Cluster details 
  • All the following values are normalized 0..1, min..max
  • Defects and months are connected, but not always
  • Effort is not what separates the projects- its more about defects and calender time to develop
  • Clearly, cluster 2 is a bad place and 10 and 13 look nicest.


A decision tree learned from the data labelled with each cluster (ignoring the objectives) generated:

 acap = h
|   apex = h
|   |   pmat = h
|   |   |   plex = h: _2 (6.0/1.0)
|   |   |   plex = n: _4 (3.0)
|   |   pmat = l
|   |   |   cplx = vh: _3 (2.0)
|   |   |   cplx = h
|   |   |   |   time = vh: _3 (3.0)
|   |   |   |   time = n: _6 (4.0/1.0)
|   |   |   cplx = n: _5 (2.0)
|   |   pmat = n: _6 (4.0/1.0)
|   apex = n
|   |   data = h: _6 (2.0/1.0)
|   |   data = n: _4 (3.0/1.0)
|   |   data = l: _13 (1.0)
|   apex = vh
|   |   pcap = h: _10 (3.0)
|   |   pcap = vh: _7 (2.0/1.0)
acap = n
|   sced = n
|   |   stor = xh: _7 (1.0)
|   |   stor = n
|   |   |   cplx = h
|   |   |   |   pcap = h: _10 (3.0/1.0)
|   |   |   |   pcap = n: _13 (3.0)
|   |   |   cplx = n: _7 (3.0/1.0)
|   |   stor = vh: _11 (3.0)
|   |   stor = h: _11 (2.0)
|   sced = l
|   |   $kloc <= 16.3: _9 (5.0)
|   |   $kloc > 16.3: _8 (6.0)
acap = vh: _12 (7.0/1.0)

A 3-way cross-val yielded following confusion matrix. 
  • The underlined and bold entries are the correctly classified rows.
  • The red entries are errors.
  •  Note the poor performance for recognizing clusters 4,5,6,7,10,13

 a b c d e f g h i j k l   <-- as="" classified="" font="">
 5 0 0 0 0 0 0 0 0 0 0 0 | a = _2
 0 4 1 1 0 0 0 0 0 0 0 0 | b = _3
 1 0 2 0 1 0 0 0 1 0 0 0 | c = _4
 0 1 1 0 3 0 0 0 0 0 0 0 | d = _5
 0 1 2 3 0 0 0 0 1 0 0 1 | e = _6
 0 0 0 0 0 0 0 0 1 2 1 1 | f = _7
 0 0 0 0 0 0 6 0 0 0 0 0 | g = _8
 0 0 0 0 0 0 1 4 0 0 0 0 | h = _9
 0 0 1 0 0 0 0 0 3 0 0 2 | i = _10
 0 0 0 0 0 1 0 0 0 4 0 0 | j = _11
 0 0 0 0 0 1 0 0 0 0 6 0 | k = _12
 0 0 1 0 0 0 0 0 1 1 0 2 | l = _13


The above confusion matrix is mapped into the "f" measures of the following table. 
  • The "goto" column marks the deltas of interest.
  •  Low "f" values are marked in gray. 
  • Any "goto" that comes or goes into gray is marked with gray.

cluster n effort defects months f     goto
2 5 43% 25% 72% 91% 3
3 6 5% 32% 42% 67% 6
4 5 6% 17% 37% 31% 8
5 5 7% 29% 43% 0% 6
6 8 6% 24% 40% 0% 10
7 5 7% 16% 36% 0% 13
8 6 2% 6% 17% 92% 9
9 5 0% 1% 3% 89%
10 6 2% 9% 22% 46% 12
11 5 7% 18% 31% 67% 13
12 7 2% 7% 18% 86%
13 5 7% 15% 26% 36%
total: 68




 
If we prune the above tree of any branch that leads only to gray classes, we get, as promised above, a very small tree.

acap = h
|   apex = h
|   |   pmat = h
|   |   |   plex = h: _2 (6.0/1.0)
|   |   pmat = l
|   |   |   cplx = vh: _3 (2.0)
|   |   |   cplx = h
|   |   |   |   time = vh: _3 (3.0)
acap = n
|   sced = n
|   |   stor = vh: _11 (3.0)
|   |   stor = h: _11 (2.0)
|   sced = l
|   |   $kloc <= 16.3: _9 (5.0)
|   |   $kloc > 16.3: _8 (6.0)
acap = vh: _12 (7.0/1.0)

Summary

  • The definite statements that clearly make changes in SE data are very succinct.
    • But they might not cover everything.
Question: what would you baseline this against? I.e. how would you certify this as a good/crappy idea?

No comments:

Post a Comment