Method
- Cluster the data
- Find deltas of interest between the clusters
- Score each cluster
- Let each row have objective scores, normalized 0..1, min..max
- Let the score of a row by the sum of the normalized scores
- Let the score of a cluster be the mean of the score of its rows
- Technically, this is almost the cdom predicate used in IBEA
- For each cluster C1
- Find its nearest neighbor C2 with a better score
- Assert one (leave,goto) tuple for (C1,C2)
- Build and prune a decision tree on the clusters
- Label each instance with the cluster it belongs to
- Build a decision tree on the labelled data set.
- Find the clusters that are only weakly recognized by the decision tree learner
- e.g. use a three-way cross val and prune anything with F < 0.5
- Remove the weakly recognized clusters
- For each (C1,C2) tuple where both are not weakly recognized,
- Query the tree to find the delta
Observation: the trees are so small that this can be done manually.
Example
Nasa93 clustered into 2D (one color per cluster)
Cluster details
- All the following values are normalized 0..1, min..max
- Defects and months are connected, but not always
- Effort is not what separates the projects- its more about defects and calender time to develop
- Clearly, cluster 2 is a bad place and 10 and 13 look nicest.
acap = h
| apex = h
| | pmat = h
| | | plex = h: _2 (6.0/1.0)
| | | plex = n: _4 (3.0)
| | pmat = l
| | | cplx = vh: _3 (2.0)
| | | cplx = h
| | | | time = vh: _3 (3.0)
| | | | time = n: _6 (4.0/1.0)
| | | cplx = n: _5 (2.0)
| | pmat = n: _6 (4.0/1.0)
| apex = n
| | data = h: _6 (2.0/1.0)
| | data = n: _4 (3.0/1.0)
| | data = l: _13 (1.0)
| apex = vh
| | pcap = h: _10 (3.0)
| | pcap = vh: _7 (2.0/1.0)
acap = n
| sced = n
| | stor = xh: _7 (1.0)
| | stor = n
| | | cplx = h
| | | | pcap = h: _10 (3.0/1.0)
| | | | pcap = n: _13 (3.0)
| | | cplx = n: _7 (3.0/1.0)
| | stor = vh: _11 (3.0)
| | stor = h: _11 (2.0)
| sced = l
| | $kloc <= 16.3: _9 (5.0)
| | $kloc > 16.3: _8 (6.0)
acap = vh: _12 (7.0/1.0)
A 3-way cross-val yielded following confusion matrix.
- The underlined and bold entries are the correctly classified rows.
- The red entries are errors.
- Note the poor performance for recognizing clusters 4,5,6,7,10,13
a b c d e f g h i j k l <-- as="" classified="" font="">-->
5 0 0 0 0 0 0 0 0 0 0 0 | a = _2
0 4 1 1 0 0 0 0 0 0 0 0 | b = _3
1 0 2 0 1 0 0 0 1 0 0 0 | c = _4
0 1 1 0 3 0 0 0 0 0 0 0 | d = _5
0 1 2 3 0 0 0 0 1 0 0 1 | e = _6
0 0 0 0 0 0 0 0 1 2 1 1 | f = _7
0 0 0 0 0 0 6 0 0 0 0 0 | g = _8
0 0 0 0 0 0 1 4 0 0 0 0 | h = _9
0 0 1 0 0 0 0 0 3 0 0 2 | i = _10
0 0 0 0 0 1 0 0 0 4 0 0 | j = _11
0 0 0 0 0 1 0 0 0 0 6 0 | k = _12
0 0 1 0 0 0 0 0 1 1 0 2 | l = _13
The above confusion matrix is mapped into the "f" measures of the following table.
- The "goto" column marks the deltas of interest.
- Low "f" values are marked in gray.
- Any "goto" that comes or goes into gray is marked with gray.
cluster | n | effort | defects | months | f | goto |
2 | 5 | 43% | 25% | 72% | 91% | 3 |
3 | 6 | 5% | 32% | 42% | 67% | 6 |
4 | 5 | 6% | 17% | 37% | 31% | 8 |
5 | 5 | 7% | 29% | 43% | 0% | 6 |
6 | 8 | 6% | 24% | 40% | 0% | 10 |
7 | 5 | 7% | 16% | 36% | 0% | 13 |
8 | 6 | 2% | 6% | 17% | 92% | 9 |
9 | 5 | 0% | 1% | 3% | 89% | |
10 | 6 | 2% | 9% | 22% | 46% | 12 |
11 | 5 | 7% | 18% | 31% | 67% | 13 |
12 | 7 | 2% | 7% | 18% | 86% | |
13 | 5 | 7% | 15% | 26% | 36% | |
total: | 68 |
If we prune the above tree of any branch that leads only to gray classes, we get, as promised above, a very small tree.
acap = h
| apex = h
| | pmat = h
| | | plex = h: _2 (6.0/1.0)
| | pmat = l
| | | cplx = vh: _3 (2.0)
| | | cplx = h
| | | | time = vh: _3 (3.0)
acap = n
| sced = n
| | stor = vh: _11 (3.0)
| | stor = h: _11 (2.0)
| sced = l
| | $kloc <= 16.3: _9 (5.0)
| | $kloc > 16.3: _8 (6.0)
acap = vh: _12 (7.0/1.0)
Summary
- The definite statements that clearly make changes in SE data are very succinct.
- But they might not cover everything.
Question: what would you baseline this against? I.e. how would you certify this as a good/crappy idea?
No comments:
Post a Comment