ai @ wvu: February 2010

Friday, February 26, 2010

Review article on anomaly detection

Varun Chandola, Arindam Banerjee, and Vipin Kumar, "Anomaly Detection : A Survey", ACM Computing Surveys, Vol. 41(3), Article 15, July 2009.

http://www-users.cs.umn.edu/~kumar/papers/anomaly-survey.php

IEEE TSE accepts "Genetic Algorithms for Randomized Unit Testing"

Jamie Andrews and Tim Menzies

Randomized testing is an effective method for testing software units. Thoroughness of randomized unit testing varies widely according to the settings of certain parameters, such as the relative frequencies with which methods are called. In this paper, we describe Nighthawk, a system which uses a genetic algorithm (GA) to ﬁnd parameters for randomized unit testing that optimize test coverage.

Designing GAs is somewhat of a black art. We therefore use a feature subset selection (FSS) tool to assess the size and content of the representations within the GA. Using that tool, we can reduce the size of the representation substantially, while still achieving most of the coverage found using the full representation. Our reduced GA achieves almost the same results as the full system, but in only 10% of the time. These results suggest that FSS could signiﬁcantly optimize meta heuristic search-based software engineering tools.

Download: http://menzies.us/pdf/10nighthawk.pdf

Webcast: "Finding Local Lessons in SE"

Speaker: Tim Menzies

http://coursecast02.cs.wayne.edu:8080/ess/echo/presentation/d1db6a70-7295-4b90-b1b8-36096cc1d21f

Monday, February 22, 2010

Yet more better ways to do k-means

Monday, February 15, 2010

Does Kernel Choice Affect Prediction Performance?

What we do with the kernel estimation is to come up with a probability distribution function that is derived from our train data and then use this pdf to assign weights to our neighbors in a kNN approach.

Well when the bandwidths are assigned in accordance with Scott's Rule, we see that at critical regions (where the performance of different methods diverge) kernel weighted selection methods perform better than non-weighted versions. However, for the rest of the graphs, the performance of methods are very close to each other and it is difficult to draw a solid conclusion. Below are the graphs for Desharnais dataset for Triangular, Epanechnikov and Gaussian Kernels respectively.

Classification comparisons using DENSITY

DENSITY is a new classifier I wrote that works by

Pruning irrelevant data
Set density values using geometric series
Rank based on "fairness" class distribution weight

UCI data results: http://unbox.org/wisp/var/adam/results/density/times-accs-pdpf-results.pdf

Defect data results: http://unbox.org/wisp/var/adam/results/density/accuracies

Component vs. Whole Defect Prediction

Components trimmed based on number of defective modules (using medians)
For each learner..
10X10-way cross val using training sets composed of multiple components vs. varying components

for page, see http://unbox.org/wisp/var/adam/results/defect_dist/datas/mwq

Tuesday, February 9, 2010

Generality and ``W''

To recap, we've demonstrated that ``W'' is capable of sizable median effort and "spread" reductions in software estimation. I've spent more time getting to know what types of datasets and projects ``W'' is proficient at predicting. Here's some new data:

Two new points to make: I played around with the Desharnais dataset and realized the original synthetic projects where making constraints that represented too few historical projects. The new synthetic project simply makes all independent attributes controllable, and gives results in-line with the other synthetic and non-synthetic project.

Second, multiple-goal NASA93 projects are included. The BFC-NASA93 projects seek to minimize the defects, effort, and months as a whole. The reductions are in-line with the effort-only version of this dataset.

Finally, the big generality statement. To make this somewhat easier to read, I've left it in list format with the frequency counts of each attribute value trailing. I'm still working on a more concise way to express this data, one giant table doesn't seem to help.

key: DATASET.project:::attribute=freq count

NASA93.flight:::acap=25

NASA93.flight:::aexp=6

NASA93.flight:::cplx=13

NASA93.flight:::data=14

NASA93.flight:::lexp=8

NASA93.flight:::pcap=9

NASA93.flight:::rely=20

NASA93.flight:::stor=17

NASA93.flight:::turn=6

NASA93.ground:::acap=16

NASA93.ground:::aexp=9

NASA93.ground:::cplx=13

NASA93.ground:::data=20

NASA93.ground:::lexp=7

NASA93.ground:::pcap=7

NASA93.ground:::rely=18

NASA93.ground:::stor=13

NASA93.ground:::time=14

NASA93.osp2:::sced=87

NASA93.osp:::acap=29

NASA93.osp:::aexp=8

NASA93.osp:::cplx=10

NASA93.osp:::sced=12

NASA93.osp:::stor=24

NASA93.osp:::tool=20

BFC-NASA93.flight:::acap=28

BFC-NASA93.flight:::aexp=2

BFC-NASA93.flight:::cplx=7

BFC-NASA93.flight:::data=15

BFC-NASA93.flight:::lexp=2

BFC-NASA93.flight:::pcap=10

BFC-NASA93.flight:::rely=11

BFC-NASA93.flight:::stor=17

BFC-NASA93.flight:::turn=9

BFC-NASA93.ground:::acap=22

BFC-NASA93.ground:::apex=9

BFC-NASA93.ground:::cplx=4

BFC-NASA93.ground:::data=14

BFC-NASA93.ground:::ltex=10

BFC-NASA93.ground:::pcap=8

BFC-NASA93.ground:::rely=10

BFC-NASA93.ground:::stor=14

BFC-NASA93.ground:::time=11

BFC-NASA93.osp2:::sced=109

BFC-NASA93.osp:::acap=43

BFC-NASA93.osp:::aexp=2

BFC-NASA93.osp:::cplx=17

BFC-NASA93.osp:::sced=13

BFC-NASA93.osp:::stor=16

BFC-NASA93.osp:::team=4

BFC-NASA93.osp:::tool=10

ISBSG-clientserver-discretized.proj1:::Norm_PDF_UFP=9

ISBSG-clientserver-discretized.proj1:::Norm_PDR_AFP=11

ISBSG-clientserver-discretized.proj1:::Projected_PDR_UFP=3

ISBSG-clientserver-discretized.proj1:::Reported_PDR_AFP=4

ISBSG-clientserver-discretized.proj1:::Size_Added=5

ISBSG-clientserver-discretized.proj1:::Size_Enquiry=2

ISBSG-clientserver-discretized.proj1:::Size_File=19

ISBSG-clientserver-discretized.proj1:::Size_Input=16

ISBSG-clientserver-discretized.proj1:::Size_Interface=6

ISBSG-clientserver-discretized.proj1:::Size_Output=4

ISBSG-standalone-discretized.proj1:::Adj_FP=1

ISBSG-standalone-discretized.proj1:::Norm_PDF_UFP=14

ISBSG-standalone-discretized.proj1:::Norm_PDR_AFP=15

ISBSG-standalone-discretized.proj1:::Projected_PDR_UFP=4

ISBSG-standalone-discretized.proj1:::Reported_PDR_AFP=5

ISBSG-standalone-discretized.proj1:::Size_Added=5

ISBSG-standalone-discretized.proj1:::Size_Enquiry=1

ISBSG-standalone-discretized.proj1:::Size_File=13

ISBSG-standalone-discretized.proj1:::Size_Input=17

ISBSG-standalone-discretized.proj1:::Size_Interface=4

ISBSG-standalone-discretized.proj1:::Size_Output=4

ISBSG-standalone-discretized.proj2:::Adj_FP=1

ISBSG-standalone-discretized.proj2:::Norm_PDF_UFP=6

ISBSG-standalone-discretized.proj2:::Norm_PDR_AFP=3

ISBSG-standalone-discretized.proj2:::Projected_PDR_UFP=2

ISBSG-standalone-discretized.proj2:::Reported_PDR_AFP=1

ISBSG-standalone-discretized.proj2:::Size_Input=4

ISBSG-standalone-discretized.proj2:::VAF=2

ISBSG-standalone-discretized.proj2:::Work_Effort=33

MAXWELL.proj1:::T01=9

MAXWELL.proj1:::T02=11

MAXWELL.proj1:::T03=8

MAXWELL.proj1:::T04=4

MAXWELL.proj1:::T05=7

MAXWELL.proj1:::T06=3

MAXWELL.proj1:::T07=23

MAXWELL.proj1:::T08=5

MAXWELL.proj1:::T09=4

MAXWELL.proj1:::T10=7

MAXWELL.proj1:::T11=6

MAXWELL.proj1:::T12=5

MAXWELL.proj1:::T13=4

MAXWELL.proj1:::T14=5

MAXWELL.proj1:::T15=4

coc81.allLarge:::acap=4

coc81.allLarge:::aexp=6

coc81.allLarge:::cplx=3

coc81.allLarge:::data=14

coc81.allLarge:::lexp=5

coc81.allLarge:::modp=1

coc81.allLarge:::pcap=4

coc81.allLarge:::rely=22

coc81.allLarge:::sced=4

coc81.allLarge:::stor=14

coc81.allLarge:::time=5

coc81.allLarge:::tool=4

coc81.allLarge:::turn=6

coc81.allLarge:::vexp=13

coc81.allLarge:::virt=1

coc81.flight:::acap=6

coc81.flight:::aexp=21

coc81.flight:::cplx=7

coc81.flight:::data=18

coc81.flight:::lexp=4

coc81.flight:::pcap=1

coc81.flight:::rely=7

coc81.flight:::stor=27

coc81.flight:::turn=19

coc81.ground:::acap=8

coc81.ground:::cplx=7

coc81.ground:::data=30

coc81.ground:::ltex=2

coc81.ground:::pcap=3

coc81.ground:::pmat=2

coc81.ground:::rely=8

coc81.ground:::stor=30

coc81.ground:::time=9

coc81.osp2:::sced=58

coc81.osp:::acap=4

coc81.osp:::aexp=6

coc81.osp:::cplx=20

coc81.osp:::sced=7

coc81.osp:::stor=23

coc81.osp:::tool=9

desharnais-discretized.proj1:::Adjustment=4

desharnais-discretized.proj1:::Entities=2

desharnais-discretized.proj1:::Language=11

desharnais-discretized.proj1:::Length=3

desharnais-discretized.proj1:::ManagerExp=2

desharnais-discretized.proj1:::PointsAjust=6

desharnais-discretized.proj1:::PointsNonAdjust=3

desharnais-discretized.proj1:::TeamExp=1

desharnais-discretized.proj1:::Transactions=1

desharnais-discretized.proj1:::YearEnd=1

These numbers are out of what was returned after 100 runs, so the total counts within one project might not match another. The take-home here is we've able to maintain these large effort reductions even with widely different recommendations both within the same project and across different projects with similar metrics. Some projects such as OSP2 are every stable, only recommending SCED. Others show some distribution favoring a particular attribute, some seem to get by recommending a small subset of attributes.

Closer study is needed of the attributes returned. Right now I'm not looking at the recommended values, just what attribute was chosen.

Monday, February 8, 2010

Initial Kernel Estimation Plots

Kernel-based methods are most popular for non-parametric estimators
They can discover structural features in data which parametric approaches may not reveal
Kernel performance is measured by AMISE (asymptotic mean integrated squared error)
Below graphs were produced via Epanechnikov kernel, since Epanechnikov kernel minimizes AMISE
For below graphs we compute a probability density estimate of a sample vector and then plot the evaluation of this probability density estimate
Plots (lines) below correspond to probability density function for k=16 from Cocomo81 and Cocomo81 itself respectively. The stars are the sorted effort values of related datasets.

Postcards from CHINA

Greg reports he can halve the median MRE as follows:

take data
cluster using k=10 means
assign a weight alpha to each cluster
learn alpha by random sampling (assessing each sample with LSR), selecting best alphas via bore

Results:

nasa93: median(MRE) now 0.4, was 0.61.
cocomonasa: median(MRE) now 0.39, was 0.64

Nasa93:
        #key, ties, win, loss, win-loss
 oversampled,    0,   1,    0,        1
    original,    0,   0,    1,       -1

cocomonasa:
        #key, ties, win, loss, win-loss
 oversampled,    0,   1,    0,        1
    original,    0,   0,    1,       -1

Combined:
                  #key, ties, win, loss, win-loss
oversampled|nasa93    ,    1,   2,    0,        2
oversampled|cocomonasa,    1,   2,    0,        2
original|nasa93       ,    1,   0,    2,       -2
original|cocomonasa   ,    1,   0,    2,       -2

Learning Treatments for Student Retention using TAR3

Using the treatment learner TAR3 (here, Menzies 2001), treatments were learned from features previously selected from enrollment data experiments. Below are the first obtained treatments:


Class = YES
bestClass = 20.00%

 1 worth=1.112139 [FinAidSTUDENT_AG=[5123.000000..724724.000000]] 
 2 worth=1.105999 [FinAidPARENT_MAR=M] [FinAidSTUDENT_WA=[3878.000000..561500.000000]]   
 3 worth=1.102110 [FinAidFATHER_WAG=[43669.000000..999999.000000]] [FinAidFATHER_ED=3.0]   
4 worth=1.097397 [FinAidPARENT_AGI=[67908.000000..999999.000000]]   5 worth=1.092824 [FinAidSTUDENT_WA=[3878.000000..561500.000000]]   
6 worth=1.092252 [FinAidMOTHER_ED=3.0] [FinAidFATHER_ED=3.0]   
7 worth=1.090565 [FinAidFATHER_WAG=[43669.000000..999999.000000]]   8 worth=1.082257 [FinAidFATHER_ED=3.0]   
9 worth=1.080906 [FinAidMOTHER_WAG=[23139.000000..533395.000000]] [FinAidSTUDENT_AG=[5123.000000..724724.000000]]  
10 worth=1.079164 [TotalClasses=[6.000000..10.000000]] [FinAidPARENT_MAR=M]

Class = YES
bestClass = 10.00%

 1 worth=1.159171    [FinAidPARENT_MAR=M] [FinAidPARENT_AGI=[67908.000000..999999.000000]] [CUR_ERLHRS=[15.000000..22.000000]] [FinAidSTUDENT_AG=[5123.000000..724724.000000]] 
 2 worth=1.156786    [FinAidFATHER_WAG=[43669.000000..999999.000000]] [FinAidSTUDENT_AG=[5123.000000..724724.000000]]   
3 worth=1.132347    [FinAidPARENT_HOU=4.0] [FinAidSTUDENT_AG=[5123.000000..724724.000000]]   
4 worth=1.128918    [HS_PERCENT=[60.000000..100.000000]] [FinAidMOTHER_WAG=[23139.000000..533395.000000]] [PercentileRankHSGPA=[49.500000..98.400002]]   
5 worth=1.113224    [FinAidSTUDENT_TA=2.0] [HS_PERCENT=[60.000000..100.000000]]   
6 worth=1.112139    [FinAidSTUDENT_AG=[5123.000000..724724.000000]]   7 worth=1.097397    [FinAidPARENT_AGI=[67908.000000..999999.000000]]   8 worth=1.092824    [FinAidSTUDENT_WA=[3878.000000..561500.000000]]   9 worth=1.090565    [FinAidFATHER_WAG=[43669.000000..999999.000000]]  10 worth=1.089154    [ACT1_MATH=[19.000000..510.000000]] [FinAidSTUDENT_MA=U]  

Class = NO
bestClass = 20.00%

 1 worth=1.073376 [PercentileRankHSGPA=[0.000000..49.500000)]  

Class = YES
bestClass = 10.00%

1 worth=1.090117 [ENG10=Y]

Tuesday, February 2, 2010

Whole vs. Partial Learning - NASA MDP Components - Part II - Undersampling Vs. Subsampling Vs. No sampling

Experiment stats:

discretize
nb, j48
10*10-way
supersampling/undersampling/no sampling vs component vs all

mwu and quarts here

On Profiling RDR (Clump)...

Following are runtime statistics for Bryan's RDR implementation using the JM1 data set (~11000 lines).

As you can easily see NBAYES is the outlier weighing in at a whopping 18.3 minutes. What this tells us is that we're spending way too much time classifying as we go down the tree. What's more is that the data doesn't necessarily change from node to node but what we use to classify (based on our rules) is the variable factor.

Now it's time for the proposal. Since we're using the same information for each classification, why not cache the frequency counts early on and just merge counts when RDR generates a new rule? Once a system like the aforementioned has been implemented, the runtime for RDR would be minimal.

Discretizing W

In my last post, I discovered that the current problem datasets all suffered from an abundance of attribute values. Given that W acts on exact matching for it's query constraints, this causes a problem where nothing is left in the testing set after applying the first constraint. At least in the case of the ISBSG dataset, discretization makes a big difference:

Median Effort Reductions

Median Spread Reductions

For the Desharnais dataset, problems still remain. Even with discretization we can only generate a constrained query 25% of the time. The reported results use 9 discrete bins, but tinkering with this number has yet to see any gains in effort reduction. This is concerning, but currently all Desharnais projects are highly synthetic. A better look at the data is in order.

Awktimization

If you're doing a lot of heavy text crunching with Awk, you've probably found that it can get quite slow very quickly. It always seems to happen whether you program well or not. Unfortunately for us, Awk isn't a very high efficiency tool. But that's not going to stop us. Not today.

You see, the GNU implementation of Awk, gawk, has a built-in profiler that you can use in order to analyze and optimize your Awk scripts. Its beautiful simplicity will cast light on those sinister bottlenecks destroying your performance. To use it, simply call pgawk from the command line on any machine with the GNU Awk installed. (Alternatively, you can pass the --profile option to gawk.) The runtime will execute your script and write the profiler results to a file by default named awkprof.out. This file contains the execution counts for each statement executed in the left margin.

Optimizing the now elucidated bottlenecks in your program should give you a considerable performance boost. However, if after improving on your runtimes, you still find that your program isn't zippy enough, you can put a little bit more spring in its step by compiling it to ANSI C.

That's right, the open-source Awka project takes plain old Awk code and transforms it into a nitroglycerin-breathing, high-performance rocket car written in C. Nested loops, arrays, and user functions benefit the most from the conversion. The final product will typically see performance gains of around 50%, but depending on the code, you could see 300-400% speed boosts.

So if you find that your Awk text crunching just isn't quite up to par, maybe it's time to consider grinding that puppy through a profiler. And if that still isn't enough, give it the Total Language Makeover with a conversion to C. And if you have a problem that's still taking millennia to compute, well, maybe you should contemplate asking a simpler question. :)