ai @ wvu: Software

Showing posts with label Software. Show all posts

Tuesday, October 6, 2009

POM2

In 2008, Dr. Menzies authored a paper titled "Using Simulation to Investigate Requirements Prioritization Strategies". This paper showed the effects of project and requirement dynamism on the software development processes. In the Spring of 2008, he tasked his Artificial Intelligence class to expand on this model.

Thus, POM2 was born. One of the main differences between POM and POM2 is that POM focused on small teams and small projects. POM2 built its model around the 5 different branches proposed by Dr. Turner and Dr. Boehm. This varied the project size as well as introducing other factors.

Personnel - The number skill level of the staff
Dynamism - The amount of project change
Criticality - The impact of failure after deployment
Culture - How accepting the staff is of change
Size - The total number of staff across all teams

POM2 implimented all of the branches excluding Personnel. With the information available to us, Personnel was a wash. Better people cost more and produced more. Less talented people cost less and produced less. This lead to a near zero effect.

POM2 was then put through a Monte Carlo simulator while a KEYS search engine isolated the interesting areas in the simulator.

The main discovery is that Agile development methods performed as well as or better than Plan based development methods in all areas, especially excelling in dynamic environments. Further research is needed into the Personell and Criticality branches of POM2 to fully flesh out the model.

POM2 was accepted as a short paper to ASE'09. The full version of the paper can be found here, and the short version can be found here.

Monday, September 28, 2009

Want to try some treatment learning?

Have a difficult problem to solve? Do you want to know the exact factors that lead to a diagnosis? You might be able to apply treatment learning to the problem.

First, read this and this. Those two articles should give you the basic background. For a more in-depth discussion, take a look at this draft.

If you're feeling brave, go grab the source code for either TAR3 or TAR4.1. NASA has recently given us the newest versions of both treatment learners. To run these, your computer will require a C compiler and the Matlab runtime environment.

TAR Source Code

Warning: This code does have bugs and isn't particularly well-documented. It works, but don't shout at us if it occasionally crashes.

Monday, September 7, 2009

KEYS and KEYS2

Within a model, there are chains of reasons that link inputs to the desired goals. As one might imagine, some of the links in the chain clash with others. Some of those clashes are most upstream; they are not dependent on other clashes. In the following chains of reasoning, the clashes are {e, -e}, {g, -g}, {j, -j}; the most upstream clashes are {e, -e}, {g, -g}

In order to optimize decision making about this model, we must first decide about these most upstream clashing reasons (the "keys"). Most of this model is completely irrelevant. In the context of reaching the goal, the only important discussions are the clashes {g,-g,j, -j}. Further, since {j, -j} are dependent on {g, -g}, then the core decision must be about variable g with two disputed values: true and false.

Setting the keys reduces the number of reachable states within the model. Formally, the reachable states reduce to the cross-product of all of the ranges of the collars. We call this the clumping effect. Only a small fraction of the possible states are actually reachable. This notion of $keys$ has been discovered and rediscovered
many times by many researchers. Historically, finding the keys has seen to be a very hard task. For example, finding the keys is analogous to finding the minimal environments of DeKleer's ATMS algorithm. Formally, this is logical abduction, which is an NP-hard task.

Our method for finding the keys uses a Bayesian sampling method. If a model contains keys then, by definition, those variables must appear in all solutions to that model. If model outputs are scored by some oracle, then the key variables are those with ranges that occur with very different frequencies in high/low scored model outputs. Therefore, we need not search for the keys- rather, we just need to keep frequency counts on how often ranges appear in best or rest outputs.

KEYS implements this Bayesian sampling methods. It has two main components - a greedy search and the BORE ranking heuristic. The greedy search explores a space of M mitigations over the course of M "eras". Initially, the entire set of mitigations is set randomly. During each era, one more mitigation is set to M_i=X_j, X_j being either true or false. In the original version of KEYS, the greedy search fixes one variable per era. Recent experiments use a newer version, called KEYS2, that fixes an increasing number of variables as the search progresses
(see below for details).

KEYS (and KEYS2), each era e generates a set as follows:

[1:] MaxTries times repeat:
- Selected[1{\ldots}(e-1)] are settings from previous eras.
- Guessed are randomly selected values for unfixed mitigations.
- Input = selected from guessed.
- Call model to compute score=ddp(input);
[2:] The MaxTries scores are divided into B "best" and the remainder become the "rest".
[3:] The input mitigation values are then scored using BORE (described below).
[4:] The top ranked mitigations (the default is one, but the user may fix multiple mitigations at once) are fixed and stored in selected[e].

The search moves to era e+1 and repeats steps 1,2,3,4. This process stops when every mitigation has a setting. The exact settings for MaxTries and B must be set via engineering judgment. After some experimentation, we used MaxTries=100 and B=10.

Pseudocode for the KEYS algorithm:


1. Procedure KEYS
2. while FIXED_MITIGATIONS != TOTAL_MITIGATIONS
3.    for I:=1 to 100
4.       SELECTED[1...(I-1)] = best decisions up to this step
5.       GUESSED = random settings to the remaining mitigations
6.       INPUT = SELECTED + GUESSED
7.       SCORES= SCORE(INPUT)
8.    end for   
9.    for J:=1 to NUM_MITIGATIONS_TO_SET 
10.      TOP_MITIGATION = BORE(SCORES)       
11.      SELECTED[FIXED_MITIGATIONS++] = TOP_MITIGATION 
12.   end for  
13. end while          
14. return SELECTED \end{verbatim} }

KEYS ranks mitigations by combining a novel support-based Bayesian ranking measure. BORE (short for "best or rest") divides numeric scores seen over K runs and stores the top 10% in best and the remaining 90% scores in the set rest (the bes$ set is computed by studying the delta of each score to the best score seen in any era). It then computes the probability that a value is found in best using Bayes theorem. The theorem uses evidence E and a prior probability P(H) for hypothesis H in{best, rest}, to calculate a posterior probability

When applying the theorem, likelihoods are computed from observed frequencies. These likelihoods (called "like" below) are then normalized to create probabilities. This normalization cancels out P(E) in Bayes theorem. For example, after K=10,000 runs are divided into 1,000 best solutions and 9,000 rest, the value mitigation31=false might appear 10 times in the best solutions, but only 5 times in the rest. Hence:

Previously, we have found that Bayes theorem is a poor ranking heuristic since it is easily distracted by low frequency evidence. For example, note how the probability of E belonging to the best class is moderately high even though its support is very low; i.e. P(best|E)=0.66 but freq(E|best) = 0.01.

To avoid the problem of unreliable low frequency evidence, we augment the equation with a support term. Support should increase as the frequency of a value increases, i.e. like(best|E) is a valid support measure. Hence, step 3 of our greedy search ranks values via

For each era, KEYS samples the model and fixes the top N=1 settings. Observations have suggested that N=1 is, perhaps, an overly conservative search policy. At least for the DDP models, we have observed that the improvement in costs and attainments generally increase for each era of KEYS. This lead to the speculation that we could jump further and faster into the solution space by fixing N=1 settings per era. Such a jump policy can be implemented as a small change to KEYS, which we call KEYS2.

Standard KEYS assigns the value of one to NUM_MITIGATIONS_TO_SET (see the pseudocode above);
Other variants of KEYS assigns larger values.

In era 1, KEYS2 behaves exactly the same as KEYS. In (say) era 3, KEYS2 will fix the top 3 ranked ranges. Since it sets more variables at each era, KEYS2 terminates earlier than KEYS.

TAR4.1

New to Treatment Learning and TAR3? Read this first.

TAR3 is not a data miner. Not only does it have a different goal (treatment versus classification), its very design runs counter to that of traditional classifiers such as Naive Bayes. Data miners, according to Bradley et al:

Require one scan, or less, of the data.

Are on-line, anytime algorithms.

Are suspendable, stoppable, and resumable.

Efficiently and incrementally add new data to existing models.

Work within the available RAM.

TAR3 stores all examples from the dataset in RAM. It also requires three scans of the data in order to discretize, build theories, and rank the generated treatments. TAR4 was designed and modeled after the SAWTOOTH incremental Naive Bayes classifier in order to improve the runtime, lessen the memory usage, and better scale to large datasets.

Pseudocode framework for a treatment learner. TAR3 and TAR4.1 use this basic framework, each defining their own objective function:


1. Procedure TreatmentLearner(LIVES=MAX_LIVES)
2. for every range R
3.    SCORES[R]:=score(R)
4. while LIVES>0
5.    BEFORE:= size(TEMP)
6.    TEMP:= union(TEMP,some())
7.    if BEFORE:= size(TEMP)
8.       LIVES--
9.    else
10.      LIVES:= MAX_LIVES
11. sort(TEMP) by score()
12. return TOP_ITEMS 

1. Procedure one(X=random(SIZE))
2. for 1 to X
3.    TREATMENT:=TREATMENT+ a random range from CDF(SCORES)
4. return TREATMENT

1. Procedure some
2. for 1 to REPEATS
3.    TREATMENTS:= TREATMENTS + one()
4. sort(TREATMENTS) by score()
5. return TOP_ITEMS

Naive Bayes classifiers offer a relationship between fragments of evidence E_i, a prior probability for a class P(H), and a posterior probability P(H|E):

For numeric features, a features mean and standard deviation are used in a Gaussian probability function:

Simple naive Bayes classifiers are called "naive" since they assume independence of each feature. Potentially, this is a significant problem for data sets where the static code measures are highly correlated (e.g. the number of symbols in a module increases linearly with the module's lines of code). However, Domingos and Pazzini have shown theoretically that the independence assumption is a problem in a vanishingly small percent of cases.

TAR4.1 still requires two passes through the data, for discretization and for building treatments. These two steps function in exactly the same manner as the corresponding steps in the TAR3 learner. TAR4.1, however, eliminates the final pass by building a scoring cache during the BORE classification stage. During classification, examples are placed in a U-dimensional hypercube, with one dimension for each utility. Each example e in E has a normalized distance 0 <= D_i <= 1 from the apex, the area where the best examples reside. When BORE classifies examples into best and rest, that normalized distance adds D_i to the down table and 1-D_i to the up table.

When treatments are scored by TAR4.1, the algorithm does a linear-time table lookup instead of scanning the entire dataset. Each range R_j containing example_i adds down_i and up_i to frequency counts F(R_j|base) and F(R_j|apex). These frequency counts
are then used to compute the following equations:

TAR4.1 finds the smallest treatment T that maximizes:

Note the squared term in the top of the equation, L(apex|T)^2. This term was not squared in the first design of TAR4. The first version of TAR4 was a failure. Its results were muddled by the dependent attributes. The standard Naive Bayes assumes independence between all attributes and keeps singleton counts. TAR4 added redundant information, which altered the probabilities. In effect, it produced treatments with high scores, but without the minimum best support required in TAR3. TAR4.1 was redesigned to prune treatments with low support in the original dataset. By squaring the likelihood that a range appears in the apex, those treatments that lack support are dropped in favor of those that appear often in the apex.

Ourmine

Ourmine is a data mining toolkit built to be easily learned/taught. It implements a variety of languages constructed in a modular way, providing the opportunity for easy extendability.

Ourmine features

Easily obtained and can be run from any UNIX based operating system, including OSX & Linux
Without the complications of a GUI, advanced data mining concepts can be taught effectively through the use of just a few commands
Ourmine utilizes a modulated structure, allowing experiments to be easily conducted using Ourmine by adding any custom code
Along with its many data preparation and formatting functions, Ourmine allows the power of interactive bash scripting to be conducted on the fly, providing a higher level of understanding and productivity
And much more...

For more information about Ourmine, please go here

Thursday, September 3, 2009

CLIFF

CLIFF is a plotting tool which offers the visualization of data. This software features four(4) of the current models used in the forensic community, namely the 1995 Evett model, the 1996 Walsh model, the Seheult model and the Grove model. For each model, data can be generated randomly and plotted. Other features of CLIFF include:

the ability to perform dimensionality reduction by applying the Fastmap algorithm
the ability to determine if the results gained from a region of space is important and well supported.

Tuesday, September 1, 2009

TAR3

Testing large-scale systems is expensive in terms of both time and money. Running simulations early in the process is a proven method of finding the design faults likely to lead to critical system failures, but determining the exact cause of those errors is still a time-consuming process and requires access to a limited number of domain experts. It is desirable to find an automated method that explores the large number of combinations and is able to isolate likely fault points.

Treatment learning is a subset of minimal contrast-set learning that, rather than classifying data into distinct categories, focuses on finding the unique factors that lead to a particular classification. That is, they find the smallest change to the data that causes the largest change in the class distribution. These treatments, when imposed, are able to identify the settings most likely to cause a mission-critical failure.

Pseudocode framework for a treatment learner. TAR3 and TAR4.1 use this basic framework, each defining their own objective function:



1. Procedure TreatmentLearner(LIVES=MAX_LIVES)
2. for every range R
3.    SCORES[R]:=score(R)
4. while LIVES>0
5.    BEFORE:= size(TEMP)
6.    TEMP:= union(TEMP,some())
7.    if BEFORE:= size(TEMP)
8.       LIVES--
9.    else
10.      LIVES:= MAX_LIVES
11. sort(TEMP) by score()
12. return TOP_ITEMS 

1. Procedure one(X=random(SIZE))
2. for 1 to X
3.    TREATMENT:=TREATMENT+ a random range from CDF(SCORES)
4. return TREATMENT

1. Procedure some
2. for 1 to REPEATS
3.    TREATMENTS:= TREATMENTS + one()
4. sort(TREATMENTS) by score()
5. return TOP_ITEMS

TAR3 (and its predecessor TAR2) are based on two fundamental concepts - lift and support.

The lift of a treatment is the change that some decision makes to a set of examples after imposing that decision. TAR3 is given a set of training examples E. Each example in E maps a set of attribute ranges R_i, R_j, ... -> C to some class score. The class symbols C_1, C_2,... are stamped with a rank generated from a utility score {U_1 < U_2 < ... < U_C}. Within E, these classes occur at frequencies F_1, F_2,..., F_C where sum(F_i) = 1. A treatment T of size M is a conjunction of attribute ranges {R_1 ^ R_2... ^ R_M}. Some subset of e from E is contained within the treatment. In that subset, the classes occur at frequencies f_1, f_2,..., f_C. TAR3 seeks the smallest treatment T which induces the biggest changes in the weighted sum of the utilities times frequencies of the classes. Formally, the lift is defined as:

lift = sum(U_c*f_c)/sum(U_c*F_c)

The classes used for treatment learning get a score U_C and the learner uses this to assess the class frequencies resulting from applying a treatment, finding the subset of the inputs that falls within the reduced treatment space. In normal operation, a treatment learner conducts controller learning; that is, it finds a treatment which selects for better classes and rejects worse classes. By reversing the scoring function, treatment learning can also select for the worst classes and reject the better classes. This mode is called monitor learning because it locates the one thing we should most watch for.

Real-world datasets, especially those from hardware systems, contain some noise, incorrect or misleading data caused my accidents and miscalculations. If these noisy examples are perfectly correlated with failing examples, the treatment may become overfitted. An overfitted model may come with a massive lift score, but it does not
accurately reflect the details of the entire dataset. To avoid overfitting, learners need to adopt a threshold and reject all treatments that fall on the wrong side of this threshold. We define this threshold as the minimum best support.

Given the desired class, the best support is the ratio of the frequency of that class within the treatment subset to the frequency of that class in the overall dataset. To avoid overfitting, TAR3 rejects all treatments with best support lower than a user-defined minimum (usually 0.2). As a result, the only treatments returned by TAR3 will have both a high lift and a high best support. This is also the reason that TAR3 prefers smaller treatments. The fewer rules adopted, the more evidence that will exist supporting those rules.

TAR3's lift and support calculations can assess the effectiveness of a treatment, but they are not what generates the treatments themselves. A naive treatment learner might attempt to test all subsets of all ranges of all of the attributes. Because a dataset of size N has 2^N possible subsets, this type of brute force attempt is inefficient. The art of a good treatment learner is in finding good heuristics for generating
candidate treatments.

The algorithm begins by discretizing every continuous attribute into smaller ranges by sorting their values and dividing them into a set of equally-sized bins. It then assumes the small-treatment effect; that is, it only builds treatments up to a user-defined size. Past research has shown that treatments size should be less than four attributes. TAR3 will then only build treatments from the discretized ranges with
a high heuristic value.

TAR3 determines which ranges to use by first determining the lift of each individual attribute. These individual scores are then sorted and converted into a cumulative probability distribution. TAR3 randomly selects values from this distribution, meaning that low-scoring ranges are unlikely to be selected. To build a treatment, n (random from 1...max treatment size) ranges are selected and combined. These treatments are then scored and sorted. If no improvement is seen after a certain number of rounds, TAR3 terminates and returns the top treatments.