TAR3 is not a data miner. Not only does it have a different goal (treatment versus classification), its very design runs counter to that of traditional classifiers such as Naive Bayes. Data miners, according to Bradley et al:
TAR3 stores all examples from the dataset in RAM. It also requires three scans of the data in order to discretize, build theories, and rank the generated treatments. TAR4 was designed and modeled after the SAWTOOTH incremental Naive Bayes classifier in order to improve the runtime, lessen the memory usage, and better scale to large datasets.
Pseudocode framework for a treatment learner. TAR3 and TAR4.1 use this basic framework, each defining their own objective function:
1. Procedure TreatmentLearner(LIVES=MAX_LIVES)
2. for every range R
4. while LIVES>0
5. BEFORE:= size(TEMP)
6. TEMP:= union(TEMP,some())
7. if BEFORE:= size(TEMP)
10. LIVES:= MAX_LIVES
11. sort(TEMP) by score()
12. return TOP_ITEMS
1. Procedure one(X=random(SIZE))
2. for 1 to X
3. TREATMENT:=TREATMENT+ a random range from CDF(SCORES)
4. return TREATMENT
1. Procedure some
2. for 1 to REPEATS
3. TREATMENTS:= TREATMENTS + one()
4. sort(TREATMENTS) by score()
5. return TOP_ITEMS
Naive Bayes classifiers offer a relationship between fragments of evidence E_i, a prior probability for a class P(H), and a posterior probability P(H|E):
For numeric features, a features mean and standard deviation are used in a Gaussian probability function:
Simple naive Bayes classifiers are called "naive" since they assume independence of each feature. Potentially, this is a significant problem for data sets where the static code measures are highly correlated (e.g. the number of symbols in a module increases linearly with the module's lines of code). However, Domingos and Pazzini have shown theoretically that the independence assumption is a problem in a vanishingly small percent of cases.
TAR4.1 still requires two passes through the data, for discretization and for building treatments. These two steps function in exactly the same manner as the corresponding steps in the TAR3 learner. TAR4.1, however, eliminates the final pass by building a scoring cache during the BORE classification stage. During classification, examples are placed in a U-dimensional hypercube, with one dimension for each utility. Each example e in E has a normalized distance 0 <= D_i <= 1 from the apex, the area where the best examples reside. When BORE classifies examples into best and rest, that normalized distance adds D_i to the down table and 1-D_i to the up table.
When treatments are scored by TAR4.1, the algorithm does a linear-time table lookup instead of scanning the entire dataset. Each range R_j containing example_i adds down_i and up_i to frequency counts F(R_j|base) and F(R_j|apex). These frequency counts
are then used to compute the following equations:
TAR4.1 finds the smallest treatment T that maximizes:
Note the squared term in the top of the equation, L(apex|T)^2. This term was not squared in the first design of TAR4. The first version of TAR4 was a failure. Its results were muddled by the dependent attributes. The standard Naive Bayes assumes independence between all attributes and keeps singleton counts. TAR4 added redundant information, which altered the probabilities. In effect, it produced treatments with high scores, but without the minimum best support required in TAR3. TAR4.1 was redesigned to prune treatments with low support in the original dataset. By squaring the likelihood that a range appears in the apex, those treatments that lack support are dropped in favor of those that appear often in the apex.