9/18/14- Possible problems with "defaults only"
Here is the NB+RF result I was waiting on. Looks like we now have a convincing argument that tuning works, right?
Let's confirm by checking RF only. Not as nice of a picture:
What about NB only?
Now we can see that the low performance in the defaults only case merely reflects that RF is a better choice of learner than NB. (not a big surprise) The "defaults only" case works just as well as out-of-set tuning.
8/15/14-- Bayes Test Confirms Combination Suspicions, Proposed Experiments
So here's the old results from the combination of three Baysean classifiers (stupid combination method):
Rranking Param: pD,pF AUC
1: 0.76 defaults only -- current xval
1: 0.76 defaults only -- current xval
1: 0.75 best on current -- current xval
1: 0.74 best on prev -- current xval
0: 0.63 defaults only -- prev to current full
0: 0.63 best on current -- prev to current full
0: 0.62 best on current -- prev to current xval
0: 0.61 best on prev -- prev to current full
0: 0.61 defaults only -- prev to current xval
0: 0.60 best on prev -- prev to current xval
And here are the new results from the same three classifiers, same params, new linear score-weighted combination method:
Rranking Param: pD,pF AUC
1: 0.83 best on current -- current xval
1: 0.83 best on prev -- current xval
1: 0.83 defaults only -- current xval
0: 0.67 best on current -- prev to current xval
0: 0.67 defaults only -- prev to current full
0: 0.67 best on prev -- prev to current full
0: 0.67 best on prev -- prev to current xval
0: 0.65 best on current -- prev to current full
0: 0.63 defaults only -- prev to current xval
Clearly this shows that my old combination method is inferior and gives us a hint that choice of combination method may have a sizable effect on performance.
As per our discussion the other day about using the RQs to guide the experiments, here's my current plan:
8/10/14 -- All Learners Results, Flowcharts, Suggested Changes
To start off, here's the results from the same experiment as before, but with all learners:
(Gaussian NB, Bernoulli Bayes, Multionomial Bayes, Random Forest, Logistic Regression)
(Gaussian NB, Bernoulli Bayes, Multionomial Bayes, Random Forest, Logistic Regression)
Scott-Knott Rank: pD,pF AUC
1: 0.84 best on current -- current xval
1: 0.84 best on prev -- current xval
1: 0.81 defaults only -- current xval
0: 0.71 best on current -- prev to current full
0: 0.70 best on prev -- prev to current full
0: 0.70 best on current -- prev to current xval
0: 0.69 best on prev -- prev to current xval
0: 0.64 defaults only -- prev to current full
0: 0.64 defaults only -- prev to current xval
As you can see, the pD and pF here are inferior to the results that we saw for RF or LR, but slightly better than what we saw for NB. This is contrary to what we would expect to see. After reading Thomas et al. on classier combination, I don't think my combination method is smart enough. I'm doing the equivalent of an unweighted voting by combining the results from multiple learners post-hock. Whereas I think the strategy best suited to this is score-based weighted voting from Thomas et al.
I've also realized that the process I'm doing is going to be very difficult to explain in a way that makes sense, so I've started on a couple of flowcharts. I know these doesn't entirely conform to the actual definition of the various symbols (using "document" instead of "data" for datasets, etc.), but are these flowcharts, with a few captions sufficient to convey what's being done in the experiment? Suggestions?
Experiment Overview: Shows how data flows through the two-step experiment for each combination of tuning method and evaluation method. The tuning step and evaluation step will be shown in more detail in the next chart.
Tuning Step: This step can be in one of three modes: defaults only, best on prev, best on current. This step accepts learner objects, a previous and a current dataset, and the tuning method and passes the datasets and a list of "tuned" learners objects with their default params overridden with optimized params to the next step.
Note: Here is where the proposed change to reflect the score-based method discussed in Thomas et al.
Rather than returning a list of non-dominated learners, this section should return a single ensemble learner object with has all the non-dominated learners as constituents. The ensemble learner will also carry two weights for each constituent learner based on its precision and negative predictive value from the tuning study. The precision weight will be applied to each positive classification from a learner, and the negative predictive value weight will be applied to its negative classifications. After weighting has been applied, the ensemble learner will determine a consensus though voting and report only on the consensus classifications. For the "Defaults Only" case, the ensemble learner will contain one of each type of learner with weights of 1 for each learner with the exception of Baysean learners which will receive a weight of 1/3 because three different Baysean learners are used as opposed to one of each other scheme.
Evaluation Step: This step evaluates the "tuned" learners which are passed from the previous step in one of three ways: prev to current xval, current xval, or prev to current full set. This step generates a result which constitutes a single point on each of the pD, pF plots.
(I know this is probably a little small in the blog, but opening the image in a new tab should get you the full resolution.)
(I know this is probably a little small in the blog, but opening the image in a new tab should get you the full resolution.)
Result Structure: The "flow" part of "flowchart" doesn't really apply here, but this is the structure of the results generated after the experiment is finished. For Each combination of tuning method and evaluation method, there is a list of individual results, one for each dataset.
8/04/14 -- Logistic Regression
Same story, different learner.
Scott-Knott Rank: pD,pF AUC
1: 0.86 defaults only -- current xval
1: 0.85 best on prev -- current xval
1: 0.85 best on current -- current xval
0: 0.73 best on prev -- prev to current xval
0: 0.72 best on current -- prev to current full
0: 0.72 defaults only -- prev to current full
0: 0.72 best on prev -- prev to current full
0: 0.71 defaults only -- prev to current xval
0: 0.71 best on current -- prev to current xval
8/02/14 -- Random Forests
Doing the same thing as before, but with RF instead of NB.
Step 1: Choose one tuning strategy from:
- Defaults Only- no tuning occurs, only the default parameters are used
- Best on Prev- parameters are tuned for best performance on previous version's data
- Best on Current-parameters are tuned by peeking at this version's data (must know current version's class)
Step 2: Choose one evaluation method from:
- Current Xval- current version split into test/train groups for 5x5 cross-validation
- Prev to Current Xval- like above, but training on previous version and testing on the current
- Prev to Current Full- Entire previous set is used for training, entire current set used for testing
Scikit-Learn's Random Forrest was used as the sole learner and parameter tuning was conducted within the following parameter space:
params={
'n_estimators':['values', 3, 5, 10],
'criterion':['values', "gini", "entropy"],
'max_features':['values', 'sqrt', 'log2', None],
'max_depth':['values', None, 4, 8],
'min_samples_split':['values', 4, 8],
'min_samples_leaf':['values', 2, 4],
'bootstrap':['values', True, False],
}
default_params={
'n_estimators':10,
'criterion':'gini',
'max_features':"sqrt",
'max_depth':None,
'min_samples_split':2,
'min_samples_leaf':1,
'bootstrap':True,
}
Scott-Knott Rank: pD,pF AUC
1: 0.99 defaults only -- current xval
1: 0.99 best on current -- current xval
1: 0.98 best on prev -- current xval
0: 0.74 defaults only -- prev to current full
0: 0.73 best on current -- prev to current full
0: 0.73 best on prev -- prev to current full
0: 0.72 best on prev -- prev to current xval
0: 0.72 defaults only -- prev to current xval
0: 0.70 best on current -- prev to current xval
'max_features':['values', 'sqrt', 'log2', None],
'max_depth':['values', None, 4, 8],
'min_samples_split':['values', 4, 8],
'min_samples_leaf':['values', 2, 4],
'bootstrap':['values', True, False],
}
default_params={
'n_estimators':10,
'criterion':'gini',
'max_features':"sqrt",
'max_depth':None,
'min_samples_split':2,
'min_samples_leaf':1,
'bootstrap':True,
}
All nine combinations of tuning strategy and evaluation method were tried on every non-0th-version dataset from the usual group. (Ant, Camel, Ivy, Jedit, Log4J, LUcene, Synapse, Velocity, Xalan, Xerces)
datasets*usable versions=26
Runtime~= 22hrs
Runtime~= 22hrs
Scott-Knott Rank: pD,pF AUC
1: 0.99 defaults only -- current xval
1: 0.99 best on current -- current xval
1: 0.98 best on prev -- current xval
0: 0.74 defaults only -- prev to current full
0: 0.73 best on current -- prev to current full
0: 0.73 best on prev -- prev to current full
0: 0.72 best on prev -- prev to current xval
0: 0.72 defaults only -- prev to current xval
0: 0.70 best on current -- prev to current xval
Update 7/28/14
Three styles of parameter tuning and three styles of test->train setup were compared. They are defined as follows:
Defaults Only- no tuning occurs, only the default parameters are used
Best on Prev- parameters are tuned for best performance on previous version's data
Best on Current-parameters are tuned by peeking at this version's data (must know current version's class)
Current Xval- current version split into test/train groups for 5x5 cross-validation
Prev to Current Xval- like above, but training on previous version and testing on the current
Prev to Current Full- Entire previous set is used for training, entire current set used for testing
All nine combinations were tried on every non-0th-version dataset from the usual group. (Ant, Camel, Ivy, Jedit, Log4J, LUcene, Synapse, Velocity, Xalan, Xerces)
datasets*usable versions=26
Pd/Pf plots with each dot representing an individual dataset version:
and if we rank all 9 treatments by pd/pf AUC using Scott Knott:
1: 0.76 defaults only -- current xval
1: 0.75 best on current -- current xval
1: 0.74 best on prev -- current xval
0: 0.63 defaults only -- prev to current full
0: 0.63 best on current -- prev to current full
0: 0.62 best on current -- prev to current xval
0: 0.61 best on prev -- prev to current full
0: 0.61 defaults only -- prev to current xval
0: 0.60 best on prev -- prev to current xval
We find , unsurprisingly, that results look better when both the train and test data come from the same dataset. Other than that, nothing else appears to matter too much.
Update 6/16/14
Outline of how I see these topics being presented in referance to the Sheppard, Bowes, Hall and Hall et al. results. (to make sure we're on the same page)- Parameter Tuning Style
- parameter tuning practices fall into the reasercher group "basket" of concepts and prior knowledge mentioned by Sheppard, Bowes, and Hall
- Perhaps parameter tuning can explain some of the variance in literature
- Train -> Test Style
- Much consideration and discussion of error in results focuses on sound experimental design. One design element often suspect is the style of segregating training and testing data.
- This was not examined in Sheppard, Bowes, Hall or Hall et al. but if this truly has a large effect on results, perhaps it could explain some of the variance.
- History Inclusion
- This doesn't really fit. Perhaps we should drop it?
- Learn by Cluster
- What if some authors are only using a subset of data that preforms the best?
- Comparing the performance of clusters should tell us if this is even worth considering
- (Spoiler alert: It's probably not)
- Other Things I think may be appropriate:
- Since ~70% of the studies in meta-studies used NASA or Eclipse datasets, I probably should pick those up as well and make them the primary focus
- I should probably also include the Matthews Correlation Coefficient for comparison even if we keep pD, pF and pD/pF-AUC as our primary means of comparison.
- I should probably also include more learners eventually.
Original Post:
Things with which to experiment:- Parameter Tuning Style
- Default parameters only
- Global best tuning
- best parameters from x-val on previous versions
- best parameters from x-val on current versions
- Train->Test Style
- current->current (standard x-val or leave-one-out)
- prev->current full
- prev->current with subsampling
- History Included
- No historical deltas
- Historical deltas to previous version
- Historical deltas to all (or n) previous versions
- Learn by Cluster
- No
- Yes
- Shepperd, Hall, Bowes results
- I've seen authorship in Shepperd's ppt, but not in the Hall, Bowes paper
- Is there a paper to go with the embarrassing result?
- Previous->Current :: Train->Test
- The same as incremental learning or not quite?
- More papers?