Tuesday, October 26, 2010

3Way-LOO Rank Changes + 3Bands Appearances

Above is the plot of 3-Way vs. LOO
The appearance of methods in 3bands are here.

Tuesday, October 12, 2010

IDEA - Stained Glass Visualization

Defect Sets:

Boxes with color indicate a cluster. Clusters are self-tested and ranked based on harmonic mean for the TRUE defect class. Green indicates the cluster performed in the top 25%. Yellow indicates the middle 50%, and red indicates the bottom 25%.

2,109 examples
Run time: 12.5 seconds

10,880 examples
Run time: 2 minutes

Effort sets:

Effort set performance is based on self-test MDMRE. Clusters are colored in the same manner as above.

499 examples
2.6 seconds

Datasets, charts for more datasets, and time results are available in WISP.

Stability and Bias-Variance of LOO vs. 3Way

Stability across datasets: For each evaluation method, over 19 datasets, see how many times solos appear in top 16. The numbers and plots for all methods are here.

Bias & Variance of LOO and 3-Way: Let L be the squared loss function, f be a predictor and y be the predicted value for a particular x. Also let y* be the optimal prediction (actual response for x), and ym be the main prediction. Under squared loss function, main function becomes mean of all predictions: ym=mean(all y's). Then the following definitions follow:
  1. Bias(x) = L(y*, ym)
  2. Var(x) = ED(L(ym, y)) where D is the occurrence of x in all training sets.
Above definitions are valid for single instances, however they can be average over all instances:
  1. Biasavg = Ex(B(x))
  2. Varavg = Ex(Var(x))
When we follow these definitions we get bias(x-axis) vs. variance(y-axis) figures here.

Tuesday, October 5, 2010

Do Reductions with Total Defects Select Smaller Data?

Mann Whitney says no

Stability of Combination Methods


IDEA - Progress

List completed so far:

  • Flipping the X-axis so that the dense area of points are at the West pole (essential for logging the x axis).
  • Logging of X and Y Coordinates
  • Equal width and equal frequency grid creation for m x n divisions.
  • Separating data point into Quadrants based on the axes. (Not recursive yet)
  • Coloring the plot where data exists.
  • Next step will be to color in interesting* neighboring quadrants.

Quadrants w/points, N=10

Quadrants w/points, N=4

Images below do not demonstrate the quadrant highlighting demonstrated above.

jm1 defect dataset

jm1 defect dataset with log(x)

kc1 defect dataset

kc1 defect dataset with log(x)

Linear v. Dynamic Search in Molecular Structures

In the field of cheminformatics, a common task is navigating through a database of molecules that could have thousands of entries.

One task is how to store the molecules themselves. An approach in cheminformatics is a "SMILES" string. A smiles string stores the molecules involved and special characters which indicate structural properties of the molecule. Previous tests in the Lewis Research Group have involved RDX adsorption. (RDX is an explosive compound) The Lewis Research Group uses a default file format of .xyz, but an open file conversion system exists called OpenBabel. The SMILES string for RDX is provided below, along with a picture for comparison.
RDX Molecule
SMILES strings can be represent the same structure in different ways, so for databases a special form of SMILES known as Canonical SMILES form is used to prevent duplicate entries. The advantage of SMILES strings is that when they break in to comparable substructures.

There are clever indexing schemes for quickly searching structural properties of molecules. One is substructure keys, in which binary flags about structural properties are stored for each molecule. Another is using a hash table encoding that serves as a proximity filter.

There is also the area of molecular similarity and molecular diversity. This field seeks to find similar molecules by noting differences in derived attributes. For example, it is computationally easy to compute the molecular weight of a molecule. This attribute, and others, can be used as a search therm to find similar molecules or correlation with more complex attributes such as molecular adsorption in repeating lattice structures.
Other features can also be noted and collected, providing a linear database for searching.

Overall, the question is whether to use structural similarity analysis (dynamic) or molecular diversity measures (linear) for the system.
The answer, at the current moment, appears to be using both.
We are currently not sure what variables correlate with performance in the desired chemical property of adsorption in a lattice structure, so having more features provides a greater possibility for correlation and more accurate estimates. The linear search could also be used to select candidates for dynamic search, using a hybridized preprocessing approach.

The second is that the system has the advantage of being domain specific and thus having access to domain specific algorithms and methods. The database in question will have many properties which need to be handled in relation to their environment. For example, it is likely that there will be missing or blank features for entries in the database. The algorithm might have access to be able to call a function to compute missing entries, which is a highly domain specific solution not appropriate to general algorithms.