Tuesday, June 26, 2012

Jared To do

To do:

Easiest.

show perpendicular split in clustering graph (purely cosmetic)
Min and max on 1st pass
analyze non-numerical values
Dominance between eastwest heuristic
implement splitting while y is decreasing
SA on clusters
read in csv data files
get results on real data
get pareto frontier graphs
Code DE
Maybe try standard genetic algorithms
create baseline or benchmark to test against


Hardest

Thursday, June 21, 2012

Jared Update


  • Updated Models
  • Plotting Models instead of random points
  • Normalized data in table 0 to 1 for all dimensions

  • Q's dealing with simulated annealing on clusters

Tuesday, June 19, 2012

update on recursive clustering


Whats different in the code:
  • Cluster class, reorganized the code, now captures the points in each cluster
  • changed for multidimension (hopefully)
  • few more shortcuts in code
  • clusters until root n (initial tests of "while y decreases" turned out bad)



Cluster:
[96, 5]
[88, 17]
[93, 24]
[63, 49]
[93, 52]
[97, 54]
Cluster: 
[71, 55]
[71, 57]
[93, 64]
[88, 68]
[92, 74]
Cluster:
[94, 2]
[83, 14]
[78, 14]
[62, 21]
[62, 34]
[59, 39]
Cluster:
[51, 34]
[51, 35]
[37, 28]
[30, 10]
[19, 26]
Cluster:
[18, 95]
[18, 85]
[32, 84]
[34, 64]
[41, 71]
[45, 55]
Cluster:
[61, 76]
[63, 75]
[70, 82]
[82, 88]
[82, 83]
Cluster:
[1, 20]
[2, 33]
[0, 37]
[2, 39]
[5, 41]
Cluster:
[19, 46]
[9, 49]
[7, 57]
[1, 61]
[1, 63]


Friday, June 15, 2012

Association Learning with FP-Tree/FP-Growth

FPGrowth algorithm introduction slides for learning on frequent patterns.
Another worked example.

Atrazine Target Rough Results



Rough tree on Atrazine Target (final selection round)
Created by using the FP-Tree prefix tree algorithm.  Not complete.

DNA Words, Atrazine Target Trends
Words that either increase or decrease during selection rounds.

Thursday, June 14, 2012

Recursive Clustering Algorithm


Example 1

Example 2

Things to Note:
  • Thickest blue line is first east-west, lines get thinner as the split recursively
  • Implemented on 50 random points between (0, 0) and (100, 100)
  • Set to 3 recursive splits (8 clusters of 6 or 7 points)
Things to Do:
  • Ascii Tree of clusters
  • Docs for program
  • Implement w/ Models (should be very easy)
  • Optimize for more than two dimensions
  • Make decisions on most interesting clusters (genetic algorithm, differential evolution, or simulated annealing)
  • Implement possible heuristics? 
Q's:
  • To keep median point? As of right now it assigns to the higher upper half 
  • Size of clusters to aim for? or number of times recursed?

Wednesday, June 13, 2012

Test Post

Jared Test Post

Tuesday, June 12, 2012


Hypothesis 1: Top 25% words from loops FS will best define aptamers.
ie, Informed FSS better than random
Corollary: Random subsets in general will not do well.

 1a: Aptamers are not homogeneous and will therefore have multiple groupings within

Hypothesis 2: Aptamers will fall into clusters. Some, but not all, non-aptamers will be outliers and cause error rate to rise. The highest classification error rate will occur in the earliest rounds.

-----------------------------------------------------------------------------------------------------------------------------------
Paper 1 CREATE BENCHMARK, establish that the chosen fss and clusters are accurate for aptamers

Find Representative FSS and CLUSTERS for Aptamers
FIND best FSS and clusters for Full Aptamer DB
LABEL data with best clustering for that subset
TEST with classification accuracy of classifiers
RANK the sets on lowest classification error


P2 APPLY REPRESENTATIVE FSS AND CLUSTERS TO ROUNDS DATA
Apply fss's and centroids to rounds data as a way to discriminate between rounds
Use accuracy or number of additional clusters as measurement of non-aptamers


issues/questions
HOW TO EVAL FSS/CLUSTERS?
how many clusters? or best one?

HOW DO THE OUTLYING INSTANCES INTEGRATE WITH EXISTING
CLUSTERS? new cluster? error?


----------------------------------------------------------------------------------------------------------------------------------

KNOW

Aptamer Db does not easily split and tree. Hard to cluster and j48 not good for classifying
NAIVE BAYES EXCELLENT CLASSIFICATION, JRip, on it's heels with 7-20 rules. 80-90s %

nb > jrip > j48 > decision table >> oneR > zeroR

INFERENCE: Aptamer db follows a predictable distribution. Which one??
IMPLICATION: Use a probabilistic clusterer?  ?Try to determine distribution and use goodness of fit for rounds???


Clusters of 10ish+ on Rounds data start to identify areas that are pure.
Both are highly represented in one big cluster.
Need better discriminator.

Full Ellington DB will not cluster
MUST FSS
Reversed strings do differ from it's forward counterpart

Association rules are mostly for one cluster, the one that is all zeros.
So, at it's current configuration - crap
?Would the rules generated from the loops db be valuable???

----------------------------------------------------------------------------------------------------------------------------------

HAVE
Constructed multiple smart and random subsets
and have stats on them

Don't have rig completed to test all fss, clusters, classify
- wanted to explore to make sure this could work before running big experiment

Read Research papers
Word occurance
Using association rules for clustering

DB: Full Ellington DNA Aptamer
Loop DB subset
   Two sets of Rounds data with 4-6 rounds of 30 ish instances

Monday, June 11, 2012

Active Learning on Defect Paper

I ran all the data sets through weka's Bayes and Random Forest learners. The g-values produced by weka are very close to the g-values given by my implementation, but I still kept the g-values of weka. The figures are placed together in an under-construction version of the paper.