Tuesday, June 12, 2012


Hypothesis 1: Top 25% words from loops FS will best define aptamers.
ie, Informed FSS better than random
Corollary: Random subsets in general will not do well.

 1a: Aptamers are not homogeneous and will therefore have multiple groupings within

Hypothesis 2: Aptamers will fall into clusters. Some, but not all, non-aptamers will be outliers and cause error rate to rise. The highest classification error rate will occur in the earliest rounds.

-----------------------------------------------------------------------------------------------------------------------------------
Paper 1 CREATE BENCHMARK, establish that the chosen fss and clusters are accurate for aptamers

Find Representative FSS and CLUSTERS for Aptamers
FIND best FSS and clusters for Full Aptamer DB
LABEL data with best clustering for that subset
TEST with classification accuracy of classifiers
RANK the sets on lowest classification error


P2 APPLY REPRESENTATIVE FSS AND CLUSTERS TO ROUNDS DATA
Apply fss's and centroids to rounds data as a way to discriminate between rounds
Use accuracy or number of additional clusters as measurement of non-aptamers


issues/questions
HOW TO EVAL FSS/CLUSTERS?
how many clusters? or best one?

HOW DO THE OUTLYING INSTANCES INTEGRATE WITH EXISTING
CLUSTERS? new cluster? error?


----------------------------------------------------------------------------------------------------------------------------------

KNOW

Aptamer Db does not easily split and tree. Hard to cluster and j48 not good for classifying
NAIVE BAYES EXCELLENT CLASSIFICATION, JRip, on it's heels with 7-20 rules. 80-90s %

nb > jrip > j48 > decision table >> oneR > zeroR

INFERENCE: Aptamer db follows a predictable distribution. Which one??
IMPLICATION: Use a probabilistic clusterer?  ?Try to determine distribution and use goodness of fit for rounds???


Clusters of 10ish+ on Rounds data start to identify areas that are pure.
Both are highly represented in one big cluster.
Need better discriminator.

Full Ellington DB will not cluster
MUST FSS
Reversed strings do differ from it's forward counterpart

Association rules are mostly for one cluster, the one that is all zeros.
So, at it's current configuration - crap
?Would the rules generated from the loops db be valuable???

----------------------------------------------------------------------------------------------------------------------------------

HAVE
Constructed multiple smart and random subsets
and have stats on them

Don't have rig completed to test all fss, clusters, classify
- wanted to explore to make sure this could work before running big experiment

Read Research papers
Word occurance
Using association rules for clustering

DB: Full Ellington DNA Aptamer
Loop DB subset
   Two sets of Rounds data with 4-6 rounds of 30 ish instances

No comments:

Post a Comment