Monday, February 24, 2014

Scoring JPL

Ideas implemented:

  1. Changed criterion to gini and checked results. Result: Minor changes to CART tree.
  2. Train on N nearest to centroid. What should N be? I am planning to try out range of values and check the result.
  3. Prune leaves with sd greater than 0.3 of root. -->Very different clusters. Result: 3 clusterers eliminated.

Ideas to be implemented:
  1. Join two clusters that are statistically insignificantly different. Problem: Needs major changes to structure, Trying to figure a smart way out.

Scoring:

For every cluster:
  • Compare with all other clusters.
  • It is said to be "better" over other when:
    • Both clusters are statistically significantly different by both tests.
    • Centroid is better(more or less based on dependent variables) than other.
  • It is said to be similar when:
    • first if a12 returns as similar.
    • if a12 returns as different--bootstrap test is done to check that.
    • Similar if Both tests returns similar.
  • It is said to be worse when:
    • if it is different as stated by both tests.
    • Centroid is not better than other.
  • If cluster is better on one and is worse on none, then score of that cluster is incremented.
Scores of each cluster based on training in nearest n = 20 to centroid:

when nearest n is not used:

To do:
As structures for each branch are already done, categorize clusters based on scores into "good" and "bad". Pick each cluster from "bad", find the nearest branch in CART that has good and difference the conditions.

Run these conditions through xomo to generate new dataset and compare result.

Ideas for "good" and "bad":
Sort clusters based on scores and pick top n^0.5 and say they are "good". Others are "bad".

No comments:

Post a Comment