Monday, June 24, 2013

Vasil: What I do

Master's CS Computer Science student since 2011, expected to graduate on August 2013.

Research scope: Data Reduction. Condense large training data into succinct summaries. Enabling users to review and analyze raw data.

Algorithm: PEEKING2 - A tool for data carving.

Implementation:  
  1. Feature Selection via Information Gain.
    Prune irrelevant features, select 25% of features with highest Info Gain.
  2. FASTMAP.
    Project into the direction of the greatest variability, applying a PCA-like linear time projection.
  3. Grid-clustering.
    Recursively split clusters by the median of each projected dimension.
  4. Centroid estimation.
    Replace each data cluster with its centroid.

 Inference:
  1. Instance base learning. (k=2 Nearest Neighbor)
    Extrapolate between centroids to make predictions.
  2. Contrast set rule learning.
    Generated rules estimating the deltas btw. centroids. 

Experiments:
  • PEEKING2 applied on 10 defect data sets and 10 effort data sets from PROMISE.
  • Large data reduction is being observed: 93% of original data is reduced.
  • Little information is lost. In most of the cases, k=2 NN applied on condensed data performed as well or better as other state of the art algorithm applied on overall data.
Addition research:
Applied data mining techniques to reduce the cost of data collection for a public health study conducted by WVU.

  • Applying Correlation-based feature selection, we could drastically reduced the number of features without significantly impacting the performance of Linear Regression.
  • We have also observed some degree of stability across different geographical regions.
  • Small samples of stores (20%) can be used to make prediction for the rest.
  • Samples are selected to minimize the travel distance btw. stores

No comments:

Post a Comment