Wednesday, November 7, 2012

Novel Subspace Clustering

Novel Subspace Clustering algorithm that uses a top-down FP Growth approach for attribute filtering versus the typically agglomerative bottom-up Apriori paradigm.

It produces disjoint clusters of various dimensionality and does not suffer from the exhaustive subspace search problem of bottom-up approaches.  It finds the densest, correlated attributes and then searches the points for patterns (clusters).

As with most subspace and projected clustering algorithms, the clustering is done in a cyclical manner.  In this method, FP Growth is used to find an candidate attribute subset, then EM clustering is performed over the attributes. EM produces multiple clusters, which are tested by classification learners. Good clusters are labeled and removed from the data set, creating disjoint instance clusters.  The null and bad clusters remain in the data set for further cycles of clustering.  All attributes are available for the FP Growth step and may be repeated in later clusters.

This method requires several parameters, for FP Growth, EM clustering, minimum test and stopping criteria. I believe that it will be significantly less sensitive to the parameter values than current methods. I also believe that it will be more computationally efficient than existing techniques since it uses FP Growth to find candidate subspaces escaping a combinatorial search, and removes clustered instances reducing the overall data set size.

Literature surveys compare the performance of methods over varying data set  sizes.  Moise09 varies attribute size and instance size to create several data sets, but does not test against data sets with roughly the same attribute and instance size.  My method was designed for this case, named the n x m problem. Other subspace and projected clustering algorithms suffer as the attributes are increased; whereas my method is suited to scalability.

- Erin

No comments:

Post a Comment