## Monday, January 25, 2010

### Simple single pass k-means

Farnstrom, F., Lewis, J., and Elkan, C. 2000. Scalability for clustering algorithms revisited. SIGKDD Explor. Newsl. 2, 1 (Jun. 2000), 51-57.  http://portal.acm.org/citation.cfm?id=360419

Model clusters as gaussians:
• Clusters are modeled by the  sum, sumSquared, and n of each feature.
• When clusters combine, their new stats are the sum of the of the old sums, the sum of the old symSquareds, and the n
• Reminder: when x is added to a Gaussian
• n++
• sum = sum + x
• sumSquared = sumSquared + x*x
• mean = sum/n
• sd=  sqrt((sumSquared-((sum*sum)/n))/(n-1))
• Do not add a point to a cluster if any of the its attributes falls outside of mean±1.96sd (for 95% confidence)
Algorithm:
1. Initialize k random clusters C1... Ck
2. Now we have old stats and new stats (initially, old stats is nil)
3. LOOP: Take 1% of the data and add it to a buffer.
4. For all remaining items in the buffer,  run k-means, moving the centroids (until they converge). If can add to a cluster (according to old confidence intervals), then add it (updating sum, sumSquared, and n of new stats).  Here, each cluster is treated as being a point at the mean values, weighted with the number of points in the that cluster.
5. When no more items can be added, then..
1. If any cluster has no items, seed it with the most distant data point
2. Now clear the buffer
3. Add the new stats to the old stats.
6. If there is no more new data, stop. Otherwise, go to LOOP
Runs nearly twice as fast as old k-means.  Here's the results from using various clustering algorithms on the KDD 1998 cup data (95412 rows, 481 fields). This algorithm is "N1". Normal batch k-means is "K".  "R1" is a random sampling k-means (that, btw, generated low quality clusters).

Note that clusters generated by this algorithms, working in increments of 1% of the data,  produces clusters of almost the same quality as multiple-pass K-means.

• Small enough memory that you could have multiple versions going, and pick the one with the lowest variances in the generated clusters.
• Very simple implementation.
• Satisfies the data mining desiderata:
1. Require one scan (or less) of the database if possible: a single data scan is considered costly, early termination if appropriate is highly desirable.
2. On-line “anytime” behavior: a “best” answer is always available, with status information on progress, expected remaining time, etc. provided
3. Suspendable, stoppable, resumable; incremental  progress saved to resume a stopped job.
4. Ability to incrementally incorporate additional data with existing models efficiently.
5. Work within confines of a given limited RAM buffer.
6. Utilize variety of possible scan modes: sequential,index, and sampling scans if available.
7. Ability to operate on forward-only cursor over a view of the database. This is necessary since the database view may be a result of an expensive join query, over a potentially distributed data warehouse, with much processing required to construct each row (case).