Model clusters as gaussians:
- Clusters are modeled by the sum, sumSquared, and n of each feature.
- When clusters combine, their new stats are the sum of the of the old sums, the sum of the old symSquareds, and the n
- Reminder: when x is added to a Gaussian
- n++
- sum = sum + x
- sumSquared = sumSquared + x*x
- mean = sum/n
- sd= sqrt((sumSquared-((sum*sum)/n))/(n-1))
- Do not add a point to a cluster if any of the its attributes falls outside of mean±1.96sd (for 95% confidence)
- Initialize k random clusters C1... Ck
- Now we have old stats and new stats (initially, old stats is nil)
- LOOP: Take 1% of the data and add it to a buffer.
- For all remaining items in the buffer, run k-means, moving the centroids (until they converge). If can add to a cluster (according to old confidence intervals), then add it (updating sum, sumSquared, and n of new stats). Here, each cluster is treated as being a point at the mean values, weighted with the number of points in the that cluster.
- When no more items can be added, then..
- If any cluster has no items, seed it with the most distant data point
- Now clear the buffer
- Add the new stats to the old stats.
- If there is no more new data, stop. Otherwise, go to LOOP
Note that clusters generated by this algorithms, working in increments of 1% of the data, produces clusters of almost the same quality as multiple-pass K-means.
Other advantages:
- Small enough memory that you could have multiple versions going, and pick the one with the lowest variances in the generated clusters.
- Very simple implementation.
- Satisfies the data mining desiderata:
- Require one scan (or less) of the database if possible: a single data scan is considered costly, early termination if appropriate is highly desirable.
- On-line “anytime” behavior: a “best” answer is always available, with status information on progress, expected remaining time, etc. provided
- Suspendable, stoppable, resumable; incremental progress saved to resume a stopped job.
- Ability to incrementally incorporate additional data with existing models efficiently.
- Work within confines of a given limited RAM buffer.
- Utilize variety of possible scan modes: sequential,index, and sampling scans if available.
- Ability to operate on forward-only cursor over a view of the database. This is necessary since the database view may be a result of an expensive join query, over a potentially distributed data warehouse, with much processing required to construct each row (case).
No comments:
Post a Comment