Tuesday, February 2, 2010

On Profiling RDR (Clump)...

Following are runtime statistics for Bryan's RDR implementation using the JM1 data set (~11000 lines).

As you can easily see NBAYES is the outlier weighing in at a whopping 18.3 minutes. What this tells us is that we're spending way too much time classifying as we go down the tree. What's more is that the data doesn't necessarily change from node to node but what we use to classify (based on our rules) is the variable factor.

Now it's time for the proposal. Since we're using the same information for each classification, why not cache the frequency counts early on and just merge counts when RDR generates a new rule? Once a system like the aforementioned has been implemented, the runtime for RDR would be minimal.

No comments:

Post a Comment