Tuesday, December 4, 2012

Workshop on Data Analysis Patterns in Software Engineering


DAPSE’13: International Workshop on Data Analysis Patterns in Software Engineering
The workshop has been accepted to the International Conference on Software Engineering. 
 Data science is a skilled art with a steep learning curve. To shorten that learning curve, this workshop will collect best practices in form of data analysis patterns, that is, analyses of data that leads to meaningful conclusions and can be reused for comparable data. 
In the workshop we will compile a catalog of such patterns that will help both experienced and emerging data scientists to better communicate about data analysis. The workshop is intended for anyone interested in how to analyze data correctly and efficiently in a community accepted way.


For illustrative purposes, here’s an example pattern in short and simplified form. For the workshop, we expect the discussion to be more comprehensive. We do welcome both simple and complex analysis patterns.
Pattern name: Contrast

Determine if there is a difference in one or more properties between twopopulations.

1. Apply a hypothesis test (student t-test for parametric data, Mann Whitney test for non-parametric test) to check if the property is statistically different between populations.

2. Determine the magnitude of the difference, either through visualization (e.g., boxplot) or when appropriate through mean or median.

Either step without the other can be misleading. For large populations, tiny differences might be statistically significant. In contrast for small populations large differences might not be statistically significant.

Choosing the wrong hypothesis test is a common mistake.

For example, at ICSE 2009, Bird et al. used a Mann Whitney test to compare the defect proneness (=the property) between distributed and co-located binaries (=two populations). See Figure 5 in their paper for a sample visualization of the differences between the two population.

No comments:

Post a Comment