Defect Analysis (DA) explores a number of important indicators regarding the state of a team’s software and the effectiveness of its development process. At GripQA, we study the output of our defect analysis algorithm to identify anomalies that may represent issues requiring the project team’s attention. I’m particularly fond of this technique as it combines a methodology that is relatively easy to comprehend with results that correspond well to our empirical data.
Defect Analysis falls into a category of techniques that we refer to as Code Correlation Analysis (CCA). The fundamental premise of CCA is that the history of a project’s codebase is an extremely rich source of information about both what has happened with the project and why it happened. CCA works by associating some factor that can be extracted from the information in a project’s repository and that can then be analyzed in terms of pull requests (or commits). Some examples include: defects, complexity, duplication and style guide violations. The association between the factor that we intend to study and the interactions with the repository is key to this methodology.
In order to make the most effective use of CCA techniques, we need both a target for the investigation (for DA, as we’ll be discussing in this post, we’re exploring the source of defects in the project) and a set of hypothetical causes for the issue that we’re investigating. The current implementation of DA concentrates on two potential sources of defects, the individual responsible for the code involved and the programming language used.
This discussion presents a high level explanation of the algorithm we use for defect analysis and also explores some of the insight that we can gain from applying this technology. In order to be more accessible to a larger group of readers, we’ll gloss over some of the gory details of the algorithm. However, the source code will soon be joining our other open source offerings for those that are interested in the nitty-gritty.
Although DA is just one of the many technologies that comprise GripQA’s Software Development Intelligence suite, it is also one of the easiest to explain, and the results generally compare favorably with the observed data for the projects that we’ve analyzed.
The algorithm for Defect Analysis is fairly straightforward. First we identify all pull requests that incorporate bug fixes. Then we generate a list of the files changed by each commit in the selected pull requests and extract the number of lines changed for each file.
Lines per Pull Request
We store the results in a matrix, represented in this post by a table that would look something like “Lines Per Pull Request.”
Relative File Contribution
Next, we need a relative measure of each file’s contribution to the pull request. We’ll get this by converting the raw numbers of lines into ratios calculated as the number of changed lines in each file divided by the total number of lines in the pull request. The resulting table is shown in “Relative File Contribution.”
Now, we’re in a position to begin exploring factors that might be contributing to the defects addressed by our pull requests.
In order to simplify this example, we’ll assume that all pull requests were simultaneous. For our real analysis, we scan each file’s history for each pull request.
Do some team members contribute more than their share of defects?
Individual Contribution to Files
The individuals who wrote the code in the files that were included in a pull request might be one factor that contributes to the defect(s) addressed by the pull request. We can get a sense of this by extracting the list of individuals who changed the file along with the total number of lines that they added/changed over the life of the file. Note that we’re not filtering for only lines that were changed for a specific pull request, at this point. Instead, we’re counting every line attributable to each individual. We’ll take care of relative contribution to the defect in a later step. Again, this information is stored in a matrix, presented here in the “Individual Contribution to Files” table, and again the raw numbers of lines are converted to ratios as shown in “Individual Contribution to Files as a Ratio.”
Individual Contribution to Files as a Ratio
At this point we’re ready to understand how each individual’s efforts contributed to each of the pull requests. For those who remember their linear algebra, it might be obvious that we’re about to perform a matrix multiplication. We’ll designate the “Individual Contribution to Files as a Ratio” table as the operations matrix and the “Relative File Contribution” table as the input matrix ([person contribution to files] x [files to pull requests]). As our matrix dimensions are 4×3 and 3×2, our pair of matrices is conformable for multiplication, and the result will is shown in “Individual Contribution to Pull Requests.”
Individual Contribution to Pull Requests
With the information in “Individual Contribution to Pull Requests, we have a sense of the “responsibility” of each individual for a given pull request. Another way to think about this is that the people who wrote the code in a file share responsibility for a defect in direct proportion to both how much code each individual wrote and how much code in the file was changed to address the defect. For the purposes of this analysis, we are not explicitly tracking whether, for example, “Ted” specifically wrote the lines of code that had to be changed to “fix” a given defect.
As the Pull Request columns each sum to 1 (accepting a bit of rounding error), we can pat ourselves on the back for performing the matrix multiplication correctly.
Now for the fun part—we can start thinking about what we could deduce from the data in “Individual Contribution to Pull Requests.” One might observe that Alice’s efforts contributed to both of the pull requests. Further, nearly half of contribution to Pull Request 1 came from Alice. However, if we go back to the “Individual Contribution to Files as a Ratio” table, we’ll observe that Alice also had the largest code contribution. Since Alice is a major contributor to the files in question , we would naturally expect her to also share a corresponding responsibility for the pull requests. Clearly we are well served to consider the information in the “Individual Contribution to Pull Requests table in the context of each individual’s total code contribution.
Individual Code to Defect Contribution
We can get a rough idea of the expected individual contribution to defects, if we do a bit more data manipulation. To generate the data shown in “Individual Code to Defect Contribution“, we start with the “Individual Contribution to Files” table, sum each individual’s contributions across all files and then calculate the ratio of each person’s contribution to the total team contribution. We put this number in the “Lines/Total” column of the “Individual Code to Defect Contribution” table. Finally, we average each individual’s contribution to pull requests to generate the data in the “PR Avg” column.
Defect to Code Correlation
From the information shown in the “Individual Code to Defect Contribution” table and graphed in the “Defect to Code Correlation” chart, we can see that there is a reasonably good correlation between each team member’s contribution to the project and their contribution to the code that resulted in defects. There don’t appear to be any glaring anomalies that we can attribute to individual team members. We can probably conclude that this team is well balanced and that each individual’s defect contribution is roughly where we expect it to be.
Does the programming language used contribute to more defects?
Programming Language Used for Each File
Another factor that could contribute to defects is the programming language used. As is generally the case, each file contains code written in a single language, so we’ll use matrix values of 1 to mark a file written in a given language and 0 if the file was not written in the given language. Mapping this to Files A, B & C gives us the information in “Programming Language Used for Each File”
Language Contribution to Pull Requests
Given the 1-to-1 relationship between programming languages and files, we don’t need to calculate any ratios. We can just go ahead and multiply the “Programming Language Used for Each File” matrix by the “Relative File Contribution” matrix ([language contribution to files] x [files to pull requests]) to see how our programming languages correlate to pull requests. The results are presented in the “Language Contribution to Pull Requests” table.
Language Code to Defect Contribution
Defect to Code Ratios by Language
Once we can establish a correlation between a factor in a project’s codebase and an anomaly in the project’s defects, we can start exploring ways to address the issue. Some common measures to mitigate concerns around contributions from team members include pairs programming, additional training, increasing frequency and thoroughness of code reviews and greater focus on unit testing. When the issues trace back to the selection of programming languages, training, code reviews, additional testing and refactoring are among the possible remedies.
As suggested earlier, anything that can be directly associated with a pull request / commit can be used for Defect Analysis. The specific analyses discussed here are two of the explorations that we’ve found to be particularly useful. We’ll add others over time, and we can work with project teams to implement measurements that might be more illuminating for their unique situations.
Hopefully, after reading this post, you share my enthusiasm for Code Correlation Analysis in general and Defect Analysis in particular. Algorithms like these are powerful tools to help us move the field of Software Development Intelligence forward towards a time when we are fully embracing Data Driven Software Development.