Introduction

In his book Software Design X-Rays, Adam Tornhill shows a nice metric to find out if some parts of your code are coupled regarding their conjoint changes: Temporal Coupling.

In this and the next blog posts, I’m playing around with Adam’s ideas (and more) to find hidden dependencies of code parts based on version control data.

In this part, we just want to spot co-changing files which are files that change within the same commit.

As almost always, we are using Python and pandas for this analysis.

Data

With the help of a little helper library, we extract relevant log data from a Git repository. In this case, we are just using a synthetic repository to easier check that everything is working as expected.

Here are all files and all the commits from the repository:

In [1]:
from lib.ozapfdis.git_tc import log_numstat
commits = log_numstat("../../synthetic_repo//")
commits
Out[1]:
additionsdeletionsfileshatimestampauthor
11.00.0ba2abe692018-07-19 13:53:09Markus Harrer
21.00.0da2abe692018-07-19 13:53:09Markus Harrer
41.00.0af80a5af2018-07-19 13:52:48Markus Harrer
51.00.0bf80a5af2018-07-19 13:52:48Markus Harrer
70.00.0efcf14982018-07-19 13:52:31Markus Harrer
91.00.0b7e6d7382018-07-19 13:52:10Markus Harrer
101.00.0d7e6d7382018-07-19 13:52:10Markus Harrer
120.00.0d2b4d97d2018-07-19 13:51:14Markus Harrer
141.00.0b732ebbb2018-07-19 10:51:03Markus Harrer
151.00.0c732ebbb2018-07-19 10:51:03Markus Harrer
171.00.0a72f52682018-07-19 10:50:49Markus Harrer
181.00.0b72f52682018-07-19 10:50:49Markus Harrer
202.00.0af3c99c62018-07-19 10:50:38Markus Harrer
211.00.0bf3c99c62018-07-19 10:50:38Markus Harrer
230.00.0c5d5fba52018-07-19 10:50:13Markus Harrer
250.00.0aeb668d12018-07-19 10:49:16Markus Harrer
260.00.0beb668d12018-07-19 10:49:16Markus Harrer

We see that some files change often together (like “a” and “b” or “b” and “d”) and some files are completely changing alone (like “e”).

Let’s get rid of all the unneeded columns first by just the columns that we really need for this analysis.

In [2]:
commits = commits[['file', 'sha']]
commits.head()
Out[2]:
filesha
1ba2abe69
2da2abe69
4af80a5af
5bf80a5af
7efcf1498

Idea

In this analysis, we need to create a relationship from each changed file to all changed file within the same commit.

I tried different things there with various data transformations, but in the end, the following stupid straightforward approach worked best: We just assign to each file in a commit all files of the same commit and count the occurrence of these relationships.

This gives us the perspectives on co-working changes that we want.

Analysis

To implement the idea of above, we can use the pd.merge command of pandas to combine the commits DataFrame with itself. The key here is to use an outer join to expand each file in a commit (designated by the value sha) to all the files of a commit (again, designated by the values in sha).

In [3]:
import pandas as pd

commit_counts = pd.merge(
    commits,
    commits,
    left_on='sha',
    right_on='sha',
    suffixes=['','_other'],
    how='outer')
commit_counts.head()
Out[3]:
fileshafile_other
0ba2abe69b
1ba2abe69d
2da2abe69b
3da2abe69d
4af80a5afa

With this, e. g. the last commit (= first entries in the DataFrame) was expanded from

0   b   a2abe69
1   d   a2abe69

to

0   b   a2abe69     b
1   b   a2abe69     d
2   d   a2abe69     b
3   d   a2abe69     d

with the additional column of all files of the commit.

Because we’re only interested of co-changing files, we can filter out all entries for file changes of the same file.

In [4]:
commit_counts = commit_counts[commit_counts['file'] != commit_counts['file_other']]
commit_counts.head()
Out[4]:
fileshafile_other
1ba2abe69d
2da2abe69b
5af80a5afb
6bf80a5afa
10b7e6d738d

We then can count the same commit relationships with the groupby command.

In [5]:
commit_coupling = commit_counts.groupby(['file', 'file_other']).count()
commit_coupling.head()
Out[5]:
sha
filefile_other
ab4
ba4
c1
d2
cb1

For also want to know the amount of all changes for each file to get the degree of the overall coupling between co-changing files. For this, we can use the groupby command on the file index column together with the transform method to calculate the number of changes per file.

In [6]:
commit_coupling['all_changes'] = commit_coupling.groupby(['file']).sha.transform('sum')
commit_coupling.head()
Out[6]:
shaall_changes
filefile_other
ab44
ba47
c17
d27
cb11

We further calculate the ratio between each changed file to the number of all changes for all files per commit. A high ratio gives us an indicator for pairwise files that change together very often.

In [7]:
commit_coupling['ratio'] = commit_coupling['sha'] / commit_coupling['all_changes']
commit_coupling
Out[7]:
shaall_changesratio
filefile_other
ab441.000000
ba470.571429
c170.142857
d270.285714
cb111.000000
db221.000000

At last step, we do some housekeeping of the data to get a nicely sorted list of co-changing files.

In [8]:
coupling_list = commit_coupling.reset_index().sort_values(
    by=['ratio', 'file'], ascending=False)
coupling_list.rename(columns={"sha" : "co-changing"})
coupling_list = coupling_list.rename(columns={"sha" : "co-changing"})
coupling_list = coupling_list.reset_index(drop=True)
coupling_list
Out[8]:
filefile_otherco-changingall_changesratio
0db221.000000
1cb111.000000
2ab441.000000
3ba470.571429
4bd270.285714
5bc170.142857

With this result, we can e.g. see in the first three rows that the files “d”,”c” and “a” always change with the file “b”.

In detail, you can read and interpret the table like this:

  • Row with index 0: For all changes of “d”, “b” was always changed. This shows a high change dependency from the file “d” to the file “b”. In other words: If one changes “d”, something has to be changed in “b”, too.
  • Row with index 3: If “b” was changed, “a” was changed in 4 out of 7 cases (=commits) as well. Together with the row indexed 2, we can see that “a” changes always with “b”, but “b” doesn’t always change with “a”.
  • Row with index 5: If “b” was changed, “c” was changed in one case. This shows a slight (or even negligible) dependency from “b” to “c” (and maybe even “c” to “b”, because only one commit could also be coincident).

If you are more into graphs, here are all the change relationships between the files with their ratio measure:

Note: The file “e” isn’t occurring in the table because it’s getting changed completely independent.

Visualizations

We can also try to draw a diagram that suits the tiny amount of data that we have. In our case, we use a D3 chord diagram to explore the coupling of co-changing files interactively. Pandas can output the data in a JSON format needed by the used D3 visualization.

In [9]:
coupling_list[['file','file_other','co-changing']].to_json(
    "chord_coupling_data.json", orient='values')

The chord diagram gives us hint about the inherent coupling of files based on co-changing.

You can find the interactive version of this visualization here.

Summary

We’ve seen that we can spot co-changing files with the help of pandas straightforward.

Doing it step by step allows us also to step-wise refine the analysis to our own circumstances. For example, we could define co-changing files as files that not only change within a commit, but on the same day. We could also find peaks of co-changing actions that could lead us to chaotic changes in code.

But for now, we leave it there.

You can find the notebook on GitHub.

print
Spotting co-changing files

One thought on “Spotting co-changing files

Leave a Reply

Your email address will not be published.