Introduction

There are multiple reasons for analyzing a version control system like your Git repository. See for example Adam Tornhill’s book “Your Code as a Crime Scene” or his upcoming book “Software Design X-Rays” for plenty of inspirations:

You can

  • analyze knowledge islands
  • distinguish often changing code from stable code parts
  • identify code that is temporal coupled to other code

Having the necessary data for those analyses in a Pandas DataFrame gives you many possibilities to quickly gain insights into the evolution of your software system in various ways.

The idea

In a preceding blog post, I showed you a way to read a Git log file with Pandas’ DataFrame and GitPython. Looking back, this was really complicated and tedious. So, with a few tricks we can do it much better this time:

  • We use GitPython’s feature to directly access an underlying Git installation. This is way more faster than using GitPython’s object representation of the commits and makes it possible to have everything we need in one notebook.
  • We use in-memory reading by using StringIO to avoid unnecessary file access. This avoids storing the Git output on disk and read it from from disc again. This method is faster, too.
  • We also exploit Pandas’ read_csv method even more. This makes the transformation of the Git log into a DataFrame as easy as pie.

Exporting the Git repo’s history

The first step is to connect GitPython with the Git repo. If we have an instance of the repo, we can gain access to the underlying Git installation of the operating system via repo.git.

In our case, we tap the Spring PetClinic repo, a small sample application for the Spring framework (I also analyzed the huge Linux repo, works as well).

In [1]:
import git 

GIT_REPO_PATH = r'../../spring-petclinic/'
repo = git.Repo(GIT_REPO_PATH)
git_bin = repo.git
git_bin
Out[1]:
<git.cmd.Git at 0x24a61ce8ee8>

With the git_bin, we can execute almost any Git command we like directly. In our hypothetical use case, we want to retrieve some information about the change frequency of files. For this, we need the complete history of the Git repo including statistics for the changed files (via --numstat).

We use a little trick to make sure, that the format for the file’s statistics fits nicely with the commit’s metadata (SHA %h, UNIX timestamp %at and author’s name %aN). The --numstat option provides data for additions and deletions for the affected file name in one line – separated by the tabulator character \t:

1\t1\tsome/file/name.ext

We use the same tabular separator \t for the format string:

%h\t%at\t%aN

And here is the trick: Additionally, we add the number of tabulators of the file’s statistics plus an additional tabulator in front of the format string to pretend that there is an empty file statistics’ information in front of each commit meta data string.

The results looks like this:

\t\t\t%h\t%at\t%aN

Note: If you want to export the Git log on the command line into a file, you need to use the horizontal tab %x0A as separator instead of \t in the format string. Otherwise, the trick doesn’t work (I’ll show the corresponding format string at the end of this article).

OK, let’s executed the Git log export:

In [2]:
git_log = git_bin.execute('git log --numstat --pretty=format:"\t\t\t%h\t%at\t%aN"')
git_log[:80]
Out[2]:
'\t\t\t101c9dc\t1498817227\tDave Syer\n2\t3\tpom.xml\n\n\t\t\tffa967c\t1492026060\tAntoine Rey\n1'

Reading the Git log

We now read in the complete files’ history in the git_log variable. Don’t let confuse you by all the \t characters.

Let’s read the result into a Pandas DataFrame by using the read_csv method. Because we can’t provide a file path to a CSV data, we have to use StringIO to read in our in-memory buffered content.

Pandas will read the first line of the tabular-separated “file”, sees the many tabular-separated columns and parses all other lines in the same format / column layout. Additionally, we set the header to None because we don’t have one and provide nice names for all the columns that we read in.

In [3]:
import pandas as pd
from io import StringIO

commits_raw = pd.read_csv(StringIO(git_log), 
    sep="\t",
    header=None,              
    names=['additions', 'deletions', 'filename', 'sha', 'timestamp', 'author']
    )
commits_raw.head()
Out[3]:
additions deletions filename sha timestamp author
0 NaN NaN NaN 101c9dc 1.498817e+09 Dave Syer
1 2 3 pom.xml NaN NaN NaN
2 NaN NaN NaN ffa967c 1.492026e+09 Antoine Rey
3 1 1 readme.md NaN NaN NaN
4 NaN NaN NaN fd1c742 1.488785e+09 Antoine Rey

Now we have two different kinds of content for the rows:

  • The commit meta data without file statistics (see rows with the indexes 0, 2 and 4 above)
  • The file statistics without the commit meta data (see rows with the indexes 1 and 3 above)

But we are interested in the commit meta data for each file’s statistic. For this, we forward fill (ffill) the empty commit meta data entries of the file statistics rows with the preceding commit’s meta data via the DataFrame‘s fillna method and join this data with the existing columns of the file statistics.

In [4]:
commits = commits_raw[['additions', 'deletions', 'filename']]\
            .join(commits_raw[['sha', 'timestamp', 'author']].fillna(method='ffill'))
commits.head()
Out[4]:
additions deletions filename sha timestamp author
0 NaN NaN NaN 101c9dc 1.498817e+09 Dave Syer
1 2 3 pom.xml 101c9dc 1.498817e+09 Dave Syer
2 NaN NaN NaN ffa967c 1.492026e+09 Antoine Rey
3 1 1 readme.md ffa967c 1.492026e+09 Antoine Rey
4 NaN NaN NaN fd1c742 1.488785e+09 Antoine Rey

This gives use the commit meta data for each file change!

Because we aren’t interested in the pure commit meta data anymore, we drop all those rows that don’t contain file statistics aka contain null values via dropna.

In [5]:
commits = commits.dropna()
commits.head()
Out[5]:
additions deletions filename sha timestamp author
1 2 3 pom.xml 101c9dc 1.498817e+09 Dave Syer
3 1 1 readme.md ffa967c 1.492026e+09 Antoine Rey
5 1 0 pom.xml fd1c742 1.488785e+09 Antoine Rey
8 1 1 pom.xml 75912a0 1.487331e+09 Stephane Nicoll
9 11 9 src/main/java/org/springframework/samples/petc… 75912a0 1.487331e+09 Stephane Nicoll

And that’s it! We are finished!

In summary, you just need a “one-liner” for converting the Git log file output that was exported with

git log --numstat --pretty=format:"%x09%x09%x09%h%x09%at%x09%aN" > git.log

and read into a DataFrame:

In [6]:
# reading
git_log = pd.read_csv(
    "../../spring-petclinic/git.log",
    sep="\t", 
    header=None,
    names=[
        'additions', 
        'deletions', 
        'filename', 
        'sha', 
        'timestamp', 
        'author'])

# converting in "one line"
git_log[['additions', 'deletions', 'filename']]\
    .join(git_log[['sha', 'timestamp', 'author']]\
    .fillna(method='ffill'))\
    .dropna().head()
Out[6]:
additions deletions filename sha timestamp author
1 2 3 pom.xml 101c9dc 1.498817e+09 Dave Syer
3 1 1 readme.md ffa967c 1.492026e+09 Antoine Rey
5 1 0 pom.xml fd1c742 1.488785e+09 Antoine Rey
8 1 1 pom.xml 75912a0 1.487331e+09 Stephane Nicoll
9 11 9 src/main/java/org/springframework/samples/petc… 75912a0 1.487331e+09 Stephane Nicoll

Summary

In this notebook, I showed you how you can read a Git log output in only one line by using Pandas’ read_csv method. This is a very handy method and a good starting point for further analyses!

 

This notebook is also available on GitHub.

Update 1: Fixed a missing join that was causing wrong results.

print
Reading a Git repo’s commit history with Pandas efficiently

5 thoughts on “Reading a Git repo’s commit history with Pandas efficiently

Leave a Reply

Your email address will not be published. Required fields are marked *

I accept that my given data and my IP address is sent to a server in the USA only for the purpose of spam prevention through the Akismet program.More information on Akismet and GDPR.