There are multiple reasons for analyzing a version control system like your Git repository. See for example Adam Tornhill’s book “Your Code as a Crime Scene” or his upcoming book “Software Design X-Rays” for plenty of inspirations:
- analyze knowledge islands
- distinguish often changing code from stable code parts
- identify code that is temporal coupled to other code
Having the necessary data for those analyses in a Pandas DataFrame gives you many possibilities to quickly gain insights into the evolution of your software system in various ways.
In a preceding blog post, I showed you a way to read a Git log file with Pandas’ DataFrame and GitPython. Looking back, this was really complicated and tedious. So, with a few tricks we can do it much better this time:
- We use GitPython’s feature to directly access an underlying Git installation. This is way more faster than using GitPython’s object representation of the commits and makes it possible to have everything we need in one notebook.
- We use in-memory reading by using StringIO to avoid unnecessary file access. This avoids storing the Git output on disk and read it from from disc again. This method is faster, too.
- We also exploit Pandas’ read_csv method even more. This makes the transformation of the Git log into a DataFrame as easy as pie.
Exporting the Git repo’s history
The first step is to connect GitPython with the Git repo. If we have an instance of the repo, we can gain access to the underlying Git installation of the operating system via repo.git.
In our case, we tap the Spring PetClinic repo, a small sample application for the Spring framework (I also analyzed the huge Linux repo, works as well).
import git GIT_REPO_PATH = r'../../spring-petclinic/' repo = git.Repo(GIT_REPO_PATH) git_bin = repo.git git_bin
<git.cmd.Git at 0x24a61ce8ee8>
With the git_bin, we can execute almost any Git command we like directly. In our hypothetical use case, we want to retrieve some information about the change frequency of files. For this, we need the complete history of the Git repo including statistics for the changed files (via --numstat).
We use a little trick to make sure, that the format for the file’s statistics fits nicely with the commit’s metadata (SHA %h, UNIX timestamp %at and author’s name %aN). The --numstat option provides data for additions and deletions for the affected file name in one line – separated by the tabulator character \t:
We use the same tabular separator \t for the format string:
And here is the trick: Additionally, we add the number of tabulators of the file’s statistics plus an additional tabulator in front of the format string to pretend that there is an empty file statistics’ information in front of each commit meta data string.
The results looks like this:
Note: If you want to export the Git log on the command line into a file, you need to use the horizontal tab %x0A as separator instead of \t in the format string. Otherwise, the trick doesn’t work (I’ll show the corresponding format string at the end of this article).
OK, let’s executed the Git log export:
git_log = git_bin.execute('git log --numstat --pretty=format:"\t\t\t%h\t%at\t%aN"') git_log[:80]
'\t\t\t101c9dc\t1498817227\tDave Syer\n2\t3\tpom.xml\n\n\t\t\tffa967c\t1492026060\tAntoine Rey\n1'
Reading the Git log
We now read in the complete files’ history in the git_log variable. Don’t let confuse you by all the \t characters.
Let’s read the result into a Pandas DataFrame by using the read_csv method. Because we can’t provide a file path to a CSV data, we have to use StringIO to read in our in-memory buffered content.
Pandas will read the first line of the tabular-separated “file”, sees the many tabular-separated columns and parses all other lines in the same format / column layout. Additionally, we set the header to None because we don’t have one and provide nice names for all the columns that we read in.
import pandas as pd from io import StringIO commits_raw = pd.read_csv(StringIO(git_log), sep="\t", header=None, names=['additions', 'deletions', 'filename', 'sha', 'timestamp', 'author'] ) commits_raw.head()
Now we have two different kinds of content for the rows:
- The commit meta data without file statistics (see rows with the indexes 0, 2 and 4 above)
- The file statistics without the commit meta data (see rows with the indexes 1 and 3 above)
But we are interested in the commit meta data for each file’s statistic. For this, we forward fill (ffill) the empty commit meta data entries of the file statistics rows with the preceding commit’s meta data via the DataFrame‘s fillna method and join this data with the existing columns of the file statistics.
commits = commits_raw[['additions', 'deletions', 'filename']]\ .join(commits_raw[['sha', 'timestamp', 'author']].fillna(method='ffill')) commits.head()
This gives use the commit meta data for each file change!
Because we aren’t interested in the pure commit meta data anymore, we drop all those rows that don’t contain file statistics aka contain null values via dropna.
commits = commits.dropna() commits.head()
And that’s it! We are finished!
In summary, you just need a “one-liner” for converting the Git log file output that was exported with
git log --numstat --pretty=format:"%x09%x09%x09%h%x09%at%x09%aN" > git.log
and read into a DataFrame:
# reading git_log = pd.read_csv( "../../spring-petclinic/git.log", sep="\t", header=None, names=[ 'additions', 'deletions', 'filename', 'sha', 'timestamp', 'author']) # converting in "one line" git_log[['additions', 'deletions', 'filename']]\ .join(git_log[['sha', 'timestamp', 'author']]\ .fillna(method='ffill'))\ .dropna().head()
In this notebook, I showed you how you can read a Git log output in only one line by using Pandas’ read_csv method. This is a very handy method and a good starting point for further analyses!
This notebook is also available on GitHub.
Update 1: Fixed a missing join that was causing wrong results.
5 thoughts on “Reading a Git repo’s commit history with Pandas efficiently”
Pingback:Visualize Developer Contributions with Stream Graphs – feststelltaste
Pingback:Reading a Git log file output with Pandas – feststelltaste
Pingback:Checking the modularization of software systems by analyzing co-changing source code files – feststelltaste
Pingback:The problem with Git log analysis – feststelltaste
Pingback:Some problems when analyzing Git logs – feststelltaste