Context
Reading data from a software version control system can be pretty useful if you want to answer some evolutionary questions like
- Who are our main committers to the software?
- Are there any areas in the code where only one developer knows of?
- Where were we working on the last months?
I take this opportunity to show you how to read any kind of structure, linear data into Pandas’ DataFrame. The general rule of thumb is: As long as you see a pattern in the raw data, Pandas can read and tame it, too!
Hint: This notebook is also available on GitHub.
The idea
We are taking a shortcut for retrieving the commit history by exporting it into a log file. You can use e. g.
git log --all --numstat --pretty=format:'--%h--%ad--%aN' --no-renames > git.log
to do this. This will output a file with all the log information of a repository.
In this notebook, we analyze the Git repository of aim42 (an open book project about how to improve legacy systems).
The first entries of that file look something like this:
with open (r'data/gitlog_aim42.log') as log:
[print(line, end='') for line in log.readlines()[:8]]
For each commit, we choose to create a header line with the following commit info (by using --pretty=format:'--%h--%ad--%aN'):
--fa1ca6f--Thu Dec 22 08:04:18 2016 +0100--feststelltaste
It contains the SHA key, the timestamp as well as the author’s name of the commit, separated by --.
For each other row, we got some statistics about the modified files:
2 0 src/main/asciidoc/appendices/bibliography.adoc
It contains the number of lines inserted, the number of lines deleted and the relative path of the file. With a little trick and a little bit of data wrangling, we can read that information into a nicely structured DataFrame.
Let’s get started!
Import the data
First, I’ll show you my approach on how to read nearly everything into a DataFrame. The key is to use Pandas’ read_csv for reading “non-character separated values”. How to do that? We simply choose a separator that doesn’t occur in the file that we want to read. My favorite character for this is the “DEVICE CONTROL TWO” character U+0012. I haven’t encountered a situation yet where this character was included in a data set.
We just read our git.log file without any headers (because there are none) and give the only column a nice name.
import pandas as pd
commits = pd.read_csv("data\gitlog_aim42.log",
sep="\u0012",
header=None,
names=['raw'])
commits.head()
Data Wrangling
OK, but now we have a problem data wrangling challenge. We have the commit info as well as the statistic for the modified file in one column, but they don’t belong together. What we want is to have the commit info along with the file statistics in separate columns to get some serious analysis started.
Commit info
Let’s treat the commit info first. Luckily, we set some kind of anchor or marker to identify the commit info: Each commit info starts with a --. So let’s extract all the commit info from the original commits DataFrame.
commit_marker = commits[
commits['raw'].str.startswith("--")]
commit_marker.head()
With this, we can focus on extracting the information of a commit info row. The next command could be looking a little frightening, but don’t worry. We go through it step by step.
commit_info = commit_marker['raw'].str.extract(
r"^--(?P<sha>.*?)--(?P<date>.*?)--(?P<author>.*?)$",
expand=True)
commit_info['date'] = pd.to_datetime(commit_info['date'])
commit_info.head()
We want to extract some data from the raw column. For this, we use the extract method on the string representation (note the str) of all the rows. This method expects a regular expression. We provide our own regex
^--(?P<sha>.\*?)--(?P<date>.\*?)--(?P<author>.\*?)$
that works as follows:
- ^: the beginning of the row
- --: the two dashes that we choose and are used in the git log file as separator between the entries
- (?P<sha>.*?)--: a named match group (marked by the ( and ) ) with the name sha for all characters (.*) until the next occurrence (?) of the -- separators.
- and so on until
- \$: the marker for the end of the row (actually, ^ and $ aren’t needed, but it looks nicer from a regex string’s perspective in my eyes 😉 )
I use these ugly looking, named match groups because then the name of such a group will be used by Pandas for the name of the column (therefore we avoid renaming the columns later on).
The expand=True keyword delivers a DataFrame with columns for each detected regex group.
We simply store the result into a new DataFrame variable commit_info.
Because we’ve worked with the string representation of the row, Pandas didn’t recognize the right data types for our newly created columns. That’s why we need to cast the date column to the right type.
OK, this part is ready, let’s have a look at the file statistics!
File statistics
Every row that is not a commit info row is a file statistics row. So we just reuse the index of our already prepared commit_info DataFrame to get all the other data by saying “give me all commits that are not in the index of the commit_info‘s DataFrame“.
file_stats_marker = commits[
~commits.index.isin(commit_info.index)]
file_stats_marker.head()
Luckily, the row’s data is just a tab-separated string that we can easily split with the split method. We expand the result to get a DataFrame , rename the default columns to something that make more sense and adjust some data types. For the later, we use the keyword coerce that will let to_numeric return Nan‘s for all entries that are not a number.
file_stats = file_stats_marker['raw'].str.split(
"\t", expand=True)
file_stats = file_stats.rename(
columns={ 0: "insertions", 1: "deletions", 2: "filename"})
file_stats['insertions'] = pd.to_numeric(
file_stats['insertions'], errors='coerce')
file_stats['deletions'] = pd.to_numeric(
file_stats['deletions'], errors='coerce')
file_stats.head()
Putting it all together
Now we have three parts: all commits, the separated commit info and the file statistics.
We only need to glue the commit info and the file statistics together into a normalized DataFrame. For this, we have to make some adjustments to the indexes.
For the commit info, we want to have each info for each file statistics row. That means we reindex the commit info by using the index of the commits DataFrame…
commit_info.reindex(commits.index).head(3)
…and fill the missing values for the file statistics’ rows to get the needed structure. Together, this is done like the following:
commit_data = commit_info.reindex(
commits.index).fillna(method="ffill")
commit_data.head()
After filling the file statistics rows, we can throw away the dedicated commit info rows by reusing the index from above (look at the index for seeing this clearly).
commit_data = commit_data[~commit_data.index.isin(commit_info.index)]
commit_data.head()
The easy step afterward is to join the file_stats DataFrame with the commit_data.
commit_data = commit_data.join(file_stats)
commit_data.head()
We’re done!
Complete code block
To much code to look through? Here is everything from above in a condensed format.
%%time
import pandas as pd
commits = pd.read_csv(r'C:\dev\repos\aim42\git.log', sep="\u0012", header=None, names=['raw'])
commit_marker = commits[commits['raw'].str.startswith("--",na=False)]
commit_info = commit_marker['raw'].str.extract(r"^--(?P<sha>.*?)--(?P<date>.*?)--(?P<author>.*?)$", expand=True)
commit_info['date'] = pd.to_datetime(commit_info['date'])
file_stats_marker = commits[~commits.index.isin(commit_info.index)]
file_stats = file_stats_marker['raw'].str.split("\t", expand=True)
file_stats = file_stats.rename(columns={0: "insertions", 1: "deletions", 2: "filename"})
file_stats['insertions'] = pd.to_numeric(file_stats['insertions'], errors='coerce')
file_stats['deletions'] = pd.to_numeric(file_stats['deletions'], errors='coerce')
commit_data = commit_info.reindex(commits.index).fillna(method="ffill")
commit_data = commit_data[~commit_data.index.isin(commit_info.index)]
commit_data = commit_data.join(file_stats)
Just some milliseconds to run through, not bad!
Summary
In this notebook, I showed you how to read some non-perfect structured data via the non-character separator trick. I also showed you how to transform the rows that contain multiple kinds of data into one nicely structured DataFrame.
Now that we have the Git repository DataFrame, we can do some nice things with it e. g. visualizing the code churn of a project, but that’s a story for another notebook! But to give you a short preview:
%matplotlib inline
timed_commits = commit_data.set_index(pd.DatetimeIndex(commit_data['date']))[['insertions', 'deletions']].resample('1D').sum()
(timed_commits['insertions'] - timed_commits['deletions']).cumsum().fillna(method='ffill').plot()
Stay tuned!
Pingback:Reading a Git repo’s commit history with Pandas efficiently – feststelltaste
Hi, Your code is very helpful for my work ! I wanted to know if we could add the codes directly as new columns like inserted code and deleted code.
Thanks! I this situation, I would dig deeper into a tool like PyDriller to avoid using awk or other similar powerful tools on the command line. Maybe https://pydriller.readthedocs.io/en/latest/reference.html#pydriller.domain.commit.ModifiedFile will help?
Thanks for the solution.
I am new to this and was wondering, how to extract information from the following format:
[SHA] 2.12.2016 author [tag] commit message
2 0 src/main/asciidoc/appendices/bibliography.adoc
1 7 src/main/asciidoc/pattern-index.adoc
12 1 src/main/asciidoc/patterns/improve/anticorruption-layer.adoc
I am interested in SHA, date, author, tag from the first row.
The number of lines and file path is working fine with your code.
To be honest, this version of getting information out of Git repositories is not the best way. Please take a look at https://www.feststelltaste.de/reading-a-git-repos-commit-history-with-pandas-efficiently/ for a better version.