In the past, I did a lot of Git log analysis on my blog. The main reason is that developers know what Git is and what kind of data it provides. So it is easy to connect to developers then doing more advanced analysis of Git data.

But there is an area of problems with these kinds of analysis when you want to do file-based analysis in a long-running repository: Deletions, merges, splits and renames.

For the latter one, I want to show you the kinds of problems in this notebook:

Git Example repository

For this analysis, we want to use a little but long-lived repository: The Spring PetClinic project (anti-refactored by me to show some interesting things).

We first clone this repository locally.

In [1]:
%%bash

git clone https://github.com/JavaOnAutobahn/spring-petclinic
Cloning into 'spring-petclinic'...

Next, we export the Git history by using a special command (background explained here)

In [2]:
%%bash

# path to git repository
cd spring-petclinic
git log --numstat --pretty=format:"%x09%x09%x09%ai" -- *.java > git_log.csv

With a little helper function, we import the exported data (see link above for details on that as well).

In [3]:
import pandas as pd

def parse_git_log(path):
    # reading
    git_log = pd.read_csv(
        path,
        sep="\t", 
        header=None,
        names=[
            'additions', 
            'deletions', 
            'filename', 
            'timestamp'])

    # converting in "one line"
    git_log = git_log[['additions', 'deletions', 'filename']]\
        .join(git_log[['timestamp']]\
        .fillna(method='ffill'))\
        .dropna().reset_index(drop=True)

    # data type conversions
    git_log['additions'] = pd.to_numeric(git_log['additions'], errors='coerce')
    git_log['deletions'] = pd.to_numeric(git_log['deletions'], errors='coerce')
    churn = git_log['additions'] - git_log['deletions']
    git_log.insert(2, "churn", churn)
    git_log['timestamp'] = pd.to_datetime(git_log['timestamp'])
    return git_log.set_index('timestamp')

timed_log = parse_git_log("spring-petclinic/git_log.csv")
timed_log.head()
Out[3]:
additions deletions churn filename
timestamp
2019-03-05 22:32:20+01:00 1.0 1.0 0.0 src/main/java/org/springframework/samples/petc…
2019-03-05 22:32:20+01:00 2.0 0.0 2.0 src/main/java/org/springframework/samples/petc…
2019-03-05 22:32:20+01:00 2.0 1.0 1.0 src/main/java/org/springframework/samples/petc…
2019-03-05 22:32:20+01:00 2.0 0.0 2.0 src/main/java/org/springframework/samples/petc…
2019-03-05 22:32:20+01:00 3.0 0.0 3.0 src/main/java/org/springframework/samples/petc…

So what we got is a nice parsed pandas dataframe we can use for further analysis.

Analysis

Let’s dive into the actual problem analysis. Say we want to do some file-based analysis of the software project with data based on Git. So we group our features along the timestamps.

(Note that we keep the last timestamp entry for each file to do an analysis based on the most recent data later on).

In [4]:
file_churns = timed_log.reset_index().groupby('filename').agg({
    "additions" : "sum",
    "deletions" : "sum",
    "churn" : "sum",
    "timestamp" : "first"
})
file_churns.head()
Out[4]:
additions deletions churn timestamp
filename
org.springframework.samples.petclinic/src/main/java/org/springframework/samples/petclinic/HomeController.java 17.0 17.0 0.0 2013-01-09 17:24:48+08:00
org.springframework.samples.petclinic/src/main/java/org/springframework/samples/petclinic/appointments/Appointment.java 37.0 37.0 0.0 2013-01-09 17:24:48+08:00
org.springframework.samples.petclinic/src/main/java/org/springframework/samples/petclinic/appointments/AppointmentBook.java 13.0 13.0 0.0 2013-01-09 17:24:48+08:00
org.springframework.samples.petclinic/src/main/java/org/springframework/samples/petclinic/appointments/AppointmentForm.java 67.0 67.0 0.0 2013-01-09 17:24:48+08:00
org.springframework.samples.petclinic/src/main/java/org/springframework/samples/petclinic/appointments/Appointments.java 15.0 15.0 0.0 2013-01-09 17:24:48+08:00

So, at this point, something weird happens: There are files that have a negative number of lines!

How can this happen?

In [5]:
weird_churns = file_churns[file_churns['churn'] < 0].sort_values(by="timestamp", ascending=False)
weird_churns.head()
Out[5]:
additions deletions churn timestamp
filename
src/test/java/org/springframework/samples/petclinic/service/ClinicServiceJpaTests.java 11.0 13.0 -2.0 2018-11-15 18:39:27+01:00
src/main/java/org/springframework/samples/petclinic/repository/jpa/package-info.java 0.0 3.0 -3.0 2015-10-16 09:33:06+02:00
src/main/java/org/springframework/samples/petclinic/repository/jdbc/package-info.java 0.0 3.0 -3.0 2015-10-16 09:33:06+02:00
src/main/java/org/springframework/samples/petclinic/web/VetsAtomView.java 56.0 130.0 -74.0 2015-05-12 19:07:35+08:00
src/test/java/org/springframework/samples/petclinic/web/VisitsViewTests.java 7.0 77.0 -70.0 2015-05-10 06:45:39+08:00

Let’s look at a more recent file with such a negative number of lines (“recent” because then it is more likely that it still exists in the repository).

In [6]:
weird_churn_filename = weird_churns.iloc[0].name
weird_churn_filename
Out[6]:
'src/test/java/org/springframework/samples/petclinic/service/ClinicServiceJpaTests.java'

For this file, we want to follow the development. Using the --follow option if Git, we can trace the evolution of this single file. As in the first Git data export, we store this data into a file.

In [7]:
%%bash
cd spring-petclinic
git log --numstat --pretty=format:"%x09%x09%x09%ai" --follow src/test/java/org/springframework/samples/petclinic/service/ClinicServiceJpaTests.java > ../weird_churn_filename_log.csv

Let’s read in the data with our little helper function from above.

In [8]:
weird_file_churn = parse_git_log("weird_churn_filename_log.csv")
weird_file_churn.head()
Out[8]:
additions deletions churn filename
timestamp
2018-11-15 18:39:27+01:00 3.0 2.0 1.0 src/test/java/org/springframework/samples/petc…
2015-10-16 09:33:06+02:00 3.0 4.0 -1.0 src/test/java/org/springframework/samples/petc…
2013-12-16 20:58:15+09:00 2.0 1.0 1.0 src/test/java/org/springframework/samples/petc…
2013-06-28 12:00:29+08:00 1.0 2.0 -1.0 src/test/java/org/springframework/samples/petc…
2013-03-04 12:15:20+08:00 2.0 4.0 -2.0 src/test/java/org/springframework/samples/petc…

Insights

OK, what is the problem with the negative number of lines?

Let’s look at the history of this one specific file: It was renamed several times!

In [9]:
weird_file_churn['filename'].value_counts()
Out[9]:
src/test/java/org/springframework/samples/petclinic/service/ClinicServiceJpaTests.java                                                         5
src/test/java/org/springframework/samples/petclinic/jpa/AbstractJpaClinicTests.java                                                            4
src/test/java/org/springframework/samples/petclinic/jpa/JpaClinicTests.java                                                                    3
src/test/java/org/springframework/samples/petclinic/repository/jpa/JpaOwnerRepositoryImplTests.java                                            2
src/test/java/org/springframework/samples/petclinic/jpa/JpaOwnerRepositoryImplTests.java                                                       2
src/test/java/org/springframework/samples/petclinic/{ => repository}/jpa/JpaOwnerRepositoryImplTests.java                                      1
src/test/java/org/springframework/samples/petclinic/{repository/jpa/JpaOwnerRepositoryImplTests.java => service/ClinicServiceJpaTests.java}    1
src/test/java/org/springframework/samples/petclinic/jpa/{AbstractJpaClinicTests.java => JpaClinicTests.java}                                   1
src/test/java/org/springframework/samples/petclinic/jpa/{JpaClinicTests.java => JpaClinicImplTests.java}                                       1
src/test/java/org/springframework/samples/petclinic/jpa/{JpaClinicImplTests.java => JpaOwnerRepositoryImplTests.java}                          1
Name: filename, dtype: int64

Albeit Git provides rename tracking features, some of the renames aren’t renames compliant to the Git rename approach (the ones with the => are the ones that Git can track) and thus making it difficult to track those renames with standard means.

If we now sum up all the churn values for these files, we get the actual number of lines for the files based on pure Git repository data.

In [10]:
weird_file_churn['churn'].sum()
Out[10]:
23.0

Let’s compare this one with the actual number of lines in the real file using the word count comment wc.

In [11]:
%%bash
wc -l spring-petclinic/src/test/java/org/springframework/samples/petclinic/service/ClinicServiceJpaTests.java
23 spring-petclinic/src/test/java/org/springframework/samples/petclinic/service/ClinicServiceJpaTests.java

Cool, this one matches! This might not be always the case for example if you do some weird renaming actions with your source code base or to some merges or splitting ups of files.

In [12]:
weird_file_churn[weird_file_churn['filename'] == weird_churn_filename]['churn'].sum()
Out[12]:
-2.0

Visualization

Let’s look at the number of lines for this specific file to get a feeling if the data is right at all.

In [13]:
%matplotlib inline
weird_file_churn[['additions', 'deletions', 'churn']].cumsum().plot(figsize=[20,5]);

We see that somehow we got a negative number of lines of code at the beginning, which could be an indication that there was something wrong with the previous rename detection. But later on, we get a positive number of lines.

Conclusion

So there are limitations of Git repository analysis when you don’t want to dive deep into a more sophisticated model of the evolutions of a project.

Here are some ideas to mitigate this problem around renames:

  1. Maybe more advanced Git repository mining tools: There are tools like the open-source tool PyDriller or commercial tools like CodeScence or TeamScale (from the later I know that they’ve invested significant brain-power to solve file renaming and merging problems)
  2. Leverage Git rename detection: Git provides rename detection by default. You might be able to tweak some parameters to get the results you need. I once used this but I can’t remember any further details, though 🙁
  3. Avoid file-based Git analysis: There are plenty of other interesting analyses waiting for you out there which could be more valuable in your specific context.
  4. Use the actual lines of code: You might use tools like cloc to get the real number of lines of your currently existing files in the repository.

As of today, I’ve chosen the latter two options (with a tendency to 3. ;-)).

Using Git repository data together with the actual number of lines of code (option 4.) is good enough for me to get a first glimpse at the evolution of a software project.

Your context could be a different one where you have to choose more sophisticated techniques to handle all the problems around Git analysis. It would be very interesting to get to know your specific context!

 

To run and play around with this notebook on mybinderhub.com, click on this button:
print
Some problems when analyzing Git logs

Leave a Reply

Your email address will not be published. Required fields are marked *

I accept that my given data and my IP address is sent to a server in the USA only for the purpose of spam prevention through the Akismet program.More information on Akismet and GDPR.