This blog is a three-part series. See part 1 for retrieving the dataset and part 3 (upcoming) for visualization.

In big and old legacy systems, tests are often a mess. Especially end-to-end-tests with UI testing frameworks like Selenium quickly become a PITA aka unmaintainable. They are running slow and you quickly get overwhelmed by plenty of tests that do partly the same, too.

In this data analysis, I want to illustrate a way that can take us out of this misery. We want to spot test cases that are structurally very similar and thus can be seen as duplicate. We’ll calculate the similarity between tests based on their invocations of production code. We can achieve this by treating our software data as observations of linear features. This opens up ways for us to leverage existing mathematically techniques like vector distance calculation (as we’ll see in this post) as well as machine learning techniques like multidimensional scaling or clustering (in a follow-up post).

As software data under analysis, we’ll use the JUnit tests of a Java application for demonstrating the approach. We want to figure out, if there are any test cases that test production code where other, more dedicated tests are already testing as well. With the result, we could be able to delete some superfluous test cases (and always remember: less code is good code, no code is best :-)).

Reality check
The real use case originates from a software system with a massive amount of Selenium end-to-end-tests that uses the Page Object pattern. Each page object represents one HTML site of a web application. Technically, a page object exposes methods in the programming language you use that enables the interaction with websites programmatically. In such a scenario, you can infer which tests are calling the same websites and are triggering the same set of UI components (like buttons). This is a good estimator for test cases that test the same use cases in the application. We can use the results of such an analysis to find repeating test scenarios.


I’m using a dataset that I’ve created in a previous blog post with jQAssistant. It shows which test methods call which code in the application (the “production code”). It’s a pure static and structural view of our code, but can be very helpful as we’ll see shortly.

Note: There are also other ways to get these kinds of information e. g. by mining the log file of a test execution (this would even add real runtime information as well). But for the demonstration of the general approach, the pure static and structural information between the test code and our production code is sufficient.

First, we read in the data with Pandas – my favorite data analysis framework for getting things easily done.

In [1]:
import pandas as pd

invocations = pd.read_csv("datasets/test_code_invocations.csv", sep=";")
test_type test_method prod_type prod_method invocations
0 AddCommentTest void blankSiteContainsRightComment() AddComment at.dropover.comment.boundary.GetCommentRespons… 1
1 AddCommentTest void blankSiteContainsRightCreationTime() AddComment at.dropover.comment.boundary.GetCommentRespons… 1
2 AddCommentTest void blankSiteContainsRightUser() AddComment at.dropover.comment.boundary.GetCommentRespons… 1
3 AddCommentTest void failsAtCommentNull() AddComment at.dropover.comment.boundary.GetCommentRespons… 1
4 AddCommentTest void failsAtCreatorNull() AddComment at.dropover.comment.boundary.GetCommentRespons… 1

What we’ve got here are

  • all names of our test types (test_type) and production types (prod_type)
  • the signatures of the test methods (test_method) and production methods (prod_method)
  • the number of calls from the test methods to the production methods (invocations).


OK, let’s do some actual work! We want

  • to calculate the structural similarity of test cases
  • to spot possible duplications of tests

to figure out which test cases are superfluous (and can be deleted).

What we have are all tests cases (aka test methods) and their calls to the production code base (= the production methods). We can transform this data to a matrix representation that shows which test method triggers which production method by using Pandas’ pivot_table function on our invocations DataFrame.

In [2]:
invocation_matrix = invocations.pivot_table(
    index=['test_type', 'test_method'],
    columns=['prod_type', 'prod_method'],
# show interesting parts of results
prod_type AddComment AddScheduling
prod_method at.dropover.comment.boundary.GetCommentResponseModel doSync(at.dropover.comment.boundary.AddCommentRequestModel) at.dropover.scheduling.boundary.AddSchedulingResponseModel doSync(at.dropover.scheduling.boundary.AddSchedulingRequestModel)
test_type test_method
AddCommentTest void failsAtCreatorNull() 1 0
void worksAtMinimalRequest() 1 0
AddSchedulingDateTest void addDateToScheduling() 0 0
void addTwoDatesToScheduling() 0 0

What we’ve got now is the information for each invocation (or non-invocation) of test methods to production methods. In mathematical words, we’ve now got an n-dimensional vector for each test method where n is the number of tested production methods in our code base. That means we’ve just transformed our software data to a representation so that we can work with standard Data Science tooling :-D! That means all further problem-solving techniques in this area can be reused by us.

And this is exactly what we do now in our further analysis: We’ve reduced our problem to a distance calculation between vectors (we use distance instead of similarity because later used visualization techniques work with distances). For this, we can use the cosine_distances function (see this article for the mathematical background) of the machine learning library scikit-learn to calculate a pair-wise distance matrix between the test methods aka linear features.

In [3]:
from sklearn.metrics.pairwise import cosine_distances

distance_matrix = cosine_distances(invocation_matrix)
# show some interesting parts of results
array([[ 0.10557281,  0.2       ],
       [ 0.10557281,  0.2       ],
       [ 0.80388386,  0.8245884 ],
       [ 1.        ,  1.        ]])

From this result, we create a DataFrame to get a better visual representation of the data.

In [4]:
distance_df = pd.DataFrame(distance_matrix, index=invocation_matrix.index, columns=invocation_matrix.index)
# show some interesting parts of results
test_type CommentGatewayTest
test_method void readRoundtripWorksWithFullData() void readRoundtripWorksWithMandatoryData()
test_type test_method
CommentsResourceTest void postCommentActuallyCreatesComment() 0.105573 0.200000
void postCommentActuallyCreatesCommentJSON() 0.105573 0.200000
void postTwiceCreatesTwoElements() 0.803884 0.824588
ConfigurationFileTest void keyWorks() 1.000000 1.000000

You find the complete DataFrame as Excel file as well (~0.5 MB). It shows all dissimilarities between test cases based on the static calls to production code and looks something like this:

Can you already spot some clusters? We’ll have a detailed look at that in the next blog post ;-)!


Let’s have a look at what we’ve achieved by discussing some of the results. We compare the actual source code of the test method readRoundtripWorksWithFullData() from the test class CommentGatewayTest

    public void readRoundtripWorksWithFullData() {

with the test method postCommentActuallyCreatesComment() of the another test class CommentsResourceTest

    public void postCommentActuallyCreatesComment() {
        Assert.assertEquals(4L, (long)"sitewith3comments").size());
        Assert.assertEquals("comment3", ((Comment)"sitewith3comments").get(3)).getContent());

Albeit both classes represent different test levels (unit vs. integration test), they share some similarities (with ~0.1 dissimilarity aka ~90% similar calls to production methods). We can see exactly which invoked production methods are part of both test cases by filtering out the methods in the original invocations DataFrame.

In [5]:
    (invocations.test_method == "void readRoundtripWorksWithFullData()") |
    (invocations.test_method == "void postCommentActuallyCreatesComment()")]
test_type test_method prod_type prod_method invocations
112 CommentGatewayTest void readRoundtripWorksWithFullData() CommentGateway java.util.List read(java.lang.String) 2
147 CommentsResourceTest void postCommentActuallyCreatesComment() Comment java.lang.String getContent() 1
148 CommentsResourceTest void postCommentActuallyCreatesComment() CommentGateway java.util.List read(java.lang.String) 2

We see that both test methods share calls to the production method read(...), but differ in the call of the method with the name getContent() in the class Comment, because only the test method postCommentActuallyCreatesComment() of CommentsResourceTest invokes it.

We can repeat this discussion for another method named postTwiceCreatesTwoElements() in the test class CommentsResourceTest:

public void postTwiceCreatesTwoElements() {
        Assert.assertEquals(5L, (long)comments.size());
        Assert.assertEquals("comment1", ((Comment)comments.get(0)).getContent());
        Assert.assertEquals("comment2", ((Comment)comments.get(1)).getContent());
        Assert.assertEquals("comment3", ((Comment)comments.get(2)).getContent());
        Assert.assertEquals("comment4", ((Comment)comments.get(3)).getContent());
        Assert.assertEquals("comment5", ((Comment)comments.get(4)).getContent());

Albeit the test method is a little bit awkward (with all those subsequent getContent() calls), we can see a slight slimilarity of ~20%. Here are details on the production method calls as well:

In [6]:
    (invocations.test_method == "void readRoundtripWorksWithFullData()") |
    (invocations.test_method == "void postTwiceCreatesTwoElements()")]
test_type test_method prod_type prod_method invocations
112 CommentGatewayTest void readRoundtripWorksWithFullData() CommentGateway java.util.List read(java.lang.String) 2
151 CommentsResourceTest void postTwiceCreatesTwoElements() Comment java.lang.String getContent() 5
152 CommentsResourceTest void postTwiceCreatesTwoElements() CommentGateway java.util.List read(java.lang.String) 1

Both test classes invoke the read(...) method, but only postTwiceCreatesTwoElements() calls getContent() – and this for five times. This explains the dissimilarity between both test methods.

In contrast, we can have a look at the method void keyWorks() from the test class ConfigurationFileTest, which has absolutely nothing to do (= dissimilarity 1.0) with the method readRoundtripWorksWithFullData() nor the underlying calls to the production code.

    public void keyWorks() {
        assertEquals("InMemory", config.get("gateway.type"));

Looking at the corresponding invocation data, we see, that there are no common uses of production methods.

In [7]:
    (invocations.test_method == "void readRoundtripWorksWithFullData()") |
    (invocations.test_method == "void keyWorks()")]
test_type test_method prod_type prod_method invocations
112 CommentGatewayTest void readRoundtripWorksWithFullData() CommentGateway java.util.List read(java.lang.String) 2
153 ConfigurationFileTest void keyWorks() ConfigurationFile java.lang.String get(java.lang.String) 1


We’ve calculated the structural distances between test cases depending on the invocations to production methods. We’ve seen that we can simplify a question about our complex software data to a question that can be answered by standard Data Science techniques.

In the next blog post, we’ll have a deeper look into how we can get some insights into the cohesion of all test classes. We’ll use our distance matrix to visualize and cluster the data by using some simple machine learning techniques.

I hope I could illustrate how the (dis-)similarity calculation of test cases works behind the scenes. If there are any questions or shortcomings I’ve made in my analysis: Please let me know!

You can find this blog post as Jupyter notebook on GitHub.

Calculating the Structural Similarity of Test Cases

3 thoughts on “Calculating the Structural Similarity of Test Cases

Leave a Reply

Your email address will not be published. Required fields are marked *

I accept that my given data and my IP address is sent to a server in the USA only for the purpose of spam prevention through the Akismet program.More information on Akismet and GDPR.