Test Your Test

Andres Sacco

3 months ago

Creating or modifying an application involves many aspects, such as following best practices and applying design patterns to solve everyday problems. After writing the code, developers usually add unit tests and rely on tools like Sonar to track metrics such as code coverage and highlight potentially untested areas.

However, high test coverage does not guarantee low risk in production. A test may execute a line of code without actually verifying the outcome. A test may instantiate an object but never check all of its significant attributes. Coverage shows what code ran—not whether the tests would catch a fundamental defect.

This raises an important question: how can we measure not only how much code is tested, but how practical those tests really are?

context of the situation

Imagine a team responsible for several microservices. The team is recognized as one of the best, with strong code coverage and consistent promotion of good practices, such as using Sonar to detect problems and creating integration tests to verify interactions between the application and external resources, such as databases.

One day, a new feature was deployed to production—just a slight change in a few classes. Nothing that appeared risky, considering the large number of existing tests. However, a few minutes later, a significant issue surfaced, affecting the entire platform rather than just the application involved.

Seconds after the problem appeared, someone on the team analyzed the code, detected the issue, and fixed it. During the investigation, it became clear that the tests were not reliable, as they neither failed before nor after the changes.

The following is the test that caused the problem:

@Test
public void should_return_a_country() {
    when(countryRepository.findByCode("AR"))
         .thenReturn(getCountryModel());

    CountryDTO response = countryService.getCountryByCode("AR");
}

The test calls a method but overlooks crucial attributes, potentially causing errors in other applications.

What’s mutation testing?

Mutation testing is a technique for evaluating the effectiveness of a test suite by introducing small, controlled changes into the code and verifying whether the tests detect them. Instead of measuring only which lines of code are executed, mutation testing focuses on the quality of assertions and the ability of tests to catch meaningful defects.

CORE CONCEPTS

The heart of this technique implies a set of concepts:

Mutants: A mutation is a minor, controlled modification made to source code. Examples include flipping a boolean condition, replacing an arithmetic operator, or removing a method call.
Goal: Check if the existing tests can detect these changes. If a test fails when a mutant is introduced, it indicates that the test suite is sensitive enough to detect behavioral differences, a sign of strong, practical tests.

After the creation of the mutants and executing all the tests, each mutation could stay in two states:

Killed: When at least one test fails after a change, it indicates that the test suite has effectively detected a behavioral difference.
Survived: A mutant survives when all tests pass despite the injected modification. Typically, this happens when an assertion is weak or when test cases are missing.

mUTATION SCORE

Mutation testing provides a percentage indicating how practical the tests are. To calculate this, it’s necessary to use the following formula:

Mutation Testing Score

The way to interpret the percentage is:

High scores (e.g., 80–95%) often suggest strong test coverage with meaningful assertions.
Medium scores highlight opportunities to improve tests or simplify logic.
Low scores (below ~50%) usually indicate insufficient or weak tests that may not protect against regressions.

It’s important to note that a high mutation test score does not mean the application is bug-free.

Types of Mutations

Mutation testing tools create various modifications, known as mutations, that simulate potential defects in the code. While the specific mutation operators can differ based on the programming language or library used, they generally fall into three main categories: Decision mutations, Statement mutations, and Value mutations. Each category focuses on different aspects of the program’s behavior, helping assess how effectively the test suite validates the code’s logic, control flow, and data integrity.

Let’s see a brief explanation about each of them:

Decision Mutations: Modify the conditions that control program flow. These mutations focus on expressions found in if, switch, while, for, and boolean-returning operations.
Statement Mutations: Work by adding, removing, or altering entire statements. They test whether the test suite can detect situations where part of the logic disappears or behaves differently.
Value Mutations: Value mutations target constants, literals, return values, and field assignments. They simulate defects caused by incorrect data produced or used by the program.

How to IMPLEMENT IT ON AN APPLICATION?

To implement mutation testing in a JVM ecosystem, several libraries are available, such as Pitest, Major, and MuJava. The first option is the best because it is actively maintained, highly performant, integrates seamlessly with Maven, Gradle, JUnit, and TestNG, supports incremental analysis with extensive configuration options, generates clear HTML reports of killed and surviving mutants, and even provides plugins for SonarQube and other tools.

This article uses a source from a GitHub repository; feel free to clone it and use it to learn about mutation testing.

To use this library, you first need to add the dependency to your application. The following block represents how to do it on a Maven project:

<!-- Mutation Test -->
<plugin>
    <groupId>org.pitest</groupId>
    <artifactId>pitest-maven</artifactId>
    <version>${pitest-maven.version}</version>

    <configuration>
        <outputFormats>
            <outputFormat>HTML</outputFormat>
            <outputFormat>XML</outputFormat>
        </outputFormats>
        <targetClasses>
            <param>com.twa.flights.api.catalog.*</param>
        </targetClasses>
        <targetTests>
            <param>com.twa.flights.api.catalog.*</param>
        </targetTests>
    </configuration>
    <dependencies>
        <dependency>
            <groupId>org.pitest</groupId>
            <artifactId>pitest-junit5-plugin</artifactId>
            <version>${pitest-junit5-plugin.version}</version>
        </dependency>
    </dependencies>
</plugin>

As a recommendation, check the latest version of this library on the official webpage or a repository like this regularly.

Pitest allows users to export execution results in multiple formats, including HTML, CSV, and XML. The relevance of each format depends on the report’s purpose. For example, the HTML format is ideal for those who want a simple view of execution results, including the number of mutations used. In contrast, the XML format helps integrate this information with other tools, such as Sonar, which can display mutation execution results.

On this tool, it’s possible to indicate in the same way that appears on the previous code block, which packages or test classes will be mutated, and it’s possible to indicate the same about which package of the source code will suffer modifications.

Executing mutation testing implies just running a command like the following:

$ mvn clean package org.pitest:pitest-maven:mutationCoverage
[INFO] --- pitest:1.7.6:mutationCoverage (default-cli) @ api-catalog ---
[INFO] Root dir is : /home/asacco/Code/testing-your-test/api-catalog
[INFO] Found plugin : Default csv report plugin
[INFO] Found plugin : Default xml report plugin
[INFO] Found plugin : Default html report plugin
......
[INFO] Found shared classpath plugin : Default mutation engine
[INFO] Found shared classpath plugin : JUnit 5 test framework support
[INFO] Found shared classpath plugin : JUnit plugin
[INFO] Available mutators : EXPERIMENTAL_ARGUMENT_PROPAGATION,FALSE_RETURNS,TRUE_RETURNS,CONDITIONALS_BOUNDARY,CONSTRUCTOR_CALLS,EMPTY_RETURNS,INCREMENTS,INLINE_CONSTS,INVERT_NEGS,MATH,NEGATE_CONDITIONALS,NON_VOID_METHOD_CALLS,NULL_RETURNS,PRIMITIVE_RETURNS,REMOVE_CONDITIONALS_EQUAL_IF,REMOVE_CONDITIONALS_EQUAL_ELSE,REMOVE_CONDITIONALS_ORDER_IF,REMOVE_CONDITIONALS_ORDER_ELSE,RETURN_VALS,VOID_METHOD_CALLS,EXPERIMENTAL_BIG_DECIMAL,EXPERIMENTAL_BIG_INTEGER,EXPERIMENTAL_MEMBER_VARIABLE,EXPERIMENTAL_NAKED_RECEIVER,REMOVE_INCREMENTS,EXPERIMENTAL_RETURN_VALUES_MUTATOR,EXPERIMENTAL_SWITCH,EXPERIMENTAL_BIG_DECIMAL,EXPERIMENTAL_BIG_INTEGER
......
......
================================================================================
- Timings
================================================================================
> pre-scan for mutations : < 1 second
> scan classpath : < 1 second
> coverage and dependency analysis : < 1 second
> build mutation tests : < 1 second
> run mutation analysis : 4 seconds
--------------------------------------------------------------------------------
> Total : 5 seconds
--------------------------------------------------------------------------------
================================================================================
- Statistics
================================================================================
>> Line Coverage: 63/195 (32%)
>> Generated 64 mutations Killed 7 (11%)
>> Mutations with no coverage 54. Test strength 70%
>> Ran 15 tests (0.23 tests per mutation)
Enhanced functionality available at https://www.arcmutate.com/
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 9.757 s
[INFO] Finished at: 2025-11-27T10:07:37-03:00
[INFO] ------------------------------------------------------------------------

Execution time may vary depending on the size of the source code and the available resources on the machine where these tests are running.

To see the HTML report graphically, open the target folder and look for the pit-reports folder. The report will look like the following image:

Mutation Testing: General Overview

To reduce execution time, you can use historical execution data to detect changes in code and tests. In concrete terms, the command implies adding only one parameter, as shown in the following block.

$ mvn clean package org.pitest:pitest-maven:mutationCoverage -DwithHistory
......
.....
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 5.689 s
[INFO] Finished at: 2025-11-27T10:21:54-03:00
[INFO] ------------------------------------------------------------------------

The execution time passes from 9,7 to 5,6 seconds in a small project with a few classes. The approach is beneficial when the applications have a lot of code and tests.

A critical aspect of mutation testing is the ability to use multiple mutation engines. An engine is responsible for modifying the source code; in some cases, changing all the logic inside a method or a class, rather than just adjusting the method’s parameters or its response. By default, Pitest uses Gregor, which introduces modifications to the different sentences of a technique, but it’s possible to use Descartes, which reduces the modifications. To use it, it’s necessary to introduce some changes, like the following:

<!-- Mutation Test -->
<plugin>
    <groupId>org.pitest</groupId>
    <artifactId>pitest-maven</artifactId>
    <version>${pitest-maven.version}</version>

    <configuration>
       <outputFormats>
          <outputFormat>HTML</outputFormat>
          <outputFormat>XML</outputFormat>
       </outputFormats>
       <targetClasses>
          <param>com.twa.flights.api.catalog.*</param>
       </targetClasses>
       <targetTests>
          <param>com.twa.flights.api.catalog.*</param>
       </targetTests>
       <mutationEngine>descartes</mutationEngine>
    </configuration>
    <dependencies>
       <dependency>
          <groupId>org.pitest</groupId>
          <artifactId>pitest-junit5-plugin</artifactId>
          <version>${pitest-junit5-plugin.version}</version>
       </dependency>
       <dependency>
          <groupId>eu.stamp-project</groupId>
          <artifactId>descartes</artifactId>
          <version>1.3.2</version>
       </dependency>
    </dependencies>
</plugin>

As a recommendation, check the latest version of this library, as new versions are released at regular intervals.

What are THE Challenges and costs?

Introducing mutation testing into an existing application is not free of challenges, as it requires understanding its limitations and trade-offs. With this in mind, it’s crucial to set realistic expectations for this type of testing. Some of the most relevant issues are:

Execution time: Creating mutations and executing tests takes time because it involves generating source code variations and running tests to validate their effects. In large applications with hundreds of tests or large codebases, this could increase drastically. In some cases, this situation could be a barrier to implementing this type of testing in a CI pipeline.
Flaky or unstable tests: In some cases, tests pass or fail sporadically due to issues with concurrent access to external resources. These scenarios could affect mutation execution, leading to false positives, so it’s essential to either exclude these tests or find a way to mitigate the problem.

Resource consumption: In addition to the time required to execute the tests and create the mutation, there are other resource-related issues, such as CPU and memory usage. This can affect not only the pipeline running mutation tests but also other jobs running in the same CI environment. The big challenge here is to reduce resource consumption, limit mutations, or shorten the execution time of each pipeline.
Complex configuration: Most tools or libraries offer many parameters to achieve the best configuration for each application. The first attempts to use these tools could lead to unrealistic expectations about the performance and results that are achievable. Finding the right balance between performance, accuracy, and execution time often requires several iterations.

None of these issues invalidates the benefits of mutation testing, but it’s essential to develop a plan to mitigate or reduce their impact.

WHICH STRATEGIES EXIST for Adopting IT?

Adopting a new technique or tool involves several considerations, especially in an existing application with many classes and tests. There is no magic formula for implementing mutation testing without pain, but there are different approaches to reduce the problems to a low level. Some of the most relevant strategies are:

Start small: This approach focuses on familiarizing with the tool and scanning just one package, module, or critical flow to identify potential implementation issues. It’s beneficial when there are too many unit tests on the application.
Focus on high-risk code first: The critical code or flows are what anyone on a team or company wants to know whether it works or not. Adding mutation testing to a few classes that represent those flows can be implemented with minimal impact. Once everything is in order, we can incrementally update the configuration to scan more packages.
Limiting the scope: At some point, execution time becomes a critical factor, so a possible approach is to limit the number or types of mutations that can be created. It could be a good starting point: only include the most relevant mutator during execution, and measure the impact before including all the mutations.
Restrict execution: These tests may take longer to deploy, so a possible approach is to restrict their use locally or in the principal pipeline, which uses them to deploy to some productive environments.

It is possible to use one of these strategies or combine them to achieve better results, but in all cases, the choice depends on the size of the application and the number of unit tests.

WHAT’S NEXT?

There are many resources on unit testing and mutation testing. The following is just a short list of resources:

Latent Mutants: A large-scale study on the Interplay between mutation testing and software evolution by Jeongju Sohn
Practical Mutation Testing at Scale: A view from Google
Mutation Testing in Evolving Systems: Studying the Relevance of Mutants to Code Evolution by Milos Ojdanic
Does mutation testing improve testing practices? by Goran Petrovic

Other resources could be great for understanding some concepts related to testing in depth:

Testing Web APIs by Mark Winteringham
Unit Testing Principles, Practices, and Patterns by Vladimir Khorikov
Software Testing with Generative AI by Mark Winteringham

Consider this just a small list of available resources. If something is unclear, find another video or resource.

CONCLUSION

Creating tests for an application does not guarantee that nothing will go wrong, and mutation testing is not a silver bullet that can detect every possible issue. However, it provides a valuable and objective way to evaluate how practical existing tests really are.

A practical approach is to adopt it gradually: start with a small number of packages or a limited mutation scope, measure the effect on the build and pipeline, and then expand its use as appropriate.

Used pragmatically, mutation testing can significantly improve test quality and increase confidence in the application’s behavior without overwhelming the development process.

Author: Andres Sacco

Andres Sacco has been a developer since 2007 in different languages, including Java, PHP, NodeJs, Scala, and Kotlin. His background is mostly in Java and the libraries or frameworks associated with this language. In most of the companies he worked for, he researched new technologies to improve the performance, stability, and quality of the applications of each company. In 2017 he started to find new ways to optimize the transference of data between applications to reduce the cost of infrastructure. He suggested some actions, some of them applicable in all the manual microservices and others in just a few. All this work concludes with the creation of a series of theoric-practical projects, which are available on the page Manning.com Recently he published a book on Apress about the last version of Scala. Also, he published a set of theoric-practical projects about uncommon ways of testing, like architecture tests and chaos engineering. He dictated internal courses to different audiences like developers, business analysts, and commercial people. Also, he participates as a Technical Reviewer on the books of the editorials: Manning, Apress, and Packt.

Twitter LinkedIn Github