Test suite effectiveness metric evaluation: what do we know and what should we do?

by   Peng Zhang, et al.

Comparing test suite effectiveness metrics has always been a research hotspot. However, prior studies have different conclusions or even contradict each other for comparing different test suite effectiveness metrics. The problem we found most troubling to our community is that researchers tend to oversimplify the description of the ground truth they use. For example, a common expression is that "we studied the correlation between real faults and the metric to evaluate (MTE)". However, the meaning of "real faults" is not clear-cut. As a result, there is a need to scrutinize the meaning of "real faults". Without this, it will be half-knowledgeable with the conclusions. To tackle this challenge, we propose a framework ASSENT (evAluating teSt Suite EffectiveNess meTrics) to guide the follow-up research. In nature, ASSENT consists of three fundamental components: ground truth, benchmark test suites, and agreement indicator. First, materialize the ground truth for determining the real order in effectiveness among test suites. Second, generate a set of benchmark test suites and derive their ground truth order in effectiveness. Third, for the benchmark test suites, generate the MTE order in effectiveness by the metric to evaluate (MTE). Finally, calculate the agreement indicator between the two orders. Under ASSENT, we are able to compare the accuracy of different test suite effectiveness metrics. We apply ASSENT to evaluate representative test suite effectiveness metrics, including mutation score metrics and code coverage metrics. Our results show that, based on the real faults, mutation score and subsuming mutation score are the best metrics to quantify test suite effectiveness. Meanwhile, by using mutants instead of real faults, MTEs will be overestimated by more than 20


page 1

page 2

page 3

page 4


DeepMutation: A Neural Mutation Tool

Mutation testing can be used to assess the fault-detection capabilities ...

Mutant reduction evaluation: what is there and what is missing?

Background. Many mutation reduction strategies, which aim to reduce the ...

Does mutation testing improve testing practices?

Various proxy metrics for test quality have been defined in order to gui...

Is the Stack Distance Between Test Case and Method Correlated With Test Effectiveness?

Mutation testing is a means to assess the effectiveness of a test suite ...

Magma: A Ground-Truth Fuzzing Benchmark

High scalability and low running costs have made fuzz testing the de fac...

A Brief Survey on Oracle-based Test Adequacy Metrics

Even though code coverage is a widespread and popular test adequacy metr...

Using causal inference and Bayesian statistics to explain the capability of a test suite in exposing software faults

Test effectiveness refers to the capability of a test suite in exposing ...

Please sign up or login with your details

Forgot password? Click here to reset