Testing

Section 7.3 Testing

Testing is a technique for finding bugs in programs. Testing means to construct inputs to the program (or a part of a program such as a module or a single function), the so-called subject under test, executes the program on these inputs, observes its behavior and judges if the outcome was correct. If the outcome was not correct we say that the test failed. If a test failed, we discovered a bug in the program that we can repair. So, a failing test is a witness of a bug. Having an appropriate (in extent and quality) test suite is an important ingredient for increasing the quality for software. However, using tests, we cannot prove the correctness of a program. If all tests of a test suite pass, it does not imply that the program is correct¹⁵:

Testing shows the presence, not the absence of bugs.
―Edsger Dijkstra

Definition 7.3.1. Test Case.

Given a program \(\prgltr\) and a specification \((P,Q)\text{.}\) A state \(\sigma\) that satisfies the precondition is a test of \(\prgltr\text{.}\) A test is passed if \(\config\prgltr\sigma\terminates\config\CTerm{\sigma'}\) and \(\sigma'\models Q\text{.}\) If \(\sigma\) is a failure we say the test failed. A set of tests is called a test suite.

Example 7.3.2. Maximum.

Let us consider the specification of the maximum program from (7.2) and the following C0 program:

if (x > y)
  m = x;
else
  m = x;

This program passes the test \(\passing=\{\CVar x\mapsto 2,\CVar y\mapsto 1\}\) and fails the test \(\failing=\{\CVar x\mapsto 1,\CVar y\mapsto 2\}\text{.}\)

In a concrete programming language, it is hard to just specify a state. Therefore, in testing, we typically use a program to create the state that constitutes the test. This program is then often also called “test”. This test then calls the subject under test and checks if the postcondition is satisfied:

void test1() {
  int x = 2;
  int y = 1;
  int m = max(x, y);
  assert (m == x);
}

It is very common to test individual modules/units of a program separately. We then speak of unit testing. A unit can be a single function or multiple functions that implement parts of a self-contained part of the code. For example, for a binary search tree (see Subsection 5.2.2), a unit test could be inserting an element into the tree and checking that it has indeed been inserted at the right place. A unit test could also be inserting a couple of elements and the iterating over the tree to see if all inserted elements have been visited.

Unit tests are very helpful to isolate failures in the individual modules of a larger system. Without unit tests, bugs in module A may very easily cause failures when running code of module B. As discussed in Section 7.2, finding the defect is harder if the failure is far away from the cause. Unit tests help because they consider only one module and therefore make finding the bug's cause easier. Furthermore, they help in preventing the introduction of regressions: When working with a version control system, running unit tests can be automated and new commits that fail existing unit tests can be automatically prevented from entering the central code repository.

In addition to unit tests one also uses integration tests to test the interaction among different units. System tests test the entire program end-to-end. The notions of integration and system test do not denote specific test techniques but delineate the extent of a test.

In general, there are two different paradigms of testing:

Specification-based Testing (black-box testing): Tests are written only with respect to a specification. The code of the subject to test does not have an influence on the test itself.

The advantage is that black-box testing can, because it is based on a specification, uncover missing code if some cases have not been considered or implemented. On the other hand, because it ignores the structure of the program to test, it may be that some code paths are not covered by a black-box test suite.
Structural Testing (“white-box”/glass-box testing): uses the structure of the source code to design tests that execute as much code of the subject under test as possible. Based on specific metrics (see Subsection 7.3.3) one tries to cover (i.e. execute) as much of the code of the subject under test as possible. However, even if a program passes a test suite with good coverage, it doesn't not necessarily mean that it is free of significant bugs because white-box testing can only test what is there and can't test code that is missing.

Subsection 7.3.1 The Oracle Problem

A mechanism that decides if a test passes or fails is called a test oracle. In the ideal setting we have a specification that can serve as the test oracle as defined in Definition 7.3.1. In practice it is often the case that we do not have a formal specification (as defined in Definition 7.1.5). Often, the specification is only available in prose or even only available in the brains of the developer. However, some parts of the specification may be documented using assertions in the program as suggested in Remark 7.2.7.

In such cases, a typical specification is \((\ltrue,\ltrue)\text{.}\) Although it looks very weak, this specification at least requires that the program must neither get stuck, nor abort, nor diverge which is sort of the minimum requirement for a properly functioning program. If some parts of the specification are documented using assertions in the source code, these assertions will make tests fail that trigger them which results in a failure whose cause we can fix.

Note that the precondition \(\ltrue\) is also satisfied by states that do not satisfy the actual implicit specification that is only available in prose or maybe even only implicitly available. When executing the program on such a state we may observe behavior that violates the postcondition \(\ltrue\text{,}\) i.e. getting stuck, abort, failure. However, in that case this does not necessarily indicate a bug because the initial state violated the implicit precondition. Therefore, good software aborts if preconditions are violated.

Subsection 7.3.2 Black-box Testing

As mentioned before, black-box tests are written with respect to a specification. This means they can be written even before the subject under test has even been created. Let us discuss a typical black-box scenario by means of an example (for more details, refer to [13]).

Example 7.3.3.

Given is the specification of a function that is supposed to compute the roots of a quadratic polynomial:

“The function unsigned roots(double a, double b, double c, double* r); shall compute the real roots of the polynomial \(ax^2+bx+c\text{.}\) It shall return the number of real roots. If the polynomial has more than two roots, the value UINT_MAX shall be returned. The parameter r points to an array that can hold at least two doubles. The function roots places the found roots into the array pointed to by r.”

Based on the mathematics of quadratic functions, we know that there are several different cases that we must cover.

The polynomial is equal to the zero polynomial (\(a=b=c=0\)). Then, roots has to return UINT_MAX
The polynomial is linear (\(a=0\)). Then, it has one (\(b\ne 0\)) or no (\(b=0\)) root.
The polynomial is quadratic (\(a\ne 0\)). Then there is either none, one, or two roots.

It is typical for black-box testing to partition the input space along such case distinctions and then create suitable unit tests for each partition. For our example, the following test suite is suitable but not really sufficient. You can improve the test suite by developing more tests that cover more partitions.

void test_null(void) {
    unsigned res = roots(0, 0, 0, NULL);
    assert(res == UINT_MAX);
}

void test_linear(void) {
    double r[2];
    unsigned res = roots(0, 2, 2, r);
    assert(res == 1 && r[0] == -1);
}

void test_linear_none(void) {
    double r[2];
    unsigned res = roots(0, 0, 2, r);
    assert(res == 0);
}

Subsection 7.3.3 White-box Testing and Coverage

When doing structure-based testing, the coverage plays a crucial role. The coverage of a test designates the part of the subject under test that is being executed by the test. There are different kinds of coverage metrics that we briefly discuss here.

Definition 7.3.4. Statement Coverage.

A statement of the program is covered by a test \(\sigma\text{,}\) if the statement is executed in the trace of \(\sigma\text{.}\)

Remark 7.3.5.

In C0, branch and statement coverage is the same: for every branch there is a statement whose execution also implies the execution of the corresponding branch. For languages that allow for ifs without elses, this does no longer hold:

r = 0;
if (x < 0)
    r = -x;
/* next statement */

Remark 7.3.6.

The set of paths through a program may be enormously large if not infinite (if the program has loops). Therefore, it is very hard to achieve high path coverage in practice.

Example 7.3.7.

Let us consider the following, erroneous implementation of the specification of Example 7.3.3.

unsigned roots_buggy(double a, double b, double c, double *r) {
    if (a != 0) {                        // 1
        double p = b / a;                // 2
        double q = c / a;                // 3
        double d = p * p / 4 - q;        // 4

        if (d > 0) {                     // 5
            r[0] = -p / 2 + sqrt(d);     // 6
            r[1] = -p / 2 - sqrt(d);     // 7
            return 2;                    // 8
        }

        r[0] = -p / 2;                   // 9
        return 1;                        // 10
    }

    if (b != 0) {                        // 11
        r[0] = -b / c;                   // 12
        return 1;                        // 13
    }

    return 0;                            // 14
}

The tests of Example 7.3.3 cover the following statements: 1, 11, 12, 13, 14. The covered paths are:

`test_null`	1,11,14
`test_linear`	1,11,12,13
`test_linear_none`	1,11,14

The first test fails and exposes the bug that the zero-polynomial case is missing. The second test passes but does not find the bug in statement 12 which should be r[0] = -c / b;. The issue here is that the input \(\{\CVar a\mapsto 0, \CVar b\mapsto 2, \CVar\mapsto 2\}\) is not well chosen because it cannot detect the swapping of divisor and dividend. A better test would be:

void test_linear(void) {
    double r[2];
    unsigned res = roots(0, 4, 2, r);
    assert(res == 1 && r[0] == -0.5);
}

test_linear_none passes because the code handles the case of a constant polynomial unequal to 0 correctly. None of the tests exposes that the quadratic case is not correctly implemented: If the discriminant is less than 0, the function erroneously returns a root with the value of \(-p/2\text{.}\)

Also, this test suite does not have a good statement coverage:

\begin{equation*} \frac{|\{1,11,12,13,14\}|}{|\{1,\dots,14\}|}=\frac 5{14}\approx 35,7\% \end{equation*}

Remark 7.3.8.

Test suites that don't cover all statements are unsatisfactory.

Once again, let us emphasize that high coverage does not detect the situation that some functionality has not been implemented because it just means that a high proportion of the code that is there is covered. So, high coverage is necessary for a good test suite but not sufficient.

Subsection 7.3.4 Fuzzing

Finally, we would like to briefly mention fuzz testing. In fuzzing one creates a program that more or less randomly creates inputs for the subject under test and checks if the output crashes or produces a sensible output. Crucial to fuzzing is the ability to generate inputs that satisfy the precondition. For example, producing a text input that adheres to a specific grammar or file format.

Fuzzing is a black-box technique because the actual implementation has no influence on how the inputs are generated. Depending on how complicated the specification is, it can be very hard to come up with an appropriate test oracle for a fuzz tester. So, very often, fuzzing is used to test for specifications like “the program doesn't abort or get stuck”.

Furthermore, a very common thing is to use another (maybe simpler and maybe less efficient) implementation of a program A to fuzz test a more complicated implementation of the same problem B. For example, one could use a simple list-based implementation of a set to fuzz test a more complicated implementation of a set that uses binary trees.

unless in the unrealistic theoretical case where the test case really covers every input/output pair.

An Introduction to Imperative Programming

Search Results: