Testing high-performance computing applications

Last updated

High performance computing applications run on massively parallel supercomputers consist of concurrent programs designed using multi-threaded, multi-process models. The applications may consist of various constructs (threads, local processes, distributed processes, etc.) with varying degree of parallelism. Although high performance concurrent programs use similar design patterns, models and principles as that of sequential programs, unlike sequential programs, they typically demonstrate non-deterministic behavior. The probability of bugs increases with the number of interactions between the various parallel constructs. Race conditions, data races, deadlocks, missed signals and live lock are common error types.

Contents

Challenges

Parallel programs can be divided into two general categories: explicitly and implicitly parallel. Using parallel language constructs defined for process creation, communication and synchronization make an application explicitly parallel. Using a tool or parallelizing compiler to convert a serial program into a parallel one, makes it implicitly parallel. Both categories are equally bug-prone.

Heisenbugs

Concurrent applications should execute correctly on every possible thread schedule in the underlying operating system. However, traditional testing methods detect few bugs, chiefly because of the Heisenbugs [1] problem. A Heisenbug is an error that changes or disappears when an attempt is made to isolate and probe them via debugger, by adding some constructs such as synchronization requests or delay statements.

Non-repeatability

Another issue is caused due to the unpredictable behavior of the scheduler. Differences in system load influence scheduler behavior. This behavior cannot be changed manually. To counter this indeterminism, the program must be executed many times under various execution environments. Still, it is not guaranteed that a bug can be reproduced. Most of the time, the program runs correctly, and the bug is visible only when specific conditions are matched. As a result, non-repeatability of the concurrent programs is a major source of roadblock for detecting error. As an example, consider the following.

voidthread1(void*t){mutex_lock(a);mutex_lock(b);// do some work...mutex_unlock(b);mutex_unlock(a);}
voidthread2(void*t){mutex_lock(b);mutex_lock(a);// do some work...mutex_unlock(a);mutex_unlock(b);}

Clearly, this has a problem of causing deadlocks. Yet, it may cause deadlock in some runs of the program while in others, it may run successfully.

Probe effect

Probe effect is seen in parallel programs when delay-statements are inserted in parallel programs facing synchronization problems. This effect, like Heisenbugs, alters behavior changes that may obscure problems. Detecting the source of a probe effect is a great challenge in testing parallel applications.
The main difference between Probe effect and Heisenbugs is that Heisenbugs are seen when additional delay statements and/or synchronization requests are added to the concurrent application during testing, while probe effect is seen when the developer adds delay statements to concurrent applications with poor synchronization.

Testing strategies

The differences between sequential and concurrent programs lead to the differences in their testing strategies. Strategies for sequential programs can be modified to make them suitable for concurrent applications. Specialized strategies have also been developed. Conventionally, testing includes designing test cases and checking that the program produces the expected results. Thus, errors in specification, functionality, etc. are detected by running the application and subjecting it to testing methods such as Functional Testing, White Box, Black Box and Grey Box Testing. [2] Static analysis is also used for detecting errors in high performance software using methods such as Data Flow Analysis, Control Flow Analysis, Cyclomatic Complexities, Thread Escape Analysis and Static Slicing Analysis to find problems. Using static analysis before functionality testing can save time. It can detect ‘what the error is’ find the error source. Static analysis techniques can detect problems like lack of synchronization, improper synchronizations, predict occurrence of deadlocks and post-wait errors in rendezvous requests.

Details:

Deterministic scheduling / reproducible testing

The indeterminacy of scheduling has two sources. [1]

1. Time slicing
Scheduler switches contexts at equal intervals of time. This interval depends on the speed of individual processors, memory-cache hierarchy state and system load. Even on the same processor, under the same load, the interval varies slightly due to minor variations in frequency of the system clock.
2. Page Faults
Scheduler suspends a program if a page fault occurs, letting other threads proceed while the system fetches the page. As the occurrence of page faults depends upon what other processes are running, the timing of a context switch becomes indeterminate.

To make concurrent programs repeatable, an external scheduler is used. The program under test is instrumented to add calls to this scheduler. Such calls are made at the beginning and end of each thread as well as before every synchronization request. This scheduler selectively blocks threads of execution by maintaining a semaphore associated with each thread, such that only one thread is ready for execution at any given time. Thus, it converts parallel non-deterministic application into a serial execution sequence in order to achieve repeatability. The number of scheduling decisions made by the serializing scheduler is given by –

(N * K / P)*{(N + P)!}
Where
N = number of threads
K = potential context switch points
P = budget of pre-emptive context switches

Feedback-directed testing

To obtain more accurate results using deterministic scheduling, an alternate approach can be chosen. A few properly-placed pre-emptions in the concurrent program can detect bugs related to data-races. [1] Bugs are found in clusters. The existence of one bug establishes a high probability of more bugs in the same region of code. Thus each pass of the testing process identifies sections of code with bugs. The next pass more thoroughly scrutinizes those sections by adding scheduler calls around them. Allowing the problematic locations to execute in a different order can reveal unexpected behavior.

This strategy ensures that the application is not prone to the Probe Effect. Sources of errors that cause the Probe Effect can range from task creation issues to synchronization and communication problems. Requirements of timing related tests: [3]

The number of test cases per input data set is:

nC1 + nC1 + … + nC1 = 2n -1
Where n = total number of synchronization, process creation and communication calls.

This equation has exponential order. In order to reduce the number of test cases, either Deterministic Execution Method (DET) or Multiple Execution Technique (MET) is used. Various issues must be handled:

Addition of delays is a straightforward task. A typical sleep() statement can be used to insert delays.
Insertion locations are known as delay-points. As the objective of timing related test cases is to detect synchronization, communication and thread creation errors, delay statements are added just in front of these statements.

Advantages

  • Easy to implement on multiple processors without any need of ordering the synchronization requests.
  • No need to generate concurrency graph
  • More effective for fault detection
  • Total number of test cases are less, yet code coverage is more, due to static analysis

All Du-Path testing

This method applies the concept of define-use pair, in order to determine the paths to be tested.

Verification strategies

Software verification is a process that proves that software is working correctly and is performing the intended task as designed.

Test calculations

Input is given to the system to generate a known result. This input-result pair can be obtained from previous empirical results and/or manual calculations. [4] This is a system-level test that can be performed only when all relevant modules are integrated. Moreover, it only shows that bugs exist. It offers no detailed information regarding the number of bugs, their location or nature.

Symmetry tests

These tests are primarily used for scientific simulations. The output of the simulation often cannot be predicted. Since these simulations attempt to describe scientific laws, any symmetries in the theory must be honored by the simulation. Thus, by varying input conditions along the lines of symmetry and then comparing the obtained results with externally derived results, the existence of bugs can be detected. [4]

Figure 1 : Data Distribution Hpc graph.png
Figure 1 : Data Distribution

In scientific computing most data lies in the central region of the simulation conditions. As a result, it is difficult to perform Boundary-value testing [2] with real time experimental data. Thus, center of simulation (for example, for data value of 10 in Figure 1) is shifted to one of the boundaries so as to test the boundary condition effectively.

Parallel implementation tests

Parallel implementation tests are usually used for applications that use distributed memory programming models such as message passing. These tests are often applied to programs that use regular grids of processors. [4] [ clarification needed ]

Global summation

Many parallel databases use distributed parallel processing to execute the queries. While executing an aggregate function such as sum, the following strategy is used: [5]

  • Compute a partial sum locally and concurrently at each processor using the data present in the associated disk partition with it.
  • Add these local subtotals to get the final result.

The final result may contain some rounding error as each processor independently rounds-off local results. One test is to ensure that such errors do not occur. This requires showing that the aggregate sum is decomposition-independent. An alternate summation scheme is to send all of the individual values to one processor for summation. This result can be compared with the distributed result to ensure consistency.

Tools

Microsoft CHESS

This tool eliminates non-determinacy using deterministic scheduling. It tracks schedule paths executed previously and guarantees that each time a new schedule path is executed. [1] [ clarification needed ]

Related Research Articles

Real-time computing (RTC) is the computer science term for hardware and software systems subject to a "real-time constraint", for example from event to system response. Real-time programs must guarantee response within specified time constraints, often referred to as "deadlines".

Process (computing) Particular execution of a computer program

In computing, a process is the instance of a computer program that is being executed by one or many threads. It contains the program code and its activity. Depending on the operating system (OS), a process may be made up of multiple threads of execution that execute instructions concurrently.

Thread (computing) Smallest sequence of programmed instructions that can be managed independently by a scheduler

In computer science, a thread of execution is the smallest sequence of programmed instructions that can be managed independently by a scheduler, which is typically a part of the operating system. The implementation of threads and processes differs between operating systems, but in most cases a thread is a component of a process. The multiple threads of a given process may be executed concurrently, sharing resources such as memory, while different processes do not share these resources. In particular, the threads of a process share its executable code and the values of its dynamically allocated variables and non-thread-local global variables at any given time.

Parallel computing Programming paradigm in which many processes are executed simultaneously

Parallel computing is a type of computation in which many calculations or processes are carried out simultaneously. Large problems can often be divided into smaller ones, which can then be solved at the same time. There are several different forms of parallel computing: bit-level, instruction-level, data, and task parallelism. Parallelism has long been employed in high-performance computing, but has gained broader interest due to the physical constraints preventing frequency scaling. As power consumption by computers has become a concern in recent years, parallel computing has become the dominant paradigm in computer architecture, mainly in the form of multi-core processors.

In computer science, a lock or mutex is a synchronization primitive: a mechanism that enforces limits on access to a resource when there are many threads of execution. A lock is designed to enforce a mutual exclusion concurrency control policy, and with a variety of possible methods there exists multiple unique implementations for different applications.

OpenMP Open standard for parallelizing

OpenMP is an application programming interface (API) that supports multi-platform shared-memory multiprocessing programming in C, C++, and Fortran, on many platforms, instruction-set architectures and operating systems, including Solaris, AIX, FreeBSD, HP-UX, Linux, macOS, and Windows. It consists of a set of compiler directives, library routines, and environment variables that influence run-time behavior.

Race condition Problem in electronics and software

A race condition or race hazard is the condition of an electronics, software, or other system where the system's substantive behavior is dependent on the sequence or timing of other uncontrollable events. It becomes a bug when one or more of the possible behaviors is undesirable.

In concurrent programming, concurrent accesses to shared resources can lead to unexpected or erroneous behavior, so parts of the program where the shared resource is accessed need to be protected in ways that avoid the concurrent access. This protected section is the critical section or critical region. It cannot be executed by more than one process at a time. Typically, the critical section accesses a shared resource, such as a data structure, a peripheral device, or a network connection, that would not operate correctly in the context of multiple concurrent accesses.

Cilk, Cilk++, Cilk Plus and OpenCilk are general-purpose programming languages designed for multithreaded parallel computing. They are based on the C and C++ programming languages, which they extend with constructs to express parallel loops and the fork–join idiom.

Java Pathfinder (JPF) is a system to verify executable Java bytecode programs. JPF was developed at the NASA Ames Research Center and open sourced in 2005. The acronym JPF is not to be confused with the unrelated Java Plugin Framework project.

In software engineering, profiling is a form of dynamic program analysis that measures, for example, the space (memory) or time complexity of a program, the usage of particular instructions, or the frequency and duration of function calls. Most commonly, profiling information serves to aid program optimization, and more specifically, performance engineering.

Automatic parallelization, also auto parallelization, or autoparallelization refers to converting sequential code into multi-threaded and/or vectorized code in order to use multiple processors simultaneously in a shared-memory multiprocessor (SMP) machine. Fully automatic parallelization of sequential programs is a challenge because it requires complex program analysis and the best approach may depend upon parameter values that are not known at compilation time.

An instruction set simulator (ISS) is a simulation model, usually coded in a high-level programming language, which mimics the behavior of a mainframe or microprocessor by "reading" instructions and maintaining internal variables which represent the processor's registers.

In computer programming jargon, a heisenbug is a software bug that seems to disappear or alter its behavior when one attempts to study it. The term is a pun on the name of Werner Heisenberg, the physicist who first asserted the observer effect of quantum mechanics, which states that the act of observing a system inevitably alters its state. In electronics the traditional term is probe effect, where attaching a test probe to a device changes its behavior.

Dynamic program analysis is the analysis of computer software that is performed by executing programs on a real or virtual processor. For dynamic program analysis to be effective, the target program must be executed with sufficient test inputs to cover almost all possible outputs. Use of software testing measures such as code coverage helps ensure that an adequate slice of the program's set of possible behaviors has been observed. Also, care must be taken to minimize the effect that instrumentation has on the execution of the target program. Dynamic analysis is in contrast to static program analysis. Unit tests, integration tests, system tests and acceptance tests use dynamic testing.

The Java programming language and the Java virtual machine (JVM) have been designed to support concurrent programming, and all execution takes place in the context of threads. Objects and resources can be accessed by many separate threads; each thread has its own path of execution but can potentially access any object in the program. The programmer must ensure read and write access to objects is properly coordinated between threads. Thread synchronization ensures that objects are modified by only one thread at a time and that threads are prevented from accessing partially updated objects during modification by another thread. The Java language has built-in constructs to support this coordination.

The Sieve C++ Parallel Programming System is a C++ compiler and parallel runtime designed and released by Codeplay that aims to simplify the parallelization of code so that it may run efficiently on multi-processor or multi-core systems. It is an alternative to other well-known parallelisation methods such as OpenMP, the RapidMind Development Platform and Threading Building Blocks (TBB).

In computer programming and software development, debugging is the process of finding and resolving bugs within computer programs, software, or systems.

Jinx was a concurrency debugger that deterministically controlled the interleaving of workloads across processor cores, focusing on shared memory interactions. Using this deterministic approach, Jinx aimed to increase the frequency of occurrence of elusive shared memory bugs, sometimes called Heisenbugs. Jinx is no longer available. Corensic, the company that was developing Jinx, was bought by F5 Networks and the Jinx project was cancelled.

Research and literature on concurrency testing and concurrent testing typically focuses on testing software and systems that use concurrent computing. The purpose is, as with most software testing, to understand the behaviour and performance of a software system that uses concurrent computing, particularly assessing the stability of a system or application during normal activity.

References

  1. 1 2 3 4 Thomas Ball; Sebastian Burckhardt; Peli de Halleux; Madanlal Musuvathi; Shaz Qadeer (May–June 2011). "Predictable and Progressive Testing of Multithreaded Code". IEEE Software. 28 (3): 75–83. doi:10.1109/MS.2010.64.
  2. 1 2 The Art of Software Testing, Second Edition. John Wiley and Sons. 2004. p. 256. ISBN   978-0-471-46912-4.
  3. Cheer-Sun Yang; Lori L. Pollock (1997). "The Challenges in Automated Testing of Multithreaded Programs". Proceedings of the 14th International Conference on Testing Computer Software: 157–166. CiteSeerX   10.1.1.52.8791 .
  4. 1 2 3 Stephen Booth; David Henty (2004). "Verification strategies for high performance computing software". "Software Engineering for High Performance Computing System (HPCS) Applications" W3S Workshop - 26th International Conference on Software Engineering. Vol. 2004. pp. 24–26. doi:10.1049/ic:20040413. ISBN   0-86341-418-4.
  5. Korth, Henry. Database System Concepts. McGraw-Hill.