Global Arrays

Global Arrays (GA)
Paradigm	parallel, one-sided message passing, imperative (procedural), structured
First appeared	1994
Stable release	5.8.2/November 2022
Typing discipline	static, weak
OS	Cross-platform
Website	hpc.pnl.gov/globalarrays/

Last updated June 08, 2024

Global Arrays, or GA, is the library developed by scientists at Pacific Northwest National Laboratory for parallel computing. GA provides a friendly API for shared-memory programming on distributed-memory computers for multidimensional arrays. The GA library is a predecessor to the GAS (global address space) languages currently being developed for high-performance computing.^[1]^[2]^[3]^[4]

The GA toolkit has additional libraries including a Memory Allocator (MA), Aggregate Remote Memory Copy Interface (ARMCI), and functionality for out-of-core storage of arrays (ChemIO). Although GA was initially developed to run with TCGMSG, a message passing library that came before the MPI standard (Message Passing Interface), it is now fully compatible with MPI. GA includes simple matrix computations (matrix-matrix multiplication, LU solve) and works with ScaLAPACK. Sparse matrices are available but the implementation is not optimal yet.

GA was developed by Jarek Nieplocha, Robert Harrison, R. J. Littlefield, Manoj Krishnan, and Vinod Tipparaju. The ChemIO library for out-of-core storage was developed by Jarek Nieplocha, Robert Harrison and Ian Foster.

The GA library is incorporated into many quantum chemistry packages, including NWChem, MOLPRO, UTChem, MOLCAS, and TURBOMOLE. The GA library is also incorporated into sub-surface code STOMP Archived 2013-02-13 at the Wayback Machine ^[5]

The GA toolkit is free software, licensed under a self-made license Archived 2019-04-05 at the Wayback Machine .

Related Research Articles

A supercomputer is a type of computer with a high level of performance as compared to a general-purpose computer. The performance of a supercomputer is commonly measured in floating-point operations per second (FLOPS) instead of million instructions per second (MIPS). Since 2017, supercomputers have existed which can perform over 10¹⁷ FLOPS (a hundred quadrillion FLOPS, 100 petaFLOPS or 100 PFLOPS). For comparison, a desktop computer has performance in the range of hundreds of gigaFLOPS (10¹¹) to tens of teraFLOPS (10¹³). Since November 2017, all of the world's fastest 500 supercomputers run on Linux-based operating systems. Additional research is being conducted in the United States, the European Union, Taiwan, Japan, and China to build faster, more powerful and technologically superior exascale supercomputers.

Grid computing is the use of widely distributed computer resources to reach a common goal. A computing grid can be thought of as a distributed system with non-interactive workloads that involve many files. Grid computing is distinguished from conventional high-performance computing systems such as cluster computing in that grid computers have each node set to perform a different task/application. Grid computers also tend to be more heterogeneous and geographically dispersed than cluster computers. Although a single grid can be dedicated to a particular application, commonly a grid is used for a variety of purposes. Grids are often constructed with general-purpose grid middleware software libraries. Grid sizes can be quite large.

The Message Passing Interface (MPI) is a standardized and portable message-passing standard designed to function on parallel computing architectures. The MPI standard defines the syntax and semantics of library routines that are useful to a wide range of users writing portable message-passing programs in C, C++, and Fortran. There are several open-source MPI implementations, which fostered the development of a parallel software industry, and encouraged development of portable and scalable large-scale parallel applications.

OpenMP is an application programming interface (API) that supports multi-platform shared-memory multiprocessing programming in C, C++, and Fortran, on many platforms, instruction-set architectures and operating systems, including Solaris, AIX, FreeBSD, HP-UX, Linux, macOS, and Windows. It consists of a set of compiler directives, library routines, and environment variables that influence run-time behavior.

Checkpointing is a technique that provides fault tolerance for computing systems. It basically consists of saving a snapshot of the application's state, so that applications can restart from that point in case of failure. This is particularly important for long running applications that are executed in failure-prone computing systems.

The Parallel Virtual File System (PVFS) is an open-source parallel file system. A parallel file system is a type of distributed file system that distributes file data across multiple servers and provides for concurrent access by multiple tasks of a parallel application. PVFS was designed for use in large scale cluster computing. PVFS focuses on high performance access to large data sets. It consists of a server process and a client library, both of which are written entirely of user-level code. A Linux kernel module and pvfs-client process allow the file system to be mounted and used with standard utilities. The client library provides for high performance access via the message passing interface (MPI). PVFS is being jointly developed between The Parallel Architecture Research Laboratory at Clemson University and the Mathematics and Computer Science Division at Argonne National Laboratory, and the Ohio Supercomputer Center. PVFS development has been funded by NASA Goddard Space Flight Center, The DOE Office of Science Advanced Scientific Computing Research program, NSF PACI and HECURA programs, and other government and private agencies. PVFS is now known as OrangeFS in its newest development branch.

Charm++ is a parallel object-oriented programming paradigm based on C++ and developed in the Parallel Programming Laboratory at the University of Illinois at Urbana–Champaign. Charm++ is designed with the goal of enhancing programmer productivity by providing a high-level abstraction of a parallel program while at the same time delivering good performance on a wide variety of underlying hardware platforms. Programs written in Charm++ are decomposed into a number of cooperating message-driven objects called chares. When a programmer invokes a method on an object, the Charm++ runtime system sends a message to the invoked object, which may reside on the local processor or on a remote processor in a parallel computation. This message triggers the execution of code within the chare to handle the message asynchronously.

Compute Unified Device Architecture (CUDA) is a proprietary parallel computing platform and application programming interface (API) that allows software to use certain types of graphics processing units (GPUs) for accelerated general-purpose processing, an approach called general-purpose computing on GPUs (GPGPU). CUDA API and its runtime: The CUDA API is an extension of the C programming language that adds the ability to specify thread-level parallelism in C and also to specify GPU device specific operations (like moving data between the CPU and the GPU). CUDA is a software layer that gives direct access to the GPU's virtual instruction set and parallel computational elements for the execution of compute kernels. In addition to drivers and runtime kernels, the CUDA platform includes compilers, libraries and developer tools to help programmers accelerate their applications.

Trilinos is a collection of open-source software libraries, called packages, intended to be used as building blocks for the development of scientific applications. The word "Trilinos" is Greek and conveys the idea of "a string of pearls", suggesting a number of software packages linked together by a common infrastructure. Trilinos was developed at Sandia National Laboratories from a core group of existing algorithms and utilizes the functionality of software interfaces such as BLAS, LAPACK, and MPI. In 2004, Trilinos received an R&D100 Award.

In computer science, partitioned global address space (PGAS) is a parallel programming model paradigm. PGAS is typified by communication operations involving a global memory address space abstraction that is logically partitioned, where a portion is local to each process, thread, or processing element. The novelty of PGAS is that the portions of the shared memory space may have an affinity for a particular process, thereby exploiting locality of reference in order to improve performance. A PGAS memory model is featured in various parallel programming languages and libraries, including: Coarray Fortran, Unified Parallel C, Split-C, Fortress, Chapel, X10, UPC++, Coarray C++, Global Arrays, DASH and SHMEM. The PGAS paradigm is now an integrated part of the Fortran language, as of Fortran 2008 which standardized coarrays.

Data parallelism is parallelization across multiple processors in parallel computing environments. It focuses on distributing the data across different nodes, which operate on the data in parallel. It can be applied on regular data structures like arrays and matrices by working on each element in parallel. It contrasts to task parallelism as another form of parallelism.

HPX, short for High Performance ParalleX, is a runtime system for high-performance computing. It is currently under active development by the STE||AR group at Louisiana State University. Focused on scientific computing, it provides an alternative execution model to conventional approaches such as MPI. HPX aims to overcome the challenges MPI faces with increasing large supercomputers by using asynchronous communication between nodes and lightweight control objects instead of global barriers, allowing application developers to exploit fine-grained parallelism.

<span class="mw-page-title-main">Computer cluster</span> Set of computers configured in a distributed computing system

A computer cluster is a set of computers that work together so that they can be viewed as a single system. Unlike grid computers, computer clusters have each node set to perform the same task, controlled and scheduled by software. The newest manifestation of cluster computing is cloud computing.

Intel Parallel Studio XE was a software development product developed by Intel that facilitated native code development on Windows, macOS and Linux in C++ and Fortran for parallel computing. Parallel programming enables software programs to take advantage of multi-core processors from Intel and other processor vendors.

Parasoft is an independent software vendor specializing in automated software testing and application security with headquarters in Monrovia, California. It was founded in 1987 by four graduates of the California Institute of Technology who planned to commercialize the parallel computing software tools they had been working on for the Caltech Cosmic Cube, which was the first working hypercube computer built.

In computing, algorithmic skeletons, or parallelism patterns, are a high-level parallel programming model for parallel and distributed computing.

HPC Challenge Benchmark combines several benchmarks to test a number of independent attributes of the performance of high-performance computer (HPC) systems. The project has been co-sponsored by the DARPA High Productivity Computing Systems program, the United States Department of Energy and the National Science Foundation.

Tachyon is a parallel/multiprocessor ray tracing software. It is a parallel ray tracing library for use on distributed memory parallel computers, shared memory computers, and clusters of workstations. Tachyon implements rendering features such as ambient occlusion lighting, depth-of-field focal blur, shadows, reflections, and others. It was originally developed for the Intel iPSC/860 by John Stone for his M.S. thesis at University of Missouri-Rolla. Tachyon subsequently became a more functional and complete ray tracing engine, and it is now incorporated into a number of other open source software packages such as VMD, and SageMath. Tachyon is released under a permissive license.

The Center for Supercomputing Research and Development (CSRD) at the University of Illinois (UIUC) was a research center funded from 1984 to 1993. It built the shared memory Cedar computer system, which included four hardware multiprocessor clusters, as well as parallel system and applications software. It was distinguished from the four earlier UIUC Illiac systems by starting with commercial shared memory subsystems that were based on an earlier paper published by the CSRD founders. Thus CSRD was able to avoid many of the hardware design issues that slowed the Illiac series work. Over its 9 years of major funding, plus follow-on work by many of its participants, CSRD pioneered many of the shared memory architectural and software technologies upon which all 21st century computation is based.

References

↑ Nieplocha, Jarek; Harrison, Robert (1997). "Shared Memory Programming in Metacomputing Environments: The Global Array Approach". The Journal of Supercomputing. 11 (2): 119–136. doi:10.1023/A:1007955822788. S2CID 27322677.
↑ Nieplocha, Jarek (2006). "Advances, Applications and Performance of the Global Arrays Shared Memory Programming Toolkit". International Journal of High Performance Computing Applications. 20 (2): 203–231. CiteSeerX 10.1.1.133.9926 . doi:10.1177/1094342006064503. S2CID 116634.
↑ Nieplocha, Jaroslaw; Harrison, Robert J.; Littlefield, Richard J. (1996). "Global arrays: A nonuniform memory access programming model for high-performance computers". The Journal of Supercomputing. 10 (2): 169–189. CiteSeerX 10.1.1.41.5891 . doi:10.1007/BF00130708. S2CID 1272614.
↑ Tipparaju, Vinod; Krishnan, Manoj; Palmer, Bruce; Petrini, Fabrizio; Nieplocha, Jarek (2008). "Towards Fault Resilient Global Arrays". In Bischof, Christian; Bücker, Martin; Gibbon, Paul; Joubert, Gerhard R.; Lippert, Thomas; Mohr, Bernd; Peters, Frans (eds.). Parallel Computing: Architectures, Algorithms and Applications. Advances in Parallel Computing. Vol. 15. Amsterdam: IOS Press. pp. 339–345. ISBN 978-1-58603-796-3. ISSN 0927-5452. OCLC 226966397. Archived from the original on 2021-03-06. Retrieved 2012-07-17.
↑ "Gordon Bell Finalist at SC09 - GA Crosses the Petaflop Barrier". PNNL. 2009. Archived from the original on 2013-02-22. Retrieved 2015-05-23.

v t e Parallel computing
General	Distributed computing Parallel computing Massively parallel Cloud computing High-performance computing Multiprocessing Manycore processor GPGPU Computer network Systolic array
Levels	Bit Instruction Thread Task Data Memory Loop Pipeline
Multithreading	Temporal Simultaneous (SMT) Simultaneous and heterogenous Speculative (SpMT) Preemptive Cooperative Clustered multi-thread (CMT) Hardware scout
Theory	PRAM model PEM model Analysis of parallel algorithms Amdahl's law Gustafson's law Cost efficiency Karp–Flatt metric Slowdown Speedup
Elements	Process Thread Fiber Instruction window Array
Coordination	Multiprocessing Memory coherence Cache coherence Cache invalidation Barrier Synchronization Application checkpointing
Programming	Stream processing Dataflow programming Models Implicit parallelism Explicit parallelism Concurrency Non-blocking algorithm
Hardware	Flynn's taxonomy SISD SIMD Array processing (SIMT) Pipelined processing Associative processing MISD MIMD Dataflow architecture Pipelined processor Superscalar processor Vector processor Multiprocessor symmetric asymmetric Memory shared distributed distributed shared UMA NUMA COMA Massively parallel computer Computer cluster Beowulf cluster Grid computer Hardware acceleration
APIs	Ateji PX Boost Chapel HPX Charm++ Cilk Coarray Fortran CUDA Dryad C++ AMP Global Arrays GPUOpen MPI OpenMP OpenCL OpenHMPP OpenACC Parallel Extensions PVM pthreads RaftLib ROCm UPC TBB ZPL
Problems	Automatic parallelization Deadlock Deterministic algorithm Embarrassingly parallel Parallel slowdown Race condition Software lockout Scalability Starvation
Category: Parallel computing

Global Arrays

Related Research Articles

References

See also