Quasi-opportunistic supercomputing

Last updated
A typical centralized supercomputer center at NASA Ames, with over 100 cabinets, each housing many processors, for a total of about 14,000 interconnected processors in one room. On the other hand, a distributed system (e.g. BOINC) can opportunistically use tens of thousands of personal computers on the internet, whenever available. Pleiades supercomputer.jpg
A typical centralized supercomputer center at NASA Ames, with over 100 cabinets, each housing many processors, for a total of about 14,000 interconnected processors in one room. On the other hand, a distributed system (e.g. BOINC) can opportunistically use tens of thousands of personal computers on the internet, whenever available.

Quasi-opportunistic supercomputing is a computational paradigm for supercomputing on a large number of geographically disperse computers. [3] Quasi-opportunistic supercomputing aims to provide a higher quality of service than opportunistic resource sharing. [4]

Contents

The quasi-opportunistic approach coordinates computers which are often under different ownerships to achieve reliable and fault-tolerant high performance with more control than opportunistic computer grids in which computational resources are used whenever they may become available. [3]

While the "opportunistic match-making" approach to task scheduling on computer grids is simpler in that it merely matches tasks to whatever resources may be available at a given time, demanding supercomputer applications such as weather simulations or computational fluid dynamics have remained out of reach, partly due to the barriers in reliable sub-assignment of a large number of tasks as well as the reliable availability of resources at a given time. [5] [6]

The quasi-opportunistic approach enables the execution of demanding applications within computer grids by establishing grid-wise resource allocation agreements; and fault tolerant message passing to abstractly shield against the failures of the underlying resources, thus maintaining some opportunism, while allowing a higher level of control. [3]

Opportunistic supercomputing on grids

The general principle of grid computing is to use distributed computing resources from diverse administrative domains to solve a single task, by using resources as they become available. Traditionally, most grid systems have approached the task scheduling challenge by using an "opportunistic match-making" approach in which tasks are matched to whatever resources may be available at a given time. [5]

Example architecture of a geographically disperse distributively owned distributed computing system connecting many personal computers over a network ArchitectureCloudLinksSameSite.png
Example architecture of a geographically disperse distributively owned distributed computing system connecting many personal computers over a network

BOINC, developed at the University of California, Berkeley is an example of a volunteer-based, opportunistic grid computing system. [2] The applications based on the BOINC grid have reached multi-petaflop levels by using close to half a million computers connected on the internet, whenever volunteer resources become available. [7] Another system, Folding@home, which is not based on BOINC, computes protein folding, has reached 8.8 petaflops by using clients that include GPU and PlayStation 3 systems. [8] [9] [2] However, these results are not applicable to the TOP500 ratings because they do not run the general purpose Linpack benchmark.

A key strategy for grid computing is the use of middleware that partitions pieces of a program among the different computers on the network. [10] Although general grid computing has had success in parallel task execution, demanding supercomputer applications such as weather simulations or computational fluid dynamics have remained out of reach, partly due to the barriers in reliable sub-assignment of a large number of tasks as well as the reliable availability of resources at a given time. [2] [10] [9]

The opportunistic Internet PrimeNet Server supports GIMPS, one of the earliest grid computing projects since 1997, researching Mersenne prime numbers. As of May 2011, GIMPS's distributed research currently achieves about 60 teraflops as an volunteer-based computing project. [11] The use of computing resources on "volunteer grids" such as GIMPS is usually purely opportunistic: geographically disperse distributively owned computers are contributing whenever they become available, with no preset commitments that any resources will be available at any given time. Hence, hypothetically, if many of the volunteers unwittingly decide to switch their computers off on a certain day, grid resources will become significantly reduced. [12] [2] [9] Furthermore, users will find it exceedingly costly to organize a very large number of opportunistic computing resources in a manner that can achieve reasonable high performance computing. [12] [13]

Quasi-control of computational resources

Representation of an atmospheric model with differential equations that require supercomputing capabilities AtmosphericModelSchematic.png
Representation of an atmospheric model with differential equations that require supercomputing capabilities

An example of a more structured grid for high performance computing is DEISA, a supercomputer project organized by the European Community which uses computers in seven European countries. [14] Although different parts of a program executing within DEISA may be running on computers located in different countries under different ownerships and administrations, there is more control and coordination than with a purely opportunistic approach. DEISA has a two level integration scheme: the "inner level" consists of a number of strongly connected high performance computer clusters that share similar operating systems and scheduling mechanisms and provide a homogeneous computing environment; while the "outer level" consists of heterogeneous systems that have supercomputing capabilities. [15] Thus DEISA can provide somewhat controlled, yet dispersed high performance computing services to users. [15] [16]

The quasi-opportunistic paradigm aims to overcome this by achieving more control over the assignment of tasks to distributed resources and the use of pre-negotiated scenarios for the availability of systems within the network. Quasi-opportunistic distributed execution of demanding parallel computing software in grids focuses on the implementation of grid-wise allocation agreements, co-allocation subsystems, communication topology-aware allocation mechanisms, fault tolerant message passing libraries and data pre-conditioning. [17] In this approach, fault tolerant message passing is essential to abstractly shield against the failures of the underlying resources. [3]

The quasi-opportunistic approach goes beyond volunteer computing on a highly distributed systems such as BOINC, or general grid computing on a system such as Globus by allowing the middleware to provide almost seamless access to many computing clusters so that existing programs in languages such as Fortran or C can be distributed among multiple computing resources. [3]

A key component of the quasi-opportunistic approach, as in the Qoscos Grid, is an economic-based resource allocation model in which resources are provided based on agreements among specific supercomputer administration sites. Unlike volunteer systems that rely on altruism, specific contractual terms are stipulated for the performance of specific types of tasks. However, "tit-for-tat" paradigms in which computations are paid back via future computations is not suitable for supercomputing applications, and is avoided. [18]

The other key component of the quasi-opportunistic approach is a reliable message passing system to provide distributed checkpoint restart mechanisms when computer hardware or networks inevitably experience failures. [18] In this way, if some part of a large computation fails, the entire run need not be abandoned, but can restart from the last saved checkpoint. [18]

See also

Related Research Articles

<span class="mw-page-title-main">Great Internet Mersenne Prime Search</span> Volunteer project using software to search for Mersenne prime numbers

The Great Internet Mersenne Prime Search (GIMPS) is a collaborative project of volunteers who use freely available software to search for Mersenne prime numbers.

<span class="mw-page-title-main">Supercomputer</span> Type of extremely powerful computer

A supercomputer is a computer with a high level of performance as compared to a general-purpose computer. The performance of a supercomputer is commonly measured in floating-point operations per second (FLOPS) instead of million instructions per second (MIPS). Since 2017, supercomputers have existed which can perform over 1017 FLOPS (a hundred quadrillion FLOPS, 100 petaFLOPS or 100 PFLOPS). For comparison, a desktop computer has performance in the range of hundreds of gigaFLOPS (1011) to tens of teraFLOPS (1013). Since November 2017, all of the world's fastest 500 supercomputers run on Linux-based operating systems. Additional research is being conducted in the United States, the European Union, Taiwan, Japan, and China to build faster, more powerful and technologically superior exascale supercomputers.

Grid computing is the use of widely distributed computer resources to reach a common goal. A computing grid can be thought of as a distributed system with non-interactive workloads that involve many files. Grid computing is distinguished from conventional high-performance computing systems such as cluster computing in that grid computers have each node set to perform a different task/application. Grid computers also tend to be more heterogeneous and geographically dispersed than cluster computers. Although a single grid can be dedicated to a particular application, commonly a grid is used for a variety of purposes. Grids are often constructed with general-purpose grid middleware software libraries. Grid sizes can be quite large.

In computing, floating point operations per second is a measure of computer performance, useful in fields of scientific computations that require floating-point calculations. For such cases, it is a more accurate measure than measuring instructions per second.

<span class="mw-page-title-main">High-performance computing</span> Computing with supercomputers and clusters

High-performance computing (HPC) uses supercomputers and computer clusters to solve advanced computation problems.

<span class="mw-page-title-main">TeraGrid</span>

TeraGrid was an e-Science grid computing infrastructure combining resources at eleven partner sites. The project started in 2001 and operated from 2004 through 2011.

<span class="mw-page-title-main">Volunteer computing</span> System where users donate computer resources to contribute to research

Volunteer computing is a type of distributed computing in which people donate their computers' unused resources to a research-oriented project, and sometimes in exchange for credit points. The fundamental idea behind it is that a modern desktop computer is sufficiently powerful to perform billions of operations a second, but for most users only between 10–15% of its capacity is used. Common tasks such as word processing or web browsing leave the computer mostly idle.

In computer science, high-throughput computing (HTC) is the use of many computing resources over long periods of time to accomplish a computational task.

<span class="mw-page-title-main">Computer cluster</span> Set of computers configured in a distributed computing system

A computer cluster is a set of computers that work together so that they can be viewed as a single system. Unlike grid computers, computer clusters have each node set to perform the same task, controlled and scheduled by software. The newest manifestation of cluster computing is cloud computing.

Many-task computing (MTC) in computational science is an approach to parallel computing that aims to bridge the gap between two computing paradigms: high-throughput computing (HTC) and high-performance computing (HPC).

<span class="mw-page-title-main">Distributed European Infrastructure for Supercomputing Applications</span> Organization

Distributed European Infrastructure for Supercomputing Applications (DEISA) was a consortium of major national supercomputing centres in Europe. Initiated in 2002, it became a European Union funded supercomputer project. The consortium of eleven national supercomputing centres from seven European countries promoted pan-European research on European high-performance computing systems by creating a European collaborative environment in the area of supercomputing.

The Swiss National Supercomputing Centre is the national high-performance computing centre of Switzerland. It was founded in Manno, canton Ticino, in 1991. In March 2012, the CSCS moved to its new location in Lugano-Cornaredo.

Supercomputing in India has a history going back to the 1980s. The Government of India created an indigenous development programme as they had difficulty purchasing foreign supercomputers. As of June 2023, the AIRAWAT supercomputer is the fastest supercomputer in India, having been ranked 75th fastest in the world in the TOP500 supercomputer list. AIRAWAT has been installed at the Centre for Development of Advanced Computing (C-DAC) in Pune.

<span class="mw-page-title-main">Supercomputing in Europe</span> Overview of supercomputing in Europe

Several centers for supercomputing exist across Europe, and distributed access to them is coordinated by European initiatives to facilitate high-performance computing. One such initiative, the HPC Europa project, fits within the Distributed European Infrastructure for Supercomputing Applications (DEISA), which was formed in 2002 as a consortium of eleven supercomputing centers from seven European countries. Operating within the CORDIS framework, HPC Europa aims to provide access to supercomputers across Europe.

The QosCosGrid is a quasi-opportunistic supercomputing system using grid computing.

<span class="mw-page-title-main">Supercomputer architecture</span> Design of high-performance computers

Approaches to supercomputer architecture have taken dramatic turns since the earliest systems were introduced in the 1960s. Early supercomputer architectures pioneered by Seymour Cray relied on compact innovative designs and local parallelism to achieve superior computational peak performance. However, in time the demand for increased computational power ushered in the age of massively parallel systems.

<span class="mw-page-title-main">Francine Berman</span> American computer scientist

Francine Berman is an American computer scientist, and a leader in digital data preservation and cyber-infrastructure. In 2009, she was the inaugural recipient of the IEEE/ACM-CS Ken Kennedy Award "for her influential leadership in the design, development and deployment of national-scale cyberinfrastructure, her inspiring work as a teacher and mentor, and her exemplary service to the high performance community". In 2004, Business Week called her the "reigning teraflop queen".

<span class="mw-page-title-main">Supercomputer operating system</span> Use of Operative System by type of extremely powerful computer

A supercomputer operating system is an operating system intended for supercomputers. Since the end of the 20th century, supercomputer operating systems have undergone major transformations, as fundamental changes have occurred in supercomputer architecture. While early operating systems were custom tailored to each supercomputer to gain speed, the trend has been moving away from in-house operating systems and toward some form of Linux, with it running all the supercomputers on the TOP500 list in November 2017. In 2021, top 10 computers run for instance Red Hat Enterprise Linux (RHEL), or some variant of it or other Linux distribution e.g. Ubuntu.

Massively parallel is the term for using a large number of computer processors to simultaneously perform a set of coordinated computations in parallel. GPUs are massively parallel architecture with tens of thousands of threads.

<span class="mw-page-title-main">Message passing in computer clusters</span> Aspect of computer clusters

Message passing is an inherent element of all computer clusters. All computer clusters, ranging from homemade Beowulfs to some of the fastest supercomputers in the world, rely on message passing to coordinate the activities of the many nodes they encompass. Message passing in computer clusters built with commodity servers and switches is used by virtually every internet service.

References

  1. NASA website
  2. 1 2 3 4 5 Parallel and Distributed Computational Intelligence by Francisco Fernández de Vega 2010 ISBN   3-642-10674-9 pages 65-68
  3. 1 2 3 4 5 Quasi-opportunistic supercomputing in grids by Valentin Kravtsov, David Carmeli, Werner Dubitzky, Ariel Orda, Assaf Schuster, Benny Yoshpa, in IEEE International Symposium on High Performance Distributed Computing, 2007, pages 233-244
  4. Computational Science - Iccs 2008: 8th International Conference edited by Marian Bubak 2008 ISBN   978-3-540-69383-3 pages 112-113
  5. 1 2 Grid computing: experiment management, tool integration, and scientific workflows by Radu Prodan, Thomas Fahringer 2007 ISBN   3-540-69261-4 pages 1-4
  6. Computational Science - Iccs 2009: 9th International Conference edited by Gabrielle Allen, Jarek Nabrzyski 2009 ISBN   3-642-01969-2 pages 387-388
  7. BOIN statistics, 2011 Archived 2010-09-19 at the Wayback Machine
  8. "Folding@home statistics, 2011". Archived from the original on 2013-05-13. Retrieved 2011-07-21.
  9. 1 2 3 Euro-par 2010, Parallel Processing Workshop edited by Mario R. Guarracino 2011 ISBN   3-642-21877-6 pages 274-277
  10. 1 2 Languages and Compilers for Parallel Computing by Guang R. Gao 2010 ISBN   3-642-13373-8 pages 10-11
  11. "Internet PrimeNet Server Distributed Computing Technology for the Great Internet Mersenne Prime Search". GIMPS. Retrieved June 6, 2011.
  12. 1 2 Grid Computing: Towards a Global Interconnected Infrastructure edited by Nikolaos P. Preve 2011 ISBN   0-85729-675-2 page 71
  13. Cooper, Curtis and Steven Boone. "The Great Internet Mersenne Prime Search at the University of Central Missouri". The University of Central Missouri. Retrieved 4 August 2011.
  14. High Performance Computing - HiPC 2008 edited by P. Sadayappan 2008 ISBN   3-540-89893-X page 1
  15. 1 2 Euro-Par 2006 workshops: parallel processing: CoreGRID 2006 edited by Wolfgang Lehner 2007 ISBN   3-540-72226-2 pages
  16. Grid computing: International Symposium on Grid Computing (ISGC 2007) edited by Stella Shen 2008 ISBN   0-387-78416-0 page 170
  17. Kravtsov, Valentin; Carmeli, David; Dubitzky, Werner; Orda, Ariel; Schuster, Assaf; Yoshpa, Benny. "Quasi-opportunistic supercomputing in grids, hot topic paper (2007)". IEEE International Symposium on High Performance Distributed Computing. IEEE. Retrieved 4 August 2011.{{cite web}}: CS1 maint: multiple names: authors list (link)
  18. 1 2 3 Algorithms and architectures for parallel processing by Anu G. Bourgeois 2008 ISBN   3-540-69500-1 pages 234-242