SEARCH

SEARCH BY CITATION

Keywords:

  • real-time;
  • message passing;
  • communication;
  • application programming interface;
  • buffer management;
  • admission control;
  • quality of service;
  • object oriented

Abstract

The Real-Time Message Passing Interface (MPI/RT) standard is the product of the work of many people working in an open community standards group over a period of over six years. The purpose of this archival publication is to preserve the significant knowledge and experience that was developed in real-time message-passing systems as a consequence of the research and development effort as well as in the specification of the standard. Interestingly, several implementations of MPI/RT (as well as comprehensive test suites) have been created in industry and academia over the period during which the standard was created. MPI/RT is likely to gain adoption interest over time, and this adoption may be driven by the promulgation of the standard including this publication. We expect that, when people are interested in understanding options for reliable, quality of service (QoS)-oriented parallel computing with message passing, MPI/RT will serve as a foundation for such a study, whether or not its complete formalism is accepted into other systems or standards.

MPI/RT is an offshoot of MPI-1, and retains many of the communication patterns of MPI-1. However, MPI/RT has investigated issues of fine-grain concurrency and highest achievable performance in many ways that were evidently inappropriate for MPI-1 in the scientific computing space in which it resides, with its much broader audience. MPI/RT focuses on early-binding (planned transfer), concurrent message passing, while integrating multiple real-time models: time-based, event-driven, and priority-oriented channels. Group admission control and declarative (deferred early binding) semantics support the goal of hard-real-time for the message-passing component of computation. Importantly, MPI/RT emphasizes the decoupling of message transfer and process/thread scheduling as part of its contribution to parallel processing with QoS. Buffer management and state transition diagrams are also integral to the notion of streaming data into and out of processors in a way that is consistent with QoS, and friendly to zero-copy approaches to communication.

MPI/RT has also made strides in the direction of a parallel middleware specification by emphasizing an object-oriented design for the application programmer interface (API) compared with an object-based API or ad hoc API. The advantages of these are plain in the standard, in that the functionality has useful polymorphic adaptations where needed. Furthermore, the concepts that derive from MPI-1 (such as collective operations) appear as objects in MPI/RT. This modification has allowed for the removal of certain constructs in MPI-1 (such as the communicator), in favor of a specification and implementation phase for objects that describe communication in MPI/RT. Overall, a cleaner, more extensible design exists, which does not utilize more resources per se than those which are needed to admit the required channels for a program. Both offline and online admission control is contemplated, and multiple modes are supported, albeit weakly.

MPI/RT-1.1, the standard version described here, does not cover all possible real-time parallel programming possibilities. It is silent concerning process/thread scheduling, so, in some sense, still has strong aspects of ‘best effort’, in terms of process scheduling. Leaving process scheduling as an orthogonal concern was intentional, so that the best concepts in these areas would be used in concert with MPI/RT, rather than offering a monolith.

Furthermore, MPI/RT-1.1 does not explicitly address mode changes (with guaranteed mode-change QoS) between sets of channels, with invariants and non-invariants among the resources consumed. This remains important work for the future. The object-oriented, resource-conscious approach of MPI/RT naturally extends to multiple modes. Work to realize this in a standard or in prototypes remains for future work, although it was discussed and prototyped extensively during standardization.

It is interesting to consider whether the connection-oriented, but limited QoS and resource specification of MPI/RT leads to a more-scalable or less-scalable system environment than that posed by MPI and similar middleware. While connections themselves indicate that resources will be assigned per connection, only those connections that are program-mandated are actually built. By way of contrast, in MPI it is necessary to offer a virtual all-to-all communication topology, and introduce overheads associated with either the static realization of such a topology, or else the dynamic build-up/tear-down of connections in constrained environments seeking to scale.

Events over the past six years involving the evolution of networking technology make MPI/RT as interesting as it was when started, and possibly of more ubiquitous application in the long term. Infiniband, Rapid I/O, 3GIO, and other System Area Network standards are likely to offer rudimentary QoS over time in real applications. Likewise, the production of massively concurrent supercomputers (104 nodes or more), is likely to drive the need for predictable message passing in the runtime aspects of such systems. These events are likely to cause the ideas and concepts defined in this standard to have impact in areas far broader than originally anticipated. Copyright © 2004 John Wiley & Sons, Ltd.

Contents

Preface Six

1 Introduction S1

1.1 General introduction S1

1.1.1 Parallel models S2

1.1.2 ‘Sidedness’ of communication S3

1.1.3 Real-time models and QoS S3

1.1.4 Ontogeny of an MPI/RT application S4

1.1.5 The MPI/RT API S6

1.2 Introduction for users S12

1.3 Introduction for implementors S13

1.3.1 The basics S13

1.3.2 The admission test S14

1.3.3 Other advice to implementors S15

1.4 Error checking and kinds of libraries S15

1.4.1 Erroneous programs S15

1.4.2 Conformance and kinds of libraries S16

1.4.3 Error reporting S17

1.4.4 String representation of error codes S18

1.5 Related work S18

1.5.1 Admission control and resource reservation S18

1.5.2 Access arbitration and transmission control S19

1.5.3 Early results S20

1.6 Summary S23

I Concepts and basic objects S25

2 MPI/RT objects S27

2.1 Overview S27

2.2 Behavior of objects in MPI/RT S32

2.3 Generic operations defined on all MPI/RT objects S33

2.4 Attributes: object decoration S37

2.4.1 Keyval object parameter accessors S41

2.4.2 Object attribute manipulation functions S42

2.5 Containers S44

2.5.1 Container constructors S45

2.5.2 Generic container operations S45

2.5.3 Set operations S47

2.5.4 Vector base operations S48

2.5.5 Container iterators S50

2.6 Groups S53

2.6.1 Group definition S54

2.6.2 Group management S54

2.7 Miscellaneous objects S56

2.7.1 MPIRT_TASK_ADDRESS S56

2.8 Summary S57

3 Dataspecs and types S65

3.1 Overview S65

3.2 Dataspecs S65

3.2.1 Operations on dataspec S65

3.3 Predefined MPI/RT types S66

3.3.1 MPIRT_BOOLEAN S67

3.3.2 MPIRT_STRING_NAME S68

3.3.3 MPIRT_INT64 S68

3.3.4 MPIRT_TIME_SPEC S71

3.3.5 MPIRT_ADDRESS S72

3.3.6 MPIRT_BUFITER_MODE S72

3.4 Summary S73

4 Event delivery abstraction and handlers S79

4.1 Introduction S79

4.2 Event delivery abstraction S79

4.2.1 Event naming S81

4.2.2 Common event delivery abstraction operations S83

4.2.3 Triggers S85

4.2.4 Event receptors S87

4.3 Handlers S94

4.3.1 Handler constructors S94

4.3.2 Light-weight handler functions S96

4.3.3 Handler accessors S98

4.4 Waiting for an event S100

4.5 Summary S101

5 Buffer management S109

5.1 Introduction S109

5.2 Buffer object S109

5.3 Buffer object functions S115

5.3.1 Variable length transfers S115

5.3.2 Variable buffer offsets S116

5.4 Buffer operations S117

5.4.1 Operations on buffer labels S117

5.4.2 Buffer-partitioning operations S118

5.5 Buffer iterator S121

5.6 Buffer iterator accessors S130

5.7 Bufiter modes S132

5.8 Summary S134

II Transfer mechanisms and advanced objects S139

6 Channel overview S141

6.1 Introduction S141

6.2 Common attribute operations for channels S143

6.3 Summary S146

7 Point-to-point channels S149

7.1 Introduction S149

7.2 Operations on the point-to-point channel object S149

7.3 Summary S151

8 Collective channels S153

8.1 Introduction S153

8.2 Broadcast collective channel S153

8.3 Gather operation channel S156

8.4 Scatter operation channel S160

8.5 Reduce operation channel S164

8.5.1 Predefined reduce operations S168

8.6 Barrier operation channel S169

8.7 All-to-all channel S170

8.8 Summary S172

9 Channel operations S181

9.1 Data transfers S181

9.1.1 Performance considerations S181

9.1.2 Channel states and transitions S182

9.1.3 Methods for single message transfer S188

9.1.4 Methods for multiple message transfers S190

9.2 Testing completion and determining the state of data transfers S191

9.2.1 Wait operation S192

9.2.2 Test operation S192

9.3 Summary S193

III Real-time programming models and QoS S195

10 QoS overview S197

10.1 Introduction S197

10.2 Time-driven real-time programming model S198

10.2.1 Scheduling message transfers S199

10.2.2 Schedulable time intervals S199

10.2.3 The MPI/RT time specification S200

10.3 Event-driven real-time programming model S200

10.3.1 Overview S201

10.3.2 Event triggers S202

10.3.3 Event receptors S204

10.4 Priority-driven real-time programming model S204

10.4.1 Channel priority S205

10.4.2 Process priority S206

10.5 Best-effort QoS programming model S206

11 QoS specification S207

11.1 Introduction S207

11.2 Channel QoS specification S207

11.2.1 Specification for time-driven channels S208

11.2.2 Relationship of time-based schedules for different channels S211

11.2.3 Specification for event-driven-with-priority channels S211

11.2.4 Specification for combined event and time-driven with priority channels S219

11.3 Handler QoS specification S225

11.4 Event-delivery abstraction's QoS specification S227

11.4.1 QoS for triggers S227

11.4.2 QoS for receptors S230

11.5 Summary S232

12 Committing objects and resource allocation S239

12.1 Commit operation S239

12.2 Summary S241

IV Environmental mechanisms and functionality S243

13 Initialization and termination S245

13.1 Initialization and termination of MPI/RT S245

13.2 Version information S247

13.3 Summary S248

14 Clocks S251

14.1 Synchronization of clocks S251

14.2 Description of the clocks S251

14.3 Clock synchronization parameters S252

14.3.1 The epoch S252

14.3.2 The MPIRT_TIME type S253

14.3.3 The synchronized time service S253

14.3.4 Parameters S253

14.4 Behavior of the time services S256

14.5 Timed waiting S256

14.6 Summary S257

15 Instrumentation S259

15.1 Introduction S259

15.2 MPI/RT metrics S260

15.3 MPI/RT probes S261

15.4 MPI/RT user metrics S265

15.5 Summary S270

V Appendices S273

A Return codes S275

A.1 Return codes S275

B Deprecated functionality S281

B.1 Functionality deprecated in MPI/RT-1.1 S281

B.1.1 MPIRT_ERR_COMMITTED_OBJECT S281

B.1.2 MPIRT_ERR_INITIALIZED S281

B.1.3 MPIRT_CSET_RETRIEVE_NEXT S281

B.1.4 MPIRT_ERR_ACTIVE_CHANNEL S282

Acknowledgments S283

Glossary S287

Bibliography S295

MPI/RT return code index S299

MPI/RT function index S303

MPI/RT entity index S319

Index S331

Preface

PREFACE TO THE CURRENT STANDARD VERSION, MPI/RT-1.1

At the conclusion of MPI/RT-1.0, many of us participating and contributing to the standard recognized the need to continue to improve certain key features, in pursuit of a system with extremely low cost of portability, and to widen the applicability of MPI/RT. A restrained set of extensions have emerged in MPI/RT-1.1, from among a huge set of proposals, ideas, and concepts. While assiduously trying to avoid the ‘second system syndrome’, MPI/RT-1.1 works hard to fix small issues and make the standard easier to use, better, and more applicable. This document conflates the contributions of MPI/RT-1.0, the principal work, with the newly accepted developments of MPI/RT-1.1, plus errata and other improvements designed to keep this document as the main reference for understanding how to implement and use MPI/RT.

A subset of the original participants in MPI/RT-1.0, together with some new participants, have built this standard extension, with the view that future extensions (whether termed MPI/RT-1.2 or MPI/RT-2) would come much later, after a period of two or more years of implementation and usage. While initial implementations and experience with MPI/RT-1.0 have driven MPI/RT-1.1 in part, much room remains for further implementation and experience in various application spaces.

Historical perspective will of course assess the validity, efficacy, and overall impact of building application programmer interface (API) standards in the way that MPI/RT-1.0 and MPI/RT-1.1 have been done, namely, by a small group of dedicated individuals, supported by a larger group of interested participants from the application, user, and research communities of both private and public sectors. The shoe-string funding associated with MPI/RT over the last three years of its six-year life has actually energized, rather than diminished, the energy for such progress and success. What appears clear, however, is that a tremendous amount of useful computer science related to advanced systems programming of middleware with real-time has been captured in this standards document and its first extension, MPI/RT-1.1. This stable intermediate form will clearly play an important role in the future exploration of real-time middleware for scalable, parallel processing.

The efforts of the MPI/RT Forum and its members over the past six years reflect the strong commitment, perseverance, and tireless efforts of its individual participants, for which the chairs offer their sincerest gratitude.

Anthony Skjellum, Starkville, MS

Arkady Kanevsky, Waltham, MA

March 2001

PREFACE TO MPI/RT-1.0

In 1995, several researchers and practitioners became interested in advancing real-time extensions to the then existing Message Passing Interface (MPI) de facto standard and began meeting informally. Later, the group became a sanctioned subcommittee of the MPI-2 de facto standards body, which met regularly in Chicago. People involved with high-performance computing, distributed computing, message-passing systems, and real-time systems were all represented. Researchers and practitioners from industry, academia and defense laboratories were included. The MPI-2 Forum condoned this effort and allowed it to blossom as a Journal of Development activity, with a clear view that it would not formally be part of MPI-2, but nonetheless was a worthwhile working activity. This status was productive and helpful to the work because of the valuable proximity to many interested in messaging, without the compelling deadline faced by MPI-2.

From a technical perspective, a lot of issues, approaches, requirements, and techniques evolved, and significant new ideas previously thought about were introduced into the discussion. These issues included provision of key concepts not available or readily addressable within the confines of MPI-1 [1] and MPI-2 [2]: channels, real-time models, predictability, greater support for thread interactions, early-binding strategies, and admission tests. The requirements posed by the subcommittee emerged as follows: achieve highest performance messaging and add the additional constraints of predictability and quality-of-service (QoS), with the additional capability to support relevant memory management to enhance the elimination of data copies and support for the ‘cut-through’ of data.

These requirements drove us over time, sometimes systematically and sometimes ad hoc, to re-examine much of what was decided in earlier messaging systems and ultimately to evolve away from explicit upward or downward compatibility with MPI. The outcome of this effort is a ‘lower middleware’ standard called MPI/RT, which strives to offer extremely low cost of portability as compared with any native software architecture for messaging, while providing useful real-time notions of performance, predictability, and QoS. Other major decisions included the elimination of Fortran77 language bindings in favor of C++ language bindings and the thorough and continued use of object-oriented APIs and design methodologies to motivate and support the process. Positioning MPI as conceptually ‘higher middleware’ in the form of a layer on top of MPI/RT establishes a conceptual relationship between this work and the previous standard. In fact, we expect that MPI implementations may actually be layered over MPI/RT on systems where users require both notations, and this ‘layerability’ is mentioned in appropriate parts of the standard.

With encouragement and support from DARPA, and from the strong commitment of many of the subcommittee participants, including people from the mainstream of MPI Forum participants, significant progress was made over the first 18 months. The group continued to meet and burgeoned into a full-scale de facto effort of its own after the conclusion of the MPI-2 standards effort. Early versions of this effort appear in the ‘Journal of Development’ of the MPI-2 standard, but the results presented in that snapshot are quite different from what we have ultimately accomplished.

This three-year effort has led to quite a satisfactory messaging middleware specification and standard that we expect to see deployed by industry in real-time computing multicomputers and networks of workstations. The group is committed to extending the specification in a limited fashion in 1999 to support channel input/output (I/O), dynamic processes, and a few other features intentionally delayed at present. After that, sincere efforts to introduce MPI/RT to a formal standard body will be undertaken by us and others, in order to help assure its long-term acceptance.

In order to facilitate accessibility of this document, we have explicitly moved all references to the MPI standards within the document into footnotes, in order to provide additional information of relevance to MPI-informed people, while keeping the main flow and standardized content of the document fully free of dependencies on either MPI-1 or MPI-2 standard documents [1,2]. No knowledge of MPI-1 or MPI-2 is needed to proceed with MPI/RT. In this way, we can mention issues where credit is due or different decisions have been made, while avoiding dependence on the other documents moving forward, and without requiring much background study from new arrivals to MPI/RT.

As part of this work, we have generated our own Journal of Development that has been divided into two classes of results. First, there is material to be revisited in the MPI/RT-1.1 version, beginning in late 1998. Second, there is ‘best engineering practice’, results and issues worked out during the past years that we want to preserve. A separate, but not formally accepted, best engineering practice document is consequently a related outcome of this group's work. Such issues as interoperability, relationships to other standards, and a standardized subset of MPI/RT-1.1 figure in that other document. This other body of results remains separate from, yet valuable to, the community we are seeking to support.

The complete specification for MPI/RT is presented in this document. Chapter 1 provides a high-level overview of the standard for first-time readers. Chapter 2 presents the object-oriented (OO) design of MPI/RT and describes the fundamental ancestor objects in MPI/RT's class hierarchy. Chapters 3–5 describe the basic datatypes, the event-handling mechanism, and buffer management. Chapters 6–9 describe the channel-based data transfer operations. Chapters 10–12 discuss real-time QoS issues and admittance testing. Finally, Chapters 13–15 address the environmental issues for MPI/RT programs such as initialization, termination, synchronized clocks, and performance monitoring. Material of special interest to users, such as a detailed description of how to use a particular feature of MPI/RT, is highlighted by Advice to users. Material of particular interest to implementors of MPI/RT is highlighted by Advice to implementors. The motivation behind particular features are highlighted by Rationale.

Arkady Kanevsky, Bedford, MA

Anthony Skjellum, Starkville, MS

July 1998