From 4bfd864f10b68b71482b35c818559068ef8d5797 Mon Sep 17 00:00:00 2001
From: Thomas Voss <mail@thomasvoss.com>
Date: Wed, 27 Nov 2024 20:54:24 +0100
Subject: doc: Add RFC documents

---
 doc/rfc/rfc992.txt | 1060 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 1060 insertions(+)
 create mode 100644 doc/rfc/rfc992.txt

(limited to 'doc/rfc/rfc992.txt')

diff --git a/doc/rfc/rfc992.txt b/doc/rfc/rfc992.txt
new file mode 100644
index 0000000..1d21ad9
--- /dev/null
+++ b/doc/rfc/rfc992.txt
@@ -0,0 +1,1060 @@
+
+                                                       K. P. Birman (Cornell)
+Network Working Group                                  T. A. Joseph (Cornell)
+Request for Comments: 992                              November 1986
+
+
+
+       On Communication Support for Fault Tolerant Process Groups
+
+                     K. P. Birman and T. A. Joseph
+             Dept. of Computer Science, Cornell University
+                           Ithaca, N.Y. 14853
+                              607-255-9199
+
+
+1. Status of this Memo.
+
+   This memo describes a collection of multicast communication primi-
+   tives integrated with a mechanism for handling process failure and
+   recovery.  These primitives facilitate the implementation of fault-
+   tolerant process groups, which can be used to provide distributed
+   services in an environment subject to non-malicious crash failures.
+   Unlike other process group approaches, such as Cheriton's "host
+   groups" (RFC's 966, 988, [Cheriton]), our approach provides powerful
+   guarantees about the behavior of the communication subsystem when
+   process group membership is changing dynamically, for example due to
+   process or site failures, recoveries, or migration of a process from
+   one site to another.  Our approach also addresses delivery ordering
+   issues that arise when multiple clients communicate with a process
+   group concurrently, or a single client transmits multiple multicast
+   messages to a group without pausing to wait until each is received.
+   Moreover, the cost of the approach is low.  An implementation is be-
+   ing undertaken at Cornell as part of the ISIS project.
+
+   Here, we argue that the form of "best effort" reliability provided by
+   host groups may not address the requirements of those researchers who
+   are building fault tolerant software.  Our basic premise is that re-
+   liable handling of failures, recoveries, and dynamic process migra-
+   tion are important aspects of programming in distributed environ-
+   ments, and that communication support that provides unpredictable
+   behavior in the presence of such events places an unacceptable burden
+   of complexity on higher level application software.  This complexity
+   does not arise when using the fault-tolerant process group alterna-
+   tive.
+
+   This memo summarizes our approach and briefly contrasts it with other
+   process group approaches.  For a detailed discussion, together with
+   figures that clarify the details of the approach, readers are re-
+   ferred to the papers cited below.
+
+   Distribution of this memo is unlimited.
+
+
+
+
+Birman & Joseph                                                 [Page 1]
+
+RFC 992                                                    November 1986
+
+
+2. Acknowledgments
+
+   This memo was adopted from a paper presented at the Asilomar workshop
+   on fault-tolerant distributed computing, March 1986, and summarizes
+   material from a technical report that was issued by Cornell Universi-
+   ty, Dept. of Computer Science, in August 1985, which will appear in
+   ACM Transactions on Computer Systems in February 1987 [Birman-b].
+   Copies of these paper, and other relevant papers, are available on
+   request from the author: Dept. of Computer Science, Cornell Universi-
+   ty, Ithaca, New York 14853. (birman@gvax.cs.cornell.edu).  The ISIS
+   project also maintains a mailing list.  To be added to this list,
+   contact M. Schmizzi (schiz@gvax.cs.cornell.edu).
+
+   This work was supported by the Defense Advanced Research Projects
+   Agency (DoD) under ARPA order 5378, Contract MDA903-85-C-0124, and by
+   the National Science Foundation under grant DCR-8412582.  The views,
+   opinions and findings contained in this report are those of the au-
+   thors and should not be construed as an official Department of De-
+   fense position, policy, or decision.
+
+3. Introduction
+
+   At Cornell, we recently completed a prototype of the ISIS system,
+   which transforms abstract type specifications into fault-tolerant
+   distributed implementations, while insulating users from the mechan-
+   isms by which fault-tolerance is achieved.  This version of ISIS, re-
+   ported in [Birman-a], supports transactional resilient objects as a
+   basic programming abstraction.  Our current work undertakes to pro-
+   vide a much broader range of fault-tolerant programming mechanisms,
+   including fault-tolerant distributed bulletin boards [Birman-c] and
+   fault-tolerant remote procedure calls on process groups [Birman-b].
+   The approach to communication that we report here arose as part of
+   this new version of the ISIS system.
+
+   Unreliable communication primitives, such as the multicast group com-
+   munication primitives proposed in RFC's 966 and 988 and in [Cheri-
+   ton], leave some uncertainty in the delivery status of a message when
+   failures and other exceptional events occur during communication.
+   Instead, a form of "best effort" delivery is provided, but with the
+   possibility that some member of a group of processes did not receive
+   the message if the group membership was changing just as communica-
+   tion took place.  When we tried to use this sort of primitive in our
+   original work on ISIS, which must behave reliably in the presence of
+   such events, we had to address this aspect at an application level.
+   The resulting software was complex, difficult to reason about, and
+   filled with obscure bugs, and we were eventually forced to abandon
+   the entire approach as infeasible.
+
+   A wide range of reliable communication primitives have been proposed
+   in the literature, and we became convinced that by using them, the
+   complexity of our software could be greatly reduced.  These range
+
+
+
+Birman & Joseph                                                 [Page 2]
+
+RFC 992                                                    November 1986
+
+
+   from reliable and atomic broadcast [Chang] [Cristian] [Schneider] to
+   Byzantine agreement [Strong].  For several reasons, however, the ex-
+   isting work does not solve the problem at hand.  The most obvious is
+   that they do not provide a mechanism for sending a message to all the
+   members of a group when the membership is changing dynamically (the
+   "group addressing" problem).  In addition, one can identify delivery
+   ordering issues and questions regarding the detection of communica-
+   tion failures that should be handled within the broadcast mechanism.
+   These motivate a careful reexamination of the entire reliable broad-
+   cast problem.
+
+   The multicast primitives we report here are designed to respect
+   several sorts of ordering constraints, and have cost and latency that
+   varies depending on the nature of the constraint required [Birman-b]
+   [Joseph-a] [Joseph-b].  Failure and recovery are integrated into the
+   communication subsystem by treating these events as a special sort of
+   multicast issued on behalf of a process that has failed or recovered.
+   The primitives are presented in the context of fault tolerant process
+   groups: groups of processes that cooperate to implement some distri-
+   buted algorithm or service, and which need to see consistent order-
+   ings of system events in order to achieve mutually consistent
+   behavior.  Such groups are similar to the host groups of the V system
+   and the ones described in RFC's 966 and 988, but provide guarantees
+   of consistency in just the situations where a host group provides a
+   "best effort" delivery which may sometimes be erroneous.
+
+   It is helpful to think of our primitives as providing a logical or
+   "virtual" form of reliability: rather than addressing physical
+   delivery issues, they ensure that a client will never observe a sys-
+   tem state "inconsistent" with the assumption that reliable delivery
+   has occurred.  Readers familiar with serializability theory may want
+   to think of this as a weaker analog: in serializability, one allows
+   interleaved executions of operations provided that the resulting sys-
+   tem state is consistent with the assumption that execution was
+   sequential.  Similarly, reliable communication primitives permit de-
+   viations from the reliable delivery abstraction provided that the
+   resulting system state is indistinguishable from one in which reli-
+   able delivery actually did occur.
+
+   Using our primitives, the ISIS system achieved both high levels of
+   concurrency and suprisingly good performance.  Equally important, its
+   structure was made suprisingly simple, making it feasible to reason
+   about the correctness of the algorithms that are needed to maintain
+   high availability even when failures, recoveries, or process migra-
+   tion occurs.  More recently, we have applied the same approach to a
+   variety of other problems in distributed computing, and even designed
+   a consistent, fault tolerant, distributed bulletin board data struc-
+   ture (a generalized version of the blackboards used in artificial in-
+   telligence programs), with equally good results [Birman-c].  Thus, we
+   feel that the approach has been shown to work in a variety of set-
+   tings where unreliable primitives simply could not be used.
+
+
+
+Birman & Joseph                                                 [Page 3]
+
+RFC 992                                                    November 1986
+
+
+   In the remainder of this memo we summarize the issues and alterna-
+   tives that the designer of a distributed system is presented with,
+   focusing on two styles of support for fault-tolerant computing: re-
+   mote procedure calls coupled with a transactional execution facility,
+   such as is used in the ARGUS system [Liskov], and the fault-tolerant
+   process group mechanism mentioned above.  We argue that transactional
+   interactions are too restrictive to support the sort of mechanism
+   needed, and then show how our primitives can be used to provide such
+   a mechanism.  We conclude by speculating on future directions in
+   which this work might be taken.
+
+4. Issues in fault-tolerance
+
+   The difficulty of constructing fault-tolerant distributed software
+   can be traced to a number of interrelated issues.  The list that fol-
+   lows is not exhaustive, but attempts to touch on the principal con-
+   siderations that must be addressed in any such system:
+
+      [1]Synchronization.  Distributed systems offer the potential for
+      large amounts of concurrency, and it is usually desirable to
+      operate at as high a level of concurrency as possible.  However,
+      when we move from a sequential execution environment to a con-
+      current one, it becomes necessary to synchronize actions that may
+      conflict in their access to shared data or entail communication
+      with overlapping sets of processes.  Thus, a mechanism is needed
+      for ordering conflicting events.  Additional problems that can
+      arise in this context include deadlock avoidance or detection,
+      livelock avoidance, etc.
+
+      [2]Failure detection.  It is usually necessary for a fault-
+      tolerant application to have a consistent picture of which com-
+      ponents fail, and in what order. Timeout, the most common mechan-
+      ism for detecting failure, is unsatisfactory, because there are
+      many situations in which a healthy component can timeout with
+      respect to one component without this being detected by some
+      another.  Failure detection under more rigorous requirements
+      requires an agreement protocol that is related to Byzantine agree-
+      ment [Strong] [Hadzilacos].  Regardless of how this problem is
+      solved, some sort of reliable failure detection mechanism will be
+      needed in any fault-tolerant distributed system.
+
+      [3] Consistency.  When a group of processes cooperate in a distri-
+      buted system, it is necessary to ensure that the operational
+      processes have consistent views of the state of the group as a
+      whole.  For example, if process p believes that some property A
+      holds, and on the basis of this interacts with process q, the
+      state of q should not contradict the fact that p believes A to be
+      true.  This problem is closely related to notions of knowledge and
+      consistency in distributed systems [Halpern] [Lamport].  In our
+      context, A will often be the assertion that a multicast has been
+      received by q, or that q saw some sequence of events occur in the
+
+
+
+Birman & Joseph                                                 [Page 4]
+
+RFC 992                                                    November 1986
+
+
+      same order as did p.  Thus, it is necessary to be able to specify
+      the precise consistency constraints on a distributed software sys-
+      tem, and system support should be available to facilitate the
+      attainment of these constraints.
+
+      [4] Serializability.  Many distributed systems are partitioned
+      into data manager processes, which implement shared variables, and
+      transaction manager processes, which issue requests to data
+      managers [Bernstein].  If transaction managers can execute con-
+      currently, it is desirable to ensure that transactions produce
+      serializable outcomes [Eswaren] [Papadimitrou].  Serializability
+      is increasingly viewed as an important property in "object-
+      oriented" distributed systems that package services as abstract
+      objects with which clients communicate by remote procedure calls
+      (RPC).  On the other hand, there are systems for which serializa-
+      bility is either too strong a constraint, or simply inappropriate.
+      Thus, one needs a way to achieve serializability in applications
+      where it will be needed, without imposing system-wide restrictions
+      that would prevent the design of software subsystems for which
+      serializability is not needed.
+
+   Jointly, these problems render the design of fault-tolerant distri-
+   buted software daunting in the absence of adequate support.  The
+   correctness of any proposed design and of its implementation become
+   serious, if not insurmountable, concerns.  In Sec. 7, we will show
+   how the primitives of Sec. 6 provide simple ways to overcome all of
+   these issues.
+
+5. Existing alternatives
+
+   If one rules out "unreliable" communication mechanisms, there are
+   basically two fault-tolerant alternatives that can be pursued.
+
+   The first approach is to provide mechanisms for transactional
+   interactions between processes that communicate using remote pro-
+   cedure calls [Birrell].  This has lead to work on nested transactions
+   (due to nested RPC's) [Moss], support for transactions at the
+   language level [Liskov], transactions within an operating systems
+   kernel [Spector] [Allchin] [Popek] [Lazowska], and transactional
+   access to higher-level replicated services, such as resilient objects
+   in ISIS or relations in database systems.  The primitives in a tran-
+   sactional system provide mechanisms for distributing the request that
+   initiates the transaction, accessing data (which may be replicated),
+   performing concurrency control, and implementing commit or abort.
+   Additional mechanisms are normally needed for orphan termination,
+   deadlock detection, etc.  The issue then arises of how these mechan-
+   isms should themselves be implemented.
+
+   Our work in ISIS leads us to believe that whereas transactions are
+   easily implemented on top of fault-tolerant process groups -- we have
+   done so -- the converse is much harder.  Moreover, transactions
+
+
+
+Birman & Joseph                                                 [Page 5]
+
+RFC 992                                                    November 1986
+
+
+   represent a relatively heavy-weight solution to the problems surveyed
+   in the previous section, and might impose an unacceptable overhead on
+   subsystems that need to run non-transactionally, for example because
+   a pair of concurrent processes needs to interact on a frequent basis.
+   (We are not saying that "transactional" mechanisms such as cobegins
+   and toplevel actions can't solve this problem, but just that they
+   yield a solution that is awkward and costly).  This sort of reasoning
+   has lead us to focus on non-transactional interaction mechanisms, and
+   to treat transactions as a special class of mechanisms used when
+   processes that have been designed to employ a transactional protocol
+   interact.
+
+   The second approach involves the provision of a communication primi-
+   tive, such as atomic broadcast, which can be used as the framework on
+   which higher level algorithms are designed.  Such a primitive seeks
+   to deliver messages reliably to some set of destinations, despite the
+   possibility that failures might occur during the execution of the
+   protocol.  Above, we termed this the fault tolerant process group
+   approach, since it lends itself to the organization of cooperating
+   processes into groups, as described in the introduction.  Process
+   groups are an extremely flexible abstraction, and have been employed
+   in the V Kernel [Cheriton] and in UNIX, and more recently in the ISIS
+   system.  A proposal to provide Internet support for host groups was
+   raised in RFC's 966 and 988.  However, the idea of adapting the pro-
+   cess group approach to work reliably in an environment subject to the
+   sorts of exception events and concurrency cited in the previous sec-
+   tion seems to be new.
+
+   As noted earlier, existing reliable communication protocols do not
+   address the requirements of fault-tolerant process groups.  For exam-
+   ple, in [Schneider], an implementation of a reliable multicast primi-
+   tive is described.  Such a primitive ensures that a designated mes-
+   sage will be transmitted from one site to all other operational sites
+   in a system; if a failure occurs but any site has received the mes-
+   sage, all will eventually do so.  [Chang] and [Cristian] describe
+   implementations for atomic broadcast, which is a reliable broadcast
+   (sent to all sites in a system) with the additional property that
+   messages are delivered in the same order at all overlapping destina-
+   tions, and this order preserves the transmission order if messages
+   originate in a single site.
+
+   Atomic broadcast is a powerful abstraction, and essentially the same
+   behavior is provided by one of the multicast primitives we discuss in
+   the next section.  However, it has several drawbacks which made us
+   hesitant to adopt it as the only primitive in the system.  Most seri-
+   ous is the latency that is incurred in order to satisfy the delivery
+   ordering property.  Without delving deeply into the implementations,
+   which are based on a token scheme in [Chang] and an acknowledgement
+   protocol in [Schneider], we observe that the delaying of certain mes-
+   sages is fundamental to the establishment of a unique global delivery
+   ordering; indeed, it is easy to prove on knowledge theoretic grounds
+
+
+
+Birman & Joseph                                                 [Page 6]
+
+RFC 992                                                    November 1986
+
+
+   that this must always be the case.  In [Chang] a primary goal is to
+   minimize the number of messages sent, and the protocol given performs
+   extremely well in this regard.  However, a delay occurs while waiting
+   for tokens to arrive and the delivery latency that results may be
+   high.  [Cristian] assumes that clocks are closely synchronized and
+   that message transit times are bounded by well-known constants, and
+   uses this to derive atomic broadcast protocols tolerant of increas-
+   ingly severe classes of failures.  The protocols explicitly delay
+   delivery to achieve the desired global ordering on multicasts.  For
+   reasons discussed below, this tends to result in high latency in typ-
+   ical local networking environments.  An additional drawback of the
+   atomic broadcast protocols is that no mechanism is provided for
+   ensuring that all processes observe the same sequence of failures and
+   recoveries, or for ensuring that failures and recoveries are ordered
+   relative to ongoing multicasts.  Since this problem arises in any
+   setting where one process monitors another, we felt it should be
+   addressed at the same level as the communication protocol.  Finally,
+   one wants a group oriented multicast protocol, not a site oriented
+   broadcast, and this issue must be resolved too.
+
+6. Our multicast primitives
+
+   We now describe three multicast protocols - GBCAST, ABCAST, and
+   CBCAST - for transmitting a message reliably from a sender process to
+   some set of destination processes.  Details of the protocols and
+   their correctness proofs can be found in [Birman-b].  The protocols
+   ensure "all or nothing" behavior: if any destination receives a mes-
+   sage, then unless it fails, all destinations will receive it.  Group
+   addressing is discussed in Sec. 6.5.
+
+   The failure model that one adopts has a considerable impact on the
+   structure of the resulting system.  We adopted the model of fail-stop
+   processors [Schneider]: when failures occur, a processor simply stops
+   (crashes), as do all the processes executing on it.  We also assume
+   that individual processes can crash, and that this is detected when
+   it occurs by a monitoring mechanism present at each site.  Further
+   assumptions are sometimes made about the availability of synchronized
+   realtime clocks.  Here, we adopt the position that although reason-
+   ably accurate elapsed-time clocks may be available, closely synchron-
+   ized clocks probably will not be.  For example, the 60Hz "line"
+   clocks commonly used on current workstations are only accurate to
+   16ms.  On the other hand, 4-8ms inter-site message transit times are
+   common and 1-2ms are reported increasingly often.  Thus, it is impos-
+   sible to synchronize clocks to better than 32-48ms, enough time for a
+   pair of sites to exchange between 4 and 50 messages.  Even with
+   advancing technology, it seems safe to assume that clock skew will
+   remain "large" when compared to inter-site message transmission
+   speed.  In particular, this argues against time-based protocols such
+   as the one used in [Cristian]
+
+
+
+
+
+Birman & Joseph                                                 [Page 7]
+
+RFC 992                                                    November 1986
+
+
+   6.1 The GBCAST primitive
+
+       GBCAST (group multicast) is the most constrained, and costly, of
+       the three primitives.  It is used to transmit information about
+       failures and recoveries to members of a process group.  A recov-
+       ering member uses GBCAST to inform the operational ones that it
+       has become available.  Additionally, when a member fails, the
+       system arranges for a GBCAST to be issued to group members on its
+       behalf, informing them of its failure.  Arguments to GBCAST are a
+       message and a process group identifier, which is translated into
+       a set of destinations as described below (Sec. 6.5).
+
+       Our GBCAST protocol ensures that if any process receives a multi-
+       cast B before receiving a GBCAST G, then all overlapping destina-
+       tions will receive B before G <1> This is true regardless of the
+       type of multicast involved.  Moreover, when a failure occurs, the
+       corresponding GBCAST message is delivered after any other multi-
+       casts from the failed process.  Each member can therefore main-
+       tain a VIEW listing the membership of the process group, updating
+       it when a GBCAST is received.  Although VIEW's are not updated
+       simultaneously in real time, all members observe the same
+       sequence of VIEW changes.  Since, GBCAST's are ordered relative
+       to all other multicasts, all members receiving a given multicast
+       will have the same value of VIEW when they receive it.
+
+       Notice that GBCAST also provides a convenient way to change other
+       global properties of a group "atomically".  In our work, we have
+       used GBCAST to dynamically change a ranking on the members of a
+       group, to request that group members establish checkpoints for
+       use if recovery is needed after all failure, and to implement
+       process migration.  In each case, the ordering of GBCAST relative
+       to other events that makes it possible to perform the desired
+       action without running any additional protocol.  Other uses for
+       GBCAST will no doubt emerge as our research continues.
+
+       Members of a process group can also use the value of VIEW to pick
+       a strategy for processing an incoming request, or to react to
+       failure or recovery without having to run any special protocol
+       first.  Since the GBCAST ordering is the same everywhere, their
+       actions will all be consistent.  Notice that when all the members
+       of a process group may have failed, GBCAST also provides an inex-
+       pensive way to determine the last site that failed: process group
+       members simply log each value of VIEW that becomes defined on
+       stable storage before using it; a simplified version of the algo-
+       rithm in [Skeen-a] can then be executed when recovering from
+       failure.
+
+
+
+
+
+
+
+
+Birman & Joseph                                                 [Page 8]
+
+RFC 992                                                    November 1986
+
+
+   6.2 The ABCAST primitive
+
+       The GBCAST primitive is too costly to be used for general commun-
+       ication between process group members.  This motivates the intro-
+       duction of weaker (less ordered) primitives, which might be used
+       in situations where a total order on multicast messages is not
+       necessary.  Our second primitive, ABCAST (atomic multicast),
+       satisfies such a weaker constraint.  Specifically, it is often
+       desired that if two multicasts are received in some order at a
+       common destination site, they be received in that order at all
+       other common destinations, even if this order was not predeter-
+       mined.  For example, if a process group is being used to maintain
+       a replicated queue and ABCAST is used to transmit queue opera-
+       tions to all copies, the operations will be done in the same
+       order everywhere, hence the copies of the queue will remain mutu-
+       ally consistent.  The primitive ABCAST(msg, label, dests) pro-
+       vides this behavior.  Two ABCAST's having the same label are
+       delivered in the same order at all common destinations.
+
+   6.3 The CBCAST primitive
+
+       Our third primitive, CBCAST (causal multicast), is weakest in the
+       sense that it involves less distributed synchronization then
+       GBCAST or ABCAST.  CBCAST(msg, dests) atomically delivers msg to
+       each operational dest.  The CBCAST protocol ensures that if two
+       multicasts are potentially causally dependent on another, then
+       the former is delivered after the latter at all overlapping des-
+       tinations.  A multicast B' is potentially causally dependent on a
+       multicast B if both multicasts originate from the same process,
+       and B' is sent after B, or if there exists a chain of message
+       transmissions and receptions or local events by which knowledge
+       could have been transferred from the process that issued B to the
+       process that issued B' [Lamport].  For causally independent mul-
+       ticasts, the delivery ordering is not constrained.
+
+       CBCAST is valuable in systems like ISIS, where concurrency con-
+       trol algorithms are used to synchronize concurrent computations.
+       In these systems, if two processes communicate concurrently with
+       the same process the messages are almost always independent ones
+       that can be processed in any order: otherwise, concurrency con-
+       trol would have caused one to pause until the other was finished.
+       On the other hand, order is clearly important within a causally
+       linked series of multicasts, and it is precisely this sort of
+       order that CBCAST respects.
+
+   6.4 Other multicast primitives
+
+       A weaker multicast primitive is reliable multicast, which pro-
+       vides all-or-nothing delivery, but no ordering properties.  The
+       formulation of CBCAST in [Birman-b] actually includes a mechanism
+       for performing multicasts of this sort, hence no special
+
+
+
+Birman & Joseph                                                 [Page 9]
+
+RFC 992                                                    November 1986
+
+
+       primitive is needed for the purpose.  Additionally, there may be
+       situations in which ABCAST protocols that also satisfy a CBCAST
+       ordering property would be valuable.  Our ABCAST primitive could
+       be changed to respect such a rule, and we made use of a multicast
+       primitive that is simultaneously causal and atomic in our work on
+       consistent shared bulletin boards ([Birman-c]).  For simplicity,
+       the presentation here assumes that ABCAST is completely orthogo-
+       nal to CBCAST, but a simple way to build an efficient "causal
+       atomic" multicast is described in our full-length paper.  The
+       cost of this protocol is only slightly higher than that of
+       ABCAST.
+
+   6.5 Group addressing protocol
+
+       Since group membership can change dynamically, it may be diffi-
+       cult for a process to compute a list of destinations to which a
+       message should be sent, for example, as is needed to perform a
+       GBCAST.  In [Birman-b] we report on a protocol for ensuring that
+       a given multicast will be delivered to all members of a process
+       group in the same view.  This view is either the view that was
+       operative when the message transmission was initiated, or a view
+       that was defined subsequently.  The algorithm is a simple itera-
+       tive one that costs nothing unless the group membership changes,
+       and permits the caching of possibly inaccurate membership infor-
+       mation near processes that might want to communicate with a
+       group.  Using the protocol, a flexible message addressing scheme
+       can readily be supported.
+
+       Iterative addressing is only required when the process transmit-
+       ting a message has an inaccurate copy of the process group view.
+       In the implementation we are now building, this would rarely be
+       the case, and iteration is never needed if the view is known to
+       be accurate.  Thus, iterated delivery should be very infrequent.
+
+   6.6 Synchronous versus asynchronous multicast abstractions
+
+       Many systems employ RPC internally, as a lowest level primitive
+       for interaction between processes.  It should be evident that all
+       of our multicast primitives can be used to implement replicated
+       remote procedure calls [Cooper]: the caller would simply pause
+       until replies have been received from all the participants
+       (observation of a failure constitutes a reply in this case).  We
+       term such a use of the primitives synchronous, to distinguish it
+       from from an asynchronous multicast in which no replies, or just
+       one reply, suffices.
+
+       In our work on ISIS, GBCAST and ABCAST are normally invoked syn-
+       chronously, to implement a remote procedure call by one member of
+       an object on all the members of its process group.  However,
+       CBCAST, which is the most frequently used overall, is almost
+       never invoked synchronously.  Asynchronous CBCAST's are the
+
+
+
+Birman & Joseph                                                [Page 10]
+
+RFC 992                                                    November 1986
+
+
+       primary source of concurrency in ISIS: although the delivery ord-
+       ering is assured, transmission can be delayed to enable a message
+       to be piggybacked on another, or to schedule IO within the system
+       as a whole.  While the system cannot defer an asynchronous multi-
+       cast indefinitely, the ability to defer it a little, without
+       delaying some computation by doing so, permits load to be
+       smoothed.  Since CBCAST respects the delivery orderings on which
+       a computation might depend, and is ordered with respect to
+       failures, the concurrency introduced does not complicate higher
+       level algorithms.  Moreover, the protocol itself is extremely
+       cheap.
+
+       A problem is introduced by our decision to allow asynchronous
+       multicasts: the atomic reception property must now be extended to
+       address causally related sequences of asynchronous messages.  If
+       a failure were to result in some multicasts being delivered to
+       all their destinations but others that precede them not being
+       delivered anywhere, inconsistency might result even if the desti-
+       nations do not overlap.  We therefore extend the atomicity pro-
+       perty as follows.  If process t receives a message m from process
+       s, and s subsequently fails, then unless t fails as well, all
+       messages m' that s received prior to its failure must be
+       delivered to their remaining operational destinations.  This is
+       because the state of t may now depend on the contents of any such
+       m', hence the system state could become inconsistent if the
+       delivery of m' were not completed.  The costs of the protocols
+       are not affected by this change.
+
+       A second problem arises when the user-level implications of this
+       atomicity rule are considered.  In the event of a failure, any
+       suffix of a sequence of aysnchronous multicasts could be lost and
+       the system state would still be internally consistent.  A process
+       that is about to take some action that may leave an externally
+       visible side-effect will need a way to pause until it is
+       guaranteed that such multicasts have actually been delivered.
+       For this purpose, a flush primitive is provided.  Occasional
+       calls to flush do not eliminate the benefit of using CBCAST asyn-
+       chronously.  Unless the system has built up a considerable back-
+       log of undelivered multicast messages, which should be rare,
+       flush will only pause while transmission of the last few multi-
+       casts complete.
+
+7. Using the primitives
+
+   The reliable communication primitives described above lead to simple
+   solutions for the problems cited in Sec. 4:
+
+       [1]  Synchronization.  Many synchronization problems are subsumed
+       into the primitives themselves.  For example, consider the use of
+       GBCAST to implement recovery.  A recovering process would issue a
+       GBCAST to the process group members, requesting that state
+
+
+
+Birman & Joseph                                                [Page 11]
+
+RFC 992                                                    November 1986
+
+
+       information be transferred to it.  In addition to sending the
+       current state of the group to the recovering process, group
+       members update the process group view at this time.  Subsequent
+       messages to the group will be delivered to the recovered process,
+       with all necessary synchronization being provided by the ordering
+       properties of GBCAST.  In situations where other forms of syn-
+       chronization are needed, ABCAST provides a simple way to ensure
+       that several processes take actions in the same order, and this
+       form of low-level synchronization simplifies a number of higher-
+       level synchronization problems.  For example, if ABCAST is used
+       to do P() and V() operations on a distributed semaphore, the
+       order of operations on the semaphore is set by the ABCAST, hence
+       all the managers of the semaphore see these operations in a fixed
+       order.
+
+       [2]  Failure detection.  Consistent failure (and recovery) detec-
+       tion are trivial using our primitives: a process simply waits for
+       the appropriate process group view to change.  This facilitates
+       the implementation of algorithms in which one processes monitors
+       the status of another process.  A process that acts on the basis
+       of a process group view change does so with the assurance that
+       other group members will (eventually) observe the same event and
+       will take consistent actions.
+
+       [3]  Consistency.  We believe that consistency is generally
+       expressible as a set of atomicity and ordering constraints on
+       message delivery, particularly causal ones of the sort provided
+       by CBCAST.  Our primitives permit a process to specify the com-
+       munication properties needed to achieve a desired form of con-
+       sistency.  Continued research will be needed to understand pre-
+       cisely how to pick the weakest primitive in a designated situa-
+       tion.
+
+       [4]  Serializability.  To achieve serializability, one implements
+       a concurrency control algorithm and then forces computations to
+       respect the serialization order that this algorithm choses.  The
+       ABCAST primitive, as observed above, is a powerful tool for
+       establishing an order between concurrent events, e.g. by lock
+       acquisition.  Having established such an order, CBCAST can be
+       used to distribute information about the computation and also its
+       termination (commit or abort).  Any process that observes the
+       commit or abort of a computation will only be able to interact
+       with data managers that have received messages preceding the com-
+       mit or abort, hence a highly asynchronous transactional execution
+       results.  If a process running a computation fails, this is
+       detected when a failure GBCAST is received instead of the commit.
+       Thus, executions are simple and quite deterministic.
+
+       If commit is conditional, CBCAST would be used to first interro-
+       gate participants to learn if they are prepared to commit, and
+       then to transmit the commit or abort decision (the usual two-
+
+
+
+Birman & Joseph                                                [Page 12]
+
+RFC 992                                                    November 1986
+
+
+       phase commit).  On the other hand, conditional commits can often
+       be avoided using our approach.  A method for building transac-
+       tions that will roll-forward after failure after failure is dis-
+       cussed in more detail in [Birman-a] [Joseph-a] [Joseph-b].  Other
+       forms of concurrency control, such as timestamp generation, can
+       similarly be implemented using ABCAST and CBCAST.  We view tran-
+       sactional data storage as an application-level concern, which can
+       be handled using a version stack approach or a multi-version
+       store, or any other appropriate mechanism.
+
+8. Implementation
+
+   The communication primitives can be built in layers, starting with a
+   bare network providing unreliable Internet datagrams.  The software
+   structure is, however, less mature and more complex than the one sug-
+   gested in RFC's 966 and 988.  For example, at this stage of our
+   research we do not understand how to optimize our protocols to the
+   same extent as for the unreliable host multicast approach described
+   in those RFC's.  Thus, the implementation we describe here should be
+   understood to be a prototype.  A particularly intriguing question,
+   which we are investigating actively, concerns the use of a "best
+   effort" ethernet or Internet multicast as a tool to optimize the
+   implementation of our protocols.
+
+   Our basic approach is to view large area networks as a set of clus-
+   ters of sites interconnected by high speed LAN devices and intercon-
+   nected by slower long-haul links.  We first provide protocols for use
+   within clusters, and then extend them to run between clusters too.
+   Network partitioning can be tolerated at all levels of the hierarchy
+   in the sense that no incorrect actions can result after network par-
+   titioning, although our approach will sometimes block until the par-
+   tition is repaired.  Our protocols also tend to block within a clus-
+   ter while the list of operational sites for that cluster is being
+   changed.  In normal LAN's, this happens infrequently (during site
+   failure or recovery), and would not pose a problem.  (In failure
+   intensive applications, alternative protocols might be needed to
+   address this issue).
+
+   The lowest level of our software uses a site-to-site acknowledgement
+   protocol to convert the unreliable packet transport this into a
+   sequenced, error-free message abstraction, using timeouts to detect
+   apparent failures.  TCP can also be used for this purpose, provided
+   that a "filter" is placed on the incoming message stream and certain
+   types of messages are handled specially.  An agreement protocol is
+   then used to order the site-failures and recoveries consistently.  If
+   timeouts cause a failure to be detected erroneously, the protocol
+   forces the affected site to undergo recovery.
+
+   Built on this is a layer that supports the primitives themselves.
+   CBCAST has a very light-weight implementation, based on the idea of
+   flooding the system with copies of a message: Each process buffers
+
+
+
+Birman & Joseph                                                [Page 13]
+
+RFC 992                                                    November 1986
+
+
+   copies of any messages needed to ensure the consistency of its view
+   of the system.  If message m is delivered to process p, and m is
+   potentially causally dependent on a message m prime, then a copy of m
+   prime is sent to p as well (duplicates are discarded).  A garbage
+   collector deletes superfluous copies after a message has reached all
+   its destinations.  By using extensive piggybacking and a simple
+   scheduling algorithm to control message transmission, the cost of a
+   CBCAST is kept low -- often, less than one packet per destination.
+   ABCAST employs a two-phase protocol based on one suggested to us by
+   Skeen [Skeen-b].  This protocol has higher latency than CBCAST
+   because delivery can only occur during the second phase; ABCAST is
+   thus inherently synchronous.  In ISIS, however, ABCAST is used
+   rarely; we believe that this would be the case in other systems as
+   well.  GBCAST is implemented using a two-phase protocol similar to
+   the one for ABCAST, but with an additional mechanism that flushes
+   messages from a failed process before delivering the GBCAST announc-
+   ing the failure.  Although GBCAST is slower than ABCAST or CBCAST, it
+   is used rarely enough so that performance is probably less of an
+   issue here -- and in any case, even GBCAST could be tuned to give
+   very high throughput.  Preliminary performance figures appear in
+   [Birman-b].
+
+   Although satisfactory performance should be possible using an imple-
+   mentation that sits on top of a conventional Internet mechanism, it
+   should be noted that to achieve really high rates of communication
+   the layers of software described above must reside in the kernel,
+   because they run on behalf of large numbers of clients, run fre-
+   quently, and tend to execute for very brief periods before doing I/O
+   and pausing.  A non-kernel implementation will thus incur high
+   scheduling and context switching overhead.  Additionally, it is not
+   at all clear how to use ethernet style broadcast mechanisms to optim-
+   ize the performance of this sort of protocol, although it should be
+   possible.  We view this as an interesting area for research.
+
+   A forthcoming paper will describe higher level software that we are
+   building on top of the basic fault-tolerant process group mechanism
+   described above.
+
+9. Conclusions
+
+   The experience of implementing a substantial fault-tolerant system
+   left us with insights into the properties to be desired from a com-
+   munication subsystem.  In particular, we became convinced that to
+   build a reliable distributed system, one must start with a reliable
+   communication subsystem.  The multicast primitives described in this
+   memo present a simple interface, achieve a high level of concurrency,
+   can be used in both local and wide area networks, and are applicable
+   to software ranging from distributed database systems to the fault-
+   tolerant objects and bulletin boards provided by ISIS.  Because they
+   are integrated with failure handling mechanisms and respect desired
+   event orderings, they introduce a desirable form of determinism into
+
+
+
+Birman & Joseph                                                [Page 14]
+
+RFC 992                                                    November 1986
+
+
+   distributed computation without compromising efficiency.  A conse-
+   quence is that high-level algorithms are greatly simplified, reducing
+   the probability of error.  We believe that this is a very promising
+   and practical approach to building large fault-tolerant distributed
+   systems, and it is the only one we know of that leads to a rigorous
+   form of confidence in the resulting software.
+
+NOTES:
+
+   <1> A problem arises if a process p fails without receiving some mes-
+   sage after that message has already been delivered to some other pro-
+   cess q: q's VIEW when it received the message would show p to be
+   operational; hence, q will assume that p received the message,
+   although p is physically incapable of doing so.  However, the state
+   of the system is now equivalent to one in which p did receive the
+   message, but failed before acting on it.  In effect, there exists an
+   interpretation of the actual system state that is consistent with q's
+   assumption.  Thus, GBCAST satisfies the sort of logical delivery pro-
+   perty cited in the introduction.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Birman & Joseph                                                [Page 15]
+
+RFC 992                                                    November 1986
+
+
+10. References
+
+[RFC966] Deering, S. and Cheriton, D.  Host groups: A multicast exten-
+      sion to the internet protocol.  Stanford University, December
+      1985.
+
+[RFC988] Deering, S.  Host extensions for IP multicasting.  Stanford
+      University, July 1986.
+
+[Allchin] Allchin, J., McKendry, M.  Synchronization and recovery of
+      actions.  Proc. 2nd ACM SIGACT/SIGOPS Principles of Distributed
+      Computing, Montreal, Canada, 1983.
+
+[Babaoglu] Babaoglu, O., Drummond, R.  The streets of Byzantium: Network
+      architectures for fast reliable multicast.  IEEE Trans. on
+      Software Engineering TSE-11, 6 (June 1985).
+
+[Bernstein] Bernstein, P., Goodman, N.  Concurrency control algorithms
+      for replicated database systems.  ACM Computing Surveys 13, 2
+      (June 1981), 185-222.
+
+[Birman-a] Birman, K.  Replication and fault-tolerance in the ISIS sys-
+      tem.  Proc. 10th ACM SIGOPS Symposium on Operating Systems Princi-
+      ples.  Orcas Island, Washington, Dec. 1985, 79-86.
+
+[Birman-b] Birman, K., Joseph, T.  Reliable communication in the pres-
+      ence of failures.  Dept. of Computer Science, Cornell Univ., TR
+      85-694, Aug. 1985.  To appear in ACM TOCS (Feb. 1987).
+
+[Birman-c] Birman, K., Joseph, T., Stephenson, P.  Programming with
+      fault tolerant bulletin boards in asynchronous distributed sys-
+      tems.  Dept. of Computer Science, Cornell Univ., TR 85-788, Aug.
+      1986.
+
+[Birrell] Birrell, A., Nelson, B.  Implementing remote procedure calls.
+      ACM Transactions on Computer Systems 2, 1 (Feb. 1984), 39-59.
+
+[Chang] Chang, J., Maxemchuck, M. Reliable multicast protocols.  ACM
+      TOCS 2, 3 (Aug. 1984), 251-273.
+
+[Cheriton] Cheriton, D. The V Kernel: A software base for distributed
+      systems.  IEEE Software 1 12, (1984), 19-43.
+
+[Cooper] Cooper, E. Replicated procedure call.  Proc. 3rd ACM Symposium
+      on Principles of Distributed Computing., August 1984, 220-232.
+      (May 1985).
+
+[Cristian] Cristian, F. et al Atomic multicast: From simple diffusion to
+      Byzantine agreement.  IBM Technical Report RJ 4540 (48668), Oct.
+      1984.
+
+
+
+
+Birman & Joseph                                                [Page 16]
+
+RFC 992                                                    November 1986
+
+
+[Eswaren] Eswaren, K.P., et al The notion of consistency and predicate
+      locks in a database system.  Comm. ACM 19, 11 (Nov. 1976), 624-
+      633.
+
+[Hadzilacos] Hadzilacos, V.  Byzantine agreement under restricted types
+      of failures (not telling the truth is different from telling of
+      lies).  Tech. ARep. TR-19-83, Aiken Comp. Lab., Harvard University
+      (June 1983).
+
+[Halpern] Halpern, J., and Moses, Y.  Knowledge and common knowledge in
+      a distributed environment.  Tech. Report RJ-4421, IBM San Jose
+      Research Laboratory, 1984.
+
+[Joseph-a] Joseph, T.  Low cost management of replicated data.  Ph.D.
+      dissertation, Dept. of Computer Science, Cornell Univ., Ithaca
+      (Dec. 1985).
+
+[Joseph-b] Joseph, T., Birman, K.  Low cost management of replicated
+      data in fault-tolerant distributed systems.  ACM TOCS 4, 1 (Feb
+      1986), 54-70.
+
+[Lamport] Lamport, L.  Time, clocks, and the ordering of events in a
+      distributed system.  CACM 21, 7, July 1978, 558-565.
+
+[Lazowska] Lazowska, E. et al The architecture of the EDEN system.
+      Proc. 8th Symposium on Operating Systems Principles, Dec. 1981,
+      148-159.
+
+[Liskov] Liskov, B., Scheifler, R. Guardians and actions: Linguistic
+      support for robust, distributed programs.  ACM TOPLAS 5, 3 (July
+      1983), 381-404.
+
+[Moss] Moss, E.  Nested transactions: An approach to reliable, distri-
+      buted computing.  Ph.D. thesis, MIT Dept of EECS, TR 260, April
+      1981.
+
+[Papadimitrou] Papadimitrou, C.  The serializability of concurrent data-
+      base updates.  JACM 26, 4 (Oct. 1979), 631-653.
+
+[Popek] Popek, G. et al.  Locus: A network transparent, high reliability
+      distributed system.  Proc. 8th Symposium on Operating Systems
+      Principles, Dec. 1981, 169-177.
+
+[Schlicting] Schlicting, R, Schneider, F.  Fail-stop processors: An
+      approach to designing fault-tolerant distributed computing sys-
+      tems.  ACM TOCS 1, 3, August 1983, 222-238.
+
+[Schneider] Schneider, F., Gries, D., Schlicting, R.  Reliable multicast
+      protocols.  Science of computer programming 3, 2 (March 1984).
+
+[Skeen-a] Skeen, D.  Determining the last process to fail.  ACM TOCS 3,
+
+
+
+Birman & Joseph                                                [Page 17]
+
+RFC 992                                                    November 1986
+
+
+      1, Feb. 1985, 15-30.
+
+[Skeen-b] Skeen, D.  A reliable multicast protocol.  Unpublished.
+
+[Spector] Spector, A., et al  Distributed transactions for reliable sys-
+      tems.  Proc. 10th ACM SIGOPS Symposium on Operating Systems Prin-
+      ciples, Dec. 1985, 127-146.
+
+[Strong] Strong, H.R., Dolev, D. Byzantine agreement. Digest of papers,
+      Spring Compcon 83, San Francisco, CA, March 1983, 77-81.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Birman & Joseph                                                [Page 18]
+
-- 
cgit v1.2.3