diff options
author | Thomas Voss <mail@thomasvoss.com> | 2024-11-27 20:54:24 +0100 |
---|---|---|
committer | Thomas Voss <mail@thomasvoss.com> | 2024-11-27 20:54:24 +0100 |
commit | 4bfd864f10b68b71482b35c818559068ef8d5797 (patch) | |
tree | e3989f47a7994642eb325063d46e8f08ffa681dc /doc/rfc/rfc992.txt | |
parent | ea76e11061bda059ae9f9ad130a9895cc85607db (diff) |
doc: Add RFC documents
Diffstat (limited to 'doc/rfc/rfc992.txt')
-rw-r--r-- | doc/rfc/rfc992.txt | 1060 |
1 files changed, 1060 insertions, 0 deletions
diff --git a/doc/rfc/rfc992.txt b/doc/rfc/rfc992.txt new file mode 100644 index 0000000..1d21ad9 --- /dev/null +++ b/doc/rfc/rfc992.txt @@ -0,0 +1,1060 @@ + + K. P. Birman (Cornell) +Network Working Group T. A. Joseph (Cornell) +Request for Comments: 992 November 1986 + + + + On Communication Support for Fault Tolerant Process Groups + + K. P. Birman and T. A. Joseph + Dept. of Computer Science, Cornell University + Ithaca, N.Y. 14853 + 607-255-9199 + + +1. Status of this Memo. + + This memo describes a collection of multicast communication primi- + tives integrated with a mechanism for handling process failure and + recovery. These primitives facilitate the implementation of fault- + tolerant process groups, which can be used to provide distributed + services in an environment subject to non-malicious crash failures. + Unlike other process group approaches, such as Cheriton's "host + groups" (RFC's 966, 988, [Cheriton]), our approach provides powerful + guarantees about the behavior of the communication subsystem when + process group membership is changing dynamically, for example due to + process or site failures, recoveries, or migration of a process from + one site to another. Our approach also addresses delivery ordering + issues that arise when multiple clients communicate with a process + group concurrently, or a single client transmits multiple multicast + messages to a group without pausing to wait until each is received. + Moreover, the cost of the approach is low. An implementation is be- + ing undertaken at Cornell as part of the ISIS project. + + Here, we argue that the form of "best effort" reliability provided by + host groups may not address the requirements of those researchers who + are building fault tolerant software. Our basic premise is that re- + liable handling of failures, recoveries, and dynamic process migra- + tion are important aspects of programming in distributed environ- + ments, and that communication support that provides unpredictable + behavior in the presence of such events places an unacceptable burden + of complexity on higher level application software. This complexity + does not arise when using the fault-tolerant process group alterna- + tive. + + This memo summarizes our approach and briefly contrasts it with other + process group approaches. For a detailed discussion, together with + figures that clarify the details of the approach, readers are re- + ferred to the papers cited below. + + Distribution of this memo is unlimited. + + + + +Birman & Joseph [Page 1] + +RFC 992 November 1986 + + +2. Acknowledgments + + This memo was adopted from a paper presented at the Asilomar workshop + on fault-tolerant distributed computing, March 1986, and summarizes + material from a technical report that was issued by Cornell Universi- + ty, Dept. of Computer Science, in August 1985, which will appear in + ACM Transactions on Computer Systems in February 1987 [Birman-b]. + Copies of these paper, and other relevant papers, are available on + request from the author: Dept. of Computer Science, Cornell Universi- + ty, Ithaca, New York 14853. (birman@gvax.cs.cornell.edu). The ISIS + project also maintains a mailing list. To be added to this list, + contact M. Schmizzi (schiz@gvax.cs.cornell.edu). + + This work was supported by the Defense Advanced Research Projects + Agency (DoD) under ARPA order 5378, Contract MDA903-85-C-0124, and by + the National Science Foundation under grant DCR-8412582. The views, + opinions and findings contained in this report are those of the au- + thors and should not be construed as an official Department of De- + fense position, policy, or decision. + +3. Introduction + + At Cornell, we recently completed a prototype of the ISIS system, + which transforms abstract type specifications into fault-tolerant + distributed implementations, while insulating users from the mechan- + isms by which fault-tolerance is achieved. This version of ISIS, re- + ported in [Birman-a], supports transactional resilient objects as a + basic programming abstraction. Our current work undertakes to pro- + vide a much broader range of fault-tolerant programming mechanisms, + including fault-tolerant distributed bulletin boards [Birman-c] and + fault-tolerant remote procedure calls on process groups [Birman-b]. + The approach to communication that we report here arose as part of + this new version of the ISIS system. + + Unreliable communication primitives, such as the multicast group com- + munication primitives proposed in RFC's 966 and 988 and in [Cheri- + ton], leave some uncertainty in the delivery status of a message when + failures and other exceptional events occur during communication. + Instead, a form of "best effort" delivery is provided, but with the + possibility that some member of a group of processes did not receive + the message if the group membership was changing just as communica- + tion took place. When we tried to use this sort of primitive in our + original work on ISIS, which must behave reliably in the presence of + such events, we had to address this aspect at an application level. + The resulting software was complex, difficult to reason about, and + filled with obscure bugs, and we were eventually forced to abandon + the entire approach as infeasible. + + A wide range of reliable communication primitives have been proposed + in the literature, and we became convinced that by using them, the + complexity of our software could be greatly reduced. These range + + + +Birman & Joseph [Page 2] + +RFC 992 November 1986 + + + from reliable and atomic broadcast [Chang] [Cristian] [Schneider] to + Byzantine agreement [Strong]. For several reasons, however, the ex- + isting work does not solve the problem at hand. The most obvious is + that they do not provide a mechanism for sending a message to all the + members of a group when the membership is changing dynamically (the + "group addressing" problem). In addition, one can identify delivery + ordering issues and questions regarding the detection of communica- + tion failures that should be handled within the broadcast mechanism. + These motivate a careful reexamination of the entire reliable broad- + cast problem. + + The multicast primitives we report here are designed to respect + several sorts of ordering constraints, and have cost and latency that + varies depending on the nature of the constraint required [Birman-b] + [Joseph-a] [Joseph-b]. Failure and recovery are integrated into the + communication subsystem by treating these events as a special sort of + multicast issued on behalf of a process that has failed or recovered. + The primitives are presented in the context of fault tolerant process + groups: groups of processes that cooperate to implement some distri- + buted algorithm or service, and which need to see consistent order- + ings of system events in order to achieve mutually consistent + behavior. Such groups are similar to the host groups of the V system + and the ones described in RFC's 966 and 988, but provide guarantees + of consistency in just the situations where a host group provides a + "best effort" delivery which may sometimes be erroneous. + + It is helpful to think of our primitives as providing a logical or + "virtual" form of reliability: rather than addressing physical + delivery issues, they ensure that a client will never observe a sys- + tem state "inconsistent" with the assumption that reliable delivery + has occurred. Readers familiar with serializability theory may want + to think of this as a weaker analog: in serializability, one allows + interleaved executions of operations provided that the resulting sys- + tem state is consistent with the assumption that execution was + sequential. Similarly, reliable communication primitives permit de- + viations from the reliable delivery abstraction provided that the + resulting system state is indistinguishable from one in which reli- + able delivery actually did occur. + + Using our primitives, the ISIS system achieved both high levels of + concurrency and suprisingly good performance. Equally important, its + structure was made suprisingly simple, making it feasible to reason + about the correctness of the algorithms that are needed to maintain + high availability even when failures, recoveries, or process migra- + tion occurs. More recently, we have applied the same approach to a + variety of other problems in distributed computing, and even designed + a consistent, fault tolerant, distributed bulletin board data struc- + ture (a generalized version of the blackboards used in artificial in- + telligence programs), with equally good results [Birman-c]. Thus, we + feel that the approach has been shown to work in a variety of set- + tings where unreliable primitives simply could not be used. + + + +Birman & Joseph [Page 3] + +RFC 992 November 1986 + + + In the remainder of this memo we summarize the issues and alterna- + tives that the designer of a distributed system is presented with, + focusing on two styles of support for fault-tolerant computing: re- + mote procedure calls coupled with a transactional execution facility, + such as is used in the ARGUS system [Liskov], and the fault-tolerant + process group mechanism mentioned above. We argue that transactional + interactions are too restrictive to support the sort of mechanism + needed, and then show how our primitives can be used to provide such + a mechanism. We conclude by speculating on future directions in + which this work might be taken. + +4. Issues in fault-tolerance + + The difficulty of constructing fault-tolerant distributed software + can be traced to a number of interrelated issues. The list that fol- + lows is not exhaustive, but attempts to touch on the principal con- + siderations that must be addressed in any such system: + + [1]Synchronization. Distributed systems offer the potential for + large amounts of concurrency, and it is usually desirable to + operate at as high a level of concurrency as possible. However, + when we move from a sequential execution environment to a con- + current one, it becomes necessary to synchronize actions that may + conflict in their access to shared data or entail communication + with overlapping sets of processes. Thus, a mechanism is needed + for ordering conflicting events. Additional problems that can + arise in this context include deadlock avoidance or detection, + livelock avoidance, etc. + + [2]Failure detection. It is usually necessary for a fault- + tolerant application to have a consistent picture of which com- + ponents fail, and in what order. Timeout, the most common mechan- + ism for detecting failure, is unsatisfactory, because there are + many situations in which a healthy component can timeout with + respect to one component without this being detected by some + another. Failure detection under more rigorous requirements + requires an agreement protocol that is related to Byzantine agree- + ment [Strong] [Hadzilacos]. Regardless of how this problem is + solved, some sort of reliable failure detection mechanism will be + needed in any fault-tolerant distributed system. + + [3] Consistency. When a group of processes cooperate in a distri- + buted system, it is necessary to ensure that the operational + processes have consistent views of the state of the group as a + whole. For example, if process p believes that some property A + holds, and on the basis of this interacts with process q, the + state of q should not contradict the fact that p believes A to be + true. This problem is closely related to notions of knowledge and + consistency in distributed systems [Halpern] [Lamport]. In our + context, A will often be the assertion that a multicast has been + received by q, or that q saw some sequence of events occur in the + + + +Birman & Joseph [Page 4] + +RFC 992 November 1986 + + + same order as did p. Thus, it is necessary to be able to specify + the precise consistency constraints on a distributed software sys- + tem, and system support should be available to facilitate the + attainment of these constraints. + + [4] Serializability. Many distributed systems are partitioned + into data manager processes, which implement shared variables, and + transaction manager processes, which issue requests to data + managers [Bernstein]. If transaction managers can execute con- + currently, it is desirable to ensure that transactions produce + serializable outcomes [Eswaren] [Papadimitrou]. Serializability + is increasingly viewed as an important property in "object- + oriented" distributed systems that package services as abstract + objects with which clients communicate by remote procedure calls + (RPC). On the other hand, there are systems for which serializa- + bility is either too strong a constraint, or simply inappropriate. + Thus, one needs a way to achieve serializability in applications + where it will be needed, without imposing system-wide restrictions + that would prevent the design of software subsystems for which + serializability is not needed. + + Jointly, these problems render the design of fault-tolerant distri- + buted software daunting in the absence of adequate support. The + correctness of any proposed design and of its implementation become + serious, if not insurmountable, concerns. In Sec. 7, we will show + how the primitives of Sec. 6 provide simple ways to overcome all of + these issues. + +5. Existing alternatives + + If one rules out "unreliable" communication mechanisms, there are + basically two fault-tolerant alternatives that can be pursued. + + The first approach is to provide mechanisms for transactional + interactions between processes that communicate using remote pro- + cedure calls [Birrell]. This has lead to work on nested transactions + (due to nested RPC's) [Moss], support for transactions at the + language level [Liskov], transactions within an operating systems + kernel [Spector] [Allchin] [Popek] [Lazowska], and transactional + access to higher-level replicated services, such as resilient objects + in ISIS or relations in database systems. The primitives in a tran- + sactional system provide mechanisms for distributing the request that + initiates the transaction, accessing data (which may be replicated), + performing concurrency control, and implementing commit or abort. + Additional mechanisms are normally needed for orphan termination, + deadlock detection, etc. The issue then arises of how these mechan- + isms should themselves be implemented. + + Our work in ISIS leads us to believe that whereas transactions are + easily implemented on top of fault-tolerant process groups -- we have + done so -- the converse is much harder. Moreover, transactions + + + +Birman & Joseph [Page 5] + +RFC 992 November 1986 + + + represent a relatively heavy-weight solution to the problems surveyed + in the previous section, and might impose an unacceptable overhead on + subsystems that need to run non-transactionally, for example because + a pair of concurrent processes needs to interact on a frequent basis. + (We are not saying that "transactional" mechanisms such as cobegins + and toplevel actions can't solve this problem, but just that they + yield a solution that is awkward and costly). This sort of reasoning + has lead us to focus on non-transactional interaction mechanisms, and + to treat transactions as a special class of mechanisms used when + processes that have been designed to employ a transactional protocol + interact. + + The second approach involves the provision of a communication primi- + tive, such as atomic broadcast, which can be used as the framework on + which higher level algorithms are designed. Such a primitive seeks + to deliver messages reliably to some set of destinations, despite the + possibility that failures might occur during the execution of the + protocol. Above, we termed this the fault tolerant process group + approach, since it lends itself to the organization of cooperating + processes into groups, as described in the introduction. Process + groups are an extremely flexible abstraction, and have been employed + in the V Kernel [Cheriton] and in UNIX, and more recently in the ISIS + system. A proposal to provide Internet support for host groups was + raised in RFC's 966 and 988. However, the idea of adapting the pro- + cess group approach to work reliably in an environment subject to the + sorts of exception events and concurrency cited in the previous sec- + tion seems to be new. + + As noted earlier, existing reliable communication protocols do not + address the requirements of fault-tolerant process groups. For exam- + ple, in [Schneider], an implementation of a reliable multicast primi- + tive is described. Such a primitive ensures that a designated mes- + sage will be transmitted from one site to all other operational sites + in a system; if a failure occurs but any site has received the mes- + sage, all will eventually do so. [Chang] and [Cristian] describe + implementations for atomic broadcast, which is a reliable broadcast + (sent to all sites in a system) with the additional property that + messages are delivered in the same order at all overlapping destina- + tions, and this order preserves the transmission order if messages + originate in a single site. + + Atomic broadcast is a powerful abstraction, and essentially the same + behavior is provided by one of the multicast primitives we discuss in + the next section. However, it has several drawbacks which made us + hesitant to adopt it as the only primitive in the system. Most seri- + ous is the latency that is incurred in order to satisfy the delivery + ordering property. Without delving deeply into the implementations, + which are based on a token scheme in [Chang] and an acknowledgement + protocol in [Schneider], we observe that the delaying of certain mes- + sages is fundamental to the establishment of a unique global delivery + ordering; indeed, it is easy to prove on knowledge theoretic grounds + + + +Birman & Joseph [Page 6] + +RFC 992 November 1986 + + + that this must always be the case. In [Chang] a primary goal is to + minimize the number of messages sent, and the protocol given performs + extremely well in this regard. However, a delay occurs while waiting + for tokens to arrive and the delivery latency that results may be + high. [Cristian] assumes that clocks are closely synchronized and + that message transit times are bounded by well-known constants, and + uses this to derive atomic broadcast protocols tolerant of increas- + ingly severe classes of failures. The protocols explicitly delay + delivery to achieve the desired global ordering on multicasts. For + reasons discussed below, this tends to result in high latency in typ- + ical local networking environments. An additional drawback of the + atomic broadcast protocols is that no mechanism is provided for + ensuring that all processes observe the same sequence of failures and + recoveries, or for ensuring that failures and recoveries are ordered + relative to ongoing multicasts. Since this problem arises in any + setting where one process monitors another, we felt it should be + addressed at the same level as the communication protocol. Finally, + one wants a group oriented multicast protocol, not a site oriented + broadcast, and this issue must be resolved too. + +6. Our multicast primitives + + We now describe three multicast protocols - GBCAST, ABCAST, and + CBCAST - for transmitting a message reliably from a sender process to + some set of destination processes. Details of the protocols and + their correctness proofs can be found in [Birman-b]. The protocols + ensure "all or nothing" behavior: if any destination receives a mes- + sage, then unless it fails, all destinations will receive it. Group + addressing is discussed in Sec. 6.5. + + The failure model that one adopts has a considerable impact on the + structure of the resulting system. We adopted the model of fail-stop + processors [Schneider]: when failures occur, a processor simply stops + (crashes), as do all the processes executing on it. We also assume + that individual processes can crash, and that this is detected when + it occurs by a monitoring mechanism present at each site. Further + assumptions are sometimes made about the availability of synchronized + realtime clocks. Here, we adopt the position that although reason- + ably accurate elapsed-time clocks may be available, closely synchron- + ized clocks probably will not be. For example, the 60Hz "line" + clocks commonly used on current workstations are only accurate to + 16ms. On the other hand, 4-8ms inter-site message transit times are + common and 1-2ms are reported increasingly often. Thus, it is impos- + sible to synchronize clocks to better than 32-48ms, enough time for a + pair of sites to exchange between 4 and 50 messages. Even with + advancing technology, it seems safe to assume that clock skew will + remain "large" when compared to inter-site message transmission + speed. In particular, this argues against time-based protocols such + as the one used in [Cristian] + + + + + +Birman & Joseph [Page 7] + +RFC 992 November 1986 + + + 6.1 The GBCAST primitive + + GBCAST (group multicast) is the most constrained, and costly, of + the three primitives. It is used to transmit information about + failures and recoveries to members of a process group. A recov- + ering member uses GBCAST to inform the operational ones that it + has become available. Additionally, when a member fails, the + system arranges for a GBCAST to be issued to group members on its + behalf, informing them of its failure. Arguments to GBCAST are a + message and a process group identifier, which is translated into + a set of destinations as described below (Sec. 6.5). + + Our GBCAST protocol ensures that if any process receives a multi- + cast B before receiving a GBCAST G, then all overlapping destina- + tions will receive B before G <1> This is true regardless of the + type of multicast involved. Moreover, when a failure occurs, the + corresponding GBCAST message is delivered after any other multi- + casts from the failed process. Each member can therefore main- + tain a VIEW listing the membership of the process group, updating + it when a GBCAST is received. Although VIEW's are not updated + simultaneously in real time, all members observe the same + sequence of VIEW changes. Since, GBCAST's are ordered relative + to all other multicasts, all members receiving a given multicast + will have the same value of VIEW when they receive it. + + Notice that GBCAST also provides a convenient way to change other + global properties of a group "atomically". In our work, we have + used GBCAST to dynamically change a ranking on the members of a + group, to request that group members establish checkpoints for + use if recovery is needed after all failure, and to implement + process migration. In each case, the ordering of GBCAST relative + to other events that makes it possible to perform the desired + action without running any additional protocol. Other uses for + GBCAST will no doubt emerge as our research continues. + + Members of a process group can also use the value of VIEW to pick + a strategy for processing an incoming request, or to react to + failure or recovery without having to run any special protocol + first. Since the GBCAST ordering is the same everywhere, their + actions will all be consistent. Notice that when all the members + of a process group may have failed, GBCAST also provides an inex- + pensive way to determine the last site that failed: process group + members simply log each value of VIEW that becomes defined on + stable storage before using it; a simplified version of the algo- + rithm in [Skeen-a] can then be executed when recovering from + failure. + + + + + + + + +Birman & Joseph [Page 8] + +RFC 992 November 1986 + + + 6.2 The ABCAST primitive + + The GBCAST primitive is too costly to be used for general commun- + ication between process group members. This motivates the intro- + duction of weaker (less ordered) primitives, which might be used + in situations where a total order on multicast messages is not + necessary. Our second primitive, ABCAST (atomic multicast), + satisfies such a weaker constraint. Specifically, it is often + desired that if two multicasts are received in some order at a + common destination site, they be received in that order at all + other common destinations, even if this order was not predeter- + mined. For example, if a process group is being used to maintain + a replicated queue and ABCAST is used to transmit queue opera- + tions to all copies, the operations will be done in the same + order everywhere, hence the copies of the queue will remain mutu- + ally consistent. The primitive ABCAST(msg, label, dests) pro- + vides this behavior. Two ABCAST's having the same label are + delivered in the same order at all common destinations. + + 6.3 The CBCAST primitive + + Our third primitive, CBCAST (causal multicast), is weakest in the + sense that it involves less distributed synchronization then + GBCAST or ABCAST. CBCAST(msg, dests) atomically delivers msg to + each operational dest. The CBCAST protocol ensures that if two + multicasts are potentially causally dependent on another, then + the former is delivered after the latter at all overlapping des- + tinations. A multicast B' is potentially causally dependent on a + multicast B if both multicasts originate from the same process, + and B' is sent after B, or if there exists a chain of message + transmissions and receptions or local events by which knowledge + could have been transferred from the process that issued B to the + process that issued B' [Lamport]. For causally independent mul- + ticasts, the delivery ordering is not constrained. + + CBCAST is valuable in systems like ISIS, where concurrency con- + trol algorithms are used to synchronize concurrent computations. + In these systems, if two processes communicate concurrently with + the same process the messages are almost always independent ones + that can be processed in any order: otherwise, concurrency con- + trol would have caused one to pause until the other was finished. + On the other hand, order is clearly important within a causally + linked series of multicasts, and it is precisely this sort of + order that CBCAST respects. + + 6.4 Other multicast primitives + + A weaker multicast primitive is reliable multicast, which pro- + vides all-or-nothing delivery, but no ordering properties. The + formulation of CBCAST in [Birman-b] actually includes a mechanism + for performing multicasts of this sort, hence no special + + + +Birman & Joseph [Page 9] + +RFC 992 November 1986 + + + primitive is needed for the purpose. Additionally, there may be + situations in which ABCAST protocols that also satisfy a CBCAST + ordering property would be valuable. Our ABCAST primitive could + be changed to respect such a rule, and we made use of a multicast + primitive that is simultaneously causal and atomic in our work on + consistent shared bulletin boards ([Birman-c]). For simplicity, + the presentation here assumes that ABCAST is completely orthogo- + nal to CBCAST, but a simple way to build an efficient "causal + atomic" multicast is described in our full-length paper. The + cost of this protocol is only slightly higher than that of + ABCAST. + + 6.5 Group addressing protocol + + Since group membership can change dynamically, it may be diffi- + cult for a process to compute a list of destinations to which a + message should be sent, for example, as is needed to perform a + GBCAST. In [Birman-b] we report on a protocol for ensuring that + a given multicast will be delivered to all members of a process + group in the same view. This view is either the view that was + operative when the message transmission was initiated, or a view + that was defined subsequently. The algorithm is a simple itera- + tive one that costs nothing unless the group membership changes, + and permits the caching of possibly inaccurate membership infor- + mation near processes that might want to communicate with a + group. Using the protocol, a flexible message addressing scheme + can readily be supported. + + Iterative addressing is only required when the process transmit- + ting a message has an inaccurate copy of the process group view. + In the implementation we are now building, this would rarely be + the case, and iteration is never needed if the view is known to + be accurate. Thus, iterated delivery should be very infrequent. + + 6.6 Synchronous versus asynchronous multicast abstractions + + Many systems employ RPC internally, as a lowest level primitive + for interaction between processes. It should be evident that all + of our multicast primitives can be used to implement replicated + remote procedure calls [Cooper]: the caller would simply pause + until replies have been received from all the participants + (observation of a failure constitutes a reply in this case). We + term such a use of the primitives synchronous, to distinguish it + from from an asynchronous multicast in which no replies, or just + one reply, suffices. + + In our work on ISIS, GBCAST and ABCAST are normally invoked syn- + chronously, to implement a remote procedure call by one member of + an object on all the members of its process group. However, + CBCAST, which is the most frequently used overall, is almost + never invoked synchronously. Asynchronous CBCAST's are the + + + +Birman & Joseph [Page 10] + +RFC 992 November 1986 + + + primary source of concurrency in ISIS: although the delivery ord- + ering is assured, transmission can be delayed to enable a message + to be piggybacked on another, or to schedule IO within the system + as a whole. While the system cannot defer an asynchronous multi- + cast indefinitely, the ability to defer it a little, without + delaying some computation by doing so, permits load to be + smoothed. Since CBCAST respects the delivery orderings on which + a computation might depend, and is ordered with respect to + failures, the concurrency introduced does not complicate higher + level algorithms. Moreover, the protocol itself is extremely + cheap. + + A problem is introduced by our decision to allow asynchronous + multicasts: the atomic reception property must now be extended to + address causally related sequences of asynchronous messages. If + a failure were to result in some multicasts being delivered to + all their destinations but others that precede them not being + delivered anywhere, inconsistency might result even if the desti- + nations do not overlap. We therefore extend the atomicity pro- + perty as follows. If process t receives a message m from process + s, and s subsequently fails, then unless t fails as well, all + messages m' that s received prior to its failure must be + delivered to their remaining operational destinations. This is + because the state of t may now depend on the contents of any such + m', hence the system state could become inconsistent if the + delivery of m' were not completed. The costs of the protocols + are not affected by this change. + + A second problem arises when the user-level implications of this + atomicity rule are considered. In the event of a failure, any + suffix of a sequence of aysnchronous multicasts could be lost and + the system state would still be internally consistent. A process + that is about to take some action that may leave an externally + visible side-effect will need a way to pause until it is + guaranteed that such multicasts have actually been delivered. + For this purpose, a flush primitive is provided. Occasional + calls to flush do not eliminate the benefit of using CBCAST asyn- + chronously. Unless the system has built up a considerable back- + log of undelivered multicast messages, which should be rare, + flush will only pause while transmission of the last few multi- + casts complete. + +7. Using the primitives + + The reliable communication primitives described above lead to simple + solutions for the problems cited in Sec. 4: + + [1] Synchronization. Many synchronization problems are subsumed + into the primitives themselves. For example, consider the use of + GBCAST to implement recovery. A recovering process would issue a + GBCAST to the process group members, requesting that state + + + +Birman & Joseph [Page 11] + +RFC 992 November 1986 + + + information be transferred to it. In addition to sending the + current state of the group to the recovering process, group + members update the process group view at this time. Subsequent + messages to the group will be delivered to the recovered process, + with all necessary synchronization being provided by the ordering + properties of GBCAST. In situations where other forms of syn- + chronization are needed, ABCAST provides a simple way to ensure + that several processes take actions in the same order, and this + form of low-level synchronization simplifies a number of higher- + level synchronization problems. For example, if ABCAST is used + to do P() and V() operations on a distributed semaphore, the + order of operations on the semaphore is set by the ABCAST, hence + all the managers of the semaphore see these operations in a fixed + order. + + [2] Failure detection. Consistent failure (and recovery) detec- + tion are trivial using our primitives: a process simply waits for + the appropriate process group view to change. This facilitates + the implementation of algorithms in which one processes monitors + the status of another process. A process that acts on the basis + of a process group view change does so with the assurance that + other group members will (eventually) observe the same event and + will take consistent actions. + + [3] Consistency. We believe that consistency is generally + expressible as a set of atomicity and ordering constraints on + message delivery, particularly causal ones of the sort provided + by CBCAST. Our primitives permit a process to specify the com- + munication properties needed to achieve a desired form of con- + sistency. Continued research will be needed to understand pre- + cisely how to pick the weakest primitive in a designated situa- + tion. + + [4] Serializability. To achieve serializability, one implements + a concurrency control algorithm and then forces computations to + respect the serialization order that this algorithm choses. The + ABCAST primitive, as observed above, is a powerful tool for + establishing an order between concurrent events, e.g. by lock + acquisition. Having established such an order, CBCAST can be + used to distribute information about the computation and also its + termination (commit or abort). Any process that observes the + commit or abort of a computation will only be able to interact + with data managers that have received messages preceding the com- + mit or abort, hence a highly asynchronous transactional execution + results. If a process running a computation fails, this is + detected when a failure GBCAST is received instead of the commit. + Thus, executions are simple and quite deterministic. + + If commit is conditional, CBCAST would be used to first interro- + gate participants to learn if they are prepared to commit, and + then to transmit the commit or abort decision (the usual two- + + + +Birman & Joseph [Page 12] + +RFC 992 November 1986 + + + phase commit). On the other hand, conditional commits can often + be avoided using our approach. A method for building transac- + tions that will roll-forward after failure after failure is dis- + cussed in more detail in [Birman-a] [Joseph-a] [Joseph-b]. Other + forms of concurrency control, such as timestamp generation, can + similarly be implemented using ABCAST and CBCAST. We view tran- + sactional data storage as an application-level concern, which can + be handled using a version stack approach or a multi-version + store, or any other appropriate mechanism. + +8. Implementation + + The communication primitives can be built in layers, starting with a + bare network providing unreliable Internet datagrams. The software + structure is, however, less mature and more complex than the one sug- + gested in RFC's 966 and 988. For example, at this stage of our + research we do not understand how to optimize our protocols to the + same extent as for the unreliable host multicast approach described + in those RFC's. Thus, the implementation we describe here should be + understood to be a prototype. A particularly intriguing question, + which we are investigating actively, concerns the use of a "best + effort" ethernet or Internet multicast as a tool to optimize the + implementation of our protocols. + + Our basic approach is to view large area networks as a set of clus- + ters of sites interconnected by high speed LAN devices and intercon- + nected by slower long-haul links. We first provide protocols for use + within clusters, and then extend them to run between clusters too. + Network partitioning can be tolerated at all levels of the hierarchy + in the sense that no incorrect actions can result after network par- + titioning, although our approach will sometimes block until the par- + tition is repaired. Our protocols also tend to block within a clus- + ter while the list of operational sites for that cluster is being + changed. In normal LAN's, this happens infrequently (during site + failure or recovery), and would not pose a problem. (In failure + intensive applications, alternative protocols might be needed to + address this issue). + + The lowest level of our software uses a site-to-site acknowledgement + protocol to convert the unreliable packet transport this into a + sequenced, error-free message abstraction, using timeouts to detect + apparent failures. TCP can also be used for this purpose, provided + that a "filter" is placed on the incoming message stream and certain + types of messages are handled specially. An agreement protocol is + then used to order the site-failures and recoveries consistently. If + timeouts cause a failure to be detected erroneously, the protocol + forces the affected site to undergo recovery. + + Built on this is a layer that supports the primitives themselves. + CBCAST has a very light-weight implementation, based on the idea of + flooding the system with copies of a message: Each process buffers + + + +Birman & Joseph [Page 13] + +RFC 992 November 1986 + + + copies of any messages needed to ensure the consistency of its view + of the system. If message m is delivered to process p, and m is + potentially causally dependent on a message m prime, then a copy of m + prime is sent to p as well (duplicates are discarded). A garbage + collector deletes superfluous copies after a message has reached all + its destinations. By using extensive piggybacking and a simple + scheduling algorithm to control message transmission, the cost of a + CBCAST is kept low -- often, less than one packet per destination. + ABCAST employs a two-phase protocol based on one suggested to us by + Skeen [Skeen-b]. This protocol has higher latency than CBCAST + because delivery can only occur during the second phase; ABCAST is + thus inherently synchronous. In ISIS, however, ABCAST is used + rarely; we believe that this would be the case in other systems as + well. GBCAST is implemented using a two-phase protocol similar to + the one for ABCAST, but with an additional mechanism that flushes + messages from a failed process before delivering the GBCAST announc- + ing the failure. Although GBCAST is slower than ABCAST or CBCAST, it + is used rarely enough so that performance is probably less of an + issue here -- and in any case, even GBCAST could be tuned to give + very high throughput. Preliminary performance figures appear in + [Birman-b]. + + Although satisfactory performance should be possible using an imple- + mentation that sits on top of a conventional Internet mechanism, it + should be noted that to achieve really high rates of communication + the layers of software described above must reside in the kernel, + because they run on behalf of large numbers of clients, run fre- + quently, and tend to execute for very brief periods before doing I/O + and pausing. A non-kernel implementation will thus incur high + scheduling and context switching overhead. Additionally, it is not + at all clear how to use ethernet style broadcast mechanisms to optim- + ize the performance of this sort of protocol, although it should be + possible. We view this as an interesting area for research. + + A forthcoming paper will describe higher level software that we are + building on top of the basic fault-tolerant process group mechanism + described above. + +9. Conclusions + + The experience of implementing a substantial fault-tolerant system + left us with insights into the properties to be desired from a com- + munication subsystem. In particular, we became convinced that to + build a reliable distributed system, one must start with a reliable + communication subsystem. The multicast primitives described in this + memo present a simple interface, achieve a high level of concurrency, + can be used in both local and wide area networks, and are applicable + to software ranging from distributed database systems to the fault- + tolerant objects and bulletin boards provided by ISIS. Because they + are integrated with failure handling mechanisms and respect desired + event orderings, they introduce a desirable form of determinism into + + + +Birman & Joseph [Page 14] + +RFC 992 November 1986 + + + distributed computation without compromising efficiency. A conse- + quence is that high-level algorithms are greatly simplified, reducing + the probability of error. We believe that this is a very promising + and practical approach to building large fault-tolerant distributed + systems, and it is the only one we know of that leads to a rigorous + form of confidence in the resulting software. + +NOTES: + + <1> A problem arises if a process p fails without receiving some mes- + sage after that message has already been delivered to some other pro- + cess q: q's VIEW when it received the message would show p to be + operational; hence, q will assume that p received the message, + although p is physically incapable of doing so. However, the state + of the system is now equivalent to one in which p did receive the + message, but failed before acting on it. In effect, there exists an + interpretation of the actual system state that is consistent with q's + assumption. Thus, GBCAST satisfies the sort of logical delivery pro- + perty cited in the introduction. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +Birman & Joseph [Page 15] + +RFC 992 November 1986 + + +10. References + +[RFC966] Deering, S. and Cheriton, D. Host groups: A multicast exten- + sion to the internet protocol. Stanford University, December + 1985. + +[RFC988] Deering, S. Host extensions for IP multicasting. Stanford + University, July 1986. + +[Allchin] Allchin, J., McKendry, M. Synchronization and recovery of + actions. Proc. 2nd ACM SIGACT/SIGOPS Principles of Distributed + Computing, Montreal, Canada, 1983. + +[Babaoglu] Babaoglu, O., Drummond, R. The streets of Byzantium: Network + architectures for fast reliable multicast. IEEE Trans. on + Software Engineering TSE-11, 6 (June 1985). + +[Bernstein] Bernstein, P., Goodman, N. Concurrency control algorithms + for replicated database systems. ACM Computing Surveys 13, 2 + (June 1981), 185-222. + +[Birman-a] Birman, K. Replication and fault-tolerance in the ISIS sys- + tem. Proc. 10th ACM SIGOPS Symposium on Operating Systems Princi- + ples. Orcas Island, Washington, Dec. 1985, 79-86. + +[Birman-b] Birman, K., Joseph, T. Reliable communication in the pres- + ence of failures. Dept. of Computer Science, Cornell Univ., TR + 85-694, Aug. 1985. To appear in ACM TOCS (Feb. 1987). + +[Birman-c] Birman, K., Joseph, T., Stephenson, P. Programming with + fault tolerant bulletin boards in asynchronous distributed sys- + tems. Dept. of Computer Science, Cornell Univ., TR 85-788, Aug. + 1986. + +[Birrell] Birrell, A., Nelson, B. Implementing remote procedure calls. + ACM Transactions on Computer Systems 2, 1 (Feb. 1984), 39-59. + +[Chang] Chang, J., Maxemchuck, M. Reliable multicast protocols. ACM + TOCS 2, 3 (Aug. 1984), 251-273. + +[Cheriton] Cheriton, D. The V Kernel: A software base for distributed + systems. IEEE Software 1 12, (1984), 19-43. + +[Cooper] Cooper, E. Replicated procedure call. Proc. 3rd ACM Symposium + on Principles of Distributed Computing., August 1984, 220-232. + (May 1985). + +[Cristian] Cristian, F. et al Atomic multicast: From simple diffusion to + Byzantine agreement. IBM Technical Report RJ 4540 (48668), Oct. + 1984. + + + + +Birman & Joseph [Page 16] + +RFC 992 November 1986 + + +[Eswaren] Eswaren, K.P., et al The notion of consistency and predicate + locks in a database system. Comm. ACM 19, 11 (Nov. 1976), 624- + 633. + +[Hadzilacos] Hadzilacos, V. Byzantine agreement under restricted types + of failures (not telling the truth is different from telling of + lies). Tech. ARep. TR-19-83, Aiken Comp. Lab., Harvard University + (June 1983). + +[Halpern] Halpern, J., and Moses, Y. Knowledge and common knowledge in + a distributed environment. Tech. Report RJ-4421, IBM San Jose + Research Laboratory, 1984. + +[Joseph-a] Joseph, T. Low cost management of replicated data. Ph.D. + dissertation, Dept. of Computer Science, Cornell Univ., Ithaca + (Dec. 1985). + +[Joseph-b] Joseph, T., Birman, K. Low cost management of replicated + data in fault-tolerant distributed systems. ACM TOCS 4, 1 (Feb + 1986), 54-70. + +[Lamport] Lamport, L. Time, clocks, and the ordering of events in a + distributed system. CACM 21, 7, July 1978, 558-565. + +[Lazowska] Lazowska, E. et al The architecture of the EDEN system. + Proc. 8th Symposium on Operating Systems Principles, Dec. 1981, + 148-159. + +[Liskov] Liskov, B., Scheifler, R. Guardians and actions: Linguistic + support for robust, distributed programs. ACM TOPLAS 5, 3 (July + 1983), 381-404. + +[Moss] Moss, E. Nested transactions: An approach to reliable, distri- + buted computing. Ph.D. thesis, MIT Dept of EECS, TR 260, April + 1981. + +[Papadimitrou] Papadimitrou, C. The serializability of concurrent data- + base updates. JACM 26, 4 (Oct. 1979), 631-653. + +[Popek] Popek, G. et al. Locus: A network transparent, high reliability + distributed system. Proc. 8th Symposium on Operating Systems + Principles, Dec. 1981, 169-177. + +[Schlicting] Schlicting, R, Schneider, F. Fail-stop processors: An + approach to designing fault-tolerant distributed computing sys- + tems. ACM TOCS 1, 3, August 1983, 222-238. + +[Schneider] Schneider, F., Gries, D., Schlicting, R. Reliable multicast + protocols. Science of computer programming 3, 2 (March 1984). + +[Skeen-a] Skeen, D. Determining the last process to fail. ACM TOCS 3, + + + +Birman & Joseph [Page 17] + +RFC 992 November 1986 + + + 1, Feb. 1985, 15-30. + +[Skeen-b] Skeen, D. A reliable multicast protocol. Unpublished. + +[Spector] Spector, A., et al Distributed transactions for reliable sys- + tems. Proc. 10th ACM SIGOPS Symposium on Operating Systems Prin- + ciples, Dec. 1985, 127-146. + +[Strong] Strong, H.R., Dolev, D. Byzantine agreement. Digest of papers, + Spring Compcon 83, San Francisco, CA, March 1983, 77-81. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +Birman & Joseph [Page 18] + |