diff options
author | Thomas Voss <mail@thomasvoss.com> | 2024-11-27 20:54:24 +0100 |
---|---|---|
committer | Thomas Voss <mail@thomasvoss.com> | 2024-11-27 20:54:24 +0100 |
commit | 4bfd864f10b68b71482b35c818559068ef8d5797 (patch) | |
tree | e3989f47a7994642eb325063d46e8f08ffa681dc /doc/rfc/rfc816.txt | |
parent | ea76e11061bda059ae9f9ad130a9895cc85607db (diff) |
doc: Add RFC documents
Diffstat (limited to 'doc/rfc/rfc816.txt')
-rw-r--r-- | doc/rfc/rfc816.txt | 648 |
1 files changed, 648 insertions, 0 deletions
diff --git a/doc/rfc/rfc816.txt b/doc/rfc/rfc816.txt new file mode 100644 index 0000000..28e01d5 --- /dev/null +++ b/doc/rfc/rfc816.txt @@ -0,0 +1,648 @@ + + +RFC: 816 + + + + FAULT ISOLATION AND RECOVERY + + David D. Clark + MIT Laboratory for Computer Science + Computer Systems and Communications Group + July, 1982 + + + 1. Introduction + + + Occasionally, a network or a gateway will go down, and the sequence + +of hops which the packet takes from source to destination must change. + +Fault isolation is that action which hosts and gateways collectively + +take to determine that something is wrong; fault recovery is the + +identification and selection of an alternative route which will serve to + +reconnect the source to the destination. In fact, the gateways perform + +most of the functions of fault isolation and recovery. There are, + +however, a few actions which hosts must take if they wish to provide a + +reasonable level of service. This document describes the portion of + +fault isolation and recovery which is the responsibility of the host. + + + 2. What Gateways Do + + + Gateways collectively implement an algorithm which identifies the + +best route between all pairs of networks. They do this by exchanging + +packets which contain each gateway's latest opinion about the + +operational status of its neighbor networks and gateways. Assuming that + +this algorithm is operating properly, one can expect the gateways to go + +through a period of confusion immediately after some network or gateway + + 2 + + +has failed, but one can assume that once a period of negotiation has + +passed, the gateways are equipped with a consistent and correct model of + +the connectivity of the internet. At present this period of negotiation + +may actually take several minutes, and many TCP implementations time out + +within that period, but it is a design goal of the eventual algorithm + +that the gateway should be able to reconstruct the topology quickly + +enough that a TCP connection should be able to survive a failure of the + +route. + + + 3. Host Algorithm for Fault Recovery + + + Since the gateways always attempt to have a consistent and correct + +model of the internetwork topology, the host strategy for fault recovery + +is very simple. Whenever the host feels that something is wrong, it + +asks the gateway for advice, and, assuming the advice is forthcoming, it + +believes the advice completely. The advice will be wrong only during + +the transient period of negotiation, which immediately follows an + +outage, but will otherwise be reliably correct. + + + In fact, it is never necessary for a host to explicitly ask a + +gateway for advice, because the gateway will provide it as appropriate. + +When a host sends a datagram to some distant net, the host should be + +prepared to receive back either of two advisory messages which the + +gateway may send. The ICMP "redirect" message indicates that the + +gateway to which the host sent the datagram is not longer the best + +gateway to reach the net in question. The gateway will have forwarded + +the datagram, but the host should revise its routing table to have a + +different immediate address for this net. The ICMP "destination + + 3 + + +unreachable" message indicates that as a result of an outage, it is + +currently impossible to reach the addressed net or host in any manner. + +On receipt of this message, a host can either abandon the connection + +immediately without any further retransmission, or resend slowly to see + +if the fault is corrected in reasonable time. + + + If a host could assume that these two ICMP messages would always + +arrive when something was amiss in the network, then no other action on + +the part of the host would be required in order maintain its tables in + +an optimal condition. Unfortunately, there are two circumstances under + +which the messages will not arrive properly. First, during the + +transient following a failure, error messages may arrive that do not + +correctly represent the state of the world. Thus, hosts must take an + +isolated error message with some scepticism. (This transient period is + +discussed more fully below.) Second, if the host has been sending + +datagrams to a particular gateway, and that gateway itself crashes, then + +all the other gateways in the internet will reconstruct the topology, + +but the gateway in question will still be down, and therefore cannot + +provide any advice back to the host. As long as the host continues to + +direct datagrams at this dead gateway, the datagrams will simply vanish + +off the face of the earth, and nothing will come back in return. Hosts + +must detect this failure. + + + If some gateway many hops away fails, this is not of concern to the + +host, for then the discovery of the failure is the responsibility of the + +immediate neighbor gateways, which will perform this action in a manner + +invisible to the host. The problem only arises if the very first + + 4 + + +gateway, the one to which the host is immediately sending the datagrams, + +fails. We thus identify one single task which the host must perform as + +its part of fault isolation in the internet: the host must use some + +strategy to detect that a gateway to which it is sending datagrams is + +dead. + + + Let us assume for the moment that the host implements some + +algorithm to detect failed gateways; we will return later to discuss + +what this algorithm might be. First, let us consider what the host + +should do when it has determined that a gateway is down. In fact, with + +the exception of one small problem, the action the host should take is + +extremely simple. The host should select some other gateway, and try + +sending the datagram to it. Assuming that gateway is up, this will + +either produce correct results, or some ICMP advice. Since we assume + +that, ignoring temporary periods immediately following an outage, any + +gateway is capable of giving correct advice, once the host has received + +advice from any gateway, that host is in as good a condition as it can + +hope to be. + + + There is always the unpleasant possibility that when the host tries + +a different gateway, that gateway too will be down. Therefore, whatever + +algorithm the host uses to detect a dead gateway must continuously be + +applied, as the host tries every gateway in turn that it knows about. + + + The only difficult part of this algorithm is to specify the means + +by which the host maintains the table of all of the gateways to which it + +has immediate access. Currently, the specification of the internet + +protocol does not architect any message by which a host can ask to be + + 5 + + +supplied with such a table. The reason is that different networks may + +provide very different mechanisms by which this table can be filled in. + +For example, if the net is a broadcast net, such as an ethernet or a + +ringnet, every gateway may simply broadcast such a table from time to + +time, and the host need do nothing but listen to obtain the required + +information. Alternatively, the network may provide the mechanism of + +logical addressing, by which a whole set of machines can be provided + +with a single group address, to which a request can be sent for + +assistance. Failing those two schemes, the host can build up its table + +of neighbor gateways by remembering all the gateways from which it has + +ever received a message. Finally, in certain cases, it may be necessary + +for this table, or at least the initial entries in the table, to be + +constructed manually by a manager or operator at the site. In cases + +where the network in question provides absolutely no support for this + +kind of host query, at least some manual intervention will be required + +to get started, so that the host can find out about at least one + +gateway. + + + 4. Host Algorithms for Fault Isolation + + + We now return to the question raised above. What strategy should + +the host use to detect that it is talking to a dead gateway, so that it + +can know to switch to some other gateway in the list. In fact, there are + +several algorithms which can be used. All are reasonably simple to + +implement, but they have very different implications for the overhead on + +the host, the gateway, and the network. Thus, to a certain extent, the + +algorithm picked must depend on the details of the network and of the + +host. + + 6 + + + +1. NETWORK LEVEL DETECTION + + + Many networks, particularly the Arpanet, perform precisely the + +required function internal to the network. If a host sends a datagram + +to a dead gateway on the Arpanet, the network will return a "host dead" + +message, which is precisely the information the host needs to know in + +order to switch to another gateway. Some early implementations of + +Internet on the Arpanet threw these messages away. That is an + +exceedingly poor idea. + + +2. CONTINUOUS POLLING + + + The ICMP protocol provides an echo mechanism by which a host may + +solicit a response from a gateway. A host could simply send this + +message at a reasonable rate, to assure itself continuously that the + +gateway was still up. This works, but, since the message must be sent + +fairly often to detect a fault in a reasonable time, it can imply an + +unbearable overhead on the host itself, the network, and the gateway. + +This strategy is prohibited except where a specific analysis has + +indicated that the overhead is tolerable. + + +3. TRIGGERED POLLING + + + If the use of polling could be restricted to only those times when + +something seemed to be wrong, then the overhead would be bearable. + +Provided that one can get the proper advice from one's higher level + +protocols, it is possible to implement such a strategy. For example, + +one could program the TCP level so that whenever it retransmitted a + + 7 + + +segment more than once, it sent a hint down to the IP layer which + +triggered polling. This strategy does not have excessive overhead, but + +does have the problem that the host may be somewhat slow to respond to + +an error, since only after polling has started will the host be able to + +confirm that something has gone wrong, and by then the TCP above may + +have already timed out. + + + Both forms of polling suffer from a minor flaw. Hosts as well as + +gateways respond to ICMP echo messages. Thus, polling cannot be used to + +detect the error that a foreign address thought to be a gateway is + +actually a host. Such a confusion can arise if the physical addresses + +of machines are rearranged. + + +4. TRIGGERED RESELECTION + + + There is a strategy which makes use of a hint from a higher level, + +as did the previous strategy, but which avoids polling altogether. + +Whenever a higher level complains that the service seems to be + +defective, the Internet layer can pick the next gateway from the list of + +available gateways, and switch to it. Assuming that this gateway is up, + +no real harm can come of this decision, even if it was wrong, for the + +worst that will happen is a redirect message which instructs the host to + +return to the gateway originally being used. If, on the other hand, the + +original gateway was indeed down, then this immediately provides a new + +route, so the period of time until recovery is shortened. This last + +strategy seems particularly clever, and is probably the most generally + +suitable for those cases where the network itself does not provide fault + +isolation. (Regretably, I have forgotten who suggested this idea to me. + +It is not my invention.) + + 8 + + + 5. Higher Level Fault Detection + + + The previous discussion has concentrated on fault detection and + +recovery at the IP layer. This section considers what the higher layers + +such as TCP should do. + + + TCP has a single fault recovery action; it repeatedly retransmits a + +segment until either it gets an acknowledgement or its connection timer + +expires. As discussed above, it may use retransmission as an event to + +trigger a request for fault recovery to the IP layer. In the other + +direction, information may flow up from IP, reporting such things as + +ICMP Destination Unreachable or error messages from the attached + +network. The only subtle question about TCP and faults is what TCP + +should do when such an error message arrives or its connection timer + +expires. + + + The TCP specification discusses the timer. In the description of + +the open call, the timeout is described as an optional value that the + +client of TCP may specify; if any segment remains unacknowledged for + +this period, TCP should abort the connection. The default for the + +timeout is 30 seconds. Early TCPs were often implemented with a fixed + +timeout interval, but this did not work well in practice, as the + +following discussion may suggest. + + + Clients of TCP can be divided into two classes: those running on + +immediate behalf of a human, such as Telnet, and those supporting a + +program, such as a mail sender. Humans require a sophisticated response + +to errors. Depending on exactly what went wrong, they may want to + + 9 + + +abandon the connection at once, or wait for a long time to see if things + +get better. Programs do not have this human impatience, but also lack + +the power to make complex decisions based on details of the exact error + +condition. For them, a simple timeout is reasonable. + + + Based on these considerations, at least two modes of operation are + +needed in TCP. One, for programs, abandons the connection without + +exception if the TCP timer expires. The other mode, suitable for + +people, never abandons the connection on its own initiative, but reports + +to the layer above when the timer expires. Thus, the human user can see + +error messages coming from all the relevant layers, TCP and ICMP, and + +can request TCP to abort as appropriate. This second mode requires that + +TCP be able to send an asynchronous message up to its client to report + +the timeout, and it requires that error messages arriving at lower + +layers similarly flow up through TCP. + + + At levels above TCP, fault detection is also required. Either of + +the following can happen. First, the foreign client of TCP can fail, + +even though TCP is still running, so data is still acknowledged and the + +timer never expires. Alternatively, the communication path can fail, + +without the TCP timer going off, because the local client has no data to + +send. Both of these have caused trouble. + + + Sending mail provides an example of the first case. When sending + +mail using SMTP, there is an SMTP level acknowledgement that is returned + +when a piece of mail is successfully delivered. Several early mail + +receiving programs would crash just at the point where they had received + +all of the mail text (so TCP did not detect a timeout due to outstanding + + 10 + + +unacknowledged data) but before the mail was acknowledged at the SMTP + +level. This failure would cause early mail senders to wait forever for + +the SMTP level acknowledgement. The obvious cure was to set a timer at + +the SMTP level, but the first attempt to do this did not work, for there + +was no simple way to select the timer interval. If the interval + +selected was short, it expired in normal operational when sending a + +large file to a slow host. An interval of many minutes was needed to + +prevent false timeouts, but that meant that failures were detected only + +very slowly. The current solution in several mailers is to pick a + +timeout interval proportional to the size of the message. + + + Server telnet provides an example of the other kind of failure. It + +can easily happen that the communications link can fail while there is + +no traffic flowing, perhaps because the user is thinking. Eventually, + +the user will attempt to type something, at which time he will discover + +that the connection is dead and abort it. But the host end of the + +connection, having nothing to send, will not discover anything wrong, + +and will remain waiting forever. In some systems there is no way for a + +user in a different process to destroy or take over such a hanging + +process, so there is no way to recover. + + + One solution to this would be to have the host server telnet query + +the user end now and then, to see if it is still up. (Telnet does not + +have an explicit query feature, but the host could negotiate some + +unimportant option, which should produce either agreement or + +disagreement in return.) The only problem with this is that a + +reasonable sample interval, if applied to every user on a large system, + + 11 + + +can generate an unacceptable amount of traffic and system overhead. A + +smart server telnet would use this query only when something seems + +wrong, perhaps when there had been no user activity for some time. + + + In both these cases, the general conclusion is that client level + +error detection is needed, and that the details of the mechanism are + +very dependent on the application. Application programmers must be made + +aware of the problem of failures, and must understand that error + +detection at the TCP or lower level cannot solve the whole problem for + +them. + + + 6. Knowing When to Give Up + + + It is not obvious, when error messages such as ICMP Destination + +Unreachable arrive, whether TCP should abandon the connection. The + +reason that error messages are difficult to interpret is that, as + +discussed above, after a failure of a gateway or network, there is a + +transient period during which the gateways may have incorrect + +information, so that irrelevant or incorrect error messages may + +sometimes return. An isolated ICMP Destination Unreachable may arrive + +at a host, for example, if a packet is sent during the period when the + +gateways are trying to find a new route. To abandon a TCP connection + +based on such a message arriving would be to ignore the valuable feature + +of the Internet that for many internal failures it reconstructs its + +function without any disruption of the end points. + + + But if failure messages do not imply a failure, what are they for? + +In fact, error messages serve several important purposes. First, if + + 12 + + +they arrive in response to opening a new connection, they probably are + +caused by opening the connection improperly (e.g., to a non-existent + +address) rather than by a transient network failure. Second, they + +provide valuable information, after the TCP timeout has occurred, as to + +the probable cause of the failure. Finally, certain messages, such as + +ICMP Parameter Problem, imply a possible implementation problem. In + +general, error messages give valuable information about what went wrong, + +but are not to be taken as absolutely reliable. A general alerting + +mechanism, such as the TCP timeout discussed above, provides a good + +indication that whatever is wrong is a serious condition, but without + +the advisory messages to augment the timer, there is no way for the + +client to know how to respond to the error. The combination of the + +timer and the advice from the error messages provide a reasonable set of + +facts for the client layer to have. It is important that error messages + +from all layers be passed up to the client module in a useful and + +consistent way. + + +------- |