summaryrefslogtreecommitdiff
path: root/doc/rfc/rfc896.txt
diff options
context:
space:
mode:
Diffstat (limited to 'doc/rfc/rfc896.txt')
-rw-r--r--doc/rfc/rfc896.txt512
1 files changed, 512 insertions, 0 deletions
diff --git a/doc/rfc/rfc896.txt b/doc/rfc/rfc896.txt
new file mode 100644
index 0000000..d8a480a
--- /dev/null
+++ b/doc/rfc/rfc896.txt
@@ -0,0 +1,512 @@
+
+
+Network Working Group John Nagle
+Request For Comments: 896 6 January 1984
+ Ford Aerospace and Communications Corporation
+
+ Congestion Control in IP/TCP Internetworks
+
+This memo discusses some aspects of congestion control in IP/TCP
+Internetworks. It is intended to stimulate thought and further
+discussion of this topic. While some specific suggestions are
+made for improved congestion control implementation, this memo
+does not specify any standards.
+
+ Introduction
+
+Congestion control is a recognized problem in complex networks.
+We have discovered that the Department of Defense's Internet Pro-
+tocol (IP) , a pure datagram protocol, and Transmission Control
+Protocol (TCP), a transport layer protocol, when used together,
+are subject to unusual congestion problems caused by interactions
+between the transport and datagram layers. In particular, IP
+gateways are vulnerable to a phenomenon we call "congestion col-
+lapse", especially when such gateways connect networks of widely
+different bandwidth. We have developed solutions that prevent
+congestion collapse.
+
+These problems are not generally recognized because these proto-
+cols are used most often on networks built on top of ARPANET IMP
+technology. ARPANET IMP based networks traditionally have uni-
+form bandwidth and identical switching nodes, and are sized with
+substantial excess capacity. This excess capacity, and the abil-
+ity of the IMP system to throttle the transmissions of hosts has
+for most IP / TCP hosts and networks been adequate to handle
+congestion. With the recent split of the ARPANET into two inter-
+connected networks and the growth of other networks with differ-
+ing properties connected to the ARPANET, however, reliance on the
+benign properties of the IMP system is no longer enough to allow
+hosts to communicate rapidly and reliably. Improved handling of
+congestion is now mandatory for successful network operation
+under load.
+
+Ford Aerospace and Communications Corporation, and its parent
+company, Ford Motor Company, operate the only private IP/TCP
+long-haul network in existence today. This network connects four
+facilities (one in Michigan, two in California, and one in Eng-
+land) some with extensive local networks. This net is cross-tied
+to the ARPANET but uses its own long-haul circuits; traffic
+between Ford facilities flows over private leased circuits,
+including a leased transatlantic satellite connection. All
+switching nodes are pure IP datagram switches with no node-to-
+node flow control, and all hosts run software either written or
+heavily modified by Ford or Ford Aerospace. Bandwidth of links
+in this network varies widely, from 1200 to 10,000,000 bits per
+second. In general, we have not been able to afford the luxury
+of excess long-haul bandwidth that the ARPANET possesses, and our
+long-haul links are heavily loaded during peak periods. Transit
+times of several seconds are thus common in our network.
+
+
+RFC 896 Congestion Control in IP/TCP Internetworks 1/6/84
+
+
+Because of our pure datagram orientation, heavy loading, and wide
+variation in bandwidth, we have had to solve problems that the
+ARPANET / MILNET community is just beginning to recognize. Our
+network is sensitive to suboptimal behavior by host TCP implemen-
+tations, both on and off our own net. We have devoted consider-
+able effort to examining TCP behavior under various conditions,
+and have solved some widely prevalent problems with TCP. We
+present here two problems and their solutions. Many TCP imple-
+mentations have these problems; if throughput is worse through an
+ARPANET / MILNET gateway for a given TCP implementation than
+throughput across a single net, there is a high probability that
+the TCP implementation has one or both of these problems.
+
+ Congestion collapse
+
+Before we proceed with a discussion of the two specific problems
+and their solutions, a description of what happens when these
+problems are not addressed is in order. In heavily loaded pure
+datagram networks with end to end retransmission, as switching
+nodes become congested, the round trip time through the net
+increases and the count of datagrams in transit within the net
+also increases. This is normal behavior under load. As long as
+there is only one copy of each datagram in transit, congestion is
+under control. Once retransmission of datagrams not yet
+delivered begins, there is potential for serious trouble.
+
+Host TCP implementations are expected to retransmit packets
+several times at increasing time intervals until some upper limit
+on the retransmit interval is reached. Normally, this mechanism
+is enough to prevent serious congestion problems. Even with the
+better adaptive host retransmission algorithms, though, a sudden
+load on the net can cause the round-trip time to rise faster than
+the sending hosts measurements of round-trip time can be updated.
+Such a load occurs when a new bulk transfer, such a file
+transfer, begins and starts filling a large window. Should the
+round-trip time exceed the maximum retransmission interval for
+any host, that host will begin to introduce more and more copies
+of the same datagrams into the net. The network is now in seri-
+ous trouble. Eventually all available buffers in the switching
+nodes will be full and packets must be dropped. The round-trip
+time for packets that are delivered is now at its maximum. Hosts
+are sending each packet several times, and eventually some copy
+of each packet arrives at its destination. This is congestion
+collapse.
+
+This condition is stable. Once the saturation point has been
+reached, if the algorithm for selecting packets to be dropped is
+fair, the network will continue to operate in a degraded condi-
+tion. In this condition every packet is being transmitted
+several times and throughput is reduced to a small fraction of
+normal. We have pushed our network into this condition experi-
+mentally and observed its stability. It is possible for round-
+trip time to become so large that connections are broken because
+
+
+RFC 896 Congestion Control in IP/TCP Internetworks 1/6/84
+
+
+the hosts involved time out.
+
+Congestion collapse and pathological congestion are not normally
+seen in the ARPANET / MILNET system because these networks have
+substantial excess capacity. Where connections do not pass
+through IP gateways, the IMP-to host flow control mechanisms usu-
+ally prevent congestion collapse, especially since TCP implemen-
+tations tend to be well adjusted for the time constants associ-
+ated with the pure ARPANET case. However, other than ICMP Source
+Quench messages, nothing fundamentally prevents congestion col-
+lapse when TCP is run over the ARPANET / MILNET and packets are
+being dropped at gateways. Worth noting is that a few badly-
+behaved hosts can by themselves congest the gateways and prevent
+other hosts from passing traffic. We have observed this problem
+repeatedly with certain hosts (with whose administrators we have
+communicated privately) on the ARPANET.
+
+Adding additional memory to the gateways will not solve the prob-
+lem. The more memory added, the longer round-trip times must
+become before packets are dropped. Thus, the onset of congestion
+collapse will be delayed but when collapse occurs an even larger
+fraction of the packets in the net will be duplicates and
+throughput will be even worse.
+
+ The two problems
+
+Two key problems with the engineering of TCP implementations have
+been observed; we call these the small-packet problem and the
+source-quench problem. The second is being addressed by several
+implementors; the first is generally believed (incorrectly) to be
+solved. We have discovered that once the small-packet problem
+has been solved, the source-quench problem becomes much more
+tractable. We thus present the small-packet problem and our
+solution to it first.
+
+ The small-packet problem
+
+There is a special problem associated with small packets. When
+TCP is used for the transmission of single-character messages
+originating at a keyboard, the typical result is that 41 byte
+packets (one byte of data, 40 bytes of header) are transmitted
+for each byte of useful data. This 4000% overhead is annoying
+but tolerable on lightly loaded networks. On heavily loaded net-
+works, however, the congestion resulting from this overhead can
+result in lost datagrams and retransmissions, as well as exces-
+sive propagation time caused by congestion in switching nodes and
+gateways. In practice, throughput may drop so low that TCP con-
+nections are aborted.
+
+This classic problem is well-known and was first addressed in the
+Tymnet network in the late 1960s. The solution used there was to
+impose a limit on the count of datagrams generated per unit time.
+This limit was enforced by delaying transmission of small packets
+
+
+RFC 896 Congestion Control in IP/TCP Internetworks 1/6/84
+
+
+until a short (200-500ms) time had elapsed, in hope that another
+character or two would become available for addition to the same
+packet before the timer ran out. An additional feature to
+enhance user acceptability was to inhibit the time delay when a
+control character, such as a carriage return, was received.
+
+This technique has been used in NCP Telnet, X.25 PADs, and TCP
+Telnet. It has the advantage of being well-understood, and is not
+too difficult to implement. Its flaw is that it is hard to come
+up with a time limit that will satisfy everyone. A time limit
+short enough to provide highly responsive service over a 10M bits
+per second Ethernet will be too short to prevent congestion col-
+lapse over a heavily loaded net with a five second round-trip
+time; and conversely, a time limit long enough to handle the
+heavily loaded net will produce frustrated users on the Ethernet.
+
+ The solution to the small-packet problem
+
+Clearly an adaptive approach is desirable. One would expect a
+proposal for an adaptive inter-packet time limit based on the
+round-trip delay observed by TCP. While such a mechanism could
+certainly be implemented, it is unnecessary. A simple and
+elegant solution has been discovered.
+
+The solution is to inhibit the sending of new TCP segments when
+new outgoing data arrives from the user if any previously
+transmitted data on the connection remains unacknowledged. This
+inhibition is to be unconditional; no timers, tests for size of
+data received, or other conditions are required. Implementation
+typically requires one or two lines inside a TCP program.
+
+At first glance, this solution seems to imply drastic changes in
+the behavior of TCP. This is not so. It all works out right in
+the end. Let us see why this is so.
+
+When a user process writes to a TCP connection, TCP receives some
+data. It may hold that data for future sending or may send a
+packet immediately. If it refrains from sending now, it will
+typically send the data later when an incoming packet arrives and
+changes the state of the system. The state changes in one of two
+ways; the incoming packet acknowledges old data the distant host
+has received, or announces the availability of buffer space in
+the distant host for new data. (This last is referred to as
+"updating the window"). Each time data arrives on a connec-
+tion, TCP must reexamine its current state and perhaps send some
+packets out. Thus, when we omit sending data on arrival from the
+user, we are simply deferring its transmission until the next
+message arrives from the distant host. A message must always
+arrive soon unless the connection was previously idle or communi-
+cations with the other end have been lost. In the first case,
+the idle connection, our scheme will result in a packet being
+sent whenever the user writes to the TCP connection. Thus we do
+not deadlock in the idle condition. In the second case, where
+
+
+RFC 896 Congestion Control in IP/TCP Internetworks 1/6/84
+
+
+the distant host has failed, sending more data is futile anyway.
+Note that we have done nothing to inhibit normal TCP retransmis-
+sion logic, so lost messages are not a problem.
+
+Examination of the behavior of this scheme under various condi-
+tions demonstrates that the scheme does work in all cases. The
+first case to examine is the one we wanted to solve, that of the
+character-oriented Telnet connection. Let us suppose that the
+user is sending TCP a new character every 200ms, and that the
+connection is via an Ethernet with a round-trip time including
+software processing of 50ms. Without any mechanism to prevent
+small-packet congestion, one packet will be sent for each charac-
+ter, and response will be optimal. Overhead will be 4000%, but
+this is acceptable on an Ethernet. The classic timer scheme,
+with a limit of 2 packets per second, will cause two or three
+characters to be sent per packet. Response will thus be degraded
+even though on a high-bandwidth Ethernet this is unnecessary.
+Overhead will drop to 1500%, but on an Ethernet this is a bad
+tradeoff. With our scheme, every character the user types will
+find TCP with an idle connection, and the character will be sent
+at once, just as in the no-control case. The user will see no
+visible delay. Thus, our scheme performs as well as the no-
+control scheme and provides better responsiveness than the timer
+scheme.
+
+The second case to examine is the same Telnet test but over a
+long-haul link with a 5-second round trip time. Without any
+mechanism to prevent small-packet congestion, 25 new packets
+would be sent in 5 seconds.* Overhead here is 4000%. With the
+classic timer scheme, and the same limit of 2 packets per second,
+there would still be 10 packets outstanding and contributing to
+congestion. Round-trip time will not be improved by sending many
+packets, of course; in general it will be worse since the packets
+will contend for line time. Overhead now drops to 1500%. With
+our scheme, however, the first character from the user would find
+an idle TCP connection and would be sent immediately. The next
+24 characters, arriving from the user at 200ms intervals, would
+be held pending a message from the distant host. When an ACK
+arrived for the first packet at the end of 5 seconds, a single
+packet with the 24 queued characters would be sent. Our scheme
+thus results in an overhead reduction to 320% with no penalty in
+response time. Response time will usually be improved with our
+scheme because packet overhead is reduced, here by a factor of
+4.7 over the classic timer scheme. Congestion will be reduced by
+this factor and round-trip delay will decrease sharply. For this
+________
+ * This problem is not seen in the pure ARPANET case because the
+ IMPs will block the host when the count of packets
+ outstanding becomes excessive, but in the case where a pure
+ datagram local net (such as an Ethernet) or a pure datagram
+ gateway (such as an ARPANET / MILNET gateway) is involved, it
+ is possible to have large numbers of tiny packets
+ outstanding.
+
+
+RFC 896 Congestion Control in IP/TCP Internetworks 1/6/84
+
+
+case, our scheme has a striking advantage over either of the
+other approaches.
+
+We use our scheme for all TCP connections, not just Telnet con-
+nections. Let us see what happens for a file transfer data con-
+nection using our technique. The two extreme cases will again be
+considered.
+
+As before, we first consider the Ethernet case. The user is now
+writing data to TCP in 512 byte blocks as fast as TCP will accept
+them. The user's first write to TCP will start things going; our
+first datagram will be 512+40 bytes or 552 bytes long. The
+user's second write to TCP will not cause a send but will cause
+the block to be buffered. Assume that the user fills up TCP's
+outgoing buffer area before the first ACK comes back. Then when
+the ACK comes in, all queued data up to the window size will be
+sent. From then on, the window will be kept full, as each ACK
+initiates a sending cycle and queued data is sent out. Thus,
+after a one round-trip time initial period when only one block is
+sent, our scheme settles down into a maximum-throughput condi-
+tion. The delay in startup is only 50ms on the Ethernet, so the
+startup transient is insignificant. All three schemes provide
+equivalent performance for this case.
+
+Finally, let us look at a file transfer over the 5-second round
+trip time connection. Again, only one packet will be sent until
+the first ACK comes back; the window will then be filled and kept
+full. Since the round-trip time is 5 seconds, only 512 bytes of
+data are transmitted in the first 5 seconds. Assuming a 2K win-
+dow, once the first ACK comes in, 2K of data will be sent and a
+steady rate of 2K per 5 seconds will be maintained thereafter.
+Only for this case is our scheme inferior to the timer scheme,
+and the difference is only in the startup transient; steady-state
+throughput is identical. The naive scheme and the timer scheme
+would both take 250 seconds to transmit a 100K byte file under
+the above conditions and our scheme would take 254 seconds, a
+difference of 1.6%.
+
+Thus, for all cases examined, our scheme provides at least 98% of
+the performance of both other schemes, and provides a dramatic
+improvement in Telnet performance over paths with long round trip
+times. We use our scheme in the Ford Aerospace Software
+Engineering Network, and are able to run screen editors over Eth-
+ernet and talk to distant TOPS-20 hosts with improved performance
+in both cases.
+
+ Congestion control with ICMP
+
+Having solved the small-packet congestion problem and with it the
+problem of excessive small-packet congestion within our own net-
+work, we turned our attention to the problem of general conges-
+tion control. Since our own network is pure datagram with no
+node-to-node flow control, the only mechanism available to us
+
+
+RFC 896 Congestion Control in IP/TCP Internetworks 1/6/84
+
+
+under the IP standard was the ICMP Source Quench message. With
+careful handling, we find this adequate to prevent serious
+congestion problems. We do find it necessary to be careful about
+the behavior of our hosts and switching nodes regarding Source
+Quench messages.
+
+ When to send an ICMP Source Quench
+
+The present ICMP standard* specifies that an ICMP Source Quench
+message should be sent whenever a packet is dropped, and addi-
+tionally may be sent when a gateway finds itself becoming short
+of resources. There is some ambiguity here but clearly it is a
+violation of the standard to drop a packet without sending an
+ICMP message.
+
+Our basic assumption is that packets ought not to be dropped dur-
+ing normal network operation. We therefore want to throttle
+senders back before they overload switching nodes and gateways.
+All our switching nodes send ICMP Source Quench messages well
+before buffer space is exhausted; they do not wait until it is
+necessary to drop a message before sending an ICMP Source Quench.
+As demonstrated in our analysis of the small-packet problem,
+merely providing large amounts of buffering is not a solution.
+In general, our experience is that Source Quench should be sent
+when about half the buffering space is exhausted; this is not
+based on extensive experimentation but appears to be a reasonable
+engineering decision. One could argue for an adaptive scheme
+that adjusted the quench generation threshold based on recent
+experience; we have not found this necessary as yet.
+
+There exist other gateway implementations that generate Source
+Quenches only after more than one packet has been discarded. We
+consider this approach undesirable since any system for control-
+ling congestion based on the discarding of packets is wasteful of
+bandwidth and may be susceptible to congestion collapse under
+heavy load. Our understanding is that the decision to generate
+Source Quenches with great reluctance stems from a fear that ack-
+nowledge traffic will be quenched and that this will result in
+connection failure. As will be shown below, appropriate handling
+of Source Quench in host implementations eliminates this possi-
+bility.
+
+ What to do when an ICMP Source Quench is received
+
+We inform TCP or any other protocol at that layer when ICMP
+receives a Source Quench. The basic action of our TCP implemen-
+tations is to reduce the amount of data outstanding on connec-
+tions to the host mentioned in the Source Quench. This control is
+________
+ * ARPANET RFC 792 is the present standard. We are advised by
+ the Defense Communications Agency that the description of
+ ICMP in MIL-STD-1777 is incomplete and will be deleted from
+ future revision of that standard.
+
+
+RFC 896 Congestion Control in IP/TCP Internetworks 1/6/84
+
+
+applied by causing the sending TCP to behave as if the distant
+host's window size has been reduced. Our first implementation
+was simplistic but effective; once a Source Quench has been
+received our TCP behaves as if the window size is zero whenever
+the window isn't empty. This behavior continues until some
+number (at present 10) of ACKs have been received, at that time
+TCP returns to normal operation.* David Mills of Linkabit Cor-
+poration has since implemented a similar but more elaborate
+throttle on the count of outstanding packets in his DCN systems.
+The additional sophistication seems to produce a modest gain in
+throughput, but we have not made formal tests. Both implementa-
+tions effectively prevent congestion collapse in switching nodes.
+
+Source Quench thus has the effect of limiting the connection to a
+limited number (perhaps one) of outstanding messages. Thus, com-
+munication can continue but at a reduced rate, that is exactly
+the effect desired.
+
+This scheme has the important property that Source Quench doesn't
+inhibit the sending of acknowledges or retransmissions. Imple-
+mentations of Source Quench entirely within the IP layer are usu-
+ally unsuccessful because IP lacks enough information to throttle
+a connection properly. Holding back acknowledges tends to pro-
+duce retransmissions and thus unnecessary traffic. Holding back
+retransmissions may cause loss of a connection by a retransmis-
+sion timeout. Our scheme will keep connections alive under
+severe overload but at reduced bandwidth per connection.
+
+Other protocols at the same layer as TCP should also be respon-
+sive to Source Quench. In each case we would suggest that new
+traffic should be throttled but acknowledges should be treated
+normally. The only serious problem comes from the User Datagram
+Protocol, not normally a major traffic generator. We have not
+implemented any throttling in these protocols as yet; all are
+passed Source Quench messages by ICMP but ignore them.
+
+ Self-defense for gateways
+
+As we have shown, gateways are vulnerable to host mismanagement
+of congestion. Host misbehavior by excessive traffic generation
+can prevent not only the host's own traffic from getting through,
+but can interfere with other unrelated traffic. The problem can
+be dealt with at the host level but since one malfunctioning host
+can interfere with others, future gateways should be capable of
+defending themselves against such behavior by obnoxious or mali-
+cious hosts. We offer some basic self-defense techniques.
+
+On one occasion in late 1983, a TCP bug in an ARPANET host caused
+the host to frantically generate retransmissions of the same
+datagram as fast as the ARPANET would accept them. The gateway
+________
+ * This follows the control engineering dictum "Never bother
+ with proportional control unless bang-bang doesn't work".
+
+
+RFC 896 Congestion Control in IP/TCP Internetworks 1/6/84
+
+
+that connected our net with the ARPANET was saturated and little
+useful traffic could get through, since the gateway had more
+bandwidth to the ARPANET than to our net. The gateway busily
+sent ICMP Source Quench messages but the malfunctioning host
+ignored them. This continued for several hours, until the mal-
+functioning host crashed. During this period, our network was
+effectively disconnected from the ARPANET.
+
+When a gateway is forced to discard a packet, the packet is
+selected at the discretion of the gateway. Classic techniques
+for making this decision are to discard the most recently
+received packet, or the packet at the end of the longest outgoing
+queue. We suggest that a worthwhile practical measure is to dis-
+card the latest packet from the host that originated the most
+packets currently queued within the gateway. This strategy will
+tend to balance throughput amongst the hosts using the gateway.
+We have not yet tried this strategy, but it seems a reasonable
+starting point for gateway self-protection.
+
+Another strategy is to discard a newly arrived packet if the
+packet duplicates a packet already in the queue. The computa-
+tional load for this check is not a problem if hashing techniques
+are used. This check will not protect against malicious hosts
+but will provide some protection against TCP implementations with
+poor retransmission control. Gateways between fast local net-
+works and slower long-haul networks may find this check valuable
+if the local hosts are tuned to work well with the local network.
+
+Ideally the gateway should detect malfunctioning hosts and
+squelch them; such detection is difficult in a pure datagram sys-
+tem. Failure to respond to an ICMP Source Quench message,
+though, should be regarded as grounds for action by a gateway to
+disconnect a host. Detecting such failure is non-trivial but is
+a worthwhile area for further research.
+
+ Conclusion
+
+The congestion control problems associated with pure datagram
+networks are difficult, but effective solutions exist. If IP /
+TCP networks are to be operated under heavy load, TCP implementa-
+tions must address several key issues in ways at least as effec-
+tive as the ones described here.
+