1 files changed, 512 insertions, 0 deletions
diff --git a/doc/rfc/rfc896.txt b/doc/rfc/rfc896.txt
new file mode 100644
index 0000000..d8a480a
--- /dev/null
+++ b/doc/rfc/rfc896.txt
@@ -0,0 +1,512 @@
+
+
+Network Working Group                                  John Nagle
+Request For Comments:  896                         6 January 1984
+                    Ford Aerospace and Communications Corporation
+
+           Congestion Control in IP/TCP Internetworks
+
+This memo discusses some aspects of congestion control in  IP/TCP
+Internetworks.   It  is intended to stimulate thought and further
+discussion of this topic.   While some specific  suggestions  are
+made for improved congestion  control  implementation,  this memo
+does not specify any standards.
+
+                          Introduction
+
+Congestion control is a recognized problem in  complex  networks.
+We have discovered that the Department of Defense's Internet Pro-
+tocol (IP) , a pure datagram protocol, and  Transmission  Control
+Protocol  (TCP),  a transport layer protocol, when used together,
+are subject to unusual congestion problems caused by interactions
+between  the  transport  and  datagram layers.  In particular, IP
+gateways are vulnerable to a phenomenon we call  "congestion col-
+lapse",  especially when such gateways connect networks of widely
+different bandwidth.  We have developed  solutions  that  prevent
+congestion collapse.
+
+These problems are not generally recognized because these  proto-
+cols  are used most often on networks built on top of ARPANET IMP
+technology.  ARPANET IMP based networks traditionally  have  uni-
+form  bandwidth and identical switching nodes, and are sized with
+substantial excess capacity.  This excess capacity, and the abil-
+ity  of the IMP system to throttle the transmissions of hosts has
+for most IP / TCP hosts and  networks  been  adequate  to  handle
+congestion.  With the recent split of the ARPANET into two inter-
+connected networks and the growth of other networks with  differ-
+ing properties connected to the ARPANET, however, reliance on the
+benign properties of the IMP system is no longer enough to  allow
+hosts  to  communicate rapidly and reliably. Improved handling of
+congestion is now  mandatory  for  successful  network  operation
+under load.
+
+Ford Aerospace and Communications  Corporation,  and  its  parent
+company,  Ford  Motor  Company,  operate  the only private IP/TCP
+long-haul network in existence today.  This network connects four
+facilities  (one  in Michigan, two in California, and one in Eng-
+land) some with extensive local networks.  This net is cross-tied
+to  the  ARPANET  but  uses  its  own long-haul circuits; traffic
+between Ford  facilities  flows  over  private  leased  circuits,
+including  a  leased  transatlantic  satellite  connection.   All
+switching nodes are pure IP datagram switches  with  no  node-to-
+node  flow  control, and all hosts run software either written or
+heavily modified by Ford or Ford Aerospace.  Bandwidth  of  links
+in  this  network varies widely, from 1200 to 10,000,000 bits per
+second.  In general, we have not been able to afford  the  luxury
+of excess long-haul bandwidth that the ARPANET possesses, and our
+long-haul links are heavily loaded during peak periods.   Transit
+times of several seconds are thus common in our network.
+
+
+RFC 896    Congestion Control in IP/TCP Internetworks      1/6/84
+
+
+Because of our pure datagram orientation, heavy loading, and wide
+variation  in  bandwidth,  we have had to solve problems that the
+ARPANET / MILNET community is just beginning to  recognize.   Our
+network is sensitive to suboptimal behavior by host TCP implemen-
+tations, both on and off our own net.  We have devoted  consider-
+able  effort  to examining TCP behavior under various conditions,
+and have solved some widely  prevalent  problems  with  TCP.   We
+present  here  two problems and their solutions.  Many TCP imple-
+mentations have these problems; if throughput is worse through an
+ARPANET  /  MILNET  gateway  for  a given TCP implementation than
+throughput across a single net, there is a high probability  that
+the TCP implementation has one or both of these problems.
+
+                       Congestion collapse
+
+Before we proceed with a discussion of the two specific  problems
+and  their  solutions,  a  description of what happens when these
+problems are not addressed is in order.  In heavily  loaded  pure
+datagram  networks  with  end to end retransmission, as switching
+nodes become congested, the  round  trip  time  through  the  net
+increases  and  the  count of datagrams in transit within the net
+also increases.  This is normal behavior under load.  As long  as
+there is only one copy of each datagram in transit, congestion is
+under  control.   Once  retransmission  of  datagrams   not   yet
+delivered begins, there is potential for serious trouble.
+
+Host TCP  implementations  are  expected  to  retransmit  packets
+several times at increasing time intervals until some upper limit
+on the retransmit interval is reached.  Normally, this  mechanism
+is  enough to prevent serious congestion problems.  Even with the
+better adaptive host retransmission algorithms, though, a  sudden
+load on the net can cause the round-trip time to rise faster than
+the sending hosts measurements of round-trip time can be updated.
+Such  a  load  occurs  when  a  new  bulk  transfer,  such a file
+transfer, begins and starts filling a large window.   Should  the
+round-trip  time  exceed  the maximum retransmission interval for
+any host, that host will begin to introduce more and more  copies
+of  the same datagrams into the net.  The network is now in seri-
+ous trouble.  Eventually all available buffers in  the  switching
+nodes  will  be full and packets must be dropped.  The round-trip
+time for packets that are delivered is now at its maximum.  Hosts
+are  sending  each packet several times, and eventually some copy
+of each packet arrives at its destination.   This  is  congestion
+collapse.
+
+This condition is stable.  Once the  saturation  point  has  been
+reached,  if the algorithm for selecting packets to be dropped is
+fair, the network will continue to operate in a  degraded  condi-
+tion.   In  this  condition  every  packet  is  being transmitted
+several times and throughput is reduced to a  small  fraction  of
+normal.   We  have pushed our network into this condition experi-
+mentally and observed its stability.  It is possible  for  round-
+trip  time to become so large that connections are broken because
+
+
+RFC 896    Congestion Control in IP/TCP Internetworks      1/6/84
+
+
+the hosts involved time out.
+
+Congestion collapse and pathological congestion are not  normally
+seen  in  the ARPANET / MILNET system because these networks have
+substantial excess  capacity.   Where  connections  do  not  pass
+through IP gateways, the IMP-to host flow control mechanisms usu-
+ally prevent congestion collapse, especially since TCP  implemen-
+tations  tend  to be well adjusted for the time constants associ-
+ated with the pure ARPANET case.  However, other than ICMP Source
+Quench  messages,  nothing fundamentally prevents congestion col-
+lapse when TCP is run over the ARPANET / MILNET and  packets  are
+being  dropped  at  gateways.  Worth  noting is that a few badly-
+behaved hosts can by themselves congest the gateways and  prevent
+other  hosts from passing traffic.  We have observed this problem
+repeatedly with certain hosts (with whose administrators we  have
+communicated privately) on the ARPANET.
+
+Adding additional memory to the gateways will not solve the prob-
+lem.   The  more  memory  added, the longer round-trip times must
+become before packets are dropped.  Thus, the onset of congestion
+collapse  will be delayed but when collapse occurs an even larger
+fraction of the  packets  in  the  net  will  be  duplicates  and
+throughput will be even worse.
+
+                        The two problems
+
+Two key problems with the engineering of TCP implementations have
+been  observed;  we  call  these the small-packet problem and the
+source-quench problem.  The second is being addressed by  several
+implementors; the first is generally believed (incorrectly) to be
+solved.  We have discovered that once  the  small-packet  problem
+has  been  solved,  the  source-quench  problem becomes much more
+tractable.  We thus present  the  small-packet  problem  and  our
+solution to it first.
+
+                    The small-packet problem
+
+There is a special problem associated with small  packets.   When
+TCP  is  used  for  the transmission of single-character messages
+originating at a keyboard, the typical result  is  that  41  byte
+packets  (one  byte  of data, 40 bytes of header) are transmitted
+for each byte of useful data.  This 4000%  overhead  is  annoying
+but tolerable on lightly loaded networks.  On heavily loaded net-
+works, however, the congestion resulting from this  overhead  can
+result  in  lost datagrams and retransmissions, as well as exces-
+sive propagation time caused by congestion in switching nodes and
+gateways.   In practice, throughput may drop so low that TCP con-
+nections are aborted.
+
+This classic problem is well-known and was first addressed in the
+Tymnet network in the late 1960s.  The solution used there was to
+impose a limit on the count of datagrams generated per unit time.
+This limit was enforced by delaying transmission of small packets
+
+
+RFC 896    Congestion Control in IP/TCP Internetworks      1/6/84
+
+
+until a short (200-500ms) time had elapsed, in hope that  another
+character  or two would become available for addition to the same
+packet before the  timer  ran  out.   An  additional  feature  to
+enhance  user  acceptability was to inhibit the time delay when a
+control character, such as a carriage return, was received.
+
+This technique has been used in NCP Telnet, X.25  PADs,  and  TCP
+Telnet. It has the advantage of being well-understood, and is not
+too difficult to implement.  Its flaw is that it is hard to  come
+up  with  a  time limit that will satisfy everyone.  A time limit
+short enough to provide highly responsive service over a 10M bits
+per  second Ethernet will be too short to prevent congestion col-
+lapse over a heavily loaded net with  a  five  second  round-trip
+time;  and  conversely,  a  time  limit long enough to handle the
+heavily loaded net will produce frustrated users on the Ethernet.
+
+            The solution to the small-packet problem
+
+Clearly an adaptive approach is desirable.  One  would  expect  a
+proposal  for  an  adaptive  inter-packet time limit based on the
+round-trip delay observed by TCP.  While such a  mechanism  could
+certainly  be  implemented,  it  is  unnecessary.   A  simple and
+elegant solution has been discovered.
+
+The solution is to inhibit the sending of new TCP  segments  when
+new  outgoing  data  arrives  from  the  user  if  any previously
+transmitted data on the connection remains unacknowledged.   This
+inhibition  is  to be unconditional; no timers, tests for size of
+data received, or other conditions are required.   Implementation
+typically requires one or two lines inside a TCP program.
+
+At first glance, this solution seems to imply drastic changes  in
+the  behavior of TCP.  This is not so.  It all works out right in
+the end.  Let us see why this is so.
+
+When a user process writes to a TCP connection, TCP receives some
+data.   It  may  hold  that data for future sending or may send a
+packet immediately.  If it refrains from  sending  now,  it  will
+typically send the data later when an incoming packet arrives and
+changes the state of the system.  The state changes in one of two
+ways;  the incoming packet acknowledges old data the distant host
+has received, or announces the availability of  buffer  space  in
+the  distant  host  for  new  data.  (This last is referred to as
+"updating the window").    Each time data arrives  on  a  connec-
+tion,  TCP must reexamine its current state and perhaps send some
+packets out.  Thus, when we omit sending data on arrival from the
+user,  we  are  simply  deferring its transmission until the next
+message arrives from the distant host.   A  message  must  always
+arrive soon unless the connection was previously idle or communi-
+cations with the other end have been lost.  In  the  first  case,
+the  idle  connection,  our  scheme will result in a packet being
+sent whenever the user writes to the TCP connection.  Thus we  do
+not  deadlock  in  the idle condition.  In the second case, where
+
+
+RFC 896    Congestion Control in IP/TCP Internetworks      1/6/84
+
+
+the distant host has failed, sending more data is futile  anyway.
+Note  that we have done nothing to inhibit normal TCP retransmis-
+sion logic, so lost messages are not a problem.
+
+Examination of the behavior of this scheme under  various  condi-
+tions  demonstrates  that the scheme does work in all cases.  The
+first case to examine is the one we wanted to solve, that of  the
+character-oriented  Telnet  connection.   Let us suppose that the
+user is sending TCP a new character every  200ms,  and  that  the
+connection  is  via  an Ethernet with a round-trip time including
+software processing of 50ms.  Without any  mechanism  to  prevent
+small-packet congestion, one packet will be sent for each charac-
+ter, and response will be optimal.  Overhead will be  4000%,  but
+this  is  acceptable  on  an Ethernet.  The classic timer scheme,
+with a limit of 2 packets per second, will  cause  two  or  three
+characters to be sent per packet.  Response will thus be degraded
+even though on a high-bandwidth  Ethernet  this  is  unnecessary.
+Overhead  will  drop  to  1500%, but on an Ethernet this is a bad
+tradeoff.  With our scheme, every character the user  types  will
+find  TCP with an idle connection, and the character will be sent
+at once, just as in the no-control case.  The user  will  see  no
+visible  delay.   Thus,  our  scheme  performs as well as the no-
+control scheme and provides better responsiveness than the  timer
+scheme.
+
+The second case to examine is the same Telnet  test  but  over  a
+long-haul  link  with  a  5-second  round trip time.  Without any
+mechanism to prevent  small-packet  congestion,  25  new  packets
+would be sent in 5 seconds.* Overhead here is  4000%.   With  the
+classic timer scheme, and the same limit of 2 packets per second,
+there would still be 10 packets outstanding and  contributing  to
+congestion.  Round-trip time will not be improved by sending many
+packets, of course; in general it will be worse since the packets
+will  contend  for line time.  Overhead now drops to 1500%.  With
+our scheme, however, the first character from the user would find
+an  idle  TCP connection and would be sent immediately.  The next
+24 characters, arriving from the user at 200ms  intervals,  would
+be  held  pending  a  message from the distant host.  When an ACK
+arrived for the first packet at the end of 5  seconds,  a  single
+packet  with  the 24 queued characters would be sent.  Our scheme
+thus results in an overhead reduction to 320% with no penalty  in
+response  time.   Response time will usually be improved with our
+scheme because packet overhead is reduced, here by  a  factor  of
+4.7 over the classic timer scheme.  Congestion will be reduced by
+this factor and round-trip delay will decrease sharply.  For this
+________
+  * This problem is not seen in the pure ARPANET case because the
+    IMPs will block the host when the count of packets
+    outstanding becomes excessive, but in the case where a pure
+    datagram local net (such as an Ethernet) or a pure datagram
+    gateway (such as an ARPANET / MILNET gateway) is involved, it
+    is possible to have large numbers of tiny packets
+    outstanding.
+
+
+RFC 896    Congestion Control in IP/TCP Internetworks      1/6/84
+
+
+case, our scheme has a striking  advantage  over  either  of  the
+other approaches.
+
+We use our scheme for all TCP connections, not just  Telnet  con-
+nections.   Let us see what happens for a file transfer data con-
+nection using our technique. The two extreme cases will again  be
+considered.
+
+As before, we first consider the Ethernet case.  The user is  now
+writing data to TCP in 512 byte blocks as fast as TCP will accept
+them.  The user's first write to TCP will start things going; our
+first  datagram  will  be  512+40  bytes  or 552 bytes long.  The
+user's second write to TCP will not cause a send but  will  cause
+the  block  to  be buffered.  Assume that the user fills up TCP's
+outgoing buffer area before the first ACK comes back.  Then  when
+the  ACK  comes in, all queued data up to the window size will be
+sent.  From then on, the window will be kept full,  as  each  ACK
+initiates  a  sending  cycle  and queued data is sent out.  Thus,
+after a one round-trip time initial period when only one block is
+sent,  our  scheme  settles down into a maximum-throughput condi-
+tion.  The delay in startup is only 50ms on the Ethernet, so  the
+startup  transient  is  insignificant.  All three schemes provide
+equivalent performance for this case.
+
+Finally, let us look at a file transfer over the  5-second  round
+trip  time connection.  Again, only one packet will be sent until
+the first ACK comes back; the window will then be filled and kept
+full.   Since the round-trip time is 5 seconds, only 512 bytes of
+data are transmitted in the first 5 seconds.  Assuming a 2K  win-
+dow,  once  the first ACK comes in, 2K of data will be sent and a
+steady rate of 2K per 5 seconds will  be  maintained  thereafter.
+Only  for  this  case is our scheme inferior to the timer scheme,
+and the difference is only in the startup transient; steady-state
+throughput  is  identical.  The naive scheme and the timer scheme
+would both take 250 seconds to transmit a 100K  byte  file  under
+the  above  conditions  and  our scheme would take 254 seconds, a
+difference of 1.6%.
+
+Thus, for all cases examined, our scheme provides at least 98% of
+the  performance  of  both other schemes, and provides a dramatic
+improvement in Telnet performance over paths with long round trip
+times.   We  use  our  scheme  in  the  Ford  Aerospace  Software
+Engineering Network, and are able to run screen editors over Eth-
+ernet and talk to distant TOPS-20 hosts with improved performance
+in both cases.
+
+                  Congestion control with ICMP
+
+Having solved the small-packet congestion problem and with it the
+problem  of excessive small-packet congestion within our own net-
+work, we turned our attention to the problem of  general  conges-
+tion  control.   Since  our  own network is pure datagram with no
+node-to-node flow control, the only  mechanism  available  to  us
+
+
+RFC 896    Congestion Control in IP/TCP Internetworks      1/6/84
+
+
+under  the  IP standard was the ICMP Source Quench message.  With
+careful handling,  we  find  this  adequate  to  prevent  serious
+congestion problems.  We do find it necessary to be careful about
+the behavior of our hosts and switching  nodes  regarding  Source
+Quench messages.
+
+               When to send an ICMP Source Quench
+
+The present ICMP standard* specifies that an ICMP  Source  Quench
+message  should  be  sent whenever a packet is dropped, and addi-
+tionally may be sent when a gateway finds itself  becoming  short
+of  resources.   There is some ambiguity here but clearly it is a
+violation of the standard to drop a  packet  without  sending  an
+ICMP message.
+
+Our basic assumption is that packets ought not to be dropped dur-
+ing  normal  network  operation.   We  therefore want to throttle
+senders back before they overload switching nodes  and  gateways.
+All  our  switching  nodes  send ICMP Source Quench messages well
+before buffer space is exhausted; they do not wait  until  it  is
+necessary to drop a message before sending an ICMP Source Quench.
+As demonstrated in our  analysis  of  the  small-packet  problem,
+merely  providing  large  amounts of buffering is not a solution.
+In general, our experience is that Source Quench should  be  sent
+when  about  half  the  buffering space is exhausted; this is not
+based on extensive experimentation but appears to be a reasonable
+engineering  decision.   One  could  argue for an adaptive scheme
+that adjusted the quench generation  threshold  based  on  recent
+experience; we have not found this necessary as yet.
+
+There exist other gateway implementations  that  generate  Source
+Quenches  only after more than one packet has been discarded.  We
+consider this approach undesirable since any system for  control-
+ling congestion based on the discarding of packets is wasteful of
+bandwidth and may be susceptible  to  congestion  collapse  under
+heavy  load.   Our understanding is that the decision to generate
+Source Quenches with great reluctance stems from a fear that ack-
+nowledge  traffic  will  be quenched and that this will result in
+connection failure.  As will be shown below, appropriate handling
+of  Source  Quench in host implementations eliminates this possi-
+bility.
+
+        What to do when an ICMP Source Quench is received
+
+We inform TCP or any other  protocol  at  that  layer  when  ICMP
+receives  a Source Quench.  The basic action of our TCP implemen-
+tations is to reduce the amount of data  outstanding  on  connec-
+tions to the host mentioned in the Source Quench. This control is
+________
+  * ARPANET RFC 792 is the present standard.  We are advised by
+    the Defense Communications Agency that the description of
+    ICMP in MIL-STD-1777 is incomplete and will be deleted from
+    future revision of that standard.
+
+
+RFC 896    Congestion Control in IP/TCP Internetworks      1/6/84
+
+
+applied by causing the sending TCP to behave as  if  the  distant
+host's  window  size  has been reduced.  Our first implementation
+was simplistic but effective;  once  a  Source  Quench  has  been
+received  our  TCP behaves as if the window size is zero whenever
+the window isn't  empty.   This  behavior  continues  until  some
+number  (at  present 10) of ACKs have been received, at that time
+TCP returns to normal operation.* David Mills  of  Linkabit  Cor-
+poration  has  since  implemented  a  similar  but more elaborate
+throttle on the count of outstanding packets in his DCN  systems.
+The  additional  sophistication seems to produce a modest gain in
+throughput, but we have not made formal tests.  Both  implementa-
+tions effectively prevent congestion collapse in switching nodes.
+
+Source Quench thus has the effect of limiting the connection to a
+limited number (perhaps one) of outstanding messages.  Thus, com-
+munication can continue but at a reduced rate,  that  is  exactly
+the effect desired.
+
+This scheme has the important property that Source Quench doesn't
+inhibit  the  sending of acknowledges or retransmissions.  Imple-
+mentations of Source Quench entirely within the IP layer are usu-
+ally unsuccessful because IP lacks enough information to throttle
+a connection properly.  Holding back acknowledges tends  to  pro-
+duce  retransmissions and thus unnecessary traffic.  Holding back
+retransmissions may cause loss of a connection by  a  retransmis-
+sion  timeout.   Our  scheme  will  keep  connections alive under
+severe overload but at reduced bandwidth per connection.
+
+Other protocols at the same layer as TCP should also  be  respon-
+sive  to  Source  Quench.  In each case we would suggest that new
+traffic should be throttled but acknowledges  should  be  treated
+normally.   The only serious problem comes from the User Datagram
+Protocol, not normally a major traffic generator.   We  have  not
+implemented  any  throttling  in  these protocols as yet; all are
+passed Source Quench messages by ICMP but ignore them.
+
+                    Self-defense for gateways
+
+As we have shown, gateways are vulnerable to  host  mismanagement
+of  congestion.  Host misbehavior by excessive traffic generation
+can prevent not only the host's own traffic from getting through,
+but  can interfere with other unrelated traffic.  The problem can
+be dealt with at the host level but since one malfunctioning host
+can  interfere  with others, future gateways should be capable of
+defending themselves against such behavior by obnoxious or  mali-
+cious hosts.  We offer some basic self-defense techniques.
+
+On one occasion in late 1983, a TCP bug in an ARPANET host caused
+the  host  to  frantically  generate  retransmissions of the same
+datagram as fast as the ARPANET would accept them.   The  gateway
+________
+  * This follows the control engineering dictum  "Never bother
+    with proportional control unless bang-bang  doesn't work".
+
+
+RFC 896    Congestion Control in IP/TCP Internetworks      1/6/84
+
+
+that connected our net with the ARPANET was saturated and  little
+useful  traffic  could  get  through,  since the gateway had more
+bandwidth to the ARPANET than to our  net.   The  gateway  busily
+sent  ICMP  Source  Quench  messages  but the malfunctioning host
+ignored them.  This continued for several hours, until  the  mal-
+functioning  host  crashed.   During this period, our network was
+effectively disconnected from the ARPANET.
+
+When a gateway is forced to  discard  a  packet,  the  packet  is
+selected  at  the  discretion of the gateway.  Classic techniques
+for making  this  decision  are  to  discard  the  most  recently
+received packet, or the packet at the end of the longest outgoing
+queue.  We suggest that a worthwhile practical measure is to dis-
+card  the  latest  packet  from the host that originated the most
+packets currently queued within the gateway.  This strategy  will
+tend  to  balance throughput amongst the hosts using the gateway.
+We have not yet tried this strategy, but it  seems  a  reasonable
+starting point for gateway self-protection.
+
+Another strategy is to discard a  newly  arrived  packet  if  the
+packet  duplicates  a  packet already in the queue.  The computa-
+tional load for this check is not a problem if hashing techniques
+are  used.   This  check will not protect against malicious hosts
+but will provide some protection against TCP implementations with
+poor  retransmission  control.   Gateways between fast local net-
+works and slower long-haul networks may find this check  valuable
+if the local hosts are tuned to work well with the local network.
+
+Ideally  the  gateway  should  detect  malfunctioning  hosts  and
+squelch them; such detection is difficult in a pure datagram sys-
+tem.  Failure to  respond  to  an  ICMP  Source  Quench  message,
+though,  should be regarded as grounds for action by a gateway to
+disconnect a host.  Detecting such failure is non-trivial but  is
+a worthwhile area for further research.
+
+                           Conclusion
+
+The congestion control problems  associated  with  pure  datagram
+networks  are  difficult, but effective solutions exist.  If IP /
+TCP networks are to be operated under heavy load, TCP implementa-
+tions  must address several key issues in ways at least as effec-
+tive as the ones described here.
+