summaryrefslogtreecommitdiff
path: root/doc/rfc/rfc5681.txt
blob: 07b414fb30b70158a832dcecafa52125054b0ae7 (plain) (blame)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
Network Working Group                                          M. Allman
Request for Comments: 5681                                     V. Paxson
Obsoletes: 2581                                                     ICSI
Category: Standards Track                                     E. Blanton
                                                       Purdue University
                                                          September 2009


                         TCP Congestion Control

Abstract

   This document defines TCP's four intertwined congestion control
   algorithms: slow start, congestion avoidance, fast retransmit, and
   fast recovery.  In addition, the document specifies how TCP should
   begin transmission after a relatively long idle period, as well as
   discussing various acknowledgment generation methods.  This document
   obsoletes RFC 2581.

Status of This Memo

   This document specifies an Internet standards track protocol for the
   Internet community, and requests discussion and suggestions for
   improvements.  Please refer to the current edition of the "Internet
   Official Protocol Standards" (STD 1) for the standardization state
   and status of this protocol.  Distribution of this memo is unlimited.

Copyright Notice

   Copyright (c) 2009 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents in effect on the date of
   publication of this document (http://trustee.ietf.org/license-info).
   Please review these documents carefully, as they describe your rights
   and restrictions with respect to this document.

   This document may contain material from IETF Documents or IETF
   Contributions published or made publicly available before November
   10, 2008.  The person(s) controlling the copyright in some of this
   material may not have granted the IETF Trust the right to allow
   modifications of such material outside the IETF Standards Process.
   Without obtaining an adequate license from the person(s) controlling
   the copyright in such materials, this document may not be modified
   outside the IETF Standards Process, and derivative works of it may





Allman, et al.              Standards Track                     [Page 1]
^L
RFC 5681                 TCP Congestion Control           September 2009


   not be created outside the IETF Standards Process, except to format
   it for publication as an RFC or to translate it into languages other
   than English.

Table Of Contents

   1. Introduction ....................................................2
   2. Definitions .....................................................3
   3. Congestion Control Algorithms ...................................4
      3.1. Slow Start and Congestion Avoidance ........................4
      3.2. Fast Retransmit/Fast Recovery ..............................8
   4. Additional Considerations ......................................10
      4.1. Restarting Idle Connections ...............................10
      4.2. Generating Acknowledgments ................................11
      4.3. Loss Recovery Mechanisms ..................................12
   5. Security Considerations ........................................13
   6. Changes between RFC 2001 and RFC 2581 ..........................13
   7. Changes Relative to RFC 2581 ...................................14
   8. Acknowledgments ................................................15
   9. References .....................................................15
      9.1. Normative References ......................................15
      9.2. Informative References ....................................16

1.  Introduction

   This document specifies four TCP [RFC793] congestion control
   algorithms: slow start, congestion avoidance, fast retransmit and
   fast recovery.  These algorithms were devised in [Jac88] and [Jac90].
   Their use with TCP is standardized in [RFC1122].  Additional early
   work in additive-increase, multiplicative-decrease congestion control
   is given in [CJ89].

   Note that [Ste94] provides examples of these algorithms in action and
   [WS95] provides an explanation of the source code for the BSD
   implementation of these algorithms.

   In addition to specifying these congestion control algorithms, this
   document specifies what TCP connections should do after a relatively
   long idle period, as well as specifying and clarifying some of the
   issues pertaining to TCP ACK generation.

   This document obsoletes [RFC2581], which in turn obsoleted [RFC2001].

   This document is organized as follows.  Section 2 provides various
   definitions that will be used throughout the document.  Section 3
   provides a specification of the congestion control algorithms.
   Section 4 outlines concerns related to the congestion control
   algorithms and finally, section 5 outlines security considerations.



Allman, et al.              Standards Track                     [Page 2]
^L
RFC 5681                 TCP Congestion Control           September 2009


   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
   document are to be interpreted as described in [RFC2119].

2.  Definitions

   This section provides the definition of several terms that will be
   used throughout the remainder of this document.

   SEGMENT: A segment is ANY TCP/IP data or acknowledgment packet (or
      both).

   SENDER MAXIMUM SEGMENT SIZE (SMSS): The SMSS is the size of the
      largest segment that the sender can transmit.  This value can be
      based on the maximum transmission unit of the network, the path
      MTU discovery [RFC1191, RFC4821] algorithm, RMSS (see next item),
      or other factors.  The size does not include the TCP/IP headers
      and options.

   RECEIVER MAXIMUM SEGMENT SIZE (RMSS): The RMSS is the size of the
      largest segment the receiver is willing to accept.  This is the
      value specified in the MSS option sent by the receiver during
      connection startup.  Or, if the MSS option is not used, it is 536
      bytes [RFC1122].  The size does not include the TCP/IP headers and
      options.

   FULL-SIZED SEGMENT: A segment that contains the maximum number of
      data bytes permitted (i.e., a segment containing SMSS bytes of
      data).

   RECEIVER WINDOW (rwnd): The most recently advertised receiver window.

   CONGESTION WINDOW (cwnd): A TCP state variable that limits the amount
      of data a TCP can send.  At any given time, a TCP MUST NOT send
      data with a sequence number higher than the sum of the highest
      acknowledged sequence number and the minimum of cwnd and rwnd.

   INITIAL WINDOW (IW): The initial window is the size of the sender's
      congestion window after the three-way handshake is completed.

   LOSS WINDOW (LW): The loss window is the size of the congestion
      window after a TCP sender detects loss using its retransmission
      timer.

   RESTART WINDOW (RW): The restart window is the size of the congestion
      window after a TCP restarts transmission after an idle period (if
      the slow start algorithm is used; see section 4.1 for more
      discussion).



Allman, et al.              Standards Track                     [Page 3]
^L
RFC 5681                 TCP Congestion Control           September 2009


   FLIGHT SIZE: The amount of data that has been sent but not yet
      cumulatively acknowledged.

   DUPLICATE ACKNOWLEDGMENT: An acknowledgment is considered a
      "duplicate" in the following algorithms when (a) the receiver of
      the ACK has outstanding data, (b) the incoming acknowledgment
      carries no data, (c) the SYN and FIN bits are both off, (d) the
      acknowledgment number is equal to the greatest acknowledgment
      received on the given connection (TCP.UNA from [RFC793]) and (e)
      the advertised window in the incoming acknowledgment equals the
      advertised window in the last incoming acknowledgment.

      Alternatively, a TCP that utilizes selective acknowledgments
      (SACKs) [RFC2018, RFC2883] can leverage the SACK information to
      determine when an incoming ACK is a "duplicate" (e.g., if the ACK
      contains previously unknown SACK information).

3.  Congestion Control Algorithms

   This section defines the four congestion control algorithms: slow
   start, congestion avoidance, fast retransmit, and fast recovery,
   developed in [Jac88] and [Jac90].  In some situations, it may be
   beneficial for a TCP sender to be more conservative than the
   algorithms allow; however, a TCP MUST NOT be more aggressive than the
   following algorithms allow (that is, MUST NOT send data when the
   value of cwnd computed by the following algorithms would not allow
   the data to be sent).

   Also, note that the algorithms specified in this document work in
   terms of using loss as the signal of congestion.  Explicit Congestion
   Notification (ECN) could also be used as specified in [RFC3168].

3.1.  Slow Start and Congestion Avoidance

   The slow start and congestion avoidance algorithms MUST be used by a
   TCP sender to control the amount of outstanding data being injected
   into the network.  To implement these algorithms, two variables are
   added to the TCP per-connection state.  The congestion window (cwnd)
   is a sender-side limit on the amount of data the sender can transmit
   into the network before receiving an acknowledgment (ACK), while the
   receiver's advertised window (rwnd) is a receiver-side limit on the
   amount of outstanding data.  The minimum of cwnd and rwnd governs
   data transmission.

   Another state variable, the slow start threshold (ssthresh), is used
   to determine whether the slow start or congestion avoidance algorithm
   is used to control data transmission, as discussed below.




Allman, et al.              Standards Track                     [Page 4]
^L
RFC 5681                 TCP Congestion Control           September 2009


   Beginning transmission into a network with unknown conditions
   requires TCP to slowly probe the network to determine the available
   capacity, in order to avoid congesting the network with an
   inappropriately large burst of data.  The slow start algorithm is
   used for this purpose at the beginning of a transfer, or after
   repairing loss detected by the retransmission timer.  Slow start
   additionally serves to start the "ACK clock" used by the TCP sender
   to release data into the network in the slow start, congestion
   avoidance, and loss recovery algorithms.

   IW, the initial value of cwnd, MUST be set using the following
   guidelines as an upper bound.

   If SMSS > 2190 bytes:
       IW = 2 * SMSS bytes and MUST NOT be more than 2 segments
   If (SMSS > 1095 bytes) and (SMSS <= 2190 bytes):
       IW = 3 * SMSS bytes and MUST NOT be more than 3 segments
   if SMSS <= 1095 bytes:
       IW = 4 * SMSS bytes and MUST NOT be more than 4 segments

   As specified in [RFC3390], the SYN/ACK and the acknowledgment of the
   SYN/ACK MUST NOT increase the size of the congestion window.
   Further, if the SYN or SYN/ACK is lost, the initial window used by a
   sender after a correctly transmitted SYN MUST be one segment
   consisting of at most SMSS bytes.

   A detailed rationale and discussion of the IW setting is provided in
   [RFC3390].

   When initial congestion windows of more than one segment are
   implemented along with Path MTU Discovery [RFC1191], and the MSS
   being used is found to be too large, the congestion window cwnd
   SHOULD be reduced to prevent large bursts of smaller segments.
   Specifically, cwnd SHOULD be reduced by the ratio of the old segment
   size to the new segment size.

   The initial value of ssthresh SHOULD be set arbitrarily high (e.g.,
   to the size of the largest possible advertised window), but ssthresh
   MUST be reduced in response to congestion.  Setting ssthresh as high
   as possible allows the network conditions, rather than some arbitrary
   host limit, to dictate the sending rate.  In cases where the end
   systems have a solid understanding of the network path, more
   carefully setting the initial ssthresh value may have merit (e.g.,
   such that the end host does not create congestion along the path).







Allman, et al.              Standards Track                     [Page 5]
^L
RFC 5681                 TCP Congestion Control           September 2009


   The slow start algorithm is used when cwnd < ssthresh, while the
   congestion avoidance algorithm is used when cwnd > ssthresh.  When
   cwnd and ssthresh are equal, the sender may use either slow start or
   congestion avoidance.

   During slow start, a TCP increments cwnd by at most SMSS bytes for
   each ACK received that cumulatively acknowledges new data.  Slow
   start ends when cwnd exceeds ssthresh (or, optionally, when it
   reaches it, as noted above) or when congestion is observed.  While
   traditionally TCP implementations have increased cwnd by precisely
   SMSS bytes upon receipt of an ACK covering new data, we RECOMMEND
   that TCP implementations increase cwnd, per:

      cwnd += min (N, SMSS)                      (2)

   where N is the number of previously unacknowledged bytes acknowledged
   in the incoming ACK.  This adjustment is part of Appropriate Byte
   Counting [RFC3465] and provides robustness against misbehaving
   receivers that may attempt to induce a sender to artificially inflate
   cwnd using a mechanism known as "ACK Division" [SCWA99].  ACK
   Division consists of a receiver sending multiple ACKs for a single
   TCP data segment, each acknowledging only a portion of its data.  A
   TCP that increments cwnd by SMSS for each such ACK will
   inappropriately inflate the amount of data injected into the network.

   During congestion avoidance, cwnd is incremented by roughly 1 full-
   sized segment per round-trip time (RTT).  Congestion avoidance
   continues until congestion is detected.  The basic guidelines for
   incrementing cwnd during congestion avoidance are:

      * MAY increment cwnd by SMSS bytes

      * SHOULD increment cwnd per equation (2) once per RTT

      * MUST NOT increment cwnd by more than SMSS bytes

   We note that [RFC3465] allows for cwnd increases of more than SMSS
   bytes for incoming acknowledgments during slow start on an
   experimental basis; however, such behavior is not allowed as part of
   the standard.

   The RECOMMENDED way to increase cwnd during congestion avoidance is
   to count the number of bytes that have been acknowledged by ACKs for
   new data.  (A drawback of this implementation is that it requires
   maintaining an additional state variable.)  When the number of bytes
   acknowledged reaches cwnd, then cwnd can be incremented by up to SMSS
   bytes.  Note that during congestion avoidance, cwnd MUST NOT be




Allman, et al.              Standards Track                     [Page 6]
^L
RFC 5681                 TCP Congestion Control           September 2009


   increased by more than SMSS bytes per RTT.  This method both allows
   TCPs to increase cwnd by one segment per RTT in the face of delayed
   ACKs and provides robustness against ACK Division attacks.

   Another common formula that a TCP MAY use to update cwnd during
   congestion avoidance is given in equation (3):

      cwnd += SMSS*SMSS/cwnd                     (3)

   This adjustment is executed on every incoming ACK that acknowledges
   new data.  Equation (3) provides an acceptable approximation to the
   underlying principle of increasing cwnd by 1 full-sized segment per
   RTT.  (Note that for a connection in which the receiver is
   acknowledging every-other packet, (3) is less aggressive than allowed
   -- roughly increasing cwnd every second RTT.)

   Implementation Note: Since integer arithmetic is usually used in TCP
   implementations, the formula given in equation (3) can fail to
   increase cwnd when the congestion window is larger than SMSS*SMSS.
   If the above formula yields 0, the result SHOULD be rounded up to 1
   byte.

   Implementation Note: Older implementations have an additional
   additive constant on the right-hand side of equation (3).  This is
   incorrect and can actually lead to diminished performance [RFC2525].

   Implementation Note: Some implementations maintain cwnd in units of
   bytes, while others in units of full-sized segments.  The latter will
   find equation (3) difficult to use, and may prefer to use the
   counting approach discussed in the previous paragraph.

   When a TCP sender detects segment loss using the retransmission timer
   and the given segment has not yet been resent by way of the
   retransmission timer, the value of ssthresh MUST be set to no more
   than the value given in equation (4):

      ssthresh = max (FlightSize / 2, 2*SMSS)            (4)

   where, as discussed above, FlightSize is the amount of outstanding
   data in the network.

   On the other hand, when a TCP sender detects segment loss using the
   retransmission timer and the given segment has already been
   retransmitted by way of the retransmission timer at least once, the
   value of ssthresh is held constant.






Allman, et al.              Standards Track                     [Page 7]
^L
RFC 5681                 TCP Congestion Control           September 2009


   Implementation Note: An easy mistake to make is to simply use cwnd,
   rather than FlightSize, which in some implementations may
   incidentally increase well beyond rwnd.

   Furthermore, upon a timeout (as specified in [RFC2988]) cwnd MUST be
   set to no more than the loss window, LW, which equals 1 full-sized
   segment (regardless of the value of IW).  Therefore, after
   retransmitting the dropped segment the TCP sender uses the slow start
   algorithm to increase the window from 1 full-sized segment to the new
   value of ssthresh, at which point congestion avoidance again takes
   over.

   As shown in [FF96] and [RFC3782], slow-start-based loss recovery
   after a timeout can cause spurious retransmissions that trigger
   duplicate acknowledgments.  The reaction to the arrival of these
   duplicate ACKs in TCP implementations varies widely.  This document
   does not specify how to treat such acknowledgments, but does note
   this as an area that may benefit from additional attention,
   experimentation and specification.

3.2.  Fast Retransmit/Fast Recovery

   A TCP receiver SHOULD send an immediate duplicate ACK when an out-
   of-order segment arrives.  The purpose of this ACK is to inform the
   sender that a segment was received out-of-order and which sequence
   number is expected.  From the sender's perspective, duplicate ACKs
   can be caused by a number of network problems.  First, they can be
   caused by dropped segments.  In this case, all segments after the
   dropped segment will trigger duplicate ACKs until the loss is
   repaired.  Second, duplicate ACKs can be caused by the re-ordering of
   data segments by the network (not a rare event along some network
   paths [Pax97]).  Finally, duplicate ACKs can be caused by replication
   of ACK or data segments by the network.  In addition, a TCP receiver
   SHOULD send an immediate ACK when the incoming segment fills in all
   or part of a gap in the sequence space.  This will generate more
   timely information for a sender recovering from a loss through a
   retransmission timeout, a fast retransmit, or an advanced loss
   recovery algorithm, as outlined in section 4.3.

   The TCP sender SHOULD use the "fast retransmit" algorithm to detect
   and repair loss, based on incoming duplicate ACKs.  The fast
   retransmit algorithm uses the arrival of 3 duplicate ACKs (as defined
   in section 2, without any intervening ACKs which move SND.UNA) as an
   indication that a segment has been lost.  After receiving 3 duplicate
   ACKs, TCP performs a retransmission of what appears to be the missing
   segment, without waiting for the retransmission timer to expire.





Allman, et al.              Standards Track                     [Page 8]
^L
RFC 5681                 TCP Congestion Control           September 2009


   After the fast retransmit algorithm sends what appears to be the
   missing segment, the "fast recovery" algorithm governs the
   transmission of new data until a non-duplicate ACK arrives.  The
   reason for not performing slow start is that the receipt of the
   duplicate ACKs not only indicates that a segment has been lost, but
   also that segments are most likely leaving the network (although a
   massive segment duplication by the network can invalidate this
   conclusion).  In other words, since the receiver can only generate a
   duplicate ACK when a segment has arrived, that segment has left the
   network and is in the receiver's buffer, so we know it is no longer
   consuming network resources.  Furthermore, since the ACK "clock"
   [Jac88] is preserved, the TCP sender can continue to transmit new
   segments (although transmission must continue using a reduced cwnd,
   since loss is an indication of congestion).

   The fast retransmit and fast recovery algorithms are implemented
   together as follows.

   1.  On the first and second duplicate ACKs received at a sender, a
       TCP SHOULD send a segment of previously unsent data per [RFC3042]
       provided that the receiver's advertised window allows, the total
       FlightSize would remain less than or equal to cwnd plus 2*SMSS,
       and that new data is available for transmission.  Further, the
       TCP sender MUST NOT change cwnd to reflect these two segments
       [RFC3042].  Note that a sender using SACK [RFC2018] MUST NOT send
       new data unless the incoming duplicate acknowledgment contains
       new SACK information.

   2.  When the third duplicate ACK is received, a TCP MUST set ssthresh
       to no more than the value given in equation (4).  When [RFC3042]
       is in use, additional data sent in limited transmit MUST NOT be
       included in this calculation.

   3.  The lost segment starting at SND.UNA MUST be retransmitted and
       cwnd set to ssthresh plus 3*SMSS.  This artificially "inflates"
       the congestion window by the number of segments (three) that have
       left the network and which the receiver has buffered.

   4.  For each additional duplicate ACK received (after the third),
       cwnd MUST be incremented by SMSS.  This artificially inflates the
       congestion window in order to reflect the additional segment that
       has left the network.

       Note: [SCWA99] discusses a receiver-based attack whereby many
       bogus duplicate ACKs are sent to the data sender in order to
       artificially inflate cwnd and cause a higher than appropriate





Allman, et al.              Standards Track                     [Page 9]
^L
RFC 5681                 TCP Congestion Control           September 2009


       sending rate to be used.  A TCP MAY therefore limit the number of
       times cwnd is artificially inflated during loss recovery to the
       number of outstanding segments (or, an approximation thereof).

       Note: When an advanced loss recovery mechanism (such as outlined
       in section 4.3) is not in use, this increase in FlightSize can
       cause equation (4) to slightly inflate cwnd and ssthresh, as some
       of the segments between SND.UNA and SND.NXT are assumed to have
       left the network but are still reflected in FlightSize.

   5.  When previously unsent data is available and the new value of
       cwnd and the receiver's advertised window allow, a TCP SHOULD
       send 1*SMSS bytes of previously unsent data.

   6.  When the next ACK arrives that acknowledges previously
       unacknowledged data, a TCP MUST set cwnd to ssthresh (the value
       set in step 2).  This is termed "deflating" the window.

       This ACK should be the acknowledgment elicited by the
       retransmission from step 3, one RTT after the retransmission
       (though it may arrive sooner in the presence of significant out-
       of-order delivery of data segments at the receiver).
       Additionally, this ACK should acknowledge all the intermediate
       segments sent between the lost segment and the receipt of the
       third duplicate ACK, if none of these were lost.

   Note: This algorithm is known to generally not recover efficiently
   from multiple losses in a single flight of packets [FF96].  Section
   4.3 below addresses such cases.

4.  Additional Considerations

4.1.  Restarting Idle Connections

   A known problem with the TCP congestion control algorithms described
   above is that they allow a potentially inappropriate burst of traffic
   to be transmitted after TCP has been idle for a relatively long
   period of time.  After an idle period, TCP cannot use the ACK clock
   to strobe new segments into the network, as all the ACKs have drained
   from the network.  Therefore, as specified above, TCP can potentially
   send a cwnd-size line-rate burst into the network after an idle
   period.  In addition, changing network conditions may have rendered
   TCP's notion of the available end-to-end network capacity between two
   endpoints, as estimated by cwnd, inaccurate during the course of a
   long idle period.






Allman, et al.              Standards Track                    [Page 10]
^L
RFC 5681                 TCP Congestion Control           September 2009


   [Jac88] recommends that a TCP use slow start to restart transmission
   after a relatively long idle period.  Slow start serves to restart
   the ACK clock, just as it does at the beginning of a transfer.  This
   mechanism has been widely deployed in the following manner.  When TCP
   has not received a segment for more than one retransmission timeout,
   cwnd is reduced to the value of the restart window (RW) before
   transmission begins.

   For the purposes of this standard, we define RW = min(IW,cwnd).

   Using the last time a segment was received to determine whether or
   not to decrease cwnd can fail to deflate cwnd in the common case of
   persistent HTTP connections [HTH98].  In this case, a Web server
   receives a request before transmitting data to the Web client.  The
   reception of the request makes the test for an idle connection fail,
   and allows the TCP to begin transmission with a possibly
   inappropriately large cwnd.

   Therefore, a TCP SHOULD set cwnd to no more than RW before beginning
   transmission if the TCP has not sent data in an interval exceeding
   the retransmission timeout.

4.2.  Generating Acknowledgments

   The delayed ACK algorithm specified in [RFC1122] SHOULD be used by a
   TCP receiver.  When using delayed ACKs, a TCP receiver MUST NOT
   excessively delay acknowledgments.  Specifically, an ACK SHOULD be
   generated for at least every second full-sized segment, and MUST be
   generated within 500 ms of the arrival of the first unacknowledged
   packet.

   The requirement that an ACK "SHOULD" be generated for at least every
   second full-sized segment is listed in [RFC1122] in one place as a
   SHOULD and another as a MUST.  Here we unambiguously state it is a
   SHOULD.  We also emphasize that this is a SHOULD, meaning that an
   implementor should indeed only deviate from this requirement after
   careful consideration of the implications.  See the discussion of
   "Stretch ACK violation" in [RFC2525] and the references therein for a
   discussion of the possible performance problems with generating ACKs
   less frequently than every second full-sized segment.

   In some cases, the sender and receiver may not agree on what
   constitutes a full-sized segment.  An implementation is deemed to
   comply with this requirement if it sends at least one acknowledgment
   every time it receives 2*RMSS bytes of new data from the sender,
   where RMSS is the Maximum Segment Size specified by the receiver to
   the sender (or the default value of 536 bytes, per [RFC1122], if the
   receiver does not specify an MSS option during connection



Allman, et al.              Standards Track                    [Page 11]
^L
RFC 5681                 TCP Congestion Control           September 2009


   establishment).  The sender may be forced to use a segment size less
   than RMSS due to the maximum transmission unit (MTU), the path MTU
   discovery algorithm or other factors.  For instance, consider the
   case when the receiver announces an RMSS of X bytes but the sender
   ends up using a segment size of Y bytes (Y < X) due to path MTU
   discovery (or the sender's MTU size).  The receiver will generate
   stretch ACKs if it waits for 2*X bytes to arrive before an ACK is
   sent.  Clearly this will take more than 2 segments of size Y bytes.
   Therefore, while a specific algorithm is not defined, it is desirable
   for receivers to attempt to prevent this situation, for example, by
   acknowledging at least every second segment, regardless of size.
   Finally, we repeat that an ACK MUST NOT be delayed for more than 500
   ms waiting on a second full-sized segment to arrive.

   Out-of-order data segments SHOULD be acknowledged immediately, in
   order to accelerate loss recovery.  To trigger the fast retransmit
   algorithm, the receiver SHOULD send an immediate duplicate ACK when
   it receives a data segment above a gap in the sequence space.  To
   provide feedback to senders recovering from losses, the receiver
   SHOULD send an immediate ACK when it receives a data segment that
   fills in all or part of a gap in the sequence space.

   A TCP receiver MUST NOT generate more than one ACK for every incoming
   segment, other than to update the offered window as the receiving
   application consumes new data (see [RFC813] and page 42 of [RFC793]).

4.3.  Loss Recovery Mechanisms

   A number of loss recovery algorithms that augment fast retransmit and
   fast recovery have been suggested by TCP researchers and specified in
   the RFC series.  While some of these algorithms are based on the TCP
   selective acknowledgment (SACK) option [RFC2018], such as [FF96],
   [MM96a], [MM96b], and [RFC3517], others do not require SACKs, such as
   [Hoe96], [FF96], and [RFC3782].  The non-SACK algorithms use "partial
   acknowledgments" (ACKs that cover previously unacknowledged data, but
   not all the data outstanding when loss was detected) to trigger
   retransmissions.  While this document does not standardize any of the
   specific algorithms that may improve fast retransmit/fast recovery,
   these enhanced algorithms are implicitly allowed, as long as they
   follow the general principles of the basic four algorithms outlined
   above.

   That is, when the first loss in a window of data is detected,
   ssthresh MUST be set to no more than the value given by equation (4).
   Second, until all lost segments in the window of data in question are
   repaired, the number of segments transmitted in each RTT MUST be no
   more than half the number of outstanding segments when the loss was
   detected.  Finally, after all loss in the given window of segments



Allman, et al.              Standards Track                    [Page 12]
^L
RFC 5681                 TCP Congestion Control           September 2009


   has been successfully retransmitted, cwnd MUST be set to no more than
   ssthresh and congestion avoidance MUST be used to further increase
   cwnd.  Loss in two successive windows of data, or the loss of a
   retransmission, should be taken as two indications of congestion and,
   therefore, cwnd (and ssthresh) MUST be lowered twice in this case.

   We RECOMMEND that TCP implementors employ some form of advanced loss
   recovery that can cope with multiple losses in a window of data.  The
   algorithms detailed in [RFC3782] and [RFC3517] conform to the general
   principles outlined above.  We note that while these are not the only
   two algorithms that conform to the above general principles these two
   algorithms have been vetted by the community and are currently on the
   Standards Track.

5.  Security Considerations

   This document requires a TCP to diminish its sending rate in the
   presence of retransmission timeouts and the arrival of duplicate
   acknowledgments.  An attacker can therefore impair the performance of
   a TCP connection by either causing data packets or their
   acknowledgments to be lost, or by forging excessive duplicate
   acknowledgments.

   In response to the ACK division attack outlined in [SCWA99], this
   document RECOMMENDS increasing the congestion window based on the
   number of bytes newly acknowledged in each arriving ACK rather than
   by a particular constant on each arriving ACK (as outlined in section
   3.1).

   The Internet, to a considerable degree, relies on the correct
   implementation of these algorithms in order to preserve network
   stability and avoid congestion collapse.  An attacker could cause TCP
   endpoints to respond more aggressively in the face of congestion by
   forging excessive duplicate acknowledgments or excessive
   acknowledgments for new data.  Conceivably, such an attack could
   drive a portion of the network into congestion collapse.

6.  Changes between RFC 2001 and RFC 2581

   [RFC2001] was extensively rewritten editorially and it is not
   feasible to itemize the list of changes between [RFC2001] and
   [RFC2581].  The intention of [RFC2581] was to not change any of the
   recommendations given in [RFC2001], but to further clarify cases that
   were not discussed in detail in [RFC2001].  Specifically, [RFC2581]
   suggested what TCP connections should do after a relatively long idle
   period, as well as specified and clarified some of the issues





Allman, et al.              Standards Track                    [Page 13]
^L
RFC 5681                 TCP Congestion Control           September 2009


   pertaining to TCP ACK generation.  Finally, the allowable upper bound
   for the initial congestion window was raised from one to two
   segments.

7.  Changes Relative to RFC 2581

   A specific definition for "duplicate acknowledgment" has been added,
   based on the definition used by BSD TCP.

   The document now notes that what to do with duplicate ACKs after the
   retransmission timer has fired is future work and explicitly
   unspecified in this document.

   The initial window requirements were changed to allow Larger Initial
   Windows as standardized in [RFC3390].  Additionally, the steps to
   take when an initial window is discovered to be too large due to Path
   MTU Discovery [RFC1191] are detailed.

   The recommended initial value for ssthresh has been changed to say
   that it SHOULD be arbitrarily high, where it was previously MAY.
   This is to provide additional guidance to implementors on the matter.

   During slow start, the usage of Appropriate Byte Counting [RFC3465]
   with L=1*SMSS is explicitly recommended.  The method of increasing
   cwnd given in [RFC2581] is still explicitly allowed.  Byte counting
   during congestion avoidance is also recommended, while the method
   from [RFC2581] and other safe methods are still allowed.

   The treatment of ssthresh on retransmission timeout was clarified.
   In particular, ssthresh must be set to half the FlightSize on the
   first retransmission of a given segment and then is held constant on
   subsequent retransmissions of the same segment.

   The description of fast retransmit and fast recovery has been
   clarified, and the use of Limited Transmit [RFC3042] is now
   recommended.

   TCPs now MAY limit the number of duplicate ACKs that artificially
   inflate cwnd during loss recovery to the number of segments
   outstanding to avoid the duplicate ACK spoofing attack described in
   [SCWA99].

   The restart window has been changed to min(IW,cwnd) from IW.  This
   behavior was described as "experimental" in [RFC2581].

   It is now recommended that TCP implementors implement an advanced
   loss recovery algorithm conforming to the principles outlined in this
   document.



Allman, et al.              Standards Track                    [Page 14]
^L
RFC 5681                 TCP Congestion Control           September 2009


   The security considerations have been updated to discuss ACK division
   and recommend byte counting as a counter to this attack.

8.  Acknowledgments

   The core algorithms we describe were developed by Van Jacobson
   [Jac88, Jac90].  In addition, Limited Transmit [RFC3042] was
   developed in conjunction with Hari Balakrishnan and Sally Floyd.  The
   initial congestion window size specified in this document is a result
   of work with Sally Floyd and Craig Partridge [RFC2414, RFC3390].

   W. Richard ("Rich") Stevens wrote the first version of this document
   [RFC2001] and co-authored the second version [RFC2581].  This present
   version much benefits from his clarity and thoughtfulness of
   description, and we are grateful for Rich's contributions in
   elucidating TCP congestion control, as well as in more broadly
   helping us understand numerous issues relating to networking.

   We wish to emphasize that the shortcomings and mistakes of this
   document are solely the responsibility of the current authors.

   Some of the text from this document is taken from "TCP/IP
   Illustrated, Volume 1: The Protocols" by W. Richard Stevens
   (Addison-Wesley, 1994) and "TCP/IP Illustrated, Volume 2: The
   Implementation" by Gary R. Wright and W. Richard Stevens (Addison-
   Wesley, 1995).  This material is used with the permission of
   Addison-Wesley.

   Anil Agarwal, Steve Arden, Neal Cardwell, Noritoshi Demizu, Gorry
   Fairhurst, Kevin Fall, John Heffner, Alfred Hoenes, Sally Floyd,
   Reiner Ludwig, Matt Mathis, Craig Partridge, and Joe Touch
   contributed a number of helpful suggestions.

9.  References

9.1.  Normative References

   [RFC793]  Postel, J., "Transmission Control Protocol", STD 7, RFC
             793, September 1981.

   [RFC1122] Braden, R., Ed., "Requirements for Internet Hosts -
             Communication Layers", STD 3, RFC 1122, October 1989.

   [RFC1191] Mogul, J. and S. Deering, "Path MTU discovery", RFC 1191,
             November 1990.

   [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
             Requirement Levels", BCP 14, RFC 2119, March 1997.



Allman, et al.              Standards Track                    [Page 15]
^L
RFC 5681                 TCP Congestion Control           September 2009


9.2.  Informative References

   [CJ89]    Chiu, D. and R. Jain, "Analysis of the Increase/Decrease
             Algorithms for Congestion Avoidance in Computer Networks",
             Journal of Computer Networks and ISDN Systems, vol. 17, no.
             1, pp. 1-14, June 1989.

   [FF96]    Fall, K. and S. Floyd, "Simulation-based Comparisons of
             Tahoe, Reno and SACK TCP", Computer Communication Review,
             July 1996, ftp://ftp.ee.lbl.gov/papers/sacks.ps.Z.

   [Hoe96]   Hoe, J., "Improving the Start-up Behavior of a Congestion
             Control Scheme for TCP", In ACM SIGCOMM, August 1996.

   [HTH98]   Hughes, A., Touch, J., and J. Heidemann, "Issues in TCP
             Slow-Start Restart After Idle", Work in Progress, March
             1998.

   [Jac88]   Jacobson, V., "Congestion Avoidance and Control", Computer
             Communication Review, vol. 18, no. 4, pp. 314-329, Aug.
             1988.  ftp://ftp.ee.lbl.gov/papers/congavoid.ps.Z.

   [Jac90]   Jacobson, V., "Modified TCP Congestion Avoidance
             Algorithm", end2end-interest mailing list, April 30, 1990.
             ftp://ftp.isi.edu/end2end/end2end-interest-1990.mail.

   [MM96a]   Mathis, M. and J. Mahdavi, "Forward Acknowledgment:
             Refining TCP Congestion Control", Proceedings of
             SIGCOMM'96, August, 1996, Stanford, CA.  Available from
             http://www.psc.edu/networking/papers/papers.html

   [MM96b]   Mathis, M. and J. Mahdavi, "TCP Rate-Halving with Bounding
             Parameters", Technical report.  Available from
             http://www.psc.edu/networking/papers/FACKnotes/current.

   [Pax97]   Paxson, V., "End-to-End Internet Packet Dynamics",
             Proceedings of SIGCOMM '97, Cannes, France, Sep. 1997.

   [RFC813]  Clark, D., "Window and Acknowledgement Strategy in TCP",
             RFC 813, July 1982.

   [RFC2001] Stevens, W., "TCP Slow Start, Congestion Avoidance, Fast
             Retransmit, and Fast Recovery Algorithms", RFC 2001,
             January 1997.

   [RFC2018] Mathis, M., Mahdavi, J., Floyd, S., and A. Romanow, "TCP
             Selective Acknowledgment Options", RFC 2018, October 1996.




Allman, et al.              Standards Track                    [Page 16]
^L
RFC 5681                 TCP Congestion Control           September 2009


   [RFC2414] Allman, M., Floyd, S., and C. Partridge, "Increasing TCP's
             Initial Window", RFC 2414, September 1998.

   [RFC2525] Paxson, V., Allman, M., Dawson, S., Fenner, W., Griner, J.,
             Heavens, I., Lahey, K., Semke, J., and B. Volz, "Known TCP
             Implementation Problems", RFC 2525, March 1999.

   [RFC2581] Allman, M., Paxson, V., and W. Stevens, "TCP Congestion
             Control", RFC 2581, April 1999.

   [RFC2883] Floyd, S., Mahdavi, J., Mathis, M., and M. Podolsky, "An
             Extension to the Selective Acknowledgement (SACK) Option
             for TCP", RFC 2883, July 2000.

   [RFC2988] Paxson, V. and M. Allman, "Computing TCP's Retransmission
             Timer", RFC 2988, November 2000.

   [RFC3042] Allman, M., Balakrishnan, H., and S. Floyd, "Enhancing
             TCP's Loss Recovery Using Limited Transmit", RFC 3042,
             January 2001.

   [RFC3168] Ramakrishnan, K., Floyd, S., and D. Black, "The Addition of
             Explicit Congestion Notification (ECN) to IP", RFC 3168,
             September 2001.

   [RFC3390] Allman, M., Floyd, S., and C. Partridge, "Increasing TCP's
             Initial Window", RFC 3390, October 2002.

   [RFC3465] Allman, M., "TCP Congestion Control with Appropriate Byte
             Counting (ABC)", RFC 3465, February 2003.

   [RFC3517] Blanton, E., Allman, M., Fall, K., and L. Wang, "A
             Conservative Selective Acknowledgment (SACK)-based Loss
             Recovery Algorithm for TCP", RFC 3517, April 2003.

   [RFC3782] Floyd, S., Henderson, T., and A. Gurtov, "The NewReno
             Modification to TCP's Fast Recovery Algorithm", RFC 3782,
             April 2004.

   [RFC4821] Mathis, M. and J. Heffner, "Packetization Layer Path MTU
             Discovery", RFC 4821, March 2007.

   [SCWA99]  Savage, S., Cardwell, N., Wetherall, D., and T. Anderson,
             "TCP Congestion Control With a Misbehaving Receiver", ACM
             Computer Communication Review, 29(5), October 1999.

   [Ste94]   Stevens, W., "TCP/IP Illustrated, Volume 1: The Protocols",
             Addison-Wesley, 1994.



Allman, et al.              Standards Track                    [Page 17]
^L
RFC 5681                 TCP Congestion Control           September 2009


   [WS95]    Wright, G. and W. Stevens, "TCP/IP Illustrated, Volume 2:
             The Implementation", Addison-Wesley, 1995.

Authors' Addresses

   Mark Allman
   International Computer Science Institute (ICSI)
   1947 Center Street
   Suite 600
   Berkeley, CA 94704-1198
   Phone: +1 440 235 1792
   EMail: mallman@icir.org
   http://www.icir.org/mallman/


   Vern Paxson
   International Computer Science Institute (ICSI)
   1947 Center Street
   Suite 600
   Berkeley, CA 94704-1198
   Phone: +1 510/642-4274 x302
   EMail: vern@icir.org
   http://www.icir.org/vern/


   Ethan Blanton
   Purdue University Computer Sciences
   305 North University Street
   West Lafayette, IN  47907
   EMail: eblanton@cs.purdue.edu
   http://www.cs.purdue.edu/homes/eblanton/




















Allman, et al.              Standards Track                    [Page 18]
^L