doc/rfc/rfc9040.txt


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573

Internet Engineering Task Force (IETF)                          J. Touch
Request for Comments: 9040                                   Independent
Obsoletes: 2140                                                 M. Welzl
Category: Informational                                         S. Islam
ISSN: 2070-1721                                       University of Oslo
                                                               July 2021


                   TCP Control Block Interdependence

Abstract

   This memo provides guidance to TCP implementers that is intended to
   help improve connection convergence to steady-state operation without
   affecting interoperability.  It updates and replaces RFC 2140's
   description of sharing TCP state, as typically represented in TCP
   Control Blocks, among similar concurrent or consecutive connections.

Status of This Memo

   This document is not an Internet Standards Track specification; it is
   published for informational purposes.

   This document is a product of the Internet Engineering Task Force
   (IETF).  It represents the consensus of the IETF community.  It has
   received public review and has been approved for publication by the
   Internet Engineering Steering Group (IESG).  Not all documents
   approved by the IESG are candidates for any level of Internet
   Standard; see Section 2 of RFC 7841.

   Information about the current status of this document, any errata,
   and how to provide feedback on it may be obtained at
   https://www.rfc-editor.org/info/rfc9040.

Copyright Notice

   Copyright (c) 2021 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents
   (https://trustee.ietf.org/license-info) in effect on the date of
   publication of this document.  Please review these documents
   carefully, as they describe your rights and restrictions with respect
   to this document.  Code Components extracted from this document must
   include Simplified BSD License text as described in Section 4.e of
   the Trust Legal Provisions and are provided without warranty as
   described in the Simplified BSD License.

Table of Contents

   1.  Introduction
   2.  Conventions Used in This Document
   3.  Terminology
   4.  The TCP Control Block (TCB)
   5.  TCB Interdependence
   6.  Temporal Sharing
     6.1.  Initialization of a New TCB
     6.2.  Updates to the TCB Cache
     6.3.  Discussion
   7.  Ensemble Sharing
     7.1.  Initialization of a New TCB
     7.2.  Updates to the TCB Cache
     7.3.  Discussion
   8.  Issues with TCB Information Sharing
     8.1.  Traversing the Same Network Path
     8.2.  State Dependence
     8.3.  Problems with Sharing Based on IP Address
   9.  Implications
     9.1.  Layering
     9.2.  Other Possibilities
   10. Implementation Observations
   11. Changes Compared to RFC 2140
   12. Security Considerations
   13. IANA Considerations
   14. References
     14.1.  Normative References
     14.2.  Informative References
   Appendix A.  TCB Sharing History
   Appendix B.  TCP Option Sharing and Caching
   Appendix C.  Automating the Initial Window in TCP over Long
           Timescales
     C.1.  Introduction
     C.2.  Design Considerations
     C.3.  Proposed IW Algorithm
     C.4.  Discussion
     C.5.  Observations
   Acknowledgments
   Authors' Addresses

1.  Introduction

   TCP is a connection-oriented reliable transport protocol layered over
   IP [RFC0793].  Each TCP connection maintains state, usually in a data
   structure called the "TCP Control Block (TCB)".  The TCB contains
   information about the connection state, its associated local process,
   and feedback parameters about the connection's transmission
   properties.  As originally specified and usually implemented, most
   TCB information is maintained on a per-connection basis.  Some
   implementations share certain TCB information across connections to
   the same host [RFC2140].  Such sharing is intended to lead to better
   overall transient performance, especially for numerous short-lived
   and simultaneous connections, as can be used in the World Wide Web
   and other applications [Be94] [Br02].  This sharing of state is
   intended to help TCP connections converge to long-term behavior
   (assuming stable application load, i.e., so-called "steady-state")
   more quickly without affecting TCP interoperability.

   This document updates RFC 2140's discussion of TCB state sharing and
   provides a complete replacement for that document.  This state
   sharing affects only TCB initialization [RFC2140] and thus has no
   effect on the long-term behavior of TCP after a connection has been
   established or on interoperability.  Path information shared across
   SYN destination port numbers assumes that TCP segments having the
   same host-pair experience the same path properties, i.e., that
   traffic is not routed differently based on port numbers or other
   connection parameters (also addressed further in Section 8.1).  The
   observations about TCB sharing in this document apply similarly to
   any protocol with congestion state, including the Stream Control
   Transmission Protocol (SCTP) [RFC4960] and the Datagram Congestion
   Control Protocol (DCCP) [RFC4340], as well as to individual subflows
   in Multipath TCP [RFC8684].

2.  Conventions Used in This Document

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
   "OPTIONAL" in this document are to be interpreted as described in
   BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all
   capitals, as shown here.

   The core of this document describes behavior that is already
   permitted by TCP standards.  As a result, this document provides
   informative guidance but does not use normative language except when
   quoting other documents.  Normative language is used in Appendix C as
   examples of requirements for future consideration.

3.  Terminology

   The following terminology is used frequently in this document.  Items
   preceded with a "+" may be part of the state maintained as TCP
   connection state in the TCB of associated connections and are the
   focus of sharing as described in this document.  Note that terms are
   used as originally introduced where possible; in some cases,
   direction is indicated with a suffix (_S for send, _R for receive)
   and in other cases spelled out (sendcwnd).

   +cwnd:  TCP congestion window size [RFC5681]

   host:  a source or sink of TCP segments associated with a single IP
         address

   host-pair:  a pair of hosts and their corresponding IP addresses

   ISN:  Initial Sequence Number

   +MMS_R:  maximum message size that can be received, the largest
         received transport payload of an IP datagram [RFC1122]

   +MMS_S:  maximum message size that can be sent, the largest
         transmitted transport payload of an IP datagram [RFC1122]

   path:  an Internet path between the IP addresses of two hosts

   PCB:  protocol control block, the data associated with a protocol as
         maintained by an endpoint; a TCP PCB is called a "TCB"

   PLPMTUD:  packetization-layer path MTU discovery, a mechanism that
         uses transport packets to discover the Path Maximum
         Transmission Unit (PMTU) [RFC4821]

   +PMTU:  largest IP datagram that can traverse a path [RFC1191]
         [RFC8201]

   PMTUD:  path-layer MTU discovery, a mechanism that relies on ICMP
         error messages to discover the PMTU [RFC1191] [RFC8201]

   +RTT:  round-trip time of a TCP packet exchange [RFC0793]

   +RTTVAR:  variation of round-trip times of a TCP packet exchange
         [RFC6298]

   +rwnd:  TCP receive window size [RFC5681]

   +sendcwnd:  TCP send-side congestion window (cwnd) size [RFC5681]

   +sendMSS:  TCP maximum segment size, a value transmitted in a TCP
         option that represents the largest TCP user data payload that
         can be received [RFC6691]

   +ssthresh:  TCP slow-start threshold [RFC5681]

   TCB:  TCP Control Block, the data associated with a TCP connection as
         maintained by an endpoint

   TCP-AO:  TCP Authentication Option [RFC5925]

   TFO:  TCP Fast Open option [RFC7413]

   +TFO_cookie:  TCP Fast Open cookie, state that is used as part of the
         TFO mechanism, when TFO is supported [RFC7413]

   +TFO_failure:  an indication of when TFO option negotiation failed,
         when TFO is supported

   +TFOinfo:  information cached when a TFO connection is established,
         which includes the TFO_cookie [RFC7413]

4.  The TCP Control Block (TCB)

   A TCB describes the data associated with each connection, i.e., with
   each association of a pair of applications across the network.  The
   TCB contains at least the following information [RFC0793]:

      Local process state

         pointers to send and receive buffers
         pointers to retransmission queue and current segment
         pointers to Internet Protocol (IP) PCB

      Per-connection shared state

         macro-state
            connection state
            timers
            flags
            local and remote host numbers and ports
            TCP option state
         micro-state
            send and receive window state (size*, current number)
            congestion window size (sendcwnd)*
            congestion window size threshold (ssthresh)*
            max window size seen*
            sendMSS#
            MMS_S#
            MMS_R#
            PMTU#
            round-trip time and its variation#

   The per-connection information is shown as split into macro-state and
   micro-state, terminology borrowed from [Co91].  Macro-state describes
   the protocol for establishing the initial shared state about the
   connection; we include the endpoint numbers and components (timers,
   flags) required upon commencement that are later used to help
   maintain that state.  Micro-state describes the protocol after a
   connection has been established, to maintain the reliability and
   congestion control of the data transferred in the connection.

   We distinguish two other classes of shared micro-state that are
   associated more with host-pairs than with application pairs.  One
   class is clearly host-pair dependent (shown above as "#", e.g.,
   sendMSS, MMS_R, MMS_S, PMTU, RTT), because these parameters are
   defined by the endpoint or endpoint pair (of the given example:
   sendMSS, MMS_R, MMS_S, RTT) or are already cached and shared on that
   basis (of the given example: PMTU [RFC1191] [RFC4821]).  The other is
   host-pair dependent in its aggregate (shown above as "*", e.g.,
   congestion window information, current window sizes, etc.) because
   they depend on the total capacity between the two endpoints.

   Not all of the TCB state is necessarily shareable.  In particular,
   some TCP options are negotiated only upon request by the application
   layer, so their use may not be correlated across connections.  Other
   options negotiate connection-specific parameters, which are similarly
   not shareable.  These are discussed further in Appendix B.

   Finally, we exclude rwnd from further discussion because its value
   should depend on the send window size, so it is already addressed by
   send window sharing and is not independently affected by sharing.

5.  TCB Interdependence

   There are two cases of TCB interdependence.  Temporal sharing occurs
   when the TCB of an earlier (now CLOSED) connection to a host is used
   to initialize some parameters of a new connection to that same host,
   i.e., in sequence.  Ensemble sharing occurs when a currently active
   connection to a host is used to initialize another (concurrent)
   connection to that host.

6.  Temporal Sharing

   The TCB data cache is accessed in two ways: it is read to initialize
   new TCBs and written when more current per-host state is available.

6.1.  Initialization of a New TCB

   TCBs for new connections can be initialized using cached context from
   past connections as follows:

              +==============+=============================+
              | Cached TCB   | New TCB                     |
              +==============+=============================+
              | old_MMS_S    | old_MMS_S or not cached (2) |
              +--------------+-----------------------------+
              | old_MMS_R    | old_MMS_R or not cached (2) |
              +--------------+-----------------------------+
              | old_sendMSS  | old_sendMSS                 |
              +--------------+-----------------------------+
              | old_PMTU     | old_PMTU (1)                |
              +--------------+-----------------------------+
              | old_RTT      | old_RTT                     |
              +--------------+-----------------------------+
              | old_RTTVAR   | old_RTTVAR                  |
              +--------------+-----------------------------+
              | old_option   | (option specific)           |
              +--------------+-----------------------------+
              | old_ssthresh | old_ssthresh                |
              +--------------+-----------------------------+
              | old_sendcwnd | old_sendcwnd                |
              +--------------+-----------------------------+

              Table 1: Temporal Sharing - TCB Initialization

   (1)  Note that PMTU is cached at the IP layer [RFC1191] [RFC4821].

   (2)  Note that some values are not cached when they are computed
      locally (MMS_R) or indicated in the connection itself (MMS_S in
      the SYN).

   Table 2 gives an overview of option-specific information that can be
   shared.  Additional information on some specific TCP options and
   sharing is provided in Appendix B.

                   +=================+=================+
                   | Cached          | New             |
                   +=================+=================+
                   | old_TFO_cookie  | old_TFO_cookie  |
                   +-----------------+-----------------+
                   | old_TFO_failure | old_TFO_failure |
                   +-----------------+-----------------+

                        Table 2: Temporal Sharing -
                         Option Info Initialization

6.2.  Updates to the TCB Cache

   During a connection, the TCB cache can be updated based on events of
   current connections and their TCBs as they progress over time, as
   shown in Table 3.

     +==============+===============+=============+=================+
     | Cached TCB   | Current TCB   | When?       | New Cached TCB  |
     +==============+===============+=============+=================+
     | old_MMS_S    | curr_MMS_S    | OPEN        | curr_MMS_S      |
     +--------------+---------------+-------------+-----------------+
     | old_MMS_R    | curr_MMS_R    | OPEN        | curr_MMS_R      |
     +--------------+---------------+-------------+-----------------+
     | old_sendMSS  | curr_sendMSS  | MSSopt      | curr_sendMSS    |
     +--------------+---------------+-------------+-----------------+
     | old_PMTU     | curr_PMTU     | PMTUD (1) / | curr_PMTU       |
     |              |               | PLPMTUD (1) |                 |
     +--------------+---------------+-------------+-----------------+
     | old_RTT      | curr_RTT      | CLOSE       | merge(curr,old) |
     +--------------+---------------+-------------+-----------------+
     | old_RTTVAR   | curr_RTTVAR   | CLOSE       | merge(curr,old) |
     +--------------+---------------+-------------+-----------------+
     | old_option   | curr_option   | ESTAB       | (depends on     |
     |              |               |             | option)         |
     +--------------+---------------+-------------+-----------------+
     | old_ssthresh | curr_ssthresh | CLOSE       | merge(curr,old) |
     +--------------+---------------+-------------+-----------------+
     | old_sendcwnd | curr_sendcwnd | CLOSE       | merge(curr,old) |
     +--------------+---------------+-------------+-----------------+

                Table 3: Temporal Sharing - Cache Updates

   (1)  Note that PMTU is cached at the IP layer [RFC1191] [RFC4821].

   Merge() is the function that combines the current and previous (old)
   values and may vary for each parameter of the TCB cache.  The
   particular function is not specified in this document; examples
   include windowed averages (mean of the past N values, for some N) and
   exponential decay (new = (1-alpha)*old + alpha *new, where alpha is
   in the range [0..1]).

   Table 4 gives an overview of option-specific information that can be
   similarly shared.  The TFO cookie is maintained until the client
   explicitly requests it be updated as a separate event.

      +=================+=================+=======+=================+
      | Cached          | Current         | When? | New Cached      |
      +=================+=================+=======+=================+
      | old_TFO_cookie  | old_TFO_cookie  | ESTAB | old_TFO_cookie  |
      +-----------------+-----------------+-------+-----------------+
      | old_TFO_failure | old_TFO_failure | ESTAB | old_TFO_failure |
      +-----------------+-----------------+-------+-----------------+

              Table 4: Temporal Sharing - Option Info Updates

6.3.  Discussion

   As noted, there is no particular benefit to caching MMS_S and MMS_R
   as these are reported by the local IP stack.  Caching sendMSS and
   PMTU is trivial; reported values are cached (PMTU at the IP layer),
   and the most recent values are used.  The cache is updated when the
   MSS option is received in a SYN or after PMTUD (i.e., when an ICMPv4
   Fragmentation Needed [RFC1191] or ICMPv6 Packet Too Big message is
   received [RFC8201] or the equivalent is inferred, e.g., as from
   PLPMTUD [RFC4821]), respectively, so the cache always has the most
   recent values from any connection.  For sendMSS, the cache is
   consulted only at connection establishment and not otherwise updated,
   which means that MSS options do not affect current connections.  The
   default sendMSS is never saved; only reported MSS values update the
   cache, so an explicit override is required to reduce the sendMSS.
   Cached sendMSS affects only data sent in the SYN segment, i.e.,
   during client connection initiation or during simultaneous open; the
   MSS of all other segments are constrained by the value updated as
   included in the SYN.

   RTT values are updated by formulae that merge the old and new values,
   as noted in Section 6.2.  Dynamic RTT estimation requires a sequence
   of RTT measurements.  As a result, the cached RTT (and its variation)
   is an average of its previous value with the contents of the
   currently active TCB for that host, when a TCB is closed.  RTT values
   are updated only when a connection is closed.  The method for merging
   old and current values needs to attempt to reduce the transient
   effects of the new connections.

   The updates for RTT, RTTVAR, and ssthresh rely on existing
   information, i.e., old values.  Should no such values exist, the
   current values are cached instead.

   TCP options are copied or merged depending on the details of each
   option.  For example, TFO state is updated when a connection is
   established and read before establishing a new connection.

   Sections 8 and 9 discuss compatibility issues and implications of
   sharing the specific information listed above.  Section 10 gives an
   overview of known implementations.

   Most cached TCB values are updated when a connection closes.  The
   exceptions are MMS_R and MMS_S, which are reported by IP [RFC1122];
   PMTU, which is updated after Path MTU Discovery and also reported by
   IP [RFC1191] [RFC4821] [RFC8201]; and sendMSS, which is updated if
   the MSS option is received in the TCP SYN header.

   Sharing sendMSS information affects only data in the SYN of the next
   connection, because sendMSS information is typically included in most
   TCP SYN segments.  Caching PMTU can accelerate the efficiency of
   PMTUD but can also result in black-holing until corrected if in
   error.  Caching MMS_R and MMS_S may be of little direct value as they
   are reported by the local IP stack anyway.

   The way in which state related to other TCP options can be shared
   depends on the details of that option.  For example, TFO state
   includes the TCP Fast Open cookie [RFC7413] or, in case TFO fails, a
   negative TCP Fast Open response.  RFC 7413 states,

   |  The client MUST cache negative responses from the server in order
   |  to avoid potential connection failures.  Negative responses
   |  include the server not acknowledging the data in the SYN, ICMP
   |  error messages, and (most importantly) no response (SYN-ACK) from
   |  the server at all, i.e., connection timeout.

   TFOinfo is cached when a connection is established.

   State related to other TCP options might not be as readily cached.
   For example, TCP-AO [RFC5925] success or failure between a host-pair
   for a single SYN destination port might be usefully cached.  TCP-AO
   success or failure to other SYN destination ports on that host-pair
   is never useful to cache because TCP-AO security parameters can vary
   per service.

7.  Ensemble Sharing

   Sharing cached TCB data across concurrent connections requires
   attention to the aggregate nature of some of the shared state.  For
   example, although MSS and RTT values can be shared by copying, it may
   not be appropriate to simply copy congestion window or ssthresh
   information; instead, the new values can be a function (f) of the
   cumulative values and the number of connections (N).

7.1.  Initialization of a New TCB

   TCBs for new connections can be initialized using cached context from
   concurrent connections as follows:

              +===================+=========================+
              | Cached TCB        | New TCB                 |
              +===================+=========================+
              | old_MMS_S         | old_MMS_S               |
              +-------------------+-------------------------+
              | old_MMS_R         | old_MMS_R               |
              +-------------------+-------------------------+
              | old_sendMSS       | old_sendMSS             |
              +-------------------+-------------------------+
              | old_PMTU          | old_PMTU (1)            |
              +-------------------+-------------------------+
              | old_RTT           | old_RTT                 |
              +-------------------+-------------------------+
              | old_RTTVAR        | old_RTTVAR              |
              +-------------------+-------------------------+
              | sum(old_ssthresh) | f(sum(old_ssthresh), N) |
              +-------------------+-------------------------+
              | sum(old_sendcwnd) | f(sum(old_sendcwnd), N) |
              +-------------------+-------------------------+
              | old_option        | (option specific)       |
              +-------------------+-------------------------+

               Table 5: Ensemble Sharing - TCB Initialization

   (1)  Note that PMTU is cached at the IP layer [RFC1191] [RFC4821].

   In Table 5, the cached sum() is a total across all active connections
   because these parameters act in aggregate; similarly, f() is a
   function that updates that sum based on the new connection's values,
   represented as "N".

   Table 6 gives an overview of option-specific information that can be
   similarly shared.  Again, the TFO_cookie is updated upon explicit
   client request, which is a separate event.

                   +=================+=================+
                   | Cached          | New             |
                   +=================+=================+
                   | old_TFO_cookie  | old_TFO_cookie  |
                   +-----------------+-----------------+
                   | old_TFO_failure | old_TFO_failure |
                   +-----------------+-----------------+

                        Table 6: Ensemble Sharing -
                         Option Info Initialization

7.2.  Updates to the TCB Cache

   During a connection, the TCB cache can be updated based on changes to
   concurrent connections and their TCBs, as shown below:

      +==============+===============+===========+=================+
      | Cached TCB   | Current TCB   | When?     | New Cached TCB  |
      +==============+===============+===========+=================+
      | old_MMS_S    | curr_MMS_S    | OPEN      | curr_MMS_S      |
      +--------------+---------------+-----------+-----------------+
      | old_MMS_R    | curr_MMS_R    | OPEN      | curr_MMS_R      |
      +--------------+---------------+-----------+-----------------+
      | old_sendMSS  | curr_sendMSS  | MSSopt    | curr_sendMSS    |
      +--------------+---------------+-----------+-----------------+
      | old_PMTU     | curr_PMTU     | PMTUD+ /  | curr_PMTU       |
      |              |               | PLPMTUD+  |                 |
      +--------------+---------------+-----------+-----------------+
      | old_RTT      | curr_RTT      | update    | rtt_update(old, |
      |              |               |           | curr)           |
      +--------------+---------------+-----------+-----------------+
      | old_RTTVAR   | curr_RTTVAR   | update    | rtt_update(old, |
      |              |               |           | curr)           |
      +--------------+---------------+-----------+-----------------+
      | old_ssthresh | curr_ssthresh | update    | adjust sum as   |
      |              |               |           | appropriate     |
      +--------------+---------------+-----------+-----------------+
      | old_sendcwnd | curr_sendcwnd | update    | adjust sum as   |
      |              |               |           | appropriate     |
      +--------------+---------------+-----------+-----------------+
      | old_option   | curr_option   | (depends) | (option         |
      |              |               |           | specific)       |
      +--------------+---------------+-----------+-----------------+

                Table 7: Ensemble Sharing - Cache Updates

   +  Note that the PMTU is cached at the IP layer [RFC1191] [RFC4821].

   In Table 7, rtt_update() is the function used to combine old and
   current values, e.g., as a windowed average or exponentially decayed
   average.

   Table 8 gives an overview of option-specific information that can be
   similarly shared.

      +=================+=================+=======+=================+
      | Cached          | Current         | When? | New Cached      |
      +=================+=================+=======+=================+
      | old_TFO_cookie  | old_TFO_cookie  | ESTAB | old_TFO_cookie  |
      +-----------------+-----------------+-------+-----------------+
      | old_TFO_failure | old_TFO_failure | ESTAB | old_TFO_failure |
      +-----------------+-----------------+-------+-----------------+

              Table 8: Ensemble Sharing - Option Info Updates

7.3.  Discussion

   For ensemble sharing, TCB information should be cached as early as
   possible, sometimes before a connection is closed.  Otherwise,
   opening multiple concurrent connections may not result in TCB data
   sharing if no connection closes before others open.  The amount of
   work involved in updating the aggregate average should be minimized,
   but the resulting value should be equivalent to having all values
   measured within a single connection.  The function "rtt_update" in
   Table 7 indicates this operation, which occurs whenever the RTT would
   have been updated in the individual TCP connection.  As a result, the
   cache contains the shared RTT variables, which no longer need to
   reside in the TCB.

   Congestion window size and ssthresh aggregation are more complicated
   in the concurrent case.  When there is an ensemble of connections, we
   need to decide how that ensemble would have shared these variables,
   in order to derive initial values for new TCBs.

   Sections 8 and 9 discuss compatibility issues and implications of
   sharing the specific information listed above.

   There are several ways to initialize the congestion window in a new
   TCB among an ensemble of current connections to a host.  Current TCP
   implementations initialize it to 4 segments as standard [RFC3390] and
   10 segments experimentally [RFC6928].  These approaches assume that
   new connections should behave as conservatively as possible.  The
   algorithm described in [Ba12] adjusts the initial cwnd depending on
   the cwnd values of ongoing connections.  It is also possible to use
   sharing mechanisms over long timescales to adapt TCP's initial window
   automatically, as described further in Appendix C.

8.  Issues with TCB Information Sharing

   Here, we discuss various types of problems that may arise with TCB
   information sharing.

   For the congestion and current window information, the initial values
   computed by TCB interdependence may not be consistent with the long-
   term aggregate behavior of a set of concurrent connections between
   the same endpoints.  Under conventional TCP congestion control, if
   the congestion window of a single existing connection has converged
   to 40 segments, two newly joining concurrent connections will assume
   initial windows of 10 segments [RFC6928] and the existing
   connection's window will not decrease to accommodate this additional
   load.  As a consequence, the three connections can mutually
   interfere.  One example of this is seen on low-bandwidth, high-delay
   links, where concurrent connections supporting Web traffic can
   collide because their initial windows were too large, even when set
   at 1 segment.

   The authors of [Hu12] recommend caching ssthresh for temporal sharing
   only when flows are long.  Some studies suggest that sharing ssthresh
   between short flows can deteriorate the performance of individual
   connections [Hu12] [Du16], although this may benefit aggregate
   network performance.

8.1.  Traversing the Same Network Path

   TCP is sometimes used in situations where packets of the same host-
   pair do not always take the same path, such as when connection-
   specific parameters are used for routing (e.g., for load balancing).
   Multipath routing that relies on examining transport headers, such as
   ECMP and Link Aggregation Group (LAG) [RFC7424], may not result in
   repeatable path selection when TCP segments are encapsulated,
   encrypted, or altered -- for example, in some Virtual Private Network
   (VPN) tunnels that rely on proprietary encapsulation.  Similarly,
   such approaches cannot operate deterministically when the TCP header
   is encrypted, e.g., when using IPsec Encapsulating Security Payload
   (ESP) (although TCB interdependence among the entire set sharing the
   same endpoint IP addresses should work without problems when the TCP
   header is encrypted).  Measures to increase the probability that
   connections use the same path could be applied; for example, the
   connections could be given the same IPv6 flow label [RFC6437].  TCB
   interdependence can also be extended to sets of host IP address pairs
   that share the same network path conditions, such as when a group of
   addresses is on the same LAN (see Section 9).

   Traversing the same path is not important for host-specific
   information (e.g., rwnd), TCP option state (e.g., TFOinfo), or for
   information that is already cached per-host (e.g., path MTU).  When
   TCB information is shared across different SYN destination ports,
   path-related information can be incorrect; however, the impact of
   this error is potentially diminished if (as discussed here) TCB
   sharing affects only the transient event of a connection start or if
   TCB information is shared only within connections to the same SYN
   destination port.

   In the case of temporal sharing, TCB information could also become
   invalid over time, i.e., indicating that although the path remains
   the same, path properties have changed.  Because this is similar to
   the case when a connection becomes idle, mechanisms that address idle
   TCP connections (e.g., [RFC7661]) could also be applied to TCB cache
   management, especially when TCP Fast Open is used [RFC7413].

8.2.  State Dependence

   There may be additional considerations to the way in which TCB
   interdependence rebalances congestion feedback among the current
   connections.  For example, it may be appropriate to consider the
   impact of a connection being in Fast Recovery [RFC5681] or some other
   similar unusual feedback state that could inhibit or affect the
   calculations described herein.

8.3.  Problems with Sharing Based on IP Address

   It can be wrong to share TCB information between TCP connections on
   the same host as identified by the IP address if an IP address is
   assigned to a new host (e.g., IP address spinning, as is used by ISPs
   to inhibit running servers).  It can be wrong if Network Address
   Translation (NAT) [RFC2663], Network Address and Port Translation
   (NAPT) [RFC2663], or any other IP sharing mechanism is used.  Such
   mechanisms are less likely to be used with IPv6.  Other methods to
   identify a host could also be considered to make correct TCB sharing
   more likely.  Moreover, some TCB information is about dominant path
   properties rather than the specific host.  IP addresses may differ,
   yet the relevant part of the path may be the same.

9.  Implications

   There are several implications to incorporating TCB interdependence
   in TCP implementations.  First, it may reduce the need for
   application-layer multiplexing for performance enhancement [RFC7231].
   Protocols like HTTP/2 [RFC7540] avoid connection re-establishment
   costs by serializing or multiplexing a set of per-host connections
   across a single TCP connection.  This avoids TCP's per-connection
   OPEN handshake and also avoids recomputing the MSS, RTT, and
   congestion window values.  By avoiding the so-called "slow-start
   restart", performance can be optimized [Hu01].  TCB interdependence
   can provide the "slow-start restart avoidance" of multiplexing,
   without requiring a multiplexing mechanism at the application layer.

   Like the initial version of this document [RFC2140], this update's
   approach to TCB interdependence focuses on sharing a set of TCBs by
   updating the TCB state to reduce the impact of transients when
   connections begin, end, or otherwise significantly change state.
   Other mechanisms have since been proposed to continuously share
   information between all ongoing communication (including
   connectionless protocols) and update the congestion state during any
   congestion-related event (e.g., timeout, loss confirmation, etc.)
   [RFC3124].  By dealing exclusively with transients, the approach in
   this document is more likely to exhibit the "steady-state" behavior
   as unmodified, independent TCP connections.

9.1.  Layering

   TCB interdependence pushes some of the TCP implementation from its
   typical placement solely within the transport layer (in the ISO
   model) to the network layer.  This acknowledges that some components
   of state are, in fact, per-host-pair or can be per-path as indicated
   solely by that host-pair.  Transport protocols typically manage per-
   application-pair associations (per stream), and network protocols
   manage per-host-pair and path associations (routing).  Round-trip
   time, MSS, and congestion information could be more appropriately
   handled at the network layer, aggregated among concurrent
   connections, and shared across connection instances [RFC3124].

   An earlier version of RTT sharing suggested implementing RTT state at
   the IP layer rather than at the TCP layer.  Our observations describe
   sharing state among TCP connections, which avoids some of the
   difficulties in an IP-layer solution.  One such problem of an IP-
   layer solution is determining the correspondence between packet
   exchanges using IP header information alone, where such
   correspondence is needed to compute RTT.  Because TCB sharing
   computes RTTs inside the TCP layer using TCP header information, it
   can be implemented more directly and simply than at the IP layer.
   This is a case where information should be computed at the transport
   layer but could be shared at the network layer.

9.2.  Other Possibilities

   Per-host-pair associations are not the limit of these techniques.  It
   is possible that TCBs could be similarly shared between hosts on a
   subnet or within a cluster, because the predominant path can be
   subnet-subnet rather than host-host.  Additionally, TCB
   interdependence can be applied to any protocol with congestion state,
   including SCTP [RFC4960] and DCCP [RFC4340], as well as to individual
   subflows in Multipath TCP [RFC8684].

   There may be other information that can be shared between concurrent
   connections.  For example, knowing that another connection has just
   tried to expand its window size and failed, a connection may not
   attempt to do the same for some period.  The idea is that existing
   TCP implementations infer the behavior of all competing connections,
   including those within the same host or subnet.  One possible
   optimization is to make that implicit feedback explicit, via extended
   information associated with the endpoint IP address and its TCP
   implementation, rather than per-connection state in the TCB.

   This document focuses on sharing TCB information at connection
   initialization.  Subsequent to RFC 2140, there have been numerous
   approaches that attempt to coordinate ongoing state across concurrent
   connections, both within TCP and other congestion-reactive protocols,
   which are summarized in [Is18].  These approaches are more complex to
   implement, and their comparison to steady-state TCP equivalence can
   be more difficult to establish, sometimes intentionally (i.e., they
   sometimes intend to provide a different kind of "fairness" than
   emerges from TCP operation).

10.  Implementation Observations

   The observation that some TCB state is host-pair specific rather than
   application-pair dependent is not new and is a common engineering
   decision in layered protocol implementations.  Although now
   deprecated, T/TCP [RFC1644] was the first to propose using caches in
   order to maintain TCB states (see Appendix A).

   Table 9 describes the current implementation status for TCB temporal
   sharing in Windows as of December 2020, Apple variants (macOS, iOS,
   iPadOS, tvOS, and watchOS) as of January 2021, Linux kernel version
   5.10.3, and FreeBSD 12.  Ensemble sharing is not yet implemented.

        +==============+=========================================+
        | TCB data     | Status                                  |
        +==============+=========================================+
        | old_MMS_S    | Not shared                              |
        +--------------+-----------------------------------------+
        | old_MMS_R    | Not shared                              |
        +--------------+-----------------------------------------+
        | old_sendMSS  | Cached and shared in Apple, Linux (MSS) |
        +--------------+-----------------------------------------+
        | old_PMTU     | Cached and shared in Apple, FreeBSD,    |
        |              | Windows (PMTU)                          |
        +--------------+-----------------------------------------+
        | old_RTT      | Cached and shared in Apple, FreeBSD,    |
        |              | Linux, Windows                          |
        +--------------+-----------------------------------------+
        | old_RTTVAR   | Cached and shared in Apple, FreeBSD,    |
        |              | Windows                                 |
        +--------------+-----------------------------------------+
        | old_TFOinfo  | Cached and shared in Apple, Linux,      |
        |              | Windows                                 |
        +--------------+-----------------------------------------+
        | old_sendcwnd | Not shared                              |
        +--------------+-----------------------------------------+
        | old_ssthresh | Cached and shared in Apple, FreeBSD*,   |
        |              | Linux*                                  |
        +--------------+-----------------------------------------+
        | TFO failure  | Cached and shared in Apple              |
        +--------------+-----------------------------------------+

                   Table 9: KNOWN IMPLEMENTATION STATUS

   *  Note: In FreeBSD, new ssthresh is the mean of curr_ssthresh and
      its previous value if a previous value exists; in Linux, the
      calculation depends on state and is max(curr_cwnd/2, old_ssthresh)
      in most cases.

   In Table 9, "Apple" refers to all Apple OSes, i.e., macOS (desktop/
   laptop), iOS (phone), iPadOS (tablet), tvOS (video player), and
   watchOS (smart watch), which all share the same Internet protocol
   stack.

11.  Changes Compared to RFC 2140

   This document updates the description of TCB sharing in RFC 2140 and
   its associated impact on existing and new connection state, providing
   a complete replacement for that document [RFC2140].  It clarifies the
   previous description and terminology and extends the mechanism to its
   impact on new protocols and mechanisms, including multipath TCP, Fast
   Open, PLPMTUD, NAT, and the TCP Authentication Option.

   The detailed impact on TCB state addresses TCB parameters with
   greater specificity.  It separates the way MSS is used in both send
   and receive directions, it separates the way both of these MSS values
   differ from sendMSS, it adds both path MTU and ssthresh, and it
   addresses the impact on state associated with TCP options.

   New sections have been added to address compatibility issues and
   implementation observations.  The relation of this work to T/TCP has
   been moved to Appendix A (which describes the history to TCB sharing)
   partly to reflect the deprecation of that protocol.

   Appendix C has been added to discuss the potential to use temporal
   sharing over long timescales to adapt TCP's initial window
   automatically, avoiding the need to periodically revise a single
   global constant value.

   Finally, this document updates and significantly expands the
   referenced literature.

12.  Security Considerations

   These presented implementation methods do not have additional
   ramifications for direct (connection-aborting or information-
   injecting) attacks on individual connections.  Individual
   connections, whether using sharing or not, also may be susceptible to
   denial-of-service attacks that reduce performance or completely deny
   connections and transfers if not otherwise secured.

   TCB sharing may create additional denial-of-service attacks that
   affect the performance of other connections by polluting the cached
   information.  This can occur across any set of connections in which
   the TCB is shared, between connections in a single host, or between
   hosts if TCB sharing is implemented within a subnet (see
   "Implications" (Section 9)).  Some shared TCB parameters are used
   only to create new TCBs; others are shared among the TCBs of ongoing
   connections.  New connections can join the ongoing set, e.g., to
   optimize send window size among a set of connections to the same
   host.  PMTU is defined as shared at the IP layer and is already
   susceptible in this way.

   Options in client SYNs can be easier to forge than complete, two-way
   connections.  As a result, their values may not be safely
   incorporated in shared values until after the three-way handshake
   completes.

   Attacks on parameters used only for initialization affect only the
   transient performance of a TCP connection.  For short connections,
   the performance ramification can approach that of a denial-of-service
   attack.  For example, if an application changes its TCB to have a
   false and small window size, subsequent connections will experience
   performance degradation until their window grows appropriately.

   TCB sharing reuses and mixes information from past and current
   connections.  Although reusing information could create a potential
   for fingerprinting to identify hosts, the mixing reduces that
   potential.  There has been no evidence of fingerprinting based on
   this technique, and it is currently considered safe in that regard.
   Further, information about the performance of a TCP connection has
   not been considered as private.

13.  IANA Considerations

   This document has no IANA actions.

14.  References

14.1.  Normative References

   [RFC0793]  Postel, J., "Transmission Control Protocol", STD 7,
              RFC 793, DOI 10.17487/RFC0793, September 1981,
              <https://www.rfc-editor.org/info/rfc793>.

   [RFC1122]  Braden, R., Ed., "Requirements for Internet Hosts -
              Communication Layers", STD 3, RFC 1122,
              DOI 10.17487/RFC1122, October 1989,
              <https://www.rfc-editor.org/info/rfc1122>.

   [RFC1191]  Mogul, J. and S. Deering, "Path MTU discovery", RFC 1191,
              DOI 10.17487/RFC1191, November 1990,
              <https://www.rfc-editor.org/info/rfc1191>.

   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
              Requirement Levels", BCP 14, RFC 2119,
              DOI 10.17487/RFC2119, March 1997,
              <https://www.rfc-editor.org/info/rfc2119>.

   [RFC4821]  Mathis, M. and J. Heffner, "Packetization Layer Path MTU
              Discovery", RFC 4821, DOI 10.17487/RFC4821, March 2007,
              <https://www.rfc-editor.org/info/rfc4821>.

   [RFC5681]  Allman, M., Paxson, V., and E. Blanton, "TCP Congestion
              Control", RFC 5681, DOI 10.17487/RFC5681, September 2009,
              <https://www.rfc-editor.org/info/rfc5681>.

   [RFC6298]  Paxson, V., Allman, M., Chu, J., and M. Sargent,
              "Computing TCP's Retransmission Timer", RFC 6298,
              DOI 10.17487/RFC6298, June 2011,
              <https://www.rfc-editor.org/info/rfc6298>.

   [RFC7413]  Cheng, Y., Chu, J., Radhakrishnan, S., and A. Jain, "TCP
              Fast Open", RFC 7413, DOI 10.17487/RFC7413, December 2014,
              <https://www.rfc-editor.org/info/rfc7413>.

   [RFC8174]  Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC
              2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174,
              May 2017, <https://www.rfc-editor.org/info/rfc8174>.

   [RFC8201]  McCann, J., Deering, S., Mogul, J., and R. Hinden, Ed.,
              "Path MTU Discovery for IP version 6", STD 87, RFC 8201,
              DOI 10.17487/RFC8201, July 2017,
              <https://www.rfc-editor.org/info/rfc8201>.

14.2.  Informative References

   [Al10]     Allman, M., "Initial Congestion Window Specification",
              Work in Progress, Internet-Draft, draft-allman-tcpm-bump-
              initcwnd-00, 15 November 2010,
              <https://datatracker.ietf.org/doc/html/draft-allman-tcpm-
              bump-initcwnd-00>.

   [Ba12]     Barik, R., Welzl, M., Ferlin, S., and O. Alay, "LISA: A
              linked slow-start algorithm for MPTCP", IEEE ICC,
              DOI 10.1109/ICC.2016.7510786, May 2016,
              <https://doi.org/10.1109/ICC.2016.7510786>.

   [Ba20]     Bagnulo, M. and B. Briscoe, "ECN++: Adding Explicit
              Congestion Notification (ECN) to TCP Control Packets",
              Work in Progress, Internet-Draft, draft-ietf-tcpm-
              generalized-ecn-07, 16 February 2021,
              <https://datatracker.ietf.org/doc/html/draft-ietf-tcpm-
              generalized-ecn-07>.

   [Be94]     Berners-Lee, T., Cailliau, C., Luotonen, A., Nielsen, H.,
              and A. Secret, "The World-Wide Web", Communications of the
              ACM V37, pp. 76-82, DOI 10.1145/179606.179671, August
              1994, <https://doi.org/10.1145/179606.179671>.

   [Br02]     Brownlee, N. and KC. Claffy, "Understanding Internet
              traffic streams: dragonflies and tortoises", IEEE
              Communications Magazine, pp. 110-117,
              DOI 10.1109/MCOM.2002.1039865, 2002,
              <https://doi.org/10.1109/MCOM.2002.1039865>.

   [Br94]     Braden, B., "T/TCP -- Transaction TCP: Source Changes for
              Sun OS 4.1.3", USC/ISI Release 1.0, September 1994.

   [Co91]     Comer, D. and D. Stevens, "Internetworking with TCP/IP",
              ISBN 10: 0134685059, ISBN 13: 9780134685052, 1991.

   [Du16]     Dukkipati, N., Cheng, Y., and A. Vahdat, "Research
              Impacting the Practice of Congestion Control", Computer
              Communication Review, The ACM SIGCOMM newsletter, July
              2016.

   [FreeBSD]  FreeBSD, "The FreeBSD Project",
              <https://www.freebsd.org/>.

   [Hu01]     Hughes, A., Touch, J., and J. Heidemann, "Issues in TCP
              Slow-Start Restart After Idle", Work in Progress,
              Internet-Draft, draft-hughes-restart-00, December 2001,
              <https://datatracker.ietf.org/doc/html/draft-hughes-
              restart-00>.

   [Hu12]     Hurtig, P. and A. Brunstrom, "Enhanced metric caching for
              short TCP flows", IEEE International Conference on
              Communications, DOI 10.1109/ICC.2012.6364516, 2012,
              <https://doi.org/10.1109/ICC.2012.6364516>.

   [IANA]     IANA, "Transmission Control Protocol (TCP) Parameters",
              <https://www.iana.org/assignments/tcp-parameters>.

   [Is18]     Islam, S., Welzl, M., Hiorth, K., Hayes, D., Armitage, G.,
              and S. Gjessing, "ctrlTCP: Reducing latency through
              coupled, heterogeneous multi-flow TCP congestion control",
              IEEE INFOCOM 2018 - IEEE Conference on Computer
              Communications Workshops (INFOCOM WKSHPS),
              DOI 10.1109/INFCOMW.2018.8406887, April 2018,
              <https://doi.org/10.1109/INFCOMW.2018.8406887>.

   [Ja88]     Jacobson, V. and M. Karels, "Congestion Avoidance and
              Control", SIGCOMM Symposium proceedings on Communications
              architectures and protocols, November 1988.

   [RFC1379]  Braden, R., "Extending TCP for Transactions -- Concepts",
              RFC 1379, DOI 10.17487/RFC1379, November 1992,
              <https://www.rfc-editor.org/info/rfc1379>.

   [RFC1644]  Braden, R., "T/TCP -- TCP Extensions for Transactions
              Functional Specification", RFC 1644, DOI 10.17487/RFC1644,
              July 1994, <https://www.rfc-editor.org/info/rfc1644>.

   [RFC2001]  Stevens, W., "TCP Slow Start, Congestion Avoidance, Fast
              Retransmit, and Fast Recovery Algorithms", RFC 2001,
              DOI 10.17487/RFC2001, January 1997,
              <https://www.rfc-editor.org/info/rfc2001>.

   [RFC2140]  Touch, J., "TCP Control Block Interdependence", RFC 2140,
              DOI 10.17487/RFC2140, April 1997,
              <https://www.rfc-editor.org/info/rfc2140>.

   [RFC2414]  Allman, M., Floyd, S., and C. Partridge, "Increasing TCP's
              Initial Window", RFC 2414, DOI 10.17487/RFC2414, September
              1998, <https://www.rfc-editor.org/info/rfc2414>.

   [RFC2663]  Srisuresh, P. and M. Holdrege, "IP Network Address
              Translator (NAT) Terminology and Considerations",
              RFC 2663, DOI 10.17487/RFC2663, August 1999,
              <https://www.rfc-editor.org/info/rfc2663>.

   [RFC3124]  Balakrishnan, H. and S. Seshan, "The Congestion Manager",
              RFC 3124, DOI 10.17487/RFC3124, June 2001,
              <https://www.rfc-editor.org/info/rfc3124>.

   [RFC3390]  Allman, M., Floyd, S., and C. Partridge, "Increasing TCP's
              Initial Window", RFC 3390, DOI 10.17487/RFC3390, October
              2002, <https://www.rfc-editor.org/info/rfc3390>.

   [RFC4340]  Kohler, E., Handley, M., and S. Floyd, "Datagram
              Congestion Control Protocol (DCCP)", RFC 4340,
              DOI 10.17487/RFC4340, March 2006,
              <https://www.rfc-editor.org/info/rfc4340>.

   [RFC4960]  Stewart, R., Ed., "Stream Control Transmission Protocol",
              RFC 4960, DOI 10.17487/RFC4960, September 2007,
              <https://www.rfc-editor.org/info/rfc4960>.

   [RFC5925]  Touch, J., Mankin, A., and R. Bonica, "The TCP
              Authentication Option", RFC 5925, DOI 10.17487/RFC5925,
              June 2010, <https://www.rfc-editor.org/info/rfc5925>.

   [RFC6437]  Amante, S., Carpenter, B., Jiang, S., and J. Rajahalme,
              "IPv6 Flow Label Specification", RFC 6437,
              DOI 10.17487/RFC6437, November 2011,
              <https://www.rfc-editor.org/info/rfc6437>.

   [RFC6691]  Borman, D., "TCP Options and Maximum Segment Size (MSS)",
              RFC 6691, DOI 10.17487/RFC6691, July 2012,
              <https://www.rfc-editor.org/info/rfc6691>.

   [RFC6928]  Chu, J., Dukkipati, N., Cheng, Y., and M. Mathis,
              "Increasing TCP's Initial Window", RFC 6928,
              DOI 10.17487/RFC6928, April 2013,
              <https://www.rfc-editor.org/info/rfc6928>.

   [RFC7231]  Fielding, R., Ed. and J. Reschke, Ed., "Hypertext Transfer
              Protocol (HTTP/1.1): Semantics and Content", RFC 7231,
              DOI 10.17487/RFC7231, June 2014,
              <https://www.rfc-editor.org/info/rfc7231>.

   [RFC7323]  Borman, D., Braden, B., Jacobson, V., and R.
              Scheffenegger, Ed., "TCP Extensions for High Performance",
              RFC 7323, DOI 10.17487/RFC7323, September 2014,
              <https://www.rfc-editor.org/info/rfc7323>.

   [RFC7424]  Krishnan, R., Yong, L., Ghanwani, A., So, N., and B.
              Khasnabish, "Mechanisms for Optimizing Link Aggregation
              Group (LAG) and Equal-Cost Multipath (ECMP) Component Link
              Utilization in Networks", RFC 7424, DOI 10.17487/RFC7424,
              January 2015, <https://www.rfc-editor.org/info/rfc7424>.

   [RFC7540]  Belshe, M., Peon, R., and M. Thomson, Ed., "Hypertext
              Transfer Protocol Version 2 (HTTP/2)", RFC 7540,
              DOI 10.17487/RFC7540, May 2015,
              <https://www.rfc-editor.org/info/rfc7540>.

   [RFC7661]  Fairhurst, G., Sathiaseelan, A., and R. Secchi, "Updating
              TCP to Support Rate-Limited Traffic", RFC 7661,
              DOI 10.17487/RFC7661, October 2015,
              <https://www.rfc-editor.org/info/rfc7661>.

   [RFC8684]  Ford, A., Raiciu, C., Handley, M., Bonaventure, O., and C.
              Paasch, "TCP Extensions for Multipath Operation with
              Multiple Addresses", RFC 8684, DOI 10.17487/RFC8684, March
              2020, <https://www.rfc-editor.org/info/rfc8684>.

Appendix A.  TCB Sharing History

   T/TCP proposed using caches to maintain TCB information across
   instances (temporal sharing), e.g., smoothed RTT, RTT variation,
   congestion-avoidance threshold, and MSS [RFC1644].  These values were
   in addition to connection counts used by T/TCP to accelerate data
   delivery prior to the full three-way handshake during an OPEN.  The
   goal was to aggregate TCB components where they reflect one
   association -- that of the host-pair rather than artificially
   separating those components by connection.

   At least one T/TCP implementation saved the MSS and aggregated the
   RTT parameters across multiple connections but omitted caching the
   congestion window information [Br94], as originally specified in
   [RFC1379].  Some T/TCP implementations immediately updated MSS when
   the TCP MSS header option was received [Br94], although this was not
   addressed specifically in the concepts or functional specification
   [RFC1379] [RFC1644].  In later T/TCP implementations, RTT values were
   updated only after a CLOSE, which does not benefit concurrent
   sessions.

   Temporal sharing of cached TCB data was originally implemented in the
   Sun OS 4.1.3 T/TCP extensions [Br94] and the FreeBSD port of same
   [FreeBSD].  As mentioned before, only the MSS and RTT parameters were
   cached, as originally specified in [RFC1379].  Later discussion of T/
   TCP suggested including congestion control parameters in this cache;
   for example, Section 3.1 of [RFC1644] hints at initializing the
   congestion window to the old window size.

Appendix B.  TCP Option Sharing and Caching

   In addition to the options that can be cached and shared, this memo
   also lists known TCP options [IANA] for which state is unsafe to be
   kept.  This list is not intended to be authoritative or exhaustive.

   Obsolete (unsafe to keep state):

      Echo

      Echo Reply

      Partial Order Connection Permitted

      Partial Order Service Profile

      CC

      CC.NEW

      CC.ECHO

      TCP Alternate Checksum Request

      TCP Alternate Checksum Data

   No state to keep:

      End of Option List (EOL)

      No-Operation (NOP)

      Window Scale (WS)

      SACK

      Timestamps (TS)

      MD5 Signature Option

      TCP Authentication Option (TCP-AO)

      RFC3692-style Experiment 1

      RFC3692-style Experiment 2

   Unsafe to keep state:

      Skeeter (DH exchange, known to be vulnerable)

      Bubba (DH exchange, known to be vulnerable)

      Trailer Checksum Option

      SCPS capabilities

      Selective Negative Acknowledgements (S-NACK)

      Records Boundaries

      Corruption experienced

      SNAP

      TCP Compression Filter

      Quick-Start Response

      User Timeout Option (UTO)

      Multipath TCP (MPTCP) negotiation success (see below for
      negotiation failure)

      TCP Fast Open (TFO) negotiation success (see below for negotiation
      failure)

   Safe but optional to keep state:

      Multipath TCP (MPTCP) negotiation failure (to avoid negotiation
      retries)

      Maximum Segment Size (MSS)

      TCP Fast Open (TFO) negotiation failure (to avoid negotiation
      retries)

   Safe and necessary to keep state:

      TCP Fast Open (TFO) Cookie (if TFO succeeded in the past)

Appendix C.  Automating the Initial Window in TCP over Long Timescales

C.1.  Introduction

   Temporal sharing, as described earlier in this document, builds on
   the assumption that multiple consecutive connections between the same
   host-pair are somewhat likely to be exposed to similar environment
   characteristics.  The stored information can become less accurate
   over time and suitable precautions should take this aging into
   consideration (this is discussed further in Section 8.1).  However,
   there are also cases where it can make sense to track these values
   over longer periods, observing properties of TCP connections to
   gradually influence evolving trends in TCP parameters.  This appendix
   describes an example of such a case.

   TCP's congestion control algorithm uses an initial window value (IW)
   both as a starting point for new connections and as an upper limit
   for restarting after an idle period [RFC5681] [RFC7661].  This value
   has evolved over time; it was originally 1 maximum segment size (MSS)
   and increased to the lesser of 4 MSSs or 4,380 bytes [RFC3390]
   [RFC5681].  For a typical Internet connection with a maximum
   transmission unit (MTU) of 1500 bytes, this permits 3 segments of
   1,460 bytes each.

   The IW value was originally implied in the original TCP congestion
   control description and documented as a standard in 1997 [RFC2001]
   [Ja88].  The value was updated in 1998 experimentally and moved to
   the Standards Track in 2002 [RFC2414] [RFC3390].  In 2013, it was
   experimentally increased to 10 [RFC6928].

   This appendix discusses how TCP can objectively measure when an IW is
   too large and that such feedback should be used over long timescales
   to adjust the IW automatically.  The result should be safer to deploy
   and might avoid the need to repeatedly revisit IW over time.

   Note that this mechanism attempts to make the IW more adaptive over
   time.  It can increase the IW beyond that which is currently
   recommended for wide-scale deployment, so its use should be carefully
   monitored.

C.2.  Design Considerations

   TCP's IW value has existed statically for over two decades, so any
   solution to adjusting the IW dynamically should have similarly
   stable, non-invasive effects on the performance and complexity of
   TCP.  In order to be fair, the IW should be similar for most machines
   on the public Internet.  Finally, a desirable goal is to develop a
   self-correcting algorithm so that IW values that cause network
   problems can be avoided.  To that end, we propose the following
   design goals:

   *  Impart little to no impact to TCP in the absence of loss, i.e., it
      should not increase the complexity of default packet processing in
      the normal case.

   *  Adapt to network feedback over long timescales, avoiding values
      that persistently cause network problems.

   *  Decrease the IW in the presence of sustained loss of IW segments,
      as determined over a number of different connections.

   *  Increase the IW in the absence of sustained loss of IW segments,
      as determined over a number of different connections.

   *  Operate conservatively, i.e., tend towards leaving the IW the same
      in the absence of sufficient information, and give greater
      consideration to IW segment loss than IW segment success.

   We expect that, without other context, a good IW algorithm will
   converge to a single value, but this is not required.  An endpoint
   with additional context or information, or deployed in a constrained
   environment, can always use a different value.  In particular,
   information from previous connections, or sets of connections with a
   similar path, can already be used as context for such decisions (as
   noted in the core of this document).

   However, if a given IW value persistently causes packet loss during
   the initial burst of packets, it is clearly inappropriate and could
   be inducing unnecessary loss in other competing connections.  This
   might happen for sites behind very slow boxes with small buffers,
   which may or may not be the first hop.

C.3.  Proposed IW Algorithm

   Below is a simple description of the proposed IW algorithm.  It
   relies on the following parameters:

   *  MinIW = 3 MSS or 4,380 bytes (as per [RFC3390])

   *  MaxIW = 10 MSS (as per [RFC6928])

   *  MulDecr = 0.5

   *  AddIncr = 2 MSS

   *  Threshold = 0.05

   We assume that the minimum IW (MinIW) should be as currently
   specified as standard [RFC3390].  The maximum IW (MaxIW) can be set
   to a fixed value (we suggest using the experimental and now somewhat
   de facto standard in [RFC6928]) or set based on a schedule if trusted
   time references are available [Al10]; here, we prefer a fixed value.
   We also propose to use an Additive Increase Multiplicative Decrease
   (AIMD) algorithm, with increase and decreases as noted.

   Although these parameters are somewhat arbitrary, their initial
   values are not important except that the algorithm is AIMD and the
   MaxIW should not exceed that recommended for other systems on the
   Internet (here, we selected the current de facto standard rather than
   the actual standard).  Current proposals, including default current
   operation, are degenerate cases of the algorithm below for given
   parameters, notably MulDec = 1.0 and AddIncr = 0 MSS, thus disabling
   the automatic part of the algorithm.

   The proposed algorithm is as follows:

   1.  On boot:

         IW = MaxIW; # assume this is in bytes and indicates an integer
                     # multiple of 2 MSS (an even number to support
                     # ACK compression)

   2.  Upon starting a new connection:

         CWND = IW;
         conncount++;
         IWnotchecked = 1; # true

   3.  During a connection's SYN-ACK processing, if SYN-ACK includes ECN
       (as similarly addressed in Section 5 of ECN++ for TCP [Ba20]),
       treat as if the IW is too large:

         if (IWnotchecked && (synackecn == 1)) {
            losscount++;
            IWnotchecked = 0; # never check again
         }

   4.  During a connection, if retransmission occurs, check the seqno of
       the outgoing packet (in bytes) to see if the re-sent segment
       fixes an IW loss:

         if (Retransmitting && IWnotchecked && ((seqno - ISN) < IW))) {
            losscount++;
            IWnotchecked = 0; # never do this entire "if" again
         } else {
            IWnotchecked = 0; # you're beyond the IW so stop checking
         }

   5.  Once every 1000 connections, as a separate process (i.e., not as
       part of processing a given connection):

         if (conncount > 1000) {
            if (losscount/conncount > threshold) {
               # the number of connections with errors is too high
               IW = IW * MulDecr;
            } else {
               IW = IW + AddIncr;
            }
         }

   As presented, this algorithm can yield a false positive when the
   sequence number wraps around, e.g., the code might increment
   losscount in step 4 when no loss occurred or fail to increment
   losscount when a loss did occur.  This can be avoided using either
   Protection Against Wrapped Sequences (PAWS) [RFC7323] context or
   internal extended sequence number representations (as in TCP
   Authentication Option (TCP-AO) [RFC5925]).  Alternately, false
   positives can be tolerated because they are expected to be infrequent
   and thus will not significantly impact the algorithm.

   A number of additional constraints need to be imposed if this
   mechanism is implemented to ensure that it defaults to values that
   comply with current Internet standards, is conservative in how it
   extends those values, and returns to those values in the absence of
   positive feedback (i.e., success).  To that end, we recommend the
   following list of example constraints:

   *  The automatic IW algorithm MUST initialize MaxIW a value no larger
      than the currently recommended Internet default in the absence of
      other context information.

      Thus, if there are too few connections to make a decision or if
      there is otherwise insufficient information to increase the IW,
      then the MaxIW defaults to the current recommended value.

   *  An implementation MAY allow the MaxIW to grow beyond the currently
      recommended Internet default but not more than 2 segments per
      calendar year.

      Thus, if an endpoint has a persistent history of successfully
      transmitting IW segments without loss, then it is allowed to probe
      the Internet to determine if larger IW values have similar
      success.  This probing is limited and requires a trusted time
      source; otherwise, the MaxIW remains constant.

   *  An implementation MUST adjust the IW based on loss statistics at
      least once every 1000 connections.

      An endpoint needs to be sufficiently reactive to IW loss.

   *  An implementation MUST decrease the IW by at least 1 MSS when
      indicated during an evaluation interval.

      An endpoint that detects loss needs to decrease its IW by at least
      1 MSS; otherwise, it is not participating in an automatic reactive
      algorithm.

   *  An implementation MUST increase by no more than 2 MSSs per
      evaluation interval.

      An endpoint that does not experience IW loss needs to probe the
      network incrementally.

   *  An implementation SHOULD use an IW that is an integer multiple of
      2 MSSs.

      The IW should remain a multiple of 2 MSS segments to enable
      efficient ACK compression without incurring unnecessary timeouts.

   *  An implementation MUST decrease the IW if more than 95% of
      connections have IW losses.

      Again, this is to ensure an implementation is sufficiently
      reactive.

   *  An implementation MAY group IW values and statistics within
      subsets of connections.  Such grouping MAY use any information
      about connections to form groups except loss statistics.

   There are some TCP connections that might not be counted at all, such
   as those to/from loopback addresses or those within the same subnet
   as that of a local interface (for which congestion control is
   sometimes disabled anyway).  This may also include connections that
   terminate before the IW is full, i.e., as a separate check at the
   time of the connection closing.

   The period over which the IW is updated is intended to be a long
   timescale, e.g., a month or so, or 1,000 connections, whichever is
   longer.  An implementation might check the IW once a month and simply
   not update the IW or clear the connection counts in months where the
   number of connections is too small.

C.4.  Discussion

   There are numerous parameters to the above algorithm that are
   compliant with the given requirements; this is intended to allow
   variation in configuration and implementation while ensuring that all
   such algorithms are reactive and safe.

   This algorithm continues to assume segments because that is the basis
   of most TCP implementations.  It might be useful to consider revising
   the specifications to allow byte-based congestion given sufficient
   experience.

   The algorithm checks for IW losses only during the first IW after a
   connection start; it does not check for IW losses elsewhere the IW is
   used, e.g., during slow-start restarts.

   *  An implementation MAY detect IW losses during slow-start restarts
      in addition to losses during the first IW of a connection.  In
      this case, the implementation MUST count each restart as a
      "connection" for the purposes of connection counts and periodic
      rechecking of the IW value.

   False positives can occur during some kinds of segment reordering,
   e.g., that might trigger spurious retransmissions even without a true
   segment loss.  These are not expected to be sufficiently common to
   dominate the algorithm and its conclusions.

   This mechanism does require additional per-connection state, which is
   currently common in some implementations and is useful for other
   reasons (e.g., the ISN is used in TCP-AO [RFC5925]).  The mechanism
   in this appendix also benefits from persistent state kept across
   reboots, which would also be useful to other state sharing mechanisms
   (e.g., TCP Control Block Sharing per the main body of this document).

   The receive window (rwnd) is not involved in this calculation.  The
   size of rwnd is determined by receiver resources and provides space
   to accommodate segment reordering.  Also, rwnd is not involved with
   congestion control, which is the focus of the way this appendix
   manages the IW.

C.5.  Observations

   The IW may not converge to a single global value.  It also may not
   converge at all but rather may oscillate by a few MSSs as it
   repeatedly probes the Internet for larger IWs and fails.  Both
   properties are consistent with TCP behavior during each individual
   connection.

   This mechanism assumes that losses during the IW are due to IW size.
   Persistent errors that drop packets for other reasons, e.g., OS bugs,
   can cause false positives.  Again, this is consistent with TCP's
   basic assumption that loss is caused by congestion and requires
   backoff.  This algorithm treats the IW of new connections as a long-
   timescale backoff system.

Acknowledgments

   The authors would like to thank Praveen Balasubramanian for
   information regarding TCB sharing in Windows; Christoph Paasch for
   information regarding TCB sharing in Apple OSs; Yuchung Cheng, Lars
   Eggert, Ilpo Jarvinen, and Michael Scharf for comments on earlier
   draft versions of this document; as well as members of the TCPM WG.
   Earlier revisions of this work received funding from a collaborative
   research project between the University of Oslo and Huawei
   Technologies Co., Ltd. and were partly supported by USC/ISI's Postel
   Center.

Authors' Addresses

   Joe Touch
   Manhattan Beach, CA 90266
   United States of America

   Phone: +1 (310) 560-0334
   Email: touch@strayalpha.com


   Michael Welzl
   University of Oslo
   PO Box 1080 Blindern
   N-0316 Oslo
   Norway

   Phone: +47 22 85 24 20
   Email: michawe@ifi.uio.no


   Safiqul Islam
   University of Oslo
   PO Box 1080 Blindern
   Oslo N-0316
   Norway

   Phone: +47 22 84 08 37
   Email: safiquli@ifi.uio.no