diff options
Diffstat (limited to 'doc/rfc/rfc9040.txt')
| -rw-r--r-- | doc/rfc/rfc9040.txt | 1573 | 
1 files changed, 1573 insertions, 0 deletions
diff --git a/doc/rfc/rfc9040.txt b/doc/rfc/rfc9040.txt new file mode 100644 index 0000000..439a948 --- /dev/null +++ b/doc/rfc/rfc9040.txt @@ -0,0 +1,1573 @@ + + + + +Internet Engineering Task Force (IETF)                          J. Touch +Request for Comments: 9040                                   Independent +Obsoletes: 2140                                                 M. Welzl +Category: Informational                                         S. Islam +ISSN: 2070-1721                                       University of Oslo +                                                               July 2021 + + +                   TCP Control Block Interdependence + +Abstract + +   This memo provides guidance to TCP implementers that is intended to +   help improve connection convergence to steady-state operation without +   affecting interoperability.  It updates and replaces RFC 2140's +   description of sharing TCP state, as typically represented in TCP +   Control Blocks, among similar concurrent or consecutive connections. + +Status of This Memo + +   This document is not an Internet Standards Track specification; it is +   published for informational purposes. + +   This document is a product of the Internet Engineering Task Force +   (IETF).  It represents the consensus of the IETF community.  It has +   received public review and has been approved for publication by the +   Internet Engineering Steering Group (IESG).  Not all documents +   approved by the IESG are candidates for any level of Internet +   Standard; see Section 2 of RFC 7841. + +   Information about the current status of this document, any errata, +   and how to provide feedback on it may be obtained at +   https://www.rfc-editor.org/info/rfc9040. + +Copyright Notice + +   Copyright (c) 2021 IETF Trust and the persons identified as the +   document authors.  All rights reserved. + +   This document is subject to BCP 78 and the IETF Trust's Legal +   Provisions Relating to IETF Documents +   (https://trustee.ietf.org/license-info) in effect on the date of +   publication of this document.  Please review these documents +   carefully, as they describe your rights and restrictions with respect +   to this document.  Code Components extracted from this document must +   include Simplified BSD License text as described in Section 4.e of +   the Trust Legal Provisions and are provided without warranty as +   described in the Simplified BSD License. + +Table of Contents + +   1.  Introduction +   2.  Conventions Used in This Document +   3.  Terminology +   4.  The TCP Control Block (TCB) +   5.  TCB Interdependence +   6.  Temporal Sharing +     6.1.  Initialization of a New TCB +     6.2.  Updates to the TCB Cache +     6.3.  Discussion +   7.  Ensemble Sharing +     7.1.  Initialization of a New TCB +     7.2.  Updates to the TCB Cache +     7.3.  Discussion +   8.  Issues with TCB Information Sharing +     8.1.  Traversing the Same Network Path +     8.2.  State Dependence +     8.3.  Problems with Sharing Based on IP Address +   9.  Implications +     9.1.  Layering +     9.2.  Other Possibilities +   10. Implementation Observations +   11. Changes Compared to RFC 2140 +   12. Security Considerations +   13. IANA Considerations +   14. References +     14.1.  Normative References +     14.2.  Informative References +   Appendix A.  TCB Sharing History +   Appendix B.  TCP Option Sharing and Caching +   Appendix C.  Automating the Initial Window in TCP over Long +           Timescales +     C.1.  Introduction +     C.2.  Design Considerations +     C.3.  Proposed IW Algorithm +     C.4.  Discussion +     C.5.  Observations +   Acknowledgments +   Authors' Addresses + +1.  Introduction + +   TCP is a connection-oriented reliable transport protocol layered over +   IP [RFC0793].  Each TCP connection maintains state, usually in a data +   structure called the "TCP Control Block (TCB)".  The TCB contains +   information about the connection state, its associated local process, +   and feedback parameters about the connection's transmission +   properties.  As originally specified and usually implemented, most +   TCB information is maintained on a per-connection basis.  Some +   implementations share certain TCB information across connections to +   the same host [RFC2140].  Such sharing is intended to lead to better +   overall transient performance, especially for numerous short-lived +   and simultaneous connections, as can be used in the World Wide Web +   and other applications [Be94] [Br02].  This sharing of state is +   intended to help TCP connections converge to long-term behavior +   (assuming stable application load, i.e., so-called "steady-state") +   more quickly without affecting TCP interoperability. + +   This document updates RFC 2140's discussion of TCB state sharing and +   provides a complete replacement for that document.  This state +   sharing affects only TCB initialization [RFC2140] and thus has no +   effect on the long-term behavior of TCP after a connection has been +   established or on interoperability.  Path information shared across +   SYN destination port numbers assumes that TCP segments having the +   same host-pair experience the same path properties, i.e., that +   traffic is not routed differently based on port numbers or other +   connection parameters (also addressed further in Section 8.1).  The +   observations about TCB sharing in this document apply similarly to +   any protocol with congestion state, including the Stream Control +   Transmission Protocol (SCTP) [RFC4960] and the Datagram Congestion +   Control Protocol (DCCP) [RFC4340], as well as to individual subflows +   in Multipath TCP [RFC8684]. + +2.  Conventions Used in This Document + +   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", +   "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and +   "OPTIONAL" in this document are to be interpreted as described in +   BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all +   capitals, as shown here. + +   The core of this document describes behavior that is already +   permitted by TCP standards.  As a result, this document provides +   informative guidance but does not use normative language except when +   quoting other documents.  Normative language is used in Appendix C as +   examples of requirements for future consideration. + +3.  Terminology + +   The following terminology is used frequently in this document.  Items +   preceded with a "+" may be part of the state maintained as TCP +   connection state in the TCB of associated connections and are the +   focus of sharing as described in this document.  Note that terms are +   used as originally introduced where possible; in some cases, +   direction is indicated with a suffix (_S for send, _R for receive) +   and in other cases spelled out (sendcwnd). + +   +cwnd:  TCP congestion window size [RFC5681] + +   host:  a source or sink of TCP segments associated with a single IP +         address + +   host-pair:  a pair of hosts and their corresponding IP addresses + +   ISN:  Initial Sequence Number + +   +MMS_R:  maximum message size that can be received, the largest +         received transport payload of an IP datagram [RFC1122] + +   +MMS_S:  maximum message size that can be sent, the largest +         transmitted transport payload of an IP datagram [RFC1122] + +   path:  an Internet path between the IP addresses of two hosts + +   PCB:  protocol control block, the data associated with a protocol as +         maintained by an endpoint; a TCP PCB is called a "TCB" + +   PLPMTUD:  packetization-layer path MTU discovery, a mechanism that +         uses transport packets to discover the Path Maximum +         Transmission Unit (PMTU) [RFC4821] + +   +PMTU:  largest IP datagram that can traverse a path [RFC1191] +         [RFC8201] + +   PMTUD:  path-layer MTU discovery, a mechanism that relies on ICMP +         error messages to discover the PMTU [RFC1191] [RFC8201] + +   +RTT:  round-trip time of a TCP packet exchange [RFC0793] + +   +RTTVAR:  variation of round-trip times of a TCP packet exchange +         [RFC6298] + +   +rwnd:  TCP receive window size [RFC5681] + +   +sendcwnd:  TCP send-side congestion window (cwnd) size [RFC5681] + +   +sendMSS:  TCP maximum segment size, a value transmitted in a TCP +         option that represents the largest TCP user data payload that +         can be received [RFC6691] + +   +ssthresh:  TCP slow-start threshold [RFC5681] + +   TCB:  TCP Control Block, the data associated with a TCP connection as +         maintained by an endpoint + +   TCP-AO:  TCP Authentication Option [RFC5925] + +   TFO:  TCP Fast Open option [RFC7413] + +   +TFO_cookie:  TCP Fast Open cookie, state that is used as part of the +         TFO mechanism, when TFO is supported [RFC7413] + +   +TFO_failure:  an indication of when TFO option negotiation failed, +         when TFO is supported + +   +TFOinfo:  information cached when a TFO connection is established, +         which includes the TFO_cookie [RFC7413] + +4.  The TCP Control Block (TCB) + +   A TCB describes the data associated with each connection, i.e., with +   each association of a pair of applications across the network.  The +   TCB contains at least the following information [RFC0793]: + +      Local process state + +         pointers to send and receive buffers +         pointers to retransmission queue and current segment +         pointers to Internet Protocol (IP) PCB + +      Per-connection shared state + +         macro-state +            connection state +            timers +            flags +            local and remote host numbers and ports +            TCP option state +         micro-state +            send and receive window state (size*, current number) +            congestion window size (sendcwnd)* +            congestion window size threshold (ssthresh)* +            max window size seen* +            sendMSS# +            MMS_S# +            MMS_R# +            PMTU# +            round-trip time and its variation# + +   The per-connection information is shown as split into macro-state and +   micro-state, terminology borrowed from [Co91].  Macro-state describes +   the protocol for establishing the initial shared state about the +   connection; we include the endpoint numbers and components (timers, +   flags) required upon commencement that are later used to help +   maintain that state.  Micro-state describes the protocol after a +   connection has been established, to maintain the reliability and +   congestion control of the data transferred in the connection. + +   We distinguish two other classes of shared micro-state that are +   associated more with host-pairs than with application pairs.  One +   class is clearly host-pair dependent (shown above as "#", e.g., +   sendMSS, MMS_R, MMS_S, PMTU, RTT), because these parameters are +   defined by the endpoint or endpoint pair (of the given example: +   sendMSS, MMS_R, MMS_S, RTT) or are already cached and shared on that +   basis (of the given example: PMTU [RFC1191] [RFC4821]).  The other is +   host-pair dependent in its aggregate (shown above as "*", e.g., +   congestion window information, current window sizes, etc.) because +   they depend on the total capacity between the two endpoints. + +   Not all of the TCB state is necessarily shareable.  In particular, +   some TCP options are negotiated only upon request by the application +   layer, so their use may not be correlated across connections.  Other +   options negotiate connection-specific parameters, which are similarly +   not shareable.  These are discussed further in Appendix B. + +   Finally, we exclude rwnd from further discussion because its value +   should depend on the send window size, so it is already addressed by +   send window sharing and is not independently affected by sharing. + +5.  TCB Interdependence + +   There are two cases of TCB interdependence.  Temporal sharing occurs +   when the TCB of an earlier (now CLOSED) connection to a host is used +   to initialize some parameters of a new connection to that same host, +   i.e., in sequence.  Ensemble sharing occurs when a currently active +   connection to a host is used to initialize another (concurrent) +   connection to that host. + +6.  Temporal Sharing + +   The TCB data cache is accessed in two ways: it is read to initialize +   new TCBs and written when more current per-host state is available. + +6.1.  Initialization of a New TCB + +   TCBs for new connections can be initialized using cached context from +   past connections as follows: + +              +==============+=============================+ +              | Cached TCB   | New TCB                     | +              +==============+=============================+ +              | old_MMS_S    | old_MMS_S or not cached (2) | +              +--------------+-----------------------------+ +              | old_MMS_R    | old_MMS_R or not cached (2) | +              +--------------+-----------------------------+ +              | old_sendMSS  | old_sendMSS                 | +              +--------------+-----------------------------+ +              | old_PMTU     | old_PMTU (1)                | +              +--------------+-----------------------------+ +              | old_RTT      | old_RTT                     | +              +--------------+-----------------------------+ +              | old_RTTVAR   | old_RTTVAR                  | +              +--------------+-----------------------------+ +              | old_option   | (option specific)           | +              +--------------+-----------------------------+ +              | old_ssthresh | old_ssthresh                | +              +--------------+-----------------------------+ +              | old_sendcwnd | old_sendcwnd                | +              +--------------+-----------------------------+ + +              Table 1: Temporal Sharing - TCB Initialization + +   (1)  Note that PMTU is cached at the IP layer [RFC1191] [RFC4821]. + +   (2)  Note that some values are not cached when they are computed +      locally (MMS_R) or indicated in the connection itself (MMS_S in +      the SYN). + +   Table 2 gives an overview of option-specific information that can be +   shared.  Additional information on some specific TCP options and +   sharing is provided in Appendix B. + +                   +=================+=================+ +                   | Cached          | New             | +                   +=================+=================+ +                   | old_TFO_cookie  | old_TFO_cookie  | +                   +-----------------+-----------------+ +                   | old_TFO_failure | old_TFO_failure | +                   +-----------------+-----------------+ + +                        Table 2: Temporal Sharing - +                         Option Info Initialization + +6.2.  Updates to the TCB Cache + +   During a connection, the TCB cache can be updated based on events of +   current connections and their TCBs as they progress over time, as +   shown in Table 3. + +     +==============+===============+=============+=================+ +     | Cached TCB   | Current TCB   | When?       | New Cached TCB  | +     +==============+===============+=============+=================+ +     | old_MMS_S    | curr_MMS_S    | OPEN        | curr_MMS_S      | +     +--------------+---------------+-------------+-----------------+ +     | old_MMS_R    | curr_MMS_R    | OPEN        | curr_MMS_R      | +     +--------------+---------------+-------------+-----------------+ +     | old_sendMSS  | curr_sendMSS  | MSSopt      | curr_sendMSS    | +     +--------------+---------------+-------------+-----------------+ +     | old_PMTU     | curr_PMTU     | PMTUD (1) / | curr_PMTU       | +     |              |               | PLPMTUD (1) |                 | +     +--------------+---------------+-------------+-----------------+ +     | old_RTT      | curr_RTT      | CLOSE       | merge(curr,old) | +     +--------------+---------------+-------------+-----------------+ +     | old_RTTVAR   | curr_RTTVAR   | CLOSE       | merge(curr,old) | +     +--------------+---------------+-------------+-----------------+ +     | old_option   | curr_option   | ESTAB       | (depends on     | +     |              |               |             | option)         | +     +--------------+---------------+-------------+-----------------+ +     | old_ssthresh | curr_ssthresh | CLOSE       | merge(curr,old) | +     +--------------+---------------+-------------+-----------------+ +     | old_sendcwnd | curr_sendcwnd | CLOSE       | merge(curr,old) | +     +--------------+---------------+-------------+-----------------+ + +                Table 3: Temporal Sharing - Cache Updates + +   (1)  Note that PMTU is cached at the IP layer [RFC1191] [RFC4821]. + +   Merge() is the function that combines the current and previous (old) +   values and may vary for each parameter of the TCB cache.  The +   particular function is not specified in this document; examples +   include windowed averages (mean of the past N values, for some N) and +   exponential decay (new = (1-alpha)*old + alpha *new, where alpha is +   in the range [0..1]). + +   Table 4 gives an overview of option-specific information that can be +   similarly shared.  The TFO cookie is maintained until the client +   explicitly requests it be updated as a separate event. + +      +=================+=================+=======+=================+ +      | Cached          | Current         | When? | New Cached      | +      +=================+=================+=======+=================+ +      | old_TFO_cookie  | old_TFO_cookie  | ESTAB | old_TFO_cookie  | +      +-----------------+-----------------+-------+-----------------+ +      | old_TFO_failure | old_TFO_failure | ESTAB | old_TFO_failure | +      +-----------------+-----------------+-------+-----------------+ + +              Table 4: Temporal Sharing - Option Info Updates + +6.3.  Discussion + +   As noted, there is no particular benefit to caching MMS_S and MMS_R +   as these are reported by the local IP stack.  Caching sendMSS and +   PMTU is trivial; reported values are cached (PMTU at the IP layer), +   and the most recent values are used.  The cache is updated when the +   MSS option is received in a SYN or after PMTUD (i.e., when an ICMPv4 +   Fragmentation Needed [RFC1191] or ICMPv6 Packet Too Big message is +   received [RFC8201] or the equivalent is inferred, e.g., as from +   PLPMTUD [RFC4821]), respectively, so the cache always has the most +   recent values from any connection.  For sendMSS, the cache is +   consulted only at connection establishment and not otherwise updated, +   which means that MSS options do not affect current connections.  The +   default sendMSS is never saved; only reported MSS values update the +   cache, so an explicit override is required to reduce the sendMSS. +   Cached sendMSS affects only data sent in the SYN segment, i.e., +   during client connection initiation or during simultaneous open; the +   MSS of all other segments are constrained by the value updated as +   included in the SYN. + +   RTT values are updated by formulae that merge the old and new values, +   as noted in Section 6.2.  Dynamic RTT estimation requires a sequence +   of RTT measurements.  As a result, the cached RTT (and its variation) +   is an average of its previous value with the contents of the +   currently active TCB for that host, when a TCB is closed.  RTT values +   are updated only when a connection is closed.  The method for merging +   old and current values needs to attempt to reduce the transient +   effects of the new connections. + +   The updates for RTT, RTTVAR, and ssthresh rely on existing +   information, i.e., old values.  Should no such values exist, the +   current values are cached instead. + +   TCP options are copied or merged depending on the details of each +   option.  For example, TFO state is updated when a connection is +   established and read before establishing a new connection. + +   Sections 8 and 9 discuss compatibility issues and implications of +   sharing the specific information listed above.  Section 10 gives an +   overview of known implementations. + +   Most cached TCB values are updated when a connection closes.  The +   exceptions are MMS_R and MMS_S, which are reported by IP [RFC1122]; +   PMTU, which is updated after Path MTU Discovery and also reported by +   IP [RFC1191] [RFC4821] [RFC8201]; and sendMSS, which is updated if +   the MSS option is received in the TCP SYN header. + +   Sharing sendMSS information affects only data in the SYN of the next +   connection, because sendMSS information is typically included in most +   TCP SYN segments.  Caching PMTU can accelerate the efficiency of +   PMTUD but can also result in black-holing until corrected if in +   error.  Caching MMS_R and MMS_S may be of little direct value as they +   are reported by the local IP stack anyway. + +   The way in which state related to other TCP options can be shared +   depends on the details of that option.  For example, TFO state +   includes the TCP Fast Open cookie [RFC7413] or, in case TFO fails, a +   negative TCP Fast Open response.  RFC 7413 states, + +   |  The client MUST cache negative responses from the server in order +   |  to avoid potential connection failures.  Negative responses +   |  include the server not acknowledging the data in the SYN, ICMP +   |  error messages, and (most importantly) no response (SYN-ACK) from +   |  the server at all, i.e., connection timeout. + +   TFOinfo is cached when a connection is established. + +   State related to other TCP options might not be as readily cached. +   For example, TCP-AO [RFC5925] success or failure between a host-pair +   for a single SYN destination port might be usefully cached.  TCP-AO +   success or failure to other SYN destination ports on that host-pair +   is never useful to cache because TCP-AO security parameters can vary +   per service. + +7.  Ensemble Sharing + +   Sharing cached TCB data across concurrent connections requires +   attention to the aggregate nature of some of the shared state.  For +   example, although MSS and RTT values can be shared by copying, it may +   not be appropriate to simply copy congestion window or ssthresh +   information; instead, the new values can be a function (f) of the +   cumulative values and the number of connections (N). + +7.1.  Initialization of a New TCB + +   TCBs for new connections can be initialized using cached context from +   concurrent connections as follows: + +              +===================+=========================+ +              | Cached TCB        | New TCB                 | +              +===================+=========================+ +              | old_MMS_S         | old_MMS_S               | +              +-------------------+-------------------------+ +              | old_MMS_R         | old_MMS_R               | +              +-------------------+-------------------------+ +              | old_sendMSS       | old_sendMSS             | +              +-------------------+-------------------------+ +              | old_PMTU          | old_PMTU (1)            | +              +-------------------+-------------------------+ +              | old_RTT           | old_RTT                 | +              +-------------------+-------------------------+ +              | old_RTTVAR        | old_RTTVAR              | +              +-------------------+-------------------------+ +              | sum(old_ssthresh) | f(sum(old_ssthresh), N) | +              +-------------------+-------------------------+ +              | sum(old_sendcwnd) | f(sum(old_sendcwnd), N) | +              +-------------------+-------------------------+ +              | old_option        | (option specific)       | +              +-------------------+-------------------------+ + +               Table 5: Ensemble Sharing - TCB Initialization + +   (1)  Note that PMTU is cached at the IP layer [RFC1191] [RFC4821]. + +   In Table 5, the cached sum() is a total across all active connections +   because these parameters act in aggregate; similarly, f() is a +   function that updates that sum based on the new connection's values, +   represented as "N". + +   Table 6 gives an overview of option-specific information that can be +   similarly shared.  Again, the TFO_cookie is updated upon explicit +   client request, which is a separate event. + +                   +=================+=================+ +                   | Cached          | New             | +                   +=================+=================+ +                   | old_TFO_cookie  | old_TFO_cookie  | +                   +-----------------+-----------------+ +                   | old_TFO_failure | old_TFO_failure | +                   +-----------------+-----------------+ + +                        Table 6: Ensemble Sharing - +                         Option Info Initialization + +7.2.  Updates to the TCB Cache + +   During a connection, the TCB cache can be updated based on changes to +   concurrent connections and their TCBs, as shown below: + +      +==============+===============+===========+=================+ +      | Cached TCB   | Current TCB   | When?     | New Cached TCB  | +      +==============+===============+===========+=================+ +      | old_MMS_S    | curr_MMS_S    | OPEN      | curr_MMS_S      | +      +--------------+---------------+-----------+-----------------+ +      | old_MMS_R    | curr_MMS_R    | OPEN      | curr_MMS_R      | +      +--------------+---------------+-----------+-----------------+ +      | old_sendMSS  | curr_sendMSS  | MSSopt    | curr_sendMSS    | +      +--------------+---------------+-----------+-----------------+ +      | old_PMTU     | curr_PMTU     | PMTUD+ /  | curr_PMTU       | +      |              |               | PLPMTUD+  |                 | +      +--------------+---------------+-----------+-----------------+ +      | old_RTT      | curr_RTT      | update    | rtt_update(old, | +      |              |               |           | curr)           | +      +--------------+---------------+-----------+-----------------+ +      | old_RTTVAR   | curr_RTTVAR   | update    | rtt_update(old, | +      |              |               |           | curr)           | +      +--------------+---------------+-----------+-----------------+ +      | old_ssthresh | curr_ssthresh | update    | adjust sum as   | +      |              |               |           | appropriate     | +      +--------------+---------------+-----------+-----------------+ +      | old_sendcwnd | curr_sendcwnd | update    | adjust sum as   | +      |              |               |           | appropriate     | +      +--------------+---------------+-----------+-----------------+ +      | old_option   | curr_option   | (depends) | (option         | +      |              |               |           | specific)       | +      +--------------+---------------+-----------+-----------------+ + +                Table 7: Ensemble Sharing - Cache Updates + +   +  Note that the PMTU is cached at the IP layer [RFC1191] [RFC4821]. + +   In Table 7, rtt_update() is the function used to combine old and +   current values, e.g., as a windowed average or exponentially decayed +   average. + +   Table 8 gives an overview of option-specific information that can be +   similarly shared. + +      +=================+=================+=======+=================+ +      | Cached          | Current         | When? | New Cached      | +      +=================+=================+=======+=================+ +      | old_TFO_cookie  | old_TFO_cookie  | ESTAB | old_TFO_cookie  | +      +-----------------+-----------------+-------+-----------------+ +      | old_TFO_failure | old_TFO_failure | ESTAB | old_TFO_failure | +      +-----------------+-----------------+-------+-----------------+ + +              Table 8: Ensemble Sharing - Option Info Updates + +7.3.  Discussion + +   For ensemble sharing, TCB information should be cached as early as +   possible, sometimes before a connection is closed.  Otherwise, +   opening multiple concurrent connections may not result in TCB data +   sharing if no connection closes before others open.  The amount of +   work involved in updating the aggregate average should be minimized, +   but the resulting value should be equivalent to having all values +   measured within a single connection.  The function "rtt_update" in +   Table 7 indicates this operation, which occurs whenever the RTT would +   have been updated in the individual TCP connection.  As a result, the +   cache contains the shared RTT variables, which no longer need to +   reside in the TCB. + +   Congestion window size and ssthresh aggregation are more complicated +   in the concurrent case.  When there is an ensemble of connections, we +   need to decide how that ensemble would have shared these variables, +   in order to derive initial values for new TCBs. + +   Sections 8 and 9 discuss compatibility issues and implications of +   sharing the specific information listed above. + +   There are several ways to initialize the congestion window in a new +   TCB among an ensemble of current connections to a host.  Current TCP +   implementations initialize it to 4 segments as standard [RFC3390] and +   10 segments experimentally [RFC6928].  These approaches assume that +   new connections should behave as conservatively as possible.  The +   algorithm described in [Ba12] adjusts the initial cwnd depending on +   the cwnd values of ongoing connections.  It is also possible to use +   sharing mechanisms over long timescales to adapt TCP's initial window +   automatically, as described further in Appendix C. + +8.  Issues with TCB Information Sharing + +   Here, we discuss various types of problems that may arise with TCB +   information sharing. + +   For the congestion and current window information, the initial values +   computed by TCB interdependence may not be consistent with the long- +   term aggregate behavior of a set of concurrent connections between +   the same endpoints.  Under conventional TCP congestion control, if +   the congestion window of a single existing connection has converged +   to 40 segments, two newly joining concurrent connections will assume +   initial windows of 10 segments [RFC6928] and the existing +   connection's window will not decrease to accommodate this additional +   load.  As a consequence, the three connections can mutually +   interfere.  One example of this is seen on low-bandwidth, high-delay +   links, where concurrent connections supporting Web traffic can +   collide because their initial windows were too large, even when set +   at 1 segment. + +   The authors of [Hu12] recommend caching ssthresh for temporal sharing +   only when flows are long.  Some studies suggest that sharing ssthresh +   between short flows can deteriorate the performance of individual +   connections [Hu12] [Du16], although this may benefit aggregate +   network performance. + +8.1.  Traversing the Same Network Path + +   TCP is sometimes used in situations where packets of the same host- +   pair do not always take the same path, such as when connection- +   specific parameters are used for routing (e.g., for load balancing). +   Multipath routing that relies on examining transport headers, such as +   ECMP and Link Aggregation Group (LAG) [RFC7424], may not result in +   repeatable path selection when TCP segments are encapsulated, +   encrypted, or altered -- for example, in some Virtual Private Network +   (VPN) tunnels that rely on proprietary encapsulation.  Similarly, +   such approaches cannot operate deterministically when the TCP header +   is encrypted, e.g., when using IPsec Encapsulating Security Payload +   (ESP) (although TCB interdependence among the entire set sharing the +   same endpoint IP addresses should work without problems when the TCP +   header is encrypted).  Measures to increase the probability that +   connections use the same path could be applied; for example, the +   connections could be given the same IPv6 flow label [RFC6437].  TCB +   interdependence can also be extended to sets of host IP address pairs +   that share the same network path conditions, such as when a group of +   addresses is on the same LAN (see Section 9). + +   Traversing the same path is not important for host-specific +   information (e.g., rwnd), TCP option state (e.g., TFOinfo), or for +   information that is already cached per-host (e.g., path MTU).  When +   TCB information is shared across different SYN destination ports, +   path-related information can be incorrect; however, the impact of +   this error is potentially diminished if (as discussed here) TCB +   sharing affects only the transient event of a connection start or if +   TCB information is shared only within connections to the same SYN +   destination port. + +   In the case of temporal sharing, TCB information could also become +   invalid over time, i.e., indicating that although the path remains +   the same, path properties have changed.  Because this is similar to +   the case when a connection becomes idle, mechanisms that address idle +   TCP connections (e.g., [RFC7661]) could also be applied to TCB cache +   management, especially when TCP Fast Open is used [RFC7413]. + +8.2.  State Dependence + +   There may be additional considerations to the way in which TCB +   interdependence rebalances congestion feedback among the current +   connections.  For example, it may be appropriate to consider the +   impact of a connection being in Fast Recovery [RFC5681] or some other +   similar unusual feedback state that could inhibit or affect the +   calculations described herein. + +8.3.  Problems with Sharing Based on IP Address + +   It can be wrong to share TCB information between TCP connections on +   the same host as identified by the IP address if an IP address is +   assigned to a new host (e.g., IP address spinning, as is used by ISPs +   to inhibit running servers).  It can be wrong if Network Address +   Translation (NAT) [RFC2663], Network Address and Port Translation +   (NAPT) [RFC2663], or any other IP sharing mechanism is used.  Such +   mechanisms are less likely to be used with IPv6.  Other methods to +   identify a host could also be considered to make correct TCB sharing +   more likely.  Moreover, some TCB information is about dominant path +   properties rather than the specific host.  IP addresses may differ, +   yet the relevant part of the path may be the same. + +9.  Implications + +   There are several implications to incorporating TCB interdependence +   in TCP implementations.  First, it may reduce the need for +   application-layer multiplexing for performance enhancement [RFC7231]. +   Protocols like HTTP/2 [RFC7540] avoid connection re-establishment +   costs by serializing or multiplexing a set of per-host connections +   across a single TCP connection.  This avoids TCP's per-connection +   OPEN handshake and also avoids recomputing the MSS, RTT, and +   congestion window values.  By avoiding the so-called "slow-start +   restart", performance can be optimized [Hu01].  TCB interdependence +   can provide the "slow-start restart avoidance" of multiplexing, +   without requiring a multiplexing mechanism at the application layer. + +   Like the initial version of this document [RFC2140], this update's +   approach to TCB interdependence focuses on sharing a set of TCBs by +   updating the TCB state to reduce the impact of transients when +   connections begin, end, or otherwise significantly change state. +   Other mechanisms have since been proposed to continuously share +   information between all ongoing communication (including +   connectionless protocols) and update the congestion state during any +   congestion-related event (e.g., timeout, loss confirmation, etc.) +   [RFC3124].  By dealing exclusively with transients, the approach in +   this document is more likely to exhibit the "steady-state" behavior +   as unmodified, independent TCP connections. + +9.1.  Layering + +   TCB interdependence pushes some of the TCP implementation from its +   typical placement solely within the transport layer (in the ISO +   model) to the network layer.  This acknowledges that some components +   of state are, in fact, per-host-pair or can be per-path as indicated +   solely by that host-pair.  Transport protocols typically manage per- +   application-pair associations (per stream), and network protocols +   manage per-host-pair and path associations (routing).  Round-trip +   time, MSS, and congestion information could be more appropriately +   handled at the network layer, aggregated among concurrent +   connections, and shared across connection instances [RFC3124]. + +   An earlier version of RTT sharing suggested implementing RTT state at +   the IP layer rather than at the TCP layer.  Our observations describe +   sharing state among TCP connections, which avoids some of the +   difficulties in an IP-layer solution.  One such problem of an IP- +   layer solution is determining the correspondence between packet +   exchanges using IP header information alone, where such +   correspondence is needed to compute RTT.  Because TCB sharing +   computes RTTs inside the TCP layer using TCP header information, it +   can be implemented more directly and simply than at the IP layer. +   This is a case where information should be computed at the transport +   layer but could be shared at the network layer. + +9.2.  Other Possibilities + +   Per-host-pair associations are not the limit of these techniques.  It +   is possible that TCBs could be similarly shared between hosts on a +   subnet or within a cluster, because the predominant path can be +   subnet-subnet rather than host-host.  Additionally, TCB +   interdependence can be applied to any protocol with congestion state, +   including SCTP [RFC4960] and DCCP [RFC4340], as well as to individual +   subflows in Multipath TCP [RFC8684]. + +   There may be other information that can be shared between concurrent +   connections.  For example, knowing that another connection has just +   tried to expand its window size and failed, a connection may not +   attempt to do the same for some period.  The idea is that existing +   TCP implementations infer the behavior of all competing connections, +   including those within the same host or subnet.  One possible +   optimization is to make that implicit feedback explicit, via extended +   information associated with the endpoint IP address and its TCP +   implementation, rather than per-connection state in the TCB. + +   This document focuses on sharing TCB information at connection +   initialization.  Subsequent to RFC 2140, there have been numerous +   approaches that attempt to coordinate ongoing state across concurrent +   connections, both within TCP and other congestion-reactive protocols, +   which are summarized in [Is18].  These approaches are more complex to +   implement, and their comparison to steady-state TCP equivalence can +   be more difficult to establish, sometimes intentionally (i.e., they +   sometimes intend to provide a different kind of "fairness" than +   emerges from TCP operation). + +10.  Implementation Observations + +   The observation that some TCB state is host-pair specific rather than +   application-pair dependent is not new and is a common engineering +   decision in layered protocol implementations.  Although now +   deprecated, T/TCP [RFC1644] was the first to propose using caches in +   order to maintain TCB states (see Appendix A). + +   Table 9 describes the current implementation status for TCB temporal +   sharing in Windows as of December 2020, Apple variants (macOS, iOS, +   iPadOS, tvOS, and watchOS) as of January 2021, Linux kernel version +   5.10.3, and FreeBSD 12.  Ensemble sharing is not yet implemented. + +        +==============+=========================================+ +        | TCB data     | Status                                  | +        +==============+=========================================+ +        | old_MMS_S    | Not shared                              | +        +--------------+-----------------------------------------+ +        | old_MMS_R    | Not shared                              | +        +--------------+-----------------------------------------+ +        | old_sendMSS  | Cached and shared in Apple, Linux (MSS) | +        +--------------+-----------------------------------------+ +        | old_PMTU     | Cached and shared in Apple, FreeBSD,    | +        |              | Windows (PMTU)                          | +        +--------------+-----------------------------------------+ +        | old_RTT      | Cached and shared in Apple, FreeBSD,    | +        |              | Linux, Windows                          | +        +--------------+-----------------------------------------+ +        | old_RTTVAR   | Cached and shared in Apple, FreeBSD,    | +        |              | Windows                                 | +        +--------------+-----------------------------------------+ +        | old_TFOinfo  | Cached and shared in Apple, Linux,      | +        |              | Windows                                 | +        +--------------+-----------------------------------------+ +        | old_sendcwnd | Not shared                              | +        +--------------+-----------------------------------------+ +        | old_ssthresh | Cached and shared in Apple, FreeBSD*,   | +        |              | Linux*                                  | +        +--------------+-----------------------------------------+ +        | TFO failure  | Cached and shared in Apple              | +        +--------------+-----------------------------------------+ + +                   Table 9: KNOWN IMPLEMENTATION STATUS + +   *  Note: In FreeBSD, new ssthresh is the mean of curr_ssthresh and +      its previous value if a previous value exists; in Linux, the +      calculation depends on state and is max(curr_cwnd/2, old_ssthresh) +      in most cases. + +   In Table 9, "Apple" refers to all Apple OSes, i.e., macOS (desktop/ +   laptop), iOS (phone), iPadOS (tablet), tvOS (video player), and +   watchOS (smart watch), which all share the same Internet protocol +   stack. + +11.  Changes Compared to RFC 2140 + +   This document updates the description of TCB sharing in RFC 2140 and +   its associated impact on existing and new connection state, providing +   a complete replacement for that document [RFC2140].  It clarifies the +   previous description and terminology and extends the mechanism to its +   impact on new protocols and mechanisms, including multipath TCP, Fast +   Open, PLPMTUD, NAT, and the TCP Authentication Option. + +   The detailed impact on TCB state addresses TCB parameters with +   greater specificity.  It separates the way MSS is used in both send +   and receive directions, it separates the way both of these MSS values +   differ from sendMSS, it adds both path MTU and ssthresh, and it +   addresses the impact on state associated with TCP options. + +   New sections have been added to address compatibility issues and +   implementation observations.  The relation of this work to T/TCP has +   been moved to Appendix A (which describes the history to TCB sharing) +   partly to reflect the deprecation of that protocol. + +   Appendix C has been added to discuss the potential to use temporal +   sharing over long timescales to adapt TCP's initial window +   automatically, avoiding the need to periodically revise a single +   global constant value. + +   Finally, this document updates and significantly expands the +   referenced literature. + +12.  Security Considerations + +   These presented implementation methods do not have additional +   ramifications for direct (connection-aborting or information- +   injecting) attacks on individual connections.  Individual +   connections, whether using sharing or not, also may be susceptible to +   denial-of-service attacks that reduce performance or completely deny +   connections and transfers if not otherwise secured. + +   TCB sharing may create additional denial-of-service attacks that +   affect the performance of other connections by polluting the cached +   information.  This can occur across any set of connections in which +   the TCB is shared, between connections in a single host, or between +   hosts if TCB sharing is implemented within a subnet (see +   "Implications" (Section 9)).  Some shared TCB parameters are used +   only to create new TCBs; others are shared among the TCBs of ongoing +   connections.  New connections can join the ongoing set, e.g., to +   optimize send window size among a set of connections to the same +   host.  PMTU is defined as shared at the IP layer and is already +   susceptible in this way. + +   Options in client SYNs can be easier to forge than complete, two-way +   connections.  As a result, their values may not be safely +   incorporated in shared values until after the three-way handshake +   completes. + +   Attacks on parameters used only for initialization affect only the +   transient performance of a TCP connection.  For short connections, +   the performance ramification can approach that of a denial-of-service +   attack.  For example, if an application changes its TCB to have a +   false and small window size, subsequent connections will experience +   performance degradation until their window grows appropriately. + +   TCB sharing reuses and mixes information from past and current +   connections.  Although reusing information could create a potential +   for fingerprinting to identify hosts, the mixing reduces that +   potential.  There has been no evidence of fingerprinting based on +   this technique, and it is currently considered safe in that regard. +   Further, information about the performance of a TCP connection has +   not been considered as private. + +13.  IANA Considerations + +   This document has no IANA actions. + +14.  References + +14.1.  Normative References + +   [RFC0793]  Postel, J., "Transmission Control Protocol", STD 7, +              RFC 793, DOI 10.17487/RFC0793, September 1981, +              <https://www.rfc-editor.org/info/rfc793>. + +   [RFC1122]  Braden, R., Ed., "Requirements for Internet Hosts - +              Communication Layers", STD 3, RFC 1122, +              DOI 10.17487/RFC1122, October 1989, +              <https://www.rfc-editor.org/info/rfc1122>. + +   [RFC1191]  Mogul, J. and S. Deering, "Path MTU discovery", RFC 1191, +              DOI 10.17487/RFC1191, November 1990, +              <https://www.rfc-editor.org/info/rfc1191>. + +   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate +              Requirement Levels", BCP 14, RFC 2119, +              DOI 10.17487/RFC2119, March 1997, +              <https://www.rfc-editor.org/info/rfc2119>. + +   [RFC4821]  Mathis, M. and J. Heffner, "Packetization Layer Path MTU +              Discovery", RFC 4821, DOI 10.17487/RFC4821, March 2007, +              <https://www.rfc-editor.org/info/rfc4821>. + +   [RFC5681]  Allman, M., Paxson, V., and E. Blanton, "TCP Congestion +              Control", RFC 5681, DOI 10.17487/RFC5681, September 2009, +              <https://www.rfc-editor.org/info/rfc5681>. + +   [RFC6298]  Paxson, V., Allman, M., Chu, J., and M. Sargent, +              "Computing TCP's Retransmission Timer", RFC 6298, +              DOI 10.17487/RFC6298, June 2011, +              <https://www.rfc-editor.org/info/rfc6298>. + +   [RFC7413]  Cheng, Y., Chu, J., Radhakrishnan, S., and A. Jain, "TCP +              Fast Open", RFC 7413, DOI 10.17487/RFC7413, December 2014, +              <https://www.rfc-editor.org/info/rfc7413>. + +   [RFC8174]  Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC +              2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, +              May 2017, <https://www.rfc-editor.org/info/rfc8174>. + +   [RFC8201]  McCann, J., Deering, S., Mogul, J., and R. Hinden, Ed., +              "Path MTU Discovery for IP version 6", STD 87, RFC 8201, +              DOI 10.17487/RFC8201, July 2017, +              <https://www.rfc-editor.org/info/rfc8201>. + +14.2.  Informative References + +   [Al10]     Allman, M., "Initial Congestion Window Specification", +              Work in Progress, Internet-Draft, draft-allman-tcpm-bump- +              initcwnd-00, 15 November 2010, +              <https://datatracker.ietf.org/doc/html/draft-allman-tcpm- +              bump-initcwnd-00>. + +   [Ba12]     Barik, R., Welzl, M., Ferlin, S., and O. Alay, "LISA: A +              linked slow-start algorithm for MPTCP", IEEE ICC, +              DOI 10.1109/ICC.2016.7510786, May 2016, +              <https://doi.org/10.1109/ICC.2016.7510786>. + +   [Ba20]     Bagnulo, M. and B. Briscoe, "ECN++: Adding Explicit +              Congestion Notification (ECN) to TCP Control Packets", +              Work in Progress, Internet-Draft, draft-ietf-tcpm- +              generalized-ecn-07, 16 February 2021, +              <https://datatracker.ietf.org/doc/html/draft-ietf-tcpm- +              generalized-ecn-07>. + +   [Be94]     Berners-Lee, T., Cailliau, C., Luotonen, A., Nielsen, H., +              and A. Secret, "The World-Wide Web", Communications of the +              ACM V37, pp. 76-82, DOI 10.1145/179606.179671, August +              1994, <https://doi.org/10.1145/179606.179671>. + +   [Br02]     Brownlee, N. and KC. Claffy, "Understanding Internet +              traffic streams: dragonflies and tortoises", IEEE +              Communications Magazine, pp. 110-117, +              DOI 10.1109/MCOM.2002.1039865, 2002, +              <https://doi.org/10.1109/MCOM.2002.1039865>. + +   [Br94]     Braden, B., "T/TCP -- Transaction TCP: Source Changes for +              Sun OS 4.1.3", USC/ISI Release 1.0, September 1994. + +   [Co91]     Comer, D. and D. Stevens, "Internetworking with TCP/IP", +              ISBN 10: 0134685059, ISBN 13: 9780134685052, 1991. + +   [Du16]     Dukkipati, N., Cheng, Y., and A. Vahdat, "Research +              Impacting the Practice of Congestion Control", Computer +              Communication Review, The ACM SIGCOMM newsletter, July +              2016. + +   [FreeBSD]  FreeBSD, "The FreeBSD Project", +              <https://www.freebsd.org/>. + +   [Hu01]     Hughes, A., Touch, J., and J. Heidemann, "Issues in TCP +              Slow-Start Restart After Idle", Work in Progress, +              Internet-Draft, draft-hughes-restart-00, December 2001, +              <https://datatracker.ietf.org/doc/html/draft-hughes- +              restart-00>. + +   [Hu12]     Hurtig, P. and A. Brunstrom, "Enhanced metric caching for +              short TCP flows", IEEE International Conference on +              Communications, DOI 10.1109/ICC.2012.6364516, 2012, +              <https://doi.org/10.1109/ICC.2012.6364516>. + +   [IANA]     IANA, "Transmission Control Protocol (TCP) Parameters", +              <https://www.iana.org/assignments/tcp-parameters>. + +   [Is18]     Islam, S., Welzl, M., Hiorth, K., Hayes, D., Armitage, G., +              and S. Gjessing, "ctrlTCP: Reducing latency through +              coupled, heterogeneous multi-flow TCP congestion control", +              IEEE INFOCOM 2018 - IEEE Conference on Computer +              Communications Workshops (INFOCOM WKSHPS), +              DOI 10.1109/INFCOMW.2018.8406887, April 2018, +              <https://doi.org/10.1109/INFCOMW.2018.8406887>. + +   [Ja88]     Jacobson, V. and M. Karels, "Congestion Avoidance and +              Control", SIGCOMM Symposium proceedings on Communications +              architectures and protocols, November 1988. + +   [RFC1379]  Braden, R., "Extending TCP for Transactions -- Concepts", +              RFC 1379, DOI 10.17487/RFC1379, November 1992, +              <https://www.rfc-editor.org/info/rfc1379>. + +   [RFC1644]  Braden, R., "T/TCP -- TCP Extensions for Transactions +              Functional Specification", RFC 1644, DOI 10.17487/RFC1644, +              July 1994, <https://www.rfc-editor.org/info/rfc1644>. + +   [RFC2001]  Stevens, W., "TCP Slow Start, Congestion Avoidance, Fast +              Retransmit, and Fast Recovery Algorithms", RFC 2001, +              DOI 10.17487/RFC2001, January 1997, +              <https://www.rfc-editor.org/info/rfc2001>. + +   [RFC2140]  Touch, J., "TCP Control Block Interdependence", RFC 2140, +              DOI 10.17487/RFC2140, April 1997, +              <https://www.rfc-editor.org/info/rfc2140>. + +   [RFC2414]  Allman, M., Floyd, S., and C. Partridge, "Increasing TCP's +              Initial Window", RFC 2414, DOI 10.17487/RFC2414, September +              1998, <https://www.rfc-editor.org/info/rfc2414>. + +   [RFC2663]  Srisuresh, P. and M. Holdrege, "IP Network Address +              Translator (NAT) Terminology and Considerations", +              RFC 2663, DOI 10.17487/RFC2663, August 1999, +              <https://www.rfc-editor.org/info/rfc2663>. + +   [RFC3124]  Balakrishnan, H. and S. Seshan, "The Congestion Manager", +              RFC 3124, DOI 10.17487/RFC3124, June 2001, +              <https://www.rfc-editor.org/info/rfc3124>. + +   [RFC3390]  Allman, M., Floyd, S., and C. Partridge, "Increasing TCP's +              Initial Window", RFC 3390, DOI 10.17487/RFC3390, October +              2002, <https://www.rfc-editor.org/info/rfc3390>. + +   [RFC4340]  Kohler, E., Handley, M., and S. Floyd, "Datagram +              Congestion Control Protocol (DCCP)", RFC 4340, +              DOI 10.17487/RFC4340, March 2006, +              <https://www.rfc-editor.org/info/rfc4340>. + +   [RFC4960]  Stewart, R., Ed., "Stream Control Transmission Protocol", +              RFC 4960, DOI 10.17487/RFC4960, September 2007, +              <https://www.rfc-editor.org/info/rfc4960>. + +   [RFC5925]  Touch, J., Mankin, A., and R. Bonica, "The TCP +              Authentication Option", RFC 5925, DOI 10.17487/RFC5925, +              June 2010, <https://www.rfc-editor.org/info/rfc5925>. + +   [RFC6437]  Amante, S., Carpenter, B., Jiang, S., and J. Rajahalme, +              "IPv6 Flow Label Specification", RFC 6437, +              DOI 10.17487/RFC6437, November 2011, +              <https://www.rfc-editor.org/info/rfc6437>. + +   [RFC6691]  Borman, D., "TCP Options and Maximum Segment Size (MSS)", +              RFC 6691, DOI 10.17487/RFC6691, July 2012, +              <https://www.rfc-editor.org/info/rfc6691>. + +   [RFC6928]  Chu, J., Dukkipati, N., Cheng, Y., and M. Mathis, +              "Increasing TCP's Initial Window", RFC 6928, +              DOI 10.17487/RFC6928, April 2013, +              <https://www.rfc-editor.org/info/rfc6928>. + +   [RFC7231]  Fielding, R., Ed. and J. Reschke, Ed., "Hypertext Transfer +              Protocol (HTTP/1.1): Semantics and Content", RFC 7231, +              DOI 10.17487/RFC7231, June 2014, +              <https://www.rfc-editor.org/info/rfc7231>. + +   [RFC7323]  Borman, D., Braden, B., Jacobson, V., and R. +              Scheffenegger, Ed., "TCP Extensions for High Performance", +              RFC 7323, DOI 10.17487/RFC7323, September 2014, +              <https://www.rfc-editor.org/info/rfc7323>. + +   [RFC7424]  Krishnan, R., Yong, L., Ghanwani, A., So, N., and B. +              Khasnabish, "Mechanisms for Optimizing Link Aggregation +              Group (LAG) and Equal-Cost Multipath (ECMP) Component Link +              Utilization in Networks", RFC 7424, DOI 10.17487/RFC7424, +              January 2015, <https://www.rfc-editor.org/info/rfc7424>. + +   [RFC7540]  Belshe, M., Peon, R., and M. Thomson, Ed., "Hypertext +              Transfer Protocol Version 2 (HTTP/2)", RFC 7540, +              DOI 10.17487/RFC7540, May 2015, +              <https://www.rfc-editor.org/info/rfc7540>. + +   [RFC7661]  Fairhurst, G., Sathiaseelan, A., and R. Secchi, "Updating +              TCP to Support Rate-Limited Traffic", RFC 7661, +              DOI 10.17487/RFC7661, October 2015, +              <https://www.rfc-editor.org/info/rfc7661>. + +   [RFC8684]  Ford, A., Raiciu, C., Handley, M., Bonaventure, O., and C. +              Paasch, "TCP Extensions for Multipath Operation with +              Multiple Addresses", RFC 8684, DOI 10.17487/RFC8684, March +              2020, <https://www.rfc-editor.org/info/rfc8684>. + +Appendix A.  TCB Sharing History + +   T/TCP proposed using caches to maintain TCB information across +   instances (temporal sharing), e.g., smoothed RTT, RTT variation, +   congestion-avoidance threshold, and MSS [RFC1644].  These values were +   in addition to connection counts used by T/TCP to accelerate data +   delivery prior to the full three-way handshake during an OPEN.  The +   goal was to aggregate TCB components where they reflect one +   association -- that of the host-pair rather than artificially +   separating those components by connection. + +   At least one T/TCP implementation saved the MSS and aggregated the +   RTT parameters across multiple connections but omitted caching the +   congestion window information [Br94], as originally specified in +   [RFC1379].  Some T/TCP implementations immediately updated MSS when +   the TCP MSS header option was received [Br94], although this was not +   addressed specifically in the concepts or functional specification +   [RFC1379] [RFC1644].  In later T/TCP implementations, RTT values were +   updated only after a CLOSE, which does not benefit concurrent +   sessions. + +   Temporal sharing of cached TCB data was originally implemented in the +   Sun OS 4.1.3 T/TCP extensions [Br94] and the FreeBSD port of same +   [FreeBSD].  As mentioned before, only the MSS and RTT parameters were +   cached, as originally specified in [RFC1379].  Later discussion of T/ +   TCP suggested including congestion control parameters in this cache; +   for example, Section 3.1 of [RFC1644] hints at initializing the +   congestion window to the old window size. + +Appendix B.  TCP Option Sharing and Caching + +   In addition to the options that can be cached and shared, this memo +   also lists known TCP options [IANA] for which state is unsafe to be +   kept.  This list is not intended to be authoritative or exhaustive. + +   Obsolete (unsafe to keep state): + +      Echo + +      Echo Reply + +      Partial Order Connection Permitted + +      Partial Order Service Profile + +      CC + +      CC.NEW + +      CC.ECHO + +      TCP Alternate Checksum Request + +      TCP Alternate Checksum Data + +   No state to keep: + +      End of Option List (EOL) + +      No-Operation (NOP) + +      Window Scale (WS) + +      SACK + +      Timestamps (TS) + +      MD5 Signature Option + +      TCP Authentication Option (TCP-AO) + +      RFC3692-style Experiment 1 + +      RFC3692-style Experiment 2 + +   Unsafe to keep state: + +      Skeeter (DH exchange, known to be vulnerable) + +      Bubba (DH exchange, known to be vulnerable) + +      Trailer Checksum Option + +      SCPS capabilities + +      Selective Negative Acknowledgements (S-NACK) + +      Records Boundaries + +      Corruption experienced + +      SNAP + +      TCP Compression Filter + +      Quick-Start Response + +      User Timeout Option (UTO) + +      Multipath TCP (MPTCP) negotiation success (see below for +      negotiation failure) + +      TCP Fast Open (TFO) negotiation success (see below for negotiation +      failure) + +   Safe but optional to keep state: + +      Multipath TCP (MPTCP) negotiation failure (to avoid negotiation +      retries) + +      Maximum Segment Size (MSS) + +      TCP Fast Open (TFO) negotiation failure (to avoid negotiation +      retries) + +   Safe and necessary to keep state: + +      TCP Fast Open (TFO) Cookie (if TFO succeeded in the past) + +Appendix C.  Automating the Initial Window in TCP over Long Timescales + +C.1.  Introduction + +   Temporal sharing, as described earlier in this document, builds on +   the assumption that multiple consecutive connections between the same +   host-pair are somewhat likely to be exposed to similar environment +   characteristics.  The stored information can become less accurate +   over time and suitable precautions should take this aging into +   consideration (this is discussed further in Section 8.1).  However, +   there are also cases where it can make sense to track these values +   over longer periods, observing properties of TCP connections to +   gradually influence evolving trends in TCP parameters.  This appendix +   describes an example of such a case. + +   TCP's congestion control algorithm uses an initial window value (IW) +   both as a starting point for new connections and as an upper limit +   for restarting after an idle period [RFC5681] [RFC7661].  This value +   has evolved over time; it was originally 1 maximum segment size (MSS) +   and increased to the lesser of 4 MSSs or 4,380 bytes [RFC3390] +   [RFC5681].  For a typical Internet connection with a maximum +   transmission unit (MTU) of 1500 bytes, this permits 3 segments of +   1,460 bytes each. + +   The IW value was originally implied in the original TCP congestion +   control description and documented as a standard in 1997 [RFC2001] +   [Ja88].  The value was updated in 1998 experimentally and moved to +   the Standards Track in 2002 [RFC2414] [RFC3390].  In 2013, it was +   experimentally increased to 10 [RFC6928]. + +   This appendix discusses how TCP can objectively measure when an IW is +   too large and that such feedback should be used over long timescales +   to adjust the IW automatically.  The result should be safer to deploy +   and might avoid the need to repeatedly revisit IW over time. + +   Note that this mechanism attempts to make the IW more adaptive over +   time.  It can increase the IW beyond that which is currently +   recommended for wide-scale deployment, so its use should be carefully +   monitored. + +C.2.  Design Considerations + +   TCP's IW value has existed statically for over two decades, so any +   solution to adjusting the IW dynamically should have similarly +   stable, non-invasive effects on the performance and complexity of +   TCP.  In order to be fair, the IW should be similar for most machines +   on the public Internet.  Finally, a desirable goal is to develop a +   self-correcting algorithm so that IW values that cause network +   problems can be avoided.  To that end, we propose the following +   design goals: + +   *  Impart little to no impact to TCP in the absence of loss, i.e., it +      should not increase the complexity of default packet processing in +      the normal case. + +   *  Adapt to network feedback over long timescales, avoiding values +      that persistently cause network problems. + +   *  Decrease the IW in the presence of sustained loss of IW segments, +      as determined over a number of different connections. + +   *  Increase the IW in the absence of sustained loss of IW segments, +      as determined over a number of different connections. + +   *  Operate conservatively, i.e., tend towards leaving the IW the same +      in the absence of sufficient information, and give greater +      consideration to IW segment loss than IW segment success. + +   We expect that, without other context, a good IW algorithm will +   converge to a single value, but this is not required.  An endpoint +   with additional context or information, or deployed in a constrained +   environment, can always use a different value.  In particular, +   information from previous connections, or sets of connections with a +   similar path, can already be used as context for such decisions (as +   noted in the core of this document). + +   However, if a given IW value persistently causes packet loss during +   the initial burst of packets, it is clearly inappropriate and could +   be inducing unnecessary loss in other competing connections.  This +   might happen for sites behind very slow boxes with small buffers, +   which may or may not be the first hop. + +C.3.  Proposed IW Algorithm + +   Below is a simple description of the proposed IW algorithm.  It +   relies on the following parameters: + +   *  MinIW = 3 MSS or 4,380 bytes (as per [RFC3390]) + +   *  MaxIW = 10 MSS (as per [RFC6928]) + +   *  MulDecr = 0.5 + +   *  AddIncr = 2 MSS + +   *  Threshold = 0.05 + +   We assume that the minimum IW (MinIW) should be as currently +   specified as standard [RFC3390].  The maximum IW (MaxIW) can be set +   to a fixed value (we suggest using the experimental and now somewhat +   de facto standard in [RFC6928]) or set based on a schedule if trusted +   time references are available [Al10]; here, we prefer a fixed value. +   We also propose to use an Additive Increase Multiplicative Decrease +   (AIMD) algorithm, with increase and decreases as noted. + +   Although these parameters are somewhat arbitrary, their initial +   values are not important except that the algorithm is AIMD and the +   MaxIW should not exceed that recommended for other systems on the +   Internet (here, we selected the current de facto standard rather than +   the actual standard).  Current proposals, including default current +   operation, are degenerate cases of the algorithm below for given +   parameters, notably MulDec = 1.0 and AddIncr = 0 MSS, thus disabling +   the automatic part of the algorithm. + +   The proposed algorithm is as follows: + +   1.  On boot: + +         IW = MaxIW; # assume this is in bytes and indicates an integer +                     # multiple of 2 MSS (an even number to support +                     # ACK compression) + +   2.  Upon starting a new connection: + +         CWND = IW; +         conncount++; +         IWnotchecked = 1; # true + +   3.  During a connection's SYN-ACK processing, if SYN-ACK includes ECN +       (as similarly addressed in Section 5 of ECN++ for TCP [Ba20]), +       treat as if the IW is too large: + +         if (IWnotchecked && (synackecn == 1)) { +            losscount++; +            IWnotchecked = 0; # never check again +         } + +   4.  During a connection, if retransmission occurs, check the seqno of +       the outgoing packet (in bytes) to see if the re-sent segment +       fixes an IW loss: + +         if (Retransmitting && IWnotchecked && ((seqno - ISN) < IW))) { +            losscount++; +            IWnotchecked = 0; # never do this entire "if" again +         } else { +            IWnotchecked = 0; # you're beyond the IW so stop checking +         } + +   5.  Once every 1000 connections, as a separate process (i.e., not as +       part of processing a given connection): + +         if (conncount > 1000) { +            if (losscount/conncount > threshold) { +               # the number of connections with errors is too high +               IW = IW * MulDecr; +            } else { +               IW = IW + AddIncr; +            } +         } + +   As presented, this algorithm can yield a false positive when the +   sequence number wraps around, e.g., the code might increment +   losscount in step 4 when no loss occurred or fail to increment +   losscount when a loss did occur.  This can be avoided using either +   Protection Against Wrapped Sequences (PAWS) [RFC7323] context or +   internal extended sequence number representations (as in TCP +   Authentication Option (TCP-AO) [RFC5925]).  Alternately, false +   positives can be tolerated because they are expected to be infrequent +   and thus will not significantly impact the algorithm. + +   A number of additional constraints need to be imposed if this +   mechanism is implemented to ensure that it defaults to values that +   comply with current Internet standards, is conservative in how it +   extends those values, and returns to those values in the absence of +   positive feedback (i.e., success).  To that end, we recommend the +   following list of example constraints: + +   *  The automatic IW algorithm MUST initialize MaxIW a value no larger +      than the currently recommended Internet default in the absence of +      other context information. + +      Thus, if there are too few connections to make a decision or if +      there is otherwise insufficient information to increase the IW, +      then the MaxIW defaults to the current recommended value. + +   *  An implementation MAY allow the MaxIW to grow beyond the currently +      recommended Internet default but not more than 2 segments per +      calendar year. + +      Thus, if an endpoint has a persistent history of successfully +      transmitting IW segments without loss, then it is allowed to probe +      the Internet to determine if larger IW values have similar +      success.  This probing is limited and requires a trusted time +      source; otherwise, the MaxIW remains constant. + +   *  An implementation MUST adjust the IW based on loss statistics at +      least once every 1000 connections. + +      An endpoint needs to be sufficiently reactive to IW loss. + +   *  An implementation MUST decrease the IW by at least 1 MSS when +      indicated during an evaluation interval. + +      An endpoint that detects loss needs to decrease its IW by at least +      1 MSS; otherwise, it is not participating in an automatic reactive +      algorithm. + +   *  An implementation MUST increase by no more than 2 MSSs per +      evaluation interval. + +      An endpoint that does not experience IW loss needs to probe the +      network incrementally. + +   *  An implementation SHOULD use an IW that is an integer multiple of +      2 MSSs. + +      The IW should remain a multiple of 2 MSS segments to enable +      efficient ACK compression without incurring unnecessary timeouts. + +   *  An implementation MUST decrease the IW if more than 95% of +      connections have IW losses. + +      Again, this is to ensure an implementation is sufficiently +      reactive. + +   *  An implementation MAY group IW values and statistics within +      subsets of connections.  Such grouping MAY use any information +      about connections to form groups except loss statistics. + +   There are some TCP connections that might not be counted at all, such +   as those to/from loopback addresses or those within the same subnet +   as that of a local interface (for which congestion control is +   sometimes disabled anyway).  This may also include connections that +   terminate before the IW is full, i.e., as a separate check at the +   time of the connection closing. + +   The period over which the IW is updated is intended to be a long +   timescale, e.g., a month or so, or 1,000 connections, whichever is +   longer.  An implementation might check the IW once a month and simply +   not update the IW or clear the connection counts in months where the +   number of connections is too small. + +C.4.  Discussion + +   There are numerous parameters to the above algorithm that are +   compliant with the given requirements; this is intended to allow +   variation in configuration and implementation while ensuring that all +   such algorithms are reactive and safe. + +   This algorithm continues to assume segments because that is the basis +   of most TCP implementations.  It might be useful to consider revising +   the specifications to allow byte-based congestion given sufficient +   experience. + +   The algorithm checks for IW losses only during the first IW after a +   connection start; it does not check for IW losses elsewhere the IW is +   used, e.g., during slow-start restarts. + +   *  An implementation MAY detect IW losses during slow-start restarts +      in addition to losses during the first IW of a connection.  In +      this case, the implementation MUST count each restart as a +      "connection" for the purposes of connection counts and periodic +      rechecking of the IW value. + +   False positives can occur during some kinds of segment reordering, +   e.g., that might trigger spurious retransmissions even without a true +   segment loss.  These are not expected to be sufficiently common to +   dominate the algorithm and its conclusions. + +   This mechanism does require additional per-connection state, which is +   currently common in some implementations and is useful for other +   reasons (e.g., the ISN is used in TCP-AO [RFC5925]).  The mechanism +   in this appendix also benefits from persistent state kept across +   reboots, which would also be useful to other state sharing mechanisms +   (e.g., TCP Control Block Sharing per the main body of this document). + +   The receive window (rwnd) is not involved in this calculation.  The +   size of rwnd is determined by receiver resources and provides space +   to accommodate segment reordering.  Also, rwnd is not involved with +   congestion control, which is the focus of the way this appendix +   manages the IW. + +C.5.  Observations + +   The IW may not converge to a single global value.  It also may not +   converge at all but rather may oscillate by a few MSSs as it +   repeatedly probes the Internet for larger IWs and fails.  Both +   properties are consistent with TCP behavior during each individual +   connection. + +   This mechanism assumes that losses during the IW are due to IW size. +   Persistent errors that drop packets for other reasons, e.g., OS bugs, +   can cause false positives.  Again, this is consistent with TCP's +   basic assumption that loss is caused by congestion and requires +   backoff.  This algorithm treats the IW of new connections as a long- +   timescale backoff system. + +Acknowledgments + +   The authors would like to thank Praveen Balasubramanian for +   information regarding TCB sharing in Windows; Christoph Paasch for +   information regarding TCB sharing in Apple OSs; Yuchung Cheng, Lars +   Eggert, Ilpo Jarvinen, and Michael Scharf for comments on earlier +   draft versions of this document; as well as members of the TCPM WG. +   Earlier revisions of this work received funding from a collaborative +   research project between the University of Oslo and Huawei +   Technologies Co., Ltd. and were partly supported by USC/ISI's Postel +   Center. + +Authors' Addresses + +   Joe Touch +   Manhattan Beach, CA 90266 +   United States of America + +   Phone: +1 (310) 560-0334 +   Email: touch@strayalpha.com + + +   Michael Welzl +   University of Oslo +   PO Box 1080 Blindern +   N-0316 Oslo +   Norway + +   Phone: +47 22 85 24 20 +   Email: michawe@ifi.uio.no + + +   Safiqul Islam +   University of Oslo +   PO Box 1080 Blindern +   Oslo N-0316 +   Norway + +   Phone: +47 22 84 08 37 +   Email: safiquli@ifi.uio.no  |