From 4bfd864f10b68b71482b35c818559068ef8d5797 Mon Sep 17 00:00:00 2001 From: Thomas Voss Date: Wed, 27 Nov 2024 20:54:24 +0100 Subject: doc: Add RFC documents --- doc/rfc/rfc9040.txt | 1573 +++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 1573 insertions(+) create mode 100644 doc/rfc/rfc9040.txt (limited to 'doc/rfc/rfc9040.txt') diff --git a/doc/rfc/rfc9040.txt b/doc/rfc/rfc9040.txt new file mode 100644 index 0000000..439a948 --- /dev/null +++ b/doc/rfc/rfc9040.txt @@ -0,0 +1,1573 @@ + + + + +Internet Engineering Task Force (IETF) J. Touch +Request for Comments: 9040 Independent +Obsoletes: 2140 M. Welzl +Category: Informational S. Islam +ISSN: 2070-1721 University of Oslo + July 2021 + + + TCP Control Block Interdependence + +Abstract + + This memo provides guidance to TCP implementers that is intended to + help improve connection convergence to steady-state operation without + affecting interoperability. It updates and replaces RFC 2140's + description of sharing TCP state, as typically represented in TCP + Control Blocks, among similar concurrent or consecutive connections. + +Status of This Memo + + This document is not an Internet Standards Track specification; it is + published for informational purposes. + + This document is a product of the Internet Engineering Task Force + (IETF). It represents the consensus of the IETF community. It has + received public review and has been approved for publication by the + Internet Engineering Steering Group (IESG). Not all documents + approved by the IESG are candidates for any level of Internet + Standard; see Section 2 of RFC 7841. + + Information about the current status of this document, any errata, + and how to provide feedback on it may be obtained at + https://www.rfc-editor.org/info/rfc9040. + +Copyright Notice + + Copyright (c) 2021 IETF Trust and the persons identified as the + document authors. All rights reserved. + + This document is subject to BCP 78 and the IETF Trust's Legal + Provisions Relating to IETF Documents + (https://trustee.ietf.org/license-info) in effect on the date of + publication of this document. Please review these documents + carefully, as they describe your rights and restrictions with respect + to this document. Code Components extracted from this document must + include Simplified BSD License text as described in Section 4.e of + the Trust Legal Provisions and are provided without warranty as + described in the Simplified BSD License. + +Table of Contents + + 1. Introduction + 2. Conventions Used in This Document + 3. Terminology + 4. The TCP Control Block (TCB) + 5. TCB Interdependence + 6. Temporal Sharing + 6.1. Initialization of a New TCB + 6.2. Updates to the TCB Cache + 6.3. Discussion + 7. Ensemble Sharing + 7.1. Initialization of a New TCB + 7.2. Updates to the TCB Cache + 7.3. Discussion + 8. Issues with TCB Information Sharing + 8.1. Traversing the Same Network Path + 8.2. State Dependence + 8.3. Problems with Sharing Based on IP Address + 9. Implications + 9.1. Layering + 9.2. Other Possibilities + 10. Implementation Observations + 11. Changes Compared to RFC 2140 + 12. Security Considerations + 13. IANA Considerations + 14. References + 14.1. Normative References + 14.2. Informative References + Appendix A. TCB Sharing History + Appendix B. TCP Option Sharing and Caching + Appendix C. Automating the Initial Window in TCP over Long + Timescales + C.1. Introduction + C.2. Design Considerations + C.3. Proposed IW Algorithm + C.4. Discussion + C.5. Observations + Acknowledgments + Authors' Addresses + +1. Introduction + + TCP is a connection-oriented reliable transport protocol layered over + IP [RFC0793]. Each TCP connection maintains state, usually in a data + structure called the "TCP Control Block (TCB)". The TCB contains + information about the connection state, its associated local process, + and feedback parameters about the connection's transmission + properties. As originally specified and usually implemented, most + TCB information is maintained on a per-connection basis. Some + implementations share certain TCB information across connections to + the same host [RFC2140]. Such sharing is intended to lead to better + overall transient performance, especially for numerous short-lived + and simultaneous connections, as can be used in the World Wide Web + and other applications [Be94] [Br02]. This sharing of state is + intended to help TCP connections converge to long-term behavior + (assuming stable application load, i.e., so-called "steady-state") + more quickly without affecting TCP interoperability. + + This document updates RFC 2140's discussion of TCB state sharing and + provides a complete replacement for that document. This state + sharing affects only TCB initialization [RFC2140] and thus has no + effect on the long-term behavior of TCP after a connection has been + established or on interoperability. Path information shared across + SYN destination port numbers assumes that TCP segments having the + same host-pair experience the same path properties, i.e., that + traffic is not routed differently based on port numbers or other + connection parameters (also addressed further in Section 8.1). The + observations about TCB sharing in this document apply similarly to + any protocol with congestion state, including the Stream Control + Transmission Protocol (SCTP) [RFC4960] and the Datagram Congestion + Control Protocol (DCCP) [RFC4340], as well as to individual subflows + in Multipath TCP [RFC8684]. + +2. Conventions Used in This Document + + The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", + "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and + "OPTIONAL" in this document are to be interpreted as described in + BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all + capitals, as shown here. + + The core of this document describes behavior that is already + permitted by TCP standards. As a result, this document provides + informative guidance but does not use normative language except when + quoting other documents. Normative language is used in Appendix C as + examples of requirements for future consideration. + +3. Terminology + + The following terminology is used frequently in this document. Items + preceded with a "+" may be part of the state maintained as TCP + connection state in the TCB of associated connections and are the + focus of sharing as described in this document. Note that terms are + used as originally introduced where possible; in some cases, + direction is indicated with a suffix (_S for send, _R for receive) + and in other cases spelled out (sendcwnd). + + +cwnd: TCP congestion window size [RFC5681] + + host: a source or sink of TCP segments associated with a single IP + address + + host-pair: a pair of hosts and their corresponding IP addresses + + ISN: Initial Sequence Number + + +MMS_R: maximum message size that can be received, the largest + received transport payload of an IP datagram [RFC1122] + + +MMS_S: maximum message size that can be sent, the largest + transmitted transport payload of an IP datagram [RFC1122] + + path: an Internet path between the IP addresses of two hosts + + PCB: protocol control block, the data associated with a protocol as + maintained by an endpoint; a TCP PCB is called a "TCB" + + PLPMTUD: packetization-layer path MTU discovery, a mechanism that + uses transport packets to discover the Path Maximum + Transmission Unit (PMTU) [RFC4821] + + +PMTU: largest IP datagram that can traverse a path [RFC1191] + [RFC8201] + + PMTUD: path-layer MTU discovery, a mechanism that relies on ICMP + error messages to discover the PMTU [RFC1191] [RFC8201] + + +RTT: round-trip time of a TCP packet exchange [RFC0793] + + +RTTVAR: variation of round-trip times of a TCP packet exchange + [RFC6298] + + +rwnd: TCP receive window size [RFC5681] + + +sendcwnd: TCP send-side congestion window (cwnd) size [RFC5681] + + +sendMSS: TCP maximum segment size, a value transmitted in a TCP + option that represents the largest TCP user data payload that + can be received [RFC6691] + + +ssthresh: TCP slow-start threshold [RFC5681] + + TCB: TCP Control Block, the data associated with a TCP connection as + maintained by an endpoint + + TCP-AO: TCP Authentication Option [RFC5925] + + TFO: TCP Fast Open option [RFC7413] + + +TFO_cookie: TCP Fast Open cookie, state that is used as part of the + TFO mechanism, when TFO is supported [RFC7413] + + +TFO_failure: an indication of when TFO option negotiation failed, + when TFO is supported + + +TFOinfo: information cached when a TFO connection is established, + which includes the TFO_cookie [RFC7413] + +4. The TCP Control Block (TCB) + + A TCB describes the data associated with each connection, i.e., with + each association of a pair of applications across the network. The + TCB contains at least the following information [RFC0793]: + + Local process state + + pointers to send and receive buffers + pointers to retransmission queue and current segment + pointers to Internet Protocol (IP) PCB + + Per-connection shared state + + macro-state + connection state + timers + flags + local and remote host numbers and ports + TCP option state + micro-state + send and receive window state (size*, current number) + congestion window size (sendcwnd)* + congestion window size threshold (ssthresh)* + max window size seen* + sendMSS# + MMS_S# + MMS_R# + PMTU# + round-trip time and its variation# + + The per-connection information is shown as split into macro-state and + micro-state, terminology borrowed from [Co91]. Macro-state describes + the protocol for establishing the initial shared state about the + connection; we include the endpoint numbers and components (timers, + flags) required upon commencement that are later used to help + maintain that state. Micro-state describes the protocol after a + connection has been established, to maintain the reliability and + congestion control of the data transferred in the connection. + + We distinguish two other classes of shared micro-state that are + associated more with host-pairs than with application pairs. One + class is clearly host-pair dependent (shown above as "#", e.g., + sendMSS, MMS_R, MMS_S, PMTU, RTT), because these parameters are + defined by the endpoint or endpoint pair (of the given example: + sendMSS, MMS_R, MMS_S, RTT) or are already cached and shared on that + basis (of the given example: PMTU [RFC1191] [RFC4821]). The other is + host-pair dependent in its aggregate (shown above as "*", e.g., + congestion window information, current window sizes, etc.) because + they depend on the total capacity between the two endpoints. + + Not all of the TCB state is necessarily shareable. In particular, + some TCP options are negotiated only upon request by the application + layer, so their use may not be correlated across connections. Other + options negotiate connection-specific parameters, which are similarly + not shareable. These are discussed further in Appendix B. + + Finally, we exclude rwnd from further discussion because its value + should depend on the send window size, so it is already addressed by + send window sharing and is not independently affected by sharing. + +5. TCB Interdependence + + There are two cases of TCB interdependence. Temporal sharing occurs + when the TCB of an earlier (now CLOSED) connection to a host is used + to initialize some parameters of a new connection to that same host, + i.e., in sequence. Ensemble sharing occurs when a currently active + connection to a host is used to initialize another (concurrent) + connection to that host. + +6. Temporal Sharing + + The TCB data cache is accessed in two ways: it is read to initialize + new TCBs and written when more current per-host state is available. + +6.1. Initialization of a New TCB + + TCBs for new connections can be initialized using cached context from + past connections as follows: + + +==============+=============================+ + | Cached TCB | New TCB | + +==============+=============================+ + | old_MMS_S | old_MMS_S or not cached (2) | + +--------------+-----------------------------+ + | old_MMS_R | old_MMS_R or not cached (2) | + +--------------+-----------------------------+ + | old_sendMSS | old_sendMSS | + +--------------+-----------------------------+ + | old_PMTU | old_PMTU (1) | + +--------------+-----------------------------+ + | old_RTT | old_RTT | + +--------------+-----------------------------+ + | old_RTTVAR | old_RTTVAR | + +--------------+-----------------------------+ + | old_option | (option specific) | + +--------------+-----------------------------+ + | old_ssthresh | old_ssthresh | + +--------------+-----------------------------+ + | old_sendcwnd | old_sendcwnd | + +--------------+-----------------------------+ + + Table 1: Temporal Sharing - TCB Initialization + + (1) Note that PMTU is cached at the IP layer [RFC1191] [RFC4821]. + + (2) Note that some values are not cached when they are computed + locally (MMS_R) or indicated in the connection itself (MMS_S in + the SYN). + + Table 2 gives an overview of option-specific information that can be + shared. Additional information on some specific TCP options and + sharing is provided in Appendix B. + + +=================+=================+ + | Cached | New | + +=================+=================+ + | old_TFO_cookie | old_TFO_cookie | + +-----------------+-----------------+ + | old_TFO_failure | old_TFO_failure | + +-----------------+-----------------+ + + Table 2: Temporal Sharing - + Option Info Initialization + +6.2. Updates to the TCB Cache + + During a connection, the TCB cache can be updated based on events of + current connections and their TCBs as they progress over time, as + shown in Table 3. + + +==============+===============+=============+=================+ + | Cached TCB | Current TCB | When? | New Cached TCB | + +==============+===============+=============+=================+ + | old_MMS_S | curr_MMS_S | OPEN | curr_MMS_S | + +--------------+---------------+-------------+-----------------+ + | old_MMS_R | curr_MMS_R | OPEN | curr_MMS_R | + +--------------+---------------+-------------+-----------------+ + | old_sendMSS | curr_sendMSS | MSSopt | curr_sendMSS | + +--------------+---------------+-------------+-----------------+ + | old_PMTU | curr_PMTU | PMTUD (1) / | curr_PMTU | + | | | PLPMTUD (1) | | + +--------------+---------------+-------------+-----------------+ + | old_RTT | curr_RTT | CLOSE | merge(curr,old) | + +--------------+---------------+-------------+-----------------+ + | old_RTTVAR | curr_RTTVAR | CLOSE | merge(curr,old) | + +--------------+---------------+-------------+-----------------+ + | old_option | curr_option | ESTAB | (depends on | + | | | | option) | + +--------------+---------------+-------------+-----------------+ + | old_ssthresh | curr_ssthresh | CLOSE | merge(curr,old) | + +--------------+---------------+-------------+-----------------+ + | old_sendcwnd | curr_sendcwnd | CLOSE | merge(curr,old) | + +--------------+---------------+-------------+-----------------+ + + Table 3: Temporal Sharing - Cache Updates + + (1) Note that PMTU is cached at the IP layer [RFC1191] [RFC4821]. + + Merge() is the function that combines the current and previous (old) + values and may vary for each parameter of the TCB cache. The + particular function is not specified in this document; examples + include windowed averages (mean of the past N values, for some N) and + exponential decay (new = (1-alpha)*old + alpha *new, where alpha is + in the range [0..1]). + + Table 4 gives an overview of option-specific information that can be + similarly shared. The TFO cookie is maintained until the client + explicitly requests it be updated as a separate event. + + +=================+=================+=======+=================+ + | Cached | Current | When? | New Cached | + +=================+=================+=======+=================+ + | old_TFO_cookie | old_TFO_cookie | ESTAB | old_TFO_cookie | + +-----------------+-----------------+-------+-----------------+ + | old_TFO_failure | old_TFO_failure | ESTAB | old_TFO_failure | + +-----------------+-----------------+-------+-----------------+ + + Table 4: Temporal Sharing - Option Info Updates + +6.3. Discussion + + As noted, there is no particular benefit to caching MMS_S and MMS_R + as these are reported by the local IP stack. Caching sendMSS and + PMTU is trivial; reported values are cached (PMTU at the IP layer), + and the most recent values are used. The cache is updated when the + MSS option is received in a SYN or after PMTUD (i.e., when an ICMPv4 + Fragmentation Needed [RFC1191] or ICMPv6 Packet Too Big message is + received [RFC8201] or the equivalent is inferred, e.g., as from + PLPMTUD [RFC4821]), respectively, so the cache always has the most + recent values from any connection. For sendMSS, the cache is + consulted only at connection establishment and not otherwise updated, + which means that MSS options do not affect current connections. The + default sendMSS is never saved; only reported MSS values update the + cache, so an explicit override is required to reduce the sendMSS. + Cached sendMSS affects only data sent in the SYN segment, i.e., + during client connection initiation or during simultaneous open; the + MSS of all other segments are constrained by the value updated as + included in the SYN. + + RTT values are updated by formulae that merge the old and new values, + as noted in Section 6.2. Dynamic RTT estimation requires a sequence + of RTT measurements. As a result, the cached RTT (and its variation) + is an average of its previous value with the contents of the + currently active TCB for that host, when a TCB is closed. RTT values + are updated only when a connection is closed. The method for merging + old and current values needs to attempt to reduce the transient + effects of the new connections. + + The updates for RTT, RTTVAR, and ssthresh rely on existing + information, i.e., old values. Should no such values exist, the + current values are cached instead. + + TCP options are copied or merged depending on the details of each + option. For example, TFO state is updated when a connection is + established and read before establishing a new connection. + + Sections 8 and 9 discuss compatibility issues and implications of + sharing the specific information listed above. Section 10 gives an + overview of known implementations. + + Most cached TCB values are updated when a connection closes. The + exceptions are MMS_R and MMS_S, which are reported by IP [RFC1122]; + PMTU, which is updated after Path MTU Discovery and also reported by + IP [RFC1191] [RFC4821] [RFC8201]; and sendMSS, which is updated if + the MSS option is received in the TCP SYN header. + + Sharing sendMSS information affects only data in the SYN of the next + connection, because sendMSS information is typically included in most + TCP SYN segments. Caching PMTU can accelerate the efficiency of + PMTUD but can also result in black-holing until corrected if in + error. Caching MMS_R and MMS_S may be of little direct value as they + are reported by the local IP stack anyway. + + The way in which state related to other TCP options can be shared + depends on the details of that option. For example, TFO state + includes the TCP Fast Open cookie [RFC7413] or, in case TFO fails, a + negative TCP Fast Open response. RFC 7413 states, + + | The client MUST cache negative responses from the server in order + | to avoid potential connection failures. Negative responses + | include the server not acknowledging the data in the SYN, ICMP + | error messages, and (most importantly) no response (SYN-ACK) from + | the server at all, i.e., connection timeout. + + TFOinfo is cached when a connection is established. + + State related to other TCP options might not be as readily cached. + For example, TCP-AO [RFC5925] success or failure between a host-pair + for a single SYN destination port might be usefully cached. TCP-AO + success or failure to other SYN destination ports on that host-pair + is never useful to cache because TCP-AO security parameters can vary + per service. + +7. Ensemble Sharing + + Sharing cached TCB data across concurrent connections requires + attention to the aggregate nature of some of the shared state. For + example, although MSS and RTT values can be shared by copying, it may + not be appropriate to simply copy congestion window or ssthresh + information; instead, the new values can be a function (f) of the + cumulative values and the number of connections (N). + +7.1. Initialization of a New TCB + + TCBs for new connections can be initialized using cached context from + concurrent connections as follows: + + +===================+=========================+ + | Cached TCB | New TCB | + +===================+=========================+ + | old_MMS_S | old_MMS_S | + +-------------------+-------------------------+ + | old_MMS_R | old_MMS_R | + +-------------------+-------------------------+ + | old_sendMSS | old_sendMSS | + +-------------------+-------------------------+ + | old_PMTU | old_PMTU (1) | + +-------------------+-------------------------+ + | old_RTT | old_RTT | + +-------------------+-------------------------+ + | old_RTTVAR | old_RTTVAR | + +-------------------+-------------------------+ + | sum(old_ssthresh) | f(sum(old_ssthresh), N) | + +-------------------+-------------------------+ + | sum(old_sendcwnd) | f(sum(old_sendcwnd), N) | + +-------------------+-------------------------+ + | old_option | (option specific) | + +-------------------+-------------------------+ + + Table 5: Ensemble Sharing - TCB Initialization + + (1) Note that PMTU is cached at the IP layer [RFC1191] [RFC4821]. + + In Table 5, the cached sum() is a total across all active connections + because these parameters act in aggregate; similarly, f() is a + function that updates that sum based on the new connection's values, + represented as "N". + + Table 6 gives an overview of option-specific information that can be + similarly shared. Again, the TFO_cookie is updated upon explicit + client request, which is a separate event. + + +=================+=================+ + | Cached | New | + +=================+=================+ + | old_TFO_cookie | old_TFO_cookie | + +-----------------+-----------------+ + | old_TFO_failure | old_TFO_failure | + +-----------------+-----------------+ + + Table 6: Ensemble Sharing - + Option Info Initialization + +7.2. Updates to the TCB Cache + + During a connection, the TCB cache can be updated based on changes to + concurrent connections and their TCBs, as shown below: + + +==============+===============+===========+=================+ + | Cached TCB | Current TCB | When? | New Cached TCB | + +==============+===============+===========+=================+ + | old_MMS_S | curr_MMS_S | OPEN | curr_MMS_S | + +--------------+---------------+-----------+-----------------+ + | old_MMS_R | curr_MMS_R | OPEN | curr_MMS_R | + +--------------+---------------+-----------+-----------------+ + | old_sendMSS | curr_sendMSS | MSSopt | curr_sendMSS | + +--------------+---------------+-----------+-----------------+ + | old_PMTU | curr_PMTU | PMTUD+ / | curr_PMTU | + | | | PLPMTUD+ | | + +--------------+---------------+-----------+-----------------+ + | old_RTT | curr_RTT | update | rtt_update(old, | + | | | | curr) | + +--------------+---------------+-----------+-----------------+ + | old_RTTVAR | curr_RTTVAR | update | rtt_update(old, | + | | | | curr) | + +--------------+---------------+-----------+-----------------+ + | old_ssthresh | curr_ssthresh | update | adjust sum as | + | | | | appropriate | + +--------------+---------------+-----------+-----------------+ + | old_sendcwnd | curr_sendcwnd | update | adjust sum as | + | | | | appropriate | + +--------------+---------------+-----------+-----------------+ + | old_option | curr_option | (depends) | (option | + | | | | specific) | + +--------------+---------------+-----------+-----------------+ + + Table 7: Ensemble Sharing - Cache Updates + + + Note that the PMTU is cached at the IP layer [RFC1191] [RFC4821]. + + In Table 7, rtt_update() is the function used to combine old and + current values, e.g., as a windowed average or exponentially decayed + average. + + Table 8 gives an overview of option-specific information that can be + similarly shared. + + +=================+=================+=======+=================+ + | Cached | Current | When? | New Cached | + +=================+=================+=======+=================+ + | old_TFO_cookie | old_TFO_cookie | ESTAB | old_TFO_cookie | + +-----------------+-----------------+-------+-----------------+ + | old_TFO_failure | old_TFO_failure | ESTAB | old_TFO_failure | + +-----------------+-----------------+-------+-----------------+ + + Table 8: Ensemble Sharing - Option Info Updates + +7.3. Discussion + + For ensemble sharing, TCB information should be cached as early as + possible, sometimes before a connection is closed. Otherwise, + opening multiple concurrent connections may not result in TCB data + sharing if no connection closes before others open. The amount of + work involved in updating the aggregate average should be minimized, + but the resulting value should be equivalent to having all values + measured within a single connection. The function "rtt_update" in + Table 7 indicates this operation, which occurs whenever the RTT would + have been updated in the individual TCP connection. As a result, the + cache contains the shared RTT variables, which no longer need to + reside in the TCB. + + Congestion window size and ssthresh aggregation are more complicated + in the concurrent case. When there is an ensemble of connections, we + need to decide how that ensemble would have shared these variables, + in order to derive initial values for new TCBs. + + Sections 8 and 9 discuss compatibility issues and implications of + sharing the specific information listed above. + + There are several ways to initialize the congestion window in a new + TCB among an ensemble of current connections to a host. Current TCP + implementations initialize it to 4 segments as standard [RFC3390] and + 10 segments experimentally [RFC6928]. These approaches assume that + new connections should behave as conservatively as possible. The + algorithm described in [Ba12] adjusts the initial cwnd depending on + the cwnd values of ongoing connections. It is also possible to use + sharing mechanisms over long timescales to adapt TCP's initial window + automatically, as described further in Appendix C. + +8. Issues with TCB Information Sharing + + Here, we discuss various types of problems that may arise with TCB + information sharing. + + For the congestion and current window information, the initial values + computed by TCB interdependence may not be consistent with the long- + term aggregate behavior of a set of concurrent connections between + the same endpoints. Under conventional TCP congestion control, if + the congestion window of a single existing connection has converged + to 40 segments, two newly joining concurrent connections will assume + initial windows of 10 segments [RFC6928] and the existing + connection's window will not decrease to accommodate this additional + load. As a consequence, the three connections can mutually + interfere. One example of this is seen on low-bandwidth, high-delay + links, where concurrent connections supporting Web traffic can + collide because their initial windows were too large, even when set + at 1 segment. + + The authors of [Hu12] recommend caching ssthresh for temporal sharing + only when flows are long. Some studies suggest that sharing ssthresh + between short flows can deteriorate the performance of individual + connections [Hu12] [Du16], although this may benefit aggregate + network performance. + +8.1. Traversing the Same Network Path + + TCP is sometimes used in situations where packets of the same host- + pair do not always take the same path, such as when connection- + specific parameters are used for routing (e.g., for load balancing). + Multipath routing that relies on examining transport headers, such as + ECMP and Link Aggregation Group (LAG) [RFC7424], may not result in + repeatable path selection when TCP segments are encapsulated, + encrypted, or altered -- for example, in some Virtual Private Network + (VPN) tunnels that rely on proprietary encapsulation. Similarly, + such approaches cannot operate deterministically when the TCP header + is encrypted, e.g., when using IPsec Encapsulating Security Payload + (ESP) (although TCB interdependence among the entire set sharing the + same endpoint IP addresses should work without problems when the TCP + header is encrypted). Measures to increase the probability that + connections use the same path could be applied; for example, the + connections could be given the same IPv6 flow label [RFC6437]. TCB + interdependence can also be extended to sets of host IP address pairs + that share the same network path conditions, such as when a group of + addresses is on the same LAN (see Section 9). + + Traversing the same path is not important for host-specific + information (e.g., rwnd), TCP option state (e.g., TFOinfo), or for + information that is already cached per-host (e.g., path MTU). When + TCB information is shared across different SYN destination ports, + path-related information can be incorrect; however, the impact of + this error is potentially diminished if (as discussed here) TCB + sharing affects only the transient event of a connection start or if + TCB information is shared only within connections to the same SYN + destination port. + + In the case of temporal sharing, TCB information could also become + invalid over time, i.e., indicating that although the path remains + the same, path properties have changed. Because this is similar to + the case when a connection becomes idle, mechanisms that address idle + TCP connections (e.g., [RFC7661]) could also be applied to TCB cache + management, especially when TCP Fast Open is used [RFC7413]. + +8.2. State Dependence + + There may be additional considerations to the way in which TCB + interdependence rebalances congestion feedback among the current + connections. For example, it may be appropriate to consider the + impact of a connection being in Fast Recovery [RFC5681] or some other + similar unusual feedback state that could inhibit or affect the + calculations described herein. + +8.3. Problems with Sharing Based on IP Address + + It can be wrong to share TCB information between TCP connections on + the same host as identified by the IP address if an IP address is + assigned to a new host (e.g., IP address spinning, as is used by ISPs + to inhibit running servers). It can be wrong if Network Address + Translation (NAT) [RFC2663], Network Address and Port Translation + (NAPT) [RFC2663], or any other IP sharing mechanism is used. Such + mechanisms are less likely to be used with IPv6. Other methods to + identify a host could also be considered to make correct TCB sharing + more likely. Moreover, some TCB information is about dominant path + properties rather than the specific host. IP addresses may differ, + yet the relevant part of the path may be the same. + +9. Implications + + There are several implications to incorporating TCB interdependence + in TCP implementations. First, it may reduce the need for + application-layer multiplexing for performance enhancement [RFC7231]. + Protocols like HTTP/2 [RFC7540] avoid connection re-establishment + costs by serializing or multiplexing a set of per-host connections + across a single TCP connection. This avoids TCP's per-connection + OPEN handshake and also avoids recomputing the MSS, RTT, and + congestion window values. By avoiding the so-called "slow-start + restart", performance can be optimized [Hu01]. TCB interdependence + can provide the "slow-start restart avoidance" of multiplexing, + without requiring a multiplexing mechanism at the application layer. + + Like the initial version of this document [RFC2140], this update's + approach to TCB interdependence focuses on sharing a set of TCBs by + updating the TCB state to reduce the impact of transients when + connections begin, end, or otherwise significantly change state. + Other mechanisms have since been proposed to continuously share + information between all ongoing communication (including + connectionless protocols) and update the congestion state during any + congestion-related event (e.g., timeout, loss confirmation, etc.) + [RFC3124]. By dealing exclusively with transients, the approach in + this document is more likely to exhibit the "steady-state" behavior + as unmodified, independent TCP connections. + +9.1. Layering + + TCB interdependence pushes some of the TCP implementation from its + typical placement solely within the transport layer (in the ISO + model) to the network layer. This acknowledges that some components + of state are, in fact, per-host-pair or can be per-path as indicated + solely by that host-pair. Transport protocols typically manage per- + application-pair associations (per stream), and network protocols + manage per-host-pair and path associations (routing). Round-trip + time, MSS, and congestion information could be more appropriately + handled at the network layer, aggregated among concurrent + connections, and shared across connection instances [RFC3124]. + + An earlier version of RTT sharing suggested implementing RTT state at + the IP layer rather than at the TCP layer. Our observations describe + sharing state among TCP connections, which avoids some of the + difficulties in an IP-layer solution. One such problem of an IP- + layer solution is determining the correspondence between packet + exchanges using IP header information alone, where such + correspondence is needed to compute RTT. Because TCB sharing + computes RTTs inside the TCP layer using TCP header information, it + can be implemented more directly and simply than at the IP layer. + This is a case where information should be computed at the transport + layer but could be shared at the network layer. + +9.2. Other Possibilities + + Per-host-pair associations are not the limit of these techniques. It + is possible that TCBs could be similarly shared between hosts on a + subnet or within a cluster, because the predominant path can be + subnet-subnet rather than host-host. Additionally, TCB + interdependence can be applied to any protocol with congestion state, + including SCTP [RFC4960] and DCCP [RFC4340], as well as to individual + subflows in Multipath TCP [RFC8684]. + + There may be other information that can be shared between concurrent + connections. For example, knowing that another connection has just + tried to expand its window size and failed, a connection may not + attempt to do the same for some period. The idea is that existing + TCP implementations infer the behavior of all competing connections, + including those within the same host or subnet. One possible + optimization is to make that implicit feedback explicit, via extended + information associated with the endpoint IP address and its TCP + implementation, rather than per-connection state in the TCB. + + This document focuses on sharing TCB information at connection + initialization. Subsequent to RFC 2140, there have been numerous + approaches that attempt to coordinate ongoing state across concurrent + connections, both within TCP and other congestion-reactive protocols, + which are summarized in [Is18]. These approaches are more complex to + implement, and their comparison to steady-state TCP equivalence can + be more difficult to establish, sometimes intentionally (i.e., they + sometimes intend to provide a different kind of "fairness" than + emerges from TCP operation). + +10. Implementation Observations + + The observation that some TCB state is host-pair specific rather than + application-pair dependent is not new and is a common engineering + decision in layered protocol implementations. Although now + deprecated, T/TCP [RFC1644] was the first to propose using caches in + order to maintain TCB states (see Appendix A). + + Table 9 describes the current implementation status for TCB temporal + sharing in Windows as of December 2020, Apple variants (macOS, iOS, + iPadOS, tvOS, and watchOS) as of January 2021, Linux kernel version + 5.10.3, and FreeBSD 12. Ensemble sharing is not yet implemented. + + +==============+=========================================+ + | TCB data | Status | + +==============+=========================================+ + | old_MMS_S | Not shared | + +--------------+-----------------------------------------+ + | old_MMS_R | Not shared | + +--------------+-----------------------------------------+ + | old_sendMSS | Cached and shared in Apple, Linux (MSS) | + +--------------+-----------------------------------------+ + | old_PMTU | Cached and shared in Apple, FreeBSD, | + | | Windows (PMTU) | + +--------------+-----------------------------------------+ + | old_RTT | Cached and shared in Apple, FreeBSD, | + | | Linux, Windows | + +--------------+-----------------------------------------+ + | old_RTTVAR | Cached and shared in Apple, FreeBSD, | + | | Windows | + +--------------+-----------------------------------------+ + | old_TFOinfo | Cached and shared in Apple, Linux, | + | | Windows | + +--------------+-----------------------------------------+ + | old_sendcwnd | Not shared | + +--------------+-----------------------------------------+ + | old_ssthresh | Cached and shared in Apple, FreeBSD*, | + | | Linux* | + +--------------+-----------------------------------------+ + | TFO failure | Cached and shared in Apple | + +--------------+-----------------------------------------+ + + Table 9: KNOWN IMPLEMENTATION STATUS + + * Note: In FreeBSD, new ssthresh is the mean of curr_ssthresh and + its previous value if a previous value exists; in Linux, the + calculation depends on state and is max(curr_cwnd/2, old_ssthresh) + in most cases. + + In Table 9, "Apple" refers to all Apple OSes, i.e., macOS (desktop/ + laptop), iOS (phone), iPadOS (tablet), tvOS (video player), and + watchOS (smart watch), which all share the same Internet protocol + stack. + +11. Changes Compared to RFC 2140 + + This document updates the description of TCB sharing in RFC 2140 and + its associated impact on existing and new connection state, providing + a complete replacement for that document [RFC2140]. It clarifies the + previous description and terminology and extends the mechanism to its + impact on new protocols and mechanisms, including multipath TCP, Fast + Open, PLPMTUD, NAT, and the TCP Authentication Option. + + The detailed impact on TCB state addresses TCB parameters with + greater specificity. It separates the way MSS is used in both send + and receive directions, it separates the way both of these MSS values + differ from sendMSS, it adds both path MTU and ssthresh, and it + addresses the impact on state associated with TCP options. + + New sections have been added to address compatibility issues and + implementation observations. The relation of this work to T/TCP has + been moved to Appendix A (which describes the history to TCB sharing) + partly to reflect the deprecation of that protocol. + + Appendix C has been added to discuss the potential to use temporal + sharing over long timescales to adapt TCP's initial window + automatically, avoiding the need to periodically revise a single + global constant value. + + Finally, this document updates and significantly expands the + referenced literature. + +12. Security Considerations + + These presented implementation methods do not have additional + ramifications for direct (connection-aborting or information- + injecting) attacks on individual connections. Individual + connections, whether using sharing or not, also may be susceptible to + denial-of-service attacks that reduce performance or completely deny + connections and transfers if not otherwise secured. + + TCB sharing may create additional denial-of-service attacks that + affect the performance of other connections by polluting the cached + information. This can occur across any set of connections in which + the TCB is shared, between connections in a single host, or between + hosts if TCB sharing is implemented within a subnet (see + "Implications" (Section 9)). Some shared TCB parameters are used + only to create new TCBs; others are shared among the TCBs of ongoing + connections. New connections can join the ongoing set, e.g., to + optimize send window size among a set of connections to the same + host. PMTU is defined as shared at the IP layer and is already + susceptible in this way. + + Options in client SYNs can be easier to forge than complete, two-way + connections. As a result, their values may not be safely + incorporated in shared values until after the three-way handshake + completes. + + Attacks on parameters used only for initialization affect only the + transient performance of a TCP connection. For short connections, + the performance ramification can approach that of a denial-of-service + attack. For example, if an application changes its TCB to have a + false and small window size, subsequent connections will experience + performance degradation until their window grows appropriately. + + TCB sharing reuses and mixes information from past and current + connections. Although reusing information could create a potential + for fingerprinting to identify hosts, the mixing reduces that + potential. There has been no evidence of fingerprinting based on + this technique, and it is currently considered safe in that regard. + Further, information about the performance of a TCP connection has + not been considered as private. + +13. IANA Considerations + + This document has no IANA actions. + +14. References + +14.1. Normative References + + [RFC0793] Postel, J., "Transmission Control Protocol", STD 7, + RFC 793, DOI 10.17487/RFC0793, September 1981, + . + + [RFC1122] Braden, R., Ed., "Requirements for Internet Hosts - + Communication Layers", STD 3, RFC 1122, + DOI 10.17487/RFC1122, October 1989, + . + + [RFC1191] Mogul, J. and S. Deering, "Path MTU discovery", RFC 1191, + DOI 10.17487/RFC1191, November 1990, + . + + [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate + Requirement Levels", BCP 14, RFC 2119, + DOI 10.17487/RFC2119, March 1997, + . + + [RFC4821] Mathis, M. and J. Heffner, "Packetization Layer Path MTU + Discovery", RFC 4821, DOI 10.17487/RFC4821, March 2007, + . + + [RFC5681] Allman, M., Paxson, V., and E. Blanton, "TCP Congestion + Control", RFC 5681, DOI 10.17487/RFC5681, September 2009, + . + + [RFC6298] Paxson, V., Allman, M., Chu, J., and M. Sargent, + "Computing TCP's Retransmission Timer", RFC 6298, + DOI 10.17487/RFC6298, June 2011, + . + + [RFC7413] Cheng, Y., Chu, J., Radhakrishnan, S., and A. Jain, "TCP + Fast Open", RFC 7413, DOI 10.17487/RFC7413, December 2014, + . + + [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC + 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, + May 2017, . + + [RFC8201] McCann, J., Deering, S., Mogul, J., and R. Hinden, Ed., + "Path MTU Discovery for IP version 6", STD 87, RFC 8201, + DOI 10.17487/RFC8201, July 2017, + . + +14.2. Informative References + + [Al10] Allman, M., "Initial Congestion Window Specification", + Work in Progress, Internet-Draft, draft-allman-tcpm-bump- + initcwnd-00, 15 November 2010, + . + + [Ba12] Barik, R., Welzl, M., Ferlin, S., and O. Alay, "LISA: A + linked slow-start algorithm for MPTCP", IEEE ICC, + DOI 10.1109/ICC.2016.7510786, May 2016, + . + + [Ba20] Bagnulo, M. and B. Briscoe, "ECN++: Adding Explicit + Congestion Notification (ECN) to TCP Control Packets", + Work in Progress, Internet-Draft, draft-ietf-tcpm- + generalized-ecn-07, 16 February 2021, + . + + [Be94] Berners-Lee, T., Cailliau, C., Luotonen, A., Nielsen, H., + and A. Secret, "The World-Wide Web", Communications of the + ACM V37, pp. 76-82, DOI 10.1145/179606.179671, August + 1994, . + + [Br02] Brownlee, N. and KC. Claffy, "Understanding Internet + traffic streams: dragonflies and tortoises", IEEE + Communications Magazine, pp. 110-117, + DOI 10.1109/MCOM.2002.1039865, 2002, + . + + [Br94] Braden, B., "T/TCP -- Transaction TCP: Source Changes for + Sun OS 4.1.3", USC/ISI Release 1.0, September 1994. + + [Co91] Comer, D. and D. Stevens, "Internetworking with TCP/IP", + ISBN 10: 0134685059, ISBN 13: 9780134685052, 1991. + + [Du16] Dukkipati, N., Cheng, Y., and A. Vahdat, "Research + Impacting the Practice of Congestion Control", Computer + Communication Review, The ACM SIGCOMM newsletter, July + 2016. + + [FreeBSD] FreeBSD, "The FreeBSD Project", + . + + [Hu01] Hughes, A., Touch, J., and J. Heidemann, "Issues in TCP + Slow-Start Restart After Idle", Work in Progress, + Internet-Draft, draft-hughes-restart-00, December 2001, + . + + [Hu12] Hurtig, P. and A. Brunstrom, "Enhanced metric caching for + short TCP flows", IEEE International Conference on + Communications, DOI 10.1109/ICC.2012.6364516, 2012, + . + + [IANA] IANA, "Transmission Control Protocol (TCP) Parameters", + . + + [Is18] Islam, S., Welzl, M., Hiorth, K., Hayes, D., Armitage, G., + and S. Gjessing, "ctrlTCP: Reducing latency through + coupled, heterogeneous multi-flow TCP congestion control", + IEEE INFOCOM 2018 - IEEE Conference on Computer + Communications Workshops (INFOCOM WKSHPS), + DOI 10.1109/INFCOMW.2018.8406887, April 2018, + . + + [Ja88] Jacobson, V. and M. Karels, "Congestion Avoidance and + Control", SIGCOMM Symposium proceedings on Communications + architectures and protocols, November 1988. + + [RFC1379] Braden, R., "Extending TCP for Transactions -- Concepts", + RFC 1379, DOI 10.17487/RFC1379, November 1992, + . + + [RFC1644] Braden, R., "T/TCP -- TCP Extensions for Transactions + Functional Specification", RFC 1644, DOI 10.17487/RFC1644, + July 1994, . + + [RFC2001] Stevens, W., "TCP Slow Start, Congestion Avoidance, Fast + Retransmit, and Fast Recovery Algorithms", RFC 2001, + DOI 10.17487/RFC2001, January 1997, + . + + [RFC2140] Touch, J., "TCP Control Block Interdependence", RFC 2140, + DOI 10.17487/RFC2140, April 1997, + . + + [RFC2414] Allman, M., Floyd, S., and C. Partridge, "Increasing TCP's + Initial Window", RFC 2414, DOI 10.17487/RFC2414, September + 1998, . + + [RFC2663] Srisuresh, P. and M. Holdrege, "IP Network Address + Translator (NAT) Terminology and Considerations", + RFC 2663, DOI 10.17487/RFC2663, August 1999, + . + + [RFC3124] Balakrishnan, H. and S. Seshan, "The Congestion Manager", + RFC 3124, DOI 10.17487/RFC3124, June 2001, + . + + [RFC3390] Allman, M., Floyd, S., and C. Partridge, "Increasing TCP's + Initial Window", RFC 3390, DOI 10.17487/RFC3390, October + 2002, . + + [RFC4340] Kohler, E., Handley, M., and S. Floyd, "Datagram + Congestion Control Protocol (DCCP)", RFC 4340, + DOI 10.17487/RFC4340, March 2006, + . + + [RFC4960] Stewart, R., Ed., "Stream Control Transmission Protocol", + RFC 4960, DOI 10.17487/RFC4960, September 2007, + . + + [RFC5925] Touch, J., Mankin, A., and R. Bonica, "The TCP + Authentication Option", RFC 5925, DOI 10.17487/RFC5925, + June 2010, . + + [RFC6437] Amante, S., Carpenter, B., Jiang, S., and J. Rajahalme, + "IPv6 Flow Label Specification", RFC 6437, + DOI 10.17487/RFC6437, November 2011, + . + + [RFC6691] Borman, D., "TCP Options and Maximum Segment Size (MSS)", + RFC 6691, DOI 10.17487/RFC6691, July 2012, + . + + [RFC6928] Chu, J., Dukkipati, N., Cheng, Y., and M. Mathis, + "Increasing TCP's Initial Window", RFC 6928, + DOI 10.17487/RFC6928, April 2013, + . + + [RFC7231] Fielding, R., Ed. and J. Reschke, Ed., "Hypertext Transfer + Protocol (HTTP/1.1): Semantics and Content", RFC 7231, + DOI 10.17487/RFC7231, June 2014, + . + + [RFC7323] Borman, D., Braden, B., Jacobson, V., and R. + Scheffenegger, Ed., "TCP Extensions for High Performance", + RFC 7323, DOI 10.17487/RFC7323, September 2014, + . + + [RFC7424] Krishnan, R., Yong, L., Ghanwani, A., So, N., and B. + Khasnabish, "Mechanisms for Optimizing Link Aggregation + Group (LAG) and Equal-Cost Multipath (ECMP) Component Link + Utilization in Networks", RFC 7424, DOI 10.17487/RFC7424, + January 2015, . + + [RFC7540] Belshe, M., Peon, R., and M. Thomson, Ed., "Hypertext + Transfer Protocol Version 2 (HTTP/2)", RFC 7540, + DOI 10.17487/RFC7540, May 2015, + . + + [RFC7661] Fairhurst, G., Sathiaseelan, A., and R. Secchi, "Updating + TCP to Support Rate-Limited Traffic", RFC 7661, + DOI 10.17487/RFC7661, October 2015, + . + + [RFC8684] Ford, A., Raiciu, C., Handley, M., Bonaventure, O., and C. + Paasch, "TCP Extensions for Multipath Operation with + Multiple Addresses", RFC 8684, DOI 10.17487/RFC8684, March + 2020, . + +Appendix A. TCB Sharing History + + T/TCP proposed using caches to maintain TCB information across + instances (temporal sharing), e.g., smoothed RTT, RTT variation, + congestion-avoidance threshold, and MSS [RFC1644]. These values were + in addition to connection counts used by T/TCP to accelerate data + delivery prior to the full three-way handshake during an OPEN. The + goal was to aggregate TCB components where they reflect one + association -- that of the host-pair rather than artificially + separating those components by connection. + + At least one T/TCP implementation saved the MSS and aggregated the + RTT parameters across multiple connections but omitted caching the + congestion window information [Br94], as originally specified in + [RFC1379]. Some T/TCP implementations immediately updated MSS when + the TCP MSS header option was received [Br94], although this was not + addressed specifically in the concepts or functional specification + [RFC1379] [RFC1644]. In later T/TCP implementations, RTT values were + updated only after a CLOSE, which does not benefit concurrent + sessions. + + Temporal sharing of cached TCB data was originally implemented in the + Sun OS 4.1.3 T/TCP extensions [Br94] and the FreeBSD port of same + [FreeBSD]. As mentioned before, only the MSS and RTT parameters were + cached, as originally specified in [RFC1379]. Later discussion of T/ + TCP suggested including congestion control parameters in this cache; + for example, Section 3.1 of [RFC1644] hints at initializing the + congestion window to the old window size. + +Appendix B. TCP Option Sharing and Caching + + In addition to the options that can be cached and shared, this memo + also lists known TCP options [IANA] for which state is unsafe to be + kept. This list is not intended to be authoritative or exhaustive. + + Obsolete (unsafe to keep state): + + Echo + + Echo Reply + + Partial Order Connection Permitted + + Partial Order Service Profile + + CC + + CC.NEW + + CC.ECHO + + TCP Alternate Checksum Request + + TCP Alternate Checksum Data + + No state to keep: + + End of Option List (EOL) + + No-Operation (NOP) + + Window Scale (WS) + + SACK + + Timestamps (TS) + + MD5 Signature Option + + TCP Authentication Option (TCP-AO) + + RFC3692-style Experiment 1 + + RFC3692-style Experiment 2 + + Unsafe to keep state: + + Skeeter (DH exchange, known to be vulnerable) + + Bubba (DH exchange, known to be vulnerable) + + Trailer Checksum Option + + SCPS capabilities + + Selective Negative Acknowledgements (S-NACK) + + Records Boundaries + + Corruption experienced + + SNAP + + TCP Compression Filter + + Quick-Start Response + + User Timeout Option (UTO) + + Multipath TCP (MPTCP) negotiation success (see below for + negotiation failure) + + TCP Fast Open (TFO) negotiation success (see below for negotiation + failure) + + Safe but optional to keep state: + + Multipath TCP (MPTCP) negotiation failure (to avoid negotiation + retries) + + Maximum Segment Size (MSS) + + TCP Fast Open (TFO) negotiation failure (to avoid negotiation + retries) + + Safe and necessary to keep state: + + TCP Fast Open (TFO) Cookie (if TFO succeeded in the past) + +Appendix C. Automating the Initial Window in TCP over Long Timescales + +C.1. Introduction + + Temporal sharing, as described earlier in this document, builds on + the assumption that multiple consecutive connections between the same + host-pair are somewhat likely to be exposed to similar environment + characteristics. The stored information can become less accurate + over time and suitable precautions should take this aging into + consideration (this is discussed further in Section 8.1). However, + there are also cases where it can make sense to track these values + over longer periods, observing properties of TCP connections to + gradually influence evolving trends in TCP parameters. This appendix + describes an example of such a case. + + TCP's congestion control algorithm uses an initial window value (IW) + both as a starting point for new connections and as an upper limit + for restarting after an idle period [RFC5681] [RFC7661]. This value + has evolved over time; it was originally 1 maximum segment size (MSS) + and increased to the lesser of 4 MSSs or 4,380 bytes [RFC3390] + [RFC5681]. For a typical Internet connection with a maximum + transmission unit (MTU) of 1500 bytes, this permits 3 segments of + 1,460 bytes each. + + The IW value was originally implied in the original TCP congestion + control description and documented as a standard in 1997 [RFC2001] + [Ja88]. The value was updated in 1998 experimentally and moved to + the Standards Track in 2002 [RFC2414] [RFC3390]. In 2013, it was + experimentally increased to 10 [RFC6928]. + + This appendix discusses how TCP can objectively measure when an IW is + too large and that such feedback should be used over long timescales + to adjust the IW automatically. The result should be safer to deploy + and might avoid the need to repeatedly revisit IW over time. + + Note that this mechanism attempts to make the IW more adaptive over + time. It can increase the IW beyond that which is currently + recommended for wide-scale deployment, so its use should be carefully + monitored. + +C.2. Design Considerations + + TCP's IW value has existed statically for over two decades, so any + solution to adjusting the IW dynamically should have similarly + stable, non-invasive effects on the performance and complexity of + TCP. In order to be fair, the IW should be similar for most machines + on the public Internet. Finally, a desirable goal is to develop a + self-correcting algorithm so that IW values that cause network + problems can be avoided. To that end, we propose the following + design goals: + + * Impart little to no impact to TCP in the absence of loss, i.e., it + should not increase the complexity of default packet processing in + the normal case. + + * Adapt to network feedback over long timescales, avoiding values + that persistently cause network problems. + + * Decrease the IW in the presence of sustained loss of IW segments, + as determined over a number of different connections. + + * Increase the IW in the absence of sustained loss of IW segments, + as determined over a number of different connections. + + * Operate conservatively, i.e., tend towards leaving the IW the same + in the absence of sufficient information, and give greater + consideration to IW segment loss than IW segment success. + + We expect that, without other context, a good IW algorithm will + converge to a single value, but this is not required. An endpoint + with additional context or information, or deployed in a constrained + environment, can always use a different value. In particular, + information from previous connections, or sets of connections with a + similar path, can already be used as context for such decisions (as + noted in the core of this document). + + However, if a given IW value persistently causes packet loss during + the initial burst of packets, it is clearly inappropriate and could + be inducing unnecessary loss in other competing connections. This + might happen for sites behind very slow boxes with small buffers, + which may or may not be the first hop. + +C.3. Proposed IW Algorithm + + Below is a simple description of the proposed IW algorithm. It + relies on the following parameters: + + * MinIW = 3 MSS or 4,380 bytes (as per [RFC3390]) + + * MaxIW = 10 MSS (as per [RFC6928]) + + * MulDecr = 0.5 + + * AddIncr = 2 MSS + + * Threshold = 0.05 + + We assume that the minimum IW (MinIW) should be as currently + specified as standard [RFC3390]. The maximum IW (MaxIW) can be set + to a fixed value (we suggest using the experimental and now somewhat + de facto standard in [RFC6928]) or set based on a schedule if trusted + time references are available [Al10]; here, we prefer a fixed value. + We also propose to use an Additive Increase Multiplicative Decrease + (AIMD) algorithm, with increase and decreases as noted. + + Although these parameters are somewhat arbitrary, their initial + values are not important except that the algorithm is AIMD and the + MaxIW should not exceed that recommended for other systems on the + Internet (here, we selected the current de facto standard rather than + the actual standard). Current proposals, including default current + operation, are degenerate cases of the algorithm below for given + parameters, notably MulDec = 1.0 and AddIncr = 0 MSS, thus disabling + the automatic part of the algorithm. + + The proposed algorithm is as follows: + + 1. On boot: + + IW = MaxIW; # assume this is in bytes and indicates an integer + # multiple of 2 MSS (an even number to support + # ACK compression) + + 2. Upon starting a new connection: + + CWND = IW; + conncount++; + IWnotchecked = 1; # true + + 3. During a connection's SYN-ACK processing, if SYN-ACK includes ECN + (as similarly addressed in Section 5 of ECN++ for TCP [Ba20]), + treat as if the IW is too large: + + if (IWnotchecked && (synackecn == 1)) { + losscount++; + IWnotchecked = 0; # never check again + } + + 4. During a connection, if retransmission occurs, check the seqno of + the outgoing packet (in bytes) to see if the re-sent segment + fixes an IW loss: + + if (Retransmitting && IWnotchecked && ((seqno - ISN) < IW))) { + losscount++; + IWnotchecked = 0; # never do this entire "if" again + } else { + IWnotchecked = 0; # you're beyond the IW so stop checking + } + + 5. Once every 1000 connections, as a separate process (i.e., not as + part of processing a given connection): + + if (conncount > 1000) { + if (losscount/conncount > threshold) { + # the number of connections with errors is too high + IW = IW * MulDecr; + } else { + IW = IW + AddIncr; + } + } + + As presented, this algorithm can yield a false positive when the + sequence number wraps around, e.g., the code might increment + losscount in step 4 when no loss occurred or fail to increment + losscount when a loss did occur. This can be avoided using either + Protection Against Wrapped Sequences (PAWS) [RFC7323] context or + internal extended sequence number representations (as in TCP + Authentication Option (TCP-AO) [RFC5925]). Alternately, false + positives can be tolerated because they are expected to be infrequent + and thus will not significantly impact the algorithm. + + A number of additional constraints need to be imposed if this + mechanism is implemented to ensure that it defaults to values that + comply with current Internet standards, is conservative in how it + extends those values, and returns to those values in the absence of + positive feedback (i.e., success). To that end, we recommend the + following list of example constraints: + + * The automatic IW algorithm MUST initialize MaxIW a value no larger + than the currently recommended Internet default in the absence of + other context information. + + Thus, if there are too few connections to make a decision or if + there is otherwise insufficient information to increase the IW, + then the MaxIW defaults to the current recommended value. + + * An implementation MAY allow the MaxIW to grow beyond the currently + recommended Internet default but not more than 2 segments per + calendar year. + + Thus, if an endpoint has a persistent history of successfully + transmitting IW segments without loss, then it is allowed to probe + the Internet to determine if larger IW values have similar + success. This probing is limited and requires a trusted time + source; otherwise, the MaxIW remains constant. + + * An implementation MUST adjust the IW based on loss statistics at + least once every 1000 connections. + + An endpoint needs to be sufficiently reactive to IW loss. + + * An implementation MUST decrease the IW by at least 1 MSS when + indicated during an evaluation interval. + + An endpoint that detects loss needs to decrease its IW by at least + 1 MSS; otherwise, it is not participating in an automatic reactive + algorithm. + + * An implementation MUST increase by no more than 2 MSSs per + evaluation interval. + + An endpoint that does not experience IW loss needs to probe the + network incrementally. + + * An implementation SHOULD use an IW that is an integer multiple of + 2 MSSs. + + The IW should remain a multiple of 2 MSS segments to enable + efficient ACK compression without incurring unnecessary timeouts. + + * An implementation MUST decrease the IW if more than 95% of + connections have IW losses. + + Again, this is to ensure an implementation is sufficiently + reactive. + + * An implementation MAY group IW values and statistics within + subsets of connections. Such grouping MAY use any information + about connections to form groups except loss statistics. + + There are some TCP connections that might not be counted at all, such + as those to/from loopback addresses or those within the same subnet + as that of a local interface (for which congestion control is + sometimes disabled anyway). This may also include connections that + terminate before the IW is full, i.e., as a separate check at the + time of the connection closing. + + The period over which the IW is updated is intended to be a long + timescale, e.g., a month or so, or 1,000 connections, whichever is + longer. An implementation might check the IW once a month and simply + not update the IW or clear the connection counts in months where the + number of connections is too small. + +C.4. Discussion + + There are numerous parameters to the above algorithm that are + compliant with the given requirements; this is intended to allow + variation in configuration and implementation while ensuring that all + such algorithms are reactive and safe. + + This algorithm continues to assume segments because that is the basis + of most TCP implementations. It might be useful to consider revising + the specifications to allow byte-based congestion given sufficient + experience. + + The algorithm checks for IW losses only during the first IW after a + connection start; it does not check for IW losses elsewhere the IW is + used, e.g., during slow-start restarts. + + * An implementation MAY detect IW losses during slow-start restarts + in addition to losses during the first IW of a connection. In + this case, the implementation MUST count each restart as a + "connection" for the purposes of connection counts and periodic + rechecking of the IW value. + + False positives can occur during some kinds of segment reordering, + e.g., that might trigger spurious retransmissions even without a true + segment loss. These are not expected to be sufficiently common to + dominate the algorithm and its conclusions. + + This mechanism does require additional per-connection state, which is + currently common in some implementations and is useful for other + reasons (e.g., the ISN is used in TCP-AO [RFC5925]). The mechanism + in this appendix also benefits from persistent state kept across + reboots, which would also be useful to other state sharing mechanisms + (e.g., TCP Control Block Sharing per the main body of this document). + + The receive window (rwnd) is not involved in this calculation. The + size of rwnd is determined by receiver resources and provides space + to accommodate segment reordering. Also, rwnd is not involved with + congestion control, which is the focus of the way this appendix + manages the IW. + +C.5. Observations + + The IW may not converge to a single global value. It also may not + converge at all but rather may oscillate by a few MSSs as it + repeatedly probes the Internet for larger IWs and fails. Both + properties are consistent with TCP behavior during each individual + connection. + + This mechanism assumes that losses during the IW are due to IW size. + Persistent errors that drop packets for other reasons, e.g., OS bugs, + can cause false positives. Again, this is consistent with TCP's + basic assumption that loss is caused by congestion and requires + backoff. This algorithm treats the IW of new connections as a long- + timescale backoff system. + +Acknowledgments + + The authors would like to thank Praveen Balasubramanian for + information regarding TCB sharing in Windows; Christoph Paasch for + information regarding TCB sharing in Apple OSs; Yuchung Cheng, Lars + Eggert, Ilpo Jarvinen, and Michael Scharf for comments on earlier + draft versions of this document; as well as members of the TCPM WG. + Earlier revisions of this work received funding from a collaborative + research project between the University of Oslo and Huawei + Technologies Co., Ltd. and were partly supported by USC/ISI's Postel + Center. + +Authors' Addresses + + Joe Touch + Manhattan Beach, CA 90266 + United States of America + + Phone: +1 (310) 560-0334 + Email: touch@strayalpha.com + + + Michael Welzl + University of Oslo + PO Box 1080 Blindern + N-0316 Oslo + Norway + + Phone: +47 22 85 24 20 + Email: michawe@ifi.uio.no + + + Safiqul Islam + University of Oslo + PO Box 1080 Blindern + Oslo N-0316 + Norway + + Phone: +47 22 84 08 37 + Email: safiquli@ifi.uio.no -- cgit v1.2.3