diff options
Diffstat (limited to 'doc/rfc/rfc9293.txt')
-rw-r--r-- | doc/rfc/rfc9293.txt | 5576 |
1 files changed, 5576 insertions, 0 deletions
diff --git a/doc/rfc/rfc9293.txt b/doc/rfc/rfc9293.txt new file mode 100644 index 0000000..34594bd --- /dev/null +++ b/doc/rfc/rfc9293.txt @@ -0,0 +1,5576 @@ + + + + +Internet Engineering Task Force (IETF) W. Eddy, Ed. +STD: 7 MTI Systems +Request for Comments: 9293 August 2022 +Obsoletes: 793, 879, 2873, 6093, 6429, 6528, + 6691 +Updates: 1011, 1122, 5961 +Category: Standards Track +ISSN: 2070-1721 + + + Transmission Control Protocol (TCP) + +Abstract + + This document specifies the Transmission Control Protocol (TCP). TCP + is an important transport-layer protocol in the Internet protocol + stack, and it has continuously evolved over decades of use and growth + of the Internet. Over this time, a number of changes have been made + to TCP as it was specified in RFC 793, though these have only been + documented in a piecemeal fashion. This document collects and brings + those changes together with the protocol specification from RFC 793. + This document obsoletes RFC 793, as well as RFCs 879, 2873, 6093, + 6429, 6528, and 6691 that updated parts of RFC 793. It updates RFCs + 1011 and 1122, and it should be considered as a replacement for the + portions of those documents dealing with TCP requirements. It also + updates RFC 5961 by adding a small clarification in reset handling + while in the SYN-RECEIVED state. The TCP header control bits from + RFC 793 have also been updated based on RFC 3168. + +Status of This Memo + + This is an Internet Standards Track document. + + This document is a product of the Internet Engineering Task Force + (IETF). It represents the consensus of the IETF community. It has + received public review and has been approved for publication by the + Internet Engineering Steering Group (IESG). Further information on + Internet Standards is available in Section 2 of RFC 7841. + + Information about the current status of this document, any errata, + and how to provide feedback on it may be obtained at + https://www.rfc-editor.org/info/rfc9293. + +Copyright Notice + + Copyright (c) 2022 IETF Trust and the persons identified as the + document authors. All rights reserved. + + This document is subject to BCP 78 and the IETF Trust's Legal + Provisions Relating to IETF Documents + (https://trustee.ietf.org/license-info) in effect on the date of + publication of this document. Please review these documents + carefully, as they describe your rights and restrictions with respect + to this document. Code Components extracted from this document must + include Revised BSD License text as described in Section 4.e of the + Trust Legal Provisions and are provided without warranty as described + in the Revised BSD License. + + This document may contain material from IETF Documents or IETF + Contributions published or made publicly available before November + 10, 2008. The person(s) controlling the copyright in some of this + material may not have granted the IETF Trust the right to allow + modifications of such material outside the IETF Standards Process. + Without obtaining an adequate license from the person(s) controlling + the copyright in such materials, this document may not be modified + outside the IETF Standards Process, and derivative works of it may + not be created outside the IETF Standards Process, except to format + it for publication as an RFC or to translate it into languages other + than English. + +Table of Contents + + 1. Purpose and Scope + 2. Introduction + 2.1. Requirements Language + 2.2. Key TCP Concepts + 3. Functional Specification + 3.1. Header Format + 3.2. Specific Option Definitions + 3.2.1. Other Common Options + 3.2.2. Experimental TCP Options + 3.3. TCP Terminology Overview + 3.3.1. Key Connection State Variables + 3.3.2. State Machine Overview + 3.4. Sequence Numbers + 3.4.1. Initial Sequence Number Selection + 3.4.2. Knowing When to Keep Quiet + 3.4.3. The TCP Quiet Time Concept + 3.5. Establishing a Connection + 3.5.1. Half-Open Connections and Other Anomalies + 3.5.2. Reset Generation + 3.5.3. Reset Processing + 3.6. Closing a Connection + 3.6.1. Half-Closed Connections + 3.7. Segmentation + 3.7.1. Maximum Segment Size Option + 3.7.2. Path MTU Discovery + 3.7.3. Interfaces with Variable MTU Values + 3.7.4. Nagle Algorithm + 3.7.5. IPv6 Jumbograms + 3.8. Data Communication + 3.8.1. Retransmission Timeout + 3.8.2. TCP Congestion Control + 3.8.3. TCP Connection Failures + 3.8.4. TCP Keep-Alives + 3.8.5. The Communication of Urgent Information + 3.8.6. Managing the Window + 3.9. Interfaces + 3.9.1. User/TCP Interface + 3.9.2. TCP/Lower-Level Interface + 3.10. Event Processing + 3.10.1. OPEN Call + 3.10.2. SEND Call + 3.10.3. RECEIVE Call + 3.10.4. CLOSE Call + 3.10.5. ABORT Call + 3.10.6. STATUS Call + 3.10.7. SEGMENT ARRIVES + 3.10.8. Timeouts + 4. Glossary + 5. Changes from RFC 793 + 6. IANA Considerations + 7. Security and Privacy Considerations + 8. References + 8.1. Normative References + 8.2. Informative References + Appendix A. Other Implementation Notes + A.1. IP Security Compartment and Precedence + A.1.1. Precedence + A.1.2. MLS Systems + A.2. Sequence Number Validation + A.3. Nagle Modification + A.4. Low Watermark Settings + Appendix B. TCP Requirement Summary + Acknowledgments + Author's Address + +1. Purpose and Scope + + In 1981, RFC 793 [16] was released, documenting the Transmission + Control Protocol (TCP) and replacing earlier published specifications + for TCP. + + Since then, TCP has been widely implemented, and it has been used as + a transport protocol for numerous applications on the Internet. + + For several decades, RFC 793 plus a number of other documents have + combined to serve as the core specification for TCP [49]. Over time, + a number of errata have been filed against RFC 793. There have also + been deficiencies found and resolved in security, performance, and + many other aspects. The number of enhancements has grown over time + across many separate documents. These were never accumulated + together into a comprehensive update to the base specification. + + The purpose of this document is to bring together all of the IETF + Standards Track changes and other clarifications that have been made + to the base TCP functional specification (RFC 793) and to unify them + into an updated version of the specification. + + Some companion documents are referenced for important algorithms that + are used by TCP (e.g., for congestion control) but have not been + completely included in this document. This is a conscious choice, as + this base specification can be used with multiple additional + algorithms that are developed and incorporated separately. This + document focuses on the common basis that all TCP implementations + must support in order to interoperate. Since some additional TCP + features have become quite complicated themselves (e.g., advanced + loss recovery and congestion control), future companion documents may + attempt to similarly bring these together. + + In addition to the protocol specification that describes the TCP + segment format, generation, and processing rules that are to be + implemented in code, RFC 793 and other updates also contain + informative and descriptive text for readers to understand aspects of + the protocol design and operation. This document does not attempt to + alter or update this informative text and is focused only on updating + the normative protocol specification. This document preserves + references to the documentation containing the important explanations + and rationale, where appropriate. + + This document is intended to be useful both in checking existing TCP + implementations for conformance purposes, as well as in writing new + implementations. + +2. Introduction + + RFC 793 contains a discussion of the TCP design goals and provides + examples of its operation, including examples of connection + establishment, connection termination, and packet retransmission to + repair losses. + + This document describes the basic functionality expected in modern + TCP implementations and replaces the protocol specification in RFC + 793. It does not replicate or attempt to update the introduction and + philosophy content in Sections 1 and 2 of RFC 793. Other documents + are referenced to provide explanations of the theory of operation, + rationale, and detailed discussion of design decisions. This + document only focuses on the normative behavior of the protocol. + + The "TCP Roadmap" [49] provides a more extensive guide to the RFCs + that define TCP and describe various important algorithms. The TCP + Roadmap contains sections on strongly encouraged enhancements that + improve performance and other aspects of TCP beyond the basic + operation specified in this document. As one example, implementing + congestion control (e.g., [8]) is a TCP requirement, but it is a + complex topic on its own and not described in detail in this + document, as there are many options and possibilities that do not + impact basic interoperability. Similarly, most TCP implementations + today include the high-performance extensions in [47], but these are + not strictly required or discussed in this document. Multipath + considerations for TCP are also specified separately in [59]. + + A list of changes from RFC 793 is contained in Section 5. + +2.1. Requirements Language + + The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", + "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and + "OPTIONAL" in this document are to be interpreted as described in + BCP 14 [3] [12] when, and only when, they appear in all capitals, as + shown here. + + Each use of RFC 2119 keywords in the document is individually labeled + and referenced in Appendix B, which summarizes implementation + requirements. + + Sentences using "MUST" are labeled as "MUST-X" with X being a numeric + identifier enabling the requirement to be located easily when + referenced from Appendix B. + + Similarly, sentences using "SHOULD" are labeled with "SHLD-X", "MAY" + with "MAY-X", and "RECOMMENDED" with "REC-X". + + For the purposes of this labeling, "SHOULD NOT" and "MUST NOT" are + labeled the same as "SHOULD" and "MUST" instances. + +2.2. Key TCP Concepts + + TCP provides a reliable, in-order, byte-stream service to + applications. + + The application byte-stream is conveyed over the network via TCP + segments, with each TCP segment sent as an Internet Protocol (IP) + datagram. + + TCP reliability consists of detecting packet losses (via sequence + numbers) and errors (via per-segment checksums), as well as + correction via retransmission. + + TCP supports unicast delivery of data. There are anycast + applications that can successfully use TCP without modifications, + though there is some risk of instability due to changes of lower- + layer forwarding behavior [46]. + + TCP is connection oriented, though it does not inherently include a + liveness detection capability. + + Data flow is supported bidirectionally over TCP connections, though + applications are free to send data only unidirectionally, if they so + choose. + + TCP uses port numbers to identify application services and to + multiplex distinct flows between hosts. + + A more detailed description of TCP features compared to other + transport protocols can be found in Section 3.1 of [52]. Further + description of the motivations for developing TCP and its role in the + Internet protocol stack can be found in Section 2 of [16] and earlier + versions of the TCP specification. + +3. Functional Specification + +3.1. Header Format + + TCP segments are sent as internet datagrams. The Internet Protocol + (IP) header carries several information fields, including the source + and destination host addresses [1] [13]. A TCP header follows the IP + headers, supplying information specific to TCP. This division allows + for the existence of host-level protocols other than TCP. In the + early development of the Internet suite of protocols, the IP header + fields had been a part of TCP. + + This document describes TCP, which uses TCP headers. + + A TCP header, followed by any user data in the segment, is formatted + as follows, using the style from [66]: + + 0 1 2 3 + 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | Source Port | Destination Port | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | Sequence Number | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | Acknowledgment Number | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | Data | |C|E|U|A|P|R|S|F| | + | Offset| Rsrvd |W|C|R|C|S|S|Y|I| Window | + | | |R|E|G|K|H|T|N|N| | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | Checksum | Urgent Pointer | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | [Options] | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | : + : Data : + : | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + + Note that one tick mark represents one bit position. + + Figure 1: TCP Header Format + + where: + + Source Port: 16 bits + + The source port number. + + Destination Port: 16 bits + + The destination port number. + + Sequence Number: 32 bits + + The sequence number of the first data octet in this segment (except + when the SYN flag is set). If SYN is set, the sequence number is + the initial sequence number (ISN) and the first data octet is + ISN+1. + + Acknowledgment Number: 32 bits + + If the ACK control bit is set, this field contains the value of the + next sequence number the sender of the segment is expecting to + receive. Once a connection is established, this is always sent. + + Data Offset (DOffset): 4 bits + + The number of 32-bit words in the TCP header. This indicates where + the data begins. The TCP header (even one including options) is an + integer multiple of 32 bits long. + + Reserved (Rsrvd): 4 bits + + A set of control bits reserved for future use. Must be zero in + generated segments and must be ignored in received segments if the + corresponding future features are not implemented by the sending or + receiving host. + + Control bits: The control bits are also known as "flags". + Assignment is managed by IANA from the "TCP Header Flags" registry + [62]. The currently assigned control bits are CWR, ECE, URG, ACK, + PSH, RST, SYN, and FIN. + + CWR: 1 bit + + Congestion Window Reduced (see [6]). + + ECE: 1 bit + + ECN-Echo (see [6]). + + URG: 1 bit + + Urgent pointer field is significant. + + ACK: 1 bit + + Acknowledgment field is significant. + + PSH: 1 bit + + Push function (see the Send Call description in Section 3.9.1). + + RST: 1 bit + + Reset the connection. + + SYN: 1 bit + + Synchronize sequence numbers. + + FIN: 1 bit + + No more data from sender. + + Window: 16 bits + + The number of data octets beginning with the one indicated in the + acknowledgment field that the sender of this segment is willing to + accept. The value is shifted when the window scaling extension is + used [47]. + + The window size MUST be treated as an unsigned number, or else + large window sizes will appear like negative windows and TCP will + not work (MUST-1). It is RECOMMENDED that implementations will + reserve 32-bit fields for the send and receive window sizes in the + connection record and do all window computations with 32 bits (REC- + 1). + + Checksum: 16 bits + + The checksum field is the 16-bit ones' complement of the ones' + complement sum of all 16-bit words in the header and text. The + checksum computation needs to ensure the 16-bit alignment of the + data being summed. If a segment contains an odd number of header + and text octets, alignment can be achieved by padding the last + octet with zeros on its right to form a 16-bit word for checksum + purposes. The pad is not transmitted as part of the segment. + While computing the checksum, the checksum field itself is replaced + with zeros. + + The checksum also covers a pseudo-header (Figure 2) conceptually + prefixed to the TCP header. The pseudo-header is 96 bits for IPv4 + and 320 bits for IPv6. Including the pseudo-header in the checksum + gives the TCP connection protection against misrouted segments. + This information is carried in IP headers and is transferred across + the TCP/network interface in the arguments or results of calls by + the TCP implementation on the IP layer. + + +--------+--------+--------+--------+ + | Source Address | + +--------+--------+--------+--------+ + | Destination Address | + +--------+--------+--------+--------+ + | zero | PTCL | TCP Length | + +--------+--------+--------+--------+ + + Figure 2: IPv4 Pseudo-header + + Pseudo-header components for IPv4: + Source Address: the IPv4 source address in network byte order + + Destination Address: the IPv4 destination address in network + byte order + + zero: bits set to zero + + PTCL: the protocol number from the IP header + + TCP Length: the TCP header length plus the data length in octets + (this is not an explicitly transmitted quantity but is + computed), and it does not count the 12 octets of the pseudo- + header. + + For IPv6, the pseudo-header is defined in Section 8.1 of RFC 8200 + [13] and contains the IPv6 Source Address and Destination Address, + an Upper-Layer Packet Length (a 32-bit value otherwise equivalent + to TCP Length in the IPv4 pseudo-header), three bytes of zero + padding, and a Next Header value, which differs from the IPv6 + header value if there are extension headers present between IPv6 + and TCP. + + The TCP checksum is never optional. The sender MUST generate it + (MUST-2) and the receiver MUST check it (MUST-3). + + Urgent Pointer: 16 bits + + This field communicates the current value of the urgent pointer as + a positive offset from the sequence number in this segment. The + urgent pointer points to the sequence number of the octet following + the urgent data. This field is only to be interpreted in segments + with the URG control bit set. + + Options: [TCP Option]; size(Options) == (DOffset-5)*32; present only + when DOffset > 5. Note that this size expression also includes any + padding trailing the actual options present. + + Options may occupy space at the end of the TCP header and are a + multiple of 8 bits in length. All options are included in the + checksum. An option may begin on any octet boundary. There are + two cases for the format of an option: + + Case 1: A single octet of option-kind. + + Case 2: An octet of option-kind (Kind), an octet of option-length, + and the actual option-data octets. + + The option-length counts the two octets of option-kind and option- + length as well as the option-data octets. + + Note that the list of options may be shorter than the Data Offset + field might imply. The content of the header beyond the End of + Option List Option MUST be header padding of zeros (MUST-69). + + The list of all currently defined options is managed by IANA [62], + and each option is defined in other RFCs, as indicated there. That + set includes experimental options that can be extended to support + multiple concurrent usages [45]. + + A given TCP implementation can support any currently defined + options, but the following options MUST be supported (MUST-4 -- + note Maximum Segment Size Option support is also part of MUST-14 in + Section 3.7.1): + + +======+========+============================+ + | Kind | Length | Meaning | + +======+========+============================+ + | 0 | - | End of Option List Option. | + +------+--------+----------------------------+ + | 1 | - | No-Operation. | + +------+--------+----------------------------+ + | 2 | 4 | Maximum Segment Size. | + +------+--------+----------------------------+ + + Table 1: Mandatory Option Set + + These options are specified in detail in Section 3.2. + + A TCP implementation MUST be able to receive a TCP Option in any + segment (MUST-5). + + A TCP implementation MUST (MUST-6) ignore without error any TCP + Option it does not implement, assuming that the option has a length + field. All TCP Options except End of Option List Option (EOL) and + No-Operation (NOP) MUST have length fields, including all future + options (MUST-68). TCP implementations MUST be prepared to handle + an illegal option length (e.g., zero); a suggested procedure is to + reset the connection and log the error cause (MUST-7). + + Note: There is ongoing work to extend the space available for TCP + Options, such as [65]. + + Data: variable length + + User data carried by the TCP segment. + +3.2. Specific Option Definitions + + A TCP Option, in the mandatory option set, is one of an End of Option + List Option, a No-Operation Option, or a Maximum Segment Size Option. + + An End of Option List Option is formatted as follows: + + 0 + 0 1 2 3 4 5 6 7 + +-+-+-+-+-+-+-+-+ + | 0 | + +-+-+-+-+-+-+-+-+ + + where: + + Kind: 1 byte; Kind == 0. + + This option code indicates the end of the option list. This might + not coincide with the end of the TCP header according to the Data + Offset field. This is used at the end of all options, not the end + of each option, and need only be used if the end of the options + would not otherwise coincide with the end of the TCP header. + + A No-Operation Option is formatted as follows: + + 0 + 0 1 2 3 4 5 6 7 + +-+-+-+-+-+-+-+-+ + | 1 | + +-+-+-+-+-+-+-+-+ + + where: + + Kind: 1 byte; Kind == 1. + + This option code can be used between options, for example, to align + the beginning of a subsequent option on a word boundary. There is + no guarantee that senders will use this option, so receivers MUST + be prepared to process options even if they do not begin on a word + boundary (MUST-64). + + A Maximum Segment Size Option is formatted as follows: + + 0 1 2 3 + 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | 2 | Length | Maximum Segment Size (MSS) | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + + where: + + Kind: 1 byte; Kind == 2. + + If this option is present, then it communicates the maximum receive + segment size at the TCP endpoint that sends this segment. This + value is limited by the IP reassembly limit. This field may be + sent in the initial connection request (i.e., in segments with the + SYN control bit set) and MUST NOT be sent in other segments (MUST- + 65). If this option is not used, any segment size is allowed. A + more complete description of this option is provided in + Section 3.7.1. + + Length: 1 byte; Length == 4. + + Length of the option in bytes. + + Maximum Segment Size (MSS): 2 bytes. + + The maximum receive segment size at the TCP endpoint that sends + this segment. + +3.2.1. Other Common Options + + Additional RFCs define some other commonly used options that are + recommended to implement for high performance but are not necessary + for basic TCP interoperability. These are the TCP Selective + Acknowledgment (SACK) Option [22] [26], TCP Timestamp (TS) Option + [47], and TCP Window Scale (WS) Option [47]. + +3.2.2. Experimental TCP Options + + Experimental TCP Option values are defined in [30], and [45] + describes the current recommended usage for these experimental + values. + +3.3. TCP Terminology Overview + + This section includes an overview of key terms needed to understand + the detailed protocol operation in the rest of the document. There + is a glossary of terms in Section 4. + +3.3.1. Key Connection State Variables + + Before we can discuss the operation of the TCP implementation in + detail, we need to introduce some detailed terminology. The + maintenance of a TCP connection requires maintaining state for + several variables. We conceive of these variables being stored in a + connection record called a Transmission Control Block or TCB. Among + the variables stored in the TCB are the local and remote IP addresses + and port numbers, the IP security level, and compartment of the + connection (see Appendix A.1), pointers to the user's send and + receive buffers, pointers to the retransmit queue and to the current + segment. In addition, several variables relating to the send and + receive sequence numbers are stored in the TCB. + + +==========+=====================================================+ + | Variable | Description | + +==========+=====================================================+ + | SND.UNA | send unacknowledged | + +----------+-----------------------------------------------------+ + | SND.NXT | send next | + +----------+-----------------------------------------------------+ + | SND.WND | send window | + +----------+-----------------------------------------------------+ + | SND.UP | send urgent pointer | + +----------+-----------------------------------------------------+ + | SND.WL1 | segment sequence number used for last window update | + +----------+-----------------------------------------------------+ + | SND.WL2 | segment acknowledgment number used for last window | + | | update | + +----------+-----------------------------------------------------+ + | ISS | initial send sequence number | + +----------+-----------------------------------------------------+ + + Table 2: Send Sequence Variables + + +==========+=================================+ + | Variable | Description | + +==========+=================================+ + | RCV.NXT | receive next | + +----------+---------------------------------+ + | RCV.WND | receive window | + +----------+---------------------------------+ + | RCV.UP | receive urgent pointer | + +----------+---------------------------------+ + | IRS | initial receive sequence number | + +----------+---------------------------------+ + + Table 3: Receive Sequence Variables + + The following diagrams may help to relate some of these variables to + the sequence space. + + 1 2 3 4 + ----------|----------|----------|---------- + SND.UNA SND.NXT SND.UNA + +SND.WND + + 1 - old sequence numbers that have been acknowledged + 2 - sequence numbers of unacknowledged data + 3 - sequence numbers allowed for new data transmission + 4 - future sequence numbers that are not yet allowed + + Figure 3: Send Sequence Space + + The send window is the portion of the sequence space labeled 3 in + Figure 3. + + 1 2 3 + ----------|----------|---------- + RCV.NXT RCV.NXT + +RCV.WND + + 1 - old sequence numbers that have been acknowledged + 2 - sequence numbers allowed for new reception + 3 - future sequence numbers that are not yet allowed + + Figure 4: Receive Sequence Space + + The receive window is the portion of the sequence space labeled 2 in + Figure 4. + + There are also some variables used frequently in the discussion that + take their values from the fields of the current segment. + + +==========+===============================+ + | Variable | Description | + +==========+===============================+ + | SEG.SEQ | segment sequence number | + +----------+-------------------------------+ + | SEG.ACK | segment acknowledgment number | + +----------+-------------------------------+ + | SEG.LEN | segment length | + +----------+-------------------------------+ + | SEG.WND | segment window | + +----------+-------------------------------+ + | SEG.UP | segment urgent pointer | + +----------+-------------------------------+ + + Table 4: Current Segment Variables + +3.3.2. State Machine Overview + + A connection progresses through a series of states during its + lifetime. The states are: LISTEN, SYN-SENT, SYN-RECEIVED, + ESTABLISHED, FIN-WAIT-1, FIN-WAIT-2, CLOSE-WAIT, CLOSING, LAST-ACK, + TIME-WAIT, and the fictional state CLOSED. CLOSED is fictional + because it represents the state when there is no TCB, and therefore, + no connection. Briefly the meanings of the states are: + + LISTEN - represents waiting for a connection request from any remote + TCP peer and port. + + SYN-SENT - represents waiting for a matching connection request + after having sent a connection request. + + SYN-RECEIVED - represents waiting for a confirming connection + request acknowledgment after having both received and sent a + connection request. + + ESTABLISHED - represents an open connection, data received can be + delivered to the user. The normal state for the data transfer + phase of the connection. + + FIN-WAIT-1 - represents waiting for a connection termination request + from the remote TCP peer, or an acknowledgment of the connection + termination request previously sent. + + FIN-WAIT-2 - represents waiting for a connection termination request + from the remote TCP peer. + + CLOSE-WAIT - represents waiting for a connection termination request + from the local user. + + CLOSING - represents waiting for a connection termination request + acknowledgment from the remote TCP peer. + + LAST-ACK - represents waiting for an acknowledgment of the + connection termination request previously sent to the remote TCP + peer (this termination request sent to the remote TCP peer already + included an acknowledgment of the termination request sent from + the remote TCP peer). + + TIME-WAIT - represents waiting for enough time to pass to be sure + the remote TCP peer received the acknowledgment of its connection + termination request and to avoid new connections being impacted by + delayed segments from previous connections. + + CLOSED - represents no connection state at all. + + A TCP connection progresses from one state to another in response to + events. The events are the user calls, OPEN, SEND, RECEIVE, CLOSE, + ABORT, and STATUS; the incoming segments, particularly those + containing the SYN, ACK, RST, and FIN flags; and timeouts. + + The OPEN call specifies whether connection establishment is to be + actively pursued, or to be passively waited for. + + A passive OPEN request means that the process wants to accept + incoming connection requests, in contrast to an active OPEN + attempting to initiate a connection. + + The state diagram in Figure 5 illustrates only state changes, + together with the causing events and resulting actions, but addresses + neither error conditions nor actions that are not connected with + state changes. In a later section, more detail is offered with + respect to the reaction of the TCP implementation to events. Some + state names are abbreviated or hyphenated differently in the diagram + from how they appear elsewhere in the document. + + NOTA BENE: This diagram is only a summary and must not be taken as + the total specification. Many details are not included. + + +---------+ ---------\ active OPEN + | CLOSED | \ ----------- + +---------+<---------\ \ create TCB + | ^ \ \ snd SYN + passive OPEN | | CLOSE \ \ + ------------ | | ---------- \ \ + create TCB | | delete TCB \ \ + V | \ \ + rcv RST (note 1) +---------+ CLOSE | \ + -------------------->| LISTEN | ---------- | | + / +---------+ delete TCB | | + / rcv SYN | | SEND | | + / ----------- | | ------- | V + +--------+ snd SYN,ACK / \ snd SYN +--------+ + | |<----------------- ------------------>| | + | SYN | rcv SYN | SYN | + | RCVD |<-----------------------------------------------| SENT | + | | snd SYN,ACK | | + | |------------------ -------------------| | + +--------+ rcv ACK of SYN \ / rcv SYN,ACK +--------+ + | -------------- | | ----------- + | x | | snd ACK + | V V + | CLOSE +---------+ + | ------- | ESTAB | + | snd FIN +---------+ + | CLOSE | | rcv FIN + V ------- | | ------- + +---------+ snd FIN / \ snd ACK +---------+ + | FIN |<---------------- ------------------>| CLOSE | + | WAIT-1 |------------------ | WAIT | + +---------+ rcv FIN \ +---------+ + | rcv ACK of FIN ------- | CLOSE | + | -------------- snd ACK | ------- | + V x V snd FIN V + +---------+ +---------+ +---------+ + |FINWAIT-2| | CLOSING | | LAST-ACK| + +---------+ +---------+ +---------+ + | rcv ACK of FIN | rcv ACK of FIN | + | rcv FIN -------------- | Timeout=2MSL -------------- | + | ------- x V ------------ x V + \ snd ACK +---------+delete TCB +---------+ + -------------------->|TIME-WAIT|------------------->| CLOSED | + +---------+ +---------+ + + Figure 5: TCP Connection State Diagram + + The following notes apply to Figure 5: + + Note 1: The transition from SYN-RECEIVED to LISTEN on receiving a + RST is conditional on having reached SYN-RECEIVED after a passive + OPEN. + + Note 2: The figure omits a transition from FIN-WAIT-1 to TIME-WAIT + if a FIN is received and the local FIN is also acknowledged. + + Note 3: A RST can be sent from any state with a corresponding + transition to TIME-WAIT (see [70] for rationale). These + transitions are not explicitly shown; otherwise, the diagram would + become very difficult to read. Similarly, receipt of a RST from + any state results in a transition to LISTEN or CLOSED, though this + is also omitted from the diagram for legibility. + +3.4. Sequence Numbers + + A fundamental notion in the design is that every octet of data sent + over a TCP connection has a sequence number. Since every octet is + sequenced, each of them can be acknowledged. The acknowledgment + mechanism employed is cumulative so that an acknowledgment of + sequence number X indicates that all octets up to but not including X + have been received. This mechanism allows for straightforward + duplicate detection in the presence of retransmission. The numbering + scheme of octets within a segment is as follows: the first data octet + immediately following the header is the lowest numbered, and the + following octets are numbered consecutively. + + It is essential to remember that the actual sequence number space is + finite, though large. This space ranges from 0 to 2^32 - 1. Since + the space is finite, all arithmetic dealing with sequence numbers + must be performed modulo 2^32. This unsigned arithmetic preserves + the relationship of sequence numbers as they cycle from 2^32 - 1 to 0 + again. There are some subtleties to computer modulo arithmetic, so + great care should be taken in programming the comparison of such + values. The symbol "=<" means "less than or equal" (modulo 2^32). + + The typical kinds of sequence number comparisons that the TCP + implementation must perform include: + + (a) Determining that an acknowledgment refers to some sequence + number sent but not yet acknowledged. + + (b) Determining that all sequence numbers occupied by a segment have + been acknowledged (e.g., to remove the segment from a + retransmission queue). + + (c) Determining that an incoming segment contains sequence numbers + that are expected (i.e., that the segment "overlaps" the receive + window). + + In response to sending data, the TCP endpoint will receive + acknowledgments. The following comparisons are needed to process the + acknowledgments: + + SND.UNA = oldest unacknowledged sequence number + + SND.NXT = next sequence number to be sent + + SEG.ACK = acknowledgment from the receiving TCP peer (next + sequence number expected by the receiving TCP peer) + + SEG.SEQ = first sequence number of a segment + + SEG.LEN = the number of octets occupied by the data in the segment + (counting SYN and FIN) + + SEG.SEQ+SEG.LEN-1 = last sequence number of a segment + + A new acknowledgment (called an "acceptable ack") is one for which + the inequality below holds: + + SND.UNA < SEG.ACK =< SND.NXT + + A segment on the retransmission queue is fully acknowledged if the + sum of its sequence number and length is less than or equal to the + acknowledgment value in the incoming segment. + + When data is received, the following comparisons are needed: + + RCV.NXT = next sequence number expected on an incoming segment, + and is the left or lower edge of the receive window + + RCV.NXT+RCV.WND-1 = last sequence number expected on an incoming + segment, and is the right or upper edge of the receive window + + SEG.SEQ = first sequence number occupied by the incoming segment + + SEG.SEQ+SEG.LEN-1 = last sequence number occupied by the incoming + segment + + A segment is judged to occupy a portion of valid receive sequence + space if + + RCV.NXT =< SEG.SEQ < RCV.NXT+RCV.WND + + or + + RCV.NXT =< SEG.SEQ+SEG.LEN-1 < RCV.NXT+RCV.WND + + The first part of this test checks to see if the beginning of the + segment falls in the window, the second part of the test checks to + see if the end of the segment falls in the window; if the segment + passes either part of the test, it contains data in the window. + + Actually, it is a little more complicated than this. Due to zero + windows and zero-length segments, we have four cases for the + acceptability of an incoming segment: + + +=========+=========+======================================+ + | Segment | Receive | Test | + | Length | Window | | + +=========+=========+======================================+ + | 0 | 0 | SEG.SEQ = RCV.NXT | + +---------+---------+--------------------------------------+ + | 0 | >0 | RCV.NXT =< SEG.SEQ < RCV.NXT+RCV.WND | + +---------+---------+--------------------------------------+ + | >0 | 0 | not acceptable | + +---------+---------+--------------------------------------+ + | >0 | >0 | RCV.NXT =< SEG.SEQ < RCV.NXT+RCV.WND | + | | | | + | | | or | + | | | | + | | | RCV.NXT =< SEG.SEQ+SEG.LEN-1 < | + | | | RCV.NXT+RCV.WND | + +---------+---------+--------------------------------------+ + + Table 5: Segment Acceptability Tests + + Note that when the receive window is zero no segments should be + acceptable except ACK segments. Thus, it is possible for a TCP + implementation to maintain a zero receive window while transmitting + data and receiving ACKs. A TCP receiver MUST process the RST and URG + fields of all incoming segments, even when the receive window is zero + (MUST-66). + + We have taken advantage of the numbering scheme to protect certain + control information as well. This is achieved by implicitly + including some control flags in the sequence space so they can be + retransmitted and acknowledged without confusion (i.e., one and only + one copy of the control will be acted upon). Control information is + not physically carried in the segment data space. Consequently, we + must adopt rules for implicitly assigning sequence numbers to + control. The SYN and FIN are the only controls requiring this + protection, and these controls are used only at connection opening + and closing. For sequence number purposes, the SYN is considered to + occur before the first actual data octet of the segment in which it + occurs, while the FIN is considered to occur after the last actual + data octet in a segment in which it occurs. The segment length + (SEG.LEN) includes both data and sequence space-occupying controls. + When a SYN is present, then SEG.SEQ is the sequence number of the + SYN. + +3.4.1. Initial Sequence Number Selection + + A connection is defined by a pair of sockets. Connections can be + reused. New instances of a connection will be referred to as + incarnations of the connection. The problem that arises from this is + -- "how does the TCP implementation identify duplicate segments from + previous incarnations of the connection?" This problem becomes + apparent if the connection is being opened and closed in quick + succession, or if the connection breaks with loss of memory and is + then reestablished. To support this, the TIME-WAIT state limits the + rate of connection reuse, while the initial sequence number selection + described below further protects against ambiguity about which + incarnation of a connection an incoming packet corresponds to. + + To avoid confusion, we must prevent segments from one incarnation of + a connection from being used while the same sequence numbers may + still be present in the network from an earlier incarnation. We want + to assure this even if a TCP endpoint loses all knowledge of the + sequence numbers it has been using. When new connections are + created, an initial sequence number (ISN) generator is employed that + selects a new 32-bit ISN. There are security issues that result if + an off-path attacker is able to predict or guess ISN values [42]. + + TCP initial sequence numbers are generated from a number sequence + that monotonically increases until it wraps, known loosely as a + "clock". This clock is a 32-bit counter that typically increments at + least once every roughly 4 microseconds, although it is neither + assumed to be realtime nor precise, and need not persist across + reboots. The clock component is intended to ensure that with a + Maximum Segment Lifetime (MSL), generated ISNs will be unique since + it cycles approximately every 4.55 hours, which is much longer than + the MSL. Please note that for modern networks that support high data + rates where the connection might start and quickly advance sequence + numbers to overlap within the MSL, it is recommended to implement the + Timestamp Option as mentioned later in Section 3.4.3. + + A TCP implementation MUST use the above type of "clock" for clock- + driven selection of initial sequence numbers (MUST-8), and SHOULD + generate its initial sequence numbers with the expression: + + ISN = M + F(localip, localport, remoteip, remoteport, secretkey) + + where M is the 4 microsecond timer, and F() is a pseudorandom + function (PRF) of the connection's identifying parameters ("localip, + localport, remoteip, remoteport") and a secret key ("secretkey") + (SHLD-1). F() MUST NOT be computable from the outside (MUST-9), or + an attacker could still guess at sequence numbers from the ISN used + for some other connection. The PRF could be implemented as a + cryptographic hash of the concatenation of the TCP connection + parameters and some secret data. For discussion of the selection of + a specific hash algorithm and management of the secret key data, + please see Section 3 of [42]. + + For each connection there is a send sequence number and a receive + sequence number. The initial send sequence number (ISS) is chosen by + the data sending TCP peer, and the initial receive sequence number + (IRS) is learned during the connection-establishing procedure. + + For a connection to be established or initialized, the two TCP peers + must synchronize on each other's initial sequence numbers. This is + done in an exchange of connection-establishing segments carrying a + control bit called "SYN" (for synchronize) and the initial sequence + numbers. As a shorthand, segments carrying the SYN bit are also + called "SYNs". Hence, the solution requires a suitable mechanism for + picking an initial sequence number and a slightly involved handshake + to exchange the ISNs. + + The synchronization requires each side to send its own initial + sequence number and to receive a confirmation of it in acknowledgment + from the remote TCP peer. Each side must also receive the remote + peer's initial sequence number and send a confirming acknowledgment. + + 1) A --> B SYN my sequence number is X + 2) A <-- B ACK your sequence number is X + 3) A <-- B SYN my sequence number is Y + 4) A --> B ACK your sequence number is Y + + Because steps 2 and 3 can be combined in a single message this is + called the three-way (or three message) handshake (3WHS). + + A 3WHS is necessary because sequence numbers are not tied to a global + clock in the network, and TCP implementations may have different + mechanisms for picking the ISNs. The receiver of the first SYN has + no way of knowing whether the segment was an old one or not, unless + it remembers the last sequence number used on the connection (which + is not always possible), and so it must ask the sender to verify this + SYN. The three-way handshake and the advantages of a clock-driven + scheme for ISN selection are discussed in [69]. + +3.4.2. Knowing When to Keep Quiet + + A theoretical problem exists where data could be corrupted due to + confusion between old segments in the network and new ones after a + host reboots if the same port numbers and sequence space are reused. + The "quiet time" concept discussed below addresses this, and the + discussion of it is included for situations where it might be + relevant, although it is not felt to be necessary in most current + implementations. The problem was more relevant earlier in the + history of TCP. In practical use on the Internet today, the error- + prone conditions are sufficiently unlikely that it is safe to ignore. + Reasons why it is now negligible include: (a) ISS and ephemeral port + randomization have reduced likelihood of reuse of port numbers and + sequence numbers after reboots, (b) the effective MSL of the Internet + has declined as links have become faster, and (c) reboots often + taking longer than an MSL anyways. + + To be sure that a TCP implementation does not create a segment + carrying a sequence number that may be duplicated by an old segment + remaining in the network, the TCP endpoint must keep quiet for an MSL + before assigning any sequence numbers upon starting up or recovering + from a situation where memory of sequence numbers in use was lost. + For this specification the MSL is taken to be 2 minutes. This is an + engineering choice, and may be changed if experience indicates it is + desirable to do so. Note that if a TCP endpoint is reinitialized in + some sense, yet retains its memory of sequence numbers in use, then + it need not wait at all; it must only be sure to use sequence numbers + larger than those recently used. + +3.4.3. The TCP Quiet Time Concept + + Hosts that for any reason lose knowledge of the last sequence numbers + transmitted on each active (i.e., not closed) connection shall delay + emitting any TCP segments for at least the agreed MSL in the internet + system that the host is a part of. In the paragraphs below, an + explanation for this specification is given. TCP implementers may + violate the "quiet time" restriction, but only at the risk of causing + some old data to be accepted as new or new data rejected as old + duplicated data by some receivers in the internet system. + + TCP endpoints consume sequence number space each time a segment is + formed and entered into the network output queue at a source host. + The duplicate detection and sequencing algorithm in TCP relies on the + unique binding of segment data to sequence space to the extent that + sequence numbers will not cycle through all 2^32 values before the + segment data bound to those sequence numbers has been delivered and + acknowledged by the receiver and all duplicate copies of the segments + have "drained" from the internet. Without such an assumption, two + distinct TCP segments could conceivably be assigned the same or + overlapping sequence numbers, causing confusion at the receiver as to + which data is new and which is old. Remember that each segment is + bound to as many consecutive sequence numbers as there are octets of + data and SYN or FIN flags in the segment. + + Under normal conditions, TCP implementations keep track of the next + sequence number to emit and the oldest awaiting acknowledgment so as + to avoid mistakenly reusing a sequence number before its first use + has been acknowledged. This alone does not guarantee that old + duplicate data is drained from the net, so the sequence space has + been made large to reduce the probability that a wandering duplicate + will cause trouble upon arrival. At 2 megabits/sec., it takes 4.5 + hours to use up 2^32 octets of sequence space. Since the maximum + segment lifetime in the net is not likely to exceed a few tens of + seconds, this is deemed ample protection for foreseeable nets, even + if data rates escalate to 10s of megabits/sec. At 100 megabits/sec., + the cycle time is 5.4 minutes, which may be a little short but still + within reason. Much higher data rates are possible today, with + implications described in the final paragraph of this subsection. + + The basic duplicate detection and sequencing algorithm in TCP can be + defeated, however, if a source TCP endpoint does not have any memory + of the sequence numbers it last used on a given connection. For + example, if the TCP implementation were to start all connections with + sequence number 0, then upon the host rebooting, a TCP peer might re- + form an earlier connection (possibly after half-open connection + resolution) and emit packets with sequence numbers identical to or + overlapping with packets still in the network, which were emitted on + an earlier incarnation of the same connection. In the absence of + knowledge about the sequence numbers used on a particular connection, + the TCP specification recommends that the source delay for MSL + seconds before emitting segments on the connection, to allow time for + segments from the earlier connection incarnation to drain from the + system. + + Even hosts that can remember the time of day and use it to select + initial sequence number values are not immune from this problem + (i.e., even if time of day is used to select an initial sequence + number for each new connection incarnation). + + Suppose, for example, that a connection is opened starting with + sequence number S. Suppose that this connection is not used much and + that eventually the initial sequence number function (ISN(t)) takes + on a value equal to the sequence number, say S1, of the last segment + sent by this TCP endpoint on a particular connection. Now suppose, + at this instant, the host reboots and establishes a new incarnation + of the connection. The initial sequence number chosen is S1 = ISN(t) + -- last used sequence number on old incarnation of connection! If + the recovery occurs quickly enough, any old duplicates in the net + bearing sequence numbers in the neighborhood of S1 may arrive and be + treated as new packets by the receiver of the new incarnation of the + connection. + + The problem is that the recovering host may not know for how long it + was down between rebooting nor does it know whether there are still + old duplicates in the system from earlier connection incarnations. + + One way to deal with this problem is to deliberately delay emitting + segments for one MSL after recovery from a reboot -- this is the + "quiet time" specification. Hosts that prefer to avoid waiting and + are willing to risk possible confusion of old and new packets at a + given destination may choose not to wait for the "quiet time". + Implementers may provide TCP users with the ability to select on a + connection-by-connection basis whether to wait after a reboot, or may + informally implement the "quiet time" for all connections. + Obviously, even where a user selects to "wait", this is not necessary + after the host has been "up" for at least MSL seconds. + + To summarize: every segment emitted occupies one or more sequence + numbers in the sequence space, and the numbers occupied by a segment + are "busy" or "in use" until MSL seconds have passed. Upon + rebooting, a block of space-time is occupied by the octets and SYN or + FIN flags of any potentially still in-flight segments. If a new + connection is started too soon and uses any of the sequence numbers + in the space-time footprint of those potentially still in-flight + segments of the previous connection incarnation, there is a potential + sequence number overlap area that could cause confusion at the + receiver. + + High-performance cases will have shorter cycle times than those in + the megabits per second that the base TCP design described above + considers. At 1 Gbps, the cycle time is 34 seconds, only 3 seconds + at 10 Gbps, and around a third of a second at 100 Gbps. In these + higher-performance cases, TCP Timestamp Options and Protection + Against Wrapped Sequences (PAWS) [47] provide the needed capability + to detect and discard old duplicates. + +3.5. Establishing a Connection + + The "three-way handshake" is the procedure used to establish a + connection. This procedure normally is initiated by one TCP peer and + responded to by another TCP peer. The procedure also works if two + TCP peers simultaneously initiate the procedure. When simultaneous + open occurs, each TCP peer receives a SYN segment that carries no + acknowledgment after it has sent a SYN. Of course, the arrival of an + old duplicate SYN segment can potentially make it appear, to the + recipient, that a simultaneous connection initiation is in progress. + Proper use of "reset" segments can disambiguate these cases. + + Several examples of connection initiation follow. Although these + examples do not show connection synchronization using data-carrying + segments, this is perfectly legitimate, so long as the receiving TCP + endpoint doesn't deliver the data to the user until it is clear the + data is valid (e.g., the data is buffered at the receiver until the + connection reaches the ESTABLISHED state, given that the three-way + handshake reduces the possibility of false connections). It is a + trade-off between memory and messages to provide information for this + checking. + + The simplest 3WHS is shown in Figure 6. The figures should be + interpreted in the following way. Each line is numbered for + reference purposes. Right arrows (-->) indicate departure of a TCP + segment from TCP Peer A to TCP Peer B or arrival of a segment at B + from A. Left arrows (<--) indicate the reverse. Ellipses (...) + indicate a segment that is still in the network (delayed). Comments + appear in parentheses. TCP connection states represent the state + AFTER the departure or arrival of the segment (whose contents are + shown in the center of each line). Segment contents are shown in + abbreviated form, with sequence number, control flags, and ACK field. + Other fields such as window, addresses, lengths, and text have been + left out in the interest of clarity. + + TCP Peer A TCP Peer B + + 1. CLOSED LISTEN + + 2. SYN-SENT --> <SEQ=100><CTL=SYN> --> SYN-RECEIVED + + 3. ESTABLISHED <-- <SEQ=300><ACK=101><CTL=SYN,ACK> <-- SYN-RECEIVED + + 4. ESTABLISHED --> <SEQ=101><ACK=301><CTL=ACK> --> ESTABLISHED + + 5. ESTABLISHED --> <SEQ=101><ACK=301><CTL=ACK><DATA> --> ESTABLISHED + + Figure 6: Basic Three-Way Handshake for Connection Synchronization + + In line 2 of Figure 6, TCP Peer A begins by sending a SYN segment + indicating that it will use sequence numbers starting with sequence + number 100. In line 3, TCP Peer B sends a SYN and acknowledges the + SYN it received from TCP Peer A. Note that the acknowledgment field + indicates TCP Peer B is now expecting to hear sequence 101, + acknowledging the SYN that occupied sequence 100. + + At line 4, TCP Peer A responds with an empty segment containing an + ACK for TCP Peer B's SYN; and in line 5, TCP Peer A sends some data. + Note that the sequence number of the segment in line 5 is the same as + in line 4 because the ACK does not occupy sequence number space (if + it did, we would wind up ACKing ACKs!). + + Simultaneous initiation is only slightly more complex, as is shown in + Figure 7. Each TCP peer's connection state cycles from CLOSED to + SYN-SENT to SYN-RECEIVED to ESTABLISHED. + + TCP Peer A TCP Peer B + + 1. CLOSED CLOSED + + 2. SYN-SENT --> <SEQ=100><CTL=SYN> ... + + 3. SYN-RECEIVED <-- <SEQ=300><CTL=SYN> <-- SYN-SENT + + 4. ... <SEQ=100><CTL=SYN> --> SYN-RECEIVED + + 5. SYN-RECEIVED --> <SEQ=100><ACK=301><CTL=SYN,ACK> ... + + 6. ESTABLISHED <-- <SEQ=300><ACK=101><CTL=SYN,ACK> <-- SYN-RECEIVED + + 7. ... <SEQ=100><ACK=301><CTL=SYN,ACK> --> ESTABLISHED + + Figure 7: Simultaneous Connection Synchronization + + A TCP implementation MUST support simultaneous open attempts (MUST- + 10). + + Note that a TCP implementation MUST keep track of whether a + connection has reached SYN-RECEIVED state as the result of a passive + OPEN or an active OPEN (MUST-11). + + The principal reason for the three-way handshake is to prevent old + duplicate connection initiations from causing confusion. To deal + with this, a special control message, reset, is specified. If the + receiving TCP peer is in a non-synchronized state (i.e., SYN-SENT, + SYN-RECEIVED), it returns to LISTEN on receiving an acceptable reset. + If the TCP peer is in one of the synchronized states (ESTABLISHED, + FIN-WAIT-1, FIN-WAIT-2, CLOSE-WAIT, CLOSING, LAST-ACK, TIME-WAIT), it + aborts the connection and informs its user. We discuss this latter + case under "half-open" connections below. + + TCP Peer A TCP Peer B + + 1. CLOSED LISTEN + + 2. SYN-SENT --> <SEQ=100><CTL=SYN> ... + + 3. (duplicate) ... <SEQ=90><CTL=SYN> --> SYN-RECEIVED + + 4. SYN-SENT <-- <SEQ=300><ACK=91><CTL=SYN,ACK> <-- SYN-RECEIVED + + 5. SYN-SENT --> <SEQ=91><CTL=RST> --> LISTEN + + 6. ... <SEQ=100><CTL=SYN> --> SYN-RECEIVED + + 7. ESTABLISHED <-- <SEQ=400><ACK=101><CTL=SYN,ACK> <-- SYN-RECEIVED + + 8. ESTABLISHED --> <SEQ=101><ACK=401><CTL=ACK> --> ESTABLISHED + + Figure 8: Recovery from Old Duplicate SYN + + As a simple example of recovery from old duplicates, consider + Figure 8. At line 3, an old duplicate SYN arrives at TCP Peer B. + TCP Peer B cannot tell that this is an old duplicate, so it responds + normally (line 4). TCP Peer A detects that the ACK field is + incorrect and returns a RST (reset) with its SEQ field selected to + make the segment believable. TCP Peer B, on receiving the RST, + returns to the LISTEN state. When the original SYN finally arrives + at line 6, the synchronization proceeds normally. If the SYN at line + 6 had arrived before the RST, a more complex exchange might have + occurred with RSTs sent in both directions. + +3.5.1. Half-Open Connections and Other Anomalies + + An established connection is said to be "half-open" if one of the TCP + peers has closed or aborted the connection at its end without the + knowledge of the other, or if the two ends of the connection have + become desynchronized owing to a failure or reboot that resulted in + loss of memory. Such connections will automatically become reset if + an attempt is made to send data in either direction. However, half- + open connections are expected to be unusual. + + If at site A the connection no longer exists, then an attempt by the + user at site B to send any data on it will result in the site B TCP + endpoint receiving a reset control message. Such a message indicates + to the site B TCP endpoint that something is wrong, and it is + expected to abort the connection. + + Assume that two user processes A and B are communicating with one + another when a failure or reboot occurs causing loss of memory to A's + TCP implementation. Depending on the operating system supporting A's + TCP implementation, it is likely that some error recovery mechanism + exists. When the TCP endpoint is up again, A is likely to start + again from the beginning or from a recovery point. As a result, A + will probably try to OPEN the connection again or try to SEND on the + connection it believes open. In the latter case, it receives the + error message "connection not open" from the local (A's) TCP + implementation. In an attempt to establish the connection, A's TCP + implementation will send a segment containing SYN. This scenario + leads to the example shown in Figure 9. After TCP Peer A reboots, + the user attempts to reopen the connection. TCP Peer B, in the + meantime, thinks the connection is open. + + TCP Peer A TCP Peer B + + 1. (REBOOT) (send 300,receive 100) + + 2. CLOSED ESTABLISHED + + 3. SYN-SENT --> <SEQ=400><CTL=SYN> --> (??) + + 4. (!!) <-- <SEQ=300><ACK=100><CTL=ACK> <-- ESTABLISHED + + 5. SYN-SENT --> <SEQ=100><CTL=RST> --> (Abort!!) + + 6. SYN-SENT CLOSED + + 7. SYN-SENT --> <SEQ=400><CTL=SYN> --> + + Figure 9: Half-Open Connection Discovery + + When the SYN arrives at line 3, TCP Peer B, being in a synchronized + state, and the incoming segment outside the window, responds with an + acknowledgment indicating what sequence it next expects to hear (ACK + 100). TCP Peer A sees that this segment does not acknowledge + anything it sent and, being unsynchronized, sends a reset (RST) + because it has detected a half-open connection. TCP Peer B aborts at + line 5. TCP Peer A will continue to try to establish the connection; + the problem is now reduced to the basic three-way handshake of + Figure 6. + + An interesting alternative case occurs when TCP Peer A reboots and + TCP Peer B tries to send data on what it thinks is a synchronized + connection. This is illustrated in Figure 10. In this case, the + data arriving at TCP Peer A from TCP Peer B (line 2) is unacceptable + because no such connection exists, so TCP Peer A sends a RST. The + RST is acceptable so TCP Peer B processes it and aborts the + connection. + + TCP Peer A TCP Peer B + + 1. (REBOOT) (send 300,receive 100) + + 2. (??) <-- <SEQ=300><ACK=100><DATA=10><CTL=ACK> <-- ESTABLISHED + + 3. --> <SEQ=100><CTL=RST> --> (ABORT!!) + + Figure 10: Active Side Causes Half-Open Connection Discovery + + In Figure 11, two TCP Peers A and B with passive connections waiting + for SYN are depicted. An old duplicate arriving at TCP Peer B (line + 2) stirs B into action. A SYN-ACK is returned (line 3) and causes + TCP A to generate a RST (the ACK in line 3 is not acceptable). TCP + Peer B accepts the reset and returns to its passive LISTEN state. + + TCP Peer A TCP Peer B + + 1. LISTEN LISTEN + + 2. ... <SEQ=Z><CTL=SYN> --> SYN-RECEIVED + + 3. (??) <-- <SEQ=X><ACK=Z+1><CTL=SYN,ACK> <-- SYN-RECEIVED + + 4. --> <SEQ=Z+1><CTL=RST> --> (return to LISTEN!) + + 5. LISTEN LISTEN + + Figure 11: Old Duplicate SYN Initiates a Reset on Two Passive Sockets + + A variety of other cases are possible, all of which are accounted for + by the following rules for RST generation and processing. + +3.5.2. Reset Generation + + A TCP user or application can issue a reset on a connection at any + time, though reset events are also generated by the protocol itself + when various error conditions occur, as described below. The side of + a connection issuing a reset should enter the TIME-WAIT state, as + this generally helps to reduce the load on busy servers for reasons + described in [70]. + + As a general rule, reset (RST) is sent whenever a segment arrives + that apparently is not intended for the current connection. A reset + must not be sent if it is not clear that this is the case. + + There are three groups of states: + + 1. If the connection does not exist (CLOSED), then a reset is sent + in response to any incoming segment except another reset. A SYN + segment that does not match an existing connection is rejected by + this means. + + If the incoming segment has the ACK bit set, the reset takes its + sequence number from the ACK field of the segment; otherwise, the + reset has sequence number zero and the ACK field is set to the + sum of the sequence number and segment length of the incoming + segment. The connection remains in the CLOSED state. + + 2. If the connection is in any non-synchronized state (LISTEN, SYN- + SENT, SYN-RECEIVED), and the incoming segment acknowledges + something not yet sent (the segment carries an unacceptable ACK), + or if an incoming segment has a security level or compartment + (Appendix A.1) that does not exactly match the level and + compartment requested for the connection, a reset is sent. + + If the incoming segment has an ACK field, the reset takes its + sequence number from the ACK field of the segment; otherwise, the + reset has sequence number zero and the ACK field is set to the + sum of the sequence number and segment length of the incoming + segment. The connection remains in the same state. + + 3. If the connection is in a synchronized state (ESTABLISHED, FIN- + WAIT-1, FIN-WAIT-2, CLOSE-WAIT, CLOSING, LAST-ACK, TIME-WAIT), + any unacceptable segment (out-of-window sequence number or + unacceptable acknowledgment number) must be responded to with an + empty acknowledgment segment (without any user data) containing + the current send sequence number and an acknowledgment indicating + the next sequence number expected to be received, and the + connection remains in the same state. + + If an incoming segment has a security level or compartment that + does not exactly match the level and compartment requested for + the connection, a reset is sent and the connection goes to the + CLOSED state. The reset takes its sequence number from the ACK + field of the incoming segment. + +3.5.3. Reset Processing + + In all states except SYN-SENT, all reset (RST) segments are validated + by checking their SEQ fields. A reset is valid if its sequence + number is in the window. In the SYN-SENT state (a RST received in + response to an initial SYN), the RST is acceptable if the ACK field + acknowledges the SYN. + + The receiver of a RST first validates it, then changes state. If the + receiver was in the LISTEN state, it ignores it. If the receiver was + in SYN-RECEIVED state and had previously been in the LISTEN state, + then the receiver returns to the LISTEN state; otherwise, the + receiver aborts the connection and goes to the CLOSED state. If the + receiver was in any other state, it aborts the connection and advises + the user and goes to the CLOSED state. + + TCP implementations SHOULD allow a received RST segment to include + data (SHLD-2). It has been suggested that a RST segment could + contain diagnostic data that explains the cause of the RST. No + standard has yet been established for such data. + +3.6. Closing a Connection + + CLOSE is an operation meaning "I have no more data to send." The + notion of closing a full-duplex connection is subject to ambiguous + interpretation, of course, since it may not be obvious how to treat + the receiving side of the connection. We have chosen to treat CLOSE + in a simplex fashion. The user who CLOSEs may continue to RECEIVE + until the TCP receiver is told that the remote peer has CLOSED also. + Thus, a program could initiate several SENDs followed by a CLOSE, and + then continue to RECEIVE until signaled that a RECEIVE failed because + the remote peer has CLOSED. The TCP implementation will signal a + user, even if no RECEIVEs are outstanding, that the remote peer has + closed, so the user can terminate their side gracefully. A TCP + implementation will reliably deliver all buffers SENT before the + connection was CLOSED so a user who expects no data in return need + only wait to hear the connection was CLOSED successfully to know that + all their data was received at the destination TCP endpoint. Users + must keep reading connections they close for sending until the TCP + implementation indicates there is no more data. + + There are essentially three cases: + + 1) The user initiates by telling the TCP implementation to CLOSE the + connection (TCP Peer A in Figure 12). + + 2) The remote TCP endpoint initiates by sending a FIN control signal + (TCP Peer B in Figure 12). + + 3) Both users CLOSE simultaneously (Figure 13). + + Case 1: Local user initiates the close + + In this case, a FIN segment can be constructed and placed on the + outgoing segment queue. No further SENDs from the user will be + accepted by the TCP implementation, and it enters the FIN-WAIT-1 + state. RECEIVEs are allowed in this state. All segments + preceding and including FIN will be retransmitted until + acknowledged. When the other TCP peer has both acknowledged the + FIN and sent a FIN of its own, the first TCP peer can ACK this + FIN. Note that a TCP endpoint receiving a FIN will ACK but not + send its own FIN until its user has CLOSED the connection also. + + Case 2: TCP endpoint receives a FIN from the network + + If an unsolicited FIN arrives from the network, the receiving TCP + endpoint can ACK it and tell the user that the connection is + closing. The user will respond with a CLOSE, upon which the TCP + endpoint can send a FIN to the other TCP peer after sending any + remaining data. The TCP endpoint then waits until its own FIN is + acknowledged whereupon it deletes the connection. If an ACK is + not forthcoming, after the user timeout the connection is aborted + and the user is told. + + Case 3: Both users close simultaneously + + A simultaneous CLOSE by users at both ends of a connection causes + FIN segments to be exchanged (Figure 13). When all segments + preceding the FINs have been processed and acknowledged, each TCP + peer can ACK the FIN it has received. Both will, upon receiving + these ACKs, delete the connection. + + TCP Peer A TCP Peer B + + 1. ESTABLISHED ESTABLISHED + + 2. (Close) + FIN-WAIT-1 --> <SEQ=100><ACK=300><CTL=FIN,ACK> --> CLOSE-WAIT + + 3. FIN-WAIT-2 <-- <SEQ=300><ACK=101><CTL=ACK> <-- CLOSE-WAIT + + 4. (Close) + TIME-WAIT <-- <SEQ=300><ACK=101><CTL=FIN,ACK> <-- LAST-ACK + + 5. TIME-WAIT --> <SEQ=101><ACK=301><CTL=ACK> --> CLOSED + + 6. (2 MSL) + CLOSED + + Figure 12: Normal Close Sequence + + TCP Peer A TCP Peer B + + 1. ESTABLISHED ESTABLISHED + + 2. (Close) (Close) + FIN-WAIT-1 --> <SEQ=100><ACK=300><CTL=FIN,ACK> ... FIN-WAIT-1 + <-- <SEQ=300><ACK=100><CTL=FIN,ACK> <-- + ... <SEQ=100><ACK=300><CTL=FIN,ACK> --> + + 3. CLOSING --> <SEQ=101><ACK=301><CTL=ACK> ... CLOSING + <-- <SEQ=301><ACK=101><CTL=ACK> <-- + ... <SEQ=101><ACK=301><CTL=ACK> --> + + 4. TIME-WAIT TIME-WAIT + (2 MSL) (2 MSL) + CLOSED CLOSED + + Figure 13: Simultaneous Close Sequence + + A TCP connection may terminate in two ways: (1) the normal TCP close + sequence using a FIN handshake (Figure 12), and (2) an "abort" in + which one or more RST segments are sent and the connection state is + immediately discarded. If the local TCP connection is closed by the + remote side due to a FIN or RST received from the remote side, then + the local application MUST be informed whether it closed normally or + was aborted (MUST-12). + + +3.6.1. Half-Closed Connections + + The normal TCP close sequence delivers buffered data reliably in both + directions. Since the two directions of a TCP connection are closed + independently, it is possible for a connection to be "half closed", + i.e., closed in only one direction, and a host is permitted to + continue sending data in the open direction on a half-closed + connection. + + A host MAY implement a "half-duplex" TCP close sequence, so that an + application that has called CLOSE cannot continue to read data from + the connection (MAY-1). If such a host issues a CLOSE call while + received data is still pending in the TCP connection, or if new data + is received after CLOSE is called, its TCP implementation SHOULD send + a RST to show that data was lost (SHLD-3). See [23], Section 2.17 + for discussion. + + When a connection is closed actively, it MUST linger in the TIME-WAIT + state for a time 2xMSL (Maximum Segment Lifetime) (MUST-13). + However, it MAY accept a new SYN from the remote TCP endpoint to + reopen the connection directly from TIME-WAIT state (MAY-2), if it: + + (1) assigns its initial sequence number for the new connection to be + larger than the largest sequence number it used on the previous + connection incarnation, and + + (2) returns to TIME-WAIT state if the SYN turns out to be an old + duplicate. + + When the TCP Timestamp Options are available, an improved algorithm + is described in [40] in order to support higher connection + establishment rates. This algorithm for reducing TIME-WAIT is a Best + Current Practice that SHOULD be implemented since Timestamp Options + are commonly used, and using them to reduce TIME-WAIT provides + benefits for busy Internet servers (SHLD-4). + +3.7. Segmentation + + The term "segmentation" refers to the activity TCP performs when + ingesting a stream of bytes from a sending application and + packetizing that stream of bytes into TCP segments. Individual TCP + segments often do not correspond one-for-one to individual send (or + socket write) calls from the application. Applications may perform + writes at the granularity of messages in the upper-layer protocol, + but TCP guarantees no correlation between the boundaries of TCP + segments sent and received and the boundaries of the read or write + buffers of user application data. In some specific protocols, such + as Remote Direct Memory Access (RDMA) using Direct Data Placement + (DDP) and Marker PDU Aligned Framing (MPA) [34], there are + performance optimizations possible when the relation between TCP + segments and application data units can be controlled, and MPA + includes a specific mechanism for detecting and verifying this + relationship between TCP segments and application message data + structures, but this is specific to applications like RDMA. In + general, multiple goals influence the sizing of TCP segments created + by a TCP implementation. + + Goals driving the sending of larger segments include: + + * Reducing the number of packets in flight within the network. + + * Increasing processing efficiency and potential performance by + enabling a smaller number of interrupts and inter-layer + interactions. + + * Limiting the overhead of TCP headers. + + Note that the performance benefits of sending larger segments may + decrease as the size increases, and there may be boundaries where + advantages are reversed. For instance, on some implementation + architectures, 1025 bytes within a segment could lead to worse + performance than 1024 bytes, due purely to data alignment on copy + operations. + + Goals driving the sending of smaller segments include: + + * Avoiding sending a TCP segment that would result in an IP datagram + larger than the smallest MTU along an IP network path because this + results in either packet loss or packet fragmentation. Making + matters worse, some firewalls or middleboxes may drop fragmented + packets or ICMP messages related to fragmentation. + + * Preventing delays to the application data stream, especially when + TCP is waiting on the application to generate more data, or when + the application is waiting on an event or input from its peer in + order to generate more data. + + * Enabling "fate sharing" between TCP segments and lower-layer data + units (e.g., below IP, for links with cell or frame sizes smaller + than the IP MTU). + + Towards meeting these competing sets of goals, TCP includes several + mechanisms, including the Maximum Segment Size Option, Path MTU + Discovery, the Nagle algorithm, and support for IPv6 Jumbograms, as + discussed in the following subsections. + +3.7.1. Maximum Segment Size Option + + TCP endpoints MUST implement both sending and receiving the MSS + Option (MUST-14). + + TCP implementations SHOULD send an MSS Option in every SYN segment + when its receive MSS differs from the default 536 for IPv4 or 1220 + for IPv6 (SHLD-5), and MAY send it always (MAY-3). + + If an MSS Option is not received at connection setup, TCP + implementations MUST assume a default send MSS of 536 (576 - 40) for + IPv4 or 1220 (1280 - 60) for IPv6 (MUST-15). + + The maximum size of a segment that a TCP endpoint really sends, the + "effective send MSS", MUST be the smaller (MUST-16) of the send MSS + (that reflects the available reassembly buffer size at the remote + host, the EMTU_R [19]) and the largest transmission size permitted by + the IP layer (EMTU_S [19]): + + Eff.snd.MSS = min(SendMSS+20, MMS_S) - TCPhdrsize - IPoptionsize + + where: + + * SendMSS is the MSS value received from the remote host, or the + default 536 for IPv4 or 1220 for IPv6, if no MSS Option is + received. + + * MMS_S is the maximum size for a transport-layer message that TCP + may send. + + * TCPhdrsize is the size of the fixed TCP header and any options. + This is 20 in the (rare) case that no options are present but may + be larger if TCP Options are to be sent. Note that some options + might not be included on all segments, but that for each segment + sent, the sender should adjust the data length accordingly, within + the Eff.snd.MSS. + + * IPoptionsize is the size of any IPv4 options or IPv6 extension + headers associated with a TCP connection. Note that some options + or extension headers might not be included on all packets, but + that for each segment sent, the sender should adjust the data + length accordingly, within the Eff.snd.MSS. + + The MSS value to be sent in an MSS Option should be equal to the + effective MTU minus the fixed IP and TCP headers. By ignoring both + IP and TCP Options when calculating the value for the MSS Option, if + there are any IP or TCP Options to be sent in a packet, then the + sender must decrease the size of the TCP data accordingly. RFC 6691 + [43] discusses this in greater detail. + + The MSS value to be sent in an MSS Option must be less than or equal + to: + + MMS_R - 20 + + where MMS_R is the maximum size for a transport-layer message that + can be received (and reassembled at the IP layer) (MUST-67). TCP + obtains MMS_R and MMS_S from the IP layer; see the generic call + GET_MAXSIZES in Section 3.4 of RFC 1122. These are defined in terms + of their IP MTU equivalents, EMTU_R and EMTU_S [19]. + + When TCP is used in a situation where either the IP or TCP headers + are not fixed, the sender must reduce the amount of TCP data in any + given packet by the number of octets used by the IP and TCP options. + This has been a point of confusion historically, as explained in RFC + 6691, Section 3.1. + +3.7.2. Path MTU Discovery + + A TCP implementation may be aware of the MTU on directly connected + links, but will rarely have insight about MTUs across an entire + network path. For IPv4, RFC 1122 recommends an IP-layer default + effective MTU of less than or equal to 576 for destinations not + directly connected, and for IPv6 this would be 1280. Using these + fixed values limits TCP connection performance and efficiency. + Instead, implementation of Path MTU Discovery (PMTUD) and + Packetization Layer Path MTU Discovery (PLPMTUD) is strongly + recommended in order for TCP to improve segmentation decisions. Both + PMTUD and PLPMTUD help TCP choose segment sizes that avoid both on- + path (for IPv4) and source fragmentation (IPv4 and IPv6). + + PMTUD for IPv4 [2] or IPv6 [14] is implemented in conjunction between + TCP, IP, and ICMP. It relies both on avoiding source fragmentation + and setting the IPv4 DF (don't fragment) flag, the latter to inhibit + on-path fragmentation. It relies on ICMP errors from routers along + the path whenever a segment is too large to traverse a link. Several + adjustments to a TCP implementation with PMTUD are described in RFC + 2923 in order to deal with problems experienced in practice [27]. + PLPMTUD [31] is a Standards Track improvement to PMTUD that relaxes + the requirement for ICMP support across a path, and improves + performance in cases where ICMP is not consistently conveyed, but + still tries to avoid source fragmentation. The mechanisms in all + four of these RFCs are recommended to be included in TCP + implementations. + + The TCP MSS Option specifies an upper bound for the size of packets + that can be received (see [43]). Hence, setting the value in the MSS + Option too small can impact the ability for PMTUD or PLPMTUD to find + a larger path MTU. RFC 1191 discusses this implication of many older + TCP implementations setting the TCP MSS to 536 (corresponding to the + IPv4 576 byte default MTU) for non-local destinations, rather than + deriving it from the MTUs of connected interfaces as recommended. + +3.7.3. Interfaces with Variable MTU Values + + The effective MTU can sometimes vary, as when used with variable + compression, e.g., RObust Header Compression (ROHC) [37]. It is + tempting for a TCP implementation to advertise the largest possible + MSS, to support the most efficient use of compressed payloads. + Unfortunately, some compression schemes occasionally need to transmit + full headers (and thus smaller payloads) to resynchronize state at + their endpoint compressors/decompressors. If the largest MTU is used + to calculate the value to advertise in the MSS Option, TCP + retransmission may interfere with compressor resynchronization. + + As a result, when the effective MTU of an interface varies packet-to- + packet, TCP implementations SHOULD use the smallest effective MTU of + the interface to calculate the value to advertise in the MSS Option + (SHLD-6). + +3.7.4. Nagle Algorithm + + The "Nagle algorithm" was described in RFC 896 [17] and was + recommended in RFC 1122 [19] for mitigation of an early problem of + too many small packets being generated. It has been implemented in + most current TCP code bases, sometimes with minor variations (see + Appendix A.3). + + If there is unacknowledged data (i.e., SND.NXT > SND.UNA), then the + sending TCP endpoint buffers all user data (regardless of the PSH + bit) until the outstanding data has been acknowledged or until the + TCP endpoint can send a full-sized segment (Eff.snd.MSS bytes). + + A TCP implementation SHOULD implement the Nagle algorithm to coalesce + short segments (SHLD-7). However, there MUST be a way for an + application to disable the Nagle algorithm on an individual + connection (MUST-17). In all cases, sending data is also subject to + the limitation imposed by the slow start algorithm [8]. + + Since there can be problematic interactions between the Nagle + algorithm and delayed acknowledgments, some implementations use minor + variations of the Nagle algorithm, such as the one described in + Appendix A.3. + +3.7.5. IPv6 Jumbograms + + In order to support TCP over IPv6 Jumbograms, implementations need to + be able to send TCP segments larger than the 64-KB limit that the MSS + Option can convey. RFC 2675 [24] defines that an MSS value of 65,535 + bytes is to be treated as infinity, and Path MTU Discovery [14] is + used to determine the actual MSS. + + The Jumbo Payload Option need not be implemented or understood by + IPv6 nodes that do not support attachment to links with an MTU + greater than 65,575 [24], and the present IPv6 Node Requirements does + not include support for Jumbograms [55]. + +3.8. Data Communication + + Once the connection is established, data is communicated by the + exchange of segments. Because segments may be lost due to errors + (checksum test failure) or network congestion, TCP uses + retransmission to ensure delivery of every segment. Duplicate + segments may arrive due to network or TCP retransmission. As + discussed in the section on sequence numbers (Section 3.4), the TCP + implementation performs certain tests on the sequence and + acknowledgment numbers in the segments to verify their acceptability. + + The sender of data keeps track of the next sequence number to use in + the variable SND.NXT. The receiver of data keeps track of the next + sequence number to expect in the variable RCV.NXT. The sender of + data keeps track of the oldest unacknowledged sequence number in the + variable SND.UNA. If the data flow is momentarily idle and all data + sent has been acknowledged, then the three variables will be equal. + + When the sender creates a segment and transmits it, the sender + advances SND.NXT. When the receiver accepts a segment, it advances + RCV.NXT and sends an acknowledgment. When the data sender receives + an acknowledgment, it advances SND.UNA. The extent to which the + values of these variables differ is a measure of the delay in the + communication. The amount by which the variables are advanced is the + length of the data and SYN or FIN flags in the segment. Note that, + once in the ESTABLISHED state, all segments must carry current + acknowledgment information. + + The CLOSE user call implies a push function (see Section 3.9.1), as + does the FIN control flag in an incoming segment. + +3.8.1. Retransmission Timeout + + Because of the variability of the networks that compose an + internetwork system and the wide range of uses of TCP connections, + the retransmission timeout (RTO) must be dynamically determined. + + The RTO MUST be computed according to the algorithm in [10], + including Karn's algorithm for taking RTT samples (MUST-18). + + RFC 793 contains an early example procedure for computing the RTO, + based on work mentioned in IEN 177 [71]. This was then replaced by + the algorithm described in RFC 1122, which was subsequently updated + in RFC 2988 and then again in RFC 6298. + + RFC 1122 allows that if a retransmitted packet is identical to the + original packet (which implies not only that the data boundaries have + not changed, but also that none of the headers have changed), then + the same IPv4 Identification field MAY be used (see Section 3.2.1.5 + of RFC 1122) (MAY-4). The same IP Identification field may be reused + anyways since it is only meaningful when a datagram is fragmented + [44]. TCP implementations should not rely on or typically interact + with this IPv4 header field in any way. It is not a reasonable way + to indicate duplicate sent segments nor to identify duplicate + received segments. + +3.8.2. TCP Congestion Control + + RFC 2914 [5] explains the importance of congestion control for the + Internet. + + RFC 1122 required implementation of Van Jacobson's congestion control + algorithms slow start and congestion avoidance together with + exponential backoff for successive RTO values for the same segment. + RFC 2581 provided IETF Standards Track description of slow start and + congestion avoidance, along with fast retransmit and fast recovery. + RFC 5681 is the current description of these algorithms and is the + current Standards Track specification providing guidelines for TCP + congestion control. RFC 6298 describes exponential backoff of RTO + values, including keeping the backed-off value until a subsequent + segment with new data has been sent and acknowledged without + retransmission. + + A TCP endpoint MUST implement the basic congestion control algorithms + slow start, congestion avoidance, and exponential backoff of RTO to + avoid creating congestion collapse conditions (MUST-19). RFC 5681 + and RFC 6298 describe the basic algorithms on the IETF Standards + Track that are broadly applicable. Multiple other suitable + algorithms exist and have been widely used. Many TCP implementations + support a set of alternative algorithms that can be configured for + use on the endpoint. An endpoint MAY implement such alternative + algorithms provided that the algorithms are conformant with the TCP + specifications from the IETF Standards Track as described in RFC + 2914, RFC 5033 [7], and RFC 8961 [15] (MAY-18). + + Explicit Congestion Notification (ECN) was defined in RFC 3168 and is + an IETF Standards Track enhancement that has many benefits [51]. + + A TCP endpoint SHOULD implement ECN as described in RFC 3168 (SHLD- + 8). + +3.8.3. TCP Connection Failures + + Excessive retransmission of the same segment by a TCP endpoint + indicates some failure of the remote host or the internetwork path. + This failure may be of short or long duration. The following + procedure MUST be used to handle excessive retransmissions of data + segments (MUST-20): + + (a) There are two thresholds R1 and R2 measuring the amount of + retransmission that has occurred for the same segment. R1 and + R2 might be measured in time units or as a count of + retransmissions (with the current RTO and corresponding backoffs + as a conversion factor, if needed). + + (b) When the number of transmissions of the same segment reaches or + exceeds threshold R1, pass negative advice (see Section 3.3.1.4 + of [19]) to the IP layer, to trigger dead-gateway diagnosis. + + (c) When the number of transmissions of the same segment reaches a + threshold R2 greater than R1, close the connection. + + (d) An application MUST (MUST-21) be able to set the value for R2 + for a particular connection. For example, an interactive + application might set R2 to "infinity", giving the user control + over when to disconnect. + + (e) TCP implementations SHOULD inform the application of the + delivery problem (unless such information has been disabled by + the application; see the "Asynchronous Reports" section + (Section 3.9.1.8)), when R1 is reached and before R2 (SHLD-9). + This will allow a remote login application program to inform the + user, for example. + + The value of R1 SHOULD correspond to at least 3 retransmissions, at + the current RTO (SHLD-10). The value of R2 SHOULD correspond to at + least 100 seconds (SHLD-11). + + An attempt to open a TCP connection could fail with excessive + retransmissions of the SYN segment or by receipt of a RST segment or + an ICMP Port Unreachable. SYN retransmissions MUST be handled in the + general way just described for data retransmissions, including + notification of the application layer. + + However, the values of R1 and R2 may be different for SYN and data + segments. In particular, R2 for a SYN segment MUST be set large + enough to provide retransmission of the segment for at least 3 + minutes (MUST-23). The application can close the connection (i.e., + give up on the open attempt) sooner, of course. + +3.8.4. TCP Keep-Alives + + A TCP connection is said to be "idle" if for some long amount of time + there have been no incoming segments received and there is no new or + unacknowledged data to be sent. + + Implementers MAY include "keep-alives" in their TCP implementations + (MAY-5), although this practice is not universally accepted. Some + TCP implementations, however, have included a keep-alive mechanism. + To confirm that an idle connection is still active, these + implementations send a probe segment designed to elicit a response + from the TCP peer. Such a segment generally contains SEG.SEQ = + SND.NXT-1 and may or may not contain one garbage octet of data. If + keep-alives are included, the application MUST be able to turn them + on or off for each TCP connection (MUST-24), and they MUST default to + off (MUST-25). + + Keep-alive packets MUST only be sent when no sent data is + outstanding, and no data or acknowledgment packets have been received + for the connection within an interval (MUST-26). This interval MUST + be configurable (MUST-27) and MUST default to no less than two hours + (MUST-28). + + It is extremely important to remember that ACK segments that contain + no data are not reliably transmitted by TCP. Consequently, if a + keep-alive mechanism is implemented it MUST NOT interpret failure to + respond to any specific probe as a dead connection (MUST-29). + + An implementation SHOULD send a keep-alive segment with no data + (SHLD-12); however, it MAY be configurable to send a keep-alive + segment containing one garbage octet (MAY-6), for compatibility with + erroneous TCP implementations. + +3.8.5. The Communication of Urgent Information + + As a result of implementation differences and middlebox interactions, + new applications SHOULD NOT employ the TCP urgent mechanism (SHLD- + 13). However, TCP implementations MUST still include support for the + urgent mechanism (MUST-30). Information on how some TCP + implementations interpret the urgent pointer can be found in RFC 6093 + [39]. + + The objective of the TCP urgent mechanism is to allow the sending + user to stimulate the receiving user to accept some urgent data and + to permit the receiving TCP endpoint to indicate to the receiving + user when all the currently known urgent data has been received by + the user. + + This mechanism permits a point in the data stream to be designated as + the end of urgent information. Whenever this point is in advance of + the receive sequence number (RCV.NXT) at the receiving TCP endpoint, + then the TCP implementation must tell the user to go into "urgent + mode"; when the receive sequence number catches up to the urgent + pointer, the TCP implementation must tell user to go into "normal + mode". If the urgent pointer is updated while the user is in "urgent + mode", the update will be invisible to the user. + + The method employs an urgent field that is carried in all segments + transmitted. The URG control flag indicates that the urgent field is + meaningful and must be added to the segment sequence number to yield + the urgent pointer. The absence of this flag indicates that there is + no urgent data outstanding. + + To send an urgent indication, the user must also send at least one + data octet. If the sending user also indicates a push, timely + delivery of the urgent information to the destination process is + enhanced. Note that because changes in the urgent pointer correspond + to data being written by a sending application, the urgent pointer + cannot "recede" in the sequence space, but a TCP receiver should be + robust to invalid urgent pointer values. + + A TCP implementation MUST support a sequence of urgent data of any + length (MUST-31) [19]. + + The urgent pointer MUST point to the sequence number of the octet + following the urgent data (MUST-62). + + A TCP implementation MUST (MUST-32) inform the application layer + asynchronously whenever it receives an urgent pointer and there was + previously no pending urgent data, or whenever the urgent pointer + advances in the data stream. The TCP implementation MUST (MUST-33) + provide a way for the application to learn how much urgent data + remains to be read from the connection, or at least to determine + whether more urgent data remains to be read [19]. + +3.8.6. Managing the Window + + The window sent in each segment indicates the range of sequence + numbers the sender of the window (the data receiver) is currently + prepared to accept. There is an assumption that this is related to + the data buffer space currently available for this connection. + + The sending TCP endpoint packages the data to be transmitted into + segments that fit the current window, and may repackage segments on + the retransmission queue. Such repackaging is not required but may + be helpful. + + In a connection with a one-way data flow, the window information will + be carried in acknowledgment segments that all have the same sequence + number, so there will be no way to reorder them if they arrive out of + order. This is not a serious problem, but it will allow the window + information to be on occasion temporarily based on old reports from + the data receiver. A refinement to avoid this problem is to act on + the window information from segments that carry the highest + acknowledgment number (that is, segments with an acknowledgment + number equal to or greater than the highest previously received). + + Indicating a large window encourages transmissions. If more data + arrives than can be accepted, it will be discarded. This will result + in excessive retransmissions, adding unnecessarily to the load on the + network and the TCP endpoints. Indicating a small window may + restrict the transmission of data to the point of introducing a + round-trip delay between each new segment transmitted. + + The mechanisms provided allow a TCP endpoint to advertise a large + window and to subsequently advertise a much smaller window without + having accepted that much data. This so-called "shrinking the + window" is strongly discouraged. The robustness principle [19] + dictates that TCP peers will not shrink the window themselves, but + will be prepared for such behavior on the part of other TCP peers. + + A TCP receiver SHOULD NOT shrink the window, i.e., move the right + window edge to the left (SHLD-14). However, a sending TCP peer MUST + be robust against window shrinking, which may cause the "usable + window" (see Section 3.8.6.2.1) to become negative (MUST-34). + + If this happens, the sender SHOULD NOT send new data (SHLD-15), but + SHOULD retransmit normally the old unacknowledged data between + SND.UNA and SND.UNA+SND.WND (SHLD-16). The sender MAY also + retransmit old data beyond SND.UNA+SND.WND (MAY-7), but SHOULD NOT + time out the connection if data beyond the right window edge is not + acknowledged (SHLD-17). If the window shrinks to zero, the TCP + implementation MUST probe it in the standard way (described below) + (MUST-35). + +3.8.6.1. Zero-Window Probing + + The sending TCP peer must regularly transmit at least one octet of + new data (if available), or retransmit to the receiving TCP peer even + if the send window is zero, in order to "probe" the window. This + retransmission is essential to guarantee that when either TCP peer + has a zero window the reopening of the window will be reliably + reported to the other. This is referred to as Zero-Window Probing + (ZWP) in other documents. + + Probing of zero (offered) windows MUST be supported (MUST-36). + + A TCP implementation MAY keep its offered receive window closed + indefinitely (MAY-8). As long as the receiving TCP peer continues to + send acknowledgments in response to the probe segments, the sending + TCP peer MUST allow the connection to stay open (MUST-37). This + enables TCP to function in scenarios such as the "printer ran out of + paper" situation described in Section 4.2.2.17 of [19]. The behavior + is subject to the implementation's resource management concerns, as + noted in [41]. + + When the receiving TCP peer has a zero window and a segment arrives, + it must still send an acknowledgment showing its next expected + sequence number and current window (zero). + + The transmitting host SHOULD send the first zero-window probe when a + zero window has existed for the retransmission timeout period (SHLD- + 29) (Section 3.8.1), and SHOULD increase exponentially the interval + between successive probes (SHLD-30). + +3.8.6.2. Silly Window Syndrome Avoidance + + The "Silly Window Syndrome" (SWS) is a stable pattern of small + incremental window movements resulting in extremely poor TCP + performance. Algorithms to avoid SWS are described below for both + the sending side and the receiving side. RFC 1122 contains more + detailed discussion of the SWS problem. Note that the Nagle + algorithm and the sender SWS avoidance algorithm play complementary + roles in improving performance. The Nagle algorithm discourages + sending tiny segments when the data to be sent increases in small + increments, while the SWS avoidance algorithm discourages small + segments resulting from the right window edge advancing in small + increments. + +3.8.6.2.1. Sender's Algorithm -- When to Send Data + + A TCP implementation MUST include a SWS avoidance algorithm in the + sender (MUST-38). + + The Nagle algorithm from Section 3.7.4 additionally describes how to + coalesce short segments. + + The sender's SWS avoidance algorithm is more difficult than the + receiver's because the sender does not know (directly) the receiver's + total buffer space (RCV.BUFF). An approach that has been found to + work well is for the sender to calculate Max(SND.WND), which is the + maximum send window it has seen so far on the connection, and to use + this value as an estimate of RCV.BUFF. Unfortunately, this can only + be an estimate; the receiver may at any time reduce the size of + RCV.BUFF. To avoid a resulting deadlock, it is necessary to have a + timeout to force transmission of data, overriding the SWS avoidance + algorithm. In practice, this timeout should seldom occur. + + The "usable window" is: + + U = SND.UNA + SND.WND - SND.NXT + + i.e., the offered window less the amount of data sent but not + acknowledged. If D is the amount of data queued in the sending TCP + endpoint but not yet sent, then the following set of rules is + recommended. + + Send data: + + (1) if a maximum-sized segment can be sent, i.e., if: + + min(D,U) >= Eff.snd.MSS; + + (2) or if the data is pushed and all queued data can be sent now, + i.e., if: + + [SND.NXT = SND.UNA and] PUSHed and D <= U + + (the bracketed condition is imposed by the Nagle algorithm); + + (3) or if at least a fraction Fs of the maximum window can be sent, + i.e., if: + + [SND.NXT = SND.UNA and] + + min(D,U) >= Fs * Max(SND.WND); + + (4) or if the override timeout occurs. + + Here Fs is a fraction whose recommended value is 1/2. The override + timeout should be in the range 0.1 - 1.0 seconds. It may be + convenient to combine this timer with the timer used to probe zero + windows (Section 3.8.6.1). + +3.8.6.2.2. Receiver's Algorithm -- When to Send a Window Update + + A TCP implementation MUST include a SWS avoidance algorithm in the + receiver (MUST-39). + + The receiver's SWS avoidance algorithm determines when the right + window edge may be advanced; this is customarily known as "updating + the window". This algorithm combines with the delayed ACK algorithm + (Section 3.8.6.3) to determine when an ACK segment containing the + current window will really be sent to the receiver. + + The solution to receiver SWS is to avoid advancing the right window + edge RCV.NXT+RCV.WND in small increments, even if data is received + from the network in small segments. + + Suppose the total receive buffer space is RCV.BUFF. At any given + moment, RCV.USER octets of this total may be tied up with data that + has been received and acknowledged but that the user process has not + yet consumed. When the connection is quiescent, RCV.WND = RCV.BUFF + and RCV.USER = 0. + + Keeping the right window edge fixed as data arrives and is + acknowledged requires that the receiver offer less than its full + buffer space, i.e., the receiver must specify a RCV.WND that keeps + RCV.NXT+RCV.WND constant as RCV.NXT increases. Thus, the total + buffer space RCV.BUFF is generally divided into three parts: + + |<------- RCV.BUFF ---------------->| + 1 2 3 + ----|---------|------------------|------|---- + RCV.NXT ^ + (Fixed) + + 1 - RCV.USER = data received but not yet consumed; + 2 - RCV.WND = space advertised to sender; + 3 - Reduction = space available but not yet + advertised. + + The suggested SWS avoidance algorithm for the receiver is to keep + RCV.NXT+RCV.WND fixed until the reduction satisfies: + + RCV.BUFF - RCV.USER - RCV.WND >= + + min( Fr * RCV.BUFF, Eff.snd.MSS ) + + where Fr is a fraction whose recommended value is 1/2, and + Eff.snd.MSS is the effective send MSS for the connection (see + Section 3.7.1). When the inequality is satisfied, RCV.WND is set to + RCV.BUFF-RCV.USER. + + Note that the general effect of this algorithm is to advance RCV.WND + in increments of Eff.snd.MSS (for realistic receive buffers: + Eff.snd.MSS < RCV.BUFF/2). Note also that the receiver must use its + own Eff.snd.MSS, making the assumption that it is the same as the + sender's. + +3.8.6.3. Delayed Acknowledgments -- When to Send an ACK Segment + + A host that is receiving a stream of TCP data segments can increase + efficiency in both the network and the hosts by sending fewer than + one ACK (acknowledgment) segment per data segment received; this is + known as a "delayed ACK". + + A TCP endpoint SHOULD implement a delayed ACK (SHLD-18), but an ACK + should not be excessively delayed; in particular, the delay MUST be + less than 0.5 seconds (MUST-40). An ACK SHOULD be generated for at + least every second full-sized segment or 2*RMSS bytes of new data + (where RMSS is the MSS specified by the TCP endpoint receiving the + segments to be acknowledged, or the default value if not specified) + (SHLD-19). Excessive delays on ACKs can disturb the round-trip + timing and packet "clocking" algorithms. More complete discussion of + delayed ACK behavior is in Section 4.2 of RFC 5681 [8], including + recommendations to immediately acknowledge out-of-order segments, + segments above a gap in sequence space, or segments that fill all or + part of a gap, in order to accelerate loss recovery. + + Note that there are several current practices that further lead to a + reduced number of ACKs, including generic receive offload (GRO) [72], + ACK compression, and ACK decimation [28]. + +3.9. Interfaces + + There are of course two interfaces of concern: the user/TCP interface + and the TCP/lower-level interface. We have a fairly elaborate model + of the user/TCP interface, but the interface to the lower-level + protocol module is left unspecified here since it will be specified + in detail by the specification of the lower-level protocol. For the + case that the lower level is IP, we note some of the parameter values + that TCP implementations might use. + +3.9.1. User/TCP Interface + + The following functional description of user commands to the TCP + implementation is, at best, fictional, since every operating system + will have different facilities. Consequently, we must warn readers + that different TCP implementations may have different user + interfaces. However, all TCP implementations must provide a certain + minimum set of services to guarantee that all TCP implementations can + support the same protocol hierarchy. This section specifies the + functional interfaces required of all TCP implementations. + + Section 3.1 of [53] also identifies primitives provided by TCP and + could be used as an additional reference for implementers. + + The following sections functionally characterize a user/TCP + interface. The notation used is similar to most procedure or + function calls in high-level languages, but this usage is not meant + to rule out trap-type service calls. + + The user commands described below specify the basic functions the TCP + implementation must perform to support interprocess communication. + Individual implementations must define their own exact format and may + provide combinations or subsets of the basic functions in single + calls. In particular, some implementations may wish to automatically + OPEN a connection on the first SEND or RECEIVE issued by the user for + a given connection. + + In providing interprocess communication facilities, the TCP + implementation must not only accept commands, but must also return + information to the processes it serves. The latter consists of: + + (a) general information about a connection (e.g., interrupts, remote + close, binding of unspecified remote socket). + + (b) replies to specific user commands indicating success or various + types of failure. + +3.9.1.1. Open + + Format: OPEN (local port, remote socket, active/passive [, timeout] + [, Diffserv field] [, security/compartment] [, local IP address] [, + options]) -> local connection name + + If the active/passive flag is set to passive, then this is a call to + LISTEN for an incoming connection. A passive OPEN may have either a + fully specified remote socket to wait for a particular connection or + an unspecified remote socket to wait for any call. A fully specified + passive call can be made active by the subsequent execution of a + SEND. + + A transmission control block (TCB) is created and partially filled in + with data from the OPEN command parameters. + + Every passive OPEN call either creates a new connection record in + LISTEN state, or it returns an error; it MUST NOT affect any + previously created connection record (MUST-41). + + A TCP implementation that supports multiple concurrent connections + MUST provide an OPEN call that will functionally allow an application + to LISTEN on a port while a connection block with the same local port + is in SYN-SENT or SYN-RECEIVED state (MUST-42). + + On an active OPEN command, the TCP endpoint will begin the procedure + to synchronize (i.e., establish) the connection at once. + + The timeout, if present, permits the caller to set up a timeout for + all data submitted to TCP. If data is not successfully delivered to + the destination within the timeout period, the TCP endpoint will + abort the connection. The present global default is five minutes. + + The TCP implementation or some component of the operating system will + verify the user's authority to open a connection with the specified + Diffserv field value or security/compartment. The absence of a + Diffserv field value or security/compartment specification in the + OPEN call indicates the default values must be used. + + TCP will accept incoming requests as matching only if the security/ + compartment information is exactly the same as that requested in the + OPEN call. + + The Diffserv field value indicated by the user only impacts outgoing + packets, may be altered en route through the network, and has no + direct bearing or relation to received packets. + + A local connection name will be returned to the user by the TCP + implementation. The local connection name can then be used as a + shorthand term for the connection defined by the <local socket, + remote socket> pair. + + The optional "local IP address" parameter MUST be supported to allow + the specification of the local IP address (MUST-43). This enables + applications that need to select the local IP address used when + multihoming is present. + + A passive OPEN call with a specified "local IP address" parameter + will await an incoming connection request to that address. If the + parameter is unspecified, a passive OPEN will await an incoming + connection request to any local IP address and then bind the local IP + address of the connection to the particular address that is used. + + For an active OPEN call, a specified "local IP address" parameter + will be used for opening the connection. If the parameter is + unspecified, the host will choose an appropriate local IP address + (see RFC 1122, Section 3.3.4.2). + + If an application on a multihomed host does not specify the local IP + address when actively opening a TCP connection, then the TCP + implementation MUST ask the IP layer to select a local IP address + before sending the (first) SYN (MUST-44). See the function + GET_SRCADDR() in Section 3.4 of RFC 1122. + + At all other times, a previous segment has either been sent or + received on this connection, and TCP implementations MUST use the + same local address that was used in those previous segments (MUST- + 45). + + A TCP implementation MUST reject as an error a local OPEN call for an + invalid remote IP address (e.g., a broadcast or multicast address) + (MUST-46). + +3.9.1.2. Send + + Format: SEND (local connection name, buffer address, byte count, + URGENT flag [, PUSH flag] [, timeout]) + + This call causes the data contained in the indicated user buffer to + be sent on the indicated connection. If the connection has not been + opened, the SEND is considered an error. Some implementations may + allow users to SEND first; in which case, an automatic OPEN would be + done. For example, this might be one way for application data to be + included in SYN segments. If the calling process is not authorized + to use this connection, an error is returned. + + A TCP endpoint MAY implement PUSH flags on SEND calls (MAY-15). If + PUSH flags are not implemented, then the sending TCP peer: (1) MUST + NOT buffer data indefinitely (MUST-60), and (2) MUST set the PSH bit + in the last buffered segment (i.e., when there is no more queued data + to be sent) (MUST-61). The remaining description below assumes the + PUSH flag is supported on SEND calls. + + If the PUSH flag is set, the application intends the data to be + transmitted promptly to the receiver, and the PSH bit will be set in + the last TCP segment created from the buffer. + + The PSH bit is not a record marker and is independent of segment + boundaries. The transmitter SHOULD collapse successive bits when it + packetizes data, to send the largest possible segment (SHLD-27). + + If the PUSH flag is not set, the data may be combined with data from + subsequent SENDs for transmission efficiency. When an application + issues a series of SEND calls without setting the PUSH flag, the TCP + implementation MAY aggregate the data internally without sending it + (MAY-16). Note that when the Nagle algorithm is in use, TCP + implementations may buffer the data before sending, without regard to + the PUSH flag (see Section 3.7.4). + + An application program is logically required to set the PUSH flag in + a SEND call whenever it needs to force delivery of the data to avoid + a communication deadlock. However, a TCP implementation SHOULD send + a maximum-sized segment whenever possible (SHLD-28) to improve + performance (see Section 3.8.6.2.1). + + New applications SHOULD NOT set the URGENT flag [39] due to + implementation differences and middlebox issues (SHLD-13). + + If the URGENT flag is set, segments sent to the destination TCP peer + will have the urgent pointer set. The receiving TCP peer will signal + the urgent condition to the receiving process if the urgent pointer + indicates that data preceding the urgent pointer has not been + consumed by the receiving process. The purpose of the URGENT flag is + to stimulate the receiver to process the urgent data and to indicate + to the receiver when all the currently known urgent data has been + received. The number of times the sending user's TCP implementation + signals urgent will not necessarily be equal to the number of times + the receiving user will be notified of the presence of urgent data. + + If no remote socket was specified in the OPEN, but the connection is + established (e.g., because a LISTENing connection has become specific + due to a remote segment arriving for the local socket), then the + designated buffer is sent to the implied remote socket. Users who + make use of OPEN with an unspecified remote socket can make use of + SEND without ever explicitly knowing the remote socket address. + + However, if a SEND is attempted before the remote socket becomes + specified, an error will be returned. Users can use the STATUS call + to determine the status of the connection. Some TCP implementations + may notify the user when an unspecified socket is bound. + + If a timeout is specified, the current user timeout for this + connection is changed to the new one. + + In the simplest implementation, SEND would not return control to the + sending process until either the transmission was complete or the + timeout had been exceeded. However, this simple method is both + subject to deadlocks (for example, both sides of the connection might + try to do SENDs before doing any RECEIVEs) and offers poor + performance, so it is not recommended. A more sophisticated + implementation would return immediately to allow the process to run + concurrently with network I/O, and, furthermore, to allow multiple + SENDs to be in progress. Multiple SENDs are served in first come, + first served order, so the TCP endpoint will queue those it cannot + service immediately. + + We have implicitly assumed an asynchronous user interface in which a + SEND later elicits some kind of SIGNAL or pseudo-interrupt from the + serving TCP endpoint. An alternative is to return a response + immediately. For instance, SENDs might return immediate local + acknowledgment, even if the segment sent had not been acknowledged by + the distant TCP endpoint. We could optimistically assume eventual + success. If we are wrong, the connection will close anyway due to + the timeout. In implementations of this kind (synchronous), there + will still be some asynchronous signals, but these will deal with the + connection itself, and not with specific segments or buffers. + + In order for the process to distinguish among error or success + indications for different SENDs, it might be appropriate for the + buffer address to be returned along with the coded response to the + SEND request. TCP-to-user signals are discussed below, indicating + the information that should be returned to the calling process. + +3.9.1.3. Receive + + Format: RECEIVE (local connection name, buffer address, byte count) + -> byte count, URGENT flag [, PUSH flag] + + This command allocates a receiving buffer associated with the + specified connection. If no OPEN precedes this command or the + calling process is not authorized to use this connection, an error is + returned. + + In the simplest implementation, control would not return to the + calling program until either the buffer was filled or some error + occurred, but this scheme is highly subject to deadlocks. A more + sophisticated implementation would permit several RECEIVEs to be + outstanding at once. These would be filled as segments arrive. This + strategy permits increased throughput at the cost of a more elaborate + scheme (possibly asynchronous) to notify the calling program that a + PUSH has been seen or a buffer filled. + + A TCP receiver MAY pass a received PSH bit to the application layer + via the PUSH flag in the interface (MAY-17), but it is not required + (this was clarified in RFC 1122, Section 4.2.2.2). The remainder of + text describing the RECEIVE call below assumes that passing the PUSH + indication is supported. + + If enough data arrive to fill the buffer before a PUSH is seen, the + PUSH flag will not be set in the response to the RECEIVE. The buffer + will be filled with as much data as it can hold. If a PUSH is seen + before the buffer is filled, the buffer will be returned partially + filled and PUSH indicated. + + If there is urgent data, the user will have been informed as soon as + it arrived via a TCP-to-user signal. The receiving user should thus + be in "urgent mode". If the URGENT flag is on, additional urgent + data remains. If the URGENT flag is off, this call to RECEIVE has + returned all the urgent data, and the user may now leave "urgent + mode". Note that data following the urgent pointer (non-urgent data) + cannot be delivered to the user in the same buffer with preceding + urgent data unless the boundary is clearly marked for the user. + + To distinguish among several outstanding RECEIVEs and to take care of + the case that a buffer is not completely filled, the return code is + accompanied by both a buffer pointer and a byte count indicating the + actual length of the data received. + + Alternative implementations of RECEIVE might have the TCP endpoint + allocate buffer storage, or the TCP endpoint might share a ring + buffer with the user. + +3.9.1.4. Close + + Format: CLOSE (local connection name) + + This command causes the connection specified to be closed. If the + connection is not open or the calling process is not authorized to + use this connection, an error is returned. Closing connections is + intended to be a graceful operation in the sense that outstanding + SENDs will be transmitted (and retransmitted), as flow control + permits, until all have been serviced. Thus, it should be acceptable + to make several SEND calls, followed by a CLOSE, and expect all the + data to be sent to the destination. It should also be clear that + users should continue to RECEIVE on CLOSING connections since the + remote peer may be trying to transmit the last of its data. Thus, + CLOSE means "I have no more to send" but does not mean "I will not + receive any more." It may happen (if the user-level protocol is not + well thought out) that the closing side is unable to get rid of all + its data before timing out. In this event, CLOSE turns into ABORT, + and the closing TCP peer gives up. + + The user may CLOSE the connection at any time on their own + initiative, or in response to various prompts from the TCP + implementation (e.g., remote close executed, transmission timeout + exceeded, destination inaccessible). + + Because closing a connection requires communication with the remote + TCP peer, connections may remain in the closing state for a short + time. Attempts to reopen the connection before the TCP peer replies + to the CLOSE command will result in error responses. + + Close also implies push function. + +3.9.1.5. Status + + Format: STATUS (local connection name) -> status data + + This is an implementation-dependent user command and could be + excluded without adverse effect. Information returned would + typically come from the TCB associated with the connection. + + This command returns a data block containing the following + information: + + local socket, + + remote socket, + + local connection name, + + receive window, + + send window, + + connection state, + + number of buffers awaiting acknowledgment, + + number of buffers pending receipt, + + urgent state, + + Diffserv field value, + + security/compartment, and + + transmission timeout. + + Depending on the state of the connection, or on the implementation + itself, some of this information may not be available or meaningful. + If the calling process is not authorized to use this connection, an + error is returned. This prevents unauthorized processes from gaining + information about a connection. + +3.9.1.6. Abort + + Format: ABORT (local connection name) + + This command causes all pending SENDs and RECEIVES to be aborted, the + TCB to be removed, and a special RST message to be sent to the remote + TCP peer of the connection. Depending on the implementation, users + may receive abort indications for each outstanding SEND or RECEIVE, + or may simply receive an ABORT-acknowledgment. + +3.9.1.7. Flush + + Some TCP implementations have included a FLUSH call, which will empty + the TCP send queue of any data that the user has issued SEND calls + for but is still to the right of the current send window. That is, + it flushes as much queued send data as possible without losing + sequence number synchronization. The FLUSH call MAY be implemented + (MAY-14). + +3.9.1.8. Asynchronous Reports + + There MUST be a mechanism for reporting soft TCP error conditions to + the application (MUST-47). Generically, we assume this takes the + form of an application-supplied ERROR_REPORT routine that may be + upcalled asynchronously from the transport layer: + + ERROR_REPORT(local connection name, reason, subreason) + + The precise encoding of the reason and subreason parameters is not + specified here. However, the conditions that are reported + asynchronously to the application MUST include: + + * ICMP error message arrived (see Section 3.9.2.2 for description of + handling each ICMP message type since some message types need to + be suppressed from generating reports to the application) + + * Excessive retransmissions (see Section 3.8.3) + + * Urgent pointer advance (see Section 3.8.5) + + However, an application program that does not want to receive such + ERROR_REPORT calls SHOULD be able to effectively disable these calls + (SHLD-20). + +3.9.1.9. Set Differentiated Services Field (IPv4 TOS or IPv6 Traffic + Class) + + The application layer MUST be able to specify the Differentiated + Services field for segments that are sent on a connection (MUST-48). + The Differentiated Services field includes the 6-bit Differentiated + Services Codepoint (DSCP) value. It is not required, but the + application SHOULD be able to change the Differentiated Services + field during the connection lifetime (SHLD-21). TCP implementations + SHOULD pass the current Differentiated Services field value without + change to the IP layer, when it sends segments on the connection + (SHLD-22). + + The Differentiated Services field will be specified independently in + each direction on the connection, so that the receiver application + will specify the Differentiated Services field used for ACK segments. + + TCP implementations MAY pass the most recently received + Differentiated Services field up to the application (MAY-9). + +3.9.2. TCP/Lower-Level Interface + + The TCP endpoint calls on a lower-level protocol module to actually + send and receive information over a network. The two current + standard Internet Protocol (IP) versions layered below TCP are IPv4 + [1] and IPv6 [13]. + + If the lower-level protocol is IPv4, it provides arguments for a type + of service (used within the Differentiated Services field) and for a + time to live. TCP uses the following settings for these parameters: + + Diffserv field: The IP header value for the Diffserv field is given + by the user. This includes the bits of the Diffserv Codepoint + (DSCP). + + Time to Live (TTL): The TTL value used to send TCP segments MUST be + configurable (MUST-49). + + * Note that RFC 793 specified one minute (60 seconds) as a + constant for the TTL because the assumed maximum segment + lifetime was two minutes. This was intended to explicitly ask + that a segment be destroyed if it could not be delivered by the + internet system within one minute. RFC 1122 updated RFC 793 to + require that the TTL be configurable. + + * Note that the Diffserv field is permitted to change during a + connection (Section 4.2.4.2 of RFC 1122). However, the + application interface might not support this ability, and the + application does not have knowledge about individual TCP + segments, so this can only be done on a coarse granularity, at + best. This limitation is further discussed in RFC 7657 + (Sections 5.1, 5.3, and 6) [50]. Generally, an application + SHOULD NOT change the Diffserv field value during the course of + a connection (SHLD-23). + + Any lower-level protocol will have to provide the source address, + destination address, and protocol fields, and some way to determine + the "TCP length", both to provide the functional equivalent service + of IP and to be used in the TCP checksum. + + When received options are passed up to TCP from the IP layer, a TCP + implementation MUST ignore options that it does not understand (MUST- + 50). + + A TCP implementation MAY support the Timestamp (MAY-10) and Record + Route (MAY-11) Options. + +3.9.2.1. Source Routing + + If the lower level is IP (or other protocol that provides this + feature) and source routing is used, the interface must allow the + route information to be communicated. This is especially important + so that the source and destination addresses used in the TCP checksum + be the originating source and ultimate destination. It is also + important to preserve the return route to answer connection requests. + + An application MUST be able to specify a source route when it + actively opens a TCP connection (MUST-51), and this MUST take + precedence over a source route received in a datagram (MUST-52). + + When a TCP connection is OPENed passively and a packet arrives with a + completed IP Source Route Option (containing a return route), TCP + implementations MUST save the return route and use it for all + segments sent on this connection (MUST-53). If a different source + route arrives in a later segment, the later definition SHOULD + override the earlier one (SHLD-24). + +3.9.2.2. ICMP Messages + + TCP implementations MUST act on an ICMP error message passed up from + the IP layer, directing it to the connection that created the error + (MUST-54). The necessary demultiplexing information can be found in + the IP header contained within the ICMP message. + + This applies to ICMPv6 in addition to IPv4 ICMP. + + [35] contains discussion of specific ICMP and ICMPv6 messages + classified as either "soft" or "hard" errors that may bear different + responses. Treatment for classes of ICMP messages is described + below: + + Source Quench + TCP implementations MUST silently discard any received ICMP Source + Quench messages (MUST-55). See [11] for discussion. + + Soft Errors + For IPv4 ICMP, these include: Destination Unreachable -- codes 0, + 1, 5; Time Exceeded -- codes 0, 1; and Parameter Problem. + + For ICMPv6, these include: Destination Unreachable -- codes 0, 3; + Time Exceeded -- codes 0, 1; and Parameter Problem -- codes 0, 1, + 2. + + Since these Unreachable messages indicate soft error conditions, a + TCP implementation MUST NOT abort the connection (MUST-56), and it + SHOULD make the information available to the application (SHLD-25). + + Hard Errors + For ICMP these include Destination Unreachable -- codes 2-4. + + These are hard error conditions, so TCP implementations SHOULD + abort the connection (SHLD-26). [35] notes that some + implementations do not abort connections when an ICMP hard error is + received for a connection that is in any of the synchronized + states. + + Note that [35], Section 4 describes widespread implementation + behavior that treats soft errors as hard errors during connection + establishment. + +3.9.2.3. Source Address Validation + + RFC 1122 requires addresses to be validated in incoming SYN packets: + + | An incoming SYN with an invalid source address MUST be ignored + | either by TCP or by the IP layer [(MUST-63)] (see + | Section 3.2.1.3). + | + | A TCP implementation MUST silently discard an incoming SYN segment + | that is addressed to a broadcast or multicast address [(MUST-57)]. + + This prevents connection state and replies from being erroneously + generated, and implementers should note that this guidance is + applicable to all incoming segments, not just SYNs, as specifically + indicated in RFC 1122. + +3.10. Event Processing + + The processing depicted in this section is an example of one possible + implementation. Other implementations may have slightly different + processing sequences, but they should differ from those in this + section only in detail, not in substance. + + The activity of the TCP endpoint can be characterized as responding + to events. The events that occur can be cast into three categories: + user calls, arriving segments, and timeouts. This section describes + the processing the TCP endpoint does in response to each of the + events. In many cases, the processing required depends on the state + of the connection. + + Events that occur: + + User Calls + + OPEN + + SEND + + RECEIVE + + CLOSE + + ABORT + + STATUS + + Arriving Segments + + SEGMENT ARRIVES + + Timeouts + + USER TIMEOUT + + RETRANSMISSION TIMEOUT + + TIME-WAIT TIMEOUT + + The model of the TCP/user interface is that user commands receive an + immediate return and possibly a delayed response via an event or + pseudo-interrupt. In the following descriptions, the term "signal" + means cause a delayed response. + + Error responses in this document are identified by character strings. + For example, user commands referencing connections that do not exist + receive "error: connection not open". + + Please note in the following that all arithmetic on sequence numbers, + acknowledgment numbers, windows, et cetera, is modulo 2^32 (the size + of the sequence number space). Also note that "=<" means less than + or equal to (modulo 2^32). + + A natural way to think about processing incoming segments is to + imagine that they are first tested for proper sequence number (i.e., + that their contents lie in the range of the expected "receive window" + in the sequence number space) and then that they are generally queued + and processed in sequence number order. + + When a segment overlaps other already received segments, we + reconstruct the segment to contain just the new data and adjust the + header fields to be consistent. + + Note that if no state change is mentioned, the TCP connection stays + in the same state. + +3.10.1. OPEN Call + + CLOSED STATE (i.e., TCB does not exist) + + * Create a new transmission control block (TCB) to hold connection + state information. Fill in local socket identifier, remote + socket, Diffserv field, security/compartment, and user timeout + information. Note that some parts of the remote socket may be + unspecified in a passive OPEN and are to be filled in by the + parameters of the incoming SYN segment. Verify the security and + Diffserv value requested are allowed for this user, if not, return + "error: Diffserv value not allowed" or "error: security/ + compartment not allowed". If passive, enter the LISTEN state and + return. If active and the remote socket is unspecified, return + "error: remote socket unspecified"; if active and the remote + socket is specified, issue a SYN segment. An initial send + sequence number (ISS) is selected. A SYN segment of the form + <SEQ=ISS><CTL=SYN> is sent. Set SND.UNA to ISS, SND.NXT to ISS+1, + enter SYN-SENT state, and return. + + * If the caller does not have access to the local socket specified, + return "error: connection illegal for this process". If there is + no room to create a new connection, return "error: insufficient + resources". + + LISTEN STATE + + * If the OPEN call is active and the remote socket is specified, + then change the connection from passive to active, select an ISS. + Send a SYN segment, set SND.UNA to ISS, SND.NXT to ISS+1. Enter + SYN-SENT state. Data associated with SEND may be sent with SYN + segment or queued for transmission after entering ESTABLISHED + state. The urgent bit if requested in the command must be sent + with the data segments sent as a result of this command. If there + is no room to queue the request, respond with "error: insufficient + resources". If the remote socket was not specified, then return + "error: remote socket unspecified". + + SYN-SENT STATE + + SYN-RECEIVED STATE + + ESTABLISHED STATE + + FIN-WAIT-1 STATE + + FIN-WAIT-2 STATE + + CLOSE-WAIT STATE + + CLOSING STATE + + LAST-ACK STATE + + TIME-WAIT STATE + + * Return "error: connection already exists". + +3.10.2. SEND Call + + CLOSED STATE (i.e., TCB does not exist) + + * If the user does not have access to such a connection, then return + "error: connection illegal for this process". + + * Otherwise, return "error: connection does not exist". + + LISTEN STATE + + * If the remote socket is specified, then change the connection from + passive to active, select an ISS. Send a SYN segment, set SND.UNA + to ISS, SND.NXT to ISS+1. Enter SYN-SENT state. Data associated + with SEND may be sent with SYN segment or queued for transmission + after entering ESTABLISHED state. The urgent bit if requested in + the command must be sent with the data segments sent as a result + of this command. If there is no room to queue the request, + respond with "error: insufficient resources". If the remote + socket was not specified, then return "error: remote socket + unspecified". + + SYN-SENT STATE + + SYN-RECEIVED STATE + + * Queue the data for transmission after entering ESTABLISHED state. + If no space to queue, respond with "error: insufficient + resources". + + ESTABLISHED STATE + + CLOSE-WAIT STATE + + * Segmentize the buffer and send it with a piggybacked + acknowledgment (acknowledgment value = RCV.NXT). If there is + insufficient space to remember this buffer, simply return "error: + insufficient resources". + + * If the URGENT flag is set, then SND.UP <- SND.NXT and set the + urgent pointer in the outgoing segments. + + FIN-WAIT-1 STATE + + FIN-WAIT-2 STATE + + CLOSING STATE + + LAST-ACK STATE + + TIME-WAIT STATE + + * Return "error: connection closing" and do not service request. + +3.10.3. RECEIVE Call + + CLOSED STATE (i.e., TCB does not exist) + + * If the user does not have access to such a connection, return + "error: connection illegal for this process". + + * Otherwise, return "error: connection does not exist". + + LISTEN STATE + + SYN-SENT STATE + + SYN-RECEIVED STATE + + * Queue for processing after entering ESTABLISHED state. If there + is no room to queue this request, respond with "error: + insufficient resources". + + ESTABLISHED STATE + + FIN-WAIT-1 STATE + + FIN-WAIT-2 STATE + + * If insufficient incoming segments are queued to satisfy the + request, queue the request. If there is no queue space to + remember the RECEIVE, respond with "error: insufficient + resources". + + * Reassemble queued incoming segments into receive buffer and return + to user. Mark "push seen" (PUSH) if this is the case. + + * If RCV.UP is in advance of the data currently being passed to the + user, notify the user of the presence of urgent data. + + * When the TCP endpoint takes responsibility for delivering data to + the user, that fact must be communicated to the sender via an + acknowledgment. The formation of such an acknowledgment is + described below in the discussion of processing an incoming + segment. + + CLOSE-WAIT STATE + + * Since the remote side has already sent FIN, RECEIVEs must be + satisfied by data already on hand, but not yet delivered to the + user. If no text is awaiting delivery, the RECEIVE will get an + "error: connection closing" response. Otherwise, any remaining + data can be used to satisfy the RECEIVE. + + CLOSING STATE + + LAST-ACK STATE + + TIME-WAIT STATE + + * Return "error: connection closing". + +3.10.4. CLOSE Call + + CLOSED STATE (i.e., TCB does not exist) + + * If the user does not have access to such a connection, return + "error: connection illegal for this process". + + * Otherwise, return "error: connection does not exist". + + LISTEN STATE + + * Any outstanding RECEIVEs are returned with "error: closing" + responses. Delete TCB, enter CLOSED state, and return. + + SYN-SENT STATE + + * Delete the TCB and return "error: closing" responses to any queued + SENDs, or RECEIVEs. + + SYN-RECEIVED STATE + + * If no SENDs have been issued and there is no pending data to send, + then form a FIN segment and send it, and enter FIN-WAIT-1 state; + otherwise, queue for processing after entering ESTABLISHED state. + + ESTABLISHED STATE + + * Queue this until all preceding SENDs have been segmentized, then + form a FIN segment and send it. In any case, enter FIN-WAIT-1 + state. + + FIN-WAIT-1 STATE + + FIN-WAIT-2 STATE + + * Strictly speaking, this is an error and should receive an "error: + connection closing" response. An "ok" response would be + acceptable, too, as long as a second FIN is not emitted (the first + FIN may be retransmitted, though). + + CLOSE-WAIT STATE + + * Queue this request until all preceding SENDs have been + segmentized; then send a FIN segment, enter LAST-ACK state. + + CLOSING STATE + + LAST-ACK STATE + + TIME-WAIT STATE + + * Respond with "error: connection closing". + +3.10.5. ABORT Call + + CLOSED STATE (i.e., TCB does not exist) + + * If the user should not have access to such a connection, return + "error: connection illegal for this process". + + * Otherwise, return "error: connection does not exist". + + LISTEN STATE + + * Any outstanding RECEIVEs should be returned with "error: + connection reset" responses. Delete TCB, enter CLOSED state, and + return. + + SYN-SENT STATE + + * All queued SENDs and RECEIVEs should be given "connection reset" + notification. Delete the TCB, enter CLOSED state, and return. + + SYN-RECEIVED STATE + + ESTABLISHED STATE + + FIN-WAIT-1 STATE + + FIN-WAIT-2 STATE + + CLOSE-WAIT STATE + + * Send a reset segment: + + <SEQ=SND.NXT><CTL=RST> + + * All queued SENDs and RECEIVEs should be given "connection reset" + notification; all segments queued for transmission (except for the + RST formed above) or retransmission should be flushed. Delete the + TCB, enter CLOSED state, and return. + + CLOSING STATE + + LAST-ACK STATE + + TIME-WAIT STATE + + * Respond with "ok" and delete the TCB, enter CLOSED state, and + return. + +3.10.6. STATUS Call + + CLOSED STATE (i.e., TCB does not exist) + + * If the user should not have access to such a connection, return + "error: connection illegal for this process". + + * Otherwise, return "error: connection does not exist". + + LISTEN STATE + + * Return "state = LISTEN" and the TCB pointer. + + SYN-SENT STATE + + * Return "state = SYN-SENT" and the TCB pointer. + + SYN-RECEIVED STATE + + * Return "state = SYN-RECEIVED" and the TCB pointer. + + ESTABLISHED STATE + + * Return "state = ESTABLISHED" and the TCB pointer. + + FIN-WAIT-1 STATE + + * Return "state = FIN-WAIT-1" and the TCB pointer. + + FIN-WAIT-2 STATE + + * Return "state = FIN-WAIT-2" and the TCB pointer. + + CLOSE-WAIT STATE + + * Return "state = CLOSE-WAIT" and the TCB pointer. + + CLOSING STATE + + * Return "state = CLOSING" and the TCB pointer. + + LAST-ACK STATE + + * Return "state = LAST-ACK" and the TCB pointer. + + TIME-WAIT STATE + + * Return "state = TIME-WAIT" and the TCB pointer. + +3.10.7. SEGMENT ARRIVES + +3.10.7.1. CLOSED STATE + + If the state is CLOSED (i.e., TCB does not exist), then + + all data in the incoming segment is discarded. An incoming + segment containing a RST is discarded. An incoming segment not + containing a RST causes a RST to be sent in response. The + acknowledgment and sequence field values are selected to make the + reset sequence acceptable to the TCP endpoint that sent the + offending segment. + + If the ACK bit is off, sequence number zero is used, + + <SEQ=0><ACK=SEG.SEQ+SEG.LEN><CTL=RST,ACK> + + If the ACK bit is on, + + <SEQ=SEG.ACK><CTL=RST> + + Return. + +3.10.7.2. LISTEN STATE + + If the state is LISTEN, then + + First, check for a RST: + + - An incoming RST segment could not be valid since it could not + have been sent in response to anything sent by this incarnation + of the connection. An incoming RST should be ignored. Return. + + Second, check for an ACK: + + - Any acknowledgment is bad if it arrives on a connection still + in the LISTEN state. An acceptable reset segment should be + formed for any arriving ACK-bearing segment. The RST should be + formatted as follows: + + <SEQ=SEG.ACK><CTL=RST> + + - Return. + + Third, check for a SYN: + + - If the SYN bit is set, check the security. If the security/ + compartment on the incoming segment does not exactly match the + security/compartment in the TCB, then send a reset and return. + + <SEQ=0><ACK=SEG.SEQ+SEG.LEN><CTL=RST,ACK> + + - Set RCV.NXT to SEG.SEQ+1, IRS is set to SEG.SEQ, and any other + control or text should be queued for processing later. ISS + should be selected and a SYN segment sent of the form: + + <SEQ=ISS><ACK=RCV.NXT><CTL=SYN,ACK> + + - SND.NXT is set to ISS+1 and SND.UNA to ISS. The connection + state should be changed to SYN-RECEIVED. Note that any other + incoming control or data (combined with SYN) will be processed + in the SYN-RECEIVED state, but processing of SYN and ACK should + not be repeated. If the listen was not fully specified (i.e., + the remote socket was not fully specified), then the + unspecified fields should be filled in now. + + Fourth, other data or control: + + - This should not be reached. Drop the segment and return. Any + other control or data-bearing segment (not containing SYN) must + have an ACK and thus would have been discarded by the ACK + processing in the second step, unless it was first discarded by + RST checking in the first step. + +3.10.7.3. SYN-SENT STATE + + If the state is SYN-SENT, then + + First, check the ACK bit: + + - If the ACK bit is set, + + o If SEG.ACK =< ISS or SEG.ACK > SND.NXT, send a reset (unless + the RST bit is set, if so drop the segment and return) + + <SEQ=SEG.ACK><CTL=RST> + + o and discard the segment. Return. + + o If SND.UNA < SEG.ACK =< SND.NXT, then the ACK is acceptable. + Some deployed TCP code has used the check SEG.ACK == SND.NXT + (using "==" rather than "=<"), but this is not appropriate + when the stack is capable of sending data on the SYN because + the TCP peer may not accept and acknowledge all of the data + on the SYN. + + Second, check the RST bit: + + - If the RST bit is set, + + o A potential blind reset attack is described in RFC 5961 [9]. + The mitigation described in that document has specific + applicability explained therein, and is not a substitute for + cryptographic protection (e.g., IPsec or TCP-AO). A TCP + implementation that supports the mitigation described in RFC + 5961 SHOULD first check that the sequence number exactly + matches RCV.NXT prior to executing the action in the next + paragraph. + + o If the ACK was acceptable, then signal to the user "error: + connection reset", drop the segment, enter CLOSED state, + delete TCB, and return. Otherwise (no ACK), drop the + segment and return. + + Third, check the security: + + - If the security/compartment in the segment does not exactly + match the security/compartment in the TCB, send a reset: + + o If there is an ACK, + + <SEQ=SEG.ACK><CTL=RST> + + o Otherwise, + + <SEQ=0><ACK=SEG.SEQ+SEG.LEN><CTL=RST,ACK> + + - If a reset was sent, discard the segment and return. + + Fourth, check the SYN bit: + + - This step should be reached only if the ACK is ok, or there is + no ACK, and the segment did not contain a RST. + + - If the SYN bit is on and the security/compartment is + acceptable, then RCV.NXT is set to SEG.SEQ+1, IRS is set to + SEG.SEQ. SND.UNA should be advanced to equal SEG.ACK (if there + is an ACK), and any segments on the retransmission queue that + are thereby acknowledged should be removed. + + - If SND.UNA > ISS (our SYN has been ACKed), change the + connection state to ESTABLISHED, form an ACK segment + + <SEQ=SND.NXT><ACK=RCV.NXT><CTL=ACK> + + - and send it. Data or controls that were queued for + transmission MAY be included. Some TCP implementations + suppress sending this segment when the received segment + contains data that will anyways generate an acknowledgment in + the later processing steps, saving this extra acknowledgment of + the SYN from being sent. If there are other controls or text + in the segment, then continue processing at the sixth step + under Section 3.10.7.4 where the URG bit is checked; otherwise, + return. + + - Otherwise, enter SYN-RECEIVED, form a SYN,ACK segment + + <SEQ=ISS><ACK=RCV.NXT><CTL=SYN,ACK> + + - and send it. Set the variables: + + SND.WND <- SEG.WND + + SND.WL1 <- SEG.SEQ + + SND.WL2 <- SEG.ACK + + If there are other controls or text in the segment, queue them + for processing after the ESTABLISHED state has been reached, + return. + + - Note that it is legal to send and receive application data on + SYN segments (this is the "text in the segment" mentioned + above). There has been significant misinformation and + misunderstanding of this topic historically. Some firewalls + and security devices consider this suspicious. However, the + capability was used in T/TCP [21] and is used in TCP Fast Open + (TFO) [48], so is important for implementations and network + devices to permit. + + Fifth, if neither of the SYN or RST bits is set, then drop the + segment and return. + +3.10.7.4. Other States + + Otherwise, + + First, check sequence number: + + - SYN-RECEIVED STATE + + - ESTABLISHED STATE + + - FIN-WAIT-1 STATE + + - FIN-WAIT-2 STATE + + - CLOSE-WAIT STATE + + - CLOSING STATE + + - LAST-ACK STATE + + - TIME-WAIT STATE + + o Segments are processed in sequence. Initial tests on + arrival are used to discard old duplicates, but further + processing is done in SEG.SEQ order. If a segment's + contents straddle the boundary between old and new, only the + new parts are processed. + + o In general, the processing of received segments MUST be + implemented to aggregate ACK segments whenever possible + (MUST-58). For example, if the TCP endpoint is processing a + series of queued segments, it MUST process them all before + sending any ACK segments (MUST-59). + + o There are four cases for the acceptability test for an + incoming segment: + + +=========+=========+======================================+ + | Segment | Receive | Test | + | Length | Window | | + +=========+=========+======================================+ + | 0 | 0 | SEG.SEQ = RCV.NXT | + +---------+---------+--------------------------------------+ + | 0 | >0 | RCV.NXT =< SEG.SEQ < | + | | | RCV.NXT+RCV.WND | + +---------+---------+--------------------------------------+ + | >0 | 0 | not acceptable | + +---------+---------+--------------------------------------+ + | >0 | >0 | RCV.NXT =< SEG.SEQ < | + | | | RCV.NXT+RCV.WND | + | | | | + | | | or | + | | | | + | | | RCV.NXT =< SEG.SEQ+SEG.LEN-1 | + | | | < RCV.NXT+RCV.WND | + +---------+---------+--------------------------------------+ + + Table 6: Segment Acceptability Tests + + o In implementing sequence number validation as described + here, please note Appendix A.2. + + o If the RCV.WND is zero, no segments will be acceptable, but + special allowance should be made to accept valid ACKs, URGs, + and RSTs. + + o If an incoming segment is not acceptable, an acknowledgment + should be sent in reply (unless the RST bit is set, if so + drop the segment and return): + + <SEQ=SND.NXT><ACK=RCV.NXT><CTL=ACK> + + o After sending the acknowledgment, drop the unacceptable + segment and return. + + o Note that for the TIME-WAIT state, there is an improved + algorithm described in [40] for handling incoming SYN + segments that utilizes timestamps rather than relying on the + sequence number check described here. When the improved + algorithm is implemented, the logic above is not applicable + for incoming SYN segments with Timestamp Options, received + on a connection in the TIME-WAIT state. + + o In the following it is assumed that the segment is the + idealized segment that begins at RCV.NXT and does not exceed + the window. One could tailor actual segments to fit this + assumption by trimming off any portions that lie outside the + window (including SYN and FIN) and only processing further + if the segment then begins at RCV.NXT. Segments with higher + beginning sequence numbers SHOULD be held for later + processing (SHLD-31). + + Second, check the RST bit: + + - RFC 5961 [9], Section 3 describes a potential blind reset + attack and optional mitigation approach. This does not provide + a cryptographic protection (e.g., as in IPsec or TCP-AO) but + can be applicable in situations described in RFC 5961. For + stacks implementing the protection described in RFC 5961, the + three checks below apply; otherwise, processing for these + states is indicated further below. + + 1) If the RST bit is set and the sequence number is outside + the current receive window, silently drop the segment. + + 2) If the RST bit is set and the sequence number exactly + matches the next expected sequence number (RCV.NXT), then + TCP endpoints MUST reset the connection in the manner + prescribed below according to the connection state. + + 3) If the RST bit is set and the sequence number does not + exactly match the next expected sequence value, yet is + within the current receive window, TCP endpoints MUST send + an acknowledgment (challenge ACK): + + <SEQ=SND.NXT><ACK=RCV.NXT><CTL=ACK> + + After sending the challenge ACK, TCP endpoints MUST drop + the unacceptable segment and stop processing the incoming + packet further. Note that RFC 5961 and Errata ID 4772 [99] + contain additional considerations for ACK throttling in an + implementation. + + - SYN-RECEIVED STATE + + o If the RST bit is set, + + + If this connection was initiated with a passive OPEN + (i.e., came from the LISTEN state), then return this + connection to LISTEN state and return. The user need not + be informed. If this connection was initiated with an + active OPEN (i.e., came from SYN-SENT state), then the + connection was refused; signal the user "connection + refused". In either case, the retransmission queue + should be flushed. And in the active OPEN case, enter + the CLOSED state and delete the TCB, and return. + + - ESTABLISHED STATE + + - FIN-WAIT-1 STATE + + - FIN-WAIT-2 STATE + + - CLOSE-WAIT STATE + + o If the RST bit is set, then any outstanding RECEIVEs and + SEND should receive "reset" responses. All segment queues + should be flushed. Users should also receive an unsolicited + general "connection reset" signal. Enter the CLOSED state, + delete the TCB, and return. + + - CLOSING STATE + + - LAST-ACK STATE + + - TIME-WAIT STATE + + o If the RST bit is set, then enter the CLOSED state, delete + the TCB, and return. + + Third, check security: + + - SYN-RECEIVED STATE + + o If the security/compartment in the segment does not exactly + match the security/compartment in the TCB, then send a reset + and return. + + - ESTABLISHED STATE + + - FIN-WAIT-1 STATE + + - FIN-WAIT-2 STATE + + - CLOSE-WAIT STATE + + - CLOSING STATE + + - LAST-ACK STATE + + - TIME-WAIT STATE + + o If the security/compartment in the segment does not exactly + match the security/compartment in the TCB, then send a + reset; any outstanding RECEIVEs and SEND should receive + "reset" responses. All segment queues should be flushed. + Users should also receive an unsolicited general "connection + reset" signal. Enter the CLOSED state, delete the TCB, and + return. + + - Note this check is placed following the sequence check to + prevent a segment from an old connection between these port + numbers with a different security from causing an abort of the + current connection. + + Fourth, check the SYN bit: + + - SYN-RECEIVED STATE + + o If the connection was initiated with a passive OPEN, then + return this connection to the LISTEN state and return. + Otherwise, handle per the directions for synchronized states + below. + + - ESTABLISHED STATE + + - FIN-WAIT-1 STATE + + - FIN-WAIT-2 STATE + + - CLOSE-WAIT STATE + + - CLOSING STATE + + - LAST-ACK STATE + + - TIME-WAIT STATE + + o If the SYN bit is set in these synchronized states, it may + be either a legitimate new connection attempt (e.g., in the + case of TIME-WAIT), an error where the connection should be + reset, or the result of an attack attempt, as described in + RFC 5961 [9]. For the TIME-WAIT state, new connections can + be accepted if the Timestamp Option is used and meets + expectations (per [40]). For all other cases, RFC 5961 + provides a mitigation with applicability to some situations, + though there are also alternatives that offer cryptographic + protection (see Section 7). RFC 5961 recommends that in + these synchronized states, if the SYN bit is set, + irrespective of the sequence number, TCP endpoints MUST send + a "challenge ACK" to the remote peer: + + <SEQ=SND.NXT><ACK=RCV.NXT><CTL=ACK> + + o After sending the acknowledgment, TCP implementations MUST + drop the unacceptable segment and stop processing further. + Note that RFC 5961 and Errata ID 4772 [99] contain + additional ACK throttling notes for an implementation. + + o For implementations that do not follow RFC 5961, the + original behavior described in RFC 793 follows in this + paragraph. If the SYN is in the window it is an error: send + a reset, any outstanding RECEIVEs and SEND should receive + "reset" responses, all segment queues should be flushed, the + user should also receive an unsolicited general "connection + reset" signal, enter the CLOSED state, delete the TCB, and + return. + + o If the SYN is not in the window, this step would not be + reached and an ACK would have been sent in the first step + (sequence number check). + + Fifth, check the ACK field: + + - if the ACK bit is off, drop the segment and return + + - if the ACK bit is on, + + o RFC 5961 [9], Section 5 describes a potential blind data + injection attack, and mitigation that implementations MAY + choose to include (MAY-12). TCP stacks that implement RFC + 5961 MUST add an input check that the ACK value is + acceptable only if it is in the range of ((SND.UNA - + MAX.SND.WND) =< SEG.ACK =< SND.NXT). All incoming segments + whose ACK value doesn't satisfy the above condition MUST be + discarded and an ACK sent back. The new state variable + MAX.SND.WND is defined as the largest window that the local + sender has ever received from its peer (subject to window + scaling) or may be hard-coded to a maximum permissible + window value. When the ACK value is acceptable, the per- + state processing below applies: + + o SYN-RECEIVED STATE + + + If SND.UNA < SEG.ACK =< SND.NXT, then enter ESTABLISHED + state and continue processing with the variables below + set to: + + SND.WND <- SEG.WND + + SND.WL1 <- SEG.SEQ + + SND.WL2 <- SEG.ACK + + + If the segment acknowledgment is not acceptable, form a + reset segment + + <SEQ=SEG.ACK><CTL=RST> + + + and send it. + + o ESTABLISHED STATE + + + If SND.UNA < SEG.ACK =< SND.NXT, then set SND.UNA <- + SEG.ACK. Any segments on the retransmission queue that + are thereby entirely acknowledged are removed. Users + should receive positive acknowledgments for buffers that + have been SENT and fully acknowledged (i.e., SEND buffer + should be returned with "ok" response). If the ACK is a + duplicate (SEG.ACK =< SND.UNA), it can be ignored. If + the ACK acks something not yet sent (SEG.ACK > SND.NXT), + then send an ACK, drop the segment, and return. + + + If SND.UNA =< SEG.ACK =< SND.NXT, the send window should + be updated. If (SND.WL1 < SEG.SEQ or (SND.WL1 = SEG.SEQ + and SND.WL2 =< SEG.ACK)), set SND.WND <- SEG.WND, set + SND.WL1 <- SEG.SEQ, and set SND.WL2 <- SEG.ACK. + + + Note that SND.WND is an offset from SND.UNA, that SND.WL1 + records the sequence number of the last segment used to + update SND.WND, and that SND.WL2 records the + acknowledgment number of the last segment used to update + SND.WND. The check here prevents using old segments to + update the window. + + o FIN-WAIT-1 STATE + + + In addition to the processing for the ESTABLISHED state, + if the FIN segment is now acknowledged, then enter FIN- + WAIT-2 and continue processing in that state. + + o FIN-WAIT-2 STATE + + + In addition to the processing for the ESTABLISHED state, + if the retransmission queue is empty, the user's CLOSE + can be acknowledged ("ok") but do not delete the TCB. + + o CLOSE-WAIT STATE + + + Do the same processing as for the ESTABLISHED state. + + o CLOSING STATE + + + In addition to the processing for the ESTABLISHED state, + if the ACK acknowledges our FIN, then enter the TIME-WAIT + state; otherwise, ignore the segment. + + o LAST-ACK STATE + + + The only thing that can arrive in this state is an + acknowledgment of our FIN. If our FIN is now + acknowledged, delete the TCB, enter the CLOSED state, and + return. + + o TIME-WAIT STATE + + + The only thing that can arrive in this state is a + retransmission of the remote FIN. Acknowledge it, and + restart the 2 MSL timeout. + + Sixth, check the URG bit: + + - ESTABLISHED STATE + + - FIN-WAIT-1 STATE + + - FIN-WAIT-2 STATE + + o If the URG bit is set, RCV.UP <- max(RCV.UP,SEG.UP), and + signal the user that the remote side has urgent data if the + urgent pointer (RCV.UP) is in advance of the data consumed. + If the user has already been signaled (or is still in the + "urgent mode") for this continuous sequence of urgent data, + do not signal the user again. + + - CLOSE-WAIT STATE + + - CLOSING STATE + + - LAST-ACK STATE + + - TIME-WAIT STATE + + o This should not occur since a FIN has been received from the + remote side. Ignore the URG. + + Seventh, process the segment text: + + - ESTABLISHED STATE + + - FIN-WAIT-1 STATE + + - FIN-WAIT-2 STATE + + o Once in the ESTABLISHED state, it is possible to deliver + segment data to user RECEIVE buffers. Data from segments + can be moved into buffers until either the buffer is full or + the segment is empty. If the segment empties and carries a + PUSH flag, then the user is informed, when the buffer is + returned, that a PUSH has been received. + + o When the TCP endpoint takes responsibility for delivering + the data to the user, it must also acknowledge the receipt + of the data. + + o Once the TCP endpoint takes responsibility for the data, it + advances RCV.NXT over the data accepted, and adjusts RCV.WND + as appropriate to the current buffer availability. The + total of RCV.NXT and RCV.WND should not be reduced. + + o A TCP implementation MAY send an ACK segment acknowledging + RCV.NXT when a valid segment arrives that is in the window + but not at the left window edge (MAY-13). + + o Please note the window management suggestions in + Section 3.8. + + o Send an acknowledgment of the form: + + <SEQ=SND.NXT><ACK=RCV.NXT><CTL=ACK> + + o This acknowledgment should be piggybacked on a segment being + transmitted if possible without incurring undue delay. + + - CLOSE-WAIT STATE + + - CLOSING STATE + + - LAST-ACK STATE + + - TIME-WAIT STATE + + o This should not occur since a FIN has been received from the + remote side. Ignore the segment text. + + Eighth, check the FIN bit: + + - Do not process the FIN if the state is CLOSED, LISTEN, or SYN- + SENT since the SEG.SEQ cannot be validated; drop the segment + and return. + + - If the FIN bit is set, signal the user "connection closing" and + return any pending RECEIVEs with same message, advance RCV.NXT + over the FIN, and send an acknowledgment for the FIN. Note + that FIN implies PUSH for any segment text not yet delivered to + the user. + + o SYN-RECEIVED STATE + + o ESTABLISHED STATE + + + Enter the CLOSE-WAIT state. + + o FIN-WAIT-1 STATE + + + If our FIN has been ACKed (perhaps in this segment), then + enter TIME-WAIT, start the time-wait timer, turn off the + other timers; otherwise, enter the CLOSING state. + + o FIN-WAIT-2 STATE + + + Enter the TIME-WAIT state. Start the time-wait timer, + turn off the other timers. + + o CLOSE-WAIT STATE + + + Remain in the CLOSE-WAIT state. + + o CLOSING STATE + + + Remain in the CLOSING state. + + o LAST-ACK STATE + + + Remain in the LAST-ACK state. + + o TIME-WAIT STATE + + + Remain in the TIME-WAIT state. Restart the 2 MSL time- + wait timeout. + + and return. + +3.10.8. Timeouts + + USER TIMEOUT + + * For any state if the user timeout expires, flush all queues, + signal the user "error: connection aborted due to user timeout" in + general and for any outstanding calls, delete the TCB, enter the + CLOSED state, and return. + + RETRANSMISSION TIMEOUT + + * For any state if the retransmission timeout expires on a segment + in the retransmission queue, send the segment at the front of the + retransmission queue again, reinitialize the retransmission timer, + and return. + + TIME-WAIT TIMEOUT + + * If the time-wait timeout expires on a connection, delete the TCB, + enter the CLOSED state, and return. + +4. Glossary + + ACK + A control bit (acknowledge) occupying no sequence space, + which indicates that the acknowledgment field of this segment + specifies the next sequence number the sender of this segment + is expecting to receive, hence acknowledging receipt of all + previous sequence numbers. + + connection + A logical communication path identified by a pair of sockets. + + datagram + A message sent in a packet-switched computer communications + network. + + Destination Address + The network-layer address of the endpoint intended to receive + a segment. + + FIN + A control bit (finis) occupying one sequence number, which + indicates that the sender will send no more data or control + occupying sequence space. + + flush + To remove all of the contents (data or segments) from a store + (buffer or queue). + + fragment + A portion of a logical unit of data. In particular, an + internet fragment is a portion of an internet datagram. + + header + Control information at the beginning of a message, segment, + fragment, packet, or block of data. + + host + A computer. In particular, a source or destination of + messages from the point of view of the communication network. + + Identification + An Internet Protocol field. This identifying value assigned + by the sender aids in assembling the fragments of a datagram. + + internet address + A network-layer address. + + internet datagram + A unit of data exchanged between internet hosts, together + with the internet header that allows the datagram to be + routed from source to destination. + + internet fragment + A portion of the data of an internet datagram with an + internet header. + + IP + Internet Protocol. See [1] and [13]. + + IRS + The Initial Receive Sequence number. The first sequence + number used by the sender on a connection. + + ISN + The Initial Sequence Number. The first sequence number used + on a connection (either ISS or IRS). Selected in a way that + is unique within a given period of time and is unpredictable + to attackers. + + ISS + The Initial Send Sequence number. The first sequence number + used by the sender on a connection. + + left sequence + This is the next sequence number to be acknowledged by the + data-receiving TCP endpoint (or the lowest currently + unacknowledged sequence number) and is sometimes referred to + as the left edge of the send window. + + module + An implementation, usually in software, of a protocol or + other procedure. + + MSL + Maximum Segment Lifetime, the time a TCP segment can exist in + the internetwork system. Arbitrarily defined to be 2 + minutes. + + octet + An eight-bit byte. + + Options + An Option field may contain several options, and each option + may be several octets in length. + + packet + A package of data with a header that may or may not be + logically complete. More often a physical packaging than a + logical packaging of data. + + port + The portion of a connection identifier used for + demultiplexing connections at an endpoint. + + process + A program in execution. A source or destination of data from + the point of view of the TCP endpoint or other host-to-host + protocol. + + PUSH + A control bit occupying no sequence space, indicating that + this segment contains data that must be pushed through to the + receiving user. + + RCV.NXT + receive next sequence number + + RCV.UP + receive urgent pointer + + RCV.WND + receive window + + receive next sequence number + This is the next sequence number the local TCP endpoint is + expecting to receive. + + receive window + This represents the sequence numbers the local (receiving) + TCP endpoint is willing to receive. Thus, the local TCP + endpoint considers that segments overlapping the range + RCV.NXT to RCV.NXT + RCV.WND - 1 carry acceptable data or + control. Segments containing sequence numbers entirely + outside this range are considered duplicates or injection + attacks and discarded. + + RST + A control bit (reset), occupying no sequence space, + indicating that the receiver should delete the connection + without further interaction. The receiver can determine, + based on the sequence number and acknowledgment fields of the + incoming segment, whether it should honor the reset command + or ignore it. In no case does receipt of a segment + containing RST give rise to a RST in response. + + SEG.ACK + segment acknowledgment + + SEG.LEN + segment length + + SEG.SEQ + segment sequence + + SEG.UP + segment urgent pointer field + + SEG.WND + segment window field + + segment + A logical unit of data. In particular, a TCP segment is the + unit of data transferred between a pair of TCP modules. + + segment acknowledgment + The sequence number in the acknowledgment field of the + arriving segment. + + segment length + The amount of sequence number space occupied by a segment, + including any controls that occupy sequence space. + + segment sequence + The number in the sequence field of the arriving segment. + + send sequence + This is the next sequence number the local (sending) TCP + endpoint will use on the connection. It is initially + selected from an initial sequence number curve (ISN) and is + incremented for each octet of data or sequenced control + transmitted. + + send window + This represents the sequence numbers that the remote + (receiving) TCP endpoint is willing to receive. It is the + value of the window field specified in segments from the + remote (data-receiving) TCP endpoint. The range of new + sequence numbers that may be emitted by a TCP implementation + lies between SND.NXT and SND.UNA + SND.WND - 1. + (Retransmissions of sequence numbers between SND.UNA and + SND.NXT are expected, of course.) + + SND.NXT + send sequence + + SND.UNA + left sequence + + SND.UP + send urgent pointer + + SND.WL1 + segment sequence number at last window update + + SND.WL2 + segment acknowledgment number at last window update + + SND.WND + send window + + socket (or socket number, or socket address, or socket identifier) + An address that specifically includes a port identifier, that + is, the concatenation of an Internet Address with a TCP port. + + Source Address + The network-layer address of the sending endpoint. + + SYN + A control bit in the incoming segment, occupying one sequence + number, used at the initiation of a connection to indicate + where the sequence numbering will start. + + TCB + Transmission control block, the data structure that records + the state of a connection. + + TCP + Transmission Control Protocol: a host-to-host protocol for + reliable communication in internetwork environments. + + TOS + Type of Service, an obsoleted IPv4 field. The same header + bits currently are used for the Differentiated Services field + [4] containing the Differentiated Services Codepoint (DSCP) + value and the 2-bit ECN codepoint [6]. + + Type of Service + See "TOS". + + URG + A control bit (urgent), occupying no sequence space, used to + indicate that the receiving user should be notified to do + urgent processing as long as there is data to be consumed + with sequence numbers less than the value indicated by the + urgent pointer. + + urgent pointer + A control field meaningful only when the URG bit is on. This + field communicates the value of the urgent pointer that + indicates the data octet associated with the sending user's + urgent call. + +5. Changes from RFC 793 + + This document obsoletes RFC 793 as well as RFCs 6093 and 6528, which + updated 793. In all cases, only the normative protocol specification + and requirements have been incorporated into this document, and some + informational text with background and rationale may not have been + carried in. The informational content of those documents is still + valuable in learning about and understanding TCP, and they are valid + Informational references, even though their normative content has + been incorporated into this document. + + The main body of this document was adapted from RFC 793's Section 3, + titled "FUNCTIONAL SPECIFICATION", with an attempt to keep formatting + and layout as close as possible. + + The collection of applicable RFC errata that have been reported and + either accepted or held for an update to RFC 793 were incorporated + (Errata IDs: 573 [73], 574 [74], 700 [75], 701 [76], 1283 [77], 1561 + [78], 1562 [79], 1564 [80], 1571 [81], 1572 [82], 2297 [83], 2298 + [84], 2748 [85], 2749 [86], 2934 [87], 3213 [88], 3300 [89], 3301 + [90], 6222 [91]). Some errata were not applicable due to other + changes (Errata IDs: 572 [92], 575 [93], 1565 [94], 1569 [95], 2296 + [96], 3305 [97], 3602 [98]). + + Changes to the specification of the urgent pointer described in RFCs + 1011, 1122, and 6093 were incorporated. See RFC 6093 for detailed + discussion of why these changes were necessary. + + The discussion of the RTO from RFC 793 was updated to refer to RFC + 6298. The text on the RTO in RFC 1122 originally replaced the text + in RFC 793; however, RFC 2988 should have updated RFC 1122 and has + subsequently been obsoleted by RFC 6298. + + RFC 1011 [18] contains a number of comments about RFC 793, including + some needed changes to the TCP specification. These are expanded in + RFC 1122, which contains a collection of other changes and + clarifications to RFC 793. The normative items impacting the + protocol have been incorporated here, though some historically useful + implementation advice and informative discussion from RFC 1122 is not + included here. The present document, which is now the TCP + specification rather than RFC 793, updates RFC 1011, and the comments + noted in RFC 1011 have been incorporated. + + RFC 1122 contains more than just TCP requirements, so this document + can't obsolete RFC 1122 entirely. It is only marked as "updating" + RFC 1122; however, it should be understood to effectively obsolete + all of the material on TCP found in RFC 1122. + + The more secure initial sequence number generation algorithm from RFC + 6528 was incorporated. See RFC 6528 for discussion of the attacks + that this mitigates, as well as advice on selecting PRF algorithms + and managing secret key data. + + A note based on RFC 6429 was added to explicitly clarify that system + resource management concerns allow connection resources to be + reclaimed. RFC 6429 is obsoleted in the sense that the clarification + it describes has been reflected within this base TCP specification. + + The description of congestion control implementation was added based + on the set of documents that are IETF BCP or Standards Track on the + topic and the current state of common implementations. + +6. IANA Considerations + + In the "Transmission Control Protocol (TCP) Header Flags" registry, + IANA has made several changes as described in this section. + + RFC 3168 originally created this registry but only populated it with + the new bits defined in RFC 3168, neglecting the other bits that had + previously been described in RFC 793 and other documents. Bit 7 has + since also been updated by RFC 8311 [54]. + + The "Bit" column has been renamed below as the "Bit Offset" column + because it references each header flag's offset within the 16-bit + aligned view of the TCP header in Figure 1. The bits in offsets 0 + through 3 are the TCP segment Data Offset field, and not header + flags. + + IANA has added a column for "Assignment Notes". + + IANA has assigned values as indicated below. + + +========+===================+===========+====================+ + | Bit | Name | Reference | Assignment Notes | + | Offset | | | | + +========+===================+===========+====================+ + | 4 | Reserved for | RFC 9293 | | + | | future use | | | + +--------+-------------------+-----------+--------------------+ + | 5 | Reserved for | RFC 9293 | | + | | future use | | | + +--------+-------------------+-----------+--------------------+ + | 6 | Reserved for | RFC 9293 | | + | | future use | | | + +--------+-------------------+-----------+--------------------+ + | 7 | Reserved for | RFC 8311 | Previously used by | + | | future use | | Historic RFC 3540 | + | | | | as NS (Nonce Sum). | + +--------+-------------------+-----------+--------------------+ + | 8 | CWR (Congestion | RFC 3168 | | + | | Window Reduced) | | | + +--------+-------------------+-----------+--------------------+ + | 9 | ECE (ECN-Echo) | RFC 3168 | | + +--------+-------------------+-----------+--------------------+ + | 10 | Urgent pointer | RFC 9293 | | + | | field is | | | + | | significant (URG) | | | + +--------+-------------------+-----------+--------------------+ + | 11 | Acknowledgment | RFC 9293 | | + | | field is | | | + | | significant (ACK) | | | + +--------+-------------------+-----------+--------------------+ + | 12 | Push function | RFC 9293 | | + | | (PSH) | | | + +--------+-------------------+-----------+--------------------+ + | 13 | Reset the | RFC 9293 | | + | | connection (RST) | | | + +--------+-------------------+-----------+--------------------+ + | 14 | Synchronize | RFC 9293 | | + | | sequence numbers | | | + | | (SYN) | | | + +--------+-------------------+-----------+--------------------+ + | 15 | No more data from | RFC 9293 | | + | | sender (FIN) | | | + +--------+-------------------+-----------+--------------------+ + + Table 7: TCP Header Flags + + The "TCP Header Flags" registry has also been moved to a subregistry + under the global "Transmission Control Protocol (TCP) Parameters" + registry <https://www.iana.org/assignments/tcp-parameters/>. + + The registry's Registration Procedure remains Standards Action, but + the Reference has been updated to this document, and the Note has + been removed. + +7. Security and Privacy Considerations + + The TCP design includes only rudimentary security features that + improve the robustness and reliability of connections and application + data transfer, but there are no built-in cryptographic capabilities + to support any form of confidentiality, authentication, or other + typical security functions. Non-cryptographic enhancements (e.g., + [9]) have been developed to improve robustness of TCP connections to + particular types of attacks, but the applicability and protections of + non-cryptographic enhancements are limited (e.g., see Section 1.1 of + [9]). Applications typically utilize lower-layer (e.g., IPsec) and + upper-layer (e.g., TLS) protocols to provide security and privacy for + TCP connections and application data carried in TCP. Methods based + on TCP Options have been developed as well, to support some security + capabilities. + + In order to fully provide confidentiality, integrity protection, and + authentication for TCP connections (including their control flags), + IPsec is the only current effective method. For integrity protection + and authentication, the TCP Authentication Option (TCP-AO) [38] is + available, with a proposed extension to also provide confidentiality + for the segment payload. Other methods discussed in this section may + provide confidentiality or integrity protection for the payload, but + for the TCP header only cover either a subset of the fields (e.g., + tcpcrypt [57]) or none at all (e.g., TLS). Other security features + that have been added to TCP (e.g., ISN generation, sequence number + checks, and others) are only capable of partially hindering attacks. + + Applications using long-lived TCP flows have been vulnerable to + attacks that exploit the processing of control flags described in + earlier TCP specifications [33]. TCP-MD5 was a commonly implemented + TCP Option to support authentication for some of these connections, + but had flaws and is now deprecated. TCP-AO provides a capability to + protect long-lived TCP connections from attacks and has superior + properties to TCP-MD5. It does not provide any privacy for + application data or for the TCP headers. + + The "tcpcrypt" [57] experimental extension to TCP provides the + ability to cryptographically protect connection data. Metadata + aspects of the TCP flow are still visible, but the application stream + is well protected. Within the TCP header, only the urgent pointer + and FIN flag are protected through tcpcrypt. + + The TCP Roadmap [49] includes notes about several RFCs related to TCP + security. Many of the enhancements provided by these RFCs have been + integrated into the present document, including ISN generation, + mitigating blind in-window attacks, and improving handling of soft + errors and ICMP packets. These are all discussed in greater detail + in the referenced RFCs that originally described the changes needed + to earlier TCP specifications. Additionally, see RFC 6093 [39] for + discussion of security considerations related to the urgent pointer + field, which also discourages new applications from using the urgent + pointer. + + Since TCP is often used for bulk transfer flows, some attacks are + possible that abuse the TCP congestion control logic. An example is + "ACK-division" attacks. Updates that have been made to the TCP + congestion control specifications include mechanisms like Appropriate + Byte Counting (ABC) [29] that act as mitigations to these attacks. + + Other attacks are focused on exhausting the resources of a TCP + server. Examples include SYN flooding [32] or wasting resources on + non-progressing connections [41]. Operating systems commonly + implement mitigations for these attacks. Some common defenses also + utilize proxies, stateful firewalls, and other technologies outside + the end-host TCP implementation. + + The concept of a protocol's "wire image" is described in RFC 8546 + [56], which describes how TCP's cleartext headers expose more + metadata to nodes on the path than is strictly required to route the + packets to their destination. On-path adversaries may be able to + leverage this metadata. Lessons learned in this respect from TCP + have been applied in the design of newer transports like QUIC [60]. + Additionally, based partly on experiences with TCP and its + extensions, there are considerations that might be applicable for + future TCP extensions and other transports that the IETF has + documented in RFC 9065 [61], along with IAB recommendations in RFC + 8558 [58] and [67]. + + There are also methods of "fingerprinting" that can be used to infer + the host TCP implementation (operating system) version or platform + information. These collect observations of several aspects, such as + the options present in segments, the ordering of options, the + specific behaviors in the case of various conditions, packet timing, + packet sizing, and other aspects of the protocol that are left to be + determined by an implementer, and can use those observations to + identify information about the host and implementation. + + Since ICMP message processing also can interact with TCP connections, + there is potential for ICMP-based attacks against TCP connections. + These are discussed in RFC 5927 [100], along with mitigations that + have been implemented. + +8. References + +8.1. Normative References + + [1] Postel, J., "Internet Protocol", STD 5, RFC 791, + DOI 10.17487/RFC0791, September 1981, + <https://www.rfc-editor.org/info/rfc791>. + + [2] Mogul, J. and S. Deering, "Path MTU discovery", RFC 1191, + DOI 10.17487/RFC1191, November 1990, + <https://www.rfc-editor.org/info/rfc1191>. + + [3] Bradner, S., "Key words for use in RFCs to Indicate + Requirement Levels", BCP 14, RFC 2119, + DOI 10.17487/RFC2119, March 1997, + <https://www.rfc-editor.org/info/rfc2119>. + + [4] Nichols, K., Blake, S., Baker, F., and D. Black, + "Definition of the Differentiated Services Field (DS + Field) in the IPv4 and IPv6 Headers", RFC 2474, + DOI 10.17487/RFC2474, December 1998, + <https://www.rfc-editor.org/info/rfc2474>. + + [5] Floyd, S., "Congestion Control Principles", BCP 41, + RFC 2914, DOI 10.17487/RFC2914, September 2000, + <https://www.rfc-editor.org/info/rfc2914>. + + [6] Ramakrishnan, K., Floyd, S., and D. Black, "The Addition + of Explicit Congestion Notification (ECN) to IP", + RFC 3168, DOI 10.17487/RFC3168, September 2001, + <https://www.rfc-editor.org/info/rfc3168>. + + [7] Floyd, S. and M. Allman, "Specifying New Congestion + Control Algorithms", BCP 133, RFC 5033, + DOI 10.17487/RFC5033, August 2007, + <https://www.rfc-editor.org/info/rfc5033>. + + [8] Allman, M., Paxson, V., and E. Blanton, "TCP Congestion + Control", RFC 5681, DOI 10.17487/RFC5681, September 2009, + <https://www.rfc-editor.org/info/rfc5681>. + + [9] Ramaiah, A., Stewart, R., and M. Dalal, "Improving TCP's + Robustness to Blind In-Window Attacks", RFC 5961, + DOI 10.17487/RFC5961, August 2010, + <https://www.rfc-editor.org/info/rfc5961>. + + [10] Paxson, V., Allman, M., Chu, J., and M. Sargent, + "Computing TCP's Retransmission Timer", RFC 6298, + DOI 10.17487/RFC6298, June 2011, + <https://www.rfc-editor.org/info/rfc6298>. + + [11] Gont, F., "Deprecation of ICMP Source Quench Messages", + RFC 6633, DOI 10.17487/RFC6633, May 2012, + <https://www.rfc-editor.org/info/rfc6633>. + + [12] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC + 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, + May 2017, <https://www.rfc-editor.org/info/rfc8174>. + + [13] Deering, S. and R. Hinden, "Internet Protocol, Version 6 + (IPv6) Specification", STD 86, RFC 8200, + DOI 10.17487/RFC8200, July 2017, + <https://www.rfc-editor.org/info/rfc8200>. + + [14] McCann, J., Deering, S., Mogul, J., and R. Hinden, Ed., + "Path MTU Discovery for IP version 6", STD 87, RFC 8201, + DOI 10.17487/RFC8201, July 2017, + <https://www.rfc-editor.org/info/rfc8201>. + + [15] Allman, M., "Requirements for Time-Based Loss Detection", + BCP 233, RFC 8961, DOI 10.17487/RFC8961, November 2020, + <https://www.rfc-editor.org/info/rfc8961>. + +8.2. Informative References + + [16] Postel, J., "Transmission Control Protocol", STD 7, + RFC 793, DOI 10.17487/RFC0793, September 1981, + <https://www.rfc-editor.org/info/rfc793>. + + [17] Nagle, J., "Congestion Control in IP/TCP Internetworks", + RFC 896, DOI 10.17487/RFC0896, January 1984, + <https://www.rfc-editor.org/info/rfc896>. + + [18] Reynolds, J. and J. Postel, "Official Internet protocols", + RFC 1011, DOI 10.17487/RFC1011, May 1987, + <https://www.rfc-editor.org/info/rfc1011>. + + [19] Braden, R., Ed., "Requirements for Internet Hosts - + Communication Layers", STD 3, RFC 1122, + DOI 10.17487/RFC1122, October 1989, + <https://www.rfc-editor.org/info/rfc1122>. + + [20] Almquist, P., "Type of Service in the Internet Protocol + Suite", RFC 1349, DOI 10.17487/RFC1349, July 1992, + <https://www.rfc-editor.org/info/rfc1349>. + + [21] Braden, R., "T/TCP -- TCP Extensions for Transactions + Functional Specification", RFC 1644, DOI 10.17487/RFC1644, + July 1994, <https://www.rfc-editor.org/info/rfc1644>. + + [22] Mathis, M., Mahdavi, J., Floyd, S., and A. Romanow, "TCP + Selective Acknowledgment Options", RFC 2018, + DOI 10.17487/RFC2018, October 1996, + <https://www.rfc-editor.org/info/rfc2018>. + + [23] Paxson, V., Allman, M., Dawson, S., Fenner, W., Griner, + J., Heavens, I., Lahey, K., Semke, J., and B. Volz, "Known + TCP Implementation Problems", RFC 2525, + DOI 10.17487/RFC2525, March 1999, + <https://www.rfc-editor.org/info/rfc2525>. + + [24] Borman, D., Deering, S., and R. Hinden, "IPv6 Jumbograms", + RFC 2675, DOI 10.17487/RFC2675, August 1999, + <https://www.rfc-editor.org/info/rfc2675>. + + [25] Xiao, X., Hannan, A., Paxson, V., and E. Crabbe, "TCP + Processing of the IPv4 Precedence Field", RFC 2873, + DOI 10.17487/RFC2873, June 2000, + <https://www.rfc-editor.org/info/rfc2873>. + + [26] Floyd, S., Mahdavi, J., Mathis, M., and M. Podolsky, "An + Extension to the Selective Acknowledgement (SACK) Option + for TCP", RFC 2883, DOI 10.17487/RFC2883, July 2000, + <https://www.rfc-editor.org/info/rfc2883>. + + [27] Lahey, K., "TCP Problems with Path MTU Discovery", + RFC 2923, DOI 10.17487/RFC2923, September 2000, + <https://www.rfc-editor.org/info/rfc2923>. + + [28] Balakrishnan, H., Padmanabhan, V., Fairhurst, G., and M. + Sooriyabandara, "TCP Performance Implications of Network + Path Asymmetry", BCP 69, RFC 3449, DOI 10.17487/RFC3449, + December 2002, <https://www.rfc-editor.org/info/rfc3449>. + + [29] Allman, M., "TCP Congestion Control with Appropriate Byte + Counting (ABC)", RFC 3465, DOI 10.17487/RFC3465, February + 2003, <https://www.rfc-editor.org/info/rfc3465>. + + [30] Fenner, B., "Experimental Values In IPv4, IPv6, ICMPv4, + ICMPv6, UDP, and TCP Headers", RFC 4727, + DOI 10.17487/RFC4727, November 2006, + <https://www.rfc-editor.org/info/rfc4727>. + + [31] Mathis, M. and J. Heffner, "Packetization Layer Path MTU + Discovery", RFC 4821, DOI 10.17487/RFC4821, March 2007, + <https://www.rfc-editor.org/info/rfc4821>. + + [32] Eddy, W., "TCP SYN Flooding Attacks and Common + Mitigations", RFC 4987, DOI 10.17487/RFC4987, August 2007, + <https://www.rfc-editor.org/info/rfc4987>. + + [33] Touch, J., "Defending TCP Against Spoofing Attacks", + RFC 4953, DOI 10.17487/RFC4953, July 2007, + <https://www.rfc-editor.org/info/rfc4953>. + + [34] Culley, P., Elzur, U., Recio, R., Bailey, S., and J. + Carrier, "Marker PDU Aligned Framing for TCP + Specification", RFC 5044, DOI 10.17487/RFC5044, October + 2007, <https://www.rfc-editor.org/info/rfc5044>. + + [35] Gont, F., "TCP's Reaction to Soft Errors", RFC 5461, + DOI 10.17487/RFC5461, February 2009, + <https://www.rfc-editor.org/info/rfc5461>. + + [36] StJohns, M., Atkinson, R., and G. Thomas, "Common + Architecture Label IPv6 Security Option (CALIPSO)", + RFC 5570, DOI 10.17487/RFC5570, July 2009, + <https://www.rfc-editor.org/info/rfc5570>. + + [37] Sandlund, K., Pelletier, G., and L-E. Jonsson, "The RObust + Header Compression (ROHC) Framework", RFC 5795, + DOI 10.17487/RFC5795, March 2010, + <https://www.rfc-editor.org/info/rfc5795>. + + [38] Touch, J., Mankin, A., and R. Bonica, "The TCP + Authentication Option", RFC 5925, DOI 10.17487/RFC5925, + June 2010, <https://www.rfc-editor.org/info/rfc5925>. + + [39] Gont, F. and A. Yourtchenko, "On the Implementation of the + TCP Urgent Mechanism", RFC 6093, DOI 10.17487/RFC6093, + January 2011, <https://www.rfc-editor.org/info/rfc6093>. + + [40] Gont, F., "Reducing the TIME-WAIT State Using TCP + Timestamps", BCP 159, RFC 6191, DOI 10.17487/RFC6191, + April 2011, <https://www.rfc-editor.org/info/rfc6191>. + + [41] Bashyam, M., Jethanandani, M., and A. Ramaiah, "TCP Sender + Clarification for Persist Condition", RFC 6429, + DOI 10.17487/RFC6429, December 2011, + <https://www.rfc-editor.org/info/rfc6429>. + + [42] Gont, F. and S. Bellovin, "Defending against Sequence + Number Attacks", RFC 6528, DOI 10.17487/RFC6528, February + 2012, <https://www.rfc-editor.org/info/rfc6528>. + + [43] Borman, D., "TCP Options and Maximum Segment Size (MSS)", + RFC 6691, DOI 10.17487/RFC6691, July 2012, + <https://www.rfc-editor.org/info/rfc6691>. + + [44] Touch, J., "Updated Specification of the IPv4 ID Field", + RFC 6864, DOI 10.17487/RFC6864, February 2013, + <https://www.rfc-editor.org/info/rfc6864>. + + [45] Touch, J., "Shared Use of Experimental TCP Options", + RFC 6994, DOI 10.17487/RFC6994, August 2013, + <https://www.rfc-editor.org/info/rfc6994>. + + [46] McPherson, D., Oran, D., Thaler, D., and E. Osterweil, + "Architectural Considerations of IP Anycast", RFC 7094, + DOI 10.17487/RFC7094, January 2014, + <https://www.rfc-editor.org/info/rfc7094>. + + [47] Borman, D., Braden, B., Jacobson, V., and R. + Scheffenegger, Ed., "TCP Extensions for High Performance", + RFC 7323, DOI 10.17487/RFC7323, September 2014, + <https://www.rfc-editor.org/info/rfc7323>. + + [48] Cheng, Y., Chu, J., Radhakrishnan, S., and A. Jain, "TCP + Fast Open", RFC 7413, DOI 10.17487/RFC7413, December 2014, + <https://www.rfc-editor.org/info/rfc7413>. + + [49] Duke, M., Braden, R., Eddy, W., Blanton, E., and A. + Zimmermann, "A Roadmap for Transmission Control Protocol + (TCP) Specification Documents", RFC 7414, + DOI 10.17487/RFC7414, February 2015, + <https://www.rfc-editor.org/info/rfc7414>. + + [50] Black, D., Ed. and P. Jones, "Differentiated Services + (Diffserv) and Real-Time Communication", RFC 7657, + DOI 10.17487/RFC7657, November 2015, + <https://www.rfc-editor.org/info/rfc7657>. + + [51] Fairhurst, G. and M. Welzl, "The Benefits of Using + Explicit Congestion Notification (ECN)", RFC 8087, + DOI 10.17487/RFC8087, March 2017, + <https://www.rfc-editor.org/info/rfc8087>. + + [52] Fairhurst, G., Ed., Trammell, B., Ed., and M. Kuehlewind, + Ed., "Services Provided by IETF Transport Protocols and + Congestion Control Mechanisms", RFC 8095, + DOI 10.17487/RFC8095, March 2017, + <https://www.rfc-editor.org/info/rfc8095>. + + [53] Welzl, M., Tuexen, M., and N. Khademi, "On the Usage of + Transport Features Provided by IETF Transport Protocols", + RFC 8303, DOI 10.17487/RFC8303, February 2018, + <https://www.rfc-editor.org/info/rfc8303>. + + [54] Black, D., "Relaxing Restrictions on Explicit Congestion + Notification (ECN) Experimentation", RFC 8311, + DOI 10.17487/RFC8311, January 2018, + <https://www.rfc-editor.org/info/rfc8311>. + + [55] Chown, T., Loughney, J., and T. Winters, "IPv6 Node + Requirements", BCP 220, RFC 8504, DOI 10.17487/RFC8504, + January 2019, <https://www.rfc-editor.org/info/rfc8504>. + + [56] Trammell, B. and M. Kuehlewind, "The Wire Image of a + Network Protocol", RFC 8546, DOI 10.17487/RFC8546, April + 2019, <https://www.rfc-editor.org/info/rfc8546>. + + [57] Bittau, A., Giffin, D., Handley, M., Mazieres, D., Slack, + Q., and E. Smith, "Cryptographic Protection of TCP Streams + (tcpcrypt)", RFC 8548, DOI 10.17487/RFC8548, May 2019, + <https://www.rfc-editor.org/info/rfc8548>. + + [58] Hardie, T., Ed., "Transport Protocol Path Signals", + RFC 8558, DOI 10.17487/RFC8558, April 2019, + <https://www.rfc-editor.org/info/rfc8558>. + + [59] Ford, A., Raiciu, C., Handley, M., Bonaventure, O., and C. + Paasch, "TCP Extensions for Multipath Operation with + Multiple Addresses", RFC 8684, DOI 10.17487/RFC8684, March + 2020, <https://www.rfc-editor.org/info/rfc8684>. + + [60] Iyengar, J., Ed. and M. Thomson, Ed., "QUIC: A UDP-Based + Multiplexed and Secure Transport", RFC 9000, + DOI 10.17487/RFC9000, May 2021, + <https://www.rfc-editor.org/info/rfc9000>. + + [61] Fairhurst, G. and C. Perkins, "Considerations around + Transport Header Confidentiality, Network Operations, and + the Evolution of Internet Transport Protocols", RFC 9065, + DOI 10.17487/RFC9065, July 2021, + <https://www.rfc-editor.org/info/rfc9065>. + + [62] IANA, "Transmission Control Protocol (TCP) Parameters", + <https://www.iana.org/assignments/tcp-parameters/>. + + [63] Gont, F., "Processing of IP Security/Compartment and + Precedence Information by TCP", Work in Progress, + Internet-Draft, draft-gont-tcpm-tcp-seccomp-prec-00, 29 + March 2012, <https://datatracker.ietf.org/doc/html/draft- + gont-tcpm-tcp-seccomp-prec-00>. + + [64] Gont, F. and D. Borman, "On the Validation of TCP Sequence + Numbers", Work in Progress, Internet-Draft, draft-gont- + tcpm-tcp-seq-validation-04, 11 March 2019, + <https://datatracker.ietf.org/doc/html/draft-gont-tcpm- + tcp-seq-validation-04>. + + [65] Touch, J. and W. M. Eddy, "TCP Extended Data Offset + Option", Work in Progress, Internet-Draft, draft-ietf- + tcpm-tcp-edo-12, 15 April 2022, + <https://datatracker.ietf.org/doc/html/draft-ietf-tcpm- + tcp-edo-12>. + + [66] McQuistin, S., Band, V., Jacob, D., and C. Perkins, + "Describing Protocol Data Units with Augmented Packet + Header Diagrams", Work in Progress, Internet-Draft, draft- + mcquistin-augmented-ascii-diagrams-10, 7 March 2022, + <https://datatracker.ietf.org/doc/html/draft-mcquistin- + augmented-ascii-diagrams-10>. + + [67] Thomson, M. and T. Pauly, "Long-Term Viability of Protocol + Extension Mechanisms", RFC 9170, DOI 10.17487/RFC9170, + December 2021, <https://www.rfc-editor.org/info/rfc9170>. + + [68] Minshall, G., "A Suggested Modification to Nagle's + Algorithm", Work in Progress, Internet-Draft, draft- + minshall-nagle-01, 18 June 1999, + <https://datatracker.ietf.org/doc/html/draft-minshall- + nagle-01>. + + [69] Dalal, Y. and C. Sunshine, "Connection Management in + Transport Protocols", Computer Networks, Vol. 2, No. 6, + pp. 454-473, DOI 10.1016/0376-5075(78)90053-3, December + 1978, <https://doi.org/10.1016/0376-5075(78)90053-3>. + + [70] Faber, T., Touch, J., and W. Yui, "The TIME-WAIT state in + TCP and Its Effect on Busy Servers", Proceedings of IEEE + INFOCOM, pp. 1573-1583, DOI 10.1109/INFCOM.1999.752180, + March 1999, <https://doi.org/10.1109/INFCOM.1999.752180>. + + [71] Postel, J., "Comments on Action Items from the January + Meeting", IEN 177, March 1981, + <https://www.rfc-editor.org/ien/ien177.txt>. + + [72] "Segmentation Offloads", The Linux Kernel Documentation, + <https://www.kernel.org/doc/html/latest/networking/ + segmentation-offloads.html>. + + [73] RFC Errata, Erratum ID 573, RFC 793, + <https://www.rfc-editor.org/errata/eid573>. + + [74] RFC Errata, Erratum ID 574, RFC 793, + <https://www.rfc-editor.org/errata/eid574>. + + [75] RFC Errata, Erratum ID 700, RFC 793, + <https://www.rfc-editor.org/errata/eid700>. + + [76] RFC Errata, Erratum ID 701, RFC 793, + <https://www.rfc-editor.org/errata/eid701>. + + [77] RFC Errata, Erratum ID 1283, RFC 793, + <https://www.rfc-editor.org/errata/eid1283>. + + [78] RFC Errata, Erratum ID 1561, RFC 793, + <https://www.rfc-editor.org/errata/eid1561>. + + [79] RFC Errata, Erratum ID 1562, RFC 793, + <https://www.rfc-editor.org/errata/eid1562>. + + [80] RFC Errata, Erratum ID 1564, RFC 793, + <https://www.rfc-editor.org/errata/eid1564>. + + [81] RFC Errata, Erratum ID 1571, RFC 793, + <https://www.rfc-editor.org/errata/eid1571>. + + [82] RFC Errata, Erratum ID 1572, RFC 793, + <https://www.rfc-editor.org/errata/eid1572>. + + [83] RFC Errata, Erratum ID 2297, RFC 793, + <https://www.rfc-editor.org/errata/eid2297>. + + [84] RFC Errata, Erratum ID 2298, RFC 793, + <https://www.rfc-editor.org/errata/eid2298>. + + [85] RFC Errata, Erratum ID 2748, RFC 793, + <https://www.rfc-editor.org/errata/eid2748>. + + [86] RFC Errata, Erratum ID 2749, RFC 793, + <https://www.rfc-editor.org/errata/eid2749>. + + [87] RFC Errata, Erratum ID 2934, RFC 793, + <https://www.rfc-editor.org/errata/eid2934>. + + [88] RFC Errata, Erratum ID 3213, RFC 793, + <https://www.rfc-editor.org/errata/eid3213>. + + [89] RFC Errata, Erratum ID 3300, RFC 793, + <https://www.rfc-editor.org/errata/eid3300>. + + [90] RFC Errata, Erratum ID 3301, RFC 793, + <https://www.rfc-editor.org/errata/eid3301>. + + [91] RFC Errata, Erratum ID 6222, RFC 793, + <https://www.rfc-editor.org/errata/eid6222>. + + [92] RFC Errata, Erratum ID 572, RFC 793, + <https://www.rfc-editor.org/errata/eid572>. + + [93] RFC Errata, Erratum ID 575, RFC 793, + <https://www.rfc-editor.org/errata/eid575>. + + [94] RFC Errata, Erratum ID 1565, RFC 793, + <https://www.rfc-editor.org/errata/eid1565>. + + [95] RFC Errata, Erratum ID 1569, RFC 793, + <https://www.rfc-editor.org/errata/eid1569>. + + [96] RFC Errata, Erratum ID 2296, RFC 793, + <https://www.rfc-editor.org/errata/eid2296>. + + [97] RFC Errata, Erratum ID 3305, RFC 793, + <https://www.rfc-editor.org/errata/eid3305>. + + [98] RFC Errata, Erratum ID 3602, RFC 793, + <https://www.rfc-editor.org/errata/eid3602>. + + [99] RFC Errata, Erratum ID 4772, RFC 5961, + <https://www.rfc-editor.org/errata/eid4772>. + + [100] Gont, F., "ICMP Attacks against TCP", RFC 5927, + DOI 10.17487/RFC5927, July 2010, + <https://www.rfc-editor.org/info/rfc5927>. + +Appendix A. Other Implementation Notes + + This section includes additional notes and references on TCP + implementation decisions that are currently not a part of the RFC + series or included within the TCP standard. These items can be + considered by implementers, but there was not yet a consensus to + include them in the standard. + +A.1. IP Security Compartment and Precedence + + The IPv4 specification [1] includes a precedence value in the (now + obsoleted) Type of Service (TOS) field. It was modified in [20] and + then obsoleted by the definition of Differentiated Services + (Diffserv) [4]. Setting and conveying TOS between the network layer, + TCP implementation, and applications is obsolete and is replaced by + Diffserv in the current TCP specification. + + RFC 793 required checking the IP security compartment and precedence + on incoming TCP segments for consistency within a connection and with + application requests. Each of these aspects of IP have become + outdated, without specific updates to RFC 793. The issues with + precedence were fixed by [25], which is Standards Track, and so this + present TCP specification includes those changes. However, the state + of IP security options that may be used by Multi-Level Secure (MLS) + systems is not as apparent in the IETF currently. + + Resetting connections when incoming packets do not meet expected + security compartment or precedence expectations has been recognized + as a possible attack vector [63], and there has been discussion about + amending the TCP specification to prevent connections from being + aborted due to nonmatching IP security compartment and Diffserv + codepoint values. + +A.1.1. Precedence + + In Diffserv, the former precedence values are treated as Class + Selector codepoints, and methods for compatible treatment are + described in the Diffserv architecture. The RFC TCP specification + defined by RFCs 793 and 1122 included logic intending to have + connections use the highest precedence requested by either endpoint + application, and to keep the precedence consistent throughout a + connection. This logic from the obsolete TOS is not applicable to + Diffserv and should not be included in TCP implementations, though + changes to Diffserv values within a connection are discouraged. For + discussion of this, see RFC 7657 (Sections 5.1, 5.3, and 6) [50]. + + The obsoleted TOS processing rules in TCP assumed bidirectional (or + symmetric) precedence values used on a connection, but the Diffserv + architecture is asymmetric. Problems with the old TCP logic in this + regard were described in [25], and the solution described is to + ignore IP precedence in TCP. Since RFC 2873 is a Standards Track + document (although not marked as updating RFC 793), current + implementations are expected to be robust in these conditions. Note + that the Diffserv field value used in each direction is a part of the + interface between TCP and the network layer, and values in use can be + indicated both ways between TCP and the application. + +A.1.2. MLS Systems + + The IP Security Option (IPSO) and compartment defined in [1] was + refined in RFC 1038, which was later obsoleted by RFC 1108. The + Commercial IP Security Option (CIPSO) is defined in FIPS-188 + (withdrawn by NIST in 2015) and is supported by some vendors and + operating systems. RFC 1108 is now Historic, though RFC 791 itself + has not been updated to remove the IP Security Option. For IPv6, a + similar option (Common Architecture Label IPv6 Security Option + (CALIPSO)) has been defined [36]. RFC 793 includes logic that + includes the IP security/compartment information in treatment of TCP + segments. References to the IP "security/compartment" in this + document may be relevant for Multi-Level Secure (MLS) system + implementers but can be ignored for non-MLS implementations, + consistent with running code on the Internet. See Appendix A.1 for + further discussion. Note that RFC 5570 describes some MLS networking + scenarios where IPSO, CIPSO, or CALIPSO may be used. In these + special cases, TCP implementers should see Section 7.3.1 of RFC 5570 + and follow the guidance in that document. + +A.2. Sequence Number Validation + + There are cases where the TCP sequence number validation rules can + prevent ACK fields from being processed. This can result in + connection issues, as described in [64], which includes descriptions + of potential problems in conditions of simultaneous open, self- + connects, simultaneous close, and simultaneous window probes. The + document also describes potential changes to the TCP specification to + mitigate the issue by expanding the acceptable sequence numbers. + + In Internet usage of TCP, these conditions rarely occur. Common + operating systems include different alternative mitigations, and the + standard has not been updated yet to codify one of them, but + implementers should consider the problems described in [64]. + +A.3. Nagle Modification + + In common operating systems, both the Nagle algorithm and delayed + acknowledgments are implemented and enabled by default. TCP is used + by many applications that have a request-response style of + communication, where the combination of the Nagle algorithm and + delayed acknowledgments can result in poor application performance. + A modification to the Nagle algorithm is described in [68] that + improves the situation for these applications. + + This modification is implemented in some common operating systems and + does not impact TCP interoperability. Additionally, many + applications simply disable Nagle since this is generally supported + by a socket option. The TCP standard has not been updated to include + this Nagle modification, but implementers may find it beneficial to + consider. + +A.4. Low Watermark Settings + + Some operating system kernel TCP implementations include socket + options that allow specifying the number of bytes in the buffer until + the socket layer will pass sent data to TCP (SO_SNDLOWAT) or to the + application on receiving (SO_RCVLOWAT). + + In addition, another socket option (TCP_NOTSENT_LOWAT) can be used to + control the amount of unsent bytes in the write queue. This can help + a sending TCP application to avoid creating large amounts of buffered + data (and corresponding latency). As an example, this may be useful + for applications that are multiplexing data from multiple upper-level + streams onto a connection, especially when streams may be a mix of + interactive/real-time and bulk data transfer. + +Appendix B. TCP Requirement Summary + + This section is adapted from RFC 1122. + + Note that there is no requirement related to PLPMTUD in this list, + but that PLPMTUD is recommended. + + +=================+=========+======+========+=====+========+======+ + | Feature | ReqID | MUST | SHOULD | MAY | SHOULD | MUST | + | | | | | | NOT | NOT | + +=================+=========+======+========+=====+========+======+ + | PUSH flag | + +=================+=========+======+========+=====+========+======+ + | Aggregate or | MAY-16 | | | X | | | + | queue un-pushed | | | | | | | + | data | | | | | | | + +-----------------+---------+------+--------+-----+--------+------+ + | Sender collapse | SHLD-27 | | X | | | | + | successive PSH | | | | | | | + | bits | | | | | | | + +-----------------+---------+------+--------+-----+--------+------+ + | SEND call can | MAY-15 | | | X | | | + | specify PUSH | | | | | | | + +-----------------+---------+------+--------+-----+--------+------+ + | * If cannot: | MUST-60 | | | | | X | + | sender | | | | | | | + | buffer | | | | | | | + | indefinitely | | | | | | | + +-----------------+---------+------+--------+-----+--------+------+ + | * If cannot: | MUST-61 | X | | | | | + | PSH last | | | | | | | + | segment | | | | | | | + +-----------------+---------+------+--------+-----+--------+------+ + | Notify | MAY-17 | | | X | | | + | receiving ALP^1 | | | | | | | + | of PSH | | | | | | | + +-----------------+---------+------+--------+-----+--------+------+ + | Send max size | SHLD-28 | | X | | | | + | segment when | | | | | | | + | possible | | | | | | | + +=================+=========+======+========+=====+========+======+ + | Window | + +=================+=========+======+========+=====+========+======+ + | Treat as | MUST-1 | X | | | | | + | unsigned number | | | | | | | + +-----------------+---------+------+--------+-----+--------+------+ + | Handle as | REC-1 | | X | | | | + | 32-bit number | | | | | | | + +-----------------+---------+------+--------+-----+--------+------+ + | Shrink window | SHLD-14 | | | | X | | + | from right | | | | | | | + +-----------------+---------+------+--------+-----+--------+------+ + | * Send new | SHLD-15 | | | | X | | + | data when | | | | | | | + | window | | | | | | | + | shrinks | | | | | | | + +-----------------+---------+------+--------+-----+--------+------+ + | * Retransmit | SHLD-16 | | X | | | | + | old unacked | | | | | | | + | data within | | | | | | | + | window | | | | | | | + +-----------------+---------+------+--------+-----+--------+------+ + | * Time out | SHLD-17 | | | | X | | + | conn for | | | | | | | + | data past | | | | | | | + | right edge | | | | | | | + +-----------------+---------+------+--------+-----+--------+------+ + | Robust against | MUST-34 | X | | | | | + | shrinking | | | | | | | + | window | | | | | | | + +-----------------+---------+------+--------+-----+--------+------+ + | Receiver's | MAY-8 | | | X | | | + | window closed | | | | | | | + | indefinitely | | | | | | | + +-----------------+---------+------+--------+-----+--------+------+ + | Use standard | MUST-35 | X | | | | | + | probing logic | | | | | | | + +-----------------+---------+------+--------+-----+--------+------+ + | Sender probe | MUST-36 | X | | | | | + | zero window | | | | | | | + +-----------------+---------+------+--------+-----+--------+------+ + | * First probe | SHLD-29 | | X | | | | + | after RTO | | | | | | | + +-----------------+---------+------+--------+-----+--------+------+ + | * Exponential | SHLD-30 | | X | | | | + | backoff | | | | | | | + +-----------------+---------+------+--------+-----+--------+------+ + | Allow window | MUST-37 | X | | | | | + | stay zero | | | | | | | + | indefinitely | | | | | | | + +-----------------+---------+------+--------+-----+--------+------+ + | Retransmit old | MAY-7 | | | X | | | + | data beyond | | | | | | | + | SND.UNA+SND.WND | | | | | | | + +-----------------+---------+------+--------+-----+--------+------+ + | Process RST and | MUST-66 | X | | | | | + | URG even with | | | | | | | + | zero window | | | | | | | + +=================+=========+======+========+=====+========+======+ + | Urgent Data | + +=================+=========+======+========+=====+========+======+ + | Include support | MUST-30 | X | | | | | + | for urgent | | | | | | | + | pointer | | | | | | | + +-----------------+---------+------+--------+-----+--------+------+ + | Pointer | MUST-62 | X | | | | | + | indicates first | | | | | | | + | non-urgent | | | | | | | + | octet | | | | | | | + +-----------------+---------+------+--------+-----+--------+------+ + | Arbitrary | MUST-31 | X | | | | | + | length urgent | | | | | | | + | data sequence | | | | | | | + +-----------------+---------+------+--------+-----+--------+------+ + | Inform ALP^1 | MUST-32 | X | | | | | + | asynchronously | | | | | | | + | of urgent data | | | | | | | + +-----------------+---------+------+--------+-----+--------+------+ + | ALP^1 can learn | MUST-33 | X | | | | | + | if/how much | | | | | | | + | urgent data Q'd | | | | | | | + +-----------------+---------+------+--------+-----+--------+------+ + | ALP employ the | SHLD-13 | | | | X | | + | urgent | | | | | | | + | mechanism | | | | | | | + +=================+=========+======+========+=====+========+======+ + | TCP Options | + +=================+=========+======+========+=====+========+======+ + | Support the | MUST-4 | X | | | | | + | mandatory | | | | | | | + | option set | | | | | | | + +-----------------+---------+------+--------+-----+--------+------+ + | Receive TCP | MUST-5 | X | | | | | + | Option in any | | | | | | | + | segment | | | | | | | + +-----------------+---------+------+--------+-----+--------+------+ + | Ignore | MUST-6 | X | | | | | + | unsupported | | | | | | | + | options | | | | | | | + +-----------------+---------+------+--------+-----+--------+------+ + | Include length | MUST-68 | X | | | | | + | for all options | | | | | | | + | except EOL+NOP | | | | | | | + +-----------------+---------+------+--------+-----+--------+------+ + | Cope with | MUST-7 | X | | | | | + | illegal option | | | | | | | + | length | | | | | | | + +-----------------+---------+------+--------+-----+--------+------+ + | Process options | MUST-64 | X | | | | | + | regardless of | | | | | | | + | word alignment | | | | | | | + +-----------------+---------+------+--------+-----+--------+------+ + | Implement | MUST-14 | X | | | | | + | sending & | | | | | | | + | receiving MSS | | | | | | | + | Option | | | | | | | + +-----------------+---------+------+--------+-----+--------+------+ + | IPv4 Send MSS | SHLD-5 | | X | | | | + | Option unless | | | | | | | + | 536 | | | | | | | + +-----------------+---------+------+--------+-----+--------+------+ + | IPv6 Send MSS | SHLD-5 | | X | | | | + | Option unless | | | | | | | + | 1220 | | | | | | | + +-----------------+---------+------+--------+-----+--------+------+ + | Send MSS Option | MAY-3 | | | X | | | + | always | | | | | | | + +-----------------+---------+------+--------+-----+--------+------+ + | IPv4 Send-MSS | MUST-15 | X | | | | | + | default is 536 | | | | | | | + +-----------------+---------+------+--------+-----+--------+------+ + | IPv6 Send-MSS | MUST-15 | X | | | | | + | default is 1220 | | | | | | | + +-----------------+---------+------+--------+-----+--------+------+ + | Calculate | MUST-16 | X | | | | | + | effective send | | | | | | | + | seg size | | | | | | | + +-----------------+---------+------+--------+-----+--------+------+ + | MSS accounts | SHLD-6 | | X | | | | + | for varying MTU | | | | | | | + +-----------------+---------+------+--------+-----+--------+------+ + | MSS not sent on | MUST-65 | | | | | X | + | non-SYN | | | | | | | + | segments | | | | | | | + +-----------------+---------+------+--------+-----+--------+------+ + | MSS value based | MUST-67 | X | | | | | + | on MMS_R | | | | | | | + +-----------------+---------+------+--------+-----+--------+------+ + | Pad with zero | MUST-69 | X | | | | | + +=================+=========+======+========+=====+========+======+ + | TCP Checksums | + +=================+=========+======+========+=====+========+======+ + | Sender compute | MUST-2 | X | | | | | + | checksum | | | | | | | + +-----------------+---------+------+--------+-----+--------+------+ + | Receiver check | MUST-3 | X | | | | | + | checksum | | | | | | | + +=================+=========+======+========+=====+========+======+ + | ISN Selection | + +=================+=========+======+========+=====+========+======+ + | Include a | MUST-8 | X | | | | | + | clock-driven | | | | | | | + | ISN generator | | | | | | | + | component | | | | | | | + +-----------------+---------+------+--------+-----+--------+------+ + | Secure ISN | SHLD-1 | | X | | | | + | generator with | | | | | | | + | a PRF component | | | | | | | + +-----------------+---------+------+--------+-----+--------+------+ + | PRF computable | MUST-9 | | | | | X | + | from outside | | | | | | | + | the host | | | | | | | + +=================+=========+======+========+=====+========+======+ + | Opening Connections | + +=================+=========+======+========+=====+========+======+ + | Support | MUST-10 | X | | | | | + | simultaneous | | | | | | | + | open attempts | | | | | | | + +-----------------+---------+------+--------+-----+--------+------+ + | SYN-RECEIVED | MUST-11 | X | | | | | + | remembers last | | | | | | | + | state | | | | | | | + +-----------------+---------+------+--------+-----+--------+------+ + | Passive OPEN | MUST-41 | | | | | X | + | call interfere | | | | | | | + | with others | | | | | | | + +-----------------+---------+------+--------+-----+--------+------+ + | Function: | MUST-42 | X | | | | | + | simultaneously | | | | | | | + | LISTENs for | | | | | | | + | same port | | | | | | | + +-----------------+---------+------+--------+-----+--------+------+ + | Ask IP for src | MUST-44 | X | | | | | + | address for SYN | | | | | | | + | if necessary | | | | | | | + +-----------------+---------+------+--------+-----+--------+------+ + | * Otherwise, | MUST-45 | X | | | | | + | use local | | | | | | | + | addr of | | | | | | | + | connection | | | | | | | + +-----------------+---------+------+--------+-----+--------+------+ + | OPEN to | MUST-46 | | | | | X | + | broadcast/ | | | | | | | + | multicast IP | | | | | | | + | address | | | | | | | + +-----------------+---------+------+--------+-----+--------+------+ + | Silently | MUST-57 | X | | | | | + | discard seg to | | | | | | | + | bcast/mcast | | | | | | | + | addr | | | | | | | + +=================+=========+======+========+=====+========+======+ + | Closing Connections | + +=================+=========+======+========+=====+========+======+ + | RST can contain | SHLD-2 | | X | | | | + | data | | | | | | | + +-----------------+---------+------+--------+-----+--------+------+ + | Inform | MUST-12 | X | | | | | + | application of | | | | | | | + | aborted conn | | | | | | | + +-----------------+---------+------+--------+-----+--------+------+ + | Half-duplex | MAY-1 | | | X | | | + | close | | | | | | | + | connections | | | | | | | + +-----------------+---------+------+--------+-----+--------+------+ + | * Send RST to | SHLD-3 | | X | | | | + | indicate | | | | | | | + | data lost | | | | | | | + +-----------------+---------+------+--------+-----+--------+------+ + | In TIME-WAIT | MUST-13 | X | | | | | + | state for 2MSL | | | | | | | + | seconds | | | | | | | + +-----------------+---------+------+--------+-----+--------+------+ + | * Accept SYN | MAY-2 | | | X | | | + | from TIME- | | | | | | | + | WAIT state | | | | | | | + +-----------------+---------+------+--------+-----+--------+------+ + | * Use | SHLD-4 | | X | | | | + | Timestamps | | | | | | | + | to reduce | | | | | | | + | TIME-WAIT | | | | | | | + +=================+=========+======+========+=====+========+======+ + | Retransmissions | + +=================+=========+======+========+=====+========+======+ + | Implement | MUST-19 | X | | | | | + | exponential | | | | | | | + | backoff, slow | | | | | | | + | start, and | | | | | | | + | congestion | | | | | | | + | avoidance | | | | | | | + +-----------------+---------+------+--------+-----+--------+------+ + | Retransmit with | MAY-4 | | | X | | | + | same IP | | | | | | | + | identity | | | | | | | + +-----------------+---------+------+--------+-----+--------+------+ + | Karn's | MUST-18 | X | | | | | + | algorithm | | | | | | | + +=================+=========+======+========+=====+========+======+ + | Generating ACKs | + +=================+=========+======+========+=====+========+======+ + | Aggregate | MUST-58 | X | | | | | + | whenever | | | | | | | + | possible | | | | | | | + +-----------------+---------+------+--------+-----+--------+------+ + | Queue out-of- | SHLD-31 | | X | | | | + | order segments | | | | | | | + +-----------------+---------+------+--------+-----+--------+------+ + | Process all Q'd | MUST-59 | X | | | | | + | before send ACK | | | | | | | + +-----------------+---------+------+--------+-----+--------+------+ + | Send ACK for | MAY-13 | | | X | | | + | out-of-order | | | | | | | + | segment | | | | | | | + +-----------------+---------+------+--------+-----+--------+------+ + | Delayed ACKs | SHLD-18 | | X | | | | + +-----------------+---------+------+--------+-----+--------+------+ + | * Delay < 0.5 | MUST-40 | X | | | | | + | seconds | | | | | | | + +-----------------+---------+------+--------+-----+--------+------+ + | * Every 2nd | SHLD-19 | | X | | | | + | full-sized | | | | | | | + | segment or | | | | | | | + | 2*RMSS ACK'd | | | | | | | + +-----------------+---------+------+--------+-----+--------+------+ + | Receiver SWS- | MUST-39 | X | | | | | + | Avoidance | | | | | | | + | Algorithm | | | | | | | + +=================+=========+======+========+=====+========+======+ + | Sending Data | + +=================+=========+======+========+=====+========+======+ + | Configurable | MUST-49 | X | | | | | + | TTL | | | | | | | + +-----------------+---------+------+--------+-----+--------+------+ + | Sender SWS- | MUST-38 | X | | | | | + | Avoidance | | | | | | | + | Algorithm | | | | | | | + +-----------------+---------+------+--------+-----+--------+------+ + | Nagle algorithm | SHLD-7 | | X | | | | + +-----------------+---------+------+--------+-----+--------+------+ + | * Application | MUST-17 | X | | | | | + | can disable | | | | | | | + | Nagle | | | | | | | + | algorithm | | | | | | | + +=================+=========+======+========+=====+========+======+ + | Connection Failures | + +=================+=========+======+========+=====+========+======+ + | Negative advice | MUST-20 | X | | | | | + | to IP on R1 | | | | | | | + | retransmissions | | | | | | | + +-----------------+---------+------+--------+-----+--------+------+ + | Close | MUST-20 | X | | | | | + | connection on | | | | | | | + | R2 | | | | | | | + | retransmissions | | | | | | | + +-----------------+---------+------+--------+-----+--------+------+ + | ALP^1 can set | MUST-21 | X | | | | | + | R2 | | | | | | | + +-----------------+---------+------+--------+-----+--------+------+ + | Inform ALP of | SHLD-9 | | X | | | | + | R1<=retxs<R2 | | | | | | | + +-----------------+---------+------+--------+-----+--------+------+ + | Recommended | SHLD-10 | | X | | | | + | value for R1 | | | | | | | + +-----------------+---------+------+--------+-----+--------+------+ + | Recommended | SHLD-11 | | X | | | | + | value for R2 | | | | | | | + +-----------------+---------+------+--------+-----+--------+------+ + | Same mechanism | MUST-22 | X | | | | | + | for SYNs | | | | | | | + +-----------------+---------+------+--------+-----+--------+------+ + | * R2 at least | MUST-23 | X | | | | | + | 3 minutes | | | | | | | + | for SYN | | | | | | | + +=================+=========+======+========+=====+========+======+ + | Send Keep-alive Packets | + +=================+=========+======+========+=====+========+======+ + | Send Keep-alive | MAY-5 | | X | | | | + | Packets: | | | | | | | + +-----------------+---------+------+--------+-----+--------+------+ + | * Application | MUST-24 | X | | | | | + | can request | | | | | | | + +-----------------+---------+------+--------+-----+--------+------+ + | * Default is | MUST-25 | X | | | | | + | "off" | | | | | | | + +-----------------+---------+------+--------+-----+--------+------+ + | * Only send if | MUST-26 | X | | | | | + | idle for | | | | | | | + | interval | | | | | | | + +-----------------+---------+------+--------+-----+--------+------+ + | * Interval | MUST-27 | X | | | | | + | configurable | | | | | | | + +-----------------+---------+------+--------+-----+--------+------+ + | * Default at | MUST-28 | X | | | | | + | least 2 hrs. | | | | | | | + +-----------------+---------+------+--------+-----+--------+------+ + | * Tolerant of | MUST-29 | X | | | | | + | lost ACKs | | | | | | | + +-----------------+---------+------+--------+-----+--------+------+ + | * Send with no | SHLD-12 | | X | | | | + | data | | | | | | | + +-----------------+---------+------+--------+-----+--------+------+ + | * Configurable | MAY-6 | | | X | | | + | to send | | | | | | | + | garbage | | | | | | | + | octet | | | | | | | + +=================+=========+======+========+=====+========+======+ + | IP Options | + +=================+=========+======+========+=====+========+======+ + | Ignore options | MUST-50 | X | | | | | + | TCP doesn't | | | | | | | + | understand | | | | | | | + +-----------------+---------+------+--------+-----+--------+------+ + | Timestamp | MAY-10 | | X | | | | + | support | | | | | | | + +-----------------+---------+------+--------+-----+--------+------+ + | Record Route | MAY-11 | | X | | | | + | support | | | | | | | + +-----------------+---------+------+--------+-----+--------+------+ + | Source Route: | | | | | | | + +-----------------+---------+------+--------+-----+--------+------+ + | * ALP^1 can | MUST-51 | X | | | | | + | specify | | | | | | | + +-----------------+---------+------+--------+-----+--------+------+ + | * Overrides | MUST-52 | X | | | | | + | src route | | | | | | | + | in | | | | | | | + | datagram | | | | | | | + +-----------------+---------+------+--------+-----+--------+------+ + | * Build return | MUST-53 | X | | | | | + | route from | | | | | | | + | src route | | | | | | | + +-----------------+---------+------+--------+-----+--------+------+ + | * Later src | SHLD-24 | | X | | | | + | route | | | | | | | + | overrides | | | | | | | + +=================+=========+======+========+=====+========+======+ + | Receiving ICMP Messages from IP | + +=================+=========+======+========+=====+========+======+ + | Receiving ICMP | MUST-54 | X | | | | | + | messages from | | | | | | | + | IP | | | | | | | + +-----------------+---------+------+--------+-----+--------+------+ + | * Dest Unreach | SHLD-25 | X | | | | | + | (0,1,5) => | | | | | | | + | inform ALP | | | | | | | + +-----------------+---------+------+--------+-----+--------+------+ + | * Abort on | MUST-56 | | | | | X | + | Dest Unreach | | | | | | | + | (0,1,5) | | | | | | | + +-----------------+---------+------+--------+-----+--------+------+ + | * Dest Unreach | SHLD-26 | | X | | | | + | (2-4) => | | | | | | | + | abort conn | | | | | | | + +-----------------+---------+------+--------+-----+--------+------+ + | * Source | MUST-55 | X | | | | | + | Quench => | | | | | | | + | silent | | | | | | | + | discard | | | | | | | + +-----------------+---------+------+--------+-----+--------+------+ + | * Abort on | MUST-56 | | | | | X | + | Time | | | | | | | + | Exceeded | | | | | | | + +-----------------+---------+------+--------+-----+--------+------+ + | * Abort on | MUST-56 | | | | | X | + | Param | | | | | | | + | Problem | | | | | | | + +=================+=========+======+========+=====+========+======+ + | Address Validation | + +=================+=========+======+========+=====+========+======+ + | Reject OPEN | MUST-46 | X | | | | | + | call to invalid | | | | | | | + | IP address | | | | | | | + +-----------------+---------+------+--------+-----+--------+------+ + | Reject SYN from | MUST-63 | X | | | | | + | invalid IP | | | | | | | + | address | | | | | | | + +-----------------+---------+------+--------+-----+--------+------+ + | Silently | MUST-57 | X | | | | | + | discard SYN to | | | | | | | + | bcast/mcast | | | | | | | + | addr | | | | | | | + +=================+=========+======+========+=====+========+======+ + | TCP/ALP Interface Services | + +=================+=========+======+========+=====+========+======+ + | Error Report | MUST-47 | X | | | | | + | mechanism | | | | | | | + +-----------------+---------+------+--------+-----+--------+------+ + | ALP can disable | SHLD-20 | | X | | | | + | Error Report | | | | | | | + | Routine | | | | | | | + +-----------------+---------+------+--------+-----+--------+------+ + | ALP can specify | MUST-48 | X | | | | | + | Diffserv field | | | | | | | + | for sending | | | | | | | + +-----------------+---------+------+--------+-----+--------+------+ + | * Passed | SHLD-22 | | X | | | | + | unchanged to | | | | | | | + | IP | | | | | | | + +-----------------+---------+------+--------+-----+--------+------+ + | ALP can change | SHLD-21 | | X | | | | + | Diffserv field | | | | | | | + | during | | | | | | | + | connection | | | | | | | + +-----------------+---------+------+--------+-----+--------+------+ + | ALP generally | SHLD-23 | | | | X | | + | changing | | | | | | | + | Diffserv during | | | | | | | + | conn. | | | | | | | + +-----------------+---------+------+--------+-----+--------+------+ + | Pass received | MAY-9 | | | X | | | + | Diffserv field | | | | | | | + | up to ALP | | | | | | | + +-----------------+---------+------+--------+-----+--------+------+ + | FLUSH call | MAY-14 | | | X | | | + +-----------------+---------+------+--------+-----+--------+------+ + | Optional local | MUST-43 | X | | | | | + | IP addr param | | | | | | | + | in OPEN | | | | | | | + +=================+=========+======+========+=====+========+======+ + | RFC 5961 Support | + +=================+=========+======+========+=====+========+======+ + | Implement data | MAY-12 | | | X | | | + | injection | | | | | | | + | protection | | | | | | | + +=================+=========+======+========+=====+========+======+ + | Explicit Congestion Notification | + +=================+=========+======+========+=====+========+======+ + | Support ECN | SHLD-8 | | X | | | | + +=================+=========+======+========+=====+========+======+ + | Alternative Congestion Control | + +=================+=========+======+========+=====+========+======+ + | Implement | MAY-18 | | | X | | | + | alternative | | | | | | | + | conformant | | | | | | | + | algorithm(s) | | | | | | | + +-----------------+---------+------+--------+-----+--------+------+ + + Table 8: TCP Requirements Summary + + FOOTNOTES: (1) "ALP" means Application-Layer Program. + +Acknowledgments + + This document is largely a revision of RFC 793, of which Jon Postel + was the editor. Due to his excellent work, it was able to last for + three decades before we felt the need to revise it. + + Andre Oppermann was a contributor and helped to edit the first + revision of this document. + + We are thankful for the assistance of the IETF TCPM working group + chairs over the course of work on this document: + + Michael Scharf + + + Yoshifumi Nishida + + + Pasi Sarolahti + + + Michael Tüxen + + + During the discussions of this work on the TCPM mailing list, in + working group meetings, and via area reviews, helpful comments, + critiques, and reviews were received from (listed alphabetically by + last name): Praveen Balasubramanian, David Borman, Mohamed Boucadair, + Bob Briscoe, Neal Cardwell, Yuchung Cheng, Martin Duke, Francis + Dupont, Ted Faber, Gorry Fairhurst, Fernando Gont, Rodney Grimes, Yi + Huang, Rahul Jadhav, Markku Kojo, Mike Kosek, Juhamatti Kuusisaari, + Kevin Lahey, Kevin Mason, Matt Mathis, Stephen McQuistin, Jonathan + Morton, Matt Olson, Tommy Pauly, Tom Petch, Hagen Paul Pfeifer, Kyle + Rose, Anthony Sabatini, Michael Scharf, Greg Skinner, Joe Touch, + Michael Tüxen, Reji Varghese, Bernie Volz, Tim Wicinski, Lloyd Wood, + and Alex Zimmermann. + + Joe Touch provided additional help in clarifying the description of + segment size parameters and PMTUD/PLPMTUD recommendations. Markku + Kojo helped put together the text in the section on TCP Congestion + Control. + + This document includes content from errata that were reported by + (listed chronologically): Yin Shuming, Bob Braden, Morris M. Keesan, + Pei-chun Cheng, Constantin Hagemeier, Vishwas Manral, Mykyta + Yevstifeyev, EungJun Yi, Botong Huang, Charles Deng, Merlin Buge. + +Author's Address + + Wesley M. Eddy (editor) + MTI Systems + United States of America + Email: wes@mti-systems.com |