From 4bfd864f10b68b71482b35c818559068ef8d5797 Mon Sep 17 00:00:00 2001 From: Thomas Voss Date: Wed, 27 Nov 2024 20:54:24 +0100 Subject: doc: Add RFC documents --- doc/rfc/rfc1323.txt | 2075 +++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 2075 insertions(+) create mode 100644 doc/rfc/rfc1323.txt (limited to 'doc/rfc/rfc1323.txt') diff --git a/doc/rfc/rfc1323.txt b/doc/rfc/rfc1323.txt new file mode 100644 index 0000000..356eaa8 --- /dev/null +++ b/doc/rfc/rfc1323.txt @@ -0,0 +1,2075 @@ + + + + + + +Network Working Group V. Jacobson +Request for Comments: 1323 LBL +Obsoletes: RFC 1072, RFC 1185 R. Braden + ISI + D. Borman + Cray Research + May 1992 + + + TCP Extensions for High Performance + +Status of This Memo + + This RFC specifies an IAB standards track protocol for the Internet + community, and requests discussion and suggestions for improvements. + Please refer to the current edition of the "IAB Official Protocol + Standards" for the standardization state and status of this protocol. + Distribution of this memo is unlimited. + +Abstract + + This memo presents a set of TCP extensions to improve performance + over large bandwidth*delay product paths and to provide reliable + operation over very high-speed paths. It defines new TCP options for + scaled windows and timestamps, which are designed to provide + compatible interworking with TCP's that do not implement the + extensions. The timestamps are used for two distinct mechanisms: + RTTM (Round Trip Time Measurement) and PAWS (Protect Against Wrapped + Sequences). Selective acknowledgments are not included in this memo. + + This memo combines and supersedes RFC-1072 and RFC-1185, adding + additional clarification and more detailed specification. Appendix C + summarizes the changes from the earlier RFCs. + +TABLE OF CONTENTS + + 1. Introduction ................................................. 2 + 2. TCP Window Scale Option ...................................... 8 + 3. RTTM -- Round-Trip Time Measurement .......................... 11 + 4. PAWS -- Protect Against Wrapped Sequence Numbers ............. 17 + 5. Conclusions and Acknowledgments .............................. 25 + 6. References ................................................... 25 + APPENDIX A: Implementation Suggestions ........................... 27 + APPENDIX B: Duplicates from Earlier Connection Incarnations ...... 27 + APPENDIX C: Changes from RFC-1072, RFC-1185 ...................... 30 + APPENDIX D: Summary of Notation .................................. 31 + APPENDIX E: Event Processing ..................................... 32 + Security Considerations .......................................... 37 + + + +Jacobson, Braden, & Borman [Page 1] + +RFC 1323 TCP Extensions for High Performance May 1992 + + + Authors' Addresses ............................................... 37 + +1. INTRODUCTION + + The TCP protocol [Postel81] was designed to operate reliably over + almost any transmission medium regardless of transmission rate, + delay, corruption, duplication, or reordering of segments. + Production TCP implementations currently adapt to transfer rates in + the range of 100 bps to 10**7 bps and round-trip delays in the range + 1 ms to 100 seconds. Recent work on TCP performance has shown that + TCP can work well over a variety of Internet paths, ranging from 800 + Mbit/sec I/O channels to 300 bit/sec dial-up modems [Jacobson88a]. + + The introduction of fiber optics is resulting in ever-higher + transmission speeds, and the fastest paths are moving out of the + domain for which TCP was originally engineered. This memo defines a + set of modest extensions to TCP to extend the domain of its + application to match this increasing network capability. It is based + upon and obsoletes RFC-1072 [Jacobson88b] and RFC-1185 [Jacobson90b]. + + There is no one-line answer to the question: "How fast can TCP go?". + There are two separate kinds of issues, performance and reliability, + and each depends upon different parameters. We discuss each in turn. + + 1.1 TCP Performance + + TCP performance depends not upon the transfer rate itself, but + rather upon the product of the transfer rate and the round-trip + delay. This "bandwidth*delay product" measures the amount of data + that would "fill the pipe"; it is the buffer space required at + sender and receiver to obtain maximum throughput on the TCP + connection over the path, i.e., the amount of unacknowledged data + that TCP must handle in order to keep the pipeline full. TCP + performance problems arise when the bandwidth*delay product is + large. We refer to an Internet path operating in this region as a + "long, fat pipe", and a network containing this path as an "LFN" + (pronounced "elephan(t)"). + + High-capacity packet satellite channels (e.g., DARPA's Wideband + Net) are LFN's. For example, a DS1-speed satellite channel has a + bandwidth*delay product of 10**6 bits or more; this corresponds to + 100 outstanding TCP segments of 1200 bytes each. Terrestrial + fiber-optical paths will also fall into the LFN class; for + example, a cross-country delay of 30 ms at a DS3 bandwidth + (45Mbps) also exceeds 10**6 bits. + + There are three fundamental performance problems with the current + TCP over LFN paths: + + + +Jacobson, Braden, & Borman [Page 2] + +RFC 1323 TCP Extensions for High Performance May 1992 + + + (1) Window Size Limit + + The TCP header uses a 16 bit field to report the receive + window size to the sender. Therefore, the largest window + that can be used is 2**16 = 65K bytes. + + To circumvent this problem, Section 2 of this memo defines a + new TCP option, "Window Scale", to allow windows larger than + 2**16. This option defines an implicit scale factor, which + is used to multiply the window size value found in a TCP + header to obtain the true window size. + + (2) Recovery from Losses + + Packet losses in an LFN can have a catastrophic effect on + throughput. Until recently, properly-operating TCP + implementations would cause the data pipeline to drain with + every packet loss, and require a slow-start action to + recover. Recently, the Fast Retransmit and Fast Recovery + algorithms [Jacobson90c] have been introduced. Their + combined effect is to recover from one packet loss per + window, without draining the pipeline. However, more than + one packet loss per window typically results in a + retransmission timeout and the resulting pipeline drain and + slow start. + + Expanding the window size to match the capacity of an LFN + results in a corresponding increase of the probability of + more than one packet per window being dropped. This could + have a devastating effect upon the throughput of TCP over an + LFN. In addition, if a congestion control mechanism based + upon some form of random dropping were introduced into + gateways, randomly spaced packet drops would become common, + possible increasing the probability of dropping more than one + packet per window. + + To generalize the Fast Retransmit/Fast Recovery mechanism to + handle multiple packets dropped per window, selective + acknowledgments are required. Unlike the normal cumulative + acknowledgments of TCP, selective acknowledgments give the + sender a complete picture of which segments are queued at the + receiver and which have not yet arrived. Some evidence in + favor of selective acknowledgments has been published + [NBS85], and selective acknowledgments have been included in + a number of experimental Internet protocols -- VMTP + [Cheriton88], NETBLT [Clark87], and RDP [Velten84], and + proposed for OSI TP4 [NBS85]. However, in the non-LFN + regime, selective acknowledgments reduce the number of + + + +Jacobson, Braden, & Borman [Page 3] + +RFC 1323 TCP Extensions for High Performance May 1992 + + + packets retransmitted but do not otherwise improve + performance, making their complexity of questionable value. + However, selective acknowledgments are expected to become + much more important in the LFN regime. + + RFC-1072 defined a new TCP "SACK" option to send a selective + acknowledgment. However, there are important technical + issues to be worked out concerning both the format and + semantics of the SACK option. Therefore, SACK has been + omitted from this package of extensions. It is hoped that + SACK can "catch up" during the standardization process. + + (3) Round-Trip Measurement + + TCP implements reliable data delivery by retransmitting + segments that are not acknowledged within some retransmission + timeout (RTO) interval. Accurate dynamic determination of an + appropriate RTO is essential to TCP performance. RTO is + determined by estimating the mean and variance of the + measured round-trip time (RTT), i.e., the time interval + between sending a segment and receiving an acknowledgment for + it [Jacobson88a]. + + Section 4 introduces a new TCP option, "Timestamps", and then + defines a mechanism using this option that allows nearly + every segment, including retransmissions, to be timed at + negligible computational cost. We use the mnemonic RTTM + (Round Trip Time Measurement) for this mechanism, to + distinguish it from other uses of the Timestamps option. + + + 1.2 TCP Reliability + + Now we turn from performance to reliability. High transfer rate + enters TCP performance through the bandwidth*delay product. + However, high transfer rate alone can threaten TCP reliability by + violating the assumptions behind the TCP mechanism for duplicate + detection and sequencing. + + An especially serious kind of error may result from an accidental + reuse of TCP sequence numbers in data segments. Suppose that an + "old duplicate segment", e.g., a duplicate data segment that was + delayed in Internet queues, is delivered to the receiver at the + wrong moment, so that its sequence numbers falls somewhere within + the current window. There would be no checksum failure to warn of + the error, and the result could be an undetected corruption of the + data. Reception of an old duplicate ACK segment at the + transmitter could be only slightly less serious: it is likely to + + + +Jacobson, Braden, & Borman [Page 4] + +RFC 1323 TCP Extensions for High Performance May 1992 + + + lock up the connection so that no further progress can be made, + forcing an RST on the connection. + + TCP reliability depends upon the existence of a bound on the + lifetime of a segment: the "Maximum Segment Lifetime" or MSL. An + MSL is generally required by any reliable transport protocol, + since every sequence number field must be finite, and therefore + any sequence number may eventually be reused. In the Internet + protocol suite, the MSL bound is enforced by an IP-layer + mechanism, the "Time-to-Live" or TTL field. + + Duplication of sequence numbers might happen in either of two + ways: + + (1) Sequence number wrap-around on the current connection + + A TCP sequence number contains 32 bits. At a high enough + transfer rate, the 32-bit sequence space may be "wrapped" + (cycled) within the time that a segment is delayed in queues. + + (2) Earlier incarnation of the connection + + Suppose that a connection terminates, either by a proper + close sequence or due to a host crash, and the same + connection (i.e., using the same pair of sockets) is + immediately reopened. A delayed segment from the terminated + connection could fall within the current window for the new + incarnation and be accepted as valid. + + Duplicates from earlier incarnations, Case (2), are avoided by + enforcing the current fixed MSL of the TCP spec, as explained in + Section 5.3 and Appendix B. However, case (1), avoiding the + reuse of sequence numbers within the same connection, requires an + MSL bound that depends upon the transfer rate, and at high enough + rates, a new mechanism is required. + + More specifically, if the maximum effective bandwidth at which TCP + is able to transmit over a particular path is B bytes per second, + then the following constraint must be satisfied for error-free + operation: + + 2**31 / B > MSL (secs) [1] + + The following table shows the value for Twrap = 2**31/B in + seconds, for some important values of the bandwidth B: + + + + + + +Jacobson, Braden, & Borman [Page 5] + +RFC 1323 TCP Extensions for High Performance May 1992 + + + Network B*8 B Twrap + bits/sec bytes/sec secs + _______ _______ ______ ______ + + ARPANET 56kbps 7KBps 3*10**5 (~3.6 days) + + DS1 1.5Mbps 190KBps 10**4 (~3 hours) + + Ethernet 10Mbps 1.25MBps 1700 (~30 mins) + + DS3 45Mbps 5.6MBps 380 + + FDDI 100Mbps 12.5MBps 170 + + Gigabit 1Gbps 125MBps 17 + + + It is clear that wrap-around of the sequence space is not a + problem for 56kbps packet switching or even 10Mbps Ethernets. On + the other hand, at DS3 and FDDI speeds, Twrap is comparable to the + 2 minute MSL assumed by the TCP specification [Postel81]. Moving + towards gigabit speeds, Twrap becomes too small for reliable + enforcement by the Internet TTL mechanism. + + The 16-bit window field of TCP limits the effective bandwidth B to + 2**16/RTT, where RTT is the round-trip time in seconds + [McKenzie89]. If the RTT is large enough, this limits B to a + value that meets the constraint [1] for a large MSL value. For + example, consider a transcontinental backbone with an RTT of 60ms + (set by the laws of physics). With the bandwidth*delay product + limited to 64KB by the TCP window size, B is then limited to + 1.1MBps, no matter how high the theoretical transfer rate of the + path. This corresponds to cycling the sequence number space in + Twrap= 2000 secs, which is safe in today's Internet. + + It is important to understand that the culprit is not the larger + window but rather the high bandwidth. For example, consider a + (very large) FDDI LAN with a diameter of 10km. Using the speed of + light, we can compute the RTT across the ring as + (2*10**4)/(3*10**8) = 67 microseconds, and the delay*bandwidth + product is then 833 bytes. A TCP connection across this LAN using + a window of only 833 bytes will run at the full 100mbps and can + wrap the sequence space in about 3 minutes, very close to the MSL + of TCP. Thus, high speed alone can cause a reliability problem + with sequence number wrap-around, even without extended windows. + + Watson's Delta-T protocol [Watson81] includes network-layer + mechanisms for precise enforcement of an MSL. In contrast, the IP + + + +Jacobson, Braden, & Borman [Page 6] + +RFC 1323 TCP Extensions for High Performance May 1992 + + + mechanism for MSL enforcement is loosely defined and even more + loosely implemented in the Internet. Therefore, it is unwise to + depend upon active enforcement of MSL for TCP connections, and it + is unrealistic to imagine setting MSL's smaller than the current + values (e.g., 120 seconds specified for TCP). + + A possible fix for the problem of cycling the sequence space would + be to increase the size of the TCP sequence number field. For + example, the sequence number field (and also the acknowledgment + field) could be expanded to 64 bits. This could be done either by + changing the TCP header or by means of an additional option. + + Section 5 presents a different mechanism, which we call PAWS + (Protect Against Wrapped Sequence numbers), to extend TCP + reliability to transfer rates well beyond the foreseeable upper + limit of network bandwidths. PAWS uses the TCP Timestamps option + defined in Section 4 to protect against old duplicates from the + same connection. + + 1.3 Using TCP options + + The extensions defined in this memo all use new TCP options. We + must address two possible issues concerning the use of TCP + options: (1) compatibility and (2) overhead. + + We must pay careful attention to compatibility, i.e., to + interoperation with existing implementations. The only TCP option + defined previously, MSS, may appear only on a SYN segment. Every + implementation should (and we expect that most will) ignore + unknown options on SYN segments. However, some buggy TCP + implementation might be crashed by the first appearance of an + option on a non-SYN segment. Therefore, for each of the + extensions defined below, TCP options will be sent on non-SYN + segments only when an exchange of options on the SYN segments has + indicated that both sides understand the extension. Furthermore, + an extension option will be sent in a segment only if + the corresponding option was received in the initial + segment. + + A question may be raised about the bandwidth and processing + overhead for TCP options. Those options that occur on SYN + segments are not likely to cause a performance concern. Opening a + TCP connection requires execution of significant special-case + code, and the processing of options is unlikely to increase that + cost significantly. + + On the other hand, a Timestamps option may appear in any data or + ACK segment, adding 12 bytes to the 20-byte TCP header. We + + + +Jacobson, Braden, & Borman [Page 7] + +RFC 1323 TCP Extensions for High Performance May 1992 + + + believe that the bandwidth saved by reducing unnecessary + retransmissions will more than pay for the extra header bandwidth. + + There is also an issue about the processing overhead for parsing + the variable byte-aligned format of options, particularly with a + RISC-architecture CPU. To meet this concern, Appendix A contains + a recommended layout of the options in TCP headers to achieve + reasonable data field alignment. In the spirit of Header + Prediction, a TCP can quickly test for this layout and if it is + verified then use a fast path. Hosts that use this canonical + layout will effectively use the options as a set of fixed-format + fields appended to the TCP header. However, to retain the + philosophical and protocol framework of TCP options, a TCP must be + prepared to parse an arbitrary options field, albeit with less + efficiency. + + Finally, we observe that most of the mechanisms defined in this + memo are important for LFN's and/or very high-speed networks. For + low-speed networks, it might be a performance optimization to NOT + use these mechanisms. A TCP vendor concerned about optimal + performance over low-speed paths might consider turning these + extensions off for low-speed paths, or allow a user or + installation manager to disable them. + + +2. TCP WINDOW SCALE OPTION + + 2.1 Introduction + + The window scale extension expands the definition of the TCP + window to 32 bits and then uses a scale factor to carry this 32- + bit value in the 16-bit Window field of the TCP header (SEG.WND in + RFC-793). The scale factor is carried in a new TCP option, Window + Scale. This option is sent only in a SYN segment (a segment with + the SYN bit on), hence the window scale is fixed in each direction + when a connection is opened. (Another design choice would be to + specify the window scale in every TCP segment. It would be + incorrect to send a window scale option only when the scale factor + changed, since a TCP option in an acknowledgement segment will not + be delivered reliably (unless the ACK happens to be piggy-backed + on data in the other direction). Fixing the scale when the + connection is opened has the advantage of lower overhead but the + disadvantage that the scale factor cannot be changed during the + connection.) + + The maximum receive window, and therefore the scale factor, is + determined by the maximum receive buffer space. In a typical + modern implementation, this maximum buffer space is set by default + + + +Jacobson, Braden, & Borman [Page 8] + +RFC 1323 TCP Extensions for High Performance May 1992 + + + but can be overridden by a user program before a TCP connection is + opened. This determines the scale factor, and therefore no new + user interface is needed for window scaling. + + 2.2 Window Scale Option + + The three-byte Window Scale option may be sent in a SYN segment by + a TCP. It has two purposes: (1) indicate that the TCP is prepared + to do both send and receive window scaling, and (2) communicate a + scale factor to be applied to its receive window. Thus, a TCP + that is prepared to scale windows should send the option, even if + its own scale factor is 1. The scale factor is limited to a power + of two and encoded logarithmically, so it may be implemented by + binary shift operations. + + + TCP Window Scale Option (WSopt): + + Kind: 3 Length: 3 bytes + + +---------+---------+---------+ + | Kind=3 |Length=3 |shift.cnt| + +---------+---------+---------+ + + + This option is an offer, not a promise; both sides must send + Window Scale options in their SYN segments to enable window + scaling in either direction. If window scaling is enabled, + then the TCP that sent this option will right-shift its true + receive-window values by 'shift.cnt' bits for transmission in + SEG.WND. The value 'shift.cnt' may be zero (offering to scale, + while applying a scale factor of 1 to the receive window). + + This option may be sent in an initial segment (i.e., a + segment with the SYN bit on and the ACK bit off). It may also + be sent in a segment, but only if a Window Scale op- + tion was received in the initial segment. A Window Scale + option in a segment without a SYN bit should be ignored. + + The Window field in a SYN (i.e., a or ) segment + itself is never scaled. + + 2.3 Using the Window Scale Option + + A model implementation of window scaling is as follows, using the + notation of RFC-793 [Postel81]: + + * All windows are treated as 32-bit quantities for storage in + + + +Jacobson, Braden, & Borman [Page 9] + +RFC 1323 TCP Extensions for High Performance May 1992 + + + the connection control block and for local calculations. + This includes the send-window (SND.WND) and the receive- + window (RCV.WND) values, as well as the congestion window. + + * The connection state is augmented by two window shift counts, + Snd.Wind.Scale and Rcv.Wind.Scale, to be applied to the + incoming and outgoing window fields, respectively. + + * If a TCP receives a segment containing a Window Scale + option, it sends its own Window Scale option in the + segment. + + * The Window Scale option is sent with shift.cnt = R, where R + is the value that the TCP would like to use for its receive + window. + + * Upon receiving a SYN segment with a Window Scale option + containing shift.cnt = S, a TCP sets Snd.Wind.Scale to S and + sets Rcv.Wind.Scale to R; otherwise, it sets both + Snd.Wind.Scale and Rcv.Wind.Scale to zero. + + * The window field (SEG.WND) in the header of every incoming + segment, with the exception of SYN segments, is left-shifted + by Snd.Wind.Scale bits before updating SND.WND: + + SND.WND = SEG.WND << Snd.Wind.Scale + + (assuming the other conditions of RFC793 are met, and using + the "C" notation "<<" for left-shift). + + * The window field (SEG.WND) of every outgoing segment, with + the exception of SYN segments, is right-shifted by + Rcv.Wind.Scale bits: + + SEG.WND = RCV.WND >> Rcv.Wind.Scale. + + + TCP determines if a data segment is "old" or "new" by testing + whether its sequence number is within 2**31 bytes of the left edge + of the window, and if it is not, discarding the data as "old". To + insure that new data is never mistakenly considered old and vice- + versa, the left edge of the sender's window has to be at most + 2**31 away from the right edge of the receiver's window. + Similarly with the sender's right edge and receiver's left edge. + Since the right and left edges of either the sender's or + receiver's window differ by the window size, and since the sender + and receiver windows can be out of phase by at most the window + size, the above constraints imply that 2 * the max window size + + + +Jacobson, Braden, & Borman [Page 10] + +RFC 1323 TCP Extensions for High Performance May 1992 + + + must be less than 2**31, or + + max window < 2**30 + + Since the max window is 2**S (where S is the scaling shift count) + times at most 2**16 - 1 (the maximum unscaled window), the maximum + window is guaranteed to be < 2*30 if S <= 14. Thus, the shift + count must be limited to 14 (which allows windows of 2**30 = 1 + Gbyte). If a Window Scale option is received with a shift.cnt + value exceeding 14, the TCP should log the error but use 14 + instead of the specified value. + + The scale factor applies only to the Window field as transmitted + in the TCP header; each TCP using extended windows will maintain + the window values locally as 32-bit numbers. For example, the + "congestion window" computed by Slow Start and Congestion + Avoidance is not affected by the scale factor, so window scaling + will not introduce quantization into the congestion window. + +3. RTTM: ROUND-TRIP TIME MEASUREMENT + + 3.1 Introduction + + Accurate and current RTT estimates are necessary to adapt to + changing traffic conditions and to avoid an instability known as + "congestion collapse" [Nagle84] in a busy network. However, + accurate measurement of RTT may be difficult both in theory and in + implementation. + + Many TCP implementations base their RTT measurements upon a sample + of only one packet per window. While this yields an adequate + approximation to the RTT for small windows, it results in an + unacceptably poor RTT estimate for an LFN. If we look at RTT + estimation as a signal processing problem (which it is), a data + signal at some frequency, the packet rate, is being sampled at a + lower frequency, the window rate. This lower sampling frequency + violates Nyquist's criteria and may therefore introduce "aliasing" + artifacts into the estimated RTT [Hamming77]. + + A good RTT estimator with a conservative retransmission timeout + calculation can tolerate aliasing when the sampling frequency is + "close" to the data frequency. For example, with a window of 8 + packets, the sample rate is 1/8 the data frequency -- less than an + order of magnitude different. However, when the window is tens or + hundreds of packets, the RTT estimator may be seriously in error, + resulting in spurious retransmissions. + + If there are dropped packets, the problem becomes worse. Zhang + + + +Jacobson, Braden, & Borman [Page 11] + +RFC 1323 TCP Extensions for High Performance May 1992 + + + [Zhang86], Jain [Jain86] and Karn [Karn87] have shown that it is + not possible to accumulate reliable RTT estimates if retransmitted + segments are included in the estimate. Since a full window of + data will have been transmitted prior to a retransmission, all of + the segments in that window will have to be ACKed before the next + RTT sample can be taken. This means at least an additional + window's worth of time between RTT measurements and, as the error + rate approaches one per window of data (e.g., 10**-6 errors per + bit for the Wideband satellite network), it becomes effectively + impossible to obtain a valid RTT measurement. + + A solution to these problems, which actually simplifies the sender + substantially, is as follows: using TCP options, the sender places + a timestamp in each data segment, and the receiver reflects these + timestamps back in ACK segments. Then a single subtract gives the + sender an accurate RTT measurement for every ACK segment (which + will correspond to every other data segment, with a sensible + receiver). We call this the RTTM (Round-Trip Time Measurement) + mechanism. + + It is vitally important to use the RTTM mechanism with big + windows; otherwise, the door is opened to some dangerous + instabilities due to aliasing. Furthermore, the option is + probably useful for all TCP's, since it simplifies the sender. + + 3.2 TCP Timestamps Option + + TCP is a symmetric protocol, allowing data to be sent at any time + in either direction, and therefore timestamp echoing may occur in + either direction. For simplicity and symmetry, we specify that + timestamps always be sent and echoed in both directions. For + efficiency, we combine the timestamp and timestamp reply fields + into a single TCP Timestamps Option. + + + + + + + + + + + + + + + + + + +Jacobson, Braden, & Borman [Page 12] + +RFC 1323 TCP Extensions for High Performance May 1992 + + + TCP Timestamps Option (TSopt): + + Kind: 8 + + Length: 10 bytes + + +-------+-------+---------------------+---------------------+ + |Kind=8 | 10 | TS Value (TSval) |TS Echo Reply (TSecr)| + +-------+-------+---------------------+---------------------+ + 1 1 4 4 + + The Timestamps option carries two four-byte timestamp fields. + The Timestamp Value field (TSval) contains the current value of + the timestamp clock of the TCP sending the option. + + The Timestamp Echo Reply field (TSecr) is only valid if the ACK + bit is set in the TCP header; if it is valid, it echos a times- + tamp value that was sent by the remote TCP in the TSval field + of a Timestamps option. When TSecr is not valid, its value + must be zero. The TSecr value will generally be from the most + recent Timestamp option that was received; however, there are + exceptions that are explained below. + + A TCP may send the Timestamps option (TSopt) in an initial + segment (i.e., segment containing a SYN bit and no ACK + bit), and may send a TSopt in other segments only if it re- + ceived a TSopt in the initial segment for the connection. + + 3.3 The RTTM Mechanism + + The timestamp value to be sent in TSval is to be obtained from a + (virtual) clock that we call the "timestamp clock". Its values + must be at least approximately proportional to real time, in order + to measure actual RTT. + + The following example illustrates a one-way data flow with + segments arriving in sequence without loss. Here A, B, C... + represent data blocks occupying successive blocks of sequence + numbers, and ACK(A),... represent the corresponding cumulative + acknowledgments. The two timestamp fields of the Timestamps + option are shown symbolically as . Each TSecr + field contains the value most recently received in a TSval field. + + + + + + + + + +Jacobson, Braden, & Borman [Page 13] + +RFC 1323 TCP Extensions for High Performance May 1992 + + + + TCP A TCP B + + ------> + + <---- + + ------> + + <---- + + . . . . . . . . . . . . . . . . . . . . . . + + ------> + + <---- + + (etc) + + + The dotted line marks a pause (60 time units long) in which A had + nothing to send. Note that this pause inflates the RTT which B + could infer from receiving TSecr=131 in data segment C. Thus, in + one-way data flows, RTTM in the reverse direction measures a value + that is inflated by gaps in sending data. However, the following + rule prevents a resulting inflation of the measured RTT: + + A TSecr value received in a segment is used to update the + averaged RTT measurement only if the segment acknowledges + some new data, i.e., only if it advances the left edge of the + send window. + + Since TCP B is not sending data, the data segment C does not + acknowledge any new data when it arrives at B. Thus, the inflated + RTTM measurement is not used to update B's RTTM measurement. + + 3.4 Which Timestamp to Echo + + If more than one Timestamps option is received before a reply + segment is sent, the TCP must choose only one of the TSvals to + echo, ignoring the others. To minimize the state kept in the + receiver (i.e., the number of unprocessed TSvals), the receiver + should be required to retain at most one timestamp in the + connection control block. + + + + + + + +Jacobson, Braden, & Borman [Page 14] + +RFC 1323 TCP Extensions for High Performance May 1992 + + + There are three situations to consider: + + (A) Delayed ACKs. + + Many TCP's acknowledge only every Kth segment out of a group + of segments arriving within a short time interval; this + policy is known generally as "delayed ACKs". The data-sender + TCP must measure the effective RTT, including the additional + time due to delayed ACKs, or else it will retransmit + unnecessarily. Thus, when delayed ACKs are in use, the + receiver should reply with the TSval field from the earliest + unacknowledged segment. + + (B) A hole in the sequence space (segment(s) have been lost). + + The sender will continue sending until the window is filled, + and the receiver may be generating ACKs as these out-of-order + segments arrive (e.g., to aid "fast retransmit"). + + The lost segment is probably a sign of congestion, and in + that situation the sender should be conservative about + retransmission. Furthermore, it is better to overestimate + than underestimate the RTT. An ACK for an out-of-order + segment should therefore contain the timestamp from the most + recent segment that advanced the window. + + The same situation occurs if segments are re-ordered by the + network. + + (C) A filled hole in the sequence space. + + The segment that fills the hole represents the most recent + measurement of the network characteristics. On the other + hand, an RTT computed from an earlier segment would probably + include the sender's retransmit time-out, badly biasing the + sender's average RTT estimate. Thus, the timestamp from the + latest segment (which filled the hole) must be echoed. + + An algorithm that covers all three cases is described in the + following rules for Timestamps option processing on a synchronized + connection: + + (1) The connection state is augmented with two 32-bit slots: + TS.Recent holds a timestamp to be echoed in TSecr whenever a + segment is sent, and Last.ACK.sent holds the ACK field from + the last segment sent. Last.ACK.sent will equal RCV.NXT + except when ACKs have been delayed. + + + + +Jacobson, Braden, & Borman [Page 15] + +RFC 1323 TCP Extensions for High Performance May 1992 + + + (2) If Last.ACK.sent falls within the range of sequence numbers + of an incoming segment: + + SEG.SEQ <= Last.ACK.sent < SEG.SEQ + SEG.LEN + + then the TSval from the segment is copied to TS.Recent; + otherwise, the TSval is ignored. + + (3) When a TSopt is sent, its TSecr field is set to the current + TS.Recent value. + + The following examples illustrate these rules. Here A, B, C... + represent data segments occupying successive blocks of sequence + numbers, and ACK(A),... represent the corresponding + acknowledgment segments. Note that ACK(A) has the same sequence + number as B. We show only one direction of timestamp echoing, for + clarity. + + + o Packets arrive in sequence, and some of the ACKs are delayed. + + By Case (A), the timestamp from the oldest unacknowledged + segment is echoed. + + TS.Recent + -------------------> + 1 + -------------------> + 1 + -------------------> + 1 + <---- + (etc) + + o Packets arrive out of order, and every packet is + acknowledged. + + By Case (B), the timestamp from the last segment that + advanced the left window edge is echoed, until the missing + segment arrives; it is echoed according to Case (C). The + same sequence would occur if segments B and D were lost and + retransmitted.. + + + + + + + + + +Jacobson, Braden, & Borman [Page 16] + +RFC 1323 TCP Extensions for High Performance May 1992 + + + TS.Recent + -------------------> + 1 + <---- + 1 + -------------------> + 1 + <---- + 1 + -------------------> + 2 + <---- + 2 + -------------------> + 2 + <---- + 2 + -------------------> + 4 + <---- + (etc) + + + + +4. PAWS: PROTECT AGAINST WRAPPED SEQUENCE NUMBERS + + 4.1 Introduction + + Section 4.2 describes a simple mechanism to reject old duplicate + segments that might corrupt an open TCP connection; we call this + mechanism PAWS (Protect Against Wrapped Sequence numbers). PAWS + operates within a single TCP connection, using state that is saved + in the connection control block. Section 4.3 and Appendix C + discuss the implications of the PAWS mechanism for avoiding old + duplicates from previous incarnations of the same connection. + + 4.2 The PAWS Mechanism + + PAWS uses the same TCP Timestamps option as the RTTM mechanism + described earlier, and assumes that every received TCP segment + (including data and ACK segments) contains a timestamp SEG.TSval + whose values are monotone non-decreasing in time. The basic idea + is that a segment can be discarded as an old duplicate if it is + received with a timestamp SEG.TSval less than some timestamp + recently received on this connection. + + In both the PAWS and the RTTM mechanism, the "timestamps" are 32- + + + +Jacobson, Braden, & Borman [Page 17] + +RFC 1323 TCP Extensions for High Performance May 1992 + + + bit unsigned integers in a modular 32-bit space. Thus, "less + than" is defined the same way it is for TCP sequence numbers, and + the same implementation techniques apply. If s and t are + timestamp values, s < t if 0 < (t - s) < 2**31, computed in + unsigned 32-bit arithmetic. + + The choice of incoming timestamps to be saved for this comparison + must guarantee a value that is monotone increasing. For example, + we might save the timestamp from the segment that last advanced + the left edge of the receive window, i.e., the most recent in- + sequence segment. Instead, we choose the value TS.Recent + introduced in Section 3.4 for the RTTM mechanism, since using a + common value for both PAWS and RTTM simplifies the implementation + of both. As Section 3.4 explained, TS.Recent differs from the + timestamp from the last in-sequence segment only in the case of + delayed ACKs, and therefore by less than one window. Either + choice will therefore protect against sequence number wrap-around. + + RTTM was specified in a symmetrical manner, so that TSval + timestamps are carried in both data and ACK segments and are + echoed in TSecr fields carried in returning ACK or data segments. + PAWS submits all incoming segments to the same test, and therefore + protects against duplicate ACK segments as well as data segments. + (An alternative un-symmetric algorithm would protect against old + duplicate ACKs: the sender of data would reject incoming ACK + segments whose TSecr values were less than the TSecr saved from + the last segment whose ACK field advanced the left edge of the + send window. This algorithm was deemed to lack economy of + mechanism and symmetry.) + + TSval timestamps sent on {SYN} and {SYN,ACK} segments are used to + initialize PAWS. PAWS protects against old duplicate non-SYN + segments, and duplicate SYN segments received while there is a + synchronized connection. Duplicate {SYN} and {SYN,ACK} segments + received when there is no connection will be discarded by the + normal 3-way handshake and sequence number checks of TCP. + + It is recommended that RST segments NOT carry timestamps, and that + RST segments be acceptable regardless of their timestamp. Old + duplicate RST segments should be exceedingly unlikely, and their + cleanup function should take precedence over timestamps. + + 4.2.1 Basic PAWS Algorithm + + The PAWS algorithm requires the following processing to be + performed on all incoming segments for a synchronized + connection: + + + + +Jacobson, Braden, & Borman [Page 18] + +RFC 1323 TCP Extensions for High Performance May 1992 + + + R1) If there is a Timestamps option in the arriving segment + and SEG.TSval < TS.Recent and if TS.Recent is valid (see + later discussion), then treat the arriving segment as not + acceptable: + + Send an acknowledgement in reply as specified in + RFC-793 page 69 and drop the segment. + + Note: it is necessary to send an ACK segment in order + to retain TCP's mechanisms for detecting and + recovering from half-open connections. For example, + see Figure 10 of RFC-793. + + R2) If the segment is outside the window, reject it (normal + TCP processing) + + R3) If an arriving segment satisfies: SEG.SEQ <= Last.ACK.sent + (see Section 3.4), then record its timestamp in TS.Recent. + + R4) If an arriving segment is in-sequence (i.e., at the left + window edge), then accept it normally. + + R5) Otherwise, treat the segment as a normal in-window, out- + of-sequence TCP segment (e.g., queue it for later delivery + to the user). + + Steps R2, R4, and R5 are the normal TCP processing steps + specified by RFC-793. + + It is important to note that the timestamp is checked only when + a segment first arrives at the receiver, regardless of whether + it is in-sequence or it must be queued for later delivery. + Consider the following example. + + Suppose the segment sequence: A.1, B.1, C.1, ..., Z.1 has + been sent, where the letter indicates the sequence number + and the digit represents the timestamp. Suppose also that + segment B.1 has been lost. The timestamp in TS.TStamp is + 1 (from A.1), so C.1, ..., Z.1 are considered acceptable + and are queued. When B is retransmitted as segment B.2 + (using the latest timestamp), it fills the hole and causes + all the segments through Z to be acknowledged and passed + to the user. The timestamps of the queued segments are + *not* inspected again at this time, since they have + already been accepted. When B.2 is accepted, TS.Stamp is + set to 2. + + This rule allows reasonable performance under loss. A full + + + +Jacobson, Braden, & Borman [Page 19] + +RFC 1323 TCP Extensions for High Performance May 1992 + + + window of data is in transit at all times, and after a loss a + full window less one packet will show up out-of-sequence to be + queued at the receiver (e.g., up to ~2**30 bytes of data); the + timestamp option must not result in discarding this data. + + In certain unlikely circumstances, the algorithm of rules R1-R4 + could lead to discarding some segments unnecessarily, as shown + in the following example: + + Suppose again that segments: A.1, B.1, C.1, ..., Z.1 have + been sent in sequence and that segment B.1 has been lost. + Furthermore, suppose delivery of some of C.1, ... Z.1 is + delayed until AFTER the retransmission B.2 arrives at the + receiver. These delayed segments will be discarded + unnecessarily when they do arrive, since their timestamps + are now out of date. + + This case is very unlikely to occur. If the retransmission was + triggered by a timeout, some of the segments C.1, ... Z.1 must + have been delayed longer than the RTO time. This is presumably + an unlikely event, or there would be many spurious timeouts and + retransmissions. If B's retransmission was triggered by the + "fast retransmit" algorithm, i.e., by duplicate ACKs, then the + queued segments that caused these ACKs must have been received + already. + + Even if a segment were delayed past the RTO, the Fast + Retransmit mechanism [Jacobson90c] will cause the delayed + packets to be retransmitted at the same time as B.2, avoiding + an extra RTT and therefore causing a very small performance + penalty. + + We know of no case with a significant probability of occurrence + in which timestamps will cause performance degradation by + unnecessarily discarding segments. + + 4.2.2 Timestamp Clock + + It is important to understand that the PAWS algorithm does not + require clock synchronization between sender and receiver. The + sender's timestamp clock is used to stamp the segments, and the + sender uses the echoed timestamp to measure RTT's. However, + the receiver treats the timestamp as simply a monotone- + increasing serial number, without any necessary connection to + its clock. From the receiver's viewpoint, the timestamp is + acting as a logical extension of the high-order bits of the + sequence number. + + + + +Jacobson, Braden, & Borman [Page 20] + +RFC 1323 TCP Extensions for High Performance May 1992 + + + The receiver algorithm does place some requirements on the + frequency of the timestamp clock. + + (a) The timestamp clock must not be "too slow". + + It must tick at least once for each 2**31 bytes sent. In + fact, in order to be useful to the sender for round trip + timing, the clock should tick at least once per window's + worth of data, and even with the RFC-1072 window + extension, 2**31 bytes must be at least two windows. + + To make this more quantitative, any clock faster than 1 + tick/sec will reject old duplicate segments for link + speeds of ~8 Gbps. A 1ms timestamp clock will work at + link speeds up to 8 Tbps (8*10**12) bps! + + (b) The timestamp clock must not be "too fast". + + Its recycling time must be greater than MSL seconds. + Since the clock (timestamp) is 32 bits and the worst-case + MSL is 255 seconds, the maximum acceptable clock frequency + is one tick every 59 ns. + + However, it is desirable to establish a much longer + recycle period, in order to handle outdated timestamps on + idle connections (see Section 4.2.3), and to relax the MSL + requirement for preventing sequence number wrap-around. + With a 1 ms timestamp clock, the 32-bit timestamp will + wrap its sign bit in 24.8 days. Thus, it will reject old + duplicates on the same connection if MSL is 24.8 days or + less. This appears to be a very safe figure; an MSL of + 24.8 days or longer can probably be assumed by the gateway + system without requiring precise MSL enforcement by the + TTL value in the IP layer. + + Based upon these considerations, we choose a timestamp clock + frequency in the range 1 ms to 1 sec per tick. This range also + matches the requirements of the RTTM mechanism, which does not + need much more resolution than the granularity of the + retransmit timer, e.g., tens or hundreds of milliseconds. + + The PAWS mechanism also puts a strong monotonicity requirement + on the sender's timestamp clock. The method of implementation + of the timestamp clock to meet this requirement depends upon + the system hardware and software. + + * Some hosts have a hardware clock that is guaranteed to be + monotonic between hardware resets. + + + +Jacobson, Braden, & Borman [Page 21] + +RFC 1323 TCP Extensions for High Performance May 1992 + + + * A clock interrupt may be used to simply increment a binary + integer by 1 periodically. + + * The timestamp clock may be derived from a system clock + that is subject to being abruptly changed, by adding a + variable offset value. This offset is initialized to + zero. When a new timestamp clock value is needed, the + offset can be adjusted as necessary to make the new value + equal to or larger than the previous value (which was + saved for this purpose). + + + 4.2.3 Outdated Timestamps + + If a connection remains idle long enough for the timestamp + clock of the other TCP to wrap its sign bit, then the value + saved in TS.Recent will become too old; as a result, the PAWS + mechanism will cause all subsequent segments to be rejected, + freezing the connection (until the timestamp clock wraps its + sign bit again). + + With the chosen range of timestamp clock frequencies (1 sec to + 1 ms), the time to wrap the sign bit will be between 24.8 days + and 24800 days. A TCP connection that is idle for more than 24 + days and then comes to life is exceedingly unusual. However, + it is undesirable in principle to place any limitation on TCP + connection lifetimes. + + We therefore require that an implementation of PAWS include a + mechanism to "invalidate" the TS.Recent value when a connection + is idle for more than 24 days. (An alternative solution to the + problem of outdated timestamps would be to send keepalive + segments at a very low rate, but still more often than the + wrap-around time for timestamps, e.g., once a day. This would + impose negligible overhead. However, the TCP specification has + never included keepalives, so the solution based upon + invalidation was chosen.) + + Note that a TCP does not know the frequency, and therefore, the + wraparound time, of the other TCP, so it must assume the worst. + The validity of TS.Recent needs to be checked only if the basic + PAWS timestamp check fails, i.e., only if SEG.TSval < + TS.Recent. If TS.Recent is found to be invalid, then the + segment is accepted, regardless of the failure of the timestamp + check, and rule R3 updates TS.Recent with the TSval from the + new segment. + + To detect how long the connection has been idle, the TCP may + + + +Jacobson, Braden, & Borman [Page 22] + +RFC 1323 TCP Extensions for High Performance May 1992 + + + update a clock or timestamp value associated with the + connection whenever TS.Recent is updated, for example. The + details will be implementation-dependent. + + 4.2.4 Header Prediction + + "Header prediction" [Jacobson90a] is a high-performance + transport protocol implementation technique that is most + important for high-speed links. This technique optimizes the + code for the most common case, receiving a segment correctly + and in order. Using header prediction, the receiver asks the + question, "Is this segment the next in sequence?" This + question can be answered in fewer machine instructions than the + question, "Is this segment within the window?" + + Adding header prediction to our timestamp procedure leads to + the following recommended sequence for processing an arriving + TCP segment: + + H1) Check timestamp (same as step R1 above) + + H2) Do header prediction: if segment is next in sequence and + if there are no special conditions requiring additional + processing, accept the segment, record its timestamp, and + skip H3. + + H3) Process the segment normally, as specified in RFC-793. + This includes dropping segments that are outside the win- + dow and possibly sending acknowledgments, and queueing + in-window, out-of-sequence segments. + + Another possibility would be to interchange steps H1 and H2, + i.e., to perform the header prediction step H2 FIRST, and + perform H1 and H3 only when header prediction fails. This + could be a performance improvement, since the timestamp check + in step H1 is very unlikely to fail, and it requires interval + arithmetic on a finite field, a relatively expensive operation. + To perform this check on every single segment is contrary to + the philosophy of header prediction. We believe that this + change might reduce CPU time for TCP protocol processing by up + to 5-10% on high-speed networks. + + However, putting H2 first would create a hazard: a segment from + 2**32 bytes in the past might arrive at exactly the wrong time + and be accepted mistakenly by the header-prediction step. The + following reasoning has been introduced [Jacobson90b] to show + that the probability of this failure is negligible. + + + + +Jacobson, Braden, & Borman [Page 23] + +RFC 1323 TCP Extensions for High Performance May 1992 + + + If all segments are equally likely to show up as old + duplicates, then the probability of an old duplicate + exactly matching the left window edge is the maximum + segment size (MSS) divided by the size of the sequence + space. This ratio must be less than 2**-16, since MSS + must be < 2**16; for example, it will be (2**12)/(2**32) = + 2**-20 for an FDDI link. However, the older a segment is, + the less likely it is to be retained in the Internet, and + under any reasonable model of segment lifetime the + probability of an old duplicate exactly at the left window + edge must be much smaller than 2**-16. + + The 16 bit TCP checksum also allows a basic unreliability + of one part in 2**16. A protocol mechanism whose + reliability exceeds the reliability of the TCP checksum + should be considered "good enough", i.e., it won't + contribute significantly to the overall error rate. We + therefore believe we can ignore the problem of an old + duplicate being accepted by doing header prediction before + checking the timestamp. + + However, this probabilistic argument is not universally + accepted, and the consensus at present is that the performance + gain does not justify the hazard in the general case. It is + therefore recommended that H2 follow H1. + + 4.3. Duplicates from Earlier Incarnations of Connection + + The PAWS mechanism protects against errors due to sequence number + wrap-around on high-speed connection. Segments from an earlier + incarnation of the same connection are also a potential cause of + old duplicate errors. In both cases, the TCP mechanisms to + prevent such errors depend upon the enforcement of a maximum + segment lifetime (MSL) by the Internet (IP) layer (see Appendix of + RFC-1185 for a detailed discussion). Unlike the case of sequence + space wrap-around, the MSL required to prevent old duplicate + errors from earlier incarnations does not depend upon the transfer + rate. If the IP layer enforces the recommended 2 minute MSL of + TCP, and if the TCP rules are followed, TCP connections will be + safe from earlier incarnations, no matter how high the network + speed. Thus, the PAWS mechanism is not required for this case. + + We may still ask whether the PAWS mechanism can provide additional + security against old duplicates from earlier connections, allowing + us to relax the enforcement of MSL by the IP layer. Appendix B + explores this question, showing that further assumptions and/or + mechanisms are required, beyond those of PAWS. This is not part + of the current extension. + + + +Jacobson, Braden, & Borman [Page 24] + +RFC 1323 TCP Extensions for High Performance May 1992 + + +5. CONCLUSIONS AND ACKNOWLEDGMENTS + + This memo presented a set of extensions to TCP to provide efficient + operation over large-bandwidth*delay-product paths and reliable + operation over very high-speed paths. These extensions are designed + to provide compatible interworking with TCP's that do not implement + the extensions. + + These mechanisms are implemented using new TCP options for scaled + windows and timestamps. The timestamps are used for two distinct + mechanisms: RTTM (Round Trip Time Measurement) and PAWS (Protect + Against Wrapped Sequences). + + The Window Scale option was originally suggested by Mike St. Johns of + USAF/DCA. The present form of the option was suggested by Mike + Karels of UC Berkeley in response to a more cumbersome scheme defined + by Van Jacobson. Lixia Zhang helped formulate the PAWS mechanism + description in RFC-1185. + + Finally, much of this work originated as the result of discussions + within the End-to-End Task Force on the theoretical limitations of + transport protocols in general and TCP in particular. More recently, + task force members and other on the end2end-interest list have made + valuable contributions by pointing out flaws in the algorithms and + the documentation. The authors are grateful for all these + contributions. + +6. REFERENCES + + [Clark87] Clark, D., Lambert, M., and L. Zhang, "NETBLT: A Bulk + Data Transfer Protocol", RFC 998, MIT, March 1987. + + [Garlick77] Garlick, L., R. Rom, and J. Postel, "Issues in + Reliable Host-to-Host Protocols", Proc. Second Berkeley Workshop + on Distributed Data Management and Computer Networks, May 1977. + + [Hamming77] Hamming, R., "Digital Filters", ISBN 0-13-212571-4, + Prentice Hall, Englewood Cliffs, N.J., 1977. + + [Cheriton88] Cheriton, D., "VMTP: Versatile Message Transaction + Protocol", RFC 1045, Stanford University, February 1988. + + [Jacobson88a] Jacobson, V., "Congestion Avoidance and Control", + SIGCOMM '88, Stanford, CA., August 1988. + + [Jacobson88b] Jacobson, V., and R. Braden, "TCP Extensions for + Long-Delay Paths", RFC-1072, LBL and USC/Information Sciences + Institute, October 1988. + + + +Jacobson, Braden, & Borman [Page 25] + +RFC 1323 TCP Extensions for High Performance May 1992 + + + [Jacobson90a] Jacobson, V., "4BSD Header Prediction", ACM + Computer Communication Review, April 1990. + + [Jacobson90b] Jacobson, V., Braden, R., and Zhang, L., "TCP + Extension for High-Speed Paths", RFC-1185, LBL and USC/Information + Sciences Institute, October 1990. + + [Jacobson90c] Jacobson, V., "Modified TCP congestion avoidance + algorithm", Message to end2end-interest mailing list, April 1990. + + [Jain86] Jain, R., "Divergence of Timeout Algorithms for Packet + Retransmissions", Proc. Fifth Phoenix Conf. on Comp. and Comm., + Scottsdale, Arizona, March 1986. + + [Karn87] Karn, P. and C. Partridge, "Estimating Round-Trip Times + in Reliable Transport Protocols", Proc. SIGCOMM '87, Stowe, VT, + August 1987. + + [McKenzie89] McKenzie, A., "A Problem with the TCP Big Window + Option", RFC 1110, BBN STC, August 1989. + + [Nagle84] Nagle, J., "Congestion Control in IP/TCP + Internetworks", RFC 896, FACC, January 1984. + + [NBS85] Colella, R., Aronoff, R., and K. Mills, "Performance + Improvements for ISO Transport", Ninth Data Comm Symposium, + published in ACM SIGCOMM Comp Comm Review, vol. 15, no. 5, + September 1985. + + [Postel81] Postel, J., "Transmission Control Protocol - DARPA + Internet Program Protocol Specification", RFC 793, DARPA, + September 1981. + + [Velten84] Velten, D., Hinden, R., and J. Sax, "Reliable Data + Protocol", RFC 908, BBN, July 1984. + + [Watson81] Watson, R., "Timer-based Mechanisms in Reliable + Transport Protocol Connection Management", Computer Networks, Vol. + 5, 1981. + + [Zhang86] Zhang, L., "Why TCP Timers Don't Work Well", Proc. + SIGCOMM '86, Stowe, Vt., August 1986. + + + + + + + + + +Jacobson, Braden, & Borman [Page 26] + +RFC 1323 TCP Extensions for High Performance May 1992 + + +APPENDIX A: IMPLEMENTATION SUGGESTIONS + + The following layouts are recommended for sending options on non-SYN + segments, to achieve maximum feasible alignment of 32-bit and 64-bit + machines. + + + +--------+--------+--------+--------+ + | NOP | NOP | TSopt | 10 | + +--------+--------+--------+--------+ + | TSval timestamp | + +--------+--------+--------+--------+ + | TSecr timestamp | + +--------+--------+--------+--------+ + + +APPENDIX B: DUPLICATES FROM EARLIER CONNECTION INCARNATIONS + + There are two cases to be considered: (1) a system crashing (and + losing connection state) and restarting, and (2) the same connection + being closed and reopened without a loss of host state. These will + be described in the following two sections. + + B.1 System Crash with Loss of State + + TCP's quiet time of one MSL upon system startup handles the loss + of connection state in a system crash/restart. For an + explanation, see for example "When to Keep Quiet" in the TCP + protocol specification [Postel81]. The MSL that is required here + does not depend upon the transfer speed. The current TCP MSL of 2 + minutes seems acceptable as an operational compromise, as many + host systems take this long to boot after a crash. + + However, the timestamp option may be used to ease the MSL + requirements (or to provide additional security against data + corruption). If timestamps are being used and if the timestamp + clock can be guaranteed to be monotonic over a system + crash/restart, i.e., if the first value of the sender's timestamp + clock after a crash/restart can be guaranteed to be greater than + the last value before the restart, then a quiet time will be + unnecessary. + + To dispense totally with the quiet time would require that the + host clock be synchronized to a time source that is stable over + the crash/restart period, with an accuracy of one timestamp clock + tick or better. We can back off from this strict requirement to + take advantage of approximate clock synchronization. Suppose that + the clock is always re-synchronized to within N timestamp clock + + + +Jacobson, Braden, & Borman [Page 27] + +RFC 1323 TCP Extensions for High Performance May 1992 + + + ticks and that booting (extended with a quiet time, if necessary) + takes more than N ticks. This will guarantee monotonicity of the + timestamps, which can then be used to reject old duplicates even + without an enforced MSL. + + B.2 Closing and Reopening a Connection + + When a TCP connection is closed, a delay of 2*MSL in TIME-WAIT + state ties up the socket pair for 4 minutes (see Section 3.5 of + [Postel81]. Applications built upon TCP that close one connection + and open a new one (e.g., an FTP data transfer connection using + Stream mode) must choose a new socket pair each time. The TIME- + WAIT delay serves two different purposes: + + (a) Implement the full-duplex reliable close handshake of TCP. + + The proper time to delay the final close step is not really + related to the MSL; it depends instead upon the RTO for the + FIN segments and therefore upon the RTT of the path. (It + could be argued that the side that is sending a FIN knows + what degree of reliability it needs, and therefore it should + be able to determine the length of the TIME-WAIT delay for + the FIN's recipient. This could be accomplished with an + appropriate TCP option in FIN segments.) + + Although there is no formal upper-bound on RTT, common + network engineering practice makes an RTT greater than 1 + minute very unlikely. Thus, the 4 minute delay in TIME-WAIT + state works satisfactorily to provide a reliable full-duplex + TCP close. Note again that this is independent of MSL + enforcement and network speed. + + The TIME-WAIT state could cause an indirect performance + problem if an application needed to repeatedly close one + connection and open another at a very high frequency, since + the number of available TCP ports on a host is less than + 2**16. However, high network speeds are not the major + contributor to this problem; the RTT is the limiting factor + in how quickly connections can be opened and closed. + Therefore, this problem will be no worse at high transfer + speeds. + + (b) Allow old duplicate segments to expire. + + To replace this function of TIME-WAIT state, a mechanism + would have to operate across connections. PAWS is defined + strictly within a single connection; the last timestamp is + TS.Recent is kept in the connection control block, and + + + +Jacobson, Braden, & Borman [Page 28] + +RFC 1323 TCP Extensions for High Performance May 1992 + + + discarded when a connection is closed. + + An additional mechanism could be added to the TCP, a per-host + cache of the last timestamp received from any connection. + This value could then be used in the PAWS mechanism to reject + old duplicate segments from earlier incarnations of the + connection, if the timestamp clock can be guaranteed to have + ticked at least once since the old connection was open. This + would require that the TIME-WAIT delay plus the RTT together + must be at least one tick of the sender's timestamp clock. + Such an extension is not part of the proposal of this RFC. + + Note that this is a variant on the mechanism proposed by + Garlick, Rom, and Postel [Garlick77], which required each + host to maintain connection records containing the highest + sequence numbers on every connection. Using timestamps + instead, it is only necessary to keep one quantity per remote + host, regardless of the number of simultaneous connections to + that host. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +Jacobson, Braden, & Borman [Page 29] + +RFC 1323 TCP Extensions for High Performance May 1992 + + +APPENDIX C: CHANGES FROM RFC-1072, RFC-1185 + + The protocol extensions defined in this document differ in several + important ways from those defined in RFC-1072 and RFC-1185. + + (a) SACK has been deferred to a later memo. + + (b) The detailed rules for sending timestamp replies (see Section + 3.4) differ in important ways. The earlier rules could result + in an under-estimate of the RTT in certain cases (packets + dropped or out of order). + + (c) The same value TS.Recent is now shared by the two distinct + mechanisms RTTM and PAWS. This simplification became possible + because of change (b). + + (d) An ambiguity in RFC-1185 was resolved in favor of putting + timestamps on ACK as well as data segments. This supports the + symmetry of the underlying TCP protocol. + + (e) The echo and echo reply options of RFC-1072 were combined into a + single Timestamps option, to reflect the symmetry and to + simplify processing. + + (f) The problem of outdated timestamps on long-idle connections, + discussed in Section 4.2.2, was realized and resolved. + + (g) RFC-1185 recommended that header prediction take precedence over + the timestamp check. Based upon some scepticism about the + probabilistic arguments given in Section 4.2.4, it was decided + to recommend that the timestamp check be performed first. + + (h) The spec was modified so that the extended options will be sent + on segments only when they are received in the + corresponding segments. This provides the most + conservative possible conditions for interoperation with + implementations without the extensions. + + In addition to these substantive changes, the present RFC attempts to + specify the algorithms unambiguously by presenting modifications to + the Event Processing rules of RFC-793; see Appendix E. + + + + + + + + + + +Jacobson, Braden, & Borman [Page 30] + +RFC 1323 TCP Extensions for High Performance May 1992 + + +APPENDIX D: SUMMARY OF NOTATION + + The following notation has been used in this document. + + Options + + WSopt: TCP Window Scale Option + TSopt: TCP Timestamps Option + + Option Fields + + shift.cnt: Window scale byte in WSopt. + TSval: 32-bit Timestamp Value field in TSopt. + TSecr: 32-bit Timestamp Reply field in TSopt. + + Option Fields in Current Segment + + SEG.TSval: TSval field from TSopt in current segment. + SEG.TSecr: TSecr field from TSopt in current segment. + SEG.WSopt: 8-bit value in WSopt + + Clock Values + + my.TSclock: Local source of 32-bit timestamp values + my.TSclock.rate: Period of my.TSclock (1 ms to 1 sec). + + Per-Connection State Variables + + TS.Recent: Latest received Timestamp + Last.ACK.sent: Last ACK field sent + + Snd.TS.OK: 1-bit flag + Snd.WS.OK: 1-bit flag + + Rcv.Wind.Scale: Receive window scale power + Snd.Wind.Scale: Send window scale power + + + + + + + + + + + + + + + +Jacobson, Braden, & Borman [Page 31] + +RFC 1323 TCP Extensions for High Performance May 1992 + + +APPENDIX E: EVENT PROCESSING + + +Event Processing + + OPEN Call + + ... + An initial send sequence number (ISS) is selected. Send a SYN + segment of the form: + + + + ... + + SEND Call + + CLOSED STATE (i.e., TCB does not exist) + + ... + + LISTEN STATE + + If the foreign socket is specified, then change the connection + from passive to active, select an ISS. Send a SYN segment + containing the options: and + . Set SND.UNA to ISS, SND.NXT to ISS+1. + Enter SYN-SENT state. ... + + SYN-SENT STATE + SYN-RECEIVED STATE + + ... + + ESTABLISHED STATE + CLOSE-WAIT STATE + + Segmentize the buffer and send it with a piggybacked + acknowledgment (acknowledgment value = RCV.NXT). ... + + If the urgent flag is set ... + + If the Snd.TS.OK flag is set, then include the TCP Timestamps + option in each data segment. + + Scale the receive window for transmission in the segment header: + + SEG.WND = (SND.WND >> Rcv.Wind.Scale). + + + +Jacobson, Braden, & Borman [Page 32] + +RFC 1323 TCP Extensions for High Performance May 1992 + + + SEGMENT ARRIVES + + ... + + If the state is LISTEN then + + first check for an RST + + ... + + second check for an ACK + + ... + + third check for a SYN + + if the SYN bit is set, check the security. If the ... + + ... + + If the SEG.PRC is less than the TCB.PRC then continue. + + Check for a Window Scale option (WSopt); if one is found, save + SEG.WSopt in Snd.Wind.Scale and set Snd.WS.OK flag on. + Otherwise, set both Snd.Wind.Scale and Rcv.Wind.Scale to zero + and clear Snd.WS.OK flag. + + Check for a TSopt option; if one is found, save SEG.TSval in the + variable TS.Recent and turn on the Snd.TS.OK bit. + + Set RCV.NXT to SEG.SEQ+1, IRS is set to SEG.SEQ and any other + control or text should be queued for processing later. ISS + should be selected and a SYN segment sent of the form: + + + + If the Snd.WS.OK bit is on, include a WSopt option + in this segment. If the Snd.TS.OK bit is + on, include a TSopt in this + segment. Last.ACK.sent is set to RCV.NXT. + + SND.NXT is set to ISS+1 and SND.UNA to ISS. The connection + state should be changed to SYN-RECEIVED. Note that any other + incoming control or data (combined with SYN) will be processed + in the SYN-RECEIVED state, but processing of SYN and ACK should + not be repeated. If the listen was not fully specified (i.e., + the foreign socket was not fully specified), then the + unspecified fields should be filled in now. + + + +Jacobson, Braden, & Borman [Page 33] + +RFC 1323 TCP Extensions for High Performance May 1992 + + + fourth other text or control + + ... + + If the state is SYN-SENT then + + first check the ACK bit + + ... + + fourth check the SYN bit + + ... + + If the SYN bit is on and the security/compartment and precedence + are acceptable then, RCV.NXT is set to SEG.SEQ+1, IRS is set to + SEG.SEQ, and any acknowledgements on the retransmission queue + which are thereby acknowledged should be removed. + + Check for a Window Scale option (WSopt); if is found, save + SEG.WSopt in Snd.Wind.Scale; otherwise, set both Snd.Wind.Scale + and Rcv.Wind.Scale to zero. + + Check for a TSopt option; if one is found, save SEG.TSval in + variable TS.Recent and turn on the Snd.TS.OK bit in the + connection control block. If the ACK bit is set, use my.TSclock + - SEG.TSecr as the initial RTT estimate. + + If SND.UNA > ISS (our SYN has been ACKed), change the connection + state to ESTABLISHED, form an ACK segment: + + + + and send it. If the Snd.Echo.OK bit is on, include a TSopt + option in this ACK segment. + Last.ACK.sent is set to RCV.NXT. + + Data or controls which were queued for transmission may be + included. If there are other controls or text in the segment + then continue processing at the sixth step below where the URG + bit is checked, otherwise return. + + Otherwise enter SYN-RECEIVED, form a SYN,ACK segment: + + + + and send it. If the Snd.Echo.OK bit is on, include a TSopt + option in this segment. If + + + +Jacobson, Braden, & Borman [Page 34] + +RFC 1323 TCP Extensions for High Performance May 1992 + + + the Snd.WS.OK bit is on, include a WSopt option + in this segment. Last.ACK.sent is set to + RCV.NXT. + + If there are other controls or text in the segment, queue them + for processing after the ESTABLISHED state has been reached, + return. + + fifth, if neither of the SYN or RST bits is set then drop the + segment and return. + + + Otherwise, + + First, check sequence number + + SYN-RECEIVED STATE + ESTABLISHED STATE + FIN-WAIT-1 STATE + FIN-WAIT-2 STATE + CLOSE-WAIT STATE + CLOSING STATE + LAST-ACK STATE + TIME-WAIT STATE + + Segments are processed in sequence. Initial tests on arrival + are used to discard old duplicates, but further processing is + done in SEG.SEQ order. If a segment's contents straddle the + boundary between old and new, only the new parts should be + processed. + + Rescale the received window field: + + TrueWindow = SEG.WND << Snd.Wind.Scale, + + and use "TrueWindow" in place of SEG.WND in the following steps. + + Check whether the segment contains a Timestamps option and bit + Snd.TS.OK is on. If so: + + If SEG.TSval < TS.Recent, then test whether connection has + been idle less than 24 days; if both are true, then the + segment is not acceptable; follow steps below for an + unacceptable segment. + + If SEG.SEQ is equal to Last.ACK.sent, then save SEG.ECopt in + variable TS.Recent. + + + + +Jacobson, Braden, & Borman [Page 35] + +RFC 1323 TCP Extensions for High Performance May 1992 + + + There are four cases for the acceptability test for an incoming + segment: + + ... + + If an incoming segment is not acceptable, an acknowledgment + should be sent in reply (unless the RST bit is set, if so drop + the segment and return): + + + + Last.ACK.sent is set to SEG.ACK of the acknowledgment. If the + Snd.Echo.OK bit is on, include the Timestamps option + in this ACK segment. Set + Last.ACK.sent to SEG.ACK and send the ACK segment. After + sending the acknowledgment, drop the unacceptable segment and + return. + + ... + + fifth check the ACK field. + + if the ACK bit is off drop the segment and return. + + if the ACK bit is on + + ... + + ESTABLISHED STATE + + If SND.UNA < SEG.ACK =< SND.NXT then, set SND.UNA <- SEG.ACK. + Also compute a new estimate of round-trip time. If Snd.TS.OK + bit is on, use my.TSclock - SEG.TSecr; otherwise use the + elapsed time since the first segment in the retransmission + queue was sent. Any segments on the retransmission queue + which are thereby entirely acknowledged... + + ... + + Seventh, process the segment text. + + ESTABLISHED STATE + FIN-WAIT-1 STATE + FIN-WAIT-2 STATE + + ... + + Send an acknowledgment of the form: + + + +Jacobson, Braden, & Borman [Page 36] + +RFC 1323 TCP Extensions for High Performance May 1992 + + + + + If the Snd.TS.OK bit is on, include Timestamps option + in this ACK segment. Set + Last.ACK.sent to SEG.ACK of the acknowledgment, and send it. + This acknowledgment should be piggy-backed on a segment being + transmitted if possible without incurring undue delay. + + + ... + + +Security Considerations + + Security issues are not discussed in this memo. + +Authors' Addresses + + Van Jacobson + University of California + Lawrence Berkeley Laboratory + Mail Stop 46A + Berkeley, CA 94720 + + Phone: (415) 486-6411 + EMail: van@CSAM.LBL.GOV + + + Bob Braden + University of Southern California + Information Sciences Institute + 4676 Admiralty Way + Marina del Rey, CA 90292 + + Phone: (310) 822-1511 + EMail: Braden@ISI.EDU + + + Dave Borman + Cray Research + 655-E Lone Oak Drive + Eagan, MN 55121 + + Phone: (612) 683-5571 + Email: dab@cray.com + + + + + + +Jacobson, Braden, & Borman [Page 37] + \ No newline at end of file -- cgit v1.2.3