summaryrefslogtreecommitdiff
path: root/doc/rfc/rfc8985.txt
diff options
context:
space:
mode:
authorThomas Voss <mail@thomasvoss.com> 2024-11-27 20:54:24 +0100
committerThomas Voss <mail@thomasvoss.com> 2024-11-27 20:54:24 +0100
commit4bfd864f10b68b71482b35c818559068ef8d5797 (patch)
treee3989f47a7994642eb325063d46e8f08ffa681dc /doc/rfc/rfc8985.txt
parentea76e11061bda059ae9f9ad130a9895cc85607db (diff)
doc: Add RFC documents
Diffstat (limited to 'doc/rfc/rfc8985.txt')
-rw-r--r--doc/rfc/rfc8985.txt1531
1 files changed, 1531 insertions, 0 deletions
diff --git a/doc/rfc/rfc8985.txt b/doc/rfc/rfc8985.txt
new file mode 100644
index 0000000..98532ea
--- /dev/null
+++ b/doc/rfc/rfc8985.txt
@@ -0,0 +1,1531 @@
+
+
+
+
+Internet Engineering Task Force (IETF) Y. Cheng
+Request for Comments: 8985 N. Cardwell
+Category: Standards Track N. Dukkipati
+ISSN: 2070-1721 P. Jha
+ Google, Inc.
+ February 2021
+
+
+ The RACK-TLP Loss Detection Algorithm for TCP
+
+Abstract
+
+ This document presents the RACK-TLP loss detection algorithm for TCP.
+ RACK-TLP uses per-segment transmit timestamps and selective
+ acknowledgments (SACKs) and has two parts. Recent Acknowledgment
+ (RACK) starts fast recovery quickly using time-based inferences
+ derived from acknowledgment (ACK) feedback, and Tail Loss Probe (TLP)
+ leverages RACK and sends a probe packet to trigger ACK feedback to
+ avoid retransmission timeout (RTO) events. Compared to the widely
+ used duplicate acknowledgment (DupAck) threshold approach, RACK-TLP
+ detects losses more efficiently when there are application-limited
+ flights of data, lost retransmissions, or data packet reordering
+ events. It is intended to be an alternative to the DupAck threshold
+ approach.
+
+Status of This Memo
+
+ This is an Internet Standards Track document.
+
+ This document is a product of the Internet Engineering Task Force
+ (IETF). It represents the consensus of the IETF community. It has
+ received public review and has been approved for publication by the
+ Internet Engineering Steering Group (IESG). Further information on
+ Internet Standards is available in Section 2 of RFC 7841.
+
+ Information about the current status of this document, any errata,
+ and how to provide feedback on it may be obtained at
+ https://www.rfc-editor.org/info/rfc8985.
+
+Copyright Notice
+
+ Copyright (c) 2021 IETF Trust and the persons identified as the
+ document authors. All rights reserved.
+
+ This document is subject to BCP 78 and the IETF Trust's Legal
+ Provisions Relating to IETF Documents
+ (https://trustee.ietf.org/license-info) in effect on the date of
+ publication of this document. Please review these documents
+ carefully, as they describe your rights and restrictions with respect
+ to this document. Code Components extracted from this document must
+ include Simplified BSD License text as described in Section 4.e of
+ the Trust Legal Provisions and are provided without warranty as
+ described in the Simplified BSD License.
+
+Table of Contents
+
+ 1. Introduction
+ 1.1. Background
+ 1.2. Motivation
+ 2. Terminology
+ 3. RACK-TLP High-Level Design
+ 3.1. RACK: Time-Based Loss Inferences from ACKs
+ 3.2. TLP: Sending One Segment to Probe Losses Quickly with RACK
+ 3.3. RACK-TLP: Reordering Resilience with a Time Threshold
+ 3.3.1. Reordering Design Rationale
+ 3.3.2. Reordering Window Adaptation
+ 3.4. An Example of RACK-TLP in Action: Fast Recovery
+ 3.5. An Example of RACK-TLP in Action: RTO
+ 3.6. Design Summary
+ 4. Requirements
+ 5. Definitions
+ 5.1. Terms
+ 5.2. Per-Segment Variables
+ 5.3. Per-Connection Variables
+ 5.4. Per-Connection Timers
+ 6. RACK Algorithm Details
+ 6.1. Upon Transmitting a Data Segment
+ 6.2. Upon Receiving an ACK
+ 6.3. Upon RTO Expiration
+ 7. TLP Algorithm Details
+ 7.1. Initializing State
+ 7.2. Scheduling a Loss Probe
+ 7.3. Sending a Loss Probe upon PTO Expiration
+ 7.4. Detecting Losses Using the ACK of the Loss Probe
+ 7.4.1. General Case: Detecting Packet Losses Using RACK
+ 7.4.2. Special Case: Detecting a Single Loss Repaired by the
+ Loss Probe
+ 8. Managing RACK-TLP Timers
+ 9. Discussion
+ 9.1. Advantages and Disadvantages
+ 9.2. Relationships with Other Loss Recovery Algorithms
+ 9.3. Interaction with Congestion Control
+ 9.4. TLP Recovery Detection with Delayed ACKs
+ 9.5. RACK-TLP for Other Transport Protocols
+ 10. Security Considerations
+ 11. IANA Considerations
+ 12. References
+ 12.1. Normative References
+ 12.2. Informative References
+ Acknowledgments
+ Authors' Addresses
+
+1. Introduction
+
+ This document presents RACK-TLP, a TCP loss detection algorithm that
+ improves upon the widely implemented duplicate acknowledgment
+ (DupAck) counting approach described in [RFC5681] and [RFC6675]; it
+ is RECOMMENDED as an alternative to that earlier approach. RACK-TLP
+ has two parts. Recent Acknowledgment (RACK) detects losses quickly
+ using time-based inferences derived from ACK feedback. Tail Loss
+ Probe (TLP) triggers ACK feedback by quickly sending a probe segment
+ to avoid retransmission timeout (RTO) events.
+
+1.1. Background
+
+ In traditional TCP loss recovery algorithms [RFC5681] [RFC6675], a
+ sender starts fast recovery when the number of DupAcks received
+ reaches a threshold (DupThresh) that defaults to 3 (this approach is
+ referred to as "DupAck counting" in the rest of the document). The
+ sender also halves the congestion window during the recovery. The
+ rationale behind the partial window reduction is that congestion does
+ not seem severe since ACK clocking is still maintained. The time
+ elapsed in fast recovery can be just one round trip, e.g., if the
+ sender uses SACK-based recovery [RFC6675] and the number of lost
+ segments is small.
+
+ If fast recovery is not triggered or is triggered but fails to repair
+ all the losses, then the sender resorts to RTO recovery. The RTO
+ timer interval is conservatively the smoothed RTT (SRTT) plus four
+ times the RTT variation, and is lower bounded to 1 second [RFC6298].
+ Upon RTO timer expiration, the sender retransmits the first
+ unacknowledged segment and resets the congestion window to the loss
+ window value (by default, 1 full-sized segment [RFC5681]). The
+ rationale behind the congestion window reset is that an entire flight
+ of data and the ACK clock were lost, so this deserves a cautious
+ response. The sender then retransmits the rest of the data following
+ the slow start algorithm [RFC5681]. The time elapsed in RTO recovery
+ is one RTO interval plus the number of round trips needed to repair
+ all the losses.
+
+1.2. Motivation
+
+ Fast recovery is the preferred form of loss recovery because it can
+ potentially recover all losses in the timescale of a single round
+ trip, with only a fractional congestion window reduction. RTO
+ recovery and congestion window reset should ideally be the last
+ resort and should ideally be used only when the entire flight is
+ lost. However, in addition to losing an entire flight of data, the
+ following situations can unnecessarily resort to RTO recovery with
+ traditional TCP loss recovery algorithms [RFC5681] [RFC6675]:
+
+ 1. Packet drops for short flows or at the end of an application data
+ flight. When the sender is limited by the application (e.g.,
+ structured request/response traffic), segments lost at the end of
+ the application data transfer often can only be recovered by RTO.
+ Consider an example where only the last segment in a flight of
+ 100 segments is lost. Lacking any DupAck, the sender RTO
+ expires, reduces the congestion window to 1, and raises the
+ congestion window to just 2 after the loss repair is
+ acknowledged. In contrast, any single segment loss occurring
+ between the first and the 97th segment would result in fast
+ recovery, which would only cut the window in half.
+
+ 2. Lost retransmissions. Heavy congestion or traffic policers can
+ cause retransmissions to be lost. Lost retransmissions cause a
+ resort to RTO recovery since DupAck counting does not detect the
+ loss of the retransmissions. Then the slow start after RTO
+ recovery could cause burst losses again, which severely degrades
+ performance [POLICER16].
+
+ 3. Packet reordering. In this document, "reordering" refers to the
+ events where segments are delivered at the TCP receiver in a
+ chronological order different from their chronological
+ transmission order. Link-layer protocols (e.g., 802.11 block
+ ACK), link bonding, or routers' internal load balancing (e.g.,
+ ECMP) can deliver TCP segments out of order. The degree of such
+ reordering is usually within the order of the path round-trip
+ time. If the reordering degree is beyond DupThresh, DupAck
+ counting can cause a spurious fast recovery and unnecessary
+ congestion window reduction. To mitigate the issue, Non-
+ Congestion Robustness (NCR) for TCP [RFC4653] increases the
+ DupThresh from the current fixed value of three duplicate ACKs
+ [RFC5681] to approximate a congestion window of data having left
+ the network.
+
+2. Terminology
+
+ The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
+ "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
+ "OPTIONAL" in this document are to be interpreted as described in
+ BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all
+ capitals, as shown here.
+
+3. RACK-TLP High-Level Design
+
+ RACK-TLP allows senders to recover losses more effectively in all
+ three scenarios described in the previous section. There are two
+ design principles behind RACK-TLP. The first principle is to detect
+ losses via ACK events as much as possible, to repair losses at round-
+ trip timescales. The second principle is to gently probe the network
+ to solicit additional ACK feedback, to avoid RTO expiration and
+ subsequent congestion window reset. At a high level, the two
+ principles are implemented in RACK and TLP, respectively.
+
+3.1. RACK: Time-Based Loss Inferences from ACKs
+
+ The rationale behind RACK is that if a segment is delivered out of
+ order, then the segments sent chronologically before that were either
+ lost or reordered. This concept is not fundamentally different from
+ those described in [RFC5681], [RFC6675], or [FACK]. RACK's key
+ innovation is using per-segment transmission timestamps and widely
+ deployed SACK [RFC2018] options to conduct time-based inferences
+ instead of inferring losses by counting ACKs or SACKed sequences.
+ Time-based inferences are more robust than DupAck counting approaches
+ because they do not depend on flight size and thus are effective for
+ application-limited traffic.
+
+ Conceptually, RACK keeps a virtual timer for every data segment sent
+ (including retransmissions). Each timer expires dynamically based on
+ the latest RTT measurements plus an additional delay budget to
+ accommodate potential packet reordering (called the "reordering
+ window"). When a segment's timer expires, RACK marks the
+ corresponding segment as lost for retransmission.
+
+ In reality, as an algorithm, RACK does not arm a timer for every
+ segment sent because it's not necessary. Instead, the sender records
+ the most recent transmission time of every data segment sent,
+ including retransmissions. For each ACK received, the sender
+ calculates the latest RTT measurement (if eligible) and adjusts the
+ expiration time of every segment sent but not yet delivered. If a
+ segment has expired, RACK marks it as lost.
+
+ Since the time-based logic of RACK applies equally to retransmissions
+ and original transmissions, it can detect lost retransmissions as
+ well. If a segment has been retransmitted but its most recent
+ (re)transmission timestamp has expired, then, after a reordering
+ window, it's marked as lost.
+
+3.2. TLP: Sending One Segment to Probe Losses Quickly with RACK
+
+ RACK infers losses from ACK feedback; however, in some cases, ACKs
+ are sparse, particularly when the inflight is small or when the
+ losses are high. In some challenging cases, the last few segments in
+ a flight are lost. With the operations described in [RFC5681] or
+ [RFC6675], the sender's RTO would expire and reset the congestion
+ window when, in reality, most of the flight has been delivered.
+
+ Consider an example where a sender with a large congestion window
+ transmits 100 new data segments after an application write and only
+ the last three segments are lost. Without RACK-TLP, the RTO expires,
+ the sender retransmits the first unacknowledged segment, and the
+ congestion window slow starts from 1. After all the retransmits are
+ acknowledged, the congestion window is increased to 4. The total
+ delivery time for this application transfer is three RTTs plus one
+ RTO, a steep cost given that only a tiny fraction of the flight was
+ lost. If instead the losses had occurred three segments sooner in
+ the flight, then fast recovery would have recovered all losses within
+ one round trip and would have avoided resetting the congestion
+ window.
+
+ Fast recovery would be preferable in such scenarios; TLP is designed
+ to trigger the feedback RACK needed to enable that. After the last
+ (100th) segment was originally sent, TLP sends the next available
+ (new) segment or retransmits the last (highest-sequenced) segment in
+ two round trips to probe the network, hence the name "Tail Loss
+ Probe". The successful delivery of the probe would solicit an ACK.
+ RACK uses this ACK to detect that the 98th and 99th segments were
+ lost, trigger fast recovery, and retransmit both successfully. The
+ total recovery time is four RTTs, and the congestion window is only
+ partially reduced instead of being fully reset. If the probe was
+ also lost, then the sender would invoke RTO recovery, resetting the
+ congestion window.
+
+3.3. RACK-TLP: Reordering Resilience with a Time Threshold
+
+3.3.1. Reordering Design Rationale
+
+ Upon receiving an ACK indicating a SACKed segment, a sender cannot
+ tell immediately whether that was a result of reordering or loss. It
+ can only distinguish between the two in hindsight if the missing
+ sequence ranges are filled in later without retransmission. Thus, a
+ loss detection algorithm needs to budget some wait time -- a
+ reordering window -- to try to disambiguate packet reordering from
+ packet loss.
+
+ The reordering window in the DupAck counting approach is implicitly
+ defined as the elapsed time to receive DupThresh SACKed segments or
+ duplicate acknowledgments. This approach is effective if the network
+ reordering degree (in sequence distance) is smaller than DupThresh
+ and at least DupThresh segments after the loss is acknowledged. For
+ cases where the reordering degree is larger than the default
+ DupThresh of 3 packets, one alternative is to dynamically adapt
+ DupThresh based on the FlightSize (e.g., the sender adjusts the
+ DupThresh to half of the FlightSize). However, this does not work
+ well with the following two types of reordering:
+
+ 1. Application-limited flights where the last non-full-sized segment
+ is delivered first and then the remaining full-sized segments in
+ the flight are delivered in order. This reordering pattern can
+ occur when segments traverse parallel forwarding paths. In such
+ scenarios, the degree of reordering in packet distance is one
+ segment less than the flight size.
+
+ 2. A flight of segments that are delivered partially out of order.
+ One cause for this pattern is wireless link-layer retransmissions
+ with an inadequate reordering buffer at the receiver. In such
+ scenarios, the wireless sender sends the data packets in order
+ initially, but some are lost and then recovered by link-layer
+ retransmissions; the wireless receiver delivers the TCP data
+ packets in the order they are received due to the inadequate
+ reordering buffer. The random wireless transmission errors in
+ such scenarios cause the reordering degree, expressed in packet
+ distance, to have highly variable values up to the flight size.
+
+ In the above two cases, the degree of reordering in packet distance
+ is highly variable. This makes the DupAck counting approach
+ ineffective, including dynamic adaptation variants as in [RFC4653].
+ Instead, the degree of reordering in time difference in such cases is
+ usually within a single round-trip time. This is because the packets
+ either traverse disjoint paths with similar propagation delays or are
+ repaired quickly by the local access technology. Hence, using a time
+ threshold instead of a packet threshold strikes a middle ground,
+ allowing a bounded degree of reordering resilience while still
+ allowing fast recovery. This is the rationale behind the RACK-TLP
+ reordering resilience design.
+
+ Specifically, RACK-TLP introduces a new dynamic reordering window
+ parameter in time units, and the sender considers a data segment S
+ lost if both of these conditions are met:
+
+ 1. Another data segment sent later than S has been delivered.
+
+ 2. S has not been delivered after the estimated round-trip time plus
+ the reordering window.
+
+ Note that condition (1) implies at least one round trip of time has
+ elapsed since S has been sent.
+
+3.3.2. Reordering Window Adaptation
+
+ The RACK reordering window adapts to the measured duration of
+ reordering events within reasonable and specific bounds to
+ disincentivize excessive reordering. More specifically, the sender
+ sets the reordering window as follows:
+
+ 1. The reordering window SHOULD be set to zero if no reordering has
+ been observed on the connection so far, and either (a) three
+ segments have been SACKed since the last recovery or (b) the
+ sender is already in fast or RTO recovery. Otherwise, the
+ reordering window SHOULD start from a small fraction of the
+ round-trip time or zero if no round-trip time estimate is
+ available.
+
+ 2. The RACK reordering window SHOULD adaptively increase (using the
+ algorithm in "Step 4: Update RACK reordering window" below) if
+ the sender receives a Duplicate Selective Acknowledgment (DSACK)
+ option [RFC2883]. Receiving a DSACK suggests the sender made a
+ spurious retransmission, which may have been due to the
+ reordering window being too small.
+
+ 3. The RACK reordering window MUST be bounded, and this bound SHOULD
+ be SRTT.
+
+ Rules 2 and 3 are required to adapt to reordering caused by dynamics
+ such as the prolonged link-layer loss recovery episodes described
+ earlier. Each increase in the reordering window requires a new round
+ trip where the sender receives a DSACK; thus, depending on the extent
+ of reordering, it may take multiple round trips to fully adapt.
+
+ For short flows, the low initial reordering window helps recover
+ losses quickly, at the risk of spurious retransmissions. The
+ rationale is that spurious retransmissions for short flows are not
+ expected to produce excessive additional network traffic. For long
+ flows, the design tolerates reordering within a round trip. This
+ handles reordering in small timescales (reordering within the round-
+ trip time of the shortest path).
+
+ However, the fact that the initial reordering window is low and the
+ reordering window's adaptive growth is bounded means that there will
+ continue to be a cost to reordering that disincentivizes excessive
+ reordering.
+
+3.4. An Example of RACK-TLP in Action: Fast Recovery
+
+ The following example in Figure 1 illustrates the RACK-TLP algorithm
+ in action:
+
+ Event TCP DATA SENDER TCP DATA RECEIVER
+ _____ ____________________________________________________________
+ 1. Send P0, P1, P2, P3 -->
+ [P1, P2, P3 dropped by network]
+
+ 2. <-- Receive P0, ACK P0
+
+ 3a. 2RTTs after (2), TLP timer fires
+ 3b. TLP: retransmits P3 -->
+
+ 4. <-- Receive P3, SACK P3
+
+ 5a. Receive SACK for P3
+ 5b. RACK: marks P1, P2 lost
+ 5c. Retransmit P1, P2 -->
+ [P1 retransmission dropped by network]
+
+ 6. <-- Receive P2, SACK P2 & P3
+
+ 7a. RACK: marks P1 retransmission lost
+ 7b. Retransmit P1 -->
+
+ 8. <-- Receive P1, ACK P3
+
+ Figure 1: RACK-TLP Protocol Example
+
+ Figure 1 illustrates a sender sending four segments (P0, P1, P2, P3)
+ and losing the last three segments. After two round trips, TLP sends
+ a loss probe, retransmitting the last segment, P3, to solicit SACK
+ feedback and restore the ACK clock (Event 3). The delivery of P3
+ enables RACK to infer (Event 5b) that P1 and P2 were likely lost
+ because they were sent before P3. The sender then retransmits P1 and
+ P2. Unfortunately, the retransmission of P1 is lost again. However,
+ the delivery of the retransmission of P2 allows RACK to infer that
+ the retransmission of P1 was likely lost (Event 7a); hence, P1 should
+ be retransmitted (Event 7b). Note that [RFC5681] mandates a
+ principle that loss in two successive windows of data or the loss of
+ a retransmission must be taken as two indications of congestion and
+ therefore results in two separate congestion control reactions.
+
+3.5. An Example of RACK-TLP in Action: RTO
+
+ In addition to enhancing fast recovery, RACK improves the accuracy of
+ RTO recovery by reducing spurious retransmissions.
+
+ Without RACK, upon RTO timer expiration, the sender marks all the
+ unacknowledged segments as lost. This approach can lead to spurious
+ retransmissions. For example, consider a simple case where one
+ segment was sent with an RTO of 1 second and then the application
+ writes more data, causing a second and third segment to be sent right
+ before the RTO of the first segment expires. Suppose none of the
+ segments were lost. Without RACK, if there is a spurious RTO, then
+ the sender marks all three segments as lost and retransmits the first
+ segment. If the ACK for the original copy of the first segment
+ arrives right after the spurious RTO retransmission, then the sender
+ continues slow start and spuriously retransmits the second and third
+ segments since it (erroneously) presumed they are lost.
+
+ With RACK, upon RTO timer expiration, the only segment automatically
+ marked as lost is the first segment (since it was sent an RTO ago);
+ for all the other segments, RACK only marks the segment as lost if at
+ least one round trip has elapsed since the segment was transmitted.
+ Consider the previous example scenario, but this time with RACK.
+ With RACK, when the RTO expires, the sender only marks the first
+ segment as lost and retransmits that segment. The other two very
+ recently sent segments are not marked as lost because they were sent
+ less than one round trip ago and there were no ACKs providing
+ evidence that they were lost. Upon receiving the ACK for the RTO
+ retransmission, the RACK sender would not yet retransmit the second
+ or third segment, but rather would re-arm the RTO timer and wait for
+ a new RTO interval to elapse before marking the second or third
+ segment as lost.
+
+3.6. Design Summary
+
+ To summarize, RACK-TLP aims to adapt to small time-varying degrees of
+ reordering, quickly recover most losses within one to two round
+ trips, and avoid costly RTO recoveries. In the presence of
+ reordering, the adaptation algorithm can impose sometimes needless
+ delays when it waits to disambiguate loss from reordering, but the
+ penalty for waiting is bounded to one round trip, and such delays are
+ confined to flows long enough to have observed reordering.
+
+4. Requirements
+
+ The reader is expected to be familiar with the definitions given in
+ the TCP congestion control [RFC5681], selective acknowledgment
+ [RFC2018], and loss recovery [RFC6675] RFCs. RACK-TLP has the
+ following requirements:
+
+ 1. The connection MUST use selective acknowledgment (SACK) options
+ [RFC2018], and the sender MUST keep SACK scoreboard information
+ on a per-connection basis ("SACK scoreboard" has the same meaning
+ here as in [RFC6675], Section 3).
+
+ 2. For each data segment sent, the sender MUST store its most recent
+ transmission time with a timestamp whose granularity is finer
+ than 1/4 of the minimum RTT of the connection. At the time of
+ writing, microsecond resolution is suitable for intra-data center
+ traffic, and millisecond granularity or finer is suitable for the
+ Internet. Note that RACK-TLP can be implemented with TSO (TCP
+ Segmentation Offload) support by having multiple segments in a
+ TSO aggregate share the same timestamp.
+
+ 3. RACK DSACK-based reordering window adaptation is RECOMMENDED but
+ is not required.
+
+ 4. TLP requires RACK.
+
+5. Definitions
+
+ The reader is expected to be familiar with the variables SND.UNA,
+ SND.NXT, SEG.ACK, and SEG.SEQ in [RFC793]; Sender Maximum Segment
+ Size (SMSS) and FlightSize in [RFC5681]; DupThresh in [RFC6675]; and
+ RTO and SRTT in [RFC6298]. A RACK-TLP implementation uses several
+ new terms and needs to store new per-segment and per-connection
+ state, described below.
+
+5.1. Terms
+
+ These terms are used to explain the variables and algorithms below:
+
+ RACK.segment
+ Among all the segments that have been either selectively or
+ cumulatively acknowledged, the term "RACK.segment" denotes the
+ segment that was sent most recently (including retransmissions).
+
+ RACK.ack_ts
+ Denotes the time when the full sequence range of RACK.segment was
+ selectively or cumulatively acknowledged.
+
+5.2. Per-Segment Variables
+
+ These variables indicate the status of the most recent transmission
+ of a data segment:
+
+ Segment.lost
+ True if the most recent (re)transmission of the segment has been
+ marked as lost and needs to be retransmitted. False otherwise.
+
+ Segment.retransmitted
+ True if the segment has ever been retransmitted. False otherwise.
+
+ Segment.xmit_ts
+ The time of the last transmission of a data segment, including
+ retransmissions, if any, with a clock granularity specified in the
+ "Requirements" section. A maximum value INFINITE_TS indicates an
+ invalid timestamp that represents that the segment is not
+ currently in flight.
+
+ Segment.end_seq
+ The next sequence number after the last sequence number of the
+ data segment.
+
+5.3. Per-Connection Variables
+
+ RACK.xmit_ts
+ The latest transmission timestamp of RACK.segment.
+
+ RACK.end_seq
+ The Segment.end_seq of RACK.segment.
+
+ RACK.segs_sacked
+ Returns the total number of segments selectively acknowledged in
+ the SACK scoreboard.
+
+ RACK.fack
+ The highest selectively or cumulatively acknowledged sequence
+ (i.e., forward acknowledgment).
+
+ RACK.min_RTT
+ The estimated minimum round-trip time (RTT) of the connection.
+
+ RACK.rtt
+ The RTT of the most recently delivered segment on the connection
+ (either cumulatively acknowledged or selectively acknowledged)
+ that was not marked as invalid as a possible spurious
+ retransmission.
+
+ RACK.reordering_seen
+ Indicates whether the sender has detected data segment reordering
+ event(s).
+
+ RACK.reo_wnd
+ A reordering window computed in the unit of time used for
+ recording segment transmission times. It is used to defer the
+ moment at which RACK marks a segment as lost.
+
+ RACK.dsack_round
+ Indicates if a DSACK option has been received in the latest round
+ trip.
+
+ RACK.reo_wnd_mult
+ The multiplier applied to adjust RACK.reo_wnd.
+
+ RACK.reo_wnd_persist
+ The number of loss recoveries before resetting RACK.reo_wnd.
+
+ TLP.is_retrans
+ A boolean indicating whether there is an unacknowledged TLP
+ retransmission.
+
+ TLP.end_seq
+ The value of SND.NXT at the time of sending a TLP probe.
+
+ TLP.max_ack_delay:
+ The sender's budget for the maximum delayed ACK interval.
+
+5.4. Per-Connection Timers
+
+ RACK reordering timer
+ A timer that allows RACK to wait for reordering to resolve in
+ order to try to disambiguate reordering from loss when some
+ segments are marked as SACKed.
+
+ TLP PTO
+ A timer event indicating that an ACK is overdue and the sender
+ should transmit a TLP segment to solicit SACK or ACK feedback.
+
+ These timers augment the existing timers maintained by a sender,
+ including the RTO timer [RFC6298]. A RACK-TLP sender arms one of
+ these three timers -- RACK reordering timer, TLP PTO timer, or RTO
+ timer -- when it has unacknowledged segments in flight. The
+ implementation can simplify managing all three timers by multiplexing
+ a single timer among them with an additional variable to indicate the
+ event to invoke upon the next timer expiration.
+
+6. RACK Algorithm Details
+
+6.1. Upon Transmitting a Data Segment
+
+ Upon transmitting a new segment or retransmitting an old segment,
+ record the time in Segment.xmit_ts and set Segment.lost to FALSE.
+ Upon retransmitting a segment, set Segment.retransmitted to TRUE.
+
+ RACK_transmit_new_data(Segment):
+ Segment.xmit_ts = Now()
+ Segment.lost = FALSE
+
+ RACK_retransmit_data(Segment):
+ Segment.retransmitted = TRUE
+ Segment.xmit_ts = Now()
+ Segment.lost = FALSE
+
+6.2. Upon Receiving an ACK
+
+ Step 1: Update RACK.min_RTT.
+
+ Use the RTT measurements obtained via [RFC6298] or [RFC7323] to
+ update the estimated minimum RTT in RACK.min_RTT. The sender SHOULD
+ track a windowed min-filtered estimate of recent RTT measurements
+ that can adapt when migrating to significantly longer paths rather
+ than tracking a simple global minimum of all RTT measurements.
+
+ Step 2: Update the state for the most recently sent segment that has
+ been delivered.
+
+ In this step, RACK updates the state that tracks the most recently
+ sent segment that has been delivered: RACK.segment. RACK maintains
+ its latest transmission timestamp in RACK.xmit_ts and its highest
+ sequence number in RACK.end_seq. These two variables are used in
+ later steps to estimate if some segments not yet delivered were
+ likely lost. Given the information provided in an ACK, each segment
+ cumulatively ACKed or SACKed is marked as delivered in the
+ scoreboard. Because an ACK can also acknowledge retransmitted data
+ segments and because retransmissions can be spurious, the sender
+ needs to take care to avoid spurious inferences. For example, if the
+ sender were to use timing information from a spurious retransmission,
+ the RACK.rtt could be vastly underestimated.
+
+ To avoid spurious inferences, ignore a segment as invalid if any of
+ its sequence range has been retransmitted before and if either of two
+ conditions is true:
+
+ 1. The Timestamp Echo Reply field (TSecr) of the ACK's timestamp
+ option [RFC7323], if available, indicates the ACK was not
+ acknowledging the last retransmission of the segment.
+
+ 2. The segment was last retransmitted less than RACK.min_rtt ago.
+
+ The second check is a heuristic when the TCP Timestamp option is not
+ available or when the round-trip time is less than the TCP Timestamp
+ clock granularity.
+
+ Among all the segments newly ACKed or SACKed by this ACK that pass
+ the checks above, update the RACK.rtt to be the RTT sample calculated
+ using this ACK. Furthermore, record the most recent Segment.xmit_ts
+ in RACK.xmit_ts if it is ahead of RACK.xmit_ts. If Segment.xmit_ts
+ equals RACK.xmit_ts (e.g., due to clock granularity limits), then
+ compare Segment.end_seq and RACK.end_seq to break the tie when
+ deciding whether to update the RACK.segment's associated state.
+
+ Step 2 may be summarized in pseudocode as:
+
+ RACK_sent_after(t1, seq1, t2, seq2):
+ If t1 > t2:
+ Return true
+ Else if t1 == t2 AND seq1 > seq2:
+ Return true
+ Else:
+ Return false
+
+ RACK_update():
+ For each Segment newly acknowledged, cumulatively or selectively,
+ in ascending order of Segment.xmit_ts:
+ rtt = Now() - Segment.xmit_ts
+ If Segment.retransmitted is TRUE:
+ If ACK.ts_option.echo_reply < Segment.xmit_ts:
+ Continue
+ If rtt < RACK.min_rtt:
+ Continue
+
+ RACK.rtt = rtt
+ If RACK_sent_after(Segment.xmit_ts, Segment.end_seq
+ RACK.xmit_ts, RACK.end_seq):
+ RACK.xmit_ts = Segment.xmit_ts
+ RACK.end_seq = Segment.end_seq
+
+ Step 3: Detect data segment reordering.
+
+ To detect reordering, the sender looks for original data segments
+ being delivered out of order. To detect such cases, the sender
+ tracks the highest sequence selectively or cumulatively acknowledged
+ in the RACK.fack variable. ".fack" stands for the most "Forward ACK"
+ (this term is adopted from [FACK]). If a never-retransmitted segment
+ that's below RACK.fack is (selectively or cumulatively) acknowledged,
+ it has been delivered out of order. The sender sets
+ RACK.reordering_seen to TRUE if such a segment is identified.
+
+ RACK_detect_reordering():
+ For each Segment newly acknowledged, cumulatively or selectively,
+ in ascending order of Segment.end_seq:
+ If Segment.end_seq > RACK.fack:
+ RACK.fack = Segment.end_seq
+ Else if Segment.end_seq < RACK.fack AND
+ Segment.retransmitted is FALSE:
+ RACK.reordering_seen = TRUE
+
+ Step 4: Update the RACK reordering window.
+
+ The RACK reordering window, RACK.reo_wnd, serves as an adaptive
+ allowance for settling time before marking a segment as lost. This
+ step documents a detailed algorithm that follows the principles
+ outlined in the "Reordering Window Adaptation" section.
+
+ If no reordering has been observed based on the previous step, then
+ one way the sender can enter fast recovery is when the number of
+ SACKed segments matches or exceeds DupThresh (similar to [RFC6675]).
+ Furthermore, when no reordering has been observed, the RACK.reo_wnd
+ is set to 0 both upon entering and during fast recovery or RTO
+ recovery.
+
+ Otherwise, if some reordering has been observed, then RACK does not
+ trigger fast recovery based on DupThresh.
+
+ Whether or not reordering has been observed, RACK uses the reordering
+ window to assess whether any segments can be marked as lost. As a
+ consequence, the sender also enters fast recovery when there are any
+ number of SACKed segments, as long as the reorder window has passed
+ for some non-SACKed segments.
+
+ When the reordering window is not set to 0, it starts with a
+ conservative RACK.reo_wnd of RACK.min_RTT/4. This value was chosen
+ because Linux TCP used the same factor in its implementation to delay
+ Early Retransmit [RFC5827] to reduce spurious loss detections in the
+ presence of reordering, and experience showed this worked reasonably
+ well [DMCG11].
+
+ However, the reordering detection in the previous step, Step 3, has a
+ self-reinforcing drawback when the reordering window is too small to
+ cope with the actual reordering. When that happens, RACK could
+ spuriously mark reordered segments as lost, causing them to be
+ retransmitted. In turn, the retransmissions can prevent the
+ necessary conditions for Step 3 to detect reordering since this
+ mechanism requires ACKs or SACKs only for segments that have never
+ been retransmitted. In some cases, such scenarios can persist,
+ causing RACK to continue to spuriously mark segments as lost without
+ realizing the reordering window is too small.
+
+ To avoid the issue above, RACK dynamically adapts to higher degrees
+ of reordering using DSACK options from the receiver. Receiving an
+ ACK with a DSACK option indicates a possible spurious retransmission,
+ suggesting that RACK.reo_wnd may be too small. The RACK.reo_wnd
+ increases linearly for every round trip in which the sender receives
+ some DSACK option so that after N round trips in which a DSACK is
+ received, the RACK.reo_wnd becomes (N+1) * min_RTT / 4, with an
+ upper-bound of SRTT.
+
+ If the reordering is temporary, then a large adapted reordering
+ window would unnecessarily delay loss recovery later. Therefore,
+ RACK persists using the inflated RACK.reo_wnd for up to 16 loss
+ recoveries, after which it resets RACK.reo_wnd to its starting value,
+ min_RTT / 4. The downside of resetting the reordering window is the
+ risk of triggering spurious fast recovery episodes if the reordering
+ remains high. The rationale for this approach is to bound such
+ spurious recoveries to approximately once every 16 recoveries (less
+ than 7%).
+
+ To track the linear scaling factor for the adaptive reordering
+ window, RACK uses the variable RACK.reo_wnd_mult, which is
+ initialized to 1 and adapts with the observed reordering.
+
+ The following pseudocode implements the above algorithm for updating
+ the RACK reordering window:
+
+ RACK_update_reo_wnd():
+
+ /* DSACK-based reordering window adaptation */
+ If RACK.dsack_round is not None AND
+ SND.UNA >= RACK.dsack_round:
+ RACK.dsack_round = None
+ /* Grow the reordering window per round that sees DSACK.
+ Reset the window after 16 DSACK-free recoveries */
+ If RACK.dsack_round is None AND
+ any DSACK option is present on latest received ACK:
+ RACK.dsack_round = SND.NXT
+ RACK.reo_wnd_mult += 1
+ RACK.reo_wnd_persist = 16
+ Else if exiting Fast or RTO recovery:
+ RACK.reo_wnd_persist -= 1
+ If RACK.reo_wnd_persist <= 0:
+ RACK.reo_wnd_mult = 1
+
+ If RACK.reordering_seen is FALSE:
+ If in Fast or RTO recovery:
+ Return 0
+ Else if RACK.segs_sacked >= DupThresh:
+ Return 0
+ Return min(RACK.reo_wnd_mult * RACK.min_RTT / 4, SRTT)
+
+ Step 5: Detect losses.
+
+ For each segment that has not been SACKed, RACK considers that
+ segment lost if another segment that was sent later has been
+ delivered and the reordering window has passed. RACK considers the
+ reordering window to have passed if the RACK.segment was sent a
+ sufficient time after the segment in question, if a sufficient time
+ has elapsed since the RACK.segment was S/ACKed, or some combination
+ of the two. More precisely, RACK marks a segment as lost if:
+
+ RACK.xmit_ts >= Segment.xmit_ts
+ AND
+ RACK.xmit_ts - Segment.xmit_ts + (now - RACK.ack_ts) >= RACK.reo_wnd
+
+ Solving this second condition for "now", the moment at which a
+ segment is marked as lost, yields:
+
+ now >= Segment.xmit_ts + RACK.reo_wnd + (RACK.ack_ts - RACK.xmit_ts)
+
+ Then (RACK.ack_ts - RACK.xmit_ts) is the round-trip time of the most
+ recently (re)transmitted segment that's been delivered. When
+ segments are delivered in order, the most recently (re)transmitted
+ segment that's been delivered is also the most recently delivered;
+ hence, RACK.rtt == RACK.ack_ts - RACK.xmit_ts. But if segments were
+ reordered, then the segment delivered most recently was sent before
+ the most recently (re)transmitted segment. Hence, RACK.rtt >
+ (RACK.ack_ts - RACK.xmit_ts).
+
+ Since RACK.RTT >= (RACK.ack_ts - RACK.xmit_ts), the previous equation
+ reduces to saying that the sender can declare a segment lost when:
+
+ now >= Segment.xmit_ts + RACK.reo_wnd + RACK.rtt
+
+ In turn, that is equivalent to stating that a RACK sender should
+ declare a segment lost when:
+
+ Segment.xmit_ts + RACK.rtt + RACK.reo_wnd - now <= 0
+
+ Note that if the value on the left-hand side is positive, it
+ represents the remaining wait time before the segment is deemed lost.
+ But this risks a timeout (RTO) if no more ACKs come back (e.g., due
+ to losses or application-limited transmissions) to trigger the
+ marking. For timely loss detection, it is RECOMMENDED that the
+ sender install a reordering timer. This timer expires at the
+ earliest moment when RACK would conclude that all the unacknowledged
+ segments within the reordering window were lost.
+
+ The following pseudocode implements the algorithm above. When an ACK
+ is received or the RACK reordering timer expires, call
+ RACK_detect_loss_and_arm_timer(). The algorithm breaks timestamp
+ ties by using the TCP sequence space since high-speed networks often
+ have multiple segments with identical timestamps.
+
+ RACK_detect_loss():
+ timeout = 0
+ RACK.reo_wnd = RACK_update_reo_wnd()
+ For each segment, Segment, not acknowledged yet:
+ If RACK_sent_after(RACK.xmit_ts, RACK.end_seq,
+ Segment.xmit_ts, Segment.end_seq):
+ remaining = Segment.xmit_ts + RACK.rtt +
+ RACK.reo_wnd - Now()
+ If remaining <= 0:
+ Segment.lost = TRUE
+ Segment.xmit_ts = INFINITE_TS
+ Else:
+ timeout = max(remaining, timeout)
+ Return timeout
+
+ RACK_detect_loss_and_arm_timer():
+ timeout = RACK_detect_loss()
+ If timeout != 0
+ Arm the RACK timer to call
+ RACK_detect_loss_and_arm_timer() after timeout
+
+ As an optimization, an implementation can choose to check only
+ segments that have been sent before RACK.xmit_ts. This can be more
+ efficient than scanning the entire SACK scoreboard, especially when
+ there are many segments in flight. The implementation can use a
+ separate doubly linked list ordered by Segment.xmit_ts, insert a
+ segment at the tail of the list when it is (re)transmitted, and
+ remove a segment from the list when it is delivered or marked as
+ lost. In Linux TCP, this optimization improved CPU usage by orders
+ of magnitude during some fast recovery episodes on high-speed WAN
+ networks.
+
+6.3. Upon RTO Expiration
+
+ Upon RTO timer expiration, RACK marks the first outstanding segment
+ as lost (since it was sent an RTO ago); for all the other segments,
+ RACK only marks the segment as lost if the time elapsed since the
+ segment was transmitted is at least the sum of the recent RTT and the
+ reordering window.
+
+ RACK_mark_losses_on_RTO():
+ For each segment, Segment, not acknowledged yet:
+ If SEG.SEQ == SND.UNA OR
+ Segment.xmit_ts + RACK.rtt + RACK.reo_wnd - Now() <= 0:
+ Segment.lost = TRUE
+
+7. TLP Algorithm Details
+
+7.1. Initializing State
+
+ Reset TLP.is_retrans and TLP.end_seq when initiating a connection,
+ fast recovery, or RTO recovery.
+
+ TLP_init():
+ TLP.end_seq = None
+ TLP.is_retrans = false
+
+7.2. Scheduling a Loss Probe
+
+ The sender schedules a loss probe timeout (PTO) to transmit a segment
+ during the normal transmission process. The sender SHOULD start or
+ restart a loss probe PTO timer after transmitting new data (that was
+ not itself a loss probe) or upon receiving an ACK that cumulatively
+ acknowledges new data unless it is already in fast recovery, RTO
+ recovery, or segments have been SACKed (i.e., RACK.segs_sacked is not
+ zero). These conditions are excluded because they are addressed by
+ similar mechanisms, like Limited Transmit [RFC3042], the RACK
+ reordering timer, and Forward RTO-Recovery (F-RTO) [RFC5682].
+
+ The sender calculates the PTO interval by taking into account a
+ number of factors.
+
+ First, the default PTO interval is 2*SRTT. By that time, it is
+ prudent to declare that an ACK is overdue since under normal
+ circumstances, i.e., no losses, an ACK typically arrives in one SRTT.
+ Choosing the PTO to be exactly an SRTT would risk causing spurious
+ probes given that network and end-host delay variance can cause an
+ ACK to be delayed beyond the SRTT. Hence, the PTO is conservatively
+ chosen to be the next integral multiple of SRTT.
+
+ Second, when there is no SRTT estimate available, the PTO SHOULD be 1
+ second. This conservative value corresponds to the RTO value when no
+ SRTT is available, per [RFC6298].
+
+ Third, when the FlightSize is one segment, the sender MAY inflate the
+ PTO by TLP.max_ack_delay to accommodate a potentially delayed
+ acknowledgment and reduce the risk of spurious retransmissions. The
+ actual value of TLP.max_ack_delay is implementation specific.
+
+ Finally, if the time at which an RTO would fire (here denoted as
+ "TCP_RTO_expiration()") is sooner than the computed time for the PTO,
+ then the sender schedules a TLP to be sent at that RTO time.
+
+ Summarizing these considerations in pseudocode form, a sender SHOULD
+ use the following logic to select the duration of a PTO:
+
+ TLP_calc_PTO():
+ If SRTT is available:
+ PTO = 2 * SRTT
+ If FlightSize is one segment:
+ PTO += TLP.max_ack_delay
+ Else:
+ PTO = 1 sec
+
+ If Now() + PTO > TCP_RTO_expiration():
+ PTO = TCP_RTO_expiration() - Now()
+
+7.3. Sending a Loss Probe upon PTO Expiration
+
+ When the PTO timer expires, the sender MUST check whether both of the
+ following conditions are met before sending a loss probe:
+
+ 1. First, there is no other previous loss probe still in flight.
+ This ensures that, at any given time, the sender has at most one
+ additional packet in flight beyond the congestion window limit.
+ This invariant is maintained using the state variable
+ TLP.end_seq, which indicates the latest unacknowledged TLP loss
+ probe's ending sequence. It is reset when the loss probe has
+ been acknowledged or is deemed lost or irrelevant.
+
+ 2. Second, the sender has obtained an RTT measurement since the last
+ loss probe transmission or the start of the connection, whichever
+ was later. This condition ensures that loss probe
+ retransmissions do not prevent taking the RTT samples necessary
+ to adapt SRTT to an increase in path RTT.
+
+ If either one of these two conditions is not met, then the sender
+ MUST skip sending a loss probe and MUST proceed to re-arm the RTO
+ timer, as specified at the end of this section.
+
+ If both conditions are met, then the sender SHOULD transmit a
+ previously unsent data segment, if one exists and the receive window
+ allows, and increment the FlightSize accordingly. Note that the
+ FlightSize could be one packet greater than the congestion window
+ temporarily until the next ACK arrives.
+
+ If such an unsent segment is not available, then the sender SHOULD
+ retransmit the highest-sequence segment sent so far and set
+ TLP.is_retrans to true. This segment is chosen to deal with the
+ retransmission ambiguity problem in TCP. Suppose a sender sends N
+ segments and then retransmits the last segment (segment N) as a loss
+ probe, after which the sender receives a SACK for segment N. As long
+ as the sender waits for the RACK reordering window to expire, it
+ doesn't matter if that SACK was for the original transmission of
+ segment N or the TLP retransmission; in either case, the arrival of
+ the SACK for segment N provides evidence that the N-1 segments
+ preceding segment N were likely lost.
+
+ In a case where there is only one original outstanding segment of
+ data (N=1), the same logic (trivially) applies: an ACK for a single
+ outstanding segment tells the sender that the N-1=0 segments
+ preceding that segment were lost. Furthermore, whether there are N>1
+ or N=1 outstanding segments, there is a question about whether the
+ original last segment or its TLP retransmission were lost; the sender
+ estimates whether there was such a loss using TLP recovery detection
+ (see below).
+
+ The sender MUST follow the RACK transmission procedures in the "Upon
+ Transmitting a Data Segment" section upon sending either a
+ retransmission or a new data loss probe. This is critical for
+ detecting losses using the ACK for the loss probe.
+
+ After attempting to send a loss probe, regardless of whether a loss
+ probe was sent, the sender MUST re-arm the RTO timer, not the PTO
+ timer, if the FlightSize is not zero. This ensures RTO recovery
+ remains the last resort if TLP fails. The following pseudocode
+ summarizes the operations.
+
+ TLP_send_probe():
+
+ If TLP.end_seq is None and
+ Sender has taken a new RTT sample since last probe or
+ the start of connection:
+ TLP.is_retrans = false
+ Segment = send buffer segment starting at SND.NXT
+ If Segment exists and fits the peer receive window limit:
+ /* Transmit the lowest-sequence unsent Segment */
+ Transmit Segment
+ RACK_transmit_data(Segment)
+ TLP.end_seq = SND.NXT
+ Increase FlightSize by Segment length
+ Else:
+ /* Retransmit the highest-sequence Segment sent */
+ Segment = send buffer segment ending at SND.NXT
+ Transmit Segment
+ RACK_retransmit_data(Segment)
+ TLP.end_seq = SND.NXT
+ TLP.is_retrans = true
+
+ If FlightSize is not zero:
+ Rearm RTO timer to fire at timeout = now + RTO
+
+7.4. Detecting Losses Using the ACK of the Loss Probe
+
+ When there is packet loss in a flight ending with a loss probe, the
+ feedback solicited by a loss probe will reveal one of two scenarios,
+ depending on the pattern of losses.
+
+7.4.1. General Case: Detecting Packet Losses Using RACK
+
+ If the loss probe and the ACK that acknowledges the probe are
+ delivered successfully, RACK-TLP uses this ACK -- just as it would
+ with any other ACK -- to detect if any segments sent prior to the
+ probe were dropped. RACK would typically infer that any
+ unacknowledged data segments sent before the loss probe were lost,
+ since they were sent sufficiently far in the past (where at least one
+ PTO has elapsed, plus one round trip for the loss probe to be ACKed).
+ More specifically, RACK_detect_loss() (Step 5) would mark those
+ earlier segments as lost. Then the sender would trigger a fast
+ recovery to recover those losses.
+
+7.4.2. Special Case: Detecting a Single Loss Repaired by the Loss Probe
+
+ If the TLP retransmission repairs all the lost in-flight sequence
+ ranges (i.e., only the last segment in the flight was lost), the ACK
+ for the loss probe appears to be a regular cumulative ACK, which
+ would not normally trigger the congestion control response to this
+ packet loss event. The following TLP recovery detection mechanism
+ examines ACKs to detect this special case to make congestion control
+ respond properly [RFC5681].
+
+ After a TLP retransmission, the sender checks for this special case
+ of a single loss that is recovered by the loss probe itself. To
+ accomplish this, the sender checks for a duplicate ACK or DSACK
+ indicating that both the original segment and TLP retransmission
+ arrived at the receiver, which means there was no loss. If the TLP
+ sender does not receive such an indication, then it MUST assume that
+ the original data segment, the TLP retransmission, or a corresponding
+ ACK was lost for congestion control purposes.
+
+ If the TLP retransmission is spurious, a receiver that uses DSACK
+ would return an ACK that covers TLP.end_seq with a DSACK option (Case
+ 1). If the receiver does not support DSACK, it would return a DupAck
+ without any SACK option (Case 2). If the sender receives an ACK
+ matching either case, then the sender estimates that the receiver
+ received both the original data segment and the TLP probe
+ retransmission. The sender considers the TLP episode to be done and
+ records that fact by setting TLP.end_seq to None.
+
+ Upon receiving an ACK that covers some sequence number after
+ TLP.end_seq, the sender should have received any ACKs for the
+ original segment and TLP probe retransmission segment. At that time,
+ if the TLP.end_seq is still set and thus indicates that the TLP probe
+ retransmission remains unacknowledged, then the sender should presume
+ that at least one of its data segments was lost. The sender then
+ SHOULD invoke a congestion control response equivalent to a fast
+ recovery.
+
+ More precisely, on each ACK, the sender executes the following:
+
+ TLP_process_ack(ACK):
+ If TLP.end_seq is not None AND ACK's ack. number >= TLP.end_seq:
+ If not TLP.is_retrans:
+ TLP.end_seq = None /* TLP of new data delivered */
+ Else if ACK has a DSACK option matching TLP.end_seq:
+ TLP.end_seq = None /* Case 1, above */
+ Else If ACK's ack. number > TLP.end_seq:
+ TLP.end_seq = None /* Repaired the single loss */
+ (Invoke congestion control to react to
+ the loss event the probe has repaired)
+ Else If ACK is a DupAck without any SACK option:
+ TLP.end_seq = None /* Case 2, above */
+
+8. Managing RACK-TLP Timers
+
+ The RACK reordering timer, the TLP PTO timer, the RTO, and Zero
+ Window Probe (ZWP) timer [RFC793] are mutually exclusive and are used
+ in different scenarios. When arming a RACK reordering timer or TLP
+ PTO timer, the sender SHOULD cancel any other pending timers. An
+ implementation is expected to have one timer with an additional state
+ variable indicating the type of the timer.
+
+9. Discussion
+
+9.1. Advantages and Disadvantages
+
+ The biggest advantage of RACK-TLP is that every data segment, whether
+ it is an original data transmission or a retransmission, can be used
+ to detect losses of the segments sent chronologically prior to it.
+ This enables RACK-TLP to use fast recovery in cases with application-
+ limited flights of data, lost retransmissions, or data segment
+ reordering events. Consider the following examples:
+
+ 1. Packet drops at the end of an application data flight: Consider a
+ sender that transmits an application-limited flight of three data
+ segments (P1, P2, P3), and P1 and P3 are lost. Suppose the
+ transmission of each segment is at least RACK.reo_wnd after the
+ transmission of the previous segment. RACK will mark P1 as lost
+ when the SACK of P2 is received, and this will trigger the
+ retransmission of P1 as R1. When R1 is cumulatively
+ acknowledged, RACK will mark P3 as lost, and the sender will
+ retransmit P3 as R3. This example illustrates how RACK is able
+ to repair certain drops at the tail of a transaction without an
+ RTO recovery. Notice that neither the conventional duplicate ACK
+ threshold [RFC5681], nor the loss recovery algorithm [RFC6675],
+ nor the Forward Acknowledgment [FACK] algorithm can detect such
+ losses because of the required segment or sequence count.
+
+ 2. Lost retransmission: Consider a flight of three data segments
+ (P1, P2, P3) that are sent; P1 and P2 are dropped. Suppose the
+ transmission of each segment is at least RACK.reo_wnd after the
+ transmission of the previous segment. When P3 is SACKed, RACK
+ will mark P1 and P2 as lost, and they will be retransmitted as R1
+ and R2. Suppose R1 is lost again but R2 is SACKed; RACK will
+ mark R1 as lost and trigger retransmission again. Again, neither
+ the conventional three-duplicate ACK threshold approach, nor the
+ loss recovery algorithm [RFC6675], nor the Forward Acknowledgment
+ [FACK] algorithm can detect such losses. And such a lost
+ retransmission can happen when TCP is being rate-limited,
+ particularly by token bucket policers with a large bucket depth
+ and low rate limit; in such cases, retransmissions are often lost
+ repeatedly because standard congestion control requires multiple
+ round trips to reduce the rate below the policed rate.
+
+ 3. Packet reordering: Consider a simple reordering event where a
+ flight of segments are sent as (P1, P2, P3). P1 and P2 carry a
+ full payload of Maximum Sender Size (MSS) octets, but P3 has only
+ a 1-octet payload. Suppose the sender has detected reordering
+ previously and thus RACK.reo_wnd is min_RTT/4. Now P3 is
+ reordered and delivered first, before P1 and P2. As long as P1
+ and P2 are delivered within min_RTT/4, RACK will not consider P1
+ and P2 lost. But if P1 and P2 are delivered outside the
+ reordering window, then RACK will still spuriously mark P1 and P2
+ as lost.
+
+ The examples above show that RACK-TLP is particularly useful when the
+ sender is limited by the application, which can happen with
+ interactive or request/response traffic. Similarly, RACK still works
+ when the sender is limited by the receive window, which can happen
+ with applications that use the receive window to throttle the sender.
+
+ RACK-TLP works more efficiently with TCP Segmentation Offload (TSO)
+ compared to DupAck counting. RACK always marks the entire TSO
+ aggregate as lost because the segments in the same TSO aggregate have
+ the same transmission timestamp. By contrast, the algorithms based
+ on sequence counting (e.g., [RFC6675], [RFC5681]) may mark only a
+ subset of segments in the TSO aggregate as lost, forcing the stack to
+ perform expensive fragmentation of the TSO aggregate or to
+ selectively tag individual segments as lost in the scoreboard.
+
+ The main drawback of RACK-TLP is the additional state required
+ compared to DupAck counting. RACK requires the sender to record the
+ transmission time of each segment sent at a clock granularity that is
+ finer than 1/4 of the minimum RTT of the connection. TCP
+ implementations that already record this for RTT estimation do not
+ require any new per-packet state. But implementations that are not
+ yet recording segment transmission times will need to add per-packet
+ internal state (expected to be either 4 or 8 octets per segment or
+ TSO aggregate) to track transmission times. In contrast, the loss
+ detection approach described in [RFC6675] does not require any per-
+ packet state beyond the SACK scoreboard; this is particularly useful
+ on ultra-low RTT networks where the RTT may be less than the sender
+ TCP clock granularity (e.g., inside data centers). Another
+ disadvantage is that the reordering timer may expire prematurely
+ (like any other retransmission timer) and cause higher spurious
+ retransmissions, especially if DSACK is not supported.
+
+9.2. Relationships with Other Loss Recovery Algorithms
+
+ The primary motivation of RACK-TLP is to provide a general
+ alternative to some of the standard loss recovery algorithms
+ [RFC5681] [RFC6675] [RFC5827] [RFC4653]. In particular, the SACK
+ loss recovery algorithm for TCP [RFC6675] is not designed to handle
+ lost retransmissions, so its NextSeg() does not work for lost
+ retransmissions, and it does not specify the corresponding required
+ additional congestion response. Therefore, the algorithm [RFC6675]
+ MUST NOT be used with RACK-TLP; instead, a modified recovery
+ algorithm that carefully addresses such a case is needed.
+
+ The Early Retransmit mechanism [RFC5827] and NCR for TCP [RFC4653]
+ dynamically adjust the duplicate ACK threshold based on the current
+ or previous flight sizes. RACK-TLP takes a different approach by
+ using a time-based reordering window. RACK-TLP can be seen as an
+ extended Early Retransmit [RFC5827] without a FlightSize limit but
+ with an additional reordering window. [FACK] considers an original
+ segment to be lost when its sequence range is sufficiently far below
+ the highest SACKed sequence. In some sense, RACK-TLP can be seen as
+ a generalized form of FACK that operates in time space instead of
+ sequence space, enabling it to better handle reordering, application-
+ limited traffic, and lost retransmissions.
+
+ RACK-TLP is compatible with the standard RTO [RFC6298], RTO Restart
+ [RFC7765], F-RTO [RFC5682], and Eifel algorithms [RFC3522]. This is
+ because RACK-TLP only detects loss by using ACK events. It neither
+ changes the RTO timer calculation nor detects spurious RTOs. RACK-
+ TLP slightly changes the behavior of [RFC6298] by preceding the RTO
+ with a TLP and reducing potential spurious retransmissions after RTO.
+
+9.3. Interaction with Congestion Control
+
+ RACK-TLP intentionally decouples loss detection from congestion
+ control. RACK-TLP only detects losses; it does not modify the
+ congestion control algorithm [RFC5681] [RFC6937]. A segment marked
+ as lost by RACK-TLP MUST NOT be retransmitted until congestion
+ control deems this appropriate. As mentioned in the paragraph
+ following Figure 1 (Section 3.4, Paragraph 3), [RFC5681] mandates a
+ principle that loss in two successive windows of data or the loss of
+ a retransmission must be taken as two indications of congestion and
+ therefore trigger two separate reactions. The Proportional Rate
+ Reduction (PRR) algorithm [RFC6937] is RECOMMENDED for the specific
+ congestion control actions taken upon the losses detected by RACK-
+ TLP. In the absence of PRR [RFC6937], when RACK-TLP detects a lost
+ retransmission, the congestion control MUST trigger an additional
+ congestion response per the aforementioned principle in [RFC5681].
+ If multiple original transmissions or retransmissions were lost in a
+ window, the congestion control specified in [RFC5681] only reacts
+ once per window. The congestion control implementer is advised to
+ carefully consider this subtle situation introduced by RACK-TLP.
+
+ The only exception -- the only way in which RACK-TLP modulates the
+ congestion control algorithm -- is that one outstanding loss probe
+ can be sent even if the congestion window is fully used. However,
+ this temporary overcommit is accounted for and credited in the in-
+ flight data tracked for congestion control, so that congestion
+ control will erase the overcommit upon the next ACK.
+
+ If packet losses happen after reordering has been observed, RACK-TLP
+ may take longer to detect losses than the pure DupAck counting
+ approach. In this case, TCP may continue to increase the congestion
+ window upon receiving ACKs during this time, making the sender more
+ aggressive.
+
+ The following simple example compares how RACK-TLP and non-RACK-TLP
+ loss detection interact with congestion control: suppose a sender has
+ a congestion window (cwnd) of 20 segments on a SACK-enabled
+ connection. It sends 10 data segments, and all of them are lost.
+
+ Without RACK-TLP, the sender would time out, reset cwnd to 1, and
+ retransmit the first segment. It would take four round trips (1 + 2
+ + 4 + 3 = 10) to retransmit all the 10 lost segments using slow
+ start. The recovery latency would be RTO + 4*RTT, with an ending
+ cwnd of 4 segments due to congestion window validation.
+
+ With RACK-TLP, a sender would send the TLP after 2*RTT and get a
+ DupAck, enabling RACK to detect the losses and trigger fast recovery.
+ If the sender implements Proportional Rate Reduction [RFC6937], it
+ would slow start to retransmit the remaining 9 lost segments since
+ the number of segments in flight (0) is lower than the slow start
+ threshold (10). The slow start would again take four round trips (1
+ + 2 + 4 + 3 = 10) to retransmit all the lost segments. The recovery
+ latency would be 2*RTT + 4*RTT, with an ending cwnd set to the slow-
+ start threshold of 10 segments.
+
+ The difference in recovery latency (RTO + 4*RTT vs 6*RTT) can be
+ significant if the RTT is much smaller than the minimum RTO (1 second
+ in [RFC6298]) or if the RTT is large. The former case can happen in
+ local area networks, data center networks, or content distribution
+ networks with deep deployments. The latter case can happen in
+ developing regions with highly congested and/or high-latency
+ networks.
+
+9.4. TLP Recovery Detection with Delayed ACKs
+
+ Delayed or stretched ACKs complicate the detection of repairs done by
+ TLP since, with such ACKs, the sender takes a longer time to receive
+ fewer ACKs than would normally be expected. To mitigate this
+ complication, before sending a TLP loss probe retransmission, the
+ sender should attempt to wait long enough that the receiver has sent
+ any delayed ACKs that it is withholding. The sender algorithm
+ described above features such a delay in the form of
+ TLP.max_ack_delay. Furthermore, if the receiver supports DSACK,
+ then, in the case of a delayed ACK, the sender's TLP recovery
+ detection mechanism (see above) can use the DSACK information to
+ infer that the original and TLP retransmission both arrived at the
+ receiver.
+
+ If there is ACK loss or a delayed ACK without a DSACK, then this
+ algorithm is conservative because the sender will reduce the
+ congestion window when, in fact, there was no packet loss. In
+ practice, this is acceptable and potentially even desirable: if there
+ is reverse path congestion, then reducing the congestion window can
+ be prudent.
+
+9.5. RACK-TLP for Other Transport Protocols
+
+ RACK-TLP can be implemented in other transport protocols (e.g.,
+ [QUIC-LR]). The [SPROUT] loss detection algorithm was also
+ independently designed to use a 10 ms reordering window to improve
+ its loss detection similar to RACK.
+
+10. Security Considerations
+
+ RACK-TLP algorithm behavior is based on information conveyed in SACK
+ options, so it has security considerations similar to those described
+ in the Security Considerations section of [RFC6675].
+
+ Additionally, RACK-TLP has a lower risk profile than the loss
+ recovery algorithm [RFC6675] because it is not vulnerable to ACK-
+ splitting attacks [SCWA99]: for an MSS-sized segment sent, the
+ receiver or the attacker might send MSS ACKs that selectively or
+ cumulatively acknowledge one additional byte per ACK. This would not
+ fool RACK. In such a scenario, RACK.xmit_ts would not advance
+ because all the sequence ranges within the segment were transmitted
+ at the same time and thus carry the same transmission timestamp. In
+ other words, SACKing only one byte of a segment or SACKing the
+ segment in entirety have the same effect with RACK.
+
+11. IANA Considerations
+
+ This document has no IANA actions.
+
+12. References
+
+12.1. Normative References
+
+ [RFC2018] Mathis, M., Mahdavi, J., Floyd, S., and A. Romanow, "TCP
+ Selective Acknowledgment Options", RFC 2018,
+ DOI 10.17487/RFC2018, October 1996,
+ <https://www.rfc-editor.org/info/rfc2018>.
+
+ [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
+ Requirement Levels", BCP 14, RFC 2119,
+ DOI 10.17487/RFC2119, March 1997,
+ <https://www.rfc-editor.org/info/rfc2119>.
+
+ [RFC2883] Floyd, S., Mahdavi, J., Mathis, M., and M. Podolsky, "An
+ Extension to the Selective Acknowledgement (SACK) Option
+ for TCP", RFC 2883, DOI 10.17487/RFC2883, July 2000,
+ <https://www.rfc-editor.org/info/rfc2883>.
+
+ [RFC5681] Allman, M., Paxson, V., and E. Blanton, "TCP Congestion
+ Control", RFC 5681, DOI 10.17487/RFC5681, September 2009,
+ <https://www.rfc-editor.org/info/rfc5681>.
+
+ [RFC6298] Paxson, V., Allman, M., Chu, J., and M. Sargent,
+ "Computing TCP's Retransmission Timer", RFC 6298,
+ DOI 10.17487/RFC6298, June 2011,
+ <https://www.rfc-editor.org/info/rfc6298>.
+
+ [RFC6675] Blanton, E., Allman, M., Wang, L., Jarvinen, I., Kojo, M.,
+ and Y. Nishida, "A Conservative Loss Recovery Algorithm
+ Based on Selective Acknowledgment (SACK) for TCP",
+ RFC 6675, DOI 10.17487/RFC6675, August 2012,
+ <https://www.rfc-editor.org/info/rfc6675>.
+
+ [RFC7323] Borman, D., Braden, B., Jacobson, V., and R.
+ Scheffenegger, Ed., "TCP Extensions for High Performance",
+ RFC 7323, DOI 10.17487/RFC7323, September 2014,
+ <https://www.rfc-editor.org/info/rfc7323>.
+
+ [RFC793] Postel, J., "Transmission Control Protocol", STD 7,
+ RFC 793, DOI 10.17487/RFC0793, September 1981,
+ <https://www.rfc-editor.org/info/rfc793>.
+
+ [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC
+ 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174,
+ May 2017, <https://www.rfc-editor.org/info/rfc8174>.
+
+12.2. Informative References
+
+ [DMCG11] Dukkipati, N., Matthis, M., Cheng, Y., and M. Ghobadi,
+ "Proportional Rate Reduction for TCP", Proceedings of the
+ 2011 ACM SIGCOMM Conference on Internet Measurement
+ Conference pp. 155-170, DOI 10.1145/2068816.2068832,
+ November 2011, <https://doi.org/10.1145/2068816.2068832>.
+
+ [FACK] Mathis, M. and J. Mahdavi, "Forward acknowledgement:
+ refining TCP congestion control", ACM SIGCOMM Computer
+ Communication Review Volume 26, Issue 4,
+ DOI 10.1145/248157.248181, August 1996,
+ <https://doi.org/10.1145/248157.248181>.
+
+ [POLICER16]
+ Flach, T., Papageorge, P., Terzis, A., Pedrosa, L., Cheng,
+ Y., Karim, T., Katz-Bassett, E., and R. Govindan, "An
+ Internet-Wide Analysis of Traffic Policing", Proceedings
+ of the 2016 ACM SIGCOMM Conference pp. 468-482,
+ DOI 10.1145/2934872.2934873, August 2016,
+ <https://doi.org/10.1145/2934872.2934873>.
+
+ [QUIC-LR] Iyengar, J. and I. Swett, "QUIC Loss Detection and
+ Congestion Control", Work in Progress, Internet-Draft,
+ draft-ietf-quic-recovery-34, 14 January 2021,
+ <https://tools.ietf.org/html/draft-ietf-quic-recovery-34>.
+
+ [RFC3042] Allman, M., Balakrishnan, H., and S. Floyd, "Enhancing
+ TCP's Loss Recovery Using Limited Transmit", RFC 3042,
+ DOI 10.17487/RFC3042, January 2001,
+ <https://www.rfc-editor.org/info/rfc3042>.
+
+ [RFC3522] Ludwig, R. and M. Meyer, "The Eifel Detection Algorithm
+ for TCP", RFC 3522, DOI 10.17487/RFC3522, April 2003,
+ <https://www.rfc-editor.org/info/rfc3522>.
+
+ [RFC4653] Bhandarkar, S., Reddy, A. L. N., Allman, M., and E.
+ Blanton, "Improving the Robustness of TCP to Non-
+ Congestion Events", RFC 4653, DOI 10.17487/RFC4653, August
+ 2006, <https://www.rfc-editor.org/info/rfc4653>.
+
+ [RFC5682] Sarolahti, P., Kojo, M., Yamamoto, K., and M. Hata,
+ "Forward RTO-Recovery (F-RTO): An Algorithm for Detecting
+ Spurious Retransmission Timeouts with TCP", RFC 5682,
+ DOI 10.17487/RFC5682, September 2009,
+ <https://www.rfc-editor.org/info/rfc5682>.
+
+ [RFC5827] Allman, M., Avrachenkov, K., Ayesta, U., Blanton, J., and
+ P. Hurtig, "Early Retransmit for TCP and Stream Control
+ Transmission Protocol (SCTP)", RFC 5827,
+ DOI 10.17487/RFC5827, May 2010,
+ <https://www.rfc-editor.org/info/rfc5827>.
+
+ [RFC6937] Mathis, M., Dukkipati, N., and Y. Cheng, "Proportional
+ Rate Reduction for TCP", RFC 6937, DOI 10.17487/RFC6937,
+ May 2013, <https://www.rfc-editor.org/info/rfc6937>.
+
+ [RFC7765] Hurtig, P., Brunstrom, A., Petlund, A., and M. Welzl, "TCP
+ and Stream Control Transmission Protocol (SCTP) RTO
+ Restart", RFC 7765, DOI 10.17487/RFC7765, February 2016,
+ <https://www.rfc-editor.org/info/rfc7765>.
+
+ [SCWA99] Savage, S., Cardwell, N., Wetherall, D., and T. Anderson,
+ "TCP congestion control with a misbehaving receiver", ACM
+ Computer Communication Review 29(5),
+ DOI 10.1145/505696.505704, October 1999,
+ <https://doi.org/10.1145/505696.505704>.
+
+ [SPROUT] Winstein, K., Sivaraman, A., and H. Balakrishnan,
+ "Stochastic Forecasts Achieve High Throughput and Low
+ Delay over Cellular Networks", 10th USENIX Symposium on
+ Networked Systems Design and Implementation (NSDI '13)",
+ 2013.
+
+Acknowledgments
+
+ The authors thank Matt Mathis for his insights in FACK and Michael
+ Welzl for his per-packet timer idea that inspired this work. Eric
+ Dumazet, Randy Stewart, Van Jacobson, Ian Swett, Rick Jones, Jana
+ Iyengar, Hiren Panchasara, Praveen Balasubramanian, Yoshifumi
+ Nishida, Bob Briscoe, Felix Weinrank, Michael Tüxen, Martin Duke,
+ Ilpo Jarvinen, Theresa Enghardt, Mirja Kühlewind, Gorry Fairhurst,
+ Markku Kojo, and Yi Huang contributed to this document or the
+ implementations in Linux, FreeBSD, Windows, and QUIC.
+
+Authors' Addresses
+
+ Yuchung Cheng
+ Google, Inc.
+
+ Email: ycheng@google.com
+
+
+ Neal Cardwell
+ Google, Inc.
+
+ Email: ncardwell@google.com
+
+
+ Nandita Dukkipati
+ Google, Inc.
+
+ Email: nanditad@google.com
+
+
+ Priyaranjan Jha
+ Google, Inc.
+
+ Email: priyarjha@google.com