diff options
author | Thomas Voss <mail@thomasvoss.com> | 2024-11-27 20:54:24 +0100 |
---|---|---|
committer | Thomas Voss <mail@thomasvoss.com> | 2024-11-27 20:54:24 +0100 |
commit | 4bfd864f10b68b71482b35c818559068ef8d5797 (patch) | |
tree | e3989f47a7994642eb325063d46e8f08ffa681dc /doc/rfc/rfc9332.txt | |
parent | ea76e11061bda059ae9f9ad130a9895cc85607db (diff) |
doc: Add RFC documents
Diffstat (limited to 'doc/rfc/rfc9332.txt')
-rw-r--r-- | doc/rfc/rfc9332.txt | 2977 |
1 files changed, 2977 insertions, 0 deletions
diff --git a/doc/rfc/rfc9332.txt b/doc/rfc/rfc9332.txt new file mode 100644 index 0000000..d3a6b3c --- /dev/null +++ b/doc/rfc/rfc9332.txt @@ -0,0 +1,2977 @@ + + + + +Internet Engineering Task Force (IETF) K. De Schepper +Request for Comments: 9332 Nokia Bell Labs +Category: Experimental B. Briscoe, Ed. +ISSN: 2070-1721 Independent + G. White + CableLabs + January 2023 + + + Dual-Queue Coupled Active Queue Management (AQM) for Low Latency, Low + Loss, and Scalable Throughput (L4S) + +Abstract + + This specification defines a framework for coupling the Active Queue + Management (AQM) algorithms in two queues intended for flows with + different responses to congestion. This provides a way for the + Internet to transition from the scaling problems of standard TCP- + Reno-friendly ('Classic') congestion controls to the family of + 'Scalable' congestion controls. These are designed for consistently + very low queuing latency, very low congestion loss, and scaling of + per-flow throughput by using Explicit Congestion Notification (ECN) + in a modified way. Until the Coupled Dual Queue (DualQ), these + Scalable L4S congestion controls could only be deployed where a + clean-slate environment could be arranged, such as in private data + centres. + + This specification first explains how a Coupled DualQ works. It then + gives the normative requirements that are necessary for it to work + well. All this is independent of which two AQMs are used, but + pseudocode examples of specific AQMs are given in appendices. + +Status of This Memo + + This document is not an Internet Standards Track specification; it is + published for examination, experimental implementation, and + evaluation. + + This document defines an Experimental Protocol for the Internet + community. This document is a product of the Internet Engineering + Task Force (IETF). It represents the consensus of the IETF + community. It has received public review and has been approved for + publication by the Internet Engineering Steering Group (IESG). Not + all documents approved by the IESG are candidates for any level of + Internet Standard; see Section 2 of RFC 7841. + + Information about the current status of this document, any errata, + and how to provide feedback on it may be obtained at + https://www.rfc-editor.org/info/rfc9332. + +Copyright Notice + + Copyright (c) 2023 IETF Trust and the persons identified as the + document authors. All rights reserved. + + This document is subject to BCP 78 and the IETF Trust's Legal + Provisions Relating to IETF Documents + (https://trustee.ietf.org/license-info) in effect on the date of + publication of this document. Please review these documents + carefully, as they describe your rights and restrictions with respect + to this document. Code Components extracted from this document must + include Revised BSD License text as described in Section 4.e of the + Trust Legal Provisions and are provided without warranty as described + in the Revised BSD License. + +Table of Contents + + 1. Introduction + 1.1. Outline of the Problem + 1.2. Context, Scope, and Applicability + 1.3. Terminology + 1.4. Features + 2. DualQ Coupled AQM + 2.1. Coupled AQM + 2.2. Dual Queue + 2.3. Traffic Classification + 2.4. Overall DualQ Coupled AQM Structure + 2.5. Normative Requirements for a DualQ Coupled AQM + 2.5.1. Functional Requirements + 2.5.1.1. Requirements in Unexpected Cases + 2.5.2. Management Requirements + 2.5.2.1. Configuration + 2.5.2.2. Monitoring + 2.5.2.3. Anomaly Detection + 2.5.2.4. Deployment, Coexistence, and Scaling + 3. IANA Considerations + 4. Security Considerations + 4.1. Low Delay without Requiring Per-flow Processing + 4.2. Handling Unresponsive Flows and Overload + 4.2.1. Unresponsive Traffic without Overload + 4.2.2. Avoiding Short-Term Classic Starvation: Sacrifice L4S + Throughput or Delay? + 4.2.3. L4S ECN Saturation: Introduce Drop or Delay? + 4.2.3.1. Protecting against Overload by Unresponsive + ECN-Capable Traffic + 5. References + 5.1. Normative References + 5.2. Informative References + Appendix A. Example DualQ Coupled PI2 Algorithm + A.1. Pass #1: Core Concepts + A.2. Pass #2: Edge-Case Details + Appendix B. Example DualQ Coupled Curvy RED Algorithm + B.1. Curvy RED in Pseudocode + B.2. Efficient Implementation of Curvy RED + Appendix C. Choice of Coupling Factor, k + C.1. RTT-Dependence + C.2. Guidance on Controlling Throughput Equivalence + Acknowledgements + Contributors + Authors' Addresses + +1. Introduction + + This document specifies a framework for DualQ Coupled AQMs, which can + serve as the network part of the L4S architecture [RFC9330]. A DualQ + Coupled AQM consists of two queues: L4S and Classic. The L4S queue + is intended for Scalable congestion controls that can maintain very + low queuing latency (sub-millisecond on average) and high throughput + at the same time. The Coupled DualQ acts like a semi-permeable + membrane: the L4S queue isolates the sub-millisecond average queuing + delay of L4S from Classic latency, while the coupling between the + queues pools the capacity between both queues so that ad hoc numbers + of capacity-seeking applications all sharing the same capacity can + have roughly equivalent throughput per flow, whichever queue they + use. The DualQ achieves this indirectly, without having to inspect + transport-layer flow identifiers and without compromising the + performance of the Classic traffic, relative to a single queue. The + DualQ design has low complexity and requires no configuration for the + public Internet. + +1.1. Outline of the Problem + + Latency is becoming the critical performance factor for many (perhaps + most) applications on the public Internet, e.g., interactive web, web + services, voice, conversational video, interactive video, interactive + remote presence, instant messaging, online gaming, remote desktop, + cloud-based applications, and video-assisted remote control of + machinery and industrial processes. Once access network bitrates + reach levels now common in the developed world, further increases + offer diminishing returns unless latency is also addressed + [Dukkipati06]. In the last decade or so, much has been done to + reduce propagation time by placing caches or servers closer to users. + However, queuing remains a major intermittent component of latency. + + Previously, very low latency has only been available for a few + selected low-rate applications, that confine their sending rate + within a specially carved-off portion of capacity, which is + prioritized over other traffic, e.g., Diffserv Expedited Forwarding + (EF) [RFC3246]. Up to now, it has not been possible to allow any + number of low-latency, high throughput applications to seek to fully + utilize available capacity, because the capacity-seeking process + itself causes too much queuing delay. + + To reduce this queuing delay caused by the capacity-seeking process, + changes either to the network alone or to end systems alone are in + progress. L4S involves a recognition that both approaches are + yielding diminishing returns: + + * Recent state-of-the-art AQM in the network, e.g., Flow Queue CoDel + [RFC8290], Proportional Integral controller Enhanced (PIE) + [RFC8033], and Adaptive Random Early Detection (ARED) [ARED01]), + has reduced queuing delay for all traffic, not just a select few + applications. However, no matter how good the AQM, the capacity- + seeking (sawtoothing) rate of TCP-like congestion controls + represents a lower limit that will cause either the queuing delay + to vary or the link to be underutilized. These AQMs are tuned to + allow a typical capacity-seeking TCP-Reno-friendly flow to induce + an average queue that roughly doubles the base round-trip time + (RTT), adding 5-15 ms of queuing on average for a mix of long- + running flows and web traffic (cf. 500 microseconds with L4S for + the same traffic mix [L4Seval22]). However, for many + applications, low delay is not useful unless it is consistently + low. With these AQMs, 99th percentile queuing delay is 20-30 ms + (cf. 2 ms with the same traffic over L4S). + + * Similarly, recent research into using end-to-end congestion + control without needing an AQM in the network (e.g., Bottleneck + Bandwidth and Round-trip propagation time (BBR) [BBR-CC]) seems to + have hit a similar queuing delay floor of about 20 ms on average, + but there are also regular 25 ms delay spikes due to bandwidth + probes and 60 ms spikes due to flow-starts. + + L4S learns from the experience of Data Center TCP (DCTCP) [RFC8257], + which shows the power of complementary changes both in the network + and on end systems. DCTCP teaches us that two small but radical + changes to congestion control are needed to cut the two major + outstanding causes of queuing delay variability: + + 1. Far smaller rate variations (sawteeth) than Reno-friendly + congestion controls. + + 2. A shift of smoothing and hence smoothing delay from network to + sender. + + Without the former, a 'Classic' (e.g., Reno-friendly) flow's RTT + varies between roughly 1 and 2 times the base RTT between the + machines in question. Without the latter, a 'Classic' flow's + response to changing events is delayed by a worst-case + (transcontinental) RTT, which could be hundreds of times the actual + smoothing delay needed for the RTT of typical traffic from localized + Content Delivery Networks (CDNs). + + These changes are the two main features of the family of so-called + 'Scalable' congestion controls (which include DCTCP, Prague, and + Self-Clocked Rate Adaptation for Multimedia (SCReAM)). Both of these + changes only reduce delay in combination with a complementary change + in the network, and they are both only feasible with ECN, not drop, + for the signalling: + + 1. The smaller sawteeth allow an extremely shallow ECN packet- + marking threshold in the queue. + + 2. No smoothing in the network means that every fluctuation of the + queue is signalled immediately. + + Without ECN, either of these would lead to very high loss levels. In + contrast, with ECN, the resulting high marking levels are just + signals, not impairments. (Note that BBRv2 [BBRv2] combines the best + of both worlds -- it works as a Scalable congestion control when ECN + is available, but it also aims to minimize delay when ECN is absent.) + + However, until now, Scalable congestion controls (like DCTCP) did not + coexist well in a shared ECN-capable queue with existing Classic + (e.g., Reno [RFC5681] or CUBIC [RFC8312]) congestion controls -- + Scalable controls are so aggressive that these 'Classic' algorithms + would drive themselves to a small capacity share. Therefore, until + now, L4S controls could only be deployed where a clean-slate + environment could be arranged, such as in private data centres (hence + the name DCTCP). + + One way to solve the problem of coexistence between Scalable and + Classic flows is to use a per-flow-queuing (FQ) approach such as FQ- + CoDel [RFC8290]. It classifies packets by flow identifier into + separate queues in order to isolate sparse flows from the higher + latency in the queues assigned to heavier flows. However, if a + Classic flow needs both low delay and high throughput, having a queue + to itself does not isolate it from the harm it causes to itself. + Also FQ approaches need to inspect flow identifiers, which is not + always practical. + + In summary, Scalable congestion controls address the root cause of + the latency, loss and scaling problems with Classic congestion + controls. Both FQ and DualQ AQMs can be enablers for this smooth + low-latency scalable behaviour. The DualQ approach is particularly + useful because identifying flows is sometimes not practical or + desirable. + +1.2. Context, Scope, and Applicability + + L4S involves complementary changes in the network and on end systems: + + Network: + A DualQ Coupled AQM (defined in the present document) or a + modification to flow queue AQMs (described in paragraph "b" in + Section 4.2 of the L4S architecture [RFC9330]). + + End system: + A Scalable congestion control (defined in Section 4 of the L4S ECN + protocol spec [RFC9331]). + + Packet identifier: + The network and end-system parts of L4S can be deployed + incrementally, because they both identify L4S packets using the + experimentally assigned ECN codepoints in the IP header: ECT(1) + and CE [RFC8311] [RFC9331]. + + DCTCP [RFC8257] is an example of a Scalable congestion control for + controlled environments that has been deployed for some time in + Linux, Windows, and FreeBSD operating systems. During the progress + of this document through the IETF, a number of other Scalable + congestion controls were implemented, e.g., Prague over TCP and QUIC + [PRAGUE-CC] [PragueLinux], BBRv2 [BBRv2] [BBR-CC], and the L4S + variant of SCReAM for real-time media [SCReAM-L4S] [RFC8298]. + + The focus of this specification is to enable deployment of the + network part of the L4S service. Then, without any management + intervention, applications can exploit this new network capability as + the applications or their operating systems migrate to Scalable + congestion controls, which can then evolve _while_ their benefits are + being enjoyed by everyone on the Internet. + + The DualQ Coupled AQM framework can incorporate any AQM designed for + a single queue that generates a statistical or deterministic mark/ + drop probability driven by the queue dynamics. Pseudocode examples + of two different DualQ Coupled AQMs are given in the appendices. In + many cases the framework simplifies the basic control algorithm and + requires little extra processing. Therefore, it is believed the + Coupled AQM would be applicable and easy to deploy in all types of + buffers such as buffers in cost-reduced mass-market residential + equipment; buffers in end-system stacks; buffers in carrier-scale + equipment including remote access servers, routers, firewalls, and + Ethernet switches; buffers in network interface cards; buffers in + virtualized network appliances, hypervisors; and so on. + + For the public Internet, nearly all the benefit will typically be + achieved by deploying the Coupled AQM into either end of the access + link between a 'site' and the Internet, which is invariably the + bottleneck (see Section 6.4 of [RFC9330] about deployment, which also + defines the term 'site' to mean a home, an office, a campus, or + mobile user equipment). + + Latency is not the only concern of L4S: + + * The 'Low Loss' part of the name denotes that L4S generally + achieves zero congestion loss (which would otherwise cause + retransmission delays), due to its use of ECN. + + * The 'Scalable throughput' part of the name denotes that the per- + flow throughput of Scalable congestion controls should scale + indefinitely, avoiding the imminent scaling problems with 'TCP- + Friendly' congestion control algorithms [RFC3649]. + + The former is clearly in scope of this AQM document. However, the + latter is an outcome of the end-system behaviour and is therefore + outside the scope of this AQM document, even though the AQM is an + enabler. + + The overall L4S architecture [RFC9330] gives more detail, including + on wider deployment aspects such as backwards compatibility of + Scalable congestion controls in bottlenecks where a DualQ Coupled AQM + has not been deployed. The supporting papers [L4Seval22], + [DualPI2Linux], [PI2], and [PI2param] give the full rationale for the + AQM design, both discursively and in more precise mathematical form, + as well as the results of performance evaluations. The main results + have been validated independently when using the Prague congestion + control [Boru20] (experiments are run using Prague and DCTCP, but + only the former is relevant for validation, because Prague fixes a + number of problems with the Linux DCTCP code that make it unsuitable + for the public Internet). + +1.3. Terminology + + The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", + "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and + "OPTIONAL" in this document are to be interpreted as described in + BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all + capitals, as shown here. + + The DualQ Coupled AQM uses two queues for two services: + + Classic Service/Queue: The Classic service is intended for all the + congestion control behaviours that coexist with Reno [RFC5681] + (e.g., Reno itself, CUBIC [RFC8312], and TFRC [RFC5348]). The + term 'Classic queue' means a queue providing the Classic service. + + Low Latency, Low Loss, and Scalable throughput (L4S) Service/ + Queue: The 'L4S' service is intended for traffic from Scalable + congestion control algorithms, such as the Prague congestion + control [PRAGUE-CC], which was derived from Data Center TCP + [RFC8257]. The L4S service is for more general traffic than just + Prague -- it allows the set of congestion controls with similar + scaling properties to Prague to evolve, such as the examples + listed below (Relentless, SCReAM, etc.). The term 'L4S queue' + means a queue providing the L4S service. + + Classic Congestion Control: A congestion control behaviour that can + coexist with standard Reno [RFC5681] without causing significantly + negative impact on its flow rate [RFC5033]. With Classic + congestion controls, such as Reno or CUBIC, because flow rate has + scaled since TCP congestion control was first designed in 1988, it + now takes hundreds of round trips (and growing) to recover after a + congestion signal (whether a loss or an ECN mark) as shown in the + examples in Section 5.1 of the L4S architecture [RFC9330] and in + [RFC3649]. Therefore, control of queuing and utilization becomes + very slack, and the slightest disturbances (e.g., from new flows + starting) prevent a high rate from being attained. + + Scalable Congestion Control: A congestion control where the average + time from one congestion signal to the next (the recovery time) + remains invariant as flow rate scales, all other factors being + equal. This maintains the same degree of control over queuing and + utilization whatever the flow rate, as well as ensuring that high + throughput is robust to disturbances. For instance, DCTCP + averages 2 congestion signals per round trip, whatever the flow + rate, as do other recently developed Scalable congestion controls, + e.g., Relentless TCP [RELENTLESS], Prague [PRAGUE-CC] + [PragueLinux], BBRv2 [BBRv2] [BBR-CC], and the L4S variant of + SCReAM for real-time media [SCReAM-L4S] [RFC8298]. For the public + Internet, a Scalable transport has to comply with the requirements + in Section 4 of [RFC9331] (a.k.a. the 'Prague L4S requirements'). + + C: Abbreviation for Classic, e.g., when used as a subscript. + + L: Abbreviation for L4S, e.g., when used as a subscript. + + The terms Classic or L4S can also qualify other nouns, such as + 'codepoint', 'identifier', 'classification', 'packet', and 'flow'. + For example, an L4S packet means a packet with an L4S identifier + sent from an L4S congestion control. + + Both Classic and L4S services can cope with a proportion of + unresponsive or less-responsive traffic as well but, in the L4S + case, its rate has to be smooth enough or low enough to not build + a queue (e.g., DNS, Voice over IP (VoIP), game sync datagrams, + etc.). The DualQ Coupled AQM behaviour is defined to be similar + to a single First-In, First-Out (FIFO) queue with respect to + unresponsive and overload traffic. + + Reno-friendly: The subset of Classic traffic that is friendly to the + standard Reno congestion control defined for TCP in [RFC5681]. + The TFRC spec [RFC5348] indirectly implies that 'friendly' is + defined as "generally within a factor of two of the sending rate + of a TCP flow under the same conditions". 'Reno-friendly' is used + here in place of 'TCP-friendly', given the latter has become + imprecise, because the TCP protocol is now used with so many + different congestion control behaviours, and Reno is used in non- + TCP transports, such as QUIC [RFC9000]. + + DualQ or DualQ AQM: Used loosely as shorthand for a Dual-Queue + Coupled AQM, where the context makes 'Coupled AQM' obvious. + + Classic ECN: The original Explicit Congestion Notification (ECN) + protocol [RFC3168] that requires ECN signals to be treated as + equivalent to drops, both when generated in the network and when + responded to by the sender. + + For L4S, the names used for the four codepoints of the 2-bit IP- + ECN field are unchanged from those defined in the ECN spec + [RFC3168], i.e., Not-ECT, ECT(0), ECT(1), and CE, where ECT stands + for ECN-Capable Transport and CE stands for Congestion + Experienced. A packet marked with the CE codepoint is termed + 'ECN-marked' or sometimes just 'marked' where the context makes + ECN obvious. + +1.4. Features + + The AQM couples marking and/or dropping from the Classic queue to the + L4S queue in such a way that a flow will get roughly the same + throughput whichever it uses. Therefore, both queues can feed into + the full capacity of a link, and no rates need to be configured for + the queues. The L4S queue enables Scalable congestion controls like + DCTCP or Prague to give very low and consistently low latency, + without compromising the performance of competing 'Classic' Internet + traffic. + + Thousands of tests have been conducted in a typical fixed residential + broadband setting. Experiments used a range of base round-trip + delays up to 100 ms and link rates up to 200 Mb/s between the data + centre and home network, with varying amounts of background traffic + in both queues. For every L4S packet, the AQM kept the average + queuing delay below 1 ms (or 2 packets where serialization delay + exceeded 1 ms on slower links), with the 99th percentile being no + worse than 2 ms. No losses at all were introduced by the L4S AQM. + Details of the extensive experiments are available in [L4Seval22] and + [DualPI2Linux]. Subjective testing using very demanding high- + bandwidth low-latency applications over a single shared access link + is also described in [L4Sdemo16] and summarized in Section 6.1 of the + L4S architecture [RFC9330]. + + In all these experiments, the host was connected to the home network + by fixed Ethernet, in order to quantify the queuing delay that can be + achieved by a user who cares about delay. It should be emphasized + that L4S support at the bottleneck link cannot 'undelay' bursts + introduced by another link on the path, for instance by legacy Wi-Fi + equipment. However, if L4S support is added to the queue feeding the + _outgoing_ WAN link of a home gateway, it would be counterproductive + not to also reduce the burstiness of the _incoming_ Wi-Fi. Also, + trials of Wi-Fi equipment with an L4S DualQ Coupled AQM on the + _outgoing_ Wi-Fi interface are in progress, and early results of an + L4S DualQ Coupled AQM in a 5G radio access network testbed with + emulated outdoor cell edge radio fading are given in [L4S_5G]. + + Unlike Diffserv EF, the L4S queue does not have to be limited to a + small proportion of the link capacity in order to achieve low delay. + The L4S queue can be filled with a heavy load of capacity-seeking + flows (Prague, BBRv2, etc.) and still achieve low delay. The L4S + queue does not rely on the presence of other traffic in the Classic + queue that can be 'overtaken'. It gives low latency to L4S traffic + whether or not there is Classic traffic. The tail latency of traffic + served by the Classic AQM is sometimes a little better, sometimes a + little worse, when a proportion of the traffic is L4S. + + The two queues are only necessary because: + + * The large variations (sawteeth) of Classic flows need roughly a + base RTT of queuing delay to ensure full utilization. + + * Scalable flows do not need a queue to keep utilization high, but + they cannot keep latency consistently low if they are mixed with + Classic traffic. + + The L4S queue has latency priority within sub-round-trip timescales, + but over longer periods the coupling from the Classic to the L4S AQM + (explained below) ensures that it does not have bandwidth priority + over the Classic queue. + +2. DualQ Coupled AQM + + There are two main aspects to the DualQ Coupled AQM approach: + + 1. The Coupled AQM that addresses throughput equivalence between + Classic (e.g., Reno or CUBIC) flows and L4S flows (that satisfy + the Prague L4S requirements). + + 2. The Dual-Queue structure that provides latency separation for L4S + flows to isolate them from the typically large Classic queue. + +2.1. Coupled AQM + + In the 1990s, the 'TCP formula' was derived for the relationship + between the steady-state congestion window, cwnd, and the drop + probability, p of standard Reno congestion control [RFC5681]. To a + first-order approximation, the steady-state cwnd of Reno is inversely + proportional to the square root of p. + + The design focuses on Reno as the worst case, because if it does no + harm to Reno, it will not harm CUBIC or any traffic designed to be + friendly to Reno. TCP CUBIC implements a Reno-friendly mode, which + is relevant for typical RTTs under 20 ms as long as the throughput of + a single flow is less than about 350 Mb/s. In such cases, it can be + assumed that CUBIC traffic behaves similarly to Reno. The term + 'Classic' will be used for the collection of Reno-friendly traffic + including CUBIC and potentially other experimental congestion + controls intended not to significantly impact the flow rate of Reno. + + A supporting paper [PI2] includes the derivation of the equivalent + rate equation for DCTCP, for which cwnd is inversely proportional to + p (not the square root), where in this case p is the ECN-marking + probability. DCTCP is not the only congestion control that behaves + like this, so the term 'Scalable' will be used for all similar + congestion control behaviours (see examples in Section 1.2). The + term 'L4S' is used for traffic driven by a Scalable congestion + control that also complies with the additional 'Prague L4S + requirements' [RFC9331]. + + For safe coexistence, under stationary conditions, a Scalable flow + has to run at roughly the same rate as a Reno TCP flow (all other + factors being equal). So the drop or marking probability for Classic + traffic, p_C, has to be distinct from the marking probability for L4S + traffic, p_L. The original ECN spec [RFC3168] required these + probabilities to be the same, but [RFC8311] updates [RFC3168] to + enable experiments in which these probabilities are different. + + Also, to remain stable, Classic sources need the network to smooth + p_C so it changes relatively slowly. It is hard for a network node + to know the RTTs of all the flows, so a Classic AQM adds a _worst- + case_ RTT of smoothing delay (about 100-200 ms). In contrast, L4S + shifts responsibility for smoothing ECN feedback to the sender, which + only delays its response by its _own_ RTT, as well as allowing a more + immediate response if necessary. + + The Coupled AQM achieves safe coexistence by making the Classic drop + probability p_C proportional to the square of the coupled L4S + probability p_CL. p_CL is an input to the instantaneous L4S marking + probability p_L, but it changes as slowly as p_C. This makes the + Reno flow rate roughly equal the DCTCP flow rate, because the + squaring of p_CL counterbalances the square root of p_C in the 'TCP + formula' of Classic Reno congestion control. + + Stating this as a formula, the relation between Classic drop + probability, p_C, and the coupled L4S probability p_CL needs to take + the following form: + + p_C = ( p_CL / k )^2, (1) + + where k is the constant of proportionality, which is termed the + 'coupling factor'. + +2.2. Dual Queue + + Classic traffic needs to build a large queue to prevent + underutilization. Therefore, a separate queue is provided for L4S + traffic, and it is scheduled with priority over the Classic queue. + Priority is conditional to prevent starvation of Classic traffic in + certain conditions (see Section 2.4). + + Nonetheless, coupled marking ensures that giving priority to L4S + traffic still leaves the right amount of spare scheduling time for + Classic flows to each get equivalent throughput to DCTCP flows (all + other factors, such as RTT, being equal). + +2.3. Traffic Classification + + Both the Coupled AQM and DualQ mechanisms need an identifier to + distinguish L4S (L) and Classic (C) packets. Then the coupling + algorithm can achieve coexistence without having to inspect flow + identifiers, because it can apply the appropriate marking or dropping + probability to all flows of each type. A separate specification + [RFC9331] requires the network to treat the ECT(1) and CE codepoints + of the ECN field as this identifier. An additional process document + has proved necessary to make the ECT(1) codepoint available for + experimentation [RFC8311]. + + For policy reasons, an operator might choose to steer certain packets + (e.g., from certain flows or with certain addresses) out of the L + queue, even though they identify themselves as L4S by their ECN + codepoints. In such cases, the L4S ECN protocol [RFC9331] states + that the device "MUST NOT alter the end-to-end L4S ECN identifier" so + that it is preserved end to end. The aim is that each operator can + choose how it treats L4S traffic locally, but an individual operator + does not alter the identification of L4S packets, which would prevent + other operators downstream from making their own choices on how to + treat L4S traffic. + + In addition, an operator could use other identifiers to classify + certain additional packet types into the L queue that it deems will + not risk harm to the L4S service, for instance, addresses of specific + applications or hosts; specific Diffserv codepoints such as EF, + Voice-Admit, or the Non-Queue-Building (NQB) per-hop behaviour; or + certain protocols (e.g., ARP and DNS) (see Section 5.4.1 of + [RFC9331]. Note that [RFC9331] states that "a network node MUST NOT + change Not-ECT or ECT(0) in the IP-ECN field into an L4S identifier." + Thus, the L queue is not solely an L4S queue; it can be considered + more generally as a low-latency queue. + +2.4. Overall DualQ Coupled AQM Structure + + Figure 1 shows the overall structure that any DualQ Coupled AQM is + likely to have. This schematic is intended to aid understanding of + the current designs of DualQ Coupled AQMs. However, it is not + intended to preclude other innovative ways of satisfying the + normative requirements in Section 2.5 that minimally define a DualQ + Coupled AQM. Also, the schematic only illustrates operation under + normally expected circumstances; behaviour under overload or with + operator-specific classifiers is deferred to Section 2.5.1.1. + + The classifier on the left separates incoming traffic between the two + queues (L and C). Each queue has its own AQM that determines the + likelihood of marking or dropping (p_L and p_C). In [PI2], it has + been proved that it is preferable to control load with a linear + controller, then square the output before applying it as a drop + probability to Reno-friendly traffic (because Reno congestion control + decreases its load proportional to the square root of the increase in + drop). So, the AQM for Classic traffic needs to be implemented in + two stages: i) a base stage that outputs an internal probability p' + (pronounced 'p-prime') and ii) a squaring stage that outputs p_C, + where + + p_C = (p')^2. (2) + + Substituting for p_C in equation (1) gives + + p' = p_CL / k. + + So the slow-moving input to ECN marking in the L queue (the coupled + L4S probability) is + + p_CL = k*p'. (3) + + The actual ECN-marking probability p_L that is applied to the L queue + needs to track the immediate L queue delay under L-only congestion + conditions, as well as track p_CL under coupled congestion + conditions. So the L queue uses a 'Native AQM' that calculates a + probability p'_L as a function of the instantaneous L queue delay. + And given the L queue has conditional priority over the C queue, + whenever the L queue grows, the AQM ought to apply marking + probability p'_L, but p_L ought to not fall below p_CL. This + suggests + + p_L = max(p'_L, p_CL), (4) + + which has also been found to work very well in practice. + + The two transformations of p' in equations (2) and (3) implement the + required coupling given in equation (1) earlier. + + The constant of proportionality or coupling factor, k, in equation + (1) determines the ratio between the congestion probabilities (loss + or marking) experienced by L4S and Classic traffic. Thus, k + indirectly determines the ratio between L4S and Classic flow rates, + because flows (assuming they are responsive) adjust their rate in + response to congestion probability. Appendix C.2 gives guidance on + the choice of k and its effect on relative flow rates. + + _________ + | | ,------. + L4S (L) queue | |===>| ECN | + ,'| _______|_| |marker|\ + <' | | `------'\\ + //`' v ^ p_L \\ + // ,-------. | \\ + // |Native |p'_L | \\,. + // | L4S |--->(MAX) < | ___ + ,----------.// | AQM | ^ p_CL `\|.'Cond-`. + | IP-ECN |/ `-------' | / itional \ + ==>|Classifier| ,-------. (k*p') [ priority]==> + | |\ | Base | | \scheduler/ + `----------'\\ | AQM |---->: ,'|`-.___.-' + \\ | |p' | <' | + \\ `-------' (p'^2) //`' + \\ ^ | // + \\,. | v p_C // + < | _________ .------.// + `\| | | | Drop |/ + Classic (C) |queue |===>|/mark | + __|______| `------' + + Legend: ===> traffic flow + ---> control dependency + + Figure 1: DualQ Coupled AQM Schematic + + After the AQMs have applied their dropping or marking, the scheduler + forwards their packets to the link. Even though the scheduler gives + priority to the L queue, it is not as strong as the coupling from the + C queue. This is because, as the C queue grows, the 'Base AQM' + applies more congestion signals to L traffic (as well as to C). As L + flows reduce their rate in response, they use less than the + scheduling share for L traffic. So, because the scheduler is work + preserving, it schedules any C traffic in the gaps. + + Giving priority to the L queue has the benefit of very low L queue + delay, because the L queue is kept empty whenever L traffic is + controlled by the coupling. Also, there only has to be a coupling in + one direction -- from Classic to L4S. Priority has to be conditional + in some way to prevent the C queue from being starved in the short + term (see Section 4.2.2) to give C traffic a means to push in, as + explained next. With normal responsive L traffic, the coupled ECN + marking gives C traffic the ability to push back against even strict + priority, by congestion marking the L traffic to make it yield some + space. However, if there is just a small finite set of C packets + (e.g., a DNS request or an initial window of data), some Classic AQMs + will not induce enough ECN marking in the L queue, no matter how long + the small set of C packets waits. Then, if the L queue happens to + remain busy, the C traffic would never get a scheduling opportunity + from a strict priority scheduler. Ideally, the Classic AQM would be + designed to increase the coupled marking the longer that C packets + have been waiting, but this is not always practical -- hence the need + for L priority to be conditional. Giving a small weight or limited + waiting time for C traffic improves response times for short Classic + messages, such as DNS requests, and improves Classic flow startup + because immediate capacity is available. + + Example DualQ Coupled AQM algorithms called 'DualPI2' and 'Curvy RED' + are given in Appendices A and B. Either example AQM can be used to + couple packet marking and dropping across a DualQ: + + * DualPI2 uses a Proportional Integral (PI) controller as the Base + AQM. Indeed, this Base AQM with just the squared output and no + L4S queue can be used as a drop-in replacement for PIE [RFC8033], + in which case it is just called PI2 [PI2]. PI2 is a principled + simplification of PIE that is both more responsive and more stable + in the face of dynamically varying load. + + * Curvy RED is derived from RED [RED], except its configuration + parameters are delay-based to make them insensitive to link rate, + and it requires fewer operations per packet than RED. However, + DualPI2 is more responsive and stable over a wider range of RTTs + than Curvy RED. As a consequence, at the time of writing, DualPI2 + has attracted more development and evaluation attention than Curvy + RED, leaving the Curvy RED design not so fully evaluated. + + Both AQMs regulate their queue against targets configured in units of + time rather than bytes. As already explained, this ensures + configuration can be invariant for different drain rates. With AQMs + in a DualQ structure this is particularly important because the drain + rate of each queue can vary rapidly as flows for the two queues + arrive and depart, even if the combined link rate is constant. + + It would be possible to control the queues with other alternative + AQMs, as long as the normative requirements (those expressed in + capitals) in Section 2.5 are observed. + + The two queues could optionally be part of a larger queuing + hierarchy, such as the initial example ideas in [L4S-DIFFSERV]. + +2.5. Normative Requirements for a DualQ Coupled AQM + + The following requirements are intended to capture only the essential + aspects of a DualQ Coupled AQM. They are intended to be independent + of the particular AQMs implemented for each queue but to still define + the DualQ framework built around those AQMs. + +2.5.1. Functional Requirements + + A DualQ Coupled AQM implementation MUST comply with the prerequisite + L4S behaviours for any L4S network node (not just a DualQ) as + specified in Section 5 of [RFC9331]. These primarily concern + classification and re-marking as briefly summarized earlier in + Section 2.3. But Section 5.5 of [RFC9331] also gives guidance on + reducing the burstiness of the link technology underlying any L4S + AQM. + + A DualQ Coupled AQM implementation MUST utilize two queues, each with + an AQM algorithm. + + The AQM algorithm for the low-latency (L) queue MUST be able to apply + ECN marking to ECN-capable packets. + + The scheduler draining the two queues MUST give L4S packets priority + over Classic, although priority MUST be bounded in order not to + starve Classic traffic (see Section 4.2.2). The scheduler SHOULD be + work-conserving, or otherwise close to work-conserving. This is + because Classic traffic needs to be able to efficiently fill any + space left by L4S traffic even though the scheduler would otherwise + allocate it to L4S. + + [RFC9331] defines the meaning of an ECN marking on L4S traffic, + relative to drop of Classic traffic. In order to ensure coexistence + of Classic and Scalable L4S traffic, it says, "the likelihood that + the AQM drops a Not-ECT Classic packet (p_C) MUST be roughly + proportional to the square of the likelihood that it would have + marked it if it had been an L4S packet (p_L)." The term 'likelihood' + is used to allow for marking and dropping to be either probabilistic + or deterministic. + + For the current specification, this translates into the following + requirement. A DualQ Coupled AQM MUST apply ECN marking to traffic + in the L queue that is no lower than that derived from the likelihood + of drop (or ECN marking) in the Classic queue using equation (1). + + The constant of proportionality, k, in equation (1) determines the + relative flow rates of Classic and L4S flows when the AQM concerned + is the bottleneck (all other factors being equal). The L4S ECN + protocol [RFC9331] says, "The constant of proportionality (k) does + not have to be standardised for interoperability, but a value of 2 is + RECOMMENDED." + + Assuming Scalable congestion controls for the Internet will be as + aggressive as DCTCP, this will ensure their congestion window will be + roughly the same as that of a Standards Track TCP Reno congestion + control (Reno) [RFC5681] and other Reno-friendly controls, such as + TCP CUBIC in its Reno-friendly mode. + + The choice of k is a matter of operator policy, and operators MAY + choose a different value using the guidelines in Appendix C.2. + + If multiple customers or users share capacity at a bottleneck (e.g., + in the Internet access link of a campus network), the operator's + choice of k will determine capacity sharing between the flows of + different customers. However, on the public Internet, access network + operators typically isolate customers from each other with some form + of Layer 2 multiplexing (OFDM(A) in DOCSIS 3.1, CDMA in 3G, and SC- + FDMA in LTE) or Layer 3 scheduling (Weighted Round Robin (WRR) for + DSL) rather than relying on host congestion controls to share + capacity between customers [RFC0970]. In such cases, the choice of k + will solely affect relative flow rates within each customer's access + capacity, not between customers. Also, k will not affect relative + flow rates at any times when all flows are Classic or all flows are + L4S, and it will not affect the relative throughput of small flows. + + +2.5.1.1. Requirements in Unexpected Cases + + The flexibility to allow operator-specific classifiers (Section 2.3) + leads to the need to specify what the AQM in each queue ought to do + with packets that do not carry the ECN field expected for that queue. + It is expected that the AQM in each queue will inspect the ECN field + to determine what sort of congestion notification to signal, then it + will decide whether to apply congestion notification to this + particular packet, as follows: + + * If a packet that does not carry an ECT(1) or a CE codepoint is + classified into the L queue, then: + + - if the packet is ECT(0), the L AQM SHOULD apply CE marking + using a probability appropriate to Classic congestion control + and appropriate to the target delay in the L queue + + - if the packet is Not-ECT, the appropriate action depends on + whether some other function is protecting the L queue from + misbehaving flows (e.g., per-flow queue protection + [DOCSIS-Q-PROT] or latency policing): + + o if separate queue protection is provided, the L AQM SHOULD + ignore the packet and forward it unchanged, meaning it + should not calculate whether to apply congestion + notification, and it should neither drop nor CE mark the + packet (for instance, the operator might classify EF traffic + that is unresponsive to drop into the L queue, alongside + responsive L4S-ECN traffic) + + o if separate queue protection is not provided, the L AQM + SHOULD apply drop using a drop probability appropriate to + Classic congestion control and to the target delay in the L + queue + + * If a packet that carries an ECT(1) codepoint is classified into + the C queue: + + - the C AQM SHOULD apply CE marking using the Coupled AQM + probability p_CL (= k*p'). + + The above requirements are worded as "SHOULD"s, because operator- + specific classifiers are for flexibility, by definition. Therefore, + alternative actions might be appropriate in the operator's specific + circumstances. An example would be where the operator knows that + certain legacy traffic set to one codepoint actually has a congestion + response associated with another codepoint. + + If the DualQ Coupled AQM has detected overload, it MUST introduce + Classic drop to both types of ECN-capable traffic until the overload + episode has subsided. Introducing drop if ECN marking is + persistently high is recommended in Section 7 of the ECN spec + [RFC3168] and in Section 4.2.1 of the AQM Recommendations [RFC7567]. + +2.5.2. Management Requirements + + +2.5.2.1. Configuration + + By default, a DualQ Coupled AQM SHOULD NOT need any configuration for + use at a bottleneck on the public Internet [RFC7567]. The following + parameters MAY be operator-configurable, e.g., to tune for non- + Internet settings: + + * Optional packet classifier(s) to use in addition to the ECN field + (see Section 2.3). + + * Expected typical RTT, which can be used to determine the queuing + delay of the Classic AQM at its operating point, in order to + prevent typical lone flows from underutilizing capacity. For + example: + + - for the PI2 algorithm (Appendix A), the queuing delay target is + dependent on the typical RTT. + + - for the Curvy RED algorithm (Appendix B), the queuing delay at + the desired operating point of the curvy ramp is configured to + encompass a typical RTT. + + - if another Classic AQM was used, it would be likely to need an + operating point for the queue based on the typical RTT, and if + so, it SHOULD be expressed in units of time. + + An operating point that is manually calculated might be directly + configurable instead, e.g., for links with large numbers of flows + where underutilization by a single flow would be unlikely. + + * Expected maximum RTT, which can be used to set the stability + parameter(s) of the Classic AQM. For example: + + - for the PI2 algorithm (Appendix A), the gain parameters of the + PI algorithm depend on the maximum RTT. + + - for the Curvy RED algorithm (Appendix B), the smoothing + parameter is chosen to filter out transients in the queue + within a maximum RTT. + + Any stability parameter that is manually calculated assuming a + maximum RTT might be directly configurable instead. + + * Coupling factor, k (see Appendix C.2). + + * A limit to the conditional priority of L4S. This is scheduler- + dependent, but it SHOULD be expressed as a relation between the + max delay of a C packet and an L packet. For example: + + - for a WRR scheduler, a weight ratio between L and C of w:1 + means that the maximum delay of a C packet is w times that of + an L packet. + + - for a time-shifted FIFO (TS-FIFO) scheduler (see + Section 4.2.2), a time-shift of tshift means that the maximum + delay to a C packet is tshift greater than that of an L packet. + tshift could be expressed as a multiple of the typical RTT + rather than as an absolute delay. + + * The maximum Classic ECN-marking probability, p_Cmax, before + introducing drop. + +2.5.2.2. Monitoring + + An experimental DualQ Coupled AQM SHOULD allow the operator to + monitor each of the following operational statistics on demand, per + queue and per configurable sample interval, for performance + monitoring and perhaps also for accounting in some cases: + + * bits forwarded, from which utilization can be calculated; + + * total packets in the three categories: arrived, presented to the + AQM, and forwarded. The difference between the first two will + measure any non-AQM tail discard. The difference between the last + two will measure proactive AQM discard; + + * ECN packets marked, non-ECN packets dropped, and ECN packets + dropped, which can be combined with the three total packet counts + above to calculate marking and dropping probabilities; and + + * queue delay (not including serialization delay of the head packet + or medium acquisition delay) -- see further notes below. + + Unlike the other statistics, queue delay cannot be captured in a + simple accumulating counter. Therefore, the type of queue delay + statistics produced (mean, percentiles, etc.) will depend on + implementation constraints. To facilitate comparative evaluation + of different implementations and approaches, an implementation + SHOULD allow mean and 99th percentile queue delay to be derived + (per queue per sample interval). A relatively simple way to do + this would be to store a coarse-grained histogram of queue delay. + This could be done with a small number of bins with configurable + edges that represent contiguous ranges of queue delay. Then, over + a sample interval, each bin would accumulate a count of the number + of packets that had fallen within each range. The maximum queue + delay per queue per interval MAY also be recorded, to aid + diagnosis of faults and anomalous events. + +2.5.2.3. Anomaly Detection + + An experimental DualQ Coupled AQM SHOULD asynchronously report the + following data about anomalous conditions: + + * Start time and duration of overload state. + + A hysteresis mechanism SHOULD be used to prevent flapping in and + out of overload causing an event storm. For instance, exiting + from overload state could trigger one report but also latch a + timer. Then, during that time, if the AQM enters and exits + overload state any number of times, the duration in overload state + is accumulated, but no new report is generated until the first + time the AQM is out of overload once the timer has expired. + +2.5.2.4. Deployment, Coexistence, and Scaling + + [RFC5706] suggests that deployment, coexistence, and scaling should + also be covered as management requirements. The raison d'etre of the + DualQ Coupled AQM is to enable deployment and coexistence of Scalable + congestion controls (as incremental replacements for today's Reno- + friendly controls that do not scale with bandwidth-delay product). + Therefore, there is no need to repeat these motivating issues here + given they are already explained in the Introduction and detailed in + the L4S architecture [RFC9330]. + + The descriptions of specific DualQ Coupled AQM algorithms in the + appendices cover scaling of their configuration parameters, e.g., + with respect to RTT and sampling frequency. + +3. IANA Considerations + + This document has no IANA actions. + +4. Security Considerations + + +4.1. Low Delay without Requiring Per-flow Processing + + The L4S architecture [RFC9330] compares the DualQ and FQ approaches + to L4S. The privacy considerations section in that document + motivates the DualQ on the grounds that users who want to encrypt + application flow identifiers, e.g., in IPsec or other encrypted VPN + tunnels, don't have to sacrifice low delay ([RFC8404] encourages + avoidance of such privacy compromises). + + The security considerations section of the L4S architecture [RFC9330] + also includes subsections on policing of relative flow rates + (Section 8.1) and on policing of flows that cause excessive queuing + delay (Section 8.2). It explains that the interests of users do not + collide in the same way for delay as they do for bandwidth. For + someone to get more of the bandwidth of a shared link, someone else + necessarily gets less (a 'zero-sum game'), whereas queuing delay can + be reduced for everyone, without any need for someone else to lose + out. It also explains that, on the current Internet, scheduling + usually enforces separation of bandwidth between 'sites' (e.g., + households, businesses, or mobile users), but it is not common to + need to schedule or police the bandwidth used by individual + application flows. + + By the above arguments, per-flow rate policing might not be + necessary, and in trusted environments (e.g., private data centres), + it is certainly unlikely to be needed. Therefore, because it is hard + to avoid complexity and unintended side effects with per-flow rate + policing, it needs to be separable from a basic AQM, as an option, + under policy control. On this basis, the DualQ Coupled AQM provides + low delay without prejudging the question of per-flow rate policing. + + Nonetheless, the interests of users or flows might conflict, e.g., in + case of accident or malice. Then per-flow rate control could be + necessary. If per-flow rate control is needed, it can be provided as + a modular addition to a DualQ. And similarly, if protection against + excessive queue delay is needed, a per-flow queue protection option + can be added to a DualQ (e.g., [DOCSIS-Q-PROT]). + +4.2. Handling Unresponsive Flows and Overload + + In the absence of any per-flow control, it is important that the + basic DualQ Coupled AQM gives unresponsive flows no more throughput + advantage than a single-queue AQM would, and that it at least handles + overload situations. Overload means that incoming load significantly + or persistently exceeds output capacity, but it is not intended to be + a precise term -- significant and persistent are matters of degree. + + A trade-off needs to be made between complexity and the risk of + either traffic class harming the other. In overloaded conditions, + the higher priority L4S service will have to sacrifice some aspect of + its performance. Depending on the degree of overload, alternative + solutions may relax a different factor: for example, throughput, + delay, or drop. These choices need to be made either by the + developer or by operator policy, rather than by the IETF. Subsequent + subsections discuss handling different degrees of overload: + + * Unresponsive flows (L and/or C) but not overloaded, i.e., the sum + of unresponsive load before adding any responsive traffic is below + capacity. + + This case is handled by the regular Coupled DualQ (Section 2.1) + but not discussed there. So below, Section 4.2.1 explains the + design goal and how it is achieved in practice. + + * Unresponsive flows (L and/or C) causing persistent overload, i.e., + the sum of unresponsive load even before adding any responsive + traffic persistently exceeds capacity. + + This case is not covered by the regular Coupled DualQ mechanism + (Section 2.1), but the last paragraph in Section 2.5.1.1 sets + out a requirement to handle the case where ECN-capable traffic + could starve non-ECN-capable traffic. Section 4.2.3 below + discusses the general options and gives specific examples. + + * Short-term overload that lies between the 'not overloaded' and + 'persistently overloaded' cases. + + For the period before overload is deemed persistent, + Section 4.2.2 discusses options for more immediate mechanisms + at the scheduler timescale. These prevent short-term + starvation of the C queue by making the priority of the L queue + conditional, as required in Section 2.5.1. + +4.2.1. Unresponsive Traffic without Overload + + When one or more L flows and/or C flows are unresponsive, but their + total load is within the link capacity so that they do not saturate + the coupled marking (below 100%), the goal of a DualQ AQM is to + behave no worse than a single-queue AQM. + + Tests have shown that this is indeed the case with no additional + mechanism beyond the regular Coupled DualQ of Section 2.1 (see the + results of 'overload experiments' in [L4Seval22]). Perhaps + counterintuitively, whether the unresponsive flow classifies itself + into the L or the C queue, the DualQ system behaves as if it has + subtracted from the overall link capacity. Then, the coupling shares + out the remaining capacity between any competing responsive flows (in + either queue). See also Section 4.2.2, which discusses scheduler- + specific details. + +4.2.2. Avoiding Short-Term Classic Starvation: Sacrifice L4S Throughput + or Delay? + + Priority of L4S is required to be conditional (see Sections 2.4 and + 2.5.1) to avoid short-term starvation of Classic. Otherwise, as + explained in Section 2.4, even a lone responsive L4S flow could + temporarily block a small finite set of C packets (e.g., an initial + window or DNS request). The blockage would only be brief, but it + could be longer for certain AQM implementations that can only + increase the congestion signal coupled from the C queue when C + packets are actually being dequeued. There is then the question of + whether to sacrifice L4S throughput or L4S delay (or some other + policy) to make the priority conditional: + + Sacrifice L4S throughput: + By using WRR as the conditional priority scheduler, the L4S + service can sacrifice some throughput during overload. This can + be thought of as guaranteeing either a minimum throughput service + for Classic traffic or a maximum delay for a packet at the head of + the Classic queue. + + | Cautionary note: a WRR scheduler can only guarantee Classic + | throughput if Classic sources are sending enough to use it + | -- congestion signals can undermine scheduling because they + | determine how much responsive traffic of each class arrives + | for scheduling in the first place. This is why scheduling + | is only relied on to handle short-term starvation, until + | congestion signals build up and the sources react. Even + | during long-term overload (discussed more fully in + | Section 4.2.3), it's pragmatic to discard packets from both + | queues, which again thins the traffic before it reaches the + | scheduler. This is because a scheduler cannot be relied on + | to handle long-term overload since the right scheduler + | weight cannot be known for every scenario. + + The scheduling weight of the Classic queue should be small (e.g., + 1/16). In most traffic scenarios, the scheduler will not + interfere and it will not need to, because the coupling mechanism + and the end systems will determine the share of capacity across + both queues as if it were a single pool. However, if L4S traffic + is over-aggressive or unresponsive, the scheduler weight for + Classic traffic will at least be large enough to ensure it does + not starve in the short term. + + Although WRR scheduling is only expected to address short-term + overload, there are (somewhat rare) cases when WRR has an effect + on capacity shares over longer timescales. But its effect is + minor, and it certainly does no harm. Specifically, in cases + where the ratio of L4S to Classic flows (e.g., 19:1) is greater + than the ratio of their scheduler weights (e.g., 15:1), the L4S + flows will get less than an equal share of the capacity, but only + slightly. For instance, with the example numbers given, each L4S + flow will get (15/16)/19 = 4.9% when ideally each would get 1/20 = + 5%. In the rather specific case of an unresponsive flow taking up + just less than the capacity set aside for L4S (e.g., 14/16 in the + above example), using WRR could significantly reduce the capacity + left for any responsive L4S flows. + + The scheduling weight of the Classic queue should not be too + small, otherwise a C packet at the head of the queue could be + excessively delayed by a continually busy L queue. For instance, + if the Classic weight is 1/16, the maximum that a Classic packet + at the head of the queue can be delayed by L traffic is the + serialization delay of 15 MTU-sized packets. + + Sacrifice L4S delay: + The operator could choose to control overload of the Classic queue + by allowing some delay to 'leak' across to the L4S queue. The + scheduler can be made to behave like a single FIFO queue with + different service times by implementing a very simple conditional + priority scheduler that could be called a "time-shifted FIFO" (TS- + FIFO) (see the Modifier Earliest Deadline First (MEDF) scheduler + [MEDF]). This scheduler adds tshift to the queue delay of the + next L4S packet, before comparing it with the queue delay of the + next Classic packet, then it selects the packet with the greater + adjusted queue delay. + + Under regular conditions, the TS-FIFO scheduler behaves just like + a strict priority scheduler. But under moderate or high overload, + it prevents starvation of the Classic queue, because the time- + shift (tshift) defines the maximum extra queuing delay of Classic + packets relative to L4S. This would control milder overload of + responsive traffic by introducing delay to defer invoking the + overload mechanisms in Section 4.2.3, particularly when close to + the maximum congestion signal. + + The example implementations in Appendices A and B could both be + implemented with either policy. + +4.2.3. L4S ECN Saturation: Introduce Drop or Delay? + + This section concerns persistent overload caused by unresponsive L + and/or C flows. To keep the throughput of both L4S and Classic flows + roughly equal over the full load range, a different control strategy + needs to be defined above the point where the L4S AQM persistently + saturates to an ECN marking probability of 100%, leaving no room to + push back the load any harder. L4S ECN marking will saturate first + (assuming the coupling factor k>1), even though saturation could be + caused by the sum of unresponsive traffic in either or both queues + exceeding the link capacity. + + The term 'unresponsive' includes cases where a flow becomes + temporarily unresponsive, for instance, a real-time flow that takes a + while to adapt its rate in response to congestion, or a standard Reno + flow that is normally responsive, but above a certain congestion + level it will not be able to reduce its congestion window below the + allowed minimum of 2 segments [RFC5681], effectively becoming + unresponsive. (Note that L4S traffic ought to remain responsive + below a window of 2 segments. See the L4S requirements [RFC9331].) + + Saturation raises the question of whether to relieve congestion by + introducing some drop into the L4S queue or by allowing delay to grow + in both queues (which could eventually lead to drop due to buffer + exhaustion anyway): + + Drop on Saturation: + Persistent saturation can be defined by a maximum threshold for + coupled L4S ECN marking (assuming k>1) before saturation starts to + make the flow rates of the different traffic types diverge. Above + that, the drop probability of Classic traffic is applied to all + packets of all traffic types. Then experiments have shown that + queuing delay can be kept at the target in any overload situation, + including with unresponsive traffic, and no further measures are + required (Section 4.2.3.1). + + Delay on Saturation: + When L4S marking saturates, instead of introducing L4S drop, the + drop and marking probabilities of both queues could be capped. + Beyond that, delay will grow either solely in the queue with + unresponsive traffic (if WRR is used) or in both queues (if TS- + FIFO is used). In either case, the higher delay ought to control + temporary high congestion. If the overload is more persistent, + eventually the combined DualQ will overflow and tail drop will + control congestion. + + The example implementation in Appendix A solely applies the "drop on + saturation" policy. The DOCSIS specification of a DualQ Coupled AQM + [DOCSIS3.1] also implements the 'drop on saturation' policy with a + very shallow L buffer. However, the addition of DOCSIS per-flow + Queue Protection [DOCSIS-Q-PROT] turns this into 'delay on + saturation' by redirecting some packets of the flow or flows that are + most responsible for L queue overload into the C queue, which has a + higher delay target. If overload continues, this again becomes 'drop + on saturation' as the level of drop in the C queue rises to maintain + the target delay of the C queue. + +4.2.3.1. Protecting against Overload by Unresponsive ECN-Capable + Traffic + + Without a specific overload mechanism, unresponsive traffic would + have a greater advantage if it were also ECN-capable. The advantage + is undetectable at normal low levels of marking. However, it would + become significant with the higher levels of marking typical during + overload, when it could evade a significant degree of drop. This is + an issue whether the ECN-capable traffic is L4S or Classic. + + This raises the question of whether and when to introduce drop of + ECN-capable traffic, as required by both Section 7 of the ECN spec + [RFC3168] and Section 4.2.1 of the AQM recommendations [RFC7567]. + + As an example, experiments with the DualPI2 AQM (Appendix A) have + shown that introducing 'drop on saturation' at 100% coupled L4S + marking addresses this problem with unresponsive ECN, and it also + addresses the saturation problem. At saturation, DualPI2 switches + into overload mode, where the Base AQM is driven by the max delay of + both queues, and it introduces probabilistic drop to both queues + equally. It leaves only a small range of congestion levels just + below saturation where unresponsive traffic gains any advantage from + using the ECN capability (relative to being unresponsive without + ECN), and the advantage is hardly detectable (see [DualQ-Test] and + section IV-G of [L4Seval22]). Also, overload with an unresponsive + ECT(1) flow gets no more bandwidth advantage than with ECT(0). + +5. References + +5.1. Normative References + + [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate + Requirement Levels", BCP 14, RFC 2119, + DOI 10.17487/RFC2119, March 1997, + <https://www.rfc-editor.org/info/rfc2119>. + + [RFC3168] Ramakrishnan, K., Floyd, S., and D. Black, "The Addition + of Explicit Congestion Notification (ECN) to IP", + RFC 3168, DOI 10.17487/RFC3168, September 2001, + <https://www.rfc-editor.org/info/rfc3168>. + + [RFC8311] Black, D., "Relaxing Restrictions on Explicit Congestion + Notification (ECN) Experimentation", RFC 8311, + DOI 10.17487/RFC8311, January 2018, + <https://www.rfc-editor.org/info/rfc8311>. + + [RFC9331] De Schepper, K. and B. Briscoe, Ed., "The Explicit + Congestion Notification (ECN) Protocol for Low Latency, + Low Loss, and Scalable Throughput (L4S)", RFC 9331, + DOI 10.17487/RFC9331, January 2023, + <https://www.rfc-editor.org/info/rfc9331>. + +5.2. Informative References + + [Alizadeh-stability] + Alizadeh, M., Javanmard, A., and B. Prabhakar, "Analysis + of DCTCP: Stability, Convergence, and Fairness", + SIGMETRICS '11: Proceedings of the ACM SIGMETRICS Joint + International Conference on Measurement and Modeling of + Computer Systems, pp. 73-84, DOI 10.1145/1993744.1993753, + June 2011, <https://dl.acm.org/citation.cfm?id=1993753>. + + [AQMmetrics] + Kwon, M. and S. Fahmy, "A Comparison of Load-based and + Queue-based Active Queue Management Algorithms", Proc. + Int'l Soc. for Optical Engineering (SPIE), Vol. 4866, pp. + 35-46, DOI 10.1117/12.473021, 2002, + <https://www.cs.purdue.edu/homes/fahmy/papers/ldc.pdf>. + + [ARED01] Floyd, S., Gummadi, R., and S. Shenker, "Adaptive RED: An + Algorithm for Increasing the Robustness of RED's Active + Queue Management", ACIRI Technical Report 301, August + 2001, <https://www.icsi.berkeley.edu/icsi/node/2032>. + + [BBR-CC] Cardwell, N., Cheng, Y., Hassas Yeganeh, S., Swett, I., + and V. Jacobson, "BBR Congestion Control", Work in + Progress, Internet-Draft, draft-cardwell-iccrg-bbr- + congestion-control-02, 7 March 2022, + <https://datatracker.ietf.org/doc/html/draft-cardwell- + iccrg-bbr-congestion-control-02>. + + [BBRv2] "TCP BBR v2 Alpha/Preview Release", commit 17700ca, June + 2022, <https://github.com/google/bbr>. + + [Boru20] Boru Oljira, D., Grinnemo, K-J., Brunstrom, A., and J. + Taheri, "Validating the Sharing Behavior and Latency + Characteristics of the L4S Architecture", ACM SIGCOMM + Computer Communication Review, Vol. 50, Issue 2, pp. + 37-44, DOI 10.1145/3402413.3402419, May 2020, + <https://dl.acm.org/doi/abs/10.1145/3402413.3402419>. + + [CCcensus19] + Mishra, A., Sun, X., Jain, A., Pande, S., Joshi, R., and + B. Leong, "The Great Internet TCP Congestion Control + Census", Proceedings of the ACM on Measurement and + Analysis of Computing Systems, Vol. 3, Issue 3, Article + No. 45, pp. 1-24, DOI 10.1145/3366693, December 2019, + <https://doi.org/10.1145/3366693>. + + [CoDel] Nichols, K. and V. Jacobson, "Controlling Queue Delay", + ACM Queue, Vol. 10, Issue 5, May 2012, + <https://queue.acm.org/issuedetail.cfm?issue=2208917>. + + [CRED_Insights] + Briscoe, B. and K. De Schepper, "Insights from Curvy RED + (Random Early Detection)", BT Technical Report, TR- + TUB8-2015-003, DOI 10.48550/arXiv.1904.07339, August 2015, + <https://arxiv.org/abs/1904.07339>. + + [DOCSIS-Q-PROT] + Briscoe, B., Ed. and G. White, "The DOCSIS® Queue + Protection to Preserve Low Latency", Work in Progress, + Internet-Draft, draft-briscoe-docsis-q-protection-06, 13 + May 2022, <https://datatracker.ietf.org/doc/html/draft- + briscoe-docsis-q-protection-06>. + + [DOCSIS3.1] + CableLabs, "DOCSIS 3.1 MAC and Upper Layer Protocols + Interface Specification", CM-SP-MULPIv3.1, Data-Over-Cable + Service Interface Specifications DOCSIS 3.1 Version I17 or + later, January 2019, <https://specification- + search.cablelabs.com/CM-SP-MULPIv3>. + + [DualPI2Linux] + Albisser, O., De Schepper, K., Briscoe, B., Tilmans, O., + and H. Steen, "DUALPI2 - Low Latency, Low Loss and + Scalable (L4S) AQM", Proceedings of Linux Netdev 0x13 , + March 2019, <https://www.netdevconf.org/0x13/ + session.html?talk-DUALPI2-AQM>. + + [DualQ-Test] + Steen, H., "Destruction Testing: Ultra-Low Delay using + Dual Queue Coupled Active Queue Management", Master's + Thesis, Department of Informatics, University of Oslo, May + 2017. + + [Dukkipati06] + Dukkipati, N. and N. McKeown, "Why Flow-Completion Time is + the Right Metric for Congestion Control", ACM SIGCOMM + Computer Communication Review, Vol. 36, Issue 1, pp. + 59-62, DOI 10.1145/1111322.1111336, January 2006, + <https://dl.acm.org/doi/10.1145/1111322.1111336>. + + [Heist21] "L4S Tests", commit e21cd91, August 2021, + <https://github.com/heistp/l4s-tests>. + + [L4S-DIFFSERV] + Briscoe, B., "Interactions between Low Latency, Low Loss, + Scalable Throughput (L4S) and Differentiated Services", + Work in Progress, Internet-Draft, draft-briscoe-tsvwg-l4s- + diffserv-02, 4 November 2018, + <https://datatracker.ietf.org/doc/html/draft-briscoe- + tsvwg-l4s-diffserv-02>. + + [L4Sdemo16] + Bondarenko, O., De Schepper, K., Tsang, I., Briscoe, B., + Petlund, A., and C. Griwodz, "Ultra-Low Delay for All: + Live Experience, Live Analysis", Proceedings of the 7th + International Conference on Multimedia Systems, Article + No. 33, pp. 1-4, DOI 10.1145/2910017.2910633, May 2016, + <https://dl.acm.org/citation.cfm?doid=2910017.2910633>. + + [L4Seval22] + De Schepper, K., Albisser, O., Tilmans, O., and B. + Briscoe, "Dual Queue Coupled AQM: Deployable Very Low + Queuing Delay for All", Preprint submitted to IEEE/ACM + Transactions on Networking, DOI 10.48550/arXiv.2209.01078, + September 2022, <https://arxiv.org/abs/2209.01078>. + + [L4S_5G] Willars, P., Wittenmark, E., Ronkainen, H., Östberg, C., + Johansson, I., Strand, J., Lédl, P., and D. Schnieders, + "Enabling time-critical applications over 5G with rate + adaptation", Ericsson - Deutsche Telekom White Paper, + BNEW-21:025455, May 2021, <https://www.ericsson.com/en/ + reports-and-papers/white-papers/enabling-time-critical- + applications-over-5g-with-rate-adaptation>. + + [Labovitz10] + Labovitz, C., Iekel-Johnson, S., McPherson, D., Oberheide, + J., and F. Jahanian, "Internet Inter-Domain Traffic", ACM + SIGCOMM Computer Communication Review, Vol. 40, Issue 4, + pp. 75-86, DOI 10.1145/1851275.1851194, August 2010, + <https://doi.org/10.1145/1851275.1851194>. + + [LLD] White, G., Sundaresan, K., and B. Briscoe, "Low Latency + DOCSIS: Technology Overview", CableLabs White Paper, + February 2019, <https://cablela.bs/low-latency-docsis- + technology-overview-february-2019>. + + [MEDF] Menth, M., Schmid, M., Heiss, H., and T. Reim, "MEDF - A + Simple Scheduling Algorithm for Two Real-Time Transport + Service Classes with Application in the UTRAN", Proc. IEEE + Conference on Computer Communications (INFOCOM'03), Vol. + 2, pp. 1116-1122, DOI 10.1109/INFCOM.2003.1208948, March + 2003, <https://doi.org/10.1109/INFCOM.2003.1208948>. + + [PI2] De Schepper, K., Bondarenko, O., Briscoe, B., and I. + Tsang, "PI2: A Linearized AQM for both Classic and + Scalable TCP", ACM CoNEXT'16, DOI 10.1145/2999572.2999578, + December 2016, + <https://dl.acm.org/doi/10.1145/2999572.2999578>. + + [PI2param] Briscoe, B., "PI2 Parameters", Technical Report, TR-BB- + 2021-001, arXiv:2107.01003 [cs.NI], + DOI 10.48550/arXiv.2107.01003, July 2021, + <https://arxiv.org/abs/2107.01003>. + + [PRAGUE-CC] + De Schepper, K., Tilmans, O., and B. Briscoe, "Prague + Congestion Control", Work in Progress, Internet-Draft, + draft-briscoe-iccrg-prague-congestion-control-01, 11 July + 2022, <https://datatracker.ietf.org/doc/html/draft- + briscoe-iccrg-prague-congestion-control-01>. + + [PragueLinux] + Briscoe, B., De Schepper, K., Albisser, O., Misund, J., + Tilmans, O., Kuehlewind, M., and A. Ahmed, "Implementing + the 'TCP Prague' Requirements for L4S", Proceedings of + Linux Netdev 0x13, March 2019, + <https://www.netdevconf.org/0x13/session.html?talk-tcp- + prague-l4s>. + + [RED] Floyd, S. and V. Jacobson, "Random Early Detection + Gateways for Congestion Avoidance", IEEE/ACM Transactions + on Networking, Volume 1, Issue 4, pp. 397-413, + DOI 10.1109/90.251892, August 1993, + <https://dl.acm.org/doi/10.1109/90.251892>. + + [RELENTLESS] + Mathis, M., "Relentless Congestion Control", Work in + Progress, Internet-Draft, draft-mathis-iccrg-relentless- + tcp-00, 4 March 2009, + <https://datatracker.ietf.org/doc/html/draft-mathis-iccrg- + relentless-tcp-00>. + + [RFC0970] Nagle, J., "On Packet Switches With Infinite Storage", + RFC 970, DOI 10.17487/RFC0970, December 1985, + <https://www.rfc-editor.org/info/rfc970>. + + [RFC2914] Floyd, S., "Congestion Control Principles", BCP 41, + RFC 2914, DOI 10.17487/RFC2914, September 2000, + <https://www.rfc-editor.org/info/rfc2914>. + + [RFC3246] Davie, B., Charny, A., Bennet, J.C.R., Benson, K., Le + Boudec, J.Y., Courtney, W., Davari, S., Firoiu, V., and D. + Stiliadis, "An Expedited Forwarding PHB (Per-Hop + Behavior)", RFC 3246, DOI 10.17487/RFC3246, March 2002, + <https://www.rfc-editor.org/info/rfc3246>. + + [RFC3649] Floyd, S., "HighSpeed TCP for Large Congestion Windows", + RFC 3649, DOI 10.17487/RFC3649, December 2003, + <https://www.rfc-editor.org/info/rfc3649>. + + [RFC5033] Floyd, S. and M. Allman, "Specifying New Congestion + Control Algorithms", BCP 133, RFC 5033, + DOI 10.17487/RFC5033, August 2007, + <https://www.rfc-editor.org/info/rfc5033>. + + [RFC5348] Floyd, S., Handley, M., Padhye, J., and J. Widmer, "TCP + Friendly Rate Control (TFRC): Protocol Specification", + RFC 5348, DOI 10.17487/RFC5348, September 2008, + <https://www.rfc-editor.org/info/rfc5348>. + + [RFC5681] Allman, M., Paxson, V., and E. Blanton, "TCP Congestion + Control", RFC 5681, DOI 10.17487/RFC5681, September 2009, + <https://www.rfc-editor.org/info/rfc5681>. + + [RFC5706] Harrington, D., "Guidelines for Considering Operations and + Management of New Protocols and Protocol Extensions", + RFC 5706, DOI 10.17487/RFC5706, November 2009, + <https://www.rfc-editor.org/info/rfc5706>. + + [RFC7567] Baker, F., Ed. and G. Fairhurst, Ed., "IETF + Recommendations Regarding Active Queue Management", + BCP 197, RFC 7567, DOI 10.17487/RFC7567, July 2015, + <https://www.rfc-editor.org/info/rfc7567>. + + [RFC8033] Pan, R., Natarajan, P., Baker, F., and G. White, + "Proportional Integral Controller Enhanced (PIE): A + Lightweight Control Scheme to Address the Bufferbloat + Problem", RFC 8033, DOI 10.17487/RFC8033, February 2017, + <https://www.rfc-editor.org/info/rfc8033>. + + [RFC8034] White, G. and R. Pan, "Active Queue Management (AQM) Based + on Proportional Integral Controller Enhanced (PIE) for + Data-Over-Cable Service Interface Specifications (DOCSIS) + Cable Modems", RFC 8034, DOI 10.17487/RFC8034, February + 2017, <https://www.rfc-editor.org/info/rfc8034>. + + [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC + 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, + May 2017, <https://www.rfc-editor.org/info/rfc8174>. + + [RFC8257] Bensley, S., Thaler, D., Balasubramanian, P., Eggert, L., + and G. Judd, "Data Center TCP (DCTCP): TCP Congestion + Control for Data Centers", RFC 8257, DOI 10.17487/RFC8257, + October 2017, <https://www.rfc-editor.org/info/rfc8257>. + + [RFC8290] Hoeiland-Joergensen, T., McKenney, P., Taht, D., Gettys, + J., and E. Dumazet, "The Flow Queue CoDel Packet Scheduler + and Active Queue Management Algorithm", RFC 8290, + DOI 10.17487/RFC8290, January 2018, + <https://www.rfc-editor.org/info/rfc8290>. + + [RFC8298] Johansson, I. and Z. Sarker, "Self-Clocked Rate Adaptation + for Multimedia", RFC 8298, DOI 10.17487/RFC8298, December + 2017, <https://www.rfc-editor.org/info/rfc8298>. + + [RFC8312] Rhee, I., Xu, L., Ha, S., Zimmermann, A., Eggert, L., and + R. Scheffenegger, "CUBIC for Fast Long-Distance Networks", + RFC 8312, DOI 10.17487/RFC8312, February 2018, + <https://www.rfc-editor.org/info/rfc8312>. + + [RFC8404] Moriarty, K., Ed. and A. Morton, Ed., "Effects of + Pervasive Encryption on Operators", RFC 8404, + DOI 10.17487/RFC8404, July 2018, + <https://www.rfc-editor.org/info/rfc8404>. + + [RFC9000] Iyengar, J., Ed. and M. Thomson, Ed., "QUIC: A UDP-Based + Multiplexed and Secure Transport", RFC 9000, + DOI 10.17487/RFC9000, May 2021, + <https://www.rfc-editor.org/info/rfc9000>. + + [RFC9330] Briscoe, B., Ed., De Schepper, K., Bagnulo, M., and G. + White, "Low Latency, Low Loss, and Scalable Throughput + (L4S) Internet Service: Architecture", RFC 9330, + DOI 10.17487/RFC9330, January 2023, + <https://www.rfc-editor.org/info/rfc9330>. + + [SCReAM-L4S] + "SCReAM", commit fda6c53, June 2022, + <https://github.com/EricssonResearch/scream>. + + [SigQ-Dyn] Briscoe, B., "Rapid Signalling of Queue Dynamics", + Technical Report, TR-BB-2017-001, + DOI 10.48550/arXiv.1904.07044, September 2017, + <https://arxiv.org/abs/1904.07044>. + +Appendix A. Example DualQ Coupled PI2 Algorithm + + As a first concrete example, the pseudocode below gives the DualPI2 + algorithm. DualPI2 follows the structure of the DualQ Coupled AQM + framework in Figure 1. A simple ramp function (configured in units + of queuing time) with unsmoothed ECN marking is used for the Native + L4S AQM. The ramp can also be configured as a step function. The + PI2 algorithm [PI2] is used for the Classic AQM. PI2 is an improved + variant of the PIE AQM [RFC8033]. + + The pseudocode will be introduced in two passes. The first pass + explains the core concepts, deferring handling of edge-cases like + overload to the second pass. To aid comparison, line numbers are + kept in step between the two passes by using letter suffixes where + the longer code needs extra lines. + + All variables are assumed to be floating point in their basic units + (size in bytes, time in seconds, rates in bytes/second, alpha and + beta in Hz, and probabilities from 0 to 1). Constants expressed in k + (kilo), M (mega), G (giga), u (micro), m (milli), %, and so forth, + are assumed to be converted to their appropriate multiple or fraction + to represent the basic units. A real implementation that wants to + use integer values needs to handle appropriate scaling factors and + allow appropriate resolution of its integer types (including + temporary internal values during calculations). + + A full open source implementation for Linux is available at + <https://github.com/L4STeam/sch_dualpi2_upstream> and explained in + [DualPI2Linux]. The specification of the DualQ Coupled AQM for + DOCSIS cable modems and cable modem termination systems (CMTSs) is + available in [DOCSIS3.1] and explained in [LLD]. + +A.1. Pass #1: Core Concepts + + The pseudocode manipulates three main structures of variables: the + packet (pkt), the L4S queue (lq), and the Classic queue (cq). The + pseudocode consists of the following six functions: + + * The initialization function dualpi2_params_init(...) (Figure 2) + that sets parameter defaults (the API for setting non-default + values is omitted for brevity). + + * The enqueue function dualpi2_enqueue(lq, cq, pkt) (Figure 3). + + * The dequeue function dualpi2_dequeue(lq, cq, pkt) (Figure 4). + + * The recurrence function recur(q, likelihood) for de-randomized ECN + marking (shown at the end of Figure 4). + + * The L4S AQM function laqm(qdelay) (Figure 5) used to calculate the + ECN-marking probability for the L4S queue. + + * The Base AQM function that implements the PI algorithm + dualpi2_update(lq, cq) (Figure 6) used to regularly update the + base probability (p'), which is squared for the Classic AQM as + well as being coupled across to the L4S queue. + + It also uses the following functions that are not shown in full here: + + * scheduler(), which selects between the head packets of the two + queues. The choice of scheduler technology is discussed later. + + * cq.byt() or lq.byt() returns the current length (a.k.a. backlog) + of the relevant queue in bytes. + + * cq.len() or lq.len() returns the current length of the relevant + queue in packets. + + * cq.time() or lq.time() returns the current queuing delay of the + relevant queue in units of time (see Note a below). + + * mark(pkt) and drop(pkt) for ECN marking and dropping a packet. + + In experiments so far (building on experiments with PIE) on broadband + access links ranging from 4 Mb/s to 200 Mb/s with base RTTs from 5 ms + to 100 ms, DualPI2 achieves good results with the default parameters + in Figure 2. The parameters are categorised by whether they relate + to the PI2 AQM, the L4S AQM, or the framework coupling them together. + Constants and variables derived from these parameters are also + included at the end of each category. Each parameter is explained as + it is encountered in the walk-through of the pseudocode below, and + the rationale for the chosen defaults are given so that sensible + values can be used in scenarios other than the regular public + Internet. + + 1: dualpi2_params_init(...) { % Set input parameter defaults + 2: % DualQ Coupled framework parameters + 5: limit = MAX_LINK_RATE * 250 ms % Dual buffer size + 3: k = 2 % Coupling factor + 4: % NOT SHOWN % scheduler-dependent weight or equival't parameter + 6: + 7: % PI2 Classic AQM parameters + 8: target = 15 ms % Queue delay target + 9: RTT_max = 100 ms % Worst case RTT expected + 10: % PI2 constants derived from above PI2 parameters + 11: p_Cmax = min(1/k^2, 1) % Max Classic drop/mark prob + 12: Tupdate = min(target, RTT_max/3) % PI sampling interval + 13: alpha = 0.1 * Tupdate / RTT_max^2 % PI integral gain in Hz + 14: beta = 0.3 / RTT_max % PI proportional gain in Hz + 15: + 16: % L4S ramp AQM parameters + 17: minTh = 800 us % L4S min marking threshold in time units + 18: range = 400 us % Range of L4S ramp in time units + 19: Th_len = 1 pkt % Min L4S marking threshold in packets + 20: % L4S constants + 21: p_Lmax = 1 % Max L4S marking prob + 22: } + + Figure 2: Example Header Pseudocode for DualQ Coupled PI2 AQM + + The overall goal of the code is to apply the marking and dropping + probabilities for L4S and Classic traffic (p_L and p_C). These are + derived from the underlying base probabilities p'_L and p' driven, + respectively, by the traffic in the L and C queues. The marking + probability for the L queue (p_L) depends on both the base + probability in its own queue (p'_L) and a probability called p_CL, + which is coupled across from p' in the C queue (see Section 2.4 for + the derivation of the specific equations and dependencies). + + The probabilities p_CL and p_C are derived in lines 4 and 5 of the + dualpi2_update() function (Figure 6) then used in the + dualpi2_dequeue() function where p_L is also derived from p_CL at + line 6 (Figure 4). The code walk-through below builds up to + explaining that part of the code eventually, but it starts from + packet arrival. + + 1: dualpi2_enqueue(lq, cq, pkt) { % Test limit and classify lq or cq + 2: if ( lq.byt() + cq.byt() + MTU > limit) + 3: drop(pkt) % drop packet if buffer is full + 4: timestamp(pkt) % only needed if using the sojourn technique + 5: % Packet classifier + 6: if ( ecn(pkt) modulo 2 == 1 ) % ECN bits = ECT(1) or CE + 7: lq.enqueue(pkt) + 8: else % ECN bits = not-ECT or ECT(0) + 9: cq.enqueue(pkt) + 10: } + + Figure 3: Example Enqueue Pseudocode for DualQ Coupled PI2 AQM + + 1: dualpi2_dequeue(lq, cq, pkt) { % Couples L4S & Classic queues + 2: while ( lq.byt() + cq.byt() > 0 ) { + 3: if ( scheduler() == lq ) { + 4: lq.dequeue(pkt) % Scheduler chooses lq + 5: p'_L = laqm(lq.time()) % Native LAQM + 6: p_L = max(p'_L, p_CL) % Combining function + 7: if ( recur(lq, p_L) ) % Linear marking + 8: mark(pkt) + 9: } else { + 10: cq.dequeue(pkt) % Scheduler chooses cq + 11: if ( recur(cq, p_C) ) { % probability p_C = p'^2 + 12: if ( ecn(pkt) == 0 ) { % if ECN field = not-ECT + 13: drop(pkt) % squared drop + 14: continue % continue to the top of the while loop + 15: } + 16: mark(pkt) % squared mark + 17: } + 18: } + 19: return(pkt) % return the packet and stop + 20: } + 21: return(NULL) % no packet to dequeue + 22: } + + 23: recur(q, likelihood) { % Returns TRUE with a certain likelihood + 24: q.count += likelihood + 25: if (q.count > 1) { + 26: q.count -= 1 + 27: return TRUE + 28: } + 29: return FALSE + 30: } + + Figure 4: Example Dequeue Pseudocode for DualQ Coupled PI2 AQM + + When packets arrive, a common queue limit is checked first as shown + in line 2 of the enqueuing pseudocode in Figure 3. This assumes a + shared buffer for the two queues (Note b discusses the merits of + separate buffers). In order to avoid any bias against larger + packets, 1 MTU of space is always allowed, and the limit is + deliberately tested before enqueue. + + If limit is not exceeded, the packet is timestamped in line 4 (only + if the sojourn time technique is being used to measure queue delay; + see Note a below for alternatives). + + At lines 5-9, the packet is classified and enqueued to the Classic or + L4S queue dependent on the least significant bit (LSB) of the ECN + field in the IP header (line 6). Packets with a codepoint having an + LSB of 0 (Not-ECT and ECT(0)) will be enqueued in the Classic queue. + Otherwise, ECT(1) and CE packets will be enqueued in the L4S queue. + Optional additional packet classification flexibility is omitted for + brevity (see the L4S ECN protocol [RFC9331]). + + The dequeue pseudocode (Figure 4) is repeatedly called whenever the + lower layer is ready to forward a packet. It schedules one packet + for dequeuing (or zero if the queue is empty) then returns control to + the caller so that it does not block while that packet is being + forwarded. While making this dequeue decision, it also makes the + necessary AQM decisions on dropping or marking. The alternative of + applying the AQMs at enqueue would shift some processing from the + critical time when each packet is dequeued. However, it would also + add a whole queue of delay to the control signals, making the control + loop sloppier (for a typical RTT, it would double the Classic queue's + feedback delay). + + All the dequeue code is contained within a large while loop so that + if it decides to drop a packet, it will continue until it selects a + packet to schedule. Line 3 of the dequeue pseudocode is where the + scheduler chooses between the L4S queue (lq) and the Classic queue + (cq). Detailed implementation of the scheduler is not shown (see + discussion later). + + * If an L4S packet is scheduled, in lines 7 and 8 the packet is ECN- + marked with likelihood p_L. The recur() function at the end of + Figure 4 is used, which is preferred over random marking because + it avoids delay due to randomization when interpreting congestion + signals, but it still desynchronizes the sawteeth of the flows. + Line 6 calculates p_L as the maximum of the coupled L4S + probability p_CL and the probability from the Native L4S AQM p'_L. + This implements the max() function shown in Figure 1 to couple the + outputs of the two AQMs together. Of the two probabilities input + to p_L in line 6: + + - p'_L is calculated per packet in line 5 by the laqm() function + (see Figure 5), whereas + + - p_CL is maintained by the dualpi2_update() function, which runs + every Tupdate (Tupdate is set in line 12 of Figure 2). + + * If a Classic packet is scheduled, lines 10 to 17 drop or mark the + packet with probability p_C. + + The Native L4S AQM algorithm (Figure 5) is a ramp function, similar + to the RED algorithm, but simplified as follows: + + * The extent of the ramp is defined in units of queuing delay, not + bytes, so that configuration remains invariant as the queue + departure rate varies. + + * It uses instantaneous queuing delay, which avoids the complexity + of smoothing, but also avoids embedding a worst-case RTT of + smoothing delay in the network (see Section 2.1). + + * The ramp rises linearly directly from 0 to 1, not to an + intermediate value of p'_L as RED would, because there is no need + to keep ECN-marking probability low. + + * Marking does not have to be randomized. Determinism is used + instead of randomness to reduce the delay necessary to smooth out + the noise of randomness from the signal. + + The ramp function requires two configuration parameters, the minimum + threshold (minTh) and the width of the ramp (range), both in units of + queuing time, as shown in lines 17 and 18 of the initialization + function in Figure 2. The ramp function can be configured as a step + (see Note c). + + Although the DCTCP paper [Alizadeh-stability] recommends an ECN- + marking threshold of 0.17*RTT_typ, it also shows that the threshold + can be much shallower with hardly any worse underutilization of the + link (because the amplitude of DCTCP's sawteeth is so small). Based + on extensive experiments, for the public Internet the default minimum + ECN-marking threshold (target) in Figure 2 is considered a good + compromise, even though it is a significantly smaller fraction of + RTT_typ. + + 1: laqm(qdelay) { % Returns Native L4S AQM probability + 2: if (qdelay >= maxTh) + 3: return 1 + 4: else if (qdelay > minTh) + 5: return (qdelay - minTh)/range % Divide could use a bit-shift + 6: else + 7: return 0 + 8: } + + Figure 5: Example Pseudocode for the Native L4S AQM + + + 1: dualpi2_update(lq, cq) { % Update p' every Tupdate + 2: curq = cq.time() % use queuing time of first-in Classic packet + 3: p' = p' + alpha * (curq - target) + beta * (curq - prevq) + 4: p_CL = k * p' % Coupled L4S prob = base prob * coupling factor + 5: p_C = p'^2 % Classic prob = (base prob)^2 + 6: prevq = curq + 7: } + + Figure 6: Example PI-update Pseudocode for DualQ Coupled PI2 AQM + + (Note: Clamping p' within the range [0,1] omitted for clarity -- + see below.) + + The coupled marking probability p_CL depends on the base probability + (p'), which is kept up to date by executing the core PI algorithm in + Figure 6 every Tupdate. + + Note that p' solely depends on the queuing time in the Classic queue. + In line 2, the current queuing delay (curq) is evaluated from how + long the head packet was in the Classic queue (cq). The function + cq.time() (not shown) subtracts the time stamped at enqueue from the + current time (see Note a below) and implicitly takes the current + queuing delay as 0 if the queue is empty. + + The algorithm centres on line 3, which is a classical PI controller + that alters p' dependent on: a) the error between the current queuing + delay (curq) and the target queuing delay (target) and b) the change + in queuing delay since the last sample. The name 'PI' represents the + fact that the second factor (how fast the queue is growing) is + Proportional to load while the first is the Integral of the load (so + it removes any standing queue in excess of the target). + + The target parameter can be set based on local knowledge, but the aim + is for the default to be a good compromise for anywhere in the + intended deployment environment -- the public Internet. According to + [PI2param], the target queuing delay on line 8 of Figure 2 is related + to the typical base RTT worldwide, RTT_typ, by two factors: target = + RTT_typ * g * f. Below, we summarize the rationale behind these + factors and introduce a further adjustment. The two factors ensure + that, in a large proportion of cases (say 90%), the sawtooth + variations in RTT of a single flow will fit within the buffer without + underutilizing the link. Frankly, these factors are educated + guesses, but with the emphasis closer to 'educated' than to 'guess' + (see [PI2param] for the full background): + + * RTT_typ is taken as 25 ms. This is based on an average CDN + latency measured in each country weighted by the number of + Internet users in that country to produce an overall weighted + average for the Internet [PI2param]. Countries were ranked by + number of Internet users, and once 90% of Internet users were + covered, smaller countries were excluded to avoid small sample + sizes that would be less representative. Also, importantly, the + data for the average CDN latency in China (with the largest number + of Internet users) has been removed, because the CDN latency was a + significant outlier and, on reflection, the experimental technique + seemed inappropriate to the CDN market in China. + + * g is taken as 0.38. The factor g is a geometry factor that + characterizes the shape of the sawteeth of prevalent Classic + congestion controllers. The geometry factor is the fraction of + the amplitude of the sawtooth variability in queue delay that lies + below the AQM's target. For instance, at low bitrates, the + geometry factor of standard Reno is 0.5, but at higher rates, it + tends towards just under 1. According to the census of congestion + controllers conducted by Mishra et al. in Jul-Oct 2019 + [CCcensus19], most Classic TCP traffic uses CUBIC. And, according + to the analysis in [PI2param], if running over a PI2 AQM, a large + proportion of this CUBIC traffic would be in its Reno-friendly + mode, which has a geometry factor of ~0.39 (for all known + implementations). The rest of the CUBIC traffic would be in true + CUBIC mode, which has a geometry factor of ~0.36. Without + modelling the sawtooth profiles from all the other less prevalent + congestion controllers, we estimate a 7:3 weighted average of + these two, resulting in an average geometry factor of 0.38. + + * f is taken as 2. The factor f is a safety factor that increases + the target queue to allow for the distribution of RTT_typ around + its mean. Otherwise, the target queue would only avoid + underutilization for those users below the mean. It also provides + a safety margin for the proportion of paths in use that span + beyond the distance between a user and their local CDN. + Currently, no data is available on the variance of queue delay + around the mean in each region, so there is plenty of room for + this guess to become more educated. + + * [PI2param] recommends target = RTT_typ * g * f = 25 ms * 0.38 * 2 + = 19 ms. However, a further adjustment is warranted, because + target is moving year-on-year. The paper is based on data + collected in 2019, and it mentions evidence from the Speedtest + Global Index that suggests RTT_typ reduced by 17% (fixed) or 12% + (mobile) between 2020 and 2021. Therefore, we recommend a default + of target = 15 ms at the time of writing (2021). + + Operators can always use the data and discussion in [PI2param] to + configure a more appropriate target for their environment. For + instance, an operator might wish to question the assumptions called + out in that paper, such as the goal of no underutilization for a + large majority of single flow transfers (given many large transfers + use multiple flows to avoid the scaling limitations of Classic + flows). + + The two 'gain factors' in line 3 of Figure 6, alpha and beta, + respectively weight how strongly each of the two elements (Integral + and Proportional) alters p'. They are in units of 'per second of + delay' or Hz, because they transform differences in queuing delay + into changes in probability (assuming probability has a value from 0 + to 1). + + Alpha and beta determine how much p' ought to change after each + update interval (Tupdate). For a smaller Tupdate, p' should change + by the same amount per second but in finer more frequent steps. So + alpha depends on Tupdate (see line 13 of the initialization function + in Figure 2). It is best to update p' as frequently as possible, but + Tupdate will probably be constrained by hardware performance. As + shown in line 12, the update interval should be frequent enough to + update at least once in the time taken for the target queue to drain + ('target') as long as it updates at least three times per maximum + RTT. Tupdate defaults to 16 ms in the reference Linux implementation + because it has to be rounded to a multiple of 4 ms. For link rates + from 4 to 200 Mb/s and a maximum RTT of 100 ms, it has been verified + through extensive testing that Tupdate = 16 ms (as also recommended + in the PIE spec [RFC8033]) is sufficient. + + The choice of alpha and beta also determines the AQM's stable + operating range. The AQM ought to change p' as fast as possible in + response to changes in load without overcompensating and therefore + causing oscillations in the queue. Therefore, the values of alpha + and beta also depend on the RTT of the expected worst-case flow + (RTT_max). + + The maximum RTT of a PI controller (RTT_max in line 9 of Figure 2) is + not an absolute maximum, but more instability (more queue + variability) sets in for long-running flows with an RTT above this + value. The propagation delay halfway round the planet and back in + glass fibre is 200 ms. However, hardly any traffic traverses such + extreme paths and, since the significant consolidation of Internet + traffic between 2007 and 2009 [Labovitz10], a high and growing + proportion of all Internet traffic (roughly two-thirds at the time of + writing) has been served from CDNs or 'cloud' services distributed + close to end users. The Internet might change again, but for now, + designing for a maximum RTT of 100 ms is a good compromise between + faster queue control at low RTT and some instability on the occasions + when a longer path is necessary. + + Recommended derivations of the gain constants alpha and beta can be + approximated for Reno over a PI2 AQM as: alpha = 0.1 * Tupdate / + RTT_max^2; beta = 0.3 / RTT_max, as shown in lines 13 and 14 of + Figure 2. These are derived from the stability analysis in [PI2]. + For the default values of Tupdate = 16 ms and RTT_max = 100 ms, they + result in alpha = 0.16; beta = 3.2 (discrepancies are due to + rounding). These defaults have been verified with a wide range of + link rates, target delays, and traffic models with mixed and similar + RTTs, short and long flows, etc. + + In corner cases, p' can overflow the range [0,1] so the resulting + value of p' has to be bounded (omitted from the pseudocode). Then, + as already explained, the coupled and Classic probabilities are + derived from the new p' in lines 4 and 5 of Figure 6 as p_CL = k*p' + and p_C = p'^2. + + Because the coupled L4S marking probability (p_CL) is factored up by + k, the dynamic gain parameters alpha and beta are also inherently + factored up by k for the L4S queue. So, the effective gain factor + for the L4S queue is k*alpha (with defaults alpha = 0.16 Hz and k = + 2, effective L4S alpha = 0.32 Hz). + + Unlike in PIE [RFC8033], alpha and beta do not need to be tuned every + Tupdate dependent on p'. Instead, in PI2, alpha and beta are + independent of p' because the squaring applied to Classic traffic + tunes them inherently. This is explained in [PI2], which also + explains why this more principled approach removes the need for most + of the heuristics that had to be added to PIE. + + Nonetheless, an implementer might wish to add selected details to + either AQM. For instance, the Linux reference DualPI2 implementation + includes the following (not shown in the pseudocode above): + + * Classic and coupled marking or dropping (i.e., based on p_C and + p_CL from the PI controller) is not applied to a packet if the + aggregate queue length in bytes is < 2 MTU (prior to enqueuing the + packet or dequeuing it, depending on whether the AQM is configured + to be applied at enqueue or dequeue); and + + * in the WRR scheduler, the 'credit' indicating which queue should + transmit is only changed if there are packets in both queues + (i.e., if there is actual resource contention). This means that a + properly paced L flow might never be delayed by the WRR. The WRR + credit is reset in favour of the L queue when the link is idle. + + An implementer might also wish to add other heuristics, e.g., burst + protection [RFC8033] or enhanced burst protection [RFC8034]. + + Notes: + + a. The drain rate of the queue can vary if it is scheduled relative + to other queues or if it accommodates fluctuations in a wireless + medium. To auto-adjust to changes in drain rate, the queue needs + to be measured in time, not bytes or packets [AQMmetrics] + [CoDel]. Queuing delay could be measured directly as the sojourn + time (a.k.a. service time) of the queue by storing a per-packet + timestamp as each packet is enqueued and subtracting it from the + system time when the packet is dequeued. If timestamping is not + easy to introduce with certain hardware, queuing delay could be + predicted indirectly by dividing the size of the queue by the + predicted departure rate, which might be known precisely for some + link technologies (see, for example, DOCSIS PIE [RFC8034]). + + However, sojourn time is slow to detect bursts. For instance, if + a burst arrives at an empty queue, the sojourn time only fully + measures the burst's delay when its last packet is dequeued, even + though the queue has known the size of the burst since its last + packet was enqueued -- so it could have signalled congestion + earlier. To remedy this, each head packet can be marked when it + is dequeued based on the expected delay of the tail packet behind + it, as explained below, rather than based on the head packet's + own delay due to the packets in front of it. "Underutilization + with Bursty Traffic" in [Heist21] identifies a specific scenario + where bursty traffic significantly hits utilization of the L + queue. If this effect proves to be more widely applicable, using + the delay behind the head could improve performance. + + The delay behind the head can be implemented by dividing the + backlog at dequeue by the link rate or equivalently multiplying + the backlog by the delay per unit of backlog. The implementation + details will depend on whether the link rate is known; if it is + not, a moving average of the delay per unit backlog can be + maintained. This delay consists of serialization as well as + media acquisition for shared media. So the details will depend + strongly on the specific link technology. This approach should + be less sensitive to timing errors and cost less in operations + and memory than the otherwise equivalent 'scaled sojourn time' + metric, which is the sojourn time of a packet scaled by the ratio + of the queue sizes when the packet departed and arrived + [SigQ-Dyn]. + + b. Line 2 of the dualpi2_enqueue() function (Figure 3) assumes an + implementation where lq and cq share common buffer memory. An + alternative implementation could use separate buffers for each + queue, in which case the arriving packet would have to be + classified first to determine which buffer to check for available + space. The choice is a trade-off; a shared buffer can use less + memory whereas separate buffers isolate the L4S queue from tail + drop due to large bursts of Classic traffic (e.g., a Classic Reno + TCP during slow-start over a long RTT). + + c. There has been some concern that using the step function of DCTCP + for the Native L4S AQM requires end systems to smooth the signal + for an unnecessarily large number of round trips to ensure + sufficient fidelity. A ramp is no worse than a step in initial + experiments with existing DCTCP. Therefore, it is recommended + that a ramp is configured in place of a step, which will allow + congestion control algorithms to investigate faster smoothing + algorithms. + + A ramp is more general than a step, because an operator can + effectively turn the ramp into a step function, as used by DCTCP, + by setting the range to zero. There will not be a divide by zero + problem at line 5 of Figure 5 because, if minTh is equal to + maxTh, the condition for this ramp calculation cannot arise. + +A.2. Pass #2: Edge-Case Details + + This section takes a second pass through the pseudocode to add + details of two edge-cases: low link rate and overload. Figure 7 + repeats the dequeue function of Figure 4, but with details of both + edge-cases added. Similarly, Figure 8 repeats the core PI algorithm + of Figure 6, but with overload details added. The initialization, + enqueue, L4S AQM, and recur functions are unchanged. + + The link rate can be so low that it takes a single packet queue + longer to serialize than the threshold delay at which ECN marking + starts to be applied in the L queue. Therefore, a minimum marking + threshold parameter in units of packets rather than time is necessary + (Th_len, default 1 packet in line 19 of Figure 2) to ensure that the + ramp does not trigger excessive marking on slow links. Where an + implementation knows the link rate, it can set up this minimum at the + time it is configured. For instance, it would divide 1 MTU by the + link rate to convert it into a serialization time, then if the lower + threshold of the Native L AQM ramp was lower than this serialization + time, it could increase the thresholds to shift the bottom of the + ramp to 2 MTU. This is the approach used in DOCSIS [DOCSIS3.1], + because the configured link rate is dedicated to the DualQ. + + The pseudocode given here applies where the link rate is unknown, + which is more common for software implementations that might be + deployed in scenarios where the link is shared with other queues. In + lines 5a to 5d in Figure 7, the native L4S marking probability, p'_L, + is zeroed if the queue is only 1 packet (in the default + configuration). + + | Linux implementation note: In Linux, the check that the queue + | exceeds Th_len before marking with the Native L4S AQM is + | actually at enqueue, not dequeue; otherwise, it would exempt + | the last packet of a burst from being marked. The result of + | the check is conveyed from enqueue to the dequeue function via + | a boolean in the packet metadata. + + Persistent overload is deemed to have occurred when Classic drop/ + marking probability reaches p_Cmax. Above this point, the Classic + drop probability is applied to both the L and C queues, irrespective + of whether any packet is ECN-capable. ECT packets that are not + dropped can still be ECN-marked. + + In line 11 of the initialization function (Figure 2), the maximum + Classic drop probability p_Cmax = min(1/k^2, 1) or 1/4 for the + default coupling factor k = 2. In practice, 25% has been found to be + a good threshold to preserve fairness between ECN-capable and non- + ECN-capable traffic. This protects the queues against both temporary + overload from responsive flows and more persistent overload from any + unresponsive traffic that falsely claims to be responsive to ECN. + + When the Classic ECN-marking probability reaches the p_Cmax threshold + (1/k^2), the marking probability that is coupled to the L4S queue, + p_CL, will always be 100% for any k (by equation (1) in Section 2.1). + So, for readability, the constant p_Lmax is defined as 1 in line 21 + of the initialization function (Figure 2). This is intended to + ensure that the L4S queue starts to introduce dropping once ECN + marking saturates at 100% and can rise no further. The 'Prague L4S + requirements' [RFC9331] state that when an L4S congestion control + detects a drop, it falls back to a response that coexists with + 'Classic' Reno congestion control. So, it is correct that when the + L4S queue drops packets, it drops them proportional to p'^2, as if + they are Classic packets. + + The two queues each test for overload in lines 4b and 12b of the + dequeue function (Figure 7). Lines 8c to 8g drop L4S packets with + probability p'^2. Lines 8h to 8i mark the remaining packets with + probability p_CL. Given p_Lmax = 1, all remaining packets will be + marked because, to have reached the else block at line 8b, p_CL >= 1. + + Line 2a in the core PI algorithm (Figure 8) deals with overload of + the L4S queue when there is little or no Classic traffic. This is + necessary, because the core PI algorithm maintains the appropriate + drop probability to regulate overload, but it depends on the length + of the Classic queue. If there is little or no Classic queue, the + naive PI-update function (Figure 6) would drop nothing, even if the + L4S queue were overloaded -- so tail drop would have to take over + (lines 2 and 3 of Figure 3). + + Instead, line 2a of the full PI-update function (Figure 8) ensures + that the Base PI AQM in line 3 is driven by whichever of the two + queue delays is greater, but line 3 still always uses the same + Classic target (default 15 ms). If L queue delay is greater just + because there is little or no Classic traffic, normally it will still + be well below the Base AQM target. This is because L4S traffic is + also governed by the shallow threshold of its own Native AQM (lines + 5a to 6 of the dequeue algorithm in Figure 7). So the Base AQM will + be driven to zero and not contribute. However, if the L queue is + overloaded by traffic that is unresponsive to its marking, the max() + in line 2a of Figure 8 enables the L queue to smoothly take over + driving the Base AQM into overload mode even if there is little or no + Classic traffic. Then the Base AQM will keep the L queue to the + Classic target (default 15 ms) by shedding L packets. + + 1: dualpi2_dequeue(lq, cq, pkt) { % Couples L4S & Classic queues + 2: while ( lq.byt() + cq.byt() > 0 ) { + 3: if ( scheduler() == lq ) { + 4a: lq.dequeue(pkt) % L4S scheduled + 4b: if ( p_CL < p_Lmax ) { % Check for overload saturation + 5a: if (lq.len()>Th_len) % >1 packet queued + 5b: p'_L = laqm(lq.time()) % Native LAQM + 5c: else + 5d: p'_L = 0 % Suppress marking 1 pkt queue + 6: p_L = max(p'_L, p_CL) % Combining function + 7: if ( recur(lq, p_L) %Linear marking + 8a: mark(pkt) + 8b: } else { % overload saturation + 8c: if ( recur(lq, p_C) ) { % probability p_C = p'^2 + 8e: drop(pkt) % revert to Classic drop due to overload + 8f: continue % continue to the top of the while loop + 8g: } + 8h: if ( recur(lq, p_CL) ) % probability p_CL = k * p' + 8i: mark(pkt) % linear marking of remaining packets + 8j: } + 9: } else { + 10: cq.dequeue(pkt) % Classic scheduled + 11: if ( recur(cq, p_C) ) { % probability p_C = p'^2 + 12a: if ( (ecn(pkt) == 0) % ECN field = not-ECT + 12b: OR (p_C >= p_Cmax) ) { % Overload disables ECN + 13: drop(pkt) % squared drop, redo loop + 14: continue % continue to the top of the while loop + 15: } + 16: mark(pkt) % squared mark + 17: } + 18: } + 19: return(pkt) % return the packet and stop + 20: } + 21: return(NULL) % no packet to dequeue + 22: } + + Figure 7: Example Dequeue Pseudocode for DualQ Coupled PI2 AQM + (Including Code for Edge-Cases) + + 1: dualpi2_update(lq, cq) { % Update p' every Tupdate + 2a: curq = max(cq.time(), lq.time()) % use greatest queuing time + 3: p' = p' + alpha * (curq - target) + beta * (curq - prevq) + 4: p_CL = p' * k % Coupled L4S prob = base prob * coupling factor + 5: p_C = p'^2 % Classic prob = (base prob)^2 + 6: prevq = curq + 7: } + + Figure 8: Example PI-update Pseudocode for DualQ Coupled PI2 AQM + (Including Overload Code) + + + The choice of scheduler technology is critical to overload protection + (see Section 4.2.2). + + * A well-understood weighted scheduler such as WRR is recommended. + As long as the scheduler weight for Classic is small (e.g., 1/16), + its exact value is unimportant, because it does not normally + determine capacity shares. The weight is only important to + prevent unresponsive L4S traffic starving Classic traffic in the + short term (see Section 4.2.2). This is because capacity sharing + between the queues is normally determined by the coupled + congestion signal, which overrides the scheduler, by making L4S + sources leave roughly equal per-flow capacity available for + Classic flows. + + * Alternatively, a time-shifted FIFO (TS-FIFO) could be used. It + works by selecting the head packet that has waited the longest, + biased against the Classic traffic by a time-shift of tshift. To + implement TS-FIFO, the scheduler() function in line 3 of the + dequeue code would simply be implemented as the scheduler() + function at the bottom of Figure 10 in Appendix B. For the public + Internet, a good value for tshift is 50 ms. For private networks + with smaller diameter, about 4*target would be reasonable. TS- + FIFO is a very simple scheduler, but complexity might need to be + added to address some deficiencies (which is why it is not + recommended over WRR): + + - TS-FIFO does not fully isolate latency in the L4S queue from + uncontrolled bursts in the Classic queue; + + - using sojourn time for TS-FIFO is only appropriate if + timestamping of packets is feasible; and + + - even if timestamping is supported, the sojourn time of the head + packet is always stale, so a more instantaneous measure of + queue delay could be used (see Note a in Appendix A.1). + + * A strict priority scheduler would be inappropriate as discussed in + Section 4.2.2. + +Appendix B. Example DualQ Coupled Curvy RED Algorithm + + As another example of a DualQ Coupled AQM algorithm, the pseudocode + below gives the Curvy-RED-based algorithm. Although the AQM was + designed to be efficient in integer arithmetic, to aid understanding + it is first given using floating point arithmetic (Figure 10). Then, + one possible optimization for integer arithmetic is given, also in + pseudocode (Figure 11). To aid comparison, the line numbers are kept + in step between the two by using letter suffixes where the longer + code needs extra lines. + +B.1. Curvy RED in Pseudocode + + The pseudocode manipulates three main structures of variables: the + packet (pkt), the L4S queue (lq), and the Classic queue (cq). It is + defined and described below in the following three functions: + + * the initialization function cred_params_init(...) (Figure 2) that + sets parameter defaults (the API for setting non-default values is + omitted for brevity); + + * the dequeue function cred_dequeue(lq, cq, pkt) (Figure 4); and + + * the scheduling function scheduler(), which selects between the + head packets of the two queues. + + It also uses the following functions that are either shown elsewhere + or not shown in full here: + + * the enqueue function, which is identical to that used for DualPI2, + dualpi2_enqueue(lq, cq, pkt) in Figure 3; + + * mark(pkt) and drop(pkt) for ECN marking and dropping a packet; + + * cq.byt() or lq.byt() returns the current length (a.k.a. backlog) + of the relevant queue in bytes; and + + * cq.time() or lq.time() returns the current queuing delay of the + relevant queue in units of time (see Note a in Appendix A.1). + + Because Curvy RED was evaluated before DualPI2, certain improvements + introduced for DualPI2 were not evaluated for Curvy RED. In the + pseudocode below, the straightforward improvements have been added on + the assumption they will provide similar benefits, but that has not + been proven experimentally. They are: i) a conditional priority + scheduler instead of strict priority; ii) a time-based threshold for + the Native L4S AQM; and iii) ECN support for the Classic AQM. A + recent evaluation has proved that a minimum ECN-marking threshold + (minTh) greatly improves performance, so this is also included in the + pseudocode. + + Overload protection has not been added to the Curvy RED pseudocode + below so as not to detract from the main features. It would be added + in exactly the same way as in Appendix A.2 for the DualPI2 + pseudocode. The Native L4S AQM uses a step threshold, but a ramp + like that described for DualPI2 could be used instead. The scheduler + uses the simple TS-FIFO algorithm, but it could be replaced with WRR. + + The Curvy RED algorithm has not been maintained or evaluated to the + same degree as the DualPI2 algorithm. In initial experiments on + broadband access links ranging from 4 Mb/s to 200 Mb/s with base RTTs + from 5 ms to 100 ms, Curvy RED achieved good results with the default + parameters in Figure 9. + + The parameters are categorized by whether they relate to the Classic + AQM, the L4S AQM, or the framework coupling them together. Constants + and variables derived from these parameters are also included at the + end of each category. These are the raw input parameters for the + algorithm. A configuration front-end could accept more meaningful + parameters (e.g., RTT_max and RTT_typ) and convert them into these + raw parameters, as has been done for DualPI2 in Appendix A. Where + necessary, parameters are explained further in the walk-through of + the pseudocode below. + + 1: cred_params_init(...) { % Set input parameter defaults + 2: % DualQ Coupled framework parameters + 3: limit = MAX_LINK_RATE * 250 ms % Dual buffer size + 4: k' = 1 % Coupling factor as a power of 2 + 5: tshift = 50 ms % Time-shift of TS-FIFO scheduler + 6: % Constants derived from Classic AQM parameters + 7: k = 2^k' % Coupling factor from equation (1) + 6: + 7: % Classic AQM parameters + 8: g_C = 5 % EWMA smoothing parameter as a power of 1/2 + 9: S_C = -1 % Classic ramp scaling factor as a power of 2 + 10: minTh = 500 ms % No Classic drop/mark below this queue delay + 11: % Constants derived from Classic AQM parameters + 12: gamma = 2^(-g_C) % EWMA smoothing parameter + 13: range_C = 2^S_C % Range of Classic ramp + 14: + 15: % L4S AQM parameters + 16: T = 1 ms % Queue delay threshold for Native L4S AQM + 17: % Constants derived from above parameters + 18: S_L = S_C - k' % L4S ramp scaling factor as a power of 2 + 19: range_L = 2^S_L % Range of L4S ramp + 20: } + + Figure 9: Example Header Pseudocode for DualQ Coupled Curvy RED AQM + + 1: cred_dequeue(lq, cq, pkt) { % Couples L4S & Classic queues + 2: while ( lq.byt() + cq.byt() > 0 ) { + 3: if ( scheduler() == lq ) { + 4: lq.dequeue(pkt) % L4S scheduled + 5a: p_CL = (Q_C - minTh) / range_L + 5b: if ( ( lq.time() > T ) + 5c: OR ( p_CL > maxrand(U) ) ) + 6: mark(pkt) + 7: } else { + 8: cq.dequeue(pkt) % Classic scheduled + 9a: Q_C = gamma * cq.time() + (1-gamma) * Q_C % Classic Q EWMA + 10a: sqrt_p_C = (Q_C - minTh) / range_C + 10b: if ( sqrt_p_C > maxrand(2*U) ) { + 11: if ( (ecn(pkt) == 0) { % ECN field = not-ECT + 12: drop(pkt) % Squared drop, redo loop + 13: continue % continue to the top of the while loop + 14: } + 15: mark(pkt) + 16: } + 17: } + 18: return(pkt) % return the packet and stop here + 19: } + 20: return(NULL) % no packet to dequeue + 21: } + + 22: maxrand(u) { % return the max of u random numbers + 23: maxr=0 + 24: while (u-- > 0) + 25: maxr = max(maxr, rand()) % 0 <= rand() < 1 + 26: return(maxr) + 27: } + + 28: scheduler() { + 29: if ( lq.time() + tshift >= cq.time() ) + 30: return lq; + 31: else + 32: return cq; + 33: } + + Figure 10: Example Dequeue Pseudocode for DualQ Coupled Curvy RED AQM + + The dequeue pseudocode (Figure 10) is repeatedly called whenever the + lower layer is ready to forward a packet. It schedules one packet + for dequeuing (or zero if the queue is empty) then returns control to + the caller so that it does not block while that packet is being + forwarded. While making this dequeue decision, it also makes the + necessary AQM decisions on dropping or marking. The alternative of + applying the AQMs at enqueue would shift some processing from the + critical time when each packet is dequeued. However, it would also + add a whole queue of delay to the control signals, making the control + loop very sloppy. + + The code is written assuming the AQMs are applied on dequeue (Note + 1). All the dequeue code is contained within a large while loop so + that if it decides to drop a packet, it will continue until it + selects a packet to schedule. If both queues are empty, the routine + returns NULL at line 20. Line 3 of the dequeue pseudocode is where + the conditional priority scheduler chooses between the L4S queue (lq) + and the Classic queue (cq). The TS-FIFO scheduler is shown at lines + 28-33, which would be suitable if simplicity is paramount (see Note + 2). + + Within each queue, the decision whether to forward, drop, or mark is + taken as follows (to simplify the explanation, it is assumed that U = + 1): + + L4S: + If the test at line 3 determines there is an L4S packet to + dequeue, the tests at lines 5b and 5c determine whether to mark + it. The first is a simple test of whether the L4S queue delay + (lq.time()) is greater than a step threshold T (Note 3). The + second test is similar to the random ECN marking in RED but with + the following differences: i) marking depends on queuing time, not + bytes, in order to scale for any link rate without being + reconfigured; ii) marking of the L4S queue depends on a logical OR + of two tests: one against its own queuing time and one against the + queuing time of the _other_ (Classic) queue; iii) the tests are + against the instantaneous queuing time of the L4S queue but + against a smoothed average of the other (Classic) queue; and iv) + the queue is compared with the maximum of U random numbers (but if + U = 1, this is the same as the single random number used in RED). + + Specifically, in line 5a, the coupled marking probability p_CL is + set to the amount by which the averaged Classic queuing delay Q_C + exceeds the minimum queuing delay threshold (minTh), all divided + by the L4S scaling parameter range_L. range_L represents the + queuing delay (in seconds) added to minTh at which marking + probability would hit 100%. Then, in line 5c (if U = 1), the + result is compared with a uniformly distributed random number + between 0 and 1, which ensures that, over range_L, marking + probability will linearly increase with queuing time. + + Classic: + If the scheduler at line 3 chooses to dequeue a Classic packet and + jumps to line 7, the test at line 10b determines whether to drop + or mark it. But before that, line 9a updates Q_C, which is an + exponentially weighted moving average (Note 4) of the queuing time + of the Classic queue, where cq.time() is the current instantaneous + queuing time of the packet at the head of the Classic queue (zero + if empty), and gamma is the exponentially weighted moving average + (EWMA) constant (default 1/32; see line 12 of the initialization + function). + + Lines 10a and 10b implement the Classic AQM. In line 10a, the + averaged queuing time Q_C is divided by the Classic scaling + parameter range_C, in the same way that queuing time was scaled + for L4S marking. This scaled queuing time will be squared to + compute Classic drop probability. So, before it is squared, it is + effectively the square root of the drop probability; hence, it is + given the variable name sqrt_p_C. The squaring is done by + comparing it with the maximum out of two random numbers (assuming + U = 1). Comparing it with the maximum out of two is the same as + the logical 'AND' of two tests, which ensures drop probability + rises with the square of queuing time. + + The AQM functions in each queue (lines 5c and 10b) are two cases of a + new generalization of RED called 'Curvy RED', motivated as follows. + When the performance of this AQM was compared with FQ-CoDel and PIE, + their goal of holding queuing delay to a fixed target seemed + misguided [CRED_Insights]. As the number of flows increases, if the + AQM does not allow host congestion controllers to increase queuing + delay, it has to introduce abnormally high levels of loss. Then loss + rather than queuing becomes the dominant cause of delay for short + flows, due to timeouts and tail losses. + + Curvy RED constrains delay with a softened target that allows some + increase in delay as load increases. This is achieved by increasing + drop probability on a convex curve relative to queue growth (the + square curve in the Classic queue, if U = 1). Like RED, the curve + hugs the zero axis while the queue is shallow. Then, as load + increases, it introduces a growing barrier to higher delay. But, + unlike RED, it requires only two parameters, not three. The + disadvantage of Curvy RED (compared to a PI controller, for example) + is that it is not adapted to a wide range of RTTs. Curvy RED can be + used as is when the RTT range to be supported is limited; otherwise, + an adaptation mechanism is needed. + + From our limited experiments with Curvy RED so far, recommended + values of these parameters are: S_C = -1; g_C = 5; T = 5 * MTU at the + link rate (about 1 ms at 60 Mb/s) for the range of base RTTs typical + on the public Internet. [CRED_Insights] explains why these + parameters are applicable whatever rate link this AQM implementation + is deployed on and how the parameters would need to be adjusted for a + scenario with a different range of RTTs (e.g., a data centre). The + setting of k depends on policy (see Section 2.5 and Appendix C.2, + respectively, for its recommended setting and guidance on + alternatives). + + There is also a cUrviness parameter, U, which is a small positive + integer. It is likely to take the same hard-coded value for all + implementations, once experiments have determined a good value. Only + U = 1 has been used in experiments so far, but results might be even + better with U = 2 or higher. + + Notes: + + 1. The alternative of applying the AQMs at enqueue would shift some + processing from the critical time when each packet is dequeued. + However, it would also add a whole queue of delay to the control + signals, making the control loop sloppier (for a typical RTT, it + would double the Classic queue's feedback delay). On a platform + where packet timestamping is feasible, e.g., Linux, it is also + easiest to apply the AQMs at dequeue, because that is where + queuing time is also measured. + + 2. WRR better isolates the L4S queue from large delay bursts in the + Classic queue, but it is slightly less simple than TS-FIFO. If + WRR were used, a low default Classic weight (e.g., 1/16) would + need to be configured in place of the time-shift in line 5 of the + initialization function (Figure 9). + + 3. A step function is shown for simplicity. A ramp function (see + Figure 5 and the discussion around it in Appendix A.1) is + recommended, because it is more general than a step and has the + potential to enable L4S congestion controls to converge more + rapidly. + + 4. An EWMA is only one possible way to filter bursts; other more + adaptive smoothing methods could be valid, and it might be + appropriate to decrease the EWMA faster than it increases, e.g., + by using the minimum of the smoothed and instantaneous queue + delays, min(Q_C, qc.time()). + +B.2. Efficient Implementation of Curvy RED + + Although code optimization depends on the platform, the following + notes explain where the design of Curvy RED was particularly + motivated by efficient implementation. + + The Classic AQM at line 10b in Figure 10 calls maxrand(2*U), which + gives twice as much curviness as the call to maxrand(U) in the + marking function at line 5c. This is the trick that implements the + square rule in equation (1) (Section 2.1). This is based on the fact + that, given a number X from 1 to 6, the probability that two dice + throws will both be less than X is the square of the probability that + one throw will be less than X. So, when U = 1, the L4S marking + function is linear and the Classic dropping function is squared. If + U = 2, L4S would be a square function and Classic would be quartic. + And so on. + + The maxrand(u) function in lines 22-27 simply generates u random + numbers and returns the maximum. Typically, maxrand(u) could be run + in parallel out of band. For instance, if U = 1, the Classic queue + would require the maximum of two random numbers. So, instead of + calling maxrand(2*U) in-band, the maximum of every pair of values + from a pseudorandom number generator could be generated out of band + and held in a buffer ready for the Classic queue to consume. + + 1: cred_dequeue(lq, cq, pkt) { % Couples L4S & Classic queues + 2: while ( lq.byt() + cq.byt() > 0 ) { + 3: if ( scheduler() == lq ) { + 4: lq.dequeue(pkt) % L4S scheduled + 5: if ((lq.time() > T) OR (Q_C >> (S_L-2) > maxrand(U))) + 6: mark(pkt) + 7: } else { + 8: cq.dequeue(pkt) % Classic scheduled + 9: Q_C += (qc.ns() - Q_C) >> g_C % Classic Q EWMA + 10: if ( (Q_C >> (S_C-2) ) > maxrand(2*U) ) { + 11: if ( (ecn(pkt) == 0) { % ECN field = not-ECT + 12: drop(pkt) % Squared drop, redo loop + 13: continue % continue to the top of the while loop + 14: } + 15: mark(pkt) + 16: } + 17: } + 18: return(pkt) % return the packet and stop here + 19: } + 20: return(NULL) % no packet to dequeue + 21: } + + Figure 11: Optimised Example Dequeue Pseudocode for DualQ Coupled + AQM using Integer Arithmetic + + The two ranges, range_L and range_C, are expressed as powers of 2 so + that division can be implemented as a right bit-shift (>>) in lines 5 + and 10 of the integer variant of the pseudocode (Figure 11). + + For the integer variant of the pseudocode, an integer version of the + rand() function used at line 25 of the maxrand() function in + Figure 10 would be arranged to return an integer in the range 0 <= + maxrand() < 2^32 (not shown). This would scale up all the floating + point probabilities in the range [0,1] by 2^32. + + Queuing delays are also scaled up by 2^32, but in two stages: i) in + line 9, queuing time qc.ns() is returned in integer nanoseconds, + making the value about 2^30 times larger than when the units were + seconds, and then ii) in lines 5 and 10, an adjustment of -2 to the + right bit-shift multiplies the result by 2^2, to complete the scaling + by 2^32. + + In line 8 of the initialization function, the EWMA constant gamma is + represented as an integer power of 2, g_C, so that in line 9 of the + integer code (Figure 11), the division needed to weight the moving + average can be implemented by a right bit-shift (>> g_C). + +Appendix C. Choice of Coupling Factor, k + + +C.1. RTT-Dependence + + Where Classic flows compete for the same capacity, their relative + flow rates depend not only on the congestion probability but also on + their end-to-end RTT (= base RTT + queue delay). The rates of Reno + [RFC5681] flows competing over an AQM are roughly inversely + proportional to their RTTs. CUBIC exhibits similar RTT-dependence + when in Reno-friendly mode, but it is less RTT-dependent otherwise. + + Until the early experiments with the DualQ Coupled AQM, the + importance of the reasonably large Classic queue in mitigating RTT- + dependence when the base RTT is low had not been appreciated. + Appendix A.1.6 of the L4S ECN Protocol [RFC9331] uses numerical + examples to explain why bloated buffers had concealed the RTT- + dependence of Classic congestion controls before that time. Then, it + explains why, the more that queuing delays have reduced, the more + that RTT-dependence has surfaced as a potential starvation problem + for long RTT flows, when competing against very short RTT flows. + + Given that congestion control on end systems is voluntary, there is + no reason why it has to be voluntarily RTT-dependent. The RTT- + dependence of existing Classic traffic cannot be 'undeployed'. + Therefore, [RFC9331] requires L4S congestion controls to be + significantly less RTT-dependent than the standard Reno congestion + control [RFC5681], at least at low RTT. Then RTT-dependence ought to + be no worse than it is with appropriately sized Classic buffers. + Following this approach means there is no need for network devices to + address RTT-dependence, although there would be no harm if they did, + which per-flow queuing inherently does. + +C.2. Guidance on Controlling Throughput Equivalence + + The coupling factor, k, determines the balance between L4S and + Classic flow rates (see Section 2.5.2.1 and equation (1) in + Section 2.1). + + For the public Internet, a coupling factor of k = 2 is recommended + and justified below. For scenarios other than the public Internet, a + good coupling factor can be derived by plugging the appropriate + numbers into the same working. + + To summarize the maths below, from equation (7) it can be seen that + choosing k = 1.64 would theoretically make L4S throughput roughly the + same as Classic, _if their actual end-to-end RTTs were the same_. + However, even if the base RTTs are the same, the actual RTTs are + unlikely to be the same, because Classic traffic needs a fairly large + queue to avoid underutilization and excess drop, whereas L4S does + not. + + Therefore, to determine the appropriate coupling factor policy, the + operator needs to decide at what base RTT it wants L4S and Classic + flows to have roughly equal throughput, once the effect of the + additional Classic queue on Classic throughput has been taken into + account. With this approach, a network operator can determine a good + coupling factor without knowing the precise L4S algorithm for + reducing RTT-dependence -- or even in the absence of any algorithm. + + The following additional terminology will be used, with appropriate + subscripts: + + r: Packet rate [pkt/s] + + R: RTT [s/round] + + p: ECN-marking probability [] + + On the Classic side, we consider Reno as the most sensitive and + therefore worst-case Classic congestion control. We will also + consider CUBIC in its Reno-friendly mode ('CReno') as the most + prevalent congestion control, according to the references and + analysis in [PI2param]. In either case, the Classic packet rate in + steady state is given by the well-known square root formula for Reno + congestion control: + + r_C = 1.22 / (R_C * p_C^0.5) (5) + + On the L4S side, we consider the Prague congestion control + [PRAGUE-CC] as the reference for steady-state dependence on + congestion. Prague conforms to the same equation as DCTCP, but we do + not use the equation derived in the DCTCP paper, which is only + appropriate for step marking. The coupled marking, p_CL, is the + appropriate one when considering throughput equivalence with Classic + flows. Unlike step marking, coupled markings are inherently spaced + out, so we use the formula for DCTCP packet rate with probabilistic + marking derived in Appendix A of [PI2]. We use the equation without + RTT-independence enabled, which will be explained later. + + r_L = 2 / (R_L * p_CL) (6) + + For packet rate equivalence, we equate the two packet rates and + rearrange the equation into the same form as equation (1) (copied + from Section 2.1) so the two can be equated and simplified to produce + a formula for a theoretical coupling factor, which we shall call k*: + + r_c = r_L + => p_C = (p_CL/1.64 * R_L/R_C)^2. + + p_C = ( p_CL / k )^2. (1) + + k* = 1.64 * (R_C / R_L). (7) + + We say that this coupling factor is theoretical, because it is in + terms of two RTTs, which raises two practical questions: i) for + multiple flows with different RTTs, the RTT for each traffic class + would have to be derived from the RTTs of all the flows in that class + (actually the harmonic mean would be needed) and ii) a network node + cannot easily know the RTT of the flows anyway. + + RTT-dependence is caused by window-based congestion control, so it + ought to be reversed there, not in the network. Therefore, we use a + fixed coupling factor in the network and reduce RTT-dependence in L4S + senders. We cannot expect Classic senders to all be updated to + reduce their RTT-dependence. But solely addressing the problem in + L4S senders at least makes RTT-dependence no worse -- not just + between L4S senders, but also between L4S and Classic senders. + + Throughput equivalence is defined for flows under comparable + conditions, including with the same base RTT [RFC2914]. So if we + assume the same base RTT, R_b, for comparable flows, we can put both + R_C and R_L in terms of R_b. + + We can approximate the L4S RTT to be hardly greater than the base + RTT, i.e., R_L ~= R_b. And we can replace R_C with (R_b + q_C), + where the Classic queue, q_C, depends on the target queue delay that + the operator has configured for the Classic AQM. + + Taking PI2 as an example Classic AQM, it seems that we could just + take R_C = R_b + target (recommended 15 ms by default in + Appendix A.1). However, target is roughly the queue depth reached by + the tips of the sawteeth of a congestion control, not the average + [PI2param]. That is R_max = R_b + target. + + The position of the average in relation to the max depends on the + amplitude and geometry of the sawteeth. We consider two examples: + Reno [RFC5681], as the most sensitive worst case, and CUBIC [RFC8312] + in its Reno-friendly mode ('CReno') as the most prevalent congestion + control algorithm on the Internet according to the references in + [PI2param]. Both are Additive Increase Multiplicative Decrease + (AIMD), so we will generalize using b as the multiplicative decrease + factor (b_r = 0.5 for Reno, b_c = 0.7 for CReno). Then + + R_C = (R_max + b*R_max) / 2 + = R_max * (1+b)/2. + + R_reno = 0.75 * (R_b + target); R_creno = 0.85 * (R_b + target). + (8) + + Plugging all this into equation (7), at any particular base RTT, R_b, + we get a fixed coupling factor for each: + + k_reno = 1.64*0.75*(R_b+target)/R_b + = 1.23*(1 + target/R_b); k_creno = 1.39 * (1 + target/R_b). + + An operator can then choose the base RTT at which it wants throughput + to be equivalent. For instance, if we recommend that the operator + chooses R_b = 25 ms, as a typical base RTT between Internet users and + CDNs [PI2param], then these coupling factors become: + + k_reno = 1.23 * (1 + 15/25) k_creno = 1.39 * (1 + 15/25) + = 1.97 = 2.22 + ~= 2. ~= 2. (9) + + The approximation is relevant to any of the above example DualQ + Coupled algorithms, which use a coupling factor that is an integer + power of 2 to aid efficient implementation. It also fits best for + the worst case (Reno). + + To check the outcome of this coupling factor, we can express the + ratio of L4S to Classic throughput by substituting from their rate + equations (5) and (6), then also substituting for p_C in terms of + p_CL using equation (1) with k = 2 as just determined for the + Internet: + + r_L / r_C = 2 (R_C * p_C^0.5) / 1.22 (R_L * p_CL) + = (R_C * p_CL) / (1.22 * R_L * p_CL) + = R_C / (1.22 * R_L). (10) + + As an example, we can then consider single competing CReno and Prague + flows, by expressing both their RTTs in (10) in terms of their base + RTTs, R_bC and R_bL. So R_C is replaced by equation (8) for CReno. + And R_L is replaced by the max() function below, which represents the + effective RTT of the current Prague congestion control [PRAGUE-CC] in + its (default) RTT-independent mode, because it sets a floor to the + effective RTT that it uses for additive increase: + + r_L / r_C ~= 0.85 * (R_bC + target) / (1.22 * max(R_bL, R_typ)) + ~= (R_bC + target) / (1.4 * max(R_bL, R_typ)). + + It can be seen that, for base RTTs below target (15 ms), both the + numerator and the denominator plateau, which has the desired effect + of limiting RTT-dependence. + + At the start of the above derivations, an explanation was promised + for why the L4S throughput equation in equation (6) did not need to + model RTT-independence. This is because we only use one point -- at + the typical base RTT where the operator chooses to calculate the + coupling factor. Then throughput equivalence will at least hold at + that chosen point. Nonetheless, assuming Prague senders implement + RTT-independence over a range of RTTs below this, the throughput + equivalence will then extend over that range as well. + + Congestion control designers can choose different ways to reduce RTT- + dependence. And each operator can make a policy choice to decide on + a different base RTT, and therefore a different k, at which it wants + throughput equivalence. Nonetheless, for the Internet, it makes + sense to choose what is believed to be the typical RTT most users + experience, because a Classic AQM's target queuing delay is also + derived from a typical RTT for the Internet. + + As a non-Internet example, for localized traffic from a particular + ISP's data centre, using the measured RTTs, it was calculated that a + value of k = 8 would achieve throughput equivalence, and experiments + verified the formula very closely. + + But, for a typical mix of RTTs across the general Internet, a value + of k = 2 is recommended as a good workable compromise. + +Acknowledgements + + Thanks to Anil Agarwal, Sowmini Varadhan, Gabi Bracha, Nicolas Kuhn, + Greg Skinner, Tom Henderson, David Pullen, Mirja Kühlewind, Gorry + Fairhurst, Pete Heist, Ermin Sakic, and Martin Duke for detailed + review comments, particularly of the appendices, and suggestions on + how to make the explanations clearer. Thanks also to Tom Henderson + for insight on the choice of schedulers and queue delay measurement + techniques. And thanks to the area reviewers Christer Holmberg, Lars + Eggert, and Roman Danyliw. + + The early contributions of Koen De Schepper, Bob Briscoe, Olga + Bondarenko, and Inton Tsang were partly funded by the European + Community under its Seventh Framework Programme through the Reducing + Internet Transport Latency (RITE) project (ICT-317700). + Contributions of Koen De Schepper and Olivier Tilmans were also + partly funded by the 5Growth and DAEMON EU H2020 projects. Bob + Briscoe's contribution was also partly funded by the Comcast + Innovation Fund and the Research Council of Norway through the TimeIn + project. The views expressed here are solely those of the authors. + +Contributors + + The following contributed implementations and evaluations that + validated and helped to improve this specification: + + Olga Albisser <olga@albisser.org> of Simula Research Lab, Norway + (Olga Bondarenko during early draft versions) implemented the + prototype DualPI2 AQM for Linux with Koen De Schepper and conducted + extensive evaluations as well as implementing the live performance + visualization GUI [L4Sdemo16]. + + Olivier Tilmans <olivier.tilmans@nokia-bell-labs.com> of Nokia Bell + Labs, Belgium prepared and maintains the Linux implementation of + DualPI2 for upstreaming. + + Shravya K.S. wrote a model for the ns-3 simulator based on draft- + ietf-tsvwg-aqm-dualq-coupled-01 (a draft version of this document). + Based on this initial work, Tom Henderson <tomh@tomh.org> updated + that earlier model and created a model for the DualQ variant + specified as part of the Low Latency DOCSIS specification, as well as + conducting extensive evaluations. + + Ing Jyh (Inton) Tsang of Nokia, Belgium built the End-to-End Data + Centre to the Home broadband testbed on which DualQ Coupled AQM + implementations were tested. + +Authors' Addresses + + Koen De Schepper + Nokia Bell Labs + Antwerp + Belgium + Email: koen.de_schepper@nokia.com + URI: https://www.bell-labs.com/about/researcher-profiles/ + koende_schepper/ + + + Bob Briscoe (editor) + Independent + United Kingdom + Email: ietf@bobbriscoe.net + URI: https://bobbriscoe.net/ + + + Greg White + CableLabs + Louisville, CO + United States of America + Email: G.White@CableLabs.com |