diff options
author | Thomas Voss <mail@thomasvoss.com> | 2024-11-27 20:54:24 +0100 |
---|---|---|
committer | Thomas Voss <mail@thomasvoss.com> | 2024-11-27 20:54:24 +0100 |
commit | 4bfd864f10b68b71482b35c818559068ef8d5797 (patch) | |
tree | e3989f47a7994642eb325063d46e8f08ffa681dc /doc/rfc/rfc8289.txt | |
parent | ea76e11061bda059ae9f9ad130a9895cc85607db (diff) |
doc: Add RFC documents
Diffstat (limited to 'doc/rfc/rfc8289.txt')
-rw-r--r-- | doc/rfc/rfc8289.txt | 1403 |
1 files changed, 1403 insertions, 0 deletions
diff --git a/doc/rfc/rfc8289.txt b/doc/rfc/rfc8289.txt new file mode 100644 index 0000000..dc58d3e --- /dev/null +++ b/doc/rfc/rfc8289.txt @@ -0,0 +1,1403 @@ + + + + + + +Internet Engineering Task Force (IETF) K. Nichols +Request for Comments: 8289 Pollere, Inc. +Category: Experimental V. Jacobson +ISSN: 2070-1721 A. McGregor, Ed. + J. Iyengar, Ed. + Google + January 2018 + + + Controlled Delay Active Queue Management + +Abstract + + This document describes CoDel (Controlled Delay) -- a general + framework that controls bufferbloat-generated excess delay in modern + networking environments. CoDel consists of an estimator, a setpoint, + and a control loop. It requires no configuration in normal Internet + deployments. + +Status of This Memo + + This document is not an Internet Standards Track specification; it is + published for examination, experimental implementation, and + evaluation. + + This document defines an Experimental Protocol for the Internet + community. This document is a product of the Internet Engineering + Task Force (IETF). It represents the consensus of the IETF + community. It has received public review and has been approved for + publication by the Internet Engineering Steering Group (IESG). Not + all documents approved by the IESG are a candidate for any level of + Internet Standard; see Section 2 of RFC 7841. + + Information about the current status of this document, any errata, + and how to provide feedback on it may be obtained at + https://www.rfc-editor.org/info/rfc8289. + + + + + + + + + + + + + + + +Nichols, et al. Experimental [Page 1] + +RFC 8289 CoDel January 2018 + + +Copyright Notice + + Copyright (c) 2018 IETF Trust and the persons identified as the + document authors. All rights reserved. + + This document is subject to BCP 78 and the IETF Trust's Legal + Provisions Relating to IETF Documents + (https://trustee.ietf.org/license-info) in effect on the date of + publication of this document. Please review these documents + carefully, as they describe your rights and restrictions with respect + to this document. Code Components extracted from this document must + include Simplified BSD License text as described in Section 4.e of + the Trust Legal Provisions and are provided without warranty as + described in the Simplified BSD License. + +Table of Contents + + 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 + 2. Conventions and Terms Used in This Document . . . . . . . . . 4 + 3. Understanding the Building Blocks of Queue Management . . . . 5 + 3.1. Estimator . . . . . . . . . . . . . . . . . . . . . . . . 6 + 3.2. Target Setpoint . . . . . . . . . . . . . . . . . . . . . 8 + 3.3. Control Loop . . . . . . . . . . . . . . . . . . . . . . 10 + 4. Overview of the CoDel AQM . . . . . . . . . . . . . . . . . . 13 + 4.1. Non-starvation . . . . . . . . . . . . . . . . . . . . . 14 + 4.2. Setting INTERVAL . . . . . . . . . . . . . . . . . . . . 14 + 4.3. Setting TARGET . . . . . . . . . . . . . . . . . . . . . 14 + 4.4. Use with Multiple Queues . . . . . . . . . . . . . . . . 15 + 4.5. Setting Up CoDel . . . . . . . . . . . . . . . . . . . . 16 + 5. Annotated Pseudocode for CoDel AQM . . . . . . . . . . . . . 16 + 5.1. Data Types . . . . . . . . . . . . . . . . . . . . . . . 17 + 5.2. Per-Queue State (codel_queue_t Instance Variables) . . . 17 + 5.3. Constants . . . . . . . . . . . . . . . . . . . . . . . . 17 + 5.4. Enqueue Routine . . . . . . . . . . . . . . . . . . . . . 18 + 5.5. Dequeue Routine . . . . . . . . . . . . . . . . . . . . . 18 + 5.6. Helper Routines . . . . . . . . . . . . . . . . . . . . . 19 + 5.7. Implementation Considerations . . . . . . . . . . . . . . 21 + 6. Further Experimentation . . . . . . . . . . . . . . . . . . . 21 + 7. Security Considerations . . . . . . . . . . . . . . . . . . . 21 + 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 21 + 9. References . . . . . . . . . . . . . . . . . . . . . . . . . 22 + 9.1. Normative References . . . . . . . . . . . . . . . . . . 22 + 9.2. Informative References . . . . . . . . . . . . . . . . . 22 + Appendix A. Applying CoDel in the Data Center . . . . . . . . . 24 + Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . 25 + Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 25 + + + + + +Nichols, et al. Experimental [Page 2] + +RFC 8289 CoDel January 2018 + + +1. Introduction + + The "persistently full buffer" problem has been discussed in the IETF + community since the early 80s [RFC896]. The IRTF's End-to-End + Research Group called for the deployment of Active Queue Management + (AQM) to solve the problem in 1998 [RFC2309]. Despite this + awareness, the problem has only gotten worse as growth in memory + density per Moore's Law fueled an exponential increase in buffer pool + size. Efforts to deploy AQM have been frustrated by difficult + configuration and negative impact on network utilization. This + "bufferbloat" problem [BLOAT] has become increasingly important + throughout the Internet but particularly at the consumer edge. Queue + management has become more critical due to increased consumer use of + the Internet, mixing large video transactions with time-critical VoIP + and gaming. + + An effective AQM remediates bufferbloat at a bottleneck while "doing + no harm" at hops where buffers are not bloated. However, the + development and deployment of AQM are frequently subject to + misconceptions about the cause of packet queues in network buffers. + Network buffers exist to absorb the packet bursts that occur + naturally in statistically multiplexed networks. Buffers helpfully + absorb the queues created by reasonable packet network behavior such + as short-term mismatches in traffic arrival and departure rates that + arise from upstream resource contention, transport conversation + startup transients, and/or changes in the number of conversations + sharing a link. Unfortunately, other less useful network behaviors + can cause queues to fill, and their effects are not nearly as benign. + Discussion of these issues and the reason why the solution is not + simply "smaller buffers" can be found in [RFC2309], [VANQ2006], + [REDL1998], and [CODEL2012]. To understand queue management, it is + critical to understand the difference between the necessary, useful + "good" queue and the counterproductive "bad" queue. + + Several approaches to AQM have been developed over the past two + decades, but none have been widely deployed due to performance + problems. When designed with the wrong conceptual model for queues, + AQMs have limited operational range, require a lot of configuration + tweaking, and frequently impair rather than improve performance. + Learning from this past history, the CoDel approach is designed to + meet the following goals: + + o Make AQM parameterless for normal operation, with no knobs for + operators, users, or implementers to adjust. + + o Be able to distinguish "good" queue from "bad" queue and treat + them differently, that is, keep delay low while permitting + necessary bursts of traffic. + + + +Nichols, et al. Experimental [Page 3] + +RFC 8289 CoDel January 2018 + + + o Control delay while insensitive (or nearly so) to round-trip + delays, link rates, and traffic loads; this goal is to "do no + harm" to network traffic while controlling delay. + + o Adapt to dynamically changing link rates with no negative impact + on utilization. + + o Allow simple and efficient implementation (can easily span the + spectrum from low-end access points and home routers up to high- + end router hardware). + + CoDel has five major differences from prior AQMs: use of the local + queue minimum to track congestion ("bad" queue), use of an efficient + single state variable representation of that tracked statistic, use + of packet sojourn time as the observed datum (rather than packets, + bytes, or rates), use of a mathematically determined setpoint derived + from maximizing network power [KLEIN81], and a modern state-space + controller. + + CoDel configures itself based on a round-trip time metric that can be + set to 100 ms for the normal, terrestrial Internet. With no changes + to parameters, CoDel is expected to work across a wide range of + conditions, with varying links and the full range of terrestrial + round-trip times. + + CoDel is easily adapted to multiple queue systems as shown by + [RFC8290]. Implementers and users SHOULD use the fq_codel multiple- + queue approach as it deals with many problems beyond the reach of an + AQM on a single queue. + + CoDel was first published in [CODEL2012] and has been implemented in + the Linux kernel. + + Note that while this document refers to dropping packets when + indicated by CoDel, it may be reasonable to ECN-mark packets instead. + +2. Conventions and Terms Used in This Document + + The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", + "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and + "OPTIONAL" in this document are to be interpreted as described in + BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all + capitals, as shown here. + + + + + + + + +Nichols, et al. Experimental [Page 4] + +RFC 8289 CoDel January 2018 + + + The following terms are used in this document and are defined as + follows: + + sojourn time: the amount of time a packet has spent in a particular + buffer, i.e., the time a packet departs the buffer minus the + time the packet arrived at the buffer. A packet can depart a + buffer via transmission or drop. + + standing queue: a queue (in packets, bytes, or time delay) in a + buffer that persists for a "long" time, where "long" is on the + order of the longer round-trip times through the buffer, as + discussed in Section 4.2. A standing queue occurs when the + minimum queue over the "long" time is non-zero and is usually + tolerable and even desirable as long as it does not exceed some + target delay. + + bottleneck bandwidth: the limiting bandwidth along a network path. + +3. Understanding the Building Blocks of Queue Management + + At the heart of queue management is the notion of "good" queue and + "bad" queue and the search for ways to get rid of the "bad" queue + (which only adds delay) while preserving the "good" queue (which + provides for good utilization). This section explains queueing, both + good and bad, and covers the CoDel building blocks that can be used + to manage packet buffers to keep their queues in the "good" range. + + Packet queues form in buffers facing bottleneck links, i.e., where + the line rate goes from high to low or where many links converge. + The well-known bandwidth-delay product (sometimes called "pipe size") + is the bottleneck's bandwidth multiplied by the sender-receiver- + sender round-trip delay; it is the amount of data that has to be in + transit between two hosts in order to run the bottleneck link at 100% + utilization. To explore how queues can form, consider a long-lived + TCP connection with a 25-packet window sending through a connection + with a bandwidth-delay product of 20 packets. After an initial burst + of packets, the connection will settle into a 5-packet (+/-1) + standing queue; this standing queue size is determined by the + mismatch between the window size and the pipe size and is unrelated + to the connection's sending rate. The connection has 25 packets in + flight at all times, but only 20 packets arrive at the destination + over a round-trip time. If the TCP connection has a 30-packet + window, the queue will be 10 packets with no change in sending rate. + Similarly, if the window is 20 packets, there will be no queue, but + the sending rate is the same. Nothing can be inferred about the + sending rate from the queue size, and any queue other than transient + bursts only creates delays in the network. The sender needs to + reduce the number of packets in flight rather than the sending rate. + + + +Nichols, et al. Experimental [Page 5] + +RFC 8289 CoDel January 2018 + + + In the above example, the 5-packet standing queue can be seen to + contribute nothing but delay to the connection and thus is clearly + "bad" queue. If, in our example, there is a single bottleneck link + and it is much slower than the link that feeds it (say, a high-speed + Ethernet link into a limited DSL uplink), then a 20-packet buffer at + the bottleneck might be necessary to temporarily hold the 20 packets + in flight to keep the bottleneck link's utilization high. The burst + of packets should drain completely (to 0 or 1 packets) within a + round-trip time, and this transient queue is "good" queue because it + allows the connection to keep the 20 packets in flight and the + bottleneck link to be fully utilized. In terms of the delay + experienced, the "good" queue goes away in about a round-trip time, + while "bad" queue hangs around for longer, causing delays. + + Effective queue management detects "bad" queue while ignoring "good" + queue and takes action to get rid of the "bad" queue when it is + detected. The goal is a queue controller that accomplishes this + objective. To control a queue, we need three basic components: + + o Estimator - To figure out what we've got. + + o Target setpoint - To know what we want. + + o Control loop - If what we've got isn't what we want, we need a way + to move it there. + +3.1. Estimator + + The estimator both observes the queue and detects when "good" queue + turns to "bad" queue and vice versa. CoDel has two parts to its + estimator: what is observed as an indicator of queue and how the + observations are used to detect "good"/"bad" queue. + + Queue length has been widely used as an observed indicator of + congestion and is frequently conflated with sending rate. Use of + queue length as a metric is sensitive to how and when the length is + observed. A high-speed arrival link to a buffer serviced at a much + lower rate can rapidly build up a queue that might disperse + completely or down to a single packet before a round-trip time has + elapsed. If the queue length is monitored at packet arrival (as in + original Random Early Detection (RED)) or departure time, every + packet will see a queue with one possible exception. If the queue + length itself is time sampled (as recommended in [REDL1998]), a truer + picture of the queue's occupancy can be gained at the expense of + considerable implementation complexity. + + + + + + +Nichols, et al. Experimental [Page 6] + +RFC 8289 CoDel January 2018 + + + The use of queue length is further complicated in networks that are + subject to both short- and long-term changes in available link rate + (as in WiFi). Link rate drops can result in a spike in queue length + that should be ignored unless it persists. It is not the queue + length that should be controlled but the amount of excess delay + packets experience due to a persistent or standing queue, which means + that the packet sojourn time in the buffer is exactly what we want to + track. Tracking the packet sojourn times in the buffer observes the + actual delay experienced by each packet. Sojourn time allows queue + management to be independent of link rate, gives superior performance + to use of buffer size, and is directly related to user-visible + performance. It works regardless of line rate changes or link + sharing by multiple queues (which the individual queues may + experience as changing rates). + + Consider a link shared by two queues with different priorities. + Packets that arrive at the high-priority queue are sent as soon as + the link is available, while packets in the other queue have to wait + until the high-priority queue is empty (i.e., a strict priority + scheduler). The number of packets in the high-priority queue might + be large, but the queue is emptied quickly, and the amount of time + each packet spends enqueued (the sojourn time) is not large. The + other queue might have a smaller number of packets, but packet + sojourn times will include the waiting time for the high-priority + packets to be sent. This makes the sojourn time a good sample of the + congestion that each separate queue is experiencing. This example + also shows how the metric of sojourn time is independent of the + number of queues or the service discipline used and is instead + indicative of congestion seen by the individual queues. + + How can observed sojourn time be used to separate "good" queue from + "bad" queue? Although averages, especially of queue length, have + previously been widely used as an indicator of "bad" queue, their + efficacy is questionable. Consider the burst that disperses every + round-trip time. The average queue will be one-half the burst size, + though this might vary depending on when the average is computed and + the timing of arrivals. The average queue sojourn time would be one- + half the time it takes to clear the burst. The average then would + indicate a persistent queue where there is none. Instead of + averages, we recommend tracking the minimum sojourn time; then, if + there is one packet that has a zero sojourn time, there is no + persistent queue. + + A persistent queue can be detected by tracking the (local) minimum + queue delay packets experience. To ensure that this minimum value + does not become stale, it has to have been experienced recently, + i.e., during an appropriate past time interval. This interval is the + maximum amount of time a minimum value is considered to be in effect + + + +Nichols, et al. Experimental [Page 7] + +RFC 8289 CoDel January 2018 + + + and is related to the amount of time it takes for the largest + expected burst to drain. Conservatively, this interval SHOULD be at + least a round-trip time to avoid falsely detecting a persistent queue + and not a lot more than a round-trip time to avoid delay in detecting + the persistent queue. This suggests that the appropriate interval + value is the maximum round-trip time of all the connections sharing + the buffer. + + Note that the following key insight makes computation of the local + minimum efficient: it is sufficient to keep a single state variable + that indicates how long the minimum has been above or below the + target value rather than retaining all the local values to compute + the minimum, which leads to both storage and computational savings. + We use this insight in the pseudocode for CoDel later in the + document. + + These two parts, use of sojourn time as the observed value and the + local minimum as the statistic to monitor queue congestion, are key + to CoDel's estimator building block. The local minimum sojourn time + provides an accurate and robust measure of standing queue and has an + efficient implementation. In addition, use of the minimum sojourn + time has important advantages in implementation. The minimum packet + sojourn can only be decreased when a packet is dequeued, which means + that all the work of CoDel can take place when packets are dequeued + for transmission and that no locks are needed in the implementation. + The minimum is the only statistic with this property. + + A more detailed explanation with many pictures can be found in + [TSV84]. + +3.2. Target Setpoint + + Now that we have a robust way of detecting standing queue, we need a + target setpoint that tells us when to act. If the controller is set + to take action as soon as the estimator has a non-zero value, the + average drop rate will be maximized, which minimizes TCP goodput + [MACTCP1997]. Also, this policy results in no backlog over time (no + persistent queue), which negates much of the value of having a + buffer, since it maximizes the bottleneck link bandwidth lost due to + normal stochastic variation in packet interarrival time. We want a + target that maximizes utilization while minimizing delay. Early in + the history of packet networking, Kleinrock developed the analytic + machinery to do this using a quantity he called "power", which is the + ratio of a normalized throughput to a normalized delay [KLEIN81]. + + + + + + + +Nichols, et al. Experimental [Page 8] + +RFC 8289 CoDel January 2018 + + + It is straightforward to derive an analytic expression for the + average goodput of a TCP conversation at a given round-trip time r + and target f (where f is expressed as a fraction of r). Reno TCP, + for example, yields: + + goodput = r (3 + 6f - f^2) / (4 (1+f)) + + Since the peak queue delay is simply the product of f and r, power is + solely a function of f since the r's in the numerator and denominator + cancel: + + power is proportional to (1 + 2f - 1/3 f^2) / (1 + f)^2 + + As Kleinrock observed, the best operating point (in terms of + bandwidth/delay trade-off) is the peak power point, since points off + the peak represent a higher cost (in delay) per unit of bandwidth. + The power vs. f curve for any Additive Increase Multiplicative + Decrease (AIMD) TCP is monotone decreasing. But the curve is very + flat for f < 0.1, followed by an increasing curvature with a knee + around f = 0.2, then a steep, almost linear fall off [TSV84]. Since + the previous equation showed that goodput is monotone increasing with + f, the best operating point is near the right edge of the flat top, + since that represents the highest goodput achievable for a negligible + increase in delay. However, since the r in the model is a + conservative upper bound, a target of 0.1r runs the risk of pushing + shorter RTT connections over the knee and giving them higher delay + for no significant goodput increase. Generally, a more conservative + target of 0.05r offers a good utilization vs. delay trade-off while + giving enough headroom to work well with a large variation in real + RTT. + + As the above analysis shows, a very small standing queue gives close + to 100% utilization of the bottleneck link. While this result was + for Reno TCP, the derivation uses only properties that must hold for + any "TCP friendly" transport. We have verified by both analysis and + simulation that this result holds for Reno, Cubic, and Westwood + [TSV84]. This results in a particularly simple form for the target: + the ideal range for the permitted standing queue, or the target + setpoint, is between 5% and 10% of the TCP connection's RTT. + + We used simulation to explore the impact when TCPs are mixed with + other traffic and with connections of different RTTs. Accordingly, + we experimented extensively with values in the 5-10% of RTT range + and, overall, used target values between 1 and 20 milliseconds for + RTTs from 30 to 500 ms and link bandwidths of 64 Kbps to 100 Mbps to + experimentally explore a value for the target that gives consistently + + + + + +Nichols, et al. Experimental [Page 9] + +RFC 8289 CoDel January 2018 + + + high utilization while controlling delay across a range of + bandwidths, RTTs, and traffic loads. Our results were notably + consistent with the mathematics above. + + A congested (but not overloaded) CoDel link with traffic composed + solely or primarily of long-lived TCP flows will have a median delay + through the link that will tend to the target. For bursty traffic + loads and for overloaded conditions (where it is difficult or + impossible for all the arriving flows to be accommodated), the median + queues will be longer than the target. + + The non-starvation drop inhibit feature dominates where the link rate + becomes very small. By inhibiting drops when there is less than an + (outbound link) MTU worth of bytes in the buffer, CoDel adapts to + very low bandwidth links, as shown in [CODEL2012]. + +3.3. Control Loop + + Section 3.1 describes a simple, reliable way to measure "bad" + (persistent) queue. Section 3.2 shows that TCP congestion control + dynamics gives rise to a target setpoint for this measure that's a + provably good balance between enhancing throughput and minimizing + delay. Section 3.2 also shows that this target is a constant + fraction of the same "largest average RTT" interval used to + distinguish persistent from transient queue. The only remaining + building block needed for a basic AQM is a "control loop" algorithm + to effectively drive the queueing system from any "persistent queue + above the target" state to a state where the persistent queue is + below the target. + + Control theory provides a wealth of approaches to the design of + control loops. Most of classical control theory deals with the + control of linear, time-invariant, Single-Input-Single-Output (SISO) + systems. Control loops for these systems generally come from a well- + understood class known as Proportional-Integral-Derivative (PID) + controllers. Unfortunately, a queue is not a linear system, and an + AQM operates at the point of maximum non-linearity (where the output + link bandwidth saturates, so increased demand creates delay rather + than higher utilization). Output queues are also not time invariant + since traffic is generally a mix of connections that start and stop + at arbitrary times and that can have radically different behaviors + ranging from "open-loop" UDP audio/video to "closed-loop" congestion- + avoiding TCP. Finally, the constantly changing mix of connections + (which can't be converted to a single "lumped parameter" model + because of their transfer function differences) makes the system + Multi-Input-Multi-Output (MIMO), not SISO. + + + + + +Nichols, et al. Experimental [Page 10] + +RFC 8289 CoDel January 2018 + + + Since queueing systems match none of the prerequisites for a + classical controller, a better approach is a modern state-space + controller with "no persistent queue" and "has persistent queue" + states. Since Internet traffic mixtures change rapidly and + unpredictably, a noise- and error-tolerant adaptation algorithm like + stochastic gradient is a good choice. Since there's essentially no + information in the amount of persistent queue [TSV84], the adaptation + should be driven by how long it has persisted. + + Consider the two extremes of traffic behavior: a single, open-loop + UDP video stream and a single, long-lived TCP bulk data transfer. If + the average bandwidth of the UDP video stream is greater than the + bottleneck link rate, the link's queue will grow, and the controller + will eventually enter "has persistent queue" state and start dropping + packets. Since the video stream is open loop, its arrival rate is + unaffected by drops, so the queue will persist until the average drop + rate is greater than the output bandwidth deficit (= average arrival + rate - average departure rate); the job of the adaptation algorithm + is to discover this rate. For this example, the adaptation could + consist of simply estimating the arrival and departure rates and then + dropping at a rate slightly greater than their difference, but this + class of algorithm won't work at all for the bulk data TCP stream. + TCP runs in closed-loop flow balance [TSV84], so its arrival rate is + almost always exactly equal to the departure rate -- the queue isn't + the result of a rate imbalance but rather a mismatch between the TCP + sender's window and the source-destination-source round-trip path + capacity (i.e., the connection's bandwidth-delay product). The + sender's TCP congestion avoidance algorithm will slowly increase the + send window (one packet per round-trip time) [RFC5681], which will + eventually cause the bottleneck to enter "has persistent queue" + state. But, since the average input rate is the same as the average + output rate, the rate deficit estimation that gave the correct drop + rate for the video stream would compute a drop rate of zero for the + TCP stream. However, if the output link drops one packet as it + enters "has persistent queue" state, when the sender discovers this + (via TCP's normal packet loss repair mechanisms), it will reduce its + window by a factor of two [RFC5681]; so, one round-trip time after + the drop, the persistent queue will go away. + + If there were N TCP conversations sharing the bottleneck, the + controller would have to drop O(N) packets (one from each + conversation) to make all the conversations reduce their window to + get rid of the persistent queue. If the traffic mix consists of + short (<= bandwidth-delay product) conversations, the aggregate + behavior becomes more like the open-loop video example since each + conversation is likely to have already sent all its packets by the + time it learns about a drop so each drop has negligible effect on + subsequent traffic. + + + +Nichols, et al. Experimental [Page 11] + +RFC 8289 CoDel January 2018 + + + The controller does not know the number, duration, or kind of + conversations creating its queue, so it has to learn the appropriate + response. Since single drops can have a large effect if the degree + of multiplexing (the number of active conversations) is small, + dropping at too high a rate is likely to have a catastrophic effect + on throughput. Dropping at a low rate (< 1 packet per round-trip + time) and then increasing the drop rate slowly until the persistent + queue goes below the target is unlikely to overdrop and is guaranteed + to eventually dissipate the persistent queue. This stochastic + gradient learning procedure is the core of CoDel's control loop (the + gradient exists because a drop always reduces the (instantaneous) + queue, so an increasing drop rate always moves the system "down" + toward no persistent queue, regardless of traffic mix). + + The "next drop time" is decreased in inverse proportion to the square + root of the number of drops since the drop state was entered, using + the well-known non-linear relationship of drop rate to throughput to + get a linear change in throughput [REDL1998][MACTCP1997]. + + Since the best rate to start dropping is at slightly more than one + packet per RTT, the controller's initial drop rate can be directly + derived from the estimator's interval. When the minimum sojourn time + first crosses the target and CoDel drops a packet, the earliest the + controller could see the effect of the drop is the round-trip time + (interval) + the local queue wait time (the target). If the next + drop happens any earlier than this time (interval + target), CoDel + will overdrop. In practice, the local queue waiting time tends to + vary, so making the initial drop spacing (i.e., the time to the + second drop) be exactly the minimum possible also leads to + overdropping. Analysis of simulation and real-world measured data + shows that the 75th percentile magnitude of this variation is less + than the target, so the initial drop spacing SHOULD be set to the + estimator's interval (i.e., initial drop spacing = interval) to + ensure that the controller has accounted for acceptable congestion + delays. + + Use of the minimum statistic lets the controller be placed in the + dequeue routine with the estimator. This means that the control + signal (the drop) can be sent at the first sign of "bad" queue (as + indicated by the sojourn time) and that the controller can stop + acting as soon as the sojourn time falls below the target. Dropping + at dequeue has both implementation and control advantages. + + + + + + + + + +Nichols, et al. Experimental [Page 12] + +RFC 8289 CoDel January 2018 + + +4. Overview of the CoDel AQM + + CoDel was initially designed as a bufferbloat solution for the + consumer network edge. The CoDel building blocks are able to adapt + to different or time-varying link rates, be easily used with multiple + queues, have excellent utilization with low delay, and have a simple + and efficient implementation. + + The CoDel algorithm described in the rest of this document uses two + key variables: TARGET, which is the controller's target setpoint + described in Section 3.2, and INTERVAL, which is the estimator's + interval described in Section 3.3. + + The only setting CoDel requires is the INTERVAL value, and as 100 ms + satisfies that definition for normal Internet usage, CoDel can be + parameter-free for consumer use. To ensure that link utilization is + not adversely affected, CoDel's estimator sets TARGET to one that + optimizes power. CoDel's controller does not drop packets when the + drop would leave the queue empty or with fewer than a Maximum + Transmission Unit (MTU) worth of bytes in the buffer. Section 3.2 + shows that an ideal TARGET is 5-10% of the connection round-trip time + (RTT). In the open terrestrial-based Internet, especially at the + consumer edge, we expect most unbloated RTTs to have a ceiling of 100 + ms [CHARB2007]. Using this RTT gives a minimum TARGET of 5 ms and + INTERVAL of 100 ms. In practice, uncongested links will see sojourn + times below TARGET more often than once per RTT, so the estimator is + not overly sensitive to the value of INTERVAL. + + When the estimator finds a persistent delay above TARGET, the + controller enters the drop state where a packet is dropped, and the + next drop time is set. As discussed in Section 3.3, the initial next + drop spacing is intended to be long enough to give the endpoints time + to react to the single drop and therefore SHOULD be set to a value + equal to INTERVAL. If the estimator's output falls below TARGET, the + controller cancels the next drop and exits the drop state. (The + controller is more sensitive than the estimator to an overly short + INTERVAL value, since an unnecessary drop would occur and lower link + utilization). If the next drop time is reached while the controller + is still in drop state, the packet being dequeued is dropped, and the + next drop time is recalculated. + + Additional logic prevents re-entering the drop state too soon after + exiting it and resumes the drop state at a recent control level, if + one exists. This logic is described more precisely in the pseudocode + below. Additional work is required to determine the frequency and + importance of re-entering the drop state. + + + + + +Nichols, et al. Experimental [Page 13] + +RFC 8289 CoDel January 2018 + + + Note that CoDel AQM only enters its drop state when the local minimum + sojourn delay has exceeded TARGET for a time period long enough for + normal bursts to dissipate, ensuring that a burst of packets that + fits in the pipe will not be dropped. + +4.1. Non-starvation + + CoDel's goals are to control delay with little or no impact on link + utilization and to be deployed on a wide range of link bandwidths, + including variable-rate links, without reconfiguration. To keep from + making drops when it would starve the output link, CoDel makes + another check before dropping to see if at least an MTU worth of + bytes remains in the buffer. If not, the packet SHOULD NOT be + dropped; therefore, CoDel exits the drop state. The MTU size can be + set adaptively to the largest packet seen so far or can be read from + the interface driver. + +4.2. Setting INTERVAL + + The INTERVAL value is chosen to give endpoints time to react to a + drop without being so long that response times suffer. CoDel's + estimator, TARGET, and control loop all use INTERVAL. Understanding + their derivation shows that CoDel is the most sensitive to the value + of INTERVAL for single long-lived TCPs with a decreased sensitivity + for traffic mixes. This is fortunate, as RTTs vary across + connections and are not known a priori. The best policy seems to be + to use an INTERVAL value slightly larger than the RTT seen by most of + the connections using a link, a value that can be determined as the + largest RTT seen if the value is not an outlier (use of a 95-99th + percentile value should work). In practice, this value is not known + or measured (however, see Appendix A for an application where + INTERVAL is measured). An INTERVAL setting of 100 ms works well + across a range of RTTs from 10 ms to 1 second (excellent performance + is achieved in the range from 10 ms to 300 ms). For devices intended + for the normal terrestrial Internet, INTERVAL SHOULD have a value of + 100 ms. This will only cause overdropping where a long-lived TCP has + an RTT longer than 100 ms and there is little or no mixing with other + connections through the link. + +4.3. Setting TARGET + + TARGET is the maximum acceptable persistent queue delay above which + CoDel is dropping or preparing to drop and below which CoDel will not + drop. TARGET SHOULD be set to 5 ms for normal Internet traffic. + + The calculations of Section 3.2 show that the best TARGET value is + 5-10% of the RTT, with the low end of 5% preferred. Extensive + simulations exploring the impact of different TARGET values when used + + + +Nichols, et al. Experimental [Page 14] + +RFC 8289 CoDel January 2018 + + + with mixed traffic flows with different RTTs and different bandwidths + show that below a TARGET of 5 ms, utilization suffers for some + conditions and traffic loads; above 5 ms showed very little or no + improvement in utilization. + + Sojourn times must remain above the TARGET for INTERVAL amount of + time in order to enter the drop state. Any packet with a sojourn + time less than TARGET will reset the time that the queue was last + below TARGET. Since Internet traffic has very dynamic + characteristics, the actual sojourn delay experienced by packets + varies greatly and is often less than TARGET unless the overload is + excessive. When a link is not overloaded, it is not a bottleneck, + and packet sojourn times will be small or nonexistent. In the usual + case, there are only one or two places along a path where packets + will encounter a bottleneck (usually at the edge), so the total + amount of queueing delay experienced by a packet should be less than + 10 ms even under extremely congested conditions. This net delay is + substantially lower than common current queueing delays on the + Internet that grow to orders of seconds [NETAL2010] [CHARB2007]. + + Regarding the roles of TARGET and the minimum-tracking INTERVAL, note + that TARGET SHOULD NOT be increased in response to lower layers that + have a bursty nature, where packets are transmitted for short periods + interspersed with idle periods where the link is waiting for + permission to send. CoDel's estimator will "see" the effective + transmission rate over an INTERVAL amount of time, and increasing + TARGET only leads to longer queue delays. On the other hand, where a + significant additional delay is added to the intrinsic RTT of most or + all packets due to the waiting time for a transmission, it is + necessary to increase INTERVAL by that extra delay. TARGET SHOULD + NOT be adjusted for such short-term bursts, but INTERVAL MAY need to + be adjusted if the path's intrinsic RTT changes. + +4.4. Use with Multiple Queues + + CoDel is easily adapted to multiple queue systems. With other + approaches, there is always a question of how to account for the fact + that each queue receives less than the full link rate over time and + usually sees a varying rate over time. This is what CoDel excels at: + using a packet's sojourn time in the buffer completely circumvents + this problem. In a multiple-queue setting, a separate CoDel + algorithm runs on each queue, but each CoDel instance uses the packet + sojourn time the same way a single-queue CoDel does. Just as a + single-queue CoDel adapts to changing link bandwidths [CODEL2012], so + does a multiple-queue CoDel system. As an optimization to avoid + queueing more than necessary, when testing for queue occupancy before + dropping, the total occupancy of all queues sharing the same output + link SHOULD be used. This property of CoDel has been exploited in + + + +Nichols, et al. Experimental [Page 15] + +RFC 8289 CoDel January 2018 + + + fq_codel [RFC8290], which hashes on the packet header fields to + determine a specific bin, or sub-queue, for the packet and runs CoDel + on each bin or sub-queue, thus creating a well-mixed output flow and + obviating issues of reverse path flows (including "ack compression"). + +4.5. Setting Up CoDel + + CoDel is set for use in devices in the open Internet. An INTERVAL + setting of 100 ms is used, TARGET is set to 5% of INTERVAL, and the + initial drop spacing is also set to the INTERVAL. These settings + have been chosen so that a device, such as a small WiFi router, can + be sold without the need for any values to be made adjustable, + yielding a parameterless implementation. In addition, CoDel is + useful in environments with significantly different characteristics + from the normal Internet, for example, in switches used as a cluster + interconnect within a data center. Since cluster traffic is entirely + internal to the data center, round-trip latencies are low (typically + <100 us) but bandwidths are high (1-40 Gbps), so it's relatively easy + for the aggregation phase of a distributed computation (e.g., the + Reduce part of a Map/Reduce) to persistently fill and then overflow + the modest per-port buffering available in most high-speed switches. + A CoDel configured for this environment (TARGET and INTERVAL in the + microsecond rather than millisecond range) can minimize drops or + Explicit Congestion Notification (ECN) marks while keeping throughput + high and latency low. + + Devices destined for these environments MAY use a different value for + INTERVAL, where suitable. If appropriate analysis indicates, the + TARGET MAY be set to some other value in the 5-10% of INTERVAL, and + the initial drop spacing MAY be set to a value of 1.0 to 1.2 times + INTERVAL. But these settings will cause problems, such as + overdropping and low throughput, if used on the open Internet, so + devices that allow CoDel to be configured SHOULD default to the + Internet-appropriate values given in this document. + +5. Annotated Pseudocode for CoDel AQM + + What follows is the CoDel algorithm in C++-like pseudocode. Since + CoDel adds relatively little new code to a basic tail-drop FIFO + queue, we have attempted to highlight just these additions by + presenting CoDel as a sub-class of a basic FIFO queue base class. + The reference code is included to aid implementers who wish to apply + CoDel to queue management as described here or to adapt its + principles to other applications. + + + + + + + +Nichols, et al. Experimental [Page 16] + +RFC 8289 CoDel January 2018 + + + Implementors are strongly encouraged to also look at the Linux kernel + version of CoDel -- a well-written, well-tested, real-world, C-based + implementation. As of this writing, it is available at + https://github.com/torvalds/linux/blob/master/net/sched/sch_codel.c. + +5.1. Data Types + + time_t is an integer time value in units convenient for the system. + The code presented here uses 0 as a flag value to indicate "no time + set." + + packet_t* is a pointer to a packet descriptor. We assume it has a + tstamp field capable of holding a time_t and that the field is + available for use by CoDel (it will be set by the enqueue routine and + used by the dequeue routine). + + queue_t is a base class for queue objects (the parent class for + codel_queue_t objects). We assume it has enqueue() and dequeue() + methods that can be implemented in child classes. We assume it has a + bytes() method that returns the current queue size in bytes. This + can be an approximate value. The method is invoked in the dequeue() + method but shouldn't require a lock with the enqueue() method. + + flag_t is a Boolean. + +5.2. Per-Queue State (codel_queue_t Instance Variables) + + time_t first_above_time_ = 0; // Time to declare sojourn time above + // TARGET + time_t drop_next_ = 0; // Time to drop next packet + uint32_t count_ = 0; // Packets dropped in drop state + uint32_t lastcount_ = 0; // Count from previous iteration + flag_t dropping_ = false; // Set to true if in drop state + +5.3. Constants + + time_t TARGET = MS2TIME(5); // 5 ms TARGET queue delay + time_t INTERVAL = MS2TIME(100); // 100 ms sliding-minimum window + u_int maxpacket = 512; // Maximum packet size in bytes + // (SHOULD use interface MTU) + + + + + + + + + + + +Nichols, et al. Experimental [Page 17] + +RFC 8289 CoDel January 2018 + + +5.4. Enqueue Routine + + All the work of CoDel is done in the dequeue routine. The only CoDel + addition to enqueue is putting the current time in the packet's + tstamp field so that the dequeue routine can compute the packet's + sojourn time. Note that packets arriving at a full buffer will be + dropped, but these drops are not counted towards CoDel's + computations. + + void codel_queue_t::enqueue(packet_t* pkt) + { + pkt->tstamp = clock(); + queue_t::enqueue(pkt); + } + +5.5. Dequeue Routine + + This is the heart of CoDel. There are two branches based on whether + the controller is in drop state: (i) if the controller is in drop + state (that is, the minimum packet sojourn time is greater than + TARGET), then the controller checks if it is time to leave drop state + or schedules the next drop(s); or (ii) if the controller is not in + drop state, it determines if it should enter drop state and do the + initial drop. + + packet_t* CoDelQueue::dequeue() + { + time_t now = clock(); + dodequeue_result r = dodequeue(now); + uint32_t delta; + + if (dropping_) { + if (! r.ok_to_drop) { + // sojourn time below TARGET - leave drop state + dropping_ = false; + } + // Time for the next drop. Drop current packet and dequeue + // next. If the dequeue doesn't take us out of dropping + // state, schedule the next drop. A large backlog might + // result in drop rates so high that the next drop should + // happen now, hence the 'while' loop. + while (now >= drop_next_ && dropping_) { + drop(r.p); + ++count_; + r = dodequeue(now); + if (! r.ok_to_drop) { + // leave drop state + dropping_ = false; + + + +Nichols, et al. Experimental [Page 18] + +RFC 8289 CoDel January 2018 + + + } else { + // schedule the next drop. + drop_next_ = control_law(drop_next_, count_); + } + } + // If we get here, we're not in drop state. The 'ok_to_drop' + // return from dodequeue means that the sojourn time has been + // above 'TARGET' for 'INTERVAL', so enter drop state. + } else if (r.ok_to_drop) { + drop(r.p); + r = dodequeue(now); + dropping_ = true; + + // If min went above TARGET close to when it last went + // below, assume that the drop rate that controlled the + // queue on the last cycle is a good starting point to + // control it now. ('drop_next' will be at most 'INTERVAL' + // later than the time of the last drop, so 'now - drop_next' + // is a good approximation of the time from the last drop + // until now.) Implementations vary slightly here; this is + // the Linux version, which is more widely deployed and + // tested. + delta = count_ - lastcount_; + count_ = 1; + if ((delta > 1) && (now - drop_next_ < 16*INTERVAL)) + count_ = delta; + + drop_next_ = control_law(now, count_); + lastcount_ = count_; + } + return (r.p); + } + +5.6. Helper Routines + + Since the degree of multiplexing and nature of the traffic sources is + unknown, CoDel acts as a closed-loop servo system that gradually + increases the frequency of dropping until the queue is controlled + (sojourn time goes below TARGET). This is the control law that + governs the servo. It has this form because of the sqrt(p) + dependence of TCP throughput on drop probability. Note that for + embedded systems or kernel implementation, the inverse sqrt can be + computed efficiently using only integer multiplication. + + time_t codel_queue_t::control_law(time_t t, uint32_t count) + { + return t + INTERVAL / sqrt(count); + } + + + +Nichols, et al. Experimental [Page 19] + +RFC 8289 CoDel January 2018 + + + Next is a helper routine that does the actual packet dequeue and + tracks whether the sojourn time is above or below TARGET and, if + above, if it has remained above continuously for at least INTERVAL + amount of time. It returns two values: a Boolean indicating if it is + OK to drop (sojourn time above TARGET for at least INTERVAL) and the + packet dequeued. + + typedef struct { + packet_t* p; + flag_t ok_to_drop; + } dodequeue_result; + + dodequeue_result codel_queue_t::dodequeue(time_t now) + { + dodequeue_result r = { queue_t::dequeue(), false }; + if (r.p == NULL) { + // queue is empty - we can't be above TARGET + first_above_time_ = 0; + return r; + } + + // To span a large range of bandwidths, CoDel runs two + // different AQMs in parallel. One is based on sojourn time + // and takes effect when the time to send an MTU-sized + // packet is less than TARGET. The 1st term of the "if" + // below does this. The other is based on backlog and takes + // effect when the time to send an MTU-sized packet is >= + // TARGET. The goal here is to keep the output link + // utilization high by never allowing the queue to get + // smaller than the amount that arrives in a typical + // interarrival time (MTU-sized packets arriving spaced + // by the amount of time it takes to send such a packet on + // the bottleneck). The 2nd term of the "if" does this. + time_t sojourn_time = now - r.p->tstamp; + if (sojourn_time_ < TARGET || bytes() <= maxpacket_) { + // went below - stay below for at least INTERVAL + first_above_time_ = 0; + } else { + if (first_above_time_ == 0) { + // just went above from below. if still above at + // first_above_time, will say it's ok to drop. + first_above_time_ = now + INTERVAL; + } else if (now >= first_above_time_) { + r.ok_to_drop = true; + } + } + return r; + } + + + +Nichols, et al. Experimental [Page 20] + +RFC 8289 CoDel January 2018 + + +5.7. Implementation Considerations + + time_t is an integer time value in units convenient for the system. + Resolution to at least a millisecond is required, and better + resolution is useful up to the minimum possible packet time on the + output link; 64- or 32-bit widths are acceptable but with 32 bits the + resolution should be no finer than 2^{-16} to leave enough dynamic + range to represent a wide range of queue waiting times. Narrower + widths also have implementation issues due to overflow (wrapping) and + underflow (limit cycles because of truncation to zero) that are not + addressed in this pseudocode. + + Since CoDel requires relatively little per-queue state and no direct + communication or state sharing between the enqueue and dequeue + routines, it is relatively simple to add CoDel to almost any packet + processing pipeline, including forwarding engines based on + Application-Specific Integrated Circuits (ASICs) or Network + Processors (NPUs). One issue to consider is dodequeue()'s use of a + 'bytes()' function to determine the current queue size in bytes. + This value does not need to be exact. If the enqueue part of the + pipeline keeps a running count of the total number of bytes it has + put into the queue, and the dequeue routine keeps a running count of + the total bytes it has removed from the queue, 'bytes()' is simply + the difference between these two counters (32-bit counters should be + adequate). Enqueue has to update its counter once per packet queued, + but it does not matter when (before, during, or after the packet has + been added to the queue). The worst that can happen is a slight, + transient underestimate of the queue size, which might cause a drop + to be briefly deferred. + +6. Further Experimentation + + We encourage experimentation with the recommended values of TARGET + and INTERVAL for Internet settings. CoDel provides general, + efficient, parameterless building blocks for queue management that + can be applied to single or multiple queues in a variety of data + networking scenarios. CoDel's settings may be modified for other + special-purpose networking applications. + +7. Security Considerations + + This document describes an active queue management algorithm for + implementation in networked devices. There are no known security + exposures associated with CoDel at this time. + +8. IANA Considerations + + This document does not require actions by IANA. + + + +Nichols, et al. Experimental [Page 21] + +RFC 8289 CoDel January 2018 + + +9. References + +9.1. Normative References + + [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate + Requirement Levels", BCP 14, RFC 2119, + DOI 10.17487/RFC2119, March 1997, + <https://www.rfc-editor.org/info/rfc2119>. + + [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC + 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, + May 2017, <https://www.rfc-editor.org/info/rfc8174>. + +9.2. Informative References + + [BLOAT] Gettys, J. and K. Nichols, "Bufferbloat: Dark Buffers in + the Internet", Communications of the ACM, Volume 55, Issue + 1, DOI 10.1145/2063176.2063196, January 2012. + + [CHARB2007] + Dischinger, M., Haeberlen, A., Gummadi, K., and S. Saroiu, + "Characterizing Residential Broadband Networks", + Proceedings of the 7th ACM SIGCOMM Conference on Internet + Measurement, DOI 10.1145/1298306.1298313, October 2007. + + [CODEL2012] + Nichols, K. and V. Jacobson, "Controlling Queue Delay", + ACM Queue, Volume 10, Issue 5, + DOI 10.1145/2208917.2209336, May 2012. + + [KLEIN81] Kleinrock, L. and R. Gail, "An Invariant Property of + Computer Network Power", Proceedings of the International + Conference on Communications, June 1981, + <http://www.lk.cs.ucla.edu/data/files/Gail/power.pdf>. + + [MACTCP1997] + Mathis, M., Semke, J., Mahdavi, J., and T. Ott, "The + Macroscopic Behavior of the TCP Congestion Avoidance + Algorithm", ACM SIGCOMM Computer Communications + Review, Volume 27, Issue 3, pp. 67-82, + DOI 10.1145/263932.264023, July 1997. + + [NETAL2010] + Kreibich, C., Weaver, N., Paxson, V., and B. Nechaev, + "Netalyzr: Illuminating the Edge Network", Proceedings of + the 10th ACM SIGCOMM Conference on Internet Measurement, + DOI 10.1145/1879141.1879173, November 2010. + + + + +Nichols, et al. Experimental [Page 22] + +RFC 8289 CoDel January 2018 + + + [REDL1998] Nichols, K., Jacobson, V., and K. Poduri, "RED in a + Different Light", Technical Report, Cisco Systems, + September 1999, <http://citeseerx.ist.psu.edu/viewdoc/ + summary?doi=10.1.1.22.9406>. + + [RFC896] Nagle, J., "Congestion Control in IP/TCP Internetworks", + RFC 896, DOI 10.17487/RFC0896, January 1984, + <https://www.rfc-editor.org/info/rfc896>. + + [RFC2309] Braden, B., Clark, D., Crowcroft, J., Davie, B., Deering, + S., Estrin, D., Floyd, S., Jacobson, V., Minshall, G., + Partridge, C., Peterson, L., Ramakrishnan, K., Shenker, + S., Wroclawski, J., and L. Zhang, "Recommendations on + Queue Management and Congestion Avoidance in the + Internet", RFC 2309, DOI 10.17487/RFC2309, April 1998, + <https://www.rfc-editor.org/info/rfc2309>. + + [RFC5681] Allman, M., Paxson, V., and E. Blanton, "TCP Congestion + Control", RFC 5681, DOI 10.17487/RFC5681, September 2009, + <https://www.rfc-editor.org/info/rfc5681>. + + [RFC8290] Hoeiland-Joergensen, T., McKenney, P., Taht, D., + Gettys, J., and E. Dumazet, "The Flow Queue CoDel Packet + Scheduler and Active Queue Management Algorithm", + RFC 8290, DOI 10.17487/RFC8290, January 2018, + <https://www.rfc-editor.org/info/rfc8290>. + + [TSV84] Jacobson, V., "CoDel", IETF 84, Transport Area Open + Meeting, July 2012, + <http://www.ietf.org/proceedings/84/slides/ + slides-84-tsvarea-4.pdf>. + + [VANQ2006] Jacobson, V., "A Rant on Queues", Talk at MIT Lincoln + Labs, Lexington, MA, July 2006, + <http://www.pollere.net/Pdfdocs/QrantJul06.pdf>. + + + + + + + + + + + + + + + + +Nichols, et al. Experimental [Page 23] + +RFC 8289 CoDel January 2018 + + +Appendix A. Applying CoDel in the Data Center + + Nandita Dukkipati and her group at Google realized that the CoDel + building blocks could be applied to bufferbloat problems in data- + center servers, not just to Internet routers. The Linux CoDel + queueing discipline (qdisc) was adapted in three ways to tackle this + bufferbloat problem. + + 1. The default CoDel action was modified to be a direct feedback + from qdisc to the TCP layer at dequeue. The direct feedback + simply reduces TCP's congestion window just as congestion control + would do in the event of drop. The scheme falls back to ECN + marking or packet drop if the TCP socket lock could not be + acquired at dequeue. + + 2. Being located in the server makes it possible to monitor the + actual RTT to use as CoDel's interval rather than making a "best + guess" of RTT. The CoDel interval is dynamically adjusted by + using the maximum TCP round-trip time (RTT) of those connections + sharing the same qdisc/bucket. In particular, there is a history + entry of the maximum RTT experienced over the last second. As a + packet is dequeued, the RTT estimate is accessed from its TCP + socket. If the estimate is larger than the current CoDel + interval, the CoDel interval is immediately refreshed to the new + value. If the CoDel interval is not refreshed for over a second, + it is decreased to the history entry, and the process is + repeated. The use of the dynamic TCP RTT estimate allows the + interval to adapt to the actual maximum value currently seen and + thus lets the controller space its drop intervals appropriately. + + 3. Since the mathematics of computing the setpoint are invariant, a + TARGET of 5% of the RTT or CoDel interval was used here. + + Non-data packets were not dropped, as these are typically small and + sometimes critical control packets. Being located on the server, + there is no concern with misbehaving users as there would be on the + public Internet. + + In several data-center workload benchmarks, which are typically + bursty, CoDel reduced the queueing latencies at the qdisc and thereby + improved the mean and 99th-percentile latencies from several tens of + milliseconds to less than one millisecond. The minimum tracking part + of the CoDel framework proved useful in disambiguating "good" queue + versus "bad" queue, which is particularly helpful in controlling + qdisc buffers that are inherently bursty because of TCP Segmentation + Offload (TSO). + + + + + +Nichols, et al. Experimental [Page 24] + +RFC 8289 CoDel January 2018 + + +Acknowledgments + + The authors thank Jim Gettys for the constructive nagging that made + us get the work "out there" before we thought it was ready. We thank + Dave Taht, Eric Dumazet, and the open source community for showing + the value of getting it "out there" and for making it real. We thank + Nandita Dukkipati for contributions to Section 6 and for comments + that helped to substantially improve this document. We thank the AQM + Working Group and the Transport Area Shepherd, Wes Eddy, for + patiently prodding this document all the way to publication as an + RFC. + +Authors' Addresses + + Kathleen Nichols + Pollere, Inc. + PO Box 370201 + Montara, CA 94037 + United States of America + + Email: nichols@pollere.com + + + Van Jacobson + Google + + Email: vanj@google.com + + + Andrew McGregor (editor) + Google + + Email: andrewmcgr@google.com + + + Janardhan Iyengar (editor) + Google + + Email: jri@google.com + + + + + + + + + + + + +Nichols, et al. Experimental [Page 25] + |