summaryrefslogtreecommitdiff
path: root/doc/rfc/rfc8257.txt
diff options
context:
space:
mode:
Diffstat (limited to 'doc/rfc/rfc8257.txt')
-rw-r--r--doc/rfc/rfc8257.txt955
1 files changed, 955 insertions, 0 deletions
diff --git a/doc/rfc/rfc8257.txt b/doc/rfc/rfc8257.txt
new file mode 100644
index 0000000..d9ddb2a
--- /dev/null
+++ b/doc/rfc/rfc8257.txt
@@ -0,0 +1,955 @@
+
+
+
+
+
+
+Internet Engineering Task Force (IETF) S. Bensley
+Request for Comments: 8257 D. Thaler
+Category: Informational P. Balasubramanian
+ISSN: 2070-1721 Microsoft
+ L. Eggert
+ NetApp
+ G. Judd
+ Morgan Stanley
+ October 2017
+
+
+ Data Center TCP (DCTCP): TCP Congestion Control for Data Centers
+
+Abstract
+
+ This Informational RFC describes Data Center TCP (DCTCP): a TCP
+ congestion control scheme for data-center traffic. DCTCP extends the
+ Explicit Congestion Notification (ECN) processing to estimate the
+ fraction of bytes that encounter congestion rather than simply
+ detecting that some congestion has occurred. DCTCP then scales the
+ TCP congestion window based on this estimate. This method achieves
+ high-burst tolerance, low latency, and high throughput with shallow-
+ buffered switches. This memo also discusses deployment issues
+ related to the coexistence of DCTCP and conventional TCP, discusses
+ the lack of a negotiating mechanism between sender and receiver, and
+ presents some possible mitigations. This memo documents DCTCP as
+ currently implemented by several major operating systems. DCTCP, as
+ described in this specification, is applicable to deployments in
+ controlled environments like data centers, but it must not be
+ deployed over the public Internet without additional measures.
+
+Status of This Memo
+
+ This document is not an Internet Standards Track specification; it is
+ published for informational purposes.
+
+ This document is a product of the Internet Engineering Task Force
+ (IETF). It represents the consensus of the IETF community. It has
+ received public review and has been approved for publication by the
+ Internet Engineering Steering Group (IESG). Not all documents
+ approved by the IESG are a candidate for any level of Internet
+ Standard; see Section 2 of RFC 7841.
+
+ Information about the current status of this document, any errata,
+ and how to provide feedback on it may be obtained at
+ https://www.rfc-editor.org/info/rfc8257.
+
+
+
+
+
+Bensley, et al. Informational [Page 1]
+
+RFC 8257 DCTCP October 2017
+
+
+Copyright Notice
+
+ Copyright (c) 2017 IETF Trust and the persons identified as the
+ document authors. All rights reserved.
+
+ This document is subject to BCP 78 and the IETF Trust's Legal
+ Provisions Relating to IETF Documents
+ (https://trustee.ietf.org/license-info) in effect on the date of
+ publication of this document. Please review these documents
+ carefully, as they describe your rights and restrictions with respect
+ to this document. Code Components extracted from this document must
+ include Simplified BSD License text as described in Section 4.e of
+ the Trust Legal Provisions and are provided without warranty as
+ described in the Simplified BSD License.
+
+Table of Contents
+
+ 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3
+ 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 4
+ 3. DCTCP Algorithm . . . . . . . . . . . . . . . . . . . . . . . 5
+ 3.1. Marking Congestion on the L3 Switches and Routers . . . . 5
+ 3.2. Echoing Congestion Information on the Receiver . . . . . 5
+ 3.3. Processing Echoed Congestion Indications on the Sender . 7
+ 3.4. Handling of Congestion Window Growth . . . . . . . . . . 8
+ 3.5. Handling of Packet Loss . . . . . . . . . . . . . . . . . 8
+ 3.6. Handling of SYN, SYN-ACK, and RST Packets . . . . . . . . 9
+ 4. Implementation Issues . . . . . . . . . . . . . . . . . . . . 9
+ 4.1. Configuration of DCTCP . . . . . . . . . . . . . . . . . 9
+ 4.2. Computation of DCTCP.Alpha . . . . . . . . . . . . . . . 10
+ 5. Deployment Issues . . . . . . . . . . . . . . . . . . . . . . 11
+ 6. Known Issues . . . . . . . . . . . . . . . . . . . . . . . . 12
+ 7. Security Considerations . . . . . . . . . . . . . . . . . . . 12
+ 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 13
+ 9. References . . . . . . . . . . . . . . . . . . . . . . . . . 13
+ 9.1. Normative References . . . . . . . . . . . . . . . . . . 13
+ 9.2. Informative References . . . . . . . . . . . . . . . . . 14
+ Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . 16
+ Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 16
+
+
+
+
+
+
+
+
+
+
+
+
+
+Bensley, et al. Informational [Page 2]
+
+RFC 8257 DCTCP October 2017
+
+
+1. Introduction
+
+ Large data centers necessarily need many network switches to
+ interconnect their many servers. Therefore, a data center can
+ greatly reduce its capital expenditure by leveraging low-cost
+ switches. However, such low-cost switches tend to have limited queue
+ capacities; thus, they are more susceptible to packet loss due to
+ congestion.
+
+ Network traffic in a data center is often a mix of short and long
+ flows, where the short flows require low latencies and the long flows
+ require high throughputs. Data centers also experience incast
+ bursts, where many servers send traffic to a single server at the
+ same time. For example, this traffic pattern is a natural
+ consequence of the MapReduce [MAPREDUCE] workload: the worker nodes
+ complete at approximately the same time, and all reply to the master
+ node concurrently.
+
+ These factors place some conflicting demands on the queue occupancy
+ of a switch:
+
+ o The queue must be short enough that it does not impose excessive
+ latency on short flows.
+
+ o The queue must be long enough to buffer sufficient data for the
+ long flows to saturate the path capacity.
+
+ o The queue must be long enough to absorb incast bursts without
+ excessive packet loss.
+
+ Standard TCP congestion control [RFC5681] relies on packet loss to
+ detect congestion. This does not meet the demands described above.
+ First, short flows will start to experience unacceptable latencies
+ before packet loss occurs. Second, by the time TCP congestion
+ control kicks in on the senders, most of the incast burst has already
+ been dropped.
+
+ [RFC3168] describes a mechanism for using Explicit Congestion
+ Notification (ECN) from the switches for detection of congestion.
+ However, this method only detects the presence of congestion, not its
+ extent. In the presence of mild congestion, the TCP congestion
+ window is reduced too aggressively, and this unnecessarily reduces
+ the throughput of long flows.
+
+ Data Center TCP (DCTCP) changes traditional ECN processing by
+ estimating the fraction of bytes that encounter congestion rather
+ than simply detecting that some congestion has occurred. DCTCP then
+ scales the TCP congestion window based on this estimate. This method
+
+
+
+Bensley, et al. Informational [Page 3]
+
+RFC 8257 DCTCP October 2017
+
+
+ achieves high-burst tolerance, low latency, and high throughput with
+ shallow-buffered switches. DCTCP is a modification to the processing
+ of ECN by a conventional TCP and requires that standard TCP
+ congestion control be used for handling packet loss.
+
+ DCTCP should only be deployed in an intra-data-center environment
+ where both endpoints and the switching fabric are under a single
+ administrative domain. DCTCP MUST NOT be deployed over the public
+ Internet without additional measures, as detailed in Section 5.
+
+ The objective of this Informational RFC is to document DCTCP as a new
+ approach (which is known to be widely implemented and deployed) to
+ address TCP congestion control in data centers. The IETF TCPM
+ Working Group reached consensus regarding the fact that a DCTCP
+ standard would require further work. A precise documentation of
+ running code enables follow-up Experimental or Standards Track RFCs
+ through the IETF stream.
+
+ This document describes DCTCP as implemented in Microsoft Windows
+ Server 2012 [WINDOWS]. The Linux [LINUX] and FreeBSD [FREEBSD]
+ operating systems have also implemented support for DCTCP in a way
+ that is believed to follow this document. Deployment experiences
+ with DCTCP have been documented in [MORGANSTANLEY].
+
+2. Terminology
+
+ The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
+ "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
+ "OPTIONAL" in this document are to be interpreted as described in
+ BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all
+ capitals, as shown here.
+
+ Normative language is used to describe how necessary the various
+ aspects of a DCTCP implementation are for interoperability, but even
+ compliant implementations without the measures in Sections 4-6 would
+ still only be safe to deploy in controlled environments, i.e., not
+ over the public Internet.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Bensley, et al. Informational [Page 4]
+
+RFC 8257 DCTCP October 2017
+
+
+3. DCTCP Algorithm
+
+ There are three components involved in the DCTCP algorithm:
+
+ o The switches (or other intermediate devices in the network) detect
+ congestion and set the Congestion Encountered (CE) codepoint in
+ the IP header.
+
+ o The receiver echoes the congestion information back to the sender,
+ using the ECN-Echo (ECE) flag in the TCP header.
+
+ o The sender computes a congestion estimate and reacts by reducing
+ the TCP congestion window (cwnd) accordingly.
+
+3.1. Marking Congestion on the L3 Switches and Routers
+
+ The Layer 3 (L3) switches and routers in a data-center fabric
+ indicate congestion to the end nodes by setting the CE codepoint in
+ the IP header as specified in Section 5 of [RFC3168]. For example,
+ the switches may be configured with a congestion threshold. When a
+ packet arrives at a switch and its queue length is greater than the
+ congestion threshold, the switch sets the CE codepoint in the packet.
+ For example, Section 3.4 of [DCTCP10] suggests threshold marking with
+ a threshold of K > (RTT * C)/7, where C is the link rate in packets
+ per second. In typical deployments, the marking threshold is set to
+ be a small value to maintain a short average queueing delay.
+ However, the actual algorithm for marking congestion is an
+ implementation detail of the switch and will generally not be known
+ to the sender and receiver. Therefore, the sender and receiver
+ should not assume that a particular marking algorithm is implemented
+ by the switching fabric.
+
+3.2. Echoing Congestion Information on the Receiver
+
+ According to Section 6.1.3 of [RFC3168], the receiver sets the ECE
+ flag if any of the packets being acknowledged had the CE codepoint
+ set. The receiver then continues to set the ECE flag until it
+ receives a packet with the Congestion Window Reduced (CWR) flag set.
+ However, the DCTCP algorithm requires more-detailed congestion
+ information. In particular, the sender must be able to determine the
+ number of bytes sent that encountered congestion. Thus, the scheme
+ described in [RFC3168] does not suffice.
+
+ One possible solution is to ACK every packet and set the ECE flag in
+ the ACK if and only if the CE codepoint was set in the packet being
+ acknowledged. However, this prevents the use of delayed ACKs, which
+ are an important performance optimization in data centers. If the
+ delayed ACK frequency is n, then an ACK is generated every n packets.
+
+
+
+Bensley, et al. Informational [Page 5]
+
+RFC 8257 DCTCP October 2017
+
+
+ The typical value of n is 2, but it could be affected by ACK
+ throttling or packet-coalescing techniques designed to improve
+ performance.
+
+ Instead, DCTCP introduces a new Boolean TCP state variable, DCTCP
+ Congestion Encountered (DCTCP.CE), which is initialized to false and
+ stored in the Transmission Control Block (TCB). When sending an ACK,
+ the ECE flag MUST be set if and only if DCTCP.CE is true. When
+ receiving packets, the CE codepoint MUST be processed as follows:
+
+ 1. If the CE codepoint is set and DCTCP.CE is false, set DCTCP.CE to
+ true and send an immediate ACK.
+
+ 2. If the CE codepoint is not set and DCTCP.CE is true, set DCTCP.CE
+ to false and send an immediate ACK.
+
+ 3. Otherwise, ignore the CE codepoint.
+
+ Since the immediate ACK reflects the new DCTCP.CE state, it may
+ acknowledge any previously unacknowledged packets in the old state.
+ This can lead to an incorrect rate computation at the sender per
+ Section 3.3. To avoid this, an implementation MAY choose to send two
+ ACKs: one for previously unacknowledged packets and another
+ acknowledging the most recently received packet.
+
+ Receiver handling of the CWR bit is also per [RFC3168] (including
+ [Err3639]). That is, on receipt of a segment with both the CE and
+ CWR bits set, CWR is processed first and then CE is processed.
+
+ Send immediate
+ ACK with ECE=0
+ .-----. .--------------. .-----.
+ Send 1 ACK / v v | | \
+ for every | .------------. .------------. | Send 1 ACK
+ n packets | | DCTCP.CE=0 | | DCTCP.CE=1 | | for every
+ with ECE=0 | '------------' '------------' | n packets
+ \ | | ^ ^ / with ECE=1
+ '-----' '--------------' '-----'
+ Send immediate
+ ACK with ECE=1
+
+
+ Figure 1: ACK Generation State Machine
+
+
+
+
+
+
+
+
+Bensley, et al. Informational [Page 6]
+
+RFC 8257 DCTCP October 2017
+
+
+3.3. Processing Echoed Congestion Indications on the Sender
+
+ The sender estimates the fraction of bytes sent that encountered
+ congestion. The current estimate is stored in a new TCP state
+ variable, DCTCP.Alpha, which is initialized to 1 and SHOULD be
+ updated as follows:
+
+ DCTCP.Alpha = DCTCP.Alpha * (1 - g) + g * M
+
+ where:
+
+ o g is the estimation gain, a real number between 0 and 1. The
+ selection of g is left to the implementation. See Section 4 for
+ further considerations.
+
+ o M is the fraction of bytes sent that encountered congestion during
+ the previous observation window, where the observation window is
+ chosen to be approximately the Round-Trip Time (RTT). In
+ particular, an observation window ends when all bytes in flight at
+ the beginning of the window have been acknowledged.
+
+ In order to update DCTCP.Alpha, the TCP state variables defined in
+ [RFC0793] are used, and three additional TCP state variables are
+ introduced:
+
+ o DCTCP.WindowEnd: the TCP sequence number threshold when one
+ observation window ends and another is to begin; initialized to
+ SND.UNA.
+
+ o DCTCP.BytesAcked: the number of sent bytes acknowledged during the
+ current observation window; initialized to 0.
+
+ o DCTCP.BytesMarked: the number of bytes sent during the current
+ observation window that encountered congestion; initialized to 0.
+
+ The congestion estimator on the sender MUST process acceptable ACKs
+ as follows:
+
+ 1. Compute the bytes acknowledged (TCP Selective Acknowledgment
+ (SACK) options [RFC2018] are ignored for this computation):
+
+ BytesAcked = SEG.ACK - SND.UNA
+
+ 2. Update the bytes sent:
+
+ DCTCP.BytesAcked += BytesAcked
+
+
+
+
+
+Bensley, et al. Informational [Page 7]
+
+RFC 8257 DCTCP October 2017
+
+
+ 3. If the ECE flag is set, update the bytes marked:
+
+ DCTCP.BytesMarked += BytesAcked
+
+ 4. If the acknowledgment number is less than or equal to
+ DCTCP.WindowEnd, stop processing. Otherwise, the end of the
+ observation window has been reached, so proceed to update the
+ congestion estimate as follows:
+
+ 5. Compute the congestion level for the current observation window:
+
+ M = DCTCP.BytesMarked / DCTCP.BytesAcked
+
+ 6. Update the congestion estimate:
+
+ DCTCP.Alpha = DCTCP.Alpha * (1 - g) + g * M
+
+ 7. Determine the end of the next observation window:
+
+ DCTCP.WindowEnd = SND.NXT
+
+ 8. Reset the byte counters:
+
+ DCTCP.BytesAcked = DCTCP.BytesMarked = 0
+
+ 9. Rather than always halving the congestion window as described in
+ [RFC3168], the sender SHOULD update cwnd as follows:
+
+ cwnd = cwnd * (1 - DCTCP.Alpha / 2)
+
+ Just as specified in [RFC3168], DCTCP does not react to congestion
+ indications more than once for every window of data. The setting of
+ the CWR bit is also as per [RFC3168]. This is required for
+ interoperation with classic ECN receivers due to potential
+ misconfigurations.
+
+3.4. Handling of Congestion Window Growth
+
+ A DCTCP sender grows its congestion window in the same way as
+ conventional TCP. Slow start and congestion avoidance algorithms are
+ handled as specified in [RFC5681].
+
+3.5. Handling of Packet Loss
+
+ A DCTCP sender MUST react to loss episodes in the same way as
+ conventional TCP, including fast retransmit and fast recovery
+ algorithms, as specified in [RFC5681]. For cases where the packet
+ loss is inferred and not explicitly signaled by ECN, the cwnd and
+
+
+
+Bensley, et al. Informational [Page 8]
+
+RFC 8257 DCTCP October 2017
+
+
+ other state variables like ssthresh MUST be changed in the same way
+ that a conventional TCP would have changed them. As with ECN, a
+ DCTCP sender will only reduce the cwnd once per window of data across
+ all loss signals. Just as specified in [RFC5681], upon a timeout,
+ the cwnd MUST be set to no more than the loss window (1 full-sized
+ segment), regardless of previous cwnd reductions in a given window of
+ data.
+
+3.6. Handling of SYN, SYN-ACK, and RST Packets
+
+ If SYN, SYN-ACK, and RST packets for DCTCP connections have the ECN-
+ Capable Transport (ECT) codepoint set in the IP header, they will
+ receive the same treatment as other DCTCP packets when forwarded by a
+ switching fabric under load. Lack of ECT in these packets can result
+ in a higher drop rate, depending on the switching fabric
+ configuration. Hence, for DCTCP connections, the sender SHOULD set
+ ECT for SYN, SYN-ACK, and RST packets. A DCTCP receiver ignores CE
+ codepoints set on any SYN, SYN-ACK, or RST packets.
+
+4. Implementation Issues
+
+4.1. Configuration of DCTCP
+
+ An implementation needs to know when to use DCTCP. Data-center
+ servers may need to communicate with endpoints outside the data
+ center, where DCTCP is unsuitable or unsupported. Thus, a global
+ configuration setting to enable DCTCP will generally not suffice.
+ DCTCP provides no mechanism for negotiating its use. Thus,
+ additional management and configuration functionality is needed to
+ ensure that DCTCP is not used with non-DCTCP endpoints.
+
+ Known solutions rely on either configuration or heuristics.
+ Heuristics need to allow endpoints to individually enable DCTCP to
+ ensure a DCTCP sender is always paired with a DCTCP receiver. One
+ approach is to enable DCTCP based on the IP address of the remote
+ endpoint. Another approach is to detect connections that transmit
+ within the bounds of a data center. For example, an implementation
+ could support automatic selection of DCTCP if the estimated RTT is
+ less than a threshold (like 10 msec) and ECN is successfully
+ negotiated under the assumption that if the RTT is low, then the two
+ endpoints are likely in the same data-center network.
+
+ [RFC3168] forbids the ECN-marking of pure ACK packets because of the
+ inability of TCP to mitigate ACK-path congestion. RFC 3168 also
+ forbids ECN-marking of retransmissions, window probes, and RSTs.
+ However, dropping all these control packets -- rather than ECN-
+ marking them -- has considerable performance disadvantages. It is
+ RECOMMENDED that an implementation provide a configuration knob that
+
+
+
+Bensley, et al. Informational [Page 9]
+
+RFC 8257 DCTCP October 2017
+
+
+ will cause ECT to be set on such control packets, which can be used
+ in environments where such concerns do not apply. See
+ [ECN-EXPERIMENTATION] for details.
+
+ It is useful to implement DCTCP as an additional action on top of an
+ existing congestion control algorithm like Reno [RFC5681]. The DCTCP
+ implementation MAY also allow configuration of resetting the value of
+ DCTCP.Alpha as part of processing any loss episodes.
+
+4.2. Computation of DCTCP.Alpha
+
+ As noted in Section 3.3, the implementation will need to choose a
+ suitable estimation gain. [DCTCP10] provides a theoretical basis for
+ selecting the gain. However, it may be more practical to use
+ experimentation to select a suitable gain for a particular network
+ and workload. A fixed estimation gain of 1/16 is used in some
+ implementations. (It should be noted that values of 0 or 1 for g
+ result in problematic behavior; g=0 fixes DCTCP.Alpha to its initial
+ value, and g=1 sets it to M without any smoothing.)
+
+ The DCTCP.Alpha computation as per the formula in Section 3.3
+ involves fractions. An efficient kernel implementation MAY scale the
+ DCTCP.Alpha value for efficient computation using shift operations.
+ For example, if the implementation chooses g as 1/16, multiplications
+ of DCTCP.Alpha by g become right-shifts by 4. A scaling
+ implementation SHOULD ensure that DCTCP.Alpha is able to reach 0 once
+ it falls below the smallest shifted value (16 in the above example).
+ At the other extreme, a scaled update needs to ensure DCTCP.Alpha
+ does not exceed the scaling factor, which would be equivalent to
+ greater than 100% congestion. So, DCTCP.Alpha MUST be clamped after
+ an update.
+
+ This results in the following computations replacing steps 5 and 6 in
+ Section 3.3, where SCF is the chosen scaling factor (65536 in the
+ example), and SHF is the shift factor (4 in the example):
+
+ 1. Compute the congestion level for the current observation window:
+
+ ScaledM = SCF * DCTCP.BytesMarked / DCTCP.BytesAcked
+
+ 2. Update the congestion estimate:
+
+ if (DCTCP.Alpha >> SHF) == 0, then DCTCP.Alpha = 0
+
+ DCTCP.Alpha += (ScaledM >> SHF) - (DCTCP.Alpha >> SHF)
+
+ if DCTCP.Alpha > SCF, then DCTCP.Alpha = SCF
+
+
+
+
+Bensley, et al. Informational [Page 10]
+
+RFC 8257 DCTCP October 2017
+
+
+5. Deployment Issues
+
+ DCTCP and conventional TCP congestion control do not coexist well in
+ the same network. In typical DCTCP deployments, the marking
+ threshold in the switching fabric is set to a very low value to
+ reduce queueing delay, and a relatively small amount of congestion
+ will exceed the marking threshold. During such periods of
+ congestion, conventional TCP will suffer packet loss and quickly and
+ drastically reduce cwnd. DCTCP, on the other hand, will use the
+ fraction of marked packets to reduce cwnd more gradually. Thus, the
+ rate reduction in DCTCP will be much slower than that of conventional
+ TCP, and DCTCP traffic will gain a larger share of the capacity
+ compared to conventional TCP traffic traversing the same path. If
+ the traffic in the data center is a mix of conventional TCP and
+ DCTCP, it is RECOMMENDED that DCTCP traffic be segregated from
+ conventional TCP traffic. [MORGANSTANLEY] describes a deployment
+ that uses the IP Differentiated Services Codepoint (DSCP) bits to
+ segregate the network such that Active Queue Management (AQM)
+ [RFC7567] is applied to DCTCP traffic, whereas TCP traffic is managed
+ via drop-tail queueing.
+
+ Deployments should take into account segregation of non-TCP traffic
+ as well. Today's commodity switches allow configuration of different
+ marking/drop profiles for non-TCP and non-IP packets. Non-TCP and
+ non-IP packets should be able to pass through such switches, unless
+ they really run out of buffer space.
+
+ Since DCTCP relies on congestion marking by the switches, DCTCP's
+ potential can only be realized in data centers where the entire
+ network infrastructure supports ECN. The switches may also support
+ configuration of the congestion threshold used for marking. The
+ proposed parameterization can be configured with switches that
+ implement Random Early Detection (RED) [RFC2309]. [DCTCP10] provides
+ a theoretical basis for selecting the congestion threshold, but, as
+ with the estimation gain, it may be more practical to rely on
+ experimentation or simply to use the default configuration of the
+ device. DCTCP will revert to loss-based congestion control when
+ packet loss is experienced (e.g., when transiting a congested drop-
+ tail link, or a link with an AQM drop behavior).
+
+ DCTCP requires changes on both the sender and the receiver, so both
+ endpoints must support DCTCP. Furthermore, DCTCP provides no
+ mechanism for negotiating its use, so both endpoints must be
+ configured through some out-of-band mechanism to use DCTCP. A
+ variant of DCTCP that can be deployed unilaterally and that only
+ requires standard ECN behavior has been described in [ODCTCP] and
+ [BSDCAN], but it requires additional experimental evaluation.
+
+
+
+
+Bensley, et al. Informational [Page 11]
+
+RFC 8257 DCTCP October 2017
+
+
+6. Known Issues
+
+ DCTCP relies on the sender's ability to reconstruct the stream of CE
+ codepoints received by the remote endpoint. To accomplish this,
+ DCTCP avoids using a single ACK packet to acknowledge segments
+ received both with and without the CE codepoint set. However, if one
+ or more ACK packets are dropped, it is possible that a subsequent ACK
+ will cumulatively acknowledge a mix of CE and non-CE segments. This
+ will, of course, result in a less-accurate congestion estimate.
+ There are some potential considerations:
+
+ o Even with an inaccurate congestion estimate, DCTCP may still
+ perform better than [RFC3168].
+
+ o If the estimation gain is small relative to the packet loss rate,
+ the estimate may not be too inaccurate.
+
+ o If ACK packet loss mostly occurs under heavy congestion, most
+ drops will occur during an unbroken string of CE packets, and the
+ estimate will be unaffected.
+
+ However, the effect of packet drops on DCTCP under real-world
+ conditions has not been analyzed.
+
+ DCTCP provides no mechanism for negotiating its use. The effect of
+ using DCTCP with a standard ECN endpoint has been analyzed in
+ [ODCTCP] and [BSDCAN]. Furthermore, it is possible that other
+ implementations may also modify behavior in the [RFC3168] style
+ without negotiation, causing further interoperability issues.
+
+ Much like standard TCP, DCTCP is biased against flows with longer
+ RTTs. A method for improving the RTT fairness of DCTCP has been
+ proposed in [ADCTCP], but it requires additional experimental
+ evaluation.
+
+7. Security Considerations
+
+ DCTCP enhances ECN; thus, it inherits the general security
+ considerations discussed in [RFC3168], although additional mitigation
+ options exist due to the limited intra-data-center deployment of
+ DCTCP.
+
+ The processing changes introduced by DCTCP do not exacerbate the
+ considerations in [RFC3168] or introduce new ones. In particular,
+ with either algorithm, the network infrastructure or the remote
+ endpoint can falsely report congestion and, thus, cause the sender to
+ reduce cwnd. However, this is no worse than what can be achieved by
+ simply dropping packets.
+
+
+
+Bensley, et al. Informational [Page 12]
+
+RFC 8257 DCTCP October 2017
+
+
+ [RFC3168] requires that a compliant TCP must not set ECT on SYN or
+ SYN-ACK packets. [RFC5562] proposes setting ECT on SYN-ACK packets
+ but maintains the restriction of no ECT on SYN packets. Both these
+ RFCs prohibit ECT in SYN packets due to security concerns regarding
+ malicious SYN packets with ECT set. However, these RFCs are intended
+ for general Internet use; they do not directly apply to a controlled
+ data-center environment. The security concerns addressed by both of
+ these RFCs might not apply in controlled environments like data
+ centers, and it might not be necessary to account for the presence of
+ non-ECN servers. Beyond the security considerations related to
+ virtual servers, additional security can be imposed in the physical
+ servers to intercept and drop traffic resembling an attack.
+
+8. IANA Considerations
+
+ This document does not require any IANA actions.
+
+9. References
+
+9.1. Normative References
+
+ [RFC0793] Postel, J., "Transmission Control Protocol", STD 7,
+ RFC 793, DOI 10.17487/RFC0793, September 1981,
+ <https://www.rfc-editor.org/info/rfc793>.
+
+ [RFC2018] Mathis, M., Mahdavi, J., Floyd, S., and A. Romanow, "TCP
+ Selective Acknowledgment Options", RFC 2018,
+ DOI 10.17487/RFC2018, October 1996,
+ <https://www.rfc-editor.org/info/rfc2018>.
+
+ [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
+ Requirement Levels", BCP 14, RFC 2119,
+ DOI 10.17487/RFC2119, March 1997,
+ <https://www.rfc-editor.org/info/rfc2119>.
+
+ [RFC3168] Ramakrishnan, K., Floyd, S., and D. Black, "The Addition
+ of Explicit Congestion Notification (ECN) to IP",
+ RFC 3168, DOI 10.17487/RFC3168, September 2001,
+ <https://www.rfc-editor.org/info/rfc3168>.
+
+ [RFC5562] Kuzmanovic, A., Mondal, A., Floyd, S., and K.
+ Ramakrishnan, "Adding Explicit Congestion Notification
+ (ECN) Capability to TCP's SYN/ACK Packets", RFC 5562,
+ DOI 10.17487/RFC5562, June 2009,
+ <https://www.rfc-editor.org/info/rfc5562>.
+
+
+
+
+
+
+Bensley, et al. Informational [Page 13]
+
+RFC 8257 DCTCP October 2017
+
+
+ [RFC5681] Allman, M., Paxson, V., and E. Blanton, "TCP Congestion
+ Control", RFC 5681, DOI 10.17487/RFC5681, September 2009,
+ <https://www.rfc-editor.org/info/rfc5681>.
+
+ [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC
+ 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174,
+ May 2017, <https://www.rfc-editor.org/info/rfc8174>.
+
+9.2. Informative References
+
+ [ADCTCP] Alizadeh, M., Javanmard, A., and B. Prabhakar, "Analysis
+ of DCTCP: Stability, Convergence, and Fairness",
+ DOI 10.1145/1993744.1993753, Proceedings of the ACM
+ SIGMETRICS Joint International Conference on Measurement
+ and Modeling of Computer Systems, June 2011,
+ <https://dl.acm.org/citation.cfm?id=1993753>.
+
+ [BSDCAN] Kato, M., Eggert, L., Zimmermann, A., van Meter, R., and
+ H. Tokuda, "Extensions to FreeBSD Datacenter TCP for
+ Incremental Deployment Support", BSDCan 2015, June 2015,
+ <https://www.bsdcan.org/2015/schedule/events/559.en.html>.
+
+ [DCTCP10] Alizadeh, M., Greenberg, A., Maltz, D., Padhye, J., Patel,
+ P., Prabhakar, B., Sengupta, S., and M. Sridharan, "Data
+ Center TCP (DCTCP)", DOI 10.1145/1851182.1851192,
+ Proceedings of the ACM SIGCOMM 2010 Conference, August
+ 2010,
+ <http://dl.acm.org/citation.cfm?doid=1851182.1851192>.
+
+ [ECN-EXPERIMENTATION]
+ Black, D., "Explicit Congestion Notification (ECN)
+ Experimentation", Work in Progress, draft-ietf-tsvwg-ecn-
+ experimentation-06, September 2017.
+
+ [Err3639] RFC Errata, Erratum ID 3639, RFC 3168,
+ <https://www.rfc-editor.org/errata/eid3639>.
+
+ [FREEBSD] Kato, M. and H. Panchasara, "DCTCP (Data Center TCP)
+ implementation", January 2015,
+ <https://github.com/freebsd/freebsd/
+ commit/8ad879445281027858a7fa706d13e458095b595f>.
+
+ [LINUX] Borkmann, D., Westphal, F., and Glenn. Judd, "net: tcp:
+ add DCTCP congestion control algorithm", LINUX DCTCP
+ Patch, September 2014, <https://git.kernel.org/cgit/linux/
+ kernel/git/davem/net-next.git/commit/
+ ?id=e3118e8359bb7c59555aca60c725106e6d78c5ce>.
+
+
+
+
+Bensley, et al. Informational [Page 14]
+
+RFC 8257 DCTCP October 2017
+
+
+ [MAPREDUCE]
+ Dean, J. and S. Ghemawat, "MapReduce: Simplified Data
+ Processing on Large Clusters", Proceedings of the 6th
+ ACM/USENIX Symposium on Operating Systems Design and
+ Implementation, October 2004, <https://www.usenix.org/
+ legacy/publications/library/proceedings/osdi04/tech/
+ dean.html>.
+
+ [MORGANSTANLEY]
+ Judd, G., "Attaining the Promise and Avoiding the Pitfalls
+ of TCP in the Datacenter", Proceedings of the 12th USENIX
+ Symposium on Networked Systems Design and Implementation,
+ May 2015, <https://www.usenix.org/conference/nsdi15/
+ technical-sessions/presentation/judd>.
+
+ [ODCTCP] Kato, M., "Improving Transmission Performance with One-
+ Sided Datacenter TCP", M.S. Thesis, Keio University, 2013,
+ <http://eggert.org/students/kato-thesis.pdf>.
+
+ [RFC2309] Braden, B., Clark, D., Crowcroft, J., Davie, B., Deering,
+ S., Estrin, D., Floyd, S., Jacobson, V., Minshall, G.,
+ Partridge, C., Peterson, L., Ramakrishnan, K., Shenker,
+ S., Wroclawski, J., and L. Zhang, "Recommendations on
+ Queue Management and Congestion Avoidance in the
+ Internet", RFC 2309, DOI 10.17487/RFC2309, April 1998,
+ <https://www.rfc-editor.org/info/rfc2309>.
+
+ [RFC7567] Baker, F., Ed. and G. Fairhurst, Ed., "IETF
+ Recommendations Regarding Active Queue Management",
+ BCP 197, RFC 7567, DOI 10.17487/RFC7567, July 2015,
+ <https://www.rfc-editor.org/info/rfc7567>.
+
+ [WINDOWS] Microsoft, "Data Center Transmission Control Protocol
+ (DCTCP)", May 2012, <https://technet.microsoft.com/
+ en-us/library/hh997028(v=ws.11).aspx>.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Bensley, et al. Informational [Page 15]
+
+RFC 8257 DCTCP October 2017
+
+
+Acknowledgments
+
+ The DCTCP algorithm was originally proposed and analyzed in [DCTCP10]
+ by Mohammad Alizadeh, Albert Greenberg, Dave Maltz, Jitu Padhye,
+ Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, and Murari
+ Sridharan.
+
+ We would like to thank Andrew Shewmaker for identifying the problem
+ of clamping DCTCP.Alpha and proposing a solution for it.
+
+ Lars Eggert has received funding from the European Union's Horizon
+ 2020 research and innovation program 2014-2018 under grant agreement
+ No. 644866 ("SSICLOPS"). This document reflects only the authors'
+ views and the European Commission is not responsible for any use that
+ may be made of the information it contains.
+
+Authors' Addresses
+
+ Stephen Bensley
+ Microsoft
+ One Microsoft Way
+ Redmond, WA 98052
+ United States of America
+
+ Phone: +1 425 703 5570
+ Email: sbens@microsoft.com
+
+
+ Dave Thaler
+ Microsoft
+
+ Phone: +1 425 703 8835
+ Email: dthaler@microsoft.com
+
+
+ Praveen Balasubramanian
+ Microsoft
+
+ Phone: +1 425 538 2782
+ Email: pravb@microsoft.com
+
+
+
+
+
+
+
+
+
+
+
+Bensley, et al. Informational [Page 16]
+
+RFC 8257 DCTCP October 2017
+
+
+ Lars Eggert
+ NetApp
+ Sonnenallee 1
+ Kirchheim 85551
+ Germany
+
+ Phone: +49 151 120 55791
+ Email: lars@netapp.com
+ URI: http://eggert.org/
+
+
+ Glenn Judd
+ Morgan Stanley
+
+ Phone: +1 973 979 6481
+ Email: glenn.judd@morganstanley.com
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Bensley, et al. Informational [Page 17]
+