summaryrefslogtreecommitdiff
path: root/doc/rfc/rfc3124.txt
diff options
context:
space:
mode:
authorThomas Voss <mail@thomasvoss.com> 2024-11-27 20:54:24 +0100
committerThomas Voss <mail@thomasvoss.com> 2024-11-27 20:54:24 +0100
commit4bfd864f10b68b71482b35c818559068ef8d5797 (patch)
treee3989f47a7994642eb325063d46e8f08ffa681dc /doc/rfc/rfc3124.txt
parentea76e11061bda059ae9f9ad130a9895cc85607db (diff)
doc: Add RFC documents
Diffstat (limited to 'doc/rfc/rfc3124.txt')
-rw-r--r--doc/rfc/rfc3124.txt1235
1 files changed, 1235 insertions, 0 deletions
diff --git a/doc/rfc/rfc3124.txt b/doc/rfc/rfc3124.txt
new file mode 100644
index 0000000..db57bc3
--- /dev/null
+++ b/doc/rfc/rfc3124.txt
@@ -0,0 +1,1235 @@
+
+
+
+
+
+
+Network Working Group H. Balakrishnan
+Request for Comments: 3124 MIT LCS
+Category: Standards Track S. Seshan
+ CMU
+ June 2001
+
+
+ The Congestion Manager
+
+
+Status of this Memo
+
+ This document specifies an Internet standards track protocol for the
+ Internet community, and requests discussion and suggestions for
+ improvements. Please refer to the current edition of the "Internet
+ Official Protocol Standards" (STD 1) for the standardization state
+ and status of this protocol. Distribution of this memo is unlimited.
+
+Copyright Notice
+
+ Copyright (C) The Internet Society (2001). All Rights Reserved.
+
+Abstract
+
+ This document describes the Congestion Manager (CM), an end-system
+ module that:
+
+ (i) Enables an ensemble of multiple concurrent streams from a sender
+ destined to the same receiver and sharing the same congestion
+ properties to perform proper congestion avoidance and control, and
+
+ (ii) Allows applications to easily adapt to network congestion.
+
+1. Conventions used in this document:
+
+ The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
+ "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
+ document are to be interpreted as described in RFC-2119 [Bradner97].
+
+ STREAM
+
+ A group of packets that all share the same source and destination
+ IP address, IP type-of-service, transport protocol, and source and
+ destination transport-layer port numbers.
+
+
+
+
+
+
+
+Balakrishnan, et. al. Standards Track [Page 1]
+
+RFC 3124 The Congestion Manager June 2001
+
+
+ MACROFLOW
+
+ A group of CM-enabled streams that all use the same congestion
+ management and scheduling algorithms, and share congestion state
+ information. Currently, streams destined to different receivers
+ belong to different macroflows. Streams destined to the same
+ receiver MAY belong to different macroflows. When the Congestion
+ Manager is in use, streams that experience identical congestion
+ behavior and use the same congestion control algorithm SHOULD
+ belong to the same macroflow.
+
+ APPLICATION
+
+ Any software module that uses the CM. This includes user-level
+ applications such as Web servers or audio/video servers, as well
+ as in-kernel protocols such as TCP [Postel81] that use the CM for
+ congestion control.
+
+ WELL-BEHAVED APPLICATION
+
+ An application that only transmits when allowed by the CM and
+ accurately accounts for all data that it has sent to the receiver
+ by informing the CM using the CM API.
+
+ PATH MAXIMUM TRANSMISSION UNIT (PMTU)
+
+ The size of the largest packet that the sender can transmit
+ without it being fragmented en route to the receiver. It includes
+ the sizes of all headers and data except the IP header.
+
+ CONGESTION WINDOW (cwnd)
+
+ A CM state variable that modulates the amount of outstanding data
+ between sender and receiver.
+
+ OUTSTANDING WINDOW (ownd)
+
+ The number of bytes that has been transmitted by the source, but
+ not known to have been either received by the destination or lost
+ in the network.
+
+ INITIAL WINDOW (IW)
+
+ The size of the sender's congestion window at the beginning of a
+ macroflow.
+
+
+
+
+
+
+Balakrishnan, et. al. Standards Track [Page 2]
+
+RFC 3124 The Congestion Manager June 2001
+
+
+ DATA TYPE SYNTAX
+
+ We use "u64" for unsigned 64-bit, "u32" for unsigned 32-bit, "u16"
+ for unsigned 16-bit, "u8" for unsigned 8-bit, "i32" for signed
+ 32-bit, "i16" for signed 16-bit quantities, "float" for IEEE
+ floating point values. The type "void" is used to indicate that
+ no return value is expected from a call. Pointers are referred to
+ using "*" syntax, following C language convention.
+
+ We emphasize that all the API functions described in this document
+ are "abstract" calls and that conformant CM implementations may
+ differ in specific implementation details.
+
+2. Introduction
+
+ The framework described in this document integrates congestion
+ management across all applications and transport protocols. The CM
+ maintains congestion parameters (available aggregate and per-stream
+ bandwidth, per-receiver round-trip times, etc.) and exports an API
+ that enables applications to learn about network characteristics,
+ pass information to the CM, share congestion information with each
+ other, and schedule data transmissions. This document focuses on
+ applications and transport protocols with their own independent per-
+ byte or per-packet sequence number information, and does not require
+ modifications to the receiver protocol stack. However, the receiving
+ application must provide feedback to the sending application about
+ received packets and losses, and the latter is expected to use the CM
+ API to update CM state. This document does not address networks with
+ reservations or service differentiation.
+
+ The CM is an end-system module that enables an ensemble of multiple
+ concurrent streams to perform stable congestion avoidance and
+ control, and allows applications to easily adapt their transmissions
+ to prevailing network conditions. It integrates congestion
+ management across all applications and transport protocols. It
+ maintains congestion parameters (available aggregate and per-stream
+ bandwidth, per-receiver round-trip times, etc.) and exports an API
+ that enables applications to learn about network characteristics,
+ pass information to the CM, share congestion information with each
+ other, and schedule data transmissions. When the CM is used, all
+ data transmissions subject to the CM must be done with the explicit
+ consent of the CM via this API to ensure proper congestion behavior.
+
+ Systems MAY choose to use CM, and if so they MUST follow this
+ specification.
+
+ This document focuses on applications and networks where the
+ following conditions hold:
+
+
+
+Balakrishnan, et. al. Standards Track [Page 3]
+
+RFC 3124 The Congestion Manager June 2001
+
+
+ 1. Applications are well-behaved with their own independent
+ per-byte or per-packet sequence number information, and use the
+ CM API to update internal state in the CM.
+
+ 2. Networks are best-effort without service discrimination or
+ reservations. In particular, it does not address situations
+ where different streams between the same pair of hosts traverse
+ paths with differing characteristics.
+
+ The Congestion Manager framework can be extended to support
+ applications that do not provide their own feedback and to
+ differentially-served networks. These extensions will be addressed
+ in later documents.
+
+ The CM is motivated by two main goals:
+
+ (i) Enable efficient multiplexing. Increasingly, the trend on the
+ Internet is for unicast data senders (e.g., Web servers) to transmit
+ heterogeneous types of data to receivers, ranging from unreliable
+ real-time streaming content to reliable Web pages and applets. As a
+ result, many logically different streams share the same path between
+ sender and receiver. For the Internet to remain stable, each of
+ these streams must incorporate control protocols that safely probe
+ for spare bandwidth and react to congestion. Unfortunately, these
+ concurrent streams typically compete with each other for network
+ resources, rather than share them effectively. Furthermore, they do
+ not learn from each other about the state of the network. Even if
+ they each independently implement congestion control (e.g., a group
+ of TCP connections each implementing the algorithms in [Jacobson88,
+ Allman99]), the ensemble of streams tends to be more aggressive in
+ the face of congestion than a single TCP connection implementing
+ standard TCP congestion control and avoidance [Balakrishnan98].
+
+ (ii) Enable application adaptation to congestion. Increasingly,
+ popular real-time streaming applications run over UDP using their own
+ user-level transport protocols for good application performance, but
+ in most cases today do not adapt or react properly to network
+ congestion. By implementing a stable control algorithm and exposing
+ an adaptation API, the CM enables easy application adaptation to
+ congestion. Applications adapt the data they transmit to the current
+ network conditions.
+
+ The CM framework builds on recent work on TCP control block sharing
+ [Touch97], integrated TCP congestion control (TCP-Int)
+ [Balakrishnan98] and TCP sessions [Padmanabhan98]. [Touch97]
+ advocates the sharing of some of the state in the TCP control block
+ to improve transient transport performance and describes sharing
+ across an ensemble of TCP connections. [Balakrishnan98],
+
+
+
+Balakrishnan, et. al. Standards Track [Page 4]
+
+RFC 3124 The Congestion Manager June 2001
+
+
+ [Padmanabhan98], and [Eggert00] describe several experiments that
+ quantify the benefits of sharing congestion state, including improved
+ stability in the face of congestion and better loss recovery.
+ Integrating loss recovery across concurrent connections significantly
+ improves performance because losses on one connection can be detected
+ by noticing that later data sent on another connection has been
+ received and acknowledged. The CM framework extends these ideas in
+ two significant ways: (i) it extends congestion management to non-TCP
+ streams, which are becoming increasingly common and often do not
+ implement proper congestion management, and (ii) it provides an API
+ for applications to adapt their transmissions to current network
+ conditions. For an extended discussion of the motivation for the CM,
+ its architecture, API, and algorithms, see [Balakrishnan99]; for a
+ description of an implementation and performance results, see
+ [Andersen00].
+
+ The resulting end-host protocol architecture at the sender is shown
+ in Figure 1. The CM helps achieve network stability by implementing
+ stable congestion avoidance and control algorithms that are "TCP-
+ friendly" [Mahdavi98] based on algorithms described in [Allman99].
+ However, it does not attempt to enforce proper congestion behavior
+ for all applications (but it does not preclude a policer on the host
+ that performs this task). Note that while the policer at the end-
+ host can use CM, the network has to be protected against compromises
+ to the CM and the policer at the end hosts, a task that requires
+ router machinery [Floyd99a]. We do not address this issue further in
+ this document.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Balakrishnan, et. al. Standards Track [Page 5]
+
+RFC 3124 The Congestion Manager June 2001
+
+
+ |--------| |--------| |--------| |--------| |--------------|
+ | HTTP | | FTP | | RTP 1 | | RTP 2 | | |
+ |--------| |--------| |--------| |--------| | |
+ | | | ^ | ^ | |
+ | | | | | | | Scheduler |
+ | | | | | | |---| | |
+ | | | |-------|--+->| | | |
+ | | | | | |<--| |
+ v v v v | | |--------------|
+ |--------| |--------| |-------------| | | ^
+ | TCP 1 | | TCP 2 | | UDP 1 | | A | |
+ |--------| |--------| |-------------| | | |
+ ^ | ^ | | | | |--------------|
+ | | | | | | P |-->| |
+ | | | | | | | | |
+ |---|------+---|--------------|------->| | | Congestion |
+ | | | | I | | |
+ v v v | | | Controller |
+ |-----------------------------------| | | | |
+ | IP |-->| | | |
+ |-----------------------------------| | | |--------------|
+ |---|
+
+ Figure 1
+
+ The key components of the CM framework are (i) the API, (ii) the
+ congestion controller, and (iii) the scheduler. The API is (in part)
+ motivated by the requirements of application-level framing (ALF)
+ [Clark90], and is described in Section 4. The CM internals (Section
+ 5) include a congestion controller (Section 5.1) and a scheduler to
+ orchestrate data transmissions between concurrent streams in a
+ macroflow (Section 5.2). The congestion controller adjusts the
+ aggregate transmission rate between sender and receiver based on its
+ estimate of congestion in the network. It obtains feedback about its
+ past transmissions from applications themselves via the API. The
+ scheduler apportions available bandwidth amongst the different
+ streams within each macroflow and notifies applications when they are
+ permitted to send data. This document focuses on well-behaved
+ applications; a future one will describe the sender-receiver protocol
+ and header formats that will handle applications that do not
+ incorporate their own feedback to the CM.
+
+3. CM API
+
+ By convention, the IETF does not treat Application Programming
+ Interfaces as standards track. However, it is considered important
+ to have the CM API and CM algorithm requirements in one coherent
+ document. The following section on the CM API uses the terms MUST,
+
+
+
+Balakrishnan, et. al. Standards Track [Page 6]
+
+RFC 3124 The Congestion Manager June 2001
+
+
+ SHOULD, etc., but the terms are meant to apply within the context of
+ an implementation of the CM API. The section does not apply to
+ congestion control implementations in general, only to those
+ implementations offering the CM API.
+
+ Using the CM API, streams can determine their share of the available
+ bandwidth, request and have their data transmissions scheduled,
+ inform the CM about successful transmissions, and be informed when
+ the CM's estimate of path bandwidth changes. Thus, the CM frees
+ applications from having to maintain information about the state of
+ congestion and available bandwidth along any path.
+
+ The function prototypes below follow standard C language convention.
+ We emphasize that these API functions are abstract calls and
+ conformant CM implementations may differ in specific details, as long
+ as equivalent functionality is provided.
+
+ When a new stream is created by an application, it passes some
+ information to the CM via the cm_open(stream_info) API call.
+ Currently, stream_info consists of the following information: (i) the
+ source IP address, (ii) the source port, (iii) the destination IP
+ address, (iv) the destination port, and (v) the IP protocol number.
+
+3.1 State maintenance
+
+ 1. Open: All applications MUST call cm_open(stream_info) before
+ using the CM API. This returns a handle, cm_streamid, for the
+ application to use for all further CM API invocations for that
+ stream. If the returned cm_streamid is -1, then the cm_open()
+ failed and that stream cannot use the CM.
+
+ All other calls to the CM for a stream use the cm_streamid
+ returned from the cm_open() call.
+
+ 2. Close: When a stream terminates, the application SHOULD invoke
+ cm_close(cm_streamid) to inform the CM about the termination
+ of the stream.
+
+ 3. Packet size: cm_mtu(cm_streamid) returns the estimated PMTU of
+ the path between sender and receiver. Internally, this
+ information SHOULD be obtained via path MTU discovery
+ [Mogul90]. It MAY be statically configured in the absence of
+ such a mechanism.
+
+
+
+
+
+
+
+
+Balakrishnan, et. al. Standards Track [Page 7]
+
+RFC 3124 The Congestion Manager June 2001
+
+
+3.2 Data transmission
+
+ The CM accommodates two types of adaptive senders, enabling
+ applications to dynamically adapt their content based on prevailing
+ network conditions, and supporting ALF-based applications.
+
+ 1. Callback-based transmission. The callback-based transmission API
+ puts the stream in firm control of deciding what to transmit at each
+ point in time. To achieve this, the CM does not buffer any data;
+ instead, it allows streams the opportunity to adapt to unexpected
+ network changes at the last possible instant. Thus, this enables
+ streams to "pull out" and repacketize data upon learning about any
+ rate change, which is hard to do once the data has been buffered.
+ The CM must implement a cm_request(i32 cm_streamid) call for streams
+ wishing to send data in this style. After some time, depending on
+ the rate, the CM MUST invoke a callback using cmapp_send(), which is
+ a grant for the stream to send up to PMTU bytes. The callback-style
+ API is the recommended choice for ALF-based streams. Note that
+ cm_request() does not take the number of bytes or MTU-sized units as
+ an argument; each call to cm_request() is an implicit request for
+ sending up to PMTU bytes. The CM MAY provide an alternate interface,
+ cm_request(int k). The cmapp_send callback for this request is
+ granted the right to send up to k PMTU sized segments. Section 4.3
+ discusses the time duration for which the transmission grant is
+ valid, while Section 5.2 describes how these requests are scheduled
+ and callbacks made.
+
+ 2. Synchronous-style. The above callback-based API accommodates a
+ class of ALF streams that are "asynchronous." Asynchronous
+ transmitters do not transmit based on a periodic clock, but do so
+ triggered by asynchronous events like file reads or captured frames.
+ On the other hand, there are many streams that are "synchronous"
+ transmitters, which transmit periodically based on their own internal
+ timers (e.g., an audio senders that sends at a constant sampling
+ rate). While CM callbacks could be configured to periodically
+ interrupt such transmitters, the transmit loop of such applications
+ is less affected if they retain their original timer-based loop. In
+ addition, it complicates the CM API to have a stream express the
+ periodicity and granularity of its callbacks. Thus, the CM MUST
+ export an API that allows such streams to be informed of changes in
+ rates using the cmapp_update(u64 newrate, u32 srtt, u32 rttdev)
+ callback function, where newrate is the new rate in bits per second
+ for this stream, srtt is the current smoothed round trip time
+ estimate in microseconds, and rttdev is the smoothed linear deviation
+ in the round-trip time estimate calculated using the same algorithm
+ as in TCP [Paxson00]. The newrate value reports an instantaneous
+ rate calculated, for example, by taking the ratio of cwnd and srtt,
+ and dividing by the fraction of that ratio allocated to the stream.
+
+
+
+Balakrishnan, et. al. Standards Track [Page 8]
+
+RFC 3124 The Congestion Manager June 2001
+
+
+ In response, the stream MUST adapt its packet size or change its
+ timer interval to conform to (i.e., not exceed) the allowed rate. Of
+ course, it may choose not to use all of this rate. Note that the CM
+ is not on the data path of the actual transmission.
+
+ To avoid unnecessary cmapp_update() callbacks that the application
+ will only ignore, the CM MUST provide a cm_thresh(float
+ rate_downthresh, float rate_upthresh, float rtt_downthresh, float
+ rtt_upthresh) function that a stream can use at any stage in its
+ execution. In response, the CM SHOULD invoke the callback only when
+ the rate decreases to less than (rate_downthresh * lastrate) or
+ increases to more than (rate_upthresh * lastrate), where lastrate is
+ the rate last notified to the stream, or when the round-trip time
+ changes correspondingly by the requisite thresholds. This
+ information is used as a hint by the CM, in the sense the
+ cmapp_update() can be called even if these conditions are not met.
+
+ The CM MUST implement a cm_query(i32 cm_streamid, u64* rate, u32*
+ srtt, u32* rttdev) to allow an application to query the current CM
+ state. This sets the rate variable to the current rate estimate in
+ bits per second, the srtt variable to the current smoothed round-trip
+ time estimate in microseconds, and rttdev to the mean linear
+ deviation. If the CM does not have valid estimates for the
+ macroflow, it fills in negative values for the rate, srtt, and
+ rttdev.
+
+ Note that a stream can use more than one of the above transmission
+ APIs at the same time. In particular, the knowledge of sustainable
+ rate is useful for asynchronous streams as well as synchronous ones;
+ e.g., an asynchronous Web server disseminating images using TCP may
+ use cmapp_send() to schedule its transmissions and cmapp_update() to
+ decide whether to send a low-resolution or high-resolution image. A
+ TCP implementation using the CM is described in Section 6.1.1, where
+ the benefit of the cm_request() callback API for TCP will become
+ apparent.
+
+ The reader will notice that the basic CM API does not provide an
+ interface for buffered congestion-controlled transmissions. This is
+ intentional, since this transmission mode can be implemented using
+ the callback-based primitive. Section 6.1.2 describes how
+ congestion-controlled UDP sockets may be implemented using the CM
+ API.
+
+3.3 Application notification
+
+ When a stream receives feedback from receivers, it MUST use
+ cm_update(i32 cm_streamid, u32 nrecd, u32 nlost, u8 lossmode, i32
+ rtt) to inform the CM about events such as congestion losses,
+
+
+
+Balakrishnan, et. al. Standards Track [Page 9]
+
+RFC 3124 The Congestion Manager June 2001
+
+
+ successful receptions, type of loss (timeout event, Explicit
+ Congestion Notification [Ramakrishnan99], etc.) and round-trip time
+ samples. The nrecd parameter indicates how many bytes were
+ successfully received by the receiver since the last cm_update call,
+ while the nrecd parameter identifies how many bytes were received
+ were lost during the same time period. The rtt value indicates the
+ round-trip time measured during the transmission of these bytes. The
+ rtt value must be set to -1 if no valid round-trip sample was
+ obtained by the application. The lossmode parameter provides an
+ indicator of how a loss was detected. A value of CM_NO_FEEDBACK
+ indicates that the application has received no feedback for all its
+ outstanding data, and is reporting this to the CM. For example, a
+ TCP that has experienced a timeout would use this parameter to inform
+ the CM of this. A value of CM_LOSS_FEEDBACK indicates that the
+ application has experienced some loss, which it believes to be due to
+ congestion, but not all outstanding data has been lost. For example,
+ a TCP segment loss detected using duplicate (selective)
+ acknowledgments or other data-driven techniques fits this category.
+ A value of CM_EXPLICIT_CONGESTION indicates that the receiver echoed
+ an explicit congestion notification message. Finally, a value of
+ CM_NO_CONGESTION indicates that no congestion-related loss has
+ occurred. The lossmode parameter MUST be reported as a bit-vector
+ where the bits correspond to CM_NO_FEEDBACK, CM_LOSS_FEEDBACK,
+ CM_EXPLICIT_CONGESTION, and CM_NO_CONGESTION. Note that over links
+ (paths) that experience losses for reasons other than congestion, an
+ application SHOULD inform the CM of losses, with the CM_NO_CONGESTION
+ field set.
+
+ cm_notify(i32 cm_streamid, u32 nsent) MUST be called when data is
+ transmitted from the host (e.g., in the IP output routine) to inform
+ the CM that nsent bytes were just transmitted on a given stream.
+ This allows the CM to update its estimate of the number of
+ outstanding bytes for the macroflow and for the stream.
+
+ A cmapp_send() grant from the CM to an application is valid only for
+ an expiration time, equal to the larger of the round-trip time and an
+ implementation-dependent threshold communicated as an argument to the
+ cmapp_send() callback function. The application MUST NOT send data
+ based on this callback after this time has expired. Furthermore, if
+ the application decides not to send data after receiving this
+ callback, it SHOULD call cm_notify(stream_info, 0) to allow the CM to
+ permit other streams in the macroflow to transmit data. The CM
+ congestion controller MUST be robust to applications forgetting to
+ invoke cm_notify(stream_info, 0) correctly, or applications that
+ crash or disappear after having made a cm_request() call.
+
+
+
+
+
+
+Balakrishnan, et. al. Standards Track [Page 10]
+
+RFC 3124 The Congestion Manager June 2001
+
+
+3.4 Querying
+
+ If applications wish to learn about per-stream available bandwidth
+ and round-trip time, they can use the CM's cm_query(i32 cm_streamid,
+ i64* rate, i32* srtt, i32* rttdev) call, which fills in the desired
+ quantities. If the CM does not have valid estimates for the
+ macroflow, it fills in negative values for the rate, srtt, and
+ rttdev.
+
+3.5 Sharing granularity
+
+ One of the decisions the CM needs to make is the granularity at which
+ a macroflow is constructed, by deciding which streams belong to the
+ same macroflow and share congestion information. The API provides
+ two functions that allow applications to decide which of their
+ streams ought to belong to the same macroflow.
+
+ cm_getmacroflow(i32 cm_streamid) returns a unique i32 macroflow
+ identifier. cm_setmacroflow(i32 cm_macroflowid, i32 cm_streamid)
+ sets the macroflow of the stream cm_streamid to cm_macroflowid. If
+ the cm_macroflowid that is passed to cm_setmacroflow() is -1, then a
+ new macroflow is constructed and this is returned to the caller.
+ Each call to cm_setmacroflow() overrides the previous macroflow
+ association for the stream, should one exist.
+
+ The default suggested aggregation method is to aggregate by
+ destination IP address; i.e., all streams to the same destination
+ address are aggregated to a single macroflow by default. The
+ cm_getmacroflow() and cm_setmacroflow() calls can then be used to
+ change this as needed. We do note that there are some cases where
+ this may not be optimal, even over best-effort networks. For
+ example, when a group of receivers are behind a NAT device, the
+ sender will see them all as one address. If the hosts behind the NAT
+ are in fact connected over different bottleneck links, some of those
+ hosts could see worse performance than before. It is possible to
+ detect such hosts when using delay and loss estimates, although the
+ specific mechanisms for doing so are beyond the scope of this
+ document.
+
+ The objective of this interface is to set up sharing of groups not
+ sharing policy of relative weights of streams in a macroflow. The
+ latter requires the scheduler to provide an interface to set sharing
+ policy. However, because we want to support many different
+ schedulers (each of which may need different information to set
+ policy), we do not specify a complete API to the scheduler (but see
+
+
+
+
+
+
+Balakrishnan, et. al. Standards Track [Page 11]
+
+RFC 3124 The Congestion Manager June 2001
+
+
+ Section 5.2). A later guideline document is expected to describe a
+ few simple schedulers (e.g., weighted round-robin, hierarchical
+ scheduling) and the API they export to provide relative
+ prioritization.
+
+4. CM internals
+
+ This section describes the internal components of the CM. It
+ includes a Congestion Controller and a Scheduler, with well-defined,
+ abstract interfaces exported by them.
+
+4.1 Congestion controller
+
+ Associated with each macroflow is a congestion control algorithm; the
+ collection of all these algorithms comprises the congestion
+ controller of the CM. The control algorithm decides when and how
+ much data can be transmitted by a macroflow. It uses application
+ notifications (Section 4.3) from concurrent streams on the same
+ macroflow to build up information about the congestion state of the
+ network path used by the macroflow.
+
+ The congestion controller MUST implement a "TCP-friendly" [Mahdavi98]
+ congestion control algorithm. Several macroflows MAY (and indeed,
+ often will) use the same congestion control algorithm but each
+ macroflow maintains state about the network used by its streams.
+
+ The congestion control module MUST implement the following abstract
+ interfaces. We emphasize that these are not directly visible to
+ applications; they are within the context of a macroflow, and are
+ different from the CM API functions of Section 4.
+
+ - void query(u64 *rate, u32 *srtt, u32 *rttdev): This function
+ returns the estimated rate (in bits per second) and smoothed
+ round trip time (in microseconds) for the macroflow.
+
+ - void notify(u32 nsent): This function MUST be used to notify the
+ congestion control module whenever data is sent by an
+ application. The nsent parameter indicates the number of bytes
+ just sent by the application.
+
+ - void update(u32 nsent, u32 nrecd, u32 rtt, u32 lossmode): This
+ function is called whenever any of the CM streams associated with
+ a macroflow identifies that data has reached the receiver or has
+ been lost en route. The nrecd parameter indicates the number of
+ bytes that have just arrived at the receiver. The nsent
+ parameter is the sum of the number of bytes just received and the
+
+
+
+
+
+Balakrishnan, et. al. Standards Track [Page 12]
+
+RFC 3124 The Congestion Manager June 2001
+
+
+ number of bytes identified as lost en route. The rtt parameter is
+ the estimated round trip time in microseconds during the
+ transfer. The lossmode parameter provides an indicator of how a
+ loss was detected (section 4.3).
+
+ Although these interfaces are not visible to applications, the
+ congestion controller MUST implement these abstract interfaces to
+ provide for modular inter-operability with different separately-
+ developed schedulers.
+
+ The congestion control module MUST also call the associated
+ scheduler's schedule function (section 5.2) when it believes that the
+ current congestion state allows an MTU-sized packet to be sent.
+
+4.2 Scheduler
+
+ While it is the responsibility of the congestion control module to
+ determine when and how much data can be transmitted, it is the
+ responsibility of a macroflow's scheduler module to determine which
+ of the streams should get the opportunity to transmit data.
+
+ The Scheduler MUST implement the following interfaces:
+
+ - void schedule(u32 num_bytes): When the congestion control module
+ determines that data can be sent, the schedule() routine MUST be
+ called with no more than the number of bytes that can be sent.
+ In turn, the scheduler MAY call the cmapp_send() function that CM
+ applications must provide.
+
+ - float query_share(i32 cm_streamid): This call returns the
+ described stream's share of the total bandwidth available to the
+ macroflow. This call combined with the query call of the
+ congestion controller provides the information to satisfy an
+ application's cm_query() request.
+
+ - void notify(i32 cm_streamid, u32 nsent): This interface is used
+ to notify the scheduler module whenever data is sent by a CM
+ application. The nsent parameter indicates the number of bytes
+ just sent by the application.
+
+ The Scheduler MAY implement many additional interfaces. As
+ experience with CM schedulers increases, future documents may
+ make additions and/or changes to some parts of the scheduler
+ API.
+
+
+
+
+
+
+
+Balakrishnan, et. al. Standards Track [Page 13]
+
+RFC 3124 The Congestion Manager June 2001
+
+
+5. Examples
+
+5.1 Example applications
+
+ This section describes three possible uses of the CM API by
+ applications. We describe two asynchronous applications---an
+ implementation of a TCP sender and an implementation of congestion-
+ controlled UDP sockets, and a synchronous application---a streaming
+ audio server. More details of these applications and CM
+ implementation optimizations for efficient operation are described in
+ [Andersen00].
+
+ All applications that use the CM MUST incorporate feedback from the
+ receiver. For example, it must periodically (typically once or twice
+ per round trip time) determine how many of its packets arrived at the
+ receiver. When the source gets this feedback, it MUST use
+ cm_update() to inform the CM of this new information. This results
+ in the CM updating ownd and may result in the CM changing its
+ estimates and calling cmapp_update() of the streams of the macroflow.
+
+ The protocols in this section are examples and suggestions for
+ implementation, rather than requirements for any conformant
+ implementation.
+
+5.1.1 TCP
+
+ A TCP implementation that uses CM should use the cmapp_send()
+ callback API. TCP only identifies which data it should send upon the
+ arrival of an acknowledgement or expiration of a timer. As a result,
+ it requires tight control over when and if new data or
+ retransmissions are sent.
+
+ When TCP either connects to or accepts a connection from another
+ host, it performs a cm_open() call to associate the TCP connection
+ with a cm_streamid.
+
+ Once a connection is established, the CM is used to control the
+ transmission of outgoing data. The CM eliminates the need for
+ tracking and reacting to congestion in TCP, because the CM and its
+ transmission API ensure proper congestion behavior. Loss recovery is
+ still performed by TCP based on fast retransmissions and recovery as
+ well as timeouts. In addition, TCP is also modified to have its own
+ outstanding window (tcp_ownd) estimate. Whenever data segments are
+ sent from its cmapp_send() callback, TCP updates its tcp_ownd value.
+ The ownd variable is also updated after each cm_update() call. TCP
+ also maintains a count of the number of outstanding segments
+ (pkt_cnt). At any time, TCP can calculate the average packet size
+ (avg_pkt_size) as tcp_ownd/pkt_cnt. The avg_pkt_size is used by TCP
+
+
+
+Balakrishnan, et. al. Standards Track [Page 14]
+
+RFC 3124 The Congestion Manager June 2001
+
+
+ to help estimate the amount of outstanding data. Note that this is
+ not needed if the SACK option is used on the connection, since this
+ information is explicitly available.
+
+ The TCP output routines are modified as follows:
+
+ 1. All congestion window (cwnd) checks are removed.
+
+ 2. When application data is available. The TCP output routines
+ perform all non-congestion checks (Nagle algorithm, receiver-
+ advertised window check, etc). If these checks pass, the output
+ routine queues the data and calls cm_request() for the stream.
+
+ 3. If incoming data or timers result in a loss being detected, the
+ retransmission is also placed in a queue and cm_request() is
+ called for the stream.
+
+ 4. The cmapp_send() callback for TCP is set to an output routine.
+ If any retransmission is enqueued, the routine outputs the
+ retransmission. Otherwise, the routine outputs as much new data
+ as the TCP connection state allows. However, the cmapp_send()
+ never sends more than a single segment per call. This routine
+ arranges for the other output computations to be done, such as
+ header and options computations.
+
+ The IP output routine on the host calls cm_notify() when the packets
+ are actually sent out. Because it does not know which cm_streamid is
+ responsible for the packet, cm_notify() takes the stream_info as
+ argument (see Section 4 for what the stream_info should contain).
+ Because cm_notify() reports the IP payload size, TCP keeps track of
+ the total header size and incorporates these updates.
+
+ The TCP input routines are modified as follows:
+
+ 1. RTT estimation is done as normal using either timestamps or
+ Karn's algorithm. Any rtt estimate that is generated is passed to
+ CM via the cm_update call.
+
+ 2. All cwnd and slow start threshold (ssthresh) updates are
+ removed.
+
+ 3. Upon the arrival of an ack for new data, TCP computes the value
+ of in_flight (the amount of data in flight) as snd_max-ack-1
+ (i.e., MAX Sequence Sent - Current Ack - 1). TCP then calls
+ cm_update(streamid, tcp_ownd - in_flight, 0, CM_NO_CONGESTION,
+ rtt).
+
+
+
+
+
+Balakrishnan, et. al. Standards Track [Page 15]
+
+RFC 3124 The Congestion Manager June 2001
+
+
+ 4. Upon the arrival of a duplicate acknowledgement, TCP must check
+ its dupack count (dup_acks) to determine its action. If dup_acks
+ < 3, the TCP does nothing. If dup_acks == 3, TCP assumes that a
+ packet was lost and that at least 3 packets arrived to generate
+ these duplicate acks. Therefore, it calls cm_update(streamid, 4 *
+ avg_pkt_size, 3 * avg_pkt_size, CM_LOSS_FEEDBACK, rtt). The
+ average packet size is used since the acknowledgments do not
+ indicate exactly how much data has reached the other end. Most
+ TCP implementations interpret a duplicate ACK as an indication
+ that a full MSS has reached its destination. Once a new ACK is
+ received, these TCP sender implementations may resynchronize with
+ TCP receiver. The CM API does not provide a mechanism for TCP to
+ pass information from this resynchronization. Therefore, TCP can
+ only infer the arrival of an avg_pkt_size amount of data from each
+ duplicate ack. TCP also enqueues a retransmission of the lost
+ segment and calls cm_request(). If dup_acks > 3, TCP assumes that
+ a packet has reached the other end and caused this ack to be sent.
+ As a result, it calls cm_update(streamid, avg_pkt_size,
+ avg_pkt_size, CM_NO_CONGESTION, rtt).
+
+ 5. Upon the arrival of a partial acknowledgment (one that does not
+ exceed the highest segment transmitted at the time the loss
+ occurred, as defined in [Floyd99b]), TCP assumes that a packet was
+ lost and that the retransmitted packet has reached the recipient.
+ Therefore, it calls cm_update(streamid, 2 * avg_pkt_size,
+ avg_pkt_size, CM_NO_CONGESTION, rtt). CM_NO_CONGESTION is used
+ since the loss period has already been reported. TCP also
+ enqueues a retransmission of the lost segment and calls
+ cm_request().
+
+ When the TCP retransmission timer expires, the sender identifies that
+ a segment has been lost and calls cm_update(streamid, avg_pkt_size,
+ 0, CM_NO_FEEDBACK, 0) to signify that no feedback has been received
+ from the receiver and that one segment is sure to have "left the
+ pipe." TCP also enqueues a retransmission of the lost segment and
+ calls cm_request().
+
+5.1.2 Congestion-controlled UDP
+
+ Congestion-controlled UDP is a useful CM application, which we
+ describe in the context of Berkeley sockets [Stevens94]. They
+ provide the same functionality as standard Berkeley UDP sockets, but
+ instead of immediately sending the data from the kernel packet queue
+ to lower layers for transmission, the buffered socket implementation
+ makes calls to the API exported by the CM inside the kernel and gets
+ callbacks from the CM. When a CM UDP socket is created, it is bound
+ to a particular stream. Later, when data is added to the packet
+ queue, cm_request() is called on the stream associated with the
+
+
+
+Balakrishnan, et. al. Standards Track [Page 16]
+
+RFC 3124 The Congestion Manager June 2001
+
+
+ socket. When the CM schedules this stream for transmission, it calls
+ udp_ccappsend() in the UDP module. This function transmits one MTU
+ from the packet queue, and schedules the transmission of any
+ remaining packets. The in-kernel implementation of the CM UDP API
+ should not require any additional data copies and should support all
+ standard UDP options. Modifying existing applications to use
+ congestion-controlled UDP requires the implementation of a new socket
+ option on the socket. To work correctly, the sender must obtain
+ feedback about congestion. This can be done in at least two ways:
+ (i) the UDP receiver application can provide feedback to the sender
+ application, which will inform the CM of network conditions using
+ cm_update(); (ii) the UDP receiver implementation can provide
+ feedback to the sending UDP. Note that this latter alternative
+ requires changes to the receiver's network stack and the sender UDP
+ cannot assume that all receivers support this option without explicit
+ negotiation.
+
+5.1.3 Audio server
+
+ A typical audio application often has access to the sample in a
+ multitude of data rates and qualities. The objective of the
+ application is then to deliver the highest possible quality of audio
+ (typically the highest data rate) its clients. The selection of
+ which version of audio to transmit should be based on the current
+ congestion state of the network. In addition, the source will want
+ audio delivered to its users at a consistent sampling rate. As a
+ result, it must send data a regular rate, minimizing delaying
+ transmissions and reducing buffering before playback. To meet these
+ requirements, this application can use the synchronous sender API
+ (Section 4.2).
+
+ When the source first starts, it uses the cm_query() call to get an
+ initial estimate of network bandwidth and delay. If some other
+ streams on that macroflow have already been active, then it gets an
+ initial estimate that is valid; otherwise, it gets negative values,
+ which it ignores. It then chooses an encoding that does not exceed
+ these estimates (or, in the case of an invalid estimate, uses
+ application-specific initial values) and begins transmitting data.
+ The application also implements the cmapp_update() callback. When
+ the CM determines that network characteristics have changed, it calls
+ the application's cmapp_update() function and passes it a new rate
+ and round-trip time estimate. The application must change its choice
+ of audio encoding to ensure that it does not exceed these new
+ estimates.
+
+
+
+
+
+
+
+Balakrishnan, et. al. Standards Track [Page 17]
+
+RFC 3124 The Congestion Manager June 2001
+
+
+5.2 Example congestion control module
+
+ To illustrate the responsibilities of a congestion control module,
+ the following describes some of the actions of a simple TCP-like
+ congestion control module that implements Additive Increase
+ Multiplicative Decrease congestion control (AIMD_CC):
+
+ - query(): AIMD_CC returns the current congestion window (cwnd)
+ divided by the smoothed rtt (srtt) as its bandwidth estimate. It
+ returns the smoothed rtt estimate as srtt.
+
+ - notify(): AIMD_CC adds the number of bytes sent to its
+ outstanding data window (ownd).
+
+ - update(): AIMD_CC subtracts nsent from ownd. If the value of rtt
+ is non-zero, AIMD_CC updates srtt using the TCP srtt calculation.
+ If the update indicates that data has been lost, AIMD_CC sets
+ cwnd to 1 MTU if the loss_mode is CM_NO_FEEDBACK and to cwnd/2
+ (with a minimum of 1 MTU) if the loss_mode is CM_LOSS_FEEDBACK or
+ CM_EXPLICIT_CONGESTION. AIMD_CC also sets its internal ssthresh
+ variable to cwnd/2. If no loss had occurred, AIMD_CC mimics TCP
+ slow start and linear growth modes. It increments cwnd by nsent
+ when cwnd < ssthresh (bounded by a maximum of ssthresh-cwnd) and
+ by nsent * MTU/cwnd when cwnd > ssthresh.
+
+ - When cwnd or ownd are updated and indicate that at least one MTU
+ may be transmitted, AIMD_CC calls the CM to schedule a
+ transmission.
+
+5.3 Example Scheduler Module
+
+ To clarify the responsibilities of a scheduler module, the following
+ describes some of the actions of a simple round robin scheduler
+ module (RR_sched):
+
+ - schedule(): RR_sched schedules as many streams as possible in round
+ robin fashion.
+
+ - query_share(): RR_sched returns 1/(number of streams in macroflow).
+
+ - notify(): RR_sched does nothing. Round robin scheduling is not
+ affected by the amount of data sent.
+
+6. Security Considerations
+
+ The CM provides many of the same services that the congestion control
+ in TCP provides. As such, it is vulnerable to many of the same
+ security problems. For example, incorrect reports of losses and
+
+
+
+Balakrishnan, et. al. Standards Track [Page 18]
+
+RFC 3124 The Congestion Manager June 2001
+
+
+ transmissions will give the CM an inaccurate picture of the network's
+ congestion state. By giving CM a high estimate of congestion, an
+ attacker can degrade the performance observed by applications. For
+ example, a stream on a host can arbitrarily slow down any other
+ stream on the same macroflow, a form of denial of service.
+
+ The more dangerous form of attack occurs when an application gives
+ the CM a low estimate of congestion. This would cause CM to be
+ overly aggressive and allow data to be sent much more quickly than
+ sound congestion control policies would allow.
+
+ [Touch97] describes a number of the security problems that arise with
+ congestion information sharing. An additional vulnerability (not
+ covered by [Touch97])) occurs because applications have access
+ through the CM API to control shared state that will affect other
+ applications on the same computer. For instance, a poorly designed,
+ possibly a compromised, or intentionally malicious UDP application
+ could misuse cm_update() to cause starvation and/or too-aggressive
+ behavior of others in the macroflow.
+
+7. References
+
+ [Allman99] Allman, M. and Paxson, V., "TCP Congestion
+ Control", RFC 2581, April 1999.
+
+ [Andersen00] Balakrishnan, H., System Support for Bandwidth
+ Management and Content Adaptation in Internet
+ Applications, Proc. 4th Symp. on Operating Systems
+ Design and Implementation, San Diego, CA, October
+ 2000. Available from
+ http://nms.lcs.mit.edu/papers/cm-osdi2000.html
+
+ [Balakrishnan98] Balakrishnan, H., Padmanabhan, V., Seshan, S.,
+ Stemm, M., and Katz, R., "TCP Behavior of a Busy
+ Web Server: Analysis and Improvements," Proc. IEEE
+ INFOCOM, San Francisco, CA, March 1998.
+
+ [Balakrishnan99] Balakrishnan, H., Rahul, H., and Seshan, S., "An
+ Integrated Congestion Management Architecture for
+ Internet Hosts," Proc. ACM SIGCOMM, Cambridge, MA,
+ September 1999.
+
+ [Bradner96] Bradner, S., "The Internet Standards Process ---
+ Revision 3", BCP 9, RFC 2026, October 1996.
+
+ [Bradner97] Bradner, S., "Key words for use in RFCs to Indicate
+ Requirement Levels", BCP 14, RFC 2119, March 1997.
+
+
+
+
+Balakrishnan, et. al. Standards Track [Page 19]
+
+RFC 3124 The Congestion Manager June 2001
+
+
+ [Clark90] Clark, D. and Tennenhouse, D., "Architectural
+ Consideration for a New Generation of Protocols",
+ Proc. ACM SIGCOMM, Philadelphia, PA, September
+ 1990.
+
+ [Eggert00] Eggert, L., Heidemann, J., and Touch, J., "Effects
+ of Ensemble TCP," ACM Computer Comm. Review,
+ January 2000.
+
+ [Floyd99a] Floyd, S. and Fall, K.," Promoting the Use of End-
+ to-End Congestion Control in the Internet,"
+ IEEE/ACM Trans. on Networking, 7(4), August 1999,
+ pp. 458-472.
+
+ [Floyd99b] Floyd, S. and T. Henderson,"The New Reno
+ Modification to TCP's Fast Recovery Algorithm," RFC
+ 2582, April 1999.
+
+ [Jacobson88] Jacobson, V., "Congestion Avoidance and Control,"
+ Proc. ACM SIGCOMM, Stanford, CA, August 1988.
+
+ [Mahdavi98] Mahdavi, J. and Floyd, S., "The TCP Friendly
+ Website,"
+ http://www.psc.edu/networking/tcp_friendly.html
+
+ [Mogul90] Mogul, J. and S. Deering, "Path MTU Discovery," RFC
+ 1191, November 1990.
+
+ [Padmanabhan98] Padmanabhan, V., "Addressing the Challenges of Web
+ Data Transport," PhD thesis, Univ. of California,
+ Berkeley, December 1998.
+
+ [Paxson00] Paxson, V. and M. Allman, "Computing TCP's
+ Retransmission Timer", RFC 2988, November 2000.
+
+ [Postel81] Postel, J., Editor, "Transmission Control
+ Protocol", STD 7, RFC 793, September 1981.
+
+ [Ramakrishnan99] Ramakrishnan, K. and Floyd, S., "A Proposal to Add
+ Explicit Congestion Notification (ECN) to IP," RFC
+ 2481, January 1999.
+
+
+ [Stevens94] Stevens, W., TCP/IP Illustrated, Volume 1.
+ Addison-Wesley, Reading, MA, 1994.
+
+ [Touch97] Touch, J., "TCP Control Block Interdependence", RFC
+ 2140, April 1997.
+
+
+
+Balakrishnan, et. al. Standards Track [Page 20]
+
+RFC 3124 The Congestion Manager June 2001
+
+
+8. Acknowledgments
+
+ We thank David Andersen, Deepak Bansal, and Dorothy Curtis for their
+ work on the CM design and implementation. We thank Vern Paxson for
+ his detailed comments, feedback, and patience, and Sally Floyd, Mark
+ Handley, and Steven McCanne for useful feedback on the CM
+ architecture. Allison Mankin and Joe Touch provided several useful
+ comments on previous drafts of this document.
+
+9. Authors' Addresses
+
+ Hari Balakrishnan
+ Laboratory for Computer Science
+ 200 Technology Square
+ Massachusetts Institute of Technology
+ Cambridge, MA 02139
+
+ EMail: hari@lcs.mit.edu
+ Web: http://nms.lcs.mit.edu/~hari/
+
+
+ Srinivasan Seshan
+ School of Computer Science
+ Carnegie Mellon University
+ 5000 Forbes Ave.
+ Pittsburgh, PA 15213
+
+ EMail: srini@cmu.edu
+ Web: http://www.cs.cmu.edu/~srini/
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Balakrishnan, et. al. Standards Track [Page 21]
+
+RFC 3124 The Congestion Manager June 2001
+
+
+Full Copyright Statement
+
+ Copyright (C) The Internet Society (2001). All Rights Reserved.
+
+ This document and translations of it may be copied and furnished to
+ others, and derivative works that comment on or otherwise explain it
+ or assist in its implementation may be prepared, copied, published
+ and distributed, in whole or in part, without restriction of any
+ kind, provided that the above copyright notice and this paragraph are
+ included on all such copies and derivative works. However, this
+ document itself may not be modified in any way, such as by removing
+ the copyright notice or references to the Internet Society or other
+ Internet organizations, except as needed for the purpose of
+ developing Internet standards in which case the procedures for
+ copyrights defined in the Internet Standards process must be
+ followed, or as required to translate it into languages other than
+ English.
+
+ The limited permissions granted above are perpetual and will not be
+ revoked by the Internet Society or its successors or assigns.
+
+ This document and the information contained herein is provided on an
+ "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING
+ TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING
+ BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION
+ HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF
+ MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
+
+Acknowledgement
+
+ Funding for the RFC Editor function is currently provided by the
+ Internet Society.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Balakrishnan, et. al. Standards Track [Page 22]
+