1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
|
Network Working Group H. Balakrishnan
Request for Comments: 3124 MIT LCS
Category: Standards Track S. Seshan
CMU
June 2001
The Congestion Manager
Status of this Memo
This document specifies an Internet standards track protocol for the
Internet community, and requests discussion and suggestions for
improvements. Please refer to the current edition of the "Internet
Official Protocol Standards" (STD 1) for the standardization state
and status of this protocol. Distribution of this memo is unlimited.
Copyright Notice
Copyright (C) The Internet Society (2001). All Rights Reserved.
Abstract
This document describes the Congestion Manager (CM), an end-system
module that:
(i) Enables an ensemble of multiple concurrent streams from a sender
destined to the same receiver and sharing the same congestion
properties to perform proper congestion avoidance and control, and
(ii) Allows applications to easily adapt to network congestion.
1. Conventions used in this document:
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in RFC-2119 [Bradner97].
STREAM
A group of packets that all share the same source and destination
IP address, IP type-of-service, transport protocol, and source and
destination transport-layer port numbers.
Balakrishnan, et. al. Standards Track [Page 1]
^L
RFC 3124 The Congestion Manager June 2001
MACROFLOW
A group of CM-enabled streams that all use the same congestion
management and scheduling algorithms, and share congestion state
information. Currently, streams destined to different receivers
belong to different macroflows. Streams destined to the same
receiver MAY belong to different macroflows. When the Congestion
Manager is in use, streams that experience identical congestion
behavior and use the same congestion control algorithm SHOULD
belong to the same macroflow.
APPLICATION
Any software module that uses the CM. This includes user-level
applications such as Web servers or audio/video servers, as well
as in-kernel protocols such as TCP [Postel81] that use the CM for
congestion control.
WELL-BEHAVED APPLICATION
An application that only transmits when allowed by the CM and
accurately accounts for all data that it has sent to the receiver
by informing the CM using the CM API.
PATH MAXIMUM TRANSMISSION UNIT (PMTU)
The size of the largest packet that the sender can transmit
without it being fragmented en route to the receiver. It includes
the sizes of all headers and data except the IP header.
CONGESTION WINDOW (cwnd)
A CM state variable that modulates the amount of outstanding data
between sender and receiver.
OUTSTANDING WINDOW (ownd)
The number of bytes that has been transmitted by the source, but
not known to have been either received by the destination or lost
in the network.
INITIAL WINDOW (IW)
The size of the sender's congestion window at the beginning of a
macroflow.
Balakrishnan, et. al. Standards Track [Page 2]
^L
RFC 3124 The Congestion Manager June 2001
DATA TYPE SYNTAX
We use "u64" for unsigned 64-bit, "u32" for unsigned 32-bit, "u16"
for unsigned 16-bit, "u8" for unsigned 8-bit, "i32" for signed
32-bit, "i16" for signed 16-bit quantities, "float" for IEEE
floating point values. The type "void" is used to indicate that
no return value is expected from a call. Pointers are referred to
using "*" syntax, following C language convention.
We emphasize that all the API functions described in this document
are "abstract" calls and that conformant CM implementations may
differ in specific implementation details.
2. Introduction
The framework described in this document integrates congestion
management across all applications and transport protocols. The CM
maintains congestion parameters (available aggregate and per-stream
bandwidth, per-receiver round-trip times, etc.) and exports an API
that enables applications to learn about network characteristics,
pass information to the CM, share congestion information with each
other, and schedule data transmissions. This document focuses on
applications and transport protocols with their own independent per-
byte or per-packet sequence number information, and does not require
modifications to the receiver protocol stack. However, the receiving
application must provide feedback to the sending application about
received packets and losses, and the latter is expected to use the CM
API to update CM state. This document does not address networks with
reservations or service differentiation.
The CM is an end-system module that enables an ensemble of multiple
concurrent streams to perform stable congestion avoidance and
control, and allows applications to easily adapt their transmissions
to prevailing network conditions. It integrates congestion
management across all applications and transport protocols. It
maintains congestion parameters (available aggregate and per-stream
bandwidth, per-receiver round-trip times, etc.) and exports an API
that enables applications to learn about network characteristics,
pass information to the CM, share congestion information with each
other, and schedule data transmissions. When the CM is used, all
data transmissions subject to the CM must be done with the explicit
consent of the CM via this API to ensure proper congestion behavior.
Systems MAY choose to use CM, and if so they MUST follow this
specification.
This document focuses on applications and networks where the
following conditions hold:
Balakrishnan, et. al. Standards Track [Page 3]
^L
RFC 3124 The Congestion Manager June 2001
1. Applications are well-behaved with their own independent
per-byte or per-packet sequence number information, and use the
CM API to update internal state in the CM.
2. Networks are best-effort without service discrimination or
reservations. In particular, it does not address situations
where different streams between the same pair of hosts traverse
paths with differing characteristics.
The Congestion Manager framework can be extended to support
applications that do not provide their own feedback and to
differentially-served networks. These extensions will be addressed
in later documents.
The CM is motivated by two main goals:
(i) Enable efficient multiplexing. Increasingly, the trend on the
Internet is for unicast data senders (e.g., Web servers) to transmit
heterogeneous types of data to receivers, ranging from unreliable
real-time streaming content to reliable Web pages and applets. As a
result, many logically different streams share the same path between
sender and receiver. For the Internet to remain stable, each of
these streams must incorporate control protocols that safely probe
for spare bandwidth and react to congestion. Unfortunately, these
concurrent streams typically compete with each other for network
resources, rather than share them effectively. Furthermore, they do
not learn from each other about the state of the network. Even if
they each independently implement congestion control (e.g., a group
of TCP connections each implementing the algorithms in [Jacobson88,
Allman99]), the ensemble of streams tends to be more aggressive in
the face of congestion than a single TCP connection implementing
standard TCP congestion control and avoidance [Balakrishnan98].
(ii) Enable application adaptation to congestion. Increasingly,
popular real-time streaming applications run over UDP using their own
user-level transport protocols for good application performance, but
in most cases today do not adapt or react properly to network
congestion. By implementing a stable control algorithm and exposing
an adaptation API, the CM enables easy application adaptation to
congestion. Applications adapt the data they transmit to the current
network conditions.
The CM framework builds on recent work on TCP control block sharing
[Touch97], integrated TCP congestion control (TCP-Int)
[Balakrishnan98] and TCP sessions [Padmanabhan98]. [Touch97]
advocates the sharing of some of the state in the TCP control block
to improve transient transport performance and describes sharing
across an ensemble of TCP connections. [Balakrishnan98],
Balakrishnan, et. al. Standards Track [Page 4]
^L
RFC 3124 The Congestion Manager June 2001
[Padmanabhan98], and [Eggert00] describe several experiments that
quantify the benefits of sharing congestion state, including improved
stability in the face of congestion and better loss recovery.
Integrating loss recovery across concurrent connections significantly
improves performance because losses on one connection can be detected
by noticing that later data sent on another connection has been
received and acknowledged. The CM framework extends these ideas in
two significant ways: (i) it extends congestion management to non-TCP
streams, which are becoming increasingly common and often do not
implement proper congestion management, and (ii) it provides an API
for applications to adapt their transmissions to current network
conditions. For an extended discussion of the motivation for the CM,
its architecture, API, and algorithms, see [Balakrishnan99]; for a
description of an implementation and performance results, see
[Andersen00].
The resulting end-host protocol architecture at the sender is shown
in Figure 1. The CM helps achieve network stability by implementing
stable congestion avoidance and control algorithms that are "TCP-
friendly" [Mahdavi98] based on algorithms described in [Allman99].
However, it does not attempt to enforce proper congestion behavior
for all applications (but it does not preclude a policer on the host
that performs this task). Note that while the policer at the end-
host can use CM, the network has to be protected against compromises
to the CM and the policer at the end hosts, a task that requires
router machinery [Floyd99a]. We do not address this issue further in
this document.
Balakrishnan, et. al. Standards Track [Page 5]
^L
RFC 3124 The Congestion Manager June 2001
|--------| |--------| |--------| |--------| |--------------|
| HTTP | | FTP | | RTP 1 | | RTP 2 | | |
|--------| |--------| |--------| |--------| | |
| | | ^ | ^ | |
| | | | | | | Scheduler |
| | | | | | |---| | |
| | | |-------|--+->| | | |
| | | | | |<--| |
v v v v | | |--------------|
|--------| |--------| |-------------| | | ^
| TCP 1 | | TCP 2 | | UDP 1 | | A | |
|--------| |--------| |-------------| | | |
^ | ^ | | | | |--------------|
| | | | | | P |-->| |
| | | | | | | | |
|---|------+---|--------------|------->| | | Congestion |
| | | | I | | |
v v v | | | Controller |
|-----------------------------------| | | | |
| IP |-->| | | |
|-----------------------------------| | | |--------------|
|---|
Figure 1
The key components of the CM framework are (i) the API, (ii) the
congestion controller, and (iii) the scheduler. The API is (in part)
motivated by the requirements of application-level framing (ALF)
[Clark90], and is described in Section 4. The CM internals (Section
5) include a congestion controller (Section 5.1) and a scheduler to
orchestrate data transmissions between concurrent streams in a
macroflow (Section 5.2). The congestion controller adjusts the
aggregate transmission rate between sender and receiver based on its
estimate of congestion in the network. It obtains feedback about its
past transmissions from applications themselves via the API. The
scheduler apportions available bandwidth amongst the different
streams within each macroflow and notifies applications when they are
permitted to send data. This document focuses on well-behaved
applications; a future one will describe the sender-receiver protocol
and header formats that will handle applications that do not
incorporate their own feedback to the CM.
3. CM API
By convention, the IETF does not treat Application Programming
Interfaces as standards track. However, it is considered important
to have the CM API and CM algorithm requirements in one coherent
document. The following section on the CM API uses the terms MUST,
Balakrishnan, et. al. Standards Track [Page 6]
^L
RFC 3124 The Congestion Manager June 2001
SHOULD, etc., but the terms are meant to apply within the context of
an implementation of the CM API. The section does not apply to
congestion control implementations in general, only to those
implementations offering the CM API.
Using the CM API, streams can determine their share of the available
bandwidth, request and have their data transmissions scheduled,
inform the CM about successful transmissions, and be informed when
the CM's estimate of path bandwidth changes. Thus, the CM frees
applications from having to maintain information about the state of
congestion and available bandwidth along any path.
The function prototypes below follow standard C language convention.
We emphasize that these API functions are abstract calls and
conformant CM implementations may differ in specific details, as long
as equivalent functionality is provided.
When a new stream is created by an application, it passes some
information to the CM via the cm_open(stream_info) API call.
Currently, stream_info consists of the following information: (i) the
source IP address, (ii) the source port, (iii) the destination IP
address, (iv) the destination port, and (v) the IP protocol number.
3.1 State maintenance
1. Open: All applications MUST call cm_open(stream_info) before
using the CM API. This returns a handle, cm_streamid, for the
application to use for all further CM API invocations for that
stream. If the returned cm_streamid is -1, then the cm_open()
failed and that stream cannot use the CM.
All other calls to the CM for a stream use the cm_streamid
returned from the cm_open() call.
2. Close: When a stream terminates, the application SHOULD invoke
cm_close(cm_streamid) to inform the CM about the termination
of the stream.
3. Packet size: cm_mtu(cm_streamid) returns the estimated PMTU of
the path between sender and receiver. Internally, this
information SHOULD be obtained via path MTU discovery
[Mogul90]. It MAY be statically configured in the absence of
such a mechanism.
Balakrishnan, et. al. Standards Track [Page 7]
^L
RFC 3124 The Congestion Manager June 2001
3.2 Data transmission
The CM accommodates two types of adaptive senders, enabling
applications to dynamically adapt their content based on prevailing
network conditions, and supporting ALF-based applications.
1. Callback-based transmission. The callback-based transmission API
puts the stream in firm control of deciding what to transmit at each
point in time. To achieve this, the CM does not buffer any data;
instead, it allows streams the opportunity to adapt to unexpected
network changes at the last possible instant. Thus, this enables
streams to "pull out" and repacketize data upon learning about any
rate change, which is hard to do once the data has been buffered.
The CM must implement a cm_request(i32 cm_streamid) call for streams
wishing to send data in this style. After some time, depending on
the rate, the CM MUST invoke a callback using cmapp_send(), which is
a grant for the stream to send up to PMTU bytes. The callback-style
API is the recommended choice for ALF-based streams. Note that
cm_request() does not take the number of bytes or MTU-sized units as
an argument; each call to cm_request() is an implicit request for
sending up to PMTU bytes. The CM MAY provide an alternate interface,
cm_request(int k). The cmapp_send callback for this request is
granted the right to send up to k PMTU sized segments. Section 4.3
discusses the time duration for which the transmission grant is
valid, while Section 5.2 describes how these requests are scheduled
and callbacks made.
2. Synchronous-style. The above callback-based API accommodates a
class of ALF streams that are "asynchronous." Asynchronous
transmitters do not transmit based on a periodic clock, but do so
triggered by asynchronous events like file reads or captured frames.
On the other hand, there are many streams that are "synchronous"
transmitters, which transmit periodically based on their own internal
timers (e.g., an audio senders that sends at a constant sampling
rate). While CM callbacks could be configured to periodically
interrupt such transmitters, the transmit loop of such applications
is less affected if they retain their original timer-based loop. In
addition, it complicates the CM API to have a stream express the
periodicity and granularity of its callbacks. Thus, the CM MUST
export an API that allows such streams to be informed of changes in
rates using the cmapp_update(u64 newrate, u32 srtt, u32 rttdev)
callback function, where newrate is the new rate in bits per second
for this stream, srtt is the current smoothed round trip time
estimate in microseconds, and rttdev is the smoothed linear deviation
in the round-trip time estimate calculated using the same algorithm
as in TCP [Paxson00]. The newrate value reports an instantaneous
rate calculated, for example, by taking the ratio of cwnd and srtt,
and dividing by the fraction of that ratio allocated to the stream.
Balakrishnan, et. al. Standards Track [Page 8]
^L
RFC 3124 The Congestion Manager June 2001
In response, the stream MUST adapt its packet size or change its
timer interval to conform to (i.e., not exceed) the allowed rate. Of
course, it may choose not to use all of this rate. Note that the CM
is not on the data path of the actual transmission.
To avoid unnecessary cmapp_update() callbacks that the application
will only ignore, the CM MUST provide a cm_thresh(float
rate_downthresh, float rate_upthresh, float rtt_downthresh, float
rtt_upthresh) function that a stream can use at any stage in its
execution. In response, the CM SHOULD invoke the callback only when
the rate decreases to less than (rate_downthresh * lastrate) or
increases to more than (rate_upthresh * lastrate), where lastrate is
the rate last notified to the stream, or when the round-trip time
changes correspondingly by the requisite thresholds. This
information is used as a hint by the CM, in the sense the
cmapp_update() can be called even if these conditions are not met.
The CM MUST implement a cm_query(i32 cm_streamid, u64* rate, u32*
srtt, u32* rttdev) to allow an application to query the current CM
state. This sets the rate variable to the current rate estimate in
bits per second, the srtt variable to the current smoothed round-trip
time estimate in microseconds, and rttdev to the mean linear
deviation. If the CM does not have valid estimates for the
macroflow, it fills in negative values for the rate, srtt, and
rttdev.
Note that a stream can use more than one of the above transmission
APIs at the same time. In particular, the knowledge of sustainable
rate is useful for asynchronous streams as well as synchronous ones;
e.g., an asynchronous Web server disseminating images using TCP may
use cmapp_send() to schedule its transmissions and cmapp_update() to
decide whether to send a low-resolution or high-resolution image. A
TCP implementation using the CM is described in Section 6.1.1, where
the benefit of the cm_request() callback API for TCP will become
apparent.
The reader will notice that the basic CM API does not provide an
interface for buffered congestion-controlled transmissions. This is
intentional, since this transmission mode can be implemented using
the callback-based primitive. Section 6.1.2 describes how
congestion-controlled UDP sockets may be implemented using the CM
API.
3.3 Application notification
When a stream receives feedback from receivers, it MUST use
cm_update(i32 cm_streamid, u32 nrecd, u32 nlost, u8 lossmode, i32
rtt) to inform the CM about events such as congestion losses,
Balakrishnan, et. al. Standards Track [Page 9]
^L
RFC 3124 The Congestion Manager June 2001
successful receptions, type of loss (timeout event, Explicit
Congestion Notification [Ramakrishnan99], etc.) and round-trip time
samples. The nrecd parameter indicates how many bytes were
successfully received by the receiver since the last cm_update call,
while the nrecd parameter identifies how many bytes were received
were lost during the same time period. The rtt value indicates the
round-trip time measured during the transmission of these bytes. The
rtt value must be set to -1 if no valid round-trip sample was
obtained by the application. The lossmode parameter provides an
indicator of how a loss was detected. A value of CM_NO_FEEDBACK
indicates that the application has received no feedback for all its
outstanding data, and is reporting this to the CM. For example, a
TCP that has experienced a timeout would use this parameter to inform
the CM of this. A value of CM_LOSS_FEEDBACK indicates that the
application has experienced some loss, which it believes to be due to
congestion, but not all outstanding data has been lost. For example,
a TCP segment loss detected using duplicate (selective)
acknowledgments or other data-driven techniques fits this category.
A value of CM_EXPLICIT_CONGESTION indicates that the receiver echoed
an explicit congestion notification message. Finally, a value of
CM_NO_CONGESTION indicates that no congestion-related loss has
occurred. The lossmode parameter MUST be reported as a bit-vector
where the bits correspond to CM_NO_FEEDBACK, CM_LOSS_FEEDBACK,
CM_EXPLICIT_CONGESTION, and CM_NO_CONGESTION. Note that over links
(paths) that experience losses for reasons other than congestion, an
application SHOULD inform the CM of losses, with the CM_NO_CONGESTION
field set.
cm_notify(i32 cm_streamid, u32 nsent) MUST be called when data is
transmitted from the host (e.g., in the IP output routine) to inform
the CM that nsent bytes were just transmitted on a given stream.
This allows the CM to update its estimate of the number of
outstanding bytes for the macroflow and for the stream.
A cmapp_send() grant from the CM to an application is valid only for
an expiration time, equal to the larger of the round-trip time and an
implementation-dependent threshold communicated as an argument to the
cmapp_send() callback function. The application MUST NOT send data
based on this callback after this time has expired. Furthermore, if
the application decides not to send data after receiving this
callback, it SHOULD call cm_notify(stream_info, 0) to allow the CM to
permit other streams in the macroflow to transmit data. The CM
congestion controller MUST be robust to applications forgetting to
invoke cm_notify(stream_info, 0) correctly, or applications that
crash or disappear after having made a cm_request() call.
Balakrishnan, et. al. Standards Track [Page 10]
^L
RFC 3124 The Congestion Manager June 2001
3.4 Querying
If applications wish to learn about per-stream available bandwidth
and round-trip time, they can use the CM's cm_query(i32 cm_streamid,
i64* rate, i32* srtt, i32* rttdev) call, which fills in the desired
quantities. If the CM does not have valid estimates for the
macroflow, it fills in negative values for the rate, srtt, and
rttdev.
3.5 Sharing granularity
One of the decisions the CM needs to make is the granularity at which
a macroflow is constructed, by deciding which streams belong to the
same macroflow and share congestion information. The API provides
two functions that allow applications to decide which of their
streams ought to belong to the same macroflow.
cm_getmacroflow(i32 cm_streamid) returns a unique i32 macroflow
identifier. cm_setmacroflow(i32 cm_macroflowid, i32 cm_streamid)
sets the macroflow of the stream cm_streamid to cm_macroflowid. If
the cm_macroflowid that is passed to cm_setmacroflow() is -1, then a
new macroflow is constructed and this is returned to the caller.
Each call to cm_setmacroflow() overrides the previous macroflow
association for the stream, should one exist.
The default suggested aggregation method is to aggregate by
destination IP address; i.e., all streams to the same destination
address are aggregated to a single macroflow by default. The
cm_getmacroflow() and cm_setmacroflow() calls can then be used to
change this as needed. We do note that there are some cases where
this may not be optimal, even over best-effort networks. For
example, when a group of receivers are behind a NAT device, the
sender will see them all as one address. If the hosts behind the NAT
are in fact connected over different bottleneck links, some of those
hosts could see worse performance than before. It is possible to
detect such hosts when using delay and loss estimates, although the
specific mechanisms for doing so are beyond the scope of this
document.
The objective of this interface is to set up sharing of groups not
sharing policy of relative weights of streams in a macroflow. The
latter requires the scheduler to provide an interface to set sharing
policy. However, because we want to support many different
schedulers (each of which may need different information to set
policy), we do not specify a complete API to the scheduler (but see
Balakrishnan, et. al. Standards Track [Page 11]
^L
RFC 3124 The Congestion Manager June 2001
Section 5.2). A later guideline document is expected to describe a
few simple schedulers (e.g., weighted round-robin, hierarchical
scheduling) and the API they export to provide relative
prioritization.
4. CM internals
This section describes the internal components of the CM. It
includes a Congestion Controller and a Scheduler, with well-defined,
abstract interfaces exported by them.
4.1 Congestion controller
Associated with each macroflow is a congestion control algorithm; the
collection of all these algorithms comprises the congestion
controller of the CM. The control algorithm decides when and how
much data can be transmitted by a macroflow. It uses application
notifications (Section 4.3) from concurrent streams on the same
macroflow to build up information about the congestion state of the
network path used by the macroflow.
The congestion controller MUST implement a "TCP-friendly" [Mahdavi98]
congestion control algorithm. Several macroflows MAY (and indeed,
often will) use the same congestion control algorithm but each
macroflow maintains state about the network used by its streams.
The congestion control module MUST implement the following abstract
interfaces. We emphasize that these are not directly visible to
applications; they are within the context of a macroflow, and are
different from the CM API functions of Section 4.
- void query(u64 *rate, u32 *srtt, u32 *rttdev): This function
returns the estimated rate (in bits per second) and smoothed
round trip time (in microseconds) for the macroflow.
- void notify(u32 nsent): This function MUST be used to notify the
congestion control module whenever data is sent by an
application. The nsent parameter indicates the number of bytes
just sent by the application.
- void update(u32 nsent, u32 nrecd, u32 rtt, u32 lossmode): This
function is called whenever any of the CM streams associated with
a macroflow identifies that data has reached the receiver or has
been lost en route. The nrecd parameter indicates the number of
bytes that have just arrived at the receiver. The nsent
parameter is the sum of the number of bytes just received and the
Balakrishnan, et. al. Standards Track [Page 12]
^L
RFC 3124 The Congestion Manager June 2001
number of bytes identified as lost en route. The rtt parameter is
the estimated round trip time in microseconds during the
transfer. The lossmode parameter provides an indicator of how a
loss was detected (section 4.3).
Although these interfaces are not visible to applications, the
congestion controller MUST implement these abstract interfaces to
provide for modular inter-operability with different separately-
developed schedulers.
The congestion control module MUST also call the associated
scheduler's schedule function (section 5.2) when it believes that the
current congestion state allows an MTU-sized packet to be sent.
4.2 Scheduler
While it is the responsibility of the congestion control module to
determine when and how much data can be transmitted, it is the
responsibility of a macroflow's scheduler module to determine which
of the streams should get the opportunity to transmit data.
The Scheduler MUST implement the following interfaces:
- void schedule(u32 num_bytes): When the congestion control module
determines that data can be sent, the schedule() routine MUST be
called with no more than the number of bytes that can be sent.
In turn, the scheduler MAY call the cmapp_send() function that CM
applications must provide.
- float query_share(i32 cm_streamid): This call returns the
described stream's share of the total bandwidth available to the
macroflow. This call combined with the query call of the
congestion controller provides the information to satisfy an
application's cm_query() request.
- void notify(i32 cm_streamid, u32 nsent): This interface is used
to notify the scheduler module whenever data is sent by a CM
application. The nsent parameter indicates the number of bytes
just sent by the application.
The Scheduler MAY implement many additional interfaces. As
experience with CM schedulers increases, future documents may
make additions and/or changes to some parts of the scheduler
API.
Balakrishnan, et. al. Standards Track [Page 13]
^L
RFC 3124 The Congestion Manager June 2001
5. Examples
5.1 Example applications
This section describes three possible uses of the CM API by
applications. We describe two asynchronous applications---an
implementation of a TCP sender and an implementation of congestion-
controlled UDP sockets, and a synchronous application---a streaming
audio server. More details of these applications and CM
implementation optimizations for efficient operation are described in
[Andersen00].
All applications that use the CM MUST incorporate feedback from the
receiver. For example, it must periodically (typically once or twice
per round trip time) determine how many of its packets arrived at the
receiver. When the source gets this feedback, it MUST use
cm_update() to inform the CM of this new information. This results
in the CM updating ownd and may result in the CM changing its
estimates and calling cmapp_update() of the streams of the macroflow.
The protocols in this section are examples and suggestions for
implementation, rather than requirements for any conformant
implementation.
5.1.1 TCP
A TCP implementation that uses CM should use the cmapp_send()
callback API. TCP only identifies which data it should send upon the
arrival of an acknowledgement or expiration of a timer. As a result,
it requires tight control over when and if new data or
retransmissions are sent.
When TCP either connects to or accepts a connection from another
host, it performs a cm_open() call to associate the TCP connection
with a cm_streamid.
Once a connection is established, the CM is used to control the
transmission of outgoing data. The CM eliminates the need for
tracking and reacting to congestion in TCP, because the CM and its
transmission API ensure proper congestion behavior. Loss recovery is
still performed by TCP based on fast retransmissions and recovery as
well as timeouts. In addition, TCP is also modified to have its own
outstanding window (tcp_ownd) estimate. Whenever data segments are
sent from its cmapp_send() callback, TCP updates its tcp_ownd value.
The ownd variable is also updated after each cm_update() call. TCP
also maintains a count of the number of outstanding segments
(pkt_cnt). At any time, TCP can calculate the average packet size
(avg_pkt_size) as tcp_ownd/pkt_cnt. The avg_pkt_size is used by TCP
Balakrishnan, et. al. Standards Track [Page 14]
^L
RFC 3124 The Congestion Manager June 2001
to help estimate the amount of outstanding data. Note that this is
not needed if the SACK option is used on the connection, since this
information is explicitly available.
The TCP output routines are modified as follows:
1. All congestion window (cwnd) checks are removed.
2. When application data is available. The TCP output routines
perform all non-congestion checks (Nagle algorithm, receiver-
advertised window check, etc). If these checks pass, the output
routine queues the data and calls cm_request() for the stream.
3. If incoming data or timers result in a loss being detected, the
retransmission is also placed in a queue and cm_request() is
called for the stream.
4. The cmapp_send() callback for TCP is set to an output routine.
If any retransmission is enqueued, the routine outputs the
retransmission. Otherwise, the routine outputs as much new data
as the TCP connection state allows. However, the cmapp_send()
never sends more than a single segment per call. This routine
arranges for the other output computations to be done, such as
header and options computations.
The IP output routine on the host calls cm_notify() when the packets
are actually sent out. Because it does not know which cm_streamid is
responsible for the packet, cm_notify() takes the stream_info as
argument (see Section 4 for what the stream_info should contain).
Because cm_notify() reports the IP payload size, TCP keeps track of
the total header size and incorporates these updates.
The TCP input routines are modified as follows:
1. RTT estimation is done as normal using either timestamps or
Karn's algorithm. Any rtt estimate that is generated is passed to
CM via the cm_update call.
2. All cwnd and slow start threshold (ssthresh) updates are
removed.
3. Upon the arrival of an ack for new data, TCP computes the value
of in_flight (the amount of data in flight) as snd_max-ack-1
(i.e., MAX Sequence Sent - Current Ack - 1). TCP then calls
cm_update(streamid, tcp_ownd - in_flight, 0, CM_NO_CONGESTION,
rtt).
Balakrishnan, et. al. Standards Track [Page 15]
^L
RFC 3124 The Congestion Manager June 2001
4. Upon the arrival of a duplicate acknowledgement, TCP must check
its dupack count (dup_acks) to determine its action. If dup_acks
< 3, the TCP does nothing. If dup_acks == 3, TCP assumes that a
packet was lost and that at least 3 packets arrived to generate
these duplicate acks. Therefore, it calls cm_update(streamid, 4 *
avg_pkt_size, 3 * avg_pkt_size, CM_LOSS_FEEDBACK, rtt). The
average packet size is used since the acknowledgments do not
indicate exactly how much data has reached the other end. Most
TCP implementations interpret a duplicate ACK as an indication
that a full MSS has reached its destination. Once a new ACK is
received, these TCP sender implementations may resynchronize with
TCP receiver. The CM API does not provide a mechanism for TCP to
pass information from this resynchronization. Therefore, TCP can
only infer the arrival of an avg_pkt_size amount of data from each
duplicate ack. TCP also enqueues a retransmission of the lost
segment and calls cm_request(). If dup_acks > 3, TCP assumes that
a packet has reached the other end and caused this ack to be sent.
As a result, it calls cm_update(streamid, avg_pkt_size,
avg_pkt_size, CM_NO_CONGESTION, rtt).
5. Upon the arrival of a partial acknowledgment (one that does not
exceed the highest segment transmitted at the time the loss
occurred, as defined in [Floyd99b]), TCP assumes that a packet was
lost and that the retransmitted packet has reached the recipient.
Therefore, it calls cm_update(streamid, 2 * avg_pkt_size,
avg_pkt_size, CM_NO_CONGESTION, rtt). CM_NO_CONGESTION is used
since the loss period has already been reported. TCP also
enqueues a retransmission of the lost segment and calls
cm_request().
When the TCP retransmission timer expires, the sender identifies that
a segment has been lost and calls cm_update(streamid, avg_pkt_size,
0, CM_NO_FEEDBACK, 0) to signify that no feedback has been received
from the receiver and that one segment is sure to have "left the
pipe." TCP also enqueues a retransmission of the lost segment and
calls cm_request().
5.1.2 Congestion-controlled UDP
Congestion-controlled UDP is a useful CM application, which we
describe in the context of Berkeley sockets [Stevens94]. They
provide the same functionality as standard Berkeley UDP sockets, but
instead of immediately sending the data from the kernel packet queue
to lower layers for transmission, the buffered socket implementation
makes calls to the API exported by the CM inside the kernel and gets
callbacks from the CM. When a CM UDP socket is created, it is bound
to a particular stream. Later, when data is added to the packet
queue, cm_request() is called on the stream associated with the
Balakrishnan, et. al. Standards Track [Page 16]
^L
RFC 3124 The Congestion Manager June 2001
socket. When the CM schedules this stream for transmission, it calls
udp_ccappsend() in the UDP module. This function transmits one MTU
from the packet queue, and schedules the transmission of any
remaining packets. The in-kernel implementation of the CM UDP API
should not require any additional data copies and should support all
standard UDP options. Modifying existing applications to use
congestion-controlled UDP requires the implementation of a new socket
option on the socket. To work correctly, the sender must obtain
feedback about congestion. This can be done in at least two ways:
(i) the UDP receiver application can provide feedback to the sender
application, which will inform the CM of network conditions using
cm_update(); (ii) the UDP receiver implementation can provide
feedback to the sending UDP. Note that this latter alternative
requires changes to the receiver's network stack and the sender UDP
cannot assume that all receivers support this option without explicit
negotiation.
5.1.3 Audio server
A typical audio application often has access to the sample in a
multitude of data rates and qualities. The objective of the
application is then to deliver the highest possible quality of audio
(typically the highest data rate) its clients. The selection of
which version of audio to transmit should be based on the current
congestion state of the network. In addition, the source will want
audio delivered to its users at a consistent sampling rate. As a
result, it must send data a regular rate, minimizing delaying
transmissions and reducing buffering before playback. To meet these
requirements, this application can use the synchronous sender API
(Section 4.2).
When the source first starts, it uses the cm_query() call to get an
initial estimate of network bandwidth and delay. If some other
streams on that macroflow have already been active, then it gets an
initial estimate that is valid; otherwise, it gets negative values,
which it ignores. It then chooses an encoding that does not exceed
these estimates (or, in the case of an invalid estimate, uses
application-specific initial values) and begins transmitting data.
The application also implements the cmapp_update() callback. When
the CM determines that network characteristics have changed, it calls
the application's cmapp_update() function and passes it a new rate
and round-trip time estimate. The application must change its choice
of audio encoding to ensure that it does not exceed these new
estimates.
Balakrishnan, et. al. Standards Track [Page 17]
^L
RFC 3124 The Congestion Manager June 2001
5.2 Example congestion control module
To illustrate the responsibilities of a congestion control module,
the following describes some of the actions of a simple TCP-like
congestion control module that implements Additive Increase
Multiplicative Decrease congestion control (AIMD_CC):
- query(): AIMD_CC returns the current congestion window (cwnd)
divided by the smoothed rtt (srtt) as its bandwidth estimate. It
returns the smoothed rtt estimate as srtt.
- notify(): AIMD_CC adds the number of bytes sent to its
outstanding data window (ownd).
- update(): AIMD_CC subtracts nsent from ownd. If the value of rtt
is non-zero, AIMD_CC updates srtt using the TCP srtt calculation.
If the update indicates that data has been lost, AIMD_CC sets
cwnd to 1 MTU if the loss_mode is CM_NO_FEEDBACK and to cwnd/2
(with a minimum of 1 MTU) if the loss_mode is CM_LOSS_FEEDBACK or
CM_EXPLICIT_CONGESTION. AIMD_CC also sets its internal ssthresh
variable to cwnd/2. If no loss had occurred, AIMD_CC mimics TCP
slow start and linear growth modes. It increments cwnd by nsent
when cwnd < ssthresh (bounded by a maximum of ssthresh-cwnd) and
by nsent * MTU/cwnd when cwnd > ssthresh.
- When cwnd or ownd are updated and indicate that at least one MTU
may be transmitted, AIMD_CC calls the CM to schedule a
transmission.
5.3 Example Scheduler Module
To clarify the responsibilities of a scheduler module, the following
describes some of the actions of a simple round robin scheduler
module (RR_sched):
- schedule(): RR_sched schedules as many streams as possible in round
robin fashion.
- query_share(): RR_sched returns 1/(number of streams in macroflow).
- notify(): RR_sched does nothing. Round robin scheduling is not
affected by the amount of data sent.
6. Security Considerations
The CM provides many of the same services that the congestion control
in TCP provides. As such, it is vulnerable to many of the same
security problems. For example, incorrect reports of losses and
Balakrishnan, et. al. Standards Track [Page 18]
^L
RFC 3124 The Congestion Manager June 2001
transmissions will give the CM an inaccurate picture of the network's
congestion state. By giving CM a high estimate of congestion, an
attacker can degrade the performance observed by applications. For
example, a stream on a host can arbitrarily slow down any other
stream on the same macroflow, a form of denial of service.
The more dangerous form of attack occurs when an application gives
the CM a low estimate of congestion. This would cause CM to be
overly aggressive and allow data to be sent much more quickly than
sound congestion control policies would allow.
[Touch97] describes a number of the security problems that arise with
congestion information sharing. An additional vulnerability (not
covered by [Touch97])) occurs because applications have access
through the CM API to control shared state that will affect other
applications on the same computer. For instance, a poorly designed,
possibly a compromised, or intentionally malicious UDP application
could misuse cm_update() to cause starvation and/or too-aggressive
behavior of others in the macroflow.
7. References
[Allman99] Allman, M. and Paxson, V., "TCP Congestion
Control", RFC 2581, April 1999.
[Andersen00] Balakrishnan, H., System Support for Bandwidth
Management and Content Adaptation in Internet
Applications, Proc. 4th Symp. on Operating Systems
Design and Implementation, San Diego, CA, October
2000. Available from
http://nms.lcs.mit.edu/papers/cm-osdi2000.html
[Balakrishnan98] Balakrishnan, H., Padmanabhan, V., Seshan, S.,
Stemm, M., and Katz, R., "TCP Behavior of a Busy
Web Server: Analysis and Improvements," Proc. IEEE
INFOCOM, San Francisco, CA, March 1998.
[Balakrishnan99] Balakrishnan, H., Rahul, H., and Seshan, S., "An
Integrated Congestion Management Architecture for
Internet Hosts," Proc. ACM SIGCOMM, Cambridge, MA,
September 1999.
[Bradner96] Bradner, S., "The Internet Standards Process ---
Revision 3", BCP 9, RFC 2026, October 1996.
[Bradner97] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119, March 1997.
Balakrishnan, et. al. Standards Track [Page 19]
^L
RFC 3124 The Congestion Manager June 2001
[Clark90] Clark, D. and Tennenhouse, D., "Architectural
Consideration for a New Generation of Protocols",
Proc. ACM SIGCOMM, Philadelphia, PA, September
1990.
[Eggert00] Eggert, L., Heidemann, J., and Touch, J., "Effects
of Ensemble TCP," ACM Computer Comm. Review,
January 2000.
[Floyd99a] Floyd, S. and Fall, K.," Promoting the Use of End-
to-End Congestion Control in the Internet,"
IEEE/ACM Trans. on Networking, 7(4), August 1999,
pp. 458-472.
[Floyd99b] Floyd, S. and T. Henderson,"The New Reno
Modification to TCP's Fast Recovery Algorithm," RFC
2582, April 1999.
[Jacobson88] Jacobson, V., "Congestion Avoidance and Control,"
Proc. ACM SIGCOMM, Stanford, CA, August 1988.
[Mahdavi98] Mahdavi, J. and Floyd, S., "The TCP Friendly
Website,"
http://www.psc.edu/networking/tcp_friendly.html
[Mogul90] Mogul, J. and S. Deering, "Path MTU Discovery," RFC
1191, November 1990.
[Padmanabhan98] Padmanabhan, V., "Addressing the Challenges of Web
Data Transport," PhD thesis, Univ. of California,
Berkeley, December 1998.
[Paxson00] Paxson, V. and M. Allman, "Computing TCP's
Retransmission Timer", RFC 2988, November 2000.
[Postel81] Postel, J., Editor, "Transmission Control
Protocol", STD 7, RFC 793, September 1981.
[Ramakrishnan99] Ramakrishnan, K. and Floyd, S., "A Proposal to Add
Explicit Congestion Notification (ECN) to IP," RFC
2481, January 1999.
[Stevens94] Stevens, W., TCP/IP Illustrated, Volume 1.
Addison-Wesley, Reading, MA, 1994.
[Touch97] Touch, J., "TCP Control Block Interdependence", RFC
2140, April 1997.
Balakrishnan, et. al. Standards Track [Page 20]
^L
RFC 3124 The Congestion Manager June 2001
8. Acknowledgments
We thank David Andersen, Deepak Bansal, and Dorothy Curtis for their
work on the CM design and implementation. We thank Vern Paxson for
his detailed comments, feedback, and patience, and Sally Floyd, Mark
Handley, and Steven McCanne for useful feedback on the CM
architecture. Allison Mankin and Joe Touch provided several useful
comments on previous drafts of this document.
9. Authors' Addresses
Hari Balakrishnan
Laboratory for Computer Science
200 Technology Square
Massachusetts Institute of Technology
Cambridge, MA 02139
EMail: hari@lcs.mit.edu
Web: http://nms.lcs.mit.edu/~hari/
Srinivasan Seshan
School of Computer Science
Carnegie Mellon University
5000 Forbes Ave.
Pittsburgh, PA 15213
EMail: srini@cmu.edu
Web: http://www.cs.cmu.edu/~srini/
Balakrishnan, et. al. Standards Track [Page 21]
^L
RFC 3124 The Congestion Manager June 2001
Full Copyright Statement
Copyright (C) The Internet Society (2001). All Rights Reserved.
This document and translations of it may be copied and furnished to
others, and derivative works that comment on or otherwise explain it
or assist in its implementation may be prepared, copied, published
and distributed, in whole or in part, without restriction of any
kind, provided that the above copyright notice and this paragraph are
included on all such copies and derivative works. However, this
document itself may not be modified in any way, such as by removing
the copyright notice or references to the Internet Society or other
Internet organizations, except as needed for the purpose of
developing Internet standards in which case the procedures for
copyrights defined in the Internet Standards process must be
followed, or as required to translate it into languages other than
English.
The limited permissions granted above are perpetual and will not be
revoked by the Internet Society or its successors or assigns.
This document and the information contained herein is provided on an
"AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING
TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING
BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION
HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF
MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
Acknowledgement
Funding for the RFC Editor function is currently provided by the
Internet Society.
Balakrishnan, et. al. Standards Track [Page 22]
^L
|