doc/rfc/rfc2887.txt


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235

Network Working Group                                         M. Handley
Request for Comments: 2887                                      S. Floyd
Category: Informational                                            ACIRI
                                                              B. Whetten
                                                                Talarian
                                                              R. Kermode
                                                                Motorola
                                                             L. Vicisano
                                                                   Cisco
                                                                 M. Luby
                                                  Digital Fountain, Inc.
                                                             August 2000


       The Reliable Multicast Design Space for Bulk Data Transfer

Status of this Memo

   This memo provides information for the Internet community.  It does
   not specify an Internet standard of any kind.  Distribution of this
   memo is unlimited.

Copyright Notice

   Copyright (C) The Internet Society (2000).  All Rights Reserved.

Abstract

   The design space for reliable multicast is rich, with many possible
   solutions having been devised.  However, application requirements
   serve to constrain this design space to a relatively small solution
   space.  This document provides an overview of the design space and
   the ways in which application constraints affect possible solutions.

1.  Introduction

   The term "general purpose reliable multicast protocol" is something
   of an oxymoron.  Different applications have different requirements
   of a reliable multicast protocol, and these requirements constrain
   the design space in ways that two applications with differing
   requirements often cannot share a single solution.  There are however
   many successful reliable multicast protocol designs that serve more
   special purpose requirements well.

   In this document we attempt to review the design space for reliable
   multicast protocols intended for bulk data transfer.  The term bulk
   data transfer should be taken as having broad meaning - the main
   limitations are that the data stream is continuous and long lived -


Handley, et al.              Informational                      [Page 1]
^L
RFC 2887     Multicast Design Space for Bulk Data Transfer   August 2000


   constraints necessary for the forms of congestion control we
   currently understand.  The purpose of this review is to gather
   together an overview of the field and to make explicit the
   constraints imposed by particular mechanisms. The aim is to provide
   guidance to the standardization process for protocols and protocol
   building blocks.  In doing this, we cluster potential solutions into
   a number of loose categories - real protocols may be composed of
   mechanisms from more than one of these clusters.

   The main constraint on solutions is imposed by the need to scale to
   large receiver sets.  For small receiver sets the design space is
   much less restricted.

2.  Application Constraints

   Application requirements for reliable multicast (RM) are as broad and
   varied as the applications themselves.  However, there are a set of
   requirements that significantly affect the design of an RM protocol.
   A brief list includes:

   o  Does the application need to know that everyone received the data?

   o  Does the application need to constrain differences between
      receivers?

   o  Does the application need to scale to large numbers of receivers?

   o  Does the application need to be totally reliable?

   o  Does the application need ordered data?

   o  Does the application need to provide low-delay delivery?

   o  Does the application need to provide time-bounded delivery?

   o  Does the application need many interacting senders?

   o  Is the application data flow intermittent?

   o  Does the application need to work in the public Internet?

   o  Does the application need to work without a return path (e.g.
      satellite)?

   o  Does the application need to provide secure delivery?


Handley, et al.              Informational                      [Page 2]
^L
RFC 2887     Multicast Design Space for Bulk Data Transfer   August 2000


   In the context of standardizing bulk data transfer protocols, we can
   rule out applications with multiple interacting senders and
   intermittent data flows.  It is not that these applications are
   unimportant, but that we do not yet have effective congestion control
   for such applications.

2.1.  Did everyone receive the data?

   In many applications a logically defined unit or units of data is to
   be delivered to multiple clients, e.g., a file or a set of files, a
   software package, a stock quote or package of stock quotes, an event
   notification, a set of slides, a frame or block from a video.  An
   application data unit (ADU) is defined to be a logically separable
   unit of data that is useful to the application. In some cases, an
   application data unit may be short enough to fit into a single packet
   (e.g., an event notification or a stock quote), whereas in other
   cases an application data unit may be much longer than a packet
   (e.g., a software package).

   A protocol may optionally provide delivery confirmation to ensure
   reliable delivery, i.e., a mechanism for receivers to inform the
   sender when data has been delivered.  There are two types of
   confirmation, at the application data unit level and at the packet
   level. Application data unit confirmation is useful at the
   application level, e.g., to inform the application about receiver
   progress and to decide when to stop sending packets about a
   particular application data unit.  Packet confirmation is useful at
   the transport level, e.g., to inform the transport level when it can
   release buffer space being used for storing packets for which
   delivery has been confirmed.

   Some applications have a strong requirement for confirmation that all
   the receivers got an ADU, or if not, to be informed of which specific
   receivers failed to receive the entire ADU. Examples include
   applications where receivers pay for data, and reliable file-system
   replication.  Other applications do not have such a requirement.  An
   example is the distribution of free software.

   If the application does need to know that every receiver got the ADU,
   then a positive acknowledgment must be received from every receiver,
   although it may be possible to aggregate these acknowledgments.  If
   the application needs to know precisely which receivers failed to get
   the ADU, additional constraints are placed on acknowledgment
   aggregation.

   It should be noted that different mechanisms can be used for ADU-
   level confirmation and packet-level confirmation in the same
   application.  For example, an ADU-level confirmation mechanism using


Handley, et al.              Informational                      [Page 3]
^L
RFC 2887     Multicast Design Space for Bulk Data Transfer   August 2000


   positive acknowledgments may sit on top of a packet-level NACK or
   FEC-based transport.  Typically this only makes sense when ADUs are
   significantly larger than a single packet.

2.2.  Constraining differences

   Some applications need to constrain differences between receivers so
   that the data reception characteristics for all receivers falls
   within some range.  An example is a stock price feed, where it is
   unacceptable for a receiver to suffer delivery that is delayed
   significantly more than any other receiver.

   This requirement is difficult to satisfy without harming performance.
   Typically solutions involve not sending more than a limited amount of
   new data until positive acknowledgments have been received from all
   the receivers.  Such a solution does not cope with network and end-
   system failures well.

2.3.  Receiver Set Scaling

   There are many applications for RM that do not need to scale to large
   numbers of receivers.  For such applications, a range of solutions
   may be available that are not available for applications where
   scaling to large receiver sets is a requirement.

   A protocol must achieve good throughput of application data units to
   receivers.  This means that most data that is delivered to receivers
   is useful in recovering the application data unit that they are
   trying to receive. A protocol must also provide good congestion
   control to fairly share the available network resources between all
   applications.  Receiver set scaling is one of the most important
   constraints in meeting these requirements, because it strictly limits
   the mechanisms that can be used to achieve these requirements to
   those that will efficiently scale to a large receiver population.
   Acknowledgement packets have been employed by many systems to achieve
   these goals, but it is important to understand the strength and
   limitations of different ways of using such packets.

   In a very small system, it may be acceptable to have the receivers
   acknowledge every packet.  This approach provides the sender with the
   maximum amount of information about reception conditions at all the
   receivers, information that can be used both to achieve good
   throughput and to achieve congestion control.


Handley, et al.              Informational                      [Page 4]
^L
RFC 2887     Multicast Design Space for Bulk Data Transfer   August 2000


   For larger systems, such "flat ACK" schemes cause acknowledge
   implosions at the sender.  Attempts have been made to reduce this
   problem by sending aggregate ACKs infrequently [RMWT98, BC94], but it
   is very difficult to incorporate effective congestion control into
   such protocols because of the spareceness of feedback.

   Using negative acknowledgments (NACKs) instead of ACKs reduces this
   problem to one of NACK implosion (only from the receivers missing the
   packets), and because the sender really only needs to know that at
   least one receiver is missing data in order to achieve good
   throughput, various NACK suppression mechanisms can be applied.

   An alternative to NACKs is ACK aggregation, which can be done by
   arranging the receivers into a logical tree, so that each leaf sends
   ACKs to its parent which aggregates them, and passes them on up the
   tree.  Tree-based protocols scale well, but tree formation can be
   problematic.

   Other ACK topologies such as rings are also possible, but are often
   more difficult to form and maintain than trees are.  An alternative
   strategy is to add mechanisms to routers so that they can help out in
   achieving good throughput or in reducing the cost of achieving good
   throughput.

   All these solutions improve receiver set scaling, but they all have
   limits of one form or another.  One class of solutions scales to an
   infinite number of receivers by having no feedback channel whatsoever
   in order to achieve good throughput.  These open-loop solutions take
   the initial data and encode it using an FEC-style mechanism.  This
   encoded data is transmitted in a continuous stream.  Receivers then
   join the session and receive packets until they have sufficient
   packets to decode the original data, at which point they leave the
   session.

   Thus, it is clear that the intended scale of the session constrains
   the possible solutions.  All solutions will work for very small
   sessions, but as the intended receive set increases, the range of
   possible solutions that can be deployed safely decreases.

   It should also be noted that hybrids of these mechanisms are
   possible, and that using one mechanism at the packet-level and a
   different (typically higher overhead) solution at the ADU level may
   also scale reasonably if the ADUs are large compared to packets.


Handley, et al.              Informational                      [Page 5]
^L
RFC 2887     Multicast Design Space for Bulk Data Transfer   August 2000


2.4.  Total vs Semi-reliable

   Many applications require delivery of application data units to be
   totally reliable; if any of the application data unit is missing,
   none of the received portion of the application data unit is useful.
   File transfer applications are a good example of applications
   requiring total reliability.

   However, some applications do not need total reliability.  An example
   is audio broadcasting, where missing packets reduce the quality of
   the received audio but do not render it unusable.  Such applications
   can sometimes get by without any additional reliability over native
   IP reliability, but often having a semi-reliable multicast protocol
   is desirable.

2.5.  Time-bounded Delivery

   Many applications just require data to be delivered to the receivers
   as fast as possible.  They have no absolute deadline for delivery.

   However, some applications have hard delivery constraints - if the
   data does not arrive at the receiver by a certain time, there is no
   point in delivering it at all.  Such time-boundedness may be as a
   result of real-time constraints such as with audio or video
   streaming, or as the result of new data superseding old data.  In
   both cases, the requirement is for the application to have a greater
   degree of control over precisely what the application sends at which
   time than might be required with applications such as file transfer.

   Time-bounded delivery usually also implies a semi-reliable protocol,
   but the converse does not necessarily hold.

3.  Network Constraints

   The properties of the network in which the application is being
   deployed may themselves constrain the reliable multicast design
   space.

3.1.  Internet vs Intranet

   In principle the Internet and intranets are the same.  In practice
   however, the fact that an intranet is under one administration might
   allow for solutions to be configured that can not easily be done in
   the public Internet.  Thus, if the data is of very high value, it
   might be appropriate to enhance the routers to provide assistance to
   a reliable multicast transport protocol.  In the public Internet, it
   is less likely that the additional expense required to support this
   state in the routers would be acceptable.


Handley, et al.              Informational                      [Page 6]
^L
RFC 2887     Multicast Design Space for Bulk Data Transfer   August 2000


3.2.  Return Path

   In principle, when feedback is required from receivers, this feedback
   can be multicast or unicast.  Multicast feedback has advantages,
   especially in NACK-based protocols where it is valuable for NACK
   suppression.  However, it is not clear at this time whether all ISPs
   will allow all members of a session to send to that session.  If
   multicast feedback is not allowed, then unicast feedback can almost
   always be substituted, although often at the expense of additional
   messages and mechanisms.

   Some networks may not allow any form of feedback however.  The
   primary example of this occurs with satellite broadcasts where the
   back channel may be very narrow or even non-existent.  For such
   networks the solution space is very constrained - only FEC-based
   encodings have any real chance of working.  If the receivers are
   direct satellite receivers, then no congestion control is needed, but
   it is dangerous to make such assumptions because it is possible for a
   satellite hop to feed downstream networks.  Thus, congestion control
   still needs to be considered with solutions that do not have a return
   path.

3.3.  Network Assistance

   A reliable multicast protocol must involve mechanisms running in end
   hosts, and must involve routers forwarding multicast packets.
   However under some circumstances, it is possible to rely on some
   additional degree of assistance from network elements.  Broadly
   speaking we can cluster RM protocols into four classes depending on
   the degree of support received from other network elements.

   No Additional Support
      The routers merely forward packets, and only the sender and
      receivers have any reliable multicast protocol state.

   Layered Approaches
      Data is split across multiple multicast groups.  Receivers join
      appropriate groups to receive only the traffic they require.  This
      may in some cases require fast join or leave functionality from
      the routers, and may require more forwarding state in the routers.

   Server-based Approaches
      Additional nodes are used to assist with data delivery or feedback
      aggregation.  These additional nodes might not be normal senders
      or receivers, and may be present on the distribution or feedback
      tree only to provide assistance to the reliable multicast
      protocol.  They would not otherwise receive the multicast traffic.


Handley, et al.              Informational                      [Page 7]
^L
RFC 2887     Multicast Design Space for Bulk Data Transfer   August 2000


   Router-based Approaches
      With router-based approaches, routers on the normal data
      distribution tree from the sender to the receivers assist in the
      delivery of data or feedback aggregation or suppression.  As
      routers can directly influence multicast routing, they have more
      control over which traffic goes to which group members than
      server-based approaches.  However routers do not normally have a
      large amount of spare memory or processing power, which restricts
      how much functionality can be placed in the routers.  In addition,
      router code is normally more difficult to upgrade than application
      code, so router-based approaches need to be very general as they
      are more difficult to deploy and to change.

4.  Good Throughput Mechanisms

   Two main concerns that a RM protocol must address are congestion
   control and good throughput.  Packet loss plays a major role with
   respect to both concerns.  The primary symptom of congestion in many
   networks is packet loss. The primary obstacle that must be overcome
   to achieve good throughput is packet loss.  Thus, measuring and
   reacting to packet loss is crucial to address both concerns. RM
   solutions that address these concerns can be roughly categorized as
   using one or more of the following techniques:

   o  Data packet acknowledgment.

   o  Negative acknowledgment of missing data packets.

   o  Redundancy allowing not all packets to be received.

   These techniques themselves can be usefully subdivided, so that we
   can examine the parts of the requirement space in which each
   mechanism can be deployed.  In this section, we focus on using these
   mechanisms for achieving good throughput, and in the next section we
   focus on using these mechanisms for congestion control.

4.1.  ACK-based Mechanisms

   The simplest ACK-based mechanism involves every receiver sending an
   ACK packet for every data packet it receives and resending packets
   that are lost by any receiver.  Such mechanisms are limited to very
   small receiver groups by the implosion of ACKs received at the
   sender, and for this reason they are impractical for most
   applications.


Handley, et al.              Informational                      [Page 8]
^L
RFC 2887     Multicast Design Space for Bulk Data Transfer   August 2000


   Putting multiple ACKs into a single data packet [RMWT98] reduces the
   implosion problem by a constant amount, allowing slightly larger
   receiver groups.  However a limit is soon reached whereby feedback to
   the sender is too infrequent for sender-based congestion control
   mechanisms to work reliably.

   Arranging the receivers into a ring [WKM94] whereby an "ACK-token" is
   passed around the ring prevents the implosion problem for data.
   However ring creation and maintenance may itself be problematic.
   Also if ring creation does not take into account network topology
   (something which is difficult to achieve in practice), then the
   number of ACK packets crossing the network backbone for each data
   packet sent may increase O(n) with the number of receivers.

4.1.1.  Tree-based ACK Mechanisms

   Arranging the receivers into a tree [MWB+98, KCW98] whereby receivers
   generate ACKs to a parent node, which aggregates those ACKs to its
   parent in turn, is both more robust and more easily configured than a
   ring.  The ACK-tree is typically only used for ACK-aggregation - data
   packets are multicast from the sender to the receivers as normal.
   Trees are easier to construct than rings because more local
   information can be used in their construction.  Also they can be more
   fault tolerant than rings because node failures only affect a subset
   of receivers, each of which can easily and locally decide to by-pass
   its parent and report directly to the node one level higher in the
   tree.  With good ACK-tree formation, tree-based ACK mechanisms have
   the potential to be one of the most scalable RM solutions.

   To be simple to deploy, tree-based protocols must be self-organizing
   - the receivers must form the tree themselves using local information
   in a scalable manner.  Such mechanisms are possible, but are not
   trivial.  The main scaling limitations of tree-based protocols
   therefore come from the tree formation and maintenance mechanisms
   rather than from the use of ACKs.  Without such a scalable and
   automatic tree-formation mechanism, tree-based protocols must rely on
   manual configuration, which significantly limits their applicability
   (often to intranets) and (due to the complexity of configuration)
   their scalability.

   Orthogonal to the issue of tree formation is the issue of subtree
   retransmission.  With appropriate router mechanisms, or the use of
   multiple multicast groups, it is possible to allow the intermediate
   tree nodes to retransmit missing data to the nodes below them in the
   tree rather than relying on the original sender to retransmit the
   data.  This relies on there being a good correlation at the point of
   the intermediate node between the ACK tree and the actual data tree,
   as well as there being a mechanism to constrain the retransmission to


Handley, et al.              Informational                      [Page 9]
^L
RFC 2887     Multicast Design Space for Bulk Data Transfer   August 2000


   the subtree.  A good automatic tree formation mechanism combined with
   the use of administrative scoped multicast groups might provide such
   a solution. Without such tree formation mechanisms, subtree
   retransmission is difficult to deploy in large groups in the public
   internet.       This could also be solved by the use of transport-
   level router mechanisms to assist or perform retransmission, although
   existing router mechanisms [FLST98] support NACK-based rather than
   ACK-based protocols.

   Another important issue is the nature of the aggregation performed at
   interior nodes on the ACK-tree.  Such nodes could:

   1. aggregate ACKs by sending a single ACK when all their children
      have ACKed,

   2. aggregate ACKs by listing all the children that have ACKed,

   3. send an aggregated ACK with a NACK-like exception list.

   For data packets, 1. is clearly more scalable, and should be
   preferred.  However if the sender needs to know exactly which
   receivers received the data, 2. and 3. provide this information.
   Fortunately, there is usually no need to do this on a per-packet
   basis, but rather on a per-ADU basis.  Doing 1. on a per packet
   basis, and 3. on a per ADU basis is the most scalable solution for
   applications that need this information, and suffers virtually no
   disadvantage compared to the other solutions used on a per-packet
   basis.

4.2.  NACK-based mechanisms

   Instead of sending an ACK for every data packet received, receivers
   can send a negative acknowledgment (NACK) for every data packet they
   discover they did not receive.  This has a number of advantages over
   ACK-based mechanisms:

   o  The sender no longer needs to know exactly how many receivers
      there are.  This removes the topology-building phase needed for
      ring- or tree-style ACK-based algorithms.

   o  Fault-tolerance is made somewhat simpler by making receivers
      responsible for reliability.

   o  Sender state can be significantly reduced because the sender does
      not need to keep track of the receivers state.


Handley, et al.              Informational                     [Page 10]
^L
RFC 2887     Multicast Design Space for Bulk Data Transfer   August 2000


   o  Only a single NACK is needed from any receiver to indicate a
      packet that is missing by any number of receivers.  Thus NACK
      suppression is possible.

   The disadvantages are that it is more difficult for the sender to
   know that it can free transmission buffers, and that additional
   session level mechanisms are needed if the sender really needs to
   know if a particular receiver actually received all the data.
   However for many applications, neither of these is an issue.

4.2.1.  NACK Suppression

   The key differences between NACK-based protocols is in how NACK-
   suppression is performed.  The goal is for only one NACK to reach the
   sender (or a node that can resend the missing data) as soon as
   possible after the loss is first noticed, and for only one copy of
   the missing data to be received by those nodes needing
   retransmission.

   Different mechanisms come close to satisfying these goals in
   different ways.

   o  SRM [FJM95] uses random timers weighted by the round trip time
      between the sender and each node missing the data.  This is
      effective, but requires computing the RTT to each receiver before
      suppression works properly.

   o  NTE [HC97] uses a sender-triggered mechanism based on random keys
      and sliding masks.  This does not require random timers, and works
      for very large sessions, but makes it difficult to provide the
      constant low-level stream of feedback needed to perform congestion
      control.

   o  AAP [Ha99] uses exponentially distributed random timers and is
      effective for large sessions without needing to compute the RTT to
      each receiver.

   o  PGM [FLST98] and LMS [PPV98] use additional mechanisms in routers
      to suppress duplicate NACKs.  In the case of PGM, router
      assistance suppliments SRM-stype random timers and localizes the
      suppression so that the whole group does not need suppressing.

   The most general of these mechanisms is probably exponentially
   weighted random timers.  Although SRM style timers can reduce
   feedback delay, they are harder to use correctly in situations where
   all the RTTs are not known, or where the number of respondees is


Handley, et al.              Informational                     [Page 11]
^L
RFC 2887     Multicast Design Space for Bulk Data Transfer   August 2000


   unknown.  In contrast, exponentially weighted random timers work well
   across a large range of session sizes with good worst case delay
   characteristics.

   Either form of random timer based mechanism can be supplemented by
   router-support where it is available.  Sender triggered NACK
   mechanisms (e.g. [HC97]) are more difficult to integrate with
   router-based support mechanisms.

4.3.  Replication

   Some RM protocols can be designed so as to not need explicit
   reliability mechanisms except in comparatively rare cases.  An
   example is in a multicast game, where the position of a moving object
   is continuously multicast.  This positional stream does not require
   additional reliability because a new position superseding the old one
   will be sent before any retransmission could take place.  However,
   when the moving object interacts with other objects or stops moving,
   then an explicit reliability mechanism is required to reliably send
   the interaction information or last position.

   It is not just games that can be built in this manner - the NTE
   shared text editor[HC97] uses just such a mechanism with changes to a
   line of text.  For every change the whole line is sent, and so long
   as the user keeps typing no explicit reliability mechanism is needed.
   The major advantage of replication is that it is not susceptible to
   spatially uncorrelated packet loss.  With a traditional ACK or NACK
   based protocol, the probability of any particular packet being
   received by all the receivers in a large group can be very low.  This
   leads to high retransmission rates.      In contrast, replicated
   streams do not suffer as the size of the receiver group increases -
   different receivers lose different packets, but this does not
   increase network traffic.

4.4.  Packet-level Forward Error Correction

   Forward Error Correction (FEC) is a well known technique for
   protecting data against corruption.  For reliable multicast it is
   most useful in the form of erasure codes.

   The simplest form of packet-level FEC is to take a group of packets
   that is to be sent, and to XOR the packets together to form a
   newpacket which is also sent.  If there were three original packets
   plus the XOR packet sent, then if a receiver is missing any one of
   the original data packets, but receives the XOR packet, then it can
   reproduce the missing original packet.


Handley, et al.              Informational                     [Page 12]
^L
RFC 2887     Multicast Design Space for Bulk Data Transfer   August 2000


   More general erasure codes exist [BKKKLZ95], [Ri97], [LMSSS97] that
   allow the generation of n encoding packets from k original data
   packets.  In such cases, so long as at least k of the n encoding
   packets are received, then the k original data packets can be
   reproduced.

   To apply FEC the sender groups data packets into rounds, and encoding
   packets are produced based on all the data packets in a round. A
   round may consist of all data packets in an entire application data
   unit in some cases, whereas in other cases it may consist of a group
   of data packets that make up only a small portion of an application
   data unit.

   Using erasure codes to repair packet loss is a significant
   improvement over simple retransmission because the dependency on
   which packets have been lost is removed.  Thus, the amount of repair
   traffic required to repair spatially uncorrelated packet loss is
   considerably lessened.

   We can divide packet-level FEC schemes into two categories: proactive
   FEC and reactive FEC.  The difference between the two is that for
   proactive FEC the sender decides a priori how many encoding packets
   to send for each round of data packets, whereas for reactive FEC the
   sender initially transmits only the original data packets for each
   round.  Then, the sender uses feedback from the receivers to compute
   how many packets were lost by the receiver that experienced the most
   loss in each round, and then only that number of additional encoding
   packets are sent for that round.  These encoding packets will then
   also serve to repair loss at the other receivers that are missing
   fewer packets.  The receivers report via ACKs or NACKs how many
   packets are missing from each round. With NACKs, only the receiver
   missing the most packets need send a NACK for this round, so this is
   used to weight the random timers in the NACK calculation.

   Proactive and reactive FEC can be combined, e.g., a certain amount of
   proactive FEC can be sent for each round and if there are receivers
   that experience more loss than can be overcome by this for some
   rounds then they can request and receive additional encoding packets
   for these rounds.

   FEC is very effective at reducing the repair traffic for packet loss.
   However, it requires that the data to be sent to be grouped into
   rounds, which can add to end-to-end latency.  For bulk-data
   applications this is typically not a problem, but this may be an
   issue for interactive applications where replication may be a better
   solution.


Handley, et al.              Informational                     [Page 13]
^L
RFC 2887     Multicast Design Space for Bulk Data Transfer   August 2000


4.5.  Layered FEC

   An alternative use of packet level FEC is possible when data is
   spread across several multicast groups [RVC98], [BLMR98].  In such
   cases, the original k data packets are used to generate n encoding
   packets, where n is much larger than k.  The n encoded packets are
   then striped across multiple multicast groups.  When a receiver
   wishes to receive the original data it joins one or more of the
   multicast groups, and receives the encoding packets.  Once it has
   received k different encoding packets, the receiver can then leave
   all the multicast groups and reconstruct the original data.

   The primary importance of such a layering is that it allows different
   receivers to be able to receive the traffic at different rates
   according to the available capacity.  Such schemes do not require any
   form of feedback from the receivers to the sender to ensure good
   throughput, and therefore the need for good throughput does not
   constrain the size of the receiver set.  However, to perform adequate
   network congestion control using receiver joins and leaves in this
   manner may require coordination between members that are behind the
   same congested link from the sender.  As described in the next
   section, [RVC98] suggests such a layered congestion control scheme.

5.  Congestion Control Mechanisms

   The basic delivery model of the Internet is best-effort service.  No
   guarantees are given as to throughput, delay or packet loss.  End-
   systems are expected to be adaptive, and to reduce their transmission
   rate to a level appropriate for the congestion state of the network.
   Although increasingly the Internet will start to support reserved
   bandwidth and differentiated service classes for specialist
   applications, unless an end-system knows explicitly that it has
   reserved bandwidth, it must still perform congestion control.

   Broadly speaking, there are five classes of single-sender multicast
   congestion control solution:

   o  Sender-controlled, one group.

      A single multicast group is used for data distribution.  Feedback
      from the group members is used to control the rate of this group.
      The goal is to transmit at a rate dictated by the slowest
      receiver.


Handley, et al.              Informational                     [Page 14]
^L
RFC 2887     Multicast Design Space for Bulk Data Transfer   August 2000


   o  Sender-controlled, multiple groups.

      One initial multicast group is adaptively subdivided into multiple
      subgroups with subdivisions centered on congestion points in the
      network.  Application-level relays buffer data from a group nearer
      the original sender, and retransmit it at a slower rate into a
      group further from the original sender.  In this way, different
      receivers can receiver the data at different rates.  Sender-based
      congestion control takes place between the members of a subgroup
      and their relay.

   o  Receiver-controlled, one group.

      A single multicast group is used for data distribution.  The
      receivers determine if the sender is transmitting too rapidly for
      the current congestion state of the network, and they leave the
      group if this is the case.

   o  Receiver-controlled, layered organization.

      A layered approach for how to combine this scheme with a
      congestion control protocol that requires no receiver feedback is
      described in [RVC98].  The sender stripes data across multiple
      multicast groups simultaneously.  Receivers join and leave these
      layered groups depending on their measurements of the congestion
      state of the network, so that the amount of data being received is
      always appropriate. However, this scheme relies on receivers to
      join and leave the different multicast groups in a coordinated
      fashion behind a bottleneck link, and it has not yet been
      completely confirmed that this approach will scale in practice to
      the Internet.  As a result, more work on this congestion control
      mechanism would be beneficial.

   o  Router-based congestion control.

      It is possible to add additional mechanisms to multicast routers
      to assist in multicast congestion control.  Such mechanisms could
      include:

      o  Conditional joins (a multicast join that specifies a loss rate
         above which it is acceptable for the router to reject the
         join).

      o  Router filtering of traffic that exceeds a reasonable rate.
         This may include mechanisms for filtering traffic at different
         points in the network at different rates depending on local
         congestion conditions [LVS99].


Handley, et al.              Informational                     [Page 15]
^L
RFC 2887     Multicast Design Space for Bulk Data Transfer   August 2000


      o  Fair queuing schemes combined with end-to-end adaptation.

      Router-based schemes generally require more state in network
      routers than has traditionally been acceptable for backbone
      routers.  Thus, in the near-term, such schemes are only likely to
      be applicable for intranet solutions.

   For reliable multicast protocols, it is important to consider
   congestion control at the same time as reliability is being
   considered.  The same mechanisms that are used to provide reliability
   will sometimes be used to provide congestion control.

   In the case of receiver-based congestion control, open-loop delivery
   using FEC is the likely choice for achieving good throughput for
   bulk- data transfer.  This is because open-loop delivery requires no
   feedback from receivers, and thus it is a perfect match with a
   receiver-based congestion-control mechanism that operates without
   feedback from receivers.

6.  Security Considerations

   Generally speaking, security considerations have relatively little
   effect on constraining the design space for reliable multicast
   protocols.  The primary issues constraining the design space are all
   related to receiver-set scaling.  For authentication of the source
   and of data integrity, receiver-set scaling is not a significant
   issue.  However, for data encryption, key distribution and
   particularly re-keying may be significantly affected by receiver-set
   scaling.  Tree and graph based re-keying solutions[WHA98,WGL97] would
   appear to be appropriate solutions to these problems.  It is not
   clear however that such re-keying solutions need to directly affect
   the design of the data distribution part of a reliable multicast
   protocol.

   The primary question to consider for the security of reliable
   multicast protocols is the role of third-parties.  If nodes other
   than the original source of the data are allowed to send or resend
   data packets, then the security model for the protocol must take this
   into account.  In particular, it must be clear whether such third
   parties are trusted or untrusted.  A requirement for trusted third
   parties can make protocols difficult to deploy on the Internet.

   Untrusted third parties (such as receivers that retransmit the data)
   may be used so long as the data authentication mechanisms take this
   into account.  Typically this means that the original sender
   digitally signs and timestamps the data, and that the third parties
   resend this signed timestamped payload unmodified.


Handley, et al.              Informational                     [Page 16]
^L
RFC 2887     Multicast Design Space for Bulk Data Transfer   August 2000


   Unlike unicast protocols, denial-of-service attacks on multicast
   transport state are easy if the protocol design does not take such
   attacks into account.  This is because any receiver can join the
   session, and can then produce feedback that influences the progress
   of a session involving many other receivers.  Hence protection
   against denial-of-service attacks on reliable multicast protocols
   must be carefully considered.  A receiver that requests
   retransmission of every packet, or that refuses to acknowledge
   packets in an ACK-based protocol can potentially bring a reliable
   multicast session to a standstill.  Senders must have appropriate
   policy to deal with such conditions, and if necessary, evict the
   receiver from the group.  A single receiver masquerading as a large
   number of receivers may still be an issue under such circumstances
   with protocols that support NACK-like functionality.  Providing
   unique "keys" to each NACKer when they first NACK using a unicast
   response might potentially prevent such attacks.

   Denial-of-service attacks caused by traffic flooding are however
   somewhat easier to protect against than with unicast.  Unwanted
   senders can simply be pruned from the distribution tree using the
   mechanisms implemented in IGMP v3[CDT99].

7.  Conclusions

   In this document we present an overview of the design space for
   reliable multicast within the context of one-to-many bulk-data
   transfer. Other flavors of multicast application are not considered
   in this document, and hence the overview given should not be
   considered inclusive of the design space for protocols that fall
   outside the context of one-to-many bulk-data transfer. During the
   course of this overview, we have reaffirmed the notion that the
   process of reliable multicast protocol design is affected by a number
   of factors that render the generation of a "one size fits all
   solution" moot. These factors are then described to show how an
   application's needs serve to constrain the set of available
   techniques that may be used to create a reliable multicast protocol.
   We examined a number of basic techniques and to show how well they
   can meet the needs of certain types of applications.

   This document is intended to provide guidance to the IETF community
   regarding the standardization of reliable multicast protocols for
   bulk-data transfer. Given the degree to which application
   requirements constrain reliable multicast solutions, and the diverse
   set of applications that need to be supported, it should be clear
   that any standardization work should take great pains to be future-
   proof.  This would seem to imply not standardizing complete reliable
   multicast transport protocols in one pass, but rather examining the
   degree to which such protocols are separable into functional building


Handley, et al.              Informational                     [Page 17]
^L
RFC 2887     Multicast Design Space for Bulk Data Transfer   August 2000


   blocks, and standardizing these blocks separately to the maximum
   degree that makes sense.  Such an approach allows for protocol
   evolution, and allows applications with new constraints to be
   supported with maximal reuse of existing and tested mechanisms.

8.  Acknowledgments

   This document represents an overview of the reliable multicast design
   space.  The ideas presented are not those of the authors, but are
   collected from the varied presentations and discussions in the IRTF
   Reliable Multicast Research Group.  Although they are too numerous to
   list here, we thank everyone who has participated in these
   discussions for their contributions.

9.  Authors' Addresses

   Mark Handley
   ATT Center for Internet Research at ICSI,
   International Computer Science Institute,
   1947 Center Street, Suite 600,
   Berkeley, CA 94704, USA

   EMail: mjh@aciri.org


   Sally Floyd
   ATT Center for Internet Research at ICSI,
   International Computer Science Institute,
   1947 Center Street, Suite 600,
   Berkeley, CA 94704, USA

   EMail: floyd@aciri.org


   Brian Whetten
   Talarian Corporation,
   333 Distel Circle,
   Los Altos, CA 94022, USA

   EMail: whetten@talarian.com


Handley, et al.              Informational                     [Page 18]
^L
RFC 2887     Multicast Design Space for Bulk Data Transfer   August 2000


   Roger Kermode
   Motorola Australian Research Centre
   Level 3, 12 Lord St,
   Botany  NSW  2019,
   Australia

   EMail: Roger.Kermode@motorola.com


   Lorenzo Vicisano
   Cisco Systems,
   170 West Tasman Dr.
   San Jose, CA 95134, USA

   EMail: lorenzo@cisco.com


   Michael Luby
   Digital Fountain, Inc.
   600 Alabama Street
   San Francisco, CA  94110

   EMail: luby@digitalfountain.com

10.  References

   [BC94]     K. Birman, T. Clark.  "Performance of the Isis Distributed
              Computing Toolkit." Technical Report TR-94-1432, Dept. of
              Computer Science, Cornell University.

   [BKKKLZ95] J. Bloemer, M. Kalfane, M. Karpinski, R. Karp, M. Luby, D.
              Zuckerman, "An XOR-based Erasure Resilient Coding Scheme",
              ICSI Technical Report No. TR-95-048, August 1995.

   [BLMR98]   J. Byers, M. Luby, M. Mitzenmacher, A. Rege, "A Digital
              Fountain Approach to Reliable Distribution of Bulk Data",
              Proc ACM SIGCOMM 98.

   [CDT99]    Cain, B., Deering, S., and A. Thyagarajan, "Internet Group
              Management Protocol, Version 3", Work in Progress.

   [FLST98]   Farinacci, D., Lin, S., Speakman, T. and A. Tweedly, "PGM
              reliable transport protocol specification", Work in
              Progress.

   [FJM95]    S. Floyd, V. Jacobson, S. McCanne, "A Reliable Multicast
              Framework for Light-weight Sessions and Application Level
              Framing", Proc ACM SIGCOMM 95, Aug 1995 pp. 342-356.


Handley, et al.              Informational                     [Page 19]
^L
RFC 2887     Multicast Design Space for Bulk Data Transfer   August 2000


   [Ha99]     Handley, M., "Multicast address allocation protocol
              (AAP)", Work in Progress.

   [HC97]     M. Handley and J. Crowcroft, "Network text editor (NTE) a
              scalable shared text editor for MBone," ACM Computer
              Communication Review, vol. 27, pp. 197-208, Oct. 1997. ACM
              SIGCOMM'97, Sept. 1997.

   [KCW98]    Kadansky, M., Chiu, D. and J. Wesley, "Tree-based reliable
              multicast (TRAM)", Work in Progress.

   [LMSSS97]  M. Luby, M. Mitzenmacher, A. Shokrollahi, D. Spielman, V.
              Stemann, "Practical Loss-Resilient Codes", Proc ACM
              Symposium on Theory of Computing, 1997.

   [MWB+98]   Montgomery, T., Whetten, B., Basavaiah, M., Paul, S.,
              Rastogi, N., Conlan, J. and T. Yeh, "THE RMTP-II
              PROTOCOL", Work in Progress.

   [PPV98]    C. Papadopoulos, G. Parulkar, and G. Varghese, "An error
              control scheme for large-scale multicast applications," in
              Proceedings of the Conference on Computer Communications
              (IEEE Infocom), (San Francisco, California), p. 1188,
              March/April 1998.

   [Ri97]     L. Rizzo, "Effective erasure codes for reliable computer
              communication protocols," ACM Computer Communication
              Review, vol.  27, pp. 24-36, Apr. 1997.

   [RV97]     L. Rizzo, L. Vicisano, "A Reliable Multicast data
              Distribution Protocol based on software FEC techniques",
              Proc. of The Fourth IEEE Workshop on the Architecture and
              Implementation of High Performance Communication Systems
              (HPCS'97), Sani Beach, Chalkidiki, Greece June 23-25,
              1997.

   [RVC98]    L. Rizzo, L. Vicisano, J. Crowcroft, "The RLC multicast
              congestion control algorithm", submitted to IEEE Network -
              special issue multicast.

   [RMWT98]   Robertson, K., Miller, K., White, M. and A. Tweedly,
              "StarBurst multicast file transfer protocol (MFTP)
              specification", Work in Progress.

   [WHA98]    Wallner, D., Hardler, E. and R. Agee, "Key Management for
              Multicast: Issues and Architectures", RFC 2627, June 1999.


Handley, et al.              Informational                     [Page 20]
^L
RFC 2887     Multicast Design Space for Bulk Data Transfer   August 2000


   [WKM94]    Brian Whetten, Simon Kaplan, and Todd Montgomery, "A high
              performance totally ordered multicast protocol," research
              memorandum, Aug. 1994.

   [WGL97]    C.K. Wong, M. Gouda, S. Lam, "Secure Group Communications
              Using Key Graphs," Technical Report TR 97-23, Department
              of Computer Sciences, The University of Texas at Austin,
              July 1997.


Handley, et al.              Informational                     [Page 21]
^L
RFC 2887     Multicast Design Space for Bulk Data Transfer   August 2000


11.  Full Copyright Statement

   Copyright (C) The Internet Society (2000).  All Rights Reserved.

   This document and translations of it may be copied and furnished to
   others, and derivative works that comment on or otherwise explain it
   or assist in its implementation may be prepared, copied, published
   and distributed, in whole or in part, without restriction of any
   kind, provided that the above copyright notice and this paragraph are
   included on all such copies and derivative works.  However, this
   document itself may not be modified in any way, such as by removing
   the copyright notice or references to the Internet Society or other
   Internet organizations, except as needed for the purpose of
   developing Internet standards in which case the procedures for
   copyrights defined in the Internet Standards process must be
   followed, or as required to translate it into languages other than
   English.

   The limited permissions granted above are perpetual and will not be
   revoked by the Internet Society or its successors or assigns.

   This document and the information contained herein is provided on an
   "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING
   TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING
   BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION
   HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF
   MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

Acknowledgement

   Funding for the RFC Editor function is currently provided by the
   Internet Society.


Handley, et al.              Informational                     [Page 22]
^L