1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
1598
1599
1600
1601
1602
1603
1604
1605
1606
1607
1608
1609
1610
1611
1612
1613
1614
1615
1616
1617
1618
1619
1620
1621
1622
1623
1624
1625
1626
1627
1628
1629
1630
1631
1632
1633
1634
1635
1636
1637
1638
1639
1640
1641
1642
1643
1644
1645
1646
1647
1648
1649
1650
1651
1652
1653
1654
1655
1656
1657
1658
1659
1660
1661
1662
1663
1664
1665
1666
1667
1668
1669
1670
1671
1672
1673
1674
1675
1676
1677
1678
1679
1680
1681
1682
1683
1684
1685
1686
1687
1688
1689
1690
1691
1692
1693
1694
1695
1696
1697
1698
1699
1700
1701
1702
1703
1704
1705
1706
1707
1708
1709
1710
1711
1712
1713
1714
1715
1716
1717
1718
1719
1720
1721
1722
1723
1724
1725
1726
1727
1728
1729
1730
1731
1732
1733
1734
1735
1736
1737
1738
1739
1740
1741
1742
1743
1744
1745
1746
1747
1748
1749
1750
1751
1752
1753
1754
1755
1756
1757
1758
1759
1760
1761
1762
1763
1764
1765
1766
1767
1768
1769
1770
1771
1772
1773
1774
1775
1776
1777
1778
1779
1780
1781
1782
1783
1784
1785
1786
1787
1788
1789
1790
1791
1792
1793
1794
1795
1796
1797
1798
1799
1800
1801
1802
1803
1804
1805
1806
1807
1808
1809
1810
1811
1812
1813
1814
1815
1816
1817
1818
1819
1820
1821
1822
1823
1824
1825
1826
1827
1828
1829
1830
1831
1832
1833
1834
1835
1836
1837
1838
1839
1840
1841
1842
1843
1844
1845
1846
1847
1848
1849
1850
1851
1852
1853
1854
1855
1856
1857
1858
1859
1860
1861
1862
1863
1864
1865
1866
1867
1868
1869
1870
1871
1872
1873
1874
1875
1876
1877
1878
1879
1880
1881
1882
1883
1884
1885
1886
1887
1888
1889
1890
1891
1892
1893
1894
1895
1896
1897
1898
1899
1900
1901
1902
1903
1904
1905
1906
1907
1908
1909
1910
1911
1912
1913
1914
1915
1916
1917
1918
1919
1920
1921
1922
1923
1924
1925
1926
1927
1928
1929
1930
1931
1932
1933
1934
1935
1936
1937
1938
1939
1940
1941
1942
1943
1944
1945
1946
1947
1948
1949
1950
1951
1952
1953
1954
1955
1956
1957
1958
1959
1960
1961
1962
1963
1964
1965
1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
2026
2027
2028
2029
2030
2031
2032
2033
2034
2035
2036
2037
2038
2039
2040
2041
2042
2043
2044
2045
2046
2047
2048
2049
2050
2051
2052
2053
2054
2055
2056
2057
2058
2059
2060
2061
2062
2063
2064
2065
2066
2067
2068
2069
2070
2071
2072
2073
2074
2075
2076
2077
2078
2079
2080
2081
2082
2083
2084
2085
2086
2087
2088
2089
2090
2091
2092
2093
2094
2095
2096
2097
2098
2099
2100
2101
2102
2103
2104
2105
2106
2107
2108
2109
2110
2111
2112
2113
2114
2115
2116
2117
2118
2119
2120
2121
2122
2123
2124
2125
2126
2127
2128
2129
2130
2131
2132
2133
2134
2135
2136
2137
2138
2139
2140
2141
2142
2143
2144
2145
2146
2147
2148
2149
2150
2151
2152
2153
2154
2155
2156
2157
2158
2159
2160
2161
2162
2163
2164
2165
2166
2167
2168
2169
2170
2171
2172
2173
2174
2175
2176
2177
2178
2179
2180
2181
2182
2183
2184
2185
2186
2187
2188
2189
2190
2191
2192
2193
2194
2195
2196
2197
2198
2199
2200
2201
2202
2203
2204
2205
2206
2207
2208
2209
2210
2211
2212
2213
2214
2215
2216
2217
2218
2219
2220
2221
2222
2223
2224
2225
2226
2227
2228
2229
2230
2231
2232
2233
2234
2235
2236
2237
2238
2239
2240
2241
2242
2243
2244
2245
2246
2247
2248
2249
2250
2251
2252
2253
2254
2255
2256
2257
2258
2259
2260
2261
2262
2263
2264
2265
2266
2267
2268
2269
2270
2271
2272
2273
2274
2275
2276
2277
2278
2279
2280
2281
2282
2283
2284
2285
2286
2287
2288
2289
2290
2291
2292
2293
2294
2295
2296
2297
2298
2299
2300
2301
2302
2303
2304
2305
2306
2307
2308
2309
2310
2311
2312
2313
2314
2315
2316
2317
2318
2319
2320
2321
2322
2323
2324
2325
2326
2327
2328
2329
2330
2331
2332
2333
2334
2335
2336
2337
2338
2339
2340
2341
2342
2343
2344
2345
2346
2347
2348
2349
2350
2351
2352
2353
2354
2355
2356
2357
2358
2359
2360
2361
2362
2363
2364
2365
2366
2367
2368
2369
2370
2371
2372
2373
2374
2375
2376
2377
2378
2379
2380
2381
2382
2383
2384
2385
2386
2387
2388
2389
2390
2391
2392
2393
2394
2395
2396
2397
2398
2399
2400
2401
2402
2403
2404
2405
2406
2407
2408
2409
2410
2411
2412
2413
2414
2415
2416
2417
2418
2419
2420
2421
2422
2423
2424
2425
2426
2427
2428
2429
2430
2431
2432
2433
2434
2435
2436
2437
2438
2439
2440
2441
2442
2443
2444
2445
2446
2447
2448
2449
2450
2451
2452
2453
2454
2455
2456
2457
2458
2459
2460
2461
2462
2463
2464
2465
2466
2467
2468
2469
2470
2471
2472
2473
2474
2475
2476
2477
2478
2479
2480
2481
2482
2483
2484
2485
2486
2487
2488
2489
2490
2491
2492
2493
2494
2495
2496
2497
2498
2499
2500
2501
2502
2503
2504
2505
2506
2507
2508
2509
2510
2511
2512
2513
2514
2515
2516
2517
2518
2519
2520
2521
2522
2523
2524
2525
2526
2527
2528
2529
2530
2531
2532
2533
2534
2535
2536
2537
2538
2539
2540
2541
2542
2543
2544
2545
2546
2547
2548
2549
2550
2551
2552
2553
2554
2555
2556
2557
2558
2559
2560
2561
2562
2563
2564
2565
2566
2567
2568
2569
2570
2571
2572
2573
2574
2575
2576
2577
2578
2579
2580
2581
2582
2583
2584
2585
2586
2587
2588
2589
2590
2591
2592
2593
2594
2595
2596
2597
2598
2599
2600
2601
2602
2603
2604
2605
2606
2607
2608
2609
2610
2611
2612
2613
2614
2615
2616
2617
2618
2619
2620
2621
2622
2623
2624
2625
2626
2627
2628
2629
2630
2631
2632
2633
2634
2635
2636
2637
2638
2639
2640
2641
2642
2643
2644
2645
2646
2647
2648
2649
2650
2651
2652
2653
2654
2655
2656
2657
2658
2659
2660
2661
2662
2663
2664
2665
2666
2667
2668
2669
2670
2671
2672
2673
2674
2675
2676
2677
2678
2679
2680
2681
2682
2683
2684
2685
2686
2687
2688
2689
2690
2691
2692
2693
2694
2695
2696
2697
2698
2699
2700
2701
2702
2703
2704
2705
2706
2707
2708
2709
2710
2711
2712
2713
2714
2715
2716
2717
2718
2719
2720
2721
2722
2723
2724
2725
2726
2727
2728
2729
2730
2731
2732
2733
2734
2735
2736
2737
2738
2739
2740
2741
2742
2743
2744
2745
2746
2747
2748
2749
2750
2751
2752
2753
2754
2755
2756
2757
2758
2759
2760
2761
2762
2763
2764
2765
2766
2767
2768
2769
2770
2771
2772
2773
2774
2775
2776
2777
2778
2779
2780
2781
2782
2783
2784
2785
2786
2787
2788
2789
2790
2791
2792
2793
2794
2795
2796
2797
2798
2799
2800
2801
2802
2803
2804
2805
2806
2807
2808
2809
2810
2811
2812
2813
2814
2815
2816
2817
2818
2819
2820
2821
2822
2823
2824
2825
2826
2827
2828
2829
2830
2831
2832
2833
2834
2835
2836
2837
2838
2839
2840
2841
2842
2843
2844
2845
2846
2847
2848
2849
2850
2851
2852
2853
2854
2855
2856
2857
2858
2859
2860
2861
2862
2863
2864
2865
2866
2867
2868
2869
2870
2871
2872
2873
2874
2875
2876
2877
2878
2879
2880
2881
2882
2883
2884
2885
2886
2887
2888
2889
2890
2891
2892
2893
2894
2895
2896
2897
2898
2899
2900
2901
2902
2903
2904
2905
2906
2907
2908
2909
2910
2911
2912
2913
2914
2915
2916
2917
2918
2919
2920
2921
2922
2923
2924
2925
2926
2927
2928
2929
2930
2931
2932
2933
2934
2935
2936
2937
2938
2939
2940
2941
2942
2943
2944
2945
2946
2947
2948
2949
2950
2951
2952
2953
2954
2955
2956
2957
2958
2959
2960
2961
2962
2963
2964
2965
2966
2967
2968
2969
2970
2971
2972
2973
2974
2975
2976
2977
2978
2979
2980
2981
2982
2983
2984
2985
2986
2987
2988
2989
2990
2991
2992
2993
2994
2995
2996
2997
2998
2999
3000
3001
3002
3003
3004
3005
3006
3007
3008
3009
3010
3011
3012
3013
3014
3015
3016
3017
3018
3019
3020
3021
3022
3023
3024
3025
3026
3027
3028
3029
3030
3031
3032
3033
3034
3035
3036
3037
3038
3039
3040
3041
3042
3043
3044
3045
3046
3047
3048
3049
3050
3051
3052
3053
3054
3055
3056
3057
3058
3059
3060
3061
3062
3063
3064
3065
3066
3067
3068
3069
3070
3071
3072
3073
3074
3075
3076
3077
3078
3079
3080
3081
3082
3083
3084
3085
3086
3087
3088
3089
3090
3091
3092
3093
3094
3095
3096
3097
3098
3099
3100
3101
3102
3103
3104
3105
3106
3107
3108
3109
3110
3111
3112
3113
3114
3115
3116
3117
3118
3119
3120
3121
3122
3123
3124
3125
3126
3127
3128
3129
3130
3131
3132
3133
3134
3135
3136
3137
3138
3139
3140
3141
3142
3143
3144
3145
3146
3147
3148
3149
3150
3151
3152
3153
3154
3155
3156
3157
3158
3159
3160
3161
3162
3163
3164
3165
3166
3167
3168
3169
3170
3171
3172
3173
3174
3175
3176
3177
3178
3179
3180
3181
3182
3183
3184
3185
3186
3187
3188
3189
3190
3191
3192
3193
3194
3195
3196
3197
3198
3199
3200
3201
3202
3203
3204
3205
3206
3207
3208
3209
3210
3211
3212
3213
3214
3215
3216
3217
3218
3219
3220
3221
3222
3223
3224
3225
3226
3227
3228
3229
3230
3231
3232
3233
3234
3235
3236
3237
3238
3239
3240
3241
3242
3243
3244
3245
3246
3247
3248
3249
3250
3251
3252
3253
3254
3255
3256
3257
3258
3259
3260
3261
3262
3263
3264
3265
3266
3267
3268
3269
3270
3271
3272
3273
3274
3275
3276
3277
3278
3279
3280
3281
3282
3283
3284
3285
3286
3287
3288
3289
3290
3291
3292
3293
3294
3295
3296
3297
3298
3299
3300
3301
3302
3303
3304
3305
3306
3307
3308
3309
3310
3311
3312
3313
3314
3315
3316
3317
3318
3319
3320
3321
3322
3323
3324
3325
3326
3327
3328
3329
3330
3331
3332
3333
3334
3335
3336
3337
3338
3339
3340
3341
3342
3343
3344
3345
3346
3347
3348
3349
3350
3351
3352
3353
|
Internet Engineering Task Force (IETF) M. Duckworth, Ed.
Request for Comments: 8845
Category: Standards Track A. Pepperell
ISSN: 2070-1721 Acano
S. Wenger
Tencent
January 2021
Framework for Telepresence Multi-Streams
Abstract
This document defines a framework for a protocol to enable devices in
a telepresence conference to interoperate. The protocol enables
communication of information about multiple media streams so a
sending system and receiving system can make reasonable decisions
about transmitting, selecting, and rendering the media streams. This
protocol is used in addition to SIP signaling and Session Description
Protocol (SDP) negotiation for setting up a telepresence session.
Status of This Memo
This is an Internet Standards Track document.
This document is a product of the Internet Engineering Task Force
(IETF). It represents the consensus of the IETF community. It has
received public review and has been approved for publication by the
Internet Engineering Steering Group (IESG). Further information on
Internet Standards is available in Section 2 of RFC 7841.
Information about the current status of this document, any errata,
and how to provide feedback on it may be obtained at
https://www.rfc-editor.org/info/rfc8845.
Copyright Notice
Copyright (c) 2021 IETF Trust and the persons identified as the
document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents
(https://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect
to this document. Code Components extracted from this document must
include Simplified BSD License text as described in Section 4.e of
the Trust Legal Provisions and are provided without warranty as
described in the Simplified BSD License.
Table of Contents
1. Introduction
2. Requirements Language
3. Definitions
4. Overview and Motivation
5. Description of the Framework/Model
6. Spatial Relationships
7. Media Captures and Capture Scenes
7.1. Media Captures
7.1.1. Media Capture Attributes
7.2. Multiple Content Capture
7.2.1. MCC Attributes
7.3. Capture Scene
7.3.1. Capture Scene Attributes
7.3.2. Capture Scene View Attributes
7.4. Global View List
8. Simultaneous Transmission Set Constraints
9. Encodings
9.1. Individual Encodings
9.2. Encoding Group
9.3. Associating Captures with Encoding Groups
10. Consumer's Choice of Streams to Receive from the Provider
10.1. Local Preference
10.2. Physical Simultaneity Restrictions
10.3. Encoding and Encoding Group Limits
11. Extensibility
12. Examples - Using the Framework (Informative)
12.1. Provider Behavior
12.1.1. Three-Screen Endpoint Provider
12.1.2. Encoding Group Example
12.1.3. The MCU Case
12.2. Media Consumer Behavior
12.2.1. One-Screen Media Consumer
12.2.2. Two-Screen Media Consumer Configuring the Example
12.2.3. Three-Screen Media Consumer Configuring the Example
12.3. Multipoint Conference Utilizing Multiple Content Captures
12.3.1. Single Media Captures and MCC in the Same
Advertisement
12.3.2. Several MCCs in the Same Advertisement
12.3.3. Heterogeneous Conference with Switching and
Composition
12.3.4. Heterogeneous Conference with Voice-Activated
Switching
13. IANA Considerations
14. Security Considerations
15. References
15.1. Normative References
15.2. Informative References
Acknowledgements
Authors' Addresses
1. Introduction
Current telepresence systems, though based on open standards such as
RTP [RFC3550] and SIP [RFC3261], cannot easily interoperate with each
other. A major factor limiting the interoperability of telepresence
systems is the lack of a standardized way to describe and negotiate
the use of multiple audio and video streams comprising the media
flows. This document provides a framework for protocols to enable
interoperability by handling multiple streams in a standardized way.
The framework is intended to support the use cases described in "Use
Cases for Telepresence Multistreams" [RFC7205] and to meet the
requirements in "Requirements for Telepresence Multistreams"
[RFC7262]. This includes cases using multiple media streams that are
not necessarily telepresence.
The basic session setup for the use cases is based on SIP [RFC3261]
and SDP offer/answer [RFC3264]. In addition to basic SIP & SDP
offer/answer, signaling that is ControLling mUltiple streams for
tElepresence (CLUE) specific is required to exchange the information
describing the multiple Media Streams. The motivation for this
framework, an overview of the signaling, and the information required
to be exchanged are described in subsequent sections of this
document. Companion documents describe the signaling details
[RFC8848], the data model [RFC8846], and the protocol [RFC8847].
2. Requirements Language
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
"OPTIONAL" in this document are to be interpreted as described in
BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all
capitals, as shown here.
3. Definitions
The terms defined below are used throughout this document and in
companion documents. Capitalization is used in order to easily
identify a defined term.
Advertisement: A CLUE message a Media Provider sends to a Media
Consumer describing specific aspects of the content of the Media
and any restrictions it has in terms of being able to provide
certain Streams simultaneously.
Audio Capture (AC): Media Capture for audio. Denoted as "ACn" in
the examples in this document.
Capture: Same as Media Capture.
Capture Device: A device that converts physical input, such as
audio, video, or text, into an electrical signal, in most cases to
be fed into a Media encoder.
Capture Encoding: A specific Encoding of a Media Capture, to be sent
by a Media Provider to a Media Consumer via RTP.
Capture Scene: A structure representing a spatial region captured by
one or more Capture Devices, each capturing Media representing a
portion of the region. The spatial region represented by a
Capture Scene may correspond to a real region in physical space,
such as a room. A Capture Scene includes attributes and one or
more Capture Scene Views, with each view including one or more
Media Captures.
Capture Scene View (CSV): A list of Media Captures of the same Media
type that together form one way to represent the entire Capture
Scene.
CLUE: CLUE is an acronym for "ControLling mUltiple streams for
tElepresence", which is the name of the IETF working group in
which this document and certain companion documents have been
developed. Often, CLUE-* refers to something that has been
designed by the CLUE working group; for example, this document may
be called the CLUE-framework document herein and elsewhere.
CLUE-capable device: A device that supports the CLUE data channel
[RFC8850], the CLUE protocol [RFC8847] and the principles of CLUE
negotiation; it also seeks CLUE-enabled calls.
CLUE-enabled call: A call in which two CLUE-capable devices have
successfully negotiated support for a CLUE data channel in SDP
[RFC4566]. A CLUE-enabled call is not necessarily immediately
able to send CLUE-controlled Media; negotiation of the data
channel and of the CLUE protocol must complete first. Calls
between two CLUE-capable devices that have not yet successfully
completed negotiation of support for the CLUE data channel in SDP
are not considered CLUE-enabled.
Conference: Used as defined in "A Framework for Conferencing within
the Session Initiation Protocol (SIP)" [RFC4353].
Configure Message: A CLUE message a Media Consumer sends to a Media
Provider specifying which content and Media Streams it wants to
receive, based on the information in a corresponding Advertisement
message.
Consumer: Short for Media Consumer.
Encoding: Short for Individual Encoding.
Encoding Group: A set of Encoding parameters representing a total
Media Encoding capability to be subdivided across potentially
multiple Individual Encodings.
Endpoint: A CLUE-capable device that is the logical point of final
termination through receiving, decoding and Rendering, and/or
initiation through capturing, encoding, and sending of Media
Streams. An Endpoint consists of one or more physical devices
that source and sink Media Streams, and exactly one [RFC4353]
Participant (which, in turn, includes exactly one SIP User Agent).
Endpoints can be anything from multiscreen/multicamera rooms to
handheld devices.
Global View: A set of references to one or more CSVs of the same
Media type that are defined within Scenes of the same
Advertisement. A Global View is a suggestion from the Provider to
the Consumer for one set of CSVs that provide a useful
representation of all the Scenes in the Advertisement.
Global View List: A list of Global Views included in an
Advertisement. A Global View List may include Global Views of
different Media types.
Individual Encoding: a set of parameters representing a way to
encode a Media Capture to become a Capture Encoding.
Multipoint Control Unit (MCU): a CLUE-capable device that connects
two or more Endpoints into one single multimedia Conference
[RFC7667]. An MCU includes a Mixer like that described in
[RFC4353], without the requirement of [RFC4353] to send Media to
each participant.
Media: Any data that, after suitable encoding, can be conveyed over
RTP, including audio, video, or timed text.
Media Capture (MC): A source of Media, such as from one or more
Capture Devices or constructed from other Media Streams.
Media Consumer: A CLUE-capable device that intends to receive
Capture Encodings.
Media Provider: A CLUE-capable device that intends to send Capture
Encodings.
Multiple Content Capture (MCC): A Capture that mixes and/or switches
other Captures of a single type (for example, all audio or all
video). Particular Media Captures may or may not be present in
the resultant Capture Encoding, depending on time or space.
Denoted as "MCCn" in the example cases in this document.
Plane of Interest: The spatial plane within a Scene containing the
most-relevant subject matter.
Provider: Same as a Media Provider.
Render: The process of generating a representation from Media, such
as displayed motion video or sound emitted from loudspeakers.
Scene: Same as a Capture Scene.
Simultaneous Transmission Set: A set of Media Captures that can be
transmitted simultaneously from a Media Provider.
Single Media Capture: A Capture that contains Media from a single
source Capture Device, e.g., an Audio Capture from a single
microphone or a Video Capture from a single camera.
Spatial Relation: The arrangement of two objects in space, in
contrast to relation in time or other relationships.
Stream: A Capture Encoding sent from a Media Provider to a Media
Consumer via RTP [RFC3550].
Stream Characteristics: The Media Stream attributes commonly used in
non-CLUE SIP/SDP environments (such as Media codec, bitrate,
resolution, profile/level, etc.) as well as CLUE-specific
attributes, such as the Capture ID or a spatial location.
Video Capture (VC): Media Capture for video. Denoted as VCn in the
example cases in this document.
Video Composite: A single image that is formed, normally by an RTP
mixer inside an MCU, by combining visual elements from separate
sources.
4. Overview and Motivation
This section provides an overview of the functional elements defined
in this document to represent a telepresence or multistream system.
The motivations for the framework described in this document are also
provided.
Two key concepts introduced in this document are the terms "Media
Provider" and "Media Consumer". A Media Provider represents the
entity that sends the Media and a Media Consumer represents the
entity that receives the Media. A Media Provider provides Media in
the form of RTP packets; a Media Consumer consumes those RTP packets.
Media Providers and Media Consumers can reside in Endpoints or in
Multipoint Control Units (MCUs). A Media Provider in an Endpoint is
usually associated with the generation of Media for Media Captures;
these Media Captures are typically sourced from cameras, microphones,
and the like. Similarly, the Media Consumer in an Endpoint is
usually associated with renderers, such as screens and loudspeakers.
In MCUs, Media Providers and Consumers can have the form of outputs
and inputs, respectively, of RTP mixers, RTP translators, and similar
devices. Typically, telepresence devices, such as Endpoints and
MCUs, would perform as both Media Providers and Media Consumers, the
former being concerned with those devices' transmitted Media and the
latter with those devices' received Media. In a few circumstances, a
CLUE-capable device includes only Consumer or Provider functionality,
such as recorder-type Consumers or webcam-type Providers.
The motivations for the framework outlined in this document include
the following:
(1) Endpoints in telepresence systems typically have multiple Media
Capture and Media Render devices, e.g., multiple cameras and
screens. While previous system designs were able to set up
calls that would capture Media using all cameras and display
Media on all screens, for example, there was no mechanism that
could associate these Media Captures with each other in space
and time, in a cross-vendor interoperable way.
(2) The mere fact that there are multiple Media Capture and Media
Render devices, each of which may be configurable in aspects
such as zoom, leads to the difficulty that a variable number of
such devices can be used to capture different aspects of a
region. The Capture Scene concept allows for the description of
multiple setups for those multiple Media Capture devices that
could represent sensible operation points of the physical
Capture Devices in a room, chosen by the operator. A Consumer
can pick and choose from those configurations based on its
rendering abilities and then inform the Provider about its
choices. Details are provided in Section 7.
(3) In some cases, physical limitations or other reasons disallow
the concurrent use of a device in more than one setup. For
example, the center camera in a typical three-camera conference
room can set its zoom objective to capture either the middle few
seats only or all seats of a room, but not both concurrently.
The Simultaneous Transmission Set concept allows a Provider to
signal such limitations. Simultaneous Transmission Sets are
part of the Capture Scene description and are discussed in
Section 8.
(4) Often, the devices in a room do not have the computational
complexity or connectivity to deal with multiple Encoding
options simultaneously, even if each of these options is
sensible in certain scenarios, and even if the simultaneous
transmission is also sensible (i.e., in case of multicast Media
distribution to multiple Endpoints). Such constraints can be
expressed by the Provider using the Encoding Group concept,
which is described in Section 9.
(5) Due to the potentially large number of RTP Streams required for
a Multimedia Conference involving potentially many Endpoints,
each of which can have many Media Captures and Media renderers,
it has become common to multiplex multiple RTP Streams onto the
same transport address, so as to avoid using the port number as
a multiplexing point and the associated shortcomings such as
NAT/firewall traversal. The large number of possible
permutations of sensible options a Media Provider can make
available to a Media Consumer makes a mechanism desirable that
allows it to narrow down the number of possible options that a
SIP offer/answer exchange has to consider. Such information is
made available using protocol mechanisms specified in this
document and companion documents. The Media Provider and Media
Consumer may use information in CLUE messages to reduce the
complexity of SIP offer/answer messages. Also, there are
aspects of the control of both Endpoints and MCUs that
dynamically change during the progress of a call, such as audio-
level-based screen switching, layout changes, and so on, which
need to be conveyed. Note that these control aspects are
complementary to those specified in traditional SIP-based
conference management, such as Binary Floor Control Protocol
(BFCP). An exemplary call flow can be found in Section 5.
Finally, all this information needs to be conveyed, and the notion of
support for it needs to be established. This is done by the
negotiation of a "CLUE channel", a data channel negotiated early
during the initiation of a call. An Endpoint or MCU that rejects the
establishment of this data channel, by definition, does not support
CLUE-based mechanisms, whereas an Endpoint or MCU that accepts it is
indicating support for CLUE as specified in this document and its
companion documents.
5. Description of the Framework/Model
The CLUE framework specifies how multiple Media Streams are to be
handled in a telepresence Conference.
A Media Provider (transmitting Endpoint or MCU) describes specific
aspects of the content of the Media and the Media Stream Encodings it
can send in an Advertisement; and the Media Consumer responds to the
Media Provider by specifying which content and Media Streams it wants
to receive in a Configure message. The Provider then transmits the
asked-for content in the specified Streams.
This Advertisement and Configure typically occur during call
initiation, after CLUE has been enabled in a call, but they MAY also
happen at any time throughout the call, whenever there is a change in
what the Consumer wants to receive or (perhaps less common) what the
Provider can send.
An Endpoint or MCU typically acts as both Provider and Consumer at
the same time, sending Advertisements and sending Configurations in
response to receiving Advertisements. (It is possible to be just one
or the other.)
The data model [RFC8846] is based around two main concepts: a Capture
and an Encoding. A Media Capture, such as of type audio or video,
has attributes to describe the content a Provider can send. Media
Captures are described in terms of CLUE-defined attributes, such as
Spatial Relationships and purpose of the Capture. Providers tell
Consumers which Media Captures they can provide, described in terms
of the Media Capture attributes.
A Provider organizes its Media Captures into one or more Capture
Scenes, each representing a spatial region, such as a room. A
Consumer chooses which Media Captures it wants to receive from the
Capture Scenes.
In addition, the Provider can send the Consumer a description of the
Individual Encodings it can send in terms of identifiers that relate
to items in SDP [RFC4566].
The Provider can also specify constraints on its ability to provide
Media, and a sensible design choice for a Consumer is to take these
into account when choosing the content and Capture Encodings it
requests in the later offer/answer exchange. Some constraints are
due to the physical limitations of device; for example, a camera may
not be able to provide zoom and non-zoom views simultaneously. Other
constraints are system based, such as maximum bandwidth.
The following diagram illustrates the information contained in an
Advertisement.
...................................................................
. Provider Advertisement +--------------------+ .
. | Simultaneous Sets | .
. +------------------------+ +--------------------+ .
. | Capture Scene N | +--------------------+ .
. +-+----------------------+ | | Global View List | .
. | Capture Scene 2 | | +--------------------+ .
. +-+----------------------+ | | +----------------------+ .
. | Capture Scene 1 | | | | Encoding Group N | .
. | +---------------+ | | | +-+--------------------+ | .
. | | Attributes | | | | | Encoding Group 2 | | .
. | +---------------+ | | | +-+--------------------+ | | .
. | | | | | Encoding Group 1 | | | .
. | +----------------+ | | | | parameters | | | .
. | | V i e w s | | | | | bandwidth | | | .
. | | +---------+ | | | | | +-------------------+| | | .
. | | |Attribute| | | | | | | V i d e o || | | .
. | | +---------+ | | | | | | E n c o d i n g s || | | .
. | | | | | | | | Encoding 1 || | | .
. | | View 1 | | | | | | || | | .
. | | (list of MCs) | | |-+ | +-------------------+| | | .
. | +----|-|--|------+ |-+ | | | | .
. +---------|-|--|---------+ | +-------------------+| | | .
. | | | | | A u d i o || | | .
. | | | | | E n c o d i n g s || | | .
. v | | | | Encoding 1 || | | .
. +---------|--|--------+ | | || | | .
. | Media Capture N |------>| +-------------------+| | | .
. +-+---------v--|------+ | | | | | .
. | Media Capture 2 | | | | |-+ .
. +-+--------------v----+ |-------->| | | .
. | Media Capture 1 | | | | |-+ .
. | +----------------+ |---------->| | .
. | | Attributes | | |_+ +----------------------+ .
. | +----------------+ |_+ .
. +---------------------+ .
. .
...................................................................
Figure 1: Advertisement Structure
Figure 2 illustrates the call flow used by a simple system (two
Endpoints) in compliance with this document. A very brief outline of
the call flow is described in the text that follows.
+-----------+ +-----------+
| Endpoint1 | | Endpoint2 |
+----+------+ +-----+-----+
| INVITE (BASIC SDP+CLUECHANNEL) |
|--------------------------------->|
| 200 0K (BASIC SDP+CLUECHANNEL)|
|<---------------------------------|
| ACK |
|--------------------------------->|
| |
|<################################>|
| BASIC MEDIA SESSION |
|<################################>|
| |
| CONNECT (CLUE CTRL CHANNEL) |
|=================================>|
| ... |
|<================================>|
| CLUE CTRL CHANNEL ESTABLISHED |
|<================================>|
| |
| ADVERTISEMENT 1 |
|*********************************>|
| ADVERTISEMENT 2 |
|<*********************************|
| |
| CONFIGURE 1 |
|<*********************************|
| CONFIGURE 2 |
|*********************************>|
| |
| REINVITE (UPDATED SDP) |
|--------------------------------->|
| 200 0K (UPDATED SDP)|
|<---------------------------------|
| ACK |
|--------------------------------->|
| |
|<################################>|
| UPDATED MEDIA SESSION |
|<################################>|
| |
v v
Figure 2: Basic Information Flow
An initial offer/answer exchange establishes a basic Media session,
for example, audio-only, and a CLUE channel between two Endpoints.
With the establishment of that channel, the Endpoints have consented
to use the CLUE protocol mechanisms and, therefore, MUST adhere to
the CLUE protocol suite as outlined herein.
Over this CLUE channel, the Provider in each Endpoint conveys its
characteristics and capabilities by sending an Advertisement as
specified herein. The Advertisement is typically not sufficient to
set up all Media. The Consumer in the Endpoint receives the
information provided by the Provider and can use it for several
purposes. It uses it, along with information from an offer/answer
exchange, to construct a CLUE Configure message to tell the Provider
what the Consumer wishes to receive. Also, the Consumer may use the
information provided to tailor the SDP it is going to send during any
following SIP offer/answer exchange, and its reaction to SDP it
receives in that step. It is often a sensible implementation choice
to do so. Spatial relationships associated with the Media can be
included in the Advertisement, and it is often sensible for the Media
Consumer to take those spatial relationships into account when
tailoring the SDP. The Consumer can also limit the number of
Encodings it must set up resources to receive, and not waste
resources on unwanted Encodings, because it has the Provider's
Advertisement information ahead of time to determine what it really
wants to receive. The Consumer can also use the Advertisement
information for local rendering decisions.
This initial CLUE exchange is followed by an SDP offer/answer
exchange that not only establishes those aspects of the Media that
have not been "negotiated" over CLUE, but also has the effect of
setting up the Media transmission itself, involving potentially
security exchanges, Interactive Connectivity Establishment (ICE), and
whatnot. This step is considered "plain vanilla SIP".
During the lifetime of a call, further exchanges MAY occur over the
CLUE channel. In some cases, those further exchanges lead to a
modified system behavior of Provider or Consumer (or both) without
any other protocol activity such as further offer/answer exchanges.
For example, a Configure Message requesting that the Provider place a
different Capture source into a Capture Encoding, signaled over the
CLUE channel, ought not to lead to heavy-handed mechanisms like SIP
re-invites. In other cases, however, after the CLUE negotiation, an
additional offer/answer exchange becomes necessary. For example, if
both sides decide to upgrade the call from one screen to a multi-
screen call, and more bandwidth is required for the additional video
channels compared to what was previously negotiated using offer/
answer, a new offer/answer exchange is required.
One aspect of the protocol outlined herein, and specified in more
detail in companion documents, is that it makes available to the
Consumer information regarding the Provider's capabilities to deliver
Media and attributes related to that Media such as their Spatial
Relationship. The operation of the renderer inside the Consumer is
unspecified in that it can choose to ignore some information provided
by the Provider and/or not Render Media Streams available from the
Provider (although the Consumer follows the CLUE protocol and,
therefore, gracefully receives and responds to the Provider's
information using a Configure operation).
A CLUE-capable device interoperates with a device that does not
support CLUE. The CLUE-capable device can determine, by the result
of the initial offer/answer exchange, if the other device supports
and wishes to use CLUE. The specific mechanism for this is described
in [RFC8848]. If the other device does not use CLUE, then the CLUE-
capable device falls back to behavior that does not require CLUE.
As for the Media, Provider and Consumer have an end-to-end
communication relationship with respect to (RTP-transported) Media;
and the mechanisms described herein and in companion documents do not
change the aspects of setting up those RTP flows and sessions. In
other words, the RTP Media sessions conform to the negotiated SDP
whether or not CLUE is used.
6. Spatial Relationships
In order for a Consumer to perform a proper rendering, it is often
necessary (or at least helpful) for the Consumer to have received
spatial information about the Streams it is receiving. CLUE defines
a coordinate system that allows Media Providers to describe the
Spatial Relationships of their Media Captures to enable proper
scaling and spatially sensible rendering of their Streams. The
coordinate system is based on a few principles:
* Each Capture Scene has a distinct coordinate system, unrelated to
the coordinate systems of other Scenes.
* Simple systems that do not have multiple Media Captures to
associate spatially need not use the coordinate model, although it
can still be useful to provide an Area of Capture.
* Coordinates can either be in real, physical units (millimeters),
have an unknown scale, or have no physical scale. Systems that
know their physical dimensions (for example, professionally
installed Telepresence room systems) MUST provide those real-world
measurements to enable the best user experience for advanced
receiving systems that can utilize this information. Systems that
don't know specific physical dimensions but still know relative
distances MUST use "Unknown Scale". "No Scale" is intended to be
used only where Media Captures from different devices (with
potentially different scales) will be forwarded alongside one
another (e.g., in the case of an MCU).
- "Millimeters" means the scale is in millimeters.
- "Unknown Scale" means the scale is not necessarily in
millimeters, but the scale is the same for every Capture in the
Capture Scene.
- "No Scale" means the scale could be different for each Capture
-- an MCU Provider that advertises two adjacent Captures and
picks sources (which can change quickly) from different
Endpoints might use this value; the scale could be different
and changing for each Capture. But the areas of capture still
represent a Spatial Relation between Captures.
* The coordinate system is right-handed Cartesian X, Y, Z with the
origin at a spatial location of the Provider's choosing. The
Provider MUST use the same coordinate system with the same scale
and origin for all coordinates within the same Capture Scene.
The direction of increasing coordinate values is as follows: X
increases from left to right, from the point of view of an observer
at the front of the room looking toward the back; Y increases from
the front of the room to the back of the room; Z increases from low
to high (i.e., floor to ceiling).
Cameras in a Scene typically point in the direction of increasing Y,
from front to back. But there could be multiple cameras pointing in
different directions. If the physical space does not have a well-
defined front and back, the Provider chooses any direction for X, Y,
and Z consistent with right-handed coordinates.
7. Media Captures and Capture Scenes
This section describes how Providers can describe the content of
Media to Consumers.
7.1. Media Captures
Media Captures are the fundamental representations of Streams that a
device can transmit. What a Media Capture actually represents is
flexible:
* It can represent the immediate output of a physical source (e.g.,
camera, microphone) or 'synthetic' source (e.g., laptop computer,
DVD player).
* It can represent the output of an audio mixer or video composer.
* It can represent a concept such as 'the loudest speaker'.
* It can represent a conceptual position such as 'the leftmost
Stream'.
To identify and distinguish between multiple Capture instances,
Captures have a unique identity. For instance, VC1, VC2, AC1, and
AC2 (where VC1 and VC2 refer to two different Video Captures and AC1
and AC2 refer to two different Audio Captures).
Some key points about Media Captures:
* A Media Capture is of a single Media type (e.g., audio or video).
* A Media Capture is defined in a Capture Scene and is given an
Advertisement unique identity. The identity may be referenced
outside the Capture Scene that defines it through an MCC.
* A Media Capture may be associated with one or more CSVs.
* A Media Capture has exactly one set of spatial information.
* A Media Capture can be the source of at most one Capture Encoding.
Each Media Capture can be associated with attributes to describe what
it represents.
7.1.1. Media Capture Attributes
Media Capture attributes describe information about the Captures. A
Provider can use the Media Capture attributes to describe the
Captures for the benefit of the Consumer of the Advertisement
message. All these attributes are optional. Media Capture
attributes include:
* Spatial information, such as Point of Capture, Point on Line of
Capture, and Area of Capture, (all of which, in combination,
define the capture field of, for example, a camera).
* Other descriptive information to help the Consumer choose between
Captures (e.g., description, presentation, view, priority,
language, person information, and type).
The subsections below define the Capture attributes.
7.1.1.1. Point of Capture
The Point of Capture attribute is a field with a single Cartesian (X,
Y, Z) point value that describes the spatial location of the
capturing device (such as camera). For an Audio Capture with
multiple microphones, the Point of Capture defines the nominal
midpoint of the microphones.
7.1.1.2. Point on Line of Capture
The Point on Line of Capture attribute is a field with a single
Cartesian (X, Y, Z) point value that describes a position in space of
a second point on the axis of the capturing device, toward the
direction it is pointing; the first point being the Point of Capture
(see above).
Together, the Point of Capture and Point on Line of Capture define
the direction and axis of the capturing device, for example, the
optical axis of a camera or the axis of a microphone. The Media
Consumer can use this information to adjust how it Renders the
received Media if it so chooses.
For an Audio Capture, the Media Consumer can use this information
along with the Audio Capture Sensitivity Pattern to define a three-
dimensional volume of capture where sounds can be expected to be
picked up by the microphone providing this specific Audio Capture.
If the Consumer wants to associate an Audio Capture with a Video
Capture, it can compare this volume with the Area of Capture for
video Media to provide a check on whether the Audio Capture is indeed
spatially associated with the Video Capture. For example, a video
Area of Capture that fails to intersect at all with the audio volume
of capture, or is at such a long radial distance from the microphone
Point of Capture that the audio level would be very low, would be
inappropriate.
7.1.1.3. Area of Capture
The Area of Capture is a field with a set of four (X, Y, Z) points as
a value that describes the spatial location of what is being
"captured". This attribute applies only to Video Captures, not other
types of Media. By comparing the Area of Capture for different Video
Captures within the same Capture Scene, a Consumer can determine the
Spatial Relationships between them and Render them correctly.
The four points MUST be co-planar, forming a quadrilateral, which
defines the Plane of Interest for the particular Media Capture.
If the Area of Capture is not specified, it means the Video Capture
might be spatially related to other Captures in the same Scene, but
there is no detailed information on the relationship. For a switched
Capture that switches between different sections within a larger
area, the Area of Capture MUST use coordinates for the larger
potential area.
7.1.1.4. Mobility of Capture
The Mobility of Capture attribute indicates whether or not the Point
of Capture, Point on Line of Capture, and Area of Capture values stay
the same over time, or are expected to change (potentially
frequently). Possible values are static, dynamic, and highly
dynamic.
An example for "dynamic" is a camera mounted on a stand that is
occasionally hand-carried and placed at different positions in order
to provide the best angle to capture a work task. A camera worn by a
person who moves around the room is an example for "highly dynamic".
In either case, the effect is that the Point of Capture, Capture
Axis, and Area of Capture change with time.
The Point of Capture of a static Capture MUST NOT move for the life
of the CLUE session. The Point of Capture of dynamic Captures is
categorized by a change in position followed by a reasonable period
of stability -- in the order of magnitude of minutes. Highly dynamic
Captures are categorized by a Point of Capture that is constantly
moving. If the Area of Capture, Point of Capture, and Point on Line
of Capture attributes are included with dynamic or highly dynamic
Captures, they indicate spatial information at the time of the
Advertisement.
7.1.1.5. Audio Capture Sensitivity Pattern
The Audio Capture Sensitivity Pattern attribute applies only to Audio
Captures. This attribute gives information about the nominal
sensitivity pattern of the microphone that is the source of the
Capture. Possible values include patterns such as omni, shotgun,
cardioid, and hyper-cardioid.
7.1.1.6. Description
The Description attribute is a human-readable description (which
could be in multiple languages) of the Capture.
7.1.1.7. Presentation
The Presentation attribute indicates that the Capture originates from
a presentation device, that is, one that provides supplementary
information to a Conference through slides, video, still images,
data, etc. Where more information is known about the Capture, it MAY
be expanded hierarchically to indicate the different types of
presentation Media, e.g., presentation.slides, presentation.image,
etc.
Note: It is expected that a number of keywords will be defined that
provide more detail on the type of presentation. Refer to [RFC8846]
for how to extend the model.
7.1.1.8. View
The View attribute is a field with enumerated values, indicating what
type of view the Capture relates to. The Consumer can use this
information to help choose which Media Captures it wishes to receive.
Possible values are as follows:
Room: Captures the entire Scene
Table: Captures the conference table with seated people
Individual: Captures an individual person
Lectern: Captures the region of the lectern including the
presenter, for example, in a classroom-style conference
room
Audience: Captures a region showing the audience in a classroom-
style conference room
7.1.1.9. Language
The Language attribute indicates one or more languages used in the
content of the Media Capture. Captures MAY be offered in different
languages in case of multilingual and/or accessible Conferences. A
Consumer can use this attribute to differentiate between them and
pick the appropriate one.
Note that the Language attribute is defined and meaningful both for
Audio and Video Captures. In case of Audio Captures, the meaning is
obvious. For a Video Capture, "Language" could, for example, be sign
interpretation or text.
The Language attribute is coded per [RFC5646].
7.1.1.10. Person Information
The Person Information attribute allows a Provider to provide
specific information regarding the people in a Capture (regardless of
whether or not the Capture has a Presentation attribute). The
Provider may gather the information automatically or manually from a
variety of sources; however, the xCard [RFC6351] format is used to
convey the information. This allows various information, such as
Identification information (Section 6.2 of [RFC6350]), Communication
Information (Section 6.4 of [RFC6350]), and Organizational
information (Section 6.6 of [RFC6350]), to be communicated. A
Consumer may then automatically (i.e., via a policy) or manually
select Captures based on information about who is in a Capture. It
also allows a Consumer to Render information regarding the people
participating in the Conference or to use it for further processing.
The Provider may supply a minimal set of information or a larger set
of information. However, it MUST be compliant to [RFC6350] and
supply a "VERSION" and "FN" property. A Provider may supply multiple
xCards per Capture of any KIND (Section 6.1.4 of [RFC6350]).
In order to keep CLUE messages compact, the Provider SHOULD use a URI
to point to any LOGO, PHOTO, or SOUND contained in the xCard rather
than transmitting the LOGO, PHOTO, or SOUND data in a CLUE message.
7.1.1.11. Person Type
The Person Type attribute indicates the type of people contained in
the Capture with respect to the meeting agenda (regardless of whether
or not the Capture has a Presentation attribute). As a Capture may
include multiple people, the attribute may contain multiple values.
However, values MUST NOT be repeated within the attribute.
An Advertiser associates the person type with an individual Capture
when it knows that a particular type is in the Capture. If an
Advertiser cannot link a particular type with some certainty to a
Capture, then it is not included. On reception of a Capture with a
Person Type attribute, a Consumer knows with some certainty that the
Capture contains that person type. The Capture may contain other
person types, but the Advertiser has not been able to determine that
this is the case.
The types of Captured people include:
Chair: the person responsible for running the meeting
according to the agenda.
Vice-Chair: the person responsible for assisting the chair in
running the meeting.
Minute Taker: the person responsible for recording the minutes of
the meeting.
Attendee: the person has no particular responsibilities with
respect to running the meeting.
Observer: an Attendee without the right to influence the
discussion.
Presenter: the person scheduled on the agenda to make a
presentation in the meeting. Note: This is not
related to any "active speaker" functionality.
Translator: the person providing some form of translation or
commentary in the meeting.
Timekeeper: the person responsible for maintaining the meeting
schedule.
Furthermore, the Person Type attribute may contain one or more
strings allowing the Provider to indicate custom meeting-specific
types.
7.1.1.12. Priority
The Priority attribute indicates a relative priority between
different Media Captures. The Provider sets this priority, and the
Consumer MAY use the priority to help decide which Captures it wishes
to receive.
The Priority attribute is an integer that indicates a relative
priority between Captures. For example, it is possible to assign a
priority between two presentation Captures that would allow a remote
Endpoint to determine which presentation is more important. Priority
is assigned at the individual Capture level. It represents the
Provider's view of the relative priority between Captures with a
priority. The same priority number MAY be used across multiple
Captures. It indicates that they are equally important. If no
priority is assigned, no assumptions regarding relative importance of
the Capture can be assumed.
7.1.1.13. Embedded Text
The Embedded Text attribute indicates that a Capture provides
embedded textual information. For example, the Video Capture may
contain speech-to-text information composed with the video image.
7.1.1.14. Related To
The Related To attribute indicates the Capture contains additional
complementary information related to another Capture. The value
indicates the identity of the other Capture to which this Capture is
providing additional information.
For example, a Conference can utilize translators or facilitators
that provide an additional audio Stream (i.e., a translation or
description or commentary of the Conference). Where multiple
Captures are available, it may be advantageous for a Consumer to
select a complementary Capture instead of or in addition to a Capture
it relates to.
7.2. Multiple Content Capture
The MCC indicates that one or more Single Media Captures are
multiplexed (temporally and/or spatially) or mixed in one Media
Capture. Only one Capture type (i.e., audio, video, etc.) is allowed
in each MCC instance. The MCC may contain a reference to the Single
Media Captures (which may have their own attributes) as well as
attributes associated with the MCC itself. An MCC may also contain
other MCCs. The MCC MAY reference Captures from within the Capture
Scene that defines it or from other Capture Scenes. No ordering is
implied by the order that Captures appear within an MCC. An MCC MAY
contain no references to other Captures to indicate that the MCC
contains content from multiple sources, but no information regarding
those sources is given. MCCs either contain the referenced Captures
and no others or have no referenced Captures and, therefore, may
contain any Capture.
One or more MCCs may also be specified in a CSV. This allows an
Advertiser to indicate that several MCC Captures are used to
represent a Capture Scene. Table 14 provides an example of this
case.
As outlined in Section 7.1, each instance of the MCC has its own
Capture identity, i.e., MCC1. It allows all the individual Captures
contained in the MCC to be referenced by a single MCC identity.
The example below shows the use of a Multiple Content Capture:
+===================+=========================+
| Capture Scene #1 | |
+===================+=========================+
| VC1 | {MC attributes} |
+-------------------+-------------------------+
| VC2 | {MC attributes} |
+-------------------+-------------------------+
| VC3 | {MC attributes} |
+-------------------+-------------------------+
| MCC1(VC1,VC2,VC3) | {MC and MCC attributes} |
+-------------------+-------------------------+
| CSV(MCC1) | |
+-------------------+-------------------------+
Table 1: Multiple Content Capture Concept
This indicates that MCC1 is a single Capture that contains the
Captures VC1, VC2, and VC3, according to any MCC1 attributes.
7.2.1. MCC Attributes
Media Capture attributes may be associated with the MCC instance and
the Single Media Captures that the MCC references. A Provider should
avoid providing conflicting attribute values between the MCC and
Single Media Captures. Where there is conflict the attributes of the
MCC, a Provider should override any that may be present in the
individual Captures.
A Provider MAY include as much or as little of the original source
Capture information as it requires.
There are MCC-specific attributes that MUST only be used with
Multiple Content Captures. These are described in the sections
below. The attributes described in Section 7.1.1 MAY also be used
with MCCs.
The spatial-related attributes of an MCC indicate its Area of Capture
and Point of Capture within the Scene, just like any other Media
Capture. The spatial information does not imply anything about how
other Captures are composed within an MCC.
For example: a virtual Scene could be constructed for the MCC Capture
with two Video Captures with a MaxCaptures attribute set to 2 and an
Area of Capture attribute provided with an overall area. Each of the
individual Captures could then also include an Area of Capture
attribute with a subset of the overall area. The Consumer would then
know how each Capture is related to others within the Scene, but not
the relative position of the individual Captures within the composed
Capture.
+===============+===================================+
| Capture Scene | |
| #1 | |
+===============+===================================+
| VC1 | AreaofCapture=(0,0,0)(9,0,0) |
| | (0,0,9)(9,0,9) |
+---------------+-----------------------------------+
| VC2 | AreaofCapture=(10,0,0)(19,0,0) |
| | (10,0,9)(19,0,9) |
+---------------+-----------------------------------+
| MCC1(VC1,VC2) | MaxCaptures=2 |
| | AreaofCapture=(0,0,0)(19,0,0) |
| | (0,0,9)(19,0,9) |
+---------------+-----------------------------------+
| CSV(MCC1) | |
+---------------+-----------------------------------+
Table 2: Example of MCC and Single Media Capture
Attributes
The subsections below describe the MCC-only attributes.
7.2.1.1. MaxCapture: Maximum Number of Captures within an MCC
The MaxCaptures attribute indicates the maximum number of individual
Captures that may appear in a Capture Encoding at a time. The actual
number at any given time can be less than or equal to this maximum.
It may be used to derive how the Single Media Captures within the MCC
are composed/switched with regard to space and time.
A Provider can indicate that the number of Captures in an MCC Capture
Encoding is equal ("=") to the MaxCaptures value or that there may be
any number of Captures up to and including ("<=") the MaxCaptures
value. This allows a Provider to distinguish between an MCC that
purely represents a composition of sources and an MCC that represents
switched sources or switched and composed sources.
MaxCaptures may be set to one so that only content related to one of
the sources is shown in the MCC Capture Encoding at a time, or it may
be set to any value up to the total number of Source Media Captures
in the MCC.
The bullets below describe how the setting of MaxCaptures versus the
number of Captures in the MCC affects how sources appear in a Capture
Encoding:
* A switched case occurs when MaxCaptures is set to <= 1 and the
number of Captures in the MCC is greater than 1 (or not specified)
in the MCC. Zero or one Captures may be switched into the Capture
Encoding. Note: zero is allowed because of the "<=".
* A switched case occurs when MaxCaptures is set to = 1 and the
number of Captures in the MCC is greater than 1 (or not specified)
in the MCC. Only one Capture source is contained in a Capture
Encoding at a time.
* A switched and composed case occurs when MaxCaptures is set to <=
N (with N > 1) and the number of Captures in the MCC is greater
than N (or not specified). The Capture Encoding may contain
purely switched sources (i.e., <=2 allows for one source on its
own), or it may contain composed and switched sources (i.e., a
composition of two sources switched between the sources).
* A switched and composed case occurs when MaxCaptures is set to = N
(with N > 1) and the number of Captures in the MCC is greater than
N (or not specified). The Capture Encoding contains composed and
switched sources (i.e., a composition of N sources switched
between the sources). It is not possible to have a single source.
* A switched and composed case occurs when MaxCaptures is set <= to
the number of Captures in the MCC. The Capture Encoding may
contain Media switched between any number (up to the MaxCaptures)
of composed sources.
* A composed case occurs when MaxCaptures is set = to the number of
Captures in the MCC. All the sources are composed into a single
Capture Encoding.
If this attribute is not set, then as a default, it is assumed that
all source Media Capture content can appear concurrently in the
Capture Encoding associated with the MCC.
For example, the use of MaxCaptures equal to 1 on an MCC with three
Video Captures, VC1, VC2, and VC3, would indicate that the Advertiser
in the Capture Encoding would switch between VC1, VC2, and VC3 as
there may be only a maximum of one Capture at a time.
7.2.1.2. Policy
The Policy MCC attribute indicates the criteria that the Provider
uses to determine when and/or where Media content appears in the
Capture Encoding related to the MCC.
The attribute is in the form of a token that indicates the policy and
an index representing an instance of the policy. The same index
value can be used for multiple MCCs.
The tokens are as follows:
SoundLevel: This indicates that the content of the MCC is determined
by a sound-level-detection algorithm. The loudest (active)
speaker (or a previous speaker, depending on the index value) is
contained in the MCC.
RoundRobin: This indicates that the content of the MCC is determined
by a time-based algorithm. For example, the Provider provides
content from a particular source for a period of time and then
provides content from another source, and so on.
An index is used to represent an instance in the policy setting. An
index of 0 represents the most current instance of the policy, i.e.,
the active speaker, 1 represents the previous instance, i.e., the
previous active speaker, and so on.
The following example shows a case where the Provider provides two
Media Streams, one showing the active speaker and a second Stream
showing the previous speaker.
+==================+=====================+
| Capture Scene #1 | |
+==================+=====================+
| VC1 | |
+------------------+---------------------+
| VC2 | |
+------------------+---------------------+
| MCC1(VC1,VC2) | Policy=SoundLevel:0 |
| | MaxCaptures=1 |
+------------------+---------------------+
| MCC2(VC1,VC2) | Policy=SoundLevel:1 |
| | MaxCaptures=1 |
+------------------+---------------------+
| CSV(MCC1,MCC2) | |
+------------------+---------------------+
Table 3: Example Policy MCC Attribute
Usage
7.2.1.3. SynchronizationID: Synchronization Identity
The SynchronizationID MCC attribute indicates how the individual
Captures in multiple MCC Captures are synchronized. To indicate that
the Capture Encodings associated with MCCs contain Captures from the
same source at the same time, a Provider should set the same
SynchronizationID on each of the concerned MCCs. It is the Provider
that determines what the source for the Captures is, so a Provider
can choose how to group together Single Media Captures into a
combined "source" for the purpose of switching them together to keep
them synchronized according to the SynchronizationID attribute. For
example, when the Provider is in an MCU, it may determine that each
separate CLUE Endpoint is a remote source of Media. The
SynchronizationID may be used across Media types, i.e., to
synchronize audio- and video-related MCCs.
Without this attribute it is assumed that multiple MCCs may provide
content from different sources at any particular point in time.
For example:
+=======================+=====================+
| Capture Scene #1 | |
+=======================+=====================+
| VC1 | Description=Left |
+-----------------------+---------------------+
| VC2 | Description=Center |
+-----------------------+---------------------+
| VC3 | Description=Right |
+-----------------------+---------------------+
| AC1 | Description=Room |
+-----------------------+---------------------+
| CSV(VC1,VC2,VC3) | |
+-----------------------+---------------------+
| CSV(AC1) | |
+=======================+=====================+
| Capture Scene #2 | |
+=======================+=====================+
| VC4 | Description=Left |
+-----------------------+---------------------+
| VC5 | Description=Center |
+-----------------------+---------------------+
| VC6 | Description=Right |
+-----------------------+---------------------+
| AC2 | Description=Room |
+-----------------------+---------------------+
| CSV(VC4,VC5,VC6) | |
+-----------------------+---------------------+
| CSV(AC2) | |
+=======================+=====================+
| Capture Scene #3 | |
+=======================+=====================+
| VC7 | |
+-----------------------+---------------------+
| AC3 | |
+=======================+=====================+
| Capture Scene #4 | |
+=======================+=====================+
| VC8 | |
+-----------------------+---------------------+
| AC4 | |
+=======================+=====================+
| Capture Scene #5 | |
+=======================+=====================+
| MCC1(VC1,VC4,VC7) | SynchronizationID=1 |
| | MaxCaptures=1 |
+-----------------------+---------------------+
| MCC2(VC2,VC5,VC8) | SynchronizationID=1 |
| | MaxCaptures=1 |
+-----------------------+---------------------+
| MCC3(VC3,VC6) | MaxCaptures=1 |
+-----------------------+---------------------+
| MCC4(AC1,AC2,AC3,AC4) | SynchronizationID=1 |
| | MaxCaptures=1 |
+-----------------------+---------------------+
| CSV(MCC1,MCC2,MCC3) | |
+-----------------------+---------------------+
| CSV(MCC4) | |
+-----------------------+---------------------+
Table 4: Example SynchronizationID MCC
Attribute Usage
The above Advertisement would indicate that MCC1, MCC2, MCC3, and
MCC4 make up a Capture Scene. There would be four Capture Encodings
(one for each MCC). Because MCC1 and MCC2 have the same
SynchronizationID, each Encoding from MCC1 and MCC2, respectively,
would together have content from only Capture Scene 1 or only Capture
Scene 2 or the combination of VC7 and VC8 at a particular point in
time. In this case, the Provider has decided the sources to be
synchronized are Scene #1, Scene #2, and Scene #3 and #4 together.
The Encoding from MCC3 would not be synchronized with MCC1 or MCC2.
As MCC4 also has the same SynchronizationID as MCC1 and MCC2, the
content of the audio Encoding will be synchronized with the video
content.
7.2.1.4. Allow Subset Choice
The Allow Subset Choice MCC attribute is a boolean value, indicating
whether or not the Provider allows the Consumer to choose a specific
subset of the Captures referenced by the MCC. If this attribute is
true, and the MCC references other Captures, then the Consumer MAY
select (in a Configure message) a specific subset of those Captures
to be included in the MCC, and the Provider MUST then include only
that subset. If this attribute is false, or the MCC does not
reference other Captures, then the Consumer MUST NOT select a subset.
7.3. Capture Scene
In order for a Provider's individual Captures to be used effectively
by a Consumer, the Provider organizes the Captures into one or more
Capture Scenes, with the structure and contents of these Capture
Scenes being sent from the Provider to the Consumer in the
Advertisement.
A Capture Scene is a structure representing a spatial region
containing one or more Capture Devices, each capturing Media
representing a portion of the region. A Capture Scene includes one
or more Capture Scene Views (CSVs), with each CSV including one or
more Media Captures of the same Media type. There can also be Media
Captures that are not included in a CSV. A Capture Scene represents,
for example, the video image of a group of people seated next to each
other, along with the sound of their voices, which could be
represented by some number of VCs and ACs in the CSVs. An MCU can
also describe in Capture Scenes what it constructs from Media Streams
it receives.
A Provider MAY advertise one or more Capture Scenes. What
constitutes an entire Capture Scene is up to the Provider. A simple
Provider might typically use one Capture Scene for participant Media
(live video from the room cameras) and another Capture Scene for a
computer-generated presentation. In more-complex systems, the use of
additional Capture Scenes is also sensible. For example, a classroom
may advertise two Capture Scenes involving live video: one including
only the camera capturing the instructor (and associated audio) the
other including camera(s) capturing students (and associated audio).
A Capture Scene MAY (and typically will) include more than one type
of Media. For example, a Capture Scene can include several CSVs for
Video Captures and several CSVs for Audio Captures. A particular
Capture MAY be included in more than one CSV.
A Provider MAY express Spatial Relationships between Captures that
are included in the same Capture Scene. However, there is no Spatial
Relationship between Media Captures from different Capture Scenes.
In other words, Capture Scenes each use their own spatial measurement
system as outlined in Section 6.
A Provider arranges Captures in a Capture Scene to help the Consumer
choose which Captures it wants to Render. The CSVs in a Capture
Scene are different alternatives the Provider is suggesting for
representing the Capture Scene. Each CSV is given an advertisement-
unique identity. The order of CSVs within a Capture Scene has no
significance. The Media Consumer can choose to receive all Media
Captures from one CSV for each Media type (e.g., audio and video), or
it can pick and choose Media Captures regardless of how the Provider
arranges them in CSVs. Different CSVs of the same Media type are not
necessarily mutually exclusive alternatives. Also note that the
presence of multiple CSVs (with potentially multiple Encoding options
in each view) in a given Capture Scene does not necessarily imply
that a Provider is able to serve all the associated Media
simultaneously (although the construction of such an over-rich
Capture Scene is probably not sensible in many cases). What a
Provider can send simultaneously is determined through the
Simultaneous Transmission Set mechanism, described in Section 8.
Captures within the same CSV MUST be of the same Media type -- it is
not possible to mix audio and Video Captures in the same CSV, for
instance. The Provider MUST be capable of encoding and sending all
Captures (that have an Encoding Group) in a single CSV
simultaneously. The order of Captures within a CSV has no
significance. A Consumer can decide to receive all the Captures in a
single CSV, but a Consumer could also decide to receive just a subset
of those Captures. A Consumer can also decide to receive Captures
from different CSVs, all subject to the constraints set by
Simultaneous Transmission Sets, as discussed in Section 8.
When a Provider advertises a Capture Scene with multiple CSVs, it is
essentially signaling that there are multiple representations of the
same Capture Scene available. In some cases, these multiple views
would be used simultaneously (for instance, a "video view" and an
"audio view"). In some cases, the views would conceptually be
alternatives (for instance, a view consisting of three Video Captures
covering the whole room versus a view consisting of just a single
Video Capture covering only the center of a room). In this latter
example, one sensible choice for a Consumer would be to indicate
(through its Configure and possibly through an additional offer/
answer exchange) the Captures of that CSV that most closely matched
the Consumer's number of display devices or screen layout.
The following is an example of four potential CSVs for an Endpoint-
style Provider:
1. (VC0, VC1, VC2) - left, center, and right camera Video Captures
2. (MCC3) - Video Capture associated with loudest room segment
3. (VC4) - Video Capture zoomed out view of all people in the room
4. (AC0) - main audio
The first view in this Capture Scene example is a list of Video
Captures that have a Spatial Relationship to each other.
Determination of the order of these Captures (VC0, VC1, and VC2) for
rendering purposes is accomplished through use of their Area of
Capture attributes. The second view (MCC3) and the third view (VC4)
are alternative representations of the same room's video, which might
be better suited to some Consumers' rendering capabilities. The
inclusion of the Audio Capture in the same Capture Scene indicates
that AC0 is associated with all of those Video Captures, meaning it
comes from the same spatial region. Therefore, if audio were to be
Rendered at all, this audio would be the correct choice, irrespective
of which Video Captures were chosen.
7.3.1. Capture Scene Attributes
Capture Scene attributes can be applied to Capture Scenes as well as
to individual Media Captures. Attributes specified at this level
apply to all constituent Captures. Capture Scene attributes include
the following:
* Human-readable description of the Capture Scene, which could be in
multiple languages;
* xCard Scene information
* Scale information ("Millimeters", "Unknown Scale", "No Scale"), as
described in Section 6.
7.3.1.1. Scene Information
The Scene Information attribute provides information regarding the
Capture Scene rather than individual participants. The Provider may
gather the information automatically or manually from a variety of
sources. The Scene Information attribute allows a Provider to
indicate information such as organizational or geographic information
allowing a Consumer to determine which Capture Scenes are of interest
in order to then perform Capture selection. It also allows a
Consumer to Render information regarding the Scene or to use it for
further processing.
As per Section 7.1.1.10, the xCard format is used to convey this
information and the Provider may supply a minimal set of information
or a larger set of information.
In order to keep CLUE messages compact the Provider SHOULD use a URI
to point to any LOGO, PHOTO, or SOUND contained in the xCard rather
than transmitting the LOGO, PHOTO, or SOUND data in a CLUE message.
7.3.2. Capture Scene View Attributes
A Capture Scene can include one or more CSVs in addition to the
Capture-Scene-wide attributes described above. CSV attributes apply
to the CSV as a whole, i.e., to all Captures that are part of the
CSV.
CSV attributes include the following:
* A human-readable description (which could be in multiple
languages) of the CSV.
7.4. Global View List
An Advertisement can include an optional Global View list. Each item
in this list is a Global View. The Provider can include multiple
Global Views, to allow a Consumer to choose sets of Captures
appropriate to its capabilities or application. The choice of how to
make these suggestions in the Global View list for what represents
all the Scenes for which the Provider can send Media is up to the
Provider. This is very similar to how each CSV represents a
particular Scene.
As an example, suppose an Advertisement has three Scenes, and each
Scene has three CSVs, ranging from one to three Video Captures in
each CSV. The Provider is advertising a total of nine Video Captures
across three Scenes. The Provider can use the Global View list to
suggest alternatives for Consumers that can't receive all nine Video
Captures as separate Media Streams. For accommodating a Consumer
that wants to receive three Video Captures, a Provider might suggest
a Global View containing just a single CSV with three Captures and
nothing from the other two Scenes. Or a Provider might suggest a
Global View containing three different CSVs, one from each Scene,
with a single Video Capture in each.
Some additional rules:
* The ordering of Global Views in the Global View list is
insignificant.
* The ordering of CSVs within each Global View is insignificant.
* A particular CSV may be used in multiple Global Views.
* The Provider must be capable of encoding and sending all Captures
within the CSVs of a given Global View simultaneously.
The following figure shows an example of the structure of Global
Views in a Global View List.
........................................................
. Advertisement .
. .
. +--------------+ +-------------------------+ .
. |Scene 1 | |Global View List | .
. | | | | .
. | CSV1 (v)<----------------- Global View (CSV 1) | .
. | <-------. | | .
. | | *--------- Global View (CSV 1,5) | .
. | CSV2 (v) | | | | .
. | | | | | .
. | CSV3 (v)<---------*------- Global View (CSV 3,5) | .
. | | | | | | .
. | CSV4 (a)<----------------- Global View (CSV 4) | .
. | <-----------. | | .
. +--------------+ | | *----- Global View (CSV 4,6) | .
. | | | | | .
. +--------------+ | | | +-------------------------+ .
. |Scene 2 | | | | .
. | | | | | .
. | CSV5 (v)<-------' | | .
. | <---------' | .
. | | | (v) = video .
. | CSV6 (a)<-----------' (a) = audio .
. | | .
. +--------------+ .
`......................................................'
Figure 3: Global View List Structure
8. Simultaneous Transmission Set Constraints
In many practical cases, a Provider has constraints or limitations on
its ability to send Captures simultaneously. One type of limitation
is caused by the physical limitations of capture mechanisms; these
constraints are represented by a Simultaneous Transmission Set. The
second type of limitation reflects the encoding resources available,
such as bandwidth or video encoding throughput (macroblocks/second).
This type of constraint is captured by Individual Encodings and
Encoding Groups, discussed below.
Some Endpoints or MCUs can send multiple Captures simultaneously;
however, sometimes there are constraints that limit which Captures
can be sent simultaneously with other Captures. A device may not be
able to be used in different ways at the same time. Provider
Advertisements are made so that the Consumer can choose one of
several possible mutually exclusive usages of the device. This type
of constraint is expressed in a Simultaneous Transmission Set, which
lists all the Captures of a particular Media type (e.g., audio,
video, or text) that can be sent at the same time. There are
different Simultaneous Transmission Sets for each Media type in the
Advertisement. This is easier to show in an example.
Consider the example of a room system where there are three cameras,
each of which can send a separate Capture covering two people each:
VC0, VC1, and VC2. The middle camera can also zoom out (using an
optical zoom lens) and show all six people, VC3. But the middle
camera cannot be used in both modes at the same time; it has to
either show the space where two participants sit or the whole six
seats, but not both at the same time. As a result, VC1 and VC3
cannot be sent simultaneously.
Simultaneous Transmission Sets are expressed as sets of the Media
Captures that the Provider could transmit at the same time (though,
in some cases, it is not intuitive to do so). If a Multiple Content
Capture is included in a Simultaneous Transmission Set, it indicates
that the Capture Encoding associated with it could be transmitted as
the same time as the other Captures within the Simultaneous
Transmission Set. It does not imply that the Single Media Captures
contained in the Multiple Content Capture could all be transmitted at
the same time.
In this example, the two Simultaneous Transmission Sets are shown in
Table 5. If a Provider advertises one or more mutually exclusive
Simultaneous Transmission Sets, then, for each Media type, the
Consumer MUST ensure that it chooses Media Captures that lie wholly
within one of those Simultaneous Transmission Sets.
+===================+
| Simultaneous Sets |
+===================+
| {VC0, VC1, VC2} |
+-------------------+
| {VC0, VC3, VC2} |
+-------------------+
Table 5: Two
Simultaneous
Transmission Sets
A Provider OPTIONALLY can include the Simultaneous Transmission Sets
in its Advertisement. These constraints apply across all the Capture
Scenes in the Advertisement. It is a syntax-conformance requirement
that the Simultaneous Transmission Sets MUST allow all the Media
Captures in any particular CSV to be used simultaneously. Similarly,
the Simultaneous Transmission Sets MUST reflect the simultaneity
expressed by any Global View.
For shorthand convenience, a Provider MAY describe a Simultaneous
Transmission Set in terms of CSVs and Capture Scenes. If a CSV is
included in a Simultaneous Transmission Set, then all Media Captures
in the CSV are included in the Simultaneous Transmission Set. If a
Capture Scene is included in a Simultaneous Transmission Set, then
all its CSVs (of the corresponding Media type) are included in the
Simultaneous Transmission Set. The end result reduces to a set of
Media Captures, of a particular Media type, in either case.
If an Advertisement does not include Simultaneous Transmission Sets,
then the Provider MUST be able to simultaneously provide all the
Captures from any one CSV of each Media type from each Capture Scene.
Likewise, if there are no Simultaneous Transmission Sets and there is
a Global View list, then the Provider MUST be able to simultaneously
provide all the Captures from any particular Global View (of each
Media type) from the Global View list.
If an Advertisement includes multiple CSVs in a Capture Scene, then
the Consumer MAY choose one CSV for each Media type, or it MAY choose
individual Captures based on the Simultaneous Transmission Sets.
9. Encodings
Individual Encodings and Encoding Groups are CLUE's mechanisms
allowing a Provider to signal its limitations for sending Captures,
or combinations of Captures, to a Consumer. Consumers can map the
Captures they want to receive onto the Encodings, with the Encoding
parameters they want. As for the relationship between the CLUE-
specified mechanisms based on Encodings and the SIP offer/answer
exchange, please refer to Section 5.
9.1. Individual Encodings
An Individual Encoding represents a way to encode a Media Capture as
a Capture Encoding, to be sent as an encoded Media Stream from the
Provider to the Consumer. An Individual Encoding has a set of
parameters characterizing how the Media is encoded.
Different Media types have different parameters, and different
encoding algorithms may have different parameters. An Individual
Encoding can be assigned to at most one Capture Encoding at any given
time.
Individual Encoding parameters are represented in SDP [RFC4566], not
in CLUE messages. For example, for a video Encoding using H.26x
compression technologies, this can include parameters such as
follows:
* Maximum bandwidth;
* Maximum picture size in pixels;
* Maximum number of pixels to be processed per second;
The bandwidth parameter is the only one that specifically relates to
a CLUE Advertisement, as it can be further constrained by the maximum
group bandwidth in an Encoding Group.
9.2. Encoding Group
An Encoding Group includes a set of one or more Individual Encodings,
and parameters that apply to the group as a whole. By grouping
multiple Individual Encodings together, an Encoding Group describes
additional constraints on bandwidth for the group. A single Encoding
Group MAY refer to Encodings for different Media types.
The Encoding Group data structure contains:
* Maximum bitrate for all Encodings in the group combined;
* A list of identifiers for the Individual Encodings belonging to
the group.
When the Individual Encodings in a group are instantiated into
Capture Encodings, each Capture Encoding has a bitrate that MUST be
less than or equal to the max bitrate for the particular Individual
Encoding. The "maximum bitrate for all Encodings in the group"
parameter gives the additional restriction that the sum of all the
individual Capture Encoding bitrates MUST be less than or equal to
this group value.
The following diagram illustrates one example of the structure of a
Media Provider's Encoding Groups and their contents.
,-------------------------------------------------.
| Media Provider |
| |
| ,--------------------------------------. |
| | ,--------------------------------------. |
| | | ,--------------------------------------. |
| | | | Encoding Group | |
| | | | ,-----------. | |
| | | | | | ,---------. | |
| | | | | | | | ,---------.| |
| | | | | Encoding1 | |Encoding2| |Encoding3|| |
| `.| | | | | | `---------'| |
| `.| `-----------' `---------' | |
| `--------------------------------------' |
`-------------------------------------------------'
Figure 4: Encoding Group Structure
A Provider advertises one or more Encoding Groups. Each Encoding
Group includes one or more Individual Encodings. Each Individual
Encoding can represent a different way of encoding Media. For
example, one Individual Encoding may be 1080p60 video, another could
be 720p30, with a third being 352x288p30, all in, for example, H.264
format.
While a typical three-codec/display system might have one Encoding
Group per "codec box" (physical codec, connected to one camera and
one screen), there are many possibilities for the number of Encoding
Groups a Provider may be able to offer and for the Encoding values in
each Encoding Group.
There is no requirement for all Encodings within an Encoding Group to
be instantiated at the same time.
9.3. Associating Captures with Encoding Groups
Each Media Capture, including MCCs, MAY be associated with one
Encoding Group. To be eligible for configuration, a Media Capture
MUST be associated with one Encoding Group, which is used to
instantiate that Capture into a Capture Encoding. When an MCC is
configured, all the Media Captures referenced by the MCC will appear
in the Capture Encoding according to the attributes of the chosen
Encoding of the MCC. This allows an Advertiser to specify Encoding
attributes associated with the Media Captures without the need to
provide an individual Capture Encoding for each of the inputs.
If an Encoding Group is assigned to a Media Capture referenced by the
MCC, it indicates that this Capture may also have an individual
Capture Encoding.
For example:
+==================+=================+
| Capture Scene #1 | |
+==================+=================+
| VC1 | EncodeGroupID=1 |
+------------------+-----------------+
| VC2 | |
+------------------+-----------------+
| MCC1(VC1,VC2) | EncodeGroupID=2 |
+------------------+-----------------+
| CSV(VC1) | |
+------------------+-----------------+
| CSV(MCC1) | |
+------------------+-----------------+
Table 6: Example Usage of Encoding
with MCC and Source Captures
This would indicate that VC1 may be sent as its own Capture Encoding
from EncodeGroupID=1 or that it may be sent as part of a Capture
Encoding from EncodeGroupID=2 along with VC2.
More than one Capture MAY use the same Encoding Group.
The maximum number of Capture Encodings that can result from a
particular Encoding Group constraint is equal to the number of
Individual Encodings in the group. The actual number of Capture
Encodings used at any time MAY be less than this maximum. Any of the
Captures that use a particular Encoding Group can be encoded
according to any of the Individual Encodings in the group.
It is a protocol conformance requirement that the Encoding Groups
MUST allow all the Captures in a particular CSV to be used
simultaneously.
10. Consumer's Choice of Streams to Receive from the Provider
After receiving the Provider's Advertisement message (which includes
Media Captures and associated constraints), the Consumer composes its
reply to the Provider in the form of a Configure message. The
Consumer is free to use the information in the Advertisement as it
chooses, but there are a few obviously sensible design choices, which
are outlined below.
If multiple Providers connect to the same Consumer (i.e., in an MCU-
less multiparty call), it is the responsibility of the Consumer to
compose Configures for each Provider that both fulfill each
Provider's constraints as expressed in the Advertisement, as well as
its own capabilities.
In an MCU-based multiparty call, the MCU can logically terminate the
Advertisement/Configure negotiation in that it can hide the
characteristics of the receiving Endpoint and rely on its own
capabilities (transcoding/transrating/etc.) to create Media Streams
that can be decoded at the Endpoint Consumers. The timing of an
MCU's sending of Advertisements (for its outgoing ports) and
Configures (for its incoming ports, in response to Advertisements
received there) is up to the MCU and is implementation dependent.
As a general outline, a Consumer can choose, based on the
Advertisement it has received, which Captures it wishes to receive,
and which Individual Encodings it wants the Provider to use to encode
the Captures.
On receipt of an Advertisement with an MCC, the Consumer treats the
MCC as per other non-MCC Captures with the following differences:
* The Consumer would understand that the MCC is a Capture that
includes the referenced individual Captures (or any Captures, if
none are referenced) and that these individual Captures are
delivered as part of the MCC's Capture Encoding.
* The Consumer may utilize any of the attributes associated with the
referenced individual Captures and any Capture Scene attributes
from where the individual Captures were defined to choose Captures
and for Rendering decisions.
* If the MCC attribute Allow Subset Choice is true, then the
Consumer may or may not choose to receive all the indicated
Captures. It can choose to receive a subset of Captures indicated
by the MCC.
For example, if the Consumer receives:
MCC1(VC1,VC2,VC3){attributes}
A Consumer could choose all the Captures within an MCC; however, if
the Consumer determines that it doesn't want VC3, it can return
MCC1(VC1,VC2). If it wants all the individual Captures, then it
returns only the MCC identity (i.e., MCC1). If the MCC in the
Advertisement does not reference any individual Captures, or the
Allow Subset Choice attribute is false, then the Consumer cannot
choose what is included in the MCC: it is up to the Provider to
decide.
A Configure Message includes a list of Capture Encodings. These are
the Capture Encodings the Consumer wishes to receive from the
Provider. Each Capture Encoding refers to one Media Capture and one
Individual Encoding.
For each Capture the Consumer wants to receive, it configures one of
the Encodings in that Capture's Encoding Group. The Consumer does
this by telling the Provider, in its Configure Message, which
Encoding to use for each chosen Capture. Upon receipt of this
Configure from the Consumer, common knowledge is established between
Provider and Consumer regarding sensible choices for the Media
Streams. The setup of the actual Media channels, at least in the
simplest case, is left to a following offer/answer exchange.
Optimized implementations may speed up the reaction to the offer/
answer exchange by reserving the resources at the time of
finalization of the CLUE handshake.
CLUE Advertisements and Configure Messages don't necessarily require
a new SDP offer/answer for every CLUE message exchange. But the
resulting Encodings sent via RTP must conform to the most-recent SDP
offer/answer result.
In order to meaningfully create and send an initial Configure, the
Consumer needs to have received at least one Advertisement, and an
SDP offer defining the Individual Encodings, from the Provider.
In addition, the Consumer can send a Configure at any time during the
call. The Configure MUST be valid according to the most recently
received Advertisement. The Consumer can send a Configure either in
response to a new Advertisement from the Provider or on its own, for
example, because of a local change in conditions (people leaving the
room, connectivity changes, multipoint related considerations).
When choosing which Media Streams to receive from the Provider, and
the encoding characteristics of those Media Streams, the Consumer
advantageously takes several things into account: its local
preference, simultaneity restrictions, and encoding limits.
10.1. Local Preference
A variety of local factors influence the Consumer's choice of Media
Streams to be received from the Provider:
* If the Consumer is an Endpoint, it is likely that it would choose,
where possible, to receive Video and Audio Captures that match the
number of display devices and audio system it has.
* If the Consumer is an MCU, it may choose to receive loudest
speaker Streams (in order to perform its own Media composition)
and avoid pre-composed Video Captures.
* User choice (for instance, selection of a new layout) may result
in a different set of Captures, or different Encoding
characteristics, being required by the Consumer.
10.2. Physical Simultaneity Restrictions
Often there are physical simultaneity constraints of the Provider
that affect the Provider's ability to simultaneously send all of the
Captures the Consumer would wish to receive. For instance, an MCU,
when connected to a multi-camera room system, might prefer to receive
both individual video Streams of the people present in the room and
an overall view of the room from a single camera. Some Endpoint
systems might be able to provide both of these sets of Streams
simultaneously, whereas others might not (if the overall room view
were produced by changing the optical zoom level on the center
camera, for instance).
10.3. Encoding and Encoding Group Limits
Each of the Provider's Encoding Groups has limits on bandwidth, and
the constituent potential Encodings have limits on the bandwidth,
computational complexity, video frame rate, and resolution that can
be provided. When choosing the Captures to be received from a
Provider, a Consumer device MUST ensure that the Encoding
characteristics requested for each individual Capture fits within the
capability of the Encoding it is being configured to use, as well as
ensuring that the combined Encoding characteristics for Captures fit
within the capabilities of their associated Encoding Groups. In some
cases, this could cause an otherwise "preferred" choice of Capture
Encodings to be passed over in favor of different Capture Encodings
-- for instance, if a set of three Captures could only be provided at
a low resolution then a three screen device could switch to favoring
a single, higher quality, Capture Encoding.
11. Extensibility
One important characteristics of the Framework is its extensibility.
The standard for interoperability and handling multiple Streams must
be future-proof. The framework itself is inherently extensible
through expanding the data model types. For example:
* Adding more types of Media, such as telemetry, can done by
defining additional types of Captures in addition to audio and
video.
* Adding new functionalities, such as 3-D Video Captures, may
require additional attributes describing the Captures.
The infrastructure is designed to be extended rather than requiring
new infrastructure elements. Extension comes through adding to
defined types.
12. Examples - Using the Framework (Informative)
This section gives some examples, first from the point of view of the
Provider, then the Consumer, then some multipoint scenarios.
12.1. Provider Behavior
This section shows some examples in more detail of how a Provider can
use the framework to represent a typical case for telepresence rooms.
First, an Endpoint is illustrated, then an MCU case is shown.
12.1.1. Three-Screen Endpoint Provider
Consider an Endpoint with the following description:
Three cameras, three displays, and a six-person table
* Each camera can provide one Capture for each 1/3-section of the
table.
* A single Capture representing the active speaker can be provided
(voice-activity-based camera selection to a given encoder input
port implemented locally in the Endpoint).
* A single Capture representing the active speaker with the other
two Captures shown picture in picture (PiP) within the Stream can
be provided (again, implemented inside the Endpoint).
* A Capture showing a zoomed out view of all six seats in the room
can be provided.
The Video and Audio Captures for this Endpoint can be described as
follows.
Video Captures:
VC0 (the left camera Stream), Encoding Group=EG0, view=table
VC1 (the center camera Stream), Encoding Group=EG1, view=table
VC2 (the right camera Stream), Encoding Group=EG2, view=table
MCC3 (the loudest panel Stream), Encoding Group=EG1, view=table,
MaxCaptures=1, policy=SoundLevel
MCC4 (the loudest panel Stream with PiPs), Encoding Group=EG1,
view=room, MaxCaptures=3, policy=SoundLevel
VC5 (the zoomed out view of all people in the room), Encoding
Group=EG1, view=room
VC6 (presentation Stream), Encoding Group=EG1, presentation
The following diagram is a top view of the room with three cameras,
three displays, and six seats. Each camera captures two people. The
six seats are not all in a straight line.
,-. d
( )`--.__ +---+
`-' / `--.__ | |
,-. | `-.._ |_-+Camera 2 (VC2)
( ).' <--(AC1)-+-''`+-+
`-' |_...---'' | |
,-.c+-..__ +---+
( )| ``--..__ | |
`-' | ``+-..|_-+Camera 1 (VC1)
,-. | <--(AC2)..--'|+-+ ^
( )| __..--' | | |
`-'b|..--' +---+ |X
,-. |``---..___ | | |
( )\ ```--..._|_-+Camera 0 (VC0) |
`-' \ <--(AC0) ..-''`-+ |
,-. \ __.--'' | | <----------+
( ) |..-'' +---+ Y
`-' a (0,0,0) origin is under Camera 1
Figure 5: Room Layout Top View
The two points labeled 'b' and 'c' are intended to be at the midpoint
between the seating positions, and where the fields of view of the
cameras intersect.
The Plane of Interest for VC0 is a vertical plane that intersects
points 'a' and 'b'.
The Plane of Interest for VC1 intersects points 'b' and 'c'. The
plane of interest for VC2 intersects points 'c' and 'd'.
This example uses an area scale of millimeters.
Areas of capture:
bottom left bottom right top left top right
VC0 (-2011,2850,0) (-673,3000,0) (-2011,2850,757) (-673,3000,757)
VC1 ( -673,3000,0) ( 673,3000,0) ( -673,3000,757) ( 673,3000,757)
VC2 ( 673,3000,0) (2011,2850,0) ( 673,3000,757) (2011,3000,757)
MCC3(-2011,2850,0) (2011,2850,0) (-2011,2850,757) (2011,3000,757)
MCC4(-2011,2850,0) (2011,2850,0) (-2011,2850,757) (2011,3000,757)
VC5 (-2011,2850,0) (2011,2850,0) (-2011,2850,757) (2011,3000,757)
VC6 none
Points of capture:
VC0 (-1678,0,800)
VC1 (0,0,800)
VC2 (1678,0,800)
MCC3 none
MCC4 none
VC5 (0,0,800)
VC6 none
In this example, the right edge of the VC0 area lines up with the
left edge of the VC1 area. It doesn't have to be this way. There
could be a gap or an overlap. One additional thing to note for this
example is the distance from 'a' to 'b' is equal to the distance from
'b' to 'c' and the distance from 'c' to 'd'. All these distances are
1346 mm. This is the planar width of each Area of Capture for VC0,
VC1, and VC2.
Note the text in parentheses (e.g., "the left camera Stream") is not
explicitly part of the model, it is just explanatory text for this
example, and it is not included in the model with the Media Captures
and attributes. Also, MCC4 doesn't say anything about how a Capture
is composed, so the Media Consumer can't tell based on this Capture
that MCC4 is composed of a "loudest panel with PiPs".
Audio Captures:
Three ceiling microphones are located between the cameras and the
table, at the same height as the cameras. The microphones point down
at an angle toward the seating positions.
* AC0 (left), Encoding Group=EG3
* AC1 (right), Encoding Group=EG3
* AC2 (center), Encoding Group=EG3
* AC3 being a simple pre-mixed audio Stream from the room (mono),
Encoding Group=EG3
* AC4 audio Stream associated with the presentation video (mono)
Encoding Group=EG3, presentation
Point of Capture: Point on Line of Capture:
AC0 (-1342,2000,800) (-1342,2925,379)
AC1 ( 1342,2000,800) ( 1342,2925,379)
AC2 ( 0,2000,800) ( 0,3000,379)
AC3 ( 0,2000,800) ( 0,3000,379)
AC4 none
The physical simultaneity information is:
Simultaneous Transmission Set #1 {VC0, VC1, VC2, MCC3, MCC4, VC6}
Simultaneous Transmission Set #2 {VC0, VC2, VC5, VC6}
This constraint indicates that it is not possible to use all the VCs
at the same time. VC5 cannot be used at the same time as VC1 or MCC3
or MCC4. Also, using every member in the set simultaneously may not
make sense -- for example, MCC3 (loudest) and MCC4 (loudest with
PiP). In addition, there are Encoding constraints that make choosing
all of the VCs in a set impossible. VC1, MCC3, MCC4, VC5, and VC6
all use EG1 and EG1 has only three ENCs. This constraint shows up in
the Encoding Groups, not in the Simultaneous Transmission Sets.
In this example, there are no restrictions on which Audio Captures
can be sent simultaneously.
Encoding Groups:
This example has three Encoding Groups associated with the Video
Captures. Each group can have three Encodings, but with each
potential Encoding having a progressively lower specification. In
this example, 1080p60 transmission is possible (as ENC0 has a maxPps
value compatible with that). Significantly, as up to three Encodings
are available per group, it is possible to transmit some Video
Captures simultaneously that are not in the same view in the Capture
Scene, for example, VC1 and MCC3 at the same time. The information
below about Encodings is a summary of what would be conveyed in SDP,
not directly in the CLUE Advertisement.
encodeGroupID=EG0, maxGroupBandwidth=6000000
encodeID=ENC0, maxWidth=1920, maxHeight=1088, maxFrameRate=60,
maxPps=124416000, maxBandwidth=4000000
encodeID=ENC1, maxWidth=1280, maxHeight=720, maxFrameRate=30,
maxPps=27648000, maxBandwidth=4000000
encodeID=ENC2, maxWidth=960, maxHeight=544, maxFrameRate=30,
maxPps=15552000, maxBandwidth=4000000
encodeGroupID=EG1 maxGroupBandwidth=6000000
encodeID=ENC3, maxWidth=1920, maxHeight=1088, maxFrameRate=60,
maxPps=124416000, maxBandwidth=4000000
encodeID=ENC4, maxWidth=1280, maxHeight=720, maxFrameRate=30,
maxPps=27648000, maxBandwidth=4000000
encodeID=ENC5, maxWidth=960, maxHeight=544, maxFrameRate=30,
maxPps=15552000, maxBandwidth=4000000
encodeGroupID=EG2 maxGroupBandwidth=6000000
encodeID=ENC6, maxWidth=1920, maxHeight=1088, maxFrameRate=60,
maxPps=124416000, maxBandwidth=4000000
encodeID=ENC7, maxWidth=1280, maxHeight=720, maxFrameRate=30,
maxPps=27648000, maxBandwidth=4000000
encodeID=ENC8, maxWidth=960, maxHeight=544, maxFrameRate=30,
maxPps=15552000, maxBandwidth=4000000
Figure 6: Example Encoding Groups for Video
For audio, there are five potential Encodings available, so all five
Audio Captures can be encoded at the same time.
encodeGroupID=EG3, maxGroupBandwidth=320000
encodeID=ENC9, maxBandwidth=64000
encodeID=ENC10, maxBandwidth=64000
encodeID=ENC11, maxBandwidth=64000
encodeID=ENC12, maxBandwidth=64000
encodeID=ENC13, maxBandwidth=64000
Figure 7: Example Encoding Group for Audio
Capture Scenes:
The following table represents the Capture Scenes for this Provider.
Recall that a Capture Scene is composed of alternative CSVs covering
the same spatial region. Capture Scene #1 is for the main people
Captures, and Capture Scene #2 is for presentation.
Each row in the table is a separate CSV.
+==================+
| Capture Scene #1 |
+==================+
| VC0, VC1, VC2 |
+------------------+
| MCC3 |
+------------------+
| MCC4 |
+------------------+
| VC5 |
+------------------+
| AC0, AC1, AC2 |
+------------------+
| AC3 |
+==================+
| Capture Scene #2 |
+==================+
| VC6 |
+------------------+
| AC4 |
+------------------+
Table 7: Example CSVs
Different Capture Scenes are distinct from each other and do not
overlap. A Consumer can choose a view from each Capture Scene. In
this case, the three Captures, VC0, VC1, and VC2, are one way of
representing the video from the Endpoint. These three Captures
should appear adjacent to each other. Alternatively, another way of
representing the Capture Scene is with the Capture MCC3, which
automatically shows the person who is talking; this is the same for
the MCC4 and VC5 alternatives.
As in the video case, the different views of audio in Capture Scene
#1 represent the "same thing", in that one way to receive the audio
is with the three Audio Captures (AC0, AC1, and AC2), and another way
is with the mixed AC3. The Media Consumer can choose an audio CSV it
is capable of receiving.
The spatial ordering is understood by the Media Capture attribute's
Area of Capture, Point of Capture, and Point on Line of Capture.
A Media Consumer would likely want to choose a CSV to receive,
partially based on how many Streams it can simultaneously receive. A
Consumer that can receive three video Streams would probably prefer
to receive the first view of Capture Scene #1 (VC0, VC1, and VC2) and
not receive the other views. A Consumer that can receive only one
video Stream would probably choose one of the other views.
If the Consumer can receive a presentation Stream too, it would also
choose to receive the only view from Capture Scene #2 (VC6).
12.1.2. Encoding Group Example
This is an example of an Encoding Group to illustrate how it can
express dependencies between Encodings. The information below about
Encodings is a summary of what would be conveyed in SDP, not directly
in the CLUE Advertisement.
encodeGroupID=EG0 maxGroupBandwidth=6000000
encodeID=VIDENC0, maxWidth=1920, maxHeight=1088,
maxFrameRate=60, maxPps=62208000, maxBandwidth=4000000
encodeID=VIDENC1, maxWidth=1920, maxHeight=1088,
maxFrameRate=60, maxPps=62208000, maxBandwidth=4000000
encodeID=AUDENC0, maxBandwidth=96000
encodeID=AUDENC1, maxBandwidth=96000
encodeID=AUDENC2, maxBandwidth=96000
Here, the Encoding Group is EG0. Although the Encoding Group is
capable of transmitting up to 6 Mbit/s, no individual video Encoding
can exceed 4 Mbit/s.
This Encoding Group also allows up to three audio Encodings,
AUDENC<0-2>. It is not required that audio and video Encodings
reside within the same Encoding Group, but if so, then the group's
overall maxBandwidth value is a limit on the sum of all audio and
video Encodings configured by the Consumer. A system that does not
wish or need to combine bandwidth limitations in this way should
instead use separate Encoding Groups for audio and video in order for
the bandwidth limitations on audio and video to not interact.
Audio and video can be expressed in separate Encoding Groups, as in
this illustration.
encodeGroupID=EG0 maxGroupBandwidth=6000000
encodeID=VIDENC0, maxWidth=1920, maxHeight=1088,
maxFrameRate=60, maxPps=62208000, maxBandwidth=4000000
encodeID=VIDENC1, maxWidth=1920, maxHeight=1088,
maxFrameRate=60, maxPps=62208000, maxBandwidth=4000000
encodeGroupID=EG1 maxGroupBandwidth=500000
encodeID=AUDENC0, maxBandwidth=96000
encodeID=AUDENC1, maxBandwidth=96000
encodeID=AUDENC2, maxBandwidth=96000
12.1.3. The MCU Case
This section shows how an MCU might express its Capture Scenes,
intending to offer different choices for Consumers that can handle
different numbers of Streams. Each MCC is for video. A single Audio
Capture is provided for all single and multi-screen configurations
that can be associated (e.g., lip-synced) with any combination of
Video Captures (the MCCs) at the Consumer.
+==========================+==================================+
| Capture Scene #1 | |
+==========================+==================================+
| MCC | for a one-screen Consumer |
+--------------------------+----------------------------------+
| MCC1, MCC2 | for a two-screen Consumer |
+--------------------------+----------------------------------+
| MCC3, MCC4, MCC5 | for a three-screen Consumer |
+--------------------------+----------------------------------+
| MCC6, MCC7, MCC8, MCC9 | for a four-screen Consumer |
+--------------------------+----------------------------------+
| AC0 | AC representing all participants |
+--------------------------+----------------------------------+
| CSV(MCC0) | |
+--------------------------+----------------------------------+
| CSV(MCC1,MCC2) | |
+--------------------------+----------------------------------+
| CSV(MCC3,MCC4,MCC5) | |
+--------------------------+----------------------------------+
| CSV(MCC6,MCC7,MCC8,MCC9) | |
+--------------------------+----------------------------------+
| CSV(AC0) | |
+--------------------------+----------------------------------+
Table 8: MCU Main Capture Scenes
If/when a presentation Stream becomes active within the Conference,
the MCU might re-advertise the available Media as:
+==================+======================================+
| Capture Scene #2 | Note |
+==================+======================================+
| VC10 | Video Capture for presentation |
+------------------+--------------------------------------+
| AC1 | Presentation audio to accompany VC10 |
+------------------+--------------------------------------+
| CSV(VC10) | |
+------------------+--------------------------------------+
| CSV(AC1) | |
+------------------+--------------------------------------+
Table 9: MCU Presentation Capture Scene
12.2. Media Consumer Behavior
This section gives an example of how a Media Consumer might behave
when deciding how to request Streams from the three-screen Endpoint
described in the previous section.
The receive side of a call needs to balance its requirements (based
on number of screens and speakers), its decoding capabilities,
available bandwidth, and the Provider's capabilities in order to
optimally configure the Provider's Streams. Typically, it would want
to receive and decode Media from each Capture Scene advertised by the
Provider.
A sane, basic, algorithm might be for the Consumer to go through each
CSV in turn and find the collection of Video Captures that best
matches the number of screens it has (this might include
consideration of screens dedicated to presentation video display
rather than "people" video) and then decide between alternative views
in the video Capture Scenes based either on hard-coded preferences or
on user choice. Once this choice has been made, the Consumer would
then decide how to configure the Provider's Encoding Groups in order
to make best use of the available network bandwidth and its own
decoding capabilities.
12.2.1. One-Screen Media Consumer
MCC3, MCC4, and VC5 are all different views by themselves, not
grouped together in a single view; so, the receiving device should
choose between one of those. The choice would come down to whether
to see the greatest number of participants simultaneously at roughly
equal precedence (VC5), a switched view of just the loudest region
(MCC3), or a switched view with PiPs (MCC4). An Endpoint device with
a small amount of knowledge of these differences could offer a
dynamic choice of these options, in-call, to the user.
12.2.2. Two-Screen Media Consumer Configuring the Example
Mixing systems with an even number of screens, "2n", and those with
"2n+1" cameras (and vice versa) is always likely to be the
problematic case. In this instance, the behavior is likely to be
determined by whether a "two-screen" system is really a "two-decoder"
system, i.e., whether only one received Stream can be displayed per
screen or whether more than two Streams can be received and spread
across the available screen area. To enumerate three possible
behaviors here for the two-screen system when it learns that the far
end is "ideally" expressed via three Capture Streams:
1. Fall back to receiving just a single Stream (MCC3, MCC4, or VC5
as per the one-screen Consumer case above) and either leave one
screen blank or use it for presentation if/when a presentation
becomes active.
2. Receive three Streams (VC0, VC1, and VC2) and display across two
screens (either with each Capture being scaled to 2/3 of a screen
and the center Capture being split across two screens), or, as
would be necessary if there were large bezels on the screens,
with each Stream being scaled to 1/2 the screen width and height
and there being a fourth "blank" panel. This fourth panel could
potentially be used for any presentation that became active
during the call.
3. Receive three Streams, decode all three, and use control
information indicating which was the most active to switch
between showing the left and center Streams (one per screen) and
the center and right Streams.
For an Endpoint capable of all three methods of working described
above, again it might be appropriate to offer the user the choice of
display mode.
12.2.3. Three-Screen Media Consumer Configuring the Example
This is the most straightforward case: the Media Consumer would look
to identify a set of Streams to receive that best matched its
available screens; so, the VC0 plus VC1 plus VC2 should match
optimally. The spatial ordering would give sufficient information
for the correct Video Capture to be shown on the correct screen. The
Consumer would need to divide a single Encoding Group's capability by
3 either to determine what resolution and frame rate to configure the
Provider with or to configure the individual Video Captures' Encoding
Groups with what makes most sense (taking into account the receive
side decode capabilities, overall call bandwidth, the resolution of
the screens plus any user preferences such as motion vs. sharpness).
12.3. Multipoint Conference Utilizing Multiple Content Captures
The use of MCCs allows the MCU to construct outgoing Advertisements
describing complex Media switching and composition scenarios. The
following sections provide several examples.
Note: in the examples the identities of the CLUE elements (e.g.,
Captures, Capture Scene) in the incoming Advertisements overlap.
This is because there is no coordination between the Endpoints. The
MCU is responsible for making these unique in the outgoing
Advertisement.
12.3.1. Single Media Captures and MCC in the Same Advertisement
Four Endpoints are involved in a Conference where CLUE is used. An
MCU acts as a middlebox between the Endpoints with a CLUE channel
between each Endpoint and the MCU. The MCU receives the following
Advertisements.
+==================+================================+
| Capture Scene #1 | Description=AustralianConfRoom |
+==================+================================+
| VC1 | Description=Audience |
| | EncodeGroupID=1 |
+------------------+--------------------------------+
| CSV(VC1) | |
+------------------+--------------------------------+
Table 10: Advertisement Received from Endpoint A
+==================+===========================+
| Capture Scene #1 | Description=ChinaConfRoom |
+==================+===========================+
| VC1 | Description=Speaker |
| | EncodeGroupID=1 |
+------------------+---------------------------+
| VC2 | Description=Audience |
| | EncodeGroupID=1 |
+------------------+---------------------------+
| CSV(VC1, VC2) | |
+------------------+---------------------------+
Table 11: Advertisement Received from Endpoint B
Note: Endpoint B indicates that it sends two Streams.
+==================+=========================+
| Capture Scene #1 | Description=USAConfRoom |
+==================+=========================+
| VC1 | Description=Audience |
| | EncodeGroupID=1 |
+------------------+-------------------------+
| CSV(VC1) | |
+------------------+-------------------------+
Table 12: Advertisement Received from
Endpoint C
If the MCU wanted to provide a Multiple Content Captures containing a
round-robin switched view of the audience from the three Endpoints
and the speaker, it could construct the following Advertisement:
+=======================+================================+
| Capture Scene #1 | Description=AustralianConfRoom |
+=======================+================================+
| VC1 | Description=Audience |
+-----------------------+--------------------------------+
| CSV(VC1) | |
+=======================+================================+
| Capture Scene #2 | Description=ChinaConfRoom |
+=======================+================================+
| VC2 | Description=Speaker |
+-----------------------+--------------------------------+
| VC3 | Description=Audience |
+-----------------------+--------------------------------+
| CSV(VC2, VC3) | |
+=======================+================================+
| Capture Scene #3 | Description=USAConfRoom |
+=======================+================================+
| VC4 | Description=Audience |
+-----------------------+--------------------------------+
| CSV(VC4) | |
+=======================+================================+
| Capture Scene #4 | |
+=======================+================================+
| MCC1(VC1,VC2,VC3,VC4) | Policy=RoundRobin:1 |
| | MaxCaptures=1 |
| | EncodingGroup=1 |
+-----------------------+--------------------------------+
| CSV(MCC1) | |
+-----------------------+--------------------------------+
Table 13: Advertisement Sent to Endpoint F - One Encoding
Alternatively, if the MCU wanted to provide the speaker as one Media
Stream and the audiences as another, it could assign an Encoding
Group to VC2 in Capture Scene 2 and provide a CSV in Capture Scene #4
as per the example below.
+===================+================================+
| Capture Scene #1 | Description=AustralianConfRoom |
+===================+================================+
| VC1 | Description=Audience |
+-------------------+--------------------------------+
| CSV(VC1) | |
+===================+================================+
| Capture Scene #2 | Description=ChinaConfRoom |
+===================+================================+
| VC2 | Description=Speaker |
| | EncodingGroup=1 |
+-------------------+--------------------------------+
| VC3 | Description=Audience |
+-------------------+--------------------------------+
| CSV(VC2, VC3) | |
+===================+================================+
| Capture Scene #3 | Description=USAConfRoom |
+===================+================================+
| VC4 | Description=Audience |
+-------------------+--------------------------------+
| CSV(VC4) | |
+===================+================================+
| Capture Scene #4 | |
+===================+================================+
| MCC1(VC1,VC3,VC4) | Policy=RoundRobin:1 |
| | MaxCaptures=1 |
| | EncodingGroup=1 |
| | AllowSubset=True |
+-------------------+--------------------------------+
| MCC2(VC2) | MaxCaptures=1 |
| | EncodingGroup=1 |
+-------------------+--------------------------------+
| CSV2(MCC1,MCC2) | |
+-------------------+--------------------------------+
Table 14: Advertisement Sent to Endpoint F - Two
Encodings
Therefore, a Consumer could choose whether or not to have a separate
speaker-related Stream and could choose which Endpoints to see. If
it wanted the second Stream but not the Australian conference room,
it could indicate the following Captures in the Configure message:
+---------------+----------+
| MCC1(VC3,VC4) | Encoding |
+---------------+----------+
| VC2 | Encoding |
+---------------+----------+
Table 15: MCU Case:
Consumer Response
12.3.2. Several MCCs in the Same Advertisement
Multiple MCCs can be used where multiple Streams are used to carry
Media from multiple Endpoints. For example:
A Conference has three Endpoints D, E, and F. Each Endpoint has
three Video Captures covering the left, middle, and right regions of
each conference room. The MCU receives the following Advertisements
from D and E.
+==================+================================+
| Capture Scene #1 | Description=AustralianConfRoom |
+==================+================================+
| VC1 | CaptureArea=Left |
+------------------+--------------------------------+
| | EncodingGroup=1 |
+------------------+--------------------------------+
| VC2 | CaptureArea=Center |
+------------------+--------------------------------+
| | EncodingGroup=1 |
+------------------+--------------------------------+
| VC3 | CaptureArea=Right |
+------------------+--------------------------------+
| | EncodingGroup=1 |
+------------------+--------------------------------+
| CSV(VC1,VC2,VC3) | |
+------------------+--------------------------------+
Table 16: Advertisement Received from Endpoint D
+==================+===========================+
| Capture Scene #1 | Description=ChinaConfRoom |
+==================+===========================+
| VC1 | CaptureArea=Left |
+------------------+---------------------------+
| | EncodingGroup=1 |
+------------------+---------------------------+
| VC2 | CaptureArea=Center |
+------------------+---------------------------+
| | EncodingGroup=1 |
+------------------+---------------------------+
| VC3 | CaptureArea=Right |
+------------------+---------------------------+
| | EncodingGroup=1 |
+------------------+---------------------------+
| CSV(VC1,VC2,VC3) | |
+------------------+---------------------------+
Table 17: Advertisement Received from Endpoint E
The MCU wants to offer Endpoint F three Capture Encodings. Each
Capture Encoding would contain all the Captures from either Endpoint
D or Endpoint E, depending on the active speaker. The MCU sends the
following Advertisement:
+=====================+================================+
| Capture Scene #1 | Description=AustralianConfRoom |
+=====================+================================+
| VC1 | |
+---------------------+--------------------------------+
| VC2 | |
+---------------------+--------------------------------+
| VC3 | |
+---------------------+--------------------------------+
| CSV(VC1,VC2,VC3) | |
+=====================+================================+
| Capture Scene #2 | Description=ChinaConfRoom |
+=====================+================================+
| VC4 | |
+---------------------+--------------------------------+
| VC5 | |
+---------------------+--------------------------------+
| VC6 | |
+---------------------+--------------------------------+
| CSV(VC4,VC5,VC6) | |
+=====================+================================+
| Capture Scene #3 | |
+=====================+================================+
| MCC1(VC1,VC4) | CaptureArea=Left |
| | MaxCaptures=1 |
| | SynchronizationID=1 |
| | EncodingGroup=1 |
+---------------------+--------------------------------+
| MCC2(VC2,VC5) | CaptureArea=Center |
| | MaxCaptures=1 |
| | SynchronizationID=1 |
| | EncodingGroup=1 |
+---------------------+--------------------------------+
| MCC3(VC3,VC6) | CaptureArea=Right |
| | MaxCaptures=1 |
| | SynchronizationID=1 |
| | EncodingGroup=1 |
+---------------------+--------------------------------+
| CSV(MCC1,MCC2,MCC3) | |
+---------------------+--------------------------------+
Table 18: Advertisement Sent to Endpoint F
12.3.3. Heterogeneous Conference with Switching and Composition
Consider a Conference between Endpoints with the following
characteristics:
Endpoint A - 4 screens, 3 cameras
Endpoint B - 3 screens, 3 cameras
Endpoint C - 3 screens, 3 cameras
Endpoint D - 3 screens, 3 cameras
Endpoint E - 1 screen, 1 camera
Endpoint F - 2 screens, 1 camera
Endpoint G - 1 screen, 1 camera
This example focuses on what the user in one of the three-camera
multi-screen Endpoints sees. Call this person User A, at Endpoint A.
There are four large display screens at Endpoint A. Whenever
somebody at another site is speaking, all the Video Captures from
that Endpoint are shown on the large screens. If the talker is at a
three-camera site, then the video from those three cameras fills
three of the screens. If the person speaking is at a single-camera
site, then video from that camera fills one of the screens, while the
other screens show video from other single-camera Endpoints.
User A hears audio from the four loudest talkers.
User A can also see video from other Endpoints, in addition to the
current person speaking, although much smaller in size. Endpoint A
has four screens, so one of those screens shows up to nine other
Media Captures in a tiled fashion. When video from a three-camera
Endpoint appears in the tiled area, video from all three cameras
appears together across the screen with correct Spatial Relationship
among those three images.
+---+---+---+ +-------------+ +-------------+ +-------------+
| | | | | | | | | |
+---+---+---+ | | | | | |
| | | | | | | | | |
+---+---+---+ | | | | | |
| | | | | | | | | |
+---+---+---+ +-------------+ +-------------+ +-------------+
Figure 8: Endpoint A - Four-Screen Display
User B at Endpoint B sees a similar arrangement, except there are
only three screens, so the nine other Media Captures are spread out
across the bottom of the three displays, in a PiP format. When video
from a three-camera Endpoint appears in the PiP area, video from all
three cameras appears together across one screen with correct Spatial
Relationship.
+-------------+ +-------------+ +-------------+
| | | | | |
| | | | | |
| | | | | |
| +-+ +-+ +-+ | | +-+ +-+ +-+ | | +-+ +-+ +-+ |
| +-+ +-+ +-+ | | +-+ +-+ +-+ | | +-+ +-+ +-+ |
+-------------+ +-------------+ +-------------+
Figure 9: Endpoint B - Three-Screen Display with PiPs
When somebody at a different Endpoint becomes the current speaker,
then User A and User B both see the video from the new person
speaking appear on their large screen area, while the previous
speaker takes one of the smaller tiled or PiP areas. The person who
is the current speaker doesn't see themselves; they see the previous
speaker in their large screen area.
One of the points of this example is that Endpoints A and B each want
to receive three Capture Encodings for their large display areas, and
nine Encodings for their smaller areas. A and B are be able to each
send the same Configure message to the MCU, and each receive the same
conceptual Media Captures from the MCU. The differences are in how
they are Rendered and are purely a local matter at A and B.
The Advertisements for such a scenario are described below.
+=====================+========================+
| Capture Scene #1 | Description=Endpoint x |
+=====================+========================+
| VC1 | EncodingGroup=1 |
+---------------------+------------------------+
| VC2 | EncodingGroup=1 |
+---------------------+------------------------+
| VC3 | EncodingGroup=1 |
+---------------------+------------------------+
| AC1 | EncodingGroup=2 |
+---------------------+------------------------+
| CSV1(VC1, VC2, VC3) | |
+---------------------+------------------------+
| CSV2(AC1) | |
+---------------------+------------------------+
Table 19: Advertisement Received at the MCU
from Endpoints A to D
+==================+========================+
| Capture Scene #1 | Description=Endpoint y |
+==================+========================+
| VC1 | EncodingGroup=1 |
+------------------+------------------------+
| AC1 | EncodingGroup=2 |
+------------------+------------------------+
| CSV1(VC1) | |
+------------------+------------------------+
| CSV2(AC1) | |
+------------------+------------------------+
Table 20: Advertisement Received at the
MCU from Endpoints E to G
Rather than considering what is displayed, CLUE concentrates more on
what the MCU sends. The MCU doesn't know anything about the number
of screens an Endpoint has.
As Endpoints A to D each advertise that three Captures make up a
Capture Scene, the MCU offers these in a "site switching" mode. That
is, there are three Multiple Content Captures (and Capture Encodings)
each switching between Endpoints. The MCU switches in the applicable
Media into the Stream based on voice activity. Endpoint A will not
see a Capture from itself.
Using the MCC concept, the MCU would send the following Advertisement
to Endpoint A:
+=====================+========================+
| Capture Scene #1 | Description=Endpoint B |
+=====================+========================+
| VC4 | CaptureArea=Left |
+---------------------+------------------------+
| VC5 | CaptureArea=Center |
+---------------------+------------------------+
| VC6 | CaptureArea=Right |
+---------------------+------------------------+
| AC1 | |
+---------------------+------------------------+
| CSV(VC4,VC5,VC6) | |
+---------------------+------------------------+
| CSV(AC1) | |
+=====================+========================+
| Capture Scene #2 | Description=Endpoint C |
+=====================+========================+
| VC7 | CaptureArea=Left |
+---------------------+------------------------+
| VC8 | CaptureArea=Center |
+---------------------+------------------------+
| VC9 | CaptureArea=Right |
+---------------------+------------------------+
| AC2 | |
+---------------------+------------------------+
| CSV(VC7,VC8,VC9) | |
+---------------------+------------------------+
| CSV(AC2) | |
+=====================+========================+
| Capture Scene #3 | Description=Endpoint D |
+=====================+========================+
| VC10 | CaptureArea=Left |
+---------------------+------------------------+
| VC11 | CaptureArea=Center |
+---------------------+------------------------+
| VC12 | CaptureArea=Right |
+---------------------+------------------------+
| AC3 | |
+---------------------+------------------------+
| CSV(VC10,VC11,VC12) | |
+---------------------+------------------------+
| CSV(AC3) | |
+=====================+========================+
| Capture Scene #4 | Description=Endpoint E |
+=====================+========================+
| VC13 | |
+---------------------+------------------------+
| AC4 | |
+---------------------+------------------------+
| CSV(VC13) | |
+---------------------+------------------------+
| CSV(AC4) | |
+=====================+========================+
| Capture Scene #5 | Description=Endpoint F |
+=====================+========================+
| VC14 | |
+---------------------+------------------------+
| AC5 | |
+---------------------+------------------------+
| CSV(VC14) | |
+---------------------+------------------------+
| CSV(AC5) | |
+=====================+========================+
| Capture Scene #6 | Description=Endpoint G |
+=====================+========================+
| VC15 | |
+---------------------+------------------------+
| AC6 | |
+---------------------+------------------------+
| CSV(VC15) | |
+---------------------+------------------------+
| CSV(AC6) | |
+---------------------+------------------------+
Table 21: Advertisement Sent to Endpoint A -
Source Part
The above part of the Advertisement presents information about the
sources to the MCC. The information is effectively the same as the
received Advertisements, except that there are no Capture Encodings
associated with them and the identities have been renumbered.
In addition to the source Capture information, the MCU advertises
site switching of Endpoints B to G in three Streams.
+=====================+==============================+
| Capture Scene #7 | Description=Output3streammix |
+=====================+==============================+
| MCC1(VC4,VC7,VC10, | CaptureArea=Left |
| VC13) | MaxCaptures=1 |
| | SynchronizationID=1 |
| | Policy=SoundLevel:0 |
| | EncodingGroup=1 |
+---------------------+------------------------------+
| MCC2(VC5,VC8,VC11, | CaptureArea=Center |
| VC14) | MaxCaptures=1 |
| | SynchronizationID=1 |
| | Policy=SoundLevel:0 |
| | EncodingGroup=1 |
+---------------------+------------------------------+
| MCC3(VC6,VC9,VC12, | CaptureArea=Right |
| VC15) | MaxCaptures=1 |
| | SynchronizationID=1 |
| | Policy=SoundLevel:0 |
| | EncodingGroup=1 |
+---------------------+------------------------------+
| MCC4() (for audio) | CaptureArea=whole Scene |
| | MaxCaptures=1 |
| | Policy=SoundLevel:0 |
| | EncodingGroup=2 |
+---------------------+------------------------------+
| MCC5() (for audio) | CaptureArea=whole Scene |
| | MaxCaptures=1 |
| | Policy=SoundLevel:1 |
| | EncodingGroup=2 |
+---------------------+------------------------------+
| MCC6() (for audio) | CaptureArea=whole Scene |
| | MaxCaptures=1 |
| | Policy=SoundLevel:2 |
| | EncodingGroup=2 |
+---------------------+------------------------------+
| MCC7() (for audio) | CaptureArea=whole Scene |
| | MaxCaptures=1 |
| | Policy=SoundLevel:3 |
| | EncodingGroup=2 |
+---------------------+------------------------------+
| CSV(MCC1,MCC2,MCC3) | |
+---------------------+------------------------------+
| CSV(MCC4,MCC5,MCC6, | |
| MCC7) | |
+---------------------+------------------------------+
Table 22: Advertisement Sent to Endpoint A -
Switching Parts
The above part describes the three main switched Streams that relate
to site switching. MaxCaptures=1 indicates that only one Capture
from the MCC is sent at a particular time. SynchronizationID=1
indicates that the source sending is synchronized. The Provider can
choose to group together VC13, VC14, and VC15 for the purpose of
switching according to the SynchronizationID. Therefore, when the
Provider switches one of them into an MCC, it can also switch the
others even though they are not part of the same Capture Scene.
All the audio for the Conference is included in Scene #7. There
isn't necessarily a one-to-one relation between any Audio Capture and
Video Capture in this Scene. Typically, a change in the loudest
talker will cause the MCU to switch the audio Streams more quickly
than switching video Streams.
The MCU can also supply nine Media Streams showing the active and
previous eight speakers. It includes the following in the
Advertisement:
+========================+===========================+
| Capture Scene #8 | Description=Output9stream |
+========================+===========================+
| MCC8(VC4,VC5,VC6,VC7, | MaxCaptures=1 |
| VC8,VC9,VC10,VC11, | Policy=SoundLevel:0 |
| VC12,VC13,VC14,VC15) | EncodingGroup=1 |
+------------------------+---------------------------+
| MCC9(VC4,VC5,VC6,VC7, | MaxCaptures=1 |
| VC8,VC9,VC10,VC11, | Policy=SoundLevel:1 |
| VC12,VC13,VC14,VC15) | EncodingGroup=1 |
+========================+===========================+
| to | to |
+========================+===========================+
| MCC16(VC4,VC5,VC6,VC7, | MaxCaptures=1 |
| VC8,VC9,VC10,VC11, | Policy=SoundLevel:8 |
| VC12,VC13,VC14,VC15) | EncodingGroup=1 |
+------------------------+---------------------------+
| CSV(MCC8,MCC9,MCC10, | |
| MCC11,MCC12,MCC13, | |
| MCC14,MCC15,MCC16) | |
+------------------------+---------------------------+
Table 23: Advertisement Sent to Endpoint A -
9 Switched Parts
The above part indicates that there are nine Capture Encodings. Each
of the Capture Encodings may contain any Captures from any source
site with a maximum of one Capture at a time. Which Capture is
present is determined by the policy. The MCCs in this Scene do not
have any spatial attributes.
Note: The Provider alternatively could provide each of the MCCs above
in its own Capture Scene.
If the MCU wanted to provide a composed Capture Encoding containing
all of the nine Captures, it could advertise in addition:
+========================+=======================+
| Capture Scene #9 | Description=NineTiles |
+========================+=======================+
| MCC13(MCC8,MCC9,MCC10, | MaxCaptures=9 |
| MCC11,MCC12,MCC13, | EncodingGroup=1 |
| MCC14,MCC15,MCC16) | |
+------------------------+-----------------------+
| CSV(MCC13) | |
+------------------------+-----------------------+
Table 24: Advertisement Sent to Endpoint A -
9 Composed Parts
As MaxCaptures is 9, it indicates that the Capture Encoding contains
information from nine sources at a time.
The Advertisement to Endpoint B is identical to the above, other than
the fact that Captures from Endpoint A would be added and the
Captures from Endpoint B would be removed. Whether the Captures are
Rendered on a four-screen display or a three-screen display is up to
the Consumer to determine. The Consumer wants to place Video
Captures from the same original source Endpoint together, in the
correct spatial order, but the MCCs do not have spatial attributes.
So, the Consumer needs to associate incoming Media packets with the
original individual Captures in the Advertisement (such as VC4, VC5,
and VC6) in order to know the spatial information it needs for
correct placement on the screens. The Provider can use the RTCP
CaptureId source description (SDES) item and associated RTP header
extension, as described in [RFC8849], to convey this information to
the Consumer.
12.3.4. Heterogeneous Conference with Voice-Activated Switching
This example illustrates how multipoint "voice-activated switching"
behavior can be realized, with an Endpoint making its own decision
about which of its outgoing video Streams is considered the "active
talker" from that Endpoint. Then, an MCU can decide which is the
active talker among the whole Conference.
Consider a Conference between Endpoints with the following
characteristics:
Endpoint A - 3 screens, 3 cameras
Endpoint B - 3 screens, 3 cameras
Endpoint C - 1 screen, 1 camera
This example focuses on what the user at Endpoint C sees. The user
would like to see the Video Capture of the current talker, without
composing it with any other Video Capture. In this example, Endpoint
C is capable of receiving only a single video Stream. The following
tables describe Advertisements from Endpoints A and B to the MCU, and
from the MCU to Endpoint C, that can be used to accomplish this.
+===================+=========================+
| Capture Scene #1 | Description=Endpoint x |
+===================+=========================+
| VC1 | CaptureArea=Left |
| | EncodingGroup=1 |
+-------------------+-------------------------+
| VC2 | CaptureArea=Center |
| | EncodingGroup=1 |
+-------------------+-------------------------+
| VC3 | CaptureArea=Right |
| | EncodingGroup=1 |
+-------------------+-------------------------+
| MCC1(VC1,VC2,VC3) | MaxCaptures=1 |
| | CaptureArea=whole Scene |
| | Policy=SoundLevel:0 |
| | EncodingGroup=1 |
+-------------------+-------------------------+
| AC1 | CaptureArea=whole Scene |
| | EncodingGroup=2 |
+-------------------+-------------------------+
| CSV1(VC1, VC2, | |
| VC3) | |
+-------------------+-------------------------+
| CSV2(MCC1) | |
+-------------------+-------------------------+
| CSV3(AC1) | |
+-------------------+-------------------------+
Table 25: Advertisement Received at the MCU
from Endpoints A and B
Endpoints A and B are advertising each individual Video Capture, and
also a switched Capture MCC1 that switches between the other three
based on who is the active talker. These Endpoints do not advertise
distinct Audio Captures associated with each individual Video
Capture, so it would be impossible for the MCU (as a Media Consumer)
to make its own determination of which Video Capture is the active
talker based just on information in the audio Streams.
+======================+========================+
| Capture Scene #1 | Description=conference |
+======================+========================+
| MCC1() | CaptureArea=Left |
| | MaxCaptures=1 |
| | SynchronizationID=1 |
| | Policy=SoundLevel:0 |
| | EncodingGroup=1 |
+----------------------+------------------------+
| MCC2() | CaptureArea=Center |
| | MaxCaptures=1 |
| | SynchronizationID=1 |
| | Policy=SoundLevel:0 |
| | EncodingGroup=1 |
+----------------------+------------------------+
| MCC3() | CaptureArea=Right |
| | MaxCaptures=1 |
| | SynchronizationID=1 |
| | Policy=SoundLevel:0 |
| | EncodingGroup=1 |
+----------------------+------------------------+
| MCC4() | CaptureArea=whole |
| | Scene |
| | MaxCaptures=1 |
| | Policy=SoundLevel:0 |
| | EncodingGroup=1 |
+----------------------+------------------------+
| MCC5() (for audio) | CaptureArea=whole |
| | Scene |
| | MaxCaptures=1 |
| | Policy=SoundLevel:0 |
| | EncodingGroup=2 |
+----------------------+------------------------+
| MCC6() (for audio) | CaptureArea=whole |
| | Scene |
| | MaxCaptures=1 |
| | Policy=SoundLevel:1 |
| | EncodingGroup=2 |
+----------------------+------------------------+
| CSV1(MCC1,MCC2,MCC3) | |
+----------------------+------------------------+
| CSV2(MCC4) | |
+----------------------+------------------------+
| CSV3(MCC5,MCC6) | |
+----------------------+------------------------+
Table 26: Advertisement Sent from the MCU to
Endpoint C
The MCU advertises one Scene, with four video MCCs. Three of them in
CSV1 give a left, center, and right view of the Conference, with site
switching. MCC4 provides a single Video Capture representing a view
of the whole Conference. The MCU intends for MCC4 to be switched
between all the other original source Captures. In this example,
Advertisement of the MCU is not giving all the information about all
the other Endpoints' Scenes and which of those Captures are included
in the MCCs. The MCU could include all that if it wants to give the
Consumers more information, but it is not necessary for this example
scenario.
The Provider advertises MCC5 and MCC6 for audio. Both are switched
Captures, with different SoundLevel policies indicating they are the
top two dominant talkers. The Provider advertises CSV3 with both
MCCs, suggesting the Consumer should use both if it can.
Endpoint C, in its Configure Message to the MCU, requests to receive
MCC4 for video and MCC5 and MCC6 for audio. In order for the MCU to
get the information it needs to construct MCC4, it has to send
Configure Messages to Endpoints A and B asking to receive MCC1 from
each of them, along with their AC1 audio. Now the MCU can use audio
energy information from the two incoming audio Streams from Endpoints
A and B to determine which of those alternatives is the current
talker. Based on that, the MCU uses either MCC1 from A or MCC1 from
B as the source of MCC4 to send to Endpoint C.
13. IANA Considerations
This document has no IANA actions.
14. Security Considerations
There are several potential attacks related to telepresence,
specifically the protocols used by CLUE. This is the case due to
conferencing sessions, the natural involvement of multiple Endpoints,
and the many, often user-invoked, capabilities provided by the
systems.
An MCU involved in a CLUE session can experience many of the same
attacks as a conferencing system such as the one enabled by the
Conference Information Data Model for Centralized Conferencing (XCON)
framework [RFC5239]. Examples of attacks include the following: an
Endpoint attempting to listen to sessions in which it is not
authorized to participate, an Endpoint attempting to disconnect or
mute other users, and theft of service by an Endpoint in attempting
to create telepresence sessions it is not allowed to create. Thus,
it is RECOMMENDED that an MCU implementing the protocols necessary to
support CLUE follow the security recommendations specified in the
conference control protocol documents. In the case of CLUE, SIP is
the conferencing protocol, thus the security considerations in
[RFC4579] MUST be followed. Other security issues related to MCUs
are discussed in the XCON framework [RFC5239]. The use of xCard with
potentially sensitive information provides another reason to
implement recommendations in Section 11 of [RFC5239].
One primary security concern, surrounding the CLUE framework
introduced in this document, involves securing the actual protocols
and the associated authorization mechanisms. These concerns apply to
Endpoint-to-Endpoint sessions as well as sessions involving multiple
Endpoints and MCUs. Figure 2 in Section 5 provides a basic flow of
information exchange for CLUE and the protocols involved.
As described in Section 5, CLUE uses SIP/SDP to establish the session
prior to exchanging any CLUE-specific information. Thus, the
security mechanisms recommended for SIP [RFC3261], including user
authentication and authorization, MUST be supported. In addition,
the Media MUST be secured. Datagram Transport Layer Security (DTLS)
/ Secure Real-time Transport Protocol (SRTP) MUST be supported and
SHOULD be used unless the Media, which is based on RTP, is secured by
other means (see [RFC7201] [RFC7202]). Media security is also
discussed in [RFC8848] and [RFC8849]. Note that SIP call setup is
done before any CLUE-specific information is available, so the
authentication and authorization are based on the SIP mechanisms.
The entity that will be authenticated may use the Endpoint identity
or the Endpoint user identity; this is an application issue and not a
CLUE-specific issue.
A separate data channel is established to transport the CLUE protocol
messages. The contents of the CLUE protocol messages are based on
information introduced in this document. The CLUE data model
[RFC8846] defines, through an XML schema, the syntax to be used. One
type of information that could possibly introduce privacy concerns is
the xCard information, as described in Section 7.1.1.10. The
decision about which xCard information to send in the CLUE channel is
an application policy for point-to-point and multipoint calls based
on the authenticated identity that can be the Endpoint identity or
the user of the Endpoint. For example, the telepresence multipoint
application can authenticate a user before starting a CLUE exchange
with the telepresence system and have a policy per user.
In addition, the (text) description field in the Media Capture
attribute (Section 7.1.1.6) could possibly reveal sensitive
information or specific identities. The same would be true for the
descriptions in the Capture Scene (Section 7.3.1) and CSV
(Section 7.3.2) attributes. An implementation SHOULD give users
control over what sensitive information is sent in an Advertisement.
One other important consideration for the information in the xCard as
well as the description field in the Media Capture and CSV attributes
is that while the Endpoints involved in the session have been
authenticated, there are no assurance that the information in the
xCard or description fields is authentic. Thus, this information
MUST NOT be used to make any authorization decisions.
While other information in the CLUE protocol messages does not reveal
specific identities, it can reveal characteristics and capabilities
of the Endpoints. That information could possibly uniquely identify
specific Endpoints. It might also be possible for an attacker to
manipulate the information and disrupt the CLUE sessions. It would
also be possible to mount a DoS attack on the CLUE Endpoints if a
malicious agent has access to the data channel. Thus, it MUST be
possible for the Endpoints to establish a channel that is secure
against both message recovery and message modification. Further
details on this are provided in the CLUE data channel solution
document [RFC8850].
There are also security issues associated with the authorization to
perform actions at the CLUE Endpoints to invoke specific capabilities
(e.g., rearranging screens, sharing content, etc.). However, the
policies and security associated with these actions are outside the
scope of this document and the overall CLUE solution.
15. References
15.1. Normative References
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119,
DOI 10.17487/RFC2119, March 1997,
<https://www.rfc-editor.org/info/rfc2119>.
[RFC3261] Rosenberg, J., Schulzrinne, H., Camarillo, G., Johnston,
A., Peterson, J., Sparks, R., Handley, M., and E.
Schooler, "SIP: Session Initiation Protocol", RFC 3261,
DOI 10.17487/RFC3261, June 2002,
<https://www.rfc-editor.org/info/rfc3261>.
[RFC3264] Rosenberg, J. and H. Schulzrinne, "An Offer/Answer Model
with Session Description Protocol (SDP)", RFC 3264,
DOI 10.17487/RFC3264, June 2002,
<https://www.rfc-editor.org/info/rfc3264>.
[RFC3550] Schulzrinne, H., Casner, S., Frederick, R., and V.
Jacobson, "RTP: A Transport Protocol for Real-Time
Applications", STD 64, RFC 3550, DOI 10.17487/RFC3550,
July 2003, <https://www.rfc-editor.org/info/rfc3550>.
[RFC4566] Handley, M., Jacobson, V., and C. Perkins, "SDP: Session
Description Protocol", RFC 4566, DOI 10.17487/RFC4566,
July 2006, <https://www.rfc-editor.org/info/rfc4566>.
[RFC4579] Johnston, A. and O. Levin, "Session Initiation Protocol
(SIP) Call Control - Conferencing for User Agents",
BCP 119, RFC 4579, DOI 10.17487/RFC4579, August 2006,
<https://www.rfc-editor.org/info/rfc4579>.
[RFC5239] Barnes, M., Boulton, C., and O. Levin, "A Framework for
Centralized Conferencing", RFC 5239, DOI 10.17487/RFC5239,
June 2008, <https://www.rfc-editor.org/info/rfc5239>.
[RFC5646] Phillips, A., Ed. and M. Davis, Ed., "Tags for Identifying
Languages", BCP 47, RFC 5646, DOI 10.17487/RFC5646,
September 2009, <https://www.rfc-editor.org/info/rfc5646>.
[RFC6350] Perreault, S., "vCard Format Specification", RFC 6350,
DOI 10.17487/RFC6350, August 2011,
<https://www.rfc-editor.org/info/rfc6350>.
[RFC6351] Perreault, S., "xCard: vCard XML Representation",
RFC 6351, DOI 10.17487/RFC6351, August 2011,
<https://www.rfc-editor.org/info/rfc6351>.
[RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC
2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174,
May 2017, <https://www.rfc-editor.org/info/rfc8174>.
[RFC8846] Presta, R. and S P. Romano, "An XML Schema for the
Controlling Multiple Streams for Telepresence (CLUE) Data
Model", RFC 8846, DOI 10.17487/RFC8846, January 2021,
<http://www.rfc-editor.org/info/rfc8846>.
[RFC8847] Presta, R. and S P. Romano, "Protocol for Controlling
Multiple Streams for Telepresence (CLUE)", RFC 8847,
DOI 10.17487/RFC8847, January 2021,
<https://www.rfc-editor.org/info/rfc8847>.
[RFC8848] Hanton, R., Kyzivat, P., Xiao, L., and C. Groves, "Session
Signaling for Controlling Multiple Streams for
Telepresence (CLUE)", RFC 8848, DOI 10.17487/RFC8848,
January 2021, <https://www.rfc-editor.org/info/rfc8848>.
[RFC8850] Holmberg, C., "Controlling Multiple Streams for
Telepresence (CLUE) Protocol Data Channel", RFC 8850,
DOI 10.17487/RFC8850, January 2021,
<https://www.rfc-editor.org/info/rfc8850>.
15.2. Informative References
[RFC4353] Rosenberg, J., "A Framework for Conferencing with the
Session Initiation Protocol (SIP)", RFC 4353,
DOI 10.17487/RFC4353, February 2006,
<https://www.rfc-editor.org/info/rfc4353>.
[RFC7201] Westerlund, M. and C. Perkins, "Options for Securing RTP
Sessions", RFC 7201, DOI 10.17487/RFC7201, April 2014,
<https://www.rfc-editor.org/info/rfc7201>.
[RFC7202] Perkins, C. and M. Westerlund, "Securing the RTP
Framework: Why RTP Does Not Mandate a Single Media
Security Solution", RFC 7202, DOI 10.17487/RFC7202, April
2014, <https://www.rfc-editor.org/info/rfc7202>.
[RFC7205] Romanow, A., Botzko, S., Duckworth, M., and R. Even, Ed.,
"Use Cases for Telepresence Multistreams", RFC 7205,
DOI 10.17487/RFC7205, April 2014,
<https://www.rfc-editor.org/info/rfc7205>.
[RFC7262] Romanow, A., Botzko, S., and M. Barnes, "Requirements for
Telepresence Multistreams", RFC 7262,
DOI 10.17487/RFC7262, June 2014,
<https://www.rfc-editor.org/info/rfc7262>.
[RFC7667] Westerlund, M. and S. Wenger, "RTP Topologies", RFC 7667,
DOI 10.17487/RFC7667, November 2015,
<https://www.rfc-editor.org/info/rfc7667>.
[RFC8849] Even, R. and J. Lennox, "Mapping RTP Streams to
Controlling Multiple Streams for Telepresence (CLUE) Media
Captures", RFC 8849, DOI 10.17487/RFC8849, January 2021,
<https://www.rfc-editor.org/info/rfc8849>.
Acknowledgements
Allyn Romanow and Brian Baldino were authors of early draft versions.
Mark Gorzynski also contributed much to the initial approach. Many
others also contributed, including Christian Groves, Jonathan Lennox,
Paul Kyzivat, Rob Hanton, Roni Even, Christer Holmberg, Stephen
Botzko, Mary Barnes, John Leslie, and Paul Coverdale.
Authors' Addresses
Mark Duckworth (editor)
Email: mrducky73@outlook.com
Andrew Pepperell
Acano
Uxbridge
United Kingdom
Email: apeppere@gmail.com
Stephan Wenger
Tencent
2747 Park Blvd.
Palo Alto, CA 94306
United States of America
Email: stewe@stewe.org
|