1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
1598
1599
1600
1601
1602
1603
1604
1605
1606
1607
1608
1609
1610
1611
1612
1613
1614
1615
1616
1617
1618
1619
1620
1621
1622
1623
1624
1625
1626
1627
1628
1629
1630
1631
1632
1633
1634
1635
1636
1637
1638
1639
1640
1641
1642
1643
1644
1645
1646
1647
1648
1649
1650
1651
1652
1653
1654
1655
1656
1657
1658
1659
1660
1661
1662
1663
1664
1665
1666
1667
1668
1669
1670
1671
1672
1673
1674
1675
1676
1677
1678
1679
1680
1681
1682
1683
1684
1685
1686
1687
1688
1689
1690
1691
1692
1693
1694
1695
1696
1697
1698
1699
1700
1701
1702
1703
1704
1705
1706
1707
1708
1709
1710
1711
1712
1713
1714
1715
1716
1717
1718
1719
1720
1721
1722
1723
1724
1725
1726
1727
1728
1729
1730
1731
1732
1733
1734
1735
1736
1737
1738
1739
1740
1741
1742
1743
1744
1745
1746
1747
1748
1749
1750
1751
1752
1753
1754
1755
1756
1757
1758
1759
1760
1761
1762
1763
1764
1765
1766
1767
1768
1769
1770
1771
1772
1773
1774
1775
1776
1777
1778
1779
1780
1781
1782
1783
1784
1785
1786
1787
1788
1789
1790
1791
1792
1793
1794
1795
1796
1797
1798
1799
1800
1801
1802
1803
1804
1805
1806
1807
1808
1809
1810
1811
1812
1813
1814
1815
1816
1817
1818
1819
1820
1821
1822
1823
1824
1825
1826
1827
1828
1829
1830
1831
1832
1833
1834
1835
1836
1837
1838
1839
1840
1841
1842
1843
1844
1845
1846
1847
1848
1849
1850
1851
1852
1853
1854
1855
1856
1857
1858
1859
1860
1861
1862
1863
1864
1865
1866
1867
1868
1869
1870
1871
1872
1873
1874
1875
1876
1877
1878
1879
1880
1881
1882
1883
1884
1885
1886
1887
1888
1889
1890
1891
1892
1893
1894
1895
1896
1897
1898
1899
1900
1901
1902
1903
1904
1905
1906
1907
1908
1909
1910
1911
1912
1913
1914
1915
1916
1917
1918
1919
1920
1921
1922
|
Internet Engineering Task Force (IETF) H. Song
Request for Comments: 9232 Futurewei
Category: Informational F. Qin
ISSN: 2070-1721 China Mobile
P. Martinez-Julia
NICT
L. Ciavaglia
Rakuten Mobile
A. Wang
China Telecom
May 2022
Network Telemetry Framework
Abstract
Network telemetry is a technology for gaining network insight and
facilitating efficient and automated network management. It
encompasses various techniques for remote data generation,
collection, correlation, and consumption. This document describes an
architectural framework for network telemetry, motivated by
challenges that are encountered as part of the operation of networks
and by the requirements that ensue. This document clarifies the
terminology and classifies the modules and components of a network
telemetry system from different perspectives. The framework and
taxonomy help to set a common ground for the collection of related
work and provide guidance for related technique and standard
developments.
Status of This Memo
This document is not an Internet Standards Track specification; it is
published for informational purposes.
This document is a product of the Internet Engineering Task Force
(IETF). It represents the consensus of the IETF community. It has
received public review and has been approved for publication by the
Internet Engineering Steering Group (IESG). Not all documents
approved by the IESG are candidates for any level of Internet
Standard; see Section 2 of RFC 7841.
Information about the current status of this document, any errata,
and how to provide feedback on it may be obtained at
https://www.rfc-editor.org/info/rfc9232.
Copyright Notice
Copyright (c) 2022 IETF Trust and the persons identified as the
document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents
(https://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect
to this document. Code Components extracted from this document must
include Revised BSD License text as described in Section 4.e of the
Trust Legal Provisions and are provided without warranty as described
in the Revised BSD License.
Table of Contents
1. Introduction
1.1. Applicability Statement
1.2. Glossary
2. Background
2.1. Telemetry Data Coverage
2.2. Use Cases
2.3. Challenges
2.4. Network Telemetry
2.5. The Necessity of a Network Telemetry Framework
3. Network Telemetry Framework
3.1. Top-Level Modules
3.1.1. Management Plane Telemetry
3.1.2. Control Plane Telemetry
3.1.3. Forwarding Plane Telemetry
3.1.4. External Data Telemetry
3.2. Second-Level Function Components
3.3. Data Acquisition Mechanism and Type Abstraction
3.4. Mapping Existing Mechanisms into the Framework
4. Evolution of Network Telemetry Applications
5. Security Considerations
6. IANA Considerations
7. Informative References
Appendix A. A Survey on Existing Network Telemetry Techniques
A.1. Management Plane Telemetry
A.1.1. Push Extensions for NETCONF
A.1.2. gRPC Network Management Interface
A.2. Control Plane Telemetry
A.2.1. BGP Monitoring Protocol
A.3. Data Plane Telemetry
A.3.1. Alternate-Marking (AM) Technology
A.3.2. Dynamic Network Probe
A.3.3. IP Flow Information Export (IPFIX) Protocol
A.3.4. In Situ OAM
A.3.5. Postcard-Based Telemetry
A.3.6. Existing OAM for Specific Data Planes
A.4. External Data and Event Telemetry
A.4.1. Sources of External Events
A.4.2. Connectors and Interfaces
Acknowledgments
Contributors
Authors' Addresses
1. Introduction
Network visibility is the ability of management tools to see the
state and behavior of a network, which is essential for successful
network operation. Network telemetry revolves around network data
that 1) can help provide insights about the current state of the
network, including network devices, forwarding, control, and
management planes; 2) can be generated and obtained through a variety
of techniques, including but not limited to network instrumentation
and measurements; and 3) can be processed for purposes ranging from
service assurance to network security using a wide variety of data
analytical techniques. In this document, network telemetry refers to
both the data itself (i.e., "Network Telemetry Data") and the
techniques and processes used to generate, export, collect, and
consume that data for use by potentially automated management
applications. Network telemetry extends beyond the classical network
Operations, Administration, and Management (OAM) techniques and
expects to support better flexibility, scalability, accuracy,
coverage, and performance.
However, the term "network telemetry" lacks an unambiguous
definition. The scope and coverage of it cause confusion and
misunderstandings. It is beneficial to clarify the concept and
provide a clear architectural framework for network telemetry, so we
can articulate the technical field and better align the related
techniques and standard works.
To fulfill such an undertaking, we first discuss some key
characteristics of network telemetry that set a clear distinction
from the conventional network OAM and show that some conventional OAM
technologies can be considered a subset of the network telemetry
technologies. We then provide an architectural framework for network
telemetry that includes four modules, each associated with a
different category of telemetry data and corresponding procedures.
All the modules are internally structured in the same way, including
components that allow the operator to configure data sources in
regard to what data to generate and how to make that available to
client applications, components that instrument the underlying data
sources, and components that perform the actual rendering, encoding,
and exporting of the generated data. We show how the network
telemetry framework can benefit current and future network
operations. Based on the distinction of modules and function
components, we can map the existing and emerging techniques and
protocols into the framework. The framework can also simplify
designing, maintaining, and understanding a network telemetry system.
In addition, we outline the evolution stages of the network telemetry
system and discuss the potential security concerns.
The purpose of the framework and taxonomy is to set a common ground
for the collection of related work and provide guidance for future
technique and standard developments. To the best of our knowledge,
this document is the first such effort for network telemetry in
industry standards organizations. This document does not define
specific technologies.
1.1. Applicability Statement
Large-scale network data collection is a major threat to user privacy
and may be indistinguishable from pervasive monitoring [RFC7258].
The network telemetry framework presented in this document must not
be applied to generating, exporting, collecting, analyzing, or
retaining individual user data or any data that can identify end
users or characterize their behavior without consent. Based on this
principle, the network telemetry framework is not applicable to
networks whose endpoints represent individual users, such as general-
purpose access networks.
1.2. Glossary
Before further discussion, we list some key terminology and
abbreviations used in this document. There is an intended
differentiation between the terms of network telemetry and OAM.
However, it should be understood that there is not a hard-line
distinction between the two concepts. Rather, network telemetry is
considered an extension of OAM. It covers all the existing OAM
protocols but puts more emphasis on the newer and emerging techniques
and protocols concerning all aspects of network data from acquisition
to consumption.
AI: Artificial Intelligence. In the network domain, AI
refers to machine-learning-based technologies for
automated network operation and other tasks.
AM: Alternate Marking. A flow performance measurement
method, as specified in [RFC8321].
BMP: BGP Monitoring Protocol. Specified in [RFC7854].
DPI: Deep Packet Inspection. Refers to the techniques that
examine packets beyond packet L3/L4 headers.
gNMI: gRPC Network Management Interface. A network management
protocol from the OpenConfig Operator Working Group,
mainly contributed by Google. See [gnmi] for details.
GPB: Google Protocol Buffer. An extensible mechanism for
serializing structured data. See [gpb] for details.
gRPC: gRPC Remote Procedure Call. An open-source high-
performance RPC framework that gNMI is based on. See
[grpc] for details.
IPFIX: IP Flow Information Export Protocol. Specified in
[RFC7011].
IOAM: In situ OAM [RFC9197]. A data plane on-path telemetry
technique.
JSON: JavaScript Object Notation. An open standard file format
and data interchange format that uses human-readable text
to store and transmit data objects, as specified in
[RFC8259].
MIB: Management Information Base. A database used for
managing the entities in a network.
NETCONF: Network Configuration Protocol. Specified in [RFC6241].
NetFlow: A Cisco protocol used for flow record collecting, as
described in [RFC3954].
Network Telemetry: The process and instrumentation for acquiring and
utilizing network data remotely for network monitoring
and operation. A general term for a large set of network
visibility techniques and protocols, concerning aspects
like data generation, collection, correlation, and
consumption. Network telemetry addresses current network
operation issues and enables smooth evolution toward
future intent-driven autonomous networks.
NMS: Network Management System. Refers to applications that
allow network administrators to manage a network.
OAM: Operations, Administration, and Maintenance. A group of
network management functions that provide network fault
indication, fault localization, performance information,
and data and diagnosis functions. Most conventional
network monitoring techniques and protocols belong to
network OAM.
PBT: Postcard-Based Telemetry. A data plane on-path telemetry
technique. A representative technique is described in
[IPPM-IOAM-DIRECT-EXPORT].
RESTCONF: An HTTP-based protocol that provides a programmatic
interface for accessing data defined in YANG, using the
datastore concepts defined in NETCONF, as specified in
[RFC8040].
SMIv2: Structure of Management Information Version 2. Defines
MIB objects, as specified in [RFC2578].
SNMP: Simple Network Management Protocol. Versions 1, 2, and 3
are specified in [RFC1157], [RFC3416], and [RFC3411],
respectively.
XML: Extensible Markup Language. A markup language for data
encoding that is both human readable and machine
readable, as specified by W3C [W3C.REC-xml-20081126].
YANG: YANG is a data modeling language for the definition of
data sent over network management protocols such as
NETCONF and RESTCONF. YANG is defined in [RFC6020] and
[RFC7950].
YANG ECA: A YANG model for Event-Condition-Action policies, as
defined in [NETMOD-ECA-POLICY].
YANG-Push: A mechanism that allows subscriber applications to
request a stream of updates from a YANG datastore on a
network device. Details are specified in [RFC8639] and
[RFC8641].
2. Background
The term "big data" is used to describe the extremely large volume of
data sets that can be analyzed computationally to reveal patterns,
trends, and associations. Networks are undoubtedly a source of big
data because of their scale and the volume of network traffic they
forward. When a network's endpoints do not represent individual
users (e.g., in industrial, data-center, and infrastructure
contexts), network operations can often benefit from large-scale data
collection without breaching user privacy.
Today, one can access advanced big data analytics capability through
a plethora of commercial and open-source platforms (e.g., Apache
Hadoop), tools (e.g., Apache Spark), and techniques (e.g., machine
learning). Thanks to the advance of computing and storage
technologies, network big data analytics give network operators an
opportunity to gain network insights and move towards network
autonomy. Some operators start to explore the application of
Artificial Intelligence (AI) to make sense of network data. Software
tools can use the network data to detect and react on network faults,
anomalies, and policy violations, as well as predict future events.
In turn, the network policy updates for planning, intrusion
prevention, optimization, and self-healing may be applied.
It is conceivable that an autonomic network [RFC7575] is the logical
next step for network evolution following Software-Defined Networking
(SDN), which aims to reduce (or even eliminate) human labor, make
more efficient use of network resources, and provide better services
more aligned with customer requirements. The IETF ANIMA Working
Group is dedicated to developing and maintaining protocols and
procedures for automated network management and control of
professionally managed networks. The related technique of
Intent-Based Networking (IBN) [NMRG-IBN-CONCEPTS-DEFINITIONS]
requires network visibility and telemetry data in order to ensure
that the network is behaving as intended.
However, while the data processing capability is improved and
applications require more data to function better, the networks lag
behind in extracting and translating network data into useful and
actionable information in efficient ways. The system bottleneck is
shifting from data consumption to data supply. Both the number of
network nodes and the traffic bandwidth keep increasing at a fast
pace. The network configuration and policy change at smaller time
slots than before. More subtle events and fine-grained data through
all network planes need to be captured and exported in real time. In
a nutshell, it is a challenge to get enough high-quality data out of
the network in a manner that is efficient, timely, and flexible.
Therefore, we need to survey the existing technologies and protocols
and identify any potential gaps.
In the remainder of this section, we first clarify the scope of
network data (i.e., telemetry data) relevant in this document. Then,
we discuss several key use cases for network operations of today and
the future. Next, we show why the current network OAM techniques and
protocols are insufficient for these use cases. The discussion
underlines the need for new methods, techniques, and protocols, as
well as the extensions of existing ones, which we assign under the
umbrella term "Network Telemetry".
2.1. Telemetry Data Coverage
Any information that can be extracted from networks (including the
data plane, control plane, and management plane) and used to gain
visibility or as a basis for actions is considered telemetry data.
It includes statistics, event records and logs, snapshots of state,
configuration data, etc. It also covers the outputs of any active
and passive measurements [RFC7799]. In some cases, raw data is
processed in network before being sent to a data consumer. Such
processed data is also considered telemetry data. The value of
telemetry data varies. In some cases, if the cost is acceptable,
less but higher-quality data are preferred rather than a lot of low-
quality data. A classification of telemetry data is provided in
Section 3. To preserve the privacy of end users, no user packet
content should be collected. Specifically, the data objects
generated, exported, and collected by a network telemetry application
should not include any packet payload from traffic associated with
end-user systems.
2.2. Use Cases
The following set of use cases is essential for network operations.
While the list is by no means exhaustive, it is enough to highlight
the requirements for data velocity, variety, volume, and veracity,
the attributes of big data, in networks.
* Security: Network intrusion detection and prevention systems need
to monitor network traffic and activities and act upon anomalies.
Given increasingly sophisticated attack vectors coupled with
increasingly severe consequences of security breaches, new tools
and techniques need to be developed, relying on wider and deeper
visibility into networks. The ultimate goal is to achieve
security with no, or only minimal, human intervention and without
disrupting legitimate traffic flows.
* Policy and Intent Compliance: Network policies are the rules that
constrain the services for network access, provide service
differentiation, or enforce specific treatment on the traffic.
For example, a service function chain is a policy that requires
the selected flows to pass through a set of ordered network
functions. Intent, as defined in [NMRG-IBN-CONCEPTS-DEFINITIONS],
is a set of operational goals that a network should meet and
outcomes that a network is supposed to deliver, defined in a
declarative manner without specifying how to achieve or implement
them. An intent requires a complex translation and mapping
process before being applied on networks. While a policy or
intent is enforced, the compliance needs to be verified and
monitored continuously by relying on visibility that is provided
through network telemetry data. Any violation must be reported
immediately - this will alert the network administrator to the
policy or intent violation and will potentially result in updates
to how the policy or intent is applied in the network to ensure
that it remains in force.
* SLA Compliance: A Service Level Agreement (SLA) is a service
contract between a service provider and a client, which includes
the metrics for the service measurement and remedy/penalty
procedures when the service level misses the agreement. Users
need to check if they get the service as promised, and network
operators need to evaluate how they can deliver services that meet
the SLA based on real-time network telemetry data, including data
from network measurements.
* Root Cause Analysis: Many network failures can be the effect of a
sequence of chained events. Troubleshooting and recovery require
quick identification of the root cause of any observable issues.
However, the root cause is not always straightforward to identify,
especially when the failure is sporadic and the number of event
messages, both related and unrelated to the same cause, is
overwhelming. While technologies such as machine learning can be
used for root cause analysis, it is up to the network to sense and
provide the relevant diagnostic data that are either actively fed
into or passively retrieved by the root cause analysis
applications.
* Network Optimization: This covers all short-term and long-term
network optimization techniques, including load balancing, Traffic
Engineering (TE), and network planning. Network operators are
motivated to optimize their network utilization and differentiate
services for better Return on Investment (ROI) or lower Capital
Expenditure (CAPEX). The first step is to know the real-time
network conditions before applying policies for traffic
manipulation. In some cases, microbursts need to be detected in a
very short time frame so that fine-grained traffic control can be
applied to avoid network congestion. Long-term planning of
network capacity and topology requires analysis of real-world
network telemetry data that is obtained over long periods of time.
* Event Tracking and Prediction: The visibility into traffic path
and performance is critical for services and applications that
rely on healthy network operation. Numerous related network
events are of interest to network operators. For example, network
operators want to learn where and why packets are dropped for an
application flow. They also want to be warned of issues in
advance, so proactive actions can be taken to avoid catastrophic
consequences.
2.3. Challenges
For a long time, network operators have relied upon SNMP [RFC3416],
Command-Line Interface (CLI), or Syslog [RFC5424] to monitor the
network. Some other OAM techniques as described in [RFC7276] are
also used to facilitate network troubleshooting. These conventional
techniques are not sufficient to support the above use cases for the
following reasons:
* Most use cases need to continuously monitor the network and
dynamically refine the data collection in real time. Poll-based
low-frequency data collection is ill-suited for these
applications. Subscription-based streaming data directly pushed
from the data source (e.g., the forwarding chip) is preferred to
provide sufficient data quantity and precision at scale.
* Comprehensive data is needed, ranging from packet processing
engines to traffic managers, line cards to main control boards,
user flows to control protocol packets, device configurations to
operations, and physical layers to application layers.
Conventional OAM only covers a narrow range of data (e.g., SNMP
only handles data from the Management Information Base (MIB)).
Classical network devices cannot provide all the necessary probes.
More open and programmable network devices are therefore needed.
* Many application scenarios need to correlate network-wide data
from multiple sources (i.e., from distributed network devices,
different components of a network device, or different network
planes). A piecemeal solution is often lacking the capability to
consolidate the data from multiple sources. The composition of a
complete solution, as partly proposed by Autonomic Resource
Control Architecture (ARCA) [NMRG-ANTICIPATED-ADAPTATION], will be
empowered and guided by a comprehensive framework.
* Some conventional OAM techniques (e.g., CLI and Syslog) lack a
formal data model. The unstructured data hinder the tool
automation and application extensibility. Standardized data
models are essential to support the programmable networks.
* Although some conventional OAM techniques support data push (e.g.,
SNMP Trap [RFC2981][RFC3877], Syslog, and sFlow [RFC3176]), the
pushed data are limited to only predefined management plane
warnings (e.g., SNMP Trap) or sampled user packets (e.g., sFlow).
Network operators require the data with arbitrary source,
granularity, and precision, which is beyond the capability of the
existing techniques.
* Conventional passive measurement techniques can either consume
excessive network resources and produce excessive redundant data
or lead to inaccurate results; on the other hand, conventional
active measurement techniques can interfere with the user traffic,
and their results are indirect. Techniques that can collect
direct and on-demand data from user traffic are more favorable.
These challenges were addressed by newer standards and techniques
(e.g., IPFIX/Netflow, Packet Sampling (PSAMP), IOAM, and YANG-Push),
and more are emerging. These standards and techniques need to be
recognized and accommodated in a new framework.
2.4. Network Telemetry
Network telemetry has emerged as a mainstream technical term to refer
to the network data collection and consumption techniques. Several
network telemetry techniques and protocols (e.g., IPFIX [RFC7011] and
gRPC [grpc]) have been widely deployed. Network telemetry allows
separate entities to acquire data from network devices so that data
can be visualized and analyzed to support network monitoring and
operation. Network telemetry covers the conventional network OAM and
has a wider scope. For instance, it is expected that network
telemetry can provide the necessary network insight for autonomous
networks and address the shortcomings of conventional OAM techniques.
Network telemetry usually assumes machines as data consumers rather
than human operators. Hence, network telemetry can directly trigger
the automated network operation, while in contrast, some conventional
OAM tools were designed and used to help human operators to monitor
and diagnose the networks and guide manual network operations. Such
a proposition leads to very different techniques.
Although new network telemetry techniques are emerging and subject to
continuous evolution, several characteristics of network telemetry
have been well accepted. Note that network telemetry is intended to
be an umbrella term covering a wide spectrum of techniques, so the
following characteristics are not expected to be held by every
specific technique.
* Push and Streaming: Instead of polling data from network devices,
telemetry collectors subscribe to streaming data pushed from data
sources in network devices.
* Volume and Velocity: Telemetry data is intended to be consumed by
machines rather than by human beings. Therefore, the data volume
can be huge, and the processing is optimized for the needs of
automation in real time.
* Normalization and Unification: Telemetry aims to address the
overall network automation needs. Efforts are made to normalize
the data representation and unify the protocols, so as to simplify
data analysis and provide integrated analysis across heterogeneous
devices and data sources across a network.
* Model-Based: Telemetry data is modeled in advance, which allows
applications to configure and consume data with ease.
* Data Fusion: The data for a single application can come from
multiple data sources (e.g., cross-domain, cross-device, and
cross-layer) that are based on a common name/ID and need to be
correlated to take effect.
* Dynamic and Interactive: Since the network telemetry means to be
used in a closed control loop for network automation, it needs to
run continuously and adapt to the dynamic and interactive queries
from the network operation controller.
In addition, an ideal network telemetry solution may also have the
following features or properties:
* In-Network Customization: The data that is generated can be
customized in network at runtime to cater to the specific need of
applications. This needs the support of a programmable data
plane, which allows probes with custom functions to be deployed at
flexible locations.
* In-Network Data Aggregation and Correlation: Network devices and
aggregation points can work out which events and what data needs
to be stored, reported, or discarded, thus reducing the load on
the central collection and processing points while still ensuring
that the right information is ready to be processed in a timely
way.
* In-Network Processing: Sometimes it is not necessary or feasible
to gather all information to a central point to be processed and
acted upon. It is possible for the data processing to be done in
network, allowing reactive actions to be taken locally.
* Direct Data Plane Export: The data originated from data plane
forwarding chips can be directly exported to the data consumer for
efficiency, especially when the data bandwidth is large and real-
time processing is required.
* In-Band Data Collection: In addition to the passive and active
data collection approaches, the new hybrid approach allows to
directly collect data for any target flow on its entire forwarding
path [OPSAWG-IFIT-FRAMEWORK].
It is worth noting that a network telemetry system should not be
intrusive to normal network operations by avoiding the pitfall of the
"observer effect". That is, it should not change the network
behavior and affect the forwarding performance. Moreover, high-
volume telemetry traffic may cause network congestion unless proper
isolation or traffic engineering techniques are in place, or
congestion control mechanisms ensure that telemetry traffic backs off
if it exceeds the network capacity. [RFC8084] and [RFC8085] are
relevant Best Current Practices (BCPs) in this space.
Although in many cases a system for network telemetry involves a
remote data collecting and consuming entity, it is important to
understand that there are no inherent assumptions about how a system
should be architected. While a network architecture with a
centralized controller (e.g., SDN) seems to be a natural fit for
network telemetry, network telemetry can work in distributed fashions
as well. For example, telemetry data producers and consumers can
have a peer-to-peer relationship, in which a network node can be the
direct consumer of telemetry data from other nodes.
2.5. The Necessity of a Network Telemetry Framework
Network data analytics (e.g., machine learning) is applied for
network operation automation, relying on abundant and coherent data
from networks. Data acquisition that is limited to a single source
and static in nature will in many cases not be sufficient to meet an
application's telemetry data needs. As a result, multiple data
sources, involving a variety of techniques and standards, will need
to be integrated. It is desirable to have a framework that
classifies and organizes different telemetry data sources and types,
defines different components of a network telemetry system and their
interactions, and helps coordinate and integrate multiple telemetry
approaches across layers. This allows flexible combinations of data
for different applications, while normalizing and simplifying
interfaces. In detail, such a framework would benefit the
development of network operation applications for the following
reasons:
* Future networks, autonomous or otherwise, depend on holistic and
comprehensive network visibility. Use cases and applications are
better when supported uniformly and coherently using an
integrated, converged mechanism and common telemetry data
representations wherever feasible. Therefore, the protocols and
mechanisms should be consolidated into a minimum yet comprehensive
set. A telemetry framework can help to normalize the technique
developments.
* Network visibility presents multiple viewpoints. For example, the
device viewpoint takes the network infrastructure as the
monitoring object from which the network topology and device
status can be acquired, and the traffic viewpoint takes the flows
or packets as the monitoring object from which the traffic quality
and path can be acquired. An application may need to switch its
viewpoint during operation. It may also need to correlate a
service and its impact on user experience (UE) to acquire the
comprehensive information.
* Applications require network telemetry to be elastic in order to
make efficient use of network resources and reduce the impact of
processing related to network telemetry on network performance.
For example, routine network monitoring should cover the entire
network with a low data sampling rate. Only when issues arise or
critical trends emerge should telemetry data sources be modified
and telemetry data rates be boosted as needed.
* Efficient data aggregation is critical for applications to reduce
the overall quantity of data and improve the accuracy of analysis.
A telemetry framework collects all the telemetry-related works from
different sources and working groups within the IETF. This makes it
possible to assemble a comprehensive network telemetry system and to
avoid repetitious or redundant work. The framework should cover the
concepts and components from the standardization perspective. This
document describes the modules that make up a network telemetry
framework and decomposes the telemetry system into a set of distinct
components that existing and future work can easily map to.
3. Network Telemetry Framework
The top-level network telemetry framework partitions the network
telemetry into four modules based on the telemetry data object source
and represents their relationship. Once the network operation
applications acquire the data from these modules, they can apply data
analytics and take actions. At the next level, the framework
decomposes each module into separate components. Each of these
modules follows the same underlying structure, with one component
dedicated to the configuration of data subscriptions and data
sources, a second component dedicated to encoding and exporting data,
and a third component instrumenting the generation of telemetry
related to the underlying resources. Throughout the framework, the
same set of abstract data-acquiring mechanisms and data types
(Section 3.3) are applied. The two-level architecture with the
uniform data abstraction helps accurately pinpoint a protocol or
technique to its position in a network telemetry system or
disaggregates a network telemetry system into manageable parts.
3.1. Top-Level Modules
Telemetry can be applied on the forwarding plane, control plane, and
management plane in a network, as well as on other sources out of the
network, as shown in Figure 1. Therefore, we categorize the network
telemetry into four distinct modules (management plane, control
plane, forwarding plane, and external data and event telemetry) with
each having its own interface to network operation applications.
+------------------------------+
| |
| Network Operation |<-------+
| Applications | |
| | |
+------------------------------+ |
^ ^ ^ |
| | | |
V V | V
+--------------+-----------|---+ +-----------+
| | Control | | | |
| | Plane | | | External |
| <---> | | | Data and |
| | Telemetry | | | Event |
| Management | ^ V | | Telemetry |
| Plane +-------|-------+ | |
| Telemetry | V | +-----------+
| | Forwarding |
| | Plane |
| <---> |
| | Telemetry |
| | |
+--------------+---------------+
Figure 1: Modules in Layer Category of the Network Telemetry
Framework
The rationale of this partition lies in the different telemetry data
objects that result in different data sources and export locations.
Such differences have profound implications on in-network data
programming and processing capability, data encoding and the
transport protocol, and required data bandwidth and latency. Data
can be sent directly or proxied via the control and management
planes. There are advantages/disadvantages to both approaches.
Note that in some cases, the network controller itself may be the
source of telemetry data that is unique to it or derived from the
telemetry data collected from the network elements. Some of the
principles and taxonomy specific to the control plane and management
plane telemetry could also be applied to the controller when it is
required to provide the telemetry data to network operation
applications hosted outside. The scope of this document is focused
on the network elements telemetry, and further details related to
controllers are thus out of scope.
We summarize the major differences of the four modules in Table 1.
They are compared from six angles:
* Data Object
* Data Export Location
* Data Model
* Data Encoding
* Telemetry Application Protocol
* Data Transport Method
Data Object is the target and source of each module. Because the
data source varies, the location where data is mostly conveniently
exported also varies. For example, forwarding plane data mainly
originates as data exported from the forwarding Application-Specific
Integrated Circuits (ASICs), while control plane data mainly
originates from the protocol daemons running on the control CPU(s).
For convenience and efficiency, it is preferred to export the data
off the device from locations near the source. Because the locations
that can export data have different capabilities, different choices
of data models, encoding, and transport methods are made to balance
the performance and cost. For example, the forwarding chip has high
throughput but limited capacity for processing complex data and
maintaining state, while the main control CPU is capable of complex
data and state processing but has limited bandwidth for high
throughput data. As a result, the suitable telemetry protocol for
each module can be different. Some representative techniques are
shown in the corresponding table blocks to highlight the technical
diversity of these modules. Note that the selected techniques just
reflect the de facto state of the art and are by no means exhaustive
(e.g., IPFIX can also be implemented over TCP and SCTP, but that is
not recommended for the forwarding plane). The key point is that one
cannot expect to use a universal protocol to cover all the network
telemetry requirements.
+=============+===============+==========+==========+===============+
|Module |Management |Control |Forwarding|External Data |
| |Plane |Plane |Plane | |
+=============+===============+==========+==========+===============+
|Object |configuration |control |flow and |terminal, |
| |and operation |protocol |packet |social, and |
| |state |and |QoS, |environmental |
| | |signaling,|traffic | |
| | |RIB |stat., | |
| | | |buffer and| |
| | | |queue | |
| | | |stat., | |
| | | |FIB, | |
| | | |Access | |
| | | |Control | |
| | | |List (ACL)| |
+-------------+---------------+----------+----------+---------------+
|Export |main control |main |forwarding|various |
|Location |CPU |control |chip or | |
| | |CPU, |linecard | |
| | |linecard |CPU; main | |
| | |CPU, or |control | |
| | |forwarding|CPU | |
| | |chip |unlikely | |
+-------------+---------------+----------+----------+---------------+
|Data Model |YANG, MIB, |YANG, |YANG, |YANG, custom |
| |syslog |custom |custom | |
+-------------+---------------+----------+----------+---------------+
|Data Encoding|GPB, JSON, XML |GPB, JSON,|plain text|GPB, JSON, XML,|
| | |XML, plain| |plain text |
| | |text | | |
+-------------+---------------+----------+----------+---------------+
|Application |gRPC, NETCONF, |gRPC, |IPFIX, |gRPC |
|Protocol |RESTCONF |NETCONF, |traffic | |
| | |IPFIX, |mirroring,| |
| | |traffic |gRPC, | |
| | |mirroring |NETFLOW | |
+-------------+---------------+----------+----------+---------------+
|Data |HTTP(S), TCP |HTTP(S), |UDP |HTTP(S), TCP, |
|Transport | |TCP, UDP | |UDP |
+-------------+---------------+----------+----------+---------------+
Table 1: Comparison of Data Object Modules
Note that the interaction with the applications that consume network
telemetry data can be indirect. Some in-device data transfer is
possible. For example, in the management plane telemetry, the
management plane will need to acquire data from the data plane. Some
operational states can only be derived from data plane data sources
such as the interface status and statistics. As another example,
obtaining control plane telemetry data may require the ability to
access the Forwarding Information Base (FIB) of the data plane.
On the other hand, an application may involve more than one plane and
interact with multiple planes simultaneously. For example, an SLA
compliance application may require both the data plane telemetry and
the control plane telemetry.
The requirements and challenges for each module are summarized as
follows (note that the requirements may pertain across all telemetry
modules; however, we emphasize those that are most pronounced for a
particular plane).
3.1.1. Management Plane Telemetry
The management plane of network elements interacts with the Network
Management System (NMS) and provides information such as performance
data, network logging data, network warning and defects data, and
network statistics and state data. The management plane includes
many protocols, including the classical SNMP and syslog. Regardless
the protocol, management plane telemetry must address the following
requirements:
* Convenient Data Subscription: An application should have the
freedom to choose which data is exported (see Section 3.3) and the
means and frequency of how that data is exported (e.g., on-change
or periodic subscription).
* Structured Data: For automatic network operation, machines will
replace humans for network data comprehension. Data modeling
languages, such as YANG, can efficiently describe structured data
and normalize data encoding and transformation.
* High-Speed Data Transport: In order to keep up with the velocity
of information, a data source needs to be able to send large
amounts of data at high frequency. Compact encoding formats or
data compression schemes are needed to reduce the quantity of data
and improve the data transport efficiency. The subscription mode,
by replacing the query mode, reduces the interactions between
clients and servers and helps to improve the data source's
efficiency.
* Network Congestion Avoidance: The application must protect the
network from congestion with congestion control mechanisms or, at
minimum, with circuit breakers. [RFC8084] and [RFC8085] provide
some solutions in this space.
3.1.2. Control Plane Telemetry
The control plane telemetry refers to the health condition monitoring
of different network control protocols at all layers of the protocol
stack. Keeping track of the operational status of these protocols is
beneficial for detecting, localizing, and even predicting various
network issues, as well as for network optimization, in real time and
with fine granularity. Some particular challenges and issues faced
by the control plane telemetry are as follows:
* How to correlate the End-to-End (E2E) Key Performance Indicators
(KPIs) to a specific layer's KPIs. For example, IPTV users may
describe their UE by the video smoothness and definition. Then in
case of an unusually poor UE KPI or a service disconnection, it is
non-trivial to delimit and pinpoint the issue in the responsible
protocol layer (e.g., the transport layer or the network layer),
the responsible protocol (e.g., IS-IS or BGP at the network
layer), and finally the responsible device(s) with specific
reasons.
* Conventional OAM-based approaches for control plane KPI
measurement, which include Ping (L3), Traceroute (L3), Y.1731
[y1731] (L2), and so on. One common issue behind these methods is
that they only measure the KPIs instead of reflecting the actual
running status of these protocols, making them less effective or
efficient for control plane troubleshooting and network
optimization.
* How more research is needed for the BGP monitoring protocol (BMP).
BMP is an example of the control plane telemetry; it is currently
used for monitoring BGP routes and enables rich applications, such
as BGP peer analysis, Autonomous System (AS) analysis, prefix
analysis, and security analysis. However, the monitoring of other
layers, protocols, and the cross-layer, cross-protocol KPI
correlations are still in their infancy (e.g., IGP monitoring is
not as extensive as BMP), which requires further research.
Note that the requirement and solutions for network congestion
avoidance are also applicable to the control plane telemetry.
3.1.3. Forwarding Plane Telemetry
An effective forwarding plane telemetry system relies on the data
that the network device can expose. The quality, quantity, and
timeliness of data must meet some stringent requirements. This
raises some challenges for the network data plane devices where the
first-hand data originates.
* A data plane device's main function is user traffic processing and
forwarding. While supporting network visibility is important, the
telemetry is just an auxiliary function, and it should strive to
not impede normal traffic processing and forwarding (i.e., the
forwarding behavior should not be altered, and the trade-off
between forwarding performance and telemetry should be well-
balanced).
* Network operation applications require end-to-end visibility
across various sources, which can result in a huge volume of data.
However, the sheer quantity of data must not exhaust the network
bandwidth, regardless of the data delivery approach (i.e., whether
through in-band or out-of-band channels).
* The data plane devices must provide timely data with the minimum
possible delay. Long processing, transport, storage, and analysis
delay can impact the effectiveness of the control loop and even
render the data useless.
* The data should be structured, labeled, and easy for applications
to parse and consume. At the same time, the data types needed by
applications can vary significantly. The data plane devices need
to provide enough flexibility and programmability to support the
precise data provision for applications.
* The data plane telemetry should support incremental deployment and
work even though some devices are unaware of the system.
* The requirement and solutions for network congestion avoidance are
also applicable to the forwarding plane telemetry.
Although not specific to the forwarding plane, these challenges are
more difficult for the forwarding plane because of the limited
resources and flexibility. Data plane programmability is essential
to support network telemetry. Newer data plane forwarding chips are
equipped with advanced telemetry features and provide flexibility to
support customized telemetry functions.
Technique Taxonomy: This pertains to how one instruments the
telemetry; there can be multiple possible dimensions to classify the
forwarding plane telemetry techniques.
* Active, Passive, and Hybrid: This dimension pertains to the end-
to-end measurement. Active and passive methods (as well as the
hybrid types) are well documented in [RFC7799]. Passive methods
include TCPDUMP, IPFIX [RFC7011], sFlow, and traffic mirroring.
These methods usually have low data coverage. The bandwidth cost
is very high in order to improve the data coverage. On the other
hand, active methods include Ping, the One-Way Active Measurement
Protocol (OWAMP) [RFC4656], the Two-Way Active Measurement
Protocol (TWAMP) [RFC5357], the Simple Two-way Active Measurement
Protocol (STAMP) [RFC8762], and Cisco's SLA Protocol [RFC6812].
These methods are intrusive and only provide indirect network
measurements. Hybrid methods, including IOAM [RFC9197], Alternate
Marking (AM) [RFC8321], and Multipoint Alternate Marking
[RFC8889], provide a well-balanced and more flexible approach.
However, these methods are also more complex to implement.
* In-Band and Out-of-Band: Telemetry data carried in user packets
before being exported to a data collector is considered in-band
(e.g., IOAM [RFC9197]). Telemetry data that is directly exported
to a data collector without modifying user packets is considered
out-of-band (e.g., the postcard-based approach described in
Appendix A.3.5). It is also possible to have hybrid methods,
where only the telemetry instruction or partial data is carried by
user packets (e.g., AM [RFC8321]).
* End-to-End and In-Network: End-to-end methods start from, and end
at, the network end hosts (e.g., Ping). In-network methods work
in networks and are transparent to end hosts. However, if needed,
in-network methods can be easily extended into end hosts.
* Data Subject: Depending on the telemetry objective, the methods
can be flow based (e.g., IOAM [RFC9197]), path based (e.g.,
Traceroute), and node based (e.g., IPFIX [RFC7011]). The various
data objects can be packet, flow record, measurement, states, and
signal.
3.1.4. External Data Telemetry
Events that occur outside the boundaries of the network system are
another important source of network telemetry. Correlating both
internal telemetry data and external events with the requirements of
network systems, as presented in [NMRG-ANTICIPATED-ADAPTATION],
provides a strategic and functional advantage to management
operations.
As with other sources of telemetry information, the data and events
must meet strict requirements, especially in terms of timeliness,
which is essential to properly incorporate external event information
into network management applications. The specific challenges are
described as follows:
* The role of the external event detector can be played by multiple
elements, including hardware (e.g., physical sensors, such as
seismometers) and software (e.g., big data sources that can
analyze streams of information, such as Twitter messages). Thus,
the transmitted data must support different shapes but, at the
same time, follow a common but extensible schema.
* Since the main function of the external event detectors is to
perform the notifications, their timeliness is assumed. However,
once messages have been dispatched, they must be quickly collected
and inserted into the control plane with variable priority, which
is higher for important sources and events and lower for secondary
ones.
* The schema used by external detectors must be easily adopted by
current and future devices and applications. Therefore, it must
be easily mapped to current data models, such as in terms of YANG.
* As the communication with external entities outside the boundary
of a provider network may be realized over the Internet, the risk
of congestion is even more relevant in this context and proper
countermeasures must be taken. Solutions such as network
transport circuit breakers are needed as well.
Organizing both internal and external telemetry information together
will be key for the general exploitation of the management
possibilities of current and future network systems, as reflected in
the incorporation of cognitive capabilities to new hardware and
software (virtual) elements.
3.2. Second-Level Function Components
The telemetry module at each plane can be further partitioned into
five distinct conceptual components:
* Data Query, Analysis, and Storage: This component works at the
network operation application block in Figure 1. It is normally a
part of the network management system at the receiver side. On
one hand, it is responsible for issuing data requirements. The
data of interest can be modeled data through configuration or
custom data through programming. The data requirements can be
queries for one-shot data or subscriptions for events or streaming
data. On the other hand, it receives, stores, and processes the
returned data from network devices. Data analysis can be
interactive to initiate further data queries. This component can
reside in either network devices or remote controllers. It can be
centralized and distributed and involve one or more instances.
* Data Configuration and Subscription: This component manages data
queries on devices. It determines the protocol and channel for
applications to acquire desired data. This component is also
responsible for configuring the desired data that might not be
directly available from data sources. The subscription data can
be described by models, templates, or programs.
* Data Encoding and Export: This component determines how telemetry
data is delivered to the data analysis and storage component with
access control. The data encoding and the transport protocol may
vary due to the data export location.
* Data Generation and Processing: The requested data needs to be
captured, filtered, processed, and formatted in network devices
from raw data sources. This may involve in-network computing and
processing on either the fast path or the slow path in network
devices.
* Data Object and Source: This component determines the monitoring
objects and original data sources provisioned in the device. A
data source usually just provides raw data that needs further
processing. Each data source can be considered a probe. Some
data sources can be dynamically installed, while others will be
more static.
+----------------------------------------+
+----------------------------------------+ |
| | |
| Data Query, Analysis, & Storage | |
| | +
+-------+++ -----------------------------+
||| ^^^
||| |||
||V |||
+--+V--------------------+++------------+
+-----V---------------------+------------+ |
+---------------------+-------+----------+ | |
| Data Configuration | | | |
| & Subscription | Data Encoding | | |
| (model, template, | & Export | | |
| & program) | | | |
+---------------------+------------------| | |
| | | |
| Data Generation | | |
| & Processing | | |
| | | |
+----------------------------------------| | |
| | | |
| Data Object and Source | |-+
| |-+
+----------------------------------------+
Figure 2: Components in the Network Telemetry Framework
3.3. Data Acquisition Mechanism and Type Abstraction
Broadly speaking, network data can be acquired through subscription
(push) and query (poll). A subscription is a contract between
publisher and subscriber. After initial setup, the subscribed data
is automatically delivered to registered subscribers until the
subscription expires. There are two variations of subscription. The
subscriptions can be predefined, or the subscribers are allowed to
configure and tailor the published data to their specific needs.
In contrast, queries are used when a client expects immediate and
one-off feedback from network devices. The queried data may be
directly extracted from some specific data source or synthesized and
processed from raw data. Queries work well for interactive network
telemetry applications.
In general, data can be pulled (i.e., queried) whenever needed, but
in many cases, pushing the data (i.e., subscription) is more
efficient, and it can reduce the latency of a client detecting a
change. From the data consumer point of view, there are four types
of data from network devices that a telemetry data consumer can
subscribe or query:
* Simple Data: Data that are steadily available from some datastore
or static probes in network devices.
* Derived Data: Data that need to be synthesized or processed in the
network from raw data from one or more network devices. The data
processing function can be statically or dynamically loaded into
network devices.
* Event-triggered Data: Data that are conditionally acquired based
on the occurrence of some events. An example of event-triggered
data could be an interface changing operational state between up
and down. Such data can be actively pushed through subscription
or passively polled through query. There are many ways to model
events, including using Finite State Machine (FSM) or Event
Condition Action (ECA) [NETMOD-ECA-POLICY].
* Streaming Data: Data that are continuously generated. It can be a
time series or the dump of databases. For example, an interface
packet counter is exported every second. The streaming data
reflect real-time network states and metrics and require large
bandwidth and processing power. The streaming data are always
actively pushed to the subscribers.
The above telemetry data types are not mutually exclusive. Rather,
they are often composite. Derived data is composed of simple data;
event-triggered data can be simple or derived; and streaming data can
be based on some recurring event. The relationships of these data
types are illustrated in Figure 3.
+----------------------+ +-----------------+
| Event-Triggered Data |<----+ Streaming Data |
+-------+---+----------+ +-----+---+-------+
| | | |
| | | |
| | +--------------+ | |
| +-->| Derived Data |<--+ |
| +------+------ + |
| | |
| V |
| +--------------+ |
+------>| Simple Data |<------+
+--------------+
Figure 3: Data Type Relationship
Subscription usually deals with event-triggered data and streaming
data, and query usually deals with simple data and derived data. But
the other ways are also possible. Advanced network telemetry
techniques are designed mainly for event-triggered or streaming data
subscription and derived data query.
3.4. Mapping Existing Mechanisms into the Framework
The following table shows how the existing mechanisms (mainly
published in IETF and with the emphasis on the latest new
technologies) are positioned in the framework. Given the vast body
of existing work, we cannot provide an exhaustive list, so the
mechanisms in the tables should be considered as just examples.
Also, some comprehensive protocols and techniques may cover multiple
aspects or modules of the framework, so a name in a block only
emphasizes one particular characteristic of it. More details about
some listed mechanisms can be found in Appendix A.
+===============+=================+================+============+
| | Management | Control Plane | Forwarding |
| | Plane | | Plane |
+===============+=================+================+============+
| data | gNMI, NETCONF, | gNMI, NETCONF, | NETCONF, |
| configuration | RESTCONF, SNMP, | RESTCONF, | RESTCONF, |
| and subscribe | YANG-Push | YANG-Push | YANG-Push |
+---------------+-----------------+----------------+------------+
| data | MIB, YANG | YANG | IOAM, |
| generation | | | PSAMP, |
| and process | | | PBT, AM |
+---------------+-----------------+----------------+------------+
| data encoding | gRPC, HTTP, TCP | BMP, TCP | IPFIX, UDP |
| and export | | | |
+---------------+-----------------+----------------+------------+
Table 2: Existing Work Mapping
Although the framework is generally suitable for any network
environments, the multi-domain telemetry has some unique challenges
that deserve further architectural consideration, which is out of the
scope of this document.
4. Evolution of Network Telemetry Applications
Network telemetry is an evolving technical area. As the network
moves towards the automated operation, network telemetry applications
undergo several stages of evolution, which add a new layer of
requirements to the underlying network telemetry techniques. Each
stage is built upon the techniques adopted by the previous stages
plus some new requirements.
Stage 0 - Static Telemetry: The telemetry data source and type are
determined at design time. The network operator can only
configure how to use it with limited flexibility.
Stage 1 - Dynamic Telemetry: The custom telemetry data can be
dynamically programmed or configured at runtime without
interrupting the network operation, allowing a trade-off among
resource, performance, flexibility, and coverage.
Stage 2 - Interactive Telemetry: The network operator can
continuously customize and fine tune the telemetry data in real
time to reflect the network operation's visibility requirements.
Compared with Stage 1, the changes are frequent based on the real-
time feedback. At this stage, some tasks can be automated, but
human operators still need to sit in the middle to make decisions.
Stage 3 - Closed-Loop Telemetry: The telemetry is free from the
interference of human operators, except for generating the
reports. The intelligent network operation engine automatically
issues the telemetry data requests, analyzes the data, and updates
the network operations in closed control loops.
Existing technologies are ready for Stages 0 and 1. Individual
applications for Stages 2 and 3 are also possible now. However, the
future autonomic networks may need a comprehensive operation
management system that works at Stages 2 and 3 to cover all the
network operation tasks. A well-defined network telemetry framework
is the first step towards this direction.
5. Security Considerations
The complexity of network telemetry raises significant security
implications. For example, telemetry data can be manipulated to
exhaust various network resources at each plane as well as the data
consumer; falsified or tampered data can mislead the decision-making
process and paralyze networks; and wrong configuration and
programming for telemetry is equally harmful. The telemetry data is
highly sensitive, which exposes a lot of information about the
network and its configuration. Some of that information can make
designing attacks against the network much easier (e.g., exact
details of what software and patches have been installed) and allows
an attacker to determine whether a device may be subject to
unprotected security vulnerabilities.
Given that this document has proposed a framework for network
telemetry and the telemetry mechanisms discussed are more extensive
(in both message frequency and traffic amount) than the conventional
network OAM concepts, we must also anticipate that new security
considerations that may also arise. A number of techniques already
exist for securing the forwarding plane, control plane, and
management plane in a network, but it is important to consider if any
new threat vectors are now being enabled via the use of network
telemetry procedures and mechanisms.
This document proposes a conceptual architectural for collecting,
transporting, and analyzing a wide variety of data sources in support
of network applications. The protocols, data formats, and
configurations chosen to implement this framework will dictate the
specific security considerations. These considerations may include:
* Telemetry framework trust and policy models;
* Role management and access control for enabling and disabling
telemetry capabilities;
* Protocol transport used for telemetry data and its inherent
security capabilities;
* Telemetry data stores, storage encryption, methods of access, and
retention practices;
* Tracking telemetry events and any abnormalities that might
identify malicious attacks using telemetry interfaces.
* Authentication and integrity protection of telemetry data to make
data more trustworthy; and
* Segregating the telemetry data traffic from the data traffic
carried over the network (e.g., historically management access and
management data may be carried via an independent management
network).
Some security considerations highlighted above may be minimized or
negated with policy management of network telemetry. In a network
telemetry deployment, it would be advantageous to separate telemetry
capabilities into different classes of policies, i.e., Role-Based
Access Control and Event-Condition-Action policies. Also, potential
conflicts between network telemetry mechanisms must be detected
accurately and resolved quickly to avoid unnecessary network
telemetry traffic propagation escalating into an unintended or
intended denial-of-service attack.
Further study of the security issues will be required, and it is
expected that the security mechanisms and protocols are developed and
deployed along with a network telemetry system.
6. IANA Considerations
This document has no IANA actions.
7. Informative References
[gnmi] Shakir, R., Shaikh, A., Borman, P., Hines, M., Lebsack,
C., and C. Marrow, "gRPC Network Management Interface",
IETF 98, March 2017,
<https://datatracker.ietf.org/meeting/98/materials/slides-
98-rtgwg-gnmi-intro-draft-openconfig-rtgwg-gnmi-spec-00>.
[gpb] Google Developers, "Protocol Buffers",
<https://developers.google.com/protocol-buffers>.
[grpc] gRPC, "gPPC: A high performance, open source universal RPC
framework", <https://grpc.io>.
[IPPM-IOAM-DIRECT-EXPORT]
Song, H., Gafni, B., Zhou, T., Li, Z., Brockners, F.,
Bhandari, S., Ed., Sivakolundu, R., and T. Mizrahi, Ed.,
"In-situ OAM Direct Exporting", Work in Progress,
Internet-Draft, draft-ietf-ippm-ioam-direct-export-07, 13
October 2021, <https://datatracker.ietf.org/doc/html/
draft-ietf-ippm-ioam-direct-export-07>.
[IPPM-POSTCARD-BASED-TELEMETRY]
Song, H., Mirsky, G., Filsfils, C., Abdelsalam, A., Zhou,
T., Li, Z., Mishra, G., Shin, J., and K. Lee, "In-Situ OAM
Marking-based Direct Export", Work in Progress, Internet-
Draft, draft-song-ippm-postcard-based-telemetry-12, 12 May
2022, <https://datatracker.ietf.org/doc/html/draft-song-
ippm-postcard-based-telemetry-12>.
[NETCONF-DISTRIB-NOTIF]
Zhou, T., Zheng, G., Voit, E., Graf, T., and P. Francois,
"Subscription to Distributed Notifications", Work in
Progress, Internet-Draft, draft-ietf-netconf-distributed-
notif-03, 10 January 2022,
<https://datatracker.ietf.org/doc/html/draft-ietf-netconf-
distributed-notif-03>.
[NETCONF-UDP-NOTIF]
Zheng, G., Zhou, T., Graf, T., Francois, P., Feng, A. H.,
and P. Lucente, "UDP-based Transport for Configured
Subscriptions", Work in Progress, Internet-Draft, draft-
ietf-netconf-udp-notif-05, 4 March 2022,
<https://datatracker.ietf.org/doc/html/draft-ietf-netconf-
udp-notif-05>.
[NETMOD-ECA-POLICY]
Wu, Q., Bryskin, I., Birkholz, H., Liu, X., and B. Claise,
"A YANG Data model for ECA Policy Management", Work in
Progress, Internet-Draft, draft-ietf-netmod-eca-policy-01,
19 February 2021, <https://datatracker.ietf.org/doc/html/
draft-ietf-netmod-eca-policy-01>.
[NMRG-ANTICIPATED-ADAPTATION]
Martinez-Julia, P., Ed., "Exploiting External Event
Detectors to Anticipate Resource Requirements for the
Elastic Adaptation of SDN/NFV Systems", Work in Progress,
Internet-Draft, draft-pedro-nmrg-anticipated-adaptation-
02, 29 June 2018, <https://datatracker.ietf.org/doc/html/
draft-pedro-nmrg-anticipated-adaptation-02>.
[NMRG-IBN-CONCEPTS-DEFINITIONS]
Clemm, A., Ciavaglia, L., Granville, L. Z., and J.
Tantsura, "Intent-Based Networking - Concepts and
Definitions", Work in Progress, Internet-Draft, draft-
irtf-nmrg-ibn-concepts-definitions-09, 24 March 2022,
<https://datatracker.ietf.org/doc/html/draft-irtf-nmrg-
ibn-concepts-definitions-09>.
[OPSAWG-DNP4IQ]
Song, H., Ed. and J. Gong, "Requirements for Interactive
Query with Dynamic Network Probes", Work in Progress,
Internet-Draft, draft-song-opsawg-dnp4iq-01, 19 June 2017,
<https://datatracker.ietf.org/doc/html/draft-song-opsawg-
dnp4iq-01>.
[OPSAWG-IFIT-FRAMEWORK]
Song, H., Qin, F., Chen, H., Jin, J., and J. Shin, "A
Framework for In-situ Flow Information Telemetry", Work in
Progress, Internet-Draft, draft-song-opsawg-ifit-
framework-17, 22 February 2022,
<https://datatracker.ietf.org/doc/html/draft-song-opsawg-
ifit-framework-17>.
[RFC1157] Case, J., Fedor, M., Schoffstall, M., and J. Davin,
"Simple Network Management Protocol (SNMP)", RFC 1157,
DOI 10.17487/RFC1157, May 1990,
<https://www.rfc-editor.org/info/rfc1157>.
[RFC2578] McCloghrie, K., Ed., Perkins, D., Ed., and J.
Schoenwaelder, Ed., "Structure of Management Information
Version 2 (SMIv2)", STD 58, RFC 2578,
DOI 10.17487/RFC2578, April 1999,
<https://www.rfc-editor.org/info/rfc2578>.
[RFC2981] Kavasseri, R., Ed., "Event MIB", RFC 2981,
DOI 10.17487/RFC2981, October 2000,
<https://www.rfc-editor.org/info/rfc2981>.
[RFC3176] Phaal, P., Panchen, S., and N. McKee, "InMon Corporation's
sFlow: A Method for Monitoring Traffic in Switched and
Routed Networks", RFC 3176, DOI 10.17487/RFC3176,
September 2001, <https://www.rfc-editor.org/info/rfc3176>.
[RFC3411] Harrington, D., Presuhn, R., and B. Wijnen, "An
Architecture for Describing Simple Network Management
Protocol (SNMP) Management Frameworks", STD 62, RFC 3411,
DOI 10.17487/RFC3411, December 2002,
<https://www.rfc-editor.org/info/rfc3411>.
[RFC3416] Presuhn, R., Ed., "Version 2 of the Protocol Operations
for the Simple Network Management Protocol (SNMP)",
STD 62, RFC 3416, DOI 10.17487/RFC3416, December 2002,
<https://www.rfc-editor.org/info/rfc3416>.
[RFC3877] Chisholm, S. and D. Romascanu, "Alarm Management
Information Base (MIB)", RFC 3877, DOI 10.17487/RFC3877,
September 2004, <https://www.rfc-editor.org/info/rfc3877>.
[RFC3954] Claise, B., Ed., "Cisco Systems NetFlow Services Export
Version 9", RFC 3954, DOI 10.17487/RFC3954, October 2004,
<https://www.rfc-editor.org/info/rfc3954>.
[RFC4656] Shalunov, S., Teitelbaum, B., Karp, A., Boote, J., and M.
Zekauskas, "A One-way Active Measurement Protocol
(OWAMP)", RFC 4656, DOI 10.17487/RFC4656, September 2006,
<https://www.rfc-editor.org/info/rfc4656>.
[RFC5085] Nadeau, T., Ed. and C. Pignataro, Ed., "Pseudowire Virtual
Circuit Connectivity Verification (VCCV): A Control
Channel for Pseudowires", RFC 5085, DOI 10.17487/RFC5085,
December 2007, <https://www.rfc-editor.org/info/rfc5085>.
[RFC5357] Hedayat, K., Krzanowski, R., Morton, A., Yum, K., and J.
Babiarz, "A Two-Way Active Measurement Protocol (TWAMP)",
RFC 5357, DOI 10.17487/RFC5357, October 2008,
<https://www.rfc-editor.org/info/rfc5357>.
[RFC5424] Gerhards, R., "The Syslog Protocol", RFC 5424,
DOI 10.17487/RFC5424, March 2009,
<https://www.rfc-editor.org/info/rfc5424>.
[RFC6020] Bjorklund, M., Ed., "YANG - A Data Modeling Language for
the Network Configuration Protocol (NETCONF)", RFC 6020,
DOI 10.17487/RFC6020, October 2010,
<https://www.rfc-editor.org/info/rfc6020>.
[RFC6241] Enns, R., Ed., Bjorklund, M., Ed., Schoenwaelder, J., Ed.,
and A. Bierman, Ed., "Network Configuration Protocol
(NETCONF)", RFC 6241, DOI 10.17487/RFC6241, June 2011,
<https://www.rfc-editor.org/info/rfc6241>.
[RFC6812] Chiba, M., Clemm, A., Medley, S., Salowey, J., Thombare,
S., and E. Yedavalli, "Cisco Service-Level Assurance
Protocol", RFC 6812, DOI 10.17487/RFC6812, January 2013,
<https://www.rfc-editor.org/info/rfc6812>.
[RFC7011] Claise, B., Ed., Trammell, B., Ed., and P. Aitken,
"Specification of the IP Flow Information Export (IPFIX)
Protocol for the Exchange of Flow Information", STD 77,
RFC 7011, DOI 10.17487/RFC7011, September 2013,
<https://www.rfc-editor.org/info/rfc7011>.
[RFC7258] Farrell, S. and H. Tschofenig, "Pervasive Monitoring Is an
Attack", BCP 188, RFC 7258, DOI 10.17487/RFC7258, May
2014, <https://www.rfc-editor.org/info/rfc7258>.
[RFC7276] Mizrahi, T., Sprecher, N., Bellagamba, E., and Y.
Weingarten, "An Overview of Operations, Administration,
and Maintenance (OAM) Tools", RFC 7276,
DOI 10.17487/RFC7276, June 2014,
<https://www.rfc-editor.org/info/rfc7276>.
[RFC7540] Belshe, M., Peon, R., and M. Thomson, Ed., "Hypertext
Transfer Protocol Version 2 (HTTP/2)", RFC 7540,
DOI 10.17487/RFC7540, May 2015,
<https://www.rfc-editor.org/info/rfc7540>.
[RFC7575] Behringer, M., Pritikin, M., Bjarnason, S., Clemm, A.,
Carpenter, B., Jiang, S., and L. Ciavaglia, "Autonomic
Networking: Definitions and Design Goals", RFC 7575,
DOI 10.17487/RFC7575, June 2015,
<https://www.rfc-editor.org/info/rfc7575>.
[RFC7799] Morton, A., "Active and Passive Metrics and Methods (with
Hybrid Types In-Between)", RFC 7799, DOI 10.17487/RFC7799,
May 2016, <https://www.rfc-editor.org/info/rfc7799>.
[RFC7854] Scudder, J., Ed., Fernando, R., and S. Stuart, "BGP
Monitoring Protocol (BMP)", RFC 7854,
DOI 10.17487/RFC7854, June 2016,
<https://www.rfc-editor.org/info/rfc7854>.
[RFC7950] Bjorklund, M., Ed., "The YANG 1.1 Data Modeling Language",
RFC 7950, DOI 10.17487/RFC7950, August 2016,
<https://www.rfc-editor.org/info/rfc7950>.
[RFC8040] Bierman, A., Bjorklund, M., and K. Watsen, "RESTCONF
Protocol", RFC 8040, DOI 10.17487/RFC8040, January 2017,
<https://www.rfc-editor.org/info/rfc8040>.
[RFC8084] Fairhurst, G., "Network Transport Circuit Breakers",
BCP 208, RFC 8084, DOI 10.17487/RFC8084, March 2017,
<https://www.rfc-editor.org/info/rfc8084>.
[RFC8085] Eggert, L., Fairhurst, G., and G. Shepherd, "UDP Usage
Guidelines", BCP 145, RFC 8085, DOI 10.17487/RFC8085,
March 2017, <https://www.rfc-editor.org/info/rfc8085>.
[RFC8259] Bray, T., Ed., "The JavaScript Object Notation (JSON) Data
Interchange Format", STD 90, RFC 8259,
DOI 10.17487/RFC8259, December 2017,
<https://www.rfc-editor.org/info/rfc8259>.
[RFC8321] Fioccola, G., Ed., Capello, A., Cociglio, M., Castaldelli,
L., Chen, M., Zheng, L., Mirsky, G., and T. Mizrahi,
"Alternate-Marking Method for Passive and Hybrid
Performance Monitoring", RFC 8321, DOI 10.17487/RFC8321,
January 2018, <https://www.rfc-editor.org/info/rfc8321>.
[RFC8639] Voit, E., Clemm, A., Gonzalez Prieto, A., Nilsen-Nygaard,
E., and A. Tripathy, "Subscription to YANG Notifications",
RFC 8639, DOI 10.17487/RFC8639, September 2019,
<https://www.rfc-editor.org/info/rfc8639>.
[RFC8641] Clemm, A. and E. Voit, "Subscription to YANG Notifications
for Datastore Updates", RFC 8641, DOI 10.17487/RFC8641,
September 2019, <https://www.rfc-editor.org/info/rfc8641>.
[RFC8671] Evens, T., Bayraktar, S., Lucente, P., Mi, P., and S.
Zhuang, "Support for Adj-RIB-Out in the BGP Monitoring
Protocol (BMP)", RFC 8671, DOI 10.17487/RFC8671, November
2019, <https://www.rfc-editor.org/info/rfc8671>.
[RFC8762] Mirsky, G., Jun, G., Nydell, H., and R. Foote, "Simple
Two-Way Active Measurement Protocol", RFC 8762,
DOI 10.17487/RFC8762, March 2020,
<https://www.rfc-editor.org/info/rfc8762>.
[RFC8889] Fioccola, G., Ed., Cociglio, M., Sapio, A., and R. Sisto,
"Multipoint Alternate-Marking Method for Passive and
Hybrid Performance Monitoring", RFC 8889,
DOI 10.17487/RFC8889, August 2020,
<https://www.rfc-editor.org/info/rfc8889>.
[RFC8924] Aldrin, S., Pignataro, C., Ed., Kumar, N., Ed., Krishnan,
R., and A. Ghanwani, "Service Function Chaining (SFC)
Operations, Administration, and Maintenance (OAM)
Framework", RFC 8924, DOI 10.17487/RFC8924, October 2020,
<https://www.rfc-editor.org/info/rfc8924>.
[RFC9069] Evens, T., Bayraktar, S., Bhardwaj, M., and P. Lucente,
"Support for Local RIB in the BGP Monitoring Protocol
(BMP)", RFC 9069, DOI 10.17487/RFC9069, February 2022,
<https://www.rfc-editor.org/info/rfc9069>.
[RFC9197] Brockners, F., Ed., Bhandari, S., Ed., and T. Mizrahi,
Ed., "Data Fields for In Situ Operations, Administration,
and Maintenance (IOAM)", RFC 9197, DOI 10.17487/RFC9197,
May 2022, <https://www.rfc-editor.org/info/rfc9197>.
[W3C.REC-xml-20081126]
Bray, T., Paoli, J., Sperberg-McQueen, M., Maler, E., and
F. Yergeau, "Extensible Markup Language (XML) 1.0 (Fifth
Edition)", World Wide Web Consortium Recommendation REC-
xml-20081126, November 2008,
<https://www.w3.org/TR/2008/REC-xml-20081126>.
[y1731] ITU-T, "Operations, administration and maintenance (OAM)
functions and mechanisms for Ethernet-based networks",
ITU-T Recommendation G.8013/Y.1731, August 2015,
<https://www.itu.int/rec/T-REC-Y.1731/en>.
Appendix A. A Survey on Existing Network Telemetry Techniques
In this non-normative appendix, we provide an overview of some
existing techniques and standard proposals for each network telemetry
module.
A.1. Management Plane Telemetry
A.1.1. Push Extensions for NETCONF
NETCONF [RFC6241] is a popular network management protocol
recommended by IETF. Its core strength is for managing
configuration, but it can also be used for data collection.
YANG-Push [RFC8639] [RFC8641] extends NETCONF and enables subscriber
applications to request a continuous, customized stream of updates
from a YANG datastore. Providing such visibility into changes made
upon YANG configuration and operational objects enables new
capabilities based on the remote mirroring of configuration and
operational state. Moreover, a distributed data collection mechanism
[NETCONF-DISTRIB-NOTIF] via a UDP-based publication channel
[NETCONF-UDP-NOTIF] provides enhanced efficiency for the NETCONF-
based telemetry.
A.1.2. gRPC Network Management Interface
gRPC Network Management Interface (gNMI) [gnmi] is a network
management protocol based on the gRPC [grpc] Remote Procedure Call
(RPC) framework. With a single gRPC service definition, both
configuration and telemetry can be covered. gRPC is an open-source
micro-service communication framework based on HTTP/2 [RFC7540]. It
provides a number of capabilities that are well-suited for network
telemetry, including:
* A full-duplex streaming transport model; when combined with a
binary encoding mechanism, it provides good telemetry efficiency.
* A higher-level feature consistency across platforms that common
HTTP/2 libraries typically do not provide. This characteristic is
especially valuable for the fact that telemetry data collectors
normally reside on a large variety of platforms.
* A built-in load-balancing and failover mechanism.
A.2. Control Plane Telemetry
A.2.1. BGP Monitoring Protocol
BMP [RFC7854] is used to monitor BGP sessions and is intended to
provide a convenient interface for obtaining route views.
BGP routing information is collected from the monitored device(s) to
the BMP monitoring station by setting up the BMP TCP session. The
BGP peers are monitored by the BMP Peer Up and Peer Down
notifications. The BGP routes (including Adj_RIB_In [RFC7854],
Adj_RIB_out [RFC8671], and local RIB [RFC9069]) are encapsulated in
the BMP Route Monitoring Message and the BMP Route Mirroring Message,
providing both an initial table dump and real-time route updates. In
addition, BGP statistics are reported through the BMP Stats Report
Message, which could be either timer triggered or event-driven.
Future BMP extensions could further enrich BGP monitoring
applications.
A.3. Data Plane Telemetry
A.3.1. Alternate-Marking (AM) Technology
The Alternate-Marking method enables efficient measurements of packet
loss, delay, and jitter both in IP and Overlay Networks, as presented
in [RFC8321] and [RFC8889].
This technique can be applied to point-to-point and multipoint-to-
multipoint flows. Alternate Marking creates batches of packets by
alternating the value of 1 bit (or a label) of the packet header.
These batches of packets are unambiguously recognized over the
network, and the comparison of packet counters for each batch allows
the packet loss calculation. The same idea can be applied to delay
measurement by selecting ad hoc packets with a marking bit dedicated
for delay measurements.
The Alternate-Marking method needs two counters each marking period
for each flow under monitor. For instance, by considering n
measurement points and m monitored flows, the order of magnitude of
the packet counters for each time interval is n*m*2 (1 per color).
Since networks offer rich sets of network performance measurement
data (e.g., packet counters), conventional approaches run into
limitations. The bottleneck is the generation and export of the data
and the amount of data that can be reasonably collected from the
network. In addition, management tasks related to determining and
configuring which data to generate lead to significant deployment
challenges.
The Multipoint Alternate-Marking approach, described in [RFC8889],
aims to resolve this issue and make the performance monitoring more
flexible in case a detailed analysis is not needed.
An application orchestrates network performance measurement tasks
across the network to allow for optimized monitoring. The
application can choose how roughly or precisely to configure
measurement points depending on the application's requirements.
Using Alternate Marking, it is possible to monitor a Multipoint
Network without in-depth examination by using Network Clustering
(subnetworks that are portions of the entire network that preserve
the same property of the entire network, called clusters). So in the
case where there is packet loss or the delay is too high, the
specific filtering criteria could be applied to gather a more
detailed analysis by using a different combination of clusters up to
a per-flow measurement as described in the Alternate-Marking document
[RFC8321].
In summary, an application can configure end-to-end network
monitoring. If the network does not experience issues, this
approximate monitoring is good enough and is very cheap in terms of
network resources. However, in case of problems, the application
becomes aware of the issues from this approximate monitoring and, in
order to localize the portion of the network that has issues,
configures the measurement points more extensively, allowing more
detailed monitoring to be performed. After the detection and
resolution of the problem, the initial approximate monitoring can be
used again.
A.3.2. Dynamic Network Probe
A hardware-based Dynamic Network Probe (DNP) [OPSAWG-DNP4IQ] provides
a programmable means to customize the data that an application
collects from the data plane. A direct benefit of DNP is the
reduction of the exported data. A full DNP solution covers several
components including data source, data subscription, and data
generation. The data subscription needs to define the derived data
that can be composed and derived from raw data sources. The data
generation takes advantage of the moderate in-network computing to
produce the desired data.
While DNP can introduce unforeseeable flexibility to the data plane
telemetry, it also faces some challenges. It requires a flexible
data plane that can be dynamically reprogrammed at runtime. The
programming Application Programming Interface (API) is yet to be
defined.
A.3.3. IP Flow Information Export (IPFIX) Protocol
Traffic on a network can be seen as a set of flows passing through
network elements. IPFIX [RFC7011] provides a means of transmitting
traffic flow information for administrative or other purposes. A
typical IPFIX-enabled system includes a pool of Metering Processes
that collects data packets at one or more Observation Points,
optionally filters them, and aggregates information about these
packets. An Exporter then gathers each of the Observation Points
together into an Observation Domain and sends this information via
the IPFIX protocol to a Collector.
A.3.4. In Situ OAM
Classical passive and active monitoring and measurement techniques
are either inaccurate or resource consuming. It is preferable to
directly acquire data associated with a flow's packets when the
packets pass through a network. IOAM [RFC9197], a data generation
technique, embeds a new instruction header to user packets, and the
instruction directs the network nodes to add the requested data to
the packets. Thus, at the path's end, the packet's experience gained
on the entire forwarding path can be collected. Such firsthand data
is invaluable to many network OAM applications.
However, IOAM also faces some challenges. The issues on performance
impact, security, scalability and overhead limits, encapsulation
difficulties in some protocols, and cross-domain deployment need to
be addressed.
A.3.5. Postcard-Based Telemetry
The postcard-based telemetry, as embodied in IOAM Direct Export (DEX)
[IPPM-IOAM-DIRECT-EXPORT] and IOAM Marking
[IPPM-POSTCARD-BASED-TELEMETRY], is a complementary technique to the
passport-based IOAM [RFC9197]. PBT directly exports data at each
node through an independent packet. At the cost of higher bandwidth
overhead and the need for data correlation, PBT shows several unique
advantages. It can also help to identify packet drop location in
case a packet is dropped on its forwarding path.
A.3.6. Existing OAM for Specific Data Planes
Various data planes raise unique OAM requirements. IETF has
published OAM technique and framework documents (e.g., [RFC8924] and
[RFC5085]) targeting different data planes such as Multiprotocol
Label Switching (MPLS), L2 Virtual Private Network (VPN), Network
Virtualization over Layer 3 (NVO3), Virtual Extensible LAN (VXLAN),
Bit Index Explicit Replication (BIER), Service Function Chaining
(SFC), Segment Routing (SR), and Deterministic Networking (DETNET).
The aforementioned data plane telemetry techniques can be used to
enhance the OAM capability on such data planes.
A.4. External Data and Event Telemetry
A.4.1. Sources of External Events
To ensure that the information provided by external event detectors
and used by the network management solutions is meaningful for
management purposes, the network telemetry framework must ensure that
such detectors (sources) are easily connected to the management
solutions (sinks). This requires the specification of a list of
potential external data sources that could be of interest in network
management and matching it to the connectors and/or interfaces
required to connect them.
Categories of external event sources that may be of interest to
network management include:
* Smart objects and sensors. With the consolidation of the Internet
of Things (IoT), any network system will have many smart objects
attached to its physical surroundings and logical operation
environments. Most of these objects will be essentially based on
sensors of many kinds (e.g., temperature, humidity, and presence),
and the information they provide can be very useful for the
management of the network, even when they are not specifically
deployed for such purpose. Elements of this source type will
usually provide a specific protocol for interaction, especially
one of the protocols related to IoT, such as the Constrained
Application Protocol (CoAP).
* Online news reporters. Several online news services have the
ability to provide an enormous quantity of information about
different events occurring in the world. Some of those events can
have an impact on the network system managed by a specific
framework; therefore, such information may be of interest to the
management solution. For instance, diverse security reports, such
as Common Vulnerabilities and Exposures (CVEs), can be issued by
the corresponding authority and used by the management solution to
update the managed system, if needed. Instead of a specific
protocol and data format, the sources of this kind of information
usually follow a relaxed but structured format. This format will
be part of both the ontology and information model of the
telemetry framework.
* Global event analyzers. The advance of big data analyzers
provides a huge amount of information and, more interestingly, the
identification of events detected by analyzing many data streams
from different origins. In contrast with the other types of
sources, which are focused on specific events, the detectors of
this source type will detect generic events. For example, during
a sports event, some unexpected movement makes it fascinating, and
many people connect to sites that are reporting on the event. The
underlying networks supporting the services that cover the event
can be affected by such situation, so their management solutions
should be aware of it. In contrast with the other source types, a
new information model, format, and reporting protocol is required
to integrate the detectors of this type with the management
solution.
Additional detector types can be added to the system, but generally
they will be the result of composing the properties offered by these
main classes.
A.4.2. Connectors and Interfaces
For allowing external event detectors to be properly integrated with
other management solutions, both elements must expose interfaces and
protocols that are subject to their particular objective. Since
external event detectors will be focused on providing their
information to their main consumers, which generally will not be
limited to the network management solutions, the framework must
include the definition of the required connectors for ensuring the
interconnection between detectors (sources) and their consumers
within the management systems (sinks) are effective.
In some situations, the interconnection between external event
detectors and the management system is via the management plane. For
those situations, there will be a special connector that provides the
typical interfaces found in most other elements connected to the
management plane. For instance, the interfaces could accomplish this
with a specific data model (YANG) and specific telemetry protocol,
such as NETCONF, YANG-Push, or gRPC.
Acknowledgments
We would like to thank Rob Wilton, Greg Mirsky, Randy Presuhn, Joe
Clarke, Victor Liu, James Guichard, Uri Blumenthal, Giuseppe
Fioccola, Yunan Gu, Parviz Yegani, Young Lee, Qin Wu, Gyan Mishra,
Ben Schwartz, Alexey Melnikov, Michael Scharf, Dhruv Dhody, Martin
Duke, Roman Danyliw, Warren Kumari, Sheng Jiang, Lars Eggert, Éric
Vyncke, Jean-Michel Combes, Erik Kline, Benjamin Kaduk, and many
others who have provided helpful comments and suggestions to improve
this document.
Contributors
The other contributors of this document are Tianran Zhou, Zhenbin Li,
Zhenqiang Li, Daniel King, Adrian Farrel, and Alexander Clemm.
Authors' Addresses
Haoyu Song
Futurewei
United States of America
Email: haoyu.song@futurewei.com
Fengwei Qin
China Mobile
China
Email: qinfengwei@chinamobile.com
Pedro Martinez-Julia
NICT
Japan
Email: pedro@nict.go.jp
Laurent Ciavaglia
Rakuten Mobile
France
Email: laurent.ciavaglia@rakuten.com
Aijun Wang
China Telecom
China
Email: wangaj3@chinatelecom.cn
|