From 4bfd864f10b68b71482b35c818559068ef8d5797 Mon Sep 17 00:00:00 2001 From: Thomas Voss Date: Wed, 27 Nov 2024 20:54:24 +0100 Subject: doc: Add RFC documents --- doc/rfc/rfc9232.txt | 1922 +++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 1922 insertions(+) create mode 100644 doc/rfc/rfc9232.txt (limited to 'doc/rfc/rfc9232.txt') diff --git a/doc/rfc/rfc9232.txt b/doc/rfc/rfc9232.txt new file mode 100644 index 0000000..61383ff --- /dev/null +++ b/doc/rfc/rfc9232.txt @@ -0,0 +1,1922 @@ + + + + +Internet Engineering Task Force (IETF) H. Song +Request for Comments: 9232 Futurewei +Category: Informational F. Qin +ISSN: 2070-1721 China Mobile + P. Martinez-Julia + NICT + L. Ciavaglia + Rakuten Mobile + A. Wang + China Telecom + May 2022 + + + Network Telemetry Framework + +Abstract + + Network telemetry is a technology for gaining network insight and + facilitating efficient and automated network management. It + encompasses various techniques for remote data generation, + collection, correlation, and consumption. This document describes an + architectural framework for network telemetry, motivated by + challenges that are encountered as part of the operation of networks + and by the requirements that ensue. This document clarifies the + terminology and classifies the modules and components of a network + telemetry system from different perspectives. The framework and + taxonomy help to set a common ground for the collection of related + work and provide guidance for related technique and standard + developments. + +Status of This Memo + + This document is not an Internet Standards Track specification; it is + published for informational purposes. + + This document is a product of the Internet Engineering Task Force + (IETF). It represents the consensus of the IETF community. It has + received public review and has been approved for publication by the + Internet Engineering Steering Group (IESG). Not all documents + approved by the IESG are candidates for any level of Internet + Standard; see Section 2 of RFC 7841. + + Information about the current status of this document, any errata, + and how to provide feedback on it may be obtained at + https://www.rfc-editor.org/info/rfc9232. + +Copyright Notice + + Copyright (c) 2022 IETF Trust and the persons identified as the + document authors. All rights reserved. + + This document is subject to BCP 78 and the IETF Trust's Legal + Provisions Relating to IETF Documents + (https://trustee.ietf.org/license-info) in effect on the date of + publication of this document. Please review these documents + carefully, as they describe your rights and restrictions with respect + to this document. Code Components extracted from this document must + include Revised BSD License text as described in Section 4.e of the + Trust Legal Provisions and are provided without warranty as described + in the Revised BSD License. + +Table of Contents + + 1. Introduction + 1.1. Applicability Statement + 1.2. Glossary + 2. Background + 2.1. Telemetry Data Coverage + 2.2. Use Cases + 2.3. Challenges + 2.4. Network Telemetry + 2.5. The Necessity of a Network Telemetry Framework + 3. Network Telemetry Framework + 3.1. Top-Level Modules + 3.1.1. Management Plane Telemetry + 3.1.2. Control Plane Telemetry + 3.1.3. Forwarding Plane Telemetry + 3.1.4. External Data Telemetry + 3.2. Second-Level Function Components + 3.3. Data Acquisition Mechanism and Type Abstraction + 3.4. Mapping Existing Mechanisms into the Framework + 4. Evolution of Network Telemetry Applications + 5. Security Considerations + 6. IANA Considerations + 7. Informative References + Appendix A. A Survey on Existing Network Telemetry Techniques + A.1. Management Plane Telemetry + A.1.1. Push Extensions for NETCONF + A.1.2. gRPC Network Management Interface + A.2. Control Plane Telemetry + A.2.1. BGP Monitoring Protocol + A.3. Data Plane Telemetry + A.3.1. Alternate-Marking (AM) Technology + A.3.2. Dynamic Network Probe + A.3.3. IP Flow Information Export (IPFIX) Protocol + A.3.4. In Situ OAM + A.3.5. Postcard-Based Telemetry + A.3.6. Existing OAM for Specific Data Planes + A.4. External Data and Event Telemetry + A.4.1. Sources of External Events + A.4.2. Connectors and Interfaces + Acknowledgments + Contributors + Authors' Addresses + +1. Introduction + + Network visibility is the ability of management tools to see the + state and behavior of a network, which is essential for successful + network operation. Network telemetry revolves around network data + that 1) can help provide insights about the current state of the + network, including network devices, forwarding, control, and + management planes; 2) can be generated and obtained through a variety + of techniques, including but not limited to network instrumentation + and measurements; and 3) can be processed for purposes ranging from + service assurance to network security using a wide variety of data + analytical techniques. In this document, network telemetry refers to + both the data itself (i.e., "Network Telemetry Data") and the + techniques and processes used to generate, export, collect, and + consume that data for use by potentially automated management + applications. Network telemetry extends beyond the classical network + Operations, Administration, and Management (OAM) techniques and + expects to support better flexibility, scalability, accuracy, + coverage, and performance. + + However, the term "network telemetry" lacks an unambiguous + definition. The scope and coverage of it cause confusion and + misunderstandings. It is beneficial to clarify the concept and + provide a clear architectural framework for network telemetry, so we + can articulate the technical field and better align the related + techniques and standard works. + + To fulfill such an undertaking, we first discuss some key + characteristics of network telemetry that set a clear distinction + from the conventional network OAM and show that some conventional OAM + technologies can be considered a subset of the network telemetry + technologies. We then provide an architectural framework for network + telemetry that includes four modules, each associated with a + different category of telemetry data and corresponding procedures. + All the modules are internally structured in the same way, including + components that allow the operator to configure data sources in + regard to what data to generate and how to make that available to + client applications, components that instrument the underlying data + sources, and components that perform the actual rendering, encoding, + and exporting of the generated data. We show how the network + telemetry framework can benefit current and future network + operations. Based on the distinction of modules and function + components, we can map the existing and emerging techniques and + protocols into the framework. The framework can also simplify + designing, maintaining, and understanding a network telemetry system. + In addition, we outline the evolution stages of the network telemetry + system and discuss the potential security concerns. + + The purpose of the framework and taxonomy is to set a common ground + for the collection of related work and provide guidance for future + technique and standard developments. To the best of our knowledge, + this document is the first such effort for network telemetry in + industry standards organizations. This document does not define + specific technologies. + +1.1. Applicability Statement + + Large-scale network data collection is a major threat to user privacy + and may be indistinguishable from pervasive monitoring [RFC7258]. + The network telemetry framework presented in this document must not + be applied to generating, exporting, collecting, analyzing, or + retaining individual user data or any data that can identify end + users or characterize their behavior without consent. Based on this + principle, the network telemetry framework is not applicable to + networks whose endpoints represent individual users, such as general- + purpose access networks. + +1.2. Glossary + + Before further discussion, we list some key terminology and + abbreviations used in this document. There is an intended + differentiation between the terms of network telemetry and OAM. + However, it should be understood that there is not a hard-line + distinction between the two concepts. Rather, network telemetry is + considered an extension of OAM. It covers all the existing OAM + protocols but puts more emphasis on the newer and emerging techniques + and protocols concerning all aspects of network data from acquisition + to consumption. + + AI: Artificial Intelligence. In the network domain, AI + refers to machine-learning-based technologies for + automated network operation and other tasks. + + AM: Alternate Marking. A flow performance measurement + method, as specified in [RFC8321]. + + BMP: BGP Monitoring Protocol. Specified in [RFC7854]. + + DPI: Deep Packet Inspection. Refers to the techniques that + examine packets beyond packet L3/L4 headers. + + gNMI: gRPC Network Management Interface. A network management + protocol from the OpenConfig Operator Working Group, + mainly contributed by Google. See [gnmi] for details. + + GPB: Google Protocol Buffer. An extensible mechanism for + serializing structured data. See [gpb] for details. + + gRPC: gRPC Remote Procedure Call. An open-source high- + performance RPC framework that gNMI is based on. See + [grpc] for details. + + IPFIX: IP Flow Information Export Protocol. Specified in + [RFC7011]. + + IOAM: In situ OAM [RFC9197]. A data plane on-path telemetry + technique. + + JSON: JavaScript Object Notation. An open standard file format + and data interchange format that uses human-readable text + to store and transmit data objects, as specified in + [RFC8259]. + + MIB: Management Information Base. A database used for + managing the entities in a network. + + NETCONF: Network Configuration Protocol. Specified in [RFC6241]. + + NetFlow: A Cisco protocol used for flow record collecting, as + described in [RFC3954]. + + Network Telemetry: The process and instrumentation for acquiring and + utilizing network data remotely for network monitoring + and operation. A general term for a large set of network + visibility techniques and protocols, concerning aspects + like data generation, collection, correlation, and + consumption. Network telemetry addresses current network + operation issues and enables smooth evolution toward + future intent-driven autonomous networks. + + NMS: Network Management System. Refers to applications that + allow network administrators to manage a network. + + OAM: Operations, Administration, and Maintenance. A group of + network management functions that provide network fault + indication, fault localization, performance information, + and data and diagnosis functions. Most conventional + network monitoring techniques and protocols belong to + network OAM. + + PBT: Postcard-Based Telemetry. A data plane on-path telemetry + technique. A representative technique is described in + [IPPM-IOAM-DIRECT-EXPORT]. + + RESTCONF: An HTTP-based protocol that provides a programmatic + interface for accessing data defined in YANG, using the + datastore concepts defined in NETCONF, as specified in + [RFC8040]. + + SMIv2: Structure of Management Information Version 2. Defines + MIB objects, as specified in [RFC2578]. + + SNMP: Simple Network Management Protocol. Versions 1, 2, and 3 + are specified in [RFC1157], [RFC3416], and [RFC3411], + respectively. + + XML: Extensible Markup Language. A markup language for data + encoding that is both human readable and machine + readable, as specified by W3C [W3C.REC-xml-20081126]. + + YANG: YANG is a data modeling language for the definition of + data sent over network management protocols such as + NETCONF and RESTCONF. YANG is defined in [RFC6020] and + [RFC7950]. + + YANG ECA: A YANG model for Event-Condition-Action policies, as + defined in [NETMOD-ECA-POLICY]. + + YANG-Push: A mechanism that allows subscriber applications to + request a stream of updates from a YANG datastore on a + network device. Details are specified in [RFC8639] and + [RFC8641]. + +2. Background + + The term "big data" is used to describe the extremely large volume of + data sets that can be analyzed computationally to reveal patterns, + trends, and associations. Networks are undoubtedly a source of big + data because of their scale and the volume of network traffic they + forward. When a network's endpoints do not represent individual + users (e.g., in industrial, data-center, and infrastructure + contexts), network operations can often benefit from large-scale data + collection without breaching user privacy. + + Today, one can access advanced big data analytics capability through + a plethora of commercial and open-source platforms (e.g., Apache + Hadoop), tools (e.g., Apache Spark), and techniques (e.g., machine + learning). Thanks to the advance of computing and storage + technologies, network big data analytics give network operators an + opportunity to gain network insights and move towards network + autonomy. Some operators start to explore the application of + Artificial Intelligence (AI) to make sense of network data. Software + tools can use the network data to detect and react on network faults, + anomalies, and policy violations, as well as predict future events. + In turn, the network policy updates for planning, intrusion + prevention, optimization, and self-healing may be applied. + + It is conceivable that an autonomic network [RFC7575] is the logical + next step for network evolution following Software-Defined Networking + (SDN), which aims to reduce (or even eliminate) human labor, make + more efficient use of network resources, and provide better services + more aligned with customer requirements. The IETF ANIMA Working + Group is dedicated to developing and maintaining protocols and + procedures for automated network management and control of + professionally managed networks. The related technique of + Intent-Based Networking (IBN) [NMRG-IBN-CONCEPTS-DEFINITIONS] + requires network visibility and telemetry data in order to ensure + that the network is behaving as intended. + + However, while the data processing capability is improved and + applications require more data to function better, the networks lag + behind in extracting and translating network data into useful and + actionable information in efficient ways. The system bottleneck is + shifting from data consumption to data supply. Both the number of + network nodes and the traffic bandwidth keep increasing at a fast + pace. The network configuration and policy change at smaller time + slots than before. More subtle events and fine-grained data through + all network planes need to be captured and exported in real time. In + a nutshell, it is a challenge to get enough high-quality data out of + the network in a manner that is efficient, timely, and flexible. + Therefore, we need to survey the existing technologies and protocols + and identify any potential gaps. + + In the remainder of this section, we first clarify the scope of + network data (i.e., telemetry data) relevant in this document. Then, + we discuss several key use cases for network operations of today and + the future. Next, we show why the current network OAM techniques and + protocols are insufficient for these use cases. The discussion + underlines the need for new methods, techniques, and protocols, as + well as the extensions of existing ones, which we assign under the + umbrella term "Network Telemetry". + +2.1. Telemetry Data Coverage + + Any information that can be extracted from networks (including the + data plane, control plane, and management plane) and used to gain + visibility or as a basis for actions is considered telemetry data. + It includes statistics, event records and logs, snapshots of state, + configuration data, etc. It also covers the outputs of any active + and passive measurements [RFC7799]. In some cases, raw data is + processed in network before being sent to a data consumer. Such + processed data is also considered telemetry data. The value of + telemetry data varies. In some cases, if the cost is acceptable, + less but higher-quality data are preferred rather than a lot of low- + quality data. A classification of telemetry data is provided in + Section 3. To preserve the privacy of end users, no user packet + content should be collected. Specifically, the data objects + generated, exported, and collected by a network telemetry application + should not include any packet payload from traffic associated with + end-user systems. + +2.2. Use Cases + + The following set of use cases is essential for network operations. + While the list is by no means exhaustive, it is enough to highlight + the requirements for data velocity, variety, volume, and veracity, + the attributes of big data, in networks. + + * Security: Network intrusion detection and prevention systems need + to monitor network traffic and activities and act upon anomalies. + Given increasingly sophisticated attack vectors coupled with + increasingly severe consequences of security breaches, new tools + and techniques need to be developed, relying on wider and deeper + visibility into networks. The ultimate goal is to achieve + security with no, or only minimal, human intervention and without + disrupting legitimate traffic flows. + + * Policy and Intent Compliance: Network policies are the rules that + constrain the services for network access, provide service + differentiation, or enforce specific treatment on the traffic. + For example, a service function chain is a policy that requires + the selected flows to pass through a set of ordered network + functions. Intent, as defined in [NMRG-IBN-CONCEPTS-DEFINITIONS], + is a set of operational goals that a network should meet and + outcomes that a network is supposed to deliver, defined in a + declarative manner without specifying how to achieve or implement + them. An intent requires a complex translation and mapping + process before being applied on networks. While a policy or + intent is enforced, the compliance needs to be verified and + monitored continuously by relying on visibility that is provided + through network telemetry data. Any violation must be reported + immediately - this will alert the network administrator to the + policy or intent violation and will potentially result in updates + to how the policy or intent is applied in the network to ensure + that it remains in force. + + * SLA Compliance: A Service Level Agreement (SLA) is a service + contract between a service provider and a client, which includes + the metrics for the service measurement and remedy/penalty + procedures when the service level misses the agreement. Users + need to check if they get the service as promised, and network + operators need to evaluate how they can deliver services that meet + the SLA based on real-time network telemetry data, including data + from network measurements. + + * Root Cause Analysis: Many network failures can be the effect of a + sequence of chained events. Troubleshooting and recovery require + quick identification of the root cause of any observable issues. + However, the root cause is not always straightforward to identify, + especially when the failure is sporadic and the number of event + messages, both related and unrelated to the same cause, is + overwhelming. While technologies such as machine learning can be + used for root cause analysis, it is up to the network to sense and + provide the relevant diagnostic data that are either actively fed + into or passively retrieved by the root cause analysis + applications. + + * Network Optimization: This covers all short-term and long-term + network optimization techniques, including load balancing, Traffic + Engineering (TE), and network planning. Network operators are + motivated to optimize their network utilization and differentiate + services for better Return on Investment (ROI) or lower Capital + Expenditure (CAPEX). The first step is to know the real-time + network conditions before applying policies for traffic + manipulation. In some cases, microbursts need to be detected in a + very short time frame so that fine-grained traffic control can be + applied to avoid network congestion. Long-term planning of + network capacity and topology requires analysis of real-world + network telemetry data that is obtained over long periods of time. + + * Event Tracking and Prediction: The visibility into traffic path + and performance is critical for services and applications that + rely on healthy network operation. Numerous related network + events are of interest to network operators. For example, network + operators want to learn where and why packets are dropped for an + application flow. They also want to be warned of issues in + advance, so proactive actions can be taken to avoid catastrophic + consequences. + +2.3. Challenges + + For a long time, network operators have relied upon SNMP [RFC3416], + Command-Line Interface (CLI), or Syslog [RFC5424] to monitor the + network. Some other OAM techniques as described in [RFC7276] are + also used to facilitate network troubleshooting. These conventional + techniques are not sufficient to support the above use cases for the + following reasons: + + * Most use cases need to continuously monitor the network and + dynamically refine the data collection in real time. Poll-based + low-frequency data collection is ill-suited for these + applications. Subscription-based streaming data directly pushed + from the data source (e.g., the forwarding chip) is preferred to + provide sufficient data quantity and precision at scale. + + * Comprehensive data is needed, ranging from packet processing + engines to traffic managers, line cards to main control boards, + user flows to control protocol packets, device configurations to + operations, and physical layers to application layers. + Conventional OAM only covers a narrow range of data (e.g., SNMP + only handles data from the Management Information Base (MIB)). + Classical network devices cannot provide all the necessary probes. + More open and programmable network devices are therefore needed. + + * Many application scenarios need to correlate network-wide data + from multiple sources (i.e., from distributed network devices, + different components of a network device, or different network + planes). A piecemeal solution is often lacking the capability to + consolidate the data from multiple sources. The composition of a + complete solution, as partly proposed by Autonomic Resource + Control Architecture (ARCA) [NMRG-ANTICIPATED-ADAPTATION], will be + empowered and guided by a comprehensive framework. + + * Some conventional OAM techniques (e.g., CLI and Syslog) lack a + formal data model. The unstructured data hinder the tool + automation and application extensibility. Standardized data + models are essential to support the programmable networks. + + * Although some conventional OAM techniques support data push (e.g., + SNMP Trap [RFC2981][RFC3877], Syslog, and sFlow [RFC3176]), the + pushed data are limited to only predefined management plane + warnings (e.g., SNMP Trap) or sampled user packets (e.g., sFlow). + Network operators require the data with arbitrary source, + granularity, and precision, which is beyond the capability of the + existing techniques. + + * Conventional passive measurement techniques can either consume + excessive network resources and produce excessive redundant data + or lead to inaccurate results; on the other hand, conventional + active measurement techniques can interfere with the user traffic, + and their results are indirect. Techniques that can collect + direct and on-demand data from user traffic are more favorable. + + These challenges were addressed by newer standards and techniques + (e.g., IPFIX/Netflow, Packet Sampling (PSAMP), IOAM, and YANG-Push), + and more are emerging. These standards and techniques need to be + recognized and accommodated in a new framework. + +2.4. Network Telemetry + + Network telemetry has emerged as a mainstream technical term to refer + to the network data collection and consumption techniques. Several + network telemetry techniques and protocols (e.g., IPFIX [RFC7011] and + gRPC [grpc]) have been widely deployed. Network telemetry allows + separate entities to acquire data from network devices so that data + can be visualized and analyzed to support network monitoring and + operation. Network telemetry covers the conventional network OAM and + has a wider scope. For instance, it is expected that network + telemetry can provide the necessary network insight for autonomous + networks and address the shortcomings of conventional OAM techniques. + + Network telemetry usually assumes machines as data consumers rather + than human operators. Hence, network telemetry can directly trigger + the automated network operation, while in contrast, some conventional + OAM tools were designed and used to help human operators to monitor + and diagnose the networks and guide manual network operations. Such + a proposition leads to very different techniques. + + Although new network telemetry techniques are emerging and subject to + continuous evolution, several characteristics of network telemetry + have been well accepted. Note that network telemetry is intended to + be an umbrella term covering a wide spectrum of techniques, so the + following characteristics are not expected to be held by every + specific technique. + + * Push and Streaming: Instead of polling data from network devices, + telemetry collectors subscribe to streaming data pushed from data + sources in network devices. + + * Volume and Velocity: Telemetry data is intended to be consumed by + machines rather than by human beings. Therefore, the data volume + can be huge, and the processing is optimized for the needs of + automation in real time. + + * Normalization and Unification: Telemetry aims to address the + overall network automation needs. Efforts are made to normalize + the data representation and unify the protocols, so as to simplify + data analysis and provide integrated analysis across heterogeneous + devices and data sources across a network. + + * Model-Based: Telemetry data is modeled in advance, which allows + applications to configure and consume data with ease. + + * Data Fusion: The data for a single application can come from + multiple data sources (e.g., cross-domain, cross-device, and + cross-layer) that are based on a common name/ID and need to be + correlated to take effect. + + * Dynamic and Interactive: Since the network telemetry means to be + used in a closed control loop for network automation, it needs to + run continuously and adapt to the dynamic and interactive queries + from the network operation controller. + + In addition, an ideal network telemetry solution may also have the + following features or properties: + + * In-Network Customization: The data that is generated can be + customized in network at runtime to cater to the specific need of + applications. This needs the support of a programmable data + plane, which allows probes with custom functions to be deployed at + flexible locations. + + * In-Network Data Aggregation and Correlation: Network devices and + aggregation points can work out which events and what data needs + to be stored, reported, or discarded, thus reducing the load on + the central collection and processing points while still ensuring + that the right information is ready to be processed in a timely + way. + + * In-Network Processing: Sometimes it is not necessary or feasible + to gather all information to a central point to be processed and + acted upon. It is possible for the data processing to be done in + network, allowing reactive actions to be taken locally. + + * Direct Data Plane Export: The data originated from data plane + forwarding chips can be directly exported to the data consumer for + efficiency, especially when the data bandwidth is large and real- + time processing is required. + + * In-Band Data Collection: In addition to the passive and active + data collection approaches, the new hybrid approach allows to + directly collect data for any target flow on its entire forwarding + path [OPSAWG-IFIT-FRAMEWORK]. + + It is worth noting that a network telemetry system should not be + intrusive to normal network operations by avoiding the pitfall of the + "observer effect". That is, it should not change the network + behavior and affect the forwarding performance. Moreover, high- + volume telemetry traffic may cause network congestion unless proper + isolation or traffic engineering techniques are in place, or + congestion control mechanisms ensure that telemetry traffic backs off + if it exceeds the network capacity. [RFC8084] and [RFC8085] are + relevant Best Current Practices (BCPs) in this space. + + Although in many cases a system for network telemetry involves a + remote data collecting and consuming entity, it is important to + understand that there are no inherent assumptions about how a system + should be architected. While a network architecture with a + centralized controller (e.g., SDN) seems to be a natural fit for + network telemetry, network telemetry can work in distributed fashions + as well. For example, telemetry data producers and consumers can + have a peer-to-peer relationship, in which a network node can be the + direct consumer of telemetry data from other nodes. + +2.5. The Necessity of a Network Telemetry Framework + + Network data analytics (e.g., machine learning) is applied for + network operation automation, relying on abundant and coherent data + from networks. Data acquisition that is limited to a single source + and static in nature will in many cases not be sufficient to meet an + application's telemetry data needs. As a result, multiple data + sources, involving a variety of techniques and standards, will need + to be integrated. It is desirable to have a framework that + classifies and organizes different telemetry data sources and types, + defines different components of a network telemetry system and their + interactions, and helps coordinate and integrate multiple telemetry + approaches across layers. This allows flexible combinations of data + for different applications, while normalizing and simplifying + interfaces. In detail, such a framework would benefit the + development of network operation applications for the following + reasons: + + * Future networks, autonomous or otherwise, depend on holistic and + comprehensive network visibility. Use cases and applications are + better when supported uniformly and coherently using an + integrated, converged mechanism and common telemetry data + representations wherever feasible. Therefore, the protocols and + mechanisms should be consolidated into a minimum yet comprehensive + set. A telemetry framework can help to normalize the technique + developments. + + * Network visibility presents multiple viewpoints. For example, the + device viewpoint takes the network infrastructure as the + monitoring object from which the network topology and device + status can be acquired, and the traffic viewpoint takes the flows + or packets as the monitoring object from which the traffic quality + and path can be acquired. An application may need to switch its + viewpoint during operation. It may also need to correlate a + service and its impact on user experience (UE) to acquire the + comprehensive information. + + * Applications require network telemetry to be elastic in order to + make efficient use of network resources and reduce the impact of + processing related to network telemetry on network performance. + For example, routine network monitoring should cover the entire + network with a low data sampling rate. Only when issues arise or + critical trends emerge should telemetry data sources be modified + and telemetry data rates be boosted as needed. + + * Efficient data aggregation is critical for applications to reduce + the overall quantity of data and improve the accuracy of analysis. + + A telemetry framework collects all the telemetry-related works from + different sources and working groups within the IETF. This makes it + possible to assemble a comprehensive network telemetry system and to + avoid repetitious or redundant work. The framework should cover the + concepts and components from the standardization perspective. This + document describes the modules that make up a network telemetry + framework and decomposes the telemetry system into a set of distinct + components that existing and future work can easily map to. + +3. Network Telemetry Framework + + The top-level network telemetry framework partitions the network + telemetry into four modules based on the telemetry data object source + and represents their relationship. Once the network operation + applications acquire the data from these modules, they can apply data + analytics and take actions. At the next level, the framework + decomposes each module into separate components. Each of these + modules follows the same underlying structure, with one component + dedicated to the configuration of data subscriptions and data + sources, a second component dedicated to encoding and exporting data, + and a third component instrumenting the generation of telemetry + related to the underlying resources. Throughout the framework, the + same set of abstract data-acquiring mechanisms and data types + (Section 3.3) are applied. The two-level architecture with the + uniform data abstraction helps accurately pinpoint a protocol or + technique to its position in a network telemetry system or + disaggregates a network telemetry system into manageable parts. + +3.1. Top-Level Modules + + Telemetry can be applied on the forwarding plane, control plane, and + management plane in a network, as well as on other sources out of the + network, as shown in Figure 1. Therefore, we categorize the network + telemetry into four distinct modules (management plane, control + plane, forwarding plane, and external data and event telemetry) with + each having its own interface to network operation applications. + + +------------------------------+ + | | + | Network Operation |<-------+ + | Applications | | + | | | + +------------------------------+ | + ^ ^ ^ | + | | | | + V V | V + +--------------+-----------|---+ +-----------+ + | | Control | | | | + | | Plane | | | External | + | <---> | | | Data and | + | | Telemetry | | | Event | + | Management | ^ V | | Telemetry | + | Plane +-------|-------+ | | + | Telemetry | V | +-----------+ + | | Forwarding | + | | Plane | + | <---> | + | | Telemetry | + | | | + +--------------+---------------+ + + Figure 1: Modules in Layer Category of the Network Telemetry + Framework + + The rationale of this partition lies in the different telemetry data + objects that result in different data sources and export locations. + Such differences have profound implications on in-network data + programming and processing capability, data encoding and the + transport protocol, and required data bandwidth and latency. Data + can be sent directly or proxied via the control and management + planes. There are advantages/disadvantages to both approaches. + + Note that in some cases, the network controller itself may be the + source of telemetry data that is unique to it or derived from the + telemetry data collected from the network elements. Some of the + principles and taxonomy specific to the control plane and management + plane telemetry could also be applied to the controller when it is + required to provide the telemetry data to network operation + applications hosted outside. The scope of this document is focused + on the network elements telemetry, and further details related to + controllers are thus out of scope. + + We summarize the major differences of the four modules in Table 1. + They are compared from six angles: + + * Data Object + + * Data Export Location + + * Data Model + + * Data Encoding + + * Telemetry Application Protocol + + * Data Transport Method + + Data Object is the target and source of each module. Because the + data source varies, the location where data is mostly conveniently + exported also varies. For example, forwarding plane data mainly + originates as data exported from the forwarding Application-Specific + Integrated Circuits (ASICs), while control plane data mainly + originates from the protocol daemons running on the control CPU(s). + For convenience and efficiency, it is preferred to export the data + off the device from locations near the source. Because the locations + that can export data have different capabilities, different choices + of data models, encoding, and transport methods are made to balance + the performance and cost. For example, the forwarding chip has high + throughput but limited capacity for processing complex data and + maintaining state, while the main control CPU is capable of complex + data and state processing but has limited bandwidth for high + throughput data. As a result, the suitable telemetry protocol for + each module can be different. Some representative techniques are + shown in the corresponding table blocks to highlight the technical + diversity of these modules. Note that the selected techniques just + reflect the de facto state of the art and are by no means exhaustive + (e.g., IPFIX can also be implemented over TCP and SCTP, but that is + not recommended for the forwarding plane). The key point is that one + cannot expect to use a universal protocol to cover all the network + telemetry requirements. + + +=============+===============+==========+==========+===============+ + |Module |Management |Control |Forwarding|External Data | + | |Plane |Plane |Plane | | + +=============+===============+==========+==========+===============+ + |Object |configuration |control |flow and |terminal, | + | |and operation |protocol |packet |social, and | + | |state |and |QoS, |environmental | + | | |signaling,|traffic | | + | | |RIB |stat., | | + | | | |buffer and| | + | | | |queue | | + | | | |stat., | | + | | | |FIB, | | + | | | |Access | | + | | | |Control | | + | | | |List (ACL)| | + +-------------+---------------+----------+----------+---------------+ + |Export |main control |main |forwarding|various | + |Location |CPU |control |chip or | | + | | |CPU, |linecard | | + | | |linecard |CPU; main | | + | | |CPU, or |control | | + | | |forwarding|CPU | | + | | |chip |unlikely | | + +-------------+---------------+----------+----------+---------------+ + |Data Model |YANG, MIB, |YANG, |YANG, |YANG, custom | + | |syslog |custom |custom | | + +-------------+---------------+----------+----------+---------------+ + |Data Encoding|GPB, JSON, XML |GPB, JSON,|plain text|GPB, JSON, XML,| + | | |XML, plain| |plain text | + | | |text | | | + +-------------+---------------+----------+----------+---------------+ + |Application |gRPC, NETCONF, |gRPC, |IPFIX, |gRPC | + |Protocol |RESTCONF |NETCONF, |traffic | | + | | |IPFIX, |mirroring,| | + | | |traffic |gRPC, | | + | | |mirroring |NETFLOW | | + +-------------+---------------+----------+----------+---------------+ + |Data |HTTP(S), TCP |HTTP(S), |UDP |HTTP(S), TCP, | + |Transport | |TCP, UDP | |UDP | + +-------------+---------------+----------+----------+---------------+ + + Table 1: Comparison of Data Object Modules + + Note that the interaction with the applications that consume network + telemetry data can be indirect. Some in-device data transfer is + possible. For example, in the management plane telemetry, the + management plane will need to acquire data from the data plane. Some + operational states can only be derived from data plane data sources + such as the interface status and statistics. As another example, + obtaining control plane telemetry data may require the ability to + access the Forwarding Information Base (FIB) of the data plane. + + On the other hand, an application may involve more than one plane and + interact with multiple planes simultaneously. For example, an SLA + compliance application may require both the data plane telemetry and + the control plane telemetry. + + The requirements and challenges for each module are summarized as + follows (note that the requirements may pertain across all telemetry + modules; however, we emphasize those that are most pronounced for a + particular plane). + +3.1.1. Management Plane Telemetry + + The management plane of network elements interacts with the Network + Management System (NMS) and provides information such as performance + data, network logging data, network warning and defects data, and + network statistics and state data. The management plane includes + many protocols, including the classical SNMP and syslog. Regardless + the protocol, management plane telemetry must address the following + requirements: + + * Convenient Data Subscription: An application should have the + freedom to choose which data is exported (see Section 3.3) and the + means and frequency of how that data is exported (e.g., on-change + or periodic subscription). + + * Structured Data: For automatic network operation, machines will + replace humans for network data comprehension. Data modeling + languages, such as YANG, can efficiently describe structured data + and normalize data encoding and transformation. + + * High-Speed Data Transport: In order to keep up with the velocity + of information, a data source needs to be able to send large + amounts of data at high frequency. Compact encoding formats or + data compression schemes are needed to reduce the quantity of data + and improve the data transport efficiency. The subscription mode, + by replacing the query mode, reduces the interactions between + clients and servers and helps to improve the data source's + efficiency. + + * Network Congestion Avoidance: The application must protect the + network from congestion with congestion control mechanisms or, at + minimum, with circuit breakers. [RFC8084] and [RFC8085] provide + some solutions in this space. + +3.1.2. Control Plane Telemetry + + The control plane telemetry refers to the health condition monitoring + of different network control protocols at all layers of the protocol + stack. Keeping track of the operational status of these protocols is + beneficial for detecting, localizing, and even predicting various + network issues, as well as for network optimization, in real time and + with fine granularity. Some particular challenges and issues faced + by the control plane telemetry are as follows: + + * How to correlate the End-to-End (E2E) Key Performance Indicators + (KPIs) to a specific layer's KPIs. For example, IPTV users may + describe their UE by the video smoothness and definition. Then in + case of an unusually poor UE KPI or a service disconnection, it is + non-trivial to delimit and pinpoint the issue in the responsible + protocol layer (e.g., the transport layer or the network layer), + the responsible protocol (e.g., IS-IS or BGP at the network + layer), and finally the responsible device(s) with specific + reasons. + + * Conventional OAM-based approaches for control plane KPI + measurement, which include Ping (L3), Traceroute (L3), Y.1731 + [y1731] (L2), and so on. One common issue behind these methods is + that they only measure the KPIs instead of reflecting the actual + running status of these protocols, making them less effective or + efficient for control plane troubleshooting and network + optimization. + + * How more research is needed for the BGP monitoring protocol (BMP). + BMP is an example of the control plane telemetry; it is currently + used for monitoring BGP routes and enables rich applications, such + as BGP peer analysis, Autonomous System (AS) analysis, prefix + analysis, and security analysis. However, the monitoring of other + layers, protocols, and the cross-layer, cross-protocol KPI + correlations are still in their infancy (e.g., IGP monitoring is + not as extensive as BMP), which requires further research. + + Note that the requirement and solutions for network congestion + avoidance are also applicable to the control plane telemetry. + +3.1.3. Forwarding Plane Telemetry + + An effective forwarding plane telemetry system relies on the data + that the network device can expose. The quality, quantity, and + timeliness of data must meet some stringent requirements. This + raises some challenges for the network data plane devices where the + first-hand data originates. + + * A data plane device's main function is user traffic processing and + forwarding. While supporting network visibility is important, the + telemetry is just an auxiliary function, and it should strive to + not impede normal traffic processing and forwarding (i.e., the + forwarding behavior should not be altered, and the trade-off + between forwarding performance and telemetry should be well- + balanced). + + * Network operation applications require end-to-end visibility + across various sources, which can result in a huge volume of data. + However, the sheer quantity of data must not exhaust the network + bandwidth, regardless of the data delivery approach (i.e., whether + through in-band or out-of-band channels). + + * The data plane devices must provide timely data with the minimum + possible delay. Long processing, transport, storage, and analysis + delay can impact the effectiveness of the control loop and even + render the data useless. + + * The data should be structured, labeled, and easy for applications + to parse and consume. At the same time, the data types needed by + applications can vary significantly. The data plane devices need + to provide enough flexibility and programmability to support the + precise data provision for applications. + + * The data plane telemetry should support incremental deployment and + work even though some devices are unaware of the system. + + * The requirement and solutions for network congestion avoidance are + also applicable to the forwarding plane telemetry. + + Although not specific to the forwarding plane, these challenges are + more difficult for the forwarding plane because of the limited + resources and flexibility. Data plane programmability is essential + to support network telemetry. Newer data plane forwarding chips are + equipped with advanced telemetry features and provide flexibility to + support customized telemetry functions. + + Technique Taxonomy: This pertains to how one instruments the + telemetry; there can be multiple possible dimensions to classify the + forwarding plane telemetry techniques. + + * Active, Passive, and Hybrid: This dimension pertains to the end- + to-end measurement. Active and passive methods (as well as the + hybrid types) are well documented in [RFC7799]. Passive methods + include TCPDUMP, IPFIX [RFC7011], sFlow, and traffic mirroring. + These methods usually have low data coverage. The bandwidth cost + is very high in order to improve the data coverage. On the other + hand, active methods include Ping, the One-Way Active Measurement + Protocol (OWAMP) [RFC4656], the Two-Way Active Measurement + Protocol (TWAMP) [RFC5357], the Simple Two-way Active Measurement + Protocol (STAMP) [RFC8762], and Cisco's SLA Protocol [RFC6812]. + These methods are intrusive and only provide indirect network + measurements. Hybrid methods, including IOAM [RFC9197], Alternate + Marking (AM) [RFC8321], and Multipoint Alternate Marking + [RFC8889], provide a well-balanced and more flexible approach. + However, these methods are also more complex to implement. + + * In-Band and Out-of-Band: Telemetry data carried in user packets + before being exported to a data collector is considered in-band + (e.g., IOAM [RFC9197]). Telemetry data that is directly exported + to a data collector without modifying user packets is considered + out-of-band (e.g., the postcard-based approach described in + Appendix A.3.5). It is also possible to have hybrid methods, + where only the telemetry instruction or partial data is carried by + user packets (e.g., AM [RFC8321]). + + * End-to-End and In-Network: End-to-end methods start from, and end + at, the network end hosts (e.g., Ping). In-network methods work + in networks and are transparent to end hosts. However, if needed, + in-network methods can be easily extended into end hosts. + + * Data Subject: Depending on the telemetry objective, the methods + can be flow based (e.g., IOAM [RFC9197]), path based (e.g., + Traceroute), and node based (e.g., IPFIX [RFC7011]). The various + data objects can be packet, flow record, measurement, states, and + signal. + +3.1.4. External Data Telemetry + + Events that occur outside the boundaries of the network system are + another important source of network telemetry. Correlating both + internal telemetry data and external events with the requirements of + network systems, as presented in [NMRG-ANTICIPATED-ADAPTATION], + provides a strategic and functional advantage to management + operations. + + As with other sources of telemetry information, the data and events + must meet strict requirements, especially in terms of timeliness, + which is essential to properly incorporate external event information + into network management applications. The specific challenges are + described as follows: + + * The role of the external event detector can be played by multiple + elements, including hardware (e.g., physical sensors, such as + seismometers) and software (e.g., big data sources that can + analyze streams of information, such as Twitter messages). Thus, + the transmitted data must support different shapes but, at the + same time, follow a common but extensible schema. + + * Since the main function of the external event detectors is to + perform the notifications, their timeliness is assumed. However, + once messages have been dispatched, they must be quickly collected + and inserted into the control plane with variable priority, which + is higher for important sources and events and lower for secondary + ones. + + * The schema used by external detectors must be easily adopted by + current and future devices and applications. Therefore, it must + be easily mapped to current data models, such as in terms of YANG. + + * As the communication with external entities outside the boundary + of a provider network may be realized over the Internet, the risk + of congestion is even more relevant in this context and proper + countermeasures must be taken. Solutions such as network + transport circuit breakers are needed as well. + + Organizing both internal and external telemetry information together + will be key for the general exploitation of the management + possibilities of current and future network systems, as reflected in + the incorporation of cognitive capabilities to new hardware and + software (virtual) elements. + +3.2. Second-Level Function Components + + The telemetry module at each plane can be further partitioned into + five distinct conceptual components: + + * Data Query, Analysis, and Storage: This component works at the + network operation application block in Figure 1. It is normally a + part of the network management system at the receiver side. On + one hand, it is responsible for issuing data requirements. The + data of interest can be modeled data through configuration or + custom data through programming. The data requirements can be + queries for one-shot data or subscriptions for events or streaming + data. On the other hand, it receives, stores, and processes the + returned data from network devices. Data analysis can be + interactive to initiate further data queries. This component can + reside in either network devices or remote controllers. It can be + centralized and distributed and involve one or more instances. + + * Data Configuration and Subscription: This component manages data + queries on devices. It determines the protocol and channel for + applications to acquire desired data. This component is also + responsible for configuring the desired data that might not be + directly available from data sources. The subscription data can + be described by models, templates, or programs. + + * Data Encoding and Export: This component determines how telemetry + data is delivered to the data analysis and storage component with + access control. The data encoding and the transport protocol may + vary due to the data export location. + + * Data Generation and Processing: The requested data needs to be + captured, filtered, processed, and formatted in network devices + from raw data sources. This may involve in-network computing and + processing on either the fast path or the slow path in network + devices. + + * Data Object and Source: This component determines the monitoring + objects and original data sources provisioned in the device. A + data source usually just provides raw data that needs further + processing. Each data source can be considered a probe. Some + data sources can be dynamically installed, while others will be + more static. + + +----------------------------------------+ + +----------------------------------------+ | + | | | + | Data Query, Analysis, & Storage | | + | | + + +-------+++ -----------------------------+ + ||| ^^^ + ||| ||| + ||V ||| + +--+V--------------------+++------------+ + +-----V---------------------+------------+ | + +---------------------+-------+----------+ | | + | Data Configuration | | | | + | & Subscription | Data Encoding | | | + | (model, template, | & Export | | | + | & program) | | | | + +---------------------+------------------| | | + | | | | + | Data Generation | | | + | & Processing | | | + | | | | + +----------------------------------------| | | + | | | | + | Data Object and Source | |-+ + | |-+ + +----------------------------------------+ + + Figure 2: Components in the Network Telemetry Framework + +3.3. Data Acquisition Mechanism and Type Abstraction + + Broadly speaking, network data can be acquired through subscription + (push) and query (poll). A subscription is a contract between + publisher and subscriber. After initial setup, the subscribed data + is automatically delivered to registered subscribers until the + subscription expires. There are two variations of subscription. The + subscriptions can be predefined, or the subscribers are allowed to + configure and tailor the published data to their specific needs. + + In contrast, queries are used when a client expects immediate and + one-off feedback from network devices. The queried data may be + directly extracted from some specific data source or synthesized and + processed from raw data. Queries work well for interactive network + telemetry applications. + + In general, data can be pulled (i.e., queried) whenever needed, but + in many cases, pushing the data (i.e., subscription) is more + efficient, and it can reduce the latency of a client detecting a + change. From the data consumer point of view, there are four types + of data from network devices that a telemetry data consumer can + subscribe or query: + + * Simple Data: Data that are steadily available from some datastore + or static probes in network devices. + + * Derived Data: Data that need to be synthesized or processed in the + network from raw data from one or more network devices. The data + processing function can be statically or dynamically loaded into + network devices. + + * Event-triggered Data: Data that are conditionally acquired based + on the occurrence of some events. An example of event-triggered + data could be an interface changing operational state between up + and down. Such data can be actively pushed through subscription + or passively polled through query. There are many ways to model + events, including using Finite State Machine (FSM) or Event + Condition Action (ECA) [NETMOD-ECA-POLICY]. + + * Streaming Data: Data that are continuously generated. It can be a + time series or the dump of databases. For example, an interface + packet counter is exported every second. The streaming data + reflect real-time network states and metrics and require large + bandwidth and processing power. The streaming data are always + actively pushed to the subscribers. + + The above telemetry data types are not mutually exclusive. Rather, + they are often composite. Derived data is composed of simple data; + event-triggered data can be simple or derived; and streaming data can + be based on some recurring event. The relationships of these data + types are illustrated in Figure 3. + + +----------------------+ +-----------------+ + | Event-Triggered Data |<----+ Streaming Data | + +-------+---+----------+ +-----+---+-------+ + | | | | + | | | | + | | +--------------+ | | + | +-->| Derived Data |<--+ | + | +------+------ + | + | | | + | V | + | +--------------+ | + +------>| Simple Data |<------+ + +--------------+ + + Figure 3: Data Type Relationship + + Subscription usually deals with event-triggered data and streaming + data, and query usually deals with simple data and derived data. But + the other ways are also possible. Advanced network telemetry + techniques are designed mainly for event-triggered or streaming data + subscription and derived data query. + +3.4. Mapping Existing Mechanisms into the Framework + + The following table shows how the existing mechanisms (mainly + published in IETF and with the emphasis on the latest new + technologies) are positioned in the framework. Given the vast body + of existing work, we cannot provide an exhaustive list, so the + mechanisms in the tables should be considered as just examples. + Also, some comprehensive protocols and techniques may cover multiple + aspects or modules of the framework, so a name in a block only + emphasizes one particular characteristic of it. More details about + some listed mechanisms can be found in Appendix A. + + +===============+=================+================+============+ + | | Management | Control Plane | Forwarding | + | | Plane | | Plane | + +===============+=================+================+============+ + | data | gNMI, NETCONF, | gNMI, NETCONF, | NETCONF, | + | configuration | RESTCONF, SNMP, | RESTCONF, | RESTCONF, | + | and subscribe | YANG-Push | YANG-Push | YANG-Push | + +---------------+-----------------+----------------+------------+ + | data | MIB, YANG | YANG | IOAM, | + | generation | | | PSAMP, | + | and process | | | PBT, AM | + +---------------+-----------------+----------------+------------+ + | data encoding | gRPC, HTTP, TCP | BMP, TCP | IPFIX, UDP | + | and export | | | | + +---------------+-----------------+----------------+------------+ + + Table 2: Existing Work Mapping + + Although the framework is generally suitable for any network + environments, the multi-domain telemetry has some unique challenges + that deserve further architectural consideration, which is out of the + scope of this document. + +4. Evolution of Network Telemetry Applications + + Network telemetry is an evolving technical area. As the network + moves towards the automated operation, network telemetry applications + undergo several stages of evolution, which add a new layer of + requirements to the underlying network telemetry techniques. Each + stage is built upon the techniques adopted by the previous stages + plus some new requirements. + + Stage 0 - Static Telemetry: The telemetry data source and type are + determined at design time. The network operator can only + configure how to use it with limited flexibility. + + Stage 1 - Dynamic Telemetry: The custom telemetry data can be + dynamically programmed or configured at runtime without + interrupting the network operation, allowing a trade-off among + resource, performance, flexibility, and coverage. + + Stage 2 - Interactive Telemetry: The network operator can + continuously customize and fine tune the telemetry data in real + time to reflect the network operation's visibility requirements. + Compared with Stage 1, the changes are frequent based on the real- + time feedback. At this stage, some tasks can be automated, but + human operators still need to sit in the middle to make decisions. + + Stage 3 - Closed-Loop Telemetry: The telemetry is free from the + interference of human operators, except for generating the + reports. The intelligent network operation engine automatically + issues the telemetry data requests, analyzes the data, and updates + the network operations in closed control loops. + + Existing technologies are ready for Stages 0 and 1. Individual + applications for Stages 2 and 3 are also possible now. However, the + future autonomic networks may need a comprehensive operation + management system that works at Stages 2 and 3 to cover all the + network operation tasks. A well-defined network telemetry framework + is the first step towards this direction. + +5. Security Considerations + + The complexity of network telemetry raises significant security + implications. For example, telemetry data can be manipulated to + exhaust various network resources at each plane as well as the data + consumer; falsified or tampered data can mislead the decision-making + process and paralyze networks; and wrong configuration and + programming for telemetry is equally harmful. The telemetry data is + highly sensitive, which exposes a lot of information about the + network and its configuration. Some of that information can make + designing attacks against the network much easier (e.g., exact + details of what software and patches have been installed) and allows + an attacker to determine whether a device may be subject to + unprotected security vulnerabilities. + + Given that this document has proposed a framework for network + telemetry and the telemetry mechanisms discussed are more extensive + (in both message frequency and traffic amount) than the conventional + network OAM concepts, we must also anticipate that new security + considerations that may also arise. A number of techniques already + exist for securing the forwarding plane, control plane, and + management plane in a network, but it is important to consider if any + new threat vectors are now being enabled via the use of network + telemetry procedures and mechanisms. + + This document proposes a conceptual architectural for collecting, + transporting, and analyzing a wide variety of data sources in support + of network applications. The protocols, data formats, and + configurations chosen to implement this framework will dictate the + specific security considerations. These considerations may include: + + * Telemetry framework trust and policy models; + + * Role management and access control for enabling and disabling + telemetry capabilities; + + * Protocol transport used for telemetry data and its inherent + security capabilities; + + * Telemetry data stores, storage encryption, methods of access, and + retention practices; + + * Tracking telemetry events and any abnormalities that might + identify malicious attacks using telemetry interfaces. + + * Authentication and integrity protection of telemetry data to make + data more trustworthy; and + + * Segregating the telemetry data traffic from the data traffic + carried over the network (e.g., historically management access and + management data may be carried via an independent management + network). + + Some security considerations highlighted above may be minimized or + negated with policy management of network telemetry. In a network + telemetry deployment, it would be advantageous to separate telemetry + capabilities into different classes of policies, i.e., Role-Based + Access Control and Event-Condition-Action policies. Also, potential + conflicts between network telemetry mechanisms must be detected + accurately and resolved quickly to avoid unnecessary network + telemetry traffic propagation escalating into an unintended or + intended denial-of-service attack. + + Further study of the security issues will be required, and it is + expected that the security mechanisms and protocols are developed and + deployed along with a network telemetry system. + +6. IANA Considerations + + This document has no IANA actions. + +7. Informative References + + [gnmi] Shakir, R., Shaikh, A., Borman, P., Hines, M., Lebsack, + C., and C. Marrow, "gRPC Network Management Interface", + IETF 98, March 2017, + . + + [gpb] Google Developers, "Protocol Buffers", + . + + [grpc] gRPC, "gPPC: A high performance, open source universal RPC + framework", . + + [IPPM-IOAM-DIRECT-EXPORT] + Song, H., Gafni, B., Zhou, T., Li, Z., Brockners, F., + Bhandari, S., Ed., Sivakolundu, R., and T. Mizrahi, Ed., + "In-situ OAM Direct Exporting", Work in Progress, + Internet-Draft, draft-ietf-ippm-ioam-direct-export-07, 13 + October 2021, . + + [IPPM-POSTCARD-BASED-TELEMETRY] + Song, H., Mirsky, G., Filsfils, C., Abdelsalam, A., Zhou, + T., Li, Z., Mishra, G., Shin, J., and K. Lee, "In-Situ OAM + Marking-based Direct Export", Work in Progress, Internet- + Draft, draft-song-ippm-postcard-based-telemetry-12, 12 May + 2022, . + + [NETCONF-DISTRIB-NOTIF] + Zhou, T., Zheng, G., Voit, E., Graf, T., and P. Francois, + "Subscription to Distributed Notifications", Work in + Progress, Internet-Draft, draft-ietf-netconf-distributed- + notif-03, 10 January 2022, + . + + [NETCONF-UDP-NOTIF] + Zheng, G., Zhou, T., Graf, T., Francois, P., Feng, A. H., + and P. Lucente, "UDP-based Transport for Configured + Subscriptions", Work in Progress, Internet-Draft, draft- + ietf-netconf-udp-notif-05, 4 March 2022, + . + + [NETMOD-ECA-POLICY] + Wu, Q., Bryskin, I., Birkholz, H., Liu, X., and B. Claise, + "A YANG Data model for ECA Policy Management", Work in + Progress, Internet-Draft, draft-ietf-netmod-eca-policy-01, + 19 February 2021, . + + [NMRG-ANTICIPATED-ADAPTATION] + Martinez-Julia, P., Ed., "Exploiting External Event + Detectors to Anticipate Resource Requirements for the + Elastic Adaptation of SDN/NFV Systems", Work in Progress, + Internet-Draft, draft-pedro-nmrg-anticipated-adaptation- + 02, 29 June 2018, . + + [NMRG-IBN-CONCEPTS-DEFINITIONS] + Clemm, A., Ciavaglia, L., Granville, L. Z., and J. + Tantsura, "Intent-Based Networking - Concepts and + Definitions", Work in Progress, Internet-Draft, draft- + irtf-nmrg-ibn-concepts-definitions-09, 24 March 2022, + . + + [OPSAWG-DNP4IQ] + Song, H., Ed. and J. Gong, "Requirements for Interactive + Query with Dynamic Network Probes", Work in Progress, + Internet-Draft, draft-song-opsawg-dnp4iq-01, 19 June 2017, + . + + [OPSAWG-IFIT-FRAMEWORK] + Song, H., Qin, F., Chen, H., Jin, J., and J. Shin, "A + Framework for In-situ Flow Information Telemetry", Work in + Progress, Internet-Draft, draft-song-opsawg-ifit- + framework-17, 22 February 2022, + . + + [RFC1157] Case, J., Fedor, M., Schoffstall, M., and J. Davin, + "Simple Network Management Protocol (SNMP)", RFC 1157, + DOI 10.17487/RFC1157, May 1990, + . + + [RFC2578] McCloghrie, K., Ed., Perkins, D., Ed., and J. + Schoenwaelder, Ed., "Structure of Management Information + Version 2 (SMIv2)", STD 58, RFC 2578, + DOI 10.17487/RFC2578, April 1999, + . + + [RFC2981] Kavasseri, R., Ed., "Event MIB", RFC 2981, + DOI 10.17487/RFC2981, October 2000, + . + + [RFC3176] Phaal, P., Panchen, S., and N. McKee, "InMon Corporation's + sFlow: A Method for Monitoring Traffic in Switched and + Routed Networks", RFC 3176, DOI 10.17487/RFC3176, + September 2001, . + + [RFC3411] Harrington, D., Presuhn, R., and B. Wijnen, "An + Architecture for Describing Simple Network Management + Protocol (SNMP) Management Frameworks", STD 62, RFC 3411, + DOI 10.17487/RFC3411, December 2002, + . + + [RFC3416] Presuhn, R., Ed., "Version 2 of the Protocol Operations + for the Simple Network Management Protocol (SNMP)", + STD 62, RFC 3416, DOI 10.17487/RFC3416, December 2002, + . + + [RFC3877] Chisholm, S. and D. Romascanu, "Alarm Management + Information Base (MIB)", RFC 3877, DOI 10.17487/RFC3877, + September 2004, . + + [RFC3954] Claise, B., Ed., "Cisco Systems NetFlow Services Export + Version 9", RFC 3954, DOI 10.17487/RFC3954, October 2004, + . + + [RFC4656] Shalunov, S., Teitelbaum, B., Karp, A., Boote, J., and M. + Zekauskas, "A One-way Active Measurement Protocol + (OWAMP)", RFC 4656, DOI 10.17487/RFC4656, September 2006, + . + + [RFC5085] Nadeau, T., Ed. and C. Pignataro, Ed., "Pseudowire Virtual + Circuit Connectivity Verification (VCCV): A Control + Channel for Pseudowires", RFC 5085, DOI 10.17487/RFC5085, + December 2007, . + + [RFC5357] Hedayat, K., Krzanowski, R., Morton, A., Yum, K., and J. + Babiarz, "A Two-Way Active Measurement Protocol (TWAMP)", + RFC 5357, DOI 10.17487/RFC5357, October 2008, + . + + [RFC5424] Gerhards, R., "The Syslog Protocol", RFC 5424, + DOI 10.17487/RFC5424, March 2009, + . + + [RFC6020] Bjorklund, M., Ed., "YANG - A Data Modeling Language for + the Network Configuration Protocol (NETCONF)", RFC 6020, + DOI 10.17487/RFC6020, October 2010, + . + + [RFC6241] Enns, R., Ed., Bjorklund, M., Ed., Schoenwaelder, J., Ed., + and A. Bierman, Ed., "Network Configuration Protocol + (NETCONF)", RFC 6241, DOI 10.17487/RFC6241, June 2011, + . + + [RFC6812] Chiba, M., Clemm, A., Medley, S., Salowey, J., Thombare, + S., and E. Yedavalli, "Cisco Service-Level Assurance + Protocol", RFC 6812, DOI 10.17487/RFC6812, January 2013, + . + + [RFC7011] Claise, B., Ed., Trammell, B., Ed., and P. Aitken, + "Specification of the IP Flow Information Export (IPFIX) + Protocol for the Exchange of Flow Information", STD 77, + RFC 7011, DOI 10.17487/RFC7011, September 2013, + . + + [RFC7258] Farrell, S. and H. Tschofenig, "Pervasive Monitoring Is an + Attack", BCP 188, RFC 7258, DOI 10.17487/RFC7258, May + 2014, . + + [RFC7276] Mizrahi, T., Sprecher, N., Bellagamba, E., and Y. + Weingarten, "An Overview of Operations, Administration, + and Maintenance (OAM) Tools", RFC 7276, + DOI 10.17487/RFC7276, June 2014, + . + + [RFC7540] Belshe, M., Peon, R., and M. Thomson, Ed., "Hypertext + Transfer Protocol Version 2 (HTTP/2)", RFC 7540, + DOI 10.17487/RFC7540, May 2015, + . + + [RFC7575] Behringer, M., Pritikin, M., Bjarnason, S., Clemm, A., + Carpenter, B., Jiang, S., and L. Ciavaglia, "Autonomic + Networking: Definitions and Design Goals", RFC 7575, + DOI 10.17487/RFC7575, June 2015, + . + + [RFC7799] Morton, A., "Active and Passive Metrics and Methods (with + Hybrid Types In-Between)", RFC 7799, DOI 10.17487/RFC7799, + May 2016, . + + [RFC7854] Scudder, J., Ed., Fernando, R., and S. Stuart, "BGP + Monitoring Protocol (BMP)", RFC 7854, + DOI 10.17487/RFC7854, June 2016, + . + + [RFC7950] Bjorklund, M., Ed., "The YANG 1.1 Data Modeling Language", + RFC 7950, DOI 10.17487/RFC7950, August 2016, + . + + [RFC8040] Bierman, A., Bjorklund, M., and K. Watsen, "RESTCONF + Protocol", RFC 8040, DOI 10.17487/RFC8040, January 2017, + . + + [RFC8084] Fairhurst, G., "Network Transport Circuit Breakers", + BCP 208, RFC 8084, DOI 10.17487/RFC8084, March 2017, + . + + [RFC8085] Eggert, L., Fairhurst, G., and G. Shepherd, "UDP Usage + Guidelines", BCP 145, RFC 8085, DOI 10.17487/RFC8085, + March 2017, . + + [RFC8259] Bray, T., Ed., "The JavaScript Object Notation (JSON) Data + Interchange Format", STD 90, RFC 8259, + DOI 10.17487/RFC8259, December 2017, + . + + [RFC8321] Fioccola, G., Ed., Capello, A., Cociglio, M., Castaldelli, + L., Chen, M., Zheng, L., Mirsky, G., and T. Mizrahi, + "Alternate-Marking Method for Passive and Hybrid + Performance Monitoring", RFC 8321, DOI 10.17487/RFC8321, + January 2018, . + + [RFC8639] Voit, E., Clemm, A., Gonzalez Prieto, A., Nilsen-Nygaard, + E., and A. Tripathy, "Subscription to YANG Notifications", + RFC 8639, DOI 10.17487/RFC8639, September 2019, + . + + [RFC8641] Clemm, A. and E. Voit, "Subscription to YANG Notifications + for Datastore Updates", RFC 8641, DOI 10.17487/RFC8641, + September 2019, . + + [RFC8671] Evens, T., Bayraktar, S., Lucente, P., Mi, P., and S. + Zhuang, "Support for Adj-RIB-Out in the BGP Monitoring + Protocol (BMP)", RFC 8671, DOI 10.17487/RFC8671, November + 2019, . + + [RFC8762] Mirsky, G., Jun, G., Nydell, H., and R. Foote, "Simple + Two-Way Active Measurement Protocol", RFC 8762, + DOI 10.17487/RFC8762, March 2020, + . + + [RFC8889] Fioccola, G., Ed., Cociglio, M., Sapio, A., and R. Sisto, + "Multipoint Alternate-Marking Method for Passive and + Hybrid Performance Monitoring", RFC 8889, + DOI 10.17487/RFC8889, August 2020, + . + + [RFC8924] Aldrin, S., Pignataro, C., Ed., Kumar, N., Ed., Krishnan, + R., and A. Ghanwani, "Service Function Chaining (SFC) + Operations, Administration, and Maintenance (OAM) + Framework", RFC 8924, DOI 10.17487/RFC8924, October 2020, + . + + [RFC9069] Evens, T., Bayraktar, S., Bhardwaj, M., and P. Lucente, + "Support for Local RIB in the BGP Monitoring Protocol + (BMP)", RFC 9069, DOI 10.17487/RFC9069, February 2022, + . + + [RFC9197] Brockners, F., Ed., Bhandari, S., Ed., and T. Mizrahi, + Ed., "Data Fields for In Situ Operations, Administration, + and Maintenance (IOAM)", RFC 9197, DOI 10.17487/RFC9197, + May 2022, . + + [W3C.REC-xml-20081126] + Bray, T., Paoli, J., Sperberg-McQueen, M., Maler, E., and + F. Yergeau, "Extensible Markup Language (XML) 1.0 (Fifth + Edition)", World Wide Web Consortium Recommendation REC- + xml-20081126, November 2008, + . + + [y1731] ITU-T, "Operations, administration and maintenance (OAM) + functions and mechanisms for Ethernet-based networks", + ITU-T Recommendation G.8013/Y.1731, August 2015, + . + +Appendix A. A Survey on Existing Network Telemetry Techniques + + In this non-normative appendix, we provide an overview of some + existing techniques and standard proposals for each network telemetry + module. + +A.1. Management Plane Telemetry + +A.1.1. Push Extensions for NETCONF + + NETCONF [RFC6241] is a popular network management protocol + recommended by IETF. Its core strength is for managing + configuration, but it can also be used for data collection. + YANG-Push [RFC8639] [RFC8641] extends NETCONF and enables subscriber + applications to request a continuous, customized stream of updates + from a YANG datastore. Providing such visibility into changes made + upon YANG configuration and operational objects enables new + capabilities based on the remote mirroring of configuration and + operational state. Moreover, a distributed data collection mechanism + [NETCONF-DISTRIB-NOTIF] via a UDP-based publication channel + [NETCONF-UDP-NOTIF] provides enhanced efficiency for the NETCONF- + based telemetry. + +A.1.2. gRPC Network Management Interface + + gRPC Network Management Interface (gNMI) [gnmi] is a network + management protocol based on the gRPC [grpc] Remote Procedure Call + (RPC) framework. With a single gRPC service definition, both + configuration and telemetry can be covered. gRPC is an open-source + micro-service communication framework based on HTTP/2 [RFC7540]. It + provides a number of capabilities that are well-suited for network + telemetry, including: + + * A full-duplex streaming transport model; when combined with a + binary encoding mechanism, it provides good telemetry efficiency. + + * A higher-level feature consistency across platforms that common + HTTP/2 libraries typically do not provide. This characteristic is + especially valuable for the fact that telemetry data collectors + normally reside on a large variety of platforms. + + * A built-in load-balancing and failover mechanism. + +A.2. Control Plane Telemetry + +A.2.1. BGP Monitoring Protocol + + BMP [RFC7854] is used to monitor BGP sessions and is intended to + provide a convenient interface for obtaining route views. + + BGP routing information is collected from the monitored device(s) to + the BMP monitoring station by setting up the BMP TCP session. The + BGP peers are monitored by the BMP Peer Up and Peer Down + notifications. The BGP routes (including Adj_RIB_In [RFC7854], + Adj_RIB_out [RFC8671], and local RIB [RFC9069]) are encapsulated in + the BMP Route Monitoring Message and the BMP Route Mirroring Message, + providing both an initial table dump and real-time route updates. In + addition, BGP statistics are reported through the BMP Stats Report + Message, which could be either timer triggered or event-driven. + Future BMP extensions could further enrich BGP monitoring + applications. + +A.3. Data Plane Telemetry + +A.3.1. Alternate-Marking (AM) Technology + + The Alternate-Marking method enables efficient measurements of packet + loss, delay, and jitter both in IP and Overlay Networks, as presented + in [RFC8321] and [RFC8889]. + + This technique can be applied to point-to-point and multipoint-to- + multipoint flows. Alternate Marking creates batches of packets by + alternating the value of 1 bit (or a label) of the packet header. + These batches of packets are unambiguously recognized over the + network, and the comparison of packet counters for each batch allows + the packet loss calculation. The same idea can be applied to delay + measurement by selecting ad hoc packets with a marking bit dedicated + for delay measurements. + + The Alternate-Marking method needs two counters each marking period + for each flow under monitor. For instance, by considering n + measurement points and m monitored flows, the order of magnitude of + the packet counters for each time interval is n*m*2 (1 per color). + + Since networks offer rich sets of network performance measurement + data (e.g., packet counters), conventional approaches run into + limitations. The bottleneck is the generation and export of the data + and the amount of data that can be reasonably collected from the + network. In addition, management tasks related to determining and + configuring which data to generate lead to significant deployment + challenges. + + The Multipoint Alternate-Marking approach, described in [RFC8889], + aims to resolve this issue and make the performance monitoring more + flexible in case a detailed analysis is not needed. + + An application orchestrates network performance measurement tasks + across the network to allow for optimized monitoring. The + application can choose how roughly or precisely to configure + measurement points depending on the application's requirements. + + Using Alternate Marking, it is possible to monitor a Multipoint + Network without in-depth examination by using Network Clustering + (subnetworks that are portions of the entire network that preserve + the same property of the entire network, called clusters). So in the + case where there is packet loss or the delay is too high, the + specific filtering criteria could be applied to gather a more + detailed analysis by using a different combination of clusters up to + a per-flow measurement as described in the Alternate-Marking document + [RFC8321]. + + In summary, an application can configure end-to-end network + monitoring. If the network does not experience issues, this + approximate monitoring is good enough and is very cheap in terms of + network resources. However, in case of problems, the application + becomes aware of the issues from this approximate monitoring and, in + order to localize the portion of the network that has issues, + configures the measurement points more extensively, allowing more + detailed monitoring to be performed. After the detection and + resolution of the problem, the initial approximate monitoring can be + used again. + +A.3.2. Dynamic Network Probe + + A hardware-based Dynamic Network Probe (DNP) [OPSAWG-DNP4IQ] provides + a programmable means to customize the data that an application + collects from the data plane. A direct benefit of DNP is the + reduction of the exported data. A full DNP solution covers several + components including data source, data subscription, and data + generation. The data subscription needs to define the derived data + that can be composed and derived from raw data sources. The data + generation takes advantage of the moderate in-network computing to + produce the desired data. + + While DNP can introduce unforeseeable flexibility to the data plane + telemetry, it also faces some challenges. It requires a flexible + data plane that can be dynamically reprogrammed at runtime. The + programming Application Programming Interface (API) is yet to be + defined. + +A.3.3. IP Flow Information Export (IPFIX) Protocol + + Traffic on a network can be seen as a set of flows passing through + network elements. IPFIX [RFC7011] provides a means of transmitting + traffic flow information for administrative or other purposes. A + typical IPFIX-enabled system includes a pool of Metering Processes + that collects data packets at one or more Observation Points, + optionally filters them, and aggregates information about these + packets. An Exporter then gathers each of the Observation Points + together into an Observation Domain and sends this information via + the IPFIX protocol to a Collector. + +A.3.4. In Situ OAM + + Classical passive and active monitoring and measurement techniques + are either inaccurate or resource consuming. It is preferable to + directly acquire data associated with a flow's packets when the + packets pass through a network. IOAM [RFC9197], a data generation + technique, embeds a new instruction header to user packets, and the + instruction directs the network nodes to add the requested data to + the packets. Thus, at the path's end, the packet's experience gained + on the entire forwarding path can be collected. Such firsthand data + is invaluable to many network OAM applications. + + However, IOAM also faces some challenges. The issues on performance + impact, security, scalability and overhead limits, encapsulation + difficulties in some protocols, and cross-domain deployment need to + be addressed. + +A.3.5. Postcard-Based Telemetry + + The postcard-based telemetry, as embodied in IOAM Direct Export (DEX) + [IPPM-IOAM-DIRECT-EXPORT] and IOAM Marking + [IPPM-POSTCARD-BASED-TELEMETRY], is a complementary technique to the + passport-based IOAM [RFC9197]. PBT directly exports data at each + node through an independent packet. At the cost of higher bandwidth + overhead and the need for data correlation, PBT shows several unique + advantages. It can also help to identify packet drop location in + case a packet is dropped on its forwarding path. + +A.3.6. Existing OAM for Specific Data Planes + + Various data planes raise unique OAM requirements. IETF has + published OAM technique and framework documents (e.g., [RFC8924] and + [RFC5085]) targeting different data planes such as Multiprotocol + Label Switching (MPLS), L2 Virtual Private Network (VPN), Network + Virtualization over Layer 3 (NVO3), Virtual Extensible LAN (VXLAN), + Bit Index Explicit Replication (BIER), Service Function Chaining + (SFC), Segment Routing (SR), and Deterministic Networking (DETNET). + The aforementioned data plane telemetry techniques can be used to + enhance the OAM capability on such data planes. + +A.4. External Data and Event Telemetry + +A.4.1. Sources of External Events + + To ensure that the information provided by external event detectors + and used by the network management solutions is meaningful for + management purposes, the network telemetry framework must ensure that + such detectors (sources) are easily connected to the management + solutions (sinks). This requires the specification of a list of + potential external data sources that could be of interest in network + management and matching it to the connectors and/or interfaces + required to connect them. + + Categories of external event sources that may be of interest to + network management include: + + * Smart objects and sensors. With the consolidation of the Internet + of Things (IoT), any network system will have many smart objects + attached to its physical surroundings and logical operation + environments. Most of these objects will be essentially based on + sensors of many kinds (e.g., temperature, humidity, and presence), + and the information they provide can be very useful for the + management of the network, even when they are not specifically + deployed for such purpose. Elements of this source type will + usually provide a specific protocol for interaction, especially + one of the protocols related to IoT, such as the Constrained + Application Protocol (CoAP). + + * Online news reporters. Several online news services have the + ability to provide an enormous quantity of information about + different events occurring in the world. Some of those events can + have an impact on the network system managed by a specific + framework; therefore, such information may be of interest to the + management solution. For instance, diverse security reports, such + as Common Vulnerabilities and Exposures (CVEs), can be issued by + the corresponding authority and used by the management solution to + update the managed system, if needed. Instead of a specific + protocol and data format, the sources of this kind of information + usually follow a relaxed but structured format. This format will + be part of both the ontology and information model of the + telemetry framework. + + * Global event analyzers. The advance of big data analyzers + provides a huge amount of information and, more interestingly, the + identification of events detected by analyzing many data streams + from different origins. In contrast with the other types of + sources, which are focused on specific events, the detectors of + this source type will detect generic events. For example, during + a sports event, some unexpected movement makes it fascinating, and + many people connect to sites that are reporting on the event. The + underlying networks supporting the services that cover the event + can be affected by such situation, so their management solutions + should be aware of it. In contrast with the other source types, a + new information model, format, and reporting protocol is required + to integrate the detectors of this type with the management + solution. + + Additional detector types can be added to the system, but generally + they will be the result of composing the properties offered by these + main classes. + +A.4.2. Connectors and Interfaces + + For allowing external event detectors to be properly integrated with + other management solutions, both elements must expose interfaces and + protocols that are subject to their particular objective. Since + external event detectors will be focused on providing their + information to their main consumers, which generally will not be + limited to the network management solutions, the framework must + include the definition of the required connectors for ensuring the + interconnection between detectors (sources) and their consumers + within the management systems (sinks) are effective. + + In some situations, the interconnection between external event + detectors and the management system is via the management plane. For + those situations, there will be a special connector that provides the + typical interfaces found in most other elements connected to the + management plane. For instance, the interfaces could accomplish this + with a specific data model (YANG) and specific telemetry protocol, + such as NETCONF, YANG-Push, or gRPC. + +Acknowledgments + + We would like to thank Rob Wilton, Greg Mirsky, Randy Presuhn, Joe + Clarke, Victor Liu, James Guichard, Uri Blumenthal, Giuseppe + Fioccola, Yunan Gu, Parviz Yegani, Young Lee, Qin Wu, Gyan Mishra, + Ben Schwartz, Alexey Melnikov, Michael Scharf, Dhruv Dhody, Martin + Duke, Roman Danyliw, Warren Kumari, Sheng Jiang, Lars Eggert, Éric + Vyncke, Jean-Michel Combes, Erik Kline, Benjamin Kaduk, and many + others who have provided helpful comments and suggestions to improve + this document. + +Contributors + + The other contributors of this document are Tianran Zhou, Zhenbin Li, + Zhenqiang Li, Daniel King, Adrian Farrel, and Alexander Clemm. + +Authors' Addresses + + Haoyu Song + Futurewei + United States of America + Email: haoyu.song@futurewei.com + + + Fengwei Qin + China Mobile + China + Email: qinfengwei@chinamobile.com + + + Pedro Martinez-Julia + NICT + Japan + Email: pedro@nict.go.jp + + + Laurent Ciavaglia + Rakuten Mobile + France + Email: laurent.ciavaglia@rakuten.com + + + Aijun Wang + China Telecom + China + Email: wangaj3@chinatelecom.cn -- cgit v1.2.3