diff options
Diffstat (limited to 'doc/rfc/rfc9574.txt')
-rw-r--r-- | doc/rfc/rfc9574.txt | 1580 |
1 files changed, 1580 insertions, 0 deletions
diff --git a/doc/rfc/rfc9574.txt b/doc/rfc/rfc9574.txt new file mode 100644 index 0000000..5f7e3b3 --- /dev/null +++ b/doc/rfc/rfc9574.txt @@ -0,0 +1,1580 @@ + + + + +Internet Engineering Task Force (IETF) J. Rabadan, Ed. +Request for Comments: 9574 S. Sathappan +Category: Standards Track Nokia +ISSN: 2070-1721 W. Lin + Juniper Networks + M. Katiyar + Versa Networks + A. Sajassi + Cisco Systems + May 2024 + + + Optimized Ingress Replication Solution for Ethernet VPNs (EVPNs) + +Abstract + + Network Virtualization Overlay (NVO) networks using Ethernet VPNs + (EVPNs) as their control plane may use trees based on ingress + replication or Protocol Independent Multicast (PIM) to convey the + overlay Broadcast, Unknown Unicast, or Multicast (BUM) traffic. PIM + provides an efficient solution that prevents sending multiple copies + of the same packet over the same physical link; however, it may not + always be deployed in the NVO network core. Ingress replication + avoids the dependency on PIM in the NVO network core. While ingress + replication provides a simple multicast transport, some NVO networks + with demanding multicast applications require a more efficient + solution without PIM in the core. This document describes a solution + to optimize the efficiency of ingress replication trees. + +Status of This Memo + + This is an Internet Standards Track document. + + This document is a product of the Internet Engineering Task Force + (IETF). It represents the consensus of the IETF community. It has + received public review and has been approved for publication by the + Internet Engineering Steering Group (IESG). Further information on + Internet Standards is available in Section 2 of RFC 7841. + + Information about the current status of this document, any errata, + and how to provide feedback on it may be obtained at + https://www.rfc-editor.org/info/rfc9574. + +Copyright Notice + + Copyright (c) 2024 IETF Trust and the persons identified as the + document authors. All rights reserved. + + This document is subject to BCP 78 and the IETF Trust's Legal + Provisions Relating to IETF Documents + (https://trustee.ietf.org/license-info) in effect on the date of + publication of this document. Please review these documents + carefully, as they describe your rights and restrictions with respect + to this document. Code Components extracted from this document must + include Revised BSD License text as described in Section 4.e of the + Trust Legal Provisions and are provided without warranty as described + in the Revised BSD License. + +Table of Contents + + 1. Introduction + 2. Terminology and Conventions + 3. Solution Requirements + 4. EVPN BGP Attributes for Optimized Ingress Replication + 5. Non-selective Assisted Replication (AR) Solution Description + 5.1. Non-selective AR-REPLICATOR Procedures + 5.2. Non-selective AR-LEAF Procedures + 5.3. RNVE Procedures + 6. Selective Assisted Replication (AR) Solution Description + 6.1. Selective AR-REPLICATOR Procedures + 6.2. Selective AR-LEAF Procedures + 7. Pruned Flooding Lists (PFLs) + 7.1. Example of a Pruned Flooding List + 8. AR Procedures for Single-IP AR-REPLICATORS + 9. AR Procedures and EVPN All-Active Multihoming Split-Horizon + 9.1. Ethernet Segments on AR-LEAF Nodes + 9.2. Ethernet Segments on AR-REPLICATOR Nodes + 10. Security Considerations + 11. IANA Considerations + 12. References + 12.1. Normative References + 12.2. Informative References + Acknowledgements + Contributors + Authors' Addresses + +1. Introduction + + Ethernet Virtual Private Networks (EVPNs) may be used as the control + plane for a Network Virtualization Overlay (NVO) network [RFC8365]. + Network Virtualization Edge (NVE) and Provider Edge (PE) devices that + are part of the same EVPN Broadcast Domain (BD) use Ingress + Replication (IR) or PIM-based trees to transport the tenant's + Broadcast, Unknown Unicast, or Multicast (BUM) traffic. + + In the ingress replication approach, the ingress NVE receiving a BUM + frame from the Tenant System (TS) will create as many copies of the + frame as the number of remote NVEs/PEs that are attached to the BD. + Each of those copies will be encapsulated into an IP packet where the + outer IP Destination Address (IP DA) identifies the loopback of the + egress NVE/PE. The IP fabric core nodes (also known as spines) will + simply route the IP-encapsulated BUM frames based on the outer IP DA. + If PIM-based trees are used instead of ingress replication, the NVEs/ + PEs attached to the same BD will join a PIM-based tree. The ingress + NVE receiving a BUM frame will send a single copy of the frame, + encapsulated into an IP packet where the outer IP DA is the multicast + address that represents the PIM-based tree. The IP fabric core nodes + are part of the PIM tree and keep multicast state for the multicast + group, so that IP-encapsulated BUM frames can be routed to all the + NVEs/PEs that joined the tree. + + The two approaches are illustrated in Figure 1. On the left-hand + side of the diagram, NVE1 uses ingress replication to send a BUM + frame (originated from Tenant System TS1) to the remote nodes + attached to the BD, i.e., NVE2, NVE3, and PE1. On the right-hand + side, the same example is depicted but using a PIM-based tree, i.e., + (S1,G1), instead of ingress replication. While a single copy of the + tunneled BUM frame is generated in the latter approach, all the + routers in the fabric need to keep multicast state, e.g., the spine + keeps a PIM routing entry for (S1,G1) with an Incoming Interface + (IIF) and three Outgoing Interfaces (OIFs). + + To WAN To WAN + ^ ^ + | | + +-----+ +-----+ + +----------| PE1 |-----------+ +----------| PE1 |-----------+ + | +--^--+ | | +--^--+ | + | | IP Fabric | | | IP Fabric | + | PE | | (S1,G1) |OIF to G1 | + | +----PE->+-----+ No State | | IIF +-----+ OIF to G1 | + | | +---2->|Spine|------+ | | +------>Spine|------+ | + | | | +-3->+-----+ | | | | +-----+ | | + | | | | 2 3 | | |PIM |OIF to G1| | + | | | |IR | | | | |tree | | | + |+-----+ +--v--+ +--v--+ | |+-----+ +--v--+ +--v--+ | + +| NVE1|---| NVE2|---| NVE3|-+ +| NVE1|---| NVE2|---| NVE3|-+ + +--^--+ +-----+ +-----+ +--^--+ +-----+ +-----+ + | | | | | | + | v v | v v + TS1 TS2 TS3 TS1 TS2 TS3 + + Figure 1: Ingress Replication vs. PIM-Based Trees in NVO Networks + + In NVO networks where PIM-based trees cannot be used, ingress + replication is the only option. Examples of these situations are NVO + networks where the core nodes do not support PIM or the network + operator does not want to run PIM in the core. + + In some use cases, the amount of replication for BUM traffic is kept + under control on the NVEs due to the following fairly common + assumptions: + + a. Broadcast traffic is greatly reduced due to the proxy Address + Resolution Protocol (ARP) and proxy Neighbor Discovery (ND) + capabilities supported by EVPNs [RFC9161] on the NVEs. Some NVEs + can even provide Dynamic Host Configuration Protocol (DHCP) + server functions for the attached TSs, reducing the broadcast + traffic even further. + + b. Unknown unicast traffic is greatly reduced in NVO networks where + all the Media Access Control (MAC) and IP addresses from the TSs + are learned in the control plane. + + c. Multicast applications are not used. + + If the above assumptions are true for a given NVO network, then + ingress replication provides a simple solution for multi-destination + traffic. However, statement c. above is not always true, and + multicast applications are required in many use cases. + + When the multicast sources are attached to NVEs residing in + hypervisors or low-performance-replication Top-of-Rack (ToR) + switches, the ingress replication of a large amount of multicast + traffic to a significant number of remote NVEs/PEs can seriously + degrade the performance of the NVE and impact the application. + + This document describes a solution that makes use of two ingress + replication optimizations: + + 1. Assisted Replication (AR) + + 2. Pruned Flooding Lists (PFLs) + + Assisted Replication consists of a set of procedures that allows the + ingress NVE/PE to send a single copy of a broadcast or multicast + frame received from a TS to the BD without the need for PIM in the + underlay. Assisted Replication defines the roles of AR-REPLICATOR + and AR-LEAF routers. The AR-LEAF is the ingress NVE/PE attached to + the TS. The AR-LEAF sends a single copy of a broadcast or multicast + packet to a selected AR-REPLICATOR that replicates the packet + multiple times to remote AR-LEAF or AR-REPLICATOR routers and is + therefore "assisting" the ingress AR-LEAF in delivering the broadcast + or multicast traffic to the remote NVEs/PEs attached to the same BD. + Assisted Replication can use a single AR-REPLICATOR or two AR- + REPLICATOR routers in the path between the ingress AR-LEAF and the + remote destination NVEs/PEs. The procedures that use a single AR- + REPLICATOR (the non-selective Assisted Replication solution) are + specified in Section 5, whereas Section 6 describes how multi-stage + replication, i.e., two AR-REPLICATOR routers in the path between the + ingress AR-LEAF and destination NVEs/PEs, is accomplished (the + selective Assisted Replication solution). The procedures for + Assisted Replication do not impact unknown unicast traffic, which + follows the same forwarding procedures as known unicast traffic so + that packet reordering does not occur. + + PFLs provide a method for the ingress NVE/PE to prune or remove + certain destination NVEs/PEs from a flooding list, depending on the + interest of those NVEs/PEs in receiving BUM traffic. As specified in + [RFC8365], an NVE/PE builds a flooding list for BUM traffic based on + the next hops of the received EVPN Inclusive Multicast Ethernet Tag + routes for the BD. While [RFC8365] states that the flooding list is + used for all BUM traffic, this document allows pruning certain next + hops from the list. As an example, suppose an ingress NVE creates a + flooding list with next hops PE1, PE2, and PE3. If PE2 and PE3 did + not signal any interest in receiving unknown unicast traffic in their + Inclusive Multicast Ethernet Tag routes, when the ingress NVE + receives an unknown unicast frame from a TS, it will replicate it + only to PE1. That is, PE2 and PE3 are "pruned" from the NVE's + flooding list for unknown unicast traffic. PFLs can be used with + ingress replication or Assisted Replication and are described in + Section 7. + + Both optimizations -- Assisted Replication and PFLs -- may be used + together or independently so that the performance and efficiency of + the network to transport multicast can be improved. Both solutions + require some extensions to the BGP attributes used in [RFC7432]; see + Section 4 for details. + + The Assisted Replication solution described in this document is + focused on NVO networks (hence its use of IP tunnels). MPLS + transport networks are out of scope for this document. The PFLs + solution MAY be used in NVO and MPLS transport networks. + + Section 3 lists the requirements of the combined optimized ingress + replication solution, whereas Sections 5 and 6 describe the Assisted + Replication solution for non-selective and selective procedures, + respectively. Section 7 provides the PFLs solution. + +2. Terminology and Conventions + + The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", + "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and + "OPTIONAL" in this document are to be interpreted as described in + BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all + capitals, as shown here. + + The following terminology is used throughout this document: + + AR-IP: Assisted Replication - IP. Refers to an IP address owned by + the AR-REPLICATOR and used to differentiate the incoming traffic + that must follow the AR procedures. The AR-IP is also used in the + Tunnel Identifier and Next Hop fields of the Replicator-AR route. + + AR-LEAF: Assisted Replication - LEAF. Refers to an NVE/PE that + sends all the BM traffic to an AR-REPLICATOR that can replicate + the traffic further on its behalf. An AR-LEAF is typically an + NVE/PE with poor replication performance capabilities. + + AR-REPLICATOR: Assisted Replication - REPLICATOR. Refers to an NVE/ + PE that can replicate broadcast or multicast traffic received on + overlay tunnels to other overlay tunnels and local Attachment + Circuits (ACs). This document defines the control and data plane + procedures that an AR-REPLICATOR needs to follow. + + AR-VNI: Assisted Replication - VNI. Refers to a Virtual eXtensible + Local Area Network (VXLAN) Network Identifier (VNI) advertised by + the AR-REPLICATOR along with the Replicator-AR route. It is used + to identify the incoming packets that must follow the AR + procedures ONLY in the single-IP AR-REPLICATOR case (see + Section 8). + + Assisted Replication forwarding mode: In the case of an AR-LEAF, + sending an AC Broadcast and Multicast (BM) packet to a single AR- + REPLICATOR with a tunnel destination address AR-IP. In the case + of an AR-REPLICATOR, this means sending a BM packet to a selected + number of, or all of, the overlay tunnels when the packet was + previously received from an overlay tunnel. + + BD: Broadcast Domain, as defined in [RFC7432]. + + BD label: Defined as the MPLS label that identifies the BD and is + advertised in Regular-IR or Replicator-AR routes, when the + encapsulation is MPLS over GRE (MPLSoGRE) or MPLS over UDP + (MPLSoUDP). + + BM traffic: Refers to broadcast and multicast frames (excluding + unknown unicast frames). + + DF and NDF: Designated Forwarder and Non-Designated Forwarder. + These are roles defined in NVEs/PEs attached to multihomed TSs, as + per [RFC7432] and [RFC8365]. + + ES and ESI: Ethernet Segment and Ethernet Segment Identifier. EVPN + multihoming concepts as specified in [RFC7432]. + + EVI: EVPN Instance. A group of Provider Edge (PE) devices + participating in the same EVPN service, as specified in [RFC7432]. + + GRE: Generic Routing Encapsulation [RFC4023]. + + Ingress Replication forwarding mode: Refers to the ingress + replication behavior explained in [RFC7432]. In this mode, an AC + BM packet copy is sent to each remote PE/NVE in the BD, and an + overlay BM packet is sent only to the ACs and not to other overlay + tunnels. + + IR-IP: Ingress Replication - IP. Refers to the local IP address of + an NVE/PE that is used for the ingress replication signaling and + procedures provided in [RFC7432]. Encapsulated incoming traffic + with an outer destination IP address matching the IR-IP will + follow the procedures for ingress replication and not the + procedures for Assisted Replication. The IR-IP is also used in + the Tunnel Identifier and Next Hop fields of the Regular-IR route. + + IR-VNI: Ingress Replication - VNI. Refers to a VNI advertised along + with the Inclusive Multicast Ethernet Tag route for the ingress + replication tunnel type. + + MPLS: Multi-Protocol Label Switching. + + NVE: Network Virtualization Edge [RFC8365]. + + NVGRE: Network virtualization using Generic Routing Encapsulation + [RFC7637]. + + PE: Provider Edge. + + PMSI: P-Multicast Service Interface. A conceptual interface for a + PE to send customer multicast traffic to all or some PEs in the + same VPN [RFC6513]. + + RD: Route Distinguisher. + + Regular-IR route: An EVPN Inclusive Multicast Ethernet Tag route + [RFC7432] that uses the ingress replication tunnel type. + + Replicator-AR route: An EVPN Inclusive Multicast Ethernet Tag route + that is advertised by an AR-REPLICATOR to signal its capabilities, + as described in Section 4. + + RNVE: Regular NVE. Refers to an NVE that supports the procedures + provided in [RFC8365] and does not support the procedures provided + in this document. However, this document defines procedures to + interoperate with RNVEs. + + ToR switch: Top-of-Rack switch. + + TS and VM: Tenant System and Virtual Machine. In this document, TSs + and VMs are the devices connected to the ACs of the PEs and NVEs. + + VNI: VXLAN Network Identifier. Used in VXLAN tunnels. + + VSID: Virtual Segment Identifier. Used in NVGRE tunnels. + + VXLAN: Virtual eXtensible Local Area Network [RFC7348]. + +3. Solution Requirements + + The ingress replication optimization solution specified in this + document meets the following requirements: + + a. The solution provides an ingress replication optimization for BM + traffic without the need for PIM while preserving the packet + order for unicast applications, i.e., unknown unicast traffic + should follow the same path as known unicast traffic. This + optimization is required in low-performance NVEs. + + b. The solution reduces the flooded traffic in NVO networks where + some NVEs do not need broadcast/multicast and/or unknown unicast + traffic. + + c. The solution is compatible with [RFC7432] and [RFC8365] and has + no impact on the Customer Edge (CE) procedures for BM traffic. + In particular, the solution supports the following EVPN + functions: + + * All-active multihoming, including the split-horizon and DF + functions. + + * Single-active multihoming, including the DF function. + + * Handling of multi-destination traffic and processing of BM + traffic as per [RFC7432]. + + d. The solution is backward compatible with existing NVEs using a + non-optimized version of ingress replication. A given BD can + have NVEs/PEs supporting regular ingress replication and + optimized ingress replication. + + e. The solution is independent of the NVO-specific data plane + encapsulation and the virtual identifiers being used, e.g., VXLAN + VNIs, NVGRE VSIDs, or MPLS labels, as long as the tunnel is IP + based. + +4. EVPN BGP Attributes for Optimized Ingress Replication + + The ingress replication optimization solution specified in this + document extends the Inclusive Multicast Ethernet Tag routes and + attributes described in [RFC7432] so that an NVE/PE can signal its + optimized ingress replication capabilities. + + The Network Layer Reachability Information (NLRI) of the Inclusive + Multicast Ethernet Tag route [RFC7432] is shown in Figure 2 and is + used in this document without any modifications to its format. The + PMSI Tunnel Attribute's general format as provided in [RFC7432] + (which takes it from [RFC6514]) is used in this document; only a new + tunnel type and new flags are specified, as shown in Figure 3. + + +------------------------------------+ + | RD (8 octets) | + +------------------------------------+ + | Ethernet Tag ID (4 octets) | + +------------------------------------+ + | IP Address Length (1 octet) | + +------------------------------------+ + | Originating Router's IP Address | + | (4 or 16 octets) | + +------------------------------------+ + + Figure 2: EVPN Inclusive Multicast Ethernet Tag Route's NLRI + + 0 1 2 3 4 5 6 7 + +---------------------------------+ +--+--+--+--+--+--+--+--+ + | Flags (1 octet) | -> |x |E |x | T |BM|U |L | + +---------------------------------+ +--+--+--+--+--+--+--+--+ + | Tunnel Type (1 octet) | T = Assisted Replication Type + +---------------------------------+ BM = Broadcast and Multicast + | MPLS Label (3 octets) | U = Unknown (unknown unicast) + +---------------------------------+ x = unassigned + | Tunnel Identifier (variable) | + +---------------------------------+ + + Figure 3: PMSI Tunnel Attribute + + The Flags field in Figure 3 is 8 bits long as per [RFC7902]. The + Extension (E) flag was allocated by [RFC7902], and the Leaf + Information Required (L) flag was allocated by [RFC6514]. This + document defines the use of 4 bits of this Flags field: + + * Bits 3 and 4, which together form the Assisted Replication Type + (T) field + + * Bit 5, called the Broadcast and Multicast (BM) flag + + * Bit 6, called the Unknown (U) flag + + Bits 5 and 6 are collectively referred to as the Pruned Flooding + Lists (PFLs) flags. + + The T field and PFLs flags are defined as follows: + + * T is the Assisted Replication Type field (2 bits), which defines + the AR role of the advertising router: + + - 00 (decimal 0) = RNVE (non-AR support) + + - 01 (decimal 1) = AR-REPLICATOR + + - 10 (decimal 2) = AR-LEAF + + - 11 (decimal 3) = RESERVED + + * The PFLs flags define the desired behavior of the advertising + router for the different types of traffic: + + - Broadcast and Multicast (BM) flag. BM = 1 means "prune me from + the BM flooding list". BM = 0 indicates regular behavior. + + - Unknown (U) flag. U = 1 means "prune me from the Unknown + flooding list". U = 0 indicates regular behavior. + + * The L flag (bit 7) is defined in [RFC6514] and will be used only + in the selective AR solution. + + Please refer to Section 11 for the IANA considerations related to the + PMSI Tunnel Attribute flags. + + In this document, the above Inclusive Multicast Ethernet Tag route + (Figure 2) and PMSI Tunnel Attribute (Figure 3) can be used in two + different modes for the same BD: + + Regular-IR route: In this route, Originating Router's IP Address, + Tunnel Type (0x06), MPLS Label, and Tunnel Identifier MUST be used + as described in [RFC7432] when ingress replication is in use. The + NVE/PE that advertises the route will set the Next Hop to an IP + address that we denominate IR-IP in this document. When + advertised by an AR-LEAF node, the Regular-IR route MUST be + advertised with the T field set to 10 (AR-LEAF). + + Replicator-AR route: This route is used by the AR-REPLICATOR to + advertise its AR capabilities, with the fields set as follows: + + * Originating Router's IP Address MUST be set to an IP address of + the advertising router that is common to all the EVIs on the PE + (usually this is a loopback address of the PE). + + - The Tunnel Identifier and Next Hop fields SHOULD be set to + the same IP address as the Originating Router's IP Address + field when the NVE/PE originates the route -- that is, when + the NVE/PE is not an ASBR; see Section 10.2 of [RFC8365]. + Irrespective of the values in the Tunnel Identifier and + Originating Router's IP Address fields, the ingress NVE/PE + will process the received Replicator-AR route and will use + the IP address setting in the Next Hop field to create IP + tunnels to the AR-REPLICATOR. + + - The Next Hop address is referred to as the AR-IP and MUST be + different from the IR-IP for a given PE/NVE, unless the + procedures provided in Section 8 are followed. + + * Tunnel Type MUST be set to Assisted Replication Tunnel. + Section 11 provides the allocated type value. + + * T (Assisted Replication type) MUST be set to 01 (AR- + REPLICATOR). + + * L (Leaf Information Required) MUST be set to 0 for non- + selective AR and MUST be set to 1 for selective AR. + + An NVE/PE configured as an AR-REPLICATOR for a BD MUST advertise a + Replicator-AR route for the BD and MAY advertise a Regular-IR route. + The advertisement of the Replicator-AR route will indicate to the AR- + LEAFs which outer IP DA, i.e., which AR-IP, they need to use for IP- + encapsulated BM frames that use Assisted Replication forwarding mode. + The AR-REPLICATOR will forward an IP-encapsulated BM frame in + Assisted Replication forwarding mode if the outer IP DA matches its + AR-IP but will forward in Ingress Replication forwarding mode if the + outer IP DA matches its IR-IP. + + In addition, this document also uses the Leaf Auto-Discovery (Leaf + A-D) route defined in [RFC9572] in cases where the selective AR mode + is used. An AR-LEAF MAY send a Leaf A-D route in response to + reception of a Replicator-AR route whose L flag is set. The Leaf A-D + route is only used for selective AR, and the fields of such a route + are set as follows: + + * Originating Router's IP Address is set to the advertising router's + IP address (the same IP address used by the AR-LEAF in Regular-IR + routes). The Next Hop address is set to the IR-IP, which SHOULD + be the same IP address as the advertising router's IP address, + when the NVE/PE originates the route, i.e., when the NVE/PE is not + an ASBR; see Section 10.2 of [RFC8365]. + + * Route Key [RFC9572] is the "Route Type Specific" NLRI of the + Replicator-AR route for which this Leaf A-D route is generated. + + * The AR-LEAF constructs an IP-address-specific Route Target, + analogously to [RFC9572], by placing the IP address carried in the + Next Hop field of the received Replicator-AR route in the Global + Administrator field of the extended community, with the Local + Administrator field of this extended community set to 0, and + setting the Extended Communities attribute of the Leaf A-D route + to that extended community. The same IP-address-specific import + Route Target is auto-configured by the AR-REPLICATOR that sent the + Replicator-AR route, in order to control the acceptance of the + Leaf A-D routes. + + * The Leaf A-D route MUST include the PMSI Tunnel Attribute with + Tunnel Type set to Assisted Replication Tunnel (Section 11), T + (Assisted Replication type) set to AR-LEAF, and Tunnel Identifier + set to the IP address of the advertising AR-LEAF. The PMSI Tunnel + Attribute MUST carry a downstream-assigned MPLS label or VNI that + is used by the AR-REPLICATOR to send traffic to the AR-LEAF. + + Each AR-enabled node understands and processes the T (Assisted + Replication type) field in the PMSI Tunnel Attribute (Flags field) of + the routes and MUST signal the corresponding type (AR-REPLICATOR or + AR-LEAF type) according to its administrative choice. An NVE/PE + following this specification is not expected to set the Assisted + Replication Type field to decimal 3 (which is a RESERVED value). If + a route with the Assisted Replication Type field set to decimal 3 is + received by an AR-REPLICATOR or AR-LEAF, the router will process the + route as a Regular-IR route advertised by an RNVE. + + Each node attached to the BD may understand and process the BM/U + flags (PFLs flags). Note that these BM/U flags may be used to + optimize the delivery of multi-destination traffic; their use SHOULD + be an administrative choice and independent of the AR role. When the + PFL capability is enabled, the BM/U flags can be used with the + Regular-IR, Replicator-AR, and Leaf A-D routes. + + Non-optimized ingress replication NVEs/PEs will be unaware of the new + PMSI Tunnel Attribute flag definition as well as the new tunnel type + (AR), i.e., non-upgraded NVEs/PEs will ignore the information + contained in the Flags field or an unknown tunnel type (type AR in + this case) for any Inclusive Multicast Ethernet Tag route. + +5. Non-selective Assisted Replication (AR) Solution Description + + Figure 4 illustrates an example NVO network where the non-selective + AR function is enabled. Three different roles are defined for a + given BD: AR-REPLICATOR, AR-LEAF, and RNVE. The solution is called + "non-selective" because the chosen AR-REPLICATOR for a given flow + MUST replicate the BM traffic to all the NVEs/PEs in the BD except + for the source NVE/PE. NVO tunnels, i.e., IP tunnels, exist among + all the PEs and NVEs in the diagram. The PEs and NVEs in the diagram + have TSs or VMs connected to their ACs. + + ( ) + (_ WAN _) + +---(_ _)----+ + | (_ _) | + PE1 | PE2 | + +------+----+ +----+------+ + TS1--+ (BD-1) | | (BD-1) +--TS2 + |REPLICATOR | |REPLICATOR | + +--------+--+ +--+--------+ + | | + +--+----------------+--+ + | | + | | + +----+ VXLAN/NVGRE/MPLSoGRE +----+ + | | IP Fabric | | + | | | | + NVE1 | +-----------+----------+ | NVE3 + Hypervisor| ToR | NVE2 |Hypervisor + +---------+-+ +-----+-----+ +-+---------+ + | (BD-1) | | (BD-1) | | (BD-1) | + | LEAF | | RNVE | | LEAF | + +--+-----+--+ +--+-----+--+ +--+-----+--+ + | | | | | | + VM11 VM12 TS3 TS4 VM31 VM32 + + Figure 4: Non-selective AR Scenario + + In AR BDs, such as BD-1 in Figure 4, BM traffic between two NVEs may + follow a different path than unicast traffic. This solution + recommends the replication of BM traffic through the AR-REPLICATOR + node, whereas unknown/known unicast traffic will be delivered + directly from the source node to the destination node without being + replicated by any intermediate node. + + Note that known unicast forwarding is not impacted by this solution, + i.e., unknown unicast traffic SHALL follow the same path as known + unicast traffic. + +5.1. Non-selective AR-REPLICATOR Procedures + + An AR-REPLICATOR is defined as an NVE/PE capable of replicating + incoming BM traffic received on an overlay tunnel to other overlay + tunnels and local ACs. The AR-REPLICATOR signals its role in the + control plane and understands where the other roles (AR-LEAF nodes, + RNVEs, and other AR-REPLICATORs) are located. A given AR-enabled BD + service may have zero, one, or more AR-REPLICATORs. In our example + in Figure 4, PE1 and PE2 are defined as AR-REPLICATORs. The + following considerations apply to the AR-REPLICATOR role: + + a. The AR-REPLICATOR role SHOULD be an administrative choice in any + NVE/PE that is part of an AR-enabled BD. This administrative + option to enable AR-REPLICATOR capabilities MAY be implemented as + a system-level option as opposed to a per-BD option. + + b. An AR-REPLICATOR MUST advertise a Replicator-AR route and MAY + advertise a Regular-IR route. The AR-REPLICATOR MUST NOT + generate a Regular-IR route if it does not have local ACs. If + the Regular-IR route is advertised, the Assisted Replication Type + field of the Regular-IR route MUST be set to 0. + + c. The Replicator-AR and Regular-IR routes are generated according + to Section 4. The AR-IP and IR-IP are different IP addresses + owned by the AR-REPLICATOR. + + d. When a node defined as an AR-REPLICATOR receives a BM packet on + an overlay tunnel, it will do a tunnel destination IP address + lookup and apply the following procedures: + + * If the destination IP address is the AR-REPLICATOR IR-IP + address, the node will process the packet normally as + discussed in [RFC7432]. + + * If the destination IP address is the AR-REPLICATOR AR-IP + address, the node MUST replicate the packet to local ACs and + overlay tunnels (excluding the overlay tunnel to the source of + the packet). When replicating to remote AR-REPLICATORs, the + tunnel destination IP address will be an IR-IP. This will + indicate to the remote AR-REPLICATOR that it MUST NOT + replicate to overlay tunnels. The tunnel source IP address + used by the AR-REPLICATOR MUST be its IR-IP when replicating + to AR-REPLICATOR or AR-LEAF nodes. + + An AR-REPLICATOR MUST follow a data path implementation compatible + with the following rules: + + * The AR-REPLICATORs will build a flooding list composed of ACs and + overlay tunnels to remote nodes in the BD. Some of those overlay + tunnels MAY be flagged as non-BM receivers based on the BM flag + received from the remote nodes in the BD. + + * When an AR-REPLICATOR receives a BM packet on an AC, it will + forward the BM packet to its flooding list (including local ACs + and remote NVEs/PEs), skipping the non-BM overlay tunnels. + + * When an AR-REPLICATOR receives a BM packet on an overlay tunnel, + it will check the destination IP address of the underlay IP header + and: + + - If the destination IP address matches its IR-IP, the AR- + REPLICATOR will skip all the overlay tunnels from the flooding + list, i.e., it will only replicate to local ACs. This is the + regular ingress replication behavior described in [RFC7432]. + + - If the destination IP address matches its AR-IP, the AR- + REPLICATOR MUST forward the BM packet to its flooding list (ACs + and overlay tunnels), excluding the non-BM overlay tunnels. + The AR-REPLICATOR will ensure that the traffic is not sent back + to the originating AR-LEAF. + + - If the encapsulation is MPLSoGRE or MPLSoUDP and the received + BD label that the AR-REPLICATOR advertised in the Replicator-AR + route is not at the bottom of the stack, the AR-REPLICATOR MUST + copy all the labels below the BD label and propagate them when + forwarding the packet to the egress overlay tunnels. + + * The AR-REPLICATOR/LEAF nodes will build an unknown unicast + flooding list composed of ACs and overlay tunnels to the IR-IP + addresses of the remote nodes in the BD. Some of those overlay + tunnels MAY be flagged as non-U (unknown unicast) receivers based + on the U flag received from the remote nodes in the BD. + + - When an AR-REPLICATOR/LEAF receives an unknown unicast packet + on an AC, it will forward the unknown unicast packet to its + flooding list, skipping the non-U overlay tunnels. + + - When an AR-REPLICATOR/LEAF receives an unknown unicast packet + on an overlay tunnel, it will forward the unknown unicast + packet to its local ACs and never to an overlay tunnel. This + is the regular ingress replication behavior described in + [RFC7432]. + +5.2. Non-selective AR-LEAF Procedures + + An AR-LEAF is defined as an NVE/PE that, given its poor replication + performance, sends all the BM traffic to an AR-REPLICATOR that can + replicate the traffic further on its behalf. It MAY signal its AR- + LEAF capability in the control plane and understands where the other + roles are located (AR-REPLICATORs and RNVEs). A given service can + have zero, one, or more AR-LEAF nodes. In Figure 4, NVE1 and NVE3 + (both residing in hypervisors) act as AR-LEAF nodes. The following + considerations apply to the AR-LEAF role: + + a. The AR-LEAF role SHOULD be an administrative choice in any NVE/PE + that is part of an AR-enabled BD. This administrative option to + enable AR-LEAF capabilities MAY be implemented as a system-level + option as opposed to a per-BD option. + + b. In this non-selective AR solution, the AR-LEAF MUST advertise a + single Regular-IR Inclusive Multicast Ethernet Tag route as + described in [RFC7432]. The AR-LEAF SHOULD set the Assisted + Replication Type field to AR-LEAF. Note that although this field + does not affect the remote nodes when creating an EVPN + destination to the AR-LEAF, this field is useful from the + standpoint of ease of operation and troubleshooting of the BD. + + c. In a BD where there are no AR-REPLICATORs due to the AR- + REPLICATORs being down or reconfigured, the AR-LEAF MUST use + regular ingress replication based on the remote Regular-IR + Inclusive Multicast Ethernet Tag routes as described in + [RFC7432]. This may happen in the following cases: + + * The AR-LEAF has a list of AR-REPLICATORs for the BD, but it + detects that all the AR-REPLICATORs for the BD are down (via + next-hop tracking in the IGP or some other detection + mechanism). + + * The AR-LEAF receives updates from all the former AR- + REPLICATORs containing a non-REPLICATOR AR type in the + Inclusive Multicast Ethernet Tag routes. + + * The AR-LEAF never discovered an AR-REPLICATOR for the BD. + + d. In a service where there are one or more AR-REPLICATORs (based on + the received Replicator-AR routes for the BD), the AR-LEAF can + locally select which AR-REPLICATOR it sends the BM traffic to: + + * A single AR-REPLICATOR MAY be selected for all the BM packets + received on the AR-LEAF ACs for a given BD. This selection is + a local decision and does not have to match other AR-LEAFs' + selections within the same BD. + + * An AR-LEAF MAY select more than one AR-REPLICATOR and do + either per-flow or per-BD load balancing. + + * In the case of failure of the selected AR-REPLICATOR, another + AR-REPLICATOR SHOULD be selected by the AR-LEAF. + + * When an AR-REPLICATOR is selected for a given flow or BD, the + AR-LEAF MUST send all the BM packets targeted to that AR- + REPLICATOR using the forwarding information given by the + Replicator-AR route for the chosen AR-REPLICATOR, with Tunnel + Type = 0x0A (AR tunnel). The underlay destination IP address + MUST be the AR-IP advertised by the AR-REPLICATOR in the + Replicator-AR route. + + * An AR-LEAF MAY change the selection of AR-REPLICATOR(s) + dynamically due to an administrative or policy configuration + change. + + * AR-LEAF nodes SHALL send service-level BM control plane + packets, following the procedures for regular ingress + replication. An example would be IGMP, Multicast Listener + Discovery (MLD), or PIM packets, and, in general, any packets + using link-local scope multicast IPv4 or IPv6 packets. The + AR-REPLICATORs MUST NOT replicate these control plane packets + to other overlay tunnels, since they will use the IR-IP + address. + + e. The use of an AR-REPLICATOR-activation-timer (in seconds, with a + default value of 3) on the AR-LEAF nodes is RECOMMENDED. Upon + receiving a new Replicator-AR route where the AR-REPLICATOR is + selected, the AR-LEAF will run a timer before programming the new + AR-REPLICATOR. In the case of a newly added AR-REPLICATOR or if + an AR-REPLICATOR reboots, this timer will give the AR-REPLICATOR + some time to program the AR-LEAF nodes before the AR-LEAF sends + BM traffic. The AR-REPLICATOR-activation-timer SHOULD be + configurable in seconds, and its value needs to account for the + time it takes for the AR-LEAF Regular-IR Inclusive Multicast + Ethernet Tag route to get to the AR-REPLICATOR and be programmed. + While the AR-REPLICATOR-activation-timer is running, the AR-LEAF + node will use regular ingress replication. + + f. If the AR-LEAF has selected an AR-REPLICATOR, whether or not to + change to a new preferred AR-REPLICATOR for the existing BM + traffic flows is a matter of local policy. + + An AR-LEAF MUST follow a data path implementation compatible with the + following rules: + + * The AR-LEAF nodes will build two flooding lists: + + Flooding list #1: Composed of ACs and an AR-REPLICATOR-set of + overlay tunnels. The AR-REPLICATOR-set is defined as one or + more overlay tunnels to the AR-IP addresses of the remote AR- + REPLICATOR(s) in the BD. The selection of more than one AR- + REPLICATOR is described in item d. above and is a local AR-LEAF + decision. + + Flooding list #2: Composed of ACs and overlay tunnels to the + remote IR-IP addresses. + + * When an AR-LEAF receives a BM packet on an AC, it will check the + AR-REPLICATOR-set: + + - If the AR-REPLICATOR-set is empty, the AR-LEAF MUST send the + packet to flooding list #2. + + - If the AR-REPLICATOR-set is NOT empty, the AR-LEAF MUST send + the packet to flooding list #1, where only one of the overlay + tunnels of the AR-REPLICATOR-set is used. + + * When an AR-LEAF receives a BM packet on an overlay tunnel, it will + forward the BM packet to its local ACs and never to an overlay + tunnel. This is the regular ingress replication behavior + described in [RFC7432]. + + * AR-LEAF nodes process unknown unicast traffic in the same way AR- + REPLICATORS do, as described in Section 5.1. + +5.3. RNVE Procedures + + An RNVE is defined as an NVE/PE without AR-REPLICATOR or AR-LEAF + capabilities that does ingress replication as described in [RFC7432]. + The RNVE does not signal any AR role and is unaware of the AR- + REPLICATOR/LEAF roles in the BD. The RNVE will ignore the flags in + the Regular-IR routes and will ignore the Replicator-AR routes (due + to an unknown tunnel type in the PMSI Tunnel Attribute) and the Leaf + A-D routes (due to the IP-address-specific Route Target). + + This role provides EVPNs with the backward compatibility required in + optimized ingress replication BDs. In Figure 4, NVE2 acts as an + RNVE. + +6. Selective Assisted Replication (AR) Solution Description + + Figure 5 is used to describe the selective AR solution. + + ( ) + (_ WAN _) + +---(_ _)----+ + | (_ _) | + PE1 | PE2 | + +------+----+ +----+------+ + TS1--+ (BD-1) | | (BD-1) +--TS2 + |REPLICATOR | |REPLICATOR | + +--------+--+ +--+--------+ + | | + +--+----------------+--+ + | | + | | + +----+ VXLAN/NVGRE/MPLSoGRE +----+ + | | IP Fabric | | + | | | | + NVE1 | +-----------+----------+ | NVE3 + Hypervisor| ToR | NVE2 |Hypervisor + +---------+-+ +-----+-----+ +-+---------+ + | (BD-1) | | (BD-1) | | (BD-1) | + |LEAF-set-1 | |LEAF-set-1 | |LEAF-set-2 | + +--+-----+--+ +--+-----+--+ +--+-----+--+ + | | | | | | + VM11 VM12 TS3 TS4 VM31 VM32 + + Figure 5: Selective AR Scenario + + The solution is called "selective" because a given AR-REPLICATOR MUST + replicate the BM traffic to only the AR-LEAFs that requested the + replication (as opposed to all the AR-LEAF nodes) and MUST replicate + the BM traffic to the RNVEs (if there are any). The same AR roles as + those defined in Sections 4 and 5 are used here; however, the + procedures are different. + + The selective AR procedures create multiple AR-LEAF-sets in the EVPN + BD and build single-hop trees among AR-LEAFs of the same set (AR- + LEAF->AR-REPLICATOR->AR-LEAF) and two-hop trees among AR-LEAFs of + different sets (AR-LEAF->AR-REPLICATOR->AR-REPLICATOR->AR-LEAF). + Compared to the selective solution, the non-selective AR method + assumes that all the AR-LEAFs of the BD are in the same set and + always creates single-hop trees among AR-LEAFs. While the selective + solution is more efficient than the non-selective solution in multi- + stage IP fabrics, the trade-off is additional signaling and an + additional outer source IP address lookup. + + The following subsections describe the differences in the procedures + for AR-REPLICATORs/LEAFs compared to the non-selective AR solution. + There are no changes applicable to RNVEs. + +6.1. Selective AR-REPLICATOR Procedures + + In our example in Figure 5, PE1 and PE2 are defined as selective AR- + REPLICATORs. The following considerations apply to the selective AR- + REPLICATOR role: + + a. The selective AR-REPLICATOR role SHOULD be an administrative + choice in any NVE/PE that is part of an AR-enabled BD. This + administrative option MAY be implemented as a system-level option + as opposed to a per-BD option. + + b. Each AR-REPLICATOR will build a list of AR-REPLICATOR, AR-LEAF, + and RNVE nodes. In spite of the "selective" administrative + option, an AR-REPLICATOR MUST NOT behave as a selective AR- + REPLICATOR if at least one of the AR-REPLICATORs has the L flag + NOT set. If at least one AR-REPLICATOR sends a Replicator-AR + route with L = 0 (in the BD context), the rest of the AR- + REPLICATORs will fall back to non-selective AR mode. + + c. The selective AR-REPLICATOR MUST follow the procedures described + in Section 5.1, except for the following differences: + + * The AR-REPLICATOR MUST have the L flag set to 1 when + advertising the Replicator-AR route. This flag is used by the + AR-REPLICATORs to advertise their "selective" AR-REPLICATOR + capabilities. In addition, the AR-REPLICATOR auto-configures + its IP-address-specific import Route Target as described in + the third bullet of the procedures for Leaf A-D routes in + Section 4. + + * The AR-REPLICATOR will build a "selective" AR-LEAF-set with + the list of nodes that requested replication to its own AR-IP. + For instance, assuming that NVE1 and NVE2 advertise a Leaf A-D + route with PE1's IP-address-specific Route Target and NVE3 + advertises a Leaf A-D route with PE2's IP-address-specific + Route Target, PE1 will only add NVE1/NVE2 to its selective AR- + LEAF-set for BD-1 and exclude NVE3. Likewise, PE2 will only + add NVE3 to its selective AR-LEAF-set for BD-1 and exclude + NVE1/NVE2. + + * When a node defined and operating as a selective AR-REPLICATOR + receives a packet on an overlay tunnel, it will do a tunnel + destination IP lookup, and if the destination IP address is + the AR-REPLICATOR AR-IP address, the node MUST replicate the + packet to: + + - Local ACs. + + - Overlay tunnels in the selective AR-LEAF-set, excluding the + overlay tunnel to the source AR-LEAF. + + - Overlay tunnels to the RNVEs if the tunnel source IP + address is the IR-IP of an AR-LEAF. In any other case, the + AR-REPLICATOR MUST NOT replicate the BM traffic to remote + RNVEs. In other words, only the first-hop selective AR- + REPLICATOR will replicate to all the RNVEs. + + - Overlay tunnels to the remote selective AR-REPLICATORs if + the tunnel source IP address (of the encapsulated packet + that arrived on the overlay tunnel) is an IR-IP of its own + AR-LEAF-set. In any other case, the AR-REPLICATOR MUST NOT + replicate the BM traffic to remote AR-REPLICATORs. When + doing this replication, the tunnel destination IP address + is the AR-IP of the remote selective AR-REPLICATOR. The + tunnel destination address AR-IP will indicate to the + remote selective AR-REPLICATOR that the packet needs + further replication to its AR-LEAFs. + + A selective AR-REPLICATOR data path implementation MUST be compatible + with the following rules: + + * The selective AR-REPLICATORs will build two flooding lists: + + Flooding list #1: Composed of ACs and overlay tunnels to the + remote nodes in the BD, always using the IR-IPs in the tunnel + destination IP addresses. + + Flooding list #2: Composed of ACs, a selective AR-LEAF-set, and a + selective AR-REPLICATOR-set, where: + + - The selective AR-LEAF-set is composed of the overlay tunnels + to the AR-LEAFs that advertise a Leaf A-D route for the + local AR-REPLICATOR. This set is updated with every Leaf + A-D route received/withdrawn from a new AR-LEAF. + + - The selective AR-REPLICATOR-set is composed of the overlay + tunnels to all the AR-REPLICATORs that send a Replicator-AR + route with L = 1. The AR-IP addresses are used as tunnel + destination IP addresses. + + * Some of the overlay tunnels in the flooding lists MAY be flagged + as non-BM receivers based on the BM flag received from the remote + nodes in the routes. + + * When a selective AR-REPLICATOR receives a BM packet on an AC, it + MUST forward the BM packet to its flooding list #1, skipping the + non-BM overlay tunnels. + + * When a selective AR-REPLICATOR receives a BM packet on an overlay + tunnel, it will check the destination and source IPs of the + underlay IP header and: + + - If the destination IP address matches its AR-IP and the source + IP address matches an IP of its own selective AR-LEAF-set, the + AR-REPLICATOR MUST forward the BM packet to its flooding list + #2, unless some AR-REPLICATOR within the BD has advertised L = + 0. In the latter case, the node reverts to Non-selective mode, + and flooding list #1 MUST be used. Non-BM overlay tunnels are + skipped when sending BM packets. + + - If the destination IP address matches its AR-IP and the source + IP address does not match any IP address of its selective AR- + LEAF-set, the AR-REPLICATOR MUST forward the BM packet to + flooding list #2, skipping the AR-REPLICATOR-set. Non-BM + overlay tunnels are skipped when sending BM packets. + + - If the destination IP address matches its IR-IP, the AR- + REPLICATOR MUST use flooding list #1 but MUST skip all the + overlay tunnels from the flooding list, i.e., it will only + replicate to local ACs. This is the regular ingress + replication behavior described in [RFC7432]. Non-BM overlay + tunnels are skipped when sending BM packets. + + * In any case, the AR-REPLICATOR ensures that the traffic is not + sent back to the originating source. If the encapsulation is + MPLSoGRE or MPLSoUDP and the received BD label (the label that the + AR-REPLICATOR advertised in the Replicator-AR route) is not at the + bottom of the stack, the AR-REPLICATOR MUST copy the rest of the + labels when forwarding them to the egress overlay tunnels. + +6.2. Selective AR-LEAF Procedures + + A selective AR-LEAF chooses a single selective AR-REPLICATOR per BD + and: + + * Sends all the BD's BM traffic to that AR-REPLICATOR and + + * Expects to receive all the BM traffic for a given BD from the same + AR-REPLICATOR (except for the BM traffic from the RNVEs, which + comes directly from the RNVEs) + + In the example in Figure 5, we consider NVE1/NVE2/NVE3 as selective + AR-LEAFs. NVE1 selects PE1 as its selective AR-REPLICATOR. If that + is so, NVE1 will send all its BM traffic for BD-1 to PE1. If other + AR-LEAFs/REPLICATORs send BM traffic, NVE1 will receive that traffic + from PE1. A selective AR-LEAF and a non-selective AR-LEAF behave + differently, as follows: + + a. The selective AR-LEAF role SHOULD be an administrative choice in + any NVE/PE that is part of an AR-enabled BD. This administrative + option to enable AR-LEAF capabilities MAY be implemented as a + system-level option as opposed to a per-BD option. + + b. The AR-LEAF MAY advertise a Regular-IR route if there are RNVEs + in the BD. The selective AR-LEAF MUST advertise a Leaf A-D route + after receiving a Replicator-AR route with L = 1. It is + RECOMMENDED that the selective AR-LEAF wait for a period + specified by an AR-LEAF-join-wait-timer (in seconds, with a + default value of 3) before sending the Leaf A-D route, so that + the AR-LEAF can collect all the Replicator-AR routes for the BD + before advertising the Leaf A-D route. If the Replicator-AR + route with L = 1 is withdrawn, the corresponding Leaf A-D route + is withdrawn too. + + c. In a service where there is more than one selective AR- + REPLICATOR, the selective AR-LEAF MUST locally select a single + selective AR-REPLICATOR for the BD. Once selected: + + * The selective AR-LEAF MUST send a Leaf A-D route, including + the route key and IP-address-specific Route Target of the + selected AR-REPLICATOR. + + * The selective AR-LEAF MUST send all the BM packets received on + the ACs for a given BD to that AR-REPLICATOR. + + * In the case of failure of the selected AR-REPLICATOR (detected + when the Replicator-AR route becomes infeasible as a result of + any of the underlying BGP mechanisms), another AR-REPLICATOR + will be selected and a new Leaf A-D update will be issued for + the new AR-REPLICATOR. This new route will update the + selective list in the new selective AR-REPLICATOR. In the + case of failure of the active selective AR-REPLICATOR, it is + RECOMMENDED that the selective AR-LEAF revert to ingress + replication behavior for an AR-REPLICATOR-activation-timer (in + seconds, with a default value of 3) to mitigate the traffic + impact. When the timer expires, the selective AR-LEAF will + resume its AR mode with the new selective AR-REPLICATOR. The + AR-REPLICATOR-activation-timer MAY be the same configurable + parameter as the parameter discussed in Section 5.2. + + * A selective AR-LEAF MAY change the selection of AR- + REPLICATOR(s) dynamically due to an administrative or policy + configuration change. + + All the AR-LEAFs in a BD are expected to be configured as either + selective or non-selective. A mix of selective and non-selective AR- + LEAFs SHOULD NOT coexist in the same BD. If a non-selective AR-LEAF + is present, its BM traffic sent to a selective AR-REPLICATOR will not + be replicated to other AR-LEAFs that are not in its selective AR- + LEAF-set. + + A selective AR-LEAF MUST follow a data path implementation compatible + with the following rules: + + * The selective AR-LEAF nodes will build two flooding lists: + + Flooding list #1: Composed of ACs and the overlay tunnel to the + selected AR-REPLICATOR (using the AR-IP as the tunnel + destination IP address). + + Flooding list #2: Composed of ACs and overlay tunnels to the + remote IR-IP addresses. + + * Some of the overlay tunnels in the flooding lists MAY be flagged + as non-BM receivers based on the BM flag received from the remote + nodes in the routes. + + * When an AR-LEAF receives a BM packet on an AC, it will check to + see if an AR-REPLICATOR was selected; if one is found, flooding + list #1 MUST be used. Otherwise, flooding list #2 MUST be used. + Non-BM overlay tunnels are skipped when sending BM packets. + + * When an AR-LEAF receives a BM packet on an overlay tunnel, it MUST + forward the BM packet to its local ACs and never to an overlay + tunnel. This is the regular ingress replication behavior + described in [RFC7432]. + +7. Pruned Flooding Lists (PFLs) + + In addition to AR, the second optimization supported by the ingress + replication optimization solution specified in this document is the + ability of all the BD nodes to signal PFLs. As described in + Section 4, an EVPN node can signal a given value for the BM and U + PFLs flags in the Regular-IR, Replicator-AR, or Leaf A-D routes, + where: + + * BM is the Broadcast and Multicast flag. BM = 1 means "prune me + from the BM flooding list". BM = 0 indicates regular behavior. + + * U is the Unknown flag. U = 1 means "prune me from the Unknown + flooding list". U = 0 indicates regular behavior. + + The ability to signal and process these PFLs flags SHOULD be an + administrative choice. If a node is configured to process the PFLs + flags, upon receiving a non-zero PFLs flag for a route, an NVE/PE + will add the corresponding flag to the created overlay tunnel in the + flooding list. When replicating a BM packet in the context of a + flooding list, the NVE/PE will skip the overlay tunnels marked with + the flag BM = 1, since the NVEs/PEs at the end of those tunnels are + not expecting BM packets. Similarly, when replicating unknown + unicast packets, the NVE/PE will skip the overlay tunnels marked with + U = 1. + + An NVE/PE not following this document or not configured for this + optimization will ignore any of the received PFLs flags. An AR-LEAF + or RNVE receiving BUM traffic on an overlay tunnel MUST replicate the + traffic to its local ACs, regardless of the BM/U flags on the overlay + tunnels. + + This optimization MAY be used along with the Assisted Replication + solution. + +7.1. Example of a Pruned Flooding List + + In order to illustrate the use of the PFLs solution, we will assume + that BD-1 in Figure 4 is optimized ingress replication enabled and: + + * PE1 and PE2 are administratively configured as AR-REPLICATORs due + to their high-performance replication capabilities. PE1 and PE2 + will send a Replicator-AR route with BM/U flags = 00. + + * NVE1 and NVE3 are administratively configured as AR-LEAF nodes due + to their low-performance software-based replication capabilities. + They will advertise a Regular-IR route with type AR-LEAF. + Assuming that both NVEs advertise all of the attached VMs' MAC and + IP addresses in EVPNs as soon as they come up and these NVEs do + not have any VMs interested in multicast applications, they will + be configured to signal BM/U flags = 11 for BD-1. That is, + neither NVE1 nor NVE3 is interested in receiving BM or unknown + unicast traffic, since: + + - Their attached VMs (VM11, VM12, VM31, VM32) do not support + multicast applications. + + - Their attached VMs will not receive ARP Requests. Proxy ARP + [RFC9161] on the remote NVEs/PEs will reply to ARP Requests + locally, and no other broadcast traffic is expected. + + - Their attached VMs will not receive unknown unicast traffic, + since the VMs' MAC and IP addresses are always advertised by + EVPNs as long as the VMs are active. + + * NVE2 is optimized ingress replication unaware; therefore, it takes + on the RNVE role in BD-1. + + Based on the above assumptions, the following forwarding behavior + will take place: + + 1. Any BM packets sent from VM11 will be sent to VM12 and PE1. PE1 + will then forward the BM packets on to TS1, the WAN link, PE2, + and NVE2 but not to NVE3. PE2 and NVE2 will replicate the BM + packets to their local ACs, but NVE3 will be prevented from + having to replicate those BM packets to VM31 and VM32 + unnecessarily. + + 2. Any BM packets received on PE2 from the WAN will be sent to PE1 + and NVE2 but not to NVE1 and NVE3, sparing the two hypervisors + from replicating unnecessarily to their local VMs. PE1 and NVE2 + will replicate to their local ACs only. + + 3. Any unknown unicast packet sent from VM31 will be forwarded by + NVE3 to NVE2, PE1, and PE2 but not to NVE1. The solution + prevents unnecessary replication to NVE1, since the destination + of the unknown traffic cannot be NVE1. + + 4. Any unknown unicast packet sent from TS1 will be forwarded by PE1 + to the WAN link, PE2, and NVE2 but not to NVE1 and NVE3, since + the target of the unknown traffic cannot be NVE1 or NVE3. + +8. AR Procedures for Single-IP AR-REPLICATORS + + The procedures explained in Sections 5 and 6 assume that the AR- + REPLICATOR can use two local routable IP addresses to terminate and + originate NVO tunnels, i.e., IR-IP and AR-IP addresses. This is + usually the case for PE-based AR-REPLICATOR nodes. + + In some cases, the AR-REPLICATOR node does not support more than one + IP address to terminate and originate NVO tunnels, i.e., the IR-IP + and AR-IP are the same IP addresses. This may be the case in some + software-based or low-end AR-REPLICATOR nodes. If this is the case, + the procedures provided in Sections 5 and 6 MUST be modified in the + following way: + + * The Replicator-AR routes generated by the AR-REPLICATOR use an AR- + IP that will match its IR-IP. In order to differentiate the data + plane packets that need to use ingress replication from the + packets that must use Assisted Replication forwarding mode, the + Replicator-AR route MUST advertise a different VNI/VSID than the + one used by the Regular-IR route. For instance, the AR-REPLICATOR + will advertise an AR-VNI along with the Replicator-AR route and an + IR-VNI along with the Regular-IR route. Since both routes have + the same key, different Route Distinguishers are needed in each + route. + + * An AR-REPLICATOR will perform Ingress Replication forwarding mode + or Assisted Replication forwarding mode for the incoming overlay + packets based on an ingress VNI lookup as opposed to the tunnel IP + DA lookup. Note that when replicating to remote AR-REPLICATOR + nodes, the use of the IR-VNI or AR-VNI advertised by the egress + node will determine whether Ingress Replication forwarding mode or + Assisted Replication forwarding mode is used at the subsequent AR- + REPLICATOR. + + The rest of the procedures will follow those described in Sections 5 + and 6. + +9. AR Procedures and EVPN All-Active Multihoming Split-Horizon + + This section extends the procedures for the cases where two or more + AR-LEAF nodes are attached to the same ES and two or more AR- + REPLICATOR nodes are attached to the same ES in the BD. The mixed + case -- where an AR-LEAF node and an AR-REPLICATOR node are attached + to the same ES -- would require extended procedures that are out of + scope for this document. + +9.1. Ethernet Segments on AR-LEAF Nodes + + If a VXLAN or NVGRE is used and if the split-horizon is based on the + tunnel source IP address and "local bias" as described in [RFC8365], + the split-horizon check will not work if an ES is shared between two + AR-LEAF nodes, and the AR-REPLICATOR replaces the tunnel source IP + address of the packets with its own AR-IP. + + In order to be compatible with the source IP address split-horizon + check, the AR-REPLICATOR MAY keep the original received tunnel source + IP address when replicating packets to a remote AR-LEAF or RNVE. + This will allow AR-LEAF nodes to apply split-horizon check procedures + for BM packets before sending them to the local ES. Even if the AR- + LEAF's source IP address is preserved when replicating to AR-LEAFs or + RNVEs, the AR-REPLICATOR MUST always use its IR-IP as the source IP + address when replicating to other AR-REPLICATORs. + + When EVPNs are used for MPLSoGRE or MPLSoUDP, the ESI-label-based + split-horizon procedure provided in [RFC7432] will not work for + multihomed ESs defined on AR-LEAF nodes. Local bias is recommended + in this case, as it is in the case of a VXLAN or NVGRE as explained + above. The local-bias and tunnel source IP address preservation + mechanisms provide the required split-horizon behavior in non- + selective or selective AR. + + Note that if the AR-REPLICATOR implementation keeps the received + tunnel source IP address, the use of unicast Reverse Path Forwarding + (uRPF) checks in the IP fabric based on the tunnel source IP address + MUST be disabled. + +9.2. Ethernet Segments on AR-REPLICATOR Nodes + + AR-REPLICATOR nodes attached to the same all-active ES will follow + local-bias procedures [RFC8365] as follows: + + a. For BUM traffic received on a local AR-REPLICATOR's AC, local- + bias procedures as provided in [RFC8365] MUST be followed. + + b. For BUM traffic received on an AR-REPLICATOR overlay tunnel with + AR-IP as the IP DA, local bias MUST also be followed. That is, + traffic received with AR-IP as the IP DA will be treated as + though it had been received on a local AC that is part of the ES + and will be forwarded to all local ESs, irrespective of their DF + or NDF state. + + c. BUM traffic received on an AR-REPLICATOR overlay tunnel with IR- + IP as the IP DA will follow regular local-bias rules [RFC8365] + and will not be forwarded to local ESs that are shared with the + AR-LEAF or AR-REPLICATOR originating the traffic. + + d. In cases where the AR-REPLICATOR supports a single IP address, + the IR-IP and the AR-IP are the same IP address, as discussed in + Section 8. The received BUM traffic will be treated as specified + in item b above if the received VNI is the AR-VNI and as + specified in item c if the VNI is the IR-VNI. + +10. Security Considerations + + The security considerations in [RFC7432] and [RFC8365] apply to this + document. The security considerations related to the Leaf A-D route + in [RFC9572] apply too. + + In addition, the Assisted Replication method introduced by this + document may introduce some new risks that could affect the + successful delivery of BM traffic. Unicast traffic is not affected + by Assisted Replication (although unknown unicast traffic is affected + by the procedures for PFLs). The forwarding of BM traffic is + modified, and BM traffic from the AR-LEAF nodes will be drawn toward + AR-REPLICATORs in the BD. An AR-LEAF will forward BM traffic to its + selected AR-REPLICATOR; therefore, an attack on the AR-REPLICATOR + could impact the delivery of the BM traffic using that node. Also, + an attack on the AR-REPLICATOR and any change to the advertised AR + type will modify the selections made by the AR-LEAF nodes. If no + other AR-REPLICATOR is selected, the AR-LEAF nodes will be forced to + use Ingress Replication forwarding mode, which will impact their + performance, since the AR-LEAF nodes are usually NVEs/PEs with poor + replication performance. + + This document introduces the ability of the AR-REPLICATOR to forward + traffic received on an overlay tunnel to another overlay tunnel. The + reader may determine that this introduces the risk of BM loops -- + that is, an AR-LEAF receiving a BM-encapsulated packet that the AR- + LEAF originated in the first place due to one or two AR-REPLICATORs + "looping" the BM traffic back to the AR-LEAF. Following the + procedures provided in this document will prevent these BM loops, + since the AR-REPLICATOR will always forward the BM traffic using the + correct tunnel IP DA (or the correct VNI in the case of single-IP AR- + REPLICATORs), which instructs the remote nodes regarding how to + forward the traffic. This is true for both the Non-selective and + Selective modes defined in this document. However, incorrect + implementation of the procedures provided in this document may lead + to those unexpected BM loops. + + The Selective mode provides a multi-stage replication solution, where + proper configuration of all the AR-REPLICATORs will prevent any + issues. A mix of mistakenly configured selective and non-selective + AR-REPLICATORs in the same BD could theoretically create packet + duplication in some AR-LEAFs; however, this document specifies a + fallback solution -- falling back to Non-selective mode in cases + where the AR-REPLICATORs advertised an inconsistent AR mode. + + This document allows the AR-REPLICATOR to preserve the tunnel source + IP address of the AR-LEAF (as an option) when forwarding BM packets + from an overlay tunnel to another overlay tunnel. Preserving the AR- + LEAF source IP address makes the local-bias filtering procedures + possible for AR-LEAF nodes that are attached to the same ES. If the + AR-REPLICATOR does not preserve the AR-LEAF source IP address, AR- + LEAF nodes attached to all-active ESs will cause packet duplication + on the multihomed CE. + + The AR-REPLICATOR nodes are, by design, using more bandwidth than PEs + [RFC7432] or NVEs [RFC8365] would use. Certain network events or + unexpected low performance may exceed the AR-REPLICATOR's local + bandwidth and cause service disruption. + + Finally, PFLs (Section 7) should be used with care. Intentional or + unintentional misconfiguration of the BDs on a given leaf node may + result in the leaf not receiving the required BM or unknown unicast + traffic. + +11. IANA Considerations + + IANA has allocated the following Border Gateway Protocol (BGP) + parameters: + + * Allocation in the "P-Multicast Service Interface Tunnel (PMSI + Tunnel) Tunnel Types" registry: + + +=======+=============================+===========+ + | Value | Meaning | Reference | + +=======+=============================+===========+ + | 0x0A | Assisted Replication Tunnel | RFC 9574 | + +-------+-----------------------------+-----------+ + + Table 1 + + * Allocations in the "P-Multicast Service Interface (PMSI) Tunnel + Attribute Flags" registry: + + +=======+===============================+===========+ + | Value | Name | Reference | + +=======+===============================+===========+ + | 3-4 | Assisted Replication Type (T) | RFC 9574 | + +-------+-------------------------------+-----------+ + | 5 | Broadcast and Multicast (BM) | RFC 9574 | + +-------+-------------------------------+-----------+ + | 6 | Unknown (U) | RFC 9574 | + +-------+-------------------------------+-----------+ + + Table 2 + +12. References + +12.1. Normative References + + [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate + Requirement Levels", BCP 14, RFC 2119, + DOI 10.17487/RFC2119, March 1997, + <https://www.rfc-editor.org/info/rfc2119>. + + [RFC6513] Rosen, E., Ed. and R. Aggarwal, Ed., "Multicast in MPLS/ + BGP IP VPNs", RFC 6513, DOI 10.17487/RFC6513, February + 2012, <https://www.rfc-editor.org/info/rfc6513>. + + [RFC6514] Aggarwal, R., Rosen, E., Morin, T., and Y. Rekhter, "BGP + Encodings and Procedures for Multicast in MPLS/BGP IP + VPNs", RFC 6514, DOI 10.17487/RFC6514, February 2012, + <https://www.rfc-editor.org/info/rfc6514>. + + [RFC7432] Sajassi, A., Ed., Aggarwal, R., Bitar, N., Isaac, A., + Uttaro, J., Drake, J., and W. Henderickx, "BGP MPLS-Based + Ethernet VPN", RFC 7432, DOI 10.17487/RFC7432, February + 2015, <https://www.rfc-editor.org/info/rfc7432>. + + [RFC7902] Rosen, E. and T. Morin, "Registry and Extensions for + P-Multicast Service Interface Tunnel Attribute Flags", + RFC 7902, DOI 10.17487/RFC7902, June 2016, + <https://www.rfc-editor.org/info/rfc7902>. + + [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC + 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, + May 2017, <https://www.rfc-editor.org/info/rfc8174>. + + [RFC8365] Sajassi, A., Ed., Drake, J., Ed., Bitar, N., Shekhar, R., + Uttaro, J., and W. Henderickx, "A Network Virtualization + Overlay Solution Using Ethernet VPN (EVPN)", RFC 8365, + DOI 10.17487/RFC8365, March 2018, + <https://www.rfc-editor.org/info/rfc8365>. + + [RFC9572] Zhang, Z., Lin, W., Rabadan, J., Patel, K., and A. + Sajassi, "Updates to EVPN Broadcast, Unknown Unicast, or + Multicast (BUM) Procedures", RFC 9572, + DOI 10.17487/RFC9572, May 2024, + <https://www.rfc-editor.org/info/rfc9572>. + +12.2. Informative References + + [RFC4023] Worster, T., Rekhter, Y., and E. Rosen, Ed., + "Encapsulating MPLS in IP or Generic Routing Encapsulation + (GRE)", RFC 4023, DOI 10.17487/RFC4023, March 2005, + <https://www.rfc-editor.org/info/rfc4023>. + + [RFC7348] Mahalingam, M., Dutt, D., Duda, K., Agarwal, P., Kreeger, + L., Sridhar, T., Bursell, M., and C. Wright, "Virtual + eXtensible Local Area Network (VXLAN): A Framework for + Overlaying Virtualized Layer 2 Networks over Layer 3 + Networks", RFC 7348, DOI 10.17487/RFC7348, August 2014, + <https://www.rfc-editor.org/info/rfc7348>. + + [RFC7637] Garg, P., Ed. and Y. Wang, Ed., "NVGRE: Network + Virtualization Using Generic Routing Encapsulation", + RFC 7637, DOI 10.17487/RFC7637, September 2015, + <https://www.rfc-editor.org/info/rfc7637>. + + [RFC9161] Rabadan, J., Ed., Sathappan, S., Nagaraj, K., Hankins, G., + and T. King, "Operational Aspects of Proxy ARP/ND in + Ethernet Virtual Private Networks", RFC 9161, + DOI 10.17487/RFC9161, January 2022, + <https://www.rfc-editor.org/info/rfc9161>. + +Acknowledgements + + The authors would like to thank Neil Hart, David Motz, Dai Truong, + Thomas Morin, Jeffrey Zhang, Shankar Murthy, and Krzysztof Szarkowicz + for their valuable feedback and contributions. Also, thanks to John + Scudder for his thorough review, which improved the quality of the + document significantly. + +Contributors + + In addition to the authors listed on the front page, the following + people also contributed to this document and should be considered + coauthors: + + Wim Henderickx + Nokia + + + Kiran Nagaraj + Nokia + + + Ravi Shekhar + Juniper Networks + + + Nischal Sheth + Juniper Networks + + + Aldrin Isaac + Juniper + + + Mudassir Tufail + Citibank + + +Authors' Addresses + + Jorge Rabadan (editor) + Nokia + 777 Middlefield Road + Mountain View, CA 94043 + United States of America + Email: jorge.rabadan@nokia.com + + + Senthil Sathappan + Nokia + Email: senthil.sathappan@nokia.com + + + Wen Lin + Juniper Networks + Email: wlin@juniper.net + + + Mukul Katiyar + Versa Networks + Email: mukul@versa-networks.com + + + Ali Sajassi + Cisco Systems + Email: sajassi@cisco.com |