doc: Add RFC documents

author: Thomas Voss <mail@thomasvoss.com> 2024-11-27 20:54:24 +0100
committer: Thomas Voss <mail@thomasvoss.com> 2024-11-27 20:54:24 +0100
commit: 4bfd864f10b68b71482b35c818559068ef8d5797 (patch)
tree: e3989f47a7994642eb325063d46e8f08ffa681dc /doc/rfc/rfc8670.txt
parent: ea76e11061bda059ae9f9ad130a9895cc85607db (diff)
1 files changed, 971 insertions, 0 deletions
diff --git a/doc/rfc/rfc8670.txt b/doc/rfc/rfc8670.txt
new file mode 100644
index 0000000..4c6980f
--- /dev/null
+++ b/doc/rfc/rfc8670.txt
@@ -0,0 +1,971 @@
+
+
+
+
+Internet Engineering Task Force (IETF)                  C. Filsfils, Ed.
+Request for Comments: 8670                                    S. Previdi
+Category: Informational                              Cisco Systems, Inc.
+ISSN: 2070-1721                                                 G. Dawra
+                                                                LinkedIn
+                                                                E. Aries
+                                                            Arrcus, Inc.
+                                                             P. Lapukhov
+                                                                Facebook
+                                                           December 2019
+
+
+             BGP Prefix Segment in Large-Scale Data Centers
+
+Abstract
+
+   This document describes the motivation for, and benefits of, applying
+   Segment Routing (SR) in BGP-based large-scale data centers.  It
+   describes the design to deploy SR in those data centers for both the
+   MPLS and IPv6 data planes.
+
+Status of This Memo
+
+   This document is not an Internet Standards Track specification; it is
+   published for informational purposes.
+
+   This document is a product of the Internet Engineering Task Force
+   (IETF).  It represents the consensus of the IETF community.  It has
+   received public review and has been approved for publication by the
+   Internet Engineering Steering Group (IESG).  Not all documents
+   approved by the IESG are candidates for any level of Internet
+   Standard; see Section 2 of RFC 7841.
+
+   Information about the current status of this document, any errata,
+   and how to provide feedback on it may be obtained at
+   https://www.rfc-editor.org/info/rfc8670.
+
+Copyright Notice
+
+   Copyright (c) 2019 IETF Trust and the persons identified as the
+   document authors.  All rights reserved.
+
+   This document is subject to BCP 78 and the IETF Trust's Legal
+   Provisions Relating to IETF Documents
+   (https://trustee.ietf.org/license-info) in effect on the date of
+   publication of this document.  Please review these documents
+   carefully, as they describe your rights and restrictions with respect
+   to this document.  Code Components extracted from this document must
+   include Simplified BSD License text as described in Section 4.e of
+   the Trust Legal Provisions and are provided without warranty as
+   described in the Simplified BSD License.
+
+Table of Contents
+
+   1.  Introduction
+   2.  Large-Scale Data-Center Network Design Summary
+     2.1.  Reference Design
+   3.  Some Open Problems in Large Data-Center Networks
+   4.  Applying Segment Routing in the DC with MPLS Data Plane
+     4.1.  BGP Prefix Segment (BGP Prefix-SID)
+     4.2.  EBGP Labeled Unicast (RFC 8277)
+       4.2.1.  Control Plane
+       4.2.2.  Data Plane
+       4.2.3.  Network Design Variation
+       4.2.4.  Global BGP Prefix Segment through the Fabric
+       4.2.5.  Incremental Deployments
+     4.3.  IBGP Labeled Unicast (RFC 8277)
+   5.  Applying Segment Routing in the DC with IPv6 Data Plane
+   6.  Communicating Path Information to the Host
+   7.  Additional Benefits
+     7.1.  MPLS Data Plane with Operational Simplicity
+     7.2.  Minimizing the FIB Table
+     7.3.  Egress Peer Engineering
+     7.4.  Anycast
+   8.  Preferred SRGB Allocation
+   9.  IANA Considerations
+   10. Manageability Considerations
+   11. Security Considerations
+   12. References
+     12.1.  Normative References
+     12.2.  Informative References
+   Acknowledgements
+   Contributors
+   Authors' Addresses
+
+1.  Introduction
+
+   Segment Routing (SR), as described in [RFC8402], leverages the
+   source-routing paradigm.  A node steers a packet through an ordered
+   list of instructions called "segments".  A segment can represent any
+   instruction, topological or service based.  A segment can have a
+   local semantic to an SR node or a global semantic within an SR
+   domain.  SR allows the enforcement of a flow through any topological
+   path while maintaining per-flow state only from the ingress node to
+   the SR domain.  SR can be applied to the MPLS and IPv6 data planes.
+
+   The use cases described in this document should be considered in the
+   context of the BGP-based large-scale data-center (DC) design
+   described in [RFC7938].  This document extends it by applying SR both
+   with IPv6 and MPLS data planes.
+
+2.  Large-Scale Data-Center Network Design Summary
+
+   This section provides a brief summary of the Informational RFC
+   [RFC7938], which outlines a practical network design suitable for
+   data centers of various scales:
+
+   *  Data-center networks have highly symmetric topologies with
+      multiple parallel paths between two server-attachment points.  The
+      well-known Clos topology is most popular among the operators (as
+      described in [RFC7938]).  In a Clos topology, the minimum number
+      of parallel paths between two elements is determined by the
+      "width" of the "Tier-1" stage.  See Figure 1 for an illustration
+      of the concept.
+
+   *  Large-scale data centers commonly use a routing protocol, such as
+      BGP-4 [RFC4271], in order to provide endpoint connectivity.
+      Therefore, recovery after a network failure is driven either by
+      local knowledge of directly available backup paths or by
+      distributed signaling between the network devices.
+
+   *  Within data-center networks, traffic is load shared using the
+      Equal Cost Multipath (ECMP) mechanism.  With ECMP, every network
+      device implements a pseudorandom decision, mapping packets to one
+      of the parallel paths by means of a hash function calculated over
+      certain parts of the packet, typically a combination of various
+      packet header fields.
+
+   The following is a schematic of a five-stage Clos topology with four
+   devices in the "Tier-1" stage.  Notice that the number of paths
+   between Node1 and Node12 equals four; the paths have to cross all of
+   the Tier-1 devices.  At the same time, the number of paths between
+   Node1 and Node2 equals two, and the paths only cross Tier-2 devices.
+   Other topologies are possible, but for simplicity, only the
+   topologies that have a single path from Tier-1 to Tier-3 are
+   considered below.  The rest could be treated similarly, with a few
+   modifications to the logic.
+
+2.1.  Reference Design
+
+                                   Tier-1
+                                  +-----+
+                                  |NODE |
+                               +->|  5  |--+
+                               |  +-----+  |
+                       Tier-2  |           |   Tier-2
+                      +-----+  |  +-----+  |  +-----+
+        +------------>|NODE |--+->|NODE |--+--|NODE |-------------+
+        |       +-----|  3  |--+  |  6  |  +--|  9  |-----+       |
+        |       |     +-----+     +-----+     +-----+     |       |
+        |       |                                         |       |
+        |       |     +-----+     +-----+     +-----+     |       |
+        | +-----+---->|NODE |--+  |NODE |  +--|NODE |-----+-----+ |
+        | |     | +---|  4  |--+->|  7  |--+--|  10 |---+ |     | |
+        | |     | |   +-----+  |  +-----+  |  +-----+   | |     | |
+        | |     | |            |           |            | |     | |
+      +-----+ +-----+          |  +-----+  |          +-----+ +-----+
+      |NODE | |NODE | Tier-3   +->|NODE |--+   Tier-3 |NODE | |NODE |
+      |  1  | |  2  |             |  8  |             | 11  | |  12 |
+      +-----+ +-----+             +-----+             +-----+ +-----+
+        | |     | |                                     | |     | |
+        A O     B O            <- Servers ->            Z O     O O
+
+                      Figure 1: 5-Stage Clos Topology
+
+   In the reference topology illustrated in Figure 1, it is assumed:
+
+   *  Each node is its own autonomous system (AS) (Node X has AS X).
+      4-byte AS numbers are recommended ([RFC6793]).
+
+      -  For simple and efficient route propagation filtering, Node5,
+         Node6, Node7, and Node8 use the same AS; Node3 and Node4 use
+         the same AS; and Node9 and Node10 use the same AS.
+
+      -  In the case in which 2-byte autonomous system numbers are used
+         for efficient usage of the scarce 2-byte Private Use AS pool,
+         different Tier-3 nodes might use the same AS.
+
+      -  Without loss of generality, these details will be simplified in
+         this document.  It is to be assumed that each node has its own
+         AS.
+
+   *  Each node peers with its neighbors with a BGP session.  If not
+      specified, external BGP (EBGP) is assumed.  In a specific use
+      case, internal BGP (IBGP) will be used, but this will be called
+      out explicitly in that case.
+
+   *  Each node originates the IPv4 address of its loopback interface
+      into BGP and announces it to its neighbors.
+
+      -  The loopback of Node X is 192.0.2.x/32.
+
+   In this document, the Tier-1, Tier-2, and Tier-3 nodes are referred
+   to as "Spine", "Leaf", and "ToR" (top of rack) nodes, respectively.
+   When a ToR node acts as a gateway to the "outside world", it is
+   referred to as a "border node".
+
+3.  Some Open Problems in Large Data-Center Networks
+
+   The data-center-network design summarized above provides means for
+   moving traffic between hosts with reasonable efficiency.  There are
+   few open performance and reliability problems that arise in such a
+   design:
+
+   *  ECMP routing is most commonly realized per flow.  This means that
+      large, long-lived "elephant" flows may affect performance of
+      smaller, short-lived "mouse" flows and may reduce efficiency of
+      per-flow load sharing.  In other words, per-flow ECMP does not
+      perform efficiently when flow-lifetime distribution is heavy
+      tailed.  Furthermore, due to hash-function inefficiencies, it is
+      possible to have frequent flow collisions where more flows get
+      placed on one path over the others.
+
+   *  Shortest-path routing with ECMP implements an oblivious routing
+      model that is not aware of the network imbalances.  If the network
+      symmetry is broken, for example, due to link failures, utilization
+      hotspots may appear.  For example, if a link fails between Tier-1
+      and Tier-2 devices (e.g., Node5 and Node9), Tier-3 devices Node1
+      and Node2 will not be aware of that since there are other paths
+      available from the perspective of Node3.  They will continue
+      sending roughly equal traffic to Node3 and Node4 as if the failure
+      didn't exist, which may cause a traffic hotspot.
+
+   *  Isolating faults in the network with multiple parallel paths and
+      ECMP-based routing is nontrivial due to lack of determinism.
+      Specifically, the connections from HostA to HostB may take a
+      different path every time a new connection is formed, thus making
+      consistent reproduction of a failure much more difficult.  This
+      complexity scales linearly with the number of parallel paths in
+      the network and stems from the random nature of path selection by
+      the network devices.
+
+4.  Applying Segment Routing in the DC with MPLS Data Plane
+
+4.1.  BGP Prefix Segment (BGP Prefix-SID)
+
+   A BGP Prefix Segment is a segment associated with a BGP prefix.  A
+   BGP Prefix Segment is a network-wide instruction to forward the
+   packet along the ECMP-aware best path to the related prefix.
+
+   The BGP Prefix Segment is defined as the BGP Prefix-SID Attribute in
+   [RFC8669], which contains an index.  Throughout this document, the
+   BGP Prefix Segment Attribute is referred to as the "BGP Prefix-SID"
+   and the encoded index as the label index.
+
+   In this document, the network design decision has been made to assume
+   that all the nodes are allocated the same SRGB (Segment Routing
+   Global Block), e.g., [16000, 23999].  This provides operational
+   simplification as explained in Section 8, but this is not a
+   requirement.
+
+   For illustration purposes, when considering an MPLS data plane, it is
+   assumed that the label index allocated to prefix 192.0.2.x/32 is X.
+   As a result, a local label (16000+x) is allocated for prefix
+   192.0.2.x/32 by each node throughout the DC fabric.
+
+   When the IPv6 data plane is considered, it is assumed that Node X is
+   allocated IPv6 address (segment) 2001:DB8::X.
+
+4.2.  EBGP Labeled Unicast (RFC 8277)
+
+   Referring to Figure 1 and [RFC7938], the following design
+   modifications are introduced:
+
+   *  Each node peers with its neighbors via an EBGP session with
+      extensions defined in [RFC8277] (named "EBGP8277" throughout this
+      document) and with the BGP Prefix-SID attribute extension as
+      defined in [RFC8669].
+
+   *  The forwarding plane at Tier-2 and Tier-1 is MPLS.
+
+   *  The forwarding plane at Tier-3 is either IP2MPLS (if the host
+      sends IP traffic) or MPLS2MPLS (if the host sends MPLS-
+      encapsulated traffic).
+
+   Figure 2 zooms into a path from ServerA to ServerZ within the
+   topology of Figure 1.
+
+                      +-----+     +-----+     +-----+
+          +---------->|NODE |     |NODE |     |NODE |
+          |           |  4  |--+->|  7  |--+--|  10 |---+
+          |           +-----+     +-----+     +-----+   |
+          |                                             |
+      +-----+                                         +-----+
+      |NODE |                                         |NODE |
+      |  1  |                                         | 11  |
+      +-----+                                         +-----+
+        |                                              |
+        A                    <- Servers ->             Z
+
+          Figure 2: Path from A to Z via Nodes 1, 4, 7, 10, and 11
+
+   Referring to Figures 1 and 2, and assuming the IP address with the AS
+   and label-index allocation previously described, the following
+   sections detail the control-plane operation and the data-plane states
+   for the prefix 192.0.2.11/32 (loopback of Node11).
+
+4.2.1.  Control Plane
+
+   Node11 originates 192.0.2.11/32 in BGP and allocates to it a BGP
+   Prefix-SID with label-index: index11 [RFC8669].
+
+   Node11 sends the following EBGP8277 update to Node10:
+
+      IP Prefix:  192.0.2.11/32
+
+      Label:  Implicit NULL
+
+      Next hop:  Node11's interface address on the link to Node10
+
+      AS Path:  {11}
+
+      BGP Prefix-SID:  Label-Index 11
+
+   Node10 receives the above update.  As it is SR capable, Node10 is
+   able to interpret the BGP Prefix-SID; therefore, it understands that
+   it should allocate the label from its own SRGB block, offset by the
+   label index received in the BGP Prefix-SID (16000+11, hence, 16011)
+   to the Network Layer Reachability Information (NLRI) instead of
+   allocating a nondeterministic label out of a dynamically allocated
+   portion of the local label space.  The implicit NULL label in the
+   NLRI tells Node10 that it is the penultimate hop and that it must pop
+   the top label on the stack before forwarding traffic for this prefix
+   to Node11.
+
+   Then, Node10 sends the following EBGP8277 update to Node7:
+
+      IP Prefix:  192.0.2.11/32
+
+      Label:  16011
+
+      Next hop:  Node10's interface address on the link to Node7
+
+      AS Path:  {10, 11}
+
+      BGP Prefix-SID:  Label-Index 11
+
+   Node7 receives the above update.  As it is SR capable, Node7 is able
+   to interpret the BGP Prefix-SID; therefore, it allocates the local
+   (incoming) label 16011 (16000 + 11) to the NLRI (instead of
+   allocating a "dynamic" local label from its label manager).  Node7
+   uses the label in the received EBGP8277 NLRI as the outgoing label
+   (the index is only used to derive the local/incoming label).
+
+   Node7 sends the following EBGP8277 update to Node4:
+
+      IP Prefix:  192.0.2.11/32
+
+      Label:  16011
+
+      Next hop:  Node7's interface address on the link to Node4
+
+      AS Path:  {7, 10, 11}
+
+      BGP Prefix-SID:  Label-Index 11
+
+   Node4 receives the above update.  As it is SR capable, Node4 is able
+   to interpret the BGP Prefix-SID; therefore, it allocates the local
+   (incoming) label 16011 to the NLRI (instead of allocating a "dynamic"
+   local label from its label manager).  Node4 uses the label in the
+   received EBGP8277 NLRI as an outgoing label (the index is only used
+   to derive the local/incoming label).
+
+   Node4 sends the following EBGP8277 update to Node1:
+
+      IP Prefix:  192.0.2.11/32
+
+      Label:  16011
+
+      Next hop:  Node4's interface address on the link to Node1
+
+      AS Path:  {4, 7, 10, 11}
+
+      BGP Prefix-SID:  Label-Index 11
+
+   Node1 receives the above update.  As it is SR capable, Node1 is able
+   to interpret the BGP Prefix-SID; therefore, it allocates the local
+   (incoming) label 16011 to the NLRI (instead of allocating a "dynamic"
+   local label from its label manager).  Node1 uses the label in the
+   received EBGP8277 NLRI as an outgoing label (the index is only used
+   to derive the local/incoming label).
+
+4.2.2.  Data Plane
+
+   Referring to Figure 1, and assuming all nodes apply the same
+   advertisement rules described above and all nodes have the same SRGB
+   (16000-23999), here are the IP/MPLS forwarding tables for prefix
+   192.0.2.11/32 at Node1, Node4, Node7, and Node10.
+
+    +----------------------------------+----------------+------------+
+    | Incoming Label or IP Destination | Outgoing Label |  Outgoing  |
+    |                                  |                | Interface  |
+    +----------------------------------+----------------+------------+
+    |              16011               |     16011      | ECMP{3, 4} |
+    +----------------------------------+----------------+------------+
+    |          192.0.2.11/32           |     16011      | ECMP{3, 4} |
+    +----------------------------------+----------------+------------+
+
+                     Table 1: Node1 Forwarding Table
+
+    +----------------------------------+----------------+------------+
+    | Incoming Label or IP Destination | Outgoing Label |  Outgoing  |
+    |                                  |                | Interface  |
+    +----------------------------------+----------------+------------+
+    |              16011               |     16011      | ECMP{7, 8} |
+    +----------------------------------+----------------+------------+
+    |          192.0.2.11/32           |     16011      | ECMP{7, 8} |
+    +----------------------------------+----------------+------------+
+
+                     Table 2: Node4 Forwarding Table
+
+     +----------------------------------+----------------+-----------+
+     | Incoming Label or IP Destination | Outgoing Label |  Outgoing |
+     |                                  |                | Interface |
+     +----------------------------------+----------------+-----------+
+     |              16011               |     16011      |     10    |
+     +----------------------------------+----------------+-----------+
+     |          192.0.2.11/32           |     16011      |     10    |
+     +----------------------------------+----------------+-----------+
+
+                      Table 3: Node7 Forwarding Table
+
+     +----------------------------------+----------------+-----------+
+     | Incoming Label or IP Destination | Outgoing Label |  Outgoing |
+     |                                  |                | Interface |
+     +----------------------------------+----------------+-----------+
+     |              16011               |      POP       |     11    |
+     +----------------------------------+----------------+-----------+
+     |          192.0.2.11/32           |      N/A       |     11    |
+     +----------------------------------+----------------+-----------+
+
+                      Table 4: Node10 Forwarding Table
+
+4.2.3.  Network Design Variation
+
+   A network design choice could consist of switching all the traffic
+   through Tier-1 and Tier-2 as MPLS traffic.  In this case, one could
+   filter away the IP entries at Node4, Node7, and Node10.  This might
+   be beneficial in order to optimize the forwarding table size.
+
+   A network design choice could consist of allowing the hosts to send
+   MPLS-encapsulated traffic based on the Egress Peer Engineering (EPE)
+   use case as defined in [SR-CENTRAL-EPE].  For example, applications
+   at HostA would send their Z-destined traffic to Node1 with an MPLS
+   label stack where the top label is 16011 and the next label is an EPE
+   peer segment ([SR-CENTRAL-EPE]) at Node11 directing the traffic to Z.
+
+4.2.4.  Global BGP Prefix Segment through the Fabric
+
+   When the previous design is deployed, the operator enjoys global BGP
+   Prefix-SID and label allocation throughout the DC fabric.
+
+   A few examples follow:
+
+   *  Normal forwarding to Node11: A packet with top label 16011
+      received by any node in the fabric will be forwarded along the
+      ECMP-aware BGP best path towards Node11, and the label 16011 is
+      penultimate popped at Node10 (or at Node 9).
+
+   *  Traffic-engineered path to Node11: An application on a host behind
+      Node1 might want to restrict its traffic to paths via the Spine
+      node Node5.  The application achieves this by sending its packets
+      with a label stack of {16005, 16011}. BGP Prefix-SID 16005 directs
+      the packet up to Node5 along the path (Node1, Node3, Node5).  BGP
+      Prefix-SID 16011 then directs the packet down to Node11 along the
+      path (Node5, Node9, Node11).
+
+4.2.5.  Incremental Deployments
+
+   The design previously described can be deployed incrementally.  Let
+   us assume that Node7 does not support the BGP Prefix-SID, and let us
+   show how the fabric connectivity is preserved.
+
+   From a signaling viewpoint, nothing would change; even though Node7
+   does not support the BGP Prefix-SID, it does propagate the attribute
+   unmodified to its neighbors.
+
+   From a label-allocation viewpoint, the only difference is that Node7
+   would allocate a dynamic (random) label to the prefix 192.0.2.11/32
+   (e.g., 123456) instead of the "hinted" label as instructed by the BGP
+   Prefix-SID.  The neighbors of Node7 adapt automatically as they
+   always use the label in the BGP8277 NLRI as an outgoing label.
+
+   Node4 does understand the BGP Prefix-SID; therefore, it allocates the
+   indexed label in the SRGB (16011) for 192.0.2.11/32.
+
+   As a result, all the data-plane entries across the network would be
+   unchanged except the entries at Node7 and its neighbor Node4 as shown
+   in the figures below.
+
+   The key point is that the end-to-end Label Switched Path (LSP) is
+   preserved because the outgoing label is always derived from the
+   received label within the BGP8277 NLRI.  The index in the BGP Prefix-
+   SID is only used as a hint on how to allocate the local label (the
+   incoming label) but never for the outgoing label.
+
+     +----------------------------------+----------------+-----------+
+     | Incoming Label or IP Destination | Outgoing Label |  Outgoing |
+     |                                  |                | Interface |
+     +----------------------------------+----------------+-----------+
+     |              12345               |     16011      |     10    |
+     +----------------------------------+----------------+-----------+
+
+                      Table 5: Node7 Forwarding Table
+
+     +----------------------------------+----------------+-----------+
+     | Incoming Label or IP Destination | Outgoing Label |  Outgoing |
+     |                                  |                | Interface |
+     +----------------------------------+----------------+-----------+
+     |              16011               |     12345      |     7     |
+     +----------------------------------+----------------+-----------+
+
+                      Table 6: Node4 Forwarding Table
+
+   The BGP Prefix-SID can thus be deployed incrementally, i.e., one node
+   at a time.
+
+   When deployed together with a homogeneous SRGB (the same SRGB across
+   the fabric), the operator incrementally enjoys the global prefix
+   segment benefits as the deployment progresses through the fabric.
+
+4.3.  IBGP Labeled Unicast (RFC 8277)
+
+   The same exact design as EBGP8277 is used with the following
+   modifications:
+
+   *  All nodes use the same AS number.
+
+   *  Each node peers with its neighbors via an internal BGP session
+      (IBGP) with extensions defined in [RFC8277] (named "IBGP8277"
+      throughout this document).
+
+   *  Each node acts as a route reflector for each of its neighbors and
+      with the next-hop-self option.  Next-hop-self is a well-known
+      operational feature that consists of rewriting the next hop of a
+      BGP update prior to sending it to the neighbor.  Usually, it's a
+      common practice to apply next-hop-self behavior towards IBGP peers
+      for EBGP-learned routes.  In the case outlined in this section, it
+      is proposed to use the next-hop-self mechanism also to IBGP-
+      learned routes.
+
+                                  Cluster-1
+                               +-----------+
+                               |  Tier-1   |
+                               |  +-----+  |
+                               |  |NODE |  |
+                               |  |  5  |  |
+                    Cluster-2  |  +-----+  |  Cluster-3
+                   +---------+ |           | +---------+
+                   | Tier-2  | |           | |  Tier-2 |
+                   | +-----+ | |  +-----+  | | +-----+ |
+                   | |NODE | | |  |NODE |  | | |NODE | |
+                   | |  3  | | |  |  6  |  | | |  9  | |
+                   | +-----+ | |  +-----+  | | +-----+ |
+                   |         | |           | |         |
+                   |         | |           | |         |
+                   | +-----+ | |  +-----+  | | +-----+ |
+                   | |NODE | | |  |NODE |  | | |NODE | |
+                   | |  4  | | |  |  7  |  | | |  10 | |
+                   | +-----+ | |  +-----+  | | +-----+ |
+                   +---------+ |           | +---------+
+                               |           |
+                               |  +-----+  |
+                               |  |NODE |  |
+             Tier-3            |  |  8  |  |         Tier-3
+         +-----+ +-----+       |  +-----+  |      +-----+ +-----+
+         |NODE | |NODE |       +-----------+      |NODE | |NODE |
+         |  1  | |  2  |                          | 11  | |  12 |
+         +-----+ +-----+                          +-----+ +-----+
+
+         Figure 3: IBGP Sessions with Reflection and Next-Hop-Self
+
+   *  For simple and efficient route propagation filtering and as
+      illustrated in Figure 3:
+
+      -  Node5, Node6, Node7, and Node8 use the same Cluster ID
+         (Cluster-1).
+
+      -  Node3 and Node4 use the same Cluster ID (Cluster-2).
+
+      -  Node9 and Node10 use the same Cluster ID (Cluster-3).
+
+   *  The control-plane behavior is mostly the same as described in the
+      previous section; the only difference is that the EBGP8277 path
+      propagation is simply replaced by an IBGP8277 path reflection with
+      next hop changed to self.
+
+   *  The data-plane tables are exactly the same.
+
+5.  Applying Segment Routing in the DC with IPv6 Data Plane
+
+   The design described in [RFC7938] is reused with one single
+   modification.  It is highlighted using the example of the
+   reachability to Node11 via Spine node Node5.
+
+   Node5 originates 2001:DB8::5/128 with the attached BGP Prefix-SID for
+   IPv6 packets destined to segment 2001:DB8::5 ([RFC8402]).
+
+   Node11 originates 2001:DB8::11/128 with the attached BGP Prefix-SID
+   advertising the support of the Segment Routing Header (SRH) for IPv6
+   packets destined to segment 2001:DB8::11.
+
+   The control-plane and data-plane processing of all the other nodes in
+   the fabric is unchanged.  Specifically, the routes to 2001:DB8::5 and
+   2001:DB8::11 are installed in the FIB along the EBGP best path to
+   Node5 (Spine node) and Node11 (ToR node) respectively.
+
+   An application on HostA that needs to send traffic to HostZ via only
+   Node5 (Spine node) can do so by sending IPv6 packets with a Segment
+   Routing Header (SRH, [IPv6-SRH]).  The destination address and active
+   segment is set to 2001:DB8::5.  The next and last segment is set to
+   2001:DB8::11.
+
+   The application must only use IPv6 addresses that have been
+   advertised as capable for SRv6 segment processing (e.g., for which
+   the BGP Prefix Segment capability has been advertised).  How
+   applications learn this (e.g., centralized controller and
+   orchestration) is outside the scope of this document.
+
+6.  Communicating Path Information to the Host
+
+   There are two general methods for communicating path information to
+   the end-hosts: "proactive" and "reactive", aka "push" and "pull"
+   models.  There are multiple ways to implement either of these
+   methods.  Here, it is noted that one way could be using a centralized
+   controller: the controller either tells the hosts of the prefix-to-
+   path mappings beforehand and updates them as needed (network event
+   driven push) or responds to the hosts making requests for a path to a
+   specific destination (host event driven pull).  It is also possible
+   to use a hybrid model, i.e., pushing some state from the controller
+   in response to particular network events, while the host pulls other
+   state on demand.
+
+   Note also that when disseminating network-related data to the end-
+   hosts, a trade-off is made to balance the amount of information vs.
+   the level of visibility in the network state.  This applies to both
+   push and pull models.  In the extreme case, the host would request
+   path information on every flow and keep no local state at all.  On
+   the other end of the spectrum, information for every prefix in the
+   network along with available paths could be pushed and continuously
+   updated on all hosts.
+
+7.  Additional Benefits
+
+7.1.  MPLS Data Plane with Operational Simplicity
+
+   As required by [RFC7938], no new signaling protocol is introduced.
+   The BGP Prefix-SID is a lightweight extension to BGP Labeled Unicast
+   [RFC8277].  It applies either to EBGP- or IBGP-based designs.
+
+   Specifically, LDP and RSVP-TE are not used.  These protocols would
+   drastically impact the operational complexity of the data center and
+   would not scale.  This is in line with the requirements expressed in
+   [RFC7938].
+
+   Provided the same SRGB is configured on all nodes, all nodes use the
+   same MPLS label for a given IP prefix.  This is simpler from an
+   operation standpoint, as discussed in Section 8.
+
+7.2.  Minimizing the FIB Table
+
+   The designer may decide to switch all the traffic at Tier-1 and
+   Tier-2 based on MPLS, thereby drastically decreasing the IP table
+   size at these nodes.
+
+   This is easily accomplished by encapsulating the traffic either
+   directly at the host or at the source ToR node.  The encapsulation is
+   done by pushing the BGP Prefix-SID of the destination ToR for intra-
+   DC traffic, or by pushing the BGP Prefix-SID for the border node for
+   inter-DC or DC-to-outside-world traffic.
+
+7.3.  Egress Peer Engineering
+
+   It is straightforward to combine the design illustrated in this
+   document with the Egress Peer Engineering (EPE) use case described in
+   [SR-CENTRAL-EPE].
+
+   In such a case, the operator is able to engineer its outbound traffic
+   on a per-host-flow basis, without incurring any additional state at
+   intermediate points in the DC fabric.
+
+   For example, the controller only needs to inject a per-flow state on
+   the HostA to force it to send its traffic destined to a specific
+   Internet destination D via a selected border node (say Node12 in
+   Figure 1 instead of another border node, Node11) and a specific
+   egress peer of Node12 (say peer AS 9999 of local PeerNode segment
+   9999 at Node12 instead of any other peer that provides a path to the
+   destination D).  Any packet matching this state at HostA would be
+   encapsulated with SR segment list (label stack) {16012, 9999}.  16012
+   would steer the flow through the DC fabric, leveraging any ECMP,
+   along the best path to border node Node12.  Once the flow gets to
+   border node Node12, the active segment is 9999 (because of
+   Penultimate Hop Popping (PHP) on the upstream neighbor of Node12).
+   This EPE PeerNode segment forces border node Node12 to forward the
+   packet to peer AS 9999 without any IP lookup at the border node.
+   There is no per-flow state for this engineered flow in the DC fabric.
+   A benefit of SR is that the per-flow state is only required at the
+   source.
+
+   As well as allowing full traffic-engineering control, such a design
+   also offers FIB table-minimization benefits as the Internet-scale FIB
+   at border node Node12 is not required if all FIB lookups are avoided
+   there by using EPE.
+
+7.4.  Anycast
+
+   The design presented in this document preserves the availability and
+   load-balancing properties of the base design presented in [RFC8402].
+
+   For example, one could assign an anycast loopback 192.0.2.20/32 and
+   associate segment index 20 to it on the border nodes Node11 and
+   Node12 (in addition to their node-specific loopbacks).  Doing so, the
+   EPE controller could express a default "go-to-the-Internet via any
+   border node" policy as segment list {16020}. Indeed, from any host in
+   the DC fabric or from any ToR node, 16020 steers the packet towards
+   the border nodes Node11 or Node12 leveraging ECMP where available
+   along the best paths to these nodes.
+
+8.  Preferred SRGB Allocation
+
+   In the MPLS case, it is recommended to use the same SRGBs at each
+   node.
+
+   Different SRGBs in each node likely increase the complexity of the
+   solution both from an operational viewpoint and from a controller
+   viewpoint.
+
+   From an operational viewpoint, it is much simpler to have the same
+   global label at every node for the same destination (the MPLS
+   troubleshooting is then similar to the IPv6 troubleshooting where
+   this global property is a given).
+
+   From a controller viewpoint, this allows us to construct simple
+   policies applicable across the fabric.
+
+   Let us consider two applications, A and B, respectively connected to
+   Node1 and Node2 (ToR nodes).  Application A has two flows, FA1 and
+   FA2, destined to Z.  B has two flows, FB1 and FB2, destined to Z.
+   The controller wants FA1 and FB1 to be load shared across the fabric
+   while FA2 and FB2 must be respectively steered via Node5 and Node8.
+
+   Assuming a consistent unique SRGB across the fabric as described in
+   this document, the controller can simply do it by instructing A and B
+   to use {16011} respectively for FA1 and FB1 and by instructing A and
+   B to use {16005 16011} and {16008 16011} respectively for FA2 and
+   FB2.
+
+   Let us assume a design where the SRGB is different at every node and
+   where the SRGB of each node is advertised using the Originator SRGB
+   TLV of the BGP Prefix-SID as defined in [RFC8669]: SRGB of Node K
+   starts at value K*1000, and the SRGB length is 1000 (e.g., Node1's
+   SRGB is [1000, 1999], Node2's SRGB is [2000, 2999], ...).
+
+   In this case, the controller would need to collect and store all of
+   these different SRGBs (e.g., through the Originator SRGB TLV of the
+   BGP Prefix-SID); furthermore, it would also need to adapt the policy
+   for each host.  Indeed, the controller would instruct A to use {1011}
+   for FA1 while it would have to instruct B to use {2011} for FB1
+   (while with the same SRGB, both policies are the same {16011}).
+
+   Even worse, the controller would instruct A to use {1005, 5011} for
+   FA1 while it would instruct B to use {2011, 8011} for FB1 (while with
+   the same SRGB, the second segment is the same across both policies:
+   16011).  When combining segments to create a policy, one needs to
+   carefully update the label of each segment.  This is obviously more
+   error prone, more complex, and more difficult to troubleshoot.
+
+9.  IANA Considerations
+
+   This document has no IANA actions.
+
+10.  Manageability Considerations
+
+   The design and deployment guidelines described in this document are
+   based on the network design described in [RFC7938].
+
+   The deployment model assumed in this document is based on a single
+   domain where the interconnected DCs are part of the same
+   administrative domain (which, of course, is split into different
+   autonomous systems).  The operator has full control of the whole
+   domain, and the usual operational and management mechanisms and
+   procedures are used in order to prevent any information related to
+   internal prefixes and topology to be leaked outside the domain.
+
+   As recommended in [RFC8402], the same SRGB should be allocated in all
+   nodes in order to facilitate the design, deployment, and operations
+   of the domain.
+
+   When EPE ([SR-CENTRAL-EPE]) is used (as explained in Section 7.3),
+   the same operational model is assumed.  EPE information is originated
+   and propagated throughout the domain towards an internal server, and
+   unless explicitly configured by the operator, no EPE information is
+   leaked outside the domain boundaries.
+
+11.  Security Considerations
+
+   This document proposes to apply SR to a well-known scalability
+   requirement expressed in [RFC7938] using the BGP Prefix-SID as
+   defined in [RFC8669].
+
+   It has to be noted, as described in Section 10, that the design
+   illustrated in [RFC7938] and in this document refer to a deployment
+   model where all nodes are under the same administration.  In this
+   context, it is assumed that the operator doesn't want to leak outside
+   of the domain any information related to internal prefixes and
+   topology.  The internal information includes Prefix-SID and EPE
+   information.  In order to prevent such leaking, the standard BGP
+   mechanisms (filters) are applied on the boundary of the domain.
+
+   Therefore, the solution proposed in this document does not introduce
+   any additional security concerns from what is expressed in [RFC7938]
+   and [RFC8669].  It is assumed that the security and confidentiality
+   of the prefix and topology information is preserved by outbound
+   filters at each peering point of the domain as described in
+   Section 10.
+
+12.  References
+
+12.1.  Normative References
+
+   [RFC4271]  Rekhter, Y., Ed., Li, T., Ed., and S. Hares, Ed., "A
+              Border Gateway Protocol 4 (BGP-4)", RFC 4271,
+              DOI 10.17487/RFC4271, January 2006,
+              <https://www.rfc-editor.org/info/rfc4271>.
+
+   [RFC7938]  Lapukhov, P., Premji, A., and J. Mitchell, Ed., "Use of
+              BGP for Routing in Large-Scale Data Centers", RFC 7938,
+              DOI 10.17487/RFC7938, August 2016,
+              <https://www.rfc-editor.org/info/rfc7938>.
+
+   [RFC8277]  Rosen, E., "Using BGP to Bind MPLS Labels to Address
+              Prefixes", RFC 8277, DOI 10.17487/RFC8277, October 2017,
+              <https://www.rfc-editor.org/info/rfc8277>.
+
+   [RFC8402]  Filsfils, C., Ed., Previdi, S., Ed., Ginsberg, L.,
+              Decraene, B., Litkowski, S., and R. Shakir, "Segment
+              Routing Architecture", RFC 8402, DOI 10.17487/RFC8402,
+              July 2018, <https://www.rfc-editor.org/info/rfc8402>.
+
+   [RFC8669]  Previdi, S., Filsfils, C., Lindem, A., Ed., Sreekantiah,
+              A., and H. Gredler, "Segment Routing Prefix Segment
+              Identifier Extensions for BGP", RFC 8669,
+              DOI 10.17487/RFC8669, December 2019,
+              <https://www.rfc-editor.org/info/rfc8669>.
+
+12.2.  Informative References
+
+   [IPv6-SRH] Filsfils, C., Dukes, D., Previdi, S., Leddy, J.,
+              Matsushima, S., and D. Voyer, "IPv6 Segment Routing Header
+              (SRH)", Work in Progress, Internet-Draft, draft-ietf-6man-
+              segment-routing-header-26, 22 October 2019,
+              <https://tools.ietf.org/html/draft-ietf-6man-segment-
+              routing-header-26>.
+
+   [RFC6793]  Vohra, Q. and E. Chen, "BGP Support for Four-Octet
+              Autonomous System (AS) Number Space", RFC 6793,
+              DOI 10.17487/RFC6793, December 2012,
+              <https://www.rfc-editor.org/info/rfc6793>.
+
+   [SR-CENTRAL-EPE]
+              Filsfils, C., Previdi, S., Dawra, G., Aries, E., and D.
+              Afanasiev, "Segment Routing Centralized BGP Egress Peer
+              Engineering", Work in Progress, Internet-Draft, draft-
+              ietf-spring-segment-routing-central-epe-10, 21 December
+              2017, <https://tools.ietf.org/html/draft-ietf-spring-
+              segment-routing-central-epe-10>.
+
+Acknowledgements
+
+   The authors would like to thank Benjamin Black, Arjun Sreekantiah,
+   Keyur Patel, Acee Lindem, and Anoop Ghanwani for their comments and
+   review of this document.
+
+Contributors
+
+   Gaya Nagarajan
+   Facebook
+   United States of America
+
+   Email: gaya@fb.com
+
+   Gaurav Dawra
+   Cisco Systems
+   United States of America
+
+   Email: gdawra.ietf@gmail.com
+
+   Dmitry Afanasiev
+   Yandex
+   Russian Federation
+
+   Email: fl0w@yandex-team.ru
+
+   Tim Laberge
+   Cisco
+   United States of America
+
+   Email: tlaberge@cisco.com
+
+   Edet Nkposong
+   Salesforce.com Inc.
+   United States of America
+
+   Email: enkposong@salesforce.com
+
+   Mohan Nanduri
+   Microsoft
+   United States of America
+
+   Email: mohan.nanduri@oracle.com
+
+   James Uttaro
+   ATT
+   United States of America
+
+   Email: ju1738@att.com
+
+   Saikat Ray
+   Unaffiliated
+   United States of America
+
+   Email: raysaikat@gmail.com
+
+   Jon Mitchell
+   Unaffiliated
+   United States of America
+
+   Email: jrmitche@puck.nether.net
+
+Authors' Addresses
+
+   Clarence Filsfils (editor)
+   Cisco Systems, Inc.
+   Brussels
+   Belgium
+
+   Email: cfilsfil@cisco.com
+
+
+   Stefano Previdi
+   Cisco Systems, Inc.
+   Italy
+
+   Email: stefano@previdi.net
+
+
+   Gaurav Dawra
+   LinkedIn
+   United States of America
+
+   Email: gdawra.ietf@gmail.com
+
+
+   Ebben Aries
+   Arrcus, Inc.
+   2077 Gateway Place, Suite #400
+   San Jose,  CA 95119
+   United States of America
+
+   Email: exa@arrcus.com
+
+
+   Petr Lapukhov
+   Facebook
+   United States of America
+
+   Email: petr@fb.com
author	Thomas Voss <mail@thomasvoss.com>	2024-11-27 20:54:24 +0100
committer	Thomas Voss <mail@thomasvoss.com>	2024-11-27 20:54:24 +0100
commit	4bfd864f10b68b71482b35c818559068ef8d5797 (patch)
tree	e3989f47a7994642eb325063d46e8f08ffa681dc /doc/rfc/rfc8670.txt
parent	ea76e11061bda059ae9f9ad130a9895cc85607db (diff)