diff options
Diffstat (limited to 'doc/rfc/rfc8604.txt')
-rw-r--r-- | doc/rfc/rfc8604.txt | 619 |
1 files changed, 619 insertions, 0 deletions
diff --git a/doc/rfc/rfc8604.txt b/doc/rfc/rfc8604.txt new file mode 100644 index 0000000..48f7bab --- /dev/null +++ b/doc/rfc/rfc8604.txt @@ -0,0 +1,619 @@ + + + + + + +Independent Submission C. Filsfils, Ed. +Request for Comments: 8604 Cisco Systems, Inc. +Category: Informational S. Previdi +ISSN: 2070-1721 Huawei Technologies + G. Dawra, Ed. + LinkedIn + W. Henderickx + Nokia + D. Cooper + CenturyLink + June 2019 + + + Interconnecting Millions of Endpoints with Segment Routing + +Abstract + + This document describes an application of Segment Routing to scale + the network to support hundreds of thousands of network nodes, and + tens of millions of physical underlay endpoints. This use case can + be applied to the interconnection of massive-scale Data Centers (DCs) + and/or large aggregation networks. Forwarding tables of midpoint and + leaf nodes only require a few tens of thousands of entries. This may + be achieved by the inherently scaleable nature of Segment Routing and + the design proposed in this document. + +Status of This Memo + + This document is not an Internet Standards Track specification; it is + published for informational purposes. + + This is a contribution to the RFC Series, independently of any other + RFC stream. The RFC Editor has chosen to publish this document at + its discretion and makes no statement about its value for + implementation or deployment. Documents approved for publication by + the RFC Editor are not candidates for any level of Internet Standard; + see Section 2 of RFC 7841. + + Information about the current status of this document, any errata, + and how to provide feedback on it may be obtained at + https://www.rfc-editor.org/info/rfc8604. + + + + + + + + + + +Filsfils, et al. Informational [Page 1] + +RFC 8604 Large-Scale Segment Routing June 2019 + + +Copyright Notice + + Copyright (c) 2019 IETF Trust and the persons identified as the + document authors. All rights reserved. + + This document is subject to BCP 78 and the IETF Trust's Legal + Provisions Relating to IETF Documents + (https://trustee.ietf.org/license-info) in effect on the date of + publication of this document. Please review these documents + carefully, as they describe your rights and restrictions with respect + to this document. + +Table of Contents + + 1. Introduction ....................................................3 + 2. Terminology .....................................................3 + 3. Reference Design ................................................3 + 4. Control Plane ...................................................5 + 5. Illustration of the Scale .......................................5 + 6. Design Options ..................................................6 + 6.1. Segment Routing Global Block (SRGB) Size ...................6 + 6.2. Redistribution of Routes for Agg Nodes .....................7 + 6.3. Sizing and Hierarchy .......................................7 + 6.4. Local Segments to Hosts/Servers ............................7 + 6.5. Compressed SRTE Policies ...................................7 + 7. Deployment Model ................................................8 + 8. Benefits ........................................................8 + 8.1. Simplified Operations ......................................8 + 8.2. Inter-domain SLAs ..........................................8 + 8.3. Scale ......................................................9 + 8.4. ECMP .......................................................9 + 9. IANA Considerations .............................................9 + 10. Manageability Considerations ...................................9 + 11. Security Considerations ........................................9 + 12. Informative References .........................................9 + Acknowledgements ..................................................10 + Contributors ......................................................10 + Authors' Addresses ................................................11 + + + + + + + + + + + + + +Filsfils, et al. Informational [Page 2] + +RFC 8604 Large-Scale Segment Routing June 2019 + + +1. Introduction + + This document describes how Segment Routing (SR) can be used to + interconnect millions of endpoints. + +2. Terminology + + The following terms and abbreviations are used in this document: + + Term Definition + ------------------------------------------------------------- + Agg Aggregation + BGP Border Gateway Protocol + DC Data Center + DCI Data Center Interconnect + ECMP Equal-Cost Multipath + FIB Forwarding Information Base + LDP Label Distribution Protocol + LFIB Label Forwarding Information Base + MPLS Multiprotocol Label Switching + PCE Path Computation Element + PCEP Path Computation Element Communication Protocol + PW Pseudowire + SLA Service Level Agreement + SR Segment Routing + SRTE Policy Segment Routing Traffic Engineering Policy + TE Traffic Engineering + TI-LFA Topology Independent Loop-Free Alternate + +3. Reference Design + + The network diagram below illustrates the reference network topology + used in this document: + + +-------+ +--------+ +--------+ +-------+ +-------+ + A DCI1 Agg1 Agg3 DCI3 Z + | DC1 | | M1 | | C | | M2 | | DC2 | + | DCI2 Agg2 Agg4 DCI4 | + +-------+ +--------+ +--------+ +-------+ +-------+ + + Figure 1: Reference Topology + + The following apply to the reference topology above: + + o Independent ISIS-OSPF/SR instance in core (C) region. + + o Independent ISIS-OSPF/SR instance in Metro1 (M1) region. + + + + +Filsfils, et al. Informational [Page 3] + +RFC 8604 Large-Scale Segment Routing June 2019 + + + o Independent ISIS-OSPF/SR instance in Metro2 (M2) region. + + o BGP/SR in DC1. + + o BGP/SR in DC2. + + o Agg routes (Agg1, Agg2, Agg3, Agg4) are redistributed from C to M + (M1 and M2) and from M to DC domains. + + o No other route is advertised or redistributed between regions. + + o The same homogeneous Segment Routing Global Block (SRGB) is used + throughout the domains (e.g., 16000-23999). + + o Unique SRGB sub-ranges are allocated to each metro (M) and core + (C) domain: + + * The 16000-16999 range is allocated to the core (C) + domain/region. + + * The 17000-17999 range is allocated to the M1 domain/region. + + * The 18000-18999 range is allocated to the M2 domain/region. + + * Specifically, the Agg1 router has Segment Identifier (SID) + 16001 allocated, and the Agg2 router has SID 16002 allocated. + + * Specifically, the Agg3 router has SID 16003 allocated, and the + anycast SID for Agg3 and Agg4 is 16006. + + * Specifically, the DCI3 router has SID 18003 allocated, and the + anycast SID for DCI3 and DCI4 is 18006. + + * Specifically, at the Agg1 router, the binding SID 4001 leads to + DCI pair (DCI3, DCI4) via a specific low-latency path {16002, + 16003, 18006}. + + o The same SRGB sub-range is reused within each DC (DC1 and DC2) + region for each DC (e.g., 20000-23999). Specifically, nodes A + and Z both have SID 20001 allocated to them. + + + + + + + + + + + +Filsfils, et al. Informational [Page 4] + +RFC 8604 Large-Scale Segment Routing June 2019 + + +4. Control Plane + + This section provides a high-level description of how a control plane + could be implemented using protocol components already defined in + other RFCs. + + The mechanism through which SRTE Policies are defined, computed, and + programmed in the source nodes is outside the scope of this document. + + Typically, a controller or a service orchestration system programs + node A with a PW to a remote next-hop node Z with a given SLA + contract (e.g., low-latency path, disjointness from a specific core + plane, disjointness from a different PW service). + + Node A automatically detects that node Z is not reachable. It then + automatically sends a PCEP request to an SR PCE for an SRTE policy + that provides reachability information for node Z with the + requested SLA. + + The SR PCE [RFC4655] is made of two components: a multi-domain + topology and a computation engine. The multi-domain topology is + continuously refreshed through BGP - Link State (BGP-LS) feeds + [RFC7752] from each domain. The computation engine is designed to + implement TE algorithms and provide output in SR Path format. Upon + receiving the PCEP request [RFC5440], the SR PCE computes the + requested path. The path is expressed through a list of segments + (e.g., {16003, 18006, 20001}) and provided to node A. + + The SR PCE logs the request as a stateful query and hence is able to + recompute the path at each network topology change. + + Node A receives the PCEP reply with the path (expressed as a segment + list). Node A installs the received SRTE policy in the data plane. + Node A then automatically steers the PW into that SRTE policy. + +5. Illustration of the Scale + + According to the reference topology shown in Figure 1, the following + assumptions are made: + + o There is one core domain, and there are 100 leaf (metro) domains. + + o The core domain includes 200 nodes. + + o Two nodes connect each leaf (metro) domain. Each node connecting + a leaf domain has a SID allocated. Each pair of nodes connecting + a leaf domain also has a common anycast SID. This yields up to + 300 prefix segments in total. + + + +Filsfils, et al. Informational [Page 5] + +RFC 8604 Large-Scale Segment Routing June 2019 + + + o A core node connects only one leaf domain. + + o Each leaf domain has 6,000 leaf-node segments. Each leaf node has + 500 endpoints attached and thus 500 adjacency segments. This + yields a total of 3 million endpoints for a leaf domain. + + Based on the above, the network scaling numbers are as follows: + + o 6,000 leaf-node segments multiplied by 100 leaf domains: + 600,000 nodes. + + o 600,000 nodes multiplied by 500 endpoints: 300 million endpoints. + + The node scaling numbers are as follows: + + o Leaf-node segment scale: 6,000 leaf-node segments + 300 core-node + segments + 500 adjacency segments = 6,800 segments. + + o Core-node segment scale: 6,000 leaf-domain segments + + 300 core-domain segments = 6,300 segments. + + In the above calculations, the link-adjacency segments are not taken + into account. These are local segments and, typically, less than 100 + per node. + + It has to be noted that, depending on leaf-node FIB capabilities, + leaf domains could be split into multiple smaller domains. In the + above example, the leaf domains could be split into six smaller + domains so that each leaf node only needs to learn 1,000 leaf-node + segments + 300 core-node segments + 500 adjacency segments, yielding + a total of 1,800 segments. + +6. Design Options + + This section describes multiple design options to illustrate scale as + described in the previous section. + +6.1. Segment Routing Global Block (SRGB) Size + + In the simplified illustrations in this document, we picked a small + homogeneous SRGB range of 16000-23999. In practice, a large-scale + design would use a bigger range, such as 16000-80000 or even larger. + A larger range provides allocations for various TE applications + within a given domain. + + + + + + + +Filsfils, et al. Informational [Page 6] + +RFC 8604 Large-Scale Segment Routing June 2019 + + +6.2. Redistribution of Routes for Agg Nodes + + The operator might choose to not redistribute the routes for Agg + nodes into the Metro/DC domains. In that case, more segments are + required in order to express an inter-domain path. + + For example, node A would use an SRTE Policy {DCI1, Agg1, Agg3, + DCI3, Z} in order to reach Z instead of {Agg3, DCI3, Z} in the + reference design. + +6.3. Sizing and Hierarchy + + The operator is free to choose among a small number of larger leaf + domains, a large number of small leaf domains, or a mix of small and + large core/leaf domains. + + The operator is free to use a two-tier (Core/Metro) or three-tier + (Core/Metro/DC) design. + +6.4. Local Segments to Hosts/Servers + + Local segments can be programmed at any leaf node (e.g., node Z) in + order to identify locally attached hosts (or Virtual Machines (VMs)). + For example, if node Z has bound a local segment 40001 to a local + host ZH1, then node A uses the following SRTE Policy in order to + reach that host: {16006, 18006, 20001, 40001}. Such a local segment + could represent the NID (Network Interface Device) in the context of + the service provider access network, or a VM in the context of the DC + network. + +6.5. Compressed SRTE Policies + + As an example and according to Section 3, we assume that node A can + reach node Z (e.g., with a low-latency SLA contract) via the SRTE + policy that consists of the path Agg1, Agg2, Agg3, DCI3/4(anycast), + Z. The path is represented by the segment list {16001, 16002, 16003, + 18006, 20001}. + + It is clear that the control-plane solution can install an SRTE + Policy {16002, 16003, 18006} at Agg1, collect the binding SID + allocated by Agg1 to that policy (e.g., 4001), and hence program + node A with the compressed SRTE Policy {16001, 4001, 20001}. + + From node A, 16001 leads to Agg1. Once at Agg1, 4001 leads to the + DCI pair (DCI3, DCI4) via a specific low-latency path {16002, 16003, + 18006}. Once at that DCI pair, 20001 leads to Z. + + + + + +Filsfils, et al. Informational [Page 7] + +RFC 8604 Large-Scale Segment Routing June 2019 + + + Binding SIDs allocated to "intermediate" SRTE Policies achieve the + compression of end-to-end SRTE Policies. + + The segment list {16001, 4001, 20001} expresses the same path as + {16001, 16002, 16003, 18006, 20001} but with two less segments. + + The binding SID also provides for inherent churn protection. + + When the core topology changes, the control plane can update the + low-latency SRTE Policy from Agg1 to the DCI pair to DC2 without + updating the SRTE Policy from A to Z. + +7. Deployment Model + + It is expected that this design will be used in "green field" + deployments as well as interworking ("brown field") deployments with + an MPLS design across multiple domains. + +8. Benefits + + The design options illustrated in this document allow + interconnections on a very large scale. Millions of endpoints across + different domains can be interconnected. + +8.1. Simplified Operations + + Two control-plane protocols not needed in this design are LDP and + RSVP-TE. No new protocol has been introduced. The design leverages + the core IP protocols ISIS, OSPF, BGP, and PCEP with straightforward + SR extensions. + +8.2. Inter-domain SLAs + + Fast reroute and resiliency are provided by TI-LFA with sub-50-ms + fast reroute upon failure of a link, node, or Shared Risk Link Group + (SRLG). TI-LFA is described in [SR-TI-LFA]. + + The use of anycast SIDs also provides improved availability and + resiliency. + + Inter-domain SLAs can be delivered (e.g., latency vs. cost-optimized + paths, disjointness from backbone planes, disjointness from other + services, disjointness between primary and backup paths). + + Existing inter-domain solutions do not provide any support for SLA + contracts. They just provide best-effort reachability across + domains. + + + + +Filsfils, et al. Informational [Page 8] + +RFC 8604 Large-Scale Segment Routing June 2019 + + +8.3. Scale + + In addition to having eliminated the need for LDP and RSVP-TE, + per-service midpoint states have also been removed from the network. + +8.4. ECMP + + Each policy (intra-domain or inter-domain, with or without TE) is + expressed as a list of segments. Since each segment is optimized for + ECMP, the entire policy is optimized for ECMP. The benefit of an + anycast prefix segment optimized for ECMP should also be considered + (e.g., 16001 load-shares across any gateway from the M1 leaf domain + to the Core and 16002 load-shares across any gateway from the Core to + the M1 leaf domain). + +9. IANA Considerations + + This document has no IANA actions. + +10. Manageability Considerations + + This document describes an application of SR over the MPLS data + plane. SR does not introduce any changes in the MPLS data plane. + The manageability considerations described in [RFC8402] apply to the + MPLS data plane when used with SR. + +11. Security Considerations + + This document does not introduce additional security requirements and + mechanisms other than those described in [RFC8402]. + +12. Informative References + + [RFC4655] Farrel, A., Vasseur, J.-P., and J. Ash, "A Path + Computation Element (PCE)-Based Architecture", RFC 4655, + DOI 10.17487/RFC4655, August 2006, + <https://www.rfc-editor.org/info/rfc4655>. + + [RFC5440] Vasseur, JP., Ed. and JL. Le Roux, Ed., "Path Computation + Element (PCE) Communication Protocol (PCEP)", RFC 5440, + DOI 10.17487/RFC5440, March 2009, + <https://www.rfc-editor.org/info/rfc5440>. + + [RFC7752] Gredler, H., Ed., Medved, J., Previdi, S., Farrel, A., and + S. Ray, "North-Bound Distribution of Link-State and + Traffic Engineering (TE) Information Using BGP", RFC 7752, + DOI 10.17487/RFC7752, March 2016, + <https://www.rfc-editor.org/info/rfc7752>. + + + +Filsfils, et al. Informational [Page 9] + +RFC 8604 Large-Scale Segment Routing June 2019 + + + [RFC8402] Filsfils, C., Ed., Previdi, S., Ed., Ginsberg, L., + Decraene, B., Litkowski, S., and R. Shakir, "Segment + Routing Architecture", RFC 8402, DOI 10.17487/RFC8402, + July 2018, <https://www.rfc-editor.org/info/rfc8402>. + + [SR-TI-LFA] + Litkowski, S., Bashandy, A., Filsfils, C., + Decraene, B., Francois, P., Voyer, D., Clad, F., and + P. Camarillo, "Topology Independent Fast Reroute + using Segment Routing", Work in Progress, + draft-ietf-rtgwg-segment-routing-ti-lfa-01, March 2019. + +Acknowledgements + + We would like to thank Giles Heron, Alexander Preusche, Steve + Braaten, and Francis Ferguson for their contributions to the content + of this document. + +Contributors + + The following people substantially contributed to the editing of this + document: + + Dennis Cai + Individual + + Tim Laberge + Individual + + Steven Lin + Google Inc. + + Bruno Decraene + Orange + + Luay Jalil + Verizon + + Jeff Tantsura + Individual + + Rob Shakir + Google Inc. + + + + + + + + +Filsfils, et al. Informational [Page 10] + +RFC 8604 Large-Scale Segment Routing June 2019 + + +Authors' Addresses + + Clarence Filsfils (editor) + Cisco Systems, Inc. + Brussels + Belgium + + Email: cfilsfil@cisco.com + + + Stefano Previdi + Huawei Technologies + + Email: stefano@previdi.net + + + Gaurav Dawra (editor) + LinkedIn + United States of America + + Email: gdawra.ietf@gmail.com + + + Wim Henderickx + Nokia + Copernicuslaan 50 + Antwerp 2018 + Belgium + + Email: wim.henderickx@nokia.com + + + Dave Cooper + CenturyLink + + Email: Dave.Cooper@centurylink.com + + + + + + + + + + + + + + + +Filsfils, et al. Informational [Page 11] + |