doc: Add RFC documents

author: Thomas Voss <mail@thomasvoss.com> 2024-11-27 20:54:24 +0100
committer: Thomas Voss <mail@thomasvoss.com> 2024-11-27 20:54:24 +0100
commit: 4bfd864f10b68b71482b35c818559068ef8d5797 (patch)
tree: e3989f47a7994642eb325063d46e8f08ffa681dc /doc/rfc/rfc7938.txt
parent: ea76e11061bda059ae9f9ad130a9895cc85607db (diff)
1 files changed, 1963 insertions, 0 deletions
diff --git a/doc/rfc/rfc7938.txt b/doc/rfc/rfc7938.txt
new file mode 100644
index 0000000..30d544b
--- /dev/null
+++ b/doc/rfc/rfc7938.txt
@@ -0,0 +1,1963 @@
+
+
+
+
+
+
+Internet Engineering Task Force (IETF)                       P. Lapukhov
+Request for Comments: 7938                                      Facebook
+Category: Informational                                        A. Premji
+ISSN: 2070-1721                                          Arista Networks
+                                                        J. Mitchell, Ed.
+                                                             August 2016
+
+
+           Use of BGP for Routing in Large-Scale Data Centers
+
+Abstract
+
+   Some network operators build and operate data centers that support
+   over one hundred thousand servers.  In this document, such data
+   centers are referred to as "large-scale" to differentiate them from
+   smaller infrastructures.  Environments of this scale have a unique
+   set of network requirements with an emphasis on operational
+   simplicity and network stability.  This document summarizes
+   operational experience in designing and operating large-scale data
+   centers using BGP as the only routing protocol.  The intent is to
+   report on a proven and stable routing design that could be leveraged
+   by others in the industry.
+
+Status of This Memo
+
+   This document is not an Internet Standards Track specification; it is
+   published for informational purposes.
+
+   This document is a product of the Internet Engineering Task Force
+   (IETF).  It represents the consensus of the IETF community.  It has
+   received public review and has been approved for publication by the
+   Internet Engineering Steering Group (IESG).  Not all documents
+   approved by the IESG are a candidate for any level of Internet
+   Standard; see Section 2 of RFC 7841.
+
+   Information about the current status of this document, any errata,
+   and how to provide feedback on it may be obtained at
+   http://www.rfc-editor.org/info/rfc7938.
+
+
+
+
+
+
+
+
+
+
+
+
+
+Lapukhov, et al.              Informational                     [Page 1]
+
+RFC 7938               BGP Routing in Data Centers           August 2016
+
+
+Copyright Notice
+
+   Copyright (c) 2016 IETF Trust and the persons identified as the
+   document authors.  All rights reserved.
+
+   This document is subject to BCP 78 and the IETF Trust's Legal
+   Provisions Relating to IETF Documents
+   (http://trustee.ietf.org/license-info) in effect on the date of
+   publication of this document.  Please review these documents
+   carefully, as they describe your rights and restrictions with respect
+   to this document.  Code Components extracted from this document must
+   include Simplified BSD License text as described in Section 4.e of
+   the Trust Legal Provisions and are provided without warranty as
+   described in the Simplified BSD License.
+
+Table of Contents
+
+   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   3
+   2.  Network Design Requirements . . . . . . . . . . . . . . . . .   4
+     2.1.  Bandwidth and Traffic Patterns  . . . . . . . . . . . . .   4
+     2.2.  CAPEX Minimization  . . . . . . . . . . . . . . . . . . .   4
+     2.3.  OPEX Minimization . . . . . . . . . . . . . . . . . . . .   5
+     2.4.  Traffic Engineering . . . . . . . . . . . . . . . . . . .   5
+     2.5.  Summarized Requirements . . . . . . . . . . . . . . . . .   6
+   3.  Data Center Topologies Overview . . . . . . . . . . . . . . .   6
+     3.1.  Traditional DC Topology . . . . . . . . . . . . . . . . .   6
+     3.2.  Clos Network Topology . . . . . . . . . . . . . . . . . .   7
+       3.2.1.  Overview  . . . . . . . . . . . . . . . . . . . . . .   7
+       3.2.2.  Clos Topology Properties  . . . . . . . . . . . . . .   8
+       3.2.3.  Scaling the Clos Topology . . . . . . . . . . . . . .   9
+       3.2.4.  Managing the Size of Clos Topology Tiers  . . . . . .  10
+   4.  Data Center Routing Overview  . . . . . . . . . . . . . . . .  11
+     4.1.  L2-Only Designs . . . . . . . . . . . . . . . . . . . . .  11
+     4.2.  Hybrid L2/L3 Designs  . . . . . . . . . . . . . . . . . .  12
+     4.3.  L3-Only Designs . . . . . . . . . . . . . . . . . . . . .  12
+   5.  Routing Protocol Design . . . . . . . . . . . . . . . . . . .  13
+     5.1.  Choosing EBGP as the Routing Protocol . . . . . . . . . .  13
+     5.2.  EBGP Configuration for Clos Topology  . . . . . . . . . .  15
+       5.2.1.  EBGP Configuration Guidelines and Example ASN Scheme   15
+       5.2.2.  Private Use ASNs  . . . . . . . . . . . . . . . . . .  16
+       5.2.3.  Prefix Advertisement  . . . . . . . . . . . . . . . .  17
+       5.2.4.  External Connectivity . . . . . . . . . . . . . . . .  18
+       5.2.5.  Route Summarization at the Edge . . . . . . . . . . .  19
+   6.  ECMP Considerations . . . . . . . . . . . . . . . . . . . . .  20
+     6.1.  Basic ECMP  . . . . . . . . . . . . . . . . . . . . . . .  20
+     6.2.  BGP ECMP over Multiple ASNs . . . . . . . . . . . . . . .  21
+     6.3.  Weighted ECMP . . . . . . . . . . . . . . . . . . . . . .  21
+     6.4.  Consistent Hashing  . . . . . . . . . . . . . . . . . . .  22
+
+
+
+Lapukhov, et al.              Informational                     [Page 2]
+
+RFC 7938               BGP Routing in Data Centers           August 2016
+
+
+   7.  Routing Convergence Properties  . . . . . . . . . . . . . . .  22
+     7.1.  Fault Detection Timing  . . . . . . . . . . . . . . . . .  22
+     7.2.  Event Propagation Timing  . . . . . . . . . . . . . . . .  23
+     7.3.  Impact of Clos Topology Fan-Outs  . . . . . . . . . . . .  24
+     7.4.  Failure Impact Scope  . . . . . . . . . . . . . . . . . .  24
+     7.5.  Routing Micro-Loops . . . . . . . . . . . . . . . . . . .  26
+   8.  Additional Options for Design . . . . . . . . . . . . . . . .  26
+     8.1.  Third-Party Route Injection . . . . . . . . . . . . . . .  26
+     8.2.  Route Summarization within Clos Topology  . . . . . . . .  27
+       8.2.1.  Collapsing Tier 1 Devices Layer . . . . . . . . . . .  27
+       8.2.2.  Simple Virtual Aggregation  . . . . . . . . . . . . .  29
+     8.3.  ICMP Unreachable Message Masquerading . . . . . . . . . .  29
+   9.  Security Considerations . . . . . . . . . . . . . . . . . . .  30
+   10. References  . . . . . . . . . . . . . . . . . . . . . . . . .  30
+     10.1.  Normative References . . . . . . . . . . . . . . . . . .  30
+     10.2.  Informative References . . . . . . . . . . . . . . . . .  31
+   Acknowledgements  . . . . . . . . . . . . . . . . . . . . . . . .  35
+   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  35
+
+1.  Introduction
+
+   This document describes a practical routing design that can be used
+   in a large-scale data center (DC) design.  Such data centers, also
+   known as "hyper-scale" or "warehouse-scale" data centers, have a
+   unique attribute of supporting over a hundred thousand servers.  In
+   order to accommodate networks of this scale, operators are revisiting
+   networking designs and platforms to address this need.
+
+   The design presented in this document is based on operational
+   experience with data centers built to support large-scale distributed
+   software infrastructure, such as a web search engine.  The primary
+   requirements in such an environment are operational simplicity and
+   network stability so that a small group of people can effectively
+   support a significantly sized network.
+
+   Experimentation and extensive testing have shown that External BGP
+   (EBGP) [RFC4271] is well suited as a stand-alone routing protocol for
+   these types of data center applications.  This is in contrast with
+   more traditional DC designs, which may use simple tree topologies and
+   rely on extending Layer 2 (L2) domains across multiple network
+   devices.  This document elaborates on the requirements that led to
+   this design choice and presents details of the EBGP routing design as
+   well as exploring ideas for further enhancements.
+
+   This document first presents an overview of network design
+   requirements and considerations for large-scale data centers.  Then,
+   traditional hierarchical data center network topologies are
+   contrasted with Clos networks [CLOS1953] that are horizontally scaled
+
+
+
+Lapukhov, et al.              Informational                     [Page 3]
+
+RFC 7938               BGP Routing in Data Centers           August 2016
+
+
+   out.  This is followed by arguments for selecting EBGP with a Clos
+   topology as the most appropriate routing protocol to meet the
+   requirements and the proposed design is described in detail.
+   Finally, this document reviews some additional considerations and
+   design options.  A thorough understanding of BGP is assumed by a
+   reader planning on deploying the design described within the
+   document.
+
+2.  Network Design Requirements
+
+   This section describes and summarizes network design requirements for
+   large-scale data centers.
+
+2.1.  Bandwidth and Traffic Patterns
+
+   The primary requirement when building an interconnection network for
+   a large number of servers is to accommodate application bandwidth and
+   latency requirements.  Until recently it was quite common to see the
+   majority of traffic entering and leaving the data center, commonly
+   referred to as "north-south" traffic.  Traditional "tree" topologies
+   were sufficient to accommodate such flows, even with high
+   oversubscription ratios between the layers of the network.  If more
+   bandwidth was required, it was added by "scaling up" the network
+   elements, e.g., by upgrading the device's linecards or fabrics or
+   replacing the device with one with higher port density.
+
+   Today many large-scale data centers host applications generating
+   significant amounts of server-to-server traffic, which does not
+   egress the DC, commonly referred to as "east-west" traffic.  Examples
+   of such applications could be computer clusters such as Hadoop
+   [HADOOP], massive data replication between clusters needed by certain
+   applications, or virtual machine migrations.  Scaling traditional
+   tree topologies to match these bandwidth demands becomes either too
+   expensive or impossible due to physical limitations, e.g., port
+   density in a switch.
+
+2.2.  CAPEX Minimization
+
+   The Capital Expenditures (CAPEX) associated with the network
+   infrastructure alone constitutes about 10-15% of total data center
+   expenditure (see [GREENBERG2009]).  However, the absolute cost is
+   significant, and hence there is a need to constantly drive down the
+   cost of individual network elements.  This can be accomplished in two
+   ways:
+
+   o  Unifying all network elements, preferably using the same hardware
+      type or even the same device.  This allows for volume pricing on
+      bulk purchases and reduced maintenance and inventory costs.
+
+
+
+Lapukhov, et al.              Informational                     [Page 4]
+
+RFC 7938               BGP Routing in Data Centers           August 2016
+
+
+   o  Driving costs down using competitive pressures, by introducing
+      multiple network equipment vendors.
+
+   In order to allow for good vendor diversity, it is important to
+   minimize the software feature requirements for the network elements.
+   This strategy provides maximum flexibility of vendor equipment
+   choices while enforcing interoperability using open standards.
+
+2.3.  OPEX Minimization
+
+   Operating large-scale infrastructure can be expensive as a larger
+   amount of elements will statistically fail more often.  Having a
+   simpler design and operating using a limited software feature set
+   minimizes software issue-related failures.
+
+   An important aspect of Operational Expenditure (OPEX) minimization is
+   reducing the size of failure domains in the network.  Ethernet
+   networks are known to be susceptible to broadcast or unicast traffic
+   storms that can have a dramatic impact on network performance and
+   availability.  The use of a fully routed design significantly reduces
+   the size of the data-plane failure domains, i.e., limits them to the
+   lowest level in the network hierarchy.  However, such designs
+   introduce the problem of distributed control-plane failures.  This
+   observation calls for simpler and less control-plane protocols to
+   reduce protocol interaction issues, reducing the chance of a network
+   meltdown.  Minimizing software feature requirements as described in
+   the CAPEX section above also reduces testing and training
+   requirements.
+
+2.4.  Traffic Engineering
+
+   In any data center, application load balancing is a critical function
+   performed by network devices.  Traditionally, load balancers are
+   deployed as dedicated devices in the traffic forwarding path.  The
+   problem arises in scaling load balancers under growing traffic
+   demand.  A preferable solution would be able to scale the load-
+   balancing layer horizontally, by adding more of the uniform nodes and
+   distributing incoming traffic across these nodes.  In situations like
+   this, an ideal choice would be to use network infrastructure itself
+   to distribute traffic across a group of load balancers.  The
+   combination of anycast prefix advertisement [RFC4786] and Equal Cost
+   Multipath (ECMP) functionality can be used to accomplish this goal.
+   To allow for more granular load distribution, it is beneficial for
+   the network to support the ability to perform controlled per-hop
+   traffic engineering.  For example, it is beneficial to directly
+   control the ECMP next-hop set for anycast prefixes at every level of
+   the network hierarchy.
+
+
+
+
+Lapukhov, et al.              Informational                     [Page 5]
+
+RFC 7938               BGP Routing in Data Centers           August 2016
+
+
+2.5.  Summarized Requirements
+
+   This section summarizes the list of requirements outlined in the
+   previous sections:
+
+   o  REQ1: Select a topology that can be scaled "horizontally" by
+      adding more links and network devices of the same type without
+      requiring upgrades to the network elements themselves.
+
+   o  REQ2: Define a narrow set of software features/protocols supported
+      by a multitude of networking equipment vendors.
+
+   o  REQ3: Choose a routing protocol that has a simple implementation
+      in terms of programming code complexity and ease of operational
+      support.
+
+   o  REQ4: Minimize the failure domain of equipment or protocol issues
+      as much as possible.
+
+   o  REQ5: Allow for some traffic engineering, preferably via explicit
+      control of the routing prefix next hop using built-in protocol
+      mechanics.
+
+3.  Data Center Topologies Overview
+
+   This section provides an overview of two general types of data center
+   designs -- hierarchical (also known as "tree-based") and Clos-based
+   network designs.
+
+3.1.  Traditional DC Topology
+
+   In the networking industry, a common design choice for data centers
+   typically looks like an (upside down) tree with redundant uplinks and
+   three layers of hierarchy namely; core, aggregation/distribution, and
+   access layers (see Figure 1).  To accommodate bandwidth demands, each
+   higher layer, from the server towards DC egress or WAN, has higher
+   port density and bandwidth capacity where the core functions as the
+   "trunk" of the tree-based design.  To keep terminology uniform and
+   for comparison with other designs, in this document these layers will
+   be referred to as Tier 1, Tier 2 and Tier 3 "tiers", instead of core,
+   aggregation, or access layers.
+
+
+
+
+
+
+
+
+
+
+Lapukhov, et al.              Informational                     [Page 6]
+
+RFC 7938               BGP Routing in Data Centers           August 2016
+
+
+             +------+  +------+
+             |      |  |      |
+             |      |--|      |           Tier 1
+             |      |  |      |
+             +------+  +------+
+               |  |      |  |
+     +---------+  |      |  +----------+
+     | +-------+--+------+--+-------+  |
+     | |       |  |      |  |       |  |
+   +----+     +----+    +----+     +----+
+   |    |     |    |    |    |     |    |
+   |    |-----|    |    |    |-----|    | Tier 2
+   |    |     |    |    |    |     |    |
+   +----+     +----+    +----+     +----+
+      |         |          |         |
+      |         |          |         |
+      | +-----+ |          | +-----+ |
+      +-|     |-+          +-|     |-+    Tier 3
+        +-----+              +-----+
+         | | |                | | |
+     <- Servers ->        <- Servers ->
+
+                   Figure 1: Typical DC Network Topology
+
+   Unfortunately, as noted previously, it is not possible to scale a
+   tree-based design to a large enough degree for handling large-scale
+   designs due to the inability to be able to acquire Tier 1 devices
+   with a large enough port density to sufficiently scale Tier 2.  Also,
+   continuous upgrades or replacement of the upper-tier devices are
+   required as deployment size or bandwidth requirements increase, which
+   is operationally complex.  For this reason, REQ1 is in place,
+   eliminating this type of design from consideration.
+
+3.2.  Clos Network Topology
+
+   This section describes a common design for horizontally scalable
+   topology in large-scale data centers in order to meet REQ1.
+
+3.2.1.  Overview
+
+   A common choice for a horizontally scalable topology is a folded Clos
+   topology, sometimes called "fat-tree" (for example, [INTERCON] and
+   [ALFARES2008]).  This topology features an odd number of stages
+   (sometimes known as "dimensions") and is commonly made of uniform
+   elements, e.g., network switches with the same port count.
+   Therefore, the choice of folded Clos topology satisfies REQ1 and
+
+
+
+
+
+Lapukhov, et al.              Informational                     [Page 7]
+
+RFC 7938               BGP Routing in Data Centers           August 2016
+
+
+   facilitates REQ2.  See Figure 2 below for an example of a folded
+   3-stage Clos topology (3 stages counting Tier 2 stage twice, when
+   tracing a packet flow):
+
+   +-------+
+   |       |----------------------------+
+   |       |------------------+         |
+   |       |--------+         |         |
+   +-------+        |         |         |
+   +-------+        |         |         |
+   |       |--------+---------+-------+ |
+   |       |--------+-------+ |       | |
+   |       |------+ |       | |       | |
+   +-------+      | |       | |       | |
+   +-------+      | |       | |       | |
+   |       |------+-+-------+-+-----+ | |
+   |       |------+-+-----+ | |     | | |
+   |       |----+ | |     | | |     | | |
+   +-------+    | | |     | | |   ---------> M links
+    Tier 1      | | |     | | |     | | |
+              +-------+ +-------+ +-------+
+              |       | |       | |       |
+              |       | |       | |       | Tier 2
+              |       | |       | |       |
+              +-------+ +-------+ +-------+
+                | | |     | | |     | | |
+                | | |     | | |   ---------> N Links
+                | | |     | | |     | | |
+                O O O     O O O     O O O   Servers
+
+                  Figure 2: 3-Stage Folded Clos Topology
+
+   This topology is often also referred to as a "Leaf and Spine"
+   network, where "Spine" is the name given to the middle stage of the
+   Clos topology (Tier 1) and "Leaf" is the name of input/output stage
+   (Tier 2).  For uniformity, this document will refer to these layers
+   using the "Tier n" notation.
+
+3.2.2.  Clos Topology Properties
+
+   The following are some key properties of the Clos topology:
+
+   o  The topology is fully non-blocking, or more accurately non-
+      interfering, if M >= N and oversubscribed by a factor of N/M
+      otherwise.  Here M and N is the uplink and downlink port count
+      respectively, for a Tier 2 switch as shown in Figure 2.
+
+
+
+
+
+Lapukhov, et al.              Informational                     [Page 8]
+
+RFC 7938               BGP Routing in Data Centers           August 2016
+
+
+   o  Utilizing this topology requires control and data-plane support
+      for ECMP with a fan-out of M or more.
+
+   o  Tier 1 switches have exactly one path to every server in this
+      topology.  This is an important property that makes route
+      summarization dangerous in this topology (see Section 8.2 below).
+
+   o  Traffic flowing from server to server is load balanced over all
+      available paths using ECMP.
+
+3.2.3.  Scaling the Clos Topology
+
+   A Clos topology can be scaled either by increasing network element
+   port density or by adding more stages, e.g., moving to a 5-stage
+   Clos, as illustrated in Figure 3 below:
+
+                                      Tier 1
+                                     +-----+
+          Cluster                    |     |
+ +----------------------------+   +--|     |--+
+ |                            |   |  +-----+  |
+ |                    Tier 2  |   |           |   Tier 2
+ |                   +-----+  |   |  +-----+  |  +-----+
+ |     +-------------| DEV |------+--|     |--+--|     |-------------+
+ |     |       +-----|  C  |------+  |     |  +--|     |-----+       |
+ |     |       |     +-----+  |      +-----+     +-----+     |       |
+ |     |       |              |                              |       |
+ |     |       |     +-----+  |      +-----+     +-----+     |       |
+ |     | +-----------| DEV |------+  |     |  +--|     |-----------+ |
+ |     | |     | +---|  D  |------+--|     |--+--|     |---+ |     | |
+ |     | |     | |   +-----+  |   |  +-----+  |  +-----+   | |     | |
+ |     | |     | |            |   |           |            | |     | |
+ |   +-----+ +-----+          |   |  +-----+  |          +-----+ +-----+
+ |   | DEV | | DEV |          |   +--|     |--+          |     | |     |
+ |   |  A  | |  B  | Tier 3   |      |     |      Tier 3 |     | |     |
+ |   +-----+ +-----+          |      +-----+             +-----+ +-----+
+ |     | |     | |            |                            | |     | |
+ |     O O     O O            |                            O O     O O
+ |       Servers              |                              Servers
+ +----------------------------+
+
+                      Figure 3: 5-Stage Clos Topology
+
+   The small example of topology in Figure 3 is built from devices with
+   a port count of 4.  In this document, one set of directly connected
+   Tier 2 and Tier 3 devices along with their attached servers will be
+   referred to as a "cluster".  For example, DEV A, B, C, D, and the
+   servers that connect to DEV A and B, on Figure 3 form a cluster.  The
+
+
+
+Lapukhov, et al.              Informational                     [Page 9]
+
+RFC 7938               BGP Routing in Data Centers           August 2016
+
+
+   concept of a cluster may also be a useful concept as a single
+   deployment or maintenance unit that can be operated on at a different
+   frequency than the entire topology.
+
+   In practice, Tier 3 of the network, which is typically Top-of-Rack
+   switches (ToRs), is where oversubscription is introduced to allow for
+   packaging of more servers in the data center while meeting the
+   bandwidth requirements for different types of applications.  The main
+   reason to limit oversubscription at a single layer of the network is
+   to simplify application development that would otherwise need to
+   account for multiple bandwidth pools: within rack (Tier 3), between
+   racks (Tier 2), and between clusters (Tier 1).  Since
+   oversubscription does not have a direct relationship to the routing
+   design, it is not discussed further in this document.
+
+3.2.4.  Managing the Size of Clos Topology Tiers
+
+   If a data center network size is small, it is possible to reduce the
+   number of switches in Tier 1 or Tier 2 of a Clos topology by a factor
+   of two.  To understand how this could be done, take Tier 1 as an
+   example.  Every Tier 2 device connects to a single group of Tier 1
+   devices.  If half of the ports on each of the Tier 1 devices are not
+   being used, then it is possible to reduce the number of Tier 1
+   devices by half and simply map two uplinks from a Tier 2 device to
+   the same Tier 1 device that were previously mapped to different Tier
+   1 devices.  This technique maintains the same bandwidth while
+   reducing the number of elements in Tier 1, thus saving on CAPEX.  The
+   tradeoff, in this example, is the reduction of maximum DC size in
+   terms of overall server count by half.
+
+   In this example, Tier 2 devices will be using two parallel links to
+   connect to each Tier 1 device.  If one of these links fails, the
+   other will pick up all traffic of the failed link, possibly resulting
+   in heavy congestion and quality of service degradation if the path
+   determination procedure does not take bandwidth amount into account,
+   since the number of upstream Tier 1 devices is likely wider than two.
+   To avoid this situation, parallel links can be grouped in link
+   aggregation groups (LAGs), e.g., [IEEE8023AD], with widely available
+   implementation settings that take the whole "bundle" down upon a
+   single link failure.  Equivalent techniques that enforce "fate
+   sharing" on the parallel links can be used in place of LAGs to
+   achieve the same effect.  As a result of such fate-sharing, traffic
+   from two or more failed links will be rebalanced over the multitude
+   of remaining paths that equals the number of Tier 1 devices.  This
+   example is using two links for simplicity, having more links in a
+   bundle will have less impact on capacity upon a member-link failure.
+
+
+
+
+
+Lapukhov, et al.              Informational                    [Page 10]
+
+RFC 7938               BGP Routing in Data Centers           August 2016
+
+
+4.  Data Center Routing Overview
+
+   This section provides an overview of three general types of data
+   center protocol designs -- Layer 2 only, Hybrid Layer L2/L3, and
+   Layer 3 only.
+
+4.1.  L2-Only Designs
+
+   Originally, most data center designs used Spanning Tree Protocol
+   (STP) originally defined in [IEEE8021D-1990] for loop-free topology
+   creation, typically utilizing variants of the traditional DC topology
+   described in Section 3.1.  At the time, many DC switches either did
+   not support Layer 3 routing protocols or supported them with
+   additional licensing fees, which played a part in the design choice.
+   Although many enhancements have been made through the introduction of
+   Rapid Spanning Tree Protocol (RSTP) in the latest revision of
+   [IEEE8021D-2004] and Multiple Spanning Tree Protocol (MST) specified
+   in [IEEE8021Q] that increase convergence, stability, and load-
+   balancing in larger topologies, many of the fundamentals of the
+   protocol limit its applicability in large-scale DCs.  STP and its
+   newer variants use an active/standby approach to path selection, and
+   are therefore hard to deploy in horizontally scaled topologies as
+   described in Section 3.2.  Further, operators have had many
+   experiences with large failures due to issues caused by improper
+   cabling, misconfiguration, or flawed software on a single device.
+   These failures regularly affected the entire spanning-tree domain and
+   were very hard to troubleshoot due to the nature of the protocol.
+   For these reasons, and since almost all DC traffic is now IP,
+   therefore requiring a Layer 3 routing protocol at the network edge
+   for external connectivity, designs utilizing STP usually fail all of
+   the requirements of large-scale DC operators.  Various enhancements
+   to link-aggregation protocols such as [IEEE8023AD], generally known
+   as Multi-Chassis Link-Aggregation (M-LAG) made it possible to use
+   Layer 2 designs with active-active network paths while relying on STP
+   as the backup for loop prevention.  The major downsides of this
+   approach are the lack of ability to scale linearly past two in most
+   implementations, lack of standards-based implementations, and the
+   added failure domain risk of syncing state between the devices.
+
+   It should be noted that building large, horizontally scalable,
+   L2-only networks without STP is possible recently through the
+   introduction of the Transparent Interconnection of Lots of Links
+   (TRILL) protocol in [RFC6325].  TRILL resolves many of the issues STP
+   has for large-scale DC design however, due to the limited number of
+   implementations, and often the requirement for specific equipment
+   that supports it, this has limited its applicability and increased
+   the cost of such designs.
+
+
+
+
+Lapukhov, et al.              Informational                    [Page 11]
+
+RFC 7938               BGP Routing in Data Centers           August 2016
+
+
+   Finally, neither the base TRILL specification nor the M-LAG approach
+   totally eliminate the problem of the shared broadcast domain that is
+   so detrimental to the operations of any Layer 2, Ethernet-based
+   solution.  Later TRILL extensions have been proposed to solve the
+   this problem statement, primarily based on the approaches outlined in
+   [RFC7067], but this even further limits the number of available
+   interoperable implementations that can be used to build a fabric.
+   Therefore, TRILL-based designs have issues meeting REQ2, REQ3, and
+   REQ4.
+
+4.2.  Hybrid L2/L3 Designs
+
+   Operators have sought to limit the impact of data-plane faults and
+   build large-scale topologies through implementing routing protocols
+   in either the Tier 1 or Tier 2 parts of the network and dividing the
+   Layer 2 domain into numerous, smaller domains.  This design has
+   allowed data centers to scale up, but at the cost of complexity in
+   managing multiple network protocols.  For the following reasons,
+   operators have retained Layer 2 in either the access (Tier 3) or both
+   access and aggregation (Tier 3 and Tier 2) parts of the network:
+
+   o  Supporting legacy applications that may require direct Layer 2
+      adjacency or use non-IP protocols.
+
+   o  Seamless mobility for virtual machines that require the
+      preservation of IP addresses when a virtual machine moves to a
+      different Tier 3 switch.
+
+   o  Simplified IP addressing = less IP subnets are required for the
+      data center.
+
+   o  Application load balancing may require direct Layer 2 reachability
+      to perform certain functions such as Layer 2 Direct Server Return
+      (DSR).  See [L3DSR].
+
+   o  Continued CAPEX differences between L2- and L3-capable switches.
+
+4.3.  L3-Only Designs
+
+   Network designs that leverage IP routing down to Tier 3 of the
+   network have gained popularity as well.  The main benefit of these
+   designs is improved network stability and scalability, as a result of
+   confining L2 broadcast domains.  Commonly, an Interior Gateway
+   Protocol (IGP) such as Open Shortest Path First (OSPF) [RFC2328] is
+   used as the primary routing protocol in such a design.  As data
+   centers grow in scale, and server count exceeds tens of thousands,
+   such fully routed designs have become more attractive.
+
+
+
+
+Lapukhov, et al.              Informational                    [Page 12]
+
+RFC 7938               BGP Routing in Data Centers           August 2016
+
+
+   Choosing a L3-only design greatly simplifies the network,
+   facilitating the meeting of REQ1 and REQ2, and has widespread
+   adoption in networks where large Layer 2 adjacency and larger size
+   Layer 3 subnets are not as critical compared to network scalability
+   and stability.  Application providers and network operators continue
+   to develop new solutions to meet some of the requirements that
+   previously had driven large Layer 2 domains by using various overlay
+   or tunneling techniques.
+
+5.  Routing Protocol Design
+
+   In this section, the motivations for using External BGP (EBGP) as the
+   single routing protocol for data center networks having a Layer 3
+   protocol design and Clos topology are reviewed.  Then, a practical
+   approach for designing an EBGP-based network is provided.
+
+5.1.  Choosing EBGP as the Routing Protocol
+
+   REQ2 would give preference to the selection of a single routing
+   protocol to reduce complexity and interdependencies.  While it is
+   common to rely on an IGP in this situation, sometimes with either the
+   addition of EBGP at the device bordering the WAN or Internal BGP
+   (IBGP) throughout, this document proposes the use of an EBGP-only
+   design.
+
+   Although EBGP is the protocol used for almost all Inter-Domain
+   Routing in the Internet and has wide support from both vendor and
+   service provider communities, it is not generally deployed as the
+   primary routing protocol within the data center for a number of
+   reasons (some of which are interrelated):
+
+   o  BGP is perceived as a "WAN-only, protocol-only" and not often
+      considered for enterprise or data center applications.
+
+   o  BGP is believed to have a "much slower" routing convergence
+      compared to IGPs.
+
+   o  Large-scale BGP deployments typically utilize an IGP for BGP next-
+      hop resolution as all nodes in the IBGP topology are not directly
+      connected.
+
+   o  BGP is perceived to require significant configuration overhead and
+      does not support neighbor auto-discovery.
+
+
+
+
+
+
+
+
+Lapukhov, et al.              Informational                    [Page 13]
+
+RFC 7938               BGP Routing in Data Centers           August 2016
+
+
+   This document discusses some of these perceptions, especially as
+   applicable to the proposed design, and highlights some of the
+   advantages of using the protocol such as:
+
+   o  BGP has less complexity in parts of its protocol design --
+      internal data structures and state machine are simpler as compared
+      to most link-state IGPs such as OSPF.  For example, instead of
+      implementing adjacency formation, adjacency maintenance and/or
+      flow-control, BGP simply relies on TCP as the underlying
+      transport.  This fulfills REQ2 and REQ3.
+
+   o  BGP information flooding overhead is less when compared to link-
+      state IGPs.  Since every BGP router calculates and propagates only
+      the best-path selected, a network failure is masked as soon as the
+      BGP speaker finds an alternate path, which exists when highly
+      symmetric topologies, such as Clos, are coupled with an EBGP-only
+      design.  In contrast, the event propagation scope of a link-state
+      IGP is an entire area, regardless of the failure type.  In this
+      way, BGP better meets REQ3 and REQ4.  It is also worth mentioning
+      that all widely deployed link-state IGPs feature periodic
+      refreshes of routing information while BGP does not expire routing
+      state, although this rarely impacts modern router control planes.
+
+   o  BGP supports third-party (recursively resolved) next hops.  This
+      allows for manipulating multipath to be non-ECMP-based or
+      forwarding-based on application-defined paths, through
+      establishment of a peering session with an application
+      "controller" that can inject routing information into the system,
+      satisfying REQ5.  OSPF provides similar functionality using
+      concepts such as "Forwarding Address", but with more difficulty in
+      implementation and far less control of information propagation
+      scope.
+
+   o  Using a well-defined Autonomous System Number (ASN) allocation
+      scheme and standard AS_PATH loop detection, "BGP path hunting"
+      (see [JAKMA2008]) can be controlled and complex unwanted paths
+      will be ignored.  See Section 5.2 for an example of a working ASN
+      allocation scheme.  In a link-state IGP, accomplishing the same
+      goal would require multi-(instance/topology/process) support,
+      typically not available in all DC devices and quite complex to
+      configure and troubleshoot.  Using a traditional single flooding
+      domain, which most DC designs utilize, under certain failure
+      conditions may pick up unwanted lengthy paths, e.g., traversing
+      multiple Tier 2 devices.
+
+
+
+
+
+
+
+Lapukhov, et al.              Informational                    [Page 14]
+
+RFC 7938               BGP Routing in Data Centers           August 2016
+
+
+   o  EBGP configuration that is implemented with minimal routing policy
+      is easier to troubleshoot for network reachability issues.  In
+      most implementations, it is straightforward to view contents of
+      the BGP Loc-RIB and compare it to the router's Routing Information
+      Base (RIB).  Also, in most implementations, an operator can view
+      every BGP neighbors Adj-RIB-In and Adj-RIB-Out structures, and
+      therefore incoming and outgoing Network Layer Reachability
+      Information (NLRI) information can be easily correlated on both
+      sides of a BGP session.  Thus, BGP satisfies REQ3.
+
+5.2.  EBGP Configuration for Clos Topology
+
+   Clos topologies that have more than 5 stages are very uncommon due to
+   the large numbers of interconnects required by such a design.
+   Therefore, the examples below are made with reference to the 5-stage
+   Clos topology (in unfolded state).
+
+5.2.1.  EBGP Configuration Guidelines and Example ASN Scheme
+
+   The diagram below illustrates an example of an ASN allocation scheme.
+   The following is a list of guidelines that can be used:
+
+   o  EBGP single-hop sessions are established over direct point-to-
+      point links interconnecting the network nodes, no multi-hop or
+      loopback sessions are used, even in the case of multiple links
+      between the same pair of nodes.
+
+   o  Private Use ASNs from the range 64512-65534 are used to avoid ASN
+      conflicts.
+
+   o  A single ASN is allocated to all of the Clos topology's Tier 1
+      devices.
+
+   o  A unique ASN is allocated to each set of Tier 2 devices in the
+      same cluster.
+
+   o  A unique ASN is allocated to every Tier 3 device (e.g., ToR) in
+      this topology.
+
+
+
+
+
+
+
+
+
+
+
+
+
+Lapukhov, et al.              Informational                    [Page 15]
+
+RFC 7938               BGP Routing in Data Centers           August 2016
+
+
+                                ASN 65534
+                               +---------+
+                               | +-----+ |
+                               | |     | |
+                             +-|-|     |-|-+
+                             | | +-----+ | |
+                  ASN 646XX  | |         | |  ASN 646XX
+                 +---------+ | |         | | +---------+
+                 | +-----+ | | | +-----+ | | | +-----+ |
+     +-----------|-|     |-|-+-|-|     |-|-+-|-|     |-|-----------+
+     |       +---|-|     |-|-+ | |     | | +-|-|     |-|---+       |
+     |       |   | +-----+ |   | +-----+ |   | +-----+ |   |       |
+     |       |   |         |   |         |   |         |   |       |
+     |       |   |         |   |         |   |         |   |       |
+     |       |   | +-----+ |   | +-----+ |   | +-----+ |   |       |
+     | +-----+---|-|     |-|-+ | |     | | +-|-|     |-|---+-----+ |
+     | |     | +-|-|     |-|-+-|-|     |-|-+-|-|     |-|-+ |     | |
+     | |     | | | +-----+ | | | +-----+ | | | +-----+ | | |     | |
+     | |     | | +---------+ | |         | | +---------+ | |     | |
+     | |     | |             | |         | |             | |     | |
+   +-----+ +-----+           | | +-----+ | |           +-----+ +-----+
+   | ASN | |     |           +-|-|     |-|-+           |     | |     |
+   |65YYY| | ... |             | |     | |             | ... | | ... |
+   +-----+ +-----+             | +-----+ |             +-----+ +-----+
+     | |     | |               +---------+               | |     | |
+     O O     O O              <- Servers ->              O O     O O
+
+                 Figure 4: BGP ASN Layout for 5-Stage Clos
+
+5.2.2.  Private Use ASNs
+
+   The original range of Private Use ASNs [RFC6996] limited operators to
+   1023 unique ASNs.  Since it is quite likely that the number of
+   network devices may exceed this number, a workaround is required.
+   One approach is to re-use the ASNs assigned to the Tier 3 devices
+   across different clusters.  For example, Private Use ASNs 65001,
+   65002 ... 65032 could be used within every individual cluster and
+   assigned to Tier 3 devices.
+
+   To avoid route suppression due to the AS_PATH loop detection
+   mechanism in BGP, upstream EBGP sessions on Tier 3 devices must be
+   configured with the "Allowas-in" feature [ALLOWASIN] that allows
+   accepting a device's own ASN in received route advertisements.
+   Although this feature is not standardized, it is widely available
+   across multiple vendors implementations.  Introducing this feature
+   does not make routing loops more likely in the design since the
+   AS_PATH is being added to by routers at each of the topology tiers
+   and AS_PATH length is an early tie breaker in the BGP path selection
+
+
+
+Lapukhov, et al.              Informational                    [Page 16]
+
+RFC 7938               BGP Routing in Data Centers           August 2016
+
+
+   process.  Further loop protection is still in place at the Tier 1
+   device, which will not accept routes with a path including its own
+   ASN.  Tier 2 devices do not have direct connectivity with each other.
+
+   Another solution to this problem would be to use Four-Octet ASNs
+   ([RFC6793]), where there are additional Private Use ASNs available,
+   see [IANA.AS].  Use of Four-Octet ASNs puts additional protocol
+   complexity in the BGP implementation and should be balanced against
+   the complexity of re-use when considering REQ3 and REQ4.  Perhaps
+   more importantly, they are not yet supported by all BGP
+   implementations, which may limit vendor selection of DC equipment.
+   When supported, ensure that deployed implementations are able to
+   remove the Private Use ASNs when external connectivity
+   (Section 5.2.4) to these ASNs is required.
+
+5.2.3.  Prefix Advertisement
+
+   A Clos topology features a large number of point-to-point links and
+   associated prefixes.  Advertising all of these routes into BGP may
+   create Forwarding Information Base (FIB) overload in the network
+   devices.  Advertising these links also puts additional path
+   computation stress on the BGP control plane for little benefit.
+   There are two possible solutions:
+
+   o  Do not advertise any of the point-to-point links into BGP.  Since
+      the EBGP-based design changes the next-hop address at every
+      device, distant networks will automatically be reachable via the
+      advertising EBGP peer and do not require reachability to these
+      prefixes.  However, this may complicate operations or monitoring:
+      e.g., using the popular "traceroute" tool will display IP
+      addresses that are not reachable.
+
+   o  Advertise point-to-point links, but summarize them on every
+      device.  This requires an address allocation scheme such as
+      allocating a consecutive block of IP addresses per Tier 1 and Tier
+      2 device to be used for point-to-point interface addressing to the
+      lower layers (Tier 2 uplinks will be allocated from Tier 1 address
+      blocks and so forth).
+
+   Server subnets on Tier 3 devices must be announced into BGP without
+   using route summarization on Tier 2 and Tier 1 devices.  Summarizing
+   subnets in a Clos topology results in route black-holing under a
+   single link failure (e.g., between Tier 2 and Tier 3 devices), and
+   hence must be avoided.  The use of peer links within the same tier to
+   resolve the black-holing problem by providing "bypass paths" is
+   undesirable due to O(N^2) complexity of the peering-mesh and waste of
+   ports on the devices.  An alternative to the full mesh of peer links
+   would be to use a simpler bypass topology, e.g., a "ring" as
+
+
+
+Lapukhov, et al.              Informational                    [Page 17]
+
+RFC 7938               BGP Routing in Data Centers           August 2016
+
+
+   described in [FB4POST], but such a topology adds extra hops and has
+   limited bandwidth.  It may require special tweaks to make BGP routing
+   work, e.g., splitting every device into an ASN of its own.  Later in
+   this document, Section 8.2 introduces a less intrusive method for
+   performing a limited form of route summarization in Clos networks and
+   discusses its associated tradeoffs.
+
+5.2.4.  External Connectivity
+
+   A dedicated cluster (or clusters) in the Clos topology could be used
+   for the purpose of connecting to the Wide Area Network (WAN) edge
+   devices, or WAN Routers.  Tier 3 devices in such a cluster would be
+   replaced with WAN routers, and EBGP peering would be used again,
+   though WAN routers are likely to belong to a public ASN if Internet
+   connectivity is required in the design.  The Tier 2 devices in such a
+   dedicated cluster will be referred to as "Border Routers" in this
+   document.  These devices have to perform a few special functions:
+
+   o  Hide network topology information when advertising paths to WAN
+      routers, i.e., remove Private Use ASNs [RFC6996] from the AS_PATH
+      attribute.  This is typically done to avoid ASN number collisions
+      between different data centers and also to provide a uniform
+      AS_PATH length to the WAN for purposes of WAN ECMP to anycast
+      prefixes originated in the topology.  An implementation-specific
+      BGP feature typically called "Remove Private AS" is commonly used
+      to accomplish this.  Depending on implementation, the feature
+      should strip a contiguous sequence of Private Use ASNs found in an
+      AS_PATH attribute prior to advertising the path to a neighbor.
+      This assumes that all ASNs used for intra data center numbering
+      are from the Private Use ranges.  The process for stripping the
+      Private Use ASNs is not currently standardized, see [REMOVAL].
+      However, most implementations at least follow the logic described
+      in this vendor's document [VENDOR-REMOVE-PRIVATE-AS], which is
+      enough for the design specified.
+
+   o  Originate a default route to the data center devices.  This is the
+      only place where a default route can be originated, as route
+      summarization is risky for the unmodified Clos topology.
+      Alternatively, Border Routers may simply relay the default route
+      learned from WAN routers.  Advertising the default route from
+      Border Routers requires that all Border Routers be fully connected
+      to the WAN Routers upstream, to provide resistance to a single-
+      link failure causing the black-holing of traffic.  To prevent
+      black-holing in the situation when all of the EBGP sessions to the
+      WAN routers fail simultaneously on a given device, it is more
+      desirable to readvertise the default route rather than originating
+      the default route via complicated conditional route origination
+      schemes provided by some implementations [CONDITIONALROUTE].
+
+
+
+Lapukhov, et al.              Informational                    [Page 18]
+
+RFC 7938               BGP Routing in Data Centers           August 2016
+
+
+5.2.5.  Route Summarization at the Edge
+
+   It is often desirable to summarize network reachability information
+   prior to advertising it to the WAN network due to the high amount of
+   IP prefixes originated from within the data center in a fully routed
+   network design.  For example, a network with 2000 Tier 3 devices will
+   have at least 2000 servers subnets advertised into BGP, along with
+   the infrastructure prefixes.  However, as discussed in Section 5.2.3,
+   the proposed network design does not allow for route summarization
+   due to the lack of peer links inside every tier.
+
+   However, it is possible to lift this restriction for the Border
+   Routers by devising a different connectivity model for these devices.
+   There are two options possible:
+
+   o  Interconnect the Border Routers using a full-mesh of physical
+      links or using any other "peer-mesh" topology, such as ring or
+      hub-and-spoke.  Configure BGP accordingly on all Border Leafs to
+      exchange network reachability information, e.g., by adding a mesh
+      of IBGP sessions.  The interconnecting peer links need to be
+      appropriately sized for traffic that will be present in the case
+      of a device or link failure in the mesh connecting the Border
+      Routers.
+
+   o  Tier 1 devices may have additional physical links provisioned
+      toward the Border Routers (which are Tier 2 devices from the
+      perspective of Tier 1).  Specifically, if protection from a single
+      link or node failure is desired, each Tier 1 device would have to
+      connect to at least two Border Routers.  This puts additional
+      requirements on the port count for Tier 1 devices and Border
+      Routers, potentially making it a nonuniform, larger port count,
+      device compared with the other devices in the Clos.  This also
+      reduces the number of ports available to "regular" Tier 2
+      switches, and hence the number of clusters that could be
+      interconnected via Tier 1.
+
+   If any of the above options are implemented, it is possible to
+   perform route summarization at the Border Routers toward the WAN
+   network core without risking a routing black-hole condition under a
+   single link failure.  Both of the options would result in nonuniform
+   topology as additional links have to be provisioned on some network
+   devices.
+
+
+
+
+
+
+
+
+
+Lapukhov, et al.              Informational                    [Page 19]
+
+RFC 7938               BGP Routing in Data Centers           August 2016
+
+
+6.  ECMP Considerations
+
+   This section covers the Equal Cost Multipath (ECMP) functionality for
+   Clos topology and discusses a few special requirements.
+
+6.1.  Basic ECMP
+
+   ECMP is the fundamental load-sharing mechanism used by a Clos
+   topology.  Effectively, every lower-tier device will use all of its
+   directly attached upper-tier devices to load-share traffic destined
+   to the same IP prefix.  The number of ECMP paths between any two Tier
+   3 devices in Clos topology is equal to the number of the devices in
+   the middle stage (Tier 1).  For example, Figure 5 illustrates a
+   topology where Tier 3 device A has four paths to reach servers X and
+   Y, via Tier 2 devices B and C and then Tier 1 devices 1, 2, 3, and 4,
+   respectively.
+
+                                Tier 1
+                               +-----+
+                               | DEV |
+                            +->|  1  |--+
+                            |  +-----+  |
+                    Tier 2  |           |   Tier 2
+                   +-----+  |  +-----+  |  +-----+
+     +------------>| DEV |--+->| DEV |--+--|     |-------------+
+     |       +-----|  B  |--+  |  2  |  +--|     |-----+       |
+     |       |     +-----+     +-----+     +-----+     |       |
+     |       |                                         |       |
+     |       |     +-----+     +-----+     +-----+     |       |
+     | +-----+---->| DEV |--+  | DEV |  +--|     |-----+-----+ |
+     | |     | +---|  C  |--+->|  3  |--+--|     |---+ |     | |
+     | |     | |   +-----+  |  +-----+  |  +-----+   | |     | |
+     | |     | |            |           |            | |     | |
+   +-----+ +-----+          |  +-----+  |          +-----+ +-----+
+   | DEV | |     | Tier 3   +->| DEV |--+   Tier 3 |     | |     |
+   |  A  | |     |             |  4  |             |     | |     |
+   +-----+ +-----+             +-----+             +-----+ +-----+
+     | |     | |                                     | |     | |
+     O O     O O            <- Servers ->            X Y     O O
+
+               Figure 5: ECMP Fan-Out Tree from A to X and Y
+
+   The ECMP requirement implies that the BGP implementation must support
+   multipath fan-out for up to the maximum number of devices directly
+   attached at any point in the topology in the upstream or downstream
+   direction.  Normally, this number does not exceed half of the ports
+   found on a device in the topology.  For example, an ECMP fan-out of
+   32 would be required when building a Clos network using 64-port
+
+
+
+Lapukhov, et al.              Informational                    [Page 20]
+
+RFC 7938               BGP Routing in Data Centers           August 2016
+
+
+   devices.  The Border Routers may need to have wider fan-out to be
+   able to connect to a multitude of Tier 1 devices if route
+   summarization at Border Router level is implemented as described in
+   Section 5.2.5.  If a device's hardware does not support wider ECMP,
+   logical link-grouping (link-aggregation at Layer 2) could be used to
+   provide "hierarchical" ECMP (Layer 3 ECMP coupled with Layer 2 ECMP)
+   to compensate for fan-out limitations.  However, this approach
+   increases the risk of flow polarization, as less entropy will be
+   available at the second stage of ECMP.
+
+   Most BGP implementations declare paths to be equal from an ECMP
+   perspective if they match up to and including step (e) in
+   Section 9.1.2.2 of [RFC4271].  In the proposed network design there
+   is no underlying IGP, so all IGP costs are assumed to be zero or
+   otherwise the same value across all paths and policies may be applied
+   as necessary to equalize BGP attributes that vary in vendor defaults,
+   such as the MULTI_EXIT_DISC (MED) attribute and origin code.  For
+   historical reasons, it is also useful to not use 0 as the equalized
+   MED value; this and some other useful BGP information is available in
+   [RFC4277].  Routing loops are unlikely due to the BGP best-path
+   selection process (which prefers shorter AS_PATH length), and longer
+   paths through the Tier 1 devices (which don't allow their own ASN in
+   the path) are not possible.
+
+6.2.  BGP ECMP over Multiple ASNs
+
+   For application load-balancing purposes, it is desirable to have the
+   same prefix advertised from multiple Tier 3 devices.  From the
+   perspective of other devices, such a prefix would have BGP paths with
+   different AS_PATH attribute values, while having the same AS_PATH
+   attribute lengths.  Therefore, BGP implementations must support load-
+   sharing over the above-mentioned paths.  This feature is sometimes
+   known as "multipath relax" or "multipath multiple-AS" and effectively
+   allows for ECMP to be done across different neighboring ASNs if all
+   other attributes are equal as already described in the previous
+   section.
+
+6.3.  Weighted ECMP
+
+   It may be desirable for the network devices to implement "weighted"
+   ECMP, to be able to send more traffic over some paths in ECMP fan-
+   out.  This could be helpful to compensate for failures in the network
+   and send more traffic over paths that have more capacity.  The
+   prefixes that require weighted ECMP would have to be injected using
+   remote BGP speaker (central agent) over a multi-hop session as
+   described further in Section 8.1.  If support in implementations is
+   available, weight distribution for multiple BGP paths could be
+   signaled using the technique described in [LINK].
+
+
+
+Lapukhov, et al.              Informational                    [Page 21]
+
+RFC 7938               BGP Routing in Data Centers           August 2016
+
+
+6.4.  Consistent Hashing
+
+   It is often desirable to have the hashing function used for ECMP to
+   be consistent (see [CONS-HASH]), to minimize the impact on flow to
+   next-hop affinity changes when a next hop is added or removed to an
+   ECMP group.  This could be used if the network device is used as a
+   load balancer, mapping flows toward multiple destinations -- in this
+   case, losing or adding a destination will not have a detrimental
+   effect on currently established flows.  One particular recommendation
+   on implementing consistent hashing is provided in [RFC2992], though
+   other implementations are possible.  This functionality could be
+   naturally combined with weighted ECMP, with the impact of the next
+   hop changes being proportional to the weight of the given next hop.
+   The downside of consistent hashing is increased load on hardware
+   resource utilization, as typically more resources (e.g., Ternary
+   Content-Addressable Memory (TCAM) space) are required to implement a
+   consistent-hashing function.
+
+7.  Routing Convergence Properties
+
+   This section reviews routing convergence properties in the proposed
+   design.  A case is made that sub-second convergence is achievable if
+   the implementation supports fast EBGP peering session deactivation
+   and timely RIB and FIB updates upon failure of the associated link.
+
+7.1.  Fault Detection Timing
+
+   BGP typically relies on an IGP to route around link/node failures
+   inside an AS, and implements either a polling-based or an event-
+   driven mechanism to obtain updates on IGP state changes.  The
+   proposed routing design does not use an IGP, so the remaining
+   mechanisms that could be used for fault detection are BGP keep-alive
+   time-out (or any other type of keep-alive mechanism) and link-failure
+   triggers.
+
+   Relying solely on BGP keep-alive packets may result in high
+   convergence delays, on the order of multiple seconds (on many BGP
+   implementations the minimum configurable BGP hold timer value is
+   three seconds).  However, many BGP implementations can shut down
+   local EBGP peering sessions in response to the "link down" event for
+   the outgoing interface used for BGP peering.  This feature is
+   sometimes called "fast fallover".  Since links in modern data centers
+   are predominantly point-to-point fiber connections, a physical
+   interface failure is often detected in milliseconds and subsequently
+   triggers a BGP reconvergence.
+
+
+
+
+
+
+Lapukhov, et al.              Informational                    [Page 22]
+
+RFC 7938               BGP Routing in Data Centers           August 2016
+
+
+   Ethernet links may support failure signaling or detection standards
+   such as Connectivity Fault Management (CFM) as described in
+   [IEEE8021Q]; this may make failure detection more robust.
+   Alternatively, some platforms may support Bidirectional Forwarding
+   Detection (BFD) [RFC5880] to allow for sub-second failure detection
+   and fault signaling to the BGP process.  However, the use of either
+   of these presents additional requirements to vendor software and
+   possibly hardware, and may contradict REQ1.  Until recently with
+   [RFC7130], BFD also did not allow detection of a single member link
+   failure on a LAG, which would have limited its usefulness in some
+   designs.
+
+7.2.  Event Propagation Timing
+
+   In the proposed design, the impact of the BGP
+   MinRouteAdvertisementIntervalTimer (MRAI timer), as specified in
+   Section 9.2.1.1 of [RFC4271], should be considered.  Per the
+   standard, it is required for BGP implementations to space out
+   consecutive BGP UPDATE messages by at least MRAI seconds, which is
+   often a configurable value.  The initial BGP UPDATE messages after an
+   event carrying withdrawn routes are commonly not affected by this
+   timer.  The MRAI timer may present significant convergence delays
+   when a BGP speaker "waits" for the new path to be learned from its
+   peers and has no local backup path information.
+
+   In a Clos topology, each EBGP speaker typically has either one path
+   (Tier 2 devices don't accept paths from other Tier 2 in the same
+   cluster due to same ASN) or N paths for the same prefix, where N is a
+   significantly large number, e.g., N=32 (the ECMP fan-out to the next
+   tier).  Therefore, if a link fails to another device from which a
+   path is received there is either no backup path at all (e.g., from
+   the perspective of a Tier 2 switch losing the link to a Tier 3
+   device), or the backup is readily available in BGP Loc-RIB (e.g.,
+   from the perspective of a Tier 2 device losing the link to a Tier 1
+   switch).  In the former case, the BGP withdrawal announcement will
+   propagate without delay and trigger reconvergence on affected
+   devices.  In the latter case, the best path will be re-evaluated, and
+   the local ECMP group corresponding to the new next-hop set will be
+   changed.  If the BGP path was the best path selected previously, an
+   "implicit withdraw" will be sent via a BGP UPDATE message as
+   described as Option b in Section 3.1 of [RFC4271] due to the BGP
+   AS_PATH attribute changing.
+
+
+
+
+
+
+
+
+
+Lapukhov, et al.              Informational                    [Page 23]
+
+RFC 7938               BGP Routing in Data Centers           August 2016
+
+
+7.3.  Impact of Clos Topology Fan-Outs
+
+   Clos topology has large fan-outs, which may impact the "Up->Down"
+   convergence in some cases, as described in this section.  In a
+   situation when a link between Tier 3 and Tier 2 device fails, the
+   Tier 2 device will send BGP UPDATE messages to all upstream Tier 1
+   devices, withdrawing the affected prefixes.  The Tier 1 devices, in
+   turn, will relay these messages to all downstream Tier 2 devices
+   (except for the originator).  Tier 2 devices other than the one
+   originating the UPDATE should then wait for ALL upstream Tier 1
+   devices to send an UPDATE message before removing the affected
+   prefixes and sending corresponding UPDATE downstream to connected
+   Tier 3 devices.  If the original Tier 2 device or the relaying Tier 1
+   devices introduce some delay into their UPDATE message announcements,
+   the result could be UPDATE message "dispersion", that could be as
+   long as multiple seconds.  In order to avoid such a behavior, BGP
+   implementations must support "update groups".  The "update group" is
+   defined as a collection of neighbors sharing the same outbound policy
+   -- the local speaker will send BGP updates to the members of the
+   group synchronously.
+
+   The impact of such "dispersion" grows with the size of topology fan-
+   out and could also grow under network convergence churn.  Some
+   operators may be tempted to introduce "route flap dampening" type
+   features that vendors include to reduce the control-plane impact of
+   rapidly flapping prefixes.  However, due to issues described with
+   false positives in these implementations especially under such
+   "dispersion" events, it is not recommended to enable this feature in
+   this design.  More background and issues with "route flap dampening"
+   and possible implementation changes that could affect this are well
+   described in [RFC7196].
+
+7.4.  Failure Impact Scope
+
+   A network is declared to converge in response to a failure once all
+   devices within the failure impact scope are notified of the event and
+   have recalculated their RIBs and consequently updated their FIBs.
+   Larger failure impact scope typically means slower convergence since
+   more devices have to be notified, and results in a less stable
+   network.  In this section, we describe BGP's advantages over link-
+   state routing protocols in reducing failure impact scope for a Clos
+   topology.
+
+   BGP behaves like a distance-vector protocol in the sense that only
+   the best path from the point of view of the local router is sent to
+   neighbors.  As such, some failures are masked if the local node can
+   immediately find a backup path and does not have to send any updates
+   further.  Notice that in the worst case, all devices in a data center
+
+
+
+Lapukhov, et al.              Informational                    [Page 24]
+
+RFC 7938               BGP Routing in Data Centers           August 2016
+
+
+   topology have to either withdraw a prefix completely or update the
+   ECMP groups in their FIBs.  However, many failures will not result in
+   such a wide impact.  There are two main failure types where impact
+   scope is reduced:
+
+   o  Failure of a link between Tier 2 and Tier 1 devices: In this case,
+      a Tier 2 device will update the affected ECMP groups, removing the
+      failed link.  There is no need to send new information to
+      downstream Tier 3 devices, unless the path was selected as best by
+      the BGP process, in which case only an "implicit withdraw" needs
+      to be sent and this should not affect forwarding.  The affected
+      Tier 1 device will lose the only path available to reach a
+      particular cluster and will have to withdraw the associated
+      prefixes.  Such a prefix withdrawal process will only affect Tier
+      2 devices directly connected to the affected Tier 1 device.  The
+      Tier 2 devices receiving the BGP UPDATE messages withdrawing
+      prefixes will simply have to update their ECMP groups.  The Tier 3
+      devices are not involved in the reconvergence process.
+
+   o  Failure of a Tier 1 device: In this case, all Tier 2 devices
+      directly attached to the failed node will have to update their
+      ECMP groups for all IP prefixes from a non-local cluster.  The
+      Tier 3 devices are once again not involved in the reconvergence
+      process, but may receive "implicit withdraws" as described above.
+
+   Even in the case of such failures where multiple IP prefixes will
+   have to be reprogrammed in the FIB, it is worth noting that all of
+   these prefixes share a single ECMP group on a Tier 2 device.
+   Therefore, in the case of implementations with a hierarchical FIB,
+   only a single change has to be made to the FIB.  "Hierarchical FIB"
+   here means FIB structure where the next-hop forwarding information is
+   stored separately from the prefix lookup table, and the latter only
+   stores pointers to the respective forwarding information.  See
+   [BGP-PIC] for discussion of FIB hierarchies and fast convergence.
+
+   Even though BGP offers reduced failure scope for some cases, further
+   reduction of the fault domain using summarization is not always
+   possible with the proposed design, since using this technique may
+   create routing black-holes as mentioned previously.  Therefore, the
+   worst failure impact scope on the control plane is the network as a
+   whole -- for instance, in the case of a link failure between Tier 2
+   and Tier 3 devices.  The amount of impacted prefixes in this case
+   would be much less than in the case of a failure in the upper layers
+   of a Clos network topology.  The property of having such large
+   failure scope is not a result of choosing EBGP in the design but
+   rather a result of using the Clos topology.
+
+
+
+
+
+Lapukhov, et al.              Informational                    [Page 25]
+
+RFC 7938               BGP Routing in Data Centers           August 2016
+
+
+7.5.  Routing Micro-Loops
+
+   When a downstream device, e.g., Tier 2 device, loses all paths for a
+   prefix, it normally has the default route pointing toward the
+   upstream device -- in this case, the Tier 1 device.  As a result, it
+   is possible to get in the situation where a Tier 2 switch loses a
+   prefix, but a Tier 1 switch still has the path pointing to the Tier 2
+   device; this results in a transient micro-loop, since the Tier 1
+   switch will keep passing packets to the affected prefix back to the
+   Tier 2 device, and the Tier 2 will bounce them back again using the
+   default route.  This micro-loop will last for the time it takes the
+   upstream device to fully update its forwarding tables.
+
+   To minimize impact of such micro-loops, Tier 2 and Tier 1 switches
+   can be configured with static "discard" or "null" routes that will be
+   more specific than the default route for prefixes missing during
+   network convergence.  For Tier 2 switches, the discard route should
+   be a summary route, covering all server subnets of the underlying
+   Tier 3 devices.  For Tier 1 devices, the discard route should be a
+   summary covering the server IP address subnets allocated for the
+   whole data center.  Those discard routes will only take precedence
+   for the duration of network convergence, until the device learns a
+   more specific prefix via a new path.
+
+8.  Additional Options for Design
+
+8.1.  Third-Party Route Injection
+
+   BGP allows for a "third-party", i.e., a directly attached BGP
+   speaker, to inject routes anywhere in the network topology, meeting
+   REQ5.  This can be achieved by peering via a multi-hop BGP session
+   with some or even all devices in the topology.  Furthermore, BGP
+   diverse path distribution [RFC6774] could be used to inject multiple
+   BGP next hops for the same prefix to facilitate load balancing, or
+   using the BGP ADD-PATH capability [RFC7911] if supported by the
+   implementation.  Unfortunately, in many implementations, ADD-PATH has
+   been found to only support IBGP properly in the use cases for which
+   it was originally optimized; this limits the "third-party" peering to
+   IBGP only.
+
+   To implement route injection in the proposed design, a third-party
+   BGP speaker may peer with Tier 3 and Tier 1 switches, injecting the
+   same prefix, but using a special set of BGP next hops for Tier 1
+   devices.  Those next hops are assumed to resolve recursively via BGP,
+   and could be, for example, IP addresses on Tier 3 devices.  The
+   resulting forwarding table programming could provide desired traffic
+   proportion distribution among different clusters.
+
+
+
+
+Lapukhov, et al.              Informational                    [Page 26]
+
+RFC 7938               BGP Routing in Data Centers           August 2016
+
+
+8.2.  Route Summarization within Clos Topology
+
+   As mentioned previously, route summarization is not possible within
+   the proposed Clos topology since it makes the network susceptible to
+   route black-holing under single link failures.  The main problem is
+   the limited number of redundant paths between network elements, e.g.,
+   there is only a single path between any pair of Tier 1 and Tier 3
+   devices.  However, some operators may find route aggregation
+   desirable to improve control-plane stability.
+
+   If any technique to summarize within the topology is planned,
+   modeling of the routing behavior and potential for black-holing
+   should be done not only for single or multiple link failures, but
+   also for fiber pathway failures or optical domain failures when the
+   topology extends beyond a physical location.  Simple modeling can be
+   done by checking the reachability on devices doing summarization
+   under the condition of a link or pathway failure between a set of
+   devices in every tier as well as to the WAN routers when external
+   connectivity is present.
+
+   Route summarization would be possible with a small modification to
+   the network topology, though the tradeoff would be reduction of the
+   total size of the network as well as network congestion under
+   specific failures.  This approach is very similar to the technique
+   described above, which allows Border Routers to summarize the entire
+   data center address space.
+
+8.2.1.  Collapsing Tier 1 Devices Layer
+
+   In order to add more paths between Tier 1 and Tier 3 devices, group
+   Tier 2 devices into pairs, and then connect the pairs to the same
+   group of Tier 1 devices.  This is logically equivalent to
+   "collapsing" Tier 1 devices into a group of half the size, merging
+   the links on the "collapsed" devices.  The result is illustrated in
+   Figure 6.  For example, in this topology DEV C and DEV D connect to
+   the same set of Tier 1 devices (DEV 1 and DEV 2), whereas before they
+   were connecting to different groups of Tier 1 devices.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Lapukhov, et al.              Informational                    [Page 27]
+
+RFC 7938               BGP Routing in Data Centers           August 2016
+
+
+                    Tier 2       Tier 1       Tier 2
+                   +-----+      +-----+      +-----+
+     +-------------| DEV |------| DEV |------|     |-------------+
+     |       +-----|  C  |--++--|  1  |--++--|     |-----+       |
+     |       |     +-----+  ||  +-----+  ||  +-----+     |       |
+     |       |              ||           ||              |       |
+     |       |     +-----+  ||  +-----+  ||  +-----+     |       |
+     | +-----+-----| DEV |--++--| DEV |--++--|     |-----+-----+ |
+     | |     | +---|  D  |------|  2  |------|     |---+ |     | |
+     | |     | |   +-----+      +-----+      +-----+   | |     | |
+     | |     | |                                       | |     | |
+   +-----+ +-----+                                   +-----+ +-----+
+   | DEV | | DEV |                                   |     | |     |
+   |  A  | |  B  | Tier 3                     Tier 3 |     | |     |
+   +-----+ +-----+                                   +-----+ +-----+
+     | |     | |                                       | |     | |
+     O O     O O             <- Servers ->             O O     O O
+
+                      Figure 6: 5-Stage Clos Topology
+
+   Having this design in place, Tier 2 devices may be configured to
+   advertise only a default route down to Tier 3 devices.  If a link
+   between Tier 2 and Tier 3 fails, the traffic will be re-routed via
+   the second available path known to a Tier 2 switch.  It is still not
+   possible to advertise a summary route covering prefixes for a single
+   cluster from Tier 2 devices since each of them has only a single path
+   down to this prefix.  It would require dual-homed servers to
+   accomplish that.  Also note that this design is only resilient to
+   single link failures.  It is possible for a double link failure to
+   isolate a Tier 2 device from all paths toward a specific Tier 3
+   device, thus causing a routing black-hole.
+
+   A result of the proposed topology modification would be a reduction
+   of the port capacity of Tier 1 devices.  This limits the maximum
+   number of attached Tier 2 devices, and therefore will limit the
+   maximum DC network size.  A larger network would require different
+   Tier 1 devices that have higher port density to implement this
+   change.
+
+   Another problem is traffic rebalancing under link failures.  Since
+   there are two paths from Tier 1 to Tier 3, a failure of the link
+   between Tier 1 and Tier 2 switch would result in all traffic that was
+   taking the failed link to switch to the remaining path.  This will
+   result in doubling the link utilization on the remaining link.
+
+
+
+
+
+
+
+Lapukhov, et al.              Informational                    [Page 28]
+
+RFC 7938               BGP Routing in Data Centers           August 2016
+
+
+8.2.2.  Simple Virtual Aggregation
+
+   A completely different approach to route summarization is possible,
+   provided that the main goal is to reduce the FIB size, while allowing
+   the control plane to disseminate full routing information.  Firstly,
+   it could be easily noted that in many cases multiple prefixes, some
+   of which are less specific, share the same set of the next hops (same
+   ECMP group).  For example, from the perspective of Tier 3 devices,
+   all routes learned from upstream Tier 2 devices, including the
+   default route, will share the same set of BGP next hops, provided
+   that there are no failures in the network.  This makes it possible to
+   use the technique similar to that described in [RFC6769] and only
+   install the least specific route in the FIB, ignoring more specific
+   routes if they share the same next-hop set.  For example, under
+   normal network conditions, only the default route needs to be
+   programmed into the FIB.
+
+   Furthermore, if the Tier 2 devices are configured with summary
+   prefixes covering all of their attached Tier 3 device's prefixes, the
+   same logic could be applied in Tier 1 devices as well and, by
+   induction to Tier 2/Tier 3 switches in different clusters.  These
+   summary routes should still allow for more specific prefixes to leak
+   to Tier 1 devices, to enable detection of mismatches in the next-hop
+   sets if a particular link fails, thus changing the next-hop set for a
+   specific prefix.
+
+   Restating once again, this technique does not reduce the amount of
+   control-plane state (i.e., BGP UPDATEs, BGP Loc-RIB size), but only
+   allows for more efficient FIB utilization, by detecting more specific
+   prefixes that share their next-hop set with a subsuming less specific
+   prefix.
+
+8.3.  ICMP Unreachable Message Masquerading
+
+   This section discusses some operational aspects of not advertising
+   point-to-point link subnets into BGP, as previously identified as an
+   option in Section 5.2.3.  The operational impact of this decision
+   could be seen when using the well-known "traceroute" tool.
+   Specifically, IP addresses displayed by the tool will be the link's
+   point-to-point addresses, and hence will be unreachable for
+   management connectivity.  This makes some troubleshooting more
+   complicated.
+
+   One way to overcome this limitation is by using the DNS subsystem to
+   create the "reverse" entries for these point-to-point IP addresses
+   pointing to the same name as the loopback address.  The connectivity
+   then can be made by resolving this name to the "primary" IP address
+
+
+
+
+Lapukhov, et al.              Informational                    [Page 29]
+
+RFC 7938               BGP Routing in Data Centers           August 2016
+
+
+   of the device, e.g., its Loopback interface, which is always
+   advertised into BGP.  However, this creates a dependency on the DNS
+   subsystem, which may be unavailable during an outage.
+
+   Another option is to make the network device perform IP address
+   masquerading, that is, rewriting the source IP addresses of the
+   appropriate ICMP messages sent by the device with the "primary" IP
+   address of the device.  Specifically, the ICMP Destination
+   Unreachable Message (type 3) code 3 (port unreachable) and ICMP Time
+   Exceeded (type 11) code 0 are required for correct operation of the
+   "traceroute" tool.  With this modification, the "traceroute" probes
+   sent to the devices will always be sent back with the "primary" IP
+   address as the source, allowing the operator to discover the
+   "reachable" IP address of the box.  This has the downside of hiding
+   the address of the "entry point" into the device.  If the devices
+   support [RFC5837], this may allow the best of both worlds by
+   providing the information about the incoming interface even if the
+   return address is the "primary" IP address.
+
+9.  Security Considerations
+
+   The design does not introduce any additional security concerns.
+   General BGP security considerations are discussed in [RFC4271] and
+   [RFC4272].  Since a DC is a single-operator domain, this document
+   assumes that edge filtering is in place to prevent attacks against
+   the BGP sessions themselves from outside the perimeter of the DC.
+   This may be a more feasible option for most deployments than having
+   to deal with key management for TCP MD5 as described in [RFC2385] or
+   dealing with the lack of implementations of the TCP Authentication
+   Option [RFC5925] available at the time of publication of this
+   document.  The Generalized TTL Security Mechanism [RFC5082] could
+   also be used to further reduce the risk of BGP session spoofing.
+
+10.  References
+
+10.1.  Normative References
+
+   [RFC4271]  Rekhter, Y., Ed., Li, T., Ed., and S. Hares, Ed., "A
+              Border Gateway Protocol 4 (BGP-4)", RFC 4271,
+              DOI 10.17487/RFC4271, January 2006,
+              <http://www.rfc-editor.org/info/rfc4271>.
+
+   [RFC6996]  Mitchell, J., "Autonomous System (AS) Reservation for
+              Private Use", BCP 6, RFC 6996, DOI 10.17487/RFC6996, July
+              2013, <http://www.rfc-editor.org/info/rfc6996>.
+
+
+
+
+
+
+Lapukhov, et al.              Informational                    [Page 30]
+
+RFC 7938               BGP Routing in Data Centers           August 2016
+
+
+10.2.  Informative References
+
+   [ALFARES2008]
+              Al-Fares, M., Loukissas, A., and A. Vahdat, "A Scalable,
+              Commodity Data Center Network Architecture",
+              DOI 10.1145/1402958.1402967, August 2008,
+              <http://dl.acm.org/citation.cfm?id=1402967>.
+
+   [ALLOWASIN]
+              Cisco Systems, "Allowas-in Feature in BGP Configuration
+              Example", February 2015,
+              <http://www.cisco.com/c/en/us/support/docs/ip/
+              border-gateway-protocol-bgp/112236-allowas-in-bgp-config-
+              example.html>.
+
+   [BGP-PIC]  Bashandy, A., Ed., Filsfils, C., and P. Mohapatra, "BGP
+              Prefix Independent Convergence", Work in Progress,
+              draft-ietf-rtgwg-bgp-pic-02, August 2016.
+
+   [CLOS1953] Clos, C., "A Study of Non-Blocking Switching Networks",
+              The Bell System Technical Journal, Vol. 32(2),
+              DOI 10.1002/j.1538-7305.1953.tb01433.x, March 1953.
+
+   [CONDITIONALROUTE]
+              Cisco Systems, "Configuring and Verifying the BGP
+              Conditional Advertisement Feature", August 2005,
+              <http://www.cisco.com/c/en/us/support/docs/ip/
+              border-gateway-protocol-bgp/16137-cond-adv.html>.
+
+   [CONS-HASH]
+              Wikipedia, "Consistent Hashing", July 2016,
+              <https://en.wikipedia.org/w/
+              index.php?title=Consistent_hashing&oldid=728825684>.
+
+   [FB4POST]  Farrington, N. and A. Andreyev, "Facebook's Data Center
+              Network Architecture", May 2013,
+              <http://nathanfarrington.com/papers/facebook-oic13.pdf>.
+
+   [GREENBERG2009]
+              Greenberg, A., Hamilton, J., and D. Maltz, "The Cost of a
+              Cloud: Research Problems in Data Center Networks",
+              DOI 10.1145/1496091.1496103, January 2009,
+              <http://dl.acm.org/citation.cfm?id=1496103>.
+
+   [HADOOP]   Apache, "Apache Hadoop", April 2016,
+              <https://hadoop.apache.org/>.
+
+
+
+
+
+Lapukhov, et al.              Informational                    [Page 31]
+
+RFC 7938               BGP Routing in Data Centers           August 2016
+
+
+   [IANA.AS]  IANA, "Autonomous System (AS) Numbers",
+              <http://www.iana.org/assignments/as-numbers>.
+
+   [IEEE8021D-1990]
+              IEEE, "IEEE Standard for Local and Metropolitan Area
+              Networks: Media Access Control (MAC) Bridges", IEEE
+              Std 802.1D, DOI 10.1109/IEEESTD.1991.101050, 1991,
+              <http://ieeexplore.ieee.org/servlet/opac?punumber=2255>.
+
+   [IEEE8021D-2004]
+              IEEE, "IEEE Standard for Local and Metropolitan Area
+              Networks: Media Access Control (MAC) Bridges", IEEE
+              Std 802.1D, DOI 10.1109/IEEESTD.2004.94569, June 2004,
+              <http://ieeexplore.ieee.org/servlet/opac?punumber=9155>.
+
+   [IEEE8021Q]
+              IEEE, "IEEE Standard for Local and Metropolitan Area
+              Networks: Bridges and Bridged Networks", IEEE Std 802.1Q,
+              DOI 10.1109/IEEESTD.2014.6991462,
+              <http://ieeexplore.ieee.org/servlet/
+              opac?punumber=6991460>.
+
+   [IEEE8023AD]
+              IEEE, "Amendment to Carrier Sense Multiple Access With
+              Collision Detection (CSMA/CD) Access Method and Physical
+              Layer Specifications - Aggregation of Multiple Link
+              Segments", IEEE Std 802.3ad,
+              DOI 10.1109/IEEESTD.2000.91610, October 2000,
+              <http://ieeexplore.ieee.org/servlet/opac?punumber=6867>.
+
+   [INTERCON] Dally, W. and B. Towles, "Principles and Practices of
+              Interconnection Networks", ISBN 978-0122007514, January
+              2004, <http://dl.acm.org/citation.cfm?id=995703>.
+
+   [JAKMA2008]
+              Jakma, P., "BGP Path Hunting", 2008,
+              <https://blogs.oracle.com/paulj/entry/bgp_path_hunting>.
+
+   [L3DSR]    Schaumann, J., "L3DSR - Overcoming Layer 2 Limitations of
+              Direct Server Return Load Balancing", 2011,
+              <https://www.nanog.org/meetings/nanog51/presentations/
+              Monday/NANOG51.Talk45.nanog51-Schaumann.pdf>.
+
+   [LINK]     Mohapatra, P. and R. Fernando, "BGP Link Bandwidth
+              Extended Community", Work in Progress, draft-ietf-idr-
+              link-bandwidth-06, January 2013.
+
+
+
+
+
+Lapukhov, et al.              Informational                    [Page 32]
+
+RFC 7938               BGP Routing in Data Centers           August 2016
+
+
+   [REMOVAL]  Mitchell, J., Rao, D., and R. Raszuk, "Private Autonomous
+              System (AS) Removal Requirements", Work in Progress,
+              draft-mitchell-grow-remove-private-as-04, April 2015.
+
+   [RFC2328]  Moy, J., "OSPF Version 2", STD 54, RFC 2328,
+              DOI 10.17487/RFC2328, April 1998,
+              <http://www.rfc-editor.org/info/rfc2328>.
+
+   [RFC2385]  Heffernan, A., "Protection of BGP Sessions via the TCP MD5
+              Signature Option", RFC 2385, DOI 10.17487/RFC2385, August
+              1998, <http://www.rfc-editor.org/info/rfc2385>.
+
+   [RFC2992]  Hopps, C., "Analysis of an Equal-Cost Multi-Path
+              Algorithm", RFC 2992, DOI 10.17487/RFC2992, November 2000,
+              <http://www.rfc-editor.org/info/rfc2992>.
+
+   [RFC4272]  Murphy, S., "BGP Security Vulnerabilities Analysis",
+              RFC 4272, DOI 10.17487/RFC4272, January 2006,
+              <http://www.rfc-editor.org/info/rfc4272>.
+
+   [RFC4277]  McPherson, D. and K. Patel, "Experience with the BGP-4
+              Protocol", RFC 4277, DOI 10.17487/RFC4277, January 2006,
+              <http://www.rfc-editor.org/info/rfc4277>.
+
+   [RFC4786]  Abley, J. and K. Lindqvist, "Operation of Anycast
+              Services", BCP 126, RFC 4786, DOI 10.17487/RFC4786,
+              December 2006, <http://www.rfc-editor.org/info/rfc4786>.
+
+   [RFC5082]  Gill, V., Heasley, J., Meyer, D., Savola, P., Ed., and C.
+              Pignataro, "The Generalized TTL Security Mechanism
+              (GTSM)", RFC 5082, DOI 10.17487/RFC5082, October 2007,
+              <http://www.rfc-editor.org/info/rfc5082>.
+
+   [RFC5837]  Atlas, A., Ed., Bonica, R., Ed., Pignataro, C., Ed., Shen,
+              N., and JR. Rivers, "Extending ICMP for Interface and
+              Next-Hop Identification", RFC 5837, DOI 10.17487/RFC5837,
+              April 2010, <http://www.rfc-editor.org/info/rfc5837>.
+
+   [RFC5880]  Katz, D. and D. Ward, "Bidirectional Forwarding Detection
+              (BFD)", RFC 5880, DOI 10.17487/RFC5880, June 2010,
+              <http://www.rfc-editor.org/info/rfc5880>.
+
+   [RFC5925]  Touch, J., Mankin, A., and R. Bonica, "The TCP
+              Authentication Option", RFC 5925, DOI 10.17487/RFC5925,
+              June 2010, <http://www.rfc-editor.org/info/rfc5925>.
+
+
+
+
+
+
+Lapukhov, et al.              Informational                    [Page 33]
+
+RFC 7938               BGP Routing in Data Centers           August 2016
+
+
+   [RFC6325]  Perlman, R., Eastlake 3rd, D., Dutt, D., Gai, S., and A.
+              Ghanwani, "Routing Bridges (RBridges): Base Protocol
+              Specification", RFC 6325, DOI 10.17487/RFC6325, July 2011,
+              <http://www.rfc-editor.org/info/rfc6325>.
+
+   [RFC6769]  Raszuk, R., Heitz, J., Lo, A., Zhang, L., and X. Xu,
+              "Simple Virtual Aggregation (S-VA)", RFC 6769,
+              DOI 10.17487/RFC6769, October 2012,
+              <http://www.rfc-editor.org/info/rfc6769>.
+
+   [RFC6774]  Raszuk, R., Ed., Fernando, R., Patel, K., McPherson, D.,
+              and K. Kumaki, "Distribution of Diverse BGP Paths",
+              RFC 6774, DOI 10.17487/RFC6774, November 2012,
+              <http://www.rfc-editor.org/info/rfc6774>.
+
+   [RFC6793]  Vohra, Q. and E. Chen, "BGP Support for Four-Octet
+              Autonomous System (AS) Number Space", RFC 6793,
+              DOI 10.17487/RFC6793, December 2012,
+              <http://www.rfc-editor.org/info/rfc6793>.
+
+   [RFC7067]  Dunbar, L., Eastlake 3rd, D., Perlman, R., and I.
+              Gashinsky, "Directory Assistance Problem and High-Level
+              Design Proposal", RFC 7067, DOI 10.17487/RFC7067, November
+              2013, <http://www.rfc-editor.org/info/rfc7067>.
+
+   [RFC7130]  Bhatia, M., Ed., Chen, M., Ed., Boutros, S., Ed.,
+              Binderberger, M., Ed., and J. Haas, Ed., "Bidirectional
+              Forwarding Detection (BFD) on Link Aggregation Group (LAG)
+              Interfaces", RFC 7130, DOI 10.17487/RFC7130, February
+              2014, <http://www.rfc-editor.org/info/rfc7130>.
+
+   [RFC7196]  Pelsser, C., Bush, R., Patel, K., Mohapatra, P., and O.
+              Maennel, "Making Route Flap Damping Usable", RFC 7196,
+              DOI 10.17487/RFC7196, May 2014,
+              <http://www.rfc-editor.org/info/rfc7196>.
+
+   [RFC7911]  Walton, D., Retana, A., Chen, E., and J. Scudder,
+              "Advertisement of Multiple Paths in BGP", RFC 7911,
+              DOI 10.17487/RFC7911, July 2016,
+              <http://www.rfc-editor.org/info/rfc7911>.
+
+   [VENDOR-REMOVE-PRIVATE-AS]
+              Cisco Systems, "Removing Private Autonomous System Numbers
+              in BGP", August 2005,
+              <http://www.cisco.com/en/US/tech/tk365/
+              technologies_tech_note09186a0080093f27.shtml>.
+
+
+
+
+
+Lapukhov, et al.              Informational                    [Page 34]
+
+RFC 7938               BGP Routing in Data Centers           August 2016
+
+
+Acknowledgements
+
+   This publication summarizes the work of many people who participated
+   in developing, testing, and deploying the proposed network design,
+   some of whom were George Chen, Parantap Lahiri, Dave Maltz, Edet
+   Nkposong, Robert Toomey, and Lihua Yuan.  The authors would also like
+   to thank Linda Dunbar, Anoop Ghanwani, Susan Hares, Danny McPherson,
+   Robert Raszuk, and Russ White for reviewing this document and
+   providing valuable feedback, and Mary Mitchell for initial grammar
+   and style suggestions.
+
+Authors' Addresses
+
+   Petr Lapukhov
+   Facebook
+   1 Hacker Way
+   Menlo Park, CA  94025
+   United States of America
+
+   Email: petr@fb.com
+
+
+   Ariff Premji
+   Arista Networks
+   5453 Great America Parkway
+   Santa Clara, CA  95054
+   United States of America
+
+   Email: ariff@arista.com
+   URI:   http://arista.com/
+
+
+   Jon Mitchell (editor)
+
+   Email: jrmitche@puck.nether.net
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Lapukhov, et al.              Informational                    [Page 35]
+
author	Thomas Voss <mail@thomasvoss.com>	2024-11-27 20:54:24 +0100
committer	Thomas Voss <mail@thomasvoss.com>	2024-11-27 20:54:24 +0100
commit	4bfd864f10b68b71482b35c818559068ef8d5797 (patch)
tree	e3989f47a7994642eb325063d46e8f08ffa681dc /doc/rfc/rfc7938.txt
parent	ea76e11061bda059ae9f9ad130a9895cc85607db (diff)