1 files changed, 955 insertions, 0 deletions
diff --git a/doc/rfc/rfc6820.txt b/doc/rfc/rfc6820.txt
new file mode 100644
index 0000000..16923a5
--- /dev/null
+++ b/doc/rfc/rfc6820.txt
@@ -0,0 +1,955 @@
+
+
+
+
+
+
+Internet Engineering Task Force (IETF)                         T. Narten
+Request for Comments: 6820                               IBM Corporation
+Category: Informational                                         M. Karir
+ISSN: 2070-1721                                       Merit Network Inc.
+                                                                  I. Foo
+                                                     Huawei Technologies
+                                                            January 2013
+
+
+       Address Resolution Problems in Large Data Center Networks
+
+Abstract
+
+   This document examines address resolution issues related to the
+   scaling of data centers with a very large number of hosts.  The scope
+   of this document is relatively narrow, focusing on address resolution
+   (the Address Resolution Protocol (ARP) in IPv4 and Neighbor Discovery
+   (ND) in IPv6) within a data center.
+
+Status of This Memo
+
+   This document is not an Internet Standards Track specification; it is
+   published for informational purposes.
+
+   This document is a product of the Internet Engineering Task Force
+   (IETF).  It represents the consensus of the IETF community.  It has
+   received public review and has been approved for publication by the
+   Internet Engineering Steering Group (IESG).  Not all documents
+   approved by the IESG are a candidate for any level of Internet
+   Standard; see Section 2 of RFC 5741.
+
+   Information about the current status of this document, any errata,
+   and how to provide feedback on it may be obtained at
+   http://www.rfc-editor.org/info/rfc6820.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Narten, et al.                Informational                     [Page 1]
+
+RFC 6820                      ARMD-Problems                 January 2013
+
+
+Copyright Notice
+
+   Copyright (c) 2013 IETF Trust and the persons identified as the
+   document authors.  All rights reserved.
+
+   This document is subject to BCP 78 and the IETF Trust's Legal
+   Provisions Relating to IETF Documents
+   (http://trustee.ietf.org/license-info) in effect on the date of
+   publication of this document.  Please review these documents
+   carefully, as they describe your rights and restrictions with respect
+   to this document.  Code Components extracted from this document must
+   include Simplified BSD License text as described in Section 4.e of
+   the Trust Legal Provisions and are provided without warranty as
+   described in the Simplified BSD License.
+
+Table of Contents
+
+   1. Introduction ....................................................3
+   2. Terminology .....................................................3
+   3. Background ......................................................4
+   4. Address Resolution in IPv4 ......................................6
+   5. Address Resolution in IPv6 ......................................7
+   6. Generalized Data Center Design ..................................7
+      6.1. Access Layer ...............................................8
+      6.2. Aggregation Layer ..........................................8
+      6.3. Core .......................................................9
+      6.4. L3/L2 Topological Variations ...............................9
+           6.4.1. L3 to Access Switches ...............................9
+           6.4.2. L3 to Aggregation Switches ..........................9
+           6.4.3. L3 in the Core Only ................................10
+           6.4.4. Overlays ...........................................10
+      6.5. Factors That Affect Data Center Design ....................11
+           6.5.1. Traffic Patterns ...................................11
+           6.5.2. Virtualization .....................................11
+           6.5.3. Summary ............................................12
+   7. Problem Itemization ............................................12
+      7.1. ARP Processing on Routers .................................12
+      7.2. IPv6 Neighbor Discovery ...................................14
+      7.3. MAC Address Table Size Limitations in Switches ............15
+   8. Summary ........................................................15
+   9. Acknowledgments ................................................16
+   10. Security Considerations .......................................16
+   11. Informative References ........................................16
+
+
+
+
+
+
+
+
+Narten, et al.                Informational                     [Page 2]
+
+RFC 6820                      ARMD-Problems                 January 2013
+
+
+1.  Introduction
+
+   This document examines issues related to the scaling of large data
+   centers.  Specifically, this document focuses on address resolution
+   (ARP in IPv4 and Neighbor Discovery in IPv6) within the data center.
+   Although strictly speaking the scope of address resolution is
+   confined to a single L2 broadcast domain (i.e., ARP runs at the L2
+   layer below IP), the issue is complicated by routers having many
+   interfaces on which address resolution must be performed or with the
+   presence of IEEE 802.1Q domains, where individual VLANs effectively
+   form their own L2 broadcast domains.  Thus, the scope of address
+   resolution spans both the L2 link and the devices attached to those
+   links.
+
+   This document identifies potential issues associated with address
+   resolution in data centers with a large number of hosts.  The scope
+   of this document is intentionally relatively narrow, as it mirrors
+   the Address Resolution for Massive numbers of hosts in the Data
+   center (ARMD) WG charter.  This document lists "pain points" that are
+   being experienced in current data centers.  The goal of this document
+   is to focus on address resolution issues and not other broader issues
+   that might arise in data centers.
+
+2.  Terminology
+
+   Address Resolution:  The process of determining the link-layer
+      address corresponding to a given IP address.  In IPv4, address
+      resolution is performed by ARP [RFC0826]; in IPv6, it is provided
+      by Neighbor Discovery (ND) [RFC4861].
+
+   Application:  Software that runs on either a physical or virtual
+      machine, providing a service (e.g., web server, database server,
+      etc.).
+
+   L2 Broadcast Domain:  The set of all links, repeaters, and switches
+      that are traversed to reach all nodes that are members of a given
+      L2 broadcast domain.  In IEEE 802.1Q networks, a broadcast domain
+      corresponds to a single VLAN.
+
+   Host (or server):  A computer system on the network.
+
+   Hypervisor:  Software running on a host that allows multiple VMs to
+      run on the same host.
+
+   Virtual Machine (VM):  A software implementation of a physical
+      machine that runs programs as if they were executing on a
+      physical, non-virtualized machine.  Applications (generally) do
+      not know they are running on a VM as opposed to running on a
+
+
+
+Narten, et al.                Informational                     [Page 3]
+
+RFC 6820                      ARMD-Problems                 January 2013
+
+
+      "bare" host or server, though some systems provide a
+      paravirtualization environment that allows an operating system or
+      application to be aware of the presence of virtualization for
+      optimization purposes.
+
+   ToR:  Top-of-Rack Switch.  A switch placed in a single rack to
+      aggregate network connectivity to and from hosts in that rack.
+
+   EoR:  End-of-Row Switch.  A switch used to aggregate network
+      connectivity from multiple racks.  EoR switches are the next level
+      of switching above ToR switches.
+
+3.  Background
+
+   Large, flat L2 networks have long been known to have scaling
+   problems.  As the size of an L2 broadcast domain increases, the level
+   of broadcast traffic from protocols like ARP increases.  Large
+   amounts of broadcast traffic pose a particular burden because every
+   device (switch, host, and router) must process and possibly act on
+   such traffic.  In extreme cases, "broadcast storms" can occur where
+   the quantity of broadcast traffic reaches a level that effectively
+   brings down part or all of a network.  For example, poor
+   implementations of loop detection and prevention or misconfiguration
+   errors can create conditions that lead to broadcast storms as network
+   conditions change.  The conventional wisdom for addressing such
+   problems has been to say "don't do that".  That is, split large L2
+   networks into multiple smaller L2 networks, each operating as its own
+   L3/IP subnet.  Numerous data center networks have been designed with
+   this principle, e.g., with each rack placed within its own L3 IP
+   subnet.  By doing so, the broadcast domain (and address resolution)
+   is confined to one ToR switch, which works well from a scaling
+   perspective.  Unfortunately, this conflicts in some ways with the
+   current trend towards dynamic workload shifting in data centers and
+   increased virtualization, as discussed below.
+
+   Workload placement has become a challenging task within data centers.
+   Ideally, it is desirable to be able to dynamically reassign workloads
+   within a data center in order to optimize server utilization, add
+   more servers in response to increased demand, etc.  However, servers
+   are often pre-configured to run with a given set of IP addresses.
+   Placement of such servers is then subject to constraints of the IP
+   addressing restrictions of the data center.  For example, servers
+   configured with addresses from a particular subnet could only be
+   placed where they connect to the IP subnet corresponding to their IP
+   addresses.  If each ToR switch is acting as a gateway for its own
+   subnet, a server can only be connected to the one ToR switch.  This
+   gateway switch represents the L2/L3 boundary.  A similar constraint
+   occurs in virtualized environments, as discussed next.
+
+
+
+Narten, et al.                Informational                     [Page 4]
+
+RFC 6820                      ARMD-Problems                 January 2013
+
+
+   Server virtualization is fast becoming the norm in data centers.
+   With server virtualization, each physical server supports multiple
+   virtual machines, each running its own operating system, middleware,
+   and applications.  Virtualization is a key enabler of workload
+   agility, i.e., allowing any server to host any application (on its
+   own VM) and providing the flexibility of adding, shrinking, or moving
+   VMs within the physical infrastructure.  Server virtualization
+   provides numerous benefits, including higher utilization, increased
+   data security, reduced user downtime, and even significant power
+   conservation, along with the promise of a more flexible and dynamic
+   computing environment.
+
+   The discussion below focuses on VM placement and migration.  Keep in
+   mind, however, that even in a non-virtualized environment, many of
+   the same issues apply to individual workloads running on standalone
+   machines.  For example, when increasing the number of servers running
+   a particular workload to meet demand, placement of those workloads
+   may be constrained by IP subnet numbering considerations, as
+   discussed earlier.
+
+   The greatest flexibility in VM and workload management occurs when it
+   is possible to place a VM (or workload) anywhere in the data center
+   regardless of what IP addresses the VM uses and how the physical
+   network is laid out.  In practice, movement of VMs within a data
+   center is easiest when VM placement and movement do not conflict with
+   the IP subnet boundaries of the data center's network, so that the
+   VM's IP address need not be changed to reflect its actual point of
+   attachment on the network from an L3/IP perspective.  In contrast, if
+   a VM moves to a new IP subnet, its address must change, and clients
+   will need to be made aware of that change.  From a VM management
+   perspective, management is simplified if all servers are on a single
+   large L2 network.
+
+   With virtualization, it is not uncommon to have a single physical
+   server host ten or more VMs, each having its own IP (and Media Access
+   Control (MAC)) addresses.  Consequently, the number of addresses per
+   machine (and hence per subnet) is increasing, even when the number of
+   physical machines stays constant.  In a few years, the numbers will
+   likely be even higher.
+
+   In the past, applications were static in the sense that they tended
+   to stay in one physical place.  An application installed on a
+   physical machine would stay on that machine because the cost of
+   moving an application elsewhere was generally high.  Moreover,
+   physical servers hosting applications would tend to be placed in such
+   a way as to facilitate communication locality.  That is, applications
+   running on servers would be physically located near the servers
+   hosting the applications they communicated with most heavily.  The
+
+
+
+Narten, et al.                Informational                     [Page 5]
+
+RFC 6820                      ARMD-Problems                 January 2013
+
+
+   network traffic patterns in such environments could thus be
+   optimized, in some cases keeping significant traffic local to one
+   network segment.  In these more static and carefully managed
+   environments, it was possible to build networks that approached
+   scaling limitations but did not actually cross the threshold.
+
+   Today, with the proliferation of VMs, traffic patterns are becoming
+   more diverse and less predictable.  In particular, there can easily
+   be less locality of network traffic as VMs hosting applications are
+   moved for such reasons as reducing overall power usage (by
+   consolidating VMs and powering off idle machines) or moving a VM to a
+   physical server with more capacity or a lower load.  In today's
+   changing environments, it is becoming more difficult to engineer
+   networks as traffic patterns continually shift as VMs move around.
+
+   In summary, both the size and density of L2 networks are increasing.
+   In addition, increasingly dynamic workloads and the increased usage
+   of VMs are creating pressure for ever-larger L2 networks.  Today,
+   there are already data centers with over 100,000 physical machines
+   and many times that number of VMs.  This number will only increase
+   going forward.  In addition, traffic patterns within a data center
+   are also constantly changing.  Ultimately, the issues described in
+   this document might be observed at any scale, depending on the
+   particular design of the data center.
+
+4.  Address Resolution in IPv4
+
+   In IPv4 over Ethernet, ARP provides the function of address
+   resolution.  To determine the link-layer address of a given IP
+   address, a node broadcasts an ARP Request.  The request is delivered
+   to all portions of the L2 network, and the node with the requested IP
+   address responds with an ARP Reply.  ARP is an old protocol and, by
+   current standards, is sparsely documented.  For example, there are no
+   clear requirements for retransmitting ARP Requests in the absence of
+   replies.  Consequently, implementations vary in the details of what
+   they actually implement [RFC0826][RFC1122].
+
+   From a scaling perspective, there are a number of problems with ARP.
+   First, it uses broadcast, and any network with a large number of
+   attached hosts will see a correspondingly large amount of broadcast
+   ARP traffic.  The second problem is that it is not feasible to change
+   host implementations of ARP -- current implementations are too widely
+   entrenched, and any changes to host implementations of ARP would take
+   years to become sufficiently deployed to matter.  That said, it may
+   be possible to change ARP implementations in hypervisors, L2/L3
+   boundary routers, and/or ToR access switches, to leverage such
+   techniques as Proxy ARP.  Finally, ARP implementations need to take
+   steps to flush out stale or otherwise invalid entries.
+
+
+
+Narten, et al.                Informational                     [Page 6]
+
+RFC 6820                      ARMD-Problems                 January 2013
+
+
+   Unfortunately, existing standards do not provide clear implementation
+   guidelines for how to do this.  Consequently, implementations vary
+   significantly, and some implementations are "chatty" in that they
+   just periodically flush caches every few minutes and send new ARP
+   queries.
+
+5.  Address Resolution in IPv6
+
+   Broadly speaking, from the perspective of address resolution, IPv6's
+   Neighbor Discovery (ND) behaves much like ARP, with a few notable
+   differences.  First, ARP uses broadcast, whereas ND uses multicast.
+   When querying for a target IP address, ND maps the target address
+   into an IPv6 Solicited Node multicast address.  Using multicast
+   rather than broadcast has the benefit that the multicast frames do
+   not necessarily need to be sent to all parts of the network, i.e.,
+   the frames can be sent only to segments where listeners for the
+   Solicited Node multicast address reside.  In the case where multicast
+   frames are delivered to all parts of the network, sending to a
+   multicast address still has the advantage that most (if not all)
+   nodes will filter out the (unwanted) multicast query via filters
+   installed in the Network Interface Card (NIC) rather than burdening
+   host software with the need to process such packets.  Thus, whereas
+   all nodes must process every ARP query, ND queries are processed only
+   by the nodes to which they are intended.  In cases where multicast
+   filtering can't effectively be implemented in the NIC (e.g., as on
+   hypervisors supporting virtualization), filtering would need to be
+   done in software (e.g., in the hypervisor's vSwitch).
+
+6.  Generalized Data Center Design
+
+   There are many different ways in which data center networks might be
+   designed.  The designs are usually engineered to suit the particular
+   workloads that are being deployed in the data center.  For example, a
+   large web server farm might be engineered in a very different way
+   than a general-purpose multi-tenant cloud hosting service.  However,
+   in most cases the designs can be abstracted into a typical three-
+   layer model consisting of an access layer, an aggregation layer, and
+   the Core.  The access layer generally refers to the switches that are
+   closest to the physical or virtual servers; the aggregation layer
+   serves to interconnect multiple access-layer devices.  The Core
+   switches connect the aggregation switches to the larger network core.
+
+
+
+
+
+
+
+
+
+
+Narten, et al.                Informational                     [Page 7]
+
+RFC 6820                      ARMD-Problems                 January 2013
+
+
+   Figure 1 shows a generalized data center design, which captures the
+   essential elements of various alternatives.
+
+                  +-----+-----+     +-----+-----+
+                  |   Core0   |     |    Core1  |      Core
+                  +-----+-----+     +-----+-----+
+                        /    \        /       /
+                       /      \----------\   /
+                      /    /---------/    \ /
+                    +-------+           +------+
+                  +/------+ |         +/-----+ |
+                  | Aggr11| + --------|AggrN1| +      Aggregation Layer
+                  +---+---+/          +------+/
+                    /     \            /      \
+                   /       \          /        \
+                 +---+    +---+      +---+     +---+
+                 |T11|... |T1x|      |TN1|     |TNy|  Access Layer
+                 +---+    +---+      +---+     +---+
+                 |   |    |   |      |   |     |   |
+                 +---+    +---+      +---+     +---+
+                 |   |... |   |      |   |     |   |
+                 +---+    +---+      +---+     +---+  Server Racks
+                 |   |... |   |      |   |     |   |
+                 +---+    +---+      +---+     +---+
+                 |   |... |   |      |   |     |   |
+                 +---+    +---+      +---+     +---+
+
+               Typical Layered Architecture in a Data Center
+
+                                 Figure 1
+
+6.1.  Access Layer
+
+   The access switches provide connectivity directly to/from physical
+   and virtual servers.  The access layer may be implemented by wiring
+   the servers within a rack to a ToR switch or, less commonly, the
+   servers could be wired directly to an EoR switch.  A server rack may
+   have a single uplink to one access switch or may have dual uplinks to
+   two different access switches.
+
+6.2.  Aggregation Layer
+
+   In a typical data center, aggregation switches interconnect many ToR
+   switches.  Usually, there are multiple parallel aggregation switches,
+   serving the same group of ToRs to achieve load sharing.  It is no
+   longer uncommon to see aggregation switches interconnecting hundreds
+   of ToR switches in large data centers.
+
+
+
+
+Narten, et al.                Informational                     [Page 8]
+
+RFC 6820                      ARMD-Problems                 January 2013
+
+
+6.3.  Core
+
+   Core switches provide connectivity between aggregation switches and
+   the main data center network.  Core switches interconnect different
+   sets of racks and provide connectivity to data center gateways
+   leading to external networks.
+
+6.4.  L3/L2 Topological Variations
+
+6.4.1.  L3 to Access Switches
+
+   In this scenario, the L3 domain is extended all the way from the core
+   network to the access switches.  Each rack enclosure consists of a
+   single L2 domain, which is confined to the rack.  In general, there
+   are no significant ARP/ND scaling issues in this scenario, as the L2
+   domain cannot grow very large.  Such a topology has benefits in
+   scenarios where servers attached to a particular access switch
+   generally run VMs that are confined to using a single subnet.  These
+   VMs and the applications they host aren't moved (migrated) to other
+   racks that might be attached to different access switches (and
+   different IP subnets).  A small server farm or very static compute
+   cluster might be well served via this design.
+
+6.4.2.  L3 to Aggregation Switches
+
+   When the L3 domain extends only to aggregation switches, hosts in any
+   of the IP subnets configured on the aggregation switches can be
+   reachable via L2 through any access switches if access switches
+   enable all the VLANs.  Such a topology allows a greater level of
+   flexibility, as servers attached to any access switch can run any VMs
+   that have been provisioned with IP addresses configured on the
+   aggregation switches.  In such an environment, VMs can migrate
+   between racks without IP address changes.  The drawback of this
+   design, however, is that multiple VLANs have to be enabled on all
+   access switches and all access-facing ports on aggregation switches.
+   Even though L2 traffic is still partitioned by VLANs, the fact that
+   all VLANs are enabled on all ports can lead to broadcast traffic on
+   all VLANs that traverse all links and ports, which has the same
+   effect as one big L2 domain on the access-facing side of the
+   aggregation switch.  In addition, the internal traffic itself might
+   have to cross different L2 boundaries, resulting in significant
+   ARP/ND load at the aggregation switches.  This design provides a good
+   tradeoff between flexibility and L2 domain size.  A moderate-sized
+   data center might utilize this approach to provide high-availability
+   services at a single location.
+
+
+
+
+
+
+Narten, et al.                Informational                     [Page 9]
+
+RFC 6820                      ARMD-Problems                 January 2013
+
+
+6.4.3.  L3 in the Core Only
+
+   In some cases, where a wider range of VM mobility is desired (i.e., a
+   greater number of racks among which VMs can move without IP address
+   changes), the L3 routed domain might be terminated at the core
+   routers themselves.  In this case, VLANs can span multiple groups of
+   aggregation switches, which allows hosts to be moved among a greater
+   number of server racks without IP address changes.  This scenario
+   results in the largest ARP/ND performance impact, as explained later.
+   A data center with very rapid workload shifting may consider this
+   kind of design.
+
+6.4.4.  Overlays
+
+   There are several approaches where overlay networks can be used to
+   build very large L2 networks to enable VM mobility.  Overlay networks
+   using various L2 or L3 mechanisms allow interior switches/routers to
+   mask host addresses.  In addition, L3 overlays can help the data
+   center designer control the size of the L2 domain and also enhance
+   the ability to provide multi-tenancy in data center networks.
+   However, the use of overlays does not eliminate traffic associated
+   with address resolution; it simply moves it to regular data traffic.
+   That is, address resolution is implemented in the overlay and is not
+   directly visible to the switches of the data center network.
+
+   A potential problem that arises in a large data center is that when a
+   large number of hosts communicate with their peers in different
+   subnets, all these hosts send (and receive) data packets to their
+   respective L2/L3 boundary nodes, as the traffic flows are generally
+   bidirectional.  This has the potential to further highlight any
+   scaling problems.  These L2/L3 boundary nodes have to process ARP/ND
+   requests sent from originating subnets and resolve physical (MAC)
+   addresses in the target subnets for what are generally bidirectional
+   flows.  Therefore, for maximum flexibility in managing the data
+   center workload, it is often desirable to use overlays to place
+   related groups of hosts in the same topological subnet to avoid the
+   L2/L3 boundary translation.  The use of overlays in the data center
+   network can be a useful design mechanism to help manage a potential
+   bottleneck at the L2/L3 boundary by redefining where that boundary
+   exists.
+
+
+
+
+
+
+
+
+
+
+
+Narten, et al.                Informational                    [Page 10]
+
+RFC 6820                      ARMD-Problems                 January 2013
+
+
+6.5.  Factors That Affect Data Center Design
+
+6.5.1.  Traffic Patterns
+
+   Expected traffic patterns play an important role in designing
+   appropriately sized access, aggregation, and core networks.  Traffic
+   patterns also vary based on the expected use of the data center.
+
+   Broadly speaking, it is desirable to keep as much traffic as possible
+   on the access layer in order to minimize the bandwidth usage at the
+   aggregation layer.  If the expected use of the data center is to
+   serve as a large web server farm, where thousands of nodes are doing
+   similar things and the traffic pattern is largely in and out of a
+   large data center, an access layer with EoR switches might be used,
+   as it minimizes complexity, allows for servers and databases to be
+   located in the same L2 domain, and provides for maximum density.
+
+   A data center that is expected to host a multi-tenant cloud hosting
+   service might have some completely unique requirements.  In order to
+   isolate inter-customer traffic, smaller L2 domains might be
+   preferred, and though the size of the overall data center might be
+   comparable to the previous example, the multi-tenant nature of the
+   cloud hosting application requires a smaller and more
+   compartmentalized access layer.  A multi-tenant environment might
+   also require the use of L3 all the way to the access-layer ToR
+   switch.
+
+   Yet another example of a workload with a unique traffic pattern is a
+   high-performance compute cluster, where most of the traffic is
+   expected to stay within the cluster but at the same time there is a
+   high degree of crosstalk between the nodes.  This would once again
+   call for a large access layer in order to minimize the requirements
+   at the aggregation layer.
+
+6.5.2.  Virtualization
+
+   Using virtualization in the data center further serves to increase
+   the possible densities that can be achieved.  However, virtualization
+   also further complicates the requirements on the access layer, as
+   virtualization restricts the scope of server placement in the event
+   of server failover resulting from hardware failures or server
+   migration for load balancing or other reasons.
+
+   Virtualization also can place additional requirements on the
+   aggregation switches in terms of address resolution table size and
+   the scalability of any address-learning protocols that might be used
+   on those switches.  The use of virtualization often also requires the
+   use of additional VLANs for high-availability beaconing, which would
+
+
+
+Narten, et al.                Informational                    [Page 11]
+
+RFC 6820                      ARMD-Problems                 January 2013
+
+
+   need to span the entire virtualized infrastructure.  This would
+   require the access layer to also span the entire virtualized
+   infrastructure.
+
+6.5.3.  Summary
+
+   The designs described in this section have a number of tradeoffs.
+   The "L3 to access switches" design described in Section 6.4.1 is the
+   only design that constrains L2 domain size in a fashion that avoids
+   ARP/ND scaling problems.  However, that design has limitations and
+   does not address some of the other requirements that lead to
+   configurations that make use of larger L2 domains.  Consequently,
+   ARP/ND scaling issues are a real problem in practice.
+
+7.  Problem Itemization
+
+   This section articulates some specific problems or "pain points" that
+   are related to large data centers.
+
+7.1.  ARP Processing on Routers
+
+   One pain point with large L2 broadcast domains is that the routers
+   connected to the L2 domain may need to process a significant amount
+   of ARP traffic in some cases.  In particular, environments where the
+   aggregate level of ARP traffic is very large may lead to a heavy ARP
+   load on routers.  Even though the vast majority of ARP traffic may
+   not be aimed at that router, the router still has to process enough
+   of the ARP Request to determine whether it can safely be ignored.
+   The ARP algorithm specifies that a recipient must update its ARP
+   cache if it receives an ARP query from a source for which it has an
+   entry [RFC0826].
+
+   ARP processing in routers is commonly handled in a "slow path"
+   software processor, rather than directly by a hardware Application-
+   Specific Integrated Circuit (ASIC) as is the case when forwarding
+   packets.  Such a design significantly limits the rate at which ARP
+   traffic can be processed compared to the rate at which ASICs can
+   forward traffic.  Current implementations at the time of this writing
+   can support ARP processing in the low thousands of ARP packets per
+   second.  In some deployments, limitations on the rate of ARP
+   processing have been cited as being a problem.
+
+   To further reduce the ARP load, some routers have implemented
+   additional optimizations in their forwarding ASIC paths.  For
+   example, some routers can be configured to discard ARP Requests for
+   target addresses other than those assigned to the router.  That way,
+   the router's software processor only receives ARP Requests for
+
+
+
+
+Narten, et al.                Informational                    [Page 12]
+
+RFC 6820                      ARMD-Problems                 January 2013
+
+
+   addresses it owns and must respond to.  This can significantly reduce
+   the number of ARP Requests that must be processed by the router.
+
+   Another optimization concerns reducing the number of ARP queries
+   targeted at routers, whether for address resolution or to validate
+   existing cache entries.  Some routers can be configured to broadcast
+   periodic gratuitous ARPs [RFC5227].  Upon receipt of a gratuitous
+   ARP, implementations mark the associated entry as "fresh", resetting
+   the aging timer to its maximum setting.  Consequently, sending out
+   periodic gratuitous ARPs can effectively prevent nodes from needing
+   to send ARP Requests intended to revalidate stale entries for a
+   router.  The net result is an overall reduction in the number of ARP
+   queries routers receive.  Gratuitous ARPs, broadcast to all nodes in
+   the L2 broadcast domain, may in some cases also pre-populate ARP
+   caches on neighboring devices, further reducing ARP traffic.  But it
+   is not believed that pre-population of ARP entries is supported by
+   most implementations, as the ARP specification [RFC0826] recommends
+   only that pre-existing ARP entries be updated upon receipt of ARP
+   messages; it does not call for the creation of new entries when none
+   already exist.
+
+   Finally, another area concerns the overhead of processing IP packets
+   for which no ARP entry exists.  Existing standards specify that one
+   or more IP packets for which no ARP entries exist should be queued
+   pending successful completion of the address resolution process
+   [RFC1122] [RFC1812].  Once an ARP query has been resolved, any queued
+   packets can be forwarded on.  Again, the processing of such packets
+   is handled in the "slow path", effectively limiting the rate at which
+   a router can process ARP "cache misses", and is viewed as a problem
+   in some deployments today.  Additionally, if no response is received,
+   the router may send the ARP/ND query multiple times.  If no response
+   is received after a number of ARP/ND requests, the router needs to
+   drop any queued data packets and may send an ICMP destination
+   unreachable message as well [RFC0792].  This entire process can be
+   CPU intensive.
+
+   Although address resolution traffic remains local to one L2 network,
+   some data center designs terminate L2 domains at individual
+   aggregation switches/routers (e.g., see Section 6.4.2).  Such routers
+   can be connected to a large number of interfaces (e.g., 100 or more).
+   While the address resolution traffic on any one interface may be
+   manageable, the aggregate address resolution traffic across all
+   interfaces can become problematic.
+
+   Another variant of the above issue has individual routers servicing a
+   relatively small number of interfaces, with the individual interfaces
+   themselves serving very large subnets.  Once again, it is the
+   aggregate quantity of ARP traffic seen across all of the router's
+
+
+
+Narten, et al.                Informational                    [Page 13]
+
+RFC 6820                      ARMD-Problems                 January 2013
+
+
+   interfaces that can be problematic.  This pain point is essentially
+   the same as the one discussed above, the only difference being
+   whether a given number of hosts are spread across a few large IP
+   subnets or many smaller ones.
+
+   When hosts in two different subnets under the same L2/L3 boundary
+   router need to communicate with each other, the L2/L3 router not only
+   has to initiate ARP/ND requests to the target's subnet, it also has
+   to process the ARP/ND requests from the originating subnet.  This
+   process further adds to the overall ARP processing load.
+
+7.2.  IPv6 Neighbor Discovery
+
+   Though IPv6's Neighbor Discovery behaves much like ARP, there are
+   several notable differences that result in a different set of
+   potential issues.  From an L2 perspective, an important difference is
+   that ND address resolution requests are sent via multicast, which
+   results in ND queries only being processed by the nodes for which
+   they are intended.  Compared with broadcast ARPs, this reduces the
+   total number of ND packets that an implementation will receive.
+
+   Another key difference concerns revalidating stale ND entries.  ND
+   requires that nodes periodically revalidate any entries they are
+   using, to ensure that bad entries are timed out quickly enough that
+   TCP does not terminate a connection.  Consequently, some
+   implementations will send out "probe" ND queries to validate in-use
+   ND entries as frequently as every 35 seconds [RFC4861].  Such probes
+   are sent via unicast (unlike in the case of ARP).  However, on larger
+   networks, such probes can result in routers receiving many such
+   queries (i.e., many more than with ARP, which does not specify such
+   behavior).  Unfortunately, the IPv4 mitigation technique of sending
+   gratuitous ARPs (as described in Section 7.1) does not work in IPv6.
+   The ND specification specifically states that gratuitous ND "updates"
+   cannot cause an ND entry to be marked "valid".  Rather, such entries
+   are marked "probe", which causes the receiving node to (eventually)
+   generate a probe back to the sender, which in this case is precisely
+   the behavior that the router is trying to prevent!
+
+   Routers implementing Neighbor Unreachability Discovery (NUD) (for
+   neighboring destinations) will need to process neighbor cache state
+   changes such as transitioning entries from REACHABLE to STALE.  How
+   this capability is implemented may impact the scalability of ND on a
+   router.  For example, one possible implementation is to have the
+   forwarding operation detect when an ND entry is referenced that needs
+   to transition from REACHABLE to STALE, by signaling an event that
+   would need to be processed by the software processor.  Such an
+   implementation could increase the load on the service processor in
+
+
+
+
+Narten, et al.                Informational                    [Page 14]
+
+RFC 6820                      ARMD-Problems                 January 2013
+
+
+   much the same way that high rates of ARP requests have led to
+   problems on some routers.
+
+   It should be noted that ND does not require the sending of probes in
+   all cases.  Section 7.3.1 of [RFC4861] describes a technique whereby
+   hints from TCP can be used to verify that an existing ND entry is
+   working fine and does not need to be revalidated.
+
+   Finally, IPv6 and IPv4 are often run simultaneously and in parallel
+   on the same network, i.e., in dual-stack mode.  In such environments,
+   the IPv4 and IPv6 issues enumerated above compound each other.
+
+7.3.  MAC Address Table Size Limitations in Switches
+
+   L2 switches maintain L2 MAC address forwarding tables for all sources
+   and destinations traversing the switch.  These tables are populated
+   through learning and are used to forward L2 frames to their correct
+   destination.  The larger the L2 domain, the larger the tables have to
+   be.  While in theory a switch only needs to keep track of addresses
+   it is actively using (sometimes called "conversational learning"),
+   switches flood broadcast frames (e.g., from ARP), multicast frames
+   (e.g., from Neighbor Discovery), and unicast frames to unknown
+   destinations.  Switches add entries for the source addresses of such
+   flooded frames to their forwarding tables.  Consequently, MAC address
+   table size can become a problem as the size of the L2 domain
+   increases.  The table size problem is made worse with VMs, where a
+   single physical machine now hosts many VMs (in the 10's today, but
+   growing rapidly as the number of cores per CPU increases), since each
+   VM has its own MAC address that is visible to switches.
+
+   When L3 extends all the way to access switches (see Section 6.4.1),
+   the size of MAC address tables in switches is not generally a
+   problem.  When L3 extends only to aggregation switches (see
+   Section 6.4.2), however, MAC table size limitations can be a real
+   issue.
+
+8.  Summary
+
+   This document has outlined a number of issues related to address
+   resolution in large data centers.  In particular, this document has
+   described different scenarios where such issues might arise and what
+   these potential issues are, along with outlining fundamental factors
+   that cause them.  It is hoped that describing specific pain points
+   will facilitate a discussion as to whether they should be addressed
+   and how best to address them.
+
+
+
+
+
+
+Narten, et al.                Informational                    [Page 15]
+
+RFC 6820                      ARMD-Problems                 January 2013
+
+
+9.  Acknowledgments
+
+   This document has been significantly improved by comments from Manav
+   Bhatia, David Black, Stewart Bryant, Ralph Droms, Linda Dunbar,
+   Donald Eastlake, Wesley Eddy, Anoop Ghanwani, Joel Halpern, Sue
+   Hares, Pete Resnick, Benson Schliesser, T. Sridhar, and Lucy Yong.
+   Igor Gashinsky deserves additional credit for highlighting some of
+   the ARP-related pain points and for clarifying the difference between
+   what the standards require and what some router vendors have actually
+   implemented in response to operator requests.
+
+10.  Security Considerations
+
+   This document does not create any security implications nor does it
+   have any security implications.  The security vulnerabilities in ARP
+   are well known, and this document does not change or mitigate them in
+   any way.  Security considerations for Neighbor Discovery are
+   discussed in [RFC4861] and [RFC6583].
+
+11.  Informative References
+
+   [RFC0792]  Postel, J., "Internet Control Message Protocol", STD 5,
+              RFC 792, September 1981.
+
+   [RFC0826]  Plummer, D., "Ethernet Address Resolution Protocol: Or
+              converting network protocol addresses to 48.bit Ethernet
+              address for transmission on Ethernet hardware", STD 37,
+              RFC 826, November 1982.
+
+   [RFC1122]  Braden, R., "Requirements for Internet Hosts -
+              Communication Layers", STD 3, RFC 1122, October 1989.
+
+   [RFC1812]  Baker, F., "Requirements for IP Version 4 Routers",
+              RFC 1812, June 1995.
+
+   [RFC4861]  Narten, T., Nordmark, E., Simpson, W., and H. Soliman,
+              "Neighbor Discovery for IP version 6 (IPv6)", RFC 4861,
+              September 2007.
+
+   [RFC5227]  Cheshire, S., "IPv4 Address Conflict Detection", RFC 5227,
+              July 2008.
+
+   [RFC6583]  Gashinsky, I., Jaeggli, J., and W. Kumari, "Operational
+              Neighbor Discovery Problems", RFC 6583, March 2012.
+
+
+
+
+
+
+
+Narten, et al.                Informational                    [Page 16]
+
+RFC 6820                      ARMD-Problems                 January 2013
+
+
+Authors' Addresses
+
+   Thomas Narten
+   IBM Corporation
+   3039 Cornwallis Ave.
+   PO Box 12195
+   Research Triangle Park, NC  27709-2195
+   USA
+
+   EMail: narten@us.ibm.com
+
+
+   Manish Karir
+   Merit Network Inc.
+
+   EMail: mkarir@merit.edu
+
+
+   Ian Foo
+   Huawei Technologies
+
+   EMail: Ian.Foo@huawei.com
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Narten, et al.                Informational                    [Page 17]
+