diff options
Diffstat (limited to 'doc/rfc/rfc6820.txt')
-rw-r--r-- | doc/rfc/rfc6820.txt | 955 |
1 files changed, 955 insertions, 0 deletions
diff --git a/doc/rfc/rfc6820.txt b/doc/rfc/rfc6820.txt new file mode 100644 index 0000000..16923a5 --- /dev/null +++ b/doc/rfc/rfc6820.txt @@ -0,0 +1,955 @@ + + + + + + +Internet Engineering Task Force (IETF) T. Narten +Request for Comments: 6820 IBM Corporation +Category: Informational M. Karir +ISSN: 2070-1721 Merit Network Inc. + I. Foo + Huawei Technologies + January 2013 + + + Address Resolution Problems in Large Data Center Networks + +Abstract + + This document examines address resolution issues related to the + scaling of data centers with a very large number of hosts. The scope + of this document is relatively narrow, focusing on address resolution + (the Address Resolution Protocol (ARP) in IPv4 and Neighbor Discovery + (ND) in IPv6) within a data center. + +Status of This Memo + + This document is not an Internet Standards Track specification; it is + published for informational purposes. + + This document is a product of the Internet Engineering Task Force + (IETF). It represents the consensus of the IETF community. It has + received public review and has been approved for publication by the + Internet Engineering Steering Group (IESG). Not all documents + approved by the IESG are a candidate for any level of Internet + Standard; see Section 2 of RFC 5741. + + Information about the current status of this document, any errata, + and how to provide feedback on it may be obtained at + http://www.rfc-editor.org/info/rfc6820. + + + + + + + + + + + + + + + + + +Narten, et al. Informational [Page 1] + +RFC 6820 ARMD-Problems January 2013 + + +Copyright Notice + + Copyright (c) 2013 IETF Trust and the persons identified as the + document authors. All rights reserved. + + This document is subject to BCP 78 and the IETF Trust's Legal + Provisions Relating to IETF Documents + (http://trustee.ietf.org/license-info) in effect on the date of + publication of this document. Please review these documents + carefully, as they describe your rights and restrictions with respect + to this document. Code Components extracted from this document must + include Simplified BSD License text as described in Section 4.e of + the Trust Legal Provisions and are provided without warranty as + described in the Simplified BSD License. + +Table of Contents + + 1. Introduction ....................................................3 + 2. Terminology .....................................................3 + 3. Background ......................................................4 + 4. Address Resolution in IPv4 ......................................6 + 5. Address Resolution in IPv6 ......................................7 + 6. Generalized Data Center Design ..................................7 + 6.1. Access Layer ...............................................8 + 6.2. Aggregation Layer ..........................................8 + 6.3. Core .......................................................9 + 6.4. L3/L2 Topological Variations ...............................9 + 6.4.1. L3 to Access Switches ...............................9 + 6.4.2. L3 to Aggregation Switches ..........................9 + 6.4.3. L3 in the Core Only ................................10 + 6.4.4. Overlays ...........................................10 + 6.5. Factors That Affect Data Center Design ....................11 + 6.5.1. Traffic Patterns ...................................11 + 6.5.2. Virtualization .....................................11 + 6.5.3. Summary ............................................12 + 7. Problem Itemization ............................................12 + 7.1. ARP Processing on Routers .................................12 + 7.2. IPv6 Neighbor Discovery ...................................14 + 7.3. MAC Address Table Size Limitations in Switches ............15 + 8. Summary ........................................................15 + 9. Acknowledgments ................................................16 + 10. Security Considerations .......................................16 + 11. Informative References ........................................16 + + + + + + + + +Narten, et al. Informational [Page 2] + +RFC 6820 ARMD-Problems January 2013 + + +1. Introduction + + This document examines issues related to the scaling of large data + centers. Specifically, this document focuses on address resolution + (ARP in IPv4 and Neighbor Discovery in IPv6) within the data center. + Although strictly speaking the scope of address resolution is + confined to a single L2 broadcast domain (i.e., ARP runs at the L2 + layer below IP), the issue is complicated by routers having many + interfaces on which address resolution must be performed or with the + presence of IEEE 802.1Q domains, where individual VLANs effectively + form their own L2 broadcast domains. Thus, the scope of address + resolution spans both the L2 link and the devices attached to those + links. + + This document identifies potential issues associated with address + resolution in data centers with a large number of hosts. The scope + of this document is intentionally relatively narrow, as it mirrors + the Address Resolution for Massive numbers of hosts in the Data + center (ARMD) WG charter. This document lists "pain points" that are + being experienced in current data centers. The goal of this document + is to focus on address resolution issues and not other broader issues + that might arise in data centers. + +2. Terminology + + Address Resolution: The process of determining the link-layer + address corresponding to a given IP address. In IPv4, address + resolution is performed by ARP [RFC0826]; in IPv6, it is provided + by Neighbor Discovery (ND) [RFC4861]. + + Application: Software that runs on either a physical or virtual + machine, providing a service (e.g., web server, database server, + etc.). + + L2 Broadcast Domain: The set of all links, repeaters, and switches + that are traversed to reach all nodes that are members of a given + L2 broadcast domain. In IEEE 802.1Q networks, a broadcast domain + corresponds to a single VLAN. + + Host (or server): A computer system on the network. + + Hypervisor: Software running on a host that allows multiple VMs to + run on the same host. + + Virtual Machine (VM): A software implementation of a physical + machine that runs programs as if they were executing on a + physical, non-virtualized machine. Applications (generally) do + not know they are running on a VM as opposed to running on a + + + +Narten, et al. Informational [Page 3] + +RFC 6820 ARMD-Problems January 2013 + + + "bare" host or server, though some systems provide a + paravirtualization environment that allows an operating system or + application to be aware of the presence of virtualization for + optimization purposes. + + ToR: Top-of-Rack Switch. A switch placed in a single rack to + aggregate network connectivity to and from hosts in that rack. + + EoR: End-of-Row Switch. A switch used to aggregate network + connectivity from multiple racks. EoR switches are the next level + of switching above ToR switches. + +3. Background + + Large, flat L2 networks have long been known to have scaling + problems. As the size of an L2 broadcast domain increases, the level + of broadcast traffic from protocols like ARP increases. Large + amounts of broadcast traffic pose a particular burden because every + device (switch, host, and router) must process and possibly act on + such traffic. In extreme cases, "broadcast storms" can occur where + the quantity of broadcast traffic reaches a level that effectively + brings down part or all of a network. For example, poor + implementations of loop detection and prevention or misconfiguration + errors can create conditions that lead to broadcast storms as network + conditions change. The conventional wisdom for addressing such + problems has been to say "don't do that". That is, split large L2 + networks into multiple smaller L2 networks, each operating as its own + L3/IP subnet. Numerous data center networks have been designed with + this principle, e.g., with each rack placed within its own L3 IP + subnet. By doing so, the broadcast domain (and address resolution) + is confined to one ToR switch, which works well from a scaling + perspective. Unfortunately, this conflicts in some ways with the + current trend towards dynamic workload shifting in data centers and + increased virtualization, as discussed below. + + Workload placement has become a challenging task within data centers. + Ideally, it is desirable to be able to dynamically reassign workloads + within a data center in order to optimize server utilization, add + more servers in response to increased demand, etc. However, servers + are often pre-configured to run with a given set of IP addresses. + Placement of such servers is then subject to constraints of the IP + addressing restrictions of the data center. For example, servers + configured with addresses from a particular subnet could only be + placed where they connect to the IP subnet corresponding to their IP + addresses. If each ToR switch is acting as a gateway for its own + subnet, a server can only be connected to the one ToR switch. This + gateway switch represents the L2/L3 boundary. A similar constraint + occurs in virtualized environments, as discussed next. + + + +Narten, et al. Informational [Page 4] + +RFC 6820 ARMD-Problems January 2013 + + + Server virtualization is fast becoming the norm in data centers. + With server virtualization, each physical server supports multiple + virtual machines, each running its own operating system, middleware, + and applications. Virtualization is a key enabler of workload + agility, i.e., allowing any server to host any application (on its + own VM) and providing the flexibility of adding, shrinking, or moving + VMs within the physical infrastructure. Server virtualization + provides numerous benefits, including higher utilization, increased + data security, reduced user downtime, and even significant power + conservation, along with the promise of a more flexible and dynamic + computing environment. + + The discussion below focuses on VM placement and migration. Keep in + mind, however, that even in a non-virtualized environment, many of + the same issues apply to individual workloads running on standalone + machines. For example, when increasing the number of servers running + a particular workload to meet demand, placement of those workloads + may be constrained by IP subnet numbering considerations, as + discussed earlier. + + The greatest flexibility in VM and workload management occurs when it + is possible to place a VM (or workload) anywhere in the data center + regardless of what IP addresses the VM uses and how the physical + network is laid out. In practice, movement of VMs within a data + center is easiest when VM placement and movement do not conflict with + the IP subnet boundaries of the data center's network, so that the + VM's IP address need not be changed to reflect its actual point of + attachment on the network from an L3/IP perspective. In contrast, if + a VM moves to a new IP subnet, its address must change, and clients + will need to be made aware of that change. From a VM management + perspective, management is simplified if all servers are on a single + large L2 network. + + With virtualization, it is not uncommon to have a single physical + server host ten or more VMs, each having its own IP (and Media Access + Control (MAC)) addresses. Consequently, the number of addresses per + machine (and hence per subnet) is increasing, even when the number of + physical machines stays constant. In a few years, the numbers will + likely be even higher. + + In the past, applications were static in the sense that they tended + to stay in one physical place. An application installed on a + physical machine would stay on that machine because the cost of + moving an application elsewhere was generally high. Moreover, + physical servers hosting applications would tend to be placed in such + a way as to facilitate communication locality. That is, applications + running on servers would be physically located near the servers + hosting the applications they communicated with most heavily. The + + + +Narten, et al. Informational [Page 5] + +RFC 6820 ARMD-Problems January 2013 + + + network traffic patterns in such environments could thus be + optimized, in some cases keeping significant traffic local to one + network segment. In these more static and carefully managed + environments, it was possible to build networks that approached + scaling limitations but did not actually cross the threshold. + + Today, with the proliferation of VMs, traffic patterns are becoming + more diverse and less predictable. In particular, there can easily + be less locality of network traffic as VMs hosting applications are + moved for such reasons as reducing overall power usage (by + consolidating VMs and powering off idle machines) or moving a VM to a + physical server with more capacity or a lower load. In today's + changing environments, it is becoming more difficult to engineer + networks as traffic patterns continually shift as VMs move around. + + In summary, both the size and density of L2 networks are increasing. + In addition, increasingly dynamic workloads and the increased usage + of VMs are creating pressure for ever-larger L2 networks. Today, + there are already data centers with over 100,000 physical machines + and many times that number of VMs. This number will only increase + going forward. In addition, traffic patterns within a data center + are also constantly changing. Ultimately, the issues described in + this document might be observed at any scale, depending on the + particular design of the data center. + +4. Address Resolution in IPv4 + + In IPv4 over Ethernet, ARP provides the function of address + resolution. To determine the link-layer address of a given IP + address, a node broadcasts an ARP Request. The request is delivered + to all portions of the L2 network, and the node with the requested IP + address responds with an ARP Reply. ARP is an old protocol and, by + current standards, is sparsely documented. For example, there are no + clear requirements for retransmitting ARP Requests in the absence of + replies. Consequently, implementations vary in the details of what + they actually implement [RFC0826][RFC1122]. + + From a scaling perspective, there are a number of problems with ARP. + First, it uses broadcast, and any network with a large number of + attached hosts will see a correspondingly large amount of broadcast + ARP traffic. The second problem is that it is not feasible to change + host implementations of ARP -- current implementations are too widely + entrenched, and any changes to host implementations of ARP would take + years to become sufficiently deployed to matter. That said, it may + be possible to change ARP implementations in hypervisors, L2/L3 + boundary routers, and/or ToR access switches, to leverage such + techniques as Proxy ARP. Finally, ARP implementations need to take + steps to flush out stale or otherwise invalid entries. + + + +Narten, et al. Informational [Page 6] + +RFC 6820 ARMD-Problems January 2013 + + + Unfortunately, existing standards do not provide clear implementation + guidelines for how to do this. Consequently, implementations vary + significantly, and some implementations are "chatty" in that they + just periodically flush caches every few minutes and send new ARP + queries. + +5. Address Resolution in IPv6 + + Broadly speaking, from the perspective of address resolution, IPv6's + Neighbor Discovery (ND) behaves much like ARP, with a few notable + differences. First, ARP uses broadcast, whereas ND uses multicast. + When querying for a target IP address, ND maps the target address + into an IPv6 Solicited Node multicast address. Using multicast + rather than broadcast has the benefit that the multicast frames do + not necessarily need to be sent to all parts of the network, i.e., + the frames can be sent only to segments where listeners for the + Solicited Node multicast address reside. In the case where multicast + frames are delivered to all parts of the network, sending to a + multicast address still has the advantage that most (if not all) + nodes will filter out the (unwanted) multicast query via filters + installed in the Network Interface Card (NIC) rather than burdening + host software with the need to process such packets. Thus, whereas + all nodes must process every ARP query, ND queries are processed only + by the nodes to which they are intended. In cases where multicast + filtering can't effectively be implemented in the NIC (e.g., as on + hypervisors supporting virtualization), filtering would need to be + done in software (e.g., in the hypervisor's vSwitch). + +6. Generalized Data Center Design + + There are many different ways in which data center networks might be + designed. The designs are usually engineered to suit the particular + workloads that are being deployed in the data center. For example, a + large web server farm might be engineered in a very different way + than a general-purpose multi-tenant cloud hosting service. However, + in most cases the designs can be abstracted into a typical three- + layer model consisting of an access layer, an aggregation layer, and + the Core. The access layer generally refers to the switches that are + closest to the physical or virtual servers; the aggregation layer + serves to interconnect multiple access-layer devices. The Core + switches connect the aggregation switches to the larger network core. + + + + + + + + + + +Narten, et al. Informational [Page 7] + +RFC 6820 ARMD-Problems January 2013 + + + Figure 1 shows a generalized data center design, which captures the + essential elements of various alternatives. + + +-----+-----+ +-----+-----+ + | Core0 | | Core1 | Core + +-----+-----+ +-----+-----+ + / \ / / + / \----------\ / + / /---------/ \ / + +-------+ +------+ + +/------+ | +/-----+ | + | Aggr11| + --------|AggrN1| + Aggregation Layer + +---+---+/ +------+/ + / \ / \ + / \ / \ + +---+ +---+ +---+ +---+ + |T11|... |T1x| |TN1| |TNy| Access Layer + +---+ +---+ +---+ +---+ + | | | | | | | | + +---+ +---+ +---+ +---+ + | |... | | | | | | + +---+ +---+ +---+ +---+ Server Racks + | |... | | | | | | + +---+ +---+ +---+ +---+ + | |... | | | | | | + +---+ +---+ +---+ +---+ + + Typical Layered Architecture in a Data Center + + Figure 1 + +6.1. Access Layer + + The access switches provide connectivity directly to/from physical + and virtual servers. The access layer may be implemented by wiring + the servers within a rack to a ToR switch or, less commonly, the + servers could be wired directly to an EoR switch. A server rack may + have a single uplink to one access switch or may have dual uplinks to + two different access switches. + +6.2. Aggregation Layer + + In a typical data center, aggregation switches interconnect many ToR + switches. Usually, there are multiple parallel aggregation switches, + serving the same group of ToRs to achieve load sharing. It is no + longer uncommon to see aggregation switches interconnecting hundreds + of ToR switches in large data centers. + + + + +Narten, et al. Informational [Page 8] + +RFC 6820 ARMD-Problems January 2013 + + +6.3. Core + + Core switches provide connectivity between aggregation switches and + the main data center network. Core switches interconnect different + sets of racks and provide connectivity to data center gateways + leading to external networks. + +6.4. L3/L2 Topological Variations + +6.4.1. L3 to Access Switches + + In this scenario, the L3 domain is extended all the way from the core + network to the access switches. Each rack enclosure consists of a + single L2 domain, which is confined to the rack. In general, there + are no significant ARP/ND scaling issues in this scenario, as the L2 + domain cannot grow very large. Such a topology has benefits in + scenarios where servers attached to a particular access switch + generally run VMs that are confined to using a single subnet. These + VMs and the applications they host aren't moved (migrated) to other + racks that might be attached to different access switches (and + different IP subnets). A small server farm or very static compute + cluster might be well served via this design. + +6.4.2. L3 to Aggregation Switches + + When the L3 domain extends only to aggregation switches, hosts in any + of the IP subnets configured on the aggregation switches can be + reachable via L2 through any access switches if access switches + enable all the VLANs. Such a topology allows a greater level of + flexibility, as servers attached to any access switch can run any VMs + that have been provisioned with IP addresses configured on the + aggregation switches. In such an environment, VMs can migrate + between racks without IP address changes. The drawback of this + design, however, is that multiple VLANs have to be enabled on all + access switches and all access-facing ports on aggregation switches. + Even though L2 traffic is still partitioned by VLANs, the fact that + all VLANs are enabled on all ports can lead to broadcast traffic on + all VLANs that traverse all links and ports, which has the same + effect as one big L2 domain on the access-facing side of the + aggregation switch. In addition, the internal traffic itself might + have to cross different L2 boundaries, resulting in significant + ARP/ND load at the aggregation switches. This design provides a good + tradeoff between flexibility and L2 domain size. A moderate-sized + data center might utilize this approach to provide high-availability + services at a single location. + + + + + + +Narten, et al. Informational [Page 9] + +RFC 6820 ARMD-Problems January 2013 + + +6.4.3. L3 in the Core Only + + In some cases, where a wider range of VM mobility is desired (i.e., a + greater number of racks among which VMs can move without IP address + changes), the L3 routed domain might be terminated at the core + routers themselves. In this case, VLANs can span multiple groups of + aggregation switches, which allows hosts to be moved among a greater + number of server racks without IP address changes. This scenario + results in the largest ARP/ND performance impact, as explained later. + A data center with very rapid workload shifting may consider this + kind of design. + +6.4.4. Overlays + + There are several approaches where overlay networks can be used to + build very large L2 networks to enable VM mobility. Overlay networks + using various L2 or L3 mechanisms allow interior switches/routers to + mask host addresses. In addition, L3 overlays can help the data + center designer control the size of the L2 domain and also enhance + the ability to provide multi-tenancy in data center networks. + However, the use of overlays does not eliminate traffic associated + with address resolution; it simply moves it to regular data traffic. + That is, address resolution is implemented in the overlay and is not + directly visible to the switches of the data center network. + + A potential problem that arises in a large data center is that when a + large number of hosts communicate with their peers in different + subnets, all these hosts send (and receive) data packets to their + respective L2/L3 boundary nodes, as the traffic flows are generally + bidirectional. This has the potential to further highlight any + scaling problems. These L2/L3 boundary nodes have to process ARP/ND + requests sent from originating subnets and resolve physical (MAC) + addresses in the target subnets for what are generally bidirectional + flows. Therefore, for maximum flexibility in managing the data + center workload, it is often desirable to use overlays to place + related groups of hosts in the same topological subnet to avoid the + L2/L3 boundary translation. The use of overlays in the data center + network can be a useful design mechanism to help manage a potential + bottleneck at the L2/L3 boundary by redefining where that boundary + exists. + + + + + + + + + + + +Narten, et al. Informational [Page 10] + +RFC 6820 ARMD-Problems January 2013 + + +6.5. Factors That Affect Data Center Design + +6.5.1. Traffic Patterns + + Expected traffic patterns play an important role in designing + appropriately sized access, aggregation, and core networks. Traffic + patterns also vary based on the expected use of the data center. + + Broadly speaking, it is desirable to keep as much traffic as possible + on the access layer in order to minimize the bandwidth usage at the + aggregation layer. If the expected use of the data center is to + serve as a large web server farm, where thousands of nodes are doing + similar things and the traffic pattern is largely in and out of a + large data center, an access layer with EoR switches might be used, + as it minimizes complexity, allows for servers and databases to be + located in the same L2 domain, and provides for maximum density. + + A data center that is expected to host a multi-tenant cloud hosting + service might have some completely unique requirements. In order to + isolate inter-customer traffic, smaller L2 domains might be + preferred, and though the size of the overall data center might be + comparable to the previous example, the multi-tenant nature of the + cloud hosting application requires a smaller and more + compartmentalized access layer. A multi-tenant environment might + also require the use of L3 all the way to the access-layer ToR + switch. + + Yet another example of a workload with a unique traffic pattern is a + high-performance compute cluster, where most of the traffic is + expected to stay within the cluster but at the same time there is a + high degree of crosstalk between the nodes. This would once again + call for a large access layer in order to minimize the requirements + at the aggregation layer. + +6.5.2. Virtualization + + Using virtualization in the data center further serves to increase + the possible densities that can be achieved. However, virtualization + also further complicates the requirements on the access layer, as + virtualization restricts the scope of server placement in the event + of server failover resulting from hardware failures or server + migration for load balancing or other reasons. + + Virtualization also can place additional requirements on the + aggregation switches in terms of address resolution table size and + the scalability of any address-learning protocols that might be used + on those switches. The use of virtualization often also requires the + use of additional VLANs for high-availability beaconing, which would + + + +Narten, et al. Informational [Page 11] + +RFC 6820 ARMD-Problems January 2013 + + + need to span the entire virtualized infrastructure. This would + require the access layer to also span the entire virtualized + infrastructure. + +6.5.3. Summary + + The designs described in this section have a number of tradeoffs. + The "L3 to access switches" design described in Section 6.4.1 is the + only design that constrains L2 domain size in a fashion that avoids + ARP/ND scaling problems. However, that design has limitations and + does not address some of the other requirements that lead to + configurations that make use of larger L2 domains. Consequently, + ARP/ND scaling issues are a real problem in practice. + +7. Problem Itemization + + This section articulates some specific problems or "pain points" that + are related to large data centers. + +7.1. ARP Processing on Routers + + One pain point with large L2 broadcast domains is that the routers + connected to the L2 domain may need to process a significant amount + of ARP traffic in some cases. In particular, environments where the + aggregate level of ARP traffic is very large may lead to a heavy ARP + load on routers. Even though the vast majority of ARP traffic may + not be aimed at that router, the router still has to process enough + of the ARP Request to determine whether it can safely be ignored. + The ARP algorithm specifies that a recipient must update its ARP + cache if it receives an ARP query from a source for which it has an + entry [RFC0826]. + + ARP processing in routers is commonly handled in a "slow path" + software processor, rather than directly by a hardware Application- + Specific Integrated Circuit (ASIC) as is the case when forwarding + packets. Such a design significantly limits the rate at which ARP + traffic can be processed compared to the rate at which ASICs can + forward traffic. Current implementations at the time of this writing + can support ARP processing in the low thousands of ARP packets per + second. In some deployments, limitations on the rate of ARP + processing have been cited as being a problem. + + To further reduce the ARP load, some routers have implemented + additional optimizations in their forwarding ASIC paths. For + example, some routers can be configured to discard ARP Requests for + target addresses other than those assigned to the router. That way, + the router's software processor only receives ARP Requests for + + + + +Narten, et al. Informational [Page 12] + +RFC 6820 ARMD-Problems January 2013 + + + addresses it owns and must respond to. This can significantly reduce + the number of ARP Requests that must be processed by the router. + + Another optimization concerns reducing the number of ARP queries + targeted at routers, whether for address resolution or to validate + existing cache entries. Some routers can be configured to broadcast + periodic gratuitous ARPs [RFC5227]. Upon receipt of a gratuitous + ARP, implementations mark the associated entry as "fresh", resetting + the aging timer to its maximum setting. Consequently, sending out + periodic gratuitous ARPs can effectively prevent nodes from needing + to send ARP Requests intended to revalidate stale entries for a + router. The net result is an overall reduction in the number of ARP + queries routers receive. Gratuitous ARPs, broadcast to all nodes in + the L2 broadcast domain, may in some cases also pre-populate ARP + caches on neighboring devices, further reducing ARP traffic. But it + is not believed that pre-population of ARP entries is supported by + most implementations, as the ARP specification [RFC0826] recommends + only that pre-existing ARP entries be updated upon receipt of ARP + messages; it does not call for the creation of new entries when none + already exist. + + Finally, another area concerns the overhead of processing IP packets + for which no ARP entry exists. Existing standards specify that one + or more IP packets for which no ARP entries exist should be queued + pending successful completion of the address resolution process + [RFC1122] [RFC1812]. Once an ARP query has been resolved, any queued + packets can be forwarded on. Again, the processing of such packets + is handled in the "slow path", effectively limiting the rate at which + a router can process ARP "cache misses", and is viewed as a problem + in some deployments today. Additionally, if no response is received, + the router may send the ARP/ND query multiple times. If no response + is received after a number of ARP/ND requests, the router needs to + drop any queued data packets and may send an ICMP destination + unreachable message as well [RFC0792]. This entire process can be + CPU intensive. + + Although address resolution traffic remains local to one L2 network, + some data center designs terminate L2 domains at individual + aggregation switches/routers (e.g., see Section 6.4.2). Such routers + can be connected to a large number of interfaces (e.g., 100 or more). + While the address resolution traffic on any one interface may be + manageable, the aggregate address resolution traffic across all + interfaces can become problematic. + + Another variant of the above issue has individual routers servicing a + relatively small number of interfaces, with the individual interfaces + themselves serving very large subnets. Once again, it is the + aggregate quantity of ARP traffic seen across all of the router's + + + +Narten, et al. Informational [Page 13] + +RFC 6820 ARMD-Problems January 2013 + + + interfaces that can be problematic. This pain point is essentially + the same as the one discussed above, the only difference being + whether a given number of hosts are spread across a few large IP + subnets or many smaller ones. + + When hosts in two different subnets under the same L2/L3 boundary + router need to communicate with each other, the L2/L3 router not only + has to initiate ARP/ND requests to the target's subnet, it also has + to process the ARP/ND requests from the originating subnet. This + process further adds to the overall ARP processing load. + +7.2. IPv6 Neighbor Discovery + + Though IPv6's Neighbor Discovery behaves much like ARP, there are + several notable differences that result in a different set of + potential issues. From an L2 perspective, an important difference is + that ND address resolution requests are sent via multicast, which + results in ND queries only being processed by the nodes for which + they are intended. Compared with broadcast ARPs, this reduces the + total number of ND packets that an implementation will receive. + + Another key difference concerns revalidating stale ND entries. ND + requires that nodes periodically revalidate any entries they are + using, to ensure that bad entries are timed out quickly enough that + TCP does not terminate a connection. Consequently, some + implementations will send out "probe" ND queries to validate in-use + ND entries as frequently as every 35 seconds [RFC4861]. Such probes + are sent via unicast (unlike in the case of ARP). However, on larger + networks, such probes can result in routers receiving many such + queries (i.e., many more than with ARP, which does not specify such + behavior). Unfortunately, the IPv4 mitigation technique of sending + gratuitous ARPs (as described in Section 7.1) does not work in IPv6. + The ND specification specifically states that gratuitous ND "updates" + cannot cause an ND entry to be marked "valid". Rather, such entries + are marked "probe", which causes the receiving node to (eventually) + generate a probe back to the sender, which in this case is precisely + the behavior that the router is trying to prevent! + + Routers implementing Neighbor Unreachability Discovery (NUD) (for + neighboring destinations) will need to process neighbor cache state + changes such as transitioning entries from REACHABLE to STALE. How + this capability is implemented may impact the scalability of ND on a + router. For example, one possible implementation is to have the + forwarding operation detect when an ND entry is referenced that needs + to transition from REACHABLE to STALE, by signaling an event that + would need to be processed by the software processor. Such an + implementation could increase the load on the service processor in + + + + +Narten, et al. Informational [Page 14] + +RFC 6820 ARMD-Problems January 2013 + + + much the same way that high rates of ARP requests have led to + problems on some routers. + + It should be noted that ND does not require the sending of probes in + all cases. Section 7.3.1 of [RFC4861] describes a technique whereby + hints from TCP can be used to verify that an existing ND entry is + working fine and does not need to be revalidated. + + Finally, IPv6 and IPv4 are often run simultaneously and in parallel + on the same network, i.e., in dual-stack mode. In such environments, + the IPv4 and IPv6 issues enumerated above compound each other. + +7.3. MAC Address Table Size Limitations in Switches + + L2 switches maintain L2 MAC address forwarding tables for all sources + and destinations traversing the switch. These tables are populated + through learning and are used to forward L2 frames to their correct + destination. The larger the L2 domain, the larger the tables have to + be. While in theory a switch only needs to keep track of addresses + it is actively using (sometimes called "conversational learning"), + switches flood broadcast frames (e.g., from ARP), multicast frames + (e.g., from Neighbor Discovery), and unicast frames to unknown + destinations. Switches add entries for the source addresses of such + flooded frames to their forwarding tables. Consequently, MAC address + table size can become a problem as the size of the L2 domain + increases. The table size problem is made worse with VMs, where a + single physical machine now hosts many VMs (in the 10's today, but + growing rapidly as the number of cores per CPU increases), since each + VM has its own MAC address that is visible to switches. + + When L3 extends all the way to access switches (see Section 6.4.1), + the size of MAC address tables in switches is not generally a + problem. When L3 extends only to aggregation switches (see + Section 6.4.2), however, MAC table size limitations can be a real + issue. + +8. Summary + + This document has outlined a number of issues related to address + resolution in large data centers. In particular, this document has + described different scenarios where such issues might arise and what + these potential issues are, along with outlining fundamental factors + that cause them. It is hoped that describing specific pain points + will facilitate a discussion as to whether they should be addressed + and how best to address them. + + + + + + +Narten, et al. Informational [Page 15] + +RFC 6820 ARMD-Problems January 2013 + + +9. Acknowledgments + + This document has been significantly improved by comments from Manav + Bhatia, David Black, Stewart Bryant, Ralph Droms, Linda Dunbar, + Donald Eastlake, Wesley Eddy, Anoop Ghanwani, Joel Halpern, Sue + Hares, Pete Resnick, Benson Schliesser, T. Sridhar, and Lucy Yong. + Igor Gashinsky deserves additional credit for highlighting some of + the ARP-related pain points and for clarifying the difference between + what the standards require and what some router vendors have actually + implemented in response to operator requests. + +10. Security Considerations + + This document does not create any security implications nor does it + have any security implications. The security vulnerabilities in ARP + are well known, and this document does not change or mitigate them in + any way. Security considerations for Neighbor Discovery are + discussed in [RFC4861] and [RFC6583]. + +11. Informative References + + [RFC0792] Postel, J., "Internet Control Message Protocol", STD 5, + RFC 792, September 1981. + + [RFC0826] Plummer, D., "Ethernet Address Resolution Protocol: Or + converting network protocol addresses to 48.bit Ethernet + address for transmission on Ethernet hardware", STD 37, + RFC 826, November 1982. + + [RFC1122] Braden, R., "Requirements for Internet Hosts - + Communication Layers", STD 3, RFC 1122, October 1989. + + [RFC1812] Baker, F., "Requirements for IP Version 4 Routers", + RFC 1812, June 1995. + + [RFC4861] Narten, T., Nordmark, E., Simpson, W., and H. Soliman, + "Neighbor Discovery for IP version 6 (IPv6)", RFC 4861, + September 2007. + + [RFC5227] Cheshire, S., "IPv4 Address Conflict Detection", RFC 5227, + July 2008. + + [RFC6583] Gashinsky, I., Jaeggli, J., and W. Kumari, "Operational + Neighbor Discovery Problems", RFC 6583, March 2012. + + + + + + + +Narten, et al. Informational [Page 16] + +RFC 6820 ARMD-Problems January 2013 + + +Authors' Addresses + + Thomas Narten + IBM Corporation + 3039 Cornwallis Ave. + PO Box 12195 + Research Triangle Park, NC 27709-2195 + USA + + EMail: narten@us.ibm.com + + + Manish Karir + Merit Network Inc. + + EMail: mkarir@merit.edu + + + Ian Foo + Huawei Technologies + + EMail: Ian.Foo@huawei.com + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +Narten, et al. Informational [Page 17] + |