diff options
Diffstat (limited to 'doc/rfc/rfc7424.txt')
-rw-r--r-- | doc/rfc/rfc7424.txt | 1627 |
1 files changed, 1627 insertions, 0 deletions
diff --git a/doc/rfc/rfc7424.txt b/doc/rfc/rfc7424.txt new file mode 100644 index 0000000..d96cfa1 --- /dev/null +++ b/doc/rfc/rfc7424.txt @@ -0,0 +1,1627 @@ + + + + + + +Internet Engineering Task Force (IETF) R. Krishnan +Request for Comments: 7424 Brocade Communications +Category: Informational L. Yong +ISSN: 2070-1721 Huawei USA + A. Ghanwani + Dell + N. So + Vinci Systems + B. Khasnabish + ZTE Corporation + January 2015 + + + Mechanisms for Optimizing Link Aggregation Group (LAG) and + Equal-Cost Multipath (ECMP) Component Link Utilization in Networks + +Abstract + + Demands on networking infrastructure are growing exponentially due to + bandwidth-hungry applications such as rich media applications and + inter-data-center communications. In this context, it is important + to optimally use the bandwidth in wired networks that extensively use + link aggregation groups and equal-cost multipaths as techniques for + bandwidth scaling. This document explores some of the mechanisms + useful for achieving this. + +Status of This Memo + + This document is not an Internet Standards Track specification; it is + published for informational purposes. + + This document is a product of the Internet Engineering Task Force + (IETF). It represents the consensus of the IETF community. It has + received public review and has been approved for publication by the + Internet Engineering Steering Group (IESG). Not all documents + approved by the IESG are a candidate for any level of Internet + Standard; see Section 2 of RFC 5741. + + Information about the current status of this document, any errata, + and how to provide feedback on it may be obtained at + http://www.rfc-editor.org/info/rfc7424. + + + + + + + + + + +Krishnan, et al. Informational [Page 1] + +RFC 7424 Optimizing Load Distribution over LAG/ECMP January 2015 + + +Copyright Notice + + Copyright (c) 2015 IETF Trust and the persons identified as the + document authors. All rights reserved. + + This document is subject to BCP 78 and the IETF Trust's Legal + Provisions Relating to IETF Documents + (http://trustee.ietf.org/license-info) in effect on the date of + publication of this document. Please review these documents + carefully, as they describe your rights and restrictions with respect + to this document. Code Components extracted from this document must + include Simplified BSD License text as described in Section 4.e of + the Trust Legal Provisions and are provided without warranty as + described in the Simplified BSD License. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +Krishnan, et al. Informational [Page 2] + +RFC 7424 Optimizing Load Distribution over LAG/ECMP January 2015 + + +Table of Contents + + 1. Introduction ....................................................4 + 1.1. Acronyms ...................................................4 + 1.2. Terminology ................................................5 + 2. Flow Categorization .............................................6 + 3. Hash-Based Load Distribution in LAG/ECMP ........................6 + 4. Mechanisms for Optimizing LAG/ECMP Component Link Utilization ...8 + 4.1. Differences in LAG vs. ECMP ................................9 + 4.2. Operational Overview ......................................10 + 4.3. Large Flow Recognition ....................................11 + 4.3.1. Flow Identification ................................11 + 4.3.2. Criteria and Techniques for Large Flow + Recognition ........................................12 + 4.3.3. Sampling Techniques ................................12 + 4.3.4. Inline Data Path Measurement .......................14 + 4.3.5. Use of Multiple Methods for Large Flow + Recognition ........................................15 + 4.4. Options for Load Rebalancing ..............................15 + 4.4.1. Alternative Placement of Large Flows ...............15 + 4.4.2. Redistributing Small Flows .........................16 + 4.4.3. Component Link Protection Considerations ...........16 + 4.4.4. Algorithms for Load Rebalancing ....................17 + 4.4.5. Example of Load Rebalancing ........................17 + 5. Information Model for Flow Rebalancing .........................18 + 5.1. Configuration Parameters for Flow Rebalancing .............18 + 5.2. System Configuration and Identification Parameters ........19 + 5.3. Information for Alternative Placement of Large Flows ......20 + 5.4. Information for Redistribution of Small Flows .............21 + 5.5. Export of Flow Information ................................21 + 5.6. Monitoring Information ....................................21 + 5.6.1. Interface (Link) Utilization .......................21 + 5.6.2. Other Monitoring Information .......................22 + 6. Operational Considerations .....................................23 + 6.1. Rebalancing Frequency .....................................23 + 6.2. Handling Route Changes ....................................23 + 6.3. Forwarding Resources ......................................23 + 7. Security Considerations ........................................23 + 8. References .....................................................24 + 8.1. Normative References ......................................24 + 8.2. Informative References ....................................25 + Appendix A. Internet Traffic Analysis and Load-Balancing + Simulation ...........................................28 + Acknowledgements ..................................................28 + Contributors ......................................................28 + Authors' Addresses ................................................29 + + + + + +Krishnan, et al. Informational [Page 3] + +RFC 7424 Optimizing Load Distribution over LAG/ECMP January 2015 + + +1. Introduction + + Networks extensively use link aggregation groups (LAGs) [802.1AX] and + equal-cost multipaths (ECMPs) [RFC2991] as techniques for capacity + scaling. For the problems addressed by this document, network + traffic can be predominantly categorized into two traffic types: + long-lived large flows and other flows. These other flows, which + include long-lived small flows, short-lived small flows, and short- + lived large flows, are referred to as "small flows" in this document. + Long-lived large flows are simply referred to as "large flows". + + Stateless hash-based techniques [ITCOM] [RFC2991] [RFC2992] [RFC6790] + are often used to distribute both large flows and small flows over + the component links in a LAG/ECMP. However, the traffic may not be + evenly distributed over the component links due to the traffic + pattern. + + This document describes mechanisms for optimizing LAG/ECMP component + link utilization when using hash-based techniques. The mechanisms + comprise the following steps: 1) recognizing large flows in a router, + and 2) assigning the large flows to specific LAG/ECMP component links + or redistributing the small flows when a component link on the router + is congested. + + It is useful to keep in mind that in typical use cases for these + mechanisms, the large flows consume a significant amount of bandwidth + on a link, e.g., greater than 5% of link bandwidth. The number of + such flows would necessarily be fairly small, e.g., on the order of + 10s or 100s per LAG/ECMP. In other words, the number of large flows + is NOT expected to be on the order of millions of flows. Examples of + such large flows would be IPsec tunnels in service provider backbone + networks or storage backup traffic in data center networks. + +1.1. Acronyms + + DoS: Denial of Service + + ECMP: Equal-Cost Multipath + + GRE: Generic Routing Encapsulation + + IPFIX: IP Flow Information Export + + LAG: Link Aggregation Group + + MPLS: Multiprotocol Label Switching + + NVGRE: Network Virtualization using Generic Routing Encapsulation + + + +Krishnan, et al. Informational [Page 4] + +RFC 7424 Optimizing Load Distribution over LAG/ECMP January 2015 + + + PBR: Policy-Based Routing + + QoS: Quality of Service + + STT: Stateless Transport Tunneling + + VXLAN: Virtual eXtensible LAN + +1.2. Terminology + + Central management entity: + An entity that is capable of monitoring information about link + utilization and flows in routers across the network and may be + capable of making traffic-engineering decisions for placement of + large flows. It may include the functions of a collector + [RFC7011]. + + ECMP component link: + An individual next hop within an ECMP group. An ECMP component + link may itself comprise a LAG. + + ECMP table: + A table that is used as the next hop of an ECMP route that + comprises the set of ECMP component links and the weights + associated with each of those ECMP component links. The input for + looking up the table is the hash value for the packet, and the + weights are used to determine which values of the hash function + map to a given ECMP component link. + + Flow (large or small): + A sequence of packets for which ordered delivery should be + maintained, e.g., packets belonging to the same TCP connection. + + LAG component link: + An individual link within a LAG. A LAG component link is + typically a physical link. + + LAG table: + A table that is used as the output port, which is a LAG, that + comprises the set of LAG component links and the weights + associated with each of those component links. The input for + looking up the table is the hash value for the packet, and the + weights are used to determine which values of the hash function + map to a given LAG component link. + + Large flow(s): + Refers to long-lived large flow(s). + + + + +Krishnan, et al. Informational [Page 5] + +RFC 7424 Optimizing Load Distribution over LAG/ECMP January 2015 + + + Small flow(s): + Refers to any of, or a combination of, long-lived small flow(s), + short-lived small flows, and short-lived large flow(s). + +2. Flow Categorization + + In general, based on the size and duration, a flow can be categorized + into any one of the following four types, as shown in Figure 1: + + o short-lived large flow (SLLF), + + o short-lived small flow (SLSF), + + o long-lived large flow (LLLF), and + + o long-lived small flow (LLSF). + + Flow Bandwidth + ^ + |--------------------|--------------------| + | | | + Large | SLLF | LLLF | + Flow | | | + |--------------------|--------------------| + | | | + Small | SLSF | LLSF | + Flow | | | + +--------------------+--------------------+-->Flow Duration + Short-Lived Long-Lived + Flow Flow + + Figure 1: Flow Categorization + + In this document, as mentioned earlier, we categorize long-lived + large flows as "large flows", and all of the others (long-lived small + flows, short-lived small flows, and short-lived large flows) as + "small flows". + +3. Hash-Based Load Distribution in LAG/ECMP + + Hash-based techniques are often used for load balancing of traffic to + select among multiple available paths within a LAG/ECMP group. The + advantages of hash-based techniques for load distribution are the + preservation of the packet sequence in a flow and the real-time + distribution without maintaining per-flow state in the router. Hash- + based techniques use a combination of fields in the packet's headers + + + + + +Krishnan, et al. Informational [Page 6] + +RFC 7424 Optimizing Load Distribution over LAG/ECMP January 2015 + + + to identify a flow, and the hash function computed using these fields + is used to generate a unique number that identifies a link/path in a + LAG/ECMP group. The result of the hashing procedure is a many-to-one + mapping of flows to component links. + + Hash-based techniques produce good results with respect to + utilization of the individual component links if: + + o the traffic mix constitutes flows such that the result of the hash + function across these flows is fairly uniform so that a similar + number of flows is mapped to each component link, + + o the individual flow rates are much smaller as compared to the link + capacity, and + + o the differences in flow rates are not dramatic. + + However, if one or more of these conditions are not met, hash-based + techniques may result in imbalance in the loads on individual + component links. + + An example is illustrated in Figure 2. As shown, there are two + routers, R1 and R2, and there is a LAG between them that has three + component links (1), (2), and (3). A total of ten flows need to be + distributed across the links in this LAG. The result of applying the + hash-based technique is as follows: + + o Component link (1) has three flows (two small flows and one large + flow), and the link utilization is normal. + + o Component link (2) has three flows (three small flows and no large + flows), and the link utilization is light. + + - The absence of any large flow causes the component link to be + underutilized. + + o Component link (3) has four flows (two small flows and two large + flows), and the link capacity is exceeded resulting in congestion. + + - The presence of two large flows causes congestion on this + component link. + + + + + + + + + + +Krishnan, et al. Informational [Page 7] + +RFC 7424 Optimizing Load Distribution over LAG/ECMP January 2015 + + + +-----------+ -> +-----------+ + | | -> | | + | | ===> | | + | (1)|--------|(1) | + | | -> | | + | | -> | | + | (R1) | -> | (R2) | + | (2)|--------|(2) | + | | -> | | + | | -> | | + | | ===> | | + | | ===> | | + | (3)|--------|(3) | + | | | | + +-----------+ +-----------+ + + Where: -> small flow + ===> large flow + + Figure 2: Unevenly Utilized Component Links + + This document presents mechanisms for addressing the imbalance in + load distribution resulting from commonly used hash-based techniques + for LAG/ECMP that are shown in the above example. The mechanisms use + large flow awareness to compensate for the imbalance in load + distribution. + +4. Mechanisms for Optimizing LAG/ECMP Component Link Utilization + + The suggested mechanisms in this document are local optimization + solutions; they are local in the sense that both the identification + of large flows and rebalancing of the load can be accomplished + completely within individual routers in the network without the need + for interaction with other routers. + + This approach may not yield a global optimization of the placement of + large flows across multiple routers in a network, which may be + desirable in some networks. On the other hand, a local approach may + be adequate for some environments for the following reasons: + + 1) Different links within a network experience different levels of + utilization; thus, a "targeted" solution is needed for those hot + spots in the network. An example is the utilization of a LAG + between two routers that needs to be optimized. + + + + + + + +Krishnan, et al. Informational [Page 8] + +RFC 7424 Optimizing Load Distribution over LAG/ECMP January 2015 + + + 2) Some networks may lack end-to-end visibility, e.g., when a + certain network, under the control of a given operator, is a + transit network for traffic from other networks that are not + under the control of the same operator. + +4.1. Differences in LAG vs. ECMP + + While the mechanisms explained herein are applicable to both LAGs and + ECMP groups, it is useful to note that there are some key differences + between the two that may impact how effective the mechanisms are. + This relates, in part, to the localized information with which the + mechanisms are intended to operate. + + A LAG is usually established across links that are between two + adjacent routers. As a result, the scope of the problem of + optimizing the bandwidth utilization on the component links is fairly + narrow. It simply involves rebalancing the load across the component + links between these two routers, and there is no impact whatsoever to + other parts of the network. The scheme works equally well for + unicast and multicast flows. + + On the other hand, with ECMP, redistributing the load across + component links that are part of the ECMP group may impact traffic + patterns at all of the routers that are downstream of the given + router between itself and the destination. The local optimization + may result in congestion at a downstream node. (In its simplest + form, an ECMP group may be used to distribute traffic on component + links that are between two adjacent routers, and in that case, the + ECMP group is no different than a LAG for the purpose of this + discussion. It should be noted that an ECMP component link may + itself comprise a LAG, in which case the scheme may be further + applied to the component links within the LAG.) + + To demonstrate the limitations of local optimization, consider a two- + level Clos network topology as shown in Figure 3 with three leaf + routers (L1, L2, and L3) and two spine routers (S1 and S2). Assume + all of the links are 10 Gbps. + + Let L1 have two flows of 4 Gbps each towards L3, and let L2 have one + flow of 7 Gbps also towards L3. If L1 balances the load optimally + between S1 and S2, and L2 sends the flow via S1, then the downlink + from S1 to L3 would get congested, resulting in packet discards. On + the other hand, if L1 had sent both its flows towards S1 and L2 had + sent its flow towards S2, there would have been no congestion at + either S1 or S2. + + + + + + +Krishnan, et al. Informational [Page 9] + +RFC 7424 Optimizing Load Distribution over LAG/ECMP January 2015 + + + +-----+ +-----+ + | S1 | | S2 | + +-----+ +-----+ + / \ \ / /\ + / +---------+ / \ + / / \ \ / \ + / / \ +------+ \ + / / \ / \ \ + +-----+ +-----+ +-----+ + | L1 | | L2 | | L3 | + +-----+ +-----+ +-----+ + + Figure 3: Two-Level Clos Network + + The other issue with applying this scheme to ECMP groups is that it + may not apply equally to unicast and multicast traffic because of the + way multicast trees are constructed. + + Finally, it is possible for a single physical link to participate as + a component link in multiple ECMP groups, whereas with LAGs, a link + can participate as a component link of only one LAG. + +4.2. Operational Overview + + The various steps in optimizing LAG/ECMP component link utilization + in networks are detailed below: + + Step 1: + This step involves recognizing large flows in routers and + maintaining the mapping for each large flow to the component link + that it uses. Recognition of large flows is explained in Section + 4.3. + + Step 2: + The egress component links are periodically scanned for link + utilization, and the imbalance for the LAG/ECMP group is + monitored. If the imbalance exceeds a certain threshold, then + rebalancing is triggered. Measurement of the imbalance is + discussed further in Section 5.1. In addition to the imbalance, + further criteria (such as the maximum utilization of any of the + component links) may also be used to determine whether or not to + trigger rebalancing. The use of sampling techniques for the + measurement of egress component link utilization, including the + issues of depending on ingress sampling for these measurements, + are discussed in Section 4.3.3. + + + + + + +Krishnan, et al. Informational [Page 10] + +RFC 7424 Optimizing Load Distribution over LAG/ECMP January 2015 + + + Step 3: + As a part of rebalancing, the operator can choose to rebalance the + large flows by placing them on lightly loaded component links of + the LAG/ECMP group, redistribute the small flows on the congested + link to other component links of the group, or a combination of + both. + + All of the steps identified above can be done locally within the + router itself or could involve the use of a central management + entity. + + Providing large flow information to a central management entity + provides the capability to globally optimize flow distribution as + described in Section 4.1. Consider the following example. A router + may have three ECMP next hops that lead down paths P1, P2, and P3. A + couple of hops downstream on path P1, there may be a congested link, + while paths P2 and P3 may be underutilized. This is something that + the local router does not have visibility into. With the help of a + central management entity, the operator could redistribute some of + the flows from P1 to P2 and/or P3, resulting in a more optimized flow + of traffic. + + The steps described above are especially useful when bundling links + of different bandwidths, e.g., 10 Gbps and 100 Gbps as described in + [RFC7226]. + +4.3. Large Flow Recognition + +4.3.1. Flow Identification + + Flows are typically identified using one or more fields from the + packet header, for example: + + o Layer 2: Source Media Access Control (MAC) address, destination + MAC address, VLAN ID. + + o IP header: IP protocol, IP source address, IP destination address, + flow label (IPv6 only). + + o Transport protocol header: Source port number, destination port + number. These apply to protocols such as TCP, UDP, and the Stream + Control Transmission Protocol (SCTP). + + o MPLS labels. + + For tunneling protocols like Generic Routing Encapsulation (GRE) + [RFC2784], Virtual eXtensible LAN (VXLAN) [RFC7348], Network + Virtualization using Generic Routing Encapsulation (NVGRE) [NVGRE], + + + +Krishnan, et al. Informational [Page 11] + +RFC 7424 Optimizing Load Distribution over LAG/ECMP January 2015 + + + Stateless Transport Tunneling (STT) [STT], Layer 2 Tunneling Protocol + (L2TP) [RFC3931], etc., flow identification is possible based on + inner and/or outer headers as well as fields introduced by the tunnel + header, as any or all such fields may be used for load balancing + decisions [RFC5640]. + + The above list is not exhaustive. + + The mechanisms described in this document are agnostic to the fields + that are used for flow identification. + + This method of flow identification is consistent with that of IPFIX + [RFC7011]. + +4.3.2. Criteria and Techniques for Large Flow Recognition + + From the perspective of bandwidth and time duration, in order to + recognize large flows, we define an observation interval and measure + the bandwidth of the flow over that interval. A flow that exceeds a + certain minimum bandwidth threshold over that observation interval + would be considered a large flow. + + The two parameters -- the observation interval and the minimum + bandwidth threshold over that observation interval -- should be + programmable to facilitate handling of different use cases and + traffic characteristics. For example, a flow that is at or above 10% + of link bandwidth for a time period of at least one second could be + declared a large flow [DEVOFLOW]. + + In order to avoid excessive churn in the rebalancing, once a flow has + been recognized as a large flow, it should continue to be recognized + as a large flow for as long as the traffic received during an + observation interval exceeds some fraction of the bandwidth + threshold, for example, 80% of the bandwidth threshold. + + Various techniques to recognize a large flow are described in + Sections 4.3.3, 4.3.4, and 4.3.5. + +4.3.3. Sampling Techniques + + A number of routers support sampling techniques such as sFlow + [sFlow-v5] [sFlow-LAG], Packet Sampling (PSAMP) [RFC5475], and + NetFlow Sampling [RFC3954]. For the purpose of large flow + recognition, sampling needs to be enabled on all of the egress ports + in the router where such measurements are desired. + + + + + + +Krishnan, et al. Informational [Page 12] + +RFC 7424 Optimizing Load Distribution over LAG/ECMP January 2015 + + + Using sFlow as an example, processing in an sFlow collector can + provide an approximate indication of the mapping of large flows to + each of the component links in each LAG/ECMP group. Assuming + sufficient control plane resources are available, it is possible to + implement this part of the collector function in the control plane of + the router to reduce dependence on a central management entity. + + If egress sampling is not available, ingress sampling can suffice + since the central management entity used by the sampling technique + typically has visibility across multiple routers in a network and can + use the samples from an immediately downstream router to make + measurements for egress traffic at the local router. + + The option of using ingress sampling for this purpose may not be + available if the downstream router is under the control of a + different operator or if the downstream device does not support + sampling. + + Alternatively, since sampling techniques require that the sample be + annotated with the packet's egress port information, ingress sampling + may suffice. However, this means that sampling would have to be + enabled on all ports, rather than only on those ports where such + monitoring is desired. There is one situation in which this approach + may not work. If there are tunnels that originate from the given + router and if the resulting tunnel comprises the large flow, then + this cannot be deduced from ingress sampling at the given router. + Instead, for this scenario, if egress sampling is unavailable, then + ingress sampling from the downstream router must be used. + + + To illustrate the use of ingress versus egress sampling, we refer to + Figure 2. Since we are looking at rebalancing flows at R1, we would + need to enable egress sampling on ports (1), (2), and (3) on R1. If + egress sampling is not available and if R2 is also under the control + of the same administrator, enabling ingress sampling on R2's ports + (1), (2), and (3) would also work, but it would necessitate the + involvement of a central management entity in order for R1 to obtain + large flow information for each of its links. Finally, R1 can only + enable ingress sampling on all of its ports (not just the ports that + are part of the LAG/ECMP group being monitored), and that would + suffice if the sampling technique annotates the samples with the + egress port information. + + + + + + + + + +Krishnan, et al. Informational [Page 13] + +RFC 7424 Optimizing Load Distribution over LAG/ECMP January 2015 + + + The advantages and disadvantages of sampling techniques are as + follows. + + Advantages: + + o Supported in most existing routers. + + o Requires minimal router resources. + + Disadvantage: + + o In order to minimize the error inherent in sampling, there is a + minimum delay for the recognition time of large flows, and in the + time that it takes to react to this information. + + With sampling, the detection of large flows can be done on the order + of one second [DEVOFLOW]. A discussion on determining the + appropriate sampling frequency is available in [SAMP-BASIC]. + +4.3.4. Inline Data Path Measurement + + Implementations may perform recognition of large flows by performing + measurements on traffic in the data path of a router. Such an + approach would be expected to operate at the interface speed on every + interface, accounting for all packets processed by the data path of + the router. An example of such an approach is described in IPFIX + [RFC5470]. + + Using inline data path measurement, a faster and more accurate + indication of large flows mapped to each of the component links in a + LAG/ECMP group may be possible (as compared to the sampling-based + approach). + + The advantages and disadvantages of inline data path measurement are + as follows: + + Advantages: + + o As link speeds get higher, sampling rates are typically reduced to + keep the number of samples manageable, which places a lower bound + on the detection time. With inline data path measurement, large + flows can be recognized in shorter windows on higher link speeds + since every packet is accounted for [NDTM]. + + o Inline data path measurement eliminates the potential dependence + on a central management entity for large flow recognition. + + + + + +Krishnan, et al. Informational [Page 14] + +RFC 7424 Optimizing Load Distribution over LAG/ECMP January 2015 + + + Disadvantage: + + o Inline data path measurement is more resource intensive in terms + of the table sizes required for monitoring all flows. + + As mentioned earlier, the observation interval for determining a + large flow and the bandwidth threshold for classifying a flow as a + large flow should be programmable parameters in a router. + + The implementation details of inline data path measurement of large + flows is vendor dependent and beyond the scope of this document. + +4.3.5. Use of Multiple Methods for Large Flow Recognition + + It is possible that a router may have line cards that support a + sampling technique while other line cards support inline data path + measurement. As long as there is a way for the router to reliably + determine the mapping of large flows to component links of a LAG/ECMP + group, it is acceptable for the router to use more than one method + for large flow recognition. + + If both methods are supported, inline data path measurement may be + preferable because of its speed of detection [FLOW-ACC]. + +4.4. Options for Load Rebalancing + + The following subsections describe suggested techniques for load + balancing. Equipment vendors may implement more than one technique, + including those not described in this document, and allow the + operator to choose between them. + + Note that regardless of the method used, perfect rebalancing of large + flows may not be possible since flows arrive and depart at different + times. Also, any flows that are moved from one component link to + another may experience momentary packet reordering. + +4.4.1. Alternative Placement of Large Flows + + Within a LAG/ECMP group, member component links with the least + average link utilization are identified. Some large flow(s) from the + heavily loaded component links are then moved to those lightly loaded + member component links using a PBR rule in the ingress processing + element(s) in the routers. + + With this approach, only certain large flows are subjected to + momentary flow reordering. + + + + + +Krishnan, et al. Informational [Page 15] + +RFC 7424 Optimizing Load Distribution over LAG/ECMP January 2015 + + + Moving a large flow will increase the utilization of the link that it + is moved to, potentially once again creating an imbalance in the + utilization across the component links. Therefore, when moving a + large flow, care must be taken to account for the existing load and + the future load after the large flow has been moved. Further, the + appearance of new large flows may require a rearrangement of the + placement of existing flows. + + Consider a case where there is a LAG compromising four 10 Gbps + component links and there are four large flows, each of 1 Gbps. + These flows are each placed on one of the component links. + Subsequently, a fifth large flow of 2 Gbps is recognized, and to + maintain equitable load distribution, it may require placement of one + of the existing 1 Gbps flow to a different component link. This + would still result in some imbalance in the utilization across the + component links. + +4.4.2. Redistributing Small Flows + + Some large flows may consume the entire bandwidth of the component + link(s). In this case, it would be desirable for the small flows to + not use the congested component link(s). + + o The LAG/ECMP table is modified to include only non-congested + component link(s). Small flows hash into this table to be mapped + to a destination component link. Alternatively, if certain + component links are heavily loaded but not congested, the output + of the hash function can be adjusted to account for large flow + loading on each of the component links. + + o The PBR rules for large flows (refer to Section 4.4.1) must have + strict precedence over the LAG/ECMP table lookup result. + + This method works on some existing router hardware. The idea is to + prevent, or reduce the probability, that a small flow hashes into the + congested component link(s). + + With this approach, the small flows that are moved would be subject + to reordering. + +4.4.3. Component Link Protection Considerations + + If desired, certain component links may be reserved for link + protection. These reserved component links are not used for any + flows in the absence of any failures. When there is a failure of one + or more component links, all the flows on the failed component + link(s) are moved to the reserved component link(s). The mapping + table of large flows to component links simply replaces the failed + + + +Krishnan, et al. Informational [Page 16] + +RFC 7424 Optimizing Load Distribution over LAG/ECMP January 2015 + + + component link with the reserved component link. Likewise, the + LAG/ECMP table replaces the failed component link with the reserved + component link. + +4.4.4. Algorithms for Load Rebalancing + + Specific algorithms for placement of large flows are out of the scope + of this document. One possibility is to formulate the problem for + large flow placement as the well-known bin-packing problem and make + use of the various heuristics that are available for that problem + [BIN-PACK]. + +4.4.5. Example of Load Rebalancing + + Optimizing LAG/ECMP component utilization for the use case in Figure + 2 is depicted below in Figure 4. The large flow rebalancing + explained in Section 4.4.1 is used. The improved link utilization is + as follows: + + o Component link (1) has three flows (two small flows and one large + flow), and the link utilization is normal. + + o Component link (2) has four flows (three small flows and one large + flow), and the link utilization is normal now. + + o Component link (3) has three flows (two small flows and one large + flow), and the link utilization is normal now. + + + + + + + + + + + + + + + + + + + + + + + + +Krishnan, et al. Informational [Page 17] + +RFC 7424 Optimizing Load Distribution over LAG/ECMP January 2015 + + + +-----------+ -> +-----------+ + | | -> | | + | | ===> | | + | (1)|--------|(1) | + | | | | + | | ===> | | + | | -> | | + | | -> | | + | (R1) | -> | (R2) | + | (2)|--------|(2) | + | | | | + | | -> | | + | | -> | | + | | ===> | | + | (3)|--------|(3) | + | | | | + +-----------+ +-----------+ + + Where: -> small flow + ===> large flow + + Figure 4: Evenly Utilized Composite Links + + Basically, the use of the mechanisms described in Section 4.4.1 + resulted in a rebalancing of flows where one of the large flows on + component link (3), which was previously congested, was moved to + component link (2), which was previously underutilized. + +5. Information Model for Flow Rebalancing + + In order to support flow rebalancing in a router from an external + system, the exchange of some information is necessary between the + router and the external system. This section provides an exemplary + information model covering the various components needed for this + purpose. The model is intended to be informational and may be used + as a guide for the development of a data model. + +5.1. Configuration Parameters for Flow Rebalancing + + The following parameters are required for configuration of this + feature: + + o Large flow recognition parameters: + + - Observation interval: The observation interval is the time + period in seconds over which packet arrivals are observed for + the purpose of large flow recognition. + + + + +Krishnan, et al. Informational [Page 18] + +RFC 7424 Optimizing Load Distribution over LAG/ECMP January 2015 + + + - Minimum bandwidth threshold: The minimum bandwidth threshold + would be configured as a percentage of link speed and + translated into a number of bytes over the observation + interval. A flow for which the number of bytes received over a + given observation interval exceeds this number would be + recognized as a large flow. + + - Minimum bandwidth threshold for large flow maintenance: The + minimum bandwidth threshold for large flow maintenance is used + to provide hysteresis for large flow recognition. Once a flow + is recognized as a large flow, it continues to be recognized as + a large flow until it falls below this threshold. This is also + configured as a percentage of link speed and is typically lower + than the minimum bandwidth threshold defined above. + + o Imbalance threshold: A measure of the deviation of the component + link utilizations from the utilization of the overall LAG/ECMP + group. Since component links can be different speeds, the + imbalance can be computed as follows. Let the utilization of each + component link in a LAG/ECMP group with n links of speed b_1, b_2 + .. b_n be u_1, u_2 .. u_n. The mean utilization is computed as + + u_ave = [ (u_1 * b_1) + (u_2 * b_2) + .. + (u_n * b_n) ] / + [b_1 + b_2 + .. + b_n]. + + The imbalance is then computed as + + max_{i=1..n} | u_i - u_ave |. + + o Rebalancing interval: The minimum amount of time between + rebalancing events. This parameter ensures that rebalancing is + not invoked too frequently as it impacts packet ordering. + + These parameters may be configured on a system-wide basis or may + apply to an individual LAG/ECMP group. They may be applied to an + ECMP group, provided that the component links are not shared with any + other ECMP group. + +5.2. System Configuration and Identification Parameters + + The following parameters are useful for router configuration and + operation when using the mechanisms in this document. + + o IP address: The IP address of a specific router that the feature + is being configured on or that the large flow placement is being + applied to. + + + + + +Krishnan, et al. Informational [Page 19] + +RFC 7424 Optimizing Load Distribution over LAG/ECMP January 2015 + + + o LAG ID: Identifies the LAG on a given router. The LAG ID may be + required when configuring this feature (to apply a specific set of + large flow identification parameters to the LAG) and will be + required when specifying flow placement to achieve the desired + rebalancing. + + o Component Link ID: Identifies the component link within a LAG or + ECMP group. This is required when specifying flow placement to + achieve the desired rebalancing. + + o Component Link Weight: The relative weight to be applied to + traffic for a given component link when using hash-based + techniques for load distribution. + + o ECMP group: Identifies a particular ECMP group. The ECMP group + may be required when configuring this feature (to apply a specific + set of large flow identification parameters to the ECMP group) and + will be required when specifying flow placement to achieve the + desired rebalancing. We note that multiple ECMP groups can share + an overlapping set (or non-overlapping subset) of component links. + This document does not deal with the complexity of addressing such + configurations. + + The feature may be configured globally for all LAGs and/or for all + ECMP groups, or it may be configured specifically for a given LAG or + ECMP group. + +5.3. Information for Alternative Placement of Large Flows + + In cases where large flow recognition is handled by a central + management entity (see Section 4.3.3), an information model for flows + is required to allow the import of large flow information to the + router. + + Typical fields used for identifying large flows were discussed in + Section 4.3.1. The IPFIX information model [RFC7012] can be + leveraged for large flow identification. + + Large flow placement is achieved by specifying the relevant flow + information along with the following: + + o For LAG: router's IP address, LAG ID, LAG component link ID. + + o For ECMP: router's IP address, ECMP group, ECMP component link ID. + + In the case where the ECMP component link itself comprises a LAG, we + would have to specify the parameters for both the ECMP group as well + as the LAG to which the large flow is being directed. + + + +Krishnan, et al. Informational [Page 20] + +RFC 7424 Optimizing Load Distribution over LAG/ECMP January 2015 + + +5.4. Information for Redistribution of Small Flows + + Redistribution of small flows is done using the following: + + o For LAG: The LAG ID and the component link IDs along with the + relative weight of traffic to be assigned to each component link + ID are required. + + o For ECMP: The ECMP group and the ECMP next hop along with the + relative weight of traffic to be assigned to each ECMP next hop + are required. + + It is possible to have an ECMP next hop that itself comprises a LAG. + In that case, we would have to specify the new weights for both the + ECMP component links and the LAG component links. + + In the case where an ECMP component link itself comprises a LAG, we + would have to specify new weights for both the component links within + the ECMP group as well as the component links within the LAG. + +5.5. Export of Flow Information + + Exporting large flow information is required when large flow + recognition is being done on a router but the decision to rebalance + is being made in a central management entity. Large flow information + includes flow identification and the component link ID that the flow + is currently assigned to. Other information such as flow QoS and + bandwidth may be exported too. + + The IPFIX information model [RFC7012] can be leveraged for large flow + identification. + +5.6. Monitoring Information + +5.6.1. Interface (Link) Utilization + + The incoming bytes (ifInOctets), outgoing bytes (ifOutOctets), and + interface speed (ifSpeed) can be obtained, for example, from the + Interfaces table (ifTable) in the MIB module defined in [RFC1213]. + + The link utilization can then be computed as follows: + + Incoming link utilization = (delta_ifInOctets * 8) / (ifSpeed * T) + + Outgoing link utilization = (delta_ifOutOctets * 8) / (ifSpeed * T) + + + + + + +Krishnan, et al. Informational [Page 21] + +RFC 7424 Optimizing Load Distribution over LAG/ECMP January 2015 + + + Where T is the interval over which the utilization is being measured, + delta_ifInOctets is the change in ifInOctets over that interval, and + delta_ifOutOctets is the change in ifOutOctets over that interval. + + For high-speed Ethernet links, the etherStatsHighCapacityTable in the + MIB module defined in [RFC3273] can be used. + + Similar results may be achieved using the corresponding objects of + other interface management data models such as YANG [RFC7223] if + those are used instead of MIBs. + + For scalability, it is recommended to use the counter push mechanism + in [sFlow-v5] for the interface counters. Doing so would help avoid + counter polling through the MIB interface. + + The outgoing link utilization of the component links within a + LAG/ECMP group can be used to compute the imbalance (see Section 5.1) + for the LAG/ECMP group. + +5.6.2. Other Monitoring Information + + Additional monitoring information that is useful includes: + + o Number of times rebalancing was done. + + o Time since the last rebalancing event. + + o The number of large flows currently rebalanced by the scheme. + + o A list of the large flows that have been rebalanced including + + - the rate of each large flow at the time of the last rebalancing + for that flow, + + - the time that rebalancing was last performed for the given + large flow, and + + - the interfaces that the large flows was (re)directed to. + + o The settings for the weights of the interfaces within a LAG/ECMP + group used by the small flows that depend on hashing. + + + + + + + + + + +Krishnan, et al. Informational [Page 22] + +RFC 7424 Optimizing Load Distribution over LAG/ECMP January 2015 + + +6. Operational Considerations + +6.1. Rebalancing Frequency + + Flows should be rebalanced only when the imbalance in the utilization + across component links exceeds a certain threshold. Frequent + rebalancing to achieve precise equitable utilization across component + links could be counterproductive as it may result in moving flows + back and forth between the component links, impacting packet ordering + and system stability. This applies regardless of whether large flows + or small flows are redistributed. It should be noted that reordering + is a concern for TCP flows with even a few packets because three out- + of-order packets would trigger sufficient duplicate ACKs to the + sender, resulting in a retransmission [RFC5681]. + + The operator would have to experiment with various values of the + large flow recognition parameters (minimum bandwidth threshold, + minimum bandwidth threshold for large flow maintenance, and + observation interval) and the imbalance threshold across component + links to tune the solution for their environment. + +6.2. Handling Route Changes + + Large flow rebalancing must be aware of any changes to the Forwarding + Information Base (FIB). In cases where the next hop of a route no + longer to points to the LAG or to an ECMP group, any PBR entries + added as described in Sections 4.4.1 and 4.4.2 must be withdrawn in + order to avoid the creation of forwarding loops. + +6.3. Forwarding Resources + + Hash-based techniques used for load balancing with LAG/ECMP are + usually stateless. The mechanisms described in this document require + additional resources in the forwarding plane of routers for creating + PBR rules that are capable of overriding the forwarding decision from + the hash-based approach. These resources may limit the number of + flows that can be rebalanced and may also impact the latency + experienced by packets due to the additional lookups that are + required. + +7. Security Considerations + + This document does not directly impact the security of the Internet + infrastructure or its applications. In fact, it could help if there + is a DoS attack pattern that causes a hash imbalance resulting in + heavy overloading of large flows to certain LAG/ECMP component links. + + + + + +Krishnan, et al. Informational [Page 23] + +RFC 7424 Optimizing Load Distribution over LAG/ECMP January 2015 + + + An attacker with knowledge of the large flow recognition algorithm + and any stateless distribution method can generate flows that are + distributed in a way that overloads a specific path. This could be + used to cause the creation of PBR rules that exhaust the available + PBR rule capacity on routers in the network. If PBR rules are + consequently discarded, this could result in congestion on the + attacker-selected path. Alternatively, tracking large numbers of PBR + rules could result in performance degradation. + +8. References + +8.1. Normative References + + [802.1AX] IEEE, "IEEE Standard for Local and metropolitan area + networks - Link Aggregation", IEEE Std 802.1AX-2008, + 2008. + + [RFC2991] Thaler, D. and C. Hopps, "Multipath Issues in Unicast + and Multicast Next-Hop Selection", RFC 2991, November + 2000, <http://www.rfc-editor.org/info/rfc2991>. + + [RFC7011] Claise, B., Ed., Trammell, B., Ed., and P. Aitken, + "Specification of the IP Flow Information Export (IPFIX) + Protocol for the Exchange of Flow Information", STD 77, + RFC 7011, September 2013, + <http://www.rfc-editor.org/info/rfc7011>. + + [RFC7012] Claise, B., Ed., and B. Trammell, Ed., "Information + Model for IP Flow Information Export (IPFIX)", RFC 7012, + September 2013, + <http://www.rfc-editor.org/info/rfc7012>. + + + + + + + + + + + + + + + + + + + + +Krishnan, et al. Informational [Page 24] + +RFC 7424 Optimizing Load Distribution over LAG/ECMP January 2015 + + +8.2. Informative References + + [BIN-PACK] Coffman, Jr., E., Garey, M., and D. Johnson. + "Approximation Algorithms for Bin-Packing -- An Updated + Survey" (in "Algorithm Design for Computer System + Design"), Springer, 1984. + + [CAIDA] "Caida Traffic Analysis Research", + <http://www.caida.org/research/traffic-analysis/>. + + [DEVOFLOW] Mogul, J., Tourrilhes, J., Yalagandula, P., Sharma, P., + Curtis, R., and S. Banerjee, "DevoFlow: Cost-Effective + Flow Management for High Performance Enterprise + Networks", Proceedings of the ACM SIGCOMM, 2010. + + [FLOW-ACC] Zseby, T., Hirsch, T., and B. Claise, "Packet Sampling + for Flow Accounting: Challenges and Limitations", + Proceedings of the 9th international Passive and Active + Measurement Conference, 2008. + + [ITCOM] Jo, J., Kim, Y., Chao, H., and F. Merat, "Internet + traffic load balancing using dynamic hashing with flow + volume", SPIE ITCOM, 2002. + + [NDTM] Estan, C. and G. Varghese, "New Directions in Traffic + Measurement and Accounting", Proceedings of ACM SIGCOMM, + August 2002. + + [NVGRE] Garg, P. and Y. Wang, "NVGRE: Network Virtualization + using Generic Routing Encapsulation", Work in Progress, + draft-sridharan-virtualization-nvgre-07, November 2014. + + [RFC2784] Farinacci, D., Li, T., Hanks, S., Meyer, D., and P. + Traina, "Generic Routing Encapsulation (GRE)", RFC 2784, + March 2000, <http://www.rfc-editor.org/info/rfc2784>. + + [RFC6790] Kompella, K., Drake, J., Amante, S., Henderickx, W., and + L. Yong, "The Use of Entropy Labels in MPLS Forwarding", + RFC 6790, November 2012, + <http://www.rfc-editor.org/info/rfc6790>. + + [RFC1213] McCloghrie, K. and M. Rose, "Management Information Base + for Network Management of TCP/IP-based internets: + MIB-II", STD 17, RFC 1213, March 1991, + <http://www.rfc-editor.org/info/rfc1213>. + + + + + + +Krishnan, et al. Informational [Page 25] + +RFC 7424 Optimizing Load Distribution over LAG/ECMP January 2015 + + + [RFC2992] Hopps, C., "Analysis of an Equal-Cost Multi-Path + Algorithm", RFC 2992, November 2000, + <http://www.rfc-editor.org/info/rfc2992>. + + [RFC3273] Waldbusser, S., "Remote Network Monitoring Management + Information Base for High Capacity Networks", RFC 3273, + July 2002, <http://www.rfc-editor.org/info/rfc3273>. + + [RFC3931] Lau, J., Ed., Townsley, M., Ed., and I. Goyret, Ed., + "Layer Two Tunneling Protocol - Version 3 (L2TPv3)", RFC + 3931, March 2005, + <http://www.rfc-editor.org/info/rfc3931>. + + [RFC3954] Claise, B., Ed., "Cisco Systems NetFlow Services Export + Version 9", RFC 3954, October 2004, + <http://www.rfc-editor.org/info/rfc3954>. + + [RFC5470] Sadasivan, G., Brownlee, N., Claise, B., and J. Quittek, + "Architecture for IP Flow Information Export", RFC 5470, + March 2009, <http://www.rfc-editor.org/info/rfc5470>. + + [RFC5475] Zseby, T., Molina, M., Duffield, N., Niccolini, S., and + F. Raspall, "Sampling and Filtering Techniques for IP + Packet Selection", RFC 5475, March 2009, + <http://www.rfc-editor.org/info/rfc5475>. + + [RFC5640] Filsfils, C., Mohapatra, P., and C. Pignataro, "Load- + Balancing for Mesh Softwires", RFC 5640, August 2009, + <http://www.rfc-editor.org/info/rfc5640>. + + [RFC5681] Allman, M., Paxson, V., and E. Blanton, "TCP Congestion + Control", RFC 5681, September 2009, + <http://www.rfc-editor.org/info/rfc5681>. + + [RFC7223] Bjorklund, M., "A YANG Data Model for Interface + Management", RFC 7223, May 2014, + <http://www.rfc-editor.org/info/rfc7223>. + + [RFC7226] Villamizar, C., Ed., McDysan, D., Ed., Ning, S., Malis, + A., and L. Yong, "Requirements for Advanced Multipath in + MPLS Networks", RFC 7226, May 2014, + <http://www.rfc-editor.org/info/rfc7226>. + + [SAMP-BASIC] Phaal, P. and S. Panchen, "Packet Sampling Basics", + <http://www.sflow.org/packetSamplingBasics/>. + + + + + + +Krishnan, et al. Informational [Page 26] + +RFC 7424 Optimizing Load Distribution over LAG/ECMP January 2015 + + + [sFlow-v5] Phaal, P. and M. Lavine, "sFlow version 5", July 2004, + <http://www.sflow.org/sflow_version_5.txt>. + + [sFlow-LAG] Phaal, P. and A. Ghanwani, "sFlow LAG Counters + Structure", September 2012, + <http://www.sflow.org/sflow_lag.txt>. + + [STT] Davie, B., Ed., and J. Gross, "A Stateless Transport + Tunneling Protocol for Network Virtualization (STT)", + Work in Progress, draft-davie-stt-06, April 2014. + + [RFC7348] Mahalingam, M., Dutt, D., Duda, K., Agarwal, P., + Kreeger, L., Sridhar, T., Bursell, M., and C. Wright, + "Virtual eXtensible Local Area Network (VXLAN): A + Framework for Overlaying Virtualized Layer 2 Networks + over Layer 3 Networks", RFC 7348, August 2014, + <http://www.rfc-editor.org/info/rfc7348>. + + [YONG] Yong, L. and P. Yang, "Enhanced ECMP and Large Flow + Aware Transport", Work in Progress, + draft-yong-pwe3-enhance-ecmp-lfat-01, March 2010. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +Krishnan, et al. Informational [Page 27] + +RFC 7424 Optimizing Load Distribution over LAG/ECMP January 2015 + + +Appendix A. Internet Traffic Analysis and Load-Balancing Simulation + + Internet traffic [CAIDA] has been analyzed to obtain flow statistics + such as the number of packets in a flow and the flow duration. The + 5-tuple in the packet header (IP source address, IP destination + address, transport protocol source port number, transport protocol + destination port number, and IP protocol) is used for flow + identification. The analysis indicates that < ~2% of the flows take + ~30% of total traffic volume while the rest of the flows (> ~98%) + contributes ~70% [YONG]. + + The simulation has shown that, given Internet traffic patterns, the + hash-based technique does not evenly distribute flows over ECMP + paths. Some paths may be > 90% loaded while others are < 40% loaded. + The greater the number of ECMP paths, the more severe is the + imbalance in the load distribution. This implies that hash-based + distribution can cause some paths to become congested while other + paths are underutilized [YONG]. + + The simulation also shows substantial improvement by using the large + flow-aware, hash-based distribution technique described in this + document. In using the same simulated traffic, the improved + rebalancing can achieve < 10% load differences among the paths. It + proves how large flow-aware, hash-based distribution can effectively + compensate the uneven load balancing caused by hashing and the + traffic characteristics [YONG]. + +Acknowledgements + + The authors would like to thank the following individuals for their + review and valuable feedback on earlier versions of this document: + Shane Amante, Fred Baker, Michael Bugenhagen, Zhen Cao, Brian + Carpenter, Benoit Claise, Michael Fargano, Wes George, Sriganesh + Kini, Roman Krzanowski, Andrew Malis, Dave McDysan, Pete Moyer, Peter + Phaal, Dan Romascanu, Curtis Villamizar, Jianrong Wong, George Yum, + and Weifeng Zhang. As a part of the IETF Last Call process, valuable + comments were received from Martin Thomson and Carlos Pignataro. + +Contributors + + Sanjay Khanna + Cisco Systems + EMail: sanjakha@gmail.com + + + + + + + + +Krishnan, et al. Informational [Page 28] + +RFC 7424 Optimizing Load Distribution over LAG/ECMP January 2015 + + +Authors' Addresses + + Ram Krishnan + Brocade Communications + San Jose, CA 95134 + United States + Phone: +1-408-406-7890 + EMail: ramkri123@gmail.com + + + Lucy Yong + Huawei USA + 5340 Legacy Drive + Plano, TX 75025 + United States + Phone: +1-469-277-5837 + EMail: lucy.yong@huawei.com + + + Anoop Ghanwani + Dell + 5450 Great America Pkwy + Santa Clara, CA 95054 + United States + Phone: +1-408-571-3228 + EMail: anoop@alumni.duke.edu + + + Ning So + Vinci Systems + 2613 Fairbourne Cir + Plano, TX 75093 + United States + EMail: ningso@yahoo.com + + + Bhumip Khasnabish + ZTE Corporation + New Jersey 07960 + United States + Phone: +1-781-752-8003 + EMail: vumip1@gmail.com + + + + + + + + + +Krishnan, et al. Informational [Page 29] + |