summaryrefslogtreecommitdiff
path: root/doc/rfc/rfc7424.txt
diff options
context:
space:
mode:
Diffstat (limited to 'doc/rfc/rfc7424.txt')
-rw-r--r--doc/rfc/rfc7424.txt1627
1 files changed, 1627 insertions, 0 deletions
diff --git a/doc/rfc/rfc7424.txt b/doc/rfc/rfc7424.txt
new file mode 100644
index 0000000..d96cfa1
--- /dev/null
+++ b/doc/rfc/rfc7424.txt
@@ -0,0 +1,1627 @@
+
+
+
+
+
+
+Internet Engineering Task Force (IETF) R. Krishnan
+Request for Comments: 7424 Brocade Communications
+Category: Informational L. Yong
+ISSN: 2070-1721 Huawei USA
+ A. Ghanwani
+ Dell
+ N. So
+ Vinci Systems
+ B. Khasnabish
+ ZTE Corporation
+ January 2015
+
+
+ Mechanisms for Optimizing Link Aggregation Group (LAG) and
+ Equal-Cost Multipath (ECMP) Component Link Utilization in Networks
+
+Abstract
+
+ Demands on networking infrastructure are growing exponentially due to
+ bandwidth-hungry applications such as rich media applications and
+ inter-data-center communications. In this context, it is important
+ to optimally use the bandwidth in wired networks that extensively use
+ link aggregation groups and equal-cost multipaths as techniques for
+ bandwidth scaling. This document explores some of the mechanisms
+ useful for achieving this.
+
+Status of This Memo
+
+ This document is not an Internet Standards Track specification; it is
+ published for informational purposes.
+
+ This document is a product of the Internet Engineering Task Force
+ (IETF). It represents the consensus of the IETF community. It has
+ received public review and has been approved for publication by the
+ Internet Engineering Steering Group (IESG). Not all documents
+ approved by the IESG are a candidate for any level of Internet
+ Standard; see Section 2 of RFC 5741.
+
+ Information about the current status of this document, any errata,
+ and how to provide feedback on it may be obtained at
+ http://www.rfc-editor.org/info/rfc7424.
+
+
+
+
+
+
+
+
+
+
+Krishnan, et al. Informational [Page 1]
+
+RFC 7424 Optimizing Load Distribution over LAG/ECMP January 2015
+
+
+Copyright Notice
+
+ Copyright (c) 2015 IETF Trust and the persons identified as the
+ document authors. All rights reserved.
+
+ This document is subject to BCP 78 and the IETF Trust's Legal
+ Provisions Relating to IETF Documents
+ (http://trustee.ietf.org/license-info) in effect on the date of
+ publication of this document. Please review these documents
+ carefully, as they describe your rights and restrictions with respect
+ to this document. Code Components extracted from this document must
+ include Simplified BSD License text as described in Section 4.e of
+ the Trust Legal Provisions and are provided without warranty as
+ described in the Simplified BSD License.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Krishnan, et al. Informational [Page 2]
+
+RFC 7424 Optimizing Load Distribution over LAG/ECMP January 2015
+
+
+Table of Contents
+
+ 1. Introduction ....................................................4
+ 1.1. Acronyms ...................................................4
+ 1.2. Terminology ................................................5
+ 2. Flow Categorization .............................................6
+ 3. Hash-Based Load Distribution in LAG/ECMP ........................6
+ 4. Mechanisms for Optimizing LAG/ECMP Component Link Utilization ...8
+ 4.1. Differences in LAG vs. ECMP ................................9
+ 4.2. Operational Overview ......................................10
+ 4.3. Large Flow Recognition ....................................11
+ 4.3.1. Flow Identification ................................11
+ 4.3.2. Criteria and Techniques for Large Flow
+ Recognition ........................................12
+ 4.3.3. Sampling Techniques ................................12
+ 4.3.4. Inline Data Path Measurement .......................14
+ 4.3.5. Use of Multiple Methods for Large Flow
+ Recognition ........................................15
+ 4.4. Options for Load Rebalancing ..............................15
+ 4.4.1. Alternative Placement of Large Flows ...............15
+ 4.4.2. Redistributing Small Flows .........................16
+ 4.4.3. Component Link Protection Considerations ...........16
+ 4.4.4. Algorithms for Load Rebalancing ....................17
+ 4.4.5. Example of Load Rebalancing ........................17
+ 5. Information Model for Flow Rebalancing .........................18
+ 5.1. Configuration Parameters for Flow Rebalancing .............18
+ 5.2. System Configuration and Identification Parameters ........19
+ 5.3. Information for Alternative Placement of Large Flows ......20
+ 5.4. Information for Redistribution of Small Flows .............21
+ 5.5. Export of Flow Information ................................21
+ 5.6. Monitoring Information ....................................21
+ 5.6.1. Interface (Link) Utilization .......................21
+ 5.6.2. Other Monitoring Information .......................22
+ 6. Operational Considerations .....................................23
+ 6.1. Rebalancing Frequency .....................................23
+ 6.2. Handling Route Changes ....................................23
+ 6.3. Forwarding Resources ......................................23
+ 7. Security Considerations ........................................23
+ 8. References .....................................................24
+ 8.1. Normative References ......................................24
+ 8.2. Informative References ....................................25
+ Appendix A. Internet Traffic Analysis and Load-Balancing
+ Simulation ...........................................28
+ Acknowledgements ..................................................28
+ Contributors ......................................................28
+ Authors' Addresses ................................................29
+
+
+
+
+
+Krishnan, et al. Informational [Page 3]
+
+RFC 7424 Optimizing Load Distribution over LAG/ECMP January 2015
+
+
+1. Introduction
+
+ Networks extensively use link aggregation groups (LAGs) [802.1AX] and
+ equal-cost multipaths (ECMPs) [RFC2991] as techniques for capacity
+ scaling. For the problems addressed by this document, network
+ traffic can be predominantly categorized into two traffic types:
+ long-lived large flows and other flows. These other flows, which
+ include long-lived small flows, short-lived small flows, and short-
+ lived large flows, are referred to as "small flows" in this document.
+ Long-lived large flows are simply referred to as "large flows".
+
+ Stateless hash-based techniques [ITCOM] [RFC2991] [RFC2992] [RFC6790]
+ are often used to distribute both large flows and small flows over
+ the component links in a LAG/ECMP. However, the traffic may not be
+ evenly distributed over the component links due to the traffic
+ pattern.
+
+ This document describes mechanisms for optimizing LAG/ECMP component
+ link utilization when using hash-based techniques. The mechanisms
+ comprise the following steps: 1) recognizing large flows in a router,
+ and 2) assigning the large flows to specific LAG/ECMP component links
+ or redistributing the small flows when a component link on the router
+ is congested.
+
+ It is useful to keep in mind that in typical use cases for these
+ mechanisms, the large flows consume a significant amount of bandwidth
+ on a link, e.g., greater than 5% of link bandwidth. The number of
+ such flows would necessarily be fairly small, e.g., on the order of
+ 10s or 100s per LAG/ECMP. In other words, the number of large flows
+ is NOT expected to be on the order of millions of flows. Examples of
+ such large flows would be IPsec tunnels in service provider backbone
+ networks or storage backup traffic in data center networks.
+
+1.1. Acronyms
+
+ DoS: Denial of Service
+
+ ECMP: Equal-Cost Multipath
+
+ GRE: Generic Routing Encapsulation
+
+ IPFIX: IP Flow Information Export
+
+ LAG: Link Aggregation Group
+
+ MPLS: Multiprotocol Label Switching
+
+ NVGRE: Network Virtualization using Generic Routing Encapsulation
+
+
+
+Krishnan, et al. Informational [Page 4]
+
+RFC 7424 Optimizing Load Distribution over LAG/ECMP January 2015
+
+
+ PBR: Policy-Based Routing
+
+ QoS: Quality of Service
+
+ STT: Stateless Transport Tunneling
+
+ VXLAN: Virtual eXtensible LAN
+
+1.2. Terminology
+
+ Central management entity:
+ An entity that is capable of monitoring information about link
+ utilization and flows in routers across the network and may be
+ capable of making traffic-engineering decisions for placement of
+ large flows. It may include the functions of a collector
+ [RFC7011].
+
+ ECMP component link:
+ An individual next hop within an ECMP group. An ECMP component
+ link may itself comprise a LAG.
+
+ ECMP table:
+ A table that is used as the next hop of an ECMP route that
+ comprises the set of ECMP component links and the weights
+ associated with each of those ECMP component links. The input for
+ looking up the table is the hash value for the packet, and the
+ weights are used to determine which values of the hash function
+ map to a given ECMP component link.
+
+ Flow (large or small):
+ A sequence of packets for which ordered delivery should be
+ maintained, e.g., packets belonging to the same TCP connection.
+
+ LAG component link:
+ An individual link within a LAG. A LAG component link is
+ typically a physical link.
+
+ LAG table:
+ A table that is used as the output port, which is a LAG, that
+ comprises the set of LAG component links and the weights
+ associated with each of those component links. The input for
+ looking up the table is the hash value for the packet, and the
+ weights are used to determine which values of the hash function
+ map to a given LAG component link.
+
+ Large flow(s):
+ Refers to long-lived large flow(s).
+
+
+
+
+Krishnan, et al. Informational [Page 5]
+
+RFC 7424 Optimizing Load Distribution over LAG/ECMP January 2015
+
+
+ Small flow(s):
+ Refers to any of, or a combination of, long-lived small flow(s),
+ short-lived small flows, and short-lived large flow(s).
+
+2. Flow Categorization
+
+ In general, based on the size and duration, a flow can be categorized
+ into any one of the following four types, as shown in Figure 1:
+
+ o short-lived large flow (SLLF),
+
+ o short-lived small flow (SLSF),
+
+ o long-lived large flow (LLLF), and
+
+ o long-lived small flow (LLSF).
+
+ Flow Bandwidth
+ ^
+ |--------------------|--------------------|
+ | | |
+ Large | SLLF | LLLF |
+ Flow | | |
+ |--------------------|--------------------|
+ | | |
+ Small | SLSF | LLSF |
+ Flow | | |
+ +--------------------+--------------------+-->Flow Duration
+ Short-Lived Long-Lived
+ Flow Flow
+
+ Figure 1: Flow Categorization
+
+ In this document, as mentioned earlier, we categorize long-lived
+ large flows as "large flows", and all of the others (long-lived small
+ flows, short-lived small flows, and short-lived large flows) as
+ "small flows".
+
+3. Hash-Based Load Distribution in LAG/ECMP
+
+ Hash-based techniques are often used for load balancing of traffic to
+ select among multiple available paths within a LAG/ECMP group. The
+ advantages of hash-based techniques for load distribution are the
+ preservation of the packet sequence in a flow and the real-time
+ distribution without maintaining per-flow state in the router. Hash-
+ based techniques use a combination of fields in the packet's headers
+
+
+
+
+
+Krishnan, et al. Informational [Page 6]
+
+RFC 7424 Optimizing Load Distribution over LAG/ECMP January 2015
+
+
+ to identify a flow, and the hash function computed using these fields
+ is used to generate a unique number that identifies a link/path in a
+ LAG/ECMP group. The result of the hashing procedure is a many-to-one
+ mapping of flows to component links.
+
+ Hash-based techniques produce good results with respect to
+ utilization of the individual component links if:
+
+ o the traffic mix constitutes flows such that the result of the hash
+ function across these flows is fairly uniform so that a similar
+ number of flows is mapped to each component link,
+
+ o the individual flow rates are much smaller as compared to the link
+ capacity, and
+
+ o the differences in flow rates are not dramatic.
+
+ However, if one or more of these conditions are not met, hash-based
+ techniques may result in imbalance in the loads on individual
+ component links.
+
+ An example is illustrated in Figure 2. As shown, there are two
+ routers, R1 and R2, and there is a LAG between them that has three
+ component links (1), (2), and (3). A total of ten flows need to be
+ distributed across the links in this LAG. The result of applying the
+ hash-based technique is as follows:
+
+ o Component link (1) has three flows (two small flows and one large
+ flow), and the link utilization is normal.
+
+ o Component link (2) has three flows (three small flows and no large
+ flows), and the link utilization is light.
+
+ - The absence of any large flow causes the component link to be
+ underutilized.
+
+ o Component link (3) has four flows (two small flows and two large
+ flows), and the link capacity is exceeded resulting in congestion.
+
+ - The presence of two large flows causes congestion on this
+ component link.
+
+
+
+
+
+
+
+
+
+
+Krishnan, et al. Informational [Page 7]
+
+RFC 7424 Optimizing Load Distribution over LAG/ECMP January 2015
+
+
+ +-----------+ -> +-----------+
+ | | -> | |
+ | | ===> | |
+ | (1)|--------|(1) |
+ | | -> | |
+ | | -> | |
+ | (R1) | -> | (R2) |
+ | (2)|--------|(2) |
+ | | -> | |
+ | | -> | |
+ | | ===> | |
+ | | ===> | |
+ | (3)|--------|(3) |
+ | | | |
+ +-----------+ +-----------+
+
+ Where: -> small flow
+ ===> large flow
+
+ Figure 2: Unevenly Utilized Component Links
+
+ This document presents mechanisms for addressing the imbalance in
+ load distribution resulting from commonly used hash-based techniques
+ for LAG/ECMP that are shown in the above example. The mechanisms use
+ large flow awareness to compensate for the imbalance in load
+ distribution.
+
+4. Mechanisms for Optimizing LAG/ECMP Component Link Utilization
+
+ The suggested mechanisms in this document are local optimization
+ solutions; they are local in the sense that both the identification
+ of large flows and rebalancing of the load can be accomplished
+ completely within individual routers in the network without the need
+ for interaction with other routers.
+
+ This approach may not yield a global optimization of the placement of
+ large flows across multiple routers in a network, which may be
+ desirable in some networks. On the other hand, a local approach may
+ be adequate for some environments for the following reasons:
+
+ 1) Different links within a network experience different levels of
+ utilization; thus, a "targeted" solution is needed for those hot
+ spots in the network. An example is the utilization of a LAG
+ between two routers that needs to be optimized.
+
+
+
+
+
+
+
+Krishnan, et al. Informational [Page 8]
+
+RFC 7424 Optimizing Load Distribution over LAG/ECMP January 2015
+
+
+ 2) Some networks may lack end-to-end visibility, e.g., when a
+ certain network, under the control of a given operator, is a
+ transit network for traffic from other networks that are not
+ under the control of the same operator.
+
+4.1. Differences in LAG vs. ECMP
+
+ While the mechanisms explained herein are applicable to both LAGs and
+ ECMP groups, it is useful to note that there are some key differences
+ between the two that may impact how effective the mechanisms are.
+ This relates, in part, to the localized information with which the
+ mechanisms are intended to operate.
+
+ A LAG is usually established across links that are between two
+ adjacent routers. As a result, the scope of the problem of
+ optimizing the bandwidth utilization on the component links is fairly
+ narrow. It simply involves rebalancing the load across the component
+ links between these two routers, and there is no impact whatsoever to
+ other parts of the network. The scheme works equally well for
+ unicast and multicast flows.
+
+ On the other hand, with ECMP, redistributing the load across
+ component links that are part of the ECMP group may impact traffic
+ patterns at all of the routers that are downstream of the given
+ router between itself and the destination. The local optimization
+ may result in congestion at a downstream node. (In its simplest
+ form, an ECMP group may be used to distribute traffic on component
+ links that are between two adjacent routers, and in that case, the
+ ECMP group is no different than a LAG for the purpose of this
+ discussion. It should be noted that an ECMP component link may
+ itself comprise a LAG, in which case the scheme may be further
+ applied to the component links within the LAG.)
+
+ To demonstrate the limitations of local optimization, consider a two-
+ level Clos network topology as shown in Figure 3 with three leaf
+ routers (L1, L2, and L3) and two spine routers (S1 and S2). Assume
+ all of the links are 10 Gbps.
+
+ Let L1 have two flows of 4 Gbps each towards L3, and let L2 have one
+ flow of 7 Gbps also towards L3. If L1 balances the load optimally
+ between S1 and S2, and L2 sends the flow via S1, then the downlink
+ from S1 to L3 would get congested, resulting in packet discards. On
+ the other hand, if L1 had sent both its flows towards S1 and L2 had
+ sent its flow towards S2, there would have been no congestion at
+ either S1 or S2.
+
+
+
+
+
+
+Krishnan, et al. Informational [Page 9]
+
+RFC 7424 Optimizing Load Distribution over LAG/ECMP January 2015
+
+
+ +-----+ +-----+
+ | S1 | | S2 |
+ +-----+ +-----+
+ / \ \ / /\
+ / +---------+ / \
+ / / \ \ / \
+ / / \ +------+ \
+ / / \ / \ \
+ +-----+ +-----+ +-----+
+ | L1 | | L2 | | L3 |
+ +-----+ +-----+ +-----+
+
+ Figure 3: Two-Level Clos Network
+
+ The other issue with applying this scheme to ECMP groups is that it
+ may not apply equally to unicast and multicast traffic because of the
+ way multicast trees are constructed.
+
+ Finally, it is possible for a single physical link to participate as
+ a component link in multiple ECMP groups, whereas with LAGs, a link
+ can participate as a component link of only one LAG.
+
+4.2. Operational Overview
+
+ The various steps in optimizing LAG/ECMP component link utilization
+ in networks are detailed below:
+
+ Step 1:
+ This step involves recognizing large flows in routers and
+ maintaining the mapping for each large flow to the component link
+ that it uses. Recognition of large flows is explained in Section
+ 4.3.
+
+ Step 2:
+ The egress component links are periodically scanned for link
+ utilization, and the imbalance for the LAG/ECMP group is
+ monitored. If the imbalance exceeds a certain threshold, then
+ rebalancing is triggered. Measurement of the imbalance is
+ discussed further in Section 5.1. In addition to the imbalance,
+ further criteria (such as the maximum utilization of any of the
+ component links) may also be used to determine whether or not to
+ trigger rebalancing. The use of sampling techniques for the
+ measurement of egress component link utilization, including the
+ issues of depending on ingress sampling for these measurements,
+ are discussed in Section 4.3.3.
+
+
+
+
+
+
+Krishnan, et al. Informational [Page 10]
+
+RFC 7424 Optimizing Load Distribution over LAG/ECMP January 2015
+
+
+ Step 3:
+ As a part of rebalancing, the operator can choose to rebalance the
+ large flows by placing them on lightly loaded component links of
+ the LAG/ECMP group, redistribute the small flows on the congested
+ link to other component links of the group, or a combination of
+ both.
+
+ All of the steps identified above can be done locally within the
+ router itself or could involve the use of a central management
+ entity.
+
+ Providing large flow information to a central management entity
+ provides the capability to globally optimize flow distribution as
+ described in Section 4.1. Consider the following example. A router
+ may have three ECMP next hops that lead down paths P1, P2, and P3. A
+ couple of hops downstream on path P1, there may be a congested link,
+ while paths P2 and P3 may be underutilized. This is something that
+ the local router does not have visibility into. With the help of a
+ central management entity, the operator could redistribute some of
+ the flows from P1 to P2 and/or P3, resulting in a more optimized flow
+ of traffic.
+
+ The steps described above are especially useful when bundling links
+ of different bandwidths, e.g., 10 Gbps and 100 Gbps as described in
+ [RFC7226].
+
+4.3. Large Flow Recognition
+
+4.3.1. Flow Identification
+
+ Flows are typically identified using one or more fields from the
+ packet header, for example:
+
+ o Layer 2: Source Media Access Control (MAC) address, destination
+ MAC address, VLAN ID.
+
+ o IP header: IP protocol, IP source address, IP destination address,
+ flow label (IPv6 only).
+
+ o Transport protocol header: Source port number, destination port
+ number. These apply to protocols such as TCP, UDP, and the Stream
+ Control Transmission Protocol (SCTP).
+
+ o MPLS labels.
+
+ For tunneling protocols like Generic Routing Encapsulation (GRE)
+ [RFC2784], Virtual eXtensible LAN (VXLAN) [RFC7348], Network
+ Virtualization using Generic Routing Encapsulation (NVGRE) [NVGRE],
+
+
+
+Krishnan, et al. Informational [Page 11]
+
+RFC 7424 Optimizing Load Distribution over LAG/ECMP January 2015
+
+
+ Stateless Transport Tunneling (STT) [STT], Layer 2 Tunneling Protocol
+ (L2TP) [RFC3931], etc., flow identification is possible based on
+ inner and/or outer headers as well as fields introduced by the tunnel
+ header, as any or all such fields may be used for load balancing
+ decisions [RFC5640].
+
+ The above list is not exhaustive.
+
+ The mechanisms described in this document are agnostic to the fields
+ that are used for flow identification.
+
+ This method of flow identification is consistent with that of IPFIX
+ [RFC7011].
+
+4.3.2. Criteria and Techniques for Large Flow Recognition
+
+ From the perspective of bandwidth and time duration, in order to
+ recognize large flows, we define an observation interval and measure
+ the bandwidth of the flow over that interval. A flow that exceeds a
+ certain minimum bandwidth threshold over that observation interval
+ would be considered a large flow.
+
+ The two parameters -- the observation interval and the minimum
+ bandwidth threshold over that observation interval -- should be
+ programmable to facilitate handling of different use cases and
+ traffic characteristics. For example, a flow that is at or above 10%
+ of link bandwidth for a time period of at least one second could be
+ declared a large flow [DEVOFLOW].
+
+ In order to avoid excessive churn in the rebalancing, once a flow has
+ been recognized as a large flow, it should continue to be recognized
+ as a large flow for as long as the traffic received during an
+ observation interval exceeds some fraction of the bandwidth
+ threshold, for example, 80% of the bandwidth threshold.
+
+ Various techniques to recognize a large flow are described in
+ Sections 4.3.3, 4.3.4, and 4.3.5.
+
+4.3.3. Sampling Techniques
+
+ A number of routers support sampling techniques such as sFlow
+ [sFlow-v5] [sFlow-LAG], Packet Sampling (PSAMP) [RFC5475], and
+ NetFlow Sampling [RFC3954]. For the purpose of large flow
+ recognition, sampling needs to be enabled on all of the egress ports
+ in the router where such measurements are desired.
+
+
+
+
+
+
+Krishnan, et al. Informational [Page 12]
+
+RFC 7424 Optimizing Load Distribution over LAG/ECMP January 2015
+
+
+ Using sFlow as an example, processing in an sFlow collector can
+ provide an approximate indication of the mapping of large flows to
+ each of the component links in each LAG/ECMP group. Assuming
+ sufficient control plane resources are available, it is possible to
+ implement this part of the collector function in the control plane of
+ the router to reduce dependence on a central management entity.
+
+ If egress sampling is not available, ingress sampling can suffice
+ since the central management entity used by the sampling technique
+ typically has visibility across multiple routers in a network and can
+ use the samples from an immediately downstream router to make
+ measurements for egress traffic at the local router.
+
+ The option of using ingress sampling for this purpose may not be
+ available if the downstream router is under the control of a
+ different operator or if the downstream device does not support
+ sampling.
+
+ Alternatively, since sampling techniques require that the sample be
+ annotated with the packet's egress port information, ingress sampling
+ may suffice. However, this means that sampling would have to be
+ enabled on all ports, rather than only on those ports where such
+ monitoring is desired. There is one situation in which this approach
+ may not work. If there are tunnels that originate from the given
+ router and if the resulting tunnel comprises the large flow, then
+ this cannot be deduced from ingress sampling at the given router.
+ Instead, for this scenario, if egress sampling is unavailable, then
+ ingress sampling from the downstream router must be used.
+
+
+ To illustrate the use of ingress versus egress sampling, we refer to
+ Figure 2. Since we are looking at rebalancing flows at R1, we would
+ need to enable egress sampling on ports (1), (2), and (3) on R1. If
+ egress sampling is not available and if R2 is also under the control
+ of the same administrator, enabling ingress sampling on R2's ports
+ (1), (2), and (3) would also work, but it would necessitate the
+ involvement of a central management entity in order for R1 to obtain
+ large flow information for each of its links. Finally, R1 can only
+ enable ingress sampling on all of its ports (not just the ports that
+ are part of the LAG/ECMP group being monitored), and that would
+ suffice if the sampling technique annotates the samples with the
+ egress port information.
+
+
+
+
+
+
+
+
+
+Krishnan, et al. Informational [Page 13]
+
+RFC 7424 Optimizing Load Distribution over LAG/ECMP January 2015
+
+
+ The advantages and disadvantages of sampling techniques are as
+ follows.
+
+ Advantages:
+
+ o Supported in most existing routers.
+
+ o Requires minimal router resources.
+
+ Disadvantage:
+
+ o In order to minimize the error inherent in sampling, there is a
+ minimum delay for the recognition time of large flows, and in the
+ time that it takes to react to this information.
+
+ With sampling, the detection of large flows can be done on the order
+ of one second [DEVOFLOW]. A discussion on determining the
+ appropriate sampling frequency is available in [SAMP-BASIC].
+
+4.3.4. Inline Data Path Measurement
+
+ Implementations may perform recognition of large flows by performing
+ measurements on traffic in the data path of a router. Such an
+ approach would be expected to operate at the interface speed on every
+ interface, accounting for all packets processed by the data path of
+ the router. An example of such an approach is described in IPFIX
+ [RFC5470].
+
+ Using inline data path measurement, a faster and more accurate
+ indication of large flows mapped to each of the component links in a
+ LAG/ECMP group may be possible (as compared to the sampling-based
+ approach).
+
+ The advantages and disadvantages of inline data path measurement are
+ as follows:
+
+ Advantages:
+
+ o As link speeds get higher, sampling rates are typically reduced to
+ keep the number of samples manageable, which places a lower bound
+ on the detection time. With inline data path measurement, large
+ flows can be recognized in shorter windows on higher link speeds
+ since every packet is accounted for [NDTM].
+
+ o Inline data path measurement eliminates the potential dependence
+ on a central management entity for large flow recognition.
+
+
+
+
+
+Krishnan, et al. Informational [Page 14]
+
+RFC 7424 Optimizing Load Distribution over LAG/ECMP January 2015
+
+
+ Disadvantage:
+
+ o Inline data path measurement is more resource intensive in terms
+ of the table sizes required for monitoring all flows.
+
+ As mentioned earlier, the observation interval for determining a
+ large flow and the bandwidth threshold for classifying a flow as a
+ large flow should be programmable parameters in a router.
+
+ The implementation details of inline data path measurement of large
+ flows is vendor dependent and beyond the scope of this document.
+
+4.3.5. Use of Multiple Methods for Large Flow Recognition
+
+ It is possible that a router may have line cards that support a
+ sampling technique while other line cards support inline data path
+ measurement. As long as there is a way for the router to reliably
+ determine the mapping of large flows to component links of a LAG/ECMP
+ group, it is acceptable for the router to use more than one method
+ for large flow recognition.
+
+ If both methods are supported, inline data path measurement may be
+ preferable because of its speed of detection [FLOW-ACC].
+
+4.4. Options for Load Rebalancing
+
+ The following subsections describe suggested techniques for load
+ balancing. Equipment vendors may implement more than one technique,
+ including those not described in this document, and allow the
+ operator to choose between them.
+
+ Note that regardless of the method used, perfect rebalancing of large
+ flows may not be possible since flows arrive and depart at different
+ times. Also, any flows that are moved from one component link to
+ another may experience momentary packet reordering.
+
+4.4.1. Alternative Placement of Large Flows
+
+ Within a LAG/ECMP group, member component links with the least
+ average link utilization are identified. Some large flow(s) from the
+ heavily loaded component links are then moved to those lightly loaded
+ member component links using a PBR rule in the ingress processing
+ element(s) in the routers.
+
+ With this approach, only certain large flows are subjected to
+ momentary flow reordering.
+
+
+
+
+
+Krishnan, et al. Informational [Page 15]
+
+RFC 7424 Optimizing Load Distribution over LAG/ECMP January 2015
+
+
+ Moving a large flow will increase the utilization of the link that it
+ is moved to, potentially once again creating an imbalance in the
+ utilization across the component links. Therefore, when moving a
+ large flow, care must be taken to account for the existing load and
+ the future load after the large flow has been moved. Further, the
+ appearance of new large flows may require a rearrangement of the
+ placement of existing flows.
+
+ Consider a case where there is a LAG compromising four 10 Gbps
+ component links and there are four large flows, each of 1 Gbps.
+ These flows are each placed on one of the component links.
+ Subsequently, a fifth large flow of 2 Gbps is recognized, and to
+ maintain equitable load distribution, it may require placement of one
+ of the existing 1 Gbps flow to a different component link. This
+ would still result in some imbalance in the utilization across the
+ component links.
+
+4.4.2. Redistributing Small Flows
+
+ Some large flows may consume the entire bandwidth of the component
+ link(s). In this case, it would be desirable for the small flows to
+ not use the congested component link(s).
+
+ o The LAG/ECMP table is modified to include only non-congested
+ component link(s). Small flows hash into this table to be mapped
+ to a destination component link. Alternatively, if certain
+ component links are heavily loaded but not congested, the output
+ of the hash function can be adjusted to account for large flow
+ loading on each of the component links.
+
+ o The PBR rules for large flows (refer to Section 4.4.1) must have
+ strict precedence over the LAG/ECMP table lookup result.
+
+ This method works on some existing router hardware. The idea is to
+ prevent, or reduce the probability, that a small flow hashes into the
+ congested component link(s).
+
+ With this approach, the small flows that are moved would be subject
+ to reordering.
+
+4.4.3. Component Link Protection Considerations
+
+ If desired, certain component links may be reserved for link
+ protection. These reserved component links are not used for any
+ flows in the absence of any failures. When there is a failure of one
+ or more component links, all the flows on the failed component
+ link(s) are moved to the reserved component link(s). The mapping
+ table of large flows to component links simply replaces the failed
+
+
+
+Krishnan, et al. Informational [Page 16]
+
+RFC 7424 Optimizing Load Distribution over LAG/ECMP January 2015
+
+
+ component link with the reserved component link. Likewise, the
+ LAG/ECMP table replaces the failed component link with the reserved
+ component link.
+
+4.4.4. Algorithms for Load Rebalancing
+
+ Specific algorithms for placement of large flows are out of the scope
+ of this document. One possibility is to formulate the problem for
+ large flow placement as the well-known bin-packing problem and make
+ use of the various heuristics that are available for that problem
+ [BIN-PACK].
+
+4.4.5. Example of Load Rebalancing
+
+ Optimizing LAG/ECMP component utilization for the use case in Figure
+ 2 is depicted below in Figure 4. The large flow rebalancing
+ explained in Section 4.4.1 is used. The improved link utilization is
+ as follows:
+
+ o Component link (1) has three flows (two small flows and one large
+ flow), and the link utilization is normal.
+
+ o Component link (2) has four flows (three small flows and one large
+ flow), and the link utilization is normal now.
+
+ o Component link (3) has three flows (two small flows and one large
+ flow), and the link utilization is normal now.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Krishnan, et al. Informational [Page 17]
+
+RFC 7424 Optimizing Load Distribution over LAG/ECMP January 2015
+
+
+ +-----------+ -> +-----------+
+ | | -> | |
+ | | ===> | |
+ | (1)|--------|(1) |
+ | | | |
+ | | ===> | |
+ | | -> | |
+ | | -> | |
+ | (R1) | -> | (R2) |
+ | (2)|--------|(2) |
+ | | | |
+ | | -> | |
+ | | -> | |
+ | | ===> | |
+ | (3)|--------|(3) |
+ | | | |
+ +-----------+ +-----------+
+
+ Where: -> small flow
+ ===> large flow
+
+ Figure 4: Evenly Utilized Composite Links
+
+ Basically, the use of the mechanisms described in Section 4.4.1
+ resulted in a rebalancing of flows where one of the large flows on
+ component link (3), which was previously congested, was moved to
+ component link (2), which was previously underutilized.
+
+5. Information Model for Flow Rebalancing
+
+ In order to support flow rebalancing in a router from an external
+ system, the exchange of some information is necessary between the
+ router and the external system. This section provides an exemplary
+ information model covering the various components needed for this
+ purpose. The model is intended to be informational and may be used
+ as a guide for the development of a data model.
+
+5.1. Configuration Parameters for Flow Rebalancing
+
+ The following parameters are required for configuration of this
+ feature:
+
+ o Large flow recognition parameters:
+
+ - Observation interval: The observation interval is the time
+ period in seconds over which packet arrivals are observed for
+ the purpose of large flow recognition.
+
+
+
+
+Krishnan, et al. Informational [Page 18]
+
+RFC 7424 Optimizing Load Distribution over LAG/ECMP January 2015
+
+
+ - Minimum bandwidth threshold: The minimum bandwidth threshold
+ would be configured as a percentage of link speed and
+ translated into a number of bytes over the observation
+ interval. A flow for which the number of bytes received over a
+ given observation interval exceeds this number would be
+ recognized as a large flow.
+
+ - Minimum bandwidth threshold for large flow maintenance: The
+ minimum bandwidth threshold for large flow maintenance is used
+ to provide hysteresis for large flow recognition. Once a flow
+ is recognized as a large flow, it continues to be recognized as
+ a large flow until it falls below this threshold. This is also
+ configured as a percentage of link speed and is typically lower
+ than the minimum bandwidth threshold defined above.
+
+ o Imbalance threshold: A measure of the deviation of the component
+ link utilizations from the utilization of the overall LAG/ECMP
+ group. Since component links can be different speeds, the
+ imbalance can be computed as follows. Let the utilization of each
+ component link in a LAG/ECMP group with n links of speed b_1, b_2
+ .. b_n be u_1, u_2 .. u_n. The mean utilization is computed as
+
+ u_ave = [ (u_1 * b_1) + (u_2 * b_2) + .. + (u_n * b_n) ] /
+ [b_1 + b_2 + .. + b_n].
+
+ The imbalance is then computed as
+
+ max_{i=1..n} | u_i - u_ave |.
+
+ o Rebalancing interval: The minimum amount of time between
+ rebalancing events. This parameter ensures that rebalancing is
+ not invoked too frequently as it impacts packet ordering.
+
+ These parameters may be configured on a system-wide basis or may
+ apply to an individual LAG/ECMP group. They may be applied to an
+ ECMP group, provided that the component links are not shared with any
+ other ECMP group.
+
+5.2. System Configuration and Identification Parameters
+
+ The following parameters are useful for router configuration and
+ operation when using the mechanisms in this document.
+
+ o IP address: The IP address of a specific router that the feature
+ is being configured on or that the large flow placement is being
+ applied to.
+
+
+
+
+
+Krishnan, et al. Informational [Page 19]
+
+RFC 7424 Optimizing Load Distribution over LAG/ECMP January 2015
+
+
+ o LAG ID: Identifies the LAG on a given router. The LAG ID may be
+ required when configuring this feature (to apply a specific set of
+ large flow identification parameters to the LAG) and will be
+ required when specifying flow placement to achieve the desired
+ rebalancing.
+
+ o Component Link ID: Identifies the component link within a LAG or
+ ECMP group. This is required when specifying flow placement to
+ achieve the desired rebalancing.
+
+ o Component Link Weight: The relative weight to be applied to
+ traffic for a given component link when using hash-based
+ techniques for load distribution.
+
+ o ECMP group: Identifies a particular ECMP group. The ECMP group
+ may be required when configuring this feature (to apply a specific
+ set of large flow identification parameters to the ECMP group) and
+ will be required when specifying flow placement to achieve the
+ desired rebalancing. We note that multiple ECMP groups can share
+ an overlapping set (or non-overlapping subset) of component links.
+ This document does not deal with the complexity of addressing such
+ configurations.
+
+ The feature may be configured globally for all LAGs and/or for all
+ ECMP groups, or it may be configured specifically for a given LAG or
+ ECMP group.
+
+5.3. Information for Alternative Placement of Large Flows
+
+ In cases where large flow recognition is handled by a central
+ management entity (see Section 4.3.3), an information model for flows
+ is required to allow the import of large flow information to the
+ router.
+
+ Typical fields used for identifying large flows were discussed in
+ Section 4.3.1. The IPFIX information model [RFC7012] can be
+ leveraged for large flow identification.
+
+ Large flow placement is achieved by specifying the relevant flow
+ information along with the following:
+
+ o For LAG: router's IP address, LAG ID, LAG component link ID.
+
+ o For ECMP: router's IP address, ECMP group, ECMP component link ID.
+
+ In the case where the ECMP component link itself comprises a LAG, we
+ would have to specify the parameters for both the ECMP group as well
+ as the LAG to which the large flow is being directed.
+
+
+
+Krishnan, et al. Informational [Page 20]
+
+RFC 7424 Optimizing Load Distribution over LAG/ECMP January 2015
+
+
+5.4. Information for Redistribution of Small Flows
+
+ Redistribution of small flows is done using the following:
+
+ o For LAG: The LAG ID and the component link IDs along with the
+ relative weight of traffic to be assigned to each component link
+ ID are required.
+
+ o For ECMP: The ECMP group and the ECMP next hop along with the
+ relative weight of traffic to be assigned to each ECMP next hop
+ are required.
+
+ It is possible to have an ECMP next hop that itself comprises a LAG.
+ In that case, we would have to specify the new weights for both the
+ ECMP component links and the LAG component links.
+
+ In the case where an ECMP component link itself comprises a LAG, we
+ would have to specify new weights for both the component links within
+ the ECMP group as well as the component links within the LAG.
+
+5.5. Export of Flow Information
+
+ Exporting large flow information is required when large flow
+ recognition is being done on a router but the decision to rebalance
+ is being made in a central management entity. Large flow information
+ includes flow identification and the component link ID that the flow
+ is currently assigned to. Other information such as flow QoS and
+ bandwidth may be exported too.
+
+ The IPFIX information model [RFC7012] can be leveraged for large flow
+ identification.
+
+5.6. Monitoring Information
+
+5.6.1. Interface (Link) Utilization
+
+ The incoming bytes (ifInOctets), outgoing bytes (ifOutOctets), and
+ interface speed (ifSpeed) can be obtained, for example, from the
+ Interfaces table (ifTable) in the MIB module defined in [RFC1213].
+
+ The link utilization can then be computed as follows:
+
+ Incoming link utilization = (delta_ifInOctets * 8) / (ifSpeed * T)
+
+ Outgoing link utilization = (delta_ifOutOctets * 8) / (ifSpeed * T)
+
+
+
+
+
+
+Krishnan, et al. Informational [Page 21]
+
+RFC 7424 Optimizing Load Distribution over LAG/ECMP January 2015
+
+
+ Where T is the interval over which the utilization is being measured,
+ delta_ifInOctets is the change in ifInOctets over that interval, and
+ delta_ifOutOctets is the change in ifOutOctets over that interval.
+
+ For high-speed Ethernet links, the etherStatsHighCapacityTable in the
+ MIB module defined in [RFC3273] can be used.
+
+ Similar results may be achieved using the corresponding objects of
+ other interface management data models such as YANG [RFC7223] if
+ those are used instead of MIBs.
+
+ For scalability, it is recommended to use the counter push mechanism
+ in [sFlow-v5] for the interface counters. Doing so would help avoid
+ counter polling through the MIB interface.
+
+ The outgoing link utilization of the component links within a
+ LAG/ECMP group can be used to compute the imbalance (see Section 5.1)
+ for the LAG/ECMP group.
+
+5.6.2. Other Monitoring Information
+
+ Additional monitoring information that is useful includes:
+
+ o Number of times rebalancing was done.
+
+ o Time since the last rebalancing event.
+
+ o The number of large flows currently rebalanced by the scheme.
+
+ o A list of the large flows that have been rebalanced including
+
+ - the rate of each large flow at the time of the last rebalancing
+ for that flow,
+
+ - the time that rebalancing was last performed for the given
+ large flow, and
+
+ - the interfaces that the large flows was (re)directed to.
+
+ o The settings for the weights of the interfaces within a LAG/ECMP
+ group used by the small flows that depend on hashing.
+
+
+
+
+
+
+
+
+
+
+Krishnan, et al. Informational [Page 22]
+
+RFC 7424 Optimizing Load Distribution over LAG/ECMP January 2015
+
+
+6. Operational Considerations
+
+6.1. Rebalancing Frequency
+
+ Flows should be rebalanced only when the imbalance in the utilization
+ across component links exceeds a certain threshold. Frequent
+ rebalancing to achieve precise equitable utilization across component
+ links could be counterproductive as it may result in moving flows
+ back and forth between the component links, impacting packet ordering
+ and system stability. This applies regardless of whether large flows
+ or small flows are redistributed. It should be noted that reordering
+ is a concern for TCP flows with even a few packets because three out-
+ of-order packets would trigger sufficient duplicate ACKs to the
+ sender, resulting in a retransmission [RFC5681].
+
+ The operator would have to experiment with various values of the
+ large flow recognition parameters (minimum bandwidth threshold,
+ minimum bandwidth threshold for large flow maintenance, and
+ observation interval) and the imbalance threshold across component
+ links to tune the solution for their environment.
+
+6.2. Handling Route Changes
+
+ Large flow rebalancing must be aware of any changes to the Forwarding
+ Information Base (FIB). In cases where the next hop of a route no
+ longer to points to the LAG or to an ECMP group, any PBR entries
+ added as described in Sections 4.4.1 and 4.4.2 must be withdrawn in
+ order to avoid the creation of forwarding loops.
+
+6.3. Forwarding Resources
+
+ Hash-based techniques used for load balancing with LAG/ECMP are
+ usually stateless. The mechanisms described in this document require
+ additional resources in the forwarding plane of routers for creating
+ PBR rules that are capable of overriding the forwarding decision from
+ the hash-based approach. These resources may limit the number of
+ flows that can be rebalanced and may also impact the latency
+ experienced by packets due to the additional lookups that are
+ required.
+
+7. Security Considerations
+
+ This document does not directly impact the security of the Internet
+ infrastructure or its applications. In fact, it could help if there
+ is a DoS attack pattern that causes a hash imbalance resulting in
+ heavy overloading of large flows to certain LAG/ECMP component links.
+
+
+
+
+
+Krishnan, et al. Informational [Page 23]
+
+RFC 7424 Optimizing Load Distribution over LAG/ECMP January 2015
+
+
+ An attacker with knowledge of the large flow recognition algorithm
+ and any stateless distribution method can generate flows that are
+ distributed in a way that overloads a specific path. This could be
+ used to cause the creation of PBR rules that exhaust the available
+ PBR rule capacity on routers in the network. If PBR rules are
+ consequently discarded, this could result in congestion on the
+ attacker-selected path. Alternatively, tracking large numbers of PBR
+ rules could result in performance degradation.
+
+8. References
+
+8.1. Normative References
+
+ [802.1AX] IEEE, "IEEE Standard for Local and metropolitan area
+ networks - Link Aggregation", IEEE Std 802.1AX-2008,
+ 2008.
+
+ [RFC2991] Thaler, D. and C. Hopps, "Multipath Issues in Unicast
+ and Multicast Next-Hop Selection", RFC 2991, November
+ 2000, <http://www.rfc-editor.org/info/rfc2991>.
+
+ [RFC7011] Claise, B., Ed., Trammell, B., Ed., and P. Aitken,
+ "Specification of the IP Flow Information Export (IPFIX)
+ Protocol for the Exchange of Flow Information", STD 77,
+ RFC 7011, September 2013,
+ <http://www.rfc-editor.org/info/rfc7011>.
+
+ [RFC7012] Claise, B., Ed., and B. Trammell, Ed., "Information
+ Model for IP Flow Information Export (IPFIX)", RFC 7012,
+ September 2013,
+ <http://www.rfc-editor.org/info/rfc7012>.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Krishnan, et al. Informational [Page 24]
+
+RFC 7424 Optimizing Load Distribution over LAG/ECMP January 2015
+
+
+8.2. Informative References
+
+ [BIN-PACK] Coffman, Jr., E., Garey, M., and D. Johnson.
+ "Approximation Algorithms for Bin-Packing -- An Updated
+ Survey" (in "Algorithm Design for Computer System
+ Design"), Springer, 1984.
+
+ [CAIDA] "Caida Traffic Analysis Research",
+ <http://www.caida.org/research/traffic-analysis/>.
+
+ [DEVOFLOW] Mogul, J., Tourrilhes, J., Yalagandula, P., Sharma, P.,
+ Curtis, R., and S. Banerjee, "DevoFlow: Cost-Effective
+ Flow Management for High Performance Enterprise
+ Networks", Proceedings of the ACM SIGCOMM, 2010.
+
+ [FLOW-ACC] Zseby, T., Hirsch, T., and B. Claise, "Packet Sampling
+ for Flow Accounting: Challenges and Limitations",
+ Proceedings of the 9th international Passive and Active
+ Measurement Conference, 2008.
+
+ [ITCOM] Jo, J., Kim, Y., Chao, H., and F. Merat, "Internet
+ traffic load balancing using dynamic hashing with flow
+ volume", SPIE ITCOM, 2002.
+
+ [NDTM] Estan, C. and G. Varghese, "New Directions in Traffic
+ Measurement and Accounting", Proceedings of ACM SIGCOMM,
+ August 2002.
+
+ [NVGRE] Garg, P. and Y. Wang, "NVGRE: Network Virtualization
+ using Generic Routing Encapsulation", Work in Progress,
+ draft-sridharan-virtualization-nvgre-07, November 2014.
+
+ [RFC2784] Farinacci, D., Li, T., Hanks, S., Meyer, D., and P.
+ Traina, "Generic Routing Encapsulation (GRE)", RFC 2784,
+ March 2000, <http://www.rfc-editor.org/info/rfc2784>.
+
+ [RFC6790] Kompella, K., Drake, J., Amante, S., Henderickx, W., and
+ L. Yong, "The Use of Entropy Labels in MPLS Forwarding",
+ RFC 6790, November 2012,
+ <http://www.rfc-editor.org/info/rfc6790>.
+
+ [RFC1213] McCloghrie, K. and M. Rose, "Management Information Base
+ for Network Management of TCP/IP-based internets:
+ MIB-II", STD 17, RFC 1213, March 1991,
+ <http://www.rfc-editor.org/info/rfc1213>.
+
+
+
+
+
+
+Krishnan, et al. Informational [Page 25]
+
+RFC 7424 Optimizing Load Distribution over LAG/ECMP January 2015
+
+
+ [RFC2992] Hopps, C., "Analysis of an Equal-Cost Multi-Path
+ Algorithm", RFC 2992, November 2000,
+ <http://www.rfc-editor.org/info/rfc2992>.
+
+ [RFC3273] Waldbusser, S., "Remote Network Monitoring Management
+ Information Base for High Capacity Networks", RFC 3273,
+ July 2002, <http://www.rfc-editor.org/info/rfc3273>.
+
+ [RFC3931] Lau, J., Ed., Townsley, M., Ed., and I. Goyret, Ed.,
+ "Layer Two Tunneling Protocol - Version 3 (L2TPv3)", RFC
+ 3931, March 2005,
+ <http://www.rfc-editor.org/info/rfc3931>.
+
+ [RFC3954] Claise, B., Ed., "Cisco Systems NetFlow Services Export
+ Version 9", RFC 3954, October 2004,
+ <http://www.rfc-editor.org/info/rfc3954>.
+
+ [RFC5470] Sadasivan, G., Brownlee, N., Claise, B., and J. Quittek,
+ "Architecture for IP Flow Information Export", RFC 5470,
+ March 2009, <http://www.rfc-editor.org/info/rfc5470>.
+
+ [RFC5475] Zseby, T., Molina, M., Duffield, N., Niccolini, S., and
+ F. Raspall, "Sampling and Filtering Techniques for IP
+ Packet Selection", RFC 5475, March 2009,
+ <http://www.rfc-editor.org/info/rfc5475>.
+
+ [RFC5640] Filsfils, C., Mohapatra, P., and C. Pignataro, "Load-
+ Balancing for Mesh Softwires", RFC 5640, August 2009,
+ <http://www.rfc-editor.org/info/rfc5640>.
+
+ [RFC5681] Allman, M., Paxson, V., and E. Blanton, "TCP Congestion
+ Control", RFC 5681, September 2009,
+ <http://www.rfc-editor.org/info/rfc5681>.
+
+ [RFC7223] Bjorklund, M., "A YANG Data Model for Interface
+ Management", RFC 7223, May 2014,
+ <http://www.rfc-editor.org/info/rfc7223>.
+
+ [RFC7226] Villamizar, C., Ed., McDysan, D., Ed., Ning, S., Malis,
+ A., and L. Yong, "Requirements for Advanced Multipath in
+ MPLS Networks", RFC 7226, May 2014,
+ <http://www.rfc-editor.org/info/rfc7226>.
+
+ [SAMP-BASIC] Phaal, P. and S. Panchen, "Packet Sampling Basics",
+ <http://www.sflow.org/packetSamplingBasics/>.
+
+
+
+
+
+
+Krishnan, et al. Informational [Page 26]
+
+RFC 7424 Optimizing Load Distribution over LAG/ECMP January 2015
+
+
+ [sFlow-v5] Phaal, P. and M. Lavine, "sFlow version 5", July 2004,
+ <http://www.sflow.org/sflow_version_5.txt>.
+
+ [sFlow-LAG] Phaal, P. and A. Ghanwani, "sFlow LAG Counters
+ Structure", September 2012,
+ <http://www.sflow.org/sflow_lag.txt>.
+
+ [STT] Davie, B., Ed., and J. Gross, "A Stateless Transport
+ Tunneling Protocol for Network Virtualization (STT)",
+ Work in Progress, draft-davie-stt-06, April 2014.
+
+ [RFC7348] Mahalingam, M., Dutt, D., Duda, K., Agarwal, P.,
+ Kreeger, L., Sridhar, T., Bursell, M., and C. Wright,
+ "Virtual eXtensible Local Area Network (VXLAN): A
+ Framework for Overlaying Virtualized Layer 2 Networks
+ over Layer 3 Networks", RFC 7348, August 2014,
+ <http://www.rfc-editor.org/info/rfc7348>.
+
+ [YONG] Yong, L. and P. Yang, "Enhanced ECMP and Large Flow
+ Aware Transport", Work in Progress,
+ draft-yong-pwe3-enhance-ecmp-lfat-01, March 2010.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Krishnan, et al. Informational [Page 27]
+
+RFC 7424 Optimizing Load Distribution over LAG/ECMP January 2015
+
+
+Appendix A. Internet Traffic Analysis and Load-Balancing Simulation
+
+ Internet traffic [CAIDA] has been analyzed to obtain flow statistics
+ such as the number of packets in a flow and the flow duration. The
+ 5-tuple in the packet header (IP source address, IP destination
+ address, transport protocol source port number, transport protocol
+ destination port number, and IP protocol) is used for flow
+ identification. The analysis indicates that < ~2% of the flows take
+ ~30% of total traffic volume while the rest of the flows (> ~98%)
+ contributes ~70% [YONG].
+
+ The simulation has shown that, given Internet traffic patterns, the
+ hash-based technique does not evenly distribute flows over ECMP
+ paths. Some paths may be > 90% loaded while others are < 40% loaded.
+ The greater the number of ECMP paths, the more severe is the
+ imbalance in the load distribution. This implies that hash-based
+ distribution can cause some paths to become congested while other
+ paths are underutilized [YONG].
+
+ The simulation also shows substantial improvement by using the large
+ flow-aware, hash-based distribution technique described in this
+ document. In using the same simulated traffic, the improved
+ rebalancing can achieve < 10% load differences among the paths. It
+ proves how large flow-aware, hash-based distribution can effectively
+ compensate the uneven load balancing caused by hashing and the
+ traffic characteristics [YONG].
+
+Acknowledgements
+
+ The authors would like to thank the following individuals for their
+ review and valuable feedback on earlier versions of this document:
+ Shane Amante, Fred Baker, Michael Bugenhagen, Zhen Cao, Brian
+ Carpenter, Benoit Claise, Michael Fargano, Wes George, Sriganesh
+ Kini, Roman Krzanowski, Andrew Malis, Dave McDysan, Pete Moyer, Peter
+ Phaal, Dan Romascanu, Curtis Villamizar, Jianrong Wong, George Yum,
+ and Weifeng Zhang. As a part of the IETF Last Call process, valuable
+ comments were received from Martin Thomson and Carlos Pignataro.
+
+Contributors
+
+ Sanjay Khanna
+ Cisco Systems
+ EMail: sanjakha@gmail.com
+
+
+
+
+
+
+
+
+Krishnan, et al. Informational [Page 28]
+
+RFC 7424 Optimizing Load Distribution over LAG/ECMP January 2015
+
+
+Authors' Addresses
+
+ Ram Krishnan
+ Brocade Communications
+ San Jose, CA 95134
+ United States
+ Phone: +1-408-406-7890
+ EMail: ramkri123@gmail.com
+
+
+ Lucy Yong
+ Huawei USA
+ 5340 Legacy Drive
+ Plano, TX 75025
+ United States
+ Phone: +1-469-277-5837
+ EMail: lucy.yong@huawei.com
+
+
+ Anoop Ghanwani
+ Dell
+ 5450 Great America Pkwy
+ Santa Clara, CA 95054
+ United States
+ Phone: +1-408-571-3228
+ EMail: anoop@alumni.duke.edu
+
+
+ Ning So
+ Vinci Systems
+ 2613 Fairbourne Cir
+ Plano, TX 75093
+ United States
+ EMail: ningso@yahoo.com
+
+
+ Bhumip Khasnabish
+ ZTE Corporation
+ New Jersey 07960
+ United States
+ Phone: +1-781-752-8003
+ EMail: vumip1@gmail.com
+
+
+
+
+
+
+
+
+
+Krishnan, et al. Informational [Page 29]
+