summaryrefslogtreecommitdiff
path: root/doc/rfc/rfc4297.txt
diff options
context:
space:
mode:
Diffstat (limited to 'doc/rfc/rfc4297.txt')
-rw-r--r--doc/rfc/rfc4297.txt1123
1 files changed, 1123 insertions, 0 deletions
diff --git a/doc/rfc/rfc4297.txt b/doc/rfc/rfc4297.txt
new file mode 100644
index 0000000..3ba5312
--- /dev/null
+++ b/doc/rfc/rfc4297.txt
@@ -0,0 +1,1123 @@
+
+
+
+
+
+
+Network Working Group A. Romanow
+Request for Comments: 4297 Cisco
+Category: Informational J. Mogul
+ HP
+ T. Talpey
+ NetApp
+ S. Bailey
+ Sandburst
+ December 2005
+
+
+ Remote Direct Memory Access (RDMA) over IP Problem Statement
+
+Status of This Memo
+
+ This memo provides information for the Internet community. It does
+ not specify an Internet standard of any kind. Distribution of this
+ memo is unlimited.
+
+Copyright Notice
+
+ Copyright (C) The Internet Society (2005).
+
+Abstract
+
+ Overhead due to the movement of user data in the end-system network
+ I/O processing path at high speeds is significant, and has limited
+ the use of Internet protocols in interconnection networks, and the
+ Internet itself -- especially where high bandwidth, low latency,
+ and/or low overhead are required by the hosted application.
+
+ This document examines this overhead, and addresses an architectural,
+ IP-based "copy avoidance" solution for its elimination, by enabling
+ Remote Direct Memory Access (RDMA).
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Romanow, et al. Informational [Page 1]
+
+RFC 4297 RDMA over IP Problem Statement December 2005
+
+
+Table of Contents
+
+ 1. Introduction ....................................................2
+ 2. The High Cost of Data Movement Operations in Network I/O ........4
+ 2.1. Copy avoidance improves processing overhead. ...............5
+ 3. Memory bandwidth is the root cause of the problem. ..............6
+ 4. High copy overhead is problematic for many key Internet
+ applications. ...................................................8
+ 5. Copy Avoidance Techniques ......................................10
+ 5.1. A Conceptual Framework: DDP and RDMA ......................11
+ 6. Conclusions ....................................................12
+ 7. Security Considerations ........................................12
+ 8. Terminology ....................................................14
+ 9. Acknowledgements ...............................................14
+ 10. Informative References ........................................15
+
+1. Introduction
+
+ This document considers the problem of high host processing overhead
+ associated with the movement of user data to and from the network
+ interface under high speed conditions. This problem is often
+ referred to as the "I/O bottleneck" [CT90]. More specifically, the
+ source of high overhead that is of interest here is data movement
+ operations, i.e., copying. The throughput of a system may therefore
+ be limited by the overhead of this copying. This issue is not to be
+ confused with TCP offload, which is not addressed here. High speed
+ refers to conditions where the network link speed is high, relative
+ to the bandwidths of the host CPU and memory. With today's computer
+ systems, one Gigabit per second (Gbits/s) and over is considered high
+ speed.
+
+ High costs associated with copying are an issue primarily for large
+ scale systems. Although smaller systems such as rack-mounted PCs and
+ small workstations would benefit from a reduction in copying
+ overhead, the benefit to smaller machines will be primarily in the
+ next few years as they scale the amount of bandwidth they handle.
+ Today, it is large system machines with high bandwidth feeds, usually
+ multiprocessors and clusters, that are adversely affected by copying
+ overhead. Examples of such machines include all varieties of
+ servers: database servers, storage servers, application servers for
+ transaction processing, for e-commerce, and web serving, content
+ distribution, video distribution, backups, data mining and decision
+ support, and scientific computing.
+
+ Note that such servers almost exclusively service many concurrent
+ sessions (transport connections), which, in aggregate, are
+ responsible for > 1 Gbits/s of communication. Nonetheless, the cost
+
+
+
+
+Romanow, et al. Informational [Page 2]
+
+RFC 4297 RDMA over IP Problem Statement December 2005
+
+
+ of copying overhead for a particular load is the same whether from
+ few or many sessions.
+
+ The I/O bottleneck, and the role of data movement operations, have
+ been widely studied in research and industry over the last
+ approximately 14 years, and we draw freely on these results.
+ Historically, the I/O bottleneck has received attention whenever new
+ networking technology has substantially increased line rates: 100
+ Megabit per second (Mbits/s) Fast Ethernet and Fibre Distributed Data
+ Interface [FDDI], 155 Mbits/s Asynchronous Transfer Mode [ATM], 1
+ Gbits/s Ethernet. In earlier speed transitions, the availability of
+ memory bandwidth allowed the I/O bottleneck issue to be deferred.
+ Now however, this is no longer the case. While the I/O problem is
+ significant at 1 Gbits/s, it is the introduction of 10 Gbits/s
+ Ethernet which is motivating an upsurge of activity in industry and
+ research [IB, VI, CGY01, Ma02, MAF+02].
+
+ Because of high overhead of end-host processing in current
+ implementations, the TCP/IP protocol stack is not used for high speed
+ transfer. Instead, special purpose network fabrics, using a
+ technology generally known as Remote Direct Memory Access (RDMA),
+ have been developed and are widely used. RDMA is a set of mechanisms
+ that allow the network adapter, under control of the application, to
+ steer data directly into and out of application buffers. Examples of
+ such interconnection fabrics include Fibre Channel [FIBRE] for block
+ storage transfer, Virtual Interface Architecture [VI] for database
+ clusters, and Infiniband [IB], Compaq Servernet [SRVNET], and
+ Quadrics [QUAD] for System Area Networks. These link level
+ technologies limit application scaling in both distance and size,
+ meaning that the number of nodes cannot be arbitrarily large.
+
+ This problem statement substantiates the claim that in network I/O
+ processing, high overhead results from data movement operations,
+ specifically copying; and that copy avoidance significantly decreases
+ this processing overhead. It describes when and why the high
+ processing overheads occur, explains why the overhead is problematic,
+ and points out which applications are most affected.
+
+ The document goes on to discuss why the problem is relevant to the
+ Internet and to Internet-based applications. Applications that
+ store, manage, and distribute the information of the Internet are
+ well suited to applying the copy avoidance solution. They will
+ benefit by avoiding high processing overheads, which removes limits
+ to the available scaling of tiered end-systems. Copy avoidance also
+ eliminates latency for these systems, which can further benefit
+ effective distributed processing.
+
+
+
+
+
+Romanow, et al. Informational [Page 3]
+
+RFC 4297 RDMA over IP Problem Statement December 2005
+
+
+ In addition, this document introduces an architectural approach to
+ solving the problem, which is developed in detail in [BT05]. It also
+ discusses how the proposed technology may introduce security concerns
+ and how they should be addressed.
+
+ Finally, this document includes a Terminology section to aid as a
+ reference for several new terms introduced by RDMA.
+
+2. The High Cost of Data Movement Operations in Network I/O
+
+ A wealth of data from research and industry shows that copying is
+ responsible for substantial amounts of processing overhead. It
+ further shows that even in carefully implemented systems, eliminating
+ copies significantly reduces the overhead, as referenced below.
+
+ Clark et al. [CJRS89] in 1989 shows that TCP [Po81] overhead
+ processing is attributable to both operating system costs (such as
+ interrupts, context switches, process management, buffer management,
+ timer management) and the costs associated with processing individual
+ bytes (specifically, computing the checksum and moving data in
+ memory). They found that moving data in memory is the more important
+ of the costs, and their experiments show that memory bandwidth is the
+ greatest source of limitation. In the data presented [CJRS89], 64%
+ of the measured microsecond overhead was attributable to data
+ touching operations, and 48% was accounted for by copying. The
+ system measured Berkeley TCP on a Sun-3/60 using 1460 Byte Ethernet
+ packets.
+
+ In a well-implemented system, copying can occur between the network
+ interface and the kernel, and between the kernel and application
+ buffers; there are two copies, each of which are two memory bus
+ crossings, for read and write. Although in certain circumstances it
+ is possible to do better, usually two copies are required on receive.
+
+ Subsequent work has consistently shown the same phenomenon as the
+ earlier Clark study. A number of studies report results that data-
+ touching operations, checksumming and data movement, dominate the
+ processing costs for messages longer than 128 Bytes [BS96, CGY01,
+ Ch96, CJRS89, DAPP93, KP96]. For smaller sized messages, per-packet
+ overheads dominate [KP96, CGY01].
+
+ The percentage of overhead due to data-touching operations increases
+ with packet size, since time spent on per-byte operations scales
+ linearly with message size [KP96]. For example, Chu [Ch96] reported
+ substantial per-byte latency costs as a percentage of total
+ networking software costs for an MTU size packet on a SPARCstation/20
+
+
+
+
+
+Romanow, et al. Informational [Page 4]
+
+RFC 4297 RDMA over IP Problem Statement December 2005
+
+
+ running memory-to-memory TCP tests over networks with 3 different MTU
+ sizes. The percentage of total software costs attributable to
+ per-byte operations were:
+
+ 1500 Byte Ethernet 18-25%
+ 4352 Byte FDDI 35-50%
+ 9180 Byte ATM 55-65%
+
+ Although many studies report results for data-touching operations,
+ including checksumming and data movement together, much work has
+ focused just on copying [BS96, Br99, Ch96, TK95]. For example,
+ [KP96] reports results that separate processing times for checksum
+ from data movement operations. For the 1500 Byte Ethernet size, 20%
+ of total processing overhead time is attributable to copying. The
+ study used 2 DECstations 5000/200 connected by an FDDI network. (In
+ this study, checksum accounts for 30% of the processing time.)
+
+2.1. Copy avoidance improves processing overhead.
+
+ A number of studies show that eliminating copies substantially
+ reduces overhead. For example, results from copy-avoidance in the
+ IO-Lite system [PDZ99], which aimed at improving web server
+ performance, show a throughput increase of 43% over an optimized web
+ server, and 137% improvement over an Apache server. The system was
+ implemented in a 4.4BSD-derived UNIX kernel, and the experiments used
+ a server system based on a 333MHz Pentium II PC connected to a
+ switched 100 Mbits/s Fast Ethernet.
+
+ There are many other examples where elimination of copying using a
+ variety of different approaches showed significant improvement in
+ system performance [CFF+94, DP93, EBBV95, KSZ95, TK95, Wa97]. We
+ will discuss the results of one of these studies in detail in order
+ to clarify the significant degree of improvement produced by copy
+ avoidance [Ch02].
+
+ Recent work by Chase et al. [CGY01], measuring CPU utilization, shows
+ that avoiding copies reduces CPU time spent on data access from 24%
+ to 15% at 370 Mbits/s for a 32 KBytes MTU using an AlphaStation
+ XP1000 and a Myrinet adapter [BCF+95]. This is an absolute
+ improvement of 9% due to copy avoidance.
+
+ The total CPU utilization was 35%, with data access accounting for
+ 24%. Thus, the relative importance of reducing copies is 26%. At
+ 370 Mbits/s, the system is not very heavily loaded. The relative
+ improvement in achievable bandwidth is 34%. This is the improvement
+ we would see if copy avoidance were added when the machine was
+ saturated by network I/O.
+
+
+
+
+Romanow, et al. Informational [Page 5]
+
+RFC 4297 RDMA over IP Problem Statement December 2005
+
+
+ Note that improvement from the optimization becomes more important if
+ the overhead it targets is a larger share of the total cost. This is
+ what happens if other sources of overhead, such as checksumming, are
+ eliminated. In [CGY01], after removing checksum overhead, copy
+ avoidance reduces CPU utilization from 26% to 10%. This is a 16%
+ absolute reduction, a 61% relative reduction, and a 160% relative
+ improvement in achievable bandwidth.
+
+ In fact, today's network interface hardware commonly offloads the
+ checksum, which removes the other source of per-byte overhead. They
+ also coalesce interrupts to reduce per-packet costs. Thus, today
+ copying costs account for a relatively larger part of CPU utilization
+ than previously, and therefore relatively more benefit is to be
+ gained in reducing them. (Of course this argument would be specious
+ if the amount of overhead were insignificant, but it has been shown
+ to be substantial. [BS96, Br99, Ch96, KP96, TK95])
+
+3. Memory bandwidth is the root cause of the problem.
+
+ Data movement operations are expensive because memory bandwidth is
+ scarce relative to network bandwidth and CPU bandwidth [PAC+97].
+ This trend existed in the past and is expected to continue into the
+ future [HP97, STREAM], especially in large multiprocessor systems.
+
+ With copies crossing the bus twice per copy, network processing
+ overhead is high whenever network bandwidth is large in comparison to
+ CPU and memory bandwidths. Generally, with today's end-systems, the
+ effects are observable at network speeds over 1 Gbits/s. In fact,
+ with multiple bus crossings it is possible to see the bus bandwidth
+ being the limiting factor for throughput. This prevents such an
+ end-system from simultaneously achieving full network bandwidth and
+ full application performance.
+
+ A common question is whether an increase in CPU processing power
+ alleviates the problem of high processing costs of network I/O. The
+ answer is no, it is the memory bandwidth that is the issue. Faster
+ CPUs do not help if the CPU spends most of its time waiting for
+ memory [CGY01].
+
+ The widening gap between microprocessor performance and memory
+ performance has long been a widely recognized and well-understood
+ problem [PAC+97]. Hennessy [HP97] shows microprocessor performance
+ grew from 1980-1998 at 60% per year, while the access time to DRAM
+ improved at 10% per year, giving rise to an increasing "processor-
+ memory performance gap".
+
+
+
+
+
+
+Romanow, et al. Informational [Page 6]
+
+RFC 4297 RDMA over IP Problem Statement December 2005
+
+
+ Another source of relevant data is the STREAM Benchmark Reference
+ Information website, which provides information on the STREAM
+ benchmark [STREAM]. The benchmark is a simple synthetic benchmark
+ program that measures sustainable memory bandwidth (in MBytes/s) and
+ the corresponding computation rate for simple vector kernels measured
+ in MFLOPS. The website tracks information on sustainable memory
+ bandwidth for hundreds of machines and all major vendors.
+
+ Results show measured system performance statistics. Processing
+ performance from 1985-2001 increased at 50% per year on average, and
+ sustainable memory bandwidth from 1975 to 2001 increased at 35% per
+ year, on average, over all the systems measured. A similar 15% per
+ year lead of processing bandwidth over memory bandwidth shows up in
+ another statistic, machine balance [Mc95], a measure of the relative
+ rate of CPU to memory bandwidth (FLOPS/cycle) / (sustained memory
+ ops/cycle) [STREAM].
+
+ Network bandwidth has been increasing about 10-fold roughly every 8
+ years, which is a 40% per year growth rate.
+
+ A typical example illustrates that the memory bandwidth compares
+ unfavorably with link speed. The STREAM benchmark shows that a
+ modern uniprocessor PC, for example the 1.2 GHz Athlon in 2001, will
+ move the data 3 times in doing a receive operation: once for the
+ network interface to deposit the data in memory, and twice for the
+ CPU to copy the data. With 1 GBytes/s of memory bandwidth, meaning
+ one read or one write, the machine could handle approximately 2.67
+ Gbits/s of network bandwidth, one third the copy bandwidth. But this
+ assumes 100% utilization, which is not possible, and more importantly
+ the machine would be totally consumed! (A rule of thumb for
+ databases is that 20% of the machine should be required to service
+ I/O, leaving 80% for the database application. And, the less, the
+ better.)
+
+ In 2001, 1 Gbits/s links were common. An application server may
+ typically have two 1 Gbits/s connections: one connection backend to a
+ storage server and one front-end, say for serving HTTP [FGM+99].
+ Thus, the communications could use 2 Gbits/s. In our typical
+ example, the machine could handle 2.7 Gbits/s at its theoretical
+ maximum while doing nothing else. This means that the machine
+ basically could not keep up with the communication demands in 2001;
+ with the relative growth trends, the situation only gets worse.
+
+
+
+
+
+
+
+
+
+Romanow, et al. Informational [Page 7]
+
+RFC 4297 RDMA over IP Problem Statement December 2005
+
+
+4. High copy overhead is problematic for many key Internet
+ applications.
+
+ If a significant portion of resources on an application machine is
+ consumed in network I/O rather than in application processing, it
+ makes it difficult for the application to scale, i.e., to handle more
+ clients, to offer more services.
+
+ Several years ago the most affected applications were streaming
+ multimedia, parallel file systems, and supercomputing on clusters
+ [BS96]. In addition, today the applications that suffer from copying
+ overhead are more central in Internet computing -- they store,
+ manage, and distribute the information of the Internet and the
+ enterprise. They include database applications doing transaction
+ processing, e-commerce, web serving, decision support, content
+ distribution, video distribution, and backups. Clusters are
+ typically used for this category of application, since they have
+ advantages of availability and scalability.
+
+ Today these applications, which provide and manage Internet and
+ corporate information, are typically run in data centers that are
+ organized into three logical tiers. One tier is typically a set of
+ web servers connecting to the WAN. The second tier is a set of
+ application servers that run the specific applications usually on
+ more powerful machines, and the third tier is backend databases.
+ Physically, the first two tiers -- web server and application server
+ -- are usually combined [Pi01]. For example, an e-commerce server
+ communicates with a database server and with a customer site, or a
+ content distribution server connects to a server farm, or an OLTP
+ server connects to a database and a customer site.
+
+ When network I/O uses too much memory bandwidth, performance on
+ network paths between tiers can suffer. (There might also be
+ performance issues on Storage Area Network paths used either by the
+ database tier or the application tier.) The high overhead from
+ network-related memory copies diverts system resources from other
+ application processing. It also can create bottlenecks that limit
+ total system performance.
+
+ There is high motivation to maximize the processing capacity of each
+ CPU because scaling by adding CPUs, one way or another, has
+ drawbacks. For example, adding CPUs to a multiprocessor will not
+ necessarily help because a multiprocessor improves performance only
+ when the memory bus has additional bandwidth to spare. Clustering
+ can add additional complexity to handling the applications.
+
+ In order to scale a cluster or multiprocessor system, one must
+ proportionately scale the interconnect bandwidth. Interconnect
+
+
+
+Romanow, et al. Informational [Page 8]
+
+RFC 4297 RDMA over IP Problem Statement December 2005
+
+
+ bandwidth governs the performance of communication-intensive parallel
+ applications; if this (often expressed in terms of "bisection
+ bandwidth") is too low, adding additional processors cannot improve
+ system throughput. Interconnect latency can also limit the
+ performance of applications that frequently share data between
+ processors.
+
+ So, excessive overheads on network paths in a "scalable" system both
+ can require the use of more processors than optimal, and can reduce
+ the marginal utility of those additional processors.
+
+ Copy avoidance scales a machine upwards by removing at least two-
+ thirds of the bus bandwidth load from the "very best" 1-copy (on
+ receive) implementations, and removes at least 80% of the bandwidth
+ overhead from the 2-copy implementations.
+
+ The removal of bus bandwidth requirements, in turn, removes
+ bottlenecks from the network processing path and increases the
+ throughput of the machine. On a machine with limited bus bandwidth,
+ the advantages of removing this load is immediately evident, as the
+ host can attain full network bandwidth. Even on a machine with bus
+ bandwidth adequate to sustain full network bandwidth, removal of bus
+ bandwidth load serves to increase the availability of the machine for
+ the processing of user applications, in some cases dramatically.
+
+ An example showing poor performance with copies and improved scaling
+ with copy avoidance is illustrative. The IO-Lite work [PDZ99] shows
+ higher server throughput servicing more clients using a zero-copy
+ system. In an experiment designed to mimic real world web conditions
+ by simulating the effect of TCP WAN connections on the server, the
+ performance of 3 servers was compared. One server was Apache,
+ another was an optimized server called Flash, and the third was the
+ Flash server running IO-Lite, called Flash-Lite with zero copy. The
+ measurement was of throughput in requests/second as a function of the
+ number of slow background clients that could be served. As the table
+ shows, Flash-Lite has better throughput, especially as the number of
+ clients increases.
+
+ Apache Flash Flash-Lite
+ ------ ----- ----------
+ #Clients Throughput reqs/s Throughput Throughput
+
+ 0 520 610 890
+ 16 390 490 890
+ 32 360 490 850
+ 64 360 490 890
+ 128 310 450 880
+ 256 310 440 820
+
+
+
+Romanow, et al. Informational [Page 9]
+
+RFC 4297 RDMA over IP Problem Statement December 2005
+
+
+ Traditional Web servers (which mostly send data and can keep most of
+ their content in the file cache) are not the worst case for copy
+ overhead. Web proxies (which often receive as much data as they
+ send) and complex Web servers based on System Area Networks or
+ multi-tier systems will suffer more from copy overheads than in the
+ example above.
+
+5. Copy Avoidance Techniques
+
+ There have been extensive research investigation and industry
+ experience with two main alternative approaches to eliminating data
+ movement overhead, often along with improving other Operating System
+ processing costs. In one approach, hardware and/or software changes
+ within a single host reduce processing costs. In another approach,
+ memory-to-memory networking [MAF+02], the exchange of explicit data
+ placement information between hosts allows them to reduce processing
+ costs.
+
+ The single host approaches range from new hardware and software
+ architectures [KSZ95, Wa97, DWB+93] to new or modified software
+ systems [BS96, Ch96, TK95, DP93, PDZ99]. In the approach based on
+ using a networking protocol to exchange information, the network
+ adapter, under control of the application, places data directly into
+ and out of application buffers, reducing the need for data movement.
+ Commonly this approach is called RDMA, Remote Direct Memory Access.
+
+ As discussed below, research and industry experience has shown that
+ copy avoidance techniques within the receiver processing path alone
+ have proven to be problematic. The research special purpose host
+ adapter systems had good performance and can be seen as precursors
+ for the commercial RDMA-based adapters [KSZ95, DWB+93]. In software,
+ many implementations have successfully achieved zero-copy transmit,
+ but few have accomplished zero-copy receive. And those that have
+ done so make strict alignment and no-touch requirements on the
+ application, greatly reducing the portability and usefulness of the
+ implementation.
+
+ In contrast, experience has proven satisfactory with memory-to-memory
+ systems that permit RDMA; performance has been good and there have
+ not been system or networking difficulties. RDMA is a single
+ solution. Once implemented, it can be used with any OS and machine
+ architecture, and it does not need to be revised when either of these
+ are changed.
+
+ In early work, one goal of the software approaches was to show that
+ TCP could go faster with appropriate OS support [CJRS89, CFF+94].
+ While this goal was achieved, further investigation and experience
+ showed that, though possible to craft software solutions, specific
+
+
+
+Romanow, et al. Informational [Page 10]
+
+RFC 4297 RDMA over IP Problem Statement December 2005
+
+
+ system optimizations have been complex, fragile, extremely
+ interdependent with other system parameters in complex ways, and
+ often of only marginal improvement [CFF+94, CGY01, Ch96, DAPP93,
+ KSZ95, PDZ99]. The network I/O system interacts with other aspects
+ of the Operating System such as machine architecture and file I/O,
+ and disk I/O [Br99, Ch96, DP93].
+
+ For example, the Solaris Zero-Copy TCP work [Ch96], which relies on
+ page remapping, shows that the results are highly interdependent with
+ other systems, such as the file system, and that the particular
+ optimizations are specific for particular architectures, meaning that
+ for each variation in architecture, optimizations must be re-crafted
+ [Ch96].
+
+ With RDMA, application I/O buffers are mapped directly, and the
+ authorized peer may access it without incurring additional processing
+ overhead. When RDMA is implemented in hardware, arbitrary data
+ movement can be performed without involving the host CPU at all.
+
+ A number of research projects and industry products have been based
+ on the memory-to-memory approach to copy avoidance. These include
+ U-Net [EBBV95], SHRIMP [BLA+94], Hamlyn [BJM+96], Infiniband [IB],
+ Winsock Direct [Pi01]. Several memory-to-memory systems have been
+ widely used and have generally been found to be robust, to have good
+ performance, and to be relatively simple to implement. These include
+ VI [VI], Myrinet [BCF+95], Quadrics [QUAD], Compaq/Tandem Servernet
+ [SRVNET]. Networks based on these memory-to-memory architectures
+ have been used widely in scientific applications and in data centers
+ for block storage, file system access, and transaction processing.
+
+ By exporting direct memory access "across the wire", applications may
+ direct the network stack to manage all data directly from application
+ buffers. A large and growing class that takes advantage of such
+ capabilities of applications has already emerged. It includes all
+ the major databases, as well as network protocols such as Sockets
+ Direct [SDP].
+
+5.1. A Conceptual Framework: DDP and RDMA
+
+ An RDMA solution can be usefully viewed as being comprised of two
+ distinct components: "direct data placement (DDP)" and "remote direct
+ memory access (RDMA) semantics". They are distinct in purpose and
+ also in practice -- they may be implemented as separate protocols.
+
+ The more fundamental of the two is the direct data placement
+ facility. This is the means by which memory is exposed to the remote
+ peer in an appropriate fashion, and the means by which the peer may
+ access it, for instance, reading and writing.
+
+
+
+Romanow, et al. Informational [Page 11]
+
+RFC 4297 RDMA over IP Problem Statement December 2005
+
+
+ The RDMA control functions are semantically layered atop direct data
+ placement. Included are operations that provide "control" features,
+ such as connection and termination, and the ordering of operations
+ and signaling their completions. A "send" facility is provided.
+
+ While the functions (and potentially protocols) are distinct,
+ historically both aspects taken together have been referred to as
+ "RDMA". The facilities of direct data placement are useful in and of
+ themselves, and may be employed by other upper layer protocols to
+ facilitate data transfer. Therefore, it is often useful to refer to
+ DDP as the data placement functionality and RDMA as the control
+ aspect.
+
+ [BT05] develops an architecture for DDP and RDMA atop the Internet
+ Protocol Suite, and is a companion document to this problem
+ statement.
+
+6. Conclusions
+
+ This Problem Statement concludes that an IP-based, general solution
+ for reducing processing overhead in end-hosts is desirable.
+
+ It has shown that high overhead of the processing of network data
+ leads to end-host bottlenecks. These bottlenecks are in large part
+ attributable to the copying of data. The bus bandwidth of machines
+ has historically been limited, and the bandwidth of high-speed
+ interconnects taxes it heavily.
+
+ An architectural solution to alleviate these bottlenecks best
+ satisfies the issue. Further, the high speed of today's
+ interconnects and the deployment of these hosts on Internet
+ Protocol-based networks leads to the desirability of layering such a
+ solution on the Internet Protocol Suite. The architecture described
+ in [BT05] is such a proposal.
+
+7. Security Considerations
+
+ Solutions to the problem of reducing copying overhead in high
+ bandwidth transfers may introduce new security concerns. Any
+ proposed solution must be analyzed for security vulnerabilities and
+ any such vulnerabilities addressed. Potential security weaknesses --
+ due to resource issues that might lead to denial-of-service attacks,
+ overwrites and other concurrent operations, the ordering of
+ completions as required by the RDMA protocol, the granularity of
+ transfer, and any other identified vulnerabilities -- need to be
+ examined, described, and an adequate resolution to them found.
+
+
+
+
+
+Romanow, et al. Informational [Page 12]
+
+RFC 4297 RDMA over IP Problem Statement December 2005
+
+
+ Layered atop Internet transport protocols, the RDMA protocols will
+ gain leverage from and must permit integration with Internet security
+ standards, such as IPsec and TLS [IPSEC, TLS]. However, there may be
+ implementation ramifications for certain security approaches with
+ respect to RDMA, due to its copy avoidance.
+
+ IPsec, operating to secure the connection on a packet-by-packet
+ basis, seems to be a natural fit to securing RDMA placement, which
+ operates in conjunction with transport. Because RDMA enables an
+ implementation to avoid buffering, it is preferable to perform all
+ applicable security protection prior to processing of each segment by
+ the transport and RDMA layers. Such a layering enables the most
+ efficient secure RDMA implementation.
+
+ The TLS record protocol, on the other hand, is layered on top of
+ reliable transports and cannot provide such security assurance until
+ an entire record is available, which may require the buffering and/or
+ assembly of several distinct messages prior to TLS processing. This
+ defers RDMA processing and introduces overheads that RDMA is designed
+ to avoid. Therefore, TLS is viewed as potentially a less natural fit
+ for protecting the RDMA protocols.
+
+ It is necessary to guarantee properties such as confidentiality,
+ integrity, and authentication on an RDMA communications channel.
+ However, these properties cannot defend against all attacks from
+ properly authenticated peers, which might be malicious, compromised,
+ or buggy. Therefore, the RDMA design must address protection against
+ such attacks. For example, an RDMA peer should not be able to read
+ or write memory regions without prior consent.
+
+ Further, it must not be possible to evade memory consistency checks
+ at the recipient. The RDMA design must allow the recipient to rely
+ on its consistent memory contents by explicitly controlling peer
+ access to memory regions at appropriate times.
+
+ Peer connections that do not pass authentication and authorization
+ checks by upper layers must not be permitted to begin processing in
+ RDMA mode with an inappropriate endpoint. Once associated, peer
+ accesses to memory regions must be authenticated and made subject to
+ authorization checks in the context of the association and connection
+ on which they are to be performed, prior to any transfer operation or
+ data being accessed.
+
+ The RDMA protocols must ensure that these region protections be under
+ strict application control. Remote access to local memory by a
+ network peer is particularly important in the Internet context, where
+ such access can be exported globally.
+
+
+
+
+Romanow, et al. Informational [Page 13]
+
+RFC 4297 RDMA over IP Problem Statement December 2005
+
+
+8. Terminology
+
+ This section contains general terminology definitions for this
+ document and for Remote Direct Memory Access in general.
+
+ Remote Direct Memory Access (RDMA)
+ A method of accessing memory on a remote system in which the
+ local system specifies the location of the data to be
+ transferred.
+
+ RDMA Protocol
+ A protocol that supports RDMA Operations to transfer data
+ between systems.
+
+ Fabric
+ The collection of links, switches, and routers that connect a
+ set of systems.
+
+ Storage Area Network (SAN)
+ A network where disks, tapes, and other storage devices are made
+ available to one or more end-systems via a fabric.
+
+ System Area Network
+ A network where clustered systems share services, such as
+ storage and interprocess communication, via a fabric.
+
+ Fibre Channel (FC)
+ An ANSI standard link layer with associated protocols, typically
+ used to implement Storage Area Networks. [FIBRE]
+
+ Virtual Interface Architecture (VI, VIA)
+ An RDMA interface definition developed by an industry group and
+ implemented with a variety of differing wire protocols. [VI]
+
+ Infiniband (IB)
+ An RDMA interface, protocol suite and link layer specification
+ defined by an industry trade association. [IB]
+
+9. Acknowledgements
+
+ Jeff Chase generously provided many useful insights and information.
+ Thanks to Jim Pinkerton for many helpful discussions.
+
+
+
+
+
+
+
+
+
+Romanow, et al. Informational [Page 14]
+
+RFC 4297 RDMA over IP Problem Statement December 2005
+
+
+10. Informative References
+
+ [ATM] The ATM Forum, "Asynchronous Transfer Mode Physical Layer
+ Specification" af-phy-0015.000, etc. available from
+ http://www.atmforum.com/standards/approved.html.
+
+ [BCF+95] N. J. Boden, D. Cohen, R. E. Felderman, A. E. Kulawik, C.
+ L. Seitz, J. N. Seizovic, and W. Su. "Myrinet - A
+ gigabit-per-second local-area network", IEEE Micro,
+ February 1995.
+
+ [BJM+96] G. Buzzard, D. Jacobson, M. Mackey, S. Marovich, J.
+ Wilkes, "An implementation of the Hamlyn send-managed
+ interface architecture", in Proceedings of the Second
+ Symposium on Operating Systems Design and Implementation,
+ USENIX Assoc., October 1996.
+
+ [BLA+94] M. A. Blumrich, K. Li, R. Alpert, C. Dubnicki, E. W.
+ Felten, "A virtual memory mapped network interface for the
+ SHRIMP multicomputer", in Proceedings of the 21st Annual
+ Symposium on Computer Architecture, April 1994, pp. 142-
+ 153.
+
+ [Br99] J. C. Brustoloni, "Interoperation of copy avoidance in
+ network and file I/O", Proceedings of IEEE Infocom, 1999,
+ pp. 534-542.
+
+ [BS96] J. C. Brustoloni, P. Steenkiste, "Effects of buffering
+ semantics on I/O performance", Proceedings OSDI'96,
+ USENIX, Seattle, WA October 1996, pp. 277-291.
+
+ [BT05] Bailey, S. and T. Talpey, "The Architecture of Direct Data
+ Placement (DDP) And Remote Direct Memory Access (RDMA) On
+ Internet Protocols", RFC 4296, December 2005.
+
+ [CFF+94] C-H Chang, D. Flower, J. Forecast, H. Gray, B. Hawe, A.
+ Nadkarni, K. K. Ramakrishnan, U. Shikarpur, K. Wilde,
+ "High-performance TCP/IP and UDP/IP networking in DEC
+ OSF/1 for Alpha AXP", Proceedings of the 3rd IEEE
+ Symposium on High Performance Distributed Computing,
+ August 1994, pp. 36-42.
+
+ [CGY01] J. S. Chase, A. J. Gallatin, and K. G. Yocum, "End system
+ optimizations for high-speed TCP", IEEE Communications
+ Magazine, Volume: 39, Issue: 4 , April 2001, pp 68-74.
+ http://www.cs.duke.edu/ari/publications/end-
+ system.{ps,pdf}.
+
+
+
+
+Romanow, et al. Informational [Page 15]
+
+RFC 4297 RDMA over IP Problem Statement December 2005
+
+
+ [Ch96] H.K. Chu, "Zero-copy TCP in Solaris", Proc. of the USENIX
+ 1996 Annual Technical Conference, San Diego, CA, January
+ 1996.
+
+ [Ch02] Jeffrey Chase, Personal communication.
+
+ [CJRS89] D. D. Clark, V. Jacobson, J. Romkey, H. Salwen, "An
+ analysis of TCP processing overhead", IEEE Communications
+ Magazine, volume: 27, Issue: 6, June 1989, pp 23-29.
+
+ [CT90] D. D. Clark, D. Tennenhouse, "Architectural considerations
+ for a new generation of protocols", Proceedings of the ACM
+ SIGCOMM Conference, 1990.
+
+ [DAPP93] P. Druschel, M. B. Abbott, M. A. Pagels, L. L. Peterson,
+ "Network subsystem design", IEEE Network, July 1993, pp.
+ 8-17.
+
+ [DP93] P. Druschel, L. L. Peterson, "Fbufs: a high-bandwidth
+ cross-domain transfer facility", Proceedings of the 14th
+ ACM Symposium of Operating Systems Principles, December
+ 1993.
+
+ [DWB+93] C. Dalton, G. Watson, D. Banks, C. Calamvokis, A. Edwards,
+ J. Lumley, "Afterburner: architectural support for high-
+ performance protocols", Technical Report, HP Laboratories
+ Bristol, HPL-93-46, July 1993.
+
+ [EBBV95] T. von Eicken, A. Basu, V. Buch, and W. Vogels, "U-Net: A
+ user-level network interface for parallel and distributed
+ computing", Proc. of the 15th ACM Symposium on Operating
+ Systems Principles, Copper Mountain, Colorado, December
+ 3-6, 1995.
+
+ [FDDI] International Standards Organization, "Fibre Distributed
+ Data Interface", ISO/IEC 9314, committee drafts available
+ from http://www.iso.org.
+
+ [FGM+99] Fielding, R., Gettys, J., Mogul, J., Frystyk, H.,
+ Masinter, L., Leach, P., and T. Berners-Lee, "Hypertext
+ Transfer Protocol -- HTTP/1.1", RFC 2616, June 1999.
+
+ [FIBRE] ANSI Technical Committee T10, "Fibre Channel Protocol
+ (FCP)" (and as revised and updated), ANSI X3.269:1996
+ [R2001], committee draft available from
+ http://www.t10.org/drafts.htm#FibreChannel
+
+
+
+
+
+Romanow, et al. Informational [Page 16]
+
+RFC 4297 RDMA over IP Problem Statement December 2005
+
+
+ [HP97] J. L. Hennessy, D. A. Patterson, Computer Organization and
+ Design, 2nd Edition, San Francisco: Morgan Kaufmann
+ Publishers, 1997.
+
+ [IB] InfiniBand Trade Association, "InfiniBand Architecture
+ Specification, Volumes 1 and 2", Release 1.1, November
+ 2002, available from http://www.infinibandta.org/specs.
+
+ [IPSEC] Kent, S. and R. Atkinson, "Security Architecture for the
+ Internet Protocol", RFC 2401, November 1998.
+
+ [KP96] J. Kay, J. Pasquale, "Profiling and reducing processing
+ overheads in TCP/IP", IEEE/ACM Transactions on Networking,
+ Vol 4, No. 6, pp.817-828, December 1996.
+
+ [KSZ95] K. Kleinpaste, P. Steenkiste, B. Zill, "Software support
+ for outboard buffering and checksumming", SIGCOMM'95.
+
+ [Ma02] K. Magoutis, "Design and Implementation of a Direct Access
+ File System (DAFS) Kernel Server for FreeBSD", in
+ Proceedings of USENIX BSDCon 2002 Conference, San
+ Francisco, CA, February 11-14, 2002.
+
+ [MAF+02] K. Magoutis, S. Addetia, A. Fedorova, M. I. Seltzer, J.
+ S. Chase, D. Gallatin, R. Kisley, R. Wickremesinghe, E.
+ Gabber, "Structure and Performance of the Direct Access
+ File System (DAFS)", in Proceedings of the 2002 USENIX
+ Annual Technical Conference, Monterey, CA, June 9-14,
+ 2002.
+
+ [Mc95] J. D. McCalpin, "A Survey of memory bandwidth and machine
+ balance in current high performance computers", IEEE TCCA
+ Newsletter, December 1995.
+
+ [PAC+97] D. Patterson, T. Anderson, N. Cardwell, R. Fromm, K.
+ Keeton, C. Kozyrakis, R. Thomas, K. Yelick , "A case for
+ intelligient RAM: IRAM", IEEE Micro, April 1997.
+
+ [PDZ99] V. S. Pai, P. Druschel, W. Zwaenepoel, "IO-Lite: a unified
+ I/O buffering and caching system", Proc. of the 3rd
+ Symposium on Operating Systems Design and Implementation,
+ New Orleans, LA, February 1999.
+
+ [Pi01] J. Pinkerton, "Winsock Direct: The Value of System Area
+ Networks", May 2001, available from
+ http://www.microsoft.com/windows2000/techinfo/
+ howitworks/communications/winsock.asp.
+
+
+
+
+Romanow, et al. Informational [Page 17]
+
+RFC 4297 RDMA over IP Problem Statement December 2005
+
+
+ [Po81] Postel, J., "Transmission Control Protocol", STD 7, RFC
+ 793, September 1981.
+
+ [QUAD] Quadrics Ltd., Quadrics QSNet product information,
+ available from
+ http://www.quadrics.com/website/pages/02qsn.html.
+
+ [SDP] InfiniBand Trade Association, "Sockets Direct Protocol
+ v1.0", Annex A of InfiniBand Architecture Specification
+ Volume 1, Release 1.1, November 2002, available from
+ http://www.infinibandta.org/specs.
+
+ [SRVNET] R. Horst, "TNet: A reliable system area network", IEEE
+ Micro, pp. 37-45, February 1995.
+
+ [STREAM] J. D. McAlpin, The STREAM Benchmark Reference Information,
+ http://www.cs.virginia.edu/stream/.
+
+ [TK95] M. N. Thadani, Y. A. Khalidi, "An efficient zero-copy I/O
+ framework for UNIX", Technical Report, SMLI TR-95-39, May
+ 1995.
+
+ [TLS] Dierks, T. and C. Allen, "The TLS Protocol Version 1.0",
+ RFC 2246, January 1999.
+
+ [VI] D. Cameron and G. Regnier, "The Virtual Interface
+ Architecture", ISBN 0971288704, Intel Press, April 2002,
+ more info at http://www.intel.com/intelpress/via/.
+
+ [Wa97] J. R. Walsh, "DART: Fast application-level networking via
+ data-copy avoidance", IEEE Network, July/August 1997, pp.
+ 28-38.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Romanow, et al. Informational [Page 18]
+
+RFC 4297 RDMA over IP Problem Statement December 2005
+
+
+Authors' Addresses
+
+ Stephen Bailey
+ Sandburst Corporation
+ 600 Federal Street
+ Andover, MA 01810 USA
+
+ Phone: +1 978 689 1614
+ EMail: steph@sandburst.com
+
+
+ Jeffrey C. Mogul
+ HP Labs
+ Hewlett-Packard Company
+ 1501 Page Mill Road, MS 1117
+ Palo Alto, CA 94304 USA
+
+ Phone: +1 650 857 2206 (EMail preferred)
+ EMail: JeffMogul@acm.org
+
+
+ Allyn Romanow
+ Cisco Systems, Inc.
+ 170 W. Tasman Drive
+ San Jose, CA 95134 USA
+
+ Phone: +1 408 525 8836
+ EMail: allyn@cisco.com
+
+
+ Tom Talpey
+ Network Appliance
+ 1601 Trapelo Road
+ Waltham, MA 02451 USA
+
+ Phone: +1 781 768 5329
+ EMail: thomas.talpey@netapp.com
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Romanow, et al. Informational [Page 19]
+
+RFC 4297 RDMA over IP Problem Statement December 2005
+
+
+Full Copyright Statement
+
+ Copyright (C) The Internet Society (2005).
+
+ This document is subject to the rights, licenses and restrictions
+ contained in BCP 78, and except as set forth therein, the authors
+ retain all their rights.
+
+ This document and the information contained herein are provided on an
+ "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
+ OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET
+ ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED,
+ INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE
+ INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
+ WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
+
+Intellectual Property
+
+ The IETF takes no position regarding the validity or scope of any
+ Intellectual Property Rights or other rights that might be claimed to
+ pertain to the implementation or use of the technology described in
+ this document or the extent to which any license under such rights
+ might or might not be available; nor does it represent that it has
+ made any independent effort to identify any such rights. Information
+ on the procedures with respect to rights in RFC documents can be
+ found in BCP 78 and BCP 79.
+
+ Copies of IPR disclosures made to the IETF Secretariat and any
+ assurances of licenses to be made available, or the result of an
+ attempt made to obtain a general license or permission for the use of
+ such proprietary rights by implementers or users of this
+ specification can be obtained from the IETF on-line IPR repository at
+ http://www.ietf.org/ipr.
+
+ The IETF invites any interested party to bring to its attention any
+ copyrights, patents or patent applications, or other proprietary
+ rights that may cover technology that may be required to implement
+ this standard. Please address the information to the IETF at ietf-
+ ipr@ietf.org.
+
+Acknowledgement
+
+ Funding for the RFC Editor function is currently provided by the
+ Internet Society.
+
+
+
+
+
+
+
+Romanow, et al. Informational [Page 20]
+