1 files changed, 1123 insertions, 0 deletions
diff --git a/doc/rfc/rfc4297.txt b/doc/rfc/rfc4297.txt
new file mode 100644
index 0000000..3ba5312
--- /dev/null
+++ b/doc/rfc/rfc4297.txt
@@ -0,0 +1,1123 @@
+
+
+
+
+
+
+Network Working Group                                         A. Romanow
+Request for Comments: 4297                                         Cisco
+Category: Informational                                         J. Mogul
+                                                                      HP
+                                                               T. Talpey
+                                                                  NetApp
+                                                               S. Bailey
+                                                               Sandburst
+                                                           December 2005
+
+
+      Remote Direct Memory Access (RDMA) over IP Problem Statement
+
+Status of This Memo
+
+   This memo provides information for the Internet community.  It does
+   not specify an Internet standard of any kind.  Distribution of this
+   memo is unlimited.
+
+Copyright Notice
+
+   Copyright (C) The Internet Society (2005).
+
+Abstract
+
+   Overhead due to the movement of user data in the end-system network
+   I/O processing path at high speeds is significant, and has limited
+   the use of Internet protocols in interconnection networks, and the
+   Internet itself -- especially where high bandwidth, low latency,
+   and/or low overhead are required by the hosted application.
+
+   This document examines this overhead, and addresses an architectural,
+   IP-based "copy avoidance" solution for its elimination, by enabling
+   Remote Direct Memory Access (RDMA).
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Romanow, et al.              Informational                      [Page 1]
+
+RFC 4297             RDMA over IP Problem Statement        December 2005
+
+
+Table of Contents
+
+   1. Introduction ....................................................2
+   2. The High Cost of Data Movement Operations in Network I/O ........4
+      2.1. Copy avoidance improves processing overhead. ...............5
+   3. Memory bandwidth is the root cause of the problem. ..............6
+   4. High copy overhead is problematic for many key Internet
+      applications. ...................................................8
+   5. Copy Avoidance Techniques ......................................10
+      5.1. A Conceptual Framework: DDP and RDMA ......................11
+   6. Conclusions ....................................................12
+   7. Security Considerations ........................................12
+   8. Terminology ....................................................14
+   9. Acknowledgements ...............................................14
+   10. Informative References ........................................15
+
+1.  Introduction
+
+   This document considers the problem of high host processing overhead
+   associated with the movement of user data to and from the network
+   interface under high speed conditions.  This problem is often
+   referred to as the "I/O bottleneck" [CT90].  More specifically, the
+   source of high overhead that is of interest here is data movement
+   operations, i.e., copying.  The throughput of a system may therefore
+   be limited by the overhead of this copying.  This issue is not to be
+   confused with TCP offload, which is not addressed here.  High speed
+   refers to conditions where the network link speed is high, relative
+   to the bandwidths of the host CPU and memory.  With today's computer
+   systems, one Gigabit per second (Gbits/s) and over is considered high
+   speed.
+
+   High costs associated with copying are an issue primarily for large
+   scale systems.  Although smaller systems such as rack-mounted PCs and
+   small workstations would benefit from a reduction in copying
+   overhead, the benefit to smaller machines will be primarily in the
+   next few years as they scale the amount of bandwidth they handle.
+   Today, it is large system machines with high bandwidth feeds, usually
+   multiprocessors and clusters, that are adversely affected by copying
+   overhead.  Examples of such machines include all varieties of
+   servers: database servers, storage servers, application servers for
+   transaction processing, for e-commerce, and web serving, content
+   distribution, video distribution, backups, data mining and decision
+   support, and scientific computing.
+
+   Note that such servers almost exclusively service many concurrent
+   sessions (transport connections), which, in aggregate, are
+   responsible for > 1 Gbits/s of communication.  Nonetheless, the cost
+
+
+
+
+Romanow, et al.              Informational                      [Page 2]
+
+RFC 4297             RDMA over IP Problem Statement        December 2005
+
+
+   of copying overhead for a particular load is the same whether from
+   few or many sessions.
+
+   The I/O bottleneck, and the role of data movement operations, have
+   been widely studied in research and industry over the last
+   approximately 14 years, and we draw freely on these results.
+   Historically, the I/O bottleneck has received attention whenever new
+   networking technology has substantially increased line rates: 100
+   Megabit per second (Mbits/s) Fast Ethernet and Fibre Distributed Data
+   Interface [FDDI], 155 Mbits/s Asynchronous Transfer Mode [ATM], 1
+   Gbits/s Ethernet.  In earlier speed transitions, the availability of
+   memory bandwidth allowed the I/O bottleneck issue to be deferred.
+   Now however, this is no longer the case.  While the I/O problem is
+   significant at 1 Gbits/s, it is the introduction of 10 Gbits/s
+   Ethernet which is motivating an upsurge of activity in industry and
+   research [IB, VI, CGY01, Ma02, MAF+02].
+
+   Because of high overhead of end-host processing in current
+   implementations, the TCP/IP protocol stack is not used for high speed
+   transfer.  Instead, special purpose network fabrics, using a
+   technology generally known as Remote Direct Memory Access (RDMA),
+   have been developed and are widely used.  RDMA is a set of mechanisms
+   that allow the network adapter, under control of the application, to
+   steer data directly into and out of application buffers.  Examples of
+   such interconnection fabrics include Fibre Channel [FIBRE] for block
+   storage transfer, Virtual Interface Architecture [VI] for database
+   clusters, and Infiniband [IB], Compaq Servernet [SRVNET], and
+   Quadrics [QUAD] for System Area Networks.  These link level
+   technologies limit application scaling in both distance and size,
+   meaning that the number of nodes cannot be arbitrarily large.
+
+   This problem statement substantiates the claim that in network I/O
+   processing, high overhead results from data movement operations,
+   specifically copying; and that copy avoidance significantly decreases
+   this processing overhead.  It describes when and why the high
+   processing overheads occur, explains why the overhead is problematic,
+   and points out which applications are most affected.
+
+   The document goes on to discuss why the problem is relevant to the
+   Internet and to Internet-based applications.  Applications that
+   store, manage, and distribute the information of the Internet are
+   well suited to applying the copy avoidance solution.  They will
+   benefit by avoiding high processing overheads, which removes limits
+   to the available scaling of tiered end-systems.  Copy avoidance also
+   eliminates latency for these systems, which can further benefit
+   effective distributed processing.
+
+
+
+
+
+Romanow, et al.              Informational                      [Page 3]
+
+RFC 4297             RDMA over IP Problem Statement        December 2005
+
+
+   In addition, this document introduces an architectural approach to
+   solving the problem, which is developed in detail in [BT05].  It also
+   discusses how the proposed technology may introduce security concerns
+   and how they should be addressed.
+
+   Finally, this document includes a Terminology section to aid as a
+   reference for several new terms introduced by RDMA.
+
+2.  The High Cost of Data Movement Operations in Network I/O
+
+   A wealth of data from research and industry shows that copying is
+   responsible for substantial amounts of processing overhead.  It
+   further shows that even in carefully implemented systems, eliminating
+   copies significantly reduces the overhead, as referenced below.
+
+   Clark et al. [CJRS89] in 1989 shows that TCP [Po81] overhead
+   processing is attributable to both operating system costs (such as
+   interrupts, context switches, process management, buffer management,
+   timer management) and the costs associated with processing individual
+   bytes (specifically, computing the checksum and moving data in
+   memory).  They found that moving data in memory is the more important
+   of the costs, and their experiments show that memory bandwidth is the
+   greatest source of limitation.  In the data presented [CJRS89], 64%
+   of the measured microsecond overhead was attributable to data
+   touching operations, and 48% was accounted for by copying.  The
+   system measured Berkeley TCP on a Sun-3/60 using 1460 Byte Ethernet
+   packets.
+
+   In a well-implemented system, copying can occur between the network
+   interface and the kernel, and between the kernel and application
+   buffers; there are two copies, each of which are two memory bus
+   crossings, for read and write.  Although in certain circumstances it
+   is possible to do better, usually two copies are required on receive.
+
+   Subsequent work has consistently shown the same phenomenon as the
+   earlier Clark study.  A number of studies report results that data-
+   touching operations, checksumming and data movement, dominate the
+   processing costs for messages longer than 128 Bytes [BS96, CGY01,
+   Ch96, CJRS89, DAPP93, KP96].  For smaller sized messages, per-packet
+   overheads dominate [KP96, CGY01].
+
+   The percentage of overhead due to data-touching operations increases
+   with packet size, since time spent on per-byte operations scales
+   linearly with message size [KP96].  For example, Chu [Ch96] reported
+   substantial per-byte latency costs as a percentage of total
+   networking software costs for an MTU size packet on a SPARCstation/20
+
+
+
+
+
+Romanow, et al.              Informational                      [Page 4]
+
+RFC 4297             RDMA over IP Problem Statement        December 2005
+
+
+   running memory-to-memory TCP tests over networks with 3 different MTU
+   sizes.  The percentage of total software costs attributable to
+   per-byte operations were:
+
+      1500 Byte Ethernet 18-25%
+      4352 Byte FDDI     35-50%
+      9180 Byte ATM      55-65%
+
+   Although many studies report results for data-touching operations,
+   including checksumming and data movement together, much work has
+   focused just on copying [BS96, Br99, Ch96, TK95].  For example,
+   [KP96] reports results that separate processing times for checksum
+   from data movement operations.  For the 1500 Byte Ethernet size, 20%
+   of total processing overhead time is attributable to copying.  The
+   study used 2 DECstations 5000/200 connected by an FDDI network.  (In
+   this study, checksum accounts for 30% of the processing time.)
+
+2.1.  Copy avoidance improves processing overhead.
+
+   A number of studies show that eliminating copies substantially
+   reduces overhead.  For example, results from copy-avoidance in the
+   IO-Lite system [PDZ99], which aimed at improving web server
+   performance, show a throughput increase of 43% over an optimized web
+   server, and 137% improvement over an Apache server.  The system was
+   implemented in a 4.4BSD-derived UNIX kernel, and the experiments used
+   a server system based on a 333MHz Pentium II PC connected to a
+   switched 100 Mbits/s Fast Ethernet.
+
+   There are many other examples where elimination of copying using a
+   variety of different approaches showed significant improvement in
+   system performance [CFF+94, DP93, EBBV95, KSZ95, TK95, Wa97].  We
+   will discuss the results of one of these studies in detail in order
+   to clarify the significant degree of improvement produced by copy
+   avoidance [Ch02].
+
+   Recent work by Chase et al. [CGY01], measuring CPU utilization, shows
+   that avoiding copies reduces CPU time spent on data access from 24%
+   to 15% at 370 Mbits/s for a 32 KBytes MTU using an AlphaStation
+   XP1000 and a Myrinet adapter [BCF+95].  This is an absolute
+   improvement of 9% due to copy avoidance.
+
+   The total CPU utilization was 35%, with data access accounting for
+   24%.  Thus, the relative importance of reducing copies is 26%.  At
+   370 Mbits/s, the system is not very heavily loaded.  The relative
+   improvement in achievable bandwidth is 34%.  This is the improvement
+   we would see if copy avoidance were added when the machine was
+   saturated by network I/O.
+
+
+
+
+Romanow, et al.              Informational                      [Page 5]
+
+RFC 4297             RDMA over IP Problem Statement        December 2005
+
+
+   Note that improvement from the optimization becomes more important if
+   the overhead it targets is a larger share of the total cost.  This is
+   what happens if other sources of overhead, such as checksumming, are
+   eliminated.  In [CGY01], after removing checksum overhead, copy
+   avoidance reduces CPU utilization from 26% to 10%.  This is a 16%
+   absolute reduction, a 61% relative reduction, and a 160% relative
+   improvement in achievable bandwidth.
+
+   In fact, today's network interface hardware commonly offloads the
+   checksum, which removes the other source of per-byte overhead.  They
+   also coalesce interrupts to reduce per-packet costs.  Thus, today
+   copying costs account for a relatively larger part of CPU utilization
+   than previously, and therefore relatively more benefit is to be
+   gained in reducing them.  (Of course this argument would be specious
+   if the amount of overhead were insignificant, but it has been shown
+   to be substantial.  [BS96, Br99, Ch96, KP96, TK95])
+
+3.  Memory bandwidth is the root cause of the problem.
+
+   Data movement operations are expensive because memory bandwidth is
+   scarce relative to network bandwidth and CPU bandwidth [PAC+97].
+   This trend existed in the past and is expected to continue into the
+   future [HP97, STREAM], especially in large multiprocessor systems.
+
+   With copies crossing the bus twice per copy, network processing
+   overhead is high whenever network bandwidth is large in comparison to
+   CPU and memory bandwidths.  Generally, with today's end-systems, the
+   effects are observable at network speeds over 1 Gbits/s.  In fact,
+   with multiple bus crossings it is possible to see the bus bandwidth
+   being the limiting factor for throughput.  This prevents such an
+   end-system from simultaneously achieving full network bandwidth and
+   full application performance.
+
+   A common question is whether an increase in CPU processing power
+   alleviates the problem of high processing costs of network I/O.  The
+   answer is no, it is the memory bandwidth that is the issue.  Faster
+   CPUs do not help if the CPU spends most of its time waiting for
+   memory [CGY01].
+
+   The widening gap between microprocessor performance and memory
+   performance has long been a widely recognized and well-understood
+   problem [PAC+97].  Hennessy [HP97] shows microprocessor performance
+   grew from 1980-1998 at 60% per year, while the access time to DRAM
+   improved at 10% per year, giving rise to an increasing "processor-
+   memory performance gap".
+
+
+
+
+
+
+Romanow, et al.              Informational                      [Page 6]
+
+RFC 4297             RDMA over IP Problem Statement        December 2005
+
+
+   Another source of relevant data is the STREAM Benchmark Reference
+   Information website, which provides information on the STREAM
+   benchmark [STREAM].  The benchmark is a simple synthetic benchmark
+   program that measures sustainable memory bandwidth (in MBytes/s) and
+   the corresponding computation rate for simple vector kernels measured
+   in MFLOPS.  The website tracks information on sustainable memory
+   bandwidth for hundreds of machines and all major vendors.
+
+   Results show measured system performance statistics.  Processing
+   performance from 1985-2001 increased at 50% per year on average, and
+   sustainable memory bandwidth from 1975 to 2001 increased at 35% per
+   year, on average, over all the systems measured.  A similar 15% per
+   year lead of processing bandwidth over memory bandwidth shows up in
+   another statistic, machine balance [Mc95], a measure of the relative
+   rate of CPU to memory bandwidth (FLOPS/cycle) / (sustained memory
+   ops/cycle) [STREAM].
+
+   Network bandwidth has been increasing about 10-fold roughly every 8
+   years, which is a 40% per year growth rate.
+
+   A typical example illustrates that the memory bandwidth compares
+   unfavorably with link speed.  The STREAM benchmark shows that a
+   modern uniprocessor PC, for example the 1.2 GHz Athlon in 2001, will
+   move the data 3 times in doing a receive operation: once for the
+   network interface to deposit the data in memory, and twice for the
+   CPU to copy the data.  With 1 GBytes/s of memory bandwidth, meaning
+   one read or one write, the machine could handle approximately 2.67
+   Gbits/s of network bandwidth, one third the copy bandwidth.  But this
+   assumes 100% utilization, which is not possible, and more importantly
+   the machine would be totally consumed!  (A rule of thumb for
+   databases is that 20% of the machine should be required to service
+   I/O, leaving 80% for the database application.  And, the less, the
+   better.)
+
+   In 2001, 1 Gbits/s links were common.  An application server may
+   typically have two 1 Gbits/s connections: one connection backend to a
+   storage server and one front-end, say for serving HTTP [FGM+99].
+   Thus, the communications could use 2 Gbits/s.  In our typical
+   example, the machine could handle 2.7 Gbits/s at its theoretical
+   maximum while doing nothing else.  This means that the machine
+   basically could not keep up with the communication demands in 2001;
+   with the relative growth trends, the situation only gets worse.
+
+
+
+
+
+
+
+
+
+Romanow, et al.              Informational                      [Page 7]
+
+RFC 4297             RDMA over IP Problem Statement        December 2005
+
+
+4.  High copy overhead is problematic for many key Internet
+    applications.
+
+   If a significant portion of resources on an application machine is
+   consumed in network I/O rather than in application processing, it
+   makes it difficult for the application to scale, i.e., to handle more
+   clients, to offer more services.
+
+   Several years ago the most affected applications were streaming
+   multimedia, parallel file systems, and supercomputing on clusters
+   [BS96].  In addition, today the applications that suffer from copying
+   overhead are more central in Internet computing -- they store,
+   manage, and distribute the information of the Internet and the
+   enterprise.  They include database applications doing transaction
+   processing, e-commerce, web serving, decision support, content
+   distribution, video distribution, and backups.  Clusters are
+   typically used for this category of application, since they have
+   advantages of availability and scalability.
+
+   Today these applications, which provide and manage Internet and
+   corporate information, are typically run in data centers that are
+   organized into three logical tiers.  One tier is typically a set of
+   web servers connecting to the WAN.  The second tier is a set of
+   application servers that run the specific applications usually on
+   more powerful machines, and the third tier is backend databases.
+   Physically, the first two tiers -- web server and application server
+   -- are usually combined [Pi01].  For example, an e-commerce server
+   communicates with a database server and with a customer site, or a
+   content distribution server connects to a server farm, or an OLTP
+   server connects to a database and a customer site.
+
+   When network I/O uses too much memory bandwidth, performance on
+   network paths between tiers can suffer.  (There might also be
+   performance issues on Storage Area Network paths used either by the
+   database tier or the application tier.)  The high overhead from
+   network-related memory copies diverts system resources from other
+   application processing.  It also can create bottlenecks that limit
+   total system performance.
+
+   There is high motivation to maximize the processing capacity of each
+   CPU because scaling by adding CPUs, one way or another, has
+   drawbacks.  For example, adding CPUs to a multiprocessor will not
+   necessarily help because a multiprocessor improves performance only
+   when the memory bus has additional bandwidth to spare.  Clustering
+   can add additional complexity to handling the applications.
+
+   In order to scale a cluster or multiprocessor system, one must
+   proportionately scale the interconnect bandwidth.  Interconnect
+
+
+
+Romanow, et al.              Informational                      [Page 8]
+
+RFC 4297             RDMA over IP Problem Statement        December 2005
+
+
+   bandwidth governs the performance of communication-intensive parallel
+   applications; if this (often expressed in terms of "bisection
+   bandwidth") is too low, adding additional processors cannot improve
+   system throughput.  Interconnect latency can also limit the
+   performance of applications that frequently share data between
+   processors.
+
+   So, excessive overheads on network paths in a "scalable" system both
+   can require the use of more processors than optimal, and can reduce
+   the marginal utility of those additional processors.
+
+   Copy avoidance scales a machine upwards by removing at least two-
+   thirds of the bus bandwidth load from the "very best" 1-copy (on
+   receive) implementations, and removes at least 80% of the bandwidth
+   overhead from the 2-copy implementations.
+
+   The removal of bus bandwidth requirements, in turn, removes
+   bottlenecks from the network processing path and increases the
+   throughput of the machine.  On a machine with limited bus bandwidth,
+   the advantages of removing this load is immediately evident, as the
+   host can attain full network bandwidth.  Even on a machine with bus
+   bandwidth adequate to sustain full network bandwidth, removal of bus
+   bandwidth load serves to increase the availability of the machine for
+   the processing of user applications, in some cases dramatically.
+
+   An example showing poor performance with copies and improved scaling
+   with copy avoidance is illustrative.  The IO-Lite work [PDZ99] shows
+   higher server throughput servicing more clients using a zero-copy
+   system.  In an experiment designed to mimic real world web conditions
+   by simulating the effect of TCP WAN connections on the server, the
+   performance of 3 servers was compared.  One server was Apache,
+   another was an optimized server called Flash, and the third was the
+   Flash server running IO-Lite, called Flash-Lite with zero copy.  The
+   measurement was of throughput in requests/second as a function of the
+   number of slow background clients that could be served.  As the table
+   shows, Flash-Lite has better throughput, especially as the number of
+   clients increases.
+
+              Apache              Flash         Flash-Lite
+              ------              -----         ----------
+   #Clients   Throughput reqs/s   Throughput    Throughput
+
+   0          520                 610           890
+   16         390                 490           890
+   32         360                 490           850
+   64         360                 490           890
+   128        310                 450           880
+   256        310                 440           820
+
+
+
+Romanow, et al.              Informational                      [Page 9]
+
+RFC 4297             RDMA over IP Problem Statement        December 2005
+
+
+   Traditional Web servers (which mostly send data and can keep most of
+   their content in the file cache) are not the worst case for copy
+   overhead.  Web proxies (which often receive as much data as they
+   send) and complex Web servers based on System Area Networks or
+   multi-tier systems will suffer more from copy overheads than in the
+   example above.
+
+5.  Copy Avoidance Techniques
+
+   There have been extensive research investigation and industry
+   experience with two main alternative approaches to eliminating data
+   movement overhead, often along with improving other Operating System
+   processing costs.  In one approach, hardware and/or software changes
+   within a single host reduce processing costs.  In another approach,
+   memory-to-memory networking [MAF+02], the exchange of explicit data
+   placement information between hosts allows them to reduce processing
+   costs.
+
+   The single host approaches range from new hardware and software
+   architectures [KSZ95, Wa97, DWB+93] to new or modified software
+   systems [BS96, Ch96, TK95, DP93, PDZ99].  In the approach based on
+   using a networking protocol to exchange information, the network
+   adapter, under control of the application, places data directly into
+   and out of application buffers, reducing the need for data movement.
+   Commonly this approach is called RDMA, Remote Direct Memory Access.
+
+   As discussed below, research and industry experience has shown that
+   copy avoidance techniques within the receiver processing path alone
+   have proven to be problematic.  The research special purpose host
+   adapter systems had good performance and can be seen as precursors
+   for the commercial RDMA-based adapters [KSZ95, DWB+93].  In software,
+   many implementations have successfully achieved zero-copy transmit,
+   but few have accomplished zero-copy receive.  And those that have
+   done so make strict alignment and no-touch requirements on the
+   application, greatly reducing the portability and usefulness of the
+   implementation.
+
+   In contrast, experience has proven satisfactory with memory-to-memory
+   systems that permit RDMA; performance has been good and there have
+   not been system or networking difficulties.  RDMA is a single
+   solution.  Once implemented, it can be used with any OS and machine
+   architecture, and it does not need to be revised when either of these
+   are changed.
+
+   In early work, one goal of the software approaches was to show that
+   TCP could go faster with appropriate OS support [CJRS89, CFF+94].
+   While this goal was achieved, further investigation and experience
+   showed that, though possible to craft software solutions, specific
+
+
+
+Romanow, et al.              Informational                     [Page 10]
+
+RFC 4297             RDMA over IP Problem Statement        December 2005
+
+
+   system optimizations have been complex, fragile, extremely
+   interdependent with other system parameters in complex ways, and
+   often of only marginal improvement [CFF+94, CGY01, Ch96, DAPP93,
+   KSZ95, PDZ99].  The network I/O system interacts with other aspects
+   of the Operating System such as machine architecture and file I/O,
+   and disk I/O [Br99, Ch96, DP93].
+
+   For example, the Solaris Zero-Copy TCP work [Ch96], which relies on
+   page remapping, shows that the results are highly interdependent with
+   other systems, such as the file system, and that the particular
+   optimizations are specific for particular architectures, meaning that
+   for each variation in architecture, optimizations must be re-crafted
+   [Ch96].
+
+   With RDMA, application I/O buffers are mapped directly, and the
+   authorized peer may access it without incurring additional processing
+   overhead.  When RDMA is implemented in hardware, arbitrary data
+   movement can be performed without involving the host CPU at all.
+
+   A number of research projects and industry products have been based
+   on the memory-to-memory approach to copy avoidance.  These include
+   U-Net [EBBV95], SHRIMP [BLA+94], Hamlyn [BJM+96], Infiniband [IB],
+   Winsock Direct [Pi01].  Several memory-to-memory systems have been
+   widely used and have generally been found to be robust, to have good
+   performance, and to be relatively simple to implement.  These include
+   VI [VI], Myrinet [BCF+95], Quadrics [QUAD], Compaq/Tandem Servernet
+   [SRVNET].  Networks based on these memory-to-memory architectures
+   have been used widely in scientific applications and in data centers
+   for block storage, file system access, and transaction processing.
+
+   By exporting direct memory access "across the wire", applications may
+   direct the network stack to manage all data directly from application
+   buffers.  A large and growing class that takes advantage of such
+   capabilities of applications has already emerged.  It includes all
+   the major databases, as well as network protocols such as Sockets
+   Direct [SDP].
+
+5.1.  A Conceptual Framework: DDP and RDMA
+
+   An RDMA solution can be usefully viewed as being comprised of two
+   distinct components: "direct data placement (DDP)" and "remote direct
+   memory access (RDMA) semantics".  They are distinct in purpose and
+   also in practice -- they may be implemented as separate protocols.
+
+   The more fundamental of the two is the direct data placement
+   facility.  This is the means by which memory is exposed to the remote
+   peer in an appropriate fashion, and the means by which the peer may
+   access it, for instance, reading and writing.
+
+
+
+Romanow, et al.              Informational                     [Page 11]
+
+RFC 4297             RDMA over IP Problem Statement        December 2005
+
+
+   The RDMA control functions are semantically layered atop direct data
+   placement.  Included are operations that provide "control" features,
+   such as connection and termination, and the ordering of operations
+   and signaling their completions.  A "send" facility is provided.
+
+   While the functions (and potentially protocols) are distinct,
+   historically both aspects taken together have been referred to as
+   "RDMA".  The facilities of direct data placement are useful in and of
+   themselves, and may be employed by other upper layer protocols to
+   facilitate data transfer.  Therefore, it is often useful to refer to
+   DDP as the data placement functionality and RDMA as the control
+   aspect.
+
+   [BT05] develops an architecture for DDP and RDMA atop the Internet
+   Protocol Suite, and is a companion document to this problem
+   statement.
+
+6.  Conclusions
+
+   This Problem Statement concludes that an IP-based, general solution
+   for reducing processing overhead in end-hosts is desirable.
+
+   It has shown that high overhead of the processing of network data
+   leads to end-host bottlenecks.  These bottlenecks are in large part
+   attributable to the copying of data.  The bus bandwidth of machines
+   has historically been limited, and the bandwidth of high-speed
+   interconnects taxes it heavily.
+
+   An architectural solution to alleviate these bottlenecks best
+   satisfies the issue.  Further, the high speed of today's
+   interconnects and the deployment of these hosts on Internet
+   Protocol-based networks leads to the desirability of layering such a
+   solution on the Internet Protocol Suite.  The architecture described
+   in [BT05] is such a proposal.
+
+7.  Security Considerations
+
+   Solutions to the problem of reducing copying overhead in high
+   bandwidth transfers may introduce new security concerns.  Any
+   proposed solution must be analyzed for security vulnerabilities and
+   any such vulnerabilities addressed.  Potential security weaknesses --
+   due to resource issues that might lead to denial-of-service attacks,
+   overwrites and other concurrent operations, the ordering of
+   completions as required by the RDMA protocol, the granularity of
+   transfer, and any other identified vulnerabilities -- need to be
+   examined, described, and an adequate resolution to them found.
+
+
+
+
+
+Romanow, et al.              Informational                     [Page 12]
+
+RFC 4297             RDMA over IP Problem Statement        December 2005
+
+
+   Layered atop Internet transport protocols, the RDMA protocols will
+   gain leverage from and must permit integration with Internet security
+   standards, such as IPsec and TLS [IPSEC, TLS].  However, there may be
+   implementation ramifications for certain security approaches with
+   respect to RDMA, due to its copy avoidance.
+
+   IPsec, operating to secure the connection on a packet-by-packet
+   basis, seems to be a natural fit to securing RDMA placement, which
+   operates in conjunction with transport.  Because RDMA enables an
+   implementation to avoid buffering, it is preferable to perform all
+   applicable security protection prior to processing of each segment by
+   the transport and RDMA layers.  Such a layering enables the most
+   efficient secure RDMA implementation.
+
+   The TLS record protocol, on the other hand, is layered on top of
+   reliable transports and cannot provide such security assurance until
+   an entire record is available, which may require the buffering and/or
+   assembly of several distinct messages prior to TLS processing.  This
+   defers RDMA processing and introduces overheads that RDMA is designed
+   to avoid.  Therefore, TLS is viewed as potentially a less natural fit
+   for protecting the RDMA protocols.
+
+   It is necessary to guarantee properties such as confidentiality,
+   integrity, and authentication on an RDMA communications channel.
+   However, these properties cannot defend against all attacks from
+   properly authenticated peers, which might be malicious, compromised,
+   or buggy.  Therefore, the RDMA design must address protection against
+   such attacks.  For example, an RDMA peer should not be able to read
+   or write memory regions without prior consent.
+
+   Further, it must not be possible to evade memory consistency checks
+   at the recipient.  The RDMA design must allow the recipient to rely
+   on its consistent memory contents by explicitly controlling peer
+   access to memory regions at appropriate times.
+
+   Peer connections that do not pass authentication and authorization
+   checks by upper layers must not be permitted to begin processing in
+   RDMA mode with an inappropriate endpoint.  Once associated, peer
+   accesses to memory regions must be authenticated and made subject to
+   authorization checks in the context of the association and connection
+   on which they are to be performed, prior to any transfer operation or
+   data being accessed.
+
+   The RDMA protocols must ensure that these region protections be under
+   strict application control.  Remote access to local memory by a
+   network peer is particularly important in the Internet context, where
+   such access can be exported globally.
+
+
+
+
+Romanow, et al.              Informational                     [Page 13]
+
+RFC 4297             RDMA over IP Problem Statement        December 2005
+
+
+8.  Terminology
+
+   This section contains general terminology definitions for this
+   document and for Remote Direct Memory Access in general.
+
+   Remote Direct Memory Access (RDMA)
+        A method of accessing memory on a remote system in which the
+        local system specifies the location of the data to be
+        transferred.
+
+   RDMA Protocol
+        A protocol that supports RDMA Operations to transfer data
+        between systems.
+
+   Fabric
+        The collection of links, switches, and routers that connect a
+        set of systems.
+
+   Storage Area Network (SAN)
+        A network where disks, tapes, and other storage devices are made
+        available to one or more end-systems via a fabric.
+
+   System Area Network
+        A network where clustered systems share services, such as
+        storage and interprocess communication, via a fabric.
+
+   Fibre Channel (FC)
+        An ANSI standard link layer with associated protocols, typically
+        used to implement Storage Area Networks. [FIBRE]
+
+   Virtual Interface Architecture (VI, VIA)
+        An RDMA interface definition developed by an industry group and
+        implemented with a variety of differing wire protocols. [VI]
+
+   Infiniband (IB)
+        An RDMA interface, protocol suite and link layer specification
+        defined by an industry trade association. [IB]
+
+9.  Acknowledgements
+
+   Jeff Chase generously provided many useful insights and information.
+   Thanks to Jim Pinkerton for many helpful discussions.
+
+
+
+
+
+
+
+
+
+Romanow, et al.              Informational                     [Page 14]
+
+RFC 4297             RDMA over IP Problem Statement        December 2005
+
+
+10.  Informative References
+
+   [ATM]      The ATM Forum, "Asynchronous Transfer Mode Physical Layer
+              Specification" af-phy-0015.000, etc.  available from
+              http://www.atmforum.com/standards/approved.html.
+
+   [BCF+95]   N. J. Boden, D. Cohen, R. E. Felderman, A. E. Kulawik, C.
+              L. Seitz, J. N. Seizovic, and W. Su. "Myrinet - A
+              gigabit-per-second local-area network", IEEE Micro,
+              February 1995.
+
+   [BJM+96]   G. Buzzard, D. Jacobson, M. Mackey, S. Marovich, J.
+              Wilkes, "An implementation of the Hamlyn send-managed
+              interface architecture", in Proceedings of the Second
+              Symposium on Operating Systems Design and Implementation,
+              USENIX Assoc., October 1996.
+
+   [BLA+94]   M. A. Blumrich, K. Li, R. Alpert, C. Dubnicki, E. W.
+              Felten, "A virtual memory mapped network interface for the
+              SHRIMP multicomputer", in Proceedings of the 21st Annual
+              Symposium on Computer Architecture, April 1994, pp. 142-
+              153.
+
+   [Br99]     J. C. Brustoloni, "Interoperation of copy avoidance in
+              network and file I/O", Proceedings of IEEE Infocom, 1999,
+              pp. 534-542.
+
+   [BS96]     J. C. Brustoloni, P. Steenkiste, "Effects of buffering
+              semantics on I/O performance", Proceedings OSDI'96,
+              USENIX, Seattle, WA October 1996, pp. 277-291.
+
+   [BT05]     Bailey, S. and T. Talpey, "The Architecture of Direct Data
+              Placement (DDP) And Remote Direct Memory Access (RDMA) On
+              Internet Protocols", RFC 4296, December 2005.
+
+   [CFF+94]   C-H Chang, D. Flower, J. Forecast, H. Gray, B. Hawe, A.
+              Nadkarni, K. K. Ramakrishnan, U. Shikarpur, K. Wilde,
+              "High-performance TCP/IP and UDP/IP networking in DEC
+              OSF/1 for Alpha AXP",  Proceedings of the 3rd IEEE
+              Symposium on High Performance Distributed Computing,
+              August 1994, pp. 36-42.
+
+   [CGY01]    J. S. Chase, A. J. Gallatin, and K. G. Yocum, "End system
+              optimizations for high-speed TCP", IEEE Communications
+              Magazine, Volume: 39, Issue: 4 , April 2001, pp 68-74.
+              http://www.cs.duke.edu/ari/publications/end-
+              system.{ps,pdf}.
+
+
+
+
+Romanow, et al.              Informational                     [Page 15]
+
+RFC 4297             RDMA over IP Problem Statement        December 2005
+
+
+   [Ch96]     H.K. Chu, "Zero-copy TCP in Solaris", Proc. of the USENIX
+              1996 Annual Technical Conference, San Diego, CA, January
+              1996.
+
+   [Ch02]     Jeffrey Chase, Personal communication.
+
+   [CJRS89]   D. D. Clark, V. Jacobson, J. Romkey, H. Salwen, "An
+              analysis of TCP processing overhead", IEEE Communications
+              Magazine, volume:  27, Issue: 6, June 1989, pp 23-29.
+
+   [CT90]     D. D. Clark, D. Tennenhouse, "Architectural considerations
+              for a new generation of protocols", Proceedings of the ACM
+              SIGCOMM Conference, 1990.
+
+   [DAPP93]   P. Druschel, M. B. Abbott, M. A. Pagels, L. L. Peterson,
+              "Network subsystem design", IEEE Network, July 1993, pp.
+              8-17.
+
+   [DP93]     P. Druschel, L. L. Peterson, "Fbufs: a high-bandwidth
+              cross-domain transfer facility", Proceedings of the 14th
+              ACM Symposium of Operating Systems Principles, December
+              1993.
+
+   [DWB+93]   C. Dalton, G. Watson, D. Banks, C. Calamvokis, A. Edwards,
+              J. Lumley, "Afterburner: architectural support for high-
+              performance protocols", Technical Report, HP Laboratories
+              Bristol, HPL-93-46, July 1993.
+
+   [EBBV95]   T. von Eicken, A. Basu, V. Buch, and W. Vogels, "U-Net: A
+              user-level network interface for parallel and distributed
+              computing", Proc. of the 15th ACM Symposium on Operating
+              Systems Principles, Copper Mountain, Colorado, December
+              3-6, 1995.
+
+   [FDDI]     International Standards Organization, "Fibre Distributed
+              Data Interface", ISO/IEC 9314, committee drafts available
+              from http://www.iso.org.
+
+   [FGM+99]   Fielding,  R., Gettys, J., Mogul, J., Frystyk, H.,
+              Masinter, L., Leach, P., and T. Berners-Lee, "Hypertext
+              Transfer Protocol -- HTTP/1.1", RFC 2616, June 1999.
+
+   [FIBRE]    ANSI Technical Committee T10, "Fibre Channel Protocol
+              (FCP)" (and as revised and updated), ANSI X3.269:1996
+              [R2001], committee draft available from
+              http://www.t10.org/drafts.htm#FibreChannel
+
+
+
+
+
+Romanow, et al.              Informational                     [Page 16]
+
+RFC 4297             RDMA over IP Problem Statement        December 2005
+
+
+   [HP97]     J. L. Hennessy, D. A. Patterson, Computer Organization and
+              Design, 2nd Edition, San Francisco: Morgan Kaufmann
+              Publishers, 1997.
+
+   [IB]       InfiniBand Trade Association, "InfiniBand Architecture
+              Specification, Volumes 1 and 2", Release 1.1, November
+              2002, available from http://www.infinibandta.org/specs.
+
+   [IPSEC]    Kent, S. and R. Atkinson, "Security Architecture for the
+              Internet Protocol", RFC 2401, November 1998.
+
+   [KP96]     J. Kay, J. Pasquale, "Profiling and reducing processing
+              overheads in TCP/IP", IEEE/ACM Transactions on Networking,
+              Vol 4, No. 6, pp.817-828, December 1996.
+
+   [KSZ95]    K. Kleinpaste, P. Steenkiste, B. Zill, "Software support
+              for outboard buffering and checksumming", SIGCOMM'95.
+
+   [Ma02]     K. Magoutis, "Design and Implementation of a Direct Access
+              File System (DAFS) Kernel Server for FreeBSD", in
+              Proceedings of USENIX BSDCon 2002 Conference, San
+              Francisco, CA, February 11-14, 2002.
+
+   [MAF+02]   K. Magoutis, S. Addetia, A. Fedorova, M.  I. Seltzer, J.
+              S. Chase, D. Gallatin, R. Kisley, R. Wickremesinghe, E.
+              Gabber, "Structure and Performance of the Direct Access
+              File System (DAFS)", in Proceedings of the 2002 USENIX
+              Annual Technical Conference, Monterey, CA, June 9-14,
+              2002.
+
+   [Mc95]     J. D. McCalpin, "A Survey of memory bandwidth and machine
+              balance in current high performance computers", IEEE TCCA
+              Newsletter, December 1995.
+
+   [PAC+97]   D. Patterson, T. Anderson, N. Cardwell, R. Fromm, K.
+              Keeton, C. Kozyrakis, R. Thomas, K. Yelick , "A case for
+              intelligient RAM: IRAM", IEEE Micro, April 1997.
+
+   [PDZ99]    V. S. Pai, P. Druschel, W. Zwaenepoel, "IO-Lite: a unified
+              I/O buffering and caching system", Proc. of the 3rd
+              Symposium on Operating Systems Design and Implementation,
+              New Orleans, LA, February 1999.
+
+   [Pi01]     J. Pinkerton, "Winsock Direct: The Value of System Area
+              Networks", May 2001, available from
+              http://www.microsoft.com/windows2000/techinfo/
+              howitworks/communications/winsock.asp.
+
+
+
+
+Romanow, et al.              Informational                     [Page 17]
+
+RFC 4297             RDMA over IP Problem Statement        December 2005
+
+
+   [Po81]     Postel, J., "Transmission Control Protocol", STD 7, RFC
+              793, September 1981.
+
+   [QUAD]     Quadrics Ltd., Quadrics QSNet product information,
+              available from
+              http://www.quadrics.com/website/pages/02qsn.html.
+
+   [SDP]      InfiniBand Trade Association, "Sockets Direct Protocol
+              v1.0", Annex A of InfiniBand Architecture Specification
+              Volume 1, Release 1.1, November 2002, available from
+              http://www.infinibandta.org/specs.
+
+   [SRVNET]   R. Horst, "TNet: A reliable system area network", IEEE
+              Micro, pp. 37-45, February 1995.
+
+   [STREAM]   J. D. McAlpin, The STREAM Benchmark Reference Information,
+              http://www.cs.virginia.edu/stream/.
+
+   [TK95]     M. N. Thadani, Y. A. Khalidi, "An efficient zero-copy I/O
+              framework for UNIX", Technical Report, SMLI TR-95-39, May
+              1995.
+
+   [TLS]      Dierks, T. and C. Allen, "The TLS Protocol Version 1.0",
+              RFC 2246, January 1999.
+
+   [VI]       D. Cameron and G. Regnier, "The Virtual Interface
+              Architecture", ISBN 0971288704, Intel Press, April 2002,
+              more info at http://www.intel.com/intelpress/via/.
+
+   [Wa97]     J. R. Walsh, "DART: Fast application-level networking via
+              data-copy avoidance", IEEE Network, July/August 1997, pp.
+              28-38.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Romanow, et al.              Informational                     [Page 18]
+
+RFC 4297             RDMA over IP Problem Statement        December 2005
+
+
+Authors' Addresses
+
+   Stephen Bailey
+   Sandburst Corporation
+   600 Federal Street
+   Andover, MA  01810 USA
+
+   Phone: +1 978 689 1614
+   EMail: steph@sandburst.com
+
+
+   Jeffrey C. Mogul
+   HP Labs
+   Hewlett-Packard Company
+   1501 Page Mill Road, MS 1117
+   Palo Alto, CA  94304 USA
+
+   Phone: +1 650 857 2206 (EMail preferred)
+   EMail: JeffMogul@acm.org
+
+
+   Allyn Romanow
+   Cisco Systems, Inc.
+   170 W. Tasman Drive
+   San Jose, CA  95134 USA
+
+   Phone: +1 408 525 8836
+   EMail: allyn@cisco.com
+
+
+   Tom Talpey
+   Network Appliance
+   1601 Trapelo Road
+   Waltham, MA  02451 USA
+
+   Phone: +1 781 768 5329
+   EMail: thomas.talpey@netapp.com
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Romanow, et al.              Informational                     [Page 19]
+
+RFC 4297             RDMA over IP Problem Statement        December 2005
+
+
+Full Copyright Statement
+
+   Copyright (C) The Internet Society (2005).
+
+   This document is subject to the rights, licenses and restrictions
+   contained in BCP 78, and except as set forth therein, the authors
+   retain all their rights.
+
+   This document and the information contained herein are provided on an
+   "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
+   OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET
+   ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED,
+   INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE
+   INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
+   WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
+
+Intellectual Property
+
+   The IETF takes no position regarding the validity or scope of any
+   Intellectual Property Rights or other rights that might be claimed to
+   pertain to the implementation or use of the technology described in
+   this document or the extent to which any license under such rights
+   might or might not be available; nor does it represent that it has
+   made any independent effort to identify any such rights.  Information
+   on the procedures with respect to rights in RFC documents can be
+   found in BCP 78 and BCP 79.
+
+   Copies of IPR disclosures made to the IETF Secretariat and any
+   assurances of licenses to be made available, or the result of an
+   attempt made to obtain a general license or permission for the use of
+   such proprietary rights by implementers or users of this
+   specification can be obtained from the IETF on-line IPR repository at
+   http://www.ietf.org/ipr.
+
+   The IETF invites any interested party to bring to its attention any
+   copyrights, patents or patent applications, or other proprietary
+   rights that may cover technology that may be required to implement
+   this standard.  Please address the information to the IETF at ietf-
+   ipr@ietf.org.
+
+Acknowledgement
+
+   Funding for the RFC Editor function is currently provided by the
+   Internet Society.
+
+
+
+
+
+
+
+Romanow, et al.              Informational                     [Page 20]
+