diff options
Diffstat (limited to 'doc/rfc/rfc4297.txt')
-rw-r--r-- | doc/rfc/rfc4297.txt | 1123 |
1 files changed, 1123 insertions, 0 deletions
diff --git a/doc/rfc/rfc4297.txt b/doc/rfc/rfc4297.txt new file mode 100644 index 0000000..3ba5312 --- /dev/null +++ b/doc/rfc/rfc4297.txt @@ -0,0 +1,1123 @@ + + + + + + +Network Working Group A. Romanow +Request for Comments: 4297 Cisco +Category: Informational J. Mogul + HP + T. Talpey + NetApp + S. Bailey + Sandburst + December 2005 + + + Remote Direct Memory Access (RDMA) over IP Problem Statement + +Status of This Memo + + This memo provides information for the Internet community. It does + not specify an Internet standard of any kind. Distribution of this + memo is unlimited. + +Copyright Notice + + Copyright (C) The Internet Society (2005). + +Abstract + + Overhead due to the movement of user data in the end-system network + I/O processing path at high speeds is significant, and has limited + the use of Internet protocols in interconnection networks, and the + Internet itself -- especially where high bandwidth, low latency, + and/or low overhead are required by the hosted application. + + This document examines this overhead, and addresses an architectural, + IP-based "copy avoidance" solution for its elimination, by enabling + Remote Direct Memory Access (RDMA). + + + + + + + + + + + + + + + + + +Romanow, et al. Informational [Page 1] + +RFC 4297 RDMA over IP Problem Statement December 2005 + + +Table of Contents + + 1. Introduction ....................................................2 + 2. The High Cost of Data Movement Operations in Network I/O ........4 + 2.1. Copy avoidance improves processing overhead. ...............5 + 3. Memory bandwidth is the root cause of the problem. ..............6 + 4. High copy overhead is problematic for many key Internet + applications. ...................................................8 + 5. Copy Avoidance Techniques ......................................10 + 5.1. A Conceptual Framework: DDP and RDMA ......................11 + 6. Conclusions ....................................................12 + 7. Security Considerations ........................................12 + 8. Terminology ....................................................14 + 9. Acknowledgements ...............................................14 + 10. Informative References ........................................15 + +1. Introduction + + This document considers the problem of high host processing overhead + associated with the movement of user data to and from the network + interface under high speed conditions. This problem is often + referred to as the "I/O bottleneck" [CT90]. More specifically, the + source of high overhead that is of interest here is data movement + operations, i.e., copying. The throughput of a system may therefore + be limited by the overhead of this copying. This issue is not to be + confused with TCP offload, which is not addressed here. High speed + refers to conditions where the network link speed is high, relative + to the bandwidths of the host CPU and memory. With today's computer + systems, one Gigabit per second (Gbits/s) and over is considered high + speed. + + High costs associated with copying are an issue primarily for large + scale systems. Although smaller systems such as rack-mounted PCs and + small workstations would benefit from a reduction in copying + overhead, the benefit to smaller machines will be primarily in the + next few years as they scale the amount of bandwidth they handle. + Today, it is large system machines with high bandwidth feeds, usually + multiprocessors and clusters, that are adversely affected by copying + overhead. Examples of such machines include all varieties of + servers: database servers, storage servers, application servers for + transaction processing, for e-commerce, and web serving, content + distribution, video distribution, backups, data mining and decision + support, and scientific computing. + + Note that such servers almost exclusively service many concurrent + sessions (transport connections), which, in aggregate, are + responsible for > 1 Gbits/s of communication. Nonetheless, the cost + + + + +Romanow, et al. Informational [Page 2] + +RFC 4297 RDMA over IP Problem Statement December 2005 + + + of copying overhead for a particular load is the same whether from + few or many sessions. + + The I/O bottleneck, and the role of data movement operations, have + been widely studied in research and industry over the last + approximately 14 years, and we draw freely on these results. + Historically, the I/O bottleneck has received attention whenever new + networking technology has substantially increased line rates: 100 + Megabit per second (Mbits/s) Fast Ethernet and Fibre Distributed Data + Interface [FDDI], 155 Mbits/s Asynchronous Transfer Mode [ATM], 1 + Gbits/s Ethernet. In earlier speed transitions, the availability of + memory bandwidth allowed the I/O bottleneck issue to be deferred. + Now however, this is no longer the case. While the I/O problem is + significant at 1 Gbits/s, it is the introduction of 10 Gbits/s + Ethernet which is motivating an upsurge of activity in industry and + research [IB, VI, CGY01, Ma02, MAF+02]. + + Because of high overhead of end-host processing in current + implementations, the TCP/IP protocol stack is not used for high speed + transfer. Instead, special purpose network fabrics, using a + technology generally known as Remote Direct Memory Access (RDMA), + have been developed and are widely used. RDMA is a set of mechanisms + that allow the network adapter, under control of the application, to + steer data directly into and out of application buffers. Examples of + such interconnection fabrics include Fibre Channel [FIBRE] for block + storage transfer, Virtual Interface Architecture [VI] for database + clusters, and Infiniband [IB], Compaq Servernet [SRVNET], and + Quadrics [QUAD] for System Area Networks. These link level + technologies limit application scaling in both distance and size, + meaning that the number of nodes cannot be arbitrarily large. + + This problem statement substantiates the claim that in network I/O + processing, high overhead results from data movement operations, + specifically copying; and that copy avoidance significantly decreases + this processing overhead. It describes when and why the high + processing overheads occur, explains why the overhead is problematic, + and points out which applications are most affected. + + The document goes on to discuss why the problem is relevant to the + Internet and to Internet-based applications. Applications that + store, manage, and distribute the information of the Internet are + well suited to applying the copy avoidance solution. They will + benefit by avoiding high processing overheads, which removes limits + to the available scaling of tiered end-systems. Copy avoidance also + eliminates latency for these systems, which can further benefit + effective distributed processing. + + + + + +Romanow, et al. Informational [Page 3] + +RFC 4297 RDMA over IP Problem Statement December 2005 + + + In addition, this document introduces an architectural approach to + solving the problem, which is developed in detail in [BT05]. It also + discusses how the proposed technology may introduce security concerns + and how they should be addressed. + + Finally, this document includes a Terminology section to aid as a + reference for several new terms introduced by RDMA. + +2. The High Cost of Data Movement Operations in Network I/O + + A wealth of data from research and industry shows that copying is + responsible for substantial amounts of processing overhead. It + further shows that even in carefully implemented systems, eliminating + copies significantly reduces the overhead, as referenced below. + + Clark et al. [CJRS89] in 1989 shows that TCP [Po81] overhead + processing is attributable to both operating system costs (such as + interrupts, context switches, process management, buffer management, + timer management) and the costs associated with processing individual + bytes (specifically, computing the checksum and moving data in + memory). They found that moving data in memory is the more important + of the costs, and their experiments show that memory bandwidth is the + greatest source of limitation. In the data presented [CJRS89], 64% + of the measured microsecond overhead was attributable to data + touching operations, and 48% was accounted for by copying. The + system measured Berkeley TCP on a Sun-3/60 using 1460 Byte Ethernet + packets. + + In a well-implemented system, copying can occur between the network + interface and the kernel, and between the kernel and application + buffers; there are two copies, each of which are two memory bus + crossings, for read and write. Although in certain circumstances it + is possible to do better, usually two copies are required on receive. + + Subsequent work has consistently shown the same phenomenon as the + earlier Clark study. A number of studies report results that data- + touching operations, checksumming and data movement, dominate the + processing costs for messages longer than 128 Bytes [BS96, CGY01, + Ch96, CJRS89, DAPP93, KP96]. For smaller sized messages, per-packet + overheads dominate [KP96, CGY01]. + + The percentage of overhead due to data-touching operations increases + with packet size, since time spent on per-byte operations scales + linearly with message size [KP96]. For example, Chu [Ch96] reported + substantial per-byte latency costs as a percentage of total + networking software costs for an MTU size packet on a SPARCstation/20 + + + + + +Romanow, et al. Informational [Page 4] + +RFC 4297 RDMA over IP Problem Statement December 2005 + + + running memory-to-memory TCP tests over networks with 3 different MTU + sizes. The percentage of total software costs attributable to + per-byte operations were: + + 1500 Byte Ethernet 18-25% + 4352 Byte FDDI 35-50% + 9180 Byte ATM 55-65% + + Although many studies report results for data-touching operations, + including checksumming and data movement together, much work has + focused just on copying [BS96, Br99, Ch96, TK95]. For example, + [KP96] reports results that separate processing times for checksum + from data movement operations. For the 1500 Byte Ethernet size, 20% + of total processing overhead time is attributable to copying. The + study used 2 DECstations 5000/200 connected by an FDDI network. (In + this study, checksum accounts for 30% of the processing time.) + +2.1. Copy avoidance improves processing overhead. + + A number of studies show that eliminating copies substantially + reduces overhead. For example, results from copy-avoidance in the + IO-Lite system [PDZ99], which aimed at improving web server + performance, show a throughput increase of 43% over an optimized web + server, and 137% improvement over an Apache server. The system was + implemented in a 4.4BSD-derived UNIX kernel, and the experiments used + a server system based on a 333MHz Pentium II PC connected to a + switched 100 Mbits/s Fast Ethernet. + + There are many other examples where elimination of copying using a + variety of different approaches showed significant improvement in + system performance [CFF+94, DP93, EBBV95, KSZ95, TK95, Wa97]. We + will discuss the results of one of these studies in detail in order + to clarify the significant degree of improvement produced by copy + avoidance [Ch02]. + + Recent work by Chase et al. [CGY01], measuring CPU utilization, shows + that avoiding copies reduces CPU time spent on data access from 24% + to 15% at 370 Mbits/s for a 32 KBytes MTU using an AlphaStation + XP1000 and a Myrinet adapter [BCF+95]. This is an absolute + improvement of 9% due to copy avoidance. + + The total CPU utilization was 35%, with data access accounting for + 24%. Thus, the relative importance of reducing copies is 26%. At + 370 Mbits/s, the system is not very heavily loaded. The relative + improvement in achievable bandwidth is 34%. This is the improvement + we would see if copy avoidance were added when the machine was + saturated by network I/O. + + + + +Romanow, et al. Informational [Page 5] + +RFC 4297 RDMA over IP Problem Statement December 2005 + + + Note that improvement from the optimization becomes more important if + the overhead it targets is a larger share of the total cost. This is + what happens if other sources of overhead, such as checksumming, are + eliminated. In [CGY01], after removing checksum overhead, copy + avoidance reduces CPU utilization from 26% to 10%. This is a 16% + absolute reduction, a 61% relative reduction, and a 160% relative + improvement in achievable bandwidth. + + In fact, today's network interface hardware commonly offloads the + checksum, which removes the other source of per-byte overhead. They + also coalesce interrupts to reduce per-packet costs. Thus, today + copying costs account for a relatively larger part of CPU utilization + than previously, and therefore relatively more benefit is to be + gained in reducing them. (Of course this argument would be specious + if the amount of overhead were insignificant, but it has been shown + to be substantial. [BS96, Br99, Ch96, KP96, TK95]) + +3. Memory bandwidth is the root cause of the problem. + + Data movement operations are expensive because memory bandwidth is + scarce relative to network bandwidth and CPU bandwidth [PAC+97]. + This trend existed in the past and is expected to continue into the + future [HP97, STREAM], especially in large multiprocessor systems. + + With copies crossing the bus twice per copy, network processing + overhead is high whenever network bandwidth is large in comparison to + CPU and memory bandwidths. Generally, with today's end-systems, the + effects are observable at network speeds over 1 Gbits/s. In fact, + with multiple bus crossings it is possible to see the bus bandwidth + being the limiting factor for throughput. This prevents such an + end-system from simultaneously achieving full network bandwidth and + full application performance. + + A common question is whether an increase in CPU processing power + alleviates the problem of high processing costs of network I/O. The + answer is no, it is the memory bandwidth that is the issue. Faster + CPUs do not help if the CPU spends most of its time waiting for + memory [CGY01]. + + The widening gap between microprocessor performance and memory + performance has long been a widely recognized and well-understood + problem [PAC+97]. Hennessy [HP97] shows microprocessor performance + grew from 1980-1998 at 60% per year, while the access time to DRAM + improved at 10% per year, giving rise to an increasing "processor- + memory performance gap". + + + + + + +Romanow, et al. Informational [Page 6] + +RFC 4297 RDMA over IP Problem Statement December 2005 + + + Another source of relevant data is the STREAM Benchmark Reference + Information website, which provides information on the STREAM + benchmark [STREAM]. The benchmark is a simple synthetic benchmark + program that measures sustainable memory bandwidth (in MBytes/s) and + the corresponding computation rate for simple vector kernels measured + in MFLOPS. The website tracks information on sustainable memory + bandwidth for hundreds of machines and all major vendors. + + Results show measured system performance statistics. Processing + performance from 1985-2001 increased at 50% per year on average, and + sustainable memory bandwidth from 1975 to 2001 increased at 35% per + year, on average, over all the systems measured. A similar 15% per + year lead of processing bandwidth over memory bandwidth shows up in + another statistic, machine balance [Mc95], a measure of the relative + rate of CPU to memory bandwidth (FLOPS/cycle) / (sustained memory + ops/cycle) [STREAM]. + + Network bandwidth has been increasing about 10-fold roughly every 8 + years, which is a 40% per year growth rate. + + A typical example illustrates that the memory bandwidth compares + unfavorably with link speed. The STREAM benchmark shows that a + modern uniprocessor PC, for example the 1.2 GHz Athlon in 2001, will + move the data 3 times in doing a receive operation: once for the + network interface to deposit the data in memory, and twice for the + CPU to copy the data. With 1 GBytes/s of memory bandwidth, meaning + one read or one write, the machine could handle approximately 2.67 + Gbits/s of network bandwidth, one third the copy bandwidth. But this + assumes 100% utilization, which is not possible, and more importantly + the machine would be totally consumed! (A rule of thumb for + databases is that 20% of the machine should be required to service + I/O, leaving 80% for the database application. And, the less, the + better.) + + In 2001, 1 Gbits/s links were common. An application server may + typically have two 1 Gbits/s connections: one connection backend to a + storage server and one front-end, say for serving HTTP [FGM+99]. + Thus, the communications could use 2 Gbits/s. In our typical + example, the machine could handle 2.7 Gbits/s at its theoretical + maximum while doing nothing else. This means that the machine + basically could not keep up with the communication demands in 2001; + with the relative growth trends, the situation only gets worse. + + + + + + + + + +Romanow, et al. Informational [Page 7] + +RFC 4297 RDMA over IP Problem Statement December 2005 + + +4. High copy overhead is problematic for many key Internet + applications. + + If a significant portion of resources on an application machine is + consumed in network I/O rather than in application processing, it + makes it difficult for the application to scale, i.e., to handle more + clients, to offer more services. + + Several years ago the most affected applications were streaming + multimedia, parallel file systems, and supercomputing on clusters + [BS96]. In addition, today the applications that suffer from copying + overhead are more central in Internet computing -- they store, + manage, and distribute the information of the Internet and the + enterprise. They include database applications doing transaction + processing, e-commerce, web serving, decision support, content + distribution, video distribution, and backups. Clusters are + typically used for this category of application, since they have + advantages of availability and scalability. + + Today these applications, which provide and manage Internet and + corporate information, are typically run in data centers that are + organized into three logical tiers. One tier is typically a set of + web servers connecting to the WAN. The second tier is a set of + application servers that run the specific applications usually on + more powerful machines, and the third tier is backend databases. + Physically, the first two tiers -- web server and application server + -- are usually combined [Pi01]. For example, an e-commerce server + communicates with a database server and with a customer site, or a + content distribution server connects to a server farm, or an OLTP + server connects to a database and a customer site. + + When network I/O uses too much memory bandwidth, performance on + network paths between tiers can suffer. (There might also be + performance issues on Storage Area Network paths used either by the + database tier or the application tier.) The high overhead from + network-related memory copies diverts system resources from other + application processing. It also can create bottlenecks that limit + total system performance. + + There is high motivation to maximize the processing capacity of each + CPU because scaling by adding CPUs, one way or another, has + drawbacks. For example, adding CPUs to a multiprocessor will not + necessarily help because a multiprocessor improves performance only + when the memory bus has additional bandwidth to spare. Clustering + can add additional complexity to handling the applications. + + In order to scale a cluster or multiprocessor system, one must + proportionately scale the interconnect bandwidth. Interconnect + + + +Romanow, et al. Informational [Page 8] + +RFC 4297 RDMA over IP Problem Statement December 2005 + + + bandwidth governs the performance of communication-intensive parallel + applications; if this (often expressed in terms of "bisection + bandwidth") is too low, adding additional processors cannot improve + system throughput. Interconnect latency can also limit the + performance of applications that frequently share data between + processors. + + So, excessive overheads on network paths in a "scalable" system both + can require the use of more processors than optimal, and can reduce + the marginal utility of those additional processors. + + Copy avoidance scales a machine upwards by removing at least two- + thirds of the bus bandwidth load from the "very best" 1-copy (on + receive) implementations, and removes at least 80% of the bandwidth + overhead from the 2-copy implementations. + + The removal of bus bandwidth requirements, in turn, removes + bottlenecks from the network processing path and increases the + throughput of the machine. On a machine with limited bus bandwidth, + the advantages of removing this load is immediately evident, as the + host can attain full network bandwidth. Even on a machine with bus + bandwidth adequate to sustain full network bandwidth, removal of bus + bandwidth load serves to increase the availability of the machine for + the processing of user applications, in some cases dramatically. + + An example showing poor performance with copies and improved scaling + with copy avoidance is illustrative. The IO-Lite work [PDZ99] shows + higher server throughput servicing more clients using a zero-copy + system. In an experiment designed to mimic real world web conditions + by simulating the effect of TCP WAN connections on the server, the + performance of 3 servers was compared. One server was Apache, + another was an optimized server called Flash, and the third was the + Flash server running IO-Lite, called Flash-Lite with zero copy. The + measurement was of throughput in requests/second as a function of the + number of slow background clients that could be served. As the table + shows, Flash-Lite has better throughput, especially as the number of + clients increases. + + Apache Flash Flash-Lite + ------ ----- ---------- + #Clients Throughput reqs/s Throughput Throughput + + 0 520 610 890 + 16 390 490 890 + 32 360 490 850 + 64 360 490 890 + 128 310 450 880 + 256 310 440 820 + + + +Romanow, et al. Informational [Page 9] + +RFC 4297 RDMA over IP Problem Statement December 2005 + + + Traditional Web servers (which mostly send data and can keep most of + their content in the file cache) are not the worst case for copy + overhead. Web proxies (which often receive as much data as they + send) and complex Web servers based on System Area Networks or + multi-tier systems will suffer more from copy overheads than in the + example above. + +5. Copy Avoidance Techniques + + There have been extensive research investigation and industry + experience with two main alternative approaches to eliminating data + movement overhead, often along with improving other Operating System + processing costs. In one approach, hardware and/or software changes + within a single host reduce processing costs. In another approach, + memory-to-memory networking [MAF+02], the exchange of explicit data + placement information between hosts allows them to reduce processing + costs. + + The single host approaches range from new hardware and software + architectures [KSZ95, Wa97, DWB+93] to new or modified software + systems [BS96, Ch96, TK95, DP93, PDZ99]. In the approach based on + using a networking protocol to exchange information, the network + adapter, under control of the application, places data directly into + and out of application buffers, reducing the need for data movement. + Commonly this approach is called RDMA, Remote Direct Memory Access. + + As discussed below, research and industry experience has shown that + copy avoidance techniques within the receiver processing path alone + have proven to be problematic. The research special purpose host + adapter systems had good performance and can be seen as precursors + for the commercial RDMA-based adapters [KSZ95, DWB+93]. In software, + many implementations have successfully achieved zero-copy transmit, + but few have accomplished zero-copy receive. And those that have + done so make strict alignment and no-touch requirements on the + application, greatly reducing the portability and usefulness of the + implementation. + + In contrast, experience has proven satisfactory with memory-to-memory + systems that permit RDMA; performance has been good and there have + not been system or networking difficulties. RDMA is a single + solution. Once implemented, it can be used with any OS and machine + architecture, and it does not need to be revised when either of these + are changed. + + In early work, one goal of the software approaches was to show that + TCP could go faster with appropriate OS support [CJRS89, CFF+94]. + While this goal was achieved, further investigation and experience + showed that, though possible to craft software solutions, specific + + + +Romanow, et al. Informational [Page 10] + +RFC 4297 RDMA over IP Problem Statement December 2005 + + + system optimizations have been complex, fragile, extremely + interdependent with other system parameters in complex ways, and + often of only marginal improvement [CFF+94, CGY01, Ch96, DAPP93, + KSZ95, PDZ99]. The network I/O system interacts with other aspects + of the Operating System such as machine architecture and file I/O, + and disk I/O [Br99, Ch96, DP93]. + + For example, the Solaris Zero-Copy TCP work [Ch96], which relies on + page remapping, shows that the results are highly interdependent with + other systems, such as the file system, and that the particular + optimizations are specific for particular architectures, meaning that + for each variation in architecture, optimizations must be re-crafted + [Ch96]. + + With RDMA, application I/O buffers are mapped directly, and the + authorized peer may access it without incurring additional processing + overhead. When RDMA is implemented in hardware, arbitrary data + movement can be performed without involving the host CPU at all. + + A number of research projects and industry products have been based + on the memory-to-memory approach to copy avoidance. These include + U-Net [EBBV95], SHRIMP [BLA+94], Hamlyn [BJM+96], Infiniband [IB], + Winsock Direct [Pi01]. Several memory-to-memory systems have been + widely used and have generally been found to be robust, to have good + performance, and to be relatively simple to implement. These include + VI [VI], Myrinet [BCF+95], Quadrics [QUAD], Compaq/Tandem Servernet + [SRVNET]. Networks based on these memory-to-memory architectures + have been used widely in scientific applications and in data centers + for block storage, file system access, and transaction processing. + + By exporting direct memory access "across the wire", applications may + direct the network stack to manage all data directly from application + buffers. A large and growing class that takes advantage of such + capabilities of applications has already emerged. It includes all + the major databases, as well as network protocols such as Sockets + Direct [SDP]. + +5.1. A Conceptual Framework: DDP and RDMA + + An RDMA solution can be usefully viewed as being comprised of two + distinct components: "direct data placement (DDP)" and "remote direct + memory access (RDMA) semantics". They are distinct in purpose and + also in practice -- they may be implemented as separate protocols. + + The more fundamental of the two is the direct data placement + facility. This is the means by which memory is exposed to the remote + peer in an appropriate fashion, and the means by which the peer may + access it, for instance, reading and writing. + + + +Romanow, et al. Informational [Page 11] + +RFC 4297 RDMA over IP Problem Statement December 2005 + + + The RDMA control functions are semantically layered atop direct data + placement. Included are operations that provide "control" features, + such as connection and termination, and the ordering of operations + and signaling their completions. A "send" facility is provided. + + While the functions (and potentially protocols) are distinct, + historically both aspects taken together have been referred to as + "RDMA". The facilities of direct data placement are useful in and of + themselves, and may be employed by other upper layer protocols to + facilitate data transfer. Therefore, it is often useful to refer to + DDP as the data placement functionality and RDMA as the control + aspect. + + [BT05] develops an architecture for DDP and RDMA atop the Internet + Protocol Suite, and is a companion document to this problem + statement. + +6. Conclusions + + This Problem Statement concludes that an IP-based, general solution + for reducing processing overhead in end-hosts is desirable. + + It has shown that high overhead of the processing of network data + leads to end-host bottlenecks. These bottlenecks are in large part + attributable to the copying of data. The bus bandwidth of machines + has historically been limited, and the bandwidth of high-speed + interconnects taxes it heavily. + + An architectural solution to alleviate these bottlenecks best + satisfies the issue. Further, the high speed of today's + interconnects and the deployment of these hosts on Internet + Protocol-based networks leads to the desirability of layering such a + solution on the Internet Protocol Suite. The architecture described + in [BT05] is such a proposal. + +7. Security Considerations + + Solutions to the problem of reducing copying overhead in high + bandwidth transfers may introduce new security concerns. Any + proposed solution must be analyzed for security vulnerabilities and + any such vulnerabilities addressed. Potential security weaknesses -- + due to resource issues that might lead to denial-of-service attacks, + overwrites and other concurrent operations, the ordering of + completions as required by the RDMA protocol, the granularity of + transfer, and any other identified vulnerabilities -- need to be + examined, described, and an adequate resolution to them found. + + + + + +Romanow, et al. Informational [Page 12] + +RFC 4297 RDMA over IP Problem Statement December 2005 + + + Layered atop Internet transport protocols, the RDMA protocols will + gain leverage from and must permit integration with Internet security + standards, such as IPsec and TLS [IPSEC, TLS]. However, there may be + implementation ramifications for certain security approaches with + respect to RDMA, due to its copy avoidance. + + IPsec, operating to secure the connection on a packet-by-packet + basis, seems to be a natural fit to securing RDMA placement, which + operates in conjunction with transport. Because RDMA enables an + implementation to avoid buffering, it is preferable to perform all + applicable security protection prior to processing of each segment by + the transport and RDMA layers. Such a layering enables the most + efficient secure RDMA implementation. + + The TLS record protocol, on the other hand, is layered on top of + reliable transports and cannot provide such security assurance until + an entire record is available, which may require the buffering and/or + assembly of several distinct messages prior to TLS processing. This + defers RDMA processing and introduces overheads that RDMA is designed + to avoid. Therefore, TLS is viewed as potentially a less natural fit + for protecting the RDMA protocols. + + It is necessary to guarantee properties such as confidentiality, + integrity, and authentication on an RDMA communications channel. + However, these properties cannot defend against all attacks from + properly authenticated peers, which might be malicious, compromised, + or buggy. Therefore, the RDMA design must address protection against + such attacks. For example, an RDMA peer should not be able to read + or write memory regions without prior consent. + + Further, it must not be possible to evade memory consistency checks + at the recipient. The RDMA design must allow the recipient to rely + on its consistent memory contents by explicitly controlling peer + access to memory regions at appropriate times. + + Peer connections that do not pass authentication and authorization + checks by upper layers must not be permitted to begin processing in + RDMA mode with an inappropriate endpoint. Once associated, peer + accesses to memory regions must be authenticated and made subject to + authorization checks in the context of the association and connection + on which they are to be performed, prior to any transfer operation or + data being accessed. + + The RDMA protocols must ensure that these region protections be under + strict application control. Remote access to local memory by a + network peer is particularly important in the Internet context, where + such access can be exported globally. + + + + +Romanow, et al. Informational [Page 13] + +RFC 4297 RDMA over IP Problem Statement December 2005 + + +8. Terminology + + This section contains general terminology definitions for this + document and for Remote Direct Memory Access in general. + + Remote Direct Memory Access (RDMA) + A method of accessing memory on a remote system in which the + local system specifies the location of the data to be + transferred. + + RDMA Protocol + A protocol that supports RDMA Operations to transfer data + between systems. + + Fabric + The collection of links, switches, and routers that connect a + set of systems. + + Storage Area Network (SAN) + A network where disks, tapes, and other storage devices are made + available to one or more end-systems via a fabric. + + System Area Network + A network where clustered systems share services, such as + storage and interprocess communication, via a fabric. + + Fibre Channel (FC) + An ANSI standard link layer with associated protocols, typically + used to implement Storage Area Networks. [FIBRE] + + Virtual Interface Architecture (VI, VIA) + An RDMA interface definition developed by an industry group and + implemented with a variety of differing wire protocols. [VI] + + Infiniband (IB) + An RDMA interface, protocol suite and link layer specification + defined by an industry trade association. [IB] + +9. Acknowledgements + + Jeff Chase generously provided many useful insights and information. + Thanks to Jim Pinkerton for many helpful discussions. + + + + + + + + + +Romanow, et al. Informational [Page 14] + +RFC 4297 RDMA over IP Problem Statement December 2005 + + +10. Informative References + + [ATM] The ATM Forum, "Asynchronous Transfer Mode Physical Layer + Specification" af-phy-0015.000, etc. available from + http://www.atmforum.com/standards/approved.html. + + [BCF+95] N. J. Boden, D. Cohen, R. E. Felderman, A. E. Kulawik, C. + L. Seitz, J. N. Seizovic, and W. Su. "Myrinet - A + gigabit-per-second local-area network", IEEE Micro, + February 1995. + + [BJM+96] G. Buzzard, D. Jacobson, M. Mackey, S. Marovich, J. + Wilkes, "An implementation of the Hamlyn send-managed + interface architecture", in Proceedings of the Second + Symposium on Operating Systems Design and Implementation, + USENIX Assoc., October 1996. + + [BLA+94] M. A. Blumrich, K. Li, R. Alpert, C. Dubnicki, E. W. + Felten, "A virtual memory mapped network interface for the + SHRIMP multicomputer", in Proceedings of the 21st Annual + Symposium on Computer Architecture, April 1994, pp. 142- + 153. + + [Br99] J. C. Brustoloni, "Interoperation of copy avoidance in + network and file I/O", Proceedings of IEEE Infocom, 1999, + pp. 534-542. + + [BS96] J. C. Brustoloni, P. Steenkiste, "Effects of buffering + semantics on I/O performance", Proceedings OSDI'96, + USENIX, Seattle, WA October 1996, pp. 277-291. + + [BT05] Bailey, S. and T. Talpey, "The Architecture of Direct Data + Placement (DDP) And Remote Direct Memory Access (RDMA) On + Internet Protocols", RFC 4296, December 2005. + + [CFF+94] C-H Chang, D. Flower, J. Forecast, H. Gray, B. Hawe, A. + Nadkarni, K. K. Ramakrishnan, U. Shikarpur, K. Wilde, + "High-performance TCP/IP and UDP/IP networking in DEC + OSF/1 for Alpha AXP", Proceedings of the 3rd IEEE + Symposium on High Performance Distributed Computing, + August 1994, pp. 36-42. + + [CGY01] J. S. Chase, A. J. Gallatin, and K. G. Yocum, "End system + optimizations for high-speed TCP", IEEE Communications + Magazine, Volume: 39, Issue: 4 , April 2001, pp 68-74. + http://www.cs.duke.edu/ari/publications/end- + system.{ps,pdf}. + + + + +Romanow, et al. Informational [Page 15] + +RFC 4297 RDMA over IP Problem Statement December 2005 + + + [Ch96] H.K. Chu, "Zero-copy TCP in Solaris", Proc. of the USENIX + 1996 Annual Technical Conference, San Diego, CA, January + 1996. + + [Ch02] Jeffrey Chase, Personal communication. + + [CJRS89] D. D. Clark, V. Jacobson, J. Romkey, H. Salwen, "An + analysis of TCP processing overhead", IEEE Communications + Magazine, volume: 27, Issue: 6, June 1989, pp 23-29. + + [CT90] D. D. Clark, D. Tennenhouse, "Architectural considerations + for a new generation of protocols", Proceedings of the ACM + SIGCOMM Conference, 1990. + + [DAPP93] P. Druschel, M. B. Abbott, M. A. Pagels, L. L. Peterson, + "Network subsystem design", IEEE Network, July 1993, pp. + 8-17. + + [DP93] P. Druschel, L. L. Peterson, "Fbufs: a high-bandwidth + cross-domain transfer facility", Proceedings of the 14th + ACM Symposium of Operating Systems Principles, December + 1993. + + [DWB+93] C. Dalton, G. Watson, D. Banks, C. Calamvokis, A. Edwards, + J. Lumley, "Afterburner: architectural support for high- + performance protocols", Technical Report, HP Laboratories + Bristol, HPL-93-46, July 1993. + + [EBBV95] T. von Eicken, A. Basu, V. Buch, and W. Vogels, "U-Net: A + user-level network interface for parallel and distributed + computing", Proc. of the 15th ACM Symposium on Operating + Systems Principles, Copper Mountain, Colorado, December + 3-6, 1995. + + [FDDI] International Standards Organization, "Fibre Distributed + Data Interface", ISO/IEC 9314, committee drafts available + from http://www.iso.org. + + [FGM+99] Fielding, R., Gettys, J., Mogul, J., Frystyk, H., + Masinter, L., Leach, P., and T. Berners-Lee, "Hypertext + Transfer Protocol -- HTTP/1.1", RFC 2616, June 1999. + + [FIBRE] ANSI Technical Committee T10, "Fibre Channel Protocol + (FCP)" (and as revised and updated), ANSI X3.269:1996 + [R2001], committee draft available from + http://www.t10.org/drafts.htm#FibreChannel + + + + + +Romanow, et al. Informational [Page 16] + +RFC 4297 RDMA over IP Problem Statement December 2005 + + + [HP97] J. L. Hennessy, D. A. Patterson, Computer Organization and + Design, 2nd Edition, San Francisco: Morgan Kaufmann + Publishers, 1997. + + [IB] InfiniBand Trade Association, "InfiniBand Architecture + Specification, Volumes 1 and 2", Release 1.1, November + 2002, available from http://www.infinibandta.org/specs. + + [IPSEC] Kent, S. and R. Atkinson, "Security Architecture for the + Internet Protocol", RFC 2401, November 1998. + + [KP96] J. Kay, J. Pasquale, "Profiling and reducing processing + overheads in TCP/IP", IEEE/ACM Transactions on Networking, + Vol 4, No. 6, pp.817-828, December 1996. + + [KSZ95] K. Kleinpaste, P. Steenkiste, B. Zill, "Software support + for outboard buffering and checksumming", SIGCOMM'95. + + [Ma02] K. Magoutis, "Design and Implementation of a Direct Access + File System (DAFS) Kernel Server for FreeBSD", in + Proceedings of USENIX BSDCon 2002 Conference, San + Francisco, CA, February 11-14, 2002. + + [MAF+02] K. Magoutis, S. Addetia, A. Fedorova, M. I. Seltzer, J. + S. Chase, D. Gallatin, R. Kisley, R. Wickremesinghe, E. + Gabber, "Structure and Performance of the Direct Access + File System (DAFS)", in Proceedings of the 2002 USENIX + Annual Technical Conference, Monterey, CA, June 9-14, + 2002. + + [Mc95] J. D. McCalpin, "A Survey of memory bandwidth and machine + balance in current high performance computers", IEEE TCCA + Newsletter, December 1995. + + [PAC+97] D. Patterson, T. Anderson, N. Cardwell, R. Fromm, K. + Keeton, C. Kozyrakis, R. Thomas, K. Yelick , "A case for + intelligient RAM: IRAM", IEEE Micro, April 1997. + + [PDZ99] V. S. Pai, P. Druschel, W. Zwaenepoel, "IO-Lite: a unified + I/O buffering and caching system", Proc. of the 3rd + Symposium on Operating Systems Design and Implementation, + New Orleans, LA, February 1999. + + [Pi01] J. Pinkerton, "Winsock Direct: The Value of System Area + Networks", May 2001, available from + http://www.microsoft.com/windows2000/techinfo/ + howitworks/communications/winsock.asp. + + + + +Romanow, et al. Informational [Page 17] + +RFC 4297 RDMA over IP Problem Statement December 2005 + + + [Po81] Postel, J., "Transmission Control Protocol", STD 7, RFC + 793, September 1981. + + [QUAD] Quadrics Ltd., Quadrics QSNet product information, + available from + http://www.quadrics.com/website/pages/02qsn.html. + + [SDP] InfiniBand Trade Association, "Sockets Direct Protocol + v1.0", Annex A of InfiniBand Architecture Specification + Volume 1, Release 1.1, November 2002, available from + http://www.infinibandta.org/specs. + + [SRVNET] R. Horst, "TNet: A reliable system area network", IEEE + Micro, pp. 37-45, February 1995. + + [STREAM] J. D. McAlpin, The STREAM Benchmark Reference Information, + http://www.cs.virginia.edu/stream/. + + [TK95] M. N. Thadani, Y. A. Khalidi, "An efficient zero-copy I/O + framework for UNIX", Technical Report, SMLI TR-95-39, May + 1995. + + [TLS] Dierks, T. and C. Allen, "The TLS Protocol Version 1.0", + RFC 2246, January 1999. + + [VI] D. Cameron and G. Regnier, "The Virtual Interface + Architecture", ISBN 0971288704, Intel Press, April 2002, + more info at http://www.intel.com/intelpress/via/. + + [Wa97] J. R. Walsh, "DART: Fast application-level networking via + data-copy avoidance", IEEE Network, July/August 1997, pp. + 28-38. + + + + + + + + + + + + + + + + + + + +Romanow, et al. Informational [Page 18] + +RFC 4297 RDMA over IP Problem Statement December 2005 + + +Authors' Addresses + + Stephen Bailey + Sandburst Corporation + 600 Federal Street + Andover, MA 01810 USA + + Phone: +1 978 689 1614 + EMail: steph@sandburst.com + + + Jeffrey C. Mogul + HP Labs + Hewlett-Packard Company + 1501 Page Mill Road, MS 1117 + Palo Alto, CA 94304 USA + + Phone: +1 650 857 2206 (EMail preferred) + EMail: JeffMogul@acm.org + + + Allyn Romanow + Cisco Systems, Inc. + 170 W. Tasman Drive + San Jose, CA 95134 USA + + Phone: +1 408 525 8836 + EMail: allyn@cisco.com + + + Tom Talpey + Network Appliance + 1601 Trapelo Road + Waltham, MA 02451 USA + + Phone: +1 781 768 5329 + EMail: thomas.talpey@netapp.com + + + + + + + + + + + + + + +Romanow, et al. Informational [Page 19] + +RFC 4297 RDMA over IP Problem Statement December 2005 + + +Full Copyright Statement + + Copyright (C) The Internet Society (2005). + + This document is subject to the rights, licenses and restrictions + contained in BCP 78, and except as set forth therein, the authors + retain all their rights. + + This document and the information contained herein are provided on an + "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS + OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET + ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, + INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE + INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED + WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. + +Intellectual Property + + The IETF takes no position regarding the validity or scope of any + Intellectual Property Rights or other rights that might be claimed to + pertain to the implementation or use of the technology described in + this document or the extent to which any license under such rights + might or might not be available; nor does it represent that it has + made any independent effort to identify any such rights. Information + on the procedures with respect to rights in RFC documents can be + found in BCP 78 and BCP 79. + + Copies of IPR disclosures made to the IETF Secretariat and any + assurances of licenses to be made available, or the result of an + attempt made to obtain a general license or permission for the use of + such proprietary rights by implementers or users of this + specification can be obtained from the IETF on-line IPR repository at + http://www.ietf.org/ipr. + + The IETF invites any interested party to bring to its attention any + copyrights, patents or patent applications, or other proprietary + rights that may cover technology that may be required to implement + this standard. Please address the information to the IETF at ietf- + ipr@ietf.org. + +Acknowledgement + + Funding for the RFC Editor function is currently provided by the + Internet Society. + + + + + + + +Romanow, et al. Informational [Page 20] + |