summaryrefslogtreecommitdiff
path: root/doc/rfc/rfc5667.txt
diff options
context:
space:
mode:
Diffstat (limited to 'doc/rfc/rfc5667.txt')
-rw-r--r--doc/rfc/rfc5667.txt563
1 files changed, 563 insertions, 0 deletions
diff --git a/doc/rfc/rfc5667.txt b/doc/rfc/rfc5667.txt
new file mode 100644
index 0000000..56ca060
--- /dev/null
+++ b/doc/rfc/rfc5667.txt
@@ -0,0 +1,563 @@
+
+
+
+
+
+
+Internet Engineering Task Force (IETF) T. Talpey
+Request for Comments: 5667 Unaffiliated
+Category: Standards Track B. Callaghan
+ISSN: 2070-1721 Apple
+ January 2010
+
+
+ Network File System (NFS) Direct Data Placement
+
+Abstract
+
+ This document defines the bindings of the various Network File System
+ (NFS) versions to the Remote Direct Memory Access (RDMA) operations
+ supported by the RPC/RDMA transport protocol. It describes the use
+ of direct data placement by means of server-initiated RDMA operations
+ into client-supplied buffers for implementations of NFS versions 2,
+ 3, 4, and 4.1 over such an RDMA transport.
+
+Status of This Memo
+
+ This is an Internet Standards Track document.
+
+ This document is a product of the Internet Engineering Task Force
+ (IETF). It represents the consensus of the IETF community. It has
+ received public review and has been approved for publication by the
+ Internet Engineering Steering Group (IESG). Further information on
+ Internet Standards is available in Section 2 of RFC 5741.
+
+ Information about the current status of this document, any errata,
+ and how to provide feedback on it may be obtained at
+ http://www.rfc-editor.org/info/rfc5667.
+
+Copyright Notice
+
+ Copyright (c) 2010 IETF Trust and the persons identified as the
+ document authors. All rights reserved.
+
+ This document is subject to BCP 78 and the IETF Trust's Legal
+ Provisions Relating to IETF Documents
+ (http://trustee.ietf.org/license-info) in effect on the date of
+ publication of this document. Please review these documents
+ carefully, as they describe your rights and restrictions with respect
+ to this document. Code Components extracted from this document must
+ include Simplified BSD License text as described in Section 4.e of
+ the Trust Legal Provisions and are provided without warranty as
+ described in the Simplified BSD License.
+
+
+
+
+
+Talpey & Callaghan Standards Track [Page 1]
+
+RFC 5667 NFS Direct Data Placement January 2010
+
+
+ This document may contain material from IETF Documents or IETF
+ Contributions published or made publicly available before November
+ 10, 2008. The person(s) controlling the copyright in some of this
+ material may not have granted the IETF Trust the right to allow
+ modifications of such material outside the IETF Standards Process.
+ Without obtaining an adequate license from the person(s) controlling
+ the copyright in such materials, this document may not be modified
+ outside the IETF Standards Process, and derivative works of it may
+ not be created outside the IETF Standards Process, except to format
+ it for publication as an RFC or to translate it into languages other
+ than English.
+
+Table of Contents
+
+ 1. Introduction ....................................................2
+ 1.1. Requirements Language ......................................2
+ 2. Transfers from NFS Client to NFS Server .........................3
+ 3. Transfers from NFS Server to NFS Client .........................3
+ 4. NFS Versions 2 and 3 Mapping ....................................4
+ 5. NFS Version 4 Mapping ...........................................6
+ 5.1. NFS Version 4 Callbacks ....................................7
+ 6. Port Usage Considerations .......................................8
+ 7. Security Considerations .........................................9
+ 8. Acknowledgments .................................................9
+ 9. References ......................................................9
+ 9.1. Normative References .......................................9
+ 9.2. Informative References ....................................10
+
+1. Introduction
+
+ The Remote Direct Memory Access (RDMA) Transport for Remote Procedure
+ Call (RPC) [RFC5666] allows an RPC client application to post buffers
+ in a Chunk list for specific arguments and results from an RPC call.
+ The RDMA transport header conveys this list of client buffer
+ addresses to the server where the application can associate them with
+ client data and use RDMA operations to transfer the results directly
+ to and from the posted buffers on the client. The client and server
+ must agree on a consistent mapping of posted buffers to RPC. This
+ document details the mapping for each version of the NFS protocol
+ [RFC1094] [RFC1813] [RFC3530] [RFC5661].
+
+1.1. Requirements Language
+
+ The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
+ "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
+ document are to be interpreted as described in [RFC2119].
+
+
+
+
+
+Talpey & Callaghan Standards Track [Page 2]
+
+RFC 5667 NFS Direct Data Placement January 2010
+
+
+2. Transfers from NFS Client to NFS Server
+
+ The RDMA Read list, in the RDMA transport header, allows an RPC
+ client to marshal RPC call data selectively. Large chunks of data,
+ such as the file data of an NFS WRITE request, MAY be referenced by
+ an RDMA Read list and be moved efficiently and directly placed by an
+ RDMA Read operation initiated by the server.
+
+ The process of identifying these chunks for the RDMA Read list can be
+ implemented entirely within the RPC layer. It is transparent to the
+ upper-level protocol, such as NFS. For instance, the file data
+ portion of an NFS WRITE request can be selected as an RDMA "chunk"
+ within the eXternal Data Representation (XDR) marshaling code of RPC
+ based on a size criterion, independently of the NFS protocol layer.
+ The XDR unmarshaling on the receiving system can identify the
+ correspondence between Read chunks and protocol elements via the XDR
+ position value encoded in the Read chunk entry.
+
+ RPC RDMA Read chunks are employed by this NFS mapping to convey
+ specific NFS data to the server in a manner that may be directly
+ placed. The following sections describe this mapping for versions of
+ the NFS protocol.
+
+3. Transfers from NFS Server to NFS Client
+
+ The RDMA Write list, in the RDMA transport header, allows the client
+ to post one or more buffers into which the server will RDMA Write
+ designated result chunks directly. If the client sends a null Write
+ list, then results from the RPC call will be returned either as an
+ inline reply, as chunks in an RDMA Read list of server-posted
+ buffers, or in a client-posted reply buffer.
+
+ Each posted buffer in a Write list is represented as an array of
+ memory segments. This allows the client some flexibility in
+ submitting discontiguous memory segments into which the server will
+ scatter the result. Each segment is described by a triplet
+ consisting of the segment handle or steering tag (STag), segment
+ length, and memory address or offset.
+
+ struct xdr_rdma_segment {
+ uint32 handle; /* Registered memory handle */
+ uint32 length; /* Length of the chunk in bytes */
+ uint64 offset; /* Chunk virtual address or offset */
+ };
+
+ struct xdr_write_chunk {
+ struct xdr_rdma_segment target<>;
+ };
+
+
+
+Talpey & Callaghan Standards Track [Page 3]
+
+RFC 5667 NFS Direct Data Placement January 2010
+
+
+ struct xdr_write_list {
+ struct xdr_write_chunk entry;
+ struct xdr_write_list *next;
+ };
+
+ The sum of the segment lengths yields the total size of the buffer,
+ which MUST be large enough to accept the result. If the buffer is
+ too small, the server MUST return an XDR encode error. The server
+ MUST return the result data for a posted buffer by progressively
+ filling its segments, perhaps leaving some trailing segments unfilled
+ or partially full if the size of the result is less than the total
+ size of the buffer segments.
+
+ The server returns the RDMA Write list to the client with the segment
+ length fields overwritten to indicate the amount of data RDMA written
+ to each segment. Results returned by direct placement MUST NOT be
+ returned by other methods, e.g., by Read chunk list or inline. If no
+ result data at all is returned for the element, the server places no
+ data in the buffer(s), but does return zeros in the segment length
+ fields corresponding to the result.
+
+ The RDMA Write list allows the client to provide multiple result
+ buffers -- each buffer maps to a specific result in the reply. The
+ NFS client and server implementations agree by specifying the mapping
+ of results to buffers for each RPC procedure. The following sections
+ describe this mapping for versions of the NFS protocol.
+
+ Through the use of RDMA Write lists in NFS requests, it is not
+ necessary to employ the RDMA Read lists in the NFS replies, as
+ described in the RPC/RDMA protocol. This enables more efficient
+ operation, by avoiding the need for the server to expose buffers for
+ RDMA, and also avoiding "RDMA_DONE" exchanges. Clients MAY
+ additionally employ RDMA Reply chunks to receive entire messages, as
+ described in [RFC5666].
+
+4. NFS Versions 2 and 3 Mapping
+
+ A single RDMA Write list entry MAY be posted by the client to receive
+ either the opaque file data from a READ request or the pathname from
+ a READLINK request. The server MUST ignore a Write list for any
+ other NFS procedure, as well as any Write list entries beyond the
+ first in the list.
+
+ Similarly, a single RDMA Read list entry MAY be posted by the client
+ to supply the opaque file data for a WRITE request or the pathname
+ for a SYMLINK request. The server MUST ignore any Read list for
+ other NFS procedures, as well as additional Read list entries beyond
+ the first in the list.
+
+
+
+Talpey & Callaghan Standards Track [Page 4]
+
+RFC 5667 NFS Direct Data Placement January 2010
+
+
+ Because there are no NFS version 2 or 3 requests that transfer bulk
+ data in both directions, it is not necessary to post requests
+ containing both Write and Read lists. Any unneeded Read or Write
+ lists are ignored by the server.
+
+ In the case where the outgoing request or expected incoming reply is
+ larger than the maximum size supported on the connection, it is
+ possible for the RPC layer to post the entire message or result in a
+ special "RDMA_NOMSG" message type that is transferred entirely by
+ RDMA. This is implemented in RPC, below NFS, and therefore has no
+ effect on the message contents.
+
+ Non-RDMA (inline) WRITE transfers MAY OPTIONALLY employ the
+ "RDMA_MSGP" padding method described in the RPC/RDMA protocol, if the
+ appropriate value for the server is known to the client. Padding
+ allows the opaque file data to arrive at the server in an aligned
+ fashion, which may improve server performance.
+
+ The NFS version 2 and 3 protocols are frequently limited in practice
+ to requests containing less than or equal to 8 kilobytes and 32
+ kilobytes of data, respectively. In these cases, it is often
+ practical to support basic operation without employing a
+ configuration exchange as discussed in [RFC5666]. The server MUST
+ post buffers large enough to receive the largest possible incoming
+ message (approximately 12 KB for NFS version 2, or 36 KB for NFS
+ version 3, would be vastly sufficient), and the client can post
+ buffers large enough to receive replies based on the "rsize" it is
+ using to the server, plus a fixed overhead for the RPC and NFS
+ headers. Because the server MUST NOT return data in excess of this
+ size, the client can be assured of the adequacy of its posted buffer
+ sizes.
+
+ Flow control is handled dynamically by the RPC RDMA protocol, and
+ write padding is OPTIONAL and therefore MAY remain unused.
+
+ Alternatively, if the server is administratively configured to values
+ appropriate for all its clients, the same assurance of
+ interoperability within the domain can be made.
+
+ The use of a configuration protocol with NFS v2 and v3 is therefore
+ OPTIONAL. Employing a configuration exchange may allow some
+ advantage to server resource management through accurately sizing
+ buffers, enabling the server to know exactly how many RDMA Reads may
+ be in progress at once on the client connection, and enabling client
+ write padding, which may be desirable for certain servers when RDMA
+ Read is impractical.
+
+
+
+
+
+Talpey & Callaghan Standards Track [Page 5]
+
+RFC 5667 NFS Direct Data Placement January 2010
+
+
+5. NFS Version 4 Mapping
+
+ This specification applies to the first minor version of NFS version
+ 4 (NFSv4.0) and any subsequent minor versions that do not override
+ this mapping.
+
+ The Write list MUST be considered only for the COMPOUND procedure.
+ This procedure returns results from a sequence of operations. Only
+ the opaque file data from an NFS READ operation and the pathname from
+ a READLINK operation MUST utilize entries from the Write list.
+
+ If there is no Write list, i.e., the list is null, then any READ or
+ READLINK operations in the COMPOUND MUST return their data inline.
+ The NFSv4.0 client MUST ensure in this case that any result of its
+ READ and READLINK requests will fit within its receive buffers, in
+ order to avoid a resulting RDMA transport error upon transfer. The
+ server is not required to detect this.
+
+ The first entry in the Write list MUST be used by the first READ or
+ READLINK in the COMPOUND request. The next Write list entry is used
+ by the next READ or READLINK, and so on. If there are more READ or
+ READLINK operations than Write list entries, then any remaining
+ operations MUST return their results inline.
+
+ If a Write list entry is presented, then the corresponding READ or
+ READLINK MUST return its data via an RDMA Write to the buffer
+ indicated by the Write list entry. If the Write list entry has zero
+ RDMA segments, or if the total size of the segments is zero, then the
+ corresponding READ or READLINK operation MUST return its result
+ inline.
+
+ The following example shows an RDMA Write list with three posted
+ buffers A, B, and C. The designated operations in the compound
+ request, READ and READLINK, consume the posted buffers by writing
+ their results back to each buffer.
+
+ RDMA Write list:
+
+ A --> B --> C
+
+
+ Compound request:
+
+
+ PUTFH LOOKUP READ PUTFH LOOKUP READLINK PUTFH LOOKUP READ
+ | | |
+ v v v
+ A B C
+
+
+
+Talpey & Callaghan Standards Track [Page 6]
+
+RFC 5667 NFS Direct Data Placement January 2010
+
+
+ If the client does not want to have the READLINK result returned
+ directly, then it provides a zero-length array of segment triplets
+ for buffer B or sets the values in the segment triplet for buffer B
+ to zeros so that the READLINK result MUST be returned inline.
+
+ The situation is similar for RDMA Read lists sent by the client and
+ applies to the NFSv4.0 WRITE and SYMLINK procedures as for v3.
+ Additionally, inline segments too large to fit in posted buffers MAY
+ be transferred in special "RDMA_NOMSG" messages.
+
+ Non-RDMA (inline) WRITE transfers MAY OPTIONALLY employ the
+ "RDMA_MSGP" padding method described in the RPC/RDMA protocol, if the
+ appropriate value for the server is known to the client. Padding
+ allows the opaque file data to arrive at the server in an aligned
+ fashion, which may improve server performance. In order to ensure
+ accurate alignment for all data, it is likely that the client will
+ restrict its use of OPTIONAL padding to COMPOUND requests containing
+ only a single WRITE operation.
+
+ Unlike NFS versions 2 and 3, the maximum size of an NFS version 4
+ COMPOUND is not bounded, even when RDMA chunks are in use. While it
+ might appear that a configuration protocol exchange (such as the one
+ described in [RFC5666]) would help, in fact the layering issues
+ involved in building COMPOUNDs by NFS make such a mechanism
+ unworkable.
+
+ However, typical NFS version 4 clients rarely issue such problematic
+ requests. In practice, they behave in much more predictable ways, in
+ fact most still support the traditional rsize/wsize mount parameters.
+ Therefore, most NFS version 4 clients function over RPC/RDMA in the
+ same way as NFS versions 2 and 3, operationally.
+
+ There are however advantages to allowing both client and server to
+ operate with prearranged size constraints, for example, use of the
+ sizes to better manage the server's response cache. An extension to
+ NFS version 4 supporting a more comprehensive exchange of upper-layer
+ parameters is part of [RFC5661].
+
+5.1. NFS Version 4 Callbacks
+
+ The NFS version 4 protocols support server-initiated callbacks to
+ selected clients, in order to notify them of events such as recalled
+ delegations, etc. These callbacks present no particular issue to
+ being framed over RPC/RDMA, since such callbacks do not carry bulk
+ data such as NFS READ or NFS WRITE. They MAY be transmitted inline
+ via RDMA_MSG, or if the callback message or its reply overflow the
+
+
+
+
+
+Talpey & Callaghan Standards Track [Page 7]
+
+RFC 5667 NFS Direct Data Placement January 2010
+
+
+ negotiated buffer sizes for a callback connection, they MAY be
+ transferred via the RDMA_NOMSG method as described above for other
+ exchanges.
+
+ One special case is noteworthy: in NFS version 4.1, the callback
+ channel is optionally negotiated to be on the same connection as one
+ used for client requests. In this case, and because the transaction
+ ID (XID) is present in the RPC/RDMA header, the client MUST ascertain
+ whether the message is in fact an RPC REPLY, and therefore a reply to
+ a prior request and carrying its XID, before processing it as such.
+ By the same token, the server MUST ascertain whether an incoming
+ message on such a callback-eligible connection is an RPC CALL, before
+ optionally processing the XID.
+
+ In the callback case, the XID present in the RPC/RDMA header will
+ potentially have any value, which may (or may not) collide with an
+ XID used by the client for a previous or future request. The client
+ and server MUST inspect the RPC component of the message to determine
+ its potential disposition as either an RPC CALL or RPC REPLY, prior
+ to processing this XID, and MUST NOT reject or accept it without also
+ determining the proper context.
+
+6. Port Usage Considerations
+
+ NFS use of direct data placement introduces a need for an additional
+ NFS port number assignment for networks that share traditional UDP
+ and TCP port spaces with RDMA services. The iWARP [RFC5041]
+ [RFC5040] protocol is such an example (InfiniBand is not).
+
+ NFS servers for versions 2 and 3 [RFC1094] [RFC1813] traditionally
+ listen for clients on UDP and TCP port 2049, and additionally, they
+ register these with the portmapper and/or rpcbind [RFC1833] service.
+ However, [RFC3530] requires NFS servers for version 4 to listen on
+ TCP port 2049, and they are not required to register.
+
+ An NFS version 2 or version 3 server supporting RPC/RDMA on such a
+ network and registering itself with the RPC portmapper MAY choose an
+ arbitrary port, or MAY use the alternative well-known port number for
+ its RPC/RDMA service. The chosen port MAY be registered with the RPC
+ portmapper under the netid assigned by the requirement in [RFC5666].
+
+ An NFS version 4 server supporting RPC/RDMA on such a network MUST
+ use the alternative well-known port number for its RPC/RDMA service.
+ Clients SHOULD connect to this well-known port without consulting the
+ RPC portmapper (as for NFSv4/TCP).
+
+ The port number assigned to an NFS service over an RPC/RDMA transport
+ is available from the IANA port registry [RFC3232].
+
+
+
+Talpey & Callaghan Standards Track [Page 8]
+
+RFC 5667 NFS Direct Data Placement January 2010
+
+
+7. Security Considerations
+
+ The RDMA transport for RPC [RFC5666] supports all RPC [RFC5531]
+ security models, including RPCSEC_GSS [RFC2203] security and link-
+ level security. The choice of RDMA Read and RDMA Write to return RPC
+ argument and results, respectively, does not affect this, since it
+ only changes the method of data transfer. Specifically, the
+ requirements of [RFC5666] ensure that this choice does not introduce
+ new vulnerabilities.
+
+ Because this document defines only the binding of the NFS protocols
+ atop [RFC5666], all relevant security considerations are therefore to
+ be described at that layer.
+
+8. Acknowledgments
+
+ The authors would like to thank Dave Noveck and Chet Juszczak for
+ their contributions to this document.
+
+9. References
+
+9.1. Normative References
+
+ [RFC1094] Sun Microsystems, "NFS: Network File System Protocol
+ specification", RFC 1094, March 1989.
+
+ [RFC1813] Callaghan, B., Pawlowski, B., and P. Staubach, "NFS
+ Version 3 Protocol Specification", RFC 1813, June 1995.
+
+ [RFC1833] Srinivasan, R., "Binding Protocols for ONC RPC Version 2",
+ RFC 1833, August 1995.
+
+ [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
+ Requirement Levels", BCP 14, RFC 2119, March 1997.
+
+ [RFC2203] Eisler, M., Chiu, A., and L. Ling, "RPCSEC_GSS Protocol
+ Specification", RFC 2203, September 1997.
+
+ [RFC3530] Shepler, S., Callaghan, B., Robinson, D., Thurlow, R.,
+ Beame, C., Eisler, M., and D. Noveck, "Network File System
+ (NFS) version 4 Protocol", RFC 3530, April 2003.
+
+ [RFC5531] Thurlow, R., "RPC: Remote Procedure Call Protocol
+ Specification Version 2", RFC 5531, May 2009.
+
+ [RFC5661] Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed.,
+ "Network File System (NFS) Version 4 Minor Version 1
+ Protocol", RFC 5661, January 2010.
+
+
+
+Talpey & Callaghan Standards Track [Page 9]
+
+RFC 5667 NFS Direct Data Placement January 2010
+
+
+9.2. Informative References
+
+ [RFC3232] Reynolds, J., Ed., "Assigned Numbers: RFC 1700 is Replaced
+ by an On-line Database", RFC 3232, January 2002.
+
+ [RFC5040] Recio, R., Metzler, B., Culley, P., Hilland, J., and D.
+ Garcia, "A Remote Direct Memory Access Protocol
+ Specification", RFC 5040, October 2007.
+
+ [RFC5041] Shah, H., Pinkerton, J., Recio, R., and P. Culley, "Direct
+ Data Placement over Reliable Transports", RFC 5041,
+ October 2007.
+
+ [RFC5666] Talpey, T. and B. Callaghan, "Remote Direct Memory Access
+ Transport for Remote Procedure Call", RFC 5666, January
+ 2010.
+
+Authors' Addresses
+
+ Tom Talpey
+ 170 Whitman St.
+ Stow, MA 01775 USA
+
+ EMail: tmtalpey@gmail.com
+
+ Brent Callaghan
+ Apple Computer, Inc.
+ MS: 302-4K
+ 2 Infinite Loop
+ Cupertino, CA 95014 USA
+
+ EMail: brentc@apple.com
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Talpey & Callaghan Standards Track [Page 10]
+