1 files changed, 1907 insertions, 0 deletions
diff --git a/doc/rfc/rfc5666.txt b/doc/rfc/rfc5666.txt
new file mode 100644
index 0000000..f696417
--- /dev/null
+++ b/doc/rfc/rfc5666.txt
@@ -0,0 +1,1907 @@
+
+
+
+
+
+
+Internet Engineering Task Force (IETF)                         T. Talpey
+Request for Comments: 5666                                  Unaffiliated
+Category: Standards Track                                   B. Callaghan
+ISSN: 2070-1721                                                    Apple
+                                                            January 2010
+
+
+    Remote Direct Memory Access Transport for Remote Procedure Call
+
+Abstract
+
+   This document describes a protocol providing Remote Direct Memory
+   Access (RDMA) as a new transport for Remote Procedure Call (RPC).
+   The RDMA transport binding conveys the benefits of efficient, bulk-
+   data transport over high-speed networks, while providing for minimal
+   change to RPC applications and with no required revision of the
+   application RPC protocol, or the RPC protocol itself.
+
+Status of This Memo
+
+   This is an Internet Standards Track document.
+
+   This document is a product of the Internet Engineering Task Force
+   (IETF).  It represents the consensus of the IETF community.  It has
+   received public review and has been approved for publication by the
+   Internet Engineering Steering Group (IESG).  Further information on
+   Internet Standards is available in Section 2 of RFC 5741.
+
+   Information about the current status of this document, any errata,
+   and how to provide feedback on it may be obtained at
+   http://www.rfc-editor.org/info/rfc5666.
+
+Copyright Notice
+
+   Copyright (c) 2010 IETF Trust and the persons identified as the
+   document authors.  All rights reserved.
+
+   This document is subject to BCP 78 and the IETF Trust's Legal
+   Provisions Relating to IETF Documents
+   (http://trustee.ietf.org/license-info) in effect on the date of
+   publication of this document.  Please review these documents
+   carefully, as they describe your rights and restrictions with respect
+   to this document.  Code Components extracted from this document must
+   include Simplified BSD License text as described in Section 4.e of
+   the Trust Legal Provisions and are provided without warranty as
+   described in the Simplified BSD License.
+
+
+
+
+
+Talpey & Callaghan           Standards Track                    [Page 1]
+
+RFC 5666                 RDMA Transport for RPC             January 2010
+
+
+   This document may contain material from IETF Documents or IETF
+   Contributions published or made publicly available before November
+   10, 2008.  The person(s) controlling the copyright in some of this
+   material may not have granted the IETF Trust the right to allow
+   modifications of such material outside the IETF Standards Process.
+   Without obtaining an adequate license from the person(s) controlling
+   the copyright in such materials, this document may not be modified
+   outside the IETF Standards Process, and derivative works of it may
+   not be created outside the IETF Standards Process, except to format
+   it for publication as an RFC or to translate it into languages other
+   than English.
+
+Table of Contents
+
+   1. Introduction ....................................................3
+      1.1. Requirements Language ......................................4
+   2. Abstract RDMA Requirements ......................................4
+   3. Protocol Outline ................................................5
+      3.1. Short Messages .............................................6
+      3.2. Data Chunks ................................................6
+      3.3. Flow Control ...............................................7
+      3.4. XDR Encoding with Chunks ...................................8
+      3.5. XDR Decoding with Read Chunks .............................11
+      3.6. XDR Decoding with Write Chunks ............................12
+      3.7. XDR Roundup and Chunks ....................................13
+      3.8. RPC Call and Reply ........................................14
+      3.9. Padding ...................................................17
+   4. RPC RDMA Message Layout ........................................18
+      4.1. RPC-over-RDMA Header ......................................18
+      4.2. RPC-over-RDMA Header Errors ...............................20
+      4.3. XDR Language Description ..................................20
+   5. Long Messages ..................................................22
+      5.1. Message as an RDMA Read Chunk .............................23
+      5.2. RDMA Write of Long Replies (Reply Chunks) .................24
+   6. Connection Configuration Protocol ..............................25
+      6.1. Initial Connection State ..................................26
+      6.2. Protocol Description ......................................26
+   7. Memory Registration Overhead ...................................28
+   8. Errors and Error Recovery ......................................28
+   9. Node Addressing ................................................28
+   10. RPC Binding ...................................................29
+   11. Security Considerations .......................................30
+   12. IANA Considerations ...........................................31
+   13. Acknowledgments ...............................................32
+   14. References ....................................................33
+      14.1. Normative References .....................................33
+      14.2. Informative References ...................................33
+
+
+
+
+Talpey & Callaghan           Standards Track                    [Page 2]
+
+RFC 5666                 RDMA Transport for RPC             January 2010
+
+
+1.  Introduction
+
+   Remote Direct Memory Access (RDMA) [RFC5040, RFC5041], [IB] is a
+   technique for efficient movement of data between end nodes, which
+   becomes increasingly compelling over high-speed transports.  By
+   directing data into destination buffers as it is sent on a network,
+   and placing it via direct memory access by hardware, the double
+   benefit of faster transfers and reduced host overhead is obtained.
+
+   Open Network Computing Remote Procedure Call (ONC RPC, or simply,
+   RPC) [RFC5531] is a remote procedure call protocol that has been run
+   over a variety of transports.  Most RPC implementations today use UDP
+   or TCP.  RPC messages are defined in terms of an eXternal Data
+   Representation (XDR) [RFC4506], which provides a canonical data
+   representation across a variety of host architectures.  An XDR data
+   stream is conveyed differently on each type of transport.  On UDP,
+   RPC messages are encapsulated inside datagrams, while on a TCP byte
+   stream, RPC messages are delineated by a record marking protocol.  An
+   RDMA transport also conveys RPC messages in a unique fashion that
+   must be fully described if client and server implementations are to
+   interoperate.
+
+   RDMA transports present new semantics unlike the behaviors of either
+   UDP or TCP alone.  They retain message delineations like UDP while
+   also providing a reliable, sequenced data transfer like TCP.  Also,
+   they provide the new efficient, bulk-transfer service of RDMA.  RDMA
+   transports are therefore naturally viewed as a new transport type by
+   RPC.
+
+   RDMA as a transport will benefit the performance of RPC protocols
+   that move large "chunks" of data, since RDMA hardware excels at
+   moving data efficiently between host memory and a high-speed network
+   with little or no host CPU involvement.  In this context, the Network
+   File System (NFS) protocol, in all its versions [RFC1094] [RFC1813]
+   [RFC3530] [RFC5661], is an obvious beneficiary of RDMA.  A complete
+   problem statement is discussed in [RFC5532], and related NFSv4 issues
+   are discussed in [RFC5661].  Many other RPC-based protocols will also
+   benefit.
+
+   Although the RDMA transport described here provides relatively
+   transparent support for any RPC application, the proposal goes
+   further in describing mechanisms that can optimize the use of RDMA
+   with more active participation by the RPC application.
+
+
+
+
+
+
+
+
+Talpey & Callaghan           Standards Track                    [Page 3]
+
+RFC 5666                 RDMA Transport for RPC             January 2010
+
+
+1.1.  Requirements Language
+
+   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
+   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
+   document are to be interpreted as described in [RFC2119].
+
+2.  Abstract RDMA Requirements
+
+   An RPC transport is responsible for conveying an RPC message from a
+   sender to a receiver.  An RPC message is either an RPC call from a
+   client to a server, or an RPC reply from the server back to the
+   client.  An RPC message contains an RPC call header followed by
+   arguments if the message is an RPC call, or an RPC reply header
+   followed by results if the message is an RPC reply.  The call header
+   contains a transaction ID (XID) followed by the program and procedure
+   number as well as a security credential.  An RPC reply header begins
+   with an XID that matches that of the RPC call message, followed by a
+   security verifier and results.  All data in an RPC message is XDR
+   encoded.  For a complete description of the RPC protocol and XDR
+   encoding, see [RFC5531] and [RFC4506].
+
+   This protocol assumes the following abstract model for RDMA
+   transports.  These terms, common in the RDMA lexicon, are used in
+   this document.  A more complete glossary of RDMA terms can be found
+   in [RFC5040].
+
+   o Registered Memory
+        All data moved via tagged RDMA operations is resident in
+        registered memory at its destination.  This protocol assumes
+        that each segment of registered memory MUST be identified with a
+        steering tag of no more than 32 bits and memory addresses of up
+        to 64 bits in length.
+
+   o RDMA Send
+        The RDMA provider supports an RDMA Send operation with
+        completion signaled at the receiver when data is placed in a
+        pre-posted buffer.  The amount of transferred data is limited
+        only by the size of the receiver's buffer.  Sends complete at
+        the receiver in the order they were issued at the sender.
+
+   o RDMA Write
+        The RDMA provider supports an RDMA Write operation to directly
+        place data in the receiver's buffer.  An RDMA Write is initiated
+        by the sender and completion is signaled at the sender.  No
+        completion is signaled at the receiver.  The sender uses a
+        steering tag, memory address, and length of the remote
+        destination buffer.  RDMA Writes are not necessarily ordered
+        with respect to one another, but are ordered with respect to
+
+
+
+Talpey & Callaghan           Standards Track                    [Page 4]
+
+RFC 5666                 RDMA Transport for RPC             January 2010
+
+
+        RDMA Sends; a subsequent RDMA Send completion obtained at the
+        receiver guarantees that prior RDMA Write data has been
+        successfully placed in the receiver's memory.
+
+   o RDMA Read
+        The RDMA provider supports an RDMA Read operation to directly
+        place peer source data in the requester's buffer.  An RDMA Read
+        is initiated by the receiver and completion is signaled at the
+        receiver.  The receiver provides steering tags, memory
+        addresses, and a length for the remote source and local
+        destination buffers.  Since the peer at the data source receives
+        no notification of RDMA Read completion, there is an assumption
+        that on receiving the data, the receiver will signal completion
+        with an RDMA Send message, so that the peer can free the source
+        buffers and the associated steering tags.
+
+   This protocol is designed to be carried over all RDMA transports
+   meeting the stated requirements.  This protocol conveys to the RPC
+   peer information sufficient for that RPC peer to direct an RDMA layer
+   to perform transfers containing RPC data and to communicate their
+   result(s).  For example, it is readily carried over RDMA transports
+   such as Internet Wide Area RDMA Protocol (iWARP) [RFC5040, RFC5041],
+   or InfiniBand [IB].
+
+3.  Protocol Outline
+
+   An RPC message can be conveyed in identical fashion, whether it is a
+   call or reply message.  In each case, the transmission of the message
+   proper is preceded by transmission of a transport-specific header for
+   use by RPC-over-RDMA transports.  This header is analogous to the
+   record marking used for RPC over TCP, but is more extensive, since
+   RDMA transports support several modes of data transfer; it is
+   important to allow the upper-layer protocol to specify the most
+   efficient mode for each of the segments in a message.  Multiple
+   segments of a message may thereby be transferred in different ways to
+   different remote memory destinations.
+
+   All transfers of a call or reply begin with an RDMA Send that
+   transfers at least the RPC-over-RDMA header, usually with the call or
+   reply message appended, or at least some part thereof.  Because the
+   size of what may be transmitted via RDMA Send is limited by the size
+   of the receiver's pre-posted buffer, the RPC-over-RDMA transport
+   provides a number of methods to reduce the amount transferred by
+   means of the RDMA Send, when necessary, by transferring various parts
+   of the message using RDMA Read and RDMA Write.
+
+
+
+
+
+
+Talpey & Callaghan           Standards Track                    [Page 5]
+
+RFC 5666                 RDMA Transport for RPC             January 2010
+
+
+   RPC-over-RDMA framing replaces all other RPC framing (such as TCP
+   record marking) when used atop an RPC/RDMA association, even though
+   the underlying RDMA protocol may itself be layered atop a protocol
+   with a defined RPC framing (such as TCP).  It is however possible for
+   RPC/RDMA to be dynamically enabled, in the course of negotiating the
+   use of RDMA via an upper-layer exchange.  Because RPC framing
+   delimits an entire RPC request or reply, the resulting shift in
+   framing must occur between distinct RPC messages, and in concert with
+   the transport.
+
+3.1.  Short Messages
+
+   Many RPC messages are quite short.  For example, the NFS version 3
+   GETATTR request, is only 56 bytes: 20 bytes of RPC header, plus a
+   32-byte file handle argument and 4 bytes of length.  The reply to
+   this common request is about 100 bytes.
+
+   There is no benefit in transferring such small messages with an RDMA
+   Read or Write operation.  The overhead in transferring steering tags
+   and memory addresses is justified only by large transfers.  The
+   critical message size that justifies RDMA transfer will vary
+   depending on the RDMA implementation and network, but is typically of
+   the order of a few kilobytes.  It is appropriate to transfer a short
+   message with an RDMA Send to a pre-posted buffer.  The RPC-over-RDMA
+   header with the short message (call or reply) immediately following
+   is transferred using a single RDMA Send operation.
+
+   Short RPC messages over an RDMA transport:
+
+        RPC Client                           RPC Server
+            |               RPC Call              |
+       Send |   ------------------------------>   |
+            |                                     |
+            |               RPC Reply             |
+            |   <------------------------------   | Send
+
+3.2.  Data Chunks
+
+   Some protocols, like NFS, have RPC procedures that can transfer very
+   large chunks of data in the RPC call or reply and would cause the
+   maximum send size to be exceeded if one tried to transfer them as
+   part of the RDMA Send.  These large chunks typically range from a
+   kilobyte to a megabyte or more.  An RDMA transport can transfer large
+   chunks of data more efficiently via the direct placement of an RDMA
+   Read or RDMA Write operation.  Using direct placement instead of
+   inline transfer not only avoids expensive data copies, but provides
+   correct data alignment at the destination.
+
+
+
+
+Talpey & Callaghan           Standards Track                    [Page 6]
+
+RFC 5666                 RDMA Transport for RPC             January 2010
+
+
+3.3.  Flow Control
+
+   It is critical to provide RDMA Send flow control for an RDMA
+   connection.  RDMA receive operations will fail if a pre-posted
+   receive buffer is not available to accept an incoming RDMA Send, and
+   repeated occurrences of such errors can be fatal to the connection.
+   This is a departure from conventional TCP/IP networking where buffers
+   are allocated dynamically on an as-needed basis, and where
+   pre-posting is not required.
+
+   It is not practical to provide for fixed credit limits at the RPC
+   server.  Fixed limits scale poorly, since posted buffers are
+   dedicated to the associated connection until consumed by receive
+   operations.  Additionally, for protocol correctness, the RPC server
+   must always be able to reply to client requests, whether or not new
+   buffers have been posted to accept future receives.  (Note that the
+   RPC server may in fact be a client at some other layer.  For example,
+   NFSv4 callbacks are processed by the NFSv4 client, acting as an RPC
+   server.  The credit discussions apply equally in either case.)
+
+   Flow control for RDMA Send operations is implemented as a simple
+   request/grant protocol in the RPC-over-RDMA header associated with
+   each RPC message.  The RPC-over-RDMA header for RPC call messages
+   contains a requested credit value for the RPC server, which MAY be
+   dynamically adjusted by the caller to match its expected needs.  The
+   RPC-over-RDMA header for the RPC reply messages provides the granted
+   result, which MAY have any value except it MUST NOT be zero when no
+   in-progress operations are present at the server, since such a value
+   would result in deadlock.  The value MAY be adjusted up or down at
+   each opportunity to match the server's needs or policies.
+
+   The RPC client MUST NOT send unacknowledged requests in excess of
+   this granted RPC server credit limit.  If the limit is exceeded, the
+   RDMA layer may signal an error, possibly terminating the connection.
+   Even if an error does not occur, it is OPTIONAL that the server
+   handle the excess request(s), and it MAY return an RPC error to the
+   client.  Also note that the never-zero requirement implies that an
+   RPC server MUST always provide at least one credit to each connected
+   RPC client from which no requests are outstanding.  The client would
+   deadlock otherwise, unable to send another request.
+
+   While RPC calls complete in any order, the current flow control limit
+   at the RPC server is known to the RPC client from the Send ordering
+   properties.  It is always the most recent server-granted credit value
+   minus the number of requests in flight.
+
+
+
+
+
+
+Talpey & Callaghan           Standards Track                    [Page 7]
+
+RFC 5666                 RDMA Transport for RPC             January 2010
+
+
+   Certain RDMA implementations may impose additional flow control
+   restrictions, such as limits on RDMA Read operations in progress at
+   the responder.  Because these operations are outside the scope of
+   this protocol, they are not addressed and SHOULD be provided for by
+   other layers.  For example, a simple upper-layer RPC consumer might
+   perform single-issue RDMA Read requests, while a more sophisticated,
+   multithreaded RPC consumer might implement its own First In, First
+   Out (FIFO) queue of such operations.  For further discussion of
+   possible protocol implementations capable of negotiating these
+   values, see Section 6 "Connection Configuration Protocol" of this
+   document, or [RFC5661].
+
+3.4.  XDR Encoding with Chunks
+
+   The data comprising an RPC call or reply message is marshaled or
+   serialized into a contiguous stream by an XDR routine.  XDR data
+   types such as integers, strings, arrays, and linked lists are
+   commonly implemented over two very simple functions that encode
+   either an XDR data unit (32 bits) or an array of bytes.
+
+   Normally, the separate data items in an RPC call or reply are encoded
+   as a contiguous sequence of bytes for network transmission over UDP
+   or TCP.  However, in the case of an RDMA transport, local routines
+   such as XDR encode can determine that (for instance) an opaque byte
+   array is large enough to be more efficiently moved via an RDMA data
+   transfer operation like RDMA Read or RDMA Write.
+
+   Semantically speaking, the protocol has no restriction regarding data
+   types that may or may not be represented by a read or write chunk.
+   In practice however, efficiency considerations lead to the conclusion
+   that certain data types are not generally "chunkable".  Typically,
+   only those opaque and aggregate data types that may attain
+   substantial size are considered to be eligible.  With today's
+   hardware, this size may be a kilobyte or more.  However, any object
+   MAY be chosen for chunking in any given message.
+
+   The eligibility of XDR data items to be candidates for being moved as
+   data chunks (as opposed to being marshaled inline) is not specified
+   by the RPC-over-RDMA protocol.  Chunk eligibility criteria MUST be
+   determined by each upper-layer in order to provide for an
+   interoperable specification.  One such example with rationale, for
+   the NFS protocol family, is provided in [RFC5667].
+
+   The interface by which an upper-layer implementation communicates the
+   eligibility of a data item locally to RPC for chunking is out of
+   scope for this specification.  In many implementations, it is
+   possible to implement a transparent RPC chunking facility.  However,
+   such implementations may lead to inefficiencies, either because they
+
+
+
+Talpey & Callaghan           Standards Track                    [Page 8]
+
+RFC 5666                 RDMA Transport for RPC             January 2010
+
+
+   require the RPC layer to perform expensive registration and
+   de-registration of memory "on the fly", or they may require using
+   RDMA chunks in reply messages, along with the resulting additional
+   handshaking with the RPC-over-RDMA peer.  However, these issues are
+   internal and generally confined to the local interface between RPC
+   and its upper layers, one in which implementations are free to
+   innovate.  The only requirement is that the resulting RPC RDMA
+   protocol sent to the peer is valid for the upper layer.  See, for
+   example, [RFC5667].
+
+   When sending any message (request or reply) that contains an eligible
+   large data chunk, the XDR encoding routine avoids moving the data
+   into the XDR stream.  Instead, it does not encode the data portion,
+   but records the address and size of each chunk in a separate "read
+   chunk list" encoded within RPC RDMA transport-specific headers.  Such
+   chunks will be transferred via RDMA Read operations initiated by the
+   receiver.
+
+   When the read chunks are to be moved via RDMA, the memory for each
+   chunk is registered.  This registration may take place within XDR
+   itself, providing for full transparency to upper layers, or it may be
+   performed by any other specific local implementation.
+
+   Additionally, when making an RPC call that can result in bulk data
+   transferred in the reply, write chunks MAY be provided to accept the
+   data directly via RDMA Write.  These write chunks will therefore be
+   pre-filled by the RPC server prior to responding, and XDR decode of
+   the data at the client will not be required.  These chunks undergo a
+   similar registration and advertisement via "write chunk lists" built
+   as a part of XDR encoding.
+
+   Some RPC client implementations are not able to determine where an
+   RPC call's results reside during the "encode" phase.  This makes it
+   difficult or impossible for the RPC client layer to encode the write
+   chunk list at the time of building the request.  In this case, it is
+   difficult for the RPC implementation to provide transparency to the
+   RPC consumer, which may require recoding to provide result
+   information at this earlier stage.
+
+   Therefore, if the RPC client does not make a write chunk list
+   available to receive the result, then the RPC server MAY return data
+   inline in the reply, or if the upper-layer specification permits, it
+   MAY be returned via a read chunk list.  It is NOT RECOMMENDED that
+   upper-layer RPC client protocol specifications omit write chunk lists
+   for eligible replies, due to the lower performance of the additional
+   handshaking to perform data transfer, and the requirement that the
+   RPC server must expose (and preserve) the reply data for a period of
+
+
+
+
+Talpey & Callaghan           Standards Track                    [Page 9]
+
+RFC 5666                 RDMA Transport for RPC             January 2010
+
+
+   time.  In the absence of a server-provided read chunk list in the
+   reply, if the encoded reply overflows the posted receive buffer, the
+   RPC will fail with an RDMA transport error.
+
+   When any data within a message is provided via either read or write
+   chunks, the chunk itself refers only to the data portion of the XDR
+   stream element.  In particular, for counted fields (e.g., a "<>"
+   encoding) the byte count that is encoded as part of the field remains
+   in the XDR stream, and is also encoded in the chunk list.  The data
+   portion is however elided from the encoded XDR stream, and is
+   transferred as part of chunk list processing.  It is important to
+   maintain upper-layer implementation compatibility -- both the count
+   and the data must be transferred as part of the logical XDR stream.
+   While the chunk list processing results in the data being available
+   to the upper-layer peer for XDR decoding, the length present in the
+   chunk list entries is not.  Any byte count in the XDR stream MUST
+   match the sum of the byte counts present in the corresponding read or
+   write chunk list.  If they do not agree, an RPC protocol encoding
+   error results.
+
+   The following items are contained in a chunk list entry.
+
+   Handle
+        Steering tag or handle obtained when the chunk memory is
+        registered for RDMA.
+
+   Length
+        The length of the chunk in bytes.
+
+   Offset
+        The offset or beginning memory address of the chunk.  In order
+        to support the widest array of RDMA implementations, as well as
+        the most general steering tag scheme, this field is
+        unconditionally included in each chunk list entry.
+
+        While zero-based offset schemes are available in many RDMA
+        implementations, their use by RPC requires individual
+        registration of each read or write chunk.  On many such
+        implementations, this can be a significant overhead.  By
+        providing an offset in each chunk, many pre-registration or
+        region-based registrations can be readily supported, and by
+        using a single, universal chunk representation, the RPC RDMA
+        protocol implementation is simplified to its most general form.
+
+   Position
+        For data that is to be encoded, the position in the XDR stream
+        where the chunk would normally reside.  Note that the chunk
+        therefore inserts its data into the XDR stream at this position,
+
+
+
+Talpey & Callaghan           Standards Track                   [Page 10]
+
+RFC 5666                 RDMA Transport for RPC             January 2010
+
+
+        but its transfer is no longer "inline".  Also note therefore
+        that all chunks belonging to a single RPC argument or result
+        will have the same position.  For data that is to be decoded, no
+        position is used.
+
+   When XDR marshaling is complete, the chunk list is XDR encoded, then
+   sent to the receiver prepended to the RPC message.  Any source data
+   for a read chunk, or the destination of a write chunk, remain behind
+   in the sender's registered memory, and their actual payload is not
+   marshaled into the request or reply.
+
+      +----------------+----------------+-------------
+      | RPC-over-RDMA  |                |
+      |    header w/   |   RPC Header   | Non-chunk args/results
+      |     chunks     |                |
+      +----------------+----------------+-------------
+
+   Read chunk lists and write chunk lists are structured somewhat
+   differently.  This is due to the different usage -- read chunks are
+   decoded and indexed by their argument's or result's position in the
+   XDR data stream;  their size is always known.  Write chunks, on the
+   other hand, are used only for results, and have neither a preassigned
+   offset in the XDR stream nor a size until the results are produced,
+   since the buffers may be only partially filled, or may not be used
+   for results at all.  Their presence in the XDR stream is therefore
+   not known until the reply is processed.  The mapping of write chunks
+   onto designated NFS procedures and their results is described in
+   [RFC5667].
+
+   Therefore, read chunks are encoded into a read chunk list as a single
+   array, with each entry tagged by its (known) size and its argument's
+   or result's position in the XDR stream.  Write chunks are encoded as
+   a list of arrays of RDMA buffers, with each list element (an array)
+   providing buffers for a separate result.  Individual write chunk list
+   elements MAY thereby result in being partially or fully filled, or in
+   fact not being filled at all.  Unused write chunks, or unused bytes
+   in write chunk buffer lists, are not returned as results, and their
+   memory is returned to the upper layer as part of RPC completion.
+   However, the RPC layer MUST NOT assume that the buffers have not been
+   modified.
+
+3.5.  XDR Decoding with Read Chunks
+
+   The XDR decode process moves data from an XDR stream into a data
+   structure provided by the RPC client or server application.  Where
+   elements of the destination data structure are buffers or strings,
+   the RPC application can either pre-allocate storage to receive the
+
+
+
+
+Talpey & Callaghan           Standards Track                   [Page 11]
+
+RFC 5666                 RDMA Transport for RPC             January 2010
+
+
+   data or leave the string or buffer fields null and allow the XDR
+   decode stage of RPC processing to automatically allocate storage of
+   sufficient size.
+
+   When decoding a message from an RDMA transport, the receiver first
+   XDR decodes the chunk lists from the RPC-over-RDMA header, then
+   proceeds to decode the body of the RPC message (arguments or
+   results).  Whenever the XDR offset in the decode stream matches that
+   of a chunk in the read chunk list, the XDR routine initiates an RDMA
+   Read to bring over the chunk data into locally registered memory for
+   the destination buffer.
+
+   When processing an RPC request, the RPC receiver (RPC server)
+   acknowledges its completion of use of the source buffers by simply
+   replying to the RPC sender (client), and the peer may then free all
+   source buffers advertised by the request.
+
+   When processing an RPC reply, after completing such a transfer, the
+   RPC receiver (client) MUST issue an RDMA_DONE message (described in
+   Section 3.8) to notify the peer (server) that the source buffers can
+   be freed.
+
+   The read chunk list is constructed and used entirely within the
+   RPC/XDR layer.  Other than specifying the minimum chunk size, the
+   management of the read chunk list is automatic and transparent to an
+   RPC application.
+
+3.6.  XDR Decoding with Write Chunks
+
+   When a write chunk list is provided for the results of the RPC call,
+   the RPC server MUST provide any corresponding data via RDMA Write to
+   the memory referenced in the chunk list entries.  The RPC reply
+   conveys this by returning the write chunk list to the client with the
+   lengths rewritten to match the actual transfer.  The XDR decode of
+   the reply therefore performs no local data transfer but merely
+   returns the length obtained from the reply.
+
+   Each decoded result consumes one entry in the write chunk list, which
+   in turn consists of an array of RDMA segments.  The length is
+   therefore the sum of all returned lengths in all segments comprising
+   the corresponding list entry.  As each list entry is decoded, the
+   entire entry is consumed.
+
+   The write chunk list is constructed and used by the RPC application.
+   The RPC/XDR layer simply conveys the list between client and server
+   and initiates the RDMA Writes back to the client.  The mapping of
+
+
+
+
+
+Talpey & Callaghan           Standards Track                   [Page 12]
+
+RFC 5666                 RDMA Transport for RPC             January 2010
+
+
+   write chunk list entries to procedure arguments MUST be determined
+   for each protocol.  An example of a mapping is described in
+   [RFC5667].
+
+3.7.  XDR Roundup and Chunks
+
+   The XDR protocol requires 4-byte alignment of each new encoded
+   element in any XDR stream.  This requirement is for efficiency and
+   ease of decode/unmarshaling at the receiver -- if the XDR stream
+   buffer begins on a native machine boundary, then the XDR elements
+   will lie on similarly predictable offsets in memory.
+
+   Within XDR, when non-4-byte encodes (such as an odd-length string or
+   bulk data) are marshaled, their length is encoded literally, while
+   their data is padded to begin the next element at a 4-byte boundary
+   in the XDR stream.  For TCP or RDMA inline encoding, this minimal
+   overhead is required because the transport-specific framing relies on
+   the fact that the relative offset of the elements in the XDR stream
+   from the start of the message determines the XDR position during
+   decode.
+
+   On the other hand, RPC/RDMA Read chunks carry the XDR position of
+   each chunked element and length of the Chunk segment, and can be
+   placed by the receiver exactly where they belong in the receiver's
+   memory without regard to the alignment of their position in the XDR
+   stream.  Since any rounded-up data is not actually part of the upper
+   layer's message, the receiver will not reference it, and there is no
+   reason to set it to any particular value in the receiver's memory.
+
+   When roundup is present at the end of a sequence of chunks, the
+   length of the sequence will terminate it at a non-4-byte XDR
+   position.  When the receiver proceeds to decode the remaining part of
+   the XDR stream, it inspects the XDR position indicated by the next
+   chunk.  Because this position will not match (else roundup would not
+   have occurred), the receiver decoding will fall back to inspecting
+   the remaining inline portion.  If in turn, no data remains to be
+   decoded from the inline portion, then the receiver MUST conclude that
+   roundup is present, and therefore it advances the XDR decode position
+   to that indicated by the next chunk (if any).  In this way, roundup
+   is passed without ever actually transferring additional XDR bytes.
+
+   Some protocol operations over RPC/RDMA, for instance NFS writes of
+   data encountered at the end of a file or in direct I/O situations,
+   commonly yield these roundups within RDMA Read Chunks.  Because any
+   roundup bytes are not actually present in the data buffers being
+   written, memory for these bytes would come from noncontiguous
+   buffers, either as an additional memory registration segment or as an
+   additional Chunk.  The overhead of these operations can be
+
+
+
+Talpey & Callaghan           Standards Track                   [Page 13]
+
+RFC 5666                 RDMA Transport for RPC             January 2010
+
+
+   significant to both the sender to marshal them and even higher to the
+   receiver to which to transfer them.  Senders SHOULD therefore avoid
+   encoding individual RDMA Read Chunks for roundup whenever possible.
+   It is acceptable, but not necessary, to include roundup data in an
+   existing RDMA Read Chunk, but only if it is already present in the
+   XDR stream to carry upper-layer data.
+
+   Note that there is no exposure of additional data at the sender due
+   to eliding roundup data from the XDR stream, since any additional
+   sender buffers are never exposed to the peer.  The data is literally
+   not there to be transferred.
+
+   For RDMA Write Chunks, a simpler encoding method applies.  Again,
+   roundup bytes are not transferred, instead the chunk length sent to
+   the receiver in the reply is simply increased to include any roundup.
+   Because of the requirement that the RDMA Write Chunks are filled
+   sequentially without gaps, this situation can only occur on the final
+   chunk receiving data.  Therefore, there is no opportunity for roundup
+   data to insert misalignment or positional gaps into the XDR stream.
+
+3.8.  RPC Call and Reply
+
+   The RDMA transport for RPC provides three methods of moving data
+   between RPC client and server:
+
+   Inline
+        Data is moved between RPC client and server within an RDMA Send.
+
+   RDMA Read
+        Data is moved between RPC client and server via an RDMA Read
+        operation via steering tag; address and offset obtained from a
+        read chunk list.
+
+   RDMA Write
+        Result data is moved from RPC server to client via an RDMA Write
+        operation via steering tag; address and offset obtained from a
+        write chunk list or reply chunk in the client's RPC call
+        message.
+
+   These methods of data movement may occur in combinations within a
+   single RPC.  For instance, an RPC call may contain some inline data
+   along with some large chunks to be transferred via RDMA Read to the
+   server.  The reply to that call may have some result chunks that the
+   server RDMA Writes back to the client.  The following protocol
+   interactions illustrate RPC calls that use these methods to move RPC
+   message data:
+
+
+
+
+
+Talpey & Callaghan           Standards Track                   [Page 14]
+
+RFC 5666                 RDMA Transport for RPC             January 2010
+
+
+   An RPC with write chunks in the call message:
+
+       RPC Client                           RPC Server
+           |     RPC Call + Write Chunk list     |
+      Send |   ------------------------------>   |
+           |                                     |
+           |               Chunk 1               |
+           |   <------------------------------   | Write
+           |                  :                  |
+           |               Chunk n               |
+           |   <------------------------------   | Write
+           |                                     |
+           |               RPC Reply             |
+           |   <------------------------------   | Send
+
+   In the presence of write chunks, RDMA ordering provides the guarantee
+   that all data in the RDMA Write operations has been placed in memory
+   prior to the client's RPC reply processing.
+
+   An RPC with read chunks in the call message:
+
+       RPC Client                           RPC Server
+           |     RPC Call + Read Chunk list      |
+      Send |   ------------------------------>   |
+           |                                     |
+           |               Chunk 1               |
+           |   +------------------------------   | Read
+           |   v----------------------------->   |
+           |                  :                  |
+           |               Chunk n               |
+           |   +------------------------------   | Read
+           |   v----------------------------->   |
+           |                                     |
+           |               RPC Reply             |
+           |   <------------------------------   | Send
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Talpey & Callaghan           Standards Track                   [Page 15]
+
+RFC 5666                 RDMA Transport for RPC             January 2010
+
+
+   An RPC with read chunks in the reply message:
+
+       RPC Client                           RPC Server
+           |               RPC Call              |
+      Send |   ------------------------------>   |
+           |                                     |
+           |     RPC Reply + Read Chunk list     |
+           |   <------------------------------   | Send
+           |                                     |
+           |               Chunk 1               |
+      Read |   ------------------------------+   |
+           |   <-----------------------------v   |
+           |                  :                  |
+           |               Chunk n               |
+      Read |   ------------------------------+   |
+           |   <-----------------------------v   |
+           |                                     |
+           |                 Done                |
+      Send |   ------------------------------>   |
+
+   The final Done message allows the RPC client to signal the server
+   that it has received the chunks, so the server can de-register and
+   free the memory holding the chunks.  A Done completion is not
+   necessary for an RPC call, since the RPC reply Send is itself a
+   receive completion notification.  In the event that the client fails
+   to return the Done message within some timeout period, the server MAY
+   conclude that a protocol violation has occurred and close the RPC
+   connection, or it MAY proceed with a de-register and free its chunk
+   buffers.  This may result in a fatal RDMA error if the client later
+   attempts to perform an RDMA Read operation, which amounts to the same
+   thing.
+
+   The use of read chunks in RPC reply messages is much less efficient
+   than providing write chunks in the originating RPC calls, due to the
+   additional message exchanges, the need for the RPC server to
+   advertise buffers to the peer, the necessity of the server
+   maintaining a timer for the purpose of recovery from misbehaving
+   clients, and the need for additional memory registration.  Their use
+   is NOT RECOMMENDED by upper layers where efficiency is a primary
+   concern [RFC5667].  However, they MAY be employed by upper-layer
+   protocol bindings that are primarily concerned with transparency,
+   since they can frequently be implemented completely within the RPC
+   lower layers.
+
+   It is important to note that the Done message consumes a credit at
+   the RPC server.  The RPC server SHOULD provide sufficient credits to
+   the client to allow the Done message to be sent without deadlock
+   (driving the outstanding credit count to zero).  The RPC client MUST
+
+
+
+Talpey & Callaghan           Standards Track                   [Page 16]
+
+RFC 5666                 RDMA Transport for RPC             January 2010
+
+
+   account for its required Done messages to the server in its
+   accounting of available credits, and the server SHOULD replenish any
+   credit consumed by its use of such exchanges at its earliest
+   opportunity.
+
+   Finally, it is possible to conceive of RPC exchanges that involve any
+   or all combinations of write chunks in the RPC call, read chunks in
+   the RPC call, and read chunks in the RPC reply.  Support for such
+   exchanges is straightforward from a protocol perspective, but in
+   practice such exchanges would be quite rare, limited to upper-layer
+   protocol exchanges that transferred bulk data in both the call and
+   corresponding reply.
+
+3.9.  Padding
+
+   Alignment of specific opaque data enables certain scatter/gather
+   optimizations.  Padding leverages the useful property that RDMA
+   transfers preserve alignment of data, even when they are placed into
+   pre-posted receive buffers by Sends.
+
+   Many servers can make good use of such padding.  Padding allows the
+   chaining of RDMA receive buffers such that any data transferred by
+   RDMA on behalf of RPC requests will be placed into appropriately
+   aligned buffers on the system that receives the transfer.  In this
+   way, the need for servers to perform RDMA Read to satisfy all but the
+   largest client writes is obviated.
+
+   The effect of padding is demonstrated below showing prior bytes on an
+   XDR stream ("XXX" in the figure below) followed by an opaque field
+   consisting of four length bytes ("LLLL") followed by data bytes
+   ("DDD").  The receiver of the RDMA Send has posted two chained
+   receive buffers.  Without padding, the opaque data is split across
+   the two buffers.  With the addition of padding bytes ("ppp") prior to
+   the first data byte, the data can be forced to align correctly in the
+   second buffer.
+
+                                            Buffer 1       Buffer 2
+      Unpadded                           --------------  --------------
+
+
+       XXXXXXXLLLLDDDDDDDDDDDDDD    ---> XXXXXXXLLLLDDD  DDDDDDDDDDD
+
+
+      Padded
+
+
+       XXXXXXXLLLLpppDDDDDDDDDDDDDD ---> XXXXXXXLLLLppp  DDDDDDDDDDDDDD
+
+
+
+
+Talpey & Callaghan           Standards Track                   [Page 17]
+
+RFC 5666                 RDMA Transport for RPC             January 2010
+
+
+   Padding is implemented completely within the RDMA transport encoding,
+   flagged with a specific message type.  Where padding is applied, two
+   values are passed to the peer:  an "rdma_align", which is the padding
+   value used, and "rdma_thresh", which is the opaque data size at or
+   above which padding is applied.  For instance, if the server is using
+   chained 4 KB receive buffers, then up to (4 KB - 1) padding bytes
+   could be used to achieve alignment of the data.  The XDR routine at
+   the peer MUST consult these values when decoding opaque values.
+   Where the decoded length exceeds the rdma_thresh, the XDR decode MUST
+   skip over the appropriate padding as indicated by rdma_align and the
+   current XDR stream position.
+
+4.  RPC RDMA Message Layout
+
+   RPC call and reply messages are conveyed across an RDMA transport
+   with a prepended RPC-over-RDMA header.  The RPC-over-RDMA header
+   includes data for RDMA flow control credits, padding parameters, and
+   lists of addresses that provide direct data placement via RDMA Read
+   and Write operations.  The layout of the RPC message itself is
+   unchanged from that described in [RFC5531] except for the possible
+   exclusion of large data chunks that will be moved by RDMA Read or
+   Write operations.  If the RPC message (along with the RPC-over-RDMA
+   header) is too long for the posted receive buffer (even after any
+   large chunks are removed), then the entire RPC message MAY be moved
+   separately as a chunk, leaving just the RPC-over-RDMA header in the
+   RDMA Send.
+
+4.1.  RPC-over-RDMA Header
+
+   The RPC-over-RDMA header begins with four 32-bit fields that are
+   always present and that control the RDMA interaction including RDMA-
+   specific flow control.  These are then followed by a number of items
+   such as chunk lists and padding that MAY or MUST NOT be present
+   depending on the type of transmission.  The four fields that are
+   always present are:
+
+   1. Transaction ID (XID).
+      The XID generated for the RPC call and reply.  Having the XID at
+      the beginning of the message makes it easy to establish the
+      message context.  This XID MUST be the same as the XID in the RPC
+      header.  The receiver MAY perform its processing based solely on
+      the XID in the RPC-over-RDMA header, and thereby ignore the XID in
+      the RPC header, if it so chooses.
+
+   2. Version number.
+      This version of the RPC RDMA message protocol is 1.  The version
+      number MUST be increased by 1 whenever the format of the RPC RDMA
+      messages is changed.
+
+
+
+Talpey & Callaghan           Standards Track                   [Page 18]
+
+RFC 5666                 RDMA Transport for RPC             January 2010
+
+
+   3. Flow control credit value.
+      When sent in an RPC call message, the requested value is provided.
+      When sent in an RPC reply message, the granted value is returned.
+      RPC calls SHOULD NOT be sent in excess of the currently granted
+      limit.
+
+   4. Message type.
+
+      o  RDMA_MSG = 0 indicates that chunk lists and RPC message follow.
+
+      o  RDMA_NOMSG = 1 indicates that after the chunk lists there is no
+         RPC message.  In this case, the chunk lists provide information
+         to allow the message proper to be transferred using RDMA Read
+         or Write and thus is not appended to the RPC-over-RDMA header.
+
+      o  RDMA_MSGP = 2 indicates that a chunk list and RPC message with
+         some padding follow.
+
+      o  RDMA_DONE = 3 indicates that the message signals the completion
+         of a chunk transfer via RDMA Read.
+
+      o  RDMA_ERROR = 4 is used to signal any detected error(s) in the
+         RPC RDMA chunk encoding.
+
+   Because the version number is encoded as part of this header, and the
+   RDMA_ERROR message type is used to indicate errors, these first four
+   fields and the start of the following message body MUST always remain
+   aligned at these fixed offsets for all versions of the RPC-over-RDMA
+   header.
+
+   For a message of type RDMA_MSG or RDMA_NOMSG, the Read and Write
+   chunk lists follow.  If the Read chunk list is null (a 32-bit word of
+   zeros), then there are no chunks to be transferred separately and the
+   RPC message follows in its entirety.  If non-null, then it's the
+   beginning of an XDR encoded sequence of Read chunk list entries.  If
+   the Write chunk list is non-null, then an XDR encoded sequence of
+   Write chunk entries follows.
+
+   If the message type is RDMA_MSGP, then two additional fields that
+   specify the padding alignment and threshold are inserted prior to the
+   Read and Write chunk lists.
+
+   A header of message type RDMA_MSG or RDMA_MSGP MUST be followed by
+   the RPC call or RPC reply message body, beginning with the XID.  The
+   XID in the RDMA_MSG or RDMA_MSGP header MUST match this.
+
+
+
+
+
+
+Talpey & Callaghan           Standards Track                   [Page 19]
+
+RFC 5666                 RDMA Transport for RPC             January 2010
+
+
+   +--------+---------+---------+-----------+-------------+----------
+   |        |         |         | Message   |   NULLs     | RPC Call
+   |  XID   | Version | Credits |  Type     |    or       |    or
+   |        |         |         |           | Chunk Lists | Reply Msg
+   +--------+---------+---------+-----------+-------------+----------
+
+   Note that in the case of RDMA_DONE and RDMA_ERROR, no chunk list or
+   RPC message follows.  As an implementation hint: a gather operation
+   on the Send of the RDMA RPC message can be used to marshal the
+   initial header, the chunk list, and the RPC message itself.
+
+4.2.  RPC-over-RDMA Header Errors
+
+   When a peer receives an RPC RDMA message, it MUST perform the
+   following basic validity checks on the header and chunk contents.  If
+   such errors are detected in the request, an RDMA_ERROR reply MUST be
+   generated.
+
+   Two types of errors are defined, version mismatch and invalid chunk
+   format.  When the peer detects an RPC-over-RDMA header version that
+   it does not support (currently this document defines only version 1),
+   it replies with an error code of ERR_VERS, and provides the low and
+   high inclusive version numbers it does, in fact, support.  The
+   version number in this reply MUST be any value otherwise valid at the
+   receiver.  When other decoding errors are detected in the header or
+   chunks, either an RPC decode error MAY be returned or the RPC/RDMA
+   error code ERR_CHUNK MUST be returned.
+
+4.3.  XDR Language Description
+
+   Here is the message layout in XDR language.
+
+      struct xdr_rdma_segment {
+         uint32 handle;          /* Registered memory handle */
+         uint32 length;          /* Length of the chunk in bytes */
+         uint64 offset;          /* Chunk virtual address or offset */
+      };
+
+      struct xdr_read_chunk {
+         uint32 position;        /* Position in XDR stream */
+         struct xdr_rdma_segment target;
+      };
+
+      struct xdr_read_list {
+         struct xdr_read_chunk entry;
+         struct xdr_read_list  *next;
+      };
+
+
+
+
+Talpey & Callaghan           Standards Track                   [Page 20]
+
+RFC 5666                 RDMA Transport for RPC             January 2010
+
+
+      struct xdr_write_chunk {
+         struct xdr_rdma_segment target<>;
+      };
+
+      struct xdr_write_list {
+         struct xdr_write_chunk entry;
+         struct xdr_write_list  *next;
+      };
+
+      struct rdma_msg {
+         uint32    rdma_xid;     /* Mirrors the RPC header xid */
+         uint32    rdma_vers;    /* Version of this protocol */
+         uint32    rdma_credit;  /* Buffers requested/granted */
+         rdma_body rdma_body;
+      };
+
+      enum rdma_proc {
+         RDMA_MSG=0,   /* An RPC call or reply msg */
+         RDMA_NOMSG=1, /* An RPC call or reply msg - separate body */
+         RDMA_MSGP=2,  /* An RPC call or reply msg with padding */
+         RDMA_DONE=3,  /* Client signals reply completion */
+         RDMA_ERROR=4  /* An RPC RDMA encoding error */
+      };
+
+      union rdma_body switch (rdma_proc proc) {
+         case RDMA_MSG:
+           rpc_rdma_header rdma_msg;
+         case RDMA_NOMSG:
+           rpc_rdma_header_nomsg rdma_nomsg;
+         case RDMA_MSGP:
+           rpc_rdma_header_padded rdma_msgp;
+         case RDMA_DONE:
+           void;
+         case RDMA_ERROR:
+           rpc_rdma_error rdma_error;
+      };
+
+      struct rpc_rdma_header {
+         struct xdr_read_list   *rdma_reads;
+         struct xdr_write_list  *rdma_writes;
+         struct xdr_write_chunk *rdma_reply;
+         /* rpc body follows */
+      };
+
+      struct rpc_rdma_header_nomsg {
+         struct xdr_read_list   *rdma_reads;
+         struct xdr_write_list  *rdma_writes;
+         struct xdr_write_chunk *rdma_reply;
+
+
+
+Talpey & Callaghan           Standards Track                   [Page 21]
+
+RFC 5666                 RDMA Transport for RPC             January 2010
+
+
+      };
+
+      struct rpc_rdma_header_padded {
+         uint32                 rdma_align;   /* Padding alignment */
+         uint32                 rdma_thresh;  /* Padding threshold */
+         struct xdr_read_list   *rdma_reads;
+         struct xdr_write_list  *rdma_writes;
+         struct xdr_write_chunk *rdma_reply;
+         /* rpc body follows */
+      };
+
+      enum rpc_rdma_errcode {
+         ERR_VERS = 1,
+         ERR_CHUNK = 2
+      };
+
+      union rpc_rdma_error switch (rpc_rdma_errcode err) {
+         case ERR_VERS:
+           uint32               rdma_vers_low;
+           uint32               rdma_vers_high;
+         case ERR_CHUNK:
+           void;
+         default:
+           uint32               rdma_extra[8];
+      };
+
+5.  Long Messages
+
+   The receiver of RDMA Send messages is required by RDMA to have
+   previously posted one or more adequately sized buffers.  The RPC
+   client can inform the server of the maximum size of its RDMA Send
+   messages via the Connection Configuration Protocol described later in
+   this document.
+
+   Since RPC messages are frequently small, memory savings can be
+   achieved by posting small buffers.  Even large messages like NFS READ
+   or WRITE will be quite small once the chunks are removed from the
+   message.  However, there may be large messages that would demand a
+   very large buffer be posted, where the contents of the buffer may not
+   be a chunkable XDR element.  A good example is an NFS READDIR reply,
+   which may contain a large number of small filename strings.  Also,
+   the NFS version 4 protocol [RFC3530] features COMPOUND request and
+   reply messages of unbounded length.
+
+   Ideally, each upper layer will negotiate these limits.  However, it
+   is frequently necessary to provide a transparent solution.
+
+
+
+
+
+Talpey & Callaghan           Standards Track                   [Page 22]
+
+RFC 5666                 RDMA Transport for RPC             January 2010
+
+
+5.1.  Message as an RDMA Read Chunk
+
+   One relatively simple method is to have the client identify any RPC
+   message that exceeds the RPC server's posted buffer size and move it
+   separately as a chunk, i.e., reference it as the first entry in the
+   read chunk list with an XDR position of zero.
+
+   Normal Message
+
+   +--------+---------+---------+------------+-------------+----------
+   |        |         |         |            |             | RPC Call
+   |  XID   | Version | Credits |  RDMA_MSG  | Chunk Lists |    or
+   |        |         |         |            |             | Reply Msg
+   +--------+---------+---------+------------+-------------+----------
+
+   Long Message
+
+   +--------+---------+---------+------------+-------------+
+   |        |         |         |            |             |
+   |  XID   | Version | Credits | RDMA_NOMSG | Chunk Lists |
+   |        |         |         |            |             |
+   +--------+---------+---------+------------+-------------+
+                                                |
+                                                |  +----------
+                                                |  | Long RPC Call
+                                                +->|    or
+                                                   | Reply Message
+                                                   +----------
+
+   If the receiver gets an RPC-over-RDMA header with a message type of
+   RDMA_NOMSG and finds an initial read chunk list entry with a zero XDR
+   position, it allocates a registered buffer and issues an RDMA Read of
+   the long RPC message into it.  The receiver then proceeds to XDR
+   decode the RPC message as if it had received it inline with the Send
+   data.  Further decoding may issue additional RDMA Reads to bring over
+   additional chunks.
+
+   Although the handling of long messages requires one extra network
+   turnaround, in practice these messages will be rare if the posted
+   receive buffers are correctly sized, and of course they will be
+   non-existent for RDMA-aware upper layers.
+
+
+
+
+
+
+
+
+
+
+Talpey & Callaghan           Standards Track                   [Page 23]
+
+RFC 5666                 RDMA Transport for RPC             January 2010
+
+
+   A long call RPC with request supplied via RDMA Read
+
+       RPC Client                           RPC Server
+           |        RDMA-over-RPC Header         |
+      Send |   ------------------------------>   |
+           |                                     |
+           |          Long RPC Call Msg          |
+           |   +------------------------------   | Read
+           |   v----------------------------->   |
+           |                                     |
+           |         RDMA-over-RPC Reply         |
+           |   <------------------------------   | Send
+
+   An RPC with long reply returned via RDMA Read
+
+       RPC Client                           RPC Server
+           |             RPC Call                |
+      Send |   ------------------------------>   |
+           |                                     |
+           |         RDMA-over-RPC Header        |
+           |   <------------------------------   | Send
+           |                                     |
+           |          Long RPC Reply Msg         |
+      Read |   ------------------------------+   |
+           |   <-----------------------------v   |
+           |                                     |
+           |                Done                 |
+      Send |   ------------------------------>   |
+
+   It is possible for a single RPC procedure to employ both a long call
+   for its arguments and a long reply for its results.  However, such an
+   operation is atypical, as few upper layers define such exchanges.
+
+5.2.  RDMA Write of Long Replies (Reply Chunks)
+
+   A superior method of handling long RPC replies is to have the RPC
+   client post a large buffer into which the server can write a large
+   RPC reply.  This has the advantage that an RDMA Write may be slightly
+   faster in network latency than an RDMA Read, and does not require the
+   server to wait for the completion as it must for RDMA Read.
+   Additionally, for a reply it removes the need for an RDMA_DONE
+   message if the large reply is returned as a Read chunk.
+
+   This protocol supports direct return of a large reply via the
+   inclusion of an OPTIONAL rdma_reply write chunk after the read chunk
+   list and the write chunk list.  The client allocates a buffer sized
+   to receive a large reply and enters its steering tag, address and
+   length in the rdma_reply write chunk.  If the reply message is too
+
+
+
+Talpey & Callaghan           Standards Track                   [Page 24]
+
+RFC 5666                 RDMA Transport for RPC             January 2010
+
+
+   long to return inline with an RDMA Send (exceeds the size of the
+   client's posted receive buffer), even with read chunks removed, then
+   the RPC server performs an RDMA Write of the RPC reply message into
+   the buffer indicated by the rdma_reply chunk.  If the client doesn't
+   provide an rdma_reply chunk, or if it's too small, then if the upper-
+   layer specification permits, the message MAY be returned as a Read
+   chunk.
+
+   An RPC with long reply returned via RDMA Write
+
+
+    RPC Client                           RPC Server
+        |      RPC Call with rdma_reply       |
+   Send |   ------------------------------>   |
+        |                                     |
+        |          Long RPC Reply Msg         |
+        |   <------------------------------   | Write
+        |                                     |
+        |         RDMA-over-RPC Header        |
+        |   <------------------------------   | Send
+
+   The use of RDMA Write to return long replies requires that the client
+   applications anticipate a long reply and have some knowledge of its
+   size so that an adequately sized buffer can be allocated.  This is
+   certainly true of NFS READDIR replies; where the client already
+   provides an upper bound on the size of the encoded directory fragment
+   to be returned by the server.
+
+   The use of these "reply chunks" is highly efficient and convenient
+   for both RPC client and server.  Their use is encouraged for eligible
+   RPC operations such as NFS READDIR, which would otherwise require
+   extensive chunk management within the results or use of RDMA Read and
+   a Done message [RFC5667].
+
+6.  Connection Configuration Protocol
+
+   RDMA Send operations require the receiver to post one or more buffers
+   at the RDMA connection endpoint, each large enough to receive the
+   largest Send message.  Buffers are consumed as Send messages are
+   received.  If a buffer is too small, or if there are no buffers
+   posted, the RDMA transport MAY return an error and break the RDMA
+   connection.  The receiver MUST post sufficient, adequately buffers to
+   avoid buffer overrun or capacity errors.
+
+   The protocol described above includes only a mechanism for managing
+   the number of such receive buffers and no explicit features to allow
+   the RPC client and server to provision or control buffer sizing, nor
+   any other session parameters.
+
+
+
+Talpey & Callaghan           Standards Track                   [Page 25]
+
+RFC 5666                 RDMA Transport for RPC             January 2010
+
+
+   In the past, this type of connection management has not been
+   necessary for RPC.  RPC over UDP or TCP does not have a protocol to
+   negotiate the link.  The server can get a rough idea of the maximum
+   size of messages from the server protocol code.  However, a protocol
+   to negotiate transport features on a more dynamic basis is desirable.
+
+   The Connection Configuration Protocol allows the client to pass its
+   connection requirements to the server, and allows the server to
+   inform the client of its connection limits.
+
+   Use of the Connection Configuration Protocol by an upper layer is
+   OPTIONAL.
+
+6.1.  Initial Connection State
+
+   This protocol MAY be used for connection setup prior to the use of
+   another RPC protocol that uses the RDMA transport.  It operates
+   in-band, i.e., it uses the connection itself to negotiate the
+   connection parameters.  To provide a basis for connection
+   negotiation, the connection is assumed to provide a basic level of
+   interoperability: the ability to exchange at least one RPC message at
+   a time that is at least 1 KB in size.  The server MAY exceed this
+   basic level of configuration, but the client MUST NOT assume more
+   than one, and MUST receive a valid reply from the server carrying the
+   actual number of available receive messages, prior to sending its
+   next request.
+
+6.2.  Protocol Description
+
+   Version 1 of the Connection Configuration Protocol consists of a
+   single procedure that allows the client to inform the server of its
+   connection requirements and the server to return connection
+   information to the client.
+
+   The maxcall_sendsize argument is the maximum size of an RPC call
+   message that the client MAY send inline in an RDMA Send message to
+   the server.  The server MAY return a maxcall_sendsize value that is
+   smaller or larger than the client's request.  The client MUST NOT
+   send an inline call message larger than what the server will accept.
+   The maxcall_sendsize limits only the size of inline RPC calls.  It
+   does not limit the size of long RPC messages transferred as an
+   initial chunk in the Read chunk list.
+
+   The maxreply_sendsize is the maximum size of an inline RPC message
+   that the client will accept from the server.
+
+
+
+
+
+
+Talpey & Callaghan           Standards Track                   [Page 26]
+
+RFC 5666                 RDMA Transport for RPC             January 2010
+
+
+   The maxrdmaread is the maximum number of RDMA Reads that may be
+   active at the peer.  This number correlates to the RDMA incoming RDMA
+   Read count ("IRD") configured into each originating endpoint by the
+   client or server.  If more than this number of RDMA Read operations
+   by the connected peer are issued simultaneously, connection loss or
+   suboptimal flow control may result; therefore, the value SHOULD be
+   observed at all times.  The peers' values need not be equal.  If
+   zero, the peer MUST NOT issue requests that require RDMA Read to
+   satisfy, as no transfer will be possible.
+
+   The align value is the value recommended by the server for opaque
+   data values such as strings and counted byte arrays.  The client MAY
+   use this value to compute the number of prepended pad bytes when XDR
+   encoding opaque values in the RPC call message.
+
+      typedef unsigned int uint32;
+
+      struct config_rdma_req {
+           uint32  maxcall_sendsize;
+                       /* max size of inline RPC call */
+           uint32  maxreply_sendsize;
+                       /* max size of inline RPC reply */
+           uint32  maxrdmaread;
+                       /* max active RDMA Reads at client */
+      };
+
+      struct config_rdma_reply {
+           uint32  maxcall_sendsize;
+                       /* max call size accepted by server */
+           uint32  align;
+                       /* server's receive buffer alignment */
+           uint32  maxrdmaread;
+                       /* max active RDMA Reads at server */
+      };
+
+      program CONFIG_RDMA_PROG {
+         version VERS1 {
+            /*
+             * Config call/reply
+             */
+            config_rdma_reply CONF_RDMA(config_rdma_req) = 1;
+         } = 1;
+      } = 100417;
+
+
+
+
+
+
+
+
+Talpey & Callaghan           Standards Track                   [Page 27]
+
+RFC 5666                 RDMA Transport for RPC             January 2010
+
+
+7.  Memory Registration Overhead
+
+   RDMA requires that all data be transferred between registered memory
+   regions at the source and destination.  All protocol headers as well
+   as separately transferred data chunks use registered memory.  Since
+   the cost of registering and de-registering memory can be a large
+   proportion of the RDMA transaction cost, it is important to minimize
+   registration activity.  This is easily achieved within RPC controlled
+   memory by allocating chunk list data and RPC headers in a reusable
+   way from pre-registered pools.
+
+   The data chunks transferred via RDMA MAY occupy memory that persists
+   outside the bounds of the RPC transaction.  Hence, the default
+   behavior of an RPC-over-RDMA transport is to register and de-register
+   these chunks on every transaction.  However, this is not a limitation
+   of the protocol -- only of the existing local RPC API.  The API is
+   easily extended through such functions as rpc_control(3) to change
+   the default behavior so that the application can assume
+   responsibility for controlling memory registration through an RPC-
+   provided registered memory allocator.
+
+8.  Errors and Error Recovery
+
+   RPC RDMA protocol errors are described in Section 4.  RPC errors and
+   RPC error recovery are not affected by the protocol, and proceed as
+   for any RPC error condition.  RDMA transport error reporting and
+   recovery are outside the scope of this protocol.
+
+   It is assumed that the link itself will provide some degree of error
+   detection and retransmission.  iWARP's Marker PDU Aligned (MPA) layer
+   (when used over TCP), Stream Control Transmission Protocol (SCTP), as
+   well as the InfiniBand link layer all provide Cyclic Redundancy Check
+   (CRC) protection of the RDMA payload, and CRC-class protection is a
+   general attribute of such transports.  Additionally, the RPC layer
+   itself can accept errors from the link level and recover via
+   retransmission.  RPC recovery can handle complete loss and
+   re-establishment of the link.
+
+   See Section 11 for further discussion of the use of RPC-level
+   integrity schemes to detect errors and related efficiency issues.
+
+9.  Node Addressing
+
+   In setting up a new RDMA connection, the first action by an RPC
+   client will be to obtain a transport address for the server.  The
+   mechanism used to obtain this address, and to open an RDMA connection
+   is dependent on the type of RDMA transport, and is the responsibility
+   of each RPC protocol binding and its local implementation.
+
+
+
+Talpey & Callaghan           Standards Track                   [Page 28]
+
+RFC 5666                 RDMA Transport for RPC             January 2010
+
+
+10.  RPC Binding
+
+   RPC services normally register with a portmap or rpcbind [RFC1833]
+   service, which associates an RPC program number with a service
+   address.  (In the case of UDP or TCP, the service address for NFS is
+   normally port 2049.)  This policy is no different with RDMA
+   interconnects, although it may require the allocation of port numbers
+   appropriate to each upper-layer binding that uses the RPC framing
+   defined here.
+
+   When mapped atop the iWARP [RFC5040, RFC5041] transport, which uses
+   IP port addressing due to its layering on TCP and/or SCTP, port
+   mapping is trivial and consists merely of issuing the port in the
+   connection process.  The NFS/RDMA protocol service address has been
+   assigned port 20049 by IANA, for both iWARP/TCP and iWARP/SCTP.
+
+   When mapped atop InfiniBand [IB], which uses a Group Identifier
+   (GID)-based service endpoint naming scheme, a translation MUST be
+   employed.  One such translation is defined in the InfiniBand Port
+   Addressing Annex [IBPORT], which is appropriate for translating IP
+   port addressing to the InfiniBand network.  Therefore, in this case,
+   IP port addressing may be readily employed by the upper layer.
+
+   When a mapping standard or convention exists for IP ports on an RDMA
+   interconnect, there are several possibilities for each upper layer to
+   consider:
+
+      One possibility is to have an upper-layer server register its
+      mapped IP port with the rpcbind service, under the netid (or
+      netid's) defined here.  An RPC/RDMA-aware client can then resolve
+      its desired service to a mappable port, and proceed to connect.
+      This is the most flexible and compatible approach, for those upper
+      layers that are defined to use the rpcbind service.
+
+      A second possibility is to have the server's portmapper register
+      itself on the RDMA interconnect at a "well known" service address.
+      (On UDP or TCP, this corresponds to port 111.)  A client could
+      connect to this service address and use the portmap protocol to
+      obtain a service address in response to a program number, e.g., an
+      iWARP port number, or an InfiniBand GID.
+
+      Alternatively, the client could simply connect to the mapped well-
+      known port for the service itself, if it is appropriately defined.
+      By convention, the NFS/RDMA service, when operating atop such an
+      InfiniBand fabric, will use the same 20049 assignment as for
+      iWARP.
+
+
+
+
+
+Talpey & Callaghan           Standards Track                   [Page 29]
+
+RFC 5666                 RDMA Transport for RPC             January 2010
+
+
+   Historically, different RPC protocols have taken different approaches
+   to their port assignment; therefore, the specific method is left to
+   each RPC/RDMA-enabled upper-layer binding, and not addressed here.
+
+   In Section 12, "IANA Considerations", this specification defines two
+   new "netid" values, to be used for registration of upper layers atop
+   iWARP [RFC5040, RFC5041] and (when a suitable port translation
+   service is available) InfiniBand [IB].  Additional RDMA-capable
+   networks MAY define their own netids, or if they provide a port
+   translation, MAY share the one defined here.
+
+11.  Security Considerations
+
+   RPC provides its own security via the RPCSEC_GSS framework [RFC2203].
+   RPCSEC_GSS can provide message authentication, integrity checking,
+   and privacy.  This security mechanism will be unaffected by the RDMA
+   transport.  The data integrity and privacy features alter the body of
+   the message, presenting it as a single chunk.  For large messages the
+   chunk may be large enough to qualify for RDMA Read transfer.
+   However, there is much data movement associated with computation and
+   verification of integrity, or encryption/decryption, so certain
+   performance advantages may be lost.
+
+   For efficiency, a more appropriate security mechanism for RDMA links
+   may be link-level protection, such as certain configurations of
+   IPsec, which may be co-located in the RDMA hardware.  The use of
+   link-level protection MAY be negotiated through the use of the new
+   RPCSEC_GSS mechanism defined in [RFC5403] in conjunction with the
+   Channel Binding mechanism [RFC5056] and IPsec Channel Connection
+   Latching [RFC5660].  Use of such mechanisms is REQUIRED where
+   integrity and/or privacy is desired, and where efficiency is
+   required.
+
+   An additional consideration is the protection of the integrity and
+   privacy of local memory by the RDMA transport itself.  The use of
+   RDMA by RPC MUST NOT introduce any vulnerabilities to system memory
+   contents, or to memory owned by user processes.  These protections
+   are provided by the RDMA layer specifications, and specifically their
+   security models.  It is REQUIRED that any RDMA provider used for RPC
+   transport be conformant to the requirements of [RFC5042] in order to
+   satisfy these protections.
+
+   Once delivered securely by the RDMA provider, any RDMA-exposed
+   addresses will contain only RPC payloads in the chunk lists,
+   transferred under the protection of RPCSEC_GSS integrity and privacy.
+   By these means, the data will be protected end-to-end, as required by
+   the RPC layer security model.
+
+
+
+
+Talpey & Callaghan           Standards Track                   [Page 30]
+
+RFC 5666                 RDMA Transport for RPC             January 2010
+
+
+   Where upper-layer protocols choose to supply results to the requester
+   via read chunks, a server resource deficit can arise if the client
+   does not promptly acknowledge their status via the RDMA_DONE message.
+   This can potentially lead to a denial-of-service situation, with a
+   single client unfairly (and unnecessarily) consuming server RDMA
+   resources.  Servers for such upper-layer protocols MUST protect
+   against this situation, originating from one or many clients.  For
+   example, a time-based window of buffer availability may be offered,
+   if the client fails to obtain the data within the window, it will
+   simply retry using ordinary RPC retry semantics.  Or, a more severe
+   method would be for the server to simply close the client's RDMA
+   connection, freeing the RDMA resources and allowing the server to
+   reclaim them.
+
+   A fairer and more useful method is provided by the protocol itself.
+   The server MAY use the rdma_credit value to limit the number of
+   outstanding requests for each client.  By including the number of
+   outstanding RDMA_DONE completions in the computation of available
+   client credits, the server can limit its exposure to each client, and
+   therefore provide uninterrupted service as its resources permit.
+
+   However, the server must ensure that it does not decrease the credit
+   count to zero with this method, since the RDMA_DONE message is not
+   acknowledged.  If the credit count were to drop to zero solely due to
+   outstanding RDMA_DONE messages, the client would deadlock since it
+   would never obtain a new credit with which to continue.  Therefore,
+   if the server adjusts credits to zero for outstanding RDMA_DONE, it
+   MUST withhold its reply to at least one message in order to provide
+   the next credit.  The time-based window (or any other appropriate
+   method) SHOULD be used by the server to recover resources in the
+   event that the client never returns.
+
+   The Connection Configuration Protocol, when used, MUST be protected
+   by an appropriate RPC security flavor, to ensure it is not attacked
+   in the process of initiating an RPC/RDMA connection.
+
+12.  IANA Considerations
+
+   Three new assignments are specified by this document:
+
+   - A new set of RPC "netids" for resolving RPC/RDMA services
+
+   - Optional service port assignments for upper-layer bindings
+
+   - An RPC program number assignment for the configuration protocol
+
+   These assignments have been established, as below.
+
+
+
+
+Talpey & Callaghan           Standards Track                   [Page 31]
+
+RFC 5666                 RDMA Transport for RPC             January 2010
+
+
+   The new RPC transport has been assigned an RPC "netid", which is an
+   rpcbind [RFC1833] string used to describe the underlying protocol in
+   order for RPC to select the appropriate transport framing, as well as
+   the format of the service addresses and ports.
+
+   The following "Netid" registry strings are defined for this purpose:
+
+      NC_RDMA "rdma"
+      NC_RDMA6 "rdma6"
+
+   These netids MAY be used for any RDMA network satisfying the
+   requirements of Section 2, and able to identify service endpoints
+   using IP port addressing, possibly through use of a translation
+   service as described above in Section 10, "RPC Binding".  The "rdma"
+   netid is to be used when IPv4 addressing is employed by the
+   underlying transport, and "rdma6" for IPv6 addressing.
+
+   The netid assignment policy and registry are defined in [RFC5665].
+
+   As a new RPC transport, this protocol has no effect on RPC program
+   numbers or existing registered port numbers.  However, new port
+   numbers MAY be registered for use by RPC/RDMA-enabled services, as
+   appropriate to the new networks over which the services will operate.
+
+   For example, the NFS/RDMA service defined in [RFC5667] has been
+   assigned the port 20049, in the IANA registry:
+
+      nfsrdma 20049/tcp Network File System (NFS) over RDMA
+      nfsrdma 20049/udp Network File System (NFS) over RDMA
+      nfsrdma 20049/sctp Network File System (NFS) over RDMA
+
+   The OPTIONAL Connection Configuration Protocol described herein
+   requires an RPC program number assignment.  The value "100417" has
+   been assigned:
+
+      rdmaconfig 100417 rpc.rdmaconfig
+
+   The RPC program number assignment policy and registry are defined in
+   [RFC5531].
+
+13.  Acknowledgments
+
+   The authors wish to thank Rob Thurlow, John Howard, Chet Juszczak,
+   Alex Chiu, Peter Staubach, Dave Noveck, Brian Pawlowski, Steve
+   Kleiman, Mike Eisler, Mark Wittle, Shantanu Mehendale, David
+   Robinson, and Mallikarjun Chadalapaka for their contributions to this
+   document.
+
+
+
+
+Talpey & Callaghan           Standards Track                   [Page 32]
+
+RFC 5666                 RDMA Transport for RPC             January 2010
+
+
+14.  References
+
+14.1.  Normative References
+
+   [RFC1833]  Srinivasan, R., "Binding Protocols for ONC RPC Version 2",
+              RFC 1833, August 1995.
+
+   [RFC2203]  Eisler, M., Chiu, A., and L. Ling, "RPCSEC_GSS Protocol
+              Specification", RFC 2203, September 1997.
+
+   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
+              Requirement Levels", BCP 14, RFC 2119, March 1997.
+
+   [RFC4506]  Eisler, M., Ed., "XDR: External Data Representation
+              Standard", STD 67, RFC 4506, May 2006.
+
+   [RFC5042]  Pinkerton, J. and E. Deleganes, "Direct Data Placement
+              Protocol (DDP) / Remote Direct Memory Access Protocol
+              (RDMAP) Security", RFC 5042, October 2007.
+
+   [RFC5056]  Williams, N., "On the Use of Channel Bindings to Secure
+              Channels", RFC 5056, November 2007.
+
+   [RFC5403]  Eisler, M., "RPCSEC_GSS Version 2", RFC 5403, February
+              2009.
+
+   [RFC5531]  Thurlow, R., "RPC: Remote Procedure Call Protocol
+              Specification Version 2", RFC 5531, May 2009.
+
+   [RFC5660]  Williams, N., "IPsec Channels: Connection Latching", RFC
+              5660, October 2009.
+
+   [RFC5665]  Eisler, M., "IANA Considerations for Remote Procedure Call
+              (RPC) Network Identifiers and Universal Address Formats",
+              RFC 5665, January 2010.
+
+14.2.  Informative References
+
+   [RFC1094]  Sun Microsystems, "NFS: Network File System Protocol
+              specification", RFC 1094, March 1989.
+
+   [RFC1813]  Callaghan, B., Pawlowski, B., and P. Staubach, "NFS
+              Version 3 Protocol Specification", RFC 1813, June 1995.
+
+   [RFC3530]  Shepler, S., Callaghan, B., Robinson, D., Thurlow, R.,
+              Beame, C., Eisler, M., and D. Noveck, "Network File System
+              (NFS) version 4 Protocol", RFC 3530, April 2003.
+
+
+
+
+Talpey & Callaghan           Standards Track                   [Page 33]
+
+RFC 5666                 RDMA Transport for RPC             January 2010
+
+
+   [RFC5040]  Recio, R., Metzler, B., Culley, P., Hilland, J., and D.
+              Garcia, "A Remote Direct Memory Access Protocol
+              Specification", RFC 5040, October 2007.
+
+   [RFC5041]  Shah, H., Pinkerton, J., Recio, R., and P. Culley, "Direct
+              Data Placement over Reliable Transports", RFC 5041,
+              October 2007.
+
+   [RFC5532]  Talpey, T. and C. Juszczak, "Network File System (NFS)
+              Remote Direct Memory Access (RDMA) Problem Statement", RFC
+              5532, May 2009.
+
+   [RFC5661]  Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed.,
+              "Network File System Version 4 Minor Version 1 Protocol",
+              RFC 5661, January 2010.
+
+   [RFC5667]  Talpey, T. and B. Callaghan, "Network File System (NFS)
+              Direct Data Placement", RFC 5667, January 2010.
+
+   [IB]       InfiniBand Trade Association, InfiniBand Architecture
+              Specifications, available from
+              http://www.infinibandta.org.
+
+   [IBPORT]   InfiniBand Trade Association, "IP Addressing Annex",
+              available from http://www.infinibandta.org.
+
+Authors' Addresses
+
+   Tom Talpey
+   170 Whitman St.
+   Stow, MA 01775 USA
+
+   EMail: tmtalpey@gmail.com
+
+
+   Brent Callaghan
+   Apple Computer, Inc.
+   MS: 302-4K
+   2 Infinite Loop
+   Cupertino, CA 95014 USA
+
+   EMail: brentc@apple.com
+
+
+
+
+
+
+
+
+
+Talpey & Callaghan           Standards Track                   [Page 34]
+