summaryrefslogtreecommitdiff
path: root/doc/rfc/rfc5666.txt
diff options
context:
space:
mode:
Diffstat (limited to 'doc/rfc/rfc5666.txt')
-rw-r--r--doc/rfc/rfc5666.txt1907
1 files changed, 1907 insertions, 0 deletions
diff --git a/doc/rfc/rfc5666.txt b/doc/rfc/rfc5666.txt
new file mode 100644
index 0000000..f696417
--- /dev/null
+++ b/doc/rfc/rfc5666.txt
@@ -0,0 +1,1907 @@
+
+
+
+
+
+
+Internet Engineering Task Force (IETF) T. Talpey
+Request for Comments: 5666 Unaffiliated
+Category: Standards Track B. Callaghan
+ISSN: 2070-1721 Apple
+ January 2010
+
+
+ Remote Direct Memory Access Transport for Remote Procedure Call
+
+Abstract
+
+ This document describes a protocol providing Remote Direct Memory
+ Access (RDMA) as a new transport for Remote Procedure Call (RPC).
+ The RDMA transport binding conveys the benefits of efficient, bulk-
+ data transport over high-speed networks, while providing for minimal
+ change to RPC applications and with no required revision of the
+ application RPC protocol, or the RPC protocol itself.
+
+Status of This Memo
+
+ This is an Internet Standards Track document.
+
+ This document is a product of the Internet Engineering Task Force
+ (IETF). It represents the consensus of the IETF community. It has
+ received public review and has been approved for publication by the
+ Internet Engineering Steering Group (IESG). Further information on
+ Internet Standards is available in Section 2 of RFC 5741.
+
+ Information about the current status of this document, any errata,
+ and how to provide feedback on it may be obtained at
+ http://www.rfc-editor.org/info/rfc5666.
+
+Copyright Notice
+
+ Copyright (c) 2010 IETF Trust and the persons identified as the
+ document authors. All rights reserved.
+
+ This document is subject to BCP 78 and the IETF Trust's Legal
+ Provisions Relating to IETF Documents
+ (http://trustee.ietf.org/license-info) in effect on the date of
+ publication of this document. Please review these documents
+ carefully, as they describe your rights and restrictions with respect
+ to this document. Code Components extracted from this document must
+ include Simplified BSD License text as described in Section 4.e of
+ the Trust Legal Provisions and are provided without warranty as
+ described in the Simplified BSD License.
+
+
+
+
+
+Talpey & Callaghan Standards Track [Page 1]
+
+RFC 5666 RDMA Transport for RPC January 2010
+
+
+ This document may contain material from IETF Documents or IETF
+ Contributions published or made publicly available before November
+ 10, 2008. The person(s) controlling the copyright in some of this
+ material may not have granted the IETF Trust the right to allow
+ modifications of such material outside the IETF Standards Process.
+ Without obtaining an adequate license from the person(s) controlling
+ the copyright in such materials, this document may not be modified
+ outside the IETF Standards Process, and derivative works of it may
+ not be created outside the IETF Standards Process, except to format
+ it for publication as an RFC or to translate it into languages other
+ than English.
+
+Table of Contents
+
+ 1. Introduction ....................................................3
+ 1.1. Requirements Language ......................................4
+ 2. Abstract RDMA Requirements ......................................4
+ 3. Protocol Outline ................................................5
+ 3.1. Short Messages .............................................6
+ 3.2. Data Chunks ................................................6
+ 3.3. Flow Control ...............................................7
+ 3.4. XDR Encoding with Chunks ...................................8
+ 3.5. XDR Decoding with Read Chunks .............................11
+ 3.6. XDR Decoding with Write Chunks ............................12
+ 3.7. XDR Roundup and Chunks ....................................13
+ 3.8. RPC Call and Reply ........................................14
+ 3.9. Padding ...................................................17
+ 4. RPC RDMA Message Layout ........................................18
+ 4.1. RPC-over-RDMA Header ......................................18
+ 4.2. RPC-over-RDMA Header Errors ...............................20
+ 4.3. XDR Language Description ..................................20
+ 5. Long Messages ..................................................22
+ 5.1. Message as an RDMA Read Chunk .............................23
+ 5.2. RDMA Write of Long Replies (Reply Chunks) .................24
+ 6. Connection Configuration Protocol ..............................25
+ 6.1. Initial Connection State ..................................26
+ 6.2. Protocol Description ......................................26
+ 7. Memory Registration Overhead ...................................28
+ 8. Errors and Error Recovery ......................................28
+ 9. Node Addressing ................................................28
+ 10. RPC Binding ...................................................29
+ 11. Security Considerations .......................................30
+ 12. IANA Considerations ...........................................31
+ 13. Acknowledgments ...............................................32
+ 14. References ....................................................33
+ 14.1. Normative References .....................................33
+ 14.2. Informative References ...................................33
+
+
+
+
+Talpey & Callaghan Standards Track [Page 2]
+
+RFC 5666 RDMA Transport for RPC January 2010
+
+
+1. Introduction
+
+ Remote Direct Memory Access (RDMA) [RFC5040, RFC5041], [IB] is a
+ technique for efficient movement of data between end nodes, which
+ becomes increasingly compelling over high-speed transports. By
+ directing data into destination buffers as it is sent on a network,
+ and placing it via direct memory access by hardware, the double
+ benefit of faster transfers and reduced host overhead is obtained.
+
+ Open Network Computing Remote Procedure Call (ONC RPC, or simply,
+ RPC) [RFC5531] is a remote procedure call protocol that has been run
+ over a variety of transports. Most RPC implementations today use UDP
+ or TCP. RPC messages are defined in terms of an eXternal Data
+ Representation (XDR) [RFC4506], which provides a canonical data
+ representation across a variety of host architectures. An XDR data
+ stream is conveyed differently on each type of transport. On UDP,
+ RPC messages are encapsulated inside datagrams, while on a TCP byte
+ stream, RPC messages are delineated by a record marking protocol. An
+ RDMA transport also conveys RPC messages in a unique fashion that
+ must be fully described if client and server implementations are to
+ interoperate.
+
+ RDMA transports present new semantics unlike the behaviors of either
+ UDP or TCP alone. They retain message delineations like UDP while
+ also providing a reliable, sequenced data transfer like TCP. Also,
+ they provide the new efficient, bulk-transfer service of RDMA. RDMA
+ transports are therefore naturally viewed as a new transport type by
+ RPC.
+
+ RDMA as a transport will benefit the performance of RPC protocols
+ that move large "chunks" of data, since RDMA hardware excels at
+ moving data efficiently between host memory and a high-speed network
+ with little or no host CPU involvement. In this context, the Network
+ File System (NFS) protocol, in all its versions [RFC1094] [RFC1813]
+ [RFC3530] [RFC5661], is an obvious beneficiary of RDMA. A complete
+ problem statement is discussed in [RFC5532], and related NFSv4 issues
+ are discussed in [RFC5661]. Many other RPC-based protocols will also
+ benefit.
+
+ Although the RDMA transport described here provides relatively
+ transparent support for any RPC application, the proposal goes
+ further in describing mechanisms that can optimize the use of RDMA
+ with more active participation by the RPC application.
+
+
+
+
+
+
+
+
+Talpey & Callaghan Standards Track [Page 3]
+
+RFC 5666 RDMA Transport for RPC January 2010
+
+
+1.1. Requirements Language
+
+ The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
+ "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
+ document are to be interpreted as described in [RFC2119].
+
+2. Abstract RDMA Requirements
+
+ An RPC transport is responsible for conveying an RPC message from a
+ sender to a receiver. An RPC message is either an RPC call from a
+ client to a server, or an RPC reply from the server back to the
+ client. An RPC message contains an RPC call header followed by
+ arguments if the message is an RPC call, or an RPC reply header
+ followed by results if the message is an RPC reply. The call header
+ contains a transaction ID (XID) followed by the program and procedure
+ number as well as a security credential. An RPC reply header begins
+ with an XID that matches that of the RPC call message, followed by a
+ security verifier and results. All data in an RPC message is XDR
+ encoded. For a complete description of the RPC protocol and XDR
+ encoding, see [RFC5531] and [RFC4506].
+
+ This protocol assumes the following abstract model for RDMA
+ transports. These terms, common in the RDMA lexicon, are used in
+ this document. A more complete glossary of RDMA terms can be found
+ in [RFC5040].
+
+ o Registered Memory
+ All data moved via tagged RDMA operations is resident in
+ registered memory at its destination. This protocol assumes
+ that each segment of registered memory MUST be identified with a
+ steering tag of no more than 32 bits and memory addresses of up
+ to 64 bits in length.
+
+ o RDMA Send
+ The RDMA provider supports an RDMA Send operation with
+ completion signaled at the receiver when data is placed in a
+ pre-posted buffer. The amount of transferred data is limited
+ only by the size of the receiver's buffer. Sends complete at
+ the receiver in the order they were issued at the sender.
+
+ o RDMA Write
+ The RDMA provider supports an RDMA Write operation to directly
+ place data in the receiver's buffer. An RDMA Write is initiated
+ by the sender and completion is signaled at the sender. No
+ completion is signaled at the receiver. The sender uses a
+ steering tag, memory address, and length of the remote
+ destination buffer. RDMA Writes are not necessarily ordered
+ with respect to one another, but are ordered with respect to
+
+
+
+Talpey & Callaghan Standards Track [Page 4]
+
+RFC 5666 RDMA Transport for RPC January 2010
+
+
+ RDMA Sends; a subsequent RDMA Send completion obtained at the
+ receiver guarantees that prior RDMA Write data has been
+ successfully placed in the receiver's memory.
+
+ o RDMA Read
+ The RDMA provider supports an RDMA Read operation to directly
+ place peer source data in the requester's buffer. An RDMA Read
+ is initiated by the receiver and completion is signaled at the
+ receiver. The receiver provides steering tags, memory
+ addresses, and a length for the remote source and local
+ destination buffers. Since the peer at the data source receives
+ no notification of RDMA Read completion, there is an assumption
+ that on receiving the data, the receiver will signal completion
+ with an RDMA Send message, so that the peer can free the source
+ buffers and the associated steering tags.
+
+ This protocol is designed to be carried over all RDMA transports
+ meeting the stated requirements. This protocol conveys to the RPC
+ peer information sufficient for that RPC peer to direct an RDMA layer
+ to perform transfers containing RPC data and to communicate their
+ result(s). For example, it is readily carried over RDMA transports
+ such as Internet Wide Area RDMA Protocol (iWARP) [RFC5040, RFC5041],
+ or InfiniBand [IB].
+
+3. Protocol Outline
+
+ An RPC message can be conveyed in identical fashion, whether it is a
+ call or reply message. In each case, the transmission of the message
+ proper is preceded by transmission of a transport-specific header for
+ use by RPC-over-RDMA transports. This header is analogous to the
+ record marking used for RPC over TCP, but is more extensive, since
+ RDMA transports support several modes of data transfer; it is
+ important to allow the upper-layer protocol to specify the most
+ efficient mode for each of the segments in a message. Multiple
+ segments of a message may thereby be transferred in different ways to
+ different remote memory destinations.
+
+ All transfers of a call or reply begin with an RDMA Send that
+ transfers at least the RPC-over-RDMA header, usually with the call or
+ reply message appended, or at least some part thereof. Because the
+ size of what may be transmitted via RDMA Send is limited by the size
+ of the receiver's pre-posted buffer, the RPC-over-RDMA transport
+ provides a number of methods to reduce the amount transferred by
+ means of the RDMA Send, when necessary, by transferring various parts
+ of the message using RDMA Read and RDMA Write.
+
+
+
+
+
+
+Talpey & Callaghan Standards Track [Page 5]
+
+RFC 5666 RDMA Transport for RPC January 2010
+
+
+ RPC-over-RDMA framing replaces all other RPC framing (such as TCP
+ record marking) when used atop an RPC/RDMA association, even though
+ the underlying RDMA protocol may itself be layered atop a protocol
+ with a defined RPC framing (such as TCP). It is however possible for
+ RPC/RDMA to be dynamically enabled, in the course of negotiating the
+ use of RDMA via an upper-layer exchange. Because RPC framing
+ delimits an entire RPC request or reply, the resulting shift in
+ framing must occur between distinct RPC messages, and in concert with
+ the transport.
+
+3.1. Short Messages
+
+ Many RPC messages are quite short. For example, the NFS version 3
+ GETATTR request, is only 56 bytes: 20 bytes of RPC header, plus a
+ 32-byte file handle argument and 4 bytes of length. The reply to
+ this common request is about 100 bytes.
+
+ There is no benefit in transferring such small messages with an RDMA
+ Read or Write operation. The overhead in transferring steering tags
+ and memory addresses is justified only by large transfers. The
+ critical message size that justifies RDMA transfer will vary
+ depending on the RDMA implementation and network, but is typically of
+ the order of a few kilobytes. It is appropriate to transfer a short
+ message with an RDMA Send to a pre-posted buffer. The RPC-over-RDMA
+ header with the short message (call or reply) immediately following
+ is transferred using a single RDMA Send operation.
+
+ Short RPC messages over an RDMA transport:
+
+ RPC Client RPC Server
+ | RPC Call |
+ Send | ------------------------------> |
+ | |
+ | RPC Reply |
+ | <------------------------------ | Send
+
+3.2. Data Chunks
+
+ Some protocols, like NFS, have RPC procedures that can transfer very
+ large chunks of data in the RPC call or reply and would cause the
+ maximum send size to be exceeded if one tried to transfer them as
+ part of the RDMA Send. These large chunks typically range from a
+ kilobyte to a megabyte or more. An RDMA transport can transfer large
+ chunks of data more efficiently via the direct placement of an RDMA
+ Read or RDMA Write operation. Using direct placement instead of
+ inline transfer not only avoids expensive data copies, but provides
+ correct data alignment at the destination.
+
+
+
+
+Talpey & Callaghan Standards Track [Page 6]
+
+RFC 5666 RDMA Transport for RPC January 2010
+
+
+3.3. Flow Control
+
+ It is critical to provide RDMA Send flow control for an RDMA
+ connection. RDMA receive operations will fail if a pre-posted
+ receive buffer is not available to accept an incoming RDMA Send, and
+ repeated occurrences of such errors can be fatal to the connection.
+ This is a departure from conventional TCP/IP networking where buffers
+ are allocated dynamically on an as-needed basis, and where
+ pre-posting is not required.
+
+ It is not practical to provide for fixed credit limits at the RPC
+ server. Fixed limits scale poorly, since posted buffers are
+ dedicated to the associated connection until consumed by receive
+ operations. Additionally, for protocol correctness, the RPC server
+ must always be able to reply to client requests, whether or not new
+ buffers have been posted to accept future receives. (Note that the
+ RPC server may in fact be a client at some other layer. For example,
+ NFSv4 callbacks are processed by the NFSv4 client, acting as an RPC
+ server. The credit discussions apply equally in either case.)
+
+ Flow control for RDMA Send operations is implemented as a simple
+ request/grant protocol in the RPC-over-RDMA header associated with
+ each RPC message. The RPC-over-RDMA header for RPC call messages
+ contains a requested credit value for the RPC server, which MAY be
+ dynamically adjusted by the caller to match its expected needs. The
+ RPC-over-RDMA header for the RPC reply messages provides the granted
+ result, which MAY have any value except it MUST NOT be zero when no
+ in-progress operations are present at the server, since such a value
+ would result in deadlock. The value MAY be adjusted up or down at
+ each opportunity to match the server's needs or policies.
+
+ The RPC client MUST NOT send unacknowledged requests in excess of
+ this granted RPC server credit limit. If the limit is exceeded, the
+ RDMA layer may signal an error, possibly terminating the connection.
+ Even if an error does not occur, it is OPTIONAL that the server
+ handle the excess request(s), and it MAY return an RPC error to the
+ client. Also note that the never-zero requirement implies that an
+ RPC server MUST always provide at least one credit to each connected
+ RPC client from which no requests are outstanding. The client would
+ deadlock otherwise, unable to send another request.
+
+ While RPC calls complete in any order, the current flow control limit
+ at the RPC server is known to the RPC client from the Send ordering
+ properties. It is always the most recent server-granted credit value
+ minus the number of requests in flight.
+
+
+
+
+
+
+Talpey & Callaghan Standards Track [Page 7]
+
+RFC 5666 RDMA Transport for RPC January 2010
+
+
+ Certain RDMA implementations may impose additional flow control
+ restrictions, such as limits on RDMA Read operations in progress at
+ the responder. Because these operations are outside the scope of
+ this protocol, they are not addressed and SHOULD be provided for by
+ other layers. For example, a simple upper-layer RPC consumer might
+ perform single-issue RDMA Read requests, while a more sophisticated,
+ multithreaded RPC consumer might implement its own First In, First
+ Out (FIFO) queue of such operations. For further discussion of
+ possible protocol implementations capable of negotiating these
+ values, see Section 6 "Connection Configuration Protocol" of this
+ document, or [RFC5661].
+
+3.4. XDR Encoding with Chunks
+
+ The data comprising an RPC call or reply message is marshaled or
+ serialized into a contiguous stream by an XDR routine. XDR data
+ types such as integers, strings, arrays, and linked lists are
+ commonly implemented over two very simple functions that encode
+ either an XDR data unit (32 bits) or an array of bytes.
+
+ Normally, the separate data items in an RPC call or reply are encoded
+ as a contiguous sequence of bytes for network transmission over UDP
+ or TCP. However, in the case of an RDMA transport, local routines
+ such as XDR encode can determine that (for instance) an opaque byte
+ array is large enough to be more efficiently moved via an RDMA data
+ transfer operation like RDMA Read or RDMA Write.
+
+ Semantically speaking, the protocol has no restriction regarding data
+ types that may or may not be represented by a read or write chunk.
+ In practice however, efficiency considerations lead to the conclusion
+ that certain data types are not generally "chunkable". Typically,
+ only those opaque and aggregate data types that may attain
+ substantial size are considered to be eligible. With today's
+ hardware, this size may be a kilobyte or more. However, any object
+ MAY be chosen for chunking in any given message.
+
+ The eligibility of XDR data items to be candidates for being moved as
+ data chunks (as opposed to being marshaled inline) is not specified
+ by the RPC-over-RDMA protocol. Chunk eligibility criteria MUST be
+ determined by each upper-layer in order to provide for an
+ interoperable specification. One such example with rationale, for
+ the NFS protocol family, is provided in [RFC5667].
+
+ The interface by which an upper-layer implementation communicates the
+ eligibility of a data item locally to RPC for chunking is out of
+ scope for this specification. In many implementations, it is
+ possible to implement a transparent RPC chunking facility. However,
+ such implementations may lead to inefficiencies, either because they
+
+
+
+Talpey & Callaghan Standards Track [Page 8]
+
+RFC 5666 RDMA Transport for RPC January 2010
+
+
+ require the RPC layer to perform expensive registration and
+ de-registration of memory "on the fly", or they may require using
+ RDMA chunks in reply messages, along with the resulting additional
+ handshaking with the RPC-over-RDMA peer. However, these issues are
+ internal and generally confined to the local interface between RPC
+ and its upper layers, one in which implementations are free to
+ innovate. The only requirement is that the resulting RPC RDMA
+ protocol sent to the peer is valid for the upper layer. See, for
+ example, [RFC5667].
+
+ When sending any message (request or reply) that contains an eligible
+ large data chunk, the XDR encoding routine avoids moving the data
+ into the XDR stream. Instead, it does not encode the data portion,
+ but records the address and size of each chunk in a separate "read
+ chunk list" encoded within RPC RDMA transport-specific headers. Such
+ chunks will be transferred via RDMA Read operations initiated by the
+ receiver.
+
+ When the read chunks are to be moved via RDMA, the memory for each
+ chunk is registered. This registration may take place within XDR
+ itself, providing for full transparency to upper layers, or it may be
+ performed by any other specific local implementation.
+
+ Additionally, when making an RPC call that can result in bulk data
+ transferred in the reply, write chunks MAY be provided to accept the
+ data directly via RDMA Write. These write chunks will therefore be
+ pre-filled by the RPC server prior to responding, and XDR decode of
+ the data at the client will not be required. These chunks undergo a
+ similar registration and advertisement via "write chunk lists" built
+ as a part of XDR encoding.
+
+ Some RPC client implementations are not able to determine where an
+ RPC call's results reside during the "encode" phase. This makes it
+ difficult or impossible for the RPC client layer to encode the write
+ chunk list at the time of building the request. In this case, it is
+ difficult for the RPC implementation to provide transparency to the
+ RPC consumer, which may require recoding to provide result
+ information at this earlier stage.
+
+ Therefore, if the RPC client does not make a write chunk list
+ available to receive the result, then the RPC server MAY return data
+ inline in the reply, or if the upper-layer specification permits, it
+ MAY be returned via a read chunk list. It is NOT RECOMMENDED that
+ upper-layer RPC client protocol specifications omit write chunk lists
+ for eligible replies, due to the lower performance of the additional
+ handshaking to perform data transfer, and the requirement that the
+ RPC server must expose (and preserve) the reply data for a period of
+
+
+
+
+Talpey & Callaghan Standards Track [Page 9]
+
+RFC 5666 RDMA Transport for RPC January 2010
+
+
+ time. In the absence of a server-provided read chunk list in the
+ reply, if the encoded reply overflows the posted receive buffer, the
+ RPC will fail with an RDMA transport error.
+
+ When any data within a message is provided via either read or write
+ chunks, the chunk itself refers only to the data portion of the XDR
+ stream element. In particular, for counted fields (e.g., a "<>"
+ encoding) the byte count that is encoded as part of the field remains
+ in the XDR stream, and is also encoded in the chunk list. The data
+ portion is however elided from the encoded XDR stream, and is
+ transferred as part of chunk list processing. It is important to
+ maintain upper-layer implementation compatibility -- both the count
+ and the data must be transferred as part of the logical XDR stream.
+ While the chunk list processing results in the data being available
+ to the upper-layer peer for XDR decoding, the length present in the
+ chunk list entries is not. Any byte count in the XDR stream MUST
+ match the sum of the byte counts present in the corresponding read or
+ write chunk list. If they do not agree, an RPC protocol encoding
+ error results.
+
+ The following items are contained in a chunk list entry.
+
+ Handle
+ Steering tag or handle obtained when the chunk memory is
+ registered for RDMA.
+
+ Length
+ The length of the chunk in bytes.
+
+ Offset
+ The offset or beginning memory address of the chunk. In order
+ to support the widest array of RDMA implementations, as well as
+ the most general steering tag scheme, this field is
+ unconditionally included in each chunk list entry.
+
+ While zero-based offset schemes are available in many RDMA
+ implementations, their use by RPC requires individual
+ registration of each read or write chunk. On many such
+ implementations, this can be a significant overhead. By
+ providing an offset in each chunk, many pre-registration or
+ region-based registrations can be readily supported, and by
+ using a single, universal chunk representation, the RPC RDMA
+ protocol implementation is simplified to its most general form.
+
+ Position
+ For data that is to be encoded, the position in the XDR stream
+ where the chunk would normally reside. Note that the chunk
+ therefore inserts its data into the XDR stream at this position,
+
+
+
+Talpey & Callaghan Standards Track [Page 10]
+
+RFC 5666 RDMA Transport for RPC January 2010
+
+
+ but its transfer is no longer "inline". Also note therefore
+ that all chunks belonging to a single RPC argument or result
+ will have the same position. For data that is to be decoded, no
+ position is used.
+
+ When XDR marshaling is complete, the chunk list is XDR encoded, then
+ sent to the receiver prepended to the RPC message. Any source data
+ for a read chunk, or the destination of a write chunk, remain behind
+ in the sender's registered memory, and their actual payload is not
+ marshaled into the request or reply.
+
+ +----------------+----------------+-------------
+ | RPC-over-RDMA | |
+ | header w/ | RPC Header | Non-chunk args/results
+ | chunks | |
+ +----------------+----------------+-------------
+
+ Read chunk lists and write chunk lists are structured somewhat
+ differently. This is due to the different usage -- read chunks are
+ decoded and indexed by their argument's or result's position in the
+ XDR data stream; their size is always known. Write chunks, on the
+ other hand, are used only for results, and have neither a preassigned
+ offset in the XDR stream nor a size until the results are produced,
+ since the buffers may be only partially filled, or may not be used
+ for results at all. Their presence in the XDR stream is therefore
+ not known until the reply is processed. The mapping of write chunks
+ onto designated NFS procedures and their results is described in
+ [RFC5667].
+
+ Therefore, read chunks are encoded into a read chunk list as a single
+ array, with each entry tagged by its (known) size and its argument's
+ or result's position in the XDR stream. Write chunks are encoded as
+ a list of arrays of RDMA buffers, with each list element (an array)
+ providing buffers for a separate result. Individual write chunk list
+ elements MAY thereby result in being partially or fully filled, or in
+ fact not being filled at all. Unused write chunks, or unused bytes
+ in write chunk buffer lists, are not returned as results, and their
+ memory is returned to the upper layer as part of RPC completion.
+ However, the RPC layer MUST NOT assume that the buffers have not been
+ modified.
+
+3.5. XDR Decoding with Read Chunks
+
+ The XDR decode process moves data from an XDR stream into a data
+ structure provided by the RPC client or server application. Where
+ elements of the destination data structure are buffers or strings,
+ the RPC application can either pre-allocate storage to receive the
+
+
+
+
+Talpey & Callaghan Standards Track [Page 11]
+
+RFC 5666 RDMA Transport for RPC January 2010
+
+
+ data or leave the string or buffer fields null and allow the XDR
+ decode stage of RPC processing to automatically allocate storage of
+ sufficient size.
+
+ When decoding a message from an RDMA transport, the receiver first
+ XDR decodes the chunk lists from the RPC-over-RDMA header, then
+ proceeds to decode the body of the RPC message (arguments or
+ results). Whenever the XDR offset in the decode stream matches that
+ of a chunk in the read chunk list, the XDR routine initiates an RDMA
+ Read to bring over the chunk data into locally registered memory for
+ the destination buffer.
+
+ When processing an RPC request, the RPC receiver (RPC server)
+ acknowledges its completion of use of the source buffers by simply
+ replying to the RPC sender (client), and the peer may then free all
+ source buffers advertised by the request.
+
+ When processing an RPC reply, after completing such a transfer, the
+ RPC receiver (client) MUST issue an RDMA_DONE message (described in
+ Section 3.8) to notify the peer (server) that the source buffers can
+ be freed.
+
+ The read chunk list is constructed and used entirely within the
+ RPC/XDR layer. Other than specifying the minimum chunk size, the
+ management of the read chunk list is automatic and transparent to an
+ RPC application.
+
+3.6. XDR Decoding with Write Chunks
+
+ When a write chunk list is provided for the results of the RPC call,
+ the RPC server MUST provide any corresponding data via RDMA Write to
+ the memory referenced in the chunk list entries. The RPC reply
+ conveys this by returning the write chunk list to the client with the
+ lengths rewritten to match the actual transfer. The XDR decode of
+ the reply therefore performs no local data transfer but merely
+ returns the length obtained from the reply.
+
+ Each decoded result consumes one entry in the write chunk list, which
+ in turn consists of an array of RDMA segments. The length is
+ therefore the sum of all returned lengths in all segments comprising
+ the corresponding list entry. As each list entry is decoded, the
+ entire entry is consumed.
+
+ The write chunk list is constructed and used by the RPC application.
+ The RPC/XDR layer simply conveys the list between client and server
+ and initiates the RDMA Writes back to the client. The mapping of
+
+
+
+
+
+Talpey & Callaghan Standards Track [Page 12]
+
+RFC 5666 RDMA Transport for RPC January 2010
+
+
+ write chunk list entries to procedure arguments MUST be determined
+ for each protocol. An example of a mapping is described in
+ [RFC5667].
+
+3.7. XDR Roundup and Chunks
+
+ The XDR protocol requires 4-byte alignment of each new encoded
+ element in any XDR stream. This requirement is for efficiency and
+ ease of decode/unmarshaling at the receiver -- if the XDR stream
+ buffer begins on a native machine boundary, then the XDR elements
+ will lie on similarly predictable offsets in memory.
+
+ Within XDR, when non-4-byte encodes (such as an odd-length string or
+ bulk data) are marshaled, their length is encoded literally, while
+ their data is padded to begin the next element at a 4-byte boundary
+ in the XDR stream. For TCP or RDMA inline encoding, this minimal
+ overhead is required because the transport-specific framing relies on
+ the fact that the relative offset of the elements in the XDR stream
+ from the start of the message determines the XDR position during
+ decode.
+
+ On the other hand, RPC/RDMA Read chunks carry the XDR position of
+ each chunked element and length of the Chunk segment, and can be
+ placed by the receiver exactly where they belong in the receiver's
+ memory without regard to the alignment of their position in the XDR
+ stream. Since any rounded-up data is not actually part of the upper
+ layer's message, the receiver will not reference it, and there is no
+ reason to set it to any particular value in the receiver's memory.
+
+ When roundup is present at the end of a sequence of chunks, the
+ length of the sequence will terminate it at a non-4-byte XDR
+ position. When the receiver proceeds to decode the remaining part of
+ the XDR stream, it inspects the XDR position indicated by the next
+ chunk. Because this position will not match (else roundup would not
+ have occurred), the receiver decoding will fall back to inspecting
+ the remaining inline portion. If in turn, no data remains to be
+ decoded from the inline portion, then the receiver MUST conclude that
+ roundup is present, and therefore it advances the XDR decode position
+ to that indicated by the next chunk (if any). In this way, roundup
+ is passed without ever actually transferring additional XDR bytes.
+
+ Some protocol operations over RPC/RDMA, for instance NFS writes of
+ data encountered at the end of a file or in direct I/O situations,
+ commonly yield these roundups within RDMA Read Chunks. Because any
+ roundup bytes are not actually present in the data buffers being
+ written, memory for these bytes would come from noncontiguous
+ buffers, either as an additional memory registration segment or as an
+ additional Chunk. The overhead of these operations can be
+
+
+
+Talpey & Callaghan Standards Track [Page 13]
+
+RFC 5666 RDMA Transport for RPC January 2010
+
+
+ significant to both the sender to marshal them and even higher to the
+ receiver to which to transfer them. Senders SHOULD therefore avoid
+ encoding individual RDMA Read Chunks for roundup whenever possible.
+ It is acceptable, but not necessary, to include roundup data in an
+ existing RDMA Read Chunk, but only if it is already present in the
+ XDR stream to carry upper-layer data.
+
+ Note that there is no exposure of additional data at the sender due
+ to eliding roundup data from the XDR stream, since any additional
+ sender buffers are never exposed to the peer. The data is literally
+ not there to be transferred.
+
+ For RDMA Write Chunks, a simpler encoding method applies. Again,
+ roundup bytes are not transferred, instead the chunk length sent to
+ the receiver in the reply is simply increased to include any roundup.
+ Because of the requirement that the RDMA Write Chunks are filled
+ sequentially without gaps, this situation can only occur on the final
+ chunk receiving data. Therefore, there is no opportunity for roundup
+ data to insert misalignment or positional gaps into the XDR stream.
+
+3.8. RPC Call and Reply
+
+ The RDMA transport for RPC provides three methods of moving data
+ between RPC client and server:
+
+ Inline
+ Data is moved between RPC client and server within an RDMA Send.
+
+ RDMA Read
+ Data is moved between RPC client and server via an RDMA Read
+ operation via steering tag; address and offset obtained from a
+ read chunk list.
+
+ RDMA Write
+ Result data is moved from RPC server to client via an RDMA Write
+ operation via steering tag; address and offset obtained from a
+ write chunk list or reply chunk in the client's RPC call
+ message.
+
+ These methods of data movement may occur in combinations within a
+ single RPC. For instance, an RPC call may contain some inline data
+ along with some large chunks to be transferred via RDMA Read to the
+ server. The reply to that call may have some result chunks that the
+ server RDMA Writes back to the client. The following protocol
+ interactions illustrate RPC calls that use these methods to move RPC
+ message data:
+
+
+
+
+
+Talpey & Callaghan Standards Track [Page 14]
+
+RFC 5666 RDMA Transport for RPC January 2010
+
+
+ An RPC with write chunks in the call message:
+
+ RPC Client RPC Server
+ | RPC Call + Write Chunk list |
+ Send | ------------------------------> |
+ | |
+ | Chunk 1 |
+ | <------------------------------ | Write
+ | : |
+ | Chunk n |
+ | <------------------------------ | Write
+ | |
+ | RPC Reply |
+ | <------------------------------ | Send
+
+ In the presence of write chunks, RDMA ordering provides the guarantee
+ that all data in the RDMA Write operations has been placed in memory
+ prior to the client's RPC reply processing.
+
+ An RPC with read chunks in the call message:
+
+ RPC Client RPC Server
+ | RPC Call + Read Chunk list |
+ Send | ------------------------------> |
+ | |
+ | Chunk 1 |
+ | +------------------------------ | Read
+ | v-----------------------------> |
+ | : |
+ | Chunk n |
+ | +------------------------------ | Read
+ | v-----------------------------> |
+ | |
+ | RPC Reply |
+ | <------------------------------ | Send
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Talpey & Callaghan Standards Track [Page 15]
+
+RFC 5666 RDMA Transport for RPC January 2010
+
+
+ An RPC with read chunks in the reply message:
+
+ RPC Client RPC Server
+ | RPC Call |
+ Send | ------------------------------> |
+ | |
+ | RPC Reply + Read Chunk list |
+ | <------------------------------ | Send
+ | |
+ | Chunk 1 |
+ Read | ------------------------------+ |
+ | <-----------------------------v |
+ | : |
+ | Chunk n |
+ Read | ------------------------------+ |
+ | <-----------------------------v |
+ | |
+ | Done |
+ Send | ------------------------------> |
+
+ The final Done message allows the RPC client to signal the server
+ that it has received the chunks, so the server can de-register and
+ free the memory holding the chunks. A Done completion is not
+ necessary for an RPC call, since the RPC reply Send is itself a
+ receive completion notification. In the event that the client fails
+ to return the Done message within some timeout period, the server MAY
+ conclude that a protocol violation has occurred and close the RPC
+ connection, or it MAY proceed with a de-register and free its chunk
+ buffers. This may result in a fatal RDMA error if the client later
+ attempts to perform an RDMA Read operation, which amounts to the same
+ thing.
+
+ The use of read chunks in RPC reply messages is much less efficient
+ than providing write chunks in the originating RPC calls, due to the
+ additional message exchanges, the need for the RPC server to
+ advertise buffers to the peer, the necessity of the server
+ maintaining a timer for the purpose of recovery from misbehaving
+ clients, and the need for additional memory registration. Their use
+ is NOT RECOMMENDED by upper layers where efficiency is a primary
+ concern [RFC5667]. However, they MAY be employed by upper-layer
+ protocol bindings that are primarily concerned with transparency,
+ since they can frequently be implemented completely within the RPC
+ lower layers.
+
+ It is important to note that the Done message consumes a credit at
+ the RPC server. The RPC server SHOULD provide sufficient credits to
+ the client to allow the Done message to be sent without deadlock
+ (driving the outstanding credit count to zero). The RPC client MUST
+
+
+
+Talpey & Callaghan Standards Track [Page 16]
+
+RFC 5666 RDMA Transport for RPC January 2010
+
+
+ account for its required Done messages to the server in its
+ accounting of available credits, and the server SHOULD replenish any
+ credit consumed by its use of such exchanges at its earliest
+ opportunity.
+
+ Finally, it is possible to conceive of RPC exchanges that involve any
+ or all combinations of write chunks in the RPC call, read chunks in
+ the RPC call, and read chunks in the RPC reply. Support for such
+ exchanges is straightforward from a protocol perspective, but in
+ practice such exchanges would be quite rare, limited to upper-layer
+ protocol exchanges that transferred bulk data in both the call and
+ corresponding reply.
+
+3.9. Padding
+
+ Alignment of specific opaque data enables certain scatter/gather
+ optimizations. Padding leverages the useful property that RDMA
+ transfers preserve alignment of data, even when they are placed into
+ pre-posted receive buffers by Sends.
+
+ Many servers can make good use of such padding. Padding allows the
+ chaining of RDMA receive buffers such that any data transferred by
+ RDMA on behalf of RPC requests will be placed into appropriately
+ aligned buffers on the system that receives the transfer. In this
+ way, the need for servers to perform RDMA Read to satisfy all but the
+ largest client writes is obviated.
+
+ The effect of padding is demonstrated below showing prior bytes on an
+ XDR stream ("XXX" in the figure below) followed by an opaque field
+ consisting of four length bytes ("LLLL") followed by data bytes
+ ("DDD"). The receiver of the RDMA Send has posted two chained
+ receive buffers. Without padding, the opaque data is split across
+ the two buffers. With the addition of padding bytes ("ppp") prior to
+ the first data byte, the data can be forced to align correctly in the
+ second buffer.
+
+ Buffer 1 Buffer 2
+ Unpadded -------------- --------------
+
+
+ XXXXXXXLLLLDDDDDDDDDDDDDD ---> XXXXXXXLLLLDDD DDDDDDDDDDD
+
+
+ Padded
+
+
+ XXXXXXXLLLLpppDDDDDDDDDDDDDD ---> XXXXXXXLLLLppp DDDDDDDDDDDDDD
+
+
+
+
+Talpey & Callaghan Standards Track [Page 17]
+
+RFC 5666 RDMA Transport for RPC January 2010
+
+
+ Padding is implemented completely within the RDMA transport encoding,
+ flagged with a specific message type. Where padding is applied, two
+ values are passed to the peer: an "rdma_align", which is the padding
+ value used, and "rdma_thresh", which is the opaque data size at or
+ above which padding is applied. For instance, if the server is using
+ chained 4 KB receive buffers, then up to (4 KB - 1) padding bytes
+ could be used to achieve alignment of the data. The XDR routine at
+ the peer MUST consult these values when decoding opaque values.
+ Where the decoded length exceeds the rdma_thresh, the XDR decode MUST
+ skip over the appropriate padding as indicated by rdma_align and the
+ current XDR stream position.
+
+4. RPC RDMA Message Layout
+
+ RPC call and reply messages are conveyed across an RDMA transport
+ with a prepended RPC-over-RDMA header. The RPC-over-RDMA header
+ includes data for RDMA flow control credits, padding parameters, and
+ lists of addresses that provide direct data placement via RDMA Read
+ and Write operations. The layout of the RPC message itself is
+ unchanged from that described in [RFC5531] except for the possible
+ exclusion of large data chunks that will be moved by RDMA Read or
+ Write operations. If the RPC message (along with the RPC-over-RDMA
+ header) is too long for the posted receive buffer (even after any
+ large chunks are removed), then the entire RPC message MAY be moved
+ separately as a chunk, leaving just the RPC-over-RDMA header in the
+ RDMA Send.
+
+4.1. RPC-over-RDMA Header
+
+ The RPC-over-RDMA header begins with four 32-bit fields that are
+ always present and that control the RDMA interaction including RDMA-
+ specific flow control. These are then followed by a number of items
+ such as chunk lists and padding that MAY or MUST NOT be present
+ depending on the type of transmission. The four fields that are
+ always present are:
+
+ 1. Transaction ID (XID).
+ The XID generated for the RPC call and reply. Having the XID at
+ the beginning of the message makes it easy to establish the
+ message context. This XID MUST be the same as the XID in the RPC
+ header. The receiver MAY perform its processing based solely on
+ the XID in the RPC-over-RDMA header, and thereby ignore the XID in
+ the RPC header, if it so chooses.
+
+ 2. Version number.
+ This version of the RPC RDMA message protocol is 1. The version
+ number MUST be increased by 1 whenever the format of the RPC RDMA
+ messages is changed.
+
+
+
+Talpey & Callaghan Standards Track [Page 18]
+
+RFC 5666 RDMA Transport for RPC January 2010
+
+
+ 3. Flow control credit value.
+ When sent in an RPC call message, the requested value is provided.
+ When sent in an RPC reply message, the granted value is returned.
+ RPC calls SHOULD NOT be sent in excess of the currently granted
+ limit.
+
+ 4. Message type.
+
+ o RDMA_MSG = 0 indicates that chunk lists and RPC message follow.
+
+ o RDMA_NOMSG = 1 indicates that after the chunk lists there is no
+ RPC message. In this case, the chunk lists provide information
+ to allow the message proper to be transferred using RDMA Read
+ or Write and thus is not appended to the RPC-over-RDMA header.
+
+ o RDMA_MSGP = 2 indicates that a chunk list and RPC message with
+ some padding follow.
+
+ o RDMA_DONE = 3 indicates that the message signals the completion
+ of a chunk transfer via RDMA Read.
+
+ o RDMA_ERROR = 4 is used to signal any detected error(s) in the
+ RPC RDMA chunk encoding.
+
+ Because the version number is encoded as part of this header, and the
+ RDMA_ERROR message type is used to indicate errors, these first four
+ fields and the start of the following message body MUST always remain
+ aligned at these fixed offsets for all versions of the RPC-over-RDMA
+ header.
+
+ For a message of type RDMA_MSG or RDMA_NOMSG, the Read and Write
+ chunk lists follow. If the Read chunk list is null (a 32-bit word of
+ zeros), then there are no chunks to be transferred separately and the
+ RPC message follows in its entirety. If non-null, then it's the
+ beginning of an XDR encoded sequence of Read chunk list entries. If
+ the Write chunk list is non-null, then an XDR encoded sequence of
+ Write chunk entries follows.
+
+ If the message type is RDMA_MSGP, then two additional fields that
+ specify the padding alignment and threshold are inserted prior to the
+ Read and Write chunk lists.
+
+ A header of message type RDMA_MSG or RDMA_MSGP MUST be followed by
+ the RPC call or RPC reply message body, beginning with the XID. The
+ XID in the RDMA_MSG or RDMA_MSGP header MUST match this.
+
+
+
+
+
+
+Talpey & Callaghan Standards Track [Page 19]
+
+RFC 5666 RDMA Transport for RPC January 2010
+
+
+ +--------+---------+---------+-----------+-------------+----------
+ | | | | Message | NULLs | RPC Call
+ | XID | Version | Credits | Type | or | or
+ | | | | | Chunk Lists | Reply Msg
+ +--------+---------+---------+-----------+-------------+----------
+
+ Note that in the case of RDMA_DONE and RDMA_ERROR, no chunk list or
+ RPC message follows. As an implementation hint: a gather operation
+ on the Send of the RDMA RPC message can be used to marshal the
+ initial header, the chunk list, and the RPC message itself.
+
+4.2. RPC-over-RDMA Header Errors
+
+ When a peer receives an RPC RDMA message, it MUST perform the
+ following basic validity checks on the header and chunk contents. If
+ such errors are detected in the request, an RDMA_ERROR reply MUST be
+ generated.
+
+ Two types of errors are defined, version mismatch and invalid chunk
+ format. When the peer detects an RPC-over-RDMA header version that
+ it does not support (currently this document defines only version 1),
+ it replies with an error code of ERR_VERS, and provides the low and
+ high inclusive version numbers it does, in fact, support. The
+ version number in this reply MUST be any value otherwise valid at the
+ receiver. When other decoding errors are detected in the header or
+ chunks, either an RPC decode error MAY be returned or the RPC/RDMA
+ error code ERR_CHUNK MUST be returned.
+
+4.3. XDR Language Description
+
+ Here is the message layout in XDR language.
+
+ struct xdr_rdma_segment {
+ uint32 handle; /* Registered memory handle */
+ uint32 length; /* Length of the chunk in bytes */
+ uint64 offset; /* Chunk virtual address or offset */
+ };
+
+ struct xdr_read_chunk {
+ uint32 position; /* Position in XDR stream */
+ struct xdr_rdma_segment target;
+ };
+
+ struct xdr_read_list {
+ struct xdr_read_chunk entry;
+ struct xdr_read_list *next;
+ };
+
+
+
+
+Talpey & Callaghan Standards Track [Page 20]
+
+RFC 5666 RDMA Transport for RPC January 2010
+
+
+ struct xdr_write_chunk {
+ struct xdr_rdma_segment target<>;
+ };
+
+ struct xdr_write_list {
+ struct xdr_write_chunk entry;
+ struct xdr_write_list *next;
+ };
+
+ struct rdma_msg {
+ uint32 rdma_xid; /* Mirrors the RPC header xid */
+ uint32 rdma_vers; /* Version of this protocol */
+ uint32 rdma_credit; /* Buffers requested/granted */
+ rdma_body rdma_body;
+ };
+
+ enum rdma_proc {
+ RDMA_MSG=0, /* An RPC call or reply msg */
+ RDMA_NOMSG=1, /* An RPC call or reply msg - separate body */
+ RDMA_MSGP=2, /* An RPC call or reply msg with padding */
+ RDMA_DONE=3, /* Client signals reply completion */
+ RDMA_ERROR=4 /* An RPC RDMA encoding error */
+ };
+
+ union rdma_body switch (rdma_proc proc) {
+ case RDMA_MSG:
+ rpc_rdma_header rdma_msg;
+ case RDMA_NOMSG:
+ rpc_rdma_header_nomsg rdma_nomsg;
+ case RDMA_MSGP:
+ rpc_rdma_header_padded rdma_msgp;
+ case RDMA_DONE:
+ void;
+ case RDMA_ERROR:
+ rpc_rdma_error rdma_error;
+ };
+
+ struct rpc_rdma_header {
+ struct xdr_read_list *rdma_reads;
+ struct xdr_write_list *rdma_writes;
+ struct xdr_write_chunk *rdma_reply;
+ /* rpc body follows */
+ };
+
+ struct rpc_rdma_header_nomsg {
+ struct xdr_read_list *rdma_reads;
+ struct xdr_write_list *rdma_writes;
+ struct xdr_write_chunk *rdma_reply;
+
+
+
+Talpey & Callaghan Standards Track [Page 21]
+
+RFC 5666 RDMA Transport for RPC January 2010
+
+
+ };
+
+ struct rpc_rdma_header_padded {
+ uint32 rdma_align; /* Padding alignment */
+ uint32 rdma_thresh; /* Padding threshold */
+ struct xdr_read_list *rdma_reads;
+ struct xdr_write_list *rdma_writes;
+ struct xdr_write_chunk *rdma_reply;
+ /* rpc body follows */
+ };
+
+ enum rpc_rdma_errcode {
+ ERR_VERS = 1,
+ ERR_CHUNK = 2
+ };
+
+ union rpc_rdma_error switch (rpc_rdma_errcode err) {
+ case ERR_VERS:
+ uint32 rdma_vers_low;
+ uint32 rdma_vers_high;
+ case ERR_CHUNK:
+ void;
+ default:
+ uint32 rdma_extra[8];
+ };
+
+5. Long Messages
+
+ The receiver of RDMA Send messages is required by RDMA to have
+ previously posted one or more adequately sized buffers. The RPC
+ client can inform the server of the maximum size of its RDMA Send
+ messages via the Connection Configuration Protocol described later in
+ this document.
+
+ Since RPC messages are frequently small, memory savings can be
+ achieved by posting small buffers. Even large messages like NFS READ
+ or WRITE will be quite small once the chunks are removed from the
+ message. However, there may be large messages that would demand a
+ very large buffer be posted, where the contents of the buffer may not
+ be a chunkable XDR element. A good example is an NFS READDIR reply,
+ which may contain a large number of small filename strings. Also,
+ the NFS version 4 protocol [RFC3530] features COMPOUND request and
+ reply messages of unbounded length.
+
+ Ideally, each upper layer will negotiate these limits. However, it
+ is frequently necessary to provide a transparent solution.
+
+
+
+
+
+Talpey & Callaghan Standards Track [Page 22]
+
+RFC 5666 RDMA Transport for RPC January 2010
+
+
+5.1. Message as an RDMA Read Chunk
+
+ One relatively simple method is to have the client identify any RPC
+ message that exceeds the RPC server's posted buffer size and move it
+ separately as a chunk, i.e., reference it as the first entry in the
+ read chunk list with an XDR position of zero.
+
+ Normal Message
+
+ +--------+---------+---------+------------+-------------+----------
+ | | | | | | RPC Call
+ | XID | Version | Credits | RDMA_MSG | Chunk Lists | or
+ | | | | | | Reply Msg
+ +--------+---------+---------+------------+-------------+----------
+
+ Long Message
+
+ +--------+---------+---------+------------+-------------+
+ | | | | | |
+ | XID | Version | Credits | RDMA_NOMSG | Chunk Lists |
+ | | | | | |
+ +--------+---------+---------+------------+-------------+
+ |
+ | +----------
+ | | Long RPC Call
+ +->| or
+ | Reply Message
+ +----------
+
+ If the receiver gets an RPC-over-RDMA header with a message type of
+ RDMA_NOMSG and finds an initial read chunk list entry with a zero XDR
+ position, it allocates a registered buffer and issues an RDMA Read of
+ the long RPC message into it. The receiver then proceeds to XDR
+ decode the RPC message as if it had received it inline with the Send
+ data. Further decoding may issue additional RDMA Reads to bring over
+ additional chunks.
+
+ Although the handling of long messages requires one extra network
+ turnaround, in practice these messages will be rare if the posted
+ receive buffers are correctly sized, and of course they will be
+ non-existent for RDMA-aware upper layers.
+
+
+
+
+
+
+
+
+
+
+Talpey & Callaghan Standards Track [Page 23]
+
+RFC 5666 RDMA Transport for RPC January 2010
+
+
+ A long call RPC with request supplied via RDMA Read
+
+ RPC Client RPC Server
+ | RDMA-over-RPC Header |
+ Send | ------------------------------> |
+ | |
+ | Long RPC Call Msg |
+ | +------------------------------ | Read
+ | v-----------------------------> |
+ | |
+ | RDMA-over-RPC Reply |
+ | <------------------------------ | Send
+
+ An RPC with long reply returned via RDMA Read
+
+ RPC Client RPC Server
+ | RPC Call |
+ Send | ------------------------------> |
+ | |
+ | RDMA-over-RPC Header |
+ | <------------------------------ | Send
+ | |
+ | Long RPC Reply Msg |
+ Read | ------------------------------+ |
+ | <-----------------------------v |
+ | |
+ | Done |
+ Send | ------------------------------> |
+
+ It is possible for a single RPC procedure to employ both a long call
+ for its arguments and a long reply for its results. However, such an
+ operation is atypical, as few upper layers define such exchanges.
+
+5.2. RDMA Write of Long Replies (Reply Chunks)
+
+ A superior method of handling long RPC replies is to have the RPC
+ client post a large buffer into which the server can write a large
+ RPC reply. This has the advantage that an RDMA Write may be slightly
+ faster in network latency than an RDMA Read, and does not require the
+ server to wait for the completion as it must for RDMA Read.
+ Additionally, for a reply it removes the need for an RDMA_DONE
+ message if the large reply is returned as a Read chunk.
+
+ This protocol supports direct return of a large reply via the
+ inclusion of an OPTIONAL rdma_reply write chunk after the read chunk
+ list and the write chunk list. The client allocates a buffer sized
+ to receive a large reply and enters its steering tag, address and
+ length in the rdma_reply write chunk. If the reply message is too
+
+
+
+Talpey & Callaghan Standards Track [Page 24]
+
+RFC 5666 RDMA Transport for RPC January 2010
+
+
+ long to return inline with an RDMA Send (exceeds the size of the
+ client's posted receive buffer), even with read chunks removed, then
+ the RPC server performs an RDMA Write of the RPC reply message into
+ the buffer indicated by the rdma_reply chunk. If the client doesn't
+ provide an rdma_reply chunk, or if it's too small, then if the upper-
+ layer specification permits, the message MAY be returned as a Read
+ chunk.
+
+ An RPC with long reply returned via RDMA Write
+
+
+ RPC Client RPC Server
+ | RPC Call with rdma_reply |
+ Send | ------------------------------> |
+ | |
+ | Long RPC Reply Msg |
+ | <------------------------------ | Write
+ | |
+ | RDMA-over-RPC Header |
+ | <------------------------------ | Send
+
+ The use of RDMA Write to return long replies requires that the client
+ applications anticipate a long reply and have some knowledge of its
+ size so that an adequately sized buffer can be allocated. This is
+ certainly true of NFS READDIR replies; where the client already
+ provides an upper bound on the size of the encoded directory fragment
+ to be returned by the server.
+
+ The use of these "reply chunks" is highly efficient and convenient
+ for both RPC client and server. Their use is encouraged for eligible
+ RPC operations such as NFS READDIR, which would otherwise require
+ extensive chunk management within the results or use of RDMA Read and
+ a Done message [RFC5667].
+
+6. Connection Configuration Protocol
+
+ RDMA Send operations require the receiver to post one or more buffers
+ at the RDMA connection endpoint, each large enough to receive the
+ largest Send message. Buffers are consumed as Send messages are
+ received. If a buffer is too small, or if there are no buffers
+ posted, the RDMA transport MAY return an error and break the RDMA
+ connection. The receiver MUST post sufficient, adequately buffers to
+ avoid buffer overrun or capacity errors.
+
+ The protocol described above includes only a mechanism for managing
+ the number of such receive buffers and no explicit features to allow
+ the RPC client and server to provision or control buffer sizing, nor
+ any other session parameters.
+
+
+
+Talpey & Callaghan Standards Track [Page 25]
+
+RFC 5666 RDMA Transport for RPC January 2010
+
+
+ In the past, this type of connection management has not been
+ necessary for RPC. RPC over UDP or TCP does not have a protocol to
+ negotiate the link. The server can get a rough idea of the maximum
+ size of messages from the server protocol code. However, a protocol
+ to negotiate transport features on a more dynamic basis is desirable.
+
+ The Connection Configuration Protocol allows the client to pass its
+ connection requirements to the server, and allows the server to
+ inform the client of its connection limits.
+
+ Use of the Connection Configuration Protocol by an upper layer is
+ OPTIONAL.
+
+6.1. Initial Connection State
+
+ This protocol MAY be used for connection setup prior to the use of
+ another RPC protocol that uses the RDMA transport. It operates
+ in-band, i.e., it uses the connection itself to negotiate the
+ connection parameters. To provide a basis for connection
+ negotiation, the connection is assumed to provide a basic level of
+ interoperability: the ability to exchange at least one RPC message at
+ a time that is at least 1 KB in size. The server MAY exceed this
+ basic level of configuration, but the client MUST NOT assume more
+ than one, and MUST receive a valid reply from the server carrying the
+ actual number of available receive messages, prior to sending its
+ next request.
+
+6.2. Protocol Description
+
+ Version 1 of the Connection Configuration Protocol consists of a
+ single procedure that allows the client to inform the server of its
+ connection requirements and the server to return connection
+ information to the client.
+
+ The maxcall_sendsize argument is the maximum size of an RPC call
+ message that the client MAY send inline in an RDMA Send message to
+ the server. The server MAY return a maxcall_sendsize value that is
+ smaller or larger than the client's request. The client MUST NOT
+ send an inline call message larger than what the server will accept.
+ The maxcall_sendsize limits only the size of inline RPC calls. It
+ does not limit the size of long RPC messages transferred as an
+ initial chunk in the Read chunk list.
+
+ The maxreply_sendsize is the maximum size of an inline RPC message
+ that the client will accept from the server.
+
+
+
+
+
+
+Talpey & Callaghan Standards Track [Page 26]
+
+RFC 5666 RDMA Transport for RPC January 2010
+
+
+ The maxrdmaread is the maximum number of RDMA Reads that may be
+ active at the peer. This number correlates to the RDMA incoming RDMA
+ Read count ("IRD") configured into each originating endpoint by the
+ client or server. If more than this number of RDMA Read operations
+ by the connected peer are issued simultaneously, connection loss or
+ suboptimal flow control may result; therefore, the value SHOULD be
+ observed at all times. The peers' values need not be equal. If
+ zero, the peer MUST NOT issue requests that require RDMA Read to
+ satisfy, as no transfer will be possible.
+
+ The align value is the value recommended by the server for opaque
+ data values such as strings and counted byte arrays. The client MAY
+ use this value to compute the number of prepended pad bytes when XDR
+ encoding opaque values in the RPC call message.
+
+ typedef unsigned int uint32;
+
+ struct config_rdma_req {
+ uint32 maxcall_sendsize;
+ /* max size of inline RPC call */
+ uint32 maxreply_sendsize;
+ /* max size of inline RPC reply */
+ uint32 maxrdmaread;
+ /* max active RDMA Reads at client */
+ };
+
+ struct config_rdma_reply {
+ uint32 maxcall_sendsize;
+ /* max call size accepted by server */
+ uint32 align;
+ /* server's receive buffer alignment */
+ uint32 maxrdmaread;
+ /* max active RDMA Reads at server */
+ };
+
+ program CONFIG_RDMA_PROG {
+ version VERS1 {
+ /*
+ * Config call/reply
+ */
+ config_rdma_reply CONF_RDMA(config_rdma_req) = 1;
+ } = 1;
+ } = 100417;
+
+
+
+
+
+
+
+
+Talpey & Callaghan Standards Track [Page 27]
+
+RFC 5666 RDMA Transport for RPC January 2010
+
+
+7. Memory Registration Overhead
+
+ RDMA requires that all data be transferred between registered memory
+ regions at the source and destination. All protocol headers as well
+ as separately transferred data chunks use registered memory. Since
+ the cost of registering and de-registering memory can be a large
+ proportion of the RDMA transaction cost, it is important to minimize
+ registration activity. This is easily achieved within RPC controlled
+ memory by allocating chunk list data and RPC headers in a reusable
+ way from pre-registered pools.
+
+ The data chunks transferred via RDMA MAY occupy memory that persists
+ outside the bounds of the RPC transaction. Hence, the default
+ behavior of an RPC-over-RDMA transport is to register and de-register
+ these chunks on every transaction. However, this is not a limitation
+ of the protocol -- only of the existing local RPC API. The API is
+ easily extended through such functions as rpc_control(3) to change
+ the default behavior so that the application can assume
+ responsibility for controlling memory registration through an RPC-
+ provided registered memory allocator.
+
+8. Errors and Error Recovery
+
+ RPC RDMA protocol errors are described in Section 4. RPC errors and
+ RPC error recovery are not affected by the protocol, and proceed as
+ for any RPC error condition. RDMA transport error reporting and
+ recovery are outside the scope of this protocol.
+
+ It is assumed that the link itself will provide some degree of error
+ detection and retransmission. iWARP's Marker PDU Aligned (MPA) layer
+ (when used over TCP), Stream Control Transmission Protocol (SCTP), as
+ well as the InfiniBand link layer all provide Cyclic Redundancy Check
+ (CRC) protection of the RDMA payload, and CRC-class protection is a
+ general attribute of such transports. Additionally, the RPC layer
+ itself can accept errors from the link level and recover via
+ retransmission. RPC recovery can handle complete loss and
+ re-establishment of the link.
+
+ See Section 11 for further discussion of the use of RPC-level
+ integrity schemes to detect errors and related efficiency issues.
+
+9. Node Addressing
+
+ In setting up a new RDMA connection, the first action by an RPC
+ client will be to obtain a transport address for the server. The
+ mechanism used to obtain this address, and to open an RDMA connection
+ is dependent on the type of RDMA transport, and is the responsibility
+ of each RPC protocol binding and its local implementation.
+
+
+
+Talpey & Callaghan Standards Track [Page 28]
+
+RFC 5666 RDMA Transport for RPC January 2010
+
+
+10. RPC Binding
+
+ RPC services normally register with a portmap or rpcbind [RFC1833]
+ service, which associates an RPC program number with a service
+ address. (In the case of UDP or TCP, the service address for NFS is
+ normally port 2049.) This policy is no different with RDMA
+ interconnects, although it may require the allocation of port numbers
+ appropriate to each upper-layer binding that uses the RPC framing
+ defined here.
+
+ When mapped atop the iWARP [RFC5040, RFC5041] transport, which uses
+ IP port addressing due to its layering on TCP and/or SCTP, port
+ mapping is trivial and consists merely of issuing the port in the
+ connection process. The NFS/RDMA protocol service address has been
+ assigned port 20049 by IANA, for both iWARP/TCP and iWARP/SCTP.
+
+ When mapped atop InfiniBand [IB], which uses a Group Identifier
+ (GID)-based service endpoint naming scheme, a translation MUST be
+ employed. One such translation is defined in the InfiniBand Port
+ Addressing Annex [IBPORT], which is appropriate for translating IP
+ port addressing to the InfiniBand network. Therefore, in this case,
+ IP port addressing may be readily employed by the upper layer.
+
+ When a mapping standard or convention exists for IP ports on an RDMA
+ interconnect, there are several possibilities for each upper layer to
+ consider:
+
+ One possibility is to have an upper-layer server register its
+ mapped IP port with the rpcbind service, under the netid (or
+ netid's) defined here. An RPC/RDMA-aware client can then resolve
+ its desired service to a mappable port, and proceed to connect.
+ This is the most flexible and compatible approach, for those upper
+ layers that are defined to use the rpcbind service.
+
+ A second possibility is to have the server's portmapper register
+ itself on the RDMA interconnect at a "well known" service address.
+ (On UDP or TCP, this corresponds to port 111.) A client could
+ connect to this service address and use the portmap protocol to
+ obtain a service address in response to a program number, e.g., an
+ iWARP port number, or an InfiniBand GID.
+
+ Alternatively, the client could simply connect to the mapped well-
+ known port for the service itself, if it is appropriately defined.
+ By convention, the NFS/RDMA service, when operating atop such an
+ InfiniBand fabric, will use the same 20049 assignment as for
+ iWARP.
+
+
+
+
+
+Talpey & Callaghan Standards Track [Page 29]
+
+RFC 5666 RDMA Transport for RPC January 2010
+
+
+ Historically, different RPC protocols have taken different approaches
+ to their port assignment; therefore, the specific method is left to
+ each RPC/RDMA-enabled upper-layer binding, and not addressed here.
+
+ In Section 12, "IANA Considerations", this specification defines two
+ new "netid" values, to be used for registration of upper layers atop
+ iWARP [RFC5040, RFC5041] and (when a suitable port translation
+ service is available) InfiniBand [IB]. Additional RDMA-capable
+ networks MAY define their own netids, or if they provide a port
+ translation, MAY share the one defined here.
+
+11. Security Considerations
+
+ RPC provides its own security via the RPCSEC_GSS framework [RFC2203].
+ RPCSEC_GSS can provide message authentication, integrity checking,
+ and privacy. This security mechanism will be unaffected by the RDMA
+ transport. The data integrity and privacy features alter the body of
+ the message, presenting it as a single chunk. For large messages the
+ chunk may be large enough to qualify for RDMA Read transfer.
+ However, there is much data movement associated with computation and
+ verification of integrity, or encryption/decryption, so certain
+ performance advantages may be lost.
+
+ For efficiency, a more appropriate security mechanism for RDMA links
+ may be link-level protection, such as certain configurations of
+ IPsec, which may be co-located in the RDMA hardware. The use of
+ link-level protection MAY be negotiated through the use of the new
+ RPCSEC_GSS mechanism defined in [RFC5403] in conjunction with the
+ Channel Binding mechanism [RFC5056] and IPsec Channel Connection
+ Latching [RFC5660]. Use of such mechanisms is REQUIRED where
+ integrity and/or privacy is desired, and where efficiency is
+ required.
+
+ An additional consideration is the protection of the integrity and
+ privacy of local memory by the RDMA transport itself. The use of
+ RDMA by RPC MUST NOT introduce any vulnerabilities to system memory
+ contents, or to memory owned by user processes. These protections
+ are provided by the RDMA layer specifications, and specifically their
+ security models. It is REQUIRED that any RDMA provider used for RPC
+ transport be conformant to the requirements of [RFC5042] in order to
+ satisfy these protections.
+
+ Once delivered securely by the RDMA provider, any RDMA-exposed
+ addresses will contain only RPC payloads in the chunk lists,
+ transferred under the protection of RPCSEC_GSS integrity and privacy.
+ By these means, the data will be protected end-to-end, as required by
+ the RPC layer security model.
+
+
+
+
+Talpey & Callaghan Standards Track [Page 30]
+
+RFC 5666 RDMA Transport for RPC January 2010
+
+
+ Where upper-layer protocols choose to supply results to the requester
+ via read chunks, a server resource deficit can arise if the client
+ does not promptly acknowledge their status via the RDMA_DONE message.
+ This can potentially lead to a denial-of-service situation, with a
+ single client unfairly (and unnecessarily) consuming server RDMA
+ resources. Servers for such upper-layer protocols MUST protect
+ against this situation, originating from one or many clients. For
+ example, a time-based window of buffer availability may be offered,
+ if the client fails to obtain the data within the window, it will
+ simply retry using ordinary RPC retry semantics. Or, a more severe
+ method would be for the server to simply close the client's RDMA
+ connection, freeing the RDMA resources and allowing the server to
+ reclaim them.
+
+ A fairer and more useful method is provided by the protocol itself.
+ The server MAY use the rdma_credit value to limit the number of
+ outstanding requests for each client. By including the number of
+ outstanding RDMA_DONE completions in the computation of available
+ client credits, the server can limit its exposure to each client, and
+ therefore provide uninterrupted service as its resources permit.
+
+ However, the server must ensure that it does not decrease the credit
+ count to zero with this method, since the RDMA_DONE message is not
+ acknowledged. If the credit count were to drop to zero solely due to
+ outstanding RDMA_DONE messages, the client would deadlock since it
+ would never obtain a new credit with which to continue. Therefore,
+ if the server adjusts credits to zero for outstanding RDMA_DONE, it
+ MUST withhold its reply to at least one message in order to provide
+ the next credit. The time-based window (or any other appropriate
+ method) SHOULD be used by the server to recover resources in the
+ event that the client never returns.
+
+ The Connection Configuration Protocol, when used, MUST be protected
+ by an appropriate RPC security flavor, to ensure it is not attacked
+ in the process of initiating an RPC/RDMA connection.
+
+12. IANA Considerations
+
+ Three new assignments are specified by this document:
+
+ - A new set of RPC "netids" for resolving RPC/RDMA services
+
+ - Optional service port assignments for upper-layer bindings
+
+ - An RPC program number assignment for the configuration protocol
+
+ These assignments have been established, as below.
+
+
+
+
+Talpey & Callaghan Standards Track [Page 31]
+
+RFC 5666 RDMA Transport for RPC January 2010
+
+
+ The new RPC transport has been assigned an RPC "netid", which is an
+ rpcbind [RFC1833] string used to describe the underlying protocol in
+ order for RPC to select the appropriate transport framing, as well as
+ the format of the service addresses and ports.
+
+ The following "Netid" registry strings are defined for this purpose:
+
+ NC_RDMA "rdma"
+ NC_RDMA6 "rdma6"
+
+ These netids MAY be used for any RDMA network satisfying the
+ requirements of Section 2, and able to identify service endpoints
+ using IP port addressing, possibly through use of a translation
+ service as described above in Section 10, "RPC Binding". The "rdma"
+ netid is to be used when IPv4 addressing is employed by the
+ underlying transport, and "rdma6" for IPv6 addressing.
+
+ The netid assignment policy and registry are defined in [RFC5665].
+
+ As a new RPC transport, this protocol has no effect on RPC program
+ numbers or existing registered port numbers. However, new port
+ numbers MAY be registered for use by RPC/RDMA-enabled services, as
+ appropriate to the new networks over which the services will operate.
+
+ For example, the NFS/RDMA service defined in [RFC5667] has been
+ assigned the port 20049, in the IANA registry:
+
+ nfsrdma 20049/tcp Network File System (NFS) over RDMA
+ nfsrdma 20049/udp Network File System (NFS) over RDMA
+ nfsrdma 20049/sctp Network File System (NFS) over RDMA
+
+ The OPTIONAL Connection Configuration Protocol described herein
+ requires an RPC program number assignment. The value "100417" has
+ been assigned:
+
+ rdmaconfig 100417 rpc.rdmaconfig
+
+ The RPC program number assignment policy and registry are defined in
+ [RFC5531].
+
+13. Acknowledgments
+
+ The authors wish to thank Rob Thurlow, John Howard, Chet Juszczak,
+ Alex Chiu, Peter Staubach, Dave Noveck, Brian Pawlowski, Steve
+ Kleiman, Mike Eisler, Mark Wittle, Shantanu Mehendale, David
+ Robinson, and Mallikarjun Chadalapaka for their contributions to this
+ document.
+
+
+
+
+Talpey & Callaghan Standards Track [Page 32]
+
+RFC 5666 RDMA Transport for RPC January 2010
+
+
+14. References
+
+14.1. Normative References
+
+ [RFC1833] Srinivasan, R., "Binding Protocols for ONC RPC Version 2",
+ RFC 1833, August 1995.
+
+ [RFC2203] Eisler, M., Chiu, A., and L. Ling, "RPCSEC_GSS Protocol
+ Specification", RFC 2203, September 1997.
+
+ [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
+ Requirement Levels", BCP 14, RFC 2119, March 1997.
+
+ [RFC4506] Eisler, M., Ed., "XDR: External Data Representation
+ Standard", STD 67, RFC 4506, May 2006.
+
+ [RFC5042] Pinkerton, J. and E. Deleganes, "Direct Data Placement
+ Protocol (DDP) / Remote Direct Memory Access Protocol
+ (RDMAP) Security", RFC 5042, October 2007.
+
+ [RFC5056] Williams, N., "On the Use of Channel Bindings to Secure
+ Channels", RFC 5056, November 2007.
+
+ [RFC5403] Eisler, M., "RPCSEC_GSS Version 2", RFC 5403, February
+ 2009.
+
+ [RFC5531] Thurlow, R., "RPC: Remote Procedure Call Protocol
+ Specification Version 2", RFC 5531, May 2009.
+
+ [RFC5660] Williams, N., "IPsec Channels: Connection Latching", RFC
+ 5660, October 2009.
+
+ [RFC5665] Eisler, M., "IANA Considerations for Remote Procedure Call
+ (RPC) Network Identifiers and Universal Address Formats",
+ RFC 5665, January 2010.
+
+14.2. Informative References
+
+ [RFC1094] Sun Microsystems, "NFS: Network File System Protocol
+ specification", RFC 1094, March 1989.
+
+ [RFC1813] Callaghan, B., Pawlowski, B., and P. Staubach, "NFS
+ Version 3 Protocol Specification", RFC 1813, June 1995.
+
+ [RFC3530] Shepler, S., Callaghan, B., Robinson, D., Thurlow, R.,
+ Beame, C., Eisler, M., and D. Noveck, "Network File System
+ (NFS) version 4 Protocol", RFC 3530, April 2003.
+
+
+
+
+Talpey & Callaghan Standards Track [Page 33]
+
+RFC 5666 RDMA Transport for RPC January 2010
+
+
+ [RFC5040] Recio, R., Metzler, B., Culley, P., Hilland, J., and D.
+ Garcia, "A Remote Direct Memory Access Protocol
+ Specification", RFC 5040, October 2007.
+
+ [RFC5041] Shah, H., Pinkerton, J., Recio, R., and P. Culley, "Direct
+ Data Placement over Reliable Transports", RFC 5041,
+ October 2007.
+
+ [RFC5532] Talpey, T. and C. Juszczak, "Network File System (NFS)
+ Remote Direct Memory Access (RDMA) Problem Statement", RFC
+ 5532, May 2009.
+
+ [RFC5661] Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed.,
+ "Network File System Version 4 Minor Version 1 Protocol",
+ RFC 5661, January 2010.
+
+ [RFC5667] Talpey, T. and B. Callaghan, "Network File System (NFS)
+ Direct Data Placement", RFC 5667, January 2010.
+
+ [IB] InfiniBand Trade Association, InfiniBand Architecture
+ Specifications, available from
+ http://www.infinibandta.org.
+
+ [IBPORT] InfiniBand Trade Association, "IP Addressing Annex",
+ available from http://www.infinibandta.org.
+
+Authors' Addresses
+
+ Tom Talpey
+ 170 Whitman St.
+ Stow, MA 01775 USA
+
+ EMail: tmtalpey@gmail.com
+
+
+ Brent Callaghan
+ Apple Computer, Inc.
+ MS: 302-4K
+ 2 Infinite Loop
+ Cupertino, CA 95014 USA
+
+ EMail: brentc@apple.com
+
+
+
+
+
+
+
+
+
+Talpey & Callaghan Standards Track [Page 34]
+