summaryrefslogtreecommitdiff
path: root/doc/rfc/rfc8166.txt
diff options
context:
space:
mode:
authorThomas Voss <mail@thomasvoss.com> 2024-11-27 20:54:24 +0100
committerThomas Voss <mail@thomasvoss.com> 2024-11-27 20:54:24 +0100
commit4bfd864f10b68b71482b35c818559068ef8d5797 (patch)
treee3989f47a7994642eb325063d46e8f08ffa681dc /doc/rfc/rfc8166.txt
parentea76e11061bda059ae9f9ad130a9895cc85607db (diff)
doc: Add RFC documents
Diffstat (limited to 'doc/rfc/rfc8166.txt')
-rw-r--r--doc/rfc/rfc8166.txt3083
1 files changed, 3083 insertions, 0 deletions
diff --git a/doc/rfc/rfc8166.txt b/doc/rfc/rfc8166.txt
new file mode 100644
index 0000000..d2c8584
--- /dev/null
+++ b/doc/rfc/rfc8166.txt
@@ -0,0 +1,3083 @@
+
+
+
+
+
+
+Internet Engineering Task Force (IETF) C. Lever, Ed.
+Request for Comments: 8166 Oracle
+Obsoletes: 5666 W. Simpson
+Category: Standards Track Red Hat
+ISSN: 2070-1721 T. Talpey
+ Microsoft
+ June 2017
+
+
+ Remote Direct Memory Access Transport for
+ Remote Procedure Call Version 1
+
+Abstract
+
+ This document specifies a protocol for conveying Remote Procedure
+ Call (RPC) messages on physical transports capable of Remote Direct
+ Memory Access (RDMA). This protocol is referred to as the RPC-over-
+ RDMA version 1 protocol in this document. It requires no revision to
+ application RPC protocols or the RPC protocol itself. This document
+ obsoletes RFC 5666.
+
+Status of This Memo
+
+ This is an Internet Standards Track document.
+
+ This document is a product of the Internet Engineering Task Force
+ (IETF). It represents the consensus of the IETF community. It has
+ received public review and has been approved for publication by the
+ Internet Engineering Steering Group (IESG). Further information on
+ Internet Standards is available in Section 2 of RFC 7841.
+
+ Information about the current status of this document, any errata,
+ and how to provide feedback on it may be obtained at
+ http://www.rfc-editor.org/info/rfc8166.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Lever, et al. Standards Track [Page 1]
+
+RFC 8166 RPC-over-RDMA Version 1 June 2017
+
+
+Copyright Notice
+
+ Copyright (c) 2017 IETF Trust and the persons identified as the
+ document authors. All rights reserved.
+
+ This document is subject to BCP 78 and the IETF Trust's Legal
+ Provisions Relating to IETF Documents
+ (http://trustee.ietf.org/license-info) in effect on the date of
+ publication of this document. Please review these documents
+ carefully, as they describe your rights and restrictions with respect
+ to this document. Code Components extracted from this document must
+ include Simplified BSD License text as described in Section 4.e of
+ the Trust Legal Provisions and are provided without warranty as
+ described in the Simplified BSD License.
+
+ This document may contain material from IETF Documents or IETF
+ Contributions published or made publicly available before November
+ 10, 2008. The person(s) controlling the copyright in some of this
+ material may not have granted the IETF Trust the right to allow
+ modifications of such material outside the IETF Standards Process.
+ Without obtaining an adequate license from the person(s) controlling
+ the copyright in such materials, this document may not be modified
+ outside the IETF Standards Process, and derivative works of it may
+ not be created outside the IETF Standards Process, except to format
+ it for publication as an RFC or to translate it into languages other
+ than English.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Lever, et al. Standards Track [Page 2]
+
+RFC 8166 RPC-over-RDMA Version 1 June 2017
+
+
+Table of Contents
+
+ 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 4
+ 1.1. RPCs on RDMA Transports . . . . . . . . . . . . . . . . . 4
+ 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 5
+ 2.1. Requirements Language . . . . . . . . . . . . . . . . . . 5
+ 2.2. RPCs . . . . . . . . . . . . . . . . . . . . . . . . . . 5
+ 2.3. RDMA . . . . . . . . . . . . . . . . . . . . . . . . . . 8
+ 3. RPC-over-RDMA Protocol Framework . . . . . . . . . . . . . . 10
+ 3.1. Transfer Models . . . . . . . . . . . . . . . . . . . . . 10
+ 3.2. Message Framing . . . . . . . . . . . . . . . . . . . . . 11
+ 3.3. Managing Receiver Resources . . . . . . . . . . . . . . . 11
+ 3.4. XDR Encoding with Chunks . . . . . . . . . . . . . . . . 14
+ 3.5. Message Size . . . . . . . . . . . . . . . . . . . . . . 19
+ 4. RPC-over-RDMA in Operation . . . . . . . . . . . . . . . . . 23
+ 4.1. XDR Protocol Definition . . . . . . . . . . . . . . . . . 23
+ 4.2. Fixed Header Fields . . . . . . . . . . . . . . . . . . . 28
+ 4.3. Chunk Lists . . . . . . . . . . . . . . . . . . . . . . . 30
+ 4.4. Memory Registration . . . . . . . . . . . . . . . . . . . 33
+ 4.5. Error Handling . . . . . . . . . . . . . . . . . . . . . 34
+ 4.6. Protocol Elements No Longer Supported . . . . . . . . . . 37
+ 4.7. XDR Examples . . . . . . . . . . . . . . . . . . . . . . 38
+ 5. RPC Bind Parameters . . . . . . . . . . . . . . . . . . . . . 39
+ 6. ULB Specifications . . . . . . . . . . . . . . . . . . . . . 41
+ 6.1. DDP-Eligibility . . . . . . . . . . . . . . . . . . . . . 41
+ 6.2. Maximum Reply Size . . . . . . . . . . . . . . . . . . . 43
+ 6.3. Additional Considerations . . . . . . . . . . . . . . . . 43
+ 6.4. ULP Extensions . . . . . . . . . . . . . . . . . . . . . 43
+ 7. Protocol Extensibility . . . . . . . . . . . . . . . . . . . 44
+ 7.1. Conventional Extensions . . . . . . . . . . . . . . . . . 44
+ 8. Security Considerations . . . . . . . . . . . . . . . . . . . 44
+ 8.1. Memory Protection . . . . . . . . . . . . . . . . . . . . 44
+ 8.2. RPC Message Security . . . . . . . . . . . . . . . . . . 46
+ 9. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 49
+ 10. References . . . . . . . . . . . . . . . . . . . . . . . . . 50
+ 10.1. Normative References . . . . . . . . . . . . . . . . . . 50
+ 10.2. Informative References . . . . . . . . . . . . . . . . . 51
+ Appendix A. Changes from RFC 5666 . . . . . . . . . . . . . . . 53
+ A.1. Changes to the Specification . . . . . . . . . . . . . . 53
+ A.2. Changes to the Protocol . . . . . . . . . . . . . . . . . 53
+ Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . 54
+ Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 55
+
+
+
+
+
+
+
+
+
+Lever, et al. Standards Track [Page 3]
+
+RFC 8166 RPC-over-RDMA Version 1 June 2017
+
+
+1. Introduction
+
+ This document specifies the RPC-over-RDMA version 1 protocol, based
+ on existing implementations of RFC 5666 and experience gained through
+ deployment. This document obsoletes RFC 5666.
+
+ This specification clarifies text that was subject to multiple
+ interpretations and removes support for unimplemented RPC-over-RDMA
+ version 1 protocol elements. It clarifies the role of Upper-Layer
+ Bindings (ULBs) and describes what they are to contain.
+
+ In addition, this document describes current practice using
+ RPCSEC_GSS [RFC7861] on RDMA transports.
+
+ The protocol version number has not been changed because the protocol
+ specified in this document fully interoperates with implementations
+ of the RPC-over-RDMA version 1 protocol specified in [RFC5666].
+
+1.1. RPCs on RDMA Transports
+
+ RDMA [RFC5040] [RFC5041] [IBARCH] is a technique for moving data
+ efficiently between end nodes. By directing data into destination
+ buffers as it is sent on a network, and placing it via direct memory
+ access by hardware, the benefits of faster transfers and reduced host
+ overhead are obtained.
+
+ Open Network Computing Remote Procedure Call (ONC RPC, often
+ shortened in NFSv4 documents to RPC) [RFC5531] is a remote procedure
+ call protocol that runs over a variety of transports. Most RPC
+ implementations today use UDP [RFC768] or TCP [RFC793]. On UDP, RPC
+ messages are encapsulated inside datagrams, while on a TCP byte
+ stream, RPC messages are delineated by a record marking protocol. An
+ RDMA transport also conveys RPC messages in a specific fashion that
+ must be fully described if RPC implementations are to interoperate.
+
+ RDMA transports present semantics that differ from either UDP or TCP.
+ They retain message delineations like UDP but provide reliable and
+ sequenced data transfer like TCP. They also provide an offloaded
+ bulk transfer service not provided by UDP or TCP. RDMA transports
+ are therefore appropriately viewed as a new transport type by RPC.
+
+ In this context, the Network File System (NFS) protocols, as
+ described in [RFC1094], [RFC1813], [RFC7530], [RFC5661], and future
+ NFSv4 minor versions, are all obvious beneficiaries of RDMA
+ transports. A complete problem statement is presented in [RFC5532].
+ Many other RPC-based protocols can also benefit.
+
+
+
+
+
+Lever, et al. Standards Track [Page 4]
+
+RFC 8166 RPC-over-RDMA Version 1 June 2017
+
+
+ Although the RDMA transport described herein can provide relatively
+ transparent support for any RPC application, this document also
+ describes mechanisms that can optimize data transfer even further,
+ when RPC applications are willing to exploit awareness of RDMA as the
+ transport.
+
+2. Terminology
+
+2.1. Requirements Language
+
+ The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
+ "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
+ "OPTIONAL" in this document are to be interpreted as described in BCP
+ 14 [RFC2119] [RFC8174] when, and only when, they appear in all
+ capitals, as shown here.
+
+2.2. RPCs
+
+ This section highlights key elements of the RPC [RFC5531] and
+ External Data Representation (XDR) [RFC4506] protocols, upon which
+ RPC-over-RDMA version 1 is constructed. Strong grounding with these
+ protocols is recommended before reading this document.
+
+2.2.1. Upper-Layer Protocols
+
+ RPCs are an abstraction used to implement the operations of an Upper-
+ Layer Protocol (ULP). "ULP" refers to an RPC Program and Version
+ tuple, which is a versioned set of procedure calls that comprise a
+ single well-defined API. One example of a ULP is the Network File
+ System Version 4.0 [RFC7530].
+
+ In this document, the term "RPC consumer" refers to an implementation
+ of a ULP running on an RPC client endpoint.
+
+2.2.2. Requesters and Responders
+
+ Like a local procedure call, every RPC procedure has a set of
+ "arguments" and a set of "results". A calling context invokes a
+ procedure, passing arguments to it, and the procedure subsequently
+ returns a set of results. Unlike a local procedure call, the called
+ procedure is executed remotely rather than in the local application's
+ execution context.
+
+ The RPC protocol as described in [RFC5531] is fundamentally a
+ message-passing protocol between one or more clients (where RPC
+ consumers are running) and a server (where a remote execution context
+ is available to process RPC transactions on behalf of those
+ consumers).
+
+
+
+Lever, et al. Standards Track [Page 5]
+
+RFC 8166 RPC-over-RDMA Version 1 June 2017
+
+
+ ONC RPC transactions are made up of two types of messages:
+
+ CALL
+ An "RPC Call message" requests that work be done. This type of
+ message is designated by the value zero (0) in the message's
+ msg_type field. An arbitrary unique value is placed in the
+ message's XID field in order to match this RPC Call message to a
+ corresponding RPC Reply message.
+
+ REPLY
+ An "RPC Reply message" reports the results of work requested by an
+ RPC Call message. An RPC Reply message is designated by the value
+ one (1) in the message's msg_type field. The value contained in
+ an RPC Reply message's XID field is copied from the RPC Call
+ message whose results are being reported.
+
+ The RPC client endpoint acts as a "Requester". It serializes the
+ procedure's arguments and conveys them to a server endpoint via an
+ RPC Call message. This message contains an RPC protocol header, a
+ header describing the requested upper-layer operation, and all
+ arguments.
+
+ The RPC server endpoint acts as a "Responder". It deserializes the
+ arguments and processes the requested operation. It then serializes
+ the operation's results into another byte stream. This byte stream
+ is conveyed back to the Requester via an RPC Reply message. This
+ message contains an RPC protocol header, a header describing the
+ upper-layer reply, and all results.
+
+ The Requester deserializes the results and allows the original caller
+ to proceed. At this point, the RPC transaction designated by the XID
+ in the RPC Call message is complete, and the XID is retired.
+
+ In summary, RPC Call messages are sent by Requesters to Responders to
+ initiate RPC transactions. RPC Reply messages are sent by Responders
+ to Requesters to complete the processing on an RPC transaction.
+
+2.2.3. RPC Transports
+
+ The role of an "RPC transport" is to mediate the exchange of RPC
+ messages between Requesters and Responders. An RPC transport bridges
+ the gap between the RPC message abstraction and the native operations
+ of a particular network transport.
+
+ RPC-over-RDMA is a connection-oriented RPC transport. When a
+ connection-oriented transport is used, clients initiate transport
+ connections, while servers wait passively for incoming connection
+ requests.
+
+
+
+Lever, et al. Standards Track [Page 6]
+
+RFC 8166 RPC-over-RDMA Version 1 June 2017
+
+
+2.2.4. External Data Representation
+
+ One cannot assume that all Requesters and Responders represent data
+ objects the same way internally. RPC uses External Data
+ Representation (XDR) to translate native data types and serialize
+ arguments and results [RFC4506].
+
+ The XDR protocol encodes data independently of the endianness or size
+ of host-native data types, allowing unambiguous decoding of data on
+ the receiving end. RPC Programs are specified by writing an XDR
+ definition of their procedures, argument data types, and result data
+ types.
+
+ XDR assumes that the number of bits in a byte (octet) and their order
+ are the same on both endpoints and on the physical network. The
+ smallest indivisible unit of XDR encoding is a group of four octets.
+ XDR also flattens lists, arrays, and other complex data types so they
+ can be conveyed as a stream of bytes.
+
+ A serialized stream of bytes that is the result of XDR encoding is
+ referred to as an "XDR stream". A sending endpoint encodes native
+ data into an XDR stream and then transmits that stream to a receiver.
+ A receiving endpoint decodes incoming XDR byte streams into its
+ native data representation format.
+
+2.2.4.1. XDR Opaque Data
+
+ Sometimes, a data item must be transferred as is: without encoding or
+ decoding. The contents of such a data item are referred to as
+ "opaque data". XDR encoding places the content of opaque data items
+ directly into an XDR stream without altering it in any way. ULPs or
+ applications perform any needed data translation in this case.
+ Examples of opaque data items include the content of files or generic
+ byte strings.
+
+2.2.4.2. XDR Roundup
+
+ The number of octets in a variable-length data item precedes that
+ item in an XDR stream. If the size of an encoded data item is not a
+ multiple of four octets, octets containing zero are added after the
+ end of the item; this is the case so that the next encoded data item
+ in the XDR stream starts on a four-octet boundary. The encoded size
+ of the item is not changed by the addition of the extra octets.
+ These extra octets are never exposed to ULPs.
+
+ This technique is referred to as "XDR roundup", and the extra octets
+ are referred to as "XDR roundup padding".
+
+
+
+
+Lever, et al. Standards Track [Page 7]
+
+RFC 8166 RPC-over-RDMA Version 1 June 2017
+
+
+2.3. RDMA
+
+ RPC Requesters and Responders can be made more efficient if large RPC
+ messages are transferred by a third party, such as intelligent
+ network-interface hardware (data movement offload), and placed in the
+ receiver's memory so that no additional adjustment of data alignment
+ has to be made (direct data placement or "DDP"). RDMA transports
+ enable both optimizations.
+
+2.3.1. DDP
+
+ Typically, RPC implementations copy the contents of RPC messages into
+ a buffer before being sent. An efficient RPC implementation sends
+ bulk data without copying it into a separate send buffer first.
+
+ However, socket-based RPC implementations are often unable to receive
+ data directly into its final place in memory. Receivers often need
+ to copy incoming data to finish an RPC operation: sometimes, only to
+ adjust data alignment.
+
+ In this document, "RDMA" refers to the physical mechanism an RDMA
+ transport utilizes when moving data. Although this may not be
+ efficient, before an RDMA transfer, a sender may copy data into an
+ intermediate buffer. After an RDMA transfer, a receiver may copy
+ that data again to its final destination.
+
+ In this document, the term "DDP" refers to any optimized data
+ transfer where it is unnecessary for a receiving host's CPU to copy
+ transferred data to another location after it has been received.
+
+ Just as [RFC5666] did, this document focuses on the use of RDMA Read
+ and Write operations to achieve both data movement offload and DDP.
+ However, not all RDMA-based data transfer qualifies as DDP, and DDP
+ can be achieved using non-RDMA mechanisms.
+
+2.3.2. RDMA Transport Requirements
+
+ To achieve good performance during receive operations, RDMA
+ transports require that RDMA consumers provision resources in advance
+ to receive incoming messages.
+
+ An RDMA consumer might provide Receive buffers in advance by posting
+ an RDMA Receive Work Request for every expected RDMA Send from a
+ remote peer. These buffers are provided before the remote peer posts
+ RDMA Send Work Requests; thus, this is often referred to as "pre-
+ posting" buffers.
+
+
+
+
+
+Lever, et al. Standards Track [Page 8]
+
+RFC 8166 RPC-over-RDMA Version 1 June 2017
+
+
+ An RDMA Receive Work Request remains outstanding until hardware
+ matches it to an inbound Send operation. The resources associated
+ with that Receive must be retained in host memory, or "pinned", until
+ the Receive completes.
+
+ Given these basic tenets of RDMA transport operation, the RPC-over-
+ RDMA version 1 protocol assumes each transport provides the following
+ abstract operations. A more complete discussion of these operations
+ is found in [RFC5040].
+
+ Registered Memory
+ Registered memory is a region of memory that is assigned a
+ steering tag that temporarily permits access by the RDMA provider
+ to perform data-transfer operations. The RPC-over-RDMA version 1
+ protocol assumes that each region of registered memory MUST be
+ identified with a steering tag of no more than 32 bits and memory
+ addresses of up to 64 bits in length.
+
+ RDMA Send
+ The RDMA provider supports an RDMA Send operation, with completion
+ signaled on the receiving peer after data has been placed in a
+ pre-posted buffer. Sends complete at the receiver in the order
+ they were issued at the sender. The amount of data transferred by
+ a single RDMA Send operation is limited by the size of the remote
+ peer's pre-posted buffers.
+
+ RDMA Receive
+ The RDMA provider supports an RDMA Receive operation to receive
+ data conveyed by incoming RDMA Send operations. To reduce the
+ amount of memory that must remain pinned awaiting incoming Sends,
+ the amount of pre-posted memory is limited. Flow control to
+ prevent overrunning receiver resources is provided by the RDMA
+ consumer (in this case, the RPC-over-RDMA version 1 protocol).
+
+ RDMA Write
+ The RDMA provider supports an RDMA Write operation to place data
+ directly into a remote memory region. The local host initiates an
+ RDMA Write, and completion is signaled there. No completion is
+ signaled on the remote peer. The local host provides a steering
+ tag, memory address, and length of the remote peer's memory
+ region.
+
+ RDMA Writes are not ordered with respect to one another, but are
+ ordered with respect to RDMA Sends. A subsequent RDMA Send
+ completion obtained at the write initiator guarantees that prior
+ RDMA Write data has been successfully placed in the remote peer's
+ memory.
+
+
+
+
+Lever, et al. Standards Track [Page 9]
+
+RFC 8166 RPC-over-RDMA Version 1 June 2017
+
+
+ RDMA Read
+ The RDMA provider supports an RDMA Read operation to place peer
+ source data directly into the read initiator's memory. The local
+ host initiates an RDMA Read, and completion is signaled there. No
+ completion is signaled on the remote peer. The local host
+ provides steering tags, memory addresses, and a length for the
+ remote source and local destination memory region.
+
+ The local host signals Read completion to the remote peer as part
+ of a subsequent RDMA Send message. The remote peer can then
+ release steering tags and subsequently free associated source
+ memory regions.
+
+ The RPC-over-RDMA version 1 protocol is designed to be carried over
+ RDMA transports that support the above abstract operations. This
+ protocol conveys information sufficient for an RPC peer to direct an
+ RDMA provider to perform transfers containing RPC data and to
+ communicate their result(s).
+
+3. RPC-over-RDMA Protocol Framework
+
+3.1. Transfer Models
+
+ A "transfer model" designates which endpoint exposes its memory and
+ which is responsible for initiating the transfer of data. To enable
+ RDMA Read and Write operations, for example, an endpoint first
+ exposes regions of its memory to a remote endpoint, which initiates
+ these operations against the exposed memory.
+
+ Read-Read
+ Requesters expose their memory to the Responder, and the Responder
+ exposes its memory to Requesters. The Responder reads, or pulls,
+ RPC arguments or whole RPC calls from each Requester. Requesters
+ pull RPC results or whole RPC relies from the Responder.
+
+ Write-Write
+ Requesters expose their memory to the Responder, and the Responder
+ exposes its memory to Requesters. Requesters write, or push, RPC
+ arguments or whole RPC calls to the Responder. The Responder
+ pushes RPC results or whole RPC relies to each Requester.
+
+ Read-Write
+ Requesters expose their memory to the Responder, but the Responder
+ does not expose its memory. The Responder pulls RPC arguments or
+ whole RPC calls from each Requester. The Responder pushes RPC
+ results or whole RPC relies to each Requester.
+
+
+
+
+
+Lever, et al. Standards Track [Page 10]
+
+RFC 8166 RPC-over-RDMA Version 1 June 2017
+
+
+ Write-Read
+ The Responder exposes its memory to Requesters, but Requesters do
+ not expose their memory. Requesters push RPC arguments or whole
+ RPC calls to the Responder. Requesters pull RPC results or whole
+ RPC relies from the Responder.
+
+3.2. Message Framing
+
+ On an RPC-over-RDMA transport, each RPC message is encapsulated by an
+ RPC-over-RDMA message. An RPC-over-RDMA message consists of two XDR
+ streams.
+
+ RPC Payload Stream
+ The "Payload stream" contains the encapsulated RPC message being
+ transferred by this RPC-over-RDMA message. This stream always
+ begins with the Transaction ID (XID) field of the encapsulated RPC
+ message.
+
+ Transport Stream
+ The "Transport stream" contains a header that describes and
+ controls the transfer of the Payload stream in this RPC-over-RDMA
+ message. This header is analogous to the record marking used for
+ RPC on TCP sockets but is more extensive, since RDMA transports
+ support several modes of data transfer.
+
+ In its simplest form, an RPC-over-RDMA message consists of a
+ Transport stream followed immediately by a Payload stream conveyed
+ together in a single RDMA Send. To transmit large RPC messages, a
+ combination of one RDMA Send operation and one or more other RDMA
+ operations is employed.
+
+ RPC-over-RDMA framing replaces all other RPC framing (such as TCP
+ record marking) when used atop an RPC-over-RDMA association, even
+ when the underlying RDMA protocol may itself be layered atop a
+ transport with a defined RPC framing (such as TCP).
+
+ However, it is possible for RPC-over-RDMA to be dynamically enabled
+ in the course of negotiating the use of RDMA via a ULP exchange.
+ Because RPC framing delimits an entire RPC request or reply, the
+ resulting shift in framing must occur between distinct RPC messages,
+ and in concert with the underlying transport.
+
+3.3. Managing Receiver Resources
+
+ It is critical to provide RDMA Send flow control for an RDMA
+ connection. If any pre-posted Receive buffer on the connection is
+ not large enough to accept an incoming RDMA Send, or if a pre-posted
+ Receive buffer is not available to accept an incoming RDMA Send, the
+
+
+
+Lever, et al. Standards Track [Page 11]
+
+RFC 8166 RPC-over-RDMA Version 1 June 2017
+
+
+ RDMA connection can be terminated. This is different than
+ conventional TCP/IP networking, in which buffers are allocated
+ dynamically as messages are received.
+
+ The longevity of an RDMA connection mandates that sending endpoints
+ respect the resource limits of peer receivers. To ensure messages
+ can be sent and received reliably, there are two operational
+ parameters for each connection.
+
+3.3.1. RPC-over-RDMA Credits
+
+ Flow control for RDMA Send operations directed to the Responder is
+ implemented as a simple request/grant protocol in the RPC-over-RDMA
+ header associated with each RPC message.
+
+ An RPC-over-RDMA version 1 credit is the capability to handle one
+ RPC-over-RDMA transaction. Each RPC-over-RDMA message sent from
+ Requester to Responder requests a number of credits from the
+ Responder. Each RPC-over-RDMA message sent from Responder to
+ Requester informs the Requester how many credits the Responder has
+ granted. The requested and granted values are carried in each RPC-
+ over-RDMA message's rdma_credit field (see Section 4.2.3).
+
+ Practically speaking, the critical value is the granted value. A
+ Requester MUST NOT send unacknowledged requests in excess of the
+ Responder's granted credit limit. If the granted value is exceeded,
+ the RDMA layer may signal an error, possibly terminating the
+ connection. The granted value MUST NOT be zero, since such a value
+ would result in deadlock.
+
+ RPC calls complete in any order, but the current granted credit limit
+ at the Responder is known to the Requester from RDMA Send ordering
+ properties. The number of allowed new requests the Requester may
+ send is then the lower of the current requested and granted credit
+ values, minus the number of requests in flight. Advertised credit
+ values are not altered when individual RPCs are started or completed.
+
+ The requested and granted credit values MAY be adjusted to match the
+ needs or policies in effect on either peer. For instance, a
+ Responder may reduce the granted credit value to accommodate the
+ available resources in a Shared Receive Queue. The Responder MUST
+ ensure that an increase in receive resources is effected before the
+ next RPC Reply message is sent.
+
+ A Requester MUST maintain enough receive resources to accommodate
+ expected replies. Responders have to be prepared for there to be no
+ receive resources available on Requesters with no pending RPC
+ transactions.
+
+
+
+Lever, et al. Standards Track [Page 12]
+
+RFC 8166 RPC-over-RDMA Version 1 June 2017
+
+
+ Certain RDMA implementations may impose additional flow-control
+ restrictions, such as limits on RDMA Read operations in progress at
+ the Responder. Accommodation of such restrictions is considered the
+ responsibility of each RPC-over-RDMA version 1 implementation.
+
+3.3.2. Inline Threshold
+
+ An "inline threshold" value is the largest message size (in octets)
+ that can be conveyed in one direction between peer implementations
+ using RDMA Send and Receive. The inline threshold value is the
+ smaller of the largest number of bytes the sender can post via a
+ single RDMA Send operation and the largest number of bytes the
+ receiver can accept via a single RDMA Receive operation. Each
+ connection has two inline threshold values: one for messages flowing
+ from Requester-to-Responder (referred to as the "call inline
+ threshold") and one for messages flowing from Responder-to-Requester
+ (referred to as the "reply inline threshold").
+
+ Unlike credit limits, inline threshold values are not advertised to
+ peers via the RPC-over-RDMA version 1 protocol, and there is no
+ provision for inline threshold values to change during the lifetime
+ of an RPC-over-RDMA version 1 connection.
+
+3.3.3. Initial Connection State
+
+ When a connection is first established, peers might not know how many
+ receive resources the other has, nor how large the other peer's
+ inline thresholds are.
+
+ As a basis for an initial exchange of RPC requests, each RPC-over-
+ RDMA version 1 connection provides the ability to exchange at least
+ one RPC message at a time, whose RPC Call and Reply messages are no
+ more than 1024 bytes in size. A Responder MAY exceed this basic
+ level of configuration, but a Requester MUST NOT assume more than one
+ credit is available and MUST receive a valid reply from the Responder
+ carrying the actual number of available credits, prior to sending its
+ next request.
+
+ Receiver implementations MUST support inline thresholds of 1024 bytes
+ but MAY support larger inline thresholds values. An independent
+ mechanism for discovering a peer's inline thresholds before a
+ connection is established may be used to optimize the use of RDMA
+ Send and Receive operations. In the absence of such a mechanism,
+ senders and receives MUST assume the inline thresholds are 1024
+ bytes.
+
+
+
+
+
+
+Lever, et al. Standards Track [Page 13]
+
+RFC 8166 RPC-over-RDMA Version 1 June 2017
+
+
+3.4. XDR Encoding with Chunks
+
+ When a DDP capability is available, the transport places the contents
+ of one or more XDR data items directly into the receiver's memory,
+ separately from the transfer of other parts of the containing XDR
+ stream.
+
+3.4.1. Reducing an XDR Stream
+
+ RPC-over-RDMA version 1 provides a mechanism for moving part of an
+ RPC message via a data transfer distinct from an RDMA Send/Receive
+ pair. The sender removes one or more XDR data items from the Payload
+ stream. They are conveyed via other mechanisms, such as one or more
+ RDMA Read or Write operations. As the receiver decodes an incoming
+ message, it skips over directly placed data items.
+
+ The portion of an XDR stream that is split out and moved separately
+ is referred to as a "chunk". In some contexts, data in an RPC-over-
+ RDMA header that describes these split out regions of memory may also
+ be referred to as a "chunk".
+
+ A Payload stream after chunks have been removed is referred to as a
+ "reduced" Payload stream. Likewise, a data item that has been
+ removed from a Payload stream to be transferred separately is
+ referred to as a "reduced" data item.
+
+3.4.2. DDP-Eligibility
+
+ Not all XDR data items benefit from DDP. For example, small data
+ items or data items that require XDR unmarshaling by the receiver do
+ not benefit from DDP. In addition, it is impractical for receivers
+ to prepare for every possible XDR data item in a protocol to be
+ transferred in a chunk.
+
+ To maintain interoperability on an RPC-over-RDMA transport, a
+ determination must be made of which few XDR data items in each ULP
+ are allowed to use DDP.
+
+ This is done by additional specifications that describe how ULPs
+ employ DDP. A "ULB specification" identifies which specific
+ individual XDR data items in a ULP MAY be transferred via DDP. Such
+ data items are referred to as "DDP-eligible". All other XDR data
+ items MUST NOT be reduced.
+
+ Detailed requirements for ULBs are provided in Section 6.
+
+
+
+
+
+
+Lever, et al. Standards Track [Page 14]
+
+RFC 8166 RPC-over-RDMA Version 1 June 2017
+
+
+3.4.3. RDMA Segments
+
+ When encoding a Payload stream that contains a DDP-eligible data
+ item, a sender may choose to reduce that data item. When it chooses
+ to do so, the sender does not place the item into the Payload stream.
+ Instead, the sender records in the RPC-over-RDMA header the location
+ and size of the memory region containing that data item.
+
+ The Requester provides location information for DDP-eligible data
+ items in both RPC Call and Reply messages. The Responder uses this
+ information to retrieve arguments contained in the specified region
+ of the Requester's memory or place results in that memory region.
+
+ An "RDMA segment", or "plain segment", is an RPC-over-RDMA Transport
+ header data object that contains the precise coordinates of a
+ contiguous memory region that is to be conveyed separately from the
+ Payload stream. Plain segments contain the following information:
+
+ Handle
+ Steering tag (STag) or R_key generated by registering this memory
+ with the RDMA provider.
+
+ Length
+ The length of the RDMA segment's memory region, in octets. An
+ "empty segment" is an RDMA segment with the value zero (0) in its
+ length field.
+
+ Offset
+ The offset or beginning memory address of the RDMA segment's
+ memory region.
+
+ See [RFC5040] for further discussion.
+
+3.4.4. Chunks
+
+ In RPC-over-RDMA version 1, a "chunk" refers to a portion of the
+ Payload stream that is moved independently of the RPC-over-RDMA
+ Transport header and Payload stream. Chunk data is removed from the
+ sender's Payload stream, transferred via separate operations, and
+ then reinserted into the receiver's Payload stream to form a complete
+ RPC message.
+
+ Each chunk is comprised of RDMA segments. Each RDMA segment
+ represents a single contiguous piece of that chunk. A Requester MAY
+ divide a chunk into RDMA segments using any boundaries that are
+ convenient. The length of a chunk is the sum of the lengths of the
+ RDMA segments that comprise it.
+
+
+
+
+Lever, et al. Standards Track [Page 15]
+
+RFC 8166 RPC-over-RDMA Version 1 June 2017
+
+
+ The RPC-over-RDMA version 1 transport protocol does not place a limit
+ on chunk size. However, each ULP may cap the amount of data that can
+ be transferred by a single RPC (for example, NFS has "rsize" and
+ "wsize", which restrict the payload size of NFS READ and WRITE
+ operations). The Responder can use such limits to sanity check chunk
+ sizes before using them in RDMA operations.
+
+3.4.4.1. Counted Arrays
+
+ If a chunk contains a counted array data type, the count of array
+ elements MUST remain in the Payload stream, while the array elements
+ MUST be moved to the chunk. For example, when encoding an opaque
+ byte array as a chunk, the count of bytes stays in the Payload
+ stream, while the bytes in the array are removed from the Payload
+ stream and transferred within the chunk.
+
+ Individual array elements appear in a chunk in their entirety. For
+ example, when encoding an array of arrays as a chunk, the count of
+ items in the enclosing array stays in the Payload stream, but each
+ enclosed array, including its item count, is transferred as part of
+ the chunk.
+
+3.4.4.2. Optional-Data
+
+ If a chunk contains an optional-data data type, the "is present"
+ field MUST remain in the Payload stream, while the data, if present,
+ MUST be moved to the chunk.
+
+3.4.4.3. XDR Unions
+
+ A union data type MUST NOT be made DDP-eligible, but one or more of
+ its arms MAY be DDP-eligible, subject to the other requirements in
+ this section.
+
+3.4.4.4. Chunk Roundup
+
+ Except in special cases (covered in Section 3.5.3), a chunk MUST
+ contain exactly one XDR data item. This makes it straightforward to
+ reduce variable-length data items without affecting the XDR alignment
+ of data items in the Payload stream.
+
+ When a variable-length XDR data item is reduced, the sender MUST
+ remove XDR roundup padding for that data item from the Payload stream
+ so that data items remaining in the Payload stream begin on four-byte
+ alignment.
+
+
+
+
+
+
+Lever, et al. Standards Track [Page 16]
+
+RFC 8166 RPC-over-RDMA Version 1 June 2017
+
+
+3.4.5. Read Chunks
+
+ A "Read chunk" represents an XDR data item that is to be pulled from
+ the Requester to the Responder.
+
+ A Read chunk is a list of one or more RDMA read segments. An RDMA
+ read segment consists of a Position field followed by a plain
+ segment. See Section 4.1.2 for details.
+
+ Position
+ The byte offset in the unreduced Payload stream where the receiver
+ reinserts the data item conveyed in a chunk. The Position value
+ MUST be computed from the beginning of the unreduced Payload
+ stream, which begins at Position zero. All RDMA read segments
+ belonging to the same Read chunk have the same value in their
+ Position field.
+
+ While constructing an RPC Call message, a Requester registers memory
+ regions that contain data to be transferred via RDMA Read operations.
+ It advertises the coordinates of these regions in the RPC-over-RDMA
+ Transport header of the RPC Call message.
+
+ After receiving an RPC Call message sent via an RDMA Send operation,
+ a Responder transfers the chunk data from the Requester using RDMA
+ Read operations. The Responder reconstructs the transferred chunk
+ data by concatenating the contents of each RDMA segment, in list
+ order, into the received Payload stream at the Position value
+ recorded in that RDMA segment.
+
+ Put another way, the Responder inserts the first RDMA segment in a
+ Read chunk into the Payload stream at the byte offset indicated by
+ its Position field. RDMA segments whose Position field value match
+ this offset are concatenated afterwards, until there are no more RDMA
+ segments at that Position value.
+
+ The Position field in a read segment indicates where the containing
+ Read chunk starts in the Payload stream. The value in this field
+ MUST be a multiple of four. All segments in the same Read chunk
+ share the same Position value, even if one or more of the RDMA
+ segments have a non-four-byte-aligned length.
+
+
+
+
+
+
+
+
+
+
+
+Lever, et al. Standards Track [Page 17]
+
+RFC 8166 RPC-over-RDMA Version 1 June 2017
+
+
+3.4.5.1. Decoding Read Chunks
+
+ While decoding a received Payload stream, whenever the XDR offset in
+ the Payload stream matches that of a Read chunk, the Responder
+ initiates an RDMA Read to pull the chunk's data content into
+ registered local memory.
+
+ The Responder acknowledges its completion of use of Read chunk source
+ buffers when it sends an RPC Reply message to the Requester. The
+ Requester may then release Read chunks advertised in the request.
+
+3.4.5.2. Read Chunk Roundup
+
+ When reducing a variable-length argument data item, the Requester
+ SHOULD NOT include the data item's XDR roundup padding in the chunk.
+ The length of a Read chunk is determined as follows:
+
+ o If the Requester chooses to include roundup padding in a Read
+ chunk, the chunk's total length MUST be the sum of the encoded
+ length of the data item and the length of the roundup padding.
+ The length of the data item that was encoded into the Payload
+ stream remains unchanged.
+
+ The sender can increase the length of the chunk by adding another
+ RDMA segment containing only the roundup padding, or it can do so
+ by extending the final RDMA segment in the chunk.
+
+ o If the sender chooses not to include roundup padding in the chunk,
+ the chunk's total length MUST be the same as the encoded length of
+ the data item.
+
+3.4.6. Write Chunks
+
+ While constructing an RPC Call message, a Requester prepares memory
+ regions in which to receive DDP-eligible result data items. A "Write
+ chunk" represents an XDR data item that is to be pushed from a
+ Responder to a Requester. It is made up of an array of zero or more
+ plain segments.
+
+ Write chunks are provisioned by a Requester long before the Responder
+ has prepared the reply Payload stream. A Requester often does not
+ know the actual length of the result data items to be returned, since
+ the result does not yet exist. Thus, it MUST register Write chunks
+ long enough to accommodate the maximum possible size of each returned
+ data item.
+
+
+
+
+
+
+Lever, et al. Standards Track [Page 18]
+
+RFC 8166 RPC-over-RDMA Version 1 June 2017
+
+
+ In addition, the XDR position of DDP-eligible data items in the
+ reply's Payload stream is not predictable when a Requester constructs
+ an RPC Call message. Therefore, RDMA segments in a Write chunk do
+ not have a Position field.
+
+ For each Write chunk provided by a Requester, the Responder pushes
+ one data item to the Requester, filling the chunk contiguously and in
+ segment array order until that data item has been completely written
+ to the Requester. The Responder MUST copy the segment count and all
+ segments from the Requester-provided Write chunk into the RPC Reply
+ message's Transport header. As it does so, the Responder updates
+ each segment length field to reflect the actual amount of data that
+ is being returned in that segment. The Responder then sends the RPC
+ Reply message via an RDMA Send operation.
+
+ An "empty Write chunk" is a Write chunk with a zero segment count.
+ By definition, the length of an empty Write chunk is zero. An
+ "unused Write chunk" has a non-zero segment count, but all of its
+ segments are empty segments.
+
+3.4.6.1. Decoding Write Chunks
+
+ After receiving the RPC Reply message, the Requester reconstructs the
+ transferred data by concatenating the contents of each segment, in
+ array order, into the RPC Reply message's XDR stream at the known XDR
+ position of the associated DDP-eligible result data item.
+
+3.4.6.2. Write Chunk Roundup
+
+ When provisioning a Write chunk for a variable-length result data
+ item, the Requester SHOULD NOT include additional space for XDR
+ roundup padding. A Responder MUST NOT write XDR roundup padding into
+ a Write chunk, even if the Requester made space available for it.
+ Therefore, when returning a single variable-length result data item,
+ a returned Write chunk's total length MUST be the same as the encoded
+ length of the result data item.
+
+3.5. Message Size
+
+ A receiver of RDMA Send operations is required by RDMA to have
+ previously posted one or more adequately sized buffers. Memory
+ savings are achieved on both Requesters and Responders by posting
+ small Receive buffers. However, not all RPC messages are small.
+ RPC-over-RDMA version 1 provides several mechanisms that allow
+ messages of any size to be conveyed efficiently.
+
+
+
+
+
+
+Lever, et al. Standards Track [Page 19]
+
+RFC 8166 RPC-over-RDMA Version 1 June 2017
+
+
+3.5.1. Short Messages
+
+ RPC messages are frequently smaller than typical inline thresholds.
+ For example, the NFS version 3 GETATTR operation is only 56 bytes: 20
+ bytes of RPC header, a 32-byte file handle argument, and 4 bytes for
+ its length. The reply to this common request is about 100 bytes.
+
+ Since all RPC messages conveyed via RPC-over-RDMA require an RDMA
+ Send operation, the most efficient way to send an RPC message that is
+ smaller than the inline threshold is to append the Payload stream
+ directly to the Transport stream. An RPC-over-RDMA header with a
+ small RPC Call or Reply message immediately following is transferred
+ using a single RDMA Send operation. No other operations are needed.
+
+ An RPC-over-RDMA transaction using Short Messages:
+
+ Requester Responder
+ | RDMA Send (RDMA_MSG) |
+ Call | ------------------------------> |
+ | |
+ | | Processing
+ | |
+ | RDMA Send (RDMA_MSG) |
+ | <------------------------------ | Reply
+
+3.5.2. Chunked Messages
+
+ If DDP-eligible data items are present in a Payload stream, a sender
+ MAY reduce some or all of these items by removing them from the
+ Payload stream. The sender uses a separate mechanism to transfer the
+ reduced data items. The Transport stream with the reduced Payload
+ stream immediately following is then transferred using a single RDMA
+ Send operation.
+
+ After receiving the Transport and Payload streams of an RPC Call
+ message accompanied by Read chunks, the Responder uses RDMA Read
+ operations to move reduced data items in Read chunks. Before sending
+ the Transport and Payload streams of an RPC Reply message containing
+ Write chunks, the Responder uses RDMA Write operations to move
+ reduced data items in Write and Reply chunks.
+
+
+
+
+
+
+
+
+
+
+
+Lever, et al. Standards Track [Page 20]
+
+RFC 8166 RPC-over-RDMA Version 1 June 2017
+
+
+ An RPC-over-RDMA transaction with a Read chunk:
+
+ Requester Responder
+ | RDMA Send (RDMA_MSG) |
+ Call | ------------------------------> |
+ | RDMA Read |
+ | <------------------------------ |
+ | RDMA Response (arg data) |
+ | ------------------------------> |
+ | |
+ | | Processing
+ | |
+ | RDMA Send (RDMA_MSG) |
+ | <------------------------------ | Reply
+
+ An RPC-over-RDMA transaction with a Write chunk:
+
+ Requester Responder
+ | RDMA Send (RDMA_MSG) |
+ Call | ------------------------------> |
+ | |
+ | | Processing
+ | |
+ | RDMA Write (result data) |
+ | <------------------------------ |
+ | RDMA Send (RDMA_MSG) |
+ | <------------------------------ | Reply
+
+3.5.3. Long Messages
+
+ When a Payload stream is larger than the receiver's inline threshold,
+ the Payload stream is reduced by removing DDP-eligible data items and
+ placing them in chunks to be moved separately. If there are no DDP-
+ eligible data items in the Payload stream, or the Payload stream is
+ still too large after it has been reduced, the RDMA transport MUST
+ use RDMA Read or Write operations to convey the Payload stream
+ itself. This mechanism is referred to as a "Long Message".
+
+ To transmit a Long Message, the sender conveys only the Transport
+ stream with an RDMA Send operation. The Payload stream is not
+ included in the Send buffer in this instance. Instead, the Requester
+ provides chunks that the Responder uses to move the Payload stream.
+
+ Long Call
+ To send a Long Call message, the Requester provides a special Read
+ chunk that contains the RPC Call message's Payload stream. Every
+ RDMA read segment in this chunk MUST contain zero in its Position
+ field. Thus, this chunk is known as a "Position Zero Read chunk".
+
+
+
+Lever, et al. Standards Track [Page 21]
+
+RFC 8166 RPC-over-RDMA Version 1 June 2017
+
+
+ Long Reply
+ To send a Long Reply, the Requester provides a single special
+ Write chunk in advance, known as the "Reply chunk", that will
+ contain the RPC Reply message's Payload stream. The Requester
+ sizes the Reply chunk to accommodate the maximum expected reply
+ size for that upper-layer operation.
+
+ Though the purpose of a Long Message is to handle large RPC messages,
+ Requesters MAY use a Long Message at any time to convey an RPC Call
+ message.
+
+ A Responder chooses which form of reply to use based on the chunks
+ provided by the Requester. If Write chunks were provided and the
+ Responder has a DDP-eligible result, it first reduces the reply
+ Payload stream. If a Reply chunk was provided and the reduced
+ Payload stream is larger than the reply inline threshold, the
+ Responder MUST use the Requester-provided Reply chunk for the reply.
+
+ XDR data items may appear in these special chunks without regard to
+ their DDP-eligibility. As these chunks contain a Payload stream,
+ such chunks MUST include appropriate XDR roundup padding to maintain
+ proper XDR alignment of their contents.
+
+ An RPC-over-RDMA transaction using a Long Call:
+
+ Requester Responder
+ | RDMA Send (RDMA_NOMSG) |
+ Call | ------------------------------> |
+ | RDMA Read |
+ | <------------------------------ |
+ | RDMA Response (RPC call) |
+ | ------------------------------> |
+ | |
+ | | Processing
+ | |
+ | RDMA Send (RDMA_MSG) |
+ | <------------------------------ | Reply
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Lever, et al. Standards Track [Page 22]
+
+RFC 8166 RPC-over-RDMA Version 1 June 2017
+
+
+ An RPC-over-RDMA transaction using a Long Reply:
+
+ Requester Responder
+ | RDMA Send (RDMA_MSG) |
+ Call | ------------------------------> |
+ | |
+ | | Processing
+ | |
+ | RDMA Write (RPC reply) |
+ | <------------------------------ |
+ | RDMA Send (RDMA_NOMSG) |
+ | <------------------------------ | Reply
+
+4. RPC-over-RDMA in Operation
+
+ Every RPC-over-RDMA version 1 message has a header that includes a
+ copy of the message's transaction ID, data for managing RDMA flow-
+ control credits, and lists of RDMA segments describing chunks. All
+ RPC-over-RDMA header content is contained in the Transport stream;
+ thus, it MUST be XDR encoded.
+
+ RPC message layout is unchanged from that described in [RFC5531]
+ except for the possible reduction of data items that are moved by
+ separate operations.
+
+ The RPC-over-RDMA protocol passes RPC messages without regard to
+ their type (CALL or REPLY). Apart from restrictions imposed by ULBs,
+ each endpoint of a connection MAY send RDMA_MSG or RDMA_NOMSG message
+ header types at any time (subject to credit limits).
+
+4.1. XDR Protocol Definition
+
+ This section contains a description of the core features of the RPC-
+ over-RDMA version 1 protocol, expressed in the XDR language
+ [RFC4506].
+
+ This description is provided in a way that makes it simple to extract
+ into ready-to-compile form. The reader can apply the following shell
+ script to this document to produce a machine-readable XDR description
+ of the RPC-over-RDMA version 1 protocol.
+
+ <CODE BEGINS>
+
+ #!/bin/sh
+ grep '^ *///' | sed 's?^ /// ??' | sed 's?^ *///$??'
+
+ <CODE ENDS>
+
+
+
+
+Lever, et al. Standards Track [Page 23]
+
+RFC 8166 RPC-over-RDMA Version 1 June 2017
+
+
+ That is, if the above script is stored in a file called "extract.sh"
+ and this document is in a file called "spec.txt", then the reader can
+ do the following to extract an XDR description file:
+
+ <CODE BEGINS>
+
+ sh extract.sh < spec.txt > rpcrdma_corev1.x
+
+ <CODE ENDS>
+
+4.1.1. Code Component License
+
+ Code components extracted from this document must include the
+ following license text. When the extracted XDR code is combined with
+ other complementary XDR code, which itself has an identical license,
+ only a single copy of the license text need be preserved.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Lever, et al. Standards Track [Page 24]
+
+RFC 8166 RPC-over-RDMA Version 1 June 2017
+
+
+ <CODE BEGINS>
+
+ /// /*
+ /// * Copyright (c) 2010-2017 IETF Trust and the persons
+ /// * identified as authors of the code. All rights reserved.
+ /// *
+ /// * The authors of the code are:
+ /// * B. Callaghan, T. Talpey, and C. Lever
+ /// *
+ /// * Redistribution and use in source and binary forms, with
+ /// * or without modification, are permitted provided that the
+ /// * following conditions are met:
+ /// *
+ /// * - Redistributions of source code must retain the above
+ /// * copyright notice, this list of conditions and the
+ /// * following disclaimer.
+ /// *
+ /// * - Redistributions in binary form must reproduce the above
+ /// * copyright notice, this list of conditions and the
+ /// * following disclaimer in the documentation and/or other
+ /// * materials provided with the distribution.
+ /// *
+ /// * - Neither the name of Internet Society, IETF or IETF
+ /// * Trust, nor the names of specific contributors, may be
+ /// * used to endorse or promote products derived from this
+ /// * software without specific prior written permission.
+ /// *
+ /// * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS
+ /// * AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED
+ /// * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ /// * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
+ /// * FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO
+ /// * EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
+ /// * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ /// * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
+ /// * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ /// * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
+ /// * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
+ /// * LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ /// * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING
+ /// * IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF
+ /// * ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ /// */
+ ///
+
+ <CODE ENDS>
+
+
+
+
+
+Lever, et al. Standards Track [Page 25]
+
+RFC 8166 RPC-over-RDMA Version 1 June 2017
+
+
+4.1.2. RPC-over-RDMA Version 1 XDR
+
+ XDR data items defined in this section encodes the Transport Header
+ Stream in each RPC-over-RDMA version 1 message. Comments identify
+ items that cannot be changed in subsequent versions.
+
+ <CODE BEGINS>
+
+ /// /*
+ /// * Plain RDMA segment (Section 3.4.3)
+ /// */
+ /// struct xdr_rdma_segment {
+ /// uint32 handle; /* Registered memory handle */
+ /// uint32 length; /* Length of the chunk in bytes */
+ /// uint64 offset; /* Chunk virtual address or offset */
+ /// };
+ ///
+ /// /*
+ /// * RDMA read segment (Section 3.4.5)
+ /// */
+ /// struct xdr_read_chunk {
+ /// uint32 position; /* Position in XDR stream */
+ /// struct xdr_rdma_segment target;
+ /// };
+ ///
+ /// /*
+ /// * Read list (Section 4.3.1)
+ /// */
+ /// struct xdr_read_list {
+ /// struct xdr_read_chunk entry;
+ /// struct xdr_read_list *next;
+ /// };
+ ///
+ /// /*
+ /// * Write chunk (Section 3.4.6)
+ /// */
+ /// struct xdr_write_chunk {
+ /// struct xdr_rdma_segment target<>;
+ /// };
+ ///
+ /// /*
+ /// * Write list (Section 4.3.2)
+ /// */
+ /// struct xdr_write_list {
+ /// struct xdr_write_chunk entry;
+ /// struct xdr_write_list *next;
+ /// };
+ ///
+
+
+
+Lever, et al. Standards Track [Page 26]
+
+RFC 8166 RPC-over-RDMA Version 1 June 2017
+
+
+ /// /*
+ /// * Chunk lists (Section 4.3)
+ /// */
+ /// struct rpc_rdma_header {
+ /// struct xdr_read_list *rdma_reads;
+ /// struct xdr_write_list *rdma_writes;
+ /// struct xdr_write_chunk *rdma_reply;
+ /// /* rpc body follows */
+ /// };
+ ///
+ /// struct rpc_rdma_header_nomsg {
+ /// struct xdr_read_list *rdma_reads;
+ /// struct xdr_write_list *rdma_writes;
+ /// struct xdr_write_chunk *rdma_reply;
+ /// };
+ ///
+ /// /* Not to be used */
+ /// struct rpc_rdma_header_padded {
+ /// uint32 rdma_align;
+ /// uint32 rdma_thresh;
+ /// struct xdr_read_list *rdma_reads;
+ /// struct xdr_write_list *rdma_writes;
+ /// struct xdr_write_chunk *rdma_reply;
+ /// /* rpc body follows */
+ /// };
+ ///
+ /// /*
+ /// * Error handling (Section 4.5)
+ /// */
+ /// enum rpc_rdma_errcode {
+ /// ERR_VERS = 1, /* Value fixed for all versions */
+ /// ERR_CHUNK = 2
+ /// };
+ ///
+ /// /* Structure fixed for all versions */
+ /// struct rpc_rdma_errvers {
+ /// uint32 rdma_vers_low;
+ /// uint32 rdma_vers_high;
+ /// };
+ ///
+ /// union rpc_rdma_error switch (rpc_rdma_errcode err) {
+ /// case ERR_VERS:
+ /// rpc_rdma_errvers range;
+ /// case ERR_CHUNK:
+ /// void;
+ /// };
+ ///
+ /// /*
+
+
+
+Lever, et al. Standards Track [Page 27]
+
+RFC 8166 RPC-over-RDMA Version 1 June 2017
+
+
+ /// * Procedures (Section 4.2.4)
+ /// */
+ /// enum rdma_proc {
+ /// RDMA_MSG = 0, /* Value fixed for all versions */
+ /// RDMA_NOMSG = 1, /* Value fixed for all versions */
+ /// RDMA_MSGP = 2, /* Not to be used */
+ /// RDMA_DONE = 3, /* Not to be used */
+ /// RDMA_ERROR = 4 /* Value fixed for all versions */
+ /// };
+ ///
+ /// /* The position of the proc discriminator field is
+ /// * fixed for all versions */
+ /// union rdma_body switch (rdma_proc proc) {
+ /// case RDMA_MSG:
+ /// rpc_rdma_header rdma_msg;
+ /// case RDMA_NOMSG:
+ /// rpc_rdma_header_nomsg rdma_nomsg;
+ /// case RDMA_MSGP: /* Not to be used */
+ /// rpc_rdma_header_padded rdma_msgp;
+ /// case RDMA_DONE: /* Not to be used */
+ /// void;
+ /// case RDMA_ERROR:
+ /// rpc_rdma_error rdma_error;
+ /// };
+ ///
+ /// /*
+ /// * Fixed header fields (Section 4.2)
+ /// */
+ /// struct rdma_msg {
+ /// uint32 rdma_xid; /* Position fixed for all versions */
+ /// uint32 rdma_vers; /* Position fixed for all versions */
+ /// uint32 rdma_credit; /* Position fixed for all versions */
+ /// rdma_body rdma_body;
+ /// };
+
+ <CODE ENDS>
+
+4.2. Fixed Header Fields
+
+ The RPC-over-RDMA header begins with four fixed 32-bit fields that
+ control the RDMA interaction.
+
+ The first three words are individual fields in the rdma_msg
+ structure. The fourth word is the first word of the rdma_body union,
+ which acts as the discriminator for the switched union. The contents
+ of this field are described in Section 4.2.4.
+
+
+
+
+
+Lever, et al. Standards Track [Page 28]
+
+RFC 8166 RPC-over-RDMA Version 1 June 2017
+
+
+ These four fields must remain with the same meanings and in the same
+ positions in all subsequent versions of the RPC-over-RDMA protocol.
+
+4.2.1. Transaction ID (XID)
+
+ The XID generated for the RPC Call and Reply messages. Having the
+ XID at a fixed location in the header makes it easy for the receiver
+ to establish context as soon as each RPC-over-RDMA message arrives.
+ This XID MUST be the same as the XID in the RPC message. The
+ receiver MAY perform its processing based solely on the XID in the
+ RPC-over-RDMA header, and thereby ignore the XID in the RPC message,
+ if it so chooses.
+
+4.2.2. Version Number
+
+ For RPC-over-RDMA version 1, this field MUST contain the value one
+ (1). Rules regarding changes to this transport protocol version
+ number can be found in Section 7.
+
+4.2.3. Credit Value
+
+ When sent with an RPC Call message, the requested credit value is
+ provided. When sent with an RPC Reply message, the granted credit
+ value is returned. Further discussion of how the credit value is
+ determined can be found in Section 3.3.
+
+4.2.4. Procedure Number
+
+ RDMA_MSG = 0 indicates that chunk lists and a Payload stream
+ follow. The format of the chunk lists is
+ discussed below.
+
+ RDMA_NOMSG = 1 indicates that after the chunk lists there is no
+ Payload stream. In this case, the chunk lists
+ provide information to allow the Responder to
+ transfer the Payload stream using explicit RDMA
+ operations.
+
+ RDMA_MSGP = 2 is reserved.
+
+ RDMA_DONE = 3 is reserved.
+
+ RDMA_ERROR = 4 is used to signal an encoding error in the RPC-
+ over-RDMA header.
+
+ An RDMA_MSG procedure conveys the Transport stream and the Payload
+ stream via an RDMA Send operation. The Transport stream contains the
+ four fixed fields followed by the Read and Write lists and the Reply
+
+
+
+Lever, et al. Standards Track [Page 29]
+
+RFC 8166 RPC-over-RDMA Version 1 June 2017
+
+
+ chunk, though any or all three MAY be marked as not present. The
+ Payload stream then follows, beginning with its XID field. If a Read
+ or Write chunk list is present, a portion of the Payload stream has
+ been reduced and is conveyed via separate operations.
+
+ An RDMA_NOMSG procedure conveys the Transport stream via an RDMA Send
+ operation. The Transport stream contains the four fixed fields
+ followed by the Read and Write chunk lists and the Reply chunk.
+ Though any of these MAY be marked as not present, one MUST be present
+ and MUST hold the Payload stream for this RPC-over-RDMA message. If
+ a Read or Write chunk list is present, a portion of the Payload
+ stream has been excised and is conveyed via separate operations.
+
+ An RDMA_ERROR procedure conveys the Transport stream via an RDMA Send
+ operation. The Transport stream contains the four fixed fields
+ followed by formatted error information. No Payload stream is
+ conveyed in this type of RPC-over-RDMA message.
+
+ A Requester MUST NOT send an RPC-over-RDMA header with the RDMA_ERROR
+ procedure. A Responder MUST silently discard RDMA_ERROR procedures.
+
+ The Transport stream and Payload stream can be constructed in
+ separate buffers. However, the total length of the gathered buffers
+ cannot exceed the inline threshold.
+
+4.3. Chunk Lists
+
+ The chunk lists in an RPC-over-RDMA version 1 header are three XDR
+ optional-data fields that follow the fixed header fields in RDMA_MSG
+ and RDMA_NOMSG procedures. Read Section 4.19 of [RFC4506] carefully
+ to understand how optional-data fields work. Examples of XDR-encoded
+ chunk lists are provided in Section 4.7 as an aid to understanding.
+
+ Often, an RPC-over-RDMA message has no associated chunks. In this
+ case, the Read list, Write list, and Reply chunk are all marked "not
+ present".
+
+4.3.1. Read List
+
+ Each RDMA_MSG or RDMA_NOMSG procedure has one "Read list". The Read
+ list is a list of zero or more RDMA read segments, provided by the
+ Requester, that are grouped by their Position fields into Read
+ chunks. Each Read chunk advertises the location of argument data the
+ Responder is to pull from the Requester. The Requester has reduced
+ the data items in these chunks from the call's Payload stream.
+
+
+
+
+
+
+Lever, et al. Standards Track [Page 30]
+
+RFC 8166 RPC-over-RDMA Version 1 June 2017
+
+
+ A Requester may transmit the Payload stream of an RPC Call message
+ using a Position Zero Read chunk. If the RPC Call message has no
+ argument data that is DDP-eligible and the Position Zero Read chunk
+ is not being used, the Requester leaves the Read list empty.
+
+ Responders MUST leave the Read list empty in all replies.
+
+4.3.1.1. Matching Read Chunks to Arguments
+
+ When reducing a DDP-eligible argument data item, a Requester records
+ the XDR stream offset of that data item in the Read chunk's Position
+ field. The Responder can then tell unambiguously where that chunk is
+ to be reinserted into the received Payload stream to form a complete
+ RPC Call message.
+
+4.3.2. Write List
+
+ Each RDMA_MSG or RDMA_NOMSG procedure has one "Write list". The
+ Write list is a list of zero or more Write chunks, provided by the
+ Requester. Each Write chunk is an array of plain segments; thus, the
+ Write list is a list of counted arrays.
+
+ If an RPC Reply message has no possible DDP-eligible result data
+ items, the Requester leaves the Write list empty. When a Requester
+ provides a Write list, the Responder MUST push data corresponding to
+ DDP-eligible result data items to Requester memory referenced in the
+ Write list. The Responder removes these data items from the reply's
+ Payload stream.
+
+4.3.2.1. Matching Write Chunks to Results
+
+ A Requester constructs the Write list for an RPC transaction before
+ the Responder has formulated its reply. When there is only one DDP-
+ eligible result data item, the Requester inserts only a single Write
+ chunk in the Write list. If the returned Write chunk is not an
+ unused Write chunk, the Requester knows with certainty which result
+ data item is contained in it.
+
+ When a Requester has provided multiple Write chunks, the Responder
+ fills in each Write chunk with one DDP-eligible result until there
+ are either no more DDP-eligible results or no more Write chunks.
+
+ The Requester might not be able to predict in advance which DDP-
+ eligible data item goes in which chunk. Thus, the Requester is
+ responsible for allocating and registering Write chunks large enough
+ to accommodate the largest result data item that might be associated
+ with each chunk in the Write list.
+
+
+
+
+Lever, et al. Standards Track [Page 31]
+
+RFC 8166 RPC-over-RDMA Version 1 June 2017
+
+
+ As a Requester decodes a reply Payload stream, it is clear from the
+ contents of the RPC Reply message which Write chunk contains which
+ result data item.
+
+4.3.2.2. Unused Write Chunks
+
+ There are occasions when a Requester provides a non-empty Write chunk
+ but the Responder is not able to use it. For example, a ULP may
+ define a union result where some arms of the union contain a DDP-
+ eligible data item while other arms do not. The Responder is
+ required to use Requester-provided Write chunks in this case, but if
+ the Responder returns a result that uses an arm of the union that has
+ no DDP-eligible data item, that Write chunk remains unconsumed.
+
+ If there is a subsequent DDP-eligible result data item in the RPC
+ Reply message, it MUST be placed in that unconsumed Write chunk.
+ Therefore, the Requester MUST provision each Write chunk so it can be
+ filled with the largest DDP-eligible data item that can be placed in
+ it.
+
+ If this is the last or only Write chunk available and it remains
+ unconsumed, the Responder MUST return this Write chunk as an unused
+ Write chunk (see Section 3.4.6). The Responder sets the segment
+ count to a value matching the Requester-provided Write chunk, but
+ returns only empty segments in that Write chunk.
+
+ Unused Write chunks, or unused bytes in Write chunk segments, are
+ returned to the RPC consumer as part of RPC completion. Even if a
+ Responder indicates that a Write chunk is not consumed, the Responder
+ may have written data into one or more segments before choosing not
+ to return that data item. The Requester MUST NOT assume that the
+ memory regions backing a Write chunk have not been modified.
+
+4.3.2.3. Empty Write Chunks
+
+ To force a Responder to return a DDP-eligible result inline, a
+ Requester employs the following mechanism:
+
+ o When there is only one DDP-eligible result item in an RPC Reply
+ message, the Requester provides an empty Write list.
+
+ o When there are multiple DDP-eligible result data items and a
+ Requester prefers that a data item is returned inline, the
+ Requester provides an empty Write chunk for that item (see
+ Section 3.4.6). The Responder MUST return the corresponding
+ result data item inline and MUST return an empty Write chunk in
+ that Write list position in the RPC Reply message.
+
+
+
+
+Lever, et al. Standards Track [Page 32]
+
+RFC 8166 RPC-over-RDMA Version 1 June 2017
+
+
+ As always, a Requester and Responder must prepare for a Long Reply to
+ be used if the resulting RPC Reply might be too large to be conveyed
+ in an RDMA Send.
+
+4.3.3. Reply Chunk
+
+ Each RDMA_MSG or RDMA_NOMSG procedure has one "Reply chunk" slot. A
+ Requester MUST provide a Reply chunk whenever the maximum possible
+ size of the RPC Reply message's Transport and Payload streams is
+ larger than the inline threshold for messages from Responder to
+ Requester. Otherwise, the Requester marks the Reply chunk as not
+ present.
+
+ If the Transport stream and Payload stream together are smaller than
+ the reply inline threshold, the Responder MAY return the RPC Reply
+ message as a Short message rather than using the Requester-provided
+ Reply chunk.
+
+ When a Requester provides a Reply chunk in an RPC Call message, the
+ Responder MUST copy that chunk into the Transport header of the RPC
+ Reply message. As with Write chunks, the Responder modifies the
+ copied Reply chunk in the RPC Reply message to reflect the actual
+ amount of data that is being returned in the Reply chunk.
+
+4.4. Memory Registration
+
+ The cost of registering and invalidating memory can be a significant
+ proportion of the cost of an RPC-over-RDMA transaction. Thus, an
+ important implementation consideration is how to minimize
+ registration activity without exposing system memory needlessly.
+
+4.4.1. Registration Longevity
+
+ Data transferred via RDMA Read and Write can reside in a memory
+ allocation not in the control of the RPC-over-RDMA transport. These
+ memory allocations can persist outside the bounds of an RPC
+ transaction. They are registered and invalidated as needed, as part
+ of each RPC transaction.
+
+ The Requester endpoint must ensure that memory regions associated
+ with each RPC transaction are protected from Responder access before
+ allowing upper-layer access to the data contained in them. Moreover,
+ the Requester must not access these memory regions while the
+ Responder has access to them.
+
+
+
+
+
+
+
+Lever, et al. Standards Track [Page 33]
+
+RFC 8166 RPC-over-RDMA Version 1 June 2017
+
+
+ This includes memory regions that are associated with canceled RPCs.
+ A Responder cannot know that the Requester is no longer waiting for a
+ reply, and it might proceed to read or even update memory that the
+ Requester might have released for other use.
+
+4.4.2. Communicating DDP-Eligibility
+
+ The interface by which a ULP implementation communicates the
+ eligibility of a data item locally to its local RPC-over-RDMA
+ endpoint is not described by this specification.
+
+ Depending on the implementation and constraints imposed by ULBs, it
+ is possible to implement reduction transparently to upper layers.
+ Such implementations may lead to inefficiencies, either because they
+ require the RPC layer to perform expensive registration and
+ invalidation of memory "on the fly", or they may require using RDMA
+ chunks in RPC Reply messages, along with the resulting additional
+ handshaking with the RPC-over-RDMA peer.
+
+ However, these issues are internal and generally confined to the
+ local interface between RPC and its upper layers, one in which
+ implementations are free to innovate. The only requirement, beyond
+ constraints imposed by the ULB, is that the resulting RPC-over-RDMA
+ protocol sent to the peer be valid for the upper layer.
+
+4.4.3. Registration Strategies
+
+ The choice of which memory registration strategies to employ is left
+ to Requester and Responder implementers. To support the widest array
+ of RDMA implementations, as well as the most general steering tag
+ scheme, an Offset field is included in each RDMA segment.
+
+ While zero-based offset schemes are available in many RDMA
+ implementations, their use by RPC requires individual registration of
+ each memory region. For such implementations, this can be a
+ significant overhead. By providing an offset in each chunk, many
+ pre-registration or region-based registrations can be readily
+ supported.
+
+4.5. Error Handling
+
+ A receiver performs basic validity checks on the RPC-over-RDMA header
+ and chunk contents before it passes the RPC message to the RPC layer.
+ If an incoming RPC-over-RDMA message is not as long as a minimal size
+ RPC-over-RDMA header (28 bytes), the receiver cannot trust the value
+ of the XID field; therefore, it MUST silently discard the message
+ before performing any parsing. If other errors are detected in the
+ RPC-over-RDMA header of an RPC Call message, a Responder MUST send an
+
+
+
+Lever, et al. Standards Track [Page 34]
+
+RFC 8166 RPC-over-RDMA Version 1 June 2017
+
+
+ RDMA_ERROR message back to the Requester. If errors are detected in
+ the RPC-over-RDMA header of an RPC Reply message, a Requester MUST
+ silently discard the message.
+
+ To form an RDMA_ERROR procedure:
+
+ o The rdma_xid field MUST contain the same XID that was in the
+ rdma_xid field in the failing request;
+
+ o The rdma_vers field MUST contain the same version that was in the
+ rdma_vers field in the failing request;
+
+ o The rdma_proc field MUST contain the value RDMA_ERROR; and
+
+ o The rdma_err field contains a value that reflects the type of
+ error that occurred, as described below.
+
+ An RDMA_ERROR procedure indicates a permanent error. Receipt of this
+ procedure completes the RPC transaction associated with XID in the
+ rdma_xid field. A receiver MUST silently discard an RDMA_ERROR
+ procedure that it cannot decode.
+
+4.5.1. Header Version Mismatch
+
+ When a Responder detects an RPC-over-RDMA header version that it does
+ not support (currently this document defines only version 1), it MUST
+ reply with an RDMA_ERROR procedure and set the rdma_err value to
+ ERR_VERS, also providing the low and high inclusive version numbers
+ it does, in fact, support.
+
+4.5.2. XDR Errors
+
+ A receiver might encounter an XDR parsing error that prevents it from
+ processing the incoming Transport stream. Examples of such errors
+ include an invalid value in the rdma_proc field; an RDMA_NOMSG
+ message where the Read list, Write list, and Reply chunk are marked
+ not present; or the value of the rdma_xid field does not match the
+ value of the XID field in the accompanying RPC message. If the
+ rdma_vers field contains a recognized value, but an XDR parsing error
+ occurs, the Responder MUST reply with an RDMA_ERROR procedure and set
+ the rdma_err value to ERR_CHUNK.
+
+ When a Responder receives a valid RPC-over-RDMA header but the
+ Responder's ULP implementation cannot parse the RPC arguments in the
+ RPC Call message, the Responder SHOULD return an RPC Reply message
+ with status GARBAGE_ARGS, using an RDMA_MSG procedure. This type of
+ parsing failure might be due to mismatches between chunk sizes or
+ offsets and the contents of the Payload stream, for example.
+
+
+
+Lever, et al. Standards Track [Page 35]
+
+RFC 8166 RPC-over-RDMA Version 1 June 2017
+
+
+4.5.3. Responder RDMA Operational Errors
+
+ In RPC-over-RDMA version 1, the Responder initiates RDMA Read and
+ Write operations that target the Requester's memory. Problems might
+ arise as the Responder attempts to use Requester-provided resources
+ for RDMA operations. For example:
+
+ o Usually, chunks can be validated only by using their contents to
+ perform data transfers. If chunk contents are invalid (e.g., a
+ memory region is no longer registered or a chunk length exceeds
+ the end of the registered memory region), a Remote Access Error
+ occurs.
+
+ o If a Requester's Receive buffer is too small, the Responder's Send
+ operation completes with a Local Length Error.
+
+ o If the Requester-provided Reply chunk is too small to accommodate
+ a large RPC Reply message, a Remote Access Error occurs. A
+ Responder might detect this problem before attempting to write
+ past the end of the Reply chunk.
+
+ RDMA operational errors are typically fatal to the connection. To
+ avoid a retransmission loop and repeated connection loss that
+ deadlocks the connection, once the Requester has re-established a
+ connection, the Responder should send an RDMA_ERROR reply with an
+ rdma_err value of ERR_CHUNK to indicate that no RPC-level reply is
+ possible for that XID.
+
+4.5.4. Other Operational Errors
+
+ While a Requester is constructing an RPC Call message, an
+ unrecoverable problem might occur that prevents the Requester from
+ posting further RDMA Work Requests on behalf of that message. As
+ with other transports, if a Requester is unable to construct and
+ transmit an RPC Call message, the associated RPC transaction fails
+ immediately.
+
+ After a Requester has received a reply, if it is unable to invalidate
+ a memory region due to an unrecoverable problem, the Requester MUST
+ close the connection to protect that memory from Responder access
+ before the associated RPC transaction is complete.
+
+ While a Responder is constructing an RPC Reply message or error
+ message, an unrecoverable problem might occur that prevents the
+ Responder from posting further RDMA Work Requests on behalf of that
+ message. If a Responder is unable to construct and transmit an RPC
+ Reply or RPC-over-RDMA error message, the Responder MUST close the
+ connection to signal to the Requester that a reply was lost.
+
+
+
+Lever, et al. Standards Track [Page 36]
+
+RFC 8166 RPC-over-RDMA Version 1 June 2017
+
+
+4.5.5. RDMA Transport Errors
+
+ The RDMA connection and physical link provide some degree of error
+ detection and retransmission. iWARP's Marker PDU Aligned (MPA) layer
+ (when used over TCP), the Stream Control Transmission Protocol
+ (SCTP), as well as the InfiniBand [IBARCH] link layer all provide
+ Cyclic Redundancy Check (CRC) protection of the RDMA payload, and
+ CRC-class protection is a general attribute of such transports.
+
+ Additionally, the RPC layer itself can accept errors from the
+ transport and recover via retransmission. RPC recovery can handle
+ complete loss and re-establishment of a transport connection.
+
+ The details of reporting and recovery from RDMA link-layer errors are
+ described in specific link-layer APIs and operational specifications
+ and are outside the scope of this protocol specification. See
+ Section 8 for further discussion of the use of RPC-level integrity
+ schemes to detect errors.
+
+4.6. Protocol Elements No Longer Supported
+
+ The following protocol elements are no longer supported in RPC-over-
+ RDMA version 1. Related enum values and structure definitions remain
+ in the RPC-over-RDMA version 1 protocol for backwards compatibility.
+
+4.6.1. RDMA_MSGP
+
+ The specification of RDMA_MSGP in Section 3.9 of [RFC5666] is
+ incomplete. To fully specify RDMA_MSGP would require:
+
+ o Updating the definition of DDP-eligibility to include data items
+ that may be transferred, with padding, via RDMA_MSGP procedures
+
+ o Adding full operational descriptions of the alignment and
+ threshold fields
+
+ o Discussing how alignment preferences are communicated between two
+ peers without using CCP
+
+ o Describing the treatment of RDMA_MSGP procedures that convey Read
+ or Write chunks
+
+ The RDMA_MSGP message type is beneficial only when the padded data
+ payload is at the end of an RPC message's argument or result list.
+ This is not typical for NFSv4 COMPOUND RPCs, which often include a
+ GETATTR operation as the final element of the compound operation
+ array.
+
+
+
+
+Lever, et al. Standards Track [Page 37]
+
+RFC 8166 RPC-over-RDMA Version 1 June 2017
+
+
+ Without a full specification of RDMA_MSGP, there has been no fully
+ implemented prototype of it. Without a complete prototype of
+ RDMA_MSGP support, it is difficult to assess whether this protocol
+ element has benefit or can even be made to work interoperably.
+
+ Therefore, senders MUST NOT send RDMA_MSGP procedures. When
+ receiving an RDMA_MSGP procedure, Responders SHOULD reply with an
+ RDMA_ERROR procedure, setting the rdma_err field to ERR_CHUNK;
+ Requesters MUST silently discard the message.
+
+4.6.2. RDMA_DONE
+
+ Because no implementation of RPC-over-RDMA version 1 uses the Read-
+ Read transfer model, there is never a need to send an RDMA_DONE
+ procedure.
+
+ Therefore, senders MUST NOT send RDMA_DONE messages. Receivers MUST
+ silently discard RDMA_DONE messages.
+
+4.7. XDR Examples
+
+ RPC-over-RDMA chunk lists are complex data types. In this section,
+ illustrations are provided to help readers grasp how chunk lists are
+ represented inside an RPC-over-RDMA header.
+
+ A plain segment is the simplest component, being made up of a 32-bit
+ handle (H), a 32-bit length (L), and 64 bits of offset (OO). Once
+ flattened into an XDR stream, plain segments appear as
+
+ HLOO
+
+ An RDMA read segment has an additional 32-bit position field (P).
+ RDMA read segments appear as
+
+ PHLOO
+
+ A Read chunk is a list of RDMA read segments. Each RDMA read segment
+ is preceded by a 32-bit word containing a one if a segment follows or
+ a zero if there are no more segments in the list. In XDR form, this
+ would look like
+
+ 1 PHLOO 1 PHLOO 1 PHLOO 0
+
+ where P would hold the same value for each RDMA read segment
+ belonging to the same Read chunk.
+
+
+
+
+
+
+Lever, et al. Standards Track [Page 38]
+
+RFC 8166 RPC-over-RDMA Version 1 June 2017
+
+
+ The Read list is also a list of RDMA read segments. In XDR form,
+ this would look like a Read chunk, except that the P values could
+ vary across the list. An empty Read list is encoded as a single
+ 32-bit zero.
+
+ One Write chunk is a counted array of plain segments. In XDR form,
+ the count would appear as the first 32-bit word, followed by an HLOO
+ for each element of the array. For instance, a Write chunk with
+ three elements would look like
+
+ 3 HLOO HLOO HLOO
+
+ The Write list is a list of counted arrays. In XDR form, this is a
+ combination of optional-data and counted arrays. To represent a
+ Write list containing a Write chunk with three segments and a Write
+ chunk with two segments, XDR would encode
+
+ 1 3 HLOO HLOO HLOO 1 2 HLOO HLOO 0
+
+ An empty Write list is encoded as a single 32-bit zero.
+
+ The Reply chunk is a Write chunk. However, since it is an optional-
+ data field, there is a 32-bit field in front of it that contains a
+ one if the Reply chunk is present or a zero if it is not. After
+ encoding, a Reply chunk with two segments would look like
+
+ 1 2 HLOO HLOO
+
+ Frequently, a Requester does not provide any chunks. In that case,
+ after the four fixed fields in the RPC-over-RDMA header, there are
+ simply three 32-bit fields that contain zero.
+
+5. RPC Bind Parameters
+
+ In setting up a new RDMA connection, the first action by a Requester
+ is to obtain a transport address for the Responder. The means used
+ to obtain this address, and to open an RDMA connection, is dependent
+ on the type of RDMA transport and is the responsibility of each RPC
+ protocol binding and its local implementation.
+
+ RPC services normally register with a portmap or rpcbind service
+ [RFC1833], which associates an RPC Program number with a service
+ address. This policy is no different with RDMA transports. However,
+ a different and distinct service address (port number) might
+ sometimes be required for ULP operation with RPC-over-RDMA.
+
+
+
+
+
+
+Lever, et al. Standards Track [Page 39]
+
+RFC 8166 RPC-over-RDMA Version 1 June 2017
+
+
+ When mapped atop the iWARP transport [RFC5040] [RFC5041], which uses
+ IP port addressing due to its layering on TCP and/or SCTP, port
+ mapping is trivial and consists merely of issuing the port in the
+ connection process. The NFS/RDMA protocol service address has been
+ assigned port 20049 by IANA, for both iWARP/TCP and iWARP/SCTP
+ [RFC5667].
+
+ When mapped atop InfiniBand [IBARCH], which uses a service endpoint
+ naming scheme based on a Group Identifier (GID), a translation MUST
+ be employed. One such translation is described in Annexes A3
+ (Application Specific Identifiers), A4 (Sockets Direct Protocol
+ (SDP)), and A11 (RDMA IP CM Service) of [IBARCH], which is
+ appropriate for translating IP port addressing to the InfiniBand
+ network. Therefore, in this case, IP port addressing may be readily
+ employed by the upper layer.
+
+ When a mapping standard or convention exists for IP ports on an RDMA
+ interconnect, there are several possibilities for each upper layer to
+ consider:
+
+ o One possibility is to have the Responder register its mapped IP
+ port with the rpcbind service under the netid (or netids) defined
+ here. An RPC-over-RDMA-aware Requester can then resolve its
+ desired service to a mappable port and proceed to connect. This
+ is the most flexible and compatible approach, for those upper
+ layers that are defined to use the rpcbind service.
+
+ o A second possibility is to have the Responder's portmapper
+ register itself on the RDMA interconnect at a "well-known" service
+ address (on UDP or TCP, this corresponds to port 111). A
+ Requester could connect to this service address and use the
+ portmap protocol to obtain a service address in response to a
+ program number, e.g., an iWARP port number or an InfiniBand GID.
+
+ o Alternately, the Requester could simply connect to the mapped
+ well-known port for the service itself, if it is appropriately
+ defined. By convention, the NFS/RDMA service, when operating atop
+ such an InfiniBand fabric, uses the same 20049 assignment as for
+ iWARP.
+
+ Historically, different RPC protocols have taken different approaches
+ to their port assignment. Therefore, the specific method is left to
+ each RPC-over-RDMA-enabled ULB and is not addressed in this document.
+
+
+
+
+
+
+
+
+Lever, et al. Standards Track [Page 40]
+
+RFC 8166 RPC-over-RDMA Version 1 June 2017
+
+
+ In Section 9, this specification defines two new netid values, to be
+ used for registration of upper layers atop iWARP [RFC5040] [RFC5041]
+ and (when a suitable port translation service is available)
+ InfiniBand [IBARCH]. Additional RDMA-capable networks MAY define
+ their own netids, or if they provide a port translation, they MAY
+ share the one defined in this document.
+
+6. ULB Specifications
+
+ An ULP is typically defined independently of any particular RPC
+ transport. An ULB (ULB) specification provides guidance that helps
+ the ULP interoperate correctly and efficiently over a particular
+ transport. For RPC-over-RDMA version 1, a ULB may provide:
+
+ o A taxonomy of XDR data items that are eligible for DDP
+
+ o Constraints on which upper-layer procedures may be reduced and on
+ how many chunks may appear in a single RPC request
+
+ o A method for determining the maximum size of the reply Payload
+ stream for all procedures in the ULP
+
+ o An rpcbind port assignment for operation of the RPC Program and
+ Version on an RPC-over-RDMA transport
+
+ Each RPC Program and Version tuple that utilizes RPC-over-RDMA
+ version 1 needs to have a ULB specification.
+
+6.1. DDP-Eligibility
+
+ An ULB designates some XDR data items as eligible for DDP. As an
+ RPC-over-RDMA message is formed, DDP-eligible data items can be
+ removed from the Payload stream and placed directly in the receiver's
+ memory.
+
+ An XDR data item should be considered for DDP-eligibility if there is
+ a clear benefit to moving the contents of the item directly from the
+ sender's memory to the receiver's memory. Criteria for DDP-
+ eligibility include:
+
+ o The XDR data item is frequently sent or received, and its size is
+ often much larger than typical inline thresholds.
+
+ o If the XDR data item is a result, its maximum size must be
+ predictable in advance by the Requester.
+
+
+
+
+
+
+Lever, et al. Standards Track [Page 41]
+
+RFC 8166 RPC-over-RDMA Version 1 June 2017
+
+
+ o Transport-level processing of the XDR data item is not needed.
+ For example, the data item is an opaque byte array, which requires
+ no XDR encoding and decoding of its content.
+
+ o The content of the XDR data item is sensitive to address
+ alignment. For example, a data copy operation would be required
+ on the receiver to enable the message to be parsed correctly, or
+ to enable the data item to be accessed.
+
+ o The XDR data item does not contain DDP-eligible data items.
+
+ In addition to defining the set of data items that are DDP-eligible,
+ a ULB may also limit the use of chunks to particular upper-layer
+ procedures. If more than one data item in a procedure is DDP-
+ eligible, the ULB may also limit the number of chunks that a
+ Requester can provide for a particular upper-layer procedure.
+
+ Senders MUST NOT reduce data items that are not DDP-eligible. Such
+ data items MAY, however, be moved as part of a Position Zero Read
+ chunk or a Reply chunk.
+
+ The programming interface by which an upper-layer implementation
+ indicates the DDP-eligibility of a data item to the RPC transport is
+ not described by this specification. The only requirements are that
+ the receiver can re-assemble the transmitted RPC-over-RDMA message
+ into a valid XDR stream, and that DDP-eligibility rules specified by
+ the ULB are respected.
+
+ There is no provision to express DDP-eligibility within the XDR
+ language. The only definitive specification of DDP-eligibility is a
+ ULB.
+
+ In general, a DDP-eligibility violation occurs when:
+
+ o A Requester reduces a non-DDP-eligible argument data item. The
+ Responder MUST NOT process this RPC Call message and MUST report
+ the violation as described in Section 4.5.2.
+
+ o A Responder reduces a non-DDP-eligible result data item. The
+ Requester MUST terminate the pending RPC transaction and report an
+ appropriate permanent error to the RPC consumer.
+
+ o A Responder does not reduce a DDP-eligible result data item into
+ an available Write chunk. The Requester MUST terminate the
+ pending RPC transaction and report an appropriate permanent error
+ to the RPC consumer.
+
+
+
+
+
+Lever, et al. Standards Track [Page 42]
+
+RFC 8166 RPC-over-RDMA Version 1 June 2017
+
+
+6.2. Maximum Reply Size
+
+ A Requester provides resources for both an RPC Call message and its
+ matching RPC Reply message. A Requester forms the RPC Call message
+ itself; thus, the Requester can compute the exact resources needed.
+
+ A Requester must allocate resources for the RPC Reply message (an
+ RPC-over-RDMA credit, a Receive buffer, and possibly a Write list and
+ Reply chunk) before the Responder has formed the actual reply. To
+ accommodate all possible replies for the procedure in the RPC Call
+ message, a Requester must allocate reply resources based on the
+ maximum possible size of the expected RPC Reply message.
+
+ If there are procedures in the ULP for which there is no clear reply
+ size maximum, the ULB needs to specify a dependable means for
+ determining the maximum.
+
+6.3. Additional Considerations
+
+ There may be other details provided in a ULB.
+
+ o An ULB may recommend inline threshold values or other transport-
+ related parameters for RPC-over-RDMA version 1 connections bearing
+ that ULP.
+
+ o An ULP may provide a means to communicate these transport-related
+ parameters between peers. Note that RPC-over-RDMA version 1 does
+ not specify any mechanism for changing any transport-related
+ parameter after a connection has been established.
+
+ o Multiple ULPs may share a single RPC-over-RDMA version 1
+ connection when their ULBs allow the use of RPC-over-RDMA version
+ 1 and the rpcbind port assignments for the Protocols allow
+ connection sharing. In this case, the same transport parameters
+ (such as inline threshold) apply to all Protocols using that
+ connection.
+
+ Each ULB needs to be designed to allow correct interoperation without
+ regard to the transport parameters actually in use. Furthermore,
+ implementations of ULPs must be designed to interoperate correctly
+ regardless of the connection parameters in effect on a connection.
+
+6.4. ULP Extensions
+
+ An RPC Program and Version tuple may be extensible. For instance,
+ there may be a minor versioning scheme that is not reflected in the
+ RPC version number, or the ULP may allow additional features to be
+ specified after the original RPC Program specification was ratified.
+
+
+
+Lever, et al. Standards Track [Page 43]
+
+RFC 8166 RPC-over-RDMA Version 1 June 2017
+
+
+ ULBs are provided for interoperable RPC Programs and Versions by
+ extending existing ULBs to reflect the changes made necessary by each
+ addition to the existing XDR.
+
+7. Protocol Extensibility
+
+ The RPC-over-RDMA header format is specified using XDR, unlike the
+ message header used with RPC-over-TCP. To maintain a high degree of
+ interoperability among implementations of RPC-over-RDMA, any change
+ to this XDR requires a protocol version number change. New versions
+ of RPC-over-RDMA may be published as separate protocol specifications
+ without updating this document.
+
+ The first four fields in every RPC-over-RDMA header must remain
+ aligned at the same fixed offsets for all versions of the RPC-over-
+ RDMA protocol. The version number must be in a fixed place to enable
+ implementations to detect protocol version mismatches.
+
+ For version mismatches to be reported in a fashion that all future
+ version implementations can reliably decode, the rdma_proc field must
+ remain in a fixed place, the value of ERR_VERS must always remain the
+ same, and the field placement in struct rpc_rdma_errvers must always
+ remain the same.
+
+7.1. Conventional Extensions
+
+ Introducing new capabilities to RPC-over-RDMA version 1 is limited to
+ the adoption of conventions that make use of existing XDR (defined in
+ this document) and allowed abstract RDMA operations. Because no
+ mechanism for detecting optional features exists in RPC-over-RDMA
+ version 1, implementations must rely on ULPs to communicate the
+ existence of such extensions.
+
+ Such extensions must be specified in a Standards Track RFC with
+ appropriate review by the NFSv4 Working Group and the IESG. An
+ example of a conventional extension to RPC-over-RDMA version 1 is the
+ specification of backward direction message support to enable NFSv4.1
+ callback operations, described in [RFC8167].
+
+8. Security Considerations
+
+8.1. Memory Protection
+
+ A primary consideration is the protection of the integrity and
+ confidentiality of local memory by an RPC-over-RDMA transport. The
+ use of an RPC-over-RDMA transport protocol MUST NOT introduce
+ vulnerabilities to system memory contents nor to memory owned by user
+ processes.
+
+
+
+Lever, et al. Standards Track [Page 44]
+
+RFC 8166 RPC-over-RDMA Version 1 June 2017
+
+
+ It is REQUIRED that any RDMA provider used for RPC transport be
+ conformant to the requirements of [RFC5042] in order to satisfy these
+ protections. These protections are provided by the RDMA layer
+ specifications, and in particular, their security models.
+
+8.1.1. Protection Domains
+
+ The use of Protection Domains to limit the exposure of memory regions
+ to a single connection is critical. Any attempt by an endpoint not
+ participating in that connection to reuse memory handles needs to
+ result in immediate failure of that connection. Because ULP security
+ mechanisms rely on this aspect of Reliable Connection behavior,
+ strong authentication of remote endpoints is recommended.
+
+8.1.2. Handle Predictability
+
+ Unpredictable memory handles should be used for any operation
+ requiring advertised memory regions. Advertising a continuously
+ registered memory region allows a remote host to read or write to
+ that region even when an RPC involving that memory is not under way.
+ Therefore, implementations should avoid advertising persistently
+ registered memory.
+
+8.1.3. Memory Protection
+
+ Requesters should register memory regions for remote access only when
+ they are about to be the target of an RPC operation that involves an
+ RDMA Read or Write.
+
+ Registered memory regions should be invalidated as soon as related
+ RPC operations are complete. Invalidation and DMA unmapping of
+ memory regions should be complete before message integrity checking
+ is done and before the RPC consumer is allowed to continue execution
+ and use or alter the contents of a memory region.
+
+ An RPC transaction on a Requester might be terminated before a reply
+ arrives if the RPC consumer exits unexpectedly (for example, it is
+ signaled or a segmentation fault occurs). When an RPC terminates
+ abnormally, memory regions associated with that RPC should be
+ invalidated appropriately before the regions are released to be
+ reused for other purposes on the Requester.
+
+8.1.4. Denial of Service
+
+ A detailed discussion of denial-of-service exposures that can result
+ from the use of an RDMA transport is found in Section 6.4 of
+ [RFC5042].
+
+
+
+
+Lever, et al. Standards Track [Page 45]
+
+RFC 8166 RPC-over-RDMA Version 1 June 2017
+
+
+ A Responder is not obliged to pull Read chunks that are unreasonably
+ large. The Responder can use an RDMA_ERROR response to terminate
+ RPCs with unreadable Read chunks. If a Responder transmits more data
+ than a Requester is prepared to receive in a Write or Reply chunk,
+ the RDMA Network Interface Cards (RNICs) typically terminate the
+ connection. For further discussion, see Section 4.5. Such repeated
+ chunk errors can deny service to other users sharing the connection
+ from the errant Requester.
+
+ An RPC-over-RDMA transport implementation is not responsible for
+ throttling the RPC request rate, other than to keep the number of
+ concurrent RPC transactions at or under the number of credits granted
+ per connection. This is explained in Section 3.3.1. A sender can
+ trigger a self denial of service by exceeding the credit grant
+ repeatedly.
+
+ When an RPC has been canceled due to a signal or premature exit of an
+ application process, a Requester may invalidate the RPC's Write and
+ Reply chunks. Invalidation prevents the subsequent arrival of the
+ Responder's reply from altering the memory regions associated with
+ those chunks after the memory has been reused.
+
+ On the Requester, a malfunctioning application or a malicious user
+ can create a situation where RPCs are continuously initiated and then
+ aborted, resulting in Responder replies that terminate the underlying
+ RPC-over-RDMA connection repeatedly. Such situations can deny
+ service to other users sharing the connection from that Requester.
+
+8.2. RPC Message Security
+
+ ONC RPC provides cryptographic security via the RPCSEC_GSS framework
+ [RFC7861]. RPCSEC_GSS implements message authentication
+ (rpc_gss_svc_none), per-message integrity checking
+ (rpc_gss_svc_integrity), and per-message confidentiality
+ (rpc_gss_svc_privacy) in the layer above RPC-over-RDMA. The latter
+ two services require significant computation and movement of data on
+ each endpoint host. Some performance benefits enabled by RDMA
+ transports can be lost.
+
+8.2.1. RPC-over-RDMA Protection at Lower Layers
+
+ For any RPC transport, utilizing RPCSEC_GSS integrity or privacy
+ services has performance implications. Protection below the RPC
+ transport is often more appropriate in performance-sensitive
+ deployments, especially if it, too, can be offloaded. Certain
+ configurations of IPsec can be co-located in RDMA hardware, for
+ example, without change to RDMA consumers and little loss of data
+
+
+
+
+Lever, et al. Standards Track [Page 46]
+
+RFC 8166 RPC-over-RDMA Version 1 June 2017
+
+
+ movement efficiency. Such arrangements can also provide a higher
+ degree of privacy by hiding endpoint identity or altering the
+ frequency at which messages are exchanged, at a performance cost.
+
+ The use of protection in a lower layer MAY be negotiated through the
+ use of an RPCSEC_GSS security flavor defined in [RFC7861] in
+ conjunction with the Channel Binding mechanism [RFC5056] and IPsec
+ Channel Connection Latching [RFC5660]. Use of such mechanisms is
+ REQUIRED where integrity or confidentiality is desired and where
+ efficiency is required.
+
+8.2.2. RPCSEC_GSS on RPC-over-RDMA Transports
+
+ Not all RDMA devices and fabrics support the above protection
+ mechanisms. Also, per-message authentication is still required on
+ NFS clients where multiple users access NFS files. In these cases,
+ RPCSEC_GSS can protect NFS traffic conveyed on RPC-over-RDMA
+ connections.
+
+ RPCSEC_GSS extends the ONC RPC protocol [RFC5531] without changing
+ the format of RPC messages. By observing the conventions described
+ in this section, an RPC-over-RDMA transport can convey RPCSEC_GSS-
+ protected RPC messages interoperably.
+
+ As part of the ONC RPC protocol, protocol elements of RPCSEC_GSS that
+ appear in the Payload stream of an RPC-over-RDMA message (such as
+ control messages exchanged as part of establishing or destroying a
+ security context or data items that are part of RPCSEC_GSS
+ authentication material) MUST NOT be reduced.
+
+8.2.2.1. RPCSEC_GSS Context Negotiation
+
+ Some NFS client implementations use a separate connection to
+ establish a Generic Security Service (GSS) context for NFS operation.
+ These clients use TCP and the standard NFS port (2049) for context
+ establishment. To enable the use of RPCSEC_GSS with NFS/RDMA, an NFS
+ server MUST also provide a TCP-based NFS service on port 2049.
+
+8.2.2.2. RPC-over-RDMA with RPCSEC_GSS Authentication
+
+ The RPCSEC_GSS authentication service has no impact on the DDP-
+ eligibility of data items in a ULP.
+
+ However, RPCSEC_GSS authentication material appearing in an RPC
+ message header can be larger than, say, an AUTH_SYS authenticator.
+ In particular, when an RPCSEC_GSS pseudoflavor is in use, a Requester
+
+
+
+
+
+Lever, et al. Standards Track [Page 47]
+
+RFC 8166 RPC-over-RDMA Version 1 June 2017
+
+
+ needs to accommodate a larger RPC credential when marshaling RPC Call
+ messages and needs to provide for a maximum size RPCSEC_GSS verifier
+ when allocating reply buffers and Reply chunks.
+
+ RPC messages, and thus Payload streams, are made larger as a result.
+ ULP operations that fit in a Short Message when a simpler form of
+ authentication is in use might need to be reduced, or conveyed via a
+ Long Message, when RPCSEC_GSS authentication is in use. It is more
+ likely that a Requester provides both a Read list and a Reply chunk
+ in the same RPC-over-RDMA header to convey a Long Call and provision
+ a receptacle for a Long Reply. More frequent use of Long Messages
+ can impact transport efficiency.
+
+8.2.2.3. RPC-over-RDMA with RPCSEC_GSS Integrity or Privacy
+
+ The RPCSEC_GSS integrity service enables endpoints to detect
+ modification of RPC messages in flight. The RPCSEC_GSS privacy
+ service prevents all but the intended recipient from viewing the
+ cleartext content of RPC arguments and results. RPCSEC_GSS integrity
+ and privacy services are end-to-end. They protect RPC arguments and
+ results from application to server endpoint, and back.
+
+ The RPCSEC_GSS integrity and encryption services operate on whole RPC
+ messages after they have been XDR encoded for transmit, and before
+ they have been XDR decoded after receipt. Both sender and receiver
+ endpoints use intermediate buffers to prevent exposure of encrypted
+ data or unverified cleartext data to RPC consumers. After
+ verification, encryption, and message wrapping has been performed,
+ the transport layer MAY use RDMA data transfer between these
+ intermediate buffers.
+
+ The process of reducing a DDP-eligible data item removes the data
+ item and its XDR padding from the encoded XDR stream. XDR padding of
+ a reduced data item is not transferred in an RPC-over-RDMA message.
+ After reduction, the Payload stream contains fewer octets than the
+ whole XDR stream did beforehand. XDR padding octets are often zero
+ bytes, but they don't have to be. Thus, reducing DDP-eligible items
+ affects the result of message integrity verification or encryption.
+
+ Therefore, a sender MUST NOT reduce a Payload stream when RPCSEC_GSS
+ integrity or encryption services are in use. Effectively, no data
+ item is DDP-eligible in this situation, and Chunked Messages cannot
+ be used. In this mode, an RPC-over-RDMA transport operates in the
+ same manner as a transport that does not support DDP.
+
+
+
+
+
+
+
+Lever, et al. Standards Track [Page 48]
+
+RFC 8166 RPC-over-RDMA Version 1 June 2017
+
+
+ When an RPCSEC_GSS integrity or privacy service is in use, a
+ Requester provides both a Read list and a Reply chunk in the same
+ RPC-over-RDMA header to convey a Long Call and provision a receptacle
+ for a Long Reply.
+
+8.2.2.4. Protecting RPC-over-RDMA Transport Headers
+
+ Like the base fields in an ONC RPC message (XID, call direction, and
+ so on), the contents of an RPC-over-RDMA message's Transport stream
+ are not protected by RPCSEC_GSS. This exposes XIDs, connection
+ credit limits, and chunk lists (but not the content of the data items
+ they refer to) to malicious behavior, which could redirect data that
+ is transferred by the RPC-over-RDMA message, result in spurious
+ retransmits, or trigger connection loss.
+
+ In particular, if an attacker alters the information contained in the
+ chunk lists of an RPC-over-RDMA header, data contained in those
+ chunks can be redirected to other registered memory regions on
+ Requesters. An attacker might alter the arguments of RDMA Read and
+ RDMA Write operations on the wire to similar effect. If such
+ alterations occur, the use of RPCSEC_GSS integrity or privacy
+ services enable a Requester to detect unexpected material in a
+ received RPC message.
+
+ Encryption at lower layers, as described in Section 8.2.1, protects
+ the content of the Transport stream. To address attacks on RDMA
+ protocols themselves, RDMA transport implementations should conform
+ to [RFC5042].
+
+9. IANA Considerations
+
+ A set of RPC netids for resolving RPC-over-RDMA services is specified
+ by this document. This is unchanged from [RFC5666].
+
+ The RPC-over-RDMA transport has been assigned an RPC netid, which is
+ an rpcbind [RFC1833] string used to describe the underlying protocol
+ in order for RPC to select the appropriate transport framing, as well
+ as the format of the service addresses and ports.
+
+ The following netid registry strings are defined for this purpose:
+
+ NC_RDMA "rdma"
+ NC_RDMA6 "rdma6"
+
+ The "rdma" netid is to be used when IPv4 addressing is employed by
+ the underlying transport, and "rdma6" for IPv6 addressing. The netid
+ assignment policy and registry are defined in [RFC5665].
+
+
+
+
+Lever, et al. Standards Track [Page 49]
+
+RFC 8166 RPC-over-RDMA Version 1 June 2017
+
+
+ These netids MAY be used for any RDMA network that satisfies the
+ requirements of Section 2.3.2 and that is able to identify service
+ endpoints using IP port addressing, possibly through use of a
+ translation service as described in Section 5.
+
+ The use of the RPC-over-RDMA protocol has no effect on RPC Program
+ numbers or existing registered port numbers. However, new port
+ numbers MAY be registered for use by RPC-over-RDMA-enabled services,
+ as appropriate to the new networks over which the services will
+ operate.
+
+ For example, the NFS/RDMA service defined in [RFC5667] has been
+ assigned the port 20049 in the "Service Name and Transport Protocol
+ Port Number Registry". This is distinct from the port number defined
+ for NFS on TCP, which is assigned the port 2049 in the same registry.
+ NFS clients use the same RPC Program number for NFS (100003) when
+ using either transport [RFC5531] (see the "Remote Procedure Call
+ (RPC) Program Numbers" registry).
+
+10. References
+
+10.1. Normative References
+
+ [RFC1833] Srinivasan, R., "Binding Protocols for ONC RPC Version 2",
+ RFC 1833, DOI 10.17487/RFC1833, August 1995,
+ <http://www.rfc-editor.org/info/rfc1833>.
+
+ [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
+ Requirement Levels", BCP 14, RFC 2119,
+ DOI 10.17487/RFC2119, March 1997,
+ <http://www.rfc-editor.org/info/rfc2119>.
+
+ [RFC4506] Eisler, M., Ed., "XDR: External Data Representation
+ Standard", STD 67, RFC 4506, DOI 10.17487/RFC4506, May
+ 2006, <http://www.rfc-editor.org/info/rfc4506>.
+
+ [RFC5042] Pinkerton, J. and E. Deleganes, "Direct Data Placement
+ Protocol (DDP) / Remote Direct Memory Access Protocol
+ (RDMAP) Security", RFC 5042, DOI 10.17487/RFC5042, October
+ 2007, <http://www.rfc-editor.org/info/rfc5042>.
+
+ [RFC5056] Williams, N., "On the Use of Channel Bindings to Secure
+ Channels", RFC 5056, DOI 10.17487/RFC5056, November 2007,
+ <http://www.rfc-editor.org/info/rfc5056>.
+
+ [RFC5531] Thurlow, R., "RPC: Remote Procedure Call Protocol
+ Specification Version 2", RFC 5531, DOI 10.17487/RFC5531,
+ May 2009, <http://www.rfc-editor.org/info/rfc5531>.
+
+
+
+Lever, et al. Standards Track [Page 50]
+
+RFC 8166 RPC-over-RDMA Version 1 June 2017
+
+
+ [RFC5660] Williams, N., "IPsec Channels: Connection Latching",
+ RFC 5660, DOI 10.17487/RFC5660, October 2009,
+ <http://www.rfc-editor.org/info/rfc5660>.
+
+ [RFC5665] Eisler, M., "IANA Considerations for Remote Procedure Call
+ (RPC) Network Identifiers and Universal Address Formats",
+ RFC 5665, DOI 10.17487/RFC5665, January 2010,
+ <http://www.rfc-editor.org/info/rfc5665>.
+
+ [RFC7861] Adamson, A. and N. Williams, "Remote Procedure Call (RPC)
+ Security Version 3", RFC 7861, DOI 10.17487/RFC7861,
+ November 2016, <http://www.rfc-editor.org/info/rfc7861>.
+
+ [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC
+ 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174,
+ May 2017, <http://www.rfc-editor.org/info/rfc8174>.
+
+10.2. Informative References
+
+ [IBARCH] InfiniBand Trade Association, "InfiniBand Architecture
+ Specification Volume 1", Release 1.3, March 2015,
+ <http://www.infinibandta.org/content/
+ pages.php?pg=technology_download>.
+
+ [RFC768] Postel, J., "User Datagram Protocol", STD 6, RFC 768,
+ DOI 10.17487/RFC0768, August 1980,
+ <http://www.rfc-editor.org/info/rfc768>.
+
+ [RFC793] Postel, J., "Transmission Control Protocol", STD 7,
+ RFC 793, DOI 10.17487/RFC0793, September 1981,
+ <http://www.rfc-editor.org/info/rfc793>.
+
+ [RFC1094] Nowicki, B., "NFS: Network File System Protocol
+ specification", RFC 1094, DOI 10.17487/RFC1094, March
+ 1989, <http://www.rfc-editor.org/info/rfc1094>.
+
+ [RFC1813] Callaghan, B., Pawlowski, B., and P. Staubach, "NFS
+ Version 3 Protocol Specification", RFC 1813,
+ DOI 10.17487/RFC1813, June 1995,
+ <http://www.rfc-editor.org/info/rfc1813>.
+
+ [RFC5040] Recio, R., Metzler, B., Culley, P., Hilland, J., and D.
+ Garcia, "A Remote Direct Memory Access Protocol
+ Specification", RFC 5040, DOI 10.17487/RFC5040, October
+ 2007, <http://www.rfc-editor.org/info/rfc5040>.
+
+
+
+
+
+
+Lever, et al. Standards Track [Page 51]
+
+RFC 8166 RPC-over-RDMA Version 1 June 2017
+
+
+ [RFC5041] Shah, H., Pinkerton, J., Recio, R., and P. Culley, "Direct
+ Data Placement over Reliable Transports", RFC 5041,
+ DOI 10.17487/RFC5041, October 2007,
+ <http://www.rfc-editor.org/info/rfc5041>.
+
+ [RFC5532] Talpey, T. and C. Juszczak, "Network File System (NFS)
+ Remote Direct Memory Access (RDMA) Problem Statement",
+ RFC 5532, DOI 10.17487/RFC5532, May 2009,
+ <http://www.rfc-editor.org/info/rfc5532>.
+
+ [RFC5661] Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed.,
+ "Network File System (NFS) Version 4 Minor Version 1
+ Protocol", RFC 5661, DOI 10.17487/RFC5661, January 2010,
+ <http://www.rfc-editor.org/info/rfc5661>.
+
+ [RFC5662] Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed.,
+ "Network File System (NFS) Version 4 Minor Version 1
+ External Data Representation Standard (XDR) Description",
+ RFC 5662, DOI 10.17487/RFC5662, January 2010,
+ <http://www.rfc-editor.org/info/rfc5662>.
+
+ [RFC5666] Talpey, T. and B. Callaghan, "Remote Direct Memory Access
+ Transport for Remote Procedure Call", RFC 5666,
+ DOI 10.17487/RFC5666, January 2010,
+ <http://www.rfc-editor.org/info/rfc5666>.
+
+ [RFC5667] Talpey, T. and B. Callaghan, "Network File System (NFS)
+ Direct Data Placement", RFC 5667, DOI 10.17487/RFC5667,
+ January 2010, <http://www.rfc-editor.org/info/rfc5667>.
+
+ [RFC7530] Haynes, T., Ed. and D. Noveck, Ed., "Network File System
+ (NFS) Version 4 Protocol", RFC 7530, DOI 10.17487/RFC7530,
+ March 2015, <http://www.rfc-editor.org/info/rfc7530>.
+
+ [RFC8167] Lever, C., "Bidirectional Remote Procedure Call on RPC-
+ over-RDMA Transports", RFC 8167, DOI 10.17487/RFC8167,
+ June 2017, <http://www.rfc-editor.org/info/rfc8167>.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Lever, et al. Standards Track [Page 52]
+
+RFC 8166 RPC-over-RDMA Version 1 June 2017
+
+
+Appendix A. Changes from RFC 5666
+
+A.1. Changes to the Specification
+
+ The following alterations have been made to the RPC-over-RDMA version
+ 1 specification. The section numbers below refer to [RFC5666].
+
+ o Section 2 has been expanded to introduce and explain key RPC
+ [RFC5531], XDR [RFC4506], and RDMA [RFC5040] terminology. These
+ terms are now used consistently throughout the specification.
+
+ o Section 3 has been reorganized and split into subsections to help
+ readers locate specific requirements and definitions.
+
+ o Sections 4 and 5 have been combined to improve the organization of
+ this information.
+
+ o The optional Connection Configuration Protocol has never been
+ implemented. The specification of CCP has been deleted from this
+ specification.
+
+ o A section consolidating requirements for ULBs has been added.
+
+ o An XDR extraction mechanism is provided, along with full
+ copyright, matching the approach used in [RFC5662].
+
+ o The "Security Considerations" section has been expanded to include
+ a discussion of how RPC-over-RDMA security depends on features of
+ the underlying RDMA transport.
+
+ o A subsection describing the use of RPCSEC_GSS [RFC7861] with RPC-
+ over-RDMA version 1 has been added.
+
+A.2. Changes to the Protocol
+
+ Although the protocol described herein interoperates with existing
+ implementations of [RFC5666], the following changes have been made
+ relative to the protocol described in that document:
+
+ o Support for the Read-Read transfer model has been removed. Read-
+ Read is a slower transfer model than Read-Write. As a result,
+ implementers have chosen not to support it. Removal of Read-Read
+ simplifies explanatory text, and the RDMA_DONE procedure is no
+ longer part of the protocol.
+
+
+
+
+
+
+
+Lever, et al. Standards Track [Page 53]
+
+RFC 8166 RPC-over-RDMA Version 1 June 2017
+
+
+ o The specification of RDMA_MSGP in [RFC5666] is not adequate,
+ although some incomplete implementations exist. Even if an
+ adequate specification were provided and an implementation were
+ produced, benefit for protocols such as NFSv4.0 [RFC7530] is
+ doubtful. Therefore, the RDMA_MSGP message type is no longer
+ supported.
+
+ o Technical issues with regard to handling RPC-over-RDMA header
+ errors have been corrected.
+
+ o Specific requirements related to implicit XDR roundup and complex
+ XDR data types have been added.
+
+ o Explicit guidance is provided related to sizing Write chunks,
+ managing multiple chunks in the Write list, and handling unused
+ Write chunks.
+
+ o Clear guidance about Send and Receive buffer sizes has been
+ introduced. This enables better decisions about when a Reply
+ chunk must be provided.
+
+Acknowledgments
+
+ The editor gratefully acknowledges the work of Brent Callaghan and
+ Tom Talpey on the original RPC-over-RDMA Version 1 specification
+ [RFC5666].
+
+ Dave Noveck provided excellent review, constructive suggestions, and
+ consistent navigational guidance throughout the process of drafting
+ this document. Dave also contributed much of the organization and
+ content of Section 7 and helped the authors understand the
+ complexities of XDR extensibility.
+
+ The comments and contributions of Karen Deitke, Dai Ngo, Chunli
+ Zhang, Dominique Martinet, and Mahesh Siddheshwar are accepted with
+ great thanks. The editor also wishes to thank Bill Baker, Greg
+ Marsden, and Matt Benjamin for their support of this work.
+
+ The extract.sh shell script and formatting conventions were first
+ described by the authors of the NFSv4.1 XDR specification [RFC5662].
+
+ Special thanks go to Transport Area Director Spencer Dawkins, NFSV4
+ Working Group Chair and Document Shepherd Spencer Shepler, and NFSV4
+ Working Group Secretary Thomas Haynes for their support.
+
+
+
+
+
+
+
+Lever, et al. Standards Track [Page 54]
+
+RFC 8166 RPC-over-RDMA Version 1 June 2017
+
+
+Authors' Addresses
+
+ Charles Lever (editor)
+ Oracle Corporation
+ 1015 Granger Avenue
+ Ann Arbor, MI 48104
+ United States of America
+
+ Phone: +1 248 816 6463
+ Email: chuck.lever@oracle.com
+
+
+ William Allen Simpson
+ Red Hat
+ 1384 Fontaine
+ Madison Heights, MI 48071
+ United States of America
+
+ Email: william.allen.simpson@gmail.com
+
+
+ Tom Talpey
+ Microsoft Corp.
+ One Microsoft Way
+ Redmond, WA 98052
+ United States of America
+
+ Phone: +1 425 704-9945
+ Email: ttalpey@microsoft.com
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Lever, et al. Standards Track [Page 55]
+