From 4bfd864f10b68b71482b35c818559068ef8d5797 Mon Sep 17 00:00:00 2001 From: Thomas Voss Date: Wed, 27 Nov 2024 20:54:24 +0100 Subject: doc: Add RFC documents --- doc/rfc/rfc6581.txt | 1403 +++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 1403 insertions(+) create mode 100644 doc/rfc/rfc6581.txt (limited to 'doc/rfc/rfc6581.txt') diff --git a/doc/rfc/rfc6581.txt b/doc/rfc/rfc6581.txt new file mode 100644 index 0000000..6a6ddd0 --- /dev/null +++ b/doc/rfc/rfc6581.txt @@ -0,0 +1,1403 @@ + + + + + + +Internet Engineering Task Force (IETF) A. Kanevsky, Ed. +Request for Comments: 6581 Dell Inc. +Updates: 5043, 5044 C. Bestler, Ed. +Category: Standards Track Nexenta Systems +ISSN: 2070-1721 R. Sharp + Intel + S. Wise + Open Grid Computing + April 2012 + + + Enhanced Remote Direct Memory Access (RDMA) + Connection Establishment + +Abstract + + This document updates RFC 5043 and RFC 5044 by extending Marker + Protocol Data Unit (PDU) Aligned Framing (MPA) negotiation for Remote + Direct Memory Access (RDMA) connection establishment. The first + enhancement extends RFC 5044, enabling peer-to-peer connection + establishment over MPA / Transmission Control Protocol (TCP). The + second enhancement extends both RFC 5043 and RFC 5044, by providing + an option for standardized exchange of RDMA-layer connection + configuration. + +Status of This Memo + + This is an Internet Standards Track document. + + This document is a product of the Internet Engineering Task Force + (IETF). It represents the consensus of the IETF community. It has + received public review and has been approved for publication by + the Internet Engineering Steering Group (IESG). Further + information on Internet Standards is available in Section 2 of + RFC 5741. + + Information about the current status of this document, any + errata, and how to provide feedback on it may be obtained at + http://www.rfc-editor.org/info/rfc6581. + + + + + + + + + + + + +Kanevsky, et al. Standards Track [Page 1] + +RFC 6581 Enhanced RDMA Connection Establishment April 2012 + + +Copyright Notice + + Copyright (c) 2012 IETF Trust and the persons identified as the + document authors. All rights reserved. + + This document is subject to BCP 78 and the IETF Trust's Legal + Provisions Relating to IETF Documents + (http://trustee.ietf.org/license-info) in effect on the date of + publication of this document. Please review these documents + carefully, as they describe your rights and restrictions with respect + to this document. Code Components extracted from this document must + include Simplified BSD License text as described in Section 4.e of + the Trust Legal Provisions and are provided without warranty as + described in the Simplified BSD License. + +Table of Contents + + 1. Introduction ....................................................3 + 1.1. Summary of Changes Affecting RFC 5044 ......................4 + 1.2. Summary of Changes Affecting RFC 5043 ......................4 + 2. Requirements Language ...........................................4 + 3. Definitions .....................................................4 + 4. Motivations .....................................................7 + 4.1. Standardization of RDMA Read Parameter Configuration .......7 + 4.2. Enabling MPA Mode ..........................................9 + 4.3. Lack of Explicit RTR in MPA Request/Reply Exchange ........10 + 4.4. Limitations on ULP Workaround .............................11 + 4.4.1. Transport Neutral APIs .............................11 + 4.4.2. Work/Completion Queue Accounting ...................11 + 4.4.3. Host-based Implementation of MPA Fencing ...........12 + 5. Enhanced MPA Connection Establishment ..........................13 + 6. Enhanced MPA Request/Reply Frames ..............................14 + 7. Enhanced SCTP Session Control Chunks ...........................15 + 8. MPA Error Reporting ............................................16 + 9. Enhanced RDMA Connection Establishment Data ....................17 + 9.1. IRD and ORD Negotiation ...................................18 + 9.2. Peer-to-Peer Connection Negotiation .......................20 + 9.3. Enhanced Connection Negotiation Flow ......................21 + 10. Interoperability ..............................................21 + 11. IANA Considerations ...........................................22 + 12. Security Considerations .......................................23 + 13. Acknowledgements ..............................................23 + 14. References ....................................................23 + 14.1. Normative References .....................................23 + 14.2. Informative References ...................................24 + + + + + + +Kanevsky, et al. Standards Track [Page 2] + +RFC 6581 Enhanced RDMA Connection Establishment April 2012 + + +1. Introduction + + When used over the Transmission Control Protocol (TCP), the current + Remote Direct Data Placement (RDDP) [RFC5041] suite of protocols + relies on the MPA [RFC5044] protocol for both connection + establishment and for markers for TCP layering. + + A typical model for establishing an RDMA connection has the following + steps: + + o The passive side (responder) Upper Layer Protocol (ULP) listens + for connection requests. + + o The active side (initiator) ULP submits a connection request using + an RDMA endpoint, the desired destination, and the parameters to + be used for the connection. Those parameters include both RDMA- + layer characteristics, such as the number of simultaneous RDMA + Read Requests to be allowed, and application-specific data. + + o The passive side ULP receives a connection request that includes + the identity of the active side and the requested connection + characteristics. The passive side ULP uses this information to + decide whether to accept the connection, and if it is to be + accepted, how to create and/or configure the local RDMA endpoint. + + o If accepting, the responder submits its acceptance of the + connection request, which in turn generates the accept message to + the initiator. This responder accept operation includes the RDMA + endpoint to be used and the connection characteristics (both the + RDMA configuration and any application-specific Private Data to be + transferred to the initiator). + + o The active side receives confirmation that the connection has been + accepted, what the configured connection characteristics are, and + any application-supplied Private Data. + + Currently, MPA only supports a client-server model for connection + establishment, forcing peer-to-peer applications to interact as + though they had a client-server relationship. In addition, + negotiation of some parameters specific to the Remote Direct Memory + Access Protocol (RDMAP) [RFC5040] are left to ULP negotiation. + Providing an optional ULP-independent format for exchanging these + parameters would be of benefit to transport neutral RDMA + applications. + + + + + + + +Kanevsky, et al. Standards Track [Page 3] + +RFC 6581 Enhanced RDMA Connection Establishment April 2012 + + +1.1. Summary of Changes Affecting RFC 5044 + + This document enhances the MPA connection setup protocol [RFC5044]. + First, it adds exchange and negotiation of the parameters necessary + to support RDMA Read Requests. Second, it adds a message that serves + as a Ready to Receive (RTR) indication from the initiator to the + responder as the last message of connection establishment and adds + negotiation of which type of message to use for carrying the RTR + indication into MPA Request/Reply Frames. + + RTR indications are optional and are carried by existing RDMA message + types, specifically a zero-length FULPDU Send message, a zero-length + RDMA Read message, or a zero-length RDMA write message. The presence + vs. absence of the RTR indication and the type of RDMA message to use + are negotiated by control flags in Enhanced RDMA connection + establishment data specified by this document (see Section 9). RDMA + implementations are often tightly integrated with application + libraries and hardware, hence the flexibility to use more than one + type of RDMA message enables implementations to choose message types + that are less disruptive to the implementation structure. When an + RTR indication is used, and MPA connection setup negotiation + indicates support for multiple RDMA message types as RTR indications + by both the initiator and responder, the initiator selects one of the + supported RDMA message types as the RTR indication at the initiator's + sole discretion. + +1.2. Summary of Changes Affecting RFC 5043 + + This document enhances [RFC5043] by adding new Enhanced Session + Control Chunks that extend the currently defined Chunks with the + addition of Inbound RDMA Read Queue Depth (IRD) and Outbound RDMA + Read Queue Depth (ORD) negotiation. + +2. Requirements Language + + The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", + "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this + document are to be interpreted as described in [RFC2119]. + +3. Definitions + + Active Side: See Initiator. + + Consumer: The ULPs or applications that lie above MPA and Direct + Data Placement (DDP). The Consumer is responsible for making TCP + or Stream Control Transmission Protocol (SCTP) connections, + starting MPA and DDP connections, and generally controlling + operations. See [RFC5044] and [RFC5043]. + + + +Kanevsky, et al. Standards Track [Page 4] + +RFC 6581 Enhanced RDMA Connection Establishment April 2012 + + + CRC: Cyclic Redundancy Check + + Completion Queue (CQ): A Consumer-accessible queue where the RDMA + device reports completions of Work Requests. A Consumer is able + to reap completions from a CQ without requiring per-transaction + support from the kernel or other privileged entity. See [RDMAC]. + + Completion Queue Entry (CQE): Transport- and device-specific + representation of a Work Completion. A CQ holds CQEs. See + [RDMAC]. + + FULPDU: Framed Upper Layer Protocol PDU. See FPDU of [RFC5044]. + + Inbound RDMA Read Request Queue (IRRQ): A queue that is associated + with an RDMA connection that tracks active incoming simultaneous + RDMA Read Request Messages. See [RDMAC]. + + Inbound RDMA Read Queue Depth (IRD): The maximum number of incoming + simultaneous RDMA Read Request Messages an RDMA connection can + handle. See [RDMAC]. + + Initiator: The endpoint of a connection that sends the MPA Request + Frame. The initiator is the active side of the connection + establishment. See [RFC5044]. + + IRD: See Inbound RDMA Read Queue Depth. + + MPA Fencing: MPA responder connection establishment logic that + ensures that no ULP messages will be transferred until the + initiator's first message has been received. + + MPA Request Frame: Data sent from the MPA initiator to the MPA + responder during the Startup Phase. See [RFC5044]. + + MPA Reply Frame: Data sent from the MPA responder to the MPA + initiator during the Startup Phase. See [RFC5044]. + + ORD: See Outbound RDMA Read Queue Depth. + + Outbound RDMA Read Queue Depth (ORD): The maximum number of + simultaneous RDMA Read Requests that can be issued for the RDMA + connection. This should be less than or equal to the peer's IRD. + See [RDMAC]. + + Passive Side: See Responder. + + Private Data: A block of data exchanged between MPA endpoints during + initial connection setup. See [RFC5044]. + + + +Kanevsky, et al. Standards Track [Page 5] + +RFC 6581 Enhanced RDMA Connection Establishment April 2012 + + + Queue Pair (QP): A Queue Pair is the set of Work Queues associated + exclusively with a single Endpoint (first defined in [VIA]). The + Send Queue (SQ), Receive Queue (RQ), and Inbound RDMA Read Queue + (IRQ) are considered to be part of the Queue Pair. The + potentially shared Completion Queue (CQ) and Shared Receive Queue + (SRQ) are not. See [RDMAC]. + + Remote Peer: The MPA protocol implementation on the opposite end of + the connection. Used to refer to the remote entity when + describing protocol exchanges or other interactions between two + nodes. See [RFC5044]. + + Responder: The connection endpoint that responds to an incoming MPA + connection request (the MPA Request Frame). The responder is the + passive side of the connection establishment. See [RFC5044]. + + Ready to Receive (RTR): RTR is an indication provided by the last + connection establishment message sent from the initiator to the + responder. An RTR indicates that the initiator is ready to + receive messages and that connection establishment is completed. + + Startup Phase: The initial exchanges of an MPA connection that + serves to more fully identify MPA endpoints to each other and pass + connection-specific setup information to each other. See + [RFC5044]. + + Shared Receive Queue (SRQ): A shared pool of Receive Work Requests + posted by the Consumer that can be allocated by multiple RDMA + endpoints (QP). See [RDMAC]. + + Tagged (DDP) Message: A DDP Message that targets a Tagged Buffer + that is explicitly advertised to the Remote Peer through exchange + of an STag (memory handle), offset in the memory region identified + by STag, and length [RFC5040]. + + Untagged (DDP) Message: A DDP Message that targets an Untagged + Buffer associated with a queue specified the by Queue Number (QN). + [RFC5040]. + + Work Queue: An element of a QP that allows user-space applications + to submit Work Requests directly to network hardware (first + defined in [VIA]). Specific Work Queues include the Send Queue + (SQ) for transmit requests, Receive Queue (RQ) for receive + requests specific to a single endpoint, and Shared Receive Queues + (SRQs) for receive requests that can be allocated by one or more + endpoints. See [RDMAC]. + + + + + +Kanevsky, et al. Standards Track [Page 6] + +RFC 6581 Enhanced RDMA Connection Establishment April 2012 + + + Work Queue Element (WQE): Transport- and device-specific + representation of a Work Request. See [RDMAC]. + + Work Request: An elementary object used by Consumers to enqueue a + requested operation (WQEs) onto a Work Queue. See [RDMAC]. + +4. Motivations + + The goal of this document is two-fold. The first is to extend + support from the current client-server model for RDMA connection + setup to a peer-to-peer model. The second is to add negotiation of + the RDMA Read Queue size for both sides of an RDMA connection. + +4.1. Standardization of RDMA Read Parameter Configuration + + Most RDMA applications are developed using a transport-neutral + Application Programming Interface (API) to access RDMA services based + on a "Queue Pair" paradigm as originally defined by the Virtual + Interface Architecture [VIA], refined by the Direct Access + Programming Library [DAPL], and most commonly deployed with the + OpenFabrics API [OFA]. + + These transport-neutral APIs seek to provide a common set of RDMA + services whether the underlying transport is, for example, RDDP over + MPA, RDDP over SCTP, or InfiniBand. + + The common model for establishing an RDMA connection has the + following steps: + + o The passive side ULP listens for connection requests. + + o The active side ULP submits a connection request using an RDMA + endpoint ("Queue Pair"), the desired destination, and the + parameters to be used for the connection. Those parameters + include both RDMA-layer characteristics, such as the number of + simultaneous RDMA Read Requests to be allowed, and application- + specific data (typically referred to as "Private Data"). + + o The passive side ULP receives a connection request, which includes + the identity of the active side and the requested connection + characteristics. The passive side ULP uses this information to + decide whether to accept the connection, and if it is to be + accepted, how to create and/or configure the RDMA endpoint. + + + + + + + + +Kanevsky, et al. Standards Track [Page 7] + +RFC 6581 Enhanced RDMA Connection Establishment April 2012 + + + o If accepting, the passive side ULP submits its acceptance of the + connection request. This local accept operation includes the RDMA + endpoint to be used and the connection characteristics (both the + RDMA configuration and any application-specific Private Data to be + returned). + + o The active side receives confirmation that the connection has been + accepted, what the configured connection characteristics are, and + any application-supplied Private Data. + + As currently defined, DDP connection establishment requires the ULP + to encode the RDMA configuration in the application-specific Private + Data. This results in undesirable duplication of logic to cover RDMA + characteristics of both InfiniBand and RDDP for each ULP, and to + specify for InfiniBand and RDDP the extraction of the RDMA + characteristics for each ULP. + + Both RDDP and InfiniBand support an initial Private Data exchange; + therefore, a standard definition of the RDMA characteristics within + the Private Data section would enable common connection establishment + APIs to format the RDMA characteristics based on the same API + information used when establishing either protocol to form the + connection. The application would then only have to indicate that it + was using this standard format to enable common connection + establishment procedures to apply common code to properly parse these + fields and configure the RDMA endpoints accordingly. Exchange of + parameters necessary to perform RDMA Read operations is a common + usage of the initial Private Data exchange. + + One of the RDMA operations that is defined in [RDMAC] is an RDMA + Read. RDMA Read operations are performed using an untagged message + sent from a Queue Pair (QP) on the local endpoint to a QP on the + remote endpoint targeting the Inbound RDMA Read Request Queue (QN=1 + or Inbound RDMA Read Request Queue (IRRQ)) associated with the + connection. RDMA Read responses transfer data associated with each + RDMA Read Request from the remote endpoint to the local endpoint + using tagged messages. An inbound RDMA Read Request remains on the + IRRQ from the time that it is received until the time that the last + tagged message associated with the RDMA request is acknowledged. The + IRRQ is associated with a QP but is not a Work Queue. Instead, the + IRRQ is a stand-alone queue that is used to manage RDMA Read Requests + associated with a QP. See [RDMAC], Section 6 for more information + regarding QPs and IRRQ. One of the characteristics that must be + configured for a QP is the size of the IRRQ. This parameter is + called the Inbound RDMA Read Queue Depth (IRD). Another + characteristic of a QP that must be configured is a local limit on + the number of simultaneous outbound RDMA Read Requests based on the + size of the remote endpoint QP's IRRQ. This parameter is call the + + + +Kanevsky, et al. Standards Track [Page 8] + +RFC 6581 Enhanced RDMA Connection Establishment April 2012 + + + Outbound RDMA Read Queue Depth (ORD). ORD is used to limit the + number of simultaneous RDMA Read Requests such that the local + endpoint does not overrun the remote endpoint's IRRQ depth or IRD. + Note that outbound RDMA Reads are submitted to a QP's Send Queue at + the local peer, not to a separate outbound RDMA Read Request queue on + the local peer. The local endpoint uses ORD to strictly limit + simultaneous Read Requests so that IRRQ overruns do not occur at the + remote endpoint. + + Determination of the values of the ORD and IRD are left to the ULP by + the current RDDP suite of protocols and also by [RDMAC]. Since this + negotiation of ORD and IRD is typical, it is desirable to provide a + common mechanism as described in this document. + +4.2. Enabling MPA Mode + + MPA defines encoding of DDP Segments in Framed Upper Layer Protocol + PDUs (FULPDUs). Generation of FULPDUs requires the ability to + periodically insert MPA Markers and to generate the MPA CRC-32c for + each frame. Reception may require parsing/removing the markers after + using them to identify MPA Frame boundaries and validation of the + MPA-CRC32c. + + A major design objective for MPA was to ensure that the resulting TCP + stream would be fully compliant for any and all TCP-aware + middleboxes. The challenge is that while only some TCP payload + streams are a valid stream of MPA FULPDUs, any sequence of bytes is a + valid TCP payload stream. The determination that a given stream is + in a specific MPA mode cannot be made at the MPA or TCP layer. + Therefore, enabling of MPA mode is handled by the ULP. + + The MPA protocol can be viewed as having two parts: + + o a specification of generation and reception of MPA FULPDUs. This + is unchanged by enhanced RDMA connection establishment. + + o a pre-MPA exchange of messages to enable a specific MPA mode for + the TCP connection. Enhanced RDMA connection establishment + extends this protocol with two new features. + + In typical implementations, generation and reception of MPA FULPDUs + is handled by hardware. The exchange of the MPA Request and Reply + Frames is then handled by host software. As will be explained, this + implementation split impedes applications that are not compatible + with the client-server assumptions in the current MPA Request/Reply + exchange. + + + + + +Kanevsky, et al. Standards Track [Page 9] + +RFC 6581 Enhanced RDMA Connection Establishment April 2012 + + +4.3. Lack of Explicit RTR in MPA Request/Reply Exchange + + The exchange of MPA Request and Reply messages to place a TCP + connection in MPA mode is specified in [RFC5044]. This protocol + provides many benefits to the design of MPA FULPDU hardware: + + o The ULP is responsible for specifying the exact MPA Mode (Markers + enabled or disabled, CRC-32c enabled or suppressed) and the point + in the TCP streams (inbound and outbound) where MPA Frames will + begin. + + o Before the first MPA Frame is transmitted, all pre-MPA mode TCP + payloads will have been acknowledged by the peer. Therefore, it + is never necessary to generate a retransmission that mixes pre-MPA + and MPA payload. + + o Before MPA reception is enabled, all incoming pre-MPA mode TCP + payloads will have been acknowledged. Therefore, the host will + never receive a TCP segment that mixes pre-MPA and MPA payload. + + The limitation of the current MPA Request/Reply exchange is that it + does not define a Ready to Receive (RTR) indication that the active + side would send, so that the passive side can know that the last non- + MPA payload (the MPA Reply) had been received. + + Instead, the role of an RTR indication is piggybacked on the first + MPA FULPDU sent by the active side. This is actually a valuable + optimization for all applications that fit the classic client-server + model. The client only initiates the connection when it has a + request to send to the server, and the server has nothing to send + until it has received and processed the client request. + + Even applications where the server sends some configuration data + immediately can easily send the same information as application + Private Data in the MPA Reply. So the currently defined exchange + works for almost all applications. + + Many peer-to-peer applications, especially those involving cluster + calculations (frequently using Message Passing Interface (MPI) + [UsingMPI] or [RDS]), have no natural client or server roles ([PPMPI] + [OpenMP]). Typically, one member of the cluster is arbitrarily + selected to initiate the connection when the distributed task is + launched, while the other accepts it. At startup time, however, + there is no way to predict which node will have the first message to + actually send. Immediately establishing the connections is valuable + because it reduces latency once results are ready to transmit and it + validates connectivity throughout the cluster. + + + + +Kanevsky, et al. Standards Track [Page 10] + +RFC 6581 Enhanced RDMA Connection Establishment April 2012 + + + The lack of an explicit RTR indication in the MPA Request/Reply + exchange forces all applications to have a first message from the + connection initiator, whether or not this matches the application + communication model. + +4.4. Limitations on ULP Workaround + + The requirement that the RDMA connection initiator sends the first + message does not appear to be onerous on first examination. The + natural question is why the application layer would not simply + generate a dummy message when there is no other message to submit. + + There are three factors that make this workaround unsuitable for many + peer-to-peer applications: + + o Transport-Neutral APIs. + + o Work/Completion Queue Accounting. + + o Host-based implementation of MPA Fencing. + +4.4.1. Transport-Neutral APIs + + Many of these applications access RDMA services using a transport- + neutral API such as [DAPL] or [OFA]. Only RDDP over TCP [RFC5044] + has a first message requirement. Other RDMA transports, including + RDDP over SCTP (see [RFC5043]) and InfiniBand (see [IBTA]), do not. + + Application or middleware communications can be expressed as + transport-neutral RDMA operations, allowing lower software layers to + translate to transport and device specifics. Having a distinct extra + message that is required only for one transport undermines the + application's goal of being transport neutral. + +4.4.2. Work/Completion Queue Accounting + + RDMA local APIs conventionally use Work Queues to submit requests + (Work Queue elements or WQEs) and to asynchronously receive + completions (in Completion Queues or CQs). + + Each Work Request can generate a Completion Queue Entry (CQE). + Completions for successful transmit Work Requests are frequently + suppressed, but the CQ capacity must account for the possibility that + each will complete in error. A CQ can receive completions from + multiple Work Queues. + + + + + + +Kanevsky, et al. Standards Track [Page 11] + +RFC 6581 Enhanced RDMA Connection Establishment April 2012 + + + CQs are defined to allow hardware RDMA implementations to generate + CQEs directly to a user-space-mapped buffer. This enables a user- + space RDMA Consumer to reap completions without requiring kernel + intervention. + + A hardware RDMA implementation cannot reasonably wait for an + available slot in the CQ. The queue must be sized such that an + overflow will not occur. When an overflow does occur, it is + considered a catastrophic error and will typically require tearing + down all RDMA connections using that CQ. + + This style of interface is very efficient, but places a burden on the + application to properly size each CQ to match the Work Queues that + feed it. + + While the format of both WQEs and CQEs is transport and device + dependent, a transport-neutral API can deal with WQEs and CQEs as + abstract transport- and device-neutral objects. Therefore, the + number of WQEs and CQEs required for an application can be transport + and device neutral. + + The capacity of the Work Queues and CQs can be calculated in an + abstract transport- and device-neutral fashion. If a dummy operation + approach is used, it would require lower layers to know the usage + model, and would disrupt the calculations by inserting a dummy + "operation" Work Request and filtering out the matching completion. + The lower layer does not know the usage model on which the queue + sizes are built, nor does it know how frequently an insertion will be + required. + +4.4.3. Host-based Implementation of MPA Fencing + + Many hardware implementations of RDDP using MPA/TCP do not handle the + MPA Request/Reply exchange in hardware, rather they are handled by + the host processor in software. With such designs, it is common for + the MPA Fencing to be implemented in the user-space, device-specific + library (commonly referred to as a 'User Verbs' library or module). + + When the generation and reception of MPA FULPDUs are already + dedicated to hardware, a Work Completion can only be generated by an + untagged message, since arrival of a message for a tagged buffer does + not necessarily generate a completion and is done without any + interaction with ULP [RFC5040]. + + + + + + + + +Kanevsky, et al. Standards Track [Page 12] + +RFC 6581 Enhanced RDMA Connection Establishment April 2012 + + +5. Enhanced MPA Connection Establishment + + Below we provide an overview of Enhanced Connection Setup. The goal + is to allow standard negotiation of the ORD/IRD setting on both sides + of the RDMA connection and/or to negotiate the initial data transfer + operation by the initiator when the existing 'client sends first' + rule does not match application requirements. + + The RDMA connection initiator sends an MPA Request, as specified in + [RFC5044]; the new format defined here allows for: + + o Standardized negotiation of ORD and IRD. + + o Negotiation of RTR functionality and the RDMA message type to use + as the RTR indication. + + The RDMA connection responder processes the MPA Request and generates + an MPA Reply, as specified in [RFC5044]; the new format completes the + negotiation. + + The local interface needs to provide a way for a ULP to request the + use of explicit RTR indication on a per-application or per-connection + basis when an explicit RTR indication will be required. Piggybacking + the RTR on a Client's first message is a valuable optimization for + most connections. + + The RDMA connection initiator MUST NOT allow any later FULPDUs to be + transmitted before the RTR indication. One method to achieve this is + to delay notifying the ULP that the RDMA connection has been + established until after any required RTR indication has been + transmitted. + + All MPA exchanges are performed via TCP prior to RDMA establishment, + and are therefore signaled via TCP and not via RDMA completion. + + + + + + + + + + + + + + + + + +Kanevsky, et al. Standards Track [Page 13] + +RFC 6581 Enhanced RDMA Connection Establishment April 2012 + + +6. Enhanced MPA Request/Reply Frames + + Enhanced RDMA connection establishment uses an alternate format for + MPA Requests and Replies as follows: + + 0 1 2 3 + 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + 0 | | + + Key (16 bytes containing "MPA ID Req Frame") + + 4 | (4D 50 41 20 49 44 20 52 65 71 20 46 72 61 6D 65) | + + Or (16 bytes containing "MPA ID Rep Frame") + + 8 | (4D 50 41 20 49 44 20 52 65 70 20 46 72 61 6D 65) | + + + + 12 | | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + 16 |M|C|R|S| Res | Rev | PD_Length | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | | + ~ ~ + ~ Private Data ~ + | | + | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + + Key: Unchanged from [RFC5044]. + + M: Unchanged from [RFC5044]. + + C: Unchanged from [RFC5044]. + + R: Unchanged from [RFC5044]. + + S: One, if the Private Data begins with the enhanced RDMA connection + establishment data; 0 otherwise. + + Res: One bit smaller than in [RFC5044]; otherwise unchanged. In + [RFC5044], the 'Res' field, in which the newly defined 'S' bit + resides, is reserved for future use. [RFC5044] specifies that + 'Res' MUST be set to zero when sending and MUST NOT be checked on + reception, making use of 'S' bit backwards compatibility with the + original MPA Frame format. When the 'S' bit is set to zero, no + additional Private Data is used for enhanced RDMA connection + establishment; therefore, the resulting MPA Request and Reply + Frames are identical to the unenhanced protocol. + + + + + +Kanevsky, et al. Standards Track [Page 14] + +RFC 6581 Enhanced RDMA Connection Establishment April 2012 + + + Rev: This field contains the revision of MPA. To use any enhanced + connection establishment feature, this MUST be set to two or + higher. If no enhanced connection establishment features are + desired, it MAY be set to one. A host accepting MPA connections + MUST continue to accept MPA Requests with version one, even if it + supports version two. + + PD_Length: Unchanged from [RFC5044]. This is the total length of + the Private Data field, including the enhanced RDMA connection + establishment data, if present. + + Private Data: Unchanged from [RFC5044]. However, if the 'S' flag is + set, Private Data MUST begin with enhanced RDMA connection + establishment data (see Section 9). + +7. Enhanced SCTP Session Control Chunks + + Enhanced RDMA connection establishment uses the first 32 bits of the + Private Data field for IRD and ORD negotiation in the "DDP Stream + Session Initiate" and "DDP Stream Session Accept" SCTP Session + Control Chunks. + + The type of the SCTP Session Control Chunk is defined by a Function + Code (see [RFC4960]). [RFC5043] already defines codes for 'DDP + Stream Session Initiate' and 'DDP Stream Session Accept', which are + equivalent to an MPA Request Frame and an accepting MPA Reply Frame. + + Enhanced RDMA connection establishment requires three additional + function codes listed below: + + Enhanced DDP Stream Session Initiate: 0x005 + + Enhanced DDP Stream Session Accept: 0x006 + + Enhanced DDP Stream Session Reject: 0x007 + + The Enhanced Reject function code MUST be used to indicate rejection + of enhanced DDP stream session for a configuration that would have + been accepted for unenhanced DDP stream session negotiation. + + The enhanced DDP stream session establishment follows the same rules + as the standard DDP stream session establishment as defined in + [RFC5043]. ULP-supplied Private Data MUST be included for Enhanced + DDP Stream Session Initiate, Enhanced DDP Stream Session Accept, and + Enhanced DDP Stream Session Reject messages, and MUST follow the + enhanced RDMA connection establishment data in the DDP Stream Session + Initiate and the Enhanced DDP Stream Session Accept messages. + + + + +Kanevsky, et al. Standards Track [Page 15] + +RFC 6581 Enhanced RDMA Connection Establishment April 2012 + + + Private Data length MUST NOT exceed 512 bytes in any message, + including enhanced RDMA connection establishment data. + + Private Data MUST NOT be included in the DDP Stream Session TERM + message. + + Received Extended DDP Stream Session Control messages SHOULD be + reported to the ULP. If reported, any supplied Private Data MUST be + available for the ULP to examine. For example, a received Extended + DDP Stream Session Control message is not reported to ULP if none of + the requested RTR indication types are supported by the receiver. In + this case, the Provider MAY generate a reject reply message + indicating which RTR indication types it supports. + + The enhanced DDP stream management MUST use the DDP stream session + termination function code to terminate a stream established using + enhanced DDP stream session function codes. + + [RFC5043] already supports either side sending the first DDP Message + since the Payload Protocol Identifier (PPID) already distinguishes + between Session Establishment and DDP Segments. The enhanced RDMA + connection establishment provides the ULP a transport-independent way + to support the peer-to-peer model. + + The following additional Legal Sequences of DDP Stream Session + messages are defined: + + o Enhanced Active/Passive Session Accepted: as with Section 6.2 of + [RFC5043], but with the extended opcodes as defined in this + document. + + o Enhanced Active/Passive Session Rejected: as with Section 6.3 of + [RFC5043], but with the extended opcodes as defined in this + document. + + o Enhanced Active/Passive Session Non-ULP Rejected: as with Section + 6.4 of [RFC5043], but with the extended opcodes as defined in this + document. + +8. MPA Error Reporting + + The RDMA connection establishment protocol is layered upon the + protocols defined in [RFC5040] and [RFC5041]. Any enhanced RDMA + connection establishment error generates an MPA termination message + to a peer. [RFC5040] defines a triplet of protocol layers, error + types, and error codes for error specification. MPA negotiation for + RDMA connection establishment uses the following layer and error type + for MPA error reporting: + + + +Kanevsky, et al. Standards Track [Page 16] + +RFC 6581 Enhanced RDMA Connection Establishment April 2012 + + + Layer: 0x2 - LLP Error Type: 0x0 - MPA + + While [RFC5044] defines four error codes, [RFC5043] does not define + any. Enhanced RDMA connection establishment extends the error codes + defined in [RFC5044] by adding three new error codes. Thus, enhanced + RDMA connection establishment is backward compatible with both + [RFC5043] and [RFC5044]. + + The following error codes are defined for enhanced RDMA connection + establishment negotiation: + + Error Code Description + -------------------------------------------------------- + 0x05 Local catastrophic + 0x06 Insufficient IRD resources + 0x07 No matching RTR option + +9. Enhanced RDMA Connection Establishment Data + + Enhanced RDMA connection establishment places the following 32 bits + at the beginning of the Private Data field of the MPA Request and + Reply Frames or the "DDP Stream Session Initiate" and "DDP Stream + Session Accept" SCTP Session Control Chunks. ULP-specified Private + Data follows this field. The maximum amount of ULP-specified Private + Data is therefore reduced by 4 bytes. Note that this field MUST be + sent in network byte order, with the IRD and ORD encoded as 14-bit + unsigned integers. + + 0 1 2 3 + 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + 0 |A|B| IRD |C|D| ORD | + 4 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + + IRD: Inbound RDMA Read Queue Depth. + + ORD: Outbound RDMA Read Queue Depth. + + A: Control Flag for connection model. + + B: Control Flag for use of a zero-length FULPDU (Send) RTR + indication. + + C: Control Flag for use of a zero-length RDMA Write RTR indication. + + D: Control Flag for use of a zero-length RDMA Read RTR indication. + + + + + +Kanevsky, et al. Standards Track [Page 17] + +RFC 6581 Enhanced RDMA Connection Establishment April 2012 + + +9.1. IRD and ORD Negotiation + + The IRD and ORD are used for negotiation of Inbound RDMA Read Request + Queue depths for both endpoints of the RDMA connection. The IRD is + used to configure the depth of the Inbound RDMA Read Request Queue + (IRRQ) on each endpoint. ORD is used to limit the number of + simultaneous outbound RDMA Read Requests allowed at any given point + in time in order to avoid IRRQ overruns at the remote endpoint. In + order to describe the negotiation of both local endpoint and remote + endpoint ORD and IRD values, four terms are defined: + + Initiator IRD: The IRD value sent in the MPA Request or "DDP Stream + Session Initiate" SCTP Session Control Chunk. This is the value + of the initiator's IRD at the time of the MPA Request generation. + The responder sets its local ORD value to this value or less. The + initiator IRD is the maximum number of simultaneous inbound RDMA + Read Requests that the initiator can support for the requested + connection. + + Initiator ORD: The ORD value in the MPA Request or "DDP Stream + Session Initiate" SCTP Session Control Chunk. This is the initial + value of the initiator's ORD at the time of the MPA Request + generation and also a request to the responder to support a + responder IRD of at least this value. The initiator ORD is the + maximum number of simultaneous outbound RDMA Read operations that + the initiator desires the responder to support for the requested + connection. + + Responder IRD: The IRD value returned in the MPA Reply or "DDP + Stream Session Accept" SCTP Session Control Chunk. This is the + actual value that the responder sets for its local IRD. This + value is greater than or equal to the initiator ORD for successful + negotiations. The responder IRD is the maximum number of + simultaneous inbound RDMA Read Requests that the responder + actually can support for the requested connection. + + Responder ORD: The ORD value returned in the MPA Reply or "DDP + Stream Session Accept" SCTP Session Control Chunk. This is the + actual value that the responder used for ORD and is less than or + equal to the initiator IRD for successful negotiations. The + responder ORD is the maximum number of simultaneous outbound RDMA + Read operations that the responder will allow for the requested + connection. + + + + + + + + +Kanevsky, et al. Standards Track [Page 18] + +RFC 6581 Enhanced RDMA Connection Establishment April 2012 + + + The relationships between these parameters after a successful + negotiation is complete are the following: + + initiator ORD <= responder IRD + + responder ORD <= initiator IRD + + The responder and initiator MUST pass the peer's provided IRD and ORD + values to the ULP, in addition to using the values as calculated by + the preceding rules. + + The responder ORD SHOULD be set to a value less than or equal to the + initiator IRD. If the initiator ORD is insufficient to support the + selected connection model, the responder IRD MAY be increased; for + example, if the initiator ORD is 0 (RDMA Reads will not be used by + the ULP) and the responder supports use of a zero-length RDMA Read + RTR indication, then the responder IRD can be set to 1. The + responder MUST set its ORD at most to the initiator IRD. The + responder MAY reject the connection request if the initiator IRD is + not sufficient for the ULP-required ORD and specify the required ORD + in the MPA Reject Frame responder ORD. Thus, the TERM message MUST + contain Layer 2, Error Type 0, Error Code 6. + + Upon receiving the MPA Accept Frame from the responder, the initiator + MUST set its IRD at least to the responder ORD and its ORD at most to + the responder IRD. If the initiator does not have sufficient + resources for the required IRD, it MUST send a TERM message to the + responder indicating insufficient resources and terminate the + connection due to insufficient resources. Thus, the TERM message + MUST contain Layer 2, Error Type 0, Error Code 6. + + The initiator MUST pass the responder provided IRD and ORD to the ULP + for both MPA Accept and Reject messages. The initiator ULP can + decide its course of action. For example, the initiator ULP may + terminate the established connection and renegotiate the responder + ORD. + + An all ones value (0x3FFF) indicates that automatic negotiation of + the IRD or ORD is not desired, and that the ULP will be responsible + for it. The responder MUST respond to an initiator ORD value of + 0x3FFF by leaving its local endpoint IRD value unchanged and setting + the IRD to 0x3FFF in its reply message. The initiator MUST leave its + local endpoint ORD value unchanged upon receiving a responder IRD + value of 0x3FFF. The responder MUST respond to an initiator IRD + value of 0x3FFF by leaving its local endpoint ORD value unchanged, + and setting ORD to 0x3FFF in its reply message. The initiator MUST + leave its local endpoint IRD value unchanged upon receiving a + responder ORD value of 0x3FFF. + + + +Kanevsky, et al. Standards Track [Page 19] + +RFC 6581 Enhanced RDMA Connection Establishment April 2012 + + +9.2. Peer-to-Peer Connection Negotiation + + Control Flag A value 1 indicates that a peer-to-peer connection model + is being performed, and value 0 indicates a client-server model. + Control Flag B value 1 indicates that a zero-length FULPDU (Send) RTR + indication is requested for the initiator and supported by the + responder, respectively, 0 otherwise. Control Flag C value 1 + indicates that a zero-length RDMA Write RTR indication is requested + for the initiator and supported by the responder, respectively, 0 + otherwise. Control Flag D value 1 indicates that a zero-length RDMA + Read RTR indication is requested for the initiator and supported by + the responder, respectively, 0 otherwise. The initiator MUST set + Control Flag A to 1 for the peer-to-peer model. The initiator MUST + set each Control Flag B, C, and D to 1 for each of the options it + supports, if Control Flag A is set to 1. + + The responder MUST support at least one RTR indication option if it + supports Enhanced RDMA connection establishment. If Control Flag A + is 1 in the MPA Request message, then the responder MUST set Control + Flag A to 1 in the MPA reply message. For each initiator-supported + RTR indication option, the responder SHOULD set the corresponding + Control Flag if the responder can support that option in an MPA + reply. The responder is not required to specify all RTR indication + options it supports. The responder MUST set at least one RTR + indication option if it supports more than one initiator-specified + RTR indication option. The responder MAY include additional RTR + indication options it supports, even if not requested by any + initiator specified RTR indication options. If the responder does + not support any of the initiator-specified RTR indication options, + then the responder MUST set at least one RTR indication type option + it supports. + + Upon receiving the MPA Accept Frame with Control Flag A set to 1, the + initiator MUST generate one of the negotiated RTR indications. If + the initiator is not able to generate any of the responder-supported + RTR indications, then it MUST send a TERM message to the responder + indicating failure to negotiate a mutually compatible connection + model or RTR option, and terminate the connection. Thus, the TERM + message MUST contain Layer 2, Error Type 0, Error Code 7. The ULP + can negotiate a ULP-level RTR indication when a Provider-level RTR + indication cannot be negotiated. + + The initiator MUST set Control Flag A to 0 for the client-server + model. The responder MUST set Control Flag A to 0 if Control Flag A + is 0 in the request. If Control Flag A is set to 0, then Control + Flags B, C, and D MUST also be set to 0. On reception, if Control + Flag A is set to 0, then Control Flags B, C, and D MUST be ignored. + + + + +Kanevsky, et al. Standards Track [Page 20] + +RFC 6581 Enhanced RDMA Connection Establishment April 2012 + + +9.3. Enhanced Connection Negotiation Flow + + The RTR indication type and ORD/IRD negotiation follows the following + order: + + initiator (MPA Request) --> The initiator sets Control Flag A to 1 + to indicate the peer-to-peer connection model and sets its initial + IRD/ORD on the local endpoint of the connection. The initiator + also sets Control Flags B, C, and D to 1 for each initiator- + supported option of RTR indication. + + responder (MPA Reply) <-- The responder matches the initiator's + Control Flag A value and sets ORD/IRD to its local endpoint values + based upon the initiator's initial ORD/IRD values and the number + of simultaneous RDMA Read Requests required by the ULP. The + responder sets Control Flags B, C, and D to 1 for each responder- + supported option of RTR indication options for the peer-to-peer + connection model. The responder also sets its IRD/ORD to actual + values. + + initiator (First RDMA Message) --> After the initiator modifies its + ORD/IRD to match the responder's values as stated above, the + initiator sends the first message of the negotiated RTR indication + option. If no matching RTR indication option exists, then the + initiator sends a TERM message. + + The initiator or responder MUST generate the TERM message that + contains Layer 2, Error Type 0, Error Code 5 when it encounters + any error locally for which the special Error Code is not defined + in Section 8 before resetting the connection. + +10. Interoperability + + The initiator requests enhanced RDMA connection establishment by + sending an enhanced RDMA establishment request; an enhanced responder + is REQUIRED to respond with an enhanced RDMA connection establishment + response, whereas an unenhanced responder treats the enhanced request + as incorrectly formatted and closes the TCP connection. All + responders are REQUIRED to issue unenhanced RDMA connection + establishment responses in response to unenhanced RDMA connection + establishment requests. + + The initiator MUST NOT use the enhanced RDMA connection establishment + formats or function codes when no enhanced functionality is desired. + + The responder MUST continue to accept unenhanced connection requests. + + + + + +Kanevsky, et al. Standards Track [Page 21] + +RFC 6581 Enhanced RDMA Connection Establishment April 2012 + + + There are three initiator/responder cases that involve enhanced MPA: + both the initiator and responder, only the responder, and only the + initiator. The enhanced MPA Frame is defined by field 'S' set to 1. + + Enhanced MPA initiator and responder: If the responder receives an + enhanced MPA message, it MUST respond with an enhanced MPA + message. + + Enhanced MPA responder only: If the responder receives an unenhanced + MPA message ('S' is set to 0), it MUST respond with an unenhanced + MPA message. + + Enhanced MPA initiator only: If the responder receives an enhanced + MPA message and it does not support enhanced RDMA connection + establishment, it MUST close the TCP connection and exit MPA. + From a standard RDMA connection establishment point of view, the + enhanced MPA Frame is improperly formatted as stated in [RFC5044]. + Thus, both the initiator and responder report TCP connection + termination to an application locally. In this case, the + initiator MAY attempt to establish an RDMA connection using the + unenhanced MPA protocol as defined in [RFC5044] if this protocol + is compatible with the application, and let the ULP deal with ORD + and IRD and peer-to-peer negotiations. + + A note for potential future enhancements for connection establishment + negotiation: It is possible to further extend formatting of Private + Data of the MPA Request and Reply Frames and to use other bits from + the "Res" field to indicate additional Private Data formatting. + +11. IANA Considerations + + IANA has added the following entries to the "SCTP Function Codes for + DDP Session Control" registry created by Section 3.5 of [RFC6580]: + + 0x0005, Enhanced DDP Stream Session Initiate, [RFC6581] + + 0x0006, Enhanced DDP Stream Session Accept, [RFC6581] + + 0x0007, Enhanced DDP Stream Session Reject, [RFC6581] + + IANA has added the following entries to the "MPA Errors" registry + created by Section 3.3 of [RFC6580]: + + 0x2/0x0/0x05, - MPA Error / Local catastrophic error, [RFC6581] + + 0x2/0x0/0x06 - MPA Error / Insufficient IRD resources, [RFC6581] + + 0x2/0x0/0x07 - MPA Error / No matching RTR option, [RFC6581] + + + +Kanevsky, et al. Standards Track [Page 22] + +RFC 6581 Enhanced RDMA Connection Establishment April 2012 + + +12. Security Considerations + + The security considerations from RFC 5044 and RFC 5043 apply and the + changes in this document do not introduce new security + considerations. However, it is recommended that implementations do + sanity checking for the input parameters, including ORD, IRD, and the + control flags used for RTR indication option negotiation. + +13. Acknowledgements + + The authors wish to thank Sean Hefty, Dave Minturn, Tom Talpey, David + Black, and David Harrington for their valuable contributions and + reviews of this document. + +14. References + +14.1. Normative References + + [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate + Requirement Levels", BCP 14, RFC 2119, March 1997. + + [RFC4960] Stewart, R., "Stream Control Transmission Protocol", RFC + 4960, September 2007. + + [RFC5040] Recio, R., Metzler, B., Culley, P., Hilland, J., and D. + Garcia, "A Remote Direct Memory Access Protocol + Specification", RFC 5040, October 2007. + + [RFC5041] Shah, H., Pinkerton, J., Recio, R., and P. Culley, "Direct + Data Placement over Reliable Transports", RFC 5041, + October 2007. + + [RFC5043] Bestler, C. and R. Stewart, "Stream Control Transmission + Protocol (SCTP) Direct Data Placement (DDP) Adaptation", + RFC 5043, October 2007. + + [RFC5044] Culley, P., Elzur, U., Recio, R., Bailey, S., and J. + Carrier, "Marker PDU Aligned Framing for TCP + Specification", RFC 5044, October 2007. + + [RFC6580] Ko, M. and D. Black, "IANA Registries for the Remote + Direct Data Placement (RDDP) Protocols", RFC 6580, April + 2012. + + + + + + + + +Kanevsky, et al. Standards Track [Page 23] + +RFC 6581 Enhanced RDMA Connection Establishment April 2012 + + +14.2. Informative References + + [DAPL] "Direct Access Programming Library", + . + + [IBTA] "InfiniBand Architecture Specification Release 1.2.1", + . + + [OFA] "OFA verbs & APIs", . + + [OpenMP] McGraw-Hill, "Parallel Programming in C with MPI and + OpenMP", 2003. + + [PPMPI] Morgan Kaufmann Publishers Inc., "Parallel Programming + with MPI", 2008. + + [RDMAC] "RDMA Protocol Verbs Specification (Version 1.0)", + . + + [RDS] Open Fabrics Association, "Reliable Datagram Socket", + 2008, + . + + [UsingMPI] MIT Press, "Using MPI-2: Advanced Features of the Message + Passing Interface", 1999. + + [VIA] Cameron, Don and Greg Regnier, "Virtual Interface + Architecture", Intel, April 2002. + + + + + + + + + + + + + + + + + + + + + + +Kanevsky, et al. Standards Track [Page 24] + +RFC 6581 Enhanced RDMA Connection Establishment April 2012 + + +Authors' Addresses + + Arkady Kanevsky (editor) + Dell Inc. + One Dell Way, MS PS2-47 + Round Rock, TX 78682 + USA + + Phone: +1-512-728-0000 + EMail: arkady.kanevsky@gmail.com + + + Caitlin Bestler (editor) + Nexenta Systems + 555 E El Camino Real #104 + Sunnyvale, CA 94087 + USA + + Phone: +1-949-528-3085 + EMail: Caitlin.Bestler@nexenta.com + + + Robert Sharp + Intel + LAD High Performance Message Passing, Mailstop: AN1-WTR1 + 1501 South Mopac, Suite 400 + Austin, TX 78746 + USA + + Phone: +1-512-493-3242 + EMail: robert.o.sharp@intel.com + + + Steve Wise + Open Grid Computing + 4030 Braker Lane STE 130 + Austin, TX 78759 + USA + + Phone: +1-512-343-9196 x101 + EMail: swise@opengridcomputing.com + + + + + + + + + + +Kanevsky, et al. Standards Track [Page 25] + -- cgit v1.2.3