diff options
Diffstat (limited to 'doc/rfc/rfc5044.txt')
-rw-r--r-- | doc/rfc/rfc5044.txt | 4147 |
1 files changed, 4147 insertions, 0 deletions
diff --git a/doc/rfc/rfc5044.txt b/doc/rfc/rfc5044.txt new file mode 100644 index 0000000..075c8d5 --- /dev/null +++ b/doc/rfc/rfc5044.txt @@ -0,0 +1,4147 @@ + + + + + + +Network Working Group P. Culley +Request for Comments: 5044 Hewlett-Packard Company +Category: Standards Track U. Elzur + Broadcom Corporation + R. Recio + IBM Corporation + S. Bailey + Sandburst Corporation + J. Carrier + Cray Inc. + October 2007 + + + Marker PDU Aligned Framing for TCP Specification + +Status of This Memo + + This document specifies an Internet standards track protocol for the + Internet community, and requests discussion and suggestions for + improvements. Please refer to the current edition of the "Internet + Official Protocol Standards" (STD 1) for the standardization state + and status of this protocol. Distribution of this memo is unlimited. + +Abstract + + Marker PDU Aligned Framing (MPA) is designed to work as an + "adaptation layer" between TCP and the Direct Data Placement protocol + (DDP) as described in RFC 5041. It preserves the reliable, in-order + delivery of TCP, while adding the preservation of higher-level + protocol record boundaries that DDP requires. MPA is fully compliant + with applicable TCP RFCs and can be utilized with existing TCP + implementations. MPA also supports integrated implementations that + combine TCP, MPA and DDP to reduce buffering requirements in the + implementation and improve performance at the system level. + + + + + + + + + + + + + + + + + +Culley, et al. Standards Track [Page 1] + +RFC 5044 MPA Framing for TCP October 2007 + + +Table of Contents + + 1. Introduction ....................................................4 + 1.1. Motivation .................................................4 + 1.2. Protocol Overview ..........................................5 + 2. Glossary ........................................................8 + 3. MPA's Interactions with DDP ....................................11 + 4. MPA Full Operation Phase .......................................13 + 4.1. FPDU Format ...............................................13 + 4.2. Marker Format .............................................14 + 4.3. MPA Markers ...............................................14 + 4.4. CRC Calculation ...........................................16 + 4.5. FPDU Size Considerations ..................................21 + 5. MPA's interactions with TCP ....................................22 + 5.1. MPA transmitters with a standard layered TCP ..............22 + 5.2. MPA receivers with a standard layered TCP .................23 + 6. MPA Receiver FPDU Identification ...............................24 + 7. Connection Semantics ...........................................24 + 7.1. Connection Setup ..........................................24 + 7.1.1. MPA Request and Reply Frame Format .................26 + 7.1.2. Connection Startup Rules ...........................28 + 7.1.3. Example Delayed Startup Sequence ...................30 + 7.1.4. Use of Private Data ................................33 + 7.1.4.1. Motivation ................................33 + 7.1.4.2. Example Immediate Startup Using + Private Data ..............................35 + 7.1.5. "Dual Stack" Implementations .......................37 + 7.2. Normal Connection Teardown ................................38 + 8. Error Semantics ................................................39 + 9. Security Considerations ........................................40 + 9.1. Protocol-Specific Security Considerations .................40 + 9.1.1. Spoofing ...........................................40 + 9.1.1.1. Impersonation .............................41 + 9.1.1.2. Stream Hijacking ..........................41 + 9.1.1.3. Man-in-the-Middle Attack ..................41 + 9.1.2. Eavesdropping ......................................42 + 9.2. Introduction to Security Options ..........................42 + 9.3. Using IPsec with MPA ......................................43 + 9.4. Requirements for IPsec Encapsulation of MPA/DDP ...........43 + 10. IANA Considerations ...........................................44 + Appendix A. Optimized MPA-Aware TCP Implementations ...............45 + A.1. Optimized MPA/TCP Transmitters ............................46 + A.2. Effects of Optimized MPA/TCP Segmentation .................46 + A.3. Optimized MPA/TCP Receivers ...............................48 + A.4. Re-segmenting Middleboxes and Non-Optimized MPA/TCP + Senders ...................................................49 + A.5. Receiver Implementation ...................................50 + A.5.1. Network Layer Reassembly Buffers ...................51 + + + +Culley, et al. Standards Track [Page 2] + +RFC 5044 MPA Framing for TCP October 2007 + + + A.5.2. TCP Reassembly Buffers .............................52 + Appendix B. Analysis of MPA over TCP Operations ...................52 + B.1. Assumptions ...............................................53 + B.1.1. MPA Is Layered beneath DDP .........................53 + B.1.2. MPA Preserves DDP Message Framing ..................53 + B.1.3. The Size of the ULPDU Passed to MPA Is Less Than + EMSS Under Normal Conditions .......................53 + B.1.4. Out-of-Order Placement but NO Out-of-Order Delivery.54 + B.2. The Value of FPDU Alignment ...............................54 + B.2.1. Impact of Lack of FPDU Alignment on the Receiver + Computational Load and Complexity ..................56 + B.2.2. FPDU Alignment Effects on TCP Wire Protocol ........60 + Appendix C. IETF Implementation Interoperability with RDMA + Consortium Protocols ..................................62 + C.1. Negotiated Parameters ......................................63 + C.2. RDMAC RNIC and Non-Permissive IETF RNIC ....................64 + C.2.1. RDMAC RNIC Initiator ................................65 + C.2.2. Non-Permissive IETF RNIC Initiator ..................65 + C.2.3. RDMAC RNIC and Permissive IETF RNIC .................65 + C.2.4. RDMAC RNIC Initiator ................................66 + C.2.5. Permissive IETF RNIC Initiator ......................67 + C.3. Non-Permissive IETF RNIC and Permissive IETF RNIC ..........67 + Normative References ..............................................68 + Informative References ............................................68 + Contributors ......................................................70 + +Table of Figures + + Figure 1: ULP MPA TCP Layering .....................................5 + Figure 2: FPDU Format .............................................13 + Figure 3: Marker Format ...........................................14 + Figure 4: Example FPDU Format with Marker .........................16 + Figure 5: Annotated Hex Dump of an FPDU ...........................19 + Figure 6: Annotated Hex Dump of an FPDU with Marker ...............20 + Figure 7: Fully Layered Implementation ............................22 + Figure 8: MPA Request/Reply Frame .................................26 + Figure 9: Example Delayed Startup Negotiation .....................31 + Figure 10: Example Immediate Startup Negotiation ..................35 + Figure 11: Optimized MPA/TCP Implementation .......................45 + Figure 12: Non-Aligned FPDU Freely Placed in TCP Octet Stream .....56 + Figure 13: Aligned FPDU Placed Immediately after TCP Header .......58 + Figure 14: Connection Parameters for the RNIC Types ...............63 + Figure 15: MPA Negotiation between an RDMAC RNIC and a + Non-Permissive IETF RNIC ...............................65 + Figure 16: MPA Negotiation between an RDMAC RNIC and a Permissive + IETF RNIC ..............................................66 + Figure 17: MPA Negotiation between a Non-Permissive IETF RNIC and + a Permissive IETF RNIC .................................67 + + + +Culley, et al. Standards Track [Page 3] + +RFC 5044 MPA Framing for TCP October 2007 + + +1. Introduction + + This section discusses the reason for creating MPA on TCP and a + general overview of the protocol. + +1.1. Motivation + + The Direct Data Placement protocol [DDP], when used with TCP + [RFC793], requires a mechanism to detect record boundaries. The DDP + records are referred to as Upper Layer Protocol Data Units by this + document. The ability to locate the Upper Layer Protocol Data Unit + (ULPDU) boundary is useful to a hardware network adapter that uses + DDP to directly place the data in the application buffer based on the + control information carried in the ULPDU header. This may be done + without requiring that the packets arrive in order. Potential + benefits of this capability are the avoidance of the memory copy + overhead and a smaller memory requirement for handling out-of-order + or dropped packets. + + Many approaches have been proposed for a generalized framing + mechanism. Some are probabilistic in nature and others are + deterministic. An example probabilistic approach is characterized by + a detectable value embedded in the octet stream, with no method of + preventing that value elsewhere within user data. It is + probabilistic because under some conditions the receiver may + incorrectly interpret application data as the detectable value. + Under these conditions, the protocol may fail with unacceptable + frequency. One deterministic approach is characterized by embedded + controls at known locations in the octet stream. Because the + receiver can guarantee it will only examine the data stream at + locations that are known to contain the embedded control, the + protocol can never misinterpret application data as being embedded + control data. For unambiguous handling of an out-of-order packet, a + deterministic approach is preferred. + + The MPA protocol provides a framing mechanism for DDP running over + TCP using the deterministic approach. It allows the location of the + ULPDU to be determined in the TCP stream even if the TCP segments + arrive out of order. + + + + + + + + + + + + +Culley, et al. Standards Track [Page 4] + +RFC 5044 MPA Framing for TCP October 2007 + + +1.2. Protocol Overview + + The layering of PDUs with MPA is shown in Figure 1, below. + + +------------------+ + | ULP client | + +------------------+ <- Consumer messages + | DDP | + +------------------+ <- ULPDUs + | MPA* | + +------------------+ <- FPDUs (containing ULPDUs) + | TCP* | + +------------------+ <- TCP Segments (containing FPDUs) + | IP etc. | + +------------------+ + * These may be fully layered or optimized together. + + Figure 1: ULP MPA TCP Layering + + MPA is described as an extra layer above TCP and below DDP. The + operation sequence is: + + 1. A TCP connection is established by ULP action. This is done + using methods not described by this specification. The ULP may + exchange some amount of data in streaming mode prior to starting + MPA, but is not required to do so. + + 2. The Consumer negotiates the use of DDP and MPA at both ends of a + connection. The mechanisms to do this are not described in this + specification. The negotiation may be done in streaming mode, or + by some other mechanism (such as a pre-arranged port number). + + 3. The ULP activates MPA on each end in the Startup Phase, either as + an Initiator or a Responder, as determined by the ULP. This mode + verifies the usage of MPA, specifies the use of CRC and Markers, + and allows the ULP to communicate some additional data via a + Private Data exchange. See Section 7.1, Connection Setup, for + more details on the startup process. + + 4. At the end of the Startup Phase, the ULP puts MPA (and DDP) into + Full Operation and begins sending DDP data as further described + below. In this document, DDP data chunks are called ULPDUs. For + a description of the DDP data, see [DDP]. + + + + + + + + +Culley, et al. Standards Track [Page 5] + +RFC 5044 MPA Framing for TCP October 2007 + + + Following is a description of data transfer when MPA is in Full + Operation. + + 1. DDP determines the Maximum ULPDU (MULPDU) size by querying MPA + for this value. MPA derives this information from TCP or IP, + when it is available, or chooses a reasonable value. + + 2. DDP creates ULPDUs of MULPDU size or smaller, and hands them to + MPA at the sender. + + 3. MPA creates a Framed Protocol Data Unit (FPDU) by prepending a + header, optionally inserting Markers, and appending a CRC field + after the ULPDU and PAD (if any). MPA delivers the FPDU to TCP. + + 4. The TCP sender puts the FPDUs into the TCP stream. If the sender + is optimized MPA/TCP, it segments the TCP stream in such a way + that a TCP Segment boundary is also the boundary of an FPDU. TCP + then passes each segment to the IP layer for transmission. + + 5. The receiver may or may not be optimized. If it is optimized + MPA/TCP, it may separate passing the TCP payload to MPA from + passing the TCP payload ordering information to MPA. In either + case, RFC-compliant TCP wire behavior is observed at both the + sender and receiver. + + 6. The MPA receiver locates and assembles complete FPDUs within the + stream, verifies their integrity, and removes MPA Markers (when + present), ULPDU_Length, PAD, and the CRC field. + + 7. MPA then provides the complete ULPDUs to DDP. MPA may also + separate passing MPA payload to DDP from passing the MPA payload + ordering information. + + A fully layered MPA on TCP is implemented as a data stream ULP for + TCP and is therefore RFC compliant. + + An optimized DDP/MPA/TCP uses a TCP layer that potentially contains + some additional behaviors as suggested in this document. When + DDP/MPA/TCP are cross-layer optimized, the behavior of TCP + (especially sender segmentation) may change from that of the un- + optimized implementation, but the changes are within the bounds + permitted by the TCP RFC specifications, and will interoperate with + an un-optimized TCP. The additional behaviors are described in + Appendix A and are not normative; they are described at a TCP + interface layer as a convenience. Implementations may achieve the + described functionality using any method, including cross-layer + optimizations between TCP, MPA, and DDP. + + + + +Culley, et al. Standards Track [Page 6] + +RFC 5044 MPA Framing for TCP October 2007 + + + An optimized DDP/MPA/TCP sender is able to segment the data stream + such that TCP segments begin with FPDUs (FPDU Alignment). This has + significant advantages for receivers. When segments arrive with + aligned FPDUs, the receiver usually need not buffer any portion of + the segment, allowing DDP to place it in its destination memory + immediately, thus avoiding copies from intermediate buffers (DDP's + reason for existence). + + An optimized DDP/MPA/TCP receiver allows a DDP on MPA implementation + to locate the start of ULPDUs that may be received out of order. It + also allows the implementation to determine if the entire ULPDU has + been received. As a result, MPA can pass out-of-order ULPDUs to DDP + for immediate use. This enables a DDP on MPA implementation to save + a significant amount of intermediate storage by placing the ULPDUs in + the right locations in the application buffers when they arrive, + rather than waiting until full ordering can be restored. + + The ability of a receiver to recover out-of-order ULPDUs is optional + and declared to the transmitter during startup. When the receiver + declares that it does not support out-of-order recovery, the + transmitter does not add the control information to the data stream + needed for out-of-order recovery. + + If the receiver is fully layered, then MPA receives a strictly + ordered stream of data and does not deal with out-of-order ULPDUs. + In this case, MPA passes each ULPDU to DDP when the last bytes arrive + from TCP, along with the indication that they are in order. + + MPA implementations that support recovery of out-of-order ULPDUs MUST + support a mechanism to indicate the ordering of ULPDUs as the sender + transmitted them and indicate when missing intermediate segments + arrive. These mechanisms allow DDP to reestablish record ordering + and report Delivery of complete messages (groups of records). + + MPA also addresses enhanced data integrity. Some users of TCP have + noted that the TCP checksum is not as strong as could be desired (see + [CRCTCP]). Studies such as [CRCTCP] have shown that the TCP checksum + indicates segments in error at a much higher rate than the underlying + link characteristics would indicate. With these higher error rates, + the chance that an error will escape detection, when using only the + TCP checksum for data integrity, becomes a concern. A stronger + integrity check can reduce the chance of data errors being missed. + + MPA includes a CRC check to increase the ULPDU data integrity to the + level provided by other modern protocols, such as SCTP [RFC4960]. It + is possible to disable this CRC check; however, CRCs MUST be enabled + unless it is clear that the end-to-end connection through the network + has data integrity at least as good as an MPA with CRC enabled (for + + + +Culley, et al. Standards Track [Page 7] + +RFC 5044 MPA Framing for TCP October 2007 + + + example, when IPsec is implemented end to end). DDP's ULP expects + this level of data integrity and therefore the ULP does not have to + provide its own duplicate data integrity and error recovery for lost + data. + +2. Glossary + + The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", + "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this + document are to be interpreted as described in [RFC2119]. + + Consumer - the ULPs or applications that lie above MPA and DDP. The + Consumer is responsible for making TCP connections, starting MPA + and DDP connections, and generally controlling operations. + + CRC - Cyclic Redundancy Check. + + Delivery - (Delivered, Delivers) - For MPA, Delivery is defined as + the process of informing DDP that a particular PDU is ordered for + use. A PDU is Delivered in the exact order that it was sent by + the original sender; MPA uses TCP's byte stream ordering to + determine when Delivery is possible. This is specifically + different from "passing the PDU to DDP", which may generally + occur in any order, while the order of Delivery is strictly + defined. + + EMSS - Effective Maximum Segment Size. EMSS is the smaller of the + TCP maximum segment size (MSS) as defined in RFC 793 [RFC793], + and the current path Maximum Transmission Unit (MTU) [RFC1191]. + + FPDU - Framed Protocol Data Unit. The unit of data created by an MPA + sender. + + FPDU Alignment - The property that an FPDU is Header Aligned with the + TCP segment, and the TCP segment includes an integer number of + FPDUs. A TCP segment with an FPDU Alignment allows immediate + processing of the contained FPDUs without waiting on other TCP + segments to arrive or combining with prior segments. + + FPDU Pointer (FPDUPTR) - This field of the Marker is used to indicate + the beginning of an FPDU. + + Full Operation (Full Operation Phase) - After the completion of the + Startup Phase, MPA begins exchanging FPDUs. + + + + + + + +Culley, et al. Standards Track [Page 8] + +RFC 5044 MPA Framing for TCP October 2007 + + + Header Alignment - The property that a TCP segment begins with an + FPDU. The FPDU is Header Aligned when the FPDU header is exactly + at the start of the TCP segment (right behind the TCP headers on + the wire). + + Initiator - The endpoint of a connection that sends the MPA Request + Frame, i.e., the first to actually send data (which may not be + the one that sends the TCP SYN). + + Marker - A four-octet field that is placed in the MPA data stream at + fixed octet intervals (every 512 octets). + + MPA-aware TCP - A TCP implementation that is aware of the receiver + efficiencies of MPA FPDU Alignment and is capable of sending TCP + segments that begin with an FPDU. + + MPA-enabled - MPA is enabled if the MPA protocol is visible on the + wire. When the sender is MPA-enabled, it is inserting framing + and Markers. When the receiver is MPA-enabled, it is + interpreting framing and Markers. + + MPA Request Frame - Data sent from the MPA Initiator to the MPA + Responder during the Startup Phase. + + MPA Reply Frame - Data sent from the MPA Responder to the MPA + Initiator during the Startup Phase. + + MPA - Marker-based ULP PDU Aligned Framing for TCP protocol. This + document defines the MPA protocol. + + MULPDU - Maximum ULPDU. The current maximum size of the record that + is acceptable for DDP to pass to MPA for transmission. + + Node - A computing device attached to one or more links of a network. + A Node in this context does not refer to a specific application + or protocol instantiation running on the computer. A Node may + consist of one or more MPA on TCP devices installed in a host + computer. + + PAD - A 1-3 octet group of zeros used to fill an FPDU to an exact + modulo 4 size. + + PDU - Protocol data unit + + Private Data - A block of data exchanged between MPA endpoints during + initial connection setup. + + + + + +Culley, et al. Standards Track [Page 9] + +RFC 5044 MPA Framing for TCP October 2007 + + + Protection Domain - An RDMA concept (see [VERBS-RDMA] and [RDMASEC]) + that ties use of various endpoint resources (memory access, etc.) + to the specific RDMA/DDP/MPA connection. + + RDDP - A suite of protocols including MPA, [DDP], [RDMAP], an overall + security document [RDMASEC], a problem statement [RFC4297], an + architecture document [RFC4296], and an applicability document + [APPL]. + + RDMA - Remote Direct Memory Access; a protocol that uses DDP and MPA + to enable applications to transfer data directly from memory + buffers. See [RDMAP]. + + Remote Peer - The MPA protocol implementation on the opposite end of + the connection. Used to refer to the remote entity when + describing protocol exchanges or other interactions between two + Nodes. + + Responder - The connection endpoint that responds to an incoming MPA + connection request (the MAP Request Frame). This may not be the + endpoint that awaited the TCP SYN. + + Startup Phase - The initial exchanges of an MPA connection that + serves to more fully identify MPA endpoints to each other and + pass connection specific setup information to each other. + + ULP - Upper Layer Protocol. The protocol layer above the protocol + layer currently being referenced. The ULP for MPA is DDP [DDP]. + + ULPDU - Upper Layer Protocol Data Unit. The data record defined by + the layer above MPA (DDP). ULPDU corresponds to DDP's DDP + segment. + + ULPDU_Length - A field in the FPDU describing the length of the + included ULPDU. + + + + + + + + + + + + + + + + +Culley, et al. Standards Track [Page 10] + +RFC 5044 MPA Framing for TCP October 2007 + + +3. MPA's Interactions with DDP + + DDP requires MPA to maintain DDP record boundaries from the sender to + the receiver. When using MPA on TCP to send data, DDP provides + records (ULPDUs) to MPA. MPA will use the reliable transmission + abilities of TCP to transmit the data, and will insert appropriate + additional information into the TCP stream to allow the MPA receiver + to locate the record boundary information. + + As such, MPA accepts complete records (ULPDUs) from DDP at the sender + and returns them to DDP at the receiver. + + MPA MUST encapsulate the ULPDU such that there is exactly one ULPDU + contained in one FPDU. + + MPA over a standard TCP stack can usually provide FPDU Alignment with + the TCP Header if the FPDU is equal to TCP's EMSS. An optimized + MPA/TCP stack can also maintain alignment as long as the FPDU is less + than or equal to TCP's EMSS. Since FPDU Alignment is generally + desired by the receiver, DDP cooperates with MPA to ensure FPDUs' + lengths do not exceed the EMSS under normal conditions. This is done + with the MULPDU mechanism. + + MPA MUST provide information to DDP on the current maximum size of + the record that is acceptable to send (MULPDU). DDP SHOULD limit + each record size to MULPDU. The range of MULPDU values MUST be + between 128 octets and 64768 octets, inclusive. + + The sending DDP MUST NOT post a ULPDU larger than 64768 octets to + MPA. DDP MAY post a ULPDU of any size between one and 64768 octets; + however, MPA is not REQUIRED to support a ULPDU Length that is + greater than the current MULPDU. + + While the maximum theoretical length supported by the MPA header + ULPDU_Length field is 65535, TCP over IP requires the IP datagram + maximum length to be 65535 octets. To enable MPA to support FPDU + Alignment, the maximum size of the FPDU must fit within an IP + datagram. Thus, the ULPDU limit of 64768 octets was derived by + taking the maximum IP datagram length, subtracting from it the + maximum total length of the sum of the IPv4 header, TCP header, IPv4 + options, TCP options, and the worst-case MPA overhead, and then + rounding the result down to a 128-octet boundary. + + Note that MULPDU will be significantly smaller than the theoretical + maximum in most implementations for most circumstances, due to link + MTUs, use of extra headers such as required for IPsec, etc. + + + + + +Culley, et al. Standards Track [Page 11] + +RFC 5044 MPA Framing for TCP October 2007 + + + On receive, MPA MUST pass each ULPDU with its length to DDP when it + has been validated. + + If an MPA implementation supports passing out-of-order ULPDUs to DDP, + the MPA implementation SHOULD: + + * Pass each ULPDU with its length to DDP as soon as it has been + fully received and validated. + + * Provide a mechanism to indicate the ordering of ULPDUs as the + sender transmitted them. One possible mechanism might be + providing the TCP sequence number for each ULPDU. + + * Provide a mechanism to indicate when a given ULPDU (and prior + ULPDUs) are complete (Delivered to DDP). One possible mechanism + might be to allow DDP to see the current outgoing TCP ACK + sequence number. + + * Provide an indication to DDP that the TCP has closed or has begun + to close the connection (e.g., received a FIN). + + MPA MUST provide the protocol version negotiated with its peer to + DDP. DDP will use this version to set the version in its header and + to report the version to [RDMAP]. + + + + + + + + + + + + + + + + + + + + + + + + + + + +Culley, et al. Standards Track [Page 12] + +RFC 5044 MPA Framing for TCP October 2007 + + +4. MPA Full Operation Phase + + The following sections describe the main semantics of the Full + Operation Phase of MPA. + +4.1. FPDU Format + + MPA senders create FPDUs out of ULPDUs. The format of an FPDU shown + below MUST be used for all MPA FPDUs. For purposes of clarity, + Markers are not shown in Figure 2. + + 0 1 2 3 + 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | ULPDU_Length | | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + + | | + ~ ~ + ~ ULPDU ~ + | | + | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | | PAD (0-3 octets) | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | CRC | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + + Figure 2: FPDU Format + + ULPDU_Length: 16 bits (unsigned integer). This is the number of + octets of the contained ULPDU. It does not include the length of the + FPDU header itself, the pad, the CRC, or of any Markers that fall + within the ULPDU. The 16-bit ULPDU Length field is large enough to + support the largest IP datagrams for IPv4 or IPv6. + + PAD: The PAD field trails the ULPDU and contains between 0 and 3 + octets of data. The pad data MUST be set to zero by the sender and + ignored by the receiver (except for CRC checking). The length of the + pad is set so as to make the size of the FPDU an integral multiple of + four. + + CRC: 32 bits. When CRCs are enabled, this field contains a CRC32c + check value, which is used to verify the entire contents of the FPDU, + using CRC32c. See Section 4.4, CRC Calculation. When CRCs are not + enabled, this field is still present, may contain any value, and MUST + NOT be checked. + + + + + + +Culley, et al. Standards Track [Page 13] + +RFC 5044 MPA Framing for TCP October 2007 + + + The FPDU adds a minimum of 6 octets to the length of the ULPDU. In + addition, the total length of the FPDU will include the length of any + Markers and from 0 to 3 pad octets added to round-up the ULPDU size. + +4.2. Marker Format + + The format of a Marker MUST be as specified in Figure 3: + + 0 1 2 3 + 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | RESERVED | FPDUPTR | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + + Figure 3: Marker Format + + RESERVED: The Reserved field MUST be set to zero on transmit and + ignored on receive (except for CRC calculation). + + FPDUPTR: The FPDU Pointer is a relative pointer, 16 bits long, + interpreted as an unsigned integer that indicates the number of + octets in the TCP stream from the beginning of the ULPDU Length field + to the first octet of the entire Marker. The least significant two + bits MUST always be set to zero at the transmitter, and the receivers + MUST always treat these as zero for calculations. + +4.3. MPA Markers + + MPA Markers are used to identify the start of FPDUs when packets are + received out of order. This is done by locating the Markers at fixed + intervals in the data stream (which is correlated to the TCP sequence + number) and using the Marker value to locate the preceding FPDU + start. + + All MPA Markers are included in the containing FPDU CRC calculation + (when both CRCs and Markers are in use). + + The MPA receiver's ability to locate out-of-order FPDUs and pass the + ULPDUs to DDP is implementation dependent. MPA/DDP allows those + receivers that are able to deal with out-of-order FPDUs in this way + to require the insertion of Markers in the data stream. When the + receiver cannot deal with out-of-order FPDUs in this way, it may + disable the insertion of Markers at the sender. All MPA senders MUST + be able to generate Markers when their use is declared by the + opposing receiver (see Section 7.1, Connection Setup). + + + + + + +Culley, et al. Standards Track [Page 14] + +RFC 5044 MPA Framing for TCP October 2007 + + + When Markers are enabled, MPA senders MUST insert a Marker into the + data stream at a 512-octet periodic interval in the TCP Sequence + Number Space. The Marker contains a 16-bit unsigned integer referred + to as the FPDUPTR (FPDU Pointer). + + If the FPDUPTR's value is non-zero, the FPDU Pointer is a 16-bit + relative back-pointer. FPDUPTR MUST contain the number of octets in + the TCP stream from the beginning of the ULPDU Length field to the + first octet of the Marker, unless the Marker falls between FPDUs. + Thus, the location of the first octet of the previous FPDU header can + be determined by subtracting the value of the given Marker from the + current octet-stream sequence number (i.e., TCP sequence number) of + the first octet of the Marker. Note that this computation MUST take + into account that the TCP sequence number could have wrapped between + the Marker and the header. + + An FPDUPTR value of 0x0000 is a special case -- it is used when the + Marker falls exactly between FPDUs (between the preceding FPDU CRC + field and the next FPDU's ULPDU Length field). In this case, the + Marker is considered to be contained in the following FPDU; the + Marker MUST be included in the CRC calculation of the FPDU following + the Marker (if CRCs are being generated or checked). Thus, an + FPDUPTR value of 0x0000 means that immediately following the Marker + is an FPDU header (the ULPDU Length field). + + Since all FPDUs are integral multiples of 4 octets, the bottom two + bits of the FPDUPTR as calculated by the sender are zero. MPA + reserves these bits so they MUST be treated as zero for computation + at the receiver. + + When Markers are enabled (see Section 7.1, Connection Setup), the MPA + Markers MUST be inserted immediately preceding the first FPDU of Full + Operation Phase, and at every 512th octet of the TCP octet stream + thereafter. As a result, the first Marker has an FPDUPTR value of + 0x0000. If the first Marker begins at octet sequence number + SeqStart, then Markers are inserted such that the first octet of the + Marker is at octet sequence number SeqNum if the remainder of (SeqNum + - SeqStart) mod 512 is zero. Note that SeqNum can wrap. + + For example, if the TCP sequence number were used to calculate the + insertion point of the Marker, the starting TCP sequence number is + unlikely to be zero, and 512-octet multiples are unlikely to fall on + a modulo 512 of zero. If the MPA connection is started at TCP + sequence number 11, then the 1st Marker will begin at 11, and + subsequent Markers will begin at 523, 1035, etc. + + + + + + +Culley, et al. Standards Track [Page 15] + +RFC 5044 MPA Framing for TCP October 2007 + + + If an FPDU is large enough to contain multiple Markers, they MUST all + point to the same point in the TCP stream: the first octet of the + ULPDU Length field for the FPDU. + + If a Marker interval contains multiple FPDUs (the FPDUs are small), + the Marker MUST point to the start of the ULPDU Length field for the + FPDU containing the Marker unless the Marker falls between FPDUs, in + which case the Marker MUST be zero. + + The following example shows an FPDU containing a Marker. + + 0 1 2 3 + 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | ULPDU Length (0x0010) | | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + + | | + + + + | ULPDU (octets 0-9) | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | (0x0000) | FPDU ptr (0x000C) | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | ULPDU (octets 10-15) | + | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | | PAD (2 octets:0,0) | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | CRC | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + + Figure 4: Example FPDU Format with Marker + + MPA Receivers MUST preserve ULPDU boundaries when passing data to + DDP. MPA Receivers MUST pass the ULPDU data and the ULPDU Length to + DDP and not the Markers, headers, and CRC. + +4.4. CRC Calculation + + An MPA implementation MUST implement CRC support and MUST either: + + (1) always use CRCs; the MPA provider is not REQUIRED to support an + administrator's request that CRCs not be used. + + or + + (2a) only indicate a preference not to use CRCs on the explicit + request of the system administrator, via an interface not + defined in this spec. The default configuration for a + connection MUST be to use CRCs. + + + +Culley, et al. Standards Track [Page 16] + +RFC 5044 MPA Framing for TCP October 2007 + + + (2b) disable CRC checking (and possibly generation) if both the local + and remote endpoints indicate preference not to use CRCs. + + An administrative decision to have a host request CRC suppression + SHOULD NOT be made unless there is assurance that the TCP connection + involved provides protection from undetected errors that is at least + as strong as an end-to-end CRC32c. End-to-end usage of an IPsec + cryptographic integrity check is among the ways to provide such + protection, and the use of channel bindings [NFSv4CHANNEL] by the ULP + can provide a high level of assurance that the IPsec protection scope + is end-to-end with respect to the ULP. + + The process MUST be invisible to the ULP. + + After receipt of an MPA startup declaration indicating that its peer + requires CRCs, an MPA instance MUST continue generating and checking + CRCs until the connection terminates. If an MPA instance has + declared that it does not require CRCs, it MUST turn off CRC checking + immediately after receipt of an MPA mode declaration indicating that + its peer also does not require CRCs. It MAY continue generating + CRCs. See Section 7.1, Connection Setup, for details on the MPA + startup. + + When sending an FPDU, the sender MUST include a CRC field. When CRCs + are enabled, the CRC field in the MPA FPDU MUST be computed using the + CRC32c polynomial in the manner described in the iSCSI Protocol + [iSCSI] document for Header and Data Digests. + + The fields which MUST be included in the CRC calculation when sending + an FPDU are as follows: + + 1) If a Marker does not immediately precede the ULPDU Length field, + the CRC-32c is calculated from the first octet of the ULPDU + Length field, through all the ULPDU and Markers (if present), to + the last octet of the PAD (if present), inclusive. If there is a + Marker immediately following the PAD, the Marker is included in + the CRC calculation for this FPDU. + + 2) If a Marker immediately precedes the first octet of the ULPDU + Length field of the FPDU, (i.e., the Marker fell between FPDUs, + and thus is required to be included in the second FPDU), the + CRC-32c is calculated from the first octet of the Marker, through + the ULPDU Length header, through all the ULPDU and Markers (if + present), to the last octet of the PAD (if present), inclusive. + + 3) After calculating the CRC-32c, the resultant value is placed into + the CRC field at the end of the FPDU. + + + + +Culley, et al. Standards Track [Page 17] + +RFC 5044 MPA Framing for TCP October 2007 + + + When an FPDU is received, and CRC checking is enabled, the receiver + MUST first perform the following: + + 1) Calculate the CRC of the incoming FPDU in the same fashion as + defined above. + + 2) Verify that the calculated CRC-32c value is the same as the + received CRC-32c value found in the FPDU CRC field. If not, the + receiver MUST treat the FPDU as an invalid FPDU. + + The procedure for handling invalid FPDUs is covered in Section 8, + Error Semantics. + + The following is an annotated hex dump of an example FPDU sent as the + first FPDU on the stream. As such, it starts with a Marker. The + FPDU contains a 42 octet ULPDU (an example DDP segment) which in turn + contains 24 octets of the contained ULPDU, which is a data load that + is all zeros. The CRC32c has been correctly calculated and can be + used as a reference. See the [DDP] and [RDMAP] specification for + definitions of the DDP Control field, Queue, MSN, MO, and Send Data. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +Culley, et al. Standards Track [Page 18] + +RFC 5044 MPA Framing for TCP October 2007 + + + Octet Contents Annotation + Count + + 0000 00 Marker: Reserved + 0001 00 + 0002 00 Marker: FPDUPTR + 0003 00 + 0004 00 ULPDU Length + 0005 2a + 0006 41 DDP Control Field, Send with Last flag set + 0007 43 + 0008 00 Reserved (DDP STag position with no STag) + 0009 00 + 000a 00 + 000b 00 + 000c 00 DDP Queue = 0 + 000d 00 + 000e 00 + 000f 00 + 0010 00 DDP MSN = 1 + 0011 00 + 0012 00 + 0013 01 + 0014 00 DDP MO = 0 + 0015 00 + 0016 00 + 0017 00 + 0018 00 DDP Send Data (24 octets of zeros) + ... + 002f 00 + 0030 52 CRC32c + 0031 23 + 0032 99 + 0033 83 + + Figure 5: Annotated Hex Dump of an FPDU + + + + + + + + + + + + + + + +Culley, et al. Standards Track [Page 19] + +RFC 5044 MPA Framing for TCP October 2007 + + + The following is an example sent as the second FPDU of the stream + where the first FPDU (which is not shown here) had a length of 492 + octets and was also a Send to Queue 0 with Last Flag set. This + example contains a Marker. + + Octet Contents Annotation + Count + + 01ec 00 Length + 01ed 2a + 01ee 41 DDP Control Field: Send with Last Flag set + 01ef 43 + 01f0 00 Reserved (DDP STag position with no STag) + 01f1 00 + 01f2 00 + 01f3 00 + 01f4 00 DDP Queue = 0 + 01f5 00 + 01f6 00 + 01f7 00 + 01f8 00 DDP MSN = 2 + 01f9 00 + 01fa 00 + 01fb 02 + 01fc 00 DDP MO = 0 + 01fd 00 + 01fe 00 + 01ff 00 + 0200 00 Marker: Reserved + 0201 00 + 0202 00 Marker: FPDUPTR + 0203 14 + 0204 00 DDP Send Data (24 octets of zeros) + ... + 021b 00 + 021c 84 CRC32c + 021d 92 + 021e 58 + 021f 98 + + Figure 6: Annotated Hex Dump of an FPDU with Marker + + + + + + + + + + +Culley, et al. Standards Track [Page 20] + +RFC 5044 MPA Framing for TCP October 2007 + + +4.5. FPDU Size Considerations + + MPA defines the Maximum Upper Layer Protocol Data Unit (MULPDU) as + the size of the largest ULPDU fitting in an FPDU. For an empty TCP + Segment, MULPDU is EMSS minus the FPDU overhead (6 octets) minus + space for Markers and pad octets. + + The maximum ULPDU Length for a single ULPDU when Markers are + present MUST be computed as: + + MULPDU = EMSS - (6 + 4 * Ceiling(EMSS / 512) + EMSS mod 4) + + The formula above accounts for the worst-case number of Markers. + + The maximum ULPDU Length for a single ULPDU when Markers are NOT + present MUST be computed as: + + MULPDU = EMSS - (6 + EMSS mod 4) + + As a further optimization of the wire efficiency an MPA + implementation MAY dynamically adjust the MULPDU (see Section 5 for + latency and wire efficiency trade-offs). When one or more FPDUs are + already packed into a TCP Segment, MULPDU MAY be reduced accordingly. + + DDP SHOULD provide ULPDUs that are as large as possible, but less + than or equal to MULPDU. + + If the TCP implementation needs to adjust EMSS to support MTU changes + or changing TCP options, the MULPDU value is changed accordingly. + + In certain rare situations, the EMSS may shrink below 128 octets in + size. If this occurs, the MPA on TCP sender MUST NOT shrink the + MULPDU below 128 octets and is not required to follow the + segmentation rules in Section 5.1 and Appendix A. + + If one or more FPDUs are already packed into a TCP segment, such that + the remaining room is less than 128 octets, MPA MUST NOT provide a + MULPDU smaller than 128. In this case, MPA would typically provide a + MULPDU for the next full sized segment, but may still pack the next + FPDU into the small remaining room, provide that the next FPDU is + small enough to fit. + + The value 128 is chosen as to allow DDP designers room for the DDP + Header and some user data. + + + + + + + +Culley, et al. Standards Track [Page 21] + +RFC 5044 MPA Framing for TCP October 2007 + + +5. MPA's interactions with TCP + + The following sections describe MPA's interactions with TCP. This + section discusses using a standard layered TCP stack with MPA + attached above a TCP socket. Discussion of using an optimized MPA- + aware TCP with an MPA implementation that takes advantage of the + extra optimizations is done in Appendix A. + + +-----------------------------------+ + | +-----+ +-----------------+ | + | | MPA | | Other Protocols | | + | +-----+ +-----------------+ | + | || || | + | ----- socket API -------------- | + | || | + | +-----+ | + | | TCP | | + | +-----+ | + | || | + | +-----+ | + | | IP | | + | +-----+ | + +-----------------------------------+ + + Figure 7: Fully Layered Implementation + + The Fully layered implementation is described for completeness; + however, the user is cautioned that the reduced probability of FPDU + alignment when transmitting with this implementation will tend to + introduce a higher overhead at optimized receivers. In addition, the + lack of out-of-order receive processing will significantly reduce the + value of DDP/MPA by imposing higher buffering and copying overhead in + the local receiver. + +5.1. MPA transmitters with a standard layered TCP + + MPA transmitters SHOULD calculate a MULPDU as described in Section + 4.5. If the TCP implementation allows EMSS to be determined by MPA, + that value should be used. If the transmit side TCP implementation + is not able to report the EMSS, MPA SHOULD use the current MTU value + to establish a likely FPDU size, taking into account the various + expected header sizes. + + MPA transmitters SHOULD also use whatever facilities the TCP stack + presents to cause the TCP transmitter to start TCP segments at FPDU + boundaries. Multiple FPDUs MAY be packed into a single TCP segment + as determined by the EMSS calculation as long as they are entirely + contained in the TCP segment. + + + +Culley, et al. Standards Track [Page 22] + +RFC 5044 MPA Framing for TCP October 2007 + + + For example, passing FPDU buffers sized to the current EMSS to the + TCP socket and using the TCP_NODELAY socket option to disable the + Nagle [RFC896] algorithm will usually result in many of the segments + starting with an FPDU. + + It is recognized that various effects can cause an FPDU Alignment to + be lost. Following are a few of the effects: + + * ULPDUs that are smaller than the MULPDU. If these are sent in a + continuous stream, FPDU Alignment will be lost. Note that + careful use of a dynamic MULPDU can help in this case; the MULPDU + for future FPDUs can be adjusted to re-establish alignment with + the segments based on the current EMSS. + + * Sending enough data that the TCP receive window limit is reached. + TCP may send a smaller segment to exactly fill the receive + window. + + * Sending data when TCP is operating up against the congestion + window. If TCP is not tracking the congestion window in + segments, it may transmit a smaller segment to exactly fill the + receive window. + + * Changes in EMSS due to varying TCP options, or changes in MTU. + + If FPDU Alignment with TCP segments is lost for any reason, the + alignment is regained after a break in transmission where the TCP + send buffers are emptied. Many usage models for DDP/MPA will include + such breaks. + + MPA receivers are REQUIRED to be able to operate correctly even if + alignment is lost (see Section 6). + +5.2. MPA receivers with a standard layered TCP + + MPA receivers will get TCP data in the usual ordered stream. The + receivers MUST identify FPDU boundaries by using the ULPDU_LENGTH + field, as described in Section 6. Receivers MAY utilize markers to + check for FPDU boundary consistency, but they are NOT required to + examine the markers to determine the FPDU boundaries. + + + + + + + + + + + +Culley, et al. Standards Track [Page 23] + +RFC 5044 MPA Framing for TCP October 2007 + + +6. MPA Receiver FPDU Identification + + An MPA receiver MUST first verify the FPDU before passing the ULPDU + to DDP. To do this, the receiver MUST: + + * locate the start of the FPDU unambiguously, + + * verify its CRC (if CRC checking is enabled). + + If the above conditions are true, the MPA receiver passes the ULPDU + to DDP. + + To detect the start of the FPDU unambiguously one of the following + MUST be used: + + 1: In an ordered TCP stream, the ULPDU Length field in the current + FPDU when FPDU has a valid CRC, can be used to identify the + beginning of the next FPDU. + + 2: For optimized MPA/TCP receivers that support out-of-order + reception of FPDUs (see Section 4.3, MPA Markers) a Marker can + always be used to locate the beginning of an FPDU (in FPDUs with + valid CRCs). Since the location of the Marker is known in the + octet stream (sequence number space), the Marker can always be + found. + + 3: Having found an FPDU by means of a Marker, an optimized MPA/TCP + receiver can find following contiguous FPDUs by using the ULPDU + Length fields (from FPDUs with valid CRCs) to establish the next + FPDU boundary. + + The ULPDU Length field (see Section 4) MUST be used to determine if + the entire FPDU is present before forwarding the ULPDU to DDP. + + CRC calculation is discussed in Section 4.4 above. + +7. Connection Semantics + +7.1. Connection Setup + + MPA requires that the Consumer MUST activate MPA, and any TCP + enhancements for MPA, on a TCP half connection at the same location + in the octet stream at both the sender and the receiver. This is + required in order for the Marker scheme to correctly locate the + Markers (if enabled) and to correctly locate the first FPDU. + + MPA, and any TCP enhancements for MPA are enabled by the ULP in both + directions at once at an endpoint. + + + +Culley, et al. Standards Track [Page 24] + +RFC 5044 MPA Framing for TCP October 2007 + + + This can be accomplished several ways, and is left up to DDP's ULP: + + * DDP's ULP MAY require DDP on MPA startup immediately after TCP + connection setup. This has the advantage that no streaming mode + negotiation is needed. An example of such a protocol is shown in + Figure 10: Example Immediate Startup negotiation. + + This may be accomplished by using a well-known port, or a service + locator protocol to locate an appropriate port on which DDP on + MPA is expected to operate. + + * DDP's ULP MAY negotiate the start of DDP on MPA sometime after a + normal TCP startup, using TCP streaming data exchanges on the + same connection. The exchange establishes that DDP on MPA (as + well as other ULPs) will be used, and exactly locates the point + in the octet stream where MPA is to begin operation. Note that + such a negotiation protocol is outside the scope of this + specification. A simplified example of such a protocol is shown + in Figure 9: Example Delayed Startup negotiation on page 33. + + An MPA endpoint operates in two distinct phases. + + The Startup Phase is used to verify correct MPA setup, exchange CRC + and Marker configuration, and optionally pass Private Data between + endpoints prior to completing a DDP connection. During this phase, + specifically formatted frames are exchanged as TCP byte streams + without using CRCs or Markers. During this phase a DDP endpoint need + not be "bound" to the MPA connection. In fact, the choice of DDP + endpoint and its operating parameters may not be known until the + Consumer supplied Private Data (if any) has been examined by the + Consumer. + + The second distinct phase is Full Operation during which FPDUs are + sent using all the rules that pertain (CRCs, Markers, MULPDU + restrictions, etc.). A DDP endpoint MUST be "bound" to the MPA + connection at entry to this phase. + + When Private Data is passed between ULPs in the Startup Phase, the + ULP is responsible for interpreting that data, and then placing MPA + into Full Operation. + + Note: The following text differentiates the two endpoints by calling + them Initiator and Responder. This is quite arbitrary and is NOT + related to the TCP startup (SYN, SYN/ACK sequence). The + Initiator is the side that sends first in the MPA startup + sequence (the MPA Request Frame). + + + + + +Culley, et al. Standards Track [Page 25] + +RFC 5044 MPA Framing for TCP October 2007 + + + Note: The possibility that both endpoints would be allowed to make a + connection at the same time, sometimes called an active/active + connection, was considered by the work group and rejected. There + were several motivations for this decision. One was that + applications needing this facility were few (none other than + theoretical at the time of this document). Another was that the + facility created some implementation difficulties, particularly + with the "dual stack" designs described later on. A last issue + was that dealing with rejected connections at startup would have + required at least an additional frame type, and more recovery + actions, complicating the protocol. While none of these issues + was overwhelming, the group and implementers were not motivated + to do the work to resolve these issues. The protocol includes a + method of detecting these active/active startup attempts so that + they can be rejected and an error reported. + + The ULP is responsible for determining which side is Initiator or + Responder. For client/server type ULPs, this is easy. For peer-peer + ULPs (which might utilize a TCP style active/active startup), some + mechanism (not defined by this specification) must be established, or + some streaming mode data exchanged prior to MPA startup to determine + which side starts in Initiator and which starts in Responder MPA + mode. + +7.1.1 MPA Request and Reply Frame Format + + 0 1 2 3 + 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + 0 | | + + Key (16 bytes containing "MPA ID Req Frame") + + 4 | (4D 50 41 20 49 44 20 52 65 71 20 46 72 61 6D 65) | + + Or (16 bytes containing "MPA ID Rep Frame") + + 8 | (4D 50 41 20 49 44 20 52 65 70 20 46 72 61 6D 65) | + + + + 12 | | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + 16 |M|C|R| Res | Rev | PD_Length | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | | + ~ ~ + ~ Private Data ~ + | | + | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + + Figure 8: MPA Request/Reply Frame + + + +Culley, et al. Standards Track [Page 26] + +RFC 5044 MPA Framing for TCP October 2007 + + + Key: This field contains the "key" used to validate that the sender + is an MPA sender. Initiator mode senders MUST set this field to + the fixed value "MPA ID Req Frame" or (in byte order) 4D 50 41 20 + 49 44 20 52 65 71 20 46 72 61 6D 65 (in hexadecimal). Responder + mode receivers MUST check this field for the same value, and + close the connection and report an error locally if any other + value is detected. Responder mode senders MUST set this field to + the fixed value "MPA ID Rep Frame" or (in byte order) 4D 50 41 20 + 49 44 20 52 65 70 20 46 72 61 6D 65 (in hexadecimal). Initiator + mode receivers MUST check this field for the same value, and + close the connection and report an error locally if any other + value is detected. + + M: This bit declares an endpoint's REQUIRED Marker usage. When this + bit is '1' in an MPA Request Frame, the Initiator declares that + Markers are REQUIRED in FPDUs sent from the Responder. When set + to '1' in an MPA Reply Frame, this bit declares that Markers are + REQUIRED in FPDUs sent from the Initiator. When in a received + MPA Request Frame or MPA Reply Frame and the value is '0', + Markers MUST NOT be added to the data stream by that endpoint. + When '1' Markers MUST be added as described in Section 4.3, MPA + Markers. + + C: This bit declares an endpoint's preferred CRC usage. When this + field is '0' in the MPA Request Frame and the MPA Reply Frame, + CRCs MUST not be checked and need not be generated by either + endpoint. When this bit is '1' in either the MPA Request Frame + or MPA Reply Frame, CRCs MUST be generated and checked by both + endpoints. Note that even when not in use, the CRC field remains + present in the FPDU. When CRCs are not in use, the CRC field + MUST be considered valid for FPDU checking regardless of its + contents. + + R: This bit is set to zero, and not checked on reception in the MPA + Request Frame. In the MPA Reply Frame, this bit is the Rejected + Connection bit, set by the Responders ULP to indicate acceptance + '0', or rejection '1', of the connection parameters provided in + the Private Data. + + Res: This field is reserved for future use. It MUST be set to zero + when sending, and not checked on reception. + + + + + + + + + + +Culley, et al. Standards Track [Page 27] + +RFC 5044 MPA Framing for TCP October 2007 + + + Rev: This field contains the revision of MPA. For this version of + the specification, senders MUST set this field to one. MPA + receivers compliant with this version of the specification MUST + check this field. If the MPA receiver cannot interoperate with + the received version, then it MUST close the connection and + report an error locally. Otherwise, the MPA receiver should + report the received version to the ULP. + + PD_Length: This field MUST contain the length in octets of the + Private Data field. A value of zero indicates that there is no + Private Data field present at all. If the receiver detects that + the PD_Length field does not match the length of the Private Data + field, or if the length of the Private Data field exceeds 512 + octets, the receiver MUST close the connection and report an + error locally. Otherwise, the MPA receiver should pass the + PD_Length value and Private Data to the ULP. + + Private Data: This field may contain any value defined by ULPs or may + not be present. The Private Data field MUST be between 0 and 512 + octets in length. ULPs define how to size, set, and validate + this field within these limits. Private Data usage is further + discussed in Section 7.1.4. + +7.1.2. Connection Startup Rules + + The following rules apply to MPA connection Startup Phase: + + 1. When MPA is started in the Initiator mode, the MPA implementation + MUST send a valid MPA Request Frame. The MPA Request Frame MAY + include ULP-supplied Private Data. + + 2. When MPA is started in the Responder mode, the MPA implementation + MUST wait until an MPA Request Frame is received and validated + before entering Full MPA/DDP Operation. + + If the MPA Request Frame is improperly formatted, the + implementation MUST close the TCP connection and exit MPA. + + If the MPA Request Frame is properly formatted but the Private + Data is not acceptable, the implementation SHOULD return an MPA + Reply Frame with the Rejected Connection bit set to '1'; the MPA + Reply Frame MAY include ULP-supplied Private Data; the + implementation MUST exit MPA, leaving the TCP connection open. + The ULP may close TCP or use the connection for other purposes. + + If the MPA Request Frame is properly formatted and the Private + Data is acceptable, the implementation SHOULD return an MPA Reply + Frame with the Rejected Connection bit set to '0'; the MPA Reply + + + +Culley, et al. Standards Track [Page 28] + +RFC 5044 MPA Framing for TCP October 2007 + + + Frame MAY include ULP-supplied Private Data; and the Responder + SHOULD prepare to interpret any data received as FPDUs and pass + any received ULPDUs to DDP. + + Note: Since the receiver's ability to deal with Markers is + unknown until the Request and Reply Frames have been + received, sending FPDUs before this occurs is not possible. + + + Note: The requirement to wait on a Request Frame before sending a + Reply Frame is a design choice. It makes for a well-ordered + sequence of events at each end, and avoids having to specify + how to deal with situations where both ends start at the same + time. + + 3. MPA Initiator mode implementations MUST receive and validate an + MPA Reply Frame. + + If the MPA Reply Frame is improperly formatted, the + implementation MUST close the TCP connection and exit MPA. + + If the MPA Reply Frame is properly formatted but is the Private + Data is not acceptable, or if the Rejected Connection bit is set + to '1', the implementation MUST exit MPA, leaving the TCP + connection open. The ULP may close TCP or use the connection for + other purposes. + + If the MPA Reply Frame is properly formatted and the Private Data + is acceptable, and the Reject Connection bit is set to '0', the + implementation SHOULD enter Full MPA/DDP Operation Phase; + interpreting any received data as FPDUs and sending DDP ULPDUs as + FPDUs. + + 4. MPA Responder mode implementations MUST receive and validate at + least one FPDU before sending any FPDUs or Markers. + + Note: This requirement is present to allow the Initiator time to + get its receiver into Full Operation before an FPDU arrives, + avoiding potential race conditions at the Initiator. This + was also subject to some debate in the work group before + rough consensus was reached. Eliminating this requirement + would allow faster startup in some types of applications. + However, that would also make certain implementations + (particularly "dual stack") much harder. + + 5. If a received "Key" does not match the expected value (see + Section 7.1.1, MPA Request and Reply Frame Format) the TCP/DDP + connection MUST be closed, and an error returned to the ULP. + + + +Culley, et al. Standards Track [Page 29] + +RFC 5044 MPA Framing for TCP October 2007 + + + 6. The received Private Data fields may be used by Consumers at + either end to further validate the connection and set up DDP or + other ULP parameters. The Initiator ULP MAY close the + TCP/MPA/DDP connection as a result of validating the Private Data + fields. The Responder SHOULD return an MPA Reply Frame with the + "Reject Connection" bit set to '1' if the validation of the + Private Data is not acceptable to the ULP. + + 7. When the first FPDU is to be sent, then if Markers are enabled, + the first octets sent are the special Marker 0x00000000, followed + by the start of the FPDU (the FPDU's ULPDU Length field). If + Markers are not enabled, the first octets sent are the start of + the FPDU (the FPDU's ULPDU Length field). + + 8. MPA implementations MUST use the difference between the MPA + Request Frame and the MPA Reply Frame to check for incorrect + "Initiator/Initiator" startups. Implementations SHOULD put a + timeout on waiting for the MPA Request Frame when started in + Responder mode, to detect incorrect "Responder/Responder" + startups. + + 9. MPA implementations MUST validate the PD_Length field. The + buffer that receives the Private Data field MUST be large enough + to receive that data; the amount of Private Data MUST not exceed + the PD_Length or the application buffer. If any of the above + fails, the startup frame MUST be considered improperly formatted. + + 10. MPA implementations SHOULD implement a reasonable timeout while + waiting for the entire set of startup frames; this prevents + certain denial-of-service attacks. ULPs SHOULD implement a + reasonable timeout while waiting for FPDUs, ULPDUs, and + application level messages to guard against application failures + and certain denial-of-service attacks. + +7.1.3. Example Delayed Startup Sequence + + A variety of startup sequences are possible when using MPA on TCP. + Following is an example of an MPA/DDP startup that occurs after TCP + has been running for a while and has exchanged some amount of + streaming data. This example does not use any Private Data (an + example that does is shown later in Section 7.1.4.2, Example + Immediate Startup Using Private Data), although it is perfectly legal + to include the Private Data. Note that since the example does not + use any Private Data, there are no ULP interactions shown between + receiving "startup frames" and putting MPA into Full Operation. + + + + + + +Culley, et al. Standards Track [Page 30] + +RFC 5044 MPA Framing for TCP October 2007 + + + Initiator Responder + + +---------------------------+ + |ULP streaming mode | + | <Hello> request to | + | transition to DDP/MPA | +---------------------------+ + | mode (optional). | --------> |ULP gets request; | + +---------------------------+ | enables MPA Responder | + | mode with last (optional)| + | streaming mode | + | <Hello Ack> for MPA to | + | send. | + +---------------------------+ |MPA waits for incoming | + |ULP receives streaming | <-------- | <MPA Request Frame>. | + | <Hello Ack>; | +---------------------------+ + |Enters MPA Initiator mode; | + |MPA sends | + | <MPA Request Frame>; | + |MPA waits for incoming | +---------------------------+ + | <MPA Reply Frame>. | - - - - > |MPA receives | + +---------------------------+ | <MPA Request Frame>. | + |Consumer binds DDP to MPA; | + |MPA sends the | + | <MPA Reply Frame>. | + |DDP/MPA enables FPDU | + +---------------------------+ | decoding, but does not | + |MPA receives the | < - - - - | send any FPDUs. | + | <MPA Reply Frame> | +---------------------------+ + |Consumer binds DDP to MPA; | + |DDP/MPA begins Full | + | Operation. | + |MPA sends first FPDU (as | +---------------------------+ + | DDP ULPDUs become | ========> |MPA receives first FPDU. | + | available). | |MPA sends first FPDU (as | + +---------------------------+ | DDP ULPDUs become | + <====== | available). | + +---------------------------+ + + Figure 9: Example Delayed Startup Negotiation + + + + + + + + + + + + +Culley, et al. Standards Track [Page 31] + +RFC 5044 MPA Framing for TCP October 2007 + + + An example Delayed Startup sequence is described below: + + * Active and passive sides start up a TCP connection in the + usual fashion, probably using sockets APIs. They exchange + some amount of streaming mode data. At some point, one side + (the MPA Initiator) sends streaming mode data that + effectively says "Hello, let's go into MPA/DDP mode". + + * When the remote side (the MPA Responder) gets this streaming mode + message, the Consumer would send a last streaming mode message + that effectively says "I acknowledge your Hello, and am now in + MPA Responder mode". The exchange of these messages establishes + the exact point in the TCP stream where MPA is enabled. The + Responding Consumer enables MPA in the Responder mode and waits + for the initial MPA startup message. + + * The Initiating Consumer would enable MPA startup in the + Initiator mode which then sends the MPA Request Frame. It is + assumed that no Private Data messages are needed for this + example, although it is possible to do so. The Initiating + MPA (and Consumer) would also wait for the MPA connection to + be accepted. + + * The Responding MPA would receive the initial MPA Request Frame + and would inform the Consumer that this message arrived. The + Consumer can then accept the MPA/DDP connection or close the TCP + connection. + + * To accept the connection request, the Responding Consumer would + use an appropriate API to bind the TCP/MPA connections to a DDP + endpoint, thus enabling MPA/DDP into Full Operation. In the + process of going to Full Operation, MPA sends the MPA Reply + Frame. MPA/DDP waits for the first incoming FPDU before sending + any FPDUs. + + * If the initial TCP data was not a properly formatted MPA Request + Frame, MPA will close or reset the TCP connection immediately. + + * The Initiating MPA would receive the MPA Reply Frame and + would report this message to the Consumer. The Consumer can + then accept the MPA/DDP connection, or close or reset the TCP + connection to abort the process. + + * On determining that the connection is acceptable, the + Initiating Consumer would use an appropriate API to bind the + TCP/MPA connections to a DDP endpoint thus enabling MPA/DDP + into Full Operation. MPA/DDP would begin sending DDP + messages as MPA FPDUs. + + + +Culley, et al. Standards Track [Page 32] + +RFC 5044 MPA Framing for TCP October 2007 + + +7.1.4. Use of Private Data + + This section is advisory in nature, in that it suggests a method by + which a ULP can deal with pre-DDP connection information exchange. + +7.1.4.1. Motivation + + Prior RDMA protocols have been developed that provide Private Data + via out-of-band mechanisms. As a result, many applications now + expect some form of Private Data to be available for application use + prior to setting up the DDP/RDMA connection. Following are some + examples of the use of Private Data. + + An RDMA endpoint (referred to as a Queue Pair, or QP, in InfiniBand + and the [VERBS-RDMA]) must be associated with a Protection Domain. + No receive operations may be posted to the endpoint before it is + associated with a Protection Domain. Indeed under both the + InfiniBand and proposed RDMA/DDP verbs [VERBS-RDMA] an endpoint/QP is + created within a Protection Domain. + + There are some applications where the choice of Protection Domain is + dependent upon the identity of the remote ULP client. For example, + if a user session requires multiple connections, it is highly + desirable for all of those connections to use a single Protection + Domain. Note: Use of Protection Domains is further discussed in + [RDMASEC]. + + InfiniBand, the DAT APIs [DAT-API], and the IT-API [IT-API] all + provide for the active-side ULP to provide Private Data when + requesting a connection. This data is passed to the ULP to allow it + to determine whether to accept the connection, and if so with which + endpoint (and implicitly which Protection Domain). + + The Private Data can also be used to ensure that both ends of the + connection have configured their RDMA endpoints compatibly on such + matters as the RDMA Read capacity (see [RDMAP]). Further ULP- + specific uses are also presumed, such as establishing the identity of + the client. + + Private Data is also allowed for when accepting the connection, to + allow completion of any negotiation on RDMA resources and for other + ULP reasons. + + There are several potential ways to exchange this Private Data. For + example, the InfiniBand specification includes a connection + management protocol that allows a small amount of Private Data to be + exchanged using datagrams before actually starting the RDMA + connection. + + + +Culley, et al. Standards Track [Page 33] + +RFC 5044 MPA Framing for TCP October 2007 + + + This document allows for small amounts of Private Data to be + exchanged as part of the MPA startup sequence. The actual Private + Data fields are carried in the MPA Request Frame and the MPA Reply + Frame. + + If larger amounts of Private Data or more negotiation is necessary, + TCP streaming mode messages may be exchanged prior to enabling MPA. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +Culley, et al. Standards Track [Page 34] + +RFC 5044 MPA Framing for TCP October 2007 + + +7.1.4.2. Example Immediate Startup Using Private Data + + Initiator Responder + + +---------------------------+ + |TCP SYN sent. | +--------------------------+ + +---------------------------+ --------> |TCP gets SYN packet; | + +---------------------------+ | sends SYN-Ack. | + |TCP gets SYN-Ack | <-------- +--------------------------+ + | sends Ack. | + +---------------------------+ --------> +--------------------------+ + +---------------------------+ |Consumer enables MPA | + |Consumer enables MPA | |Responder mode, waits for | + |Initiator mode with | | <MPA Request frame>. | + |Private Data; MPA sends | +--------------------------+ + | <MPA Request Frame>; | + |MPA waits for incoming | +--------------------------+ + | <MPA Reply Frame>. | - - - - > |MPA receives | + +---------------------------+ | <MPA Request Frame>. | + |Consumer examines Private | + |Data, provides MPA with | + |return Private Data, | + |binds DDP to MPA, and | + |enables MPA to send an | + | <MPA Reply Frame>. | + |DDP/MPA enables FPDU | + +---------------------------+ |decoding, but does not | + |MPA receives the | < - - - - |send any FPDUs. | + | <MPA Reply Frame>. | +--------------------------+ + |Consumer examines Private | + |Data, binds DDP to MPA, | + |and enables DDP/MPA to | + |begin Full Operation. | + |MPA sends first FPDU (as | +--------------------------+ + |DDP ULPDUs become | ========> |MPA receives first FPDU. | + |available). | |MPA sends first FPDU (as | + +---------------------------+ |DDP ULPDUs become | + <====== |available). | + +--------------------------+ + + Figure 10: Example Immediate Startup Negotiation + + Note: The exact order of when MPA is started in the TCP connection + sequence is implementation dependent; the above diagram shows one + possible sequence. Also, the Initiator "Ack" to the Responder's + "SYN-Ack" may be combined into the same TCP segment containing + the MPA Request Frame (as is allowed by TCP RFCs). + + + + +Culley, et al. Standards Track [Page 35] + +RFC 5044 MPA Framing for TCP October 2007 + + + The example immediate startup sequence is described below: + + * The passive side (Responding Consumer) would listen on the TCP + destination port, to indicate its readiness to accept a + connection. + + * The active side (Initiating Consumer) would request a + connection from a TCP endpoint (that expected to upgrade to + MPA/DDP/RDMA and expected the Private Data) to a destination + address and port. + + * The Initiating Consumer would initiate a TCP connection to + the destination port. Acceptance/rejection of the connection + would proceed as per normal TCP connection establishment. + + * The passive side (Responding Consumer) would receive the TCP + connection request as usual allowing normal TCP gatekeepers, such + as INETD and TCPserver, to exercise their normal + safeguard/logging functions. On acceptance of the TCP + connection, the Responding Consumer would enable MPA in the + Responder mode and wait for the initial MPA startup message. + + * The Initiating Consumer would enable MPA startup in the + Initiator mode to send an initial MPA Request Frame with its + included Private Data message to send. The Initiating MPA + (and Consumer) would also wait for the MPA connection to be + accepted, and any returned Private Data. + + * The Responding MPA would receive the initial MPA Request Frame + with the Private Data message and would pass the Private Data + through to the Consumer. The Consumer can then accept the + MPA/DDP connection, close the TCP connection, or reject the MPA + connection with a return message. + + * To accept the connection request, the Responding Consumer would + use an appropriate API to bind the TCP/MPA connections to a DDP + endpoint, thus enabling MPA/DDP into Full Operation. In the + process of going to Full Operation, MPA sends the MPA Reply + Frame, which includes the Consumer-supplied Private Data + containing any appropriate Consumer response. MPA/DDP waits for + the first incoming FPDU before sending any FPDUs. + + * If the initial TCP data was not a properly formatted MPA Request + Frame, MPA will close or reset the TCP connection immediately. + + + + + + + +Culley, et al. Standards Track [Page 36] + +RFC 5044 MPA Framing for TCP October 2007 + + + * To reject the MPA connection request, the Responding Consumer + would send an MPA Reply Frame with any ULP-supplied Private Data + (with reason for rejection), with the "Rejected Connection" bit + set to '1', and may close the TCP connection. + + * The Initiating MPA would receive the MPA Reply Frame with the + Private Data message and would report this message to the + Consumer, including the supplied Private Data. + + If the "Rejected Connection" bit is set to a '1', MPA will + close the TCP connection and exit. + + If the "Rejected Connection" bit is set to a '0', and on + determining from the MPA Reply Frame Private Data that the + connection is acceptable, the Initiating Consumer would use + an appropriate API to bind the TCP/MPA connections to a DDP + endpoint thus enabling MPA/DDP into Full Operation. MPA/DDP + would begin sending DDP messages as MPA FPDUs. + +7.1.5. "Dual Stack" Implementations + + MPA/DDP implementations are commonly expected to be implemented as + part of a "dual stack" architecture. One stack is the traditional + TCP stack, usually with a sockets interface API (Application + Programming Interface). The second stack is the MPA/DDP stack with + its own API, and potentially separate code or hardware to deal with + the MPA/DDP data. Of course, implementations may vary, so the + following comments are of an advisory nature only. + + The use of the two stacks offers advantages: + + TCP connection setup is usually done with the TCP stack. This + allows use of the usual naming and addressing mechanisms. It + also means that any mechanisms used to "harden" the connection + setup against security threats are also used when starting + MPA/DDP. + + Some applications may have been originally designed for TCP, but + are "enhanced" to utilize MPA/DDP after a negotiation reveals the + capability to do so. The negotiation process takes place in + TCP's streaming mode, using the usual TCP APIs. + + Some new applications, designed for RDMA or DDP, still need to + exchange some data prior to starting MPA/DDP. This exchange can + be of arbitrary length or complexity, but often consists of only + a small amount of Private Data, perhaps only a single message. + Using the TCP streaming mode for this exchange allows this to be + done using well-understood methods. + + + +Culley, et al. Standards Track [Page 37] + +RFC 5044 MPA Framing for TCP October 2007 + + + The main disadvantage of using two stacks is the conversion of an + active TCP connection between them. This process must be done with + care to prevent loss of data. + + To avoid some of the problems when using a "dual stack" architecture, + the following additional restrictions may be required by the + implementation: + + 1. Enabling the DDP/MPA stack SHOULD be done only when no incoming + stream data is expected. This is typically managed by the ULP + protocol. When following the recommended startup sequence, the + Responder side enters DDP/MPA mode, sends the last streaming mode + data, and then waits for the MPA Request Frame. No additional + streaming mode data is expected. The Initiator side ULP receives + the last streaming mode data, and then enters DDP/MPA mode. + Again, no additional streaming mode data is expected. + + 2. The DDP/MPA MAY provide the ability to send a "last streaming + message" as part of its Responder DDP/MPA enable function. This + allows the DDP/MPA stack to more easily manage the conversion to + DDP/MPA mode (and avoid problems with a very fast return of the + MPA Request Frame from the Initiator side). + + Note: Regardless of the "stack" architecture used, TCP's rules MUST + be followed. For example, if network data is lost, re-segmented, + or re-ordered, TCP MUST recover appropriately even when this + occurs while switching stacks. + +7.2. Normal Connection Teardown + + Each half connection of MPA terminates when DDP closes the + corresponding TCP half connection. + + A mechanism SHOULD be provided by MPA to DDP for DDP to be made aware + that a graceful close of the TCP connection has been received by the + TCP (e.g., FIN is received). + + + + + + + + + + + + + + + +Culley, et al. Standards Track [Page 38] + +RFC 5044 MPA Framing for TCP October 2007 + + +8. Error Semantics + + The following errors MUST be detected by MPA and the codes SHOULD be + provided to DDP or other Consumer: + + Code Error + + 1 TCP connection closed, terminated, or lost. This includes lost + by timeout, too many retries, RST received, or FIN received. + + 2 Received MPA CRC does not match the calculated value for the + FPDU. + + 3 In the event that the CRC is valid, received MPA Marker (if + enabled) and ULPDU Length fields do not agree on the start of an + FPDU. If the FPDU start determined from previous ULPDU Length + fields does not match with the MPA Marker position, MPA SHOULD + deliver an error to DDP. It may not be possible to make this + check as a segment arrives, but the check SHOULD be made when a + gap creating an out-of-order sequence is closed and any time a + Marker points to an already identified FPDU. It is OPTIONAL for + a receiver to check each Marker, if multiple Markers are present + in an FPDU, or if the segment is received in order. + + 4 Invalid MPA Request Frame or MPA Response Frame received. In + this case, the TCP connection MUST be immediately closed. DDP + and other ULPs should treat this similar to code 1, above. + + When conditions 2 or 3 above are detected, an optimized MPA/TCP + implementation MAY choose to silently drop the TCP segment rather + than reporting the error to DDP. In this case, the sending TCP will + retry the segment, usually correcting the error, unless the problem + was at the source. In that case, the source will usually exceed the + number of retries and terminate the connection. + + Once MPA delivers an error of any type, it MUST NOT pass or deliver + any additional FPDUs on that half connection. + + For Error codes 2 and 3, MPA MUST NOT close the TCP connection + following a reported error. Closing the connection is the + responsibility of DDP's ULP. + + Note that since MPA will not Deliver any FPDUs on a half + connection following an error detected on the receive side of + that connection, DDP's ULP is expected to tear down the + connection. This may not occur until after one or more last + messages are transmitted on the opposite half connection. This + allows a diagnostic error message to be sent. + + + +Culley, et al. Standards Track [Page 39] + +RFC 5044 MPA Framing for TCP October 2007 + + +9. Security Considerations + + This section discusses the security considerations for MPA. + +9.1. Protocol-Specific Security Considerations + + The vulnerabilities of MPA to third-party attacks are no greater than + any other protocol running over TCP. A third party, by sending + packets into the network that are delivered to an MPA receiver, could + launch a variety of attacks that take advantage of how MPA operates. + For example, a third party could send random packets that are valid + for TCP, but contain no FPDU headers. An MPA receiver reports an + error to DDP when any packet arrives that cannot be validated as an + FPDU when properly located on an FPDU boundary. A third party could + also send packets that are valid for TCP, MPA, and DDP, but do not + target valid buffers. These types of attacks ultimately result in + loss of connection and thus become a type of DOS (Denial Of Service) + attack. Communication security mechanisms such as IPsec [RFC2401, + RFC4301] may be used to prevent such attacks. + + Independent of how MPA operates, a third party could use ICMP + messages to reduce the path MTU to such a small size that performance + would likewise be severely impacted. Range checking on path MTU + sizes in ICMP packets may be used to prevent such attacks. + + [RDMAP] and [DDP] are used to control, read, and write data buffers + over IP networks. Therefore, the control and the data packets of + these protocols are vulnerable to the spoofing, tampering, and + information disclosure attacks listed below. In addition, connection + to/from an unauthorized or unauthenticated endpoint is a potential + problem with most applications using RDMA, DDP, and MPA. + +9.1.1. Spoofing + + Spoofing attacks can be launched by the Remote Peer or by a network + based attacker. A network-based spoofing attack applies to all + Remote Peers. Because the MPA Stream requires a TCP Stream in the + ESTABLISHED state, certain types of traditional forms of wire attacks + do not apply -- an end-to-end handshake must have occurred to + establish the MPA Stream. So, the only form of spoofing that applies + is one when a remote node can both send and receive packets. Yet + even with this limitation the Stream is still exposed to the + following spoofing attacks. + + + + + + + + +Culley, et al. Standards Track [Page 40] + +RFC 5044 MPA Framing for TCP October 2007 + + +9.1.1.1. Impersonation + + A network-based attacker can impersonate a legal MPA/DDP/RDMAP peer + (by spoofing a legal IP address) and establish an MPA/DDP/RDMAP + Stream with the victim. End-to-end authentication (i.e., IPsec or + ULP authentication) provides protection against this attack. + +9.1.1.2. Stream Hijacking + + Stream hijacking happens when a network-based attacker follows the + Stream establishment phase, and waits until the authentication phase + (if such a phase exists) is completed successfully. He can then + spoof the IP address and redirect the Stream from the victim to its + own machine. For example, an attacker can wait until an iSCSI + authentication is completed successfully, and hijack the iSCSI + Stream. + + The best protection against this form of attack is end-to-end + integrity protection and authentication, such as IPsec, to prevent + spoofing. Another option is to provide physical security. + Discussion of physical security is out of scope for this document. + +9.1.1.3. Man-in-the-Middle Attack + + If a network-based attacker has the ability to delete, inject, + replay, or modify packets that will still be accepted by MPA (e.g., + TCP sequence number is correct, FPDU is valid, etc.), then the Stream + can be exposed to a man-in-the-middle attack. The attacker could + potentially use the services of [DDP] and [RDMAP] to read the + contents of the associated Data Buffer, to modify the contents of the + associated Data Buffer, or to disable further access to the buffer. + Other attacks on the connection setup sequence and even on TCP can be + used to cause denial of service. The only countermeasure for this + form of attack is to either secure the MPA/DDP/RDMAP Stream (i.e., + integrity protect) or attempt to provide physical security to prevent + man-in-the-middle type attacks. + + The best protection against this form of attack is end-to-end + integrity protection and authentication, such as IPsec, to prevent + spoofing or tampering. If Stream or session level authentication and + integrity protection are not used, then a man-in-the-middle attack + can occur, enabling spoofing and tampering. + + Another approach is to restrict access to only the local subnet/link + and provide some mechanism to limit access, such as physical security + or 802.1.x. This model is an extremely limited deployment scenario + and will not be further examined here. + + + + +Culley, et al. Standards Track [Page 41] + +RFC 5044 MPA Framing for TCP October 2007 + + +9.1.2. Eavesdropping + + Generally speaking, Stream confidentiality protects against + eavesdropping. Stream and/or session authentication and integrity + protection are a counter measurement against various spoofing and + tampering attacks. The effectiveness of authentication and integrity + against a specific attack depend on whether the authentication is + machine-level authentication (as the one provided by IPsec) or ULP + authentication. + +9.2. Introduction to Security Options + + The following security services can be applied to an MPA/DDP/RDMAP + Stream: + + 1. Session confidentiality - protects against eavesdropping. + + 2. Per-packet data source authentication - protects against the + following spoofing attacks: network-based impersonation, Stream + hijacking, and man in the middle. + + 3. Per-packet integrity - protects against tampering done by + network-based modification of FPDUs (indirectly affecting buffer + content through DDP services). + + 4. Packet sequencing - protects against replay attacks, which is a + special case of the above tampering attack. + + If an MPA/DDP/RDMAP Stream may be subject to impersonation attacks, + or Stream hijacking attacks, it is recommended that the Stream be + authenticated, integrity protected, and protected from replay + attacks. It may use confidentiality protection to protect from + eavesdropping (in case the MPA/DDP/RDMAP Stream traverses a public + network). + + IPsec is capable of providing the above security services for IP and + TCP traffic. + + ULP protocols may be able to provide part of the above security + services. See [NFSv4CHAN] for additional information on a promising + approach called "channel binding". From [NFSv4CHAN]: + + "The concept of channel bindings allows applications to prove + that the end-points of two secure channels at different network + layers are the same by binding authentication at one channel to + the session protection at the other channel. The use of channel + + + + + +Culley, et al. Standards Track [Page 42] + +RFC 5044 MPA Framing for TCP October 2007 + + + bindings allows applications to delegate session protection to + lower layers, which may significantly improve performance for + some applications." + +9.3. Using IPsec with MPA + + IPsec can be used to protect against the packet injection attacks + outlined above. Because IPsec is designed to secure individual IP + packets, MPA can run above IPsec without change. IPsec packets are + processed (e.g., integrity checked and decrypted) in the order they + are received, and an MPA receiver will process the decrypted FPDUs + contained in these packets in the same manner as FPDUs contained in + unsecured IP packets. + + MPA implementations MUST implement IPsec as described in Section 9.4 + below. The use of IPsec is up to ULPs and administrators. + +9.4. Requirements for IPsec Encapsulation of MPA/DDP + + The IP Storage working group has spent significant time and effort to + define the normative IPsec requirements for IP storage [RFC3723]. + Portions of that specification are applicable to a wide variety of + protocols, including the RDDP protocol suite. In order not to + replicate this effort, an MPA on TCP implementation MUST follow the + requirements defined in RFC 3723, Sections 2.3 and 5, including the + associated normative references for those sections. + + Additionally, since IPsec acceleration hardware may only be able to + handle a limited number of active Internet Key Exchange Protocol + (IKE) Phase 2 security associations (SAs), Phase 2 delete messages + MAY be sent for idle SAs, as a means of keeping the number of active + Phase 2 SAs to a minimum. The receipt of an IKE Phase 2 delete + message MUST NOT be interpreted as a reason for tearing down a + DDP/RDMA Stream. Rather, it is preferable to leave the Stream up, + and if additional traffic is sent on it, to bring up another IKE + Phase 2 SA to protect it. This avoids the potential for continually + bringing Streams up and down. + + The IPsec requirements for RDDP are based on the version of IPsec + specified in RFC 2401 [RFC2401] and related RFCs, as profiled by RFC + 3723 [RFC3723], despite the existence of a newer version of IPsec + specified in RFC 4301 [RFC4301] and related RFCs. One of the + important early applications of the RDDP protocols is their use with + iSCSI [iSER]; RDDP's IPsec requirements follow those of IPsec in + order to facilitate that usage by allowing a common profile of IPsec + to be used with iSCSI and the RDDP protocols. In the future, RFC + + + + + +Culley, et al. Standards Track [Page 43] + +RFC 5044 MPA Framing for TCP October 2007 + + + 3723 may be updated to the newer version of IPsec; the IPsec security + requirements of any such update should apply uniformly to iSCSI and + the RDDP protocols. + + Note that there are serious security issues if IPsec is not + implemented end-to-end. For example, if IPsec is implemented as a + tunnel in the middle of the network, any hosts between the peer and + the IPsec tunneling device can freely attack the unprotected Stream. + +10. IANA Considerations + + No IANA actions are required by this document. + + If a well-known port is chosen as the mechanism to identify a DDP on + MPA on TCP, the well-known port must be registered with IANA. + Because the use of the port is DDP specific, registration of the port + with IANA is left to DDP. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +Culley, et al. Standards Track [Page 44] + +RFC 5044 MPA Framing for TCP October 2007 + + +Appendix A. Optimized MPA-Aware TCP Implementations + + This appendix is for information only and is NOT part of the + standard. + + This appendix covers some Optimized MPA-aware TCP implementation + guidance to implementers. It is intended for those implementations + that want to send/receive as much traffic as possible in an aligned + and zero-copy fashion. + + +-----------------------------------+ + | +-----------+ +-----------------+ | + | | Optimized | | Other Protocols | | + | | MPA/TCP | +-----------------+ | + | +-----------+ || | + | \\ --- socket API --- | + | \\ || | + | \\ +-----+ | + | \\ | TCP | | + | \\ +-----+ | + | \\ // | + | +-------+ | + | | IP | | + | +-------+ | + +-----------------------------------+ + + Figure 11: Optimized MPA/TCP Implementation + + The diagram above shows a block diagram of a potential + implementation. The network sub-system in the diagram can support + traditional sockets-based connections using the normal API as shown + on the right side of the diagram. Connections for DDP/MPA/TCP are + run using the facilities shown on the left side of the diagram. + + The DDP/MPA/TCP connections can be started using the facilities shown + on the left side using some suitable API, or they can be initiated + using the facilities shown on the right side and transitioned to the + left side at the point in the connection setup where MPA goes to + "Full MPA/DDP Operation Phase" as described in Section 7.1.2. + + The optimized MPA/TCP implementations (left side of diagram and + described below) are only applicable to MPA. All other TCP + applications continue to use the standard TCP stacks and interfaces + shown in the right side of the diagram. + + + + + + + +Culley, et al. Standards Track [Page 45] + +RFC 5044 MPA Framing for TCP October 2007 + + +A.1. Optimized MPA/TCP Transmitters + + The various TCP RFCs allow considerable choice in segmenting a TCP + stream. In order to optimize FPDU recovery at the MPA receiver, an + optimized MPA/TCP implementation uses additional segmentation rules. + + To provide optimum performance, an optimized MPA/TCP transmit side + implementation should be enabled to: + + * With an EMSS large enough to contain the FPDU(s), segment the + outgoing TCP stream such that the first octet of every TCP + segment begins with an FPDU. Multiple FPDUs may be packed into a + single TCP segment as long as they are entirely contained in the + TCP segment. + + * Report the current EMSS from the TCP to the MPA transmit layer. + + There are exceptions to the above rule. Once an ULPDU is provided to + MPA, the MPA/TCP sender transmits it or fails the connection; it + cannot be repudiated. As a result, during changes in MTU and EMSS, + or when TCP's Receive Window size (RWIN) becomes too small, it may be + necessary to send FPDUs that do not conform to the segmentation rule + above. + + A possible, but less desirable, alternative is to use IP + fragmentation on accepted FPDUs to deal with MTU reductions or + extremely small EMSS. + + Even when alignment with TCP segments is lost, the sender still + formats the FPDU according to FPDU format as shown in Figure 2. + + On a retransmission, TCP does not necessarily preserve original TCP + segmentation boundaries. This can lead to the loss of FPDU Alignment + and containment within a TCP segment during TCP retransmissions. An + optimized MPA/TCP sender should try to preserve original TCP + segmentation boundaries on a retransmission. + +A.2. Effects of Optimized MPA/TCP Segmentation + + Optimized MPA/TCP senders will fill TCP segments to the EMSS with a + single FPDU when a DDP message is large enough. Since the DDP + message may not exactly fit into TCP segments, a "message tail" often + occurs that results in an FPDU that is smaller than a single TCP + segment. Additionally, some DDP messages may be considerably shorter + than the EMSS. If a small FPDU is sent in a single TCP segment, the + result is a "short" TCP segment. + + + + + +Culley, et al. Standards Track [Page 46] + +RFC 5044 MPA Framing for TCP October 2007 + + + Applications expected to see strong advantages from Direct Data + Placement include transaction-based applications and throughput + applications. Request/response protocols typically send one FPDU per + TCP segment and then wait for a response. Under these conditions, + these "short" TCP segments are an appropriate and expected effect of + the segmentation. + + Another possibility is that the application might be sending multiple + messages (FPDUs) to the same endpoint before waiting for a response. + In this case, the segmentation policy would tend to reduce the + available connection bandwidth by under-filling the TCP segments. + + Standard TCP implementations often utilize the Nagle [RFC896] + algorithm to ensure that segments are filled to the EMSS whenever the + round-trip latency is large enough that the source stream can fully + fill segments before ACKs arrive. The algorithm does this by + delaying the transmission of TCP segments until a ULP can fill a + segment, or until an ACK arrives from the far side. The algorithm + thus allows for smaller segments when latencies are shorter to keep + the ULP's end-to-end latency to reasonable levels. + + The Nagle algorithm is not mandatory to use [RFC1122]. + + When used with optimized MPA/TCP stacks, Nagle and similar algorithms + can result in the "packing" of multiple FPDUs into TCP segments. + + If a "message tail", small DDP messages, or the start of a larger DDP + message are available, MPA may pack multiple FPDUs into TCP segments. + When this is done, the TCP segments can be more fully utilized, but, + due to the size constraints of FPDUs, segments may not be filled to + the EMSS. A dynamic MULPDU that informs DDP of the size of the + remaining TCP segment space makes filling the TCP segment more + effective. + + Note that MPA receivers do more processing of a TCP segment that + contains multiple FPDUs; this may affect the performance of some + receiver implementations. + + It is up to the ULP to decide if Nagle is useful with DDP/MPA. Note + that many of the applications expected to take advantage of MPA/DDP + prefer to avoid the extra delays caused by Nagle. In such scenarios, + it is anticipated there will be minimal opportunity for packing at + the transmitter and receivers may choose to optimize their + performance for this anticipated behavior. + + + + + + + +Culley, et al. Standards Track [Page 47] + +RFC 5044 MPA Framing for TCP October 2007 + + + Therefore, the application is expected to set TCP parameters such + that it can trade off latency and wire efficiency. Implementations + should provide a connection option that disables Nagle for MPA/TCP + similar to the way the TCP_NODELAY socket option is provided for a + traditional sockets interface. + + When latency is not critical, application is expected to leave Nagle + enabled. In this case, the TCP implementation may pack any available + FPDUs into TCP segments so that the segments are filled to the EMSS. + If the amount of data available is not enough to fill the TCP segment + when it is prepared for transmission, TCP can send the segment partly + filled, or use the Nagle algorithm to wait for the ULP to post more + data. + +A.3. Optimized MPA/TCP Receivers + + When an MPA receive implementation and the MPA-aware receive side TCP + implementation support handling out-of-order ULPDUs, the TCP receive + implementation performs the following functions: + + 1) The implementation passes incoming TCP segments to MPA as soon as + they have been received and validated, even if not received in + order. The TCP layer commits to keeping each segment before it + can be passed to the MPA. This means that the segment must have + passed the TCP, IP, and lower layer data integrity validation + (i.e., checksum), must be in the receive window, must be part of + the same epoch (if timestamps are used to verify this), and must + have passed any other checks required by TCP RFCs. + + This is not to imply that the data must be completely ordered + before use. An implementation can accept out-of-order segments, + SACK them [RFC2018], and pass them to MPA immediately, before the + reception of the segments needed to fill in the gaps. MPA + expects to utilize these segments when they are complete FPDUs or + can be combined into complete FPDUs to allow the passing of + ULPDUs to DDP when they arrive, independent of ordering. DDP + uses the passed ULPDU to "place" the DDP segments (see [DDP] for + more details). + + Since MPA performs a CRC calculation and other checks on received + FPDUs, the MPA/TCP implementation ensures that any TCP segments + that duplicate data already received and processed (as can happen + during TCP retries) do not overwrite already received and + processed FPDUs. This avoids the possibility that duplicate data + may corrupt already validated FPDUs. + + + + + + +Culley, et al. Standards Track [Page 48] + +RFC 5044 MPA Framing for TCP October 2007 + + + 2) The implementation provides a mechanism to indicate the ordering + of TCP segments as the sender transmitted them. One possible + mechanism might be attaching the TCP sequence number to each + segment. + + 3) The implementation also provides a mechanism to indicate when a + given TCP segment (and the prior TCP stream) is complete. One + possible mechanism might be to utilize the leading (left) edge of + the TCP Receive Window. + + MPA uses the ordering and completion indications to inform DDP + when a ULPDU is complete; MPA Delivers the FPDU to DDP. DDP uses + the indications to "deliver" its messages to the DDP consumer + (see [DDP] for more details). + + DDP on MPA utilizes the above two mechanisms to establish the + Delivery semantics that DDP's consumers agree to. These + semantics are described fully in [DDP]. These include + requirements on DDP's consumer to respect ownership of buffers + prior to the time that DDP delivers them to the Consumer. + + The use of SACK [RFC2018] significantly improves network utilization + and performance and is therefore recommended. When combined with the + out-of-order passing of segments to MPA and DDP, significant + buffering and copying of received data can be avoided. + +A.4. Re-Segmenting Middleboxes and Non-Optimized MPA/TCP Senders + + Since MPA senders often start FPDUs on TCP segment boundaries, a + receiving optimized MPA/TCP implementation may be able to optimize + the reception of data in various ways. + + However, MPA receivers MUST NOT depend on FPDU Alignment on TCP + segment boundaries. + + Some MPA senders may be unable to conform to the sender requirements + because their implementation of TCP is not designed with MPA in mind. + Even for optimized MPA/TCP senders, the network may contain + "middleboxes" which modify the TCP stream by changing the + segmentation. This is generally interoperable with TCP and its users + and MPA must be no exception. + + The presence of Markers in MPA (when enabled) allows an optimized + MPA/TCP receiver to recover the FPDUs despite these obstacles, + although it may be necessary to utilize additional buffering at the + receiver to do so. + + + + + +Culley, et al. Standards Track [Page 49] + +RFC 5044 MPA Framing for TCP October 2007 + + + Some of the cases that a receiver may have to contend with are listed + below as a reminder to the implementer: + + * A single aligned and complete FPDU, either in order or out of + order: This can be passed to DDP as soon as validated, and + Delivered when ordering is established. + + * Multiple FPDUs in a TCP segment, aligned and fully contained, + either in order or out of order: These can be passed to DDP as + soon as validated, and Delivered when ordering is established. + + * Incomplete FPDU: The receiver should buffer until the remainder + of the FPDU arrives. If the remainder of the FPDU is already + available, this can be passed to DDP as soon as validated, and + Delivered when ordering is established. + + * Unaligned FPDU start: The partial FPDU must be combined with its + preceding portion(s). If the preceding parts are already + available, and the whole FPDU is present, this can be passed to + DDP as soon as validated, and Delivered when ordering is + established. If the whole FPDU is not available, the receiver + should buffer until the remainder of the FPDU arrives. + + * Combinations of unaligned or incomplete FPDUs (and potentially + other complete FPDUs) in the same TCP segment: If any FPDU is + present in its entirety, or can be completed with portions + already available, it can be passed to DDP as soon as validated, + and Delivered when ordering is established. + +A.5. Receiver Implementation + + Transport & Network Layer Reassembly Buffers: + + The use of reassembly buffers (either TCP reassembly buffers or IP + fragmentation reassembly buffers) is implementation dependent. When + MPA is enabled, reassembly buffers are needed if out-of-order packets + arrive and Markers are not enabled. Buffers are also needed if FPDU + alignment is lost or if IP fragmentation occurs. This is because the + incoming out-of-order segment may not contain enough information for + MPA to process all of the FPDU. For cases where a re-segmenting + middlebox is present, or where the TCP sender is not optimized, the + presence of Markers significantly reduces the amount of buffering + needed. + + Recovery from IP fragmentation is transparent to the MPA Consumers. + + + + + + +Culley, et al. Standards Track [Page 50] + +RFC 5044 MPA Framing for TCP October 2007 + + +A.5.1 Network Layer Reassembly Buffers + + The MPA/TCP implementation should set the IP Don't Fragment bit at + the IP layer. Thus, upon a path MTU change, intermediate devices + drop the IP datagram if it is too large and reply with an ICMP + message that tells the source TCP that the path MTU has changed. + This causes TCP to emit segments conformant with the new path MTU + size. Thus, IP fragments under most conditions should never occur at + the receiver. But it is possible. + + There are several options for implementation of network layer + reassembly buffers: + + 1. drop any IP fragments, and reply with an ICMP message according + to [RFC792] (fragmentation needed and DF set) to tell the Remote + Peer to resize its TCP segment. + + 2. support an IP reassembly buffer, but have it of limited size + (possibly the same size as the local link's MTU). The end node + would normally never Advertise a path MTU larger than the local + link MTU. It is recommended that a dropped IP fragment cause an + ICMP message to be generated according to RFC 792. + + 3. multiple IP reassembly buffers, of effectively unlimited size. + + 4. support an IP reassembly buffer for the largest IP datagram (64 + KB). + + 5. support for a large IP reassembly buffer that could span multiple + IP datagrams. + + An implementation should support at least 2 or 3 above, to avoid + dropping packets that have traversed the entire fabric. + + There is no end-to-end ACK for IP reassembly buffers, so there is no + flow control on the buffer. The only end-to-end ACK is a TCP ACK, + which can only occur when a complete IP datagram is delivered to TCP. + Because of this, under worst case, pathological scenarios, the + largest IP reassembly buffer is the TCP receive window (to buffer + multiple IP datagrams that have all been fragmented). + + Note that if the Remote Peer does not implement re-segmentation of + the data stream upon receiving the ICMP reply updating the path MTU, + it is possible to halt forward progress because the opposite peer + would continue to retransmit using a transport segment size that is + too large. This deadlock scenario is no different than if the fabric + MTU (not last-hop MTU) was reduced after connection setup, and the + remote node's behavior is not compliant with [RFC1122]. + + + +Culley, et al. Standards Track [Page 51] + +RFC 5044 MPA Framing for TCP October 2007 + + +A.5.2 TCP Reassembly Buffers + + A TCP reassembly buffer is also needed. TCP reassembly buffers are + needed if FPDU Alignment is lost when using TCP with MPA or when the + MPA FPDU spans multiple TCP segments. Buffers are also needed if + Markers are disabled and out-of-order packets arrive. + + Since lost FPDU Alignment often means that FPDUs are incomplete, an + MPA on TCP implementation must have a reassembly buffer large enough + to recover an FPDU that is less than or equal to the MTU of the + locally attached link (this should be the largest possible Advertised + TCP path MTU). If the MTU is smaller than 140 octets, a buffer of at + least 140 octets long is needed to support the minimum FPDU size. + The 140 octets allow for the minimum MULPDU of 128, 2 octets of pad, + 2 of ULPDU_Length, 4 of CRC, and space for a possible Marker. As + usual, additional buffering is likely to provide better performance. + + Note that if the TCP segments were not stored, it would be possible + to deadlock the MPA algorithm. If the path MTU is reduced, FPDU + Alignment requires the source TCP to re-segment the data stream to + the new path MTU. The source MPA will detect this condition and + reduce the MPA segment size, but any FPDUs already posted to the + source TCP will be re-segmented and lose FPDU Alignment. If the + destination does not support a TCP reassembly buffer, these segments + can never be successfully transmitted and the protocol deadlocks. + + When a complete FPDU is received, processing continues normally. + +Appendix B. Analysis of MPA over TCP Operations + + This appendix is for information only and is NOT part of the + standard. + + This appendix is an analysis of MPA on TCP and why it is useful to + integrate MPA with TCP (with modifications to typical TCP + implementations) to reduce overall system buffering and overhead. + + One of MPA's high-level goals is to provide enough information, when + combined with the Direct Data Placement Protocol [DDP], to enable + out-of-order placement of DDP payload into the final Upper Layer + Protocol (ULP) Buffer. Note that DDP separates the act of placing + data into a ULP Buffer from that of notifying the ULP that the ULP + Buffer is available for use. In DDP terminology, the former is + defined as "Placement", and the later is defined as "Delivery". MPA + supports in-order Delivery of the data to the ULP, including support + for Direct Data Placement in the final ULP Buffer location when TCP + segments arrive out of order. Effectively, the goal is to use the + + + + +Culley, et al. Standards Track [Page 52] + +RFC 5044 MPA Framing for TCP October 2007 + + + pre-posted ULP Buffers as the TCP receive buffer, where the + reassembly of the ULP Protocol Data Unit (PDU) by TCP (with MPA and + DDP) is done in place, in the ULP Buffer, with no data copies. + + This appendix walks through the advantages and disadvantages of the + TCP sender modifications proposed by MPA: + + 1) that MPA prefers that the TCP sender to do Header Alignment, where + a TCP segment should begin with an MPA Framing Protocol Data Unit + (FPDU) (if there is payload present). + + 2) that there be an integral number of FPDUs in a TCP segment (under + conditions where the path MTU is not changing). + + This appendix concludes that the scaling advantages of FPDU Alignment + are strong, based primarily on fairly drastic TCP receive buffer + reduction requirements and simplified receive handling. The analysis + also shows that there is little effect to TCP wire behavior. + +B.1. Assumptions + +B.1.1 MPA Is Layered beneath DDP + + MPA is an adaptation layer between DDP and TCP. DDP requires + preservation of DDP segment boundaries and a CRC32c digest covering + the DDP header and data. MPA adds these features to the TCP stream + so that DDP over TCP has the same basic properties as DDP over SCTP. + +B.1.2. MPA Preserves DDP Message Framing + + MPA was designed as a framing layer specifically for DDP and was not + intended as a general-purpose framing layer for any other ULP using + TCP. + + A framing layer allows ULPs using it to receive indications from the + transport layer only when complete ULPDUs are present. As a framing + layer, MPA is not aware of the content of the DDP PDU, only that it + has received and, if necessary, reassembled a complete PDU for + Delivery to the DDP. + +B.1.3. The Size of the ULPDU Passed to MPA Is Less Than EMSS under + Normal Conditions + + To make reception of a complete DDP PDU on every received segment + possible, DDP passes to MPA a PDU that is no larger than the EMSS of + the underlying fabric. Each FPDU that MPA creates contains + sufficient information for the receiver to directly place the ULP + payload in the correct location in the correct receive buffer. + + + +Culley, et al. Standards Track [Page 53] + +RFC 5044 MPA Framing for TCP October 2007 + + + Edge cases when this condition does not occur are dealt with, but do + not need to be on the fast path. + +B.1.4. Out-of-Order Placement but NO Out-of-Order Delivery + + DDP receives complete DDP PDUs from MPA. Each DDP PDU contains the + information necessary to place its ULP payload directly in the + correct location in host memory. + + Because each DDP segment is self-describing, it is possible for DDP + segments received out of order to have their ULP payload placed + immediately in the ULP receive buffer. + + Data delivery to the ULP is guaranteed to be in the order the data + was sent. DDP only indicates data delivery to the ULP after TCP has + acknowledged the complete byte stream. + +B.2. The Value of FPDU Alignment + + Significant receiver optimizations can be achieved when Header + Alignment and complete FPDUs are the common case. The optimizations + allow utilizing significantly fewer buffers on the receiver and less + computation per FPDU. The net effect is the ability to build a + "flow-through" receiver that enables TCP-based solutions to scale to + 10G and beyond in an economical way. The optimizations are + especially relevant to hardware implementations of receivers that + process multiple protocol layers -- Data Link Layer (e.g., Ethernet), + Network and Transport Layer (e.g., TCP/IP), and even some ULP on top + of TCP (e.g., MPA/DDP). As network speed increases, there is an + increasing desire to use a hardware-based receiver in order to + achieve an efficient high performance solution. + + A TCP receiver, under worst-case conditions, has to allocate buffers + (BufferSizeTCP) whose capacities are a function of the bandwidth- + delay product. Thus: + + BufferSizeTCP = K * bandwidth [octets/second] * Delay [seconds]. + + Where bandwidth is the end-to-end bandwidth of the connection, delay + is the round-trip delay of the connection, and K is an + implementation-dependent constant. + + Thus, BufferSizeTCP scales with the end-to-end bandwidth (10x more + buffers for a 10x increase in end-to-end bandwidth). As this + buffering approach may scale poorly for hardware or software + implementations alike, several approaches allow reduction in the + amount of buffering required for high-speed TCP communication. + + + + +Culley, et al. Standards Track [Page 54] + +RFC 5044 MPA Framing for TCP October 2007 + + + The MPA/DDP approach is to enable the ULP's Buffer to be used as the + TCP receive buffer. If the application pre-posts a sufficient amount + of buffering, and each TCP segment has sufficient information to + place the payload into the right application buffer, when an out-of- + order TCP segment arrives it could potentially be placed directly in + the ULP Buffer. However, placement can only be done when a complete + FPDU with the placement information is available to the receiver, and + the FPDU contents contain enough information to place the data into + the correct ULP Buffer (e.g., there is a DDP header available). + + For the case when the FPDU is not aligned with the TCP segment, it + may take, on average, 2 TCP segments to assemble one FPDU. + Therefore, the receiver has to allocate BufferSizeNAF (Buffer Size, + Non-Aligned FPDU) octets: + + BufferSizeNAF = K1* EMSS * number_of_connections + K2 * EMSS + + Where K1 and K2 are implementation-dependent constants and EMSS is + the effective maximum segment size. + + For example, a 1 GB/sec link with 10,000 connections and an EMSS of + 1500 B would require 15 MB of memory. Often the number of + connections used scales with the network speed, aggravating the + situation for higher speeds. + + FPDU Alignment would allow the receiver to allocate BufferSizeAF + (Buffer Size, Aligned FPDU) octets: + + BufferSizeAF = K2 * EMSS + + for the same conditions. An FPDU Aligned receiver may require memory + in the range of ~100s of KB -- which is feasible for an on-chip + memory and enables a "flow-through" design, in which the data flows + through the network interface card (NIC) and is placed directly in + the destination buffer. Assuming most of the connections support + FPDU Alignment, the receiver buffers no longer scale with number of + connections. + + Additional optimizations can be achieved in a balanced I/O sub-system + -- where the system interface of the network controller provides + ample bandwidth as compared with the network bandwidth. For almost + twenty years this has been the case and the trend is expected to + continue. While Ethernet speeds have scaled by 1000 (from 10 + megabit/sec to 10 gigabit/sec), I/O bus bandwidth of volume CPU + architectures has scaled from ~2 MB/sec to ~2 GB/sec (PC-XT bus to + PCI-X DDR). Under these conditions, the FPDU Alignment approach + allows BufferSizeAF to be indifferent to network speed. It is + primarily a function of the local processing time for a given frame. + + + +Culley, et al. Standards Track [Page 55] + +RFC 5044 MPA Framing for TCP October 2007 + + + Thus, when the FPDU Alignment approach is used, receive buffering is + expected to scale gracefully (i.e., less than linear scaling) as + network speed is increased. + +B.2.1. Impact of Lack of FPDU Alignment on the Receiver Computational + Load and Complexity + + The receiver must perform IP and TCP processing, and then perform + FPDU CRC checks, before it can trust the FPDU header placement + information. For simplicity of the description, the assumption is + that an FPDU is carried in no more than 2 TCP segments. In reality, + with no FPDU Alignment, an FPDU can be carried by more than 2 TCP + segments (e.g., if the path MTU was reduced). + + ----++-----------------------------++-----------------------++----- + +---||---------------+ +--------||--------+ +----------||----+ + | TCP Seg X-1 | | TCP Seg X | | TCP Seg X+1 | + +---||---------------+ +--------||--------+ +----------||----+ + ----++-----------------------------++-----------------------++----- + FPDU #N-1 FPDU #N + + Figure 12: Non-Aligned FPDU Freely Placed in TCP Octet Stream + + The receiver algorithm for processing TCP segments (e.g., TCP segment + #X in Figure 12) carrying non-aligned FPDUs (in order or out of + order) includes: + + Data Link Layer processing (whole frame) -- typically including a CRC + calculation. + + 1. Network Layer processing (assuming not an IP fragment, the + whole Data Link Layer frame contains one IP datagram. IP + fragments should be reassembled in a local buffer. This is + not a performance optimization goal.) + + 2. Transport Layer processing -- TCP protocol processing, header + and checksum checks. + + a. Classify incoming TCP segment using the 5 tuple (IP SRC, + IP DST, TCP SRC Port, TCP DST Port, protocol). + + + + + + + + + + + +Culley, et al. Standards Track [Page 56] + +RFC 5044 MPA Framing for TCP October 2007 + + + 3. Find FPDU message boundaries. + + a. Get MPA state information for the connection. + + If the TCP segment is in order, use the receiver-managed + MPA state information to calculate where the previous + FPDU message (#N-1) ends in the current TCP segment X. + (previously, when the MPA receiver processed the first + part of FPDU #N-1, it calculated the number of bytes + remaining to complete FPDU #N-1 by using the MPA Length + field). + + Get the stored partial CRC for FPDU #N-1. + + Complete CRC calculation for FPDU #N-1 data (first + portion of TCP segment #X). + + Check CRC calculation for FPDU #N-1. + + If no FPDU CRC errors, placement is allowed. + + Locate the local buffer for the first portion of + FPDU#N-1, CopyData(local buffer of first portion + of FPDU #N-1, host buffer address, length). + + Compute host buffer address for second portion of + FPDU #N-1. + + CopyData (local buffer of second portion of FPDU #N- + 1, host buffer address for second portion, + length). + + Calculate the octet offset into the TCP segment for + the next FPDU #N. + + Start calculation of CRC for available data for FPDU. + #N + + Store partial CRC results for FPDU #N. + + Store local buffer address of first portion of FPDU + #N. + + No further action is possible on FPDU #N, before it + is completely received. + + + + + + +Culley, et al. Standards Track [Page 57] + +RFC 5044 MPA Framing for TCP October 2007 + + + If the TCP segment is out of order, the receiver must + buffer the data until at least one complete FPDU is + received. Typically, buffering for more than one TCP + segment per connection is required. Use the MPA-based + Markers to calculate where FPDU boundaries are. + + When a complete FPDU is available, a similar + procedure to the in-order algorithm above is used. + There is additional complexity, though, because when + the missing segment arrives, this TCP segment must be + run through the CRC engine after the CRC is + calculated for the missing segment. + + If we assume FPDU Alignment, the following diagram and the algorithm + below apply. Note that when using MPA, the receiver is assumed to + actively detect presence or loss of FPDU Alignment for every TCP + segment received. + + +--------------------------+ +--------------------------+ + +--|--------------------------+ +--|--------------------------+ + | | TCP Seg X | | | TCP Seg X+1 | + +--|--------------------------+ +--|--------------------------+ + +--------------------------+ +--------------------------+ + FPDU #N FPDU #N+1 + + Figure 13: Aligned FPDU Placed Immediately after TCP Header + + + + + + + + + + + + + + + + + + + + + + + + + +Culley, et al. Standards Track [Page 58] + +RFC 5044 MPA Framing for TCP October 2007 + + + The receiver algorithm for FPDU Aligned frames (in order or out of + order) includes: + + 1) Data Link Layer processing (whole frame) -- typically + including a CRC calculation. + + 2) Network Layer processing (assuming not an IP fragment, the + whole Data Link Layer frame contains one IP datagram. IP + fragments should be reassembled in a local buffer. This is + not a performance optimization goal.) + + 3) Transport Layer processing -- TCP protocol processing, header + and checksum checks. + + a. Classify incoming TCP segment using the 5 tuple (IP SRC, + IP DST, TCP SRC Port, TCP DST Port, protocol). + + 4) Check for Header Alignment (described in detail in Section + 6). Assuming Header Alignment for the rest of the algorithm + below. + + a. If the header is not aligned, see the algorithm defined + in the prior section. + + 5) If TCP segment is in order or out of order, the MPA header is + at the beginning of the current TCP payload. Get the FPDU + length from the FPDU header. + + 6) Calculate CRC over FPDU. + + 7) Check CRC calculation for FPDU #N. + + 8) If no FPDU CRC errors, placement is allowed. + + 9) CopyData(TCP segment #X, host buffer address, length). + + 10) Loop to #5 until all the FPDUs in the TCP segment are + consumed in order to handle FPDU packing. + + Implementation note: In both cases, the receiver has to classify the + incoming TCP segment and associate it with one of the flows it + maintains. In the case of no FPDU Alignment, the receiver is forced + to classify incoming traffic before it can calculate the FPDU CRC. + In the case of FPDU Alignment, the operations order is left to the + implementer. + + + + + + +Culley, et al. Standards Track [Page 59] + +RFC 5044 MPA Framing for TCP October 2007 + + + The FPDU Aligned receiver algorithm is significantly simpler. There + is no need to locally buffer portions of FPDUs. Accessing state + information is also substantially simplified -- the normal case does + not require retrieving information to find out where an FPDU starts + and ends or retrieval of a partial CRC before the CRC calculation can + commence. This avoids adding internal latencies, having multiple + data passes through the CRC machine, or scheduling multiple commands + for moving the data to the host buffer. + + The aligned FPDU approach is useful for in-order and out-of-order + reception. The receiver can use the same mechanisms for data storage + in both cases, and only needs to account for when all the TCP + segments have arrived to enable Delivery. The Header Alignment, + along with the high probability that at least one complete FPDU is + found with every TCP segment, allows the receiver to perform data + placement for out-of-order TCP segments with no need for intermediate + buffering. Essentially, the TCP receive buffer has been eliminated + and TCP reassembly is done in place within the ULP Buffer. + + In case FPDU Alignment is not found, the receiver should follow the + algorithm for non-aligned FPDU reception, which may be slower and + less efficient. + +B.2.2. FPDU Alignment Effects on TCP Wire Protocol + + In an optimized MPA/TCP implementation, TCP exposes its EMSS to MPA. + MPA uses the EMSS to calculate its MULPDU, which it then exposes to + DDP, its ULP. DDP uses the MULPDU to segment its payload so that + each FPDU sent by MPA fits completely into one TCP segment. This has + no impact on wire protocol, and exposing this information is already + supported on many TCP implementations, including all modern flavors + of BSD networking, through the TCP_MAXSEG socket option. + + In the common case, the ULP (i.e., DDP over MPA) messages provided to + the TCP layer are segmented to MULPDU size. It is assumed that the + ULP message size is bounded by MULPDU, such that a single ULP message + can be encapsulated in a single TCP segment. Therefore, in the + common case, there is no increase in the number of TCP segments + emitted. For smaller ULP messages, the sender can also apply + packing, i.e., the sender packs as many complete FPDUs as possible + into one TCP segment. The requirement to always have a complete FPDU + may increase the number of TCP segments emitted. Typically, a ULP + message size varies from a few bytes to multiple EMSSs (e.g., 64 + Kbytes). In some cases, the ULP may post more than one message at a + time for transmission, giving the sender an opportunity for packing. + In the case where more than one FPDU is available for transmission + and the FPDUs are encapsulated into a TCP segment and there is no + room in the TCP segment to include the next complete FPDU, another + + + +Culley, et al. Standards Track [Page 60] + +RFC 5044 MPA Framing for TCP October 2007 + + + TCP segment is sent. In this corner case, some of the TCP segments + are not full size. In the worst-case scenario, the ULP may choose an + FPDU size that is EMSS/2 +1 and has multiple messages available for + transmission. For this poor choice of FPDU size, the average TCP + segment size is therefore about 1/2 of the EMSS and the number of TCP + segments emitted is approaching 2x of what is possible without the + requirement to encapsulate an integer number of complete FPDUs in + every TCP segment. This is a dynamic situation that only lasts for + the duration where the sender ULP has multiple non-optimal messages + for transmission and this causes a minor impact on the wire + utilization. + + However, it is not expected that requiring FPDU Alignment will have a + measurable impact on wire behavior of most applications. Throughput + applications with large I/Os are expected to take full advantage of + the EMSS. Another class of applications with many small outstanding + buffers (as compared to EMSS) is expected to use packing when + applicable. Transaction-oriented applications are also optimal. + + TCP retransmission is another area that can affect sender behavior. + TCP supports retransmission of the exact, originally transmitted + segment (see [RFC793], Sections 2.6 and 3.7 (under "Managing the + Window") and [RFC1122], Section 4.2.2.15). In the unlikely event + that part of the original segment has been received and acknowledged + by the Remote Peer (e.g., a re-segmenting middlebox, as documented in + Appendix A.4, Re-Segmenting Middleboxes and Non-Optimized MPA/TCP + Senders), a better available bandwidth utilization may be possible by + retransmitting only the missing octets. If an optimized MPA/TCP + retransmits complete FPDUs, there may be some marginal bandwidth + loss. + + Another area where a change in the TCP segment number may have impact + is that of slow start and congestion avoidance. Slow-start + exponential increase is measured in segments per second, as the + algorithm focuses on the overhead per segment at the source for + congestion that eventually results in dropped segments. Slow-start + exponential bandwidth growth for optimized MPA/TCP is similar to any + TCP implementation. Congestion avoidance allows for a linear growth + in available bandwidth when recovering after a packet drop. Similar + to the analysis for slow start, optimized MPA/TCP doesn't change the + behavior of the algorithm. Therefore, the average size of the + segment versus EMSS is not a major factor in the assessment of the + bandwidth growth for a sender. Both slow start and congestion + avoidance for an optimized MPA/TCP will behave similarly to any TCP + sender and allow an optimized MPA/TCP to enjoy the theoretical + performance limits of the algorithms. + + + + + +Culley, et al. Standards Track [Page 61] + +RFC 5044 MPA Framing for TCP October 2007 + + + In summary, the ULP messages generated at the sender (e.g., the + amount of messages grouped for every transmission request) and + message size distribution has the most significant impact over the + number of TCP segments emitted. The worst-case effect for certain + ULPs (with average message size of EMSS/2+1 to EMSS) is bounded by an + increase of up to 2x in the number of TCP segments and acknowledges. + In reality, the effect is expected to be marginal. + +Appendix C. IETF Implementation Interoperability with RDMA Consortium + Protocols + + This appendix is for information only and is NOT part of the + standard. + + This appendix covers methods of making MPA implementations + interoperate with both IETF and RDMA Consortium versions of the + protocols. + + The RDMA Consortium created early specifications of the MPA/DDP/RDMA + protocols, and some manufacturers created implementations of those + protocols before the IETF versions were finalized. These protocols + are very similar to the IETF versions making it possible for + implementations to be created or modified to support either set of + specifications. + + For those interested, the RDMA Consortium protocol documents (draft- + culley-iwarp-mpa-v1.0.pdf [RDMA-MPA], draft-shah-iwarp-ddp-v1.0.pdf + [RDMA-DDP], and draft-recio-iwarp-rdmac-v1.0.pdf [RDMA-RDMAC]) can be + obtained at http://www.rdmaconsortium.org/home. + + In this section, implementations of MPA/DDP/RDMA that conform to the + RDMAC specifications are called RDMAC RNICs. Implementations of + MPA/DDP/RDMA that conform to the IETF RFCs are called IETF RNICs. + + Without the exchange of MPA Request/Reply Frames, there is no + standard mechanism for enabling RDMAC RNICs to interoperate with IETF + RNICs. Even if a ULP uses a well-known port to start an IETF RNIC + immediately in RDMA mode (i.e., without exchanging the MPA + Request/Reply messages), there is no reason to believe an IETF RNIC + will interoperate with an RDMAC RNIC because of the differences in + the version number in the DDP and RDMAP headers on the wire. + + Therefore, the ULP or other supporting entity at the RDMAC RNIC must + implement MPA Request/Reply Frames on behalf of the RNIC in order to + negotiate the connection parameters. The following section describes + the results following the exchange of the MPA Request/Reply Frames + before the conversion from streaming to RDMA mode. + + + + +Culley, et al. Standards Track [Page 62] + +RFC 5044 MPA Framing for TCP October 2007 + + +C.1. Negotiated Parameters + + Three types of RNICs are considered: + + Upgraded RDMAC RNIC - an RNIC implementing the RDMAC protocols that + has a ULP or other supporting entity that exchanges the MPA + Request/Reply Frames in streaming mode before the conversion to RDMA + mode. + + Non-permissive IETF RNIC - an RNIC implementing the IETF protocols + that is not capable of implementing the RDMAC protocols. Such an + RNIC can only interoperate with other IETF RNICs. + + Permissive IETF RNIC - an RNIC implementing the IETF protocols that + is capable of implementing the RDMAC protocols on a per-connection + basis. + + The Permissive IETF RNIC is recommended for those implementers that + want maximum interoperability with other RNIC implementations. + + The values used by these three RNIC types for the MPA, DDP, and RDMAP + versions as well as MPA Markers and CRC are summarized in Figure 14. + + +----------------++-----------+-----------+-----------+-----------+ + | RNIC TYPE || DDP/RDMAP | MPA | MPA | MPA | + | || Version | Revision | Markers | CRC | + +----------------++-----------+-----------+-----------+-----------+ + +----------------++-----------+-----------+-----------+-----------+ + | RDMAC || 0 | 0 | 1 | 1 | + | || | | | | + +----------------++-----------+-----------+-----------+-----------+ + | IETF || 1 | 1 | 0 or 1 | 0 or 1 | + | Non-permissive || | | | | + +----------------++-----------+-----------+-----------+-----------+ + | IETF || 1 or 0 | 1 or 0 | 0 or 1 | 0 or 1 | + | permissive || | | | | + +----------------++-----------+-----------+-----------+-----------+ + + Figure 14: Connection Parameters for the RNIC Types + for MPA Markers and MPA CRC, enabled=1, disabled=0. + + It is assumed there is no mixing of versions allowed between MPA, + DDP, and RDMAP. The RNIC either generates the RDMAC protocols on the + wire (version is zero) or uses the IETF protocols (version is one). + + + + + + + +Culley, et al. Standards Track [Page 63] + +RFC 5044 MPA Framing for TCP October 2007 + + + During the exchange of the MPA Request/Reply Frames, each peer + provides its MPA Revision, Marker preference (M: 0=disabled, + 1=enabled), and CRC preference. The MPA Revision provided in the MPA + Request Frame and the MPA Reply Frame may differ. + + From the information in the MPA Request/Reply Frames, each side sets + the Version field (V: 0=RDMAC, 1=IETF) of the DDP/RDMAP protocols as + well as the state of the Markers for each half connection. Between + DDP and RDMAP, no mixing of versions is allowed. Moreover, the DDP + and RDMAP version MUST be identical in the two directions. The RNIC + either generates the RDMAC protocols on the wire (version is zero) or + uses the IETF protocols (version is one). + + In the following sections, the figures do not discuss CRC negotiation + because there is no interoperability issue for CRCs. Since the RDMAC + RNIC will always request CRC use, then, according to the IETF MPA + specification, both peers MUST generate and check CRCs. + +C.2. RDMAC RNIC and Non-Permissive IETF RNIC + + Figure 15 shows that a Non-permissive IETF RNIC cannot interoperate + with an RDMAC RNIC, despite the fact that both peers exchange MPA + Request/Reply Frames. For a Non-permissive IETF RNIC, the MPA + negotiation has no effect on the DDP/RDMAP version and it is unable + to interoperate with the RDMAC RNIC. + + The rows in the figure show the state of the Marker field in the MPA + Request Frame sent by the MPA Initiator. The columns show the state + of the Marker field in the MPA Reply Frame sent by the MPA Responder. + Each type of RNIC is shown as an Initiator and a Responder. The + connection results are shown in the lower right corner, at the + intersection of the different RNIC types, where V=0 is the RDMAC + DDP/RDMAP version, V=1 is the IETF DDP/RDMAC version, M=0 means MPA + Markers are disabled, and M=1 means MPA Markers are enabled. The + negotiated Marker state is shown as X/Y, for the receive direction of + the Initiator/Responder. + + + + + + + + + + + + + + + +Culley, et al. Standards Track [Page 64] + +RFC 5044 MPA Framing for TCP October 2007 + + + +---------------------------++-----------------------+ + | MPA || MPA | + | CONNECT || Responder | + | MODE +-----------------++-------+---------------+ + | | RNIC || RDMAC | IETF | + | | TYPE || | Non-permissive| + | | +------++-------+-------+-------+ + | | |MARKER|| M=1 | M=0 | M=1 | + +---------+----------+------++-------+-------+-------+ + +---------+----------+------++-------+-------+-------+ + | | RDMAC | M=1 || V=0 | close | close | + | | | || M=1/1 | | | + | +----------+------++-------+-------+-------+ + | MPA | | M=0 || close | V=1 | V=1 | + |Initiator| IETF | || | M=0/0 | M=0/1 | + | |Non-perms.+------++-------+-------+-------+ + | | | M=1 || close | V=1 | V=1 | + | | | || | M=1/0 | M=1/1 | + +---------+----------+------++-------+-------+-------+ + + Figure 15: MPA Negotiation between an RDMAC RNIC and + a Non-Permissive IETF RNIC + +C.2.1. RDMAC RNIC Initiator + + If the RDMAC RNIC is the MPA Initiator, its ULP sends an MPA Request + Frame with Rev field set to zero and the M and C bits set to one. + Because the Non-permissive IETF RNIC cannot dynamically downgrade the + version number it uses for DDP and RDMAP, it would send an MPA Reply + Frame with the Rev field equal to one and then gracefully close the + connection. + +C.2.2. Non-Permissive IETF RNIC Initiator + + If the Non-permissive IETF RNIC is the MPA Initiator, it sends an MPA + Request Frame with Rev field equal to one. The ULP or supporting + entity for the RDMAC RNIC responds with an MPA Reply Frame that has + the Rev field equal to zero and the M bit set to one. The Non- + permissive IETF RNIC will gracefully close the connection after it + reads the incompatible Rev field in the MPA Reply Frame. + +C.2.3. RDMAC RNIC and Permissive IETF RNIC + + Figure 16 shows that a Permissive IETF RNIC can interoperate with an + RDMAC RNIC regardless of its Marker preference. The figure uses the + same format as shown with the Non-permissive IETF RNIC. + + + + + +Culley, et al. Standards Track [Page 65] + +RFC 5044 MPA Framing for TCP October 2007 + + + +---------------------------++-----------------------+ + | MPA || MPA | + | CONNECT || Responder | + | MODE +-----------------++-------+---------------+ + | | RNIC || RDMAC | IETF | + | | TYPE || | Permissive | + | | +------++-------+-------+-------+ + | | |MARKER|| M=1 | M=0 | M=1 | + +---------+----------+------++-------+-------+-------+ + +---------+----------+------++-------+-------+-------+ + | | RDMAC | M=1 || V=0 | N/A | V=0 | + | | | || M=1/1 | | M=1/1 | + | +----------+------++-------+-------+-------+ + | MPA | | M=0 || V=0 | V=1 | V=1 | + |Initiator| IETF | || M=1/1 | M=0/0 | M=0/1 | + | |Permissive+------++-------+-------+-------+ + | | | M=1 || V=0 | V=1 | V=1 | + | | | || M=1/1 | M=1/0 | M=1/1 | + +---------+----------+------++-------+-------+-------+ + + Figure 16: MPA Negotiation between an RDMAC RNIC and + a Permissive IETF RNIC + + A truly Permissive IETF RNIC will recognize an RDMAC RNIC from the + Rev field of the MPA Req/Rep Frames and then adjust its receive + Marker state and DDP/RDMAP version to accommodate the RDMAC RNIC. As + a result, as an MPA Responder, the Permissive IETF RNIC will never + return an MPA Reply Frame with the M bit set to zero. This case is + shown as a not applicable (N/A) in Figure 16. + +C.2.4. RDMAC RNIC Initiator + + When the RDMAC RNIC is the MPA Initiator, its ULP or other supporting + entity prepares an MPA Request message and sets the revision to zero + and the M bit and C bit to one. + + The Permissive IETF Responder receives the MPA Request message and + checks the revision field. Since it is capable of generating RDMAC + DDP/RDMAP headers, it sends an MPA Reply message with revision set to + zero and the M and C bits set to one. The Responder must inform its + ULP that it is generating version zero DDP/RDMAP messages. + + + + + + + + + + +Culley, et al. Standards Track [Page 66] + +RFC 5044 MPA Framing for TCP October 2007 + + +C.2.5 Permissive IETF RNIC Initiator + + If the Permissive IETF RNIC is the MPA Initiator, it prepares the MPA + Request Frame setting the Rev field to one. Regardless of the value + of the M bit in the MPA Request Frame, the ULP or other supporting + entity for the RDMAC RNIC will create an MPA Reply Frame with Rev + equal to zero and the M bit set to one. + + When the Initiator reads the Rev field of the MPA Reply Frame and + finds that its peer is an RDMAC RNIC, it must inform its ULP that it + should generate version zero DDP/RDMAP messages and enable MPA + Markers and CRC. + +C.3. Non-Permissive IETF RNIC and Permissive IETF RNIC + + For completeness, Figure 17 below shows the results of MPA + negotiation between a Non-permissive IETF RNIC and a Permissive IETF + RNIC. The important point from this figure is that an IETF RNIC + cannot detect whether its peer is a Permissive or Non-permissive + RNIC. + + +---------------------------++-------------------------------+ + | MPA || MPA | + | CONNECT || Responder | + | MODE +-----------------++---------------+---------------+ + | | RNIC || IETF | IETF | + | | TYPE || Non-permissive| Permissive | + | | +------++-------+-------+-------+-------+ + | | |MARKER|| M=0 | M=1 | M=0 | M=1 | + +---------+----------+------++-------+-------+-------+-------+ + +---------+----------+------++-------+-------+-------+-------+ + | | | M=0 || V=1 | V=1 | V=1 | V=1 | + | | IETF | || M=0/0 | M=0/1 | M=0/0 | M=0/1 | + | |Non-perms.+------++-------+-------+-------+-------+ + | | | M=1 || V=1 | V=1 | V=1 | V=1 | + | | | || M=1/0 | M=1/1 | M=1/0 | M=1/1 | + | MPA +----------+------++-------+-------+-------+-------+ + |Initiator| | M=0 || V=1 | V=1 | V=1 | V=1 | + | | IETF | || M=0/0 | M=0/1 | M=0/0 | M=0/1 | + | |Permissive+------++-------+-------+-------+-------+ + | | | M=1 || V=1 | V=1 | V=1 | V=1 | + | | | || M=1/0 | M=1/1 | M=1/0 | M=1/1 | + +---------+----------+------++-------+-------+-------+-------+ + + Figure 17: MPA negotiation between a Non-permissive IETF RNIC and a + Permissive IETF RNIC. + + + + + +Culley, et al. Standards Track [Page 67] + +RFC 5044 MPA Framing for TCP October 2007 + + +Normative References + + [iSCSI] Satran, J., Meth, K., Sapuntzakis, C., Chadalapaka, M., + and E. Zeidner, "Internet Small Computer Systems + Interface (iSCSI)", RFC 3720, April 2004. + + [RFC1191] Mogul, J. and S. Deering, "Path MTU discovery", RFC + 1191, November 1990. + + [RFC2018] Mathis, M., Mahdavi, J., Floyd, S., and A. Romanow, "TCP + Selective Acknowledgment Options", RFC 2018, October + 1996. + + [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate + Requirement Levels", BCP 14, RFC 2119, March 1997. + + [RFC2401] Kent, S. and R. Atkinson, "Security Architecture for the + Internet Protocol", RFC 2401, November 1998. + + [RFC3723] Aboba, B., Tseng, J., Walker, J., Rangan, V., and F. + Travostino, "Securing Block Storage Protocols over IP", + RFC 3723, April 2004. + + [RFC793] Postel, J., "Transmission Control Protocol", STD 7, RFC + 793, September 1981. + + [RDMASEC] Pinkerton, J. and E. Deleganes, "Direct Data Placement + Protocol (DDP) / Remote Direct Memory Access Protocol + (RDMAP) Security", RFC 5042, October 2007. + +Informative References + + [APPL] Bestler, C. and L. Coene, "Applicability of Remote + Direct Memory Access Protocol (RDMA) and Direct Data + Placement (DDP)", RFC 5045, October 2007. + + [CRCTCP] Stone J., Partridge, C., "When the CRC and TCP checksum + disagree", ACM Sigcomm, Sept. 2000. + + [DAT-API] DAT Collaborative, "kDAPL (Kernel Direct Access + Programming Library) and uDAPL (User Direct Access + Programming Library)", Http://www.datcollaborative.org. + + [DDP] Shah, H., Pinkerton, J., Recio, R., and P. Culley, + "Direct Data Placement over Reliable Transports", RFC + 5041, October 2007. + + + + + +Culley, et al. Standards Track [Page 68] + +RFC 5044 MPA Framing for TCP October 2007 + + + [iSER] Ko, M., Chadalapaka, M., Hufferd, J., Elzur, U., Shah, + H., and P. Thaler, "Internet Small Computer System + Interface (iSCSI) Extensions for Remote Direct Memory + Access (RDMA)" RFC 5046, October 2007. + + [IT-API] The Open Group, "Interconnect Transport API (IT-API)" + Version 2.1, http://www.opengroup.org. + + [NFSv4CHAN] Williams, N., "On the Use of Channel Bindings to Secure + Channels", Work in Progress, June 2006. + + [RDMA-DDP] "Direct Data Placement over Reliable Transports (Version + 1.0)", RDMA Consortium, October 2002, + <http://www.rdmaconsortium.org/home/draft-shah-iwarp- + ddp-v1.0.pdf>. + + [RDMA-MPA] "Marker PDU Aligned Framing for TCP Specification + (Version 1.0)", RDMA Consortium, October 2002, + <http://www.rdmaconsortium.org/home/draft-culley-iwarp- + mpa-v1.0.pdf>. + + [RDMA-RDMAC] "An RDMA Protocol Specification (Version 1.0)", RDMA + Consortium, October 2002, + <http://www.rdmaconsortium.org/home/draft-recio-iwarp- + rdmac-v1.0.pdf>. + + [RDMAP] Recio, R., Culley, P., Garcia, D., Hilland, J., and B. + Metzler, "A Remote Direct Memory Access Protocol + Specification", RFC 5040, October 2007. + + [RFC792] Postel, J., "Internet Control Message Protocol", STD 5, + RFC 792, September 1981. + + [RFC896] Nagle, J., "Congestion control in IP/TCP internetworks", + RFC 896, January 1984. + + [RFC1122] Braden, R., "Requirements for Internet Hosts - + Communication Layers", STD 3, RFC 1122, October 1989. + + [RFC4960] Stewart, R., Ed., "Stream Control Transmission + Protocol", RFC 4960, September 2007. + + [RFC4296] Bailey, S. and T. Talpey, "The Architecture of Direct + Data Placement (DDP) and Remote Direct Memory Access + (RDMA) on Internet Protocols", RFC 4296, December 2005. + + + + + + +Culley, et al. Standards Track [Page 69] + +RFC 5044 MPA Framing for TCP October 2007 + + + [RFC4297] Romanow, A., Mogul, J., Talpey, T., and S. Bailey, + "Remote Direct Memory Access (RDMA) over IP Problem + Statement", RFC 4297, December 2005. + + [RFC4301] Kent, S. and K. Seo, "Security Architecture for the + Internet Protocol", RFC 4301, December 2005. + + [VERBS-RMDA] "RDMA Protocol Verbs Specification", RDMA Consortium + standard, April 2003, <http://www.rdmaconsortium.org/ + home/draft-hilland-iwarp-verbs-v1.0-RDMAC.pdf>. + +Contributors + + Dwight Barron + Hewlett-Packard Company + 20555 SH 249 + Houston, TX 77070-2698 USA + Phone: 281-514-2769 + EMail: dwight.barron@hp.com + + Jeff Chase + Department of Computer Science + Duke University + Durham, NC 27708-0129 USA + Phone: +1 919 660 6559 + EMail: chase@cs.duke.edu + + Ted Compton + EMC Corporation + Research Triangle Park, NC 27709 USA + Phone: 919-248-6075 + EMail: compton_ted@emc.com + + Dave Garcia + 24100 Hutchinson Rd. + Los Gatos, CA 95033 + Phone: 831 247 4464 + EMail: Dave.Garcia@StanfordAlumni.org + + Hari Ghadia + Gen10 Technology, Inc. + 1501 W Shady Grove Road + Grand Prairie, TX 75050 + Phone: (972) 301 3630 + EMail: hghadia@gen10technology.com + + + + + + +Culley, et al. Standards Track [Page 70] + +RFC 5044 MPA Framing for TCP October 2007 + + + Howard C. Herbert + Intel Corporation + MS CH7-404 + 5000 West Chandler Blvd. + Chandler, AZ 85226 + Phone: 480-554-3116 + EMail: howard.c.herbert@intel.com + + Jeff Hilland + Hewlett-Packard Company + 20555 SH 249 + Houston, TX 77070-2698 USA + Phone: 281-514-9489 + EMail: jeff.hilland@hp.com + + Mike Ko + IBM + 650 Harry Rd. + San Jose, CA 95120 + Phone: (408) 927-2085 + EMail: mako@us.ibm.com + + Mike Krause + Hewlett-Packard Corporation, 43LN + 19410 Homestead Road + Cupertino, CA 95014 USA + Phone: +1 (408) 447-3191 + EMail: krause@cup.hp.com + + Dave Minturn + Intel Corporation + MS JF1-210 + 5200 North East Elam Young Parkway + Hillsboro, Oregon 97124 + Phone: 503-712-4106 + EMail: dave.b.minturn@intel.com + + Jim Pinkerton + Microsoft, Inc. + One Microsoft Way + Redmond, WA 98052 USA + EMail: jpink@microsoft.com + + + + + + + + + +Culley, et al. Standards Track [Page 71] + +RFC 5044 MPA Framing for TCP October 2007 + + + Hemal Shah + Broadcom Corporation + 5300 California Avenue + Irvine, CA 92617 USA + Phone: +1 (949) 926-6941 + EMail: hemal@broadcom.com + + Allyn Romanow + Cisco Systems + 170 W Tasman Drive + San Jose, CA 95134 USA + Phone: +1 408 525 8836 + EMail: allyn@cisco.com + + Tom Talpey + Network Appliance + 1601 Trapelo Road #16 + Waltham, MA 02451 USA + Phone: +1 (781) 768-5329 + EMail: thomas.talpey@netapp.com + + Patricia Thaler + Broadcom + 16215 Alton Parkway + Irvine, CA 92618 + Phone: 916 570 2707 + EMail: pthaler@broadcom.com + + Jim Wendt + Hewlett Packard Corporation + 8000 Foothills Boulevard MS 5668 + Roseville, CA 95747-5668 USA + Phone: +1 916 785 5198 + EMail: jim_wendt@hp.com + + Jim Williams + Emulex Corporation + 580 Main Street + Bolton, MA 01740 USA + Phone: +1 978 779 7224 + EMail: jim.williams@emulex.com + + + + + + + + + + +Culley, et al. Standards Track [Page 72] + +RFC 5044 MPA Framing for TCP October 2007 + + +Authors' Addresses + + Paul R. Culley + Hewlett-Packard Company + 20555 SH 249 + Houston, TX 77070-2698 USA + Phone: 281-514-5543 + EMail: paul.culley@hp.com + + Uri Elzur + 5300 California Avenue + Irvine, CA 92617, USA + Phone: 949.926.6432 + EMail: uri@broadcom.com + + Renato J Recio + IBM + Internal Zip 9043 + 11400 Burnett Road + Austin, Texas 78759 + Phone: 512-838-3685 + EMail: recio@us.ibm.com + + Stephen Bailey + Sandburst Corporation + 600 Federal Street + Andover, MA 01810 USA + Phone: +1 978 689 1614 + EMail: steph@sandburst.com + + John Carrier + Cray Inc. + 411 First Avenue S, Suite 600 + Seattle, WA 98104-2860 + Phone: 206-701-2090 + EMail: carrier@cray.com + + + + + + + + + + + + + + + +Culley, et al. Standards Track [Page 73] + +RFC 5044 MPA Framing for TCP October 2007 + + +Full Copyright Statement + + Copyright (C) The IETF Trust (2007). + + This document is subject to the rights, licenses and restrictions + contained in BCP 78, and except as set forth therein, the authors + retain all their rights. + + This document and the information contained herein are provided on an + "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS + OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND + THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS + OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF + THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED + WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. + +Intellectual Property + + The IETF takes no position regarding the validity or scope of any + Intellectual Property Rights or other rights that might be claimed to + pertain to the implementation or use of the technology described in + this document or the extent to which any license under such rights + might or might not be available; nor does it represent that it has + made any independent effort to identify any such rights. Information + on the procedures with respect to rights in RFC documents can be + found in BCP 78 and BCP 79. + + Copies of IPR disclosures made to the IETF Secretariat and any + assurances of licenses to be made available, or the result of an + attempt made to obtain a general license or permission for the use of + such proprietary rights by implementers or users of this + specification can be obtained from the IETF on-line IPR repository at + http://www.ietf.org/ipr. + + The IETF invites any interested party to bring to its attention any + copyrights, patents or patent applications, or other proprietary + rights that may cover technology that may be required to implement + this standard. Please address the information to the IETF at + ietf-ipr@ietf.org. + + + + + + + + + + + + +Culley, et al. Standards Track [Page 74] + |