summaryrefslogtreecommitdiff
path: root/doc/rfc/rfc7609.txt
diff options
context:
space:
mode:
Diffstat (limited to 'doc/rfc/rfc7609.txt')
-rw-r--r--doc/rfc/rfc7609.txt8011
1 files changed, 8011 insertions, 0 deletions
diff --git a/doc/rfc/rfc7609.txt b/doc/rfc/rfc7609.txt
new file mode 100644
index 0000000..4abff4e
--- /dev/null
+++ b/doc/rfc/rfc7609.txt
@@ -0,0 +1,8011 @@
+
+
+
+
+
+
+Independent Submission M. Fox
+Request for Comments: 7609 C. Kassimis
+Category: Informational J. Stevens
+ISSN: 2070-1721 IBM
+ August 2015
+
+
+ IBM's Shared Memory Communications over RDMA (SMC-R) Protocol
+
+Abstract
+
+ This document describes IBM's Shared Memory Communications over RDMA
+ (SMC-R) protocol. This protocol provides Remote Direct Memory Access
+ (RDMA) communications to TCP endpoints in a manner that is
+ transparent to socket applications. It further provides for dynamic
+ discovery of partner RDMA capabilities and dynamic setup of RDMA
+ connections, as well as transparent high availability and load
+ balancing when redundant RDMA network paths are available. It
+ maintains many of the traditional TCP/IP qualities of service such as
+ filtering that enterprise users demand, as well as TCP socket
+ semantics such as urgent data.
+
+Status of This Memo
+
+ This document is not an Internet Standards Track specification; it is
+ published for informational purposes.
+
+ This is a contribution to the RFC Series, independently of any other
+ RFC stream. The RFC Editor has chosen to publish this document at
+ its discretion and makes no statement about its value for
+ implementation or deployment. Documents approved for publication by
+ the RFC Editor are not a candidate for any level of Internet
+ Standard; see Section 2 of RFC 5741.
+
+ Information about the current status of this document, any errata,
+ and how to provide feedback on it may be obtained at
+ http://www.rfc-editor.org/info/rfc7609.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Fox, et al. Informational [Page 1]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+Copyright Notice
+
+ Copyright (c) 2015 IETF Trust and the persons identified as the
+ document authors. All rights reserved.
+
+ This document is subject to BCP 78 and the IETF Trust's Legal
+ Provisions Relating to IETF Documents
+ (http://trustee.ietf.org/license-info) in effect on the date of
+ publication of this document. Please review these documents
+ carefully, as they describe your rights and restrictions with respect
+ to this document.
+
+Table of Contents
+
+ 1. Introduction ....................................................5
+ 1.1. Protocol Overview ..........................................6
+ 1.1.1. Hardware Requirements ...............................8
+ 1.2. Definition of Common Terms .................................8
+ 1.3. Conventions Used in This Document .........................11
+ 2. Link Architecture ..............................................11
+ 2.1. Remote Memory Buffers (RMBs) ..............................12
+ 2.2. SMC-R Link Groups .........................................18
+ 2.2.1. Link Group Types ...................................18
+ 2.2.2. Maximum Number of Links in Link Group ..............21
+ 2.2.3. Forming and Managing Link Groups ...................23
+ 2.2.4. SMC-R Link Identifiers .............................24
+ 2.3. SMC-R Resilience and Load Balancing .......................24
+ 3. SMC-R Rendezvous Architecture ..................................26
+ 3.1. TCP Options ...............................................26
+ 3.2. Connection Layer Control (CLC) Messages ...................27
+ 3.3. LLC Messages ..............................................27
+ 3.4. CDC Messages ..............................................29
+ 3.5. Rendezvous Flows ..........................................29
+ 3.5.1. First Contact ......................................29
+ 3.5.1.1. Pre-negotiation of TCP Options ............29
+ 3.5.1.2. Client Proposal ...........................30
+ 3.5.1.3. Server Acceptance .........................32
+ 3.5.1.4. Client Confirmation .......................32
+ 3.5.1.5. Link (QP) Confirmation ....................32
+ 3.5.1.6. Second SMC-R Link Setup ...................35
+ 3.5.1.6.1. Client Processing of ADD LINK
+ LLC Message from Server ........35
+ 3.5.1.6.2. Server Processing of ADD LINK
+ Reply LLC Message from Client ..36
+ 3.5.1.6.3. Exchange of RKeys on
+ Second SMC-R Link ..............38
+ 3.5.1.6.4. Aborting SMC-R and
+ Falling Back to IP .............38
+
+
+
+Fox, et al. Informational [Page 2]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+ 3.5.2. Subsequent Contact .................................38
+ 3.5.2.1. SMC-R Proposal ............................39
+ 3.5.2.2. SMC-R Acceptance ..........................40
+ 3.5.2.3. SMC-R Confirmation ........................41
+ 3.5.2.4. TCP Data Flow Race with SMC
+ Confirm CLC Message .......................41
+ 3.5.3. First Contact Variation: Creating a
+ Parallel Link Group ................................42
+ 3.5.4. Normal SMC-R Link Termination ......................43
+ 3.5.5. Link Group Management Flows ........................44
+ 3.5.5.1. Adding and Deleting Links in an
+ SMC-R Link Group ..........................44
+ 3.5.5.1.1. Server-Initiated ADD
+ LINK Processing ................45
+ 3.5.5.1.2. Client-Initiated ADD
+ LINK Processing ................45
+ 3.5.5.1.3. Server-Initiated DELETE
+ LINK Processing ................46
+ 3.5.5.1.4. Client-Initiated DELETE
+ LINK Request ...................48
+ 3.5.5.2. Managing Multiple RKeys over
+ Multiple SMC-R Links in a Link Group ......49
+ 3.5.5.2.1. Adding a New RMB to an
+ SMC-R Link Group ...............50
+ 3.5.5.2.2. Deleting an RMB from an
+ SMC-R Link Group ...............53
+ 3.5.5.2.3. Adding a New SMC-R Link to a
+ Link Group with Multiple RMBs ..54
+ 3.5.5.3. Serialization of LLC Exchanges,
+ and Collisions ............................56
+ 3.5.5.3.1. Collisions with ADD
+ LINK / CONFIRM LINK Exchange ...57
+ 3.5.5.3.2. Collisions during
+ DELETE LINK Exchange ...........58
+ 3.5.5.3.3. Collisions during
+ CONFIRM RKEY Exchange ..........59
+ 4. SMC-R Memory-Sharing Architecture ..............................60
+ 4.1. RMB Element Allocation Considerations .....................60
+ 4.2. RMB and RMBE Format .......................................60
+ 4.3. RMBE Control Information ..................................60
+ 4.4. Use of RMBEs ..............................................61
+ 4.4.1. Initializing and Accessing RMBEs ...................61
+ 4.4.2. RMB Element Reuse and Conflict Resolution ..........62
+ 4.5. SMC-R Protocol Considerations .............................63
+ 4.5.1. SMC-R Protocol Optimized Window Size Updates .......63
+ 4.5.2. Small Data Sends ...................................64
+ 4.5.3. TCP Keepalive Processing ...........................65
+
+
+
+
+Fox, et al. Informational [Page 3]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+ 4.6. TCP Connection Failover between SMC-R Links ...............67
+ 4.6.1. Validating Data Integrity ..........................67
+ 4.6.2. Resuming the TCP Connection on a New SMC-R Link ....68
+ 4.7. RMB Data Flows ............................................69
+ 4.7.1. Scenario 1: Send Flow, Window Size Unconstrained ...69
+ 4.7.2. Scenario 2: Send/Receive Flow, Window Size
+ Unconstrained ......................................71
+ 4.7.3. Scenario 3: Send Flow, Window Size Constrained .....72
+ 4.7.4. Scenario 4: Large Send, Flow Control, Full
+ Window Size Writes .................................74
+ 4.7.5. Scenario 5: Send Flow, Urgent Data, Window
+ Size Unconstrained .................................77
+ 4.7.6. Scenario 6: Send Flow, Urgent Data, Window
+ Size Closed ........................................79
+ 4.8. Connection Termination ....................................81
+ 4.8.1. Normal SMC-R Connection Termination Flows ..........81
+ 4.8.2. Abnormal SMC-R Connection Termination Flows ........86
+ 4.8.3. Other SMC-R Connection Termination Conditions ......88
+ 5. Security Considerations ........................................89
+ 5.1. VLAN Considerations .......................................89
+ 5.2. Firewall Considerations ...................................89
+ 5.3. Host-Based IP Filters .....................................89
+ 5.4. Intrusion Detection Services ..............................90
+ 5.5. IP Security (IPsec) .......................................90
+ 5.6. TLS/SSL ...................................................90
+ 6. IANA Considerations ............................................90
+ 7. Normative References ...........................................91
+ Appendix A. Formats ...............................................92
+ A.1. TCP Option .................................................92
+ A.2. CLC Messages ...............................................92
+ A.2.1. Peer ID Format ......................................93
+ A.2.2. SMC Proposal CLC Message Format .....................94
+ A.2.3. SMC Accept CLC Message Format .......................98
+ A.2.4. SMC Confirm CLC Message Format .....................102
+ A.2.5. SMC Decline CLC Message Format .....................105
+ A.3. LLC Messages ..............................................106
+ A.3.1. CONFIRM LINK LLC Message Format ....................107
+ A.3.2. ADD LINK LLC Message Format ........................109
+ A.3.3. ADD LINK CONTINUATION LLC Message Format ...........112
+ A.3.4. DELETE LINK LLC Message Format .....................115
+ A.3.5. CONFIRM RKEY LLC Message Format ....................117
+ A.3.6. CONFIRM RKEY CONTINUATION LLC Message Format .......120
+ A.3.7. DELETE RKEY LLC Message Format .....................122
+ A.3.8. TEST LINK LLC Message Format .......................124
+ A.4. Connection Data Control (CDC) Message Format ..............125
+
+
+
+
+
+
+Fox, et al. Informational [Page 4]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+ Appendix B. Socket API Considerations ............................129
+ B.1. setsockopt() / getsockopt() Considerations ................130
+ Appendix C. Rendezvous Error Scenarios ...........................131
+ C.1. SMC Decline during CLC Negotiation ........................131
+ C.2. SMC Decline during LLC Negotiation ........................131
+ C.3. The SMC Decline Window ....................................133
+ C.4. Out-of-Sync Conditions during SMC-R Negotiation ...........133
+ C.5. Timeouts during CLC Negotiation ...........................134
+ C.6. Protocol Errors during CLC Negotiation ....................134
+ C.7. Timeouts during LLC Negotiation ...........................135
+ C.7.1. Recovery Actions for LLC Timeouts and Failures .....136
+ C.8. Failure to Add Second SMC-R Link to a Link Group ..........142
+ Authors' Addresses ...............................................143
+
+1. Introduction
+
+ This document specifies IBM's Shared Memory Communications over RDMA
+ (SMC-R) protocol. SMC-R is a protocol for Remote Direct Memory
+ Access (RDMA) communication between TCP socket endpoints. SMC-R runs
+ over networks that support RDMA over Converged Ethernet (RoCE). It
+ is designed to permit existing TCP applications to benefit from RDMA
+ without requiring modifications to the applications or predefinition
+ of RDMA partners.
+
+ SMC-R provides dynamic discovery of the RDMA capabilities of TCP
+ peers and automatic setup of RDMA connections that those peers can
+ use. SMC-R also provides transparent high availability and
+ load-balancing capabilities that are demanded by enterprise
+ installations but are missing from current RDMA protocols. If
+ redundant RoCE-capable hardware such as RDMA-capable Network
+ Interface Cards (RNICs) and RoCE-capable switches is present, SMC-R
+ can load-balance over that redundant hardware and can also
+ non-disruptively move TCP traffic from failed paths to surviving
+ paths, all seamlessly to the application and the sockets layer.
+ Because SMC-R preserves socket semantics and the TCP three-way
+ handshake, many TCP qualities of service such as filtering, load
+ balancing, and Secure Socket Layer (SSL) encryption are preserved, as
+ are TCP features such as urgent data.
+
+ Because of the dynamic discovery and setup of SMC-R connectivity
+ between peers, no RDMA connection manager (RDMA-CM) is required.
+ This also means that support for Unreliable Datagram (UD) Queue Pairs
+ (QPs) is also not required.
+
+
+
+
+
+
+
+
+Fox, et al. Informational [Page 5]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+ It is recommended that the SMC-R services be implemented in kernel
+ space, which enables optimizations such as resource-sharing between
+ connections across multiple processes and also permits applications
+ using SMC-R to spawn multiple processes (e.g., fork) without losing
+ SMC-R functionality. A user-space implementation is compatible with
+ this architecture, but it may not support spawned processes (e.g.,
+ fork), which limits sharing and resource optimization to TCP
+ connections that originate from the same process. This might be an
+ appropriate design choice if the use case is a system that hosts a
+ large single process application that creates many TCP connections to
+ a peer host, or in implementations where a kernel-space
+ implementation is not possible or introduces excessive overhead for
+ "kernel space to user space" context switches.
+
+1.1. Protocol Overview
+
+ SMC-R defines the concept of the SMC-R link, which is a logical
+ point-to-point link using reliably connected queue pairs between
+ TCP/IP stack peers over a RoCE fabric. An SMC-R link is bound to a
+ specific hardware path, meaning a specific RNIC on each peer. SMC-R
+ links are created and maintained by an SMC-R layer, which may reside
+ in kernel space or user space, depending upon operating system and
+ implementation requirements. The SMC-R layer resides below the
+ sockets layer and directs data traffic for TCP connections between
+ connected peers over the RoCE fabric using RDMA rather than over a
+ TCP connection. The TCP/IP stack, with its requirements for
+ fragmentation, packetization, etc., is bypassed, and the application
+ data is moved between peers using RDMA.
+
+ Multiple SMC-R links between the same two TCP/IP stack peers are also
+ supported. A set of SMC-R links called a link group can be logically
+ bonded together to provide redundant connectivity. If there is
+ redundant hardware -- for example, two RNICs on each peer -- separate
+ SMC-R links are created between the peers to exploit that redundant
+ hardware. The link group architecture with redundant links provides
+ load balancing and increased bandwidth, as well as seamless failover.
+
+ Each SMC-R link group is associated with an area of memory called
+ Remote Memory Buffers (RMBs), which are areas of memory that are
+ available for SMC-R peers to write into using RDMA writes. Multiple
+ TCP connections between peers may be multiplexed over a single SMC-R
+ link, in which case the SMC-R layer manages the partitioning of the
+ RMBs between the TCP connections. This multiplexing reduces the RDMA
+ resources, such as QPs and RMBs, that are required to support
+ multiple connections between peers, and it also reduces the
+ processing and delays related to setting up QPs, pinning memory, and
+ other RDMA setup tasks when new TCP connections are created. In a
+ kernel-space SMC-R implementation in which the RMBs reside in kernel
+
+
+
+Fox, et al. Informational [Page 6]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+ storage, this sharing and optimization works across multiple
+ processes executing on the same host. In a user-space SMC-R
+ implementation in which the RMBs reside in user space, this sharing
+ and optimization is limited to multiple TCP connections created by a
+ single process, as separate RMBs and QPs will be required for each
+ process.
+
+ SMC-R also introduces a rendezvous protocol that is used to
+ dynamically discover the RDMA capabilities of TCP connection partners
+ and exchange credentials necessary to exploit that capability if
+ present. TCP connections are set up using the normal TCP three-way
+ handshake [RFC793], with the addition of a new TCP option that
+ indicates SMC-R capability. If both partners indicate SMC-R
+ capability, then at the completion of the three-way TCP handshake the
+ SMC-R layers in each peer take control of the TCP connection and use
+ it to exchange additional Connection Layer Control (CLC) messages to
+ negotiate SMC-R credentials such as QP information; addressability
+ over the RoCE fabric; RMB buffer sizes; and keys and addresses for
+ accessing RMBs over RDMA. If at any time during this negotiation a
+ failure or decline occurs, the TCP connection falls back to using the
+ IP fabric.
+
+ If the SMC-R negotiation succeeds and either a new SMC-R link is set
+ up or an existing SMC-R link is chosen for the TCP connection, then
+ the SMC-R layers open the sockets to the applications and the
+ applications use the sockets as normal. The SMC-R layer intercepts
+ the socket reads and writes and moves the TCP connection data over
+ the SMC-R link, "out of band" to the TCP connection, which remains
+ open and idle over the IP fabric, except for termination flows and
+ possible keepalive flows. Regular TCP sequence numbering methods are
+ used for the TCP flows that do occur; data flowing over RDMA does not
+ use or affect TCP sequence numbers.
+
+ This architecture does not support fallback of active SMC-R
+ connections to IP. Once connection data has completed the switch to
+ RDMA, a TCP connection cannot be switched back to IP and will reset
+ if RDMA becomes unusable.
+
+ The SMC-R protocol defines the format of the RMBs that are used to
+ receive TCP connection data written over RDMA, as well as the
+ semantics for managing and writing to these buffers using Connection
+ Data Control (CDC) messages.
+
+
+
+
+
+
+
+
+
+Fox, et al. Informational [Page 7]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+ Finally, SMC-R defines Link Layer Control (LLC) messages that are
+ exchanged over the RoCE fabric between peer SMC-R layers to manage
+ the SMC-R links and link groups. These include messages to test and
+ confirm connectivity over an SMC-R link, add and delete SMC-R links
+ to or from the link group, and exchange RMB addressability
+ information.
+
+1.1.1. Hardware Requirements
+
+ SMC-R does not require full Converged Enhanced Ethernet switch
+ functionality. SMC-R functions over standard Ethernet fabrics,
+ provided that endpoint RNICs are provided and IEEE 802.3x Global
+ Pause Frame is supported and enabled in the switch fabric.
+
+ While SMC-R as specified in this document is designed to operate over
+ RoCE fabrics, adjustments to the rendezvous methods could enable it
+ to run over other RDMA fabrics, such as InfiniBand [RoCE] and iWARP.
+
+1.2. Definition of Common Terms
+
+ This section provides definitions of terms that have a specific
+ meaning to the SMC-R protocol and are used throughout this document.
+
+ SMC-R Link
+
+ An SMC-R link is a logical point-to-point connection over the RoCE
+ fabric via specific physical adapters (Media Access Control /
+ Global Identifier (MAC/GID)). The link is formed during the
+ "first contact" sequence of the TCP/IP three-way handshake
+ sequence that occurs over the IP fabric. During this handshake,
+ an RDMA reliably connected queue pair (RC-QP) connection is formed
+ between the two peer SMC hosts and is defined as the SMC-R link.
+ The SMC-R link can then support multiple TCP connections between
+ the two peers. An SMC-R link is associated with a single LAN (or
+ VLAN) segment and is not routable.
+
+ SMC-R Link Group
+
+ An SMC-R link group is a group of SMC-R links between the same two
+ SMC-R peers, typically with each link over unique RoCE adapters.
+ Each link in the link group has equal characteristics, such as the
+ same VLAN ID (if VLANs are in use), access to the same RMB(s), and
+ access to the same TCP server/client.
+
+
+
+
+
+
+
+
+Fox, et al. Informational [Page 8]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+ SMC-R Peer
+
+ The SMC-R peer is the peer software stack within the peer
+ operating system with respect to the Shared Memory Communications
+ (messaging) protocol.
+
+ SMC-R Rendezvous
+
+ SMC-R Rendezvous is the SMC-R peer discovery and handshake
+ sequence that occurs transparently over the IP (Ethernet) fabric
+ during and immediately after the TCP connection three-way
+ handshake by exchanging the SMC-R capabilities and credentials
+ using experimental TCP option and CLC messages.
+
+ RoCE SendMsg
+
+ RoCE SendMsg is a send operation posted to a reliably connected
+ queue pair with inline data, for the purpose of transferring
+ control information between peers.
+
+ TCP Client
+
+ The TCP client is the TCP socket-based peer that initiates a TCP
+ connection.
+
+ TCP Server
+
+ The TCP server is the TCP socket-based peer that accepts a TCP
+ connection.
+
+ CLC Messages
+
+ The SMC-R protocol defines a set of Connection Layer Control
+ messages that flow over the TCP connection that are used to manage
+ SMC-R link rendezvous at TCP connection setup time. This
+ mechanism is analogous to SSL setup messages.
+
+ LLC Commands
+
+ The SMC-R protocol defines a set of RoCE Link Layer Control
+ commands that flow over the RoCE fabric using RoCE SendMsg, that
+ are used to manage SMC-R links, SMC-R link groups, and SMC-R
+ link group RMB expansion and contraction.
+
+
+
+
+
+
+
+
+Fox, et al. Informational [Page 9]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+ CDC Message
+
+ The SMC-R protocol defines a Connection Data Control message that
+ flows over the RoCE fabric using RoCE SendMsg that is used to
+ manage the SMC-R connection data. This message provides
+ information about data being transferred over the out-of-band RDMA
+ connection, such as data cursors, sequence numbers, and data flags
+ (for example, urgent data). The receipt of this message also
+ provides an interrupt to inform the receiver that it has received
+ RDMA data.
+
+ RMB
+
+ A Remote (RDMA) Memory Buffer is a fixed or pinned buffer
+ allocated in each of the peer hosts for a TCP (via SMC-R)
+ connection. The RMB is registered to the RNIC and allows remote
+ access by the remote peer using RDMA semantics. Each host is
+ passed the peer's RMB-specific access information (RMB Key (RKey)
+ and RMB element offset) during the SMC-R Rendezvous process. The
+ host stores socket application user data directly into the peer's
+ RMB using RDMA over RoCE.
+
+ RToken
+
+ The RToken is the combination of an RMB's RKey and RDMA virtual
+ address. An RToken provides RMB addressability information to an
+ RDMA peer.
+
+ RMBE
+
+ The Remote Memory Buffer Element (RMBE) is an area of an RMB that
+ is allocated to a specific TCP connection. The RMBE contains data
+ for the TCP connection. The RMBE represents the TCP receive
+ buffer, whereby the remote peer writes into the RMBE and the local
+ peer reads from the local RMBE. The alert token resolves to a
+ specific RMBE.
+
+ Alert Token
+
+ The SMC-R alert token is a 4-byte value that uniquely identifies
+ the TCP connection over an SMC-R connection. The alert token
+ allows the SMC peer to quickly identify the target TCP connection
+ that now has new work. The format of the token is defined by the
+ owning SMC-R endpoint and is considered opaque to the remote peer.
+ However, the token should not simply be an index to an RMBE; it
+ should reference a TCP connection and be able to be validated to
+ avoid reading data from stale connections.
+
+
+
+
+Fox, et al. Informational [Page 10]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+ RNIC
+
+ The RDMA-capable Network Interface Card (RNIC) is an Ethernet NIC
+ that supports RDMA semantics and verbs using RoCE.
+
+ First Contact
+
+ "First contact" describes an SMC-R negotiation to set up the first
+ link in a link group.
+
+ Subsequent Contact
+
+ "Subsequent contact" describes an SMC-R negotiation between peers
+ who are using an already-existing SMC-R link group.
+
+1.3. Conventions Used in This Document
+
+ In the rendezvous flow diagrams, dashed lines (----) are used to
+ indicate flows over the TCP/IP fabric and dotted lines (....) are
+ used to indicate flows over the RoCE fabric.
+
+ In the data transfer ladder diagrams, dashed lines (----) are used to
+ indicate RDMA write operations and dotted lines (....) are used to
+ indicate CDC messages, which are RDMA messages with inline data that
+ contain control information for the connection.
+
+2. Link Architecture
+
+ An SMC-R link is based on reliably connected queue pairs (QPs) that
+ form a "logical point-to-point link" between the two SMC-R peers over
+ a RoCE fabric. An SMC-R link extends from SMC-R peer to SMC-R peer,
+ where typically each peer would be a TCP/IP stack and would reside on
+ separate hosts.
+
+ ,,.--..,_
+ +----+ _-`` `-, +-----+
+ |QP 8| - RoCE ', |QP 64|
+ | | / VLAN M . | |
+ +----+--------+/ \+-------+-----+
+ | RNIC 1 | SMC-R Link | RNIC 2 |
+ | |<--------------------->| |
+ +------------+ , /+------------+
+ MAC A (GID A) MAC B (GID B)
+ . .`
+ `', ,-`
+ ``''--''``
+
+ Figure 1: SMC-R Link Overview
+
+
+
+Fox, et al. Informational [Page 11]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+ Figure 1 illustrates an overview of the basic concepts of SMC-R peer-
+ to-peer connectivity; this is called the SMC-R link. The SMC-R link
+ forms a logical point-to-point connection between two SMC-R peers via
+ RoCE. The SMC-R link is defined and identified by the following
+ attributes:
+
+ SMC-R link = RC QPs
+ (source VMAC GID QP + target VMAC GID QP + VLAN ID)
+
+ The SMC-R link can optionally be associated with a VLAN ID. If VLANs
+ are in use for the associated IP (LAN) connection, then the VLAN
+ attribute is carried over on the SMC-R link. When VLANs are in use,
+ each SMC-R link group is associated with a single and specific VLAN.
+ The RoCE fabric is the same physical Ethernet LAN used for standard
+ TCP/IP-over-Ethernet communications, with switches as described in
+ Section 1.1.1.
+
+ An SMC-R link is designed to support multiple TCP connections between
+ the same two peers. An SMC-R link is intended to be long lived,
+ while the underlying TCP connections can dynamically come and go.
+ The associated RMBs can also be dynamically added and removed from
+ the link as needed. The first TCP connection between the peers
+ establishes the SMC-R link. Subsequent TCP connections then use the
+ previously established link. When the last TCP connection
+ terminates, the link can then be terminated, typically after an
+ implementation-defined idle timeout period has elapsed. The TCP
+ server is responsible for initiating and terminating the SMC-R link.
+
+2.1. Remote Memory Buffers (RMBs)
+
+ Figure 2 shows the hosts -- Hosts X and Y -- and their associated
+ RMBs within each host. With the SMC-R link, and the associated RKeys
+ and RDMA virtual addresses, each SMC-R-enabled TCP/IP stack can
+ remotely access its peer's RMBs using RDMA. The RKeys and virtual
+ addresses are exchanged during the rendezvous processing when the
+ link is established. The combination of the RKey and the virtual
+ address is the RToken. Note that the SMC-R link ends at the QP
+ providing access to the RMB (via the link + RToken).
+
+
+
+
+
+
+
+
+
+
+
+
+
+Fox, et al. Informational [Page 12]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+ Host X Host Y
+ +-------------------+ ,.--.,_ +-------------------+
+ | | .'` '. | |
+ | Protection | ,' `, | Protection |
+ | Domain X | / \ | Domain Y |
+ | +------+ / \ +------+ |
+ | QP 8 |RNIC 1| | SMC-R Link | |RNIC 2| QP 64 |
+ | | | |<-------------------->| | | |
+ | | | || || | | |
+ | | +------+| VLAN A |+------+ | |
+ | | || || | |
+ | | | | RoCE | | | |
+ | |RToken X | \ / |RToken Y | |
+ | | | \ / | | |
+ | V | `. ,' | V |
+ | +--------+ | '._ ,' | +--------+ |
+ | | | | `''-'`` | | | |
+ | | RMB | | | | RMB | |
+ | | | | | | | |
+ | +--------+ | | +--------+ |
+ +-------------------+ +-------------------+
+
+ Figure 2: SMC-R Link and RMBs
+
+ An SMC-R link can support multiple RMBs that are independently
+ managed by each peer. The number and the size of RMBs are managed by
+ the peers based on the host's unique memory management requirements;
+ however, the maximum number of RMBs that can be associated to a link
+ group on one peer is 255. The QP has a single protection domain, but
+ each RMB has a unique RToken. All RTokens must be exchanged with the
+ peer.
+
+ Each peer manages the RMBs in its local memory for its remote SMC-R
+ peer by sharing access to the RMBs via RTokens with its peers. The
+ remote peer writes into the RMBs via RDMA, and the local peer (RMB
+ owner) then reads from the RMBs.
+
+ When two peers decide to use SMC-R for a given TCP connection, they
+ each allocate a local RMB element for the TCP connection and
+ communicate the location of this local RMB element during rendezvous
+ processing. To that end, RMB elements are created in pairs, with one
+ RMB element allocated locally on each peer of the SMC-R link.
+
+
+
+
+
+
+
+
+
+Fox, et al. Informational [Page 13]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+ --- +------------+---------------+
+ /\ |Eye Catcher | |
+ | +------------+ |
+ | | |
+ RMB Element 1 | |
+ | | Receive Buffer |
+ | | |
+ | | |
+ \/ | |
+ --- +------------+---------------+
+ /\ |Eye Catcher | |
+ | +------------+ |
+ | | |
+ RMB Element 2 | |
+ | | Receive Buffer |
+ | | |
+ | | |
+ \/ | |
+ --- +----------------------------+
+ | . |
+ | . |
+ | . |
+ | . |
+ | (up to 255 elements) |
+ +----------------------------+
+
+ Figure 3: RMB Format
+
+ Figure 3 illustrates the basic format of an RMB. The RMB is a
+ virtual memory buffer whose backing real memory is pinned, which can
+ support up to 255 TCP connections to exactly one remote SMC-R peer.
+ Each RMB is therefore associated with the SMC-R links within a link
+ group for the two peers and a specific RoCE Protection Domain. Other
+ than the two peers identified by the SMC-R link, no other SMC-R peers
+ can have RDMA access to an RMB; this requires a unique Protection
+ Domain for every SMC-R link. This is critical to ensure integrity of
+ SMC-R communications.
+
+ RMBs are subdivided into multiple elements for efficiency, with each
+ RMB Element (RMBE) associated with a single TCP connection.
+ Therefore, multiple TCP connections across an SMC-R link group can
+ share the same memory for RDMA purposes, reducing the overhead of
+ having to register additional memory with the RNIC for every new TCP
+ connection. The number of elements in an RMB and the size of each
+ RMBE are entirely governed by the owning peer, subject to the SMC-R
+ architecture rules; however, all RMB elements within a given RMB must
+ be the same size. Each peer can decide the level of resource-sharing
+ that is desirable across TCP connections based on local constraints,
+
+
+
+Fox, et al. Informational [Page 14]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+ such as available system memory. An RMB element is identified to the
+ remote SMC-R peer via an RMB Element Token, which consists of the
+ following:
+
+ o RMB RToken: The combination of the RKey and virtual address
+ provided by the RNIC that identifies the start of the RMB for RDMA
+ operations.
+
+ o RMB Index: Identifies the RMB element index in the RMB. Used to
+ locate a specific RMB element within an RMB. Valid value range is
+ 1-255.
+
+ o RMB Element Length: The length of the RMB element's eye catcher
+ plus the length of the receive buffer. This length is equal for
+ all RMB elements in a given RMB. This length can be variable
+ across different RMBs.
+
+ Multiple RMBs can be associated to an SMC-R link group, and each peer
+ in an SMC-R link group manages allocation of its RMBs. RMB
+ allocation can be asymmetric. For example, Server X can allocate two
+ RMBs to an SMC-R link group while Server Y allocates five. This
+ provides maximum implementation flexibility to allow hosts to
+ optimize RMB management for their own local requirements. The
+ maximum number of RMBs that can be allocated on one peer to a link
+ group is 255. If more RMBs are required, the peer may fall back to
+ IP for subsequent connections or, if the peer is the server, create a
+ parallel link group.
+
+ One use case for multiple RMBs is multiple receive buffer sizes.
+ Since every element in an RMB must be the same size, multiple RMBs
+ with different element sizes can be allocated if varying receive
+ buffer sizes are required.
+
+ Also, since the maximum number of TCP connections whose receive
+ buffers can be allocated to an RMB is 255, multiple RMBs may be
+ required to provide capacity for large numbers of TCP connections
+ between two peers.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Fox, et al. Informational [Page 15]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+ Separately from the RMB, the TCP/IP stack that owns each RMB
+ maintains control data for each RMB element within its local control
+ structures. The control data contains flags for maintaining the
+ state of the TCP data (for example, urgent data indicator) and, most
+ importantly, the following two cursors, which are illustrated below
+ in Figure 4:
+
+ o The peer producer cursor: This is a wrapping offset into the
+ RMB element's receive buffer that points to the next byte of data
+ to be written by the remote peer. This cursor is provided by the
+ remote peer in a Connection Data Control (CDC) message, which is
+ sent using RoCE SendMsg processing, and tells the local peer how
+ far it can consume data in the RMBE buffer.
+
+ o The peer consumer cursor: This is a wrapping offset into the
+ remote peer's RMB element's receive buffer that points to the next
+ byte of data to be consumed by the remote peer in its own RMBE.
+ The local peer cannot write into the remote peer's RMBE beyond
+ this point without causing data loss. This cursor is also
+ provided by the peer using a Connection Data Control message.
+
+ Each TCP connection peer maintains its cursors for a TCP connection's
+ RMBE in its local control structures. In other words, the peer who
+ writes into a remote peer's RMBE provides its producer cursor to the
+ peer whose RMBE it has written into. The peer who reads from its
+ RMBE provides its consumer cursor to the writing peer. In this
+ manner, the reads and writes between peers are kept coordinated.
+
+ For example, referring to Figure 4, Peer B writes the hashed data
+ into the receive buffer of Peer A's RMBE. After that write
+ completes, Peer B uses a CDC message to update its producer cursor to
+ Peer A, to indicate to Peer A how much data is available for Peer A
+ to consume. The CDC message that Peer B sends to Peer A wakes up
+ Peer A and notifies it that there is data to be consumed.
+
+ Similarly, when Peer A consumes data written by Peer B, it uses a CDC
+ message to update its consumer cursor to Peer B to let Peer B know
+ how much data it has consumed, so Peer B knows how much space is
+ available for further writes. If Peer B were to write enough data to
+ Peer A that it would wrap the RMBE receive buffer and exceed the
+ consumer cursor, data loss would result.
+
+ Note that this is a simplistic description of the control flows, and
+ they are optimized to minimize the number of CDC messages required,
+ as described in Section 4.7 ("RMB Data Flows").
+
+
+
+
+
+
+Fox, et al. Informational [Page 16]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+ Peer A's RMBE Control Info Peer B's RMBE Control Info
+ +--------------------------+ +--------------------------+
+ | | | |
+ /----Peer producer cursor | +-----+-Peer consumer cursor |
+ /| | | | |
+ | +--------------------------+ | +--------------------------+
+ | Peer A's RMBE |
+ | +--------------------------+ |
+ | | +------------------+
+ | | | |
+ | | \/ |
+ | | +------------|
+ | |-------------+/////////// |
+ | |//RDMA data written by ///|
+ | |/// Peer B that is ////// |
+ | |/available to be consumed/|
+ | |///////////////////////// |
+ | |///////// +---------------|
+ | |----------+/\ |
+ | | | |
+ \| | |
+ \ / |
+ |\---------/ |
+ | |
+ | |
+
+ Figure 4: RMBE Cursors
+
+ Additional flags and indicators are communicated between peers. In
+ all cases, these flags and indicators are updated by the peer using
+ CDC messages, which are sent using RoCE SendMsg. More details on
+ these additional flags and indicators are described in Section 4.3
+ ("RMBE Control Information").
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Fox, et al. Informational [Page 17]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+2.2. SMC-R Link Groups
+
+ SMC-R links are logically grouped together to form an SMC-R link
+ group. The purpose of the link group is for supporting multiple
+ links between the same two peers to provide for:
+
+ o Resilience: Provides transparent and dynamic switching of the link
+ used by existing TCP connections during link failures, typically
+ hardware related. TCP traffic using the failing link can be
+ switched to an active link within the link group, thereby avoiding
+ disruptions to application workloads.
+
+ o Link utilization: Provides an active/active link usage model
+ allowing TCP traffic to be balanced across the links, which
+ increases bandwidth and also avoids hardware imbalances and
+ bottlenecks. Note that both adapter and switch utilization can
+ become potential resource constraint issues.
+
+ SMC-R link group support is required. Resilience is not optional.
+ However, the user can elect to provision a single RNIC (on one or
+ both hosts).
+
+ Multiple links that are formed between the same two peers fall into
+ two distinct categories:
+
+ 1. Equal Links: Links providing equal access to the same RMB(s) at
+ both endpoints, whereby all TCP connections associated with the
+ links must have the same VLAN ID and have the same TCP server and
+ TCP client roles or relationship.
+
+ 2. Unequal Links: Links providing access to unique, unrelated and
+ isolated RMB(s) (i.e., for unique VLANs or unique and isolated
+ application workloads, etc.) or having unique TCP server or client
+ roles.
+
+ Links that are logically grouped together forming an SMC-R link group
+ must be equal links.
+
+2.2.1. Link Group Types
+
+ Equal links within a link group also have another "Link Group Type"
+ attribute based on the link's associated underlying physical path.
+ The following SMC-R link types are defined:
+
+ 1. Single link: the only active link within a link group
+
+ 2. Parallel link: not allowed -- SMC-R links having the same physical
+ RNIC at both hosts
+
+
+
+Fox, et al. Informational [Page 18]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+ 3. Asymmetric link: links that have unique RNIC adapters at one host
+ but share a single adapter at the peer host
+
+ 4. Symmetric link: links that have unique RNIC adapters at both hosts
+
+ These link group types are further explained in the following figures
+ and descriptions.
+
+ Figure 2 above shows the single-link case. The single link
+ illustrated in Figure 2 also establishes the SMC-R link group. Link
+ groups are supposed to have multiple links, but when only one RNIC is
+ available at both hosts then only a single link can be created. This
+ is expected to be a transient case.
+
+ Figure 5 shows the symmetric-link case. Both hosts have unique and
+ redundant RNIC adapters. This configuration meets the objectives for
+ providing full RoCE redundancy required to provide the level of
+ resilience required for high availability for SMC-R. While this
+ configuration is not required, it is a strongly recommended "best
+ practice" for the exploitation of SMC-R. Single and asymmetric links
+ must be supported but are intended to provide for short-term
+ transient conditions -- for example, during a temporary outage or
+ recycle of an RNIC.
+
+ Host X Host Y
+ +-------------------+ +-------------------+
+ | | | |
+ | Protection | | Protection |
+ | Domain X | | Domain Y |
+ | +------+ +------+ |
+ | QP 8 |RNIC 1| SMC-R Link 1 |RNIC 2| QP 64 |
+ |RToken X| | |<-------------------->| | | |
+ | | | | | | |RToken Y|
+ | \/ +------+ +------+ \/ |
+ |+--------+ | | +--------+ |
+ || | | | | | |
+ || RMB | | | | RMB | |
+ || | | | | | |
+ |+--------+ | | +--------+ |
+ | /\ +------+ +------+ /\ |
+ |RToken Z| | | SMC-R Link 2 | | |RToken W|
+ | | |RNIC 3|<-------------------->|RNIC 4| | |
+ | QP 9 | | | | QP 65 |
+ | +------+ +------+ |
+ +-------------------+ +-------------------+
+
+ Figure 5: Symmetric SMC-R Links
+
+
+
+
+Fox, et al. Informational [Page 19]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+ Host X Host Y
+ +-------------------+ +-------------------+
+ | | | |
+ | Protection | | Protection |
+ | Domain X | | Domain Y |
+ | +------+ +------+ |
+ | QP 8 |RNIC 1| SMC-R Link 1 |RNIC 2| QP 64 |
+ |RToken X| | |<-------------------->| | | |
+ | | | | .->| | |RToken Y|
+ | \/ +------+ .` +------+ \/ |
+ |+--------+ | .` | +--------+ |
+ || | | .` | | | |
+ || RMB | | .` | | RMB | |
+ || | | .`SMC-R | | | |
+ |+--------+ | .` Link 2 | +--------+ |
+ | /\ +------+ .` +------+ |
+ |RToken Z| | | .` | |down or |
+ | | |RNIC 3|<-` |RNIC 4|unavailable |
+ | QP 9 | | | | |
+ | +------+ +------+ |
+ +-------------------+ +-------------------+
+
+ Figure 6: Asymmetric SMC-R Links
+
+ In the example provided by Figure 6, Host X has two RNICs but Host Y
+ only has one RNIC because RNIC 4 is not available. This
+ configuration allows for the creation of an asymmetric link. While
+ an asymmetric link will provide some resilience (for example, when
+ RNIC 1 fails), ideally each host should provide two redundant RNICs.
+ This should be a transient case, and when RNIC 4 becomes available,
+ this configuration must transition to a symmetric-link configuration.
+ This transition is accomplished by first creating the new symmetric
+ link and then deleting the asymmetric link with reason code
+ "Asymmetric link no longer needed" specified in the DELETE LINK LLC
+ message.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Fox, et al. Informational [Page 20]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+ Host X Host Y
+ +-------------------+ +-------------------+
+ | | | |
+ | Protection | | Protection |
+ | Domain X | | Domain Y |
+ | +------+ SMC-R Link 1 +------+ |
+ | QP 8 |RNIC 1|<-------------------->|RNIC 2| QP 64 |
+ |RToken X| | | | | | |
+ | | | |<-------------------->| | |RToken Y|
+ | \/ +------+ SMC-R Link 2 +------+ \/ |
+ |+--------+ QP 9 | | QP 65 +--------+ |
+ || | | | | | | | |
+ || RMB |<-- + | | +---->| RMB | |
+ || | | | | | |
+ |+--------+ | | +--------+ |
+ | +------+ +------+ |
+ | down or| | | |down or |
+ | unavailable|RNIC 3| |RNIC 4|unavailable |
+ | | | | | |
+ | +------+ +------+ |
+ +-------------------+ +-------------------+
+
+ Figure 7: SMC-R Parallel Links (Not Supported)
+
+ Figure 7 shows parallel links, which are two links in the link group
+ that use the same hardware. This configuration is not permitted.
+ Because SMC-R multiplexes multiple TCP connections over an SMC-R link
+ and both links are using the exact same hardware, there is no
+ additional redundancy or capacity benefit obtained from this
+ configuration. In addition to providing no real benefit, this
+ configuration adds the unnecessary overhead of additional queue
+ pairs, generation of additional RKeys, etc.
+
+2.2.2. Maximum Number of Links in Link Group
+
+ The SMC-R protocol defines a maximum of eight symmetric SMC-R links
+ within a single SMC-R link group. This allows for support for up to
+ eight unique physical paths between peer hosts. However, in terms of
+ meeting the basic requirements for redundancy, support for at least
+ two symmetric links must be implemented. Supporting more than two
+ links also simplifies implementation for practical matters relating
+ to dynamically adding and removing links -- for example, starting a
+ third SMC-R link prior to taking down one of the two existing links.
+ Recall that all links within a link group must have equal access to
+ all associated RMBs.
+
+
+
+
+
+
+Fox, et al. Informational [Page 21]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+ The SMC-R protocol allows an implementation to assign an
+ implementation-specific and appropriate value for maximum symmetric
+ links. The implementation value must not exceed the architecture
+ limit of 8; also, the value must not be lower than 2, because the
+ SMC-R protocol requires redundancy. This does not mean that two
+ RNICs are physically required to enable SMC-R connectivity, but at
+ least two RNICs for redundancy are strongly recommended.
+
+ The SMC-R peers exchange their implementation maximum link values
+ during the link group establishment using the defined maximum link
+ value in the CONFIRM LINK LLC command. Once the initial exchange
+ completes, the value is set for the life of the link group. The
+ maximum link value can be provided by both the server and client.
+ The server must supply a value, whereas the client maximum link value
+ is optional. When the client does not supply a value, it indicates
+ that the client accepts the server-supplied maximum value. If the
+ client provides a value, it cannot exceed the server-supplied maximum
+ value. If the client passes a lower value, this lower value then
+ becomes the final negotiated maximum number of symmetric links for
+ this link group. Again, the minimum value is 2.
+
+ During run time, the client must never request that the server add a
+ symmetric link to a link group that would exceed the negotiated
+ maximum link value. Likewise, the server must never attempt to add a
+ symmetric link to a link group that would exceed the negotiated
+ maximum value.
+
+ In terms of counting the number of active links within a link group,
+ the initial link (or the only/last) link is always counted as 1.
+ Then, as additional links are added, they are either symmetric or
+ asymmetric links.
+
+ With regards to enforcing the maximum link rules, asymmetric links
+ are an exception having a unique set of rules:
+
+ o Asymmetric links are always limited to one asymmetric link allowed
+ per link group.
+
+ o Asymmetric links must not be counted in the maximum symmetric-link
+ count calculation. When tracking the current count or enforcing
+ the negotiated maximum number of links, an asymmetric link is not
+ to be counted.
+
+
+
+
+
+
+
+
+
+Fox, et al. Informational [Page 22]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+2.2.3. Forming and Managing Link Groups
+
+ SMC-R link groups are self-defining. The first SMC-R link in a link
+ group is created using TCP option flows on the TCP three-way
+ handshake followed by CLC message flows over the TCP connection.
+ Subsequent SMC-R links in the link group are created by sending LLC
+ messages over an SMC-R link that already exists in the link group.
+ Once an SMC-R link group is created, no additional SMC-R links in
+ that group are created using TCP and CLC negotiation. Because
+ subsequent SMC-R links are created exclusively by sending LLC
+ messages over an existing SMC-R link in a link group, the membership
+ of SMC-R links in a link group is self-defining.
+
+ This architecture does not define a specific identifier for an SMC-R
+ link group. This identification may be useful for network management
+ and may be assigned in a platform-specific manner, or in an extension
+ to this architecture.
+
+ In each SMC-R link group, one peer is the server for all TCP
+ connections and the other peer is the client. If there are
+ additional TCP connections between the peers that use SMC-R and have
+ the client and server roles reversed, another SMC-R link group is set
+ up between them with the opposite client-server relationship.
+
+ This is required because there are specific responsibilities divided
+ between the client and server in the management of an SMC-R link
+ group.
+
+ In this architecture, the decision of whether to use an existing
+ SMC-R link group or create a new SMC-R link group for a TCP
+ connection is made exclusively by the server.
+
+ Management of the links in an SMC-R link group is also a server
+ responsibility. The server is responsible for adding and deleting
+ links in a link group. The client may request that the server take
+ certain actions, but the final responsibility is the server's.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Fox, et al. Informational [Page 23]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+2.2.4. SMC-R Link Identifiers
+
+ This architecture defines multiple identifiers to identify SMC-R
+ links and peers.
+
+ o Link number: This is a 1-byte value that identifies an SMC-R link
+ within a link group. Both the server and the client use this
+ number to distinguish an SMC-R link from other links within the
+ same link group. It is only unique within a link group. In order
+ to prevent timing windows that may occur when a server creates a
+ new link while the client is still cleaning up a previously
+ existing link, link numbers cannot be reused until the entire link
+ numbering space has been exhausted.
+
+ o Link user ID: This is an architecturally opaque 4-byte value that
+ a peer uses to uniquely define an SMC-R link within its own space.
+ This means that a link user ID is unique within one peer only.
+ Each peer defines its own link user ID for a link. The peers
+ exchange this information once during link setup, and it is never
+ used architecturally again. The purpose of this identifier is for
+ network management, display, and debugging. For example, an
+ operator on a client could provide the operator on the server with
+ the server's link user ID if he requires the server's operator to
+ check on the operation of a link that the client is having trouble
+ with.
+
+ o Peer ID: The SMC-R peer ID uniquely identifies a specific instance
+ of a specific TCP/IP stack. It is required because in clustered
+ and load-balancing environments, an IP address does not uniquely
+ identify a TCP/IP stack. An RNIC's MAC/GID also doesn't uniquely
+ or reliably identify a TCP/IP stack, because RNICs can go up and
+ down and even be redeployed to other TCP/IP stacks in a
+ multiple-partitioned or virtualized environment. The peer ID is
+ not only unique per TCP/IP stack but is also unique per instance
+ of a TCP/IP stack, meaning that if a TCP/IP stack is restarted,
+ its peer ID changes.
+
+2.3. SMC-R Resilience and Load Balancing
+
+ The SMC-R multilink architecture provides resilience for network high
+ availability via failover capability to an alternate RoCE adapter.
+
+ The SMC-R multilink architecture does not define primary, secondary,
+ or alternate roles to the links. Instead, there are multiple active
+ links representing multiple redundant RoCE paths over the same LAN.
+
+
+
+
+
+
+Fox, et al. Informational [Page 24]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+ Assignment of TCP connections to links is unidirectional and
+ asymmetric. This means that the client and server may each choose a
+ separate link for their RDMA writes associated with a specific TCP
+ connection.
+
+ If a hardware failure occurs or a QP failure associated with an
+ individual link occurs, then the TCP connections that were associated
+ with the failing link are dynamically and transparently switched to
+ use another available link. The server or the client can detect a
+ failure, immediately move their TCP connections, and then notify
+ their peer via the DELETE LINK LLC command. While the client can
+ notify the server of an apparent link failure with the DELETE LINK
+ LLC command, the server performs the actual link deletion.
+
+ The movement of TCP connections to another link can be accomplished
+ with minimal coordination between the peers. The TCP connection
+ movement is also transparent to, and non-disruptive to, the TCP
+ socket application workloads for most failure scenarios. After a
+ failure, the surviving links and all associated hardware must handle
+ the link group's workload.
+
+ As each SMC-R peer begins to move active TCP connections to another
+ link, all current RDMA write operations must be allowed to complete.
+ The moving peer then sends a signal to verify receipt of the last
+ successful write by its peer. If this verification fails, the TCP
+ connection must be reset. Once this verification is complete, all
+ writes that failed may then be retried, in order, over the new link.
+ Any data writes or CDC messages for which the sender did not receive
+ write completion must be replayed before any subsequent data or CDC
+ write operations are sent. LLC messages are not retried over the new
+ link, because they are dependent on a known link configuration, which
+ has just changed because of the failure. The initiator of an LLC
+ message exchange that fails will be responsible for retrying once the
+ link group configuration stabilizes.
+
+ When a new link becomes available and is re-added to the link group,
+ each peer is free to rebalance its current TCP connections as needed
+ or only assign new TCP connections to the newly added link. Both the
+ server and client are free to manage TCP connections across the link
+ group as needed. TCP connection movement does not have to be
+ stimulated by a link failure.
+
+ The SMC-R architecture also defines orderly versus disorderly
+ failover. The type of failover is communicated in the LLC
+ DELETE LINK command and is simply a means to indicate that the link
+ has terminated (disorderly) or link termination is imminent
+ (orderly). The orderly link deletion could be initiated via operator
+ command or programmatically to bring down an idle link. For example,
+
+
+
+Fox, et al. Informational [Page 25]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+ an operator command could initiate orderly shutdown of an adapter for
+ service. Implementation of the two types is based on implementation
+ requirements and is beyond the scope of the SMC-R architecture.
+
+3. SMC-R Rendezvous Architecture
+
+ "Rendezvous" is the process that SMC-R-capable peers use to
+ dynamically discover each others' capabilities, negotiate SMC-R
+ connections, set up SMC-R links and link groups, and manage those
+ link groups. A key aspect of SMC-R Rendezvous is that it occurs
+ dynamically and automatically, without requiring SMC-R link
+ configuration to be defined by an administrator.
+
+ SMC-R Rendezvous starts with the TCP/IP three-way handshake, during
+ which connection peers use TCP options to announce their SMC-R
+ capabilities. If both endpoints are SMC-R capable, then Connection
+ Layer Control (CLC) messages are exchanged between the peers' SMC-R
+ layers over the newly established TCP connection to negotiate SMC-R
+ credentials. The CLC message mechanism is analogous to the messages
+ exchanged by SSL for its handshake processing.
+
+ If a new SMC-R link is being set up, Link Layer Control (LLC)
+ messages are used to confirm RDMA connectivity. LLC messages are
+ also used by the SMC-R layers at each peer to manage the links and
+ link groups.
+
+ Once an SMC-R link is set up or agreed to by the peers, the TCP
+ sockets are passed to the peer applications, which use them as
+ normal. The SMC-R layer, which resides under the sockets layer,
+ transmits the socket data between peers over RDMA using the SMC-R
+ protocol, bypassing the TCP/IP stack.
+
+3.1. TCP Options
+
+ During the TCP/IP three-way handshake, the client and server indicate
+ their support for SMC-R by including experimental TCP option 254 on
+ the three-way handshake flows, in accordance with [RFC6994] ("Shared
+ Use of Experimental TCP Options"). The Experiment Identifier (ExID)
+ value used is the string "SMCR" in EBCDIC (IBM-1047) encoding
+ (0xE2D4C3D9). This ExID has been registered in the "TCP Experimental
+ Option Experiment Identifiers (TCP ExIDs)" registry maintained
+ by IANA.
+
+
+
+
+
+
+
+
+
+Fox, et al. Informational [Page 26]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+ After completion of the three-way TCP handshake, each peer queries
+ its peer's options. If both peers set the TCP option on the
+ three-way handshake, inline SMC-R negotiation occurs using CLC
+ messages. If neither peer, or only one peer, sets the TCP option,
+ SMC-R cannot be used for the TCP connection, and the TCP connection
+ completes the setup using the IP fabric.
+
+3.2. Connection Layer Control (CLC) Messages
+
+ CLC messages are sent as data payload over the IP network using the
+ TCP connection between SMC-R layers at the peers. They are analogous
+ to the messages used to exchange parameters for SSL.
+
+ The use of CLC messages is detailed in the following sections. The
+ following list provides a summary of the defined CLC messages and
+ their purposes:
+
+ o SMC Proposal: Sent from the client to propose that this TCP
+ connection is eligible to be moved to SMC-R. The client
+ identifies itself and its subnet to the server and passes the
+ SMC-R elements for a suggested RoCE path via the MAC and GID.
+
+ o SMC Accept: Sent from the server to accept the client's TCP
+ connection SMC Proposal. The server responds to the client's
+ proposal by identifying itself to the client and passing the
+ elements of a RoCE path that the client can use to perform RDMA
+ writes to the server. This consists of such SMC-R link elements
+ as RoCE MAC, GID, and RMB information.
+
+ o SMC Confirm: Sent from the client to confirm the server's
+ acceptance of the SMC connection. The client responds to the
+ server's acceptance by passing the elements of a RoCE path that
+ the server can use to perform RDMA writes to the client. This
+ consists of such SMC-R link elements as RoCE MAC, GID, and RMB
+ information.
+
+ o SMC Decline: Sent from either the server or the client to reject
+ the SMC connection, indicating the reason the peer must decline
+ the SMC Proposal and allowing the TCP connection to revert back to
+ IP connectivity.
+
+3.3. LLC Messages
+
+ Link Layer Control (LLC) messages are sent between peer SMC-R layers
+ over an SMC-R link to manage the link or the link group. LLC
+ messages are sent using RoCE SendMsg and are 44 bytes long. The
+ 44-byte size is based on what can fit into a RoCE Work Queue Element
+ (WQE) without requiring the posting of receive buffers.
+
+
+
+Fox, et al. Informational [Page 27]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+ LLC messages generally follow a request-reply semantic. Each message
+ has a request flavor and a reply flavor, and each request must be
+ confirmed with a reply, except where otherwise noted. The use of LLC
+ messages is detailed in the following sections. The following list
+ provides a summary of the defined LLC messages and their purposes:
+
+ o ADD LINK: Used to add a new link to a link group. Sent from the
+ server to the client to initiate addition of a new link to the
+ link group, or from the client to the server to request that the
+ server initiate addition of a new link.
+
+ o ADD LINK CONTINUATION: A continuation of ADD LINK that allows the
+ ADD LINK to span multiple commands, because all of the link
+ information cannot be contained in a single ADD LINK message.
+
+ o CONFIRM LINK: Used to confirm that RoCE connectivity over a newly
+ created SMC-R link is working correctly. Initiated by the server.
+ Both this message and its reply must flow over the SMC-R link
+ being confirmed.
+
+ o DELETE LINK: When initiated by the server, deletes a specific link
+ from the link group or deletes the entire link group. When
+ initiated by the client, requests that the server delete a
+ specific link or the entire link group.
+
+ o CONFIRM RKEY: Informs the peer on the SMC-R link of the addition
+ of an RMB to the link group.
+
+ o CONFIRM RKEY CONTINUATION: A continuation of CONFIRM RKEY that
+ allows the CONFIRM RKEY to span multiple commands, in the event
+ that all of the information cannot be contained in a single
+ CONFIRM RKEY message.
+
+ o DELETE RKEY: Informs the peer on the SMC-R link of the deletion of
+ one or more RMBs from the link group.
+
+ o TEST LINK: Verifies that an already-active SMC-R link is active
+ and healthy.
+
+ o Optional LLC message: Any LLC message in which the two high-order
+ bits of the opcode are b'10'. This optional message must be
+ silently discarded by a receiving peer that does not support the
+ opcode. No such messages are defined in this version of the
+ architecture; however, the concept is defined to allow for
+ toleration of possible advanced, optional functions.
+
+
+
+
+
+
+Fox, et al. Informational [Page 28]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+ CONFIRM LINK and TEST LINK are sensitive to which link they flow on
+ and must flow on the link being confirmed or tested. The other flows
+ may flow over any active link in the link group. When there are
+ multiple links in a link group, a response to an LLC message must
+ flow over the same link that the original message flowed over, with
+ the following exceptions:
+
+ o ADD LINK request from a server in response to an ADD LINK from a
+ client.
+
+ o DELETE LINK request from a server in response to a DELETE LINK
+ from a client.
+
+3.4. CDC Messages
+
+ Connection Data Control (CDC) messages are sent over the RoCE fabric
+ between peers using RoCE SendMsg and are 44 bytes long. The 44-byte
+ size is based on the size that can fit into a RoCE WQE without
+ requiring the posting of receive buffers. CDC messages are used to
+ describe the socket application data passed via RDMA write
+ operations, as well as TCP connection state information, including
+ producer cursors and consumer cursors, RMBE state information, and
+ failover data validation.
+
+3.5. Rendezvous Flows
+
+ Rendezvous information for SMC-R is exchanged as TCP options on the
+ TCP three-way handshake flows to indicate capability, followed by
+ inline TCP negotiation messages to actually do the SMC-R setup.
+ Formats of all rendezvous options and messages discussed in this
+ section are detailed in Appendix A.
+
+3.5.1. First Contact
+
+ First contact between RoCE peers occurs when a new SMC-R link group
+ is being set up. This could be because no SMC-R links already exist
+ between the peers, or the server decides to create a new SMC-R link
+ group in parallel with an existing one.
+
+3.5.1.1. Pre-negotiation of TCP Options
+
+ The client and server indicate their SMC-R capability to each other
+ using TCP option 254 on the TCP three-way handshake flows.
+
+ A client who wishes to do SMC-R will include TCP option 254 using an
+ ExID equal to the EBCDIC (codepage IBM-1047) encoding of "SMCR" on
+ its SYN flow.
+
+
+
+
+Fox, et al. Informational [Page 29]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+ A server that supports SMC-R will include TCP option 254 with the
+ ExID value of EBCDIC "SMCR" on its SYN-ACK flow. Because the server
+ is listening for connections and does not know where client
+ connections will come from, the server implementation may choose to
+ unconditionally include this TCP option if it supports SMC-R. This
+ may be required for server implementations where extensions to the
+ TCP stack are not practical. For server implementations that can add
+ code to examine and react to packets during the three-way handshake,
+ the server should only include the SMC-R TCP option on the SYN-ACK if
+ the client included it on its SYN packet.
+
+ A client who supports SMC-R and meets the three conditions outlined
+ above may optionally include the TCP option for SMC-R on its ACK
+ flow, regardless of whether or not the server included it on its
+ SYN-ACK flow. Some TCP/IP stacks may have to include it if the SMC-R
+ layer cannot modify the options on the socket until the three-way
+ handshake completes. Proprietary servers should not include this
+ option on the ACK flow, since including it on the SYN flow was
+ sufficient to indicate the client's capabilities.
+
+ Once the initial three-way TCP handshake is completed, each peer
+ examines the socket options. SMC-R implementations may do this by
+ examining what was actually provided on the SYN and SYN-ACK packets
+ or by performing a getsockopt() operation to determine the options
+ sent by the peer. If neither peer, or only one peer, specified the
+ TCP option for SMC-R, then SMC-R cannot be used on this connection
+ and it proceeds using normal IP flows and processing.
+
+ If both peers specified the TCP option for SMC-R, then the TCP
+ connection is not started yet and the peers proceed to SMC-R
+ negotiation using inline data flows. The socket is not yet turned
+ over to the applications; instead, the respective SMC layers exchange
+ CLC messages over the newly formed TCP connection.
+
+3.5.1.2. Client Proposal
+
+ If SMC-R is supported by both peers, the client sends an SMC Proposal
+ CLC message to the server. It is not immediately apparent on this
+ flow from client to server whether this is a new or existing SMC-R
+ link, because in clustered environments a single IP address may
+ represent multiple hosts. This type of cluster virtual IP address
+ can be owned by a network-based or host-based Layer 4 load balancer
+ that distributes incoming TCP connections across a cluster of
+ servers/hosts. For purposes of high availability, other clustered
+ environments may also support the movement of a virtual IP address
+ dynamically from one host in the cluster to another. In summary, the
+ client cannot predetermine that a connection is targeting the same
+ host by simply matching the destination IP address for outgoing TCP
+
+
+
+Fox, et al. Informational [Page 30]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+ connections. Therefore, it cannot predetermine the SMC-R link that
+ will be used for a new TCP connection. This information will be
+ dynamically learned, and the appropriate actions will be taken as the
+ SMC-R negotiation handshake unfolds.
+
+ In the SMC-R proposal message, the initiator (client) proposes the
+ use of SMC-R by including its peer ID, GID, and MAC addresses, as
+ well as the IP subnet number of the outgoing interface (if IPv4) or
+ the IP prefix list for the network over which the proposal is sent
+ (if IPv6). At this point in the flow, the client makes no local
+ commitments of resources for SMC-R.
+
+ When the server receives the SMC Proposal CLC message, it uses the
+ peer ID provided by the client, plus subnet or prefix information
+ provided by the client, to determine if it already has a usable SMC-R
+ link with this SMC-R peer. If there are one or more existing SMC-R
+ links with this SMC-R peer, the server then decides which SMC-R link
+ it will use for this TCP connection. See Sections 3.5.2 and 3.5.3
+ for the cases of reusing an existing SMC-R link or creating a
+ parallel SMC-R link group between SMC-R peers.
+
+ If this is a first contact between SMC-R peers, the server must
+ validate that it is on the same LAN as the client before continuing.
+ For IPv4, the server does this by verifying that it has an interface
+ with an IP subnet number that matches the subnet number sent by the
+ client in the SMC Proposal. For IPv6, it does this by verifying that
+ it is directly attached to at least one IP prefix that was listed by
+ the client in its SMC Proposal message.
+
+ If the server agrees to use SMC-R, the server begins the setup of a
+ new SMC-R link by allocating local QP and RMB resources (setting its
+ QP state to INIT) and providing its full SMC-R information in an SMC
+ Accept CLC message to the client over the TCP connection, along with
+ a flag set indicating that this is a first contact flow. While the
+ SMC Accept message could flow over any IP route back to the client
+ depending upon Layer 3 IP routing, the SMC-R credentials provided
+ must be for the common subnet or prefix between the server and
+ client, as determined above. If the server cannot or does not want
+ to do SMC-R with the client, it sends an SMC Decline CLC message to
+ the client, and the connection data may begin flowing using normal
+ TCP/IP flows.
+
+
+
+
+
+
+
+
+
+
+Fox, et al. Informational [Page 31]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+3.5.1.3. Server Acceptance
+
+ When the client receives the SMC Accept from the server, it
+ determines whether this is a new or existing SMC-R link, using the
+ combination of the following: the first contact flag, its MAC/GID and
+ the MAC/GID returned by the server, the VLAN over which the
+ connection is setting up, and the QP number provided by the server.
+
+ If it is an existing SMC-R link and the client agrees to use that
+ link for the TCP connection, see Section 3.5.2 ("Subsequent Contact")
+ below. If it is a new SMC-R link between peers that already have an
+ SMC-R link, then the server is starting a new SMC-R link group.
+
+ Assuming that either (1) this is a first contact between peers or
+ (2) the server is starting a new SMC-R link group, the client now
+ allocates local QP and RMB resources for the SMC-R link (setting the
+ QP state to RTR (ready to receive)), associates them with the server
+ QP as learned from the SMC Accept CLC message, and sends an SMC
+ Confirm CLC message to the server over the TCP connection with its
+ SMC-R link information included. The client also starts a timer to
+ wait for the server to confirm the reliably connected queue pair, as
+ described below.
+
+3.5.1.4. Client Confirmation
+
+ Upon receipt of the client's SMC Confirm CLC message, the server
+ associates its QP for this SMC-R link with the client's QP as learned
+ from the SMC Confirm CLC message and sets its QP state to RTS (ready
+ to send). The client and the server now have reliably connected
+ queue pairs.
+
+3.5.1.5. Link (QP) Confirmation
+
+ Since setting up the SMC-R link and its QPs did not require any
+ network flows on the RoCE fabric, the client and server must now
+ confirm connectivity over the RoCE fabric. To accomplish this, the
+ server will send a CONFIRM LINK Link Layer Control (LLC) message to
+ the client over the newly created SMC-R link, using the RoCE fabric.
+ The CONFIRM LINK LLC message will provide the server's MAC, GID, and
+ QP information for the connection, allow each partner to communicate
+ the maximum number of links it can tolerate in this link group (the
+ "link limit"), and will additionally provide two link IDs:
+
+ o a 1-byte server-assigned link number that is used by both peers to
+ identify the link within the link group and is only unique within
+ a link group.
+
+
+
+
+
+Fox, et al. Informational [Page 32]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+ o a 4-byte link user ID. This opaque value is assigned by the
+ server for the server's local use and is provided to the client
+ for management purposes -- for example, to use in network
+ management displays and products.
+
+ When the server sends this message, it will set a timer for receiving
+ confirmation from the client.
+
+ When the client receives the server's confirmation in the form of a
+ CONFIRM LINK LLC message, it will cancel the confirmation timer it
+ set when it sent the SMC Confirm message. The client will also
+ advance its QP state to RTS and respond over the RoCE fabric with a
+ CONFIRM LINK response LLC message that (1) provides its MAC, GID,
+ QP number, and link limit, (2) confirms the 1-byte link number sent
+ by the server, and (3) provides its own 4-byte link user ID to the
+ server.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Fox, et al. Informational [Page 33]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+ Host X -- Server Host Y -- Client
+ +-------------------+ +-------------------+
+ | Peer ID = PS1 | | Peer ID = PC1 |
+ | +------+ +------+ |
+ | QP 8 |RNIC 1| |RNIC 2| QP 64 |
+ |RToken X| |MAC MA| |MAC MB| | |
+ | | |GID GA| |GID GB| |RToken Y|
+ | \/ +------+ (Subnet S1) +------+ \/ |
+ |+--------+ | | +--------+ |
+ || RMB | | | | RMB | |
+ |+--------+ | | +--------+ |
+ | +------+ +------+ |
+ | |RNIC 3| |RNIC 4| |
+ | |MAC MC| |MAC MD| |
+ | |GID GC| |GID GD| |
+ | +------+ +------+ |
+ +-------------------+ +-------------------+
+
+ SYN TCP options(254,"SMCR")
+ <---------------------------------------------------------
+
+ SYN-ACK TCP options(254,"SMCR")
+ --------------------------------------------------------->
+
+ ACK [TCP options(254,"SMCR")]
+ <--------------------------------------------------------
+
+ SMC Proposal(PC1,MB,GB,S1)
+ <--------------------------------------------------------
+
+ SMC Accept(PS1,first contact,MA,GA,MTU,QP8,RToken=X,RMB elem index)
+ --------------------------------------------------------->
+
+ SMC Confirm(PC1,MB,GB,MTU,QP64,RToken=Y,RMB element index)
+ <--------------------------------------------------------
+
+ CONFIRM LINK(MA,GA,QP8, link lim, server link user ID, linknum)
+ .........................................................>
+
+ CONFIRM LINK rsp(MB,GB,QP64, link lim, client link user ID, linknum)
+ <........................................................
+
+ Legend:
+ ------------ TCP/IP and CLC flows
+ ............ RoCE (LLC) flows
+ Square brackets ("[ ]") indicate optional information
+
+ Figure 8: First Contact Rendezvous Flows
+
+
+
+Fox, et al. Informational [Page 34]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+ Technically, the data for the TCP connection could now flow over the
+ RoCE path. However, if this is a first contact, there is no
+ alternate for this recently established RoCE path. Since in the
+ current architecture there is no failover from RoCE to IP once
+ connection data starts flowing, this means that a failure of this
+ path would disrupt the TCP connection, meaning that the level of
+ redundancy and failover is less than that provided by IP. If the
+ network has alternate RoCE paths available, they would not be usable
+ at this point. This situation would be unacceptable.
+
+3.5.1.6. Second SMC-R Link Setup
+
+ Because of the unacceptable situation described above, TCP data will
+ not be allowed to flow on the newly established SMC-R link until a
+ second path has been set up, or at least attempted.
+
+ If the server has a second RNIC available on the same LAN, it
+ attempts to set up the second SMC-R link over that second RNIC. If
+ it only has one RNIC available on the LAN, it will attempt to set up
+ the second SMC-R link over that one RNIC. In the latter case, the
+ server is attempting to set up an asymmetric link, in case the client
+ does have a second RNIC on the LAN.
+
+ In either case, the server allocates a new QP over the RNIC it is
+ attempting to use for the second link and assigns a link number to
+ the new link; the server also creates an RToken for the RMB over this
+ second QP (note that this means that the first and second QP each
+ have their own RToken to represent the same RMB). The server
+ provides this information, as well as the MAC and GID of the RNIC
+ over which it is attempting to set up the second link, in an ADD LINK
+ LLC message that it sends to the client over the SMC-R link that is
+ already set up.
+
+3.5.1.6.1. Client Processing of ADD LINK LLC Message from Server
+
+ When the client receives the server's ADD LINK LLC message, it
+ examines the GID and MAC provided by the server to determine whether
+ the server is attempting to use the same server-side RNIC as the
+ existing SMC-R link or a different one.
+
+ If the server is attempting to use the same server-side RNIC as the
+ existing SMC-R link, then the client verifies that it has a second
+ RNIC on the same LAN. If it does not, the client rejects the
+ ADD LINK request from the server, because the resulting link would be
+ a parallel link, which is not supported within a link group. If the
+ client does have a second RNIC on the same LAN, it accepts the
+ request, and an asymmetric link will be set up.
+
+
+
+
+Fox, et al. Informational [Page 35]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+ If the server is using a different server-side RNIC from the existing
+ SMC-R link, then the client will accept the request and a second
+ SMC-R link will be set up in this SMC-R link group. If the client
+ has a second RNIC on the same LAN, that second RNIC will be used for
+ the second SMC-R link, creating symmetric links. If the client does
+ not have a second RNIC on the same LAN, it will use the same RNIC as
+ was used for the initial SMC-R link, resulting in the setup of an
+ asymmetric link in the SMC-R link group.
+
+ In either case, when the client accepts the server's ADD LINK
+ request, it allocates a new QP on the chosen RNIC and creates an RKey
+ over that new QP for the client-side RMB for the SMC-R link group,
+ then sends an ADD LINK reply LLC message to the server providing that
+ information as well as echoing the link number that was sent by the
+ server.
+
+ If the client rejects the server's ADD LINK request, it sends an ADD
+ LINK reply LLC message to the server with the reason code for the
+ rejection.
+
+3.5.1.6.2. Server Processing of ADD LINK Reply LLC Message from Client
+
+ If the client sends a negative response to the server or no reply is
+ received, the server frees the RoCE resources it had allocated for
+ the new link. Having a single link in an SMC-R link group is
+ undesirable. The server's recovery is detailed in Appendix C.8
+ ("Failure to Add Second SMC-R Link to a Link Group").
+
+ If the client sends a positive reply to the server with
+ MAC/GID/QP/RKey information, the server associates its QP for the new
+ SMC-R link to the QP that the client provided. Now, the new SMC-R
+ link is in the same situation that the first was in after the client
+ sent its ACK packet -- there is a reliably connected queue pair over
+ the new RoCE path, but there have been no RoCE flows to confirm that
+ it's actually usable. So, at this point, the client and server will
+ exchange CONFIRM LINK LLC messages just like they did on the first
+ SMC-R link.
+
+ If either peer receives a failure during this second CONFIRM LINK LLC
+ exchange (either an immediate failure -- which implies that the
+ message did not reach the partner -- or a timeout), it sends a DELETE
+ LINK LLC message to the partner over the first (and now only) link in
+ the link group. This DELETE LINK LLC message must be acknowledged
+ before data can flow on the single link in the link group.
+
+
+
+
+
+
+
+Fox, et al. Informational [Page 36]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+ Host X -- Server Host Y -- Client
+ +-------------------+ +-------------------+
+ | Peer ID = PS1 | | Peer ID = PC1 |
+ | +------+ +------+ |
+ | QP 8 |RNIC 1| SMC-R Link 1 |RNIC 2| QP 64 |
+ |RToken X| |MAC MA|<-------------------->|MAC MB| | |
+ | | |GID GA| |GID GB| |RToken Y|
+ | \/ +------+ +------+ \/ |
+ |+--------+ | | +--------+ |
+ || | | | | | |
+ || RMB | | | | RMB | |
+ || | | | | | |
+ |+--------+ | | +--------+ |
+ | /\ +------+ +------+ /\ |
+ | | |RNIC 3| SMC-R Link 2 |RNIC 4| | |
+ |RToken Z| |MAC MC|<-------------------->|MAC MD| |RToken W |
+ | QP 9 |GID GC| (being added) |GID GD| QP 65 |
+ | +------+ +------+ |
+ +-------------------+ +-------------------+
+
+ First SMC-R link setup as shown in Figure 8
+ <-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.->
+
+ ADD LINK request(QP9,MC,GC, link number = 2)
+ ............................................>
+
+ ADD LINK response(QP65,MD,GD, link number = 2)
+ <............................................
+
+ ADD LINK CONTINUATION request(RToken=Z)
+ ............................................>
+
+ ADD LINK CONTINUATION response(RToken=W)
+ <............................................
+
+ CONFIRM LINK(MC,GC,QP9, link number = 2, link user ID)
+ .............................................>
+
+ CONFIRM LINK response(MD,GD,QP65, link number = 2, link user ID)
+ <.............................................
+
+ Legend:
+ ------------ TCP/IP and CLC flows
+ ............ RoCE (LLC) flows
+
+ Figure 9: First Contact, Second Link Setup
+
+
+
+
+
+Fox, et al. Informational [Page 37]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+3.5.1.6.3. Exchange of RKeys on Second SMC-R Link
+
+ Note that in the scenario described here -- first contact -- there is
+ only one RMB RKey to exchange on the second SMC-R link, and it is
+ exchanged in the ADD LINK CONTINUATION request and reply. In
+ scenarios other than first contact -- for example, adding a new SMC-R
+ link to a longstanding link group with multiple RMBs -- additional
+ flows will be required to exchange additional RMB RKeys. See
+ Section 3.5.5.2.3 ("Adding a New SMC-R Link to a Link Group with
+ Multiple RMBs") for more details on these flows.
+
+3.5.1.6.4. Aborting SMC-R and Falling Back to IP
+
+ If both partners don't provide the SMC-R TCP option during the
+ three-way TCP handshake, the connection falls back to normal TCP/IP.
+ During the SMC-R negotiation that occurs after the three-way TCP
+ handshake, either partner may break off SMC-R by sending an SMC
+ Decline CLC message. The SMC Decline CLC message may be sent in
+ place of any expected message and may also be sent during the CONFIRM
+ LINK LLC exchange if there is a failure before any application data
+ has flowed over the RoCE fabric. For more details on exactly when an
+ SMC Decline can flow during link group setup, see Appendices C.1
+ ("SMC Decline during CLC Negotiation") and C.2 ("SMC Decline during
+ LLC Negotiation").
+
+ If this fallback to IP happens while setting up a new SMC-R link
+ group, the RoCE resources allocated for this SMC-R link group
+ relationship are torn down, and it will be retried as a new SMC-R
+ link group next time a connection starts between these peers with
+ SMC-R proposed. Note that if this happens because one side doesn't
+ support SMC-R, there will be very little to tear down, as the TCP
+ option will have failed to flow on either the initial SYN or the
+ SYN-ACK before either side had reserved any local RoCE resources.
+
+3.5.2. Subsequent Contact
+
+ "Subsequent contact" means setting up a new TCP connection between
+ two peers that already have an SMC-R link group between them and
+ reusing the existing SMC-R link group. In this case, it is not
+ necessary to allocate new QPs. However, it is possible that a new
+ RMB has been allocated for this TCP connection, if the previous TCP
+ connection used the last element available in the previously used
+ RMB, or for any other implementation-dependent reason. For this
+ reason, and for convenience and error checking, the same TCP
+ option 254, followed by the inline negotiation method described for
+ initial contact, will be used for subsequent contact, but the
+ processing differs in some ways. That processing is described below.
+
+
+
+
+Fox, et al. Informational [Page 38]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+3.5.2.1. SMC-R Proposal
+
+ When the client begins the inline negotiation with the server, it
+ does not know if this is a first contact or a subsequent contact.
+ The client cannot know this information until it sees the server's
+ peer ID, to determine whether or not it already has an SMC-R link
+ with this peer that it can use. There are several reasons why it is
+ not sufficient to use the partner IP address, subnet, VLAN, or other
+ IP information to make this determination. The most obvious reason
+ is distributed systems: if the server IP address is actually a
+ virtual IP address representing a distributed cluster, the actual
+ host serving this TCP connection may not be the same as the host that
+ served the last TCP connection to this same IP address.
+
+ After the TCP three-way handshake, assuming that both partners
+ indicate SMC-R capability, the client builds and sends the
+ SMC Proposal CLC message to the server in exactly the same manner as
+ it does in the "first contact" case, and in fact at this point
+ doesn't know if it's a first contact or a subsequent contact. As in
+ the "first contact" case, the client sends its peer ID value,
+ suggested RNIC MAC/GID, and IP subnet or prefix information.
+
+ Upon receiving the client's proposal, the server looks up the
+ provided peer ID to determine if it already has a usable SMC-R
+ link group with this peer. If it does already have a usable SMC-R
+ link group, the server then needs to decide whether it will use the
+ existing SMC-R link group or create a new link group. For the case
+ of the new link group, see Section 3.5.3 ("First Contact Variation:
+ Creating a Parallel Link Group") below.
+
+ For this discussion, assume that the server decides to use the
+ existing SMC-R link group for the TCP connection, which is expected
+ to be the most common case. The server is responsible for making
+ this decision. The server then needs to communicate that information
+ to the client, but it is not necessary to allocate, associate, and
+ confirm QPs for the chosen SMC-R link. All that remains to be done
+ is to set up RMB space for this TCP connection.
+
+ If one of the RMBs already in use for this SMC-R link group has an
+ available element that uses the appropriate buffer size, the server
+ merely chooses one for this TCP connection and then sends an SMC
+ Accept CLC message providing the full RoCE information for the chosen
+ SMC-R link to the client, using the same format as the SMC Accept CLC
+ message described in Section 3.5.1 ("First Contact") above.
+
+
+
+
+
+
+
+Fox, et al. Informational [Page 39]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+ The server may choose to use the SMC-R link that matches the
+ suggested MAC/GID provided by the client in the SMC Proposal for its
+ RDMA writes but is not obligated to do so. The final decision on
+ which specific SMC-R link to assign a TCP connection to is an
+ independent server and client decision.
+
+ It may be necessary for the server to allocate a new RMB for this
+ connection. The reasons for this are implementation dependent and
+ could include the following:
+
+ o no available space in existing RMB or RMBs, or
+
+ o desire to allocate a new RMB that uses a different buffer size
+ from the ones already created, or
+
+ o any other implementation-dependent reason
+
+ In this case, the server will allocate the new RMB and then perform
+ the flows described in Section 3.5.5.2.1 ("Adding a New RMB to an
+ SMC-R Link Group"). Once that processing is complete, the server
+ then provides the full RoCE information, including the new RKey, for
+ this connection in an SMC Confirm CLC message to the client.
+
+3.5.2.2. SMC-R Acceptance
+
+ Upon receiving the SMC Accept CLC message from the server, the client
+ examines the RoCE information provided by the server to determine
+ whether this is a first contact for a new SMC-R link group or a
+ subsequent contact for an existing SMC-R link group. It is a
+ subsequent contact if the server-side peer ID, GID, MAC, and QP
+ number provided in the packet match a known SMC-R link, and the first
+ contact flag is not set. If this is not the case -- for example, the
+ GID and MAC match but the QP is new -- then the server is creating a
+ new, parallel SMC-R link group, and this is treated as a first
+ contact.
+
+ A different RMB RToken does not indicate a first contact, as the
+ server may have allocated a new RMB or may be using several RMBs for
+ this SMC-R link. The client needs the server's RMB information only
+ for its RDMA writes to the server, and since there is no requirement
+ for symmetric RMBs, this information is simply control information
+ for the RDMA writes on this SMC-R link.
+
+ The client must validate that the RMB element being provided by the
+ server is not in use by another TCP connection on this SMC-R link
+ group. This validation must validate the new <rtoken, index> across
+
+
+
+
+
+Fox, et al. Informational [Page 40]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+ all known <rtoken, index> on this link group. See Section 4.4.2
+ ("RMB Element Reuse and Conflict Resolution") for the case in which
+ the server tries to use an RMB element that is already in use on this
+ link group.
+
+ Once the client has determined that this TCP connection is a
+ subsequent contact over an existing SMC-R link, it performs an RMB
+ allocation process similar to what the server did: it either
+ (1) allocates an element from an RMB already associated with this
+ SMC-R link or (2) allocates a new RMB, associates it with this SMC-R
+ link, and then chooses an element out of it.
+
+ If the client allocates a new RMB for this TCP connection, it
+ performs the processing described in Section 3.5.5.2.1 ("Adding a New
+ RMB to an SMC-R Link Group"). Once that processing is complete, the
+ client provides its full RoCE information for this TCP connection in
+ an SMC Confirm CLC message.
+
+ Because an SMC-R link with a verified connected QP already exists and
+ is being reused, there is no need for verification or alternate QP
+ selection flows or timers.
+
+3.5.2.3. SMC-R Confirmation
+
+ When the server receives the client's SMC Confirm CLC message on a
+ subsequent contact, it verifies the following:
+
+ o The RMB element provided by the client is not already in use by
+ another TCP connection on this SMC-R link group (see Section 4.4.2
+ ("RMB Element Reuse and Conflict Resolution") for the case in
+ which it is).
+
+ o The MAC/GID/QP information provided by the client matches an
+ active link within the link group. The client is free to select
+ any valid/active link. The client is not required to select the
+ same link as the server.
+
+ If this validation passes, the server stores the client's RMB
+ information for this connection, and the RoCE setup of the TCP
+ connection is complete.
+
+3.5.2.4. TCP Data Flow Race with SMC Confirm CLC Message
+
+ On a subsequent contact TCP/IP connection, a peer may send data as
+ soon as it has received the peer RMB information for the connection.
+ There are no additional RoCE confirmation flows, since the QPs on the
+ SMC-R link are already reliably connected and verified.
+
+
+
+
+Fox, et al. Informational [Page 41]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+ In the majority of cases, the first data will flow from the client to
+ the server. The client must send the SMC Confirm CLC message before
+ sending any connection data over the chosen SMC-R link; however, the
+ client need not wait for confirmation of this message, and in fact
+ there will be no such confirmation. Since the server is required to
+ have the RMB fully set up and ready to receive data from the client
+ before sending an SMC Accept CLC message, the client can begin
+ sending data over the SMC-R link immediately upon completing the send
+ of the SMC Confirm CLC message.
+
+ It is possible that data from the client will arrive at the
+ server-side RMB before the SMC Confirm CLC message from the client
+ has been processed. In this case, the server must handle this race
+ condition and not provide the arrived TCP data to the socket
+ application until the SMC Confirm CLC message has been received and
+ fully processed, opening the socket.
+
+ If the server has initial data to send to the client that is not a
+ response to the client (this case should be rare), it can send the
+ data immediately upon receiving and processing the SMC Confirm CLC
+ message from the client. The client must have opened the TCP socket
+ to the client application upon sending the SMC Confirm CLC message so
+ the client will be ready to process data from the server.
+
+3.5.3. First Contact Variation: Creating a Parallel Link Group
+
+ Recall that parallel SMC-R links within an SMC-R link group are not
+ supported. These are multiple SMC-R links within a link group that
+ use the same network path. However, multiple SMC-R link groups
+ between the same peers are supported. This means that if multiple
+ SMC-R links over the same RoCE path are desired, it is necessary to
+ use multiple SMC-R link groups. While not a recommended practice,
+ this could be done for platform-specific reasons, like QP separation
+ of different workloads. Only the server can drive the creation of
+ multiple SMC-R link groups between peers.
+
+ At a high level, when the server decides to create an additional
+ SMC-R link group with a client with which it already has an SMC-R
+ link group, the flows are basically the same as the normal
+ "first contact" case described above. The following text provides
+ more detail and clarification of processing in this case.
+
+ When the server receives the SMC Proposal CLC message from the client
+ and, using the MAC/GID information, determines that it already has an
+ SMC-R link group with this client, the server can either reuse the
+ existing SMC-R link group (detailed in Section 3.5.2 ("Subsequent
+ Contact") above) or create a new SMC-R link group in addition to the
+ existing one.
+
+
+
+Fox, et al. Informational [Page 42]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+ If the server decides to create a new SMC-R link group, it does the
+ same processing it would have done for first contact: allocate QP and
+ RMB resources as well as alternate QP resources, and communicate the
+ QP and RMB information to the client in the SMC Accept CLC message
+ with the first contact flag set.
+
+ When the client receives the server's SMC Accept CLC message with the
+ new QP information and the first contact flag set, it knows that the
+ server is creating a new SMC-R link group even though it already has
+ an SMC-R link group with the server. In this case, the client will
+ also allocate a new QP for this new SMC-R link, allocate an RMB for
+ it, and generate an RKey for it.
+
+ Note that multiple SMC-R link groups between the same peers must
+ access different RMB resources, so new RMBs will be required. Using
+ the same RMBs that are in use in another SMC-R link group is not
+ permitted.
+
+ The client then associates its new QP with the server's new QP and
+ sends its SMC Confirm CLC message back to the server providing the
+ new QP/RMB information, and then sets its confirmation timer for the
+ new SMC-R link.
+
+ When the server receives the client's SMC Confirm CLC message, it
+ associates its QP with the client's QP as learned from the SMC
+ Confirm CLC message and sends a confirmation LLC message. The rest
+ of the flow, with the confirmation QP and setup of additional SMC-R
+ links, unfolds just like the "first contact" case.
+
+3.5.4. Normal SMC-R Link Termination
+
+ The normal socket API trigger points are used by the SMC-R layer to
+ initiate SMC-R connection termination flows. The main design point
+ for SMC-R normal connection flows is to use the SMC-R protocol to
+ first shut down the SMC-R connection and free up any SMC-R RDMA
+ resources, and then allow the normal TCP connection termination
+ protocol (i.e., FIN processing) to drive cleanup of the TCP
+ connection that exists on the IP fabric. This design point is very
+ important in ensuring that RDMA resources such as the RMBEs are only
+ freed and reused when both SMC-R endpoints are completely done with
+ their RDMA write operations to the partner's RMBE.
+
+ When the last TCP connection over an SMC-R link group terminates, the
+ link group can be terminated. Similar to creation of SMC-R links and
+ link groups, the primary responsibility for determining that normal
+ termination is needed and initiating it lies with the server.
+
+
+
+
+
+Fox, et al. Informational [Page 43]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+ Implementations may opt to set timers to keep SMC-R link groups up
+ for a specified time after the last TCP connection ends, to avoid
+ churn in cases where TCP connections come and go regularly.
+
+ The link or link group may also be terminated as a result of a
+ command initiated by the operator. This command can be entered at
+ either the client or the server. If entered at the client, the
+ client requests that the server perform link or link group
+ termination, and the responsibility for doing so ultimately lies with
+ the server.
+
+ When the server determines that the SMC-R link group is to be
+ terminated, it sends a DELETE LINK LLC message to the client, with a
+ flag set indicating that all links in the link group are to be
+ terminated. After receiving confirmation from the adapter that the
+ DELETE LINK LLC message has been sent, the server can clean up its
+ end of the link group (QPs, RMBs, etc.). Upon receipt of the DELETE
+ LINK message from the server, the client must immediately comply and
+ clean up its end of the link group. Any TCP connections that the
+ client believes to be active on the link group must be immediately
+ terminated.
+
+ The client can request that the server delete the link group as well.
+ The client does this by sending a DELETE LINK message to the server,
+ indicating that cleanup of all links is requested. The server must
+ comply by sending a DELETE LINK to the client and processing as
+ described in the previous paragraph. If there are TCP connections
+ active on the link group when the server receives this request, they
+ are immediately terminated by sending a RST flow over the IP fabric.
+
+3.5.5. Link Group Management Flows
+
+3.5.5.1. Adding and Deleting Links in an SMC-R Link Group
+
+ The server has the lead role in managing the composition of the link
+ group. Links are added to the link group by the server. The client
+ may notify the server of new conditions that may result in the server
+ adding a new link, but the server is ultimately responsible. In
+ general, links are deleted from the link group by the server;
+ however, in certain error cases the client may inform the server that
+ a link must be deleted and treat it as deleted without waiting for
+ action from the server. These flows are detailed in the sections
+ that follow.
+
+
+
+
+
+
+
+
+Fox, et al. Informational [Page 44]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+3.5.5.1.1. Server-Initiated ADD LINK Processing
+
+ As described in previous sections, the server initiates an ADD LINK
+ exchange to create redundancy in a newly created link group. Once a
+ link group is established, the server may also initiate ADD LINK for
+ other reasons, including:
+
+ o Availability of additional resources on the server host to support
+ an additional SMC-R link. This may include the provisioning of an
+ additional RNIC, more storage becoming available to support
+ additional QP resources, operator command, or any other
+ implementation-dependent reason. Note that in order to be
+ available for an existing link group a new RNIC must be attached
+ to the same RoCE LAN that the link group is using.
+
+ o Receipt of notification from the client that additional resources
+ on the client are available to support an additional SMC-R link.
+ See Section 3.5.5.1.2 ("Client-Initiated ADD LINK Processing").
+
+ Server-initiated ADD LINK processing in an established SMC-R link
+ group is the same as the ADD LINK processing described in
+ Section 3.5.1.6 ("Second SMC-R Link Setup"), with the following
+ changes:
+
+ o If an asymmetric SMC-R link already exists in the link group, a
+ second asymmetric link will not be created. Only one asymmetric
+ link is permitted in a link group.
+
+ o TCP data flow on already-existing link(s) in the link group is not
+ halted or otherwise affected during the process of setting up the
+ additional link.
+
+ The server will not initiate ADD LINK processing if the link group
+ already has the maximum number of links negotiated by the partners.
+
+3.5.5.1.2. Client-Initiated ADD LINK Processing
+
+ If an additional RNIC becomes available for an existing SMC-R link
+ group on the client's side, the client notifies the server by sending
+ an ADD LINK request LLC message to the server. Unlike an ADD LINK
+ request sent by the server to the client, this ADD LINK request
+ merely informs the server that the client has a new RNIC. If the
+ link group lacks redundancy or has redundancy only on an asymmetric
+ link with a single RNIC on the client side, the server must initiate
+ an ADD LINK exchange in response to this message, to create or
+ improve the link group's redundancy.
+
+
+
+
+
+Fox, et al. Informational [Page 45]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+ If the link group already has symmetric-link redundancy but has fewer
+ than the negotiated maximum number of links, the server may respond
+ by initiating an ADD LINK exchange to create a new link using the
+ client's new resource but is not required to do so.
+
+ If the link group already has the negotiated maximum number of links,
+ the server must ignore the client's ADD LINK request LLC message.
+
+ Because the server is not required to respond to the client's
+ ADD LINK LLC message in all cases, the client must not wait for a
+ response or throw an error if one does not come.
+
+3.5.5.1.3. Server-Initiated DELETE LINK Processing
+
+ Reasons that a server may delete a link include the following:
+
+ o The link has not been used for TCP connections for an
+ implementation-defined time interval, and deleting the link will
+ not cause the link group to lack redundancy.
+
+ o Errors in resources supporting the link occur. These errors may
+ include, but are not limited to, RNIC errors, QP errors, and
+ software errors.
+
+ o The RNIC supporting this SMC-R link is being taken down, either
+ because of an error case or because of an operator or software
+ command.
+
+ If a link being deleted is supporting TCP connections and there are
+ one or more surviving links in the link group, the TCP connections
+ are moved to the surviving links. For more information on this
+ processing, see Section 2.3 ("SMC-R Resilience and Load Balancing").
+
+ The server deletes a link from the link group by sending a
+ DELETE LINK request LLC message to the client over any of the usable
+ links in the link group. Because the DELETE LINK LLC message
+ specifies which link is to be deleted, it may flow over any link in
+ the link group. The server must not clean up its RoCE resources for
+ the link until the client responds.
+
+ The client responds to the server's DELETE LINK request LLC message
+ by sending the server a DELETE LINK response LLC message. The client
+ must respond positively; it cannot decline to delete the link. Once
+ the server has received the client's DELETE LINK response, both sides
+ may clean up their resources for the link.
+
+
+
+
+
+
+Fox, et al. Informational [Page 46]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+ Either a positive write completion or some other indication from the
+ RNIC on the client's side is sufficient to indicate to the client
+ that the server has received the DELETE LINK response.
+
+ Host X Host Y
+ +-------------------+ +-------------------+
+ | +------+ +------+ |
+ | QP 8 |RNIC 1| SMC-R Link 1 |RNIC 2| QP 9 |
+ |RToken X| |Failed|<--X----X----X----X-->| | |
+ | | | | | | |
+ | \/ +------+ +------+ |
+ |+--------+ | | |
+ || Deleted| | | |
+ || RMB | | | |
+ || | | | |
+ |+--------+ | | |
+ | /\ +------+ +------+ |
+ |RToken Z| | | SMC-R Link 2 | | |
+ | | |RNIC 3|<-------------------->|RNIC 4| |
+ | QP 64| | | | QP 65 |
+ | +------+ +------+ |
+ +-------------------+ +-------------------+
+
+ DELETE LINK(request, link number = 1,
+ ................................................>
+ reason code = RNIC failure)
+
+ DELETE LINK(response, link number = 1)
+ <................................................
+
+ (Note: Architecturally, this exchange can flow over either
+ SMC-R link but most likely flows over Link 2, since
+ the RNIC for Link 1 has failed.)
+
+ Figure 10: Server-Initiated DELETE LINK Flow
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Fox, et al. Informational [Page 47]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+3.5.5.1.4. Client-Initiated DELETE LINK Request
+
+ The client may request that the server delete a link for the same
+ reasons that the server may delete a link, except for inactivity
+ timeout.
+
+ Because the client depends on the server to delete links, there are
+ two types of delete requests from client to server:
+
+ o Orderly: The client is requesting that the server delete the link
+ when able. This would result from an operator command to bring
+ down the RNIC or some other nonfatal reason. In this case, the
+ server is required to delete the link but may not do it right
+ away.
+
+ o Disorderly: The server must delete the link right away, because
+ the client has experienced a fatal error with the link.
+
+ In either case, the server responds by initiating a DELETE LINK
+ exchange with the client, as described in the previous section. The
+ difference between the two is whether the server must do so
+ immediately or can delay for an opportunity to gracefully delete the
+ link.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Fox, et al. Informational [Page 48]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+ Host X Host Y
+ +-------------------+ +-------------------+
+ | +------+ +------+ |
+ | QP 8 |RNIC 1| SMC-R Link 1 |RNIC 2| QP 9 |
+ |RToken X| | |<---X--X--X--X--X--X->|Failed| |
+ | | | | | | |
+ | \/ +------+ +------+ |
+ |+--------+ | | |
+ || Deleted| | | |
+ || RMB | | | |
+ || | | | |
+ |+--------+ | | |
+ | /\ +------+ +------+ |
+ |RToken Z| | | SMC-R Link 2 | | |
+ | | |RNIC 3|<-------------------->|RNIC 4| |
+ | QP 64| | | | QP 65 |
+ | +------+ +------+ |
+ +-------------------+ +-------------------+
+
+ DELETE LINK(request, link number = 1, disorderly,
+ <...............................................
+ reason code = RNIC failure)
+
+ DELETE LINK(request, link number = 1,
+ ................................................>
+ reason code = RNIC failure)
+
+ DELETE LINK(response, link number = 1)
+ <................................................
+
+ (Note: Architecturally, this exchange can flow over either
+ SMC-R link but most likely flows over Link 2, since
+ the RNIC for Link 1 has failed.)
+
+ Figure 11: Client-Initiated DELETE LINK Flow
+
+3.5.5.2. Managing Multiple RKeys over Multiple SMC-R Links in a
+ Link Group
+
+ After the initial contact sequence completes and the number of TCP
+ connections increases, it is possible that the SMC peers could add
+ more RMBs to the link group. Recall that each peer independently
+ manages its RMBs. Also recall that an RMB's RToken is specific to a
+ QP, which means that when there are multiple SMC-R links in a link
+ group, each RMB accessed with the link group requires a separate
+ RToken for each SMC-R link in the group.
+
+
+
+
+
+Fox, et al. Informational [Page 49]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+ Each RMB that is added to a link must be added to all links within
+ the link group. The set of RMBs created for the link is called the
+ "RToken set". The RTokens must be exchanged with the peer. As RMBs
+ are added and deleted, the RToken set must remain in sync.
+
+3.5.5.2.1. Adding a New RMB to an SMC-R Link Group
+
+ A new RMB can be added to an SMC-R link group on either the client
+ side or the server side. When an additional RMB is added to an
+ existing SMC-R link group, that RMB must be associated with the QPs
+ for each link in the link group. Therefore, when an RMB is added to
+ an SMC-R link group, its RMB RToken for each SMC-R link's QP must be
+ communicated to the peer.
+
+ The tokens for a new RMB added to an existing SMC-R link group are
+ communicated using CONFIRM RKEY LLC messages, as shown in Figure 12.
+ The RToken set is specified as pairs: an SMC-R link number, paired
+ with the new RMB's RToken over that SMC-R link. To preserve failover
+ capability, any TCP connection that uses a newly added RMB cannot go
+ active until all RTokens for the RMB have been communicated for all
+ of the links in the link group.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Fox, et al. Informational [Page 50]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+ Host X Host Y
+ +-------------------+ +-------------------+
+ | +------+ +------+ |
+ | QP 8 |RNIC 1| SMC-R Link 1 |RNIC 2| QP 9 |
+ |RToken X| | |<-------------------->| | |
+ | | | | | | |
+ | \/ +------+ +------+ |
+ |+--------+ | | |
+ || New | | | |
+ || RMB | | | |
+ || | | | |
+ |+--------+ | | |
+ | /\ +------+ +------+ |
+ |RToken Z| | | SMC-R Link 2 | | |
+ | | |RNIC 3|<-------------------->|RNIC 4| |
+ | QP 64| | | | QP 65 |
+ | +------+ +------+ |
+ +-------------------+ +-------------------+
+
+ CONFIRM RKEY(request, Add,
+ ................................................>
+ RToken set((Link 1,RToken X),(Link 2,RToken Z)))
+
+ CONFIRM RKEY(response, Add,
+ <................................................
+ RToken set((Link 1,RToken X),(Link 2,RToken Z)))
+
+ (Note: This exchange can flow over either SMC-R link.)
+
+ Figure 12: Add RMB to Existing Link Group
+
+ Implementations may choose to proactively add RMBs to link groups in
+ anticipation of need. For example, an implementation may add a new
+ RMB when a certain usage threshold (e.g., percentage used) for all of
+ its existing RMBs has been exceeded.
+
+ A new RMB may also be added to an existing link group on an as-needed
+ basis -- for example, when a new TCP connection is added to the link
+ group but there are no available RMB elements. In this case, the CLC
+ exchange is paused while the peer that requires the new RMB adds it.
+ An example of this is illustrated in Figure 13.
+
+
+
+
+
+
+
+
+
+
+Fox, et al. Informational [Page 51]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+ Host X -- Server Host Y -- Client
+ +-------------------+ +--------------------+
+ | Peer ID = PS1 | | Peer ID = PC1 |
+ | +------+ +------+ |
+ | QP 8 |RNIC 1| SMC-R Link 1 |RNIC 2| QP 64 |
+ |RToken X| |MAC MA|<-------------------->|MAC MB| | |
+ | | |GID GA| |GID GB| |RToken Y2|
+ | \/ +------+ +------+ \/ |
+ |+--------+ | | +--------+ |
+ || | | Subnet S1 | | New | |
+ || RMB | | | | RMB | |
+ |+--------+ | | +--------+ |
+ | /\ +------+ +------+ /\ |
+ | | |RNIC 3| SMC-R Link 2 |RNIC 4| |RToken W2|
+ | | |MAC MC|<-------------------->|MAC MD| | |
+ | QP 9 |GID GC| |GID GD| QP 65 |
+ | +------+ +------+ |
+ +-------------------+ +--------------------+
+
+ SYN / SYN-ACK / ACK TCP three-way handshake with TCP option
+ <--------------------------------------------------------->
+
+ SMC Proposal(PC1,MB,GB,S1)
+ <--------------------------------------------------------
+
+ SMC Accept(PS1,not 1st contact,MA,GA,QP8,RToken=X,RMB elem index)
+ --------------------------------------------------------->
+
+ CONFIRM RKEY(request, Add,
+ <........................................................
+ RToken set((Link 1,RToken Y2),(Link 2,RToken W2)))
+
+ CONFIRM RKEY(response, Add,
+ ........................................................>
+ RToken set((Link 1,RToken Y2),(Link 2,RToken W2)))
+
+ SMC Confirm(PC1,MB,GB,QP64,RToken=Y2, RMB element index)
+ <--------------------------------------------------------
+
+ Legend:
+ ------------ TCP/IP and CLC flows
+ ............ RoCE (LLC) flows
+
+ Figure 13: Client Adds RMB during TCP Connection Setup
+
+
+
+
+
+
+
+Fox, et al. Informational [Page 52]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+3.5.5.2.2. Deleting an RMB from an SMC-R Link Group
+
+ Either peer can delete one or more of its RMBs as long as it is not
+ being used for any TCP connections. Ideally, an SMC-R peer would use
+ a timer to avoid freeing an RMB immediately after the last TCP
+ connection stops using it, to keep the RMB available for later TCP
+ connections and avoid thrashing with addition and deletion of RMBs.
+ Once an SMC-R peer decides to delete an RMB, it sends a DELETE RKEY
+ LLC message to its peer. It can then free the RMB once it receives
+ a response from the peer. Multiple RMBs can be deleted in a
+ DELETE RKEY exchange.
+
+ Note that in a DELETE RKEY message, it is not necessary to specify
+ the full RToken for a deleted RMB. The RMB's RKey over one link in
+ the link group is sufficient to specify which RMB is being deleted.
+
+ Host X Host Y
+ +-------------------+ +-------------------+
+ | +------+ +------+ |
+ | QP 8 |RNIC 1| SMC-R Link 1 |RNIC 2| QP 9 |
+ |RToken X| | |<-------------------->| | |
+ | | | | | | |
+ | \/ +------+ +------+ |
+ |+--------+ | | |
+ || Deleted| | | |
+ || RMB | | | |
+ || | | | |
+ |+--------+ | | |
+ | /\ +------+ +------+ |
+ |RToken Z| | | SMC-R Link 2 | | |
+ | | |RNIC 3|<-------------------->|RNIC 4| |
+ | QP 9 | | | | |
+ | +------+ +------+ |
+ +-------------------+ +-------------------+
+
+ DELETE RKEY(request, RKey list(RKey X))
+ ................................................>
+
+ DELETE RKEY(response, RKey list(RKey X))
+ <................................................
+
+ (Note: This exchange can flow over either SMC-R link.)
+
+ Figure 14: Delete RMB from SMC-R Link Group
+
+
+
+
+
+
+
+Fox, et al. Informational [Page 53]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+3.5.5.2.3. Adding a New SMC-R Link to a Link Group with Multiple RMBs
+
+ When a new SMC-R link is added to an existing link group, there could
+ be multiple RMBs on each side already associated with the link group.
+ There could also be a different number of RMBs on one side than on
+ the other, because each peer manages its RMBs independently. Each of
+ these RMBs will require a new RToken to be used on the new SMC-R
+ link, and those new RTokens must then be communicated to the peer.
+ This requires two-way communication, as the server will have to
+ communicate its RTokens to the client and vice versa.
+
+ RTokens are communicated between peers in pairs. Each RToken pair
+ consists of:
+
+ o The RToken for the RMB, as is already known on an existing SMC-R
+ link in the link group.
+
+ o The RToken for the same RMB, to be used on the new SMC-R link.
+
+ These pairs are required to ensure that each peer knows which RTokens
+ across QPs are equivalent.
+
+ The ADD LINK request and response LLC messages do not have enough
+ space to contain any RToken pairs. ADD LINK CONTINUATION LLC
+ messages are used to communicate these pairs, as shown in Figure 15.
+ The ADD LINK CONTINUATION LLC messages are sent on the same SMC-R
+ link that the ADD LINK LLC messages were sent over, and in both the
+ ADD LINK and ADD LINK CONTINUATION LLC messages the first RToken in
+ each RToken pair will be the RToken for the RMB as known on the SMC-R
+ link over which the LLC message is being sent.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Fox, et al. Informational [Page 54]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+ Host X -- Server Host Y -- Client
+ +-------------------+ +-------------------+
+ | Peer ID = PS1 | | Peer ID = PC1 |
+ | +------+ +------+ |
+ | QP 8 |RNIC 1| SMC-R Link 1 |RNIC 2| QP 64 |
+ |RKey set| |MAC MA|<-------------------->|MAC MB| |RKey set|
+ |X,Y,Z | |GID GA| |GID GB| |Q,R,S,T |
+ | \/ +------+ +------+ \/ |
+ |+--------+ | | +--------+ |
+ || 3 RMBs | | | | 4 RMBs | |
+ |+--------+ | | +--------+ |
+ | /\ +------+ +------+ /\ |
+ |RKey set| |RNIC 3| SMC-R Link 2 |RNIC 4| | RKey set|
+ |U,V,W | |MAC MC|<-------------------->|MAC MD| | L,M,N,P |
+ | QP 9 |GID GC| (being added) |GID GD| QP 65 |
+ | +------+ +------+ |
+ +-------------------+ +-------------------+
+
+ ADD LINK request (QP9,MC,GC, link number = 2)
+ ............................................>
+
+ ADD LINK response (QP65,MD,GD, link number = 2)
+ <............................................
+
+ ADD LINK CONTINUATION req(RToken pairs=((X,U),(Y,V),(Z,W)))
+ ............................................>
+
+ ADD LINK CONTINUATION rsp(RToken pairs=((Q,L),(R,M),(S,N),(T,P)))
+ <.............................................
+
+ CONFIRM LINK req/rsp exchange on Link 2
+ <.............................................>
+
+
+ Legend:
+ ------------ TCP/IP and CLC flows
+ ............ RoCE (LLC) flows
+
+ Figure 15: Exchanging RKeys when a New Link Is Added to a Link Group
+
+
+
+
+
+
+
+
+
+
+
+
+Fox, et al. Informational [Page 55]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+3.5.5.3. Serialization of LLC Exchanges, and Collisions
+
+ LLC flows can be divided into two main groups for serialization
+ considerations.
+
+ The first group is LLC messages that are independent and can flow at
+ any time. These are one-time, unsolicited messages that either do
+ not have a required response or have a simple response that does not
+ interfere with the operations of another group of messages. These
+ messages are as follows:
+
+ o TEST LINK from either the client or the server: This message
+ requires a TEST LINK response to be returned but does not affect
+ the configuration of the link group or the RKeys.
+
+ o ADD LINK from the client to the server: This message is provided
+ as an "FYI" to the server to let it know that the client has an
+ additional RNIC available. The server is not required to act upon
+ or respond to this message.
+
+ o DELETE LINK from the client to the server: This message informs
+ the server that either (1) the client has experienced an error or
+ problem that requires a link or link group to be terminated or
+ (2) an operator has commanded that a link or link group be
+ terminated. The server does not respond directly to the message;
+ rather, it initiates a DELETE LINK exchange as a result of
+ receiving it.
+
+ o DELETE LINK from the server to the client, with the "delete entire
+ link group" flag set: This message informs the client that the
+ entire link group is being deleted.
+
+ The second group is LLC messages that are part of an exchange of LLC
+ messages that affects link group configuration; this exchange must
+ complete before another exchange of LLC messages that affects link
+ group configuration can be processed. When a peer knows that one of
+ these exchanges is in progress, it must not start another exchange.
+ These exchanges are as follows:
+
+ o ADD LINK / ADD LINK response / ADD LINK CONTINUATION / ADD LINK
+ CONTINUATION response / CONFIRM LINK / CONFIRM LINK response: This
+ exchange, by adding a new link, changes the configuration of the
+ link group.
+
+ o DELETE LINK / DELETE LINK response initiated by the server,
+ without the "delete entire link group" flag set: This exchange, by
+ deleting a link, changes the configuration of the link group.
+
+
+
+
+Fox, et al. Informational [Page 56]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+ o CONFIRM RKEY / CONFIRM RKEY response or DELETE RKEY / DELETE RKEY
+ response: This exchange changes the RMB configuration of the link
+ group. RKeys cannot change while links are being added or deleted
+ (while an ADD LINK or DELETE LINK is in progress). However,
+ CONFIRM RKEY and DELETE RKEY are unique in that both the client
+ and server can independently manage (add or remove) their own
+ RMBs. This allows each peer to concurrently change their RKeys
+ and therefore concurrently send CONFIRM RKEY or DELETE RKEY
+ requests. The concurrent CONFIRM RKEY or DELETE RKEY requests can
+ be independently processed and do not represent a collision.
+
+ Because the server is in control of the configuration of the link
+ group, many timing windows and collisions are avoided, but there are
+ still some that must be handled.
+
+3.5.5.3.1. Collisions with ADD LINK / CONFIRM LINK Exchange
+
+ Colliding LLC message: TEST LINK
+
+ Action to resolve: Send immediate TEST LINK reply.
+
+ Colliding LLC message: ADD LINK from client to server
+
+ Action to resolve: Server ignores the ADD LINK message. When
+ client receives server's ADD LINK, client will consider that
+ message to be in response to its ADD LINK message and the flow
+ works. Since both client and server know not to start this
+ exchange if an ADD LINK operation is already underway, this can
+ only occur if the client sends this message before receiving the
+ server's ADD LINK and this message crosses with the server's ADD
+ LINK message; therefore, the server's ADD LINK arrives at the
+ client immediately after the client sent this message.
+
+ Colliding LLC message: DELETE LINK from client to server, specific
+ link specified
+
+ Action to resolve: Server queues the DELETE LINK message and
+ processes it after the ADD LINK exchange completes. If it is an
+ orderly link termination, it can wait until after this exchange
+ continues. If it is disorderly and the link affected is the one
+ that the current exchange is using, the server will discover the
+ outage when a message in this exchange fails.
+
+ Colliding LLC message: DELETE LINK from client to server, entire link
+ group to be deleted
+
+ Action to resolve: Immediately clean up the link group.
+
+
+
+
+Fox, et al. Informational [Page 57]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+ Colliding LLC message: CONFIRM RKEY from client
+
+ Action to resolve: Send a negative CONFIRM RKEY response to the
+ client. Once the current exchange finishes, client will have to
+ recompute its RKey set to include the new link and then start a
+ new CONFIRM RKEY exchange.
+
+3.5.5.3.2. Collisions during DELETE LINK Exchange
+
+ Colliding LLC message: TEST LINK from either peer
+
+ Action to resolve: Send immediate TEST LINK response.
+
+ Colliding LLC message: ADD LINK from client to server
+
+ Action to resolve: Server queues the ADD LINK and processes it
+ after the current exchange completes.
+
+ Colliding LLC message: DELETE LINK from client to server (specific
+ link)
+
+ Action to resolve: Server queues the DELETE LINK message and
+ processes it after the current exchange completes. If it is an
+ orderly link termination, it can wait until after this exchange
+ continues. If it is disorderly and the link affected is the one
+ that the current exchange is using, the server will discover the
+ outage when a message in this exchange fails.
+
+ Colliding LLC message: DELETE LINK from either client or server,
+ deleting the entire link group
+
+ Action to resolve: Immediately clean up the link group.
+
+ Colliding LLC message: CONFIRM RKEY from client to server
+
+ Action to resolve: Send a negative CONFIRM RKEY response to the
+ client. Once the current exchange finishes, client will have to
+ recompute its RKey set to include the new link and then start a
+ new CONFIRM RKEY exchange.
+
+
+
+
+
+
+
+
+
+
+
+
+Fox, et al. Informational [Page 58]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+3.5.5.3.3. Collisions during CONFIRM RKEY Exchange
+
+ Colliding LLC message: TEST LINK
+
+ Action to resolve: Send immediate TEST LINK reply.
+
+ Colliding LLC message: ADD LINK from client to server
+
+ Action to resolve: Queue the ADD LINK, and process it after the
+ current exchange completes.
+
+ Colliding LLC message: ADD LINK from server to client (CONFIRM RKEY
+ exchange was initiated by the client, and it crossed with the server
+ initiating an ADD LINK exchange)
+
+ Action to resolve: Process the ADD LINK. Client will receive a
+ negative CONFIRM RKEY from the server and will have to redo this
+ CONFIRM RKEY exchange after the ADD LINK exchange completes.
+
+ Colliding LLC message: DELETE LINK from client to server, specific
+ link to be deleted (CONFIRM RKEY exchange was initiated by the
+ server, and it crossed with the client's DELETE LINK request)
+
+ Action to resolve: Server queues the DELETE LINK message and
+ processes it after the CONFIRM RKEY exchange completes. If it is
+ an orderly link termination, it can wait until after this exchange
+ continues. If it is disorderly and the link affected is the one
+ that the current exchange is using, the server will discover the
+ outage when a message in this exchange fails.
+
+ Colliding LLC message: DELETE LINK from server to client, specific
+ link deleted (CONFIRM RKEY exchange was initiated by the client, and
+ it crossed with the server's DELETE LINK)
+
+ Action to resolve: Process the DELETE LINK. Client will receive a
+ negative CONFIRM RKEY from the server and will have to redo this
+ CONFIRM RKEY exchange after the ADD LINK exchange completes.
+
+ Colliding LLC message: DELETE LINK from either client or server,
+ entire link group deleted
+
+ Action to resolve: Immediately clean up the link group.
+
+ Colliding LLC message: CONFIRM LINK from the peer that did not start
+ the current CONFIRM LINK exchange
+
+ Action to resolve: Queue the request, and process it after the
+ current exchange completes.
+
+
+
+Fox, et al. Informational [Page 59]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+4. SMC-R Memory-Sharing Architecture
+
+4.1. RMB Element Allocation Considerations
+
+ Each TCP connection using SMC-R must be allocated an RMBE by each
+ SMC-R peer. This allocation is performed by each endpoint
+ independently to allow each endpoint to select an RMBE that best
+ matches the characteristics on its TCP socket endpoint. The RMBE
+ associated with a TCP socket endpoint must have a receive buffer that
+ is at least as large as the TCP receive buffer size in effect for
+ that connection. The receive buffer size can be determined by what
+ is specified explicitly by the application using setsockopt() or
+ implicitly via the system-configured default value. This will allow
+ sufficient data to be RDMA-written by the SMC-R peer to fill an
+ entire receive buffer size's worth of data on a given data flow.
+ Given that each RMB must have fixed-length RMBEs, this implies that
+ an SMC-R endpoint may need to maintain multiple RMBs of various sizes
+ for SMC-R connections on a given SMC-R link and can then select an
+ RMBE that most closely fits a connection.
+
+4.2. RMB and RMBE Format
+
+ An RMB is a virtual memory buffer whose backing real memory is
+ pinned. The RMB is subdivided into a whole number of equal-sized RMB
+ Elements (RMBEs). Each RMBE begins with a 4-byte eye catcher for
+ diagnostic and service purposes, followed by the receive data buffer.
+ The contents of this diagnostic eye catcher are implementation
+ dependent and should be used by the local SMC-R peer to check for
+ overlay errors by verifying an intact eye catcher with every RMBE
+ access.
+
+ The RMBE is a wrapping receive buffer for receiving RDMA writes from
+ the peer. Cursors, as described below, are exchanged between peers
+ to manage and track RDMA writes and local data reads from the RMBE
+ for a TCP connection.
+
+4.3. RMBE Control Information
+
+ RMBE control information consists of consumer cursors, producer
+ cursors, wrap counts, CDC message sequence numbers, control flags
+ such as urgent data and "writer blocked" indicators, and TCP
+ connection information such as termination flags. This information
+ is exchanged between SMC-R peers using CDC messages, which are passed
+ using RoCE SendMsg. A TCP/IP stack implementing SMC-R must receive
+ and store this information in its internal data structures, as it is
+ used to manage the RMBE and its data buffer.
+
+
+
+
+
+Fox, et al. Informational [Page 60]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+ The format and contents of the CDC message are described in detail in
+ Appendix A.4 ("Connection Data Control (CDC) Message Format"). The
+ following is a high-level description of what this control
+ information contains.
+
+ o Connection state flags such as sending done, connection closed,
+ failover data validation, and abnormal close.
+
+ o A sequence number that is managed by the sender. This sequence
+ number starts at 1, is increased each send, and wraps to 0. This
+ sequence number tracks the CDC message sent and is not related to
+ the number of bytes sent. It is used for failover data
+ validation.
+
+ o Producer cursor: a wrapping offset into the receiver's RMBE data
+ area. Set by the peer that is writing into the RMBE, it points to
+ where the writing peer will write the next byte of data into an
+ RMBE. This cursor is accompanied by a wrap sequence number to
+ help the RMBE owner (the receiver) identify full window size
+ wrapping writes. Note that this cursor must account for (i.e.,
+ skip over) the RMBE eye catcher that is in the beginning of the
+ data area.
+
+ o Consumer cursor: a wrapping offset into the receiver's RMBE data
+ area. Set by the owner of the RMBE (the peer that is reading from
+ it), this cursor points to the offset of the next byte of data to
+ be consumed by the peer in its own RMBE. The sender cannot write
+ beyond this cursor into the receiver's RMBE without causing data
+ loss. Like the producer cursor, this is accompanied by a wrap
+ count to help the writer identify full window size wrapping reads.
+ Note that this cursor must account for (i.e., skip over) the RMBE
+ eye catcher that is in the beginning of the data area.
+
+ o Data flags such as urgent data, writer blocked indicator, and
+ cursor update requests.
+
+4.4. Use of RMBEs
+
+4.4.1. Initializing and Accessing RMBEs
+
+ The RMBE eye catcher is initialized by the RMB owner prior to
+ assigning it to a specific TCP connection and communicating its RMB
+ index to the SMC-R partner. After an RMBE index is communicated to
+ the SMC-R partner, the RMBE can only be referenced in "read-only
+ mode" by the owner, and all updates to it are performed by the remote
+ SMC-R partner via RDMA write operations.
+
+
+
+
+
+Fox, et al. Informational [Page 61]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+ Initialization of an RMBE must include the following:
+
+ o Zeroing out the entire RMBE receive buffer, which helps minimize
+ data integrity issues (e.g., data from a previous connection
+ somehow being presented to the current connection).
+
+ o Setting the beginning RMBE eye catcher. This eye catcher plays an
+ important role in helping detect accidental overlays of the RMBE.
+ The RMB owner should always validate these eye catchers before
+ each new reference to the RMBE. If the eye catchers are found to
+ be corrupted, the local host must reset the TCP connection
+ associated with this RMBE and log the appropriate diagnostic
+ information.
+
+4.4.2. RMB Element Reuse and Conflict Resolution
+
+ RMB elements can be reused once their associated TCP and SMC-R
+ connections are terminated. Under normal and abnormal SMC-R
+ connection termination processing, both SMC-R peers must explicitly
+ acknowledge that they are done using an RMBE before that element can
+ be freed and reassigned to another SMC-R connection instance. For
+ more details on SMC-R connection termination, refer to Section 4.8.
+
+ However, there are some error scenarios where this two-way explicit
+ acknowledgment may not be completed. In these scenarios, an RMBE
+ owner may choose to reassign this RMBE to a new SMC-R connection
+ instance on this SMC-R link group. When this occurs, the partner
+ SMC-R peer must detect this condition during SMC-R Rendezvous
+ processing when presented with an RMBE that it believes is already in
+ use for a different SMC-R connection. In this case, the SMC-R peer
+ must abort the existing SMC-R connection associated with this RMBE.
+ The abort processing resets the TCP connection (if it is still
+ active), but it must not attempt to perform any RDMA writes to this
+ RMBE and must also ignore any data sitting in the local RMBE
+ associated with the existing connection. It then proceeds to free up
+ the local RMBE and notify the local application that the connection
+ is being abnormally reset.
+
+ The remote SMC-R peer then proceeds to normal processing for this new
+ SMC-R connection.
+
+
+
+
+
+
+
+
+
+
+
+Fox, et al. Informational [Page 62]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+4.5. SMC-R Protocol Considerations
+
+ The following sections describe considerations for the SMC-R protocol
+ as compared to TCP.
+
+4.5.1. SMC-R Protocol Optimized Window Size Updates
+
+ An SMC-R receiver host sends its consumer cursor information to the
+ sender to convey the progress that the receiving application has made
+ in consuming the sent data. The difference between the writer's
+ producer cursor and the associated receiver's consumer cursor
+ indicates the window size available for the sender to write into.
+ This is somewhat similar to TCP window update processing and
+ therefore has some similar considerations, such as silly window
+ syndrome avoidance, whereby TCP has an optimization that minimizes
+ the overhead of very small, unproductive window size updates
+ associated with suboptimal socket applications consuming very small
+ amounts of data on every receive() invocation. For SMC-R, the
+ receiver only updates its consumer cursor via a unique CDC message
+ under the following conditions:
+
+ o The current window size (from a sender's perspective) is less than
+ half of the receive buffer space, and the consumer cursor update
+ will result in a minimum increase in the window size of 10% of the
+ receive buffer space. Some examples:
+
+ a. Receive buffer size: 64K, current window size (from a sender's
+ perspective): 50K. No need to update the consumer cursor.
+ Plenty of space is available for the sender.
+
+ b. Receive buffer size: 64K, current window size (from a sender's
+ perspective): 30K, current window size from a receiver's
+ perspective: 31K. No need to update the consumer cursor; even
+ though the sender's window size is < 1/2 of the 64K, the window
+ update would only increase that by 1K, which is < 1/10th of the
+ 64K buffer size.
+
+ c. Receive buffer size: 64K, current window size (from a sender's
+ perspective): 30K, current window size from a receiver's
+ perspective: 64K. The receiver updates the consumer cursor
+ (sender's window size is < 1/2 of the 64K; the window update
+ would increase that by > 6.4K).
+
+
+
+
+
+
+
+
+
+Fox, et al. Informational [Page 63]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+ o The receiver must always include a consumer cursor update whenever
+ it sends a CDC message to the partner for another flow (i.e., send
+ flow in the opposite direction). This allows the window size
+ update to be delivered with no additional overhead. This is
+ somewhat similar to TCP DelayAck processing and quite effective
+ for request/response data patterns.
+
+ o If a peer has set the B-bit in a CDC message, then any consumption
+ of data by the receiver causes a CDC message to be sent, updating
+ the consumer cursor until a CDC message with that bit cleared is
+ received from the peer.
+
+ o The optimized window size updates are overridden when the sender
+ sets the Consumer Cursor Update Requested flag in a CDC message to
+ the receiver. When this indicator is on, the consumer must send a
+ consumer cursor update immediately when data is consumed by the
+ local application or if the cursor has not been updated for a
+ while (i.e., local copy of the consumer cursor does not match the
+ last consumer cursor value sent to the partner). This allows the
+ sender to perform optional diagnostics for detecting a stalled
+ receiver application (data has been sent but not consumed). It is
+ recommended that the Consumer Cursor Update Requested flag only be
+ sent for diagnostic procedures, as it may result in non-optimal
+ data path performance.
+
+4.5.2. Small Data Sends
+
+ The SMC-R protocol makes no special provisions for handling small
+ data segments sent across a stream socket. Data is always sent if
+ sufficient window space is available. In contrast to the TCP Nagle
+ algorithm, there are no special provisions in SMC-R for coalescing
+ small data segments.
+
+ An implementation of SMC-R can be configured to optimize its sending
+ processing by coalescing outbound data for a given SMC-R connection
+ so that it can reduce the number of RDMA write operations it
+ performs, in a fashion similar to Nagle's algorithm. However, any
+ such coalescing would require a timer on the sending host that would
+ ensure that data was eventually sent. Also, the sending host would
+ have to opt out of this processing if Nagle's algorithm had been
+ disabled (programmatically or via system configuration).
+
+
+
+
+
+
+
+
+
+
+Fox, et al. Informational [Page 64]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+4.5.3. TCP Keepalive Processing
+
+ TCP keepalive processing allows applications to direct the local
+ TCP/IP host to periodically "test" the viability of an idle TCP
+ connection. Since SMC-R connections have a TCP representation along
+ with an SMC-R representation, there are unique keepalive processing
+ considerations:
+
+ o SMC-R-layer keepalive processing: If keepalive is enabled for an
+ SMC-R connection, the local host maintains a keepalive timer that
+ reflects how long an SMC-R connection has been idle. The local
+ host also maintains a timestamp of last activity for each SMC-R
+ link (for any SMC-R connection on that link). When it is
+ determined that an SMC-R connection has been idle longer than the
+ keepalive interval, the host checks to see whether or not the
+ SMC-R link has been idle for a duration longer than the keepalive
+ timeout. If both conditions are met, the local host then performs
+ a TEST LINK LLC command to test the viability of the SMC-R link
+ over the RoCE fabric (RC-QPs). If a TEST LINK LLC command
+ response is received within a reasonable amount of time, then the
+ link is considered viable, and all connections using this link are
+ considered viable as well. If, however, a response is not
+ received in a reasonable amount of time or there's a failure in
+ sending the TEST LINK LLC command, then this is considered a
+ failure in the SMC-R link, and failover processing to an alternate
+ SMC-R link must be triggered. If no alternate SMC-R link exists
+ in the SMC-R link group, then all of the SMC-R connections on this
+ link are abnormally terminated by resetting the TCP connections
+ represented by these SMC-R connections. Given that multiple SMC-R
+ connections can share the same SMC-R link, implementing an SMC-R
+ link-level probe using the TEST LINK LLC command will help reduce
+ the amount of unproductive keepalive traffic for SMC-R
+ connections; as long as some SMC-R connections on a given SMC-R
+ link are active (i.e., have had I/O activity within the keepalive
+ interval), then there is no need to perform additional link
+ viability testing.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Fox, et al. Informational [Page 65]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+ o TCP-layer keepalive processing: Traditional TCP "keepalive"
+ packets are not as relevant for SMC-R connections, given that the
+ TCP path is not used for these connections once the SMC-R
+ Rendezvous processing is completed. All SMC-R connections by
+ default have associated TCP connections that are idle. Are TCP
+ keepalive probes still needed for these connections? There are
+ two main scenarios to consider:
+
+ 1. TCP keepalives that are used to determine whether or not the
+ peer TCP endpoint is still active. This is not needed for
+ SMC-R connections, as the SMC-R-level keepalives mentioned
+ above will determine whether or not the remote endpoint
+ connections are still active.
+
+ 2. TCP keepalives that are used to ensure that TCP connections
+ traversing an intermediate proxy maintain an active state. For
+ example, stateful firewalls typically maintain state
+ representing every valid TCP connection that traverses the
+ firewall. These types of firewalls are known to expire idle
+ connections by removing their state in the firewall to conserve
+ memory. TCP keepalives are often used in this scenario to
+ prevent firewalls from timing out otherwise idle connections.
+ When using SMC-R, both endpoints must reside in the same
+ Layer 2 network (i.e., the same subnet). As a result,
+ firewalls cannot be injected in the path between two SMC-R
+ endpoints. However, other intermediate proxies, such as
+ TCP/IP-layer load balancers, may be injected in the path of two
+ SMC-R endpoints. These types of load balancers also maintain
+ connection state so that they can forward TCP connection
+ traffic to the appropriate cluster endpoint. When using SMC-R,
+ these TCP connections will appear to be completely idle, making
+ them susceptible to potential timeouts at the load-balancing
+ proxy. As a result, for this scenario, TCP keepalives may
+ still be relevant.
+
+ The following are the TCP-level keepalive processing requirements for
+ SMC-R-enabled hosts:
+
+ o SMC-R peers should allow TCP keepalives to flow on the TCP path of
+ SMC-R connections based on existing TCP keepalive configuration
+ and programming options. However, it is strongly recommended that
+ platforms provide the ability to specify very granular keepalive
+ timers (for example, single-digit-second timers) and should
+ consider providing a configuration option that limits the minimum
+ keepalive timer that will be used for TCP-layer keepalives on
+ SMC-R connections. This is important to minimize the amount of
+ TCP keepalive packets transmitted in the network for SMC-R
+ connections.
+
+
+
+Fox, et al. Informational [Page 66]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+ o SMC-R peers must always respond to inbound TCP-layer keepalives
+ (by sending ACKs for these packets) even if the connection is
+ using SMC-R. Typically, once a TCP connection has completed the
+ SMC-R Rendezvous processing and is using SMC-R for data flows, no
+ new inbound TCP segments are expected on that TCP connection,
+ other than TCP termination segments (FIN, RST, etc.). TCP
+ keepalives are the one exception that must be supported. Also,
+ since TCP keepalive probes do not carry any application-layer
+ data, this has no adverse impact on the application's inbound data
+ stream.
+
+4.6. TCP Connection Failover between SMC-R Links
+
+ A peer may change which SMC-R link within a link group it sends its
+ writes over in the event of a link failure. Since each peer
+ independently chooses which link to send writes over for a specific
+ TCP connection, this process is done independently by each peer.
+
+4.6.1. Validating Data Integrity
+
+ Even though RoCE is a reliable transport, there is a small subset of
+ failure modes that could cause unrecoverable loss of data. When an
+ RNIC acknowledges receipt of an RDMA write to its peer, that creates
+ a write completion event to the sending peer, which allows the sender
+ to release any buffers it is holding for that write. In normal
+ operation and in most failures, this operation is reliable.
+
+ However, there are failure modes possible in which a receiving RNIC
+ has acknowledged an RDMA write but then was not able to place the
+ received data into its host memory -- for example, a sudden,
+ disorderly failure of the interface between the RNIC and the host.
+ While rare, these types of events must be guarded against to ensure
+ data integrity. The process for switching SMC-R links during
+ failover, as described in this section, guards against this
+ possibility and is mandatory.
+
+ Each peer must track the current state of the CDC sequence numbers
+ for a TCP connection. The sender must keep track of the sequence
+ number of the CDC message that described the last write acknowledged
+ by the peer RNIC, or Sequence Sent (SS). In other words, SS
+ describes the last write that the sender believes its peer has
+ successfully received. The receiver must keep track of the sequence
+ number of the CDC message that described the last write that it has
+ successfully received (i.e., the data has been successfully placed
+ into an RMBE), or Sequence Received (SR).
+
+
+
+
+
+
+Fox, et al. Informational [Page 67]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+ When an RNIC fails and the sender changes SMC-R links, the sender
+ must first send a CDC message with the F-bit (failover validation
+ indicator; see Appendix A.4) set over the new SMC-R link. This is
+ the failover data validation message. The sequence number in this
+ CDC message is equal to SS. The CDC message key, the length, and the
+ SMC-R alert token are the only other fields in this CDC message that
+ are significant. No reply is expected from this validation message,
+ and once the sender has sent it, the sender may resume sending on the
+ new SMC-R link as described in Section 4.6.2.
+
+ Upon receipt of the failover validation message, the receiver must
+ verify that its SR value for the TCP connection is equal to or
+ greater than the sequence number in the failover validation message.
+ If so, no further action is required, and the TCP connection resumes
+ on the new SMC-R link. If SR is less than the sequence number value
+ in the validation message, data has been lost, and the receiver must
+ immediately reset the TCP connection.
+
+4.6.2. Resuming the TCP Connection on a New SMC-R Link
+
+ When a connection is moved to a new SMC-R link and the failover
+ validation message has been sent, the sender can immediately resume
+ normal transmission. In order to preserve the application message
+ stream, the sender must replay any RDMA writes (and their associated
+ CDC messages) that were in progress or failed when the previous SMC-R
+ link failed, before sending new data on the new SMC-R link. The
+ sender has two options for accomplishing this:
+
+ o Preserve the sequence numbers "as is": Retry all failed and
+ pending operations as they were originally done, including
+ reposting all associated RDMA write operations and their
+ associated CDC messages without making any changes. Then resume
+ sending new data using new sequence numbers.
+
+ o Combine pending messages and possibly add new data: Combine failed
+ and pending messages into a single new write with a new sequence
+ number. This allows the sender to combine pending messages into
+ fewer operations. As a further optimization, this write can also
+ include new data, as long as all failed and pending data are also
+ included. If this approach is taken, the sequence number must be
+ increased beyond the last failed or pending sequence number.
+
+
+
+
+
+
+
+
+
+
+Fox, et al. Informational [Page 68]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+4.7. RMB Data Flows
+
+ The following sections describe the RDMA wire flows for the SMC-R
+ protocol after a TCP connection has switched into SMC-R mode (i.e.,
+ SMC-R Rendezvous processing is complete and a pair of RMB elements
+ has been assigned and communicated by the SMC-R peers). The ladder
+ diagrams below include the following:
+
+ o RMBE control information kept by each peer. Only a subset of the
+ information is depicted, specifically only the fields that reflect
+ the stream of data written by Host A and read by Host B.
+
+ o Time line 0-x, which shows the wire flows in a time-relative
+ fashion.
+
+ o Note that RMBE control information is only shown in a time
+ interval if its value changed (otherwise, assume that the value is
+ unchanged from the previously depicted value).
+
+ o The local copy of the producer cursors and consumer cursors that
+ is maintained by each host is not depicted in these figures. Note
+ that the cursor values in the diagram reflect the necessity of
+ skipping over the eye catcher in the RMBE data area. They start
+ and wrap at 4, not 0.
+
+4.7.1. Scenario 1: Send Flow, Window Size Unconstrained
+
+ SMC Host A SMC Host B
+ RMBE A Info RMBE B Info
+ (Consumer Cursors) (Producer Cursors)
+ Cursor Wrap Seq# Time Time Cursor Wrap Seq# Flags
+ 4 0 0 0 4 0 0
+ 0 0 1 ---------------> 1 0 0 0
+ RDMA-WR Data
+ (4:1003)
+ 4 0 2 ...............> 2 1004 0 0
+ CDC Message
+
+ Figure 16: Scenario 1: Send Flow, Window Size Unconstrained
+
+ Scenario assumptions:
+
+ o Kernel implementation.
+
+ o New SMC-R connection; no data has been sent on the connection.
+
+
+
+
+
+
+Fox, et al. Informational [Page 69]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+ o Host A: Application issues send for 1000 bytes to Host B.
+
+ o Host B: RMBE receive buffer size is 10,000; application has issued
+ a recv for 10,000 bytes.
+
+ Flow description:
+
+ 1. The application issues a send() for 1000 bytes; the SMC-R layer
+ copies data into a kernel send buffer. It then schedules an RDMA
+ write operation to move the data into the peer's RMBE receive
+ buffer, at relative position 4-1003 (to skip the 4-byte
+ eye catcher in the RMBE data area). Note that no immediate data
+ or alert (i.e., interrupt) is provided to Host B for this RDMA
+ operation.
+
+ 2. Host A sends a CDC message to update the producer cursor to
+ byte 1004. This CDC message will deliver an interrupt to Host B.
+ At this point, the SMC-R layer can return control back to the
+ application. Host B, once notified of the completion of the
+ previous RDMA operation, locates the RMBE associated with the RMBE
+ alert token that was included in the message and proceeds to
+ perform normal receive-side processing, waking up the suspended
+ application read thread, copying the data into the application's
+ receive buffer, etc. It will use the producer cursor as an
+ indicator of how much data is available to be delivered to the
+ local application. After this processing is complete, the SMC-R
+ layer will also update its local consumer cursor to match the
+ producer cursor (i.e., indicating that all data has been
+ consumed). Note that a message to the peer updating the consumer
+ cursor is not needed at this time, as the window size is
+ unconstrained (> 1/2 of the receive buffer size). The window size
+ is calculated by taking the difference between the producer cursor
+ and the consumer cursor in the RMBEs (10,000 - 1004 = 8996).
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Fox, et al. Informational [Page 70]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+4.7.2. Scenario 2: Send/Receive Flow, Window Size Unconstrained
+
+ SMC Host A SMC Host B
+ RMBE A Info RMBE B Info
+ (Consumer Cursors) (Producer Cursors)
+ Cursor Wrap Seq# Time Time Cursor Wrap Seq# Flags
+ 4 0 0 0 4 0 0
+ 0 0 1 ---------------> 1 0 0 0
+ RDMA-WR Data
+ (4:1003)
+ 4 0 2 ...............> 2 1004 0 0
+ CDC Message
+
+ 0 0 3 <-------------- 3 1004 0 0
+ RDMA-WR Data
+ (4:503)
+ 1004 0 4 <.............. 4 1004 0 0
+ CDC Message
+
+ Figure 17: Scenario 2: Send/Receive Flow, Window Size Unconstrained
+
+ Scenario assumptions:
+
+ o New SMC-R connection; no data has been sent on the connection.
+
+ o Host A: Application issues send for 1000 bytes to Host B.
+
+ o Host B: RMBE receive buffer size is 10,000; application has
+ already issued a recv for 10,000 bytes. Once the receive is
+ completed, the application sends a 500-byte response to Host A.
+
+ Flow description:
+
+ 1. The application issues a send() for 1000 bytes; the SMC-R layer
+ copies data into a kernel send buffer. It then schedules an RDMA
+ write operation to move the data into the peer's RMBE receive
+ buffer, at relative position 4-1003. Note that no immediate data
+ or alert (i.e., interrupt) is provided to Host B for this RDMA
+ operation.
+
+ 2. Host A sends a CDC message to update the producer cursor to
+ byte 1004. This CDC message will deliver an interrupt to Host B.
+ At this point, the SMC-R layer can return control back to the
+ application.
+
+
+
+
+
+
+
+Fox, et al. Informational [Page 71]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+ 3. Host B, once notified of the receipt of the previous CDC message,
+ locates the RMBE associated with the RMBE alert token and proceeds
+ to perform normal receive-side processing, waking up the suspended
+ application read thread, copying the data into the application's
+ receive buffer, etc. After this processing is complete, the SMC-R
+ layer will also update its local consumer cursor to match the
+ producer cursor (i.e., indicating that all data has been
+ consumed). Note that an update of the consumer cursor to the peer
+ is not needed at this time, as the window size is unconstrained
+ (> 1/2 of the receive buffer size). The application then performs
+ a send() for 500 bytes to Host A. The SMC-R layer will copy the
+ data into a kernel buffer and then schedule an RDMA write into the
+ partner's RMBE receive buffer. Note that this RDMA write
+ operation includes no immediate data or notification to Host A.
+
+ 4. Host B sends a CDC message to update the partner's RMBE control
+ information with the latest producer cursor (set to 503 and not
+ shown in the diagram above) and to also inform the peer that the
+ consumer cursor value is now 1004. It also updates the local
+ current consumer cursor and the last sent consumer cursor to 1004.
+ This CDC message includes notification, since we are updating our
+ producer cursor; this requires attention by the peer host.
+
+4.7.3. Scenario 3: Send Flow, Window Size Constrained
+
+ SMC Host A SMC Host B
+ RMBE A Info RMBE B Info
+ (Consumer Cursors) (Producer Cursors)
+ Cursor Wrap Seq# Time Time Cursor Wrap Seq# Flags
+ 4 0 0 0 4 0 0
+ 4 0 1 ---------------> 1 4 0 0
+ RDMA-WR Data
+ (4:3003)
+ 4 0 2 ...............> 2 3004 0 0
+ CDC Message
+ 4 0 3 3 3004 0 0
+ 4 0 4 ---------------> 4 3004 0 0
+ RDMA-WR Data
+ (3004:7003)
+ 4 0 5 ................> 5 7004 0 0
+ CDC Message
+ 7004 0 6 <................ 6 7004 0 0
+ CDC Message
+
+ Figure 18: Scenario 3: Send Flow, Window Size Constrained
+
+
+
+
+
+
+Fox, et al. Informational [Page 72]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+ Scenario assumptions:
+
+ o New SMC-R connection; no data has been sent on this connection.
+
+ o Host A: Application issues send for 3000 bytes to Host B and then
+ another send for 4000 bytes.
+
+ o Host B: RMBE receive buffer size is 10,000. Application has
+ already issued a recv for 10,000 bytes.
+
+ Flow description:
+
+ 1. The application issues a send() for 3000 bytes; the SMC-R layer
+ copies data into a kernel send buffer. It then schedules an RDMA
+ write operation to move the data into the peer's RMBE receive
+ buffer, at relative position 4-3003. Note that no immediate data
+ or alert (i.e., interrupt) is provided to Host B for this RDMA
+ operation.
+
+ 2. Host A sends a CDC message to update its producer cursor to
+ byte 3003. This CDC message will deliver an interrupt to Host B.
+ At this point, the SMC-R layer can return control back to the
+ application.
+
+ 3. Host B, once notified of the receipt of the previous CDC message,
+ locates the RMBE associated with the RMBE alert token and proceeds
+ to perform normal receive-side processing, waking up the suspended
+ application read thread, copying the data into the application's
+ receive buffer, etc. After this processing is complete, the SMC-R
+ layer will also update its local consumer cursor to match the
+ producer cursor (i.e., indicating that all data has been
+ consumed). It will not, however, update the partner with this
+ information, as the window size is not constrained
+ (10,000 - 3000 = 7000 bytes of available space). The application
+ on Host B also issues a new recv() for 10,000 bytes.
+
+ 4. On Host A, the application issues a send() for 4000 bytes. The
+ SMC-R layer copies the data into a kernel buffer and schedules an
+ async RDMA write into the peer's RMBE receive buffer at relative
+ position 3003-7004. Note that no alert is provided to Host B for
+ this flow.
+
+ 5. Host A sends a CDC message to update the producer cursor to
+ byte 7004. This CDC message will deliver an interrupt to Host B.
+ At this point, the SMC-R layer can return control back to the
+ application.
+
+
+
+
+
+Fox, et al. Informational [Page 73]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+ 6. Host B, once notified of the receipt of the previous CDC message,
+ locates the RMBE associated with the RMBE alert token and proceeds
+ to perform normal receive-side processing, waking up the suspended
+ application read thread, copying the data into the application's
+ receive buffer, etc. After this processing is complete, the SMC-R
+ layer will also update its local consumer cursor to match the
+ producer cursor (i.e., indicating that all data has been
+ consumed). It will then determine whether or not it needs to
+ update the consumer cursor to the peer. The available window size
+ is now 3000 (10,000 - (producer cursor - last sent consumer
+ cursor)), which is < 1/2 of the receive buffer size
+ (10,000/2 = 5000), and the advance of the window size is > 10% of
+ the window size (1000). Therefore, a CDC message is issued to
+ update the consumer cursor to Peer A.
+
+4.7.4. Scenario 4: Large Send, Flow Control, Full Window Size Writes
+
+ SMC Host A SMC Host B
+ RMBE A Info RMBE B Info
+ (Consumer Cursors) (Producer Cursors)
+ Cursor Wrap Seq# Time Time Cursor Wrap Seq# Flags
+ 1004 1 0 0 1004 1 0
+ 1004 1 1 ---------------> 1 1004 1 0
+ RDMA-WR Data
+ (1004:9999)
+ 1004 1 2 ---------------> 2 1004 1 0
+ RDMA-WR Data
+ (4:1003)
+ 1004 1 3 ...............> 3 1004 2 Wrt
+ CDC Message Blk
+
+ 1004 2 4 <............... 4 1004 2 Wrt
+ CDC Message Blk
+
+ 1004 2 5 ---------------> 5 1004 2 Wrt
+ RDMA-WR Data Blk
+ (1004:9999)
+ 1004 2 6 ---------------> 6 1004 2 Wrt
+ RDMA-WR Data Blk
+ (4:1003)
+ 1004 2 7 ...............> 7 1004 3 Wrt
+ CDC Message Blk
+
+ 1004 3 8 <............... 8 1004 3 Wrt
+ CDC Message Blk
+
+ Figure 19: Scenario 4: Large Send, Flow Control,
+ Full Window Size Writes
+
+
+
+Fox, et al. Informational [Page 74]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+ Scenario assumptions:
+
+ o Kernel implementation.
+
+ o Existing SMC-R connection, Host B's receive window size is fully
+ open (peer consumer cursor = peer producer cursor).
+
+ o Host A: Application issues send for 20,000 bytes to Host B.
+
+ o Host B: RMBE receive buffer size is 10,000; application has issued
+ a recv for 10,000 bytes.
+
+ Flow description:
+
+ 1. The application issues a send() for 20,000 bytes; the SMC-R layer
+ copies data into a kernel send buffer (assumes that send buffer
+ space of 20,000 is available for this connection). It then
+ schedules an RDMA write operation to move the data into the peer's
+ RMBE receive buffer, at relative position 1004-9999. Note that no
+ immediate data or alert (i.e., interrupt) is provided to Host B
+ for this RDMA operation.
+
+ 2. Host A then schedules an RDMA write operation to fill the
+ remaining 1000 bytes of available space in the peer's RMBE receive
+ buffer, at relative position 4-1003. Note that no immediate data
+ or alert (i.e., interrupt) is provided to Host B for this RDMA
+ operation. Also note that an implementation of SMC-R may optimize
+ this processing by combining steps 1 and 2 into a single
+ RDMA write operation (with two different data sources).
+
+ 3. Host A sends a CDC message to update the producer cursor to
+ byte 1004. Since the entire receive buffer space is filled, the
+ producer writer blocked flag (the "Wrt Blk" indicator (flag) in
+ Figure 19) is set and the producer cursor wrap sequence number
+ (the producer "Wrap Seq#" in Figure 19) is incremented. This CDC
+ message will deliver an interrupt to Host B. At this point, the
+ SMC-R layer can return control back to the application.
+
+ 4. Host B, once notified of the receipt of the previous CDC message,
+ locates the RMBE associated with the RMBE alert token and proceeds
+ to perform normal receive-side processing, waking up the suspended
+ application read thread, copying the data into the application's
+ receive buffer, etc. In this scenario, Host B notices that the
+ producer cursor has not been advanced (same value as the consumer
+ cursor); however, it notices that the producer cursor wrap
+ sequence number is different from its local value (1), indicating
+ that a full window of new data is available. All of the data in
+ the receive buffer can be processed, with the first segment
+
+
+
+Fox, et al. Informational [Page 75]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+ (1004-9999) followed by the second segment (4-1003). Because the
+ producer writer blocked indicator was set, Host B schedules a CDC
+ message to update its latest information to the peer: consumer
+ cursor (1004), consumer cursor wrap sequence number (the current
+ value of 2 is used).
+
+ 5. Host A, upon receipt of the CDC message, locates the TCP
+ connection associated with the alert token and, upon examining the
+ control information provided, notices that Host B has consumed all
+ of the data (based on the consumer cursor and the consumer cursor
+ wrap sequence number) and initiates the next RDMA write to fill
+ the receive buffer at offset 1003-9999.
+
+ 6. Host A then moves the next 1000 bytes into the beginning of the
+ receive buffer (4-1003) by scheduling an RDMA write operation.
+ Note that at this point there are still 8 bytes remaining to be
+ written.
+
+ 7. Host A then sends a CDC message to set the producer writer blocked
+ indicator and to increment the producer cursor wrap sequence
+ number (3).
+
+ 8. Host B, upon notification, completes the same processing as step 4
+ above, including sending a CDC message to update the peer to
+ indicate that all data has been consumed. At this point, Host A
+ can write the final 8 bytes to Host B's RMBE into
+ positions 1004-1011 (not shown).
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Fox, et al. Informational [Page 76]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+4.7.5. Scenario 5: Send Flow, Urgent Data, Window Size Unconstrained
+
+ SMC Host A SMC Host B
+ RMBE A Info RMBE B Info
+ (Consumer Cursors) (Producer Cursors)
+ Cursor Wrap Seq# Time Time Cursor Wrap Seq# Flag
+ 1000 1 0 0 1000 1 0
+ 1000 1 1 ---------------> 1 1000 1 0
+ RDMA-WR Data
+ (1000:1499)
+ 1000 1 2 ...............> 2 1500 1 UrgP
+ CDC Message UrgA
+
+ 1500 1 3 <............... 3 1500 1 UrgP
+ CDC Message UrgA
+
+ 1500 1 4 ---------------> 4 1500 1 UrgP
+ RDMA-WR Data UrgA
+ (1500:2499)
+ 1500 1 5 ...............> 5 2500 1 0
+ CDC Message
+
+ Figure 20: Scenario 5: Send Flow, Urgent Data, Window Size Open
+
+ Scenario assumptions:
+
+ o Kernel implementation.
+
+ o Existing SMC-R connection; window size open (unconstrained); all
+ data has been consumed by receiver.
+
+ o Host A: Application issues send for 500 bytes with urgent data
+ indicator (out of band) to Host B, then sends 1000 bytes of
+ normal data.
+
+ o Host B: RMBE receive buffer size is 10,000; application has issued
+ a recv for 10,000 bytes and is also monitoring the socket for
+ urgent data.
+
+ Flow description:
+
+ 1. The application issues a send() for 500 bytes of urgent data; the
+ SMC-R layer copies data into a kernel send buffer. It then
+ schedules an RDMA write operation to move the data into the peer's
+ RMBE receive buffer, at relative position 1000-1499. Note that no
+ immediate data or alert (i.e., interrupt) is provided to Host B
+ for this RDMA operation.
+
+
+
+
+Fox, et al. Informational [Page 77]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+ 2. Host A sends a CDC message to update its producer cursor to
+ byte 1500 and to turn on the producer Urgent Data Pending (UrgP)
+ and Urgent Data Present (UrgA) flags. This CDC message will
+ deliver an interrupt to Host B. At this point, the SMC-R layer
+ can return control back to the application.
+
+ 3. Host B, once notified of the receipt of the previous CDC message,
+ locates the RMBE associated with the RMBE alert token, notices
+ that the Urgent Data Pending flag is on, and proceeds with out-of-
+ band socket API notification -- for example, satisfying any
+ outstanding select() or poll() requests on the socket by
+ indicating that urgent data is pending (i.e., by setting the
+ exception bit on). The urgent data present indicator allows
+ Host B to also determine the position of the urgent data (the
+ producer cursor points 1 byte beyond the last byte of urgent
+ data). Host B can then perform normal receive-side processing
+ (including specific urgent data processing), copying the data into
+ the application's receive buffer, etc. Host B then sends a CDC
+ message to update the partner's RMBE control area with its latest
+ consumer cursor (1500). Note that this CDC message must occur,
+ regardless of the current local window size that is available.
+ The partner host (Host A) cannot initiate any additional RDMA
+ writes until it receives acknowledgment that the urgent data has
+ been processed (or at least processed/remembered at the SMC-R
+ layer).
+
+ 4. Upon receipt of the message, Host A wakes up, sees that the peer
+ consumed all data up to and including the last byte of urgent
+ data, and now resumes sending any pending data. In this case, the
+ application had previously issued a send for 1000 bytes of normal
+ data, which would have been copied in the send buffer, and control
+ would have been returned to the application. Host A now initiates
+ an RDMA write to move that data to the peer's receive buffer at
+ position 1500-2499.
+
+ 5. Host A then sends a CDC message to update its producer cursor
+ value (2500) and to turn off the Urgent Data Pending and Urgent
+ Data Present flags. Host B wakes up, processes the new data
+ (resumes application, copies data into the application receive
+ buffer), and then proceeds to update the local current consumer
+ cursor (2500). Given that the window size is unconstrained, there
+ is no need for a consumer cursor update in the peer's RMBE.
+
+
+
+
+
+
+
+
+
+Fox, et al. Informational [Page 78]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+4.7.6. Scenario 6: Send Flow, Urgent Data, Window Size Closed
+
+ SMC Host A SMC Host B
+ RMBE A Info RMBE B Info
+ (Consumer Cursors) (Producer Cursors)
+ Cursor Wrap Seq# Time Time Cursor Wrap Seq# Flag
+ 1000 1 0 0 1000 2 Wrt
+ Blk
+
+ 1000 1 1 ...............> 1 1000 2 Wrt
+ CDC Message Blk
+ UrgP
+
+ 1000 2 2 <............... 2 1000 2 Wrt
+ CDC Message Blk
+ UrgP
+
+ 1000 2 3 ---------------> 3 1000 2 Wrt
+ RDMA-WR Data Blk
+ (1000:1499) UrgP
+
+ 1000 2 4 ...............> 4 1500 2 UrgP
+ CDC Message UrgA
+
+ 1500 2 5 <............... 5 1500 2 UrgP
+ CDC Message UrgA
+
+ 1500 2 6 ---------------> 6 1500 2 UrgP
+ RDMA-WR Data UrgA
+ (1500:2499)
+ 1000 2 7 ...............> 7 2500 2 0
+ CDC Message
+
+ Figure 21: Scenario 6: Send Flow, Urgent Data, Window Size Closed
+
+ Scenario assumptions:
+
+ o Kernel implementation.
+
+ o Existing SMC-R connection; window size closed; writer is blocked.
+
+ o Host A: Application issues send for 500 bytes with urgent data
+ indicator (out of band) to Host B, then sends 1000 bytes of
+ normal data.
+
+ o Host B: RMBE receive buffer size is 10,000; application has no
+ outstanding recv() (for normal data) and is monitoring the socket
+ for urgent data.
+
+
+
+Fox, et al. Informational [Page 79]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+ Flow description:
+
+ 1. The application issues a send() for 500 bytes of urgent data; the
+ SMC-R layer copies data into a kernel send buffer (if available).
+ Since the writer is blocked (window size closed), it cannot send
+ the data immediately. It then sends a CDC message to notify the
+ peer of the Urgent Data Pending (UrgP) indicator (the writer
+ blocked indicator remains on as well). This serves as a signal to
+ Host B that urgent data is pending in the stream. Control is also
+ returned to the application at this point.
+
+ 2. Host B, once notified of the receipt of the previous CDC message,
+ locates the RMBE associated with the RMBE alert token, notices
+ that the Urgent Data Pending flag is on, and proceeds with out-of-
+ band socket API notification -- for example, satisfying any
+ outstanding select() or poll() requests on the socket by
+ indicating that urgent data is pending (i.e., by setting the
+ exception bit on). At this point, it is expected that the
+ application will enter urgent data mode processing, expeditiously
+ processing all normal data (by issuing recv API calls) so that it
+ can get to the urgent data byte. Whether the application has this
+ urgent mode processing or not, at some point, the application will
+ consume some or all of the pending data in the receive buffer.
+ When this occurs, Host B will also send a CDC message to update
+ its consumer cursor and consumer cursor wrap sequence number to
+ the peer. In the example above, a full window's worth of data was
+ consumed.
+
+ 3. Host A, once awakened by the message, will notice that the window
+ size is now open on this connection (based on the consumer cursor
+ and the consumer cursor wrap sequence number, which now matches
+ the producer cursor wrap sequence number) and resume sending of
+ the urgent data segment by scheduling an RDMA write into relative
+ position 1000-1499.
+
+ 4. Host A then sends a CDC message to advance its producer cursor
+ (1500) and to also notify Host B of the Urgent Data Present (UrgA)
+ indicator (and turn off the writer blocked indicator). This
+ signals to Host B that the urgent data is now in the local receive
+ buffer and that the producer cursor points to the last byte of
+ urgent data.
+
+ 5. Host B wakes up, processes the urgent data, and, once the urgent
+ data is consumed, sends a CDC message to update its consumer
+ cursor (1500).
+
+
+
+
+
+
+Fox, et al. Informational [Page 80]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+ 6. Host A wakes up, sees that Host B has consumed the sequence number
+ associated with the urgent data, and then initiates the next RDMA
+ write operation to move the 1000 bytes associated with the next
+ send() of normal data into the peer's receive buffer at
+ position 1500-2499. Note that the send API would have likely
+ completed earlier in the process by copying the 1000 bytes into a
+ send buffer and returning back to the application, even though we
+ could not send any new data until the urgent data was processed
+ and acknowledged by Host B.
+
+ 7. Host A sends a CDC message to advance its producer cursor to 2500
+ and to reset the Urgent Data Pending and Urgent Data Present
+ flags. Host B wakes up and processes the inbound data.
+
+4.8. Connection Termination
+
+ Just as SMC-R connections are established using a combination of TCP
+ connection establishment flows and SMC-R protocol flows, the
+ termination of SMC-R connections also uses a similar combination of
+ SMC-R protocol termination flows and normal TCP connection
+ termination flows. The following sections describe the SMC-R
+ protocol normal and abnormal connection termination flows.
+
+4.8.1. Normal SMC-R Connection Termination Flows
+
+ Normal SMC-R connection flows are triggered via the normal stream
+ socket API semantics, namely by the application issuing a close() or
+ shutdown() API. Most applications, after consuming all incoming data
+ and after sending any outbound data, will then issue a close() API to
+ indicate that they are done both sending and receiving data. Some
+ applications, typically a small percentage, make use of the
+ shutdown() API that allows them to indicate that the application is
+ done sending data, receiving data, or both sending and receiving
+ data. The main use of this API is scenarios where a TCP application
+ wants to alert its partner endpoint that it is done sending data but
+ is still receiving data on its socket (shutdown for write). Issuing
+ shutdown() for both sending and receiving data is really no different
+ than issuing a close() and can therefore be treated in a similar
+ fashion. Shutdown for read is typically not a very useful operation
+ and in normal circumstances does not trigger any network flows to
+ notify the partner TCP endpoint of this operation.
+
+ These same trigger points will be used by the SMC-R layer to initiate
+ SMC-R connection termination flows. The main design point for SMC-R
+ normal connection flows is to use the SMC-R protocol to first shut
+ down the SMC-R connection and free up any SMC-R RDMA resources, and
+ then allow the normal TCP connection termination protocol (i.e., FIN
+ processing) to drive cleanup of the TCP connection. This design
+
+
+
+Fox, et al. Informational [Page 81]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+ point is very important in ensuring that RDMA resources such as
+ the RMBEs are only freed and reused when both SMC-R endpoints
+ are completely done with their RDMA write operations to the
+ partner's RMBE.
+
+ 1
+ +-----------------+
+ |-------------->| CLOSED |<-------------|
+ 3D | | | | 4D
+ | +-----------------+ |
+ | | |
+ | 2 | |
+ | V |
+ +----------------+ +-----------------+ +----------------+
+ |AppFinCloseWait | | ACTIVE | |PeerFinCloseWait|
+ | | | | | |
+ +----------------+ +-----------------+ +----------------+
+ | | | |
+ | Active Close | 3A | 4A | Passive Close |
+ | V | V |
+ | +--------------+ | +-------------+ |
+ |--<----|PeerCloseWait1| | |AppCloseWait1|--->----|
+ 3C | | | | | | | 4C
+ | +--------------+ | +-------------+ |
+ | | | | |
+ | | 3B | 4B | |
+ | V | V |
+ | +--------------+ | +-------------+ |
+ |--<----|PeerCloseWait2| | |AppCloseWait2|--->----|
+ | | | | |
+ +--------------+ | +-------------+
+ |
+ |
+
+ Figure 22: SMC-R Connection States
+
+ Figure 22 describes the states that an SMC-R connection typically
+ goes through. Note that there are variations to these states that
+ can occur when an SMC-R connection is abnormally terminated, similar
+ in a way to when a TCP connection is reset. The following are the
+ high-level state transitions for an SMC-R connection:
+
+ 1. An SMC-R connection begins in the Closed state. This state is
+ meant to reflect an RMBE that is not currently in use (was
+ previously in use but no longer is, or was never allocated).
+
+
+
+
+
+
+Fox, et al. Informational [Page 82]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+ 2. An SMC-R connection progresses to the Active state once the SMC-R
+ Rendezvous processing has successfully completed, RMB element
+ indices have been exchanged, and SMC-R links have been activated.
+ In this state, the TCP connection is fully established, rendezvous
+ processing has been completed, and SMC-R peers can begin the
+ exchange of data via RDMA.
+
+ 3. Active close processing (on the SMC-R peer that is initiating the
+ connection termination).
+
+ A. When an application on one of the SMC-R connection peers issues
+ a close(), a shutdown() for write, or a shutdown() for both
+ read and write, the SMC-R layer on that host will initiate
+ SMC-R connection termination processing. First, if a close()
+ or shutdown(both) is issued, it will check to see that there's
+ no data in the local RMB element that has not been read by the
+ application. If unread data is detected, the SMC-R connection
+ must be abnormally reset; for more details on this, refer to
+ Section 4.8.2 ("Abnormal SMC-R Connection Termination Flows").
+ If no unread data is pending, it then checks to see whether or
+ not any outstanding data is waiting to be written to the peer,
+ or if any outstanding RDMA writes for this SMC-R connection
+ have not yet completed. If either of these two scenarios is
+ true, an indicator that this connection is in a pending close
+ state is saved in internal data structures representing this
+ SMC-R connection, and control is returned to the application.
+ If all data to be written to the partner has completed, this
+ peer will send a CDC message to notify the peer of either the
+ PeerConnectionClosed indicator (close or shutdown for both was
+ issued) or the PeerDoneWriting indicator. This will provide an
+ interrupt to inform that partner SMC-R peer that the connection
+ is terminating. At this point, the local side of the SMC-R
+ connection transitions in the PeerCloseWait1 state, and control
+ can be returned to the application. If this process could not
+ be completed synchronously (the pending close condition
+ mentioned above), it is completed when all RDMA writes for data
+ and control cursors have been completed.
+
+ B. At some point, the SMC-R peer application (passive close) will
+ consume all incoming data, realize that that partner is done
+ sending data on this connection, and proceed to initiate its
+ own close of the connection once it has completed sending all
+ data from its end. The partner application can initiate this
+ connection termination processing via close() or shutdown()
+ APIs. If the application does so by issuing a shutdown() for
+ write, then the partner SMC-R layer will send a CDC message to
+ notify the peer (the active close side) of the PeerDoneWriting
+ indicator. When the "active close" SMC-R peer wakes up as a
+
+
+
+Fox, et al. Informational [Page 83]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+ result of the previous CDC message, it will notice that the
+ PeerDoneWriting indicator is now on and transition to the
+ PeerCloseWait2 state. This state indicates that the peer is
+ done sending data and may still be reading data. At this
+ point, the "active close" peer will also need to ensure that
+ any outstanding recv() calls for this socket are woken up and
+ remember that no more data is forthcoming on this connection
+ (in case the local connection was shutdown() for write only).
+
+ C. This flow is a common transition from 3A or 3B above. When the
+ SMC-R peer (passive close) consumes all data and updates all
+ necessary cursors to the peer, and the application closes its
+ socket (close or shutdown for both), it will send a CDC message
+ to the peer (the active close side) with the
+ PeerConnectionClosed indicator set. At this point, the
+ connection can transition back to the Closed state if the local
+ application has already closed (or issued shutdown for both)
+ the socket. Once in the Closed state, the RMBE can now be
+ safely reused for a new SMC-R connection. When the
+ PeerConnectionClosed indicator is turned on, the SMC-R peer is
+ indicating that it is done updating the partner's RMBE.
+
+ D. Conditional state: If the local application has not yet issued
+ a close() or shutdown(both), we need to wait until the
+ application does so. Once it does, the local host will send a
+ CDC message to notify the peer of the PeerConnectionClosed
+ indicator and then transition to the Closed state.
+
+ 4. Passive close processing (on the SMC-R peer that receives an
+ indication that the partner is closing the connection).
+
+ A. Upon receipt of a CDC message, the SMC-R layer will detect that
+ the PeerConnectionClosed indicator or PeerDoneWriting indicator
+ is on. If any outstanding recv() calls are pending, they are
+ completed with an indicator that the partner has closed the
+ connection (zero-length data presented to the application). If
+ there is any pending data to be written and
+ PeerConnectionClosed is on, then an SMC-R connection reset must
+ be performed. The connection then enters the AppCloseWait1
+ state on the passive close side waiting for the local
+ application to initiate its own close processing.
+
+ B. If the local application issues a shutdown() for writing, then
+ the SMC-R layer will send a CDC message to notify the partner
+ of the PeerDoneWriting indicator and then transition the local
+ side of the SMC-R connection to the AppCloseWait2 state.
+
+
+
+
+
+Fox, et al. Informational [Page 84]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+ C. When the application issues a close() or shutdown() for both,
+ the local SMC-R peer will send a message informing the peer of
+ the PeerConnectionClosed indicator and transition to the Closed
+ state if the remote peer has also sent the local peer the
+ PeerConnectionClosed indicator. If the peer has not sent the
+ PeerConnectionClosed indicator, we transition into the
+ PeerFinCloseWait state.
+
+ D. The local SMC-R connection stays in this state until the peer
+ sends the PeerConnectionClosed indicator in a CDC message.
+ When the indicator is sent, we transition to the Closed state
+ and are then free to reuse this RMBE.
+
+ Note that each SMC-R peer needs to provide some logic that will
+ prevent being stranded in a termination state indefinitely. For
+ example, if an Active Close SMC-R peer is in a PeerCloseWait (1 or 2)
+ state waiting for the remote SMC-R peer to update its connection
+ termination status, it needs to provide a timer that will prevent it
+ from waiting in that state indefinitely should the remote SMC-R peer
+ not respond to this termination request. This could occur in error
+ scenarios -- for example, if the remote SMC-R peer suffered a failure
+ prior to being able to respond to the termination request or the
+ remote application is not responding to this connection termination
+ request by closing its own socket. This latter scenario is similar
+ to the TCP FINWAIT2 state, which has been known to sometimes cause
+ issues when remote TCP/IP hosts lose track of established connections
+ and neglect to close them. Even though the TCP standards do not
+ mandate a timeout from the TCP FINWAIT2 state, most TCP/IP
+ implementations assign a timeout for this state. A similar timeout
+ will be required for SMC-R connections. When this timeout occurs,
+ the local SMC-R peer performs TCP reset processing for this
+ connection. However, no additional RDMA writes to the partner RMBE
+ can occur at this point (we have already indicated that we are done
+ updating the peer's RMBE). After the TCP connection is reset, the
+ RMBE can be returned to the free pool for reallocation. See
+ Section 4.4.2 for more details.
+
+ Also note that it is possible to have two SMC-R endpoints initiate an
+ Active close concurrently. In that scenario, the flows above still
+ apply; however, both endpoints follow the active close path (path 3).
+
+
+
+
+
+
+
+
+
+
+
+Fox, et al. Informational [Page 85]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+4.8.2. Abnormal SMC-R Connection Termination Flows
+
+ Abnormal SMC-R connection termination can occur for a variety of
+ reasons, including the following:
+
+ o The TCP connection associated with an SMC-R connection is reset.
+ In TCP, either endpoint can send a RST segment to abort an
+ existing TCP connection when error conditions are detected for the
+ connection or the application overtly requests that the connection
+ be reset.
+
+ o Normal SMC-R connection termination processing has unexpectedly
+ stalled for a given connection. When the stall is detected
+ (connection termination timeout condition), an abnormal SMC-R
+ connection termination flow is initiated.
+
+ In these scenarios, it is very important that resources associated
+ with the affected SMC-R connections are properly cleaned up to ensure
+ that there are no orphaned resources and that resources can reliably
+ be reused for new SMC-R connections. Given that SMC-R relies heavily
+ on the RDMA write processing, special care needs to be taken to
+ ensure that an RMBE is no longer being used by an SMC-R peer before
+ logically reassigning that RMBE to a new SMC-R connection.
+
+ When an SMC-R peer initiates a TCP connection reset, it also
+ initiates an SMC-R abnormal connection flow at the same time. The
+ SMC-R peers explicitly signal their intent to abnormally terminate an
+ SMC-R connection and await explicit acknowledgment that the peer has
+ received this notification and has also completed abnormal connection
+ termination on its end. Note that TCP connection reset processing
+ can occur in parallel to these flows.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Fox, et al. Informational [Page 86]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+ +-----------------+
+ |-------------->| CLOSED |<-------------|
+ | | | |
+ | +-----------------+ |
+ | |
+ | |
+ | |
+ | +-----------------------+ |
+ | | Any state | |
+ |1B | (before setting | 2B|
+ | | PeerConnectionClosed | |
+ | | indicator in | |
+ | | peer's RMBE) | |
+ | +-----------------------+ |
+ | 1A | | 2A |
+ | Active Abort | | Passive Abort |
+ | V V |
+ | +--------------+ +--------------+ |
+ |-------|PeerAbortWait | | Process Abort|------|
+ | | | |
+ +--------------+ +--------------+
+
+ Figure 23: SMC-R Abnormal Connection Termination State Diagram
+
+ Figure 23 above shows the SMC-R abnormal connection termination state
+ diagram:
+
+ 1. Active abort designates the SMC-R peer that is initiating the TCP
+ RST processing. At the time that the TCP RST is sent, the active
+ abort side must also do the following:
+
+ A. Send the PeerConnAbort indicator to the partner in a CDC
+ message, and then transition to the PeerAbortWait state.
+ During this state, it will monitor this SMC-R connection
+ waiting for the peer to send its corresponding PeerConnAbort
+ indicator but will ignore any other activity in this connection
+ (i.e., new incoming data). It will also generate an
+ appropriate error to any socket API calls issued against this
+ socket (e.g., ECONNABORTED, ECONNRESET).
+
+ B. Once the peer sends the PeerConnAbort indicator to the local
+ host, the local host can transition this SMC-R connection to
+ the Closed state and reuse this RMBE. Note that the SMC-R peer
+ that goes into the active abort state must provide some
+ protection against staying in that state indefinitely should
+ the remote SMC-R peer not respond by sending its own
+ PeerConnAbort indicator to the local host. While this should
+ be a rare scenario, it could occur if the remote SMC-R peer
+
+
+
+Fox, et al. Informational [Page 87]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+ (passive abort) suffered a failure right after the local SMC-R
+ peer (active abort) sent the PeerConnAbort indicator. To
+ protect against these types of failures, a timer can be set
+ after entering the PeerAbortWait state, and if that timer pops
+ before the peer has sent its local PeerConnAbort indicator (to
+ the active abort side), this RMBE can be returned to the free
+ pool for possible reallocation. See Section 4.4.2 for more
+ details.
+
+ 2. Passive abort designates the SMC-R peer that is the recipient of
+ an SMC-R abort from the peer designated by the PeerConnAbort
+ indicator being sent by the peer in a CDC message. Upon receiving
+ this request, the local peer must do the following:
+
+ A. Using the appropriate error codes, indicate to the socket
+ application that this connection has been aborted, and then
+ purge all in-flight data for this connection that is waiting to
+ be read or waiting to be sent.
+
+ B. Send a CDC message to notify the peer of the PeerConnAbort
+ indicator and, once that is completed, transition this RMBE to
+ the Closed state.
+
+ If an SMC-R peer receives a TCP RST for a given SMC-R connection, it
+ also initiates SMC-R abnormal connection termination processing if it
+ has not already been notified (via the PeerConnAbort indicator) that
+ the partner is severing the connection. It is possible to have two
+ SMC-R endpoints concurrently be in an active abort role for a given
+ connection. In that scenario, the flows above still apply but both
+ endpoints take the active abort path (path 1).
+
+4.8.3. Other SMC-R Connection Termination Conditions
+
+ The following are additional conditions that have implications for
+ SMC-R connection termination:
+
+ o An SMC-R peer being gracefully shut down. If an SMC-R peer
+ supports a graceful shutdown operation, it should attempt to
+ terminate all SMC-R connections as part of shutdown processing.
+ This could be accomplished via LLC DELETE LINK requests on all
+ active SMC-R links.
+
+ o Abnormal termination of an SMC-R peer. In this example, there may
+ be no opportunity for the host to perform any SMC-R cleanup
+ processing. In this scenario, it is up to the remote peer to
+ detect a RoCE communications failure with the failing host. This
+
+
+
+
+
+Fox, et al. Informational [Page 88]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+ could trigger SMC-R link switchover, but that would also generate
+ RoCE errors, causing the remote host to eventually terminate all
+ existing SMC-R connections to this peer.
+
+ o Loss of RoCE connectivity between two SMC-R peers. If two peers
+ are no longer reachable across any links in their SMC-R link
+ group, then both peers perform a TCP reset for the connections,
+ generate an error to the local applications, and free up all QP
+ resources associated with the link group.
+
+5. Security Considerations
+
+5.1. VLAN Considerations
+
+ The concepts and access control of virtual LANs (VLANs) must be
+ extended to also cover the RoCE network traffic flowing across the
+ Ethernet.
+
+ The RoCE VLAN configuration and access permissions must mirror the IP
+ VLAN configuration and access permissions over the Converged Enhanced
+ Ethernet fabric. This means that hosts, routers, and switches that
+ have access to specific VLANs on the IP fabric must also have the
+ same VLAN access across the RoCE fabric. In other words, the SMC-R
+ connectivity will follow the same virtual network access permissions
+ as normal TCP/IP traffic.
+
+5.2. Firewall Considerations
+
+ As mentioned above, the RoCE fabric inherits the same VLAN
+ topology/access as the IP fabric. RoCE is a Layer 2 protocol that
+ requires both endpoints to reside in the same Layer 2 network (i.e.,
+ VLAN). RoCE traffic cannot traverse multiple VLANs, as there is no
+ support for routing RoCE traffic beyond a single VLAN. As a result,
+ SMC-R communications will also be confined to peers that are members
+ of the same VLAN. IP-based firewalls are typically inserted between
+ VLANs (or physical LANs) and rely on normal IP routing to insert
+ themselves in the data path. Since RoCE (and by extension SMC-R) is
+ not routable beyond the local VLAN, there is no ability to insert a
+ firewall in the network path of two SMC-R peers.
+
+5.3. Host-Based IP Filters
+
+ Because SMC-R maintains the TCP three-way handshake for connection
+ setup before switching to RoCE out of band, existing IP filters that
+ control connection setup flows remain effective in an SMC-R
+ environment. IP filters that operate on traffic flowing in an active
+ TCP connection are not supported, because the connection data does
+ not flow over IP.
+
+
+
+Fox, et al. Informational [Page 89]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+5.4. Intrusion Detection Services
+
+ Similar to IP filters, intrusion detection services that operate on
+ TCP connection setups are compatible with SMC-R with no changes
+ required. However, once the TCP connection has switched to RoCE out
+ of band, packets are not available for examination.
+
+5.5. IP Security (IPsec)
+
+ IP security is not compatible with SMC-R, because there are no IP
+ packets on which to operate. TCP connections that require IP
+ security must opt out of SMC-R.
+
+5.6. TLS/SSL
+
+ Transport Layer Security/Secure Socket Layer (TLS/SSL) is preserved
+ in an SMC-R environment. The TLS/SSL layer resides above the SMC-R
+ layer, and outgoing connection data is encrypted before being passed
+ down to the SMC-R layer for RDMA write. Similarly, incoming
+ connection data goes through the SMC-R layer encrypted and is
+ decrypted by the TLS/SSL layer as it is today.
+
+ The TLS/SSL handshake messages flow over the TCP connection after the
+ connection has switched to SMC-R, and so they are exchanged using
+ RDMA writes by the SMC-R layer, transparently to the TLS/SSL layer.
+
+6. IANA Considerations
+
+ The scarcity of TCP option codes available for assignment is
+ understood, and this architecture uses experimental TCP options
+ following the conventions of [RFC6994] ("Shared Use of Experimental
+ TCP Options").
+
+ TCP ExID 0xE2D4C3D9 has been registered with IANA as a TCP Experiment
+ Identifier. See Section 3.1.
+
+ If this protocol achieves wide acceptance, a discrete option code may
+ be requested by subsequent versions of this protocol.
+
+
+
+
+
+
+
+
+
+
+
+
+
+Fox, et al. Informational [Page 90]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+7. Normative References
+
+ [RFC793] Postel, J., "Transmission Control Protocol", STD 7,
+ RFC 793, DOI 10.17487/RFC0793, September 1981,
+ <http://www.rfc-editor.org/info/rfc793>.
+
+ [RFC6994] Touch, J., "Shared Use of Experimental TCP Options",
+ RFC 6994, DOI 10.17487/RFC6994, August 2013,
+ <http://www.rfc-editor.org/info/rfc6994>.
+
+ [RoCE] InfiniBand, "RDMA over Converged Ethernet specification",
+ <https://cw.infinibandta.org/wg/Members/documentRevision/
+ download/7149>.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Fox, et al. Informational [Page 91]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+Appendix A. Formats
+
+A.1. TCP Option
+
+ The SMC-R TCP option is formatted in accordance with [RFC6994]
+ ("Shared Use of Experimental TCP Options"). The ExID value is
+ IBM-1047 (EBCDIC) encoding for "SMCR".
+
+ 0 1 2 3
+ 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | Kind = 254 | Length = 6 | x'E2' | x'D4' |
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | x'C3' | x'D9' |
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
+ Figure 24: SMC-R TCP Option Format
+
+A.2. CLC Messages
+
+ The following rules apply to all CLC messages:
+
+ General rules on formats:
+
+ o Reserved fields must be set to zero and not validated.
+
+ o Each message has an eye catcher at the start and another
+ eye catcher at the end. These must both be validated by the
+ receiver.
+
+ o SMC version indicator: The only SMC-R version defined in this
+ architecture is version 1. In the future, if peers have a
+ mismatch of versions, the lowest common version number is used.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Fox, et al. Informational [Page 92]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+A.2.1. Peer ID Format
+
+ All CLC messages contain a peer ID that uniquely identifies an
+ instance of a TCP/IP stack. This peer ID is required to be
+ universally unique across TCP/IP stacks and instances (including
+ restarts) of TCP/IP stacks.
+
+ 0 1 2 3
+ 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | Instance ID | RoCE MAC (first 2 bytes) |
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | RoCE MAC (last 4 bytes) |
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
+ Figure 25: Peer ID Format
+
+ Instance ID
+
+ A 2-byte instance count that ensures that if the same RNIC MAC is
+ later used in the peer ID for a different TCP/IP stack -- for
+ example, if an RNIC is redeployed to another stack -- the values
+ are unique. It also ensures that if a TCP/IP stack is restarted,
+ the instance ID changes. The value is implementation defined,
+ with one suggestion being 2 bytes of the system clock.
+
+ RoCE MAC
+
+ The RoCE MAC address for one of the peer's RNICs. Note that in a
+ virtualized environment this will be the virtual MAC of one of the
+ peer's RNICs.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Fox, et al. Informational [Page 93]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+A.2.2. SMC Proposal CLC Message Format
+
+ 0 1 2 3
+ 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | x'E2' | x'D4' | x'C3' | x'D9' |
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | Type = 1 | Length |Version| Rsrvd |
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | |
+ +- Client's Peer ID -+
+ | |
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | |
+ +- -+
+ | |
+ +- Client's preferred GID -+
+ | |
+ +- -+
+ | |
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | Client's preferred RoCE |
+ +- MAC address +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | |Offset to mask/prefix area (0) |
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ . .
+ . Area for future growth .
+ . .
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | IPv4 Subnet Mask |
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | IPv4 Mask Lgth| Reserved |Num IPv6 prfx |
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ : :
+ : Array of IPv6 prefixes (variable length) :
+ : :
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | x'E2' | x'D4' | x'C3' | x'D9' |
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
+ Figure 26: SMC Proposal CLC Message Format
+
+
+
+
+
+
+
+
+
+
+Fox, et al. Informational [Page 94]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+ The fields present in the SMC Proposal CLC message are:
+
+ Eye catchers
+
+ Like all CLC messages, the SMC Proposal has beginning and ending
+ eye catchers to aid with verification and parsing. The hex digits
+ spell "SMCR" in IBM-1047 (EBCDIC).
+
+ Type
+
+ CLC message Type 1 indicates SMC Proposal.
+
+ Length
+
+ The length of this CLC message. If this is an IPv4 flow, this
+ value is 52. Otherwise, it is variable, depending upon how many
+ prefixes are listed.
+
+ Version
+
+ Version of the SMC-R protocol. Version 1 is the only currently
+ defined value.
+
+ Client's Peer ID
+
+ As described in Appendix A.2.1 above.
+
+ Client's preferred RoCE GID
+
+ The IPv6 address of the client's preferred RNIC on the RoCE
+ fabric.
+
+ Client's preferred RoCE MAC address
+
+ The MAC address of the client's preferred RNIC on the RoCE fabric.
+ It is required, as some operating systems do not have neighbor
+ discovery or ARP support for RoCE RNICs.
+
+ Offset to mask/prefix area
+
+ Provides the number of bytes that must be skipped after this
+ field, to access the IPv4 Subnet Mask field and the fields that
+ follow it. Allows for future growth of this signal. In this
+ version of the architecture, this value is always zero.
+
+
+
+
+
+
+
+Fox, et al. Informational [Page 95]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+ Area for future growth
+
+ In this version of the architecture, this field does not exist.
+ This indicates where additional information may be inserted into
+ the signal in the future. The "Offset to mask/prefix area" field
+ must be used to skip over this area.
+
+ IPv4 Subnet Mask
+
+ If this message is flowing over an IPv4 TCP connection, the value
+ of the subnet mask associated with the interface over which the
+ client sent this message. If this is an IPv6 flow, this field is
+ all zeros.
+
+ This field, along with all fields that follow it in this signal,
+ must be accessed by skipping the number of bytes listed in the
+ "Offset to mask/prefix area" field after the end of that field.
+
+ IPv4 Mask Lgth
+
+ If this message is flowing over an IPv4 TCP connection, the number
+ of significant bits in the IPv4 Subnet Mask field. If this is an
+ IPv6 flow, this field is zero.
+
+ Num IPv6 prfx
+
+ If this message is flowing over an IPv6 TCP connection, the number
+ of IPv6 prefixes that follow, with a maximum value of 8. If this
+ is an IPv4 flow, this field is zero and is immediately followed by
+ the ending eye catcher.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Fox, et al. Informational [Page 96]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+ Array of IPv6 prefixes
+
+ For IPv6 TCP connections, a list of the IPv6 prefixes associated
+ with the network over which the client sent this message, up to a
+ maximum of eight prefixes.
+
+ 0 1 2 3
+ 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | |
+ + +
+ | |
+ + IPv6 prefix value +
+ | |
+ + +
+ | |
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | Prefix Length |
+ +-+-+-+-+-+-+-+-+
+
+ Figure 27: Format for IPv6 Prefix Array Element
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Fox, et al. Informational [Page 97]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+A.2.3. SMC Accept CLC Message Format
+
+ 0 1 2 3
+ 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | x'E2' | x'D4' | x'C3' | x'D9' |
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | Type = 2 | Length = 68 |Version|F|Rsrvd|
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | |
+ +- Server's Peer ID -+
+ | |
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | |
+ +- -+
+ | |
+ +- Server's RoCE GID -+
+ | |
+ +- -+
+ | |
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | Server's RoCE |
+ +- MAC address +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | | Server QP (bytes 1-2) |
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+---+
+ |Srvr QP byte 3 | Server RMB RKey (bytes 1-3) |
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ |Srvr RMB byte 4|Server RMB indx| Srvr RMB alert tkn (bytes 1-2)|
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | Srvr RMB alert tkn (bytes 3-4)|Bsize | MTU | Reserved |
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | |
+ +- Server's RMB virtual address -+
+ | |
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | Reserved | Server's initial packet sequence number |
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | x'E2' | x'D4' | x'C3' | x'D9' |
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
+ Figure 28: SMC Accept CLC Message Format
+
+
+
+
+
+
+
+
+
+
+Fox, et al. Informational [Page 98]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+ The fields present in the SMC Accept CLC message are:
+
+ Eye catchers
+
+ Like all CLC messages, the SMC Accept has beginning and ending
+ eye catchers to aid with verification and parsing. The hex digits
+ spell "SMCR" in IBM-1047 (EBCDIC).
+
+ Type
+
+ CLC message Type 2 indicates SMC Accept.
+
+ Length
+
+ The SMC Accept CLC message is 68 bytes long.
+
+ Version
+
+ Version of the SMC-R protocol. Version 1 is the only currently
+ defined value.
+
+ F-bit
+
+ First contact flag: A 1-bit flag that indicates that the server
+ believes this TCP connection is the first SMC-R contact for this
+ link group.
+
+ Server's Peer ID
+
+ As described in Appendix A.2.1 above.
+
+ Server's RoCE GID
+
+ The IPv6 address of the RNIC that the server chose for this SMC-R
+ link.
+
+ Server's RoCE MAC address
+
+ The MAC address of the server's RNIC for the SMC-R link. It is
+ required, as some operating systems do not have neighbor discovery
+ or ARP support for RoCE RNICs.
+
+ Server's QP number
+
+ The number for the reliably connected queue pair that the server
+ created for this SMC-R link.
+
+
+
+
+
+Fox, et al. Informational [Page 99]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+ Server's RMB RKey
+
+ The RDMA RKey for the RMB that the server created or chose for
+ this TCP connection.
+
+ Server's RMB element index
+
+ Indexes which element within the server's RMB will represent this
+ TCP connection.
+
+ Server's RMB element alert token
+
+ A platform-defined, architecturally opaque token that identifies
+ this TCP connection. Added by the client as immediate data on
+ RDMA writes from the client to the server to inform the server
+ that there is data for this connection to retrieve from the
+ RMB element.
+
+ Bsize:
+
+ Server's RMB element buffer size in 4-bit compressed notation:
+ x = 4 bits. Actual buffer size value is (2^(x + 4)) * 1K.
+ Smallest possible value is 16K. Largest size supported by this
+ architecture is 512K.
+
+ MTU
+
+ An enumerated value indicating this peer's QP MTU size. The two
+ peers exchange their MTU values, and whichever value is smaller
+ will be used for the QP. This field should only be validated in
+ the first contact exchange.
+
+ The enumerated MTU values are:
+
+ 0: reserved
+
+ 1: 256
+
+ 2: 512
+
+ 3: 1024
+
+ 4: 2048
+
+ 5: 4096
+
+ 6-15: reserved
+
+
+
+
+Fox, et al. Informational [Page 100]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+ Server's RMB virtual address
+
+ The virtual address of the server's RMB as assigned by the
+ server's RNIC.
+
+ Server's initial packet sequence number
+
+ The starting packet sequence number that this peer will use when
+ sending to the other peer, so that the other peer can prepare its
+ QP for the sequence number to expect.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Fox, et al. Informational [Page 101]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+A.2.4. SMC Confirm CLC Message Format
+
+ 0 1 2 3
+ 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | x'E2' | x'D4' | x'C3' | x'D9' |
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | Type = 3 | Length = 68 |Version| Rsrvd |
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | |
+ +- Client's Peer ID -+
+ | |
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | |
+ +- -+
+ | |
+ +- Client's RoCE GID -+
+ | |
+ +- -+
+ | |
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | Client's RoCE |
+ +- MAC address +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | | Client QP (bytes 1-2) |
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+---+
+ |Clnt QP byte 3 | Client RMB RKey (bytes 1-3) |
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ |Clnt RMB byte 4|Client RMB indx| Clnt RMB alert tkn (bytes 1-2)|
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | Clnt RMB alert tkn (bytes 3-4)|Bsize | MTU | Reserved |
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | |
+ +- Client's RMB Virtual Address -+
+ | |
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | Reserved | Client's initial packet sequence number |
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | x'E2' | x'D4' | x'C3' | x'D9' |
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
+ Figure 29: SMC Confirm CLC Message Format
+
+ The SMC Confirm CLC message is nearly identical to the SMC Accept,
+ except that it contains client information and lacks a first contact
+ flag.
+
+
+
+
+
+
+Fox, et al. Informational [Page 102]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+ The fields present in the SMC Confirm CLC message are:
+
+ Eye catchers
+
+ Like all CLC messages, the SMC Confirm has beginning and ending
+ eye catchers to aid with verification and parsing. The hex digits
+ spell "SMCR" in IBM-1047 (EBCDIC).
+
+ Type
+
+ CLC message Type 3 indicates SMC Confirm.
+
+ Length
+
+ The SMC Confirm CLC message is 68 bytes long.
+
+ Version
+
+ Version of the SMC-R protocol. Version 1 is the only currently
+ defined value.
+
+ Client's Peer ID
+
+ As described in Appendix A.2.1 above.
+
+ Client's RoCE GID
+
+ The IPv6 address of the RNIC that the client chose for this SMC-R
+ link.
+
+ Client's RoCE MAC address
+
+ The MAC address of the client's RNIC for the SMC-R link. It is
+ required, as some operating systems do not have neighbor discovery
+ or ARP support for RoCE RNICs.
+
+ Client's QP number
+
+ The number for the reliably connected queue pair that the client
+ created for this SMC-R link.
+
+ Client's RMB RKey
+
+ The RDMA RKey for the RMB that the client created or chose for
+ this TCP connection.
+
+
+
+
+
+
+Fox, et al. Informational [Page 103]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+ Client's RMB element index
+
+ Indexes which element within the client's RMB will represent this
+ TCP connection.
+
+ Client's RMB element alert token
+
+ A platform-defined, architecturally opaque token that identifies
+ this TCP connection. Added by the server as immediate data on
+ RDMA writes from the server to the client to inform the client
+ that there is data for this connection to retrieve from the
+ RMB element.
+
+ Bsize:
+
+ Client's RMB element buffer size in 4-bit compressed notation:
+ x = 4 bits. Actual buffer size value is (2^(x + 4)) * 1K.
+ Smallest possible value is 16K. Largest size supported by this
+ architecture is 512K.
+
+ MTU
+
+ An enumerated value indicating this peer's QP MTU size. The two
+ peers exchange their MTU values, and whichever value is smaller
+ will be used for the QP. The values are enumerated in
+ Appendix A.2.3. This value should only be validated in the first
+ contact exchange.
+
+ Client's RMB Virtual Address
+
+ The virtual address of the client's RMB as assigned by the
+ server's RNIC.
+
+ Client's initial packet sequence number
+
+ The starting packet sequence number that this peer will use when
+ sending to the other peer, so that the other peer can prepare its
+ QP for the sequence number to expect.
+
+
+
+
+
+
+
+
+
+
+
+
+
+Fox, et al. Informational [Page 104]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+A.2.5. SMC Decline CLC Message Format
+
+ 0 1 2 3
+ 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | x'E2' | x'D4' | x'C3' | x'D9' |
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | Type = 4 | Length = 28 |Version|S|Rsrvd|
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | |
+ +- Sender's Peer ID -+
+ | |
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | Peer Diagnosis Information |
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | |
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | x'E2' | x'D4' | x'C3' | x'D9' |
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
+ Figure 30: SMC Decline CLC Message Format
+
+ The fields present in the SMC Decline CLC message are:
+
+ Eye catchers
+
+ Like all CLC messages, the SMC Decline has beginning and ending
+ eye catchers to aid with verification and parsing. The hex digits
+ spell "SMCR" in IBM-1047 (EBCDIC).
+
+ Type
+
+ CLC message Type 4 indicates SMC Decline.
+
+ Length
+
+ The SMC Decline CLC message is 28 bytes long.
+
+ Version
+
+ Version of the SMC-R protocol. Version 1 is the only currently
+ defined value.
+
+ S-bit
+
+ Sync Bit. Indicates that the link group is out of sync and the
+ receiving peer must clean up its representation of the link group.
+
+
+
+
+Fox, et al. Informational [Page 105]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+ Sender's Peer ID
+
+ As described in Appendix A.2.1 above.
+
+ Peer Diagnosis Information
+
+ 4 bytes of diagnosis information provided by the peer. These
+ values are defined by the individual peers, and it is necessary to
+ consult the peer's system documentation to interpret the results.
+
+A.3. LLC Messages
+
+ LLC messages are sent over an existing SMC-R link using RoCE SendMsg
+ and are always 44 bytes long so that they fit into the space
+ available in a single WQE without requiring the receiver to post
+ receive buffers. If all 44 bytes are not needed, they are padded out
+ with zeros. LLC messages are in a request/response format. The
+ message type is the same for request and response, and a flag
+ indicates whether a message is flowing as a request or a response.
+
+ The two high-order bits of an LLC message opcode indicate how it is
+ to be handled by a peer that does not support the opcode.
+
+ If the high-order bits of the opcode are b'00', then the peer must
+ support the LLC message and indicate a protocol error if it does not.
+
+ If the high-order bits of the opcode are b'10', then the peer must
+ silently discard the LLC message if it does not support the opcode.
+ This requirement is included to allow for toleration of advanced, but
+ optional, functionality.
+
+ High-order bits of b'11' indicate a Connection Data Control (CDC)
+ message as described in Appendix A.4.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Fox, et al. Informational [Page 106]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+A.3.1. CONFIRM LINK LLC Message Format
+
+ 0 1 2 3
+ 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | Type = 1 | Length = 44 | Reserved |R| Reserved |
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | Sender's RoCE |
+ +- MAC address +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | | |
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ +
+ | |
+ +- -+
+ | Sender's RoCE GID |
+ +- -+
+ | |
+ +- +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | |Sender's QP number, bytes 1-2 |
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ |Sender QP byte3| Link number |Sender's link userID, bytes 1-2|
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ |Sender's link userID, bytes 3-4| Max links | Reserved |
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | |
+ +- Reserved -+
+ | |
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
+ Figure 31: CONFIRM LINK LLC Message Format
+
+ The CONFIRM LINK LLC message is required to be exchanged between the
+ server and client over a newly created SMC-R link to complete the
+ setup of an SMC-R link. Its purpose is to confirm that the RoCE path
+ is actually usable.
+
+ On first contact, this message flows after the server receives the
+ SMC Confirm CLC message from the client over the IP connection. For
+ additional links added to an SMC-R link group, it flows after the
+ ADD LINK and ADD LINK CONTINUATION exchange. This flow provides
+ confirmation that the queue pair is in fact usable. Each peer echoes
+ its RoCE information back to the other.
+
+
+
+
+
+
+
+
+
+
+Fox, et al. Informational [Page 107]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+ The contents of the CONFIRM LINK LLC message are:
+
+ Type
+
+ Type 1 indicates CONFIRM LINK.
+
+ Length
+
+ The CONFIRM LINK LLC message is 44 bytes long.
+
+ R
+
+ Reply flag. When set, indicates that this is a CONFIRM LINK
+ reply.
+
+ Sender's RoCE MAC address
+
+ The MAC address of the sender's RNIC for the SMC-R link. It is
+ required, as some operating systems do not have neighbor discovery
+ or ARP support for RoCE RNICs.
+
+ Sender's RoCE GID
+
+ The IPv6 address of the RNIC that the sender is using for this
+ SMC-R link.
+
+ Sender's QP number
+
+ The number for the reliably connected queue pair that the sender
+ created for this SMC-R link.
+
+ Link number
+
+ An identifier assigned by the server that uniquely identifies the
+ link within the link group. This identifier is ONLY unique within
+ a link group. Provided by the server and echoed back by the
+ client.
+
+ Link user ID
+
+ An opaque, implementation-defined identifier assigned by the
+ sender and provided to the receiver solely for purposes of
+ display, diagnosis, network management, etc. The link user ID
+ should be unique across the sender's entire software space,
+ including all other link groups.
+
+
+
+
+
+
+Fox, et al. Informational [Page 108]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+ Max links
+
+ The maximum number of links the sender can support in a link
+ group. The maximum for this link group is the smaller of the
+ values provided by the two peers.
+
+A.3.2. ADD LINK LLC Message Format
+
+ 0 1 2 3
+ 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | Type = 2 | Length = 44 | Rsrvd |RsnCode|R|Z| Reserved |
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | Sender's RoCE |
+ +- MAC address +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | | |
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ +
+ | |
+ +- -+
+ | Sender's RoCE GID |
+ +- -+
+ | |
+ +- +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | |Sender's QP number, bytes 1-2 |
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ |Sender QP byte3| Link number |Rsrvd | MTU |Initial PSN |
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | Initial PSN (continued) | |
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ -+
+ | Reserved |
+ +- -+
+ | |
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
+ Figure 32: ADD LINK LLC Message Format
+
+ The ADD LINK LLC message is sent over an existing link in the link
+ group when a peer wishes to add an SMC-R link to an existing SMC-R
+ link group. It is sent by the server to add a new SMC-R link to the
+ group, or by the client to request that the server add a new link --
+ for example, when a new RNIC becomes active. When sent from the
+ client to the server, it represents a request that the server
+ initiate an ADD LINK exchange.
+
+
+
+
+
+
+
+
+Fox, et al. Informational [Page 109]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+ This message is sent immediately after the initial SMC-R link in the
+ group completes, as described in Section 3.5.1 ("First Contact"). It
+ can also be sent over an existing SMC-R link group at any time as new
+ RNICs are added and become available. Therefore, there can be as few
+ as one new RMB RToken to be communicated, or several. RTokens will
+ be communicated using ADD LINK CONTINUATION messages.
+
+ The contents of the ADD LINK LLC message are:
+
+ Type
+
+ Type 2 indicates ADD LINK.
+
+ Length
+
+ The ADD LINK LLC message is 44 bytes long.
+
+ RsnCode
+
+ If the Z (rejection) flag is set, this field provides the reason
+ code. Values can be:
+
+ X'1' - no alternate path available: set when the server
+ provides the same MAC/GID as an existing SMC-R link in
+ the group, and the client does not have any additional
+ RNICs available (i.e., the server is attempting to set
+ up an asymmetric link but none is available).
+
+ X'2' - Invalid MTU value specified.
+
+ R
+
+ Reply flag. When set, indicates that this is an ADD LINK reply.
+
+ Z
+
+ Rejection flag. When set on reply, indicates that the server's
+ ADD LINK was rejected by the client. When this flag is set, the
+ reason code will also be set.
+
+ Sender's RoCE MAC address
+
+ The MAC address of the sender's RNIC for the new SMC-R link. It
+ is required, as some operating systems do not have neighbor
+ discovery or ARP support for RoCE RNICs.
+
+
+
+
+
+
+Fox, et al. Informational [Page 110]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+ Sender's RoCE GID
+
+ The IPv6 address of the RNIC that the sender is using for the new
+ SMC-R link.
+
+ Sender's QP number
+
+ The number for the reliably connected queue pair that the sender
+ created for the new SMC-R link.
+
+ Link number
+
+ An identifier for the new SMC-R link. This is assigned by the
+ server and uniquely identifies the link within the link group.
+ This identifier is ONLY unique within a link group. Provided by
+ the server and echoed back by the client.
+
+ MTU
+
+ An enumerated value indicating this peer's QP MTU size. The two
+ peers exchange their MTU values, and whichever value is smaller
+ will be used for the QP. The values are enumerated in
+ Appendix A.2.3.
+
+ Initial PSN
+
+ The starting packet sequence number (PSN) that this peer will use
+ when sending to the other peer, so that the other peer can prepare
+ its QP for the sequence number to expect.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Fox, et al. Informational [Page 111]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+A.3.3. ADD LINK CONTINUATION LLC Message Format
+
+ 0 1 2 3
+ 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | Type = 3 | Length = 44 | Reserved |R| Reserved |
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | Linknum | NumRTokens | Reserved |
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | |
+ +- -+
+ | |
+ +- RKey/RToken pair -+
+ | |
+ +- -+
+ | |
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | |
+ +- -+
+ | |
+ +- RKey/RToken pair or zeros -+
+ | |
+ +- -+
+ | |
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | Reserved |
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
+ Figure 33: ADD LINK CONTINUATION LLC Message Format
+
+ When a new SMC-R link is added to an SMC-R link group, it is
+ necessary to communicate the new link's RTokens for the RMBs that the
+ SMC-R link group can access. This message follows the ADD LINK and
+ provides the RTokens.
+
+ The server kicks off this exchange by sending the first ADD LINK
+ CONTINUATION LLC message, and the server controls the exchange as
+ described below.
+
+ o If the client and the server require the same number of ADD LINK
+ CONTINUATION messages to communicate their RTokens, the server
+ starts the exchange by sending the first ADD LINK CONTINUATION
+ request to the client with its (the server's) RTokens. The client
+ then responds with an ADD LINK CONTINUATION response with its
+ RTokens, and so on until the exchange is completed.
+
+
+
+
+
+
+Fox, et al. Informational [Page 112]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+ o If the server requires more ADD LINK CONTINUATION messages than
+ the client, then after the client has communicated all of its
+ RTokens, the server continues to send ADD LINK CONTINUATION
+ request messages to the client. The client continues to respond,
+ using empty (number of RTokens to be communicated = 0) ADD LINK
+ CONTINUATION response messages.
+
+ o If the client requires more ADD LINK CONTINUATION messages than
+ the server, then after communicating all of its RTokens, the
+ server will continue to send empty ADD LINK CONTINUATION messages
+ to the client to solicit replies with the client's RTokens, until
+ all have been communicated.
+
+ The contents of the ADD LINK CONTINUATION LLC message are:
+
+ Type
+
+ Type 3 indicates ADD LINK CONTINUATION.
+
+ Length
+
+ The ADD LINK CONTINUATION LLC message is 44 bytes long.
+
+ R
+
+ Reply flag. When set, indicates that this is an ADD LINK
+ CONTINUATION reply.
+
+ LinkNum
+
+ The link number of the new link within the SMC-R link group for
+ which RKeys are being communicated.
+
+ NumRTokens
+
+ Number of RTokens remaining to be communicated (including the ones
+ in this message). If the value is less than or equal to 2, this
+ is the last message. If it is greater than 2, another
+ continuation message will be required, and its value will be the
+ value in this message minus 2, and so on until all RKeys are
+ communicated. The maximum value for this field is 255.
+
+
+
+
+
+
+
+
+
+
+Fox, et al. Informational [Page 113]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+ RKey/RToken pairs (two or less)
+
+ These consist of an RKey for an RMB that is known on the SMC-R
+ link over which this message was sent (the reference RKey), paired
+ with the same RMB's RToken over the new SMC-R link. A full RToken
+ is not required for the reference, because it is only being used
+ to distinguish which RMB it applies to, not address it.
+
+ 0 1 2 3
+ 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | Reference RKey |
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | New RKey |
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | |
+ +- New Virtual Address -+
+ | |
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
+ Figure 34: RKey/RToken Pair Format
+
+ The contents of the RKey/RToken pair are:
+
+ Reference RKey
+
+ The RKey of the RMB as it is already known on the SMC-R link over
+ which this message is being sent. Required so that the peer knows
+ with which RMB to associate the new RToken.
+
+ New RKey
+
+ The RKey of this RMB as it is known over the new SMC-R link.
+
+ New Virtual Address
+
+ The virtual address of this RMB as it is known over the new
+ SMC-R link.
+
+
+
+
+
+
+
+
+
+
+
+
+
+Fox, et al. Informational [Page 114]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+A.3.4. DELETE LINK LLC Message Format
+
+ 0 1 2 3
+ 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | Type = 4 | Length = 44 | Reserved |R|A|O| Rsrvd |
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | Linknum | reason code (bytes 1-3) |
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ |RsnCode byte 4 | |
+ +-+-+-+-+-+-+-+-+ -+
+ | |
+ +- -+
+ | |
+ +- -+
+ | |
+ +- Reserved -+
+ | |
+ +- -+
+ | |
+ +- -+
+ | |
+ +- -+
+ | |
+ +- -+
+ | |
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
+ Figure 35: DELETE LINK LLC Message Format
+
+ When the client or server detects that a QP or SMC-R link goes down
+ or needs to come down, it sends this message over one of the other
+ links in the link group.
+
+ When the DELETE LINK is sent from the client, it only serves as a
+ notification, and the client expects the server to respond by sending
+ a DELETE LINK request. To avoid races, only the server will initiate
+ the actual DELETE LINK request and response sequence that results
+ from notification from the client.
+
+ The server can also initiate the DELETE LINK without notification
+ from the client if it detects an error or if orderly link termination
+ was initiated.
+
+ The client may also request termination of the entire link group, and
+ the server may terminate the entire link group using this message.
+
+
+
+
+
+Fox, et al. Informational [Page 115]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+ The contents of the DELETE LINK LLC message are:
+
+ Type
+
+ Type 4 indicates DELETE LINK.
+
+ Length
+
+ The DELETE LINK LLC message is 44 bytes long.
+
+ R
+
+ Reply flag. When set, indicates that this is a DELETE LINK reply.
+
+ A
+
+ "All" flag. When set, indicates that all links in the link group
+ are to be terminated. This terminates the link group.
+
+ O
+
+ Orderly flag. Indicates orderly termination. Orderly termination
+ is generally caused by an operator command rather than an error on
+ the link. When the client requests orderly termination, the
+ server may wait to complete other work before terminating.
+
+ LinkNum
+
+ The link number of the link to be terminated. If the A flag is
+ set, this field has no meaning and is set to 0.
+
+ RsnCode
+
+ The termination reason code. Currently defined reason codes are:
+
+ Request reason codes:
+
+ X'00010000' = Lost path
+
+ X'00020000' = Operator initiated termination
+
+ X'00030000' = Program initiated termination (link inactivity)
+
+ X'00040000' = LLC protocol violation
+
+ X'00050000' = Asymmetric link no longer needed
+
+
+
+
+
+Fox, et al. Informational [Page 116]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+ Response reason code:
+
+ X'00100000' = Unknown link ID (no link)
+
+A.3.5. CONFIRM RKEY LLC Message Format
+
+ 0 1 2 3
+ 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | Type = 6 | Length = 44 | Reserved |R|0|Z|C|Rsrvd |
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | NumTkns | New RMB RKey for this link (bytes 1-3) |
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ |ThisLink byte 4| |
+ +-+-+-+-+-+-+-+-+ -+
+ | New RMB virtual address for this link |
+ +- +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | | |
+ +-+-+-+-+-+-+-+-+ -+
+ | |
+ +- Other link RMB specification or zeros -+
+ | |
+ +- +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | | |
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ -+
+ | |
+ +- -+
+ | Other link RMB specification or zeros |
+ +- +-+-+-+-+-+-+-+-+
+ | | Reserved |
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
+ Figure 36: CONFIRM RKEY LLC Message Format
+
+ The CONFIRM RKEY flow can be sent at any time from either the client
+ or the server, to inform the peer that an RMB has been created or
+ deleted. The creator of a new RMB must inform its peer of the new
+ RMB's RToken for all SMC-R links in the SMC-R link group.
+
+ For RMB creation, the creator sends this message over the SMC-R link
+ that the first TCP connection that uses the new RMB is using. This
+ message contains the new RMB RToken for the SMC-R link over which
+ the message is sent. It then lists the sender's SMC-R links in the
+ link group paired with the new RToken for the new RMB for that link.
+ This message can communicate the new RTokens for three QPs: the QP
+ for the link over which this message is sent, and two others. If
+ there are more than three links in the SMC-R link group, a
+ CONFIRM RKEY CONTINUATION will be required.
+
+
+
+Fox, et al. Informational [Page 117]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+ The peer responds by simply echoing the message with the response
+ flag set. If the response is a negative response, the sender must
+ recalculate the RToken set and start a new CONFIRM RKEY exchange from
+ the beginning. The timing of this retry is controlled by the C flag,
+ as described below.
+
+ The contents of the CONFIRM RKEY LLC message are:
+
+ Type
+
+ Type 6 indicates CONFIRM RKEY.
+
+ Length
+
+ The CONFIRM RKEY LLC message is 44 bytes long.
+
+ R
+
+ Reply flag. When set, indicates that this is a CONFIRM RKEY
+ reply.
+
+ 0
+
+ Reserved bit.
+
+ Z
+
+ Negative response flag.
+
+ C
+
+ Configuration Retry bit. If this is a negative response and this
+ flag is set, the originator should recalculate the RKey set and
+ retry this exchange as soon as the current configuration change is
+ completed. If this flag is not set on a negative response, the
+ originator must wait for the next natural stimulus (for example, a
+ new TCP connection started that requires a new RMB) before
+ retrying.
+
+ NumTkns
+
+ The number of other link/RToken pairs, including those provided in
+ this message, to be communicated. Note that this value does not
+ include the RToken for the link on which this message was sent
+ (i.e., the maximum value is 2). If this value is 3 or less, this
+ is the only message in the exchange. If this value is greater
+ than 3, a CONFIRM RKEY CONTINUATION message will be required.
+
+
+
+
+Fox, et al. Informational [Page 118]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+ Note: In this version of the architecture, eight is the maximum
+ number of links supported in a link group.
+
+ New RMB RKey for this link
+
+ The new RMB's RKey as assigned on the link over which this message
+ is being sent.
+
+ New RMB virtual address for this link
+
+ The new RMB's virtual address as assigned on the link over which
+ this message is being sent.
+
+ Other link RMB specification
+
+ The new RMB's specification on the other links in the link group,
+ as shown in Figure 37.
+
+ 0 1 2 3
+ 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | Link number | RMB's RKey for the specified link (bytes 1-3) |
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ |New RKey byte 4| |
+ +-+-+-+-+-+-+-+-+ -+
+ | RMB's virtual address for the specified link |
+ +- +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | |
+ +-+-+-+-+-+-+-+-+
+
+ Figure 37: Format of Link Number/RKey Pairs
+
+ Link number
+
+ The link number for a link in the link group.
+
+ RMB's RKey for the specified link
+
+ The RKey used to reach the RMB over the link whose number was
+ specified in the Link number field.
+
+ RMB's virtual address for the specified link
+
+ The virtual address used to reach the RMB over the link whose
+ number was specified in the Link number field.
+
+
+
+
+
+
+Fox, et al. Informational [Page 119]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+A.3.6. CONFIRM RKEY CONTINUATION LLC Message Format
+
+ 0 1 2 3
+ 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | Type = 8 | Length = 44 | Reserved |R|0|Z| Rsrvd |
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | NumTknsLeft | |
+ +-+-+-+-+-+-+-+-+ -+
+ | |
+ +- Other link RMB specification -+
+ | |
+ +- +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | | |
+ +-+-+-+-+-+-+-+-+ -+
+ | |
+ +- Other link RMB specification or zeros -+
+ | |
+ +- +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | | |
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ -+
+ | |
+ +- -+
+ | Other link RMB specification or zeros |
+ +- +-+-+-+-+-+-+-+-+
+ | | Reserved |
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
+ Figure 38: CONFIRM RKEY CONTINUATION LLC Message Format
+
+ The CONFIRM RKEY CONTINUATION LLC message is used to communicate any
+ additional RMB RTokens that did not fit into the CONFIRM RKEY
+ message. Each of these messages can hold up to three RMB RTokens.
+ The NumTknsLeft field indicates how many RMB RTokens are to be
+ communicated, including the ones in this message. If the value is 3
+ or less, this is the last message of the group. If the value is 4 or
+ higher, additional CONFIRM RKEY CONTINUATION messages will follow,
+ and the NumTknsLeft value will be a countdown until all are
+ communicated.
+
+ Like the CONFIRM RKEY message, the peer responds by echoing the
+ message back with the reply flag set.
+
+
+
+
+
+
+
+
+
+Fox, et al. Informational [Page 120]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+ The contents of the CONFIRM RKEY CONTINUATION LLC message are:
+
+ Type
+
+ Type 8 indicates CONFIRM RKEY CONTINUATION.
+
+ Length
+
+ The CONFIRM RKEY CONTINUATION LLC message is 44 bytes long.
+
+ R
+
+ Reply flag. When set, indicates that this is a CONFIRM RKEY
+ CONTINUATION reply.
+
+ 0
+
+ Reserved bit.
+
+ Z
+
+ Negative response flag.
+
+ NumTknsLeft
+
+ The number of link/RToken pairs, including those provided in this
+ message, that are remaining to be communicated. If this value is
+ 3 or less, this is the last message in the exchange. If this
+ value is greater than 3, another CONFIRM RKEY CONTINUATION message
+ will be required. Note that in this version of the architecture,
+ eight is the maximum number of links supported in a link group.
+
+ Other link RMB specification
+
+ The new RMB's specification on other links in the link group, as
+ shown in Figure 37.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Fox, et al. Informational [Page 121]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+A.3.7. DELETE RKEY LLC Message Format
+
+ 0 1 2 3
+ 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | Type = 9 | Length = 44 | Reserved |R|0|Z| Rsrvd |
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | Count | Error Mask | Reserved |
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | First deleted RKey |
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | Second deleted RKey or zeros |
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | Third deleted RKey or zeros |
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | Fourth deleted RKey or zeros |
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | Fifth deleted RKey or zeros |
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | Sixth deleted RKey or zeros |
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | Seventh deleted RKey or zeros |
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | Eighth deleted RKey or zeros |
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | Reserved |
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
+ Figure 39: DELETE RKEY LLC Message Format
+
+ The DELETE RKEY flow can be sent at any time from either the client
+ or the server, to inform the peer that one or more RMBs have been
+ deleted. Because the peer already knows every RMB's RKey on each
+ link in the link group, this message only specifies one RKey for each
+ RMB being deleted. The RKey provided for each deleted RMB will be
+ its RKey as known on the SMC-R link over which this message is sent.
+
+ It is not necessary to provide the entire RToken. The RKey alone is
+ sufficient for identifying an existing RMB.
+
+ The peer responds by simply echoing the message with the response
+ flag set. If the peer did not recognize an RKey, a negative response
+ flag will be set; however, no aggressive recovery action beyond
+ logging the error will be taken.
+
+
+
+
+
+
+
+Fox, et al. Informational [Page 122]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+ The contents of the DELETE RKEY LLC message are:
+
+ Type
+
+ Type 9 indicates DELETE RKEY.
+
+ Length
+
+ The DELETE RKEY LLC message is 44 bytes long.
+
+ R
+
+ Reply flag. When set, indicates that this is a DELETE RKEY reply.
+
+ 0
+
+ Reserved bit.
+
+ Z
+
+ Negative response flag.
+
+ Count
+
+ Number of RMBs being deleted by this message. Maximum value is 8.
+
+ Error Mask
+
+ If this is a negative response, indicates which RMBs were not
+ successfully deleted. Each bit corresponds to a listed RMB; for
+ example, b'01010000' indicates that the second and fourth RKeys
+ weren't successfully deleted.
+
+ Deleted RKeys
+
+ A list of Count RKeys. Provided on the request flow and echoed
+ back on the response flow. Each RKey is valid on the link over
+ which this message is sent and represents a deleted RMB. Up to
+ eight RMBs can be deleted in this message.
+
+
+
+
+
+
+
+
+
+
+
+
+Fox, et al. Informational [Page 123]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+A.3.8. TEST LINK LLC Message Format
+
+ 0 1 2 3
+ 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | Type = 7 | Length = 44 | Reserved |R| Reserved |
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | |
+ +- -+
+ | |
+ +- User Data -+
+ | |
+ +- -+
+ | |
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | |
+ +- -+
+ | |
+ +- -+
+ | Reserved |
+ +- -+
+ | |
+ +- -+
+ | |
+ +- -+
+ | |
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
+ Figure 40: TEST LINK LLC Message Format
+
+ The TEST LINK request can be sent from either peer to the other on an
+ existing SMC-R link at any time to test that the SMC-R link is active
+ and healthy at the software level. A peer that receives a TEST LINK
+ LLC message immediately sends back a TEST LINK reply, echoing back
+ the user data. Refer also to Section 4.5.3 ("TCP Keepalive
+ Processing").
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Fox, et al. Informational [Page 124]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+ The contents of the TEST LINK LLC message are:
+
+ Type
+
+ Type 7 indicates TEST LINK.
+
+ Length
+
+ The TEST LINK LLC message is 44 bytes long.
+
+ R
+
+ Reply flag. When set, indicates that this is a TEST LINK reply.
+
+ User Data
+
+ The receiver of this message echoes the sender's data back in a
+ TEST LINK response LLC message.
+
+A.4. Connection Data Control (CDC) Message Format
+
+ The RMBE control data is communicated using Connection Data Control
+ (CDC) messages, which use RoCE SendMsg, similar to LLC messages.
+ Also, as with LLC messages, CDC messages are 44 bytes long to ensure
+ that they can fit into private data areas of receive WQEs without
+ requiring the receiver to post receive buffers.
+
+ Unlike LLC messages, this data is integral to the data path, so its
+ processing must be prioritized and optimized similarly to other data
+ path processing. While LLC messages may be processed on a slower
+ path than data, these messages cannot be.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Fox, et al. Informational [Page 125]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+ 0 1 2 3
+ 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+ 0 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | Type = x'FE' | Length = 44 | Sequence number |
+ 4 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | SMC-R alert token |
+ 8 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | Reserved | Producer cursor wrap seqno |
+ 12 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | Producer Cursor |
+ 16 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | Reserved | Consumer cursor wrap seqno |
+ 20 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | Consumer Cursor |
+ 24 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ |B|P|U|R|F|Rsrvd|D|C|A| Reserved |
+ 28 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | |
+ 32 +- -+
+ | |
+ 36 +- Reserved -+
+ | |
+ 40 +- -+
+ | |
+ 44 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
+ Figure 41: Connection Data Control (CDC) Message Format
+
+ Type = x'FE'
+
+ This type number has the two high-order bits turned on to enable
+ processing to quickly distinguish it from an LLC message.
+
+ Length = 44
+
+ The length of inline data that does not require the posting of a
+ receive buffer.
+
+ Sequence number
+
+ A 2-byte unsigned integer that represents a wrapping sequence
+ number. The initial value is 1, and this value can wrap to 0.
+ Incremented with every control message sent, except for the
+ failover data validation message, and used to guard against
+ processing an old control message out of sequence. Also used in
+ failover data validation. In normal usage, if this number is less
+
+
+
+
+
+Fox, et al. Informational [Page 126]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+ than the last received value, discard this message. If greater,
+ process this message. Old control messages can be lost with no
+ ill effect but cannot be processed after newer ones.
+
+ If this is a failover validation CDC message (F flag set), then
+ the receiver must verify that it has received and fully processed
+ the RDMA write that was described by the CDC message with the
+ sequence number in this message. If not, the TCP connection must
+ be reset to guard against data loss. Details of this processing
+ are provided in Section 4.6.1.
+
+ SMC-R alert token
+
+ The endpoint-assigned alert token that identifies to which TCP
+ connection on the link group this control message refers.
+
+ Producer cursor wrap seqno
+
+ A 2-byte unsigned integer that represents a wrapping counter
+ incremented by the producer whenever the data written into this
+ RMBE receive buffer causes a wrap (i.e., the producer cursor
+ wraps). This is used by the receiver to determine when new data
+ is available even though the cursors appear unchanged, such as
+ when a full window size write is completed (producer cursor of
+ this RMBE sent by peer = local consumer cursor) or in scenarios
+ where the producer cursor sent for this RMBE < local consumer
+ cursor.
+
+ Producer Cursor
+
+ A 4-byte unsigned integer that is a wrapping offset into the RMBE
+ data area. Points to the next byte of data to be written by the
+ sender. Can advance up to the receiver's consumer cursor as known
+ by the sender. When the urgent data present indicator is on,
+ points 1 byte beyond the last byte of urgent data. When computing
+ this cursor, the presence of the eye catcher in the RMBE data area
+ must be accounted for. The first writable data location in the
+ RMBE is at offset 4, so this cursor begins at 4 and wraps to 4.
+
+ Consumer cursor wrap seqno
+
+ A 2-byte unsigned integer that mirrors the value of the producer
+ cursor wrap sequence number when the last read from this RMBE
+ occurred. Used as an indicator of how far along the consumer is
+ in reading data (i.e., processed last wrap point or not). The
+ producer side can use this indicator to detect whether or not more
+ data can be written to the partner in full window write scenarios
+ (where the producer cursor = consumer cursor as known on the
+
+
+
+Fox, et al. Informational [Page 127]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+ remote RMBE). In this scenario, if the consumer sequence number
+ equals the local producer sequence number, the producer knows that
+ more data can be written.
+
+ Consumer Cursor
+
+ A 4-byte unsigned integer that is a wrapping offset into the
+ sender's RMBE data area. Points to the offset of the next byte of
+ data to be consumed by the peer in its own RMBE. When computing
+ this cursor, the presence of the eye catcher in the RMBE data area
+ must be accounted for. The first writable data location in the
+ RMBE is at offset 4, so this cursor begins at 4 and wraps to 4.
+ The sender cannot write beyond this cursor into the peer's RMBE
+ without causing data loss.
+
+ B-bit
+
+ Writer blocked indicator: Sender is blocked for writing. If this
+ bit is set, sender will require explicit notification when receive
+ buffer space is available.
+
+ P-bit
+
+ Urgent data pending: Sender has urgent data pending for this
+ connection.
+
+ U-bit
+
+ Urgent data present: Indicates that urgent data is present in the
+ RMBE data area, and the producer cursor points to 1 byte beyond
+ the last byte of urgent data.
+
+ R-bit
+
+ Request for consumer cursor update: Indicates that an immediate
+ consumer cursor update is requested, regardless of whether or not
+ one is warranted according to the window size optimization
+ algorithm described in Section 4.5.1.
+
+ F-bit
+
+ Failover validation indicator: Sent by a peer to guard against
+ data loss during failover when the TCP connection is being moved
+ to another SMC-R link in the link group. When this bit is set,
+ the only other fields in the CDC message that are significant are
+ the Type, Length, SMC-R alert token, and Sequence number fields.
+ The receiver must validate that it has fully processed the RDMA
+ write described by the previous CDC message bearing the same
+
+
+
+Fox, et al. Informational [Page 128]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+ sequence number as this validation message. If it has, no further
+ action is required. If it has not, the TCP connection must be
+ reset. This processing is described in detail in Section 4.6.1.
+
+ D-bit
+
+ Sending done indicator: Sent by a peer when it is done writing new
+ data into the receiver's RMBE data area.
+
+ C-bit
+
+ PeerConnectionClosed indicator: Sent by a peer when it is
+ completely done with this connection and will no longer be making
+ any updates to the receiver's RMBE or sending any more control
+ messages.
+
+ A-bit
+
+ Abnormal close indicator: Sent by a peer when the connection is
+ abnormally terminated (for example, the TCP connection was reset).
+ When sent, it indicates that the peer is completely done with this
+ connection and will no longer be making any updates to this RMBE
+ or sending any more control messages. It also indicates that the
+ RMBE owner must flush any remaining data on this connection and
+ generate an error return code to any outstanding socket APIs on
+ this connection (same processing as receiving a RST segment on a
+ TCP connection).
+
+Appendix B. Socket API Considerations
+
+ A key design goal for SMC-R is to require no application changes for
+ exploitation. It is confined to socket applications using stream
+ (i.e., TCP) sockets over IPv4 or IPv6. By virtue of the fact that
+ the switch to the SMC-R protocol occurs after a TCP connection is
+ established, no changes are required in a socket address family or in
+ the IP addresses and ports that the socket applications are using.
+ Existing socket APIs that allow applications to retrieve local and
+ remote socket address structures for an established TCP connection
+ (for example, getsockname() and getpeername()) will continue to
+ function as they have before. Existing DNS setup and APIs for
+ resolving hostnames to IP addresses and vice versa also continue to
+ function without any changes. In general, all of the usual socket
+ APIs that are used for TCP communications (send APIs, recv APIs,
+ etc.) will continue to function as they do today, even if SMC-R is
+ used as the underlying protocol.
+
+
+
+
+
+
+Fox, et al. Informational [Page 129]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+ Each SMC-R-enabled implementation does, however, need to pay special
+ attention to any socket APIs that have a reliance on the underlying
+ TCP and IP protocols and also ensure that their behavior in an SMC-R
+ environment is reasonable and minimizes impact on the application.
+ While the basic socket API set is fairly similar across different
+ operating systems, there is more variability when it comes to
+ advanced socket API options. Each implementation needs to perform a
+ detailed analysis of its API options, any possible impact that SMC-R
+ may have, and any resultant implications. As part of that step, a
+ discussion or review with other implementations supporting SMC-R
+ would be useful to ensure consistent implementation.
+
+B.1. setsockopt() / getsockopt() Considerations
+
+ These APIs allow socket applications to manipulate socket, transport
+ (TCP/UDP), and IP-level options associated with a given socket.
+ Typically, a platform restricts the number of IP options available to
+ stream (TCP) socket applications, given their connection-oriented
+ nature. The general guideline here is to continue processing these
+ APIs in a manner that allows for application compatibility. Some
+ options will be relevant to the SMC-R protocol and will require
+ special processing "under the covers". For example, the ability to
+ manipulate TCP send and receive buffer sizes is still valid for
+ SMC-R. However, other options may have no meaning for SMC-R. For
+ example, if an application enabled the TCP_NODELAY socket option to
+ disable Nagle's algorithm, it should have no real effect on SMC-R
+ communications, as there is no notion of Nagle's algorithm with this
+ new protocol. But the implementation must accept the TCP_NODELAY
+ option as it does today and save it so that it can be later extracted
+ via getsockopt() processing. Note that any TCP or IP-level options
+ will still have an effect on any TCP/IP packets flowing for an SMC-R
+ connection (i.e., as part of TCP/IP connection establishment and
+ TCP/IP connection termination packet flows).
+
+ Under the covers, manipulation of the TCP options will also include
+ the SMC-layer setting, as well as reading the SMC-R experimental
+ option before and after completion of the three-way TCP handshake.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Fox, et al. Informational [Page 130]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+Appendix C. Rendezvous Error Scenarios
+
+ This section discusses error scenarios for setting up and managing
+ SMC-R links.
+
+C.1. SMC Decline during CLC Negotiation
+
+ A peer to the SMC-R CLC negotiation can send an SMC Decline in lieu
+ of any expected CLC message to decline SMC and force the TCP
+ connection back to the IP fabric. There can be several reasons for
+ an SMC Decline during the CLC negotiation, including the following:
+
+ o RNIC went down
+
+ o SMC-R forbidden by local policy
+
+ o subnet (IPv4) or prefix (IPv6) doesn't match
+
+ o lack of resources to perform SMC-R
+
+ In all cases, when an SMC Decline is sent in lieu of an expected CLC
+ message, no confirmation is required, and the TCP connection
+ immediately falls back to using the IP fabric.
+
+ To prevent ambiguity between CLC messages and application data, an
+ SMC Decline cannot "chase" another CLC message. An SMC Decline can
+ only be sent in lieu of an expected CLC message. For example, if the
+ client sends an SMC Proposal and then its RNIC goes down, it must
+ wait for the SMC Accept from the server and then reply to the
+ SMC Accept with an SMC Decline.
+
+ This "no chase" rule means that if this TCP connection is not a first
+ contact between RoCE peers, a server cannot send an SMC Decline after
+ sending an SMC Accept -- it can only either break the TCP connection
+ or fail over if a problem arises in the RoCE fabric after it has sent
+ the SMC Accept. Similarly, once the client sends an SMC Confirm on a
+ TCP connection that isn't a first contact, it is committed to SMC-R
+ for this TCP connection and cannot fall back to IP.
+
+C.2. SMC Decline during LLC Negotiation
+
+ For a TCP connection that represents a first contact between RoCE
+ pairs, it is possible for SMC to fall back to IP during the LLC
+ negotiation. This is possible until the first contact SMC-R link is
+ confirmed. For example, see Figure 42. After a first contact SMC-R
+ link is confirmed, fallback to IP is no longer possible. This
+ translates to the following rule: a first contact peer can send an
+
+
+
+
+Fox, et al. Informational [Page 131]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+ SMC Decline at any time during LLC negotiation until it has
+ successfully sent its CONFIRM LINK (request or response) flow. After
+ that point, it cannot fall back to IP.
+
+ Host X -- Server Host Y -- Client
+ +-------------------+ +-------------------+
+ | Peer ID = PS1 | | Peer ID = PC1 |
+ | +------+ +------+ |
+ | QP 8 |RNIC 1| SMC-R Link 1 |RNIC 2| QP 64 |
+ | RKey X | |MAC MA|<-------------------->|MAC MB| | |
+ | | |GID GA| attempted setup |GID GB| | RKey Y2|
+ | \/ +------+ +------+ \/ |
+ |+--------+ | | +--------+ |
+ || RMB | | | | RMB | |
+ |+--------+ | | +--------+ |
+ | /\ +------+ +------+ /\ |
+ | | |RNIC 3| |RNIC 4| | RKey W2|
+ | | |MAC MC| |MAC MD| | |
+ | QP 9 |GID GC| |GID GD| QP 65 |
+ | +------+ +------+ |
+ +-------------------+ +-------------------+
+
+ SYN / SYN-ACK / ACK TCP three-way handshake with TCP option
+ <--------------------------------------------------------->
+
+ SMC Proposal / SMC Accept / SMC Confirm exchange
+ <-------------------------------------------------------->
+
+ CONFIRM LINK(request, Link 1)
+ .........................................................>
+
+ CONFIRM LINK(response, Link 1)
+ X...................................
+ :
+ : RoCE write failure
+ :.................................>
+
+ SMC Decline(PC1, reason code)
+ <--------------------------------------------------------
+
+ Connection data flows over IP fabric
+ <------------------------------------------------------->
+
+ Legend:
+ ------------ TCP/IP and CLC flows
+ ............ RoCE (LLC) flows
+
+ Figure 42: SMC Decline during LLC Negotiation
+
+
+
+Fox, et al. Informational [Page 132]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+C.3. The SMC Decline Window
+
+ Because SMC-R does not support fallback to IP for a TCP connection
+ that is already using RDMA, there are specific rules on when the
+ SMC Decline CLC message, which signals a fallback to IP because of an
+ error or problem with the RoCE fabric, can be sent during TCP
+ connection setup. There is a "point of no return" after which a
+ connection cannot fall back to IP, and RoCE errors that occur after
+ this point require the connection to be broken with a RST flow in the
+ IP fabric.
+
+ For a first contact, that point of no return is after the ADD LINK
+ LLC message has been successfully sent for the second SMC-R link.
+ Specifically, the server cannot fall back to IP after receiving
+ either (1) a positive write completion indication for the ADD LINK
+ request or (2) the ADD LINK response from the client, whichever comes
+ first. The client cannot fall back to IP after sending a negative
+ ADD LINK response, receiving a positive write complete on a positive
+ ADD LINK response, or receiving a CONFIRM LINK for the second SMC-R
+ link from the server, whichever comes first.
+
+ For a subsequent contact, that point of no return is after the last
+ send of the CLC negotiation completes. This, in combination with the
+ rule that error "chasers" are not allowed during CLC negotiation,
+ means that the server cannot send an SMC Decline after sending an SMC
+ Accept, and the client cannot send an SMC Decline after sending an
+ SMC Confirm.
+
+C.4. Out-of-Sync Conditions during SMC-R Negotiation
+
+ The SMC Accept CLC message contains a first contact flag that
+ indicates to the client whether the server believes it is setting up
+ a new link group or using an existing link group. This flag is used
+ to detect an out-of-sync condition between the client and the server.
+ The scenario for such a condition is as follows: there is a single
+ existing SMC-R link between the peers. After the client sends the
+ SMC Proposal CLC message, the existing SMC-R link between the client
+ and the server fails. The client cannot chase the SMC Proposal CLC
+ message with an SMC Decline CLC message in this case, because the
+ client does not yet know that the server would have wanted to choose
+ the SMC-R link that just crashed. The QP that failed recovers before
+ the server returns its SMC Accept CLC message. This means that there
+ is a QP but no SMC-R link. Since the server had not yet learned of
+ the SMC-R link failure when it sent the SMC Accept CLC message, it
+ attempts to reuse the SMC-R link that just failed. This means that
+ the server would not set the first contact flag, indicating to the
+ client that the server thinks it is reusing an SMC-R link. However,
+ the client does not have an SMC-R link that matches the server's
+
+
+
+Fox, et al. Informational [Page 133]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+ specification. Because the first contact flag is off, the client
+ realizes it is out of sync with the server and sends an SMC Decline
+ to cause the connection to fall back to IP.
+
+C.5. Timeouts during CLC Negotiation
+
+ Because the SMC-R negotiation flows as TCP data, there are built-in
+ timeouts and retransmits at the TCP layer for individual messages.
+ Implementations also must protect the overall TCP/CLC handshake with
+ a timer or timers to prevent connections from hanging indefinitely
+ due to SMC-R processing. This can be done with individual timers for
+ individual CLC messages or an overall timer for the entire exchange,
+ which may include the TCP handshake and the CLC handshake under one
+ timer or separate timers. This decision is implementation dependent.
+
+ If the TCP and/or CLC handshakes time out, the TCP connection must be
+ terminated as it would be in a legacy IP environment when connection
+ setup doesn't complete in a timely manner. Because the CLC flows are
+ TCP messages, if they cannot be sent and received in a timely
+ fashion, the TCP connection is not healthy and would not work if
+ fallback to IP were attempted.
+
+C.6. Protocol Errors during CLC Negotiation
+
+ Protocol errors occur during CLC negotiation when a message is
+ received that is not expected. For example, a peer that is expecting
+ a CLC message but instead receives application data has experienced a
+ protocol error; this also indicates a likely software error, as the
+ two sides are out of sync. When application data is expected, this
+ data is not parsed to ensure that it's not a CLC message.
+
+ When a peer is expecting a CLC negotiation message, any parsing error
+ except a bad enumerated value in that message must be treated as
+ application data. The CLC negotiation messages are designed with
+ beginning and ending eye catchers to help verify that a CLC
+ negotiation message is actually the expected message. If other
+ parsing errors in an expected CLC message occur, such as incorrect
+ length fields or incorrectly formatted fields, the message must be
+ treated as application data.
+
+ All protocol errors, with the exception of bad enumerated values,
+ must result in termination of the TCP connection. No fallback to IP
+ is allowed in the case of a protocol error, because if the protocols
+ are out of sync, mismatched, or corrupted, then data and security
+ integrity cannot be ensured.
+
+
+
+
+
+
+Fox, et al. Informational [Page 134]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+ The exception to this rule is enumerated values -- for example, the
+ QP MTU values on SMC Accept and SMC Confirm. If a reserved value is
+ received, the proper error response is to send an SMC Decline and
+ fall back to IP; this is because the use of a reserved enumerated
+ value indicates that the other partner likely has additional support
+ that the receiving partner does not have. This indicated mismatch of
+ SMC-R capabilities is not an integrity problem but indicates that
+ SMC-R cannot be used for this connection.
+
+C.7. Timeouts during LLC Negotiation
+
+ Whenever a peer sends an LLC message to which a reply is expected, it
+ sets a timer after the send posts to wait for the reply. An expected
+ response may be a reply flavor of the LLC message (for example, a
+ CONFIRM LINK reply) or a new LLC message (for example, an ADD LINK
+ CONTINUATION expected from the server by the client if there are more
+ RKeys to be communicated).
+
+ On LLC flows that are part of a first contact setup of a link group,
+ the value of the timer is implementation dependent but should be long
+ enough to allow the other peer to have a write complete timeout and
+ 2-3 retransmits of an SMC Decline on the TCP fabric. For LLC flows
+ that are maintaining the link group and are not part of a first
+ contact setup of a link group, the timers may be shorter. Upon
+ receipt of an expected reply, the timer is cancelled. If a timer
+ pops without a reply having been received, the sender must initiate a
+ recovery action.
+
+ During first contact processing, failure of an LLC verification timer
+ is a "should-not-occur" that indicates a problem with one of the
+ endpoints; this is because if there is a "routine" failure in the
+ RoCE fabric that causes an LLC verification send to fail, the sender
+ will get a write completion failure and will then send an SMC Decline
+ to the partner. The only time an LLC verification timer will expire
+ on a first contact is when the sender thinks the send succeeded but
+ it actually didn't. Because of the reliably connected nature of QP
+ connections on the RoCE fabric, this indicates a problem with one of
+ the peers, not with the RoCE fabric.
+
+ After the reliably connected queue pair for the first SMC-R link in a
+ link group is set up on initial contact, the client sets a timer to
+ wait for a RoCE verification message from the server that the QP is
+ actually connected and usable. If the server experiences a failure
+ sending its QP confirmation message, it will send an SMC Decline,
+ which should arrive at the client before the client's verification
+ timer expires. If the client's timer expires without receiving
+ either an SMC Decline or a RoCE message confirmation from the server,
+
+
+
+
+Fox, et al. Informational [Page 135]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+ there is a problem with either the server or the TCP fabric. In
+ either case, the client must break the TCP connection and clean up
+ the SMC-R link.
+
+ There are two scenarios in which the client's response to the QP
+ verification message fails to reach the server. The main difference
+ is whether or not the client has successfully completed the send of
+ the CONFIRM LINK response.
+
+ In the normal case of a problem with the RoCE path, the client will
+ learn of the failure by getting a write completion failure, before
+ the server's timer expires. In this case, the client sends an SMC
+ Decline CLC message to the server, and the TCP connection falls back
+ to IP.
+
+ If the client's send of the confirmation message receives a positive
+ return code but for some reason still does not reach the server, or
+ the client's SMC Decline CLC message fails to reach the server after
+ the client fails to send its RoCE confirmation message, then the
+ server's timer will time out and the server must break the TCP
+ connection by sending a RST. This is expected to be a very rare
+ case, because if the client cannot send its CONFIRM LINK response LLC
+ message, the client should get a negative return code and initiate
+ fallback to IP. A client receiving a positive return code on a send
+ that fails to reach the server should also be an extremely rare case.
+
+C.7.1. Recovery Actions for LLC Timeouts and Failures
+
+ The following list describes recovery actions for LLC timeouts. A
+ write completion failure or other indication of send failure for an
+ LLC command is treated the same as a timeout.
+
+ LLC message: CONFIRM LINK from server (first contact, first link in
+ the link group)
+
+ Timer waits for: CONFIRM LINK reply from client.
+
+ Recovery action: Break the TCP connection by sending a RST, and
+ clean up the link. The server should have received an SMC Decline
+ from the client by now if the client had an LLC send failure.
+
+ LLC message: CONFIRM LINK from server (first contact, second link in
+ the link group)
+
+ Timer waits for: CONFIRM LINK reply from client.
+
+
+
+
+
+
+Fox, et al. Informational [Page 136]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+ Recovery action: The second link was not successfully set up.
+ Send a DELETE LINK to the client. Connection data cannot flow in
+ the first link in the link group, until the reply to this DELETE
+ LINK is received, to prevent the peers from being out of sync on
+ the state of the link group.
+
+ LLC message: CONFIRM LINK from server (not first contact)
+
+ Timer waits for: CONFIRM LINK reply from client.
+
+ Recovery action: Clean up the new link, and set a timer to retry.
+ Send a DELETE LINK to the client, in case the client has a longer
+ timer interval, so the client can stop waiting.
+
+ LLC message: CONFIRM LINK reply from client (first contact)
+
+ Timer waits for: ADD LINK from server.
+
+ Recovery action: Clean up the SMC-R link, and break the TCP
+ connection by sending a RST over the IP fabric. There is a
+ problem with the server. If the server had a send failure, it
+ should have sent an SMC Decline by now.
+
+ LLC message: ADD LINK from server (first contact)
+
+ Timer waits for: ADD LINK reply from client.
+
+ Recovery action: Break the TCP connection with a RST, and clean up
+ RoCE resources. The connection is past the point where the server
+ can fall back to IP, and if the client had a send problem it
+ should have sent an SMC Decline by now.
+
+ LLC message: ADD LINK from server (not first contact)
+
+ Timer waits for: ADD LINK reply from client.
+
+ Recovery action: Clean up resources (QP, RKeys, etc.) for the new
+ link, and treat the link over which the ADD LINK was sent as if it
+ had failed. If there is another link available to resend the
+ ADD LINK and the link group still needs another link, retry the
+ ADD LINK over another link in the link group.
+
+ LLC message: ADD LINK reply from client (and there are more RKeys to
+ be communicated)
+
+ Timer waits for: ADD LINK CONTINUATION from server.
+
+ Recovery action: Treat the same as ADD LINK timer failure.
+
+
+
+Fox, et al. Informational [Page 137]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+ LLC message: ADD LINK reply or ADD LINK CONTINUATION reply from
+ client (and there are no more RKeys to be communicated, for the
+ second link in a first contact scenario)
+
+ Timer waits for: CONFIRM LINK from the server, over the new link.
+
+ Recovery action: The setup of the new link failed. Send a
+ DELETE LINK to the server. Do not consider the socket opened to
+ the client application until receiving confirmation from the
+ server in the form of a DELETE LINK request for this link and
+ sending the reply (to prevent the partners from being out of sync
+ on the state of the link group).
+
+ Set a timer to send another ADD LINK to the server if there is
+ still an unused RNIC on the client side.
+
+ LLC message: ADD LINK reply or ADD LINK CONTINUATION reply from
+ client (and there are no more RKeys to be communicated)
+
+ Timer waits for: CONFIRM LINK from the server, over the new link.
+
+ Recovery action: Send a DELETE LINK to the server for the new
+ link, then clean up any resource allocated for the new link and
+ set a timer to send an ADD LINK to the server if there is still an
+ unused RNIC on the client side. The setup of the new link failed,
+ but the link over which the ADD LINK exchange occurred is
+ unaffected.
+
+ LLC message: ADD LINK CONTINUATION from server
+
+ Timer waits for: ADD LINK CONTINUATION reply from client.
+
+ Recovery action: Treat the same as ADD LINK timer failure.
+
+ LLC message: ADD LINK CONTINUATION reply from client (first contact,
+ and RMB count fields indicate that the server owes more ADD LINK
+ CONTINUATION messages)
+
+ Timer waits for: ADD LINK CONTINUATION from server.
+
+ Recovery action: Clean up the SMC-R link, and break the TCP
+ connection by sending a RST. There is a problem with the server.
+
+ If the server had a send failure, it should have sent an
+ SMC Decline by now.
+
+
+
+
+
+
+Fox, et al. Informational [Page 138]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+ LLC message: ADD LINK CONTINUATION reply from client (not first
+ contact, and RMB count fields indicate that the server owes more
+ ADD LINK CONTINUATION messages)
+
+ Timer waits for: ADD LINK CONTINUATION from server.
+
+ Recovery action: Treat as if client detected link failure on the
+ link that the ADD LINK exchange is using. Send a DELETE LINK to
+ the server over another active link if one exists; otherwise,
+ clean up the link group.
+
+ LLC message: DELETE LINK from client
+
+ Timer waits for: DELETE LINK request from server.
+
+ Recovery action: If the scope of the request is to delete a single
+ link, the surviving link over which the client sent the
+ DELETE LINK is no longer usable either. If this is the last link
+ in the link group, end TCP connections over the link group by
+ sending RST packets. If there are other surviving links in the
+ link group, resend over a surviving link. Also send a DELETE LINK
+ over a surviving link for the link over which the client attempted
+ to send the initial DELETE LINK message. If the scope of the
+ request is to delete the entire link group, try resending on other
+ links in the link group until success is achieved. If all sends
+ fail, tear down the link group and any TCP connections that exist
+ on it.
+
+ LLC message: DELETE LINK from server (scope: entire link group)
+
+ Timer waits for: Confirmation from the adapter that the message
+ was delivered.
+
+ Recovery action: Tear down the link group and any TCP connections
+ that exist on it.
+
+ LLC message: DELETE LINK from server (scope: single link)
+
+ Timer waits for: DELETE LINK reply from client.
+
+ Recovery action: The link over which the server sent the
+ DELETE LINK is no longer usable either. If this is the last link
+ in the link group, end TCP connections over the link group by
+ sending RST packets. If there are other surviving links in the
+ link group, resend over a surviving link. Also send a DELETE LINK
+ over a surviving link for the link over which the server attempted
+ to send the initial DELETE LINK message. If the scope of the
+ request is to delete the entire link group, try resending on other
+
+
+
+Fox, et al. Informational [Page 139]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+ links in the link group until success is achieved. If all sends
+ fail, tear down the link group and any TCP connections that exist
+ on it.
+
+ LLC message: CONFIRM RKEY from client
+
+ Timer waits for: CONFIRM RKEY reply from server.
+
+ Recovery action: Perform normal client procedures for detection of
+ failed link. The link over which the message was sent has failed.
+
+ LLC message: CONFIRM RKEY from server
+
+ Timer waits for: CONFIRM RKEY reply from client.
+
+ Recovery action: Perform normal server procedures for detection of
+ failed link. The link over which the message was sent has failed.
+
+ LLC message: TEST LINK from client
+
+ Timer waits for: TEST LINK reply from server.
+
+ Recovery action: Perform normal client procedures for detection of
+ failed link. The link over which the message was sent has failed.
+
+ LLC message: TEST LINK from server
+
+ Timer waits for: TEST LINK reply from client.
+
+ Recovery action: Perform normal server procedures for detection of
+ failed link. The link over which the message was sent has failed.
+
+ The following list describes recovery actions for invalid LLC
+ messages. These could be misformatted or contain out-of-sync data.
+
+ LLC message received: CONFIRM LINK from server
+
+ What it indicates: Incorrect link information.
+
+ Recovery action: Protocol error. The link must be brought down by
+ sending a DELETE LINK for the link over another link in the link
+ group if one exists. If this is a first contact, fall back to IP
+ by sending an SMC Decline to the server.
+
+
+
+
+
+
+
+
+Fox, et al. Informational [Page 140]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+ LLC message received: ADD LINK
+
+ What it indicates: Undefined enumerated MTU value.
+
+ Recovery action: Send a negative ADD LINK reply with reason
+ code x'2'.
+
+ LLC message received: ADD LINK reply from client
+
+ What it indicates: Client-side link information that would result
+ in a parallel link being set up.
+
+ Recovery action: Parallel links are not permitted. Delete the
+ link by sending a DELETE LINK to the client over another link in
+ the link group.
+
+ LLC message received: Any link group command from the server, except
+ DELETE LINK for the entire link group
+
+ What it indicates: Client has sent a DELETE LINK for the link on
+ which the message was received.
+
+ Recovery action: Ignore the LLC message. Worst case: the server
+ will time out. Best case: the DELETE LINK crosses with the
+ command from the server, and the server realizes it failed.
+
+ LLC message received: ADD LINK CONTINUATION from server or ADD LINK
+ CONTINUATION reply from client
+
+ What it indicates: Number of RMBs provided doesn't match count
+ given on initial ADD LINK or ADD LINK reply message.
+
+ Recovery action: Protocol error. Treat as if detected link
+ outage.
+
+ LLC message received: DELETE LINK from client
+
+ What it indicates: Link indicated doesn't exist.
+
+ Recovery action: If the link is in the process of being cleaned
+ up, assume timing window and ignore message. Otherwise, send a
+ DELETE LINK reply with reason code 1.
+
+ LLC message received: DELETE LINK from server
+
+ What it indicates: Link indicated doesn't exist.
+
+ Recovery action: Send a DELETE LINK reply with reason code 1.
+
+
+
+Fox, et al. Informational [Page 141]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+ LLC message received: CONFIRM RKEY from either client or server
+
+ What it indicates: No RKey provided for one or more of the links
+ in the link group.
+
+ Recovery action: Treat as if detected failure of the link(s) for
+ which no RKey was provided.
+
+ LLC message received: DELETE RKEY
+
+ What it indicates: Specified RKey doesn't exist.
+
+ Recovery action: Send a negative DELETE RKEY response.
+
+ LLC message received: TEST LINK reply
+
+ What it indicates: User data doesn't match what was sent in the
+ TEST LINK request.
+
+ Recovery action: Treat as if detected that the link has gone down.
+ This is a protocol error.
+
+ LLC message received: Unknown LLC type with high-order bits of opcode
+ equal to b'10'
+
+ What it indicates: This is an optional LLC message that the
+ receiver does not support.
+
+ Recovery action: Ignore (silently discard) the message.
+
+ LLC message received: Any unambiguously incorrect or out-of-sync LLC
+ message
+
+ What it indicates: Link is out of sync.
+
+ Recovery action: Treat as if detected that the link has gone down.
+ Note that an unsupported or unknown LLC opcode whose two
+ high-order bits are b'10' is not an error and must be silently
+ discarded. Any other unknown or unsupported LLC opcode is an
+ error.
+
+C.8. Failure to Add Second SMC-R Link to a Link Group
+
+ When there is any failure in setting up the second SMC-R link in an
+ SMC-R link group, including confirmation timer expiration, the SMC-R
+ link group is allowed to continue without available failover.
+ However, this situation is extremely undesirable, and the server must
+ endeavor to correct it as soon as it can.
+
+
+
+Fox, et al. Informational [Page 142]
+
+RFC 7609 IBM's Shared Memory Communications over RDMA August 2015
+
+
+ The server peer in the SMC-R link group must set a timer to drive it
+ to retry setup of a failed additional SMC-R link. The server will
+ immediately retry the SMC-R link setup when the first of the
+ following events occurs:
+
+ o The retry timer expires.
+
+ o A new RNIC becomes available to the server, on the same LAN as the
+ SMC-R link group.
+
+ o An ADD LINK LLC request message is received from the client; this
+ indicates the availability of a new RNIC on the client side.
+
+Authors' Addresses
+
+ Mike Fox
+ IBM
+ 3039 Cornwallis Rd.
+ Research Triangle Park, NC 27709
+ United States
+
+ Email: mjfox@us.ibm.com
+
+
+ Constantinos (Gus) Kassimis
+ IBM
+ 3039 Cornwallis Rd.
+ Research Triangle Park, NC 27709
+ United States
+
+ Email: kassimis@us.ibm.com
+
+
+ Jerry Stevens
+ IBM
+ 3039 Cornwallis Rd.
+ Research Triangle Park, NC 27709
+ United States
+
+ Email: sjerry@us.ibm.com
+
+
+
+
+
+
+
+
+
+
+
+Fox, et al. Informational [Page 143]
+