1 files changed, 8011 insertions, 0 deletions
diff --git a/doc/rfc/rfc7609.txt b/doc/rfc/rfc7609.txt
new file mode 100644
index 0000000..4abff4e
--- /dev/null
+++ b/doc/rfc/rfc7609.txt
@@ -0,0 +1,8011 @@
+
+
+
+
+
+
+Independent Submission                                            M. Fox
+Request for Comments: 7609                                   C. Kassimis
+Category: Informational                                       J. Stevens
+ISSN: 2070-1721                                                      IBM
+                                                             August 2015
+
+
+     IBM's Shared Memory Communications over RDMA (SMC-R) Protocol
+
+Abstract
+
+   This document describes IBM's Shared Memory Communications over RDMA
+   (SMC-R) protocol.  This protocol provides Remote Direct Memory Access
+   (RDMA) communications to TCP endpoints in a manner that is
+   transparent to socket applications.  It further provides for dynamic
+   discovery of partner RDMA capabilities and dynamic setup of RDMA
+   connections, as well as transparent high availability and load
+   balancing when redundant RDMA network paths are available.  It
+   maintains many of the traditional TCP/IP qualities of service such as
+   filtering that enterprise users demand, as well as TCP socket
+   semantics such as urgent data.
+
+Status of This Memo
+
+   This document is not an Internet Standards Track specification; it is
+   published for informational purposes.
+
+   This is a contribution to the RFC Series, independently of any other
+   RFC stream.  The RFC Editor has chosen to publish this document at
+   its discretion and makes no statement about its value for
+   implementation or deployment.  Documents approved for publication by
+   the RFC Editor are not a candidate for any level of Internet
+   Standard; see Section 2 of RFC 5741.
+
+   Information about the current status of this document, any errata,
+   and how to provide feedback on it may be obtained at
+   http://www.rfc-editor.org/info/rfc7609.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Fox, et al.                   Informational                     [Page 1]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+Copyright Notice
+
+   Copyright (c) 2015 IETF Trust and the persons identified as the
+   document authors.  All rights reserved.
+
+   This document is subject to BCP 78 and the IETF Trust's Legal
+   Provisions Relating to IETF Documents
+   (http://trustee.ietf.org/license-info) in effect on the date of
+   publication of this document.  Please review these documents
+   carefully, as they describe your rights and restrictions with respect
+   to this document.
+
+Table of Contents
+
+   1. Introduction ....................................................5
+      1.1. Protocol Overview ..........................................6
+           1.1.1. Hardware Requirements ...............................8
+      1.2. Definition of Common Terms .................................8
+      1.3. Conventions Used in This Document .........................11
+   2. Link Architecture ..............................................11
+      2.1. Remote Memory Buffers (RMBs) ..............................12
+      2.2. SMC-R Link Groups .........................................18
+           2.2.1. Link Group Types ...................................18
+           2.2.2. Maximum Number of Links in Link Group ..............21
+           2.2.3. Forming and Managing Link Groups ...................23
+           2.2.4. SMC-R Link Identifiers .............................24
+      2.3. SMC-R Resilience and Load Balancing .......................24
+   3. SMC-R Rendezvous Architecture ..................................26
+      3.1. TCP Options ...............................................26
+      3.2. Connection Layer Control (CLC) Messages ...................27
+      3.3. LLC Messages ..............................................27
+      3.4. CDC Messages ..............................................29
+      3.5. Rendezvous Flows ..........................................29
+           3.5.1. First Contact ......................................29
+                  3.5.1.1. Pre-negotiation of TCP Options ............29
+                  3.5.1.2. Client Proposal ...........................30
+                  3.5.1.3. Server Acceptance .........................32
+                  3.5.1.4. Client Confirmation .......................32
+                  3.5.1.5. Link (QP) Confirmation ....................32
+                  3.5.1.6. Second SMC-R Link Setup ...................35
+                           3.5.1.6.1. Client Processing of ADD LINK
+                                      LLC Message from Server ........35
+                           3.5.1.6.2. Server Processing of ADD LINK
+                                      Reply LLC Message from Client ..36
+                           3.5.1.6.3. Exchange of RKeys on
+                                      Second SMC-R Link ..............38
+                           3.5.1.6.4. Aborting SMC-R and
+                                      Falling Back to IP .............38
+
+
+
+Fox, et al.                   Informational                     [Page 2]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+           3.5.2. Subsequent Contact .................................38
+                  3.5.2.1. SMC-R Proposal ............................39
+                  3.5.2.2. SMC-R Acceptance ..........................40
+                  3.5.2.3. SMC-R Confirmation ........................41
+                  3.5.2.4. TCP Data Flow Race with SMC
+                           Confirm CLC Message .......................41
+           3.5.3. First Contact Variation: Creating a
+                  Parallel Link Group ................................42
+           3.5.4. Normal SMC-R Link Termination ......................43
+           3.5.5. Link Group Management Flows ........................44
+                  3.5.5.1. Adding and Deleting Links in an
+                           SMC-R Link Group ..........................44
+                           3.5.5.1.1. Server-Initiated ADD
+                                      LINK Processing ................45
+                           3.5.5.1.2. Client-Initiated ADD
+                                      LINK Processing ................45
+                           3.5.5.1.3. Server-Initiated DELETE
+                                      LINK Processing ................46
+                           3.5.5.1.4. Client-Initiated DELETE
+                                      LINK Request ...................48
+                  3.5.5.2. Managing Multiple RKeys over
+                           Multiple SMC-R Links in a Link Group ......49
+                           3.5.5.2.1. Adding a New RMB to an
+                                      SMC-R Link Group ...............50
+                           3.5.5.2.2. Deleting an RMB from an
+                                      SMC-R Link Group ...............53
+                           3.5.5.2.3. Adding a New SMC-R Link to a
+                                      Link Group with Multiple RMBs ..54
+                  3.5.5.3. Serialization of LLC Exchanges,
+                           and Collisions ............................56
+                           3.5.5.3.1. Collisions with ADD
+                                      LINK / CONFIRM LINK Exchange ...57
+                           3.5.5.3.2. Collisions during
+                                      DELETE LINK Exchange ...........58
+                           3.5.5.3.3. Collisions during
+                                      CONFIRM RKEY Exchange ..........59
+   4. SMC-R Memory-Sharing Architecture ..............................60
+      4.1. RMB Element Allocation Considerations .....................60
+      4.2. RMB and RMBE Format .......................................60
+      4.3. RMBE Control Information ..................................60
+      4.4. Use of RMBEs ..............................................61
+           4.4.1. Initializing and Accessing RMBEs ...................61
+           4.4.2. RMB Element Reuse and Conflict Resolution ..........62
+      4.5. SMC-R Protocol Considerations .............................63
+           4.5.1. SMC-R Protocol Optimized Window Size Updates .......63
+           4.5.2. Small Data Sends ...................................64
+           4.5.3. TCP Keepalive Processing ...........................65
+
+
+
+
+Fox, et al.                   Informational                     [Page 3]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+      4.6. TCP Connection Failover between SMC-R Links ...............67
+           4.6.1. Validating Data Integrity ..........................67
+           4.6.2. Resuming the TCP Connection on a New SMC-R Link ....68
+      4.7. RMB Data Flows ............................................69
+           4.7.1. Scenario 1: Send Flow, Window Size Unconstrained ...69
+           4.7.2. Scenario 2: Send/Receive Flow, Window Size
+                  Unconstrained ......................................71
+           4.7.3. Scenario 3: Send Flow, Window Size Constrained .....72
+           4.7.4. Scenario 4: Large Send, Flow Control, Full
+                  Window Size Writes .................................74
+           4.7.5. Scenario 5: Send Flow, Urgent Data, Window
+                  Size Unconstrained .................................77
+           4.7.6. Scenario 6: Send Flow, Urgent Data, Window
+                  Size Closed ........................................79
+      4.8. Connection Termination ....................................81
+           4.8.1. Normal SMC-R Connection Termination Flows ..........81
+           4.8.2. Abnormal SMC-R Connection Termination Flows ........86
+           4.8.3. Other SMC-R Connection Termination Conditions ......88
+   5. Security Considerations ........................................89
+      5.1. VLAN Considerations .......................................89
+      5.2. Firewall Considerations ...................................89
+      5.3. Host-Based IP Filters .....................................89
+      5.4. Intrusion Detection Services ..............................90
+      5.5. IP Security (IPsec) .......................................90
+      5.6. TLS/SSL ...................................................90
+   6. IANA Considerations ............................................90
+   7. Normative References ...........................................91
+   Appendix A. Formats ...............................................92
+     A.1. TCP Option .................................................92
+     A.2. CLC Messages ...............................................92
+          A.2.1. Peer ID Format ......................................93
+          A.2.2. SMC Proposal CLC Message Format .....................94
+          A.2.3. SMC Accept CLC Message Format .......................98
+          A.2.4. SMC Confirm CLC Message Format .....................102
+          A.2.5. SMC Decline CLC Message Format .....................105
+     A.3. LLC Messages ..............................................106
+          A.3.1. CONFIRM LINK LLC Message Format ....................107
+          A.3.2. ADD LINK LLC Message Format ........................109
+          A.3.3. ADD LINK CONTINUATION LLC Message Format ...........112
+          A.3.4. DELETE LINK LLC Message Format .....................115
+          A.3.5. CONFIRM RKEY LLC Message Format ....................117
+          A.3.6. CONFIRM RKEY CONTINUATION LLC Message Format .......120
+          A.3.7. DELETE RKEY LLC Message Format .....................122
+          A.3.8. TEST LINK LLC Message Format .......................124
+     A.4. Connection Data Control (CDC) Message Format ..............125
+
+
+
+
+
+
+Fox, et al.                   Informational                     [Page 4]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+   Appendix B. Socket API Considerations ............................129
+     B.1. setsockopt() / getsockopt() Considerations ................130
+   Appendix C. Rendezvous Error Scenarios ...........................131
+     C.1. SMC Decline during CLC Negotiation ........................131
+     C.2. SMC Decline during LLC Negotiation ........................131
+     C.3. The SMC Decline Window ....................................133
+     C.4. Out-of-Sync Conditions during SMC-R Negotiation ...........133
+     C.5. Timeouts during CLC Negotiation ...........................134
+     C.6. Protocol Errors during CLC Negotiation ....................134
+     C.7. Timeouts during LLC Negotiation ...........................135
+          C.7.1. Recovery Actions for LLC Timeouts and Failures .....136
+     C.8. Failure to Add Second SMC-R Link to a Link Group ..........142
+   Authors' Addresses ...............................................143
+
+1.  Introduction
+
+   This document specifies IBM's Shared Memory Communications over RDMA
+   (SMC-R) protocol.  SMC-R is a protocol for Remote Direct Memory
+   Access (RDMA) communication between TCP socket endpoints.  SMC-R runs
+   over networks that support RDMA over Converged Ethernet (RoCE).  It
+   is designed to permit existing TCP applications to benefit from RDMA
+   without requiring modifications to the applications or predefinition
+   of RDMA partners.
+
+   SMC-R provides dynamic discovery of the RDMA capabilities of TCP
+   peers and automatic setup of RDMA connections that those peers can
+   use.  SMC-R also provides transparent high availability and
+   load-balancing capabilities that are demanded by enterprise
+   installations but are missing from current RDMA protocols.  If
+   redundant RoCE-capable hardware such as RDMA-capable Network
+   Interface Cards (RNICs) and RoCE-capable switches is present, SMC-R
+   can load-balance over that redundant hardware and can also
+   non-disruptively move TCP traffic from failed paths to surviving
+   paths, all seamlessly to the application and the sockets layer.
+   Because SMC-R preserves socket semantics and the TCP three-way
+   handshake, many TCP qualities of service such as filtering, load
+   balancing, and Secure Socket Layer (SSL) encryption are preserved, as
+   are TCP features such as urgent data.
+
+   Because of the dynamic discovery and setup of SMC-R connectivity
+   between peers, no RDMA connection manager (RDMA-CM) is required.
+   This also means that support for Unreliable Datagram (UD) Queue Pairs
+   (QPs) is also not required.
+
+
+
+
+
+
+
+
+Fox, et al.                   Informational                     [Page 5]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+   It is recommended that the SMC-R services be implemented in kernel
+   space, which enables optimizations such as resource-sharing between
+   connections across multiple processes and also permits applications
+   using SMC-R to spawn multiple processes (e.g., fork) without losing
+   SMC-R functionality.  A user-space implementation is compatible with
+   this architecture, but it may not support spawned processes (e.g.,
+   fork), which limits sharing and resource optimization to TCP
+   connections that originate from the same process.  This might be an
+   appropriate design choice if the use case is a system that hosts a
+   large single process application that creates many TCP connections to
+   a peer host, or in implementations where a kernel-space
+   implementation is not possible or introduces excessive overhead for
+   "kernel space to user space" context switches.
+
+1.1.  Protocol Overview
+
+   SMC-R defines the concept of the SMC-R link, which is a logical
+   point-to-point link using reliably connected queue pairs between
+   TCP/IP stack peers over a RoCE fabric.  An SMC-R link is bound to a
+   specific hardware path, meaning a specific RNIC on each peer.  SMC-R
+   links are created and maintained by an SMC-R layer, which may reside
+   in kernel space or user space, depending upon operating system and
+   implementation requirements.  The SMC-R layer resides below the
+   sockets layer and directs data traffic for TCP connections between
+   connected peers over the RoCE fabric using RDMA rather than over a
+   TCP connection.  The TCP/IP stack, with its requirements for
+   fragmentation, packetization, etc., is bypassed, and the application
+   data is moved between peers using RDMA.
+
+   Multiple SMC-R links between the same two TCP/IP stack peers are also
+   supported.  A set of SMC-R links called a link group can be logically
+   bonded together to provide redundant connectivity.  If there is
+   redundant hardware -- for example, two RNICs on each peer -- separate
+   SMC-R links are created between the peers to exploit that redundant
+   hardware.  The link group architecture with redundant links provides
+   load balancing and increased bandwidth, as well as seamless failover.
+
+   Each SMC-R link group is associated with an area of memory called
+   Remote Memory Buffers (RMBs), which are areas of memory that are
+   available for SMC-R peers to write into using RDMA writes.  Multiple
+   TCP connections between peers may be multiplexed over a single SMC-R
+   link, in which case the SMC-R layer manages the partitioning of the
+   RMBs between the TCP connections.  This multiplexing reduces the RDMA
+   resources, such as QPs and RMBs, that are required to support
+   multiple connections between peers, and it also reduces the
+   processing and delays related to setting up QPs, pinning memory, and
+   other RDMA setup tasks when new TCP connections are created.  In a
+   kernel-space SMC-R implementation in which the RMBs reside in kernel
+
+
+
+Fox, et al.                   Informational                     [Page 6]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+   storage, this sharing and optimization works across multiple
+   processes executing on the same host.  In a user-space SMC-R
+   implementation in which the RMBs reside in user space, this sharing
+   and optimization is limited to multiple TCP connections created by a
+   single process, as separate RMBs and QPs will be required for each
+   process.
+
+   SMC-R also introduces a rendezvous protocol that is used to
+   dynamically discover the RDMA capabilities of TCP connection partners
+   and exchange credentials necessary to exploit that capability if
+   present.  TCP connections are set up using the normal TCP three-way
+   handshake [RFC793], with the addition of a new TCP option that
+   indicates SMC-R capability.  If both partners indicate SMC-R
+   capability, then at the completion of the three-way TCP handshake the
+   SMC-R layers in each peer take control of the TCP connection and use
+   it to exchange additional Connection Layer Control (CLC) messages to
+   negotiate SMC-R credentials such as QP information; addressability
+   over the RoCE fabric; RMB buffer sizes; and keys and addresses for
+   accessing RMBs over RDMA.  If at any time during this negotiation a
+   failure or decline occurs, the TCP connection falls back to using the
+   IP fabric.
+
+   If the SMC-R negotiation succeeds and either a new SMC-R link is set
+   up or an existing SMC-R link is chosen for the TCP connection, then
+   the SMC-R layers open the sockets to the applications and the
+   applications use the sockets as normal.  The SMC-R layer intercepts
+   the socket reads and writes and moves the TCP connection data over
+   the SMC-R link, "out of band" to the TCP connection, which remains
+   open and idle over the IP fabric, except for termination flows and
+   possible keepalive flows.  Regular TCP sequence numbering methods are
+   used for the TCP flows that do occur; data flowing over RDMA does not
+   use or affect TCP sequence numbers.
+
+   This architecture does not support fallback of active SMC-R
+   connections to IP.  Once connection data has completed the switch to
+   RDMA, a TCP connection cannot be switched back to IP and will reset
+   if RDMA becomes unusable.
+
+   The SMC-R protocol defines the format of the RMBs that are used to
+   receive TCP connection data written over RDMA, as well as the
+   semantics for managing and writing to these buffers using Connection
+   Data Control (CDC) messages.
+
+
+
+
+
+
+
+
+
+Fox, et al.                   Informational                     [Page 7]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+   Finally, SMC-R defines Link Layer Control (LLC) messages that are
+   exchanged over the RoCE fabric between peer SMC-R layers to manage
+   the SMC-R links and link groups.  These include messages to test and
+   confirm connectivity over an SMC-R link, add and delete SMC-R links
+   to or from the link group, and exchange RMB addressability
+   information.
+
+1.1.1.  Hardware Requirements
+
+   SMC-R does not require full Converged Enhanced Ethernet switch
+   functionality.  SMC-R functions over standard Ethernet fabrics,
+   provided that endpoint RNICs are provided and IEEE 802.3x Global
+   Pause Frame is supported and enabled in the switch fabric.
+
+   While SMC-R as specified in this document is designed to operate over
+   RoCE fabrics, adjustments to the rendezvous methods could enable it
+   to run over other RDMA fabrics, such as InfiniBand [RoCE] and iWARP.
+
+1.2.  Definition of Common Terms
+
+   This section provides definitions of terms that have a specific
+   meaning to the SMC-R protocol and are used throughout this document.
+
+   SMC-R Link
+
+      An SMC-R link is a logical point-to-point connection over the RoCE
+      fabric via specific physical adapters (Media Access Control /
+      Global Identifier (MAC/GID)).  The link is formed during the
+      "first contact" sequence of the TCP/IP three-way handshake
+      sequence that occurs over the IP fabric.  During this handshake,
+      an RDMA reliably connected queue pair (RC-QP) connection is formed
+      between the two peer SMC hosts and is defined as the SMC-R link.
+      The SMC-R link can then support multiple TCP connections between
+      the two peers.  An SMC-R link is associated with a single LAN (or
+      VLAN) segment and is not routable.
+
+   SMC-R Link Group
+
+      An SMC-R link group is a group of SMC-R links between the same two
+      SMC-R peers, typically with each link over unique RoCE adapters.
+      Each link in the link group has equal characteristics, such as the
+      same VLAN ID (if VLANs are in use), access to the same RMB(s), and
+      access to the same TCP server/client.
+
+
+
+
+
+
+
+
+Fox, et al.                   Informational                     [Page 8]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+   SMC-R Peer
+
+      The SMC-R peer is the peer software stack within the peer
+      operating system with respect to the Shared Memory Communications
+      (messaging) protocol.
+
+   SMC-R Rendezvous
+
+      SMC-R Rendezvous is the SMC-R peer discovery and handshake
+      sequence that occurs transparently over the IP (Ethernet) fabric
+      during and immediately after the TCP connection three-way
+      handshake by exchanging the SMC-R capabilities and credentials
+      using experimental TCP option and CLC messages.
+
+   RoCE SendMsg
+
+      RoCE SendMsg is a send operation posted to a reliably connected
+      queue pair with inline data, for the purpose of transferring
+      control information between peers.
+
+   TCP Client
+
+      The TCP client is the TCP socket-based peer that initiates a TCP
+      connection.
+
+   TCP Server
+
+      The TCP server is the TCP socket-based peer that accepts a TCP
+      connection.
+
+   CLC Messages
+
+      The SMC-R protocol defines a set of Connection Layer Control
+      messages that flow over the TCP connection that are used to manage
+      SMC-R link rendezvous at TCP connection setup time.  This
+      mechanism is analogous to SSL setup messages.
+
+   LLC Commands
+
+      The SMC-R protocol defines a set of RoCE Link Layer Control
+      commands that flow over the RoCE fabric using RoCE SendMsg, that
+      are used to manage SMC-R links, SMC-R link groups, and SMC-R
+      link group RMB expansion and contraction.
+
+
+
+
+
+
+
+
+Fox, et al.                   Informational                     [Page 9]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+   CDC Message
+
+      The SMC-R protocol defines a Connection Data Control message that
+      flows over the RoCE fabric using RoCE SendMsg that is used to
+      manage the SMC-R connection data.  This message provides
+      information about data being transferred over the out-of-band RDMA
+      connection, such as data cursors, sequence numbers, and data flags
+      (for example, urgent data).  The receipt of this message also
+      provides an interrupt to inform the receiver that it has received
+      RDMA data.
+
+   RMB
+
+      A Remote (RDMA) Memory Buffer is a fixed or pinned buffer
+      allocated in each of the peer hosts for a TCP (via SMC-R)
+      connection.  The RMB is registered to the RNIC and allows remote
+      access by the remote peer using RDMA semantics.  Each host is
+      passed the peer's RMB-specific access information (RMB Key (RKey)
+      and RMB element offset) during the SMC-R Rendezvous process.  The
+      host stores socket application user data directly into the peer's
+      RMB using RDMA over RoCE.
+
+   RToken
+
+      The RToken is the combination of an RMB's RKey and RDMA virtual
+      address.  An RToken provides RMB addressability information to an
+      RDMA peer.
+
+   RMBE
+
+      The Remote Memory Buffer Element (RMBE) is an area of an RMB that
+      is allocated to a specific TCP connection.  The RMBE contains data
+      for the TCP connection.  The RMBE represents the TCP receive
+      buffer, whereby the remote peer writes into the RMBE and the local
+      peer reads from the local RMBE.  The alert token resolves to a
+      specific RMBE.
+
+   Alert Token
+
+      The SMC-R alert token is a 4-byte value that uniquely identifies
+      the TCP connection over an SMC-R connection.  The alert token
+      allows the SMC peer to quickly identify the target TCP connection
+      that now has new work.  The format of the token is defined by the
+      owning SMC-R endpoint and is considered opaque to the remote peer.
+      However, the token should not simply be an index to an RMBE; it
+      should reference a TCP connection and be able to be validated to
+      avoid reading data from stale connections.
+
+
+
+
+Fox, et al.                   Informational                    [Page 10]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+   RNIC
+
+      The RDMA-capable Network Interface Card (RNIC) is an Ethernet NIC
+      that supports RDMA semantics and verbs using RoCE.
+
+   First Contact
+
+      "First contact" describes an SMC-R negotiation to set up the first
+      link in a link group.
+
+   Subsequent Contact
+
+      "Subsequent contact" describes an SMC-R negotiation between peers
+      who are using an already-existing SMC-R link group.
+
+1.3.  Conventions Used in This Document
+
+   In the rendezvous flow diagrams, dashed lines (----) are used to
+   indicate flows over the TCP/IP fabric and dotted lines (....) are
+   used to indicate flows over the RoCE fabric.
+
+   In the data transfer ladder diagrams, dashed lines (----) are used to
+   indicate RDMA write operations and dotted lines (....) are used to
+   indicate CDC messages, which are RDMA messages with inline data that
+   contain control information for the connection.
+
+2.  Link Architecture
+
+   An SMC-R link is based on reliably connected queue pairs (QPs) that
+   form a "logical point-to-point link" between the two SMC-R peers over
+   a RoCE fabric.  An SMC-R link extends from SMC-R peer to SMC-R peer,
+   where typically each peer would be a TCP/IP stack and would reside on
+   separate hosts.
+
+                            ,,.--..,_
+     +----+             _-``         `-,           +-----+
+     |QP 8|            -   RoCE         ',         |QP 64|
+     |    |          /     VLAN M         .        |     |
+     +----+--------+/                     \+-------+-----+
+      | RNIC 1     |    SMC-R Link         | RNIC 2     |
+      |            |<--------------------->|            |
+      +------------+ ,                    /+------------+
+              MAC A (GID A)             MAC B (GID B)
+                       .                .`
+                        `',          ,-`
+                           ``''--''``
+
+                       Figure 1: SMC-R Link Overview
+
+
+
+Fox, et al.                   Informational                    [Page 11]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+   Figure 1 illustrates an overview of the basic concepts of SMC-R peer-
+   to-peer connectivity; this is called the SMC-R link.  The SMC-R link
+   forms a logical point-to-point connection between two SMC-R peers via
+   RoCE.  The SMC-R link is defined and identified by the following
+   attributes:
+
+      SMC-R link = RC QPs
+         (source VMAC GID QP + target VMAC GID QP + VLAN ID)
+
+   The SMC-R link can optionally be associated with a VLAN ID.  If VLANs
+   are in use for the associated IP (LAN) connection, then the VLAN
+   attribute is carried over on the SMC-R link.  When VLANs are in use,
+   each SMC-R link group is associated with a single and specific VLAN.
+   The RoCE fabric is the same physical Ethernet LAN used for standard
+   TCP/IP-over-Ethernet communications, with switches as described in
+   Section 1.1.1.
+
+   An SMC-R link is designed to support multiple TCP connections between
+   the same two peers.  An SMC-R link is intended to be long lived,
+   while the underlying TCP connections can dynamically come and go.
+   The associated RMBs can also be dynamically added and removed from
+   the link as needed.  The first TCP connection between the peers
+   establishes the SMC-R link.  Subsequent TCP connections then use the
+   previously established link.  When the last TCP connection
+   terminates, the link can then be terminated, typically after an
+   implementation-defined idle timeout period has elapsed.  The TCP
+   server is responsible for initiating and terminating the SMC-R link.
+
+2.1.  Remote Memory Buffers (RMBs)
+
+   Figure 2 shows the hosts -- Hosts X and Y -- and their associated
+   RMBs within each host.  With the SMC-R link, and the associated RKeys
+   and RDMA virtual addresses, each SMC-R-enabled TCP/IP stack can
+   remotely access its peer's RMBs using RDMA.  The RKeys and virtual
+   addresses are exchanged during the rendezvous processing when the
+   link is established.  The combination of the RKey and the virtual
+   address is the RToken.  Note that the SMC-R link ends at the QP
+   providing access to the RMB (via the link + RToken).
+
+
+
+
+
+
+
+
+
+
+
+
+
+Fox, et al.                   Informational                    [Page 12]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+          Host X                                     Host Y
+     +-------------------+        ,.--.,_       +-------------------+
+     |                   |     .'`       '.     |                   |
+     | Protection        |   ,'            `,   |    Protection     |
+     | Domain X          |  /                \  |    Domain Y       |
+     |            +------+ /                  \ +------+            |
+     |       QP 8 |RNIC 1| |   SMC-R Link     | |RNIC 2|  QP 64     |
+     |        |   |      |<-------------------->|      |   |        |
+     |        |   |      ||                    ||      |   |        |
+     |        |   +------+|    VLAN A          |+------+   |        |
+     |        |          ||                    ||          |        |
+     |        |          | |   RoCE           | |          |        |
+     |        |RToken X  | \                  / |RToken Y  |        |
+     |        |          |  \                /  |          |        |
+     |        V          |   `.            ,'   |          V        |
+     | +--------+        |     '._       ,'     |        +--------+ |
+     | |        |        |        `''-'``       |        |        | |
+     | | RMB    |        |                      |        | RMB    | |
+     | |        |        |                      |        |        | |
+     | +--------+        |                      |        +--------+ |
+     +-------------------+                      +-------------------+
+
+                       Figure 2: SMC-R Link and RMBs
+
+   An SMC-R link can support multiple RMBs that are independently
+   managed by each peer.  The number and the size of RMBs are managed by
+   the peers based on the host's unique memory management requirements;
+   however, the maximum number of RMBs that can be associated to a link
+   group on one peer is 255.  The QP has a single protection domain, but
+   each RMB has a unique RToken.  All RTokens must be exchanged with the
+   peer.
+
+   Each peer manages the RMBs in its local memory for its remote SMC-R
+   peer by sharing access to the RMBs via RTokens with its peers.  The
+   remote peer writes into the RMBs via RDMA, and the local peer (RMB
+   owner) then reads from the RMBs.
+
+   When two peers decide to use SMC-R for a given TCP connection, they
+   each allocate a local RMB element for the TCP connection and
+   communicate the location of this local RMB element during rendezvous
+   processing.  To that end, RMB elements are created in pairs, with one
+   RMB element allocated locally on each peer of the SMC-R link.
+
+
+
+
+
+
+
+
+
+Fox, et al.                   Informational                    [Page 13]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+                  ---  +------------+---------------+
+                  /\   |Eye Catcher |               |
+                   |   +------------+               |
+                   |   |                            |
+         RMB Element 1 |                            |
+                   |   |   Receive Buffer           |
+                   |   |                            |
+                   |   |                            |
+                  \/   |                            |
+                  ---  +------------+---------------+
+                  /\   |Eye Catcher |               |
+                   |   +------------+               |
+                   |   |                            |
+         RMB Element 2 |                            |
+                   |   |   Receive Buffer           |
+                   |   |                            |
+                   |   |                            |
+                  \/   |                            |
+                  ---  +----------------------------+
+                       |            .               |
+                       |            .               |
+                       |            .               |
+                       |            .               |
+                       |    (up to 255 elements)    |
+                       +----------------------------+
+
+                           Figure 3: RMB Format
+
+   Figure 3 illustrates the basic format of an RMB.  The RMB is a
+   virtual memory buffer whose backing real memory is pinned, which can
+   support up to 255 TCP connections to exactly one remote SMC-R peer.
+   Each RMB is therefore associated with the SMC-R links within a link
+   group for the two peers and a specific RoCE Protection Domain.  Other
+   than the two peers identified by the SMC-R link, no other SMC-R peers
+   can have RDMA access to an RMB; this requires a unique Protection
+   Domain for every SMC-R link.  This is critical to ensure integrity of
+   SMC-R communications.
+
+   RMBs are subdivided into multiple elements for efficiency, with each
+   RMB Element (RMBE) associated with a single TCP connection.
+   Therefore, multiple TCP connections across an SMC-R link group can
+   share the same memory for RDMA purposes, reducing the overhead of
+   having to register additional memory with the RNIC for every new TCP
+   connection.  The number of elements in an RMB and the size of each
+   RMBE are entirely governed by the owning peer, subject to the SMC-R
+   architecture rules; however, all RMB elements within a given RMB must
+   be the same size.  Each peer can decide the level of resource-sharing
+   that is desirable across TCP connections based on local constraints,
+
+
+
+Fox, et al.                   Informational                    [Page 14]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+   such as available system memory.  An RMB element is identified to the
+   remote SMC-R peer via an RMB Element Token, which consists of the
+   following:
+
+   o  RMB RToken: The combination of the RKey and virtual address
+      provided by the RNIC that identifies the start of the RMB for RDMA
+      operations.
+
+   o  RMB Index: Identifies the RMB element index in the RMB.  Used to
+      locate a specific RMB element within an RMB.  Valid value range is
+      1-255.
+
+   o  RMB Element Length: The length of the RMB element's eye catcher
+      plus the length of the receive buffer.  This length is equal for
+      all RMB elements in a given RMB.  This length can be variable
+      across different RMBs.
+
+   Multiple RMBs can be associated to an SMC-R link group, and each peer
+   in an SMC-R link group manages allocation of its RMBs.  RMB
+   allocation can be asymmetric.  For example, Server X can allocate two
+   RMBs to an SMC-R link group while Server Y allocates five.  This
+   provides maximum implementation flexibility to allow hosts to
+   optimize RMB management for their own local requirements.  The
+   maximum number of RMBs that can be allocated on one peer to a link
+   group is 255.  If more RMBs are required, the peer may fall back to
+   IP for subsequent connections or, if the peer is the server, create a
+   parallel link group.
+
+   One use case for multiple RMBs is multiple receive buffer sizes.
+   Since every element in an RMB must be the same size, multiple RMBs
+   with different element sizes can be allocated if varying receive
+   buffer sizes are required.
+
+   Also, since the maximum number of TCP connections whose receive
+   buffers can be allocated to an RMB is 255, multiple RMBs may be
+   required to provide capacity for large numbers of TCP connections
+   between two peers.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Fox, et al.                   Informational                    [Page 15]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+   Separately from the RMB, the TCP/IP stack that owns each RMB
+   maintains control data for each RMB element within its local control
+   structures.  The control data contains flags for maintaining the
+   state of the TCP data (for example, urgent data indicator) and, most
+   importantly, the following two cursors, which are illustrated below
+   in Figure 4:
+
+   o  The peer producer cursor: This is a wrapping offset into the
+      RMB element's receive buffer that points to the next byte of data
+      to be written by the remote peer.  This cursor is provided by the
+      remote peer in a Connection Data Control (CDC) message, which is
+      sent using RoCE SendMsg processing, and tells the local peer how
+      far it can consume data in the RMBE buffer.
+
+   o  The peer consumer cursor: This is a wrapping offset into the
+      remote peer's RMB element's receive buffer that points to the next
+      byte of data to be consumed by the remote peer in its own RMBE.
+      The local peer cannot write into the remote peer's RMBE beyond
+      this point without causing data loss.  This cursor is also
+      provided by the peer using a Connection Data Control message.
+
+   Each TCP connection peer maintains its cursors for a TCP connection's
+   RMBE in its local control structures.  In other words, the peer who
+   writes into a remote peer's RMBE provides its producer cursor to the
+   peer whose RMBE it has written into.  The peer who reads from its
+   RMBE provides its consumer cursor to the writing peer.  In this
+   manner, the reads and writes between peers are kept coordinated.
+
+   For example, referring to Figure 4, Peer B writes the hashed data
+   into the receive buffer of Peer A's RMBE.  After that write
+   completes, Peer B uses a CDC message to update its producer cursor to
+   Peer A, to indicate to Peer A how much data is available for Peer A
+   to consume.  The CDC message that Peer B sends to Peer A wakes up
+   Peer A and notifies it that there is data to be consumed.
+
+   Similarly, when Peer A consumes data written by Peer B, it uses a CDC
+   message to update its consumer cursor to Peer B to let Peer B know
+   how much data it has consumed, so Peer B knows how much space is
+   available for further writes.  If Peer B were to write enough data to
+   Peer A that it would wrap the RMBE receive buffer and exceed the
+   consumer cursor, data loss would result.
+
+   Note that this is a simplistic description of the control flows, and
+   they are optimized to minimize the number of CDC messages required,
+   as described in Section 4.7 ("RMB Data Flows").
+
+
+
+
+
+
+Fox, et al.                   Informational                    [Page 16]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+      Peer A's RMBE Control Info            Peer B's RMBE Control Info
+     +--------------------------+          +--------------------------+
+     |                          |          |                          |
+      /----Peer producer cursor |    +-----+-Peer consumer cursor     |
+    /|                          |    |     |                          |
+   | +--------------------------+    |     +--------------------------+
+   |  Peer A's RMBE                  |
+   | +--------------------------+    |
+   | |            +------------------+
+   | |            |             |
+   | |            \/            |
+   | |             +------------|
+   | |-------------+/////////// |
+   | |//RDMA data written by ///|
+   | |/// Peer B that is ////// |
+   | |/available to be consumed/|
+   | |///////////////////////// |
+   | |///////// +---------------|
+   | |----------+/\             |
+   | |            |             |
+    \|            |             |
+     \           /              |
+     |\---------/               |
+     |                          |
+     |                          |
+
+                          Figure 4: RMBE Cursors
+
+   Additional flags and indicators are communicated between peers.  In
+   all cases, these flags and indicators are updated by the peer using
+   CDC messages, which are sent using RoCE SendMsg.  More details on
+   these additional flags and indicators are described in Section 4.3
+   ("RMBE Control Information").
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Fox, et al.                   Informational                    [Page 17]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+2.2.  SMC-R Link Groups
+
+   SMC-R links are logically grouped together to form an SMC-R link
+   group.  The purpose of the link group is for supporting multiple
+   links between the same two peers to provide for:
+
+   o  Resilience: Provides transparent and dynamic switching of the link
+      used by existing TCP connections during link failures, typically
+      hardware related.  TCP traffic using the failing link can be
+      switched to an active link within the link group, thereby avoiding
+      disruptions to application workloads.
+
+   o  Link utilization: Provides an active/active link usage model
+      allowing TCP traffic to be balanced across the links, which
+      increases bandwidth and also avoids hardware imbalances and
+      bottlenecks.  Note that both adapter and switch utilization can
+      become potential resource constraint issues.
+
+   SMC-R link group support is required.  Resilience is not optional.
+   However, the user can elect to provision a single RNIC (on one or
+   both hosts).
+
+   Multiple links that are formed between the same two peers fall into
+   two distinct categories:
+
+   1. Equal Links: Links providing equal access to the same RMB(s) at
+      both endpoints, whereby all TCP connections associated with the
+      links must have the same VLAN ID and have the same TCP server and
+      TCP client roles or relationship.
+
+   2. Unequal Links: Links providing access to unique, unrelated and
+      isolated RMB(s) (i.e., for unique VLANs or unique and isolated
+      application workloads, etc.) or having unique TCP server or client
+      roles.
+
+   Links that are logically grouped together forming an SMC-R link group
+   must be equal links.
+
+2.2.1.  Link Group Types
+
+   Equal links within a link group also have another "Link Group Type"
+   attribute based on the link's associated underlying physical path.
+   The following SMC-R link types are defined:
+
+   1. Single link: the only active link within a link group
+
+   2. Parallel link: not allowed -- SMC-R links having the same physical
+      RNIC at both hosts
+
+
+
+Fox, et al.                   Informational                    [Page 18]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+   3. Asymmetric link: links that have unique RNIC adapters at one host
+      but share a single adapter at the peer host
+
+   4. Symmetric link: links that have unique RNIC adapters at both hosts
+
+   These link group types are further explained in the following figures
+   and descriptions.
+
+   Figure 2 above shows the single-link case.  The single link
+   illustrated in Figure 2 also establishes the SMC-R link group.  Link
+   groups are supposed to have multiple links, but when only one RNIC is
+   available at both hosts then only a single link can be created.  This
+   is expected to be a transient case.
+
+   Figure 5 shows the symmetric-link case.  Both hosts have unique and
+   redundant RNIC adapters.  This configuration meets the objectives for
+   providing full RoCE redundancy required to provide the level of
+   resilience required for high availability for SMC-R.  While this
+   configuration is not required, it is a strongly recommended "best
+   practice" for the exploitation of SMC-R.  Single and asymmetric links
+   must be supported but are intended to provide for short-term
+   transient conditions -- for example, during a temporary outage or
+   recycle of an RNIC.
+
+          Host X                                     Host Y
+     +-------------------+                      +-------------------+
+     |                   |                      |                   |
+     | Protection        |                      |    Protection     |
+     | Domain X          |                      |    Domain Y       |
+     |            +------+                      +------+            |
+     |       QP 8 |RNIC 1|     SMC-R Link 1     |RNIC 2|  QP 64     |
+     |RToken X|   |      |<-------------------->|      |   |        |
+     |        |   |      |                      |      |   |RToken Y|
+     |       \/   +------+                      +------+  \/        |
+     |+--------+         |                      |        +--------+ |
+     ||        |         |                      |        |        | |
+     || RMB    |         |                      |        | RMB    | |
+     ||        |         |                      |        |        | |
+     |+--------+         |                      |        +--------+ |
+     |       /\   +------+                      +------+  /\        |
+     |RToken Z|   |      |     SMC-R Link 2     |      |   |RToken W|
+     |        |   |RNIC 3|<-------------------->|RNIC 4|   |        |
+     |       QP 9 |      |                      |      |  QP 65     |
+     |            +------+                      +------+            |
+     +-------------------+                      +-------------------+
+
+                      Figure 5: Symmetric SMC-R Links
+
+
+
+
+Fox, et al.                   Informational                    [Page 19]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+          Host X                                     Host Y
+     +-------------------+                      +-------------------+
+     |                   |                      |                   |
+     | Protection        |                      |    Protection     |
+     | Domain X          |                      |    Domain Y       |
+     |            +------+                      +------+            |
+     |       QP 8 |RNIC 1|     SMC-R Link 1     |RNIC 2|  QP 64     |
+     |RToken X|   |      |<-------------------->|      |   |        |
+     |        |   |      |                   .->|      |   |RToken Y|
+     |       \/   +------+                 .`   +------+  \/        |
+     |+--------+         |               .`     |        +--------+ |
+     ||        |         |             .`       |        |        | |
+     || RMB    |         |           .`         |        | RMB    | |
+     ||        |         |         .`SMC-R      |        |        | |
+     |+--------+         |       .` Link 2      |        +--------+ |
+     |       /\   +------+     .`               +------+            |
+     |RToken Z|   |      |   .`                 |      |down or     |
+     |        |   |RNIC 3|<-`                   |RNIC 4|unavailable |
+     |       QP 9 |      |                      |      |            |
+     |            +------+                      +------+            |
+     +-------------------+                      +-------------------+
+
+                     Figure 6: Asymmetric SMC-R Links
+
+   In the example provided by Figure 6, Host X has two RNICs but Host Y
+   only has one RNIC because RNIC 4 is not available.  This
+   configuration allows for the creation of an asymmetric link.  While
+   an asymmetric link will provide some resilience (for example, when
+   RNIC 1 fails), ideally each host should provide two redundant RNICs.
+   This should be a transient case, and when RNIC 4 becomes available,
+   this configuration must transition to a symmetric-link configuration.
+   This transition is accomplished by first creating the new symmetric
+   link and then deleting the asymmetric link with reason code
+   "Asymmetric link no longer needed" specified in the DELETE LINK LLC
+   message.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Fox, et al.                   Informational                    [Page 20]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+          Host X                                     Host Y
+     +-------------------+                      +-------------------+
+     |                   |                      |                   |
+     | Protection        |                      |    Protection     |
+     | Domain X          |                      |    Domain Y       |
+     |            +------+  SMC-R Link 1        +------+            |
+     |       QP 8 |RNIC 1|<-------------------->|RNIC 2|  QP 64     |
+     |RToken X|   |      |                      |      |   |        |
+     |        |   |      |<-------------------->|      |   |RToken Y|
+     |       \/   +------+  SMC-R Link 2        +------+  \/        |
+     |+--------+   QP 9  |                      | QP 65  +--------+ |
+     ||        |    |    |                      |  |     |        | |
+     || RMB    |<-- +    |                      |  +---->| RMB    | |
+     ||        |         |                      |        |        | |
+     |+--------+         |                      |        +--------+ |
+     |            +------+                      +------+            |
+     |     down or|      |                      |      |down or     |
+     | unavailable|RNIC 3|                      |RNIC 4|unavailable |
+     |            |      |                      |      |            |
+     |            +------+                      +------+            |
+     +-------------------+                      +-------------------+
+
+              Figure 7: SMC-R Parallel Links (Not Supported)
+
+   Figure 7 shows parallel links, which are two links in the link group
+   that use the same hardware.  This configuration is not permitted.
+   Because SMC-R multiplexes multiple TCP connections over an SMC-R link
+   and both links are using the exact same hardware, there is no
+   additional redundancy or capacity benefit obtained from this
+   configuration.  In addition to providing no real benefit, this
+   configuration adds the unnecessary overhead of additional queue
+   pairs, generation of additional RKeys, etc.
+
+2.2.2.  Maximum Number of Links in Link Group
+
+   The SMC-R protocol defines a maximum of eight symmetric SMC-R links
+   within a single SMC-R link group.  This allows for support for up to
+   eight unique physical paths between peer hosts.  However, in terms of
+   meeting the basic requirements for redundancy, support for at least
+   two symmetric links must be implemented.  Supporting more than two
+   links also simplifies implementation for practical matters relating
+   to dynamically adding and removing links -- for example, starting a
+   third SMC-R link prior to taking down one of the two existing links.
+   Recall that all links within a link group must have equal access to
+   all associated RMBs.
+
+
+
+
+
+
+Fox, et al.                   Informational                    [Page 21]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+   The SMC-R protocol allows an implementation to assign an
+   implementation-specific and appropriate value for maximum symmetric
+   links.  The implementation value must not exceed the architecture
+   limit of 8; also, the value must not be lower than 2, because the
+   SMC-R protocol requires redundancy.  This does not mean that two
+   RNICs are physically required to enable SMC-R connectivity, but at
+   least two RNICs for redundancy are strongly recommended.
+
+   The SMC-R peers exchange their implementation maximum link values
+   during the link group establishment using the defined maximum link
+   value in the CONFIRM LINK LLC command.  Once the initial exchange
+   completes, the value is set for the life of the link group.  The
+   maximum link value can be provided by both the server and client.
+   The server must supply a value, whereas the client maximum link value
+   is optional.  When the client does not supply a value, it indicates
+   that the client accepts the server-supplied maximum value.  If the
+   client provides a value, it cannot exceed the server-supplied maximum
+   value.  If the client passes a lower value, this lower value then
+   becomes the final negotiated maximum number of symmetric links for
+   this link group.  Again, the minimum value is 2.
+
+   During run time, the client must never request that the server add a
+   symmetric link to a link group that would exceed the negotiated
+   maximum link value.  Likewise, the server must never attempt to add a
+   symmetric link to a link group that would exceed the negotiated
+   maximum value.
+
+   In terms of counting the number of active links within a link group,
+   the initial link (or the only/last) link is always counted as 1.
+   Then, as additional links are added, they are either symmetric or
+   asymmetric links.
+
+   With regards to enforcing the maximum link rules, asymmetric links
+   are an exception having a unique set of rules:
+
+   o  Asymmetric links are always limited to one asymmetric link allowed
+      per link group.
+
+   o  Asymmetric links must not be counted in the maximum symmetric-link
+      count calculation.  When tracking the current count or enforcing
+      the negotiated maximum number of links, an asymmetric link is not
+      to be counted.
+
+
+
+
+
+
+
+
+
+Fox, et al.                   Informational                    [Page 22]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+2.2.3.  Forming and Managing Link Groups
+
+   SMC-R link groups are self-defining.  The first SMC-R link in a link
+   group is created using TCP option flows on the TCP three-way
+   handshake followed by CLC message flows over the TCP connection.
+   Subsequent SMC-R links in the link group are created by sending LLC
+   messages over an SMC-R link that already exists in the link group.
+   Once an SMC-R link group is created, no additional SMC-R links in
+   that group are created using TCP and CLC negotiation.  Because
+   subsequent SMC-R links are created exclusively by sending LLC
+   messages over an existing SMC-R link in a link group, the membership
+   of SMC-R links in a link group is self-defining.
+
+   This architecture does not define a specific identifier for an SMC-R
+   link group.  This identification may be useful for network management
+   and may be assigned in a platform-specific manner, or in an extension
+   to this architecture.
+
+   In each SMC-R link group, one peer is the server for all TCP
+   connections and the other peer is the client.  If there are
+   additional TCP connections between the peers that use SMC-R and have
+   the client and server roles reversed, another SMC-R link group is set
+   up between them with the opposite client-server relationship.
+
+   This is required because there are specific responsibilities divided
+   between the client and server in the management of an SMC-R link
+   group.
+
+   In this architecture, the decision of whether to use an existing
+   SMC-R link group or create a new SMC-R link group for a TCP
+   connection is made exclusively by the server.
+
+   Management of the links in an SMC-R link group is also a server
+   responsibility.  The server is responsible for adding and deleting
+   links in a link group.  The client may request that the server take
+   certain actions, but the final responsibility is the server's.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Fox, et al.                   Informational                    [Page 23]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+2.2.4.  SMC-R Link Identifiers
+
+   This architecture defines multiple identifiers to identify SMC-R
+   links and peers.
+
+   o  Link number: This is a 1-byte value that identifies an SMC-R link
+      within a link group.  Both the server and the client use this
+      number to distinguish an SMC-R link from other links within the
+      same link group.  It is only unique within a link group.  In order
+      to prevent timing windows that may occur when a server creates a
+      new link while the client is still cleaning up a previously
+      existing link, link numbers cannot be reused until the entire link
+      numbering space has been exhausted.
+
+   o  Link user ID: This is an architecturally opaque 4-byte value that
+      a peer uses to uniquely define an SMC-R link within its own space.
+      This means that a link user ID is unique within one peer only.
+      Each peer defines its own link user ID for a link.  The peers
+      exchange this information once during link setup, and it is never
+      used architecturally again.  The purpose of this identifier is for
+      network management, display, and debugging.  For example, an
+      operator on a client could provide the operator on the server with
+      the server's link user ID if he requires the server's operator to
+      check on the operation of a link that the client is having trouble
+      with.
+
+   o  Peer ID: The SMC-R peer ID uniquely identifies a specific instance
+      of a specific TCP/IP stack.  It is required because in clustered
+      and load-balancing environments, an IP address does not uniquely
+      identify a TCP/IP stack.  An RNIC's MAC/GID also doesn't uniquely
+      or reliably identify a TCP/IP stack, because RNICs can go up and
+      down and even be redeployed to other TCP/IP stacks in a
+      multiple-partitioned or virtualized environment.  The peer ID is
+      not only unique per TCP/IP stack but is also unique per instance
+      of a TCP/IP stack, meaning that if a TCP/IP stack is restarted,
+      its peer ID changes.
+
+2.3.  SMC-R Resilience and Load Balancing
+
+   The SMC-R multilink architecture provides resilience for network high
+   availability via failover capability to an alternate RoCE adapter.
+
+   The SMC-R multilink architecture does not define primary, secondary,
+   or alternate roles to the links.  Instead, there are multiple active
+   links representing multiple redundant RoCE paths over the same LAN.
+
+
+
+
+
+
+Fox, et al.                   Informational                    [Page 24]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+   Assignment of TCP connections to links is unidirectional and
+   asymmetric.  This means that the client and server may each choose a
+   separate link for their RDMA writes associated with a specific TCP
+   connection.
+
+   If a hardware failure occurs or a QP failure associated with an
+   individual link occurs, then the TCP connections that were associated
+   with the failing link are dynamically and transparently switched to
+   use another available link.  The server or the client can detect a
+   failure, immediately move their TCP connections, and then notify
+   their peer via the DELETE LINK LLC command.  While the client can
+   notify the server of an apparent link failure with the DELETE LINK
+   LLC command, the server performs the actual link deletion.
+
+   The movement of TCP connections to another link can be accomplished
+   with minimal coordination between the peers.  The TCP connection
+   movement is also transparent to, and non-disruptive to, the TCP
+   socket application workloads for most failure scenarios.  After a
+   failure, the surviving links and all associated hardware must handle
+   the link group's workload.
+
+   As each SMC-R peer begins to move active TCP connections to another
+   link, all current RDMA write operations must be allowed to complete.
+   The moving peer then sends a signal to verify receipt of the last
+   successful write by its peer.  If this verification fails, the TCP
+   connection must be reset.  Once this verification is complete, all
+   writes that failed may then be retried, in order, over the new link.
+   Any data writes or CDC messages for which the sender did not receive
+   write completion must be replayed before any subsequent data or CDC
+   write operations are sent.  LLC messages are not retried over the new
+   link, because they are dependent on a known link configuration, which
+   has just changed because of the failure.  The initiator of an LLC
+   message exchange that fails will be responsible for retrying once the
+   link group configuration stabilizes.
+
+   When a new link becomes available and is re-added to the link group,
+   each peer is free to rebalance its current TCP connections as needed
+   or only assign new TCP connections to the newly added link.  Both the
+   server and client are free to manage TCP connections across the link
+   group as needed.  TCP connection movement does not have to be
+   stimulated by a link failure.
+
+   The SMC-R architecture also defines orderly versus disorderly
+   failover.  The type of failover is communicated in the LLC
+   DELETE LINK command and is simply a means to indicate that the link
+   has terminated (disorderly) or link termination is imminent
+   (orderly).  The orderly link deletion could be initiated via operator
+   command or programmatically to bring down an idle link.  For example,
+
+
+
+Fox, et al.                   Informational                    [Page 25]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+   an operator command could initiate orderly shutdown of an adapter for
+   service.  Implementation of the two types is based on implementation
+   requirements and is beyond the scope of the SMC-R architecture.
+
+3.  SMC-R Rendezvous Architecture
+
+   "Rendezvous" is the process that SMC-R-capable peers use to
+   dynamically discover each others' capabilities, negotiate SMC-R
+   connections, set up SMC-R links and link groups, and manage those
+   link groups.  A key aspect of SMC-R Rendezvous is that it occurs
+   dynamically and automatically, without requiring SMC-R link
+   configuration to be defined by an administrator.
+
+   SMC-R Rendezvous starts with the TCP/IP three-way handshake, during
+   which connection peers use TCP options to announce their SMC-R
+   capabilities.  If both endpoints are SMC-R capable, then Connection
+   Layer Control (CLC) messages are exchanged between the peers' SMC-R
+   layers over the newly established TCP connection to negotiate SMC-R
+   credentials.  The CLC message mechanism is analogous to the messages
+   exchanged by SSL for its handshake processing.
+
+   If a new SMC-R link is being set up, Link Layer Control (LLC)
+   messages are used to confirm RDMA connectivity.  LLC messages are
+   also used by the SMC-R layers at each peer to manage the links and
+   link groups.
+
+   Once an SMC-R link is set up or agreed to by the peers, the TCP
+   sockets are passed to the peer applications, which use them as
+   normal.  The SMC-R layer, which resides under the sockets layer,
+   transmits the socket data between peers over RDMA using the SMC-R
+   protocol, bypassing the TCP/IP stack.
+
+3.1.  TCP Options
+
+   During the TCP/IP three-way handshake, the client and server indicate
+   their support for SMC-R by including experimental TCP option 254 on
+   the three-way handshake flows, in accordance with [RFC6994] ("Shared
+   Use of Experimental TCP Options").  The Experiment Identifier (ExID)
+   value used is the string "SMCR" in EBCDIC (IBM-1047) encoding
+   (0xE2D4C3D9).  This ExID has been registered in the "TCP Experimental
+   Option Experiment Identifiers (TCP ExIDs)" registry maintained
+   by IANA.
+
+
+
+
+
+
+
+
+
+Fox, et al.                   Informational                    [Page 26]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+   After completion of the three-way TCP handshake, each peer queries
+   its peer's options.  If both peers set the TCP option on the
+   three-way handshake, inline SMC-R negotiation occurs using CLC
+   messages.  If neither peer, or only one peer, sets the TCP option,
+   SMC-R cannot be used for the TCP connection, and the TCP connection
+   completes the setup using the IP fabric.
+
+3.2.  Connection Layer Control (CLC) Messages
+
+   CLC messages are sent as data payload over the IP network using the
+   TCP connection between SMC-R layers at the peers.  They are analogous
+   to the messages used to exchange parameters for SSL.
+
+   The use of CLC messages is detailed in the following sections.  The
+   following list provides a summary of the defined CLC messages and
+   their purposes:
+
+   o  SMC Proposal: Sent from the client to propose that this TCP
+      connection is eligible to be moved to SMC-R.  The client
+      identifies itself and its subnet to the server and passes the
+      SMC-R elements for a suggested RoCE path via the MAC and GID.
+
+   o  SMC Accept: Sent from the server to accept the client's TCP
+      connection SMC Proposal.  The server responds to the client's
+      proposal by identifying itself to the client and passing the
+      elements of a RoCE path that the client can use to perform RDMA
+      writes to the server.  This consists of such SMC-R link elements
+      as RoCE MAC, GID, and RMB information.
+
+   o  SMC Confirm: Sent from the client to confirm the server's
+      acceptance of the SMC connection.  The client responds to the
+      server's acceptance by passing the elements of a RoCE path that
+      the server can use to perform RDMA writes to the client.  This
+      consists of such SMC-R link elements as RoCE MAC, GID, and RMB
+      information.
+
+   o  SMC Decline: Sent from either the server or the client to reject
+      the SMC connection, indicating the reason the peer must decline
+      the SMC Proposal and allowing the TCP connection to revert back to
+      IP connectivity.
+
+3.3.  LLC Messages
+
+   Link Layer Control (LLC) messages are sent between peer SMC-R layers
+   over an SMC-R link to manage the link or the link group.  LLC
+   messages are sent using RoCE SendMsg and are 44 bytes long.  The
+   44-byte size is based on what can fit into a RoCE Work Queue Element
+   (WQE) without requiring the posting of receive buffers.
+
+
+
+Fox, et al.                   Informational                    [Page 27]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+   LLC messages generally follow a request-reply semantic.  Each message
+   has a request flavor and a reply flavor, and each request must be
+   confirmed with a reply, except where otherwise noted.  The use of LLC
+   messages is detailed in the following sections.  The following list
+   provides a summary of the defined LLC messages and their purposes:
+
+   o  ADD LINK: Used to add a new link to a link group.  Sent from the
+      server to the client to initiate addition of a new link to the
+      link group, or from the client to the server to request that the
+      server initiate addition of a new link.
+
+   o  ADD LINK CONTINUATION: A continuation of ADD LINK that allows the
+      ADD LINK to span multiple commands, because all of the link
+      information cannot be contained in a single ADD LINK message.
+
+   o  CONFIRM LINK: Used to confirm that RoCE connectivity over a newly
+      created SMC-R link is working correctly.  Initiated by the server.
+      Both this message and its reply must flow over the SMC-R link
+      being confirmed.
+
+   o  DELETE LINK: When initiated by the server, deletes a specific link
+      from the link group or deletes the entire link group.  When
+      initiated by the client, requests that the server delete a
+      specific link or the entire link group.
+
+   o  CONFIRM RKEY: Informs the peer on the SMC-R link of the addition
+      of an RMB to the link group.
+
+   o  CONFIRM RKEY CONTINUATION: A continuation of CONFIRM RKEY that
+      allows the CONFIRM RKEY to span multiple commands, in the event
+      that all of the information cannot be contained in a single
+      CONFIRM RKEY message.
+
+   o  DELETE RKEY: Informs the peer on the SMC-R link of the deletion of
+      one or more RMBs from the link group.
+
+   o  TEST LINK: Verifies that an already-active SMC-R link is active
+      and healthy.
+
+   o  Optional LLC message: Any LLC message in which the two high-order
+      bits of the opcode are b'10'.  This optional message must be
+      silently discarded by a receiving peer that does not support the
+      opcode.  No such messages are defined in this version of the
+      architecture; however, the concept is defined to allow for
+      toleration of possible advanced, optional functions.
+
+
+
+
+
+
+Fox, et al.                   Informational                    [Page 28]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+   CONFIRM LINK and TEST LINK are sensitive to which link they flow on
+   and must flow on the link being confirmed or tested.  The other flows
+   may flow over any active link in the link group.  When there are
+   multiple links in a link group, a response to an LLC message must
+   flow over the same link that the original message flowed over, with
+   the following exceptions:
+
+   o  ADD LINK request from a server in response to an ADD LINK from a
+      client.
+
+   o  DELETE LINK request from a server in response to a DELETE LINK
+      from a client.
+
+3.4.  CDC Messages
+
+   Connection Data Control (CDC) messages are sent over the RoCE fabric
+   between peers using RoCE SendMsg and are 44 bytes long.  The 44-byte
+   size is based on the size that can fit into a RoCE WQE without
+   requiring the posting of receive buffers.  CDC messages are used to
+   describe the socket application data passed via RDMA write
+   operations, as well as TCP connection state information, including
+   producer cursors and consumer cursors, RMBE state information, and
+   failover data validation.
+
+3.5.  Rendezvous Flows
+
+   Rendezvous information for SMC-R is exchanged as TCP options on the
+   TCP three-way handshake flows to indicate capability, followed by
+   inline TCP negotiation messages to actually do the SMC-R setup.
+   Formats of all rendezvous options and messages discussed in this
+   section are detailed in Appendix A.
+
+3.5.1.  First Contact
+
+   First contact between RoCE peers occurs when a new SMC-R link group
+   is being set up.  This could be because no SMC-R links already exist
+   between the peers, or the server decides to create a new SMC-R link
+   group in parallel with an existing one.
+
+3.5.1.1.  Pre-negotiation of TCP Options
+
+   The client and server indicate their SMC-R capability to each other
+   using TCP option 254 on the TCP three-way handshake flows.
+
+   A client who wishes to do SMC-R will include TCP option 254 using an
+   ExID equal to the EBCDIC (codepage IBM-1047) encoding of "SMCR" on
+   its SYN flow.
+
+
+
+
+Fox, et al.                   Informational                    [Page 29]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+   A server that supports SMC-R will include TCP option 254 with the
+   ExID value of EBCDIC "SMCR" on its SYN-ACK flow.  Because the server
+   is listening for connections and does not know where client
+   connections will come from, the server implementation may choose to
+   unconditionally include this TCP option if it supports SMC-R.  This
+   may be required for server implementations where extensions to the
+   TCP stack are not practical.  For server implementations that can add
+   code to examine and react to packets during the three-way handshake,
+   the server should only include the SMC-R TCP option on the SYN-ACK if
+   the client included it on its SYN packet.
+
+   A client who supports SMC-R and meets the three conditions outlined
+   above may optionally include the TCP option for SMC-R on its ACK
+   flow, regardless of whether or not the server included it on its
+   SYN-ACK flow.  Some TCP/IP stacks may have to include it if the SMC-R
+   layer cannot modify the options on the socket until the three-way
+   handshake completes.  Proprietary servers should not include this
+   option on the ACK flow, since including it on the SYN flow was
+   sufficient to indicate the client's capabilities.
+
+   Once the initial three-way TCP handshake is completed, each peer
+   examines the socket options.  SMC-R implementations may do this by
+   examining what was actually provided on the SYN and SYN-ACK packets
+   or by performing a getsockopt() operation to determine the options
+   sent by the peer.  If neither peer, or only one peer, specified the
+   TCP option for SMC-R, then SMC-R cannot be used on this connection
+   and it proceeds using normal IP flows and processing.
+
+   If both peers specified the TCP option for SMC-R, then the TCP
+   connection is not started yet and the peers proceed to SMC-R
+   negotiation using inline data flows.  The socket is not yet turned
+   over to the applications; instead, the respective SMC layers exchange
+   CLC messages over the newly formed TCP connection.
+
+3.5.1.2.  Client Proposal
+
+   If SMC-R is supported by both peers, the client sends an SMC Proposal
+   CLC message to the server.  It is not immediately apparent on this
+   flow from client to server whether this is a new or existing SMC-R
+   link, because in clustered environments a single IP address may
+   represent multiple hosts.  This type of cluster virtual IP address
+   can be owned by a network-based or host-based Layer 4 load balancer
+   that distributes incoming TCP connections across a cluster of
+   servers/hosts.  For purposes of high availability, other clustered
+   environments may also support the movement of a virtual IP address
+   dynamically from one host in the cluster to another.  In summary, the
+   client cannot predetermine that a connection is targeting the same
+   host by simply matching the destination IP address for outgoing TCP
+
+
+
+Fox, et al.                   Informational                    [Page 30]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+   connections.  Therefore, it cannot predetermine the SMC-R link that
+   will be used for a new TCP connection.  This information will be
+   dynamically learned, and the appropriate actions will be taken as the
+   SMC-R negotiation handshake unfolds.
+
+   In the SMC-R proposal message, the initiator (client) proposes the
+   use of SMC-R by including its peer ID, GID, and MAC addresses, as
+   well as the IP subnet number of the outgoing interface (if IPv4) or
+   the IP prefix list for the network over which the proposal is sent
+   (if IPv6).  At this point in the flow, the client makes no local
+   commitments of resources for SMC-R.
+
+   When the server receives the SMC Proposal CLC message, it uses the
+   peer ID provided by the client, plus subnet or prefix information
+   provided by the client, to determine if it already has a usable SMC-R
+   link with this SMC-R peer.  If there are one or more existing SMC-R
+   links with this SMC-R peer, the server then decides which SMC-R link
+   it will use for this TCP connection.  See Sections 3.5.2 and 3.5.3
+   for the cases of reusing an existing SMC-R link or creating a
+   parallel SMC-R link group between SMC-R peers.
+
+   If this is a first contact between SMC-R peers, the server must
+   validate that it is on the same LAN as the client before continuing.
+   For IPv4, the server does this by verifying that it has an interface
+   with an IP subnet number that matches the subnet number sent by the
+   client in the SMC Proposal.  For IPv6, it does this by verifying that
+   it is directly attached to at least one IP prefix that was listed by
+   the client in its SMC Proposal message.
+
+   If the server agrees to use SMC-R, the server begins the setup of a
+   new SMC-R link by allocating local QP and RMB resources (setting its
+   QP state to INIT) and providing its full SMC-R information in an SMC
+   Accept CLC message to the client over the TCP connection, along with
+   a flag set indicating that this is a first contact flow.  While the
+   SMC Accept message could flow over any IP route back to the client
+   depending upon Layer 3 IP routing, the SMC-R credentials provided
+   must be for the common subnet or prefix between the server and
+   client, as determined above.  If the server cannot or does not want
+   to do SMC-R with the client, it sends an SMC Decline CLC message to
+   the client, and the connection data may begin flowing using normal
+   TCP/IP flows.
+
+
+
+
+
+
+
+
+
+
+Fox, et al.                   Informational                    [Page 31]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+3.5.1.3.  Server Acceptance
+
+   When the client receives the SMC Accept from the server, it
+   determines whether this is a new or existing SMC-R link, using the
+   combination of the following: the first contact flag, its MAC/GID and
+   the MAC/GID returned by the server, the VLAN over which the
+   connection is setting up, and the QP number provided by the server.
+
+   If it is an existing SMC-R link and the client agrees to use that
+   link for the TCP connection, see Section 3.5.2 ("Subsequent Contact")
+   below.  If it is a new SMC-R link between peers that already have an
+   SMC-R link, then the server is starting a new SMC-R link group.
+
+   Assuming that either (1) this is a first contact between peers or
+   (2) the server is starting a new SMC-R link group, the client now
+   allocates local QP and RMB resources for the SMC-R link (setting the
+   QP state to RTR (ready to receive)), associates them with the server
+   QP as learned from the SMC Accept CLC message, and sends an SMC
+   Confirm CLC message to the server over the TCP connection with its
+   SMC-R link information included.  The client also starts a timer to
+   wait for the server to confirm the reliably connected queue pair, as
+   described below.
+
+3.5.1.4.  Client Confirmation
+
+   Upon receipt of the client's SMC Confirm CLC message, the server
+   associates its QP for this SMC-R link with the client's QP as learned
+   from the SMC Confirm CLC message and sets its QP state to RTS (ready
+   to send).  The client and the server now have reliably connected
+   queue pairs.
+
+3.5.1.5.  Link (QP) Confirmation
+
+   Since setting up the SMC-R link and its QPs did not require any
+   network flows on the RoCE fabric, the client and server must now
+   confirm connectivity over the RoCE fabric.  To accomplish this, the
+   server will send a CONFIRM LINK Link Layer Control (LLC) message to
+   the client over the newly created SMC-R link, using the RoCE fabric.
+   The CONFIRM LINK LLC message will provide the server's MAC, GID, and
+   QP information for the connection, allow each partner to communicate
+   the maximum number of links it can tolerate in this link group (the
+   "link limit"), and will additionally provide two link IDs:
+
+   o  a 1-byte server-assigned link number that is used by both peers to
+      identify the link within the link group and is only unique within
+      a link group.
+
+
+
+
+
+Fox, et al.                   Informational                    [Page 32]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+   o  a 4-byte link user ID.  This opaque value is assigned by the
+      server for the server's local use and is provided to the client
+      for management purposes -- for example, to use in network
+      management displays and products.
+
+   When the server sends this message, it will set a timer for receiving
+   confirmation from the client.
+
+   When the client receives the server's confirmation in the form of a
+   CONFIRM LINK LLC message, it will cancel the confirmation timer it
+   set when it sent the SMC Confirm message.  The client will also
+   advance its QP state to RTS and respond over the RoCE fabric with a
+   CONFIRM LINK response LLC message that (1) provides its MAC, GID,
+   QP number, and link limit, (2) confirms the 1-byte link number sent
+   by the server, and (3) provides its own 4-byte link user ID to the
+   server.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Fox, et al.                   Informational                    [Page 33]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+       Host X -- Server                           Host Y -- Client
+    +-------------------+                      +-------------------+
+    | Peer ID = PS1     |                      |   Peer ID = PC1   |
+    |            +------+                      +------+            |
+    |       QP 8 |RNIC 1|                      |RNIC 2|  QP 64     |
+    |RToken X|   |MAC MA|                      |MAC MB|   |        |
+    |        |   |GID GA|                      |GID GB|   |RToken Y|
+    |       \/   +------+      (Subnet S1)     +------+  \/        |
+    |+--------+         |                      |        +--------+ |
+    || RMB    |         |                      |        | RMB    | |
+    |+--------+         |                      |        +--------+ |
+    |            +------+                      +------+            |
+    |            |RNIC 3|                      |RNIC 4|            |
+    |            |MAC MC|                      |MAC MD|            |
+    |            |GID GC|                      |GID GD|            |
+    |            +------+                      +------+            |
+    +-------------------+                      +-------------------+
+
+                     SYN TCP options(254,"SMCR")
+        <---------------------------------------------------------
+
+                     SYN-ACK TCP options(254,"SMCR")
+        --------------------------------------------------------->
+
+                     ACK [TCP options(254,"SMCR")]
+        <--------------------------------------------------------
+
+                    SMC Proposal(PC1,MB,GB,S1)
+        <--------------------------------------------------------
+
+    SMC Accept(PS1,first contact,MA,GA,MTU,QP8,RToken=X,RMB elem index)
+        --------------------------------------------------------->
+
+         SMC Confirm(PC1,MB,GB,MTU,QP64,RToken=Y,RMB element index)
+         <--------------------------------------------------------
+
+       CONFIRM LINK(MA,GA,QP8, link lim, server link user ID, linknum)
+        .........................................................>
+
+    CONFIRM LINK rsp(MB,GB,QP64, link lim, client link user ID, linknum)
+        <........................................................
+
+                           Legend:
+                    ------------   TCP/IP and CLC flows
+                    ............   RoCE (LLC) flows
+           Square brackets ("[ ]") indicate optional information
+
+                 Figure 8: First Contact Rendezvous Flows
+
+
+
+Fox, et al.                   Informational                    [Page 34]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+   Technically, the data for the TCP connection could now flow over the
+   RoCE path.  However, if this is a first contact, there is no
+   alternate for this recently established RoCE path.  Since in the
+   current architecture there is no failover from RoCE to IP once
+   connection data starts flowing, this means that a failure of this
+   path would disrupt the TCP connection, meaning that the level of
+   redundancy and failover is less than that provided by IP.  If the
+   network has alternate RoCE paths available, they would not be usable
+   at this point.  This situation would be unacceptable.
+
+3.5.1.6.  Second SMC-R Link Setup
+
+   Because of the unacceptable situation described above, TCP data will
+   not be allowed to flow on the newly established SMC-R link until a
+   second path has been set up, or at least attempted.
+
+   If the server has a second RNIC available on the same LAN, it
+   attempts to set up the second SMC-R link over that second RNIC.  If
+   it only has one RNIC available on the LAN, it will attempt to set up
+   the second SMC-R link over that one RNIC.  In the latter case, the
+   server is attempting to set up an asymmetric link, in case the client
+   does have a second RNIC on the LAN.
+
+   In either case, the server allocates a new QP over the RNIC it is
+   attempting to use for the second link and assigns a link number to
+   the new link; the server also creates an RToken for the RMB over this
+   second QP (note that this means that the first and second QP each
+   have their own RToken to represent the same RMB).  The server
+   provides this information, as well as the MAC and GID of the RNIC
+   over which it is attempting to set up the second link, in an ADD LINK
+   LLC message that it sends to the client over the SMC-R link that is
+   already set up.
+
+3.5.1.6.1.  Client Processing of ADD LINK LLC Message from Server
+
+   When the client receives the server's ADD LINK LLC message, it
+   examines the GID and MAC provided by the server to determine whether
+   the server is attempting to use the same server-side RNIC as the
+   existing SMC-R link or a different one.
+
+   If the server is attempting to use the same server-side RNIC as the
+   existing SMC-R link, then the client verifies that it has a second
+   RNIC on the same LAN.  If it does not, the client rejects the
+   ADD LINK request from the server, because the resulting link would be
+   a parallel link, which is not supported within a link group.  If the
+   client does have a second RNIC on the same LAN, it accepts the
+   request, and an asymmetric link will be set up.
+
+
+
+
+Fox, et al.                   Informational                    [Page 35]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+   If the server is using a different server-side RNIC from the existing
+   SMC-R link, then the client will accept the request and a second
+   SMC-R link will be set up in this SMC-R link group.  If the client
+   has a second RNIC on the same LAN, that second RNIC will be used for
+   the second SMC-R link, creating symmetric links.  If the client does
+   not have a second RNIC on the same LAN, it will use the same RNIC as
+   was used for the initial SMC-R link, resulting in the setup of an
+   asymmetric link in the SMC-R link group.
+
+   In either case, when the client accepts the server's ADD LINK
+   request, it allocates a new QP on the chosen RNIC and creates an RKey
+   over that new QP for the client-side RMB for the SMC-R link group,
+   then sends an ADD LINK reply LLC message to the server providing that
+   information as well as echoing the link number that was sent by the
+   server.
+
+   If the client rejects the server's ADD LINK request, it sends an ADD
+   LINK reply LLC message to the server with the reason code for the
+   rejection.
+
+3.5.1.6.2.  Server Processing of ADD LINK Reply LLC Message from Client
+
+   If the client sends a negative response to the server or no reply is
+   received, the server frees the RoCE resources it had allocated for
+   the new link.  Having a single link in an SMC-R link group is
+   undesirable.  The server's recovery is detailed in Appendix C.8
+   ("Failure to Add Second SMC-R Link to a Link Group").
+
+   If the client sends a positive reply to the server with
+   MAC/GID/QP/RKey information, the server associates its QP for the new
+   SMC-R link to the QP that the client provided.  Now, the new SMC-R
+   link is in the same situation that the first was in after the client
+   sent its ACK packet -- there is a reliably connected queue pair over
+   the new RoCE path, but there have been no RoCE flows to confirm that
+   it's actually usable.  So, at this point, the client and server will
+   exchange CONFIRM LINK LLC messages just like they did on the first
+   SMC-R link.
+
+   If either peer receives a failure during this second CONFIRM LINK LLC
+   exchange (either an immediate failure -- which implies that the
+   message did not reach the partner -- or a timeout), it sends a DELETE
+   LINK LLC message to the partner over the first (and now only) link in
+   the link group.  This DELETE LINK LLC message must be acknowledged
+   before data can flow on the single link in the link group.
+
+
+
+
+
+
+
+Fox, et al.                   Informational                    [Page 36]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+       Host X -- Server                           Host Y -- Client
+    +-------------------+                      +-------------------+
+    | Peer ID = PS1     |                      |   Peer ID = PC1   |
+    |            +------+                      +------+            |
+    |       QP 8 |RNIC 1|      SMC-R Link 1    |RNIC 2|  QP 64     |
+    |RToken X|   |MAC MA|<-------------------->|MAC MB|   |        |
+    |        |   |GID GA|                      |GID GB|   |RToken Y|
+    |       \/   +------+                      +------+  \/        |
+    |+--------+         |                      |        +--------+ |
+    ||        |         |                      |        |        | |
+    || RMB    |         |                      |        | RMB    | |
+    ||        |         |                      |        |        | |
+    |+--------+         |                      |        +--------+ |
+    |       /\   +------+                      +------+  /\        |
+    |        |   |RNIC 3|      SMC-R Link 2    |RNIC 4|  |         |
+    |RToken Z|   |MAC MC|<-------------------->|MAC MD|  |RToken W |
+    |       QP 9 |GID GC|      (being added)   |GID GD| QP 65      |
+    |            +------+                      +------+            |
+    +-------------------+                      +-------------------+
+
+                First SMC-R link setup as shown in Figure 8
+            <-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.->
+
+            ADD LINK request(QP9,MC,GC, link number = 2)
+            ............................................>
+
+            ADD LINK response(QP65,MD,GD, link number = 2)
+            <............................................
+
+            ADD LINK CONTINUATION request(RToken=Z)
+            ............................................>
+
+           ADD LINK CONTINUATION response(RToken=W)
+            <............................................
+
+         CONFIRM LINK(MC,GC,QP9, link number = 2, link user ID)
+            .............................................>
+
+      CONFIRM LINK response(MD,GD,QP65, link number = 2, link user ID)
+            <.............................................
+
+                          Legend:
+                   ------------   TCP/IP and CLC flows
+                   ............   RoCE (LLC) flows
+
+                Figure 9: First Contact, Second Link Setup
+
+
+
+
+
+Fox, et al.                   Informational                    [Page 37]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+3.5.1.6.3.  Exchange of RKeys on Second SMC-R Link
+
+   Note that in the scenario described here -- first contact -- there is
+   only one RMB RKey to exchange on the second SMC-R link, and it is
+   exchanged in the ADD LINK CONTINUATION request and reply.  In
+   scenarios other than first contact -- for example, adding a new SMC-R
+   link to a longstanding link group with multiple RMBs -- additional
+   flows will be required to exchange additional RMB RKeys.  See
+   Section 3.5.5.2.3 ("Adding a New SMC-R Link to a Link Group with
+   Multiple RMBs") for more details on these flows.
+
+3.5.1.6.4.  Aborting SMC-R and Falling Back to IP
+
+   If both partners don't provide the SMC-R TCP option during the
+   three-way TCP handshake, the connection falls back to normal TCP/IP.
+   During the SMC-R negotiation that occurs after the three-way TCP
+   handshake, either partner may break off SMC-R by sending an SMC
+   Decline CLC message.  The SMC Decline CLC message may be sent in
+   place of any expected message and may also be sent during the CONFIRM
+   LINK LLC exchange if there is a failure before any application data
+   has flowed over the RoCE fabric.  For more details on exactly when an
+   SMC Decline can flow during link group setup, see Appendices C.1
+   ("SMC Decline during CLC Negotiation") and C.2 ("SMC Decline during
+   LLC Negotiation").
+
+   If this fallback to IP happens while setting up a new SMC-R link
+   group, the RoCE resources allocated for this SMC-R link group
+   relationship are torn down, and it will be retried as a new SMC-R
+   link group next time a connection starts between these peers with
+   SMC-R proposed.  Note that if this happens because one side doesn't
+   support SMC-R, there will be very little to tear down, as the TCP
+   option will have failed to flow on either the initial SYN or the
+   SYN-ACK before either side had reserved any local RoCE resources.
+
+3.5.2.  Subsequent Contact
+
+   "Subsequent contact" means setting up a new TCP connection between
+   two peers that already have an SMC-R link group between them and
+   reusing the existing SMC-R link group.  In this case, it is not
+   necessary to allocate new QPs.  However, it is possible that a new
+   RMB has been allocated for this TCP connection, if the previous TCP
+   connection used the last element available in the previously used
+   RMB, or for any other implementation-dependent reason.  For this
+   reason, and for convenience and error checking, the same TCP
+   option 254, followed by the inline negotiation method described for
+   initial contact, will be used for subsequent contact, but the
+   processing differs in some ways.  That processing is described below.
+
+
+
+
+Fox, et al.                   Informational                    [Page 38]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+3.5.2.1.  SMC-R Proposal
+
+   When the client begins the inline negotiation with the server, it
+   does not know if this is a first contact or a subsequent contact.
+   The client cannot know this information until it sees the server's
+   peer ID, to determine whether or not it already has an SMC-R link
+   with this peer that it can use.  There are several reasons why it is
+   not sufficient to use the partner IP address, subnet, VLAN, or other
+   IP information to make this determination.  The most obvious reason
+   is distributed systems: if the server IP address is actually a
+   virtual IP address representing a distributed cluster, the actual
+   host serving this TCP connection may not be the same as the host that
+   served the last TCP connection to this same IP address.
+
+   After the TCP three-way handshake, assuming that both partners
+   indicate SMC-R capability, the client builds and sends the
+   SMC Proposal CLC message to the server in exactly the same manner as
+   it does in the "first contact" case, and in fact at this point
+   doesn't know if it's a first contact or a subsequent contact.  As in
+   the "first contact" case, the client sends its peer ID value,
+   suggested RNIC MAC/GID, and IP subnet or prefix information.
+
+   Upon receiving the client's proposal, the server looks up the
+   provided peer ID to determine if it already has a usable SMC-R
+   link group with this peer.  If it does already have a usable SMC-R
+   link group, the server then needs to decide whether it will use the
+   existing SMC-R link group or create a new link group.  For the case
+   of the new link group, see Section 3.5.3 ("First Contact Variation:
+   Creating a Parallel Link Group") below.
+
+   For this discussion, assume that the server decides to use the
+   existing SMC-R link group for the TCP connection, which is expected
+   to be the most common case.  The server is responsible for making
+   this decision.  The server then needs to communicate that information
+   to the client, but it is not necessary to allocate, associate, and
+   confirm QPs for the chosen SMC-R link.  All that remains to be done
+   is to set up RMB space for this TCP connection.
+
+   If one of the RMBs already in use for this SMC-R link group has an
+   available element that uses the appropriate buffer size, the server
+   merely chooses one for this TCP connection and then sends an SMC
+   Accept CLC message providing the full RoCE information for the chosen
+   SMC-R link to the client, using the same format as the SMC Accept CLC
+   message described in Section 3.5.1 ("First Contact") above.
+
+
+
+
+
+
+
+Fox, et al.                   Informational                    [Page 39]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+   The server may choose to use the SMC-R link that matches the
+   suggested MAC/GID provided by the client in the SMC Proposal for its
+   RDMA writes but is not obligated to do so.  The final decision on
+   which specific SMC-R link to assign a TCP connection to is an
+   independent server and client decision.
+
+   It may be necessary for the server to allocate a new RMB for this
+   connection.  The reasons for this are implementation dependent and
+   could include the following:
+
+   o  no available space in existing RMB or RMBs, or
+
+   o  desire to allocate a new RMB that uses a different buffer size
+      from the ones already created, or
+
+   o  any other implementation-dependent reason
+
+   In this case, the server will allocate the new RMB and then perform
+   the flows described in Section 3.5.5.2.1 ("Adding a New RMB to an
+   SMC-R Link Group").  Once that processing is complete, the server
+   then provides the full RoCE information, including the new RKey, for
+   this connection in an SMC Confirm CLC message to the client.
+
+3.5.2.2.  SMC-R Acceptance
+
+   Upon receiving the SMC Accept CLC message from the server, the client
+   examines the RoCE information provided by the server to determine
+   whether this is a first contact for a new SMC-R link group or a
+   subsequent contact for an existing SMC-R link group.  It is a
+   subsequent contact if the server-side peer ID, GID, MAC, and QP
+   number provided in the packet match a known SMC-R link, and the first
+   contact flag is not set.  If this is not the case -- for example, the
+   GID and MAC match but the QP is new -- then the server is creating a
+   new, parallel SMC-R link group, and this is treated as a first
+   contact.
+
+   A different RMB RToken does not indicate a first contact, as the
+   server may have allocated a new RMB or may be using several RMBs for
+   this SMC-R link.  The client needs the server's RMB information only
+   for its RDMA writes to the server, and since there is no requirement
+   for symmetric RMBs, this information is simply control information
+   for the RDMA writes on this SMC-R link.
+
+   The client must validate that the RMB element being provided by the
+   server is not in use by another TCP connection on this SMC-R link
+   group.  This validation must validate the new <rtoken, index> across
+
+
+
+
+
+Fox, et al.                   Informational                    [Page 40]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+   all known <rtoken, index> on this link group.  See Section 4.4.2
+   ("RMB Element Reuse and Conflict Resolution") for the case in which
+   the server tries to use an RMB element that is already in use on this
+   link group.
+
+   Once the client has determined that this TCP connection is a
+   subsequent contact over an existing SMC-R link, it performs an RMB
+   allocation process similar to what the server did: it either
+   (1) allocates an element from an RMB already associated with this
+   SMC-R link or (2) allocates a new RMB, associates it with this SMC-R
+   link, and then chooses an element out of it.
+
+   If the client allocates a new RMB for this TCP connection, it
+   performs the processing described in Section 3.5.5.2.1 ("Adding a New
+   RMB to an SMC-R Link Group").  Once that processing is complete, the
+   client provides its full RoCE information for this TCP connection in
+   an SMC Confirm CLC message.
+
+   Because an SMC-R link with a verified connected QP already exists and
+   is being reused, there is no need for verification or alternate QP
+   selection flows or timers.
+
+3.5.2.3.  SMC-R Confirmation
+
+   When the server receives the client's SMC Confirm CLC message on a
+   subsequent contact, it verifies the following:
+
+   o  The RMB element provided by the client is not already in use by
+      another TCP connection on this SMC-R link group (see Section 4.4.2
+      ("RMB Element Reuse and Conflict Resolution") for the case in
+      which it is).
+
+   o  The MAC/GID/QP information provided by the client matches an
+      active link within the link group.  The client is free to select
+      any valid/active link.  The client is not required to select the
+      same link as the server.
+
+   If this validation passes, the server stores the client's RMB
+   information for this connection, and the RoCE setup of the TCP
+   connection is complete.
+
+3.5.2.4.  TCP Data Flow Race with SMC Confirm CLC Message
+
+   On a subsequent contact TCP/IP connection, a peer may send data as
+   soon as it has received the peer RMB information for the connection.
+   There are no additional RoCE confirmation flows, since the QPs on the
+   SMC-R link are already reliably connected and verified.
+
+
+
+
+Fox, et al.                   Informational                    [Page 41]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+   In the majority of cases, the first data will flow from the client to
+   the server.  The client must send the SMC Confirm CLC message before
+   sending any connection data over the chosen SMC-R link; however, the
+   client need not wait for confirmation of this message, and in fact
+   there will be no such confirmation.  Since the server is required to
+   have the RMB fully set up and ready to receive data from the client
+   before sending an SMC Accept CLC message, the client can begin
+   sending data over the SMC-R link immediately upon completing the send
+   of the SMC Confirm CLC message.
+
+   It is possible that data from the client will arrive at the
+   server-side RMB before the SMC Confirm CLC message from the client
+   has been processed.  In this case, the server must handle this race
+   condition and not provide the arrived TCP data to the socket
+   application until the SMC Confirm CLC message has been received and
+   fully processed, opening the socket.
+
+   If the server has initial data to send to the client that is not a
+   response to the client (this case should be rare), it can send the
+   data immediately upon receiving and processing the SMC Confirm CLC
+   message from the client.  The client must have opened the TCP socket
+   to the client application upon sending the SMC Confirm CLC message so
+   the client will be ready to process data from the server.
+
+3.5.3.  First Contact Variation: Creating a Parallel Link Group
+
+   Recall that parallel SMC-R links within an SMC-R link group are not
+   supported.  These are multiple SMC-R links within a link group that
+   use the same network path.  However, multiple SMC-R link groups
+   between the same peers are supported.  This means that if multiple
+   SMC-R links over the same RoCE path are desired, it is necessary to
+   use multiple SMC-R link groups.  While not a recommended practice,
+   this could be done for platform-specific reasons, like QP separation
+   of different workloads.  Only the server can drive the creation of
+   multiple SMC-R link groups between peers.
+
+   At a high level, when the server decides to create an additional
+   SMC-R link group with a client with which it already has an SMC-R
+   link group, the flows are basically the same as the normal
+   "first contact" case described above.  The following text provides
+   more detail and clarification of processing in this case.
+
+   When the server receives the SMC Proposal CLC message from the client
+   and, using the MAC/GID information, determines that it already has an
+   SMC-R link group with this client, the server can either reuse the
+   existing SMC-R link group (detailed in Section 3.5.2 ("Subsequent
+   Contact") above) or create a new SMC-R link group in addition to the
+   existing one.
+
+
+
+Fox, et al.                   Informational                    [Page 42]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+   If the server decides to create a new SMC-R link group, it does the
+   same processing it would have done for first contact: allocate QP and
+   RMB resources as well as alternate QP resources, and communicate the
+   QP and RMB information to the client in the SMC Accept CLC message
+   with the first contact flag set.
+
+   When the client receives the server's SMC Accept CLC message with the
+   new QP information and the first contact flag set, it knows that the
+   server is creating a new SMC-R link group even though it already has
+   an SMC-R link group with the server.  In this case, the client will
+   also allocate a new QP for this new SMC-R link, allocate an RMB for
+   it, and generate an RKey for it.
+
+   Note that multiple SMC-R link groups between the same peers must
+   access different RMB resources, so new RMBs will be required.  Using
+   the same RMBs that are in use in another SMC-R link group is not
+   permitted.
+
+   The client then associates its new QP with the server's new QP and
+   sends its SMC Confirm CLC message back to the server providing the
+   new QP/RMB information, and then sets its confirmation timer for the
+   new SMC-R link.
+
+   When the server receives the client's SMC Confirm CLC message, it
+   associates its QP with the client's QP as learned from the SMC
+   Confirm CLC message and sends a confirmation LLC message.  The rest
+   of the flow, with the confirmation QP and setup of additional SMC-R
+   links, unfolds just like the "first contact" case.
+
+3.5.4.  Normal SMC-R Link Termination
+
+   The normal socket API trigger points are used by the SMC-R layer to
+   initiate SMC-R connection termination flows.  The main design point
+   for SMC-R normal connection flows is to use the SMC-R protocol to
+   first shut down the SMC-R connection and free up any SMC-R RDMA
+   resources, and then allow the normal TCP connection termination
+   protocol (i.e., FIN processing) to drive cleanup of the TCP
+   connection that exists on the IP fabric.  This design point is very
+   important in ensuring that RDMA resources such as the RMBEs are only
+   freed and reused when both SMC-R endpoints are completely done with
+   their RDMA write operations to the partner's RMBE.
+
+   When the last TCP connection over an SMC-R link group terminates, the
+   link group can be terminated.  Similar to creation of SMC-R links and
+   link groups, the primary responsibility for determining that normal
+   termination is needed and initiating it lies with the server.
+
+
+
+
+
+Fox, et al.                   Informational                    [Page 43]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+   Implementations may opt to set timers to keep SMC-R link groups up
+   for a specified time after the last TCP connection ends, to avoid
+   churn in cases where TCP connections come and go regularly.
+
+   The link or link group may also be terminated as a result of a
+   command initiated by the operator.  This command can be entered at
+   either the client or the server.  If entered at the client, the
+   client requests that the server perform link or link group
+   termination, and the responsibility for doing so ultimately lies with
+   the server.
+
+   When the server determines that the SMC-R link group is to be
+   terminated, it sends a DELETE LINK LLC message to the client, with a
+   flag set indicating that all links in the link group are to be
+   terminated.  After receiving confirmation from the adapter that the
+   DELETE LINK LLC message has been sent, the server can clean up its
+   end of the link group (QPs, RMBs, etc.).  Upon receipt of the DELETE
+   LINK message from the server, the client must immediately comply and
+   clean up its end of the link group.  Any TCP connections that the
+   client believes to be active on the link group must be immediately
+   terminated.
+
+   The client can request that the server delete the link group as well.
+   The client does this by sending a DELETE LINK message to the server,
+   indicating that cleanup of all links is requested.  The server must
+   comply by sending a DELETE LINK to the client and processing as
+   described in the previous paragraph.  If there are TCP connections
+   active on the link group when the server receives this request, they
+   are immediately terminated by sending a RST flow over the IP fabric.
+
+3.5.5.  Link Group Management Flows
+
+3.5.5.1.  Adding and Deleting Links in an SMC-R Link Group
+
+   The server has the lead role in managing the composition of the link
+   group.  Links are added to the link group by the server.  The client
+   may notify the server of new conditions that may result in the server
+   adding a new link, but the server is ultimately responsible.  In
+   general, links are deleted from the link group by the server;
+   however, in certain error cases the client may inform the server that
+   a link must be deleted and treat it as deleted without waiting for
+   action from the server.  These flows are detailed in the sections
+   that follow.
+
+
+
+
+
+
+
+
+Fox, et al.                   Informational                    [Page 44]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+3.5.5.1.1.  Server-Initiated ADD LINK Processing
+
+   As described in previous sections, the server initiates an ADD LINK
+   exchange to create redundancy in a newly created link group.  Once a
+   link group is established, the server may also initiate ADD LINK for
+   other reasons, including:
+
+   o  Availability of additional resources on the server host to support
+      an additional SMC-R link.  This may include the provisioning of an
+      additional RNIC, more storage becoming available to support
+      additional QP resources, operator command, or any other
+      implementation-dependent reason.  Note that in order to be
+      available for an existing link group a new RNIC must be attached
+      to the same RoCE LAN that the link group is using.
+
+   o  Receipt of notification from the client that additional resources
+      on the client are available to support an additional SMC-R link.
+      See Section 3.5.5.1.2 ("Client-Initiated ADD LINK Processing").
+
+   Server-initiated ADD LINK processing in an established SMC-R link
+   group is the same as the ADD LINK processing described in
+   Section 3.5.1.6 ("Second SMC-R Link Setup"), with the following
+   changes:
+
+   o  If an asymmetric SMC-R link already exists in the link group, a
+      second asymmetric link will not be created.  Only one asymmetric
+      link is permitted in a link group.
+
+   o  TCP data flow on already-existing link(s) in the link group is not
+      halted or otherwise affected during the process of setting up the
+      additional link.
+
+   The server will not initiate ADD LINK processing if the link group
+   already has the maximum number of links negotiated by the partners.
+
+3.5.5.1.2.  Client-Initiated ADD LINK Processing
+
+   If an additional RNIC becomes available for an existing SMC-R link
+   group on the client's side, the client notifies the server by sending
+   an ADD LINK request LLC message to the server.  Unlike an ADD LINK
+   request sent by the server to the client, this ADD LINK request
+   merely informs the server that the client has a new RNIC.  If the
+   link group lacks redundancy or has redundancy only on an asymmetric
+   link with a single RNIC on the client side, the server must initiate
+   an ADD LINK exchange in response to this message, to create or
+   improve the link group's redundancy.
+
+
+
+
+
+Fox, et al.                   Informational                    [Page 45]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+   If the link group already has symmetric-link redundancy but has fewer
+   than the negotiated maximum number of links, the server may respond
+   by initiating an ADD LINK exchange to create a new link using the
+   client's new resource but is not required to do so.
+
+   If the link group already has the negotiated maximum number of links,
+   the server must ignore the client's ADD LINK request LLC message.
+
+   Because the server is not required to respond to the client's
+   ADD LINK LLC message in all cases, the client must not wait for a
+   response or throw an error if one does not come.
+
+3.5.5.1.3.  Server-Initiated DELETE LINK Processing
+
+   Reasons that a server may delete a link include the following:
+
+   o  The link has not been used for TCP connections for an
+      implementation-defined time interval, and deleting the link will
+      not cause the link group to lack redundancy.
+
+   o  Errors in resources supporting the link occur.  These errors may
+      include, but are not limited to, RNIC errors, QP errors, and
+      software errors.
+
+   o  The RNIC supporting this SMC-R link is being taken down, either
+      because of an error case or because of an operator or software
+      command.
+
+   If a link being deleted is supporting TCP connections and there are
+   one or more surviving links in the link group, the TCP connections
+   are moved to the surviving links.  For more information on this
+   processing, see Section 2.3 ("SMC-R Resilience and Load Balancing").
+
+   The server deletes a link from the link group by sending a
+   DELETE LINK request LLC message to the client over any of the usable
+   links in the link group.  Because the DELETE LINK LLC message
+   specifies which link is to be deleted, it may flow over any link in
+   the link group.  The server must not clean up its RoCE resources for
+   the link until the client responds.
+
+   The client responds to the server's DELETE LINK request LLC message
+   by sending the server a DELETE LINK response LLC message.  The client
+   must respond positively; it cannot decline to delete the link.  Once
+   the server has received the client's DELETE LINK response, both sides
+   may clean up their resources for the link.
+
+
+
+
+
+
+Fox, et al.                   Informational                    [Page 46]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+   Either a positive write completion or some other indication from the
+   RNIC on the client's side is sufficient to indicate to the client
+   that the server has received the DELETE LINK response.
+
+         Host X                                     Host Y
+    +-------------------+                      +-------------------+
+    |            +------+                      +------+            |
+    |       QP 8 |RNIC 1|     SMC-R Link 1     |RNIC 2| QP 9       |
+    |RToken X|   |Failed|<--X----X----X----X-->|      |            |
+    |        |   |      |                      |      |            |
+    |       \/   +------+                      +------+            |
+    |+--------+         |                      |                   |
+    || Deleted|         |                      |                   |
+    || RMB    |         |                      |                   |
+    ||        |         |                      |                   |
+    |+--------+         |                      |                   |
+    |       /\   +------+                      +------+            |
+    |RToken Z|   |      |     SMC-R Link 2     |      |            |
+    |        |   |RNIC 3|<-------------------->|RNIC 4|            |
+    |       QP 64|      |                      |      | QP 65      |
+    |            +------+                      +------+            |
+    +-------------------+                      +-------------------+
+
+          DELETE LINK(request, link number = 1,
+                ................................................>
+                       reason code = RNIC failure)
+
+          DELETE LINK(response, link number = 1)
+               <................................................
+
+           (Note: Architecturally, this exchange can flow over either
+                  SMC-R link but most likely flows over Link 2, since
+                  the RNIC for Link 1 has failed.)
+
+               Figure 10: Server-Initiated DELETE LINK Flow
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Fox, et al.                   Informational                    [Page 47]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+3.5.5.1.4.  Client-Initiated DELETE LINK Request
+
+   The client may request that the server delete a link for the same
+   reasons that the server may delete a link, except for inactivity
+   timeout.
+
+   Because the client depends on the server to delete links, there are
+   two types of delete requests from client to server:
+
+   o  Orderly: The client is requesting that the server delete the link
+      when able.  This would result from an operator command to bring
+      down the RNIC or some other nonfatal reason.  In this case, the
+      server is required to delete the link but may not do it right
+      away.
+
+   o  Disorderly: The server must delete the link right away, because
+      the client has experienced a fatal error with the link.
+
+   In either case, the server responds by initiating a DELETE LINK
+   exchange with the client, as described in the previous section.  The
+   difference between the two is whether the server must do so
+   immediately or can delay for an opportunity to gracefully delete the
+   link.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Fox, et al.                   Informational                    [Page 48]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+          Host X                                     Host Y
+     +-------------------+                      +-------------------+
+     |            +------+                      +------+            |
+     |       QP 8 |RNIC 1|     SMC-R Link 1     |RNIC 2| QP 9       |
+     |RToken X|   |      |<---X--X--X--X--X--X->|Failed|            |
+     |        |   |      |                      |      |            |
+     |       \/   +------+                      +------+            |
+     |+--------+         |                      |                   |
+     || Deleted|         |                      |                   |
+     || RMB    |         |                      |                   |
+     ||        |         |                      |                   |
+     |+--------+         |                      |                   |
+     |       /\   +------+                      +------+            |
+     |RToken Z|   |      |     SMC-R Link 2     |      |            |
+     |        |   |RNIC 3|<-------------------->|RNIC 4|            |
+     |       QP 64|      |                      |      | QP 65      |
+     |            +------+                      +------+            |
+     +-------------------+                      +-------------------+
+
+           DELETE LINK(request, link number = 1, disorderly,
+                <...............................................
+                       reason code = RNIC failure)
+
+           DELETE LINK(request, link number = 1,
+                 ................................................>
+                        reason code = RNIC failure)
+
+           DELETE LINK(response, link number = 1)
+                <................................................
+
+           (Note: Architecturally, this exchange can flow over either
+                  SMC-R link but most likely flows over Link 2, since
+                  the RNIC for Link 1 has failed.)
+
+               Figure 11: Client-Initiated DELETE LINK Flow
+
+3.5.5.2.  Managing Multiple RKeys over Multiple SMC-R Links in a
+          Link Group
+
+   After the initial contact sequence completes and the number of TCP
+   connections increases, it is possible that the SMC peers could add
+   more RMBs to the link group.  Recall that each peer independently
+   manages its RMBs.  Also recall that an RMB's RToken is specific to a
+   QP, which means that when there are multiple SMC-R links in a link
+   group, each RMB accessed with the link group requires a separate
+   RToken for each SMC-R link in the group.
+
+
+
+
+
+Fox, et al.                   Informational                    [Page 49]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+   Each RMB that is added to a link must be added to all links within
+   the link group.  The set of RMBs created for the link is called the
+   "RToken set".  The RTokens must be exchanged with the peer.  As RMBs
+   are added and deleted, the RToken set must remain in sync.
+
+3.5.5.2.1.  Adding a New RMB to an SMC-R Link Group
+
+   A new RMB can be added to an SMC-R link group on either the client
+   side or the server side.  When an additional RMB is added to an
+   existing SMC-R link group, that RMB must be associated with the QPs
+   for each link in the link group.  Therefore, when an RMB is added to
+   an SMC-R link group, its RMB RToken for each SMC-R link's QP must be
+   communicated to the peer.
+
+   The tokens for a new RMB added to an existing SMC-R link group are
+   communicated using CONFIRM RKEY LLC messages, as shown in Figure 12.
+   The RToken set is specified as pairs: an SMC-R link number, paired
+   with the new RMB's RToken over that SMC-R link.  To preserve failover
+   capability, any TCP connection that uses a newly added RMB cannot go
+   active until all RTokens for the RMB have been communicated for all
+   of the links in the link group.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Fox, et al.                   Informational                    [Page 50]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+          Host X                                     Host Y
+     +-------------------+                      +-------------------+
+     |            +------+                      +------+            |
+     |       QP 8 |RNIC 1|     SMC-R Link 1     |RNIC 2| QP 9       |
+     |RToken X|   |      |<-------------------->|      |            |
+     |        |   |      |                      |      |            |
+     |       \/   +------+                      +------+            |
+     |+--------+         |                      |                   |
+     || New    |         |                      |                   |
+     || RMB    |         |                      |                   |
+     ||        |         |                      |                   |
+     |+--------+         |                      |                   |
+     |       /\   +------+                      +------+            |
+     |RToken Z|   |      |     SMC-R Link 2     |      |            |
+     |        |   |RNIC 3|<-------------------->|RNIC 4|            |
+     |       QP 64|      |                      |      | QP 65      |
+     |            +------+                      +------+            |
+     +-------------------+                      +-------------------+
+
+           CONFIRM RKEY(request, Add,
+                 ................................................>
+                      RToken set((Link 1,RToken X),(Link 2,RToken Z)))
+
+           CONFIRM RKEY(response, Add,
+                <................................................
+                      RToken set((Link 1,RToken X),(Link 2,RToken Z)))
+
+            (Note: This exchange can flow over either SMC-R link.)
+
+                 Figure 12: Add RMB to Existing Link Group
+
+   Implementations may choose to proactively add RMBs to link groups in
+   anticipation of need.  For example, an implementation may add a new
+   RMB when a certain usage threshold (e.g., percentage used) for all of
+   its existing RMBs has been exceeded.
+
+   A new RMB may also be added to an existing link group on an as-needed
+   basis -- for example, when a new TCP connection is added to the link
+   group but there are no available RMB elements.  In this case, the CLC
+   exchange is paused while the peer that requires the new RMB adds it.
+   An example of this is illustrated in Figure 13.
+
+
+
+
+
+
+
+
+
+
+Fox, et al.                   Informational                    [Page 51]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+       Host X -- Server                            Host Y -- Client
+    +-------------------+                      +--------------------+
+    | Peer ID = PS1     |                      |   Peer ID = PC1    |
+    |            +------+                      +------+             |
+    |       QP 8 |RNIC 1|    SMC-R Link 1      |RNIC 2|  QP 64      |
+    |RToken X|   |MAC MA|<-------------------->|MAC MB|   |         |
+    |        |   |GID GA|                      |GID GB|   |RToken Y2|
+    |       \/   +------+                      +------+  \/         |
+    |+--------+         |                      |        +--------+  |
+    ||        |         |   Subnet S1          |        | New    |  |
+    || RMB    |         |                      |        | RMB    |  |
+    |+--------+         |                      |        +--------+  |
+    |       /\   +------+                      +------+  /\         |
+    |        |   |RNIC 3|    SMC-R Link 2      |RNIC 4|   |RToken W2|
+    |        |   |MAC MC|<-------------------->|MAC MD|   |         |
+    |       QP 9 |GID GC|                      |GID GD|  QP 65      |
+    |            +------+                      +------+             |
+    +-------------------+                      +--------------------+
+
+           SYN / SYN-ACK / ACK TCP three-way handshake with TCP option
+        <--------------------------------------------------------->
+
+                    SMC Proposal(PC1,MB,GB,S1)
+        <--------------------------------------------------------
+
+      SMC Accept(PS1,not 1st contact,MA,GA,QP8,RToken=X,RMB elem index)
+        --------------------------------------------------------->
+
+          CONFIRM RKEY(request, Add,
+        <........................................................
+                  RToken set((Link 1,RToken Y2),(Link 2,RToken W2)))
+
+          CONFIRM RKEY(response, Add,
+         ........................................................>
+                  RToken set((Link 1,RToken Y2),(Link 2,RToken W2)))
+
+          SMC Confirm(PC1,MB,GB,QP64,RToken=Y2, RMB element index)
+        <--------------------------------------------------------
+
+                         Legend:
+                  ------------   TCP/IP and CLC flows
+                  ............   RoCE (LLC) flows
+
+          Figure 13: Client Adds RMB during TCP Connection Setup
+
+
+
+
+
+
+
+Fox, et al.                   Informational                    [Page 52]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+3.5.5.2.2.  Deleting an RMB from an SMC-R Link Group
+
+   Either peer can delete one or more of its RMBs as long as it is not
+   being used for any TCP connections.  Ideally, an SMC-R peer would use
+   a timer to avoid freeing an RMB immediately after the last TCP
+   connection stops using it, to keep the RMB available for later TCP
+   connections and avoid thrashing with addition and deletion of RMBs.
+   Once an SMC-R peer decides to delete an RMB, it sends a DELETE RKEY
+   LLC message to its peer.  It can then free the RMB once it receives
+   a response from the peer.  Multiple RMBs can be deleted in a
+   DELETE RKEY exchange.
+
+   Note that in a DELETE RKEY message, it is not necessary to specify
+   the full RToken for a deleted RMB.  The RMB's RKey over one link in
+   the link group is sufficient to specify which RMB is being deleted.
+
+          Host X                                     Host Y
+     +-------------------+                      +-------------------+
+     |            +------+                      +------+            |
+     |       QP 8 |RNIC 1|     SMC-R Link 1     |RNIC 2| QP 9       |
+     |RToken X|   |      |<-------------------->|      |            |
+     |        |   |      |                      |      |            |
+     |       \/   +------+                      +------+            |
+     |+--------+         |                      |                   |
+     || Deleted|         |                      |                   |
+     || RMB    |         |                      |                   |
+     ||        |         |                      |                   |
+     |+--------+         |                      |                   |
+     |       /\   +------+                      +------+            |
+     |RToken Z|   |      |     SMC-R Link 2     |      |            |
+     |        |   |RNIC 3|<-------------------->|RNIC 4|            |
+     |       QP 9 |      |                      |      |            |
+     |            +------+                      +------+            |
+     +-------------------+                      +-------------------+
+
+           DELETE RKEY(request, RKey list(RKey X))
+                 ................................................>
+
+           DELETE RKEY(response, RKey list(RKey X))
+                <................................................
+
+           (Note: This exchange can flow over either SMC-R link.)
+
+                Figure 14: Delete RMB from SMC-R Link Group
+
+
+
+
+
+
+
+Fox, et al.                   Informational                    [Page 53]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+3.5.5.2.3.  Adding a New SMC-R Link to a Link Group with Multiple RMBs
+
+   When a new SMC-R link is added to an existing link group, there could
+   be multiple RMBs on each side already associated with the link group.
+   There could also be a different number of RMBs on one side than on
+   the other, because each peer manages its RMBs independently.  Each of
+   these RMBs will require a new RToken to be used on the new SMC-R
+   link, and those new RTokens must then be communicated to the peer.
+   This requires two-way communication, as the server will have to
+   communicate its RTokens to the client and vice versa.
+
+   RTokens are communicated between peers in pairs.  Each RToken pair
+   consists of:
+
+   o  The RToken for the RMB, as is already known on an existing SMC-R
+      link in the link group.
+
+   o  The RToken for the same RMB, to be used on the new SMC-R link.
+
+   These pairs are required to ensure that each peer knows which RTokens
+   across QPs are equivalent.
+
+   The ADD LINK request and response LLC messages do not have enough
+   space to contain any RToken pairs.  ADD LINK CONTINUATION LLC
+   messages are used to communicate these pairs, as shown in Figure 15.
+   The ADD LINK CONTINUATION LLC messages are sent on the same SMC-R
+   link that the ADD LINK LLC messages were sent over, and in both the
+   ADD LINK and ADD LINK CONTINUATION LLC messages the first RToken in
+   each RToken pair will be the RToken for the RMB as known on the SMC-R
+   link over which the LLC message is being sent.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Fox, et al.                   Informational                    [Page 54]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+       Host X -- Server                           Host Y -- Client
+    +-------------------+                      +-------------------+
+    | Peer ID = PS1     |                      |   Peer ID = PC1   |
+    |            +------+                      +------+            |
+    |       QP 8 |RNIC 1|    SMC-R Link 1      |RNIC 2|  QP 64     |
+    |RKey set|   |MAC MA|<-------------------->|MAC MB|   |RKey set|
+    |X,Y,Z   |   |GID GA|                      |GID GB|   |Q,R,S,T |
+    |       \/   +------+                      +------+  \/        |
+    |+--------+         |                      |        +--------+ |
+    || 3 RMBs |         |                      |        | 4 RMBs | |
+    |+--------+         |                      |        +--------+ |
+    |       /\   +------+                      +------+  /\        |
+    |RKey set|   |RNIC 3|    SMC-R Link 2      |RNIC 4|  | RKey set|
+    |U,V,W   |   |MAC MC|<-------------------->|MAC MD|  | L,M,N,P |
+    |       QP 9 |GID GC|    (being added)     |GID GD| QP 65      |
+    |            +------+                      +------+            |
+    +-------------------+                      +-------------------+
+
+            ADD LINK request (QP9,MC,GC, link number = 2)
+            ............................................>
+
+            ADD LINK response (QP65,MD,GD, link number = 2)
+            <............................................
+
+    ADD LINK CONTINUATION req(RToken pairs=((X,U),(Y,V),(Z,W)))
+             ............................................>
+
+    ADD LINK CONTINUATION rsp(RToken pairs=((Q,L),(R,M),(S,N),(T,P)))
+             <.............................................
+
+           CONFIRM LINK req/rsp exchange on Link 2
+            <.............................................>
+
+
+                          Legend:
+                   ------------   TCP/IP and CLC flows
+                   ............   RoCE (LLC) flows
+
+   Figure 15: Exchanging RKeys when a New Link Is Added to a Link Group
+
+
+
+
+
+
+
+
+
+
+
+
+Fox, et al.                   Informational                    [Page 55]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+3.5.5.3.  Serialization of LLC Exchanges, and Collisions
+
+   LLC flows can be divided into two main groups for serialization
+   considerations.
+
+   The first group is LLC messages that are independent and can flow at
+   any time.  These are one-time, unsolicited messages that either do
+   not have a required response or have a simple response that does not
+   interfere with the operations of another group of messages.  These
+   messages are as follows:
+
+   o  TEST LINK from either the client or the server: This message
+      requires a TEST LINK response to be returned but does not affect
+      the configuration of the link group or the RKeys.
+
+   o  ADD LINK from the client to the server: This message is provided
+      as an "FYI" to the server to let it know that the client has an
+      additional RNIC available.  The server is not required to act upon
+      or respond to this message.
+
+   o  DELETE LINK from the client to the server: This message informs
+      the server that either (1) the client has experienced an error or
+      problem that requires a link or link group to be terminated or
+      (2) an operator has commanded that a link or link group be
+      terminated.  The server does not respond directly to the message;
+      rather, it initiates a DELETE LINK exchange as a result of
+      receiving it.
+
+   o  DELETE LINK from the server to the client, with the "delete entire
+      link group" flag set: This message informs the client that the
+      entire link group is being deleted.
+
+   The second group is LLC messages that are part of an exchange of LLC
+   messages that affects link group configuration; this exchange must
+   complete before another exchange of LLC messages that affects link
+   group configuration can be processed.  When a peer knows that one of
+   these exchanges is in progress, it must not start another exchange.
+   These exchanges are as follows:
+
+   o  ADD LINK / ADD LINK response / ADD LINK CONTINUATION / ADD LINK
+      CONTINUATION response / CONFIRM LINK / CONFIRM LINK response: This
+      exchange, by adding a new link, changes the configuration of the
+      link group.
+
+   o  DELETE LINK / DELETE LINK response initiated by the server,
+      without the "delete entire link group" flag set: This exchange, by
+      deleting a link, changes the configuration of the link group.
+
+
+
+
+Fox, et al.                   Informational                    [Page 56]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+   o  CONFIRM RKEY / CONFIRM RKEY response or DELETE RKEY / DELETE RKEY
+      response: This exchange changes the RMB configuration of the link
+      group.  RKeys cannot change while links are being added or deleted
+      (while an ADD LINK or DELETE LINK is in progress).  However,
+      CONFIRM RKEY and DELETE RKEY are unique in that both the client
+      and server can independently manage (add or remove) their own
+      RMBs.  This allows each peer to concurrently change their RKeys
+      and therefore concurrently send CONFIRM RKEY or DELETE RKEY
+      requests.  The concurrent CONFIRM RKEY or DELETE RKEY requests can
+      be independently processed and do not represent a collision.
+
+   Because the server is in control of the configuration of the link
+   group, many timing windows and collisions are avoided, but there are
+   still some that must be handled.
+
+3.5.5.3.1.  Collisions with ADD LINK / CONFIRM LINK Exchange
+
+   Colliding LLC message: TEST LINK
+
+      Action to resolve: Send immediate TEST LINK reply.
+
+   Colliding LLC message: ADD LINK from client to server
+
+      Action to resolve: Server ignores the ADD LINK message.  When
+      client receives server's ADD LINK, client will consider that
+      message to be in response to its ADD LINK message and the flow
+      works.  Since both client and server know not to start this
+      exchange if an ADD LINK operation is already underway, this can
+      only occur if the client sends this message before receiving the
+      server's ADD LINK and this message crosses with the server's ADD
+      LINK message; therefore, the server's ADD LINK arrives at the
+      client immediately after the client sent this message.
+
+   Colliding LLC message: DELETE LINK from client to server, specific
+   link specified
+
+      Action to resolve: Server queues the DELETE LINK message and
+      processes it after the ADD LINK exchange completes.  If it is an
+      orderly link termination, it can wait until after this exchange
+      continues.  If it is disorderly and the link affected is the one
+      that the current exchange is using, the server will discover the
+      outage when a message in this exchange fails.
+
+   Colliding LLC message: DELETE LINK from client to server, entire link
+   group to be deleted
+
+      Action to resolve: Immediately clean up the link group.
+
+
+
+
+Fox, et al.                   Informational                    [Page 57]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+   Colliding LLC message: CONFIRM RKEY from client
+
+      Action to resolve: Send a negative CONFIRM RKEY response to the
+      client.  Once the current exchange finishes, client will have to
+      recompute its RKey set to include the new link and then start a
+      new CONFIRM RKEY exchange.
+
+3.5.5.3.2.  Collisions during DELETE LINK Exchange
+
+   Colliding LLC message: TEST LINK from either peer
+
+      Action to resolve: Send immediate TEST LINK response.
+
+   Colliding LLC message: ADD LINK from client to server
+
+      Action to resolve: Server queues the ADD LINK and processes it
+      after the current exchange completes.
+
+   Colliding LLC message: DELETE LINK from client to server (specific
+   link)
+
+      Action to resolve: Server queues the DELETE LINK message and
+      processes it after the current exchange completes.  If it is an
+      orderly link termination, it can wait until after this exchange
+      continues.  If it is disorderly and the link affected is the one
+      that the current exchange is using, the server will discover the
+      outage when a message in this exchange fails.
+
+   Colliding LLC message: DELETE LINK from either client or server,
+   deleting the entire link group
+
+      Action to resolve: Immediately clean up the link group.
+
+   Colliding LLC message: CONFIRM RKEY from client to server
+
+      Action to resolve: Send a negative CONFIRM RKEY response to the
+      client.  Once the current exchange finishes, client will have to
+      recompute its RKey set to include the new link and then start a
+      new CONFIRM RKEY exchange.
+
+
+
+
+
+
+
+
+
+
+
+
+Fox, et al.                   Informational                    [Page 58]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+3.5.5.3.3.  Collisions during CONFIRM RKEY Exchange
+
+   Colliding LLC message: TEST LINK
+
+      Action to resolve: Send immediate TEST LINK reply.
+
+   Colliding LLC message: ADD LINK from client to server
+
+      Action to resolve: Queue the ADD LINK, and process it after the
+      current exchange completes.
+
+   Colliding LLC message: ADD LINK from server to client (CONFIRM RKEY
+   exchange was initiated by the client, and it crossed with the server
+   initiating an ADD LINK exchange)
+
+      Action to resolve: Process the ADD LINK.  Client will receive a
+      negative CONFIRM RKEY from the server and will have to redo this
+      CONFIRM RKEY exchange after the ADD LINK exchange completes.
+
+   Colliding LLC message: DELETE LINK from client to server, specific
+   link to be deleted (CONFIRM RKEY exchange was initiated by the
+   server, and it crossed with the client's DELETE LINK request)
+
+      Action to resolve: Server queues the DELETE LINK message and
+      processes it after the CONFIRM RKEY exchange completes.  If it is
+      an orderly link termination, it can wait until after this exchange
+      continues.  If it is disorderly and the link affected is the one
+      that the current exchange is using, the server will discover the
+      outage when a message in this exchange fails.
+
+   Colliding LLC message: DELETE LINK from server to client, specific
+   link deleted (CONFIRM RKEY exchange was initiated by the client, and
+   it crossed with the server's DELETE LINK)
+
+      Action to resolve: Process the DELETE LINK.  Client will receive a
+      negative CONFIRM RKEY from the server and will have to redo this
+      CONFIRM RKEY exchange after the ADD LINK exchange completes.
+
+   Colliding LLC message: DELETE LINK from either client or server,
+   entire link group deleted
+
+      Action to resolve: Immediately clean up the link group.
+
+   Colliding LLC message: CONFIRM LINK from the peer that did not start
+   the current CONFIRM LINK exchange
+
+      Action to resolve: Queue the request, and process it after the
+      current exchange completes.
+
+
+
+Fox, et al.                   Informational                    [Page 59]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+4.  SMC-R Memory-Sharing Architecture
+
+4.1.  RMB Element Allocation Considerations
+
+   Each TCP connection using SMC-R must be allocated an RMBE by each
+   SMC-R peer.  This allocation is performed by each endpoint
+   independently to allow each endpoint to select an RMBE that best
+   matches the characteristics on its TCP socket endpoint.  The RMBE
+   associated with a TCP socket endpoint must have a receive buffer that
+   is at least as large as the TCP receive buffer size in effect for
+   that connection.  The receive buffer size can be determined by what
+   is specified explicitly by the application using setsockopt() or
+   implicitly via the system-configured default value.  This will allow
+   sufficient data to be RDMA-written by the SMC-R peer to fill an
+   entire receive buffer size's worth of data on a given data flow.
+   Given that each RMB must have fixed-length RMBEs, this implies that
+   an SMC-R endpoint may need to maintain multiple RMBs of various sizes
+   for SMC-R connections on a given SMC-R link and can then select an
+   RMBE that most closely fits a connection.
+
+4.2.  RMB and RMBE Format
+
+   An RMB is a virtual memory buffer whose backing real memory is
+   pinned.  The RMB is subdivided into a whole number of equal-sized RMB
+   Elements (RMBEs).  Each RMBE begins with a 4-byte eye catcher for
+   diagnostic and service purposes, followed by the receive data buffer.
+   The contents of this diagnostic eye catcher are implementation
+   dependent and should be used by the local SMC-R peer to check for
+   overlay errors by verifying an intact eye catcher with every RMBE
+   access.
+
+   The RMBE is a wrapping receive buffer for receiving RDMA writes from
+   the peer.  Cursors, as described below, are exchanged between peers
+   to manage and track RDMA writes and local data reads from the RMBE
+   for a TCP connection.
+
+4.3.  RMBE Control Information
+
+   RMBE control information consists of consumer cursors, producer
+   cursors, wrap counts, CDC message sequence numbers, control flags
+   such as urgent data and "writer blocked" indicators, and TCP
+   connection information such as termination flags.  This information
+   is exchanged between SMC-R peers using CDC messages, which are passed
+   using RoCE SendMsg.  A TCP/IP stack implementing SMC-R must receive
+   and store this information in its internal data structures, as it is
+   used to manage the RMBE and its data buffer.
+
+
+
+
+
+Fox, et al.                   Informational                    [Page 60]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+   The format and contents of the CDC message are described in detail in
+   Appendix A.4 ("Connection Data Control (CDC) Message Format").  The
+   following is a high-level description of what this control
+   information contains.
+
+   o  Connection state flags such as sending done, connection closed,
+      failover data validation, and abnormal close.
+
+   o  A sequence number that is managed by the sender.  This sequence
+      number starts at 1, is increased each send, and wraps to 0.  This
+      sequence number tracks the CDC message sent and is not related to
+      the number of bytes sent.  It is used for failover data
+      validation.
+
+   o  Producer cursor: a wrapping offset into the receiver's RMBE data
+      area.  Set by the peer that is writing into the RMBE, it points to
+      where the writing peer will write the next byte of data into an
+      RMBE.  This cursor is accompanied by a wrap sequence number to
+      help the RMBE owner (the receiver) identify full window size
+      wrapping writes.  Note that this cursor must account for (i.e.,
+      skip over) the RMBE eye catcher that is in the beginning of the
+      data area.
+
+   o  Consumer cursor: a wrapping offset into the receiver's RMBE data
+      area.  Set by the owner of the RMBE (the peer that is reading from
+      it), this cursor points to the offset of the next byte of data to
+      be consumed by the peer in its own RMBE.  The sender cannot write
+      beyond this cursor into the receiver's RMBE without causing data
+      loss.  Like the producer cursor, this is accompanied by a wrap
+      count to help the writer identify full window size wrapping reads.
+      Note that this cursor must account for (i.e., skip over) the RMBE
+      eye catcher that is in the beginning of the data area.
+
+   o  Data flags such as urgent data, writer blocked indicator, and
+      cursor update requests.
+
+4.4.  Use of RMBEs
+
+4.4.1.  Initializing and Accessing RMBEs
+
+   The RMBE eye catcher is initialized by the RMB owner prior to
+   assigning it to a specific TCP connection and communicating its RMB
+   index to the SMC-R partner.  After an RMBE index is communicated to
+   the SMC-R partner, the RMBE can only be referenced in "read-only
+   mode" by the owner, and all updates to it are performed by the remote
+   SMC-R partner via RDMA write operations.
+
+
+
+
+
+Fox, et al.                   Informational                    [Page 61]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+   Initialization of an RMBE must include the following:
+
+   o  Zeroing out the entire RMBE receive buffer, which helps minimize
+      data integrity issues (e.g., data from a previous connection
+      somehow being presented to the current connection).
+
+   o  Setting the beginning RMBE eye catcher.  This eye catcher plays an
+      important role in helping detect accidental overlays of the RMBE.
+      The RMB owner should always validate these eye catchers before
+      each new reference to the RMBE.  If the eye catchers are found to
+      be corrupted, the local host must reset the TCP connection
+      associated with this RMBE and log the appropriate diagnostic
+      information.
+
+4.4.2.  RMB Element Reuse and Conflict Resolution
+
+   RMB elements can be reused once their associated TCP and SMC-R
+   connections are terminated.  Under normal and abnormal SMC-R
+   connection termination processing, both SMC-R peers must explicitly
+   acknowledge that they are done using an RMBE before that element can
+   be freed and reassigned to another SMC-R connection instance.  For
+   more details on SMC-R connection termination, refer to Section 4.8.
+
+   However, there are some error scenarios where this two-way explicit
+   acknowledgment may not be completed.  In these scenarios, an RMBE
+   owner may choose to reassign this RMBE to a new SMC-R connection
+   instance on this SMC-R link group.  When this occurs, the partner
+   SMC-R peer must detect this condition during SMC-R Rendezvous
+   processing when presented with an RMBE that it believes is already in
+   use for a different SMC-R connection.  In this case, the SMC-R peer
+   must abort the existing SMC-R connection associated with this RMBE.
+   The abort processing resets the TCP connection (if it is still
+   active), but it must not attempt to perform any RDMA writes to this
+   RMBE and must also ignore any data sitting in the local RMBE
+   associated with the existing connection.  It then proceeds to free up
+   the local RMBE and notify the local application that the connection
+   is being abnormally reset.
+
+   The remote SMC-R peer then proceeds to normal processing for this new
+   SMC-R connection.
+
+
+
+
+
+
+
+
+
+
+
+Fox, et al.                   Informational                    [Page 62]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+4.5.  SMC-R Protocol Considerations
+
+   The following sections describe considerations for the SMC-R protocol
+   as compared to TCP.
+
+4.5.1.  SMC-R Protocol Optimized Window Size Updates
+
+   An SMC-R receiver host sends its consumer cursor information to the
+   sender to convey the progress that the receiving application has made
+   in consuming the sent data.  The difference between the writer's
+   producer cursor and the associated receiver's consumer cursor
+   indicates the window size available for the sender to write into.
+   This is somewhat similar to TCP window update processing and
+   therefore has some similar considerations, such as silly window
+   syndrome avoidance, whereby TCP has an optimization that minimizes
+   the overhead of very small, unproductive window size updates
+   associated with suboptimal socket applications consuming very small
+   amounts of data on every receive() invocation.  For SMC-R, the
+   receiver only updates its consumer cursor via a unique CDC message
+   under the following conditions:
+
+   o  The current window size (from a sender's perspective) is less than
+      half of the receive buffer space, and the consumer cursor update
+      will result in a minimum increase in the window size of 10% of the
+      receive buffer space.  Some examples:
+
+      a. Receive buffer size: 64K, current window size (from a sender's
+         perspective): 50K.  No need to update the consumer cursor.
+         Plenty of space is available for the sender.
+
+      b. Receive buffer size: 64K, current window size (from a sender's
+         perspective): 30K, current window size from a receiver's
+         perspective: 31K.  No need to update the consumer cursor; even
+         though the sender's window size is < 1/2 of the 64K, the window
+         update would only increase that by 1K, which is < 1/10th of the
+         64K buffer size.
+
+      c. Receive buffer size: 64K, current window size (from a sender's
+         perspective): 30K, current window size from a receiver's
+         perspective: 64K.  The receiver updates the consumer cursor
+         (sender's window size is < 1/2 of the 64K; the window update
+         would increase that by > 6.4K).
+
+
+
+
+
+
+
+
+
+Fox, et al.                   Informational                    [Page 63]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+   o  The receiver must always include a consumer cursor update whenever
+      it sends a CDC message to the partner for another flow (i.e., send
+      flow in the opposite direction).  This allows the window size
+      update to be delivered with no additional overhead.  This is
+      somewhat similar to TCP DelayAck processing and quite effective
+      for request/response data patterns.
+
+   o  If a peer has set the B-bit in a CDC message, then any consumption
+      of data by the receiver causes a CDC message to be sent, updating
+      the consumer cursor until a CDC message with that bit cleared is
+      received from the peer.
+
+   o  The optimized window size updates are overridden when the sender
+      sets the Consumer Cursor Update Requested flag in a CDC message to
+      the receiver.  When this indicator is on, the consumer must send a
+      consumer cursor update immediately when data is consumed by the
+      local application or if the cursor has not been updated for a
+      while (i.e., local copy of the consumer cursor does not match the
+      last consumer cursor value sent to the partner).  This allows the
+      sender to perform optional diagnostics for detecting a stalled
+      receiver application (data has been sent but not consumed).  It is
+      recommended that the Consumer Cursor Update Requested flag only be
+      sent for diagnostic procedures, as it may result in non-optimal
+      data path performance.
+
+4.5.2.  Small Data Sends
+
+   The SMC-R protocol makes no special provisions for handling small
+   data segments sent across a stream socket.  Data is always sent if
+   sufficient window space is available.  In contrast to the TCP Nagle
+   algorithm, there are no special provisions in SMC-R for coalescing
+   small data segments.
+
+   An implementation of SMC-R can be configured to optimize its sending
+   processing by coalescing outbound data for a given SMC-R connection
+   so that it can reduce the number of RDMA write operations it
+   performs, in a fashion similar to Nagle's algorithm.  However, any
+   such coalescing would require a timer on the sending host that would
+   ensure that data was eventually sent.  Also, the sending host would
+   have to opt out of this processing if Nagle's algorithm had been
+   disabled (programmatically or via system configuration).
+
+
+
+
+
+
+
+
+
+
+Fox, et al.                   Informational                    [Page 64]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+4.5.3.  TCP Keepalive Processing
+
+   TCP keepalive processing allows applications to direct the local
+   TCP/IP host to periodically "test" the viability of an idle TCP
+   connection.  Since SMC-R connections have a TCP representation along
+   with an SMC-R representation, there are unique keepalive processing
+   considerations:
+
+   o  SMC-R-layer keepalive processing: If keepalive is enabled for an
+      SMC-R connection, the local host maintains a keepalive timer that
+      reflects how long an SMC-R connection has been idle.  The local
+      host also maintains a timestamp of last activity for each SMC-R
+      link (for any SMC-R connection on that link).  When it is
+      determined that an SMC-R connection has been idle longer than the
+      keepalive interval, the host checks to see whether or not the
+      SMC-R link has been idle for a duration longer than the keepalive
+      timeout.  If both conditions are met, the local host then performs
+      a TEST LINK LLC command to test the viability of the SMC-R link
+      over the RoCE fabric (RC-QPs).  If a TEST LINK LLC command
+      response is received within a reasonable amount of time, then the
+      link is considered viable, and all connections using this link are
+      considered viable as well.  If, however, a response is not
+      received in a reasonable amount of time or there's a failure in
+      sending the TEST LINK LLC command, then this is considered a
+      failure in the SMC-R link, and failover processing to an alternate
+      SMC-R link must be triggered.  If no alternate SMC-R link exists
+      in the SMC-R link group, then all of the SMC-R connections on this
+      link are abnormally terminated by resetting the TCP connections
+      represented by these SMC-R connections.  Given that multiple SMC-R
+      connections can share the same SMC-R link, implementing an SMC-R
+      link-level probe using the TEST LINK LLC command will help reduce
+      the amount of unproductive keepalive traffic for SMC-R
+      connections; as long as some SMC-R connections on a given SMC-R
+      link are active (i.e., have had I/O activity within the keepalive
+      interval), then there is no need to perform additional link
+      viability testing.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Fox, et al.                   Informational                    [Page 65]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+   o  TCP-layer keepalive processing: Traditional TCP "keepalive"
+      packets are not as relevant for SMC-R connections, given that the
+      TCP path is not used for these connections once the SMC-R
+      Rendezvous processing is completed.  All SMC-R connections by
+      default have associated TCP connections that are idle.  Are TCP
+      keepalive probes still needed for these connections?  There are
+      two main scenarios to consider:
+
+      1. TCP keepalives that are used to determine whether or not the
+         peer TCP endpoint is still active.  This is not needed for
+         SMC-R connections, as the SMC-R-level keepalives mentioned
+         above will determine whether or not the remote endpoint
+         connections are still active.
+
+      2. TCP keepalives that are used to ensure that TCP connections
+         traversing an intermediate proxy maintain an active state.  For
+         example, stateful firewalls typically maintain state
+         representing every valid TCP connection that traverses the
+         firewall.  These types of firewalls are known to expire idle
+         connections by removing their state in the firewall to conserve
+         memory.  TCP keepalives are often used in this scenario to
+         prevent firewalls from timing out otherwise idle connections.
+         When using SMC-R, both endpoints must reside in the same
+         Layer 2 network (i.e., the same subnet).  As a result,
+         firewalls cannot be injected in the path between two SMC-R
+         endpoints.  However, other intermediate proxies, such as
+         TCP/IP-layer load balancers, may be injected in the path of two
+         SMC-R endpoints.  These types of load balancers also maintain
+         connection state so that they can forward TCP connection
+         traffic to the appropriate cluster endpoint.  When using SMC-R,
+         these TCP connections will appear to be completely idle, making
+         them susceptible to potential timeouts at the load-balancing
+         proxy.  As a result, for this scenario, TCP keepalives may
+         still be relevant.
+
+   The following are the TCP-level keepalive processing requirements for
+   SMC-R-enabled hosts:
+
+   o  SMC-R peers should allow TCP keepalives to flow on the TCP path of
+      SMC-R connections based on existing TCP keepalive configuration
+      and programming options.  However, it is strongly recommended that
+      platforms provide the ability to specify very granular keepalive
+      timers (for example, single-digit-second timers) and should
+      consider providing a configuration option that limits the minimum
+      keepalive timer that will be used for TCP-layer keepalives on
+      SMC-R connections.  This is important to minimize the amount of
+      TCP keepalive packets transmitted in the network for SMC-R
+      connections.
+
+
+
+Fox, et al.                   Informational                    [Page 66]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+   o  SMC-R peers must always respond to inbound TCP-layer keepalives
+      (by sending ACKs for these packets) even if the connection is
+      using SMC-R.  Typically, once a TCP connection has completed the
+      SMC-R Rendezvous processing and is using SMC-R for data flows, no
+      new inbound TCP segments are expected on that TCP connection,
+      other than TCP termination segments (FIN, RST, etc.).  TCP
+      keepalives are the one exception that must be supported.  Also,
+      since TCP keepalive probes do not carry any application-layer
+      data, this has no adverse impact on the application's inbound data
+      stream.
+
+4.6.  TCP Connection Failover between SMC-R Links
+
+   A peer may change which SMC-R link within a link group it sends its
+   writes over in the event of a link failure.  Since each peer
+   independently chooses which link to send writes over for a specific
+   TCP connection, this process is done independently by each peer.
+
+4.6.1.  Validating Data Integrity
+
+   Even though RoCE is a reliable transport, there is a small subset of
+   failure modes that could cause unrecoverable loss of data.  When an
+   RNIC acknowledges receipt of an RDMA write to its peer, that creates
+   a write completion event to the sending peer, which allows the sender
+   to release any buffers it is holding for that write.  In normal
+   operation and in most failures, this operation is reliable.
+
+   However, there are failure modes possible in which a receiving RNIC
+   has acknowledged an RDMA write but then was not able to place the
+   received data into its host memory -- for example, a sudden,
+   disorderly failure of the interface between the RNIC and the host.
+   While rare, these types of events must be guarded against to ensure
+   data integrity.  The process for switching SMC-R links during
+   failover, as described in this section, guards against this
+   possibility and is mandatory.
+
+   Each peer must track the current state of the CDC sequence numbers
+   for a TCP connection.  The sender must keep track of the sequence
+   number of the CDC message that described the last write acknowledged
+   by the peer RNIC, or Sequence Sent (SS).  In other words, SS
+   describes the last write that the sender believes its peer has
+   successfully received.  The receiver must keep track of the sequence
+   number of the CDC message that described the last write that it has
+   successfully received (i.e., the data has been successfully placed
+   into an RMBE), or Sequence Received (SR).
+
+
+
+
+
+
+Fox, et al.                   Informational                    [Page 67]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+   When an RNIC fails and the sender changes SMC-R links, the sender
+   must first send a CDC message with the F-bit (failover validation
+   indicator; see Appendix A.4) set over the new SMC-R link.  This is
+   the failover data validation message.  The sequence number in this
+   CDC message is equal to SS.  The CDC message key, the length, and the
+   SMC-R alert token are the only other fields in this CDC message that
+   are significant.  No reply is expected from this validation message,
+   and once the sender has sent it, the sender may resume sending on the
+   new SMC-R link as described in Section 4.6.2.
+
+   Upon receipt of the failover validation message, the receiver must
+   verify that its SR value for the TCP connection is equal to or
+   greater than the sequence number in the failover validation message.
+   If so, no further action is required, and the TCP connection resumes
+   on the new SMC-R link.  If SR is less than the sequence number value
+   in the validation message, data has been lost, and the receiver must
+   immediately reset the TCP connection.
+
+4.6.2.  Resuming the TCP Connection on a New SMC-R Link
+
+   When a connection is moved to a new SMC-R link and the failover
+   validation message has been sent, the sender can immediately resume
+   normal transmission.  In order to preserve the application message
+   stream, the sender must replay any RDMA writes (and their associated
+   CDC messages) that were in progress or failed when the previous SMC-R
+   link failed, before sending new data on the new SMC-R link.  The
+   sender has two options for accomplishing this:
+
+   o  Preserve the sequence numbers "as is": Retry all failed and
+      pending operations as they were originally done, including
+      reposting all associated RDMA write operations and their
+      associated CDC messages without making any changes.  Then resume
+      sending new data using new sequence numbers.
+
+   o  Combine pending messages and possibly add new data: Combine failed
+      and pending messages into a single new write with a new sequence
+      number.  This allows the sender to combine pending messages into
+      fewer operations.  As a further optimization, this write can also
+      include new data, as long as all failed and pending data are also
+      included.  If this approach is taken, the sequence number must be
+      increased beyond the last failed or pending sequence number.
+
+
+
+
+
+
+
+
+
+
+Fox, et al.                   Informational                    [Page 68]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+4.7.  RMB Data Flows
+
+   The following sections describe the RDMA wire flows for the SMC-R
+   protocol after a TCP connection has switched into SMC-R mode (i.e.,
+   SMC-R Rendezvous processing is complete and a pair of RMB elements
+   has been assigned and communicated by the SMC-R peers).  The ladder
+   diagrams below include the following:
+
+   o  RMBE control information kept by each peer.  Only a subset of the
+      information is depicted, specifically only the fields that reflect
+      the stream of data written by Host A and read by Host B.
+
+   o  Time line 0-x, which shows the wire flows in a time-relative
+      fashion.
+
+   o  Note that RMBE control information is only shown in a time
+      interval if its value changed (otherwise, assume that the value is
+      unchanged from the previously depicted value).
+
+   o  The local copy of the producer cursors and consumer cursors that
+      is maintained by each host is not depicted in these figures.  Note
+      that the cursor values in the diagram reflect the necessity of
+      skipping over the eye catcher in the RMBE data area.  They start
+      and wrap at 4, not 0.
+
+4.7.1.  Scenario 1: Send Flow, Window Size Unconstrained
+
+            SMC Host A                             SMC Host B
+           RMBE A Info                            RMBE B Info
+       (Consumer Cursors)                      (Producer Cursors)
+   Cursor   Wrap Seq# Time               Time Cursor   Wrap Seq#  Flags
+   4        0         0                  0    4        0          0
+   0        0         1 ---------------> 1    0        0          0
+                        RDMA-WR Data
+                          (4:1003)
+   4        0         2 ...............> 2    1004     0          0
+                        CDC Message
+
+        Figure 16: Scenario 1: Send Flow, Window Size Unconstrained
+
+   Scenario assumptions:
+
+   o  Kernel implementation.
+
+   o  New SMC-R connection; no data has been sent on the connection.
+
+
+
+
+
+
+Fox, et al.                   Informational                    [Page 69]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+   o  Host A: Application issues send for 1000 bytes to Host B.
+
+   o  Host B: RMBE receive buffer size is 10,000; application has issued
+      a recv for 10,000 bytes.
+
+   Flow description:
+
+   1. The application issues a send() for 1000 bytes; the SMC-R layer
+      copies data into a kernel send buffer.  It then schedules an RDMA
+      write operation to move the data into the peer's RMBE receive
+      buffer, at relative position 4-1003 (to skip the 4-byte
+      eye catcher in the RMBE data area).  Note that no immediate data
+      or alert (i.e., interrupt) is provided to Host B for this RDMA
+      operation.
+
+   2. Host A sends a CDC message to update the producer cursor to
+      byte 1004.  This CDC message will deliver an interrupt to Host B.
+      At this point, the SMC-R layer can return control back to the
+      application.  Host B, once notified of the completion of the
+      previous RDMA operation, locates the RMBE associated with the RMBE
+      alert token that was included in the message and proceeds to
+      perform normal receive-side processing, waking up the suspended
+      application read thread, copying the data into the application's
+      receive buffer, etc.  It will use the producer cursor as an
+      indicator of how much data is available to be delivered to the
+      local application.  After this processing is complete, the SMC-R
+      layer will also update its local consumer cursor to match the
+      producer cursor (i.e., indicating that all data has been
+      consumed).  Note that a message to the peer updating the consumer
+      cursor is not needed at this time, as the window size is
+      unconstrained (> 1/2 of the receive buffer size).  The window size
+      is calculated by taking the difference between the producer cursor
+      and the consumer cursor in the RMBEs (10,000 - 1004 = 8996).
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Fox, et al.                   Informational                    [Page 70]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+4.7.2.  Scenario 2: Send/Receive Flow, Window Size Unconstrained
+
+             SMC Host A                             SMC Host B
+            RMBE A Info                            RMBE B Info
+        (Consumer Cursors)                      (Producer Cursors)
+    Cursor   Wrap Seq# Time               Time Cursor   Wrap Seq#  Flags
+    4        0         0                  0    4        0          0
+    0        0         1 ---------------> 1    0        0          0
+                         RDMA-WR Data
+                           (4:1003)
+    4        0         2 ...............> 2    1004     0          0
+                         CDC Message
+
+    0        0         3 <--------------  3    1004     0          0
+                         RDMA-WR Data
+                           (4:503)
+    1004     0         4 <..............  4    1004     0          0
+                          CDC Message
+
+    Figure 17: Scenario 2: Send/Receive Flow, Window Size Unconstrained
+
+   Scenario assumptions:
+
+   o  New SMC-R connection; no data has been sent on the connection.
+
+   o  Host A: Application issues send for 1000 bytes to Host B.
+
+   o  Host B: RMBE receive buffer size is 10,000; application has
+      already issued a recv for 10,000 bytes.  Once the receive is
+      completed, the application sends a 500-byte response to Host A.
+
+   Flow description:
+
+   1. The application issues a send() for 1000 bytes; the SMC-R layer
+      copies data into a kernel send buffer.  It then schedules an RDMA
+      write operation to move the data into the peer's RMBE receive
+      buffer, at relative position 4-1003.  Note that no immediate data
+      or alert (i.e., interrupt) is provided to Host B for this RDMA
+      operation.
+
+   2. Host A sends a CDC message to update the producer cursor to
+      byte 1004.  This CDC message will deliver an interrupt to Host B.
+      At this point, the SMC-R layer can return control back to the
+      application.
+
+
+
+
+
+
+
+Fox, et al.                   Informational                    [Page 71]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+   3. Host B, once notified of the receipt of the previous CDC message,
+      locates the RMBE associated with the RMBE alert token and proceeds
+      to perform normal receive-side processing, waking up the suspended
+      application read thread, copying the data into the application's
+      receive buffer, etc.  After this processing is complete, the SMC-R
+      layer will also update its local consumer cursor to match the
+      producer cursor (i.e., indicating that all data has been
+      consumed).  Note that an update of the consumer cursor to the peer
+      is not needed at this time, as the window size is unconstrained
+      (> 1/2 of the receive buffer size).  The application then performs
+      a send() for 500 bytes to Host A.  The SMC-R layer will copy the
+      data into a kernel buffer and then schedule an RDMA write into the
+      partner's RMBE receive buffer.  Note that this RDMA write
+      operation includes no immediate data or notification to Host A.
+
+   4. Host B sends a CDC message to update the partner's RMBE control
+      information with the latest producer cursor (set to 503 and not
+      shown in the diagram above) and to also inform the peer that the
+      consumer cursor value is now 1004.  It also updates the local
+      current consumer cursor and the last sent consumer cursor to 1004.
+      This CDC message includes notification, since we are updating our
+      producer cursor; this requires attention by the peer host.
+
+4.7.3.  Scenario 3: Send Flow, Window Size Constrained
+
+             SMC Host A                             SMC Host B
+            RMBE A Info                            RMBE B Info
+        (Consumer Cursors)                      (Producer Cursors)
+    Cursor   Wrap Seq# Time               Time Cursor   Wrap Seq#  Flags
+    4        0         0                  0    4        0          0
+    4        0         1 ---------------> 1    4        0          0
+                         RDMA-WR Data
+                           (4:3003)
+    4        0         2 ...............> 2    3004     0          0
+                         CDC Message
+    4        0         3                  3    3004     0          0
+    4        0         4 ---------------> 4    3004     0          0
+                         RDMA-WR Data
+                           (3004:7003)
+    4        0         5 ................> 5   7004     0          0
+                         CDC Message
+    7004     0         6 <................ 6   7004     0          0
+                         CDC Message
+
+         Figure 18: Scenario 3: Send Flow, Window Size Constrained
+
+
+
+
+
+
+Fox, et al.                   Informational                    [Page 72]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+   Scenario assumptions:
+
+   o  New SMC-R connection; no data has been sent on this connection.
+
+   o  Host A: Application issues send for 3000 bytes to Host B and then
+      another send for 4000 bytes.
+
+   o  Host B: RMBE receive buffer size is 10,000.  Application has
+      already issued a recv for 10,000 bytes.
+
+   Flow description:
+
+   1. The application issues a send() for 3000 bytes; the SMC-R layer
+      copies data into a kernel send buffer.  It then schedules an RDMA
+      write operation to move the data into the peer's RMBE receive
+      buffer, at relative position 4-3003.  Note that no immediate data
+      or alert (i.e., interrupt) is provided to Host B for this RDMA
+      operation.
+
+   2. Host A sends a CDC message to update its producer cursor to
+      byte 3003.  This CDC message will deliver an interrupt to Host B.
+      At this point, the SMC-R layer can return control back to the
+      application.
+
+   3. Host B, once notified of the receipt of the previous CDC message,
+      locates the RMBE associated with the RMBE alert token and proceeds
+      to perform normal receive-side processing, waking up the suspended
+      application read thread, copying the data into the application's
+      receive buffer, etc.  After this processing is complete, the SMC-R
+      layer will also update its local consumer cursor to match the
+      producer cursor (i.e., indicating that all data has been
+      consumed).  It will not, however, update the partner with this
+      information, as the window size is not constrained
+      (10,000 - 3000 = 7000 bytes of available space).  The application
+      on Host B also issues a new recv() for 10,000 bytes.
+
+   4. On Host A, the application issues a send() for 4000 bytes.  The
+      SMC-R layer copies the data into a kernel buffer and schedules an
+      async RDMA write into the peer's RMBE receive buffer at relative
+      position 3003-7004.  Note that no alert is provided to Host B for
+      this flow.
+
+   5. Host A sends a CDC message to update the producer cursor to
+      byte 7004.  This CDC message will deliver an interrupt to Host B.
+      At this point, the SMC-R layer can return control back to the
+      application.
+
+
+
+
+
+Fox, et al.                   Informational                    [Page 73]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+   6. Host B, once notified of the receipt of the previous CDC message,
+      locates the RMBE associated with the RMBE alert token and proceeds
+      to perform normal receive-side processing, waking up the suspended
+      application read thread, copying the data into the application's
+      receive buffer, etc.  After this processing is complete, the SMC-R
+      layer will also update its local consumer cursor to match the
+      producer cursor (i.e., indicating that all data has been
+      consumed).  It will then determine whether or not it needs to
+      update the consumer cursor to the peer.  The available window size
+      is now 3000 (10,000 - (producer cursor - last sent consumer
+      cursor)), which is < 1/2 of the receive buffer size
+      (10,000/2 = 5000), and the advance of the window size is > 10% of
+      the window size (1000).  Therefore, a CDC message is issued to
+      update the consumer cursor to Peer A.
+
+4.7.4.  Scenario 4: Large Send, Flow Control, Full Window Size Writes
+
+             SMC Host A                             SMC Host B
+            RMBE A Info                            RMBE B Info
+        (Consumer Cursors)                      (Producer Cursors)
+    Cursor   Wrap Seq# Time               Time Cursor   Wrap Seq#  Flags
+    1004     1         0                  0    1004     1          0
+    1004     1         1 ---------------> 1    1004     1          0
+                         RDMA-WR Data
+                           (1004:9999)
+    1004     1         2 ---------------> 2    1004     1          0
+                         RDMA-WR Data
+                           (4:1003)
+    1004     1         3 ...............> 3    1004     2          Wrt
+                         CDC Message                               Blk
+
+    1004     2         4 <............... 4    1004     2          Wrt
+                         CDC Message                               Blk
+
+    1004     2         5 ---------------> 5    1004     2          Wrt
+                         RDMA-WR Data                              Blk
+                           (1004:9999)
+    1004     2         6 ---------------> 6    1004     2          Wrt
+                         RDMA-WR Data                              Blk
+                          (4:1003)
+    1004     2         7 ...............> 7    1004     3          Wrt
+                         CDC Message                               Blk
+
+    1004     3         8 <............... 8    1004     3          Wrt
+                         CDC Message                               Blk
+
+             Figure 19: Scenario 4: Large Send, Flow Control,
+                          Full Window Size Writes
+
+
+
+Fox, et al.                   Informational                    [Page 74]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+   Scenario assumptions:
+
+   o  Kernel implementation.
+
+   o  Existing SMC-R connection, Host B's receive window size is fully
+      open (peer consumer cursor = peer producer cursor).
+
+   o  Host A: Application issues send for 20,000 bytes to Host B.
+
+   o  Host B: RMBE receive buffer size is 10,000; application has issued
+      a recv for 10,000 bytes.
+
+   Flow description:
+
+   1. The application issues a send() for 20,000 bytes; the SMC-R layer
+      copies data into a kernel send buffer (assumes that send buffer
+      space of 20,000 is available for this connection).  It then
+      schedules an RDMA write operation to move the data into the peer's
+      RMBE receive buffer, at relative position 1004-9999.  Note that no
+      immediate data or alert (i.e., interrupt) is provided to Host B
+      for this RDMA operation.
+
+   2. Host A then schedules an RDMA write operation to fill the
+      remaining 1000 bytes of available space in the peer's RMBE receive
+      buffer, at relative position 4-1003.  Note that no immediate data
+      or alert (i.e., interrupt) is provided to Host B for this RDMA
+      operation.  Also note that an implementation of SMC-R may optimize
+      this processing by combining steps 1 and 2 into a single
+      RDMA write operation (with two different data sources).
+
+   3. Host A sends a CDC message to update the producer cursor to
+      byte 1004.  Since the entire receive buffer space is filled, the
+      producer writer blocked flag (the "Wrt Blk" indicator (flag) in
+      Figure 19) is set and the producer cursor wrap sequence number
+      (the producer "Wrap Seq#" in Figure 19) is incremented.  This CDC
+      message will deliver an interrupt to Host B.  At this point, the
+      SMC-R layer can return control back to the application.
+
+   4. Host B, once notified of the receipt of the previous CDC message,
+      locates the RMBE associated with the RMBE alert token and proceeds
+      to perform normal receive-side processing, waking up the suspended
+      application read thread, copying the data into the application's
+      receive buffer, etc.  In this scenario, Host B notices that the
+      producer cursor has not been advanced (same value as the consumer
+      cursor); however, it notices that the producer cursor wrap
+      sequence number is different from its local value (1), indicating
+      that a full window of new data is available.  All of the data in
+      the receive buffer can be processed, with the first segment
+
+
+
+Fox, et al.                   Informational                    [Page 75]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+      (1004-9999) followed by the second segment (4-1003).  Because the
+      producer writer blocked indicator was set, Host B schedules a CDC
+      message to update its latest information to the peer: consumer
+      cursor (1004), consumer cursor wrap sequence number (the current
+      value of 2 is used).
+
+   5. Host A, upon receipt of the CDC message, locates the TCP
+      connection associated with the alert token and, upon examining the
+      control information provided, notices that Host B has consumed all
+      of the data (based on the consumer cursor and the consumer cursor
+      wrap sequence number) and initiates the next RDMA write to fill
+      the receive buffer at offset 1003-9999.
+
+   6. Host A then moves the next 1000 bytes into the beginning of the
+      receive buffer (4-1003) by scheduling an RDMA write operation.
+      Note that at this point there are still 8 bytes remaining to be
+      written.
+
+   7. Host A then sends a CDC message to set the producer writer blocked
+      indicator and to increment the producer cursor wrap sequence
+      number (3).
+
+   8. Host B, upon notification, completes the same processing as step 4
+      above, including sending a CDC message to update the peer to
+      indicate that all data has been consumed.  At this point, Host A
+      can write the final 8 bytes to Host B's RMBE into
+      positions 1004-1011 (not shown).
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Fox, et al.                   Informational                    [Page 76]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+4.7.5.  Scenario 5: Send Flow, Urgent Data, Window Size Unconstrained
+
+             SMC Host A                             SMC Host B
+            RMBE A Info                            RMBE B Info
+        (Consumer Cursors)                      (Producer Cursors)
+    Cursor   Wrap Seq# Time               Time Cursor   Wrap Seq#  Flag
+    1000     1         0                  0    1000     1          0
+    1000     1         1 ---------------> 1    1000     1          0
+                         RDMA-WR Data
+                           (1000:1499)
+    1000     1         2 ...............> 2    1500     1          UrgP
+                         CDC Message                               UrgA
+
+    1500     1         3 <............... 3    1500     1          UrgP
+                         CDC Message                               UrgA
+
+    1500     1         4 ---------------> 4    1500     1          UrgP
+                         RDMA-WR Data                              UrgA
+                           (1500:2499)
+    1500     1         5 ...............> 5    2500     1          0
+                         CDC Message
+
+      Figure 20: Scenario 5: Send Flow, Urgent Data, Window Size Open
+
+   Scenario assumptions:
+
+   o  Kernel implementation.
+
+   o  Existing SMC-R connection; window size open (unconstrained); all
+      data has been consumed by receiver.
+
+   o  Host A: Application issues send for 500 bytes with urgent data
+      indicator (out of band) to Host B, then sends 1000 bytes of
+      normal data.
+
+   o  Host B: RMBE receive buffer size is 10,000; application has issued
+      a recv for 10,000 bytes and is also monitoring the socket for
+      urgent data.
+
+   Flow description:
+
+   1. The application issues a send() for 500 bytes of urgent data; the
+      SMC-R layer copies data into a kernel send buffer.  It then
+      schedules an RDMA write operation to move the data into the peer's
+      RMBE receive buffer, at relative position 1000-1499.  Note that no
+      immediate data or alert (i.e., interrupt) is provided to Host B
+      for this RDMA operation.
+
+
+
+
+Fox, et al.                   Informational                    [Page 77]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+   2. Host A sends a CDC message to update its producer cursor to
+      byte 1500 and to turn on the producer Urgent Data Pending (UrgP)
+      and Urgent Data Present (UrgA) flags.  This CDC message will
+      deliver an interrupt to Host B.  At this point, the SMC-R layer
+      can return control back to the application.
+
+   3. Host B, once notified of the receipt of the previous CDC message,
+      locates the RMBE associated with the RMBE alert token, notices
+      that the Urgent Data Pending flag is on, and proceeds with out-of-
+      band socket API notification -- for example, satisfying any
+      outstanding select() or poll() requests on the socket by
+      indicating that urgent data is pending (i.e., by setting the
+      exception bit on).  The urgent data present indicator allows
+      Host B to also determine the position of the urgent data (the
+      producer cursor points 1 byte beyond the last byte of urgent
+      data).  Host B can then perform normal receive-side processing
+      (including specific urgent data processing), copying the data into
+      the application's receive buffer, etc.  Host B then sends a CDC
+      message to update the partner's RMBE control area with its latest
+      consumer cursor (1500).  Note that this CDC message must occur,
+      regardless of the current local window size that is available.
+      The partner host (Host A) cannot initiate any additional RDMA
+      writes until it receives acknowledgment that the urgent data has
+      been processed (or at least processed/remembered at the SMC-R
+      layer).
+
+   4. Upon receipt of the message, Host A wakes up, sees that the peer
+      consumed all data up to and including the last byte of urgent
+      data, and now resumes sending any pending data.  In this case, the
+      application had previously issued a send for 1000 bytes of normal
+      data, which would have been copied in the send buffer, and control
+      would have been returned to the application.  Host A now initiates
+      an RDMA write to move that data to the peer's receive buffer at
+      position 1500-2499.
+
+   5. Host A then sends a CDC message to update its producer cursor
+      value (2500) and to turn off the Urgent Data Pending and Urgent
+      Data Present flags.  Host B wakes up, processes the new data
+      (resumes application, copies data into the application receive
+      buffer), and then proceeds to update the local current consumer
+      cursor (2500).  Given that the window size is unconstrained, there
+      is no need for a consumer cursor update in the peer's RMBE.
+
+
+
+
+
+
+
+
+
+Fox, et al.                   Informational                    [Page 78]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+4.7.6.  Scenario 6: Send Flow, Urgent Data, Window Size Closed
+
+             SMC Host A                             SMC Host B
+            RMBE A Info                            RMBE B Info
+        (Consumer Cursors)                      (Producer Cursors)
+    Cursor   Wrap Seq# Time               Time Cursor   Wrap Seq#  Flag
+    1000     1         0                  0    1000     2          Wrt
+                                                                   Blk
+
+    1000     1         1 ...............> 1    1000     2          Wrt
+                         CDC Message                               Blk
+                                                                   UrgP
+
+    1000     2         2 <............... 2    1000     2          Wrt
+                         CDC Message                               Blk
+                                                                   UrgP
+
+    1000     2         3 ---------------> 3    1000     2          Wrt
+                         RDMA-WR Data                              Blk
+                           (1000:1499)                             UrgP
+
+    1000     2         4 ...............> 4    1500     2          UrgP
+                         CDC Message                               UrgA
+
+    1500     2         5 <............... 5    1500     2          UrgP
+                         CDC Message                               UrgA
+
+    1500     2         6 ---------------> 6    1500     2          UrgP
+                         RDMA-WR Data                              UrgA
+                           (1500:2499)
+    1000     2         7 ...............> 7    2500     2          0
+                         CDC Message
+
+     Figure 21: Scenario 6: Send Flow, Urgent Data, Window Size Closed
+
+   Scenario assumptions:
+
+   o  Kernel implementation.
+
+   o  Existing SMC-R connection; window size closed; writer is blocked.
+
+   o  Host A: Application issues send for 500 bytes with urgent data
+      indicator (out of band) to Host B, then sends 1000 bytes of
+      normal data.
+
+   o  Host B: RMBE receive buffer size is 10,000; application has no
+      outstanding recv() (for normal data) and is monitoring the socket
+      for urgent data.
+
+
+
+Fox, et al.                   Informational                    [Page 79]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+   Flow description:
+
+   1. The application issues a send() for 500 bytes of urgent data; the
+      SMC-R layer copies data into a kernel send buffer (if available).
+      Since the writer is blocked (window size closed), it cannot send
+      the data immediately.  It then sends a CDC message to notify the
+      peer of the Urgent Data Pending (UrgP) indicator (the writer
+      blocked indicator remains on as well).  This serves as a signal to
+      Host B that urgent data is pending in the stream.  Control is also
+      returned to the application at this point.
+
+   2. Host B, once notified of the receipt of the previous CDC message,
+      locates the RMBE associated with the RMBE alert token, notices
+      that the Urgent Data Pending flag is on, and proceeds with out-of-
+      band socket API notification -- for example, satisfying any
+      outstanding select() or poll() requests on the socket by
+      indicating that urgent data is pending (i.e., by setting the
+      exception bit on).  At this point, it is expected that the
+      application will enter urgent data mode processing, expeditiously
+      processing all normal data (by issuing recv API calls) so that it
+      can get to the urgent data byte.  Whether the application has this
+      urgent mode processing or not, at some point, the application will
+      consume some or all of the pending data in the receive buffer.
+      When this occurs, Host B will also send a CDC message to update
+      its consumer cursor and consumer cursor wrap sequence number to
+      the peer.  In the example above, a full window's worth of data was
+      consumed.
+
+   3. Host A, once awakened by the message, will notice that the window
+      size is now open on this connection (based on the consumer cursor
+      and the consumer cursor wrap sequence number, which now matches
+      the producer cursor wrap sequence number) and resume sending of
+      the urgent data segment by scheduling an RDMA write into relative
+      position 1000-1499.
+
+   4. Host A then sends a CDC message to advance its producer cursor
+      (1500) and to also notify Host B of the Urgent Data Present (UrgA)
+      indicator (and turn off the writer blocked indicator).  This
+      signals to Host B that the urgent data is now in the local receive
+      buffer and that the producer cursor points to the last byte of
+      urgent data.
+
+   5. Host B wakes up, processes the urgent data, and, once the urgent
+      data is consumed, sends a CDC message to update its consumer
+      cursor (1500).
+
+
+
+
+
+
+Fox, et al.                   Informational                    [Page 80]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+   6. Host A wakes up, sees that Host B has consumed the sequence number
+      associated with the urgent data, and then initiates the next RDMA
+      write operation to move the 1000 bytes associated with the next
+      send() of normal data into the peer's receive buffer at
+      position 1500-2499.  Note that the send API would have likely
+      completed earlier in the process by copying the 1000 bytes into a
+      send buffer and returning back to the application, even though we
+      could not send any new data until the urgent data was processed
+      and acknowledged by Host B.
+
+   7. Host A sends a CDC message to advance its producer cursor to 2500
+      and to reset the Urgent Data Pending and Urgent Data Present
+      flags.  Host B wakes up and processes the inbound data.
+
+4.8.  Connection Termination
+
+   Just as SMC-R connections are established using a combination of TCP
+   connection establishment flows and SMC-R protocol flows, the
+   termination of SMC-R connections also uses a similar combination of
+   SMC-R protocol termination flows and normal TCP connection
+   termination flows.  The following sections describe the SMC-R
+   protocol normal and abnormal connection termination flows.
+
+4.8.1.  Normal SMC-R Connection Termination Flows
+
+   Normal SMC-R connection flows are triggered via the normal stream
+   socket API semantics, namely by the application issuing a close() or
+   shutdown() API.  Most applications, after consuming all incoming data
+   and after sending any outbound data, will then issue a close() API to
+   indicate that they are done both sending and receiving data.  Some
+   applications, typically a small percentage, make use of the
+   shutdown() API that allows them to indicate that the application is
+   done sending data, receiving data, or both sending and receiving
+   data.  The main use of this API is scenarios where a TCP application
+   wants to alert its partner endpoint that it is done sending data but
+   is still receiving data on its socket (shutdown for write).  Issuing
+   shutdown() for both sending and receiving data is really no different
+   than issuing a close() and can therefore be treated in a similar
+   fashion.  Shutdown for read is typically not a very useful operation
+   and in normal circumstances does not trigger any network flows to
+   notify the partner TCP endpoint of this operation.
+
+   These same trigger points will be used by the SMC-R layer to initiate
+   SMC-R connection termination flows.  The main design point for SMC-R
+   normal connection flows is to use the SMC-R protocol to first shut
+   down the SMC-R connection and free up any SMC-R RDMA resources, and
+   then allow the normal TCP connection termination protocol (i.e., FIN
+   processing) to drive cleanup of the TCP connection.  This design
+
+
+
+Fox, et al.                   Informational                    [Page 81]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+   point is very important in ensuring that RDMA resources such as
+   the RMBEs are only freed and reused when both SMC-R endpoints
+   are completely done with their RDMA write operations to the
+   partner's RMBE.
+
+                                      1
+                            +-----------------+
+            |-------------->|     CLOSED      |<-------------|
+        3D  |               |                 |              |  4D
+            |               +-----------------+              |
+            |                       |                        |
+            |                     2 |                        |
+            |                       V                        |
+    +----------------+     +-----------------+     +----------------+
+    |AppFinCloseWait |     |     ACTIVE      |     |PeerFinCloseWait|
+    |                |     |                 |     |                |
+    +----------------+     +-----------------+     +----------------+
+            |                   |         |                   |
+            |     Active Close  | 3A | 4A |  Passive Close    |
+            |                   V    |    V                   |
+            |       +--------------+ | +-------------+        |
+            |--<----|PeerCloseWait1| | |AppCloseWait1|--->----|
+        3C  |       |              | | |             |        |  4C
+            |       +--------------+ | +-------------+        |
+            |             |          |         |              |
+            |             | 3B       |     4B  |              |
+            |             V          |         V              |
+            |       +--------------+ | +-------------+        |
+            |--<----|PeerCloseWait2| | |AppCloseWait2|--->----|
+                    |              | | |             |
+                    +--------------+ | +-------------+
+                                     |
+                                     |
+
+                    Figure 22: SMC-R Connection States
+
+   Figure 22 describes the states that an SMC-R connection typically
+   goes through.  Note that there are variations to these states that
+   can occur when an SMC-R connection is abnormally terminated, similar
+   in a way to when a TCP connection is reset.  The following are the
+   high-level state transitions for an SMC-R connection:
+
+   1. An SMC-R connection begins in the Closed state.  This state is
+      meant to reflect an RMBE that is not currently in use (was
+      previously in use but no longer is, or was never allocated).
+
+
+
+
+
+
+Fox, et al.                   Informational                    [Page 82]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+   2. An SMC-R connection progresses to the Active state once the SMC-R
+      Rendezvous processing has successfully completed, RMB element
+      indices have been exchanged, and SMC-R links have been activated.
+      In this state, the TCP connection is fully established, rendezvous
+      processing has been completed, and SMC-R peers can begin the
+      exchange of data via RDMA.
+
+   3. Active close processing (on the SMC-R peer that is initiating the
+      connection termination).
+
+      A. When an application on one of the SMC-R connection peers issues
+         a close(), a shutdown() for write, or a shutdown() for both
+         read and write, the SMC-R layer on that host will initiate
+         SMC-R connection termination processing.  First, if a close()
+         or shutdown(both) is issued, it will check to see that there's
+         no data in the local RMB element that has not been read by the
+         application.  If unread data is detected, the SMC-R connection
+         must be abnormally reset; for more details on this, refer to
+         Section 4.8.2 ("Abnormal SMC-R Connection Termination Flows").
+         If no unread data is pending, it then checks to see whether or
+         not any outstanding data is waiting to be written to the peer,
+         or if any outstanding RDMA writes for this SMC-R connection
+         have not yet completed.  If either of these two scenarios is
+         true, an indicator that this connection is in a pending close
+         state is saved in internal data structures representing this
+         SMC-R connection, and control is returned to the application.
+         If all data to be written to the partner has completed, this
+         peer will send a CDC message to notify the peer of either the
+         PeerConnectionClosed indicator (close or shutdown for both was
+         issued) or the PeerDoneWriting indicator.  This will provide an
+         interrupt to inform that partner SMC-R peer that the connection
+         is terminating.  At this point, the local side of the SMC-R
+         connection transitions in the PeerCloseWait1 state, and control
+         can be returned to the application.  If this process could not
+         be completed synchronously (the pending close condition
+         mentioned above), it is completed when all RDMA writes for data
+         and control cursors have been completed.
+
+      B. At some point, the SMC-R peer application (passive close) will
+         consume all incoming data, realize that that partner is done
+         sending data on this connection, and proceed to initiate its
+         own close of the connection once it has completed sending all
+         data from its end.  The partner application can initiate this
+         connection termination processing via close() or shutdown()
+         APIs.  If the application does so by issuing a shutdown() for
+         write, then the partner SMC-R layer will send a CDC message to
+         notify the peer (the active close side) of the PeerDoneWriting
+         indicator.  When the "active close" SMC-R peer wakes up as a
+
+
+
+Fox, et al.                   Informational                    [Page 83]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+         result of the previous CDC message, it will notice that the
+         PeerDoneWriting indicator is now on and transition to the
+         PeerCloseWait2 state.  This state indicates that the peer is
+         done sending data and may still be reading data.  At this
+         point, the "active close" peer will also need to ensure that
+         any outstanding recv() calls for this socket are woken up and
+         remember that no more data is forthcoming on this connection
+         (in case the local connection was shutdown() for write only).
+
+      C. This flow is a common transition from 3A or 3B above.  When the
+         SMC-R peer (passive close) consumes all data and updates all
+         necessary cursors to the peer, and the application closes its
+         socket (close or shutdown for both), it will send a CDC message
+         to the peer (the active close side) with the
+         PeerConnectionClosed indicator set.  At this point, the
+         connection can transition back to the Closed state if the local
+         application has already closed (or issued shutdown for both)
+         the socket.  Once in the Closed state, the RMBE can now be
+         safely reused for a new SMC-R connection.  When the
+         PeerConnectionClosed indicator is turned on, the SMC-R peer is
+         indicating that it is done updating the partner's RMBE.
+
+      D. Conditional state: If the local application has not yet issued
+         a close() or shutdown(both), we need to wait until the
+         application does so.  Once it does, the local host will send a
+         CDC message to notify the peer of the PeerConnectionClosed
+         indicator and then transition to the Closed state.
+
+   4. Passive close processing (on the SMC-R peer that receives an
+      indication that the partner is closing the connection).
+
+      A. Upon receipt of a CDC message, the SMC-R layer will detect that
+         the PeerConnectionClosed indicator or PeerDoneWriting indicator
+         is on.  If any outstanding recv() calls are pending, they are
+         completed with an indicator that the partner has closed the
+         connection (zero-length data presented to the application).  If
+         there is any pending data to be written and
+         PeerConnectionClosed is on, then an SMC-R connection reset must
+         be performed.  The connection then enters the AppCloseWait1
+         state on the passive close side waiting for the local
+         application to initiate its own close processing.
+
+      B. If the local application issues a shutdown() for writing, then
+         the SMC-R layer will send a CDC message to notify the partner
+         of the PeerDoneWriting indicator and then transition the local
+         side of the SMC-R connection to the AppCloseWait2 state.
+
+
+
+
+
+Fox, et al.                   Informational                    [Page 84]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+      C. When the application issues a close() or shutdown() for both,
+         the local SMC-R peer will send a message informing the peer of
+         the PeerConnectionClosed indicator and transition to the Closed
+         state if the remote peer has also sent the local peer the
+         PeerConnectionClosed indicator.  If the peer has not sent the
+         PeerConnectionClosed indicator, we transition into the
+         PeerFinCloseWait state.
+
+      D. The local SMC-R connection stays in this state until the peer
+         sends the PeerConnectionClosed indicator in a CDC message.
+         When the indicator is sent, we transition to the Closed state
+         and are then free to reuse this RMBE.
+
+   Note that each SMC-R peer needs to provide some logic that will
+   prevent being stranded in a termination state indefinitely.  For
+   example, if an Active Close SMC-R peer is in a PeerCloseWait (1 or 2)
+   state waiting for the remote SMC-R peer to update its connection
+   termination status, it needs to provide a timer that will prevent it
+   from waiting in that state indefinitely should the remote SMC-R peer
+   not respond to this termination request.  This could occur in error
+   scenarios -- for example, if the remote SMC-R peer suffered a failure
+   prior to being able to respond to the termination request or the
+   remote application is not responding to this connection termination
+   request by closing its own socket.  This latter scenario is similar
+   to the TCP FINWAIT2 state, which has been known to sometimes cause
+   issues when remote TCP/IP hosts lose track of established connections
+   and neglect to close them.  Even though the TCP standards do not
+   mandate a timeout from the TCP FINWAIT2 state, most TCP/IP
+   implementations assign a timeout for this state.  A similar timeout
+   will be required for SMC-R connections.  When this timeout occurs,
+   the local SMC-R peer performs TCP reset processing for this
+   connection.  However, no additional RDMA writes to the partner RMBE
+   can occur at this point (we have already indicated that we are done
+   updating the peer's RMBE).  After the TCP connection is reset, the
+   RMBE can be returned to the free pool for reallocation.  See
+   Section 4.4.2 for more details.
+
+   Also note that it is possible to have two SMC-R endpoints initiate an
+   Active close concurrently.  In that scenario, the flows above still
+   apply; however, both endpoints follow the active close path (path 3).
+
+
+
+
+
+
+
+
+
+
+
+Fox, et al.                   Informational                    [Page 85]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+4.8.2.  Abnormal SMC-R Connection Termination Flows
+
+   Abnormal SMC-R connection termination can occur for a variety of
+   reasons, including the following:
+
+   o  The TCP connection associated with an SMC-R connection is reset.
+      In TCP, either endpoint can send a RST segment to abort an
+      existing TCP connection when error conditions are detected for the
+      connection or the application overtly requests that the connection
+      be reset.
+
+   o  Normal SMC-R connection termination processing has unexpectedly
+      stalled for a given connection.  When the stall is detected
+      (connection termination timeout condition), an abnormal SMC-R
+      connection termination flow is initiated.
+
+   In these scenarios, it is very important that resources associated
+   with the affected SMC-R connections are properly cleaned up to ensure
+   that there are no orphaned resources and that resources can reliably
+   be reused for new SMC-R connections.  Given that SMC-R relies heavily
+   on the RDMA write processing, special care needs to be taken to
+   ensure that an RMBE is no longer being used by an SMC-R peer before
+   logically reassigning that RMBE to a new SMC-R connection.
+
+   When an SMC-R peer initiates a TCP connection reset, it also
+   initiates an SMC-R abnormal connection flow at the same time.  The
+   SMC-R peers explicitly signal their intent to abnormally terminate an
+   SMC-R connection and await explicit acknowledgment that the peer has
+   received this notification and has also completed abnormal connection
+   termination on its end.  Note that TCP connection reset processing
+   can occur in parallel to these flows.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Fox, et al.                   Informational                    [Page 86]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+                            +-----------------+
+            |-------------->|     CLOSED      |<-------------|
+            |               |                 |              |
+            |               +-----------------+              |
+            |                                                |
+            |                                                |
+            |                                                |
+            |           +-----------------------+            |
+            |           |     Any state         |            |
+            |1B         | (before setting       |          2B|
+            |           |  PeerConnectionClosed |            |
+            |           |  indicator in         |            |
+            |           |  peer's RMBE)         |            |
+            |           +-----------------------+            |
+            |         1A        |         |      2A          |
+            |     Active Abort  |         |  Passive Abort   |
+            |                   V         V                  |
+            |       +--------------+   +--------------+      |
+            |-------|PeerAbortWait |   | Process Abort|------|
+                    |              |   |              |
+                    +--------------+   +--------------+
+
+      Figure 23: SMC-R Abnormal Connection Termination State Diagram
+
+   Figure 23 above shows the SMC-R abnormal connection termination state
+   diagram:
+
+   1. Active abort designates the SMC-R peer that is initiating the TCP
+      RST processing.  At the time that the TCP RST is sent, the active
+      abort side must also do the following:
+
+      A. Send the PeerConnAbort indicator to the partner in a CDC
+         message, and then transition to the PeerAbortWait state.
+         During this state, it will monitor this SMC-R connection
+         waiting for the peer to send its corresponding PeerConnAbort
+         indicator but will ignore any other activity in this connection
+         (i.e., new incoming data).  It will also generate an
+         appropriate error to any socket API calls issued against this
+         socket (e.g., ECONNABORTED, ECONNRESET).
+
+      B. Once the peer sends the PeerConnAbort indicator to the local
+         host, the local host can transition this SMC-R connection to
+         the Closed state and reuse this RMBE.  Note that the SMC-R peer
+         that goes into the active abort state must provide some
+         protection against staying in that state indefinitely should
+         the remote SMC-R peer not respond by sending its own
+         PeerConnAbort indicator to the local host.  While this should
+         be a rare scenario, it could occur if the remote SMC-R peer
+
+
+
+Fox, et al.                   Informational                    [Page 87]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+         (passive abort) suffered a failure right after the local SMC-R
+         peer (active abort) sent the PeerConnAbort indicator.  To
+         protect against these types of failures, a timer can be set
+         after entering the PeerAbortWait state, and if that timer pops
+         before the peer has sent its local PeerConnAbort indicator (to
+         the active abort side), this RMBE can be returned to the free
+         pool for possible reallocation.  See Section 4.4.2 for more
+         details.
+
+   2. Passive abort designates the SMC-R peer that is the recipient of
+      an SMC-R abort from the peer designated by the PeerConnAbort
+      indicator being sent by the peer in a CDC message.  Upon receiving
+      this request, the local peer must do the following:
+
+      A. Using the appropriate error codes, indicate to the socket
+         application that this connection has been aborted, and then
+         purge all in-flight data for this connection that is waiting to
+         be read or waiting to be sent.
+
+      B. Send a CDC message to notify the peer of the PeerConnAbort
+         indicator and, once that is completed, transition this RMBE to
+         the Closed state.
+
+   If an SMC-R peer receives a TCP RST for a given SMC-R connection, it
+   also initiates SMC-R abnormal connection termination processing if it
+   has not already been notified (via the PeerConnAbort indicator) that
+   the partner is severing the connection.  It is possible to have two
+   SMC-R endpoints concurrently be in an active abort role for a given
+   connection.  In that scenario, the flows above still apply but both
+   endpoints take the active abort path (path 1).
+
+4.8.3.  Other SMC-R Connection Termination Conditions
+
+   The following are additional conditions that have implications for
+   SMC-R connection termination:
+
+   o  An SMC-R peer being gracefully shut down.  If an SMC-R peer
+      supports a graceful shutdown operation, it should attempt to
+      terminate all SMC-R connections as part of shutdown processing.
+      This could be accomplished via LLC DELETE LINK requests on all
+      active SMC-R links.
+
+   o  Abnormal termination of an SMC-R peer.  In this example, there may
+      be no opportunity for the host to perform any SMC-R cleanup
+      processing.  In this scenario, it is up to the remote peer to
+      detect a RoCE communications failure with the failing host.  This
+
+
+
+
+
+Fox, et al.                   Informational                    [Page 88]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+      could trigger SMC-R link switchover, but that would also generate
+      RoCE errors, causing the remote host to eventually terminate all
+      existing SMC-R connections to this peer.
+
+   o  Loss of RoCE connectivity between two SMC-R peers.  If two peers
+      are no longer reachable across any links in their SMC-R link
+      group, then both peers perform a TCP reset for the connections,
+      generate an error to the local applications, and free up all QP
+      resources associated with the link group.
+
+5.  Security Considerations
+
+5.1.  VLAN Considerations
+
+   The concepts and access control of virtual LANs (VLANs) must be
+   extended to also cover the RoCE network traffic flowing across the
+   Ethernet.
+
+   The RoCE VLAN configuration and access permissions must mirror the IP
+   VLAN configuration and access permissions over the Converged Enhanced
+   Ethernet fabric.  This means that hosts, routers, and switches that
+   have access to specific VLANs on the IP fabric must also have the
+   same VLAN access across the RoCE fabric.  In other words, the SMC-R
+   connectivity will follow the same virtual network access permissions
+   as normal TCP/IP traffic.
+
+5.2.  Firewall Considerations
+
+   As mentioned above, the RoCE fabric inherits the same VLAN
+   topology/access as the IP fabric.  RoCE is a Layer 2 protocol that
+   requires both endpoints to reside in the same Layer 2 network (i.e.,
+   VLAN).  RoCE traffic cannot traverse multiple VLANs, as there is no
+   support for routing RoCE traffic beyond a single VLAN.  As a result,
+   SMC-R communications will also be confined to peers that are members
+   of the same VLAN.  IP-based firewalls are typically inserted between
+   VLANs (or physical LANs) and rely on normal IP routing to insert
+   themselves in the data path.  Since RoCE (and by extension SMC-R) is
+   not routable beyond the local VLAN, there is no ability to insert a
+   firewall in the network path of two SMC-R peers.
+
+5.3.  Host-Based IP Filters
+
+   Because SMC-R maintains the TCP three-way handshake for connection
+   setup before switching to RoCE out of band, existing IP filters that
+   control connection setup flows remain effective in an SMC-R
+   environment.  IP filters that operate on traffic flowing in an active
+   TCP connection are not supported, because the connection data does
+   not flow over IP.
+
+
+
+Fox, et al.                   Informational                    [Page 89]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+5.4.  Intrusion Detection Services
+
+   Similar to IP filters, intrusion detection services that operate on
+   TCP connection setups are compatible with SMC-R with no changes
+   required.  However, once the TCP connection has switched to RoCE out
+   of band, packets are not available for examination.
+
+5.5.  IP Security (IPsec)
+
+   IP security is not compatible with SMC-R, because there are no IP
+   packets on which to operate.  TCP connections that require IP
+   security must opt out of SMC-R.
+
+5.6.  TLS/SSL
+
+   Transport Layer Security/Secure Socket Layer (TLS/SSL) is preserved
+   in an SMC-R environment.  The TLS/SSL layer resides above the SMC-R
+   layer, and outgoing connection data is encrypted before being passed
+   down to the SMC-R layer for RDMA write.  Similarly, incoming
+   connection data goes through the SMC-R layer encrypted and is
+   decrypted by the TLS/SSL layer as it is today.
+
+   The TLS/SSL handshake messages flow over the TCP connection after the
+   connection has switched to SMC-R, and so they are exchanged using
+   RDMA writes by the SMC-R layer, transparently to the TLS/SSL layer.
+
+6.  IANA Considerations
+
+   The scarcity of TCP option codes available for assignment is
+   understood, and this architecture uses experimental TCP options
+   following the conventions of [RFC6994] ("Shared Use of Experimental
+   TCP Options").
+
+   TCP ExID 0xE2D4C3D9 has been registered with IANA as a TCP Experiment
+   Identifier.  See Section 3.1.
+
+   If this protocol achieves wide acceptance, a discrete option code may
+   be requested by subsequent versions of this protocol.
+
+
+
+
+
+
+
+
+
+
+
+
+
+Fox, et al.                   Informational                    [Page 90]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+7.  Normative References
+
+   [RFC793]   Postel, J., "Transmission Control Protocol", STD 7,
+              RFC 793, DOI 10.17487/RFC0793, September 1981,
+              <http://www.rfc-editor.org/info/rfc793>.
+
+   [RFC6994]  Touch, J., "Shared Use of Experimental TCP Options",
+              RFC 6994, DOI 10.17487/RFC6994, August 2013,
+              <http://www.rfc-editor.org/info/rfc6994>.
+
+   [RoCE]     InfiniBand, "RDMA over Converged Ethernet specification",
+              <https://cw.infinibandta.org/wg/Members/documentRevision/
+              download/7149>.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Fox, et al.                   Informational                    [Page 91]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+Appendix A.  Formats
+
+A.1.  TCP Option
+
+   The SMC-R TCP option is formatted in accordance with [RFC6994]
+   ("Shared Use of Experimental TCP Options").  The ExID value is
+   IBM-1047 (EBCDIC) encoding for "SMCR".
+
+      0                   1                   2                   3
+      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+     |   Kind = 254  | Length = 6    |   x'E2'       |   x'D4'       |
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+     |    x'C3'      |    x'D9'      |
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
+                    Figure 24: SMC-R TCP Option Format
+
+A.2.  CLC Messages
+
+   The following rules apply to all CLC messages:
+
+   General rules on formats:
+
+   o  Reserved fields must be set to zero and not validated.
+
+   o  Each message has an eye catcher at the start and another
+      eye catcher at the end.  These must both be validated by the
+      receiver.
+
+   o  SMC version indicator: The only SMC-R version defined in this
+      architecture is version 1.  In the future, if peers have a
+      mismatch of versions, the lowest common version number is used.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Fox, et al.                   Informational                    [Page 92]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+A.2.1.  Peer ID Format
+
+   All CLC messages contain a peer ID that uniquely identifies an
+   instance of a TCP/IP stack.  This peer ID is required to be
+   universally unique across TCP/IP stacks and instances (including
+   restarts) of TCP/IP stacks.
+
+      0                   1                   2                   3
+      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+     |          Instance ID          |    RoCE MAC (first 2 bytes)   |
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+     |                    RoCE MAC (last 4 bytes)                    |
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
+                         Figure 25: Peer ID Format
+
+   Instance ID
+
+      A 2-byte instance count that ensures that if the same RNIC MAC is
+      later used in the peer ID for a different TCP/IP stack -- for
+      example, if an RNIC is redeployed to another stack -- the values
+      are unique.  It also ensures that if a TCP/IP stack is restarted,
+      the instance ID changes.  The value is implementation defined,
+      with one suggestion being 2 bytes of the system clock.
+
+   RoCE MAC
+
+      The RoCE MAC address for one of the peer's RNICs.  Note that in a
+      virtualized environment this will be the virtual MAC of one of the
+      peer's RNICs.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Fox, et al.                   Informational                    [Page 93]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+A.2.2.  SMC Proposal CLC Message Format
+
+      0                   1                   2                   3
+      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+     |   x'E2'       |   x'D4'       |     x'C3'     |     x'D9'     |
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+     |  Type = 1     |           Length              |Version| Rsrvd |
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+     |                                                               |
+     +-                       Client's Peer ID                      -+
+     |                                                               |
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+     |                                                               |
+     +-                                                             -+
+     |                                                               |
+     +-                Client's preferred GID                       -+
+     |                                                               |
+     +-                                                             -+
+     |                                                               |
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+     |  Client's preferred RoCE                                      |
+     +- MAC address                  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+     |                               |Offset to mask/prefix area (0) |
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+     .                                                               .
+     .                  Area for future growth                       .
+     .                                                               .
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+     |                         IPv4 Subnet Mask                      |
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+     | IPv4 Mask Lgth|           Reserved            |Num IPv6 prfx  |
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+     :                                                               :
+     :           Array of IPv6 prefixes (variable length)            :
+     :                                                               :
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+     |   x'E2'       |   x'D4'       |     x'C3'     |     x'D9'     |
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
+                Figure 26: SMC Proposal CLC Message Format
+
+
+
+
+
+
+
+
+
+
+Fox, et al.                   Informational                    [Page 94]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+   The fields present in the SMC Proposal CLC message are:
+
+   Eye catchers
+
+      Like all CLC messages, the SMC Proposal has beginning and ending
+      eye catchers to aid with verification and parsing.  The hex digits
+      spell "SMCR" in IBM-1047 (EBCDIC).
+
+   Type
+
+      CLC message Type 1 indicates SMC Proposal.
+
+   Length
+
+      The length of this CLC message.  If this is an IPv4 flow, this
+      value is 52.  Otherwise, it is variable, depending upon how many
+      prefixes are listed.
+
+   Version
+
+      Version of the SMC-R protocol.  Version 1 is the only currently
+      defined value.
+
+   Client's Peer ID
+
+      As described in Appendix A.2.1 above.
+
+   Client's preferred RoCE GID
+
+      The IPv6 address of the client's preferred RNIC on the RoCE
+      fabric.
+
+   Client's preferred RoCE MAC address
+
+      The MAC address of the client's preferred RNIC on the RoCE fabric.
+      It is required, as some operating systems do not have neighbor
+      discovery or ARP support for RoCE RNICs.
+
+   Offset to mask/prefix area
+
+      Provides the number of bytes that must be skipped after this
+      field, to access the IPv4 Subnet Mask field and the fields that
+      follow it.  Allows for future growth of this signal.  In this
+      version of the architecture, this value is always zero.
+
+
+
+
+
+
+
+Fox, et al.                   Informational                    [Page 95]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+   Area for future growth
+
+      In this version of the architecture, this field does not exist.
+      This indicates where additional information may be inserted into
+      the signal in the future.  The "Offset to mask/prefix area" field
+      must be used to skip over this area.
+
+   IPv4 Subnet Mask
+
+      If this message is flowing over an IPv4 TCP connection, the value
+      of the subnet mask associated with the interface over which the
+      client sent this message.  If this is an IPv6 flow, this field is
+      all zeros.
+
+      This field, along with all fields that follow it in this signal,
+      must be accessed by skipping the number of bytes listed in the
+      "Offset to mask/prefix area" field after the end of that field.
+
+   IPv4 Mask Lgth
+
+      If this message is flowing over an IPv4 TCP connection, the number
+      of significant bits in the IPv4 Subnet Mask field.  If this is an
+      IPv6 flow, this field is zero.
+
+   Num IPv6 prfx
+
+      If this message is flowing over an IPv6 TCP connection, the number
+      of IPv6 prefixes that follow, with a maximum value of 8.  If this
+      is an IPv4 flow, this field is zero and is immediately followed by
+      the ending eye catcher.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Fox, et al.                   Informational                    [Page 96]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+   Array of IPv6 prefixes
+
+      For IPv6 TCP connections, a list of the IPv6 prefixes associated
+      with the network over which the client sent this message, up to a
+      maximum of eight prefixes.
+
+      0                   1                   2                   3
+      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+     |                                                               |
+     +                                                               +
+     |                                                               |
+     +                  IPv6 prefix value                            +
+     |                                                               |
+     +                                                               +
+     |                                                               |
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+     | Prefix Length |
+     +-+-+-+-+-+-+-+-+
+
+              Figure 27: Format for IPv6 Prefix Array Element
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Fox, et al.                   Informational                    [Page 97]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+A.2.3.  SMC Accept CLC Message Format
+
+      0                   1                   2                   3
+      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+     |   x'E2'       |   x'D4'       |     x'C3'     |     x'D9'     |
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+     |  Type = 2     |    Length = 68                |Version|F|Rsrvd|
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+     |                                                               |
+     +-                       Server's Peer ID                      -+
+     |                                                               |
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+     |                                                               |
+     +-                                                             -+
+     |                                                               |
+     +-                Server's RoCE GID                            -+
+     |                                                               |
+     +-                                                             -+
+     |                                                               |
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+     |  Server's RoCE                                                |
+     +- MAC address                  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+     |                               |     Server QP (bytes 1-2)     |
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+---+
+     |Srvr QP byte 3 |         Server RMB RKey (bytes 1-3)           |
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+     |Srvr RMB byte 4|Server RMB indx| Srvr RMB alert tkn (bytes 1-2)|
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+     | Srvr RMB alert tkn (bytes 3-4)|Bsize  | MTU   |   Reserved    |
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+     |                                                               |
+     +-                     Server's RMB virtual address            -+
+     |                                                               |
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+     | Reserved      |    Server's initial packet sequence number    |
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+     |   x'E2'       |   x'D4'       |     x'C3'     |     x'D9'     |
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
+                 Figure 28: SMC Accept CLC Message Format
+
+
+
+
+
+
+
+
+
+
+Fox, et al.                   Informational                    [Page 98]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+   The fields present in the SMC Accept CLC message are:
+
+   Eye catchers
+
+      Like all CLC messages, the SMC Accept has beginning and ending
+      eye catchers to aid with verification and parsing.  The hex digits
+      spell "SMCR" in IBM-1047 (EBCDIC).
+
+   Type
+
+      CLC message Type 2 indicates SMC Accept.
+
+   Length
+
+      The SMC Accept CLC message is 68 bytes long.
+
+   Version
+
+      Version of the SMC-R protocol.  Version 1 is the only currently
+      defined value.
+
+   F-bit
+
+      First contact flag: A 1-bit flag that indicates that the server
+      believes this TCP connection is the first SMC-R contact for this
+      link group.
+
+   Server's Peer ID
+
+      As described in Appendix A.2.1 above.
+
+   Server's RoCE GID
+
+      The IPv6 address of the RNIC that the server chose for this SMC-R
+      link.
+
+   Server's RoCE MAC address
+
+      The MAC address of the server's RNIC for the SMC-R link.  It is
+      required, as some operating systems do not have neighbor discovery
+      or ARP support for RoCE RNICs.
+
+   Server's QP number
+
+      The number for the reliably connected queue pair that the server
+      created for this SMC-R link.
+
+
+
+
+
+Fox, et al.                   Informational                    [Page 99]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+   Server's RMB RKey
+
+      The RDMA RKey for the RMB that the server created or chose for
+      this TCP connection.
+
+   Server's RMB element index
+
+      Indexes which element within the server's RMB will represent this
+      TCP connection.
+
+   Server's RMB element alert token
+
+      A platform-defined, architecturally opaque token that identifies
+      this TCP connection.  Added by the client as immediate data on
+      RDMA writes from the client to the server to inform the server
+      that there is data for this connection to retrieve from the
+      RMB element.
+
+   Bsize:
+
+      Server's RMB element buffer size in 4-bit compressed notation:
+      x = 4 bits.  Actual buffer size value is (2^(x + 4)) * 1K.
+      Smallest possible value is 16K.  Largest size supported by this
+      architecture is 512K.
+
+   MTU
+
+      An enumerated value indicating this peer's QP MTU size.  The two
+      peers exchange their MTU values, and whichever value is smaller
+      will be used for the QP.  This field should only be validated in
+      the first contact exchange.
+
+      The enumerated MTU values are:
+
+         0:  reserved
+
+         1:  256
+
+         2:  512
+
+         3:  1024
+
+         4:  2048
+
+         5:  4096
+
+         6-15: reserved
+
+
+
+
+Fox, et al.                   Informational                   [Page 100]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+   Server's RMB virtual address
+
+      The virtual address of the server's RMB as assigned by the
+      server's RNIC.
+
+   Server's initial packet sequence number
+
+      The starting packet sequence number that this peer will use when
+      sending to the other peer, so that the other peer can prepare its
+      QP for the sequence number to expect.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Fox, et al.                   Informational                   [Page 101]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+A.2.4.  SMC Confirm CLC Message Format
+
+      0                   1                   2                   3
+      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+     |   x'E2'       |   x'D4'       |     x'C3'     |     x'D9'     |
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+     |  Type = 3     |    Length = 68                |Version| Rsrvd |
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+     |                                                               |
+     +-                       Client's Peer ID                      -+
+     |                                                               |
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+     |                                                               |
+     +-                                                             -+
+     |                                                               |
+     +-                Client's RoCE GID                            -+
+     |                                                               |
+     +-                                                             -+
+     |                                                               |
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+     |  Client's RoCE                                                |
+     +- MAC address                  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+     |                               |     Client QP (bytes 1-2)     |
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+---+
+     |Clnt QP byte 3 |         Client RMB RKey (bytes 1-3)           |
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+     |Clnt RMB byte 4|Client RMB indx| Clnt RMB alert tkn (bytes 1-2)|
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+     | Clnt RMB alert tkn (bytes 3-4)|Bsize  | MTU   |   Reserved    |
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+     |                                                               |
+     +-                  Client's RMB Virtual Address               -+
+     |                                                               |
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+     | Reserved      |    Client's initial packet sequence number    |
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+     |   x'E2'       |   x'D4'       |     x'C3'     |     x'D9'     |
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
+                 Figure 29: SMC Confirm CLC Message Format
+
+   The SMC Confirm CLC message is nearly identical to the SMC Accept,
+   except that it contains client information and lacks a first contact
+   flag.
+
+
+
+
+
+
+Fox, et al.                   Informational                   [Page 102]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+   The fields present in the SMC Confirm CLC message are:
+
+   Eye catchers
+
+      Like all CLC messages, the SMC Confirm has beginning and ending
+      eye catchers to aid with verification and parsing.  The hex digits
+      spell "SMCR" in IBM-1047 (EBCDIC).
+
+   Type
+
+      CLC message Type 3 indicates SMC Confirm.
+
+   Length
+
+      The SMC Confirm CLC message is 68 bytes long.
+
+   Version
+
+      Version of the SMC-R protocol.  Version 1 is the only currently
+      defined value.
+
+   Client's Peer ID
+
+      As described in Appendix A.2.1 above.
+
+   Client's RoCE GID
+
+      The IPv6 address of the RNIC that the client chose for this SMC-R
+      link.
+
+   Client's RoCE MAC address
+
+      The MAC address of the client's RNIC for the SMC-R link.  It is
+      required, as some operating systems do not have neighbor discovery
+      or ARP support for RoCE RNICs.
+
+   Client's QP number
+
+      The number for the reliably connected queue pair that the client
+      created for this SMC-R link.
+
+   Client's RMB RKey
+
+      The RDMA RKey for the RMB that the client created or chose for
+      this TCP connection.
+
+
+
+
+
+
+Fox, et al.                   Informational                   [Page 103]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+   Client's RMB element index
+
+      Indexes which element within the client's RMB will represent this
+      TCP connection.
+
+   Client's RMB element alert token
+
+      A platform-defined, architecturally opaque token that identifies
+      this TCP connection.  Added by the server as immediate data on
+      RDMA writes from the server to the client to inform the client
+      that there is data for this connection to retrieve from the
+      RMB element.
+
+   Bsize:
+
+      Client's RMB element buffer size in 4-bit compressed notation:
+      x = 4 bits.  Actual buffer size value is (2^(x + 4)) * 1K.
+      Smallest possible value is 16K.  Largest size supported by this
+      architecture is 512K.
+
+   MTU
+
+      An enumerated value indicating this peer's QP MTU size.  The two
+      peers exchange their MTU values, and whichever value is smaller
+      will be used for the QP.  The values are enumerated in
+      Appendix A.2.3.  This value should only be validated in the first
+      contact exchange.
+
+   Client's RMB Virtual Address
+
+      The virtual address of the client's RMB as assigned by the
+      server's RNIC.
+
+   Client's initial packet sequence number
+
+      The starting packet sequence number that this peer will use when
+      sending to the other peer, so that the other peer can prepare its
+      QP for the sequence number to expect.
+
+
+
+
+
+
+
+
+
+
+
+
+
+Fox, et al.                   Informational                   [Page 104]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+A.2.5.  SMC Decline CLC Message Format
+
+      0                   1                   2                   3
+      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+     |   x'E2'       |   x'D4'       |     x'C3'     |     x'D9'     |
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+     |  Type = 4     |    Length = 28                |Version|S|Rsrvd|
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+     |                                                               |
+     +-                       Sender's Peer ID                      -+
+     |                                                               |
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+     |              Peer Diagnosis Information                       |
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+     |                                                               |
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+     |   x'E2'       |   x'D4'       |     x'C3'     |     x'D9'     |
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
+                 Figure 30: SMC Decline CLC Message Format
+
+   The fields present in the SMC Decline CLC message are:
+
+   Eye catchers
+
+      Like all CLC messages, the SMC Decline has beginning and ending
+      eye catchers to aid with verification and parsing.  The hex digits
+      spell "SMCR" in IBM-1047 (EBCDIC).
+
+   Type
+
+      CLC message Type 4 indicates SMC Decline.
+
+   Length
+
+      The SMC Decline CLC message is 28 bytes long.
+
+   Version
+
+      Version of the SMC-R protocol.  Version 1 is the only currently
+      defined value.
+
+   S-bit
+
+      Sync Bit.  Indicates that the link group is out of sync and the
+      receiving peer must clean up its representation of the link group.
+
+
+
+
+Fox, et al.                   Informational                   [Page 105]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+   Sender's Peer ID
+
+      As described in Appendix A.2.1 above.
+
+   Peer Diagnosis Information
+
+      4 bytes of diagnosis information provided by the peer.  These
+      values are defined by the individual peers, and it is necessary to
+      consult the peer's system documentation to interpret the results.
+
+A.3.  LLC Messages
+
+   LLC messages are sent over an existing SMC-R link using RoCE SendMsg
+   and are always 44 bytes long so that they fit into the space
+   available in a single WQE without requiring the receiver to post
+   receive buffers.  If all 44 bytes are not needed, they are padded out
+   with zeros.  LLC messages are in a request/response format.  The
+   message type is the same for request and response, and a flag
+   indicates whether a message is flowing as a request or a response.
+
+   The two high-order bits of an LLC message opcode indicate how it is
+   to be handled by a peer that does not support the opcode.
+
+   If the high-order bits of the opcode are b'00', then the peer must
+   support the LLC message and indicate a protocol error if it does not.
+
+   If the high-order bits of the opcode are b'10', then the peer must
+   silently discard the LLC message if it does not support the opcode.
+   This requirement is included to allow for toleration of advanced, but
+   optional, functionality.
+
+   High-order bits of b'11' indicate a Connection Data Control (CDC)
+   message as described in Appendix A.4.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Fox, et al.                   Informational                   [Page 106]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+A.3.1.  CONFIRM LINK LLC Message Format
+
+      0                   1                   2                   3
+      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+     |  Type = 1     |  Length = 44  |   Reserved    |R|  Reserved   |
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+     |  Sender's RoCE                                                |
+     +-   MAC address                +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+     |                               |                               |
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+                               +
+     |                                                               |
+     +-                                                             -+
+     |                 Sender's RoCE GID                             |
+     +-                                                             -+
+     |                                                               |
+     +-                              +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+     |                               |Sender's QP number, bytes 1-2  |
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+     |Sender QP byte3| Link number   |Sender's link userID, bytes 1-2|
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+     |Sender's link userID, bytes 3-4| Max links     |  Reserved     |
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+     |                                                               |
+     +-                         Reserved                            -+
+     |                                                               |
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
+                Figure 31: CONFIRM LINK LLC Message Format
+
+   The CONFIRM LINK LLC message is required to be exchanged between the
+   server and client over a newly created SMC-R link to complete the
+   setup of an SMC-R link.  Its purpose is to confirm that the RoCE path
+   is actually usable.
+
+   On first contact, this message flows after the server receives the
+   SMC Confirm CLC message from the client over the IP connection.  For
+   additional links added to an SMC-R link group, it flows after the
+   ADD LINK and ADD LINK CONTINUATION exchange.  This flow provides
+   confirmation that the queue pair is in fact usable.  Each peer echoes
+   its RoCE information back to the other.
+
+
+
+
+
+
+
+
+
+
+Fox, et al.                   Informational                   [Page 107]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+   The contents of the CONFIRM LINK LLC message are:
+
+   Type
+
+      Type 1 indicates CONFIRM LINK.
+
+   Length
+
+      The CONFIRM LINK LLC message is 44 bytes long.
+
+   R
+
+      Reply flag.  When set, indicates that this is a CONFIRM LINK
+      reply.
+
+   Sender's RoCE MAC address
+
+      The MAC address of the sender's RNIC for the SMC-R link.  It is
+      required, as some operating systems do not have neighbor discovery
+      or ARP support for RoCE RNICs.
+
+   Sender's RoCE GID
+
+      The IPv6 address of the RNIC that the sender is using for this
+      SMC-R link.
+
+   Sender's QP number
+
+      The number for the reliably connected queue pair that the sender
+      created for this SMC-R link.
+
+   Link number
+
+      An identifier assigned by the server that uniquely identifies the
+      link within the link group.  This identifier is ONLY unique within
+      a link group.  Provided by the server and echoed back by the
+      client.
+
+   Link user ID
+
+      An opaque, implementation-defined identifier assigned by the
+      sender and provided to the receiver solely for purposes of
+      display, diagnosis, network management, etc.  The link user ID
+      should be unique across the sender's entire software space,
+      including all other link groups.
+
+
+
+
+
+
+Fox, et al.                   Informational                   [Page 108]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+   Max links
+
+      The maximum number of links the sender can support in a link
+      group.  The maximum for this link group is the smaller of the
+      values provided by the two peers.
+
+A.3.2.  ADD LINK LLC Message Format
+
+      0                   1                   2                   3
+      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+     |  Type = 2     |  Length = 44  | Rsrvd |RsnCode|R|Z| Reserved  |
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+     |  Sender's RoCE                                                |
+     +-   MAC address                +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+     |                               |                               |
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+                               +
+     |                                                               |
+     +-                                                             -+
+     |                 Sender's RoCE GID                             |
+     +-                                                             -+
+     |                                                               |
+     +-                              +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+     |                               |Sender's QP number, bytes 1-2  |
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+     |Sender QP byte3| Link number   |Rsrvd  |  MTU  |Initial PSN    |
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+     |  Initial PSN (continued)      |                               |
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+                              -+
+     |                          Reserved                             |
+     +-                                                             -+
+     |                                                               |
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
+                  Figure 32: ADD LINK LLC Message Format
+
+   The ADD LINK LLC message is sent over an existing link in the link
+   group when a peer wishes to add an SMC-R link to an existing SMC-R
+   link group.  It is sent by the server to add a new SMC-R link to the
+   group, or by the client to request that the server add a new link --
+   for example, when a new RNIC becomes active.  When sent from the
+   client to the server, it represents a request that the server
+   initiate an ADD LINK exchange.
+
+
+
+
+
+
+
+
+Fox, et al.                   Informational                   [Page 109]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+   This message is sent immediately after the initial SMC-R link in the
+   group completes, as described in Section 3.5.1 ("First Contact").  It
+   can also be sent over an existing SMC-R link group at any time as new
+   RNICs are added and become available.  Therefore, there can be as few
+   as one new RMB RToken to be communicated, or several.  RTokens will
+   be communicated using ADD LINK CONTINUATION messages.
+
+   The contents of the ADD LINK LLC message are:
+
+   Type
+
+      Type 2 indicates ADD LINK.
+
+   Length
+
+      The ADD LINK LLC message is 44 bytes long.
+
+   RsnCode
+
+      If the Z (rejection) flag is set, this field provides the reason
+      code.  Values can be:
+
+         X'1' - no alternate path available: set when the server
+                provides the same MAC/GID as an existing SMC-R link in
+                the group, and the client does not have any additional
+                RNICs available (i.e., the server is attempting to set
+                up an asymmetric link but none is available).
+
+         X'2' - Invalid MTU value specified.
+
+   R
+
+      Reply flag.  When set, indicates that this is an ADD LINK reply.
+
+   Z
+
+      Rejection flag.  When set on reply, indicates that the server's
+      ADD LINK was rejected by the client.  When this flag is set, the
+      reason code will also be set.
+
+   Sender's RoCE MAC address
+
+      The MAC address of the sender's RNIC for the new SMC-R link.  It
+      is required, as some operating systems do not have neighbor
+      discovery or ARP support for RoCE RNICs.
+
+
+
+
+
+
+Fox, et al.                   Informational                   [Page 110]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+   Sender's RoCE GID
+
+      The IPv6 address of the RNIC that the sender is using for the new
+      SMC-R link.
+
+   Sender's QP number
+
+      The number for the reliably connected queue pair that the sender
+      created for the new SMC-R link.
+
+   Link number
+
+      An identifier for the new SMC-R link.  This is assigned by the
+      server and uniquely identifies the link within the link group.
+      This identifier is ONLY unique within a link group.  Provided by
+      the server and echoed back by the client.
+
+   MTU
+
+      An enumerated value indicating this peer's QP MTU size.  The two
+      peers exchange their MTU values, and whichever value is smaller
+      will be used for the QP.  The values are enumerated in
+      Appendix A.2.3.
+
+   Initial PSN
+
+      The starting packet sequence number (PSN) that this peer will use
+      when sending to the other peer, so that the other peer can prepare
+      its QP for the sequence number to expect.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Fox, et al.                   Informational                   [Page 111]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+A.3.3.  ADD LINK CONTINUATION LLC Message Format
+
+      0                   1                   2                   3
+      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+     |  Type = 3     |  Length = 44  |  Reserved     |R|  Reserved   |
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+     |   Linknum     | NumRTokens    |         Reserved              |
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+     |                                                               |
+     +-                                                             -+
+     |                                                               |
+     +-                  RKey/RToken pair                           -+
+     |                                                               |
+     +-                                                             -+
+     |                                                               |
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+     |                                                               |
+     +-                                                             -+
+     |                                                               |
+     +-                  RKey/RToken pair or zeros                  -+
+     |                                                               |
+     +-                                                             -+
+     |                                                               |
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+     |                        Reserved                               |
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
+            Figure 33: ADD LINK CONTINUATION LLC Message Format
+
+   When a new SMC-R link is added to an SMC-R link group, it is
+   necessary to communicate the new link's RTokens for the RMBs that the
+   SMC-R link group can access.  This message follows the ADD LINK and
+   provides the RTokens.
+
+   The server kicks off this exchange by sending the first ADD LINK
+   CONTINUATION LLC message, and the server controls the exchange as
+   described below.
+
+   o  If the client and the server require the same number of ADD LINK
+      CONTINUATION messages to communicate their RTokens, the server
+      starts the exchange by sending the first ADD LINK CONTINUATION
+      request to the client with its (the server's) RTokens.  The client
+      then responds with an ADD LINK CONTINUATION response with its
+      RTokens, and so on until the exchange is completed.
+
+
+
+
+
+
+Fox, et al.                   Informational                   [Page 112]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+   o  If the server requires more ADD LINK CONTINUATION messages than
+      the client, then after the client has communicated all of its
+      RTokens, the server continues to send ADD LINK CONTINUATION
+      request messages to the client.  The client continues to respond,
+      using empty (number of RTokens to be communicated = 0) ADD LINK
+      CONTINUATION response messages.
+
+   o  If the client requires more ADD LINK CONTINUATION messages than
+      the server, then after communicating all of its RTokens, the
+      server will continue to send empty ADD LINK CONTINUATION messages
+      to the client to solicit replies with the client's RTokens, until
+      all have been communicated.
+
+   The contents of the ADD LINK CONTINUATION LLC message are:
+
+   Type
+
+      Type 3 indicates ADD LINK CONTINUATION.
+
+   Length
+
+      The ADD LINK CONTINUATION LLC message is 44 bytes long.
+
+   R
+
+      Reply flag.  When set, indicates that this is an ADD LINK
+      CONTINUATION reply.
+
+   LinkNum
+
+      The link number of the new link within the SMC-R link group for
+      which RKeys are being communicated.
+
+   NumRTokens
+
+      Number of RTokens remaining to be communicated (including the ones
+      in this message).  If the value is less than or equal to 2, this
+      is the last message.  If it is greater than 2, another
+      continuation message will be required, and its value will be the
+      value in this message minus 2, and so on until all RKeys are
+      communicated.  The maximum value for this field is 255.
+
+
+
+
+
+
+
+
+
+
+Fox, et al.                   Informational                   [Page 113]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+   RKey/RToken pairs (two or less)
+
+      These consist of an RKey for an RMB that is known on the SMC-R
+      link over which this message was sent (the reference RKey), paired
+      with the same RMB's RToken over the new SMC-R link.  A full RToken
+      is not required for the reference, because it is only being used
+      to distinguish which RMB it applies to, not address it.
+
+      0                   1                   2                   3
+      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+     |                         Reference RKey                        |
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+     |                            New RKey                           |
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+     |                                                               |
+     +-                       New Virtual Address                   -+
+     |                                                               |
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
+                    Figure 34: RKey/RToken Pair Format
+
+   The contents of the RKey/RToken pair are:
+
+   Reference RKey
+
+      The RKey of the RMB as it is already known on the SMC-R link over
+      which this message is being sent.  Required so that the peer knows
+      with which RMB to associate the new RToken.
+
+   New RKey
+
+      The RKey of this RMB as it is known over the new SMC-R link.
+
+   New Virtual Address
+
+      The virtual address of this RMB as it is known over the new
+      SMC-R link.
+
+
+
+
+
+
+
+
+
+
+
+
+
+Fox, et al.                   Informational                   [Page 114]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+A.3.4.  DELETE LINK LLC Message Format
+
+      0                   1                   2                   3
+      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+     |  Type = 4     |  Length = 44  |  Reserved     |R|A|O| Rsrvd   |
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+     |   Linknum     |         reason code (bytes 1-3)               |
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+     |RsnCode byte 4 |                                               |
+     +-+-+-+-+-+-+-+-+                                              -+
+     |                                                               |
+     +-                                                             -+
+     |                                                               |
+     +-                                                             -+
+     |                                                               |
+     +-                          Reserved                           -+
+     |                                                               |
+     +-                                                             -+
+     |                                                               |
+     +-                                                             -+
+     |                                                               |
+     +-                                                             -+
+     |                                                               |
+     +-                                                             -+
+     |                                                               |
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
+                 Figure 35: DELETE LINK LLC Message Format
+
+   When the client or server detects that a QP or SMC-R link goes down
+   or needs to come down, it sends this message over one of the other
+   links in the link group.
+
+   When the DELETE LINK is sent from the client, it only serves as a
+   notification, and the client expects the server to respond by sending
+   a DELETE LINK request.  To avoid races, only the server will initiate
+   the actual DELETE LINK request and response sequence that results
+   from notification from the client.
+
+   The server can also initiate the DELETE LINK without notification
+   from the client if it detects an error or if orderly link termination
+   was initiated.
+
+   The client may also request termination of the entire link group, and
+   the server may terminate the entire link group using this message.
+
+
+
+
+
+Fox, et al.                   Informational                   [Page 115]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+   The contents of the DELETE LINK LLC message are:
+
+   Type
+
+      Type 4 indicates DELETE LINK.
+
+   Length
+
+      The DELETE LINK LLC message is 44 bytes long.
+
+   R
+
+      Reply flag.  When set, indicates that this is a DELETE LINK reply.
+
+   A
+
+      "All" flag.  When set, indicates that all links in the link group
+      are to be terminated.  This terminates the link group.
+
+   O
+
+      Orderly flag.  Indicates orderly termination.  Orderly termination
+      is generally caused by an operator command rather than an error on
+      the link.  When the client requests orderly termination, the
+      server may wait to complete other work before terminating.
+
+   LinkNum
+
+      The link number of the link to be terminated.  If the A flag is
+      set, this field has no meaning and is set to 0.
+
+   RsnCode
+
+      The termination reason code.  Currently defined reason codes are:
+
+      Request reason codes:
+
+         X'00010000' = Lost path
+
+         X'00020000' = Operator initiated termination
+
+         X'00030000' = Program initiated termination (link inactivity)
+
+         X'00040000' = LLC protocol violation
+
+         X'00050000' = Asymmetric link no longer needed
+
+
+
+
+
+Fox, et al.                   Informational                   [Page 116]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+      Response reason code:
+
+         X'00100000' = Unknown link ID (no link)
+
+A.3.5.  CONFIRM RKEY LLC Message Format
+
+      0                   1                   2                   3
+      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+     |  Type = 6     |  Length = 44  |   Reserved    |R|0|Z|C|Rsrvd  |
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+     |   NumTkns     |  New RMB RKey for this link (bytes 1-3)       |
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+     |ThisLink byte 4|                                               |
+     +-+-+-+-+-+-+-+-+                                              -+
+     |           New RMB virtual address for this link               |
+     +-              +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+     |               |                                               |
+     +-+-+-+-+-+-+-+-+                                              -+
+     |                                                               |
+     +-   Other link RMB specification or zeros                     -+
+     |                                                               |
+     +-                              +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+     |                               |                               |
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+                              -+
+     |                                                               |
+     +-                                                             -+
+     |      Other link RMB specification or zeros                    |
+     +-                                              +-+-+-+-+-+-+-+-+
+     |                                               |  Reserved     |
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
+                Figure 36: CONFIRM RKEY LLC Message Format
+
+   The CONFIRM RKEY flow can be sent at any time from either the client
+   or the server, to inform the peer that an RMB has been created or
+   deleted.  The creator of a new RMB must inform its peer of the new
+   RMB's RToken for all SMC-R links in the SMC-R link group.
+
+   For RMB creation, the creator sends this message over the SMC-R link
+   that the first TCP connection that uses the new RMB is using.  This
+   message contains the new RMB RToken for the SMC-R link over which
+   the message is sent.  It then lists the sender's SMC-R links in the
+   link group paired with the new RToken for the new RMB for that link.
+   This message can communicate the new RTokens for three QPs: the QP
+   for the link over which this message is sent, and two others.  If
+   there are more than three links in the SMC-R link group, a
+   CONFIRM RKEY CONTINUATION will be required.
+
+
+
+Fox, et al.                   Informational                   [Page 117]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+   The peer responds by simply echoing the message with the response
+   flag set.  If the response is a negative response, the sender must
+   recalculate the RToken set and start a new CONFIRM RKEY exchange from
+   the beginning.  The timing of this retry is controlled by the C flag,
+   as described below.
+
+   The contents of the CONFIRM RKEY LLC message are:
+
+   Type
+
+      Type 6 indicates CONFIRM RKEY.
+
+   Length
+
+      The CONFIRM RKEY LLC message is 44 bytes long.
+
+   R
+
+      Reply flag.  When set, indicates that this is a CONFIRM RKEY
+      reply.
+
+   0
+
+      Reserved bit.
+
+   Z
+
+      Negative response flag.
+
+   C
+
+      Configuration Retry bit.  If this is a negative response and this
+      flag is set, the originator should recalculate the RKey set and
+      retry this exchange as soon as the current configuration change is
+      completed.  If this flag is not set on a negative response, the
+      originator must wait for the next natural stimulus (for example, a
+      new TCP connection started that requires a new RMB) before
+      retrying.
+
+   NumTkns
+
+      The number of other link/RToken pairs, including those provided in
+      this message, to be communicated.  Note that this value does not
+      include the RToken for the link on which this message was sent
+      (i.e., the maximum value is 2).  If this value is 3 or less, this
+      is the only message in the exchange.  If this value is greater
+      than 3, a CONFIRM RKEY CONTINUATION message will be required.
+
+
+
+
+Fox, et al.                   Informational                   [Page 118]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+      Note: In this version of the architecture, eight is the maximum
+      number of links supported in a link group.
+
+   New RMB RKey for this link
+
+      The new RMB's RKey as assigned on the link over which this message
+      is being sent.
+
+   New RMB virtual address for this link
+
+      The new RMB's virtual address as assigned on the link over which
+      this message is being sent.
+
+   Other link RMB specification
+
+      The new RMB's specification on the other links in the link group,
+      as shown in Figure 37.
+
+      0                   1                   2                   3
+      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+     | Link number   | RMB's RKey for the specified link (bytes 1-3) |
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+     |New RKey byte 4|                                               |
+     +-+-+-+-+-+-+-+-+                                              -+
+     |           RMB's virtual address for the specified link        |
+     +-              +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+     |               |
+     +-+-+-+-+-+-+-+-+
+
+                Figure 37: Format of Link Number/RKey Pairs
+
+   Link number
+
+      The link number for a link in the link group.
+
+   RMB's RKey for the specified link
+
+      The RKey used to reach the RMB over the link whose number was
+      specified in the Link number field.
+
+   RMB's virtual address for the specified link
+
+      The virtual address used to reach the RMB over the link whose
+      number was specified in the Link number field.
+
+
+
+
+
+
+Fox, et al.                   Informational                   [Page 119]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+A.3.6.  CONFIRM RKEY CONTINUATION LLC Message Format
+
+      0                   1                   2                   3
+      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+     |  Type = 8     |  Length = 44  |   Reserved    |R|0|Z|  Rsrvd  |
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+     |  NumTknsLeft  |                                               |
+     +-+-+-+-+-+-+-+-+                                              -+
+     |                                                               |
+     +-          Other link RMB specification                       -+
+     |                                                               |
+     +-              +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+     |               |                                               |
+     +-+-+-+-+-+-+-+-+                                              -+
+     |                                                               |
+     +-   Other link RMB specification or zeros                     -+
+     |                                                               |
+     +-                              +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+     |                               |                               |
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+                              -+
+     |                                                               |
+     +-                                                             -+
+     |      Other link RMB specification or zeros                    |
+     +-                                              +-+-+-+-+-+-+-+-+
+     |                                               |  Reserved     |
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
+          Figure 38: CONFIRM RKEY CONTINUATION LLC Message Format
+
+   The CONFIRM RKEY CONTINUATION LLC message is used to communicate any
+   additional RMB RTokens that did not fit into the CONFIRM RKEY
+   message.  Each of these messages can hold up to three RMB RTokens.
+   The NumTknsLeft field indicates how many RMB RTokens are to be
+   communicated, including the ones in this message.  If the value is 3
+   or less, this is the last message of the group.  If the value is 4 or
+   higher, additional CONFIRM RKEY CONTINUATION messages will follow,
+   and the NumTknsLeft value will be a countdown until all are
+   communicated.
+
+   Like the CONFIRM RKEY message, the peer responds by echoing the
+   message back with the reply flag set.
+
+
+
+
+
+
+
+
+
+Fox, et al.                   Informational                   [Page 120]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+   The contents of the CONFIRM RKEY CONTINUATION LLC message are:
+
+   Type
+
+      Type 8 indicates CONFIRM RKEY CONTINUATION.
+
+   Length
+
+      The CONFIRM RKEY CONTINUATION LLC message is 44 bytes long.
+
+   R
+
+      Reply flag.  When set, indicates that this is a CONFIRM RKEY
+      CONTINUATION reply.
+
+   0
+
+      Reserved bit.
+
+   Z
+
+      Negative response flag.
+
+   NumTknsLeft
+
+      The number of link/RToken pairs, including those provided in this
+      message, that are remaining to be communicated.  If this value is
+      3 or less, this is the last message in the exchange.  If this
+      value is greater than 3, another CONFIRM RKEY CONTINUATION message
+      will be required.  Note that in this version of the architecture,
+      eight is the maximum number of links supported in a link group.
+
+   Other link RMB specification
+
+      The new RMB's specification on other links in the link group, as
+      shown in Figure 37.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Fox, et al.                   Informational                   [Page 121]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+A.3.7.  DELETE RKEY LLC Message Format
+
+      0                   1                   2                   3
+      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+     |  Type = 9     |  Length = 44  |   Reserved    |R|0|Z|  Rsrvd  |
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+     |     Count     | Error Mask    |        Reserved               |
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+     |                First deleted RKey                             |
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+     |            Second deleted RKey or zeros                       |
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+     |            Third deleted RKey or zeros                        |
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+     |            Fourth deleted RKey or zeros                       |
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+     |            Fifth deleted RKey or zeros                        |
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+     |            Sixth deleted RKey or zeros                        |
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+     |            Seventh deleted RKey or zeros                      |
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+     |            Eighth deleted RKey or zeros                       |
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+     |                       Reserved                                |
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
+                 Figure 39: DELETE RKEY LLC Message Format
+
+   The DELETE RKEY flow can be sent at any time from either the client
+   or the server, to inform the peer that one or more RMBs have been
+   deleted.  Because the peer already knows every RMB's RKey on each
+   link in the link group, this message only specifies one RKey for each
+   RMB being deleted.  The RKey provided for each deleted RMB will be
+   its RKey as known on the SMC-R link over which this message is sent.
+
+   It is not necessary to provide the entire RToken.  The RKey alone is
+   sufficient for identifying an existing RMB.
+
+   The peer responds by simply echoing the message with the response
+   flag set.  If the peer did not recognize an RKey, a negative response
+   flag will be set; however, no aggressive recovery action beyond
+   logging the error will be taken.
+
+
+
+
+
+
+
+Fox, et al.                   Informational                   [Page 122]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+   The contents of the DELETE RKEY LLC message are:
+
+   Type
+
+      Type 9 indicates DELETE RKEY.
+
+   Length
+
+      The DELETE RKEY LLC message is 44 bytes long.
+
+   R
+
+      Reply flag.  When set, indicates that this is a DELETE RKEY reply.
+
+   0
+
+      Reserved bit.
+
+   Z
+
+      Negative response flag.
+
+   Count
+
+      Number of RMBs being deleted by this message.  Maximum value is 8.
+
+   Error Mask
+
+      If this is a negative response, indicates which RMBs were not
+      successfully deleted.  Each bit corresponds to a listed RMB; for
+      example, b'01010000' indicates that the second and fourth RKeys
+      weren't successfully deleted.
+
+   Deleted RKeys
+
+      A list of Count RKeys.  Provided on the request flow and echoed
+      back on the response flow.  Each RKey is valid on the link over
+      which this message is sent and represents a deleted RMB.  Up to
+      eight RMBs can be deleted in this message.
+
+
+
+
+
+
+
+
+
+
+
+
+Fox, et al.                   Informational                   [Page 123]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+A.3.8.  TEST LINK LLC Message Format
+
+      0                   1                   2                   3
+      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+     |  Type = 7     |  Length = 44  |   Reserved    |R|  Reserved   |
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+     |                                                               |
+     +-                                                             -+
+     |                                                               |
+     +-                         User Data                           -+
+     |                                                               |
+     +-                                                             -+
+     |                                                               |
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+     |                                                               |
+     +-                                                             -+
+     |                                                               |
+     +-                                                             -+
+     |                          Reserved                             |
+     +-                                                             -+
+     |                                                               |
+     +-                                                             -+
+     |                                                               |
+     +-                                                             -+
+     |                                                               |
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
+                  Figure 40: TEST LINK LLC Message Format
+
+   The TEST LINK request can be sent from either peer to the other on an
+   existing SMC-R link at any time to test that the SMC-R link is active
+   and healthy at the software level.  A peer that receives a TEST LINK
+   LLC message immediately sends back a TEST LINK reply, echoing back
+   the user data.  Refer also to Section 4.5.3 ("TCP Keepalive
+   Processing").
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Fox, et al.                   Informational                   [Page 124]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+   The contents of the TEST LINK LLC message are:
+
+   Type
+
+      Type 7 indicates TEST LINK.
+
+   Length
+
+      The TEST LINK LLC message is 44 bytes long.
+
+   R
+
+      Reply flag.  When set, indicates that this is a TEST LINK reply.
+
+   User Data
+
+      The receiver of this message echoes the sender's data back in a
+      TEST LINK response LLC message.
+
+A.4.  Connection Data Control (CDC) Message Format
+
+   The RMBE control data is communicated using Connection Data Control
+   (CDC) messages, which use RoCE SendMsg, similar to LLC messages.
+   Also, as with LLC messages, CDC messages are 44 bytes long to ensure
+   that they can fit into private data areas of receive WQEs without
+   requiring the receiver to post receive buffers.
+
+   Unlike LLC messages, this data is integral to the data path, so its
+   processing must be prioritized and optimized similarly to other data
+   path processing.  While LLC messages may be processed on a slower
+   path than data, these messages cannot be.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Fox, et al.                   Informational                   [Page 125]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+       0                   1                   2                   3
+       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+   0  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+      | Type = x'FE'  | Length = 44   |      Sequence number          |
+   4  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+      |                       SMC-R alert token                       |
+   8  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+      |         Reserved              | Producer cursor wrap seqno    |
+   12 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+      |                       Producer Cursor                         |
+   16 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+      |         Reserved              | Consumer cursor wrap seqno    |
+   20 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+      |                       Consumer Cursor                         |
+   24 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+      |B|P|U|R|F|Rsrvd|D|C|A|             Reserved                    |
+   28 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+      |                                                               |
+   32 +-                                                             -+
+      |                                                               |
+   36 +-                         Reserved                            -+
+      |                                                               |
+   40 +-                                                             -+
+      |                                                               |
+   44 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
+          Figure 41: Connection Data Control (CDC) Message Format
+
+   Type = x'FE'
+
+      This type number has the two high-order bits turned on to enable
+      processing to quickly distinguish it from an LLC message.
+
+   Length = 44
+
+      The length of inline data that does not require the posting of a
+      receive buffer.
+
+   Sequence number
+
+      A 2-byte unsigned integer that represents a wrapping sequence
+      number.  The initial value is 1, and this value can wrap to 0.
+      Incremented with every control message sent, except for the
+      failover data validation message, and used to guard against
+      processing an old control message out of sequence.  Also used in
+      failover data validation.  In normal usage, if this number is less
+
+
+
+
+
+Fox, et al.                   Informational                   [Page 126]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+      than the last received value, discard this message.  If greater,
+      process this message.  Old control messages can be lost with no
+      ill effect but cannot be processed after newer ones.
+
+      If this is a failover validation CDC message (F flag set), then
+      the receiver must verify that it has received and fully processed
+      the RDMA write that was described by the CDC message with the
+      sequence number in this message.  If not, the TCP connection must
+      be reset to guard against data loss.  Details of this processing
+      are provided in Section 4.6.1.
+
+   SMC-R alert token
+
+      The endpoint-assigned alert token that identifies to which TCP
+      connection on the link group this control message refers.
+
+   Producer cursor wrap seqno
+
+      A 2-byte unsigned integer that represents a wrapping counter
+      incremented by the producer whenever the data written into this
+      RMBE receive buffer causes a wrap (i.e., the producer cursor
+      wraps).  This is used by the receiver to determine when new data
+      is available even though the cursors appear unchanged, such as
+      when a full window size write is completed (producer cursor of
+      this RMBE sent by peer = local consumer cursor) or in scenarios
+      where the producer cursor sent for this RMBE < local consumer
+      cursor.
+
+   Producer Cursor
+
+      A 4-byte unsigned integer that is a wrapping offset into the RMBE
+      data area.  Points to the next byte of data to be written by the
+      sender.  Can advance up to the receiver's consumer cursor as known
+      by the sender.  When the urgent data present indicator is on,
+      points 1 byte beyond the last byte of urgent data.  When computing
+      this cursor, the presence of the eye catcher in the RMBE data area
+      must be accounted for.  The first writable data location in the
+      RMBE is at offset 4, so this cursor begins at 4 and wraps to 4.
+
+   Consumer cursor wrap seqno
+
+      A 2-byte unsigned integer that mirrors the value of the producer
+      cursor wrap sequence number when the last read from this RMBE
+      occurred.  Used as an indicator of how far along the consumer is
+      in reading data (i.e., processed last wrap point or not).  The
+      producer side can use this indicator to detect whether or not more
+      data can be written to the partner in full window write scenarios
+      (where the producer cursor = consumer cursor as known on the
+
+
+
+Fox, et al.                   Informational                   [Page 127]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+      remote RMBE).  In this scenario, if the consumer sequence number
+      equals the local producer sequence number, the producer knows that
+      more data can be written.
+
+   Consumer Cursor
+
+      A 4-byte unsigned integer that is a wrapping offset into the
+      sender's RMBE data area.  Points to the offset of the next byte of
+      data to be consumed by the peer in its own RMBE.  When computing
+      this cursor, the presence of the eye catcher in the RMBE data area
+      must be accounted for.  The first writable data location in the
+      RMBE is at offset 4, so this cursor begins at 4 and wraps to 4.
+      The sender cannot write beyond this cursor into the peer's RMBE
+      without causing data loss.
+
+   B-bit
+
+      Writer blocked indicator: Sender is blocked for writing.  If this
+      bit is set, sender will require explicit notification when receive
+      buffer space is available.
+
+   P-bit
+
+      Urgent data pending: Sender has urgent data pending for this
+      connection.
+
+   U-bit
+
+      Urgent data present: Indicates that urgent data is present in the
+      RMBE data area, and the producer cursor points to 1 byte beyond
+      the last byte of urgent data.
+
+   R-bit
+
+      Request for consumer cursor update: Indicates that an immediate
+      consumer cursor update is requested, regardless of whether or not
+      one is warranted according to the window size optimization
+      algorithm described in Section 4.5.1.
+
+   F-bit
+
+      Failover validation indicator: Sent by a peer to guard against
+      data loss during failover when the TCP connection is being moved
+      to another SMC-R link in the link group.  When this bit is set,
+      the only other fields in the CDC message that are significant are
+      the Type, Length, SMC-R alert token, and Sequence number fields.
+      The receiver must validate that it has fully processed the RDMA
+      write described by the previous CDC message bearing the same
+
+
+
+Fox, et al.                   Informational                   [Page 128]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+      sequence number as this validation message.  If it has, no further
+      action is required.  If it has not, the TCP connection must be
+      reset.  This processing is described in detail in Section 4.6.1.
+
+   D-bit
+
+      Sending done indicator: Sent by a peer when it is done writing new
+      data into the receiver's RMBE data area.
+
+   C-bit
+
+      PeerConnectionClosed indicator: Sent by a peer when it is
+      completely done with this connection and will no longer be making
+      any updates to the receiver's RMBE or sending any more control
+      messages.
+
+   A-bit
+
+      Abnormal close indicator: Sent by a peer when the connection is
+      abnormally terminated (for example, the TCP connection was reset).
+      When sent, it indicates that the peer is completely done with this
+      connection and will no longer be making any updates to this RMBE
+      or sending any more control messages.  It also indicates that the
+      RMBE owner must flush any remaining data on this connection and
+      generate an error return code to any outstanding socket APIs on
+      this connection (same processing as receiving a RST segment on a
+      TCP connection).
+
+Appendix B.  Socket API Considerations
+
+   A key design goal for SMC-R is to require no application changes for
+   exploitation.  It is confined to socket applications using stream
+   (i.e., TCP) sockets over IPv4 or IPv6.  By virtue of the fact that
+   the switch to the SMC-R protocol occurs after a TCP connection is
+   established, no changes are required in a socket address family or in
+   the IP addresses and ports that the socket applications are using.
+   Existing socket APIs that allow applications to retrieve local and
+   remote socket address structures for an established TCP connection
+   (for example, getsockname() and getpeername()) will continue to
+   function as they have before.  Existing DNS setup and APIs for
+   resolving hostnames to IP addresses and vice versa also continue to
+   function without any changes.  In general, all of the usual socket
+   APIs that are used for TCP communications (send APIs, recv APIs,
+   etc.) will continue to function as they do today, even if SMC-R is
+   used as the underlying protocol.
+
+
+
+
+
+
+Fox, et al.                   Informational                   [Page 129]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+   Each SMC-R-enabled implementation does, however, need to pay special
+   attention to any socket APIs that have a reliance on the underlying
+   TCP and IP protocols and also ensure that their behavior in an SMC-R
+   environment is reasonable and minimizes impact on the application.
+   While the basic socket API set is fairly similar across different
+   operating systems, there is more variability when it comes to
+   advanced socket API options.  Each implementation needs to perform a
+   detailed analysis of its API options, any possible impact that SMC-R
+   may have, and any resultant implications.  As part of that step, a
+   discussion or review with other implementations supporting SMC-R
+   would be useful to ensure consistent implementation.
+
+B.1.  setsockopt() / getsockopt() Considerations
+
+   These APIs allow socket applications to manipulate socket, transport
+   (TCP/UDP), and IP-level options associated with a given socket.
+   Typically, a platform restricts the number of IP options available to
+   stream (TCP) socket applications, given their connection-oriented
+   nature.  The general guideline here is to continue processing these
+   APIs in a manner that allows for application compatibility.  Some
+   options will be relevant to the SMC-R protocol and will require
+   special processing "under the covers".  For example, the ability to
+   manipulate TCP send and receive buffer sizes is still valid for
+   SMC-R.  However, other options may have no meaning for SMC-R.  For
+   example, if an application enabled the TCP_NODELAY socket option to
+   disable Nagle's algorithm, it should have no real effect on SMC-R
+   communications, as there is no notion of Nagle's algorithm with this
+   new protocol.  But the implementation must accept the TCP_NODELAY
+   option as it does today and save it so that it can be later extracted
+   via getsockopt() processing.  Note that any TCP or IP-level options
+   will still have an effect on any TCP/IP packets flowing for an SMC-R
+   connection (i.e., as part of TCP/IP connection establishment and
+   TCP/IP connection termination packet flows).
+
+   Under the covers, manipulation of the TCP options will also include
+   the SMC-layer setting, as well as reading the SMC-R experimental
+   option before and after completion of the three-way TCP handshake.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Fox, et al.                   Informational                   [Page 130]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+Appendix C.  Rendezvous Error Scenarios
+
+   This section discusses error scenarios for setting up and managing
+   SMC-R links.
+
+C.1.  SMC Decline during CLC Negotiation
+
+   A peer to the SMC-R CLC negotiation can send an SMC Decline in lieu
+   of any expected CLC message to decline SMC and force the TCP
+   connection back to the IP fabric.  There can be several reasons for
+   an SMC Decline during the CLC negotiation, including the following:
+
+   o  RNIC went down
+
+   o  SMC-R forbidden by local policy
+
+   o  subnet (IPv4) or prefix (IPv6) doesn't match
+
+   o  lack of resources to perform SMC-R
+
+   In all cases, when an SMC Decline is sent in lieu of an expected CLC
+   message, no confirmation is required, and the TCP connection
+   immediately falls back to using the IP fabric.
+
+   To prevent ambiguity between CLC messages and application data, an
+   SMC Decline cannot "chase" another CLC message.  An SMC Decline can
+   only be sent in lieu of an expected CLC message.  For example, if the
+   client sends an SMC Proposal and then its RNIC goes down, it must
+   wait for the SMC Accept from the server and then reply to the
+   SMC Accept with an SMC Decline.
+
+   This "no chase" rule means that if this TCP connection is not a first
+   contact between RoCE peers, a server cannot send an SMC Decline after
+   sending an SMC Accept -- it can only either break the TCP connection
+   or fail over if a problem arises in the RoCE fabric after it has sent
+   the SMC Accept.  Similarly, once the client sends an SMC Confirm on a
+   TCP connection that isn't a first contact, it is committed to SMC-R
+   for this TCP connection and cannot fall back to IP.
+
+C.2.  SMC Decline during LLC Negotiation
+
+   For a TCP connection that represents a first contact between RoCE
+   pairs, it is possible for SMC to fall back to IP during the LLC
+   negotiation.  This is possible until the first contact SMC-R link is
+   confirmed.  For example, see Figure 42.  After a first contact SMC-R
+   link is confirmed, fallback to IP is no longer possible.  This
+   translates to the following rule: a first contact peer can send an
+
+
+
+
+Fox, et al.                   Informational                   [Page 131]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+   SMC Decline at any time during LLC negotiation until it has
+   successfully sent its CONFIRM LINK (request or response) flow.  After
+   that point, it cannot fall back to IP.
+
+       Host X -- Server                           Host Y -- Client
+    +-------------------+                      +-------------------+
+    | Peer ID = PS1     |                      |   Peer ID = PC1   |
+    |            +------+                      +------+            |
+    |       QP 8 |RNIC 1|    SMC-R Link 1      |RNIC 2|  QP 64     |
+    | RKey X |   |MAC MA|<-------------------->|MAC MB|   |        |
+    |        |   |GID GA|   attempted setup    |GID GB|   | RKey Y2|
+    |       \/   +------+                      +------+  \/        |
+    |+--------+         |                      |        +--------+ |
+    || RMB    |         |                      |        | RMB    | |
+    |+--------+         |                      |        +--------+ |
+    |       /\   +------+                      +------+  /\        |
+    |        |   |RNIC 3|                      |RNIC 4|   | RKey W2|
+    |        |   |MAC MC|                      |MAC MD|   |        |
+    |       QP 9 |GID GC|                      |GID GD|  QP 65     |
+    |            +------+                      +------+            |
+    +-------------------+                      +-------------------+
+
+          SYN / SYN-ACK / ACK TCP three-way handshake with TCP option
+         <--------------------------------------------------------->
+
+            SMC Proposal / SMC Accept / SMC Confirm exchange
+         <-------------------------------------------------------->
+
+           CONFIRM LINK(request, Link 1)
+         .........................................................>
+
+                           CONFIRM LINK(response, Link 1)
+                              X...................................
+                                :
+                                : RoCE write failure
+                                :.................................>
+
+           SMC Decline(PC1, reason code)
+          <--------------------------------------------------------
+
+              Connection data flows over IP fabric
+          <------------------------------------------------------->
+
+                          Legend:
+                   ------------   TCP/IP and CLC flows
+                   ............   RoCE (LLC) flows
+
+               Figure 42: SMC Decline during LLC Negotiation
+
+
+
+Fox, et al.                   Informational                   [Page 132]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+C.3.  The SMC Decline Window
+
+   Because SMC-R does not support fallback to IP for a TCP connection
+   that is already using RDMA, there are specific rules on when the
+   SMC Decline CLC message, which signals a fallback to IP because of an
+   error or problem with the RoCE fabric, can be sent during TCP
+   connection setup.  There is a "point of no return" after which a
+   connection cannot fall back to IP, and RoCE errors that occur after
+   this point require the connection to be broken with a RST flow in the
+   IP fabric.
+
+   For a first contact, that point of no return is after the ADD LINK
+   LLC message has been successfully sent for the second SMC-R link.
+   Specifically, the server cannot fall back to IP after receiving
+   either (1) a positive write completion indication for the ADD LINK
+   request or (2) the ADD LINK response from the client, whichever comes
+   first.  The client cannot fall back to IP after sending a negative
+   ADD LINK response, receiving a positive write complete on a positive
+   ADD LINK response, or receiving a CONFIRM LINK for the second SMC-R
+   link from the server, whichever comes first.
+
+   For a subsequent contact, that point of no return is after the last
+   send of the CLC negotiation completes.  This, in combination with the
+   rule that error "chasers" are not allowed during CLC negotiation,
+   means that the server cannot send an SMC Decline after sending an SMC
+   Accept, and the client cannot send an SMC Decline after sending an
+   SMC Confirm.
+
+C.4.  Out-of-Sync Conditions during SMC-R Negotiation
+
+   The SMC Accept CLC message contains a first contact flag that
+   indicates to the client whether the server believes it is setting up
+   a new link group or using an existing link group.  This flag is used
+   to detect an out-of-sync condition between the client and the server.
+   The scenario for such a condition is as follows: there is a single
+   existing SMC-R link between the peers.  After the client sends the
+   SMC Proposal CLC message, the existing SMC-R link between the client
+   and the server fails.  The client cannot chase the SMC Proposal CLC
+   message with an SMC Decline CLC message in this case, because the
+   client does not yet know that the server would have wanted to choose
+   the SMC-R link that just crashed.  The QP that failed recovers before
+   the server returns its SMC Accept CLC message.  This means that there
+   is a QP but no SMC-R link.  Since the server had not yet learned of
+   the SMC-R link failure when it sent the SMC Accept CLC message, it
+   attempts to reuse the SMC-R link that just failed.  This means that
+   the server would not set the first contact flag, indicating to the
+   client that the server thinks it is reusing an SMC-R link.  However,
+   the client does not have an SMC-R link that matches the server's
+
+
+
+Fox, et al.                   Informational                   [Page 133]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+   specification.  Because the first contact flag is off, the client
+   realizes it is out of sync with the server and sends an SMC Decline
+   to cause the connection to fall back to IP.
+
+C.5.  Timeouts during CLC Negotiation
+
+   Because the SMC-R negotiation flows as TCP data, there are built-in
+   timeouts and retransmits at the TCP layer for individual messages.
+   Implementations also must protect the overall TCP/CLC handshake with
+   a timer or timers to prevent connections from hanging indefinitely
+   due to SMC-R processing.  This can be done with individual timers for
+   individual CLC messages or an overall timer for the entire exchange,
+   which may include the TCP handshake and the CLC handshake under one
+   timer or separate timers.  This decision is implementation dependent.
+
+   If the TCP and/or CLC handshakes time out, the TCP connection must be
+   terminated as it would be in a legacy IP environment when connection
+   setup doesn't complete in a timely manner.  Because the CLC flows are
+   TCP messages, if they cannot be sent and received in a timely
+   fashion, the TCP connection is not healthy and would not work if
+   fallback to IP were attempted.
+
+C.6.  Protocol Errors during CLC Negotiation
+
+   Protocol errors occur during CLC negotiation when a message is
+   received that is not expected.  For example, a peer that is expecting
+   a CLC message but instead receives application data has experienced a
+   protocol error; this also indicates a likely software error, as the
+   two sides are out of sync.  When application data is expected, this
+   data is not parsed to ensure that it's not a CLC message.
+
+   When a peer is expecting a CLC negotiation message, any parsing error
+   except a bad enumerated value in that message must be treated as
+   application data.  The CLC negotiation messages are designed with
+   beginning and ending eye catchers to help verify that a CLC
+   negotiation message is actually the expected message.  If other
+   parsing errors in an expected CLC message occur, such as incorrect
+   length fields or incorrectly formatted fields, the message must be
+   treated as application data.
+
+   All protocol errors, with the exception of bad enumerated values,
+   must result in termination of the TCP connection.  No fallback to IP
+   is allowed in the case of a protocol error, because if the protocols
+   are out of sync, mismatched, or corrupted, then data and security
+   integrity cannot be ensured.
+
+
+
+
+
+
+Fox, et al.                   Informational                   [Page 134]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+   The exception to this rule is enumerated values -- for example, the
+   QP MTU values on SMC Accept and SMC Confirm.  If a reserved value is
+   received, the proper error response is to send an SMC Decline and
+   fall back to IP; this is because the use of a reserved enumerated
+   value indicates that the other partner likely has additional support
+   that the receiving partner does not have.  This indicated mismatch of
+   SMC-R capabilities is not an integrity problem but indicates that
+   SMC-R cannot be used for this connection.
+
+C.7.  Timeouts during LLC Negotiation
+
+   Whenever a peer sends an LLC message to which a reply is expected, it
+   sets a timer after the send posts to wait for the reply.  An expected
+   response may be a reply flavor of the LLC message (for example, a
+   CONFIRM LINK reply) or a new LLC message (for example, an ADD LINK
+   CONTINUATION expected from the server by the client if there are more
+   RKeys to be communicated).
+
+   On LLC flows that are part of a first contact setup of a link group,
+   the value of the timer is implementation dependent but should be long
+   enough to allow the other peer to have a write complete timeout and
+   2-3 retransmits of an SMC Decline on the TCP fabric.  For LLC flows
+   that are maintaining the link group and are not part of a first
+   contact setup of a link group, the timers may be shorter.  Upon
+   receipt of an expected reply, the timer is cancelled.  If a timer
+   pops without a reply having been received, the sender must initiate a
+   recovery action.
+
+   During first contact processing, failure of an LLC verification timer
+   is a "should-not-occur" that indicates a problem with one of the
+   endpoints; this is because if there is a "routine" failure in the
+   RoCE fabric that causes an LLC verification send to fail, the sender
+   will get a write completion failure and will then send an SMC Decline
+   to the partner.  The only time an LLC verification timer will expire
+   on a first contact is when the sender thinks the send succeeded but
+   it actually didn't.  Because of the reliably connected nature of QP
+   connections on the RoCE fabric, this indicates a problem with one of
+   the peers, not with the RoCE fabric.
+
+   After the reliably connected queue pair for the first SMC-R link in a
+   link group is set up on initial contact, the client sets a timer to
+   wait for a RoCE verification message from the server that the QP is
+   actually connected and usable.  If the server experiences a failure
+   sending its QP confirmation message, it will send an SMC Decline,
+   which should arrive at the client before the client's verification
+   timer expires.  If the client's timer expires without receiving
+   either an SMC Decline or a RoCE message confirmation from the server,
+
+
+
+
+Fox, et al.                   Informational                   [Page 135]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+   there is a problem with either the server or the TCP fabric.  In
+   either case, the client must break the TCP connection and clean up
+   the SMC-R link.
+
+   There are two scenarios in which the client's response to the QP
+   verification message fails to reach the server.  The main difference
+   is whether or not the client has successfully completed the send of
+   the CONFIRM LINK response.
+
+   In the normal case of a problem with the RoCE path, the client will
+   learn of the failure by getting a write completion failure, before
+   the server's timer expires.  In this case, the client sends an SMC
+   Decline CLC message to the server, and the TCP connection falls back
+   to IP.
+
+   If the client's send of the confirmation message receives a positive
+   return code but for some reason still does not reach the server, or
+   the client's SMC Decline CLC message fails to reach the server after
+   the client fails to send its RoCE confirmation message, then the
+   server's timer will time out and the server must break the TCP
+   connection by sending a RST.  This is expected to be a very rare
+   case, because if the client cannot send its CONFIRM LINK response LLC
+   message, the client should get a negative return code and initiate
+   fallback to IP.  A client receiving a positive return code on a send
+   that fails to reach the server should also be an extremely rare case.
+
+C.7.1.  Recovery Actions for LLC Timeouts and Failures
+
+   The following list describes recovery actions for LLC timeouts.  A
+   write completion failure or other indication of send failure for an
+   LLC command is treated the same as a timeout.
+
+   LLC message: CONFIRM LINK from server (first contact, first link in
+   the link group)
+
+      Timer waits for: CONFIRM LINK reply from client.
+
+      Recovery action: Break the TCP connection by sending a RST, and
+      clean up the link.  The server should have received an SMC Decline
+      from the client by now if the client had an LLC send failure.
+
+   LLC message: CONFIRM LINK from server (first contact, second link in
+   the link group)
+
+      Timer waits for: CONFIRM LINK reply from client.
+
+
+
+
+
+
+Fox, et al.                   Informational                   [Page 136]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+      Recovery action: The second link was not successfully set up.
+      Send a DELETE LINK to the client.  Connection data cannot flow in
+      the first link in the link group, until the reply to this DELETE
+      LINK is received, to prevent the peers from being out of sync on
+      the state of the link group.
+
+   LLC message: CONFIRM LINK from server (not first contact)
+
+      Timer waits for: CONFIRM LINK reply from client.
+
+      Recovery action: Clean up the new link, and set a timer to retry.
+      Send a DELETE LINK to the client, in case the client has a longer
+      timer interval, so the client can stop waiting.
+
+   LLC message: CONFIRM LINK reply from client (first contact)
+
+      Timer waits for: ADD LINK from server.
+
+      Recovery action: Clean up the SMC-R link, and break the TCP
+      connection by sending a RST over the IP fabric.  There is a
+      problem with the server.  If the server had a send failure, it
+      should have sent an SMC Decline by now.
+
+   LLC message: ADD LINK from server (first contact)
+
+      Timer waits for: ADD LINK reply from client.
+
+      Recovery action: Break the TCP connection with a RST, and clean up
+      RoCE resources.  The connection is past the point where the server
+      can fall back to IP, and if the client had a send problem it
+      should have sent an SMC Decline by now.
+
+   LLC message: ADD LINK from server (not first contact)
+
+      Timer waits for: ADD LINK reply from client.
+
+      Recovery action: Clean up resources (QP, RKeys, etc.) for the new
+      link, and treat the link over which the ADD LINK was sent as if it
+      had failed.  If there is another link available to resend the
+      ADD LINK and the link group still needs another link, retry the
+      ADD LINK over another link in the link group.
+
+   LLC message: ADD LINK reply from client (and there are more RKeys to
+   be communicated)
+
+      Timer waits for: ADD LINK CONTINUATION from server.
+
+      Recovery action: Treat the same as ADD LINK timer failure.
+
+
+
+Fox, et al.                   Informational                   [Page 137]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+   LLC message: ADD LINK reply or ADD LINK CONTINUATION reply from
+   client (and there are no more RKeys to be communicated, for the
+   second link in a first contact scenario)
+
+      Timer waits for: CONFIRM LINK from the server, over the new link.
+
+      Recovery action: The setup of the new link failed.  Send a
+      DELETE LINK to the server.  Do not consider the socket opened to
+      the client application until receiving confirmation from the
+      server in the form of a DELETE LINK request for this link and
+      sending the reply (to prevent the partners from being out of sync
+      on the state of the link group).
+
+      Set a timer to send another ADD LINK to the server if there is
+      still an unused RNIC on the client side.
+
+   LLC message: ADD LINK reply or ADD LINK CONTINUATION reply from
+   client (and there are no more RKeys to be communicated)
+
+      Timer waits for: CONFIRM LINK from the server, over the new link.
+
+      Recovery action: Send a DELETE LINK to the server for the new
+      link, then clean up any resource allocated for the new link and
+      set a timer to send an ADD LINK to the server if there is still an
+      unused RNIC on the client side.  The setup of the new link failed,
+      but the link over which the ADD LINK exchange occurred is
+      unaffected.
+
+   LLC message: ADD LINK CONTINUATION from server
+
+      Timer waits for: ADD LINK CONTINUATION reply from client.
+
+      Recovery action: Treat the same as ADD LINK timer failure.
+
+   LLC message: ADD LINK CONTINUATION reply from client (first contact,
+   and RMB count fields indicate that the server owes more ADD LINK
+   CONTINUATION messages)
+
+      Timer waits for: ADD LINK CONTINUATION from server.
+
+      Recovery action: Clean up the SMC-R link, and break the TCP
+      connection by sending a RST.  There is a problem with the server.
+
+      If the server had a send failure, it should have sent an
+      SMC Decline by now.
+
+
+
+
+
+
+Fox, et al.                   Informational                   [Page 138]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+   LLC message: ADD LINK CONTINUATION reply from client (not first
+   contact, and RMB count fields indicate that the server owes more
+   ADD LINK CONTINUATION messages)
+
+      Timer waits for: ADD LINK CONTINUATION from server.
+
+      Recovery action: Treat as if client detected link failure on the
+      link that the ADD LINK exchange is using.  Send a DELETE LINK to
+      the server over another active link if one exists; otherwise,
+      clean up the link group.
+
+   LLC message: DELETE LINK from client
+
+      Timer waits for: DELETE LINK request from server.
+
+      Recovery action: If the scope of the request is to delete a single
+      link, the surviving link over which the client sent the
+      DELETE LINK is no longer usable either.  If this is the last link
+      in the link group, end TCP connections over the link group by
+      sending RST packets.  If there are other surviving links in the
+      link group, resend over a surviving link.  Also send a DELETE LINK
+      over a surviving link for the link over which the client attempted
+      to send the initial DELETE LINK message.  If the scope of the
+      request is to delete the entire link group, try resending on other
+      links in the link group until success is achieved.  If all sends
+      fail, tear down the link group and any TCP connections that exist
+      on it.
+
+   LLC message: DELETE LINK from server (scope: entire link group)
+
+      Timer waits for: Confirmation from the adapter that the message
+      was delivered.
+
+      Recovery action: Tear down the link group and any TCP connections
+      that exist on it.
+
+   LLC message: DELETE LINK from server (scope: single link)
+
+      Timer waits for: DELETE LINK reply from client.
+
+      Recovery action: The link over which the server sent the
+      DELETE LINK is no longer usable either.  If this is the last link
+      in the link group, end TCP connections over the link group by
+      sending RST packets.  If there are other surviving links in the
+      link group, resend over a surviving link.  Also send a DELETE LINK
+      over a surviving link for the link over which the server attempted
+      to send the initial DELETE LINK message.  If the scope of the
+      request is to delete the entire link group, try resending on other
+
+
+
+Fox, et al.                   Informational                   [Page 139]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+      links in the link group until success is achieved.  If all sends
+      fail, tear down the link group and any TCP connections that exist
+      on it.
+
+   LLC message: CONFIRM RKEY from client
+
+      Timer waits for: CONFIRM RKEY reply from server.
+
+      Recovery action: Perform normal client procedures for detection of
+      failed link.  The link over which the message was sent has failed.
+
+   LLC message: CONFIRM RKEY from server
+
+      Timer waits for: CONFIRM RKEY reply from client.
+
+      Recovery action: Perform normal server procedures for detection of
+      failed link.  The link over which the message was sent has failed.
+
+   LLC message: TEST LINK from client
+
+      Timer waits for: TEST LINK reply from server.
+
+      Recovery action: Perform normal client procedures for detection of
+      failed link.  The link over which the message was sent has failed.
+
+   LLC message: TEST LINK from server
+
+      Timer waits for: TEST LINK reply from client.
+
+      Recovery action: Perform normal server procedures for detection of
+      failed link.  The link over which the message was sent has failed.
+
+   The following list describes recovery actions for invalid LLC
+   messages.  These could be misformatted or contain out-of-sync data.
+
+   LLC message received: CONFIRM LINK from server
+
+      What it indicates: Incorrect link information.
+
+      Recovery action: Protocol error.  The link must be brought down by
+      sending a DELETE LINK for the link over another link in the link
+      group if one exists.  If this is a first contact, fall back to IP
+      by sending an SMC Decline to the server.
+
+
+
+
+
+
+
+
+Fox, et al.                   Informational                   [Page 140]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+   LLC message received: ADD LINK
+
+      What it indicates: Undefined enumerated MTU value.
+
+      Recovery action: Send a negative ADD LINK reply with reason
+      code x'2'.
+
+   LLC message received: ADD LINK reply from client
+
+      What it indicates: Client-side link information that would result
+      in a parallel link being set up.
+
+      Recovery action: Parallel links are not permitted.  Delete the
+      link by sending a DELETE LINK to the client over another link in
+      the link group.
+
+   LLC message received: Any link group command from the server, except
+   DELETE LINK for the entire link group
+
+      What it indicates: Client has sent a DELETE LINK for the link on
+      which the message was received.
+
+      Recovery action: Ignore the LLC message.  Worst case: the server
+      will time out.  Best case: the DELETE LINK crosses with the
+      command from the server, and the server realizes it failed.
+
+   LLC message received: ADD LINK CONTINUATION from server or ADD LINK
+   CONTINUATION reply from client
+
+      What it indicates: Number of RMBs provided doesn't match count
+      given on initial ADD LINK or ADD LINK reply message.
+
+      Recovery action: Protocol error.  Treat as if detected link
+      outage.
+
+   LLC message received: DELETE LINK from client
+
+      What it indicates: Link indicated doesn't exist.
+
+      Recovery action: If the link is in the process of being cleaned
+      up, assume timing window and ignore message.  Otherwise, send a
+      DELETE LINK reply with reason code 1.
+
+   LLC message received: DELETE LINK from server
+
+      What it indicates: Link indicated doesn't exist.
+
+      Recovery action: Send a DELETE LINK reply with reason code 1.
+
+
+
+Fox, et al.                   Informational                   [Page 141]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+   LLC message received: CONFIRM RKEY from either client or server
+
+      What it indicates: No RKey provided for one or more of the links
+      in the link group.
+
+      Recovery action: Treat as if detected failure of the link(s) for
+      which no RKey was provided.
+
+   LLC message received: DELETE RKEY
+
+      What it indicates: Specified RKey doesn't exist.
+
+      Recovery action: Send a negative DELETE RKEY response.
+
+   LLC message received: TEST LINK reply
+
+      What it indicates: User data doesn't match what was sent in the
+      TEST LINK request.
+
+      Recovery action: Treat as if detected that the link has gone down.
+      This is a protocol error.
+
+   LLC message received: Unknown LLC type with high-order bits of opcode
+   equal to b'10'
+
+      What it indicates: This is an optional LLC message that the
+      receiver does not support.
+
+      Recovery action: Ignore (silently discard) the message.
+
+   LLC message received: Any unambiguously incorrect or out-of-sync LLC
+   message
+
+      What it indicates: Link is out of sync.
+
+      Recovery action: Treat as if detected that the link has gone down.
+      Note that an unsupported or unknown LLC opcode whose two
+      high-order bits are b'10' is not an error and must be silently
+      discarded.  Any other unknown or unsupported LLC opcode is an
+      error.
+
+C.8.  Failure to Add Second SMC-R Link to a Link Group
+
+   When there is any failure in setting up the second SMC-R link in an
+   SMC-R link group, including confirmation timer expiration, the SMC-R
+   link group is allowed to continue without available failover.
+   However, this situation is extremely undesirable, and the server must
+   endeavor to correct it as soon as it can.
+
+
+
+Fox, et al.                   Informational                   [Page 142]
+
+RFC 7609      IBM's Shared Memory Communications over RDMA   August 2015
+
+
+   The server peer in the SMC-R link group must set a timer to drive it
+   to retry setup of a failed additional SMC-R link.  The server will
+   immediately retry the SMC-R link setup when the first of the
+   following events occurs:
+
+   o  The retry timer expires.
+
+   o  A new RNIC becomes available to the server, on the same LAN as the
+      SMC-R link group.
+
+   o  An ADD LINK LLC request message is received from the client; this
+      indicates the availability of a new RNIC on the client side.
+
+Authors' Addresses
+
+   Mike Fox
+   IBM
+   3039 Cornwallis Rd.
+   Research Triangle Park, NC  27709
+   United States
+
+   Email: mjfox@us.ibm.com
+
+
+   Constantinos (Gus) Kassimis
+   IBM
+   3039 Cornwallis Rd.
+   Research Triangle Park, NC  27709
+   United States
+
+   Email: kassimis@us.ibm.com
+
+
+   Jerry Stevens
+   IBM
+   3039 Cornwallis Rd.
+   Research Triangle Park, NC  27709
+   United States
+
+   Email: sjerry@us.ibm.com
+
+
+
+
+
+
+
+
+
+
+
+Fox, et al.                   Informational                   [Page 143]
+