diff options
Diffstat (limited to 'doc/rfc/rfc4755.txt')
-rw-r--r-- | doc/rfc/rfc4755.txt | 731 |
1 files changed, 731 insertions, 0 deletions
diff --git a/doc/rfc/rfc4755.txt b/doc/rfc/rfc4755.txt new file mode 100644 index 0000000..b2e1557 --- /dev/null +++ b/doc/rfc/rfc4755.txt @@ -0,0 +1,731 @@ + + + + + + +Network Working Group V. Kashyap +Request for Comments: 4755 IBM +Category: Standards Track December 2006 + + + IP over InfiniBand: Connected Mode + +Status of This Memo + + This document specifies an Internet standards track protocol for the + Internet community, and requests discussion and suggestions for + improvements. Please refer to the current edition of the "Internet + Official Protocol Standards" (STD 1) for the standardization state + and status of this protocol. Distribution of this memo is unlimited. + +Copyright Notice + + Copyright (C) The IETF Trust (2006). + +Abstract + + This document specifies transmission of IPv4/IPv6 packets and address + resolution over the connected modes of InfiniBand. + + + + + + + + + + + + + + + + + + + + + + + + + + + + +Kashyap Standards Track [Page 1] + +RFC 4755 Connected Mode IPoIB December 2006 + + +Table of Contents + + 1. Introduction ....................................................2 + 2. IPoIB-connected Mode ............................................3 + 2.1. Multicasting ...............................................3 + 2.2. Outline of Address Resolution ..............................4 + 2.3. Outline of Connection Setup ................................4 + 3. Address Resolution ..............................................4 + 3.1. Link-layer Address .........................................4 + 3.2. IB Connection Setup ........................................6 + 3.3. Simultaneous IB Connections ................................6 + 3.4. IPoIB-CM IB Connection Teardown ............................7 + 3.5. Service-ID .................................................7 + 4. Frame Format ....................................................8 + 5. Maximum Transmission Unit .......................................8 + 5.1. Per-Connection MTU .........................................9 + 6. Private-Data Format .............................................9 + 7. IPoIB-CM Considerations ........................................10 + 7.1. A Cautionary Note on IPoIB-RC .............................10 + 7.2. IPoIB-CM Per-Destination MTU ..............................10 + 8. Security Considerations ........................................11 + 9. IANA Considerations ............................................11 + 10. Acknowledgements ..............................................11 + 11. Normative References ..........................................11 + 12. Informative References ........................................11 + +1. Introduction + + The InfiniBand specification [IB_ARCH] can be found at + www.infinibandta.org. The document [RFC4392] provides a short + overview of InfiniBand architecture along with consideration for + specifying IP over InfiniBand networks. + + The InfiniBand Architecture (IBA) defines multiple modes of + transports. Of these the unreliable datagram (UD) transport method + best matches the needs of IP. IP over InfiniBand (IPoIB) over UD is + described in [RFC4391]. This document describes IP transmission over + the connected modes of IBA. + + IBA defines two connected modes: + + 1. Reliable Connected (RC) + 2. Unreliable Connected (UC) + + As is evident from the nomenclature, the two modes differ mainly in + providing reliability of data delivery across the connection. This + document applies equally to both the connected modes. IPoIB over + these two modes is referred to as IPoIB-CM (connected mode) in this + + + +Kashyap Standards Track [Page 2] + +RFC 4755 Connected Mode IPoIB December 2006 + + + document. For clarity, IPoIB over the unreliable datagram mode as + described in [RFC4391] is referred to as IPoIB-UD. + + IBA requires that all Host Channel Adapters (HCAs) support the + reliable and unreliable connected modes [IB_ARCH]. It is optional + for Target Channel Adapters (TCAs) to support the connected modes. + + The connected modes offer link MTUs of up to 2^31 octets in length. + Thus, the use of connected modes can offer significant benefits by + supporting reasonably large MTUs. The datagram modes of InfiniBand + Architecture (IBA) are limited to 4096 octets. + + Reliability is also enhanced if the underlying feature of "automatic + path migration" supported by the connected modes is utilized. + + The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", + "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this + document are to be interpreted as described in RFC 2119 [RFC2119]. + +2. IPoIB-connected Mode + + IPoIB over connected mode is an OPTIONAL extension to IPoIB-UD. + Every IPoIB implementation MUST support [RFC4391] and MAY support the + extensions described in this document. + + Therefore, IP encapsulation, default MTU, link-layer address format, + and the IPv6 stateless autoconfiguration mechanism apply to IPoIB-CM + exactly as described in [RFC4391]. + +2.1. Multicasting + + The connected modes of IBA define a non-broadcast, multiple-access + network. The connected modes of IBA do not support multicasting + though every node can communicate with every other node if desired. + + This requires that multicasting be emulated in some form by the + network. However, in the case of an InfiniBand network, instead of + an emulation, an unreliable datagram (UD) queue pair (QP) can be used + to support multicasting while the connected mode QP is used for + unicast traffic. Since every IPoIB implementation is required to + support the UD mode, every implementation supporting IPoIB-CM will be + able to utilize the pre-existing IPoIB-UD QP for all + broadcast/multicast communications. Multicast mapping, transmission, + and reception of multicast packets and multicast routing MUST use the + UD QP associated with the IPoIB interface. + + + + + + +Kashyap Standards Track [Page 3] + +RFC 4755 Connected Mode IPoIB December 2006 + + +2.2. Outline of Address Resolution + + Every IPoIB-CM interface MUST have two sets of QPs associated with + it: + + 1) One unreliable datagram QP + 2) One or more connected mode QPs + + [RFC4391] describes the address resolution method to determine the + link address of the peer. This response is received on the UD QP + associated with the IPoIB interface. + +2.3. Outline of Connection Setup + + Once the link address of the remote node is known, an IB connection + must be set up between the nodes before any IP communication may + occur. + + To make a connection, the sender must know the service-ID to use in + the request to make a connection [IB_ARCH]. It must also supply the + "connection mode" queue pair to the remote node. The peer replies + with its queue pair. Each IB connection is peer to peer and uses one + connected mode QP at each end. + + Though the address resolution occurs at an individual IP address + level, the connection between the nodes is at the IB layer. + Therefore, every individual address resolution does not imply a new + connection between the peers. + +3. Address Resolution + + Address resolution queries are sent out on the "broadcast-GID" + (Broadcast-Group Identifier) over the UD QP associated with the IPoIB + interface [RFC4391]. A unicast reply is received on the UD QP. + +3.1. Link-layer Address + + IPoIB encapsulation [RFC4391] describes the link-layer address as + follows: + + <1 octet reserved>:QP: GID + + This document extends the link-layer address as follows: + + <Flags>:QPN:GID + + + + + + +Kashyap Standards Track [Page 4] + +RFC 4755 Connected Mode IPoIB December 2006 + + + Flags: + + This is a single-octet field. The bits indicate the connected + modes supported by the interface. + + Bit 0 specifies the support for the "reliable connected" (RC) + mode. Bit 1 indicates the support for the "unreliable connected" + (UC) mode. All other bits in the octet are reserved and MUST be + set to 0 on transmits and ignored on receives. The format of the + flags is as follows: + + +--+--+--+--+--+--+--+--+ + |RC|UC| 0| 0| 0| 0| 0| 0| + +--+--+--+--+--+--+--+--+ + + Both the RC and UC MAY be set at the same time if the interface + supports both the modes. Since the IPoIB-UD mode is always + supported, there are no flags to indicate IPoIB-UD support. + + If IPoIB-CM is not supported, i.e., if the implementation only + supports IPoIB-UD, then the implementation MUST ignore the <Flags> + on reception. It MUST set the <Flags> octet to all zeros on + transmission as specified in [RFC4391]. + + QPN: + + The queue-pair number (QPN) on which the unicast address + resolution replies will be received [RFC4391]. An IPoIB interface + has only one UD QP associated with it whether or not it supports + this extension. + + The QPN also serves another purpose. It is used to form the + Service-ID that is used to set up the IB connection. + + On receiving the multicast/broadcast address resolution request, the + receiver replies with its own link address, including the associated + UD QPN and the appropriate flags. + + The receiver's reply is unicast back to the sender after the receiver + has, as in the case of IPoIB-UD, resolved the GID to the Local + Identifier (LID), and determined other required parameters [RFC4391]. + Once the address resolution is completed, the underlying IB + connection on the supported connection modes can be set up. An + implementation is NOT REQUIRED to set up a connection merely because + the peer indicates the capability. The decision to make such a + connection is left to the implementation. + + + + + +Kashyap Standards Track [Page 5] + +RFC 4755 Connected Mode IPoIB December 2006 + + +3.2. IB Connection Setup + + Once the address resolution is complete, the IB connection can be set + up by either of the peers. To set up a connection, IB Management + Datagrams (MADs) are directed to the peer's communication manager + (CM). The connection request always contains a Service-ID for the + peer to associate the request with the appropriate service. If the + request is accepted, the peer returns the relevant connected mode QPN + in the response MAD. The format of the CM connection messages and + the IB connection setup process is described in [IB_ARCH]. The + overall handshake is of the form: + + REQ ----> + <---- REP [or REJ(reject)] + RTA ----> + [or REJ(reject)] + + The CM messages include, among other parameters, the Service-ID, + Local connection-mode QPN, and the payload size to use over the + connection. + + Note: The IB connection is set up using the Service-ID as defined in + Section 3.5 below. The node MUST keep a record of IB + connections it is participating in. The node MAY attempt + another connection to the remote peer using the same Service-ID + as used for an existing IB connection. Similarly, the receiver + of such a connection MAY drop the request with a suitable error + indication in the CM response. The decision to accept or + initiate multiple connections from or to an IPoIB interface is + left to the implementation. + + The node that initiated the connection is aware of the target node's + IP address as described above. The node receiving the IB connection + request, however, cannot determine the initiating node's link + address. To enable this determination, every CM message exchanged in + setting up the IB connection MUST include the sender's IPoIB-UD QPN + in the "private data" [IB_ARCH] field. The IPoIB-UD QPN MUST be + included in all "REJ" [IB_ARCH] messages too. + +3.3. Simultaneous IB Connections + + To ensure that two IB connections are not set up between the peers + due to REQ crossing, the following rules MUST be followed: + + The receiver forms the remote node's link-layer address using the + UD QPN received in the "private data" field of the "REQ" message + and the GID of the sender included in the "REQ" message. The + link-layer address is used to determine if there is already an + + + +Kashyap Standards Track [Page 6] + +RFC 4755 Connected Mode IPoIB December 2006 + + + outstanding connection request "REQ" sent by the local interface + to the given received link-layer address. If such an outstanding + request is determined, then the two link-layer addresses (local + and remote) are numerically compared. If the local link-layer + address is numerically smaller, then the connection is accepted, + otherwise rejected. The error code in "REJ" MAD is set to + "Consumer Reject" [IB_ARCH]. + + Note: The link-layer addresses formed for comparison zero out the + connection mode flags specified in Section 3.1. The + comparison is performed from the most significant octet to + the least significant octet of the link-layer address. + + The above holds even if the receiver supports multiple IB + connections from the same peer. This is to ensure that only one + more connection is set up when the "REQ" messages cross. + +3.4. IPoIB-CM IB Connection Teardown + + IB connections created through IPoIB-CM are considered part of an + IPoIB interface. As such, they SHOULD be torn down when the IPoIB + interfaces they are associated with are torn down. + + Furthermore, the IB connection between two peers MAY be torn down by + either peer whenever the address resolution entry expires. An + implementation is free to implement alternative policies for tearing + down of IB connections between peers. + +3.5. Service-ID + + The InfiniBand specification defines a block of Service-IDs for IETF + use. The InfiniBand specification has left the definition and + management of this block to the IETF [IB_ARCH]. The 64-bit block is + as follows: + + +--------+--------+--------+--------+-------+--------+--------+------+ + |00000001|<-------------------IETF use------------------------------>| + +--------+--------+--------+--------+-------+--------+--------+------+ + + + + + + + + + + + + + +Kashyap Standards Track [Page 7] + +RFC 4755 Connected Mode IPoIB December 2006 + + + The Service-IDs used by IPoIB will be in the following format: + + +--------+--------+--------+--------+-------+-------+--------+-------+ + |00000001| Type | Reserved | QPN | + +--------+--------+--------+--------+-------+-------+--------+-------+ + + The "Type" field MUST be set to 0. + + The "Reserved" field MUST be set to zeros. + + The QPN MUST be the UD QP exchanged during address resolution. + +4. Frame Format + + All IP datagrams transported over InfiniBand are prefixed by a + 4-octet encapsulation header as described in [RFC4391]. + + 0 1 2 3 + 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | | | + | Type | Reserved | + | | | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + + The type field SHALL indicate the encapsulated protocol as per + the following table. + + +----------+-------------+ + | Type | Protocol | + |------------------------| + | 0x800 | IPv4 | + |------------------------| + | 0x86DD | IPv6 | + +------------------------+ + + These values are taken from the "ETHER TYPE" numbers assigned by + Internet Assigned Numbers Authority (IANA). Other network protocols, + identified by different values of "ETHER TYPE", may use the + encapsulation format defined herein, but such use is outside of the + scope of this document. + +5. Maximum Transmission Unit + + The IB connection setup might be used for both IPv4 and IPv6 or it + could be used for only one of them while a different connection is + used for the other. The link MTU MUST be able to support the minimum + MTU required by the protocols. + + + +Kashyap Standards Track [Page 8] + +RFC 4755 Connected Mode IPoIB December 2006 + + + The default MTU of the IPoIB-CM interface is 2044 octets, i.e., + 2048-octet IPoIB-link MTU minus the 4-octet encapsulation header. + + However, connected modes of InfiniBand allow message sizes up to 2^31 + octets. Therefore, IPoIB-CM can use a much larger MTU for unicast + communication between any two endpoints. The maximum and/or optimal + payload that can be received or sent over an InfiniBand connection is + dependent on the implementation, IB Channel Adapter, and the + resources configured. + + An implementation MAY utilize the following mechanism to exchange the + optimal message size across the IB connection. + +5.1. Per-Connection MTU + + Every IB connection setup message includes a "private data" field + [IB_ARCH]. The "private data" field in the connection setup message + (CM REQ) MUST include the "Receive MTU". This indicates the maximum + packet size the requester can accept. The requester MUST be able to + accept smaller MTU sizes as well. + + It is up to the implementation to utilize this mechanism for setting + the per-IB connection MTU. To calculate the resultant IPoIB MTU over + the connection the smaller of the two IB "Receive MTU" values is used + by both the peers. The IPoIB interface must also account for the 4- + octet encapsulation header and so the IPoIB MTU over the connection + will be further reduced by that amount. + +6. Private-Data Format + + The "private data" field in every CM message for connection + establishment must include the following values: + + 1. UD QPN of the sender + 2. Receive MTU supported by the sender + + The format of the "private data" field MUST be as follows: + + 0 7 15 23 31 + +--------+--------+--------+--------+ + |Reserved| UD QPN | + +--------+--------+--------+--------+ + | Receive MTU | + +--------+--------+--------+--------+ + + The Reserved value MUST be set to zero on transmit and ignored on + receive. + + + + +Kashyap Standards Track [Page 9] + +RFC 4755 Connected Mode IPoIB December 2006 + + +7. IPoIB-CM Considerations + + Every IPoIB interface supports IPoIB-UD. It may additionally support + one or both of the IPoIB-CM modes. Therefore, there can be multiple + methods of communicating between any two peers. This implies that an + interface MAY transmit/receive a packet over any of the RC, UC, or UD + modes depending on the modes supported between it and the peer. It + further follows that every IPoIB implementation compliant with this + document MUST accept all IP unicast transmissions over any of the + IPoIB modes it supports. Multicast and broadcast packets by their + nature will always be transmitted and received over the IPoIB-UD QP. + Additionally, all address resolution responses (ARP or Neighbor + Discovery) MUST always be encapsulated in a UD mode packet. + +7.1. A Cautionary Note on IPoIB-RC + + The RC mode of InfiniBand guarantees in-order delivery of packets. + Every message transmitted over the RC connection is broken into + physical MTU-sized packets by the RC connection. If any packet is + lost, it is retransmitted until the complete message is exchanged. + Therefore, there is a possibility of an upper transport layer + experiencing a timeout, while the RC layer is still in the process of + transferring the complete message. TCP will view the timeout as an + indicator of congestion and enter slow-start thereby affecting + throughput drastically [RFC2581]. Other upper-layer protocols might + insert retransmissions into the fabric, adding to the already + existing congestion. + + The applicability of Infiniband reliability is on a fabric with short + latencies (not wide area). Therefore, the RC timer values should be + short compared with the starting minimum time values used by the + upper end-to-end transports. In addition, because the RC mode does + not have measurement-based reliable transmission, its use over + fabrics with long latency or very dynamic latency may be a concern + for congestion-aware traffic traversing those fabrics. + +7.2. IPoIB-CM Per-Destination MTU + + As described above, interfaces on the same subnet may support + different link MTUs based on the negotiated value or due to the link + type (UD or connected mode). Therefore, an implementation might + choose to define a large IP MTU, which is reduced based on the MTU to + the destination. The relevant MTU may be stored in a suitable per- + destination object, such as a route cache or a neighbor cache. The + per-destination MTU is known to the IPoIB-CM interface as described + in Section 5. + + + + + +Kashyap Standards Track [Page 10] + +RFC 4755 Connected Mode IPoIB December 2006 + + + Implementations might choose not to support differing MTU values and + always support an MTU equal to the IPoIB-UD MTU determined from the + broadcast GID. + +8. Security Considerations + + An impostor may return a false set of flags to an IPOIB interface. + This may cause unnecessary attempts and some delay/disruption in + IPoIB communication. The same is the case if wrong/spurious QPN + values are provided during address resolution broadcast/multicast. + +9. IANA Considerations + + Future uses of the reserved bits and octets in the link-layer address + (Section 3.1), Service-ID (Section 3.5), and "Private-Data Format" + (Section 6) MUST be published as RFCs. This document requires that + the reserved bits be set to zero on sends. + +10. Acknowledgements + + The author thanks the IPoIB Working Group for the various comments + and suggestions. A special thanks to Bernie King-Smith and Dror + Goldenberg for the detailed review and suggestions. + +11. Normative References + + [IB_ARCH] InfiniBand Architecture Specification, version 1.2 + www.infinibandta.org + + [RFC4392] Kashyap, V., "IP over InfiniBand (IPoIB) Architecture", + RFC 4392, April 2006. + + [RFC4391] Chu, J. and V. Kashyap, "Transmission of IP over + InfiniBand (IPoIB)", RFC 4391, April 2006. + + [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate + Requirement Levels", BCP 14, RFC 2119, March 1997. + +12. Informative References + + [RFC2581] Allman, M., Paxson, V., and W. Stevens, "TCP Congestion + Control ", RFC 2581, April 1999. + + + + + + + + + +Kashyap Standards Track [Page 11] + +RFC 4755 Connected Mode IPoIB December 2006 + + +Author's Address + + Vivek Kashyap + 15350, SW Koll Parkway + Beaverton + OR 97006 + + Phone: +1 503 578 3422 + EMail: vivk@us.ibm.com + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +Kashyap Standards Track [Page 12] + +RFC 4755 Connected Mode IPoIB December 2006 + + +Full Copyright Statement + + Copyright (C) The IETF Trust (2006). + + This document is subject to the rights, licenses and restrictions + contained in BCP 78, and except as set forth therein, the authors + retain all their rights. + + This document and the information contained herein are provided on an + "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS + OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST, + AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, + EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT + THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY + IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR + PURPOSE. + +Intellectual Property + + The IETF takes no position regarding the validity or scope of any + Intellectual Property Rights or other rights that might be claimed to + pertain to the implementation or use of the technology described in + this document or the extent to which any license under such rights + might or might not be available; nor does it represent that it has + made any independent effort to identify any such rights. Information + on the procedures with respect to rights in RFC documents can be + found in BCP 78 and BCP 79. + + Copies of IPR disclosures made to the IETF Secretariat and any + assurances of licenses to be made available, or the result of an + attempt made to obtain a general license or permission for the use of + such proprietary rights by implementers or users of this + specification can be obtained from the IETF on-line IPR repository at + http://www.ietf.org/ipr. + + The IETF invites any interested party to bring to its attention any + copyrights, patents or patent applications, or other proprietary + rights that may cover technology that may be required to implement + this standard. Please address the information to the IETF at + ietf-ipr@ietf.org. + +Acknowledgement + + Funding for the RFC Editor function is currently provided by the + Internet Society. + + + + + + +Kashyap Standards Track [Page 13] + |