1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
|
Network Working Group J. Chu
Request for Comments: 4391 Sun Microsystems
Category: Standards Track V. Kashyap
IBM
April 2006
Transmission of IP over InfiniBand (IPoIB)
Status of This Memo
This document specifies an Internet standards track protocol for the
Internet community, and requests discussion and suggestions for
improvements. Please refer to the current edition of the "Internet
Official Protocol Standards" (STD 1) for the standardization state
and status of this protocol. Distribution of this memo is unlimited.
Copyright Notice
Copyright (C) The Internet Society (2006).
Abstract
This document specifies a method for encapsulating and transmitting
IPv4/IPv6 and Address Resolution Protocol (ARP) packets over
InfiniBand (IB). It describes the link-layer address to be used when
resolving the IP addresses in IP over InfiniBand (IPoIB) subnets.
The document also describes the mapping from IP multicast addresses
to InfiniBand multicast addresses. In addition, this document
defines the setup and configuration of IPoIB links.
Table of Contents
1. Introduction ....................................................2
2. IP over UD Mode .................................................2
3. InfiniBand Datalink .............................................3
4. Multicast Mapping ...............................................3
4.1. Broadcast-GID Parameters ...................................5
5. Setting Up an IPoIB Link ........................................6
6. Frame Format ....................................................6
7. Maximum Transmission Unit .......................................8
8. IPv6 Stateless Autoconfiguration ................................8
8.1. IPv6 Link-Local Address ....................................9
9. Address Mapping - Unicast .......................................9
9.1. Link Information ...........................................9
9.1.1. Link-Layer Address/Hardware Address ................11
9.1.2. Auxiliary Link Information .........................12
Chu & Kashyap Standards Track [Page 1]
^L
RFC 4391 IP over InfiniBand (IPoIB) April 2006
9.2. Address Resolution in IPv4 Subnets ........................13
9.3. Address Resolution in IPv6 Subnets ........................14
9.4. Cautionary Note on QPN Caching ............................14
10. Sending and Receiving IP Multicast Packets ....................14
11. IP Multicast Routing ..........................................16
12. New Types of Vulnerability in IB Multicast ....................17
13. Security Considerations .......................................17
14. IANA Considerations ...........................................18
15. Acknowledgements ..............................................18
16. References ....................................................18
16.1. Normative References .....................................18
16.2. Informative References ...................................19
1. Introduction
The InfiniBand specification [IBTA] can be found at
http://www.infinibandta.org. The document [RFC4392] provides a short
overview of InfiniBand architecture (IBA) along with considerations
for specifying IP over InfiniBand networks.
IBA defines multiple modes of transport over which IP may be
implemented. The Unreliable Datagram (UD) transport mode best
matches the needs of IP and the need for universality as described in
[RFC4392].
This document specifies IPoIB over IB's UD mode. The implementation
of IP subnets over IB's other transport mechanisms is out of scope of
this document.
This document describes the necessary steps required in order to lay
out an IP network on top of an IB network. It describes all the
elements of an IPoIB link, how to configure its associated
attributes, and how to set up basic broadcast and multicast services
for it.
It further describes IP address resolution and the encapsulation of
IP and Address Resolution Protocol (ARP) packets in InfiniBand frame.
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in RFC 2119 [RFC2119].
2. IP over UD Mode
The unreliable datagram mode of communication is supported by all IB
elements be they IB routers, Host Channel Adapters (HCAs), or Target
Channel Adapters (TCAs). In addition to being the only universal
transmission method, it supports multicasting, partitioning, and a
Chu & Kashyap Standards Track [Page 2]
^L
RFC 4391 IP over InfiniBand (IPoIB) April 2006
32-bit Cyclic Redundancy Check (CRC) [IBTA]. Though multicasting
support is optional in IB fabrics, IPoIB architecture requires the
participating components to support it.
All IPoIB implementations MUST support IP over the UD transport mode
of IBA.
3. InfiniBand Datalink
An IB subnet is formed by a network of IB nodes interconnected either
directly or via IB switches. IB subnets may be connected using IB
routers to form a fabric made of multiple IB subnets. Nodes residing
in different IB subnets can communicate directly with one another
through IB routers at the IB network layer. Multiple IP subnets may
be overlaid over this IB network.
An IP subnet is configured over a communication facility or medium
over which nodes can communicate at the "link" layer [IPV6]. For
example, an ethernet segment is a link formed by interconnected
switches/hubs/bridges. The segment is therefore defined by the
physical topology of the network. This is not the case with IPoIB.
IPoIB subnets are built over an abstract "link". The link is defined
by its members and common characteristics such as the P_Key, link
MTU, and the Q_Key.
Any two ports using UD communication mode in an IB fabric can
communicate only if they are in the same partition (i.e., have the
same P_Key and the same Q_Key) [RFC4392]. The link MTU provides a
limit to the size of the payload that may be used. The packet
transmission and routing within the IB fabric are also affected by
additional parameters such as the traffic class (TClass), hop limit
(HopLimit), service level (SL), and the flow label (FlowLabel)
[RFC4392]. The determination and use of these values for IPoIB
communication are described in the following sections.
4. Multicast Mapping
IB identifies multicast groups by the Multicast Global Identifiers
(MGIDs), which follow the same rules as IPv6 multicast addresses.
Hence the MGIDs follow the same rules regarding the transient
addresses and scope bits albeit in the context of the IB fabric. The
resultant address therefore resembles IPv6 multicast addresses. The
documents [IBTA, RFC4392] give a detailed description of IB
multicast.
The IPoIB multicast mapping is depicted in figure 1. The same
mapping function is used for both IPv4 and IPv6 except for the IPoIB
signature field.
Chu & Kashyap Standards Track [Page 3]
^L
RFC 4391 IP over InfiniBand (IPoIB) April 2006
Unless explicitly stated, all addresses and fields in the protocol
headers in this document are stored in the network byte order.
| 8 | 4 | 4 | 16 bits | 16 bits | 80 bits |
+------ -+----+----+-----------------+---------+-------------------+
|11111111|0001|scop|<IPoIB signature>|< P_Key >| group ID |
+--------+----+----+-----------------+---------+-------------------+
Figure 1
Since an MGID allocated for transporting IP multicast datagrams is
considered only a transient link-layer multicast address [RFC4392],
all IB MGIDs allocated for IPoIB purpose MUST set T-flag to 1 [IBTA].
A special signature is embedded to identify the MGID for IPoIB use
only. For IPv4 over IB, the signature MUST be "0x401B". For IPv6
over IB, the signature MUST be "0x601B".
The IP multicast address is used together with a given IPoIB link
P_Key to form the MGID of the IB multicast group. For IPv6 the lower
80-bit of the group ID is used directly in the lower 80-bit of the
MGID. For IPv4, the group ID is only 28-bit long, and is placed
directly in the lower 28 bits of the MGID. The rest of the group ID
bits in the MGID are filled with 0.
E.g., on an IPoIB link that is fully contained within a single IB
subnet with a P_Key of 0x8000, the MGIDs for the all-router multicast
group with group ID 2 [AARCH, IGMP3] are:
FF12:401B:8000::2, for IPv4 in compressed format, and
FF12:601B:8000::2, for IPv6 in compressed format.
A special case exists for the IPv4 limited broadcast address
"255.255.255.255" [HOSTS]. The address SHALL be mapped to the
"broadcast-GID", which is defined as follows:
| 8 | 4 | 4 | 16 bits | 16 bits | 48 bits | 32 bits |
+--------+----+----+----------------+---------+----------+---------+
|11111111|0001|scop|0100000000011011|< P_Key >|00.......0|<all 1's>|
+--------+----+----+----------------+---------+----------+---------+
Figure 2
All MGIDs used in the IPoIB subnet MUST use the same scop bits as in
the corresponding broadcast-GID.
Chu & Kashyap Standards Track [Page 4]
^L
RFC 4391 IP over InfiniBand (IPoIB) April 2006
4.1. Broadcast-GID Parameters
The broadcast-GID is set up with the following attributes:
1. P_Key
A "Full Membership" P_Key (high-order bit is set to 1) MUST be
used so that all members may communicate with one another.
2. Q_Key
It is RECOMMENDED that a controlled Q_Key be used with the
high-order bit set. This is to prevent non-privileged
software from fabricating and sending out bogus IP datagrams.
3. IB MTU
The value assigned to the broadcast-GID must not be greater
than any physical link MTU spanned by the IPoIB subnet.
The following attributes are required in multicast transmissions and
also in unicast transmissions if an IPoIB link covers more than a
single IB subnet.
4. Other parameters
The selection of TClass, FlowLabel, and HopLimit values is
implementation dependent. But it must take into account the
topology of IB subnets comprising the IPoIB link in order to
allow successful communication between any two nodes in the
same IPoIB link.
An SL also needs to be assigned to the broadcast-GID. This SL
is used in all multicast communication in the subnet.
The broadcast-GID's scope bits need to be set based on whether
the IPoIB link is confined within an IB subnet or the IPoIB
link spans multiple IB subnets. A default of local-subnet
scope (i.e., 0x2) is RECOMMENDED. A node might determine the
scope bits to use by interactively searching for a broadcast-
GID of ever greater scope by first starting with the local-
scope. Or, an implementation might include the scope bits as
a configuration parameter.
Chu & Kashyap Standards Track [Page 5]
^L
RFC 4391 IP over InfiniBand (IPoIB) April 2006
5. Setting Up an IPoIB Link
The broadcast-GID, as defined in the previous section, MUST be set up
for an IPoIB subnet to be formed. Every IPoIB interface MUST
"FullMember" join the IB multicast group defined by the broadcast-
GID. This multicast group will henceforth be referred to as the
broadcast group. The join operation returns the MTU, the Q_Key, and
other parameters associated with the broadcast group. The node then
associates the parameters received as a result of the join operation
with its IPoIB interface. The broadcast group also serves to provide
a link-layer broadcast service for protocols like ARP, net-directed,
subnet-directed, and all-subnets-directed broadcasts in IPv4 over IB
networks.
The join operation is successful only if the Subnet Manager (SM)
determines that the joining node can support the MTU registered with
the broadcast group [RFC4392] ensuring support for a common link MTU.
The SM also ensures that all the nodes joining the broadcast-GID have
paths to one another and can therefore send and receive unicast
packets. It further ensures that all the nodes do indeed form a
multicast tree that allows packets sent from any member to be
replicated to every other member. Thus, the IPoIB link is formed by
the IPoIB nodes joining the broadcast group. There is no physical
demarcation of the IPoIB link other than that determined by the
broadcast group membership.
The P_Key is a configuration parameter that must be known before the
broadcast-GID can be formed. For a node to join a partition, one of
its ports must be assigned the relevant P_Key by the SM [RFC4392].
The method of creation of the broadcast group and the
assignment/choice of its parameters are up to the implementation
and/or the administrator of the IPoIB subnet. The broadcast group
may be created by the first IPoIB node to be initialized, or it can
be created administratively before the IPoIB subnet is set up. It is
RECOMMENDED that the creation and deletion of the broadcast group be
under administrative control.
InfiniBand multicast management, which includes the creation,
joining, and leaving of IB multicast groups by IB nodes, is described
in [RFC4392].
6. Frame Format
All IP and ARP datagrams transported over InfiniBand are prefixed by
a 4-octet encapsulation header as illustrated below.
Chu & Kashyap Standards Track [Page 6]
^L
RFC 4391 IP over InfiniBand (IPoIB) April 2006
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| | |
| Type | Reserved |
| | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Figure 3
The "Reserved" field MUST be set to zero on send and ignored on
receive unless specified differently in a future document.
The "Type" field SHALL indicate the encapsulated protocol as per the
following table.
+----------+-------------+
| Type | Protocol |
|------------------------|
| 0x800 | IPv4 |
|------------------------|
| 0x806 | ARP |
|------------------------|
| 0x8035 | RARP |
|------------------------|
| 0x86DD | IPv6 |
+------------------------+
Table 1
These values are taken from the "ETHER TYPE" numbers assigned by
Internet Assigned Numbers Authority (IANA) [IANA]. Other network
protocols, identified by different values of "ETHER TYPE", may use
the encapsulation format defined herein, but such use is outside of
the scope of this document.
|<------ IB Frame headers -------->|<- Payload ->|<- IB trailers ->|
+-------+------+---------+---------+-------------+---------+-------+
|Local | |Base |Datagram | 4-octet | | |
|Routing| GRH* |Transport|Extended | header |Invariant|Variant|
|Header |Header|Header |Transport| + | CRC | CRC |
| | | |Header | IP/ARP | | |
+-------+------+---------+---------+-------------+---------+-------+
Figure 4
Figure 4 depicts the IB frame encapsulating an IP/ARP datagram. The
InfiniBand specification requires the use of Global Routing Header
Chu & Kashyap Standards Track [Page 7]
^L
RFC 4391 IP over InfiniBand (IPoIB) April 2006
(GRH) [RFC4392] when multicasting or when an InfiniBand packet
traverses from one IB subnet to another through an IB router. Its
use is optional when used for unicast transmission between nodes
within an IB subnet. The IPoIB implementation MUST be able to handle
packets received with or without the use of GRH.
7. Maximum Transmission Unit
IB MTU: The IB components, that is, IB links, switches, Channel
Adapters (CAs), and IB routers, may support maximum payloads of
256, 512, 1024, 2048, or 4096 octets. The maximum IB payload
supported by the IB components in any IB path is the IB MTU for
the path.
IPoIB-Link MTU: The IPoIB-link MTU is the MTU value associated with
the broadcast group. The IPoIB-link MTU can be set to any value
up to the smallest IB MTU supported by the IB components
comprising the IPoIB link.
In order to reduce problems with fragmentation and path-MTU
discovery, this document requires that all IPoIB implementations
support an MTU of 2044 octets, that is, a 2048-octet IPoIB-link MTU
minus the 4-octet encapsulation overhead. Larger and smaller MTUs
MAY be supported subject to other existing MTU requirements [IPV6],
but the default configuration must support an MTU of 2044 octets.
8. IPv6 Stateless Autoconfiguration
IB architecture associates an EUI-64 identifier termed the Globally
Unique Identifier (GUID) [RFC4392, IBTA] with each port. The Local
Identifier (LID) is unique within an IB subnet only.
The interface identifier may be chosen from the following:
1) The EUI-64-compliant GUID assigned by the manufacturer.
2) If the IPoIB subnet is fully contained within an IB subnet, any
of the unique 16-bit LIDs of the port associated with the IPoIB
interface.
The LID values of a port may change after a reboot/power-cycle
of the IB node. Therefore, if a persistent value is desired,
it would be prudent not to use the LID to form the interface
identifier.
On the other hand, the LID provides an identifier that can be
used to create a more anonymous IPv6 address since the LID is
not globally unique and is subject to change over time.
Chu & Kashyap Standards Track [Page 8]
^L
RFC 4391 IP over InfiniBand (IPoIB) April 2006
It is RECOMMENDED that the link-local address be constructed from the
port's EUI-64 identifier as given below.
[AARCH] requires that the interface identifier be created in the
"Modified EUI-64" format when derived from an EUI-64 identifier.
[IBTA] is unclear if the GUID should use IEEE EUI-64 format or the
"Modified EUI-64" format. Therefore, when creating an interface
identifier from the GUID, an implementation MUST do the following:
=> Determine if the GUID is a modified EUI-64 identifier ("u" bit
is toggled) as defined by [AARCH]
=> If the GUID is a modified EUI-64 identifier, then the "u" bit
MUST NOT be toggled when creating the interface identifier
=> If the GUID is an unmodified EUI-64 identifier, then the "u"
bit MUST be toggled in compliance with [AARCH]
8.1. IPv6 Link-Local Address
The IPv6 link-local address for an IPoIB interface is formed as
described in [AARCH] using the interface identifier as described in
the previous section.
9. Address Mapping - Unicast
Address resolution in IPv4 subnets is accomplished through Address
Resolution Protocol (ARP) [ARP]. It is accomplished in IPv6 subnets
using the Neighbor Discovery protocol [DISC].
9.1. Link Information
An InfiniBand packet over the UD mode includes multiple headers such
as the LRH (local route header), GRH (global route header), BTH (base
transport header), DETH (datagram extended transport header) as
depicted in figure 4 and specified in the InfiniBand architecture
[IBTA]. All these headers comprise the link-layer in an IPoIB link.
The parameters needed in these IBA headers constitute the link-layer
information that needs to be determined before an IP packet may be
transmitted across the IPoIB link.
Chu & Kashyap Standards Track [Page 9]
^L
RFC 4391 IP over InfiniBand (IPoIB) April 2006
The parameters that need to be determined are as follows:
a) LID
The LID is always needed. A packet always includes the LRH
that is targeted at the remote node's LID, or an IB router's
LID to get to the remote node in another IB subnet.
b) Global Identifier (GID)
The GID is not needed when exchanging information within an IB
subnet though it may be included in any packet. It is an
absolute necessity when transmitting across the IB subnet since
the IB routers use the GID to correctly forward the packets.
The source and destination GIDs are fields included in the GRH.
The GID, if formed using the GUID, can be used to unambiguously
identify an endpoint.
c) Queue Pair Number (QPN)
Every unicast UD communication is always directed to a
particular queue pair (QP) at the peer.
d) Q_Key
A Q_Key is associated with each Unreliable Datagram QPN. The
received packets must contain a Q_Key that matches the QP's
Q_Key to be accepted.
e) P_Key
A successful communication between two IB nodes using UD mode
can occur only if the two nodes have compatible P_Keys. This
is referred to as being in the same partition [IBTA].
f) SL
Every IBA packet contains an SL value. A path in IBA is
defined by the three-tuple (source LID, destination LID, SL).
The SL in turns is mapped to a virtual lane (VL) at every CA,
switch that sends/forwards the packet [RFC4392]. Multiple SLs
may be used between two endpoints to provide for load
balancing. SLs may be used for providing a Quality of Service
(QoS) infrastructure, or may be used to avoid deadlocks in the
IBA fabric.
Chu & Kashyap Standards Track [Page 10]
^L
RFC 4391 IP over InfiniBand (IPoIB) April 2006
Another auxiliary piece of information, not included in the IBA
headers, is the following:
g) Path rate
IBA defines multiple link speeds. A higher-speed transmitter
can swamp switches and the CAs. To avoid such congestion,
every source transmitting at greater than 1x speeds is required
to determine the "path rate" before the data may be transmitted
[IBTA].
9.1.1. Link-Layer Address/Hardware Address
Though the list of information required for a successful transmittal
of an IPoIB packet is large, not all the information need be
determined during the IP address resolution process.
The 20-octet IPoIB link-layer address used in the source/target
link-layer address option in IPv6 and the "hardware address" in
IPv4/ARP has the same format.
The format is as described below:
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Reserved | Queue Pair Number |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| |
+ +
| |
+ GID +
| |
+ +
| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Figure 5
a) Reserved Flags
These 8 bits are reserved for future use. These bits MUST be
set to zero on send and ignored on receive unless specified
differently in a future document.
Chu & Kashyap Standards Track [Page 11]
^L
RFC 4391 IP over InfiniBand (IPoIB) April 2006
b) QPN
Every unicast communication in IB architecture is directed to a
specific QP [IBTA]. This QP number is included in the link
description. All IP communication to the relevant IPoIB
interface MUST be directed to this QPN. In the case of IPv4
subnets, the Address Resolution Protocol (ARP) reply packets
are also directed to the same QPN.
The choice of the QPN for IP/ARP communication is up to the
implementation.
c) GID
This is one of the GIDs of the port associated with the IPoIB
interface [IBTA]. IB associates multiple GIDs with a port. It
is RECOMMENDED that the GID formed by the combination of the IB
subnet prefix and the port's "Port GUID" [IBTA] be included in
the link-layer/hardware address.
9.1.2. Auxiliary Link Information
The rest of the parameters are determined as follows:
a) LID
The method of determining the peer's LID is not defined in this
document. It is up to the implementation to use any of the
IBA-approved methods to determine the destination LID. One
such method is to use the GID determined during the address
resolution, to retrieve the associated LID from the IB routing
infrastructure or the Subnet Administrator (SA).
It is the responsibility of the administrator to ensure that
the IB subnet(s) have unicast connectivity between the IPoIB
nodes. The GID exchanged between two endpoints in a multicast
message (ARP/ND) does not guarantee the existence of a unicast
path between the two.
There may be multiple LIDs, and hence paths, between the
endpoints. The criteria for selection of the LIDs are beyond
the scope of this document.
Chu & Kashyap Standards Track [Page 12]
^L
RFC 4391 IP over InfiniBand (IPoIB) April 2006
b) Q_Key
The Q_Key received on joining the broadcast group MUST be used
for all IPoIB communication over the particular IPoIB link.
c) P_Key
The P_Key to be used in the IP subnet is not discovered but is
a configuration parameter.
d) SL
The method of determining the SL is not defined in this
document. The SL is determined by any of the IBA-approved
methods.
e) Path rate
The implementation must leverage IB methods to determine the
path rate as required.
9.2. Address Resolution in IPv4 Subnets
The ARP packet header is as defined in [ARP]. The hardware type is
set to 32 (decimal) as specified by IANA [IANA]. The rest of the
fields are used as per [ARP].
16 bits: hardware type
16 bits: protocol
8 bits: length of hardware address
8 bits: length of protocol address
16 bits: ARP operation
The remaining fields in the packet hold the sender/target hardware
and protocol addresses.
[ sender hardware address ]
[ sender protocol address ]
[ target hardware address ]
[ target protocol address ]
The hardware address included in the ARP packet will be as specified
in section 9.1.1 and depicted in figure 5.
The length of the hardware address used in ARP packet header
therefore is 20.
Chu & Kashyap Standards Track [Page 13]
^L
RFC 4391 IP over InfiniBand (IPoIB) April 2006
9.3. Address Resolution in IPv6 Subnets
The Source/Target Link-layer address option is used in Router
Solicit, Router advertisements, Redirect, Neighbor Solicitation, and
Neighbor Advertisement messages when such messages are transmitted on
InfiniBand networks.
The source/target address option is specified as follows:
Type:
Source Link-layer address 1
Target Link-layer address 2
Length: 3
Link-layer address:
The link-layer address is as specified in section 9.1.1 and
depicted in figure 5.
[DISC] specifies the length of source/target option in
number of 8-octets as indicated by a length of '3' above.
Since the IPoIB link-layer address is only 20 octets long,
two octets of zero MUST be prepended to fill the total
option length of 24 octets.
9.4. Cautionary Note on QPN Caching
The link-layer address for IPoIB includes the QPN, which might not be
constant across reboots or even across network interface resets.
Cached QPN entries, such as in static ARP entries or in Reverse
Address Resolution Protocol (RARP) servers, will only work if the
implementation(s) using these options ensure that the QPN associated
with an interface is invariant across reboots/network resets.
It is RECOMMENDED that implementations revalidate ARP caches
periodically due to the aforementioned QPN-induced volatility of
IPoIB link-layer addresses.
10. Sending and Receiving IP Multicast Packets
Multicast in InfiniBand differs in a number of ways from multicast in
ethernet. This adds some complexity to an IPoIB implementation when
supporting IP multicast over IB.
A) An IB multicast group must be explicitly created through the SA
before it can be used.
Chu & Kashyap Standards Track [Page 14]
^L
RFC 4391 IP over InfiniBand (IPoIB) April 2006
This implies that in order to send a packet destined for an IP
multicast address, the IPoIB implementation must check with the
SA on the outbound link first for a "MCMemberRecord" that
matches the MGID. If one does exist, the Multicast Local
Identifier (MLID) associated with the multicast group is used
as the Destination Local Identifier (DLID) for the packet.
Otherwise, it implies no member exists on the local link. If
the scope of the IP multicast group is beyond link-local, the
packet must be sent to the on-link routers through the use of
the all-router multicast group or the broadcast group. This is
to allow local routers to forward the packet to multicast
listeners on remote networks. The all-router multicast group
is preferred over the broadcast group for better efficiency.
If the all-router multicast group does not exist, the sender
can assume that there are no routers on the local link; hence
the packet can be safely dropped.
B) A multicast sender must join the target multicast group before
outgoing multicast messages from it can be successfully routed.
The "SendOnlyNonMember" join is different from the regular
"FullMember" join in two aspects. First, both types of joins
enable multicast packets to be routed FROM the local port, but
only the "FullMember" join causes multicast packets to be
routed TO the port. Second, the sender port of a
"SendOnlyNonMember" join will not be counted as a member of the
multicast group for purposes of group creation and deletion.
The following code snippet demonstrates the steps in a typical
implementation when processing an egress multicast packet.
if the egress port is already a "SendOnlyNonMember", or a
"FullMember"
=> send the packet
else if the target multicast group exists
=> do "SendOnlyNonMember" join
=> send the packet
else if scope > link-local AND the all-router multicast group exists
=> send the packet to all routers
else
=> drop the packet
Implementations should cache the information about the existence of
an IB multicast group, its MLID and other attributes. This is to
avoid expensive SA calls on every outgoing multicast packet. Senders
MUST subscribe to the multicast group create and delete traps in
Chu & Kashyap Standards Track [Page 15]
^L
RFC 4391 IP over InfiniBand (IPoIB) April 2006
order to monitor the status of specific IB multicast groups. For
example, multicast packets directed to the all-router multicast group
due to a lack of listener on the local subnet must be forwarded to
the right multicast group if the group is created later. This
happens when a listener shows up on the local subnet.
A node joining an IP multicast group must first construct an MGID
according to the rule described in section 4 above. Once the correct
MGID is calculated, the node must call the SA of the outbound link to
attempt a "FullMember" join of the IB multicast group corresponding
to the MGID. If the IB multicast group does not already exist, one
must be created first with the IPoIB link MTU. The MGID MUST use the
same P_Key, Q_Key, SL, MTU, and HopLimit as those used in the
broadcast-GID. The rest of attributes SHOULD follow the values used
in the broadcast-GID as well.
The join request will cause the local port to be added to the
multicast group. It also enables the SM to program IB switches and
routers with the new multicast information to ensure the correct
forwarding of multicast packets for the group.
When a node leaves an IP multicast group, it SHOULD make a
"FullMember" leave request to the SA. This gives the SM an
opportunity to update relevant forwarding information, to delete an
IB multicast group if the local port is the last FullMember to leave,
and to free up the MLID allocated for it. The specific algorithm is
implementation-dependent and is out of the scope of this document.
Note that for an IPoIB link that spans more than one IB subnet
connected by IB routers, an adequate multicast forwarding support at
the IB level is required for multicast packets to reach listeners on
a remote IB subnet. The specific mechanism for this is beyond the
scope of IPoIB.
11. IP Multicast Routing
IP multicast routing requires each interface over which the router is
operating to be configured to listen to all link-layer multicast
addresses generated by IP [IPMULT, IP6MLD]. For an Ethernet
interface, this is often achieved by turning on the promiscuous
multicast mode on the interface.
IBA does not provide any hardware support for promiscuous multicast
mode. Fortunately, a promiscuous multicast mode can be emulated in
the software running on a router through the following steps:
A) Obtain a list of all active IB multicast groups from the local
SA.
Chu & Kashyap Standards Track [Page 16]
^L
RFC 4391 IP over InfiniBand (IPoIB) April 2006
B) Make a "NonMember" join request to the SA for every group that
has a signature in its MGID matching the one for either IPv4 or
IPv6.
C) Subscribe to the IB multicast group creation events using a
wildcarded MGID so that the router can "NonMember" join all IB
multicast groups created subsequently for IPv4 or IPv6.
The "NonMember" join has the same effect as a "FullMember" join
except that the former will not be counted as a member of the
multicast group for purposes of group creation or deletion. That is,
when the last "FullMember" leaves a multicast group, the group can be
safely deleted by the SA without concerning any "NonMember" routers.
12. New Types of Vulnerability in IB Multicast
Many IB multicast functions are subject to failures due to a number
of possible resource constraints. These include the creation of IB
multicast groups, the join calls ("SendOnlyNonMember", "FullMember",
and "NonMember"), and the attaching of a QP to a multicast group.
In general, the occurrence of these failure conditions is highly
implementation-dependent, and is believed to be rare. Usually, a
failed multicast operation at the IB level can be propagated back to
the IP level, causing the original operation to fail and the
initiator of the operation to be notified. But some IB multicast
functions are not tied to any foreground operation, making their
failures hard to detect. For example, if an IP multicast router
attempts to "NonMember" join a newly created multicast group in the
local subnet, but the join call fails, packet forwarding for that
particular multicast group will likely fail silently, that is,
without the attention of local multicast senders. This type of
problem can add more vulnerability to the already unreliable IP
multicast operations.
Implementations SHOULD log error messages upon any failure from an IB
multicast operation. Network administrators should be aware of this
vulnerability, and preserve enough multicast resources at the points
where IP multicast will be used heavily. For example, HCAs with
ample multicast resources should be used at any IP multicast router.
13. Security Considerations
This document specifies IP transmission over a multicast network.
Any network of this kind is vulnerable to a sender claiming another's
identity and forging traffic or eavesdropping. It is the
responsibility of the higher layers or applications to implement
suitable countermeasures if this is a problem.
Chu & Kashyap Standards Track [Page 17]
^L
RFC 4391 IP over InfiniBand (IPoIB) April 2006
Successful transmission of IP packets depends on the correct setup of
the IPoIB link, creation of the broadcast-GID, creation of the QP and
its attachment to the broadcast-GID, and the correct determination of
various link parameters such as the LID, service level, and path
rate. These operations, many of which involve interactions with the
SM/SA, MUST be protected by the underlying operating system. This is
to prevent malicious, non-privileged software from hijacking
important resources and configurations.
Controlled Q_Keys SHOULD be used in all transmissions. This is to
prevent non-privileged software from fabricating IP datagrams.
14. IANA Considerations
To support ARP over InfiniBand, a value for the Address Resolution
Parameter "Number Hardware Type (hrd)" is required. IANA has
assigned the number "32" to indicate InfiniBand [IANA_ARP].
Future uses of the reserved bits in the frame format (Figure 3) and
link-layer address (Figure 5) MUST be published as RFCs. This
document requires that the reserved bits be set to zero on send and
ignored on receive.
15. Acknowledgements
The authors would like to thank Bruce Beukema, David Brean, Dan
Cassiday, Aditya Dube, Yaron Haviv, Michael Krause, Thomas Narten,
Erik Nordmark, Greg Pfister, Jim Pinkerton, Renato Recio, Kevin
Reilly, Kanoj Sarcar, Satya Sharma, Madhu Talluri, and David L.
Stevens for their suggestions and many clarifications on the IBA
specification.
16. References
16.1. Normative References
[AARCH] Hinden, R. and S. Deering, "Internet Protocol Version 6
(IPv6) Addressing Architecture", RFC 3513, April 2003.
[ARP] Plummer, David C., "Ethernet Address Resolution
Protocol: Or converting network protocol addresses to
48.bit Ethernet address for transmission on Ethernet
hardware ", STD 37, RFC 826, November 1982.
[DISC] Narten, T., Nordmark, E., and W. Simpson, "Neighbor
Discovery for IP Version 6 (IPv6)", RFC 2461, December
1998.
Chu & Kashyap Standards Track [Page 18]
^L
RFC 4391 IP over InfiniBand (IPoIB) April 2006
[IANA] Internet Assigned Numbers Authority, URL
http://www.iana.org
[IANA_ARP] URL http://www.iana.org/assignments/arp-parameters
[IBTA] InfiniBand Architecture Specification, URL
http://www.infinibandta.org/specs
[RFC4392] Kashyap, V., "IP over InfiniBand (IPoIB) Architecture",
RFC 4392, April 2006.
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119, March 1997.
16.2. Informative References
[HOSTS] Braden, R., "Requirements for Internet Hosts -
Communication Layers", STD 3, RFC 1122, October 1989.
[IGMP3] Cain, B., Deering, S., Kouvelas, I., Fenner, B., and A.
Thyagarajan, "Internet Group Management Protocol,
Version 3", RFC 3376, October 2002.
[IP6MLD] Deering, S., Fenner, W., and B. Haberman, "Multicast
Listener Discovery (MLD) for IPv6", RFC 2710, October
1999.
[IPMULT] Deering, S., "Host extensions for IP multicasting", STD
5, RFC 1112, August 1989.
[IPV6] Deering, S. and R. Hinden, "Internet Protocol, Version 6
(IPv6) Specification", RFC 2460, December 1998.
Chu & Kashyap Standards Track [Page 19]
^L
RFC 4391 IP over InfiniBand (IPoIB) April 2006
Authors' Addresses
H.K. Jerry Chu
17 Network Circle, UMPK17-201
Menlo Park, CA 94025
USA
Phone: +1 650 786 5146
EMail: jerry.chu@sun.com
Vivek Kashyap
15350, SW Koll Parkway
Beaverton, OR 97006
USA
Phone: +1 503 578 3422
EMail: vivk@us.ibm.com
Chu & Kashyap Standards Track [Page 20]
^L
RFC 4391 IP over InfiniBand (IPoIB) April 2006
Full Copyright Statement
Copyright (C) The Internet Society (2006).
This document is subject to the rights, licenses and restrictions
contained in BCP 78, and except as set forth therein, the authors
retain all their rights.
This document and the information contained herein are provided on an
"AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET
ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED,
INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE
INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
Intellectual Property
The IETF takes no position regarding the validity or scope of any
Intellectual Property Rights or other rights that might be claimed to
pertain to the implementation or use of the technology described in
this document or the extent to which any license under such rights
might or might not be available; nor does it represent that it has
made any independent effort to identify any such rights. Information
on the procedures with respect to rights in RFC documents can be
found in BCP 78 and BCP 79.
Copies of IPR disclosures made to the IETF Secretariat and any
assurances of licenses to be made available, or the result of an
attempt made to obtain a general license or permission for the use of
such proprietary rights by implementers or users of this
specification can be obtained from the IETF on-line IPR repository at
http://www.ietf.org/ipr.
The IETF invites any interested party to bring to its attention any
copyrights, patents or patent applications, or other proprietary
rights that may cover technology that may be required to implement
this standard. Please address the information to the IETF at
ietf-ipr@ietf.org.
Acknowledgement
Funding for the RFC Editor function is provided by the IETF
Administrative Support Activity (IASA).
Chu & Kashyap Standards Track [Page 21]
^L
|