1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
1598
1599
1600
1601
1602
1603
1604
1605
1606
1607
1608
1609
1610
1611
1612
1613
1614
1615
1616
1617
1618
1619
1620
1621
1622
1623
1624
1625
1626
1627
1628
1629
1630
1631
1632
1633
1634
1635
1636
1637
1638
1639
1640
1641
1642
1643
1644
1645
1646
1647
1648
1649
1650
1651
1652
1653
1654
1655
1656
1657
1658
1659
1660
1661
1662
1663
1664
1665
1666
1667
1668
1669
1670
1671
1672
1673
1674
1675
1676
1677
1678
1679
1680
1681
1682
1683
1684
1685
1686
1687
1688
1689
1690
1691
1692
1693
1694
1695
1696
1697
1698
1699
1700
1701
1702
1703
1704
1705
1706
1707
1708
1709
1710
1711
1712
1713
1714
1715
1716
1717
1718
1719
1720
1721
1722
1723
1724
1725
1726
1727
1728
1729
1730
1731
1732
1733
1734
1735
1736
1737
1738
1739
1740
1741
1742
1743
1744
1745
1746
1747
1748
1749
1750
1751
1752
1753
1754
1755
1756
1757
1758
1759
1760
1761
1762
1763
1764
1765
1766
1767
1768
1769
1770
1771
1772
1773
1774
1775
1776
1777
1778
1779
1780
1781
1782
1783
1784
1785
1786
1787
1788
1789
1790
1791
1792
1793
1794
1795
1796
1797
1798
1799
1800
1801
1802
1803
1804
1805
1806
1807
1808
1809
1810
1811
1812
1813
1814
1815
1816
1817
1818
1819
1820
1821
1822
1823
1824
1825
1826
1827
1828
1829
1830
1831
1832
1833
1834
1835
1836
1837
1838
1839
1840
1841
1842
1843
1844
1845
1846
1847
1848
1849
1850
1851
1852
1853
1854
1855
1856
1857
1858
1859
1860
1861
1862
1863
1864
1865
1866
1867
1868
1869
1870
1871
1872
1873
1874
1875
1876
1877
1878
1879
1880
1881
1882
1883
1884
1885
1886
1887
1888
1889
1890
1891
1892
1893
1894
1895
1896
1897
1898
1899
1900
1901
1902
1903
1904
1905
1906
1907
1908
1909
1910
1911
1912
1913
1914
1915
1916
1917
1918
1919
1920
1921
1922
1923
1924
1925
1926
1927
1928
1929
1930
1931
1932
1933
1934
1935
1936
1937
1938
1939
1940
1941
1942
1943
1944
1945
1946
1947
1948
1949
1950
1951
1952
1953
1954
1955
1956
1957
1958
1959
1960
1961
1962
1963
1964
1965
1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
2026
2027
2028
2029
2030
2031
2032
2033
2034
2035
2036
2037
2038
2039
2040
2041
2042
2043
2044
2045
2046
2047
2048
2049
2050
2051
2052
2053
2054
2055
2056
2057
2058
2059
2060
2061
2062
2063
2064
2065
2066
2067
2068
2069
2070
2071
2072
2073
2074
2075
2076
2077
2078
2079
2080
2081
2082
2083
2084
2085
2086
2087
2088
2089
2090
2091
2092
2093
2094
2095
2096
2097
2098
2099
2100
2101
2102
2103
2104
2105
2106
2107
2108
2109
2110
2111
2112
2113
2114
2115
2116
2117
2118
2119
2120
2121
2122
2123
2124
2125
2126
2127
2128
2129
2130
2131
2132
2133
2134
2135
2136
2137
2138
2139
2140
2141
2142
2143
2144
2145
2146
2147
2148
2149
2150
2151
2152
2153
2154
2155
2156
2157
2158
2159
2160
2161
2162
2163
2164
2165
2166
2167
2168
2169
2170
2171
2172
2173
2174
2175
2176
2177
2178
2179
2180
2181
2182
2183
2184
2185
2186
2187
2188
2189
2190
2191
2192
2193
2194
2195
2196
2197
2198
2199
2200
2201
2202
2203
2204
2205
2206
2207
2208
2209
2210
2211
2212
2213
2214
2215
2216
2217
2218
2219
2220
2221
2222
2223
2224
2225
2226
2227
2228
2229
2230
2231
2232
2233
2234
2235
2236
2237
2238
2239
2240
2241
2242
2243
2244
2245
2246
2247
2248
2249
2250
2251
2252
2253
2254
2255
2256
2257
2258
2259
2260
2261
2262
2263
2264
2265
2266
2267
2268
2269
2270
2271
2272
2273
2274
2275
2276
2277
2278
2279
2280
2281
2282
2283
2284
2285
2286
2287
2288
2289
2290
2291
2292
2293
2294
2295
2296
2297
2298
2299
2300
2301
2302
2303
2304
2305
2306
2307
2308
2309
2310
2311
2312
2313
2314
2315
2316
2317
2318
2319
2320
2321
2322
2323
2324
2325
2326
2327
2328
2329
2330
2331
2332
2333
2334
2335
2336
2337
2338
2339
2340
2341
2342
2343
2344
2345
2346
2347
2348
2349
2350
2351
2352
2353
2354
2355
2356
2357
2358
2359
2360
2361
2362
2363
2364
2365
2366
2367
2368
2369
2370
2371
2372
2373
2374
2375
2376
2377
2378
2379
2380
2381
2382
2383
2384
2385
2386
2387
2388
2389
2390
2391
2392
2393
2394
2395
2396
2397
2398
2399
2400
2401
2402
2403
2404
2405
2406
2407
2408
2409
2410
2411
2412
2413
2414
2415
2416
2417
2418
2419
2420
2421
2422
2423
2424
2425
2426
2427
2428
2429
2430
2431
2432
2433
2434
2435
2436
2437
2438
2439
2440
2441
2442
2443
2444
2445
2446
2447
2448
2449
2450
2451
2452
2453
2454
2455
2456
2457
2458
2459
2460
2461
2462
2463
2464
2465
2466
2467
2468
2469
2470
2471
2472
2473
2474
2475
2476
2477
2478
2479
2480
2481
2482
2483
2484
2485
2486
2487
2488
2489
2490
2491
2492
2493
2494
2495
2496
2497
2498
2499
2500
2501
2502
2503
2504
2505
2506
2507
2508
2509
2510
2511
2512
2513
2514
2515
2516
2517
2518
2519
2520
2521
2522
2523
2524
2525
2526
2527
2528
2529
2530
2531
2532
2533
2534
2535
2536
2537
2538
2539
2540
2541
2542
2543
2544
2545
2546
2547
2548
2549
2550
2551
2552
2553
2554
2555
2556
2557
2558
2559
2560
2561
2562
2563
2564
2565
2566
2567
2568
2569
2570
2571
2572
2573
2574
2575
2576
2577
2578
2579
2580
2581
2582
2583
2584
2585
2586
2587
2588
2589
2590
2591
2592
2593
2594
2595
2596
2597
2598
2599
2600
2601
2602
2603
2604
2605
2606
2607
2608
2609
2610
2611
2612
2613
2614
2615
2616
2617
2618
2619
2620
2621
2622
2623
2624
2625
2626
2627
2628
2629
2630
2631
2632
2633
2634
2635
2636
2637
2638
2639
2640
2641
2642
2643
2644
2645
2646
2647
2648
2649
2650
2651
2652
2653
2654
2655
2656
2657
2658
2659
2660
2661
2662
2663
2664
2665
2666
2667
2668
2669
2670
2671
2672
2673
2674
2675
2676
2677
2678
2679
2680
2681
2682
2683
2684
2685
2686
2687
2688
2689
2690
2691
2692
2693
2694
2695
2696
2697
2698
2699
2700
2701
2702
2703
2704
2705
2706
2707
2708
2709
2710
2711
2712
2713
2714
2715
2716
2717
2718
2719
2720
2721
2722
2723
2724
2725
2726
2727
2728
2729
2730
2731
2732
2733
2734
2735
2736
2737
2738
2739
2740
2741
2742
2743
2744
2745
2746
2747
2748
2749
2750
2751
2752
2753
2754
2755
2756
2757
2758
2759
2760
2761
2762
2763
2764
2765
2766
2767
2768
2769
2770
2771
2772
2773
2774
2775
2776
2777
2778
2779
2780
2781
2782
2783
2784
2785
2786
2787
2788
2789
2790
2791
2792
2793
2794
2795
2796
2797
2798
2799
2800
2801
2802
2803
2804
2805
2806
2807
2808
2809
2810
2811
2812
2813
2814
2815
2816
2817
2818
2819
2820
2821
2822
2823
2824
2825
2826
2827
2828
2829
2830
2831
2832
2833
2834
2835
2836
2837
2838
2839
2840
2841
2842
2843
2844
2845
2846
2847
2848
2849
2850
2851
2852
2853
2854
2855
2856
2857
2858
2859
2860
2861
2862
2863
2864
2865
2866
2867
2868
2869
2870
2871
2872
2873
2874
2875
2876
2877
2878
2879
2880
2881
2882
2883
2884
2885
2886
2887
2888
2889
2890
2891
2892
2893
2894
2895
2896
2897
2898
2899
2900
2901
2902
2903
2904
2905
2906
2907
2908
2909
2910
2911
2912
2913
2914
2915
2916
2917
2918
2919
2920
2921
2922
2923
2924
2925
2926
2927
2928
2929
2930
2931
2932
2933
2934
2935
2936
2937
2938
2939
2940
2941
2942
2943
2944
2945
2946
2947
2948
2949
2950
2951
2952
2953
2954
2955
2956
2957
2958
2959
2960
2961
2962
2963
2964
2965
2966
2967
2968
2969
2970
2971
2972
2973
2974
2975
2976
2977
2978
2979
2980
2981
2982
2983
2984
2985
2986
2987
2988
2989
2990
2991
2992
2993
2994
2995
2996
2997
2998
2999
3000
3001
3002
3003
3004
3005
3006
3007
3008
3009
3010
3011
3012
3013
3014
3015
3016
3017
3018
3019
3020
3021
3022
3023
3024
3025
3026
3027
3028
3029
3030
3031
3032
3033
3034
3035
3036
3037
3038
3039
3040
3041
3042
3043
3044
3045
3046
3047
3048
3049
3050
3051
3052
3053
3054
3055
3056
3057
3058
3059
3060
3061
3062
3063
3064
3065
3066
3067
3068
3069
3070
3071
3072
3073
3074
3075
3076
3077
3078
3079
3080
3081
3082
3083
3084
3085
3086
3087
3088
3089
3090
3091
3092
3093
3094
3095
3096
3097
3098
3099
3100
3101
3102
3103
3104
3105
3106
3107
3108
3109
3110
3111
3112
3113
3114
3115
3116
3117
3118
3119
3120
3121
3122
3123
3124
3125
3126
3127
3128
3129
3130
3131
3132
3133
3134
3135
3136
3137
3138
3139
3140
3141
3142
3143
3144
3145
3146
3147
3148
3149
3150
3151
3152
3153
3154
3155
3156
3157
3158
3159
3160
3161
3162
3163
3164
3165
3166
3167
3168
3169
3170
3171
3172
3173
3174
3175
3176
3177
3178
3179
3180
3181
3182
3183
3184
3185
3186
3187
3188
3189
3190
3191
3192
3193
3194
3195
3196
3197
3198
3199
3200
3201
3202
3203
3204
3205
3206
3207
3208
3209
3210
3211
3212
3213
3214
3215
3216
3217
3218
3219
3220
3221
3222
3223
3224
3225
3226
3227
3228
3229
3230
3231
3232
3233
3234
3235
3236
3237
3238
3239
3240
3241
3242
3243
3244
3245
3246
3247
3248
3249
3250
3251
3252
3253
3254
3255
3256
3257
3258
3259
3260
3261
3262
3263
3264
3265
3266
3267
3268
3269
3270
3271
3272
3273
3274
3275
3276
3277
3278
3279
3280
3281
3282
3283
3284
3285
3286
3287
3288
3289
3290
3291
3292
3293
3294
3295
3296
3297
3298
3299
3300
3301
3302
3303
3304
3305
3306
3307
3308
3309
3310
3311
3312
3313
3314
3315
3316
3317
3318
3319
3320
3321
3322
3323
3324
3325
3326
3327
3328
3329
3330
3331
3332
3333
3334
3335
3336
3337
3338
3339
3340
3341
3342
3343
3344
3345
3346
3347
3348
3349
3350
3351
3352
3353
3354
3355
3356
3357
3358
3359
3360
3361
3362
3363
3364
3365
3366
3367
3368
3369
3370
3371
3372
3373
3374
3375
3376
3377
3378
3379
3380
3381
3382
3383
3384
3385
3386
3387
3388
3389
3390
3391
3392
3393
3394
3395
3396
3397
3398
3399
3400
3401
3402
3403
3404
3405
3406
3407
3408
3409
3410
3411
3412
3413
3414
3415
3416
3417
3418
3419
3420
3421
3422
3423
3424
3425
3426
3427
3428
3429
3430
3431
3432
3433
3434
3435
3436
3437
3438
3439
3440
3441
3442
3443
3444
3445
3446
3447
3448
3449
3450
3451
3452
3453
3454
3455
3456
3457
3458
3459
3460
3461
3462
3463
3464
3465
3466
3467
3468
3469
3470
3471
3472
3473
3474
3475
3476
3477
3478
3479
3480
3481
3482
3483
3484
3485
3486
3487
3488
3489
3490
3491
3492
3493
3494
3495
3496
3497
3498
3499
3500
3501
3502
3503
3504
3505
3506
3507
3508
3509
3510
3511
3512
3513
3514
3515
3516
3517
3518
3519
3520
3521
3522
3523
3524
3525
3526
3527
3528
3529
3530
3531
3532
3533
3534
3535
3536
3537
3538
3539
3540
3541
3542
3543
3544
3545
3546
3547
3548
3549
3550
3551
3552
3553
3554
3555
3556
3557
3558
3559
3560
3561
3562
3563
3564
3565
3566
3567
3568
3569
3570
3571
3572
3573
3574
3575
3576
3577
3578
3579
3580
3581
3582
3583
3584
3585
3586
3587
3588
3589
3590
3591
3592
3593
3594
3595
3596
3597
3598
3599
3600
3601
3602
3603
3604
3605
3606
3607
3608
3609
3610
3611
3612
3613
3614
3615
3616
3617
3618
3619
3620
3621
3622
3623
3624
3625
3626
3627
3628
3629
3630
3631
3632
3633
3634
3635
3636
3637
3638
3639
3640
3641
3642
3643
3644
3645
3646
3647
3648
3649
3650
3651
3652
3653
3654
3655
3656
3657
3658
3659
3660
3661
3662
3663
3664
3665
3666
3667
3668
3669
3670
3671
3672
3673
3674
3675
3676
3677
3678
3679
3680
3681
3682
3683
3684
3685
3686
3687
3688
3689
3690
3691
3692
3693
3694
3695
3696
3697
3698
3699
3700
3701
3702
3703
3704
3705
3706
3707
3708
3709
3710
3711
3712
3713
3714
3715
3716
3717
3718
3719
3720
3721
3722
3723
3724
3725
3726
3727
3728
3729
3730
3731
3732
3733
3734
3735
3736
3737
3738
3739
3740
3741
3742
3743
3744
3745
3746
3747
3748
3749
3750
3751
3752
3753
3754
3755
3756
3757
3758
3759
3760
3761
3762
3763
3764
3765
3766
3767
3768
3769
3770
3771
3772
3773
3774
3775
3776
3777
3778
3779
3780
3781
3782
3783
3784
3785
3786
3787
3788
3789
3790
3791
3792
3793
3794
3795
3796
3797
3798
3799
3800
3801
3802
3803
3804
3805
3806
3807
3808
3809
3810
3811
3812
3813
3814
3815
3816
3817
3818
3819
3820
3821
3822
3823
3824
3825
3826
3827
3828
3829
3830
3831
3832
3833
3834
3835
3836
3837
3838
3839
3840
3841
3842
3843
3844
3845
3846
3847
3848
3849
3850
3851
3852
3853
3854
3855
3856
3857
3858
3859
3860
3861
3862
3863
3864
3865
3866
3867
3868
3869
3870
3871
3872
3873
3874
3875
3876
3877
3878
3879
3880
3881
3882
3883
3884
3885
3886
3887
3888
3889
3890
3891
3892
3893
3894
3895
3896
3897
3898
3899
3900
3901
3902
3903
3904
3905
3906
3907
3908
3909
3910
3911
3912
3913
3914
3915
3916
3917
3918
3919
3920
3921
3922
3923
3924
3925
3926
3927
3928
3929
3930
3931
3932
3933
3934
3935
3936
3937
3938
3939
3940
3941
3942
3943
3944
3945
3946
3947
3948
3949
3950
3951
3952
3953
3954
3955
3956
3957
3958
3959
3960
3961
3962
3963
3964
3965
3966
3967
3968
3969
3970
3971
3972
3973
3974
3975
3976
3977
3978
3979
3980
3981
3982
3983
3984
3985
3986
3987
3988
3989
3990
3991
3992
3993
3994
3995
3996
3997
3998
3999
4000
4001
4002
4003
4004
4005
4006
4007
4008
4009
4010
4011
4012
4013
4014
4015
4016
4017
4018
4019
4020
4021
4022
4023
4024
4025
4026
4027
4028
4029
4030
4031
4032
4033
4034
4035
4036
4037
4038
4039
4040
4041
4042
4043
4044
4045
4046
4047
4048
4049
4050
4051
4052
4053
4054
4055
4056
4057
4058
4059
4060
4061
4062
4063
4064
4065
4066
4067
4068
4069
4070
4071
4072
4073
4074
4075
4076
4077
4078
4079
4080
4081
4082
4083
4084
4085
4086
4087
4088
4089
4090
4091
4092
4093
4094
4095
4096
4097
4098
4099
4100
4101
4102
4103
4104
4105
4106
4107
4108
4109
4110
4111
4112
4113
4114
4115
4116
4117
4118
4119
4120
4121
4122
4123
4124
4125
4126
4127
4128
4129
4130
4131
4132
4133
4134
4135
4136
4137
4138
4139
4140
4141
4142
4143
4144
4145
4146
4147
|
Network Working Group P. Culley
Request for Comments: 5044 Hewlett-Packard Company
Category: Standards Track U. Elzur
Broadcom Corporation
R. Recio
IBM Corporation
S. Bailey
Sandburst Corporation
J. Carrier
Cray Inc.
October 2007
Marker PDU Aligned Framing for TCP Specification
Status of This Memo
This document specifies an Internet standards track protocol for the
Internet community, and requests discussion and suggestions for
improvements. Please refer to the current edition of the "Internet
Official Protocol Standards" (STD 1) for the standardization state
and status of this protocol. Distribution of this memo is unlimited.
Abstract
Marker PDU Aligned Framing (MPA) is designed to work as an
"adaptation layer" between TCP and the Direct Data Placement protocol
(DDP) as described in RFC 5041. It preserves the reliable, in-order
delivery of TCP, while adding the preservation of higher-level
protocol record boundaries that DDP requires. MPA is fully compliant
with applicable TCP RFCs and can be utilized with existing TCP
implementations. MPA also supports integrated implementations that
combine TCP, MPA and DDP to reduce buffering requirements in the
implementation and improve performance at the system level.
Culley, et al. Standards Track [Page 1]
^L
RFC 5044 MPA Framing for TCP October 2007
Table of Contents
1. Introduction ....................................................4
1.1. Motivation .................................................4
1.2. Protocol Overview ..........................................5
2. Glossary ........................................................8
3. MPA's Interactions with DDP ....................................11
4. MPA Full Operation Phase .......................................13
4.1. FPDU Format ...............................................13
4.2. Marker Format .............................................14
4.3. MPA Markers ...............................................14
4.4. CRC Calculation ...........................................16
4.5. FPDU Size Considerations ..................................21
5. MPA's interactions with TCP ....................................22
5.1. MPA transmitters with a standard layered TCP ..............22
5.2. MPA receivers with a standard layered TCP .................23
6. MPA Receiver FPDU Identification ...............................24
7. Connection Semantics ...........................................24
7.1. Connection Setup ..........................................24
7.1.1. MPA Request and Reply Frame Format .................26
7.1.2. Connection Startup Rules ...........................28
7.1.3. Example Delayed Startup Sequence ...................30
7.1.4. Use of Private Data ................................33
7.1.4.1. Motivation ................................33
7.1.4.2. Example Immediate Startup Using
Private Data ..............................35
7.1.5. "Dual Stack" Implementations .......................37
7.2. Normal Connection Teardown ................................38
8. Error Semantics ................................................39
9. Security Considerations ........................................40
9.1. Protocol-Specific Security Considerations .................40
9.1.1. Spoofing ...........................................40
9.1.1.1. Impersonation .............................41
9.1.1.2. Stream Hijacking ..........................41
9.1.1.3. Man-in-the-Middle Attack ..................41
9.1.2. Eavesdropping ......................................42
9.2. Introduction to Security Options ..........................42
9.3. Using IPsec with MPA ......................................43
9.4. Requirements for IPsec Encapsulation of MPA/DDP ...........43
10. IANA Considerations ...........................................44
Appendix A. Optimized MPA-Aware TCP Implementations ...............45
A.1. Optimized MPA/TCP Transmitters ............................46
A.2. Effects of Optimized MPA/TCP Segmentation .................46
A.3. Optimized MPA/TCP Receivers ...............................48
A.4. Re-segmenting Middleboxes and Non-Optimized MPA/TCP
Senders ...................................................49
A.5. Receiver Implementation ...................................50
A.5.1. Network Layer Reassembly Buffers ...................51
Culley, et al. Standards Track [Page 2]
^L
RFC 5044 MPA Framing for TCP October 2007
A.5.2. TCP Reassembly Buffers .............................52
Appendix B. Analysis of MPA over TCP Operations ...................52
B.1. Assumptions ...............................................53
B.1.1. MPA Is Layered beneath DDP .........................53
B.1.2. MPA Preserves DDP Message Framing ..................53
B.1.3. The Size of the ULPDU Passed to MPA Is Less Than
EMSS Under Normal Conditions .......................53
B.1.4. Out-of-Order Placement but NO Out-of-Order Delivery.54
B.2. The Value of FPDU Alignment ...............................54
B.2.1. Impact of Lack of FPDU Alignment on the Receiver
Computational Load and Complexity ..................56
B.2.2. FPDU Alignment Effects on TCP Wire Protocol ........60
Appendix C. IETF Implementation Interoperability with RDMA
Consortium Protocols ..................................62
C.1. Negotiated Parameters ......................................63
C.2. RDMAC RNIC and Non-Permissive IETF RNIC ....................64
C.2.1. RDMAC RNIC Initiator ................................65
C.2.2. Non-Permissive IETF RNIC Initiator ..................65
C.2.3. RDMAC RNIC and Permissive IETF RNIC .................65
C.2.4. RDMAC RNIC Initiator ................................66
C.2.5. Permissive IETF RNIC Initiator ......................67
C.3. Non-Permissive IETF RNIC and Permissive IETF RNIC ..........67
Normative References ..............................................68
Informative References ............................................68
Contributors ......................................................70
Table of Figures
Figure 1: ULP MPA TCP Layering .....................................5
Figure 2: FPDU Format .............................................13
Figure 3: Marker Format ...........................................14
Figure 4: Example FPDU Format with Marker .........................16
Figure 5: Annotated Hex Dump of an FPDU ...........................19
Figure 6: Annotated Hex Dump of an FPDU with Marker ...............20
Figure 7: Fully Layered Implementation ............................22
Figure 8: MPA Request/Reply Frame .................................26
Figure 9: Example Delayed Startup Negotiation .....................31
Figure 10: Example Immediate Startup Negotiation ..................35
Figure 11: Optimized MPA/TCP Implementation .......................45
Figure 12: Non-Aligned FPDU Freely Placed in TCP Octet Stream .....56
Figure 13: Aligned FPDU Placed Immediately after TCP Header .......58
Figure 14: Connection Parameters for the RNIC Types ...............63
Figure 15: MPA Negotiation between an RDMAC RNIC and a
Non-Permissive IETF RNIC ...............................65
Figure 16: MPA Negotiation between an RDMAC RNIC and a Permissive
IETF RNIC ..............................................66
Figure 17: MPA Negotiation between a Non-Permissive IETF RNIC and
a Permissive IETF RNIC .................................67
Culley, et al. Standards Track [Page 3]
^L
RFC 5044 MPA Framing for TCP October 2007
1. Introduction
This section discusses the reason for creating MPA on TCP and a
general overview of the protocol.
1.1. Motivation
The Direct Data Placement protocol [DDP], when used with TCP
[RFC793], requires a mechanism to detect record boundaries. The DDP
records are referred to as Upper Layer Protocol Data Units by this
document. The ability to locate the Upper Layer Protocol Data Unit
(ULPDU) boundary is useful to a hardware network adapter that uses
DDP to directly place the data in the application buffer based on the
control information carried in the ULPDU header. This may be done
without requiring that the packets arrive in order. Potential
benefits of this capability are the avoidance of the memory copy
overhead and a smaller memory requirement for handling out-of-order
or dropped packets.
Many approaches have been proposed for a generalized framing
mechanism. Some are probabilistic in nature and others are
deterministic. An example probabilistic approach is characterized by
a detectable value embedded in the octet stream, with no method of
preventing that value elsewhere within user data. It is
probabilistic because under some conditions the receiver may
incorrectly interpret application data as the detectable value.
Under these conditions, the protocol may fail with unacceptable
frequency. One deterministic approach is characterized by embedded
controls at known locations in the octet stream. Because the
receiver can guarantee it will only examine the data stream at
locations that are known to contain the embedded control, the
protocol can never misinterpret application data as being embedded
control data. For unambiguous handling of an out-of-order packet, a
deterministic approach is preferred.
The MPA protocol provides a framing mechanism for DDP running over
TCP using the deterministic approach. It allows the location of the
ULPDU to be determined in the TCP stream even if the TCP segments
arrive out of order.
Culley, et al. Standards Track [Page 4]
^L
RFC 5044 MPA Framing for TCP October 2007
1.2. Protocol Overview
The layering of PDUs with MPA is shown in Figure 1, below.
+------------------+
| ULP client |
+------------------+ <- Consumer messages
| DDP |
+------------------+ <- ULPDUs
| MPA* |
+------------------+ <- FPDUs (containing ULPDUs)
| TCP* |
+------------------+ <- TCP Segments (containing FPDUs)
| IP etc. |
+------------------+
* These may be fully layered or optimized together.
Figure 1: ULP MPA TCP Layering
MPA is described as an extra layer above TCP and below DDP. The
operation sequence is:
1. A TCP connection is established by ULP action. This is done
using methods not described by this specification. The ULP may
exchange some amount of data in streaming mode prior to starting
MPA, but is not required to do so.
2. The Consumer negotiates the use of DDP and MPA at both ends of a
connection. The mechanisms to do this are not described in this
specification. The negotiation may be done in streaming mode, or
by some other mechanism (such as a pre-arranged port number).
3. The ULP activates MPA on each end in the Startup Phase, either as
an Initiator or a Responder, as determined by the ULP. This mode
verifies the usage of MPA, specifies the use of CRC and Markers,
and allows the ULP to communicate some additional data via a
Private Data exchange. See Section 7.1, Connection Setup, for
more details on the startup process.
4. At the end of the Startup Phase, the ULP puts MPA (and DDP) into
Full Operation and begins sending DDP data as further described
below. In this document, DDP data chunks are called ULPDUs. For
a description of the DDP data, see [DDP].
Culley, et al. Standards Track [Page 5]
^L
RFC 5044 MPA Framing for TCP October 2007
Following is a description of data transfer when MPA is in Full
Operation.
1. DDP determines the Maximum ULPDU (MULPDU) size by querying MPA
for this value. MPA derives this information from TCP or IP,
when it is available, or chooses a reasonable value.
2. DDP creates ULPDUs of MULPDU size or smaller, and hands them to
MPA at the sender.
3. MPA creates a Framed Protocol Data Unit (FPDU) by prepending a
header, optionally inserting Markers, and appending a CRC field
after the ULPDU and PAD (if any). MPA delivers the FPDU to TCP.
4. The TCP sender puts the FPDUs into the TCP stream. If the sender
is optimized MPA/TCP, it segments the TCP stream in such a way
that a TCP Segment boundary is also the boundary of an FPDU. TCP
then passes each segment to the IP layer for transmission.
5. The receiver may or may not be optimized. If it is optimized
MPA/TCP, it may separate passing the TCP payload to MPA from
passing the TCP payload ordering information to MPA. In either
case, RFC-compliant TCP wire behavior is observed at both the
sender and receiver.
6. The MPA receiver locates and assembles complete FPDUs within the
stream, verifies their integrity, and removes MPA Markers (when
present), ULPDU_Length, PAD, and the CRC field.
7. MPA then provides the complete ULPDUs to DDP. MPA may also
separate passing MPA payload to DDP from passing the MPA payload
ordering information.
A fully layered MPA on TCP is implemented as a data stream ULP for
TCP and is therefore RFC compliant.
An optimized DDP/MPA/TCP uses a TCP layer that potentially contains
some additional behaviors as suggested in this document. When
DDP/MPA/TCP are cross-layer optimized, the behavior of TCP
(especially sender segmentation) may change from that of the un-
optimized implementation, but the changes are within the bounds
permitted by the TCP RFC specifications, and will interoperate with
an un-optimized TCP. The additional behaviors are described in
Appendix A and are not normative; they are described at a TCP
interface layer as a convenience. Implementations may achieve the
described functionality using any method, including cross-layer
optimizations between TCP, MPA, and DDP.
Culley, et al. Standards Track [Page 6]
^L
RFC 5044 MPA Framing for TCP October 2007
An optimized DDP/MPA/TCP sender is able to segment the data stream
such that TCP segments begin with FPDUs (FPDU Alignment). This has
significant advantages for receivers. When segments arrive with
aligned FPDUs, the receiver usually need not buffer any portion of
the segment, allowing DDP to place it in its destination memory
immediately, thus avoiding copies from intermediate buffers (DDP's
reason for existence).
An optimized DDP/MPA/TCP receiver allows a DDP on MPA implementation
to locate the start of ULPDUs that may be received out of order. It
also allows the implementation to determine if the entire ULPDU has
been received. As a result, MPA can pass out-of-order ULPDUs to DDP
for immediate use. This enables a DDP on MPA implementation to save
a significant amount of intermediate storage by placing the ULPDUs in
the right locations in the application buffers when they arrive,
rather than waiting until full ordering can be restored.
The ability of a receiver to recover out-of-order ULPDUs is optional
and declared to the transmitter during startup. When the receiver
declares that it does not support out-of-order recovery, the
transmitter does not add the control information to the data stream
needed for out-of-order recovery.
If the receiver is fully layered, then MPA receives a strictly
ordered stream of data and does not deal with out-of-order ULPDUs.
In this case, MPA passes each ULPDU to DDP when the last bytes arrive
from TCP, along with the indication that they are in order.
MPA implementations that support recovery of out-of-order ULPDUs MUST
support a mechanism to indicate the ordering of ULPDUs as the sender
transmitted them and indicate when missing intermediate segments
arrive. These mechanisms allow DDP to reestablish record ordering
and report Delivery of complete messages (groups of records).
MPA also addresses enhanced data integrity. Some users of TCP have
noted that the TCP checksum is not as strong as could be desired (see
[CRCTCP]). Studies such as [CRCTCP] have shown that the TCP checksum
indicates segments in error at a much higher rate than the underlying
link characteristics would indicate. With these higher error rates,
the chance that an error will escape detection, when using only the
TCP checksum for data integrity, becomes a concern. A stronger
integrity check can reduce the chance of data errors being missed.
MPA includes a CRC check to increase the ULPDU data integrity to the
level provided by other modern protocols, such as SCTP [RFC4960]. It
is possible to disable this CRC check; however, CRCs MUST be enabled
unless it is clear that the end-to-end connection through the network
has data integrity at least as good as an MPA with CRC enabled (for
Culley, et al. Standards Track [Page 7]
^L
RFC 5044 MPA Framing for TCP October 2007
example, when IPsec is implemented end to end). DDP's ULP expects
this level of data integrity and therefore the ULP does not have to
provide its own duplicate data integrity and error recovery for lost
data.
2. Glossary
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in [RFC2119].
Consumer - the ULPs or applications that lie above MPA and DDP. The
Consumer is responsible for making TCP connections, starting MPA
and DDP connections, and generally controlling operations.
CRC - Cyclic Redundancy Check.
Delivery - (Delivered, Delivers) - For MPA, Delivery is defined as
the process of informing DDP that a particular PDU is ordered for
use. A PDU is Delivered in the exact order that it was sent by
the original sender; MPA uses TCP's byte stream ordering to
determine when Delivery is possible. This is specifically
different from "passing the PDU to DDP", which may generally
occur in any order, while the order of Delivery is strictly
defined.
EMSS - Effective Maximum Segment Size. EMSS is the smaller of the
TCP maximum segment size (MSS) as defined in RFC 793 [RFC793],
and the current path Maximum Transmission Unit (MTU) [RFC1191].
FPDU - Framed Protocol Data Unit. The unit of data created by an MPA
sender.
FPDU Alignment - The property that an FPDU is Header Aligned with the
TCP segment, and the TCP segment includes an integer number of
FPDUs. A TCP segment with an FPDU Alignment allows immediate
processing of the contained FPDUs without waiting on other TCP
segments to arrive or combining with prior segments.
FPDU Pointer (FPDUPTR) - This field of the Marker is used to indicate
the beginning of an FPDU.
Full Operation (Full Operation Phase) - After the completion of the
Startup Phase, MPA begins exchanging FPDUs.
Culley, et al. Standards Track [Page 8]
^L
RFC 5044 MPA Framing for TCP October 2007
Header Alignment - The property that a TCP segment begins with an
FPDU. The FPDU is Header Aligned when the FPDU header is exactly
at the start of the TCP segment (right behind the TCP headers on
the wire).
Initiator - The endpoint of a connection that sends the MPA Request
Frame, i.e., the first to actually send data (which may not be
the one that sends the TCP SYN).
Marker - A four-octet field that is placed in the MPA data stream at
fixed octet intervals (every 512 octets).
MPA-aware TCP - A TCP implementation that is aware of the receiver
efficiencies of MPA FPDU Alignment and is capable of sending TCP
segments that begin with an FPDU.
MPA-enabled - MPA is enabled if the MPA protocol is visible on the
wire. When the sender is MPA-enabled, it is inserting framing
and Markers. When the receiver is MPA-enabled, it is
interpreting framing and Markers.
MPA Request Frame - Data sent from the MPA Initiator to the MPA
Responder during the Startup Phase.
MPA Reply Frame - Data sent from the MPA Responder to the MPA
Initiator during the Startup Phase.
MPA - Marker-based ULP PDU Aligned Framing for TCP protocol. This
document defines the MPA protocol.
MULPDU - Maximum ULPDU. The current maximum size of the record that
is acceptable for DDP to pass to MPA for transmission.
Node - A computing device attached to one or more links of a network.
A Node in this context does not refer to a specific application
or protocol instantiation running on the computer. A Node may
consist of one or more MPA on TCP devices installed in a host
computer.
PAD - A 1-3 octet group of zeros used to fill an FPDU to an exact
modulo 4 size.
PDU - Protocol data unit
Private Data - A block of data exchanged between MPA endpoints during
initial connection setup.
Culley, et al. Standards Track [Page 9]
^L
RFC 5044 MPA Framing for TCP October 2007
Protection Domain - An RDMA concept (see [VERBS-RDMA] and [RDMASEC])
that ties use of various endpoint resources (memory access, etc.)
to the specific RDMA/DDP/MPA connection.
RDDP - A suite of protocols including MPA, [DDP], [RDMAP], an overall
security document [RDMASEC], a problem statement [RFC4297], an
architecture document [RFC4296], and an applicability document
[APPL].
RDMA - Remote Direct Memory Access; a protocol that uses DDP and MPA
to enable applications to transfer data directly from memory
buffers. See [RDMAP].
Remote Peer - The MPA protocol implementation on the opposite end of
the connection. Used to refer to the remote entity when
describing protocol exchanges or other interactions between two
Nodes.
Responder - The connection endpoint that responds to an incoming MPA
connection request (the MAP Request Frame). This may not be the
endpoint that awaited the TCP SYN.
Startup Phase - The initial exchanges of an MPA connection that
serves to more fully identify MPA endpoints to each other and
pass connection specific setup information to each other.
ULP - Upper Layer Protocol. The protocol layer above the protocol
layer currently being referenced. The ULP for MPA is DDP [DDP].
ULPDU - Upper Layer Protocol Data Unit. The data record defined by
the layer above MPA (DDP). ULPDU corresponds to DDP's DDP
segment.
ULPDU_Length - A field in the FPDU describing the length of the
included ULPDU.
Culley, et al. Standards Track [Page 10]
^L
RFC 5044 MPA Framing for TCP October 2007
3. MPA's Interactions with DDP
DDP requires MPA to maintain DDP record boundaries from the sender to
the receiver. When using MPA on TCP to send data, DDP provides
records (ULPDUs) to MPA. MPA will use the reliable transmission
abilities of TCP to transmit the data, and will insert appropriate
additional information into the TCP stream to allow the MPA receiver
to locate the record boundary information.
As such, MPA accepts complete records (ULPDUs) from DDP at the sender
and returns them to DDP at the receiver.
MPA MUST encapsulate the ULPDU such that there is exactly one ULPDU
contained in one FPDU.
MPA over a standard TCP stack can usually provide FPDU Alignment with
the TCP Header if the FPDU is equal to TCP's EMSS. An optimized
MPA/TCP stack can also maintain alignment as long as the FPDU is less
than or equal to TCP's EMSS. Since FPDU Alignment is generally
desired by the receiver, DDP cooperates with MPA to ensure FPDUs'
lengths do not exceed the EMSS under normal conditions. This is done
with the MULPDU mechanism.
MPA MUST provide information to DDP on the current maximum size of
the record that is acceptable to send (MULPDU). DDP SHOULD limit
each record size to MULPDU. The range of MULPDU values MUST be
between 128 octets and 64768 octets, inclusive.
The sending DDP MUST NOT post a ULPDU larger than 64768 octets to
MPA. DDP MAY post a ULPDU of any size between one and 64768 octets;
however, MPA is not REQUIRED to support a ULPDU Length that is
greater than the current MULPDU.
While the maximum theoretical length supported by the MPA header
ULPDU_Length field is 65535, TCP over IP requires the IP datagram
maximum length to be 65535 octets. To enable MPA to support FPDU
Alignment, the maximum size of the FPDU must fit within an IP
datagram. Thus, the ULPDU limit of 64768 octets was derived by
taking the maximum IP datagram length, subtracting from it the
maximum total length of the sum of the IPv4 header, TCP header, IPv4
options, TCP options, and the worst-case MPA overhead, and then
rounding the result down to a 128-octet boundary.
Note that MULPDU will be significantly smaller than the theoretical
maximum in most implementations for most circumstances, due to link
MTUs, use of extra headers such as required for IPsec, etc.
Culley, et al. Standards Track [Page 11]
^L
RFC 5044 MPA Framing for TCP October 2007
On receive, MPA MUST pass each ULPDU with its length to DDP when it
has been validated.
If an MPA implementation supports passing out-of-order ULPDUs to DDP,
the MPA implementation SHOULD:
* Pass each ULPDU with its length to DDP as soon as it has been
fully received and validated.
* Provide a mechanism to indicate the ordering of ULPDUs as the
sender transmitted them. One possible mechanism might be
providing the TCP sequence number for each ULPDU.
* Provide a mechanism to indicate when a given ULPDU (and prior
ULPDUs) are complete (Delivered to DDP). One possible mechanism
might be to allow DDP to see the current outgoing TCP ACK
sequence number.
* Provide an indication to DDP that the TCP has closed or has begun
to close the connection (e.g., received a FIN).
MPA MUST provide the protocol version negotiated with its peer to
DDP. DDP will use this version to set the version in its header and
to report the version to [RDMAP].
Culley, et al. Standards Track [Page 12]
^L
RFC 5044 MPA Framing for TCP October 2007
4. MPA Full Operation Phase
The following sections describe the main semantics of the Full
Operation Phase of MPA.
4.1. FPDU Format
MPA senders create FPDUs out of ULPDUs. The format of an FPDU shown
below MUST be used for all MPA FPDUs. For purposes of clarity,
Markers are not shown in Figure 2.
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| ULPDU_Length | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ +
| |
~ ~
~ ULPDU ~
| |
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| | PAD (0-3 octets) |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| CRC |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Figure 2: FPDU Format
ULPDU_Length: 16 bits (unsigned integer). This is the number of
octets of the contained ULPDU. It does not include the length of the
FPDU header itself, the pad, the CRC, or of any Markers that fall
within the ULPDU. The 16-bit ULPDU Length field is large enough to
support the largest IP datagrams for IPv4 or IPv6.
PAD: The PAD field trails the ULPDU and contains between 0 and 3
octets of data. The pad data MUST be set to zero by the sender and
ignored by the receiver (except for CRC checking). The length of the
pad is set so as to make the size of the FPDU an integral multiple of
four.
CRC: 32 bits. When CRCs are enabled, this field contains a CRC32c
check value, which is used to verify the entire contents of the FPDU,
using CRC32c. See Section 4.4, CRC Calculation. When CRCs are not
enabled, this field is still present, may contain any value, and MUST
NOT be checked.
Culley, et al. Standards Track [Page 13]
^L
RFC 5044 MPA Framing for TCP October 2007
The FPDU adds a minimum of 6 octets to the length of the ULPDU. In
addition, the total length of the FPDU will include the length of any
Markers and from 0 to 3 pad octets added to round-up the ULPDU size.
4.2. Marker Format
The format of a Marker MUST be as specified in Figure 3:
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| RESERVED | FPDUPTR |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Figure 3: Marker Format
RESERVED: The Reserved field MUST be set to zero on transmit and
ignored on receive (except for CRC calculation).
FPDUPTR: The FPDU Pointer is a relative pointer, 16 bits long,
interpreted as an unsigned integer that indicates the number of
octets in the TCP stream from the beginning of the ULPDU Length field
to the first octet of the entire Marker. The least significant two
bits MUST always be set to zero at the transmitter, and the receivers
MUST always treat these as zero for calculations.
4.3. MPA Markers
MPA Markers are used to identify the start of FPDUs when packets are
received out of order. This is done by locating the Markers at fixed
intervals in the data stream (which is correlated to the TCP sequence
number) and using the Marker value to locate the preceding FPDU
start.
All MPA Markers are included in the containing FPDU CRC calculation
(when both CRCs and Markers are in use).
The MPA receiver's ability to locate out-of-order FPDUs and pass the
ULPDUs to DDP is implementation dependent. MPA/DDP allows those
receivers that are able to deal with out-of-order FPDUs in this way
to require the insertion of Markers in the data stream. When the
receiver cannot deal with out-of-order FPDUs in this way, it may
disable the insertion of Markers at the sender. All MPA senders MUST
be able to generate Markers when their use is declared by the
opposing receiver (see Section 7.1, Connection Setup).
Culley, et al. Standards Track [Page 14]
^L
RFC 5044 MPA Framing for TCP October 2007
When Markers are enabled, MPA senders MUST insert a Marker into the
data stream at a 512-octet periodic interval in the TCP Sequence
Number Space. The Marker contains a 16-bit unsigned integer referred
to as the FPDUPTR (FPDU Pointer).
If the FPDUPTR's value is non-zero, the FPDU Pointer is a 16-bit
relative back-pointer. FPDUPTR MUST contain the number of octets in
the TCP stream from the beginning of the ULPDU Length field to the
first octet of the Marker, unless the Marker falls between FPDUs.
Thus, the location of the first octet of the previous FPDU header can
be determined by subtracting the value of the given Marker from the
current octet-stream sequence number (i.e., TCP sequence number) of
the first octet of the Marker. Note that this computation MUST take
into account that the TCP sequence number could have wrapped between
the Marker and the header.
An FPDUPTR value of 0x0000 is a special case -- it is used when the
Marker falls exactly between FPDUs (between the preceding FPDU CRC
field and the next FPDU's ULPDU Length field). In this case, the
Marker is considered to be contained in the following FPDU; the
Marker MUST be included in the CRC calculation of the FPDU following
the Marker (if CRCs are being generated or checked). Thus, an
FPDUPTR value of 0x0000 means that immediately following the Marker
is an FPDU header (the ULPDU Length field).
Since all FPDUs are integral multiples of 4 octets, the bottom two
bits of the FPDUPTR as calculated by the sender are zero. MPA
reserves these bits so they MUST be treated as zero for computation
at the receiver.
When Markers are enabled (see Section 7.1, Connection Setup), the MPA
Markers MUST be inserted immediately preceding the first FPDU of Full
Operation Phase, and at every 512th octet of the TCP octet stream
thereafter. As a result, the first Marker has an FPDUPTR value of
0x0000. If the first Marker begins at octet sequence number
SeqStart, then Markers are inserted such that the first octet of the
Marker is at octet sequence number SeqNum if the remainder of (SeqNum
- SeqStart) mod 512 is zero. Note that SeqNum can wrap.
For example, if the TCP sequence number were used to calculate the
insertion point of the Marker, the starting TCP sequence number is
unlikely to be zero, and 512-octet multiples are unlikely to fall on
a modulo 512 of zero. If the MPA connection is started at TCP
sequence number 11, then the 1st Marker will begin at 11, and
subsequent Markers will begin at 523, 1035, etc.
Culley, et al. Standards Track [Page 15]
^L
RFC 5044 MPA Framing for TCP October 2007
If an FPDU is large enough to contain multiple Markers, they MUST all
point to the same point in the TCP stream: the first octet of the
ULPDU Length field for the FPDU.
If a Marker interval contains multiple FPDUs (the FPDUs are small),
the Marker MUST point to the start of the ULPDU Length field for the
FPDU containing the Marker unless the Marker falls between FPDUs, in
which case the Marker MUST be zero.
The following example shows an FPDU containing a Marker.
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| ULPDU Length (0x0010) | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ +
| |
+ +
| ULPDU (octets 0-9) |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| (0x0000) | FPDU ptr (0x000C) |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| ULPDU (octets 10-15) |
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| | PAD (2 octets:0,0) |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| CRC |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Figure 4: Example FPDU Format with Marker
MPA Receivers MUST preserve ULPDU boundaries when passing data to
DDP. MPA Receivers MUST pass the ULPDU data and the ULPDU Length to
DDP and not the Markers, headers, and CRC.
4.4. CRC Calculation
An MPA implementation MUST implement CRC support and MUST either:
(1) always use CRCs; the MPA provider is not REQUIRED to support an
administrator's request that CRCs not be used.
or
(2a) only indicate a preference not to use CRCs on the explicit
request of the system administrator, via an interface not
defined in this spec. The default configuration for a
connection MUST be to use CRCs.
Culley, et al. Standards Track [Page 16]
^L
RFC 5044 MPA Framing for TCP October 2007
(2b) disable CRC checking (and possibly generation) if both the local
and remote endpoints indicate preference not to use CRCs.
An administrative decision to have a host request CRC suppression
SHOULD NOT be made unless there is assurance that the TCP connection
involved provides protection from undetected errors that is at least
as strong as an end-to-end CRC32c. End-to-end usage of an IPsec
cryptographic integrity check is among the ways to provide such
protection, and the use of channel bindings [NFSv4CHANNEL] by the ULP
can provide a high level of assurance that the IPsec protection scope
is end-to-end with respect to the ULP.
The process MUST be invisible to the ULP.
After receipt of an MPA startup declaration indicating that its peer
requires CRCs, an MPA instance MUST continue generating and checking
CRCs until the connection terminates. If an MPA instance has
declared that it does not require CRCs, it MUST turn off CRC checking
immediately after receipt of an MPA mode declaration indicating that
its peer also does not require CRCs. It MAY continue generating
CRCs. See Section 7.1, Connection Setup, for details on the MPA
startup.
When sending an FPDU, the sender MUST include a CRC field. When CRCs
are enabled, the CRC field in the MPA FPDU MUST be computed using the
CRC32c polynomial in the manner described in the iSCSI Protocol
[iSCSI] document for Header and Data Digests.
The fields which MUST be included in the CRC calculation when sending
an FPDU are as follows:
1) If a Marker does not immediately precede the ULPDU Length field,
the CRC-32c is calculated from the first octet of the ULPDU
Length field, through all the ULPDU and Markers (if present), to
the last octet of the PAD (if present), inclusive. If there is a
Marker immediately following the PAD, the Marker is included in
the CRC calculation for this FPDU.
2) If a Marker immediately precedes the first octet of the ULPDU
Length field of the FPDU, (i.e., the Marker fell between FPDUs,
and thus is required to be included in the second FPDU), the
CRC-32c is calculated from the first octet of the Marker, through
the ULPDU Length header, through all the ULPDU and Markers (if
present), to the last octet of the PAD (if present), inclusive.
3) After calculating the CRC-32c, the resultant value is placed into
the CRC field at the end of the FPDU.
Culley, et al. Standards Track [Page 17]
^L
RFC 5044 MPA Framing for TCP October 2007
When an FPDU is received, and CRC checking is enabled, the receiver
MUST first perform the following:
1) Calculate the CRC of the incoming FPDU in the same fashion as
defined above.
2) Verify that the calculated CRC-32c value is the same as the
received CRC-32c value found in the FPDU CRC field. If not, the
receiver MUST treat the FPDU as an invalid FPDU.
The procedure for handling invalid FPDUs is covered in Section 8,
Error Semantics.
The following is an annotated hex dump of an example FPDU sent as the
first FPDU on the stream. As such, it starts with a Marker. The
FPDU contains a 42 octet ULPDU (an example DDP segment) which in turn
contains 24 octets of the contained ULPDU, which is a data load that
is all zeros. The CRC32c has been correctly calculated and can be
used as a reference. See the [DDP] and [RDMAP] specification for
definitions of the DDP Control field, Queue, MSN, MO, and Send Data.
Culley, et al. Standards Track [Page 18]
^L
RFC 5044 MPA Framing for TCP October 2007
Octet Contents Annotation
Count
0000 00 Marker: Reserved
0001 00
0002 00 Marker: FPDUPTR
0003 00
0004 00 ULPDU Length
0005 2a
0006 41 DDP Control Field, Send with Last flag set
0007 43
0008 00 Reserved (DDP STag position with no STag)
0009 00
000a 00
000b 00
000c 00 DDP Queue = 0
000d 00
000e 00
000f 00
0010 00 DDP MSN = 1
0011 00
0012 00
0013 01
0014 00 DDP MO = 0
0015 00
0016 00
0017 00
0018 00 DDP Send Data (24 octets of zeros)
...
002f 00
0030 52 CRC32c
0031 23
0032 99
0033 83
Figure 5: Annotated Hex Dump of an FPDU
Culley, et al. Standards Track [Page 19]
^L
RFC 5044 MPA Framing for TCP October 2007
The following is an example sent as the second FPDU of the stream
where the first FPDU (which is not shown here) had a length of 492
octets and was also a Send to Queue 0 with Last Flag set. This
example contains a Marker.
Octet Contents Annotation
Count
01ec 00 Length
01ed 2a
01ee 41 DDP Control Field: Send with Last Flag set
01ef 43
01f0 00 Reserved (DDP STag position with no STag)
01f1 00
01f2 00
01f3 00
01f4 00 DDP Queue = 0
01f5 00
01f6 00
01f7 00
01f8 00 DDP MSN = 2
01f9 00
01fa 00
01fb 02
01fc 00 DDP MO = 0
01fd 00
01fe 00
01ff 00
0200 00 Marker: Reserved
0201 00
0202 00 Marker: FPDUPTR
0203 14
0204 00 DDP Send Data (24 octets of zeros)
...
021b 00
021c 84 CRC32c
021d 92
021e 58
021f 98
Figure 6: Annotated Hex Dump of an FPDU with Marker
Culley, et al. Standards Track [Page 20]
^L
RFC 5044 MPA Framing for TCP October 2007
4.5. FPDU Size Considerations
MPA defines the Maximum Upper Layer Protocol Data Unit (MULPDU) as
the size of the largest ULPDU fitting in an FPDU. For an empty TCP
Segment, MULPDU is EMSS minus the FPDU overhead (6 octets) minus
space for Markers and pad octets.
The maximum ULPDU Length for a single ULPDU when Markers are
present MUST be computed as:
MULPDU = EMSS - (6 + 4 * Ceiling(EMSS / 512) + EMSS mod 4)
The formula above accounts for the worst-case number of Markers.
The maximum ULPDU Length for a single ULPDU when Markers are NOT
present MUST be computed as:
MULPDU = EMSS - (6 + EMSS mod 4)
As a further optimization of the wire efficiency an MPA
implementation MAY dynamically adjust the MULPDU (see Section 5 for
latency and wire efficiency trade-offs). When one or more FPDUs are
already packed into a TCP Segment, MULPDU MAY be reduced accordingly.
DDP SHOULD provide ULPDUs that are as large as possible, but less
than or equal to MULPDU.
If the TCP implementation needs to adjust EMSS to support MTU changes
or changing TCP options, the MULPDU value is changed accordingly.
In certain rare situations, the EMSS may shrink below 128 octets in
size. If this occurs, the MPA on TCP sender MUST NOT shrink the
MULPDU below 128 octets and is not required to follow the
segmentation rules in Section 5.1 and Appendix A.
If one or more FPDUs are already packed into a TCP segment, such that
the remaining room is less than 128 octets, MPA MUST NOT provide a
MULPDU smaller than 128. In this case, MPA would typically provide a
MULPDU for the next full sized segment, but may still pack the next
FPDU into the small remaining room, provide that the next FPDU is
small enough to fit.
The value 128 is chosen as to allow DDP designers room for the DDP
Header and some user data.
Culley, et al. Standards Track [Page 21]
^L
RFC 5044 MPA Framing for TCP October 2007
5. MPA's interactions with TCP
The following sections describe MPA's interactions with TCP. This
section discusses using a standard layered TCP stack with MPA
attached above a TCP socket. Discussion of using an optimized MPA-
aware TCP with an MPA implementation that takes advantage of the
extra optimizations is done in Appendix A.
+-----------------------------------+
| +-----+ +-----------------+ |
| | MPA | | Other Protocols | |
| +-----+ +-----------------+ |
| || || |
| ----- socket API -------------- |
| || |
| +-----+ |
| | TCP | |
| +-----+ |
| || |
| +-----+ |
| | IP | |
| +-----+ |
+-----------------------------------+
Figure 7: Fully Layered Implementation
The Fully layered implementation is described for completeness;
however, the user is cautioned that the reduced probability of FPDU
alignment when transmitting with this implementation will tend to
introduce a higher overhead at optimized receivers. In addition, the
lack of out-of-order receive processing will significantly reduce the
value of DDP/MPA by imposing higher buffering and copying overhead in
the local receiver.
5.1. MPA transmitters with a standard layered TCP
MPA transmitters SHOULD calculate a MULPDU as described in Section
4.5. If the TCP implementation allows EMSS to be determined by MPA,
that value should be used. If the transmit side TCP implementation
is not able to report the EMSS, MPA SHOULD use the current MTU value
to establish a likely FPDU size, taking into account the various
expected header sizes.
MPA transmitters SHOULD also use whatever facilities the TCP stack
presents to cause the TCP transmitter to start TCP segments at FPDU
boundaries. Multiple FPDUs MAY be packed into a single TCP segment
as determined by the EMSS calculation as long as they are entirely
contained in the TCP segment.
Culley, et al. Standards Track [Page 22]
^L
RFC 5044 MPA Framing for TCP October 2007
For example, passing FPDU buffers sized to the current EMSS to the
TCP socket and using the TCP_NODELAY socket option to disable the
Nagle [RFC896] algorithm will usually result in many of the segments
starting with an FPDU.
It is recognized that various effects can cause an FPDU Alignment to
be lost. Following are a few of the effects:
* ULPDUs that are smaller than the MULPDU. If these are sent in a
continuous stream, FPDU Alignment will be lost. Note that
careful use of a dynamic MULPDU can help in this case; the MULPDU
for future FPDUs can be adjusted to re-establish alignment with
the segments based on the current EMSS.
* Sending enough data that the TCP receive window limit is reached.
TCP may send a smaller segment to exactly fill the receive
window.
* Sending data when TCP is operating up against the congestion
window. If TCP is not tracking the congestion window in
segments, it may transmit a smaller segment to exactly fill the
receive window.
* Changes in EMSS due to varying TCP options, or changes in MTU.
If FPDU Alignment with TCP segments is lost for any reason, the
alignment is regained after a break in transmission where the TCP
send buffers are emptied. Many usage models for DDP/MPA will include
such breaks.
MPA receivers are REQUIRED to be able to operate correctly even if
alignment is lost (see Section 6).
5.2. MPA receivers with a standard layered TCP
MPA receivers will get TCP data in the usual ordered stream. The
receivers MUST identify FPDU boundaries by using the ULPDU_LENGTH
field, as described in Section 6. Receivers MAY utilize markers to
check for FPDU boundary consistency, but they are NOT required to
examine the markers to determine the FPDU boundaries.
Culley, et al. Standards Track [Page 23]
^L
RFC 5044 MPA Framing for TCP October 2007
6. MPA Receiver FPDU Identification
An MPA receiver MUST first verify the FPDU before passing the ULPDU
to DDP. To do this, the receiver MUST:
* locate the start of the FPDU unambiguously,
* verify its CRC (if CRC checking is enabled).
If the above conditions are true, the MPA receiver passes the ULPDU
to DDP.
To detect the start of the FPDU unambiguously one of the following
MUST be used:
1: In an ordered TCP stream, the ULPDU Length field in the current
FPDU when FPDU has a valid CRC, can be used to identify the
beginning of the next FPDU.
2: For optimized MPA/TCP receivers that support out-of-order
reception of FPDUs (see Section 4.3, MPA Markers) a Marker can
always be used to locate the beginning of an FPDU (in FPDUs with
valid CRCs). Since the location of the Marker is known in the
octet stream (sequence number space), the Marker can always be
found.
3: Having found an FPDU by means of a Marker, an optimized MPA/TCP
receiver can find following contiguous FPDUs by using the ULPDU
Length fields (from FPDUs with valid CRCs) to establish the next
FPDU boundary.
The ULPDU Length field (see Section 4) MUST be used to determine if
the entire FPDU is present before forwarding the ULPDU to DDP.
CRC calculation is discussed in Section 4.4 above.
7. Connection Semantics
7.1. Connection Setup
MPA requires that the Consumer MUST activate MPA, and any TCP
enhancements for MPA, on a TCP half connection at the same location
in the octet stream at both the sender and the receiver. This is
required in order for the Marker scheme to correctly locate the
Markers (if enabled) and to correctly locate the first FPDU.
MPA, and any TCP enhancements for MPA are enabled by the ULP in both
directions at once at an endpoint.
Culley, et al. Standards Track [Page 24]
^L
RFC 5044 MPA Framing for TCP October 2007
This can be accomplished several ways, and is left up to DDP's ULP:
* DDP's ULP MAY require DDP on MPA startup immediately after TCP
connection setup. This has the advantage that no streaming mode
negotiation is needed. An example of such a protocol is shown in
Figure 10: Example Immediate Startup negotiation.
This may be accomplished by using a well-known port, or a service
locator protocol to locate an appropriate port on which DDP on
MPA is expected to operate.
* DDP's ULP MAY negotiate the start of DDP on MPA sometime after a
normal TCP startup, using TCP streaming data exchanges on the
same connection. The exchange establishes that DDP on MPA (as
well as other ULPs) will be used, and exactly locates the point
in the octet stream where MPA is to begin operation. Note that
such a negotiation protocol is outside the scope of this
specification. A simplified example of such a protocol is shown
in Figure 9: Example Delayed Startup negotiation on page 33.
An MPA endpoint operates in two distinct phases.
The Startup Phase is used to verify correct MPA setup, exchange CRC
and Marker configuration, and optionally pass Private Data between
endpoints prior to completing a DDP connection. During this phase,
specifically formatted frames are exchanged as TCP byte streams
without using CRCs or Markers. During this phase a DDP endpoint need
not be "bound" to the MPA connection. In fact, the choice of DDP
endpoint and its operating parameters may not be known until the
Consumer supplied Private Data (if any) has been examined by the
Consumer.
The second distinct phase is Full Operation during which FPDUs are
sent using all the rules that pertain (CRCs, Markers, MULPDU
restrictions, etc.). A DDP endpoint MUST be "bound" to the MPA
connection at entry to this phase.
When Private Data is passed between ULPs in the Startup Phase, the
ULP is responsible for interpreting that data, and then placing MPA
into Full Operation.
Note: The following text differentiates the two endpoints by calling
them Initiator and Responder. This is quite arbitrary and is NOT
related to the TCP startup (SYN, SYN/ACK sequence). The
Initiator is the side that sends first in the MPA startup
sequence (the MPA Request Frame).
Culley, et al. Standards Track [Page 25]
^L
RFC 5044 MPA Framing for TCP October 2007
Note: The possibility that both endpoints would be allowed to make a
connection at the same time, sometimes called an active/active
connection, was considered by the work group and rejected. There
were several motivations for this decision. One was that
applications needing this facility were few (none other than
theoretical at the time of this document). Another was that the
facility created some implementation difficulties, particularly
with the "dual stack" designs described later on. A last issue
was that dealing with rejected connections at startup would have
required at least an additional frame type, and more recovery
actions, complicating the protocol. While none of these issues
was overwhelming, the group and implementers were not motivated
to do the work to resolve these issues. The protocol includes a
method of detecting these active/active startup attempts so that
they can be rejected and an error reported.
The ULP is responsible for determining which side is Initiator or
Responder. For client/server type ULPs, this is easy. For peer-peer
ULPs (which might utilize a TCP style active/active startup), some
mechanism (not defined by this specification) must be established, or
some streaming mode data exchanged prior to MPA startup to determine
which side starts in Initiator and which starts in Responder MPA
mode.
7.1.1 MPA Request and Reply Frame Format
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
0 | |
+ Key (16 bytes containing "MPA ID Req Frame") +
4 | (4D 50 41 20 49 44 20 52 65 71 20 46 72 61 6D 65) |
+ Or (16 bytes containing "MPA ID Rep Frame") +
8 | (4D 50 41 20 49 44 20 52 65 70 20 46 72 61 6D 65) |
+ +
12 | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
16 |M|C|R| Res | Rev | PD_Length |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| |
~ ~
~ Private Data ~
| |
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Figure 8: MPA Request/Reply Frame
Culley, et al. Standards Track [Page 26]
^L
RFC 5044 MPA Framing for TCP October 2007
Key: This field contains the "key" used to validate that the sender
is an MPA sender. Initiator mode senders MUST set this field to
the fixed value "MPA ID Req Frame" or (in byte order) 4D 50 41 20
49 44 20 52 65 71 20 46 72 61 6D 65 (in hexadecimal). Responder
mode receivers MUST check this field for the same value, and
close the connection and report an error locally if any other
value is detected. Responder mode senders MUST set this field to
the fixed value "MPA ID Rep Frame" or (in byte order) 4D 50 41 20
49 44 20 52 65 70 20 46 72 61 6D 65 (in hexadecimal). Initiator
mode receivers MUST check this field for the same value, and
close the connection and report an error locally if any other
value is detected.
M: This bit declares an endpoint's REQUIRED Marker usage. When this
bit is '1' in an MPA Request Frame, the Initiator declares that
Markers are REQUIRED in FPDUs sent from the Responder. When set
to '1' in an MPA Reply Frame, this bit declares that Markers are
REQUIRED in FPDUs sent from the Initiator. When in a received
MPA Request Frame or MPA Reply Frame and the value is '0',
Markers MUST NOT be added to the data stream by that endpoint.
When '1' Markers MUST be added as described in Section 4.3, MPA
Markers.
C: This bit declares an endpoint's preferred CRC usage. When this
field is '0' in the MPA Request Frame and the MPA Reply Frame,
CRCs MUST not be checked and need not be generated by either
endpoint. When this bit is '1' in either the MPA Request Frame
or MPA Reply Frame, CRCs MUST be generated and checked by both
endpoints. Note that even when not in use, the CRC field remains
present in the FPDU. When CRCs are not in use, the CRC field
MUST be considered valid for FPDU checking regardless of its
contents.
R: This bit is set to zero, and not checked on reception in the MPA
Request Frame. In the MPA Reply Frame, this bit is the Rejected
Connection bit, set by the Responders ULP to indicate acceptance
'0', or rejection '1', of the connection parameters provided in
the Private Data.
Res: This field is reserved for future use. It MUST be set to zero
when sending, and not checked on reception.
Culley, et al. Standards Track [Page 27]
^L
RFC 5044 MPA Framing for TCP October 2007
Rev: This field contains the revision of MPA. For this version of
the specification, senders MUST set this field to one. MPA
receivers compliant with this version of the specification MUST
check this field. If the MPA receiver cannot interoperate with
the received version, then it MUST close the connection and
report an error locally. Otherwise, the MPA receiver should
report the received version to the ULP.
PD_Length: This field MUST contain the length in octets of the
Private Data field. A value of zero indicates that there is no
Private Data field present at all. If the receiver detects that
the PD_Length field does not match the length of the Private Data
field, or if the length of the Private Data field exceeds 512
octets, the receiver MUST close the connection and report an
error locally. Otherwise, the MPA receiver should pass the
PD_Length value and Private Data to the ULP.
Private Data: This field may contain any value defined by ULPs or may
not be present. The Private Data field MUST be between 0 and 512
octets in length. ULPs define how to size, set, and validate
this field within these limits. Private Data usage is further
discussed in Section 7.1.4.
7.1.2. Connection Startup Rules
The following rules apply to MPA connection Startup Phase:
1. When MPA is started in the Initiator mode, the MPA implementation
MUST send a valid MPA Request Frame. The MPA Request Frame MAY
include ULP-supplied Private Data.
2. When MPA is started in the Responder mode, the MPA implementation
MUST wait until an MPA Request Frame is received and validated
before entering Full MPA/DDP Operation.
If the MPA Request Frame is improperly formatted, the
implementation MUST close the TCP connection and exit MPA.
If the MPA Request Frame is properly formatted but the Private
Data is not acceptable, the implementation SHOULD return an MPA
Reply Frame with the Rejected Connection bit set to '1'; the MPA
Reply Frame MAY include ULP-supplied Private Data; the
implementation MUST exit MPA, leaving the TCP connection open.
The ULP may close TCP or use the connection for other purposes.
If the MPA Request Frame is properly formatted and the Private
Data is acceptable, the implementation SHOULD return an MPA Reply
Frame with the Rejected Connection bit set to '0'; the MPA Reply
Culley, et al. Standards Track [Page 28]
^L
RFC 5044 MPA Framing for TCP October 2007
Frame MAY include ULP-supplied Private Data; and the Responder
SHOULD prepare to interpret any data received as FPDUs and pass
any received ULPDUs to DDP.
Note: Since the receiver's ability to deal with Markers is
unknown until the Request and Reply Frames have been
received, sending FPDUs before this occurs is not possible.
Note: The requirement to wait on a Request Frame before sending a
Reply Frame is a design choice. It makes for a well-ordered
sequence of events at each end, and avoids having to specify
how to deal with situations where both ends start at the same
time.
3. MPA Initiator mode implementations MUST receive and validate an
MPA Reply Frame.
If the MPA Reply Frame is improperly formatted, the
implementation MUST close the TCP connection and exit MPA.
If the MPA Reply Frame is properly formatted but is the Private
Data is not acceptable, or if the Rejected Connection bit is set
to '1', the implementation MUST exit MPA, leaving the TCP
connection open. The ULP may close TCP or use the connection for
other purposes.
If the MPA Reply Frame is properly formatted and the Private Data
is acceptable, and the Reject Connection bit is set to '0', the
implementation SHOULD enter Full MPA/DDP Operation Phase;
interpreting any received data as FPDUs and sending DDP ULPDUs as
FPDUs.
4. MPA Responder mode implementations MUST receive and validate at
least one FPDU before sending any FPDUs or Markers.
Note: This requirement is present to allow the Initiator time to
get its receiver into Full Operation before an FPDU arrives,
avoiding potential race conditions at the Initiator. This
was also subject to some debate in the work group before
rough consensus was reached. Eliminating this requirement
would allow faster startup in some types of applications.
However, that would also make certain implementations
(particularly "dual stack") much harder.
5. If a received "Key" does not match the expected value (see
Section 7.1.1, MPA Request and Reply Frame Format) the TCP/DDP
connection MUST be closed, and an error returned to the ULP.
Culley, et al. Standards Track [Page 29]
^L
RFC 5044 MPA Framing for TCP October 2007
6. The received Private Data fields may be used by Consumers at
either end to further validate the connection and set up DDP or
other ULP parameters. The Initiator ULP MAY close the
TCP/MPA/DDP connection as a result of validating the Private Data
fields. The Responder SHOULD return an MPA Reply Frame with the
"Reject Connection" bit set to '1' if the validation of the
Private Data is not acceptable to the ULP.
7. When the first FPDU is to be sent, then if Markers are enabled,
the first octets sent are the special Marker 0x00000000, followed
by the start of the FPDU (the FPDU's ULPDU Length field). If
Markers are not enabled, the first octets sent are the start of
the FPDU (the FPDU's ULPDU Length field).
8. MPA implementations MUST use the difference between the MPA
Request Frame and the MPA Reply Frame to check for incorrect
"Initiator/Initiator" startups. Implementations SHOULD put a
timeout on waiting for the MPA Request Frame when started in
Responder mode, to detect incorrect "Responder/Responder"
startups.
9. MPA implementations MUST validate the PD_Length field. The
buffer that receives the Private Data field MUST be large enough
to receive that data; the amount of Private Data MUST not exceed
the PD_Length or the application buffer. If any of the above
fails, the startup frame MUST be considered improperly formatted.
10. MPA implementations SHOULD implement a reasonable timeout while
waiting for the entire set of startup frames; this prevents
certain denial-of-service attacks. ULPs SHOULD implement a
reasonable timeout while waiting for FPDUs, ULPDUs, and
application level messages to guard against application failures
and certain denial-of-service attacks.
7.1.3. Example Delayed Startup Sequence
A variety of startup sequences are possible when using MPA on TCP.
Following is an example of an MPA/DDP startup that occurs after TCP
has been running for a while and has exchanged some amount of
streaming data. This example does not use any Private Data (an
example that does is shown later in Section 7.1.4.2, Example
Immediate Startup Using Private Data), although it is perfectly legal
to include the Private Data. Note that since the example does not
use any Private Data, there are no ULP interactions shown between
receiving "startup frames" and putting MPA into Full Operation.
Culley, et al. Standards Track [Page 30]
^L
RFC 5044 MPA Framing for TCP October 2007
Initiator Responder
+---------------------------+
|ULP streaming mode |
| <Hello> request to |
| transition to DDP/MPA | +---------------------------+
| mode (optional). | --------> |ULP gets request; |
+---------------------------+ | enables MPA Responder |
| mode with last (optional)|
| streaming mode |
| <Hello Ack> for MPA to |
| send. |
+---------------------------+ |MPA waits for incoming |
|ULP receives streaming | <-------- | <MPA Request Frame>. |
| <Hello Ack>; | +---------------------------+
|Enters MPA Initiator mode; |
|MPA sends |
| <MPA Request Frame>; |
|MPA waits for incoming | +---------------------------+
| <MPA Reply Frame>. | - - - - > |MPA receives |
+---------------------------+ | <MPA Request Frame>. |
|Consumer binds DDP to MPA; |
|MPA sends the |
| <MPA Reply Frame>. |
|DDP/MPA enables FPDU |
+---------------------------+ | decoding, but does not |
|MPA receives the | < - - - - | send any FPDUs. |
| <MPA Reply Frame> | +---------------------------+
|Consumer binds DDP to MPA; |
|DDP/MPA begins Full |
| Operation. |
|MPA sends first FPDU (as | +---------------------------+
| DDP ULPDUs become | ========> |MPA receives first FPDU. |
| available). | |MPA sends first FPDU (as |
+---------------------------+ | DDP ULPDUs become |
<====== | available). |
+---------------------------+
Figure 9: Example Delayed Startup Negotiation
Culley, et al. Standards Track [Page 31]
^L
RFC 5044 MPA Framing for TCP October 2007
An example Delayed Startup sequence is described below:
* Active and passive sides start up a TCP connection in the
usual fashion, probably using sockets APIs. They exchange
some amount of streaming mode data. At some point, one side
(the MPA Initiator) sends streaming mode data that
effectively says "Hello, let's go into MPA/DDP mode".
* When the remote side (the MPA Responder) gets this streaming mode
message, the Consumer would send a last streaming mode message
that effectively says "I acknowledge your Hello, and am now in
MPA Responder mode". The exchange of these messages establishes
the exact point in the TCP stream where MPA is enabled. The
Responding Consumer enables MPA in the Responder mode and waits
for the initial MPA startup message.
* The Initiating Consumer would enable MPA startup in the
Initiator mode which then sends the MPA Request Frame. It is
assumed that no Private Data messages are needed for this
example, although it is possible to do so. The Initiating
MPA (and Consumer) would also wait for the MPA connection to
be accepted.
* The Responding MPA would receive the initial MPA Request Frame
and would inform the Consumer that this message arrived. The
Consumer can then accept the MPA/DDP connection or close the TCP
connection.
* To accept the connection request, the Responding Consumer would
use an appropriate API to bind the TCP/MPA connections to a DDP
endpoint, thus enabling MPA/DDP into Full Operation. In the
process of going to Full Operation, MPA sends the MPA Reply
Frame. MPA/DDP waits for the first incoming FPDU before sending
any FPDUs.
* If the initial TCP data was not a properly formatted MPA Request
Frame, MPA will close or reset the TCP connection immediately.
* The Initiating MPA would receive the MPA Reply Frame and
would report this message to the Consumer. The Consumer can
then accept the MPA/DDP connection, or close or reset the TCP
connection to abort the process.
* On determining that the connection is acceptable, the
Initiating Consumer would use an appropriate API to bind the
TCP/MPA connections to a DDP endpoint thus enabling MPA/DDP
into Full Operation. MPA/DDP would begin sending DDP
messages as MPA FPDUs.
Culley, et al. Standards Track [Page 32]
^L
RFC 5044 MPA Framing for TCP October 2007
7.1.4. Use of Private Data
This section is advisory in nature, in that it suggests a method by
which a ULP can deal with pre-DDP connection information exchange.
7.1.4.1. Motivation
Prior RDMA protocols have been developed that provide Private Data
via out-of-band mechanisms. As a result, many applications now
expect some form of Private Data to be available for application use
prior to setting up the DDP/RDMA connection. Following are some
examples of the use of Private Data.
An RDMA endpoint (referred to as a Queue Pair, or QP, in InfiniBand
and the [VERBS-RDMA]) must be associated with a Protection Domain.
No receive operations may be posted to the endpoint before it is
associated with a Protection Domain. Indeed under both the
InfiniBand and proposed RDMA/DDP verbs [VERBS-RDMA] an endpoint/QP is
created within a Protection Domain.
There are some applications where the choice of Protection Domain is
dependent upon the identity of the remote ULP client. For example,
if a user session requires multiple connections, it is highly
desirable for all of those connections to use a single Protection
Domain. Note: Use of Protection Domains is further discussed in
[RDMASEC].
InfiniBand, the DAT APIs [DAT-API], and the IT-API [IT-API] all
provide for the active-side ULP to provide Private Data when
requesting a connection. This data is passed to the ULP to allow it
to determine whether to accept the connection, and if so with which
endpoint (and implicitly which Protection Domain).
The Private Data can also be used to ensure that both ends of the
connection have configured their RDMA endpoints compatibly on such
matters as the RDMA Read capacity (see [RDMAP]). Further ULP-
specific uses are also presumed, such as establishing the identity of
the client.
Private Data is also allowed for when accepting the connection, to
allow completion of any negotiation on RDMA resources and for other
ULP reasons.
There are several potential ways to exchange this Private Data. For
example, the InfiniBand specification includes a connection
management protocol that allows a small amount of Private Data to be
exchanged using datagrams before actually starting the RDMA
connection.
Culley, et al. Standards Track [Page 33]
^L
RFC 5044 MPA Framing for TCP October 2007
This document allows for small amounts of Private Data to be
exchanged as part of the MPA startup sequence. The actual Private
Data fields are carried in the MPA Request Frame and the MPA Reply
Frame.
If larger amounts of Private Data or more negotiation is necessary,
TCP streaming mode messages may be exchanged prior to enabling MPA.
Culley, et al. Standards Track [Page 34]
^L
RFC 5044 MPA Framing for TCP October 2007
7.1.4.2. Example Immediate Startup Using Private Data
Initiator Responder
+---------------------------+
|TCP SYN sent. | +--------------------------+
+---------------------------+ --------> |TCP gets SYN packet; |
+---------------------------+ | sends SYN-Ack. |
|TCP gets SYN-Ack | <-------- +--------------------------+
| sends Ack. |
+---------------------------+ --------> +--------------------------+
+---------------------------+ |Consumer enables MPA |
|Consumer enables MPA | |Responder mode, waits for |
|Initiator mode with | | <MPA Request frame>. |
|Private Data; MPA sends | +--------------------------+
| <MPA Request Frame>; |
|MPA waits for incoming | +--------------------------+
| <MPA Reply Frame>. | - - - - > |MPA receives |
+---------------------------+ | <MPA Request Frame>. |
|Consumer examines Private |
|Data, provides MPA with |
|return Private Data, |
|binds DDP to MPA, and |
|enables MPA to send an |
| <MPA Reply Frame>. |
|DDP/MPA enables FPDU |
+---------------------------+ |decoding, but does not |
|MPA receives the | < - - - - |send any FPDUs. |
| <MPA Reply Frame>. | +--------------------------+
|Consumer examines Private |
|Data, binds DDP to MPA, |
|and enables DDP/MPA to |
|begin Full Operation. |
|MPA sends first FPDU (as | +--------------------------+
|DDP ULPDUs become | ========> |MPA receives first FPDU. |
|available). | |MPA sends first FPDU (as |
+---------------------------+ |DDP ULPDUs become |
<====== |available). |
+--------------------------+
Figure 10: Example Immediate Startup Negotiation
Note: The exact order of when MPA is started in the TCP connection
sequence is implementation dependent; the above diagram shows one
possible sequence. Also, the Initiator "Ack" to the Responder's
"SYN-Ack" may be combined into the same TCP segment containing
the MPA Request Frame (as is allowed by TCP RFCs).
Culley, et al. Standards Track [Page 35]
^L
RFC 5044 MPA Framing for TCP October 2007
The example immediate startup sequence is described below:
* The passive side (Responding Consumer) would listen on the TCP
destination port, to indicate its readiness to accept a
connection.
* The active side (Initiating Consumer) would request a
connection from a TCP endpoint (that expected to upgrade to
MPA/DDP/RDMA and expected the Private Data) to a destination
address and port.
* The Initiating Consumer would initiate a TCP connection to
the destination port. Acceptance/rejection of the connection
would proceed as per normal TCP connection establishment.
* The passive side (Responding Consumer) would receive the TCP
connection request as usual allowing normal TCP gatekeepers, such
as INETD and TCPserver, to exercise their normal
safeguard/logging functions. On acceptance of the TCP
connection, the Responding Consumer would enable MPA in the
Responder mode and wait for the initial MPA startup message.
* The Initiating Consumer would enable MPA startup in the
Initiator mode to send an initial MPA Request Frame with its
included Private Data message to send. The Initiating MPA
(and Consumer) would also wait for the MPA connection to be
accepted, and any returned Private Data.
* The Responding MPA would receive the initial MPA Request Frame
with the Private Data message and would pass the Private Data
through to the Consumer. The Consumer can then accept the
MPA/DDP connection, close the TCP connection, or reject the MPA
connection with a return message.
* To accept the connection request, the Responding Consumer would
use an appropriate API to bind the TCP/MPA connections to a DDP
endpoint, thus enabling MPA/DDP into Full Operation. In the
process of going to Full Operation, MPA sends the MPA Reply
Frame, which includes the Consumer-supplied Private Data
containing any appropriate Consumer response. MPA/DDP waits for
the first incoming FPDU before sending any FPDUs.
* If the initial TCP data was not a properly formatted MPA Request
Frame, MPA will close or reset the TCP connection immediately.
Culley, et al. Standards Track [Page 36]
^L
RFC 5044 MPA Framing for TCP October 2007
* To reject the MPA connection request, the Responding Consumer
would send an MPA Reply Frame with any ULP-supplied Private Data
(with reason for rejection), with the "Rejected Connection" bit
set to '1', and may close the TCP connection.
* The Initiating MPA would receive the MPA Reply Frame with the
Private Data message and would report this message to the
Consumer, including the supplied Private Data.
If the "Rejected Connection" bit is set to a '1', MPA will
close the TCP connection and exit.
If the "Rejected Connection" bit is set to a '0', and on
determining from the MPA Reply Frame Private Data that the
connection is acceptable, the Initiating Consumer would use
an appropriate API to bind the TCP/MPA connections to a DDP
endpoint thus enabling MPA/DDP into Full Operation. MPA/DDP
would begin sending DDP messages as MPA FPDUs.
7.1.5. "Dual Stack" Implementations
MPA/DDP implementations are commonly expected to be implemented as
part of a "dual stack" architecture. One stack is the traditional
TCP stack, usually with a sockets interface API (Application
Programming Interface). The second stack is the MPA/DDP stack with
its own API, and potentially separate code or hardware to deal with
the MPA/DDP data. Of course, implementations may vary, so the
following comments are of an advisory nature only.
The use of the two stacks offers advantages:
TCP connection setup is usually done with the TCP stack. This
allows use of the usual naming and addressing mechanisms. It
also means that any mechanisms used to "harden" the connection
setup against security threats are also used when starting
MPA/DDP.
Some applications may have been originally designed for TCP, but
are "enhanced" to utilize MPA/DDP after a negotiation reveals the
capability to do so. The negotiation process takes place in
TCP's streaming mode, using the usual TCP APIs.
Some new applications, designed for RDMA or DDP, still need to
exchange some data prior to starting MPA/DDP. This exchange can
be of arbitrary length or complexity, but often consists of only
a small amount of Private Data, perhaps only a single message.
Using the TCP streaming mode for this exchange allows this to be
done using well-understood methods.
Culley, et al. Standards Track [Page 37]
^L
RFC 5044 MPA Framing for TCP October 2007
The main disadvantage of using two stacks is the conversion of an
active TCP connection between them. This process must be done with
care to prevent loss of data.
To avoid some of the problems when using a "dual stack" architecture,
the following additional restrictions may be required by the
implementation:
1. Enabling the DDP/MPA stack SHOULD be done only when no incoming
stream data is expected. This is typically managed by the ULP
protocol. When following the recommended startup sequence, the
Responder side enters DDP/MPA mode, sends the last streaming mode
data, and then waits for the MPA Request Frame. No additional
streaming mode data is expected. The Initiator side ULP receives
the last streaming mode data, and then enters DDP/MPA mode.
Again, no additional streaming mode data is expected.
2. The DDP/MPA MAY provide the ability to send a "last streaming
message" as part of its Responder DDP/MPA enable function. This
allows the DDP/MPA stack to more easily manage the conversion to
DDP/MPA mode (and avoid problems with a very fast return of the
MPA Request Frame from the Initiator side).
Note: Regardless of the "stack" architecture used, TCP's rules MUST
be followed. For example, if network data is lost, re-segmented,
or re-ordered, TCP MUST recover appropriately even when this
occurs while switching stacks.
7.2. Normal Connection Teardown
Each half connection of MPA terminates when DDP closes the
corresponding TCP half connection.
A mechanism SHOULD be provided by MPA to DDP for DDP to be made aware
that a graceful close of the TCP connection has been received by the
TCP (e.g., FIN is received).
Culley, et al. Standards Track [Page 38]
^L
RFC 5044 MPA Framing for TCP October 2007
8. Error Semantics
The following errors MUST be detected by MPA and the codes SHOULD be
provided to DDP or other Consumer:
Code Error
1 TCP connection closed, terminated, or lost. This includes lost
by timeout, too many retries, RST received, or FIN received.
2 Received MPA CRC does not match the calculated value for the
FPDU.
3 In the event that the CRC is valid, received MPA Marker (if
enabled) and ULPDU Length fields do not agree on the start of an
FPDU. If the FPDU start determined from previous ULPDU Length
fields does not match with the MPA Marker position, MPA SHOULD
deliver an error to DDP. It may not be possible to make this
check as a segment arrives, but the check SHOULD be made when a
gap creating an out-of-order sequence is closed and any time a
Marker points to an already identified FPDU. It is OPTIONAL for
a receiver to check each Marker, if multiple Markers are present
in an FPDU, or if the segment is received in order.
4 Invalid MPA Request Frame or MPA Response Frame received. In
this case, the TCP connection MUST be immediately closed. DDP
and other ULPs should treat this similar to code 1, above.
When conditions 2 or 3 above are detected, an optimized MPA/TCP
implementation MAY choose to silently drop the TCP segment rather
than reporting the error to DDP. In this case, the sending TCP will
retry the segment, usually correcting the error, unless the problem
was at the source. In that case, the source will usually exceed the
number of retries and terminate the connection.
Once MPA delivers an error of any type, it MUST NOT pass or deliver
any additional FPDUs on that half connection.
For Error codes 2 and 3, MPA MUST NOT close the TCP connection
following a reported error. Closing the connection is the
responsibility of DDP's ULP.
Note that since MPA will not Deliver any FPDUs on a half
connection following an error detected on the receive side of
that connection, DDP's ULP is expected to tear down the
connection. This may not occur until after one or more last
messages are transmitted on the opposite half connection. This
allows a diagnostic error message to be sent.
Culley, et al. Standards Track [Page 39]
^L
RFC 5044 MPA Framing for TCP October 2007
9. Security Considerations
This section discusses the security considerations for MPA.
9.1. Protocol-Specific Security Considerations
The vulnerabilities of MPA to third-party attacks are no greater than
any other protocol running over TCP. A third party, by sending
packets into the network that are delivered to an MPA receiver, could
launch a variety of attacks that take advantage of how MPA operates.
For example, a third party could send random packets that are valid
for TCP, but contain no FPDU headers. An MPA receiver reports an
error to DDP when any packet arrives that cannot be validated as an
FPDU when properly located on an FPDU boundary. A third party could
also send packets that are valid for TCP, MPA, and DDP, but do not
target valid buffers. These types of attacks ultimately result in
loss of connection and thus become a type of DOS (Denial Of Service)
attack. Communication security mechanisms such as IPsec [RFC2401,
RFC4301] may be used to prevent such attacks.
Independent of how MPA operates, a third party could use ICMP
messages to reduce the path MTU to such a small size that performance
would likewise be severely impacted. Range checking on path MTU
sizes in ICMP packets may be used to prevent such attacks.
[RDMAP] and [DDP] are used to control, read, and write data buffers
over IP networks. Therefore, the control and the data packets of
these protocols are vulnerable to the spoofing, tampering, and
information disclosure attacks listed below. In addition, connection
to/from an unauthorized or unauthenticated endpoint is a potential
problem with most applications using RDMA, DDP, and MPA.
9.1.1. Spoofing
Spoofing attacks can be launched by the Remote Peer or by a network
based attacker. A network-based spoofing attack applies to all
Remote Peers. Because the MPA Stream requires a TCP Stream in the
ESTABLISHED state, certain types of traditional forms of wire attacks
do not apply -- an end-to-end handshake must have occurred to
establish the MPA Stream. So, the only form of spoofing that applies
is one when a remote node can both send and receive packets. Yet
even with this limitation the Stream is still exposed to the
following spoofing attacks.
Culley, et al. Standards Track [Page 40]
^L
RFC 5044 MPA Framing for TCP October 2007
9.1.1.1. Impersonation
A network-based attacker can impersonate a legal MPA/DDP/RDMAP peer
(by spoofing a legal IP address) and establish an MPA/DDP/RDMAP
Stream with the victim. End-to-end authentication (i.e., IPsec or
ULP authentication) provides protection against this attack.
9.1.1.2. Stream Hijacking
Stream hijacking happens when a network-based attacker follows the
Stream establishment phase, and waits until the authentication phase
(if such a phase exists) is completed successfully. He can then
spoof the IP address and redirect the Stream from the victim to its
own machine. For example, an attacker can wait until an iSCSI
authentication is completed successfully, and hijack the iSCSI
Stream.
The best protection against this form of attack is end-to-end
integrity protection and authentication, such as IPsec, to prevent
spoofing. Another option is to provide physical security.
Discussion of physical security is out of scope for this document.
9.1.1.3. Man-in-the-Middle Attack
If a network-based attacker has the ability to delete, inject,
replay, or modify packets that will still be accepted by MPA (e.g.,
TCP sequence number is correct, FPDU is valid, etc.), then the Stream
can be exposed to a man-in-the-middle attack. The attacker could
potentially use the services of [DDP] and [RDMAP] to read the
contents of the associated Data Buffer, to modify the contents of the
associated Data Buffer, or to disable further access to the buffer.
Other attacks on the connection setup sequence and even on TCP can be
used to cause denial of service. The only countermeasure for this
form of attack is to either secure the MPA/DDP/RDMAP Stream (i.e.,
integrity protect) or attempt to provide physical security to prevent
man-in-the-middle type attacks.
The best protection against this form of attack is end-to-end
integrity protection and authentication, such as IPsec, to prevent
spoofing or tampering. If Stream or session level authentication and
integrity protection are not used, then a man-in-the-middle attack
can occur, enabling spoofing and tampering.
Another approach is to restrict access to only the local subnet/link
and provide some mechanism to limit access, such as physical security
or 802.1.x. This model is an extremely limited deployment scenario
and will not be further examined here.
Culley, et al. Standards Track [Page 41]
^L
RFC 5044 MPA Framing for TCP October 2007
9.1.2. Eavesdropping
Generally speaking, Stream confidentiality protects against
eavesdropping. Stream and/or session authentication and integrity
protection are a counter measurement against various spoofing and
tampering attacks. The effectiveness of authentication and integrity
against a specific attack depend on whether the authentication is
machine-level authentication (as the one provided by IPsec) or ULP
authentication.
9.2. Introduction to Security Options
The following security services can be applied to an MPA/DDP/RDMAP
Stream:
1. Session confidentiality - protects against eavesdropping.
2. Per-packet data source authentication - protects against the
following spoofing attacks: network-based impersonation, Stream
hijacking, and man in the middle.
3. Per-packet integrity - protects against tampering done by
network-based modification of FPDUs (indirectly affecting buffer
content through DDP services).
4. Packet sequencing - protects against replay attacks, which is a
special case of the above tampering attack.
If an MPA/DDP/RDMAP Stream may be subject to impersonation attacks,
or Stream hijacking attacks, it is recommended that the Stream be
authenticated, integrity protected, and protected from replay
attacks. It may use confidentiality protection to protect from
eavesdropping (in case the MPA/DDP/RDMAP Stream traverses a public
network).
IPsec is capable of providing the above security services for IP and
TCP traffic.
ULP protocols may be able to provide part of the above security
services. See [NFSv4CHAN] for additional information on a promising
approach called "channel binding". From [NFSv4CHAN]:
"The concept of channel bindings allows applications to prove
that the end-points of two secure channels at different network
layers are the same by binding authentication at one channel to
the session protection at the other channel. The use of channel
Culley, et al. Standards Track [Page 42]
^L
RFC 5044 MPA Framing for TCP October 2007
bindings allows applications to delegate session protection to
lower layers, which may significantly improve performance for
some applications."
9.3. Using IPsec with MPA
IPsec can be used to protect against the packet injection attacks
outlined above. Because IPsec is designed to secure individual IP
packets, MPA can run above IPsec without change. IPsec packets are
processed (e.g., integrity checked and decrypted) in the order they
are received, and an MPA receiver will process the decrypted FPDUs
contained in these packets in the same manner as FPDUs contained in
unsecured IP packets.
MPA implementations MUST implement IPsec as described in Section 9.4
below. The use of IPsec is up to ULPs and administrators.
9.4. Requirements for IPsec Encapsulation of MPA/DDP
The IP Storage working group has spent significant time and effort to
define the normative IPsec requirements for IP storage [RFC3723].
Portions of that specification are applicable to a wide variety of
protocols, including the RDDP protocol suite. In order not to
replicate this effort, an MPA on TCP implementation MUST follow the
requirements defined in RFC 3723, Sections 2.3 and 5, including the
associated normative references for those sections.
Additionally, since IPsec acceleration hardware may only be able to
handle a limited number of active Internet Key Exchange Protocol
(IKE) Phase 2 security associations (SAs), Phase 2 delete messages
MAY be sent for idle SAs, as a means of keeping the number of active
Phase 2 SAs to a minimum. The receipt of an IKE Phase 2 delete
message MUST NOT be interpreted as a reason for tearing down a
DDP/RDMA Stream. Rather, it is preferable to leave the Stream up,
and if additional traffic is sent on it, to bring up another IKE
Phase 2 SA to protect it. This avoids the potential for continually
bringing Streams up and down.
The IPsec requirements for RDDP are based on the version of IPsec
specified in RFC 2401 [RFC2401] and related RFCs, as profiled by RFC
3723 [RFC3723], despite the existence of a newer version of IPsec
specified in RFC 4301 [RFC4301] and related RFCs. One of the
important early applications of the RDDP protocols is their use with
iSCSI [iSER]; RDDP's IPsec requirements follow those of IPsec in
order to facilitate that usage by allowing a common profile of IPsec
to be used with iSCSI and the RDDP protocols. In the future, RFC
Culley, et al. Standards Track [Page 43]
^L
RFC 5044 MPA Framing for TCP October 2007
3723 may be updated to the newer version of IPsec; the IPsec security
requirements of any such update should apply uniformly to iSCSI and
the RDDP protocols.
Note that there are serious security issues if IPsec is not
implemented end-to-end. For example, if IPsec is implemented as a
tunnel in the middle of the network, any hosts between the peer and
the IPsec tunneling device can freely attack the unprotected Stream.
10. IANA Considerations
No IANA actions are required by this document.
If a well-known port is chosen as the mechanism to identify a DDP on
MPA on TCP, the well-known port must be registered with IANA.
Because the use of the port is DDP specific, registration of the port
with IANA is left to DDP.
Culley, et al. Standards Track [Page 44]
^L
RFC 5044 MPA Framing for TCP October 2007
Appendix A. Optimized MPA-Aware TCP Implementations
This appendix is for information only and is NOT part of the
standard.
This appendix covers some Optimized MPA-aware TCP implementation
guidance to implementers. It is intended for those implementations
that want to send/receive as much traffic as possible in an aligned
and zero-copy fashion.
+-----------------------------------+
| +-----------+ +-----------------+ |
| | Optimized | | Other Protocols | |
| | MPA/TCP | +-----------------+ |
| +-----------+ || |
| \\ --- socket API --- |
| \\ || |
| \\ +-----+ |
| \\ | TCP | |
| \\ +-----+ |
| \\ // |
| +-------+ |
| | IP | |
| +-------+ |
+-----------------------------------+
Figure 11: Optimized MPA/TCP Implementation
The diagram above shows a block diagram of a potential
implementation. The network sub-system in the diagram can support
traditional sockets-based connections using the normal API as shown
on the right side of the diagram. Connections for DDP/MPA/TCP are
run using the facilities shown on the left side of the diagram.
The DDP/MPA/TCP connections can be started using the facilities shown
on the left side using some suitable API, or they can be initiated
using the facilities shown on the right side and transitioned to the
left side at the point in the connection setup where MPA goes to
"Full MPA/DDP Operation Phase" as described in Section 7.1.2.
The optimized MPA/TCP implementations (left side of diagram and
described below) are only applicable to MPA. All other TCP
applications continue to use the standard TCP stacks and interfaces
shown in the right side of the diagram.
Culley, et al. Standards Track [Page 45]
^L
RFC 5044 MPA Framing for TCP October 2007
A.1. Optimized MPA/TCP Transmitters
The various TCP RFCs allow considerable choice in segmenting a TCP
stream. In order to optimize FPDU recovery at the MPA receiver, an
optimized MPA/TCP implementation uses additional segmentation rules.
To provide optimum performance, an optimized MPA/TCP transmit side
implementation should be enabled to:
* With an EMSS large enough to contain the FPDU(s), segment the
outgoing TCP stream such that the first octet of every TCP
segment begins with an FPDU. Multiple FPDUs may be packed into a
single TCP segment as long as they are entirely contained in the
TCP segment.
* Report the current EMSS from the TCP to the MPA transmit layer.
There are exceptions to the above rule. Once an ULPDU is provided to
MPA, the MPA/TCP sender transmits it or fails the connection; it
cannot be repudiated. As a result, during changes in MTU and EMSS,
or when TCP's Receive Window size (RWIN) becomes too small, it may be
necessary to send FPDUs that do not conform to the segmentation rule
above.
A possible, but less desirable, alternative is to use IP
fragmentation on accepted FPDUs to deal with MTU reductions or
extremely small EMSS.
Even when alignment with TCP segments is lost, the sender still
formats the FPDU according to FPDU format as shown in Figure 2.
On a retransmission, TCP does not necessarily preserve original TCP
segmentation boundaries. This can lead to the loss of FPDU Alignment
and containment within a TCP segment during TCP retransmissions. An
optimized MPA/TCP sender should try to preserve original TCP
segmentation boundaries on a retransmission.
A.2. Effects of Optimized MPA/TCP Segmentation
Optimized MPA/TCP senders will fill TCP segments to the EMSS with a
single FPDU when a DDP message is large enough. Since the DDP
message may not exactly fit into TCP segments, a "message tail" often
occurs that results in an FPDU that is smaller than a single TCP
segment. Additionally, some DDP messages may be considerably shorter
than the EMSS. If a small FPDU is sent in a single TCP segment, the
result is a "short" TCP segment.
Culley, et al. Standards Track [Page 46]
^L
RFC 5044 MPA Framing for TCP October 2007
Applications expected to see strong advantages from Direct Data
Placement include transaction-based applications and throughput
applications. Request/response protocols typically send one FPDU per
TCP segment and then wait for a response. Under these conditions,
these "short" TCP segments are an appropriate and expected effect of
the segmentation.
Another possibility is that the application might be sending multiple
messages (FPDUs) to the same endpoint before waiting for a response.
In this case, the segmentation policy would tend to reduce the
available connection bandwidth by under-filling the TCP segments.
Standard TCP implementations often utilize the Nagle [RFC896]
algorithm to ensure that segments are filled to the EMSS whenever the
round-trip latency is large enough that the source stream can fully
fill segments before ACKs arrive. The algorithm does this by
delaying the transmission of TCP segments until a ULP can fill a
segment, or until an ACK arrives from the far side. The algorithm
thus allows for smaller segments when latencies are shorter to keep
the ULP's end-to-end latency to reasonable levels.
The Nagle algorithm is not mandatory to use [RFC1122].
When used with optimized MPA/TCP stacks, Nagle and similar algorithms
can result in the "packing" of multiple FPDUs into TCP segments.
If a "message tail", small DDP messages, or the start of a larger DDP
message are available, MPA may pack multiple FPDUs into TCP segments.
When this is done, the TCP segments can be more fully utilized, but,
due to the size constraints of FPDUs, segments may not be filled to
the EMSS. A dynamic MULPDU that informs DDP of the size of the
remaining TCP segment space makes filling the TCP segment more
effective.
Note that MPA receivers do more processing of a TCP segment that
contains multiple FPDUs; this may affect the performance of some
receiver implementations.
It is up to the ULP to decide if Nagle is useful with DDP/MPA. Note
that many of the applications expected to take advantage of MPA/DDP
prefer to avoid the extra delays caused by Nagle. In such scenarios,
it is anticipated there will be minimal opportunity for packing at
the transmitter and receivers may choose to optimize their
performance for this anticipated behavior.
Culley, et al. Standards Track [Page 47]
^L
RFC 5044 MPA Framing for TCP October 2007
Therefore, the application is expected to set TCP parameters such
that it can trade off latency and wire efficiency. Implementations
should provide a connection option that disables Nagle for MPA/TCP
similar to the way the TCP_NODELAY socket option is provided for a
traditional sockets interface.
When latency is not critical, application is expected to leave Nagle
enabled. In this case, the TCP implementation may pack any available
FPDUs into TCP segments so that the segments are filled to the EMSS.
If the amount of data available is not enough to fill the TCP segment
when it is prepared for transmission, TCP can send the segment partly
filled, or use the Nagle algorithm to wait for the ULP to post more
data.
A.3. Optimized MPA/TCP Receivers
When an MPA receive implementation and the MPA-aware receive side TCP
implementation support handling out-of-order ULPDUs, the TCP receive
implementation performs the following functions:
1) The implementation passes incoming TCP segments to MPA as soon as
they have been received and validated, even if not received in
order. The TCP layer commits to keeping each segment before it
can be passed to the MPA. This means that the segment must have
passed the TCP, IP, and lower layer data integrity validation
(i.e., checksum), must be in the receive window, must be part of
the same epoch (if timestamps are used to verify this), and must
have passed any other checks required by TCP RFCs.
This is not to imply that the data must be completely ordered
before use. An implementation can accept out-of-order segments,
SACK them [RFC2018], and pass them to MPA immediately, before the
reception of the segments needed to fill in the gaps. MPA
expects to utilize these segments when they are complete FPDUs or
can be combined into complete FPDUs to allow the passing of
ULPDUs to DDP when they arrive, independent of ordering. DDP
uses the passed ULPDU to "place" the DDP segments (see [DDP] for
more details).
Since MPA performs a CRC calculation and other checks on received
FPDUs, the MPA/TCP implementation ensures that any TCP segments
that duplicate data already received and processed (as can happen
during TCP retries) do not overwrite already received and
processed FPDUs. This avoids the possibility that duplicate data
may corrupt already validated FPDUs.
Culley, et al. Standards Track [Page 48]
^L
RFC 5044 MPA Framing for TCP October 2007
2) The implementation provides a mechanism to indicate the ordering
of TCP segments as the sender transmitted them. One possible
mechanism might be attaching the TCP sequence number to each
segment.
3) The implementation also provides a mechanism to indicate when a
given TCP segment (and the prior TCP stream) is complete. One
possible mechanism might be to utilize the leading (left) edge of
the TCP Receive Window.
MPA uses the ordering and completion indications to inform DDP
when a ULPDU is complete; MPA Delivers the FPDU to DDP. DDP uses
the indications to "deliver" its messages to the DDP consumer
(see [DDP] for more details).
DDP on MPA utilizes the above two mechanisms to establish the
Delivery semantics that DDP's consumers agree to. These
semantics are described fully in [DDP]. These include
requirements on DDP's consumer to respect ownership of buffers
prior to the time that DDP delivers them to the Consumer.
The use of SACK [RFC2018] significantly improves network utilization
and performance and is therefore recommended. When combined with the
out-of-order passing of segments to MPA and DDP, significant
buffering and copying of received data can be avoided.
A.4. Re-Segmenting Middleboxes and Non-Optimized MPA/TCP Senders
Since MPA senders often start FPDUs on TCP segment boundaries, a
receiving optimized MPA/TCP implementation may be able to optimize
the reception of data in various ways.
However, MPA receivers MUST NOT depend on FPDU Alignment on TCP
segment boundaries.
Some MPA senders may be unable to conform to the sender requirements
because their implementation of TCP is not designed with MPA in mind.
Even for optimized MPA/TCP senders, the network may contain
"middleboxes" which modify the TCP stream by changing the
segmentation. This is generally interoperable with TCP and its users
and MPA must be no exception.
The presence of Markers in MPA (when enabled) allows an optimized
MPA/TCP receiver to recover the FPDUs despite these obstacles,
although it may be necessary to utilize additional buffering at the
receiver to do so.
Culley, et al. Standards Track [Page 49]
^L
RFC 5044 MPA Framing for TCP October 2007
Some of the cases that a receiver may have to contend with are listed
below as a reminder to the implementer:
* A single aligned and complete FPDU, either in order or out of
order: This can be passed to DDP as soon as validated, and
Delivered when ordering is established.
* Multiple FPDUs in a TCP segment, aligned and fully contained,
either in order or out of order: These can be passed to DDP as
soon as validated, and Delivered when ordering is established.
* Incomplete FPDU: The receiver should buffer until the remainder
of the FPDU arrives. If the remainder of the FPDU is already
available, this can be passed to DDP as soon as validated, and
Delivered when ordering is established.
* Unaligned FPDU start: The partial FPDU must be combined with its
preceding portion(s). If the preceding parts are already
available, and the whole FPDU is present, this can be passed to
DDP as soon as validated, and Delivered when ordering is
established. If the whole FPDU is not available, the receiver
should buffer until the remainder of the FPDU arrives.
* Combinations of unaligned or incomplete FPDUs (and potentially
other complete FPDUs) in the same TCP segment: If any FPDU is
present in its entirety, or can be completed with portions
already available, it can be passed to DDP as soon as validated,
and Delivered when ordering is established.
A.5. Receiver Implementation
Transport & Network Layer Reassembly Buffers:
The use of reassembly buffers (either TCP reassembly buffers or IP
fragmentation reassembly buffers) is implementation dependent. When
MPA is enabled, reassembly buffers are needed if out-of-order packets
arrive and Markers are not enabled. Buffers are also needed if FPDU
alignment is lost or if IP fragmentation occurs. This is because the
incoming out-of-order segment may not contain enough information for
MPA to process all of the FPDU. For cases where a re-segmenting
middlebox is present, or where the TCP sender is not optimized, the
presence of Markers significantly reduces the amount of buffering
needed.
Recovery from IP fragmentation is transparent to the MPA Consumers.
Culley, et al. Standards Track [Page 50]
^L
RFC 5044 MPA Framing for TCP October 2007
A.5.1 Network Layer Reassembly Buffers
The MPA/TCP implementation should set the IP Don't Fragment bit at
the IP layer. Thus, upon a path MTU change, intermediate devices
drop the IP datagram if it is too large and reply with an ICMP
message that tells the source TCP that the path MTU has changed.
This causes TCP to emit segments conformant with the new path MTU
size. Thus, IP fragments under most conditions should never occur at
the receiver. But it is possible.
There are several options for implementation of network layer
reassembly buffers:
1. drop any IP fragments, and reply with an ICMP message according
to [RFC792] (fragmentation needed and DF set) to tell the Remote
Peer to resize its TCP segment.
2. support an IP reassembly buffer, but have it of limited size
(possibly the same size as the local link's MTU). The end node
would normally never Advertise a path MTU larger than the local
link MTU. It is recommended that a dropped IP fragment cause an
ICMP message to be generated according to RFC 792.
3. multiple IP reassembly buffers, of effectively unlimited size.
4. support an IP reassembly buffer for the largest IP datagram (64
KB).
5. support for a large IP reassembly buffer that could span multiple
IP datagrams.
An implementation should support at least 2 or 3 above, to avoid
dropping packets that have traversed the entire fabric.
There is no end-to-end ACK for IP reassembly buffers, so there is no
flow control on the buffer. The only end-to-end ACK is a TCP ACK,
which can only occur when a complete IP datagram is delivered to TCP.
Because of this, under worst case, pathological scenarios, the
largest IP reassembly buffer is the TCP receive window (to buffer
multiple IP datagrams that have all been fragmented).
Note that if the Remote Peer does not implement re-segmentation of
the data stream upon receiving the ICMP reply updating the path MTU,
it is possible to halt forward progress because the opposite peer
would continue to retransmit using a transport segment size that is
too large. This deadlock scenario is no different than if the fabric
MTU (not last-hop MTU) was reduced after connection setup, and the
remote node's behavior is not compliant with [RFC1122].
Culley, et al. Standards Track [Page 51]
^L
RFC 5044 MPA Framing for TCP October 2007
A.5.2 TCP Reassembly Buffers
A TCP reassembly buffer is also needed. TCP reassembly buffers are
needed if FPDU Alignment is lost when using TCP with MPA or when the
MPA FPDU spans multiple TCP segments. Buffers are also needed if
Markers are disabled and out-of-order packets arrive.
Since lost FPDU Alignment often means that FPDUs are incomplete, an
MPA on TCP implementation must have a reassembly buffer large enough
to recover an FPDU that is less than or equal to the MTU of the
locally attached link (this should be the largest possible Advertised
TCP path MTU). If the MTU is smaller than 140 octets, a buffer of at
least 140 octets long is needed to support the minimum FPDU size.
The 140 octets allow for the minimum MULPDU of 128, 2 octets of pad,
2 of ULPDU_Length, 4 of CRC, and space for a possible Marker. As
usual, additional buffering is likely to provide better performance.
Note that if the TCP segments were not stored, it would be possible
to deadlock the MPA algorithm. If the path MTU is reduced, FPDU
Alignment requires the source TCP to re-segment the data stream to
the new path MTU. The source MPA will detect this condition and
reduce the MPA segment size, but any FPDUs already posted to the
source TCP will be re-segmented and lose FPDU Alignment. If the
destination does not support a TCP reassembly buffer, these segments
can never be successfully transmitted and the protocol deadlocks.
When a complete FPDU is received, processing continues normally.
Appendix B. Analysis of MPA over TCP Operations
This appendix is for information only and is NOT part of the
standard.
This appendix is an analysis of MPA on TCP and why it is useful to
integrate MPA with TCP (with modifications to typical TCP
implementations) to reduce overall system buffering and overhead.
One of MPA's high-level goals is to provide enough information, when
combined with the Direct Data Placement Protocol [DDP], to enable
out-of-order placement of DDP payload into the final Upper Layer
Protocol (ULP) Buffer. Note that DDP separates the act of placing
data into a ULP Buffer from that of notifying the ULP that the ULP
Buffer is available for use. In DDP terminology, the former is
defined as "Placement", and the later is defined as "Delivery". MPA
supports in-order Delivery of the data to the ULP, including support
for Direct Data Placement in the final ULP Buffer location when TCP
segments arrive out of order. Effectively, the goal is to use the
Culley, et al. Standards Track [Page 52]
^L
RFC 5044 MPA Framing for TCP October 2007
pre-posted ULP Buffers as the TCP receive buffer, where the
reassembly of the ULP Protocol Data Unit (PDU) by TCP (with MPA and
DDP) is done in place, in the ULP Buffer, with no data copies.
This appendix walks through the advantages and disadvantages of the
TCP sender modifications proposed by MPA:
1) that MPA prefers that the TCP sender to do Header Alignment, where
a TCP segment should begin with an MPA Framing Protocol Data Unit
(FPDU) (if there is payload present).
2) that there be an integral number of FPDUs in a TCP segment (under
conditions where the path MTU is not changing).
This appendix concludes that the scaling advantages of FPDU Alignment
are strong, based primarily on fairly drastic TCP receive buffer
reduction requirements and simplified receive handling. The analysis
also shows that there is little effect to TCP wire behavior.
B.1. Assumptions
B.1.1 MPA Is Layered beneath DDP
MPA is an adaptation layer between DDP and TCP. DDP requires
preservation of DDP segment boundaries and a CRC32c digest covering
the DDP header and data. MPA adds these features to the TCP stream
so that DDP over TCP has the same basic properties as DDP over SCTP.
B.1.2. MPA Preserves DDP Message Framing
MPA was designed as a framing layer specifically for DDP and was not
intended as a general-purpose framing layer for any other ULP using
TCP.
A framing layer allows ULPs using it to receive indications from the
transport layer only when complete ULPDUs are present. As a framing
layer, MPA is not aware of the content of the DDP PDU, only that it
has received and, if necessary, reassembled a complete PDU for
Delivery to the DDP.
B.1.3. The Size of the ULPDU Passed to MPA Is Less Than EMSS under
Normal Conditions
To make reception of a complete DDP PDU on every received segment
possible, DDP passes to MPA a PDU that is no larger than the EMSS of
the underlying fabric. Each FPDU that MPA creates contains
sufficient information for the receiver to directly place the ULP
payload in the correct location in the correct receive buffer.
Culley, et al. Standards Track [Page 53]
^L
RFC 5044 MPA Framing for TCP October 2007
Edge cases when this condition does not occur are dealt with, but do
not need to be on the fast path.
B.1.4. Out-of-Order Placement but NO Out-of-Order Delivery
DDP receives complete DDP PDUs from MPA. Each DDP PDU contains the
information necessary to place its ULP payload directly in the
correct location in host memory.
Because each DDP segment is self-describing, it is possible for DDP
segments received out of order to have their ULP payload placed
immediately in the ULP receive buffer.
Data delivery to the ULP is guaranteed to be in the order the data
was sent. DDP only indicates data delivery to the ULP after TCP has
acknowledged the complete byte stream.
B.2. The Value of FPDU Alignment
Significant receiver optimizations can be achieved when Header
Alignment and complete FPDUs are the common case. The optimizations
allow utilizing significantly fewer buffers on the receiver and less
computation per FPDU. The net effect is the ability to build a
"flow-through" receiver that enables TCP-based solutions to scale to
10G and beyond in an economical way. The optimizations are
especially relevant to hardware implementations of receivers that
process multiple protocol layers -- Data Link Layer (e.g., Ethernet),
Network and Transport Layer (e.g., TCP/IP), and even some ULP on top
of TCP (e.g., MPA/DDP). As network speed increases, there is an
increasing desire to use a hardware-based receiver in order to
achieve an efficient high performance solution.
A TCP receiver, under worst-case conditions, has to allocate buffers
(BufferSizeTCP) whose capacities are a function of the bandwidth-
delay product. Thus:
BufferSizeTCP = K * bandwidth [octets/second] * Delay [seconds].
Where bandwidth is the end-to-end bandwidth of the connection, delay
is the round-trip delay of the connection, and K is an
implementation-dependent constant.
Thus, BufferSizeTCP scales with the end-to-end bandwidth (10x more
buffers for a 10x increase in end-to-end bandwidth). As this
buffering approach may scale poorly for hardware or software
implementations alike, several approaches allow reduction in the
amount of buffering required for high-speed TCP communication.
Culley, et al. Standards Track [Page 54]
^L
RFC 5044 MPA Framing for TCP October 2007
The MPA/DDP approach is to enable the ULP's Buffer to be used as the
TCP receive buffer. If the application pre-posts a sufficient amount
of buffering, and each TCP segment has sufficient information to
place the payload into the right application buffer, when an out-of-
order TCP segment arrives it could potentially be placed directly in
the ULP Buffer. However, placement can only be done when a complete
FPDU with the placement information is available to the receiver, and
the FPDU contents contain enough information to place the data into
the correct ULP Buffer (e.g., there is a DDP header available).
For the case when the FPDU is not aligned with the TCP segment, it
may take, on average, 2 TCP segments to assemble one FPDU.
Therefore, the receiver has to allocate BufferSizeNAF (Buffer Size,
Non-Aligned FPDU) octets:
BufferSizeNAF = K1* EMSS * number_of_connections + K2 * EMSS
Where K1 and K2 are implementation-dependent constants and EMSS is
the effective maximum segment size.
For example, a 1 GB/sec link with 10,000 connections and an EMSS of
1500 B would require 15 MB of memory. Often the number of
connections used scales with the network speed, aggravating the
situation for higher speeds.
FPDU Alignment would allow the receiver to allocate BufferSizeAF
(Buffer Size, Aligned FPDU) octets:
BufferSizeAF = K2 * EMSS
for the same conditions. An FPDU Aligned receiver may require memory
in the range of ~100s of KB -- which is feasible for an on-chip
memory and enables a "flow-through" design, in which the data flows
through the network interface card (NIC) and is placed directly in
the destination buffer. Assuming most of the connections support
FPDU Alignment, the receiver buffers no longer scale with number of
connections.
Additional optimizations can be achieved in a balanced I/O sub-system
-- where the system interface of the network controller provides
ample bandwidth as compared with the network bandwidth. For almost
twenty years this has been the case and the trend is expected to
continue. While Ethernet speeds have scaled by 1000 (from 10
megabit/sec to 10 gigabit/sec), I/O bus bandwidth of volume CPU
architectures has scaled from ~2 MB/sec to ~2 GB/sec (PC-XT bus to
PCI-X DDR). Under these conditions, the FPDU Alignment approach
allows BufferSizeAF to be indifferent to network speed. It is
primarily a function of the local processing time for a given frame.
Culley, et al. Standards Track [Page 55]
^L
RFC 5044 MPA Framing for TCP October 2007
Thus, when the FPDU Alignment approach is used, receive buffering is
expected to scale gracefully (i.e., less than linear scaling) as
network speed is increased.
B.2.1. Impact of Lack of FPDU Alignment on the Receiver Computational
Load and Complexity
The receiver must perform IP and TCP processing, and then perform
FPDU CRC checks, before it can trust the FPDU header placement
information. For simplicity of the description, the assumption is
that an FPDU is carried in no more than 2 TCP segments. In reality,
with no FPDU Alignment, an FPDU can be carried by more than 2 TCP
segments (e.g., if the path MTU was reduced).
----++-----------------------------++-----------------------++-----
+---||---------------+ +--------||--------+ +----------||----+
| TCP Seg X-1 | | TCP Seg X | | TCP Seg X+1 |
+---||---------------+ +--------||--------+ +----------||----+
----++-----------------------------++-----------------------++-----
FPDU #N-1 FPDU #N
Figure 12: Non-Aligned FPDU Freely Placed in TCP Octet Stream
The receiver algorithm for processing TCP segments (e.g., TCP segment
#X in Figure 12) carrying non-aligned FPDUs (in order or out of
order) includes:
Data Link Layer processing (whole frame) -- typically including a CRC
calculation.
1. Network Layer processing (assuming not an IP fragment, the
whole Data Link Layer frame contains one IP datagram. IP
fragments should be reassembled in a local buffer. This is
not a performance optimization goal.)
2. Transport Layer processing -- TCP protocol processing, header
and checksum checks.
a. Classify incoming TCP segment using the 5 tuple (IP SRC,
IP DST, TCP SRC Port, TCP DST Port, protocol).
Culley, et al. Standards Track [Page 56]
^L
RFC 5044 MPA Framing for TCP October 2007
3. Find FPDU message boundaries.
a. Get MPA state information for the connection.
If the TCP segment is in order, use the receiver-managed
MPA state information to calculate where the previous
FPDU message (#N-1) ends in the current TCP segment X.
(previously, when the MPA receiver processed the first
part of FPDU #N-1, it calculated the number of bytes
remaining to complete FPDU #N-1 by using the MPA Length
field).
Get the stored partial CRC for FPDU #N-1.
Complete CRC calculation for FPDU #N-1 data (first
portion of TCP segment #X).
Check CRC calculation for FPDU #N-1.
If no FPDU CRC errors, placement is allowed.
Locate the local buffer for the first portion of
FPDU#N-1, CopyData(local buffer of first portion
of FPDU #N-1, host buffer address, length).
Compute host buffer address for second portion of
FPDU #N-1.
CopyData (local buffer of second portion of FPDU #N-
1, host buffer address for second portion,
length).
Calculate the octet offset into the TCP segment for
the next FPDU #N.
Start calculation of CRC for available data for FPDU.
#N
Store partial CRC results for FPDU #N.
Store local buffer address of first portion of FPDU
#N.
No further action is possible on FPDU #N, before it
is completely received.
Culley, et al. Standards Track [Page 57]
^L
RFC 5044 MPA Framing for TCP October 2007
If the TCP segment is out of order, the receiver must
buffer the data until at least one complete FPDU is
received. Typically, buffering for more than one TCP
segment per connection is required. Use the MPA-based
Markers to calculate where FPDU boundaries are.
When a complete FPDU is available, a similar
procedure to the in-order algorithm above is used.
There is additional complexity, though, because when
the missing segment arrives, this TCP segment must be
run through the CRC engine after the CRC is
calculated for the missing segment.
If we assume FPDU Alignment, the following diagram and the algorithm
below apply. Note that when using MPA, the receiver is assumed to
actively detect presence or loss of FPDU Alignment for every TCP
segment received.
+--------------------------+ +--------------------------+
+--|--------------------------+ +--|--------------------------+
| | TCP Seg X | | | TCP Seg X+1 |
+--|--------------------------+ +--|--------------------------+
+--------------------------+ +--------------------------+
FPDU #N FPDU #N+1
Figure 13: Aligned FPDU Placed Immediately after TCP Header
Culley, et al. Standards Track [Page 58]
^L
RFC 5044 MPA Framing for TCP October 2007
The receiver algorithm for FPDU Aligned frames (in order or out of
order) includes:
1) Data Link Layer processing (whole frame) -- typically
including a CRC calculation.
2) Network Layer processing (assuming not an IP fragment, the
whole Data Link Layer frame contains one IP datagram. IP
fragments should be reassembled in a local buffer. This is
not a performance optimization goal.)
3) Transport Layer processing -- TCP protocol processing, header
and checksum checks.
a. Classify incoming TCP segment using the 5 tuple (IP SRC,
IP DST, TCP SRC Port, TCP DST Port, protocol).
4) Check for Header Alignment (described in detail in Section
6). Assuming Header Alignment for the rest of the algorithm
below.
a. If the header is not aligned, see the algorithm defined
in the prior section.
5) If TCP segment is in order or out of order, the MPA header is
at the beginning of the current TCP payload. Get the FPDU
length from the FPDU header.
6) Calculate CRC over FPDU.
7) Check CRC calculation for FPDU #N.
8) If no FPDU CRC errors, placement is allowed.
9) CopyData(TCP segment #X, host buffer address, length).
10) Loop to #5 until all the FPDUs in the TCP segment are
consumed in order to handle FPDU packing.
Implementation note: In both cases, the receiver has to classify the
incoming TCP segment and associate it with one of the flows it
maintains. In the case of no FPDU Alignment, the receiver is forced
to classify incoming traffic before it can calculate the FPDU CRC.
In the case of FPDU Alignment, the operations order is left to the
implementer.
Culley, et al. Standards Track [Page 59]
^L
RFC 5044 MPA Framing for TCP October 2007
The FPDU Aligned receiver algorithm is significantly simpler. There
is no need to locally buffer portions of FPDUs. Accessing state
information is also substantially simplified -- the normal case does
not require retrieving information to find out where an FPDU starts
and ends or retrieval of a partial CRC before the CRC calculation can
commence. This avoids adding internal latencies, having multiple
data passes through the CRC machine, or scheduling multiple commands
for moving the data to the host buffer.
The aligned FPDU approach is useful for in-order and out-of-order
reception. The receiver can use the same mechanisms for data storage
in both cases, and only needs to account for when all the TCP
segments have arrived to enable Delivery. The Header Alignment,
along with the high probability that at least one complete FPDU is
found with every TCP segment, allows the receiver to perform data
placement for out-of-order TCP segments with no need for intermediate
buffering. Essentially, the TCP receive buffer has been eliminated
and TCP reassembly is done in place within the ULP Buffer.
In case FPDU Alignment is not found, the receiver should follow the
algorithm for non-aligned FPDU reception, which may be slower and
less efficient.
B.2.2. FPDU Alignment Effects on TCP Wire Protocol
In an optimized MPA/TCP implementation, TCP exposes its EMSS to MPA.
MPA uses the EMSS to calculate its MULPDU, which it then exposes to
DDP, its ULP. DDP uses the MULPDU to segment its payload so that
each FPDU sent by MPA fits completely into one TCP segment. This has
no impact on wire protocol, and exposing this information is already
supported on many TCP implementations, including all modern flavors
of BSD networking, through the TCP_MAXSEG socket option.
In the common case, the ULP (i.e., DDP over MPA) messages provided to
the TCP layer are segmented to MULPDU size. It is assumed that the
ULP message size is bounded by MULPDU, such that a single ULP message
can be encapsulated in a single TCP segment. Therefore, in the
common case, there is no increase in the number of TCP segments
emitted. For smaller ULP messages, the sender can also apply
packing, i.e., the sender packs as many complete FPDUs as possible
into one TCP segment. The requirement to always have a complete FPDU
may increase the number of TCP segments emitted. Typically, a ULP
message size varies from a few bytes to multiple EMSSs (e.g., 64
Kbytes). In some cases, the ULP may post more than one message at a
time for transmission, giving the sender an opportunity for packing.
In the case where more than one FPDU is available for transmission
and the FPDUs are encapsulated into a TCP segment and there is no
room in the TCP segment to include the next complete FPDU, another
Culley, et al. Standards Track [Page 60]
^L
RFC 5044 MPA Framing for TCP October 2007
TCP segment is sent. In this corner case, some of the TCP segments
are not full size. In the worst-case scenario, the ULP may choose an
FPDU size that is EMSS/2 +1 and has multiple messages available for
transmission. For this poor choice of FPDU size, the average TCP
segment size is therefore about 1/2 of the EMSS and the number of TCP
segments emitted is approaching 2x of what is possible without the
requirement to encapsulate an integer number of complete FPDUs in
every TCP segment. This is a dynamic situation that only lasts for
the duration where the sender ULP has multiple non-optimal messages
for transmission and this causes a minor impact on the wire
utilization.
However, it is not expected that requiring FPDU Alignment will have a
measurable impact on wire behavior of most applications. Throughput
applications with large I/Os are expected to take full advantage of
the EMSS. Another class of applications with many small outstanding
buffers (as compared to EMSS) is expected to use packing when
applicable. Transaction-oriented applications are also optimal.
TCP retransmission is another area that can affect sender behavior.
TCP supports retransmission of the exact, originally transmitted
segment (see [RFC793], Sections 2.6 and 3.7 (under "Managing the
Window") and [RFC1122], Section 4.2.2.15). In the unlikely event
that part of the original segment has been received and acknowledged
by the Remote Peer (e.g., a re-segmenting middlebox, as documented in
Appendix A.4, Re-Segmenting Middleboxes and Non-Optimized MPA/TCP
Senders), a better available bandwidth utilization may be possible by
retransmitting only the missing octets. If an optimized MPA/TCP
retransmits complete FPDUs, there may be some marginal bandwidth
loss.
Another area where a change in the TCP segment number may have impact
is that of slow start and congestion avoidance. Slow-start
exponential increase is measured in segments per second, as the
algorithm focuses on the overhead per segment at the source for
congestion that eventually results in dropped segments. Slow-start
exponential bandwidth growth for optimized MPA/TCP is similar to any
TCP implementation. Congestion avoidance allows for a linear growth
in available bandwidth when recovering after a packet drop. Similar
to the analysis for slow start, optimized MPA/TCP doesn't change the
behavior of the algorithm. Therefore, the average size of the
segment versus EMSS is not a major factor in the assessment of the
bandwidth growth for a sender. Both slow start and congestion
avoidance for an optimized MPA/TCP will behave similarly to any TCP
sender and allow an optimized MPA/TCP to enjoy the theoretical
performance limits of the algorithms.
Culley, et al. Standards Track [Page 61]
^L
RFC 5044 MPA Framing for TCP October 2007
In summary, the ULP messages generated at the sender (e.g., the
amount of messages grouped for every transmission request) and
message size distribution has the most significant impact over the
number of TCP segments emitted. The worst-case effect for certain
ULPs (with average message size of EMSS/2+1 to EMSS) is bounded by an
increase of up to 2x in the number of TCP segments and acknowledges.
In reality, the effect is expected to be marginal.
Appendix C. IETF Implementation Interoperability with RDMA Consortium
Protocols
This appendix is for information only and is NOT part of the
standard.
This appendix covers methods of making MPA implementations
interoperate with both IETF and RDMA Consortium versions of the
protocols.
The RDMA Consortium created early specifications of the MPA/DDP/RDMA
protocols, and some manufacturers created implementations of those
protocols before the IETF versions were finalized. These protocols
are very similar to the IETF versions making it possible for
implementations to be created or modified to support either set of
specifications.
For those interested, the RDMA Consortium protocol documents (draft-
culley-iwarp-mpa-v1.0.pdf [RDMA-MPA], draft-shah-iwarp-ddp-v1.0.pdf
[RDMA-DDP], and draft-recio-iwarp-rdmac-v1.0.pdf [RDMA-RDMAC]) can be
obtained at http://www.rdmaconsortium.org/home.
In this section, implementations of MPA/DDP/RDMA that conform to the
RDMAC specifications are called RDMAC RNICs. Implementations of
MPA/DDP/RDMA that conform to the IETF RFCs are called IETF RNICs.
Without the exchange of MPA Request/Reply Frames, there is no
standard mechanism for enabling RDMAC RNICs to interoperate with IETF
RNICs. Even if a ULP uses a well-known port to start an IETF RNIC
immediately in RDMA mode (i.e., without exchanging the MPA
Request/Reply messages), there is no reason to believe an IETF RNIC
will interoperate with an RDMAC RNIC because of the differences in
the version number in the DDP and RDMAP headers on the wire.
Therefore, the ULP or other supporting entity at the RDMAC RNIC must
implement MPA Request/Reply Frames on behalf of the RNIC in order to
negotiate the connection parameters. The following section describes
the results following the exchange of the MPA Request/Reply Frames
before the conversion from streaming to RDMA mode.
Culley, et al. Standards Track [Page 62]
^L
RFC 5044 MPA Framing for TCP October 2007
C.1. Negotiated Parameters
Three types of RNICs are considered:
Upgraded RDMAC RNIC - an RNIC implementing the RDMAC protocols that
has a ULP or other supporting entity that exchanges the MPA
Request/Reply Frames in streaming mode before the conversion to RDMA
mode.
Non-permissive IETF RNIC - an RNIC implementing the IETF protocols
that is not capable of implementing the RDMAC protocols. Such an
RNIC can only interoperate with other IETF RNICs.
Permissive IETF RNIC - an RNIC implementing the IETF protocols that
is capable of implementing the RDMAC protocols on a per-connection
basis.
The Permissive IETF RNIC is recommended for those implementers that
want maximum interoperability with other RNIC implementations.
The values used by these three RNIC types for the MPA, DDP, and RDMAP
versions as well as MPA Markers and CRC are summarized in Figure 14.
+----------------++-----------+-----------+-----------+-----------+
| RNIC TYPE || DDP/RDMAP | MPA | MPA | MPA |
| || Version | Revision | Markers | CRC |
+----------------++-----------+-----------+-----------+-----------+
+----------------++-----------+-----------+-----------+-----------+
| RDMAC || 0 | 0 | 1 | 1 |
| || | | | |
+----------------++-----------+-----------+-----------+-----------+
| IETF || 1 | 1 | 0 or 1 | 0 or 1 |
| Non-permissive || | | | |
+----------------++-----------+-----------+-----------+-----------+
| IETF || 1 or 0 | 1 or 0 | 0 or 1 | 0 or 1 |
| permissive || | | | |
+----------------++-----------+-----------+-----------+-----------+
Figure 14: Connection Parameters for the RNIC Types
for MPA Markers and MPA CRC, enabled=1, disabled=0.
It is assumed there is no mixing of versions allowed between MPA,
DDP, and RDMAP. The RNIC either generates the RDMAC protocols on the
wire (version is zero) or uses the IETF protocols (version is one).
Culley, et al. Standards Track [Page 63]
^L
RFC 5044 MPA Framing for TCP October 2007
During the exchange of the MPA Request/Reply Frames, each peer
provides its MPA Revision, Marker preference (M: 0=disabled,
1=enabled), and CRC preference. The MPA Revision provided in the MPA
Request Frame and the MPA Reply Frame may differ.
From the information in the MPA Request/Reply Frames, each side sets
the Version field (V: 0=RDMAC, 1=IETF) of the DDP/RDMAP protocols as
well as the state of the Markers for each half connection. Between
DDP and RDMAP, no mixing of versions is allowed. Moreover, the DDP
and RDMAP version MUST be identical in the two directions. The RNIC
either generates the RDMAC protocols on the wire (version is zero) or
uses the IETF protocols (version is one).
In the following sections, the figures do not discuss CRC negotiation
because there is no interoperability issue for CRCs. Since the RDMAC
RNIC will always request CRC use, then, according to the IETF MPA
specification, both peers MUST generate and check CRCs.
C.2. RDMAC RNIC and Non-Permissive IETF RNIC
Figure 15 shows that a Non-permissive IETF RNIC cannot interoperate
with an RDMAC RNIC, despite the fact that both peers exchange MPA
Request/Reply Frames. For a Non-permissive IETF RNIC, the MPA
negotiation has no effect on the DDP/RDMAP version and it is unable
to interoperate with the RDMAC RNIC.
The rows in the figure show the state of the Marker field in the MPA
Request Frame sent by the MPA Initiator. The columns show the state
of the Marker field in the MPA Reply Frame sent by the MPA Responder.
Each type of RNIC is shown as an Initiator and a Responder. The
connection results are shown in the lower right corner, at the
intersection of the different RNIC types, where V=0 is the RDMAC
DDP/RDMAP version, V=1 is the IETF DDP/RDMAC version, M=0 means MPA
Markers are disabled, and M=1 means MPA Markers are enabled. The
negotiated Marker state is shown as X/Y, for the receive direction of
the Initiator/Responder.
Culley, et al. Standards Track [Page 64]
^L
RFC 5044 MPA Framing for TCP October 2007
+---------------------------++-----------------------+
| MPA || MPA |
| CONNECT || Responder |
| MODE +-----------------++-------+---------------+
| | RNIC || RDMAC | IETF |
| | TYPE || | Non-permissive|
| | +------++-------+-------+-------+
| | |MARKER|| M=1 | M=0 | M=1 |
+---------+----------+------++-------+-------+-------+
+---------+----------+------++-------+-------+-------+
| | RDMAC | M=1 || V=0 | close | close |
| | | || M=1/1 | | |
| +----------+------++-------+-------+-------+
| MPA | | M=0 || close | V=1 | V=1 |
|Initiator| IETF | || | M=0/0 | M=0/1 |
| |Non-perms.+------++-------+-------+-------+
| | | M=1 || close | V=1 | V=1 |
| | | || | M=1/0 | M=1/1 |
+---------+----------+------++-------+-------+-------+
Figure 15: MPA Negotiation between an RDMAC RNIC and
a Non-Permissive IETF RNIC
C.2.1. RDMAC RNIC Initiator
If the RDMAC RNIC is the MPA Initiator, its ULP sends an MPA Request
Frame with Rev field set to zero and the M and C bits set to one.
Because the Non-permissive IETF RNIC cannot dynamically downgrade the
version number it uses for DDP and RDMAP, it would send an MPA Reply
Frame with the Rev field equal to one and then gracefully close the
connection.
C.2.2. Non-Permissive IETF RNIC Initiator
If the Non-permissive IETF RNIC is the MPA Initiator, it sends an MPA
Request Frame with Rev field equal to one. The ULP or supporting
entity for the RDMAC RNIC responds with an MPA Reply Frame that has
the Rev field equal to zero and the M bit set to one. The Non-
permissive IETF RNIC will gracefully close the connection after it
reads the incompatible Rev field in the MPA Reply Frame.
C.2.3. RDMAC RNIC and Permissive IETF RNIC
Figure 16 shows that a Permissive IETF RNIC can interoperate with an
RDMAC RNIC regardless of its Marker preference. The figure uses the
same format as shown with the Non-permissive IETF RNIC.
Culley, et al. Standards Track [Page 65]
^L
RFC 5044 MPA Framing for TCP October 2007
+---------------------------++-----------------------+
| MPA || MPA |
| CONNECT || Responder |
| MODE +-----------------++-------+---------------+
| | RNIC || RDMAC | IETF |
| | TYPE || | Permissive |
| | +------++-------+-------+-------+
| | |MARKER|| M=1 | M=0 | M=1 |
+---------+----------+------++-------+-------+-------+
+---------+----------+------++-------+-------+-------+
| | RDMAC | M=1 || V=0 | N/A | V=0 |
| | | || M=1/1 | | M=1/1 |
| +----------+------++-------+-------+-------+
| MPA | | M=0 || V=0 | V=1 | V=1 |
|Initiator| IETF | || M=1/1 | M=0/0 | M=0/1 |
| |Permissive+------++-------+-------+-------+
| | | M=1 || V=0 | V=1 | V=1 |
| | | || M=1/1 | M=1/0 | M=1/1 |
+---------+----------+------++-------+-------+-------+
Figure 16: MPA Negotiation between an RDMAC RNIC and
a Permissive IETF RNIC
A truly Permissive IETF RNIC will recognize an RDMAC RNIC from the
Rev field of the MPA Req/Rep Frames and then adjust its receive
Marker state and DDP/RDMAP version to accommodate the RDMAC RNIC. As
a result, as an MPA Responder, the Permissive IETF RNIC will never
return an MPA Reply Frame with the M bit set to zero. This case is
shown as a not applicable (N/A) in Figure 16.
C.2.4. RDMAC RNIC Initiator
When the RDMAC RNIC is the MPA Initiator, its ULP or other supporting
entity prepares an MPA Request message and sets the revision to zero
and the M bit and C bit to one.
The Permissive IETF Responder receives the MPA Request message and
checks the revision field. Since it is capable of generating RDMAC
DDP/RDMAP headers, it sends an MPA Reply message with revision set to
zero and the M and C bits set to one. The Responder must inform its
ULP that it is generating version zero DDP/RDMAP messages.
Culley, et al. Standards Track [Page 66]
^L
RFC 5044 MPA Framing for TCP October 2007
C.2.5 Permissive IETF RNIC Initiator
If the Permissive IETF RNIC is the MPA Initiator, it prepares the MPA
Request Frame setting the Rev field to one. Regardless of the value
of the M bit in the MPA Request Frame, the ULP or other supporting
entity for the RDMAC RNIC will create an MPA Reply Frame with Rev
equal to zero and the M bit set to one.
When the Initiator reads the Rev field of the MPA Reply Frame and
finds that its peer is an RDMAC RNIC, it must inform its ULP that it
should generate version zero DDP/RDMAP messages and enable MPA
Markers and CRC.
C.3. Non-Permissive IETF RNIC and Permissive IETF RNIC
For completeness, Figure 17 below shows the results of MPA
negotiation between a Non-permissive IETF RNIC and a Permissive IETF
RNIC. The important point from this figure is that an IETF RNIC
cannot detect whether its peer is a Permissive or Non-permissive
RNIC.
+---------------------------++-------------------------------+
| MPA || MPA |
| CONNECT || Responder |
| MODE +-----------------++---------------+---------------+
| | RNIC || IETF | IETF |
| | TYPE || Non-permissive| Permissive |
| | +------++-------+-------+-------+-------+
| | |MARKER|| M=0 | M=1 | M=0 | M=1 |
+---------+----------+------++-------+-------+-------+-------+
+---------+----------+------++-------+-------+-------+-------+
| | | M=0 || V=1 | V=1 | V=1 | V=1 |
| | IETF | || M=0/0 | M=0/1 | M=0/0 | M=0/1 |
| |Non-perms.+------++-------+-------+-------+-------+
| | | M=1 || V=1 | V=1 | V=1 | V=1 |
| | | || M=1/0 | M=1/1 | M=1/0 | M=1/1 |
| MPA +----------+------++-------+-------+-------+-------+
|Initiator| | M=0 || V=1 | V=1 | V=1 | V=1 |
| | IETF | || M=0/0 | M=0/1 | M=0/0 | M=0/1 |
| |Permissive+------++-------+-------+-------+-------+
| | | M=1 || V=1 | V=1 | V=1 | V=1 |
| | | || M=1/0 | M=1/1 | M=1/0 | M=1/1 |
+---------+----------+------++-------+-------+-------+-------+
Figure 17: MPA negotiation between a Non-permissive IETF RNIC and a
Permissive IETF RNIC.
Culley, et al. Standards Track [Page 67]
^L
RFC 5044 MPA Framing for TCP October 2007
Normative References
[iSCSI] Satran, J., Meth, K., Sapuntzakis, C., Chadalapaka, M.,
and E. Zeidner, "Internet Small Computer Systems
Interface (iSCSI)", RFC 3720, April 2004.
[RFC1191] Mogul, J. and S. Deering, "Path MTU discovery", RFC
1191, November 1990.
[RFC2018] Mathis, M., Mahdavi, J., Floyd, S., and A. Romanow, "TCP
Selective Acknowledgment Options", RFC 2018, October
1996.
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119, March 1997.
[RFC2401] Kent, S. and R. Atkinson, "Security Architecture for the
Internet Protocol", RFC 2401, November 1998.
[RFC3723] Aboba, B., Tseng, J., Walker, J., Rangan, V., and F.
Travostino, "Securing Block Storage Protocols over IP",
RFC 3723, April 2004.
[RFC793] Postel, J., "Transmission Control Protocol", STD 7, RFC
793, September 1981.
[RDMASEC] Pinkerton, J. and E. Deleganes, "Direct Data Placement
Protocol (DDP) / Remote Direct Memory Access Protocol
(RDMAP) Security", RFC 5042, October 2007.
Informative References
[APPL] Bestler, C. and L. Coene, "Applicability of Remote
Direct Memory Access Protocol (RDMA) and Direct Data
Placement (DDP)", RFC 5045, October 2007.
[CRCTCP] Stone J., Partridge, C., "When the CRC and TCP checksum
disagree", ACM Sigcomm, Sept. 2000.
[DAT-API] DAT Collaborative, "kDAPL (Kernel Direct Access
Programming Library) and uDAPL (User Direct Access
Programming Library)", Http://www.datcollaborative.org.
[DDP] Shah, H., Pinkerton, J., Recio, R., and P. Culley,
"Direct Data Placement over Reliable Transports", RFC
5041, October 2007.
Culley, et al. Standards Track [Page 68]
^L
RFC 5044 MPA Framing for TCP October 2007
[iSER] Ko, M., Chadalapaka, M., Hufferd, J., Elzur, U., Shah,
H., and P. Thaler, "Internet Small Computer System
Interface (iSCSI) Extensions for Remote Direct Memory
Access (RDMA)" RFC 5046, October 2007.
[IT-API] The Open Group, "Interconnect Transport API (IT-API)"
Version 2.1, http://www.opengroup.org.
[NFSv4CHAN] Williams, N., "On the Use of Channel Bindings to Secure
Channels", Work in Progress, June 2006.
[RDMA-DDP] "Direct Data Placement over Reliable Transports (Version
1.0)", RDMA Consortium, October 2002,
<http://www.rdmaconsortium.org/home/draft-shah-iwarp-
ddp-v1.0.pdf>.
[RDMA-MPA] "Marker PDU Aligned Framing for TCP Specification
(Version 1.0)", RDMA Consortium, October 2002,
<http://www.rdmaconsortium.org/home/draft-culley-iwarp-
mpa-v1.0.pdf>.
[RDMA-RDMAC] "An RDMA Protocol Specification (Version 1.0)", RDMA
Consortium, October 2002,
<http://www.rdmaconsortium.org/home/draft-recio-iwarp-
rdmac-v1.0.pdf>.
[RDMAP] Recio, R., Culley, P., Garcia, D., Hilland, J., and B.
Metzler, "A Remote Direct Memory Access Protocol
Specification", RFC 5040, October 2007.
[RFC792] Postel, J., "Internet Control Message Protocol", STD 5,
RFC 792, September 1981.
[RFC896] Nagle, J., "Congestion control in IP/TCP internetworks",
RFC 896, January 1984.
[RFC1122] Braden, R., "Requirements for Internet Hosts -
Communication Layers", STD 3, RFC 1122, October 1989.
[RFC4960] Stewart, R., Ed., "Stream Control Transmission
Protocol", RFC 4960, September 2007.
[RFC4296] Bailey, S. and T. Talpey, "The Architecture of Direct
Data Placement (DDP) and Remote Direct Memory Access
(RDMA) on Internet Protocols", RFC 4296, December 2005.
Culley, et al. Standards Track [Page 69]
^L
RFC 5044 MPA Framing for TCP October 2007
[RFC4297] Romanow, A., Mogul, J., Talpey, T., and S. Bailey,
"Remote Direct Memory Access (RDMA) over IP Problem
Statement", RFC 4297, December 2005.
[RFC4301] Kent, S. and K. Seo, "Security Architecture for the
Internet Protocol", RFC 4301, December 2005.
[VERBS-RMDA] "RDMA Protocol Verbs Specification", RDMA Consortium
standard, April 2003, <http://www.rdmaconsortium.org/
home/draft-hilland-iwarp-verbs-v1.0-RDMAC.pdf>.
Contributors
Dwight Barron
Hewlett-Packard Company
20555 SH 249
Houston, TX 77070-2698 USA
Phone: 281-514-2769
EMail: dwight.barron@hp.com
Jeff Chase
Department of Computer Science
Duke University
Durham, NC 27708-0129 USA
Phone: +1 919 660 6559
EMail: chase@cs.duke.edu
Ted Compton
EMC Corporation
Research Triangle Park, NC 27709 USA
Phone: 919-248-6075
EMail: compton_ted@emc.com
Dave Garcia
24100 Hutchinson Rd.
Los Gatos, CA 95033
Phone: 831 247 4464
EMail: Dave.Garcia@StanfordAlumni.org
Hari Ghadia
Gen10 Technology, Inc.
1501 W Shady Grove Road
Grand Prairie, TX 75050
Phone: (972) 301 3630
EMail: hghadia@gen10technology.com
Culley, et al. Standards Track [Page 70]
^L
RFC 5044 MPA Framing for TCP October 2007
Howard C. Herbert
Intel Corporation
MS CH7-404
5000 West Chandler Blvd.
Chandler, AZ 85226
Phone: 480-554-3116
EMail: howard.c.herbert@intel.com
Jeff Hilland
Hewlett-Packard Company
20555 SH 249
Houston, TX 77070-2698 USA
Phone: 281-514-9489
EMail: jeff.hilland@hp.com
Mike Ko
IBM
650 Harry Rd.
San Jose, CA 95120
Phone: (408) 927-2085
EMail: mako@us.ibm.com
Mike Krause
Hewlett-Packard Corporation, 43LN
19410 Homestead Road
Cupertino, CA 95014 USA
Phone: +1 (408) 447-3191
EMail: krause@cup.hp.com
Dave Minturn
Intel Corporation
MS JF1-210
5200 North East Elam Young Parkway
Hillsboro, Oregon 97124
Phone: 503-712-4106
EMail: dave.b.minturn@intel.com
Jim Pinkerton
Microsoft, Inc.
One Microsoft Way
Redmond, WA 98052 USA
EMail: jpink@microsoft.com
Culley, et al. Standards Track [Page 71]
^L
RFC 5044 MPA Framing for TCP October 2007
Hemal Shah
Broadcom Corporation
5300 California Avenue
Irvine, CA 92617 USA
Phone: +1 (949) 926-6941
EMail: hemal@broadcom.com
Allyn Romanow
Cisco Systems
170 W Tasman Drive
San Jose, CA 95134 USA
Phone: +1 408 525 8836
EMail: allyn@cisco.com
Tom Talpey
Network Appliance
1601 Trapelo Road #16
Waltham, MA 02451 USA
Phone: +1 (781) 768-5329
EMail: thomas.talpey@netapp.com
Patricia Thaler
Broadcom
16215 Alton Parkway
Irvine, CA 92618
Phone: 916 570 2707
EMail: pthaler@broadcom.com
Jim Wendt
Hewlett Packard Corporation
8000 Foothills Boulevard MS 5668
Roseville, CA 95747-5668 USA
Phone: +1 916 785 5198
EMail: jim_wendt@hp.com
Jim Williams
Emulex Corporation
580 Main Street
Bolton, MA 01740 USA
Phone: +1 978 779 7224
EMail: jim.williams@emulex.com
Culley, et al. Standards Track [Page 72]
^L
RFC 5044 MPA Framing for TCP October 2007
Authors' Addresses
Paul R. Culley
Hewlett-Packard Company
20555 SH 249
Houston, TX 77070-2698 USA
Phone: 281-514-5543
EMail: paul.culley@hp.com
Uri Elzur
5300 California Avenue
Irvine, CA 92617, USA
Phone: 949.926.6432
EMail: uri@broadcom.com
Renato J Recio
IBM
Internal Zip 9043
11400 Burnett Road
Austin, Texas 78759
Phone: 512-838-3685
EMail: recio@us.ibm.com
Stephen Bailey
Sandburst Corporation
600 Federal Street
Andover, MA 01810 USA
Phone: +1 978 689 1614
EMail: steph@sandburst.com
John Carrier
Cray Inc.
411 First Avenue S, Suite 600
Seattle, WA 98104-2860
Phone: 206-701-2090
EMail: carrier@cray.com
Culley, et al. Standards Track [Page 73]
^L
RFC 5044 MPA Framing for TCP October 2007
Full Copyright Statement
Copyright (C) The IETF Trust (2007).
This document is subject to the rights, licenses and restrictions
contained in BCP 78, and except as set forth therein, the authors
retain all their rights.
This document and the information contained herein are provided on an
"AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND
THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS
OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF
THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
Intellectual Property
The IETF takes no position regarding the validity or scope of any
Intellectual Property Rights or other rights that might be claimed to
pertain to the implementation or use of the technology described in
this document or the extent to which any license under such rights
might or might not be available; nor does it represent that it has
made any independent effort to identify any such rights. Information
on the procedures with respect to rights in RFC documents can be
found in BCP 78 and BCP 79.
Copies of IPR disclosures made to the IETF Secretariat and any
assurances of licenses to be made available, or the result of an
attempt made to obtain a general license or permission for the use of
such proprietary rights by implementers or users of this
specification can be obtained from the IETF on-line IPR repository at
http://www.ietf.org/ipr.
The IETF invites any interested party to bring to its attention any
copyrights, patents or patent applications, or other proprietary
rights that may cover technology that may be required to implement
this standard. Please address the information to the IETF at
ietf-ipr@ietf.org.
Culley, et al. Standards Track [Page 74]
^L
|