1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
|
Network Working Group T. Talpey
Request for Comments: 5532 C. Juszczak
Category: Informational May 2009
Network File System (NFS) Remote Direct Memory Access (RDMA)
Problem Statement
Status of This Memo
This memo provides information for the Internet community. It does
not specify an Internet standard of any kind. Distribution of this
memo is unlimited.
Copyright Notice
Copyright (c) 2009 IETF Trust and the persons identified as the
document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents in effect on the date of
publication of this document (http://trustee.ietf.org/license-info).
Please review these documents carefully, as they describe your rights
and restrictions with respect to this document.
This document may contain material from IETF Documents or IETF
Contributions published or made publicly available before November
10, 2008. The person(s) controlling the copyright in some of this
material may not have granted the IETF Trust the right to allow
modifications of such material outside the IETF Standards Process.
Without obtaining an adequate license from the person(s) controlling
the copyright in such materials, this document may not be modified
outside the IETF Standards Process, and derivative works of it may
not be created outside the IETF Standards Process, except to format
it for publication as an RFC or to translate it into languages other
than English.
Abstract
This document addresses enabling the use of Remote Direct Memory
Access (RDMA) by the Network File System (NFS) protocols. NFS
implementations historically incur significant overhead due to data
copies on end-host systems, as well as other processing overhead.
This document explores the potential benefits of RDMA to these
implementations and evaluates the reasons why RDMA is especially
well-suited to NFS and network file protocols in general.
Talpey & Juszczak Informational [Page 1]
^L
RFC 5532 NFS RDMA Problem Statement May 2009
Table of Contents
1. Introduction ....................................................2
1.1. Background .................................................3
2. Problem Statement ...............................................4
3. File Protocol Architecture ......................................5
4. Sources of Overhead .............................................7
4.1. Savings from TOE ...........................................8
4.2. Savings from RDMA ..........................................9
5. Application of RDMA to NFS .....................................10
6. Conclusions ....................................................10
7. Security Considerations ........................................11
8. Acknowledgments ................................................12
9. References .....................................................12
9.1. Normative References ......................................12
9.2. Informative References ....................................13
1. Introduction
The Network File System (NFS) protocol (as described in [RFC1094],
[RFC1813], and [RFC3530]) is one of several remote file access
protocols used in the class of processing architecture sometimes
called Network-Attached Storage (NAS).
Historically, remote file access has proven to be a convenient,
cost-effective way to share information over a network, a concept
proven over time by the popularity of the NFS protocol. However,
there are issues in such a deployment.
As compared to a local (direct-attached) file access architecture,
NFS removes the overhead of managing the local on-disk file system
state and its metadata, but interposes at least a transport network
and two network endpoints between an application process and the
files it is accessing. To date, this trade-off has usually resulted
in a net performance loss as a result of reduced bandwidth, increased
application server CPU utilization, and other overheads.
Several classes of applications, including those directly supporting
enterprise activities in high-performance domains such as database
applications and shared clusters, have therefore encountered issues
with moving to NFS architectures. While this has been due
principally to the performance costs of NFS versus direct-attached
files, other reasons are relevant, such as the lack of strong
consistency guarantees being provided by NFS implementations.
Replication of local file access performance on NAS using traditional
network protocol stacks has proven difficult, not because of protocol
processing overheads, but because of data copy costs in the network
Talpey & Juszczak Informational [Page 2]
^L
RFC 5532 NFS RDMA Problem Statement May 2009
endpoints. This is especially true since host buses are now often
the main bottleneck in NAS architectures [MOG03] [CHA+01].
The External Data Representation [RFC4506] employed beneath NFS and
the Remote Procedure Call (RPC) [RFC5531] can add more data copies,
exacerbating the problem.
Data copy-avoidance designs have not been widely adopted for a
variety of reasons. [BRU99] points out that "many copy avoidance
techniques for network I/O are not applicable or may even backfire if
applied to file I/O". Other designs that eliminate unnecessary
copies, such as [PAI+00], are incompatible with existing APIs and
therefore force application changes.
In recent years, an effort to standardize a set of protocols for
Remote Direct Memory Access (RDMA) over the standard Internet
Protocol Suite has been chartered [RDDP]. A complete IP-based RDMA
protocol suite is available in the published Standards Track
specifications.
RDMA is a general solution to the problem of CPU overhead incurred
due to data copies, primarily at the receiver. Substantial research
has addressed this and has borne out the efficacy of the approach.
An overview of this is the "Remote Direct Memory Access (RDMA) over
IP Problem Statement" [RFC4297].
In addition to the per-byte savings of offloading data copies, RDMA-
enabled NICs (RNICS) offload the underlying protocol layers as well
(e.g., TCP), further reducing CPU overhead due to NAS processing.
1.1. Background
The RDDP Problem Statement [RFC4297] asserts:
High costs associated with copying are an issue primarily for
large scale systems ... with high bandwidth feeds, usually
multiprocessors and clusters, that are adversely affected by
copying overhead. Examples of such machines include all varieties
of servers: database servers, storage servers, application servers
for transaction processing, for e-commerce, and web serving,
content distribution, video distribution, backups, data mining and
decision support, and scientific computing.
Note that such servers almost exclusively service many concurrent
sessions (transport connections), which, in aggregate, are
responsible for > 1 Gbits/s of communication. Nonetheless, the
cost of copying overhead for a particular load is the same whether
from few or many sessions.
Talpey & Juszczak Informational [Page 3]
^L
RFC 5532 NFS RDMA Problem Statement May 2009
Note that each of the servers listed above could be accessing their
file data as an NFS client, or as NFS serving the data to such
clients, or acting as both.
The CPU overhead of the NFS and TCP/IP protocol stacks (including
data copies or reduced copy workarounds) becomes a significant matter
in these clients and servers. File access using locally attached
disks imposes relatively low overhead due to the highly optimized I/O
path and direct memory access afforded to the storage controller.
This is not the case with NFS, which must pass data to, and
especially from, the network and network processing stack to the NFS
stack. Frequently, data copies are imposed on this transfer; in some
cases, several such copies are imposed in each direction.
Copies are potentially encountered in an NFS implementation
exchanging data to and from user address spaces, within kernel buffer
caches, in eXternal Data Representation (XDR) marshalling and
unmarshalling, and within network stacks and network drivers. Other
overheads such as serialization among multiple threads of execution
sharing a single NFS mount point and transport connection are
additionally encountered.
Numerous upper-layer protocols achieve extremely high bandwidth and
low overhead through the use of RDMA. [MAF+02] shows that the RDMA-
based Direct Access File System (with a user-level implementation of
the file system client) can outperform even a zero-copy
implementation of NFS [CHA+01] [CHA+99] [GAL+99] [KM02]. Also, file
data access implies the use of large Unequal Loss Protection (ULP)
messages. These large messages tend to amortize any increase in
per-message costs due to the offload of protocol processing incurred
when using RNICs while gaining the benefits of reduced per-byte
costs. Finally, the direct memory addressing afforded by RDMA avoids
many sources of contention on network resources.
2. Problem Statement
The principal performance problem encountered by NFS implementations
is the CPU overhead required to implement the protocol. Primary
among the sources of this overhead is the movement of data from NFS
protocol messages to its eventual destination in user buffers or
aligned kernel buffers. Due to the nature of the RPC and XDR
protocols, the NFS data payload arrives at arbitrary alignment,
necessitating a copy at the receiver, and the NFS requests are
completed in an arbitrary sequence.
The data copies consume system bus bandwidth and CPU time, reducing
the available system capacity for applications [RFC4297]. To date,
achieving zero-copy with NFS has required sophisticated, version-
Talpey & Juszczak Informational [Page 4]
^L
RFC 5532 NFS RDMA Problem Statement May 2009
specific "header cracking" hardware and/or extensive platform-
specific virtual memory mapping tricks. Such approaches become even
more difficult for NFS version 4 due to the existence of the COMPOUND
operation and presence of Kerberos and other security information,
which further reduce alignment and greatly complicate ULP offload.
Furthermore, NFS is challenged by high-speed network fabrics such as
10 Gbits/s Ethernet. Performing even raw network I/O such as TCP is
an issue at such speeds with today's hardware. The problem is
fundamental in nature and has led the IETF to explore RDMA [RFC4297].
Zero-copy techniques benefit file protocols extensively, as they
enable direct user I/O, reduce the overhead of protocol stacks,
provide perfect alignment into caches, etc. Many studies have
already shown the performance benefits of such techniques [SKE+01]
[DCK+03] [FJNFS] [FJDAFS] [KM02] [MAF+02].
RDMA is compelling here for another reason; hardware-offloaded
networking support in itself does not avoid data copies, without
resorting to implementing part of the NFS protocol in the Network
Interface Card (NIC). Support of RDMA by NFS enables the highest
performance at the architecture level rather than by implementation;
this enables ubiquitous and interoperable solutions.
By providing file access performance equivalent to that of local file
systems, NFS over RDMA will enable applications running on a set of
client machines to interact through an NFS file system, just as
applications running on a single machine might interact through a
local file system.
3. File Protocol Architecture
NFS runs as an Open Network Computing (ONC) RPC [RFC5531]
application. Being a file access protocol, NFS is very "rich" in
data content (versus control information).
NFS messages can range from very small (under 100 bytes) to very
large (from many kilobytes to a megabyte or more). They are all
contained within an RPC message and follow a variable-length RPC
header. This layout provides an alignment challenge for the data
items contained in an NFS call (request) or reply (response) message.
In addition to the control information in each NFS call or reply
message, sometimes there are large "chunks" of application file data,
for example, read and write requests. With NFS version 4 (due to the
existence of the COMPOUND operation), there can be several of these
data chunks interspersed with control information.
Talpey & Juszczak Informational [Page 5]
^L
RFC 5532 NFS RDMA Problem Statement May 2009
ONC RPC is a remote procedure call protocol that has been run over a
variety of transports. Most implementations today use UDP or TCP.
RPC messages are defined in terms of an eXternal Data Representation
(XDR) [RFC4506], which provides a canonical data representation
across a variety of host architectures. An XDR data stream is
conveyed differently on each type of transport. On UDP, RPC messages
are encapsulated inside datagrams, while on a TCP byte stream, RPC
messages are delineated by a record-marking protocol. An RDMA
transport also conveys RPC messages in a unique fashion that must be
fully described if client and server implementations are to
interoperate.
The RPC transport is responsible for conveying an RPC message from a
sender to a receiver. An RPC message is either an RPC call from a
client to a server, or an RPC reply from the server back to the
client. An RPC message contains an RPC call header followed by
arguments if the message is an RPC call, or an RPC reply header
followed by results if the message is an RPC reply. The call header
contains a transaction ID (XID) followed by the program and procedure
number as well as a security credential. An RPC reply header begins
with an XID that matches that of the RPC call message, followed by a
security verifier and results. All data in an RPC message is XDR
encoded.
The encoding of XDR data into transport buffers is referred to as
"marshalling", and the decoding of XDR data contained within
transport buffers and into destination RPC procedure result buffers,
is referred to as "unmarshalling". Therefore, the process of
marshalling takes place at the sender of any particular message, be
it an RPC request or an RPC response. Unmarshalling, of course,
takes place at the receiver.
Normally, any bulk data is moved (copied) as a result of the
unmarshalling process, because the destination address is not known
until the RPC code receives control and subsequently invokes the XDR
unmarshalling routine. In other words, XDR-encoded data is not
self-describing, and it carries no placement information. This
results in a data copy in most NFS implementations.
One mechanism by which the RPC layer may overcome this is for each
request to include placement information, to be used for direct
placement during XDR encode. This "write chunk" can avoid sending
bulk data inline in an RPC message and generally results in one or
more RDMA Write operations.
Similarly, a "read chunk", where placement information referring to
bulk data that may be directly fetched via one or more RDMA Read
operations during XDR decode, may be conveyed. The "read chunk" will
Talpey & Juszczak Informational [Page 6]
^L
RFC 5532 NFS RDMA Problem Statement May 2009
therefore be useful in both RPC calls and replies, while the "write
chunk" is used solely in replies.
These "chunks" are the key concept in an existing proposal [RPCRDMA].
They convey what are effectively pointers to remote memory across the
network. They allow cooperating peers to exchange data outside of
XDR encodings but still use XDR for describing the data to be
transferred. And, finally, through use of XDR they maintain a large
degree of on-the-wire compatibility.
The central concept of the RDMA transport is to provide the
additional encoding conventions to convey this placement information
in transport-specific encoding, and to modify the XDR handling of
bulk data.
Block Diagram
+------------------------+-----------------------------------+
| NFS | NFS + RDMA |
+------------------------+----------------------+------------+
| Operations / Procedures | |
+-----------------------------------------------+ |
| RPC/XDR | |
+--------------------------------+--------------+ |
| Stream Transport | RDMA Transport |
+--------------------------------+---------------------------+
4. Sources of Overhead
Network and file protocol costs can be categorized as follows:
o per-byte costs - data touching costs such as checksum or data
copy. Today's network interface hardware commonly offloads the
checksum, which leaves the other major source of per-byte
overhead, data copy.
o per-packet costs - interrupts and lower-layer processing (LLP).
Today's network interface hardware also commonly coalesce
interrupts to reduce per-packet costs.
o per-message (request or response) costs - LLP and ULP processing.
Improvement from optimization becomes more important if the overhead
it targets is a larger share of the total cost. As other sources of
overhead, such as the checksumming and interrupt handling above are
eliminated, the remaining overheads (primarily data copy) loom
larger.
Talpey & Juszczak Informational [Page 7]
^L
RFC 5532 NFS RDMA Problem Statement May 2009
With copies crossing the bus twice per copy, network processing
overhead is high whenever network bandwidth is large in comparison to
CPU and memory bandwidths. Generally, with today's end-systems, the
effects are observable at network speeds at or above 1 Gbit/s.
A common question is whether an increase in CPU processing power
alleviates the problem of high processing costs of network I/O. The
answer is no, it is the memory bandwidth that is the issue. Faster
CPUs do not help if the CPU spends most of its time waiting for
memory [RFC4297].
TCP offload engine (TOE) technology aims to offload the CPU by moving
TCP/IP protocol processing to the NIC. However, TOE technology by
itself does nothing to avoid necessary data copies within upper-layer
protocols. [MOG03] provides a description of the role TOE can play
in reducing per-packet and per-message costs. Beyond the offloads
commonly provided by today's network interface hardware, TOE alone
(without RDMA) helps in protocol header processing, but this has been
shown to be a minority component of the total protocol processing
overhead. [CHA+01]
Numerous software approaches to the optimization of network
throughput have been made. Experience has shown that network I/O
interacts with other aspects of system processing such as file I/O
and disk I/O [BRU99] [CHU96]. Zero-copy optimizations based on page
remapping [CHU96] can be dependent upon machine architecture, and are
not scalable to multi-processor architectures. Correct buffer
alignment and sizing together are needed to optimize the performance
of zero-copy movement mechanisms [SKE+01]. The NFS message layout
described above does not facilitate the splitting of headers from
data nor does it facilitate providing correct data buffer alignment.
4.1. Savings from TOE
The expected improvement of TOE specifically for NFS protocol
processing can be quantified and shown to be fundamentally limited.
[SHI+03] presents a set of "LAWS" parameters that serve to illustrate
the issues. In the TOE case, the copy cost can be viewed as part of
the application processing "a". Application processing increases the
LAWS "gamma", which is shown by the paper to result in a diminished
benefit for TOE.
For example, if the overhead is 20% TCP/IP, 30% copy, and 50% real
application work, then gamma is 80/20 or 4, which means the maximum
benefit of TOE is 1/gamma, or only 25%.
For RDMA (with embedded TOE) and the same example, the "overhead" (o)
offloaded or eliminated is 50% (20% + 30%). Therefore, in the RDMA
Talpey & Juszczak Informational [Page 8]
^L
RFC 5532 NFS RDMA Problem Statement May 2009
case, gamma is 50/50 or 1, and the inverse gives the potential
benefit of 1 (100%), a factor of two.
CPU Overhead Reduction Factor
No Offload TCP Offload RDMA Offload
-----------+-------------+-------------
1.00x 1.25x 2.00x
The analysis in the paper shows that RDMA could improve throughput by
the same factor of two, even when the host is (just) powerful enough
to drive the full network bandwidth without RDMA. It can also be
shown that the speedup may be higher if network bandwidth grows
faster than Moore's Law, although the higher benefits will apply to a
narrow range of applications.
4.2. Savings from RDMA
Performance measurements directly comparing an NFS-over-RDMA
prototype with conventional network-based NFS processing are
described in [CAL+03]. Comparisons of Read throughput and CPU
overhead were performed on two types of Gigabit Ethernet adapters,
one type being a conventional adapter, and another type with RDMA
capability. The prototype RDMA protocol performed all transfers via
RDMA Read. The NFS layer in the study was measured while performing
read transfers, varying the transfer size and readahead depth across
ranges used by typical NFS deployments.
In these results, conventional network-based throughput was severely
limited by the client's CPU being saturated at 100% for all
transfers. Read throughput reached no more than 60 MBytes/s.
I/O Type Size Read Throughput CPU Utilization
Conventional 2 KB 20 MB/s 100%
Conventional 16 KB 40 MB/s 100%
Conventional 256 KB 60 MB/s 100%
However, over RDMA, throughput rose to the theoretical maximum
throughput of the platform, while saturating the single-CPU system
only at maximum throughput.
I/O Type Size Read Throughput CPU Utilization
RDMA 2 KB 10 MB/s 45%
RDMA 16 KB 40 MB/s 70%
RDMA 256 KB 100 MB/s 100%
The lower relative throughput of the RDMA prototype at the small
blocksize may be attributable to the RDMA Read imposed by the
Talpey & Juszczak Informational [Page 9]
^L
RFC 5532 NFS RDMA Problem Statement May 2009
prototype protocol, which reduced the operation rate since it
introduces additional latency. As well, it may reflect the relative
increase of per-packet setup costs within the DMA portion of the
transfer.
5. Application of RDMA to NFS
Efficient file protocols require efficient data positioning and
movement. The client system knows the client memory address where
the application has data to be written or wants read data deposited.
The server system knows the server memory address where the local
file system will accept write data or has data to be read. Neither
peer however is aware of the others' data destination in the current
NFS, RPC, or XDR protocols. Existing NFS implementations have
struggled with the performance costs of data copies when using
traditional Ethernet transports.
With the onset of faster networks, the network I/O bottleneck will
worsen. Fortunately, new transports that support RDMA have emerged.
RDMA excels at bulk transfer efficiency; it is an efficient way to
deliver direct data placement and remove a major part of the problem:
data copies. RDMA also addresses other overheads, e.g., underlying
protocol offload, and offers separation of control information from
data.
The current NFS message layout provides the performance-enhancing
opportunity for an NFS-over-RDMA protocol that separates the control
information from data chunks while meeting the alignment needs of
both. The data chunks can be copied "directly" between the client
and server memory addresses above (with a single occurrence on each
memory bus) while the control information can be passed "inline".
[RPCRDMA] describes such a protocol.
6. Conclusions
NFS version 4 [RFC3530] has been granted "Proposed Standard" status.
The NFSv4 protocol was developed along several design points,
important among them: effective operation over wide-area networks,
including the Internet itself; strong security integrated into the
protocol; extensive cross-platform interoperability including
integrated locking semantics compatible with multiple operating
systems; and (this is key), protocol extension.
NFS version 4 is an excellent base on which to add the needed
performance enhancements and improved semantics described above. The
minor versioning support defined in NFS version 4 was designed to
support protocol improvements without disruption to the installed
base [NFSv4.1]. Evolutionary improvement of the protocol via minor
Talpey & Juszczak Informational [Page 10]
^L
RFC 5532 NFS RDMA Problem Statement May 2009
versioning is a conservative and cautious approach to current and
future problems and shortcomings.
Many arguments can be made as to the efficacy of the file abstraction
in meeting the future needs of enterprise data service and the
Internet. Fine grained Quality of Service (QoS) policies (e.g., data
delivery, retention, availability, security, etc.) are high among
them.
It is vital that the NFS protocol continue to provide these benefits
to a wide range of applications, without its usefulness being
compromised by concerns about performance and semantic inadequacies.
This can reasonably be addressed in the existing NFS protocol
framework. A cautious evolutionary improvement of performance and
semantics allows building on the value already present in the NFS
protocol, while addressing new requirements that have arisen from the
application of networking technology.
7. Security Considerations
The NFS protocol, in conjunction with its layering on RPC, provides a
rich and widely interoperable security model to applications and
systems. Any layering of NFS-over-RDMA transports must address the
NFS security requirements, and additionally must ensure that no new
vulnerabilities are introduced. For RDMA, the integrity, and any
privacy, of the data stream are of particular importance.
The core goals of an NFS-to-RDMA binding are to reduce overhead and
to enable high performance. To support these goals while maintaining
required NFS security protection presents a special challenge.
Historically, the provision of integrity and privacy have been
implemented within the RPC layer, and their operation requires local
processing of messages exchanged with the RPC peer. This processing
imposes memory and processing overhead on a per-message basis,
exactly the overhead that RDMA is designed to avoid.
Therefore, it is a requirement that the RDMA transport binding
provide a means to delegate the integrity and privacy processing to
the RDMA hardware, in order to maintain the high level of performance
desired from the approach, while simultaneously providing the
existing highest levels of security required by the NFS protocol.
This in turn requires a means by which the RPC layer may invoke these
services from the RDMA provider, and for the NFS layer to negotiate
their use end-to-end.
The "Channel Binding" concept [RFC5056] together with "IPsec Channel
Connection Latching" [BTNSLATCH] provide a means by which the RPC and
NFS layers may delegate their session protection to the lower RDMA
Talpey & Juszczak Informational [Page 11]
^L
RFC 5532 NFS RDMA Problem Statement May 2009
layers. An extension to the RPCSEC_GSS protocol [RFC5403] may be
employed to negotiate the use of these bindings, and to establish the
shared secrets necessary to protect the sessions.
The protocol described in [RPCRDMA] specifies the use of these
mechanisms, and they are required to implement the protocol.
An additional consideration is protection of the integrity and
privacy of local memory by the RDMA transport itself. The use of
RDMA by NFS must not introduce any vulnerabilities to system memory
contents, or to memory owned by user processes. These protections
are provided by the RDMA layer specifications, and specifically their
security models. It is required that any RDMA provider used for NFS
transport be conformant to the requirements of [RFC5042] in order to
satisfy these protections.
8. Acknowledgments
The authors wish to thank Jeff Chase who provided many useful
suggestions.
9. References
9.1. Normative References
[RFC3530] Shepler, S., Callaghan, B., Robinson, D., Thurlow, R.,
Beame, C., Eisler, M., and D. Noveck, "Network File
System (NFS) version 4 Protocol", RFC 3530, April 2003.
[RFC5531] Thurlow, R., "RPC: Remote Procedure Call Protocol
Specification Version 2", RFC 5531, May 2009.
[RFC4506] Eisler, M., Ed., "XDR: External Data Representation
Standard", STD 67, RFC 4506, May 2006.
[RFC1813] Callaghan, B., Pawlowski, B., and P. Staubach, "NFS
Version 3 Protocol Specification", RFC 1813, June 1995.
[RFC5403] Eisler, M., "RPCSEC_GSS Version 2", RFC 5403, February
2009.
[RFC5056] Williams, N., "On the Use of Channel Bindings to Secure
Channels", RFC 5056, November 2007.
[RFC5042] Pinkerton, J. and E. Deleganes, "Direct Data Placement
Protocol (DDP) / Remote Direct Memory Access Protocol
(RDMAP) Security", RFC 5042, October 2007.
Talpey & Juszczak Informational [Page 12]
^L
RFC 5532 NFS RDMA Problem Statement May 2009
9.2. Informative References
[BRU99] J. Brustoloni, "Interoperation of copy avoidance in
network and file I/O", in Proc. INFOCOM '99, pages 534-
542, New York, NY, Mar. 1999., IEEE. Also available from
http://www.cs.pitt.edu/~jcb/publs.html.
[BTNSLATCH] Williams, N., "IPsec Channels: Connection Latching", Work
in Progress, November 2008.
[CAL+03] B. Callaghan, T. Lingutla-Raj, A. Chiu, P. Staubach, O.
Asad, "NFS over RDMA", in Proceedings of ACM SIGCOMM
Summer 2003 NICELI Workshop.
[CHA+01] J. S. Chase, A. J. Gallatin, K. G. Yocum, "Endsystem
optimizations for high-speed TCP", IEEE Communications,
39(4):68-74, April 2001.
[CHA+99] J. S. Chase, D. C. Anderson, A. J. Gallatin, A. R.
Lebeck, K. G. Yocum, "Network I/O with Trapeze", in 1999
Hot Interconnects Symposium, August 1999.
[CHU96] H.K. Chu, "Zero-copy TCP in Solaris", Proc. of the USENIX
1996 Annual Technical Conference, San Diego, CA, January
1996.
[DCK+03] M. DeBergalis, P. Corbett, S. Kleiman, A. Lent, D.
Noveck, T. Talpey, M. Wittle, "The Direct Access File
System", in Proceedings of 2nd USENIX Conference on File
and Storage Technologies (FAST '03), San Francisco, CA,
March 31 - April 2, 2003.
[FJDAFS] Fujitsu Prime Software Technologies, "Meet the DAFS
Performance with DAFS/VI Kernel Implementation using
cLAN", available from
http://www.pst.fujitsu.com/english/dafsdemo/index.html,
2001.
[FJNFS] Fujitsu Prime Software Technologies, "An Adaptation of
VIA to NFS on Linux", available from
http://www.pst.fujitsu.com/english/nfs/index.html, 2000.
[GAL+99] A. Gallatin, J. Chase, K. Yocum, "Trapeze/IP: TCP/IP at
Near-Gigabit Speeds", 1999 USENIX Technical Conference
(Freenix Track), June 1999.
Talpey & Juszczak Informational [Page 13]
^L
RFC 5532 NFS RDMA Problem Statement May 2009
[KM02] K. Magoutis, "Design and Implementation of a Direct
Access File System (DAFS) Kernel Server for FreeBSD", in
Proceedings of USENIX BSDCon 2002 Conference, San
Francisco, CA, February 11-14, 2002.
[MAF+02] K. Magoutis, S. Addetia, A. Fedorova, M. Seltzer, J.
Chase, D. Gallatin, R. Kisley, R. Wickremesinghe, E.
Gabber, "Structure and Performance of the Direct Access
File System (DAFS)", in Proceedings of 2002 USENIX Annual
Technical Conference, Monterey, CA, June 9-14, 2002.
[MOG03] J. Mogul, "TCP offload is a dumb idea whose time has
come", 9th Workshop on Hot Topics in Operating Systems
(HotOS IX), Lihue, HI, May 2003. USENIX.
[NFSv4.1] Shepler, S., Eisler, M., and D. Noveck, "NFSv4 Minor
Version 1", Work in Progress, September 2008.
[PAI+00] V. S. Pai, P. Druschel, W. Zwaenepoel, "IO-Lite: a
unified I/O buffering and caching system", ACM Trans.
Computer Systems, 18(1):37-66, Feb. 2000.
[RDDP] RDDP Working Group charter,
http://www.ietf.org/html.charters/rddpcharter.html.
[RFC4297] Romanow, A., Mogul, J., Talpey, T., and S. Bailey,
"Remote Direct Memory Access (RDMA) over IP Problem
Statement", RFC 4297, December 2005.
[RFC1094] Sun Microsystems, "NFS: Network File System Protocol
specification", RFC 1094, March 1989.
[RPCRDMA] Talpey, T. and B. Callaghan, "Remote Direct Memory Access
Transport for Remote Procedure Call", Work in Progress,
April 2008.
[SHI+03] P. Shivam, J. Chase, "On the Elusive Benefits of Protocol
Offload", Proceedings of ACM SIGCOMM Summer 2003 NICELI
Workshop, also available from
http://issg.cs.duke.edu/publications/niceli03.pdf.
[SKE+01] K.-A. Skevik, T. Plagemann, V. Goebel, P. Halvorsen,
"Evaluation of a Zero-Copy Protocol Implementation", in
Proceedings of the 27th Euromicro Conference - Multimedia
and Telecommunications Track (MTT'2001), Warsaw, Poland,
September 2001.
Talpey & Juszczak Informational [Page 14]
^L
RFC 5532 NFS RDMA Problem Statement May 2009
Authors' Addresses
Tom Talpey
170 Whitman St.
Stow, MA 01775 USA
Phone: +1 978 821-8577
EMail: tmtalpey@gmail.com
Chet Juszczak
P.O. Box 1467
Merrimack, NH 03054
Phone: +1 603 253-6602
EMail: chetnh@earthlink.net
Talpey & Juszczak Informational [Page 15]
^L
|