From 4bfd864f10b68b71482b35c818559068ef8d5797 Mon Sep 17 00:00:00 2001 From: Thomas Voss Date: Wed, 27 Nov 2024 20:54:24 +0100 Subject: doc: Add RFC documents --- doc/rfc/rfc5663.txt | 1571 +++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 1571 insertions(+) create mode 100644 doc/rfc/rfc5663.txt (limited to 'doc/rfc/rfc5663.txt') diff --git a/doc/rfc/rfc5663.txt b/doc/rfc/rfc5663.txt new file mode 100644 index 0000000..55b403a --- /dev/null +++ b/doc/rfc/rfc5663.txt @@ -0,0 +1,1571 @@ + + + + + + +Internet Engineering Task Force (IETF) D. Black +Request for Comments: 5663 S. Fridella +Category: Standards Track EMC Corporation +ISSN: 2070-1721 J. Glasgow + Google + January 2010 + + + Parallel NFS (pNFS) Block/Volume Layout + +Abstract + + Parallel NFS (pNFS) extends Network File Sharing version 4 (NFSv4) to + allow clients to directly access file data on the storage used by the + NFSv4 server. This ability to bypass the server for data access can + increase both performance and parallelism, but requires additional + client functionality for data access, some of which is dependent on + the class of storage used. The main pNFS operations document + specifies storage-class-independent extensions to NFS; this document + specifies the additional extensions (primarily data structures) for + use of pNFS with block- and volume-based storage. + +Status of This Memo + + This is an Internet Standards Track document. + + This document is a product of the Internet Engineering Task Force + (IETF). It represents the consensus of the IETF community. It has + received public review and has been approved for publication by the + Internet Engineering Steering Group (IESG). Further information on + Internet Standards is available in Section 2 of RFC 5741. + + Information about the current status of this document, any errata, + and how to provide feedback on it may be obtained at + http://www.rfc-editor.org/info/rfc5663. + + + + + + + + + + + + + + + + +Black, et al. Standards Track [Page 1] + +RFC 5663 pNFS Block/Volume Layout January 2010 + + +Copyright Notice + + Copyright (c) 2010 IETF Trust and the persons identified as the + document authors. All rights reserved. + + This document is subject to BCP 78 and the IETF Trust's Legal + Provisions Relating to IETF Documents + (http://trustee.ietf.org/license-info) in effect on the date of + publication of this document. Please review these documents + carefully, as they describe your rights and restrictions with respect + to this document. Code Components extracted from this document must + include Simplified BSD License text as described in Section 4.e of + the Trust Legal Provisions and are provided without warranty as + described in the Simplified BSD License. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +Black, et al. Standards Track [Page 2] + +RFC 5663 pNFS Block/Volume Layout January 2010 + + +Table of Contents + + 1. Introduction ....................................................4 + 1.1. Conventions Used in This Document ..........................4 + 1.2. General Definitions ........................................5 + 1.3. Code Components Licensing Notice ...........................5 + 1.4. XDR Description ............................................5 + 2. Block Layout Description ........................................7 + 2.1. Background and Architecture ................................7 + 2.2. GETDEVICELIST and GETDEVICEINFO ............................9 + 2.2.1. Volume Identification ...............................9 + 2.2.2. Volume Topology ....................................10 + 2.2.3. GETDEVICELIST and GETDEVICEINFO deviceid4 ..........12 + 2.3. Data Structures: Extents and Extent Lists .................12 + 2.3.1. Layout Requests and Extent Lists ...................15 + 2.3.2. Layout Commits .....................................16 + 2.3.3. Layout Returns .....................................16 + 2.3.4. Client Copy-on-Write Processing ....................17 + 2.3.5. Extents are Permissions ............................18 + 2.3.6. End-of-file Processing .............................20 + 2.3.7. Layout Hints .......................................20 + 2.3.8. Client Fencing .....................................21 + 2.4. Crash Recovery Issues .....................................23 + 2.5. Recalling Resources: CB_RECALL_ANY ........................23 + 2.6. Transient and Permanent Errors ............................24 + 3. Security Considerations ........................................24 + 4. Conclusions ....................................................26 + 5. IANA Considerations ............................................26 + 6. Acknowledgments ................................................26 + 7. References .....................................................27 + 7.1. Normative References ......................................27 + 7.2. Informative References ....................................27 + + + + + + + + + + + + + + + + + + + +Black, et al. Standards Track [Page 3] + +RFC 5663 pNFS Block/Volume Layout January 2010 + + +1. Introduction + + Figure 1 shows the overall architecture of a Parallel NFS (pNFS) + system: + + +-----------+ + |+-----------+ +-----------+ + ||+-----------+ | | + ||| | NFSv4.1 + pNFS | | + +|| Clients |<------------------------------>| Server | + +| | | | + +-----------+ | | + ||| +-----------+ + ||| | + ||| | + ||| Storage +-----------+ | + ||| Protocol |+-----------+ | + ||+----------------||+-----------+ Control | + |+-----------------||| | Protocol| + +------------------+|| Storage |------------+ + +| Systems | + +-----------+ + + Figure 1: pNFS Architecture + + The overall approach is that pNFS-enhanced clients obtain sufficient + information from the server to enable them to access the underlying + storage (on the storage systems) directly. See the pNFS portion of + [NFSv4.1] for more details. This document is concerned with access + from pNFS clients to storage systems over storage protocols based on + blocks and volumes, such as the Small Computer System Interface + (SCSI) protocol family (e.g., parallel SCSI, Fibre Channel Protocol + (FCP) for Fibre Channel, Internet SCSI (iSCSI), Serial Attached SCSI + (SAS), and Fibre Channel over Ethernet (FCoE)). This class of + storage is referred to as block/volume storage. While the Server to + Storage System protocol, called the "Control Protocol", is not of + concern for interoperability here, it will typically also be a + block/volume protocol when clients use block/ volume protocols. + +1.1. Conventions Used in This Document + + The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", + "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this + document are to be interpreted as described in RFC 2119 [RFC2119]. + + + + + + + +Black, et al. Standards Track [Page 4] + +RFC 5663 pNFS Block/Volume Layout January 2010 + + +1.2. General Definitions + + The following definitions are provided for the purpose of providing + an appropriate context for the reader. + + Byte + + This document defines a byte as an octet, i.e., a datum exactly 8 + bits in length. + + Client + + The "client" is the entity that accesses the NFS server's + resources. The client may be an application that contains the + logic to access the NFS server directly. The client may also be + the traditional operating system client that provides remote file + system services for a set of applications. + + Server + + The "server" is the entity responsible for coordinating client + access to a set of file systems and is identified by a server + owner. + +1.3. Code Components Licensing Notice + + The external data representation (XDR) description and scripts for + extracting the XDR description are Code Components as described in + Section 4 of "Legal Provisions Relating to IETF Documents" [LEGAL]. + These Code Components are licensed according to the terms of Section + 4 of "Legal Provisions Relating to IETF Documents". + +1.4. XDR Description + + This document contains the XDR ([XDR]) description of the NFSv4.1 + block layout protocol. The XDR description is embedded in this + document in a way that makes it simple for the reader to extract into + a ready-to-compile form. The reader can feed this document into the + following shell script to produce the machine readable XDR + description of the NFSv4.1 block layout: + + #!/bin/sh + grep '^ *///' $* | sed 's?^ */// ??' | sed 's?^ *///$??' + + + + + + + + +Black, et al. Standards Track [Page 5] + +RFC 5663 pNFS Block/Volume Layout January 2010 + + + That is, if the above script is stored in a file called "extract.sh", + and this document is in a file called "spec.txt", then the reader can + do: + + sh extract.sh < spec.txt > nfs4_block_layout_spec.x + + The effect of the script is to remove both leading white space and a + sentinel sequence of "///" from each matching line. + + The embedded XDR file header follows, with subsequent pieces embedded + throughout the document: + + /// /* + /// * This code was derived from RFC 5663. + /// * Please reproduce this note if possible. + /// */ + /// /* + /// * Copyright (c) 2010 IETF Trust and the persons identified + /// * as the document authors. All rights reserved. + /// * + /// * Redistribution and use in source and binary forms, with + /// * or without modification, are permitted provided that the + /// * following conditions are met: + /// * + /// * - Redistributions of source code must retain the above + /// * copyright notice, this list of conditions and the + /// * following disclaimer. + /// * + /// * - Redistributions in binary form must reproduce the above + /// * copyright notice, this list of conditions and the + /// * following disclaimer in the documentation and/or other + /// * materials provided with the distribution. + /// * + /// * - Neither the name of Internet Society, IETF or IETF + /// * Trust, nor the names of specific contributors, may be + /// * used to endorse or promote products derived from this + /// * software without specific prior written permission. + /// * + /// * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS + /// * AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED + /// * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + /// * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS + /// * FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO + /// * EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE + /// * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, + /// * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT + /// * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR + /// * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS + + + +Black, et al. Standards Track [Page 6] + +RFC 5663 pNFS Block/Volume Layout January 2010 + + + /// * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF + /// * LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, + /// * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING + /// * IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF + /// * ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + /// */ + /// + /// /* + /// * nfs4_block_layout_prot.x + /// */ + /// + /// %#include "nfsv41.h" + /// + + The XDR code contained in this document depends on types from the + nfsv41.x file. This includes both nfs types that end with a 4, such + as offset4, length4, etc., as well as more generic types such as + uint32_t and uint64_t. + +2. Block Layout Description + +2.1. Background and Architecture + + The fundamental storage abstraction supported by block/volume storage + is a storage volume consisting of a sequential series of fixed-size + blocks. This can be thought of as a logical disk; it may be realized + by the storage system as a physical disk, a portion of a physical + disk, or something more complex (e.g., concatenation, striping, RAID, + and combinations thereof) involving multiple physical disks or + portions thereof. + + A pNFS layout for this block/volume class of storage is responsible + for mapping from an NFS file (or portion of a file) to the blocks of + storage volumes that contain the file. The blocks are expressed as + extents with 64-bit offsets and lengths using the existing NFSv4 + offset4 and length4 types. Clients must be able to perform I/O to + the block extents without affecting additional areas of storage + (especially important for writes); therefore, extents MUST be aligned + to 512-byte boundaries, and writable extents MUST be aligned to the + block size used by the NFSv4 server in managing the actual file + system (4 kilobytes and 8 kilobytes are common block sizes). This + block size is available as the NFSv4.1 layout_blksize attribute. + [NFSv4.1]. Readable extents SHOULD be aligned to the block size used + by the NFSv4 server, but in order to support legacy file systems with + fragments, alignment to 512-byte boundaries is acceptable. + + + + + + +Black, et al. Standards Track [Page 7] + +RFC 5663 pNFS Block/Volume Layout January 2010 + + + The pNFS operation for requesting a layout (LAYOUTGET) includes the + "layoutiomode4 loga_iomode" argument, which indicates whether the + requested layout is for read-only use or read-write use. A read-only + layout may contain holes that are read as zero, whereas a read-write + layout will contain allocated, but un-initialized storage in those + holes (read as zero, can be written by client). This document also + supports client participation in copy-on-write (e.g., for file + systems with snapshots) by providing both read-only and un- + initialized storage for the same range in a layout. Reads are + initially performed on the read-only storage, with writes going to + the un-initialized storage. After the first write that initializes + the un-initialized storage, all reads are performed to that now- + initialized writable storage, and the corresponding read-only storage + is no longer used. + + The block/volume layout solution expands the security + responsibilities of the pNFS clients, and there are a number of + environments where the mandatory to implement security properties for + NFS cannot be satisfied. The additional security responsibilities of + the client follow, and a full discussion is present in Section 3, + "Security Considerations". + + o Typically, storage area network (SAN) disk arrays and SAN + protocols provide access control mechanisms (e.g., Logical Unit + Number (LUN) mapping and/or masking), which operate at the + granularity of individual hosts, not individual blocks. For this + reason, block-based protection must be provided by the client + software. + + o Similarly, SAN disk arrays and SAN protocols typically are not + able to validate NFS locks that apply to file regions. For + instance, if a file is covered by a mandatory read-only lock, the + server can ensure that only readable layouts for the file are + granted to pNFS clients. However, it is up to each pNFS client to + ensure that the readable layout is used only to service read + requests, and not to allow writes to the existing parts of the + file. + + Since block/volume storage systems are generally not capable of + enforcing such file-based security, in environments where pNFS + clients cannot be trusted to enforce such policies, pNFS block/volume + storage layouts SHOULD NOT be used. + + + + + + + + + +Black, et al. Standards Track [Page 8] + +RFC 5663 pNFS Block/Volume Layout January 2010 + + +2.2. GETDEVICELIST and GETDEVICEINFO + +2.2.1. Volume Identification + + Storage systems such as storage arrays can have multiple physical + network ports that need not be connected to a common network, + resulting in a pNFS client having simultaneous multipath access to + the same storage volumes via different ports on different networks. + + The networks may not even be the same technology -- for example, + access to the same volume via both iSCSI and Fibre Channel is + possible, hence network addresses are difficult to use for volume + identification. For this reason, this pNFS block layout identifies + storage volumes by content, for example providing the means to match + (unique portions of) labels used by volume managers. Volume + identification is performed by matching one or more opaque byte + sequences to specific parts of the stored data. Any block pNFS + system using this layout MUST support a means of content-based unique + volume identification that can be employed via the data structure + given here. + + /// struct pnfs_block_sig_component4 { /* disk signature component */ + /// int64_t bsc_sig_offset; /* byte offset of component + /// on volume*/ + /// opaque bsc_contents<>; /* contents of this component + /// of the signature */ + /// }; + /// + + Note that the opaque "bsc_contents" field in the + "pnfs_block_sig_component4" structure MUST NOT be interpreted as a + zero-terminated string, as it may contain embedded zero-valued bytes. + There are no restrictions on alignment (e.g., neither bsc_sig_offset + nor the length are required to be multiples of 4). The + bsc_sig_offset is a signed quantity, which, when positive, represents + an byte offset from the start of the volume, and when negative + represents an byte offset from the end of the volume. + + Negative offsets are permitted in order to simplify the client + implementation on systems where the device label is found at a fixed + offset from the end of the volume. If the server uses negative + offsets to describe the signature, then the client and server MUST + NOT see different volume sizes. Negative offsets SHOULD NOT be used + in systems that dynamically resize volumes unless care is taken to + ensure that the device label is always present at the offset from the + end of the volume as seen by the clients. + + + + + +Black, et al. Standards Track [Page 9] + +RFC 5663 pNFS Block/Volume Layout January 2010 + + + A signature is an array of up to "PNFS_BLOCK_MAX_SIG_COMP" (defined + below) signature components. The client MUST NOT assume that all + signature components are co-located within a single sector on a block + device. + + The pNFS client block layout driver uses this volume identification + to map pnfs_block_volume_type4 PNFS_BLOCK_VOLUME_SIMPLE deviceid4s to + its local view of a LUN. + +2.2.2. Volume Topology + + The pNFS block server volume topology is expressed as an arbitrary + combination of base volume types enumerated in the following data + structures. The individual components of the topology are contained + in an array and components may refer to other components by using + array indices. + + /// enum pnfs_block_volume_type4 { + /// PNFS_BLOCK_VOLUME_SIMPLE = 0, /* volume maps to a single + /// LU */ + /// PNFS_BLOCK_VOLUME_SLICE = 1, /* volume is a slice of + /// another volume */ + /// PNFS_BLOCK_VOLUME_CONCAT = 2, /* volume is a + /// concatenation of + /// multiple volumes */ + /// PNFS_BLOCK_VOLUME_STRIPE = 3 /* volume is striped across + /// multiple volumes */ + /// }; + /// + /// const PNFS_BLOCK_MAX_SIG_COMP = 16;/* maximum components per + /// signature */ + /// struct pnfs_block_simple_volume_info4 { + /// pnfs_block_sig_component4 bsv_ds; + /// /* disk signature */ + /// }; + /// + /// + /// struct pnfs_block_slice_volume_info4 { + /// offset4 bsv_start; /* offset of the start of the + /// slice in bytes */ + /// length4 bsv_length; /* length of slice in bytes */ + /// uint32_t bsv_volume; /* array index of sliced + /// volume */ + /// }; + /// + /// struct pnfs_block_concat_volume_info4 { + /// uint32_t bcv_volumes<>; /* array indices of volumes + /// which are concatenated */ + + + +Black, et al. Standards Track [Page 10] + +RFC 5663 pNFS Block/Volume Layout January 2010 + + + /// }; + /// + /// struct pnfs_block_stripe_volume_info4 { + /// length4 bsv_stripe_unit; /* size of stripe in bytes */ + /// uint32_t bsv_volumes<>; /* array indices of volumes + /// which are striped across -- + /// MUST be same size */ + /// }; + /// + /// union pnfs_block_volume4 switch (pnfs_block_volume_type4 type) { + /// case PNFS_BLOCK_VOLUME_SIMPLE: + /// pnfs_block_simple_volume_info4 bv_simple_info; + /// case PNFS_BLOCK_VOLUME_SLICE: + /// pnfs_block_slice_volume_info4 bv_slice_info; + /// case PNFS_BLOCK_VOLUME_CONCAT: + /// pnfs_block_concat_volume_info4 bv_concat_info; + /// case PNFS_BLOCK_VOLUME_STRIPE: + /// pnfs_block_stripe_volume_info4 bv_stripe_info; + /// }; + /// + /// /* block layout specific type for da_addr_body */ + /// struct pnfs_block_deviceaddr4 { + /// pnfs_block_volume4 bda_volumes<>; /* array of volumes */ + /// }; + /// + + The "pnfs_block_deviceaddr4" data structure is a structure that + allows arbitrarily complex nested volume structures to be encoded. + The types of aggregations that are allowed are stripes, + concatenations, and slices. Note that the volume topology expressed + in the pnfs_block_deviceaddr4 data structure will always resolve to a + set of pnfs_block_volume_type4 PNFS_BLOCK_VOLUME_SIMPLE. The array + of volumes is ordered such that the root of the volume hierarchy is + the last element of the array. Concat, slice, and stripe volumes + MUST refer to volumes defined by lower indexed elements of the array. + + The "pnfs_block_device_addr4" data structure is returned by the + server as the storage-protocol-specific opaque field da_addr_body in + the "device_addr4" structure by a successful GETDEVICEINFO operation + [NFSv4.1]. + + As noted above, all device_addr4 structures eventually resolve to a + set of volumes of type PNFS_BLOCK_VOLUME_SIMPLE. These volumes are + each uniquely identified by a set of signature components. + Complicated volume hierarchies may be composed of dozens of volumes + each with several signature components; thus, the device address may + require several kilobytes. The client SHOULD be prepared to allocate + a large buffer to contain the result. In the case of the server + + + +Black, et al. Standards Track [Page 11] + +RFC 5663 pNFS Block/Volume Layout January 2010 + + + returning NFS4ERR_TOOSMALL, the client SHOULD allocate a buffer of at + least gdir_mincount_bytes to contain the expected result and retry + the GETDEVICEINFO request. + +2.2.3. GETDEVICELIST and GETDEVICEINFO deviceid4 + + The server in response to a GETDEVICELIST request typically will + return a single "deviceid4" in the gdlr_deviceid_list array. This is + because the deviceid4 when passed to GETDEVICEINFO will return a + "device_addr4", which encodes the entire volume hierarchy. In the + case of copy-on-write file systems, the "gdlr_deviceid_list" array + may contain two deviceid4's, one referencing the read-only volume + hierarchy, and one referencing the writable volume hierarchy. There + is no required ordering of the readable and writable IDs in the array + as the volumes are uniquely identified by their deviceid4, and are + referred to by layouts using the deviceid4. Another example of the + server returning multiple device items occurs when the file handle + represents the root of a namespace spanning multiple physical file + systems on the server, each with a different volume hierarchy. In + this example, a server implementation may return either a list of + device IDs used by each of the physical file systems, or it may + return an empty list. + + Each deviceid4 returned by a successful GETDEVICELIST operation is a + shorthand id used to reference the whole volume topology. These + device IDs, as well as device IDs returned in extents of a LAYOUTGET + operation, can be used as input to the GETDEVICEINFO operation. + Decoding the "pnfs_block_deviceaddr4" results in a flat ordering of + data blocks mapped to PNFS_BLOCK_VOLUME_SIMPLE volumes. Combined + with the mapping to a client LUN described in Section 2.2.1 "Volume + Identification", a logical volume offset can be mapped to a block on + a pNFS client LUN [NFSv4.1]. + +2.3. Data Structures: Extents and Extent Lists + + A pNFS block layout is a list of extents within a flat array of data + blocks in a logical volume. The details of the volume topology can + be determined by using the GETDEVICEINFO operation (see discussion of + volume identification, Section 2.2 above). The block layout + describes the individual block extents on the volume that make up the + file. The offsets and length contained in an extent are specified in + units of bytes. + + + + + + + + + +Black, et al. Standards Track [Page 12] + +RFC 5663 pNFS Block/Volume Layout January 2010 + + + /// enum pnfs_block_extent_state4 { + /// PNFS_BLOCK_READ_WRITE_DATA = 0,/* the data located by this + /// extent is valid + /// for reading and writing. */ + /// PNFS_BLOCK_READ_DATA = 1, /* the data located by this + /// extent is valid for reading + /// only; it may not be + /// written. */ + /// PNFS_BLOCK_INVALID_DATA = 2, /* the location is valid; the + /// data is invalid. It is a + /// newly (pre-) allocated + /// extent. There is physical + /// space on the volume. */ + /// PNFS_BLOCK_NONE_DATA = 3 /* the location is invalid. + /// It is a hole in the file. + /// There is no physical space + /// on the volume. */ + /// }; + + + /// + /// struct pnfs_block_extent4 { + /// deviceid4 bex_vol_id; /* id of logical volume on + /// which extent of file is + /// stored. */ + /// offset4 bex_file_offset; /* the starting byte offset in + /// the file */ + /// length4 bex_length; /* the size in bytes of the + /// extent */ + /// offset4 bex_storage_offset; /* the starting byte offset + /// in the volume */ + /// pnfs_block_extent_state4 bex_state; + /// /* the state of this extent */ + /// }; + /// + /// /* block layout specific type for loc_body */ + /// struct pnfs_block_layout4 { + /// pnfs_block_extent4 blo_extents<>; + /// /* extents which make up this + /// layout. */ + /// }; + /// + + The block layout consists of a list of extents that map the logical + regions of the file to physical locations on a volume. The + "bex_storage_offset" field within each extent identifies a location + on the logical volume specified by the "bex_vol_id" field in the + extent. The bex_vol_id itself is shorthand for the whole topology of + + + +Black, et al. Standards Track [Page 13] + +RFC 5663 pNFS Block/Volume Layout January 2010 + + + the logical volume on which the file is stored. The client is + responsible for translating this logical offset into an offset on the + appropriate underlying SAN logical unit. In most cases, all extents + in a layout will reside on the same volume and thus have the same + bex_vol_id. In the case of copy-on-write file systems, the + PNFS_BLOCK_READ_DATA extents may have a different bex_vol_id from the + writable extents. + + Each extent maps a logical region of the file onto a portion of the + specified logical volume. The bex_file_offset, bex_length, and + bex_state fields for an extent returned from the server are valid for + all extents. In contrast, the interpretation of the + bex_storage_offset field depends on the value of bex_state as follows + (in increasing order): + + o PNFS_BLOCK_READ_WRITE_DATA means that bex_storage_offset is valid, + and points to valid/initialized data that can be read and written. + + o PNFS_BLOCK_READ_DATA means that bex_storage_offset is valid and + points to valid/ initialized data that can only be read. Write + operations are prohibited; the client may need to request a + read-write layout. + + o PNFS_BLOCK_INVALID_DATA means that bex_storage_offset is valid, + but points to invalid un-initialized data. This data must not be + physically read from the disk until it has been initialized. A + read request for a PNFS_BLOCK_INVALID_DATA extent must fill the + user buffer with zeros, unless the extent is covered by a + PNFS_BLOCK_READ_DATA extent of a copy-on-write file system. Write + requests must write whole server-sized blocks to the disk; bytes + not initialized by the user must be set to zero. Any write to + storage in a PNFS_BLOCK_INVALID_DATA extent changes the written + portion of the extent to PNFS_BLOCK_READ_WRITE_DATA; the pNFS + client is responsible for reporting this change via LAYOUTCOMMIT. + + o PNFS_BLOCK_NONE_DATA means that bex_storage_offset is not valid, + and this extent may not be used to satisfy write requests. Read + requests may be satisfied by zero-filling as for + PNFS_BLOCK_INVALID_DATA. PNFS_BLOCK_NONE_DATA extents may be + returned by requests for readable extents; they are never returned + if the request was for a writable extent. + + An extent list contains all relevant extents in increasing order of + the bex_file_offset of each extent; any ties are broken by increasing + order of the extent state (bex_state). + + + + + + +Black, et al. Standards Track [Page 14] + +RFC 5663 pNFS Block/Volume Layout January 2010 + + +2.3.1. Layout Requests and Extent Lists + + Each request for a layout specifies at least three parameters: file + offset, desired size, and minimum size. If the status of a request + indicates success, the extent list returned must meet the following + criteria: + + o A request for a readable (but not writable) layout returns only + PNFS_BLOCK_READ_DATA or PNFS_BLOCK_NONE_DATA extents (but not + PNFS_BLOCK_INVALID_DATA or PNFS_BLOCK_READ_WRITE_DATA extents). + + o A request for a writable layout returns PNFS_BLOCK_READ_WRITE_DATA + or PNFS_BLOCK_INVALID_DATA extents (but not PNFS_BLOCK_NONE_DATA + extents). It may also return PNFS_BLOCK_READ_DATA extents only + when the offset ranges in those extents are also covered by + PNFS_BLOCK_INVALID_DATA extents to permit writes. + + o The first extent in the list MUST contain the requested starting + offset. + + o The total size of extents within the requested range MUST cover at + least the minimum size. One exception is allowed: the total size + MAY be smaller if only readable extents were requested and EOF is + encountered. + + o Extents in the extent list MUST be logically contiguous for a + read-only layout. For a read-write layout, the set of writable + extents (i.e., excluding PNFS_BLOCK_READ_DATA extents) MUST be + logically contiguous. Every PNFS_BLOCK_READ_DATA extent in a + read-write layout MUST be covered by one or more + PNFS_BLOCK_INVALID_DATA extents. This overlap of + PNFS_BLOCK_READ_DATA and PNFS_BLOCK_INVALID_DATA extents is the + only permitted extent overlap. + + o Extents MUST be ordered in the list by starting offset, with + PNFS_BLOCK_READ_DATA extents preceding PNFS_BLOCK_INVALID_DATA + extents in the case of equal bex_file_offsets. + + If the minimum requested size, loga_minlength, is zero, this is an + indication to the metadata server that the client desires any layout + at offset loga_offset or less that the metadata server has "readily + available". Readily is subjective, and depends on the layout type + and the pNFS server implementation. For block layout servers, + readily available SHOULD be interpreted such that readable layouts + are always available, even if some extents are in the + PNFS_BLOCK_NONE_DATA state. When processing requests for writable + layouts, a layout is readily available if extents can be returned in + the PNFS_BLOCK_READ_WRITE_DATA state. + + + +Black, et al. Standards Track [Page 15] + +RFC 5663 pNFS Block/Volume Layout January 2010 + + +2.3.2. Layout Commits + + /// /* block layout specific type for lou_body */ + /// struct pnfs_block_layoutupdate4 { + /// pnfs_block_extent4 blu_commit_list<>; + /// /* list of extents which + /// * now contain valid data. + /// */ + /// }; + /// + + The "pnfs_block_layoutupdate4" structure is used by the client as the + block-protocol specific argument in a LAYOUTCOMMIT operation. The + "blu_commit_list" field is an extent list covering regions of the + file layout that were previously in the PNFS_BLOCK_INVALID_DATA + state, but have been written by the client and should now be + considered in the PNFS_BLOCK_READ_WRITE_DATA state. The bex_state + field of each extent in the blu_commit_list MUST be set to + PNFS_BLOCK_READ_WRITE_DATA. The extents in the commit list MUST be + disjoint and MUST be sorted by bex_file_offset. The + bex_storage_offset field is unused. Implementors should be aware + that a server may be unable to commit regions at a granularity + smaller than a file-system block (typically 4 KB or 8 KB). As noted + above, the block-size that the server uses is available as an NFSv4 + attribute, and any extents included in the "blu_commit_list" MUST be + aligned to this granularity and have a size that is a multiple of + this granularity. If the client believes that its actions have moved + the end-of-file into the middle of a block being committed, the + client MUST write zeroes from the end-of-file to the end of that + block before committing the block. Failure to do so may result in + junk (un-initialized data) appearing in that area if the file is + subsequently extended by moving the end-of-file. + +2.3.3. Layout Returns + + The LAYOUTRETURN operation is done without any block layout specific + data. When the LAYOUTRETURN operation specifies a + LAYOUTRETURN4_FILE_return type, then the layoutreturn_file4 data + structure specifies the region of the file layout that is no longer + needed by the client. The opaque "lrf_body" field of the + "layoutreturn_file4" data structure MUST have length zero. A + LAYOUTRETURN operation represents an explicit release of resources by + the client, usually done for the purpose of avoiding unnecessary + CB_LAYOUTRECALL operations in the future. The client may return + disjoint regions of the file by using multiple LAYOUTRETURN + operations within a single COMPOUND operation. + + + + + +Black, et al. Standards Track [Page 16] + +RFC 5663 pNFS Block/Volume Layout January 2010 + + + Note that the block/volume layout supports unilateral layout + revocation. When a layout is unilaterally revoked by the server, + usually due to the client's lease time expiring, or a delegation + being recalled, or the client failing to return a layout in a timely + manner, it is important for the sake of correctness that any in- + flight I/Os that the client issued before the layout was revoked are + rejected at the storage. For the block/volume protocol, this is + possible by fencing a client with an expired layout timer from the + physical storage. Note, however, that the granularity of this + operation can only be at the host/logical-unit level. Thus, if one + of a client's layouts is unilaterally revoked by the server, it will + effectively render useless *all* of the client's layouts for files + located on the storage units comprising the logical volume. This may + render useless the client's layouts for files in other file systems. + +2.3.4. Client Copy-on-Write Processing + + Copy-on-write is a mechanism used to support file and/or file system + snapshots. When writing to unaligned regions, or to regions smaller + than a file system block, the writer must copy the portions of the + original file data to a new location on disk. This behavior can + either be implemented on the client or the server. The paragraphs + below describe how a pNFS block layout client implements access to a + file that requires copy-on-write semantics. + + Distinguishing the PNFS_BLOCK_READ_WRITE_DATA and + PNFS_BLOCK_READ_DATA extent types in combination with the allowed + overlap of PNFS_BLOCK_READ_DATA extents with PNFS_BLOCK_INVALID_DATA + extents allows copy-on-write processing to be done by pNFS clients. + In classic NFS, this operation would be done by the server. Since + pNFS enables clients to do direct block access, it is useful for + clients to participate in copy-on-write operations. All block/volume + pNFS clients MUST support this copy-on-write processing. + + When a client wishes to write data covered by a PNFS_BLOCK_READ_DATA + extent, it MUST have requested a writable layout from the server; + that layout will contain PNFS_BLOCK_INVALID_DATA extents to cover all + the data ranges of that layout's PNFS_BLOCK_READ_DATA extents. More + precisely, for any bex_file_offset range covered by one or more + PNFS_BLOCK_READ_DATA extents in a writable layout, the server MUST + include one or more PNFS_BLOCK_INVALID_DATA extents in the layout + that cover the same bex_file_offset range. When performing a write + to such an area of a layout, the client MUST effectively copy the + data from the PNFS_BLOCK_READ_DATA extent for any partial blocks of + bex_file_offset and range, merge in the changes to be written, and + write the result to the PNFS_BLOCK_INVALID_DATA extent for the blocks + for that bex_file_offset and range. That is, if entire blocks of + data are to be overwritten by an operation, the corresponding + + + +Black, et al. Standards Track [Page 17] + +RFC 5663 pNFS Block/Volume Layout January 2010 + + + PNFS_BLOCK_READ_DATA blocks need not be fetched, but any partial- + block writes must be merged with data fetched via + PNFS_BLOCK_READ_DATA extents before storing the result via + PNFS_BLOCK_INVALID_DATA extents. For the purposes of this + discussion, "entire blocks" and "partial blocks" refer to the + server's file-system block size. Storing of data in a + PNFS_BLOCK_INVALID_DATA extent converts the written portion of the + PNFS_BLOCK_INVALID_DATA extent to a PNFS_BLOCK_READ_WRITE_DATA + extent; all subsequent reads MUST be performed from this extent; the + corresponding portion of the PNFS_BLOCK_READ_DATA extent MUST NOT be + used after storing data in a PNFS_BLOCK_INVALID_DATA extent. If a + client writes only a portion of an extent, the extent may be split at + block aligned boundaries. + + When a client wishes to write data to a PNFS_BLOCK_INVALID_DATA + extent that is not covered by a PNFS_BLOCK_READ_DATA extent, it MUST + treat this write identically to a write to a file not involved with + copy-on-write semantics. Thus, data must be written in at least + block-sized increments, aligned to multiples of block-sized offsets, + and unwritten portions of blocks must be zero filled. + + In the LAYOUTCOMMIT operation that normally sends updated layout + information back to the server, for writable data, some + PNFS_BLOCK_INVALID_DATA extents may be committed as + PNFS_BLOCK_READ_WRITE_DATA extents, signifying that the storage at + the corresponding bex_storage_offset values has been stored into and + is now to be considered as valid data to be read. + PNFS_BLOCK_READ_DATA extents are not committed to the server. For + extents that the client receives via LAYOUTGET as + PNFS_BLOCK_INVALID_DATA and returns via LAYOUTCOMMIT as + PNFS_BLOCK_READ_WRITE_DATA, the server will understand that the + PNFS_BLOCK_READ_DATA mapping for that extent is no longer valid or + necessary for that file. + +2.3.5. Extents are Permissions + + Layout extents returned to pNFS clients grant permission to read or + write; PNFS_BLOCK_READ_DATA and PNFS_BLOCK_NONE_DATA are read-only + (PNFS_BLOCK_NONE_DATA reads as zeroes), PNFS_BLOCK_READ_WRITE_DATA + and PNFS_BLOCK_INVALID_DATA are read/write, (PNFS_BLOCK_INVALID_DATA + reads as zeros, any write converts it to PNFS_BLOCK_READ_WRITE_DATA). + This is the only means a client has of obtaining permission to + perform direct I/O to storage devices; a pNFS client MUST NOT perform + direct I/O operations that are not permitted by an extent held by the + client. Client adherence to this rule places the pNFS server in + control of potentially conflicting storage device operations, + enabling the server to determine what does conflict and how to avoid + conflicts by granting and recalling extents to/from clients. + + + +Black, et al. Standards Track [Page 18] + +RFC 5663 pNFS Block/Volume Layout January 2010 + + + Block/volume class storage devices are not required to perform read + and write operations atomically. Overlapping concurrent read and + write operations to the same data may cause the read to return a + mixture of before-write and after-write data. Overlapping write + operations can be worse, as the result could be a mixture of data + from the two write operations; data corruption can occur if the + underlying storage is striped and the operations complete in + different orders on different stripes. When there are multiple + clients who wish to access the same data, a pNFS server can avoid + these conflicts by implementing a concurrency control policy of + single writer XOR multiple readers. This policy MUST be implemented + when storage devices do not provide atomicity for concurrent + read/write and write/write operations to the same data. + + If a client makes a layout request that conflicts with an existing + layout delegation, the request will be rejected with the error + NFS4ERR_LAYOUTTRYLATER. This client is then expected to retry the + request after a short interval. During this interval, the server + SHOULD recall the conflicting portion of the layout delegation from + the client that currently holds it. This reject-and-retry approach + does not prevent client starvation when there is contention for the + layout of a particular file. For this reason, a pNFS server SHOULD + implement a mechanism to prevent starvation. One possibility is that + the server can maintain a queue of rejected layout requests. Each + new layout request can be checked to see if it conflicts with a + previous rejected request, and if so, the newer request can be + rejected. Once the original requesting client retries its request, + its entry in the rejected request queue can be cleared, or the entry + in the rejected request queue can be removed when it reaches a + certain age. + + NFSv4 supports mandatory locks and share reservations. These are + mechanisms that clients can use to restrict the set of I/O operations + that are permissible to other clients. Since all I/O operations + ultimately arrive at the NFSv4 server for processing, the server is + in a position to enforce these restrictions. However, with pNFS + layouts, I/Os will be issued from the clients that hold the layouts + directly to the storage devices that host the data. These devices + have no knowledge of files, mandatory locks, or share reservations, + and are not in a position to enforce such restrictions. For this + reason the NFSv4 server MUST NOT grant layouts that conflict with + mandatory locks or share reservations. Further, if a conflicting + mandatory lock request or a conflicting open request arrives at the + server, the server MUST recall the part of the layout in conflict + with the request before granting the request. + + + + + + +Black, et al. Standards Track [Page 19] + +RFC 5663 pNFS Block/Volume Layout January 2010 + + +2.3.6. End-of-file Processing + + The end-of-file location can be changed in two ways: implicitly as + the result of a WRITE or LAYOUTCOMMIT beyond the current end-of-file, + or explicitly as the result of a SETATTR request. Typically, when a + file is truncated by an NFSv4 client via the SETATTR call, the server + frees any disk blocks belonging to the file that are beyond the new + end-of-file byte, and MUST write zeros to the portion of the new + end-of-file block beyond the new end-of-file byte. These actions + render any pNFS layouts that refer to the blocks that are freed or + written semantically invalid. Therefore, the server MUST recall from + clients the portions of any pNFS layouts that refer to blocks that + will be freed or written by the server before processing the truncate + request. These recalls may take time to complete; as explained in + [NFSv4.1], if the server cannot respond to the client SETATTR request + in a reasonable amount of time, it SHOULD reply to the client with + the error NFS4ERR_DELAY. + + Blocks in the PNFS_BLOCK_INVALID_DATA state that lie beyond the new + end-of-file block present a special case. The server has reserved + these blocks for use by a pNFS client with a writable layout for the + file, but the client has yet to commit the blocks, and they are not + yet a part of the file mapping on disk. The server MAY free these + blocks while processing the SETATTR request. If so, the server MUST + recall any layouts from pNFS clients that refer to the blocks before + processing the truncate. If the server does not free the + PNFS_BLOCK_INVALID_DATA blocks while processing the SETATTR request, + it need not recall layouts that refer only to the PNFS_BLOCK_INVALID + DATA blocks. + + When a file is extended implicitly by a WRITE or LAYOUTCOMMIT beyond + the current end-of-file, or extended explicitly by a SETATTR request, + the server need not recall any portions of any pNFS layouts. + +2.3.7. Layout Hints + + The SETATTR operation supports a layout hint attribute [NFSv4.1]. + When the client sets a layout hint (data type layouthint4) with a + layout type of LAYOUT4_BLOCK_VOLUME (the loh_type field), the + loh_body field contains a value of data type pnfs_block_layouthint4. + + /// /* block layout specific type for loh_body */ + /// struct pnfs_block_layouthint4 { + /// uint64_t blh_maximum_io_time; /* maximum i/o time in seconds + /// */ + /// }; + /// + + + + +Black, et al. Standards Track [Page 20] + +RFC 5663 pNFS Block/Volume Layout January 2010 + + + The block layout client uses the layout hint data structure to + communicate to the server the maximum time that it may take an I/O to + execute on the client. Clients using block layouts MUST set the + layout hint attribute before using LAYOUTGET operations. + +2.3.8. Client Fencing + + The pNFS block protocol must handle situations in which a system + failure, typically a network connectivity issue, requires the server + to unilaterally revoke extents from one client in order to transfer + the extents to another client. The pNFS server implementation MUST + ensure that when resources are transferred to another client, they + are not used by the client originally owning them, and this must be + ensured against any possible combination of partitions and delays + among all of the participants to the protocol (server, storage and + client). Two approaches to guaranteeing this isolation are possible + and are discussed below. + + One implementation choice for fencing the block client from the block + storage is the use of LUN masking or mapping at the storage systems + or storage area network to disable access by the client to be + isolated. This requires server access to a management interface for + the storage system and authorization to perform LUN masking and + management operations. For example, the Storage Management + Initiative Specification (SMI-S) [SMIS] provides a means to discover + and mask LUNs, including a means of associating clients with the + necessary World Wide Names or Initiator names to be masked. + + In the absence of support for LUN masking, the server has to rely on + the clients to implement a timed-lease I/O fencing mechanism. + Because clients do not know if the server is using LUN masking, in + all cases, the client MUST implement timed-lease fencing. In timed- + lease fencing, we define two time periods, the first, "lease_time" is + the length of a lease as defined by the server's lease_time attribute + (see [NFSv4.1]), and the second, "blh_maximum_io_time" is the maximum + time it can take for a client I/O to the storage system to either + complete or fail; this value is often 30 seconds or 60 seconds, but + may be longer in some environments. If the maximum client I/O time + cannot be bounded, the client MUST use a value of all 1s as the + blh_maximum_io_time. + + After a new client ID is established, the client MUST use SETATTR + with a layout hint of type LAYOUT4_BLOCK_VOLUME to inform the server + of its maximum I/O time prior to issuing the first LAYOUTGET + operation. While the maximum I/O time hint is a per-file attribute, + it is actually a per-client characteristic. Thus, the server MUST + maintain the last maximum I/O time hint sent separately for each + client. Each time the maximum I/O time changes, the server MUST + + + +Black, et al. Standards Track [Page 21] + +RFC 5663 pNFS Block/Volume Layout January 2010 + + + apply it to all files for which the client has a layout. If the + client does not specify this attribute on a file for which a block + layout is requested, the server SHOULD use the most recent value + provided by the same client for any file; if that client has not + provided a value for this attribute, the server SHOULD reject the + layout request with the error NFS4ERR_LAYOUTUNAVAILABLE. The client + SHOULD NOT send a SETATTR of the layout hint with every LAYOUTGET. A + server that implements fencing via LUN masking SHOULD accept any + maximum I/O time value from a client. A server that does not + implement fencing may return an error NFS4ERR_INVAL to the SETATTR + operation. Such a server SHOULD return NFS4ERR_INVAL when a client + sends an unbounded maximum I/O time (all 1s), or when the maximum I/O + time is significantly greater than that of other clients using block + layouts with pNFS. + + When a client receives the error NFS4ERR_INVAL in response to the + SETATTR operation for a layout hint, the client MUST NOT use the + LAYOUTGET operation. After responding with NFS4ERR_INVAL to the + SETATTR for layout hint, the server MUST return the error + NFS4ERR_LAYOUTUNAVAILABLE to all subsequent LAYOUTGET operations from + that client. Thus, the server, by returning either NFS4ERR_INVAL or + NFS4_OK determines whether or not a client with a large, or an + unbounded-maximum I/O time may use pNFS. + + Using the lease time and the maximum I/O time values, we specify the + behavior of the client and server as follows. + + When a client receives layout information via a LAYOUTGET operation, + those layouts are valid for at most "lease_time" seconds from when + the server granted them. A layout is renewed by any successful + SEQUENCE operation, or whenever a new stateid is created or updated + (see the section "Lease Renewal" of [NFSv4.1]). If the layout lease + is not renewed prior to expiration, the client MUST cease to use the + layout after "lease_time" seconds from when it either sent the + original LAYOUTGET command or sent the last operation renewing the + lease. In other words, the client may not issue any I/O to blocks + specified by an expired layout. In the presence of large + communication delays between the client and server, it is even + possible for the lease to expire prior to the server response + arriving at the client. In such a situation, the client MUST NOT use + the expired layouts, and SHOULD revert to using standard NFSv41 READ + and WRITE operations. Furthermore, the client must be configured + such that I/O operations complete within the "blh_maximum_io_time" + even in the presence of multipath drivers that will retry I/Os via + multiple paths. + + + + + + +Black, et al. Standards Track [Page 22] + +RFC 5663 pNFS Block/Volume Layout January 2010 + + + As stated in the "Dealing with Lease Expiration on the Client" + section of [NFSv4.1], if any SEQUENCE operation is successful, but + sr_status_flag has SEQ4_STATUS_EXPIRED_ALL_STATE_REVOKED, + SEQ4_STATUS_EXPIRED_SOME_STATE_REVOKED, or + SEQ4_STATUS_ADMIN_STATE_REVOKED is set, the client MUST immediately + cease to use all layouts and device ID to device address mappings + associated with the corresponding server. + + In the absence of known two-way communication between the client and + the server on the fore channel, the server must wait for at least the + time period "lease_time" plus "blh_maximum_io_time" before + transferring layouts from the original client to any other client. + The server, like the client, must take a conservative approach, and + start the lease expiration timer from the time that it received the + operation that last renewed the lease. + +2.4. Crash Recovery Issues + + A critical requirement in crash recovery is that both the client and + the server know when the other has failed. Additionally, it is + required that a client sees a consistent view of data across server + restarts. These requirements and a full discussion of crash recovery + issues are covered in the "Crash Recovery" section of the NFSv41 + specification [NFSv4.1]. This document contains additional crash + recovery material specific only to the block/volume layout. + + When the server crashes while the client holds a writable layout, and + the client has written data to blocks covered by the layout, and the + blocks are still in the PNFS_BLOCK_INVALID_DATA state, the client has + two options for recovery. If the data that has been written to these + blocks is still cached by the client, the client can simply re-write + the data via NFSv4, once the server has come back online. However, + if the data is no longer in the client's cache, the client MUST NOT + attempt to source the data from the data servers. Instead, it should + attempt to commit the blocks in question to the server during the + server's recovery grace period, by sending a LAYOUTCOMMIT with the + "loca_reclaim" flag set to true. This process is described in detail + in Section 18.42.4 of [NFSv4.1]. + +2.5. Recalling Resources: CB_RECALL_ANY + + The server may decide that it cannot hold all of the state for + layouts without running out of resources. In such a case, it is free + to recall individual layouts using CB_LAYOUTRECALL to reduce the + load, or it may choose to request that the client return any layout. + + + + + + +Black, et al. Standards Track [Page 23] + +RFC 5663 pNFS Block/Volume Layout January 2010 + + + The NFSv4.1 spec [NFSv4.1] defines the following types: + + const RCA4_TYPE_MASK_BLK_LAYOUT = 4; + + struct CB_RECALL_ANY4args { + uint32_t craa_objects_to_keep; + bitmap4 craa_type_mask; + }; + + When the server sends a CB_RECALL_ANY request to a client specifying + the RCA4_TYPE_MASK_BLK_LAYOUT bit in craa_type_mask, the client + should immediately respond with NFS4_OK, and then asynchronously + return complete file layouts until the number of files with layouts + cached on the client is less than craa_object_to_keep. + +2.6. Transient and Permanent Errors + + The server may respond to LAYOUTGET with a variety of error statuses. + These errors can convey transient conditions or more permanent + conditions that are unlikely to be resolved soon. + + The transient errors, NFS4ERR_RECALLCONFLICT and NFS4ERR_TRYLATER, + are used to indicate that the server cannot immediately grant the + layout to the client. In the former case, this is because the server + has recently issued a CB_LAYOUTRECALL to the requesting client, + whereas in the case of NFS4ERR_TRYLATER, the server cannot grant the + request possibly due to sharing conflicts with other clients. In + either case, a reasonable approach for the client is to wait several + milliseconds and retry the request. The client SHOULD track the + number of retries, and if forward progress is not made, the client + SHOULD send the READ or WRITE operation directly to the server. + + The error NFS4ERR_LAYOUTUNAVAILABLE may be returned by the server if + layouts are not supported for the requested file or its containing + file system. The server may also return this error code if the + server is the progress of migrating the file from secondary storage, + or for any other reason that causes the server to be unable to supply + the layout. As a result of receiving NFS4ERR_LAYOUTUNAVAILABLE, the + client SHOULD send future READ and WRITE requests directly to the + server. It is expected that a client will not cache the file's + layoutunavailable state forever, particular if the file is closed, + and thus eventually, the client MAY reissue a LAYOUTGET operation. + +3. Security Considerations + + Typically, SAN disk arrays and SAN protocols provide access control + mechanisms (e.g., LUN mapping and/or masking) that operate at the + granularity of individual hosts. The functionality provided by such + + + +Black, et al. Standards Track [Page 24] + +RFC 5663 pNFS Block/Volume Layout January 2010 + + + mechanisms makes it possible for the server to "fence" individual + client machines from certain physical disks -- that is to say, to + prevent individual client machines from reading or writing to certain + physical disks. Finer-grained access control methods are not + generally available. For this reason, certain security + responsibilities are delegated to pNFS clients for block/volume + layouts. Block/volume storage systems generally control access at a + volume granularity, and hence pNFS clients have to be trusted to only + perform accesses allowed by the layout extents they currently hold + (e.g., and not access storage for files on which a layout extent is + not held). In general, the server will not be able to prevent a + client that holds a layout for a file from accessing parts of the + physical disk not covered by the layout. Similarly, the server will + not be able to prevent a client from accessing blocks covered by a + layout that it has already returned. This block-based level of + protection must be provided by the client software. + + An alternative method of block/volume protocol use is for the storage + devices to export virtualized block addresses, which do reflect the + files to which blocks belong. These virtual block addresses are + exported to pNFS clients via layouts. This allows the storage device + to make appropriate access checks, while mapping virtual block + addresses to physical block addresses. In environments where the + security requirements are such that client-side protection from + access to storage outside of the authorized layout extents is not + sufficient, pNFS block/volume storage layouts SHOULD NOT be used + unless the storage device is able to implement the appropriate access + checks, via use of virtualized block addresses or other means. In + contrast, an environment where client-side protection may suffice + consists of co-located clients, server and storage systems in a data + center with a physically isolated SAN under control of a single + system administrator or small group of system administrators. + + This also has implications for some NFSv4 functionality outside pNFS. + For instance, if a file is covered by a mandatory read-only lock, the + server can ensure that only readable layouts for the file are granted + to pNFS clients. However, it is up to each pNFS client to ensure + that the readable layout is used only to service read requests, and + not to allow writes to the existing parts of the file. Similarly, + block/volume storage devices are unable to validate NFS Access + Control Lists (ACLs) and file open modes, so the client must enforce + the policies before sending a READ or WRITE request to the storage + device. Since block/volume storage systems are generally not capable + of enforcing such file-based security, in environments where pNFS + clients cannot be trusted to enforce such policies, pNFS block/volume + storage layouts SHOULD NOT be used. + + + + + +Black, et al. Standards Track [Page 25] + +RFC 5663 pNFS Block/Volume Layout January 2010 + + + Access to block/volume storage is logically at a lower layer of the + I/O stack than NFSv4, and hence NFSv4 security is not directly + applicable to protocols that access such storage directly. Depending + on the protocol, some of the security mechanisms provided by NFSv4 + (e.g., encryption, cryptographic integrity) may not be available or + may be provided via different means. At one extreme, pNFS with + block/volume storage can be used with storage access protocols (e.g., + parallel SCSI) that provide essentially no security functionality. + At the other extreme, pNFS may be used with storage protocols such as + iSCSI that can provide significant security functionality. It is the + responsibility of those administering and deploying pNFS with a + block/volume storage access protocol to ensure that appropriate + protection is provided to that protocol (physical security is a + common means for protocols not based on IP). In environments where + the security requirements for the storage protocol cannot be met, + pNFS block/volume storage layouts SHOULD NOT be used. + + When security is available for a storage protocol, it is generally at + a different granularity and with a different notion of identity than + NFSv4 (e.g., NFSv4 controls user access to files, iSCSI controls + initiator access to volumes). The responsibility for enforcing + appropriate correspondences between these security layers is placed + upon the pNFS client. As with the issues in the first paragraph of + this section, in environments where the security requirements are + such that client-side protection from access to storage outside of + the layout is not sufficient, pNFS block/volume storage layouts + SHOULD NOT be used. + +4. Conclusions + + This document specifies the block/volume layout type for pNFS and + associated functionality. + +5. IANA Considerations + + There are no IANA considerations in this document. All pNFS IANA + Considerations are covered in [NFSv4.1]. + +6. Acknowledgments + + This document draws extensively on the authors' familiarity with the + mapping functionality and protocol in EMC's Multi-Path File System + (MPFS) (previously named HighRoad) system [MPFS]. The protocol used + by MPFS is called FMP (File Mapping Protocol); it is an add-on + protocol that runs in parallel with file system protocols such as + NFSv3 to provide pNFS-like functionality for block/volume storage. + While drawing on FMP, the data structures and functional + considerations in this document differ in significant ways, based on + + + +Black, et al. Standards Track [Page 26] + +RFC 5663 pNFS Block/Volume Layout January 2010 + + + lessons learned and the opportunity to take advantage of NFSv4 + features such as COMPOUND operations. The design to support pNFS + client participation in copy-on-write is based on text and ideas + contributed by Craig Everhart. + + Andy Adamson, Ben Campbell, Richard Chandler, Benny Halevy, Fredric + Isaman, and Mario Wurzl all helped to review versions of this + specification. + +7. References + +7.1. Normative References + + [LEGAL] IETF Trust, "Legal Provisions Relating to IETF Documents", + http://trustee.ietf.org/docs/IETF-Trust-License-Policy.pdf, + November 2008. + + [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate + Requirement Levels", BCP 14, RFC 2119, March 1997. + + [NFSv4.1] Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed., + "Network File System (NFS) Version 4 Minor Version 1 + Protocol", RFC 5661, January 2010. + + [XDR] Eisler, M., Ed., "XDR: External Data Representation + Standard", STD 67, RFC 4506, May 2006. + +7.2. Informative References + + [MPFS] EMC Corporation, "EMC Celerra Multi-Path File System + (MPFS)", EMC Data Sheet, + http://www.emc.com/collateral/software/data-sheet/ + h2006-celerra-mpfs-mpfsi.pdf. + + [SMIS] SNIA, "Storage Management Initiative Specification (SMI-S) + v1.4", http://www.snia.org/tech_activities/standards/ + curr_standards/smi/SMI-S_Technical_Position_v1.4.0r4.zip. + + + + + + + + + + + + + + +Black, et al. Standards Track [Page 27] + +RFC 5663 pNFS Block/Volume Layout January 2010 + + +Authors' Addresses + + David L. Black + EMC Corporation + 176 South Street + Hopkinton, MA 01748 + + Phone: +1 (508) 293-7953 + EMail: black_david@emc.com + + + Stephen Fridella + Nasuni Inc + 313 Speen St + Natick MA 01760 + + EMail: stevef@nasuni.com + + Jason Glasgow + Google + 5 Cambridge Center + Cambridge, MA 02142 + + Phone: +1 (617) 575 1599 + EMail: jglasgow@aya.yale.edu + + + + + + + + + + + + + + + + + + + + + + + + + + +Black, et al. Standards Track [Page 28] + -- cgit v1.2.3