diff options
Diffstat (limited to 'doc/rfc/rfc8154.txt')
-rw-r--r-- | doc/rfc/rfc8154.txt | 1683 |
1 files changed, 1683 insertions, 0 deletions
diff --git a/doc/rfc/rfc8154.txt b/doc/rfc/rfc8154.txt new file mode 100644 index 0000000..8e797c5 --- /dev/null +++ b/doc/rfc/rfc8154.txt @@ -0,0 +1,1683 @@ + + + + + + +Internet Engineering Task Force (IETF) C. Hellwig +Request for Comments: 8154 May 2017 +Category: Standards Track +ISSN: 2070-1721 + + + Parallel NFS (pNFS) Small Computer System Interface (SCSI) Layout + +Abstract + + The Parallel Network File System (pNFS) allows a separation between + the metadata (onto a metadata server) and data (onto a storage + device) for a file. The Small Computer System Interface (SCSI) + layout type is defined in this document as an extension to pNFS to + allow the use of SCSI-based block storage devices. + +Status of This Memo + + This is an Internet Standards Track document. + + This document is a product of the Internet Engineering Task Force + (IETF). It represents the consensus of the IETF community. It has + received public review and has been approved for publication by the + Internet Engineering Steering Group (IESG). Further information on + Internet Standards is available in Section 2 of RFC 7841. + + Information about the current status of this document, any errata, + and how to provide feedback on it may be obtained at + http://www.rfc-editor.org/info/rfc8154. + +Copyright Notice + + Copyright (c) 2017 IETF Trust and the persons identified as the + document authors. All rights reserved. + + This document is subject to BCP 78 and the IETF Trust's Legal + Provisions Relating to IETF Documents + (http://trustee.ietf.org/license-info) in effect on the date of + publication of this document. Please review these documents + carefully, as they describe your rights and restrictions with respect + to this document. Code Components extracted from this document must + include Simplified BSD License text as described in Section 4.e of + the Trust Legal Provisions and are provided without warranty as + described in the Simplified BSD License. + + + + + + + +Hellwig Standards Track [Page 1] + +RFC 8154 pNFS SCSI Layout May 2017 + + +Table of Contents + + 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 + 1.1. Conventions Used in This Document . . . . . . . . . . . . 4 + 1.2. General Definitions . . . . . . . . . . . . . . . . . . . 4 + 1.3. Code Components Licensing Notice . . . . . . . . . . . . 5 + 1.4. XDR Description . . . . . . . . . . . . . . . . . . . . . 5 + 2. SCSI Layout Description . . . . . . . . . . . . . . . . . . . 7 + 2.1. Background and Architecture . . . . . . . . . . . . . . . 7 + 2.2. layouttype4 . . . . . . . . . . . . . . . . . . . . . . . 8 + 2.3. GETDEVICEINFO . . . . . . . . . . . . . . . . . . . . . . 8 + 2.3.1. Volume Identification . . . . . . . . . . . . . . . . 8 + 2.3.2. Volume Topology . . . . . . . . . . . . . . . . . . . 10 + 2.4. Data Structures: Extents and Extent Lists . . . . . . . . 12 + 2.4.1. Layout Requests and Extent Lists . . . . . . . . . . 15 + 2.4.2. Layout Commits . . . . . . . . . . . . . . . . . . . 16 + 2.4.3. Layout Returns . . . . . . . . . . . . . . . . . . . 17 + 2.4.4. Layout Revocation . . . . . . . . . . . . . . . . . . 17 + 2.4.5. Client Copy-on-Write Processing . . . . . . . . . . . 17 + 2.4.6. Extents Are Permissions . . . . . . . . . . . . . . . 18 + 2.4.7. Partial-Block Updates . . . . . . . . . . . . . . . . 19 + 2.4.8. End-of-File Processing . . . . . . . . . . . . . . . 20 + 2.4.9. Layout Hints . . . . . . . . . . . . . . . . . . . . 20 + 2.4.10. Client Fencing . . . . . . . . . . . . . . . . . . . 21 + 2.5. Crash Recovery Issues . . . . . . . . . . . . . . . . . . 22 + 2.6. Recalling Resources: CB_RECALL_ANY . . . . . . . . . . . 23 + 2.7. Transient and Permanent Errors . . . . . . . . . . . . . 23 + 2.8. Volatile Write Caches . . . . . . . . . . . . . . . . . . 24 + 3. Enforcing NFSv4 Semantics . . . . . . . . . . . . . . . . . . 24 + 3.1. Use of Open Stateids . . . . . . . . . . . . . . . . . . 25 + 3.2. Enforcing Security Restrictions . . . . . . . . . . . . . 26 + 3.3. Enforcing Locking Restrictions . . . . . . . . . . . . . 26 + 4. Security Considerations . . . . . . . . . . . . . . . . . . . 27 + 5. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 28 + 6. Normative References . . . . . . . . . . . . . . . . . . . . 28 + Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . 29 + Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 30 + + + + + + + + + + + + + + +Hellwig Standards Track [Page 2] + +RFC 8154 pNFS SCSI Layout May 2017 + + +1. Introduction + + Figure 1 shows the overall architecture of a Parallel NFS (pNFS) + system: + + +-----------+ + |+-----------+ +-----------+ + ||+-----------+ | | + ||| | NFSv4.1 + pNFS | | + +|| Clients |<------------------------------>| Server | + +| | | | + +-----------+ | | + ||| +-----------+ + ||| | + ||| | + ||| Storage +-----------+ | + ||| Protocol |+-----------+ | + ||+----------------||+-----------+ Control | + |+-----------------||| | Protocol| + +------------------+|| Storage |------------+ + +| Systems | + +-----------+ + + Figure 1 + + The overall approach is that pNFS-enhanced clients obtain sufficient + information from the server to enable them to access the underlying + storage (on the storage systems) directly. See Section 12 of + [RFC5661] for more details. This document is concerned with access + from pNFS clients to storage devices over block storage protocols + based on the SCSI Architecture Model [SAM-5], e.g., the Fibre Channel + Protocol (FCP), Internet SCSI (iSCSI), or Serial Attached SCSI (SAS). + pNFS SCSI layout requires block-based SCSI command sets, for example, + SCSI Block Commands [SBC3]. While SCSI command sets for non-block- + based access exist, these are not supported by the SCSI layout type, + and all future references to SCSI storage devices will imply a block- + based SCSI command set. + + The Server to Storage System protocol, called the "Control Protocol", + is not of concern for interoperability, although it will typically be + the same SCSI-based storage protocol. + + This document is based on [RFC5663] and makes changes to the block + layout type to provide a better pNFS layout protocol for SCSI-based + storage devices. Despite these changes, [RFC5663] remains the + defining document for the existing block layout type. pNFS Block Disk + Protection [RFC6688] is unnecessary in the context of the SCSI layout + type because the new layout type provides mandatory disk access + + + +Hellwig Standards Track [Page 3] + +RFC 8154 pNFS SCSI Layout May 2017 + + + protection as part of the layout type definition. In contrast to + [RFC5663], this document uses SCSI protocol features to provide + reliable fencing by using SCSI persistent reservations, and it can + provide reliable and efficient device discovery by using SCSI device + identifiers instead of having to rely on probing all devices + potentially attached to a client. This new layout type also + optimizes the Input/Output (I/O) path by reducing the size of the + LAYOUTCOMMIT payload. + + The above two paragraphs summarize the major functional differences + from [RFC5663]. There are other minor differences, e.g., the "base" + volume type in this specification is used instead of the "simple" + volume type in [RFC5663], but there are no significant differences in + the data structures that describe the volume topology above this + level (Section 2.3.2) or in the data structures that describe extents + (Section 2.4). + +1.1. Conventions Used in This Document + + The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", + "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this + document are to be interpreted as described in [RFC2119]. + +1.2. General Definitions + + The following definitions are provided for the purpose of providing + an appropriate context for the reader. + + Byte: an octet, i.e., a datum exactly 8 bits in length. + + Client: the entity that accesses the NFS server's resources. The + client may be an application that contains the logic to access the + NFS server directly. The client may also be the traditional + operating system client that provides remote file system services + for a set of applications. + + Server: the entity responsible for coordinating client access to a + set of file systems and is identified by a server owner. + + Metadata Server (MDS): a pNFS server that provides metadata + information for a file system object. It also is responsible for + generating layouts for file system objects. Note that the MDS is + also responsible for directory-based operations. + + + + + + + + +Hellwig Standards Track [Page 4] + +RFC 8154 pNFS SCSI Layout May 2017 + + +1.3. Code Components Licensing Notice + + The external data representation (XDR) description and scripts for + extracting the XDR description are Code Components as described in + Section 4 of "Legal Provisions Relating to IETF Documents" [LEGAL]. + These Code Components are licensed according to the terms of + Section 4 of "Legal Provisions Relating to IETF Documents". + +1.4. XDR Description + + This document contains the XDR [RFC4506] description of the NFSv4.1 + SCSI layout protocol. The XDR description is embedded in this + document in a way that makes it simple for the reader to extract into + a ready-to-compile form. The reader can feed this document into the + following shell script to produce the machine-readable XDR + description of the NFSv4.1 SCSI layout: + + #!/bin/sh + grep '^ *///' $* | sed 's?^ */// ??' | sed 's?^ *///$??' + + That is, if the above script is stored in a file called "extract.sh", + and this document is in a file called "spec.txt", then the reader can + do: + + sh extract.sh < spec.txt > scsi_prot.x + + The effect of the script is to remove leading white space from each + line, plus a sentinel sequence of "///". + + The embedded XDR file header follows. Subsequent XDR descriptions + with the sentinel sequence are embedded throughout the document. + + Note that the XDR code contained in this document depends on types + from the NFSv4.1 nfs4_prot.x file [RFC5662]. This includes both NFS + types that end with a 4, such as offset4, length4, etc., as well as + more generic types such as uint32_t and uint64_t. + + /// /* + /// * This code was derived from RFC 8154. + /// * Please reproduce this note if possible. + /// */ + /// /* + /// * Copyright (c) 2017 IETF Trust and the persons + /// * identified as authors of the code. All rights reserved. + /// * + + + + + + +Hellwig Standards Track [Page 5] + +RFC 8154 pNFS SCSI Layout May 2017 + + + /// * Redistribution and use in source and binary forms, with + /// * or without modification, are permitted provided that the + /// * following conditions are met: + /// * + /// * - Redistributions of source code must retain the above + /// * copyright notice, this list of conditions and the + /// * following disclaimer. + /// * + /// * - Redistributions in binary form must reproduce the above + /// * copyright notice, this list of conditions and the + /// * following disclaimer in the documentation and/or other + /// * materials provided with the distribution. + /// * + /// * - Neither the name of Internet Society, IETF or IETF + /// * Trust, nor the names of specific contributors, may be + /// * used to endorse or promote products derived from this + /// * software without specific prior written permission. + /// * + /// * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS + /// * AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED + /// * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + /// * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS + /// * FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO + /// * EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE + /// * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, + /// * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT + /// * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR + /// * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS + /// * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF + /// * LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, + /// * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING + /// * IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF + /// * ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + /// */ + /// + /// /* + /// * nfs4_scsi_layout_prot.x + /// */ + /// + /// %#include "nfsv41.h" + /// + + + + + + + + + + +Hellwig Standards Track [Page 6] + +RFC 8154 pNFS SCSI Layout May 2017 + + +2. SCSI Layout Description + +2.1. Background and Architecture + + The fundamental storage model supported by SCSI storage devices is a + logical unit (LU) consisting of a sequential series of fixed-size + blocks. Logical units used as devices for NFS SCSI layouts, and the + SCSI initiators used for the pNFS metadata server and clients, MUST + support SCSI persistent reservations as defined in [SPC4]. + + A pNFS layout for this SCSI class of storage is responsible for + mapping from an NFS file (or portion of a file) to the blocks of + storage volumes that contain the file. The blocks are expressed as + extents with 64-bit offsets and lengths using the existing NFSv4 + offset4 and length4 types. Clients MUST be able to perform I/O to + the block extents without affecting additional areas of storage + (especially important for writes); therefore, extents MUST be aligned + to logical block size boundaries of the underlying logical units + (typically 512 or 4096 bytes). For complex volume topologies, the + server MUST ensure extents are aligned to the logical block size + boundaries of the largest logical block size in the volume topology. + + The pNFS operation for requesting a layout (LAYOUTGET) includes the + "layoutiomode4 loga_iomode" argument, which indicates whether the + requested layout is for read-only use or read-write use. A read-only + layout may contain holes that are read as zero, whereas a read-write + layout will contain allocated but uninitialized storage in those + holes (read as zero, can be written by client). This document also + supports client participation in copy-on-write (e.g., for file + systems with snapshots) by providing both read-only and uninitialized + storage for the same range in a layout. Reads are initially + performed on the read-only storage, with writes going to the + uninitialized storage. After the first write that initializes the + uninitialized storage, all reads are performed to that now- + initialized writable storage, and the corresponding read-only storage + is no longer used. + + The SCSI layout solution expands the security responsibilities of the + pNFS clients, and there are a number of environments where the + mandatory-to-implement security properties for NFS cannot be + satisfied. The additional security responsibilities of the client + follow, and a full discussion is present in Section 4 ("Security + Considerations"). + + + + + + + + +Hellwig Standards Track [Page 7] + +RFC 8154 pNFS SCSI Layout May 2017 + + + o Typically, SCSI storage devices provide access control mechanisms + (e.g., Logical Unit Number (LUN) mapping and/or masking), which + operate at the granularity of individual hosts, not individual + blocks. For this reason, block-based protection must be provided + by the client software. + + o Similarly, SCSI storage devices typically are not able to validate + NFS locks that apply to file regions. For instance, if a file is + covered by a mandatory read-only lock, the server can ensure that + only readable layouts for the file are granted to pNFS clients. + However, it is up to each pNFS client to ensure that the readable + layout is used only to service read requests and not to allow + writes to the existing parts of the file. + + Since SCSI storage devices are generally not capable of enforcing + such file-based security, in environments where pNFS clients cannot + be trusted to enforce such policies, pNFS SCSI layouts MUST NOT be + used. + +2.2. layouttype4 + + The layout4 type defined in [RFC5662] is extended with a new value as + follows: + + enum layouttype4 { + LAYOUT4_NFSV4_1_FILES = 1, + LAYOUT4_OSD2_OBJECTS = 2, + LAYOUT4_BLOCK_VOLUME = 3, + LAYOUT4_SCSI = 5 + }; + + This document defines the structure associated with the layouttype4 + value LAYOUT4_SCSI. [RFC5661] specifies the loc_body structure as an + XDR type "opaque". The opaque layout is uninterpreted by the generic + pNFS client layers but obviously must be interpreted by the layout + type implementation. + +2.3. GETDEVICEINFO + +2.3.1. Volume Identification + + SCSI targets implementing [SPC4] export unique LU names for each LU + through the Device Identification Vital Product Data (VPD) page (page + code 0x83), which can be obtained using the INQUIRY command with the + Enable VPD (EVPD) bit set to one. This document uses a subset of + this information to identify LUs backing pNFS SCSI layouts. The + + + + + +Hellwig Standards Track [Page 8] + +RFC 8154 pNFS SCSI Layout May 2017 + + + Device Identification VPD page descriptors used to identify LUs for + use with pNFS SCSI layouts must adhere to the following restrictions: + + 1. The "ASSOCIATION" MUST be set to 0 (The "DESIGNATOR" field is + associated with the addressed logical unit). + + 2. The "DESIGNATOR TYPE" MUST be set to one of four values that are + required for the mandatory logical unit name in Section 7.7.3 of + [SPC4], as explicitly listed in the "pnfs_scsi_designator_type" + enumeration: + + PS_DESIGNATOR_T10 - based on T10 vendor ID + + PS_DESIGNATOR_EUI64 - based on EUI-64 + + PS_DESIGNATOR_NAA - Network Address Authority (NAA) + + PS_DESIGNATOR_NAME - SCSI name string + + 3. Any other association or designator type MUST NOT be used. Use + of T10 vendor IDs is discouraged when one of the other types can + be used. + + The "CODE SET" VPD page field is stored in the "sbv_code_set" field + of the "pnfs_scsi_base_volume_info4" data structure, the "DESIGNATOR + TYPE" is stored in "sbv_designator_type", and the DESIGNATOR is + stored in "sbv_designator". Due to the use of an XDR array, the + "DESIGNATOR LENGTH" field does not need to be set separately. Only + certain combinations of "sbv_code_set" and "sbv_designator_type" are + valid; please refer to [SPC4] for details, and note that ASCII MAY be + used as the code set for UTF-8 text that contains only printable + ASCII characters. Note that a Device Identification VPD page MAY + contain multiple descriptors with the same association, code set, and + designator type. Thus, NFS clients MUST check all the descriptors + for a possible match to "sbv_code_set", "sbv_designator_type", and + "sbv_designator". + + Storage devices such as storage arrays can have multiple physical + network interfaces that need not be connected to a common network, + resulting in a pNFS client having simultaneous multipath access to + the same storage volumes via different ports on different networks. + Selection of one or multiple ports to access the storage device is + left up to the client. + + Additionally, the server returns a persistent reservation key in the + "sbv_pr_key" field. See Section 2.4.10 for more details on the use + of persistent reservations. + + + + +Hellwig Standards Track [Page 9] + +RFC 8154 pNFS SCSI Layout May 2017 + + +2.3.2. Volume Topology + + The pNFS SCSI layout volume topology is expressed in terms of the + volume types described below. The individual components of the + topology are contained in an array, and components MAY refer to other + components by using array indices. + + /// enum pnfs_scsi_volume_type4 { + /// PNFS_SCSI_VOLUME_SLICE = 1, /* volume is a slice of + /// another volume */ + /// PNFS_SCSI_VOLUME_CONCAT = 2, /* volume is a + /// concatenation of + /// multiple volumes */ + /// PNFS_SCSI_VOLUME_STRIPE = 3 /* volume is striped across + /// multiple volumes */ + /// PNFS_SCSI_VOLUME_BASE = 4, /* volume maps to a single + /// LU */ + /// }; + /// + + /// /* + /// * Code sets from SPC-4. + /// */ + /// enum pnfs_scsi_code_set { + /// PS_CODE_SET_BINARY = 1, + /// PS_CODE_SET_ASCII = 2, + /// PS_CODE_SET_UTF8 = 3 + /// }; + /// + /// /* + /// * Designator types taken from SPC-4. + /// * + /// * Other values are allocated in SPC-4 but are not mandatory to + /// * implement or aren't logical unit names. + /// */ + /// enum pnfs_scsi_designator_type { + /// PS_DESIGNATOR_T10 = 1, + /// PS_DESIGNATOR_EUI64 = 2, + /// PS_DESIGNATOR_NAA = 3, + /// PS_DESIGNATOR_NAME = 8 + /// }; + /// + /// /* + /// * Logical unit name + reservation key. + /// */ + /// struct pnfs_scsi_base_volume_info4 { + /// pnfs_scsi_code_set sbv_code_set; + /// pnfs_scsi_designator_type sbv_designator_type; + + + +Hellwig Standards Track [Page 10] + +RFC 8154 pNFS SCSI Layout May 2017 + + + /// opaque sbv_designator<>; + /// uint64_t sbv_pr_key; + /// }; + /// + + /// struct pnfs_scsi_slice_volume_info4 { + /// offset4 ssv_start; /* offset of the start of + /// the slice in bytes */ + /// length4 ssv_length; /* length of slice in + /// bytes */ + /// uint32_t ssv_volume; /* array index of sliced + /// volume */ + /// }; + /// + + /// + /// struct pnfs_scsi_concat_volume_info4 { + /// uint32_t scv_volumes<>; /* array indices of volumes + /// that are concatenated */ + /// }; + + /// + /// struct pnfs_scsi_stripe_volume_info4 { + /// length4 ssv_stripe_unit; /* size of stripe in bytes */ + /// uint32_t ssv_volumes<>; /* array indices of + /// volumes that are striped + /// across -- MUST be same + /// size */ + /// }; + + /// + /// union pnfs_scsi_volume4 switch (pnfs_scsi_volume_type4 type) { + /// case PNFS_SCSI_VOLUME_BASE: + /// pnfs_scsi_base_volume_info4 sv_simple_info; + /// case PNFS_SCSI_VOLUME_SLICE: + /// pnfs_scsi_slice_volume_info4 sv_slice_info; + /// case PNFS_SCSI_VOLUME_CONCAT: + /// pnfs_scsi_concat_volume_info4 sv_concat_info; + /// case PNFS_SCSI_VOLUME_STRIPE: + /// pnfs_scsi_stripe_volume_info4 sv_stripe_info; + /// }; + /// + + + + + + + + + +Hellwig Standards Track [Page 11] + +RFC 8154 pNFS SCSI Layout May 2017 + + + /// /* SCSI layout-specific type for da_addr_body */ + /// struct pnfs_scsi_deviceaddr4 { + /// pnfs_scsi_volume4 sda_volumes<>; /* array of volumes */ + /// }; + /// + + The "pnfs_scsi_deviceaddr4" data structure is a structure that allows + arbitrarily complex nested volume structures to be encoded. The + types of aggregations that are allowed are stripes, concatenations, + and slices. Note that the volume topology expressed in the + "pnfs_scsi_deviceaddr4" data structure will always resolve to a set + of "pnfs_scsi_volume_type4" PNFS_SCSI_VOLUME_BASE. The array of + volumes is ordered such that the root of the volume hierarchy is the + last element of the array. Concat, slice, and stripe volumes MUST + refer to volumes defined by lower indexed elements of the array. + + The "pnfs_scsi_deviceaddr4" data structure is returned by the server + as the storage-protocol-specific opaque field "da_addr_body" in the + "device_addr4" data structure by a successful GETDEVICEINFO operation + [RFC5661]. + + As noted above, all "device_addr4" data structures eventually resolve + to a set of volumes of type PNFS_SCSI_VOLUME_BASE. Complicated + volume hierarchies may be composed of dozens of volumes, each with + several components; thus, the device address may require several + kilobytes. The client SHOULD be prepared to allocate a large buffer + to contain the result. In the case of the server returning + NFS4ERR_TOOSMALL, the client SHOULD allocate a buffer of at least + gdir_mincount_bytes to contain the expected result and retry the + GETDEVICEINFO request. + +2.4. Data Structures: Extents and Extent Lists + + A pNFS SCSI layout is a list of extents within a flat array of data + blocks in a volume. The details of the volume topology can be + determined by using the GETDEVICEINFO operation. The SCSI layout + describes the individual block extents on the volume that make up the + file. The offsets and length contained in an extent are specified in + units of bytes. + + + + + + + + + + + + +Hellwig Standards Track [Page 12] + +RFC 8154 pNFS SCSI Layout May 2017 + + + /// enum pnfs_scsi_extent_state4 { + /// PNFS_SCSI_READ_WRITE_DATA = 0, /* the data located by + /// this extent is valid + /// for reading and + /// writing. */ + /// PNFS_SCSI_READ_DATA = 1, /* the data located by this + /// extent is valid for + /// reading only; it may not + /// be written. */ + /// PNFS_SCSI_INVALID_DATA = 2, /* the location is valid; the + /// data is invalid. It is a + /// newly (pre-)allocated + /// extent. The client MUST + /// not read from this + /// space. */ + /// PNFS_SCSI_NONE_DATA = 3 /* the location is invalid. + /// It is a hole in the file. + /// The client MUST NOT read + /// from or write to this + /// space. */ + /// }; + + /// + /// struct pnfs_scsi_extent4 { + /// deviceid4 se_vol_id; /* id of the volume on + /// which extent of file is + /// stored */ + /// offset4 se_file_offset; /* starting byte offset + /// in the file */ + /// length4 se_length; /* size in bytes of the + /// extent */ + /// offset4 se_storage_offset; /* starting byte offset + /// in the volume */ + /// pnfs_scsi_extent_state4 se_state; + /// /* state of this extent */ + /// }; + /// + + /// /* SCSI layout-specific type for loc_body */ + /// struct pnfs_scsi_layout4 { + /// pnfs_scsi_extent4 sl_extents<>; + /// /* extents that make up this + /// layout */ + /// }; + /// + + + + + + +Hellwig Standards Track [Page 13] + +RFC 8154 pNFS SCSI Layout May 2017 + + + The SCSI layout consists of a list of extents that map the regions of + the file to locations on a volume. The "se_storage_offset" field + within each extent identifies a location on the volume specified by + the "se_vol_id" field in the extent. The "se_vol_id" itself is + shorthand for the whole topology of the volume on which the file is + stored. The client is responsible for translating this volume- + relative offset into an offset on the appropriate underlying SCSI LU. + + Each extent maps a region of the file onto a portion of the specified + LU. The "se_file_offset", "se_length", and "se_state" fields for an + extent returned from the server are valid for all extents. In + contrast, the interpretation of the "se_storage_offset" field depends + on the value of "se_state" as follows (in increasing order): + + PNFS_SCSI_READ_WRITE_DATA + "se_storage_offset" is valid and points to valid/initialized data + that can be read and written. + + PNFS_SCSI_READ_DATA + "se_storage_offset" is valid and points to valid/initialized data + that can only be read. Write operations are prohibited. + + PNFS_SCSI_INVALID_DATA + "se_storage_offset" is valid but points to invalid, uninitialized + data. This data MUST not be read from the disk until it has been + initialized. A read request for a PNFS_SCSI_INVALID_DATA extent + MUST fill the user buffer with zeros, unless the extent is covered + by a PNFS_SCSI_READ_DATA extent of a copy-on-write file system. + Write requests MUST write whole server-sized blocks to the disk; + bytes not initialized by the user MUST be set to zero. Any write + to storage in a PNFS_SCSI_INVALID_DATA extent changes the written + portion of the extent to PNFS_SCSI_READ_WRITE_DATA; the pNFS + client is responsible for reporting this change via LAYOUTCOMMIT. + + PNFS_SCSI_NONE_DATA + "se_storage_offset" is not valid, and this extent MAY not be used + to satisfy write requests. Read requests MAY be satisfied by + zero-filling as for PNFS_SCSI_INVALID_DATA. PNFS_SCSI_NONE_DATA + extents MAY be returned by requests for readable extents; they are + never returned if the request was for a writable extent. + + An extent list contains all relevant extents in increasing order of + the se_file_offset of each extent; any ties are broken by increasing + order of the extent state (se_state). + + + + + + + +Hellwig Standards Track [Page 14] + +RFC 8154 pNFS SCSI Layout May 2017 + + +2.4.1. Layout Requests and Extent Lists + + Each request for a layout specifies at least three parameters: file + offset, desired size, and minimum size. If the status of a request + indicates success, the extent list returned MUST meet the following + criteria: + + o A request for a readable (but not writable) layout MUST return + either PNFS_SCSI_READ_DATA or PNFS_SCSI_NONE_DATA extents. It + SHALL NOT return PNFS_SCSI_INVALID_DATA or + PNFS_SCSI_READ_WRITE_DATA extents. + + o A request for a writable layout MUST return + PNFS_SCSI_READ_WRITE_DATA or PNFS_SCSI_INVALID_DATA extents, and + it MAY return additional PNFS_SCSI_READ_DATA extents for ranges + covered by PNFS_SCSI_INVALID_DATA extents to allow client-side + copy-on-write operations. A request for a writable layout SHALL + NOT return PNFS_SCSI_NONE_DATA extents. + + o The first extent in the list MUST contain the requested starting + offset. + + o The total size of extents within the requested range MUST cover at + least the minimum size. One exception is allowed: the total size + MAY be smaller if only readable extents were requested and EOF is + encountered. + + o Extents in the extent list MUST be logically contiguous for a + read-only layout. For a read-write layout, the set of writable + extents (i.e., excluding PNFS_SCSI_READ_DATA extents) MUST be + logically contiguous. Every PNFS_SCSI_READ_DATA extent in a read- + write layout MUST be covered by one or more PNFS_SCSI_INVALID_DATA + extents. This overlap of PNFS_SCSI_READ_DATA and + PNFS_SCSI_INVALID_DATA extents is the only permitted extent + overlap. + + o Extents MUST be ordered in the list by starting offset, with + PNFS_SCSI_READ_DATA extents preceding PNFS_SCSI_INVALID_DATA + extents in the case of equal se_file_offsets. + + According to [RFC5661], if the minimum requested size, + loga_minlength, is zero, this is an indication to the metadata server + that the client desires any layout at offset loga_offset or less that + the metadata server has "readily available". Given the lack of a + clear definition of this phrase, in the context of the SCSI layout + + + + + + +Hellwig Standards Track [Page 15] + +RFC 8154 pNFS SCSI Layout May 2017 + + + type, when loga_minlength is zero, the metadata server SHOULD do the + following: + + o when processing requests for readable layouts, return all such + layouts, even if some extents are in the PNFS_SCSI_NONE_DATA + state. + + o when processing requests for writable layouts, return extents that + can be returned in the PNFS_SCSI_READ_WRITE_DATA state. + +2.4.2. Layout Commits + + /// + /// /* SCSI layout-specific type for lou_body */ + /// + /// struct pnfs_scsi_range4 { + /// offset4 sr_file_offset; /* starting byte offset + /// in the file */ + /// length4 sr_length; /* size in bytes */ + /// }; + /// + /// struct pnfs_scsi_layoutupdate4 { + /// pnfs_scsi_range4 slu_commit_list<>; + /// /* list of extents that + /// * now contain valid data. + /// */ + /// }; + + The "pnfs_scsi_layoutupdate4" data structure is used by the client as + the SCSI layout-specific argument in a LAYOUTCOMMIT operation. The + "slu_commit_list" field is a list covering regions of the file layout + that were previously in the PNFS_SCSI_INVALID_DATA state but have + been written by the client and SHOULD now be considered in the + PNFS_SCSI_READ_WRITE_DATA state. The extents in the commit list MUST + be disjoint and MUST be sorted by sr_file_offset. Implementors + should be aware that a server MAY be unable to commit regions at a + granularity smaller than a file system block (typically 4 KB or 8 + KB). As noted above, the block size that the server uses is + available as an NFSv4 attribute, and any extents included in the + "slu_commit_list" MUST be aligned to this granularity and have a size + that is a multiple of this granularity. Since the block in question + is in state PNFS_SCSI_INVALID_DATA, byte ranges not written SHOULD be + filled with zeros. This applies even if it appears that the area + being written is beyond what the client believes to be the end of + file. + + + + + + +Hellwig Standards Track [Page 16] + +RFC 8154 pNFS SCSI Layout May 2017 + + +2.4.3. Layout Returns + + A LAYOUTRETURN operation represents an explicit release of resources + by the client. This MAY be done in response to a CB_LAYOUTRECALL or + before any recall, in order to avoid a future CB_LAYOUTRECALL. When + the LAYOUTRETURN operation specifies a LAYOUTRETURN4_FILE return + type, then the "layoutreturn_file4" data structure specifies the + region of the file layout that is no longer needed by the client. + + The LAYOUTRETURN operation is done without any data specific to the + SCSI layout. The opaque "lrf_body" field of the "layoutreturn_file4" + data structure MUST have length zero. + +2.4.4. Layout Revocation + + Layouts MAY be unilaterally revoked by the server due to the client's + lease time expiring or the client failing to return a layout that has + been recalled in a timely manner. For the SCSI layout type, this is + accomplished by fencing off the client from access to storage as + described in Section 2.4.10. When this is done, it is necessary that + all I/Os issued by the fenced-off client be rejected by the storage. + This includes any in-flight I/Os that the client issued before the + layout was revoked. + + Note that the granularity of this operation can only be at the host/ + LU level. Thus, if one of a client's layouts is unilaterally revoked + by the server, it will effectively render useless *all* of the + client's layouts for files located on the storage units comprising + the volume. This may render useless the client's layouts for files + in other file systems. See Section 2.4.10.5 for a discussion of + recovery from fencing. + +2.4.5. Client Copy-on-Write Processing + + Copy-on-write is a mechanism used to support file and/or file system + snapshots. When writing to unaligned regions, or to regions smaller + than a file system block, the writer MUST copy the portions of the + original file data to a new location on disk. This behavior can be + implemented either on the client or the server. The paragraphs below + describe how a pNFS SCSI layout client implements access to a file + that requires copy-on-write semantics. + + Distinguishing the PNFS_SCSI_READ_WRITE_DATA and PNFS_SCSI_READ_DATA + extent types in combination with the allowed overlap of + PNFS_SCSI_READ_DATA extents with PNFS_SCSI_INVALID_DATA extents + allows copy-on-write processing to be done by pNFS clients. In + classic NFS, this operation would be done by the server. Since pNFS + enables clients to do direct block access, it is useful for clients + + + +Hellwig Standards Track [Page 17] + +RFC 8154 pNFS SCSI Layout May 2017 + + + to participate in copy-on-write operations. All SCSI pNFS clients + MUST support this copy-on-write processing. + + When a client wishes to write data covered by a PNFS_SCSI_READ_DATA + extent, it MUST have requested a writable layout from the server; + that layout will contain PNFS_SCSI_INVALID_DATA extents to cover all + the data ranges of that layout's PNFS_SCSI_READ_DATA extents. More + precisely, for any se_file_offset range covered by one or more + PNFS_SCSI_READ_DATA extents in a writable layout, the server MUST + include one or more PNFS_SCSI_INVALID_DATA extents in the layout that + cover the same se_file_offset range. When performing a write to such + an area of a layout, the client MUST effectively copy the data from + the PNFS_SCSI_READ_DATA extent for any partial blocks of + se_file_offset and range, merge in the changes to be written, and + write the result to the PNFS_SCSI_INVALID_DATA extent for the blocks + for that se_file_offset and range. That is, if entire blocks of data + are to be overwritten by an operation, the corresponding + PNFS_SCSI_READ_DATA blocks need not be fetched, but any partial- + block writes MUST be merged with data fetched via PNFS_SCSI_READ_DATA + extents before storing the result via PNFS_SCSI_INVALID_DATA extents. + For the purposes of this discussion, "entire blocks" and "partial + blocks" refer to the block size of the server's file system. Storing + of data in a PNFS_SCSI_INVALID_DATA extent converts the written + portion of the PNFS_SCSI_INVALID_DATA extent to a + PNFS_SCSI_READ_WRITE_DATA extent; all subsequent reads MUST be + performed from this extent; the corresponding portion of the + PNFS_SCSI_READ_DATA extent MUST NOT be used after storing data in a + PNFS_SCSI_INVALID_DATA extent. If a client writes only a portion of + an extent, the extent MAY be split at block-aligned boundaries. + + When a client wishes to write data to a PNFS_SCSI_INVALID_DATA extent + that is not covered by a PNFS_SCSI_READ_DATA extent, it MUST treat + this write identically to a write to a file not involved with copy- + on-write semantics. Thus, data MUST be written in at least block- + sized increments and aligned to multiples of block-sized offsets, and + unwritten portions of blocks MUST be zero filled. + +2.4.6. Extents Are Permissions + + Layout extents returned to pNFS clients grant permission to read or + write; PNFS_SCSI_READ_DATA and PNFS_SCSI_NONE_DATA are read-only + (PNFS_SCSI_NONE_DATA reads as zeros), and PNFS_SCSI_READ_WRITE_DATA + and PNFS_SCSI_INVALID_DATA are read-write (PNFS_SCSI_INVALID_DATA + reads as zeros; any write converts it to PNFS_SCSI_READ_WRITE_DATA). + This is the only means a client has of obtaining permission to + perform direct I/O to storage devices; a pNFS client MUST NOT perform + direct I/O operations that are not permitted by an extent held by the + client. Client adherence to this rule places the pNFS server in + + + +Hellwig Standards Track [Page 18] + +RFC 8154 pNFS SCSI Layout May 2017 + + + control of potentially conflicting storage device operations, + enabling the server to determine what does conflict and how to avoid + conflicts by granting and recalling extents to/from clients. + + If a client makes a layout request that conflicts with an existing + layout delegation, the request will be rejected with the error + NFS4ERR_LAYOUTTRYLATER. This client is then expected to retry the + request after a short interval. During this interval, the server + SHOULD recall the conflicting portion of the layout delegation from + the client that currently holds it. This reject-and-retry approach + does not prevent client starvation when there is contention for the + layout of a particular file. For this reason, a pNFS server SHOULD + implement a mechanism to prevent starvation. One possibility is that + the server can maintain a queue of rejected layout requests. Each + new layout request can be checked to see if it conflicts with a + previous rejected request, and if so, the newer request can be + rejected. Once the original requesting client retries its request, + its entry in the rejected request queue can be cleared, or the entry + in the rejected request queue can be removed when it reaches a + certain age. + + NFSv4 supports mandatory locks and share reservations. These are + mechanisms that clients can use to restrict the set of I/O operations + that are permissible to other clients. Since all I/O operations + ultimately arrive at the NFSv4 server for processing, the server is + in a position to enforce these restrictions. However, with pNFS + layouts, I/Os will be issued from the clients that hold the layouts + directly to the storage devices that host the data. These devices + have no knowledge of files, mandatory locks, or share reservations, + and they are not in a position to enforce such restrictions. For + this reason, the NFSv4 server MUST NOT grant layouts that conflict + with mandatory locks or share reservations. Further, if a + conflicting mandatory lock request or a conflicting OPEN request + arrives at the server, the server MUST recall the part of the layout + in conflict with the request before granting the request. + +2.4.7. Partial-Block Updates + + SCSI storage devices do not provide byte granularity access and can + only perform read and write operations atomically on a block + granularity. Writes to SCSI storage devices thus require read- + modify-write cycles to write data that is smaller than the block size + or that is otherwise not block aligned. Write operations from + multiple clients to the same block can thus lead to data corruption + even if the byte range written by the applications does not overlap. + When there are multiple clients who wish to access the same block, a + + + + + +Hellwig Standards Track [Page 19] + +RFC 8154 pNFS SCSI Layout May 2017 + + + pNFS server MUST avoid these conflicts by implementing a concurrency + control policy of single writer XOR multiple readers for a given data + block. + +2.4.8. End-of-File Processing + + The end-of-file location can be changed in two ways: implicitly as + the result of a WRITE or LAYOUTCOMMIT beyond the current end of file + or explicitly as the result of a SETATTR request. Typically, when a + file is truncated by an NFSv4 client via the SETATTR call, the server + frees any disk blocks belonging to the file that are beyond the new + end-of-file byte and MUST write zeros to the portion of the new end- + of-file block beyond the new end-of-file byte. These actions render + semantically invalid any pNFS layouts that refer to the blocks that + are freed or written. Therefore, the server MUST recall from clients + the portions of any pNFS layouts that refer to blocks that will be + freed or written by the server before effecting the file truncation. + These recalls may take time to complete; as explained in [RFC5661], + if the server cannot respond to the client SETATTR request in a + reasonable amount of time, it SHOULD reply to the client with the + error NFS4ERR_DELAY. + + Blocks in the PNFS_SCSI_INVALID_DATA state that lie beyond the new + end-of-file block present a special case. The server has reserved + these blocks for use by a pNFS client with a writable layout for the + file, but the client has yet to commit the blocks, and they are not + yet a part of the file mapping on disk. The server MAY free these + blocks while processing the SETATTR request. If so, the server MUST + recall any layouts from pNFS clients that refer to the blocks before + processing the truncate. If the server does not free the + PNFS_SCSI_INVALID_DATA blocks while processing the SETATTR request, + it need not recall layouts that refer only to the + PNFS_SCSI_INVALID_DATA blocks. + + When a file is extended implicitly by a WRITE or LAYOUTCOMMIT beyond + the current end of file, or extended explicitly by a SETATTR request, + the server need not recall any portions of any pNFS layouts. + +2.4.9. Layout Hints + + The layout hint attribute specified in [RFC5661] is not supported by + the SCSI layout, and the pNFS server MUST reject setting a layout + hint attribute with a loh_type value of LAYOUT4_SCSI_VOLUME during + OPEN or SETATTR operations. On a file system only supporting the + SCSI layout, a server MUST NOT report the layout_hint attribute in + the supported_attrs attribute. + + + + + +Hellwig Standards Track [Page 20] + +RFC 8154 pNFS SCSI Layout May 2017 + + +2.4.10. Client Fencing + + The pNFS SCSI protocol must handle situations in which a system + failure, typically a network connectivity issue, requires the server + to unilaterally revoke extents from a client after the client fails + to respond to a CB_LAYOUTRECALL request. This is implemented by + fencing off a non-responding client from access to the storage + device. + + The pNFS SCSI protocol implements fencing using persistent + reservations (PRs), similar to the fencing method used by existing + shared disk file systems. By placing a PR of type "Exclusive Access + - Registrants Only" on each SCSI LU exported to pNFS clients, the MDS + prevents access from any client that does not have an outstanding + device ID that gives the client a reservation key to access the LU + and allows the MDS to revoke access to the logical unit at any time. + +2.4.10.1. PRs -- Key Generation + + To allow fencing individual systems, each system MUST use a unique + persistent reservation key. [SPC4] does not specify a way to + generate keys. This document assigns the burden to generate unique + keys to the MDS, which MUST generate a key for itself before + exporting a volume and a key for each client that accesses SCSI + layout volumes. Individuals keys for each volume that a client can + access are permitted but not required. + +2.4.10.2. PRs -- MDS Registration and Reservation + + Before returning a PNFS_SCSI_VOLUME_BASE volume to the client, the + MDS needs to prepare the volume for fencing using PRs. This is done + by registering the reservation generated for the MDS with the device + using the "PERSISTENT RESERVE OUT" command with a service action of + "REGISTER", followed by a "PERSISTENT RESERVE OUT" command with a + service action of "RESERVE" and the "TYPE" field set to 8h (Exclusive + Access - Registrants Only). To make sure all I_T nexuses (see + Section 3.1.45 of [SAM-5]) are registered, the MDS SHOULD set the + "All Target Ports" (ALL_TG_PT) bit when registering the key or + otherwise ensure the registration is performed for each target port, + and it MUST perform registration for each initiator port. + +2.4.10.3. PRs -- Client Registration + + Before performing the first I/O to a device returned from a + GETDEVICEINFO operation, the client will register the registration + key returned in sbv_pr_key with the storage device by issuing a + "PERSISTENT RESERVE OUT" command with a service action of REGISTER + with the "SERVICE ACTION RESERVATION KEY" set to the reservation key + + + +Hellwig Standards Track [Page 21] + +RFC 8154 pNFS SCSI Layout May 2017 + + + returned in sbv_pr_key. To make sure all I_T nexuses are registered, + the client SHOULD set the "All Target Ports" (ALL_TG_PT) bit when + registering the key or otherwise ensure the registration is performed + for each target port, and it MUST perform registration for each + initiator port. + + When a client stops using a device earlier returned by GETDEVICEINFO, + it MUST unregister the earlier registered key by issuing a + "PERSISTENT RESERVE OUT" command with a service action of "REGISTER" + with the "RESERVATION KEY" set to the earlier registered reservation + key. + +2.4.10.4. PRs -- Fencing Action + + In case of a non-responding client, the MDS fences the client by + issuing a "PERSISTENT RESERVE OUT" command with the service action + set to "PREEMPT" or "PREEMPT AND ABORT", the "RESERVATION KEY" field + set to the server's reservation key, the service action "RESERVATION + KEY" field set to the reservation key associated with the non- + responding client, and the "TYPE" field set to 8h (Exclusive Access - + Registrants Only). + + After the MDS preempts a client, all client I/O to the LU fails. The + client SHOULD at this point return any layout that refers to the + device ID that points to the LU. Note that the client can + distinguish I/O errors due to fencing from other errors based on the + "RESERVATION CONFLICT" SCSI status. Refer to [SPC4] for details. + +2.4.10.5. Client Recovery after a Fence Action + + A client that detects a "RESERVATION CONFLICT" SCSI status (I/O + error) on the storage devices MUST commit all layouts that use the + storage device through the MDS, return all outstanding layouts for + the device, forget the device ID, and unregister the reservation key. + Future GETDEVICEINFO calls MAY refer to the storage device again, in + which case the client will perform a new registration based on the + key provided (via sbv_pr_key) at that time. + +2.5. Crash Recovery Issues + + A critical requirement in crash recovery is that both the client and + the server know when the other has failed. Additionally, it is + required that a client sees a consistent view of data across server + restarts. These requirements and a full discussion of crash recovery + issues are covered in Section 8.4 ("Crash Recovery") of the NFSv4.1 + specification [RFC5661]. This document contains additional crash + recovery material specific only to the SCSI layout. + + + + +Hellwig Standards Track [Page 22] + +RFC 8154 pNFS SCSI Layout May 2017 + + + When the server crashes while the client holds a writable layout, the + client has written data to blocks covered by the layout, and the + blocks are still in the PNFS_SCSI_INVALID_DATA state, the client has + two options for recovery. If the data that has been written to these + blocks is still cached by the client, the client can simply re-write + the data via NFSv4 once the server has come back online. However, if + the data is no longer in the client's cache, the client MUST NOT + attempt to source the data from the data servers. Instead, it SHOULD + attempt to commit the blocks in question to the server during the + server's recovery grace period by sending a LAYOUTCOMMIT with the + "loca_reclaim" flag set to true. This process is described in detail + in Section 18.42.4 of [RFC5661]. + +2.6. Recalling Resources: CB_RECALL_ANY + + The server MAY decide that it cannot hold all of the state for + layouts without running out of resources. In such a case, it is free + to recall individual layouts using CB_LAYOUTRECALL to reduce the + load, or it MAY choose to request that the client return any layout. + + The NFSv4.1 specification [RFC5661] defines the following types: + + const RCA4_TYPE_MASK_BLK_LAYOUT = 4; + + struct CB_RECALL_ANY4args { + uint32_t craa_objects_to_keep; + bitmap4 craa_type_mask; + }; + + When the server sends a CB_RECALL_ANY request to a client specifying + the RCA4_TYPE_MASK_BLK_LAYOUT bit in craa_type_mask, the client + SHOULD immediately respond with NFS4_OK and then asynchronously + return complete file layouts until the number of files with layouts + cached on the client is less than craa_object_to_keep. + +2.7. Transient and Permanent Errors + + The server may respond to LAYOUTGET with a variety of error statuses. + These errors can convey transient conditions or more permanent + conditions that are unlikely to be resolved soon. + + The error NFS4ERR_RECALLCONFLICT indicates that the server has + recently issued a CB_LAYOUTRECALL to the requesting client, making it + necessary for the client to respond to the recall before processing + the layout request. A client can wait for that recall to be received + and processed, or it can retry as NFS4ERR_TRYLATER, as described + below. + + + + +Hellwig Standards Track [Page 23] + +RFC 8154 pNFS SCSI Layout May 2017 + + + The error NFS4ERR_TRYLATER is used to indicate that the server cannot + immediately grant the layout to the client. This may be due to + constraints on writable sharing of blocks by multiple clients or to a + conflict with a recallable lock (e.g., a delegation). In either + case, a reasonable approach for the client is to wait several + milliseconds and retry the request. The client SHOULD track the + number of retries, and if forward progress is not made, the client + SHOULD abandon the attempt to get a layout and perform READ and WRITE + operations by sending them to the server. + + The error NFS4ERR_LAYOUTUNAVAILABLE MAY be returned by the server if + layouts are not supported for the requested file or its containing + file system. The server MAY also return this error code if the + server is in the process of migrating the file from secondary + storage, there is a conflicting lock that would prevent the layout + from being granted, or any other reason causes the server to be + unable to supply the layout. As a result of receiving + NFS4ERR_LAYOUTUNAVAILABLE, the client SHOULD abandon the attempt to + get a layout and perform READ and WRITE operations by sending them to + the MDS. It is expected that a client will not cache the file's + layoutunavailable state forever. In particular, when the file is + closed or opened by the client, issuing a new LAYOUTGET is + appropriate. + +2.8. Volatile Write Caches + + Many storage devices implement volatile write caches that require an + explicit flush to persist the data from write operations to stable + storage. Storage devices implementing [SBC3] should indicate a + volatile write cache by setting the Write Cache Enable (WCE) bit to 1 + in the Caching mode page. When a volatile write cache is used, the + pNFS server MUST ensure the volatile write cache has been committed + to stable storage before the LAYOUTCOMMIT operation returns by using + one of the SYNCHRONIZE CACHE commands. + +3. Enforcing NFSv4 Semantics + + The functionality provided by SCSI persistent reservations makes it + possible for the MDS to control access by individual client machines + to specific LUs. Individual client machines may be allowed to or + prevented from reading or writing to certain block devices. Finer- + grained access control methods are not generally available. + + For this reason, certain responsibilities for enforcing NFSv4 + semantics, including security and locking, are delegated to pNFS + clients when SCSI layouts are being used. The metadata server's role + is to only grant layouts appropriately, and the pNFS clients have to + be trusted to only perform accesses allowed by the layout extents + + + +Hellwig Standards Track [Page 24] + +RFC 8154 pNFS SCSI Layout May 2017 + + + they currently hold (e.g., not access storage for files on which a + layout extent is not held). In general, the server will not be able + to prevent a client that holds a layout for a file from accessing + parts of the physical disk not covered by the layout. Similarly, the + server will not be able to prevent a client from accessing blocks + covered by a layout that it has already returned. The pNFS client + must respect the layout model for this mapping type to appropriately + respect NFSv4 semantics. + + Furthermore, there is no way for the storage to determine the + specific NFSv4 entity (principal, openowner, lockowner) on whose + behalf the I/O operation is being done. This fact may limit the + functionality to be supported and require the pNFS client to + implement server policies other than those describable by layouts. + In cases in which layouts previously granted become invalid, the + server has the option of recalling them. In situations in which + communication difficulties prevent this from happening, layouts may + be revoked by the server. This revocation is accompanied by changes + in persistent reservation that have the effect of preventing SCSI + access to the LUs in question by the client. + +3.1. Use of Open Stateids + + The effective implementation of these NFSv4 semantic constraints is + complicated by the different granularities of the actors for the + different types of the functionality to be enforced: + + o To enforce security constraints for particular principals. + + o To enforce locking constraints for particular owners (openowners + and lockowners). + + Fundamental to enforcing both of these sorts of constraints is the + principle that a pNFS client must not issue a SCSI I/O operation + unless it possesses both: + + o A valid open stateid for the file in question, performing the I/O + that allows I/O of the type in question, which is associated with + the openowner and principal on whose behalf the I/O is to be done. + + o A valid layout stateid for the file in question that covers the + byte range on which the I/O is to be done and that allows I/O of + that type to be done. + + As a result, if the equivalent of I/O with an anonymous or write- + bypass stateid is to be done, it MUST NOT by done using the pNFS SCSI + layout type. The client MAY attempt such I/O using READs and WRITEs + that do not use pNFS and are directed to the MDS. + + + +Hellwig Standards Track [Page 25] + +RFC 8154 pNFS SCSI Layout May 2017 + + + When open stateids are revoked, due to lease expiration or any form + of administrative revocation, the server MUST recall all layouts that + allow I/O to be done on any of the files for which open revocation + happens. When there is a failure to successfully return those + layouts, the client MUST be fenced. + +3.2. Enforcing Security Restrictions + + The restriction noted above provides adequate enforcement of + appropriate security restriction when the principal issuing the I/O + is the same as that opening the file. The server is responsible for + checking that the I/O mode requested by the OPEN is allowed for the + principal doing the OPEN. If the correct sort of I/O is done on + behalf of the same principal, then the security restriction is + thereby enforced. + + If I/O is done by a principal different from the one that opened the + file, the client SHOULD send the I/O to be performed by the metadata + server rather than doing it directly to the storage device. + +3.3. Enforcing Locking Restrictions + + Mandatory enforcement of whole-file locking by means of share + reservations is provided when the pNFS client obeys the requirement + set forth in Section 3.1. Since performing I/O requires a valid open + stateid, an I/O that violates an existing share reservation would + only be possible when the server allows conflicting open stateids to + exist. + + The nature of the SCSI layout type is that such implementation/ + enforcement of mandatory byte-range locks is very difficult. Given + that layouts are granted to clients rather than owners, the pNFS + client is in no position to successfully arbitrate among multiple + lockowners on the same client. Suppose lockowner A is doing a write + and, while the I/O is pending, lockowner B requests a mandatory byte- + range lock for a byte range potentially overlapping the pending I/O. + In such a situation, the lock request cannot be granted while the I/O + is pending. In a non-pNFS environment, the server would have to wait + for pending I/O before granting the mandatory byte-range lock. In + the pNFS environment, the server does not issue the I/O and is thus + in no position to wait for its completion. The server may recall + such layouts, but in doing so, it has no way of distinguishing those + being used by lockowners A and B, making it difficult to allow B to + perform I/O while forbidding A from doing so. Given this fact, the + MDS need to successfully recall all layouts that overlap the range + being locked before returning a successful response to the LOCK + request. While the lock is in effect, the server SHOULD respond to + requests for layouts that overlap a currently locked area with + + + +Hellwig Standards Track [Page 26] + +RFC 8154 pNFS SCSI Layout May 2017 + + + NFS4ERR_LAYOUTUNAVAILABLE. To simplify the required logic, a server + MAY do this for all layout requests on the file in question as long + as there are any byte-range locks in effect. + + Given these difficulties, it may be difficult for servers supporting + mandatory byte-range locks to also support SCSI layouts. Servers can + support advisory byte-range locks instead. The NFSv4 protocol + currently has no way of determining whether byte-range lock support + on a particular file system will be mandatory or advisory, except by + trying operation, which would conflict if mandatory locking is in + effect. Therefore, to avoid confusion, servers SHOULD NOT switch + between mandatory and advisory byte-range locking based on whether + any SCSI layouts have been obtained or whether a client that has + obtained a SCSI layout has requested a byte-range lock. + +4. Security Considerations + + Access to SCSI storage devices is logically at a lower layer of the + I/O stack than NFSv4; hence, NFSv4 security is not directly + applicable to protocols that access such storage directly. Depending + on the protocol, some of the security mechanisms provided by NFSv4 + (e.g., encryption and cryptographic integrity) may not be available + or may be provided via different means. At one extreme, pNFS with + SCSI layouts can be used with storage access protocols (e.g., Serial + Attached SCSI [SAS3]) that provide essentially no security + functionality. At the other extreme, pNFS may be used with storage + protocols such as iSCSI [RFC7143] that can provide significant + security functionality. It is the responsibility of those + administering and deploying pNFS with a SCSI storage access protocol + to ensure that appropriate protection is provided to that protocol + (physical security is a common means for protocols not based on IP). + In environments where the security requirements for the storage + protocol cannot be met, pNFS SCSI layouts SHOULD NOT be used. + + When using IP-based storage protocols such as iSCSI, IPsec should be + used as outlined in [RFC3723] and updated in [RFC7146]. + + When security is available for a storage protocol, it is generally at + a different granularity and with a different notion of identity than + NFSv4 (e.g., NFSv4 controls user access to files, and iSCSI controls + initiator access to volumes). The responsibility for enforcing + appropriate correspondences between these security layers is placed + upon the pNFS client. As with the issues in the first paragraph of + this section, in environments where the security requirements are + such that client-side protection from access to storage outside of + the layout is not sufficient, pNFS SCSI layouts SHOULD NOT be used. + + + + + +Hellwig Standards Track [Page 27] + +RFC 8154 pNFS SCSI Layout May 2017 + + +5. IANA Considerations + + IANA has assigned a new pNFS layout type in the "pNFS Layout Types + Registry" as follows: + + Layout Type Name: LAYOUT4_SCSI + Value: 0x00000005 + RFC: RFC 8154 + How: L + Minor Versions: 1 + +6. Normative References + + [LEGAL] IETF Trust, "Legal Provisions Relating to IETF Documents", + March 2015, <http://trustee.ietf.org/docs/ + IETF-Trust-License-Policy.pdf>. + + [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate + Requirement Levels", BCP 14, RFC 2119, + DOI 10.17487/RFC2119, March 1997, + <http://www.rfc-editor.org/info/rfc2119>. + + [RFC3723] Aboba, B., Tseng, J., Walker, J., Rangan, V., and F. + Travostino, "Securing Block Storage Protocols over IP", + RFC 3723, DOI 10.17487/RFC3723, April 2004, + <http://www.rfc-editor.org/info/rfc3723>. + + [RFC4506] Eisler, M., Ed., "XDR: External Data Representation + Standard", STD 67, RFC 4506, DOI 10.17487/RFC4506, May + 2006, <http://www.rfc-editor.org/info/rfc4506>. + + [RFC5661] Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed., + "Network File System (NFS) Version 4 Minor Version 1 + Protocol", RFC 5661, DOI 10.17487/RFC5661, January 2010, + <http://www.rfc-editor.org/info/rfc5661>. + + [RFC5662] Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed., + "Network File System (NFS) Version 4 Minor Version 1 + External Data Representation Standard (XDR) Description", + RFC 5662, DOI 10.17487/RFC5662, January 2010, + <http://www.rfc-editor.org/info/rfc5662>. + + [RFC5663] Black, D., Fridella, S., and J. Glasgow, "Parallel NFS + (pNFS) Block/Volume Layout", RFC 5663, + DOI 10.17487/RFC5663, January 2010, + <http://www.rfc-editor.org/info/rfc5663>. + + + + + +Hellwig Standards Track [Page 28] + +RFC 8154 pNFS SCSI Layout May 2017 + + + [RFC6688] Black, D., Ed., Glasgow, J., and S. Faibish, "Parallel NFS + (pNFS) Block Disk Protection", RFC 6688, + DOI 10.17487/RFC6688, July 2012, + <http://www.rfc-editor.org/info/rfc6688>. + + [RFC7143] Chadalapaka, M., Satran, J., Meth, K., and D. Black, + "Internet Small Computer System Interface (iSCSI) Protocol + (Consolidated)", RFC 7143, DOI 10.17487/RFC7143, April + 2014, <http://www.rfc-editor.org/info/rfc7143>. + + [RFC7146] Black, D. and P. Koning, "Securing Block Storage Protocols + over IP: RFC 3723 Requirements Update for IPsec v3", + RFC 7146, DOI 10.17487/RFC7146, April 2014, + <http://www.rfc-editor.org/info/rfc7146>. + + [SAM-5] INCITS Technical Committee T10, "Information Technology - + SCSI Architecture Model - 5 (SAM-5)", ANSI + INCITS 515-2016, 2016. + + [SAS3] INCITS Technical Committee T10, "Information technology - + Serial Attached SCSI-3 (SAS-3)", ANSI INCITS 519-2014, + ISO/IEC 14776-154, 2014. + + [SBC3] INCITS Technical Committee T10, "Information Technology - + SCSI Block Commands - 3 (SBC-3)", ANSI INCITS 514-2014, + ISO/IEC 14776-323, 2014. + + [SPC4] INCITS Technical Committee T10, "Information Technology - + SCSI Primary Commands - 4 (SPC-4)", ANSI INCITS 513-2015, + 2015. + +Acknowledgments + + Large parts of this document were copied verbatim from [RFC5663], and + some parts were inspired by it. Thank to David Black, Stephen + Fridella, and Jason Glasgow for their work on the pNFS block/volume + layout protocol. + + David Black, Robert Elliott, and Tom Haynes provided a thorough + review of drafts of this document, and their input led to the current + form of the document. + + David Noveck provided ample feedback to various drafts of this + document, wrote the section on enforcing NFSv4 semantics, and rewrote + various sections to better catch the intent. + + + + + + +Hellwig Standards Track [Page 29] + +RFC 8154 pNFS SCSI Layout May 2017 + + +Author's Address + + Christoph Hellwig + + Email: hch@lst.de + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +Hellwig Standards Track [Page 30] + |