diff options
Diffstat (limited to 'doc/rfc/rfc5664.txt')
-rw-r--r-- | doc/rfc/rfc5664.txt | 1963 |
1 files changed, 1963 insertions, 0 deletions
diff --git a/doc/rfc/rfc5664.txt b/doc/rfc/rfc5664.txt new file mode 100644 index 0000000..0ee5b3e --- /dev/null +++ b/doc/rfc/rfc5664.txt @@ -0,0 +1,1963 @@ + + + + + + +Internet Engineering Task Force (IETF) B. Halevy +Request for Comments: 5664 B. Welch +Category: Standards Track J. Zelenka +ISSN: 2070-1721 Panasas + January 2010 + + + Object-Based Parallel NFS (pNFS) Operations + +Abstract + + Parallel NFS (pNFS) extends Network File System version 4 (NFSv4) to + allow clients to directly access file data on the storage used by the + NFSv4 server. This ability to bypass the server for data access can + increase both performance and parallelism, but requires additional + client functionality for data access, some of which is dependent on + the class of storage used, a.k.a. the Layout Type. The main pNFS + operations and data types in NFSv4 Minor version 1 specify a layout- + type-independent layer; layout-type-specific information is conveyed + using opaque data structures whose internal structure is further + defined by the particular layout type specification. This document + specifies the NFSv4.1 Object-Based pNFS Layout Type as a companion to + the main NFSv4 Minor version 1 specification. + +Status of This Memo + + This is an Internet Standards Track document. + + This document is a product of the Internet Engineering Task Force + (IETF). It represents the consensus of the IETF community. It has + received public review and has been approved for publication by the + Internet Engineering Steering Group (IESG). Further information on + Internet Standards is available in Section 2 of RFC 5741. + + Information about the current status of this document, any errata, + and how to provide feedback on it may be obtained at + http://www.rfc-editor.org/info/rfc5664. + + + + + + + + + + + + + + +Halevy, et al. Standards Track [Page 1] + +RFC 5664 pNFS Objects January 2010 + + +Copyright Notice + + Copyright (c) 2010 IETF Trust and the persons identified as the + document authors. All rights reserved. + + This document is subject to BCP 78 and the IETF Trust's Legal + Provisions Relating to IETF Documents + (http://trustee.ietf.org/license-info) in effect on the date of + publication of this document. Please review these documents + carefully, as they describe your rights and restrictions with respect + to this document. Code Components extracted from this document must + include Simplified BSD License text as described in Section 4.e of + the Trust Legal Provisions and are provided without warranty as + described in the Simplified BSD License. + +Table of Contents + + 1. Introduction ....................................................3 + 1.1. Requirements Language ......................................4 + 2. XDR Description of the Objects-Based Layout Protocol ............4 + 2.1. Code Components Licensing Notice ...........................4 + 3. Basic Data Type Definitions .....................................6 + 3.1. pnfs_osd_objid4 ............................................6 + 3.2. pnfs_osd_version4 ..........................................6 + 3.3. pnfs_osd_object_cred4 ......................................7 + 3.4. pnfs_osd_raid_algorithm4 ...................................8 + 4. Object Storage Device Addressing and Discovery ..................8 + 4.1. pnfs_osd_targetid_type4 ...................................10 + 4.2. pnfs_osd_deviceaddr4 ......................................10 + 4.2.1. SCSI Target Identifier .............................11 + 4.2.2. Device Network Address .............................11 + 5. Object-Based Layout ............................................12 + 5.1. pnfs_osd_data_map4 ........................................13 + 5.2. pnfs_osd_layout4 ..........................................14 + 5.3. Data Mapping Schemes ......................................14 + 5.3.1. Simple Striping ....................................15 + 5.3.2. Nested Striping ....................................16 + 5.3.3. Mirroring ..........................................17 + 5.4. RAID Algorithms ...........................................18 + 5.4.1. PNFS_OSD_RAID_0 ....................................18 + 5.4.2. PNFS_OSD_RAID_4 ....................................18 + 5.4.3. PNFS_OSD_RAID_5 ....................................18 + 5.4.4. PNFS_OSD_RAID_PQ ...................................19 + 5.4.5. RAID Usage and Implementation Notes ................19 + 6. Object-Based Layout Update .....................................20 + 6.1. pnfs_osd_deltaspaceused4 ..................................20 + 6.2. pnfs_osd_layoutupdate4 ....................................21 + 7. Recovering from Client I/O Errors ..............................21 + + + +Halevy, et al. Standards Track [Page 2] + +RFC 5664 pNFS Objects January 2010 + + + 8. Object-Based Layout Return .....................................22 + 8.1. pnfs_osd_errno4 ...........................................23 + 8.2. pnfs_osd_ioerr4 ...........................................24 + 8.3. pnfs_osd_layoutreturn4 ....................................24 + 9. Object-Based Creation Layout Hint ..............................25 + 9.1. pnfs_osd_layouthint4 ......................................25 + 10. Layout Segments ...............................................26 + 10.1. CB_LAYOUTRECALL and LAYOUTRETURN .........................27 + 10.2. LAYOUTCOMMIT .............................................27 + 11. Recalling Layouts .............................................27 + 11.1. CB_RECALL_ANY ............................................28 + 12. Client Fencing ................................................29 + 13. Security Considerations .......................................29 + 13.1. OSD Security Data Types ..................................30 + 13.2. The OSD Security Protocol ................................30 + 13.3. Protocol Privacy Requirements ............................32 + 13.4. Revoking Capabilities ....................................32 + 14. IANA Considerations ...........................................33 + 15. References ....................................................33 + 15.1. Normative References .....................................33 + 15.2. Informative References ...................................34 + Appendix A. Acknowledgments ......................................35 + +1. Introduction + + In pNFS, the file server returns typed layout structures that + describe where file data is located. There are different layouts for + different storage systems and methods of arranging data on storage + devices. This document describes the layouts used with object-based + storage devices (OSDs) that are accessed according to the OSD storage + protocol standard (ANSI INCITS 400-2004 [1]). + + An "object" is a container for data and attributes, and files are + stored in one or more objects. The OSD protocol specifies several + operations on objects, including READ, WRITE, FLUSH, GET ATTRIBUTES, + SET ATTRIBUTES, CREATE, and DELETE. However, using the object-based + layout the client only uses the READ, WRITE, GET ATTRIBUTES, and + FLUSH commands. The other commands are only used by the pNFS server. + + An object-based layout for pNFS includes object identifiers, + capabilities that allow clients to READ or WRITE those objects, and + various parameters that control how file data is striped across their + component objects. The OSD protocol has a capability-based security + scheme that allows the pNFS server to control what operations and + what objects can be used by clients. This scheme is described in + more detail in the "Security Considerations" section (Section 13). + + + + + +Halevy, et al. Standards Track [Page 3] + +RFC 5664 pNFS Objects January 2010 + + +1.1. Requirements Language + + The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", + "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this + document are to be interpreted as described in RFC 2119 [2]. + +2. XDR Description of the Objects-Based Layout Protocol + + This document contains the external data representation (XDR [3]) + description of the NFSv4.1 objects layout protocol. The XDR + description is embedded in this document in a way that makes it + simple for the reader to extract into a ready-to-compile form. The + reader can feed this document into the following shell script to + produce the machine readable XDR description of the NFSv4.1 objects + layout protocol: + + #!/bin/sh + grep '^ *///' $* | sed 's?^ */// ??' | sed 's?^ *///$??' + + That is, if the above script is stored in a file called "extract.sh", + and this document is in a file called "spec.txt", then the reader can + do: + + sh extract.sh < spec.txt > pnfs_osd_prot.x + + The effect of the script is to remove leading white space from each + line, plus a sentinel sequence of "///". + + The embedded XDR file header follows. Subsequent XDR descriptions, + with the sentinel sequence are embedded throughout the document. + + Note that the XDR code contained in this document depends on types + from the NFSv4.1 nfs4_prot.x file ([4]). This includes both nfs + types that end with a 4, such as offset4, length4, etc., as well as + more generic types such as uint32_t and uint64_t. + +2.1. Code Components Licensing Notice + + The XDR description, marked with lines beginning with the sequence + "///", as well as scripts for extracting the XDR description are Code + Components as described in Section 4 of "Legal Provisions Relating to + IETF Documents" [5]. These Code Components are licensed according to + the terms of Section 4 of "Legal Provisions Relating to IETF + Documents". + + + + + + + +Halevy, et al. Standards Track [Page 4] + +RFC 5664 pNFS Objects January 2010 + + + /// /* + /// * Copyright (c) 2010 IETF Trust and the persons identified + /// * as authors of the code. All rights reserved. + /// * + /// * Redistribution and use in source and binary forms, with + /// * or without modification, are permitted provided that the + /// * following conditions are met: + /// * + /// * o Redistributions of source code must retain the above + /// * copyright notice, this list of conditions and the + /// * following disclaimer. + /// * + /// * o Redistributions in binary form must reproduce the above + /// * copyright notice, this list of conditions and the + /// * following disclaimer in the documentation and/or other + /// * materials provided with the distribution. + /// * + /// * o Neither the name of Internet Society, IETF or IETF + /// * Trust, nor the names of specific contributors, may be + /// * used to endorse or promote products derived from this + /// * software without specific prior written permission. + /// * + /// * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS + /// * AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED + /// * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + /// * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS + /// * FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO + /// * EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE + /// * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, + /// * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT + /// * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR + /// * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS + /// * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF + /// * LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, + /// * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING + /// * IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF + /// * ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + /// * + /// * This code was derived from RFC 5664. + /// * Please reproduce this note if possible. + /// */ + /// + /// /* + /// * pnfs_osd_prot.x + /// */ + /// + /// %#include <nfs4_prot.x> + /// + + + +Halevy, et al. Standards Track [Page 5] + +RFC 5664 pNFS Objects January 2010 + + +3. Basic Data Type Definitions + + The following sections define basic data types and constants used by + the Object-Based Layout protocol. + +3.1. pnfs_osd_objid4 + + An object is identified by a number, somewhat like an inode number. + The object storage model has a two-level scheme, where the objects + within an object storage device are grouped into partitions. + + /// struct pnfs_osd_objid4 { + /// deviceid4 oid_device_id; + /// uint64_t oid_partition_id; + /// uint64_t oid_object_id; + /// }; + /// + + The pnfs_osd_objid4 type is used to identify an object within a + partition on a specified object storage device. "oid_device_id" + selects the object storage device from the set of available storage + devices. The device is identified with the deviceid4 type, which is + an index into addressing information about that device returned by + the GETDEVICELIST and GETDEVICEINFO operations. The deviceid4 data + type is defined in NFSv4.1 [6]. Within an OSD, a partition is + identified with a 64-bit number, "oid_partition_id". Within a + partition, an object is identified with a 64-bit number, + "oid_object_id". Creation and management of partitions is outside + the scope of this document, and is a facility provided by the object- + based storage file system. + +3.2. pnfs_osd_version4 + + /// enum pnfs_osd_version4 { + /// PNFS_OSD_MISSING = 0, + /// PNFS_OSD_VERSION_1 = 1, + /// PNFS_OSD_VERSION_2 = 2 + /// }; + /// + + pnfs_osd_version4 is used to indicate the OSD protocol version or + whether an object is missing (i.e., unavailable). Some of the + object-based layout-supported RAID algorithms encode redundant + information and can compensate for missing components, but the data + placement algorithm needs to know what parts are missing. + + + + + + +Halevy, et al. Standards Track [Page 6] + +RFC 5664 pNFS Objects January 2010 + + + At this time, the OSD standard is at version 1.0, and we anticipate a + version 2.0 of the standard (SNIA T10/1729-D [14]). The second + generation OSD protocol has additional proposed features to support + more robust error recovery, snapshots, and byte-range capabilities. + Therefore, the OSD version is explicitly called out in the + information returned in the layout. (This information can also be + deduced by looking inside the capability type at the format field, + which is the first byte. The format value is 0x1 for an OSD v1 + capability. However, it seems most robust to call out the version + explicitly.) + +3.3. pnfs_osd_object_cred4 + + /// enum pnfs_osd_cap_key_sec4 { + /// PNFS_OSD_CAP_KEY_SEC_NONE = 0, + /// PNFS_OSD_CAP_KEY_SEC_SSV = 1 + /// }; + /// + /// struct pnfs_osd_object_cred4 { + /// pnfs_osd_objid4 oc_object_id; + /// pnfs_osd_version4 oc_osd_version; + /// pnfs_osd_cap_key_sec4 oc_cap_key_sec; + /// opaque oc_capability_key<>; + /// opaque oc_capability<>; + /// }; + /// + + The pnfs_osd_object_cred4 structure is used to identify each + component comprising the file. The "oc_object_id" identifies the + component object, the "oc_osd_version" represents the osd protocol + version, or whether that component is unavailable, and the + "oc_capability" and "oc_capability_key", along with the + "oda_systemid" from the pnfs_osd_deviceaddr4, provide the OSD + security credentials needed to access that object. The + "oc_cap_key_sec" value denotes the method used to secure the + oc_capability_key (see Section 13.1 for more details). + + To comply with the OSD security requirements, the capability key + SHOULD be transferred securely to prevent eavesdropping (see + Section 13). Therefore, a client SHOULD either issue the LAYOUTGET + or GETDEVICEINFO operations via RPCSEC_GSS with the privacy service + or previously establish a secret state verifier (SSV) for the + sessions via the NFSv4.1 SET_SSV operation. The + pnfs_osd_cap_key_sec4 type is used to identify the method used by the + server to secure the capability key. + + + + + + +Halevy, et al. Standards Track [Page 7] + +RFC 5664 pNFS Objects January 2010 + + + o PNFS_OSD_CAP_KEY_SEC_NONE denotes that the oc_capability_key is + not encrypted, in which case the client SHOULD issue the LAYOUTGET + or GETDEVICEINFO operations with RPCSEC_GSS with the privacy + service or the NFSv4.1 transport should be secured by using + methods that are external to NFSv4.1 like the use of IPsec [15] + for transporting the NFSV4.1 protocol. + + o PNFS_OSD_CAP_KEY_SEC_SSV denotes that the oc_capability_key + contents are encrypted using the SSV GSS context and the + capability key as inputs to the GSS_Wrap() function (see GSS-API + [7]) with the conf_req_flag set to TRUE. The client MUST use the + secret SSV key as part of the client's GSS context to decrypt the + capability key using the value of the oc_capability_key field as + the input_message to the GSS_unwrap() function. Note that to + prevent eavesdropping of the SSV key, the client SHOULD issue + SET_SSV via RPCSEC_GSS with the privacy service. + + The actual method chosen depends on whether the client established a + SSV key with the server and whether it issued the operation with the + RPCSEC_GSS privacy method. Naturally, if the client did not + establish an SSV key via SET_SSV, the server MUST use the + PNFS_OSD_CAP_KEY_SEC_NONE method. Otherwise, if the operation was + not issued with the RPCSEC_GSS privacy method, the server SHOULD + secure the oc_capability_key with the PNFS_OSD_CAP_KEY_SEC_SSV + method. The server MAY use the PNFS_OSD_CAP_KEY_SEC_SSV method also + when the operation was issued with the RPCSEC_GSS privacy method. + +3.4. pnfs_osd_raid_algorithm4 + + /// enum pnfs_osd_raid_algorithm4 { + /// PNFS_OSD_RAID_0 = 1, + /// PNFS_OSD_RAID_4 = 2, + /// PNFS_OSD_RAID_5 = 3, + /// PNFS_OSD_RAID_PQ = 4 /* Reed-Solomon P+Q */ + /// }; + /// + + pnfs_osd_raid_algorithm4 represents the data redundancy algorithm + used to protect the file's contents. See Section 5.4 for more + details. + +4. Object Storage Device Addressing and Discovery + + Data operations to an OSD require the client to know the "address" of + each OSD's root object. The root object is synonymous with the Small + Computer System Interface (SCSI) logical unit. The client specifies + SCSI logical units to its SCSI protocol stack using a representation + + + + +Halevy, et al. Standards Track [Page 8] + +RFC 5664 pNFS Objects January 2010 + + + local to the client. Because these representations are local, + GETDEVICEINFO must return information that can be used by the client + to select the correct local representation. + + In the block world, a set offset (logical block number or track/ + sector) contains a disk label. This label identifies the disk + uniquely. In contrast, an OSD has a standard set of attributes on + its root object. For device identification purposes, the OSD System + ID (root information attribute number 3) and the OSD Name (root + information attribute number 9) are used as the label. These appear + in the pnfs_osd_deviceaddr4 type below under the "oda_systemid" and + "oda_osdname" fields. + + In some situations, SCSI target discovery may need to be driven based + on information contained in the GETDEVICEINFO response. One example + of this is Internet SCSI (iSCSI) targets that are not known to the + client until a layout has been requested. The information provided + as the "oda_targetid", "oda_targetaddr", and "oda_lun" fields in the + pnfs_osd_deviceaddr4 type described below (see Section 4.2) allows + the client to probe a specific device given its network address and + optionally its iSCSI Name (see iSCSI [8]), or when the device network + address is omitted, allows it to discover the object storage device + using the provided device name or SCSI Device Identifier (see SPC-3 + [9].) + + The oda_systemid is implicitly used by the client, by using the + object credential signing key to sign each request with the request + integrity check value. This method protects the client from + unintentionally accessing a device if the device address mapping was + changed (or revoked). The server computes the capability key using + its own view of the systemid associated with the respective deviceid + present in the credential. If the client's view of the deviceid + mapping is stale, the client will use the wrong systemid (which must + be system-wide unique) and the I/O request to the OSD will fail to + pass the integrity check verification. + + To recover from this condition the client should report the error and + return the layout using LAYOUTRETURN, and invalidate all the device + address mappings associated with this layout. The client can then + ask for a new layout if it wishes using LAYOUTGET and resolve the + referenced deviceids using GETDEVICEINFO or GETDEVICELIST. + + The server MUST provide the oda_systemid and SHOULD also provide the + oda_osdname. When the OSD name is present, the client SHOULD get the + root information attributes whenever it establishes communication + with the OSD and verify that the OSD name it got from the OSD matches + the one sent by the metadata server. To do so, the client uses the + root_obj_cred credentials. + + + +Halevy, et al. Standards Track [Page 9] + +RFC 5664 pNFS Objects January 2010 + + +4.1. pnfs_osd_targetid_type4 + + The following enum specifies the manner in which a SCSI target can be + specified. The target can be specified as a SCSI Name, or as an SCSI + Device Identifier. + + /// enum pnfs_osd_targetid_type4 { + /// OBJ_TARGET_ANON = 1, + /// OBJ_TARGET_SCSI_NAME = 2, + /// OBJ_TARGET_SCSI_DEVICE_ID = 3 + /// }; + /// + +4.2. pnfs_osd_deviceaddr4 + + The specification for an object device address is as follows: + +/// union pnfs_osd_targetid4 switch (pnfs_osd_targetid_type4 oti_type) { +/// case OBJ_TARGET_SCSI_NAME: +/// string oti_scsi_name<>; +/// +/// case OBJ_TARGET_SCSI_DEVICE_ID: +/// opaque oti_scsi_device_id<>; +/// +/// default: +/// void; +/// }; +/// +/// union pnfs_osd_targetaddr4 switch (bool ota_available) { +/// case TRUE: +/// netaddr4 ota_netaddr; +/// case FALSE: +/// void; +/// }; +/// +/// struct pnfs_osd_deviceaddr4 { +/// pnfs_osd_targetid4 oda_targetid; +/// pnfs_osd_targetaddr4 oda_targetaddr; +/// opaque oda_lun[8]; +/// opaque oda_systemid<>; +/// pnfs_osd_object_cred4 oda_root_obj_cred; +/// opaque oda_osdname<>; +/// }; +/// + + + + + + + +Halevy, et al. Standards Track [Page 10] + +RFC 5664 pNFS Objects January 2010 + + +4.2.1. SCSI Target Identifier + + When "oda_targetid" is specified as an OBJ_TARGET_SCSI_NAME, the + "oti_scsi_name" string MUST be formatted as an "iSCSI Name" as + specified in iSCSI [8] and [10]. Note that the specification of the + oti_scsi_name string format is outside the scope of this document. + Parsing the string is based on the string prefix, e.g., "iqn.", + "eui.", or "naa." and more formats MAY be specified in the future in + accordance with iSCSI Names properties. + + Currently, the iSCSI Name provides for naming the target device using + a string formatted as an iSCSI Qualified Name (IQN) or as an Extended + Unique Identifier (EUI) [11] string. Those are typically used to + identify iSCSI or Secure Routing Protocol (SRP) [16] devices. The + Network Address Authority (NAA) string format (see [10]) provides for + naming the device using globally unique identifiers, as defined in + Fibre Channel Framing and Signaling (FC-FS) [17]. These are + typically used to identify Fibre Channel or SAS [18] (Serial Attached + SCSI) devices. In particular, such devices that are dual-attached + both over Fibre Channel or SAS and over iSCSI. + + When "oda_targetid" is specified as an OBJ_TARGET_SCSI_DEVICE_ID, the + "oti_scsi_device_id" opaque field MUST be formatted as a SCSI Device + Identifier as defined in SPC-3 [9] VPD Page 83h (Section 7.6.3. + "Device Identification VPD Page"). If the Device Identifier is + identical to the OSD System ID, as given by oda_systemid, the server + SHOULD provide a zero-length oti_scsi_device_id opaque value. Note + that similarly to the "oti_scsi_name", the specification of the + oti_scsi_device_id opaque contents is outside the scope of this + document and more formats MAY be specified in the future in + accordance with SPC-3. + + The OBJ_TARGET_ANON pnfs_osd_targetid_type4 MAY be used for providing + no target identification. In this case, only the OSD System ID, and + optionally the provided network address, are used to locate the + device. + +4.2.2. Device Network Address + + The optional "oda_targetaddr" field MAY be provided by the server as + a hint to accelerate device discovery over, e.g., the iSCSI transport + protocol. The network address is given with the netaddr4 type, which + specifies a TCP/IP based endpoint (as specified in NFSv4.1 [6]). + When given, the client SHOULD use it to probe for the SCSI device at + the given network address. The client MAY still use other discovery + mechanisms such as Internet Storage Name Service (iSNS) [12] to + locate the device using the oda_targetid. In particular, such an + + + + +Halevy, et al. Standards Track [Page 11] + +RFC 5664 pNFS Objects January 2010 + + + external name service SHOULD be used when the devices may be attached + to the network using multiple connections, and/or multiple storage + fabrics (e.g., Fibre-Channel and iSCSI). + + The "oda_lun" field identifies the OSD 64-bit Logical Unit Number, + formatted in accordance with SAM-3 [13]. The client uses the Logical + Unit Number to communicate with the specific OSD Logical Unit. Its + use is defined in detail by the SCSI transport protocol, e.g., iSCSI + [8]. + +5. Object-Based Layout + + The layout4 type is defined in the NFSv4.1 [6] as follows: + + enum layouttype4 { + LAYOUT4_NFSV4_1_FILES = 1, + LAYOUT4_OSD2_OBJECTS = 2, + LAYOUT4_BLOCK_VOLUME = 3 + }; + + struct layout_content4 { + layouttype4 loc_type; + opaque loc_body<>; + }; + + struct layout4 { + offset4 lo_offset; + length4 lo_length; + layoutiomode4 lo_iomode; + layout_content4 lo_content; + }; + + + This document defines structure associated with the layouttype4 + value, LAYOUT4_OSD2_OBJECTS. The NFSv4.1 [6] specifies the loc_body + structure as an XDR type "opaque". The opaque layout is + uninterpreted by the generic pNFS client layers, but obviously must + be interpreted by the object storage layout driver. This section + defines the structure of this opaque value, pnfs_osd_layout4. + + + + + + + + + + + + +Halevy, et al. Standards Track [Page 12] + +RFC 5664 pNFS Objects January 2010 + + +5.1. pnfs_osd_data_map4 + + /// struct pnfs_osd_data_map4 { + /// uint32_t odm_num_comps; + /// length4 odm_stripe_unit; + /// uint32_t odm_group_width; + /// uint32_t odm_group_depth; + /// uint32_t odm_mirror_cnt; + /// pnfs_osd_raid_algorithm4 odm_raid_algorithm; + /// }; + /// + + The pnfs_osd_data_map4 structure parameterizes the algorithm that + maps a file's contents over the component objects. Instead of + limiting the system to simple striping scheme where loss of a single + component object results in data loss, the map parameters support + mirroring and more complicated schemes that protect against loss of a + component object. + + "odm_num_comps" is the number of component objects the file is + striped over. The server MAY grow the file by adding more components + to the stripe while clients hold valid layouts until the file has + reached its final stripe width. The file length in this case MUST be + limited to the number of bytes in a full stripe. + + The "odm_stripe_unit" is the number of bytes placed on one component + before advancing to the next one in the list of components. The + number of bytes in a full stripe is odm_stripe_unit times the number + of components. In some RAID schemes, a stripe includes redundant + information (i.e., parity) that lets the system recover from loss or + damage to a component object. + + The "odm_group_width" and "odm_group_depth" parameters allow a nested + striping pattern (see Section 5.3.2 for details). If there is no + nesting, then odm_group_width and odm_group_depth MUST be zero. The + size of the components array MUST be a multiple of odm_group_width. + + The "odm_mirror_cnt" is used to replicate a file by replicating its + component objects. If there is no mirroring, then odm_mirror_cnt + MUST be 0. If odm_mirror_cnt is greater than zero, then the size of + the component array MUST be a multiple of (odm_mirror_cnt+1). + + See Section 5.3 for more details. + + + + + + + + +Halevy, et al. Standards Track [Page 13] + +RFC 5664 pNFS Objects January 2010 + + +5.2. pnfs_osd_layout4 + + /// struct pnfs_osd_layout4 { + /// pnfs_osd_data_map4 olo_map; + /// uint32_t olo_comps_index; + /// pnfs_osd_object_cred4 olo_components<>; + /// }; + /// + + The pnfs_osd_layout4 structure specifies a layout over a set of + component objects. The "olo_components" field is an array of object + identifiers and security credentials that grant access to each + object. The organization of the data is defined by the + pnfs_osd_data_map4 type that specifies how the file's data is mapped + onto the component objects (i.e., the striping pattern). The data + placement algorithm that maps file data onto component objects + assumes that each component object occurs exactly once in the array + of components. Therefore, component objects MUST appear in the + olo_components array only once. The components array may represent + all objects comprising the file, in which case "olo_comps_index" is + set to zero and the number of entries in the olo_components array is + equal to olo_map.odm_num_comps. The server MAY return fewer + components than odm_num_comps, provided that the returned components + are sufficient to access any byte in the layout's data range (e.g., a + sub-stripe of "odm_group_width" components). In this case, + olo_comps_index represents the position of the returned components + array within the full array of components that comprise the file. + + Note that the layout depends on the file size, which the client + learns from the generic return parameters of LAYOUTGET, by doing + GETATTR commands to the metadata server. The client uses the file + size to decide if it should fill holes with zeros or return a short + read. Striping patterns can cause cases where component objects are + shorter than other components because a hole happens to correspond to + the last part of the component object. + +5.3. Data Mapping Schemes + + This section describes the different data mapping schemes in detail. + The object layout always uses a "dense" layout as described in + NFSv4.1 [6]. This means that the second stripe unit of the file + starts at offset 0 of the second component, rather than at offset + stripe_unit bytes. After a full stripe has been written, the next + stripe unit is appended to the first component object in the list + without any holes in the component objects. + + + + + + +Halevy, et al. Standards Track [Page 14] + +RFC 5664 pNFS Objects January 2010 + + +5.3.1. Simple Striping + + The mapping from the logical offset within a file (L) to the + component object C and object-specific offset O is defined by the + following equations: + + L = logical offset into the file + W = total number of components + S = W * stripe_unit + N = L / S + C = (L-(N*S)) / stripe_unit + O = (N*stripe_unit)+(L%stripe_unit) + + In these equations, S is the number of bytes in a full stripe, and N + is the stripe number. C is an index into the array of components, so + it selects a particular object storage device. Both N and C count + from zero. O is the offset within the object that corresponds to the + file offset. Note that this computation does not accommodate the + same object appearing in the olo_components array multiple times. + + For example, consider an object striped over four devices, <D0 D1 D2 + D3>. The stripe_unit is 4096 bytes. The stripe width S is thus 4 * + 4096 = 16384. + + Offset 0: + N = 0 / 16384 = 0 + C = 0-0/4096 = 0 (D0) + O = 0*4096 + (0%4096) = 0 + + Offset 4096: + N = 4096 / 16384 = 0 + C = (4096-(0*16384)) / 4096 = 1 (D1) + O = (0*4096)+(4096%4096) = 0 + + Offset 9000: + N = 9000 / 16384 = 0 + C = (9000-(0*16384)) / 4096 = 2 (D2) + O = (0*4096)+(9000%4096) = 808 + + Offset 132000: + N = 132000 / 16384 = 8 + C = (132000-(8*16384)) / 4096 = 0 (D0) + O = (8*4096) + (132000%4096) = 33696 + + + + + + + + +Halevy, et al. Standards Track [Page 15] + +RFC 5664 pNFS Objects January 2010 + + +5.3.2. Nested Striping + + The odm_group_width and odm_group_depth parameters allow a nested + striping pattern. odm_group_width defines the width of a data stripe + and odm_group_depth defines how many stripes are written before + advancing to the next group of components in the list of component + objects for the file. The math used to map from a file offset to a + component object and offset within that object is shown below. The + computations map from the logical offset L to the component index C + and offset relative O within that component object. + + L = logical offset into the file + W = total number of components + S = stripe_unit * group_depth * W + T = stripe_unit * group_depth * group_width + U = stripe_unit * group_width + M = L / S + G = (L - (M * S)) / T + H = (L - (M * S)) % T + N = H / U + C = (H - (N * U)) / stripe_unit + G * group_width + O = L % stripe_unit + N * stripe_unit + M * group_depth * stripe_unit + + In these equations, S is the number of bytes striped across all + component objects before the pattern repeats. T is the number of + bytes striped within a group of component objects before advancing to + the next group. U is the number of bytes in a stripe within a group. + M is the "major" (i.e., across all components) stripe number, and N + is the "minor" (i.e., across the group) stripe number. G counts the + groups from the beginning of the major stripe, and H is the byte + offset within the group. + + For example, consider an object striped over 100 devices with a + group_width of 10, a group_depth of 50, and a stripe_unit of 1 MB. + In this scheme, 500 MB are written to the first 10 components, and + 5000 MB are written before the pattern wraps back around to the first + component in the array. + + + + + + + + + + + + + + +Halevy, et al. Standards Track [Page 16] + +RFC 5664 pNFS Objects January 2010 + + + Offset 0: + W = 100 + S = 1 MB * 50 * 100 = 5000 MB + T = 1 MB * 50 * 10 = 500 MB + U = 1 MB * 10 = 10 MB + M = 0 / 5000 MB = 0 + G = (0 - (0 * 5000 MB)) / 500 MB = 0 + H = (0 - (0 * 5000 MB)) % 500 MB = 0 + N = 0 / 10 MB = 0 + C = (0 - (0 * 10 MB)) / 1 MB + 0 * 10 = 0 + O = 0 % 1 MB + 0 * 1 MB + 0 * 50 * 1 MB = 0 + + Offset 27 MB: + M = 27 MB / 5000 MB = 0 + G = (27 MB - (0 * 5000 MB)) / 500 MB = 0 + H = (27 MB - (0 * 5000 MB)) % 500 MB = 27 MB + N = 27 MB / 10 MB = 2 + C = (27 MB - (2 * 10 MB)) / 1 MB + 0 * 10 = 7 + O = 27 MB % 1 MB + 2 * 1 MB + 0 * 50 * 1 MB = 2 MB + + Offset 7232 MB: + M = 7232 MB / 5000 MB = 1 + G = (7232 MB - (1 * 5000 MB)) / 500 MB = 4 + H = (7232 MB - (1 * 5000 MB)) % 500 MB = 232 MB + N = 232 MB / 10 MB = 23 + C = (232 MB - (23 * 10 MB)) / 1 MB + 4 * 10 = 42 + O = 7232 MB % 1 MB + 23 * 1 MB + 1 * 50 * 1 MB = 73 MB + +5.3.3. Mirroring + + The odm_mirror_cnt is used to replicate a file by replicating its + component objects. If there is no mirroring, then odm_mirror_cnt + MUST be 0. If odm_mirror_cnt is greater than zero, then the size of + the olo_components array MUST be a multiple of (odm_mirror_cnt+1). + Thus, for a classic mirror on two objects, odm_mirror_cnt is one. + Note that mirroring can be defined over any RAID algorithm and + striping pattern (either simple or nested). If odm_group_width is + also non-zero, then the size of the olo_components array MUST be a + multiple of odm_group_width * (odm_mirror_cnt+1). Replicas are + adjacent in the olo_components array, and the value C produced by the + above equations is not a direct index into the olo_components array. + Instead, the following equations determine the replica component + index RCi, where i ranges from 0 to odm_mirror_cnt. + + C = component index for striping or two-level striping + i ranges from 0 to odm_mirror_cnt, inclusive + RCi = C * (odm_mirror_cnt+1) + i + + + + +Halevy, et al. Standards Track [Page 17] + +RFC 5664 pNFS Objects January 2010 + + +5.4. RAID Algorithms + + pnfs_osd_raid_algorithm4 determines the algorithm and placement of + redundant data. This section defines the different redundancy + algorithms. Note: The term "RAID" (Redundant Array of Independent + Disks) is used in this document to represent an array of component + objects that store data for an individual file. The objects are + stored on independent object-based storage devices. File data is + encoded and striped across the array of component objects using + algorithms developed for block-based RAID systems. + +5.4.1. PNFS_OSD_RAID_0 + + PNFS_OSD_RAID_0 means there is no parity data, so all bytes in the + component objects are data bytes located by the above equations for C + and O. If a component object is marked as PNFS_OSD_MISSING, the pNFS + client MUST either return an I/O error if this component is attempted + to be read or, alternatively, it can retry the READ against the pNFS + server. + +5.4.2. PNFS_OSD_RAID_4 + + PNFS_OSD_RAID_4 means that the last component object, or the last in + each group (if odm_group_width is greater than zero), contains parity + information computed over the rest of the stripe with an XOR + operation. If a component object is unavailable, the client can read + the rest of the stripe units in the damaged stripe and recompute the + missing stripe unit by XORing the other stripe units in the stripe. + Or the client can replay the READ against the pNFS server that will + presumably perform the reconstructed read on the client's behalf. + + When parity is present in the file, then there is an additional + computation to map from the file offset L to the offset that accounts + for embedded parity, L'. First compute L', and then use L' in the + above equations for C and O. + + L = file offset, not accounting for parity + P = number of parity devices in each stripe + W = group_width, if not zero, else size of olo_components array + N = L / (W-P * stripe_unit) + L' = N * (W * stripe_unit) + + (L % (W-P * stripe_unit)) + +5.4.3. PNFS_OSD_RAID_5 + + PNFS_OSD_RAID_5 means that the position of the parity data is rotated + on each stripe or each group (if odm_group_width is greater than + zero). In the first stripe, the last component holds the parity. In + + + +Halevy, et al. Standards Track [Page 18] + +RFC 5664 pNFS Objects January 2010 + + + the second stripe, the next-to-last component holds the parity, and + so on. In this scheme, all stripe units are rotated so that I/O is + evenly spread across objects as the file is read sequentially. The + rotated parity layout is illustrated here, with numbers indicating + the stripe unit. + + 0 1 2 P + 4 5 P 3 + 8 P 6 7 + P 9 a b + + To compute the component object C, first compute the offset that + accounts for parity L' and use that to compute C. Then rotate C to + get C'. Finally, increase C' by one if the parity information comes + at or before C' within that stripe. The following equations + illustrate this by computing I, which is the index of the component + that contains parity for a given stripe. + + L = file offset, not accounting for parity + W = odm_group_width, if not zero, else size of olo_components array + N = L / (W-1 * stripe_unit) + (Compute L' as describe above) + (Compute C based on L' as described above) + C' = (C - (N%W)) % W + I = W - (N%W) - 1 + if (C' <= I) { + C'++ + } + +5.4.4. PNFS_OSD_RAID_PQ + + PNFS_OSD_RAID_PQ is a double-parity scheme that uses the Reed-Solomon + P+Q encoding scheme [19]. In this layout, the last two component + objects hold the P and Q data, respectively. P is parity computed + with XOR, and Q is a more complex equation that is not described + here. The equations given above for embedded parity can be used to + map a file offset to the correct component object by setting the + number of parity components to 2 instead of 1 for RAID4 or RAID5. + Clients may simply choose to read data through the metadata server if + two components are missing or damaged. + +5.4.5. RAID Usage and Implementation Notes + + RAID layouts with redundant data in their stripes require additional + serialization of updates to ensure correct operation. Otherwise, if + two clients simultaneously write to the same logical range of an + object, the result could include different data in the same ranges of + mirrored tuples, or corrupt parity information. It is the + + + +Halevy, et al. Standards Track [Page 19] + +RFC 5664 pNFS Objects January 2010 + + + responsibility of the metadata server to enforce serialization + requirements such as this. For example, the metadata server may do + so by not granting overlapping write layouts within mirrored objects. + +6. Object-Based Layout Update + + layoutupdate4 is used in the LAYOUTCOMMIT operation to convey updates + to the layout and additional information to the metadata server. It + is defined in the NFSv4.1 [6] as follows: + + struct layoutupdate4 { + layouttype4 lou_type; + opaque lou_body<>; + }; + + The layoutupdate4 type is an opaque value at the generic pNFS client + level. If the lou_type layout type is LAYOUT4_OSD2_OBJECTS, then the + lou_body opaque value is defined by the pnfs_osd_layoutupdate4 type. + + Object-Based pNFS clients are not allowed to modify the layout. + Therefore, the information passed in pnfs_osd_layoutupdate4 is used + only to update the file's attributes. In addition to the generic + information the client can pass to the metadata server in + LAYOUTCOMMIT such as the highest offset the client wrote to and the + last time it modified the file, the client MAY use + pnfs_osd_layoutupdate4 to convey the capacity consumed (or released) + by writes using the layout, and to indicate that I/O errors were + encountered by such writes. + +6.1. pnfs_osd_deltaspaceused4 + + /// union pnfs_osd_deltaspaceused4 switch (bool dsu_valid) { + /// case TRUE: + /// int64_t dsu_delta; + /// case FALSE: + /// void; + /// }; + /// + + pnfs_osd_deltaspaceused4 is used to convey space utilization + information at the time of LAYOUTCOMMIT. For the file system to + properly maintain capacity-used information, it needs to track how + much capacity was consumed by WRITE operations performed by the + client. In this protocol, the OSD returns the capacity consumed by a + write (*), which can be different than the number of bytes written + because of internal overhead like block-level allocation and indirect + blocks, and the client reflects this back to the pNFS server so it + can accurately track quota. The pNFS server can choose to trust this + + + +Halevy, et al. Standards Track [Page 20] + +RFC 5664 pNFS Objects January 2010 + + + information coming from the clients and therefore avoid querying the + OSDs at the time of LAYOUTCOMMIT. If the client is unable to obtain + this information from the OSD, it simply returns invalid + olu_delta_space_used. + +6.2. pnfs_osd_layoutupdate4 + + /// struct pnfs_osd_layoutupdate4 { + /// pnfs_osd_deltaspaceused4 olu_delta_space_used; + /// bool olu_ioerr_flag; + /// }; + /// + + "olu_delta_space_used" is used to convey capacity usage information + back to the metadata server. + + The "olu_ioerr_flag" is used when I/O errors were encountered while + writing the file. The client MUST report the errors using the + pnfs_osd_ioerr4 structure (see Section 8.1) at LAYOUTRETURN time. + + If the client updated the file successfully before hitting the I/O + errors, it MAY use LAYOUTCOMMIT to update the metadata server as + described above. Typically, in the error-free case, the server MAY + turn around and update the file's attributes on the storage devices. + However, if I/O errors were encountered, the server better not + attempt to write the new attributes on the storage devices until it + receives the I/O error report; therefore, the client MUST set the + olu_ioerr_flag to true. Note that in this case, the client SHOULD + send both the LAYOUTCOMMIT and LAYOUTRETURN operations in the same + COMPOUND RPC. + +7. Recovering from Client I/O Errors + + The pNFS client may encounter errors when directly accessing the + object storage devices. However, it is the responsibility of the + metadata server to handle the I/O errors. When the + LAYOUT4_OSD2_OBJECTS layout type is used, the client MUST report the + I/O errors to the server at LAYOUTRETURN time using the + pnfs_osd_ioerr4 structure (see Section 8.1). + + The metadata server analyzes the error and determines the required + recovery operations such as repairing any parity inconsistencies, + recovering media failures, or reconstructing missing objects. + + + + + + + + +Halevy, et al. Standards Track [Page 21] + +RFC 5664 pNFS Objects January 2010 + + + The metadata server SHOULD recall any outstanding layouts to allow it + exclusive write access to the stripes being recovered and to prevent + other clients from hitting the same error condition. In these cases, + the server MUST complete recovery before handing out any new layouts + to the affected byte ranges. + + Although it MAY be acceptable for the client to propagate a + corresponding error to the application that initiated the I/O + operation and drop any unwritten data, the client SHOULD attempt to + retry the original I/O operation by requesting a new layout using + LAYOUTGET and retry the I/O operation(s) using the new layout, or the + client MAY just retry the I/O operation(s) using regular NFS READ or + WRITE operations via the metadata server. The client SHOULD attempt + to retrieve a new layout and retry the I/O operation using OSD + commands first and only if the error persists, retry the I/O + operation via the metadata server. + +8. Object-Based Layout Return + + layoutreturn_file4 is used in the LAYOUTRETURN operation to convey + layout-type specific information to the server. It is defined in the + NFSv4.1 [6] as follows: + + struct layoutreturn_file4 { + offset4 lrf_offset; + length4 lrf_length; + stateid4 lrf_stateid; + /* layouttype4 specific data */ + opaque lrf_body<>; + }; + + union layoutreturn4 switch(layoutreturn_type4 lr_returntype) { + case LAYOUTRETURN4_FILE: + layoutreturn_file4 lr_layout; + default: + void; + }; + + struct LAYOUTRETURN4args { + /* CURRENT_FH: file */ + bool lora_reclaim; + layoutreturn_stateid lora_recallstateid; + layouttype4 lora_layout_type; + layoutiomode4 lora_iomode; + layoutreturn4 lora_layoutreturn; + }; + + + + + +Halevy, et al. Standards Track [Page 22] + +RFC 5664 pNFS Objects January 2010 + + + If the lora_layout_type layout type is LAYOUT4_OSD2_OBJECTS, then the + lrf_body opaque value is defined by the pnfs_osd_layoutreturn4 type. + + The pnfs_osd_layoutreturn4 type allows the client to report I/O error + information back to the metadata server as defined below. + +8.1. pnfs_osd_errno4 + + /// enum pnfs_osd_errno4 { + /// PNFS_OSD_ERR_EIO = 1, + /// PNFS_OSD_ERR_NOT_FOUND = 2, + /// PNFS_OSD_ERR_NO_SPACE = 3, + /// PNFS_OSD_ERR_BAD_CRED = 4, + /// PNFS_OSD_ERR_NO_ACCESS = 5, + /// PNFS_OSD_ERR_UNREACHABLE = 6, + /// PNFS_OSD_ERR_RESOURCE = 7 + /// }; + /// + + pnfs_osd_errno4 is used to represent error types when read/write + errors are reported to the metadata server. The error codes serve as + hints to the metadata server that may help it in diagnosing the exact + reason for the error and in repairing it. + + o PNFS_OSD_ERR_EIO indicates the operation failed because the object + storage device experienced a failure trying to access the object. + The most common source of these errors is media errors, but other + internal errors might cause this as well. In this case, the + metadata server should go examine the broken object more closely; + hence, it should be used as the default error code. + + o PNFS_OSD_ERR_NOT_FOUND indicates the object ID specifies an object + that does not exist on the object storage device. + + o PNFS_OSD_ERR_NO_SPACE indicates the operation failed because the + object storage device ran out of free capacity during the + operation. + + o PNFS_OSD_ERR_BAD_CRED indicates the security parameters are not + valid. The primary cause of this is that the capability has + expired, or the access policy tag (a.k.a., capability version + number) has been changed to revoke capabilities. The client will + need to return the layout and get a new one with fresh + capabilities. + + + + + + + +Halevy, et al. Standards Track [Page 23] + +RFC 5664 pNFS Objects January 2010 + + + o PNFS_OSD_ERR_NO_ACCESS indicates the capability does not allow the + requested operation. This should not occur in normal operation + because the metadata server should give out correct capabilities, + or none at all. + + o PNFS_OSD_ERR_UNREACHABLE indicates the client did not complete the + I/O operation at the object storage device due to a communication + failure. Whether or not the I/O operation was executed by the OSD + is undetermined. + + o PNFS_OSD_ERR_RESOURCE indicates the client did not issue the I/O + operation due to a local problem on the initiator (i.e., client) + side, e.g., when running out of memory. The client MUST guarantee + that the OSD command was never dispatched to the OSD. + +8.2. pnfs_osd_ioerr4 + + /// struct pnfs_osd_ioerr4 { + /// pnfs_osd_objid4 oer_component; + /// length4 oer_comp_offset; + /// length4 oer_comp_length; + /// bool oer_iswrite; + /// pnfs_osd_errno4 oer_errno; + /// }; + /// + + The pnfs_osd_ioerr4 structure is used to return error indications for + objects that generated errors during data transfers. These are hints + to the metadata server that there are problems with that object. For + each error, "oer_component", "oer_comp_offset", and "oer_comp_length" + represent the object and byte range within the component object in + which the error occurred; "oer_iswrite" is set to "true" if the + failed OSD operation was data modifying, and "oer_errno" represents + the type of error. + + Component byte ranges in the optional pnfs_osd_ioerr4 structure are + used for recovering the object and MUST be set by the client to cover + all failed I/O operations to the component. + +8.3. pnfs_osd_layoutreturn4 + + /// struct pnfs_osd_layoutreturn4 { + /// pnfs_osd_ioerr4 olr_ioerr_report<>; + /// }; + /// + + + + + + +Halevy, et al. Standards Track [Page 24] + +RFC 5664 pNFS Objects January 2010 + + + When OSD I/O operations failed, "olr_ioerr_report<>" is used to + report these errors to the metadata server as an array of elements of + type pnfs_osd_ioerr4. Each element in the array represents an error + that occurred on the object specified by oer_component. If no errors + are to be reported, the size of the olr_ioerr_report<> array is set + to zero. + +9. Object-Based Creation Layout Hint + + The layouthint4 type is defined in the NFSv4.1 [6] as follows: + + struct layouthint4 { + layouttype4 loh_type; + opaque loh_body<>; + }; + + The layouthint4 structure is used by the client to pass a hint about + the type of layout it would like created for a particular file. If + the loh_type layout type is LAYOUT4_OSD2_OBJECTS, then the loh_body + opaque value is defined by the pnfs_osd_layouthint4 type. + +9.1. pnfs_osd_layouthint4 + + /// union pnfs_osd_max_comps_hint4 switch (bool omx_valid) { + /// case TRUE: + /// uint32_t omx_max_comps; + /// case FALSE: + /// void; + /// }; + /// + /// union pnfs_osd_stripe_unit_hint4 switch (bool osu_valid) { + /// case TRUE: + /// length4 osu_stripe_unit; + /// case FALSE: + /// void; + /// }; + /// + /// union pnfs_osd_group_width_hint4 switch (bool ogw_valid) { + /// case TRUE: + /// uint32_t ogw_group_width; + /// case FALSE: + /// void; + /// }; + /// + /// union pnfs_osd_group_depth_hint4 switch (bool ogd_valid) { + /// case TRUE: + /// uint32_t ogd_group_depth; + /// case FALSE: + + + +Halevy, et al. Standards Track [Page 25] + +RFC 5664 pNFS Objects January 2010 + + + /// void; + /// }; + /// + /// union pnfs_osd_mirror_cnt_hint4 switch (bool omc_valid) { + /// case TRUE: + /// uint32_t omc_mirror_cnt; + /// case FALSE: + /// void; + /// }; + /// + /// union pnfs_osd_raid_algorithm_hint4 switch (bool ora_valid) { + /// case TRUE: + /// pnfs_osd_raid_algorithm4 ora_raid_algorithm; + /// case FALSE: + /// void; + /// }; + /// + /// struct pnfs_osd_layouthint4 { + /// pnfs_osd_max_comps_hint4 olh_max_comps_hint; + /// pnfs_osd_stripe_unit_hint4 olh_stripe_unit_hint; + /// pnfs_osd_group_width_hint4 olh_group_width_hint; + /// pnfs_osd_group_depth_hint4 olh_group_depth_hint; + /// pnfs_osd_mirror_cnt_hint4 olh_mirror_cnt_hint; + /// pnfs_osd_raid_algorithm_hint4 olh_raid_algorithm_hint; + /// }; + /// + + This type conveys hints for the desired data map. All parameters are + optional so the client can give values for only the parameters it + cares about, e.g. it can provide a hint for the desired number of + mirrored components, regardless of the RAID algorithm selected for + the file. The server should make an attempt to honor the hints, but + it can ignore any or all of them at its own discretion and without + failing the respective CREATE operation. + + The "olh_max_comps_hint" can be used to limit the total number of + component objects comprising the file. All other hints correspond + directly to the different fields of pnfs_osd_data_map4. + +10. Layout Segments + + The pnfs layout operations operate on logical byte ranges. There is + no requirement in the protocol for any relationship between byte + ranges used in LAYOUTGET to acquire layouts and byte ranges used in + CB_LAYOUTRECALL, LAYOUTCOMMIT, or LAYOUTRETURN. However, using OSD + byte-range capabilities poses limitations on these operations since + + + + + +Halevy, et al. Standards Track [Page 26] + +RFC 5664 pNFS Objects January 2010 + + + the capabilities associated with layout segments cannot be merged or + split. The following guidelines should be followed for proper + operation of object-based layouts. + +10.1. CB_LAYOUTRECALL and LAYOUTRETURN + + In general, the object-based layout driver should keep track of each + layout segment it got, keeping record of the segment's iomode, + offset, and length. The server should allow the client to get + multiple overlapping layout segments but is free to recall the layout + to prevent overlap. + + In response to CB_LAYOUTRECALL, the client should return all layout + segments matching the given iomode and overlapping with the recalled + range. When returning the layouts for this byte range with + LAYOUTRETURN, the client MUST NOT return a sub-range of a layout + segment it has; each LAYOUTRETURN sent MUST completely cover at least + one outstanding layout segment. + + The server, in turn, should release any segment that exactly matches + the clientid, iomode, and byte range given in LAYOUTRETURN. If no + exact match is found, then the server should release all layout + segments matching the clientid and iomode and that are fully + contained in the returned byte range. If none are found and the byte + range is a subset of an outstanding layout segment with for the same + clientid and iomode, then the client can be considered malfunctioning + and the server SHOULD recall all layouts from this client to reset + its state. If this behavior repeats, the server SHOULD deny all + LAYOUTGETs from this client. + +10.2. LAYOUTCOMMIT + + LAYOUTCOMMIT is only used by object-based pNFS to convey modified + attributes hints and/or to report the presence of I/O errors to the + metadata server (MDS). Therefore, the offset and length in + LAYOUTCOMMIT4args are reserved for future use and should be set to 0. + +11. Recalling Layouts + + The object-based metadata server should recall outstanding layouts in + the following cases: + + o When the file's security policy changes, i.e., Access Control + Lists (ACLs) or permission mode bits are set. + + o When the file's aggregation map changes, rendering outstanding + layouts invalid. + + + + +Halevy, et al. Standards Track [Page 27] + +RFC 5664 pNFS Objects January 2010 + + + o When there are sharing conflicts. For example, the server will + issue stripe-aligned layout segments for RAID-5 objects. To + prevent corruption of the file's parity, multiple clients must not + hold valid write layouts for the same stripes. An outstanding + READ/WRITE (RW) layout should be recalled when a conflicting + LAYOUTGET is received from a different client for LAYOUTIOMODE4_RW + and for a byte range overlapping with the outstanding layout + segment. + +11.1. CB_RECALL_ANY + + The metadata server can use the CB_RECALL_ANY callback operation to + notify the client to return some or all of its layouts. The NFSv4.1 + [6] defines the following types: + + const RCA4_TYPE_MASK_OBJ_LAYOUT_MIN = 8; + const RCA4_TYPE_MASK_OBJ_LAYOUT_MAX = 9; + + struct CB_RECALL_ANY4args { + uint32_t craa_objects_to_keep; + bitmap4 craa_type_mask; + }; + + Typically, CB_RECALL_ANY will be used to recall client state when the + server needs to reclaim resources. The craa_type_mask bitmap + specifies the type of resources that are recalled and the + craa_objects_to_keep value specifies how many of the recalled objects + the client is allowed to keep. The object-based layout type mask + flags are defined as follows. They represent the iomode of the + recalled layouts. In response, the client SHOULD return layouts of + the recalled iomode that it needs the least, keeping at most + craa_objects_to_keep object-based layouts. + + /// enum pnfs_osd_cb_recall_any_mask { + /// PNFS_OSD_RCA4_TYPE_MASK_READ = 8, + /// PNFS_OSD_RCA4_TYPE_MASK_RW = 9 + /// }; + /// + + The PNFS_OSD_RCA4_TYPE_MASK_READ flag notifies the client to return + layouts of iomode LAYOUTIOMODE4_READ. Similarly, the + PNFS_OSD_RCA4_TYPE_MASK_RW flag notifies the client to return layouts + of iomode LAYOUTIOMODE4_RW. When both mask flags are set, the client + is notified to return layouts of either iomode. + + + + + + + +Halevy, et al. Standards Track [Page 28] + +RFC 5664 pNFS Objects January 2010 + + +12. Client Fencing + + In cases where clients are uncommunicative and their lease has + expired or when clients fail to return recalled layouts within a + lease period at the least (see "Recalling a Layout"[6]), the server + MAY revoke client layouts and/or device address mappings and reassign + these resources to other clients. To avoid data corruption, the + metadata server MUST fence off the revoked clients from the + respective objects as described in Section 13.4. + +13. Security Considerations + + The pNFS extension partitions the NFSv4 file system protocol into two + parts, the control path and the data path (storage protocol). The + control path contains all the new operations described by this + extension; all existing NFSv4 security mechanisms and features apply + to the control path. The combination of components in a pNFS system + is required to preserve the security properties of NFSv4 with respect + to an entity accessing data via a client, including security + countermeasures to defend against threats that NFSv4 provides + defenses for in environments where these threats are considered + significant. + + The metadata server enforces the file access-control policy at + LAYOUTGET time. The client should use suitable authorization + credentials for getting the layout for the requested iomode (READ or + RW) and the server verifies the permissions and ACL for these + credentials, possibly returning NFS4ERR_ACCESS if the client is not + allowed the requested iomode. If the LAYOUTGET operation succeeds + the client receives, as part of the layout, a set of object + capabilities allowing it I/O access to the specified objects + corresponding to the requested iomode. When the client acts on I/O + operations on behalf of its local users, it MUST authenticate and + authorize the user by issuing respective OPEN and ACCESS calls to the + metadata server, similar to having NFSv4 data delegations. If access + is allowed, the client uses the corresponding (READ or RW) + capabilities to perform the I/O operations at the object storage + devices. When the metadata server receives a request to change a + file's permissions or ACL, it SHOULD recall all layouts for that file + and it MUST change the capability version attribute on all objects + comprising the file to implicitly invalidate any outstanding + capabilities before committing to the new permissions and ACL. Doing + this will ensure that clients re-authorize their layouts according to + the modified permissions and ACL by requesting new layouts. + Recalling the layouts in this case is courtesy of the server intended + to prevent clients from getting an error on I/Os done after the + capability version changed. + + + + +Halevy, et al. Standards Track [Page 29] + +RFC 5664 pNFS Objects January 2010 + + + The object storage protocol MUST implement the security aspects + described in version 1 of the T10 OSD protocol definition [1]. The + standard defines four security methods: NOSEC, CAPKEY, CMDRSP, and + ALLDATA. To provide minimum level of security allowing verification + and enforcement of the server access control policy using the layout + security credentials, the NOSEC security method MUST NOT be used for + any I/O operation. The remainder of this section gives an overview + of the security mechanism described in that standard. The goal is to + give the reader a basic understanding of the object security model. + Any discrepancies between this text and the actual standard are + obviously to be resolved in favor of the OSD standard. + +13.1. OSD Security Data Types + + There are three main data types associated with object security: a + capability, a credential, and security parameters. The capability is + a set of fields that specifies an object and what operations can be + performed on it. A credential is a signed capability. Only a + security manager that knows the secret device keys can correctly sign + a capability to form a valid credential. In pNFS, the file server + acts as the security manager and returns signed capabilities (i.e., + credentials) to the pNFS client. The security parameters are values + computed by the issuer of OSD commands (i.e., the client) that prove + they hold valid credentials. The client uses the credential as a + signing key to sign the requests it makes to OSD, and puts the + resulting signatures into the security_parameters field of the OSD + command. The object storage device uses the secret keys it shares + with the security manager to validate the signature values in the + security parameters. + + The security types are opaque to the generic layers of the pNFS + client. The credential contents are defined as opaque within the + pnfs_osd_object_cred4 type. Instead of repeating the definitions + here, the reader is referred to Section 4.9.2.2 of the OSD standard. + +13.2. The OSD Security Protocol + + The object storage protocol relies on a cryptographically secure + capability to control accesses at the object storage devices. + Capabilities are generated by the metadata server, returned to the + client, and used by the client as described below to authenticate + their requests to the object-based storage device. Capabilities + therefore achieve the required access and open mode checking. They + allow the file server to define and check a policy (e.g., open mode) + and the OSD to enforce that policy without knowing the details (e.g., + user IDs and ACLs). + + + + + +Halevy, et al. Standards Track [Page 30] + +RFC 5664 pNFS Objects January 2010 + + + Since capabilities are tied to layouts, and since they are used to + enforce access control, when the file ACL or mode changes the + outstanding capabilities MUST be revoked to enforce the new access + permissions. The server SHOULD recall layouts to allow clients to + gracefully return their capabilities before the access permissions + change. + + Each capability is specific to a particular object, an operation on + that object, a byte range within the object (in OSDv2), and has an + explicit expiration time. The capabilities are signed with a secret + key that is shared by the object storage devices and the metadata + managers. Clients do not have device keys so they are unable to + forge the signatures in the security parameters. The combination of + a capability, the OSD System ID, and a signature is called a + "credential" in the OSD specification. + + The details of the security and privacy model for object storage are + defined in the T10 OSD standard. The following sketch of the + algorithm should help the reader understand the basic model. + + LAYOUTGET returns a CapKey and a Cap, which, together with the OSD + System ID, are also called a credential. It is a capability and a + signature over that capability and the SystemID. The OSD Standard + refers to the CapKey as the "Credential integrity check value" and to + the ReqMAC as the "Request integrity check value". + + CapKey = MAC<SecretKey>(Cap, SystemID) + Credential = {Cap, SystemID, CapKey} + + The client uses CapKey to sign all the requests it issues for that + object using the respective Cap. In other words, the Cap appears in + the request to the storage device, and that request is signed with + the CapKey as follows: + + ReqMAC = MAC<CapKey>(Req, ReqNonce) + Request = {Cap, Req, ReqNonce, ReqMAC} + + The following is sent to the OSD: {Cap, Req, ReqNonce, ReqMAC}. The + OSD uses the SecretKey it shares with the metadata server to compare + the ReqMAC the client sent with a locally computed value: + + LocalCapKey = MAC<SecretKey>(Cap, SystemID) + LocalReqMAC = MAC<LocalCapKey>(Req, ReqNonce) + + and if they match the OSD assumes that the capabilities came from an + authentic metadata server and allows access to the object, as allowed + by the Cap. + + + + +Halevy, et al. Standards Track [Page 31] + +RFC 5664 pNFS Objects January 2010 + + +13.3. Protocol Privacy Requirements + + Note that if the server LAYOUTGET reply, holding CapKey and Cap, is + snooped by another client, it can be used to generate valid OSD + requests (within the Cap access restrictions). + + To provide the required privacy requirements for the capability key + returned by LAYOUTGET, the GSS-API [7] framework can be used, e.g., + by using the RPCSEC_GSS privacy method to send the LAYOUTGET + operation or by using the SSV key to encrypt the oc_capability_key + using the GSS_Wrap() function. Two general ways to provide privacy + in the absence of GSS-API that are independent of NFSv4 are either an + isolated network such as a VLAN or a secure channel provided by IPsec + [15]. + +13.4. Revoking Capabilities + + At any time, the metadata server may invalidate all outstanding + capabilities on an object by changing its POLICY ACCESS TAG + attribute. The value of the POLICY ACCESS TAG is part of a + capability, and it must match the state of the object attribute. If + they do not match, the OSD rejects accesses to the object with the + sense key set to ILLEGAL REQUEST and an additional sense code set to + INVALID FIELD IN CDB. When a client attempts to use a capability and + is rejected this way, it should issue a LAYOUTCOMMIT for the object + and specify PNFS_OSD_BAD_CRED in the olr_ioerr_report parameter. The + client may elect to issue a compound LAYOUTRETURN/LAYOUTGET (or + LAYOUTCOMMIT/LAYOUTRETURN/LAYOUTGET) to attempt to fetch a refreshed + set of capabilities. + + The metadata server may elect to change the access policy tag on an + object at any time, for any reason (with the understanding that there + is likely an associated performance penalty, especially if there are + outstanding layouts for this object). The metadata server MUST + revoke outstanding capabilities when any one of the following occurs: + + o the permissions on the object change, + + o a conflicting mandatory byte-range lock is granted, or + + o a layout is revoked and reassigned to another client. + + A pNFS client will typically hold one layout for each byte range for + either READ or READ/WRITE. The client's credentials are checked by + the metadata server at LAYOUTGET time and it is the client's + responsibility to enforce access control among multiple users + accessing the same file. It is neither required nor expected that + the pNFS client will obtain a separate layout for each user accessing + + + +Halevy, et al. Standards Track [Page 32] + +RFC 5664 pNFS Objects January 2010 + + + a shared object. The client SHOULD use OPEN and ACCESS calls to + check user permissions when performing I/O so that the server's + access control policies are correctly enforced. The result of the + ACCESS operation may be cached while the client holds a valid layout + as the server is expected to recall layouts when the file's access + permissions or ACL change. + +14. IANA Considerations + + As described in NFSv4.1 [6], new layout type numbers have been + assigned by IANA. This document defines the protocol associated with + the existing layout type number, LAYOUT4_OSD2_OBJECTS, and it + requires no further actions for IANA. + +15. References + +15.1. Normative References + + [1] Weber, R., "Information Technology - SCSI Object-Based Storage + Device Commands (OSD)", ANSI INCITS 400-2004, December 2004. + + [2] Bradner, S., "Key words for use in RFCs to Indicate Requirement + Levels", BCP 14, RFC 2119, March 1997. + + [3] Eisler, M., "XDR: External Data Representation Standard", + STD 67, RFC 4506, May 2006. + + [4] Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed., "Network + File System (NFS) Version 4 Minor Version 1 External Data + Representation Standard (XDR) Description", RFC 5662, + January 2010. + + [5] IETF Trust, "Legal Provisions Relating to IETF Documents", + November 2008, + <http://trustee.ietf.org/docs/IETF-Trust-License-Policy.pdf>. + + [6] Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed., "Network + File System (NFS) Version 4 Minor Version 1 Protocol", + RFC 5661, January 2010. + + [7] Linn, J., "Generic Security Service Application Program + Interface Version 2, Update 1", RFC 2743, January 2000. + + [8] Satran, J., Meth, K., Sapuntzakis, C., Chadalapaka, M., and E. + Zeidner, "Internet Small Computer Systems Interface (iSCSI)", + RFC 3720, April 2004. + + + + + +Halevy, et al. Standards Track [Page 33] + +RFC 5664 pNFS Objects January 2010 + + + [9] Weber, R., "SCSI Primary Commands - 3 (SPC-3)", ANSI + INCITS 408-2005, October 2005. + + [10] Krueger, M., Chadalapaka, M., and R. Elliott, "T11 Network + Address Authority (NAA) Naming Format for iSCSI Node Names", + RFC 3980, February 2005. + + [11] IEEE, "Guidelines for 64-bit Global Identifier (EUI-64) + Registration Authority", + <http://standards.ieee.org/regauth/oui/tutorials/EUI64.html>. + + [12] Tseng, J., Gibbons, K., Travostino, F., Du Laney, C., and J. + Souza, "Internet Storage Name Service (iSNS)", RFC 4171, + September 2005. + + [13] Weber, R., "SCSI Architecture Model - 3 (SAM-3)", ANSI + INCITS 402-2005, February 2005. + +15.2. Informative References + + [14] Weber, R., "SCSI Object-Based Storage Device Commands -2 + (OSD-2)", January 2009, + <http://www.t10.org/cgi-bin/ac.pl?t=f&f=osd2r05a.pdf>. + + [15] Kent, S. and K. Seo, "Security Architecture for the Internet + Protocol", RFC 4301, December 2005. + + [16] T10 1415-D, "SCSI RDMA Protocol (SRP)", ANSI INCITS 365-2002, + December 2002. + + [17] T11 1619-D, "Fibre Channel Framing and Signaling - 2 + (FC-FS-2)", ANSI INCITS 424-2007, February 2007. + + [18] T10 1601-D, "Serial Attached SCSI - 1.1 (SAS-1.1)", ANSI + INCITS 417-2006, June 2006. + + [19] MacWilliams, F. and N. Sloane, "The Theory of Error-Correcting + Codes, Part I", 1977. + + + + + + + + + + + + + +Halevy, et al. Standards Track [Page 34] + +RFC 5664 pNFS Objects January 2010 + + +Appendix A. Acknowledgments + + Todd Pisek was a co-editor of the initial versions of this document. + Daniel E. Messinger, Pete Wyckoff, Mike Eisler, Sean P. Turner, Brian + E. Carpenter, Jari Arkko, David Black, and Jason Glasgow reviewed and + commented on this document. + +Authors' Addresses + + Benny Halevy + Panasas, Inc. + 1501 Reedsdale St. Suite 400 + Pittsburgh, PA 15233 + USA + + Phone: +1-412-323-3500 + EMail: bhalevy@panasas.com + URI: http://www.panasas.com/ + + + Brent Welch + Panasas, Inc. + 6520 Kaiser Drive + Fremont, CA 95444 + USA + + Phone: +1-510-608-7770 + EMail: welch@panasas.com + URI: http://www.panasas.com/ + + + Jim Zelenka + Panasas, Inc. + 1501 Reedsdale St. Suite 400 + Pittsburgh, PA 15233 + USA + + Phone: +1-412-323-3500 + EMail: jimz@panasas.com + URI: http://www.panasas.com/ + + + + + + + + + + + +Halevy, et al. Standards Track [Page 35] + |