diff options
| author | Thomas Voss <mail@thomasvoss.com> | 2024-11-27 20:54:24 +0100 | 
|---|---|---|
| committer | Thomas Voss <mail@thomasvoss.com> | 2024-11-27 20:54:24 +0100 | 
| commit | 4bfd864f10b68b71482b35c818559068ef8d5797 (patch) | |
| tree | e3989f47a7994642eb325063d46e8f08ffa681dc /doc/rfc/rfc8435.txt | |
| parent | ea76e11061bda059ae9f9ad130a9895cc85607db (diff) | |
doc: Add RFC documents
Diffstat (limited to 'doc/rfc/rfc8435.txt')
| -rw-r--r-- | doc/rfc/rfc8435.txt | 2355 | 
1 files changed, 2355 insertions, 0 deletions
| diff --git a/doc/rfc/rfc8435.txt b/doc/rfc/rfc8435.txt new file mode 100644 index 0000000..02b0b8d --- /dev/null +++ b/doc/rfc/rfc8435.txt @@ -0,0 +1,2355 @@ + + + + + + +Internet Engineering Task Force (IETF)                         B. Halevy +Request for Comments: 8435 +Category: Standards Track                                      T. Haynes +ISSN: 2070-1721                                              Hammerspace +                                                             August 2018 + + +                Parallel NFS (pNFS) Flexible File Layout + +Abstract + +   Parallel NFS (pNFS) allows a separation between the metadata (onto a +   metadata server) and data (onto a storage device) for a file.  The +   flexible file layout type is defined in this document as an extension +   to pNFS that allows the use of storage devices that require only a +   limited degree of interaction with the metadata server and use +   already-existing protocols.  Client-side mirroring is also added to +   provide replication of files. + +Status of This Memo + +   This is an Internet Standards Track document. + +   This document is a product of the Internet Engineering Task Force +   (IETF).  It represents the consensus of the IETF community.  It has +   received public review and has been approved for publication by the +   Internet Engineering Steering Group (IESG).  Further information on +   Internet Standards is available in Section 2 of RFC 7841. + +   Information about the current status of this document, any errata, +   and how to provide feedback on it may be obtained at +   https://www.rfc-editor.org/info/rfc8435. + +Copyright Notice + +   Copyright (c) 2018 IETF Trust and the persons identified as the +   document authors.  All rights reserved. + +   This document is subject to BCP 78 and the IETF Trust's Legal +   Provisions Relating to IETF Documents +   (https://trustee.ietf.org/license-info) in effect on the date of +   publication of this document.  Please review these documents +   carefully, as they describe your rights and restrictions with respect +   to this document.  Code Components extracted from this document must +   include Simplified BSD License text as described in Section 4.e of +   the Trust Legal Provisions and are provided without warranty as +   described in the Simplified BSD License. + + + + +Halevy & Haynes              Standards Track                    [Page 1] + +RFC 8435                pNFS Flexible File Layout            August 2018 + + +Table of Contents + +   1. Introduction ....................................................3 +      1.1. Definitions ................................................4 +      1.2. Requirements Language ......................................6 +   2. Coupling of Storage Devices .....................................6 +      2.1. LAYOUTCOMMIT ...............................................7 +      2.2. Fencing Clients from the Storage Device ....................7 +           2.2.1. Implementation Notes for Synthetic uids/gids ........8 +           2.2.2. Example of Using Synthetic uids/gids ................9 +      2.3. State and Locking Models ..................................10 +           2.3.1. Loosely Coupled Locking Model ......................11 +           2.3.2. Tightly Coupled Locking Model ......................12 +   3. XDR Description of the Flexible File Layout Type ...............13 +      3.1. Code Components Licensing Notice ..........................14 +   4. Device Addressing and Discovery ................................16 +      4.1. ff_device_addr4 ...........................................16 +      4.2. Storage Device Multipathing ...............................17 +   5. Flexible File Layout Type ......................................18 +      5.1. ff_layout4 ................................................19 +           5.1.1. Error Codes from LAYOUTGET .........................23 +           5.1.2. Client Interactions with FF_FLAGS_NO_IO_THRU_MDS ...23 +      5.2. LAYOUTCOMMIT ..............................................24 +      5.3. Interactions between Devices and Layouts ..................24 +      5.4. Handling Version Errors ...................................24 +   6. Striping via Sparse Mapping ....................................25 +   7. Recovering from Client I/O Errors ..............................25 +   8. Mirroring ......................................................26 +      8.1. Selecting a Mirror ........................................26 +      8.2. Writing to Mirrors ........................................27 +           8.2.1. Single Storage Device Updates Mirrors ..............27 +           8.2.2. Client Updates All Mirrors .........................27 +           8.2.3. Handling Write Errors ..............................28 +           8.2.4. Handling Write COMMITs .............................28 +      8.3. Metadata Server Resilvering of the File ...................29 +   9. Flexible File Layout Type Return ...............................29 +      9.1. I/O Error Reporting .......................................30 +           9.1.1. ff_ioerr4 ..........................................30 +      9.2. Layout Usage Statistics ...................................31 +           9.2.1. ff_io_latency4 .....................................31 +           9.2.2. ff_layoutupdate4 ...................................32 +           9.2.3. ff_iostats4 ........................................33 +      9.3. ff_layoutreturn4 ..........................................34 +   10. Flexible File Layout Type LAYOUTERROR .........................35 +   11. Flexible File Layout Type LAYOUTSTATS .........................35 +   12. Flexible File Layout Type Creation Hint .......................35 +      12.1. ff_layouthint4 ...........................................35 +   13. Recalling a Layout ............................................36 + + + +Halevy & Haynes              Standards Track                    [Page 2] + +RFC 8435                pNFS Flexible File Layout            August 2018 + + +      13.1. CB_RECALL_ANY ............................................36 +   14. Client Fencing ................................................37 +   15. Security Considerations .......................................37 +      15.1. RPCSEC_GSS and Security Services .........................39 +           15.1.1. Loosely Coupled ...................................39 +           15.1.2. Tightly Coupled ...................................39 +   16. IANA Considerations ...........................................39 +   17. References ....................................................40 +      17.1. Normative References .....................................40 +      17.2. Informative References ...................................41 +   Acknowledgments ...................................................42 +   Authors' Addresses ................................................42 + +1.  Introduction + +   In Parallel NFS (pNFS), the metadata server returns layout type +   structures that describe where file data is located.  There are +   different layout types for different storage systems and methods of +   arranging data on storage devices.  This document defines the +   flexible file layout type used with file-based data servers that are +   accessed using the NFS protocols: NFSv3 [RFC1813], NFSv4.0 [RFC7530], +   NFSv4.1 [RFC5661], and NFSv4.2 [RFC7862]. + +   To provide a global state model equivalent to that of the files +   layout type, a back-end control protocol might be implemented between +   the metadata server and NFSv4.1+ storage devices.  An implementation +   can either define its own proprietary mechanism or it could define a +   control protocol in a Standards Track document.  The requirements for +   a control protocol are specified in [RFC5661] and clarified in +   [RFC8434]. + +   The control protocol described in this document is based on NFS.  It +   does not provide for knowledge of stateids to be passed between the +   metadata server and the storage devices.  Instead, the storage +   devices are configured such that the metadata server has full access +   rights to the data file system and then the metadata server uses +   synthetic ids to control client access to individual files. + +   In traditional mirroring of data, the server is responsible for +   replicating, validating, and repairing copies of the data file.  With +   client-side mirroring, the metadata server provides a layout that +   presents the available mirrors to the client.  The client then picks +   a mirror to read from and ensures that all writes go to all mirrors. +   The client only considers the write transaction to have succeeded if +   all mirrors are successfully updated.  In case of error, the client +   can use the LAYOUTERROR operation to inform the metadata server, +   which is then responsible for the repairing of the mirrored copies of +   the file. + + + +Halevy & Haynes              Standards Track                    [Page 3] + +RFC 8435                pNFS Flexible File Layout            August 2018 + + +1.1.  Definitions + +   control communication requirements:  the specification for +      information on layouts, stateids, file metadata, and file data +      that must be communicated between the metadata server and the +      storage devices.  There is a separate set of requirements for each +      layout type. + +   control protocol:  the particular mechanism that an implementation of +      a layout type would use to meet the control communication +      requirement for that layout type.  This need not be a protocol as +      normally understood.  In some cases, the same protocol may be used +      as a control protocol and storage protocol. + +   client-side mirroring:  a feature in which the client, not the +      server, is responsible for updating all of the mirrored copies of +      a layout segment. + +   (file) data:  that part of the file system object that contains the +      data to be read or written.  It is the contents of the object +      rather than the attributes of the object. + +   data server (DS):  a pNFS server that provides the file's data when +      the file system object is accessed over a file-based protocol. + +   fencing:  the process by which the metadata server prevents the +      storage devices from processing I/O from a specific client to a +      specific file. + +   file layout type:  a layout type in which the storage devices are +      accessed via the NFS protocol (see Section 13 of [RFC5661]). + +   gid:  the group id, a numeric value that identifies to which group a +      file belongs. + +   layout:  the information a client uses to access file data on a +      storage device.  This information includes specification of the +      protocol (layout type) and the identity of the storage devices to +      be used. + +   layout iomode:  a grant of either read-only or read/write I/O to the +      client. + +   layout segment:  a sub-division of a layout.  That sub-division might +      be by the layout iomode (see Sections 3.3.20 and 12.2.9 of +      [RFC5661]), a striping pattern (see Section 13.3 of [RFC5661]), or +      requested byte range. + + + + +Halevy & Haynes              Standards Track                    [Page 4] + +RFC 8435                pNFS Flexible File Layout            August 2018 + + +   layout stateid:  a 128-bit quantity returned by a server that +      uniquely defines the layout state provided by the server for a +      specific layout that describes a layout type and file (see +      Section 12.5.2 of [RFC5661]).  Further, Section 12.5.3 of +      [RFC5661] describes differences in handling between layout +      stateids and other stateid types. + +   layout type:  a specification of both the storage protocol used to +      access the data and the aggregation scheme used to lay out the +      file data on the underlying storage devices. + +   loose coupling:  when the control protocol is a storage protocol. + +   (file) metadata:  the part of the file system object that contains +      various descriptive data relevant to the file object, as opposed +      to the file data itself.  This could include the time of last +      modification, access time, EOF position, etc. + +   metadata server (MDS):  the pNFS server that provides metadata +      information for a file system object.  It is also responsible for +      generating, recalling, and revoking layouts for file system +      objects, for performing directory operations, and for performing +      I/O operations to regular files when the clients direct these to +      the metadata server itself. + +   mirror:  a copy of a layout segment.  Note that if one copy of the +      mirror is updated, then all copies must be updated. + +   recalling a layout:  a graceful recall, via a callback, of a specific +      layout by the metadata server to the client.  Graceful here means +      that the client would have the opportunity to flush any WRITEs, +      etc., before returning the layout to the metadata server. + +   revoking a layout:  an invalidation of a specific layout by the +      metadata server.  Once revocation occurs, the metadata server will +      not accept as valid any reference to the revoked layout, and a +      storage device will not accept any client access based on the +      layout. + +   resilvering:  the act of rebuilding a mirrored copy of a layout +      segment from a known good copy of the layout segment.  Note that +      this can also be done to create a new mirrored copy of the layout +      segment. + +   rsize:  the data transfer buffer size used for READs. + + + + + + +Halevy & Haynes              Standards Track                    [Page 5] + +RFC 8435                pNFS Flexible File Layout            August 2018 + + +   stateid:  a 128-bit quantity returned by a server that uniquely +      defines the set of locking-related state provided by the server. +      Stateids may designate state related to open files, byte-range +      locks, delegations, or layouts. + +   storage device:  the target to which clients may direct I/O requests +      when they hold an appropriate layout.  See Section 2.1 of +      [RFC8434] for further discussion of the difference between a data +      server and a storage device. + +   storage protocol:  the protocol used by clients to do I/O operations +      to the storage device.  Each layout type specifies the set of +      storage protocols. + +   tight coupling:  an arrangement in which the control protocol is one +      designed specifically for control communication.  It may be either +      a proprietary protocol adapted specifically to a particular +      metadata server or a protocol based on a Standards Track document. + +   uid:  the user id, a numeric value that identifies which user owns a +      file. + +   wsize:  the data transfer buffer size used for WRITEs. + +1.2.  Requirements Language + +   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", +   "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and +   "OPTIONAL" in this document are to be interpreted as described in +   BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all +   capitals, as shown here. + +2.  Coupling of Storage Devices + +   A server implementation may choose either a loosely coupled model or +   a tightly coupled model between the metadata server and the storage +   devices.  [RFC8434] describes the general problems facing pNFS +   implementations.  This document details how the new flexible file +   layout type addresses these issues.  To implement the tightly coupled +   model, a control protocol has to be defined.  As the flexible file +   layout imposes no special requirements on the client, the control +   protocol will need to provide: + +   (1)  management of both security and LAYOUTCOMMITs and + +   (2)  a global stateid model and management of these stateids. + + + + + +Halevy & Haynes              Standards Track                    [Page 6] + +RFC 8435                pNFS Flexible File Layout            August 2018 + + +   When implementing the loosely coupled model, the only control +   protocol will be a version of NFS, with no ability to provide a +   global stateid model or to prevent clients from using layouts +   inappropriately.  To enable client use in that environment, this +   document will specify how security, state, and locking are to be +   managed. + +2.1.  LAYOUTCOMMIT + +   Regardless of the coupling model, the metadata server has the +   responsibility, upon receiving a LAYOUTCOMMIT (see Section 18.42 of +   [RFC5661]) to ensure that the semantics of pNFS are respected (see +   Section 3.1 of [RFC8434]).  These do include a requirement that data +   written to a data storage device be stable before the occurrence of +   the LAYOUTCOMMIT. + +   It is the responsibility of the client to make sure the data file is +   stable before the metadata server begins to query the storage devices +   about the changes to the file.  If any WRITE to a storage device did +   not result with stable_how equal to FILE_SYNC, a LAYOUTCOMMIT to the +   metadata server MUST be preceded by a COMMIT to the storage devices +   written to.  Note that if the client has not done a COMMIT to the +   storage device, then the LAYOUTCOMMIT might not be synchronized to +   the last WRITE operation to the storage device. + +2.2.  Fencing Clients from the Storage Device + +   With loosely coupled storage devices, the metadata server uses +   synthetic uids (user ids) and gids (group ids) for the data file, +   where the uid owner of the data file is allowed read/write access and +   the gid owner is allowed read-only access.  As part of the layout +   (see ffds_user and ffds_group in Section 5.1), the client is provided +   with the user and group to be used in the Remote Procedure Call (RPC) +   [RFC5531] credentials needed to access the data file.  Fencing off of +   clients is achieved by the metadata server changing the synthetic uid +   and/or gid owners of the data file on the storage device to +   implicitly revoke the outstanding RPC credentials.  A client +   presenting the wrong credential for the desired access will get an +   NFS4ERR_ACCESS error. + +   With this loosely coupled model, the metadata server is not able to +   fence off a single client; it is forced to fence off all clients. +   However, as the other clients react to the fencing, returning their +   layouts and trying to get new ones, the metadata server can hand out +   a new uid and gid to allow access. + + + + + + +Halevy & Haynes              Standards Track                    [Page 7] + +RFC 8435                pNFS Flexible File Layout            August 2018 + + +   It is RECOMMENDED to implement common access control methods at the +   storage device file system to allow only the metadata server root +   (super user) access to the storage device and to set the owner of all +   directories holding data files to the root user.  This approach +   provides a practical model to enforce access control and fence off +   cooperative clients, but it cannot protect against malicious clients; +   hence, it provides a level of security equivalent to AUTH_SYS.  It is +   RECOMMENDED that the communication between the metadata server and +   storage device be secure from eavesdroppers and man-in-the-middle +   protocol tampering.  The security measure could be physical security +   (e.g., the servers are co-located in a physically secure area), +   encrypted communications, or some other technique. + +   With tightly coupled storage devices, the metadata server sets the +   user and group owners, mode bits, and Access Control List (ACL) of +   the data file to be the same as the metadata file.  And the client +   must authenticate with the storage device and go through the same +   authorization process it would go through via the metadata server. +   In the case of tight coupling, fencing is the responsibility of the +   control protocol and is not described in detail in this document. +   However, implementations of the tightly coupled locking model (see +   Section 2.3) will need a way to prevent access by certain clients to +   specific files by invalidating the corresponding stateids on the +   storage device.  In such a scenario, the client will be given an +   error of NFS4ERR_BAD_STATEID. + +   The client need not know the model used between the metadata server +   and the storage device.  It need only react consistently to any +   errors in interacting with the storage device.  It should both return +   the layout and error to the metadata server and ask for a new layout. +   At that point, the metadata server can either hand out a new layout, +   hand out no layout (forcing the I/O through it), or deny the client +   further access to the file. + +2.2.1.  Implementation Notes for Synthetic uids/gids + +   The selection method for the synthetic uids and gids to be used for +   fencing in loosely coupled storage devices is strictly an +   implementation issue.  That is, an administrator might restrict a +   range of such ids available to the Lightweight Directory Access +   Protocol (LDAP) 'uid' field [RFC4519].  The administrator might also +   be able to choose an id that would never be used to grant access. +   Then, when the metadata server had a request to access a file, a +   SETATTR would be sent to the storage device to set the owner and +   group of the data file.  The user and group might be selected in a +   round-robin fashion from the range of available ids. + + + + + +Halevy & Haynes              Standards Track                    [Page 8] + +RFC 8435                pNFS Flexible File Layout            August 2018 + + +   Those ids would be sent back as ffds_user and ffds_group to the +   client, who would present them as the RPC credentials to the storage +   device.  When the client is done accessing the file and the metadata +   server knows that no other client is accessing the file, it can reset +   the owner and group to restrict access to the data file. + +   When the metadata server wants to fence off a client, it changes the +   synthetic uid and/or gid to the restricted ids.  Note that using a +   restricted id ensures that there is a change of owner and at least +   one id available that never gets allowed access. + +   Under an AUTH_SYS security model, synthetic uids and gids of 0 SHOULD +   be avoided.  These typically either grant super access to files on a +   storage device or are mapped to an anonymous id.  In the first case, +   even if the data file is fenced, the client might still be able to +   access the file.  In the second case, multiple ids might be mapped to +   the anonymous ids. + +2.2.2.  Example of Using Synthetic uids/gids + +   The user loghyr creates a file "ompha.c" on the metadata server, +   which then creates a corresponding data file on the storage device. + +   The metadata server entry may look like: + +   -rw-r--r--    1 loghyr  staff    1697 Dec  4 11:31 ompha.c + +   On the storage device, the file may be assigned some unpredictable +   synthetic uid/gid to deny access: + +   -rw-r-----    1 19452   28418    1697 Dec  4 11:31 data_ompha.c + +   When the file is opened on a client and accessed, the user will try +   to get a layout for the data file.  Since the layout knows nothing +   about the user (and does not care), it does not matter whether the +   user loghyr or garbo opens the file.  The client has to present an +   uid of 19452 to get write permission.  If it presents any other value +   for the uid, then it must give a gid of 28418 to get read access. + +   Further, if the metadata server decides to fence the file, it should +   change the uid and/or gid such that these values neither match +   earlier values for that file nor match a predictable change based on +   an earlier fencing. + +   -rw-r-----    1 19453   28419    1697 Dec  4 11:31 data_ompha.c + + + + + + +Halevy & Haynes              Standards Track                    [Page 9] + +RFC 8435                pNFS Flexible File Layout            August 2018 + + +   The set of synthetic gids on the storage device should be selected +   such that there is no mapping in any of the name services used by the +   storage device, i.e., each group should have no members. + +   If the layout segment has an iomode of LAYOUTIOMODE4_READ, then the +   metadata server should return a synthetic uid that is not set on the +   storage device.  Only the synthetic gid would be valid. + +   The client is thus solely responsible for enforcing file permissions +   in a loosely coupled model.  To allow loghyr write access, it will +   send an RPC to the storage device with a credential of 1066:1067.  To +   allow garbo read access, it will send an RPC to the storage device +   with a credential of 1067:1067.  The value of the uid does not matter +   as long as it is not the synthetic uid granted when getting the +   layout. + +   While pushing the enforcement of permission checking onto the client +   may seem to weaken security, the client may already be responsible +   for enforcing permissions before modifications are sent to a server. +   With cached writes, the client is always responsible for tracking who +   is modifying a file and making sure to not coalesce requests from +   multiple users into one request. + +2.3.  State and Locking Models + +   An implementation can always be deployed as a loosely coupled model. +   There is, however, no way for a storage device to indicate over an +   NFS protocol that it can definitively participate in a tightly +   coupled model: + +   o  Storage devices implementing the NFSv3 and NFSv4.0 protocols are +      always treated as loosely coupled. + +   o  NFSv4.1+ storage devices that do not return the +      EXCHGID4_FLAG_USE_PNFS_DS flag set to EXCHANGE_ID are indicating +      that they are to be treated as loosely coupled.  From the locking +      viewpoint, they are treated in the same way as NFSv4.0 storage +      devices. + +   o  NFSv4.1+ storage devices that do identify themselves with the +      EXCHGID4_FLAG_USE_PNFS_DS flag set to EXCHANGE_ID can potentially +      be tightly coupled.  They would use a back-end control protocol to +      implement the global stateid model as described in [RFC5661]. + +   A storage device would have to be either discovered or advertised +   over the control protocol to enable a tightly coupled model. + + + + + +Halevy & Haynes              Standards Track                   [Page 10] + +RFC 8435                pNFS Flexible File Layout            August 2018 + + +2.3.1.  Loosely Coupled Locking Model + +   When locking-related operations are requested, they are primarily +   dealt with by the metadata server, which generates the appropriate +   stateids.  When an NFSv4 version is used as the data access protocol, +   the metadata server may make stateid-related requests of the storage +   devices.  However, it is not required to do so, and the resulting +   stateids are known only to the metadata server and the storage +   device. + +   Given this basic structure, locking-related operations are handled as +   follows: + +   o  OPENs are dealt with by the metadata server.  Stateids are +      selected by the metadata server and associated with the client ID +      describing the client's connection to the metadata server.  The +      metadata server may need to interact with the storage device to +      locate the file to be opened, but no locking-related functionality +      need be used on the storage device. + +      OPEN_DOWNGRADE and CLOSE only require local execution on the +      metadata server. + +   o  Advisory byte-range locks can be implemented locally on the +      metadata server.  As in the case of OPENs, the stateids associated +      with byte-range locks are assigned by the metadata server and only +      used on the metadata server. + +   o  Delegations are assigned by the metadata server that initiates +      recalls when conflicting OPENs are processed.  No storage device +      involvement is required. + +   o  TEST_STATEID and FREE_STATEID are processed locally on the +      metadata server, without storage device involvement. + +   All I/O operations to the storage device are done using the anonymous +   stateid.  Thus, the storage device has no information about the +   openowner and lockowner responsible for issuing a particular I/O +   operation.  As a result: + +   o  Mandatory byte-range locking cannot be supported because the +      storage device has no way of distinguishing I/O done on behalf of +      the lock owner from those done by others. + +   o  Enforcement of share reservations is the responsibility of the +      client.  Even though I/O is done using the anonymous stateid, the +      client must ensure that it has a valid stateid associated with the +      openowner. + + + +Halevy & Haynes              Standards Track                   [Page 11] + +RFC 8435                pNFS Flexible File Layout            August 2018 + + +   In the event that a stateid is revoked, the metadata server is +   responsible for preventing client access, since it has no way of +   being sure that the client is aware that the stateid in question has +   been revoked. + +   As the client never receives a stateid generated by a storage device, +   there is no client lease on the storage device and no prospect of +   lease expiration, even when access is via NFSv4 protocols.  Clients +   will have leases on the metadata server.  In dealing with lease +   expiration, the metadata server may need to use fencing to prevent +   revoked stateids from being relied upon by a client unaware of the +   fact that they have been revoked. + +2.3.2.  Tightly Coupled Locking Model + +   When locking-related operations are requested, they are primarily +   dealt with by the metadata server, which generates the appropriate +   stateids.  These stateids must be made known to the storage device +   using control protocol facilities, the details of which are not +   discussed in this document. + +   Given this basic structure, locking-related operations are handled as +   follows: + +   o  OPENs are dealt with primarily on the metadata server.  Stateids +      are selected by the metadata server and associated with the client +      ID describing the client's connection to the metadata server.  The +      metadata server needs to interact with the storage device to +      locate the file to be opened and to make the storage device aware +      of the association between the metadata-server-chosen stateid and +      the client and openowner that it represents. + +      OPEN_DOWNGRADE and CLOSE are executed initially on the metadata +      server, but the state change made must be propagated to the +      storage device. + +   o  Advisory byte-range locks can be implemented locally on the +      metadata server.  As in the case of OPENs, the stateids associated +      with byte-range locks are assigned by the metadata server and are +      available for use on the metadata server.  Because I/O operations +      are allowed to present lock stateids, the metadata server needs +      the ability to make the storage device aware of the association +      between the metadata-server-chosen stateid and the corresponding +      open stateid it is associated with. + +   o  Mandatory byte-range locks can be supported when both the metadata +      server and the storage devices have the appropriate support.  As +      in the case of advisory byte-range locks, these are assigned by + + + +Halevy & Haynes              Standards Track                   [Page 12] + +RFC 8435                pNFS Flexible File Layout            August 2018 + + +      the metadata server and are available for use on the metadata +      server.  To enable mandatory lock enforcement on the storage +      device, the metadata server needs the ability to make the storage +      device aware of the association between the metadata-server-chosen +      stateid and the client, openowner, and lock (i.e., lockowner, +      byte-range, and lock-type) that it represents.  Because I/O +      operations are allowed to present lock stateids, this information +      needs to be propagated to all storage devices to which I/O might +      be directed rather than only to storage device that contain the +      locked region. + +   o  Delegations are assigned by the metadata server that initiates +      recalls when conflicting OPENs are processed.  Because I/O +      operations are allowed to present delegation stateids, the +      metadata server requires the ability (1) to make the storage +      device aware of the association between the metadata-server-chosen +      stateid and the filehandle and delegation type it represents and +      (2) to break such an association. + +   o  TEST_STATEID is processed locally on the metadata server, without +      storage device involvement. + +   o  FREE_STATEID is processed on the metadata server, but the metadata +      server requires the ability to propagate the request to the +      corresponding storage devices. + +   Because the client will possess and use stateids valid on the storage +   device, there will be a client lease on the storage device, and the +   possibility of lease expiration does exist.  The best approach for +   the storage device is to retain these locks as a courtesy.  However, +   if it does not do so, control protocol facilities need to provide the +   means to synchronize lock state between the metadata server and +   storage device. + +   Clients will also have leases on the metadata server that are subject +   to expiration.  In dealing with lease expiration, the metadata server +   would be expected to use control protocol facilities enabling it to +   invalidate revoked stateids on the storage device.  In the event the +   client is not responsive, the metadata server may need to use fencing +   to prevent revoked stateids from being acted upon by the storage +   device. + +3.  XDR Description of the Flexible File Layout Type + +   This document contains the External Data Representation (XDR) +   [RFC4506] description of the flexible file layout type.  The XDR +   description is embedded in this document in a way that makes it +   simple for the reader to extract into a ready-to-compile form.  The + + + +Halevy & Haynes              Standards Track                   [Page 13] + +RFC 8435                pNFS Flexible File Layout            August 2018 + + +   reader can feed this document into the following shell script to +   produce the machine-readable XDR description of the flexible file +   layout type: + +   <CODE BEGINS> + +   #!/bin/sh +   grep '^ *///' $* | sed 's?^ */// ??' | sed 's?^ *///$??' + +   <CODE ENDS> + +   That is, if the above script is stored in a file called "extract.sh" +   and this document is in a file called "spec.txt", then the reader can +   do: + +   sh extract.sh < spec.txt > flex_files_prot.x + +   The effect of the script is to remove leading white space from each +   line, plus a sentinel sequence of "///". + +   The embedded XDR file header follows.  Subsequent XDR descriptions +   with the sentinel sequence are embedded throughout the document. + +   Note that the XDR code contained in this document depends on types +   from the NFSv4.1 nfs4_prot.x file [RFC5662].  This includes both nfs +   types that end with a 4, such as offset4, length4, etc., as well as +   more generic types such as uint32_t and uint64_t. + +3.1.  Code Components Licensing Notice + +   Both the XDR description and the scripts used for extracting the XDR +   description are Code Components as described in Section 4 of "Trust +   Legal Provisions (TLP)" [LEGAL].  These Code Components are licensed +   according to the terms of that document. + +   <CODE BEGINS> + +   /// /* +   ///  * Copyright (c) 2018 IETF Trust and the persons identified +   ///  * as authors of the code.  All rights reserved. +   ///  * +   ///  * Redistribution and use in source and binary forms, with +   ///  * or without modification, are permitted provided that the +   ///  * following conditions are met: +   ///  * +   ///  * - Redistributions of source code must retain the above +   ///  *   copyright notice, this list of conditions and the +   ///  *   following disclaimer. + + + +Halevy & Haynes              Standards Track                   [Page 14] + +RFC 8435                pNFS Flexible File Layout            August 2018 + + +   ///  * +   ///  * - Redistributions in binary form must reproduce the above +   ///  *   copyright notice, this list of conditions and the +   ///  *   following disclaimer in the documentation and/or other +   ///  *   materials provided with the distribution. +   ///  * +   ///  * - Neither the name of Internet Society, IETF or IETF +   ///  *   Trust, nor the names of specific contributors, may be +   ///  *   used to endorse or promote products derived from this +   ///  *   software without specific prior written permission. +   ///  * +   ///  *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS +   ///  *   AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED +   ///  *   WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +   ///  *   IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS +   ///  *   FOR A PARTICULAR PURPOSE ARE DISCLAIMED.  IN NO +   ///  *   EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE +   ///  *   LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +   ///  *   EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT +   ///  *   NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR +   ///  *   SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS +   ///  *   INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF +   ///  *   LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, +   ///  *   OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING +   ///  *   IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF +   ///  *   ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. +   ///  * +   ///  * This code was derived from RFC 8435. +   ///  * Please reproduce this note if possible. +   ///  */ +   /// +   /// /* +   ///  * flex_files_prot.x +   ///  */ +   /// +   /// /* +   ///  * The following include statements are for example only. +   ///  * The actual XDR definition files are generated separately +   ///  * and independently and are likely to have a different name. +   ///  * %#include <nfsv42.x> +   ///  * %#include <rpc_prot.x> +   ///  */ +   /// + +   <CODE ENDS> + + + + + + +Halevy & Haynes              Standards Track                   [Page 15] + +RFC 8435                pNFS Flexible File Layout            August 2018 + + +4.  Device Addressing and Discovery + +   Data operations to a storage device require the client to know the +   network address of the storage device.  The NFSv4.1+ GETDEVICEINFO +   operation (Section 18.40 of [RFC5661]) is used by the client to +   retrieve that information. + +4.1.  ff_device_addr4 + +   The ff_device_addr4 data structure is returned by the server as the +   layout-type-specific opaque field da_addr_body in the device_addr4 +   structure by a successful GETDEVICEINFO operation. + +   <CODE BEGINS> + +   /// struct ff_device_versions4 { +   ///         uint32_t        ffdv_version; +   ///         uint32_t        ffdv_minorversion; +   ///         uint32_t        ffdv_rsize; +   ///         uint32_t        ffdv_wsize; +   ///         bool            ffdv_tightly_coupled; +   /// }; +   /// + +   /// struct ff_device_addr4 { +   ///         multipath_list4     ffda_netaddrs; +   ///         ff_device_versions4 ffda_versions<>; +   /// }; +   /// + +   <CODE ENDS> + +   The ffda_netaddrs field is used to locate the storage device.  It +   MUST be set by the server to a list holding one or more of the device +   network addresses. + +   The ffda_versions array allows the metadata server to present choices +   as to NFS version, minor version, and coupling strength to the +   client.  The ffdv_version and ffdv_minorversion represent the NFS +   protocol to be used to access the storage device.  This layout +   specification defines the semantics for ffdv_versions 3 and 4.  If +   ffdv_version equals 3, then the server MUST set ffdv_minorversion to +   0 and ffdv_tightly_coupled to false.  The client MUST then access the +   storage device using the NFSv3 protocol [RFC1813].  If ffdv_version +   equals 4, then the server MUST set ffdv_minorversion to one of the +   NFSv4 minor version numbers, and the client MUST access the storage +   device using NFSv4 with the specified minor version. + + + + +Halevy & Haynes              Standards Track                   [Page 16] + +RFC 8435                pNFS Flexible File Layout            August 2018 + + +   Note that while the client might determine that it cannot use any of +   the configured combinations of ffdv_version, ffdv_minorversion, and +   ffdv_tightly_coupled, when it gets the device list from the metadata +   server, there is no way to indicate to the metadata server as to +   which device it is version incompatible.  However, if the client +   waits until it retrieves the layout from the metadata server, it can +   at that time clearly identify the storage device in question (see +   Section 5.4). + +   The ffdv_rsize and ffdv_wsize are used to communicate the maximum +   rsize and wsize supported by the storage device.  As the storage +   device can have a different rsize or wsize than the metadata server, +   the ffdv_rsize and ffdv_wsize allow the metadata server to +   communicate that information on behalf of the storage device. + +   ffdv_tightly_coupled informs the client as to whether or not the +   metadata server is tightly coupled with the storage devices.  Note +   that even if the data protocol is at least NFSv4.1, it may still be +   the case that there is loose coupling in effect.  If +   ffdv_tightly_coupled is not set, then the client MUST commit writes +   to the storage devices for the file before sending a LAYOUTCOMMIT to +   the metadata server.  That is, the writes MUST be committed by the +   client to stable storage via issuing WRITEs with stable_how == +   FILE_SYNC or by issuing a COMMIT after WRITEs with stable_how != +   FILE_SYNC (see Section 3.3.7 of [RFC1813]). + +4.2.  Storage Device Multipathing + +   The flexible file layout type supports multipathing to multiple +   storage device addresses.  Storage-device-level multipathing is used +   for bandwidth scaling via trunking and for higher availability of use +   in the event of a storage device failure.  Multipathing allows the +   client to switch to another storage device address that may be that +   of another storage device that is exporting the same data stripe +   unit, without having to contact the metadata server for a new layout. + +   To support storage device multipathing, ffda_netaddrs contains an +   array of one or more storage device network addresses.  This array +   (data type multipath_list4) represents a list of storage devices +   (each identified by a network address), with the possibility that +   some storage device will appear in the list multiple times. + +   The client is free to use any of the network addresses as a +   destination to send storage device requests.  If some network +   addresses are less desirable paths to the data than others, then the +   metadata server SHOULD NOT include those network addresses in +   ffda_netaddrs.  If less desirable network addresses exist to provide +   failover, the RECOMMENDED method to offer the addresses is to provide + + + +Halevy & Haynes              Standards Track                   [Page 17] + +RFC 8435                pNFS Flexible File Layout            August 2018 + + +   them in a replacement device-ID-to-device-address mapping or a +   replacement device ID.  When a client finds no response from the +   storage device using all addresses available in ffda_netaddrs, it +   SHOULD send a GETDEVICEINFO to attempt to replace the existing +   device-ID-to-device-address mappings.  If the metadata server detects +   that all network paths represented by ffda_netaddrs are unavailable, +   the metadata server SHOULD send a CB_NOTIFY_DEVICEID (if the client +   has indicated it wants device ID notifications for changed device +   IDs) to change the device-ID-to-device-address mappings to the +   available addresses.  If the device ID itself will be replaced, the +   metadata server SHOULD recall all layouts with the device ID and thus +   force the client to get new layouts and device ID mappings via +   LAYOUTGET and GETDEVICEINFO. + +   Generally, if two network addresses appear in ffda_netaddrs, they +   will designate the same storage device.  When the storage device is +   accessed over NFSv4.1 or a higher minor version, the two storage +   device addresses will support the implementation of client ID or +   session trunking (the latter is RECOMMENDED) as defined in [RFC5661]. +   The two storage device addresses will share the same server owner or +   major ID of the server owner.  It is not always necessary for the two +   storage device addresses to designate the same storage device with +   trunking being used.  For example, the data could be read-only, and +   the data consist of exact replicas. + +5.  Flexible File Layout Type + +   The original layouttype4 introduced in [RFC5662] is modified to be: + +   <CODE BEGINS> + +       enum layouttype4 { +           LAYOUT4_NFSV4_1_FILES   = 1, +           LAYOUT4_OSD2_OBJECTS    = 2, +           LAYOUT4_BLOCK_VOLUME    = 3, +           LAYOUT4_FLEX_FILES      = 4 +       }; + +       struct layout_content4 { +           layouttype4             loc_type; +           opaque                  loc_body<>; +       }; + + + + + + + + + +Halevy & Haynes              Standards Track                   [Page 18] + +RFC 8435                pNFS Flexible File Layout            August 2018 + + +       struct layout4 { +           offset4                 lo_offset; +           length4                 lo_length; +           layoutiomode4           lo_iomode; +           layout_content4         lo_content; +       }; + +   <CODE ENDS> + +   This document defines structures associated with the layouttype4 +   value LAYOUT4_FLEX_FILES.  [RFC5661] specifies the loc_body structure +   as an XDR type "opaque".  The opaque layout is uninterpreted by the +   generic pNFS client layers but is interpreted by the flexible file +   layout type implementation.  This section defines the structure of +   this otherwise opaque value, ff_layout4. + +5.1.  ff_layout4 + +   <CODE BEGINS> + +   /// const FF_FLAGS_NO_LAYOUTCOMMIT   = 0x00000001; +   /// const FF_FLAGS_NO_IO_THRU_MDS    = 0x00000002; +   /// const FF_FLAGS_NO_READ_IO        = 0x00000004; +   /// const FF_FLAGS_WRITE_ONE_MIRROR  = 0x00000008; + +   /// typedef uint32_t            ff_flags4; +   /// + +   /// struct ff_data_server4 { +   ///     deviceid4               ffds_deviceid; +   ///     uint32_t                ffds_efficiency; +   ///     stateid4                ffds_stateid; +   ///     nfs_fh4                 ffds_fh_vers<>; +   ///     fattr4_owner            ffds_user; +   ///     fattr4_owner_group      ffds_group; +   /// }; +   /// + +   /// struct ff_mirror4 { +   ///     ff_data_server4         ffm_data_servers<>; +   /// }; +   /// + +   /// struct ff_layout4 { +   ///     length4                 ffl_stripe_unit; +   ///     ff_mirror4              ffl_mirrors<>; +   ///     ff_flags4               ffl_flags; +   ///     uint32_t                ffl_stats_collect_hint; + + + +Halevy & Haynes              Standards Track                   [Page 19] + +RFC 8435                pNFS Flexible File Layout            August 2018 + + +   /// }; +   /// + +   <CODE ENDS> + +   The ff_layout4 structure specifies a layout in that portion of the +   data file described in the current layout segment.  It is either a +   single instance or a set of mirrored copies of that portion of the +   data file.  When mirroring is in effect, it protects against loss of +   data in layout segments. + +   While not explicitly shown in the above XDR, each layout4 element +   returned in the logr_layout array of LAYOUTGET4res (see +   Section 18.43.2 of [RFC5661]) describes a layout segment.  Hence, +   each ff_layout4 also describes a layout segment.  It is possible that +   the file is concatenated from more than one layout segment.  Each +   layout segment MAY represent different striping parameters. + +   The ffl_stripe_unit field is the stripe unit size in use for the +   current layout segment.  The number of stripes is given inside each +   mirror by the number of elements in ffm_data_servers.  If the number +   of stripes is one, then the value for ffl_stripe_unit MUST default to +   zero.  The only supported mapping scheme is sparse and is detailed in +   Section 6.  Note that there is an assumption here that both the +   stripe unit size and the number of stripes are the same across all +   mirrors. + +   The ffl_mirrors field is the array of mirrored storage devices that +   provide the storage for the current stripe; see Figure 1. + +   The ffl_stats_collect_hint field provides a hint to the client on how +   often the server wants it to report LAYOUTSTATS for a file.  The time +   is in seconds. + + + + + + + + + + + + + + + + + + +Halevy & Haynes              Standards Track                   [Page 20] + +RFC 8435                pNFS Flexible File Layout            August 2018 + + +                      +-----------+ +                      |           | +                      |           | +                      |   File    | +                      |           | +                      |           | +                      +-----+-----+ +                            | +               +------------+------------+ +               |                         | +          +----+-----+             +-----+----+ +          | Mirror 1 |             | Mirror 2 | +          +----+-----+             +-----+----+ +               |                         | +          +-----------+            +-----------+ +          |+-----------+           |+-----------+ +          ||+-----------+          ||+-----------+ +          +||  Storage  |          +||  Storage  | +           +|  Devices  |           +|  Devices  | +            +-----------+            +-----------+ + +                           Figure 1 + +   The ffs_mirrors field represents an array of state information for +   each mirrored copy of the current layout segment.  Each element is +   described by a ff_mirror4 type. + +   ffds_deviceid provides the deviceid of the storage device holding the +   data file. + +   ffds_fh_vers is an array of filehandles of the data file matching the +   available NFS versions on the given storage device.  There MUST be +   exactly as many elements in ffds_fh_vers as there are in +   ffda_versions.  Each element of the array corresponds to a particular +   combination of ffdv_version, ffdv_minorversion, and +   ffdv_tightly_coupled provided for the device.  The array allows for +   server implementations that have different filehandles for different +   combinations of version, minor version, and coupling strength.  See +   Section 5.4 for how to handle versioning issues between the client +   and storage devices. + +   For tight coupling, ffds_stateid provides the stateid to be used by +   the client to access the file.  For loose coupling and an NFSv4 +   storage device, the client will have to use an anonymous stateid to +   perform I/O on the storage device.  With no control protocol, the +   metadata server stateid cannot be used to provide a global stateid +   model.  Thus, the server MUST set the ffds_stateid to be the +   anonymous stateid. + + + +Halevy & Haynes              Standards Track                   [Page 21] + +RFC 8435                pNFS Flexible File Layout            August 2018 + + +   This specification of the ffds_stateid restricts both models for +   NFSv4.x storage protocols: + +   loosely coupled model:  the stateid has to be an anonymous stateid + +   tightly coupled model:  the stateid has to be a global stateid + +   A number of issues stem from a mismatch between the fact that +   ffds_stateid is defined as a single item while ffds_fh_vers is +   defined as an array.  It is possible for each open file on the +   storage device to require its own open stateid.  Because there are +   established loosely coupled implementations of the version of the +   protocol described in this document, such potential issues have not +   been addressed here.  It is possible for future layout types to be +   defined that address these issues, should it become important to +   provide multiple stateids for the same underlying file. + +   For loosely coupled storage devices, ffds_user and ffds_group provide +   the synthetic user and group to be used in the RPC credentials that +   the client presents to the storage device to access the data files. +   For tightly coupled storage devices, the user and group on the +   storage device will be the same as on the metadata server; that is, +   if ffdv_tightly_coupled (see Section 4.1) is set, then the client +   MUST ignore both ffds_user and ffds_group. + +   The allowed values for both ffds_user and ffds_group are specified as +   owner and owner_group, respectively, in Section 5.9 of [RFC5661]. +   For NFSv3 compatibility, user and group strings that consist of +   decimal numeric values with no leading zeros can be given a special +   interpretation by clients and servers that choose to provide such +   support.  The receiver may treat such a user or group string as +   representing the same user as would be represented by an NFSv3 uid or +   gid having the corresponding numeric value.  Note that if using +   Kerberos for security, the expectation is that these values will be a +   name@domain string. + +   ffds_efficiency describes the metadata server's evaluation as to the +   effectiveness of each mirror.  Note that this is per layout and not +   per device as the metric may change due to perceived load, +   availability to the metadata server, etc.  Higher values denote +   higher perceived utility.  The way the client can select the best +   mirror to access is discussed in Section 8.1. + +   ffl_flags is a bitmap that allows the metadata server to inform the +   client of particular conditions that may result from more or less +   tight coupling of the storage devices. + + + + + +Halevy & Haynes              Standards Track                   [Page 22] + +RFC 8435                pNFS Flexible File Layout            August 2018 + + +   FF_FLAGS_NO_LAYOUTCOMMIT:  can be set to indicate that the client is +      not required to send LAYOUTCOMMIT to the metadata server. + +   FF_FLAGS_NO_IO_THRU_MDS:  can be set to indicate that the client +      should not send I/O operations to the metadata server.  That is, +      even if the client could determine that there was a network +      disconnect to a storage device, the client should not try to proxy +      the I/O through the metadata server. + +   FF_FLAGS_NO_READ_IO:  can be set to indicate that the client should +      not send READ requests with the layouts of iomode +      LAYOUTIOMODE4_RW.  Instead, it should request a layout of iomode +      LAYOUTIOMODE4_READ from the metadata server. + +   FF_FLAGS_WRITE_ONE_MIRROR:  can be set to indicate that the client +      only needs to update one of the mirrors (see Section 8.2). + +5.1.1.  Error Codes from LAYOUTGET + +   [RFC5661] provides little guidance as to how the client is to proceed +   with a LAYOUTGET that returns an error of either +   NFS4ERR_LAYOUTTRYLATER, NFS4ERR_LAYOUTUNAVAILABLE, and NFS4ERR_DELAY. +   Within the context of this document: + +   NFS4ERR_LAYOUTUNAVAILABLE:  there is no layout available and the I/O +      is to go to the metadata server.  Note that it is possible to have +      had a layout before a recall and not after. + +   NFS4ERR_LAYOUTTRYLATER:  there is some issue preventing the layout +      from being granted.  If the client already has an appropriate +      layout, it should continue with I/O to the storage devices. + +   NFS4ERR_DELAY:  there is some issue preventing the layout from being +      granted.  If the client already has an appropriate layout, it +      should not continue with I/O to the storage devices. + +5.1.2.  Client Interactions with FF_FLAGS_NO_IO_THRU_MDS + +   Even if the metadata server provides the FF_FLAGS_NO_IO_THRU_MDS +   flag, the client can still perform I/O to the metadata server.  The +   flag functions as a hint.  The flag indicates to the client that the +   metadata server prefers to separate the metadata I/O from the data I/ +   O, most likely for performance reasons. + + + + + + + + +Halevy & Haynes              Standards Track                   [Page 23] + +RFC 8435                pNFS Flexible File Layout            August 2018 + + +5.2.  LAYOUTCOMMIT + +   The flexible file layout does not use lou_body inside the +   loca_layoutupdate argument to LAYOUTCOMMIT.  If lou_type is +   LAYOUT4_FLEX_FILES, the lou_body field MUST have a zero length (see +   Section 18.42.1 of [RFC5661]). + +5.3.  Interactions between Devices and Layouts + +   In [RFC5661], the file layout type is defined such that the +   relationship between multipathing and filehandles can result in +   either 0, 1, or N filehandles (see Section 13.3).  Some rationales +   for this are clustered servers that share the same filehandle or +   allow for multiple read-only copies of the file on the same storage +   device.  In the flexible file layout type, while there is an array of +   filehandles, they are independent of the multipathing being used.  If +   the metadata server wants to provide multiple read-only copies of the +   same file on the same storage device, then it should provide multiple +   mirrored instances, each with a different ff_device_addr4.  The +   client can then determine that, since the each of the ffds_fh_vers +   are different, there are multiple copies of the file for the current +   layout segment available. + +5.4.  Handling Version Errors + +   When the metadata server provides the ffda_versions array in the +   ff_device_addr4 (see Section 4.1), the client is able to determine +   whether or not it can access a storage device with any of the +   supplied combinations of ffdv_version, ffdv_minorversion, and +   ffdv_tightly_coupled.  However, due to the limitations of reporting +   errors in GETDEVICEINFO (see Section 18.40 in [RFC5661]), the client +   is not able to specify which specific device it cannot communicate +   with over one of the provided ffdv_version and ffdv_minorversion +   combinations.  Using ff_ioerr4 (see Section 9.1.1) inside either the +   LAYOUTRETURN (see Section 18.44 of [RFC5661]) or the LAYOUTERROR (see +   Section 15.6 of [RFC7862] and Section 10 of this document), the +   client can isolate the problematic storage device. + +   The error code to return for LAYOUTRETURN and/or LAYOUTERROR is +   NFS4ERR_MINOR_VERS_MISMATCH.  It does not matter whether the mismatch +   is a major version (e.g., client can use NFSv3 but not NFSv4) or +   minor version (e.g., client can use NFSv4.1 but not NFSv4.2), the +   error indicates that for all the supplied combinations for +   ffdv_version and ffdv_minorversion, the client cannot communicate +   with the storage device.  The client can retry the GETDEVICEINFO to +   see if the metadata server can provide a different combination, or it +   can fall back to doing the I/O through the metadata server. + + + + +Halevy & Haynes              Standards Track                   [Page 24] + +RFC 8435                pNFS Flexible File Layout            August 2018 + + +6.  Striping via Sparse Mapping + +   While other layout types support both dense and sparse mapping of +   logical offsets to physical offsets within a file (see, for example, +   Section 13.4 of [RFC5661]), the flexible file layout type only +   supports a sparse mapping. + +   With sparse mappings, the logical offset within a file (L) is also +   the physical offset on the storage device.  As detailed in +   Section 13.4.4 of [RFC5661], this results in holes across each +   storage device that does not contain the current stripe index. + +   L: logical offset within the file + +   W: stripe width +       W = number of elements in ffm_data_servers + +   S: number of bytes in a stripe +       S = W * ffl_stripe_unit + +   N: stripe number +       N = L / S + +7.  Recovering from Client I/O Errors + +   The pNFS client may encounter errors when directly accessing the +   storage devices.  However, it is the responsibility of the metadata +   server to recover from the I/O errors.  When the LAYOUT4_FLEX_FILES +   layout type is used, the client MUST report the I/O errors to the +   server at LAYOUTRETURN time using the ff_ioerr4 structure (see +   Section 9.1.1). + +   The metadata server analyzes the error and determines the required +   recovery operations such as recovering media failures or +   reconstructing missing data files. + +   The metadata server MUST recall any outstanding layouts to allow it +   exclusive write access to the stripes being recovered and to prevent +   other clients from hitting the same error condition.  In these cases, +   the server MUST complete recovery before handing out any new layouts +   to the affected byte ranges. + +   Although the client implementation has the option to propagate a +   corresponding error to the application that initiated the I/O +   operation and drop any unwritten data, the client should attempt to +   retry the original I/O operation by either requesting a new layout or +   sending the I/O via regular NFSv4.1+ READ or WRITE operations to the +   metadata server.  The client SHOULD attempt to retrieve a new layout + + + +Halevy & Haynes              Standards Track                   [Page 25] + +RFC 8435                pNFS Flexible File Layout            August 2018 + + +   and retry the I/O operation using the storage device first and only +   retry the I/O operation via the metadata server if the error +   persists. + +8.  Mirroring + +   The flexible file layout type has a simple model in place for the +   mirroring of the file data constrained by a layout segment.  There is +   no assumption that each copy of the mirror is stored identically on +   the storage devices.  For example, one device might employ +   compression or deduplication on the data.  However, the over-the-wire +   transfer of the file contents MUST appear identical.  Note, this is a +   constraint of the selected XDR representation in which each mirrored +   copy of the layout segment has the same striping pattern (see +   Figure 1). + +   The metadata server is responsible for determining the number of +   mirrored copies and the location of each mirror.  While the client +   may provide a hint to how many copies it wants (see Section 12), the +   metadata server can ignore that hint; in any event, the client has no +   means to dictate either the storage device (which also means the +   coupling and/or protocol levels to access the layout segments) or the +   location of said storage device. + +   The updating of mirrored layout segments is done via client-side +   mirroring.  With this approach, the client is responsible for making +   sure modifications are made on all copies of the layout segments it +   is informed of via the layout.  If a layout segment is being +   resilvered to a storage device, that mirrored copy will not be in the +   layout.  Thus, the metadata server MUST update that copy until the +   client is presented it in a layout.  If the FF_FLAGS_WRITE_ONE_MIRROR +   is set in ffl_flags, the client need only update one of the mirrors +   (see Section 8.2).  If the client is writing to the layout segments +   via the metadata server, then the metadata server MUST update all +   copies of the mirror.  As seen in Section 8.3, during the +   resilvering, the layout is recalled, and the client has to make +   modifications via the metadata server. + +8.1.  Selecting a Mirror + +   When the metadata server grants a layout to a client, it MAY let the +   client know how fast it expects each mirror to be once the request +   arrives at the storage devices via the ffds_efficiency member.  While +   the algorithms to calculate that value are left to the metadata +   server implementations, factors that could contribute to that +   calculation include speed of the storage device, physical memory +   available to the device, operating system version, current load, etc. + + + + +Halevy & Haynes              Standards Track                   [Page 26] + +RFC 8435                pNFS Flexible File Layout            August 2018 + + +   However, what should not be involved in that calculation is a +   perceived network distance between the client and the storage device. +   The client is better situated for making that determination based on +   past interaction with the storage device over the different available +   network interfaces between the two; that is, the metadata server +   might not know about a transient outage between the client and +   storage device because it has no presence on the given subnet. + +   As such, it is the client that decides which mirror to access for +   reading the file.  The requirements for writing to mirrored layout +   segments are presented below. + +8.2.  Writing to Mirrors + +8.2.1.  Single Storage Device Updates Mirrors + +   If the FF_FLAGS_WRITE_ONE_MIRROR flag in ffl_flags is set, the client +   only needs to update one of the copies of the layout segment.  For +   this case, the storage device MUST ensure that all copies of the +   mirror are updated when any one of the mirrors is updated.  If the +   storage device gets an error when updating one of the mirrors, then +   it MUST inform the client that the original WRITE had an error.  The +   client then MUST inform the metadata server (see Section 8.2.3).  The +   client's responsibility with respect to COMMIT is explained in +   Section 8.2.4.  The client may choose any one of the mirrors and may +   use ffds_efficiency as described in Section 8.1 when making this +   choice. + +8.2.2.  Client Updates All Mirrors + +   If the FF_FLAGS_WRITE_ONE_MIRROR flag in ffl_flags is not set, the +   client is responsible for updating all mirrored copies of the layout +   segments that it is given in the layout.  A single failed update is +   sufficient to fail the entire operation.  If all but one copy is +   updated successfully and the last one provides an error, then the +   client needs to inform the metadata server about the error.  The +   client can use either LAYOUTRETURN or LAYOUTERROR to inform the +   metadata server that the update failed to that storage device.  If +   the client is updating the mirrors serially, then it SHOULD stop at +   the first error encountered and report that to the metadata server. +   If the client is updating the mirrors in parallel, then it SHOULD +   wait until all storage devices respond so that it can report all +   errors encountered during the update. + + + + + + + + +Halevy & Haynes              Standards Track                   [Page 27] + +RFC 8435                pNFS Flexible File Layout            August 2018 + + +8.2.3.  Handling Write Errors + +   When the client reports a write error to the metadata server, the +   metadata server is responsible for determining if it wants to remove +   the errant mirror from the layout, if the mirror has recovered from +   some transient error, etc.  When the client tries to get a new +   layout, the metadata server informs it of the decision by the +   contents of the layout.  The client MUST NOT assume that the contents +   of the previous layout will match those of the new one.  If it has +   updates that were not committed to all mirrors, then it MUST resend +   those updates to all mirrors. + +   There is no provision in the protocol for the metadata server to +   directly determine that the client has or has not recovered from an +   error.  For example, if a storage device was network partitioned from +   the client and the client reported the error to the metadata server, +   then the network partition would be repaired, and all of the copies +   would be successfully updated.  There is no mechanism for the client +   to report that fact, and the metadata server is forced to repair the +   file across the mirror. + +   If the client supports NFSv4.2, it can use LAYOUTERROR and +   LAYOUTRETURN to provide hints to the metadata server about the +   recovery efforts.  A LAYOUTERROR on a file is for a non-fatal error. +   A subsequent LAYOUTRETURN without a ff_ioerr4 indicates that the +   client successfully replayed the I/O to all mirrors.  Any +   LAYOUTRETURN with a ff_ioerr4 is an error that the metadata server +   needs to repair.  The client MUST be prepared for the LAYOUTERROR to +   trigger a CB_LAYOUTRECALL if the metadata server determines it needs +   to start repairing the file. + +8.2.4.  Handling Write COMMITs + +   When stable writes are done to the metadata server or to a single +   replica (if allowed by the use of FF_FLAGS_WRITE_ONE_MIRROR), it is +   the responsibility of the receiving node to propagate the written +   data stably, before replying to the client. + +   In the corresponding cases in which unstable writes are done, the +   receiving node does not have any such obligation, although it may +   choose to asynchronously propagate the updates.  However, once a +   COMMIT is replied to, all replicas must reflect the writes that have +   been done, and this data must have been committed to stable storage +   on all replicas. + + + + + + + +Halevy & Haynes              Standards Track                   [Page 28] + +RFC 8435                pNFS Flexible File Layout            August 2018 + + +   In order to avoid situations in which stale data is read from +   replicas to which writes have not been propagated: + +   o  A client that has outstanding unstable writes made to single node +      (metadata server or storage device) MUST do all reads from that +      same node. + +   o  When writes are flushed to the server (for example, to implement +      close-to-open semantics), a COMMIT must be done by the client to +      ensure that up-to-date written data will be available irrespective +      of the particular replica read. + +8.3.  Metadata Server Resilvering of the File + +   The metadata server may elect to create a new mirror of the layout +   segments at any time.  This might be to resilver a copy on a storage +   device that was down for servicing, to provide a copy of the layout +   segments on storage with different storage performance +   characteristics, etc.  As the client will not be aware of the new +   mirror and the metadata server will not be aware of updates that the +   client is making to the layout segments, the metadata server MUST +   recall the writable layout segment(s) that it is resilvering.  If the +   client issues a LAYOUTGET for a writable layout segment that is in +   the process of being resilvered, then the metadata server can deny +   that request with an NFS4ERR_LAYOUTUNAVAILABLE.  The client would +   then have to perform the I/O through the metadata server. + +9.  Flexible File Layout Type Return + +   layoutreturn_file4 is used in the LAYOUTRETURN operation to convey +   layout-type-specific information to the server.  It is defined in +   Section 18.44.1 of [RFC5661] as follows: + +   <CODE BEGINS> + +      /* Constants used for LAYOUTRETURN and CB_LAYOUTRECALL */ +      const LAYOUT4_RET_REC_FILE      = 1; +      const LAYOUT4_RET_REC_FSID      = 2; +      const LAYOUT4_RET_REC_ALL       = 3; + +      enum layoutreturn_type4 { +              LAYOUTRETURN4_FILE = LAYOUT4_RET_REC_FILE, +              LAYOUTRETURN4_FSID = LAYOUT4_RET_REC_FSID, +              LAYOUTRETURN4_ALL  = LAYOUT4_RET_REC_ALL +      }; + +   struct layoutreturn_file4 { +           offset4         lrf_offset; + + + +Halevy & Haynes              Standards Track                   [Page 29] + +RFC 8435                pNFS Flexible File Layout            August 2018 + + +           length4         lrf_length; +           stateid4        lrf_stateid; +           /* layouttype4 specific data */ +           opaque          lrf_body<>; +   }; + +   union layoutreturn4 switch(layoutreturn_type4 lr_returntype) { +           case LAYOUTRETURN4_FILE: +                   layoutreturn_file4      lr_layout; +           default: +                   void; +   }; + +   struct LAYOUTRETURN4args { +           /* CURRENT_FH: file */ +           bool                    lora_reclaim; +           layouttype4             lora_layout_type; +           layoutiomode4           lora_iomode; +           layoutreturn4           lora_layoutreturn; +   }; + +   <CODE ENDS> + +   If the lora_layout_type layout type is LAYOUT4_FLEX_FILES and the +   lr_returntype is LAYOUTRETURN4_FILE, then the lrf_body opaque value +   is defined by ff_layoutreturn4 (see Section 9.3).  This allows the +   client to report I/O error information or layout usage statistics +   back to the metadata server as defined below.  Note that while the +   data structures are built on concepts introduced in NFSv4.2, the +   effective discriminated union (lora_layout_type combined with +   ff_layoutreturn4) allows for an NFSv4.1 metadata server to utilize +   the data. + +9.1.  I/O Error Reporting + +9.1.1.  ff_ioerr4 + +   <CODE BEGINS> + +   /// struct ff_ioerr4 { +   ///         offset4        ffie_offset; +   ///         length4        ffie_length; +   ///         stateid4       ffie_stateid; +   ///         device_error4  ffie_errors<>; +   /// }; +   /// + +   <CODE ENDS> + + + +Halevy & Haynes              Standards Track                   [Page 30] + +RFC 8435                pNFS Flexible File Layout            August 2018 + + +   Recall that [RFC7862] defines device_error4 as: + +   <CODE BEGINS> + +   struct device_error4 { +           deviceid4       de_deviceid; +           nfsstat4        de_status; +           nfs_opnum4      de_opnum; +   }; + +   <CODE ENDS> + +   The ff_ioerr4 structure is used to return error indications for data +   files that generated errors during data transfers.  These are hints +   to the metadata server that there are problems with that file.  For +   each error, ffie_errors.de_deviceid, ffie_offset, and ffie_length +   represent the storage device and byte range within the file in which +   the error occurred; ffie_errors represents the operation and type of +   error.  The use of device_error4 is described in Section 15.6 of +   [RFC7862]. + +   Even though the storage device might be accessed via NFSv3 and +   reports back NFSv3 errors to the client, the client is responsible +   for mapping these to appropriate NFSv4 status codes as de_status. +   Likewise, the NFSv3 operations need to be mapped to equivalent NFSv4 +   operations. + +9.2.  Layout Usage Statistics + +9.2.1.  ff_io_latency4 + +   <CODE BEGINS> + +   /// struct ff_io_latency4 { +   ///         uint64_t       ffil_ops_requested; +   ///         uint64_t       ffil_bytes_requested; +   ///         uint64_t       ffil_ops_completed; +   ///         uint64_t       ffil_bytes_completed; +   ///         uint64_t       ffil_bytes_not_delivered; +   ///         nfstime4       ffil_total_busy_time; +   ///         nfstime4       ffil_aggregate_completion_time; +   /// }; +   /// + +   <CODE ENDS> + + + + + + +Halevy & Haynes              Standards Track                   [Page 31] + +RFC 8435                pNFS Flexible File Layout            August 2018 + + +   Both operation counts and bytes transferred are kept in the +   ff_io_latency4.  As seen in ff_layoutupdate4 (see Section 9.2.2), +   READ and WRITE operations are aggregated separately.  READ operations +   are used for the ff_io_latency4 ffl_read.  Both WRITE and COMMIT +   operations are used for the ff_io_latency4 ffl_write.  "Requested" +   counters track what the client is attempting to do, and "completed" +   counters track what was done.  There is no requirement that the +   client only report completed results that have matching requested +   results from the reported period. + +   ffil_bytes_not_delivered is used to track the aggregate number of +   bytes requested but not fulfilled due to error conditions. +   ffil_total_busy_time is the aggregate time spent with outstanding RPC +   calls. ffil_aggregate_completion_time is the sum of all round-trip +   times for completed RPC calls. + +   In Section 3.3.1 of [RFC5661], the nfstime4 is defined as the number +   of seconds and nanoseconds since midnight or zero hour January 1, +   1970 Coordinated Universal Time (UTC).  The use of nfstime4 in +   ff_io_latency4 is to store time since the start of the first I/O from +   the client after receiving the layout.  In other words, these are to +   be decoded as duration and not as a date and time. + +   Note that LAYOUTSTATS are cumulative, i.e., not reset each time the +   operation is sent.  If two LAYOUTSTATS operations for the same file +   and layout stateid originate from the same NFS client and are +   processed at the same time by the metadata server, then the one +   containing the larger values contains the most recent time series +   data. + +9.2.2.  ff_layoutupdate4 + +   <CODE BEGINS> + +   /// struct ff_layoutupdate4 { +   ///         netaddr4       ffl_addr; +   ///         nfs_fh4        ffl_fhandle; +   ///         ff_io_latency4 ffl_read; +   ///         ff_io_latency4 ffl_write; +   ///         nfstime4       ffl_duration; +   ///         bool           ffl_local; +   /// }; +   /// + +   <CODE ENDS> + + + + + + +Halevy & Haynes              Standards Track                   [Page 32] + +RFC 8435                pNFS Flexible File Layout            August 2018 + + +   ffl_addr differentiates which network address the client is connected +   to on the storage device.  In the case of multipathing, ffl_fhandle +   indicates which read-only copy was selected. ffl_read and ffl_write +   convey the latencies for both READ and WRITE operations, +   respectively.  ffl_duration is used to indicate the time period over +   which the statistics were collected.  If true, ffl_local indicates +   that the I/O was serviced by the client's cache.  This flag allows +   the client to inform the metadata server about "hot" access to a file +   it would not normally be allowed to report on. + +9.2.3.  ff_iostats4 + +   <CODE BEGINS> + +   /// struct ff_iostats4 { +   ///         offset4           ffis_offset; +   ///         length4           ffis_length; +   ///         stateid4          ffis_stateid; +   ///         io_info4          ffis_read; +   ///         io_info4          ffis_write; +   ///         deviceid4         ffis_deviceid; +   ///         ff_layoutupdate4  ffis_layoutupdate; +   /// }; +   /// + +   <CODE ENDS> + +   [RFC7862] defines io_info4 as: + +   <CODE BEGINS> + +   struct io_info4 { +           uint64_t        ii_count; +           uint64_t        ii_bytes; +   }; + +   <CODE ENDS> + +   With pNFS, data transfers are performed directly between the pNFS +   client and the storage devices.  Therefore, the metadata server has +   no direct knowledge of the I/O operations being done and thus cannot +   create on its own statistical information about client I/O to +   optimize the data storage location.  ff_iostats4 MAY be used by the +   client to report I/O statistics back to the metadata server upon +   returning the layout. + + + + + + +Halevy & Haynes              Standards Track                   [Page 33] + +RFC 8435                pNFS Flexible File Layout            August 2018 + + +   Since it is not feasible for the client to report every I/O that used +   the layout, the client MAY identify "hot" byte ranges for which to +   report I/O statistics.  The definition and/or configuration mechanism +   of what is considered "hot" and the size of the reported byte range +   are out of the scope of this document.  For client implementation, +   providing reasonable default values and an optional run-time +   management interface to control these parameters is suggested.  For +   example, a client can define the default byte-range resolution to be +   1 MB in size and the thresholds for reporting to be 1 MB/second or 10 +   I/O operations per second. + +   For each byte range, ffis_offset and ffis_length represent the +   starting offset of the range and the range length in bytes. +   ffis_read.ii_count, ffis_read.ii_bytes, ffis_write.ii_count, and +   ffis_write.ii_bytes represent the number of contiguous READ and WRITE +   I/Os and the respective aggregate number of bytes transferred within +   the reported byte range. + +   The combination of ffis_deviceid and ffl_addr uniquely identifies +   both the storage path and the network route to it.  Finally, +   ffl_fhandle allows the metadata server to differentiate between +   multiple read-only copies of the file on the same storage device. + +9.3.  ff_layoutreturn4 + +   <CODE BEGINS> + +   /// struct ff_layoutreturn4 { +   ///         ff_ioerr4     fflr_ioerr_report<>; +   ///         ff_iostats4   fflr_iostats_report<>; +   /// }; +   /// + +   <CODE ENDS> + +   When data file I/O operations fail, fflr_ioerr_report<> is used to +   report these errors to the metadata server as an array of elements of +   type ff_ioerr4.  Each element in the array represents an error that +   occurred on the data file identified by ffie_errors.de_deviceid.  If +   no errors are to be reported, the size of the fflr_ioerr_report<> +   array is set to zero.  The client MAY also use fflr_iostats_report<> +   to report a list of I/O statistics as an array of elements of type +   ff_iostats4.  Each element in the array represents statistics for a +   particular byte range.  Byte ranges are not guaranteed to be disjoint +   and MAY repeat or intersect. + + + + + + +Halevy & Haynes              Standards Track                   [Page 34] + +RFC 8435                pNFS Flexible File Layout            August 2018 + + +10.  Flexible File Layout Type LAYOUTERROR + +   If the client is using NFSv4.2 to communicate with the metadata +   server, then instead of waiting for a LAYOUTRETURN to send error +   information to the metadata server (see Section 9.1), it MAY use +   LAYOUTERROR (see Section 15.6 of [RFC7862]) to communicate that +   information.  For the flexible file layout type, this means that +   LAYOUTERROR4args is treated the same as ff_ioerr4. + +11.  Flexible File Layout Type LAYOUTSTATS + +   If the client is using NFSv4.2 to communicate with the metadata +   server, then instead of waiting for a LAYOUTRETURN to send I/O +   statistics to the metadata server (see Section 9.2), it MAY use +   LAYOUTSTATS (see Section 15.7 of [RFC7862]) to communicate that +   information.  For the flexible file layout type, this means that +   LAYOUTSTATS4args.lsa_layoutupdate is overloaded with the same +   contents as in ffis_layoutupdate. + +12.  Flexible File Layout Type Creation Hint + +   The layouthint4 type is defined in the [RFC5661] as follows: + +   <CODE BEGINS> + +   struct layouthint4 { +       layouttype4        loh_type; +       opaque             loh_body<>; +   }; + +   <CODE ENDS> + +   The layouthint4 structure is used by the client to pass a hint about +   the type of layout it would like created for a particular file.  If +   the loh_type layout type is LAYOUT4_FLEX_FILES, then the loh_body +   opaque value is defined by the ff_layouthint4 type. + +12.1.  ff_layouthint4 + +   <CODE BEGINS> + +   /// union ff_mirrors_hint switch (bool ffmc_valid) { +   ///     case TRUE: +   ///         uint32_t    ffmc_mirrors; +   ///     case FALSE: +   ///         void; +   /// }; +   /// + + + +Halevy & Haynes              Standards Track                   [Page 35] + +RFC 8435                pNFS Flexible File Layout            August 2018 + + +   /// struct ff_layouthint4 { +   ///     ff_mirrors_hint    fflh_mirrors_hint; +   /// }; +   /// + +   <CODE ENDS> + +   This type conveys hints for the desired data map.  All parameters are +   optional so the client can give values for only the parameter it +   cares about. + +13.  Recalling a Layout + +   While Section 12.5.5 of [RFC5661] discusses reasons independent of +   layout type for recalling a layout, the flexible file layout type +   metadata server should recall outstanding layouts in the following +   cases: + +   o  When the file's security policy changes, i.e., ACLs or permission +      mode bits are set. + +   o  When the file's layout changes, rendering outstanding layouts +      invalid. + +   o  When existing layouts are inconsistent with the need to enforce +      locking constraints. + +   o  When existing layouts are inconsistent with the requirements +      regarding resilvering as described in Section 8.3. + +13.1.  CB_RECALL_ANY + +   The metadata server can use the CB_RECALL_ANY callback operation to +   notify the client to return some or all of its layouts.  Section 22.3 +   of [RFC5661] defines the allowed types of the "NFSv4 Recallable +   Object Types Registry". + +   <CODE BEGINS> + +   /// const RCA4_TYPE_MASK_FF_LAYOUT_MIN     = 16; +   /// const RCA4_TYPE_MASK_FF_LAYOUT_MAX     = 17; +   /// + +   struct  CB_RECALL_ANY4args      { +       uint32_t        craa_layouts_to_keep; +       bitmap4         craa_type_mask; +   }; + + + + +Halevy & Haynes              Standards Track                   [Page 36] + +RFC 8435                pNFS Flexible File Layout            August 2018 + + +   <CODE ENDS> + +   Typically, CB_RECALL_ANY will be used to recall client state when the +   server needs to reclaim resources.  The craa_type_mask bitmap +   specifies the type of resources that are recalled, and the +   craa_layouts_to_keep value specifies how many of the recalled +   flexible file layouts the client is allowed to keep.  The mask flags +   for the flexible file layout type are defined as follows: + +   <CODE BEGINS> + +   /// enum ff_cb_recall_any_mask { +   ///     PNFS_FF_RCA4_TYPE_MASK_READ = 16, +   ///     PNFS_FF_RCA4_TYPE_MASK_RW   = 17 +   /// }; +   /// + +   <CODE ENDS> + +   The flags represent the iomode of the recalled layouts.  In response, +   the client SHOULD return layouts of the recalled iomode that it needs +   the least, keeping at most craa_layouts_to_keep flexible file +   layouts. + +   The PNFS_FF_RCA4_TYPE_MASK_READ flag notifies the client to return +   layouts of iomode LAYOUTIOMODE4_READ.  Similarly, the +   PNFS_FF_RCA4_TYPE_MASK_RW flag notifies the client to return layouts +   of iomode LAYOUTIOMODE4_RW.  When both mask flags are set, the client +   is notified to return layouts of either iomode. + +14.  Client Fencing + +   In cases where clients are uncommunicative and their lease has +   expired or when clients fail to return recalled layouts within a +   lease period, the server MAY revoke client layouts and reassign these +   resources to other clients (see Section 12.5.5 of [RFC5661]).  To +   avoid data corruption, the metadata server MUST fence off the revoked +   clients from the respective data files as described in Section 2.2. + +15.  Security Considerations + +   The combination of components in a pNFS system is required to +   preserve the security properties of NFSv4.1+ with respect to an +   entity accessing data via a client.  The pNFS feature partitions the +   NFSv4.1+ file system protocol into two parts: the control protocol +   and the data protocol.  As the control protocol in this document is +   NFS, the security properties are equivalent to the version of NFS +   being used.  The flexible file layout further divides the data + + + +Halevy & Haynes              Standards Track                   [Page 37] + +RFC 8435                pNFS Flexible File Layout            August 2018 + + +   protocol into metadata and data paths.  The security properties of +   the metadata path are equivalent to those of NFSv4.1x (see Sections +   1.7.1 and 2.2.1 of [RFC5661]).  And the security properties of the +   data path are equivalent to those of the version of NFS used to +   access the storage device, with the provision that the metadata +   server is responsible for authenticating client access to the data +   file.  The metadata server provides appropriate credentials to the +   client to access data files on the storage device.  It is also +   responsible for revoking access for a client to the storage device. + +   The metadata server enforces the file access control policy at +   LAYOUTGET time.  The client should use RPC authorization credentials +   for getting the layout for the requested iomode ((LAYOUTIOMODE4_READ +   or LAYOUTIOMODE4_RW), and the server verifies the permissions and ACL +   for these credentials, possibly returning NFS4ERR_ACCESS if the +   client is not allowed the requested iomode.  If the LAYOUTGET +   operation succeeds, the client receives, as part of the layout, a set +   of credentials allowing it I/O access to the specified data files +   corresponding to the requested iomode.  When the client acts on I/O +   operations on behalf of its local users, it MUST authenticate and +   authorize the user by issuing respective OPEN and ACCESS calls to the +   metadata server, similar to having NFSv4 data delegations. + +   The combination of filehandle, synthetic uid, and gid in the layout +   is the way that the metadata server enforces access control to the +   data server.  The client only has access to filehandles of file +   objects and not directory objects.  Thus, given a filehandle in a +   layout, it is not possible to guess the parent directory filehandle. +   Further, as the data file permissions only allow the given synthetic +   uid read/write permission and the given synthetic gid read +   permission, knowing the synthetic ids of one file does not +   necessarily allow access to any other data file on the storage +   device. + +   The metadata server can also deny access at any time by fencing the +   data file, which means changing the synthetic ids.  In turn, that +   forces the client to return its current layout and get a new layout +   if it wants to continue I/O to the data file. + +   If access is allowed, the client uses the corresponding (read-only or +   read/write) credentials to perform the I/O operations at the data +   file's storage devices.  When the metadata server receives a request +   to change a file's permissions or ACL, it SHOULD recall all layouts +   for that file and then MUST fence off any clients still holding +   outstanding layouts for the respective files by implicitly +   invalidating the previously distributed credential on all data file +   comprising the file in question.  It is REQUIRED that this be done +   before committing to the new permissions and/or ACL.  By requesting + + + +Halevy & Haynes              Standards Track                   [Page 38] + +RFC 8435                pNFS Flexible File Layout            August 2018 + + +   new layouts, the clients will reauthorize access against the modified +   access control metadata.  Recalling the layouts in this case is +   intended to prevent clients from getting an error on I/Os done after +   the client was fenced off. + +15.1.  RPCSEC_GSS and Security Services + +   Because of the special use of principals within the loosely coupled +   model, the issues are different depending on the coupling model. + +15.1.1.  Loosely Coupled + +   RPCSEC_GSS version 3 (RPCSEC_GSSv3) [RFC7861] contains facilities +   that would allow it to be used to authorize the client to the storage +   device on behalf of the metadata server.  Doing so would require that +   each of the metadata server, storage device, and client would need to +   implement RPCSEC_GSSv3 using an RPC-application-defined structured +   privilege assertion in a manner described in Section 4.9.1 of +   [RFC7862].  The specifics necessary to do so are not described in +   this document.  This is principally because any such specification +   would require extensive implementation work on a wide range of +   storage devices, which would be unlikely to result in a widely usable +   specification for a considerable time. + +   As a result, the layout type described in this document will not +   provide support for use of RPCSEC_GSS together with the loosely +   coupled model.  However, future layout types could be specified, +   which would allow such support, either through the use of +   RPCSEC_GSSv3 or in other ways. + +15.1.2.  Tightly Coupled + +   With tight coupling, the principal used to access the metadata file +   is exactly the same as used to access the data file.  The storage +   device can use the control protocol to validate any RPC credentials. +   As a result, there are no security issues related to using RPCSEC_GSS +   with a tightly coupled system.  For example, if Kerberos V5 Generic +   Security Service Application Program Interface (GSS-API) [RFC4121] is +   used as the security mechanism, then the storage device could use a +   control protocol to validate the RPC credentials to the metadata +   server. + +16.  IANA Considerations + +   [RFC5661] introduced the "pNFS Layout Types Registry"; new layout +   type numbers in this registry need to be assigned by IANA.  This +   document defines the protocol associated with an existing layout type +   number: LAYOUT4_FLEX_FILES.  See Table 1. + + + +Halevy & Haynes              Standards Track                   [Page 39] + +RFC 8435                pNFS Flexible File Layout            August 2018 + + +   +--------------------+------------+----------+-----+----------------+ +   | Layout Type Name   | Value      | RFC      | How | Minor Versions | +   +--------------------+------------+----------+-----+----------------+ +   | LAYOUT4_FLEX_FILES | 0x00000004 | RFC 8435 | L   | 1              | +   +--------------------+------------+----------+-----+----------------+ + +                     Table 1: Layout Type Assignments + +   [RFC5661] also introduced the "NFSv4 Recallable Object Types +   Registry".  This document defines new recallable objects for +   RCA4_TYPE_MASK_FF_LAYOUT_MIN and RCA4_TYPE_MASK_FF_LAYOUT_MAX (see +   Table 2). + +   +------------------------------+-------+--------+-----+-------------+ +   | Recallable Object Type Name  | Value | RFC    | How | Minor       | +   |                              |       |        |     | Versions    | +   +------------------------------+-------+--------+-----+-------------+ +   | RCA4_TYPE_MASK_FF_LAYOUT_MIN | 16    | RFC    | L   | 1           | +   |                              |       | 8435   |     |             | +   | RCA4_TYPE_MASK_FF_LAYOUT_MAX | 17    | RFC    | L   | 1           | +   |                              |       | 8435   |     |             | +   +------------------------------+-------+--------+-----+-------------+ + +                Table 2: Recallable Object Type Assignments + +17.  References + +17.1.  Normative References + +   [LEGAL]    IETF Trust, "Trust Legal Provisions (TLP)", +              <https://trustee.ietf.org/trust-legal-provisions.html>. + +   [RFC1813]  Callaghan, B., Pawlowski, B., and P. Staubach, "NFS +              Version 3 Protocol Specification", RFC 1813, +              DOI 10.17487/RFC1813, June 1995, +              <https://www.rfc-editor.org/info/rfc1813>. + +   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate +              Requirement Levels", BCP 14, RFC 2119, +              DOI 10.17487/RFC2119, March 1997, +              <https://www.rfc-editor.org/info/rfc2119>. + +   [RFC4121]  Zhu, L., Jaganathan, K., and S. Hartman, "The Kerberos +              Version 5 Generic Security Service Application Program +              Interface (GSS-API) Mechanism: Version 2", RFC 4121, +              DOI 10.17487/RFC4121, July 2005, +              <https://www.rfc-editor.org/info/rfc4121>. + + + + +Halevy & Haynes              Standards Track                   [Page 40] + +RFC 8435                pNFS Flexible File Layout            August 2018 + + +   [RFC4506]  Eisler, M., Ed., "XDR: External Data Representation +              Standard", STD 67, RFC 4506, DOI 10.17487/RFC4506, May +              2006, <https://www.rfc-editor.org/info/rfc4506>. + +   [RFC5531]  Thurlow, R., "RPC: Remote Procedure Call Protocol +              Specification Version 2", RFC 5531, DOI 10.17487/RFC5531, +              May 2009, <https://www.rfc-editor.org/info/rfc5531>. + +   [RFC5661]  Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed., +              "Network File System (NFS) Version 4 Minor Version 1 +              Protocol", RFC 5661, DOI 10.17487/RFC5661, January 2010, +              <https://www.rfc-editor.org/info/rfc5661>. + +   [RFC5662]  Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed., +              "Network File System (NFS) Version 4 Minor Version 1 +              External Data Representation Standard (XDR) Description", +              RFC 5662, DOI 10.17487/RFC5662, January 2010, +              <https://www.rfc-editor.org/info/rfc5662>. + +   [RFC7530]  Haynes, T., Ed. and D. Noveck, Ed., "Network File System +              (NFS) Version 4 Protocol", RFC 7530, DOI 10.17487/RFC7530, +              March 2015, <https://www.rfc-editor.org/info/rfc7530>. + +   [RFC7861]  Adamson, A. and N. Williams, "Remote Procedure Call (RPC) +              Security Version 3", RFC 7861, DOI 10.17487/RFC7861, +              November 2016, <https://www.rfc-editor.org/info/rfc7861>. + +   [RFC7862]  Haynes, T., "Network File System (NFS) Version 4 Minor +              Version 2 Protocol", RFC 7862, DOI 10.17487/RFC7862, +              November 2016, <https://www.rfc-editor.org/info/rfc7862>. + +   [RFC8174]  Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC +              2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, +              May 2017, <https://www.rfc-editor.org/info/rfc8174>. + +   [RFC8434]  Haynes, T., "Requirements for Parallel NFS (pNFS) Layout +              Types", RFC 8434, DOI 10.17487/RFC8434, August 2018, +              <https://www.rfc-editor.org/info/rfc8434>. + +17.2.  Informative References + +   [RFC4519]  Sciberras, A., Ed., "Lightweight Directory Access Protocol +              (LDAP): Schema for User Applications", RFC 4519, +              DOI 10.17487/RFC4519, June 2006, +              <https://www.rfc-editor.org/info/rfc4519>. + + + + + + +Halevy & Haynes              Standards Track                   [Page 41] + +RFC 8435                pNFS Flexible File Layout            August 2018 + + +Acknowledgments + +   The following individuals provided miscellaneous comments to early +   draft versions of this document: Matt W. Benjamin, Adam Emerson, +   J. Bruce Fields, and Lev Solomonov. + +   The following individuals provided miscellaneous comments to the +   final draft versions of this document: Anand Ganesh, Robert Wipfel, +   Gobikrishnan Sundharraj, Trond Myklebust, Rick Macklem, and Jim +   Sermersheim. + +   Idan Kedar caught a nasty bug in the interaction of client-side +   mirroring and the minor versioning of devices. + +   Dave Noveck provided comprehensive reviews of the document during the +   working group last calls.  He also rewrote Section 2.3. + +   Olga Kornievskaia made a convincing case against the use of a +   credential versus a principal in the fencing approach.  Andy Adamson +   and Benjamin Kaduk helped to sharpen the focus. + +   Benjamin Kaduk and Olga Kornievskaia also helped provide concrete +   scenarios for loosely coupled security mechanisms.  In the end, Olga +   proved that as defined, the loosely coupled model would not work with +   RPCSEC_GSS. + +   Tigran Mkrtchyan provided the use case for not allowing the client to +   proxy the I/O through the data server. + +   Rick Macklem provided the use case for only writing to a single +   mirror. + +Authors' Addresses + +   Benny Halevy + +   Email: bhalevy@gmail.com + + +   Thomas Haynes +   Hammerspace +   4300 El Camino Real Ste 105 +   Los Altos, CA  94022 +   United States of America + +   Email: loghyr@gmail.com + + + + + +Halevy & Haynes              Standards Track                   [Page 42] + |