doc: Add RFC documents

author: Thomas Voss <mail@thomasvoss.com> 2024-11-27 20:54:24 +0100
committer: Thomas Voss <mail@thomasvoss.com> 2024-11-27 20:54:24 +0100
commit: 4bfd864f10b68b71482b35c818559068ef8d5797 (patch)
tree: e3989f47a7994642eb325063d46e8f08ffa681dc /doc/rfc/rfc8435.txt
parent: ea76e11061bda059ae9f9ad130a9895cc85607db (diff)
1 files changed, 2355 insertions, 0 deletions
diff --git a/doc/rfc/rfc8435.txt b/doc/rfc/rfc8435.txt
new file mode 100644
index 0000000..02b0b8d
--- /dev/null
+++ b/doc/rfc/rfc8435.txt
@@ -0,0 +1,2355 @@
+
+
+
+
+
+
+Internet Engineering Task Force (IETF)                         B. Halevy
+Request for Comments: 8435
+Category: Standards Track                                      T. Haynes
+ISSN: 2070-1721                                              Hammerspace
+                                                             August 2018
+
+
+                Parallel NFS (pNFS) Flexible File Layout
+
+Abstract
+
+   Parallel NFS (pNFS) allows a separation between the metadata (onto a
+   metadata server) and data (onto a storage device) for a file.  The
+   flexible file layout type is defined in this document as an extension
+   to pNFS that allows the use of storage devices that require only a
+   limited degree of interaction with the metadata server and use
+   already-existing protocols.  Client-side mirroring is also added to
+   provide replication of files.
+
+Status of This Memo
+
+   This is an Internet Standards Track document.
+
+   This document is a product of the Internet Engineering Task Force
+   (IETF).  It represents the consensus of the IETF community.  It has
+   received public review and has been approved for publication by the
+   Internet Engineering Steering Group (IESG).  Further information on
+   Internet Standards is available in Section 2 of RFC 7841.
+
+   Information about the current status of this document, any errata,
+   and how to provide feedback on it may be obtained at
+   https://www.rfc-editor.org/info/rfc8435.
+
+Copyright Notice
+
+   Copyright (c) 2018 IETF Trust and the persons identified as the
+   document authors.  All rights reserved.
+
+   This document is subject to BCP 78 and the IETF Trust's Legal
+   Provisions Relating to IETF Documents
+   (https://trustee.ietf.org/license-info) in effect on the date of
+   publication of this document.  Please review these documents
+   carefully, as they describe your rights and restrictions with respect
+   to this document.  Code Components extracted from this document must
+   include Simplified BSD License text as described in Section 4.e of
+   the Trust Legal Provisions and are provided without warranty as
+   described in the Simplified BSD License.
+
+
+
+
+Halevy & Haynes              Standards Track                    [Page 1]
+
+RFC 8435                pNFS Flexible File Layout            August 2018
+
+
+Table of Contents
+
+   1. Introduction ....................................................3
+      1.1. Definitions ................................................4
+      1.2. Requirements Language ......................................6
+   2. Coupling of Storage Devices .....................................6
+      2.1. LAYOUTCOMMIT ...............................................7
+      2.2. Fencing Clients from the Storage Device ....................7
+           2.2.1. Implementation Notes for Synthetic uids/gids ........8
+           2.2.2. Example of Using Synthetic uids/gids ................9
+      2.3. State and Locking Models ..................................10
+           2.3.1. Loosely Coupled Locking Model ......................11
+           2.3.2. Tightly Coupled Locking Model ......................12
+   3. XDR Description of the Flexible File Layout Type ...............13
+      3.1. Code Components Licensing Notice ..........................14
+   4. Device Addressing and Discovery ................................16
+      4.1. ff_device_addr4 ...........................................16
+      4.2. Storage Device Multipathing ...............................17
+   5. Flexible File Layout Type ......................................18
+      5.1. ff_layout4 ................................................19
+           5.1.1. Error Codes from LAYOUTGET .........................23
+           5.1.2. Client Interactions with FF_FLAGS_NO_IO_THRU_MDS ...23
+      5.2. LAYOUTCOMMIT ..............................................24
+      5.3. Interactions between Devices and Layouts ..................24
+      5.4. Handling Version Errors ...................................24
+   6. Striping via Sparse Mapping ....................................25
+   7. Recovering from Client I/O Errors ..............................25
+   8. Mirroring ......................................................26
+      8.1. Selecting a Mirror ........................................26
+      8.2. Writing to Mirrors ........................................27
+           8.2.1. Single Storage Device Updates Mirrors ..............27
+           8.2.2. Client Updates All Mirrors .........................27
+           8.2.3. Handling Write Errors ..............................28
+           8.2.4. Handling Write COMMITs .............................28
+      8.3. Metadata Server Resilvering of the File ...................29
+   9. Flexible File Layout Type Return ...............................29
+      9.1. I/O Error Reporting .......................................30
+           9.1.1. ff_ioerr4 ..........................................30
+      9.2. Layout Usage Statistics ...................................31
+           9.2.1. ff_io_latency4 .....................................31
+           9.2.2. ff_layoutupdate4 ...................................32
+           9.2.3. ff_iostats4 ........................................33
+      9.3. ff_layoutreturn4 ..........................................34
+   10. Flexible File Layout Type LAYOUTERROR .........................35
+   11. Flexible File Layout Type LAYOUTSTATS .........................35
+   12. Flexible File Layout Type Creation Hint .......................35
+      12.1. ff_layouthint4 ...........................................35
+   13. Recalling a Layout ............................................36
+
+
+
+Halevy & Haynes              Standards Track                    [Page 2]
+
+RFC 8435                pNFS Flexible File Layout            August 2018
+
+
+      13.1. CB_RECALL_ANY ............................................36
+   14. Client Fencing ................................................37
+   15. Security Considerations .......................................37
+      15.1. RPCSEC_GSS and Security Services .........................39
+           15.1.1. Loosely Coupled ...................................39
+           15.1.2. Tightly Coupled ...................................39
+   16. IANA Considerations ...........................................39
+   17. References ....................................................40
+      17.1. Normative References .....................................40
+      17.2. Informative References ...................................41
+   Acknowledgments ...................................................42
+   Authors' Addresses ................................................42
+
+1.  Introduction
+
+   In Parallel NFS (pNFS), the metadata server returns layout type
+   structures that describe where file data is located.  There are
+   different layout types for different storage systems and methods of
+   arranging data on storage devices.  This document defines the
+   flexible file layout type used with file-based data servers that are
+   accessed using the NFS protocols: NFSv3 [RFC1813], NFSv4.0 [RFC7530],
+   NFSv4.1 [RFC5661], and NFSv4.2 [RFC7862].
+
+   To provide a global state model equivalent to that of the files
+   layout type, a back-end control protocol might be implemented between
+   the metadata server and NFSv4.1+ storage devices.  An implementation
+   can either define its own proprietary mechanism or it could define a
+   control protocol in a Standards Track document.  The requirements for
+   a control protocol are specified in [RFC5661] and clarified in
+   [RFC8434].
+
+   The control protocol described in this document is based on NFS.  It
+   does not provide for knowledge of stateids to be passed between the
+   metadata server and the storage devices.  Instead, the storage
+   devices are configured such that the metadata server has full access
+   rights to the data file system and then the metadata server uses
+   synthetic ids to control client access to individual files.
+
+   In traditional mirroring of data, the server is responsible for
+   replicating, validating, and repairing copies of the data file.  With
+   client-side mirroring, the metadata server provides a layout that
+   presents the available mirrors to the client.  The client then picks
+   a mirror to read from and ensures that all writes go to all mirrors.
+   The client only considers the write transaction to have succeeded if
+   all mirrors are successfully updated.  In case of error, the client
+   can use the LAYOUTERROR operation to inform the metadata server,
+   which is then responsible for the repairing of the mirrored copies of
+   the file.
+
+
+
+Halevy & Haynes              Standards Track                    [Page 3]
+
+RFC 8435                pNFS Flexible File Layout            August 2018
+
+
+1.1.  Definitions
+
+   control communication requirements:  the specification for
+      information on layouts, stateids, file metadata, and file data
+      that must be communicated between the metadata server and the
+      storage devices.  There is a separate set of requirements for each
+      layout type.
+
+   control protocol:  the particular mechanism that an implementation of
+      a layout type would use to meet the control communication
+      requirement for that layout type.  This need not be a protocol as
+      normally understood.  In some cases, the same protocol may be used
+      as a control protocol and storage protocol.
+
+   client-side mirroring:  a feature in which the client, not the
+      server, is responsible for updating all of the mirrored copies of
+      a layout segment.
+
+   (file) data:  that part of the file system object that contains the
+      data to be read or written.  It is the contents of the object
+      rather than the attributes of the object.
+
+   data server (DS):  a pNFS server that provides the file's data when
+      the file system object is accessed over a file-based protocol.
+
+   fencing:  the process by which the metadata server prevents the
+      storage devices from processing I/O from a specific client to a
+      specific file.
+
+   file layout type:  a layout type in which the storage devices are
+      accessed via the NFS protocol (see Section 13 of [RFC5661]).
+
+   gid:  the group id, a numeric value that identifies to which group a
+      file belongs.
+
+   layout:  the information a client uses to access file data on a
+      storage device.  This information includes specification of the
+      protocol (layout type) and the identity of the storage devices to
+      be used.
+
+   layout iomode:  a grant of either read-only or read/write I/O to the
+      client.
+
+   layout segment:  a sub-division of a layout.  That sub-division might
+      be by the layout iomode (see Sections 3.3.20 and 12.2.9 of
+      [RFC5661]), a striping pattern (see Section 13.3 of [RFC5661]), or
+      requested byte range.
+
+
+
+
+Halevy & Haynes              Standards Track                    [Page 4]
+
+RFC 8435                pNFS Flexible File Layout            August 2018
+
+
+   layout stateid:  a 128-bit quantity returned by a server that
+      uniquely defines the layout state provided by the server for a
+      specific layout that describes a layout type and file (see
+      Section 12.5.2 of [RFC5661]).  Further, Section 12.5.3 of
+      [RFC5661] describes differences in handling between layout
+      stateids and other stateid types.
+
+   layout type:  a specification of both the storage protocol used to
+      access the data and the aggregation scheme used to lay out the
+      file data on the underlying storage devices.
+
+   loose coupling:  when the control protocol is a storage protocol.
+
+   (file) metadata:  the part of the file system object that contains
+      various descriptive data relevant to the file object, as opposed
+      to the file data itself.  This could include the time of last
+      modification, access time, EOF position, etc.
+
+   metadata server (MDS):  the pNFS server that provides metadata
+      information for a file system object.  It is also responsible for
+      generating, recalling, and revoking layouts for file system
+      objects, for performing directory operations, and for performing
+      I/O operations to regular files when the clients direct these to
+      the metadata server itself.
+
+   mirror:  a copy of a layout segment.  Note that if one copy of the
+      mirror is updated, then all copies must be updated.
+
+   recalling a layout:  a graceful recall, via a callback, of a specific
+      layout by the metadata server to the client.  Graceful here means
+      that the client would have the opportunity to flush any WRITEs,
+      etc., before returning the layout to the metadata server.
+
+   revoking a layout:  an invalidation of a specific layout by the
+      metadata server.  Once revocation occurs, the metadata server will
+      not accept as valid any reference to the revoked layout, and a
+      storage device will not accept any client access based on the
+      layout.
+
+   resilvering:  the act of rebuilding a mirrored copy of a layout
+      segment from a known good copy of the layout segment.  Note that
+      this can also be done to create a new mirrored copy of the layout
+      segment.
+
+   rsize:  the data transfer buffer size used for READs.
+
+
+
+
+
+
+Halevy & Haynes              Standards Track                    [Page 5]
+
+RFC 8435                pNFS Flexible File Layout            August 2018
+
+
+   stateid:  a 128-bit quantity returned by a server that uniquely
+      defines the set of locking-related state provided by the server.
+      Stateids may designate state related to open files, byte-range
+      locks, delegations, or layouts.
+
+   storage device:  the target to which clients may direct I/O requests
+      when they hold an appropriate layout.  See Section 2.1 of
+      [RFC8434] for further discussion of the difference between a data
+      server and a storage device.
+
+   storage protocol:  the protocol used by clients to do I/O operations
+      to the storage device.  Each layout type specifies the set of
+      storage protocols.
+
+   tight coupling:  an arrangement in which the control protocol is one
+      designed specifically for control communication.  It may be either
+      a proprietary protocol adapted specifically to a particular
+      metadata server or a protocol based on a Standards Track document.
+
+   uid:  the user id, a numeric value that identifies which user owns a
+      file.
+
+   wsize:  the data transfer buffer size used for WRITEs.
+
+1.2.  Requirements Language
+
+   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
+   "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
+   "OPTIONAL" in this document are to be interpreted as described in
+   BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all
+   capitals, as shown here.
+
+2.  Coupling of Storage Devices
+
+   A server implementation may choose either a loosely coupled model or
+   a tightly coupled model between the metadata server and the storage
+   devices.  [RFC8434] describes the general problems facing pNFS
+   implementations.  This document details how the new flexible file
+   layout type addresses these issues.  To implement the tightly coupled
+   model, a control protocol has to be defined.  As the flexible file
+   layout imposes no special requirements on the client, the control
+   protocol will need to provide:
+
+   (1)  management of both security and LAYOUTCOMMITs and
+
+   (2)  a global stateid model and management of these stateids.
+
+
+
+
+
+Halevy & Haynes              Standards Track                    [Page 6]
+
+RFC 8435                pNFS Flexible File Layout            August 2018
+
+
+   When implementing the loosely coupled model, the only control
+   protocol will be a version of NFS, with no ability to provide a
+   global stateid model or to prevent clients from using layouts
+   inappropriately.  To enable client use in that environment, this
+   document will specify how security, state, and locking are to be
+   managed.
+
+2.1.  LAYOUTCOMMIT
+
+   Regardless of the coupling model, the metadata server has the
+   responsibility, upon receiving a LAYOUTCOMMIT (see Section 18.42 of
+   [RFC5661]) to ensure that the semantics of pNFS are respected (see
+   Section 3.1 of [RFC8434]).  These do include a requirement that data
+   written to a data storage device be stable before the occurrence of
+   the LAYOUTCOMMIT.
+
+   It is the responsibility of the client to make sure the data file is
+   stable before the metadata server begins to query the storage devices
+   about the changes to the file.  If any WRITE to a storage device did
+   not result with stable_how equal to FILE_SYNC, a LAYOUTCOMMIT to the
+   metadata server MUST be preceded by a COMMIT to the storage devices
+   written to.  Note that if the client has not done a COMMIT to the
+   storage device, then the LAYOUTCOMMIT might not be synchronized to
+   the last WRITE operation to the storage device.
+
+2.2.  Fencing Clients from the Storage Device
+
+   With loosely coupled storage devices, the metadata server uses
+   synthetic uids (user ids) and gids (group ids) for the data file,
+   where the uid owner of the data file is allowed read/write access and
+   the gid owner is allowed read-only access.  As part of the layout
+   (see ffds_user and ffds_group in Section 5.1), the client is provided
+   with the user and group to be used in the Remote Procedure Call (RPC)
+   [RFC5531] credentials needed to access the data file.  Fencing off of
+   clients is achieved by the metadata server changing the synthetic uid
+   and/or gid owners of the data file on the storage device to
+   implicitly revoke the outstanding RPC credentials.  A client
+   presenting the wrong credential for the desired access will get an
+   NFS4ERR_ACCESS error.
+
+   With this loosely coupled model, the metadata server is not able to
+   fence off a single client; it is forced to fence off all clients.
+   However, as the other clients react to the fencing, returning their
+   layouts and trying to get new ones, the metadata server can hand out
+   a new uid and gid to allow access.
+
+
+
+
+
+
+Halevy & Haynes              Standards Track                    [Page 7]
+
+RFC 8435                pNFS Flexible File Layout            August 2018
+
+
+   It is RECOMMENDED to implement common access control methods at the
+   storage device file system to allow only the metadata server root
+   (super user) access to the storage device and to set the owner of all
+   directories holding data files to the root user.  This approach
+   provides a practical model to enforce access control and fence off
+   cooperative clients, but it cannot protect against malicious clients;
+   hence, it provides a level of security equivalent to AUTH_SYS.  It is
+   RECOMMENDED that the communication between the metadata server and
+   storage device be secure from eavesdroppers and man-in-the-middle
+   protocol tampering.  The security measure could be physical security
+   (e.g., the servers are co-located in a physically secure area),
+   encrypted communications, or some other technique.
+
+   With tightly coupled storage devices, the metadata server sets the
+   user and group owners, mode bits, and Access Control List (ACL) of
+   the data file to be the same as the metadata file.  And the client
+   must authenticate with the storage device and go through the same
+   authorization process it would go through via the metadata server.
+   In the case of tight coupling, fencing is the responsibility of the
+   control protocol and is not described in detail in this document.
+   However, implementations of the tightly coupled locking model (see
+   Section 2.3) will need a way to prevent access by certain clients to
+   specific files by invalidating the corresponding stateids on the
+   storage device.  In such a scenario, the client will be given an
+   error of NFS4ERR_BAD_STATEID.
+
+   The client need not know the model used between the metadata server
+   and the storage device.  It need only react consistently to any
+   errors in interacting with the storage device.  It should both return
+   the layout and error to the metadata server and ask for a new layout.
+   At that point, the metadata server can either hand out a new layout,
+   hand out no layout (forcing the I/O through it), or deny the client
+   further access to the file.
+
+2.2.1.  Implementation Notes for Synthetic uids/gids
+
+   The selection method for the synthetic uids and gids to be used for
+   fencing in loosely coupled storage devices is strictly an
+   implementation issue.  That is, an administrator might restrict a
+   range of such ids available to the Lightweight Directory Access
+   Protocol (LDAP) 'uid' field [RFC4519].  The administrator might also
+   be able to choose an id that would never be used to grant access.
+   Then, when the metadata server had a request to access a file, a
+   SETATTR would be sent to the storage device to set the owner and
+   group of the data file.  The user and group might be selected in a
+   round-robin fashion from the range of available ids.
+
+
+
+
+
+Halevy & Haynes              Standards Track                    [Page 8]
+
+RFC 8435                pNFS Flexible File Layout            August 2018
+
+
+   Those ids would be sent back as ffds_user and ffds_group to the
+   client, who would present them as the RPC credentials to the storage
+   device.  When the client is done accessing the file and the metadata
+   server knows that no other client is accessing the file, it can reset
+   the owner and group to restrict access to the data file.
+
+   When the metadata server wants to fence off a client, it changes the
+   synthetic uid and/or gid to the restricted ids.  Note that using a
+   restricted id ensures that there is a change of owner and at least
+   one id available that never gets allowed access.
+
+   Under an AUTH_SYS security model, synthetic uids and gids of 0 SHOULD
+   be avoided.  These typically either grant super access to files on a
+   storage device or are mapped to an anonymous id.  In the first case,
+   even if the data file is fenced, the client might still be able to
+   access the file.  In the second case, multiple ids might be mapped to
+   the anonymous ids.
+
+2.2.2.  Example of Using Synthetic uids/gids
+
+   The user loghyr creates a file "ompha.c" on the metadata server,
+   which then creates a corresponding data file on the storage device.
+
+   The metadata server entry may look like:
+
+   -rw-r--r--    1 loghyr  staff    1697 Dec  4 11:31 ompha.c
+
+   On the storage device, the file may be assigned some unpredictable
+   synthetic uid/gid to deny access:
+
+   -rw-r-----    1 19452   28418    1697 Dec  4 11:31 data_ompha.c
+
+   When the file is opened on a client and accessed, the user will try
+   to get a layout for the data file.  Since the layout knows nothing
+   about the user (and does not care), it does not matter whether the
+   user loghyr or garbo opens the file.  The client has to present an
+   uid of 19452 to get write permission.  If it presents any other value
+   for the uid, then it must give a gid of 28418 to get read access.
+
+   Further, if the metadata server decides to fence the file, it should
+   change the uid and/or gid such that these values neither match
+   earlier values for that file nor match a predictable change based on
+   an earlier fencing.
+
+   -rw-r-----    1 19453   28419    1697 Dec  4 11:31 data_ompha.c
+
+
+
+
+
+
+Halevy & Haynes              Standards Track                    [Page 9]
+
+RFC 8435                pNFS Flexible File Layout            August 2018
+
+
+   The set of synthetic gids on the storage device should be selected
+   such that there is no mapping in any of the name services used by the
+   storage device, i.e., each group should have no members.
+
+   If the layout segment has an iomode of LAYOUTIOMODE4_READ, then the
+   metadata server should return a synthetic uid that is not set on the
+   storage device.  Only the synthetic gid would be valid.
+
+   The client is thus solely responsible for enforcing file permissions
+   in a loosely coupled model.  To allow loghyr write access, it will
+   send an RPC to the storage device with a credential of 1066:1067.  To
+   allow garbo read access, it will send an RPC to the storage device
+   with a credential of 1067:1067.  The value of the uid does not matter
+   as long as it is not the synthetic uid granted when getting the
+   layout.
+
+   While pushing the enforcement of permission checking onto the client
+   may seem to weaken security, the client may already be responsible
+   for enforcing permissions before modifications are sent to a server.
+   With cached writes, the client is always responsible for tracking who
+   is modifying a file and making sure to not coalesce requests from
+   multiple users into one request.
+
+2.3.  State and Locking Models
+
+   An implementation can always be deployed as a loosely coupled model.
+   There is, however, no way for a storage device to indicate over an
+   NFS protocol that it can definitively participate in a tightly
+   coupled model:
+
+   o  Storage devices implementing the NFSv3 and NFSv4.0 protocols are
+      always treated as loosely coupled.
+
+   o  NFSv4.1+ storage devices that do not return the
+      EXCHGID4_FLAG_USE_PNFS_DS flag set to EXCHANGE_ID are indicating
+      that they are to be treated as loosely coupled.  From the locking
+      viewpoint, they are treated in the same way as NFSv4.0 storage
+      devices.
+
+   o  NFSv4.1+ storage devices that do identify themselves with the
+      EXCHGID4_FLAG_USE_PNFS_DS flag set to EXCHANGE_ID can potentially
+      be tightly coupled.  They would use a back-end control protocol to
+      implement the global stateid model as described in [RFC5661].
+
+   A storage device would have to be either discovered or advertised
+   over the control protocol to enable a tightly coupled model.
+
+
+
+
+
+Halevy & Haynes              Standards Track                   [Page 10]
+
+RFC 8435                pNFS Flexible File Layout            August 2018
+
+
+2.3.1.  Loosely Coupled Locking Model
+
+   When locking-related operations are requested, they are primarily
+   dealt with by the metadata server, which generates the appropriate
+   stateids.  When an NFSv4 version is used as the data access protocol,
+   the metadata server may make stateid-related requests of the storage
+   devices.  However, it is not required to do so, and the resulting
+   stateids are known only to the metadata server and the storage
+   device.
+
+   Given this basic structure, locking-related operations are handled as
+   follows:
+
+   o  OPENs are dealt with by the metadata server.  Stateids are
+      selected by the metadata server and associated with the client ID
+      describing the client's connection to the metadata server.  The
+      metadata server may need to interact with the storage device to
+      locate the file to be opened, but no locking-related functionality
+      need be used on the storage device.
+
+      OPEN_DOWNGRADE and CLOSE only require local execution on the
+      metadata server.
+
+   o  Advisory byte-range locks can be implemented locally on the
+      metadata server.  As in the case of OPENs, the stateids associated
+      with byte-range locks are assigned by the metadata server and only
+      used on the metadata server.
+
+   o  Delegations are assigned by the metadata server that initiates
+      recalls when conflicting OPENs are processed.  No storage device
+      involvement is required.
+
+   o  TEST_STATEID and FREE_STATEID are processed locally on the
+      metadata server, without storage device involvement.
+
+   All I/O operations to the storage device are done using the anonymous
+   stateid.  Thus, the storage device has no information about the
+   openowner and lockowner responsible for issuing a particular I/O
+   operation.  As a result:
+
+   o  Mandatory byte-range locking cannot be supported because the
+      storage device has no way of distinguishing I/O done on behalf of
+      the lock owner from those done by others.
+
+   o  Enforcement of share reservations is the responsibility of the
+      client.  Even though I/O is done using the anonymous stateid, the
+      client must ensure that it has a valid stateid associated with the
+      openowner.
+
+
+
+Halevy & Haynes              Standards Track                   [Page 11]
+
+RFC 8435                pNFS Flexible File Layout            August 2018
+
+
+   In the event that a stateid is revoked, the metadata server is
+   responsible for preventing client access, since it has no way of
+   being sure that the client is aware that the stateid in question has
+   been revoked.
+
+   As the client never receives a stateid generated by a storage device,
+   there is no client lease on the storage device and no prospect of
+   lease expiration, even when access is via NFSv4 protocols.  Clients
+   will have leases on the metadata server.  In dealing with lease
+   expiration, the metadata server may need to use fencing to prevent
+   revoked stateids from being relied upon by a client unaware of the
+   fact that they have been revoked.
+
+2.3.2.  Tightly Coupled Locking Model
+
+   When locking-related operations are requested, they are primarily
+   dealt with by the metadata server, which generates the appropriate
+   stateids.  These stateids must be made known to the storage device
+   using control protocol facilities, the details of which are not
+   discussed in this document.
+
+   Given this basic structure, locking-related operations are handled as
+   follows:
+
+   o  OPENs are dealt with primarily on the metadata server.  Stateids
+      are selected by the metadata server and associated with the client
+      ID describing the client's connection to the metadata server.  The
+      metadata server needs to interact with the storage device to
+      locate the file to be opened and to make the storage device aware
+      of the association between the metadata-server-chosen stateid and
+      the client and openowner that it represents.
+
+      OPEN_DOWNGRADE and CLOSE are executed initially on the metadata
+      server, but the state change made must be propagated to the
+      storage device.
+
+   o  Advisory byte-range locks can be implemented locally on the
+      metadata server.  As in the case of OPENs, the stateids associated
+      with byte-range locks are assigned by the metadata server and are
+      available for use on the metadata server.  Because I/O operations
+      are allowed to present lock stateids, the metadata server needs
+      the ability to make the storage device aware of the association
+      between the metadata-server-chosen stateid and the corresponding
+      open stateid it is associated with.
+
+   o  Mandatory byte-range locks can be supported when both the metadata
+      server and the storage devices have the appropriate support.  As
+      in the case of advisory byte-range locks, these are assigned by
+
+
+
+Halevy & Haynes              Standards Track                   [Page 12]
+
+RFC 8435                pNFS Flexible File Layout            August 2018
+
+
+      the metadata server and are available for use on the metadata
+      server.  To enable mandatory lock enforcement on the storage
+      device, the metadata server needs the ability to make the storage
+      device aware of the association between the metadata-server-chosen
+      stateid and the client, openowner, and lock (i.e., lockowner,
+      byte-range, and lock-type) that it represents.  Because I/O
+      operations are allowed to present lock stateids, this information
+      needs to be propagated to all storage devices to which I/O might
+      be directed rather than only to storage device that contain the
+      locked region.
+
+   o  Delegations are assigned by the metadata server that initiates
+      recalls when conflicting OPENs are processed.  Because I/O
+      operations are allowed to present delegation stateids, the
+      metadata server requires the ability (1) to make the storage
+      device aware of the association between the metadata-server-chosen
+      stateid and the filehandle and delegation type it represents and
+      (2) to break such an association.
+
+   o  TEST_STATEID is processed locally on the metadata server, without
+      storage device involvement.
+
+   o  FREE_STATEID is processed on the metadata server, but the metadata
+      server requires the ability to propagate the request to the
+      corresponding storage devices.
+
+   Because the client will possess and use stateids valid on the storage
+   device, there will be a client lease on the storage device, and the
+   possibility of lease expiration does exist.  The best approach for
+   the storage device is to retain these locks as a courtesy.  However,
+   if it does not do so, control protocol facilities need to provide the
+   means to synchronize lock state between the metadata server and
+   storage device.
+
+   Clients will also have leases on the metadata server that are subject
+   to expiration.  In dealing with lease expiration, the metadata server
+   would be expected to use control protocol facilities enabling it to
+   invalidate revoked stateids on the storage device.  In the event the
+   client is not responsive, the metadata server may need to use fencing
+   to prevent revoked stateids from being acted upon by the storage
+   device.
+
+3.  XDR Description of the Flexible File Layout Type
+
+   This document contains the External Data Representation (XDR)
+   [RFC4506] description of the flexible file layout type.  The XDR
+   description is embedded in this document in a way that makes it
+   simple for the reader to extract into a ready-to-compile form.  The
+
+
+
+Halevy & Haynes              Standards Track                   [Page 13]
+
+RFC 8435                pNFS Flexible File Layout            August 2018
+
+
+   reader can feed this document into the following shell script to
+   produce the machine-readable XDR description of the flexible file
+   layout type:
+
+   <CODE BEGINS>
+
+   #!/bin/sh
+   grep '^ *///' $* | sed 's?^ */// ??' | sed 's?^ *///$??'
+
+   <CODE ENDS>
+
+   That is, if the above script is stored in a file called "extract.sh"
+   and this document is in a file called "spec.txt", then the reader can
+   do:
+
+   sh extract.sh < spec.txt > flex_files_prot.x
+
+   The effect of the script is to remove leading white space from each
+   line, plus a sentinel sequence of "///".
+
+   The embedded XDR file header follows.  Subsequent XDR descriptions
+   with the sentinel sequence are embedded throughout the document.
+
+   Note that the XDR code contained in this document depends on types
+   from the NFSv4.1 nfs4_prot.x file [RFC5662].  This includes both nfs
+   types that end with a 4, such as offset4, length4, etc., as well as
+   more generic types such as uint32_t and uint64_t.
+
+3.1.  Code Components Licensing Notice
+
+   Both the XDR description and the scripts used for extracting the XDR
+   description are Code Components as described in Section 4 of "Trust
+   Legal Provisions (TLP)" [LEGAL].  These Code Components are licensed
+   according to the terms of that document.
+
+   <CODE BEGINS>
+
+   /// /*
+   ///  * Copyright (c) 2018 IETF Trust and the persons identified
+   ///  * as authors of the code.  All rights reserved.
+   ///  *
+   ///  * Redistribution and use in source and binary forms, with
+   ///  * or without modification, are permitted provided that the
+   ///  * following conditions are met:
+   ///  *
+   ///  * - Redistributions of source code must retain the above
+   ///  *   copyright notice, this list of conditions and the
+   ///  *   following disclaimer.
+
+
+
+Halevy & Haynes              Standards Track                   [Page 14]
+
+RFC 8435                pNFS Flexible File Layout            August 2018
+
+
+   ///  *
+   ///  * - Redistributions in binary form must reproduce the above
+   ///  *   copyright notice, this list of conditions and the
+   ///  *   following disclaimer in the documentation and/or other
+   ///  *   materials provided with the distribution.
+   ///  *
+   ///  * - Neither the name of Internet Society, IETF or IETF
+   ///  *   Trust, nor the names of specific contributors, may be
+   ///  *   used to endorse or promote products derived from this
+   ///  *   software without specific prior written permission.
+   ///  *
+   ///  *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS
+   ///  *   AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED
+   ///  *   WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+   ///  *   IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
+   ///  *   FOR A PARTICULAR PURPOSE ARE DISCLAIMED.  IN NO
+   ///  *   EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
+   ///  *   LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+   ///  *   EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
+   ///  *   NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+   ///  *   SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
+   ///  *   INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
+   ///  *   LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+   ///  *   OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING
+   ///  *   IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF
+   ///  *   ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+   ///  *
+   ///  * This code was derived from RFC 8435.
+   ///  * Please reproduce this note if possible.
+   ///  */
+   ///
+   /// /*
+   ///  * flex_files_prot.x
+   ///  */
+   ///
+   /// /*
+   ///  * The following include statements are for example only.
+   ///  * The actual XDR definition files are generated separately
+   ///  * and independently and are likely to have a different name.
+   ///  * %#include <nfsv42.x>
+   ///  * %#include <rpc_prot.x>
+   ///  */
+   ///
+
+   <CODE ENDS>
+
+
+
+
+
+
+Halevy & Haynes              Standards Track                   [Page 15]
+
+RFC 8435                pNFS Flexible File Layout            August 2018
+
+
+4.  Device Addressing and Discovery
+
+   Data operations to a storage device require the client to know the
+   network address of the storage device.  The NFSv4.1+ GETDEVICEINFO
+   operation (Section 18.40 of [RFC5661]) is used by the client to
+   retrieve that information.
+
+4.1.  ff_device_addr4
+
+   The ff_device_addr4 data structure is returned by the server as the
+   layout-type-specific opaque field da_addr_body in the device_addr4
+   structure by a successful GETDEVICEINFO operation.
+
+   <CODE BEGINS>
+
+   /// struct ff_device_versions4 {
+   ///         uint32_t        ffdv_version;
+   ///         uint32_t        ffdv_minorversion;
+   ///         uint32_t        ffdv_rsize;
+   ///         uint32_t        ffdv_wsize;
+   ///         bool            ffdv_tightly_coupled;
+   /// };
+   ///
+
+   /// struct ff_device_addr4 {
+   ///         multipath_list4     ffda_netaddrs;
+   ///         ff_device_versions4 ffda_versions<>;
+   /// };
+   ///
+
+   <CODE ENDS>
+
+   The ffda_netaddrs field is used to locate the storage device.  It
+   MUST be set by the server to a list holding one or more of the device
+   network addresses.
+
+   The ffda_versions array allows the metadata server to present choices
+   as to NFS version, minor version, and coupling strength to the
+   client.  The ffdv_version and ffdv_minorversion represent the NFS
+   protocol to be used to access the storage device.  This layout
+   specification defines the semantics for ffdv_versions 3 and 4.  If
+   ffdv_version equals 3, then the server MUST set ffdv_minorversion to
+   0 and ffdv_tightly_coupled to false.  The client MUST then access the
+   storage device using the NFSv3 protocol [RFC1813].  If ffdv_version
+   equals 4, then the server MUST set ffdv_minorversion to one of the
+   NFSv4 minor version numbers, and the client MUST access the storage
+   device using NFSv4 with the specified minor version.
+
+
+
+
+Halevy & Haynes              Standards Track                   [Page 16]
+
+RFC 8435                pNFS Flexible File Layout            August 2018
+
+
+   Note that while the client might determine that it cannot use any of
+   the configured combinations of ffdv_version, ffdv_minorversion, and
+   ffdv_tightly_coupled, when it gets the device list from the metadata
+   server, there is no way to indicate to the metadata server as to
+   which device it is version incompatible.  However, if the client
+   waits until it retrieves the layout from the metadata server, it can
+   at that time clearly identify the storage device in question (see
+   Section 5.4).
+
+   The ffdv_rsize and ffdv_wsize are used to communicate the maximum
+   rsize and wsize supported by the storage device.  As the storage
+   device can have a different rsize or wsize than the metadata server,
+   the ffdv_rsize and ffdv_wsize allow the metadata server to
+   communicate that information on behalf of the storage device.
+
+   ffdv_tightly_coupled informs the client as to whether or not the
+   metadata server is tightly coupled with the storage devices.  Note
+   that even if the data protocol is at least NFSv4.1, it may still be
+   the case that there is loose coupling in effect.  If
+   ffdv_tightly_coupled is not set, then the client MUST commit writes
+   to the storage devices for the file before sending a LAYOUTCOMMIT to
+   the metadata server.  That is, the writes MUST be committed by the
+   client to stable storage via issuing WRITEs with stable_how ==
+   FILE_SYNC or by issuing a COMMIT after WRITEs with stable_how !=
+   FILE_SYNC (see Section 3.3.7 of [RFC1813]).
+
+4.2.  Storage Device Multipathing
+
+   The flexible file layout type supports multipathing to multiple
+   storage device addresses.  Storage-device-level multipathing is used
+   for bandwidth scaling via trunking and for higher availability of use
+   in the event of a storage device failure.  Multipathing allows the
+   client to switch to another storage device address that may be that
+   of another storage device that is exporting the same data stripe
+   unit, without having to contact the metadata server for a new layout.
+
+   To support storage device multipathing, ffda_netaddrs contains an
+   array of one or more storage device network addresses.  This array
+   (data type multipath_list4) represents a list of storage devices
+   (each identified by a network address), with the possibility that
+   some storage device will appear in the list multiple times.
+
+   The client is free to use any of the network addresses as a
+   destination to send storage device requests.  If some network
+   addresses are less desirable paths to the data than others, then the
+   metadata server SHOULD NOT include those network addresses in
+   ffda_netaddrs.  If less desirable network addresses exist to provide
+   failover, the RECOMMENDED method to offer the addresses is to provide
+
+
+
+Halevy & Haynes              Standards Track                   [Page 17]
+
+RFC 8435                pNFS Flexible File Layout            August 2018
+
+
+   them in a replacement device-ID-to-device-address mapping or a
+   replacement device ID.  When a client finds no response from the
+   storage device using all addresses available in ffda_netaddrs, it
+   SHOULD send a GETDEVICEINFO to attempt to replace the existing
+   device-ID-to-device-address mappings.  If the metadata server detects
+   that all network paths represented by ffda_netaddrs are unavailable,
+   the metadata server SHOULD send a CB_NOTIFY_DEVICEID (if the client
+   has indicated it wants device ID notifications for changed device
+   IDs) to change the device-ID-to-device-address mappings to the
+   available addresses.  If the device ID itself will be replaced, the
+   metadata server SHOULD recall all layouts with the device ID and thus
+   force the client to get new layouts and device ID mappings via
+   LAYOUTGET and GETDEVICEINFO.
+
+   Generally, if two network addresses appear in ffda_netaddrs, they
+   will designate the same storage device.  When the storage device is
+   accessed over NFSv4.1 or a higher minor version, the two storage
+   device addresses will support the implementation of client ID or
+   session trunking (the latter is RECOMMENDED) as defined in [RFC5661].
+   The two storage device addresses will share the same server owner or
+   major ID of the server owner.  It is not always necessary for the two
+   storage device addresses to designate the same storage device with
+   trunking being used.  For example, the data could be read-only, and
+   the data consist of exact replicas.
+
+5.  Flexible File Layout Type
+
+   The original layouttype4 introduced in [RFC5662] is modified to be:
+
+   <CODE BEGINS>
+
+       enum layouttype4 {
+           LAYOUT4_NFSV4_1_FILES   = 1,
+           LAYOUT4_OSD2_OBJECTS    = 2,
+           LAYOUT4_BLOCK_VOLUME    = 3,
+           LAYOUT4_FLEX_FILES      = 4
+       };
+
+       struct layout_content4 {
+           layouttype4             loc_type;
+           opaque                  loc_body<>;
+       };
+
+
+
+
+
+
+
+
+
+Halevy & Haynes              Standards Track                   [Page 18]
+
+RFC 8435                pNFS Flexible File Layout            August 2018
+
+
+       struct layout4 {
+           offset4                 lo_offset;
+           length4                 lo_length;
+           layoutiomode4           lo_iomode;
+           layout_content4         lo_content;
+       };
+
+   <CODE ENDS>
+
+   This document defines structures associated with the layouttype4
+   value LAYOUT4_FLEX_FILES.  [RFC5661] specifies the loc_body structure
+   as an XDR type "opaque".  The opaque layout is uninterpreted by the
+   generic pNFS client layers but is interpreted by the flexible file
+   layout type implementation.  This section defines the structure of
+   this otherwise opaque value, ff_layout4.
+
+5.1.  ff_layout4
+
+   <CODE BEGINS>
+
+   /// const FF_FLAGS_NO_LAYOUTCOMMIT   = 0x00000001;
+   /// const FF_FLAGS_NO_IO_THRU_MDS    = 0x00000002;
+   /// const FF_FLAGS_NO_READ_IO        = 0x00000004;
+   /// const FF_FLAGS_WRITE_ONE_MIRROR  = 0x00000008;
+
+   /// typedef uint32_t            ff_flags4;
+   ///
+
+   /// struct ff_data_server4 {
+   ///     deviceid4               ffds_deviceid;
+   ///     uint32_t                ffds_efficiency;
+   ///     stateid4                ffds_stateid;
+   ///     nfs_fh4                 ffds_fh_vers<>;
+   ///     fattr4_owner            ffds_user;
+   ///     fattr4_owner_group      ffds_group;
+   /// };
+   ///
+
+   /// struct ff_mirror4 {
+   ///     ff_data_server4         ffm_data_servers<>;
+   /// };
+   ///
+
+   /// struct ff_layout4 {
+   ///     length4                 ffl_stripe_unit;
+   ///     ff_mirror4              ffl_mirrors<>;
+   ///     ff_flags4               ffl_flags;
+   ///     uint32_t                ffl_stats_collect_hint;
+
+
+
+Halevy & Haynes              Standards Track                   [Page 19]
+
+RFC 8435                pNFS Flexible File Layout            August 2018
+
+
+   /// };
+   ///
+
+   <CODE ENDS>
+
+   The ff_layout4 structure specifies a layout in that portion of the
+   data file described in the current layout segment.  It is either a
+   single instance or a set of mirrored copies of that portion of the
+   data file.  When mirroring is in effect, it protects against loss of
+   data in layout segments.
+
+   While not explicitly shown in the above XDR, each layout4 element
+   returned in the logr_layout array of LAYOUTGET4res (see
+   Section 18.43.2 of [RFC5661]) describes a layout segment.  Hence,
+   each ff_layout4 also describes a layout segment.  It is possible that
+   the file is concatenated from more than one layout segment.  Each
+   layout segment MAY represent different striping parameters.
+
+   The ffl_stripe_unit field is the stripe unit size in use for the
+   current layout segment.  The number of stripes is given inside each
+   mirror by the number of elements in ffm_data_servers.  If the number
+   of stripes is one, then the value for ffl_stripe_unit MUST default to
+   zero.  The only supported mapping scheme is sparse and is detailed in
+   Section 6.  Note that there is an assumption here that both the
+   stripe unit size and the number of stripes are the same across all
+   mirrors.
+
+   The ffl_mirrors field is the array of mirrored storage devices that
+   provide the storage for the current stripe; see Figure 1.
+
+   The ffl_stats_collect_hint field provides a hint to the client on how
+   often the server wants it to report LAYOUTSTATS for a file.  The time
+   is in seconds.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Halevy & Haynes              Standards Track                   [Page 20]
+
+RFC 8435                pNFS Flexible File Layout            August 2018
+
+
+                      +-----------+
+                      |           |
+                      |           |
+                      |   File    |
+                      |           |
+                      |           |
+                      +-----+-----+
+                            |
+               +------------+------------+
+               |                         |
+          +----+-----+             +-----+----+
+          | Mirror 1 |             | Mirror 2 |
+          +----+-----+             +-----+----+
+               |                         |
+          +-----------+            +-----------+
+          |+-----------+           |+-----------+
+          ||+-----------+          ||+-----------+
+          +||  Storage  |          +||  Storage  |
+           +|  Devices  |           +|  Devices  |
+            +-----------+            +-----------+
+
+                           Figure 1
+
+   The ffs_mirrors field represents an array of state information for
+   each mirrored copy of the current layout segment.  Each element is
+   described by a ff_mirror4 type.
+
+   ffds_deviceid provides the deviceid of the storage device holding the
+   data file.
+
+   ffds_fh_vers is an array of filehandles of the data file matching the
+   available NFS versions on the given storage device.  There MUST be
+   exactly as many elements in ffds_fh_vers as there are in
+   ffda_versions.  Each element of the array corresponds to a particular
+   combination of ffdv_version, ffdv_minorversion, and
+   ffdv_tightly_coupled provided for the device.  The array allows for
+   server implementations that have different filehandles for different
+   combinations of version, minor version, and coupling strength.  See
+   Section 5.4 for how to handle versioning issues between the client
+   and storage devices.
+
+   For tight coupling, ffds_stateid provides the stateid to be used by
+   the client to access the file.  For loose coupling and an NFSv4
+   storage device, the client will have to use an anonymous stateid to
+   perform I/O on the storage device.  With no control protocol, the
+   metadata server stateid cannot be used to provide a global stateid
+   model.  Thus, the server MUST set the ffds_stateid to be the
+   anonymous stateid.
+
+
+
+Halevy & Haynes              Standards Track                   [Page 21]
+
+RFC 8435                pNFS Flexible File Layout            August 2018
+
+
+   This specification of the ffds_stateid restricts both models for
+   NFSv4.x storage protocols:
+
+   loosely coupled model:  the stateid has to be an anonymous stateid
+
+   tightly coupled model:  the stateid has to be a global stateid
+
+   A number of issues stem from a mismatch between the fact that
+   ffds_stateid is defined as a single item while ffds_fh_vers is
+   defined as an array.  It is possible for each open file on the
+   storage device to require its own open stateid.  Because there are
+   established loosely coupled implementations of the version of the
+   protocol described in this document, such potential issues have not
+   been addressed here.  It is possible for future layout types to be
+   defined that address these issues, should it become important to
+   provide multiple stateids for the same underlying file.
+
+   For loosely coupled storage devices, ffds_user and ffds_group provide
+   the synthetic user and group to be used in the RPC credentials that
+   the client presents to the storage device to access the data files.
+   For tightly coupled storage devices, the user and group on the
+   storage device will be the same as on the metadata server; that is,
+   if ffdv_tightly_coupled (see Section 4.1) is set, then the client
+   MUST ignore both ffds_user and ffds_group.
+
+   The allowed values for both ffds_user and ffds_group are specified as
+   owner and owner_group, respectively, in Section 5.9 of [RFC5661].
+   For NFSv3 compatibility, user and group strings that consist of
+   decimal numeric values with no leading zeros can be given a special
+   interpretation by clients and servers that choose to provide such
+   support.  The receiver may treat such a user or group string as
+   representing the same user as would be represented by an NFSv3 uid or
+   gid having the corresponding numeric value.  Note that if using
+   Kerberos for security, the expectation is that these values will be a
+   name@domain string.
+
+   ffds_efficiency describes the metadata server's evaluation as to the
+   effectiveness of each mirror.  Note that this is per layout and not
+   per device as the metric may change due to perceived load,
+   availability to the metadata server, etc.  Higher values denote
+   higher perceived utility.  The way the client can select the best
+   mirror to access is discussed in Section 8.1.
+
+   ffl_flags is a bitmap that allows the metadata server to inform the
+   client of particular conditions that may result from more or less
+   tight coupling of the storage devices.
+
+
+
+
+
+Halevy & Haynes              Standards Track                   [Page 22]
+
+RFC 8435                pNFS Flexible File Layout            August 2018
+
+
+   FF_FLAGS_NO_LAYOUTCOMMIT:  can be set to indicate that the client is
+      not required to send LAYOUTCOMMIT to the metadata server.
+
+   FF_FLAGS_NO_IO_THRU_MDS:  can be set to indicate that the client
+      should not send I/O operations to the metadata server.  That is,
+      even if the client could determine that there was a network
+      disconnect to a storage device, the client should not try to proxy
+      the I/O through the metadata server.
+
+   FF_FLAGS_NO_READ_IO:  can be set to indicate that the client should
+      not send READ requests with the layouts of iomode
+      LAYOUTIOMODE4_RW.  Instead, it should request a layout of iomode
+      LAYOUTIOMODE4_READ from the metadata server.
+
+   FF_FLAGS_WRITE_ONE_MIRROR:  can be set to indicate that the client
+      only needs to update one of the mirrors (see Section 8.2).
+
+5.1.1.  Error Codes from LAYOUTGET
+
+   [RFC5661] provides little guidance as to how the client is to proceed
+   with a LAYOUTGET that returns an error of either
+   NFS4ERR_LAYOUTTRYLATER, NFS4ERR_LAYOUTUNAVAILABLE, and NFS4ERR_DELAY.
+   Within the context of this document:
+
+   NFS4ERR_LAYOUTUNAVAILABLE:  there is no layout available and the I/O
+      is to go to the metadata server.  Note that it is possible to have
+      had a layout before a recall and not after.
+
+   NFS4ERR_LAYOUTTRYLATER:  there is some issue preventing the layout
+      from being granted.  If the client already has an appropriate
+      layout, it should continue with I/O to the storage devices.
+
+   NFS4ERR_DELAY:  there is some issue preventing the layout from being
+      granted.  If the client already has an appropriate layout, it
+      should not continue with I/O to the storage devices.
+
+5.1.2.  Client Interactions with FF_FLAGS_NO_IO_THRU_MDS
+
+   Even if the metadata server provides the FF_FLAGS_NO_IO_THRU_MDS
+   flag, the client can still perform I/O to the metadata server.  The
+   flag functions as a hint.  The flag indicates to the client that the
+   metadata server prefers to separate the metadata I/O from the data I/
+   O, most likely for performance reasons.
+
+
+
+
+
+
+
+
+Halevy & Haynes              Standards Track                   [Page 23]
+
+RFC 8435                pNFS Flexible File Layout            August 2018
+
+
+5.2.  LAYOUTCOMMIT
+
+   The flexible file layout does not use lou_body inside the
+   loca_layoutupdate argument to LAYOUTCOMMIT.  If lou_type is
+   LAYOUT4_FLEX_FILES, the lou_body field MUST have a zero length (see
+   Section 18.42.1 of [RFC5661]).
+
+5.3.  Interactions between Devices and Layouts
+
+   In [RFC5661], the file layout type is defined such that the
+   relationship between multipathing and filehandles can result in
+   either 0, 1, or N filehandles (see Section 13.3).  Some rationales
+   for this are clustered servers that share the same filehandle or
+   allow for multiple read-only copies of the file on the same storage
+   device.  In the flexible file layout type, while there is an array of
+   filehandles, they are independent of the multipathing being used.  If
+   the metadata server wants to provide multiple read-only copies of the
+   same file on the same storage device, then it should provide multiple
+   mirrored instances, each with a different ff_device_addr4.  The
+   client can then determine that, since the each of the ffds_fh_vers
+   are different, there are multiple copies of the file for the current
+   layout segment available.
+
+5.4.  Handling Version Errors
+
+   When the metadata server provides the ffda_versions array in the
+   ff_device_addr4 (see Section 4.1), the client is able to determine
+   whether or not it can access a storage device with any of the
+   supplied combinations of ffdv_version, ffdv_minorversion, and
+   ffdv_tightly_coupled.  However, due to the limitations of reporting
+   errors in GETDEVICEINFO (see Section 18.40 in [RFC5661]), the client
+   is not able to specify which specific device it cannot communicate
+   with over one of the provided ffdv_version and ffdv_minorversion
+   combinations.  Using ff_ioerr4 (see Section 9.1.1) inside either the
+   LAYOUTRETURN (see Section 18.44 of [RFC5661]) or the LAYOUTERROR (see
+   Section 15.6 of [RFC7862] and Section 10 of this document), the
+   client can isolate the problematic storage device.
+
+   The error code to return for LAYOUTRETURN and/or LAYOUTERROR is
+   NFS4ERR_MINOR_VERS_MISMATCH.  It does not matter whether the mismatch
+   is a major version (e.g., client can use NFSv3 but not NFSv4) or
+   minor version (e.g., client can use NFSv4.1 but not NFSv4.2), the
+   error indicates that for all the supplied combinations for
+   ffdv_version and ffdv_minorversion, the client cannot communicate
+   with the storage device.  The client can retry the GETDEVICEINFO to
+   see if the metadata server can provide a different combination, or it
+   can fall back to doing the I/O through the metadata server.
+
+
+
+
+Halevy & Haynes              Standards Track                   [Page 24]
+
+RFC 8435                pNFS Flexible File Layout            August 2018
+
+
+6.  Striping via Sparse Mapping
+
+   While other layout types support both dense and sparse mapping of
+   logical offsets to physical offsets within a file (see, for example,
+   Section 13.4 of [RFC5661]), the flexible file layout type only
+   supports a sparse mapping.
+
+   With sparse mappings, the logical offset within a file (L) is also
+   the physical offset on the storage device.  As detailed in
+   Section 13.4.4 of [RFC5661], this results in holes across each
+   storage device that does not contain the current stripe index.
+
+   L: logical offset within the file
+
+   W: stripe width
+       W = number of elements in ffm_data_servers
+
+   S: number of bytes in a stripe
+       S = W * ffl_stripe_unit
+
+   N: stripe number
+       N = L / S
+
+7.  Recovering from Client I/O Errors
+
+   The pNFS client may encounter errors when directly accessing the
+   storage devices.  However, it is the responsibility of the metadata
+   server to recover from the I/O errors.  When the LAYOUT4_FLEX_FILES
+   layout type is used, the client MUST report the I/O errors to the
+   server at LAYOUTRETURN time using the ff_ioerr4 structure (see
+   Section 9.1.1).
+
+   The metadata server analyzes the error and determines the required
+   recovery operations such as recovering media failures or
+   reconstructing missing data files.
+
+   The metadata server MUST recall any outstanding layouts to allow it
+   exclusive write access to the stripes being recovered and to prevent
+   other clients from hitting the same error condition.  In these cases,
+   the server MUST complete recovery before handing out any new layouts
+   to the affected byte ranges.
+
+   Although the client implementation has the option to propagate a
+   corresponding error to the application that initiated the I/O
+   operation and drop any unwritten data, the client should attempt to
+   retry the original I/O operation by either requesting a new layout or
+   sending the I/O via regular NFSv4.1+ READ or WRITE operations to the
+   metadata server.  The client SHOULD attempt to retrieve a new layout
+
+
+
+Halevy & Haynes              Standards Track                   [Page 25]
+
+RFC 8435                pNFS Flexible File Layout            August 2018
+
+
+   and retry the I/O operation using the storage device first and only
+   retry the I/O operation via the metadata server if the error
+   persists.
+
+8.  Mirroring
+
+   The flexible file layout type has a simple model in place for the
+   mirroring of the file data constrained by a layout segment.  There is
+   no assumption that each copy of the mirror is stored identically on
+   the storage devices.  For example, one device might employ
+   compression or deduplication on the data.  However, the over-the-wire
+   transfer of the file contents MUST appear identical.  Note, this is a
+   constraint of the selected XDR representation in which each mirrored
+   copy of the layout segment has the same striping pattern (see
+   Figure 1).
+
+   The metadata server is responsible for determining the number of
+   mirrored copies and the location of each mirror.  While the client
+   may provide a hint to how many copies it wants (see Section 12), the
+   metadata server can ignore that hint; in any event, the client has no
+   means to dictate either the storage device (which also means the
+   coupling and/or protocol levels to access the layout segments) or the
+   location of said storage device.
+
+   The updating of mirrored layout segments is done via client-side
+   mirroring.  With this approach, the client is responsible for making
+   sure modifications are made on all copies of the layout segments it
+   is informed of via the layout.  If a layout segment is being
+   resilvered to a storage device, that mirrored copy will not be in the
+   layout.  Thus, the metadata server MUST update that copy until the
+   client is presented it in a layout.  If the FF_FLAGS_WRITE_ONE_MIRROR
+   is set in ffl_flags, the client need only update one of the mirrors
+   (see Section 8.2).  If the client is writing to the layout segments
+   via the metadata server, then the metadata server MUST update all
+   copies of the mirror.  As seen in Section 8.3, during the
+   resilvering, the layout is recalled, and the client has to make
+   modifications via the metadata server.
+
+8.1.  Selecting a Mirror
+
+   When the metadata server grants a layout to a client, it MAY let the
+   client know how fast it expects each mirror to be once the request
+   arrives at the storage devices via the ffds_efficiency member.  While
+   the algorithms to calculate that value are left to the metadata
+   server implementations, factors that could contribute to that
+   calculation include speed of the storage device, physical memory
+   available to the device, operating system version, current load, etc.
+
+
+
+
+Halevy & Haynes              Standards Track                   [Page 26]
+
+RFC 8435                pNFS Flexible File Layout            August 2018
+
+
+   However, what should not be involved in that calculation is a
+   perceived network distance between the client and the storage device.
+   The client is better situated for making that determination based on
+   past interaction with the storage device over the different available
+   network interfaces between the two; that is, the metadata server
+   might not know about a transient outage between the client and
+   storage device because it has no presence on the given subnet.
+
+   As such, it is the client that decides which mirror to access for
+   reading the file.  The requirements for writing to mirrored layout
+   segments are presented below.
+
+8.2.  Writing to Mirrors
+
+8.2.1.  Single Storage Device Updates Mirrors
+
+   If the FF_FLAGS_WRITE_ONE_MIRROR flag in ffl_flags is set, the client
+   only needs to update one of the copies of the layout segment.  For
+   this case, the storage device MUST ensure that all copies of the
+   mirror are updated when any one of the mirrors is updated.  If the
+   storage device gets an error when updating one of the mirrors, then
+   it MUST inform the client that the original WRITE had an error.  The
+   client then MUST inform the metadata server (see Section 8.2.3).  The
+   client's responsibility with respect to COMMIT is explained in
+   Section 8.2.4.  The client may choose any one of the mirrors and may
+   use ffds_efficiency as described in Section 8.1 when making this
+   choice.
+
+8.2.2.  Client Updates All Mirrors
+
+   If the FF_FLAGS_WRITE_ONE_MIRROR flag in ffl_flags is not set, the
+   client is responsible for updating all mirrored copies of the layout
+   segments that it is given in the layout.  A single failed update is
+   sufficient to fail the entire operation.  If all but one copy is
+   updated successfully and the last one provides an error, then the
+   client needs to inform the metadata server about the error.  The
+   client can use either LAYOUTRETURN or LAYOUTERROR to inform the
+   metadata server that the update failed to that storage device.  If
+   the client is updating the mirrors serially, then it SHOULD stop at
+   the first error encountered and report that to the metadata server.
+   If the client is updating the mirrors in parallel, then it SHOULD
+   wait until all storage devices respond so that it can report all
+   errors encountered during the update.
+
+
+
+
+
+
+
+
+Halevy & Haynes              Standards Track                   [Page 27]
+
+RFC 8435                pNFS Flexible File Layout            August 2018
+
+
+8.2.3.  Handling Write Errors
+
+   When the client reports a write error to the metadata server, the
+   metadata server is responsible for determining if it wants to remove
+   the errant mirror from the layout, if the mirror has recovered from
+   some transient error, etc.  When the client tries to get a new
+   layout, the metadata server informs it of the decision by the
+   contents of the layout.  The client MUST NOT assume that the contents
+   of the previous layout will match those of the new one.  If it has
+   updates that were not committed to all mirrors, then it MUST resend
+   those updates to all mirrors.
+
+   There is no provision in the protocol for the metadata server to
+   directly determine that the client has or has not recovered from an
+   error.  For example, if a storage device was network partitioned from
+   the client and the client reported the error to the metadata server,
+   then the network partition would be repaired, and all of the copies
+   would be successfully updated.  There is no mechanism for the client
+   to report that fact, and the metadata server is forced to repair the
+   file across the mirror.
+
+   If the client supports NFSv4.2, it can use LAYOUTERROR and
+   LAYOUTRETURN to provide hints to the metadata server about the
+   recovery efforts.  A LAYOUTERROR on a file is for a non-fatal error.
+   A subsequent LAYOUTRETURN without a ff_ioerr4 indicates that the
+   client successfully replayed the I/O to all mirrors.  Any
+   LAYOUTRETURN with a ff_ioerr4 is an error that the metadata server
+   needs to repair.  The client MUST be prepared for the LAYOUTERROR to
+   trigger a CB_LAYOUTRECALL if the metadata server determines it needs
+   to start repairing the file.
+
+8.2.4.  Handling Write COMMITs
+
+   When stable writes are done to the metadata server or to a single
+   replica (if allowed by the use of FF_FLAGS_WRITE_ONE_MIRROR), it is
+   the responsibility of the receiving node to propagate the written
+   data stably, before replying to the client.
+
+   In the corresponding cases in which unstable writes are done, the
+   receiving node does not have any such obligation, although it may
+   choose to asynchronously propagate the updates.  However, once a
+   COMMIT is replied to, all replicas must reflect the writes that have
+   been done, and this data must have been committed to stable storage
+   on all replicas.
+
+
+
+
+
+
+
+Halevy & Haynes              Standards Track                   [Page 28]
+
+RFC 8435                pNFS Flexible File Layout            August 2018
+
+
+   In order to avoid situations in which stale data is read from
+   replicas to which writes have not been propagated:
+
+   o  A client that has outstanding unstable writes made to single node
+      (metadata server or storage device) MUST do all reads from that
+      same node.
+
+   o  When writes are flushed to the server (for example, to implement
+      close-to-open semantics), a COMMIT must be done by the client to
+      ensure that up-to-date written data will be available irrespective
+      of the particular replica read.
+
+8.3.  Metadata Server Resilvering of the File
+
+   The metadata server may elect to create a new mirror of the layout
+   segments at any time.  This might be to resilver a copy on a storage
+   device that was down for servicing, to provide a copy of the layout
+   segments on storage with different storage performance
+   characteristics, etc.  As the client will not be aware of the new
+   mirror and the metadata server will not be aware of updates that the
+   client is making to the layout segments, the metadata server MUST
+   recall the writable layout segment(s) that it is resilvering.  If the
+   client issues a LAYOUTGET for a writable layout segment that is in
+   the process of being resilvered, then the metadata server can deny
+   that request with an NFS4ERR_LAYOUTUNAVAILABLE.  The client would
+   then have to perform the I/O through the metadata server.
+
+9.  Flexible File Layout Type Return
+
+   layoutreturn_file4 is used in the LAYOUTRETURN operation to convey
+   layout-type-specific information to the server.  It is defined in
+   Section 18.44.1 of [RFC5661] as follows:
+
+   <CODE BEGINS>
+
+      /* Constants used for LAYOUTRETURN and CB_LAYOUTRECALL */
+      const LAYOUT4_RET_REC_FILE      = 1;
+      const LAYOUT4_RET_REC_FSID      = 2;
+      const LAYOUT4_RET_REC_ALL       = 3;
+
+      enum layoutreturn_type4 {
+              LAYOUTRETURN4_FILE = LAYOUT4_RET_REC_FILE,
+              LAYOUTRETURN4_FSID = LAYOUT4_RET_REC_FSID,
+              LAYOUTRETURN4_ALL  = LAYOUT4_RET_REC_ALL
+      };
+
+   struct layoutreturn_file4 {
+           offset4         lrf_offset;
+
+
+
+Halevy & Haynes              Standards Track                   [Page 29]
+
+RFC 8435                pNFS Flexible File Layout            August 2018
+
+
+           length4         lrf_length;
+           stateid4        lrf_stateid;
+           /* layouttype4 specific data */
+           opaque          lrf_body<>;
+   };
+
+   union layoutreturn4 switch(layoutreturn_type4 lr_returntype) {
+           case LAYOUTRETURN4_FILE:
+                   layoutreturn_file4      lr_layout;
+           default:
+                   void;
+   };
+
+   struct LAYOUTRETURN4args {
+           /* CURRENT_FH: file */
+           bool                    lora_reclaim;
+           layouttype4             lora_layout_type;
+           layoutiomode4           lora_iomode;
+           layoutreturn4           lora_layoutreturn;
+   };
+
+   <CODE ENDS>
+
+   If the lora_layout_type layout type is LAYOUT4_FLEX_FILES and the
+   lr_returntype is LAYOUTRETURN4_FILE, then the lrf_body opaque value
+   is defined by ff_layoutreturn4 (see Section 9.3).  This allows the
+   client to report I/O error information or layout usage statistics
+   back to the metadata server as defined below.  Note that while the
+   data structures are built on concepts introduced in NFSv4.2, the
+   effective discriminated union (lora_layout_type combined with
+   ff_layoutreturn4) allows for an NFSv4.1 metadata server to utilize
+   the data.
+
+9.1.  I/O Error Reporting
+
+9.1.1.  ff_ioerr4
+
+   <CODE BEGINS>
+
+   /// struct ff_ioerr4 {
+   ///         offset4        ffie_offset;
+   ///         length4        ffie_length;
+   ///         stateid4       ffie_stateid;
+   ///         device_error4  ffie_errors<>;
+   /// };
+   ///
+
+   <CODE ENDS>
+
+
+
+Halevy & Haynes              Standards Track                   [Page 30]
+
+RFC 8435                pNFS Flexible File Layout            August 2018
+
+
+   Recall that [RFC7862] defines device_error4 as:
+
+   <CODE BEGINS>
+
+   struct device_error4 {
+           deviceid4       de_deviceid;
+           nfsstat4        de_status;
+           nfs_opnum4      de_opnum;
+   };
+
+   <CODE ENDS>
+
+   The ff_ioerr4 structure is used to return error indications for data
+   files that generated errors during data transfers.  These are hints
+   to the metadata server that there are problems with that file.  For
+   each error, ffie_errors.de_deviceid, ffie_offset, and ffie_length
+   represent the storage device and byte range within the file in which
+   the error occurred; ffie_errors represents the operation and type of
+   error.  The use of device_error4 is described in Section 15.6 of
+   [RFC7862].
+
+   Even though the storage device might be accessed via NFSv3 and
+   reports back NFSv3 errors to the client, the client is responsible
+   for mapping these to appropriate NFSv4 status codes as de_status.
+   Likewise, the NFSv3 operations need to be mapped to equivalent NFSv4
+   operations.
+
+9.2.  Layout Usage Statistics
+
+9.2.1.  ff_io_latency4
+
+   <CODE BEGINS>
+
+   /// struct ff_io_latency4 {
+   ///         uint64_t       ffil_ops_requested;
+   ///         uint64_t       ffil_bytes_requested;
+   ///         uint64_t       ffil_ops_completed;
+   ///         uint64_t       ffil_bytes_completed;
+   ///         uint64_t       ffil_bytes_not_delivered;
+   ///         nfstime4       ffil_total_busy_time;
+   ///         nfstime4       ffil_aggregate_completion_time;
+   /// };
+   ///
+
+   <CODE ENDS>
+
+
+
+
+
+
+Halevy & Haynes              Standards Track                   [Page 31]
+
+RFC 8435                pNFS Flexible File Layout            August 2018
+
+
+   Both operation counts and bytes transferred are kept in the
+   ff_io_latency4.  As seen in ff_layoutupdate4 (see Section 9.2.2),
+   READ and WRITE operations are aggregated separately.  READ operations
+   are used for the ff_io_latency4 ffl_read.  Both WRITE and COMMIT
+   operations are used for the ff_io_latency4 ffl_write.  "Requested"
+   counters track what the client is attempting to do, and "completed"
+   counters track what was done.  There is no requirement that the
+   client only report completed results that have matching requested
+   results from the reported period.
+
+   ffil_bytes_not_delivered is used to track the aggregate number of
+   bytes requested but not fulfilled due to error conditions.
+   ffil_total_busy_time is the aggregate time spent with outstanding RPC
+   calls. ffil_aggregate_completion_time is the sum of all round-trip
+   times for completed RPC calls.
+
+   In Section 3.3.1 of [RFC5661], the nfstime4 is defined as the number
+   of seconds and nanoseconds since midnight or zero hour January 1,
+   1970 Coordinated Universal Time (UTC).  The use of nfstime4 in
+   ff_io_latency4 is to store time since the start of the first I/O from
+   the client after receiving the layout.  In other words, these are to
+   be decoded as duration and not as a date and time.
+
+   Note that LAYOUTSTATS are cumulative, i.e., not reset each time the
+   operation is sent.  If two LAYOUTSTATS operations for the same file
+   and layout stateid originate from the same NFS client and are
+   processed at the same time by the metadata server, then the one
+   containing the larger values contains the most recent time series
+   data.
+
+9.2.2.  ff_layoutupdate4
+
+   <CODE BEGINS>
+
+   /// struct ff_layoutupdate4 {
+   ///         netaddr4       ffl_addr;
+   ///         nfs_fh4        ffl_fhandle;
+   ///         ff_io_latency4 ffl_read;
+   ///         ff_io_latency4 ffl_write;
+   ///         nfstime4       ffl_duration;
+   ///         bool           ffl_local;
+   /// };
+   ///
+
+   <CODE ENDS>
+
+
+
+
+
+
+Halevy & Haynes              Standards Track                   [Page 32]
+
+RFC 8435                pNFS Flexible File Layout            August 2018
+
+
+   ffl_addr differentiates which network address the client is connected
+   to on the storage device.  In the case of multipathing, ffl_fhandle
+   indicates which read-only copy was selected. ffl_read and ffl_write
+   convey the latencies for both READ and WRITE operations,
+   respectively.  ffl_duration is used to indicate the time period over
+   which the statistics were collected.  If true, ffl_local indicates
+   that the I/O was serviced by the client's cache.  This flag allows
+   the client to inform the metadata server about "hot" access to a file
+   it would not normally be allowed to report on.
+
+9.2.3.  ff_iostats4
+
+   <CODE BEGINS>
+
+   /// struct ff_iostats4 {
+   ///         offset4           ffis_offset;
+   ///         length4           ffis_length;
+   ///         stateid4          ffis_stateid;
+   ///         io_info4          ffis_read;
+   ///         io_info4          ffis_write;
+   ///         deviceid4         ffis_deviceid;
+   ///         ff_layoutupdate4  ffis_layoutupdate;
+   /// };
+   ///
+
+   <CODE ENDS>
+
+   [RFC7862] defines io_info4 as:
+
+   <CODE BEGINS>
+
+   struct io_info4 {
+           uint64_t        ii_count;
+           uint64_t        ii_bytes;
+   };
+
+   <CODE ENDS>
+
+   With pNFS, data transfers are performed directly between the pNFS
+   client and the storage devices.  Therefore, the metadata server has
+   no direct knowledge of the I/O operations being done and thus cannot
+   create on its own statistical information about client I/O to
+   optimize the data storage location.  ff_iostats4 MAY be used by the
+   client to report I/O statistics back to the metadata server upon
+   returning the layout.
+
+
+
+
+
+
+Halevy & Haynes              Standards Track                   [Page 33]
+
+RFC 8435                pNFS Flexible File Layout            August 2018
+
+
+   Since it is not feasible for the client to report every I/O that used
+   the layout, the client MAY identify "hot" byte ranges for which to
+   report I/O statistics.  The definition and/or configuration mechanism
+   of what is considered "hot" and the size of the reported byte range
+   are out of the scope of this document.  For client implementation,
+   providing reasonable default values and an optional run-time
+   management interface to control these parameters is suggested.  For
+   example, a client can define the default byte-range resolution to be
+   1 MB in size and the thresholds for reporting to be 1 MB/second or 10
+   I/O operations per second.
+
+   For each byte range, ffis_offset and ffis_length represent the
+   starting offset of the range and the range length in bytes.
+   ffis_read.ii_count, ffis_read.ii_bytes, ffis_write.ii_count, and
+   ffis_write.ii_bytes represent the number of contiguous READ and WRITE
+   I/Os and the respective aggregate number of bytes transferred within
+   the reported byte range.
+
+   The combination of ffis_deviceid and ffl_addr uniquely identifies
+   both the storage path and the network route to it.  Finally,
+   ffl_fhandle allows the metadata server to differentiate between
+   multiple read-only copies of the file on the same storage device.
+
+9.3.  ff_layoutreturn4
+
+   <CODE BEGINS>
+
+   /// struct ff_layoutreturn4 {
+   ///         ff_ioerr4     fflr_ioerr_report<>;
+   ///         ff_iostats4   fflr_iostats_report<>;
+   /// };
+   ///
+
+   <CODE ENDS>
+
+   When data file I/O operations fail, fflr_ioerr_report<> is used to
+   report these errors to the metadata server as an array of elements of
+   type ff_ioerr4.  Each element in the array represents an error that
+   occurred on the data file identified by ffie_errors.de_deviceid.  If
+   no errors are to be reported, the size of the fflr_ioerr_report<>
+   array is set to zero.  The client MAY also use fflr_iostats_report<>
+   to report a list of I/O statistics as an array of elements of type
+   ff_iostats4.  Each element in the array represents statistics for a
+   particular byte range.  Byte ranges are not guaranteed to be disjoint
+   and MAY repeat or intersect.
+
+
+
+
+
+
+Halevy & Haynes              Standards Track                   [Page 34]
+
+RFC 8435                pNFS Flexible File Layout            August 2018
+
+
+10.  Flexible File Layout Type LAYOUTERROR
+
+   If the client is using NFSv4.2 to communicate with the metadata
+   server, then instead of waiting for a LAYOUTRETURN to send error
+   information to the metadata server (see Section 9.1), it MAY use
+   LAYOUTERROR (see Section 15.6 of [RFC7862]) to communicate that
+   information.  For the flexible file layout type, this means that
+   LAYOUTERROR4args is treated the same as ff_ioerr4.
+
+11.  Flexible File Layout Type LAYOUTSTATS
+
+   If the client is using NFSv4.2 to communicate with the metadata
+   server, then instead of waiting for a LAYOUTRETURN to send I/O
+   statistics to the metadata server (see Section 9.2), it MAY use
+   LAYOUTSTATS (see Section 15.7 of [RFC7862]) to communicate that
+   information.  For the flexible file layout type, this means that
+   LAYOUTSTATS4args.lsa_layoutupdate is overloaded with the same
+   contents as in ffis_layoutupdate.
+
+12.  Flexible File Layout Type Creation Hint
+
+   The layouthint4 type is defined in the [RFC5661] as follows:
+
+   <CODE BEGINS>
+
+   struct layouthint4 {
+       layouttype4        loh_type;
+       opaque             loh_body<>;
+   };
+
+   <CODE ENDS>
+
+   The layouthint4 structure is used by the client to pass a hint about
+   the type of layout it would like created for a particular file.  If
+   the loh_type layout type is LAYOUT4_FLEX_FILES, then the loh_body
+   opaque value is defined by the ff_layouthint4 type.
+
+12.1.  ff_layouthint4
+
+   <CODE BEGINS>
+
+   /// union ff_mirrors_hint switch (bool ffmc_valid) {
+   ///     case TRUE:
+   ///         uint32_t    ffmc_mirrors;
+   ///     case FALSE:
+   ///         void;
+   /// };
+   ///
+
+
+
+Halevy & Haynes              Standards Track                   [Page 35]
+
+RFC 8435                pNFS Flexible File Layout            August 2018
+
+
+   /// struct ff_layouthint4 {
+   ///     ff_mirrors_hint    fflh_mirrors_hint;
+   /// };
+   ///
+
+   <CODE ENDS>
+
+   This type conveys hints for the desired data map.  All parameters are
+   optional so the client can give values for only the parameter it
+   cares about.
+
+13.  Recalling a Layout
+
+   While Section 12.5.5 of [RFC5661] discusses reasons independent of
+   layout type for recalling a layout, the flexible file layout type
+   metadata server should recall outstanding layouts in the following
+   cases:
+
+   o  When the file's security policy changes, i.e., ACLs or permission
+      mode bits are set.
+
+   o  When the file's layout changes, rendering outstanding layouts
+      invalid.
+
+   o  When existing layouts are inconsistent with the need to enforce
+      locking constraints.
+
+   o  When existing layouts are inconsistent with the requirements
+      regarding resilvering as described in Section 8.3.
+
+13.1.  CB_RECALL_ANY
+
+   The metadata server can use the CB_RECALL_ANY callback operation to
+   notify the client to return some or all of its layouts.  Section 22.3
+   of [RFC5661] defines the allowed types of the "NFSv4 Recallable
+   Object Types Registry".
+
+   <CODE BEGINS>
+
+   /// const RCA4_TYPE_MASK_FF_LAYOUT_MIN     = 16;
+   /// const RCA4_TYPE_MASK_FF_LAYOUT_MAX     = 17;
+   ///
+
+   struct  CB_RECALL_ANY4args      {
+       uint32_t        craa_layouts_to_keep;
+       bitmap4         craa_type_mask;
+   };
+
+
+
+
+Halevy & Haynes              Standards Track                   [Page 36]
+
+RFC 8435                pNFS Flexible File Layout            August 2018
+
+
+   <CODE ENDS>
+
+   Typically, CB_RECALL_ANY will be used to recall client state when the
+   server needs to reclaim resources.  The craa_type_mask bitmap
+   specifies the type of resources that are recalled, and the
+   craa_layouts_to_keep value specifies how many of the recalled
+   flexible file layouts the client is allowed to keep.  The mask flags
+   for the flexible file layout type are defined as follows:
+
+   <CODE BEGINS>
+
+   /// enum ff_cb_recall_any_mask {
+   ///     PNFS_FF_RCA4_TYPE_MASK_READ = 16,
+   ///     PNFS_FF_RCA4_TYPE_MASK_RW   = 17
+   /// };
+   ///
+
+   <CODE ENDS>
+
+   The flags represent the iomode of the recalled layouts.  In response,
+   the client SHOULD return layouts of the recalled iomode that it needs
+   the least, keeping at most craa_layouts_to_keep flexible file
+   layouts.
+
+   The PNFS_FF_RCA4_TYPE_MASK_READ flag notifies the client to return
+   layouts of iomode LAYOUTIOMODE4_READ.  Similarly, the
+   PNFS_FF_RCA4_TYPE_MASK_RW flag notifies the client to return layouts
+   of iomode LAYOUTIOMODE4_RW.  When both mask flags are set, the client
+   is notified to return layouts of either iomode.
+
+14.  Client Fencing
+
+   In cases where clients are uncommunicative and their lease has
+   expired or when clients fail to return recalled layouts within a
+   lease period, the server MAY revoke client layouts and reassign these
+   resources to other clients (see Section 12.5.5 of [RFC5661]).  To
+   avoid data corruption, the metadata server MUST fence off the revoked
+   clients from the respective data files as described in Section 2.2.
+
+15.  Security Considerations
+
+   The combination of components in a pNFS system is required to
+   preserve the security properties of NFSv4.1+ with respect to an
+   entity accessing data via a client.  The pNFS feature partitions the
+   NFSv4.1+ file system protocol into two parts: the control protocol
+   and the data protocol.  As the control protocol in this document is
+   NFS, the security properties are equivalent to the version of NFS
+   being used.  The flexible file layout further divides the data
+
+
+
+Halevy & Haynes              Standards Track                   [Page 37]
+
+RFC 8435                pNFS Flexible File Layout            August 2018
+
+
+   protocol into metadata and data paths.  The security properties of
+   the metadata path are equivalent to those of NFSv4.1x (see Sections
+   1.7.1 and 2.2.1 of [RFC5661]).  And the security properties of the
+   data path are equivalent to those of the version of NFS used to
+   access the storage device, with the provision that the metadata
+   server is responsible for authenticating client access to the data
+   file.  The metadata server provides appropriate credentials to the
+   client to access data files on the storage device.  It is also
+   responsible for revoking access for a client to the storage device.
+
+   The metadata server enforces the file access control policy at
+   LAYOUTGET time.  The client should use RPC authorization credentials
+   for getting the layout for the requested iomode ((LAYOUTIOMODE4_READ
+   or LAYOUTIOMODE4_RW), and the server verifies the permissions and ACL
+   for these credentials, possibly returning NFS4ERR_ACCESS if the
+   client is not allowed the requested iomode.  If the LAYOUTGET
+   operation succeeds, the client receives, as part of the layout, a set
+   of credentials allowing it I/O access to the specified data files
+   corresponding to the requested iomode.  When the client acts on I/O
+   operations on behalf of its local users, it MUST authenticate and
+   authorize the user by issuing respective OPEN and ACCESS calls to the
+   metadata server, similar to having NFSv4 data delegations.
+
+   The combination of filehandle, synthetic uid, and gid in the layout
+   is the way that the metadata server enforces access control to the
+   data server.  The client only has access to filehandles of file
+   objects and not directory objects.  Thus, given a filehandle in a
+   layout, it is not possible to guess the parent directory filehandle.
+   Further, as the data file permissions only allow the given synthetic
+   uid read/write permission and the given synthetic gid read
+   permission, knowing the synthetic ids of one file does not
+   necessarily allow access to any other data file on the storage
+   device.
+
+   The metadata server can also deny access at any time by fencing the
+   data file, which means changing the synthetic ids.  In turn, that
+   forces the client to return its current layout and get a new layout
+   if it wants to continue I/O to the data file.
+
+   If access is allowed, the client uses the corresponding (read-only or
+   read/write) credentials to perform the I/O operations at the data
+   file's storage devices.  When the metadata server receives a request
+   to change a file's permissions or ACL, it SHOULD recall all layouts
+   for that file and then MUST fence off any clients still holding
+   outstanding layouts for the respective files by implicitly
+   invalidating the previously distributed credential on all data file
+   comprising the file in question.  It is REQUIRED that this be done
+   before committing to the new permissions and/or ACL.  By requesting
+
+
+
+Halevy & Haynes              Standards Track                   [Page 38]
+
+RFC 8435                pNFS Flexible File Layout            August 2018
+
+
+   new layouts, the clients will reauthorize access against the modified
+   access control metadata.  Recalling the layouts in this case is
+   intended to prevent clients from getting an error on I/Os done after
+   the client was fenced off.
+
+15.1.  RPCSEC_GSS and Security Services
+
+   Because of the special use of principals within the loosely coupled
+   model, the issues are different depending on the coupling model.
+
+15.1.1.  Loosely Coupled
+
+   RPCSEC_GSS version 3 (RPCSEC_GSSv3) [RFC7861] contains facilities
+   that would allow it to be used to authorize the client to the storage
+   device on behalf of the metadata server.  Doing so would require that
+   each of the metadata server, storage device, and client would need to
+   implement RPCSEC_GSSv3 using an RPC-application-defined structured
+   privilege assertion in a manner described in Section 4.9.1 of
+   [RFC7862].  The specifics necessary to do so are not described in
+   this document.  This is principally because any such specification
+   would require extensive implementation work on a wide range of
+   storage devices, which would be unlikely to result in a widely usable
+   specification for a considerable time.
+
+   As a result, the layout type described in this document will not
+   provide support for use of RPCSEC_GSS together with the loosely
+   coupled model.  However, future layout types could be specified,
+   which would allow such support, either through the use of
+   RPCSEC_GSSv3 or in other ways.
+
+15.1.2.  Tightly Coupled
+
+   With tight coupling, the principal used to access the metadata file
+   is exactly the same as used to access the data file.  The storage
+   device can use the control protocol to validate any RPC credentials.
+   As a result, there are no security issues related to using RPCSEC_GSS
+   with a tightly coupled system.  For example, if Kerberos V5 Generic
+   Security Service Application Program Interface (GSS-API) [RFC4121] is
+   used as the security mechanism, then the storage device could use a
+   control protocol to validate the RPC credentials to the metadata
+   server.
+
+16.  IANA Considerations
+
+   [RFC5661] introduced the "pNFS Layout Types Registry"; new layout
+   type numbers in this registry need to be assigned by IANA.  This
+   document defines the protocol associated with an existing layout type
+   number: LAYOUT4_FLEX_FILES.  See Table 1.
+
+
+
+Halevy & Haynes              Standards Track                   [Page 39]
+
+RFC 8435                pNFS Flexible File Layout            August 2018
+
+
+   +--------------------+------------+----------+-----+----------------+
+   | Layout Type Name   | Value      | RFC      | How | Minor Versions |
+   +--------------------+------------+----------+-----+----------------+
+   | LAYOUT4_FLEX_FILES | 0x00000004 | RFC 8435 | L   | 1              |
+   +--------------------+------------+----------+-----+----------------+
+
+                     Table 1: Layout Type Assignments
+
+   [RFC5661] also introduced the "NFSv4 Recallable Object Types
+   Registry".  This document defines new recallable objects for
+   RCA4_TYPE_MASK_FF_LAYOUT_MIN and RCA4_TYPE_MASK_FF_LAYOUT_MAX (see
+   Table 2).
+
+   +------------------------------+-------+--------+-----+-------------+
+   | Recallable Object Type Name  | Value | RFC    | How | Minor       |
+   |                              |       |        |     | Versions    |
+   +------------------------------+-------+--------+-----+-------------+
+   | RCA4_TYPE_MASK_FF_LAYOUT_MIN | 16    | RFC    | L   | 1           |
+   |                              |       | 8435   |     |             |
+   | RCA4_TYPE_MASK_FF_LAYOUT_MAX | 17    | RFC    | L   | 1           |
+   |                              |       | 8435   |     |             |
+   +------------------------------+-------+--------+-----+-------------+
+
+                Table 2: Recallable Object Type Assignments
+
+17.  References
+
+17.1.  Normative References
+
+   [LEGAL]    IETF Trust, "Trust Legal Provisions (TLP)",
+              <https://trustee.ietf.org/trust-legal-provisions.html>.
+
+   [RFC1813]  Callaghan, B., Pawlowski, B., and P. Staubach, "NFS
+              Version 3 Protocol Specification", RFC 1813,
+              DOI 10.17487/RFC1813, June 1995,
+              <https://www.rfc-editor.org/info/rfc1813>.
+
+   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
+              Requirement Levels", BCP 14, RFC 2119,
+              DOI 10.17487/RFC2119, March 1997,
+              <https://www.rfc-editor.org/info/rfc2119>.
+
+   [RFC4121]  Zhu, L., Jaganathan, K., and S. Hartman, "The Kerberos
+              Version 5 Generic Security Service Application Program
+              Interface (GSS-API) Mechanism: Version 2", RFC 4121,
+              DOI 10.17487/RFC4121, July 2005,
+              <https://www.rfc-editor.org/info/rfc4121>.
+
+
+
+
+Halevy & Haynes              Standards Track                   [Page 40]
+
+RFC 8435                pNFS Flexible File Layout            August 2018
+
+
+   [RFC4506]  Eisler, M., Ed., "XDR: External Data Representation
+              Standard", STD 67, RFC 4506, DOI 10.17487/RFC4506, May
+              2006, <https://www.rfc-editor.org/info/rfc4506>.
+
+   [RFC5531]  Thurlow, R., "RPC: Remote Procedure Call Protocol
+              Specification Version 2", RFC 5531, DOI 10.17487/RFC5531,
+              May 2009, <https://www.rfc-editor.org/info/rfc5531>.
+
+   [RFC5661]  Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed.,
+              "Network File System (NFS) Version 4 Minor Version 1
+              Protocol", RFC 5661, DOI 10.17487/RFC5661, January 2010,
+              <https://www.rfc-editor.org/info/rfc5661>.
+
+   [RFC5662]  Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed.,
+              "Network File System (NFS) Version 4 Minor Version 1
+              External Data Representation Standard (XDR) Description",
+              RFC 5662, DOI 10.17487/RFC5662, January 2010,
+              <https://www.rfc-editor.org/info/rfc5662>.
+
+   [RFC7530]  Haynes, T., Ed. and D. Noveck, Ed., "Network File System
+              (NFS) Version 4 Protocol", RFC 7530, DOI 10.17487/RFC7530,
+              March 2015, <https://www.rfc-editor.org/info/rfc7530>.
+
+   [RFC7861]  Adamson, A. and N. Williams, "Remote Procedure Call (RPC)
+              Security Version 3", RFC 7861, DOI 10.17487/RFC7861,
+              November 2016, <https://www.rfc-editor.org/info/rfc7861>.
+
+   [RFC7862]  Haynes, T., "Network File System (NFS) Version 4 Minor
+              Version 2 Protocol", RFC 7862, DOI 10.17487/RFC7862,
+              November 2016, <https://www.rfc-editor.org/info/rfc7862>.
+
+   [RFC8174]  Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC
+              2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174,
+              May 2017, <https://www.rfc-editor.org/info/rfc8174>.
+
+   [RFC8434]  Haynes, T., "Requirements for Parallel NFS (pNFS) Layout
+              Types", RFC 8434, DOI 10.17487/RFC8434, August 2018,
+              <https://www.rfc-editor.org/info/rfc8434>.
+
+17.2.  Informative References
+
+   [RFC4519]  Sciberras, A., Ed., "Lightweight Directory Access Protocol
+              (LDAP): Schema for User Applications", RFC 4519,
+              DOI 10.17487/RFC4519, June 2006,
+              <https://www.rfc-editor.org/info/rfc4519>.
+
+
+
+
+
+
+Halevy & Haynes              Standards Track                   [Page 41]
+
+RFC 8435                pNFS Flexible File Layout            August 2018
+
+
+Acknowledgments
+
+   The following individuals provided miscellaneous comments to early
+   draft versions of this document: Matt W. Benjamin, Adam Emerson,
+   J. Bruce Fields, and Lev Solomonov.
+
+   The following individuals provided miscellaneous comments to the
+   final draft versions of this document: Anand Ganesh, Robert Wipfel,
+   Gobikrishnan Sundharraj, Trond Myklebust, Rick Macklem, and Jim
+   Sermersheim.
+
+   Idan Kedar caught a nasty bug in the interaction of client-side
+   mirroring and the minor versioning of devices.
+
+   Dave Noveck provided comprehensive reviews of the document during the
+   working group last calls.  He also rewrote Section 2.3.
+
+   Olga Kornievskaia made a convincing case against the use of a
+   credential versus a principal in the fencing approach.  Andy Adamson
+   and Benjamin Kaduk helped to sharpen the focus.
+
+   Benjamin Kaduk and Olga Kornievskaia also helped provide concrete
+   scenarios for loosely coupled security mechanisms.  In the end, Olga
+   proved that as defined, the loosely coupled model would not work with
+   RPCSEC_GSS.
+
+   Tigran Mkrtchyan provided the use case for not allowing the client to
+   proxy the I/O through the data server.
+
+   Rick Macklem provided the use case for only writing to a single
+   mirror.
+
+Authors' Addresses
+
+   Benny Halevy
+
+   Email: bhalevy@gmail.com
+
+
+   Thomas Haynes
+   Hammerspace
+   4300 El Camino Real Ste 105
+   Los Altos, CA  94022
+   United States of America
+
+   Email: loghyr@gmail.com
+
+
+
+
+
+Halevy & Haynes              Standards Track                   [Page 42]
+
author	Thomas Voss <mail@thomasvoss.com>	2024-11-27 20:54:24 +0100
committer	Thomas Voss <mail@thomasvoss.com>	2024-11-27 20:54:24 +0100
commit	4bfd864f10b68b71482b35c818559068ef8d5797 (patch)
tree	e3989f47a7994642eb325063d46e8f08ffa681dc /doc/rfc/rfc8435.txt
parent	ea76e11061bda059ae9f9ad130a9895cc85607db (diff)