diff options
author | Thomas Voss <mail@thomasvoss.com> | 2024-11-27 20:54:24 +0100 |
---|---|---|
committer | Thomas Voss <mail@thomasvoss.com> | 2024-11-27 20:54:24 +0100 |
commit | 4bfd864f10b68b71482b35c818559068ef8d5797 (patch) | |
tree | e3989f47a7994642eb325063d46e8f08ffa681dc /doc/rfc/rfc8881.txt | |
parent | ea76e11061bda059ae9f9ad130a9895cc85607db (diff) |
doc: Add RFC documents
Diffstat (limited to 'doc/rfc/rfc8881.txt')
-rw-r--r-- | doc/rfc/rfc8881.txt | 31832 |
1 files changed, 31832 insertions, 0 deletions
diff --git a/doc/rfc/rfc8881.txt b/doc/rfc/rfc8881.txt new file mode 100644 index 0000000..538a563 --- /dev/null +++ b/doc/rfc/rfc8881.txt @@ -0,0 +1,31832 @@ + + + + +Internet Engineering Task Force (IETF) D. Noveck, Ed. +Request for Comments: 8881 NetApp +Obsoletes: 5661 C. Lever +Category: Standards Track ORACLE +ISSN: 2070-1721 August 2020 + + + Network File System (NFS) Version 4 Minor Version 1 Protocol + +Abstract + + This document describes the Network File System (NFS) version 4 minor + version 1, including features retained from the base protocol (NFS + version 4 minor version 0, which is specified in RFC 7530) and + protocol extensions made subsequently. The later minor version has + no dependencies on NFS version 4 minor version 0, and is considered a + separate protocol. + + This document obsoletes RFC 5661. It substantially revises the + treatment of features relating to multi-server namespace, superseding + the description of those features appearing in RFC 5661. + +Status of This Memo + + This is an Internet Standards Track document. + + This document is a product of the Internet Engineering Task Force + (IETF). It represents the consensus of the IETF community. It has + received public review and has been approved for publication by the + Internet Engineering Steering Group (IESG). Further information on + Internet Standards is available in Section 2 of RFC 7841. + + Information about the current status of this document, any errata, + and how to provide feedback on it may be obtained at + https://www.rfc-editor.org/info/rfc8881. + +Copyright Notice + + Copyright (c) 2020 IETF Trust and the persons identified as the + document authors. All rights reserved. + + This document is subject to BCP 78 and the IETF Trust's Legal + Provisions Relating to IETF Documents + (https://trustee.ietf.org/license-info) in effect on the date of + publication of this document. Please review these documents + carefully, as they describe your rights and restrictions with respect + to this document. Code Components extracted from this document must + include Simplified BSD License text as described in Section 4.e of + the Trust Legal Provisions and are provided without warranty as + described in the Simplified BSD License. + + This document may contain material from IETF Documents or IETF + Contributions published or made publicly available before November + 10, 2008. The person(s) controlling the copyright in some of this + material may not have granted the IETF Trust the right to allow + modifications of such material outside the IETF Standards Process. + Without obtaining an adequate license from the person(s) controlling + the copyright in such materials, this document may not be modified + outside the IETF Standards Process, and derivative works of it may + not be created outside the IETF Standards Process, except to format + it for publication as an RFC or to translate it into languages other + than English. + +Table of Contents + + 1. Introduction + 1.1. Introduction to This Update + 1.2. The NFS Version 4 Minor Version 1 Protocol + 1.3. Requirements Language + 1.4. Scope of This Document + 1.5. NFSv4 Goals + 1.6. NFSv4.1 Goals + 1.7. General Definitions + 1.8. Overview of NFSv4.1 Features + 1.9. Differences from NFSv4.0 + 2. Core Infrastructure + 2.1. Introduction + 2.2. RPC and XDR + 2.3. COMPOUND and CB_COMPOUND + 2.4. Client Identifiers and Client Owners + 2.5. Server Owners + 2.6. Security Service Negotiation + 2.7. Minor Versioning + 2.8. Non-RPC-Based Security Services + 2.9. Transport Layers + 2.10. Session + 3. Protocol Constants and Data Types + 3.1. Basic Constants + 3.2. Basic Data Types + 3.3. Structured Data Types + 4. Filehandles + 4.1. Obtaining the First Filehandle + 4.2. Filehandle Types + 4.3. One Method of Constructing a Volatile Filehandle + 4.4. Client Recovery from Filehandle Expiration + 5. File Attributes + 5.1. REQUIRED Attributes + 5.2. RECOMMENDED Attributes + 5.3. Named Attributes + 5.4. Classification of Attributes + 5.5. Set-Only and Get-Only Attributes + 5.6. REQUIRED Attributes - List and Definition References + 5.7. RECOMMENDED Attributes - List and Definition References + 5.8. Attribute Definitions + 5.9. Interpreting owner and owner_group + 5.10. Character Case Attributes + 5.11. Directory Notification Attributes + 5.12. pNFS Attribute Definitions + 5.13. Retention Attributes + 6. Access Control Attributes + 6.1. Goals + 6.2. File Attributes Discussion + 6.3. Common Methods + 6.4. Requirements + 7. Single-Server Namespace + 7.1. Server Exports + 7.2. Browsing Exports + 7.3. Server Pseudo File System + 7.4. Multiple Roots + 7.5. Filehandle Volatility + 7.6. Exported Root + 7.7. Mount Point Crossing + 7.8. Security Policy and Namespace Presentation + 8. State Management + 8.1. Client and Session ID + 8.2. Stateid Definition + 8.3. Lease Renewal + 8.4. Crash Recovery + 8.5. Server Revocation of Locks + 8.6. Short and Long Leases + 8.7. Clocks, Propagation Delay, and Calculating Lease Expiration + 8.8. Obsolete Locking Infrastructure from NFSv4.0 + 9. File Locking and Share Reservations + 9.1. Opens and Byte-Range Locks + 9.2. Lock Ranges + 9.3. Upgrading and Downgrading Locks + 9.4. Stateid Seqid Values and Byte-Range Locks + 9.5. Issues with Multiple Open-Owners + 9.6. Blocking Locks + 9.7. Share Reservations + 9.8. OPEN/CLOSE Operations + 9.9. Open Upgrade and Downgrade + 9.10. Parallel OPENs + 9.11. Reclaim of Open and Byte-Range Locks + 10. Client-Side Caching + 10.1. Performance Challenges for Client-Side Caching + 10.2. Delegation and Callbacks + 10.3. Data Caching + 10.4. Open Delegation + 10.5. Data Caching and Revocation + 10.6. Attribute Caching + 10.7. Data and Metadata Caching and Memory Mapped Files + 10.8. Name and Directory Caching without Directory Delegations + 10.9. Directory Delegations + 11. Multi-Server Namespace + 11.1. Terminology + 11.2. File System Location Attributes + 11.3. File System Presence or Absence + 11.4. Getting Attributes for an Absent File System + 11.5. Uses of File System Location Information + 11.6. Trunking without File System Location Information + 11.7. Users and Groups in a Multi-Server Namespace + 11.8. Additional Client-Side Considerations + 11.9. Overview of File Access Transitions + 11.10. Effecting Network Endpoint Transitions + 11.11. Effecting File System Transitions + 11.12. Transferring State upon Migration + 11.13. Client Responsibilities When Access Is Transitioned + 11.14. Server Responsibilities Upon Migration + 11.15. Effecting File System Referrals + 11.16. The Attribute fs_locations + 11.17. The Attribute fs_locations_info + 11.18. The Attribute fs_status + 12. Parallel NFS (pNFS) + 12.1. Introduction + 12.2. pNFS Definitions + 12.3. pNFS Operations + 12.4. pNFS Attributes + 12.5. Layout Semantics + 12.6. pNFS Mechanics + 12.7. Recovery + 12.8. Metadata and Storage Device Roles + 12.9. Security Considerations for pNFS + 13. NFSv4.1 as a Storage Protocol in pNFS: the File Layout Type + 13.1. Client ID and Session Considerations + 13.2. File Layout Definitions + 13.3. File Layout Data Types + 13.4. Interpreting the File Layout + 13.5. Data Server Multipathing + 13.6. Operations Sent to NFSv4.1 Data Servers + 13.7. COMMIT through Metadata Server + 13.8. The Layout Iomode + 13.9. Metadata and Data Server State Coordination + 13.10. Data Server Component File Size + 13.11. Layout Revocation and Fencing + 13.12. Security Considerations for the File Layout Type + 14. Internationalization + 14.1. Stringprep Profile for the utf8str_cs Type + 14.2. Stringprep Profile for the utf8str_cis Type + 14.3. Stringprep Profile for the utf8str_mixed Type + 14.4. UTF-8 Capabilities + 14.5. UTF-8 Related Errors + 15. Error Values + 15.1. Error Definitions + 15.2. Operations and Their Valid Errors + 15.3. Callback Operations and Their Valid Errors + 15.4. Errors and the Operations That Use Them + 16. NFSv4.1 Procedures + 16.1. Procedure 0: NULL - No Operation + 16.2. Procedure 1: COMPOUND - Compound Operations + 17. Operations: REQUIRED, RECOMMENDED, or OPTIONAL + 18. NFSv4.1 Operations + 18.1. Operation 3: ACCESS - Check Access Rights + 18.2. Operation 4: CLOSE - Close File + 18.3. Operation 5: COMMIT - Commit Cached Data + 18.4. Operation 6: CREATE - Create a Non-Regular File Object + 18.5. Operation 7: DELEGPURGE - Purge Delegations Awaiting + Recovery + 18.6. Operation 8: DELEGRETURN - Return Delegation + 18.7. Operation 9: GETATTR - Get Attributes + 18.8. Operation 10: GETFH - Get Current Filehandle + 18.9. Operation 11: LINK - Create Link to a File + 18.10. Operation 12: LOCK - Create Lock + 18.11. Operation 13: LOCKT - Test for Lock + 18.12. Operation 14: LOCKU - Unlock File + 18.13. Operation 15: LOOKUP - Lookup Filename + 18.14. Operation 16: LOOKUPP - Lookup Parent Directory + 18.15. Operation 17: NVERIFY - Verify Difference in Attributes + 18.16. Operation 18: OPEN - Open a Regular File + 18.17. Operation 19: OPENATTR - Open Named Attribute Directory + 18.18. Operation 21: OPEN_DOWNGRADE - Reduce Open File Access + 18.19. Operation 22: PUTFH - Set Current Filehandle + 18.20. Operation 23: PUTPUBFH - Set Public Filehandle + 18.21. Operation 24: PUTROOTFH - Set Root Filehandle + 18.22. Operation 25: READ - Read from File + 18.23. Operation 26: READDIR - Read Directory + 18.24. Operation 27: READLINK - Read Symbolic Link + 18.25. Operation 28: REMOVE - Remove File System Object + 18.26. Operation 29: RENAME - Rename Directory Entry + 18.27. Operation 31: RESTOREFH - Restore Saved Filehandle + 18.28. Operation 32: SAVEFH - Save Current Filehandle + 18.29. Operation 33: SECINFO - Obtain Available Security + 18.30. Operation 34: SETATTR - Set Attributes + 18.31. Operation 37: VERIFY - Verify Same Attributes + 18.32. Operation 38: WRITE - Write to File + 18.33. Operation 40: BACKCHANNEL_CTL - Backchannel Control + 18.34. Operation 41: BIND_CONN_TO_SESSION - Associate Connection + with Session + 18.35. Operation 42: EXCHANGE_ID - Instantiate Client ID + 18.36. Operation 43: CREATE_SESSION - Create New Session and + Confirm Client ID + 18.37. Operation 44: DESTROY_SESSION - Destroy a Session + 18.38. Operation 45: FREE_STATEID - Free Stateid with No Locks + 18.39. Operation 46: GET_DIR_DELEGATION - Get a Directory + Delegation + 18.40. Operation 47: GETDEVICEINFO - Get Device Information + 18.41. Operation 48: GETDEVICELIST - Get All Device Mappings for + a File System + 18.42. Operation 49: LAYOUTCOMMIT - Commit Writes Made Using a + Layout + 18.43. Operation 50: LAYOUTGET - Get Layout Information + 18.44. Operation 51: LAYOUTRETURN - Release Layout Information + 18.45. Operation 52: SECINFO_NO_NAME - Get Security on Unnamed + Object + 18.46. Operation 53: SEQUENCE - Supply Per-Procedure Sequencing + and Control + 18.47. Operation 54: SET_SSV - Update SSV for a Client ID + 18.48. Operation 55: TEST_STATEID - Test Stateids for Validity + 18.49. Operation 56: WANT_DELEGATION - Request Delegation + 18.50. Operation 57: DESTROY_CLIENTID - Destroy a Client ID + 18.51. Operation 58: RECLAIM_COMPLETE - Indicates Reclaims + Finished + 18.52. Operation 10044: ILLEGAL - Illegal Operation + 19. NFSv4.1 Callback Procedures + 19.1. Procedure 0: CB_NULL - No Operation + 19.2. Procedure 1: CB_COMPOUND - Compound Operations + 20. NFSv4.1 Callback Operations + 20.1. Operation 3: CB_GETATTR - Get Attributes + 20.2. Operation 4: CB_RECALL - Recall a Delegation + 20.3. Operation 5: CB_LAYOUTRECALL - Recall Layout from Client + 20.4. Operation 6: CB_NOTIFY - Notify Client of Directory + Changes + 20.5. Operation 7: CB_PUSH_DELEG - Offer Previously Requested + Delegation to Client + 20.6. Operation 8: CB_RECALL_ANY - Keep Any N Recallable Objects + 20.7. Operation 9: CB_RECALLABLE_OBJ_AVAIL - Signal Resources + for Recallable Objects + 20.8. Operation 10: CB_RECALL_SLOT - Change Flow Control Limits + 20.9. Operation 11: CB_SEQUENCE - Supply Backchannel Sequencing + and Control + 20.10. Operation 12: CB_WANTS_CANCELLED - Cancel Pending + Delegation Wants + 20.11. Operation 13: CB_NOTIFY_LOCK - Notify Client of Possible + Lock Availability + 20.12. Operation 14: CB_NOTIFY_DEVICEID - Notify Client of Device + ID Changes + 20.13. Operation 10044: CB_ILLEGAL - Illegal Callback Operation + 21. Security Considerations + 22. IANA Considerations + 22.1. IANA Actions + 22.2. Named Attribute Definitions + 22.3. Device ID Notifications + 22.4. Object Recall Types + 22.5. Layout Types + 22.6. Path Variable Definitions + 23. References + 23.1. Normative References + 23.2. Informative References + Appendix A. The Need for This Update + Appendix B. Changes in This Update + B.1. Revisions Made to Section 11 of RFC 5661 + B.2. Revisions Made to Operations in RFC 5661 + B.3. Revisions Made to Error Definitions in RFC 5661 + B.4. Other Revisions Made to RFC 5661 + Appendix C. Security Issues That Need to Be Addressed + Acknowledgments + Authors' Addresses + +1. Introduction + +1.1. Introduction to This Update + + Two important features previously defined in minor version 0 but + never fully addressed in minor version 1 are trunking, which is the + simultaneous use of multiple connections between a client and server, + potentially to different network addresses, and Transparent State + Migration, which allows a file system to be transferred between + servers in a way that provides to the client the ability to maintain + its existing locking state across the transfer. + + The revised description of the NFS version 4 minor version 1 + (NFSv4.1) protocol presented in this update is necessary to enable + full use of these features together with other multi-server namespace + features. This document is in the form of an updated description of + the NFSv4.1 protocol previously defined in RFC 5661 [66]. RFC 5661 + is obsoleted by this document. However, the update has a limited + scope and is focused on enabling full use of trunking and Transparent + State Migration. The need for these changes is discussed in + Appendix A. Appendix B describes the specific changes made to arrive + at the current text. + + This limited-scope update replaces the current NFSv4.1 RFC with the + intention of providing an authoritative and complete specification, + the motivation for which is discussed in [36], addressing the issues + within the scope of the update. However, it will not address issues + that are known but outside of this limited scope as could be expected + by a full update of the protocol. Below are some areas that are + known to need addressing in a future update of the protocol: + + * Work needs to be done with regard to RFC 8178 [67], which + establishes NFSv4-wide versioning rules. As RFC 5661 is currently + inconsistent with that document, changes are needed in order to + arrive at a situation in which there would be no need for RFC 8178 + to update the NFSv4.1 specification. + + * Work needs to be done with regard to RFC 8434 [70], which + establishes the requirements for parallel NFS (pNFS) layout types, + which are not clearly defined in RFC 5661. When that work is done + and the resulting documents approved, the new NFSv4.1 + specification document will provide a clear set of requirements + for layout types and a description of the file layout type that + conforms to those requirements. Other layout types will have + their own specification documents that conform to those + requirements as well. + + * Work needs to be done to address many errata reports relevant to + RFC 5661, other than errata report 2006 [64], which is addressed + in this document. Addressing that report was not deferrable + because of the interaction of the changes suggested there and the + newly described handling of state and session migration. + + The errata reports that have been deferred and that will need to + be addressed in a later document include reports currently + assigned a range of statuses in the errata reporting system, + including reports marked Accepted and those marked Hold For + Document Update because the change was too minor to address + immediately. + + In addition, there is a set of other reports, including at least + one in state Rejected, that will need to be addressed in a later + document. This will involve making changes to consensus decisions + reflected in RFC 5661, in situations in which the working group + has decided that the treatment in RFC 5661 is incorrect and needs + to be revised to reflect the working group's new consensus and to + ensure compatibility with existing implementations that do not + follow the handling described in RFC 5661. + + Note that it is expected that all such errata reports will remain + relevant to implementors and the authors of an eventual + rfc5661bis, despite the fact that this document obsoletes RFC 5661 + [66]. + + * There is a need for a new approach to the description of + internationalization since the current internationalization + section (Section 14) has never been implemented and does not meet + the needs of the NFSv4 protocol. Possible solutions are to create + a new internationalization section modeled on that in [68] or to + create a new document describing internationalization for all + NFSv4 minor versions and reference that document in the RFCs + defining both NFSv4.0 and NFSv4.1. + + * There is a need for a revised treatment of security in NFSv4.1. + The issues with the existing treatment are discussed in + Appendix C. + + Until the above work is done, there will not be a consistent set of + documents that provides a description of the NFSv4.1 protocol, and + any full description would involve documents updating other documents + within the specification. The updates applied by RFC 8434 [70] and + RFC 8178 [67] to RFC 5661 also apply to this specification, and will + apply to any subsequent v4.1 specification until that work is done. + +1.2. The NFS Version 4 Minor Version 1 Protocol + + The NFS version 4 minor version 1 (NFSv4.1) protocol is the second + minor version of the NFS version 4 (NFSv4) protocol. The first minor + version, NFSv4.0, is now described in RFC 7530 [68]. It generally + follows the guidelines for minor versioning that are listed in + Section 10 of RFC 3530 [37]. However, it diverges from guidelines 11 + ("a client and server that support minor version X must support minor + versions 0 through X-1") and 12 ("no new features may be introduced + as mandatory in a minor version"). These divergences are due to the + introduction of the sessions model for managing non-idempotent + operations and the RECLAIM_COMPLETE operation. These two new + features are infrastructural in nature and simplify implementation of + existing and other new features. Making them anything but REQUIRED + would add undue complexity to protocol definition and implementation. + NFSv4.1 accordingly updates the minor versioning guidelines + (Section 2.7). + + As a minor version, NFSv4.1 is consistent with the overall goals for + NFSv4, but extends the protocol so as to better meet those goals, + based on experiences with NFSv4.0. In addition, NFSv4.1 has adopted + some additional goals, which motivate some of the major extensions in + NFSv4.1. + +1.3. Requirements Language + + The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", + "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this + document are to be interpreted as described in RFC 2119 [1]. + +1.4. Scope of This Document + + This document describes the NFSv4.1 protocol. With respect to + NFSv4.0, this document does not: + + * describe the NFSv4.0 protocol, except where needed to contrast + with NFSv4.1. + + * modify the specification of the NFSv4.0 protocol. + + * clarify the NFSv4.0 protocol. + +1.5. NFSv4 Goals + + The NFSv4 protocol is a further revision of the NFS protocol defined + already by NFSv3 [38]. It retains the essential characteristics of + previous versions: easy recovery; independence of transport + protocols, operating systems, and file systems; simplicity; and good + performance. NFSv4 has the following goals: + + * Improved access and good performance on the Internet + + The protocol is designed to transit firewalls easily, perform well + where latency is high and bandwidth is low, and scale to very + large numbers of clients per server. + + * Strong security with negotiation built into the protocol + + The protocol builds on the work of the ONCRPC working group in + supporting the RPCSEC_GSS protocol. Additionally, the NFSv4.1 + protocol provides a mechanism to allow clients and servers the + ability to negotiate security and require clients and servers to + support a minimal set of security schemes. + + * Good cross-platform interoperability + + The protocol features a file system model that provides a useful, + common set of features that does not unduly favor one file system + or operating system over another. + + * Designed for protocol extensions + + The protocol is designed to accept standard extensions within a + framework that enables and encourages backward compatibility. + +1.6. NFSv4.1 Goals + + NFSv4.1 has the following goals, within the framework established by + the overall NFSv4 goals. + + * To correct significant structural weaknesses and oversights + discovered in the base protocol. + + * To add clarity and specificity to areas left unaddressed or not + addressed in sufficient detail in the base protocol. However, as + stated in Section 1.4, it is not a goal to clarify the NFSv4.0 + protocol in the NFSv4.1 specification. + + * To add specific features based on experience with the existing + protocol and recent industry developments. + + * To provide protocol support to take advantage of clustered server + deployments including the ability to provide scalable parallel + access to files distributed among multiple servers. + +1.7. General Definitions + + The following definitions provide an appropriate context for the + reader. + + Byte: In this document, a byte is an octet, i.e., a datum exactly 8 + bits in length. + + Client: The client is the entity that accesses the NFS server's + resources. The client may be an application that contains the + logic to access the NFS server directly. The client may also be + the traditional operating system client that provides remote file + system services for a set of applications. + + A client is uniquely identified by a client owner. + + With reference to byte-range locking, the client is also the + entity that maintains a set of locks on behalf of one or more + applications. This client is responsible for crash or failure + recovery for those locks it manages. + + Note that multiple clients may share the same transport and + connection and multiple clients may exist on the same network + node. + + Client ID: The client ID is a 64-bit quantity used as a unique, + short-hand reference to a client-supplied verifier and client + owner. The server is responsible for supplying the client ID. + + Client Owner: The client owner is a unique string, opaque to the + server, that identifies a client. Multiple network connections + and source network addresses originating from those connections + may share a client owner. The server is expected to treat + requests from connections with the same client owner as coming + from the same client. + + File System: The file system is the collection of objects on a + server (as identified by the major identifier of a server owner, + which is defined later in this section) that share the same fsid + attribute (see Section 5.8.1.9). + + Lease: A lease is an interval of time defined by the server for + which the client is irrevocably granted locks. At the end of a + lease period, locks may be revoked if the lease has not been + extended. A lock must be revoked if a conflicting lock has been + granted after the lease interval. + + A server grants a client a single lease for all state. + + Lock: The term "lock" is used to refer to byte-range (in UNIX + environments, also known as record) locks, share reservations, + delegations, or layouts unless specifically stated otherwise. + + Secret State Verifier (SSV): The SSV is a unique secret key shared + between a client and server. The SSV serves as the secret key for + an internal (that is, internal to NFSv4.1) Generic Security + Services (GSS) mechanism (the SSV GSS mechanism; see + Section 2.10.9). The SSV GSS mechanism uses the SSV to compute + message integrity code (MIC) and Wrap tokens. See + Section 2.10.8.3 for more details on how NFSv4.1 uses the SSV and + the SSV GSS mechanism. + + Server: The Server is the entity responsible for coordinating client + access to a set of file systems and is identified by a server + owner. A server can span multiple network addresses. + + Server Owner: The server owner identifies the server to the client. + The server owner consists of a major identifier and a minor + identifier. When the client has two connections each to a peer + with the same major identifier, the client assumes that both peers + are the same server (the server namespace is the same via each + connection) and that lock state is shareable across both + connections. When each peer has both the same major and minor + identifiers, the client assumes that each connection might be + associable with the same session. + + Stable Storage: Stable storage is storage from which data stored by + an NFSv4.1 server can be recovered without data loss from multiple + power failures (including cascading power failures, that is, + several power failures in quick succession), operating system + failures, and/or hardware failure of components other than the + storage medium itself (such as disk, nonvolatile RAM, flash + memory, etc.). + + Some examples of stable storage that are allowable for an NFS + server include: + + 1. Media commit of data; that is, the modified data has been + successfully written to the disk media, for example, the disk + platter. + + 2. An immediate reply disk drive with battery-backed, on-drive + intermediate storage or uninterruptible power system (UPS). + + 3. Server commit of data with battery-backed intermediate storage + and recovery software. + + 4. Cache commit with uninterruptible power system (UPS) and + recovery software. + + Stateid: A stateid is a 128-bit quantity returned by a server that + uniquely defines the open and locking states provided by the + server for a specific open-owner or lock-owner/open-owner pair for + a specific file and type of lock. + + Verifier: A verifier is a 64-bit quantity generated by the client + that the server can use to determine if the client has restarted + and lost all previous lock state. + +1.8. Overview of NFSv4.1 Features + + The major features of the NFSv4.1 protocol will be reviewed in brief. + This will be done to provide an appropriate context for both the + reader who is familiar with the previous versions of the NFS protocol + and the reader who is new to the NFS protocols. For the reader new + to the NFS protocols, there is still a set of fundamental knowledge + that is expected. The reader should be familiar with the External + Data Representation (XDR) and Remote Procedure Call (RPC) protocols + as described in [2] and [3]. A basic knowledge of file systems and + distributed file systems is expected as well. + + In general, this specification of NFSv4.1 will not distinguish those + features added in minor version 1 from those present in the base + protocol but will treat NFSv4.1 as a unified whole. See Section 1.9 + for a summary of the differences between NFSv4.0 and NFSv4.1. + +1.8.1. RPC and Security + + As with previous versions of NFS, the External Data Representation + (XDR) and Remote Procedure Call (RPC) mechanisms used for the NFSv4.1 + protocol are those defined in [2] and [3]. To meet end-to-end + security requirements, the RPCSEC_GSS framework [4] is used to extend + the basic RPC security. With the use of RPCSEC_GSS, various + mechanisms can be provided to offer authentication, integrity, and + privacy to the NFSv4 protocol. Kerberos V5 is used as described in + [5] to provide one security framework. With the use of RPCSEC_GSS, + other mechanisms may also be specified and used for NFSv4.1 security. + + To enable in-band security negotiation, the NFSv4.1 protocol has + operations that provide the client a method of querying the server + about its policies regarding which security mechanisms must be used + for access to the server's file system resources. With this, the + client can securely match the security mechanism that meets the + policies specified at both the client and server. + + NFSv4.1 introduces parallel access (see Section 1.8.2.2), which is + called pNFS. The security framework described in this section is + significantly modified by the introduction of pNFS (see + Section 12.9), because data access is sometimes not over RPC. The + level of significance varies with the storage protocol (see + Section 12.2.5) and can be as low as zero impact (see Section 13.12). + +1.8.2. Protocol Structure + +1.8.2.1. Core Protocol + + Unlike NFSv3, which used a series of ancillary protocols (e.g., NLM, + NSM (Network Status Monitor), MOUNT), within all minor versions of + NFSv4 a single RPC protocol is used to make requests to the server. + Facilities that had been separate protocols, such as locking, are now + integrated within a single unified protocol. + +1.8.2.2. Parallel Access + + Minor version 1 supports high-performance data access to a clustered + server implementation by enabling a separation of metadata access and + data access, with the latter done to multiple servers in parallel. + + Such parallel data access is controlled by recallable objects known + as "layouts", which are integrated into the protocol locking model. + Clients direct requests for data access to a set of data servers + specified by the layout via a data storage protocol which may be + NFSv4.1 or may be another protocol. + + Because the protocols used for parallel data access are not + necessarily RPC-based, the RPC-based security model (Section 1.8.1) + is obviously impacted (see Section 12.9). The degree of impact + varies with the storage protocol (see Section 12.2.5) used for data + access, and can be as low as zero (see Section 13.12). + +1.8.3. File System Model + + The general file system model used for the NFSv4.1 protocol is the + same as previous versions. The server file system is hierarchical + with the regular files contained within being treated as opaque byte + streams. In a slight departure, file and directory names are encoded + with UTF-8 to deal with the basics of internationalization. + + The NFSv4.1 protocol does not require a separate protocol to provide + for the initial mapping between path name and filehandle. All file + systems exported by a server are presented as a tree so that all file + systems are reachable from a special per-server global root + filehandle. This allows LOOKUP operations to be used to perform + functions previously provided by the MOUNT protocol. The server + provides any necessary pseudo file systems to bridge any gaps that + arise due to unexported gaps between exported file systems. + +1.8.3.1. Filehandles + + As in previous versions of the NFS protocol, opaque filehandles are + used to identify individual files and directories. Lookup-type and + create operations translate file and directory names to filehandles, + which are then used to identify objects in subsequent operations. + + The NFSv4.1 protocol provides support for persistent filehandles, + guaranteed to be valid for the lifetime of the file system object + designated. In addition, it provides support to servers to provide + filehandles with more limited validity guarantees, called volatile + filehandles. + +1.8.3.2. File Attributes + + The NFSv4.1 protocol has a rich and extensible file object attribute + structure, which is divided into REQUIRED, RECOMMENDED, and named + attributes (see Section 5). + + Several (but not all) of the REQUIRED attributes are derived from the + attributes of NFSv3 (see the definition of the fattr3 data type in + [38]). An example of a REQUIRED attribute is the file object's type + (Section 5.8.1.2) so that regular files can be distinguished from + directories (also known as folders in some operating environments) + and other types of objects. REQUIRED attributes are discussed in + Section 5.1. + + An example of three RECOMMENDED attributes are acl, sacl, and dacl. + These attributes define an Access Control List (ACL) on a file object + (Section 6). An ACL provides directory and file access control + beyond the model used in NFSv3. The ACL definition allows for + specification of specific sets of permissions for individual users + and groups. In addition, ACL inheritance allows propagation of + access permissions and restrictions down a directory tree as file + system objects are created. RECOMMENDED attributes are discussed in + Section 5.2. + + A named attribute is an opaque byte stream that is associated with a + directory or file and referred to by a string name. Named attributes + are meant to be used by client applications as a method to associate + application-specific data with a regular file or directory. NFSv4.1 + modifies named attributes relative to NFSv4.0 by tightening the + allowed operations in order to prevent the development of non- + interoperable implementations. Named attributes are discussed in + Section 5.3. + +1.8.3.3. Multi-Server Namespace + + NFSv4.1 contains a number of features to allow implementation of + namespaces that cross server boundaries and that allow and facilitate + a nondisruptive transfer of support for individual file systems + between servers. They are all based upon attributes that allow one + file system to specify alternate, additional, and new location + information that specifies how the client may access that file + system. + + These attributes can be used to provide for individual active file + systems: + + * Alternate network addresses to access the current file system + instance. + + * The locations of alternate file system instances or replicas to be + used in the event that the current file system instance becomes + unavailable. + + These file system location attributes may be used together with the + concept of absent file systems, in which a position in the server + namespace is associated with locations on other servers without there + being any corresponding file system instance on the current server. + For example, + + * These attributes may be used with absent file systems to implement + referrals whereby one server may direct the client to a file + system provided by another server. This allows extensive multi- + server namespaces to be constructed. + + * These attributes may be provided when a previously present file + system becomes absent. This allows nondisruptive migration of + file systems to alternate servers. + +1.8.4. Locking Facilities + + As mentioned previously, NFSv4.1 is a single protocol that includes + locking facilities. These locking facilities include support for + many types of locks including a number of sorts of recallable locks. + Recallable locks such as delegations allow the client to be assured + that certain events will not occur so long as that lock is held. + When circumstances change, the lock is recalled via a callback + request. The assurances provided by delegations allow more extensive + caching to be done safely when circumstances allow it. + + The types of locks are: + + * Share reservations as established by OPEN operations. + + * Byte-range locks. + + * File delegations, which are recallable locks that assure the + holder that inconsistent opens and file changes cannot occur so + long as the delegation is held. + + * Directory delegations, which are recallable locks that assure the + holder that inconsistent directory modifications cannot occur so + long as the delegation is held. + + * Layouts, which are recallable objects that assure the holder that + direct access to the file data may be performed directly by the + client and that no change to the data's location that is + inconsistent with that access may be made so long as the layout is + held. + + All locks for a given client are tied together under a single client- + wide lease. All requests made on sessions associated with the client + renew that lease. When the client's lease is not promptly renewed, + the client's locks are subject to revocation. In the event of server + restart, clients have the opportunity to safely reclaim their locks + within a special grace period. + +1.9. Differences from NFSv4.0 + + The following summarizes the major differences between minor version + 1 and the base protocol: + + * Implementation of the sessions model (Section 2.10). + + * Parallel access to data (Section 12). + + * Addition of the RECLAIM_COMPLETE operation to better structure the + lock reclamation process (Section 18.51). + + * Enhanced delegation support as follows. + + - Delegations on directories and other file types in addition to + regular files (Section 18.39, Section 18.49). + + - Operations to optimize acquisition of recalled or denied + delegations (Section 18.49, Section 20.5, Section 20.7). + + - Notifications of changes to files and directories + (Section 18.39, Section 20.4). + + - A method to allow a server to indicate that it is recalling one + or more delegations for resource management reasons, and thus a + method to allow the client to pick which delegations to return + (Section 20.6). + + * Attributes can be set atomically during exclusive file create via + the OPEN operation (see the new EXCLUSIVE4_1 creation method in + Section 18.16). + + * Open files can be preserved if removed and the hard link count + ("hard link" is defined in an Open Group [6] standard) goes to + zero, thus obviating the need for clients to rename deleted files + to partially hidden names -- colloquially called "silly rename" + (see the new OPEN4_RESULT_PRESERVE_UNLINKED reply flag in + Section 18.16). + + * Improved compatibility with Microsoft Windows for Access Control + Lists (Section 6.2.3, Section 6.2.2, Section 6.4.3.2). + + * Data retention (Section 5.13). + + * Identification of the implementation of the NFS client and server + (Section 18.35). + + * Support for notification of the availability of byte-range locks + (see the new OPEN4_RESULT_MAY_NOTIFY_LOCK reply flag in + Section 18.16 and see Section 20.11). + + * In NFSv4.1, LIPKEY and SPKM-3 are not required security mechanisms + [39]. + +2. Core Infrastructure + +2.1. Introduction + + NFSv4.1 relies on core infrastructure common to nearly every + operation. This core infrastructure is described in the remainder of + this section. + +2.2. RPC and XDR + + The NFSv4.1 protocol is a Remote Procedure Call (RPC) application + that uses RPC version 2 and the corresponding eXternal Data + Representation (XDR) as defined in [3] and [2]. + +2.2.1. RPC-Based Security + + Previous NFS versions have been thought of as having a host-based + authentication model, where the NFS server authenticates the NFS + client, and trusts the client to authenticate all users. Actually, + NFS has always depended on RPC for authentication. One of the first + forms of RPC authentication, AUTH_SYS, had no strong authentication + and required a host-based authentication approach. NFSv4.1 also + depends on RPC for basic security services and mandates RPC support + for a user-based authentication model. The user-based authentication + model has user principals authenticated by a server, and in turn the + server authenticated by user principals. RPC provides some basic + security services that are used by NFSv4.1. + +2.2.1.1. RPC Security Flavors + + As described in "Authentication", Section 7 of [3], RPC security is + encapsulated in the RPC header, via a security or authentication + flavor, and information specific to the specified security flavor. + Every RPC header conveys information used to identify and + authenticate a client and server. As discussed in Section 2.2.1.1.1, + some security flavors provide additional security services. + + NFSv4.1 clients and servers MUST implement RPCSEC_GSS. (This + requirement to implement is not a requirement to use.) Other + flavors, such as AUTH_NONE and AUTH_SYS, MAY be implemented as well. + +2.2.1.1.1. RPCSEC_GSS and Security Services + + RPCSEC_GSS [4] uses the functionality of GSS-API [7]. This allows + for the use of various security mechanisms by the RPC layer without + the additional implementation overhead of adding RPC security + flavors. + +2.2.1.1.1.1. Identification, Authentication, Integrity, Privacy + + Via the GSS-API, RPCSEC_GSS can be used to identify and authenticate + users on clients to servers, and servers to users. It can also + perform integrity checking on the entire RPC message, including the + RPC header, and on the arguments or results. Finally, privacy, + usually via encryption, is a service available with RPCSEC_GSS. + Privacy is performed on the arguments and results. Note that if + privacy is selected, integrity, authentication, and identification + are enabled. If privacy is not selected, but integrity is selected, + authentication and identification are enabled. If integrity and + privacy are not selected, but authentication is enabled, + identification is enabled. RPCSEC_GSS does not provide + identification as a separate service. + + Although GSS-API has an authentication service distinct from its + privacy and integrity services, GSS-API's authentication service is + not used for RPCSEC_GSS's authentication service. Instead, each RPC + request and response header is integrity protected with the GSS-API + integrity service, and this allows RPCSEC_GSS to offer per-RPC + authentication and identity. See [4] for more information. + + NFSv4.1 client and servers MUST support RPCSEC_GSS's integrity and + authentication service. NFSv4.1 servers MUST support RPCSEC_GSS's + privacy service. NFSv4.1 clients SHOULD support RPCSEC_GSS's privacy + service. + +2.2.1.1.1.2. Security Mechanisms for NFSv4.1 + + RPCSEC_GSS, via GSS-API, normalizes access to mechanisms that provide + security services. Therefore, NFSv4.1 clients and servers MUST + support the Kerberos V5 security mechanism. + + The use of RPCSEC_GSS requires selection of mechanism, quality of + protection (QOP), and service (authentication, integrity, privacy). + For the mandated security mechanisms, NFSv4.1 specifies that a QOP of + zero is used, leaving it up to the mechanism or the mechanism's + configuration to map QOP zero to an appropriate level of protection. + Each mandated mechanism specifies a minimum set of cryptographic + algorithms for implementing integrity and privacy. NFSv4.1 clients + and servers MUST be implemented on operating environments that comply + with the REQUIRED cryptographic algorithms of each REQUIRED + mechanism. + +2.2.1.1.1.2.1. Kerberos V5 + + The Kerberos V5 GSS-API mechanism as described in [5] MUST be + implemented with the RPCSEC_GSS services as specified in the + following table: + + column descriptions: + 1 == number of pseudo flavor + 2 == name of pseudo flavor + 3 == mechanism's OID + 4 == RPCSEC_GSS service + 5 == NFSv4.1 clients MUST support + 6 == NFSv4.1 servers MUST support + + 1 2 3 4 5 6 + ------------------------------------------------------------------ + 390003 krb5 1.2.840.113554.1.2.2 rpc_gss_svc_none yes yes + 390004 krb5i 1.2.840.113554.1.2.2 rpc_gss_svc_integrity yes yes + 390005 krb5p 1.2.840.113554.1.2.2 rpc_gss_svc_privacy no yes + + Note that the number and name of the pseudo flavor are presented here + as a mapping aid to the implementor. Because the NFSv4.1 protocol + includes a method to negotiate security and it understands the GSS- + API mechanism, the pseudo flavor is not needed. The pseudo flavor is + needed for the NFSv3 since the security negotiation is done via the + MOUNT protocol as described in [40]. + + At the time NFSv4.1 was specified, the Advanced Encryption Standard + (AES) with HMAC-SHA1 was a REQUIRED algorithm set for Kerberos V5. + In contrast, when NFSv4.0 was specified, weaker algorithm sets were + REQUIRED for Kerberos V5, and were REQUIRED in the NFSv4.0 + specification, because the Kerberos V5 specification at the time did + not specify stronger algorithms. The NFSv4.1 specification does not + specify REQUIRED algorithms for Kerberos V5, and instead, the + implementor is expected to track the evolution of the Kerberos V5 + standard if and when stronger algorithms are specified. + +2.2.1.1.1.2.1.1. Security Considerations for Cryptographic Algorithms + in Kerberos V5 + + When deploying NFSv4.1, the strength of the security achieved depends + on the existing Kerberos V5 infrastructure. The algorithms of + Kerberos V5 are not directly exposed to or selectable by the client + or server, so there is some due diligence required by the user of + NFSv4.1 to ensure that security is acceptable where needed. + +2.2.1.1.1.3. GSS Server Principal + + Regardless of what security mechanism under RPCSEC_GSS is being used, + the NFS server MUST identify itself in GSS-API via a + GSS_C_NT_HOSTBASED_SERVICE name type. GSS_C_NT_HOSTBASED_SERVICE + names are of the form: + + service@hostname + + For NFS, the "service" element is + + nfs + + Implementations of security mechanisms will convert nfs@hostname to + various different forms. For Kerberos V5, the following form is + RECOMMENDED: + + nfs/hostname + +2.3. COMPOUND and CB_COMPOUND + + A significant departure from the versions of the NFS protocol before + NFSv4 is the introduction of the COMPOUND procedure. For the NFSv4 + protocol, in all minor versions, there are exactly two RPC + procedures, NULL and COMPOUND. The COMPOUND procedure is defined as + a series of individual operations and these operations perform the + sorts of functions performed by traditional NFS procedures. + + The operations combined within a COMPOUND request are evaluated in + order by the server, without any atomicity guarantees. A limited set + of facilities exist to pass results from one operation to another. + Once an operation returns a failing result, the evaluation ends and + the results of all evaluated operations are returned to the client. + + With the use of the COMPOUND procedure, the client is able to build + simple or complex requests. These COMPOUND requests allow for a + reduction in the number of RPCs needed for logical file system + operations. For example, multi-component look up requests can be + constructed by combining multiple LOOKUP operations. Those can be + further combined with operations such as GETATTR, READDIR, or OPEN + plus READ to do more complicated sets of operation without incurring + additional latency. + + NFSv4.1 also contains a considerable set of callback operations in + which the server makes an RPC directed at the client. Callback RPCs + have a similar structure to that of the normal server requests. In + all minor versions of the NFSv4 protocol, there are two callback RPC + procedures: CB_NULL and CB_COMPOUND. The CB_COMPOUND procedure is + defined in an analogous fashion to that of COMPOUND with its own set + of callback operations. + + The addition of new server and callback operations within the + COMPOUND and CB_COMPOUND request framework provides a means of + extending the protocol in subsequent minor versions. + + Except for a small number of operations needed for session creation, + server requests and callback requests are performed within the + context of a session. Sessions provide a client context for every + request and support robust replay protection for non-idempotent + requests. + +2.4. Client Identifiers and Client Owners + + For each operation that obtains or depends on locking state, the + specific client needs to be identifiable by the server. + + Each distinct client instance is represented by a client ID. A + client ID is a 64-bit identifier representing a specific client at a + given time. The client ID is changed whenever the client re- + initializes, and may change when the server re-initializes. Client + IDs are used to support lock identification and crash recovery. + + During steady state operation, the client ID associated with each + operation is derived from the session (see Section 2.10) on which the + operation is sent. A session is associated with a client ID when the + session is created. + + Unlike NFSv4.0, the only NFSv4.1 operations possible before a client + ID is established are those needed to establish the client ID. + + A sequence of an EXCHANGE_ID operation followed by a CREATE_SESSION + operation using that client ID (eir_clientid as returned from + EXCHANGE_ID) is required to establish and confirm the client ID on + the server. Establishment of identification by a new incarnation of + the client also has the effect of immediately releasing any locking + state that a previous incarnation of that same client might have had + on the server. Such released state would include all byte-range + lock, share reservation, layout state, and -- where the server + supports neither the CLAIM_DELEGATE_PREV nor CLAIM_DELEG_CUR_FH claim + types -- all delegation state associated with the same client with + the same identity. For discussion of delegation state recovery, see + Section 10.2.1. For discussion of layout state recovery, see + Section 12.7.1. + + Releasing such state requires that the server be able to determine + that one client instance is the successor of another. Where this + cannot be done, for any of a number of reasons, the locking state + will remain for a time subject to lease expiration (see Section 8.3) + and the new client will need to wait for such state to be removed, if + it makes conflicting lock requests. + + Client identification is encapsulated in the following client owner + data type: + + struct client_owner4 { + verifier4 co_verifier; + opaque co_ownerid<NFS4_OPAQUE_LIMIT>; + }; + + The first field, co_verifier, is a client incarnation verifier, + allowing the server to distinguish successive incarnations (e.g., + reboots) of the same client. The server will start the process of + canceling the client's leased state if co_verifier is different than + what the server has previously recorded for the identified client (as + specified in the co_ownerid field). + + The second field, co_ownerid, is a variable length string that + uniquely defines the client so that subsequent instances of the same + client bear the same co_ownerid with a different verifier. + + There are several considerations for how the client generates the + co_ownerid string: + + * The string should be unique so that multiple clients do not + present the same string. The consequences of two clients + presenting the same string range from one client getting an error + to one client having its leased state abruptly and unexpectedly + cancelled. + + * The string should be selected so that subsequent incarnations + (e.g., restarts) of the same client cause the client to present + the same string. The implementor is cautioned from an approach + that requires the string to be recorded in a local file because + this precludes the use of the implementation in an environment + where there is no local disk and all file access is from an + NFSv4.1 server. + + * The string should be the same for each server network address that + the client accesses. This way, if a server has multiple + interfaces, the client can trunk traffic over multiple network + paths as described in Section 2.10.5. (Note: the precise opposite + was advised in the NFSv4.0 specification [37].) + + * The algorithm for generating the string should not assume that the + client's network address will not change, unless the client + implementation knows it is using statically assigned network + addresses. This includes changes between client incarnations and + even changes while the client is still running in its current + incarnation. Thus, with dynamic address assignment, if the client + includes just the client's network address in the co_ownerid + string, there is a real risk that after the client gives up the + network address, another client, using a similar algorithm for + generating the co_ownerid string, would generate a conflicting + co_ownerid string. + + Given the above considerations, an example of a well-generated + co_ownerid string is one that includes: + + * If applicable, the client's statically assigned network address. + + * Additional information that tends to be unique, such as one or + more of: + + - The client machine's serial number (for privacy reasons, it is + best to perform some one-way function on the serial number). + + - A Media Access Control (MAC) address (again, a one-way function + should be performed). + + - The timestamp of when the NFSv4.1 software was first installed + on the client (though this is subject to the previously + mentioned caution about using information that is stored in a + file, because the file might only be accessible over NFSv4.1). + + - A true random number. However, since this number ought to be + the same between client incarnations, this shares the same + problem as that of using the timestamp of the software + installation. + + * For a user-level NFSv4.1 client, it should contain additional + information to distinguish the client from other user-level + clients running on the same host, such as a process identifier or + other unique sequence. + + The client ID is assigned by the server (the eir_clientid result from + EXCHANGE_ID) and should be chosen so that it will not conflict with a + client ID previously assigned by the server. This applies across + server restarts. + + In the event of a server restart, a client may find out that its + current client ID is no longer valid when it receives an + NFS4ERR_STALE_CLIENTID error. The precise circumstances depend on + the characteristics of the sessions involved, specifically whether + the session is persistent (see Section 2.10.6.5), but in each case + the client will receive this error when it attempts to establish a + new session with the existing client ID and receives the error + NFS4ERR_STALE_CLIENTID, indicating that a new client ID needs to be + obtained via EXCHANGE_ID and the new session established with that + client ID. + + When a session is not persistent, the client will find out that it + needs to create a new session as a result of getting an + NFS4ERR_BADSESSION, since the session in question was lost as part of + a server restart. When the existing client ID is presented to a + server as part of creating a session and that client ID is not + recognized, as would happen after a server restart, the server will + reject the request with the error NFS4ERR_STALE_CLIENTID. + + In the case of the session being persistent, the client will re- + establish communication using the existing session after the restart. + This session will be associated with the existing client ID but may + only be used to retransmit operations that the client previously + transmitted and did not see replies to. Replies to operations that + the server previously performed will come from the reply cache; + otherwise, NFS4ERR_DEADSESSION will be returned. Hence, such a + session is referred to as "dead". In this situation, in order to + perform new operations, the client needs to establish a new session. + If an attempt is made to establish this new session with the existing + client ID, the server will reject the request with + NFS4ERR_STALE_CLIENTID. + + When NFS4ERR_STALE_CLIENTID is received in either of these + situations, the client needs to obtain a new client ID by use of the + EXCHANGE_ID operation, then use that client ID as the basis of a new + session, and then proceed to any other necessary recovery for the + server restart case (see Section 8.4.2). + + See the descriptions of EXCHANGE_ID (Section 18.35) and + CREATE_SESSION (Section 18.36) for a complete specification of these + operations. + +2.4.1. Upgrade from NFSv4.0 to NFSv4.1 + + To facilitate upgrade from NFSv4.0 to NFSv4.1, a server may compare a + value of data type client_owner4 in an EXCHANGE_ID with a value of + data type nfs_client_id4 that was established using the SETCLIENTID + operation of NFSv4.0. A server that does so will allow an upgraded + client to avoid waiting until the lease (i.e., the lease established + by the NFSv4.0 instance client) expires. This requires that the + value of data type client_owner4 be constructed the same way as the + value of data type nfs_client_id4. If the latter's contents included + the server's network address (per the recommendations of the NFSv4.0 + specification [37]), and the NFSv4.1 client does not wish to use a + client ID that prevents trunking, it should send two EXCHANGE_ID + operations. The first EXCHANGE_ID will have a client_owner4 equal to + the nfs_client_id4. This will clear the state created by the NFSv4.0 + client. The second EXCHANGE_ID will not have the server's network + address. The state created for the second EXCHANGE_ID will not have + to wait for lease expiration, because there will be no state to + expire. + +2.4.2. Server Release of Client ID + + NFSv4.1 introduces a new operation called DESTROY_CLIENTID + (Section 18.50), which the client SHOULD use to destroy a client ID + it no longer needs. This permits graceful, bilateral release of a + client ID. The operation cannot be used if there are sessions + associated with the client ID, or state with an unexpired lease. + + If the server determines that the client holds no associated state + for its client ID (associated state includes unrevoked sessions, + opens, locks, delegations, layouts, and wants), the server MAY choose + to unilaterally release the client ID in order to conserve resources. + If the client contacts the server after this release, the server MUST + ensure that the client receives the appropriate error so that it will + use the EXCHANGE_ID/CREATE_SESSION sequence to establish a new client + ID. The server ought to be very hesitant to release a client ID + since the resulting work on the client to recover from such an event + will be the same burden as if the server had failed and restarted. + Typically, a server would not release a client ID unless there had + been no activity from that client for many minutes. As long as there + are sessions, opens, locks, delegations, layouts, or wants, the + server MUST NOT release the client ID. See Section 2.10.13.1.4 for + discussion on releasing inactive sessions. + +2.4.3. Resolving Client Owner Conflicts + + When the server gets an EXCHANGE_ID for a client owner that currently + has no state, or that has state but the lease has expired, the server + MUST allow the EXCHANGE_ID and confirm the new client ID if followed + by the appropriate CREATE_SESSION. + + When the server gets an EXCHANGE_ID for a new incarnation of a client + owner that currently has an old incarnation with state and an + unexpired lease, the server is allowed to dispose of the state of the + previous incarnation of the client owner if one of the following is + true: + + * The principal that created the client ID for the client owner is + the same as the principal that is sending the EXCHANGE_ID + operation. Note that if the client ID was created with + SP4_MACH_CRED state protection (Section 18.35), the principal MUST + be based on RPCSEC_GSS authentication, the RPCSEC_GSS service used + MUST be integrity or privacy, and the same GSS mechanism and + principal MUST be used as that used when the client ID was + created. + + * The client ID was established with SP4_SSV protection + (Section 18.35, Section 2.10.8.3) and the client sends the + EXCHANGE_ID with the security flavor set to RPCSEC_GSS using the + GSS SSV mechanism (Section 2.10.9). + + * The client ID was established with SP4_SSV protection, and under + the conditions described herein, the EXCHANGE_ID was sent with + SP4_MACH_CRED state protection. Because the SSV might not persist + across client and server restart, and because the first time a + client sends EXCHANGE_ID to a server it does not have an SSV, the + client MAY send the subsequent EXCHANGE_ID without an SSV + RPCSEC_GSS handle. Instead, as with SP4_MACH_CRED protection, the + principal MUST be based on RPCSEC_GSS authentication, the + RPCSEC_GSS service used MUST be integrity or privacy, and the same + GSS mechanism and principal MUST be used as that used when the + client ID was created. + + If none of the above situations apply, the server MUST return + NFS4ERR_CLID_INUSE. + + If the server accepts the principal and co_ownerid as matching that + which created the client ID, and the co_verifier in the EXCHANGE_ID + differs from the co_verifier used when the client ID was created, + then after the server receives a CREATE_SESSION that confirms the + client ID, the server deletes state. If the co_verifier values are + the same (e.g., the client either is updating properties of the + client ID (Section 18.35) or is attempting trunking (Section 2.10.5), + the server MUST NOT delete state. + +2.5. Server Owners + + The server owner is similar to a client owner (Section 2.4), but + unlike the client owner, there is no shorthand server ID. The server + owner is defined in the following data type: + + struct server_owner4 { + uint64_t so_minor_id; + opaque so_major_id<NFS4_OPAQUE_LIMIT>; + }; + + The server owner is returned from EXCHANGE_ID. When the so_major_id + fields are the same in two EXCHANGE_ID results, the connections that + each EXCHANGE_ID were sent over can be assumed to address the same + server (as defined in Section 1.7). If the so_minor_id fields are + also the same, then not only do both connections connect to the same + server, but the session can be shared across both connections. The + reader is cautioned that multiple servers may deliberately or + accidentally claim to have the same so_major_id or so_major_id/ + so_minor_id; the reader should examine Sections 2.10.5 and 18.35 in + order to avoid acting on falsely matching server owner values. + + The considerations for generating an so_major_id are similar to that + for generating a co_ownerid string (see Section 2.4). The + consequences of two servers generating conflicting so_major_id values + are less dire than they are for co_ownerid conflicts because the + client can use RPCSEC_GSS to compare the authenticity of each server + (see Section 2.10.5). + +2.6. Security Service Negotiation + + With the NFSv4.1 server potentially offering multiple security + mechanisms, the client needs a method to determine or negotiate which + mechanism is to be used for its communication with the server. The + NFS server may have multiple points within its file system namespace + that are available for use by NFS clients. These points can be + considered security policy boundaries, and, in some NFS + implementations, are tied to NFS export points. In turn, the NFS + server may be configured such that each of these security policy + boundaries may have different or multiple security mechanisms in use. + + The security negotiation between client and server SHOULD be done + with a secure channel to eliminate the possibility of a third party + intercepting the negotiation sequence and forcing the client and + server to choose a lower level of security than required or desired. + See Section 21 for further discussion. + +2.6.1. NFSv4.1 Security Tuples + + An NFS server can assign one or more "security tuples" to each + security policy boundary in its namespace. Each security tuple + consists of a security flavor (see Section 2.2.1.1) and, if the + flavor is RPCSEC_GSS, a GSS-API mechanism Object Identifier (OID), a + GSS-API quality of protection, and an RPCSEC_GSS service. + +2.6.2. SECINFO and SECINFO_NO_NAME + + The SECINFO and SECINFO_NO_NAME operations allow the client to + determine, on a per-filehandle basis, what security tuple is to be + used for server access. In general, the client will not have to use + either operation except during initial communication with the server + or when the client crosses security policy boundaries at the server. + However, the server's policies may also change at any time and force + the client to negotiate a new security tuple. + + Where the use of different security tuples would affect the type of + access that would be allowed if a request was sent over the same + connection used for the SECINFO or SECINFO_NO_NAME operation (e.g., + read-only vs. read-write) access, security tuples that allow greater + access should be presented first. Where the general level of access + is the same and different security flavors limit the range of + principals whose privileges are recognized (e.g., allowing or + disallowing root access), flavors supporting the greatest range of + principals should be listed first. + +2.6.3. Security Error + + Based on the assumption that each NFSv4.1 client and server MUST + support a minimum set of security (i.e., Kerberos V5 under + RPCSEC_GSS), the NFS client will initiate file access to the server + with one of the minimal security tuples. During communication with + the server, the client may receive an NFS error of NFS4ERR_WRONGSEC. + This error allows the server to notify the client that the security + tuple currently being used contravenes the server's security policy. + The client is then responsible for determining (see Section 2.6.3.1) + what security tuples are available at the server and choosing one + that is appropriate for the client. + +2.6.3.1. Using NFS4ERR_WRONGSEC, SECINFO, and SECINFO_NO_NAME + + This section explains the mechanics of NFSv4.1 security negotiation. + +2.6.3.1.1. Put Filehandle Operations + + The term "put filehandle operation" refers to PUTROOTFH, PUTPUBFH, + PUTFH, and RESTOREFH. Each of the subsections herein describes how + the server handles a subseries of operations that starts with a put + filehandle operation. + +2.6.3.1.1.1. Put Filehandle Operation + SAVEFH + + The client is saving a filehandle for a future RESTOREFH, LINK, or + RENAME. SAVEFH MUST NOT return NFS4ERR_WRONGSEC. To determine + whether or not the put filehandle operation returns NFS4ERR_WRONGSEC, + the server implementation pretends SAVEFH is not in the series of + operations and examines which of the situations described in the + other subsections of Section 2.6.3.1.1 apply. + +2.6.3.1.1.2. Two or More Put Filehandle Operations + + For a series of N put filehandle operations, the server MUST NOT + return NFS4ERR_WRONGSEC to the first N-1 put filehandle operations. + The Nth put filehandle operation is handled as if it is the first in + a subseries of operations. For example, if the server received a + COMPOUND request with this series of operations -- PUTFH, PUTROOTFH, + LOOKUP -- then the PUTFH operation is ignored for NFS4ERR_WRONGSEC + purposes, and the PUTROOTFH, LOOKUP subseries is processed as + according to Section 2.6.3.1.1.3. + +2.6.3.1.1.3. Put Filehandle Operation + LOOKUP (or OPEN of an Existing + Name) + + This situation also applies to a put filehandle operation followed by + a LOOKUP or an OPEN operation that specifies an existing component + name. + + In this situation, the client is potentially crossing a security + policy boundary, and the set of security tuples the parent directory + supports may differ from those of the child. The server + implementation may decide whether to impose any restrictions on + security policy administration. There are at least three approaches + (sec_policy_child is the tuple set of the child export, + sec_policy_parent is that of the parent). + + (a) sec_policy_child <= sec_policy_parent (<= for subset). This + means that the set of security tuples specified on the security + policy of a child directory is always a subset of its parent + directory. + + (b) sec_policy_child ^ sec_policy_parent != {} (^ for intersection, + {} for the empty set). This means that the set of security + tuples specified on the security policy of a child directory + always has a non-empty intersection with that of the parent. + + (c) sec_policy_child ^ sec_policy_parent == {}. This means that the + set of security tuples specified on the security policy of a + child directory may not intersect with that of the parent. In + other words, there are no restrictions on how the system + administrator may set up these tuples. + + In order for a server to support approaches (b) (for the case when a + client chooses a flavor that is not a member of sec_policy_parent) + and (c), the put filehandle operation cannot return NFS4ERR_WRONGSEC + when there is a security tuple mismatch. Instead, it should be + returned from the LOOKUP (or OPEN by existing component name) that + follows. + + Since the above guideline does not contradict approach (a), it should + be followed in general. Even if approach (a) is implemented, it is + possible for the security tuple used to be acceptable for the target + of LOOKUP but not for the filehandles used in the put filehandle + operation. The put filehandle operation could be a PUTROOTFH or + PUTPUBFH, where the client cannot know the security tuples for the + root or public filehandle. Or the security policy for the filehandle + used by the put filehandle operation could have changed since the + time the filehandle was obtained. + + Therefore, an NFSv4.1 server MUST NOT return NFS4ERR_WRONGSEC in + response to the put filehandle operation if the operation is + immediately followed by a LOOKUP or an OPEN by component name. + +2.6.3.1.1.4. Put Filehandle Operation + LOOKUPP + + Since SECINFO only works its way down, there is no way LOOKUPP can + return NFS4ERR_WRONGSEC without SECINFO_NO_NAME. SECINFO_NO_NAME + solves this issue via style SECINFO_STYLE4_PARENT, which works in the + opposite direction as SECINFO. As with Section 2.6.3.1.1.3, a put + filehandle operation that is followed by a LOOKUPP MUST NOT return + NFS4ERR_WRONGSEC. If the server does not support SECINFO_NO_NAME, + the client's only recourse is to send the put filehandle operation, + LOOKUPP, GETFH sequence of operations with every security tuple it + supports. + + Regardless of whether SECINFO_NO_NAME is supported, an NFSv4.1 server + MUST NOT return NFS4ERR_WRONGSEC in response to a put filehandle + operation if the operation is immediately followed by a LOOKUPP. + +2.6.3.1.1.5. Put Filehandle Operation + SECINFO/SECINFO_NO_NAME + + A security-sensitive client is allowed to choose a strong security + tuple when querying a server to determine a file object's permitted + security tuples. The security tuple chosen by the client does not + have to be included in the tuple list of the security policy of + either the parent directory indicated in the put filehandle operation + or the child file object indicated in SECINFO (or any parent + directory indicated in SECINFO_NO_NAME). Of course, the server has + to be configured for whatever security tuple the client selects; + otherwise, the request will fail at the RPC layer with an appropriate + authentication error. + + In theory, there is no connection between the security flavor used by + SECINFO or SECINFO_NO_NAME and those supported by the security + policy. But in practice, the client may start looking for strong + flavors from those supported by the security policy, followed by + those in the REQUIRED set. + + The NFSv4.1 server MUST NOT return NFS4ERR_WRONGSEC to a put + filehandle operation that is immediately followed by SECINFO or + SECINFO_NO_NAME. The NFSv4.1 server MUST NOT return NFS4ERR_WRONGSEC + from SECINFO or SECINFO_NO_NAME. + +2.6.3.1.1.6. Put Filehandle Operation + Nothing + + The NFSv4.1 server MUST NOT return NFS4ERR_WRONGSEC. + +2.6.3.1.1.7. Put Filehandle Operation + Anything Else + + "Anything Else" includes OPEN by filehandle. + + The security policy enforcement applies to the filehandle specified + in the put filehandle operation. Therefore, the put filehandle + operation MUST return NFS4ERR_WRONGSEC when there is a security tuple + mismatch. This avoids the complexity of adding NFS4ERR_WRONGSEC as + an allowable error to every other operation. + + A COMPOUND containing the series put filehandle operation + + SECINFO_NO_NAME (style SECINFO_STYLE4_CURRENT_FH) is an efficient way + for the client to recover from NFS4ERR_WRONGSEC. + + The NFSv4.1 server MUST NOT return NFS4ERR_WRONGSEC to any operation + other than a put filehandle operation, LOOKUP, LOOKUPP, and OPEN (by + component name). + +2.6.3.1.1.8. Operations after SECINFO and SECINFO_NO_NAME + + Suppose a client sends a COMPOUND procedure containing the series + SEQUENCE, PUTFH, SECINFO_NONAME, READ, and suppose the security tuple + used does not match that required for the target file. By rule (see + Section 2.6.3.1.1.5), neither PUTFH nor SECINFO_NO_NAME can return + NFS4ERR_WRONGSEC. By rule (see Section 2.6.3.1.1.7), READ cannot + return NFS4ERR_WRONGSEC. The issue is resolved by the fact that + SECINFO and SECINFO_NO_NAME consume the current filehandle (note that + this is a change from NFSv4.0). This leaves no current filehandle + for READ to use, and READ returns NFS4ERR_NOFILEHANDLE. + +2.6.3.1.2. LINK and RENAME + + The LINK and RENAME operations use both the current and saved + filehandles. Technically, the server MAY return NFS4ERR_WRONGSEC + from LINK or RENAME if the security policy of the saved filehandle + rejects the security flavor used in the COMPOUND request's + credentials. If the server does so, then if there is no intersection + between the security policies of saved and current filehandles, this + means that it will be impossible for the client to perform the + intended LINK or RENAME operation. + + For example, suppose the client sends this COMPOUND request: + SEQUENCE, PUTFH bFH, SAVEFH, PUTFH aFH, RENAME "c" "d", where + filehandles bFH and aFH refer to different directories. Suppose no + common security tuple exists between the security policies of aFH and + bFH. If the client sends the request using credentials acceptable to + bFH's security policy but not aFH's policy, then the PUTFH aFH + operation will fail with NFS4ERR_WRONGSEC. After a SECINFO_NO_NAME + request, the client sends SEQUENCE, PUTFH bFH, SAVEFH, PUTFH aFH, + RENAME "c" "d", using credentials acceptable to aFH's security policy + but not bFH's policy. The server returns NFS4ERR_WRONGSEC on the + RENAME operation. + + To prevent a client from an endless sequence of a request containing + LINK or RENAME, followed by a request containing SECINFO_NO_NAME or + SECINFO, the server MUST detect when the security policies of the + current and saved filehandles have no mutually acceptable security + tuple, and MUST NOT return NFS4ERR_WRONGSEC from LINK or RENAME in + that situation. Instead the server MUST do one of two things: + + * The server can return NFS4ERR_XDEV. + + * The server can allow the security policy of the current filehandle + to override that of the saved filehandle, and so return NFS4_OK. + +2.7. Minor Versioning + + To address the requirement of an NFS protocol that can evolve as the + need arises, the NFSv4.1 protocol contains the rules and framework to + allow for future minor changes or versioning. + + The base assumption with respect to minor versioning is that any + future accepted minor version will be documented in one or more + Standards Track RFCs. Minor version 0 of the NFSv4 protocol is + represented by [37], and minor version 1 is represented by this RFC. + The COMPOUND and CB_COMPOUND procedures support the encoding of the + minor version being requested by the client. + + The following items represent the basic rules for the development of + minor versions. Note that a future minor version may modify or add + to the following rules as part of the minor version definition. + + 1. Procedures are not added or deleted. + + To maintain the general RPC model, NFSv4 minor versions will not + add to or delete procedures from the NFS program. + + 2. Minor versions may add operations to the COMPOUND and + CB_COMPOUND procedures. + + The addition of operations to the COMPOUND and CB_COMPOUND + procedures does not affect the RPC model. + + * Minor versions may append attributes to the bitmap4 that + represents sets of attributes and to the fattr4 that + represents sets of attribute values. + + This allows for the expansion of the attribute model to allow + for future growth or adaptation. + + * Minor version X must append any new attributes after the last + documented attribute. + + Since attribute results are specified as an opaque array of + per-attribute, XDR-encoded results, the complexity of adding + new attributes in the midst of the current definitions would + be too burdensome. + + 3. Minor versions must not modify the structure of an existing + operation's arguments or results. + + Again, the complexity of handling multiple structure definitions + for a single operation is too burdensome. New operations should + be added instead of modifying existing structures for a minor + version. + + This rule does not preclude the following adaptations in a minor + version: + + * adding bits to flag fields, such as new attributes to + GETATTR's bitmap4 data type, and providing corresponding + variants of opaque arrays, such as a notify4 used together + with such bitmaps + + * adding bits to existing attributes like ACLs that have flag + words + + * extending enumerated types (including NFS4ERR_*) with new + values + + * adding cases to a switched union + + 4. Minor versions must not modify the structure of existing + attributes. + + 5. Minor versions must not delete operations. + + This prevents the potential reuse of a particular operation + "slot" in a future minor version. + + 6. Minor versions must not delete attributes. + + 7. Minor versions must not delete flag bits or enumeration values. + + 8. Minor versions may declare an operation MUST NOT be implemented. + + Specifying that an operation MUST NOT be implemented is + equivalent to obsoleting an operation. For the client, it means + that the operation MUST NOT be sent to the server. For the + server, an NFS error can be returned as opposed to "dropping" + the request as an XDR decode error. This approach allows for + the obsolescence of an operation while maintaining its structure + so that a future minor version can reintroduce the operation. + + 1. Minor versions may declare that an attribute MUST NOT be + implemented. + + 2. Minor versions may declare that a flag bit or enumeration + value MUST NOT be implemented. + + 9. Minor versions may downgrade features from REQUIRED to + RECOMMENDED, or RECOMMENDED to OPTIONAL. + + 10. Minor versions may upgrade features from OPTIONAL to + RECOMMENDED, or RECOMMENDED to REQUIRED. + + 11. A client and server that support minor version X SHOULD support + minor versions zero through X-1 as well. + + 12. Except for infrastructural changes, a minor version must not + introduce REQUIRED new features. + + This rule allows for the introduction of new functionality and + forces the use of implementation experience before designating a + feature as REQUIRED. On the other hand, some classes of + features are infrastructural and have broad effects. Allowing + infrastructural features to be RECOMMENDED or OPTIONAL + complicates implementation of the minor version. + + 13. A client MUST NOT attempt to use a stateid, filehandle, or + similar returned object from the COMPOUND procedure with minor + version X for another COMPOUND procedure with minor version Y, + where X != Y. + +2.8. Non-RPC-Based Security Services + + As described in Section 2.2.1.1.1.1, NFSv4.1 relies on RPC for + identification, authentication, integrity, and privacy. NFSv4.1 + itself provides or enables additional security services as described + in the next several subsections. + +2.8.1. Authorization + + Authorization to access a file object via an NFSv4.1 operation is + ultimately determined by the NFSv4.1 server. A client can + predetermine its access to a file object via the OPEN (Section 18.16) + and the ACCESS (Section 18.1) operations. + + Principals with appropriate access rights can modify the + authorization on a file object via the SETATTR (Section 18.30) + operation. Attributes that affect access rights include mode, owner, + owner_group, acl, dacl, and sacl. See Section 5. + +2.8.2. Auditing + + NFSv4.1 provides auditing on a per-file object basis, via the acl and + sacl attributes as described in Section 6. It is outside the scope + of this specification to specify audit log formats or management + policies. + +2.8.3. Intrusion Detection + + NFSv4.1 provides alarm control on a per-file object basis, via the + acl and sacl attributes as described in Section 6. Alarms may serve + as the basis for intrusion detection. It is outside the scope of + this specification to specify heuristics for detecting intrusion via + alarms. + +2.9. Transport Layers + +2.9.1. REQUIRED and RECOMMENDED Properties of Transports + + NFSv4.1 works over Remote Direct Memory Access (RDMA) and non-RDMA- + based transports with the following attributes: + + * The transport supports reliable delivery of data, which NFSv4.1 + requires but neither NFSv4.1 nor RPC has facilities for ensuring + [41]. + + * The transport delivers data in the order it was sent. Ordered + delivery simplifies detection of transmit errors, and simplifies + the sending of arbitrary sized requests and responses via the + record marking protocol [3]. + + Where an NFSv4.1 implementation supports operation over the IP + network protocol, any transport used between NFS and IP MUST be among + the IETF-approved congestion control transport protocols. At the + time this document was written, the only two transports that had the + above attributes were TCP and the Stream Control Transmission + Protocol (SCTP). To enhance the possibilities for interoperability, + an NFSv4.1 implementation MUST support operation over the TCP + transport protocol. + + Even if NFSv4.1 is used over a non-IP network protocol, it is + RECOMMENDED that the transport support congestion control. + + It is permissible for a connectionless transport to be used under + NFSv4.1; however, reliable and in-order delivery of data combined + with congestion control by the connectionless transport is REQUIRED. + As a consequence, UDP by itself MUST NOT be used as an NFSv4.1 + transport. NFSv4.1 assumes that a client transport address and + server transport address used to send data over a transport together + constitute a connection, even if the underlying transport eschews the + concept of a connection. + +2.9.2. Client and Server Transport Behavior + + If a connection-oriented transport (e.g., TCP) is used, the client + and server SHOULD use long-lived connections for at least three + reasons: + + 1. This will prevent the weakening of the transport's congestion + control mechanisms via short-lived connections. + + 2. This will improve performance for the WAN environment by + eliminating the need for connection setup handshakes. + + 3. The NFSv4.1 callback model differs from NFSv4.0, and requires the + client and server to maintain a client-created backchannel (see + Section 2.10.3.1) for the server to use. + + In order to reduce congestion, if a connection-oriented transport is + used, and the request is not the NULL procedure: + + * A requester MUST NOT retry a request unless the connection the + request was sent over was lost before the reply was received. + + * A replier MUST NOT silently drop a request, even if the request is + a retry. (The silent drop behavior of RPCSEC_GSS [4] does not + apply because this behavior happens at the RPCSEC_GSS layer, a + lower layer in the request processing.) Instead, the replier + SHOULD return an appropriate error (see Section 2.10.6.1), or it + MAY disconnect the connection. + + When sending a reply, the replier MUST send the reply to the same + full network address (e.g., if using an IP-based transport, the + source port of the requester is part of the full network address) + from which the requester sent the request. If using a connection- + oriented transport, replies MUST be sent on the same connection from + which the request was received. + + If a connection is dropped after the replier receives the request but + before the replier sends the reply, the replier might have a pending + reply. If a connection is established with the same source and + destination full network address as the dropped connection, then the + replier MUST NOT send the reply until the requester retries the + request. The reason for this prohibition is that the requester MAY + retry a request over a different connection (provided that connection + is associated with the original request's session). + + When using RDMA transports, there are other reasons for not + tolerating retries over the same connection: + + * RDMA transports use "credits" to enforce flow control, where a + credit is a right to a peer to transmit a message. If one peer + were to retransmit a request (or reply), it would consume an + additional credit. If the replier retransmitted a reply, it would + certainly result in an RDMA connection loss, since the requester + would typically only post a single receive buffer for each + request. If the requester retransmitted a request, the additional + credit consumed on the server might lead to RDMA connection + failure unless the client accounted for it and decreased its + available credit, leading to wasted resources. + + * RDMA credits present a new issue to the reply cache in NFSv4.1. + The reply cache may be used when a connection within a session is + lost, such as after the client reconnects. Credit information is + a dynamic property of the RDMA connection, and stale values must + not be replayed from the cache. This implies that the reply cache + contents must not be blindly used when replies are sent from it, + and credit information appropriate to the channel must be + refreshed by the RPC layer. + + In addition, as described in Section 2.10.6.2, while a session is + active, the NFSv4.1 requester MUST NOT stop waiting for a reply. + +2.9.3. Ports + + Historically, NFSv3 servers have listened over TCP port 2049. The + registered port 2049 [42] for the NFS protocol should be the default + configuration. NFSv4.1 clients SHOULD NOT use the RPC binding + protocols as described in [43]. + +2.10. Session + + NFSv4.1 clients and servers MUST support and MUST use the session + feature as described in this section. + +2.10.1. Motivation and Overview + + Previous versions and minor versions of NFS have suffered from the + following: + + * Lack of support for Exactly Once Semantics (EOS). This includes + lack of support for EOS through server failure and recovery. + + * Limited callback support, including no support for sending + callbacks through firewalls, and races between replies to normal + requests and callbacks. + + * Limited trunking over multiple network paths. + + * Requiring machine credentials for fully secure operation. + + Through the introduction of a session, NFSv4.1 addresses the above + shortfalls with practical solutions: + + * EOS is enabled by a reply cache with a bounded size, making it + feasible to keep the cache in persistent storage and enable EOS + through server failure and recovery. One reason that previous + revisions of NFS did not support EOS was because some EOS + approaches often limited parallelism. As will be explained in + Section 2.10.6, NFSv4.1 supports both EOS and unlimited + parallelism. + + * The NFSv4.1 client (defined in Section 1.7) creates transport + connections and provides them to the server to use for sending + callback requests, thus solving the firewall issue + (Section 18.34). Races between responses from client requests and + callbacks caused by the requests are detected via the session's + sequencing properties that are a consequence of EOS + (Section 2.10.6.3). + + * The NFSv4.1 client can associate an arbitrary number of + connections with the session, and thus provide trunking + (Section 2.10.5). + + * The NFSv4.1 client and server produce a session key independent of + client and server machine credentials which can be used to compute + a digest for protecting critical session management operations + (Section 2.10.8.3). + + * The NFSv4.1 client can also create secure RPCSEC_GSS contexts for + use by the session's backchannel that do not require the server to + authenticate to a client machine principal (Section 2.10.8.2). + + A session is a dynamically created, long-lived server object created + by a client and used over time from one or more transport + connections. Its function is to maintain the server's state relative + to the connection(s) belonging to a client instance. This state is + entirely independent of the connection itself, and indeed the state + exists whether or not the connection exists. A client may have one + or more sessions associated with it so that client-associated state + may be accessed using any of the sessions associated with that + client's client ID, when connections are associated with those + sessions. When no connections are associated with any of a client + ID's sessions for an extended time, such objects as locks, opens, + delegations, layouts, etc. are subject to expiration. The session + serves as an object representing a means of access by a client to the + associated client state on the server, independent of the physical + means of access to that state. + + A single client may create multiple sessions. A single session MUST + NOT serve multiple clients. + +2.10.2. NFSv4 Integration + + Sessions are part of NFSv4.1 and not NFSv4.0. Normally, a major + infrastructure change such as sessions would require a new major + version number to an Open Network Computing (ONC) RPC program like + NFS. However, because NFSv4 encapsulates its functionality in a + single procedure, COMPOUND, and because COMPOUND can support an + arbitrary number of operations, sessions have been added to NFSv4.1 + with little difficulty. COMPOUND includes a minor version number + field, and for NFSv4.1 this minor version is set to 1. When the + NFSv4 server processes a COMPOUND with the minor version set to 1, it + expects a different set of operations than it does for NFSv4.0. + NFSv4.1 defines the SEQUENCE operation, which is required for every + COMPOUND that operates over an established session, with the + exception of some session administration operations, such as + DESTROY_SESSION (Section 18.37). + +2.10.2.1. SEQUENCE and CB_SEQUENCE + + In NFSv4.1, when the SEQUENCE operation is present, it MUST be the + first operation in the COMPOUND procedure. The primary purpose of + SEQUENCE is to carry the session identifier. The session identifier + associates all other operations in the COMPOUND procedure with a + particular session. SEQUENCE also contains required information for + maintaining EOS (see Section 2.10.6). Session-enabled NFSv4.1 + COMPOUND requests thus have the form: + + +-----+--------------+-----------+------------+-----------+---- + | tag | minorversion | numops |SEQUENCE op | op + args | ... + | | (== 1) | (limited) | + args | | + +-----+--------------+-----------+------------+-----------+---- + + and the replies have the form: + + +------------+-----+--------+-------------------------------+--// + |last status | tag | numres |status + SEQUENCE op + results | // + +------------+-----+--------+-------------------------------+--// + //-----------------------+---- + // status + op + results | ... + //-----------------------+---- + + A CB_COMPOUND procedure request and reply has a similar form to + COMPOUND, but instead of a SEQUENCE operation, there is a CB_SEQUENCE + operation. CB_COMPOUND also has an additional field called + "callback_ident", which is superfluous in NFSv4.1 and MUST be ignored + by the client. CB_SEQUENCE has the same information as SEQUENCE, and + also includes other information needed to resolve callback races + (Section 2.10.6.3). + +2.10.2.2. Client ID and Session Association + + Each client ID (Section 2.4) can have zero or more active sessions. + A client ID and associated session are required to perform file + access in NFSv4.1. Each time a session is used (whether by a client + sending a request to the server or the client replying to a callback + request from the server), the state leased to its associated client + ID is automatically renewed. + + State (which can consist of share reservations, locks, delegations, + and layouts (Section 1.8.4)) is tied to the client ID. Client state + is not tied to any individual session. Successive state changing + operations from a given state owner MAY go over different sessions, + provided the session is associated with the same client ID. A + callback MAY arrive over a different session than that of the request + that originally acquired the state pertaining to the callback. For + example, if session A is used to acquire a delegation, a request to + recall the delegation MAY arrive over session B if both sessions are + associated with the same client ID. Sections 2.10.8.1 and 2.10.8.2 + discuss the security considerations around callbacks. + +2.10.3. Channels + + A channel is not a connection. A channel represents the direction + ONC RPC requests are sent. + + Each session has one or two channels: the fore channel and the + backchannel. Because there are at most two channels per session, and + because each channel has a distinct purpose, channels are not + assigned identifiers. + + The fore channel is used for ordinary requests from the client to the + server, and carries COMPOUND requests and responses. A session + always has a fore channel. + + The backchannel is used for callback requests from server to client, + and carries CB_COMPOUND requests and responses. Whether or not there + is a backchannel is decided by the client; however, many features of + NFSv4.1 require a backchannel. NFSv4.1 servers MUST support + backchannels. + + Each session has resources for each channel, including separate reply + caches (see Section 2.10.6.1). Note that even the backchannel + requires a reply cache (or, at least, a slot table in order to detect + retries) because some callback operations are non-idempotent. + +2.10.3.1. Association of Connections, Channels, and Sessions + + Each channel is associated with zero or more transport connections + (whether of the same transport protocol or different transport + protocols). A connection can be associated with one channel or both + channels of a session; the client and server negotiate whether a + connection will carry traffic for one channel or both channels via + the CREATE_SESSION (Section 18.36) and the BIND_CONN_TO_SESSION + (Section 18.34) operations. When a session is created via + CREATE_SESSION, the connection that transported the CREATE_SESSION + request is automatically associated with the fore channel, and + optionally the backchannel. If the client specifies no state + protection (Section 18.35) when the session is created, then when + SEQUENCE is transmitted on a different connection, the connection is + automatically associated with the fore channel of the session + specified in the SEQUENCE operation. + + A connection's association with a session is not exclusive. A + connection associated with the channel(s) of one session may be + simultaneously associated with the channel(s) of other sessions + including sessions associated with other client IDs. + + It is permissible for connections of multiple transport types to be + associated with the same channel. For example, both TCP and RDMA + connections can be associated with the fore channel. In the event an + RDMA and non-RDMA connection are associated with the same channel, + the maximum number of slots SHOULD be at least one more than the + total number of RDMA credits (Section 2.10.6.1). This way, if all + RDMA credits are used, the non-RDMA connection can have at least one + outstanding request. If a server supports multiple transport types, + it MUST allow a client to associate connections from each transport + to a channel. + + It is permissible for a connection of one type of transport to be + associated with the fore channel, and a connection of a different + type to be associated with the backchannel. + +2.10.4. Server Scope + + Servers each specify a server scope value in the form of an opaque + string eir_server_scope returned as part of the results of an + EXCHANGE_ID operation. The purpose of the server scope is to allow a + group of servers to indicate to clients that a set of servers sharing + the same server scope value has arranged to use distinct values of + opaque identifiers so that the two servers never assign the same + value to two distinct objects. Thus, the identifiers generated by + two servers within that set can be assumed compatible so that, in + certain important cases, identifiers generated by one server in that + set may be presented to another server of the same scope. + + The use of such compatible values does not imply that a value + generated by one server will always be accepted by another. In most + cases, it will not. However, a server will not inadvertently accept + a value generated by another server. When it does accept it, it will + be because it is recognized as valid and carrying the same meaning as + on another server of the same scope. + + When servers are of the same server scope, this compatibility of + values applies to the following identifiers: + + * Filehandle values. A filehandle value accepted by two servers of + the same server scope denotes the same object. A WRITE operation + sent to one server is reflected immediately in a READ sent to the + other. + + * Server owner values. When the server scope values are the same, + server owner value may be validly compared. In cases where the + server scope values are different, server owner values are treated + as different even if they contain identical strings of bytes. + + The coordination among servers required to provide such compatibility + can be quite minimal, and limited to a simple partition of the ID + space. The recognition of common values requires additional + implementation, but this can be tailored to the specific situations + in which that recognition is desired. + + Clients will have occasion to compare the server scope values of + multiple servers under a number of circumstances, each of which will + be discussed under the appropriate functional section: + + * When server owner values received in response to EXCHANGE_ID + operations sent to multiple network addresses are compared for the + purpose of determining the validity of various forms of trunking, + as described in Section 11.5.2. + + * When network or server reconfiguration causes the same network + address to possibly be directed to different servers, with the + necessity for the client to determine when lock reclaim should be + attempted, as described in Section 8.4.2.1. + + When two replies from EXCHANGE_ID, each from two different server + network addresses, have the same server scope, there are a number of + ways a client can validate that the common server scope is due to two + servers cooperating in a group. + + * If both EXCHANGE_ID requests were sent with RPCSEC_GSS ([4], [9], + [27]) authentication and the server principal is the same for both + targets, the equality of server scope is validated. It is + RECOMMENDED that two servers intending to share the same server + scope and server_owner major_id also share the same principal + name. In some cases, this simplifies the client's task of + validating server scope. + + * The client may accept the appearance of the second server in the + fs_locations or fs_locations_info attribute for a relevant file + system. For example, if there is a migration event for a + particular file system or there are locks to be reclaimed on a + particular file system, the attributes for that particular file + system may be used. The client sends the GETATTR request to the + first server for the fs_locations or fs_locations_info attribute + with RPCSEC_GSS authentication. It may need to do this in advance + of the need to verify the common server scope. If the client + successfully authenticates the reply to GETATTR, and the GETATTR + request and reply containing the fs_locations or fs_locations_info + attribute refers to the second server, then the equality of server + scope is supported. A client may choose to limit the use of this + form of support to information relevant to the specific file + system involved (e.g. a file system being migrated). + +2.10.5. Trunking + + Trunking is the use of multiple connections between a client and + server in order to increase the speed of data transfer. NFSv4.1 + supports two types of trunking: session trunking and client ID + trunking. + + In the context of a single server network address, it can be assumed + that all connections are accessing the same server, and NFSv4.1 + servers MUST support both forms of trunking. When multiple + connections use a set of network addresses to access the same server, + the server MUST support both forms of trunking. NFSv4.1 servers in a + clustered configuration MAY allow network addresses for different + servers to use client ID trunking. + + Clients may use either form of trunking as long as they do not, when + trunking between different server network addresses, violate the + servers' mandates as to the kinds of trunking to be allowed (see + below). With regard to callback channels, the client MUST allow the + server to choose among all callback channels valid for a given client + ID and MUST support trunking when the connections supporting the + backchannel allow session or client ID trunking to be used for + callbacks. + + Session trunking is essentially the association of multiple + connections, each with potentially different target and/or source + network addresses, to the same session. When the target network + addresses (server addresses) of the two connections are the same, the + server MUST support such session trunking. When the target network + addresses are different, the server MAY indicate such support using + the data returned by the EXCHANGE_ID operation (see below). + + Client ID trunking is the association of multiple sessions to the + same client ID. Servers MUST support client ID trunking for two + target network addresses whenever they allow session trunking for + those same two network addresses. In addition, a server MAY, by + presenting the same major server owner ID (Section 2.5) and server + scope (Section 2.10.4), allow an additional case of client ID + trunking. When two servers return the same major server owner and + server scope, it means that the two servers are cooperating on + locking state management, which is a prerequisite for client ID + trunking. + + Distinguishing when the client is allowed to use session and client + ID trunking requires understanding how the results of the EXCHANGE_ID + (Section 18.35) operation identify a server. Suppose a client sends + EXCHANGE_IDs over two different connections, each with a possibly + different target network address, but each EXCHANGE_ID operation has + the same value in the eia_clientowner field. If the same NFSv4.1 + server is listening over each connection, then each EXCHANGE_ID + result MUST return the same values of eir_clientid, + eir_server_owner.so_major_id, and eir_server_scope. The client can + then treat each connection as referring to the same server (subject + to verification; see Section 2.10.5.1 below), and it can use each + connection to trunk requests and replies. The client's choice is + whether session trunking or client ID trunking applies. + + Session Trunking. If the eia_clientowner argument is the same in two + different EXCHANGE_ID requests, and the eir_clientid, + eir_server_owner.so_major_id, eir_server_owner.so_minor_id, and + eir_server_scope results match in both EXCHANGE_ID results, then + the client is permitted to perform session trunking. If the + client has no session mapping to the tuple of eir_clientid, + eir_server_owner.so_major_id, eir_server_scope, and + eir_server_owner.so_minor_id, then it creates the session via a + CREATE_SESSION operation over one of the connections, which + associates the connection to the session. If there is a session + for the tuple, the client can send BIND_CONN_TO_SESSION to + associate the connection to the session. + + Of course, if the client does not desire to use session trunking, + it is not required to do so. It can invoke CREATE_SESSION on the + connection. This will result in client ID trunking as described + below. It can also decide to drop the connection if it does not + choose to use trunking. + + Client ID Trunking. If the eia_clientowner argument is the same in + two different EXCHANGE_ID requests, and the eir_clientid, + eir_server_owner.so_major_id, and eir_server_scope results match + in both EXCHANGE_ID results, then the client is permitted to + perform client ID trunking (regardless of whether the + eir_server_owner.so_minor_id results match). The client can + associate each connection with different sessions, where each + session is associated with the same server. + + The client completes the act of client ID trunking by invoking + CREATE_SESSION on each connection, using the same client ID that + was returned in eir_clientid. These invocations create two + sessions and also associate each connection with its respective + session. The client is free to decline to use client ID trunking + by simply dropping the connection at this point. + + When doing client ID trunking, locking state is shared across + sessions associated with that same client ID. This requires the + server to coordinate state across sessions and the client to be + able to associate the same locking state with multiple sessions. + + It is always possible that, as a result of various sorts of + reconfiguration events, eir_server_scope and eir_server_owner values + may be different on subsequent EXCHANGE_ID requests made to the same + network address. + + In most cases, such reconfiguration events will be disruptive and + indicate that an IP address formerly connected to one server is now + connected to an entirely different one. + + Some guidelines on client handling of such situations follow: + + * When eir_server_scope changes, the client has no assurance that + any IDs that it obtained previously (e.g., filehandles) can be + validly used on the new server, and, even if the new server + accepts them, there is no assurance that this is not due to + accident. Thus, it is best to treat all such state as lost or + stale, although a client may assume that the probability of + inadvertent acceptance is low and treat this situation as within + the next case. + + * When eir_server_scope remains the same and + eir_server_owner.so_major_id changes, the client can use the + filehandles it has, consider its locking state lost, and attempt + to reclaim or otherwise re-obtain its locks. It might find that + its filehandle is now stale. However, if NFS4ERR_STALE is not + returned, it can proceed to reclaim or otherwise re-obtain its + open locking state. + + * When eir_server_scope and eir_server_owner.so_major_id remain the + same, the client has to use the now-current values of + eir_server_owner.so_minor_id in deciding on appropriate forms of + trunking. This may result in connections being dropped or new + sessions being created. + +2.10.5.1. Verifying Claims of Matching Server Identity + + When the server responds using two different connections that claim + matching or partially matching eir_server_owner, eir_server_scope, + and eir_clientid values, the client does not have to trust the + servers' claims. The client may verify these claims before trunking + traffic in the following ways: + + * For session trunking, clients SHOULD reliably verify if + connections between different network paths are in fact associated + with the same NFSv4.1 server and usable on the same session, and + servers MUST allow clients to perform reliable verification. When + a client ID is created, the client SHOULD specify that + BIND_CONN_TO_SESSION is to be verified according to the SP4_SSV or + SP4_MACH_CRED (Section 18.35) state protection options. For + SP4_SSV, reliable verification depends on a shared secret (the + SSV) that is established via the SET_SSV (see Section 18.47) + operation. + + When a new connection is associated with the session (via the + BIND_CONN_TO_SESSION operation, see Section 18.34), if the client + specified SP4_SSV state protection for the BIND_CONN_TO_SESSION + operation, the client MUST send the BIND_CONN_TO_SESSION with + RPCSEC_GSS protection, using integrity or privacy, and an + RPCSEC_GSS handle created with the GSS SSV mechanism (see + Section 2.10.9). + + If the client mistakenly tries to associate a connection to a + session of a wrong server, the server will either reject the + attempt because it is not aware of the session identifier of the + BIND_CONN_TO_SESSION arguments, or it will reject the attempt + because the RPCSEC_GSS authentication fails. Even if the server + mistakenly or maliciously accepts the connection association + attempt, the RPCSEC_GSS verifier it computes in the response will + not be verified by the client, so the client will know it cannot + use the connection for trunking the specified session. + + If the client specified SP4_MACH_CRED state protection, the + BIND_CONN_TO_SESSION operation will use RPCSEC_GSS integrity or + privacy, using the same credential that was used when the client + ID was created. Mutual authentication via RPCSEC_GSS assures the + client that the connection is associated with the correct session + of the correct server. + + * For client ID trunking, the client has at least two options for + verifying that the same client ID obtained from two different + EXCHANGE_ID operations came from the same server. The first + option is to use RPCSEC_GSS authentication when sending each + EXCHANGE_ID operation. Each time an EXCHANGE_ID is sent with + RPCSEC_GSS authentication, the client notes the principal name of + the GSS target. If the EXCHANGE_ID results indicate that client + ID trunking is possible, and the GSS targets' principal names are + the same, the servers are the same and client ID trunking is + allowed. + + The second option for verification is to use SP4_SSV protection. + When the client sends EXCHANGE_ID, it specifies SP4_SSV + protection. The first EXCHANGE_ID the client sends always has to + be confirmed by a CREATE_SESSION call. The client then sends + SET_SSV. Later, the client sends EXCHANGE_ID to a second + destination network address different from the one the first + EXCHANGE_ID was sent to. The client checks that each EXCHANGE_ID + reply has the same eir_clientid, eir_server_owner.so_major_id, and + eir_server_scope. If so, the client verifies the claim by sending + a CREATE_SESSION operation to the second destination address, + protected with RPCSEC_GSS integrity using an RPCSEC_GSS handle + returned by the second EXCHANGE_ID. If the server accepts the + CREATE_SESSION request, and if the client verifies the RPCSEC_GSS + verifier and integrity codes, then the client has proof the second + server knows the SSV, and thus the two servers are cooperating for + the purposes of specifying server scope and client ID trunking. + +2.10.6. Exactly Once Semantics + + Via the session, NFSv4.1 offers exactly once semantics (EOS) for + requests sent over a channel. EOS is supported on both the fore + channel and backchannel. + + Each COMPOUND or CB_COMPOUND request that is sent with a leading + SEQUENCE or CB_SEQUENCE operation MUST be executed by the receiver + exactly once. This requirement holds regardless of whether the + request is sent with reply caching specified (see + Section 2.10.6.1.3). The requirement holds even if the requester is + sending the request over a session created between a pNFS data client + and pNFS data server. To understand the rationale for this + requirement, divide the requests into three classifications: + + * Non-idempotent requests. + + * Idempotent modifying requests. + + * Idempotent non-modifying requests. + + An example of a non-idempotent request is RENAME. Obviously, if a + replier executes the same RENAME request twice, and the first + execution succeeds, the re-execution will fail. If the replier + returns the result from the re-execution, this result is incorrect. + Therefore, EOS is required for non-idempotent requests. + + An example of an idempotent modifying request is a COMPOUND request + containing a WRITE operation. Repeated execution of the same WRITE + has the same effect as execution of that WRITE a single time. + Nevertheless, enforcing EOS for WRITEs and other idempotent modifying + requests is necessary to avoid data corruption. + + Suppose a client sends WRITE A to a noncompliant server that does not + enforce EOS, and receives no response, perhaps due to a network + partition. The client reconnects to the server and re-sends WRITE A. + Now, the server has outstanding two instances of A. The server can + be in a situation in which it executes and replies to the retry of A, + while the first A is still waiting in the server's internal I/O + system for some resource. Upon receiving the reply to the second + attempt of WRITE A, the client believes its WRITE is done so it is + free to send WRITE B, which overlaps the byte-range of A. When the + original A is dispatched from the server's I/O system and executed + (thus the second time A will have been written), then what has been + written by B can be overwritten and thus corrupted. + + An example of an idempotent non-modifying request is a COMPOUND + containing SEQUENCE, PUTFH, READLINK, and nothing else. The re- + execution of such a request will not cause data corruption or produce + an incorrect result. Nonetheless, to keep the implementation simple, + the replier MUST enforce EOS for all requests, whether or not + idempotent and non-modifying. + + Note that true and complete EOS is not possible unless the server + persists the reply cache in stable storage, and unless the server is + somehow implemented to never require a restart (indeed, if such a + server exists, the distinction between a reply cache kept in stable + storage versus one that is not is one without meaning). See + Section 2.10.6.5 for a discussion of persistence in the reply cache. + Regardless, even if the server does not persist the reply cache, EOS + improves robustness and correctness over previous versions of NFS + because the legacy duplicate request/reply caches were based on the + ONC RPC transaction identifier (XID). Section 2.10.6.1 explains the + shortcomings of the XID as a basis for a reply cache and describes + how NFSv4.1 sessions improve upon the XID. + +2.10.6.1. Slot Identifiers and Reply Cache + + The RPC layer provides a transaction ID (XID), which, while required + to be unique, is not convenient for tracking requests for two + reasons. First, the XID is only meaningful to the requester; it + cannot be interpreted by the replier except to test for equality with + previously sent requests. When consulting an RPC-based duplicate + request cache, the opaqueness of the XID requires a computationally + expensive look up (often via a hash that includes XID and source + address). NFSv4.1 requests use a non-opaque slot ID, which is an + index into a slot table, which is far more efficient. Second, + because RPC requests can be executed by the replier in any order, + there is no bound on the number of requests that may be outstanding + at any time. To achieve perfect EOS, using ONC RPC would require + storing all replies in the reply cache. XIDs are 32 bits; storing + over four billion (2^(32)) replies in the reply cache is not + practical. In practice, previous versions of NFS have chosen to + store a fixed number of replies in the cache, and to use a least + recently used (LRU) approach to replacing cache entries with new + entries when the cache is full. In NFSv4.1, the number of + outstanding requests is bounded by the size of the slot table, and a + sequence ID per slot is used to tell the replier when it is safe to + delete a cached reply. + + In the NFSv4.1 reply cache, when the requester sends a new request, + it selects a slot ID in the range 0..N, where N is the replier's + current maximum slot ID granted to the requester on the session over + which the request is to be sent. The value of N starts out as equal + to ca_maxrequests - 1 (Section 18.36), but can be adjusted by the + response to SEQUENCE or CB_SEQUENCE as described later in this + section. The slot ID must be unused by any of the requests that the + requester has already active on the session. "Unused" here means the + requester has no outstanding request for that slot ID. + + A slot contains a sequence ID and the cached reply corresponding to + the request sent with that sequence ID. The sequence ID is a 32-bit + unsigned value, and is therefore in the range 0..0xFFFFFFFF (2^(32) - + 1). The first time a slot is used, the requester MUST specify a + sequence ID of one (Section 18.36). Each time a slot is reused, the + request MUST specify a sequence ID that is one greater than that of + the previous request on the slot. If the previous sequence ID was + 0xFFFFFFFF, then the next request for the slot MUST have the sequence + ID set to zero (i.e., (2^(32) - 1) + 1 mod 2^(32)). + + The sequence ID accompanies the slot ID in each request. It is for + the critical check at the replier: it used to efficiently determine + whether a request using a certain slot ID is a retransmit or a new, + never-before-seen request. It is not feasible for the requester to + assert that it is retransmitting to implement this, because for any + given request the requester cannot know whether the replier has seen + it unless the replier actually replies. Of course, if the requester + has seen the reply, the requester would not retransmit. + + The replier compares each received request's sequence ID with the + last one previously received for that slot ID, to see if the new + request is: + + * A new request, in which the sequence ID is one greater than that + previously seen in the slot (accounting for sequence wraparound). + The replier proceeds to execute the new request, and the replier + MUST increase the slot's sequence ID by one. + + * A retransmitted request, in which the sequence ID is equal to that + currently recorded in the slot. If the original request has + executed to completion, the replier returns the cached reply. See + Section 2.10.6.2 for direction on how the replier deals with + retries of requests that are still in progress. + + * A misordered retry, in which the sequence ID is less than + (accounting for sequence wraparound) that previously seen in the + slot. The replier MUST return NFS4ERR_SEQ_MISORDERED (as the + result from SEQUENCE or CB_SEQUENCE). + + * A misordered new request, in which the sequence ID is two or more + than (accounting for sequence wraparound) that previously seen in + the slot. Note that because the sequence ID MUST wrap around to + zero once it reaches 0xFFFFFFFF, a misordered new request and a + misordered retry cannot be distinguished. Thus, the replier MUST + return NFS4ERR_SEQ_MISORDERED (as the result from SEQUENCE or + CB_SEQUENCE). + + Unlike the XID, the slot ID is always within a specific range; this + has two implications. The first implication is that for a given + session, the replier need only cache the results of a limited number + of COMPOUND requests. The second implication derives from the first, + which is that unlike XID-indexed reply caches (also known as + duplicate request caches - DRCs), the slot ID-based reply cache + cannot be overflowed. Through use of the sequence ID to identify + retransmitted requests, the replier does not need to actually cache + the request itself, reducing the storage requirements of the reply + cache further. These facilities make it practical to maintain all + the required entries for an effective reply cache. + + The slot ID, sequence ID, and session ID therefore take over the + traditional role of the XID and source network address in the + replier's reply cache implementation. This approach is considerably + more portable and completely robust -- it is not subject to the + reassignment of ports as clients reconnect over IP networks. In + addition, the RPC XID is not used in the reply cache, enhancing + robustness of the cache in the face of any rapid reuse of XIDs by the + requester. While the replier does not care about the XID for the + purposes of reply cache management (but the replier MUST return the + same XID that was in the request), nonetheless there are + considerations for the XID in NFSv4.1 that are the same as all other + previous versions of NFS. The RPC XID remains in each message and + needs to be formulated in NFSv4.1 requests as in any other ONC RPC + request. The reasons include: + + * The RPC layer retains its existing semantics and implementation. + + * The requester and replier must be able to interoperate at the RPC + layer, prior to the NFSv4.1 decoding of the SEQUENCE or + CB_SEQUENCE operation. + + * If an operation is being used that does not start with SEQUENCE or + CB_SEQUENCE (e.g., BIND_CONN_TO_SESSION), then the RPC XID is + needed for correct operation to match the reply to the request. + + * The SEQUENCE or CB_SEQUENCE operation may generate an error. If + so, the embedded slot ID, sequence ID, and session ID (if present) + in the request will not be in the reply, and the requester has + only the XID to match the reply to the request. + + Given that well-formulated XIDs continue to be required, this raises + the question: why do SEQUENCE and CB_SEQUENCE replies have a session + ID, slot ID, and sequence ID? Having the session ID in the reply + means that the requester does not have to use the XID to look up the + session ID, which would be necessary if the connection were + associated with multiple sessions. Having the slot ID and sequence + ID in the reply means that the requester does not have to use the XID + to look up the slot ID and sequence ID. Furthermore, since the XID + is only 32 bits, it is too small to guarantee the re-association of a + reply with its request [44]; having session ID, slot ID, and sequence + ID in the reply allows the client to validate that the reply in fact + belongs to the matched request. + + The SEQUENCE (and CB_SEQUENCE) operation also carries a + "highest_slotid" value, which carries additional requester slot usage + information. The requester MUST always indicate the slot ID + representing the outstanding request with the highest-numbered slot + value. The requester should in all cases provide the most + conservative value possible, although it can be increased somewhat + above the actual instantaneous usage to maintain some minimum or + optimal level. This provides a way for the requester to yield unused + request slots back to the replier, which in turn can use the + information to reallocate resources. + + The replier responds with both a new target highest_slotid and an + enforced highest_slotid, described as follows: + + * The target highest_slotid is an indication to the requester of the + highest_slotid the replier wishes the requester to be using. This + permits the replier to withdraw (or add) resources from a + requester that has been found to not be using them, in order to + more fairly share resources among a varying level of demand from + other requesters. The requester must always comply with the + replier's value updates, since they indicate newly established + hard limits on the requester's access to session resources. + However, because of request pipelining, the requester may have + active requests in flight reflecting prior values; therefore, the + replier must not immediately require the requester to comply. + + * The enforced highest_slotid indicates the highest slot ID the + requester is permitted to use on a subsequent SEQUENCE or + CB_SEQUENCE operation. The replier's enforced highest_slotid + SHOULD be no less than the highest_slotid the requester indicated + in the SEQUENCE or CB_SEQUENCE arguments. + + A requester can be intransigent with respect to lowering its + highest_slotid argument to a Sequence operation, i.e. the + requester continues to ignore the target highest_slotid in the + response to a Sequence operation, and continues to set its + highest_slotid argument to be higher than the target + highest_slotid. This can be considered particularly egregious + behavior when the replier knows there are no outstanding requests + with slot IDs higher than its target highest_slotid. When faced + with such intransigence, the replier is free to take more forceful + action, and MAY reply with a new enforced highest_slotid that is + less than its previous enforced highest_slotid. Thereafter, if + the requester continues to send requests with a highest_slotid + that is greater than the replier's new enforced highest_slotid, + the server MAY return NFS4ERR_BAD_HIGH_SLOT, unless the slot ID in + the request is greater than the new enforced highest_slotid and + the request is a retry. + + The replier SHOULD retain the slots it wants to retire until the + requester sends a request with a highest_slotid less than or equal + to the replier's new enforced highest_slotid. + + The requester can also be intransigent with respect to sending + non-retry requests that have a slot ID that exceeds the replier's + highest_slotid. Once the replier has forcibly lowered the + enforced highest_slotid, the requester is only allowed to send + retries on slots that exceed the replier's highest_slotid. If a + request is received with a slot ID that is higher than the new + enforced highest_slotid, and the sequence ID is one higher than + what is in the slot's reply cache, then the server can both retire + the slot and return NFS4ERR_BADSLOT (however, the server MUST NOT + do one and not the other). The reason it is safe to retire the + slot is because by using the next sequence ID, the requester is + indicating it has received the previous reply for the slot. + + * The requester SHOULD use the lowest available slot when sending a + new request. This way, the replier may be able to retire slot + entries faster. However, where the replier is actively adjusting + its granted highest_slotid, it will not be able to use only the + receipt of the slot ID and highest_slotid in the request. Neither + the slot ID nor the highest_slotid used in a request may reflect + the replier's current idea of the requester's session limit, + because the request may have been sent from the requester before + the update was received. Therefore, in the downward adjustment + case, the replier may have to retain a number of reply cache + entries at least as large as the old value of maximum requests + outstanding, until it can infer that the requester has seen a + reply containing the new granted highest_slotid. The replier can + infer that the requester has seen such a reply when it receives a + new request with the same slot ID as the request replied to and + the next higher sequence ID. + +2.10.6.1.1. Caching of SEQUENCE and CB_SEQUENCE Replies + + When a SEQUENCE or CB_SEQUENCE operation is successfully executed, + its reply MUST always be cached. Specifically, session ID, sequence + ID, and slot ID MUST be cached in the reply cache. The reply from + SEQUENCE also includes the highest slot ID, target highest slot ID, + and status flags. Instead of caching these values, the server MAY + re-compute the values from the current state of the fore channel, + session, and/or client ID as appropriate. Similarly, the reply from + CB_SEQUENCE includes a highest slot ID and target highest slot ID. + The client MAY re-compute the values from the current state of the + session as appropriate. + + Regardless of whether or not a replier is re-computing highest slot + ID, target slot ID, and status on replies to retries, the requester + MUST NOT assume that the values are being re-computed whenever it + receives a reply after a retry is sent, since it has no way of + knowing whether the reply it has received was sent by the replier in + response to the retry or is a delayed response to the original + request. Therefore, it may be the case that highest slot ID, target + slot ID, or status bits may reflect the state of affairs when the + request was first executed. Although acting based on such delayed + information is valid, it may cause the receiver of the reply to do + unneeded work. Requesters MAY choose to send additional requests to + get the current state of affairs or use the state of affairs reported + by subsequent requests, in preference to acting immediately on data + that might be out of date. + +2.10.6.1.2. Errors from SEQUENCE and CB_SEQUENCE + + Any time SEQUENCE or CB_SEQUENCE returns an error, the sequence ID of + the slot MUST NOT change. The replier MUST NOT modify the reply + cache entry for the slot whenever an error is returned from SEQUENCE + or CB_SEQUENCE. + +2.10.6.1.3. Optional Reply Caching + + On a per-request basis, the requester can choose to direct the + replier to cache the reply to all operations after the first + operation (SEQUENCE or CB_SEQUENCE) via the sa_cachethis or + csa_cachethis fields of the arguments to SEQUENCE or CB_SEQUENCE. + The reason it would not direct the replier to cache the entire reply + is that the request is composed of all idempotent operations [41]. + Caching the reply may offer little benefit. If the reply is too + large (see Section 2.10.6.4), it may not be cacheable anyway. Even + if the reply to idempotent request is small enough to cache, + unnecessarily caching the reply slows down the server and increases + RPC latency. + + Whether or not the requester requests the reply to be cached has no + effect on the slot processing. If the result of SEQUENCE or + CB_SEQUENCE is NFS4_OK, then the slot's sequence ID MUST be + incremented by one. If a requester does not direct the replier to + cache the reply, the replier MUST do one of following: + + * The replier can cache the entire original reply. Even though + sa_cachethis or csa_cachethis is FALSE, the replier is always free + to cache. It may choose this approach in order to simplify + implementation. + + * The replier enters into its reply cache a reply consisting of the + original results to the SEQUENCE or CB_SEQUENCE operation, and + with the next operation in COMPOUND or CB_COMPOUND having the + error NFS4ERR_RETRY_UNCACHED_REP. Thus, if the requester later + retries the request, it will get NFS4ERR_RETRY_UNCACHED_REP. If a + replier receives a retried Sequence operation where the reply to + the COMPOUND or CB_COMPOUND was not cached, then the replier, + + - MAY return NFS4ERR_RETRY_UNCACHED_REP in reply to a Sequence + operation if the Sequence operation is not the first operation + (granted, a requester that does so is in violation of the + NFSv4.1 protocol). + + - MUST NOT return NFS4ERR_RETRY_UNCACHED_REP in reply to a + Sequence operation if the Sequence operation is the first + operation. + + * If the second operation is an illegal operation, or an operation + that was legal in a previous minor version of NFSv4 and MUST NOT + be supported in the current minor version (e.g., SETCLIENTID), the + replier MUST NOT ever return NFS4ERR_RETRY_UNCACHED_REP. Instead + the replier MUST return NFS4ERR_OP_ILLEGAL or NFS4ERR_BADXDR or + NFS4ERR_NOTSUPP as appropriate. + + * If the second operation can result in another error status, the + replier MAY return a status other than NFS4ERR_RETRY_UNCACHED_REP, + provided the operation is not executed in such a way that the + state of the replier is changed. Examples of such an error status + include: NFS4ERR_NOTSUPP returned for an operation that is legal + but not REQUIRED in the current minor versions, and thus not + supported by the replier; NFS4ERR_SEQUENCE_POS; and + NFS4ERR_REQ_TOO_BIG. + + The discussion above assumes that the retried request matches the + original one. Section 2.10.6.1.3.1 discusses what the replier might + do, and MUST do when original and retried requests do not match. + Since the replier may only cache a small amount of the information + that would be required to determine whether this is a case of a false + retry, the replier may send to the client any of the following + responses: + + * The cached reply to the original request (if the replier has + cached it in its entirety and the users of the original request + and retry match). + + * A reply that consists only of the Sequence operation with the + error NFS4ERR_SEQ_FALSE_RETRY. + + * A reply consisting of the response to Sequence with the status + NFS4_OK, together with the second operation as it appeared in the + retried request with an error of NFS4ERR_RETRY_UNCACHED_REP or + other error as described above. + + * A reply that consists of the response to Sequence with the status + NFS4_OK, together with the second operation as it appeared in the + original request with an error of NFS4ERR_RETRY_UNCACHED_REP or + other error as described above. + +2.10.6.1.3.1. False Retry + + If a requester sent a Sequence operation with a slot ID and sequence + ID that are in the reply cache but the replier detected that the + retried request is not the same as the original request, including a + retry that has different operations or different arguments in the + operations from the original and a retry that uses a different + principal in the RPC request's credential field that translates to a + different user, then this is a false retry. When the replier detects + a false retry, it is permitted (but not always obligated) to return + NFS4ERR_SEQ_FALSE_RETRY in response to the Sequence operation when it + detects a false retry. + + Translations of particularly privileged user values to other users + due to the lack of appropriately secure credentials, as configured on + the replier, should be applied before determining whether the users + are the same or different. If the replier determines the users are + different between the original request and a retry, then the replier + MUST return NFS4ERR_SEQ_FALSE_RETRY. + + If an operation of the retry is an illegal operation, or an operation + that was legal in a previous minor version of NFSv4 and MUST NOT be + supported in the current minor version (e.g., SETCLIENTID), the + replier MAY return NFS4ERR_SEQ_FALSE_RETRY (and MUST do so if the + users of the original request and retry differ). Otherwise, the + replier MAY return NFS4ERR_OP_ILLEGAL or NFS4ERR_BADXDR or + NFS4ERR_NOTSUPP as appropriate. Note that the handling is in + contrast for how the replier deals with retries requests with no + cached reply. The difference is due to NFS4ERR_SEQ_FALSE_RETRY being + a valid error for only Sequence operations, whereas + NFS4ERR_RETRY_UNCACHED_REP is a valid error for all operations except + illegal operations and operations that MUST NOT be supported in the + current minor version of NFSv4. + +2.10.6.2. Retry and Replay of Reply + + A requester MUST NOT retry a request, unless the connection it used + to send the request disconnects. The requester can then reconnect + and re-send the request, or it can re-send the request over a + different connection that is associated with the same session. + + If the requester is a server wanting to re-send a callback operation + over the backchannel of a session, the requester of course cannot + reconnect because only the client can associate connections with the + backchannel. The server can re-send the request over another + connection that is bound to the same session's backchannel. If there + is no such connection, the server MUST indicate that the session has + no backchannel by setting the SEQ4_STATUS_CB_PATH_DOWN_SESSION flag + bit in the response to the next SEQUENCE operation from the client. + The client MUST then associate a connection with the session (or + destroy the session). + + Note that it is not fatal for a requester to retry without a + disconnect between the request and retry. However, the retry does + consume resources, especially with RDMA, where each request, retry or + not, consumes a credit. Retries for no reason, especially retries + sent shortly after the previous attempt, are a poor use of network + bandwidth and defeat the purpose of a transport's inherent congestion + control system. + + A requester MUST wait for a reply to a request before using the slot + for another request. If it does not wait for a reply, then the + requester does not know what sequence ID to use for the slot on its + next request. For example, suppose a requester sends a request with + sequence ID 1, and does not wait for the response. The next time it + uses the slot, it sends the new request with sequence ID 2. If the + replier has not seen the request with sequence ID 1, then the replier + is not expecting sequence ID 2, and rejects the requester's new + request with NFS4ERR_SEQ_MISORDERED (as the result from SEQUENCE or + CB_SEQUENCE). + + RDMA fabrics do not guarantee that the memory handles (Steering Tags) + within each RPC/RDMA "chunk" [32] are valid on a scope outside that + of a single connection. Therefore, handles used by the direct + operations become invalid after connection loss. The server must + ensure that any RDMA operations that must be replayed from the reply + cache use the newly provided handle(s) from the most recent request. + + A retry might be sent while the original request is still in progress + on the replier. The replier SHOULD deal with the issue by returning + NFS4ERR_DELAY as the reply to SEQUENCE or CB_SEQUENCE operation, but + implementations MAY return NFS4ERR_MISORDERED. Since errors from + SEQUENCE and CB_SEQUENCE are never recorded in the reply cache, this + approach allows the results of the execution of the original request + to be properly recorded in the reply cache (assuming that the + requester specified the reply to be cached). + +2.10.6.3. Resolving Server Callback Races + + It is possible for server callbacks to arrive at the client before + the reply from related fore channel operations. For example, a + client may have been granted a delegation to a file it has opened, + but the reply to the OPEN (informing the client of the granting of + the delegation) may be delayed in the network. If a conflicting + operation arrives at the server, it will recall the delegation using + the backchannel, which may be on a different transport connection, + perhaps even a different network, or even a different session + associated with the same client ID. + + The presence of a session between the client and server alleviates + this issue. When a session is in place, each client request is + uniquely identified by its { session ID, slot ID, sequence ID } + triple. By the rules under which slot entries (reply cache entries) + are retired, the server has knowledge whether the client has "seen" + each of the server's replies. The server can therefore provide + sufficient information to the client to allow it to disambiguate + between an erroneous or conflicting callback race condition. + + For each client operation that might result in some sort of server + callback, the server SHOULD "remember" the { session ID, slot ID, + sequence ID } triple of the client request until the slot ID + retirement rules allow the server to determine that the client has, + in fact, seen the server's reply. Until the time the { session ID, + slot ID, sequence ID } request triple can be retired, any recalls of + the associated object MUST carry an array of these referring + identifiers (in the CB_SEQUENCE operation's arguments), for the + benefit of the client. After this time, it is not necessary for the + server to provide this information in related callbacks, since it is + certain that a race condition can no longer occur. + + The CB_SEQUENCE operation that begins each server callback carries a + list of "referring" { session ID, slot ID, sequence ID } triples. If + the client finds the request corresponding to the referring session + ID, slot ID, and sequence ID to be currently outstanding (i.e., the + server's reply has not been seen by the client), it can determine + that the callback has raced the reply, and act accordingly. If the + client does not find the request corresponding to the referring + triple to be outstanding (including the case of a session ID + referring to a destroyed session), then there is no race with respect + to this triple. The server SHOULD limit the referring triples to + requests that refer to just those that apply to the objects referred + to in the CB_COMPOUND procedure. + + The client must not simply wait forever for the expected server reply + to arrive before responding to the CB_COMPOUND that won the race, + because it is possible that it will be delayed indefinitely. The + client should assume the likely case that the reply will arrive + within the average round-trip time for COMPOUND requests to the + server, and wait that period of time. If that period of time + expires, it can respond to the CB_COMPOUND with NFS4ERR_DELAY. There + are other scenarios under which callbacks may race replies. Among + them are pNFS layout recalls as described in Section 12.5.5.2. + +2.10.6.4. COMPOUND and CB_COMPOUND Construction Issues + + Very large requests and replies may pose both buffer management + issues (especially with RDMA) and reply cache issues. When the + session is created (Section 18.36), for each channel (fore and back), + the client and server negotiate the maximum-sized request they will + send or process (ca_maxrequestsize), the maximum-sized reply they + will return or process (ca_maxresponsesize), and the maximum-sized + reply they will store in the reply cache (ca_maxresponsesize_cached). + + If a request exceeds ca_maxrequestsize, the reply will have the + status NFS4ERR_REQ_TOO_BIG. A replier MAY return NFS4ERR_REQ_TOO_BIG + as the status for the first operation (SEQUENCE or CB_SEQUENCE) in + the request (which means that no operations in the request executed + and that the state of the slot in the reply cache is unchanged), or + it MAY opt to return it on a subsequent operation in the same + COMPOUND or CB_COMPOUND request (which means that at least one + operation did execute and that the state of the slot in the reply + cache does change). The replier SHOULD set NFS4ERR_REQ_TOO_BIG on + the operation that exceeds ca_maxrequestsize. + + If a reply exceeds ca_maxresponsesize, the reply will have the status + NFS4ERR_REP_TOO_BIG. A replier MAY return NFS4ERR_REP_TOO_BIG as the + status for the first operation (SEQUENCE or CB_SEQUENCE) in the + request, or it MAY opt to return it on a subsequent operation (in the + same COMPOUND or CB_COMPOUND reply). A replier MAY return + NFS4ERR_REP_TOO_BIG in the reply to SEQUENCE or CB_SEQUENCE, even if + the response would still exceed ca_maxresponsesize. + + If sa_cachethis or csa_cachethis is TRUE, then the replier MUST cache + a reply except if an error is returned by the SEQUENCE or CB_SEQUENCE + operation (see Section 2.10.6.1.2). If the reply exceeds + ca_maxresponsesize_cached (and sa_cachethis or csa_cachethis is + TRUE), then the server MUST return NFS4ERR_REP_TOO_BIG_TO_CACHE. + Even if NFS4ERR_REP_TOO_BIG_TO_CACHE (or any other error for that + matter) is returned on an operation other than the first operation + (SEQUENCE or CB_SEQUENCE), then the reply MUST be cached if + sa_cachethis or csa_cachethis is TRUE. For example, if a COMPOUND + has eleven operations, including SEQUENCE, the fifth operation is a + RENAME, and the tenth operation is a READ for one million bytes, the + server may return NFS4ERR_REP_TOO_BIG_TO_CACHE on the tenth + operation. Since the server executed several operations, especially + the non-idempotent RENAME, the client's request to cache the reply + needs to be honored in order for the correct operation of exactly + once semantics. If the client retries the request, the server will + have cached a reply that contains results for ten of the eleven + requested operations, with the tenth operation having a status of + NFS4ERR_REP_TOO_BIG_TO_CACHE. + + A client needs to take care that, when sending operations that change + the current filehandle (except for PUTFH, PUTPUBFH, PUTROOTFH, and + RESTOREFH), it does not exceed the maximum reply buffer before the + GETFH operation. Otherwise, the client will have to retry the + operation that changed the current filehandle, in order to obtain the + desired filehandle. For the OPEN operation (see Section 18.16), + retry is not always available as an option. The following guidelines + for the handling of filehandle-changing operations are advised: + + * Within the same COMPOUND procedure, a client SHOULD send GETFH + immediately after a current filehandle-changing operation. A + client MUST send GETFH after a current filehandle-changing + operation that is also non-idempotent (e.g., the OPEN operation), + unless the operation is RESTOREFH. RESTOREFH is an exception, + because even though it is non-idempotent, the filehandle RESTOREFH + produced originated from an operation that is either idempotent + (e.g., PUTFH, LOOKUP), or non-idempotent (e.g., OPEN, CREATE). If + the origin is non-idempotent, then because the client MUST send + GETFH after the origin operation, the client can recover if + RESTOREFH returns an error. + + * A server MAY return NFS4ERR_REP_TOO_BIG or + NFS4ERR_REP_TOO_BIG_TO_CACHE (if sa_cachethis is TRUE) on a + filehandle-changing operation if the reply would be too large on + the next operation. + + * A server SHOULD return NFS4ERR_REP_TOO_BIG or + NFS4ERR_REP_TOO_BIG_TO_CACHE (if sa_cachethis is TRUE) on a + filehandle-changing, non-idempotent operation if the reply would + be too large on the next operation, especially if the operation is + OPEN. + + * A server MAY return NFS4ERR_UNSAFE_COMPOUND to a non-idempotent + current filehandle-changing operation, if it looks at the next + operation (in the same COMPOUND procedure) and finds it is not + GETFH. The server SHOULD do this if it is unable to determine in + advance whether the total response size would exceed + ca_maxresponsesize_cached or ca_maxresponsesize. + +2.10.6.5. Persistence + + Since the reply cache is bounded, it is practical for the reply cache + to persist across server restarts. The replier MUST persist the + following information if it agreed to persist the session (when the + session was created; see Section 18.36): + + * The session ID. + + * The slot table including the sequence ID and cached reply for each + slot. + + The above are sufficient for a replier to provide EOS semantics for + any requests that were sent and executed before the server restarted. + If the replier is a client, then there is no need for it to persist + any more information, unless the client will be persisting all other + state across client restart, in which case, the server will never see + any NFSv4.1-level protocol manifestation of a client restart. If the + replier is a server, with just the slot table and session ID + persisting, any requests the client retries after the server restart + will return the results that are cached in the reply cache, and any + new requests (i.e., the sequence ID is one greater than the slot's + sequence ID) MUST be rejected with NFS4ERR_DEADSESSION (returned by + SEQUENCE). Such a session is considered dead. A server MAY re- + animate a session after a server restart so that the session will + accept new requests as well as retries. To re-animate a session, the + server needs to persist additional information through server + restart: + + * The client ID. This is a prerequisite to let the client create + more sessions associated with the same client ID as the re- + animated session. + + * The client ID's sequence ID that is used for creating sessions + (see Sections 18.35 and 18.36). This is a prerequisite to let the + client create more sessions. + + * The principal that created the client ID. This allows the server + to authenticate the client when it sends EXCHANGE_ID. + + * The SSV, if SP4_SSV state protection was specified when the client + ID was created (see Section 18.35). This lets the client create + new sessions, and associate connections with the new and existing + sessions. + + * The properties of the client ID as defined in Section 18.35. + + A persistent reply cache places certain demands on the server. The + execution of the sequence of operations (starting with SEQUENCE) and + placement of its results in the persistent cache MUST be atomic. If + a client retries a sequence of operations that was previously + executed on the server, the only acceptable outcomes are either the + original cached reply or an indication that the client ID or session + has been lost (indicating a catastrophic loss of the reply cache or a + session that has been deleted because the client failed to use the + session for an extended period of time). + + A server could fail and restart in the middle of a COMPOUND procedure + that contains one or more non-idempotent or idempotent-but-modifying + operations. This creates an even higher challenge for atomic + execution and placement of results in the reply cache. One way to + view the problem is as a single transaction consisting of each + operation in the COMPOUND followed by storing the result in + persistent storage, then finally a transaction commit. If there is a + failure before the transaction is committed, then the server rolls + back the transaction. If the server itself fails, then when it + restarts, its recovery logic could roll back the transaction before + starting the NFSv4.1 server. + + While the description of the implementation for atomic execution of + the request and caching of the reply is beyond the scope of this + document, an example implementation for NFSv2 [45] is described in + [46]. + +2.10.7. RDMA Considerations + + A complete discussion of the operation of RPC-based protocols over + RDMA transports is in [32]. A discussion of the operation of NFSv4, + including NFSv4.1, over RDMA is in [33]. Where RDMA is considered, + this specification assumes the use of such a layering; it addresses + only the upper-layer issues relevant to making best use of RPC/RDMA. + +2.10.7.1. RDMA Connection Resources + + RDMA requires its consumers to register memory and post buffers of a + specific size and number for receive operations. + + Registration of memory can be a relatively high-overhead operation, + since it requires pinning of buffers, assignment of attributes (e.g., + readable/writable), and initialization of hardware translation. + Preregistration is desirable to reduce overhead. These registrations + are specific to hardware interfaces and even to RDMA connection + endpoints; therefore, negotiation of their limits is desirable to + manage resources effectively. + + Following basic registration, these buffers must be posted by the RPC + layer to handle receives. These buffers remain in use by the RPC/ + NFSv4.1 implementation; the size and number of them must be known to + the remote peer in order to avoid RDMA errors that would cause a + fatal error on the RDMA connection. + + NFSv4.1 manages slots as resources on a per-session basis (see + Section 2.10), while RDMA connections manage credits on a per- + connection basis. This means that in order for a peer to send data + over RDMA to a remote buffer, it has to have both an NFSv4.1 slot and + an RDMA credit. If multiple RDMA connections are associated with a + session, then if the total number of credits across all RDMA + connections associated with the session is X, and the number of slots + in the session is Y, then the maximum number of outstanding requests + is the lesser of X and Y. + +2.10.7.2. Flow Control + + Previous versions of NFS do not provide flow control; instead, they + rely on the windowing provided by transports like TCP to throttle + requests. This does not work with RDMA, which provides no operation + flow control and will terminate a connection in error when limits are + exceeded. Limits such as maximum number of requests outstanding are + therefore negotiated when a session is created (see the + ca_maxrequests field in Section 18.36). These limits then provide + the maxima within which each connection associated with the session's + channel(s) must remain. RDMA connections are managed within these + limits as described in Section 3.3 of [32]; if there are multiple + RDMA connections, then the maximum number of requests for a channel + will be divided among the RDMA connections. Put a different way, the + onus is on the replier to ensure that the total number of RDMA + credits across all connections associated with the replier's channel + does exceed the channel's maximum number of outstanding requests. + + The limits may also be modified dynamically at the replier's choosing + by manipulating certain parameters present in each NFSv4.1 reply. In + addition, the CB_RECALL_SLOT callback operation (see Section 20.8) + can be sent by a server to a client to return RDMA credits to the + server, thereby lowering the maximum number of requests a client can + have outstanding to the server. + +2.10.7.3. Padding + + Header padding is requested by each peer at session initiation (see + the ca_headerpadsize argument to CREATE_SESSION in Section 18.36), + and subsequently used by the RPC RDMA layer, as described in [32]. + Zero padding is permitted. + + Padding leverages the useful property that RDMA preserve alignment of + data, even when they are placed into anonymous (untagged) buffers. + If requested, client inline writes will insert appropriate pad bytes + within the request header to align the data payload on the specified + boundary. The client is encouraged to add sufficient padding (up to + the negotiated size) so that the "data" field of the WRITE operation + is aligned. Most servers can make good use of such padding, which + allows them to chain receive buffers in such a way that any data + carried by client requests will be placed into appropriate buffers at + the server, ready for file system processing. The receiver's RPC + layer encounters no overhead from skipping over pad bytes, and the + RDMA layer's high performance makes the insertion and transmission of + padding on the sender a significant optimization. In this way, the + need for servers to perform RDMA Read to satisfy all but the largest + client writes is obviated. An added benefit is the reduction of + message round trips on the network -- a potentially good trade, where + latency is present. + + The value to choose for padding is subject to a number of criteria. + A primary source of variable-length data in the RPC header is the + authentication information, the form of which is client-determined, + possibly in response to server specification. The contents of + COMPOUNDs, sizes of strings such as those passed to RENAME, etc. all + go into the determination of a maximal NFSv4.1 request size and + therefore minimal buffer size. The client must select its offered + value carefully, so as to avoid overburdening the server, and vice + versa. The benefit of an appropriate padding value is higher + performance. + + Sender gather: + |RPC Request|Pad bytes|Length| -> |User data...| + \------+----------------------/ \ + \ \ + \ Receiver scatter: \-----------+- ... + /-----+----------------\ \ \ + |RPC Request|Pad|Length| -> |FS buffer|->|FS buffer|->... + + In the above case, the server may recycle unused buffers to the next + posted receive if unused by the actual received request, or may pass + the now-complete buffers by reference for normal write processing. + For a server that can make use of it, this removes any need for data + copies of incoming data, without resorting to complicated end-to-end + buffer advertisement and management. This includes most kernel-based + and integrated server designs, among many others. The client may + perform similar optimizations, if desired. + +2.10.7.4. Dual RDMA and Non-RDMA Transports + + Some RDMA transports (e.g., RFC 5040 [8]) permit a "streaming" (non- + RDMA) phase, where ordinary traffic might flow before "stepping up" + to RDMA mode, commencing RDMA traffic. Some RDMA transports start + connections always in RDMA mode. NFSv4.1 allows, but does not + assume, a streaming phase before RDMA mode. When a connection is + associated with a session, the client and server negotiate whether + the connection is used in RDMA or non-RDMA mode (see Sections 18.36 + and 18.34). + +2.10.8. Session Security + +2.10.8.1. Session Callback Security + + Via session/connection association, NFSv4.1 improves security over + that provided by NFSv4.0 for the backchannel. The connection is + client-initiated (see Section 18.34) and subject to the same firewall + and routing checks as the fore channel. At the client's option (see + Section 18.35), connection association is fully authenticated before + being activated (see Section 18.34). Traffic from the server over + the backchannel is authenticated exactly as the client specifies (see + Section 2.10.8.2). + +2.10.8.2. Backchannel RPC Security + + When the NFSv4.1 client establishes the backchannel, it informs the + server of the security flavors and principals to use when sending + requests. If the security flavor is RPCSEC_GSS, the client expresses + the principal in the form of an established RPCSEC_GSS context. The + server is free to use any of the flavor/principal combinations the + client offers, but it MUST NOT use unoffered combinations. This way, + the client need not provide a target GSS principal for the + backchannel as it did with NFSv4.0, nor does the server have to + implement an RPCSEC_GSS initiator as it did with NFSv4.0 [37]. + + The CREATE_SESSION (Section 18.36) and BACKCHANNEL_CTL + (Section 18.33) operations allow the client to specify flavor/ + principal combinations. + + Also note that the SP4_SSV state protection mode (see Sections 18.35 + and 2.10.8.3) has the side benefit of providing SSV-derived + RPCSEC_GSS contexts (Section 2.10.9). + +2.10.8.3. Protection from Unauthorized State Changes + + As described to this point in the specification, the state model of + NFSv4.1 is vulnerable to an attacker that sends a SEQUENCE operation + with a forged session ID and with a slot ID that it expects the + legitimate client to use next. When the legitimate client uses the + slot ID with the same sequence number, the server returns the + attacker's result from the reply cache, which disrupts the legitimate + client and thus denies service to it. Similarly, an attacker could + send a CREATE_SESSION with a forged client ID to create a new session + associated with the client ID. The attacker could send requests + using the new session that change locking state, such as LOCKU + operations to release locks the legitimate client has acquired. + Setting a security policy on the file that requires RPCSEC_GSS + credentials when manipulating the file's state is one potential work + around, but has the disadvantage of preventing a legitimate client + from releasing state when RPCSEC_GSS is required to do so, but a GSS + context cannot be obtained (possibly because the user has logged off + the client). + + NFSv4.1 provides three options to a client for state protection, + which are specified when a client creates a client ID via EXCHANGE_ID + (Section 18.35). + + The first (SP4_NONE) is to simply waive state protection. + + The other two options (SP4_MACH_CRED and SP4_SSV) share several + traits: + + * An RPCSEC_GSS-based credential is used to authenticate client ID + and session maintenance operations, including creating and + destroying a session, associating a connection with the session, + and destroying the client ID. + + * Because RPCSEC_GSS is used to authenticate client ID and session + maintenance, the attacker cannot associate a rogue connection with + a legitimate session, or associate a rogue session with a + legitimate client ID in order to maliciously alter the client ID's + lock state via CLOSE, LOCKU, DELEGRETURN, LAYOUTRETURN, etc. + + * In cases where the server's security policies on a portion of its + namespace require RPCSEC_GSS authentication, a client may have to + use an RPCSEC_GSS credential to remove per-file state (e.g., + LOCKU, CLOSE, etc.). The server may require that the principal + that removes the state match certain criteria (e.g., the principal + might have to be the same as the one that acquired the state). + However, the client might not have an RPCSEC_GSS context for such + a principal, and might not be able to create such a context + (perhaps because the user has logged off). When the client + establishes SP4_MACH_CRED or SP4_SSV protection, it can specify a + list of operations that the server MUST allow using the machine + credential (if SP4_MACH_CRED is used) or the SSV credential (if + SP4_SSV is used). + + The SP4_MACH_CRED state protection option uses a machine credential + where the principal that creates the client ID MUST also be the + principal that performs client ID and session maintenance operations. + The security of the machine credential state protection approach + depends entirely on safeguarding the per-machine credential. + Assuming a proper safeguard using the per-machine credential for + operations like CREATE_SESSION, BIND_CONN_TO_SESSION, + DESTROY_SESSION, and DESTROY_CLIENTID will prevent an attacker from + associating a rogue connection with a session, or associating a rogue + session with a client ID. + + There are at least three scenarios for the SP4_MACH_CRED option: + + 1. The system administrator configures a unique, permanent per- + machine credential for one of the mandated GSS mechanisms (e.g., + if Kerberos V5 is used, a "keytab" containing a principal derived + from a client host name could be used). + + 2. The client is used by a single user, and so the client ID and its + sessions are used by just that user. If the user's credential + expires, then session and client ID maintenance cannot occur, but + since the client has a single user, only that user is + inconvenienced. + + 3. The physical client has multiple users, but the client + implementation has a unique client ID for each user. This is + effectively the same as the second scenario, but a disadvantage + is that each user needs to be allocated at least one session + each, so the approach suffers from lack of economy. + + The SP4_SSV protection option uses the SSV (Section 1.7), via + RPCSEC_GSS and the SSV GSS mechanism (Section 2.10.9), to protect + state from attack. The SP4_SSV protection option is intended for the + situation comprised of a client that has multiple active users and a + system administrator who wants to avoid the burden of installing a + permanent machine credential on each client. The SSV is established + and updated on the server via SET_SSV (see Section 18.47). To + prevent eavesdropping, a client SHOULD send SET_SSV via RPCSEC_GSS + with the privacy service. Several aspects of the SSV make it + intractable for an attacker to guess the SSV, and thus associate + rogue connections with a session, and rogue sessions with a client + ID: + + * The arguments to and results of SET_SSV include digests of the old + and new SSV, respectively. + + * Because the initial value of the SSV is zero, therefore known, the + client that opts for SP4_SSV protection and opts to apply SP4_SSV + protection to BIND_CONN_TO_SESSION and CREATE_SESSION MUST send at + least one SET_SSV operation before the first BIND_CONN_TO_SESSION + operation or before the second CREATE_SESSION operation on a + client ID. If it does not, the SSV mechanism will not generate + tokens (Section 2.10.9). A client SHOULD send SET_SSV as soon as + a session is created. + + * A SET_SSV request does not replace the SSV with the argument to + SET_SSV. Instead, the current SSV on the server is logically + exclusive ORed (XORed) with the argument to SET_SSV. Each time a + new principal uses a client ID for the first time, the client + SHOULD send a SET_SSV with that principal's RPCSEC_GSS + credentials, with RPCSEC_GSS service set to RPC_GSS_SVC_PRIVACY. + + Here are the types of attacks that can be attempted by an attacker + named Eve on a victim named Bob, and how SP4_SSV protection foils + each attack: + + * Suppose Eve is the first user to log into a legitimate client. + Eve's use of an NFSv4.1 file system will cause the legitimate + client to create a client ID with SP4_SSV protection, specifying + that the BIND_CONN_TO_SESSION operation MUST use the SSV + credential. Eve's use of the file system also causes an SSV to be + created. The SET_SSV operation that creates the SSV will be + protected by the RPCSEC_GSS context created by the legitimate + client, which uses Eve's GSS principal and credentials. Eve can + eavesdrop on the network while her RPCSEC_GSS context is created + and the SET_SSV using her context is sent. Even if the legitimate + client sends the SET_SSV with RPC_GSS_SVC_PRIVACY, because Eve + knows her own credentials, she can decrypt the SSV. Eve can + compute an RPCSEC_GSS credential that BIND_CONN_TO_SESSION will + accept, and so associate a new connection with the legitimate + session. Eve can change the slot ID and sequence state of a + legitimate session, and/or the SSV state, in such a way that when + Bob accesses the server via the same legitimate client, the + legitimate client will be unable to use the session. + + The client's only recourse is to create a new client ID for Bob to + use, and establish a new SSV for the client ID. The client will + be unable to delete the old client ID, and will let the lease on + the old client ID expire. + + Once the legitimate client establishes an SSV over the new session + using Bob's RPCSEC_GSS context, Eve can use the new session via + the legitimate client, but she cannot disrupt Bob. Moreover, + because the client SHOULD have modified the SSV due to Eve using + the new session, Bob cannot get revenge on Eve by associating a + rogue connection with the session. + + The question is how did the legitimate client detect that Eve has + hijacked the old session? When the client detects that a new + principal, Bob, wants to use the session, it SHOULD have sent a + SET_SSV, which leads to the following sub-scenarios: + + - Let us suppose that from the rogue connection, Eve sent a + SET_SSV with the same slot ID and sequence ID that the + legitimate client later uses. The server will assume the + SET_SSV sent with Bob's credentials is a retry, and return to + the legitimate client the reply it sent Eve. However, unless + Eve can correctly guess the SSV the legitimate client will use, + the digest verification checks in the SET_SSV response will + fail. That is an indication to the client that the session has + apparently been hijacked. + + - Alternatively, Eve sent a SET_SSV with a different slot ID than + the legitimate client uses for its SET_SSV. Then the digest + verification of the SET_SSV sent with Bob's credentials fails + on the server, and the error returned to the client makes it + apparent that the session has been hijacked. + + - Alternatively, Eve sent an operation other than SET_SSV, but + with the same slot ID and sequence that the legitimate client + uses for its SET_SSV. The server returns to the legitimate + client the response it sent Eve. The client sees that the + response is not at all what it expects. The client assumes + either session hijacking or a server bug, and either way + destroys the old session. + + * Eve associates a rogue connection with the session as above, and + then destroys the session. Again, Bob goes to use the server from + the legitimate client, which sends a SET_SSV using Bob's + credentials. The client receives an error that indicates that the + session does not exist. When the client tries to create a new + session, this will fail because the SSV it has does not match that + which the server has, and now the client knows the session was + hijacked. The legitimate client establishes a new client ID. + + * If Eve creates a connection before the legitimate client + establishes an SSV, because the initial value of the SSV is zero + and therefore known, Eve can send a SET_SSV that will pass the + digest verification check. However, because the new connection + has not been associated with the session, the SET_SSV is rejected + for that reason. + + In summary, an attacker's disruption of state when SP4_SSV protection + is in use is limited to the formative period of a client ID, its + first session, and the establishment of the SSV. Once a non- + malicious user uses the client ID, the client quickly detects any + hijack and rectifies the situation. Once a non-malicious user + successfully modifies the SSV, the attacker cannot use NFSv4.1 + operations to disrupt the non-malicious user. + + Note that neither the SP4_MACH_CRED nor SP4_SSV protection approaches + prevent hijacking of a transport connection that has previously been + associated with a session. If the goal of a counter-threat strategy + is to prevent connection hijacking, the use of IPsec is RECOMMENDED. + + If a connection hijack occurs, the hijacker could in theory change + locking state and negatively impact the service to legitimate + clients. However, if the server is configured to require the use of + RPCSEC_GSS with integrity or privacy on the affected file objects, + and if EXCHGID4_FLAG_BIND_PRINC_STATEID capability (Section 18.35) is + in force, this will thwart unauthorized attempts to change locking + state. + +2.10.9. The Secret State Verifier (SSV) GSS Mechanism + + The SSV provides the secret key for a GSS mechanism internal to + NFSv4.1 that NFSv4.1 uses for state protection. Contexts for this + mechanism are not established via the RPCSEC_GSS protocol. Instead, + the contexts are automatically created when EXCHANGE_ID specifies + SP4_SSV protection. The only tokens defined are the PerMsgToken + (emitted by GSS_GetMIC) and the SealedMessage token (emitted by + GSS_Wrap). + + The mechanism OID for the SSV mechanism is + iso.org.dod.internet.private.enterprise.Michael Eisler.nfs.ssv_mech + (1.3.6.1.4.1.28882.1.1). While the SSV mechanism does not define any + initial context tokens, the OID can be used to let servers indicate + that the SSV mechanism is acceptable whenever the client sends a + SECINFO or SECINFO_NO_NAME operation (see Section 2.6). + + The SSV mechanism defines four subkeys derived from the SSV value. + Each time SET_SSV is invoked, the subkeys are recalculated by the + client and server. The calculation of each of the four subkeys + depends on each of the four respective ssv_subkey4 enumerated values. + The calculation uses the HMAC [52] algorithm, using the current SSV + as the key, the one-way hash algorithm as negotiated by EXCHANGE_ID, + and the input text as represented by the XDR encoded enumeration + value for that subkey of data type ssv_subkey4. If the length of the + output of the HMAC algorithm exceeds the length of key of the + encryption algorithm (which is also negotiated by EXCHANGE_ID), then + the subkey MUST be truncated from the HMAC output, i.e., if the + subkey is of N bytes long, then the first N bytes of the HMAC output + MUST be used for the subkey. The specification of EXCHANGE_ID states + that the length of the output of the HMAC algorithm MUST NOT be less + than the length of subkey needed for the encryption algorithm (see + Section 18.35). + + /* Input for computing subkeys */ + enum ssv_subkey4 { + SSV4_SUBKEY_MIC_I2T = 1, + SSV4_SUBKEY_MIC_T2I = 2, + SSV4_SUBKEY_SEAL_I2T = 3, + SSV4_SUBKEY_SEAL_T2I = 4 + }; + + The subkey derived from SSV4_SUBKEY_MIC_I2T is used for calculating + message integrity codes (MICs) that originate from the NFSv4.1 + client, whether as part of a request over the fore channel or a + response over the backchannel. The subkey derived from + SSV4_SUBKEY_MIC_T2I is used for MICs originating from the NFSv4.1 + server. The subkey derived from SSV4_SUBKEY_SEAL_I2T is used for + encryption text originating from the NFSv4.1 client, and the subkey + derived from SSV4_SUBKEY_SEAL_T2I is used for encryption text + originating from the NFSv4.1 server. + + The PerMsgToken description is based on an XDR definition: + + /* Input for computing smt_hmac */ + struct ssv_mic_plain_tkn4 { + uint32_t smpt_ssv_seq; + opaque smpt_orig_plain<>; + }; + + /* SSV GSS PerMsgToken token */ + struct ssv_mic_tkn4 { + uint32_t smt_ssv_seq; + opaque smt_hmac<>; + }; + + The field smt_hmac is an HMAC calculated by using the subkey derived + from SSV4_SUBKEY_MIC_I2T or SSV4_SUBKEY_MIC_T2I as the key, the one- + way hash algorithm as negotiated by EXCHANGE_ID, and the input text + as represented by data of type ssv_mic_plain_tkn4. The field + smpt_ssv_seq is the same as smt_ssv_seq. The field smpt_orig_plain + is the "message" input passed to GSS_GetMIC() (see Section 2.3.1 of + [7]). The caller of GSS_GetMIC() provides a pointer to a buffer + containing the plain text. The SSV mechanism's entry point for + GSS_GetMIC() encodes this into an opaque array, and the encoding will + include an initial four-byte length, plus any necessary padding. + Prepended to this will be the XDR encoded value of smpt_ssv_seq, thus + making up an XDR encoding of a value of data type ssv_mic_plain_tkn4, + which in turn is the input into the HMAC. + + The token emitted by GSS_GetMIC() is XDR encoded and of XDR data type + ssv_mic_tkn4. The field smt_ssv_seq comes from the SSV sequence + number, which is equal to one after SET_SSV (Section 18.47) is called + the first time on a client ID. Thereafter, the SSV sequence number + is incremented on each SET_SSV. Thus, smt_ssv_seq represents the + version of the SSV at the time GSS_GetMIC() was called. As noted in + Section 18.35, the client and server can maintain multiple concurrent + versions of the SSV. This allows the SSV to be changed without + serializing all RPC calls that use the SSV mechanism with SET_SSV + operations. Once the HMAC is calculated, it is XDR encoded into + smt_hmac, which will include an initial four-byte length, and any + necessary padding. Prepended to this will be the XDR encoded value + of smt_ssv_seq. + + The SealedMessage description is based on an XDR definition: + + /* Input for computing ssct_encr_data and ssct_hmac */ + struct ssv_seal_plain_tkn4 { + opaque sspt_confounder<>; + uint32_t sspt_ssv_seq; + opaque sspt_orig_plain<>; + opaque sspt_pad<>; + }; + + /* SSV GSS SealedMessage token */ + struct ssv_seal_cipher_tkn4 { + uint32_t ssct_ssv_seq; + opaque ssct_iv<>; + opaque ssct_encr_data<>; + opaque ssct_hmac<>; + }; + + The token emitted by GSS_Wrap() is XDR encoded and of XDR data type + ssv_seal_cipher_tkn4. + + The ssct_ssv_seq field has the same meaning as smt_ssv_seq. + + The ssct_encr_data field is the result of encrypting a value of the + XDR encoded data type ssv_seal_plain_tkn4. The encryption key is the + subkey derived from SSV4_SUBKEY_SEAL_I2T or SSV4_SUBKEY_SEAL_T2I, and + the encryption algorithm is that negotiated by EXCHANGE_ID. + + The ssct_iv field is the initialization vector (IV) for the + encryption algorithm (if applicable) and is sent in clear text. The + content and size of the IV MUST comply with the specification of the + encryption algorithm. For example, the id-aes256-CBC algorithm MUST + use a 16-byte initialization vector (IV), which MUST be unpredictable + for each instance of a value of data type ssv_seal_plain_tkn4 that is + encrypted with a particular SSV key. + + The ssct_hmac field is the result of computing an HMAC using the + value of the XDR encoded data type ssv_seal_plain_tkn4 as the input + text. The key is the subkey derived from SSV4_SUBKEY_MIC_I2T or + SSV4_SUBKEY_MIC_T2I, and the one-way hash algorithm is that + negotiated by EXCHANGE_ID. + + The sspt_confounder field is a random value. + + The sspt_ssv_seq field is the same as ssvt_ssv_seq. + + The field sspt_orig_plain field is the original plaintext and is the + "input_message" input passed to GSS_Wrap() (see Section 2.3.3 of + [7]). As with the handling of the plaintext by the SSV mechanism's + GSS_GetMIC() entry point, the entry point for GSS_Wrap() expects a + pointer to the plaintext, and will XDR encode an opaque array into + sspt_orig_plain representing the plain text, along with the other + fields of an instance of data type ssv_seal_plain_tkn4. + + The sspt_pad field is present to support encryption algorithms that + require inputs to be in fixed-sized blocks. The content of sspt_pad + is zero filled except for the length. Beware that the XDR encoding + of ssv_seal_plain_tkn4 contains three variable-length arrays, and so + each array consumes four bytes for an array length, and each array + that follows the length is always padded to a multiple of four bytes + per the XDR standard. + + For example, suppose the encryption algorithm uses 16-byte blocks, + and the sspt_confounder is three bytes long, and the sspt_orig_plain + field is 15 bytes long. The XDR encoding of sspt_confounder uses + eight bytes (4 + 3 + 1-byte pad), the XDR encoding of sspt_ssv_seq + uses four bytes, the XDR encoding of sspt_orig_plain uses 20 bytes (4 + + 15 + 1-byte pad), and the smallest XDR encoding of the sspt_pad + field is four bytes. This totals 36 bytes. The next multiple of 16 + is 48; thus, the length field of sspt_pad needs to be set to 12 + bytes, or a total encoding of 16 bytes. The total number of XDR + encoded bytes is thus 8 + 4 + 20 + 16 = 48. + + GSS_Wrap() emits a token that is an XDR encoding of a value of data + type ssv_seal_cipher_tkn4. Note that regardless of whether or not + the caller of GSS_Wrap() requests confidentiality, the token always + has confidentiality. This is because the SSV mechanism is for + RPCSEC_GSS, and RPCSEC_GSS never produces GSS_wrap() tokens without + confidentiality. + + There is one SSV per client ID. There is a single GSS context for a + client ID / SSV pair. All SSV mechanism RPCSEC_GSS handles of a + client ID / SSV pair share the same GSS context. SSV GSS contexts do + not expire except when the SSV is destroyed (causes would include the + client ID being destroyed or a server restart). Since one purpose of + context expiration is to replace keys that have been in use for "too + long", hence vulnerable to compromise by brute force or accident, the + client can replace the SSV key by sending periodic SET_SSV + operations, which is done by cycling through different users' + RPCSEC_GSS credentials. This way, the SSV is replaced without + destroying the SSV's GSS contexts. + + SSV RPCSEC_GSS handles can be expired or deleted by the server at any + time, and the EXCHANGE_ID operation can be used to create more SSV + RPCSEC_GSS handles. Expiration of SSV RPCSEC_GSS handles does not + imply that the SSV or its GSS context has expired. + + The client MUST establish an SSV via SET_SSV before the SSV GSS + context can be used to emit tokens from GSS_Wrap() and GSS_GetMIC(). + If SET_SSV has not been successfully called, attempts to emit tokens + MUST fail. + + The SSV mechanism does not support replay detection and sequencing in + its tokens because RPCSEC_GSS does not use those features (see + "Context Creation Requests", Section 5.2.2 of [4]). However, + Section 2.10.10 discusses special considerations for the SSV + mechanism when used with RPCSEC_GSS. + +2.10.10. Security Considerations for RPCSEC_GSS When Using the SSV + Mechanism + + When a client ID is created with SP4_SSV state protection (see + Section 18.35), the client is permitted to associate multiple + RPCSEC_GSS handles with the single SSV GSS context (see + Section 2.10.9). Because of the way RPCSEC_GSS (both version 1 and + version 2, see [4] and [9]) calculate the verifier of the reply, + special care must be taken by the implementation of the NFSv4.1 + client to prevent attacks by a man-in-the-middle. The verifier of an + RPCSEC_GSS reply is the output of GSS_GetMIC() applied to the input + value of the seq_num field of the RPCSEC_GSS credential (data type + rpc_gss_cred_ver_1_t) (see Section 5.3.3.2 of [4]). If multiple + RPCSEC_GSS handles share the same GSS context, then if one handle is + used to send a request with the same seq_num value as another handle, + an attacker could block the reply, and replace it with the verifier + used for the other handle. + + There are multiple ways to prevent the attack on the SSV RPCSEC_GSS + verifier in the reply. The simplest is believed to be as follows. + + * Each time one or more new SSV RPCSEC_GSS handles are created via + EXCHANGE_ID, the client SHOULD send a SET_SSV operation to modify + the SSV. By changing the SSV, the new handles will not result in + the re-use of an SSV RPCSEC_GSS verifier in a reply. + + * When a requester decides to use N SSV RPCSEC_GSS handles, it + SHOULD assign a unique and non-overlapping range of seq_nums to + each SSV RPCSEC_GSS handle. The size of each range SHOULD be + equal to MAXSEQ / N (see Section 5 of [4] for the definition of + MAXSEQ). When an SSV RPCSEC_GSS handle reaches its maximum, it + SHOULD force the replier to destroy the handle by sending a NULL + RPC request with seq_num set to MAXSEQ + 1 (see Section 5.3.3.3 of + [4]). + + * When the requester wants to increase or decrease N, it SHOULD + force the replier to destroy all N handles by sending a NULL RPC + request on each handle with seq_num set to MAXSEQ + 1. If the + requester is the client, it SHOULD send a SET_SSV operation before + using new handles. If the requester is the server, then the + client SHOULD send a SET_SSV operation when it detects that the + server has forced it to destroy a backchannel's SSV RPCSEC_GSS + handle. By sending a SET_SSV operation, the SSV will change, and + so the attacker will be unavailable to successfully replay a + previous verifier in a reply to the requester. + + Note that if the replier carefully creates the SSV RPCSEC_GSS + handles, the related risk of a man-in-the-middle splicing a forged + SSV RPCSEC_GSS credential with a verifier for another handle does not + exist. This is because the verifier in an RPCSEC_GSS request is + computed from input that includes both the RPCSEC_GSS handle and + seq_num (see Section 5.3.1 of [4]). Provided the replier takes care + to avoid re-using the value of an RPCSEC_GSS handle that it creates, + such as by including a generation number in the handle, the man-in- + the-middle will not be able to successfully replay a previous + verifier in the request to a replier. + +2.10.11. Session Mechanics - Steady State + +2.10.11.1. Obligations of the Server + + The server has the primary obligation to monitor the state of + backchannel resources that the client has created for the server + (RPCSEC_GSS contexts and backchannel connections). If these + resources vanish, the server takes action as specified in + Section 2.10.13.2. + +2.10.11.2. Obligations of the Client + + The client SHOULD honor the following obligations in order to utilize + the session: + + * Keep a necessary session from going idle on the server. A client + that requires a session but nonetheless is not sending operations + risks having the session be destroyed by the server. This is + because sessions consume resources, and resource limitations may + force the server to cull an inactive session. A server MAY + consider a session to be inactive if the client has not used the + session before the session inactivity timer (Section 2.10.12) has + expired. + + * Destroy the session when not needed. If a client has multiple + sessions, one of which has no requests waiting for replies, and + has been idle for some period of time, it SHOULD destroy the + session. + + * Maintain GSS contexts and RPCSEC_GSS handles for the backchannel. + If the client requires the server to use the RPCSEC_GSS security + flavor for callbacks, then it needs to be sure the RPCSEC_GSS + handles and/or their GSS contexts that are handed to the server + via BACKCHANNEL_CTL or CREATE_SESSION are unexpired. + + * Preserve a connection for a backchannel. The server requires a + backchannel in order to gracefully recall recallable state or + notify the client of certain events. Note that if the connection + is not being used for the fore channel, there is no way for the + client to tell if the connection is still alive (e.g., the server + restarted without sending a disconnect). The onus is on the + server, not the client, to determine if the backchannel's + connection is alive, and to indicate in the response to a SEQUENCE + operation when the last connection associated with a session's + backchannel has disconnected. + +2.10.11.3. Steps the Client Takes to Establish a Session + + If the client does not have a client ID, the client sends EXCHANGE_ID + to establish a client ID. If it opts for SP4_MACH_CRED or SP4_SSV + protection, in the spo_must_enforce list of operations, it SHOULD at + minimum specify CREATE_SESSION, DESTROY_SESSION, + BIND_CONN_TO_SESSION, BACKCHANNEL_CTL, and DESTROY_CLIENTID. If it + opts for SP4_SSV protection, the client needs to ask for SSV-based + RPCSEC_GSS handles. + + The client uses the client ID to send a CREATE_SESSION on a + connection to the server. The results of CREATE_SESSION indicate + whether or not the server will persist the session reply cache + through a server that has restarted, and the client notes this for + future reference. + + If the client specified SP4_SSV state protection when the client ID + was created, then it SHOULD send SET_SSV in the first COMPOUND after + the session is created. Each time a new principal goes to use the + client ID, it SHOULD send a SET_SSV again. + + If the client wants to use delegations, layouts, directory + notifications, or any other state that requires a backchannel, then + it needs to add a connection to the backchannel if CREATE_SESSION did + not already do so. The client creates a connection, and calls + BIND_CONN_TO_SESSION to associate the connection with the session and + the session's backchannel. If CREATE_SESSION did not already do so, + the client MUST tell the server what security is required in order + for the client to accept callbacks. The client does this via + BACKCHANNEL_CTL. If the client selected SP4_MACH_CRED or SP4_SSV + protection when it called EXCHANGE_ID, then the client SHOULD specify + that the backchannel use RPCSEC_GSS contexts for security. + + If the client wants to use additional connections for the + backchannel, then it needs to call BIND_CONN_TO_SESSION on each + connection it wants to use with the session. If the client wants to + use additional connections for the fore channel, then it needs to + call BIND_CONN_TO_SESSION if it specified SP4_SSV or SP4_MACH_CRED + state protection when the client ID was created. + + At this point, the session has reached steady state. + +2.10.12. Session Inactivity Timer + + The server MAY maintain a session inactivity timer for each session. + If the session inactivity timer expires, then the server MAY destroy + the session. To avoid losing a session due to inactivity, the client + MUST renew the session inactivity timer. The length of session + inactivity timer MUST NOT be less than the lease_time attribute + (Section 5.8.1.11). As with lease renewal (Section 8.3), when the + server receives a SEQUENCE operation, it resets the session + inactivity timer, and MUST NOT allow the timer to expire while the + rest of the operations in the COMPOUND procedure's request are still + executing. Once the last operation has finished, the server MUST set + the session inactivity timer to expire no sooner than the sum of the + current time and the value of the lease_time attribute. + +2.10.13. Session Mechanics - Recovery + +2.10.13.1. Events Requiring Client Action + + The following events require client action to recover. + +2.10.13.1.1. RPCSEC_GSS Context Loss by Callback Path + + If all RPCSEC_GSS handles granted by the client to the server for + callback use have expired, the client MUST establish a new handle via + BACKCHANNEL_CTL. The sr_status_flags field of the SEQUENCE results + indicates when callback handles are nearly expired, or fully expired + (see Section 18.46.3). + +2.10.13.1.2. Connection Loss + + If the client loses the last connection of the session and wants to + retain the session, then it needs to create a new connection, and if, + when the client ID was created, BIND_CONN_TO_SESSION was specified in + the spo_must_enforce list, the client MUST use BIND_CONN_TO_SESSION + to associate the connection with the session. + + If there was a request outstanding at the time of connection loss, + then if the client wants to continue to use the session, it MUST + retry the request, as described in Section 2.10.6.2. Note that it is + not necessary to retry requests over a connection with the same + source network address or the same destination network address as the + lost connection. As long as the session ID, slot ID, and sequence ID + in the retry match that of the original request, the server will + recognize the request as a retry if it executed the request prior to + disconnect. + + If the connection that was lost was the last one associated with the + backchannel, and the client wants to retain the backchannel and/or + prevent revocation of recallable state, the client needs to + reconnect, and if it does, it MUST associate the connection to the + session and backchannel via BIND_CONN_TO_SESSION. The server SHOULD + indicate when it has no callback connection via the sr_status_flags + result from SEQUENCE. + +2.10.13.1.3. Backchannel GSS Context Loss + + Via the sr_status_flags result of the SEQUENCE operation or other + means, the client will learn if some or all of the RPCSEC_GSS + contexts it assigned to the backchannel have been lost. If the + client wants to retain the backchannel and/or not put recallable + state subject to revocation, the client needs to use BACKCHANNEL_CTL + to assign new contexts. + +2.10.13.1.4. Loss of Session + + The replier might lose a record of the session. Causes include: + + * Replier failure and restart. + + * A catastrophe that causes the reply cache to be corrupted or lost + on the media on which it was stored. This applies even if the + replier indicated in the CREATE_SESSION results that it would + persist the cache. + + * The server purges the session of a client that has been inactive + for a very extended period of time. + + * As a result of configuration changes among a set of clustered + servers, a network address previously connected to one server + becomes connected to a different server that has no knowledge of + the session in question. Such a configuration change will + generally only happen when the original server ceases to function + for a time. + + Loss of reply cache is equivalent to loss of session. The replier + indicates loss of session to the requester by returning + NFS4ERR_BADSESSION on the next operation that uses the session ID + that refers to the lost session. + + After an event like a server restart, the client may have lost its + connections. The client assumes for the moment that the session has + not been lost. It reconnects, and if it specified connection + association enforcement when the session was created, it invokes + BIND_CONN_TO_SESSION using the session ID. Otherwise, it invokes + SEQUENCE. If BIND_CONN_TO_SESSION or SEQUENCE returns + NFS4ERR_BADSESSION, the client knows the session is not available to + it when communicating with that network address. If the connection + survives session loss, then the next SEQUENCE operation the client + sends over the connection will get back NFS4ERR_BADSESSION. The + client again knows the session was lost. + + Here is one suggested algorithm for the client when it gets + NFS4ERR_BADSESSION. It is not obligatory in that, if a client does + not want to take advantage of such features as trunking, it may omit + parts of it. However, it is a useful example that draws attention to + various possible recovery issues: + + 1. If the client has other connections to other server network + addresses associated with the same session, attempt a COMPOUND + with a single operation, SEQUENCE, on each of the other + connections. + + 2. If the attempts succeed, the session is still alive, and this is + a strong indicator that the server's network address has moved. + The client might send an EXCHANGE_ID on the connection that + returned NFS4ERR_BADSESSION to see if there are opportunities for + client ID trunking (i.e., the same client ID and so_major_id + value are returned). The client might use DNS to see if the + moved network address was replaced with another, so that the + performance and availability benefits of session trunking can + continue. + + 3. If the SEQUENCE requests fail with NFS4ERR_BADSESSION, then the + session no longer exists on any of the server network addresses + for which the client has connections associated with that session + ID. It is possible the session is still alive and available on + other network addresses. The client sends an EXCHANGE_ID on all + the connections to see if the server owner is still listening on + those network addresses. If the same server owner is returned + but a new client ID is returned, this is a strong indicator of a + server restart. If both the same server owner and same client ID + are returned, then this is a strong indication that the server + did delete the session, and the client will need to send a + CREATE_SESSION if it has no other sessions for that client ID. + If a different server owner is returned, the client can use DNS + to find other network addresses. If it does not, or if DNS does + not find any other addresses for the server, then the client will + be unable to provide NFSv4.1 service, and fatal errors should be + returned to processes that were using the server. If the client + is using a "mount" paradigm, unmounting the server is advised. + + 4. If the client knows of no other connections associated with the + session ID and server network addresses that are, or have been, + associated with the session ID, then the client can use DNS to + find other network addresses. If it does not, or if DNS does not + find any other addresses for the server, then the client will be + unable to provide NFSv4.1 service, and fatal errors should be + returned to processes that were using the server. If the client + is using a "mount" paradigm, unmounting the server is advised. + + If there is a reconfiguration event that results in the same network + address being assigned to servers where the eir_server_scope value is + different, it cannot be guaranteed that a session ID generated by the + first will be recognized as invalid by the first. Therefore, in + managing server reconfigurations among servers with different server + scope values, it is necessary to make sure that all clients have + disconnected from the first server before effecting the + reconfiguration. Nonetheless, clients should not assume that servers + will always adhere to this requirement; clients MUST be prepared to + deal with unexpected effects of server reconfigurations. Even where + a session ID is inappropriately recognized as valid, it is likely + either that the connection will not be recognized as valid or that a + sequence value for a slot will not be correct. Therefore, when a + client receives results indicating such unexpected errors, the use of + EXCHANGE_ID to determine the current server configuration is + RECOMMENDED. + + A variation on the above is that after a server's network address + moves, there is no NFSv4.1 server listening, e.g., no listener on + port 2049. In this example, one of the following occur: the NFSv4 + server returns NFS4ERR_MINOR_VERS_MISMATCH, the NFS server returns a + PROG_MISMATCH error, the RPC listener on 2049 returns PROG_UNVAIL, or + attempts to reconnect to the network address timeout. These SHOULD + be treated as equivalent to SEQUENCE returning NFS4ERR_BADSESSION for + these purposes. + + When the client detects session loss, it needs to call CREATE_SESSION + to recover. Any non-idempotent operations that were in progress + might have been performed on the server at the time of session loss. + The client has no general way to recover from this. + + Note that loss of session does not imply loss of byte-range lock, + open, delegation, or layout state because locks, opens, delegations, + and layouts are tied to the client ID and depend on the client ID, + not the session. Nor does loss of byte-range lock, open, delegation, + or layout state imply loss of session state, because the session + depends on the client ID; loss of client ID however does imply loss + of session, byte-range lock, open, delegation, and layout state. See + Section 8.4.2. A session can survive a server restart, but lock + recovery may still be needed. + + It is possible that CREATE_SESSION will fail with + NFS4ERR_STALE_CLIENTID (e.g., the server restarts and does not + preserve client ID state). If so, the client needs to call + EXCHANGE_ID, followed by CREATE_SESSION. + +2.10.13.2. Events Requiring Server Action + + The following events require server action to recover. + +2.10.13.2.1. Client Crash and Restart + + As described in Section 18.35, a restarted client sends EXCHANGE_ID + in such a way that it causes the server to delete any sessions it + had. + +2.10.13.2.2. Client Crash with No Restart + + If a client crashes and never comes back, it will never send + EXCHANGE_ID with its old client owner. Thus, the server has session + state that will never be used again. After an extended period of + time, and if the server has resource constraints, it MAY destroy the + old session as well as locking state. + +2.10.13.2.3. Extended Network Partition + + To the server, the extended network partition may be no different + from a client crash with no restart (see Section 2.10.13.2.2). + Unless the server can discern that there is a network partition, it + is free to treat the situation as if the client has crashed + permanently. + +2.10.13.2.4. Backchannel Connection Loss + + If there were callback requests outstanding at the time of a + connection loss, then the server MUST retry the requests, as + described in Section 2.10.6.2. Note that it is not necessary to + retry requests over a connection with the same source network address + or the same destination network address as the lost connection. As + long as the session ID, slot ID, and sequence ID in the retry match + that of the original request, the callback target will recognize the + request as a retry even if it did see the request prior to + disconnect. + + If the connection lost is the last one associated with the + backchannel, then the server MUST indicate that in the + sr_status_flags field of every SEQUENCE reply until the backchannel + is re-established. There are two situations, each of which uses + different status flags: no connectivity for the session's backchannel + and no connectivity for any session backchannel of the client. See + Section 18.46 for a description of the appropriate flags in + sr_status_flags. + +2.10.13.2.5. GSS Context Loss + + The server SHOULD monitor when the number of RPCSEC_GSS handles + assigned to the backchannel reaches one, and when that one handle is + near expiry (i.e., between one and two periods of lease time), and + indicate so in the sr_status_flags field of all SEQUENCE replies. + The server MUST indicate when all of the backchannel's assigned + RPCSEC_GSS handles have expired via the sr_status_flags field of all + SEQUENCE replies. + +2.10.14. Parallel NFS and Sessions + + A client and server can potentially be a non-pNFS implementation, a + metadata server implementation, a data server implementation, or two + or three types of implementations. The EXCHGID4_FLAG_USE_NON_PNFS, + EXCHGID4_FLAG_USE_PNFS_MDS, and EXCHGID4_FLAG_USE_PNFS_DS flags (not + mutually exclusive) are passed in the EXCHANGE_ID arguments and + results to allow the client to indicate how it wants to use sessions + created under the client ID, and to allow the server to indicate how + it will allow the sessions to be used. See Section 13.1 for pNFS + sessions considerations. + +3. Protocol Constants and Data Types + + The syntax and semantics to describe the data types of the NFSv4.1 + protocol are defined in the XDR (RFC 4506 [2]) and RPC (RFC 5531 [3]) + documents. The next sections build upon the XDR data types to define + constants, types, and structures specific to this protocol. The full + list of XDR data types is in [10]. + +3.1. Basic Constants + + const NFS4_FHSIZE = 128; + const NFS4_VERIFIER_SIZE = 8; + const NFS4_OPAQUE_LIMIT = 1024; + const NFS4_SESSIONID_SIZE = 16; + + const NFS4_INT64_MAX = 0x7fffffffffffffff; + const NFS4_UINT64_MAX = 0xffffffffffffffff; + const NFS4_INT32_MAX = 0x7fffffff; + const NFS4_UINT32_MAX = 0xffffffff; + + const NFS4_MAXFILELEN = 0xffffffffffffffff; + const NFS4_MAXFILEOFF = 0xfffffffffffffffe; + + Except where noted, all these constants are defined in bytes. + + * NFS4_FHSIZE is the maximum size of a filehandle. + + * NFS4_VERIFIER_SIZE is the fixed size of a verifier. + + * NFS4_OPAQUE_LIMIT is the maximum size of certain opaque + information. + + * NFS4_SESSIONID_SIZE is the fixed size of a session identifier. + + * NFS4_INT64_MAX is the maximum value of a signed 64-bit integer. + + * NFS4_UINT64_MAX is the maximum value of an unsigned 64-bit + integer. + + * NFS4_INT32_MAX is the maximum value of a signed 32-bit integer. + + * NFS4_UINT32_MAX is the maximum value of an unsigned 32-bit + integer. + + * NFS4_MAXFILELEN is the maximum length of a regular file. + + * NFS4_MAXFILEOFF is the maximum offset into a regular file. + +3.2. Basic Data Types + + These are the base NFSv4.1 data types. + + +===============+==============================================+ + | Data Type | Definition | + +===============+==============================================+ + | int32_t | typedef int int32_t; | + +---------------+----------------------------------------------+ + | uint32_t | typedef unsigned int uint32_t; | + +---------------+----------------------------------------------+ + | int64_t | typedef hyper int64_t; | + +---------------+----------------------------------------------+ + | uint64_t | typedef unsigned hyper uint64_t; | + +---------------+----------------------------------------------+ + | attrlist4 | typedef opaque attrlist4<>; | + | | | + | | Used for file/directory attributes. | + +---------------+----------------------------------------------+ + | bitmap4 | typedef uint32_t bitmap4<>; | + | | | + | | Used in attribute array encoding. | + +---------------+----------------------------------------------+ + | changeid4 | typedef uint64_t changeid4; | + | | | + | | Used in the definition of change_info4. | + +---------------+----------------------------------------------+ + | clientid4 | typedef uint64_t clientid4; | + | | | + | | Shorthand reference to client | + | | identification. | + +---------------+----------------------------------------------+ + | count4 | typedef uint32_t count4; | + | | | + | | Various count parameters (READ, WRITE, | + | | COMMIT). | + +---------------+----------------------------------------------+ + | length4 | typedef uint64_t length4; | + | | | + | | The length of a byte-range within a file. | + +---------------+----------------------------------------------+ + | mode4 | typedef uint32_t mode4; | + | | | + | | Mode attribute data type. | + +---------------+----------------------------------------------+ + | nfs_cookie4 | typedef uint64_t nfs_cookie4; | + | | | + | | Opaque cookie value for READDIR. | + +---------------+----------------------------------------------+ + | nfs_fh4 | typedef opaque nfs_fh4<NFS4_FHSIZE>; | + | | | + | | Filehandle definition. | + +---------------+----------------------------------------------+ + | nfs_ftype4 | enum nfs_ftype4; | + | | | + | | Various defined file types. | + +---------------+----------------------------------------------+ + | nfsstat4 | enum nfsstat4; | + | | | + | | Return value for operations. | + +---------------+----------------------------------------------+ + | offset4 | typedef uint64_t offset4; | + | | | + | | Various offset designations (READ, WRITE, | + | | LOCK, COMMIT). | + +---------------+----------------------------------------------+ + | qop4 | typedef uint32_t qop4; | + | | | + | | Quality of protection designation in | + | | SECINFO. | + +---------------+----------------------------------------------+ + | sec_oid4 | typedef opaque sec_oid4<>; | + | | | + | | Security Object Identifier. The sec_oid4 | + | | data type is not really opaque. Instead, it | + | | contains an ASN.1 OBJECT IDENTIFIER as used | + | | by GSS-API in the mech_type argument to | + | | GSS_Init_sec_context. See [7] for details. | + +---------------+----------------------------------------------+ + | sequenceid4 | typedef uint32_t sequenceid4; | + | | | + | | Sequence number used for various session | + | | operations (EXCHANGE_ID, CREATE_SESSION, | + | | SEQUENCE, CB_SEQUENCE). | + +---------------+----------------------------------------------+ + | seqid4 | typedef uint32_t seqid4; | + | | | + | | Sequence identifier used for locking. | + +---------------+----------------------------------------------+ + | sessionid4 | typedef opaque | + | | sessionid4[NFS4_SESSIONID_SIZE]; | + | | | + | | Session identifier. | + +---------------+----------------------------------------------+ + | slotid4 | typedef uint32_t slotid4; | + | | | + | | Sequencing artifact for various session | + | | operations (SEQUENCE, CB_SEQUENCE). | + +---------------+----------------------------------------------+ + | utf8string | typedef opaque utf8string<>; | + | | | + | | UTF-8 encoding for strings. | + +---------------+----------------------------------------------+ + | utf8str_cis | typedef utf8string utf8str_cis; | + | | | + | | Case-insensitive UTF-8 string. | + +---------------+----------------------------------------------+ + | utf8str_cs | typedef utf8string utf8str_cs; | + | | | + | | Case-sensitive UTF-8 string. | + +---------------+----------------------------------------------+ + | utf8str_mixed | typedef utf8string utf8str_mixed; | + | | | + | | UTF-8 strings with a case-sensitive prefix | + | | and a case-insensitive suffix. | + +---------------+----------------------------------------------+ + | component4 | typedef utf8str_cs component4; | + | | | + | | Represents pathname components. | + +---------------+----------------------------------------------+ + | linktext4 | typedef utf8str_cs linktext4; | + | | | + | | Symbolic link contents ("symbolic link" is | + | | defined in an Open Group [11] standard). | + +---------------+----------------------------------------------+ + | pathname4 | typedef component4 pathname4<>; | + | | | + | | Represents pathname for fs_locations. | + +---------------+----------------------------------------------+ + | verifier4 | typedef opaque | + | | verifier4[NFS4_VERIFIER_SIZE]; | + | | | + | | Verifier used for various operations | + | | (COMMIT, CREATE, EXCHANGE_ID, OPEN, READDIR, | + | | WRITE) NFS4_VERIFIER_SIZE is defined as 8. | + +---------------+----------------------------------------------+ + + Table 1 + + End of Base Data Types + +3.3. Structured Data Types + +3.3.1. nfstime4 + + struct nfstime4 { + int64_t seconds; + uint32_t nseconds; + }; + + The nfstime4 data type gives the number of seconds and nanoseconds + since midnight or zero hour January 1, 1970 Coordinated Universal + Time (UTC). Values greater than zero for the seconds field denote + dates after the zero hour January 1, 1970. Values less than zero for + the seconds field denote dates before the zero hour January 1, 1970. + In both cases, the nseconds field is to be added to the seconds field + for the final time representation. For example, if the time to be + represented is one-half second before zero hour January 1, 1970, the + seconds field would have a value of negative one (-1) and the + nseconds field would have a value of one-half second (500000000). + Values greater than 999,999,999 for nseconds are invalid. + + This data type is used to pass time and date information. A server + converts to and from its local representation of time when processing + time values, preserving as much accuracy as possible. If the + precision of timestamps stored for a file system object is less than + defined, loss of precision can occur. An adjunct time maintenance + protocol is RECOMMENDED to reduce client and server time skew. + +3.3.2. time_how4 + + enum time_how4 { + SET_TO_SERVER_TIME4 = 0, + SET_TO_CLIENT_TIME4 = 1 + }; + +3.3.3. settime4 + + union settime4 switch (time_how4 set_it) { + case SET_TO_CLIENT_TIME4: + nfstime4 time; + default: + void; + }; + + The time_how4 and settime4 data types are used for setting timestamps + in file object attributes. If set_it is SET_TO_SERVER_TIME4, then + the server uses its local representation of time for the time value. + +3.3.4. specdata4 + + struct specdata4 { + uint32_t specdata1; /* major device number */ + uint32_t specdata2; /* minor device number */ + }; + + This data type represents the device numbers for the device file + types NF4CHR and NF4BLK. + +3.3.5. fsid4 + + struct fsid4 { + uint64_t major; + uint64_t minor; + }; + +3.3.6. change_policy4 + + struct change_policy4 { + uint64_t cp_major; + uint64_t cp_minor; + }; + + The change_policy4 data type is used for the change_policy + RECOMMENDED attribute. It provides change sequencing indication + analogous to the change attribute. To enable the server to present a + value valid across server re-initialization without requiring + persistent storage, two 64-bit quantities are used, allowing one to + be a server instance ID and the second to be incremented non- + persistently, within a given server instance. + +3.3.7. fattr4 + + struct fattr4 { + bitmap4 attrmask; + attrlist4 attr_vals; + }; + + The fattr4 data type is used to represent file and directory + attributes. + + The bitmap is a counted array of 32-bit integers used to contain bit + values. The position of the integer in the array that contains bit n + can be computed from the expression (n / 32), and its bit within that + integer is (n mod 32). + + 0 1 + +-----------+-----------+-----------+-- + | count | 31 .. 0 | 63 .. 32 | + +-----------+-----------+-----------+-- + +3.3.8. change_info4 + + struct change_info4 { + bool atomic; + changeid4 before; + changeid4 after; + }; + + This data type is used with the CREATE, LINK, OPEN, REMOVE, and + RENAME operations to let the client know the value of the change + attribute for the directory in which the target file system object + resides. + +3.3.9. netaddr4 + + struct netaddr4 { + /* see struct rpcb in RFC 1833 */ + string na_r_netid<>; /* network id */ + string na_r_addr<>; /* universal address */ + }; + + The netaddr4 data type is used to identify network transport + endpoints. The na_r_netid and na_r_addr fields respectively contain + a netid and uaddr. The netid and uaddr concepts are defined in [12]. + The netid and uaddr formats for TCP over IPv4 and TCP over IPv6 are + defined in [12], specifically Tables 2 and 3 and in Sections 5.2.3.3 + and 5.2.3.4. + +3.3.10. state_owner4 + + struct state_owner4 { + clientid4 clientid; + opaque owner<NFS4_OPAQUE_LIMIT>; + }; + + typedef state_owner4 open_owner4; + typedef state_owner4 lock_owner4; + + The state_owner4 data type is the base type for the open_owner4 + (Section 3.3.10.1) and lock_owner4 (Section 3.3.10.2). + +3.3.10.1. open_owner4 + + This data type is used to identify the owner of OPEN state. + +3.3.10.2. lock_owner4 + + This structure is used to identify the owner of byte-range locking + state. + +3.3.11. open_to_lock_owner4 + + struct open_to_lock_owner4 { + seqid4 open_seqid; + stateid4 open_stateid; + seqid4 lock_seqid; + lock_owner4 lock_owner; + }; + + This data type is used for the first LOCK operation done for an + open_owner4. It provides both the open_stateid and lock_owner, such + that the transition is made from a valid open_stateid sequence to + that of the new lock_stateid sequence. Using this mechanism avoids + the confirmation of the lock_owner/lock_seqid pair since it is tied + to established state in the form of the open_stateid/open_seqid. + +3.3.12. stateid4 + + struct stateid4 { + uint32_t seqid; + opaque other[12]; + }; + + This data type is used for the various state sharing mechanisms + between the client and server. The client never modifies a value of + data type stateid. The starting value of the "seqid" field is + undefined. The server is required to increment the "seqid" field by + one at each transition of the stateid. This is important since the + client will inspect the seqid in OPEN stateids to determine the order + of OPEN processing done by the server. + +3.3.13. layouttype4 + + enum layouttype4 { + LAYOUT4_NFSV4_1_FILES = 0x1, + LAYOUT4_OSD2_OBJECTS = 0x2, + LAYOUT4_BLOCK_VOLUME = 0x3 + }; + + This data type indicates what type of layout is being used. The file + server advertises the layout types it supports through the + fs_layout_type file system attribute (Section 5.12.1). A client asks + for layouts of a particular type in LAYOUTGET, and processes those + layouts in its layout-type-specific logic. + + The layouttype4 data type is 32 bits in length. The range + represented by the layout type is split into three parts. Type 0x0 + is reserved. Types within the range 0x00000001-0x7FFFFFFF are + globally unique and are assigned according to the description in + Section 22.5; they are maintained by IANA. Types within the range + 0x80000000-0xFFFFFFFF are site specific and for private use only. + + The LAYOUT4_NFSV4_1_FILES enumeration specifies that the NFSv4.1 file + layout type, as defined in Section 13, is to be used. The + LAYOUT4_OSD2_OBJECTS enumeration specifies that the object layout, as + defined in [47], is to be used. Similarly, the LAYOUT4_BLOCK_VOLUME + enumeration specifies that the block/volume layout, as defined in + [48], is to be used. + +3.3.14. deviceid4 + + const NFS4_DEVICEID4_SIZE = 16; + + typedef opaque deviceid4[NFS4_DEVICEID4_SIZE]; + + Layout information includes device IDs that specify a storage device + through a compact handle. Addressing and type information is + obtained with the GETDEVICEINFO operation. Device IDs are not + guaranteed to be valid across metadata server restarts. A device ID + is unique per client ID and layout type. See Section 12.2.10 for + more details. + +3.3.15. device_addr4 + + struct device_addr4 { + layouttype4 da_layout_type; + opaque da_addr_body<>; + }; + + The device address is used to set up a communication channel with the + storage device. Different layout types will require different data + types to define how they communicate with storage devices. The + opaque da_addr_body field is interpreted based on the specified + da_layout_type field. + + This document defines the device address for the NFSv4.1 file layout + (see Section 13.3), which identifies a storage device by network IP + address and port number. This is sufficient for the clients to + communicate with the NFSv4.1 storage devices, and may be sufficient + for other layout types as well. Device types for object-based + storage devices and block storage devices (e.g., Small Computer + System Interface (SCSI) volume labels) are defined by their + respective layout specifications. + +3.3.16. layout_content4 + + struct layout_content4 { + layouttype4 loc_type; + opaque loc_body<>; + }; + + The loc_body field is interpreted based on the layout type + (loc_type). This document defines the loc_body for the NFSv4.1 file + layout type; see Section 13.3 for its definition. + +3.3.17. layout4 + + struct layout4 { + offset4 lo_offset; + length4 lo_length; + layoutiomode4 lo_iomode; + layout_content4 lo_content; + }; + + The layout4 data type defines a layout for a file. The layout type + specific data is opaque within lo_content. Since layouts are sub- + dividable, the offset and length together with the file's filehandle, + the client ID, iomode, and layout type identify the layout. + +3.3.18. layoutupdate4 + + struct layoutupdate4 { + layouttype4 lou_type; + opaque lou_body<>; + }; + + The layoutupdate4 data type is used by the client to return updated + layout information to the metadata server via the LAYOUTCOMMIT + (Section 18.42) operation. This data type provides a channel to pass + layout type specific information (in field lou_body) back to the + metadata server. For example, for the block/volume layout type, this + could include the list of reserved blocks that were written. The + contents of the opaque lou_body argument are determined by the layout + type. The NFSv4.1 file-based layout does not use this data type; if + lou_type is LAYOUT4_NFSV4_1_FILES, the lou_body field MUST have a + zero length. + +3.3.19. layouthint4 + + struct layouthint4 { + layouttype4 loh_type; + opaque loh_body<>; + }; + + The layouthint4 data type is used by the client to pass in a hint + about the type of layout it would like created for a particular file. + It is the data type specified by the layout_hint attribute described + in Section 5.12.4. The metadata server may ignore the hint or may + selectively ignore fields within the hint. This hint should be + provided at create time as part of the initial attributes within + OPEN. The loh_body field is specific to the type of layout + (loh_type). The NFSv4.1 file-based layout uses the + nfsv4_1_file_layouthint4 data type as defined in Section 13.3. + +3.3.20. layoutiomode4 + + enum layoutiomode4 { + LAYOUTIOMODE4_READ = 1, + LAYOUTIOMODE4_RW = 2, + LAYOUTIOMODE4_ANY = 3 + }; + + The iomode specifies whether the client intends to just read or both + read and write the data represented by the layout. While the + LAYOUTIOMODE4_ANY iomode MUST NOT be used in the arguments to the + LAYOUTGET operation, it MAY be used in the arguments to the + LAYOUTRETURN and CB_LAYOUTRECALL operations. The LAYOUTIOMODE4_ANY + iomode specifies that layouts pertaining to both LAYOUTIOMODE4_READ + and LAYOUTIOMODE4_RW iomodes are being returned or recalled, + respectively. The metadata server's use of the iomode may depend on + the layout type being used. The storage devices MAY validate I/O + accesses against the iomode and reject invalid accesses. + +3.3.21. nfs_impl_id4 + + struct nfs_impl_id4 { + utf8str_cis nii_domain; + utf8str_cs nii_name; + nfstime4 nii_date; + }; + + This data type is used to identify client and server implementation + details. The nii_domain field is the DNS domain name with which the + implementor is associated. The nii_name field is the product name of + the implementation and is completely free form. It is RECOMMENDED + that the nii_name be used to distinguish machine architecture, + machine platforms, revisions, versions, and patch levels. The + nii_date field is the timestamp of when the software instance was + published or built. + +3.3.22. threshold_item4 + + struct threshold_item4 { + layouttype4 thi_layout_type; + bitmap4 thi_hintset; + opaque thi_hintlist<>; + }; + + This data type contains a list of hints specific to a layout type for + helping the client determine when it should send I/O directly through + the metadata server versus the storage devices. The data type + consists of the layout type (thi_layout_type), a bitmap (thi_hintset) + describing the set of hints supported by the server (they may differ + based on the layout type), and a list of hints (thi_hintlist) whose + content is determined by the hintset bitmap. See the mdsthreshold + attribute for more details. + + The thi_hintset field is a bitmap of the following values: + + +=========================+===+=========+===========================+ + | name | # | Data | Description | + | | | Type | | + +=========================+===+=========+===========================+ + | threshold4_read_size | 0 | length4 | If a file's length is | + | | | | less than the value of | + | | | | threshold4_read_size, | + | | | | then it is RECOMMENDED | + | | | | that the client read | + | | | | from the file via the | + | | | | MDS and not a storage | + | | | | device. | + +-------------------------+---+---------+---------------------------+ + | threshold4_write_size | 1 | length4 | If a file's length is | + | | | | less than the value of | + | | | | threshold4_write_size, | + | | | | then it is RECOMMENDED | + | | | | that the client write | + | | | | to the file via the | + | | | | MDS and not a storage | + | | | | device. | + +-------------------------+---+---------+---------------------------+ + | threshold4_read_iosize | 2 | length4 | For read I/O sizes | + | | | | below this threshold, | + | | | | it is RECOMMENDED to | + | | | | read data through the | + | | | | MDS. | + +-------------------------+---+---------+---------------------------+ + | threshold4_write_iosize | 3 | length4 | For write I/O sizes | + | | | | below this threshold, | + | | | | it is RECOMMENDED to | + | | | | write data through the | + | | | | MDS. | + +-------------------------+---+---------+---------------------------+ + + Table 2 + +3.3.23. mdsthreshold4 + + struct mdsthreshold4 { + threshold_item4 mth_hints<>; + }; + + This data type holds an array of elements of data type + threshold_item4, each of which is valid for a particular layout type. + An array is necessary because a server can support multiple layout + types for a single file. + +4. Filehandles + + The filehandle in the NFS protocol is a per-server unique identifier + for a file system object. The contents of the filehandle are opaque + to the client. Therefore, the server is responsible for translating + the filehandle to an internal representation of the file system + object. + +4.1. Obtaining the First Filehandle + + The operations of the NFS protocol are defined in terms of one or + more filehandles. Therefore, the client needs a filehandle to + initiate communication with the server. With the NFSv3 protocol (RFC + 1813 [38]), there exists an ancillary protocol to obtain this first + filehandle. The MOUNT protocol, RPC program number 100005, provides + the mechanism of translating a string-based file system pathname to a + filehandle, which can then be used by the NFS protocols. + + The MOUNT protocol has deficiencies in the area of security and use + via firewalls. This is one reason that the use of the public + filehandle was introduced in RFC 2054 [49] and RFC 2055 [50]. With + the use of the public filehandle in combination with the LOOKUP + operation in the NFSv3 protocol, it has been demonstrated that the + MOUNT protocol is unnecessary for viable interaction between NFS + client and server. + + Therefore, the NFSv4.1 protocol will not use an ancillary protocol + for translation from string-based pathnames to a filehandle. Two + special filehandles will be used as starting points for the NFS + client. + +4.1.1. Root Filehandle + + The first of the special filehandles is the ROOT filehandle. The + ROOT filehandle is the "conceptual" root of the file system namespace + at the NFS server. The client uses or starts with the ROOT + filehandle by employing the PUTROOTFH operation. The PUTROOTFH + operation instructs the server to set the "current" filehandle to the + ROOT of the server's file tree. Once this PUTROOTFH operation is + used, the client can then traverse the entirety of the server's file + tree with the LOOKUP operation. A complete discussion of the server + namespace is in Section 7. + +4.1.2. Public Filehandle + + The second special filehandle is the PUBLIC filehandle. Unlike the + ROOT filehandle, the PUBLIC filehandle may be bound or represent an + arbitrary file system object at the server. The server is + responsible for this binding. It may be that the PUBLIC filehandle + and the ROOT filehandle refer to the same file system object. + However, it is up to the administrative software at the server and + the policies of the server administrator to define the binding of the + PUBLIC filehandle and server file system object. The client may not + make any assumptions about this binding. The client uses the PUBLIC + filehandle via the PUTPUBFH operation. + +4.2. Filehandle Types + + In the NFSv3 protocol, there was one type of filehandle with a single + set of semantics. This type of filehandle is termed "persistent" in + NFSv4.1. The semantics of a persistent filehandle remain the same as + before. A new type of filehandle introduced in NFSv4.1 is the + "volatile" filehandle, which attempts to accommodate certain server + environments. + + The volatile filehandle type was introduced to address server + functionality or implementation issues that make correct + implementation of a persistent filehandle infeasible. Some server + environments do not provide a file-system-level invariant that can be + used to construct a persistent filehandle. The underlying server + file system may not provide the invariant or the server's file system + programming interfaces may not provide access to the needed + invariant. Volatile filehandles may ease the implementation of + server functionality such as hierarchical storage management or file + system reorganization or migration. However, the volatile filehandle + increases the implementation burden for the client. + + Since the client will need to handle persistent and volatile + filehandles differently, a file attribute is defined that may be used + by the client to determine the filehandle types being returned by the + server. + +4.2.1. General Properties of a Filehandle + + The filehandle contains all the information the server needs to + distinguish an individual file. To the client, the filehandle is + opaque. The client stores filehandles for use in a later request and + can compare two filehandles from the same server for equality by + doing a byte-by-byte comparison. However, the client MUST NOT + otherwise interpret the contents of filehandles. If two filehandles + from the same server are equal, they MUST refer to the same file. + Servers SHOULD try to maintain a one-to-one correspondence between + filehandles and files, but this is not required. Clients MUST use + filehandle comparisons only to improve performance, not for correct + behavior. All clients need to be prepared for situations in which it + cannot be determined whether two filehandles denote the same object + and in such cases, avoid making invalid assumptions that might cause + incorrect behavior. Further discussion of filehandle and attribute + comparison in the context of data caching is presented in + Section 10.3.4. + + As an example, in the case that two different pathnames when + traversed at the server terminate at the same file system object, the + server SHOULD return the same filehandle for each path. This can + occur if a hard link (see [6]) is used to create two file names that + refer to the same underlying file object and associated data. For + example, if paths /a/b/c and /a/d/c refer to the same file, the + server SHOULD return the same filehandle for both pathnames' + traversals. + +4.2.2. Persistent Filehandle + + A persistent filehandle is defined as having a fixed value for the + lifetime of the file system object to which it refers. Once the + server creates the filehandle for a file system object, the server + MUST accept the same filehandle for the object for the lifetime of + the object. If the server restarts, the NFS server MUST honor the + same filehandle value as it did in the server's previous + instantiation. Similarly, if the file system is migrated, the new + NFS server MUST honor the same filehandle as the old NFS server. + + The persistent filehandle will be become stale or invalid when the + file system object is removed. When the server is presented with a + persistent filehandle that refers to a deleted object, it MUST return + an error of NFS4ERR_STALE. A filehandle may become stale when the + file system containing the object is no longer available. The file + system may become unavailable if it exists on removable media and the + media is no longer available at the server or the file system in + whole has been destroyed or the file system has simply been removed + from the server's namespace (i.e., unmounted in a UNIX environment). + +4.2.3. Volatile Filehandle + + A volatile filehandle does not share the same longevity + characteristics of a persistent filehandle. The server may determine + that a volatile filehandle is no longer valid at many different + points in time. If the server can definitively determine that a + volatile filehandle refers to an object that has been removed, the + server should return NFS4ERR_STALE to the client (as is the case for + persistent filehandles). In all other cases where the server + determines that a volatile filehandle can no longer be used, it + should return an error of NFS4ERR_FHEXPIRED. + + The REQUIRED attribute "fh_expire_type" is used by the client to + determine what type of filehandle the server is providing for a + particular file system. This attribute is a bitmask with the + following values: + + FH4_PERSISTENT The value of FH4_PERSISTENT is used to indicate a + persistent filehandle, which is valid until the object is removed + from the file system. The server will not return + NFS4ERR_FHEXPIRED for this filehandle. FH4_PERSISTENT is defined + as a value in which none of the bits specified below are set. + + FH4_VOLATILE_ANY The filehandle may expire at any time, except as + specifically excluded (i.e., FH4_NO_EXPIRE_WITH_OPEN). + + FH4_NOEXPIRE_WITH_OPEN May only be set when FH4_VOLATILE_ANY is set. + If this bit is set, then the meaning of FH4_VOLATILE_ANY is + qualified to exclude any expiration of the filehandle when it is + open. + + FH4_VOL_MIGRATION The filehandle will expire as a result of a file + system transition (migration or replication), in those cases in + which the continuity of filehandle use is not specified by handle + class information within the fs_locations_info attribute. When + this bit is set, clients without access to fs_locations_info + information should assume that filehandles will expire on file + system transitions. + + FH4_VOL_RENAME The filehandle will expire during rename. This + includes a rename by the requesting client or a rename by any + other client. If FH4_VOL_ANY is set, FH4_VOL_RENAME is redundant. + + Servers that provide volatile filehandles that can expire while open + require special care as regards handling of RENAMEs and REMOVEs. + This situation can arise if FH4_VOL_MIGRATION or FH4_VOL_RENAME is + set, if FH4_VOLATILE_ANY is set and FH4_NOEXPIRE_WITH_OPEN is not + set, or if a non-read-only file system has a transition target in a + different handle class. In these cases, the server should deny a + RENAME or REMOVE that would affect an OPEN file of any of the + components leading to the OPEN file. In addition, the server should + deny all RENAME or REMOVE requests during the grace period, in order + to make sure that reclaims of files where filehandles may have + expired do not do a reclaim for the wrong file. + + Volatile filehandles are especially suitable for implementation of + the pseudo file systems used to bridge exports. See Section 7.5 for + a discussion of this. + +4.3. One Method of Constructing a Volatile Filehandle + + A volatile filehandle, while opaque to the client, could contain: + + [volatile bit = 1 | server boot time | slot | generation number] + + * slot is an index in the server volatile filehandle table + + * generation number is the generation number for the table entry/ + slot + + When the client presents a volatile filehandle, the server makes the + following checks, which assume that the check for the volatile bit + has passed. If the server boot time is less than the current server + boot time, return NFS4ERR_FHEXPIRED. If slot is out of range, return + NFS4ERR_BADHANDLE. If the generation number does not match, return + NFS4ERR_FHEXPIRED. + + When the server restarts, the table is gone (it is volatile). + + If the volatile bit is 0, then it is a persistent filehandle with a + different structure following it. + +4.4. Client Recovery from Filehandle Expiration + + If possible, the client SHOULD recover from the receipt of an + NFS4ERR_FHEXPIRED error. The client must take on additional + responsibility so that it may prepare itself to recover from the + expiration of a volatile filehandle. If the server returns + persistent filehandles, the client does not need these additional + steps. + + For volatile filehandles, most commonly the client will need to store + the component names leading up to and including the file system + object in question. With these names, the client should be able to + recover by finding a filehandle in the namespace that is still + available or by starting at the root of the server's file system + namespace. + + If the expired filehandle refers to an object that has been removed + from the file system, obviously the client will not be able to + recover from the expired filehandle. + + It is also possible that the expired filehandle refers to a file that + has been renamed. If the file was renamed by another client, again + it is possible that the original client will not be able to recover. + However, in the case that the client itself is renaming the file and + the file is open, it is possible that the client may be able to + recover. The client can determine the new pathname based on the + processing of the rename request. The client can then regenerate the + new filehandle based on the new pathname. The client could also use + the COMPOUND procedure to construct a series of operations like: + + RENAME A B + LOOKUP B + GETFH + + Note that the COMPOUND procedure does not provide atomicity. This + example only reduces the overhead of recovering from an expired + filehandle. + +5. File Attributes + + To meet the requirements of extensibility and increased + interoperability with non-UNIX platforms, attributes need to be + handled in a flexible manner. The NFSv3 fattr3 structure contains a + fixed list of attributes that not all clients and servers are able to + support or care about. The fattr3 structure cannot be extended as + new needs arise and it provides no way to indicate non-support. With + the NFSv4.1 protocol, the client is able to query what attributes the + server supports and construct requests with only those supported + attributes (or a subset thereof). + + To this end, attributes are divided into three groups: REQUIRED, + RECOMMENDED, and named. Both REQUIRED and RECOMMENDED attributes are + supported in the NFSv4.1 protocol by a specific and well-defined + encoding and are identified by number. They are requested by setting + a bit in the bit vector sent in the GETATTR request; the server + response includes a bit vector to list what attributes were returned + in the response. New REQUIRED or RECOMMENDED attributes may be added + to the NFSv4 protocol as part of a new minor version by publishing a + Standards Track RFC that allocates a new attribute number value and + defines the encoding for the attribute. See Section 2.7 for further + discussion. + + Named attributes are accessed by the new OPENATTR operation, which + accesses a hidden directory of attributes associated with a file + system object. OPENATTR takes a filehandle for the object and + returns the filehandle for the attribute hierarchy. The filehandle + for the named attributes is a directory object accessible by LOOKUP + or READDIR and contains files whose names represent the named + attributes and whose data bytes are the value of the attribute. For + example: + + +----------+-----------+---------------------------------+ + | LOOKUP | "foo" | ; look up file | + +----------+-----------+---------------------------------+ + | GETATTR | attrbits | | + +----------+-----------+---------------------------------+ + | OPENATTR | | ; access foo's named attributes | + +----------+-----------+---------------------------------+ + | LOOKUP | "x11icon" | ; look up specific attribute | + +----------+-----------+---------------------------------+ + | READ | 0,4096 | ; read stream of bytes | + +----------+-----------+---------------------------------+ + + Table 3 + + Named attributes are intended for data needed by applications rather + than by an NFS client implementation. NFS implementors are strongly + encouraged to define their new attributes as RECOMMENDED attributes + by bringing them to the IETF Standards Track process. + + The set of attributes that are classified as REQUIRED is deliberately + small since servers need to do whatever it takes to support them. A + server should support as many of the RECOMMENDED attributes as + possible but, by their definition, the server is not required to + support all of them. Attributes are deemed REQUIRED if the data is + both needed by a large number of clients and is not otherwise + reasonably computable by the client when support is not provided on + the server. + + Note that the hidden directory returned by OPENATTR is a convenience + for protocol processing. The client should not make any assumptions + about the server's implementation of named attributes and whether or + not the underlying file system at the server has a named attribute + directory. Therefore, operations such as SETATTR and GETATTR on the + named attribute directory are undefined. + +5.1. REQUIRED Attributes + + These MUST be supported by every NFSv4.1 client and server in order + to ensure a minimum level of interoperability. The server MUST store + and return these attributes, and the client MUST be able to function + with an attribute set limited to these attributes. With just the + REQUIRED attributes some client functionality may be impaired or + limited in some ways. A client may ask for any of these attributes + to be returned by setting a bit in the GETATTR request, and the + server MUST return their value. + +5.2. RECOMMENDED Attributes + + These attributes are understood well enough to warrant support in the + NFSv4.1 protocol. However, they may not be supported on all clients + and servers. A client may ask for any of these attributes to be + returned by setting a bit in the GETATTR request but must handle the + case where the server does not return them. A client MAY ask for the + set of attributes the server supports and SHOULD NOT request + attributes the server does not support. A server should be tolerant + of requests for unsupported attributes and simply not return them + rather than considering the request an error. It is expected that + servers will support all attributes they comfortably can and only + fail to support attributes that are difficult to support in their + operating environments. A server should provide attributes whenever + they don't have to "tell lies" to the client. For example, a file + modification time should be either an accurate time or should not be + supported by the server. At times this will be difficult for + clients, but a client is better positioned to decide whether and how + to fabricate or construct an attribute or whether to do without the + attribute. + +5.3. Named Attributes + + These attributes are not supported by direct encoding in the NFSv4 + protocol but are accessed by string names rather than numbers and + correspond to an uninterpreted stream of bytes that are stored with + the file system object. The namespace for these attributes may be + accessed by using the OPENATTR operation. The OPENATTR operation + returns a filehandle for a virtual "named attribute directory", and + further perusal and modification of the namespace may be done using + operations that work on more typical directories. In particular, + READDIR may be used to get a list of such named attributes, and + LOOKUP and OPEN may select a particular attribute. Creation of a new + named attribute may be the result of an OPEN specifying file + creation. + + Once an OPEN is done, named attributes may be examined and changed by + normal READ and WRITE operations using the filehandles and stateids + returned by OPEN. + + Named attributes and the named attribute directory may have their own + (non-named) attributes. Each of these objects MUST have all of the + REQUIRED attributes and may have additional RECOMMENDED attributes. + However, the set of attributes for named attributes and the named + attribute directory need not be, and typically will not be, as large + as that for other objects in that file system. + + Named attributes and the named attribute directory might be the + target of delegations (in the case of the named attribute directory, + these will be directory delegations). However, since granting of + delegations is at the server's discretion, a server need not support + delegations on named attributes or the named attribute directory. + + It is RECOMMENDED that servers support arbitrary named attributes. A + client should not depend on the ability to store any named attributes + in the server's file system. If a server does support named + attributes, a client that is also able to handle them should be able + to copy a file's data and metadata with complete transparency from + one location to another; this would imply that names allowed for + regular directory entries are valid for named attribute names as + well. + + In NFSv4.1, the structure of named attribute directories is + restricted in a number of ways, in order to prevent the development + of non-interoperable implementations in which some servers support a + fully general hierarchical directory structure for named attributes + while others support a limited but adequate structure for named + attributes. In such an environment, clients or applications might + come to depend on non-portable extensions. The restrictions are: + + * CREATE is not allowed in a named attribute directory. Thus, such + objects as symbolic links and special files are not allowed to be + named attributes. Further, directories may not be created in a + named attribute directory, so no hierarchical structure of named + attributes for a single object is allowed. + + * If OPENATTR is done on a named attribute directory or on a named + attribute, the server MUST return NFS4ERR_WRONG_TYPE. + + * Doing a RENAME of a named attribute to a different named attribute + directory or to an ordinary (i.e., non-named-attribute) directory + is not allowed. + + * Creating hard links between named attribute directories or between + named attribute directories and ordinary directories is not + allowed. + + Names of attributes will not be controlled by this document or other + IETF Standards Track documents. See Section 22.2 for further + discussion. + +5.4. Classification of Attributes + + Each of the REQUIRED and RECOMMENDED attributes can be classified in + one of three categories: per server (i.e., the value of the attribute + will be the same for all file objects that share the same server + owner; see Section 2.5 for a definition of server owner), per file + system (i.e., the value of the attribute will be the same for some or + all file objects that share the same fsid attribute (Section 5.8.1.9) + and server owner), or per file system object. Note that it is + possible that some per file system attributes may vary within the + file system, depending on the value of the "homogeneous" + (Section 5.8.2.16) attribute. Note that the attributes + time_access_set and time_modify_set are not listed in this section + because they are write-only attributes corresponding to time_access + and time_modify, and are used in a special instance of SETATTR. + + * The per-server attribute is: + + lease_time + + * The per-file system attributes are: + + supported_attrs, suppattr_exclcreat, fh_expire_type, + link_support, symlink_support, unique_handles, aclsupport, + cansettime, case_insensitive, case_preserving, + chown_restricted, files_avail, files_free, files_total, + fs_locations, homogeneous, maxfilesize, maxname, maxread, + maxwrite, no_trunc, space_avail, space_free, space_total, + time_delta, change_policy, fs_status, fs_layout_type, + fs_locations_info, fs_charset_cap + + * The per-file system object attributes are: + + type, change, size, named_attr, fsid, rdattr_error, filehandle, + acl, archive, fileid, hidden, maxlink, mimetype, mode, + numlinks, owner, owner_group, rawdev, space_used, system, + time_access, time_backup, time_create, time_metadata, + time_modify, mounted_on_fileid, dir_notif_delay, + dirent_notif_delay, dacl, sacl, layout_type, layout_hint, + layout_blksize, layout_alignment, mdsthreshold, retention_get, + retention_set, retentevt_get, retentevt_set, retention_hold, + mode_set_masked + + For quota_avail_hard, quota_avail_soft, and quota_used, see their + definitions below for the appropriate classification. + +5.5. Set-Only and Get-Only Attributes + + Some REQUIRED and RECOMMENDED attributes are set-only; i.e., they can + be set via SETATTR but not retrieved via GETATTR. Similarly, some + REQUIRED and RECOMMENDED attributes are get-only; i.e., they can be + retrieved via GETATTR but not set via SETATTR. If a client attempts + to set a get-only attribute or get a set-only attributes, the server + MUST return NFS4ERR_INVAL. + +5.6. REQUIRED Attributes - List and Definition References + + The list of REQUIRED attributes appears in Table 4. The meaning of + the columns of the table are: + + Name: The name of the attribute. + + Id: The number assigned to the attribute. In the event of conflicts + between the assigned number and [10], the latter is likely + authoritative, but should be resolved with Errata to this document + and/or [10]. See [51] for the Errata process. + + Data Type: The XDR data type of the attribute. + + Acc: Access allowed to the attribute. R means read-only (GETATTR + may retrieve, SETATTR may not set). W means write-only (SETATTR + may set, GETATTR may not retrieve). R W means read/write (GETATTR + may retrieve, SETATTR may set). + + Defined in: The section of this specification that describes the + attribute. + + +====================+====+============+=====+==================+ + | Name | Id | Data Type | Acc | Defined in: | + +====================+====+============+=====+==================+ + | supported_attrs | 0 | bitmap4 | R | Section 5.8.1.1 | + +--------------------+----+------------+-----+------------------+ + | type | 1 | nfs_ftype4 | R | Section 5.8.1.2 | + +--------------------+----+------------+-----+------------------+ + | fh_expire_type | 2 | uint32_t | R | Section 5.8.1.3 | + +--------------------+----+------------+-----+------------------+ + | change | 3 | uint64_t | R | Section 5.8.1.4 | + +--------------------+----+------------+-----+------------------+ + | size | 4 | uint64_t | R W | Section 5.8.1.5 | + +--------------------+----+------------+-----+------------------+ + | link_support | 5 | bool | R | Section 5.8.1.6 | + +--------------------+----+------------+-----+------------------+ + | symlink_support | 6 | bool | R | Section 5.8.1.7 | + +--------------------+----+------------+-----+------------------+ + | named_attr | 7 | bool | R | Section 5.8.1.8 | + +--------------------+----+------------+-----+------------------+ + | fsid | 8 | fsid4 | R | Section 5.8.1.9 | + +--------------------+----+------------+-----+------------------+ + | unique_handles | 9 | bool | R | Section 5.8.1.10 | + +--------------------+----+------------+-----+------------------+ + | lease_time | 10 | nfs_lease4 | R | Section 5.8.1.11 | + +--------------------+----+------------+-----+------------------+ + | rdattr_error | 11 | enum | R | Section 5.8.1.12 | + +--------------------+----+------------+-----+------------------+ + | filehandle | 19 | nfs_fh4 | R | Section 5.8.1.13 | + +--------------------+----+------------+-----+------------------+ + | suppattr_exclcreat | 75 | bitmap4 | R | Section 5.8.1.14 | + +--------------------+----+------------+-----+------------------+ + + Table 4 + +5.7. RECOMMENDED Attributes - List and Definition References + + The RECOMMENDED attributes are defined in Table 5. The meanings of + the column headers are the same as Table 4; see Section 5.6 for the + meanings. + + +====================+====+====================+=====+=============+ + | Name | Id | Data Type | Acc | Defined in: | + +====================+====+====================+=====+=============+ + | acl | 12 | nfsace4<> | R W | Section | + | | | | | 6.2.1 | + +--------------------+----+--------------------+-----+-------------+ + | aclsupport | 13 | uint32_t | R | Section | + | | | | | 6.2.1.2 | + +--------------------+----+--------------------+-----+-------------+ + | archive | 14 | bool | R W | Section | + | | | | | 5.8.2.1 | + +--------------------+----+--------------------+-----+-------------+ + | cansettime | 15 | bool | R | Section | + | | | | | 5.8.2.2 | + +--------------------+----+--------------------+-----+-------------+ + | case_insensitive | 16 | bool | R | Section | + | | | | | 5.8.2.3 | + +--------------------+----+--------------------+-----+-------------+ + | case_preserving | 17 | bool | R | Section | + | | | | | 5.8.2.4 | + +--------------------+----+--------------------+-----+-------------+ + | change_policy | 60 | chg_policy4 | R | Section | + | | | | | 5.8.2.5 | + +--------------------+----+--------------------+-----+-------------+ + | chown_restricted | 18 | bool | R | Section | + | | | | | 5.8.2.6 | + +--------------------+----+--------------------+-----+-------------+ + | dacl | 58 | nfsacl41 | R W | Section | + | | | | | 6.2.2 | + +--------------------+----+--------------------+-----+-------------+ + | dir_notif_delay | 56 | nfstime4 | R | Section | + | | | | | 5.11.1 | + +--------------------+----+--------------------+-----+-------------+ + | dirent_notif_delay | 57 | nfstime4 | R | Section | + | | | | | 5.11.2 | + +--------------------+----+--------------------+-----+-------------+ + | fileid | 20 | uint64_t | R | Section | + | | | | | 5.8.2.7 | + +--------------------+----+--------------------+-----+-------------+ + | files_avail | 21 | uint64_t | R | Section | + | | | | | 5.8.2.8 | + +--------------------+----+--------------------+-----+-------------+ + | files_free | 22 | uint64_t | R | Section | + | | | | | 5.8.2.9 | + +--------------------+----+--------------------+-----+-------------+ + | files_total | 23 | uint64_t | R | Section | + | | | | | 5.8.2.10 | + +--------------------+----+--------------------+-----+-------------+ + | fs_charset_cap | 76 | uint32_t | R | Section | + | | | | | 5.8.2.11 | + +--------------------+----+--------------------+-----+-------------+ + | fs_layout_type | 62 | layouttype4<> | R | Section | + | | | | | 5.12.1 | + +--------------------+----+--------------------+-----+-------------+ + | fs_locations | 24 | fs_locations | R | Section | + | | | | | 5.8.2.12 | + +--------------------+----+--------------------+-----+-------------+ + | fs_locations_info | 67 | fs_locations_info4 | R | Section | + | | | | | 5.8.2.13 | + +--------------------+----+--------------------+-----+-------------+ + | fs_status | 61 | fs4_status | R | Section | + | | | | | 5.8.2.14 | + +--------------------+----+--------------------+-----+-------------+ + | hidden | 25 | bool | R W | Section | + | | | | | 5.8.2.15 | + +--------------------+----+--------------------+-----+-------------+ + | homogeneous | 26 | bool | R | Section | + | | | | | 5.8.2.16 | + +--------------------+----+--------------------+-----+-------------+ + | layout_alignment | 66 | uint32_t | R | Section | + | | | | | 5.12.2 | + +--------------------+----+--------------------+-----+-------------+ + | layout_blksize | 65 | uint32_t | R | Section | + | | | | | 5.12.3 | + +--------------------+----+--------------------+-----+-------------+ + | layout_hint | 63 | layouthint4 | W | Section | + | | | | | 5.12.4 | + +--------------------+----+--------------------+-----+-------------+ + | layout_type | 64 | layouttype4<> | R | Section | + | | | | | 5.12.5 | + +--------------------+----+--------------------+-----+-------------+ + | maxfilesize | 27 | uint64_t | R | Section | + | | | | | 5.8.2.17 | + +--------------------+----+--------------------+-----+-------------+ + | maxlink | 28 | uint32_t | R | Section | + | | | | | 5.8.2.18 | + +--------------------+----+--------------------+-----+-------------+ + | maxname | 29 | uint32_t | R | Section | + | | | | | 5.8.2.19 | + +--------------------+----+--------------------+-----+-------------+ + | maxread | 30 | uint64_t | R | Section | + | | | | | 5.8.2.20 | + +--------------------+----+--------------------+-----+-------------+ + | maxwrite | 31 | uint64_t | R | Section | + | | | | | 5.8.2.21 | + +--------------------+----+--------------------+-----+-------------+ + | mdsthreshold | 68 | mdsthreshold4 | R | Section | + | | | | | 5.12.6 | + +--------------------+----+--------------------+-----+-------------+ + | mimetype | 32 | utf8str_cs | R W | Section | + | | | | | 5.8.2.22 | + +--------------------+----+--------------------+-----+-------------+ + | mode | 33 | mode4 | R W | Section | + | | | | | 6.2.4 | + +--------------------+----+--------------------+-----+-------------+ + | mode_set_masked | 74 | mode_masked4 | W | Section | + | | | | | 6.2.5 | + +--------------------+----+--------------------+-----+-------------+ + | mounted_on_fileid | 55 | uint64_t | R | Section | + | | | | | 5.8.2.23 | + +--------------------+----+--------------------+-----+-------------+ + | no_trunc | 34 | bool | R | Section | + | | | | | 5.8.2.24 | + +--------------------+----+--------------------+-----+-------------+ + | numlinks | 35 | uint32_t | R | Section | + | | | | | 5.8.2.25 | + +--------------------+----+--------------------+-----+-------------+ + | owner | 36 | utf8str_mixed | R W | Section | + | | | | | 5.8.2.26 | + +--------------------+----+--------------------+-----+-------------+ + | owner_group | 37 | utf8str_mixed | R W | Section | + | | | | | 5.8.2.27 | + +--------------------+----+--------------------+-----+-------------+ + | quota_avail_hard | 38 | uint64_t | R | Section | + | | | | | 5.8.2.28 | + +--------------------+----+--------------------+-----+-------------+ + | quota_avail_soft | 39 | uint64_t | R | Section | + | | | | | 5.8.2.29 | + +--------------------+----+--------------------+-----+-------------+ + | quota_used | 40 | uint64_t | R | Section | + | | | | | 5.8.2.30 | + +--------------------+----+--------------------+-----+-------------+ + | rawdev | 41 | specdata4 | R | Section | + | | | | | 5.8.2.31 | + +--------------------+----+--------------------+-----+-------------+ + | retentevt_get | 71 | retention_get4 | R | Section | + | | | | | 5.13.3 | + +--------------------+----+--------------------+-----+-------------+ + | retentevt_set | 72 | retention_set4 | W | Section | + | | | | | 5.13.4 | + +--------------------+----+--------------------+-----+-------------+ + | retention_get | 69 | retention_get4 | R | Section | + | | | | | 5.13.1 | + +--------------------+----+--------------------+-----+-------------+ + | retention_hold | 73 | uint64_t | R W | Section | + | | | | | 5.13.5 | + +--------------------+----+--------------------+-----+-------------+ + | retention_set | 70 | retention_set4 | W | Section | + | | | | | 5.13.2 | + +--------------------+----+--------------------+-----+-------------+ + | sacl | 59 | nfsacl41 | R W | Section | + | | | | | 6.2.3 | + +--------------------+----+--------------------+-----+-------------+ + | space_avail | 42 | uint64_t | R | Section | + | | | | | 5.8.2.32 | + +--------------------+----+--------------------+-----+-------------+ + | space_free | 43 | uint64_t | R | Section | + | | | | | 5.8.2.33 | + +--------------------+----+--------------------+-----+-------------+ + | space_total | 44 | uint64_t | R | Section | + | | | | | 5.8.2.34 | + +--------------------+----+--------------------+-----+-------------+ + | space_used | 45 | uint64_t | R | Section | + | | | | | 5.8.2.35 | + +--------------------+----+--------------------+-----+-------------+ + | system | 46 | bool | R W | Section | + | | | | | 5.8.2.36 | + +--------------------+----+--------------------+-----+-------------+ + | time_access | 47 | nfstime4 | R | Section | + | | | | | 5.8.2.37 | + +--------------------+----+--------------------+-----+-------------+ + | time_access_set | 48 | settime4 | W | Section | + | | | | | 5.8.2.38 | + +--------------------+----+--------------------+-----+-------------+ + | time_backup | 49 | nfstime4 | R W | Section | + | | | | | 5.8.2.39 | + +--------------------+----+--------------------+-----+-------------+ + | time_create | 50 | nfstime4 | R W | Section | + | | | | | 5.8.2.40 | + +--------------------+----+--------------------+-----+-------------+ + | time_delta | 51 | nfstime4 | R | Section | + | | | | | 5.8.2.41 | + +--------------------+----+--------------------+-----+-------------+ + | time_metadata | 52 | nfstime4 | R | Section | + | | | | | 5.8.2.42 | + +--------------------+----+--------------------+-----+-------------+ + | time_modify | 53 | nfstime4 | R | Section | + | | | | | 5.8.2.43 | + +--------------------+----+--------------------+-----+-------------+ + | time_modify_set | 54 | settime4 | W | Section | + | | | | | 5.8.2.44 | + +--------------------+----+--------------------+-----+-------------+ + + Table 5 + +5.8. Attribute Definitions + +5.8.1. Definitions of REQUIRED Attributes + +5.8.1.1. Attribute 0: supported_attrs + + The bit vector that would retrieve all REQUIRED and RECOMMENDED + attributes that are supported for this object. The scope of this + attribute applies to all objects with a matching fsid. + +5.8.1.2. Attribute 1: type + + Designates the type of an object in terms of one of a number of + special constants: + + * NF4REG designates a regular file. + + * NF4DIR designates a directory. + + * NF4BLK designates a block device special file. + + * NF4CHR designates a character device special file. + + * NF4LNK designates a symbolic link. + + * NF4SOCK designates a named socket special file. + + * NF4FIFO designates a fifo special file. + + * NF4ATTRDIR designates a named attribute directory. + + * NF4NAMEDATTR designates a named attribute. + + Within the explanatory text and operation descriptions, the following + phrases will be used with the meanings given below: + + * The phrase "is a directory" means that the object's type attribute + is NF4DIR or NF4ATTRDIR. + + * The phrase "is a special file" means that the object's type + attribute is NF4BLK, NF4CHR, NF4SOCK, or NF4FIFO. + + * The phrases "is an ordinary file" and "is a regular file" mean + that the object's type attribute is NF4REG or NF4NAMEDATTR. + +5.8.1.3. Attribute 2: fh_expire_type + + Server uses this to specify filehandle expiration behavior to the + client. See Section 4 for additional description. + +5.8.1.4. Attribute 3: change + + A value created by the server that the client can use to determine if + file data, directory contents, or attributes of the object have been + modified. The server may return the object's time_metadata attribute + for this attribute's value, but only if the file system object cannot + be updated more frequently than the resolution of time_metadata. + +5.8.1.5. Attribute 4: size + + The size of the object in bytes. + +5.8.1.6. Attribute 5: link_support + + TRUE, if the object's file system supports hard links. + +5.8.1.7. Attribute 6: symlink_support + + TRUE, if the object's file system supports symbolic links. + +5.8.1.8. Attribute 7: named_attr + + TRUE, if this object has named attributes. In other words, object + has a non-empty named attribute directory. + +5.8.1.9. Attribute 8: fsid + + Unique file system identifier for the file system holding this + object. The fsid attribute has major and minor components, each of + which are of data type uint64_t. + +5.8.1.10. Attribute 9: unique_handles + + TRUE, if two distinct filehandles are guaranteed to refer to two + different file system objects. + +5.8.1.11. Attribute 10: lease_time + + Duration of the lease at server in seconds. + +5.8.1.12. Attribute 11: rdattr_error + + Error returned from an attempt to retrieve attributes during a + READDIR operation. + +5.8.1.13. Attribute 19: filehandle + + The filehandle of this object (primarily for READDIR requests). + +5.8.1.14. Attribute 75: suppattr_exclcreat + + The bit vector that would set all REQUIRED and RECOMMENDED attributes + that are supported by the EXCLUSIVE4_1 method of file creation via + the OPEN operation. The scope of this attribute applies to all + objects with a matching fsid. + +5.8.2. Definitions of Uncategorized RECOMMENDED Attributes + + The definitions of most of the RECOMMENDED attributes follow. + Collections that share a common category are defined in other + sections. + +5.8.2.1. Attribute 14: archive + + TRUE, if this file has been archived since the time of last + modification (deprecated in favor of time_backup). + +5.8.2.2. Attribute 15: cansettime + + TRUE, if the server is able to change the times for a file system + object as specified in a SETATTR operation. + +5.8.2.3. Attribute 16: case_insensitive + + TRUE, if file name comparisons on this file system are case + insensitive. + +5.8.2.4. Attribute 17: case_preserving + + TRUE, if file name case on this file system is preserved. + +5.8.2.5. Attribute 60: change_policy + + A value created by the server that the client can use to determine if + some server policy related to the current file system has been + subject to change. If the value remains the same, then the client + can be sure that the values of the attributes related to fs location + and the fss_type field of the fs_status attribute have not changed. + On the other hand, a change in this value does necessarily imply a + change in policy. It is up to the client to interrogate the server + to determine if some policy relevant to it has changed. See + Section 3.3.6 for details. + + This attribute MUST change when the value returned by the + fs_locations or fs_locations_info attribute changes, when a file + system goes from read-only to writable or vice versa, or when the + allowable set of security flavors for the file system or any part + thereof is changed. + +5.8.2.6. Attribute 18: chown_restricted + + If TRUE, the server will reject any request to change either the + owner or the group associated with a file if the caller is not a + privileged user (for example, "root" in UNIX operating environments + or, in Windows 2000, the "Take Ownership" privilege). + +5.8.2.7. Attribute 20: fileid + + A number uniquely identifying the file within the file system. + +5.8.2.8. Attribute 21: files_avail + + File slots available to this user on the file system containing this + object -- this should be the smallest relevant limit. + +5.8.2.9. Attribute 22: files_free + + Free file slots on the file system containing this object -- this + should be the smallest relevant limit. + +5.8.2.10. Attribute 23: files_total + + Total file slots on the file system containing this object. + +5.8.2.11. Attribute 76: fs_charset_cap + + Character set capabilities for this file system. See Section 14.4. + +5.8.2.12. Attribute 24: fs_locations + + Locations where this file system may be found. If the server returns + NFS4ERR_MOVED as an error, this attribute MUST be supported. See + Section 11.16 for more details. + +5.8.2.13. Attribute 67: fs_locations_info + + Full function file system location. See Section 11.17.2 for more + details. + +5.8.2.14. Attribute 61: fs_status + + Generic file system type information. See Section 11.18 for more + details. + +5.8.2.15. Attribute 25: hidden + + TRUE, if the file is considered hidden with respect to the Windows + API. + +5.8.2.16. Attribute 26: homogeneous + + TRUE, if this object's file system is homogeneous; i.e., all objects + in the file system (all objects on the server with the same fsid) + have common values for all per-file-system attributes. + +5.8.2.17. Attribute 27: maxfilesize + + Maximum supported file size for the file system of this object. + +5.8.2.18. Attribute 28: maxlink + + Maximum number of links for this object. + +5.8.2.19. Attribute 29: maxname + + Maximum file name size supported for this object. + +5.8.2.20. Attribute 30: maxread + + Maximum amount of data the READ operation will return for this + object. + +5.8.2.21. Attribute 31: maxwrite + + Maximum amount of data the WRITE operation will accept for this + object. This attribute SHOULD be supported if the file is writable. + Lack of this attribute can lead to the client either wasting + bandwidth or not receiving the best performance. + +5.8.2.22. Attribute 32: mimetype + + MIME body type/subtype of this object. + +5.8.2.23. Attribute 55: mounted_on_fileid + + Like fileid, but if the target filehandle is the root of a file + system, this attribute represents the fileid of the underlying + directory. + + UNIX-based operating environments connect a file system into the + namespace by connecting (mounting) the file system onto the existing + file object (the mount point, usually a directory) of an existing + file system. When the mount point's parent directory is read via an + API like readdir(), the return results are directory entries, each + with a component name and a fileid. The fileid of the mount point's + directory entry will be different from the fileid that the stat() + system call returns. The stat() system call is returning the fileid + of the root of the mounted file system, whereas readdir() is + returning the fileid that stat() would have returned before any file + systems were mounted on the mount point. + + Unlike NFSv3, NFSv4.1 allows a client's LOOKUP request to cross other + file systems. The client detects the file system crossing whenever + the filehandle argument of LOOKUP has an fsid attribute different + from that of the filehandle returned by LOOKUP. A UNIX-based client + will consider this a "mount point crossing". UNIX has a legacy + scheme for allowing a process to determine its current working + directory. This relies on readdir() of a mount point's parent and + stat() of the mount point returning fileids as previously described. + The mounted_on_fileid attribute corresponds to the fileid that + readdir() would have returned as described previously. + + While the NFSv4.1 client could simply fabricate a fileid + corresponding to what mounted_on_fileid provides (and if the server + does not support mounted_on_fileid, the client has no choice), there + is a risk that the client will generate a fileid that conflicts with + one that is already assigned to another object in the file system. + Instead, if the server can provide the mounted_on_fileid, the + potential for client operational problems in this area is eliminated. + + If the server detects that there is no mounted point at the target + file object, then the value for mounted_on_fileid that it returns is + the same as that of the fileid attribute. + + The mounted_on_fileid attribute is RECOMMENDED, so the server SHOULD + provide it if possible, and for a UNIX-based server, this is + straightforward. Usually, mounted_on_fileid will be requested during + a READDIR operation, in which case it is trivial (at least for UNIX- + based servers) to return mounted_on_fileid since it is equal to the + fileid of a directory entry returned by readdir(). If + mounted_on_fileid is requested in a GETATTR operation, the server + should obey an invariant that has it returning a value that is equal + to the file object's entry in the object's parent directory, i.e., + what readdir() would have returned. Some operating environments + allow a series of two or more file systems to be mounted onto a + single mount point. In this case, for the server to obey the + aforementioned invariant, it will need to find the base mount point, + and not the intermediate mount points. + +5.8.2.24. Attribute 34: no_trunc + + If this attribute is TRUE, then if the client uses a file name longer + than name_max, an error will be returned instead of the name being + truncated. + +5.8.2.25. Attribute 35: numlinks + + Number of hard links to this object. + +5.8.2.26. Attribute 36: owner + + The string name of the owner of this object. + +5.8.2.27. Attribute 37: owner_group + + The string name of the group ownership of this object. + +5.8.2.28. Attribute 38: quota_avail_hard + + The value in bytes that represents the amount of additional disk + space beyond the current allocation that can be allocated to this + file or directory before further allocations will be refused. It is + understood that this space may be consumed by allocations to other + files or directories. + +5.8.2.29. Attribute 39: quota_avail_soft + + The value in bytes that represents the amount of additional disk + space that can be allocated to this file or directory before the user + may reasonably be warned. It is understood that this space may be + consumed by allocations to other files or directories though there is + a rule as to which other files or directories. + +5.8.2.30. Attribute 40: quota_used + + The value in bytes that represents the amount of disk space used by + this file or directory and possibly a number of other similar files + or directories, where the set of "similar" meets at least the + criterion that allocating space to any file or directory in the set + will reduce the "quota_avail_hard" of every other file or directory + in the set. + + Note that there may be a number of distinct but overlapping sets of + files or directories for which a quota_used value is maintained, + e.g., "all files with a given owner", "all files with a given group + owner", etc. The server is at liberty to choose any of those sets + when providing the content of the quota_used attribute, but should do + so in a repeatable way. The rule may be configured per file system + or may be "choose the set with the smallest quota". + +5.8.2.31. Attribute 41: rawdev + + Raw device number of file of type NF4BLK or NF4CHR. The device + number is split into major and minor numbers. If the file's type + attribute is not NF4BLK or NF4CHR, the value returned SHOULD NOT be + considered useful. + +5.8.2.32. Attribute 42: space_avail + + Disk space in bytes available to this user on the file system + containing this object -- this should be the smallest relevant limit. + +5.8.2.33. Attribute 43: space_free + + Free disk space in bytes on the file system containing this object -- + this should be the smallest relevant limit. + +5.8.2.34. Attribute 44: space_total + + Total disk space in bytes on the file system containing this object. + +5.8.2.35. Attribute 45: space_used + + Number of file system bytes allocated to this object. + +5.8.2.36. Attribute 46: system + + This attribute is TRUE if this file is a "system" file with respect + to the Windows operating environment. + +5.8.2.37. Attribute 47: time_access + + The time_access attribute represents the time of last access to the + object by a READ operation sent to the server. The notion of what is + an "access" depends on the server's operating environment and/or the + server's file system semantics. For example, for servers obeying + Portable Operating System Interface (POSIX) semantics, time_access + would be updated only by the READ and READDIR operations and not any + of the operations that modify the content of the object [13], [14], + [15]. Of course, setting the corresponding time_access_set attribute + is another way to modify the time_access attribute. + + Whenever the file object resides on a writable file system, the + server should make its best efforts to record time_access into stable + storage. However, to mitigate the performance effects of doing so, + and most especially whenever the server is satisfying the read of the + object's content from its cache, the server MAY cache access time + updates and lazily write them to stable storage. It is also + acceptable to give administrators of the server the option to disable + time_access updates. + +5.8.2.38. Attribute 48: time_access_set + + Sets the time of last access to the object. SETATTR use only. + +5.8.2.39. Attribute 49: time_backup + + The time of last backup of the object. + +5.8.2.40. Attribute 50: time_create + + The time of creation of the object. This attribute does not have any + relation to the traditional UNIX file attribute "ctime" or "change + time". + +5.8.2.41. Attribute 51: time_delta + + Smallest useful server time granularity. + +5.8.2.42. Attribute 52: time_metadata + + The time of last metadata modification of the object. + +5.8.2.43. Attribute 53: time_modify + + The time of last modification to the object. + +5.8.2.44. Attribute 54: time_modify_set + + Sets the time of last modification to the object. SETATTR use only. + +5.9. Interpreting owner and owner_group + + The RECOMMENDED attributes "owner" and "owner_group" (and also users + and groups within the "acl" attribute) are represented in terms of a + UTF-8 string. To avoid a representation that is tied to a particular + underlying implementation at the client or server, the use of the + UTF-8 string has been chosen. Note that Section 6.1 of RFC 2624 [53] + provides additional rationale. It is expected that the client and + server will have their own local representation of owner and + owner_group that is used for local storage or presentation to the end + user. Therefore, it is expected that when these attributes are + transferred between the client and server, the local representation + is translated to a syntax of the form "user@dns_domain". This will + allow for a client and server that do not use the same local + representation the ability to translate to a common syntax that can + be interpreted by both. + + Similarly, security principals may be represented in different ways + by different security mechanisms. Servers normally translate these + representations into a common format, generally that used by local + storage, to serve as a means of identifying the users corresponding + to these security principals. When these local identifiers are + translated to the form of the owner attribute, associated with files + created by such principals, they identify, in a common format, the + users associated with each corresponding set of security principals. + + The translation used to interpret owner and group strings is not + specified as part of the protocol. This allows various solutions to + be employed. For example, a local translation table may be consulted + that maps a numeric identifier to the user@dns_domain syntax. A name + service may also be used to accomplish the translation. A server may + provide a more general service, not limited by any particular + translation (which would only translate a limited set of possible + strings) by storing the owner and owner_group attributes in local + storage without any translation or it may augment a translation + method by storing the entire string for attributes for which no + translation is available while using the local representation for + those cases in which a translation is available. + + Servers that do not provide support for all possible values of the + owner and owner_group attributes SHOULD return an error + (NFS4ERR_BADOWNER) when a string is presented that has no + translation, as the value to be set for a SETATTR of the owner, + owner_group, or acl attributes. When a server does accept an owner + or owner_group value as valid on a SETATTR (and similarly for the + owner and group strings in an acl), it is promising to return that + same string when a corresponding GETATTR is done. Configuration + changes (including changes from the mapping of the string to the + local representation) and ill-constructed name translations (those + that contain aliasing) may make that promise impossible to honor. + Servers should make appropriate efforts to avoid a situation in which + these attributes have their values changed when no real change to + ownership has occurred. + + The "dns_domain" portion of the owner string is meant to be a DNS + domain name, for example, user@example.org. Servers should accept as + valid a set of users for at least one domain. A server may treat + other domains as having no valid translations. A more general + service is provided when a server is capable of accepting users for + multiple domains, or for all domains, subject to security + constraints. + + In the case where there is no translation available to the client or + server, the attribute value will be constructed without the "@". + Therefore, the absence of the @ from the owner or owner_group + attribute signifies that no translation was available at the sender + and that the receiver of the attribute should not use that string as + a basis for translation into its own internal format. Even though + the attribute value cannot be translated, it may still be useful. In + the case of a client, the attribute string may be used for local + display of ownership. + + To provide a greater degree of compatibility with NFSv3, which + identified users and groups by 32-bit unsigned user identifiers and + group identifiers, owner and group strings that consist of decimal + numeric values with no leading zeros can be given a special + interpretation by clients and servers that choose to provide such + support. The receiver may treat such a user or group string as + representing the same user as would be represented by an NFSv3 uid or + gid having the corresponding numeric value. A server is not + obligated to accept such a string, but may return an NFS4ERR_BADOWNER + instead. To avoid this mechanism being used to subvert user and + group translation, so that a client might pass all of the owners and + groups in numeric form, a server SHOULD return an NFS4ERR_BADOWNER + error when there is a valid translation for the user or owner + designated in this way. In that case, the client must use the + appropriate name@domain string and not the special form for + compatibility. + + The owner string "nobody" may be used to designate an anonymous user, + which will be associated with a file created by a security principal + that cannot be mapped through normal means to the owner attribute. + Users and implementations of NFSv4.1 SHOULD NOT use "nobody" to + designate a real user whose access is not anonymous. + +5.10. Character Case Attributes + + With respect to the case_insensitive and case_preserving attributes, + each UCS-4 character (which UTF-8 encodes) can be mapped according to + Appendix B.2 of RFC 3454 [16]. For general character handling and + internationalization issues, see Section 14. + +5.11. Directory Notification Attributes + + As described in Section 18.39, the client can request a minimum delay + for notifications of changes to attributes, but the server is free to + ignore what the client requests. The client can determine in advance + what notification delays the server will accept by sending a GETATTR + operation for either or both of two directory notification + attributes. When the client calls the GET_DIR_DELEGATION operation + and asks for attribute change notifications, it should request + notification delays that are no less than the values in the server- + provided attributes. + +5.11.1. Attribute 56: dir_notif_delay + + The dir_notif_delay attribute is the minimum number of seconds the + server will delay before notifying the client of a change to the + directory's attributes. + +5.11.2. Attribute 57: dirent_notif_delay + + The dirent_notif_delay attribute is the minimum number of seconds the + server will delay before notifying the client of a change to a file + object that has an entry in the directory. + +5.12. pNFS Attribute Definitions + +5.12.1. Attribute 62: fs_layout_type + + The fs_layout_type attribute (see Section 3.3.13) applies to a file + system and indicates what layout types are supported by the file + system. When the client encounters a new fsid, the client SHOULD + obtain the value for the fs_layout_type attribute associated with the + new file system. This attribute is used by the client to determine + if the layout types supported by the server match any of the client's + supported layout types. + +5.12.2. Attribute 66: layout_alignment + + When a client holds layouts on files of a file system, the + layout_alignment attribute indicates the preferred alignment for I/O + to files on that file system. Where possible, the client should send + READ and WRITE operations with offsets that are whole multiples of + the layout_alignment attribute. + +5.12.3. Attribute 65: layout_blksize + + When a client holds layouts on files of a file system, the + layout_blksize attribute indicates the preferred block size for I/O + to files on that file system. Where possible, the client should send + READ operations with a count argument that is a whole multiple of + layout_blksize, and WRITE operations with a data argument of size + that is a whole multiple of layout_blksize. + +5.12.4. Attribute 63: layout_hint + + The layout_hint attribute (see Section 3.3.19) may be set on newly + created files to influence the metadata server's choice for the + file's layout. If possible, this attribute is one of those set in + the initial attributes within the OPEN operation. The metadata + server may choose to ignore this attribute. The layout_hint + attribute is a subset of the layout structure returned by LAYOUTGET. + For example, instead of specifying particular devices, this would be + used to suggest the stripe width of a file. The server + implementation determines which fields within the layout will be + used. + +5.12.5. Attribute 64: layout_type + + This attribute lists the layout type(s) available for a file. The + value returned by the server is for informational purposes only. The + client will use the LAYOUTGET operation to obtain the information + needed in order to perform I/O, for example, the specific device + information for the file and its layout. + +5.12.6. Attribute 68: mdsthreshold + + This attribute is a server-provided hint used to communicate to the + client when it is more efficient to send READ and WRITE operations to + the metadata server or the data server. The two types of thresholds + described are file size thresholds and I/O size thresholds. If a + file's size is smaller than the file size threshold, data accesses + SHOULD be sent to the metadata server. If an I/O request has a + length that is below the I/O size threshold, the I/O SHOULD be sent + to the metadata server. Each threshold type is specified separately + for read and write. + + The server MAY provide both types of thresholds for a file. If both + file size and I/O size are provided, the client SHOULD reach or + exceed both thresholds before sending its read or write requests to + the data server. Alternatively, if only one of the specified + thresholds is reached or exceeded, the I/O requests are sent to the + metadata server. + + For each threshold type, a value of zero indicates no READ or WRITE + should be sent to the metadata server, while a value of all ones + indicates that all READs or WRITEs should be sent to the metadata + server. + + The attribute is available on a per-filehandle basis. If the current + filehandle refers to a non-pNFS file or directory, the metadata + server should return an attribute that is representative of the + filehandle's file system. It is suggested that this attribute is + queried as part of the OPEN operation. Due to dynamic system + changes, the client should not assume that the attribute will remain + constant for any specific time period; thus, it should be + periodically refreshed. + +5.13. Retention Attributes + + Retention is a concept whereby a file object can be placed in an + immutable, undeletable, unrenamable state for a fixed or infinite + duration of time. Once in this "retained" state, the file cannot be + moved out of the state until the duration of retention has been + reached. + + When retention is enabled, retention MUST extend to the data of the + file, and the name of file. The server MAY extend retention to any + other property of the file, including any subset of REQUIRED, + RECOMMENDED, and named attributes, with the exceptions noted in this + section. + + Servers MAY support or not support retention on any file object type. + + The five retention attributes are explained in the next subsections. + +5.13.1. Attribute 69: retention_get + + If retention is enabled for the associated file, this attribute's + value represents the retention begin time of the file object. This + attribute's value is only readable with the GETATTR operation and + MUST NOT be modified by the SETATTR operation (Section 5.5). The + value of the attribute consists of: + + const RET4_DURATION_INFINITE = 0xffffffffffffffff; + struct retention_get4 { + uint64_t rg_duration; + nfstime4 rg_begin_time<1>; + }; + + The field rg_duration is the duration in seconds indicating how long + the file will be retained once retention is enabled. The field + rg_begin_time is an array of up to one absolute time value. If the + array is zero length, no beginning retention time has been + established, and retention is not enabled. If rg_duration is equal + to RET4_DURATION_INFINITE, the file, once retention is enabled, will + be retained for an infinite duration. + + If (as soon as) rg_duration is zero, then rg_begin_time will be of + zero length, and again, retention is not (no longer) enabled. + +5.13.2. Attribute 70: retention_set + + This attribute is used to set the retention duration and optionally + enable retention for the associated file object. This attribute is + only modifiable via the SETATTR operation and MUST NOT be retrieved + by the GETATTR operation (Section 5.5). This attribute corresponds + to retention_get. The value of the attribute consists of: + + struct retention_set4 { + bool rs_enable; + uint64_t rs_duration<1>; + }; + + If the client sets rs_enable to TRUE, then it is enabling retention + on the file object with the begin time of retention starting from the + server's current time and date. The duration of the retention can + also be provided if the rs_duration array is of length one. The + duration is the time in seconds from the begin time of retention, and + if set to RET4_DURATION_INFINITE, the file is to be retained forever. + If retention is enabled, with no duration specified in either this + SETATTR or a previous SETATTR, the duration defaults to zero seconds. + The server MAY restrict the enabling of retention or the duration of + retention on the basis of the ACE4_WRITE_RETENTION ACL permission. + The enabling of retention MUST NOT prevent the enabling of event- + based retention or the modification of the retention_hold attribute. + + The following rules apply to both the retention_set and retentevt_set + attributes. + + * As long as retention is not enabled, the client is permitted to + decrease the duration. + + * The duration can always be set to an equal or higher value, even + if retention is enabled. Note that once retention is enabled, the + actual duration (as returned by the retention_get or retentevt_get + attributes; see Section 5.13.1 or Section 5.13.3) is constantly + counting down to zero (one unit per second), unless the duration + was set to RET4_DURATION_INFINITE. Thus, it will not be possible + for the client to precisely extend the duration on a file that has + retention enabled. + + * While retention is enabled, attempts to disable retention or + decrease the retention's duration MUST fail with the error + NFS4ERR_INVAL. + + * If the principal attempting to change retention_set or + retentevt_set does not have ACE4_WRITE_RETENTION permissions, the + attempt MUST fail with NFS4ERR_ACCESS. + +5.13.3. Attribute 71: retentevt_get + + Gets the event-based retention duration, and if enabled, the event- + based retention begin time of the file object. This attribute is + like retention_get, but refers to event-based retention. The event + that triggers event-based retention is not defined by the NFSv4.1 + specification. + +5.13.4. Attribute 72: retentevt_set + + Sets the event-based retention duration, and optionally enables + event-based retention on the file object. This attribute corresponds + to retentevt_get and is like retention_set, but refers to event-based + retention. When event-based retention is set, the file MUST be + retained even if non-event-based retention has been set, and the + duration of non-event-based retention has been reached. Conversely, + when non-event-based retention has been set, the file MUST be + retained even if event-based retention has been set, and the duration + of event-based retention has been reached. The server MAY restrict + the enabling of event-based retention or the duration of event-based + retention on the basis of the ACE4_WRITE_RETENTION ACL permission. + The enabling of event-based retention MUST NOT prevent the enabling + of non-event-based retention or the modification of the + retention_hold attribute. + +5.13.5. Attribute 73: retention_hold + + Gets or sets administrative retention holds, one hold per bit + position. + + This attribute allows one to 64 administrative holds, one hold per + bit on the attribute. If retention_hold is not zero, then the file + MUST NOT be deleted, renamed, or modified, even if the duration on + enabled event or non-event-based retention has been reached. The + server MAY restrict the modification of retention_hold on the basis + of the ACE4_WRITE_RETENTION_HOLD ACL permission. The enabling of + administration retention holds does not prevent the enabling of + event-based or non-event-based retention. + + If the principal attempting to change retention_hold does not have + ACE4_WRITE_RETENTION_HOLD permissions, the attempt MUST fail with + NFS4ERR_ACCESS. + +6. Access Control Attributes + + Access Control Lists (ACLs) are file attributes that specify fine- + grained access control. This section covers the "acl", "dacl", + "sacl", "aclsupport", "mode", and "mode_set_masked" file attributes + and their interactions. Note that file attributes may apply to any + file system object. + +6.1. Goals + + ACLs and modes represent two well-established models for specifying + permissions. This section specifies requirements that attempt to + meet the following goals: + + * If a server supports the mode attribute, it should provide + reasonable semantics to clients that only set and retrieve the + mode attribute. + + * If a server supports ACL attributes, it should provide reasonable + semantics to clients that only set and retrieve those attributes. + + * On servers that support the mode attribute, if ACL attributes have + never been set on an object, via inheritance or explicitly, the + behavior should be traditional UNIX-like behavior. + + * On servers that support the mode attribute, if the ACL attributes + have been previously set on an object, either explicitly or via + inheritance: + + - Setting only the mode attribute should effectively control the + traditional UNIX-like permissions of read, write, and execute + on owner, owner_group, and other. + + - Setting only the mode attribute should provide reasonable + security. For example, setting a mode of 000 should be enough + to ensure that future OPEN operations for + OPEN4_SHARE_ACCESS_READ or OPEN4_SHARE_ACCESS_WRITE by any + principal fail, regardless of a previously existing or + inherited ACL. + + * NFSv4.1 may introduce different semantics relating to the mode and + ACL attributes, but it does not render invalid any previously + existing implementations. Additionally, this section provides + clarifications based on previous implementations and discussions + around them. + + * On servers that support both the mode and the acl or dacl + attributes, the server must keep the two consistent with each + other. The value of the mode attribute (with the exception of the + three high-order bits described in Section 6.2.4) must be + determined entirely by the value of the ACL, so that use of the + mode is never required for anything other than setting the three + high-order bits. See Section 6.4.1 for exact requirements. + + * When a mode attribute is set on an object, the ACL attributes may + need to be modified in order to not conflict with the new mode. + In such cases, it is desirable that the ACL keep as much + information as possible. This includes information about + inheritance, AUDIT and ALARM ACEs, and permissions granted and + denied that do not conflict with the new mode. + +6.2. File Attributes Discussion + +6.2.1. Attribute 12: acl + + The NFSv4.1 ACL attribute contains an array of Access Control Entries + (ACEs) that are associated with the file system object. Although the + client can set and get the acl attribute, the server is responsible + for using the ACL to perform access control. The client can use the + OPEN or ACCESS operations to check access without modifying or + reading data or metadata. + + The NFS ACE structure is defined as follows: + + typedef uint32_t acetype4; + + typedef uint32_t aceflag4; + + typedef uint32_t acemask4; + + struct nfsace4 { + acetype4 type; + aceflag4 flag; + acemask4 access_mask; + utf8str_mixed who; + }; + + To determine if a request succeeds, the server processes each nfsace4 + entry in order. Only ACEs that have a "who" that matches the + requester are considered. Each ACE is processed until all of the + bits of the requester's access have been ALLOWED. Once a bit (see + below) has been ALLOWED by an ACCESS_ALLOWED_ACE, it is no longer + considered in the processing of later ACEs. If an ACCESS_DENIED_ACE + is encountered where the requester's access still has unALLOWED bits + in common with the "access_mask" of the ACE, the request is denied. + When the ACL is fully processed, if there are bits in the requester's + mask that have not been ALLOWED or DENIED, access is denied. + + Unlike the ALLOW and DENY ACE types, the ALARM and AUDIT ACE types do + not affect a requester's access, and instead are for triggering + events as a result of a requester's access attempt. Therefore, AUDIT + and ALARM ACEs are processed only after processing ALLOW and DENY + ACEs. + + The NFSv4.1 ACL model is quite rich. Some server platforms may + provide access-control functionality that goes beyond the UNIX-style + mode attribute, but that is not as rich as the NFS ACL model. So + that users can take advantage of this more limited functionality, the + server may support the acl attributes by mapping between its ACL + model and the NFSv4.1 ACL model. Servers must ensure that the ACL + they actually store or enforce is at least as strict as the NFSv4 ACL + that was set. It is tempting to accomplish this by rejecting any ACL + that falls outside the small set that can be represented accurately. + However, such an approach can render ACLs unusable without special + client-side knowledge of the server's mapping, which defeats the + purpose of having a common NFSv4 ACL protocol. Therefore, servers + should accept every ACL that they can without compromising security. + To help accomplish this, servers may make a special exception, in the + case of unsupported permission bits, to the rule that bits not + ALLOWED or DENIED by an ACL must be denied. For example, a UNIX- + style server might choose to silently allow read attribute + permissions even though an ACL does not explicitly allow those + permissions. (An ACL that explicitly denies permission to read + attributes should still be rejected.) + + The situation is complicated by the fact that a server may have + multiple modules that enforce ACLs. For example, the enforcement for + NFSv4.1 access may be different from, but not weaker than, the + enforcement for local access, and both may be different from the + enforcement for access through other protocols such as SMB (Server + Message Block). So it may be useful for a server to accept an ACL + even if not all of its modules are able to support it. + + The guiding principle with regard to NFSv4 access is that the server + must not accept ACLs that appear to make access to the file more + restrictive than it really is. + +6.2.1.1. ACE Type + + The constants used for the type field (acetype4) are as follows: + + const ACE4_ACCESS_ALLOWED_ACE_TYPE = 0x00000000; + const ACE4_ACCESS_DENIED_ACE_TYPE = 0x00000001; + const ACE4_SYSTEM_AUDIT_ACE_TYPE = 0x00000002; + const ACE4_SYSTEM_ALARM_ACE_TYPE = 0x00000003; + + Only the ALLOWED and DENIED bits may be used in the dacl attribute, + and only the AUDIT and ALARM bits may be used in the sacl attribute. + All four are permitted in the acl attribute. + + +==============================+==============+=====================+ + | Value | Abbreviation | Description | + +==============================+==============+=====================+ + | ACE4_ACCESS_ALLOWED_ACE_TYPE | ALLOW | Explicitly grants | + | | | the access | + | | | defined in | + | | | acemask4 to the | + | | | file or | + | | | directory. | + +------------------------------+--------------+---------------------+ + | ACE4_ACCESS_DENIED_ACE_TYPE | DENY | Explicitly denies | + | | | the access | + | | | defined in | + | | | acemask4 to the | + | | | file or | + | | | directory. | + +------------------------------+--------------+---------------------+ + | ACE4_SYSTEM_AUDIT_ACE_TYPE | AUDIT | Log (in a system- | + | | | dependent way) | + | | | any access | + | | | attempt to a file | + | | | or directory that | + | | | uses any of the | + | | | access methods | + | | | specified in | + | | | acemask4. | + +------------------------------+--------------+---------------------+ + | ACE4_SYSTEM_ALARM_ACE_TYPE | ALARM | Generate an alarm | + | | | (in a system- | + | | | dependent way) | + | | | when any access | + | | | attempt is made | + | | | to a file or | + | | | directory for the | + | | | access methods | + | | | specified in | + | | | acemask4. | + +------------------------------+--------------+---------------------+ + + Table 6 + + The "Abbreviation" column denotes how the types will be referred to + throughout the rest of this section. + +6.2.1.2. Attribute 13: aclsupport + + A server need not support all of the above ACE types. This attribute + indicates which ACE types are supported for the current file system. + The bitmask constants used to represent the above definitions within + the aclsupport attribute are as follows: + + const ACL4_SUPPORT_ALLOW_ACL = 0x00000001; + const ACL4_SUPPORT_DENY_ACL = 0x00000002; + const ACL4_SUPPORT_AUDIT_ACL = 0x00000004; + const ACL4_SUPPORT_ALARM_ACL = 0x00000008; + + Servers that support either the ALLOW or DENY ACE type SHOULD support + both ALLOW and DENY ACE types. + + Clients should not attempt to set an ACE unless the server claims + support for that ACE type. If the server receives a request to set + an ACE that it cannot store, it MUST reject the request with + NFS4ERR_ATTRNOTSUPP. If the server receives a request to set an ACE + that it can store but cannot enforce, the server SHOULD reject the + request with NFS4ERR_ATTRNOTSUPP. + + Support for any of the ACL attributes is optional (albeit + RECOMMENDED). However, a server that supports either of the new ACL + attributes (dacl or sacl) MUST allow use of the new ACL attributes to + access all of the ACE types that it supports. In other words, if + such a server supports ALLOW or DENY ACEs, then it MUST support the + dacl attribute, and if it supports AUDIT or ALARM ACEs, then it MUST + support the sacl attribute. + +6.2.1.3. ACE Access Mask + + The bitmask constants used for the access mask field are as follows: + + const ACE4_READ_DATA = 0x00000001; + const ACE4_LIST_DIRECTORY = 0x00000001; + const ACE4_WRITE_DATA = 0x00000002; + const ACE4_ADD_FILE = 0x00000002; + const ACE4_APPEND_DATA = 0x00000004; + const ACE4_ADD_SUBDIRECTORY = 0x00000004; + const ACE4_READ_NAMED_ATTRS = 0x00000008; + const ACE4_WRITE_NAMED_ATTRS = 0x00000010; + const ACE4_EXECUTE = 0x00000020; + const ACE4_DELETE_CHILD = 0x00000040; + const ACE4_READ_ATTRIBUTES = 0x00000080; + const ACE4_WRITE_ATTRIBUTES = 0x00000100; + const ACE4_WRITE_RETENTION = 0x00000200; + const ACE4_WRITE_RETENTION_HOLD = 0x00000400; + + const ACE4_DELETE = 0x00010000; + const ACE4_READ_ACL = 0x00020000; + const ACE4_WRITE_ACL = 0x00040000; + const ACE4_WRITE_OWNER = 0x00080000; + const ACE4_SYNCHRONIZE = 0x00100000; + + Note that some masks have coincident values, for example, + ACE4_READ_DATA and ACE4_LIST_DIRECTORY. The mask entries + ACE4_LIST_DIRECTORY, ACE4_ADD_FILE, and ACE4_ADD_SUBDIRECTORY are + intended to be used with directory objects, while ACE4_READ_DATA, + ACE4_WRITE_DATA, and ACE4_APPEND_DATA are intended to be used with + non-directory objects. + +6.2.1.3.1. Discussion of Mask Attributes + + ACE4_READ_DATA + + Operation(s) affected: + READ + + OPEN + + Discussion: + Permission to read the data of the file. + + Servers SHOULD allow a user the ability to read the data of the + file when only the ACE4_EXECUTE access mask bit is allowed. + + ACE4_LIST_DIRECTORY + + Operation(s) affected: + READDIR + + Discussion: + Permission to list the contents of a directory. + + ACE4_WRITE_DATA + + Operation(s) affected: + WRITE + + OPEN + + SETATTR of size + + Discussion: + Permission to modify a file's data. + + ACE4_ADD_FILE + + Operation(s) affected: + CREATE + + LINK + + OPEN + + RENAME + + Discussion: + Permission to add a new file in a directory. The CREATE + operation is affected when nfs_ftype4 is NF4LNK, NF4BLK, + NF4CHR, NF4SOCK, or NF4FIFO. (NF4DIR is not listed because it + is covered by ACE4_ADD_SUBDIRECTORY.) OPEN is affected when + used to create a regular file. LINK and RENAME are always + affected. + + ACE4_APPEND_DATA + + Operation(s) affected: + WRITE + + OPEN + + SETATTR of size + + Discussion: + The ability to modify a file's data, but only starting at EOF. + This allows for the notion of append-only files, by allowing + ACE4_APPEND_DATA and denying ACE4_WRITE_DATA to the same user + or group. If a file has an ACL such as the one described above + and a WRITE request is made for somewhere other than EOF, the + server SHOULD return NFS4ERR_ACCESS. + + ACE4_ADD_SUBDIRECTORY + + Operation(s) affected: + CREATE + + RENAME + + Discussion: + Permission to create a subdirectory in a directory. The CREATE + operation is affected when nfs_ftype4 is NF4DIR. The RENAME + operation is always affected. + + ACE4_READ_NAMED_ATTRS + + Operation(s) affected: + OPENATTR + + Discussion: + Permission to read the named attributes of a file or to look up + the named attribute directory. OPENATTR is affected when it is + not used to create a named attribute directory. This is when + 1) createdir is TRUE, but a named attribute directory already + exists, or 2) createdir is FALSE. + + ACE4_WRITE_NAMED_ATTRS + + Operation(s) affected: + OPENATTR + + Discussion: + Permission to write the named attributes of a file or to create + a named attribute directory. OPENATTR is affected when it is + used to create a named attribute directory. This is when + createdir is TRUE and no named attribute directory exists. The + ability to check whether or not a named attribute directory + exists depends on the ability to look it up; therefore, users + also need the ACE4_READ_NAMED_ATTRS permission in order to + create a named attribute directory. + + ACE4_EXECUTE + + Operation(s) affected: + READ + + OPEN + + REMOVE + + RENAME + + LINK + + CREATE + + Discussion: + Permission to execute a file. + + Servers SHOULD allow a user the ability to read the data of the + file when only the ACE4_EXECUTE access mask bit is allowed. + This is because there is no way to execute a file without + reading the contents. Though a server may treat ACE4_EXECUTE + and ACE4_READ_DATA bits identically when deciding to permit a + READ operation, it SHOULD still allow the two bits to be set + independently in ACLs, and MUST distinguish between them when + replying to ACCESS operations. In particular, servers SHOULD + NOT silently turn on one of the two bits when the other is set, + as that would make it impossible for the client to correctly + enforce the distinction between read and execute permissions. + + As an example, following a SETATTR of the following ACL: + + nfsuser:ACE4_EXECUTE:ALLOW + + A subsequent GETATTR of ACL for that file SHOULD return: + + nfsuser:ACE4_EXECUTE:ALLOW + + Rather than: + + nfsuser:ACE4_EXECUTE/ACE4_READ_DATA:ALLOW + + ACE4_EXECUTE + + Operation(s) affected: + LOOKUP + + Discussion: + Permission to traverse/search a directory. + + ACE4_DELETE_CHILD + + Operation(s) affected: + REMOVE + + RENAME + + Discussion: + Permission to delete a file or directory within a directory. + See Section 6.2.1.3.2 for information on ACE4_DELETE and + ACE4_DELETE_CHILD interact. + + ACE4_READ_ATTRIBUTES + + Operation(s) affected: + GETATTR of file system object attributes + + VERIFY + + NVERIFY + + READDIR + + Discussion: + The ability to read basic attributes (non-ACLs) of a file. On + a UNIX system, basic attributes can be thought of as the stat- + level attributes. Allowing this access mask bit would mean + that the entity can execute "ls -l" and stat. If a READDIR + operation requests attributes, this mask must be allowed for + the READDIR to succeed. + + ACE4_WRITE_ATTRIBUTES + + Operation(s) affected: + SETATTR of time_access_set, time_backup, + + time_create, time_modify_set, mimetype, hidden, system + + Discussion: + Permission to change the times associated with a file or + directory to an arbitrary value. Also permission to change the + mimetype, hidden, and system attributes. A user having + ACE4_WRITE_DATA or ACE4_WRITE_ATTRIBUTES will be allowed to set + the times associated with a file to the current server time. + + ACE4_WRITE_RETENTION + + Operation(s) affected: + SETATTR of retention_set, retentevt_set. + + Discussion: + Permission to modify the durations of event and non-event-based + retention. Also permission to enable event and non-event-based + retention. A server MAY behave such that setting + ACE4_WRITE_ATTRIBUTES allows ACE4_WRITE_RETENTION. + + ACE4_WRITE_RETENTION_HOLD + + Operation(s) affected: + SETATTR of retention_hold. + + Discussion: + Permission to modify the administration retention holds. A + server MAY map ACE4_WRITE_ATTRIBUTES to + ACE_WRITE_RETENTION_HOLD. + + ACE4_DELETE + + Operation(s) affected: + REMOVE + + Discussion: + Permission to delete the file or directory. See + Section 6.2.1.3.2 for information on ACE4_DELETE and + ACE4_DELETE_CHILD interact. + + ACE4_READ_ACL + + Operation(s) affected: + GETATTR of acl, dacl, or sacl + + NVERIFY + + VERIFY + + Discussion: + Permission to read the ACL. + + ACE4_WRITE_ACL + + Operation(s) affected: + SETATTR of acl and mode + + Discussion: + Permission to write the acl and mode attributes. + + ACE4_WRITE_OWNER + + Operation(s) affected: + SETATTR of owner and owner_group + + Discussion: + Permission to write the owner and owner_group attributes. On + UNIX systems, this is the ability to execute chown() and + chgrp(). + + ACE4_SYNCHRONIZE + + Operation(s) affected: + NONE + + Discussion: + Permission to use the file object as a synchronization + primitive for interprocess communication. This permission is + not enforced or interpreted by the NFSv4.1 server on behalf of + the client. + + Typically, the ACE4_SYNCHRONIZE permission is only meaningful + on local file systems, i.e., file systems not accessed via + NFSv4.1. The reason that the permission bit exists is that + some operating environments, such as Windows, use + ACE4_SYNCHRONIZE. + + For example, if a client copies a file that has + ACE4_SYNCHRONIZE set from a local file system to an NFSv4.1 + server, and then later copies the file from the NFSv4.1 server + to a local file system, it is likely that if ACE4_SYNCHRONIZE + was set in the original file, the client will want it set in + the second copy. The first copy will not have the permission + set unless the NFSv4.1 server has the means to set the + ACE4_SYNCHRONIZE bit. The second copy will not have the + permission set unless the NFSv4.1 server has the means to + retrieve the ACE4_SYNCHRONIZE bit. + + Server implementations need not provide the granularity of control + that is implied by this list of masks. For example, POSIX-based + systems might not distinguish ACE4_APPEND_DATA (the ability to append + to a file) from ACE4_WRITE_DATA (the ability to modify existing + contents); both masks would be tied to a single "write" permission + [17]. When such a server returns attributes to the client, it would + show both ACE4_APPEND_DATA and ACE4_WRITE_DATA if and only if the + write permission is enabled. + + If a server receives a SETATTR request that it cannot accurately + implement, it should err in the direction of more restricted access, + except in the previously discussed cases of execute and read. For + example, suppose a server cannot distinguish overwriting data from + appending new data, as described in the previous paragraph. If a + client submits an ALLOW ACE where ACE4_APPEND_DATA is set but + ACE4_WRITE_DATA is not (or vice versa), the server should either turn + off ACE4_APPEND_DATA or reject the request with NFS4ERR_ATTRNOTSUPP. + +6.2.1.3.2. ACE4_DELETE vs. ACE4_DELETE_CHILD + + Two access mask bits govern the ability to delete a directory entry: + ACE4_DELETE on the object itself (the "target") and ACE4_DELETE_CHILD + on the containing directory (the "parent"). + + Many systems also take the "sticky bit" (MODE4_SVTX) on a directory + to allow unlink only to a user that owns either the target or the + parent; on some such systems the decision also depends on whether the + target is writable. + + Servers SHOULD allow unlink if either ACE4_DELETE is permitted on the + target, or ACE4_DELETE_CHILD is permitted on the parent. (Note that + this is true even if the parent or target explicitly denies one of + these permissions.) + + If the ACLs in question neither explicitly ALLOW nor DENY either of + the above, and if MODE4_SVTX is not set on the parent, then the + server SHOULD allow the removal if and only if ACE4_ADD_FILE is + permitted. In the case where MODE4_SVTX is set, the server may also + require the remover to own either the parent or the target, or may + require the target to be writable. + + This allows servers to support something close to traditional UNIX- + like semantics, with ACE4_ADD_FILE taking the place of the write bit. + +6.2.1.4. ACE flag + + The bitmask constants used for the flag field are as follows: + + const ACE4_FILE_INHERIT_ACE = 0x00000001; + const ACE4_DIRECTORY_INHERIT_ACE = 0x00000002; + const ACE4_NO_PROPAGATE_INHERIT_ACE = 0x00000004; + const ACE4_INHERIT_ONLY_ACE = 0x00000008; + const ACE4_SUCCESSFUL_ACCESS_ACE_FLAG = 0x00000010; + const ACE4_FAILED_ACCESS_ACE_FLAG = 0x00000020; + const ACE4_IDENTIFIER_GROUP = 0x00000040; + const ACE4_INHERITED_ACE = 0x00000080; + + A server need not support any of these flags. If the server supports + flags that are similar to, but not exactly the same as, these flags, + the implementation may define a mapping between the protocol-defined + flags and the implementation-defined flags. + + For example, suppose a client tries to set an ACE with + ACE4_FILE_INHERIT_ACE set but not ACE4_DIRECTORY_INHERIT_ACE. If the + server does not support any form of ACL inheritance, the server + should reject the request with NFS4ERR_ATTRNOTSUPP. If the server + supports a single "inherit ACE" flag that applies to both files and + directories, the server may reject the request (i.e., requiring the + client to set both the file and directory inheritance flags). The + server may also accept the request and silently turn on the + ACE4_DIRECTORY_INHERIT_ACE flag. + +6.2.1.4.1. Discussion of Flag Bits + + ACE4_FILE_INHERIT_ACE + Any non-directory file in any sub-directory will get this ACE + inherited. + + ACE4_DIRECTORY_INHERIT_ACE + Can be placed on a directory and indicates that this ACE should be + added to each new directory created. + + If this flag is set in an ACE in an ACL attribute to be set on a + non-directory file system object, the operation attempting to set + the ACL SHOULD fail with NFS4ERR_ATTRNOTSUPP. + + ACE4_NO_PROPAGATE_INHERIT_ACE + Can be placed on a directory. This flag tells the server that + inheritance of this ACE should stop at newly created child + directories. + + ACE4_INHERIT_ONLY_ACE + Can be placed on a directory but does not apply to the directory; + ALLOW and DENY ACEs with this bit set do not affect access to the + directory, and AUDIT and ALARM ACEs with this bit set do not + trigger log or alarm events. Such ACEs only take effect once they + are applied (with this bit cleared) to newly created files and + directories as specified by the ACE4_FILE_INHERIT_ACE and + ACE4_DIRECTORY_INHERIT_ACE flags. + + If this flag is present on an ACE, but neither + ACE4_DIRECTORY_INHERIT_ACE nor ACE4_FILE_INHERIT_ACE is present, + then an operation attempting to set such an attribute SHOULD fail + with NFS4ERR_ATTRNOTSUPP. + + ACE4_SUCCESSFUL_ACCESS_ACE_FLAG and ACE4_FAILED_ACCESS_ACE_FLAG + The ACE4_SUCCESSFUL_ACCESS_ACE_FLAG (SUCCESS) and + ACE4_FAILED_ACCESS_ACE_FLAG (FAILED) flag bits may be set only on + ACE4_SYSTEM_AUDIT_ACE_TYPE (AUDIT) and ACE4_SYSTEM_ALARM_ACE_TYPE + (ALARM) ACE types. If during the processing of the file's ACL, + the server encounters an AUDIT or ALARM ACE that matches the + principal attempting the OPEN, the server notes that fact, and the + presence, if any, of the SUCCESS and FAILED flags encountered in + the AUDIT or ALARM ACE. Once the server completes the ACL + processing, it then notes if the operation succeeded or failed. + If the operation succeeded, and if the SUCCESS flag was set for a + matching AUDIT or ALARM ACE, then the appropriate AUDIT or ALARM + event occurs. If the operation failed, and if the FAILED flag was + set for the matching AUDIT or ALARM ACE, then the appropriate + AUDIT or ALARM event occurs. Either or both of the SUCCESS or + FAILED can be set, but if neither is set, the AUDIT or ALARM ACE + is not useful. + + The previously described processing applies to ACCESS operations + even when they return NFS4_OK. For the purposes of AUDIT and + ALARM, we consider an ACCESS operation to be a "failure" if it + fails to return a bit that was requested and supported. + + ACE4_IDENTIFIER_GROUP + Indicates that the "who" refers to a GROUP as defined under UNIX + or a GROUP ACCOUNT as defined under Windows. Clients and servers + MUST ignore the ACE4_IDENTIFIER_GROUP flag on ACEs with a who + value equal to one of the special identifiers outlined in + Section 6.2.1.5. + + ACE4_INHERITED_ACE + Indicates that this ACE is inherited from a parent directory. A + server that supports automatic inheritance will place this flag on + any ACEs inherited from the parent directory when creating a new + object. Client applications will use this to perform automatic + inheritance. Clients and servers MUST clear this bit in the acl + attribute; it may only be used in the dacl and sacl attributes. + +6.2.1.5. ACE Who + + The "who" field of an ACE is an identifier that specifies the + principal or principals to whom the ACE applies. It may refer to a + user or a group, with the flag bit ACE4_IDENTIFIER_GROUP specifying + which. + + There are several special identifiers that need to be understood + universally, rather than in the context of a particular DNS domain. + Some of these identifiers cannot be understood when an NFS client + accesses the server, but have meaning when a local process accesses + the file. The ability to display and modify these permissions is + permitted over NFS, even if none of the access methods on the server + understands the identifiers. + + +===============+==================================================+ + | Who | Description | + +===============+==================================================+ + | OWNER | The owner of the file. | + +---------------+--------------------------------------------------+ + | GROUP | The group associated with the file. | + +---------------+--------------------------------------------------+ + | EVERYONE | The world, including the owner and owning group. | + +---------------+--------------------------------------------------+ + | INTERACTIVE | Accessed from an interactive terminal. | + +---------------+--------------------------------------------------+ + | NETWORK | Accessed via the network. | + +---------------+--------------------------------------------------+ + | DIALUP | Accessed as a dialup user to the server. | + +---------------+--------------------------------------------------+ + | BATCH | Accessed from a batch job. | + +---------------+--------------------------------------------------+ + | ANONYMOUS | Accessed without any authentication. | + +---------------+--------------------------------------------------+ + | AUTHENTICATED | Any authenticated user (opposite of ANONYMOUS). | + +---------------+--------------------------------------------------+ + | SERVICE | Access from a system service. | + +---------------+--------------------------------------------------+ + + Table 7 + + To avoid conflict, these special identifiers are distinguished by an + appended "@" and should appear in the form "xxxx@" (with no domain + name after the "@"), for example, ANONYMOUS@. + + The ACE4_IDENTIFIER_GROUP flag MUST be ignored on entries with these + special identifiers. When encoding entries with these special + identifiers, the ACE4_IDENTIFIER_GROUP flag SHOULD be set to zero. + +6.2.1.5.1. Discussion of EVERYONE@ + + It is important to note that "EVERYONE@" is not equivalent to the + UNIX "other" entity. This is because, by definition, UNIX "other" + does not include the owner or owning group of a file. "EVERYONE@" + means literally everyone, including the owner or owning group. + +6.2.2. Attribute 58: dacl + + The dacl attribute is like the acl attribute, but dacl allows just + ALLOW and DENY ACEs. The dacl attribute supports automatic + inheritance (see Section 6.4.3.2). + +6.2.3. Attribute 59: sacl + + The sacl attribute is like the acl attribute, but sacl allows just + AUDIT and ALARM ACEs. The sacl attribute supports automatic + inheritance (see Section 6.4.3.2). + +6.2.4. Attribute 33: mode + + The NFSv4.1 mode attribute is based on the UNIX mode bits. The + following bits are defined: + + const MODE4_SUID = 0x800; /* set user id on execution */ + const MODE4_SGID = 0x400; /* set group id on execution */ + const MODE4_SVTX = 0x200; /* save text even after use */ + const MODE4_RUSR = 0x100; /* read permission: owner */ + const MODE4_WUSR = 0x080; /* write permission: owner */ + const MODE4_XUSR = 0x040; /* execute permission: owner */ + const MODE4_RGRP = 0x020; /* read permission: group */ + const MODE4_WGRP = 0x010; /* write permission: group */ + const MODE4_XGRP = 0x008; /* execute permission: group */ + const MODE4_ROTH = 0x004; /* read permission: other */ + const MODE4_WOTH = 0x002; /* write permission: other */ + const MODE4_XOTH = 0x001; /* execute permission: other */ + + Bits MODE4_RUSR, MODE4_WUSR, and MODE4_XUSR apply to the principal + identified in the owner attribute. Bits MODE4_RGRP, MODE4_WGRP, and + MODE4_XGRP apply to principals identified in the owner_group + attribute but who are not identified in the owner attribute. Bits + MODE4_ROTH, MODE4_WOTH, and MODE4_XOTH apply to any principal that + does not match that in the owner attribute and does not have a group + matching that of the owner_group attribute. + + Bits within a mode other than those specified above are not defined + by this protocol. A server MUST NOT return bits other than those + defined above in a GETATTR or READDIR operation, and it MUST return + NFS4ERR_INVAL if bits other than those defined above are set in a + SETATTR, CREATE, OPEN, VERIFY, or NVERIFY operation. + +6.2.5. Attribute 74: mode_set_masked + + The mode_set_masked attribute is a write-only attribute that allows + individual bits in the mode attribute to be set or reset, without + changing others. It allows, for example, the bits MODE4_SUID, + MODE4_SGID, and MODE4_SVTX to be modified while leaving unmodified + any of the nine low-order mode bits devoted to permissions. + + In such instances that the nine low-order bits are left unmodified, + then neither the acl nor the dacl attribute should be automatically + modified as discussed in Section 6.4.1. + + The mode_set_masked attribute consists of two words, each in the form + of a mode4. The first consists of the value to be applied to the + current mode value and the second is a mask. Only bits set to one in + the mask word are changed (set or reset) in the file's mode. All + other bits in the mode remain unchanged. Bits in the first word that + correspond to bits that are zero in the mask are ignored, except that + undefined bits are checked for validity and can result in + NFS4ERR_INVAL as described below. + + The mode_set_masked attribute is only valid in a SETATTR operation. + If it is used in a CREATE or OPEN operation, the server MUST return + NFS4ERR_INVAL. + + Bits not defined as valid in the mode attribute are not valid in + either word of the mode_set_masked attribute. The server MUST return + NFS4ERR_INVAL if any such bits are set to one in a SETATTR. If the + mode and mode_set_masked attributes are both specified in the same + SETATTR, the server MUST also return NFS4ERR_INVAL. + +6.3. Common Methods + + The requirements in this section will be referred to in future + sections, especially Section 6.4. + +6.3.1. Interpreting an ACL + +6.3.1.1. Server Considerations + + The server uses the algorithm described in Section 6.2.1 to determine + whether an ACL allows access to an object. However, the ACL might + not be the sole determiner of access. For example: + + * In the case of a file system exported as read-only, the server may + deny write access even though an object's ACL grants it. + + * Server implementations MAY grant ACE4_WRITE_ACL and ACE4_READ_ACL + permissions to prevent a situation from arising in which there is + no valid way to ever modify the ACL. + + * All servers will allow a user the ability to read the data of the + file when only the execute permission is granted (i.e., if the ACL + denies the user the ACE4_READ_DATA access and allows the user + ACE4_EXECUTE, the server will allow the user to read the data of + the file). + + * Many servers have the notion of owner-override in which the owner + of the object is allowed to override accesses that are denied by + the ACL. This may be helpful, for example, to allow users + continued access to open files on which the permissions have + changed. + + * Many servers have the notion of a "superuser" that has privileges + beyond an ordinary user. The superuser may be able to read or + write data or metadata in ways that would not be permitted by the + ACL. + + * A retention attribute might also block access otherwise allowed by + ACLs (see Section 5.13). + +6.3.1.2. Client Considerations + + Clients SHOULD NOT do their own access checks based on their + interpretation of the ACL, but rather use the OPEN and ACCESS + operations to do access checks. This allows the client to act on the + results of having the server determine whether or not access should + be granted based on its interpretation of the ACL. + + Clients must be aware of situations in which an object's ACL will + define a certain access even though the server will not enforce it. + In general, but especially in these situations, the client needs to + do its part in the enforcement of access as defined by the ACL. To + do this, the client MAY send the appropriate ACCESS operation prior + to servicing the request of the user or application in order to + determine whether the user or application should be granted the + access requested. For examples in which the ACL may define accesses + that the server doesn't enforce, see Section 6.3.1.1. + +6.3.2. Computing a Mode Attribute from an ACL + + The following method can be used to calculate the MODE4_R*, MODE4_W*, + and MODE4_X* bits of a mode attribute, based upon an ACL. + + First, for each of the special identifiers OWNER@, GROUP@, and + EVERYONE@, evaluate the ACL in order, considering only ALLOW and DENY + ACEs for the identifier EVERYONE@ and for the identifier under + consideration. The result of the evaluation will be an NFSv4 ACL + mask showing exactly which bits are permitted to that identifier. + + Then translate the calculated mask for OWNER@, GROUP@, and EVERYONE@ + into mode bits for, respectively, the user, group, and other, as + follows: + + 1. Set the read bit (MODE4_RUSR, MODE4_RGRP, or MODE4_ROTH) if and + only if ACE4_READ_DATA is set in the corresponding mask. + + 2. Set the write bit (MODE4_WUSR, MODE4_WGRP, or MODE4_WOTH) if and + only if ACE4_WRITE_DATA and ACE4_APPEND_DATA are both set in the + corresponding mask. + + 3. Set the execute bit (MODE4_XUSR, MODE4_XGRP, or MODE4_XOTH), if + and only if ACE4_EXECUTE is set in the corresponding mask. + +6.3.2.1. Discussion + + Some server implementations also add bits permitted to named users + and groups to the group bits (MODE4_RGRP, MODE4_WGRP, and + MODE4_XGRP). + + Implementations are discouraged from doing this, because it has been + found to cause confusion for users who see members of a file's group + denied access that the mode bits appear to allow. (The presence of + DENY ACEs may also lead to such behavior, but DENY ACEs are expected + to be more rarely used.) + + The same user confusion seen when fetching the mode also results if + setting the mode does not effectively control permissions for the + owner, group, and other users; this motivates some of the + requirements that follow. + +6.4. Requirements + + The server that supports both mode and ACL must take care to + synchronize the MODE4_*USR, MODE4_*GRP, and MODE4_*OTH bits with the + ACEs that have respective who fields of "OWNER@", "GROUP@", and + "EVERYONE@". This way, the client can see if semantically equivalent + access permissions exist whether the client asks for the owner, + owner_group, and mode attributes or for just the ACL. + + In this section, much is made of the methods in Section 6.3.2. Many + requirements refer to this section. But note that the methods have + behaviors specified with "SHOULD". This is intentional, to avoid + invalidating existing implementations that compute the mode according + to the withdrawn POSIX ACL draft (1003.1e draft 17), rather than by + actual permissions on owner, group, and other. + +6.4.1. Setting the Mode and/or ACL Attributes + + In the case where a server supports the sacl or dacl attribute, in + addition to the acl attribute, the server MUST fail a request to set + the acl attribute simultaneously with a dacl or sacl attribute. The + error to be given is NFS4ERR_ATTRNOTSUPP. + +6.4.1.1. Setting Mode and not ACL + + When any of the nine low-order mode bits are subject to change, + either because the mode attribute was set or because the + mode_set_masked attribute was set and the mask included one or more + bits from the nine low-order mode bits, and no ACL attribute is + explicitly set, the acl and dacl attributes must be modified in + accordance with the updated value of those bits. This must happen + even if the value of the low-order bits is the same after the mode is + set as before. + + Note that any AUDIT or ALARM ACEs (hence any ACEs in the sacl + attribute) are unaffected by changes to the mode. + + In cases in which the permissions bits are subject to change, the acl + and dacl attributes MUST be modified such that the mode computed via + the method in Section 6.3.2 yields the low-order nine bits (MODE4_R*, + MODE4_W*, MODE4_X*) of the mode attribute as modified by the + attribute change. The ACL attributes SHOULD also be modified such + that: + + 1. If MODE4_RGRP is not set, entities explicitly listed in the ACL + other than OWNER@ and EVERYONE@ SHOULD NOT be granted + ACE4_READ_DATA. + + 2. If MODE4_WGRP is not set, entities explicitly listed in the ACL + other than OWNER@ and EVERYONE@ SHOULD NOT be granted + ACE4_WRITE_DATA or ACE4_APPEND_DATA. + + 3. If MODE4_XGRP is not set, entities explicitly listed in the ACL + other than OWNER@ and EVERYONE@ SHOULD NOT be granted + ACE4_EXECUTE. + + Access mask bits other than those listed above, appearing in ALLOW + ACEs, MAY also be disabled. + + Note that ACEs with the flag ACE4_INHERIT_ONLY_ACE set do not affect + the permissions of the ACL itself, nor do ACEs of the type AUDIT and + ALARM. As such, it is desirable to leave these ACEs unmodified when + modifying the ACL attributes. + + Also note that the requirement may be met by discarding the acl and + dacl, in favor of an ACL that represents the mode and only the mode. + This is permitted, but it is preferable for a server to preserve as + much of the ACL as possible without violating the above requirements. + Discarding the ACL makes it effectively impossible for a file created + with a mode attribute to inherit an ACL (see Section 6.4.3). + +6.4.1.2. Setting ACL and Not Mode + + When setting the acl or dacl and not setting the mode or + mode_set_masked attributes, the permission bits of the mode need to + be derived from the ACL. In this case, the ACL attribute SHOULD be + set as given. The nine low-order bits of the mode attribute + (MODE4_R*, MODE4_W*, MODE4_X*) MUST be modified to match the result + of the method in Section 6.3.2. The three high-order bits of the + mode (MODE4_SUID, MODE4_SGID, MODE4_SVTX) SHOULD remain unchanged. + +6.4.1.3. Setting Both ACL and Mode + + When setting both the mode (includes use of either the mode attribute + or the mode_set_masked attribute) and the acl or dacl attributes in + the same operation, the attributes MUST be applied in this order: + mode (or mode_set_masked), then ACL. The mode-related attribute is + set as given, then the ACL attribute is set as given, possibly + changing the final mode, as described above in Section 6.4.1.2. + +6.4.2. Retrieving the Mode and/or ACL Attributes + + This section applies only to servers that support both the mode and + ACL attributes. + + Some server implementations may have a concept of "objects without + ACLs", meaning that all permissions are granted and denied according + to the mode attribute and that no ACL attribute is stored for that + object. If an ACL attribute is requested of such a server, the + server SHOULD return an ACL that does not conflict with the mode; + that is to say, the ACL returned SHOULD represent the nine low-order + bits of the mode attribute (MODE4_R*, MODE4_W*, MODE4_X*) as + described in Section 6.3.2. + + For other server implementations, the ACL attribute is always present + for every object. Such servers SHOULD store at least the three high- + order bits of the mode attribute (MODE4_SUID, MODE4_SGID, + MODE4_SVTX). The server SHOULD return a mode attribute if one is + requested, and the low-order nine bits of the mode (MODE4_R*, + MODE4_W*, MODE4_X*) MUST match the result of applying the method in + Section 6.3.2 to the ACL attribute. + +6.4.3. Creating New Objects + + If a server supports any ACL attributes, it may use the ACL + attributes on the parent directory to compute an initial ACL + attribute for a newly created object. This will be referred to as + the inherited ACL within this section. The act of adding one or more + ACEs to the inherited ACL that are based upon ACEs in the parent + directory's ACL will be referred to as inheriting an ACE within this + section. + + Implementors should standardize what the behavior of CREATE and OPEN + must be depending on the presence or absence of the mode and ACL + attributes. + + 1. If just the mode is given in the call: + + In this case, inheritance SHOULD take place, but the mode MUST be + applied to the inherited ACL as described in Section 6.4.1.1, + thereby modifying the ACL. + + 2. If just the ACL is given in the call: + + In this case, inheritance SHOULD NOT take place, and the ACL as + defined in the CREATE or OPEN will be set without modification, + and the mode modified as in Section 6.4.1.2. + + 3. If both mode and ACL are given in the call: + + In this case, inheritance SHOULD NOT take place, and both + attributes will be set as described in Section 6.4.1.3. + + 4. If neither mode nor ACL is given in the call: + + In the case where an object is being created without any initial + attributes at all, e.g., an OPEN operation with an opentype4 of + OPEN4_CREATE and a createmode4 of EXCLUSIVE4, inheritance SHOULD + NOT take place (note that EXCLUSIVE4_1 is a better choice of + createmode4, since it does permit initial attributes). Instead, + the server SHOULD set permissions to deny all access to the newly + created object. It is expected that the appropriate client will + set the desired attributes in a subsequent SETATTR operation, and + the server SHOULD allow that operation to succeed, regardless of + what permissions the object is created with. For example, an + empty ACL denies all permissions, but the server should allow the + owner's SETATTR to succeed even though WRITE_ACL is implicitly + denied. + + In other cases, inheritance SHOULD take place, and no + modifications to the ACL will happen. The mode attribute, if + supported, MUST be as computed in Section 6.3.2, with the + MODE4_SUID, MODE4_SGID, and MODE4_SVTX bits clear. If no + inheritable ACEs exist on the parent directory, the rules for + creating acl, dacl, or sacl attributes are implementation + defined. If either the dacl or sacl attribute is supported, then + the ACL4_DEFAULTED flag SHOULD be set on the newly created + attributes. + +6.4.3.1. The Inherited ACL + + If the object being created is not a directory, the inherited ACL + SHOULD NOT inherit ACEs from the parent directory ACL unless the + ACE4_FILE_INHERIT_FLAG is set. + + If the object being created is a directory, the inherited ACL should + inherit all inheritable ACEs from the parent directory, that is, + those that have the ACE4_FILE_INHERIT_ACE or + ACE4_DIRECTORY_INHERIT_ACE flag set. If the inheritable ACE has + ACE4_FILE_INHERIT_ACE set but ACE4_DIRECTORY_INHERIT_ACE is clear, + the inherited ACE on the newly created directory MUST have the + ACE4_INHERIT_ONLY_ACE flag set to prevent the directory from being + affected by ACEs meant for non-directories. + + When a new directory is created, the server MAY split any inherited + ACE that is both inheritable and effective (in other words, that has + neither ACE4_INHERIT_ONLY_ACE nor ACE4_NO_PROPAGATE_INHERIT_ACE set), + into two ACEs, one with no inheritance flags and one with + ACE4_INHERIT_ONLY_ACE set. (In the case of a dacl or sacl attribute, + both of those ACEs SHOULD also have the ACE4_INHERITED_ACE flag set.) + This makes it simpler to modify the effective permissions on the + directory without modifying the ACE that is to be inherited to the + new directory's children. + +6.4.3.2. Automatic Inheritance + + The acl attribute consists only of an array of ACEs, but the sacl + (Section 6.2.3) and dacl (Section 6.2.2) attributes also include an + additional flag field. + + struct nfsacl41 { + aclflag4 na41_flag; + nfsace4 na41_aces<>; + }; + + The flag field applies to the entire sacl or dacl; three flag values + are defined: + + const ACL4_AUTO_INHERIT = 0x00000001; + const ACL4_PROTECTED = 0x00000002; + const ACL4_DEFAULTED = 0x00000004; + + and all other bits must be cleared. The ACE4_INHERITED_ACE flag may + be set in the ACEs of the sacl or dacl (whereas it must always be + cleared in the acl). + + Together these features allow a server to support automatic + inheritance, which we now explain in more detail. + + Inheritable ACEs are normally inherited by child objects only at the + time that the child objects are created; later modifications to + inheritable ACEs do not result in modifications to inherited ACEs on + descendants. + + However, the dacl and sacl provide an OPTIONAL mechanism that allows + a client application to propagate changes to inheritable ACEs to an + entire directory hierarchy. + + A server that supports this performs inheritance at object creation + time in the normal way, and SHOULD set the ACE4_INHERITED_ACE flag on + any inherited ACEs as they are added to the new object. + + A client application such as an ACL editor may then propagate changes + to inheritable ACEs on a directory by recursively traversing that + directory's descendants and modifying each ACL encountered to remove + any ACEs with the ACE4_INHERITED_ACE flag and to replace them by the + new inheritable ACEs (also with the ACE4_INHERITED_ACE flag set). It + uses the existing ACE inheritance flags in the obvious way to decide + which ACEs to propagate. (Note that it may encounter further + inheritable ACEs when descending the directory hierarchy and that + those will also need to be taken into account when propagating + inheritable ACEs to further descendants.) + + The reach of this propagation may be limited in two ways: first, + automatic inheritance is not performed from any directory ACL that + has the ACL4_AUTO_INHERIT flag cleared; and second, automatic + inheritance stops wherever an ACL with the ACL4_PROTECTED flag is + set, preventing modification of that ACL and also (if the ACL is set + on a directory) of the ACL on any of the object's descendants. + + This propagation is performed independently for the sacl and the dacl + attributes; thus, the ACL4_AUTO_INHERIT and ACL4_PROTECTED flags may + be independently set for the sacl and the dacl, and propagation of + one type of acl may continue down a hierarchy even where propagation + of the other acl has stopped. + + New objects should be created with a dacl and a sacl that both have + the ACL4_PROTECTED flag cleared and the ACL4_AUTO_INHERIT flag set to + the same value as that on, respectively, the sacl or dacl of the + parent object. + + Both the dacl and sacl attributes are RECOMMENDED, and a server may + support one without supporting the other. + + A server that supports both the old acl attribute and one or both of + the new dacl or sacl attributes must do so in such a way as to keep + all three attributes consistent with each other. Thus, the ACEs + reported in the acl attribute should be the union of the ACEs + reported in the dacl and sacl attributes, except that the + ACE4_INHERITED_ACE flag must be cleared from the ACEs in the acl. + And of course a client that queries only the acl will be unable to + determine the values of the sacl or dacl flag fields. + + When a client performs a SETATTR for the acl attribute, the server + SHOULD set the ACL4_PROTECTED flag to true on both the sacl and the + dacl. By using the acl attribute, as opposed to the dacl or sacl + attributes, the client signals that it may not understand automatic + inheritance, and thus cannot be trusted to set an ACL for which + automatic inheritance would make sense. + + When a client application queries an ACL, modifies it, and sets it + again, it should leave any ACEs marked with ACE4_INHERITED_ACE + unchanged, in their original order, at the end of the ACL. If the + application is unable to do this, it should set the ACL4_PROTECTED + flag. This behavior is not enforced by servers, but violations of + this rule may lead to unexpected results when applications perform + automatic inheritance. + + If a server also supports the mode attribute, it SHOULD set the mode + in such a way that leaves inherited ACEs unchanged, in their original + order, at the end of the ACL. If it is unable to do so, it SHOULD + set the ACL4_PROTECTED flag on the file's dacl. + + Finally, in the case where the request that creates a new file or + directory does not also set permissions for that file or directory, + and there are also no ACEs to inherit from the parent's directory, + then the server's choice of ACL for the new object is implementation- + dependent. In this case, the server SHOULD set the ACL4_DEFAULTED + flag on the ACL it chooses for the new object. An application + performing automatic inheritance takes the ACL4_DEFAULTED flag as a + sign that the ACL should be completely replaced by one generated + using the automatic inheritance rules. + +7. Single-Server Namespace + + This section describes the NFSv4 single-server namespace. Single- + server namespaces may be presented directly to clients, or they may + be used as a basis to form larger multi-server namespaces (e.g., + site-wide or organization-wide) to be presented to clients, as + described in Section 11. + +7.1. Server Exports + + On a UNIX server, the namespace describes all the files reachable by + pathnames under the root directory or "/". On a Windows server, the + namespace constitutes all the files on disks named by mapped disk + letters. NFS server administrators rarely make the entire server's + file system namespace available to NFS clients. More often, portions + of the namespace are made available via an "export" feature. In + previous versions of the NFS protocol, the root filehandle for each + export is obtained through the MOUNT protocol; the client sent a + string that identified the export name within the namespace and the + server returned the root filehandle for that export. The MOUNT + protocol also provided an EXPORTS procedure that enumerated the + server's exports. + +7.2. Browsing Exports + + The NFSv4.1 protocol provides a root filehandle that clients can use + to obtain filehandles for the exports of a particular server, via a + series of LOOKUP operations within a COMPOUND, to traverse a path. A + common user experience is to use a graphical user interface (perhaps + a file "Open" dialog window) to find a file via progressive browsing + through a directory tree. The client must be able to move from one + export to another export via single-component, progressive LOOKUP + operations. + + This style of browsing is not well supported by the NFSv3 protocol. + In NFSv3, the client expects all LOOKUP operations to remain within a + single server file system. For example, the device attribute will + not change. This prevents a client from taking namespace paths that + span exports. + + In the case of NFSv3, an automounter on the client can obtain a + snapshot of the server's namespace using the EXPORTS procedure of the + MOUNT protocol. If it understands the server's pathname syntax, it + can create an image of the server's namespace on the client. The + parts of the namespace that are not exported by the server are filled + in with directories that might be constructed similarly to an NFSv4.1 + "pseudo file system" (see Section 7.3) that allows the user to browse + from one mounted file system to another. There is a drawback to this + representation of the server's namespace on the client: it is static. + If the server administrator adds a new export, the client will be + unaware of it. + +7.3. Server Pseudo File System + + NFSv4.1 servers avoid this namespace inconsistency by presenting all + the exports for a given server within the framework of a single + namespace for that server. An NFSv4.1 client uses LOOKUP and READDIR + operations to browse seamlessly from one export to another. + + Where there are portions of the server namespace that are not + exported, clients require some way of traversing those portions to + reach actual exported file systems. A technique that servers may use + to provide for this is to bridge the unexported portion of the + namespace via a "pseudo file system" that provides a view of exported + directories only. A pseudo file system has a unique fsid and behaves + like a normal, read-only file system. + + Based on the construction of the server's namespace, it is possible + that multiple pseudo file systems may exist. For example, + + /a pseudo file system + /a/b real file system + /a/b/c pseudo file system + /a/b/c/d real file system + + Each of the pseudo file systems is considered a separate entity and + therefore MUST have its own fsid, unique among all the fsids for that + server. + +7.4. Multiple Roots + + Certain operating environments are sometimes described as having + "multiple roots". In such environments, individual file systems are + commonly represented by disk or volume names. NFSv4 servers for + these platforms can construct a pseudo file system above these root + names so that disk letters or volume names are simply directory names + in the pseudo root. + +7.5. Filehandle Volatility + + The nature of the server's pseudo file system is that it is a logical + representation of file system(s) available from the server. + Therefore, the pseudo file system is most likely constructed + dynamically when the server is first instantiated. It is expected + that the pseudo file system may not have an on-disk counterpart from + which persistent filehandles could be constructed. Even though it is + preferable that the server provide persistent filehandles for the + pseudo file system, the NFS client should expect that pseudo file + system filehandles are volatile. This can be confirmed by checking + the associated "fh_expire_type" attribute for those filehandles in + question. If the filehandles are volatile, the NFS client must be + prepared to recover a filehandle value (e.g., with a series of LOOKUP + operations) when receiving an error of NFS4ERR_FHEXPIRED. + + Because it is quite likely that servers will implement pseudo file + systems using volatile filehandles, clients need to be prepared for + them, rather than assuming that all filehandles will be persistent. + +7.6. Exported Root + + If the server's root file system is exported, one might conclude that + a pseudo file system is unneeded. This is not necessarily so. + Assume the following file systems on a server: + + / fs1 (exported) + /a fs2 (not exported) + /a/b fs3 (exported) + + Because fs2 is not exported, fs3 cannot be reached with simple + LOOKUPs. The server must bridge the gap with a pseudo file system. + +7.7. Mount Point Crossing + + The server file system environment may be constructed in such a way + that one file system contains a directory that is 'covered' or + mounted upon by a second file system. For example: + + /a/b (file system 1) + /a/b/c/d (file system 2) + + The pseudo file system for this server may be constructed to look + like: + + / (place holder/not exported) + /a/b (file system 1) + /a/b/c/d (file system 2) + + It is the server's responsibility to present the pseudo file system + that is complete to the client. If the client sends a LOOKUP request + for the path /a/b/c/d, the server's response is the filehandle of the + root of the file system /a/b/c/d. In previous versions of the NFS + protocol, the server would respond with the filehandle of directory + /a/b/c/d within the file system /a/b. + + The NFS client will be able to determine if it crosses a server mount + point by a change in the value of the "fsid" attribute. + +7.8. Security Policy and Namespace Presentation + + Because NFSv4 clients possess the ability to change the security + mechanisms used, after determining what is allowed, by using SECINFO + and SECINFO_NONAME, the server SHOULD NOT present a different view of + the namespace based on the security mechanism being used by a client. + Instead, it should present a consistent view and return + NFS4ERR_WRONGSEC if an attempt is made to access data with an + inappropriate security mechanism. + + If security considerations make it necessary to hide the existence of + a particular file system, as opposed to all of the data within it, + the server can apply the security policy of a shared resource in the + server's namespace to components of the resource's ancestors. For + example: + + / (place holder/not exported) + /a/b (file system 1) + /a/b/MySecretProject (file system 2) + + The /a/b/MySecretProject directory is a real file system and is the + shared resource. Suppose the security policy for /a/b/ + MySecretProject is Kerberos with integrity and it is desired to limit + knowledge of the existence of this file system. In this case, the + server should apply the same security policy to /a/b. This allows + for knowledge of the existence of a file system to be secured when + desirable. + + For the case of the use of multiple, disjoint security mechanisms in + the server's resources, applying that sort of policy would result in + the higher-level file system not being accessible using any security + flavor. Therefore, that sort of configuration is not compatible with + hiding the existence (as opposed to the contents) from clients using + multiple disjoint sets of security flavors. + + In other circumstances, a desirable policy is for the security of a + particular object in the server's namespace to include the union of + all security mechanisms of all direct descendants. A common and + convenient practice, unless strong security requirements dictate + otherwise, is to make the entire the pseudo file system accessible by + all of the valid security mechanisms. + + Where there is concern about the security of data on the network, + clients should use strong security mechanisms to access the pseudo + file system in order to prevent man-in-the-middle attacks. + +8. State Management + + Integrating locking into the NFS protocol necessarily causes it to be + stateful. With the inclusion of such features as share reservations, + file and directory delegations, recallable layouts, and support for + mandatory byte-range locking, the protocol becomes substantially more + dependent on proper management of state than the traditional + combination of NFS and NLM (Network Lock Manager) [54]. These + features include expanded locking facilities, which provide some + measure of inter-client exclusion, but the state also offers features + not readily providable using a stateless model. There are three + components to making this state manageable: + + * clear division between client and server + + * ability to reliably detect inconsistency in state between client + and server + + * simple and robust recovery mechanisms + + In this model, the server owns the state information. The client + requests changes in locks and the server responds with the changes + made. Non-client-initiated changes in locking state are infrequent. + The client receives prompt notification of such changes and can + adjust its view of the locking state to reflect the server's changes. + + Individual pieces of state created by the server and passed to the + client at its request are represented by 128-bit stateids. These + stateids may represent a particular open file, a set of byte-range + locks held by a particular owner, or a recallable delegation of + privileges to access a file in particular ways or at a particular + location. + + In all cases, there is a transition from the most general information + that represents a client as a whole to the eventual lightweight + stateid used for most client and server locking interactions. The + details of this transition will vary with the type of object but it + always starts with a client ID. + +8.1. Client and Session ID + + A client must establish a client ID (see Section 2.4) and then one or + more sessionids (see Section 2.10) before performing any operations + to open, byte-range lock, delegate, or obtain a layout for a file + object. Each session ID is associated with a specific client ID, and + thus serves as a shorthand reference to an NFSv4.1 client. + + For some types of locking interactions, the client will represent + some number of internal locking entities called "owners", which + normally correspond to processes internal to the client. For other + types of locking-related objects, such as delegations and layouts, no + such intermediate entities are provided for, and the locking-related + objects are considered to be transferred directly between the server + and a unitary client. + +8.2. Stateid Definition + + When the server grants a lock of any type (including opens, byte- + range locks, delegations, and layouts), it responds with a unique + stateid that represents a set of locks (often a single lock) for the + same file, of the same type, and sharing the same ownership + characteristics. Thus, opens of the same file by different open- + owners each have an identifying stateid. Similarly, each set of + byte-range locks on a file owned by a specific lock-owner has its own + identifying stateid. Delegations and layouts also have associated + stateids by which they may be referenced. The stateid is used as a + shorthand reference to a lock or set of locks, and given a stateid, + the server can determine the associated state-owner or state-owners + (in the case of an open-owner/lock-owner pair) and the associated + filehandle. When stateids are used, the current filehandle must be + the one associated with that stateid. + + All stateids associated with a given client ID are associated with a + common lease that represents the claim of those stateids and the + objects they represent to be maintained by the server. See + Section 8.3 for a discussion of the lease. + + The server may assign stateids independently for different clients. + A stateid with the same bit pattern for one client may designate an + entirely different set of locks for a different client. The stateid + is always interpreted with respect to the client ID associated with + the current session. Stateids apply to all sessions associated with + the given client ID, and the client may use a stateid obtained from + one session on another session associated with the same client ID. + +8.2.1. Stateid Types + + With the exception of special stateids (see Section 8.2.3), each + stateid represents locking objects of one of a set of types defined + by the NFSv4.1 protocol. Note that in all these cases, where we + speak of guarantee, it is understood there are situations such as a + client restart, or lock revocation, that allow the guarantee to be + voided. + + * Stateids may represent opens of files. + + Each stateid in this case represents the OPEN state for a given + client ID/open-owner/filehandle triple. Such stateids are subject + to change (with consequent incrementing of the stateid's seqid) in + response to OPENs that result in upgrade and OPEN_DOWNGRADE + operations. + + * Stateids may represent sets of byte-range locks. + + All locks held on a particular file by a particular owner and + gotten under the aegis of a particular open file are associated + with a single stateid with the seqid being incremented whenever + LOCK and LOCKU operations affect that set of locks. + + * Stateids may represent file delegations, which are recallable + guarantees by the server to the client that other clients will not + reference or modify a particular file, until the delegation is + returned. In NFSv4.1, file delegations may be obtained on both + regular and non-regular files. + + A stateid represents a single delegation held by a client for a + particular filehandle. + + * Stateids may represent directory delegations, which are recallable + guarantees by the server to the client that other clients will not + modify the directory, until the delegation is returned. + + A stateid represents a single delegation held by a client for a + particular directory filehandle. + + * Stateids may represent layouts, which are recallable guarantees by + the server to the client that particular files may be accessed via + an alternate data access protocol at specific locations. Such + access is limited to particular sets of byte-ranges and may + proceed until those byte-ranges are reduced or the layout is + returned. + + A stateid represents the set of all layouts held by a particular + client for a particular filehandle with a given layout type. The + seqid is updated as the layouts of that set of byte-ranges change, + via layout stateid changing operations such as LAYOUTGET and + LAYOUTRETURN. + +8.2.2. Stateid Structure + + Stateids are divided into two fields, a 96-bit "other" field + identifying the specific set of locks and a 32-bit "seqid" sequence + value. Except in the case of special stateids (see Section 8.2.3), a + particular value of the "other" field denotes a set of locks of the + same type (for example, byte-range locks, opens, delegations, or + layouts), for a specific file or directory, and sharing the same + ownership characteristics. The seqid designates a specific instance + of such a set of locks, and is incremented to indicate changes in + such a set of locks, either by the addition or deletion of locks from + the set, a change in the byte-range they apply to, or an upgrade or + downgrade in the type of one or more locks. + + When such a set of locks is first created, the server returns a + stateid with seqid value of one. On subsequent operations that + modify the set of locks, the server is required to increment the + "seqid" field by one whenever it returns a stateid for the same + state-owner/file/type combination and there is some change in the set + of locks actually designated. In this case, the server will return a + stateid with an "other" field the same as previously used for that + state-owner/file/type combination, with an incremented "seqid" field. + This pattern continues until the seqid is incremented past + NFS4_UINT32_MAX, and one (not zero) is the next seqid value. + + The purpose of the incrementing of the seqid is to allow the server + to communicate to the client the order in which operations that + modified locking state associated with a stateid have been processed + and to make it possible for the client to send requests that are + conditional on the set of locks not having changed since the stateid + in question was returned. + + Except for layout stateids (Section 12.5.3), when a client sends a + stateid to the server, it has two choices with regard to the seqid + sent. It may set the seqid to zero to indicate to the server that it + wishes the most up-to-date seqid for that stateid's "other" field to + be used. This would be the common choice in the case of a stateid + sent with a READ or WRITE operation. It also may set a non-zero + value, in which case the server checks if that seqid is the correct + one. In that case, the server is required to return + NFS4ERR_OLD_STATEID if the seqid is lower than the most current value + and NFS4ERR_BAD_STATEID if the seqid is greater than the most current + value. This would be the common choice in the case of stateids sent + with a CLOSE or OPEN_DOWNGRADE. Because OPENs may be sent in + parallel for the same owner, a client might close a file without + knowing that an OPEN upgrade had been done by the server, changing + the lock in question. If CLOSE were sent with a zero seqid, the OPEN + upgrade would be cancelled before the client even received an + indication that an upgrade had happened. + + When a stateid is sent by the server to the client as part of a + callback operation, it is not subject to checking for a current seqid + and returning NFS4ERR_OLD_STATEID. This is because the client is not + in a position to know the most up-to-date seqid and thus cannot + verify it. Unless specially noted, the seqid value for a stateid + sent by the server to the client as part of a callback is required to + be zero with NFS4ERR_BAD_STATEID returned if it is not. + + In making comparisons between seqids, both by the client in + determining the order of operations and by the server in determining + whether the NFS4ERR_OLD_STATEID is to be returned, the possibility of + the seqid being swapped around past the NFS4_UINT32_MAX value needs + to be taken into account. When two seqid values are being compared, + the total count of slots for all sessions associated with the current + client is used to do this. When one seqid value is less than this + total slot count and another seqid value is greater than + NFS4_UINT32_MAX minus the total slot count, the former is to be + treated as lower than the latter, despite the fact that it is + numerically greater. + +8.2.3. Special Stateids + + Stateid values whose "other" field is either all zeros or all ones + are reserved. They may not be assigned by the server but have + special meanings defined by the protocol. The particular meaning + depends on whether the "other" field is all zeros or all ones and the + specific value of the "seqid" field. + + The following combinations of "other" and "seqid" are defined in + NFSv4.1: + + * When "other" and "seqid" are both zero, the stateid is treated as + a special anonymous stateid, which can be used in READ, WRITE, and + SETATTR requests to indicate the absence of any OPEN state + associated with the request. When an anonymous stateid value is + used and an existing open denies the form of access requested, + then access will be denied to the request. This stateid MUST NOT + be used on operations to data servers (Section 13.6). + + * When "other" and "seqid" are both all ones, the stateid is a + special READ bypass stateid. When this value is used in WRITE or + SETATTR, it is treated like the anonymous value. When used in + READ, the server MAY grant access, even if access would normally + be denied to READ operations. This stateid MUST NOT be used on + operations to data servers. + + * When "other" is zero and "seqid" is one, the stateid represents + the current stateid, which is whatever value is the last stateid + returned by an operation within the COMPOUND. In the case of an + OPEN, the stateid returned for the open file and not the + delegation is used. The stateid passed to the operation in place + of the special value has its "seqid" value set to zero, except + when the current stateid is used by the operation CLOSE or + OPEN_DOWNGRADE. If there is no operation in the COMPOUND that has + returned a stateid value, the server MUST return the error + NFS4ERR_BAD_STATEID. As illustrated in Figure 6, if the value of + a current stateid is a special stateid and the stateid of an + operation's arguments has "other" set to zero and "seqid" set to + one, then the server MUST return the error NFS4ERR_BAD_STATEID. + + * When "other" is zero and "seqid" is NFS4_UINT32_MAX, the stateid + represents a reserved stateid value defined to be invalid. When + this stateid is used, the server MUST return the error + NFS4ERR_BAD_STATEID. + + If a stateid value is used that has all zeros or all ones in the + "other" field but does not match one of the cases above, the server + MUST return the error NFS4ERR_BAD_STATEID. + + Special stateids, unlike other stateids, are not associated with + individual client IDs or filehandles and can be used with all valid + client IDs and filehandles. In the case of a special stateid + designating the current stateid, the current stateid value + substituted for the special stateid is associated with a particular + client ID and filehandle, and so, if it is used where the current + filehandle does not match that associated with the current stateid, + the operation to which the stateid is passed will return + NFS4ERR_BAD_STATEID. + +8.2.4. Stateid Lifetime and Validation + + Stateids must remain valid until either a client restart or a server + restart or until the client returns all of the locks associated with + the stateid by means of an operation such as CLOSE or DELEGRETURN. + If the locks are lost due to revocation, as long as the client ID is + valid, the stateid remains a valid designation of that revoked state + until the client frees it by using FREE_STATEID. Stateids associated + with byte-range locks are an exception. They remain valid even if a + LOCKU frees all remaining locks, so long as the open file with which + they are associated remains open, unless the client frees the + stateids via the FREE_STATEID operation. + + It should be noted that there are situations in which the client's + locks become invalid, without the client requesting they be returned. + These include lease expiration and a number of forms of lock + revocation within the lease period. It is important to note that in + these situations, the stateid remains valid and the client can use it + to determine the disposition of the associated lost locks. + + An "other" value must never be reused for a different purpose (i.e., + different filehandle, owner, or type of locks) within the context of + a single client ID. A server may retain the "other" value for the + same purpose beyond the point where it may otherwise be freed, but if + it does so, it must maintain "seqid" continuity with previous values. + + One mechanism that may be used to satisfy the requirement that the + server recognize invalid and out-of-date stateids is for the server + to divide the "other" field of the stateid into two fields. + + * an index into a table of locking-state structures. + + * a generation number that is incremented on each allocation of a + table entry for a particular use. + + And then store in each table entry, + + * the client ID with which the stateid is associated. + + * the current generation number for the (at most one) valid stateid + sharing this index value. + + * the filehandle of the file on which the locks are taken. + + * an indication of the type of stateid (open, byte-range lock, file + delegation, directory delegation, layout). + + * the last "seqid" value returned corresponding to the current + "other" value. + + * an indication of the current status of the locks associated with + this stateid, in particular, whether these have been revoked and + if so, for what reason. + + With this information, an incoming stateid can be validated and the + appropriate error returned when necessary. Special and non-special + stateids are handled separately. (See Section 8.2.3 for a discussion + of special stateids.) + + Note that stateids are implicitly qualified by the current client ID, + as derived from the client ID associated with the current session. + Note, however, that the semantics of the session will prevent + stateids associated with a previous client or server instance from + being analyzed by this procedure. + + If server restart has resulted in an invalid client ID or a session + ID that is invalid, SEQUENCE will return an error and the operation + that takes a stateid as an argument will never be processed. + + If there has been a server restart where there is a persistent + session and all leased state has been lost, then the session in + question will, although valid, be marked as dead, and any operation + not satisfied by means of the reply cache will receive the error + NFS4ERR_DEADSESSION, and thus not be processed as indicated below. + + When a stateid is being tested and the "other" field is all zeros or + all ones, a check that the "other" and "seqid" fields match a defined + combination for a special stateid is done and the results determined + as follows: + + * If the "other" and "seqid" fields do not match a defined + combination associated with a special stateid, the error + NFS4ERR_BAD_STATEID is returned. + + * If the special stateid is one designating the current stateid and + there is a current stateid, then the current stateid is + substituted for the special stateid and the checks appropriate to + non-special stateids are performed. + + * If the combination is valid in general but is not appropriate to + the context in which the stateid is used (e.g., an all-zero + stateid is used when an OPEN stateid is required in a LOCK + operation), the error NFS4ERR_BAD_STATEID is also returned. + + * Otherwise, the check is completed and the special stateid is + accepted as valid. + + When a stateid is being tested, and the "other" field is neither all + zeros nor all ones, the following procedure could be used to validate + an incoming stateid and return an appropriate error, when necessary, + assuming that the "other" field would be divided into a table index + and an entry generation. + + * If the table index field is outside the range of the associated + table, return NFS4ERR_BAD_STATEID. + + * If the selected table entry is of a different generation than that + specified in the incoming stateid, return NFS4ERR_BAD_STATEID. + + * If the selected table entry does not match the current filehandle, + return NFS4ERR_BAD_STATEID. + + * If the client ID in the table entry does not match the client ID + associated with the current session, return NFS4ERR_BAD_STATEID. + + * If the stateid represents revoked state, then return + NFS4ERR_EXPIRED, NFS4ERR_ADMIN_REVOKED, or NFS4ERR_DELEG_REVOKED, + as appropriate. + + * If the stateid type is not valid for the context in which the + stateid appears, return NFS4ERR_BAD_STATEID. Note that a stateid + may be valid in general, as would be reported by the TEST_STATEID + operation, but be invalid for a particular operation, as, for + example, when a stateid that doesn't represent byte-range locks is + passed to the non-from_open case of LOCK or to LOCKU, or when a + stateid that does not represent an open is passed to CLOSE or + OPEN_DOWNGRADE. In such cases, the server MUST return + NFS4ERR_BAD_STATEID. + + * If the "seqid" field is not zero and it is greater than the + current sequence value corresponding to the current "other" field, + return NFS4ERR_BAD_STATEID. + + * If the "seqid" field is not zero and it is less than the current + sequence value corresponding to the current "other" field, return + NFS4ERR_OLD_STATEID. + + * Otherwise, the stateid is valid and the table entry should contain + any additional information about the type of stateid and + information associated with that particular type of stateid, such + as the associated set of locks, e.g., open-owner and lock-owner + information, as well as information on the specific locks, e.g., + open modes and byte-ranges. + +8.2.5. Stateid Use for I/O Operations + + Clients performing I/O operations need to select an appropriate + stateid based on the locks (including opens and delegations) held by + the client and the various types of state-owners sending the I/O + requests. SETATTR operations that change the file size are treated + like I/O operations in this regard. + + The following rules, applied in order of decreasing priority, govern + the selection of the appropriate stateid. In following these rules, + the client will only consider locks of which it has actually received + notification by an appropriate operation response or callback. Note + that the rules are slightly different in the case of I/O to data + servers when file layouts are being used (see Section 13.9.1). + + * If the client holds a delegation for the file in question, the + delegation stateid SHOULD be used. + + * Otherwise, if the entity corresponding to the lock-owner (e.g., a + process) sending the I/O has a byte-range lock stateid for the + associated open file, then the byte-range lock stateid for that + lock-owner and open file SHOULD be used. + + * If there is no byte-range lock stateid, then the OPEN stateid for + the open file in question SHOULD be used. + + * Finally, if none of the above apply, then a special stateid SHOULD + be used. + + Ignoring these rules may result in situations in which the server + does not have information necessary to properly process the request. + For example, when mandatory byte-range locks are in effect, if the + stateid does not indicate the proper lock-owner, via a lock stateid, + a request might be avoidably rejected. + + The server however should not try to enforce these ordering rules and + should use whatever information is available to properly process I/O + requests. In particular, when a client has a delegation for a given + file, it SHOULD take note of this fact in processing a request, even + if it is sent with a special stateid. + +8.2.6. Stateid Use for SETATTR Operations + + Because each operation is associated with a session ID and from that + the clientid can be determined, operations do not need to include a + stateid for the server to be able to determine whether they should + cause a delegation to be recalled or are to be treated as done within + the scope of the delegation. + + In the case of SETATTR operations, a stateid is present. In cases + other than those that set the file size, the client may send either a + special stateid or, when a delegation is held for the file in + question, a delegation stateid. While the server SHOULD validate the + stateid and may use the stateid to optimize the determination as to + whether a delegation is held, it SHOULD note the presence of a + delegation even when a special stateid is sent, and MUST accept a + valid delegation stateid when sent. + +8.3. Lease Renewal + + Each client/server pair, as represented by a client ID, has a single + lease. The purpose of the lease is to allow the client to indicate + to the server, in a low-overhead way, that it is active, and thus + that the server is to retain the client's locks. This arrangement + allows the server to remove stale locking-related objects that are + held by a client that has crashed or is otherwise unreachable, once + the relevant lease expires. This in turn allows other clients to + obtain conflicting locks without being delayed indefinitely by + inactive or unreachable clients. It is not a mechanism for cache + consistency and lease renewals may not be denied if the lease + interval has not expired. + + Since each session is associated with a specific client (identified + by the client's client ID), any operation sent on that session is an + indication that the associated client is reachable. When a request + is sent for a given session, successful execution of a SEQUENCE + operation (or successful retrieval of the result of SEQUENCE from the + reply cache) on an unexpired lease will result in the lease being + implicitly renewed, for the standard renewal period (equal to the + lease_time attribute). + + If the client ID's lease has not expired when the server receives a + SEQUENCE operation, then the server MUST renew the lease. If the + client ID's lease has expired when the server receives a SEQUENCE + operation, the server MAY renew the lease; this depends on whether + any state was revoked as a result of the client's failure to renew + the lease before expiration. + + Absent other activity that would renew the lease, a COMPOUND + consisting of a single SEQUENCE operation will suffice. The client + should also take communication-related delays into account and take + steps to ensure that the renewal messages actually reach the server + in good time. For example: + + * When trunking is in effect, the client should consider sending + multiple requests on different connections, in order to ensure + that renewal occurs, even in the event of blockage in the path + used for one of those connections. + + * Transport retransmission delays might become so large as to + approach or exceed the length of the lease period. This may be + particularly likely when the server is unresponsive due to a + restart; see Section 8.4.2.1. If the client implementation is not + careful, transport retransmission delays can result in the client + failing to detect a server restart before the grace period ends. + The scenario is that the client is using a transport with + exponential backoff, such that the maximum retransmission timeout + exceeds both the grace period and the lease_time attribute. A + network partition causes the client's connection's retransmission + interval to back off, and even after the partition heals, the next + transport-level retransmission is sent after the server has + restarted and its grace period ends. + + The client MUST either recover from the ensuing NFS4ERR_NO_GRACE + errors or it MUST ensure that, despite transport-level + retransmission intervals that exceed the lease_time, a SEQUENCE + operation is sent that renews the lease before expiration. The + client can achieve this by associating a new connection with the + session, and sending a SEQUENCE operation on it. However, if the + attempt to establish a new connection is delayed for some reason + (e.g., exponential backoff of the connection establishment + packets), the client will have to abort the connection + establishment attempt before the lease expires, and attempt to + reconnect. + + If the server renews the lease upon receiving a SEQUENCE operation, + the server MUST NOT allow the lease to expire while the rest of the + operations in the COMPOUND procedure's request are still executing. + Once the last operation has finished, and the response to COMPOUND + has been sent, the server MUST set the lease to expire no sooner than + the sum of current time and the value of the lease_time attribute. + + A client ID's lease can expire when it has been at least the lease + interval (lease_time) since the last lease-renewing SEQUENCE + operation was sent on any of the client ID's sessions and there are + no active COMPOUND operations on any such sessions. + + Because the SEQUENCE operation is the basic mechanism to renew a + lease, and because it must be done at least once for each lease + period, it is the natural mechanism whereby the server will inform + the client of changes in the lease status that the client needs to be + informed of. The client should inspect the status flags + (sr_status_flags) returned by sequence and take the appropriate + action (see Section 18.46.3 for details). + + * The status bits SEQ4_STATUS_CB_PATH_DOWN and + SEQ4_STATUS_CB_PATH_DOWN_SESSION indicate problems with the + backchannel that the client may need to address in order to + receive callback requests. + + * The status bits SEQ4_STATUS_CB_GSS_CONTEXTS_EXPIRING and + SEQ4_STATUS_CB_GSS_CONTEXTS_EXPIRED indicate problems with GSS + contexts or RPCSEC_GSS handles for the backchannel that the client + might have to address in order to allow callback requests to be + sent. + + * The status bits SEQ4_STATUS_EXPIRED_ALL_STATE_REVOKED, + SEQ4_STATUS_EXPIRED_SOME_STATE_REVOKED, + SEQ4_STATUS_ADMIN_STATE_REVOKED, and + SEQ4_STATUS_RECALLABLE_STATE_REVOKED notify the client of lock + revocation events. When these bits are set, the client should use + TEST_STATEID to find what stateids have been revoked and use + FREE_STATEID to acknowledge loss of the associated state. + + * The status bit SEQ4_STATUS_LEASE_MOVE indicates that + responsibility for lease renewal has been transferred to one or + more new servers. + + * The status bit SEQ4_STATUS_RESTART_RECLAIM_NEEDED indicates that + due to server restart the client must reclaim locking state. + + * The status bit SEQ4_STATUS_BACKCHANNEL_FAULT indicates that the + server has encountered an unrecoverable fault with the backchannel + (e.g., it has lost track of a sequence ID for a slot in the + backchannel). + +8.4. Crash Recovery + + A critical requirement in crash recovery is that both the client and + the server know when the other has failed. Additionally, it is + required that a client sees a consistent view of data across server + restarts. All READ and WRITE operations that may have been queued + within the client or network buffers must wait until the client has + successfully recovered the locks protecting the READ and WRITE + operations. Any that reach the server before the server can safely + determine that the client has recovered enough locking state to be + sure that such operations can be safely processed must be rejected. + This will happen because either: + + * The state presented is no longer valid since it is associated with + a now invalid client ID. In this case, the client will receive + either an NFS4ERR_BADSESSION or NFS4ERR_DEADSESSION error, and any + attempt to attach a new session to that invalid client ID will + result in an NFS4ERR_STALE_CLIENTID error. + + * Subsequent recovery of locks may make execution of the operation + inappropriate (NFS4ERR_GRACE). + +8.4.1. Client Failure and Recovery + + In the event that a client fails, the server may release the client's + locks when the associated lease has expired. Conflicting locks from + another client may only be granted after this lease expiration. As + discussed in Section 8.3, when a client has not failed and re- + establishes its lease before expiration occurs, requests for + conflicting locks will not be granted. + + To minimize client delay upon restart, lock requests are associated + with an instance of the client by a client-supplied verifier. This + verifier is part of the client_owner4 sent in the initial EXCHANGE_ID + call made by the client. The server returns a client ID as a result + of the EXCHANGE_ID operation. The client then confirms the use of + the client ID by establishing a session associated with that client + ID (see Section 18.36.3 for a description of how this is done). All + locks, including opens, byte-range locks, delegations, and layouts + obtained by sessions using that client ID, are associated with that + client ID. + + Since the verifier will be changed by the client upon each + initialization, the server can compare a new verifier to the verifier + associated with currently held locks and determine that they do not + match. This signifies the client's new instantiation and subsequent + loss (upon confirmation of the new client ID) of locking state. As a + result, the server is free to release all locks held that are + associated with the old client ID that was derived from the old + verifier. At this point, conflicting locks from other clients, kept + waiting while the lease had not yet expired, can be granted. In + addition, all stateids associated with the old client ID can also be + freed, as they are no longer reference-able. + + Note that the verifier must have the same uniqueness properties as + the verifier for the COMMIT operation. + +8.4.2. Server Failure and Recovery + + If the server loses locking state (usually as a result of a restart), + it must allow clients time to discover this fact and re-establish the + lost locking state. The client must be able to re-establish the + locking state without having the server deny valid requests because + the server has granted conflicting access to another client. + Likewise, if there is a possibility that clients have not yet re- + established their locking state for a file and that such locking + state might make it invalid to perform READ or WRITE operations. For + example, if mandatory locks are a possibility, the server must + disallow READ and WRITE operations for that file. + + A client can determine that loss of locking state has occurred via + several methods. + + 1. When a SEQUENCE (most common) or other operation returns + NFS4ERR_BADSESSION, this may mean that the session has been + destroyed but the client ID is still valid. The client sends a + CREATE_SESSION request with the client ID to re-establish the + session. If CREATE_SESSION fails with NFS4ERR_STALE_CLIENTID, + the client must establish a new client ID (see Section 8.1) and + re-establish its lock state with the new client ID, after the + CREATE_SESSION operation succeeds (see Section 8.4.2.1). + + 2. When a SEQUENCE (most common) or other operation on a persistent + session returns NFS4ERR_DEADSESSION, this indicates that a + session is no longer usable for new, i.e., not satisfied from the + reply cache, operations. Once all pending operations are + determined to be either performed before the retry or not + performed, the client sends a CREATE_SESSION request with the + client ID to re-establish the session. If CREATE_SESSION fails + with NFS4ERR_STALE_CLIENTID, the client must establish a new + client ID (see Section 8.1) and re-establish its lock state after + the CREATE_SESSION, with the new client ID, succeeds + (Section 8.4.2.1). + + 3. When an operation, neither SEQUENCE nor preceded by SEQUENCE (for + example, CREATE_SESSION, DESTROY_SESSION), returns + NFS4ERR_STALE_CLIENTID, the client MUST establish a new client ID + (Section 8.1) and re-establish its lock state (Section 8.4.2.1). + +8.4.2.1. State Reclaim + + When state information and the associated locks are lost as a result + of a server restart, the protocol must provide a way to cause that + state to be re-established. The approach used is to define, for most + types of locking state (layouts are an exception), a request whose + function is to allow the client to re-establish on the server a lock + first obtained from a previous instance. Generally, these requests + are variants of the requests normally used to create locks of that + type and are referred to as "reclaim-type" requests, and the process + of re-establishing such locks is referred to as "reclaiming" them. + + Because each client must have an opportunity to reclaim all of the + locks that it has without the possibility that some other client will + be granted a conflicting lock, a "grace period" is devoted to the + reclaim process. During this period, requests creating client IDs + and sessions are handled normally, but locking requests are subject + to special restrictions. Only reclaim-type locking requests are + allowed, unless the server can reliably determine (through state + persistently maintained across restart instances) that granting any + such lock cannot possibly conflict with a subsequent reclaim. When a + request is made to obtain a new lock (i.e., not a reclaim-type + request) during the grace period and such a determination cannot be + made, the server must return the error NFS4ERR_GRACE. + + Once a session is established using the new client ID, the client + will use reclaim-type locking requests (e.g., LOCK operations with + reclaim set to TRUE and OPEN operations with a claim type of + CLAIM_PREVIOUS; see Section 9.11) to re-establish its locking state. + Once this is done, or if there is no such locking state to reclaim, + the client sends a global RECLAIM_COMPLETE operation, i.e., one with + the rca_one_fs argument set to FALSE, to indicate that it has + reclaimed all of the locking state that it will reclaim. Once a + client sends such a RECLAIM_COMPLETE operation, it may attempt non- + reclaim locking operations, although it might get an NFS4ERR_GRACE + status result from each such operation until the period of special + handling is over. See Section 11.11.9 for a discussion of the + analogous handling lock reclamation in the case of file systems + transitioning from server to server. + + During the grace period, the server must reject READ and WRITE + operations and non-reclaim locking requests (i.e., other LOCK and + OPEN operations) with an error of NFS4ERR_GRACE, unless it can + guarantee that these may be done safely, as described below. + + The grace period may last until all clients that are known to + possibly have had locks have done a global RECLAIM_COMPLETE + operation, indicating that they have finished reclaiming the locks + they held before the server restart. This means that a client that + has done a RECLAIM_COMPLETE must be prepared to receive an + NFS4ERR_GRACE when attempting to acquire new locks. In order for the + server to know that all clients with possible prior lock state have + done a RECLAIM_COMPLETE, the server must maintain in stable storage a + list clients that may have such locks. The server may also terminate + the grace period before all clients have done a global + RECLAIM_COMPLETE. The server SHOULD NOT terminate the grace period + before a time equal to the lease period in order to give clients an + opportunity to find out about the server restart, as a result of + sending requests on associated sessions with a frequency governed by + the lease time. Note that when a client does not send such requests + (or they are sent by the client but not received by the server), it + is possible for the grace period to expire before the client finds + out that the server restart has occurred. + + Some additional time in order to allow a client to establish a new + client ID and session and to effect lock reclaims may be added to the + lease time. Note that analogous rules apply to file system-specific + grace periods discussed in Section 11.11.9. + + If the server can reliably determine that granting a non-reclaim + request will not conflict with reclamation of locks by other clients, + the NFS4ERR_GRACE error does not have to be returned even within the + grace period, although NFS4ERR_GRACE must always be returned to + clients attempting a non-reclaim lock request before doing their own + global RECLAIM_COMPLETE. For the server to be able to service READ + and WRITE operations during the grace period, it must again be able + to guarantee that no possible conflict could arise between a + potential reclaim locking request and the READ or WRITE operation. + If the server is unable to offer that guarantee, the NFS4ERR_GRACE + error must be returned to the client. + + For a server to provide simple, valid handling during the grace + period, the easiest method is to simply reject all non-reclaim + locking requests and READ and WRITE operations by returning the + NFS4ERR_GRACE error. However, a server may keep information about + granted locks in stable storage. With this information, the server + could determine if a locking, READ or WRITE operation can be safely + processed. + + For example, if the server maintained on stable storage summary + information on whether mandatory locks exist, either mandatory byte- + range locks, or share reservations specifying deny modes, many + requests could be allowed during the grace period. If it is known + that no such share reservations exist, OPEN request that do not + specify deny modes may be safely granted. If, in addition, it is + known that no mandatory byte-range locks exist, either through + information stored on stable storage or simply because the server + does not support such locks, READ and WRITE operations may be safely + processed during the grace period. Another important case is where + it is known that no mandatory byte-range locks exist, either because + the server does not provide support for them or because their absence + is known from persistently recorded data. In this case, READ and + WRITE operations specifying stateids derived from reclaim-type + operations may be validly processed during the grace period because + of the fact that the valid reclaim ensures that no lock subsequently + granted can prevent the I/O. + + To reiterate, for a server that allows non-reclaim lock and I/O + requests to be processed during the grace period, it MUST determine + that no lock subsequently reclaimed will be rejected and that no lock + subsequently reclaimed would have prevented any I/O operation + processed during the grace period. + + Clients should be prepared for the return of NFS4ERR_GRACE errors for + non-reclaim lock and I/O requests. In this case, the client should + employ a retry mechanism for the request. A delay (on the order of + several seconds) between retries should be used to avoid overwhelming + the server. Further discussion of the general issue is included in + [55]. The client must account for the server that can perform I/O + and non-reclaim locking requests within the grace period as well as + those that cannot do so. + + A reclaim-type locking request outside the server's grace period can + only succeed if the server can guarantee that no conflicting lock or + I/O request has been granted since restart. + + A server may, upon restart, establish a new value for the lease + period. Therefore, clients should, once a new client ID is + established, refetch the lease_time attribute and use it as the basis + for lease renewal for the lease associated with that server. + However, the server must establish, for this restart event, a grace + period at least as long as the lease period for the previous server + instantiation. This allows the client state obtained during the + previous server instance to be reliably re-established. + + The possibility exists that, because of server configuration events, + the client will be communicating with a server different than the one + on which the locks were obtained, as shown by the combination of + eir_server_scope and eir_server_owner. This leads to the issue of if + and when the client should attempt to reclaim locks previously + obtained on what is being reported as a different server. The rules + to resolve this question are as follows: + + * If the server scope is different, the client should not attempt to + reclaim locks. In this situation, no lock reclaim is possible. + Any attempt to re-obtain the locks with non-reclaim operations is + problematic since there is no guarantee that the existing + filehandles will be recognized by the new server, or that if + recognized, they denote the same objects. It is best to treat the + locks as having been revoked by the reconfiguration event. + + * If the server scope is the same, the client should attempt to + reclaim locks, even if the eir_server_owner value is different. + In this situation, it is the responsibility of the server to + return NFS4ERR_NO_GRACE if it cannot provide correct support for + lock reclaim operations, including the prevention of edge + conditions. + + The eir_server_owner field is not used in making this determination. + Its function is to specify trunking possibilities for the client (see + Section 2.10.5) and not to control lock reclaim. + +8.4.2.1.1. Security Considerations for State Reclaim + + During the grace period, a client can reclaim state that it believes + or asserts it had before the server restarted. Unless the server + maintained a complete record of all the state the client had, the + server has little choice but to trust the client. (Of course, if the + server maintained a complete record, then it would not have to force + the client to reclaim state after server restart.) While the server + has to trust the client to tell the truth, the negative consequences + for security are limited to enabling denial-of-service attacks in + situations in which AUTH_SYS is supported. The fundamental rule for + the server when processing reclaim requests is that it MUST NOT grant + the reclaim if an equivalent non-reclaim request would not be granted + during steady state due to access control or access conflict issues. + For example, an OPEN request during a reclaim will be refused with + NFS4ERR_ACCESS if the principal making the request does not have + access to open the file according to the discretionary ACL + (Section 6.2.2) on the file. + + Nonetheless, it is possible that a client operating in error or + maliciously could, during reclaim, prevent another client from + reclaiming access to state. For example, an attacker could send an + OPEN reclaim operation with a deny mode that prevents another client + from reclaiming the OPEN state it had before the server restarted. + The attacker could perform the same denial of service during steady + state prior to server restart, as long as the attacker had + permissions. Given that the attack vectors are equivalent, the grace + period does not offer any additional opportunity for denial of + service, and any concerns about this attack vector, whether during + grace or steady state, are addressed the same way: use RPCSEC_GSS for + authentication and limit access to the file only to principals that + the owner of the file trusts. + + Note that if prior to restart the server had client IDs with the + EXCHGID4_FLAG_BIND_PRINC_STATEID (Section 18.35) capability set, then + the server SHOULD record in stable storage the client owner and the + principal that established the client ID via EXCHANGE_ID. If the + server does not, then there is a risk a client will be unable to + reclaim state if it does not have a credential for a principal that + was originally authorized to establish the state. + +8.4.3. Network Partitions and Recovery + + If the duration of a network partition is greater than the lease + period provided by the server, the server will not have received a + lease renewal from the client. If this occurs, the server may free + all locks held for the client or it may allow the lock state to + remain for a considerable period, subject to the constraint that if a + request for a conflicting lock is made, locks associated with an + expired lease do not prevent such a conflicting lock from being + granted but MUST be revoked as necessary so as to avoid interfering + with such conflicting requests. + + If the server chooses to delay freeing of lock state until there is a + conflict, it may either free all of the client's locks once there is + a conflict or it may only revoke the minimum set of locks necessary + to allow conflicting requests. When it adopts the finer-grained + approach, it must revoke all locks associated with a given stateid, + even if the conflict is with only a subset of locks. + + When the server chooses to free all of a client's lock state, either + immediately upon lease expiration or as a result of the first attempt + to obtain a conflicting a lock, the server may report the loss of + lock state in a number of ways. + + The server may choose to invalidate the session and the associated + client ID. In this case, once the client can communicate with the + server, it will receive an NFS4ERR_BADSESSION error. Upon attempting + to create a new session, it would get an NFS4ERR_STALE_CLIENTID. + Upon creating the new client ID and new session, the client will + attempt to reclaim locks. Normally, the server will not allow the + client to reclaim locks, because the server will not be in its + recovery grace period. + + Another possibility is for the server to maintain the session and + client ID but for all stateids held by the client to become invalid + or stale. Once the client can reach the server after such a network + partition, the status returned by the SEQUENCE operation will + indicate a loss of locking state; i.e., the flag + SEQ4_STATUS_EXPIRED_ALL_STATE_REVOKED will be set in sr_status_flags. + In addition, all I/O submitted by the client with the now invalid + stateids will fail with the server returning the error + NFS4ERR_EXPIRED. Once the client learns of the loss of locking + state, it will suitably notify the applications that held the + invalidated locks. The client should then take action to free + invalidated stateids, either by establishing a new client ID using a + new verifier or by doing a FREE_STATEID operation to release each of + the invalidated stateids. + + When the server adopts a finer-grained approach to revocation of + locks when a client's lease has expired, only a subset of stateids + will normally become invalid during a network partition. When the + client can communicate with the server after such a network partition + heals, the status returned by the SEQUENCE operation will indicate a + partial loss of locking state + (SEQ4_STATUS_EXPIRED_SOME_STATE_REVOKED). In addition, operations, + including I/O submitted by the client, with the now invalid stateids + will fail with the server returning the error NFS4ERR_EXPIRED. Once + the client learns of the loss of locking state, it will use the + TEST_STATEID operation on all of its stateids to determine which + locks have been lost and then suitably notify the applications that + held the invalidated locks. The client can then release the + invalidated locking state and acknowledge the revocation of the + associated locks by doing a FREE_STATEID operation on each of the + invalidated stateids. + + When a network partition is combined with a server restart, there are + edge conditions that place requirements on the server in order to + avoid silent data corruption following the server restart. Two of + these edge conditions are known, and are discussed below. + + The first edge condition arises as a result of the scenarios such as + the following: + + 1. Client A acquires a lock. + + 2. Client A and server experience mutual network partition, such + that client A is unable to renew its lease. + + 3. Client A's lease expires, and the server releases the lock. + + 4. Client B acquires a lock that would have conflicted with that of + client A. + + 5. Client B releases its lock. + + 6. Server restarts. + + 7. Network partition between client A and server heals. + + 8. Client A connects to a new server instance and finds out about + server restart. + + 9. Client A reclaims its lock within the server's grace period. + + Thus, at the final step, the server has erroneously granted client + A's lock reclaim. If client B modified the object the lock was + protecting, client A will experience object corruption. + + The second known edge condition arises in situations such as the + following: + + 1. Client A acquires one or more locks. + + 2. Server restarts. + + 3. Client A and server experience mutual network partition, such + that client A is unable to reclaim all of its locks within the + grace period. + + 4. Server's reclaim grace period ends. Client A has either no + locks or an incomplete set of locks known to the server. + + 5. Client B acquires a lock that would have conflicted with a lock + of client A that was not reclaimed. + + 6. Client B releases the lock. + + 7. Server restarts a second time. + + 8. Network partition between client A and server heals. + + 9. Client A connects to new server instance and finds out about + server restart. + + 10. Client A reclaims its lock within the server's grace period. + + As with the first edge condition, the final step of the scenario of + the second edge condition has the server erroneously granting client + A's lock reclaim. + + Solving the first and second edge conditions requires either that the + server always assumes after it restarts that some edge condition + occurs, and thus returns NFS4ERR_NO_GRACE for all reclaim attempts, + or that the server record some information in stable storage. The + amount of information the server records in stable storage is in + inverse proportion to how harsh the server intends to be whenever + edge conditions arise. The server that is completely tolerant of all + edge conditions will record in stable storage every lock that is + acquired, removing the lock record from stable storage only when the + lock is released. For the two edge conditions discussed above, the + harshest a server can be, and still support a grace period for + reclaims, requires that the server record in stable storage some + minimal information. For example, a server implementation could, for + each client, save in stable storage a record containing: + + * the co_ownerid field from the client_owner4 presented in the + EXCHANGE_ID operation. + + * a boolean that indicates if the client's lease expired or if there + was administrative intervention (see Section 8.5) to revoke a + byte-range lock, share reservation, or delegation and there has + been no acknowledgment, via FREE_STATEID, of such revocation. + + * a boolean that indicates whether the client may have locks that it + believes to be reclaimable in situations in which the grace period + was terminated, making the server's view of lock reclaimability + suspect. The server will set this for any client record in stable + storage where the client has not done a suitable RECLAIM_COMPLETE + (global or file system-specific depending on the target of the + lock request) before it grants any new (i.e., not reclaimed) lock + to any client. + + Assuming the above record keeping, for the first edge condition, + after the server restarts, the record that client A's lease expired + means that another client could have acquired a conflicting byte- + range lock, share reservation, or delegation. Hence, the server must + reject a reclaim from client A with the error NFS4ERR_NO_GRACE. + + For the second edge condition, after the server restarts for a second + time, the indication that the client had not completed its reclaims + at the time at which the grace period ended means that the server + must reject a reclaim from client A with the error NFS4ERR_NO_GRACE. + + When either edge condition occurs, the client's attempt to reclaim + locks will result in the error NFS4ERR_NO_GRACE. When this is + received, or after the client restarts with no lock state, the client + will send a global RECLAIM_COMPLETE. When the RECLAIM_COMPLETE is + received, the server and client are again in agreement regarding + reclaimable locks and both booleans in persistent storage can be + reset, to be set again only when there is a subsequent event that + causes lock reclaim operations to be questionable. + + Regardless of the level and approach to record keeping, the server + MUST implement one of the following strategies (which apply to + reclaims of share reservations, byte-range locks, and delegations): + + 1. Reject all reclaims with NFS4ERR_NO_GRACE. This is extremely + unforgiving, but necessary if the server does not record lock + state in stable storage. + + 2. Record sufficient state in stable storage such that all known + edge conditions involving server restart, including the two noted + in this section, are detected. It is acceptable to erroneously + recognize an edge condition and not allow a reclaim, when, with + sufficient knowledge, it would be allowed. The error the server + would return in this case is NFS4ERR_NO_GRACE. Note that it is + not known if there are other edge conditions. + + In the event that, after a server restart, the server determines + there is unrecoverable damage or corruption to the information in + stable storage, then for all clients and/or locks that may be + affected, the server MUST return NFS4ERR_NO_GRACE. + + A mandate for the client's handling of the NFS4ERR_NO_GRACE error is + outside the scope of this specification, since the strategies for + such handling are very dependent on the client's operating + environment. However, one potential approach is described below. + + When the client receives NFS4ERR_NO_GRACE, it could examine the + change attribute of the objects for which the client is trying to + reclaim state, and use that to determine whether to re-establish the + state via normal OPEN or LOCK operations. This is acceptable + provided that the client's operating environment allows it. In other + words, the client implementor is advised to document for his users + the behavior. The client could also inform the application that its + byte-range lock or share reservations (whether or not they were + delegated) have been lost, such as via a UNIX signal, a Graphical + User Interface (GUI) pop-up window, etc. See Section 10.5 for a + discussion of what the client should do for dealing with unreclaimed + delegations on client state. + + For further discussion of revocation of locks, see Section 8.5. + +8.5. Server Revocation of Locks + + At any point, the server can revoke locks held by a client, and the + client must be prepared for this event. When the client detects that + its locks have been or may have been revoked, the client is + responsible for validating the state information between itself and + the server. Validating locking state for the client means that it + must verify or reclaim state for each lock currently held. + + The first occasion of lock revocation is upon server restart. Note + that this includes situations in which sessions are persistent and + locking state is lost. In this class of instances, the client will + receive an error (NFS4ERR_STALE_CLIENTID) on an operation that takes + client ID, usually as part of recovery in response to a problem with + the current session), and the client will proceed with normal crash + recovery as described in the Section 8.4.2.1. + + The second occasion of lock revocation is the inability to renew the + lease before expiration, as discussed in Section 8.4.3. While this + is considered a rare or unusual event, the client must be prepared to + recover. The server is responsible for determining the precise + consequences of the lease expiration, informing the client of the + scope of the lock revocation decided upon. The client then uses the + status information provided by the server in the SEQUENCE results + (field sr_status_flags, see Section 18.46.3) to synchronize its + locking state with that of the server, in order to recover. + + The third occasion of lock revocation can occur as a result of + revocation of locks within the lease period, either because of + administrative intervention or because a recallable lock (a + delegation or layout) was not returned within the lease period after + having been recalled. While these are considered rare events, they + are possible, and the client must be prepared to deal with them. + When either of these events occurs, the client finds out about the + situation through the status returned by the SEQUENCE operation. Any + use of stateids associated with locks revoked during the lease period + will receive the error NFS4ERR_ADMIN_REVOKED or + NFS4ERR_DELEG_REVOKED, as appropriate. + + In all situations in which a subset of locking state may have been + revoked, which include all cases in which locking state is revoked + within the lease period, it is up to the client to determine which + locks have been revoked and which have not. It does this by using + the TEST_STATEID operation on the appropriate set of stateids. Once + the set of revoked locks has been determined, the applications can be + notified, and the invalidated stateids can be freed and lock + revocation acknowledged by using FREE_STATEID. + +8.6. Short and Long Leases + + When determining the time period for the server lease, the usual + lease trade-offs apply. A short lease is good for fast server + recovery at a cost of increased operations to effect lease renewal + (when there are no other operations during the period to effect lease + renewal as a side effect). A long lease is certainly kinder and + gentler to servers trying to handle very large numbers of clients. + The number of extra requests to effect lock renewal drops in inverse + proportion to the lease time. The disadvantages of a long lease + include the possibility of slower recovery after certain failures. + After server failure, a longer grace period may be required when some + clients do not promptly reclaim their locks and do a global + RECLAIM_COMPLETE. In the event of client failure, the longer period + for a lease to expire will force conflicting requests to wait longer. + + A long lease is practical if the server can store lease state in + stable storage. Upon recovery, the server can reconstruct the lease + state from its stable storage and continue operation with its + clients. + +8.7. Clocks, Propagation Delay, and Calculating Lease Expiration + + To avoid the need for synchronized clocks, lease times are granted by + the server as a time delta. However, there is a requirement that the + client and server clocks do not drift excessively over the duration + of the lease. There is also the issue of propagation delay across + the network, which could easily be several hundred milliseconds, as + well as the possibility that requests will be lost and need to be + retransmitted. + + To take propagation delay into account, the client should subtract it + from lease times (e.g., if the client estimates the one-way + propagation delay as 200 milliseconds, then it can assume that the + lease is already 200 milliseconds old when it gets it). In addition, + it will take another 200 milliseconds to get a response back to the + server. So the client must send a lease renewal or write data back + to the server at least 400 milliseconds before the lease would + expire. If the propagation delay varies over the life of the lease + (e.g., the client is on a mobile host), the client will need to + continuously subtract the increase in propagation delay from the + lease times. + + The server's lease period configuration should take into account the + network distance of the clients that will be accessing the server's + resources. It is expected that the lease period will take into + account the network propagation delays and other network delay + factors for the client population. Since the protocol does not allow + for an automatic method to determine an appropriate lease period, the + server's administrator may have to tune the lease period. + +8.8. Obsolete Locking Infrastructure from NFSv4.0 + + There are a number of operations and fields within existing + operations that no longer have a function in NFSv4.1. In one way or + another, these changes are all due to the implementation of sessions + that provide client context and exactly once semantics as a base + feature of the protocol, separate from locking itself. + + The following NFSv4.0 operations MUST NOT be implemented in NFSv4.1. + The server MUST return NFS4ERR_NOTSUPP if these operations are found + in an NFSv4.1 COMPOUND. + + * SETCLIENTID since its function has been replaced by EXCHANGE_ID. + + * SETCLIENTID_CONFIRM since client ID confirmation now happens by + means of CREATE_SESSION. + + * OPEN_CONFIRM because state-owner-based seqids have been replaced + by the sequence ID in the SEQUENCE operation. + + * RELEASE_LOCKOWNER because lock-owners with no associated locks do + not have any sequence-related state and so can be deleted by the + server at will. + + * RENEW because every SEQUENCE operation for a session causes lease + renewal, making a separate operation superfluous. + + Also, there are a number of fields, present in existing operations, + related to locking that have no use in minor version 1. They were + used in minor version 0 to perform functions now provided in a + different fashion. + + * Sequence ids used to sequence requests for a given state-owner and + to provide retry protection, now provided via sessions. + + * Client IDs used to identify the client associated with a given + request. Client identification is now available using the client + ID associated with the current session, without needing an + explicit client ID field. + + Such vestigial fields in existing operations have no function in + NFSv4.1 and are ignored by the server. Note that client IDs in + operations new to NFSv4.1 (such as CREATE_SESSION and + DESTROY_CLIENTID) are not ignored. + +9. File Locking and Share Reservations + + To support Win32 share reservations, it is necessary to provide + operations that atomically open or create files. Having a separate + share/unshare operation would not allow correct implementation of the + Win32 OpenFile API. In order to correctly implement share semantics, + the previous NFS protocol mechanisms used when a file is opened or + created (LOOKUP, CREATE, ACCESS) need to be replaced. The NFSv4.1 + protocol defines an OPEN operation that is capable of atomically + looking up, creating, and locking a file on the server. + +9.1. Opens and Byte-Range Locks + + It is assumed that manipulating a byte-range lock is rare when + compared to READ and WRITE operations. It is also assumed that + server restarts and network partitions are relatively rare. + Therefore, it is important that the READ and WRITE operations have a + lightweight mechanism to indicate if they possess a held lock. A + LOCK operation contains the heavyweight information required to + establish a byte-range lock and uniquely define the owner of the + lock. + +9.1.1. State-Owner Definition + + When opening a file or requesting a byte-range lock, the client must + specify an identifier that represents the owner of the requested + lock. This identifier is in the form of a state-owner, represented + in the protocol by a state_owner4, a variable-length opaque array + that, when concatenated with the current client ID, uniquely defines + the owner of a lock managed by the client. This may be a thread ID, + process ID, or other unique value. + + Owners of opens and owners of byte-range locks are separate entities + and remain separate even if the same opaque arrays are used to + designate owners of each. The protocol distinguishes between open- + owners (represented by open_owner4 structures) and lock-owners + (represented by lock_owner4 structures). + + Each open is associated with a specific open-owner while each byte- + range lock is associated with a lock-owner and an open-owner, the + latter being the open-owner associated with the open file under which + the LOCK operation was done. Delegations and layouts, on the other + hand, are not associated with a specific owner but are associated + with the client as a whole (identified by a client ID). + +9.1.2. Use of the Stateid and Locking + + All READ, WRITE, and SETATTR operations contain a stateid. For the + purposes of this section, SETATTR operations that change the size + attribute of a file are treated as if they are writing the area + between the old and new sizes (i.e., the byte-range truncated or + added to the file by means of the SETATTR), even where SETATTR is not + explicitly mentioned in the text. The stateid passed to one of these + operations must be one that represents an open, a set of byte-range + locks, or a delegation, or it may be a special stateid representing + anonymous access or the special bypass stateid. + + If the state-owner performs a READ or WRITE operation in a situation + in which it has established a byte-range lock or share reservation on + the server (any OPEN constitutes a share reservation), the stateid + (previously returned by the server) must be used to indicate what + locks, including both byte-range locks and share reservations, are + held by the state-owner. If no state is established by the client, + either a byte-range lock or a share reservation, a special stateid + for anonymous state (zero as the value for "other" and "seqid") is + used. (See Section 8.2.3 for a description of 'special' stateids in + general.) Regardless of whether a stateid for anonymous state or a + stateid returned by the server is used, if there is a conflicting + share reservation or mandatory byte-range lock held on the file, the + server MUST refuse to service the READ or WRITE operation. + + Share reservations are established by OPEN operations and by their + nature are mandatory in that when the OPEN denies READ or WRITE + operations, that denial results in such operations being rejected + with error NFS4ERR_LOCKED. Byte-range locks may be implemented by + the server as either mandatory or advisory, or the choice of + mandatory or advisory behavior may be determined by the server on the + basis of the file being accessed (for example, some UNIX-based + servers support a "mandatory lock bit" on the mode attribute such + that if set, byte-range locks are required on the file before I/O is + possible). When byte-range locks are advisory, they only prevent the + granting of conflicting lock requests and have no effect on READs or + WRITEs. Mandatory byte-range locks, however, prevent conflicting I/O + operations. When they are attempted, they are rejected with + NFS4ERR_LOCKED. When the client gets NFS4ERR_LOCKED on a file for + which it knows it has the proper share reservation, it will need to + send a LOCK operation on the byte-range of the file that includes the + byte-range the I/O was to be performed on, with an appropriate + locktype field of the LOCK operation's arguments (i.e., READ*_LT for + a READ operation, WRITE*_LT for a WRITE operation). + + Note that for UNIX environments that support mandatory byte-range + locking, the distinction between advisory and mandatory locking is + subtle. In fact, advisory and mandatory byte-range locks are exactly + the same as far as the APIs and requirements on implementation. If + the mandatory lock attribute is set on the file, the server checks to + see if the lock-owner has an appropriate shared (READ_LT) or + exclusive (WRITE_LT) byte-range lock on the byte-range it wishes to + READ from or WRITE to. If there is no appropriate lock, the server + checks if there is a conflicting lock (which can be done by + attempting to acquire the conflicting lock on behalf of the lock- + owner, and if successful, release the lock after the READ or WRITE + operation is done), and if there is, the server returns + NFS4ERR_LOCKED. + + For Windows environments, byte-range locks are always mandatory, so + the server always checks for byte-range locks during I/O requests. + + Thus, the LOCK operation does not need to distinguish between + advisory and mandatory byte-range locks. It is the server's + processing of the READ and WRITE operations that introduces the + distinction. + + Every stateid that is validly passed to READ, WRITE, or SETATTR, with + the exception of special stateid values, defines an access mode for + the file (i.e., OPEN4_SHARE_ACCESS_READ, OPEN4_SHARE_ACCESS_WRITE, or + OPEN4_SHARE_ACCESS_BOTH). + + * For stateids associated with opens, this is the mode defined by + the original OPEN that caused the allocation of the OPEN stateid + and as modified by subsequent OPENs and OPEN_DOWNGRADEs for the + same open-owner/file pair. + + * For stateids returned by byte-range LOCK operations, the + appropriate mode is the access mode for the OPEN stateid + associated with the lock set represented by the stateid. + + * For delegation stateids, the access mode is based on the type of + delegation. + + When a READ, WRITE, or SETATTR (that specifies the size attribute) + operation is done, the operation is subject to checking against the + access mode to verify that the operation is appropriate given the + stateid with which the operation is associated. + + In the case of WRITE-type operations (i.e., WRITEs and SETATTRs that + set size), the server MUST verify that the access mode allows writing + and MUST return an NFS4ERR_OPENMODE error if it does not. In the + case of READ, the server may perform the corresponding check on the + access mode, or it may choose to allow READ on OPENs for + OPEN4_SHARE_ACCESS_WRITE, to accommodate clients whose WRITE + implementation may unavoidably do reads (e.g., due to buffer cache + constraints). However, even if READs are allowed in these + circumstances, the server MUST still check for locks that conflict + with the READ (e.g., another OPEN specified OPEN4_SHARE_DENY_READ or + OPEN4_SHARE_DENY_BOTH). Note that a server that does enforce the + access mode check on READs need not explicitly check for conflicting + share reservations since the existence of OPEN for + OPEN4_SHARE_ACCESS_READ guarantees that no conflicting share + reservation can exist. + + The READ bypass special stateid (all bits of "other" and "seqid" set + to one) indicates a desire to bypass locking checks. The server MAY + allow READ operations to bypass locking checks at the server, when + this special stateid is used. However, WRITE operations with this + special stateid value MUST NOT bypass locking checks and are treated + exactly the same as if a special stateid for anonymous state were + used. + + A lock may not be granted while a READ or WRITE operation using one + of the special stateids is being performed and the scope of the lock + to be granted would conflict with the READ or WRITE operation. This + can occur when: + + * A mandatory byte-range lock is requested with a byte-range that + conflicts with the byte-range of the READ or WRITE operation. For + the purposes of this paragraph, a conflict occurs when a shared + lock is requested and a WRITE operation is being performed, or an + exclusive lock is requested and either a READ or a WRITE operation + is being performed. + + * A share reservation is requested that denies reading and/or + writing and the corresponding operation is being performed. + + * A delegation is to be granted and the delegation type would + prevent the I/O operation, i.e., READ and WRITE conflict with an + OPEN_DELEGATE_WRITE delegation and WRITE conflicts with an + OPEN_DELEGATE_READ delegation. + + When a client holds a delegation, it needs to ensure that the stateid + sent conveys the association of operation with the delegation, to + avoid the delegation from being avoidably recalled. When the + delegation stateid, a stateid open associated with that delegation, + or a stateid representing byte-range locks derived from such an open + is used, the server knows that the READ, WRITE, or SETATTR does not + conflict with the delegation but is sent under the aegis of the + delegation. Even though it is possible for the server to determine + from the client ID (via the session ID) that the client does in fact + have a delegation, the server is not obliged to check this, so using + a special stateid can result in avoidable recall of the delegation. + +9.2. Lock Ranges + + The protocol allows a lock-owner to request a lock with a byte-range + and then either upgrade, downgrade, or unlock a sub-range of the + initial lock, or a byte-range that overlaps -- fully or partially -- + either with that initial lock or a combination of a set of existing + locks for the same lock-owner. It is expected that this will be an + uncommon type of request. In any case, servers or server file + systems may not be able to support sub-range lock semantics. In the + event that a server receives a locking request that represents a sub- + range of current locking state for the lock-owner, the server is + allowed to return the error NFS4ERR_LOCK_RANGE to signify that it + does not support sub-range lock operations. Therefore, the client + should be prepared to receive this error and, if appropriate, report + the error to the requesting application. + + The client is discouraged from combining multiple independent locking + ranges that happen to be adjacent into a single request since the + server may not support sub-range requests for reasons related to the + recovery of byte-range locking state in the event of server failure. + As discussed in Section 8.4.2, the server may employ certain + optimizations during recovery that work effectively only when the + client's behavior during lock recovery is similar to the client's + locking behavior prior to server failure. + +9.3. Upgrading and Downgrading Locks + + If a client has a WRITE_LT lock on a byte-range, it can request an + atomic downgrade of the lock to a READ_LT lock via the LOCK + operation, by setting the type to READ_LT. If the server supports + atomic downgrade, the request will succeed. If not, it will return + NFS4ERR_LOCK_NOTSUPP. The client should be prepared to receive this + error and, if appropriate, report the error to the requesting + application. + + If a client has a READ_LT lock on a byte-range, it can request an + atomic upgrade of the lock to a WRITE_LT lock via the LOCK operation + by setting the type to WRITE_LT or WRITEW_LT. If the server does not + support atomic upgrade, it will return NFS4ERR_LOCK_NOTSUPP. If the + upgrade can be achieved without an existing conflict, the request + will succeed. Otherwise, the server will return either + NFS4ERR_DENIED or NFS4ERR_DEADLOCK. The error NFS4ERR_DEADLOCK is + returned if the client sent the LOCK operation with the type set to + WRITEW_LT and the server has detected a deadlock. The client should + be prepared to receive such errors and, if appropriate, report the + error to the requesting application. + +9.4. Stateid Seqid Values and Byte-Range Locks + + When a LOCK or LOCKU operation is performed, the stateid returned has + the same "other" value as the argument's stateid, and a "seqid" value + that is incremented (relative to the argument's stateid) to reflect + the occurrence of the LOCK or LOCKU operation. The server MUST + increment the value of the "seqid" field whenever there is any change + to the locking status of any byte offset as described by any of the + locks covered by the stateid. A change in locking status includes a + change from locked to unlocked or the reverse or a change from being + locked for READ_LT to being locked for WRITE_LT or the reverse. + + When there is no such change, as, for example, when a range already + locked for WRITE_LT is locked again for WRITE_LT, the server MAY + increment the "seqid" value. + +9.5. Issues with Multiple Open-Owners + + When the same file is opened by multiple open-owners, a client will + have multiple OPEN stateids for that file, each associated with a + different open-owner. In that case, there can be multiple LOCK and + LOCKU requests for the same lock-owner sent using the different OPEN + stateids, and so a situation may arise in which there are multiple + stateids, each representing byte-range locks on the same file and + held by the same lock-owner but each associated with a different + open-owner. + + In such a situation, the locking status of each byte (i.e., whether + it is locked, the READ_LT or WRITE_LT type of the lock, and the lock- + owner holding the lock) MUST reflect the last LOCK or LOCKU operation + done for the lock-owner in question, independent of the stateid + through which the request was sent. + + When a byte is locked by the lock-owner in question, the open-owner + to which that byte-range lock is assigned SHOULD be that of the open- + owner associated with the stateid through which the last LOCK of that + byte was done. When there is a change in the open-owner associated + with locks for the stateid through which a LOCK or LOCKU was done, + the "seqid" field of the stateid MUST be incremented, even if the + locking, in terms of lock-owners has not changed. When there is a + change to the set of locked bytes associated with a different stateid + for the same lock-owner, i.e., associated with a different open- + owner, the "seqid" value for that stateid MUST NOT be incremented. + +9.6. Blocking Locks + + Some clients require the support of blocking locks. While NFSv4.1 + provides a callback when a previously unavailable lock becomes + available, this is an OPTIONAL feature and clients cannot depend on + its presence. Clients need to be prepared to continually poll for + the lock. This presents a fairness problem. Two of the lock types, + READW_LT and WRITEW_LT, are used to indicate to the server that the + client is requesting a blocking lock. When the callback is not used, + the server should maintain an ordered list of pending blocking locks. + When the conflicting lock is released, the server may wait for the + period of time equal to lease_time for the first waiting client to + re-request the lock. After the lease period expires, the next + waiting client request is allowed the lock. Clients are required to + poll at an interval sufficiently small that it is likely to acquire + the lock in a timely manner. The server is not required to maintain + a list of pending blocked locks as it is used to increase fairness + and not correct operation. Because of the unordered nature of crash + recovery, storing of lock state to stable storage would be required + to guarantee ordered granting of blocking locks. + + Servers may also note the lock types and delay returning denial of + the request to allow extra time for a conflicting lock to be + released, allowing a successful return. In this way, clients can + avoid the burden of needless frequent polling for blocking locks. + The server should take care in the length of delay in the event the + client retransmits the request. + + If a server receives a blocking LOCK operation, denies it, and then + later receives a nonblocking request for the same lock, which is also + denied, then it should remove the lock in question from its list of + pending blocking locks. Clients should use such a nonblocking + request to indicate to the server that this is the last time they + intend to poll for the lock, as may happen when the process + requesting the lock is interrupted. This is a courtesy to the + server, to prevent it from unnecessarily waiting a lease period + before granting other LOCK operations. However, clients are not + required to perform this courtesy, and servers must not depend on + them doing so. Also, clients must be prepared for the possibility + that this final locking request will be accepted. + + When a server indicates, via the flag OPEN4_RESULT_MAY_NOTIFY_LOCK, + that CB_NOTIFY_LOCK callbacks might be done for the current open + file, the client should take notice of this, but, since this is a + hint, cannot rely on a CB_NOTIFY_LOCK always being done. A client + may reasonably reduce the frequency with which it polls for a denied + lock, since the greater latency that might occur is likely to be + eliminated given a prompt callback, but it still needs to poll. When + it receives a CB_NOTIFY_LOCK, it should promptly try to obtain the + lock, but it should be aware that other clients may be polling and + that the server is under no obligation to reserve the lock for that + particular client. + +9.7. Share Reservations + + A share reservation is a mechanism to control access to a file. It + is a separate and independent mechanism from byte-range locking. + When a client opens a file, it sends an OPEN operation to the server + specifying the type of access required (READ, WRITE, or BOTH) and the + type of access to deny others (OPEN4_SHARE_DENY_NONE, + OPEN4_SHARE_DENY_READ, OPEN4_SHARE_DENY_WRITE, or + OPEN4_SHARE_DENY_BOTH). If the OPEN fails, the client will fail the + application's open request. + + Pseudo-code definition of the semantics: + + if (request.access == 0) { + return (NFS4ERR_INVAL) + } else { + if ((request.access & file_state.deny)) || + (request.deny & file_state.access)) { + return (NFS4ERR_SHARE_DENIED) + } + return (NFS4ERR_OK); + + When doing this checking of share reservations on OPEN, the current + file_state used in the algorithm includes bits that reflect all + current opens, including those for the open-owner making the new OPEN + request. + + The constants used for the OPEN and OPEN_DOWNGRADE operations for the + access and deny fields are as follows: + + const OPEN4_SHARE_ACCESS_READ = 0x00000001; + const OPEN4_SHARE_ACCESS_WRITE = 0x00000002; + const OPEN4_SHARE_ACCESS_BOTH = 0x00000003; + + const OPEN4_SHARE_DENY_NONE = 0x00000000; + const OPEN4_SHARE_DENY_READ = 0x00000001; + const OPEN4_SHARE_DENY_WRITE = 0x00000002; + const OPEN4_SHARE_DENY_BOTH = 0x00000003; + +9.8. OPEN/CLOSE Operations + + To provide correct share semantics, a client MUST use the OPEN + operation to obtain the initial filehandle and indicate the desired + access and what access, if any, to deny. Even if the client intends + to use a special stateid for anonymous state or READ bypass, it must + still obtain the filehandle for the regular file with the OPEN + operation so the appropriate share semantics can be applied. Clients + that do not have a deny mode built into their programming interfaces + for opening a file should request a deny mode of + OPEN4_SHARE_DENY_NONE. + + The OPEN operation with the CREATE flag also subsumes the CREATE + operation for regular files as used in previous versions of the NFS + protocol. This allows a create with a share to be done atomically. + + The CLOSE operation removes all share reservations held by the open- + owner on that file. If byte-range locks are held, the client SHOULD + release all locks before sending a CLOSE operation. The server MAY + free all outstanding locks on CLOSE, but some servers may not support + the CLOSE of a file that still has byte-range locks held. The server + MUST return failure, NFS4ERR_LOCKS_HELD, if any locks would exist + after the CLOSE. + + The LOOKUP operation will return a filehandle without establishing + any lock state on the server. Without a valid stateid, the server + will assume that the client has the least access. For example, if + one client opened a file with OPEN4_SHARE_DENY_BOTH and another + client accesses the file via a filehandle obtained through LOOKUP, + the second client could only read the file using the special read + bypass stateid. The second client could not WRITE the file at all + because it would not have a valid stateid from OPEN and the special + anonymous stateid would not be allowed access. + +9.9. Open Upgrade and Downgrade + + When an OPEN is done for a file and the open-owner for which the OPEN + is being done already has the file open, the result is to upgrade the + open file status maintained on the server to include the access and + deny bits specified by the new OPEN as well as those for the existing + OPEN. The result is that there is one open file, as far as the + protocol is concerned, and it includes the union of the access and + deny bits for all of the OPEN requests completed. The OPEN is + represented by a single stateid whose "other" value matches that of + the original open, and whose "seqid" value is incremented to reflect + the occurrence of the upgrade. The increment is required in cases in + which the "upgrade" results in no change to the open mode (e.g., an + OPEN is done for read when the existing open file is opened for + OPEN4_SHARE_ACCESS_BOTH). Only a single CLOSE will be done to reset + the effects of both OPENs. The client may use the stateid returned + by the OPEN effecting the upgrade or with a stateid sharing the same + "other" field and a seqid of zero, although care needs to be taken as + far as upgrades that happen while the CLOSE is pending. Note that + the client, when sending the OPEN, may not know that the same file is + in fact being opened. The above only applies if both OPENs result in + the OPENed object being designated by the same filehandle. + + When the server chooses to export multiple filehandles corresponding + to the same file object and returns different filehandles on two + different OPENs of the same file object, the server MUST NOT "OR" + together the access and deny bits and coalesce the two open files. + Instead, the server must maintain separate OPENs with separate + stateids and will require separate CLOSEs to free them. + + When multiple open files on the client are merged into a single OPEN + file object on the server, the close of one of the open files (on the + client) may necessitate change of the access and deny status of the + open file on the server. This is because the union of the access and + deny bits for the remaining opens may be smaller (i.e., a proper + subset) than previously. The OPEN_DOWNGRADE operation is used to + make the necessary change and the client should use it to update the + server so that share reservation requests by other clients are + handled properly. The stateid returned has the same "other" field as + that passed to the server. The "seqid" value in the returned stateid + MUST be incremented, even in situations in which there is no change + to the access and deny bits for the file. + +9.10. Parallel OPENs + + Unlike the case of NFSv4.0, in which OPEN operations for the same + open-owner are inherently serialized because of the owner-based + seqid, multiple OPENs for the same open-owner may be done in + parallel. When clients do this, they may encounter situations in + which, because of the existence of hard links, two OPEN operations + may turn out to open the same file, with a later OPEN performed being + an upgrade of the first, with this fact only visible to the client + once the operations complete. + + In this situation, clients may determine the order in which the OPENs + were performed by examining the stateids returned by the OPENs. + Stateids that share a common value of the "other" field can be + recognized as having opened the same file, with the order of the + operations determinable from the order of the "seqid" fields, mod any + possible wraparound of the 32-bit field. + + When the possibility exists that the client will send multiple OPENs + for the same open-owner in parallel, it may be the case that an open + upgrade may happen without the client knowing beforehand that this + could happen. Because of this possibility, CLOSEs and + OPEN_DOWNGRADEs should generally be sent with a non-zero seqid in the + stateid, to avoid the possibility that the status change associated + with an open upgrade is not inadvertently lost. + +9.11. Reclaim of Open and Byte-Range Locks + + Special forms of the LOCK and OPEN operations are provided when it is + necessary to re-establish byte-range locks or opens after a server + failure. + + * To reclaim existing opens, an OPEN operation is performed using a + CLAIM_PREVIOUS. Because the client, in this type of situation, + will have already opened the file and have the filehandle of the + target file, this operation requires that the current filehandle + be the target file, rather than a directory, and no file name is + specified. + + * To reclaim byte-range locks, a LOCK operation with the reclaim + parameter set to true is used. + + Reclaims of opens associated with delegations are discussed in + Section 10.2.1. + +10. Client-Side Caching + + Client-side caching of data, of file attributes, and of file names is + essential to providing good performance with the NFS protocol. + Providing distributed cache coherence is a difficult problem, and + previous versions of the NFS protocol have not attempted it. + Instead, several NFS client implementation techniques have been used + to reduce the problems that a lack of coherence poses for users. + These techniques have not been clearly defined by earlier protocol + specifications, and it is often unclear what is valid or invalid + client behavior. + + The NFSv4.1 protocol uses many techniques similar to those that have + been used in previous protocol versions. The NFSv4.1 protocol does + not provide distributed cache coherence. However, it defines a more + limited set of caching guarantees to allow locks and share + reservations to be used without destructive interference from client- + side caching. + + In addition, the NFSv4.1 protocol introduces a delegation mechanism, + which allows many decisions normally made by the server to be made + locally by clients. This mechanism provides efficient support of the + common cases where sharing is infrequent or where sharing is read- + only. + +10.1. Performance Challenges for Client-Side Caching + + Caching techniques used in previous versions of the NFS protocol have + been successful in providing good performance. However, several + scalability challenges can arise when those techniques are used with + very large numbers of clients. This is particularly true when + clients are geographically distributed, which classically increases + the latency for cache revalidation requests. + + The previous versions of the NFS protocol repeat their file data + cache validation requests at the time the file is opened. This + behavior can have serious performance drawbacks. A common case is + one in which a file is only accessed by a single client. Therefore, + sharing is infrequent. + + In this case, repeated references to the server to find that no + conflicts exist are expensive. A better option with regards to + performance is to allow a client that repeatedly opens a file to do + so without reference to the server. This is done until potentially + conflicting operations from another client actually occur. + + A similar situation arises in connection with byte-range locking. + Sending LOCK and LOCKU operations as well as the READ and WRITE + operations necessary to make data caching consistent with the locking + semantics (see Section 10.3.2) can severely limit performance. When + locking is used to provide protection against infrequent conflicts, a + large penalty is incurred. This penalty may discourage the use of + byte-range locking by applications. + + The NFSv4.1 protocol provides more aggressive caching strategies with + the following design goals: + + * Compatibility with a large range of server semantics. + + * Providing the same caching benefits as previous versions of the + NFS protocol when unable to support the more aggressive model. + + * Requirements for aggressive caching are organized so that a large + portion of the benefit can be obtained even when not all of the + requirements can be met. + + The appropriate requirements for the server are discussed in later + sections in which specific forms of caching are covered (see + Section 10.4). + +10.2. Delegation and Callbacks + + Recallable delegation of server responsibilities for a file to a + client improves performance by avoiding repeated requests to the + server in the absence of inter-client conflict. With the use of a + "callback" RPC from server to client, a server recalls delegated + responsibilities when another client engages in sharing of a + delegated file. + + A delegation is passed from the server to the client, specifying the + object of the delegation and the type of delegation. There are + different types of delegations, but each type contains a stateid to + be used to represent the delegation when performing operations that + depend on the delegation. This stateid is similar to those + associated with locks and share reservations but differs in that the + stateid for a delegation is associated with a client ID and may be + used on behalf of all the open-owners for the given client. A + delegation is made to the client as a whole and not to any specific + process or thread of control within it. + + The backchannel is established by CREATE_SESSION and + BIND_CONN_TO_SESSION, and the client is required to maintain it. + Because the backchannel may be down, even temporarily, correct + protocol operation does not depend on them. Preliminary testing of + backchannel functionality by means of a CB_COMPOUND procedure with a + single operation, CB_SEQUENCE, can be used to check the continuity of + the backchannel. A server avoids delegating responsibilities until + it has determined that the backchannel exists. Because the granting + of a delegation is always conditional upon the absence of conflicting + access, clients MUST NOT assume that a delegation will be granted and + they MUST always be prepared for OPENs, WANT_DELEGATIONs, and + GET_DIR_DELEGATIONs to be processed without any delegations being + granted. + + Unlike locks, an operation by a second client to a delegated file + will cause the server to recall a delegation through a callback. For + individual operations, we will describe, under IMPLEMENTATION, when + such operations are required to effect a recall. A number of points + should be noted, however. + + * The server is free to recall a delegation whenever it feels it is + desirable and may do so even if no operations requiring recall are + being done. + + * Operations done outside the NFSv4.1 protocol, due to, for example, + access by other protocols, or by local access, also need to result + in delegation recall when they make analogous changes to file + system data. What is crucial is if the change would invalidate + the guarantees provided by the delegation. When this is possible, + the delegation needs to be recalled and MUST be returned or + revoked before allowing the operation to proceed. + + * The semantics of the file system are crucial in defining when + delegation recall is required. If a particular change within a + specific implementation causes change to a file attribute, then + delegation recall is required, whether that operation has been + specifically listed as requiring delegation recall. Again, what + is critical is whether the guarantees provided by the delegation + are being invalidated. + + Despite those caveats, the implementation sections for a number of + operations describe situations in which delegation recall would be + required under some common circumstances: + + * For GETATTR, see Section 18.7.4. + + * For OPEN, see Section 18.16.4. + + * For READ, see Section 18.22.4. + + * For REMOVE, see Section 18.25.4. + + * For RENAME, see Section 18.26.4. + + * For SETATTR, see Section 18.30.4. + + * For WRITE, see Section 18.32.4. + + On recall, the client holding the delegation needs to flush modified + state (such as modified data) to the server and return the + delegation. The conflicting request will not be acted on until the + recall is complete. The recall is considered complete when the + client returns the delegation or the server times its wait for the + delegation to be returned and revokes the delegation as a result of + the timeout. In the interim, the server will either delay responding + to conflicting requests or respond to them with NFS4ERR_DELAY. + Following the resolution of the recall, the server has the + information necessary to grant or deny the second client's request. + + At the time the client receives a delegation recall, it may have + substantial state that needs to be flushed to the server. Therefore, + the server should allow sufficient time for the delegation to be + returned since it may involve numerous RPCs to the server. If the + server is able to determine that the client is diligently flushing + state to the server as a result of the recall, the server may extend + the usual time allowed for a recall. However, the time allowed for + recall completion should not be unbounded. + + An example of this is when responsibility to mediate opens on a given + file is delegated to a client (see Section 10.4). The server will + not know what opens are in effect on the client. Without this + knowledge, the server will be unable to determine if the access and + deny states for the file allow any particular open until the + delegation for the file has been returned. + + A client failure or a network partition can result in failure to + respond to a recall callback. In this case, the server will revoke + the delegation, which in turn will render useless any modified state + still on the client. + +10.2.1. Delegation Recovery + + There are three situations that delegation recovery needs to deal + with: + + * client restart + + * server restart + + * network partition (full or backchannel-only) + + In the event the client restarts, the failure to renew the lease will + result in the revocation of byte-range locks and share reservations. + Delegations, however, may be treated a bit differently. + + There will be situations in which delegations will need to be re- + established after a client restarts. The reason for this is that the + client may have file data stored locally and this data was associated + with the previously held delegations. The client will need to re- + establish the appropriate file state on the server. + + To allow for this type of client recovery, the server MAY extend the + period for delegation recovery beyond the typical lease expiration + period. This implies that requests from other clients that conflict + with these delegations will need to wait. Because the normal recall + process may require significant time for the client to flush changed + state to the server, other clients need be prepared for delays that + occur because of a conflicting delegation. This longer interval + would increase the window for clients to restart and consult stable + storage so that the delegations can be reclaimed. For OPEN + delegations, such delegations are reclaimed using OPEN with a claim + type of CLAIM_DELEGATE_PREV or CLAIM_DELEG_PREV_FH (see Sections 10.5 + and 18.16 for discussion of OPEN delegation and the details of OPEN, + respectively). + + A server MAY support claim types of CLAIM_DELEGATE_PREV and + CLAIM_DELEG_PREV_FH, and if it does, it MUST NOT remove delegations + upon a CREATE_SESSION that confirm a client ID created by + EXCHANGE_ID. Instead, the server MUST, for a period of time no less + than that of the value of the lease_time attribute, maintain the + client's delegations to allow time for the client to send + CLAIM_DELEGATE_PREV and/or CLAIM_DELEG_PREV_FH requests. The server + that supports CLAIM_DELEGATE_PREV and/or CLAIM_DELEG_PREV_FH MUST + support the DELEGPURGE operation. + + When the server restarts, delegations are reclaimed (using the OPEN + operation with CLAIM_PREVIOUS) in a similar fashion to byte-range + locks and share reservations. However, there is a slight semantic + difference. In the normal case, if the server decides that a + delegation should not be granted, it performs the requested action + (e.g., OPEN) without granting any delegation. For reclaim, the + server grants the delegation but a special designation is applied so + that the client treats the delegation as having been granted but + recalled by the server. Because of this, the client has the duty to + write all modified state to the server and then return the + delegation. This process of handling delegation reclaim reconciles + three principles of the NFSv4.1 protocol: + + * Upon reclaim, a client reporting resources assigned to it by an + earlier server instance must be granted those resources. + + * The server has unquestionable authority to determine whether + delegations are to be granted and, once granted, whether they are + to be continued. + + * The use of callbacks should not be depended upon until the client + has proven its ability to receive them. + + When a client needs to reclaim a delegation and there is no + associated open, the client may use the CLAIM_PREVIOUS variant of the + WANT_DELEGATION operation. However, since the server is not required + to support this operation, an alternative is to reclaim via a dummy + OPEN together with the delegation using an OPEN of type + CLAIM_PREVIOUS. The dummy open file can be released using a CLOSE to + re-establish the original state to be reclaimed, a delegation without + an associated open. + + When a client has more than a single open associated with a + delegation, state for those additional opens can be established using + OPEN operations of type CLAIM_DELEGATE_CUR. When these are used to + establish opens associated with reclaimed delegations, the server + MUST allow them when made within the grace period. + + When a network partition occurs, delegations are subject to freeing + by the server when the lease renewal period expires. This is similar + to the behavior for locks and share reservations. For delegations, + however, the server may extend the period in which conflicting + requests are held off. Eventually, the occurrence of a conflicting + request from another client will cause revocation of the delegation. + A loss of the backchannel (e.g., by later network configuration + change) will have the same effect. A recall request will fail and + revocation of the delegation will result. + + A client normally finds out about revocation of a delegation when it + uses a stateid associated with a delegation and receives one of the + errors NFS4ERR_EXPIRED, NFS4ERR_ADMIN_REVOKED, or + NFS4ERR_DELEG_REVOKED. It also may find out about delegation + revocation after a client restart when it attempts to reclaim a + delegation and receives that same error. Note that in the case of a + revoked OPEN_DELEGATE_WRITE delegation, there are issues because data + may have been modified by the client whose delegation is revoked and + separately by other clients. See Section 10.5.1 for a discussion of + such issues. Note also that when delegations are revoked, + information about the revoked delegation will be written by the + server to stable storage (as described in Section 8.4.3). This is + done to deal with the case in which a server restarts after revoking + a delegation but before the client holding the revoked delegation is + notified about the revocation. + +10.3. Data Caching + + When applications share access to a set of files, they need to be + implemented so as to take account of the possibility of conflicting + access by another application. This is true whether the applications + in question execute on different clients or reside on the same + client. + + Share reservations and byte-range locks are the facilities the + NFSv4.1 protocol provides to allow applications to coordinate access + by using mutual exclusion facilities. The NFSv4.1 protocol's data + caching must be implemented such that it does not invalidate the + assumptions on which those using these facilities depend. + +10.3.1. Data Caching and OPENs + + In order to avoid invalidating the sharing assumptions on which + applications rely, NFSv4.1 clients should not provide cached data to + applications or modify it on behalf of an application when it would + not be valid to obtain or modify that same data via a READ or WRITE + operation. + + Furthermore, in the absence of an OPEN delegation (see Section 10.4), + two additional rules apply. Note that these rules are obeyed in + practice by many NFSv3 clients. + + * First, cached data present on a client must be revalidated after + doing an OPEN. Revalidating means that the client fetches the + change attribute from the server, compares it with the cached + change attribute, and if different, declares the cached data (as + well as the cached attributes) as invalid. This is to ensure that + the data for the OPENed file is still correctly reflected in the + client's cache. This validation must be done at least when the + client's OPEN operation includes a deny of OPEN4_SHARE_DENY_WRITE + or OPEN4_SHARE_DENY_BOTH, thus terminating a period in which other + clients may have had the opportunity to open the file with + OPEN4_SHARE_ACCESS_WRITE/OPEN4_SHARE_ACCESS_BOTH access. Clients + may choose to do the revalidation more often (i.e., at OPENs + specifying a deny mode of OPEN4_SHARE_DENY_NONE) to parallel the + NFSv3 protocol's practice for the benefit of users assuming this + degree of cache revalidation. + + Since the change attribute is updated for data and metadata + modifications, some client implementors may be tempted to use the + time_modify attribute and not the change attribute to validate + cached data, so that metadata changes do not spuriously invalidate + clean data. The implementor is cautioned in this approach. The + change attribute is guaranteed to change for each update to the + file, whereas time_modify is guaranteed to change only at the + granularity of the time_delta attribute. Use by the client's data + cache validation logic of time_modify and not change runs the risk + of the client incorrectly marking stale data as valid. Thus, any + cache validation approach by the client MUST include the use of + the change attribute. + + * Second, modified data must be flushed to the server before closing + a file OPENed for OPEN4_SHARE_ACCESS_WRITE. This is complementary + to the first rule. If the data is not flushed at CLOSE, the + revalidation done after the client OPENs a file is unable to + achieve its purpose. The other aspect to flushing the data before + close is that the data must be committed to stable storage, at the + server, before the CLOSE operation is requested by the client. In + the case of a server restart and a CLOSEd file, it may not be + possible to retransmit the data to be written to the file, hence, + this requirement. + +10.3.2. Data Caching and File Locking + + For those applications that choose to use byte-range locking instead + of share reservations to exclude inconsistent file access, there is + an analogous set of constraints that apply to client-side data + caching. These rules are effective only if the byte-range locking is + used in a way that matches in an equivalent way the actual READ and + WRITE operations executed. This is as opposed to byte-range locking + that is based on pure convention. For example, it is possible to + manipulate a two-megabyte file by dividing the file into two one- + megabyte ranges and protecting access to the two byte-ranges by byte- + range locks on bytes zero and one. A WRITE_LT lock on byte zero of + the file would represent the right to perform READ and WRITE + operations on the first byte-range. A WRITE_LT lock on byte one of + the file would represent the right to perform READ and WRITE + operations on the second byte-range. As long as all applications + manipulating the file obey this convention, they will work on a local + file system. However, they may not work with the NFSv4.1 protocol + unless clients refrain from data caching. + + The rules for data caching in the byte-range locking environment are: + + * First, when a client obtains a byte-range lock for a particular + byte-range, the data cache corresponding to that byte-range (if + any cache data exists) must be revalidated. If the change + attribute indicates that the file may have been updated since the + cached data was obtained, the client must flush or invalidate the + cached data for the newly locked byte-range. A client might + choose to invalidate all of the non-modified cached data that it + has for the file, but the only requirement for correct operation + is to invalidate all of the data in the newly locked byte-range. + + * Second, before releasing a WRITE_LT lock for a byte-range, all + modified data for that byte-range must be flushed to the server. + The modified data must also be written to stable storage. + + Note that flushing data to the server and the invalidation of cached + data must reflect the actual byte-ranges locked or unlocked. + Rounding these up or down to reflect client cache block boundaries + will cause problems if not carefully done. For example, writing a + modified block when only half of that block is within an area being + unlocked may cause invalid modification to the byte-range outside the + unlocked area. This, in turn, may be part of a byte-range locked by + another client. Clients can avoid this situation by synchronously + performing portions of WRITE operations that overlap that portion + (initial or final) that is not a full block. Similarly, invalidating + a locked area that is not an integral number of full buffer blocks + would require the client to read one or two partial blocks from the + server if the revalidation procedure shows that the data that the + client possesses may not be valid. + + The data that is written to the server as a prerequisite to the + unlocking of a byte-range must be written, at the server, to stable + storage. The client may accomplish this either with synchronous + writes or by following asynchronous writes with a COMMIT operation. + This is required because retransmission of the modified data after a + server restart might conflict with a lock held by another client. + + A client implementation may choose to accommodate applications that + use byte-range locking in non-standard ways (e.g., using a byte-range + lock as a global semaphore) by flushing to the server more data upon + a LOCKU than is covered by the locked range. This may include + modified data within files other than the one for which the unlocks + are being done. In such cases, the client must not interfere with + applications whose READs and WRITEs are being done only within the + bounds of byte-range locks that the application holds. For example, + an application locks a single byte of a file and proceeds to write + that single byte. A client that chose to handle a LOCKU by flushing + all modified data to the server could validly write that single byte + in response to an unrelated LOCKU operation. However, it would not + be valid to write the entire block in which that single written byte + was located since it includes an area that is not locked and might be + locked by another client. Client implementations can avoid this + problem by dividing files with modified data into those for which all + modifications are done to areas covered by an appropriate byte-range + lock and those for which there are modifications not covered by a + byte-range lock. Any writes done for the former class of files must + not include areas not locked and thus not modified on the client. + +10.3.3. Data Caching and Mandatory File Locking + + Client-side data caching needs to respect mandatory byte-range + locking when it is in effect. The presence of mandatory byte-range + locking for a given file is indicated when the client gets back + NFS4ERR_LOCKED from a READ or WRITE operation on a file for which it + has an appropriate share reservation. When mandatory locking is in + effect for a file, the client must check for an appropriate byte- + range lock for data being read or written. If a byte-range lock + exists for the range being read or written, the client may satisfy + the request using the client's validated cache. If an appropriate + byte-range lock is not held for the range of the read or write, the + read or write request must not be satisfied by the client's cache and + the request must be sent to the server for processing. When a read + or write request partially overlaps a locked byte-range, the request + should be subdivided into multiple pieces with each byte-range + (locked or not) treated appropriately. + +10.3.4. Data Caching and File Identity + + When clients cache data, the file data needs to be organized + according to the file system object to which the data belongs. For + NFSv3 clients, the typical practice has been to assume for the + purpose of caching that distinct filehandles represent distinct file + system objects. The client then has the choice to organize and + maintain the data cache on this basis. + + In the NFSv4.1 protocol, there is now the possibility to have + significant deviations from a "one filehandle per object" model + because a filehandle may be constructed on the basis of the object's + pathname. Therefore, clients need a reliable method to determine if + two filehandles designate the same file system object. If clients + were simply to assume that all distinct filehandles denote distinct + objects and proceed to do data caching on this basis, caching + inconsistencies would arise between the distinct client-side objects + that mapped to the same server-side object. + + By providing a method to differentiate filehandles, the NFSv4.1 + protocol alleviates a potential functional regression in comparison + with the NFSv3 protocol. Without this method, caching + inconsistencies within the same client could occur, and this has not + been present in previous versions of the NFS protocol. Note that it + is possible to have such inconsistencies with applications executing + on multiple clients, but that is not the issue being addressed here. + + For the purposes of data caching, the following steps allow an + NFSv4.1 client to determine whether two distinct filehandles denote + the same server-side object: + + * If GETATTR directed to two filehandles returns different values of + the fsid attribute, then the filehandles represent distinct + objects. + + * If GETATTR for any file with an fsid that matches the fsid of the + two filehandles in question returns a unique_handles attribute + with a value of TRUE, then the two objects are distinct. + + * If GETATTR directed to the two filehandles does not return the + fileid attribute for both of the handles, then it cannot be + determined whether the two objects are the same. Therefore, + operations that depend on that knowledge (e.g., client-side data + caching) cannot be done reliably. Note that if GETATTR does not + return the fileid attribute for both filehandles, it will return + it for neither of the filehandles, since the fsid for both + filehandles is the same. + + * If GETATTR directed to the two filehandles returns different + values for the fileid attribute, then they are distinct objects. + + * Otherwise, they are the same object. + +10.4. Open Delegation + + When a file is being OPENed, the server may delegate further handling + of opens and closes for that file to the opening client. Any such + delegation is recallable since the circumstances that allowed for the + delegation are subject to change. In particular, if the server + receives a conflicting OPEN from another client, the server must + recall the delegation before deciding whether the OPEN from the other + client may be granted. Making a delegation is up to the server, and + clients should not assume that any particular OPEN either will or + will not result in an OPEN delegation. The following is a typical + set of conditions that servers might use in deciding whether an OPEN + should be delegated: + + * The client must be able to respond to the server's callback + requests. If a backchannel has been established, the server will + send a CB_COMPOUND request, containing a single operation, + CB_SEQUENCE, for a test of backchannel availability. + + * The client must have responded properly to previous recalls. + + * There must be no current OPEN conflicting with the requested + delegation. + + * There should be no current delegation that conflicts with the + delegation being requested. + + * The probability of future conflicting open requests should be low + based on the recent history of the file. + + * The existence of any server-specific semantics of OPEN/CLOSE that + would make the required handling incompatible with the prescribed + handling that the delegated client would apply (see below). + + There are two types of OPEN delegations: OPEN_DELEGATE_READ and + OPEN_DELEGATE_WRITE. An OPEN_DELEGATE_READ delegation allows a + client to handle, on its own, requests to open a file for reading + that do not deny OPEN4_SHARE_ACCESS_READ access to others. Multiple + OPEN_DELEGATE_READ delegations may be outstanding simultaneously and + do not conflict. An OPEN_DELEGATE_WRITE delegation allows the client + to handle, on its own, all opens. Only one OPEN_DELEGATE_WRITE + delegation may exist for a given file at a given time, and it is + inconsistent with any OPEN_DELEGATE_READ delegations. + + When a client has an OPEN_DELEGATE_READ delegation, it is assured + that neither the contents, the attributes (with the exception of + time_access), nor the names of any links to the file will change + without its knowledge, so long as the delegation is held. When a + client has an OPEN_DELEGATE_WRITE delegation, it may modify the file + data locally since no other client will be accessing the file's data. + The client holding an OPEN_DELEGATE_WRITE delegation may only locally + affect file attributes that are intimately connected with the file + data: size, change, time_access, time_metadata, and time_modify. All + other attributes must be reflected on the server. + + When a client has an OPEN delegation, it does not need to send OPENs + or CLOSEs to the server. Instead, the client may update the + appropriate status internally. For an OPEN_DELEGATE_READ delegation, + opens that cannot be handled locally (opens that are for + OPEN4_SHARE_ACCESS_WRITE/OPEN4_SHARE_ACCESS_BOTH or that deny + OPEN4_SHARE_ACCESS_READ access) must be sent to the server. + + When an OPEN delegation is made, the reply to the OPEN contains an + OPEN delegation structure that specifies the following: + + * the type of delegation (OPEN_DELEGATE_READ or + OPEN_DELEGATE_WRITE). + + * space limitation information to control flushing of data on close + (OPEN_DELEGATE_WRITE delegation only; see Section 10.4.1) + + * an nfsace4 specifying read and write permissions + + * a stateid to represent the delegation + + The delegation stateid is separate and distinct from the stateid for + the OPEN proper. The standard stateid, unlike the delegation + stateid, is associated with a particular lock-owner and will continue + to be valid after the delegation is recalled and the file remains + open. + + When a request internal to the client is made to open a file and an + OPEN delegation is in effect, it will be accepted or rejected solely + on the basis of the following conditions. Any requirement for other + checks to be made by the delegate should result in the OPEN + delegation being denied so that the checks can be made by the server + itself. + + * The access and deny bits for the request and the file as described + in Section 9.7. + + * The read and write permissions as determined below. + + The nfsace4 passed with delegation can be used to avoid frequent + ACCESS calls. The permission check should be as follows: + + * If the nfsace4 indicates that the open may be done, then it should + be granted without reference to the server. + + * If the nfsace4 indicates that the open may not be done, then an + ACCESS request must be sent to the server to obtain the definitive + answer. + + The server may return an nfsace4 that is more restrictive than the + actual ACL of the file. This includes an nfsace4 that specifies + denial of all access. Note that some common practices such as + mapping the traditional user "root" to the user "nobody" (see + Section 5.9) may make it incorrect to return the actual ACL of the + file in the delegation response. + + The use of a delegation together with various other forms of caching + creates the possibility that no server authentication and + authorization will ever be performed for a given user since all of + the user's requests might be satisfied locally. Where the client is + depending on the server for authentication and authorization, the + client should be sure authentication and authorization occurs for + each user by use of the ACCESS operation. This should be the case + even if an ACCESS operation would not be required otherwise. As + mentioned before, the server may enforce frequent authentication by + returning an nfsace4 denying all access with every OPEN delegation. + +10.4.1. Open Delegation and Data Caching + + An OPEN delegation allows much of the message overhead associated + with the opening and closing files to be eliminated. An open when an + OPEN delegation is in effect does not require that a validation + message be sent to the server. The continued endurance of the + "OPEN_DELEGATE_READ delegation" provides a guarantee that no OPEN for + OPEN4_SHARE_ACCESS_WRITE/OPEN4_SHARE_ACCESS_BOTH, and thus no write, + has occurred. Similarly, when closing a file opened for + OPEN4_SHARE_ACCESS_WRITE/OPEN4_SHARE_ACCESS_BOTH and if an + OPEN_DELEGATE_WRITE delegation is in effect, the data written does + not have to be written to the server until the OPEN delegation is + recalled. The continued endurance of the OPEN delegation provides a + guarantee that no open, and thus no READ or WRITE, has been done by + another client. + + For the purposes of OPEN delegation, READs and WRITEs done without an + OPEN are treated as the functional equivalents of a corresponding + type of OPEN. Although a client SHOULD NOT use special stateids when + an open exists, delegation handling on the server can use the client + ID associated with the current session to determine if the operation + has been done by the holder of the delegation (in which case, no + recall is necessary) or by another client (in which case, the + delegation must be recalled and I/O not proceed until the delegation + is returned or revoked). + + With delegations, a client is able to avoid writing data to the + server when the CLOSE of a file is serviced. The file close system + call is the usual point at which the client is notified of a lack of + stable storage for the modified file data generated by the + application. At the close, file data is written to the server and, + through normal accounting, the server is able to determine if the + available file system space for the data has been exceeded (i.e., the + server returns NFS4ERR_NOSPC or NFS4ERR_DQUOT). This accounting + includes quotas. The introduction of delegations requires that an + alternative method be in place for the same type of communication to + occur between client and server. + + In the delegation response, the server provides either the limit of + the size of the file or the number of modified blocks and associated + block size. The server must ensure that the client will be able to + write modified data to the server of a size equal to that provided in + the original delegation. The server must make this assurance for all + outstanding delegations. Therefore, the server must be careful in + its management of available space for new or modified data, taking + into account available file system space and any applicable quotas. + The server can recall delegations as a result of managing the + available file system space. The client should abide by the server's + state space limits for delegations. If the client exceeds the stated + limits for the delegation, the server's behavior is undefined. + + Based on server conditions, quotas, or available file system space, + the server may grant OPEN_DELEGATE_WRITE delegations with very + restrictive space limitations. The limitations may be defined in a + way that will always force modified data to be flushed to the server + on close. + + With respect to authentication, flushing modified data to the server + after a CLOSE has occurred may be problematic. For example, the user + of the application may have logged off the client, and unexpired + authentication credentials may not be present. In this case, the + client may need to take special care to ensure that local unexpired + credentials will in fact be available. This may be accomplished by + tracking the expiration time of credentials and flushing data well in + advance of their expiration or by making private copies of + credentials to assure their availability when needed. + +10.4.2. Open Delegation and File Locks + + When a client holds an OPEN_DELEGATE_WRITE delegation, lock + operations are performed locally. This includes those required for + mandatory byte-range locking. This can be done since the delegation + implies that there can be no conflicting locks. Similarly, all of + the revalidations that would normally be associated with obtaining + locks and the flushing of data associated with the releasing of locks + need not be done. + + When a client holds an OPEN_DELEGATE_READ delegation, lock operations + are not performed locally. All lock operations, including those + requesting non-exclusive locks, are sent to the server for + resolution. + +10.4.3. Handling of CB_GETATTR + + The server needs to employ special handling for a GETATTR where the + target is a file that has an OPEN_DELEGATE_WRITE delegation in + effect. The reason for this is that the client holding the + OPEN_DELEGATE_WRITE delegation may have modified the data, and the + server needs to reflect this change to the second client that + submitted the GETATTR. Therefore, the client holding the + OPEN_DELEGATE_WRITE delegation needs to be interrogated. The server + will use the CB_GETATTR operation. The only attributes that the + server can reliably query via CB_GETATTR are size and change. + + Since CB_GETATTR is being used to satisfy another client's GETATTR + request, the server only needs to know if the client holding the + delegation has a modified version of the file. If the client's copy + of the delegated file is not modified (data or size), the server can + satisfy the second client's GETATTR request from the attributes + stored locally at the server. If the file is modified, the server + only needs to know about this modified state. If the server + determines that the file is currently modified, it will respond to + the second client's GETATTR as if the file had been modified locally + at the server. + + Since the form of the change attribute is determined by the server + and is opaque to the client, the client and server need to agree on a + method of communicating the modified state of the file. For the size + attribute, the client will report its current view of the file size. + For the change attribute, the handling is more involved. + + For the client, the following steps will be taken when receiving an + OPEN_DELEGATE_WRITE delegation: + + * The value of the change attribute will be obtained from the server + and cached. Let this value be represented by c. + + * The client will create a value greater than c that will be used + for communicating that modified data is held at the client. Let + this value be represented by d. + + * When the client is queried via CB_GETATTR for the change + attribute, it checks to see if it holds modified data. If the + file is modified, the value d is returned for the change attribute + value. If this file is not currently modified, the client returns + the value c for the change attribute. + + For simplicity of implementation, the client MAY for each CB_GETATTR + return the same value d. This is true even if, between successive + CB_GETATTR operations, the client again modifies the file's data or + metadata in its cache. The client can return the same value because + the only requirement is that the client be able to indicate to the + server that the client holds modified data. Therefore, the value of + d may always be c + 1. + + While the change attribute is opaque to the client in the sense that + it has no idea what units of time, if any, the server is counting + change with, it is not opaque in that the client has to treat it as + an unsigned integer, and the server has to be able to see the results + of the client's changes to that integer. Therefore, the server MUST + encode the change attribute in network order when sending it to the + client. The client MUST decode it from network order to its native + order when receiving it, and the client MUST encode it in network + order when sending it to the server. For this reason, change is + defined as an unsigned integer rather than an opaque array of bytes. + + For the server, the following steps will be taken when providing an + OPEN_DELEGATE_WRITE delegation: + + * Upon providing an OPEN_DELEGATE_WRITE delegation, the server will + cache a copy of the change attribute in the data structure it uses + to record the delegation. Let this value be represented by sc. + + * When a second client sends a GETATTR operation on the same file to + the server, the server obtains the change attribute from the first + client. Let this value be cc. + + * If the value cc is equal to sc, the file is not modified and the + server returns the current values for change, time_metadata, and + time_modify (for example) to the second client. + + * If the value cc is NOT equal to sc, the file is currently modified + at the first client and most likely will be modified at the server + at a future time. The server then uses its current time to + construct attribute values for time_metadata and time_modify. A + new value of sc, which we will call nsc, is computed by the + server, such that nsc >= sc + 1. The server then returns the + constructed time_metadata, time_modify, and nsc values to the + requester. The server replaces sc in the delegation record with + nsc. To prevent the possibility of time_modify, time_metadata, + and change from appearing to go backward (which would happen if + the client holding the delegation fails to write its modified data + to the server before the delegation is revoked or returned), the + server SHOULD update the file's metadata record with the + constructed attribute values. For reasons of reasonable + performance, committing the constructed attribute values to stable + storage is OPTIONAL. + + As discussed earlier in this section, the client MAY return the same + cc value on subsequent CB_GETATTR calls, even if the file was + modified in the client's cache yet again between successive + CB_GETATTR calls. Therefore, the server must assume that the file + has been modified yet again, and MUST take care to ensure that the + new nsc it constructs and returns is greater than the previous nsc it + returned. An example implementation's delegation record would + satisfy this mandate by including a boolean field (let us call it + "modified") that is set to FALSE when the delegation is granted, and + an sc value set at the time of grant to the change attribute value. + The modified field would be set to TRUE the first time cc != sc, and + would stay TRUE until the delegation is returned or revoked. The + processing for constructing nsc, time_modify, and time_metadata would + use this pseudo code: + + if (!modified) { + do CB_GETATTR for change and size; + + if (cc != sc) + modified = TRUE; + } else { + do CB_GETATTR for size; + } + + if (modified) { + sc = sc + 1; + time_modify = time_metadata = current_time; + update sc, time_modify, time_metadata into file's metadata; + } + + This would return to the client (that sent GETATTR) the attributes it + requested, but make sure size comes from what CB_GETATTR returned. + The server would not update the file's metadata with the client's + modified size. + + In the case that the file attribute size is different than the + server's current value, the server treats this as a modification + regardless of the value of the change attribute retrieved via + CB_GETATTR and responds to the second client as in the last step. + + This methodology resolves issues of clock differences between client + and server and other scenarios where the use of CB_GETATTR break + down. + + It should be noted that the server is under no obligation to use + CB_GETATTR, and therefore the server MAY simply recall the delegation + to avoid its use. + +10.4.4. Recall of Open Delegation + + The following events necessitate recall of an OPEN delegation: + + * potentially conflicting OPEN request (or a READ or WRITE operation + done with a special stateid) + + * SETATTR sent by another client + + * REMOVE request for the file + + * RENAME request for the file as either the source or target of the + RENAME + + Whether a RENAME of a directory in the path leading to the file + results in recall of an OPEN delegation depends on the semantics of + the server's file system. If that file system denies such RENAMEs + when a file is open, the recall must be performed to determine + whether the file in question is, in fact, open. + + In addition to the situations above, the server may choose to recall + OPEN delegations at any time if resource constraints make it + advisable to do so. Clients should always be prepared for the + possibility of recall. + + When a client receives a recall for an OPEN delegation, it needs to + update state on the server before returning the delegation. These + same updates must be done whenever a client chooses to return a + delegation voluntarily. The following items of state need to be + dealt with: + + * If the file associated with the delegation is no longer open and + no previous CLOSE operation has been sent to the server, a CLOSE + operation must be sent to the server. + + * If a file has other open references at the client, then OPEN + operations must be sent to the server. The appropriate stateids + will be provided by the server for subsequent use by the client + since the delegation stateid will no longer be valid. These OPEN + requests are done with the claim type of CLAIM_DELEGATE_CUR. This + will allow the presentation of the delegation stateid so that the + client can establish the appropriate rights to perform the OPEN. + (See Section 18.16, which describes the OPEN operation, for + details.) + + * If there are granted byte-range locks, the corresponding LOCK + operations need to be performed. This applies to the + OPEN_DELEGATE_WRITE delegation case only. + + * For an OPEN_DELEGATE_WRITE delegation, if at the time of recall + the file is not open for OPEN4_SHARE_ACCESS_WRITE/ + OPEN4_SHARE_ACCESS_BOTH, all modified data for the file must be + flushed to the server. If the delegation had not existed, the + client would have done this data flush before the CLOSE operation. + + * For an OPEN_DELEGATE_WRITE delegation when a file is still open at + the time of recall, any modified data for the file needs to be + flushed to the server. + + * With the OPEN_DELEGATE_WRITE delegation in place, it is possible + that the file was truncated during the duration of the delegation. + For example, the truncation could have occurred as a result of an + OPEN UNCHECKED with a size attribute value of zero. Therefore, if + a truncation of the file has occurred and this operation has not + been propagated to the server, the truncation must occur before + any modified data is written to the server. + + In the case of OPEN_DELEGATE_WRITE delegation, byte-range locking + imposes some additional requirements. To precisely maintain the + associated invariant, it is required to flush any modified data in + any byte-range for which a WRITE_LT lock was released while the + OPEN_DELEGATE_WRITE delegation was in effect. However, because the + OPEN_DELEGATE_WRITE delegation implies no other locking by other + clients, a simpler implementation is to flush all modified data for + the file (as described just above) if any WRITE_LT lock has been + released while the OPEN_DELEGATE_WRITE delegation was in effect. + + An implementation need not wait until delegation recall (or the + decision to voluntarily return a delegation) to perform any of the + above actions, if implementation considerations (e.g., resource + availability constraints) make that desirable. Generally, however, + the fact that the actual OPEN state of the file may continue to + change makes it not worthwhile to send information about opens and + closes to the server, except as part of delegation return. An + exception is when the client has no more internal opens of the file. + In this case, sending a CLOSE is useful because it reduces resource + utilization on the client and server. Regardless of the client's + choices on scheduling these actions, all must be performed before the + delegation is returned, including (when applicable) the close that + corresponds to the OPEN that resulted in the delegation. These + actions can be performed either in previous requests or in previous + operations in the same COMPOUND request. + +10.4.5. Clients That Fail to Honor Delegation Recalls + + A client may fail to respond to a recall for various reasons, such as + a failure of the backchannel from server to the client. The client + may be unaware of a failure in the backchannel. This lack of + awareness could result in the client finding out long after the + failure that its delegation has been revoked, and another client has + modified the data for which the client had a delegation. This is + especially a problem for the client that held an OPEN_DELEGATE_WRITE + delegation. + + Status bits returned by SEQUENCE operations help to provide an + alternate way of informing the client of issues regarding the status + of the backchannel and of recalled delegations. When the backchannel + is not available, the server returns the status bit + SEQ4_STATUS_CB_PATH_DOWN on SEQUENCE operations. The client can + react by attempting to re-establish the backchannel and by returning + recallable objects if a backchannel cannot be successfully re- + established. + + Whether the backchannel is functioning or not, it may be that the + recalled delegation is not returned. Note that the client's lease + might still be renewed, even though the recalled delegation is not + returned. In this situation, servers SHOULD revoke delegations that + are not returned in a period of time equal to the lease period. This + period of time should allow the client time to note the backchannel- + down status and re-establish the backchannel. + + When delegations are revoked, the server will return with the + SEQ4_STATUS_RECALLABLE_STATE_REVOKED status bit set on subsequent + SEQUENCE operations. The client should note this and then use + TEST_STATEID to find which delegations have been revoked. + +10.4.6. Delegation Revocation + + At the point a delegation is revoked, if there are associated opens + on the client, these opens may or may not be revoked. If no byte- + range lock or open is granted that is inconsistent with the existing + open, the stateid for the open may remain valid and be disconnected + from the revoked delegation, just as would be the case if the + delegation were returned. + + For example, if an OPEN for OPEN4_SHARE_ACCESS_BOTH with a deny of + OPEN4_SHARE_DENY_NONE is associated with the delegation, granting of + another such OPEN to a different client will revoke the delegation + but need not revoke the OPEN, since the two OPENs are consistent with + each other. On the other hand, if an OPEN denying write access is + granted, then the existing OPEN must be revoked. + + When opens and/or locks are revoked, the applications holding these + opens or locks need to be notified. This notification usually occurs + by returning errors for READ/WRITE operations or when a close is + attempted for the open file. + + If no opens exist for the file at the point the delegation is + revoked, then notification of the revocation is unnecessary. + However, if there is modified data present at the client for the + file, the user of the application should be notified. Unfortunately, + it may not be possible to notify the user since active applications + may not be present at the client. See Section 10.5.1 for additional + details. + +10.4.7. Delegations via WANT_DELEGATION + + In addition to providing delegations as part of the reply to OPEN + operations, servers MAY provide delegations separate from open, via + the OPTIONAL WANT_DELEGATION operation. This allows delegations to + be obtained in advance of an OPEN that might benefit from them, for + objects that are not a valid target of OPEN, or to deal with cases in + which a delegation has been recalled and the client wants to make an + attempt to re-establish it if the absence of use by other clients + allows that. + + The WANT_DELEGATION operation may be performed on any type of file + object other than a directory. + + When a delegation is obtained using WANT_DELEGATION, any open files + for the same filehandle held by that client are to be treated as + subordinate to the delegation, just as if they had been created using + an OPEN of type CLAIM_DELEGATE_CUR. They are otherwise unchanged as + to seqid, access and deny modes, and the relationship with byte-range + locks. Similarly, because existing byte-range locks are subordinate + to an open, those byte-range locks also become indirectly subordinate + to that new delegation. + + The WANT_DELEGATION operation provides for delivery of delegations + via callbacks, when the delegations are not immediately available. + When a requested delegation is available, it is delivered to the + client via a CB_PUSH_DELEG operation. When this happens, open files + for the same filehandle become subordinate to the new delegation at + the point at which the delegation is delivered, just as if they had + been created using an OPEN of type CLAIM_DELEGATE_CUR. Similarly, + this occurs for existing byte-range locks subordinate to an open. + +10.5. Data Caching and Revocation + + When locks and delegations are revoked, the assumptions upon which + successful caching depends are no longer guaranteed. For any locks + or share reservations that have been revoked, the corresponding + state-owner needs to be notified. This notification includes + applications with a file open that has a corresponding delegation + that has been revoked. Cached data associated with the revocation + must be removed from the client. In the case of modified data + existing in the client's cache, that data must be removed from the + client without being written to the server. As mentioned, the + assumptions made by the client are no longer valid at the point when + a lock or delegation has been revoked. For example, another client + may have been granted a conflicting byte-range lock after the + revocation of the byte-range lock at the first client. Therefore, + the data within the lock range may have been modified by the other + client. Obviously, the first client is unable to guarantee to the + application what has occurred to the file in the case of revocation. + + Notification to a state-owner will in many cases consist of simply + returning an error on the next and all subsequent READs/WRITEs to the + open file or on the close. Where the methods available to a client + make such notification impossible because errors for certain + operations may not be returned, more drastic action such as signals + or process termination may be appropriate. The justification here is + that an invariant on which an application depends may be violated. + Depending on how errors are typically treated for the client- + operating environment, further levels of notification including + logging, console messages, and GUI pop-ups may be appropriate. + +10.5.1. Revocation Recovery for Write Open Delegation + + Revocation recovery for an OPEN_DELEGATE_WRITE delegation poses the + special issue of modified data in the client cache while the file is + not open. In this situation, any client that does not flush modified + data to the server on each close must ensure that the user receives + appropriate notification of the failure as a result of the + revocation. Since such situations may require human action to + correct problems, notification schemes in which the appropriate user + or administrator is notified may be necessary. Logging and console + messages are typical examples. + + If there is modified data on the client, it must not be flushed + normally to the server. A client may attempt to provide a copy of + the file data as modified during the delegation under a different + name in the file system namespace to ease recovery. Note that when + the client can determine that the file has not been modified by any + other client, or when the client has a complete cached copy of the + file in question, such a saved copy of the client's view of the file + may be of particular value for recovery. In another case, recovery + using a copy of the file based partially on the client's cached data + and partially on the server's copy as modified by other clients will + be anything but straightforward, so clients may avoid saving file + contents in these situations or specially mark the results to warn + users of possible problems. + + Saving of such modified data in delegation revocation situations may + be limited to files of a certain size or might be used only when + sufficient disk space is available within the target file system. + Such saving may also be restricted to situations when the client has + sufficient buffering resources to keep the cached copy available + until it is properly stored to the target file system. + +10.6. Attribute Caching + + This section pertains to the caching of a file's attributes on a + client when that client does not hold a delegation on the file. + + The attributes discussed in this section do not include named + attributes. Individual named attributes are analogous to files, and + caching of the data for these needs to be handled just as data + caching is for ordinary files. Similarly, LOOKUP results from an + OPENATTR directory (as well as the directory's contents) are to be + cached on the same basis as any other pathnames. + + Clients may cache file attributes obtained from the server and use + them to avoid subsequent GETATTR requests. Such caching is write + through in that modification to file attributes is always done by + means of requests to the server and should not be done locally and + should not be cached. The exception to this are modifications to + attributes that are intimately connected with data caching. + Therefore, extending a file by writing data to the local data cache + is reflected immediately in the size as seen on the client without + this change being immediately reflected on the server. Normally, + such changes are not propagated directly to the server, but when the + modified data is flushed to the server, analogous attribute changes + are made on the server. When OPEN delegation is in effect, the + modified attributes may be returned to the server in reaction to a + CB_RECALL call. + + The result of local caching of attributes is that the attribute + caches maintained on individual clients will not be coherent. + Changes made in one order on the server may be seen in a different + order on one client and in a third order on another client. + + The typical file system application programming interfaces do not + provide means to atomically modify or interrogate attributes for + multiple files at the same time. The following rules provide an + environment where the potential incoherencies mentioned above can be + reasonably managed. These rules are derived from the practice of + previous NFS protocols. + + * All attributes for a given file (per-fsid attributes excepted) are + cached as a unit at the client so that no non-serializability can + arise within the context of a single file. + + * An upper time boundary is maintained on how long a client cache + entry can be kept without being refreshed from the server. + + * When operations are performed that change attributes at the + server, the updated attribute set is requested as part of the + containing RPC. This includes directory operations that update + attributes indirectly. This is accomplished by following the + modifying operation with a GETATTR operation and then using the + results of the GETATTR to update the client's cached attributes. + + Note that if the full set of attributes to be cached is requested by + READDIR, the results can be cached by the client on the same basis as + attributes obtained via GETATTR. + + A client may validate its cached version of attributes for a file by + fetching both the change and time_access attributes and assuming that + if the change attribute has the same value as it did when the + attributes were cached, then no attributes other than time_access + have changed. The reason why time_access is also fetched is because + many servers operate in environments where the operation that updates + change does not update time_access. For example, POSIX file + semantics do not update access time when a file is modified by the + write system call [15]. Therefore, the client that wants a current + time_access value should fetch it with change during the attribute + cache validation processing and update its cached time_access. + + The client may maintain a cache of modified attributes for those + attributes intimately connected with data of modified regular files + (size, time_modify, and change). Other than those three attributes, + the client MUST NOT maintain a cache of modified attributes. + Instead, attribute changes are immediately sent to the server. + + In some operating environments, the equivalent to time_access is + expected to be implicitly updated by each read of the content of the + file object. If an NFS client is caching the content of a file + object, whether it is a regular file, directory, or symbolic link, + the client SHOULD NOT update the time_access attribute (via SETATTR + or a small READ or READDIR request) on the server with each read that + is satisfied from cache. The reason is that this can defeat the + performance benefits of caching content, especially since an explicit + SETATTR of time_access may alter the change attribute on the server. + If the change attribute changes, clients that are caching the content + will think the content has changed, and will re-read unmodified data + from the server. Nor is the client encouraged to maintain a modified + version of time_access in its cache, since the client either would + eventually have to write the access time to the server with bad + performance effects or never update the server's time_access, thereby + resulting in a situation where an application that caches access time + between a close and open of the same file observes the access time + oscillating between the past and present. The time_access attribute + always means the time of last access to a file by a read that was + satisfied by the server. This way clients will tend to see only + time_access changes that go forward in time. + +10.7. Data and Metadata Caching and Memory Mapped Files + + Some operating environments include the capability for an application + to map a file's content into the application's address space. Each + time the application accesses a memory location that corresponds to a + block that has not been loaded into the address space, a page fault + occurs and the file is read (or if the block does not exist in the + file, the block is allocated and then instantiated in the + application's address space). + + As long as each memory-mapped access to the file requires a page + fault, the relevant attributes of the file that are used to detect + access and modification (time_access, time_metadata, time_modify, and + change) will be updated. However, in many operating environments, + when page faults are not required, these attributes will not be + updated on reads or updates to the file via memory access (regardless + of whether the file is local or is accessed remotely). A client or + server MAY fail to update attributes of a file that is being accessed + via memory-mapped I/O. This has several implications: + + * If there is an application on the server that has memory mapped a + file that a client is also accessing, the client may not be able + to get a consistent value of the change attribute to determine + whether or not its cache is stale. A server that knows that the + file is memory-mapped could always pessimistically return updated + values for change so as to force the application to always get the + most up-to-date data and metadata for the file. However, due to + the negative performance implications of this, such behavior is + OPTIONAL. + + * If the memory-mapped file is not being modified on the server, and + instead is just being read by an application via the memory-mapped + interface, the client will not see an updated time_access + attribute. However, in many operating environments, neither will + any process running on the server. Thus, NFS clients are at no + disadvantage with respect to local processes. + + * If there is another client that is memory mapping the file, and if + that client is holding an OPEN_DELEGATE_WRITE delegation, the same + set of issues as discussed in the previous two bullet points + apply. So, when a server does a CB_GETATTR to a file that the + client has modified in its cache, the reply from CB_GETATTR will + not necessarily be accurate. As discussed earlier, the client's + obligation is to report that the file has been modified since the + delegation was granted, not whether it has been modified again + between successive CB_GETATTR calls, and the server MUST assume + that any file the client has modified in cache has been modified + again between successive CB_GETATTR calls. Depending on the + nature of the client's memory management system, this weak + obligation may not be possible. A client MAY return stale + information in CB_GETATTR whenever the file is memory-mapped. + + * The mixture of memory mapping and byte-range locking on the same + file is problematic. Consider the following scenario, where a + page size on each client is 8192 bytes. + + - Client A memory maps the first page (8192 bytes) of file X. + + - Client B memory maps the first page (8192 bytes) of file X. + + - Client A WRITE_LT locks the first 4096 bytes. + + - Client B WRITE_LT locks the second 4096 bytes. + + - Client A, via a STORE instruction, modifies part of its locked + byte-range. + + - Simultaneous to client A, client B executes a STORE on part of + its locked byte-range. + + Here the challenge is for each client to resynchronize to get a + correct view of the first page. In many operating environments, the + virtual memory management systems on each client only know a page is + modified, not that a subset of the page corresponding to the + respective lock byte-ranges has been modified. So it is not possible + for each client to do the right thing, which is to write to the + server only that portion of the page that is locked. For example, if + client A simply writes out the page, and then client B writes out the + page, client A's data is lost. + + Moreover, if mandatory locking is enabled on the file, then we have a + different problem. When clients A and B execute the STORE + instructions, the resulting page faults require a byte-range lock on + the entire page. Each client then tries to extend their locked range + to the entire page, which results in a deadlock. Communicating the + NFS4ERR_DEADLOCK error to a STORE instruction is difficult at best. + + If a client is locking the entire memory-mapped file, there is no + problem with advisory or mandatory byte-range locking, at least until + the client unlocks a byte-range in the middle of the file. + + Given the above issues, the following are permitted: + + * Clients and servers MAY deny memory mapping a file for which they + know there are byte-range locks. + + * Clients and servers MAY deny a byte-range lock on a file they know + is memory-mapped. + + * A client MAY deny memory mapping a file that it knows requires + mandatory locking for I/O. If mandatory locking is enabled after + the file is opened and mapped, the client MAY deny the application + further access to its mapped file. + +10.8. Name and Directory Caching without Directory Delegations + + The NFSv4.1 directory delegation facility (described in Section 10.9 + below) is OPTIONAL for servers to implement. Even where it is + implemented, it may not always be functional because of resource + availability issues or other constraints. Thus, it is important to + understand how name and directory caching are done in the absence of + directory delegations. These topics are discussed in the next two + subsections. + +10.8.1. Name Caching + + The results of LOOKUP and READDIR operations may be cached to avoid + the cost of subsequent LOOKUP operations. Just as in the case of + attribute caching, inconsistencies may arise among the various client + caches. To mitigate the effects of these inconsistencies and given + the context of typical file system APIs, an upper time boundary is + maintained for how long a client name cache entry can be kept without + verifying that the entry has not been made invalid by a directory + change operation performed by another client. + + When a client is not making changes to a directory for which there + exist name cache entries, the client needs to periodically fetch + attributes for that directory to ensure that it is not being + modified. After determining that no modification has occurred, the + expiration time for the associated name cache entries may be updated + to be the current time plus the name cache staleness bound. + + When a client is making changes to a given directory, it needs to + determine whether there have been changes made to the directory by + other clients. It does this by using the change attribute as + reported before and after the directory operation in the associated + change_info4 value returned for the operation. The server is able to + communicate to the client whether the change_info4 data is provided + atomically with respect to the directory operation. If the change + values are provided atomically, the client has a basis for + determining, given proper care, whether other clients are modifying + the directory in question. + + The simplest way to enable the client to make this determination is + for the client to serialize all changes made to a specific directory. + When this is done, and the server provides before and after values of + the change attribute atomically, the client can simply compare the + after value of the change attribute from one operation on a directory + with the before value on the subsequent operation modifying that + directory. When these are equal, the client is assured that no other + client is modifying the directory in question. + + When such serialization is not used, and there may be multiple + simultaneous outstanding operations modifying a single directory sent + from a single client, making this sort of determination can be more + complicated. If two such operations complete in a different order + than they were actually performed, that might give an appearance + consistent with modification being made by another client. Where + this appears to happen, the client needs to await the completion of + all such modifications that were started previously, to see if the + outstanding before and after change numbers can be sorted into a + chain such that the before value of one change number matches the + after value of a previous one, in a chain consistent with this client + being the only one modifying the directory. + + In either of these cases, the client is able to determine whether the + directory is being modified by another client. If the comparison + indicates that the directory was updated by another client, the name + cache associated with the modified directory is purged from the + client. If the comparison indicates no modification, the name cache + can be updated on the client to reflect the directory operation and + the associated timeout can be extended. The post-operation change + value needs to be saved as the basis for future change_info4 + comparisons. + + As demonstrated by the scenario above, name caching requires that the + client revalidate name cache data by inspecting the change attribute + of a directory at the point when the name cache item was cached. + This requires that the server update the change attribute for + directories when the contents of the corresponding directory is + modified. For a client to use the change_info4 information + appropriately and correctly, the server must report the pre- and + post-operation change attribute values atomically. When the server + is unable to report the before and after values atomically with + respect to the directory operation, the server must indicate that + fact in the change_info4 return value. When the information is not + atomically reported, the client should not assume that other clients + have not changed the directory. + +10.8.2. Directory Caching + + The results of READDIR operations may be used to avoid subsequent + READDIR operations. Just as in the cases of attribute and name + caching, inconsistencies may arise among the various client caches. + To mitigate the effects of these inconsistencies, and given the + context of typical file system APIs, the following rules should be + followed: + + * Cached READDIR information for a directory that is not obtained in + a single READDIR operation must always be a consistent snapshot of + directory contents. This is determined by using a GETATTR before + the first READDIR and after the last READDIR that contributes to + the cache. + + * An upper time boundary is maintained to indicate the length of + time a directory cache entry is considered valid before the client + must revalidate the cached information. + + The revalidation technique parallels that discussed in the case of + name caching. When the client is not changing the directory in + question, checking the change attribute of the directory with GETATTR + is adequate. The lifetime of the cache entry can be extended at + these checkpoints. When a client is modifying the directory, the + client needs to use the change_info4 data to determine whether there + are other clients modifying the directory. If it is determined that + no other client modifications are occurring, the client may update + its directory cache to reflect its own changes. + + As demonstrated previously, directory caching requires that the + client revalidate directory cache data by inspecting the change + attribute of a directory at the point when the directory was cached. + This requires that the server update the change attribute for + directories when the contents of the corresponding directory is + modified. For a client to use the change_info4 information + appropriately and correctly, the server must report the pre- and + post-operation change attribute values atomically. When the server + is unable to report the before and after values atomically with + respect to the directory operation, the server must indicate that + fact in the change_info4 return value. When the information is not + atomically reported, the client should not assume that other clients + have not changed the directory. + +10.9. Directory Delegations + +10.9.1. Introduction to Directory Delegations + + Directory caching for the NFSv4.1 protocol, as previously described, + is similar to file caching in previous versions. Clients typically + cache directory information for a duration determined by the client. + At the end of a predefined timeout, the client will query the server + to see if the directory has been updated. By caching attributes, + clients reduce the number of GETATTR calls made to the server to + validate attributes. Furthermore, frequently accessed files and + directories, such as the current working directory, have their + attributes cached on the client so that some NFS operations can be + performed without having to make an RPC call. By caching name and + inode information about most recently looked up entries in a + Directory Name Lookup Cache (DNLC), clients do not need to send + LOOKUP calls to the server every time these files are accessed. + + This caching approach works reasonably well at reducing network + traffic in many environments. However, it does not address + environments where there are numerous queries for files that do not + exist. In these cases of "misses", the client sends requests to the + server in order to provide reasonable application semantics and + promptly detect the creation of new directory entries. Examples of + high miss activity are compilation in software development + environments. The current behavior of NFS limits its potential + scalability and wide-area sharing effectiveness in these types of + environments. Other distributed stateful file system architectures + such as AFS and DFS have proven that adding state around directory + contents can greatly reduce network traffic in high-miss + environments. + + Delegation of directory contents is an OPTIONAL feature of NFSv4.1. + Directory delegations provide similar traffic reduction benefits as + with file delegations. By allowing clients to cache directory + contents (in a read-only fashion) while being notified of changes, + the client can avoid making frequent requests to interrogate the + contents of slowly-changing directories, reducing network traffic and + improving client performance. It can also simplify the task of + determining whether other clients are making changes to the directory + when the client itself is making many changes to the directory and + changes are not serialized. + + Directory delegations allow improved namespace cache consistency to + be achieved through delegations and synchronous recalls, in the + absence of notifications. In addition, if time-based consistency is + sufficient, asynchronous notifications can provide performance + benefits for the client, and possibly the server, under some common + operating conditions such as slowly-changing and/or very large + directories. + +10.9.2. Directory Delegation Design + + NFSv4.1 introduces the GET_DIR_DELEGATION (Section 18.39) operation + to allow the client to ask for a directory delegation. The + delegation covers directory attributes and all entries in the + directory. If either of these change, the delegation will be + recalled synchronously. The operation causing the recall will have + to wait before the recall is complete. Any changes to directory + entry attributes will not cause the delegation to be recalled. + + In addition to asking for delegations, a client can also ask for + notifications for certain events. These events include changes to + the directory's attributes and/or its contents. If a client asks for + notification for a certain event, the server will notify the client + when that event occurs. This will not result in the delegation being + recalled for that client. The notifications are asynchronous and + provide a way of avoiding recalls in situations where a directory is + changing enough that the pure recall model may not be effective while + trying to allow the client to get substantial benefit. In the + absence of notifications, once the delegation is recalled the client + has to refresh its directory cache; this might not be very efficient + for very large directories. + + The delegation is read-only and the client may not make changes to + the directory other than by performing NFSv4.1 operations that modify + the directory or the associated file attributes so that the server + has knowledge of these changes. In order to keep the client's + namespace synchronized with that of the server, the server will + notify the delegation-holding client (assuming it has requested + notifications) of the changes made as a result of that client's + directory-modifying operations. This is to avoid any need for that + client to send subsequent GETATTR or READDIR operations to the + server. If a single client is holding the delegation and that client + makes any changes to the directory (i.e., the changes are made via + operations sent on a session associated with the client ID holding + the delegation), the delegation will not be recalled. Multiple + clients may hold a delegation on the same directory, but if any such + client modifies the directory, the server MUST recall the delegation + from the other clients, unless those clients have made provisions to + be notified of that sort of modification. + + Delegations can be recalled by the server at any time. Normally, the + server will recall the delegation when the directory changes in a way + that is not covered by the notification, or when the directory + changes and notifications have not been requested. If another client + removes the directory for which a delegation has been granted, the + server will recall the delegation. + +10.9.3. Attributes in Support of Directory Notifications + + See Section 5.11 for a description of the attributes associated with + directory notifications. + +10.9.4. Directory Delegation Recall + + The server will recall the directory delegation by sending a callback + to the client. It will use the same callback procedure as used for + recalling file delegations. The server will recall the delegation + when the directory changes in a way that is not covered by the + notification. However, the server need not recall the delegation if + attributes of an entry within the directory change. + + If the server notices that handing out a delegation for a directory + is causing too many notifications to be sent out, it may decide to + not hand out delegations for that directory and/or recall those + already granted. If a client tries to remove the directory for which + a delegation has been granted, the server will recall all associated + delegations. + + The implementation sections for a number of operations describe + situations in which notification or delegation recall would be + required under some common circumstances. In this regard, a similar + set of caveats to those listed in Section 10.2 apply. + + * For CREATE, see Section 18.4.4. + + * For LINK, see Section 18.9.4. + + * For OPEN, see Section 18.16.4. + + * For REMOVE, see Section 18.25.4. + + * For RENAME, see Section 18.26.4. + + * For SETATTR, see Section 18.30.4. + +10.9.5. Directory Delegation Recovery + + Recovery from client or server restart for state on regular files has + two main goals: avoiding the necessity of breaking application + guarantees with respect to locked files and delivery of updates + cached at the client. Neither of these goals applies to directories + protected by OPEN_DELEGATE_READ delegations and notifications. Thus, + no provision is made for reclaiming directory delegations in the + event of client or server restart. The client can simply establish a + directory delegation in the same fashion as was done initially. + +11. Multi-Server Namespace + + NFSv4.1 supports attributes that allow a namespace to extend beyond + the boundaries of a single server. It is desirable that clients and + servers support construction of such multi-server namespaces. Use of + such multi-server namespaces is OPTIONAL; however, and for many + purposes, single-server namespaces are perfectly acceptable. The use + of multi-server namespaces can provide many advantages by separating + a file system's logical position in a namespace from the (possibly + changing) logistical and administrative considerations that cause a + particular file system to be located on a particular server via a + single network access path that has to be known in advance or + determined using DNS. + +11.1. Terminology + + In this section as a whole (i.e., within all of Section 11), the + phrase "client ID" always refers to the 64-bit shorthand identifier + assigned by the server (a clientid4) and never to the structure that + the client uses to identify itself to the server (called an + nfs_client_id4 or client_owner in NFSv4.0 and NFSv4.1, respectively). + The opaque identifier within those structures is referred to as a + "client id string". + +11.1.1. Terminology Related to Trunking + + It is particularly important to clarify the distinction between + trunking detection and trunking discovery. The definitions we + present are applicable to all minor versions of NFSv4, but we will + focus on how these terms apply to NFS version 4.1. + + * Trunking detection refers to ways of deciding whether two specific + network addresses are connected to the same NFSv4 server. The + means available to make this determination depends on the protocol + version, and, in some cases, on the client implementation. + + In the case of NFS version 4.1 and later minor versions, the means + of trunking detection are as described in this document and are + available to every client. Two network addresses connected to the + same server can always be used together to access a particular + server but cannot necessarily be used together to access a single + session. See below for definitions of the terms "server- + trunkable" and "session-trunkable". + + * Trunking discovery is a process by which a client using one + network address can obtain other addresses that are connected to + the same server. Typically, it builds on a trunking detection + facility by providing one or more methods by which candidate + addresses are made available to the client, who can then use + trunking detection to appropriately filter them. + + Despite the support for trunking detection, there was no + description of trunking discovery provided in RFC 5661 [66], + making it necessary to provide those means in this document. + + The combination of a server network address and a particular + connection type to be used by a connection is referred to as a + "server endpoint". Although using different connection types may + result in different ports being used, the use of different ports by + multiple connections to the same network address in such cases is not + the essence of the distinction between the two endpoints used. This + is in contrast to the case of port-specific endpoints, in which the + explicit specification of port numbers within network addresses is + used to allow a single server node to support multiple NFS servers. + + Two network addresses connected to the same server are said to be + server-trunkable. Two such addresses support the use of client ID + trunking, as described in Section 2.10.5. + + Two network addresses connected to the same server such that those + addresses can be used to support a single common session are referred + to as session-trunkable. Note that two addresses may be server- + trunkable without being session-trunkable, and that, when two + connections of different connection types are made to the same + network address and are based on a single file system location entry, + they are always session-trunkable, independent of the connection + type, as specified by Section 2.10.5, since their derivation from the + same file system location entry, together with the identity of their + network addresses, assures that both connections are to the same + server and will return server-owner information, allowing session + trunking to be used. + +11.1.2. Terminology Related to File System Location + + Regarding the terminology that relates to the construction of multi- + server namespaces out of a set of local per-server namespaces: + + * Each server has a set of exported file systems that may be + accessed by NFSv4 clients. Typically, this is done by assigning + each file system a name within the pseudo-fs associated with the + server, although the pseudo-fs may be dispensed with if there is + only a single exported file system. Each such file system is part + of the server's local namespace, and can be considered as a file + system instance within a larger multi-server namespace. + + * The set of all exported file systems for a given server + constitutes that server's local namespace. + + * In some cases, a server will have a namespace more extensive than + its local namespace by using features associated with attributes + that provide file system location information. These features, + which allow construction of a multi-server namespace, are all + described in individual sections below and include referrals + (Section 11.5.6), migration (Section 11.5.5), and replication + (Section 11.5.4). + + * A file system present in a server's pseudo-fs may have multiple + file system instances on different servers associated with it. + All such instances are considered replicas of one another. + Whether such replicas can be used simultaneously is discussed in + Section 11.11.1, while the level of coordination between them + (important when switching between them) is discussed in Sections + 11.11.2 through 11.11.8 below. + + * When a file system is present in a server's pseudo-fs, but there + is no corresponding local file system, it is said to be "absent". + In such cases, all associated instances will be accessed on other + servers. + + Regarding the terminology that relates to attributes used in trunking + discovery and other multi-server namespace features: + + * File system location attributes include the fs_locations and + fs_locations_info attributes. + + * File system location entries provide the individual file system + locations within the file system location attributes. Each such + entry specifies a server, in the form of a hostname or an address, + and an fs name, which designates the location of the file system + within the server's local namespace. A file system location entry + designates a set of server endpoints to which the client may + establish connections. There may be multiple endpoints because a + hostname may map to multiple network addresses and because + multiple connection types may be used to communicate with a single + network address. However, except where explicit port numbers are + used to designate a set of servers within a single server node, + all such endpoints MUST designate a way of connecting to a single + server. The exact form of the location entry varies with the + particular file system location attribute used, as described in + Section 11.2. + + The network addresses used in file system location entries + typically appear without port number indications and are used to + designate a server at one of the standard ports for NFS access, + e.g., 2049 for TCP or 20049 for use with RPC-over-RDMA. Port + numbers may be used in file system location entries to designate + servers (typically user-level ones) accessed using other port + numbers. In the case where network addresses indicate trunking + relationships, the use of an explicit port number is inappropriate + since trunking is a relationship between network addresses. See + Section 11.5.2 for details. + + * File system location elements are derived from location entries, + and each describes a particular network access path consisting of + a network address and a location within the server's local + namespace. Such location elements need not appear within a file + system location attribute, but the existence of each location + element derives from a corresponding location entry. When a + location entry specifies an IP address, there is only a single + corresponding location element. File system location entries that + contain a hostname are resolved using DNS, and may result in one + or more location elements. All location elements consist of a + location address that includes the IP address of an interface to a + server and an fs name, which is the location of the file system + within the server's local namespace. The fs name can be empty if + the server has no pseudo-fs and only a single exported file system + at the root filehandle. + + * Two file system location elements are said to be server-trunkable + if they specify the same fs name and the location addresses are + such that the location addresses are server-trunkable. When the + corresponding network paths are used, the client will always be + able to use client ID trunking, but will only be able to use + session trunking if the paths are also session-trunkable. + + * Two file system location elements are said to be session-trunkable + if they specify the same fs name and the location addresses are + such that the location addresses are session-trunkable. When the + corresponding network paths are used, the client will be able to + able to use either client ID trunking or session trunking. + + Discussion of the term "replica" is complicated by the fact that the + term was used in RFC 5661 [66] with a meaning different from that + used in this document. In short, in [66] each replica is identified + by a single network access path, while in the current document, a set + of network access paths that have server-trunkable network addresses + and the same root-relative file system pathname is considered to be a + single replica with multiple network access paths. + + Each set of server-trunkable location elements defines a set of + available network access paths to a particular file system. When + there are multiple such file systems, each of which containing the + same data, these file systems are considered replicas of one another. + Logically, such replication is symmetric, since the fs currently in + use and an alternate fs are replicas of each other. Often, in other + documents, the term "replica" is not applied to the fs currently in + use, despite the fact that the replication relation is inherently + symmetric. + +11.2. File System Location Attributes + + NFSv4.1 contains attributes that provide information about how a + given file system may be accessed (i.e., at what network address and + namespace position). As a result, file systems in the namespace of + one server can be associated with one or more instances of that file + system on other servers. These attributes contain file system + location entries specifying a server address target (either as a DNS + name representing one or more IP addresses or as a specific IP + address) together with the pathname of that file system within the + associated single-server namespace. + + The fs_locations_info RECOMMENDED attribute allows specification of + one or more file system instance locations where the data + corresponding to a given file system may be found. In addition to + the specification of file system instance locations, this attribute + provides helpful information to do the following: + + * Guide choices among the various file system instances provided + (e.g., priority for use, writability, currency, etc.). + + * Help the client efficiently effect as seamless a transition as + possible among multiple file system instances, when and if that + should be necessary. + + * Guide the selection of the appropriate connection type to be used + when establishing a connection. + + Within the fs_locations_info attribute, each fs_locations_server4 + entry corresponds to a file system location entry: the fls_server + field designates the server, and the fl_rootpath field of the + encompassing fs_locations_item4 gives the location pathname within + the server's pseudo-fs. + + The fs_locations attribute defined in NFSv4.0 is also a part of + NFSv4.1. This attribute only allows specification of the file system + locations where the data corresponding to a given file system may be + found. Servers SHOULD make this attribute available whenever + fs_locations_info is supported, but client use of fs_locations_info + is preferable because it provides more information. + + Within the fs_locations attribute, each fs_location4 contains a file + system location entry with the server field designating the server + and the rootpath field giving the location pathname within the + server's pseudo-fs. + +11.3. File System Presence or Absence + + A given location in an NFSv4.1 namespace (typically but not + necessarily a multi-server namespace) can have a number of file + system instance locations associated with it (via the fs_locations or + fs_locations_info attribute). There may also be an actual current + file system at that location, accessible via normal namespace + operations (e.g., LOOKUP). In this case, the file system is said to + be "present" at that position in the namespace, and clients will + typically use it, reserving use of additional locations specified via + the location-related attributes to situations in which the principal + location is no longer available. + + When there is no actual file system at the namespace location in + question, the file system is said to be "absent". An absent file + system contains no files or directories other than the root. Any + reference to it, except to access a small set of attributes useful in + determining alternate locations, will result in an error, + NFS4ERR_MOVED. Note that if the server ever returns the error + NFS4ERR_MOVED, it MUST support the fs_locations attribute and SHOULD + support the fs_locations_info and fs_status attributes. + + While the error name suggests that we have a case of a file system + that once was present, and has only become absent later, this is only + one possibility. A position in the namespace may be permanently + absent with the set of file system(s) designated by the location + attributes being the only realization. The name NFS4ERR_MOVED + reflects an earlier, more limited conception of its function, but + this error will be returned whenever the referenced file system is + absent, whether it has moved or not. + + Except in the case of GETATTR-type operations (to be discussed + later), when the current filehandle at the start of an operation is + within an absent file system, that operation is not performed and the + error NFS4ERR_MOVED is returned, to indicate that the file system is + absent on the current server. + + Because a GETFH cannot succeed if the current filehandle is within an + absent file system, filehandles within an absent file system cannot + be transferred to the client. When a client does have filehandles + within an absent file system, it is the result of obtaining them when + the file system was present, and having the file system become absent + subsequently. + + It should be noted that because the check for the current filehandle + being within an absent file system happens at the start of every + operation, operations that change the current filehandle so that it + is within an absent file system will not result in an error. This + allows such combinations as PUTFH-GETATTR and LOOKUP-GETATTR to be + used to get attribute information, particularly location attribute + information, as discussed below. + + The RECOMMENDED file system attribute fs_status can be used to + interrogate the present/absent status of a given file system. + +11.4. Getting Attributes for an Absent File System + + When a file system is absent, most attributes are not available, but + it is necessary to allow the client access to the small set of + attributes that are available, and most particularly those that give + information about the correct current locations for this file system: + fs_locations and fs_locations_info. + +11.4.1. GETATTR within an Absent File System + + As mentioned above, an exception is made for GETATTR in that + attributes may be obtained for a filehandle within an absent file + system. This exception only applies if the attribute mask contains + at least one attribute bit that indicates the client is interested in + a result regarding an absent file system: fs_locations, + fs_locations_info, or fs_status. If none of these attributes is + requested, GETATTR will result in an NFS4ERR_MOVED error. + + When a GETATTR is done on an absent file system, the set of supported + attributes is very limited. Many attributes, including those that + are normally REQUIRED, will not be available on an absent file + system. In addition to the attributes mentioned above (fs_locations, + fs_locations_info, fs_status), the following attributes SHOULD be + available on absent file systems. In the case of RECOMMENDED + attributes, they should be available at least to the same degree that + they are available on present file systems. + + change_policy: This attribute is useful for absent file systems and + can be helpful in summarizing to the client when any of the + location-related attributes change. + + fsid: This attribute should be provided so that the client can + determine file system boundaries, including, in particular, the + boundary between present and absent file systems. This value must + be different from any other fsid on the current server and need + have no particular relationship to fsids on any particular + destination to which the client might be directed. + + mounted_on_fileid: For objects at the top of an absent file system, + this attribute needs to be available. Since the fileid is within + the present parent file system, there should be no need to + reference the absent file system to provide this information. + + Other attributes SHOULD NOT be made available for absent file + systems, even when it is possible to provide them. The server should + not assume that more information is always better and should avoid + gratuitously providing additional information. + + When a GETATTR operation includes a bit mask for one of the + attributes fs_locations, fs_locations_info, or fs_status, but where + the bit mask includes attributes that are not supported, GETATTR will + not return an error, but will return the mask of the actual + attributes supported with the results. + + Handling of VERIFY/NVERIFY is similar to GETATTR in that if the + attribute mask does not include fs_locations, fs_locations_info, or + fs_status, the error NFS4ERR_MOVED will result. It differs in that + any appearance in the attribute mask of an attribute not supported + for an absent file system (and note that this will include some + normally REQUIRED attributes) will also cause an NFS4ERR_MOVED + result. + +11.4.2. READDIR and Absent File Systems + + A READDIR performed when the current filehandle is within an absent + file system will result in an NFS4ERR_MOVED error, since, unlike the + case of GETATTR, no such exception is made for READDIR. + + Attributes for an absent file system may be fetched via a READDIR for + a directory in a present file system, when that directory contains + the root directories of one or more absent file systems. In this + case, the handling is as follows: + + * If the attribute set requested includes one of the attributes + fs_locations, fs_locations_info, or fs_status, then fetching of + attributes proceeds normally and no NFS4ERR_MOVED indication is + returned, even when the rdattr_error attribute is requested. + + * If the attribute set requested does not include one of the + attributes fs_locations, fs_locations_info, or fs_status, then if + the rdattr_error attribute is requested, each directory entry for + the root of an absent file system will report NFS4ERR_MOVED as the + value of the rdattr_error attribute. + + * If the attribute set requested does not include any of the + attributes fs_locations, fs_locations_info, fs_status, or + rdattr_error, then the occurrence of the root of an absent file + system within the directory will result in the READDIR failing + with an NFS4ERR_MOVED error. + + * The unavailability of an attribute because of a file system's + absence, even one that is ordinarily REQUIRED, does not result in + any error indication. The set of attributes returned for the root + directory of the absent file system in that case is simply + restricted to those actually available. + +11.5. Uses of File System Location Information + + The file system location attributes (i.e., fs_locations and + fs_locations_info), together with the possibility of absent file + systems, provide a number of important facilities for reliable, + manageable, and scalable data access. + + When a file system is present, these attributes can provide the + following: + + * The locations of alternative replicas to be used to access the + same data in the event of server failures, communications + problems, or other difficulties that make continued access to the + current replica impossible or otherwise impractical. Provisioning + and use of such alternate replicas is referred to as "replication" + and is discussed in Section 11.5.4 below. + + * The network address(es) to be used to access the current file + system instance or replicas of it. Client use of this information + is discussed in Section 11.5.2 below. + + Under some circumstances, multiple replicas may be used + simultaneously to provide higher-performance access to the file + system in question, although the lack of state sharing between + servers may be an impediment to such use. + + When a file system is present but becomes absent, clients can be + given the opportunity to have continued access to their data using a + different replica. In this case, a continued attempt to use the data + in the now-absent file system will result in an NFS4ERR_MOVED error, + and then the successor replica or set of possible replica choices can + be fetched and used to continue access. Transfer of access to the + new replica location is referred to as "migration" and is discussed + in Section 11.5.4 below. + + When a file system is currently absent, specification of file system + location provides a means by which file systems located on one server + can be associated with a namespace defined by another server, thus + allowing a general multi-server namespace facility. A designation of + such a remote instance, in place of a file system not previously + present, is called a "pure referral" and is discussed in + Section 11.5.6 below. + + Because client support for attributes related to file system location + is OPTIONAL, a server may choose to take action to hide migration and + referral events from such clients, by acting as a proxy, for example. + The server can determine the presence of client support from the + arguments of the EXCHANGE_ID operation (see Section 18.35.3). + +11.5.1. Combining Multiple Uses in a Single Attribute + + A file system location attribute will sometimes contain information + relating to the location of multiple replicas, which may be used in + different ways: + + * File system location entries that relate to the file system + instance currently in use provide trunking information, allowing + the client to find additional network addresses by which the + instance may be accessed. + + * File system location entries that provide information about + replicas to which access is to be transferred. + + * Other file system location entries that relate to replicas that + are available to use in the event that access to the current + replica becomes unsatisfactory. + + In order to simplify client handling and to allow the best choice of + replicas to access, the server should adhere to the following + guidelines: + + * All file system location entries that relate to a single file + system instance should be adjacent. + + * File system location entries that relate to the instance currently + in use should appear first. + + * File system location entries that relate to replica(s) to which + migration is occurring should appear before replicas that are + available for later use if the current replica should become + inaccessible. + +11.5.2. File System Location Attributes and Trunking + + Trunking is the use of multiple connections between a client and + server in order to increase the speed of data transfer. A client may + determine the set of network addresses to use to access a given file + system in a number of ways: + + * When the name of the server is known to the client, it may use DNS + to obtain a set of network addresses to use in accessing the + server. + + * The client may fetch the file system location attribute for the + file system. This will provide either the name of the server + (which can be turned into a set of network addresses using DNS) or + a set of server-trunkable location entries. Using the latter + alternative, the server can provide addresses it regards as + desirable to use to access the file system in question. Although + these entries can contain port numbers, these port numbers are not + used in determining trunking relationships. Once the candidate + addresses have been determined and EXCHANGE_ID done to the proper + server, only the value of the so_major_id field returned by the + servers in question determines whether a trunking relationship + actually exists. + + When the client fetches a location attribute for a file system, it + should be noted that the client may encounter multiple entries for a + number of reasons, such that when it determines trunking information, + it may need to bypass addresses not trunkable with one already known. + + The server can provide location entries that include either names or + network addresses. It might use the latter form because of DNS- + related security concerns or because the set of addresses to be used + might require active management by the server. + + Location entries used to discover candidate addresses for use in + trunking are subject to change, as discussed in Section 11.5.7 below. + The client may respond to such changes by using additional addresses + once they are verified or by ceasing to use existing ones. The + server can force the client to cease using an address by returning + NFS4ERR_MOVED when that address is used to access a file system. + This allows a transfer of client access that is similar to migration, + although the same file system instance is accessed throughout. + +11.5.3. File System Location Attributes and Connection Type Selection + + Because of the need to support multiple types of connections, clients + face the issue of determining the proper connection type to use when + establishing a connection to a given server network address. In some + cases, this issue can be addressed through the use of the connection + "step-up" facility described in Section 18.36. However, because + there are cases in which that facility is not available, the client + may have to choose a connection type with no possibility of changing + it within the scope of a single connection. + + The two file system location attributes differ as to the information + made available in this regard. The fs_locations attribute provides + no information to support connection type selection. As a result, + clients supporting multiple connection types would need to attempt to + establish connections using multiple connection types until the one + preferred by the client is successfully established. + + The fs_locations_info attribute includes the FSLI4TF_RDMA flag, which + is convenient for a client wishing to use RDMA. When this flag is + set, it indicates that RPC-over-RDMA support is available using the + specified location entry. A client can establish a TCP connection + and then convert that connection to use RDMA by using the step-up + facility. + + Irrespective of the particular attribute used, when there is no + indication that a step-up operation can be performed, a client + supporting RDMA operation can establish a new RDMA connection, and it + can be bound to the session already established by the TCP + connection, allowing the TCP connection to be dropped and the session + converted to further use in RDMA mode, if the server supports that. + +11.5.4. File System Replication + + The fs_locations and fs_locations_info attributes provide alternative + file system locations, to be used to access data in place of or in + addition to the current file system instance. On first access to a + file system, the client should obtain the set of alternate locations + by interrogating the fs_locations or fs_locations_info attribute, + with the latter being preferred. + + In the event that the occurrence of server failures, communications + problems, or other difficulties make continued access to the current + file system impossible or otherwise impractical, the client can use + the alternate locations as a way to get continued access to its data. + + The alternate locations may be physical replicas of the (typically + read-only) file system data supplemented by possible asynchronous + propagation of updates. Alternatively, they may provide for the use + of various forms of server clustering in which multiple servers + provide alternate ways of accessing the same physical file system. + How the difference between replicas affects file system transitions + can be represented within the fs_locations and fs_locations_info + attributes, and how the client deals with file system transition + issues will be discussed in detail in later sections. + + Although the location attributes provide some information about the + nature of the inter-replica transition, many aspects of the semantics + of possible asynchronous updates are not currently described by the + protocol, which makes it necessary for clients using replication to + switch among replicas undergoing change to familiarize themselves + with the semantics of the update approach used. Due to this lack of + specificity, many applications may find the use of migration more + appropriate because a server can propagate all updates made before an + established point in time to the new replica as part of the migration + event. + +11.5.4.1. File System Trunking Presented as Replication + + In some situations, a file system location entry may indicate a file + system access path to be used as an alternate location, where + trunking, rather than replication, is to be used. The situations in + which this is appropriate are limited to those in which both of the + following are true: + + * The two file system locations (i.e., the one on which the location + attribute is obtained and the one specified in the file system + location entry) designate the same locations within their + respective single-server namespaces. + + * The two server network addresses (i.e., the one being used to + obtain the location attribute and the one specified in the file + system location entry) designate the same server (as indicated by + the same value of the so_major_id field of the eir_server_owner + field returned in response to EXCHANGE_ID). + + When these conditions hold, operations using both access paths are + generally trunked, although trunking may be disallowed when the + attribute fs_locations_info is used: + + * When the fs_locations_info attribute shows the two entries as not + having the same simultaneous-use class, trunking is inhibited, and + the two access paths cannot be used together. + + In this case, the two paths can be used serially with no + transition activity required on the part of the client, and any + transition between access paths is transparent. In transferring + access from one to the other, the client acts as if communication + were interrupted, establishing a new connection and possibly a new + session to continue access to the same file system. + + * Note that for two such location entries, any information within + the fs_locations_info attribute that indicates the need for + special transition activity, i.e., the appearance of the two file + system location entries with different handle, fileid, write- + verifier, change, and readdir classes, indicates a serious + problem. The client, if it allows transition to the file system + instance at all, must not treat any transition as a transparent + one. The server SHOULD NOT indicate that these two entries (for + the same file system on the same server) belong to different + handle, fileid, write-verifier, change, and readdir classes, + whether or not the two entries are shown belonging to the same + simultaneous-use class. + + These situations were recognized by [66], even though that document + made no explicit mention of trunking: + + * It treated the situation that we describe as trunking as one of + simultaneous use of two distinct file system instances, even + though, in the explanatory framework now used to describe the + situation, the case is one in which a single file system is + accessed by two different trunked addresses. + + * It treated the situation in which two paths are to be used + serially as a special sort of "transparent transition". However, + in the descriptive framework now used to categorize transition + situations, this is considered a case of a "network endpoint + transition" (see Section 11.9). + +11.5.5. File System Migration + + When a file system is present and becomes inaccessible using the + current access path, the NFSv4.1 protocol provides a means by which + clients can be given the opportunity to have continued access to + their data. This may involve using a different access path to the + existing replica or providing a path to a different replica. The new + access path or the location of the new replica is specified by a file + system location attribute. The ensuing migration of access includes + the ability to retain locks across the transition. Depending on + circumstances, this can involve: + + * The continued use of the existing clientid when accessing the + current replica using a new access path. + + * Use of lock reclaim, taking advantage of a per-fs grace period. + + * Use of Transparent State Migration. + + Typically, a client will be accessing the file system in question, + get an NFS4ERR_MOVED error, and then use a file system location + attribute to determine the new access path for the data. When + fs_locations_info is used, additional information will be available + that will define the nature of the client's handling of the + transition to a new server. + + In most instances, servers will choose to migrate all clients using a + particular file system to a successor replica at the same time to + avoid cases in which different clients are updating different + replicas. However, migration of an individual client can be helpful + in providing load balancing, as long as the replicas in question are + such that they represent the same data as described in + Section 11.11.8. + + * In the case in which there is no transition between replicas + (i.e., only a change in access path), there are no special + difficulties in using of this mechanism to effect load balancing. + + * In the case in which the two replicas are sufficiently coordinated + as to allow a single client coherent, simultaneous access to both, + there is, in general, no obstacle to the use of migration of + particular clients to effect load balancing. Generally, such + simultaneous use involves cooperation between servers to ensure + that locks granted on two coordinated replicas cannot conflict and + can remain effective when transferred to a common replica. + + * In the case in which a large set of clients is accessing a file + system in a read-only fashion, it can be helpful to migrate all + clients with writable access simultaneously, while using load + balancing on the set of read-only copies, as long as the rules in + Section 11.11.8, which are designed to prevent data reversion, are + followed. + + In other cases, the client might not have sufficient guarantees of + data similarity or coherence to function properly (e.g., the data in + the two replicas is similar but not identical), and the possibility + that different clients are updating different replicas can exacerbate + the difficulties, making the use of load balancing in such situations + a perilous enterprise. + + The protocol does not specify how the file system will be moved + between servers or how updates to multiple replicas will be + coordinated. It is anticipated that a number of different server-to- + server coordination mechanisms might be used, with the choice left to + the server implementer. The NFSv4.1 protocol specifies the method + used to communicate the migration event between client and server. + + In the case of various forms of server clustering, the new location + may be another server providing access to the same physical file + system. The client's responsibilities in dealing with this + transition will depend on whether a switch between replicas has + occurred and the means the server has chosen to provide continuity of + locking state. These issues will be discussed in detail below. + + Although a single successor location is typical, multiple locations + may be provided. When multiple locations are provided, the client + will typically use the first one provided. If that is inaccessible + for some reason, later ones can be used. In such cases, the client + might consider the transition to the new replica to be a migration + event, even though some of the servers involved might not be aware of + the use of the server that was inaccessible. In such a case, a + client might lose access to locking state as a result of the access + transfer. + + When an alternate location is designated as the target for migration, + it must designate the same data (with metadata being the same to the + degree indicated by the fs_locations_info attribute). Where file + systems are writable, a change made on the original file system must + be visible on all migration targets. Where a file system is not + writable but represents a read-only copy (possibly periodically + updated) of a writable file system, similar requirements apply to the + propagation of updates. Any change visible in the original file + system must already be effected on all migration targets, to avoid + any possibility that a client, in effecting a transition to the + migration target, will see any reversion in file system state. + +11.5.6. Referrals + + Referrals allow the server to associate a file system namespace entry + located on one server with a file system located on another server. + When this includes the use of pure referrals, servers are provided a + way of placing a file system in a location within the namespace + essentially without respect to its physical location on a particular + server. This allows a single server or a set of servers to present a + multi-server namespace that encompasses file systems located on a + wider range of servers. Some likely uses of this facility include + establishment of site-wide or organization-wide namespaces, with the + eventual possibility of combining such together into a truly global + namespace, such as the one provided by AFS (the Andrew File System) + [65]. + + Referrals occur when a client determines, upon first referencing a + position in the current namespace, that it is part of a new file + system and that the file system is absent. When this occurs, + typically upon receiving the error NFS4ERR_MOVED, the actual location + or locations of the file system can be determined by fetching a + locations attribute. + + The file system location attribute may designate a single file system + location or multiple file system locations, to be selected based on + the needs of the client. The server, in the fs_locations_info + attribute, may specify priorities to be associated with various file + system location choices. The server may assign different priorities + to different locations as reported to individual clients, in order to + adapt to client physical location or to effect load balancing. When + both read-only and read-write file systems are present, some of the + read-only locations might not be absolutely up-to-date (as they would + have to be in the case of replication and migration). Servers may + also specify file system locations that include client-substituted + variables so that different clients are referred to different file + systems (with different data contents) based on client attributes + such as CPU architecture. + + If the fs_locations_info attribute lists multiple possible targets, + the relationships among them may be important to the client in + selecting which one to use. The same rules specified in + Section 11.5.5 below regarding multiple migration targets apply to + these multiple replicas as well. For example, the client might + prefer a writable target on a server that has additional writable + replicas to which it subsequently might switch. Note that, as + distinguished from the case of replication, there is no need to deal + with the case of propagation of updates made by the current client, + since the current client has not accessed the file system in + question. + + Use of multi-server namespaces is enabled by NFSv4.1 but is not + required. The use of multi-server namespaces and their scope will + depend on the applications used and system administration + preferences. + + Multi-server namespaces can be established by a single server + providing a large set of pure referrals to all of the included file + systems. Alternatively, a single multi-server namespace may be + administratively segmented with separate referral file systems (on + separate servers) for each separately administered portion of the + namespace. The top-level referral file system or any segment may use + replicated referral file systems for higher availability. + + Generally, multi-server namespaces are for the most part uniform, in + that the same data made available to one client at a given location + in the namespace is made available to all clients at that namespace + location. However, there are facilities provided that allow + different clients to be directed to different sets of data, for + reasons such as enabling adaptation to such client characteristics as + CPU architecture. These facilities are described in Section 11.17.3. + + Note that it is possible, when providing a uniform namespace, to + provide different location entries to different clients in order to + provide each client with a copy of the data physically closest to it + or otherwise optimize access (e.g., provide load balancing). + +11.5.7. Changes in a File System Location Attribute + + Although clients will typically fetch a file system location + attribute when first accessing a file system and when NFS4ERR_MOVED + is returned, a client can choose to fetch the attribute periodically, + in which case, the value fetched may change over time. + + For clients not prepared to access multiple replicas simultaneously + (see Section 11.11.1), the handling of the various cases of location + change are as follows: + + * Changes in the list of replicas or in the network addresses + associated with replicas do not require immediate action. The + client will typically update its list of replicas to reflect the + new information. + + * Additions to the list of network addresses for the current file + system instance need not be acted on promptly. However, to + prepare for a subsequent migration event, the client can choose to + take note of the new address and then use it whenever it needs to + switch access to a new replica. + + * Deletions from the list of network addresses for the current file + system instance do not require the client to immediately cease use + of existing access paths, although new connections are not to be + established on addresses that have been deleted. However, clients + can choose to act on such deletions by preparing for an eventual + shift in access, which becomes unavoidable as soon as the server + returns NFS4ERR_MOVED to indicate that a particular network access + path is not usable to access the current file system. + + For clients that are prepared to access several replicas + simultaneously, the following additional cases need to be addressed. + As in the cases discussed above, changes in the set of replicas need + not be acted upon promptly, although the client has the option of + adjusting its access even in the absence of difficulties that would + lead to the selection of a new replica. + + * When a new replica is added, which may be accessed simultaneously + with one currently in use, the client is free to use the new + replica immediately. + + * When a replica currently in use is deleted from the list, the + client need not cease using it immediately. However, since the + server may subsequently force such use to cease (by returning + NFS4ERR_MOVED), clients might decide to limit the need for later + state transfer. For example, new opens might be done on other + replicas, rather than on one not present in the list. + +11.6. Trunking without File System Location Information + + In situations in which a file system is accessed using two server- + trunkable addresses (as indicated by the same value of the + so_major_id field of the eir_server_owner field returned in response + to EXCHANGE_ID), trunked access is allowed even though there might + not be any location entries specifically indicating the use of + trunking for that file system. + + This situation was recognized by [66], although that document made no + explicit mention of trunking and treated the situation as one of + simultaneous use of two distinct file system instances. In the + explanatory framework now used to describe the situation, the case is + one in which a single file system is accessed by two different + trunked addresses. + +11.7. Users and Groups in a Multi-Server Namespace + + As in the case of a single-server environment (see Section 5.9), when + an owner or group name of the form "id@domain" is assigned to a file, + there is an implicit promise to return that same string when the + corresponding attribute is interrogated subsequently. In the case of + a multi-server namespace, that same promise applies even if server + boundaries have been crossed. Similarly, when the owner attribute of + a file is derived from the security principal that created the file, + that attribute should have the same value even if the interrogation + occurs on a different server from the file creation. + + Similarly, the set of security principals recognized by all the + participating servers needs to be the same, with each such principal + having the same credentials, regardless of the particular server + being accessed. + + In order to meet these requirements, those setting up multi-server + namespaces will need to limit the servers included so that: + + * In all cases in which more than a single domain is supported, the + requirements stated in RFC 8000 [31] are to be respected. + + * All servers support a common set of domains that includes all of + the domains clients use and expect to see returned as the domain + portion of an owner or group in the form "id@domain". Note that, + although this set most often consists of a single domain, it is + possible for multiple domains to be supported. + + * All servers, for each domain that they support, accept the same + set of user and group ids as valid. + + * All servers recognize the same set of security principals. For + each principal, the same credential is required, independent of + the server being accessed. In addition, the group membership for + each such principal is to be the same, independent of the server + accessed. + + Note that there is no requirement in general that the users + corresponding to particular security principals have the same local + representation on each server, even though it is most often the case + that this is so. + + When AUTH_SYS is used, the following additional requirements must be + met: + + * Only a single NFSv4 domain can be supported through the use of + AUTH_SYS. + + * The "local" representation of all owners and groups must be the + same on all servers. The word "local" is used here since that is + the way that numeric user and group ids are described in + Section 5.9. However, when AUTH_SYS or stringified numeric owners + or groups are used, these identifiers are not truly local, since + they are known to the clients as well as to the server. + + Similarly, when stringified numeric user and group ids are used, the + "local" representation of all owners and groups must be the same on + all servers, even when AUTH_SYS is not used. + +11.8. Additional Client-Side Considerations + + When clients make use of servers that implement referrals, + replication, and migration, care should be taken that a user who + mounts a given file system that includes a referral or a relocated + file system continues to see a coherent picture of that user-side + file system despite the fact that it contains a number of server-side + file systems that may be on different servers. + + One important issue is upward navigation from the root of a server- + side file system to its parent (specified as ".." in UNIX), in the + case in which it transitions to that file system as a result of + referral, migration, or a transition as a result of replication. + When the client is at such a point, and it needs to ascend to the + parent, it must go back to the parent as seen within the multi-server + namespace rather than sending a LOOKUPP operation to the server, + which would result in the parent within that server's single-server + namespace. In order to do this, the client needs to remember the + filehandles that represent such file system roots and use these + instead of sending a LOOKUPP operation to the current server. This + will allow the client to present to applications a consistent + namespace, where upward navigation and downward navigation are + consistent. + + Another issue concerns refresh of referral locations. When referrals + are used extensively, they may change as server configurations + change. It is expected that clients will cache information related + to traversing referrals so that future client-side requests are + resolved locally without server communication. This is usually + rooted in client-side name look up caching. Clients should + periodically purge this data for referral points in order to detect + changes in location information. When the change_policy attribute + changes for directories that hold referral entries or for the + referral entries themselves, clients should consider any associated + cached referral information to be out of date. + +11.9. Overview of File Access Transitions + + File access transitions are of two types: + + * Those that involve a transition from accessing the current replica + to another one in connection with either replication or migration. + How these are dealt with is discussed in Section 11.11. + + * Those in which access to the current file system instance is + retained, while the network path used to access that instance is + changed. This case is discussed in Section 11.10. + +11.10. Effecting Network Endpoint Transitions + + The endpoints used to access a particular file system instance may + change in a number of ways, as listed below. In each of these cases, + the same fsid, client IDs, filehandles, and stateids are used to + continue access, with a continuity of lock state. In many cases, the + same sessions can also be used. + + The appropriate action depends on the set of replacement addresses + that are available for use (i.e., server endpoints that are server- + trunkable with one previously being used). + + * When use of a particular address is to cease, and there is also + another address currently in use that is server-trunkable with it, + requests that would have been issued on the address whose use is + to be discontinued can be issued on the remaining address(es). + When an address is server-trunkable but not session-trunkable with + the address whose use is to be discontinued, the request might + need to be modified to reflect the fact that a different session + will be used. + + * When use of a particular connection is to cease, as indicated by + receiving NFS4ERR_MOVED when using that connection, but that + address is still indicated as accessible according to the + appropriate file system location entries, it is likely that + requests can be issued on a new connection of a different + connection type once that connection is established. Since any + two non-port-specific server endpoints that share a network + address are inherently session-trunkable, the client can use + BIND_CONN_TO_SESSION to access the existing session with the new + connection. + + * When there are no potential replacement addresses in use, but + there are valid addresses session-trunkable with the one whose use + is to be discontinued, the client can use BIND_CONN_TO_SESSION to + access the existing session using the new address. Although the + target session will generally be accessible, there may be rare + situations in which that session is no longer accessible when an + attempt is made to bind the new connection to it. In this case, + the client can create a new session to enable continued access to + the existing instance using the new connection, providing for the + use of existing filehandles, stateids, and client ids while + supplying continuity of locking state. + + * When there is no potential replacement address in use, and there + are no valid addresses session-trunkable with the one whose use is + to be discontinued, other server-trunkable addresses may be used + to provide continued access. Although the use of CREATE_SESSION + is available to provide continued access to the existing instance, + servers have the option of providing continued access to the + existing session through the new network access path in a fashion + similar to that provided by session migration (see Section 11.12). + To take advantage of this possibility, clients can perform an + initial BIND_CONN_TO_SESSION, as in the previous case, and use + CREATE_SESSION only if that fails. + +11.11. Effecting File System Transitions + + There are a range of situations in which there is a change to be + effected in the set of replicas used to access a particular file + system. Some of these may involve an expansion or contraction of the + set of replicas used as discussed in Section 11.11.1 below. + + For reasons explained in that section, most transitions will involve + a transition from a single replica to a corresponding replacement + replica. When effecting replica transition, some types of sharing + between the replicas may affect handling of the transition as + described in Sections 11.11.2 through 11.11.8 below. The attribute + fs_locations_info provides helpful information to allow the client to + determine the degree of inter-replica sharing. + + With regard to some types of state, the degree of continuity across + the transition depends on the occasion prompting the transition, with + transitions initiated by the servers (i.e., migration) offering much + more scope for a nondisruptive transition than cases in which the + client on its own shifts its access to another replica (i.e., + replication). This issue potentially applies to locking state and to + session state, which are dealt with below as follows: + + * An introduction to the possible means of providing continuity in + these areas appears in Section 11.11.9 below. + + * Transparent State Migration is introduced in Section 11.12. The + possible transfer of session state is addressed there as well. + + * The client handling of transitions, including determining how to + deal with the various means that the server might take to supply + effective continuity of locking state, is discussed in + Section 11.13. + + * The source and destination servers' responsibilities in effecting + Transparent State Migration of locking and session state are + discussed in Section 11.14. + +11.11.1. File System Transitions and Simultaneous Access + + The fs_locations_info attribute (described in Section 11.17) may + indicate that two replicas may be used simultaneously, although some + situations in which such simultaneous access is permitted are more + appropriately described as instances of trunking (see + Section 11.5.4.1). Although situations in which multiple replicas + may be accessed simultaneously are somewhat similar to those in which + a single replica is accessed by multiple network addresses, there are + important differences since locking state is not shared among + multiple replicas. + + Because of this difference in state handling, many clients will not + have the ability to take advantage of the fact that such replicas + represent the same data. Such clients will not be prepared to use + multiple replicas simultaneously but will access each file system + using only a single replica, although the replica selected might make + multiple server-trunkable addresses available. + + Clients who are prepared to use multiple replicas simultaneously can + divide opens among replicas however they choose. Once that choice is + made, any subsequent transitions will treat the set of locking state + associated with each replica as a single entity. + + For example, if one of the replicas become unavailable, access will + be transferred to a different replica, which is also capable of + simultaneous access with the one still in use. + + When there is no such replica, the transition may be to the replica + already in use. At this point, the client has a choice between + merging the locking state for the two replicas under the aegis of the + sole replica in use or treating these separately until another + replica capable of simultaneous access presents itself. + +11.11.2. Filehandles and File System Transitions + + There are a number of ways in which filehandles can be handled across + a file system transition. These can be divided into two broad + classes depending upon whether the two file systems across which the + transition happens share sufficient state to effect some sort of + continuity of file system handling. + + When there is no such cooperation in filehandle assignment, the two + file systems are reported as being in different handle classes. In + this case, all filehandles are assumed to expire as part of the file + system transition. Note that this behavior does not depend on the + fh_expire_type attribute and supersedes the specification of the + FH4_VOL_MIGRATION bit, which only affects behavior when + fs_locations_info is not available. + + When there is cooperation in filehandle assignment, the two file + systems are reported as being in the same handle classes. In this + case, persistent filehandles remain valid after the file system + transition, while volatile filehandles (excluding those that are only + volatile due to the FH4_VOL_MIGRATION bit) are subject to expiration + on the target server. + +11.11.3. Fileids and File System Transitions + + In NFSv4.0, the issue of continuity of fileids in the event of a file + system transition was not addressed. The general expectation had + been that in situations in which the two file system instances are + created by a single vendor using some sort of file system image copy, + fileids would be consistent across the transition, while in the + analogous multi-vendor transitions they would not. This poses + difficulties, especially for the client without special knowledge of + the transition mechanisms adopted by the server. Note that although + fileid is not a REQUIRED attribute, many servers support fileids and + many clients provide APIs that depend on fileids. + + It is important to note that while clients themselves may have no + trouble with a fileid changing as a result of a file system + transition event, applications do typically have access to the fileid + (e.g., via stat). The result is that an application may work + perfectly well if there is no file system instance transition or if + any such transition is among instances created by a single vendor, + yet be unable to deal with the situation in which a multi-vendor + transition occurs at the wrong time. + + Providing the same fileids in a multi-vendor (multiple server + vendors) environment has generally been held to be quite difficult. + While there is work to be done, it needs to be pointed out that this + difficulty is partly self-imposed. Servers have typically identified + fileid with inode number, i.e. with a quantity used to find the file + in question. This identification poses special difficulties for + migration of a file system between vendors where assigning the same + index to a given file may not be possible. Note here that a fileid + is not required to be useful to find the file in question, only that + it is unique within the given file system. Servers prepared to + accept a fileid as a single piece of metadata and store it apart from + the value used to index the file information can relatively easily + maintain a fileid value across a migration event, allowing a truly + transparent migration event. + + In any case, where servers can provide continuity of fileids, they + should, and the client should be able to find out that such + continuity is available and take appropriate action. Information + about the continuity (or lack thereof) of fileids across a file + system transition is represented by specifying whether the file + systems in question are of the same fileid class. + + Note that when consistent fileids do not exist across a transition + (either because there is no continuity of fileids or because fileid + is not a supported attribute on one of instances involved), and there + are no reliable filehandles across a transition event (either because + there is no filehandle continuity or because the filehandles are + volatile), the client is in a position where it cannot verify that + files it was accessing before the transition are the same objects. + It is forced to assume that no object has been renamed, and, unless + there are guarantees that provide this (e.g., the file system is + read-only), problems for applications may occur. Therefore, use of + such configurations should be limited to situations where the + problems that this may cause can be tolerated. + +11.11.4. Fsids and File System Transitions + + Since fsids are generally only unique on a per-server basis, it is + likely that they will change during a file system transition. + Clients should not make the fsids received from the server visible to + applications since they may not be globally unique, and because they + may change during a file system transition event. Applications are + best served if they are isolated from such transitions to the extent + possible. + + Although normally a single source file system will transition to a + single target file system, there is a provision for splitting a + single source file system into multiple target file systems, by + specifying the FSLI4F_MULTI_FS flag. + +11.11.4.1. File System Splitting + + When a file system transition is made and the fs_locations_info + indicates that the file system in question might be split into + multiple file systems (via the FSLI4F_MULTI_FS flag), the client + SHOULD do GETATTRs to determine the fsid attribute on all known + objects within the file system undergoing transition to determine the + new file system boundaries. + + Clients might choose to maintain the fsids passed to existing + applications by mapping all of the fsids for the descendant file + systems to the common fsid used for the original file system. + + Splitting a file system can be done on a transition between file + systems of the same fileid class, since the fact that fileids are + unique within the source file system ensure they will be unique in + each of the target file systems. + +11.11.5. The Change Attribute and File System Transitions + + Since the change attribute is defined as a server-specific one, + change attributes fetched from one server are normally presumed to be + invalid on another server. Such a presumption is troublesome since + it would invalidate all cached change attributes, requiring + refetching. Even more disruptive, the absence of any assured + continuity for the change attribute means that even if the same value + is retrieved on refetch, no conclusions can be drawn as to whether + the object in question has changed. The identical change attribute + could be merely an artifact of a modified file with a different + change attribute construction algorithm, with that new algorithm just + happening to result in an identical change value. + + When the two file systems have consistent change attribute formats, + and this fact is communicated to the client by reporting in the same + change class, the client may assume a continuity of change attribute + construction and handle this situation just as it would be handled + without any file system transition. + +11.11.6. Write Verifiers and File System Transitions + + In a file system transition, the two file systems might be + cooperating in the handling of unstably written data. Clients can + determine if this is the case by seeing if the two file systems + belong to the same write-verifier class. When this is the case, + write verifiers returned from one system may be compared to those + returned by the other and superfluous writes can be avoided. + + When two file systems belong to different write-verifier classes, any + verifier generated by one must not be compared to one provided by the + other. Instead, the two verifiers should be treated as not equal + even when the values are identical. + +11.11.7. READDIR Cookies and Verifiers and File System Transitions + + In a file system transition, the two file systems might be consistent + in their handling of READDIR cookies and verifiers. Clients can + determine if this is the case by seeing if the two file systems + belong to the same readdir class. When this is the case, readdir + class, READDIR cookies, and verifiers from one system will be + recognized by the other, and READDIR operations started on one server + can be validly continued on the other simply by presenting the cookie + and verifier returned by a READDIR operation done on the first file + system to the second. + + When two file systems belong to different readdir classes, any + READDIR cookie and verifier generated by one is not valid on the + second and must not be presented to that server by the client. The + client should act as if the verifier were rejected. + +11.11.8. File System Data and File System Transitions + + When multiple replicas exist and are used simultaneously or in + succession by a client, applications using them will normally expect + that they contain either the same data or data that is consistent + with the normal sorts of changes that are made by other clients + updating the data of the file system (with metadata being the same to + the degree indicated by the fs_locations_info attribute). However, + when multiple file systems are presented as replicas of one another, + the precise relationship between the data of one and the data of + another is not, as a general matter, specified by the NFSv4.1 + protocol. It is quite possible to present as replicas file systems + where the data of those file systems is sufficiently different that + some applications have problems dealing with the transition between + replicas. The namespace will typically be constructed so that + applications can choose an appropriate level of support, so that in + one position in the namespace, a varied set of replicas might be + listed, while in another, only those that are up-to-date would be + considered replicas. The protocol does define three special cases of + the relationship among replicas to be specified by the server and + relied upon by clients: + + * When multiple replicas exist and are used simultaneously by a + client (see the FSLIB4_CLSIMUL definition within + fs_locations_info), they must designate the same data. Where file + systems are writable, a change made on one instance must be + visible on all instances at the same time, regardless of whether + the interrogated instance is the one on which the modification was + done. This allows a client to use these replicas simultaneously + without any special adaptation to the fact that there are multiple + replicas, beyond adapting to the fact that locks obtained on one + replica are maintained separately (i.e., under a different client + ID). In this case, locks (whether share reservations or byte- + range locks) and delegations obtained on one replica are + immediately reflected on all replicas, in the sense that access + from all other servers is prevented regardless of the replica + used. However, because the servers are not required to treat two + associated client IDs as representing the same client, it is best + to access each file using only a single client ID. + + * When one replica is designated as the successor instance to + another existing instance after the return of NFS4ERR_MOVED (i.e., + the case of migration), the client may depend on the fact that all + changes written to stable storage on the original instance are + written to stable storage of the successor (uncommitted writes are + dealt with in Section 11.11.6 above). + + * Where a file system is not writable but represents a read-only + copy (possibly periodically updated) of a writable file system, + clients have similar requirements with regard to the propagation + of updates. They may need a guarantee that any change visible on + the original file system instance must be immediately visible on + any replica before the client transitions access to that replica, + in order to avoid any possibility that a client, in effecting a + transition to a replica, will see any reversion in file system + state. The specific means of this guarantee varies based on the + value of the fss_type field that is reported as part of the + fs_status attribute (see Section 11.18). Since these file systems + are presumed to be unsuitable for simultaneous use, there is no + specification of how locking is handled; in general, locks + obtained on one file system will be separate from those on others. + Since these are expected to be read-only file systems, this is not + likely to pose an issue for clients or applications. + + When none of these special situations applies, there is no basis + within the protocol for the client to make assumptions about the + contents of a replica file system or its relationship to previous + file system instances. Thus, switching between nominally identical + read-write file systems would not be possible because either the + client does not use the fs_locations_info attribute, or the server + does not support it. + +11.11.9. Lock State and File System Transitions + + While accessing a file system, clients obtain locks enforced by the + server, which may prevent actions by other clients that are + inconsistent with those locks. + + When access is transferred between replicas, clients need to be + assured that the actions disallowed by holding these locks cannot + have occurred during the transition. This can be ensured by the + methods below. Unless at least one of these is implemented, clients + will not be assured of continuity of lock possession across a + migration event: + + * Providing the client an opportunity to re-obtain his locks via a + per-fs grace period on the destination server, denying all clients + using the destination file system the opportunity to obtain new + locks that conflict with those held by the transferred client as + long as that client has not completed its per-fs grace period. + Because the lock reclaim mechanism was originally defined to + support server reboot, it implicitly assumes that filehandles + will, upon reclaim, be the same as those at open. In the case of + migration, this requires that source and destination servers use + the same filehandles, as evidenced by using the same server scope + (see Section 2.10.4) or by showing this agreement using + fs_locations_info (see Section 11.11.2 above). + + Note that such a grace period can be implemented without + interfering with the ability of non-transferred clients to obtain + new locks while it is going on. As long as the destination server + is aware of the transferred locks, it can distinguish requests to + obtain new locks that contrast with existing locks from those that + do not, allowing it to treat such client requests without + reference to the ongoing grace period. + + * Locking state can be transferred as part of the transition by + providing Transparent State Migration as described in + Section 11.12. + + Of these, Transparent State Migration provides the smoother + experience for clients in that there is no need to go through a + reclaim process before new locks can be obtained; however, it + requires a greater degree of inter-server coordination. In general, + the servers taking part in migration are free to provide either + facility. However, when the filehandles can differ across the + migration event, Transparent State Migration is the only available + means of providing the needed functionality. + + It should be noted that these two methods are not mutually exclusive + and that a server might well provide both. In particular, if there + is some circumstance preventing a specific lock from being + transferred transparently, the destination server can allow it to be + reclaimed by implementing a per-fs grace period for the migrated file + system. + +11.11.9.1. Security Consideration Related to Reclaiming Lock State + after File System Transitions + + Although it is possible for a client reclaiming state to misrepresent + its state in the same fashion as described in Section 8.4.2.1.1, most + implementations providing for such reclamation in the case of file + system transitions will have the ability to detect such + misrepresentations. This limits the ability of unauthenticated + clients to execute denial-of-service attacks in these circumstances. + Nevertheless, the rules stated in Section 8.4.2.1.1 regarding + principal verification for reclaim requests apply in this situation + as well. + + Typically, implementations that support file system transitions will + have extensive information about the locks to be transferred. This + is because of the following: + + * Since failure is not involved, there is no need to store locking + information in persistent storage. + + * There is no need, as there is in the failure case, to update + multiple repositories containing locking state to keep them in + sync. Instead, there is a one-time communication of locking state + from the source to the destination server. + + * Providing this information avoids potential interference with + existing clients using the destination file system by denying them + the ability to obtain new locks during the grace period. + + When such detailed locking information, not necessarily including the + associated stateids, is available: + + * It is possible to detect reclaim requests that attempt to reclaim + locks that did not exist before the transfer, rejecting them with + NFS4ERR_RECLAIM_BAD (Section 15.1.9.4). + + * It is possible when dealing with non-reclaim requests, to + determine whether they conflict with existing locks, eliminating + the need to return NFS4ERR_GRACE (Section 15.1.9.2) on non-reclaim + requests. + + It is possible for implementations of grace periods in connection + with file system transitions not to have detailed locking information + available at the destination server, in which case, the security + situation is exactly as described in Section 8.4.2.1.1. + +11.11.9.2. Leases and File System Transitions + + In the case of lease renewal, the client may not be submitting + requests for a file system that has been transferred to another + server. This can occur because of the lease renewal mechanism. The + client renews the lease associated with all file systems when + submitting a request on an associated session, regardless of the + specific file system being referenced. + + In order for the client to schedule renewal of its lease where there + is locking state that may have been relocated to the new server, the + client must find out about lease relocation before that lease expire. + To accomplish this, the SEQUENCE operation will return the status bit + SEQ4_STATUS_LEASE_MOVED if responsibility for any of the renewed + locking state has been transferred to a new server. This will + continue until the client receives an NFS4ERR_MOVED error for each of + the file systems for which there has been locking state relocation. + + When a client receives an SEQ4_STATUS_LEASE_MOVED indication from a + server, for each file system of the server for which the client has + locking state, the client should perform an operation. For + simplicity, the client may choose to reference all file systems, but + what is important is that it must reference all file systems for + which there was locking state where that state has moved. Once the + client receives an NFS4ERR_MOVED error for each such file system, the + server will clear the SEQ4_STATUS_LEASE_MOVED indication. The client + can terminate the process of checking file systems once this + indication is cleared (but only if the client has received a reply + for all outstanding SEQUENCE requests on all sessions it has with the + server), since there are no others for which locking state has moved. + + A client may use GETATTR of the fs_status (or fs_locations_info) + attribute on all of the file systems to get absence indications in a + single (or a few) request(s), since absent file systems will not + cause an error in this context. However, it still must do an + operation that receives NFS4ERR_MOVED on each file system, in order + to clear the SEQ4_STATUS_LEASE_MOVED indication. + + Once the set of file systems with transferred locking state has been + determined, the client can follow the normal process to obtain the + new server information (through the fs_locations and + fs_locations_info attributes) and perform renewal of that lease on + the new server, unless information in the fs_locations_info attribute + shows that no state could have been transferred. If the server has + not had state transferred to it transparently, the client will + receive NFS4ERR_STALE_CLIENTID from the new server, as described + above, and the client can then reclaim locks as is done in the event + of server failure. + +11.11.9.3. Transitions and the Lease_time Attribute + + In order that the client may appropriately manage its lease in the + case of a file system transition, the destination server must + establish proper values for the lease_time attribute. + + When state is transferred transparently, that state should include + the correct value of the lease_time attribute. The lease_time + attribute on the destination server must never be less than that on + the source, since this would result in premature expiration of a + lease granted by the source server. Upon transitions in which state + is transferred transparently, the client is under no obligation to + refetch the lease_time attribute and may continue to use the value + previously fetched (on the source server). + + If state has not been transferred transparently, either because the + associated servers are shown as having different eir_server_scope + strings or because the client ID is rejected when presented to the + new server, the client should fetch the value of lease_time on the + new (i.e., destination) server, and use it for subsequent locking + requests. However, the server must respect a grace period of at + least as long as the lease_time on the source server, in order to + ensure that clients have ample time to reclaim their lock before + potentially conflicting non-reclaimed locks are granted. + +11.12. Transferring State upon Migration + + When the transition is a result of a server-initiated decision to + transition access, and the source and destination servers have + implemented appropriate cooperation, it is possible to do the + following: + + * Transfer locking state from the source to the destination server + in a fashion similar to that provided by Transparent State + Migration in NFSv4.0, as described in [69]. Server + responsibilities are described in Section 11.14.2. + + * Transfer session state from the source to the destination server. + Server responsibilities in effecting such a transfer are described + in Section 11.14.3. + + The means by which the client determines which of these transfer + events has occurred are described in Section 11.13. + +11.12.1. Transparent State Migration and pNFS + + When pNFS is involved, the protocol is capable of supporting: + + * Migration of the Metadata Server (MDS), leaving the Data Servers + (DSs) in place. + + * Migration of the file system as a whole, including the MDS and + associated DSs. + + * Replacement of one DS by another. + + * Migration of a pNFS file system to one in which pNFS is not used. + + * Migration of a file system not using pNFS to one in which layouts + are available. + + Note that migration, per se, is only involved in the transfer of the + MDS function. Although the servicing of a layout may be transferred + from one data server to another, this not done using the file system + location attributes. The MDS can effect such transfers by recalling + or revoking existing layouts and granting new ones on a different + data server. + + Migration of the MDS function is directly supported by Transparent + State Migration. Layout state will normally be transparently + transferred, just as other state is. As a result, Transparent State + Migration provides a framework in which, given appropriate inter-MDS + data transfer, one MDS can be substituted for another. + + Migration of the file system function as a whole can be accomplished + by recalling all layouts as part of the initial phase of the + migration process. As a result, I/O will be done through the MDS + during the migration process, and new layouts can be granted once the + client is interacting with the new MDS. An MDS can also effect this + sort of transition by revoking all layouts as part of Transparent + State Migration, as long as the client is notified about the loss of + locking state. + + In order to allow migration to a file system on which pNFS is not + supported, clients need to be prepared for a situation in which + layouts are not available or supported on the destination file system + and so direct I/O requests to the destination server, rather than + depending on layouts being available. + + Replacement of one DS by another is not addressed by migration as + such but can be effected by an MDS recalling layouts for the DS to be + replaced and issuing new ones to be served by the successor DS. + + Migration may transfer a file system from a server that does not + support pNFS to one that does. In order to properly adapt to this + situation, clients that support pNFS, but function adequately in its + absence, should check for pNFS support when a file system is migrated + and be prepared to use pNFS when support is available on the + destination. + +11.13. Client Responsibilities When Access Is Transitioned + + For a client to respond to an access transition, it must become aware + of it. The ways in which this can happen are discussed in + Section 11.13.1, which discusses indications that a specific file + system access path has transitioned as well as situations in which + additional activity is necessary to determine the set of file systems + that have been migrated. Section 11.13.2 goes on to complete the + discussion of how the set of migrated file systems might be + determined. Sections 11.13.3 through 11.13.5 discuss how the client + should deal with each transition it becomes aware of, either directly + or as a result of migration discovery. + + The following terms are used to describe client activities: + + * "Transition recovery" refers to the process of restoring access to + a file system on which NFS4ERR_MOVED was received. + + * "Migration recovery" refers to that subset of transition recovery + that applies when the file system has migrated to a different + replica. + + * "Migration discovery" refers to the process of determining which + file system(s) have been migrated. It is necessary to avoid a + situation in which leases could expire when a file system is not + accessed for a long period of time, since a client unaware of the + migration might be referencing an unmigrated file system and not + renewing the lease associated with the migrated file system. + +11.13.1. Client Transition Notifications + + When there is a change in the network access path that a client is to + use to access a file system, there are a number of related status + indications with which clients need to deal: + + * If an attempt is made to use or return a filehandle within a file + system that is no longer accessible at the address previously used + to access it, the error NFS4ERR_MOVED is returned. + + Exceptions are made to allow such filehandles to be used when + interrogating a file system location attribute. This enables a + client to determine a new replica's location or a new network + access path. + + This condition continues on subsequent attempts to access the file + system in question. The only way the client can avoid the error + is to cease accessing the file system in question at its old + server location and access it instead using a different address at + which it is now available. + + * Whenever a client sends a SEQUENCE operation to a server that + generated state held on that client and associated with a file + system no longer accessible on that server, the response will + contain the status bit SEQ4_STATUS_LEASE_MOVED, indicating that + there has been a lease migration. + + This condition continues until the client acknowledges the + notification by fetching a file system location attribute for the + file system whose network access path is being changed. When + there are multiple such file systems, a location attribute for + each such file system needs to be fetched. The location attribute + for all migrated file systems needs to be fetched in order to + clear the condition. Even after the condition is cleared, the + client needs to respond by using the location information to + access the file system at its new location to ensure that leases + are not needlessly expired. + + Unlike NFSv4.0, in which the corresponding conditions are both errors + and thus mutually exclusive, in NFSv4.1 the client can, and often + will, receive both indications on the same request. As a result, + implementations need to address the question of how to coordinate the + necessary recovery actions when both indications arrive in the + response to the same request. It should be noted that when + processing an NFSv4 COMPOUND, the server will normally decide whether + SEQ4_STATUS_LEASE_MOVED is to be set before it determines which file + system will be referenced or whether NFS4ERR_MOVED is to be returned. + + Since these indications are not mutually exclusive in NFSv4.1, the + following combinations are possible results when a COMPOUND is + issued: + + * The COMPOUND status is NFS4ERR_MOVED, and SEQ4_STATUS_LEASE_MOVED + is asserted. + + In this case, transition recovery is required. While it is + possible that migration discovery is needed in addition, it is + likely that only the accessed file system has transitioned. In + any case, because addressing NFS4ERR_MOVED is necessary to allow + the rejected requests to be processed on the target, dealing with + it will typically have priority over migration discovery. + + * The COMPOUND status is NFS4ERR_MOVED, and SEQ4_STATUS_LEASE_MOVED + is clear. + + In this case, transition recovery is also required. It is clear + that migration discovery is not needed to find file systems that + have been migrated other than the one returning NFS4ERR_MOVED. + Cases in which this result can arise include a referral or a + migration for which there is no associated locking state. This + can also arise in cases in which an access path transition other + than migration occurs within the same server. In such a case, + there is no need to set SEQ4_STATUS_LEASE_MOVED, since the lease + remains associated with the current server even though the access + path has changed. + + * The COMPOUND status is not NFS4ERR_MOVED, and + SEQ4_STATUS_LEASE_MOVED is asserted. + + In this case, no transition recovery activity is required on the + file system(s) accessed by the request. However, to prevent + avoidable lease expiration, migration discovery needs to be done. + + * The COMPOUND status is not NFS4ERR_MOVED, and + SEQ4_STATUS_LEASE_MOVED is clear. + + In this case, neither transition-related activity nor migration + discovery is required. + + Note that the specified actions only need to be taken if they are not + already going on. For example, when NFS4ERR_MOVED is received while + accessing a file system for which transition recovery is already + occurring, the client merely waits for that recovery to be completed, + while the receipt of the SEQ4_STATUS_LEASE_MOVED indication only + needs to initiate migration discovery for a server if such discovery + is not already underway for that server. + + The fact that a lease-migrated condition does not result in an error + in NFSv4.1 has a number of important consequences. In addition to + the fact that the two indications are not mutually exclusive, as + discussed above, there are number of issues that are important in + considering implementation of migration discovery, as discussed in + Section 11.13.2. + + Because SEQ4_STATUS_LEASE_MOVED is not an error condition, it is + possible for file systems whose access paths have not changed to be + successfully accessed on a given server even though recovery is + necessary for other file systems on the same server. As a result, + access can take place while: + + * The migration discovery process is happening for that server. + + * The transition recovery process is happening for other file + systems connected to that server. + +11.13.2. Performing Migration Discovery + + Migration discovery can be performed in the same context as + transition recovery, allowing recovery for each migrated file system + to be invoked as it is discovered. Alternatively, it may be done in + a separate migration discovery thread, allowing migration discovery + to be done in parallel with one or more instances of transition + recovery. + + In either case, because the lease-migrated indication does not result + in an error, other access to file systems on the server can proceed + normally, with the possibility that further such indications will be + received, raising the issue of how such indications are to be dealt + with. In general: + + * No action needs to be taken for such indications received by any + threads performing migration discovery, since continuation of that + work will address the issue. + + * In other cases in which migration discovery is currently being + performed, nothing further needs to be done to respond to such + lease migration indications, as long as one can be certain that + the migration discovery process would deal with those indications. + See below for details. + + * For such indications received in all other contexts, the + appropriate response is to initiate or otherwise provide for the + execution of migration discovery for file systems associated with + the server IP address returning the indication. + + This leaves a potential difficulty in situations in which the + migration discovery process is near to completion but is still + operating. One should not ignore a SEQ4_STATUS_LEASE_MOVED + indication if the migration discovery process is not able to respond + to the discovery of additional migrating file systems without + additional aid. A further complexity relevant in addressing such + situations is that a lease-migrated indication may reflect the + server's state at the time the SEQUENCE operation was processed, + which may be different from that in effect at the time the response + is received. Because new migration events may occur at any time, and + because a SEQ4_STATUS_LEASE_MOVED indication may reflect the + situation in effect a considerable time before the indication is + received, special care needs to be taken to ensure that + SEQ4_STATUS_LEASE_MOVED indications are not inappropriately ignored. + + A useful approach to this issue involves the use of separate + externally-visible migration discovery states for each server. + Separate values could represent the various possible states for the + migration discovery process for a server: + + * Non-operation, in which migration discovery is not being + performed. + + * Normal operation, in which there is an ongoing scan for migrated + file systems. + + * Completion/verification of migration discovery processing, in + which the possible completion of migration discovery processing + needs to be verified. + + Given that framework, migration discovery processing would proceed as + follows: + + * While in the normal-operation state, the thread performing + discovery would fetch, for successive file systems known to the + client on the server being worked on, a file system location + attribute plus the fs_status attribute. + + * If the fs_status attribute indicates that the file system is a + migrated one (i.e., fss_absent is true, and fss_type != + STATUS4_REFERRAL), then a migrated file system has been found. In + this situation, it is likely that the fetch of the file system + location attribute has cleared one of the file systems + contributing to the lease-migrated indication. + + * In cases in which that happened, the thread cannot know whether + the lease-migrated indication has been cleared, and so it enters + the completion/verification state and proceeds to issue a COMPOUND + to see if the SEQ4_STATUS_LEASE_MOVED indication has been cleared. + + * When the discovery process is in the completion/verification + state, if other requests get a lease-migrated indication, they + note that it was received. Later, the existence of such + indications is used when the request completes, as described + below. + + When the request used in the completion/verification state completes: + + * If a lease-migrated indication is returned, the discovery + continues normally. Note that this is so even if all file systems + have been traversed, since new migrations could have occurred + while the process was going on. + + * Otherwise, if there is any record that other requests saw a lease- + migrated indication while the request was occurring, that record + is cleared, and the verification request is retried. The + discovery process remains in the completion/verification state. + + * If there have been no lease-migrated indications, the work of + migration discovery is considered completed, and it enters the + non-operating state. Once it enters this state, subsequent lease- + migrated indications will trigger a new migration discovery + process. + + It should be noted that the process described above is not guaranteed + to terminate, as a long series of new migration events might + continually delay the clearing of the SEQ4_STATUS_LEASE_MOVED + indication. To prevent unnecessary lease expiration, it is + appropriate for clients to use the discovery of migrations to effect + lease renewal immediately, rather than waiting for the clearing of + the SEQ4_STATUS_LEASE_MOVED indication when the complete set of + migrations is available. + + Lease discovery needs to be provided as described above. This + ensures that the client discovers file system migrations soon enough + to renew its leases on each destination server before they expire. + Non-renewal of leases can lead to loss of locking state. While the + consequences of such loss can be ameliorated through implementations + of courtesy locks, servers are under no obligation to do so, and a + conflicting lock request may mean that a lock is revoked + unexpectedly. Clients should be aware of this possibility. + +11.13.3. Overview of Client Response to NFS4ERR_MOVED + + This section outlines a way in which a client that receives + NFS4ERR_MOVED can effect transition recovery by using a new server or + server endpoint if one is available. As part of that process, it + will determine: + + * Whether the NFS4ERR_MOVED indicates migration has occurred, or + whether it indicates another sort of file system access transition + as discussed in Section 11.10 above. + + * In the case of migration, whether Transparent State Migration has + occurred. + + * Whether any state has been lost during the process of Transparent + State Migration. + + * Whether sessions have been transferred as part of Transparent + State Migration. + + During the first phase of this process, the client proceeds to + examine file system location entries to find the initial network + address it will use to continue access to the file system or its + replacement. For each location entry that the client examines, the + process consists of five steps: + + 1. Performing an EXCHANGE_ID directed at the location address. This + operation is used to register the client owner (in the form of a + client_owner4) with the server, to obtain a client ID to be used + subsequently to communicate with it, to obtain that client ID's + confirmation status, and to determine server_owner4 and scope for + the purpose of determining if the entry is trunkable with the + address previously being used to access the file system (i.e., + that it represents another network access path to the same file + system and can share locking state with it). + + 2. Making an initial determination of whether migration has + occurred. The initial determination will be based on whether the + EXCHANGE_ID results indicate that the current location element is + server-trunkable with that used to access the file system when + access was terminated by receiving NFS4ERR_MOVED. If it is, then + migration has not occurred. In that case, the transition is + dealt with, at least initially, as one involving continued access + to the same file system on the same server through a new network + address. + + 3. Obtaining access to existing session state or creating new + sessions. How this is done depends on the initial determination + of whether migration has occurred and can be done as described in + Section 11.13.4 below in the case of migration or as described in + Section 11.13.5 below in the case of a network address transfer + without migration. + + 4. Verifying the trunking relationship assumed in step 2 as + discussed in Section 2.10.5.1. Although this step will generally + confirm the initial determination, it is possible for + verification to invalidate the initial determination of network + address shift (without migration) and instead determine that + migration had occurred. There is no need to redo step 3 above, + since it will be possible to continue use of the session + established already. + + 5. Obtaining access to existing locking state and/or re-obtaining + it. How this is done depends on the final determination of + whether migration has occurred and can be done as described below + in Section 11.13.4 in the case of migration or as described in + Section 11.13.5 in the case of a network address transfer without + migration. + + Once the initial address has been determined, clients are free to + apply an abbreviated process to find additional addresses trunkable + with it (clients may seek session-trunkable or server-trunkable + addresses depending on whether they support client ID trunking). + During this later phase of the process, further location entries are + examined using the abbreviated procedure specified below: + + A: Before the EXCHANGE_ID, the fs name of the location entry is + examined, and if it does not match that currently being used, the + entry is ignored. Otherwise, one proceeds as specified by step 1 + above. + + B: In the case that the network address is session-trunkable with + one used previously, a BIND_CONN_TO_SESSION is used to access + that session using the new network address. Otherwise, or if the + bind operation fails, a CREATE_SESSION is done. + + C: The verification procedure referred to in step 4 above is used. + However, if it fails, the entry is ignored and the next available + entry is used. + +11.13.4. Obtaining Access to Sessions and State after Migration + + In the event that migration has occurred, migration recovery will + involve determining whether Transparent State Migration has occurred. + This decision is made based on the client ID returned by the + EXCHANGE_ID and the reported confirmation status. + + * If the client ID is an unconfirmed client ID not previously known + to the client, then Transparent State Migration has not occurred. + + * If the client ID is a confirmed client ID previously known to the + client, then any transferred state would have been merged with an + existing client ID representing the client to the destination + server. In this state merger case, Transparent State Migration + might or might not have occurred, and a determination as to + whether it has occurred is deferred until sessions are established + and the client is ready to begin state recovery. + + * If the client ID is a confirmed client ID not previously known to + the client, then the client can conclude that the client ID was + transferred as part of Transparent State Migration. In this + transferred client ID case, Transparent State Migration has + occurred, although some state might have been lost. + + Once the client ID has been obtained, it is necessary to obtain + access to sessions to continue communication with the new server. In + any of the cases in which Transparent State Migration has occurred, + it is possible that a session was transferred as well. To deal with + that possibility, clients can, after doing the EXCHANGE_ID, issue a + BIND_CONN_TO_SESSION to connect the transferred session to a + connection to the new server. If that fails, it is an indication + that the session was not transferred and that a new session needs to + be created to take its place. + + In some situations, it is possible for a BIND_CONN_TO_SESSION to + succeed without session migration having occurred. If state merger + has taken place, then the associated client ID may have already had a + set of existing sessions, with it being possible that the session ID + of a given session is the same as one that might have been migrated. + In that event, a BIND_CONN_TO_SESSION might succeed, even though + there could have been no migration of the session with that session + ID. In such cases, the client will receive sequence errors when the + slot sequence values used are not appropriate on the new session. + When this occurs, the client can create a new a session and cease + using the existing one. + + Once the client has determined the initial migration status, and + determined that there was a shift to a new server, it needs to re- + establish its locking state, if possible. To enable this to happen + without loss of the guarantees normally provided by locking, the + destination server needs to implement a per-fs grace period in all + cases in which lock state was lost, including those in which + Transparent State Migration was not implemented. Each client for + which there was a transfer of locking state to the new server will + have the duration of the grace period to reclaim its locks, from the + time its locks were transferred. + + Clients need to deal with the following cases: + + * In the state merger case, it is possible that the server has not + attempted Transparent State Migration, in which case state may + have been lost without it being reflected in the SEQ4_STATUS bits. + To determine whether this has happened, the client can use + TEST_STATEID to check whether the stateids created on the source + server are still accessible on the destination server. Once a + single stateid is found to have been successfully transferred, the + client can conclude that Transparent State Migration was begun, + and any failure to transport all of the stateids will be reflected + in the SEQ4_STATUS bits. Otherwise, Transparent State Migration + has not occurred. + + * In a case in which Transparent State Migration has not occurred, + the client can use the per-fs grace period provided by the + destination server to reclaim locks that were held on the source + server. + + * In a case in which Transparent State Migration has occurred, and + no lock state was lost (as shown by SEQ4_STATUS flags), no lock + reclaim is necessary. + + * In a case in which Transparent State Migration has occurred, and + some lock state was lost (as shown by SEQ4_STATUS flags), existing + stateids need to be checked for validity using TEST_STATEID, and + reclaim used to re-establish any that were not transferred. + + For all of the cases above, RECLAIM_COMPLETE with an rca_one_fs value + of TRUE needs to be done before normal use of the file system, + including obtaining new locks for the file system. This applies even + if no locks were lost and there was no need for any to be reclaimed. + +11.13.5. Obtaining Access to Sessions and State after Network Address + Transfer + + The case in which there is a transfer to a new network address + without migration is similar to that described in Section 11.13.4 + above in that there is a need to obtain access to needed sessions and + locking state. However, the details are simpler and will vary + depending on the type of trunking between the address receiving + NFS4ERR_MOVED and that to which the transfer is to be made. + + To make a session available for use, a BIND_CONN_TO_SESSION should be + used to obtain access to the session previously in use. Only if this + fails, should a CREATE_SESSION be done. While this procedure mirrors + that in Section 11.13.4 above, there is an important difference in + that preservation of the session is not purely optional but depends + on the type of trunking. + + Access to appropriate locking state will generally need no actions + beyond access to the session. However, the SEQ4_STATUS bits need to + be checked for lost locking state, including the need to reclaim + locks after a server reboot, since there is always a possibility of + locking state being lost. + +11.14. Server Responsibilities Upon Migration + + In the event of file system migration, when the client connects to + the destination server, that server needs to be able to provide the + client continued access to the files it had open on the source + server. There are two ways to provide this: + + * By provision of an fs-specific grace period, allowing the client + the ability to reclaim its locks, in a fashion similar to what + would have been done in the case of recovery from a server + restart. See Section 11.14.1 for a more complete discussion. + + * By implementing Transparent State Migration possibly in connection + with session migration, the server can provide the client + immediate access to the state built up on the source server on the + destination server. + + These features are discussed separately in Sections 11.14.2 and + 11.14.3, which discuss Transparent State Migration and session + migration, respectively. + + All the features described above can involve transfer of lock-related + information between source and destination servers. In some cases, + this transfer is a necessary part of the implementation, while in + other cases, it is a helpful implementation aid, which servers might + or might not use. The subsections below discuss the information that + would be transferred but do not define the specifics of the transfer + protocol. This is left as an implementation choice, although + standards in this area could be developed at a later time. + +11.14.1. Server Responsibilities in Effecting State Reclaim after + Migration + + In this case, the destination server needs no knowledge of the locks + held on the source server. It relies on the clients to accurately + report (via reclaim operations) the locks previously held, and does + not allow new locks to be granted on migrated file systems until the + grace period expires. Disallowing of new locks applies to all + clients accessing these file systems, while grace period expiration + occurs for each migrated client independently. + + During this grace period, clients have the opportunity to use reclaim + operations to obtain locks for file system objects within the + migrated file system, in the same way that they do when recovering + from server restart, and the servers typically rely on clients to + accurately report their locks, although they have the option of + subjecting these requests to verification. If the clients only + reclaim locks held on the source server, no conflict can arise. Once + the client has reclaimed its locks, it indicates the completion of + lock reclamation by performing a RECLAIM_COMPLETE specifying + rca_one_fs as TRUE. + + While it is not necessary for source and destination servers to + cooperate to transfer information about locks, implementations are + well advised to consider transferring the following useful + information: + + * If information about the set of clients that have locking state + for the transferred file system is made available, the destination + server will be able to terminate the grace period once all such + clients have reclaimed their locks, allowing normal locking + activity to resume earlier than it would have otherwise. + + * Locking summary information for individual clients (at various + possible levels of detail) can detect some instances in which + clients do not accurately represent the locks held on the source + server. + +11.14.2. Server Responsibilities in Effecting Transparent State + Migration + + The basic responsibility of the source server in effecting + Transparent State Migration is to make available to the destination + server a description of each piece of locking state associated with + the file system being migrated. In addition to client id string and + verifier, the source server needs to provide for each stateid: + + * The stateid including the current sequence value. + + * The associated client ID. + + * The handle of the associated file. + + * The type of the lock, such as open, byte-range lock, delegation, + or layout. + + * For locks such as opens and byte-range locks, there will be + information about the owner(s) of the lock. + + * For recallable/revocable lock types, the current recall status + needs to be included. + + * For each lock type, there will be associated type-specific + information. For opens, this will include share and deny mode + while for byte-range locks and layouts, there will be a type and a + byte-range. + + Such information will most probably be organized by client id string + on the destination server so that it can be used to provide + appropriate context to each client when it makes itself known to the + client. Issues connected with a client impersonating another by + presenting another client's client id string can be addressed using + NFSv4.1 state protection features, as described in Section 21. + + A further server responsibility concerns locks that are revoked or + otherwise lost during the process of file system migration. Because + locks that appear to be lost during the process of migration will be + reclaimed by the client, the servers have to take steps to ensure + that locks revoked soon before or soon after migration are not + inadvertently allowed to be reclaimed in situations in which the + continuity of lock possession cannot be assured. + + * For locks lost on the source but whose loss has not yet been + acknowledged by the client (by using FREE_STATEID), the + destination must be aware of this loss so that it can deny a + request to reclaim them. + + * For locks lost on the destination after the state transfer but + before the client's RECLAIM_COMPLETE is done, the destination + server should note these and not allow them to be reclaimed. + + An additional responsibility of the cooperating servers concerns + situations in which a stateid cannot be transferred transparently + because it conflicts with an existing stateid held by the client and + associated with a different file system. In this case, there are two + valid choices: + + * Treat the transfer, as in NFSv4.0, as one without Transparent + State Migration. In this case, conflicting locks cannot be + granted until the client does a RECLAIM_COMPLETE, after reclaiming + the locks it had, with the exception of reclaims denied because + they were attempts to reclaim locks that had been lost. + + * Implement Transparent State Migration, except for the lock with + the conflicting stateid. In this case, the client will be aware + of a lost lock (through the SEQ4_STATUS flags) and be allowed to + reclaim it. + + When transferring state between the source and destination, the + issues discussed in Section 7.2 of [69] must still be attended to. + In this case, the use of NFS4ERR_DELAY may still be necessary in + NFSv4.1, as it was in NFSv4.0, to prevent locking state changing + while it is being transferred. See Section 15.1.1.3 for information + about appropriate client retry approaches in the event that + NFS4ERR_DELAY is returned. + + There are a number of important differences in the NFS4.1 context: + + * The absence of RELEASE_LOCKOWNER means that the one case in which + an operation could not be deferred by use of NFS4ERR_DELAY no + longer exists. + + * Sequencing of operations is no longer done using owner-based + operation sequences numbers. Instead, sequencing is session- + based. + + As a result, when sessions are not transferred, the techniques + discussed in Section 7.2 of [69] are adequate and will not be further + discussed. + +11.14.3. Server Responsibilities in Effecting Session Transfer + + The basic responsibility of the source server in effecting session + transfer is to make available to the destination server a description + of the current state of each slot with the session, including the + following: + + * The last sequence value received for that slot. + + * Whether there is cached reply data for the last request executed + and, if so, the cached reply. + + When sessions are transferred, there are a number of issues that pose + challenges in terms of making the transferred state unmodifiable + during the period it is gathered up and transferred to the + destination server: + + * A single session may be used to access multiple file systems, not + all of which are being transferred. + + * Requests made on a session may, even if rejected, affect the state + of the session by advancing the sequence number associated with + the slot used. + + As a result, when the file system state might otherwise be considered + unmodifiable, the client might have any number of in-flight requests, + each of which is capable of changing session state, which may be of a + number of types: + + 1. Those requests that were processed on the migrating file system + before migration began. + + 2. Those requests that received the error NFS4ERR_DELAY because the + file system being accessed was in the process of being migrated. + + 3. Those requests that received the error NFS4ERR_MOVED because the + file system being accessed had been migrated. + + 4. Those requests that accessed the migrating file system in order + to obtain location or status information. + + 5. Those requests that did not reference the migrating file system. + + It should be noted that the history of any particular slot is likely + to include a number of these request classes. In the case in which a + session that is migrated is used by file systems other than the one + migrated, requests of class 5 may be common and may be the last + request processed for many slots. + + Since session state can change even after the locking state has been + fixed as part of the migration process, the session state known to + the client could be different from that on the destination server, + which necessarily reflects the session state on the source server at + an earlier time. In deciding how to deal with this situation, it is + helpful to distinguish between two sorts of behavioral consequences + of the choice of initial sequence ID values: + + * The error NFS4ERR_SEQ_MISORDERED is returned when the sequence ID + in a request is neither equal to the last one seen for the current + slot nor the next greater one. + + In view of the difficulty of arriving at a mutually acceptable + value for the correct last sequence value at the point of + migration, it may be necessary for the server to show some degree + of forbearance when the sequence ID is one that would be + considered unacceptable if session migration were not involved. + + * Returning the cached reply for a previously executed request when + the sequence ID in the request matches the last value recorded for + the slot. + + In the cases in which an error is returned and there is no + possibility of any non-idempotent operation having been executed, + it may not be necessary to adhere to this as strictly as might be + proper if session migration were not involved. For example, the + fact that the error NFS4ERR_DELAY was returned may not assist the + client in any material way, while the fact that NFS4ERR_MOVED was + returned by the source server may not be relevant when the request + was reissued and directed to the destination server. + + An important issue is that the specification needs to take note of + all potential COMPOUNDs, even if they might be unlikely in practice. + For example, a COMPOUND is allowed to access multiple file systems + and might perform non-idempotent operations in some of them before + accessing a file system being migrated. Also, a COMPOUND may return + considerable data in the response before being rejected with + NFS4ERR_DELAY or NFS4ERR_MOVED, and may in addition be marked as + sa_cachethis. However, note that if the client and server adhere to + rules in Section 15.1.1.3, there is no possibility of non-idempotent + operations being spuriously reissued after receiving NFS4ERR_DELAY + response. + + To address these issues, a destination server MAY do any of the + following when implementing session transfer: + + * Avoid enforcing any sequencing semantics for a particular slot + until the client has established the starting sequence for that + slot on the destination server. + + * For each slot, avoid returning a cached reply returning + NFS4ERR_DELAY or NFS4ERR_MOVED until the client has established + the starting sequence for that slot on the destination server. + + * Until the client has established the starting sequence for a + particular slot on the destination server, avoid reporting + NFS4ERR_SEQ_MISORDERED or returning a cached reply that contains + either NFS4ERR_DELAY or NFS4ERR_MOVED and consists solely of a + series of operations where the response is NFS4_OK until the final + error. + + Because of the considerations mentioned above, including the rules + for the handling of NFS4ERR_DELAY included in Section 15.1.1.3, the + destination server can respond appropriately to SEQUENCE operations + received from the client by adopting the three policies listed below: + + * Not responding with NFS4ERR_SEQ_MISORDERED for the initial request + on a slot within a transferred session because the destination + server cannot be aware of requests made by the client after the + server handoff but before the client became aware of the shift. + In cases in which NFS4ERR_SEQ_MISORDERED would normally have been + reported, the request is to be processed normally as a new + request. + + * Replying as it would for a retry whenever the sequence matches + that transferred by the source server, even though this would not + provide retry handling for requests issued after the server + handoff, under the assumption that, when such requests are issued, + they will never be responded to in a state-changing fashion, + making retry support for them unnecessary. + + * Once a non-retry SEQUENCE is received for a given slot, using that + as the basis for further sequence checking, with no further + reference to the sequence value transferred by the source server. + +11.15. Effecting File System Referrals + + Referrals are effected when an absent file system is encountered and + one or more alternate locations are made available by the + fs_locations or fs_locations_info attributes. The client will + typically get an NFS4ERR_MOVED error, fetch the appropriate location + information, and proceed to access the file system on a different + server, even though it retains its logical position within the + original namespace. Referrals differ from migration events in that + they happen only when the client has not previously referenced the + file system in question (so there is nothing to transition). + Referrals can only come into effect when an absent file system is + encountered at its root. + + The examples given in the sections below are somewhat artificial in + that an actual client will not typically do a multi-component look + up, but will have cached information regarding the upper levels of + the name hierarchy. However, these examples are chosen to make the + required behavior clear and easy to put within the scope of a small + number of requests, without getting into a discussion of the details + of how specific clients might choose to cache things. + +11.15.1. Referral Example (LOOKUP) + + Let us suppose that the following COMPOUND is sent in an environment + in which /this/is/the/path is absent from the target server. This + may be for a number of reasons. It may be that the file system has + moved, or it may be that the target server is functioning mainly, or + solely, to refer clients to the servers on which various file systems + are located. + + * PUTROOTFH + + * LOOKUP "this" + + * LOOKUP "is" + + * LOOKUP "the" + + * LOOKUP "path" + + * GETFH + + * GETATTR (fsid, fileid, size, time_modify) + + Under the given circumstances, the following will be the result. + + * PUTROOTFH --> NFS_OK. The current fh is now the root of the + pseudo-fs. + + * LOOKUP "this" --> NFS_OK. The current fh is for /this and is + within the pseudo-fs. + + * LOOKUP "is" --> NFS_OK. The current fh is for /this/is and is + within the pseudo-fs. + + * LOOKUP "the" --> NFS_OK. The current fh is for /this/is/the and + is within the pseudo-fs. + + * LOOKUP "path" --> NFS_OK. The current fh is for /this/is/the/path + and is within a new, absent file system, but ... the client will + never see the value of that fh. + + * GETFH --> NFS4ERR_MOVED. Fails because current fh is in an absent + file system at the start of the operation, and the specification + makes no exception for GETFH. + + * GETATTR (fsid, fileid, size, time_modify). Not executed because + the failure of the GETFH stops processing of the COMPOUND. + + Given the failure of the GETFH, the client has the job of determining + the root of the absent file system and where to find that file + system, i.e., the server and path relative to that server's root fh. + Note that in this example, the client did not obtain filehandles and + attribute information (e.g., fsid) for the intermediate directories, + so that it would not be sure where the absent file system starts. It + could be the case, for example, that /this/is/the is the root of the + moved file system and that the reason that the look up of "path" + succeeded is that the file system was not absent on that operation + but was moved between the last LOOKUP and the GETFH (since COMPOUND + is not atomic). Even if we had the fsids for all of the intermediate + directories, we could have no way of knowing that /this/is/the/path + was the root of a new file system, since we don't yet have its fsid. + + In order to get the necessary information, let us re-send the chain + of LOOKUPs with GETFHs and GETATTRs to at least get the fsids so we + can be sure where the appropriate file system boundaries are. The + client could choose to get fs_locations_info at the same time but in + most cases the client will have a good guess as to where file system + boundaries are (because of where NFS4ERR_MOVED was, and was not, + received) making fetching of fs_locations_info unnecessary. + + OP01: PUTROOTFH --> NFS_OK + + * Current fh is root of pseudo-fs. + + OP02: GETATTR(fsid) --> NFS_OK + + * Just for completeness. Normally, clients will know the fsid of + the pseudo-fs as soon as they establish communication with a + server. + + OP03: LOOKUP "this" --> NFS_OK + + OP04: GETATTR(fsid) --> NFS_OK + + * Get current fsid to see where file system boundaries are. The + fsid will be that for the pseudo-fs in this example, so no + boundary. + + OP05: GETFH --> NFS_OK + + * Current fh is for /this and is within pseudo-fs. + + OP06: LOOKUP "is" --> NFS_OK + + * Current fh is for /this/is and is within pseudo-fs. + + OP07: GETATTR(fsid) --> NFS_OK + + * Get current fsid to see where file system boundaries are. The + fsid will be that for the pseudo-fs in this example, so no + boundary. + + OP08: GETFH --> NFS_OK + + * Current fh is for /this/is and is within pseudo-fs. + + OP09: LOOKUP "the" --> NFS_OK + + * Current fh is for /this/is/the and is within pseudo-fs. + + OP10: GETATTR(fsid) --> NFS_OK + + * Get current fsid to see where file system boundaries are. The + fsid will be that for the pseudo-fs in this example, so no + boundary. + + OP11: GETFH --> NFS_OK + + * Current fh is for /this/is/the and is within pseudo-fs. + + OP12: LOOKUP "path" --> NFS_OK + + * Current fh is for /this/is/the/path and is within a new, absent + file system, but ... + + * The client will never see the value of that fh. + + OP13: GETATTR(fsid, fs_locations_info) --> NFS_OK + + * We are getting the fsid to know where the file system + boundaries are. In this operation, the fsid will be different + than that of the parent directory (which in turn was retrieved + in OP10). Note that the fsid we are given will not necessarily + be preserved at the new location. That fsid might be + different, and in fact the fsid we have for this file system + might be a valid fsid of a different file system on that new + server. + + * In this particular case, we are pretty sure anyway that what + has moved is /this/is/the/path rather than /this/is/the since + we have the fsid of the latter and it is that of the pseudo-fs, + which presumably cannot move. However, in other examples, we + might not have this kind of information to rely on (e.g., + /this/is/the might be a non-pseudo file system separate from + /this/is/the/path), so we need to have other reliable source + information on the boundary of the file system that is moved. + If, for example, the file system /this/is had moved, we would + have a case of migration rather than referral, and once the + boundaries of the migrated file system was clear we could fetch + fs_locations_info. + + * We are fetching fs_locations_info because the fact that we got + an NFS4ERR_MOVED at this point means that it is most likely + that this is a referral and we need the destination. Even if + it is the case that /this/is/the is a file system that has + migrated, we will still need the location information for that + file system. + + OP14: GETFH --> NFS4ERR_MOVED + + * Fails because current fh is in an absent file system at the + start of the operation, and the specification makes no + exception for GETFH. Note that this means the server will + never send the client a filehandle from within an absent file + system. + + Given the above, the client knows where the root of the absent file + system is (/this/is/the/path) by noting where the change of fsid + occurred (between "the" and "path"). The fs_locations_info attribute + also gives the client the actual location of the absent file system, + so that the referral can proceed. The server gives the client the + bare minimum of information about the absent file system so that + there will be very little scope for problems of conflict between + information sent by the referring server and information of the file + system's home. No filehandles and very few attributes are present on + the referring server, and the client can treat those it receives as + transient information with the function of enabling the referral. + +11.15.2. Referral Example (READDIR) + + Another context in which a client may encounter referrals is when it + does a READDIR on a directory in which some of the sub-directories + are the roots of absent file systems. + + Suppose such a directory is read as follows: + + * PUTROOTFH + + * LOOKUP "this" + + * LOOKUP "is" + + * LOOKUP "the" + + * READDIR (fsid, size, time_modify, mounted_on_fileid) + + In this case, because rdattr_error is not requested, + fs_locations_info is not requested, and some of the attributes cannot + be provided, the result will be an NFS4ERR_MOVED error on the + READDIR, with the detailed results as follows: + + * PUTROOTFH --> NFS_OK. The current fh is at the root of the + pseudo-fs. + + * LOOKUP "this" --> NFS_OK. The current fh is for /this and is + within the pseudo-fs. + + * LOOKUP "is" --> NFS_OK. The current fh is for /this/is and is + within the pseudo-fs. + + * LOOKUP "the" --> NFS_OK. The current fh is for /this/is/the and + is within the pseudo-fs. + + * READDIR (fsid, size, time_modify, mounted_on_fileid) --> + NFS4ERR_MOVED. Note that the same error would have been returned + if /this/is/the had migrated, but it is returned because the + directory contains the root of an absent file system. + + So now suppose that we re-send with rdattr_error: + + * PUTROOTFH + + * LOOKUP "this" + + * LOOKUP "is" + + * LOOKUP "the" + + * READDIR (rdattr_error, fsid, size, time_modify, mounted_on_fileid) + + The results will be: + + * PUTROOTFH --> NFS_OK. The current fh is at the root of the + pseudo-fs. + + * LOOKUP "this" --> NFS_OK. The current fh is for /this and is + within the pseudo-fs. + + * LOOKUP "is" --> NFS_OK. The current fh is for /this/is and is + within the pseudo-fs. + + * LOOKUP "the" --> NFS_OK. The current fh is for /this/is/the and + is within the pseudo-fs. + + * READDIR (rdattr_error, fsid, size, time_modify, mounted_on_fileid) + --> NFS_OK. The attributes for directory entry with the component + named "path" will only contain rdattr_error with the value + NFS4ERR_MOVED, together with an fsid value and a value for + mounted_on_fileid. + + Suppose we do another READDIR to get fs_locations_info (although we + could have used a GETATTR directly, as in Section 11.15.1). + + * PUTROOTFH + + * LOOKUP "this" + + * LOOKUP "is" + + * LOOKUP "the" + + * READDIR (rdattr_error, fs_locations_info, mounted_on_fileid, fsid, + size, time_modify) + + The results would be: + + * PUTROOTFH --> NFS_OK. The current fh is at the root of the + pseudo-fs. + + * LOOKUP "this" --> NFS_OK. The current fh is for /this and is + within the pseudo-fs. + + * LOOKUP "is" --> NFS_OK. The current fh is for /this/is and is + within the pseudo-fs. + + * LOOKUP "the" --> NFS_OK. The current fh is for /this/is/the and + is within the pseudo-fs. + + * READDIR (rdattr_error, fs_locations_info, mounted_on_fileid, fsid, + size, time_modify) --> NFS_OK. The attributes will be as shown + below. + + The attributes for the directory entry with the component named + "path" will only contain: + + * rdattr_error (value: NFS_OK) + + * fs_locations_info + + * mounted_on_fileid (value: unique fileid within referring file + system) + + * fsid (value: unique value within referring server) + + The attributes for entry "path" will not contain size or time_modify + because these attributes are not available within an absent file + system. + +11.16. The Attribute fs_locations + + The fs_locations attribute is structured in the following way: + + struct fs_location4 { + utf8str_cis server<>; + pathname4 rootpath; + }; + + struct fs_locations4 { + pathname4 fs_root; + fs_location4 locations<>; + }; + + The fs_location4 data type is used to represent the location of a + file system by providing a server name and the path to the root of + the file system within that server's namespace. When a set of + servers have corresponding file systems at the same path within their + namespaces, an array of server names may be provided. An entry in + the server array is a UTF-8 string and represents one of a + traditional DNS host name, IPv4 address, IPv6 address, or a zero- + length string. An IPv4 or IPv6 address is represented as a universal + address (see Section 3.3.9 and [12]), minus the netid, and either + with or without the trailing ".p1.p2" suffix that represents the port + number. If the suffix is omitted, then the default port, 2049, + SHOULD be assumed. A zero-length string SHOULD be used to indicate + the current address being used for the RPC call. It is not a + requirement that all servers that share the same rootpath be listed + in one fs_location4 instance. The array of server names is provided + for convenience. Servers that share the same rootpath may also be + listed in separate fs_location4 entries in the fs_locations + attribute. + + The fs_locations4 data type and the fs_locations attribute each + contain an array of such locations. Since the namespace of each + server may be constructed differently, the "fs_root" field is + provided. The path represented by fs_root represents the location of + the file system in the current server's namespace, i.e., that of the + server from which the fs_locations attribute was obtained. The + fs_root path is meant to aid the client by clearly referencing the + root of the file system whose locations are being reported, no matter + what object within the current file system the current filehandle + designates. The fs_root is simply the pathname the client used to + reach the object on the current server (i.e., the object to which the + fs_locations attribute applies). + + When the fs_locations attribute is interrogated and there are no + alternate file system locations, the server SHOULD return a zero- + length array of fs_location4 structures, together with a valid + fs_root. + + As an example, suppose there is a replicated file system located at + two servers (servA and servB). At servA, the file system is located + at path /a/b/c. At, servB the file system is located at path /x/y/z. + If the client were to obtain the fs_locations value for the directory + at /a/b/c/d, it might not necessarily know that the file system's + root is located in servA's namespace at /a/b/c. When the client + switches to servB, it will need to determine that the directory it + first referenced at servA is now represented by the path /x/y/z/d on + servB. To facilitate this, the fs_locations attribute provided by + servA would have an fs_root value of /a/b/c and two entries in + fs_locations. One entry in fs_locations will be for itself (servA) + and the other will be for servB with a path of /x/y/z. With this + information, the client is able to substitute /x/y/z for the /a/b/c + at the beginning of its access path and construct /x/y/z/d to use for + the new server. + + Note that there is no requirement that the number of components in + each rootpath be the same; there is no relation between the number of + components in rootpath or fs_root, and none of the components in a + rootpath and fs_root have to be the same. In the above example, we + could have had a third element in the locations array, with server + equal to "servC" and rootpath equal to "/I/II", and a fourth element + in locations with server equal to "servD" and rootpath equal to + "/aleph/beth/gimel/daleth/he". + + The relationship between fs_root to a rootpath is that the client + replaces the pathname indicated in fs_root for the current server for + the substitute indicated in rootpath for the new server. + + For an example of a referred or migrated file system, suppose there + is a file system located at serv1. At serv1, the file system is + located at /az/buky/vedi/glagoli. The client finds that object at + glagoli has migrated (or is a referral). The client gets the + fs_locations attribute, which contains an fs_root of /az/buky/vedi/ + glagoli, and one element in the locations array, with server equal to + serv2, and rootpath equal to /izhitsa/fita. The client replaces + /az/buky/vedi/glagoli with /izhitsa/fita, and uses the latter + pathname on serv2. + + Thus, the server MUST return an fs_root that is equal to the path the + client used to reach the object to which the fs_locations attribute + applies. Otherwise, the client cannot determine the new path to use + on the new server. + + Since the fs_locations attribute lacks information defining various + attributes of the various file system choices presented, it SHOULD + only be interrogated and used when fs_locations_info is not + available. When fs_locations is used, information about the specific + locations should be assumed based on the following rules. + + The following rules are general and apply irrespective of the + context. + + * All listed file system instances should be considered as of the + same handle class, if and only if, the current fh_expire_type + attribute does not include the FH4_VOL_MIGRATION bit. Note that + in the case of referral, filehandle issues do not apply since + there can be no filehandles known within the current file system, + nor is there any access to the fh_expire_type attribute on the + referring (absent) file system. + + * All listed file system instances should be considered as of the + same fileid class if and only if the fh_expire_type attribute + indicates persistent filehandles and does not include the + FH4_VOL_MIGRATION bit. Note that in the case of referral, fileid + issues do not apply since there can be no fileids known within the + referring (absent) file system, nor is there any access to the + fh_expire_type attribute. + + * All file system instances servers should be considered as of + different change classes. + + For other class assignments, handling of file system transitions + depends on the reasons for the transition: + + * When the transition is due to migration, that is, the client was + directed to a new file system after receiving an NFS4ERR_MOVED + error, the target should be treated as being of the same write- + verifier class as the source. + + * When the transition is due to failover to another replica, that + is, the client selected another replica without receiving an + NFS4ERR_MOVED error, the target should be treated as being of a + different write-verifier class from the source. + + The specific choices reflect typical implementation patterns for + failover and controlled migration, respectively. Since other choices + are possible and useful, this information is better obtained by using + fs_locations_info. When a server implementation needs to communicate + other choices, it MUST support the fs_locations_info attribute. + + See Section 21 for a discussion on the recommendations for the + security flavor to be used by any GETATTR operation that requests the + fs_locations attribute. + +11.17. The Attribute fs_locations_info + + The fs_locations_info attribute is intended as a more functional + replacement for the fs_locations attribute, which will continue to + exist and be supported. Clients can use it to get a more complete + set of data about alternative file system locations, including + additional network paths to access replicas in use and additional + replicas. When the server does not support fs_locations_info, + fs_locations can be used to get a subset of the data. A server that + supports fs_locations_info MUST support fs_locations as well. + + There is additional data present in fs_locations_info that is not + available in fs_locations: + + * Attribute continuity information. This information will allow a + client to select a replica that meets the transparency + requirements of the applications accessing the data and to + leverage optimizations due to the server guarantees of attribute + continuity (e.g., if the change attribute of a file of the file + system is continuous between multiple replicas, the client does + not have to invalidate the file's cache when switching to a + different replica). + + * File system identity information that indicates when multiple + replicas, from the client's point of view, correspond to the same + target file system, allowing them to be used interchangeably, + without disruption, as distinct synchronized replicas of the same + file data. + + Note that having two replicas with common identity information is + distinct from the case of two (trunked) paths to the same replica. + + * Information that will bear on the suitability of various replicas, + depending on the use that the client intends. For example, many + applications need an absolutely up-to-date copy (e.g., those that + write), while others may only need access to the most up-to-date + copy reasonably available. + + * Server-derived preference information for replicas, which can be + used to implement load-balancing while giving the client the + entire file system list to be used in case the primary fails. + + The fs_locations_info attribute is structured similarly to the + fs_locations attribute. A top-level structure (fs_locations_info4) + contains the entire attribute including the root pathname of the file + system and an array of lower-level structures that define replicas + that share a common rootpath on their respective servers. The lower- + level structure in turn (fs_locations_item4) contains a specific + pathname and information on one or more individual network access + paths. For that last, lowest level, fs_locations_info has an + fs_locations_server4 structure that contains per-server-replica + information in addition to the file system location entry. This per- + server-replica information includes a nominally opaque array, + fls_info, within which specific pieces of information are located at + the specific indices listed below. + + Two fs_location_server4 entries that are within different + fs_location_item4 structures are never trunkable, while two entries + within in the same fs_location_item4 structure might or might not be + trunkable. Two entries that are trunkable will have identical + identity information, although, as noted above, the converse is not + the case. + + The attribute will always contain at least a single + fs_locations_server entry. Typically, there will be an entry with + the FS4LIGF_CUR_REQ flag set, although in the case of a referral + there will be no entry with that flag set. + + It should be noted that fs_locations_info attributes returned by + servers for various replicas may differ for various reasons. One + server may know about a set of replicas that are not known to other + servers. Further, compatibility attributes may differ. Filehandles + might be of the same class going from replica A to replica B but not + going in the reverse direction. This might happen because the + filehandles are the same, but replica B's server implementation might + not have provision to note and report that equivalence. + + The fs_locations_info attribute consists of a root pathname + (fli_fs_root, just like fs_root in the fs_locations attribute), + together with an array of fs_location_item4 structures. The + fs_location_item4 structures in turn consist of a root pathname + (fli_rootpath) together with an array (fli_entries) of elements of + data type fs_locations_server4, all defined as follows. + + /* + * Defines an individual server access path + */ + struct fs_locations_server4 { + int32_t fls_currency; + opaque fls_info<>; + utf8str_cis fls_server; + }; + + /* + * Byte indices of items within + * fls_info: flag fields, class numbers, + * bytes indicating ranks and orders. + */ + const FSLI4BX_GFLAGS = 0; + const FSLI4BX_TFLAGS = 1; + + const FSLI4BX_CLSIMUL = 2; + const FSLI4BX_CLHANDLE = 3; + const FSLI4BX_CLFILEID = 4; + const FSLI4BX_CLWRITEVER = 5; + const FSLI4BX_CLCHANGE = 6; + const FSLI4BX_CLREADDIR = 7; + + const FSLI4BX_READRANK = 8; + const FSLI4BX_WRITERANK = 9; + const FSLI4BX_READORDER = 10; + const FSLI4BX_WRITEORDER = 11; + + /* + * Bits defined within the general flag byte. + */ + const FSLI4GF_WRITABLE = 0x01; + const FSLI4GF_CUR_REQ = 0x02; + const FSLI4GF_ABSENT = 0x04; + const FSLI4GF_GOING = 0x08; + const FSLI4GF_SPLIT = 0x10; + + /* + * Bits defined within the transport flag byte. + */ + const FSLI4TF_RDMA = 0x01; + + /* + * Defines a set of replicas sharing + * a common value of the rootpath + * within the corresponding + * single-server namespaces. + */ + struct fs_locations_item4 { + fs_locations_server4 fli_entries<>; + pathname4 fli_rootpath; + }; + + /* + * Defines the overall structure of + * the fs_locations_info attribute. + */ + struct fs_locations_info4 { + uint32_t fli_flags; + int32_t fli_valid_for; + pathname4 fli_fs_root; + fs_locations_item4 fli_items<>; + }; + + /* + * Flag bits in fli_flags. + */ + const FSLI4IF_VAR_SUB = 0x00000001; + + typedef fs_locations_info4 fattr4_fs_locations_info; + + As noted above, the fs_locations_info attribute, when supported, may + be requested of absent file systems without causing NFS4ERR_MOVED to + be returned. It is generally expected that it will be available for + both present and absent file systems even if only a single + fs_locations_server4 entry is present, designating the current + (present) file system, or two fs_locations_server4 entries + designating the previous location of an absent file system (the one + just referenced) and its successor location. Servers are strongly + urged to support this attribute on all file systems if they support + it on any file system. + + The data presented in the fs_locations_info attribute may be obtained + by the server in any number of ways, including specification by the + administrator or by current protocols for transferring data among + replicas and protocols not yet developed. NFSv4.1 only defines how + this information is presented by the server to the client. + +11.17.1. The fs_locations_server4 Structure + + The fs_locations_server4 structure consists of the following items in + addition to the fls_server field, which specifies a network address + or set of addresses to be used to access the specified file system. + Note that both of these items (i.e., fls_currency and fls_info) + specify attributes of the file system replica and should not be + different when there are multiple fs_locations_server4 structures, + each specifying a network path to the chosen replica, for the same + replica. + + When these values are different in two fs_locations_server4 + structures, a client has no basis for choosing one over the other and + is best off simply ignoring both entries, whether these entries apply + to migration replication or referral. When there are more than two + such entries, majority voting can be used to exclude a single + erroneous entry from consideration. In the case in which trunking + information is provided for a replica currently being accessed, the + additional trunked addresses can be ignored while access continues on + the address currently being used, even if the entry corresponding to + that path might be considered invalid. + + * An indication of how up-to-date the file system is (fls_currency) + in seconds. This value is relative to the master copy. A + negative value indicates that the server is unable to give any + reasonably useful value here. A value of zero indicates that the + file system is the actual writable data or a reliably coherent and + fully up-to-date copy. Positive values indicate how out-of-date + this copy can normally be before it is considered for update. + Such a value is not a guarantee that such updates will always be + performed on the required schedule but instead serves as a hint + about how far the copy of the data would be expected to be behind + the most up-to-date copy. + + * A counted array of one-byte values (fls_info) containing + information about the particular file system instance. This data + includes general flags, transport capability flags, file system + equivalence class information, and selection priority information. + The encoding will be discussed below. + + * The server string (fls_server). For the case of the replica + currently being accessed (via GETATTR), a zero-length string MAY + be used to indicate the current address being used for the RPC + call. The fls_server field can also be an IPv4 or IPv6 address, + formatted the same way as an IPv4 or IPv6 address in the "server" + field of the fs_location4 data type (see Section 11.16). + + With the exception of the transport-flag field (at offset + FSLI4BX_TFLAGS with the fls_info array), all of this data defined in + this specification applies to the replica specified by the entry, + rather than the specific network path used to access it. The + classification of data in extensions to this data is discussed below. + + Data within the fls_info array is in the form of 8-bit data items + with constants giving the offsets within the array of various values + describing this particular file system instance. This style of + definition was chosen, in preference to explicit XDR structure + definitions for these values, for a number of reasons. + + * The kinds of data in the fls_info array, representing flags, file + system classes, and priorities among sets of file systems + representing the same data, are such that 8 bits provide a quite + acceptable range of values. Even where there might be more than + 256 such file system instances, having more than 256 distinct + classes or priorities is unlikely. + + * Explicit definition of the various specific data items within XDR + would limit expandability in that any extension within would + require yet another attribute, leading to specification and + implementation clumsiness. In the context of the NFSv4 extension + model in effect at the time fs_locations_info was designed (i.e., + that which is described in RFC 5661 [66]), this would necessitate + a new minor version to effect any Standards Track extension to the + data in fls_info. + + The set of fls_info data is subject to expansion in a future minor + version or in a Standards Track RFC within the context of a single + minor version. The server SHOULD NOT send and the client MUST NOT + use indices within the fls_info array or flag bits that are not + defined in Standards Track RFCs. + + In light of the new extension model defined in RFC 8178 [67] and the + fact that the individual items within fls_info are not explicitly + referenced in the XDR, the following practices should be followed + when extending or otherwise changing the structure of the data + returned in fls_info within the scope of a single minor version: + + * All extensions need to be described by Standards Track documents. + There is no need for such documents to be marked as updating RFC + 5661 [66] or this document. + + * It needs to be made clear whether the information in any added + data items applies to the replica specified by the entry or to the + specific network paths specified in the entry. + + * There needs to be a reliable way defined to determine whether the + server is aware of the extension. This may be based on the length + field of the fls_info array, but it is more flexible to provide + fs-scope or server-scope attributes to indicate what extensions + are provided. + + This encoding scheme can be adapted to the specification of multi- + byte numeric values, even though none are currently defined. If + extensions are made via Standards Track RFCs, multi-byte quantities + will be encoded as a range of bytes with a range of indices, with the + byte interpreted in big-endian byte order. Further, any such index + assignments will be constrained by the need for the relevant + quantities not to cross XDR word boundaries. + + The fls_info array currently contains: + + * Two 8-bit flag fields, one devoted to general file-system + characteristics and a second reserved for transport-related + capabilities. + + * Six 8-bit class values that define various file system equivalence + classes as explained below. + + * Four 8-bit priority values that govern file system selection as + explained below. + + The general file system characteristics flag (at byte index + FSLI4BX_GFLAGS) has the following bits defined within it: + + * FSLI4GF_WRITABLE indicates that this file system target is + writable, allowing it to be selected by clients that may need to + write on this file system. When the current file system instance + is writable and is defined as of the same simultaneous use class + (as specified by the value at index FSLI4BX_CLSIMUL) to which the + client was previously writing, then it must incorporate within its + data any committed write made on the source file system instance. + See Section 11.11.6, which discusses the write-verifier class. + While there is no harm in not setting this flag for a file system + that turns out to be writable, turning the flag on for a read-only + file system can cause problems for clients that select a migration + or replication target based on the flag and then find themselves + unable to write. + + * FSLI4GF_CUR_REQ indicates that this replica is the one on which + the request is being made. Only a single server entry may have + this flag set and, in the case of a referral, no entry will have + it set. Note that this flag might be set even if the request was + made on a network access path different from any of those + specified in the current entry. + + * FSLI4GF_ABSENT indicates that this entry corresponds to an absent + file system replica. It can only be set if FSLI4GF_CUR_REQ is + set. When both such bits are set, it indicates that a file system + instance is not usable but that the information in the entry can + be used to determine the sorts of continuity available when + switching from this replica to other possible replicas. Since + this bit can only be true if FSLI4GF_CUR_REQ is true, the value + could be determined using the fs_status attribute, but the + information is also made available here for the convenience of the + client. An entry with this bit, since it represents a true file + system (albeit absent), does not appear in the event of a + referral, but only when a file system has been accessed at this + location and has subsequently been migrated. + + * FSLI4GF_GOING indicates that a replica, while still available, + should not be used further. The client, if using it, should make + an orderly transfer to another file system instance as + expeditiously as possible. It is expected that file systems going + out of service will be announced as FSLI4GF_GOING some time before + the actual loss of service. It is also expected that the + fli_valid_for value will be sufficiently small to allow clients to + detect and act on scheduled events, while large enough that the + cost of the requests to fetch the fs_locations_info values will + not be excessive. Values on the order of ten minutes seem + reasonable. + + When this flag is seen as part of a transition into a new file + system, a client might choose to transfer immediately to another + replica, or it may reference the current file system and only + transition when a migration event occurs. Similarly, when this + flag appears as a replica in the referral, clients would likely + avoid being referred to this instance whenever there is another + choice. + + This flag, like the other items within fls_info, applies to the + replica rather than to a particular path to that replica. When it + appears, a transition to a new replica, rather than to a different + path to the same replica, is indicated. + + * FSLI4GF_SPLIT indicates that when a transition occurs from the + current file system instance to this one, the replacement may + consist of multiple file systems. In this case, the client has to + be prepared for the possibility that objects on the same file + system before migration will be on different ones after. Note + that FSLI4GF_SPLIT is not incompatible with the file systems + belonging to the same fileid class since, if one has a set of + fileids that are unique within a file system, each subset assigned + to a smaller file system after migration would not have any + conflicts internal to that file system. + + A client, in the case of a split file system, will interrogate + existing files with which it has continuing connection (it is free + to simply forget cached filehandles). If the client remembers the + directory filehandle associated with each open file, it may + proceed upward using LOOKUPP to find the new file system + boundaries. Note that in the event of a referral, there will not + be any such files and so these actions will not be performed. + Instead, a reference to a portion of the original file system now + split off into other file systems will encounter an fsid change + and possibly a further referral. + + Once the client recognizes that one file system has been split + into two, it can prevent the disruption of running applications by + presenting the two file systems as a single one until a convenient + point to recognize the transition, such as a restart. This would + require a mapping from the server's fsids to fsids as seen by the + client, but this is already necessary for other reasons. As noted + above, existing fileids within the two descendant file systems + will not conflict. Providing non-conflicting fileids for newly + created files on the split file systems is the responsibility of + the server (or servers working in concert). The server can encode + filehandles such that filehandles generated before the split event + can be discerned from those generated after the split, allowing + the server to determine when the need for emulating two file + systems as one is over. + + Although it is possible for this flag to be present in the event + of referral, it would generally be of little interest to the + client, since the client is not expected to have information + regarding the current contents of the absent file system. + + The transport-flag field (at byte index FSLI4BX_TFLAGS) contains the + following bits related to the transport capabilities of the specific + network path(s) specified by the entry: + + * FSLI4TF_RDMA indicates that any specified network paths provide + NFSv4.1 clients access using an RDMA-capable transport. + + Attribute continuity and file system identity information are + expressed by defining equivalence relations on the sets of file + systems presented to the client. Each such relation is expressed as + a set of file system equivalence classes. For each relation, a file + system has an 8-bit class number. Two file systems belong to the + same class if both have identical non-zero class numbers. Zero is + treated as non-matching. Most often, the relevant question for the + client will be whether a given replica is identical to / continuous + with the current one in a given respect, but the information should + be available also as to whether two other replicas match in that + respect as well. + + The following fields specify the file system's class numbers for the + equivalence relations used in determining the nature of file system + transitions. See Sections 11.9 through 11.14 and their various + subsections for details about how this information is to be used. + Servers may assign these values as they wish, so long as file system + instances that share the same value have the specified relationship + to one another; conversely, file systems that have the specified + relationship to one another share a common class value. As each + instance entry is added, the relationships of this instance to + previously entered instances can be consulted, and if one is found + that bears the specified relationship, that entry's class value can + be copied to the new entry. When no such previous entry exists, a + new value for that byte index (not previously used) can be selected, + most likely by incrementing the value of the last class value + assigned for that index. + + * The field with byte index FSLI4BX_CLSIMUL defines the + simultaneous-use class for the file system. + + * The field with byte index FSLI4BX_CLHANDLE defines the handle + class for the file system. + + * The field with byte index FSLI4BX_CLFILEID defines the fileid + class for the file system. + + * The field with byte index FSLI4BX_CLWRITEVER defines the write- + verifier class for the file system. + + * The field with byte index FSLI4BX_CLCHANGE defines the change + class for the file system. + + * The field with byte index FSLI4BX_CLREADDIR defines the readdir + class for the file system. + + Server-specified preference information is also provided via 8-bit + values within the fls_info array. The values provide a rank and an + order (see below) to be used with separate values specifiable for the + cases of read-only and writable file systems. These values are + compared for different file systems to establish the server-specified + preference, with lower values indicating "more preferred". + + Rank is used to express a strict server-imposed ordering on clients, + with lower values indicating "more preferred". Clients should + attempt to use all replicas with a given rank before they use one + with a higher rank. Only if all of those file systems are + unavailable should the client proceed to those of a higher rank. + Because specifying a rank will override client preferences, servers + should be conservative about using this mechanism, particularly when + the environment is one in which client communication characteristics + are neither tightly controlled nor visible to the server. + + Within a rank, the order value is used to specify the server's + preference to guide the client's selection when the client's own + preferences are not controlling, with lower values of order + indicating "more preferred". If replicas are approximately equal in + all respects, clients should defer to the order specified by the + server. When clients look at server latency as part of their + selection, they are free to use this criterion, but it is suggested + that when latency differences are not significant, the server- + specified order should guide selection. + + * The field at byte index FSLI4BX_READRANK gives the rank value to + be used for read-only access. + + * The field at byte index FSLI4BX_READORDER gives the order value to + be used for read-only access. + + * The field at byte index FSLI4BX_WRITERANK gives the rank value to + be used for writable access. + + * The field at byte index FSLI4BX_WRITEORDER gives the order value + to be used for writable access. + + Depending on the potential need for write access by a given client, + one of the pairs of rank and order values is used. The read rank and + order should only be used if the client knows that only reading will + ever be done or if it is prepared to switch to a different replica in + the event that any write access capability is required in the future. + +11.17.2. The fs_locations_info4 Structure + + The fs_locations_info4 structure, encoding the fs_locations_info + attribute, contains the following: + + * The fli_flags field, which contains general flags that affect the + interpretation of this fs_locations_info4 structure and all + fs_locations_item4 structures within it. The only flag currently + defined is FSLI4IF_VAR_SUB. All bits in the fli_flags field that + are not defined should always be returned as zero. + + * The fli_fs_root field, which contains the pathname of the root of + the current file system on the current server, just as it does in + the fs_locations4 structure. + + * An array called fli_items of fs_locations4_item structures, which + contain information about replicas of the current file system. + Where the current file system is actually present, or has been + present, i.e., this is not a referral situation, one of the + fs_locations_item4 structures will contain an fs_locations_server4 + for the current server. This structure will have FSLI4GF_ABSENT + set if the current file system is absent, i.e., normal access to + it will return NFS4ERR_MOVED. + + * The fli_valid_for field specifies a time in seconds for which it + is reasonable for a client to use the fs_locations_info attribute + without refetch. The fli_valid_for value does not provide a + guarantee of validity since servers can unexpectedly go out of + service or become inaccessible for any number of reasons. Clients + are well-advised to refetch this information for an actively + accessed file system at every fli_valid_for seconds. This is + particularly important when file system replicas may go out of + service in a controlled way using the FSLI4GF_GOING flag to + communicate an ongoing change. The server should set + fli_valid_for to a value that allows well-behaved clients to + notice the FSLI4GF_GOING flag and make an orderly switch before + the loss of service becomes effective. If this value is zero, + then no refetch interval is appropriate and the client need not + refetch this data on any particular schedule. In the event of a + transition to a new file system instance, a new value of the + fs_locations_info attribute will be fetched at the destination. + It is to be expected that this may have a different fli_valid_for + value, which the client should then use in the same fashion as the + previous value. Because a refetch of the attribute causes + information from all component entries to be refetched, the server + will typically provide a low value for this field if any of the + replicas are likely to go out of service in a short time frame. + Note that, because of the ability of the server to return + NFS4ERR_MOVED to trigger the use of different paths, when + alternate trunked paths are available, there is generally no need + to use low values of fli_valid_for in connection with the + management of alternate paths to the same replica. + + The FSLI4IF_VAR_SUB flag within fli_flags controls whether variable + substitution is to be enabled. See Section 11.17.3 for an + explanation of variable substitution. + +11.17.3. The fs_locations_item4 Structure + + The fs_locations_item4 structure contains a pathname (in the field + fli_rootpath) that encodes the path of the target file system + replicas on the set of servers designated by the included + fs_locations_server4 entries. The precise manner in which this + target location is specified depends on the value of the + FSLI4IF_VAR_SUB flag within the associated fs_locations_info4 + structure. + + If this flag is not set, then fli_rootpath simply designates the + location of the target file system within each server's single-server + namespace just as it does for the rootpath within the fs_location4 + structure. When this bit is set, however, component entries of a + certain form are subject to client-specific variable substitution so + as to allow a degree of namespace non-uniformity in order to + accommodate the selection of client-specific file system targets to + adapt to different client architectures or other characteristics. + + When such substitution is in effect, a variable beginning with the + string "${" and ending with the string "}" and containing a colon is + to be replaced by the client-specific value associated with that + variable. The string "unknown" should be used by the client when it + has no value for such a variable. The pathname resulting from such + substitutions is used to designate the target file system, so that + different clients may have different file systems, corresponding to + that location in the multi-server namespace. + + As mentioned above, such substituted pathname variables contain a + colon. The part before the colon is to be a DNS domain name, and the + part after is to be a case-insensitive alphanumeric string. + + Where the domain is "ietf.org", only variable names defined in this + document or subsequent Standards Track RFCs are subject to such + substitution. Organizations are free to use their domain names to + create their own sets of client-specific variables, to be subject to + such substitution. In cases where such variables are intended to be + used more broadly than a single organization, publication of an + Informational RFC defining such variables is RECOMMENDED. + + The variable ${ietf.org:CPU_ARCH} is used to denote that the CPU + architecture object files are compiled. This specification does not + limit the acceptable values (except that they must be valid UTF-8 + strings), but such values as "x86", "x86_64", and "sparc" would be + expected to be used in line with industry practice. + + The variable ${ietf.org:OS_TYPE} is used to denote the operating + system, and thus the kernel and library APIs, for which code might be + compiled. This specification does not limit the acceptable values + (except that they must be valid UTF-8 strings), but such values as + "linux" and "freebsd" would be expected to be used in line with + industry practice. + + The variable ${ietf.org:OS_VERSION} is used to denote the operating + system version, and thus the specific details of versioned + interfaces, for which code might be compiled. This specification + does not limit the acceptable values (except that they must be valid + UTF-8 strings). However, combinations of numbers and letters with + interspersed dots would be expected to be used in line with industry + practice, with the details of the version format depending on the + specific value of the variable ${ietf.org:OS_TYPE} with which it is + used. + + Use of these variables could result in the direction of different + clients to different file systems on the same server, as appropriate + to particular clients. In cases in which the target file systems are + located on different servers, a single server could serve as a + referral point so that each valid combination of variable values + would designate a referral hosted on a single server, with the + targets of those referrals on a number of different servers. + + Because namespace administration is affected by the values selected + to substitute for various variables, clients should provide + convenient means of determining what variable substitutions a client + will implement, as well as, where appropriate, providing means to + control the substitutions to be used. The exact means by which this + will be done is outside the scope of this specification. + + Although variable substitution is most suitable for use in the + context of referrals, it may be used in the context of replication + and migration. If it is used in these contexts, the server must + ensure that no matter what values the client presents for the + substituted variables, the result is always a valid successor file + system instance to that from which a transition is occurring, i.e., + that the data is identical or represents a later image of a writable + file system. + + Note that when fli_rootpath is a null pathname (that is, one with + zero components), the file system designated is at the root of the + specified server, whether or not the FSLI4IF_VAR_SUB flag within the + associated fs_locations_info4 structure is set. + +11.18. The Attribute fs_status + + In an environment in which multiple copies of the same basic set of + data are available, information regarding the particular source of + such data and the relationships among different copies can be very + helpful in providing consistent data to applications. + + enum fs4_status_type { + STATUS4_FIXED = 1, + STATUS4_UPDATED = 2, + STATUS4_VERSIONED = 3, + STATUS4_WRITABLE = 4, + STATUS4_REFERRAL = 5 + }; + + struct fs4_status { + bool fss_absent; + fs4_status_type fss_type; + utf8str_cs fss_source; + utf8str_cs fss_current; + int32_t fss_age; + nfstime4 fss_version; + }; + + The boolean fss_absent indicates whether the file system is currently + absent. This value will be set if the file system was previously + present and becomes absent, or if the file system has never been + present and the type is STATUS4_REFERRAL. When this boolean is set + and the type is not STATUS4_REFERRAL, the remaining information in + the fs4_status reflects that last valid when the file system was + present. + + The fss_type field indicates the kind of file system image + represented. This is of particular importance when using the version + values to determine appropriate succession of file system images. + When fss_absent is set, and the file system was previously present, + the value of fss_type reflected is that when the file was last + present. Five values are distinguished: + + * STATUS4_FIXED, which indicates a read-only image in the sense that + it will never change. The possibility is allowed that, as a + result of migration or switch to a different image, changed data + can be accessed, but within the confines of this instance, no + change is allowed. The client can use this fact to cache + aggressively. + + * STATUS4_VERSIONED, which indicates that the image, like the + STATUS4_UPDATED case, is updated externally, but it provides a + guarantee that the server will carefully update an associated + version value so that the client can protect itself from a + situation in which it reads data from one version of the file + system and then later reads data from an earlier version of the + same file system. See below for a discussion of how this can be + done. + + * STATUS4_UPDATED, which indicates an image that cannot be updated + by the user writing to it but that may be changed externally, + typically because it is a periodically updated copy of another + writable file system somewhere else. In this case, version + information is not provided, and the client does not have the + responsibility of making sure that this version only advances upon + a file system instance transition. In this case, it is the + responsibility of the server to make sure that the data presented + after a file system instance transition is a proper successor + image and includes all changes seen by the client and any change + made before all such changes. + + * STATUS4_WRITABLE, which indicates that the file system is an + actual writable one. The client need not, of course, actually + write to the file system, but once it does, it should not accept a + transition to anything other than a writable instance of that same + file system. + + * STATUS4_REFERRAL, which indicates that the file system in question + is absent and has never been present on this server. + + Note that in the STATUS4_UPDATED and STATUS4_VERSIONED cases, the + server is responsible for the appropriate handling of locks that are + inconsistent with external changes to delegations. If a server gives + out delegations, they SHOULD be recalled before an inconsistent + change is made to the data, and MUST be revoked if this is not + possible. Similarly, if an OPEN is inconsistent with data that is + changed (the OPEN has OPEN4_SHARE_DENY_WRITE/OPEN4_SHARE_DENY_BOTH + and the data is changed), that OPEN SHOULD be considered + administratively revoked. + + The opaque strings fss_source and fss_current provide a way of + presenting information about the source of the file system image + being present. It is not intended that the client do anything with + this information other than make it available to administrative + tools. It is intended that this information be helpful when + researching possible problems with a file system image that might + arise when it is unclear if the correct image is being accessed and, + if not, how that image came to be made. This kind of diagnostic + information will be helpful, if, as seems likely, copies of file + systems are made in many different ways (e.g., simple user-level + copies, file-system-level point-in-time copies, clones of the + underlying storage), under a variety of administrative arrangements. + In such environments, determining how a given set of data was + constructed can be very helpful in resolving problems. + + The opaque string fss_source is used to indicate the source of a + given file system with the expectation that tools capable of creating + a file system image propagate this information, when possible. It is + understood that this may not always be possible since a user-level + copy may be thought of as creating a new data set and the tools used + may have no mechanism to propagate this data. When a file system is + initially created, it is desirable to associate with it data + regarding how the file system was created, where it was created, who + created it, etc. Making this information available in this attribute + in a human-readable string will be helpful for applications and + system administrators and will also serve to make it available when + the original file system is used to make subsequent copies. + + The opaque string fss_current should provide whatever information is + available about the source of the current copy. Such information + includes the tool creating it, any relevant parameters to that tool, + the time at which the copy was done, the user making the change, the + server on which the change was made, etc. All information should be + in a human-readable string. + + The field fss_age provides an indication of how out-of-date the file + system currently is with respect to its ultimate data source (in case + of cascading data updates). This complements the fls_currency field + of fs_locations_server4 (see Section 11.17) in the following way: the + information in fls_currency gives a bound for how out of date the + data in a file system might typically get, while the value in fss_age + gives a bound on how out-of-date that data actually is. Negative + values imply that no information is available. A zero means that + this data is known to be current. A positive value means that this + data is known to be no older than that number of seconds with respect + to the ultimate data source. Using this value, the client may be + able to decide that a data copy is too old, so that it may search for + a newer version to use. + + The fss_version field provides a version identification, in the form + of a time value, such that successive versions always have later time + values. When the fs_type is anything other than STATUS4_VERSIONED, + the server may provide such a value, but there is no guarantee as to + its validity and clients will not use it except to provide additional + information to add to fss_source and fss_current. + + When fss_type is STATUS4_VERSIONED, servers SHOULD provide a value of + fss_version that progresses monotonically whenever any new version of + the data is established. This allows the client, if reliable image + progression is important to it, to fetch this attribute as part of + each COMPOUND where data or metadata from the file system is used. + + When it is important to the client to make sure that only valid + successor images are accepted, it must make sure that it does not + read data or metadata from the file system without updating its sense + of the current state of the image. This is to avoid the possibility + that the fs_status that the client holds will be one for an earlier + image, which would cause the client to accept a new file system + instance that is later than that but still earlier than the updated + data read by the client. + + In order to accept valid images reliably, the client must do a + GETATTR of the fs_status attribute that follows any interrogation of + data or metadata within the file system in question. Often this is + most conveniently done by appending such a GETATTR after all other + operations that reference a given file system. When errors occur + between reading file system data and performing such a GETATTR, care + must be exercised to make sure that the data in question is not used + before obtaining the proper fs_status value. In this connection, + when an OPEN is done within such a versioned file system and the + associated GETATTR of fs_status is not successfully completed, the + open file in question must not be accessed until that fs_status is + fetched. + + The procedure above will ensure that before using any data from the + file system the client has in hand a newly-fetched current version of + the file system image. Multiple values for multiple requests in + flight can be resolved by assembling them into the required partial + order (and the elements should form a total order within the partial + order) and using the last. The client may then, when switching among + file system instances, decline to use an instance that does not have + an fss_type of STATUS4_VERSIONED or whose fss_version field is + earlier than the last one obtained from the predecessor file system + instance. + +12. Parallel NFS (pNFS) + +12.1. Introduction + + pNFS is an OPTIONAL feature within NFSv4.1; the pNFS feature set + allows direct client access to the storage devices containing file + data. When file data for a single NFSv4 server is stored on multiple + and/or higher-throughput storage devices (by comparison to the + server's throughput capability), the result can be significantly + better file access performance. The relationship among multiple + clients, a single server, and multiple storage devices for pNFS + (server and clients have access to all storage devices) is shown in + Figure 1. + + +-----------+ + |+-----------+ +-----------+ + ||+-----------+ | | + ||| | NFSv4.1 + pNFS | | + +|| Clients |<------------------------------>| Server | + +| | | | + +-----------+ | | + ||| +-----------+ + ||| | + ||| | + ||| Storage +-----------+ | + ||| Protocol |+-----------+ | + ||+----------------||+-----------+ Control | + |+-----------------||| | Protocol| + +------------------+|| Storage |------------+ + +| Devices | + +-----------+ + + Figure 1 + + In this model, the clients, server, and storage devices are + responsible for managing file access. This is in contrast to NFSv4 + without pNFS, where it is primarily the server's responsibility; some + of this responsibility may be delegated to the client under strictly + specified conditions. See Section 12.2.5 for a discussion of the + Storage Protocol. See Section 12.2.6 for a discussion of the Control + Protocol. + + pNFS takes the form of OPTIONAL operations that manage protocol + objects called 'layouts' (Section 12.2.7) that contain a byte-range + and storage location information. The layout is managed in a similar + fashion as NFSv4.1 data delegations. For example, the layout is + leased, recallable, and revocable. However, layouts are distinct + abstractions and are manipulated with new operations. When a client + holds a layout, it is granted the ability to directly access the + byte-range at the storage location specified in the layout. + + There are interactions between layouts and other NFSv4.1 abstractions + such as data delegations and byte-range locking. Delegation issues + are discussed in Section 12.5.5. Byte-range locking issues are + discussed in Sections 12.2.9 and 12.5.1. + +12.2. pNFS Definitions + + NFSv4.1's pNFS feature provides parallel data access to a file system + that stripes its content across multiple storage servers. The first + instantiation of pNFS, as part of NFSv4.1, separates the file system + protocol processing into two parts: metadata processing and data + processing. Data consist of the contents of regular files that are + striped across storage servers. Data striping occurs in at least two + ways: on a file-by-file basis and, within sufficiently large files, + on a block-by-block basis. In contrast, striped access to metadata + by pNFS clients is not provided in NFSv4.1, even though the file + system back end of a pNFS server might stripe metadata. Metadata + consist of everything else, including the contents of non-regular + files (e.g., directories); see Section 12.2.1. The metadata + functionality is implemented by an NFSv4.1 server that supports pNFS + and the operations described in Section 18; such a server is called a + metadata server (Section 12.2.2). + + The data functionality is implemented by one or more storage devices, + each of which are accessed by the client via a storage protocol. A + subset (defined in Section 13.6) of NFSv4.1 is one such storage + protocol. New terms are introduced to the NFSv4.1 nomenclature and + existing terms are clarified to allow for the description of the pNFS + feature. + +12.2.1. Metadata + + Information about a file system object, such as its name, location + within the namespace, owner, ACL, and other attributes. Metadata may + also include storage location information, and this will vary based + on the underlying storage mechanism that is used. + +12.2.2. Metadata Server + + An NFSv4.1 server that supports the pNFS feature. A variety of + architectural choices exist for the metadata server and its use of + file system information held at the server. Some servers may contain + metadata only for file objects residing at the metadata server, while + the file data resides on associated storage devices. Other metadata + servers may hold both metadata and a varying degree of file data. + +12.2.3. pNFS Client + + An NFSv4.1 client that supports pNFS operations and supports at least + one storage protocol for performing I/O to storage devices. + +12.2.4. Storage Device + + A storage device stores a regular file's data, but leaves metadata + management to the metadata server. A storage device could be another + NFSv4.1 server, an object-based storage device (OSD), a block device + accessed over a System Area Network (SAN, e.g., either FiberChannel + or iSCSI SAN), or some other entity. + +12.2.5. Storage Protocol + + As noted in Figure 1, the storage protocol is the method used by the + client to store and retrieve data directly from the storage devices. + + The NFSv4.1 pNFS feature has been structured to allow for a variety + of storage protocols to be defined and used. One example storage + protocol is NFSv4.1 itself (as documented in Section 13). Other + options for the storage protocol are described elsewhere and include: + + * Block/volume protocols such as Internet SCSI (iSCSI) [56] and FCP + [57]. The block/volume protocol support can be independent of the + addressing structure of the block/volume protocol used, allowing + more than one protocol to access the same file data and enabling + extensibility to other block/volume protocols. See [48] for a + layout specification that allows pNFS to use block/volume storage + protocols. + + * Object protocols such as OSD over iSCSI or Fibre Channel [58]. + See [47] for a layout specification that allows pNFS to use object + storage protocols. + + It is possible that various storage protocols are available to both + client and server and it may be possible that a client and server do + not have a matching storage protocol available to them. Because of + this, the pNFS server MUST support normal NFSv4.1 access to any file + accessible by the pNFS feature; this will allow for continued + interoperability between an NFSv4.1 client and server. + +12.2.6. Control Protocol + + As noted in Figure 1, the control protocol is used by the exported + file system between the metadata server and storage devices. + Specification of such protocols is outside the scope of the NFSv4.1 + protocol. Such control protocols would be used to control activities + such as the allocation and deallocation of storage, the management of + state required by the storage devices to perform client access + control, and, depending on the storage protocol, the enforcement of + authentication and authorization so that restrictions that would be + enforced by the metadata server are also enforced by the storage + device. + + A particular control protocol is not REQUIRED by NFSv4.1 but + requirements are placed on the control protocol for maintaining + attributes like modify time, the change attribute, and the end-of- + file (EOF) position. Note that if pNFS is layered over a clustered, + parallel file system (e.g., PVFS [59]), the mechanisms that enable + clustering and parallelism in that file system can be considered the + control protocol. + +12.2.7. Layout Types + + A layout describes the mapping of a file's data to the storage + devices that hold the data. A layout is said to belong to a specific + layout type (data type layouttype4, see Section 3.3.13). The layout + type allows for variants to handle different storage protocols, such + as those associated with block/volume [48], object [47], and file + (Section 13) layout types. A metadata server, along with its control + protocol, MUST support at least one layout type. A private sub-range + of the layout type namespace is also defined. Values from the + private layout type range MAY be used for internal testing or + experimentation (see Section 3.3.13). + + As an example, the organization of the file layout type could be an + array of tuples (e.g., device ID, filehandle), along with a + definition of how the data is stored across the devices (e.g., + striping). A block/volume layout might be an array of tuples that + store <device ID, block number, block count> along with information + about block size and the associated file offset of the block number. + An object layout might be an array of tuples <device ID, object ID> + and an additional structure (i.e., the aggregation map) that defines + how the logical byte sequence of the file data is serialized into the + different objects. Note that the actual layouts are typically more + complex than these simple expository examples. + + Requests for pNFS-related operations will often specify a layout + type. Examples of such operations are GETDEVICEINFO and LAYOUTGET. + The response for these operations will include structures such as a + device_addr4 or a layout4, each of which includes a layout type + within it. The layout type sent by the server MUST always be the + same one requested by the client. When a server sends a response + that includes a different layout type, the client SHOULD ignore the + response and behave as if the server had returned an error response. + +12.2.8. Layout + + A layout defines how a file's data is organized on one or more + storage devices. There are many potential layout types; each of the + layout types are differentiated by the storage protocol used to + access data and by the aggregation scheme that lays out the file data + on the underlying storage devices. A layout is precisely identified + by the tuple <client ID, filehandle, layout type, iomode, range>, + where filehandle refers to the filehandle of the file on the metadata + server. + + It is important to define when layouts overlap and/or conflict with + each other. For two layouts with overlapping byte-ranges to actually + overlap each other, both layouts must be of the same layout type, + correspond to the same filehandle, and have the same iomode. Layouts + conflict when they overlap and differ in the content of the layout + (i.e., the storage device/file mapping parameters differ). Note that + differing iomodes do not lead to conflicting layouts. It is + permissible for layouts with different iomodes, pertaining to the + same byte-range, to be held by the same client. An example of this + would be copy-on-write functionality for a block/volume layout type. + +12.2.9. Layout Iomode + + The layout iomode (data type layoutiomode4, see Section 3.3.20) + indicates to the metadata server the client's intent to perform + either just READ operations or a mixture containing READ and WRITE + operations. For certain layout types, it is useful for a client to + specify this intent at the time it sends LAYOUTGET (Section 18.43). + For example, for block/volume-based protocols, block allocation could + occur when a LAYOUTIOMODE4_RW iomode is specified. A special + LAYOUTIOMODE4_ANY iomode is defined and can only be used for + LAYOUTRETURN and CB_LAYOUTRECALL, not for LAYOUTGET. It specifies + that layouts pertaining to both LAYOUTIOMODE4_READ and + LAYOUTIOMODE4_RW iomodes are being returned or recalled, + respectively. + + A storage device may validate I/O with regard to the iomode; this is + dependent upon storage device implementation and layout type. Thus, + if the client's layout iomode is inconsistent with the I/O being + performed, the storage device may reject the client's I/O with an + error indicating that a new layout with the correct iomode should be + obtained via LAYOUTGET. For example, if a client gets a layout with + a LAYOUTIOMODE4_READ iomode and performs a WRITE to a storage device, + the storage device is allowed to reject that WRITE. + + The use of the layout iomode does not conflict with OPEN share modes + or byte-range LOCK operations; open share mode and byte-range lock + conflicts are enforced as they are without the use of pNFS and are + logically separate from the pNFS layout level. Open share modes and + byte-range locks are the preferred method for restricting user access + to data files. For example, an OPEN of OPEN4_SHARE_ACCESS_WRITE does + not conflict with a LAYOUTGET containing an iomode of + LAYOUTIOMODE4_RW performed by another client. Applications that + depend on writing into the same file concurrently may use byte-range + locking to serialize their accesses. + +12.2.10. Device IDs + + The device ID (data type deviceid4, see Section 3.3.14) identifies a + group of storage devices. The scope of a device ID is the pair + <client ID, layout type>. In practice, a significant amount of + information may be required to fully address a storage device. + Rather than embedding all such information in a layout, layouts embed + device IDs. The NFSv4.1 operation GETDEVICEINFO (Section 18.40) is + used to retrieve the complete address information (including all + device addresses for the device ID) regarding the storage device + according to its layout type and device ID. For example, the address + of an NFSv4.1 data server or of an object-based storage device could + be an IP address and port. The address of a block storage device + could be a volume label. + + Clients cannot expect the mapping between a device ID and its storage + device address(es) to persist across metadata server restart. See + Section 12.7.4 for a description of how recovery works in that + situation. + + A device ID lives as long as there is a layout referring to the + device ID. If there are no layouts referring to the device ID, the + server is free to delete the device ID any time. Once a device ID is + deleted by the server, the server MUST NOT reuse the device ID for + the same layout type and client ID again. This requirement is + feasible because the device ID is 16 bytes long, leaving sufficient + room to store a generation number if the server's implementation + requires most of the rest of the device ID's content to be reused. + This requirement is necessary because otherwise the race conditions + between asynchronous notification of device ID addition and deletion + would be too difficult to sort out. + + Device ID to device address mappings are not leased, and can be + changed at any time. (Note that while device ID to device address + mappings are likely to change after the metadata server restarts, the + server is not required to change the mappings.) A server has two + choices for changing mappings. It can recall all layouts referring + to the device ID or it can use a notification mechanism. + + The NFSv4.1 protocol has no optimal way to recall all layouts that + referred to a particular device ID (unless the server associates a + single device ID with a single fsid or a single client ID; in which + case, CB_LAYOUTRECALL has options for recalling all layouts + associated with the fsid, client ID pair, or just the client ID). + + Via a notification mechanism (see Section 20.12), device ID to device + address mappings can change over the duration of server operation + without recalling or revoking the layouts that refer to device ID. + The notification mechanism can also delete a device ID, but only if + the client has no layouts referring to the device ID. A notification + of a change to a device ID to device address mapping will immediately + or eventually invalidate some or all of the device ID's mappings. + The server MUST support notifications and the client must request + them before they can be used. For further information about the + notification types, see Section 20.12. + +12.3. pNFS Operations + + NFSv4.1 has several operations that are needed for pNFS servers, + regardless of layout type or storage protocol. These operations are + all sent to a metadata server and summarized here. While pNFS is an + OPTIONAL feature, if pNFS is implemented, some operations are + REQUIRED in order to comply with pNFS. See Section 17. + + These are the fore channel pNFS operations: + + GETDEVICEINFO (Section 18.40), as noted previously + (Section 12.2.10), returns the mapping of device ID to storage + device address. + + GETDEVICELIST (Section 18.41) allows clients to fetch all device IDs + for a specific file system. + + LAYOUTGET (Section 18.43) is used by a client to get a layout for a + file. + + LAYOUTCOMMIT (Section 18.42) is used to inform the metadata server + of the client's intent to commit data that has been written to the + storage device (the storage device as originally indicated in the + return value of LAYOUTGET). + + LAYOUTRETURN (Section 18.44) is used to return layouts for a file, a + file system ID (FSID), or a client ID. + + These are the backchannel pNFS operations: + + CB_LAYOUTRECALL (Section 20.3) recalls a layout, all layouts + belonging to a file system, or all layouts belonging to a client + ID. + + CB_RECALL_ANY (Section 20.6) tells a client that it needs to return + some number of recallable objects, including layouts, to the + metadata server. + + CB_RECALLABLE_OBJ_AVAIL (Section 20.7) tells a client that a + recallable object that it was denied (in case of pNFS, a layout + denied by LAYOUTGET) due to resource exhaustion is now available. + + CB_NOTIFY_DEVICEID (Section 20.12) notifies the client of changes to + device IDs. + +12.4. pNFS Attributes + + A number of attributes specific to pNFS are listed and described in + Section 5.12. + +12.5. Layout Semantics + +12.5.1. Guarantees Provided by Layouts + + Layouts grant to the client the ability to access data located at a + storage device with the appropriate storage protocol. The client is + guaranteed the layout will be recalled when one of two things occur: + either a conflicting layout is requested or the state encapsulated by + the layout becomes invalid (this can happen when an event directly or + indirectly modifies the layout). When a layout is recalled and + returned by the client, the client continues with the ability to + access file data with normal NFSv4.1 operations through the metadata + server. Only the ability to access the storage devices is affected. + + The requirement of NFSv4.1 that all user access rights MUST be + obtained through the appropriate OPEN, LOCK, and ACCESS operations is + not modified with the existence of layouts. Layouts are provided to + NFSv4.1 clients, and user access still follows the rules of the + protocol as if they did not exist. It is a requirement that for a + client to access a storage device, a layout must be held by the + client. If a storage device receives an I/O request for a byte-range + for which the client does not hold a layout, the storage device + SHOULD reject that I/O request. Note that the act of modifying a + file for which a layout is held does not necessarily conflict with + the holding of the layout that describes the file being modified. + Therefore, it is the requirement of the storage protocol or layout + type that determines the necessary behavior. For example, block/ + volume layout types require that the layout's iomode agree with the + type of I/O being performed. + + Depending upon the layout type and storage protocol in use, storage + device access permissions may be granted by LAYOUTGET and may be + encoded within the type-specific layout. For an example of storage + device access permissions, see an object-based protocol such as [58]. + If access permissions are encoded within the layout, the metadata + server SHOULD recall the layout when those permissions become invalid + for any reason -- for example, when a file becomes unwritable or + inaccessible to a client. Note, clients are still required to + perform the appropriate OPEN, LOCK, and ACCESS operations as + described above. The degree to which it is possible for the client + to circumvent these operations and the consequences of doing so must + be clearly specified by the individual layout type specifications. + In addition, these specifications must be clear about the + requirements and non-requirements for the checking performed by the + server. + + In the presence of pNFS functionality, mandatory byte-range locks + MUST behave as they would without pNFS. Therefore, if mandatory file + locks and layouts are provided simultaneously, the storage device + MUST be able to enforce the mandatory byte-range locks. For example, + if one client obtains a mandatory byte-range lock and a second client + accesses the storage device, the storage device MUST appropriately + restrict I/O for the range of the mandatory byte-range lock. If the + storage device is incapable of providing this check in the presence + of mandatory byte-range locks, then the metadata server MUST NOT + grant layouts and mandatory byte-range locks simultaneously. + +12.5.2. Getting a Layout + + A client obtains a layout with the LAYOUTGET operation. The metadata + server will grant layouts of a particular type (e.g., block/volume, + object, or file). The client selects an appropriate layout type that + the server supports and the client is prepared to use. The layout + returned to the client might not exactly match the requested byte- + range as described in Section 18.43.3. As needed a client may send + multiple LAYOUTGET operations; these might result in multiple + overlapping, non-conflicting layouts (see Section 12.2.8). + + In order to get a layout, the client must first have opened the file + via the OPEN operation. When a client has no layout on a file, it + MUST present an open stateid, a delegation stateid, or a byte-range + lock stateid in the loga_stateid argument. A successful LAYOUTGET + result includes a layout stateid. The first successful LAYOUTGET + processed by the server using a non-layout stateid as an argument + MUST have the "seqid" field of the layout stateid in the response set + to one. Thereafter, the client MUST use a layout stateid (see + Section 12.5.3) on future invocations of LAYOUTGET on the file, and + the "seqid" MUST NOT be set to zero. Once the layout has been + retrieved, it can be held across multiple OPEN and CLOSE sequences. + Therefore, a client may hold a layout for a file that is not + currently open by any user on the client. This allows for the + caching of layouts beyond CLOSE. + + The storage protocol used by the client to access the data on the + storage device is determined by the layout's type. The client is + responsible for matching the layout type with an available method to + interpret and use the layout. The method for this layout type + selection is outside the scope of the pNFS functionality. + + Although the metadata server is in control of the layout for a file, + the pNFS client can provide hints to the server when a file is opened + or created about the preferred layout type and aggregation schemes. + pNFS introduces a layout_hint attribute (Section 5.12.4) that the + client can set at file creation time to provide a hint to the server + for new files. Setting this attribute separately, after the file has + been created might make it difficult, or impossible, for the server + implementation to comply. + + Because the EXCLUSIVE4 createmode4 does not allow the setting of + attributes at file creation time, NFSv4.1 introduces the EXCLUSIVE4_1 + createmode4, which does allow attributes to be set at file creation + time. In addition, if the session is created with persistent reply + caches, EXCLUSIVE4_1 is neither necessary nor allowed. Instead, + GUARDED4 both works better and is prescribed. Table 18 in + Section 18.16.3 summarizes how a client is allowed to send an + exclusive create. + +12.5.3. Layout Stateid + + As with all other stateids, the layout stateid consists of a "seqid" + and "other" field. Once a layout stateid is established, the "other" + field will stay constant unless the stateid is revoked or the client + returns all layouts on the file and the server disposes of the + stateid. The "seqid" field is initially set to one, and is never + zero on any NFSv4.1 operation that uses layout stateids, whether it + is a fore channel or backchannel operation. After the layout stateid + is established, the server increments by one the value of the "seqid" + in each subsequent LAYOUTGET and LAYOUTRETURN response, and in each + CB_LAYOUTRECALL request. + + Given the design goal of pNFS to provide parallelism, the layout + stateid differs from other stateid types in that the client is + expected to send LAYOUTGET and LAYOUTRETURN operations in parallel. + The "seqid" value is used by the client to properly sort responses to + LAYOUTGET and LAYOUTRETURN. The "seqid" is also used to prevent race + conditions between LAYOUTGET and CB_LAYOUTRECALL. Given that the + processing rules differ from layout stateids and other stateid types, + only the pNFS sections of this document should be considered to + determine proper layout stateid handling. + + Once the client receives a layout stateid, it MUST use the correct + "seqid" for subsequent LAYOUTGET or LAYOUTRETURN operations. The + correct "seqid" is defined as the highest "seqid" value from + responses of fully processed LAYOUTGET or LAYOUTRETURN operations or + arguments of a fully processed CB_LAYOUTRECALL operation. Since the + server is incrementing the "seqid" value on each layout operation, + the client may determine the order of operation processing by + inspecting the "seqid" value. In the case of overlapping layout + ranges, the ordering information will provide the client the + knowledge of which layout ranges are held. Note that overlapping + layout ranges may occur because of the client's specific requests or + because the server is allowed to expand the range of a requested + layout and notify the client in the LAYOUTRETURN results. Additional + layout stateid sequencing requirements are provided in + Section 12.5.5.2. + + The client's receipt of a "seqid" is not sufficient for subsequent + use. The client must fully process the operations before the "seqid" + can be used. For LAYOUTGET results, if the client is not using the + forgetful model (Section 12.5.5.1), it MUST first update its record + of what ranges of the file's layout it has before using the seqid. + For LAYOUTRETURN results, the client MUST delete the range from its + record of what ranges of the file's layout it had before using the + seqid. For CB_LAYOUTRECALL arguments, the client MUST send a + response to the recall before using the seqid. The fundamental + requirement in client processing is that the "seqid" is used to + provide the order of processing. LAYOUTGET results may be processed + in parallel. LAYOUTRETURN results may be processed in parallel. + LAYOUTGET and LAYOUTRETURN responses may be processed in parallel as + long as the ranges do not overlap. CB_LAYOUTRECALL request + processing MUST be processed in "seqid" order at all times. + + Once a client has no more layouts on a file, the layout stateid is no + longer valid and MUST NOT be used. Any attempt to use such a layout + stateid will result in NFS4ERR_BAD_STATEID. + +12.5.4. Committing a Layout + + Allowing for varying storage protocol capabilities, the pNFS protocol + does not require the metadata server and storage devices to have a + consistent view of file attributes and data location mappings. Data + location mapping refers to aspects such as which offsets store data + as opposed to storing holes (see Section 13.4.4 for a discussion). + Related issues arise for storage protocols where a layout may hold + provisionally allocated blocks where the allocation of those blocks + does not survive a complete restart of both the client and server. + Because of this inconsistency, it is necessary to resynchronize the + client with the metadata server and its storage devices and make any + potential changes available to other clients. This is accomplished + by use of the LAYOUTCOMMIT operation. + + The LAYOUTCOMMIT operation is responsible for committing a modified + layout to the metadata server. The data should be written and + committed to the appropriate storage devices before the LAYOUTCOMMIT + occurs. The scope of the LAYOUTCOMMIT operation depends on the + storage protocol in use. It is important to note that the level of + synchronization is from the point of view of the client that sent the + LAYOUTCOMMIT. The updated state on the metadata server need only + reflect the state as of the client's last operation previous to the + LAYOUTCOMMIT. The metadata server is not REQUIRED to maintain a + global view that accounts for other clients' I/O that may have + occurred within the same time frame. + + For block/volume-based layouts, LAYOUTCOMMIT may require updating the + block list that comprises the file and committing this layout to + stable storage. For file-based layouts, synchronization of + attributes between the metadata and storage devices, primarily the + size attribute, is required. + + The control protocol is free to synchronize the attributes before it + receives a LAYOUTCOMMIT; however, upon successful completion of a + LAYOUTCOMMIT, state that exists on the metadata server that describes + the file MUST be synchronized with the state that exists on the + storage devices that comprise that file as of the client's last sent + operation. Thus, a client that queries the size of a file between a + WRITE to a storage device and the LAYOUTCOMMIT might observe a size + that does not reflect the actual data written. + + The client MUST have a layout in order to send a LAYOUTCOMMIT + operation. + +12.5.4.1. LAYOUTCOMMIT and change/time_modify + + The change and time_modify attributes may be updated by the server + when the LAYOUTCOMMIT operation is processed. The reason for this is + that some layout types do not support the update of these attributes + when the storage devices process I/O operations. If a client has a + layout with the LAYOUTIOMODE4_RW iomode on the file, the client MAY + provide a suggested value to the server for time_modify within the + arguments to LAYOUTCOMMIT. Based on the layout type, the provided + value may or may not be used. The server should sanity-check the + client-provided values before they are used. For example, the server + should ensure that time does not flow backwards. The client always + has the option to set time_modify through an explicit SETATTR + operation. + + For some layout protocols, the storage device is able to notify the + metadata server of the occurrence of an I/O; as a result, the change + and time_modify attributes may be updated at the metadata server. + For a metadata server that is capable of monitoring updates to the + change and time_modify attributes, LAYOUTCOMMIT processing is not + required to update the change attribute. In this case, the metadata + server must ensure that no further update to the data has occurred + since the last update of the attributes; file-based protocols may + have enough information to make this determination or may update the + change attribute upon each file modification. This also applies for + the time_modify attribute. If the server implementation is able to + determine that the file has not been modified since the last + time_modify update, the server need not update time_modify at + LAYOUTCOMMIT. At LAYOUTCOMMIT completion, the updated attributes + should be visible if that file was modified since the latest previous + LAYOUTCOMMIT or LAYOUTGET. + +12.5.4.2. LAYOUTCOMMIT and size + + The size of a file may be updated when the LAYOUTCOMMIT operation is + used by the client. One of the fields in the argument to + LAYOUTCOMMIT is loca_last_write_offset; this field indicates the + highest byte offset written but not yet committed with the + LAYOUTCOMMIT operation. The data type of loca_last_write_offset is + newoffset4 and is switched on a boolean value, no_newoffset, that + indicates if a previous write occurred or not. If no_newoffset is + FALSE, an offset is not given. If the client has a layout with + LAYOUTIOMODE4_RW iomode on the file, with a byte-range (denoted by + the values of lo_offset and lo_length) that overlaps + loca_last_write_offset, then the client MAY set no_newoffset to TRUE + and provide an offset that will update the file size. Keep in mind + that offset is not the same as length, though they are related. For + example, a loca_last_write_offset value of zero means that one byte + was written at offset zero, and so the length of the file is at least + one byte. + + The metadata server may do one of the following: + + 1. Update the file's size using the last write offset provided by + the client as either the true file size or as a hint of the file + size. If the metadata server has a method available, any new + value for file size should be sanity-checked. For example, the + file must not be truncated if the client presents a last write + offset less than the file's current size. + + 2. Ignore the client-provided last write offset; the metadata server + must have sufficient knowledge from other sources to determine + the file's size. For example, the metadata server queries the + storage devices with the control protocol. + + The method chosen to update the file's size will depend on the + storage device's and/or the control protocol's capabilities. For + example, if the storage devices are block devices with no knowledge + of file size, the metadata server must rely on the client to set the + last write offset appropriately. + + The results of LAYOUTCOMMIT contain a new size value in the form of a + newsize4 union data type. If the file's size is set as a result of + LAYOUTCOMMIT, the metadata server must reply with the new size; + otherwise, the new size is not provided. If the file size is + updated, the metadata server SHOULD update the storage devices such + that the new file size is reflected when LAYOUTCOMMIT processing is + complete. For example, the client should be able to read up to the + new file size. + + The client can extend the length of a file or truncate a file by + sending a SETATTR operation to the metadata server with the size + attribute specified. If the size specified is larger than the + current size of the file, the file is "zero extended", i.e., zeros + are implicitly added between the file's previous EOF and the new EOF. + (In many implementations, the zero-extended byte-range of the file + consists of unallocated holes in the file.) When the client writes + past EOF via WRITE, the SETATTR operation does not need to be used. + +12.5.4.3. LAYOUTCOMMIT and layoutupdate + + The LAYOUTCOMMIT argument contains a loca_layoutupdate field + (Section 18.42.1) of data type layoutupdate4 (Section 3.3.18). This + argument is a layout-type-specific structure. The structure can be + used to pass arbitrary layout-type-specific information from the + client to the metadata server at LAYOUTCOMMIT time. For example, if + using a block/volume layout, the client can indicate to the metadata + server which reserved or allocated blocks the client used or did not + use. The content of loca_layoutupdate (field lou_body) need not be + the same layout-type-specific content returned by LAYOUTGET + (Section 18.43.2) in the loc_body field of the lo_content field of + the logr_layout field. The content of loca_layoutupdate is defined + by the layout type specification and is opaque to LAYOUTCOMMIT. + +12.5.5. Recalling a Layout + + Since a layout protects a client's access to a file via a direct + client-storage-device path, a layout need only be recalled when it is + semantically unable to serve this function. Typically, this occurs + when the layout no longer encapsulates the true location of the file + over the byte-range it represents. Any operation or action, such as + server-driven restriping or load balancing, that changes the layout + will result in a recall of the layout. A layout is recalled by the + CB_LAYOUTRECALL callback operation (see Section 20.3) and returned + with LAYOUTRETURN (see Section 18.44). The CB_LAYOUTRECALL operation + may recall a layout identified by a byte-range, all layouts + associated with a file system ID (FSID), or all layouts associated + with a client ID. Section 12.5.5.2 discusses sequencing issues + surrounding the getting, returning, and recalling of layouts. + + An iomode is also specified when recalling a layout. Generally, the + iomode in the recall request must match the layout being returned; + for example, a recall with an iomode of LAYOUTIOMODE4_RW should cause + the client to only return LAYOUTIOMODE4_RW layouts and not + LAYOUTIOMODE4_READ layouts. However, a special LAYOUTIOMODE4_ANY + enumeration is defined to enable recalling a layout of any iomode; in + other words, the client must return both LAYOUTIOMODE4_READ and + LAYOUTIOMODE4_RW layouts. + + A REMOVE operation SHOULD cause the metadata server to recall the + layout to prevent the client from accessing a non-existent file and + to reclaim state stored on the client. Since a REMOVE may be delayed + until the last close of the file has occurred, the recall may also be + delayed until this time. After the last reference on the file has + been released and the file has been removed, the client should no + longer be able to perform I/O using the layout. In the case of a + file-based layout, the data server SHOULD return NFS4ERR_STALE in + response to any operation on the removed file. + + Once a layout has been returned, the client MUST NOT send I/Os to the + storage devices for the file, byte-range, and iomode represented by + the returned layout. If a client does send an I/O to a storage + device for which it does not hold a layout, the storage device SHOULD + reject the I/O. + + Although pNFS does not alter the file data caching capabilities of + clients, or their semantics, it recognizes that some clients may + perform more aggressive write-behind caching to optimize the benefits + provided by pNFS. However, write-behind caching may negatively + affect the latency in returning a layout in response to a + CB_LAYOUTRECALL; this is similar to file delegations and the impact + that file data caching has on DELEGRETURN. Client implementations + SHOULD limit the amount of unwritten data they have outstanding at + any one time in order to prevent excessively long responses to + CB_LAYOUTRECALL. Once a layout is recalled, a server MUST wait one + lease period before taking further action. As soon as a lease period + has passed, the server may choose to fence the client's access to the + storage devices if the server perceives the client has taken too long + to return a layout. However, just as in the case of data delegation + and DELEGRETURN, the server may choose to wait, given that the client + is showing forward progress on its way to returning the layout. This + forward progress can take the form of successful interaction with the + storage devices or of sub-portions of the layout being returned by + the client. The server can also limit exposure to these problems by + limiting the byte-ranges initially provided in the layouts and thus + the amount of outstanding modified data. + +12.5.5.1. Layout Recall Callback Robustness + + It has been assumed thus far that pNFS client state (layout ranges + and iomode) for a file exactly matches that of the pNFS server for + that file. This assumption leads to the implication that any + callback results in a LAYOUTRETURN or set of LAYOUTRETURNs that + exactly match the range in the callback, since both client and server + agree about the state being maintained. However, it can be useful if + this assumption does not always hold. For example: + + * If conflicts that require callbacks are very rare, and a server + can use a multi-file callback to recover per-client resources + (e.g., via an FSID recall or a multi-file recall within a single + CB_COMPOUND), the result may be significantly less client-server + pNFS traffic. + + * It may be useful for servers to maintain information about what + ranges are held by a client on a coarse-grained basis, leading to + the server's layout ranges being beyond those actually held by the + client. In the extreme, a server could manage conflicts on a per- + file basis, only sending whole-file callbacks even though clients + may request and be granted sub-file ranges. + + * It may be useful for clients to "forget" details about what + layouts and ranges the client actually has, leading to the + server's layout ranges being beyond those that the client "thinks" + it has. As long as the client does not assume it has layouts that + are beyond what the server has granted, this is a safe practice. + When a client forgets what ranges and layouts it has, and it + receives a CB_LAYOUTRECALL operation, the client MUST follow up + with a LAYOUTRETURN for what the server recalled, or alternatively + return the NFS4ERR_NOMATCHING_LAYOUT error if it has no layout to + return in the recalled range. + + * In order to avoid errors, it is vital that a client not assign + itself layout permissions beyond what the server has granted, and + that the server not forget layout permissions that have been + granted. On the other hand, if a server believes that a client + holds a layout that the client does not know about, it is useful + for the client to cleanly indicate completion of the requested + recall either by sending a LAYOUTRETURN operation for the entire + requested range or by returning an NFS4ERR_NOMATCHING_LAYOUT error + to the CB_LAYOUTRECALL. + + Thus, in light of the above, it is useful for a server to be able to + send callbacks for layout ranges it has not granted to a client, and + for a client to return ranges it does not hold. A pNFS client MUST + always return layouts that comprise the full range specified by the + recall. Note, the full recalled layout range need not be returned as + part of a single operation, but may be returned in portions. This + allows the client to stage the flushing of dirty data and commits and + returns of layouts. Also, it indicates to the metadata server that + the client is making progress. + + When a layout is returned, the client MUST NOT have any outstanding + I/O requests to the storage devices involved in the layout. + Rephrasing, the client MUST NOT return the layout while it has + outstanding I/O requests to the storage device. + + Even with this requirement for the client, it is possible that I/O + requests may be presented to a storage device no longer allowed to + perform them. Since the server has no strict control as to when the + client will return the layout, the server may later decide to + unilaterally revoke the client's access to the storage devices as + provided by the layout. In choosing to revoke access, the server + must deal with the possibility of lingering I/O requests, i.e., I/O + requests that are still in flight to storage devices identified by + the revoked layout. All layout type specifications MUST define + whether unilateral layout revocation by the metadata server is + supported; if it is, the specification must also describe how + lingering writes are processed. For example, storage devices + identified by the revoked layout could be fenced off from the client + that held the layout. + + In order to ensure client/server convergence with regard to layout + state, the final LAYOUTRETURN operation in a sequence of LAYOUTRETURN + operations for a particular recall MUST specify the entire range + being recalled, echoing the recalled layout type, iomode, recall/ + return type (FILE, FSID, or ALL), and byte-range, even if layouts + pertaining to partial ranges were previously returned. In addition, + if the client holds no layouts that overlap the range being recalled, + the client should return the NFS4ERR_NOMATCHING_LAYOUT error code to + CB_LAYOUTRECALL. This allows the server to update its view of the + client's layout state. + +12.5.5.2. Sequencing of Layout Operations + + As with other stateful operations, pNFS requires the correct + sequencing of layout operations. pNFS uses the "seqid" in the layout + stateid to provide the correct sequencing between regular operations + and callbacks. It is the server's responsibility to avoid + inconsistencies regarding the layouts provided and the client's + responsibility to properly serialize its layout requests and layout + returns. + +12.5.5.2.1. Layout Recall and Return Sequencing + + One critical issue with regard to layout operations sequencing + concerns callbacks. The protocol must defend against races between + the reply to a LAYOUTGET or LAYOUTRETURN operation and a subsequent + CB_LAYOUTRECALL. A client MUST NOT process a CB_LAYOUTRECALL that + implies one or more outstanding LAYOUTGET or LAYOUTRETURN operations + to which the client has not yet received a reply. The client detects + such a CB_LAYOUTRECALL by examining the "seqid" field of the recall's + layout stateid. If the "seqid" is not exactly one higher than what + the client currently has recorded, and the client has at least one + LAYOUTGET and/or LAYOUTRETURN operation outstanding, the client knows + the server sent the CB_LAYOUTRECALL after sending a response to an + outstanding LAYOUTGET or LAYOUTRETURN. The client MUST wait before + processing such a CB_LAYOUTRECALL until it processes all replies for + outstanding LAYOUTGET and LAYOUTRETURN operations for the + corresponding file with seqid less than the seqid given by + CB_LAYOUTRECALL (lor_stateid; see Section 20.3.) + + In addition to the seqid-based mechanism, Section 2.10.6.3 describes + the sessions mechanism for allowing the client to detect callback + race conditions and delay processing such a CB_LAYOUTRECALL. The + server MAY reference conflicting operations in the CB_SEQUENCE that + precedes the CB_LAYOUTRECALL. Because the server has already sent + replies for these operations before sending the callback, the replies + may race with the CB_LAYOUTRECALL. The client MUST wait for all the + referenced calls to complete and update its view of the layout state + before processing the CB_LAYOUTRECALL. + +12.5.5.2.1.1. Get/Return Sequencing + + The protocol allows the client to send concurrent LAYOUTGET and + LAYOUTRETURN operations to the server. The protocol does not provide + any means for the server to process the requests in the same order in + which they were created. However, through the use of the "seqid" + field in the layout stateid, the client can determine the order in + which parallel outstanding operations were processed by the server. + Thus, when a layout retrieved by an outstanding LAYOUTGET operation + intersects with a layout returned by an outstanding LAYOUTRETURN on + the same file, the order in which the two conflicting operations are + processed determines the final state of the overlapping layout. The + order is determined by the "seqid" returned in each operation: the + operation with the higher seqid was executed later. + + It is permissible for the client to send multiple parallel LAYOUTGET + operations for the same file or multiple parallel LAYOUTRETURN + operations for the same file or a mix of both. + + It is permissible for the client to use the current stateid (see + Section 16.2.3.1.2) for LAYOUTGET operations, for example, when + compounding LAYOUTGETs or compounding OPEN and LAYOUTGETs. It is + also permissible to use the current stateid when compounding + LAYOUTRETURNs. + + It is permissible for the client to use the current stateid when + combining LAYOUTRETURN and LAYOUTGET operations for the same file in + the same COMPOUND request since the server MUST process these in + order. However, if a client does send such COMPOUND requests, it + MUST NOT have more than one outstanding for the same file at the same + time, and it MUST NOT have other LAYOUTGET or LAYOUTRETURN operations + outstanding at the same time for that same file. + +12.5.5.2.1.2. Client Considerations + + Consider a pNFS client that has sent a LAYOUTGET, and before it + receives the reply to LAYOUTGET, it receives a CB_LAYOUTRECALL for + the same file with an overlapping range. There are two + possibilities, which the client can distinguish via the layout + stateid in the recall. + + 1. The server processed the LAYOUTGET before sending the recall, so + the LAYOUTGET must be waited for because it may be carrying + layout information that will need to be returned to deal with the + CB_LAYOUTRECALL. + + 2. The server sent the callback before receiving the LAYOUTGET. The + server will not respond to the LAYOUTGET until the + CB_LAYOUTRECALL is processed. + + If these possibilities cannot be distinguished, a deadlock could + result, as the client must wait for the LAYOUTGET response before + processing the recall in the first case, but that response will not + arrive until after the recall is processed in the second case. Note + that in the first case, the "seqid" in the layout stateid of the + recall is two greater than what the client has recorded; in the + second case, the "seqid" is one greater than what the client has + recorded. This allows the client to disambiguate between the two + cases. The client thus knows precisely which possibility applies. + + In case 1, the client knows it needs to wait for the LAYOUTGET + response before processing the recall (or the client can return + NFS4ERR_DELAY). + + In case 2, the client will not wait for the LAYOUTGET response before + processing the recall because waiting would cause deadlock. + Therefore, the action at the client will only require waiting in the + case that the client has not yet seen the server's earlier responses + to the LAYOUTGET operation(s). + + The recall process can be considered completed when the final + LAYOUTRETURN operation for the recalled range is completed. The + LAYOUTRETURN uses the layout stateid (with seqid) specified in + CB_LAYOUTRECALL. If the client uses multiple LAYOUTRETURNs in + processing the recall, the first LAYOUTRETURN will use the layout + stateid as specified in CB_LAYOUTRECALL. Subsequent LAYOUTRETURNs + will use the highest seqid as is the usual case. + +12.5.5.2.1.3. Server Considerations + + Consider a race from the metadata server's point of view. The + metadata server has sent a CB_LAYOUTRECALL and receives an + overlapping LAYOUTGET for the same file before the LAYOUTRETURN(s) + that respond to the CB_LAYOUTRECALL. There are three cases: + + 1. The client sent the LAYOUTGET before processing the + CB_LAYOUTRECALL. The "seqid" in the layout stateid of the + arguments of LAYOUTGET is one less than the "seqid" in + CB_LAYOUTRECALL. The server returns NFS4ERR_RECALLCONFLICT to + the client, which indicates to the client that there is a pending + recall. + + 2. The client sent the LAYOUTGET after processing the + CB_LAYOUTRECALL, but the LAYOUTGET arrived before the + LAYOUTRETURN and the response to CB_LAYOUTRECALL that completed + that processing. The "seqid" in the layout stateid of LAYOUTGET + is equal to or greater than that of the "seqid" in + CB_LAYOUTRECALL. The server has not received a response to the + CB_LAYOUTRECALL, so it returns NFS4ERR_RECALLCONFLICT. + + 3. The client sent the LAYOUTGET after processing the + CB_LAYOUTRECALL; the server received the CB_LAYOUTRECALL + response, but the LAYOUTGET arrived before the LAYOUTRETURN that + completed that processing. The "seqid" in the layout stateid of + LAYOUTGET is equal to that of the "seqid" in CB_LAYOUTRECALL. + The server has received a response to the CB_LAYOUTRECALL, so it + returns NFS4ERR_RETURNCONFLICT. + +12.5.5.2.1.4. Wraparound and Validation of Seqid + + The rules for layout stateid processing differ from other stateids in + the protocol because the "seqid" value cannot be zero and the + stateid's "seqid" value changes in a CB_LAYOUTRECALL operation. The + non-zero requirement combined with the inherent parallelism of layout + operations means that a set of LAYOUTGET and LAYOUTRETURN operations + may contain the same value for "seqid". The server uses a slightly + modified version of the modulo arithmetic as described in + Section 2.10.6.1 when incrementing the layout stateid's "seqid". The + difference is that zero is not a valid value for "seqid"; when the + value of a "seqid" is 0xFFFFFFFF, the next valid value will be + 0x00000001. The modulo arithmetic is also used for the comparisons + of "seqid" values in the processing of CB_LAYOUTRECALL events as + described above in Section 12.5.5.2.1.3. + + Just as the server validates the "seqid" in the event of + CB_LAYOUTRECALL usage, as described in Section 12.5.5.2.1.3, the + server also validates the "seqid" value to ensure that it is within + an appropriate range. This range represents the degree of + parallelism the server supports for layout stateids. If the client + is sending multiple layout operations to the server in parallel, by + definition, the "seqid" value in the supplied stateid will not be the + current "seqid" as held by the server. The range of parallelism + spans from the highest or current "seqid" to a "seqid" value in the + past. To assist in the discussion, the server's current "seqid" + value for a layout stateid is defined as SERVER_CURRENT_SEQID. The + lowest "seqid" value that is acceptable to the server is represented + by PAST_SEQID. And the value for the range of valid "seqid"s or + range of parallelism is VALID_SEQID_RANGE. Therefore, the following + holds: VALID_SEQID_RANGE = SERVER_CURRENT_SEQID - PAST_SEQID. In the + following, all arithmetic is the modulo arithmetic as described + above. + + The server MUST support a minimum VALID_SEQID_RANGE. The minimum is + defined as: VALID_SEQID_RANGE = summation over 1..N of + (ca_maxoperations(i) - 1), where N is the number of session fore + channels and ca_maxoperations(i) is the value of the ca_maxoperations + returned from CREATE_SESSION of the i'th session. The reason for "- + 1" is to allow for the required SEQUENCE operation. The server MAY + support a VALID_SEQID_RANGE value larger than the minimum. The + maximum VALID_SEQID_RANGE is (2^(32) - 2) (accounting for zero not + being a valid "seqid" value). + + If the server finds the "seqid" is zero, the NFS4ERR_BAD_STATEID + error is returned to the client. The server further validates the + "seqid" to ensure it is within the range of parallelism, + VALID_SEQID_RANGE. If the "seqid" value is outside of that range, + the error NFS4ERR_OLD_STATEID is returned to the client. Upon + receipt of NFS4ERR_OLD_STATEID, the client updates the stateid in the + layout request based on processing of other layout requests and re- + sends the operation to the server. + +12.5.5.2.1.5. Bulk Recall and Return + + pNFS supports recalling and returning all layouts that are for files + belonging to a particular fsid (LAYOUTRECALL4_FSID, + LAYOUTRETURN4_FSID) or client ID (LAYOUTRECALL4_ALL, + LAYOUTRETURN4_ALL). There are no "bulk" stateids, so detection of + races via the seqid is not possible. The server MUST NOT initiate + bulk recall while another recall is in progress, or the corresponding + LAYOUTRETURN is in progress or pending. In the event the server + sends a bulk recall while the client has a pending or in-progress + LAYOUTRETURN, CB_LAYOUTRECALL, or LAYOUTGET, the client returns + NFS4ERR_DELAY. In the event the client sends a LAYOUTGET or + LAYOUTRETURN while a bulk recall is in progress, the server returns + NFS4ERR_RECALLCONFLICT. If the client sends a LAYOUTGET or + LAYOUTRETURN after the server receives NFS4ERR_DELAY from a bulk + recall, then to ensure forward progress, the server MAY return + NFS4ERR_RECALLCONFLICT. + + Once a CB_LAYOUTRECALL of LAYOUTRECALL4_ALL is sent, the server MUST + NOT allow the client to use any layout stateid except for + LAYOUTCOMMIT operations. Once the client receives a CB_LAYOUTRECALL + of LAYOUTRECALL4_ALL, it MUST NOT use any layout stateid except for + LAYOUTCOMMIT operations. Once a LAYOUTRETURN of LAYOUTRETURN4_ALL is + sent, all layout stateids granted to the client ID are freed. The + client MUST NOT use the layout stateids again. It MUST use LAYOUTGET + to obtain new layout stateids. + + Once a CB_LAYOUTRECALL of LAYOUTRECALL4_FSID is sent, the server MUST + NOT allow the client to use any layout stateid that refers to a file + with the specified fsid except for LAYOUTCOMMIT operations. Once the + client receives a CB_LAYOUTRECALL of LAYOUTRECALL4_ALL, it MUST NOT + use any layout stateid that refers to a file with the specified fsid + except for LAYOUTCOMMIT operations. Once a LAYOUTRETURN of + LAYOUTRETURN4_FSID is sent, all layout stateids granted to the + referenced fsid are freed. The client MUST NOT use those freed + layout stateids for files with the referenced fsid again. + Subsequently, for any file with the referenced fsid, to use a layout, + the client MUST first send a LAYOUTGET operation in order to obtain a + new layout stateid for that file. + + If the server has sent a bulk CB_LAYOUTRECALL and receives a + LAYOUTGET, or a LAYOUTRETURN with a stateid, the server MUST return + NFS4ERR_RECALLCONFLICT. If the server has sent a bulk + CB_LAYOUTRECALL and receives a LAYOUTRETURN with an lr_returntype + that is not equal to the lor_recalltype of the CB_LAYOUTRECALL, the + server MUST return NFS4ERR_RECALLCONFLICT. + +12.5.6. Revoking Layouts + + Parallel NFS permits servers to revoke layouts from clients that fail + to respond to recalls and/or fail to renew their lease in time. + Depending on the layout type, the server might revoke the layout and + might take certain actions with respect to the client's I/O to data + servers. + +12.5.7. Metadata Server Write Propagation + + Asynchronous writes written through the metadata server may be + propagated lazily to the storage devices. For data written + asynchronously through the metadata server, a client performing a + read at the appropriate storage device is not guaranteed to see the + newly written data until a COMMIT occurs at the metadata server. + While the write is pending, reads to the storage device may give out + either the old data, the new data, or a mixture of new and old. Upon + completion of a synchronous WRITE or COMMIT (for asynchronously + written data), the metadata server MUST ensure that storage devices + give out the new data and that the data has been written to stable + storage. If the server implements its storage in any way such that + it cannot obey these constraints, then it MUST recall the layouts to + prevent reads being done that cannot be handled correctly. Note that + the layouts MUST be recalled prior to the server responding to the + associated WRITE operations. + +12.6. pNFS Mechanics + + This section describes the operations flow taken by a pNFS client to + a metadata server and storage device. + + When a pNFS client encounters a new FSID, it sends a GETATTR to the + NFSv4.1 server for the fs_layout_type (Section 5.12.1) attribute. If + the attribute returns at least one layout type, and the layout types + returned are among the set supported by the client, the client knows + that pNFS is a possibility for the file system. If, from the server + that returned the new FSID, the client does not have a client ID that + came from an EXCHANGE_ID result that returned + EXCHGID4_FLAG_USE_PNFS_MDS, it MUST send an EXCHANGE_ID to the server + with the EXCHGID4_FLAG_USE_PNFS_MDS bit set. If the server's + response does not have EXCHGID4_FLAG_USE_PNFS_MDS, then contrary to + what the fs_layout_type attribute said, the server does not support + pNFS, and the client will not be able use pNFS to that server; in + this case, the server MUST return NFS4ERR_NOTSUPP in response to any + pNFS operation. + + The client then creates a session, requesting a persistent session, + so that exclusive creates can be done with single round trip via the + createmode4 of GUARDED4. If the session ends up not being + persistent, the client will use EXCLUSIVE4_1 for exclusive creates. + + If a file is to be created on a pNFS-enabled file system, the client + uses the OPEN operation. With the normal set of attributes that may + be provided upon OPEN used for creation, there is an OPTIONAL + layout_hint attribute. The client's use of layout_hint allows the + client to express its preference for a layout type and its associated + layout details. The use of a createmode4 of UNCHECKED4, GUARDED4, or + EXCLUSIVE4_1 will allow the client to provide the layout_hint + attribute at create time. The client MUST NOT use EXCLUSIVE4 (see + Table 18). The client is RECOMMENDED to combine a GETATTR operation + after the OPEN within the same COMPOUND. The GETATTR may then + retrieve the layout_type attribute for the newly created file. The + client will then know what layout type the server has chosen for the + file and therefore what storage protocol the client must use. + + If the client wants to open an existing file, then it also includes a + GETATTR to determine what layout type the file supports. + + The GETATTR in either the file creation or plain file open case can + also include the layout_blksize and layout_alignment attributes so + that the client can determine optimal offsets and lengths for I/O on + the file. + + Assuming the client supports the layout type returned by GETATTR and + it chooses to use pNFS for data access, it then sends LAYOUTGET using + the filehandle and stateid returned by OPEN, specifying the range it + wants to do I/O on. The response is a layout, which may be a subset + of the range for which the client asked. It also includes device IDs + and a description of how data is organized (or in the case of + writing, how data is to be organized) across the devices. The device + IDs and data description are encoded in a format that is specific to + the layout type, but the client is expected to understand. + + When the client wants to send an I/O, it determines to which device + ID it needs to send the I/O command by examining the data description + in the layout. It then sends a GETDEVICEINFO to find the device + address(es) of the device ID. The client then sends the I/O request + to one of device ID's device addresses, using the storage protocol + defined for the layout type. Note that if a client has multiple I/Os + to send, these I/O requests may be done in parallel. + + If the I/O was a WRITE, then at some point the client may want to use + LAYOUTCOMMIT to commit the modification time and the new size of the + file (if it believes it extended the file size) to the metadata + server and the modified data to the file system. + +12.7. Recovery + + Recovery is complicated by the distributed nature of the pNFS + protocol. In general, crash recovery for layouts is similar to crash + recovery for delegations in the base NFSv4.1 protocol. However, the + client's ability to perform I/O without contacting the metadata + server introduces subtleties that must be handled correctly if the + possibility of file system corruption is to be avoided. + +12.7.1. Recovery from Client Restart + + Client recovery for layouts is similar to client recovery for other + lock and delegation state. When a pNFS client restarts, it will lose + all information about the layouts that it previously owned. There + are two methods by which the server can reclaim these resources and + allow otherwise conflicting layouts to be provided to other clients. + + The first is through the expiry of the client's lease. If the client + recovery time is longer than the lease period, the client's lease + will expire and the server will know that state may be released. For + layouts, the server may release the state immediately upon lease + expiry or it may allow the layout to persist, awaiting possible lease + revival, as long as no other layout conflicts. + + The second is through the client restarting in less time than it + takes for the lease period to expire. In such a case, the client + will contact the server through the standard EXCHANGE_ID protocol. + The server will find that the client's co_ownerid matches the + co_ownerid of the previous client invocation, but that the verifier + is different. The server uses this as a signal to release all layout + state associated with the client's previous invocation. In this + scenario, the data written by the client but not covered by a + successful LAYOUTCOMMIT is in an undefined state; it may have been + written or it may now be lost. This is acceptable behavior and it is + the client's responsibility to use LAYOUTCOMMIT to achieve the + desired level of stability. + +12.7.2. Dealing with Lease Expiration on the Client + + If a client believes its lease has expired, it MUST NOT send I/O to + the storage device until it has validated its lease. The client can + send a SEQUENCE operation to the metadata server. If the SEQUENCE + operation is successful, but sr_status_flag has + SEQ4_STATUS_EXPIRED_ALL_STATE_REVOKED, + SEQ4_STATUS_EXPIRED_SOME_STATE_REVOKED, or + SEQ4_STATUS_ADMIN_STATE_REVOKED set, the client MUST NOT use + currently held layouts. The client has two choices to recover from + the lease expiration. First, for all modified but uncommitted data, + the client writes it to the metadata server using the FILE_SYNC4 flag + for the WRITEs, or WRITE and COMMIT. Second, the client re- + establishes a client ID and session with the server and obtains new + layouts and device-ID-to-device-address mappings for the modified + data ranges and then writes the data to the storage devices with the + newly obtained layouts. + + If sr_status_flags from the metadata server has + SEQ4_STATUS_RESTART_RECLAIM_NEEDED set (or SEQUENCE returns + NFS4ERR_BAD_SESSION and CREATE_SESSION returns + NFS4ERR_STALE_CLIENTID), then the metadata server has restarted, and + the client SHOULD recover using the methods described in + Section 12.7.4. + + If sr_status_flags from the metadata server has + SEQ4_STATUS_LEASE_MOVED set, then the client recovers by following + the procedure described in Section 11.11.9.2. After that, the client + may get an indication that the layout state was not moved with the + file system. The client recovers as in the other applicable + situations discussed in the first two paragraphs of this section. + + If sr_status_flags reports no loss of state, then the lease for the + layouts that the client has are valid and renewed, and the client can + once again send I/O requests to the storage devices. + + While clients SHOULD NOT send I/Os to storage devices that may extend + past the lease expiration time period, this is not always possible, + for example, an extended network partition that starts after the I/O + is sent and does not heal until the I/O request is received by the + storage device. Thus, the metadata server and/or storage devices are + responsible for protecting themselves from I/Os that are both sent + before the lease expires and arrive after the lease expires. See + Section 12.7.3. + +12.7.3. Dealing with Loss of Layout State on the Metadata Server + + This is a description of the case where all of the following are + true: + + * the metadata server has not restarted + + * a pNFS client's layouts have been discarded (usually because the + client's lease expired) and are invalid + + * an I/O from the pNFS client arrives at the storage device + + The metadata server and its storage devices MUST solve this by + fencing the client. In other words, they MUST solve this by + preventing the execution of I/O operations from the client to the + storage devices after layout state loss. The details of how fencing + is done are specific to the layout type. The solution for NFSv4.1 + file-based layouts is described in (Section 13.11), and solutions for + other layout types are in their respective external specification + documents. + +12.7.4. Recovery from Metadata Server Restart + + The pNFS client will discover that the metadata server has restarted + via the methods described in Section 8.4.2 and discussed in a pNFS- + specific context in Section 12.7.2, Paragraph 2. The client MUST + stop using layouts and delete the device ID to device address + mappings it previously received from the metadata server. Having + done that, if the client wrote data to the storage device without + committing the layouts via LAYOUTCOMMIT, then the client has + additional work to do in order to have the client, metadata server, + and storage device(s) all synchronized on the state of the data. + + * If the client has data still modified and unwritten in the + client's memory, the client has only two choices. + + 1. The client can obtain a layout via LAYOUTGET after the + server's grace period and write the data to the storage + devices. + + 2. The client can WRITE that data through the metadata server + using the WRITE (Section 18.32) operation, and then obtain + layouts as desired. + + * If the client asynchronously wrote data to the storage device, but + still has a copy of the data in its memory, then it has available + to it the recovery options listed above in the previous bullet + point. If the metadata server is also in its grace period, the + client has available to it the options below in the next bullet + point. + + * The client does not have a copy of the data in its memory and the + metadata server is still in its grace period. The client cannot + use LAYOUTGET (within or outside the grace period) to reclaim a + layout because the contents of the response from LAYOUTGET may not + match what it had previously. The range might be different or the + client might get the same range but the content of the layout + might be different. Even if the content of the layout appears to + be the same, the device IDs may map to different device addresses, + and even if the device addresses are the same, the device + addresses could have been assigned to a different storage device. + The option of retrieving the data from the storage device and + writing it to the metadata server per the recovery scenario + described above is not available because, again, the mappings of + range to device ID, device ID to device address, and device + address to physical device are stale, and new mappings via new + LAYOUTGET do not solve the problem. + + The only recovery option for this scenario is to send a + LAYOUTCOMMIT in reclaim mode, which the metadata server will + accept as long as it is in its grace period. The use of + LAYOUTCOMMIT in reclaim mode informs the metadata server that the + layout has changed. It is critical that the metadata server + receive this information before its grace period ends, and thus + before it starts allowing updates to the file system. + + To send LAYOUTCOMMIT in reclaim mode, the client sets the + loca_reclaim field of the operation's arguments (Section 18.42.1) + to TRUE. During the metadata server's recovery grace period (and + only during the recovery grace period) the metadata server is + prepared to accept LAYOUTCOMMIT requests with the loca_reclaim + field set to TRUE. + + When loca_reclaim is TRUE, the client is attempting to commit + changes to the layout that occurred prior to the restart of the + metadata server. The metadata server applies some consistency + checks on the loca_layoutupdate field of the arguments to + determine whether the client can commit the data written to the + storage device to the file system. The loca_layoutupdate field is + of data type layoutupdate4 and contains layout-type-specific + content (in the lou_body field of loca_layoutupdate). The layout- + type-specific information that loca_layoutupdate might have is + discussed in Section 12.5.4.3. If the metadata server's + consistency checks on loca_layoutupdate succeed, then the metadata + server MUST commit the data (as described by the loca_offset, + loca_length, and loca_layoutupdate fields of the arguments) that + was written to the storage device. If the metadata server's + consistency checks on loca_layoutupdate fail, the metadata server + rejects the LAYOUTCOMMIT operation and makes no changes to the + file system. However, any time LAYOUTCOMMIT with loca_reclaim + TRUE fails, the pNFS client has lost all the data in the range + defined by <loca_offset, loca_length>. A client can defend + against this risk by caching all data, whether written + synchronously or asynchronously in its memory, and by not + releasing the cached data until a successful LAYOUTCOMMIT. This + condition does not hold true for all layout types; for example, + file-based storage devices need not suffer from this limitation. + + * The client does not have a copy of the data in its memory and the + metadata server is no longer in its grace period; i.e., the + metadata server returns NFS4ERR_NO_GRACE. As with the scenario in + the above bullet point, the failure of LAYOUTCOMMIT means the data + in the range <loca_offset, loca_length> lost. The defense against + the risk is the same -- cache all written data on the client until + a successful LAYOUTCOMMIT. + +12.7.5. Operations during Metadata Server Grace Period + + Some of the recovery scenarios thus far noted that some operations + (namely, WRITE and LAYOUTGET) might be permitted during the metadata + server's grace period. The metadata server may allow these + operations during its grace period. For LAYOUTGET, the metadata + server must reliably determine that servicing such a request will not + conflict with an impending LAYOUTCOMMIT reclaim request. For WRITE, + the metadata server must reliably determine that servicing the + request will not conflict with an impending OPEN or with a LOCK where + the file has mandatory byte-range locking enabled. + + As mentioned previously, for expediency, the metadata server might + reject some operations (namely, WRITE and LAYOUTGET) during its grace + period, because the simplest correct approach is to reject all non- + reclaim pNFS requests and WRITE operations by returning the + NFS4ERR_GRACE error. However, depending on the storage protocol + (which is specific to the layout type) and metadata server + implementation, the metadata server may be able to determine that a + particular request is safe. For example, a metadata server may save + provisional allocation mappings for each file to stable storage, as + well as information about potentially conflicting OPEN share modes + and mandatory byte-range locks that might have been in effect at the + time of restart, and the metadata server may use this information + during the recovery grace period to determine that a WRITE request is + safe. + +12.7.6. Storage Device Recovery + + Recovery from storage device restart is mostly dependent upon the + layout type in use. However, there are a few general techniques a + client can use if it discovers a storage device has crashed while + holding modified, uncommitted data that was asynchronously written. + First and foremost, it is important to realize that the client is the + only one that has the information necessary to recover non-committed + data since it holds the modified data and probably nothing else does. + Second, the best solution is for the client to err on the side of + caution and attempt to rewrite the modified data through another + path. + + The client SHOULD immediately WRITE the data to the metadata server, + with the stable field in the WRITE4args set to FILE_SYNC4. Once it + does this, there is no need to wait for the original storage device. + +12.8. Metadata and Storage Device Roles + + If the same physical hardware is used to implement both a metadata + server and storage device, then the same hardware entity is to be + understood to be implementing two distinct roles and it is important + that it be clearly understood on behalf of which role the hardware is + executing at any given time. + + Two sub-cases can be distinguished. + + 1. The storage device uses NFSv4.1 as the storage protocol, i.e., + the same physical hardware is used to implement both a metadata + and data server. See Section 13.1 for a description of how + multiple roles are handled. + + 2. The storage device does not use NFSv4.1 as the storage protocol, + and the same physical hardware is used to implement both a + metadata and storage device. Whether distinct network addresses + are used to access the metadata server and storage device is + immaterial. This is because it is always clear to the pNFS + client and server, from the upper-layer protocol being used + (NFSv4.1 or non-NFSv4.1), to which role the request to the common + server network address is directed. + +12.9. Security Considerations for pNFS + + pNFS separates file system metadata and data and provides access to + both. There are pNFS-specific operations (listed in Section 12.3) + that provide access to the metadata; all existing NFSv4.1 + conventional (non-pNFS) security mechanisms and features apply to + accessing the metadata. The combination of components in a pNFS + system (see Figure 1) is required to preserve the security properties + of NFSv4.1 with respect to an entity that is accessing a storage + device from a client, including security countermeasures to defend + against threats for which NFSv4.1 provides defenses in environments + where these threats are considered significant. + + In some cases, the security countermeasures for connections to + storage devices may take the form of physical isolation or a + recommendation to avoid the use of pNFS in an environment. For + example, it may be impractical to provide confidentiality protection + for some storage protocols to protect against eavesdropping. In + environments where eavesdropping on such protocols is of sufficient + concern to require countermeasures, physical isolation of the + communication channel (e.g., via direct connection from client(s) to + storage device(s)) and/or a decision to forgo use of pNFS (e.g., and + fall back to conventional NFSv4.1) may be appropriate courses of + action. + + Where communication with storage devices is subject to the same + threats as client-to-metadata server communication, the protocols + used for that communication need to provide security mechanisms as + strong as or no weaker than those available via RPCSEC_GSS for + NFSv4.1. Except for the storage protocol used for the + LAYOUT4_NFSV4_1_FILES layout (see Section 13), i.e., except for + NFSv4.1, it is beyond the scope of this document to specify the + security mechanisms for storage access protocols. + + pNFS implementations MUST NOT remove NFSv4.1's access controls. The + combination of clients, storage devices, and the metadata server are + responsible for ensuring that all client-to-storage-device file data + access respects NFSv4.1's ACLs and file open modes. This entails + performing both of these checks on every access in the client, the + storage device, or both (as applicable; when the storage device is an + NFSv4.1 server, the storage device is ultimately responsible for + controlling access as described in Section 13.9.2). If a pNFS + configuration performs these checks only in the client, the risk of a + misbehaving client obtaining unauthorized access is an important + consideration in determining when it is appropriate to use such a + pNFS configuration. Such layout types SHOULD NOT be used when + client-only access checks do not provide sufficient assurance that + NFSv4.1 access control is being applied correctly. (This is not a + problem for the file layout type described in Section 13 because the + storage access protocol for LAYOUT4_NFSV4_1_FILES is NFSv4.1, and + thus the security model for storage device access via + LAYOUT4_NFSv4_1_FILES is the same as that of the metadata server.) + For handling of access control specific to a layout, the reader + should examine the layout specification, such as the NFSv4.1/ + file-based layout (Section 13) of this document, the blocks layout + [48], and objects layout [47]. + +13. NFSv4.1 as a Storage Protocol in pNFS: the File Layout Type + + This section describes the semantics and format of NFSv4.1 file-based + layouts for pNFS. NFSv4.1 file-based layouts use the + LAYOUT4_NFSV4_1_FILES layout type. The LAYOUT4_NFSV4_1_FILES type + defines striping data across multiple NFSv4.1 data servers. + +13.1. Client ID and Session Considerations + + Sessions are a REQUIRED feature of NFSv4.1, and this extends to both + the metadata server and file-based (NFSv4.1-based) data servers. + + The role a server plays in pNFS is determined by the result it + returns from EXCHANGE_ID. The roles are: + + * Metadata server (EXCHGID4_FLAG_USE_PNFS_MDS is set in the result + eir_flags). + + * Data server (EXCHGID4_FLAG_USE_PNFS_DS). + + * Non-metadata server (EXCHGID4_FLAG_USE_NON_PNFS). This is an + NFSv4.1 server that does not support operations (e.g., LAYOUTGET) + or attributes that pertain to pNFS. + + The client MAY request zero or more of EXCHGID4_FLAG_USE_NON_PNFS, + EXCHGID4_FLAG_USE_PNFS_DS, or EXCHGID4_FLAG_USE_PNFS_MDS, even though + some combinations (e.g., EXCHGID4_FLAG_USE_NON_PNFS | + EXCHGID4_FLAG_USE_PNFS_MDS) are contradictory. However, the server + MUST only return the following acceptable combinations: + + +========================================================+ + | Acceptable Results from EXCHANGE_ID | + +========================================================+ + | EXCHGID4_FLAG_USE_PNFS_MDS | + +--------------------------------------------------------+ + | EXCHGID4_FLAG_USE_PNFS_MDS | EXCHGID4_FLAG_USE_PNFS_DS | + +--------------------------------------------------------+ + | EXCHGID4_FLAG_USE_PNFS_DS | + +--------------------------------------------------------+ + | EXCHGID4_FLAG_USE_NON_PNFS | + +--------------------------------------------------------+ + | EXCHGID4_FLAG_USE_PNFS_DS | EXCHGID4_FLAG_USE_NON_PNFS | + +--------------------------------------------------------+ + + Table 8 + + As the above table implies, a server can have one or two roles. A + server can be both a metadata server and a data server, or it can be + both a data server and non-metadata server. In addition to returning + two roles in the EXCHANGE_ID's results, and thus serving both roles + via a common client ID, a server can serve two roles by returning a + unique client ID and server owner for each role in each of two + EXCHANGE_ID results, with each result indicating each role. + + In the case of a server with concurrent pNFS roles that are served by + a common client ID, if the EXCHANGE_ID request from the client has + zero or a combination of the bits set in eia_flags, the server result + should set bits that represent the higher of the acceptable + combination of the server roles, with a preference to match the roles + requested by the client. Thus, if a client request has + (EXCHGID4_FLAG_USE_NON_PNFS | EXCHGID4_FLAG_USE_PNFS_MDS | + EXCHGID4_FLAG_USE_PNFS_DS) flags set, and the server is both a + metadata server and a data server, serving both the roles by a common + client ID, the server SHOULD return with + (EXCHGID4_FLAG_USE_PNFS_MDS | EXCHGID4_FLAG_USE_PNFS_DS) set. + + In the case of a server that has multiple concurrent pNFS roles, each + role served by a unique client ID, if the client specifies zero or a + combination of roles in the request, the server results SHOULD return + only one of the roles from the combination specified by the client + request. If the role specified by the server result does not match + the intended use by the client, the client should send the + EXCHANGE_ID specifying just the interested pNFS role. + + If a pNFS metadata client gets a layout that refers it to an NFSv4.1 + data server, it needs a client ID on that data server. If it does + not yet have a client ID from the server that had the + EXCHGID4_FLAG_USE_PNFS_DS flag set in the EXCHANGE_ID results, then + the client needs to send an EXCHANGE_ID to the data server, using the + same co_ownerid as it sent to the metadata server, with the + EXCHGID4_FLAG_USE_PNFS_DS flag set in the arguments. If the server's + EXCHANGE_ID results have EXCHGID4_FLAG_USE_PNFS_DS set, then the + client may use the client ID to create sessions that will exchange + pNFS data operations. The client ID returned by the data server has + no relationship with the client ID returned by a metadata server + unless the client IDs are equal, and the server owners and server + scopes of the data server and metadata server are equal. + + In NFSv4.1, the session ID in the SEQUENCE operation implies the + client ID, which in turn might be used by the server to map the + stateid to the right client/server pair. However, when a data server + is presented with a READ or WRITE operation with a stateid, because + the stateid is associated with a client ID on a metadata server, and + because the session ID in the preceding SEQUENCE operation is tied to + the client ID of the data server, the data server has no obvious way + to determine the metadata server from the COMPOUND procedure, and + thus has no way to validate the stateid. One RECOMMENDED approach is + for pNFS servers to encode metadata server routing and/or identity + information in the data server filehandles as returned in the layout. + + If metadata server routing and/or identity information is encoded in + data server filehandles, when the metadata server identity or + location changes, the data server filehandles it gave out will become + invalid (stale), and so the metadata server MUST first recall the + layouts. Invalidating a data server filehandle does not render the + NFS client's data cache invalid. The client's cache should map a + data server filehandle to a metadata server filehandle, and a + metadata server filehandle to cached data. + + If a server is both a metadata server and a data server, the server + might need to distinguish operations on files that are directed to + the metadata server from those that are directed to the data server. + It is RECOMMENDED that the values of the filehandles returned by the + LAYOUTGET operation be different than the value of the filehandle + returned by the OPEN of the same file. + + Another scenario is for the metadata server and the storage device to + be distinct from one client's point of view, and the roles reversed + from another client's point of view. For example, in the cluster + file system model, a metadata server to one client might be a data + server to another client. If NFSv4.1 is being used as the storage + protocol, then pNFS servers need to encode the values of filehandles + according to their specific roles. + +13.1.1. Sessions Considerations for Data Servers + + Section 2.10.11.2 states that a client has to keep its lease renewed + in order to prevent a session from being deleted by the server. If + the reply to EXCHANGE_ID has just the EXCHGID4_FLAG_USE_PNFS_DS role + set, then (as noted in Section 13.6) the client will not be able to + determine the data server's lease_time attribute because GETATTR will + not be permitted. Instead, the rule is that any time a client + receives a layout referring it to a data server that returns just the + EXCHGID4_FLAG_USE_PNFS_DS role, the client MAY assume that the + lease_time attribute from the metadata server that returned the + layout applies to the data server. Thus, the data server MUST be + aware of the values of all lease_time attributes of all metadata + servers for which it is providing I/O, and it MUST use the maximum of + all such lease_time values as the lease interval for all client IDs + and sessions established on it. + + For example, if one metadata server has a lease_time attribute of 20 + seconds, and a second metadata server has a lease_time attribute of + 10 seconds, then if both servers return layouts that refer to an + EXCHGID4_FLAG_USE_PNFS_DS-only data server, the data server MUST + renew a client's lease if the interval between two SEQUENCE + operations on different COMPOUND requests is less than 20 seconds. + +13.2. File Layout Definitions + + The following definitions apply to the LAYOUT4_NFSV4_1_FILES layout + type and may be applicable to other layout types. + + Unit. A unit is a fixed-size quantity of data written to a data + server. + + Pattern. A pattern is a method of distributing one or more equal + sized units across a set of data servers. A pattern is iterated + one or more times. + + Stripe. A stripe is a set of data distributed across a set of data + servers in a pattern before that pattern repeats. + + Stripe Count. A stripe count is the number of units in a pattern. + + Stripe Width. A stripe width is the size of a stripe in bytes. The + stripe width = the stripe count * the size of the stripe unit. + + Hereafter, this document will refer to a unit that is a written in a + pattern as a "stripe unit". + + A pattern may have more stripe units than data servers. If so, some + data servers will have more than one stripe unit per stripe. A data + server that has multiple stripe units per stripe MAY store each unit + in a different data file (and depending on the implementation, will + possibly assign a unique data filehandle to each data file). + +13.3. File Layout Data Types + + The high level NFSv4.1 layout types are nfsv4_1_file_layouthint4, + nfsv4_1_file_layout_ds_addr4, and nfsv4_1_file_layout4. + + The SETATTR operation supports a layout hint attribute + (Section 5.12.4). When the client sets a layout hint (data type + layouthint4) with a layout type of LAYOUT4_NFSV4_1_FILES (the + loh_type field), the loh_body field contains a value of data type + nfsv4_1_file_layouthint4. + + const NFL4_UFLG_MASK = 0x0000003F; + const NFL4_UFLG_DENSE = 0x00000001; + const NFL4_UFLG_COMMIT_THRU_MDS = 0x00000002; + const NFL4_UFLG_STRIPE_UNIT_SIZE_MASK + = 0xFFFFFFC0; + + typedef uint32_t nfl_util4; + + enum filelayout_hint_care4 { + NFLH4_CARE_DENSE = NFL4_UFLG_DENSE, + + NFLH4_CARE_COMMIT_THRU_MDS + = NFL4_UFLG_COMMIT_THRU_MDS, + + NFLH4_CARE_STRIPE_UNIT_SIZE + = 0x00000040, + + NFLH4_CARE_STRIPE_COUNT = 0x00000080 + }; + + /* Encoded in the loh_body field of data type layouthint4: */ + + struct nfsv4_1_file_layouthint4 { + uint32_t nflh_care; + nfl_util4 nflh_util; + count4 nflh_stripe_count; + }; + + The generic layout hint structure is described in Section 3.3.19. + The client uses the layout hint in the layout_hint (Section 5.12.4) + attribute to indicate the preferred type of layout to be used for a + newly created file. The LAYOUT4_NFSV4_1_FILES layout-type-specific + content for the layout hint is composed of three fields. The first + field, nflh_care, is a set of flags indicating which values of the + hint the client cares about. If the NFLH4_CARE_DENSE flag is set, + then the client indicates in the second field, nflh_util, a + preference for how the data file is packed (Section 13.4.4), which is + controlled by the value of the expression nflh_util & NFL4_UFLG_DENSE + ("&" represents the bitwise AND operator). If the + NFLH4_CARE_COMMIT_THRU_MDS flag is set, then the client indicates a + preference for whether the client should send COMMIT operations to + the metadata server or data server (Section 13.7), which is + controlled by the value of nflh_util & NFL4_UFLG_COMMIT_THRU_MDS. If + the NFLH4_CARE_STRIPE_UNIT_SIZE flag is set, the client indicates its + preferred stripe unit size, which is indicated in nflh_util & + NFL4_UFLG_STRIPE_UNIT_SIZE_MASK (thus, the stripe unit size MUST be a + multiple of 64 bytes). The minimum stripe unit size is 64 bytes. If + the NFLH4_CARE_STRIPE_COUNT flag is set, the client indicates in the + third field, nflh_stripe_count, the stripe count. The stripe count + multiplied by the stripe unit size is the stripe width. + + When LAYOUTGET returns a LAYOUT4_NFSV4_1_FILES layout (indicated in + the loc_type field of the lo_content field), the loc_body field of + the lo_content field contains a value of data type + nfsv4_1_file_layout4. Among other content, nfsv4_1_file_layout4 has + a storage device ID (field nfl_deviceid) of data type deviceid4. The + GETDEVICEINFO operation maps a device ID to a storage device address + (type device_addr4). When GETDEVICEINFO returns a device address + with a layout type of LAYOUT4_NFSV4_1_FILES (the da_layout_type + field), the da_addr_body field contains a value of data type + nfsv4_1_file_layout_ds_addr4. + + typedef netaddr4 multipath_list4<>; + + /* + * Encoded in the da_addr_body field of + * data type device_addr4: + */ + struct nfsv4_1_file_layout_ds_addr4 { + uint32_t nflda_stripe_indices<>; + multipath_list4 nflda_multipath_ds_list<>; + }; + + The nfsv4_1_file_layout_ds_addr4 data type represents the device + address. It is composed of two fields: + + 1. nflda_multipath_ds_list: An array of lists of data servers, where + each list can be one or more elements, and each element + represents a data server address that may serve equally as the + target of I/O operations (see Section 13.5). The length of this + array might be different than the stripe count. + + 2. nflda_stripe_indices: An array of indices used to index into + nflda_multipath_ds_list. The value of each element of + nflda_stripe_indices MUST be less than the number of elements in + nflda_multipath_ds_list. Each element of nflda_multipath_ds_list + SHOULD be referred to by one or more elements of + nflda_stripe_indices. The number of elements in + nflda_stripe_indices is always equal to the stripe count. + + /* + * Encoded in the loc_body field of + * data type layout_content4: + */ + struct nfsv4_1_file_layout4 { + deviceid4 nfl_deviceid; + nfl_util4 nfl_util; + uint32_t nfl_first_stripe_index; + offset4 nfl_pattern_offset; + nfs_fh4 nfl_fh_list<>; + }; + + The nfsv4_1_file_layout4 data type represents the layout. It is + composed of the following fields: + + 1. nfl_deviceid: The device ID that maps to a value of type + nfsv4_1_file_layout_ds_addr4. + + 2. nfl_util: Like the nflh_util field of data type + nfsv4_1_file_layouthint4, a compact representation of how the + data on a file on each data server is packed, whether the client + should send COMMIT operations to the metadata server or data + server, and the stripe unit size. If a server returns two or + more overlapping layouts, each stripe unit size in each + overlapping layout MUST be the same. + + 3. nfl_first_stripe_index: The index into the first element of the + nflda_stripe_indices array to use. + + 4. nfl_pattern_offset: This field is the logical offset into the + file where the striping pattern starts. It is required for + converting the client's logical I/O offset (e.g., the current + offset in a POSIX file descriptor before the read() or write() + system call is sent) into the stripe unit number (see + Section 13.4.1). + + If dense packing is used, then nfl_pattern_offset is also needed + to convert the client's logical I/O offset to an offset on the + file on the data server corresponding to the stripe unit number + (see Section 13.4.4). + + Note that nfl_pattern_offset is not always the same as lo_offset. + For example, via the LAYOUTGET operation, a client might request + a layout starting at offset 1000 of a file that has its striping + pattern start at offset zero. + + 5. nfl_fh_list: An array of data server filehandles for each list of + data servers in each element of the nflda_multipath_ds_list + array. The number of elements in nfl_fh_list depends on whether + sparse or dense packing is being used. + + * If sparse packing is being used, the number of elements in + nfl_fh_list MUST be one of three values: + + - Zero. This means that filehandles used for each data + server are the same as the filehandle returned by the OPEN + operation from the metadata server. + + - One. This means that every data server uses the same + filehandle: what is specified in nfl_fh_list[0]. + + - The same number of elements in nflda_multipath_ds_list. + Thus, in this case, when sending an I/O operation to any + data server in nflda_multipath_ds_list[X], the filehandle + in nfl_fh_list[X] MUST be used. + + See the discussion on sparse packing in Section 13.4.4. + + * If dense packing is being used, the number of elements in + nfl_fh_list MUST be the same as the number of elements in + nflda_stripe_indices. Thus, when sending an I/O operation to + any data server in + nflda_multipath_ds_list[nflda_stripe_indices[Y]], the + filehandle in nfl_fh_list[Y] MUST be used. In addition, any + time there exists i and j, (i != j), such that the + intersection of + nflda_multipath_ds_list[nflda_stripe_indices[i]] and + nflda_multipath_ds_list[nflda_stripe_indices[j]] is not empty, + then nfl_fh_list[i] MUST NOT equal nfl_fh_list[j]. In other + words, when dense packing is being used, if a data server + appears in two or more units of a striping pattern, each + reference to the data server MUST use a different filehandle. + + Indeed, if there are multiple striping patterns, as indicated + by the presence of multiple objects of data type layout4 + (either returned in one or multiple LAYOUTGET operations), and + a data server is the target of a unit of one pattern and + another unit of another pattern, then each reference to each + data server MUST use a different filehandle. + + See the discussion on dense packing in Section 13.4.4. + + The details on the interpretation of the layout are in Section 13.4. + +13.4. Interpreting the File Layout + +13.4.1. Determining the Stripe Unit Number + + To find the stripe unit number that corresponds to the client's + logical file offset, the pattern offset will also be used. The i'th + stripe unit (SUi) is: + + relative_offset = file_offset - nfl_pattern_offset; + SUi = floor(relative_offset / stripe_unit_size); + +13.4.2. Interpreting the File Layout Using Sparse Packing + + When sparse packing is used, the algorithm for determining the + filehandle and set of data-server network addresses to write stripe + unit i (SUi) to is: + + stripe_count = number of elements in nflda_stripe_indices; + + j = (SUi + nfl_first_stripe_index) % stripe_count; + + idx = nflda_stripe_indices[j]; + + fh_count = number of elements in nfl_fh_list; + ds_count = number of elements in nflda_multipath_ds_list; + + switch (fh_count) { + case ds_count: + fh = nfl_fh_list[idx]; + break; + + case 1: + fh = nfl_fh_list[0]; + break; + + case 0: + fh = filehandle returned by OPEN; + break; + + default: + throw a fatal exception; + break; + } + + address_list = nflda_multipath_ds_list[idx]; + + The client would then select a data server from address_list, and + send a READ or WRITE operation using the filehandle specified in fh. + + Consider the following example: + + Suppose we have a device address consisting of seven data servers, + arranged in three equivalence (Section 13.5) classes: + + { A, B, C, D }, { E }, { F, G } + + where A through G are network addresses. + + Then + + nflda_multipath_ds_list<> = { A, B, C, D }, { E }, { F, G } + + i.e., + + nflda_multipath_ds_list[0] = { A, B, C, D } + + nflda_multipath_ds_list[1] = { E } + + nflda_multipath_ds_list[2] = { F, G } + + Suppose the striping index array is: + + nflda_stripe_indices<> = { 2, 0, 1, 0 } + + Now suppose the client gets a layout that has a device ID that maps + to the above device address. The initial index contains + + nfl_first_stripe_index = 2, + + and the filehandle list is + + nfl_fh_list = { 0x36, 0x87, 0x67 }. + + If the client wants to write to SU0, the set of valid { network + address, filehandle } combinations for SUi are determined by: + + nfl_first_stripe_index = 2 + + So + + idx = nflda_stripe_indices[(0 + 2) % 4] + + = nflda_stripe_indices[2] + + = 1 + + So + + nflda_multipath_ds_list[1] = { E } + + and + + nfl_fh_list[1] = { 0x87 } + + The client can thus write SU0 to { 0x87, { E } }. + + The destinations of the first 13 storage units are: + + +=====+============+==============+ + | SUi | filehandle | data servers | + +=====+============+==============+ + | 0 | 87 | E | + +-----+------------+--------------+ + | 1 | 36 | A,B,C,D | + +-----+------------+--------------+ + | 2 | 67 | F,G | + +-----+------------+--------------+ + | 3 | 36 | A,B,C,D | + +-----+------------+--------------+ + +-----+------------+--------------+ + | 4 | 87 | E | + +-----+------------+--------------+ + | 5 | 36 | A,B,C,D | + +-----+------------+--------------+ + | 6 | 67 | F,G | + +-----+------------+--------------+ + | 7 | 36 | A,B,C,D | + +-----+------------+--------------+ + +-----+------------+--------------+ + | 8 | 87 | E | + +-----+------------+--------------+ + | 9 | 36 | A,B,C,D | + +-----+------------+--------------+ + | 10 | 67 | F,G | + +-----+------------+--------------+ + | 11 | 36 | A,B,C,D | + +-----+------------+--------------+ + +-----+------------+--------------+ + | 12 | 87 | E | + +-----+------------+--------------+ + + Table 9 + +13.4.3. Interpreting the File Layout Using Dense Packing + + When dense packing is used, the algorithm for determining the + filehandle and set of data server network addresses to write stripe + unit i (SUi) to is: + + stripe_count = number of elements in nflda_stripe_indices; + + j = (SUi + nfl_first_stripe_index) % stripe_count; + + idx = nflda_stripe_indices[j]; + + fh_count = number of elements in nfl_fh_list; + ds_count = number of elements in nflda_multipath_ds_list; + + switch (fh_count) { + case stripe_count: + fh = nfl_fh_list[j]; + break; + + default: + throw a fatal exception; + break; + } + + address_list = nflda_multipath_ds_list[idx]; + + The client would then select a data server from address_list, and + send a READ or WRITE operation using the filehandle specified in fh. + + Consider the following example (which is the same as the sparse + packing example, except for the filehandle list): + + Suppose we have a device address consisting of seven data servers, + arranged in three equivalence (Section 13.5) classes: + + { A, B, C, D }, { E }, { F, G } + + where A through G are network addresses. + + Then + + nflda_multipath_ds_list<> = { A, B, C, D }, { E }, { F, G } + + i.e., + + nflda_multipath_ds_list[0] = { A, B, C, D } + + nflda_multipath_ds_list[1] = { E } + + nflda_multipath_ds_list[2] = { F, G } + + Suppose the striping index array is: + + nflda_stripe_indices<> = { 2, 0, 1, 0 } + + Now suppose the client gets a layout that has a device ID that maps + to the above device address. The initial index contains + + nfl_first_stripe_index = 2, + + and + + nfl_fh_list = { 0x67, 0x37, 0x87, 0x36 }. + + The interesting examples for dense packing are SU1 and SU3 because + each stripe unit refers to the same data server list, yet each stripe + unit MUST use a different filehandle. If the client wants to write + to SU1, the set of valid { network address, filehandle } combinations + for SUi are determined by: + + nfl_first_stripe_index = 2 + + So + + j = (1 + 2) % 4 = 3 + + idx = nflda_stripe_indices[j] + + = nflda_stripe_indices[3] + + = 0 + + So + + nflda_multipath_ds_list[0] = { A, B, C, D } + + and + + nfl_fh_list[3] = { 0x36 } + + The client can thus write SU1 to { 0x36, { A, B, C, D } }. + + For SU3, j = (3 + 2) % 4 = 1, and nflda_stripe_indices[1] = 0. Then + nflda_multipath_ds_list[0] = { A, B, C, D }, and nfl_fh_list[1] = + 0x37. The client can thus write SU3 to { 0x37, { A, B, C, D } }. + + The destinations of the first 13 storage units are: + + +=====+============+==============+ + | SUi | filehandle | data servers | + +=====+============+==============+ + | 0 | 87 | E | + +-----+------------+--------------+ + | 1 | 36 | A,B,C,D | + +-----+------------+--------------+ + | 2 | 67 | F,G | + +-----+------------+--------------+ + | 3 | 37 | A,B,C,D | + +-----+------------+--------------+ + +-----+------------+--------------+ + | 4 | 87 | E | + +-----+------------+--------------+ + | 5 | 36 | A,B,C,D | + +-----+------------+--------------+ + | 6 | 67 | F,G | + +-----+------------+--------------+ + | 7 | 37 | A,B,C,D | + +-----+------------+--------------+ + +-----+------------+--------------+ + | 8 | 87 | E | + +-----+------------+--------------+ + | 9 | 36 | A,B,C,D | + +-----+------------+--------------+ + | 10 | 67 | F,G | + +-----+------------+--------------+ + | 11 | 37 | A,B,C,D | + +-----+------------+--------------+ + +-----+------------+--------------+ + | 12 | 87 | E | + +-----+------------+--------------+ + + Table 10 + +13.4.4. Sparse and Dense Stripe Unit Packing + + The flag NFL4_UFLG_DENSE of the nfl_util4 data type (field nflh_util + of the data type nfsv4_1_file_layouthint4 and field nfl_util of data + type nfsv4_1_file_layout_ds_addr4) specifies how the data is packed + within the data file on a data server. It allows for two different + data packings: sparse and dense. The packing type determines the + calculation that will be made to map the client-visible file offset + to the offset within the data file located on the data server. + + If nfl_util & NFL4_UFLG_DENSE is zero, this means that sparse packing + is being used. Hence, the logical offsets of the file as viewed by a + client sending READs and WRITEs directly to the metadata server are + the same offsets each data server uses when storing a stripe unit. + The effect then, for striping patterns consisting of at least two + stripe units, is for each data server file to be sparse or "holey". + So for example, suppose there is a pattern with three stripe units, + the stripe unit size is 4096 bytes, and there are three data servers + in the pattern. Then, the file in data server 1 will have stripe + units 0, 3, 6, 9, ... filled; data server 2's file will have stripe + units 1, 4, 7, 10, ... filled; and data server 3's file will have + stripe units 2, 5, 8, 11, ... filled. The unfilled stripe units of + each file will be holes; hence, the files in each data server are + sparse. + + If sparse packing is being used and a client attempts I/O to one of + the holes, then an error MUST be returned by the data server. Using + the above example, if data server 3 received a READ or WRITE + operation for block 4, the data server would return + NFS4ERR_PNFS_IO_HOLE. Thus, data servers need to understand the + striping pattern in order to support sparse packing. + + If nfl_util & NFL4_UFLG_DENSE is one, this means that dense packing + is being used, and the data server files have no holes. Dense + packing might be selected because the data server does not + (efficiently) support holey files or because the data server cannot + recognize read-ahead unless there are no holes. If dense packing is + indicated in the layout, the data files will be packed. Using the + same striping pattern and stripe unit size that were used for the + sparse packing example, the corresponding dense packing example would + have all stripe units of all data files filled as follows: + + * Logical stripe units 0, 3, 6, ... of the file would live on stripe + units 0, 1, 2, ... of the file of data server 1. + + * Logical stripe units 1, 4, 7, ... of the file would live on stripe + units 0, 1, 2, ... of the file of data server 2. + + * Logical stripe units 2, 5, 8, ... of the file would live on stripe + units 0, 1, 2, ... of the file of data server 3. + + Because dense packing does not leave holes on the data servers, the + pNFS client is allowed to write to any offset of any data file of any + data server in the stripe. Thus, the data servers need not know the + file's striping pattern. + + The calculation to determine the byte offset within the data file for + dense data server layouts is: + + stripe_width = stripe_unit_size * N; + where N = number of elements in nflda_stripe_indices. + + relative_offset = file_offset - nfl_pattern_offset; + + data_file_offset = floor(relative_offset / stripe_width) + * stripe_unit_size + + relative_offset % stripe_unit_size + + If dense packing is being used, and a data server appears more than + once in a striping pattern, then to distinguish one stripe unit from + another, the data server MUST use a different filehandle. Let's + suppose there are two data servers. Logical stripe units 0, 3, 6 are + served by data server 1; logical stripe units 1, 4, 7 are served by + data server 2; and logical stripe units 2, 5, 8 are also served by + data server 2. Unless data server 2 has two filehandles (each + referring to a different data file), then, for example, a write to + logical stripe unit 1 overwrites the write to logical stripe unit 2 + because both logical stripe units are located in the same stripe unit + (0) of data server 2. + +13.5. Data Server Multipathing + + The NFSv4.1 file layout supports multipathing to multiple data server + addresses. Data-server-level multipathing is used for bandwidth + scaling via trunking (Section 2.10.5) and for higher availability of + use in the case of a data-server failure. Multipathing allows the + client to switch to another data server address which may be that of + another data server that is exporting the same data stripe unit, + without having to contact the metadata server for a new layout. + + To support data server multipathing, each element of the + nflda_multipath_ds_list contains an array of one more data server + network addresses. This array (data type multipath_list4) represents + a list of data servers (each identified by a network address), with + the possibility that some data servers will appear in the list + multiple times. + + The client is free to use any of the network addresses as a + destination to send data server requests. If some network addresses + are less optimal paths to the data than others, then the MDS SHOULD + NOT include those network addresses in an element of + nflda_multipath_ds_list. If less optimal network addresses exist to + provide failover, the RECOMMENDED method to offer the addresses is to + provide them in a replacement device-ID-to-device-address mapping, or + a replacement device ID. When a client finds that no data server in + an element of nflda_multipath_ds_list responds, it SHOULD send a + GETDEVICEINFO to attempt to replace the existing device-ID-to-device- + address mappings. If the MDS detects that all data servers + represented by an element of nflda_multipath_ds_list are unavailable, + the MDS SHOULD send a CB_NOTIFY_DEVICEID (if the client has indicated + it wants device ID notifications for changed device IDs) to change + the device-ID-to-device-address mappings to the available data + servers. If the device ID itself will be replaced, the MDS SHOULD + recall all layouts with the device ID, and thus force the client to + get new layouts and device ID mappings via LAYOUTGET and + GETDEVICEINFO. + + Generally, if two network addresses appear in an element of + nflda_multipath_ds_list, they will designate the same data server, + and the two data server addresses will support the implementation of + client ID or session trunking (the latter is RECOMMENDED) as defined + in Section 2.10.5. The two data server addresses will share the same + server owner or major ID of the server owner. It is not always + necessary for the two data server addresses to designate the same + server with trunking being used. For example, the data could be + read-only, and the data consist of exact replicas. + +13.6. Operations Sent to NFSv4.1 Data Servers + + Clients accessing data on an NFSv4.1 data server MUST send only the + NULL procedure and COMPOUND procedures whose operations are taken + only from two restricted subsets of the operations defined as valid + NFSv4.1 operations. Clients MUST use the filehandle specified by the + layout when accessing data on NFSv4.1 data servers. + + The first of these operation subsets consists of management + operations. This subset consists of the BACKCHANNEL_CTL, + BIND_CONN_TO_SESSION, CREATE_SESSION, DESTROY_CLIENTID, + DESTROY_SESSION, EXCHANGE_ID, SECINFO_NO_NAME, SET_SSV, and SEQUENCE + operations. The client may use these operations in order to set up + and maintain the appropriate client IDs, sessions, and security + contexts involved in communication with the data server. Henceforth, + these will be referred to as data-server housekeeping operations. + + The second subset consists of COMMIT, READ, WRITE, and PUTFH. These + operations MUST be used with a current filehandle specified by the + layout. In the case of PUTFH, the new current filehandle MUST be one + taken from the layout. Henceforth, these will be referred to as + data-server I/O operations. As described in Section 12.5.1, a client + MUST NOT send an I/O to a data server for which it does not hold a + valid layout; the data server MUST reject such an I/O. + + Unless the server has a concurrent non-data-server personality -- + i.e., EXCHANGE_ID results returned (EXCHGID4_FLAG_USE_PNFS_DS | + EXCHGID4_FLAG_USE_PNFS_MDS) or (EXCHGID4_FLAG_USE_PNFS_DS | + EXCHGID4_FLAG_USE_NON_PNFS) see Section 13.1 -- any attempted use of + operations against a data server other than those specified in the + two subsets above MUST return NFS4ERR_NOTSUPP to the client. + + When the server has concurrent data-server and non-data-server + personalities, each COMPOUND sent by the client MUST be constructed + so that it is appropriate to one of the two personalities, and it + MUST NOT contain operations directed to a mix of those personalities. + The server MUST enforce this. To understand the constraints, + operations within a COMPOUND are divided into the following three + classes: + + 1. An operation that is ambiguous regarding its personality + assignment. This includes all of the data-server housekeeping + operations. Additionally, if the server has assigned filehandles + so that the ones defined by the layout are the same as those used + by the metadata server, all operations using such filehandles are + within this class, with the following exception. The exception + is that if the operation uses a stateid that is incompatible with + a data-server personality (e.g., a special stateid or the stateid + has a non-zero "seqid" field, see Section 13.9.1), the operation + is in class 3, as described below. A COMPOUND containing + multiple class 1 operations (and operations of no other class) + MAY be sent to a server with multiple concurrent data server and + non-data-server personalities. + + 2. An operation that is unambiguously referable to the data-server + personality. This includes data-server I/O operations where the + filehandle is one that can only be validly directed to the data- + server personality. + + 3. An operation that is unambiguously referable to the non-data- + server personality. This includes all COMPOUND operations that + are neither data-server housekeeping nor data-server I/O + operations, plus data-server I/O operations where the current fh + (or the one to be made the current fh in the case of PUTFH) is + only valid on the metadata server or where a stateid is used that + is incompatible with the data server, i.e., is a special stateid + or has a non-zero seqid value. + + When a COMPOUND first executes an operation from class 3 above, it + acts as a normal COMPOUND on any other server, and the data-server + personality ceases to be relevant. There are no special restrictions + on the operations in the COMPOUND to limit them to those for a data + server. When a PUTFH is done, filehandles derived from the layout + are not valid. If their format is not normally acceptable, then + NFS4ERR_BADHANDLE MUST result. Similarly, current filehandles for + other operations do not accept filehandles derived from layouts and + are not normally usable on the metadata server. Using these will + result in NFS4ERR_STALE. + + When a COMPOUND first executes an operation from class 2, which would + be PUTFH where the filehandle is one from a layout, the COMPOUND + henceforth is interpreted with respect to the data-server + personality. Operations outside the two classes discussed above MUST + result in NFS4ERR_NOTSUPP. Filehandles are validated using the rules + of the data server, resulting in NFS4ERR_BADHANDLE and/or + NFS4ERR_STALE even when they would not normally do so when addressed + to the non-data-server personality. Stateids must obey the rules of + the data server in that any use of special stateids or stateids with + non-zero seqid values must result in NFS4ERR_BAD_STATEID. + + Until the server first executes an operation from class 2 or class 3, + the client MUST NOT depend on the operation being executed by either + the data-server or the non-data-server personality. The server MUST + pick one personality consistently for a given COMPOUND, with the only + possible transition being a single one when the first operation from + class 2 or class 3 is executed. + + Because of the complexity induced by assigning filehandles so they + can be used on both a data server and a metadata server, it is + RECOMMENDED that where the same server can have both personalities, + the server assign separate unique filehandles to both personalities. + This makes it unambiguous for which server a given request is + intended. + + GETATTR and SETATTR MUST be directed to the metadata server. In the + case of a SETATTR of the size attribute, the control protocol is + responsible for propagating size updates/truncations to the data + servers. In the case of extending WRITEs to the data servers, the + new size must be visible on the metadata server once a LAYOUTCOMMIT + has completed (see Section 12.5.4.2). Section 13.10 describes the + mechanism by which the client is to handle data-server files that do + not reflect the metadata server's size. + +13.7. COMMIT through Metadata Server + + The file layout provides two alternate means of providing for the + commit of data written through data servers. The flag + NFL4_UFLG_COMMIT_THRU_MDS in the field nfl_util of the file layout + (data type nfsv4_1_file_layout4) is an indication from the metadata + server to the client of the REQUIRED way of performing COMMIT, either + by sending the COMMIT to the data server or the metadata server. + These two methods of dealing with the issue correspond to broad + styles of implementation for a pNFS server supporting the file layout + type. + + * When the flag is FALSE, COMMIT operations MUST to be sent to the + data server to which the corresponding WRITE operations were sent. + This approach is sometimes useful when file striping is + implemented within the pNFS server (instead of the file system), + with the individual data servers each implementing their own file + systems. + + * When the flag is TRUE, COMMIT operations MUST be sent to the + metadata server, rather than to the individual data servers. This + approach is sometimes useful when file striping is implemented + within the clustered file system that is the backend to the pNFS + server. In such an implementation, each COMMIT to each data + server might result in repeated writes of metadata blocks to the + detriment of write performance. Sending a single COMMIT to the + metadata server can be more efficient when there exists a + clustered file system capable of implementing such a coordinated + COMMIT. + + If nfl_util & NFL4_UFLG_COMMIT_THRU_MDS is TRUE, then in order to + maintain the current NFSv4.1 commit and recovery model, the data + servers MUST return a common writeverf verifier in all WRITE + responses for a given file layout, and the metadata server's + COMMIT implementation must return the same writeverf. The value + of the writeverf verifier MUST be changed at the metadata server + or any data server that is referenced in the layout, whenever + there is a server event that can possibly lead to loss of + uncommitted data. The scope of the verifier can be for a file or + for the entire pNFS server. It might be more difficult for the + server to maintain the verifier at the file level, but the benefit + is that only events that impact a given file will require recovery + action. + + Note that if the layout specified dense packing, then the offset used + to a COMMIT to the MDS may differ than that of an offset used to a + COMMIT to the data server. + + The single COMMIT to the metadata server will return a verifier, and + the client should compare it to all the verifiers from the WRITEs and + fail the COMMIT if there are any mismatched verifiers. If COMMIT to + the metadata server fails, the client should re-send WRITEs for all + the modified data in the file. The client should treat modified data + with a mismatched verifier as a WRITE failure and try to recover by + resending the WRITEs to the original data server or using another + path to that data if the layout has not been recalled. + Alternatively, the client can obtain a new layout or it could rewrite + the data directly to the metadata server. If nfl_util & + NFL4_UFLG_COMMIT_THRU_MDS is FALSE, sending a COMMIT to the metadata + server might have no effect. If nfl_util & NFL4_UFLG_COMMIT_THRU_MDS + is FALSE, a COMMIT sent to the metadata server should be used only to + commit data that was written to the metadata server. See + Section 12.7.6 for recovery options. + +13.8. The Layout Iomode + + The layout iomode need not be used by the metadata server when + servicing NFSv4.1 file-based layouts, although in some circumstances + it may be useful. For example, if the server implementation supports + reading from read-only replicas or mirrors, it would be useful for + the server to return a layout enabling the client to do so. As such, + the client SHOULD set the iomode based on its intent to read or write + the data. The client may default to an iomode of LAYOUTIOMODE4_RW. + The iomode need not be checked by the data servers when clients + perform I/O. However, the data servers SHOULD still validate that + the client holds a valid layout and return an error if the client + does not. + +13.9. Metadata and Data Server State Coordination + +13.9.1. Global Stateid Requirements + + When the client sends I/O to a data server, the stateid used MUST NOT + be a layout stateid as returned by LAYOUTGET or sent by + CB_LAYOUTRECALL. Permitted stateids are based on one of the + following: an OPEN stateid (the stateid field of data type OPEN4resok + as returned by OPEN), a delegation stateid (the stateid field of data + types open_read_delegation4 and open_write_delegation4 as returned by + OPEN or WANT_DELEGATION, or as sent by CB_PUSH_DELEG), or a stateid + returned by the LOCK or LOCKU operations. The stateid sent to the + data server MUST be sent with the seqid set to zero, indicating the + most current version of that stateid, rather than indicating a + specific non-zero seqid value. In no case is the use of special + stateid values allowed. + + The stateid used for I/O MUST have the same effect and be subject to + the same validation on a data server as it would if the I/O was being + performed on the metadata server itself in the absence of pNFS. This + has the implication that stateids are globally valid on both the + metadata and data servers. This requires the metadata server to + propagate changes in LOCK and OPEN state to the data servers, so that + the data servers can validate I/O accesses. This is discussed + further in Section 13.9.2. Depending on when stateids are + propagated, the existence of a valid stateid on the data server may + act as proof of a valid layout. + + Clients performing I/O operations need to select an appropriate + stateid based on the locks (including opens and delegations) held by + the client and the various types of state-owners sending the I/O + requests. The rules for doing so when referencing data servers are + somewhat different from those discussed in Section 8.2.5, which apply + when accessing metadata servers. + + The following rules, applied in order of decreasing priority, govern + the selection of the appropriate stateid: + + * If the client holds a delegation for the file in question, the + delegation stateid should be used. + + * Otherwise, there must be an OPEN stateid for the current open- + owner, and that OPEN stateid for the open file in question is + used, unless mandatory locking prevents that. See below. + + * If the data server had previously responded with NFS4ERR_LOCKED to + use of the OPEN stateid, then the client should use the byte-range + lock stateid whenever one exists for that open file with the + current lock-owner. + + * Special stateids should never be used. If they are used, the data + server MUST reject the I/O with an NFS4ERR_BAD_STATEID error. + +13.9.2. Data Server State Propagation + + Since the metadata server, which handles byte-range lock and open- + mode state changes as well as ACLs, might not be co-located with the + data servers where I/O accesses are validated, the server + implementation MUST take care of propagating changes of this state to + the data servers. Once the propagation to the data servers is + complete, the full effect of those changes MUST be in effect at the + data servers. However, some state changes need not be propagated + immediately, although all changes SHOULD be propagated promptly. + These state propagations have an impact on the design of the control + protocol, even though the control protocol is outside of the scope of + this specification. Immediate propagation refers to the synchronous + propagation of state from the metadata server to the data server(s); + the propagation must be complete before returning to the client. + +13.9.2.1. Lock State Propagation + + If the pNFS server supports mandatory byte-range locking, any + mandatory byte-range locks on a file MUST be made effective at the + data servers before the request that establishes them returns to the + caller. The effect MUST be the same as if the mandatory byte-range + lock state were synchronously propagated to the data servers, even + though the details of the control protocol may avoid actual transfer + of the state under certain circumstances. + + On the other hand, since advisory byte-range lock state is not used + for checking I/O accesses at the data servers, there is no semantic + reason for propagating advisory byte-range lock state to the data + servers. Since updates to advisory locks neither confer nor remove + privileges, these changes need not be propagated immediately, and may + not need to be propagated promptly. The updates to advisory locks + need only be propagated when the data server needs to resolve a + question about a stateid. In fact, if byte-range locking is not + mandatory (i.e., is advisory) the clients are advised to avoid using + the byte-range lock-based stateids for I/O. The stateids returned by + OPEN are sufficient and eliminate overhead for this kind of state + propagation. + + If a client gets back an NFS4ERR_LOCKED error from a data server, + this is an indication that mandatory byte-range locking is in force. + The client recovers from this by getting a byte-range lock that + covers the affected range and re-sends the I/O with the stateid of + the byte-range lock. + +13.9.2.2. Open and Deny Mode Validation + + Open and deny mode validation MUST be performed against the open and + deny mode(s) held by the data servers. When access is reduced or a + deny mode made more restrictive (because of CLOSE or OPEN_DOWNGRADE), + the data server MUST prevent any I/Os that would be denied if + performed on the metadata server. When access is expanded, the data + server MUST make sure that no requests are subsequently rejected + because of open or deny issues that no longer apply, given the + previous relaxation. + +13.9.2.3. File Attributes + + Since the SETATTR operation has the ability to modify state that is + visible on both the metadata and data servers (e.g., the size), care + must be taken to ensure that the resultant state across the set of + data servers is consistent, especially when truncating or growing the + file. + + As described earlier, the LAYOUTCOMMIT operation is used to ensure + that the metadata is synchronized with changes made to the data + servers. For the NFSv4.1-based data storage protocol, it is + necessary to re-synchronize state such as the size attribute, and the + setting of mtime/change/atime. See Section 12.5.4 for a full + description of the semantics regarding LAYOUTCOMMIT and attribute + synchronization. It should be noted that by using an NFSv4.1-based + layout type, it is possible to synchronize this state before + LAYOUTCOMMIT occurs. For example, the control protocol can be used + to query the attributes present on the data servers. + + Any changes to file attributes that control authorization or access + as reflected by ACCESS calls or READs and WRITEs on the metadata + server, MUST be propagated to the data servers for enforcement on + READ and WRITE I/O calls. If the changes made on the metadata server + result in more restrictive access permissions for any user, those + changes MUST be propagated to the data servers synchronously. + + The OPEN operation (Section 18.16.4) does not impose any requirement + that I/O operations on an open file have the same credentials as the + OPEN itself (unless EXCHGID4_FLAG_BIND_PRINC_STATEID is set when + EXCHANGE_ID creates the client ID), and so it requires the server's + READ and WRITE operations to perform appropriate access checking. + Changes to ACLs also require new access checking by READ and WRITE on + the server. The propagation of access-right changes due to changes + in ACLs may be asynchronous only if the server implementation is able + to determine that the updated ACL is not more restrictive for any + user specified in the old ACL. Due to the relative infrequency of + ACL updates, it is suggested that all changes be propagated + synchronously. + +13.10. Data Server Component File Size + + A potential problem exists when a component data file on a particular + data server has grown past EOF; the problem exists for both dense and + sparse layouts. Imagine the following scenario: a client creates a + new file (size == 0) and writes to byte 131072; the client then seeks + to the beginning of the file and reads byte 100. The client should + receive zeroes back as a result of the READ. However, if the + striping pattern directs the client to send the READ to a data server + other than the one that received the client's original WRITE, the + data server servicing the READ may believe that the file's size is + still 0 bytes. In that event, the data server's READ response will + contain zero bytes and an indication of EOF. The data server can + only return zeroes if it knows that the file's size has been + extended. This would require the immediate propagation of the file's + size to all data servers, which is potentially very costly. + Therefore, the client that has initiated the extension of the file's + size MUST be prepared to deal with these EOF conditions. When the + offset in the arguments to READ is less than the client's view of the + file size, if the READ response indicates EOF and/or contains fewer + bytes than requested, the client will interpret such a response as a + hole in the file, and the NFS client will substitute zeroes for the + data. + + The NFSv4.1 protocol only provides close-to-open file data cache + semantics; meaning that when the file is closed, all modified data is + written to the server. When a subsequent OPEN of the file is done, + the change attribute is inspected for a difference from a cached + value for the change attribute. For the case above, this means that + a LAYOUTCOMMIT will be done at close (along with the data WRITEs) and + will update the file's size and change attribute. Access from + another client after that point will result in the appropriate size + being returned. + +13.11. Layout Revocation and Fencing + + As described in Section 12.7, the layout-type-specific storage + protocol is responsible for handling the effects of I/Os that started + before lease expiration and extend through lease expiration. The + LAYOUT4_NFSV4_1_FILES layout type can prevent all I/Os to data + servers from being executed after lease expiration (this prevention + is called "fencing"), without relying on a precise client lease timer + and without requiring data servers to maintain lease timers. The + LAYOUT4_NFSV4_1_FILES pNFS server has the flexibility to revoke + individual layouts, and thus fence I/O on a per-file basis. + + In addition to lease expiration, the reasons a layout can be revoked + include: client fails to respond to a CB_LAYOUTRECALL, the metadata + server restarts, or administrative intervention. Regardless of the + reason, once a client's layout has been revoked, the pNFS server MUST + prevent the client from sending I/O for the affected file from and to + all data servers; in other words, it MUST fence the client from the + affected file on the data servers. + + Fencing works as follows. As described in Section 13.1, in COMPOUND + procedure requests to the data server, the data filehandle provided + by the PUTFH operation and the stateid in the READ or WRITE operation + are used to ensure that the client has a valid layout for the I/O + being performed; if it does not, the I/O is rejected with + NFS4ERR_PNFS_NO_LAYOUT. The server can simply check the stateid and, + additionally, make the data filehandle stale if the layout specified + a data filehandle that is different from the metadata server's + filehandle for the file (see the nfl_fh_list description in + Section 13.3). + + Before the metadata server takes any action to revoke layout state + given out by a previous instance, it must make sure that all layout + state from that previous instance are invalidated at the data + servers. This has the following implications. + + * The metadata server must not restripe a file until it has + contacted all of the data servers to invalidate the layouts from + the previous instance. + + * The metadata server must not give out mandatory locks that + conflict with layouts from the previous instance without either + doing a specific layout invalidation (as it would have to do + anyway) or doing a global data server invalidation. + +13.12. Security Considerations for the File Layout Type + + The NFSv4.1 file layout type MUST adhere to the security + considerations outlined in Section 12.9. NFSv4.1 data servers MUST + make all of the required access checks on each READ or WRITE I/O as + determined by the NFSv4.1 protocol. If the metadata server would + deny a READ or WRITE operation on a file due to its ACL, mode + attribute, open access mode, open deny mode, mandatory byte-range + lock state, or any other attributes and state, the data server MUST + also deny the READ or WRITE operation. This impacts the control + protocol and the propagation of state from the metadata server to the + data servers; see Section 13.9.2 for more details. + + The methods for authentication, integrity, and privacy for data + servers based on the LAYOUT4_NFSV4_1_FILES layout type are the same + as those used by metadata servers. Metadata and data servers use ONC + RPC security flavors to authenticate, and SECINFO and SECINFO_NO_NAME + to negotiate the security mechanism and services to be used. Thus, + when using the LAYOUT4_NFSV4_1_FILES layout type, the impact on the + RPC-based security model due to pNFS (as alluded to in Sections 1.8.1 + and 1.8.2.2) is zero. + + For a given file object, a metadata server MAY require different + security parameters (secinfo4 value) than the data server. For a + given file object with multiple data servers, the secinfo4 value + SHOULD be the same across all data servers. If the secinfo4 values + across a metadata server and its data servers differ for a specific + file, the mapping of the principal to the server's internal user + identifier MUST be the same in order for the access-control checks + based on ACL, mode, open and deny mode, and mandatory locking to be + consistent across on the pNFS server. + + If an NFSv4.1 implementation supports pNFS and supports NFSv4.1 file + layouts, then the implementation MUST support the SECINFO_NO_NAME + operation on both the metadata and data servers. + +14. Internationalization + + The primary issue in which NFSv4.1 needs to deal with + internationalization, or I18N, is with respect to file names and + other strings as used within the protocol. The choice of string + representation must allow reasonable name/string access to clients + that use various languages. The UTF-8 encoding of the UCS (Universal + Multiple-Octet Coded Character Set) as defined by ISO10646 [18] + allows for this type of access and follows the policy described in + "IETF Policy on Character Sets and Languages", RFC 2277 [19]. + + RFC 3454 [16], otherwise known as "stringprep", documents a framework + for using Unicode/UTF-8 in networking protocols so as "to increase + the likelihood that string input and string comparison work in ways + that make sense for typical users throughout the world". A protocol + must define a profile of stringprep "in order to fully specify the + processing options". The remainder of this section defines the + NFSv4.1 stringprep profiles. Much of the terminology used for the + remainder of this section comes from stringprep. + + There are three UTF-8 string types defined for NFSv4.1: utf8str_cs, + utf8str_cis, and utf8str_mixed. Separate profiles are defined for + each. Each profile defines the following, as required by stringprep: + + * The intended applicability of the profile. + + * The character repertoire that is the input and output to + stringprep (which is Unicode 3.2 for the referenced version of + stringprep). However, NFSv4.1 implementations are not limited to + 3.2. + + * The mapping tables from stringprep used (as described in Section 3 + of stringprep). + + * Any additional mapping tables specific to the profile. + + * The Unicode normalization used, if any (as described in Section 4 + of stringprep). + + * The tables from the stringprep listing of characters that are + prohibited as output (as described in Section 5 of stringprep). + + * The bidirectional string testing used, if any (as described in + Section 6 of stringprep). + + * Any additional characters that are prohibited as output specific + to the profile. + + Stringprep discusses Unicode characters, whereas NFSv4.1 renders + UTF-8 characters. Since there is a one-to-one mapping from UTF-8 to + Unicode, when the remainder of this document refers to Unicode, the + reader should assume UTF-8. + + Much of the text for the profiles comes from RFC 3491 [20]. + +14.1. Stringprep Profile for the utf8str_cs Type + + Every use of the utf8str_cs type definition in the NFSv4 protocol + specification follows the profile named nfs4_cs_prep. + +14.1.1. Intended Applicability of the nfs4_cs_prep Profile + + The utf8str_cs type is a case-sensitive string of UTF-8 characters. + Its primary use in NFSv4.1 is for naming components and pathnames. + Components and pathnames are stored on the server's file system. Two + valid distinct UTF-8 strings might be the same after processing via + the utf8str_cs profile. If the strings are two names inside a + directory, the NFSv4.1 server will need to either: + + * disallow the creation of a second name if its post-processed form + collides with that of an existing name, or + + * allow the creation of the second name, but arrange so that after + post-processing, the second name is different than the post- + processed form of the first name. + +14.1.2. Character Repertoire of nfs4_cs_prep + + The nfs4_cs_prep profile uses Unicode 3.2, as defined in stringprep's + Appendix A.1. However, NFSv4.1 implementations are not limited to + 3.2. + +14.1.3. Mapping Used by nfs4_cs_prep + + The nfs4_cs_prep profile specifies mapping using the following tables + from stringprep: + + Table B.1 + + Table B.2 is normally not part of the nfs4_cs_prep profile as it is + primarily for dealing with case-insensitive comparisons. However, if + the NFSv4.1 file server supports the case_insensitive file system + attribute, and if case_insensitive is TRUE, the NFSv4.1 server MUST + use Table B.2 (in addition to Table B1) when processing utf8str_cs + strings, and the NFSv4.1 client MUST assume Table B.2 (in addition to + Table B.1) is being used. + + If the case_preserving attribute is present and set to FALSE, then + the NFSv4.1 server MUST use Table B.2 to map case when processing + utf8str_cs strings. Whether the server maps from lower to upper case + or from upper to lower case is an implementation dependency. + +14.1.4. Normalization used by nfs4_cs_prep + + The nfs4_cs_prep profile does not specify a normalization form. A + later revision of this specification may specify a particular + normalization form. Therefore, the server and client can expect that + they may receive unnormalized characters within protocol requests and + responses. If the operating environment requires normalization, then + the implementation must normalize utf8str_cs strings within the + protocol before presenting the information to an application (at the + client) or local file system (at the server). + +14.1.5. Prohibited Output for nfs4_cs_prep + + The nfs4_cs_prep profile RECOMMENDS prohibiting the use of the + following tables from stringprep: + + Table C.5 + + Table C.6 + +14.1.6. Bidirectional Output for nfs4_cs_prep + + The nfs4_cs_prep profile does not specify any checking of + bidirectional strings. + +14.2. Stringprep Profile for the utf8str_cis Type + + Every use of the utf8str_cis type definition in the NFSv4.1 protocol + specification follows the profile named nfs4_cis_prep. + +14.2.1. Intended Applicability of the nfs4_cis_prep Profile + + The utf8str_cis type is a case-insensitive string of UTF-8 + characters. Its primary use in NFSv4.1 is for naming NFS servers. + +14.2.2. Character Repertoire of nfs4_cis_prep + + The nfs4_cis_prep profile uses Unicode 3.2, as defined in + stringprep's Appendix A.1. However, NFSv4.1 implementations are not + limited to 3.2. + +14.2.3. Mapping Used by nfs4_cis_prep + + The nfs4_cis_prep profile specifies mapping using the following + tables from stringprep: + + Table B.1 + + Table B.2 + +14.2.4. Normalization Used by nfs4_cis_prep + + The nfs4_cis_prep profile specifies using Unicode normalization form + KC, as described in stringprep. + +14.2.5. Prohibited Output for nfs4_cis_prep + + The nfs4_cis_prep profile specifies prohibiting using the following + tables from stringprep: + + Table C.1.2 + + Table C.2.2 + + Table C.3 + + Table C.4 + + Table C.5 + + Table C.6 + + Table C.7 + + Table C.8 + + Table C.9 + +14.2.6. Bidirectional Output for nfs4_cis_prep + + The nfs4_cis_prep profile specifies checking bidirectional strings as + described in stringprep's Section 6. + +14.3. Stringprep Profile for the utf8str_mixed Type + + Every use of the utf8str_mixed type definition in the NFSv4.1 + protocol specification follows the profile named nfs4_mixed_prep. + +14.3.1. Intended Applicability of the nfs4_mixed_prep Profile + + The utf8str_mixed type is a string of UTF-8 characters, with a prefix + that is case sensitive, a separator equal to '@', and a suffix that + is a fully qualified domain name. Its primary use in NFSv4.1 is for + naming principals identified in an Access Control Entry. + +14.3.2. Character Repertoire of nfs4_mixed_prep + + The nfs4_mixed_prep profile uses Unicode 3.2, as defined in + stringprep's Appendix A.1. However, NFSv4.1 implementations are not + limited to 3.2. + +14.3.3. Mapping Used by nfs4_cis_prep + + For the prefix and the separator of a utf8str_mixed string, the + nfs4_mixed_prep profile specifies mapping using the following table + from stringprep: + + Table B.1 + + For the suffix of a utf8str_mixed string, the nfs4_mixed_prep profile + specifies mapping using the following tables from stringprep: + + Table B.1 + + Table B.2 + +14.3.4. Normalization Used by nfs4_mixed_prep + + The nfs4_mixed_prep profile specifies using Unicode normalization + form KC, as described in stringprep. + +14.3.5. Prohibited Output for nfs4_mixed_prep + + The nfs4_mixed_prep profile specifies prohibiting using the following + tables from stringprep: + + Table C.1.2 + + Table C.2.2 + + Table C.3 + + Table C.4 + + Table C.5 + + Table C.6 + + Table C.7 + + Table C.8 + + Table C.9 + +14.3.6. Bidirectional Output for nfs4_mixed_prep + + The nfs4_mixed_prep profile specifies checking bidirectional strings + as described in stringprep's Section 6. + +14.4. UTF-8 Capabilities + + const FSCHARSET_CAP4_CONTAINS_NON_UTF8 = 0x1; + const FSCHARSET_CAP4_ALLOWS_ONLY_UTF8 = 0x2; + + typedef uint32_t fs_charset_cap4; + + Because some operating environments and file systems do not enforce + character set encodings, NFSv4.1 supports the fs_charset_cap + attribute (Section 5.8.2.11) that indicates to the client a file + system's UTF-8 capabilities. The attribute is an integer containing + a pair of flags. The first flag is FSCHARSET_CAP4_CONTAINS_NON_UTF8, + which, if set to one, tells the client that the file system contains + non-UTF-8 characters, and the server will not convert non-UTF + characters to UTF-8 if the client reads a symbolic link or directory, + neither will operations with component names or pathnames in the + arguments convert the strings to UTF-8. The second flag is + FSCHARSET_CAP4_ALLOWS_ONLY_UTF8, which, if set to one, indicates that + the server will accept (and generate) only UTF-8 characters on the + file system. If FSCHARSET_CAP4_ALLOWS_ONLY_UTF8 is set to one, + FSCHARSET_CAP4_CONTAINS_NON_UTF8 MUST be set to zero. + FSCHARSET_CAP4_ALLOWS_ONLY_UTF8 SHOULD always be set to one. + +14.5. UTF-8 Related Errors + + Where the client sends an invalid UTF-8 string, the server should + return NFS4ERR_INVAL (see Table 11). This includes cases in which + inappropriate prefixes are detected and where the count includes + trailing bytes that do not constitute a full UCS character. + + Where the client-supplied string is valid UTF-8 but contains + characters that are not supported by the server as a value for that + string (e.g., names containing characters outside of Unicode plane 0 + on file systems that fail to support such characters despite their + presence in the Unicode standard), the server should return + NFS4ERR_BADCHAR. + + Where a UTF-8 string is used as a file name, and the file system + (while supporting all of the characters within the name) does not + allow that particular name to be used, the server should return the + error NFS4ERR_BADNAME (Table 11). This includes situations in which + the server file system imposes a normalization constraint on name + strings, but will also include such situations as file system + prohibitions of "." and ".." as file names for certain operations, + and other such constraints. + +15. Error Values + + NFS error numbers are assigned to failed operations within a Compound + (COMPOUND or CB_COMPOUND) request. A Compound request contains a + number of NFS operations that have their results encoded in sequence + in a Compound reply. The results of successful operations will + consist of an NFS4_OK status followed by the encoded results of the + operation. If an NFS operation fails, an error status will be + entered in the reply and the Compound request will be terminated. + +15.1. Error Definitions + + +===================================+========+===================+ + | Error | Number | Description | + +===================================+========+===================+ + | NFS4_OK | 0 | Section 15.1.3.1 | + +-----------------------------------+--------+-------------------+ + | NFS4ERR_ACCESS | 13 | Section 15.1.6.1 | + +-----------------------------------+--------+-------------------+ + | NFS4ERR_ATTRNOTSUPP | 10032 | Section 15.1.15.1 | + +-----------------------------------+--------+-------------------+ + | NFS4ERR_ADMIN_REVOKED | 10047 | Section 15.1.5.1 | + +-----------------------------------+--------+-------------------+ + | NFS4ERR_BACK_CHAN_BUSY | 10057 | Section 15.1.12.1 | + +-----------------------------------+--------+-------------------+ + | NFS4ERR_BADCHAR | 10040 | Section 15.1.7.1 | + +-----------------------------------+--------+-------------------+ + | NFS4ERR_BADHANDLE | 10001 | Section 15.1.2.1 | + +-----------------------------------+--------+-------------------+ + | NFS4ERR_BADIOMODE | 10049 | Section 15.1.10.1 | + +-----------------------------------+--------+-------------------+ + | NFS4ERR_BADLAYOUT | 10050 | Section 15.1.10.2 | + +-----------------------------------+--------+-------------------+ + | NFS4ERR_BADNAME | 10041 | Section 15.1.7.2 | + +-----------------------------------+--------+-------------------+ + | NFS4ERR_BADOWNER | 10039 | Section 15.1.15.2 | + +-----------------------------------+--------+-------------------+ + | NFS4ERR_BADSESSION | 10052 | Section 15.1.11.1 | + +-----------------------------------+--------+-------------------+ + | NFS4ERR_BADSLOT | 10053 | Section 15.1.11.2 | + +-----------------------------------+--------+-------------------+ + | NFS4ERR_BADTYPE | 10007 | Section 15.1.4.1 | + +-----------------------------------+--------+-------------------+ + | NFS4ERR_BADXDR | 10036 | Section 15.1.1.1 | + +-----------------------------------+--------+-------------------+ + | NFS4ERR_BAD_COOKIE | 10003 | Section 15.1.1.2 | + +-----------------------------------+--------+-------------------+ + | NFS4ERR_BAD_HIGH_SLOT | 10077 | Section 15.1.11.3 | + +-----------------------------------+--------+-------------------+ + | NFS4ERR_BAD_RANGE | 10042 | Section 15.1.8.1 | + +-----------------------------------+--------+-------------------+ + | NFS4ERR_BAD_SEQID | 10026 | Section 15.1.16.1 | + +-----------------------------------+--------+-------------------+ + | NFS4ERR_BAD_SESSION_DIGEST | 10051 | Section 15.1.12.2 | + +-----------------------------------+--------+-------------------+ + | NFS4ERR_BAD_STATEID | 10025 | Section 15.1.5.2 | + +-----------------------------------+--------+-------------------+ + | NFS4ERR_CB_PATH_DOWN | 10048 | Section 15.1.11.4 | + +-----------------------------------+--------+-------------------+ + | NFS4ERR_CLID_INUSE | 10017 | Section 15.1.13.2 | + +-----------------------------------+--------+-------------------+ + | NFS4ERR_CLIENTID_BUSY | 10074 | Section 15.1.13.1 | + +-----------------------------------+--------+-------------------+ + | NFS4ERR_COMPLETE_ALREADY | 10054 | Section 15.1.9.1 | + +-----------------------------------+--------+-------------------+ + | NFS4ERR_CONN_NOT_BOUND_TO_SESSION | 10055 | Section 15.1.11.6 | + +-----------------------------------+--------+-------------------+ + | NFS4ERR_DEADLOCK | 10045 | Section 15.1.8.2 | + +-----------------------------------+--------+-------------------+ + | NFS4ERR_DEADSESSION | 10078 | Section 15.1.11.5 | + +-----------------------------------+--------+-------------------+ + | NFS4ERR_DELAY | 10008 | Section 15.1.1.3 | + +-----------------------------------+--------+-------------------+ + | NFS4ERR_DELEG_ALREADY_WANTED | 10056 | Section 15.1.14.1 | + +-----------------------------------+--------+-------------------+ + | NFS4ERR_DELEG_REVOKED | 10087 | Section 15.1.5.3 | + +-----------------------------------+--------+-------------------+ + | NFS4ERR_DENIED | 10010 | Section 15.1.8.3 | + +-----------------------------------+--------+-------------------+ + | NFS4ERR_DIRDELEG_UNAVAIL | 10084 | Section 15.1.14.2 | + +-----------------------------------+--------+-------------------+ + | NFS4ERR_DQUOT | 69 | Section 15.1.4.2 | + +-----------------------------------+--------+-------------------+ + | NFS4ERR_ENCR_ALG_UNSUPP | 10079 | Section 15.1.13.3 | + +-----------------------------------+--------+-------------------+ + | NFS4ERR_EXIST | 17 | Section 15.1.4.3 | + +-----------------------------------+--------+-------------------+ + | NFS4ERR_EXPIRED | 10011 | Section 15.1.5.4 | + +-----------------------------------+--------+-------------------+ + | NFS4ERR_FBIG | 27 | Section 15.1.4.4 | + +-----------------------------------+--------+-------------------+ + | NFS4ERR_FHEXPIRED | 10014 | Section 15.1.2.2 | + +-----------------------------------+--------+-------------------+ + | NFS4ERR_FILE_OPEN | 10046 | Section 15.1.4.5 | + +-----------------------------------+--------+-------------------+ + | NFS4ERR_GRACE | 10013 | Section 15.1.9.2 | + +-----------------------------------+--------+-------------------+ + | NFS4ERR_HASH_ALG_UNSUPP | 10072 | Section 15.1.13.4 | + +-----------------------------------+--------+-------------------+ + | NFS4ERR_INVAL | 22 | Section 15.1.1.4 | + +-----------------------------------+--------+-------------------+ + | NFS4ERR_IO | 5 | Section 15.1.4.6 | + +-----------------------------------+--------+-------------------+ + | NFS4ERR_ISDIR | 21 | Section 15.1.2.3 | + +-----------------------------------+--------+-------------------+ + | NFS4ERR_LAYOUTTRYLATER | 10058 | Section 15.1.10.3 | + +-----------------------------------+--------+-------------------+ + | NFS4ERR_LAYOUTUNAVAILABLE | 10059 | Section 15.1.10.4 | + +-----------------------------------+--------+-------------------+ + | NFS4ERR_LEASE_MOVED | 10031 | Section 15.1.16.2 | + +-----------------------------------+--------+-------------------+ + | NFS4ERR_LOCKED | 10012 | Section 15.1.8.4 | + +-----------------------------------+--------+-------------------+ + | NFS4ERR_LOCKS_HELD | 10037 | Section 15.1.8.5 | + +-----------------------------------+--------+-------------------+ + | NFS4ERR_LOCK_NOTSUPP | 10043 | Section 15.1.8.6 | + +-----------------------------------+--------+-------------------+ + | NFS4ERR_LOCK_RANGE | 10028 | Section 15.1.8.7 | + +-----------------------------------+--------+-------------------+ + | NFS4ERR_MINOR_VERS_MISMATCH | 10021 | Section 15.1.3.2 | + +-----------------------------------+--------+-------------------+ + | NFS4ERR_MLINK | 31 | Section 15.1.4.7 | + +-----------------------------------+--------+-------------------+ + | NFS4ERR_MOVED | 10019 | Section 15.1.2.4 | + +-----------------------------------+--------+-------------------+ + | NFS4ERR_NAMETOOLONG | 63 | Section 15.1.7.3 | + +-----------------------------------+--------+-------------------+ + | NFS4ERR_NOENT | 2 | Section 15.1.4.8 | + +-----------------------------------+--------+-------------------+ + | NFS4ERR_NOFILEHANDLE | 10020 | Section 15.1.2.5 | + +-----------------------------------+--------+-------------------+ + | NFS4ERR_NOMATCHING_LAYOUT | 10060 | Section 15.1.10.5 | + +-----------------------------------+--------+-------------------+ + | NFS4ERR_NOSPC | 28 | Section 15.1.4.9 | + +-----------------------------------+--------+-------------------+ + | NFS4ERR_NOTDIR | 20 | Section 15.1.2.6 | + +-----------------------------------+--------+-------------------+ + | NFS4ERR_NOTEMPTY | 66 | Section 15.1.4.10 | + +-----------------------------------+--------+-------------------+ + | NFS4ERR_NOTSUPP | 10004 | Section 15.1.1.5 | + +-----------------------------------+--------+-------------------+ + | NFS4ERR_NOT_ONLY_OP | 10081 | Section 15.1.3.3 | + +-----------------------------------+--------+-------------------+ + | NFS4ERR_NOT_SAME | 10027 | Section 15.1.15.3 | + +-----------------------------------+--------+-------------------+ + | NFS4ERR_NO_GRACE | 10033 | Section 15.1.9.3 | + +-----------------------------------+--------+-------------------+ + | NFS4ERR_NXIO | 6 | Section 15.1.16.3 | + +-----------------------------------+--------+-------------------+ + | NFS4ERR_OLD_STATEID | 10024 | Section 15.1.5.5 | + +-----------------------------------+--------+-------------------+ + | NFS4ERR_OPENMODE | 10038 | Section 15.1.8.8 | + +-----------------------------------+--------+-------------------+ + | NFS4ERR_OP_ILLEGAL | 10044 | Section 15.1.3.4 | + +-----------------------------------+--------+-------------------+ + | NFS4ERR_OP_NOT_IN_SESSION | 10071 | Section 15.1.3.5 | + +-----------------------------------+--------+-------------------+ + | NFS4ERR_PERM | 1 | Section 15.1.6.2 | + +-----------------------------------+--------+-------------------+ + | NFS4ERR_PNFS_IO_HOLE | 10075 | Section 15.1.10.6 | + +-----------------------------------+--------+-------------------+ + | NFS4ERR_PNFS_NO_LAYOUT | 10080 | Section 15.1.10.7 | + +-----------------------------------+--------+-------------------+ + | NFS4ERR_RECALLCONFLICT | 10061 | Section 15.1.14.3 | + +-----------------------------------+--------+-------------------+ + | NFS4ERR_RECLAIM_BAD | 10034 | Section 15.1.9.4 | + +-----------------------------------+--------+-------------------+ + | NFS4ERR_RECLAIM_CONFLICT | 10035 | Section 15.1.9.5 | + +-----------------------------------+--------+-------------------+ + | NFS4ERR_REJECT_DELEG | 10085 | Section 15.1.14.4 | + +-----------------------------------+--------+-------------------+ + | NFS4ERR_REP_TOO_BIG | 10066 | Section 15.1.3.6 | + +-----------------------------------+--------+-------------------+ + | NFS4ERR_REP_TOO_BIG_TO_CACHE | 10067 | Section 15.1.3.7 | + +-----------------------------------+--------+-------------------+ + | NFS4ERR_REQ_TOO_BIG | 10065 | Section 15.1.3.8 | + +-----------------------------------+--------+-------------------+ + | NFS4ERR_RESTOREFH | 10030 | Section 15.1.16.4 | + +-----------------------------------+--------+-------------------+ + | NFS4ERR_RETRY_UNCACHED_REP | 10068 | Section 15.1.3.9 | + +-----------------------------------+--------+-------------------+ + | NFS4ERR_RETURNCONFLICT | 10086 | Section 15.1.10.8 | + +-----------------------------------+--------+-------------------+ + | NFS4ERR_ROFS | 30 | Section 15.1.4.11 | + +-----------------------------------+--------+-------------------+ + | NFS4ERR_SAME | 10009 | Section 15.1.15.4 | + +-----------------------------------+--------+-------------------+ + | NFS4ERR_SHARE_DENIED | 10015 | Section 15.1.8.9 | + +-----------------------------------+--------+-------------------+ + | NFS4ERR_SEQUENCE_POS | 10064 | Section 15.1.3.10 | + +-----------------------------------+--------+-------------------+ + | NFS4ERR_SEQ_FALSE_RETRY | 10076 | Section 15.1.11.7 | + +-----------------------------------+--------+-------------------+ + | NFS4ERR_SEQ_MISORDERED | 10063 | Section 15.1.11.8 | + +-----------------------------------+--------+-------------------+ + | NFS4ERR_SERVERFAULT | 10006 | Section 15.1.1.6 | + +-----------------------------------+--------+-------------------+ + | NFS4ERR_STALE | 70 | Section 15.1.2.7 | + +-----------------------------------+--------+-------------------+ + | NFS4ERR_STALE_CLIENTID | 10022 | Section 15.1.13.5 | + +-----------------------------------+--------+-------------------+ + | NFS4ERR_STALE_STATEID | 10023 | Section 15.1.16.5 | + +-----------------------------------+--------+-------------------+ + | NFS4ERR_SYMLINK | 10029 | Section 15.1.2.8 | + +-----------------------------------+--------+-------------------+ + | NFS4ERR_TOOSMALL | 10005 | Section 15.1.1.7 | + +-----------------------------------+--------+-------------------+ + | NFS4ERR_TOO_MANY_OPS | 10070 | Section 15.1.3.11 | + +-----------------------------------+--------+-------------------+ + | NFS4ERR_UNKNOWN_LAYOUTTYPE | 10062 | Section 15.1.10.9 | + +-----------------------------------+--------+-------------------+ + | NFS4ERR_UNSAFE_COMPOUND | 10069 | Section 15.1.3.12 | + +-----------------------------------+--------+-------------------+ + | NFS4ERR_WRONGSEC | 10016 | Section 15.1.6.3 | + +-----------------------------------+--------+-------------------+ + | NFS4ERR_WRONG_CRED | 10082 | Section 15.1.6.4 | + +-----------------------------------+--------+-------------------+ + | NFS4ERR_WRONG_TYPE | 10083 | Section 15.1.2.9 | + +-----------------------------------+--------+-------------------+ + | NFS4ERR_XDEV | 18 | Section 15.1.4.12 | + +-----------------------------------+--------+-------------------+ + + Table 11: Protocol Error Definitions + +15.1.1. General Errors + + This section deals with errors that are applicable to a broad set of + different purposes. + +15.1.1.1. NFS4ERR_BADXDR (Error Code 10036) + + The arguments for this operation do not match those specified in the + XDR definition. This includes situations in which the request ends + before all the arguments have been seen. Note that this error + applies when fixed enumerations (these include booleans) have a value + within the input stream that is not valid for the enum. A replier + may pre-parse all operations for a Compound procedure before doing + any operation execution and return RPC-level XDR errors in that case. + +15.1.1.2. NFS4ERR_BAD_COOKIE (Error Code 10003) + + Used for operations that provide a set of information indexed by some + quantity provided by the client or cookie sent by the server for an + earlier invocation. Where the value cannot be used for its intended + purpose, this error results. + +15.1.1.3. NFS4ERR_DELAY (Error Code 10008) + + For any of a number of reasons, the replier could not process this + operation in what was deemed a reasonable time. The client should + wait and then try the request with a new slot and sequence value. + + Some examples of scenarios that might lead to this situation: + + * A server that supports hierarchical storage receives a request to + process a file that had been migrated. + + * An operation requires a delegation recall to proceed, but the need + to wait for this delegation to be recalled and returned makes + processing this request in a timely fashion impossible. + + * A request is being performed on a session being migrated from + another server as described in Section 11.14.3, and the lack of + full information about the state of the session on the source + makes it impossible to process the request immediately. + + In such cases, returning the error NFS4ERR_DELAY allows necessary + preparatory operations to proceed without holding up requester + resources such as a session slot. After delaying for period of time, + the client can then re-send the operation in question, often as part + of a nearly identical request. Because of the need to avoid spurious + reissues of non-idempotent operations and to avoid acting in response + to NFS4ERR_DELAY errors returned on responses returned from the + replier's reply cache, integration with the session-provided reply + cache is necessary. There are a number of cases to deal with, each + of which requires different sorts of handling by the requester and + replier: + + * If NFS4ERR_DELAY is returned on a SEQUENCE operation, the request + is retried in full with the SEQUENCE operation containing the same + slot and sequence values. In this case, the replier MUST avoid + returning a response containing NFS4ERR_DELAY as the response to + SEQUENCE solely because an earlier instance of the same request + returned that error and it was stored in the reply cache. If the + replier did this, the retries would not be effective as there + would be no opportunity for the replier to see whether the + condition that generated the NFS4ERR_DELAY had been rectified + during the interim between the original request and the retry. + + * If NFS4ERR_DELAY is returned on an operation other than SEQUENCE + that validly appears as the first operation of a request, the + handling is similar. The request can be retried in full without + modification. In this case as well, the replier MUST avoid + returning a response containing NFS4ERR_DELAY as the response to + an initial operation of a request solely on the basis of its + presence in the reply cache. If the replier did this, the retries + would not be effective as there would be no opportunity for the + replier to see whether the condition that generated the + NFS4ERR_DELAY had been rectified during the interim between the + original request and the retry. + + * If NFS4ERR_DELAY is returned on an operation other than the first + in the request, the request when retried MUST contain a SEQUENCE + operation that is different than the original one, with either the + slot ID or the sequence value different from that in the original + request. Because requesters do this, there is no need for the + replier to take special care to avoid returning an NFS4ERR_DELAY + error obtained from the reply cache. When no non-idempotent + operations have been processed before the NFS4ERR_DELAY was + returned, the requester should retry the request in full, with the + only difference from the original request being the modification + to the slot ID or sequence value in the reissued SEQUENCE + operation. + + * When NFS4ERR_DELAY is returned on an operation other than the + first within a request and there has been a non-idempotent + operation processed before the NFS4ERR_DELAY was returned, + reissuing the request as is normally done would incorrectly cause + the re-execution of the non-idempotent operation. + + To avoid this situation, the client should reissue the request + without the non-idempotent operation. The request still must use + a SEQUENCE operation with either a different slot ID or sequence + value from the SEQUENCE in the original request. Because this is + done, there is no way the replier could avoid spuriously re- + executing the non-idempotent operation since the different + SEQUENCE parameters prevent the requester from recognizing that + the non-idempotent operation is being retried. + + Note that without the ability to return NFS4ERR_DELAY and the + requester's willingness to re-send when receiving it, deadlock might + result. For example, if a recall is done, and if the delegation + return or operations preparatory to delegation return are held up by + other operations that need the delegation to be returned, session + slots might not be available. The result could be deadlock. + +15.1.1.4. NFS4ERR_INVAL (Error Code 22) + + The arguments for this operation are not valid for some reason, even + though they do match those specified in the XDR definition for the + request. + +15.1.1.5. NFS4ERR_NOTSUPP (Error Code 10004) + + Operation not supported, either because the operation is an OPTIONAL + one and is not supported by this server or because the operation MUST + NOT be implemented in the current minor version. + +15.1.1.6. NFS4ERR_SERVERFAULT (Error Code 10006) + + An error occurred on the server that does not map to any of the + specific legal NFSv4.1 protocol error values. The client should + translate this into an appropriate error. UNIX clients may choose to + translate this to EIO. + +15.1.1.7. NFS4ERR_TOOSMALL (Error Code 10005) + + Used where an operation returns a variable amount of data, with a + limit specified by the client. Where the data returned cannot be fit + within the limit specified by the client, this error results. + +15.1.2. Filehandle Errors + + These errors deal with the situation in which the current or saved + filehandle, or the filehandle passed to PUTFH intended to become the + current filehandle, is invalid in some way. This includes situations + in which the filehandle is a valid filehandle in general but is not + of the appropriate object type for the current operation. + + Where the error description indicates a problem with the current or + saved filehandle, it is to be understood that filehandles are only + checked for the condition if they are implicit arguments of the + operation in question. + +15.1.2.1. NFS4ERR_BADHANDLE (Error Code 10001) + + Illegal NFS filehandle for the current server. The current + filehandle failed internal consistency checks. Once accepted as + valid (by PUTFH), no subsequent status change can cause the + filehandle to generate this error. + +15.1.2.2. NFS4ERR_FHEXPIRED (Error Code 10014) + + A current or saved filehandle that is an argument to the current + operation is volatile and has expired at the server. + +15.1.2.3. NFS4ERR_ISDIR (Error Code 21) + + The current or saved filehandle designates a directory when the + current operation does not allow a directory to be accepted as the + target of this operation. + +15.1.2.4. NFS4ERR_MOVED (Error Code 10019) + + The file system that contains the current filehandle object is not + present at the server or is not accessible with the network address + used. It may have been made accessible on a different set of network + addresses, relocated or migrated to another server, or it may have + never been present. The client may obtain the new file system + location by obtaining the fs_locations or fs_locations_info attribute + for the current filehandle. For further discussion, refer to + Section 11.3. + + As with the case of NFS4ERR_DELAY, it is possible that one or more + non-idempotent operations may have been successfully executed within + a COMPOUND before NFS4ERR_MOVED is returned. Because of this, once + the new location is determined, the original request that received + the NFS4ERR_MOVED should not be re-executed in full. Instead, the + client should send a new COMPOUND with any successfully executed non- + idempotent operations removed. When the client uses the same session + for the new COMPOUND, its SEQUENCE operation should use a different + slot ID or sequence. + +15.1.2.5. NFS4ERR_NOFILEHANDLE (Error Code 10020) + + The logical current or saved filehandle value is required by the + current operation and is not set. This may be a result of a + malformed COMPOUND operation (i.e., no PUTFH or PUTROOTFH before an + operation that requires the current filehandle be set). + +15.1.2.6. NFS4ERR_NOTDIR (Error Code 20) + + The current (or saved) filehandle designates an object that is not a + directory for an operation in which a directory is required. + +15.1.2.7. NFS4ERR_STALE (Error Code 70) + + The current or saved filehandle value designating an argument to the + current operation is invalid. The file referred to by that + filehandle no longer exists or access to it has been revoked. + +15.1.2.8. NFS4ERR_SYMLINK (Error Code 10029) + + The current filehandle designates a symbolic link when the current + operation does not allow a symbolic link as the target. + +15.1.2.9. NFS4ERR_WRONG_TYPE (Error Code 10083) + + The current (or saved) filehandle designates an object that is of an + invalid type for the current operation, and there is no more specific + error (such as NFS4ERR_ISDIR or NFS4ERR_SYMLINK) that applies. Note + that in NFSv4.0, such situations generally resulted in the less- + specific error NFS4ERR_INVAL. + +15.1.3. Compound Structure Errors + + This section deals with errors that relate to the overall structure + of a Compound request (by which we mean to include both COMPOUND and + CB_COMPOUND), rather than to particular operations. + + There are a number of basic constraints on the operations that may + appear in a Compound request. Sessions add to these basic + constraints by requiring a Sequence operation (either SEQUENCE or + CB_SEQUENCE) at the start of the Compound. + +15.1.3.1. NFS_OK (Error code 0) + + Indicates the operation completed successfully, in that all of the + constituent operations completed without error. + +15.1.3.2. NFS4ERR_MINOR_VERS_MISMATCH (Error code 10021) + + The minor version specified is not one that the current listener + supports. This value is returned in the overall status for the + Compound but is not associated with a specific operation since the + results will specify a result count of zero. + +15.1.3.3. NFS4ERR_NOT_ONLY_OP (Error Code 10081) + + Certain operations, which are allowed to be executed outside of a + session, MUST be the only operation within a Compound whenever the + Compound does not start with a Sequence operation. This error + results when that constraint is not met. + +15.1.3.4. NFS4ERR_OP_ILLEGAL (Error Code 10044) + + The operation code is not a valid one for the current Compound + procedure. The opcode in the result stream matched with this error + is the ILLEGAL value, although the value that appears in the request + stream may be different. Where an illegal value appears and the + replier pre-parses all operations for a Compound procedure before + doing any operation execution, an RPC-level XDR error may be + returned. + +15.1.3.5. NFS4ERR_OP_NOT_IN_SESSION (Error Code 10071) + + Most forward operations and all callback operations are only valid + within the context of a session, so that the Compound request in + question MUST begin with a Sequence operation. If an attempt is made + to execute these operations outside the context of session, this + error results. + +15.1.3.6. NFS4ERR_REP_TOO_BIG (Error Code 10066) + + The reply to a Compound would exceed the channel's negotiated maximum + response size. + +15.1.3.7. NFS4ERR_REP_TOO_BIG_TO_CACHE (Error Code 10067) + + The reply to a Compound would exceed the channel's negotiated maximum + size for replies cached in the reply cache when the Sequence for the + current request specifies that this request is to be cached. + +15.1.3.8. NFS4ERR_REQ_TOO_BIG (Error Code 10065) + + The Compound request exceeds the channel's negotiated maximum size + for requests. + +15.1.3.9. NFS4ERR_RETRY_UNCACHED_REP (Error Code 10068) + + The requester has attempted a retry of a Compound that it previously + requested not be placed in the reply cache. + +15.1.3.10. NFS4ERR_SEQUENCE_POS (Error Code 10064) + + A Sequence operation appeared in a position other than the first + operation of a Compound request. + +15.1.3.11. NFS4ERR_TOO_MANY_OPS (Error Code 10070) + + The Compound request has too many operations, exceeding the count + negotiated when the session was created. + +15.1.3.12. NFS4ERR_UNSAFE_COMPOUND (Error Code 10068) + + The client has sent a COMPOUND request with an unsafe mix of + operations -- specifically, with a non-idempotent operation that + changes the current filehandle and that is not followed by a GETFH. + +15.1.4. File System Errors + + These errors describe situations that occurred in the underlying file + system implementation rather than in the protocol or any NFSv4.x + feature. + +15.1.4.1. NFS4ERR_BADTYPE (Error Code 10007) + + An attempt was made to create an object with an inappropriate type + specified to CREATE. This may be because the type is undefined, + because the type is not supported by the server, or because the type + is not intended to be created by CREATE (such as a regular file or + named attribute, for which OPEN is used to do the file creation). + +15.1.4.2. NFS4ERR_DQUOT (Error Code 69) + + Resource (quota) hard limit exceeded. The user's resource limit on + the server has been exceeded. + +15.1.4.3. NFS4ERR_EXIST (Error Code 17) + + A file of the specified target name (when creating, renaming, or + linking) already exists. + +15.1.4.4. NFS4ERR_FBIG (Error Code 27) + + The file is too large. The operation would have caused the file to + grow beyond the server's limit. + +15.1.4.5. NFS4ERR_FILE_OPEN (Error Code 10046) + + The operation is not allowed because a file involved in the operation + is currently open. Servers may, but are not required to, disallow + linking-to, removing, or renaming open files. + +15.1.4.6. NFS4ERR_IO (Error Code 5) + + Indicates that an I/O error occurred for which the file system was + unable to provide recovery. + +15.1.4.7. NFS4ERR_MLINK (Error Code 31) + + The request would have caused the server's limit for the number of + hard links a file may have to be exceeded. + +15.1.4.8. NFS4ERR_NOENT (Error Code 2) + + Indicates no such file or directory. The file or directory name + specified does not exist. + +15.1.4.9. NFS4ERR_NOSPC (Error Code 28) + + Indicates there is no space left on the device. The operation would + have caused the server's file system to exceed its limit. + +15.1.4.10. NFS4ERR_NOTEMPTY (Error Code 66) + + An attempt was made to remove a directory that was not empty. + +15.1.4.11. NFS4ERR_ROFS (Error Code 30) + + Indicates a read-only file system. A modifying operation was + attempted on a read-only file system. + +15.1.4.12. NFS4ERR_XDEV (Error Code 18) + + Indicates an attempt to do an operation, such as linking, that + inappropriately crosses a boundary. This may be due to such + boundaries as: + + * that between file systems (where the fsids are different). + + * that between different named attribute directories or between a + named attribute directory and an ordinary directory. + + * that between byte-ranges of a file system that the file system + implementation treats as separate (for example, for space + accounting purposes), and where cross-connection between the byte- + ranges are not allowed. + +15.1.5. State Management Errors + + These errors indicate problems with the stateid (or one of the + stateids) passed to a given operation. This includes situations in + which the stateid is invalid as well as situations in which the + stateid is valid but designates locking state that has been revoked. + Depending on the operation, the stateid when valid may designate + opens, byte-range locks, file or directory delegations, layouts, or + device maps. + +15.1.5.1. NFS4ERR_ADMIN_REVOKED (Error Code 10047) + + A stateid designates locking state of any type that has been revoked + due to administrative interaction, possibly while the lease is valid. + +15.1.5.2. NFS4ERR_BAD_STATEID (Error Code 10026) + + A stateid does not properly designate any valid state. See Sections + 8.2.4 and 8.2.3 for a discussion of how stateids are validated. + +15.1.5.3. NFS4ERR_DELEG_REVOKED (Error Code 10087) + + A stateid designates recallable locking state of any type (delegation + or layout) that has been revoked due to the failure of the client to + return the lock when it was recalled. + +15.1.5.4. NFS4ERR_EXPIRED (Error Code 10011) + + A stateid designates locking state of any type that has been revoked + due to expiration of the client's lease, either immediately upon + lease expiration, or following a later request for a conflicting + lock. + +15.1.5.5. NFS4ERR_OLD_STATEID (Error Code 10024) + + A stateid with a non-zero seqid value does match the current seqid + for the state designated by the user. + +15.1.6. Security Errors + + These are the various permission-related errors in NFSv4.1. + +15.1.6.1. NFS4ERR_ACCESS (Error Code 13) + + Indicates permission denied. The caller does not have the correct + permission to perform the requested operation. Contrast this with + NFS4ERR_PERM (Section 15.1.6.2), which restricts itself to owner or + privileged-user permission failures, and NFS4ERR_WRONG_CRED + (Section 15.1.6.4), which deals with appropriate permission to delete + or modify transient objects based on the credentials of the user that + created them. + +15.1.6.2. NFS4ERR_PERM (Error Code 1) + + Indicates requester is not the owner. The operation was not allowed + because the caller is neither a privileged user (root) nor the owner + of the target of the operation. + +15.1.6.3. NFS4ERR_WRONGSEC (Error Code 10016) + + Indicates that the security mechanism being used by the client for + the operation does not match the server's security policy. The + client should change the security mechanism being used and re-send + the operation (but not with the same slot ID and sequence ID; one or + both MUST be different on the re-send). SECINFO and SECINFO_NO_NAME + can be used to determine the appropriate mechanism. + +15.1.6.4. NFS4ERR_WRONG_CRED (Error Code 10082) + + An operation that manipulates state was attempted by a principal that + was not allowed to modify that piece of state. + +15.1.7. Name Errors + + Names in NFSv4 are UTF-8 strings. When the strings are not valid + UTF-8 or are of length zero, the error NFS4ERR_INVAL results. + Besides this, there are a number of other errors to indicate specific + problems with names. + +15.1.7.1. NFS4ERR_BADCHAR (Error Code 10040) + + A UTF-8 string contains a character that is not supported by the + server in the context in which it being used. + +15.1.7.2. NFS4ERR_BADNAME (Error Code 10041) + + A name string in a request consisted of valid UTF-8 characters + supported by the server, but the name is not supported by the server + as a valid name for the current operation. An example might be + creating a file or directory named ".." on a server whose file system + uses that name for links to parent directories. + +15.1.7.3. NFS4ERR_NAMETOOLONG (Error Code 63) + + Returned when the filename in an operation exceeds the server's + implementation limit. + +15.1.8. Locking Errors + + This section deals with errors related to locking, both as to share + reservations and byte-range locking. It does not deal with errors + specific to the process of reclaiming locks. Those are dealt with in + Section 15.1.9. + +15.1.8.1. NFS4ERR_BAD_RANGE (Error Code 10042) + + The byte-range of a LOCK, LOCKT, or LOCKU operation is not allowed by + the server. For example, this error results when a server that only + supports 32-bit ranges receives a range that cannot be handled by + that server. (See Section 18.10.3.) + +15.1.8.2. NFS4ERR_DEADLOCK (Error Code 10045) + + The server has been able to determine a byte-range locking deadlock + condition for a READW_LT or WRITEW_LT LOCK operation. + +15.1.8.3. NFS4ERR_DENIED (Error Code 10010) + + An attempt to lock a file is denied. Since this may be a temporary + condition, the client is encouraged to re-send the lock request (but + not with the same slot ID and sequence ID; one or both MUST be + different on the re-send) until the lock is accepted. See + Section 9.6 for a discussion of the re-send. + +15.1.8.4. NFS4ERR_LOCKED (Error Code 10012) + + A READ or WRITE operation was attempted on a file where there was a + conflict between the I/O and an existing lock: + + * There is a share reservation inconsistent with the I/O being done. + + * The range to be read or written intersects an existing mandatory + byte-range lock. + +15.1.8.5. NFS4ERR_LOCKS_HELD (Error Code 10037) + + An operation was prevented by the unexpected presence of locks. + +15.1.8.6. NFS4ERR_LOCK_NOTSUPP (Error Code 10043) + + A LOCK operation was attempted that would require the upgrade or + downgrade of a byte-range lock range already held by the owner, and + the server does not support atomic upgrade or downgrade of locks. + +15.1.8.7. NFS4ERR_LOCK_RANGE (Error Code 10028) + + A LOCK operation is operating on a range that overlaps in part a + currently held byte-range lock for the current lock-owner and does + not precisely match a single such byte-range lock where the server + does not support this type of request, and thus does not implement + POSIX locking semantics [21]. See Sections 18.10.4, 18.11.4, and + 18.12.4 for a discussion of how this applies to LOCK, LOCKT, and + LOCKU respectively. + +15.1.8.8. NFS4ERR_OPENMODE (Error Code 10038) + + The client attempted a READ, WRITE, LOCK, or other operation not + sanctioned by the stateid passed (e.g., writing to a file opened for + read-only access). + +15.1.8.9. NFS4ERR_SHARE_DENIED (Error Code 10015) + + An attempt to OPEN a file with a share reservation has failed because + of a share conflict. + +15.1.9. Reclaim Errors + + These errors relate to the process of reclaiming locks after a server + restart. + +15.1.9.1. NFS4ERR_COMPLETE_ALREADY (Error Code 10054) + + The client previously sent a successful RECLAIM_COMPLETE operation + specifying the same scope, whether that scope is global or for the + same file system in the case of a per-fs RECLAIM_COMPLETE. An + additional RECLAIM_COMPLETE operation is not necessary and results in + this error. + +15.1.9.2. NFS4ERR_GRACE (Error Code 10013) + + This error is returned when the server is in its grace period with + regard to the file system object for which the lock was requested. + In this situation, a non-reclaim locking request cannot be granted. + This can occur because either: + + * The server does not have sufficient information about locks that + might be potentially reclaimed to determine whether the lock could + be granted. + + * The request is made by a client responsible for reclaiming its + locks that has not yet done the appropriate RECLAIM_COMPLETE + operation, allowing it to proceed to obtain new locks. + + In the case of a per-fs grace period, there may be clients (i.e., + those currently using the destination file system) who might be + unaware of the circumstances resulting in the initiation of the grace + period. Such clients need to periodically retry the request until + the grace period is over, just as other clients do. + +15.1.9.3. NFS4ERR_NO_GRACE (Error Code 10033) + + A reclaim of client state was attempted in circumstances in which the + server cannot guarantee that conflicting state has not been provided + to another client. This occurs in any of the following situations: + + * There is no active grace period applying to the file system object + for which the request was made. + + * The client making the request has no current role in reclaiming + locks. + + * Previous operations have created a situation in which the server + is not able to determine that a reclaim-interfering edge condition + does not exist. + +15.1.9.4. NFS4ERR_RECLAIM_BAD (Error Code 10034) + + The server has determined that a reclaim attempted by the client is + not valid, i.e., the lock specified as being reclaimed could not + possibly have existed before the server restart or file system + migration event. A server is not obliged to make this determination + and will typically rely on the client to only reclaim locks that the + client was granted prior to restart. However, when a server does + have reliable information to enable it to make this determination, + this error indicates that the reclaim has been rejected as invalid. + This is as opposed to the error NFS4ERR_RECLAIM_CONFLICT (see + Section 15.1.9.5) where the server can only determine that there has + been an invalid reclaim, but cannot determine which request is + invalid. + +15.1.9.5. NFS4ERR_RECLAIM_CONFLICT (Error Code 10035) + + The reclaim attempted by the client has encountered a conflict and + cannot be satisfied. This potentially indicates a misbehaving + client, although not necessarily the one receiving the error. The + misbehavior might be on the part of the client that established the + lock with which this client conflicted. See also Section 15.1.9.4 + for the related error, NFS4ERR_RECLAIM_BAD. + +15.1.10. pNFS Errors + + This section deals with pNFS-related errors including those that are + associated with using NFSv4.1 to communicate with a data server. + +15.1.10.1. NFS4ERR_BADIOMODE (Error Code 10049) + + An invalid or inappropriate layout iomode was specified. For example + an inappropriate layout iomode, suppose a client's LAYOUTGET + operation specified an iomode of LAYOUTIOMODE4_RW, and the server is + neither able nor willing to let the client send write requests to + data servers; the server can reply with NFS4ERR_BADIOMODE. The + client would then send another LAYOUTGET with an iomode of + LAYOUTIOMODE4_READ. + +15.1.10.2. NFS4ERR_BADLAYOUT (Error Code 10050) + + The layout specified is invalid in some way. For LAYOUTCOMMIT, this + indicates that the specified layout is not held by the client or is + not of mode LAYOUTIOMODE4_RW. For LAYOUTGET, it indicates that a + layout matching the client's specification as to minimum length + cannot be granted. + +15.1.10.3. NFS4ERR_LAYOUTTRYLATER (Error Code 10058) + + Layouts are temporarily unavailable for the file. The client should + re-send later (but not with the same slot ID and sequence ID; one or + both MUST be different on the re-send). + +15.1.10.4. NFS4ERR_LAYOUTUNAVAILABLE (Error Code 10059) + + Returned when layouts are not available for the current file system + or the particular specified file. + +15.1.10.5. NFS4ERR_NOMATCHING_LAYOUT (Error Code 10060) + + Returned when layouts are recalled and the client has no layouts + matching the specification of the layouts being recalled. + +15.1.10.6. NFS4ERR_PNFS_IO_HOLE (Error Code 10075) + + The pNFS client has attempted to read from or write to an illegal + hole of a file of a data server that is using sparse packing. See + Section 13.4.4. + +15.1.10.7. NFS4ERR_PNFS_NO_LAYOUT (Error Code 10080) + + The pNFS client has attempted to read from or write to a file (using + a request to a data server) without holding a valid layout. This + includes the case where the client had a layout, but the iomode does + not allow a WRITE. + +15.1.10.8. NFS4ERR_RETURNCONFLICT (Error Code 10086) + + A layout is unavailable due to an attempt to perform the LAYOUTGET + before a pending LAYOUTRETURN on the file has been received. See + Section 12.5.5.2.1.3. + +15.1.10.9. NFS4ERR_UNKNOWN_LAYOUTTYPE (Error Code 10062) + + The client has specified a layout type that is not supported by the + server. + +15.1.11. Session Use Errors + + This section deals with errors encountered when using sessions, that + is, errors encountered when a request uses a Sequence (i.e., either + SEQUENCE or CB_SEQUENCE) operation. + +15.1.11.1. NFS4ERR_BADSESSION (Error Code 10052) + + The specified session ID is unknown to the server to which the + operation is addressed. + +15.1.11.2. NFS4ERR_BADSLOT (Error Code 10053) + + The requester sent a Sequence operation that attempted to use a slot + the replier does not have in its slot table. It is possible the slot + may have been retired. + +15.1.11.3. NFS4ERR_BAD_HIGH_SLOT (Error Code 10077) + + The highest_slot argument in a Sequence operation exceeds the + replier's enforced highest_slotid. + +15.1.11.4. NFS4ERR_CB_PATH_DOWN (Error Code 10048) + + There is a problem contacting the client via the callback path. The + function of this error has been mostly superseded by the use of + status flags in the reply to the SEQUENCE operation (see + Section 18.46). + +15.1.11.5. NFS4ERR_DEADSESSION (Error Code 10078) + + The specified session is a persistent session that is dead and does + not accept new requests or perform new operations on existing + requests (in the case in which a request was partially executed + before server restart). + +15.1.11.6. NFS4ERR_CONN_NOT_BOUND_TO_SESSION (Error Code 10055) + + A Sequence operation was sent on a connection that has not been + associated with the specified session, where the client specified + that connection association was to be enforced with SP4_MACH_CRED or + SP4_SSV state protection. + +15.1.11.7. NFS4ERR_SEQ_FALSE_RETRY (Error Code 10076) + + The requester sent a Sequence operation with a slot ID and sequence + ID that are in the reply cache, but the replier has detected that the + retried request is not the same as the original request. See + Section 2.10.6.1.3.1. + +15.1.11.8. NFS4ERR_SEQ_MISORDERED (Error Code 10063) + + The requester sent a Sequence operation with an invalid sequence ID. + +15.1.12. Session Management Errors + + This section deals with errors associated with requests used in + session management. + +15.1.12.1. NFS4ERR_BACK_CHAN_BUSY (Error Code 10057) + + An attempt was made to destroy a session when the session cannot be + destroyed because the server has callback requests outstanding. + +15.1.12.2. NFS4ERR_BAD_SESSION_DIGEST (Error Code 10051) + + The digest used in a SET_SSV request is not valid. + +15.1.13. Client Management Errors + + This section deals with errors associated with requests used to + create and manage client IDs. + +15.1.13.1. NFS4ERR_CLIENTID_BUSY (Error Code 10074) + + The DESTROY_CLIENTID operation has found there are sessions and/or + unexpired state associated with the client ID to be destroyed. + +15.1.13.2. NFS4ERR_CLID_INUSE (Error Code 10017) + + While processing an EXCHANGE_ID operation, the server was presented + with a co_ownerid field that matches an existing client with valid + leased state, but the principal sending the EXCHANGE_ID operation + differs from the principal that established the existing client. + This indicates a collision (most likely due to chance) between + clients. The client should recover by changing the co_ownerid and + re-sending EXCHANGE_ID (but not with the same slot ID and sequence + ID; one or both MUST be different on the re-send). + +15.1.13.3. NFS4ERR_ENCR_ALG_UNSUPP (Error Code 10079) + + An EXCHANGE_ID was sent that specified state protection via SSV, and + where the set of encryption algorithms presented by the client did + not include any supported by the server. + +15.1.13.4. NFS4ERR_HASH_ALG_UNSUPP (Error Code 10072) + + An EXCHANGE_ID was sent that specified state protection via SSV, and + where the set of hashing algorithms presented by the client did not + include any supported by the server. + +15.1.13.5. NFS4ERR_STALE_CLIENTID (Error Code 10022) + + A client ID not recognized by the server was passed to an operation. + Note that unlike the case of NFSv4.0, client IDs are not passed + explicitly to the server in ordinary locking operations and cannot + result in this error. Instead, when there is a server restart, it is + first manifested through an error on the associated session, and the + staleness of the client ID is detected when trying to associate a + client ID with a new session. + +15.1.14. Delegation Errors + + This section deals with errors associated with requesting and + returning delegations. + +15.1.14.1. NFS4ERR_DELEG_ALREADY_WANTED (Error Code 10056) + + The client has requested a delegation when it had already registered + that it wants that same delegation. + +15.1.14.2. NFS4ERR_DIRDELEG_UNAVAIL (Error Code 10084) + + This error is returned when the server is unable or unwilling to + provide a requested directory delegation. + +15.1.14.3. NFS4ERR_RECALLCONFLICT (Error Code 10061) + + A recallable object (i.e., a layout or delegation) is unavailable due + to a conflicting recall operation that is currently in progress for + that object. + +15.1.14.4. NFS4ERR_REJECT_DELEG (Error Code 10085) + + The callback operation invoked to deal with a new delegation has + rejected it. + +15.1.15. Attribute Handling Errors + + This section deals with errors specific to attribute handling within + NFSv4. + +15.1.15.1. NFS4ERR_ATTRNOTSUPP (Error Code 10032) + + An attribute specified is not supported by the server. This error + MUST NOT be returned by the GETATTR operation. + +15.1.15.2. NFS4ERR_BADOWNER (Error Code 10039) + + This error is returned when an owner or owner_group attribute value + or the who field of an ACE within an ACL attribute value cannot be + translated to a local representation. + +15.1.15.3. NFS4ERR_NOT_SAME (Error Code 10027) + + This error is returned by the VERIFY operation to signify that the + attributes compared were not the same as those provided in the + client's request. + +15.1.15.4. NFS4ERR_SAME (Error Code 10009) + + This error is returned by the NVERIFY operation to signify that the + attributes compared were the same as those provided in the client's + request. + +15.1.16. Obsoleted Errors + + These errors MUST NOT be generated by any NFSv4.1 operation. This + can be for a number of reasons. + + * The function provided by the error has been superseded by one of + the status bits returned by the SEQUENCE operation. + + * The new session structure and associated change in locking have + made the error unnecessary. + + * There has been a restructuring of some errors for NFSv4.1 that + resulted in the elimination of certain errors. + +15.1.16.1. NFS4ERR_BAD_SEQID (Error Code 10026) + + The sequence number (seqid) in a locking request is neither the next + expected number or the last number processed. These seqids are + ignored in NFSv4.1. + +15.1.16.2. NFS4ERR_LEASE_MOVED (Error Code 10031) + + A lease being renewed is associated with a file system that has been + migrated to a new server. The error has been superseded by the + SEQ4_STATUS_LEASE_MOVED status bit (see Section 18.46). + +15.1.16.3. NFS4ERR_NXIO (Error Code 5) + + I/O error. No such device or address. This error is for errors + involving block and character device access, but because NFSv4.1 is + not a device-access protocol, this error is not applicable. + +15.1.16.4. NFS4ERR_RESTOREFH (Error Code 10030) + + The RESTOREFH operation does not have a saved filehandle (identified + by SAVEFH) to operate upon. In NFSv4.1, this error has been + superseded by NFS4ERR_NOFILEHANDLE. + +15.1.16.5. NFS4ERR_STALE_STATEID (Error Code 10023) + + A stateid generated by an earlier server instance was used. This + error is moot in NFSv4.1 because all operations that take a stateid + MUST be preceded by the SEQUENCE operation, and the earlier server + instance is detected by the session infrastructure that supports + SEQUENCE. + +15.2. Operations and Their Valid Errors + + This section contains a table that gives the valid error returns for + each protocol operation. The error code NFS4_OK (indicating no + error) is not listed but should be understood to be returnable by all + operations with two important exceptions: + + * The operations that MUST NOT be implemented: OPEN_CONFIRM, + RELEASE_LOCKOWNER, RENEW, SETCLIENTID, and SETCLIENTID_CONFIRM. + + * The invalid operation: ILLEGAL. + + +======================+========================================+ + | Operation | Errors | + +======================+========================================+ + | ACCESS | NFS4ERR_ACCESS, NFS4ERR_BADXDR, | + | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | + | | NFS4ERR_FHEXPIRED, NFS4ERR_INVAL, | + | | NFS4ERR_IO, NFS4ERR_MOVED, | + | | NFS4ERR_NOFILEHANDLE, | + | | NFS4ERR_OP_NOT_IN_SESSION, | + | | NFS4ERR_REP_TOO_BIG, | + | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | + | | NFS4ERR_REQ_TOO_BIG, | + | | NFS4ERR_RETRY_UNCACHED_REP, | + | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | + | | NFS4ERR_TOO_MANY_OPS | + +----------------------+----------------------------------------+ + | BACKCHANNEL_CTL | NFS4ERR_BADXDR, NFS4ERR_DEADSESSION, | + | | NFS4ERR_DELAY, NFS4ERR_INVAL, | + | | NFS4ERR_NOENT, | + | | NFS4ERR_OP_NOT_IN_SESSION, | + | | NFS4ERR_REP_TOO_BIG, | + | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | + | | NFS4ERR_REQ_TOO_BIG, | + | | NFS4ERR_RETRY_UNCACHED_REP, | + | | NFS4ERR_TOO_MANY_OPS | + +----------------------+----------------------------------------+ + | BIND_CONN_TO_SESSION | NFS4ERR_BADSESSION, NFS4ERR_BADXDR, | + | | NFS4ERR_BAD_SESSION_DIGEST, | + | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | + | | NFS4ERR_INVAL, NFS4ERR_NOT_ONLY_OP, | + | | NFS4ERR_REP_TOO_BIG, | + | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | + | | NFS4ERR_REQ_TOO_BIG, | + | | NFS4ERR_RETRY_UNCACHED_REP, | + | | NFS4ERR_SERVERFAULT, | + | | NFS4ERR_TOO_MANY_OPS | + +----------------------+----------------------------------------+ + | CLOSE | NFS4ERR_ADMIN_REVOKED, NFS4ERR_BADXDR, | + | | NFS4ERR_BAD_STATEID, | + | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | + | | NFS4ERR_EXPIRED, NFS4ERR_FHEXPIRED, | + | | NFS4ERR_LOCKS_HELD, NFS4ERR_MOVED, | + | | NFS4ERR_NOFILEHANDLE, | + | | NFS4ERR_OLD_STATEID, | + | | NFS4ERR_OP_NOT_IN_SESSION, | + | | NFS4ERR_REP_TOO_BIG, | + | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | + | | NFS4ERR_REQ_TOO_BIG, | + | | NFS4ERR_RETRY_UNCACHED_REP, | + | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | + | | NFS4ERR_TOO_MANY_OPS, | + | | NFS4ERR_WRONG_CRED | + +----------------------+----------------------------------------+ + | COMMIT | NFS4ERR_ACCESS, NFS4ERR_BADXDR, | + | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | + | | NFS4ERR_FHEXPIRED, NFS4ERR_IO, | + | | NFS4ERR_ISDIR, NFS4ERR_MOVED, | + | | NFS4ERR_NOFILEHANDLE, | + | | NFS4ERR_OP_NOT_IN_SESSION, | + | | NFS4ERR_REP_TOO_BIG, | + | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | + | | NFS4ERR_REQ_TOO_BIG, | + | | NFS4ERR_RETRY_UNCACHED_REP, | + | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | + | | NFS4ERR_SYMLINK, NFS4ERR_TOO_MANY_OPS, | + | | NFS4ERR_WRONG_TYPE | + +----------------------+----------------------------------------+ + | CREATE | NFS4ERR_ACCESS, NFS4ERR_ATTRNOTSUPP, | + | | NFS4ERR_BADCHAR, NFS4ERR_BADNAME, | + | | NFS4ERR_BADOWNER, NFS4ERR_BADTYPE, | + | | NFS4ERR_BADXDR, NFS4ERR_DEADSESSION, | + | | NFS4ERR_DELAY, NFS4ERR_DQUOT, | + | | NFS4ERR_EXIST, NFS4ERR_FHEXPIRED, | + | | NFS4ERR_INVAL, NFS4ERR_IO, | + | | NFS4ERR_MLINK, NFS4ERR_MOVED, | + | | NFS4ERR_NAMETOOLONG, | + | | NFS4ERR_NOFILEHANDLE, NFS4ERR_NOSPC, | + | | NFS4ERR_NOTDIR, | + | | NFS4ERR_OP_NOT_IN_SESSION, | + | | NFS4ERR_PERM, NFS4ERR_REP_TOO_BIG, | + | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | + | | NFS4ERR_REQ_TOO_BIG, | + | | NFS4ERR_RETRY_UNCACHED_REP, | + | | NFS4ERR_ROFS, NFS4ERR_SERVERFAULT, | + | | NFS4ERR_STALE, NFS4ERR_TOO_MANY_OPS, | + | | NFS4ERR_UNSAFE_COMPOUND | + +----------------------+----------------------------------------+ + | CREATE_SESSION | NFS4ERR_BADXDR, NFS4ERR_CLID_INUSE, | + | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | + | | NFS4ERR_INVAL, NFS4ERR_NOENT, | + | | NFS4ERR_NOT_ONLY_OP, NFS4ERR_NOSPC, | + | | NFS4ERR_REP_TOO_BIG, | + | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | + | | NFS4ERR_REQ_TOO_BIG, | + | | NFS4ERR_RETRY_UNCACHED_REP, | + | | NFS4ERR_SEQ_MISORDERED, | + | | NFS4ERR_SERVERFAULT, | + | | NFS4ERR_STALE_CLIENTID, | + | | NFS4ERR_TOOSMALL, | + | | NFS4ERR_TOO_MANY_OPS, | + | | NFS4ERR_WRONG_CRED | + +----------------------+----------------------------------------+ + | DELEGPURGE | NFS4ERR_BADXDR, NFS4ERR_DEADSESSION, | + | | NFS4ERR_DELAY, NFS4ERR_NOTSUPP, | + | | NFS4ERR_OP_NOT_IN_SESSION, | + | | NFS4ERR_REP_TOO_BIG, | + | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | + | | NFS4ERR_REQ_TOO_BIG, | + | | NFS4ERR_RETRY_UNCACHED_REP, | + | | NFS4ERR_SERVERFAULT, | + | | NFS4ERR_TOO_MANY_OPS, | + | | NFS4ERR_WRONG_CRED | + +----------------------+----------------------------------------+ + | DELEGRETURN | NFS4ERR_ADMIN_REVOKED, NFS4ERR_BADXDR, | + | | NFS4ERR_BAD_STATEID, | + | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | + | | NFS4ERR_DELEG_REVOKED, | + | | NFS4ERR_EXPIRED, NFS4ERR_FHEXPIRED, | + | | NFS4ERR_INVAL, NFS4ERR_MOVED, | + | | NFS4ERR_NOFILEHANDLE, NFS4ERR_NOTSUPP, | + | | NFS4ERR_OLD_STATEID, | + | | NFS4ERR_OP_NOT_IN_SESSION, | + | | NFS4ERR_REP_TOO_BIG, | + | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | + | | NFS4ERR_REQ_TOO_BIG, | + | | NFS4ERR_RETRY_UNCACHED_REP, | + | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | + | | NFS4ERR_TOO_MANY_OPS, | + | | NFS4ERR_WRONG_CRED | + +----------------------+----------------------------------------+ + | DESTROY_CLIENTID | NFS4ERR_BADXDR, NFS4ERR_CLIENTID_BUSY, | + | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | + | | NFS4ERR_NOT_ONLY_OP, | + | | NFS4ERR_REP_TOO_BIG, | + | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | + | | NFS4ERR_REQ_TOO_BIG, | + | | NFS4ERR_RETRY_UNCACHED_REP, | + | | NFS4ERR_SERVERFAULT, | + | | NFS4ERR_STALE_CLIENTID, | + | | NFS4ERR_TOO_MANY_OPS, | + | | NFS4ERR_WRONG_CRED | + +----------------------+----------------------------------------+ + | DESTROY_SESSION | NFS4ERR_BACK_CHAN_BUSY, | + | | NFS4ERR_BADSESSION, NFS4ERR_BADXDR, | + | | NFS4ERR_CB_PATH_DOWN, | + | | NFS4ERR_CONN_NOT_BOUND_TO_SESSION, | + | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | + | | NFS4ERR_NOT_ONLY_OP, | + | | NFS4ERR_REP_TOO_BIG, | + | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | + | | NFS4ERR_REQ_TOO_BIG, | + | | NFS4ERR_RETRY_UNCACHED_REP, | + | | NFS4ERR_SERVERFAULT, | + | | NFS4ERR_STALE_CLIENTID, | + | | NFS4ERR_TOO_MANY_OPS, | + | | NFS4ERR_WRONG_CRED | + +----------------------+----------------------------------------+ + | EXCHANGE_ID | NFS4ERR_BADCHAR, NFS4ERR_BADXDR, | + | | NFS4ERR_CLID_INUSE, | + | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | + | | NFS4ERR_ENCR_ALG_UNSUPP, | + | | NFS4ERR_HASH_ALG_UNSUPP, | + | | NFS4ERR_INVAL, NFS4ERR_NOENT, | + | | NFS4ERR_NOT_ONLY_OP, NFS4ERR_NOT_SAME, | + | | NFS4ERR_REP_TOO_BIG, | + | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | + | | NFS4ERR_REQ_TOO_BIG, | + | | NFS4ERR_RETRY_UNCACHED_REP, | + | | NFS4ERR_SERVERFAULT, | + | | NFS4ERR_TOO_MANY_OPS | + +----------------------+----------------------------------------+ + | FREE_STATEID | NFS4ERR_BADXDR, NFS4ERR_BAD_STATEID, | + | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | + | | NFS4ERR_LOCKS_HELD, | + | | NFS4ERR_OLD_STATEID, | + | | NFS4ERR_OP_NOT_IN_SESSION, | + | | NFS4ERR_REP_TOO_BIG, | + | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | + | | NFS4ERR_REQ_TOO_BIG, | + | | NFS4ERR_RETRY_UNCACHED_REP, | + | | NFS4ERR_SERVERFAULT, | + | | NFS4ERR_TOO_MANY_OPS, | + | | NFS4ERR_WRONG_CRED | + +----------------------+----------------------------------------+ + | GET_DIR_DELEGATION | NFS4ERR_ACCESS, NFS4ERR_BADXDR, | + | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | + | | NFS4ERR_DIRDELEG_UNAVAIL, | + | | NFS4ERR_FHEXPIRED, NFS4ERR_GRACE, | + | | NFS4ERR_INVAL, NFS4ERR_IO, | + | | NFS4ERR_MOVED, NFS4ERR_NOFILEHANDLE, | + | | NFS4ERR_NOTDIR, NFS4ERR_NOTSUPP, | + | | NFS4ERR_OP_NOT_IN_SESSION, | + | | NFS4ERR_REP_TOO_BIG, | + | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | + | | NFS4ERR_REQ_TOO_BIG, | + | | NFS4ERR_RETRY_UNCACHED_REP, | + | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | + | | NFS4ERR_TOO_MANY_OPS | + +----------------------+----------------------------------------+ + | GETATTR | NFS4ERR_ACCESS, NFS4ERR_BADXDR, | + | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | + | | NFS4ERR_FHEXPIRED, NFS4ERR_GRACE, | + | | NFS4ERR_INVAL, NFS4ERR_IO, | + | | NFS4ERR_MOVED, NFS4ERR_NOFILEHANDLE, | + | | NFS4ERR_OP_NOT_IN_SESSION, | + | | NFS4ERR_REP_TOO_BIG, | + | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | + | | NFS4ERR_REQ_TOO_BIG, | + | | NFS4ERR_RETRY_UNCACHED_REP, | + | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | + | | NFS4ERR_TOO_MANY_OPS, | + | | NFS4ERR_WRONG_TYPE | + +----------------------+----------------------------------------+ + | GETDEVICEINFO | NFS4ERR_BADXDR, NFS4ERR_DEADSESSION, | + | | NFS4ERR_DELAY, NFS4ERR_INVAL, | + | | NFS4ERR_NOENT, NFS4ERR_NOTSUPP, | + | | NFS4ERR_OP_NOT_IN_SESSION, | + | | NFS4ERR_REP_TOO_BIG, | + | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | + | | NFS4ERR_REQ_TOO_BIG, | + | | NFS4ERR_RETRY_UNCACHED_REP, | + | | NFS4ERR_SERVERFAULT, NFS4ERR_TOOSMALL, | + | | NFS4ERR_TOO_MANY_OPS, | + | | NFS4ERR_UNKNOWN_LAYOUTTYPE | + +----------------------+----------------------------------------+ + | GETDEVICELIST | NFS4ERR_BADXDR, NFS4ERR_BAD_COOKIE, | + | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | + | | NFS4ERR_FHEXPIRED, NFS4ERR_INVAL, | + | | NFS4ERR_IO, NFS4ERR_NOFILEHANDLE, | + | | NFS4ERR_NOTSUPP, NFS4ERR_NOT_SAME, | + | | NFS4ERR_OP_NOT_IN_SESSION, | + | | NFS4ERR_REP_TOO_BIG, | + | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | + | | NFS4ERR_REQ_TOO_BIG, | + | | NFS4ERR_RETRY_UNCACHED_REP, | + | | NFS4ERR_SERVERFAULT, | + | | NFS4ERR_TOO_MANY_OPS, | + | | NFS4ERR_UNKNOWN_LAYOUTTYPE | + +----------------------+----------------------------------------+ + | GETFH | NFS4ERR_FHEXPIRED, NFS4ERR_MOVED, | + | | NFS4ERR_NOFILEHANDLE, | + | | NFS4ERR_OP_NOT_IN_SESSION, | + | | NFS4ERR_STALE | + +----------------------+----------------------------------------+ + | ILLEGAL | NFS4ERR_BADXDR, NFS4ERR_OP_ILLEGAL | + +----------------------+----------------------------------------+ + | LAYOUTCOMMIT | NFS4ERR_ACCESS, NFS4ERR_ADMIN_REVOKED, | + | | NFS4ERR_ATTRNOTSUPP, | + | | NFS4ERR_BADIOMODE, NFS4ERR_BADLAYOUT, | + | | NFS4ERR_BADXDR, NFS4ERR_DEADSESSION, | + | | NFS4ERR_DELAY, NFS4ERR_DELEG_REVOKED, | + | | NFS4ERR_EXPIRED, NFS4ERR_FBIG, | + | | NFS4ERR_FHEXPIRED, NFS4ERR_GRACE, | + | | NFS4ERR_INVAL, NFS4ERR_IO, | + | | NFS4ERR_ISDIR NFS4ERR_MOVED, | + | | NFS4ERR_NOFILEHANDLE, NFS4ERR_NOTSUPP, | + | | NFS4ERR_NO_GRACE, | + | | NFS4ERR_OP_NOT_IN_SESSION, | + | | NFS4ERR_RECLAIM_BAD, | + | | NFS4ERR_RECLAIM_CONFLICT, | + | | NFS4ERR_REP_TOO_BIG, | + | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | + | | NFS4ERR_REQ_TOO_BIG, | + | | NFS4ERR_RETRY_UNCACHED_REP, | + | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | + | | NFS4ERR_SYMLINK, NFS4ERR_TOO_MANY_OPS, | + | | NFS4ERR_UNKNOWN_LAYOUTTYPE, | + | | NFS4ERR_WRONG_CRED | + +----------------------+----------------------------------------+ + | LAYOUTGET | NFS4ERR_ACCESS, NFS4ERR_ADMIN_REVOKED, | + | | NFS4ERR_BADIOMODE, NFS4ERR_BADLAYOUT, | + | | NFS4ERR_BADXDR, NFS4ERR_BAD_STATEID, | + | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | + | | NFS4ERR_DELEG_REVOKED, NFS4ERR_DQUOT, | + | | NFS4ERR_FHEXPIRED, NFS4ERR_GRACE, | + | | NFS4ERR_INVAL, NFS4ERR_IO, | + | | NFS4ERR_LAYOUTTRYLATER, | + | | NFS4ERR_LAYOUTUNAVAILABLE, | + | | NFS4ERR_LOCKED, NFS4ERR_MOVED, | + | | NFS4ERR_NOFILEHANDLE, NFS4ERR_NOSPC, | + | | NFS4ERR_NOTSUPP, NFS4ERR_OLD_STATEID, | + | | NFS4ERR_OPENMODE, | + | | NFS4ERR_OP_NOT_IN_SESSION, | + | | NFS4ERR_RECALLCONFLICT, | + | | NFS4ERR_REP_TOO_BIG, | + | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | + | | NFS4ERR_REQ_TOO_BIG, | + | | NFS4ERR_RETRY_UNCACHED_REP, | + | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | + | | NFS4ERR_TOOSMALL, | + | | NFS4ERR_TOO_MANY_OPS, | + | | NFS4ERR_UNKNOWN_LAYOUTTYPE, | + | | NFS4ERR_WRONG_TYPE | + +----------------------+----------------------------------------+ + | LAYOUTRETURN | NFS4ERR_ADMIN_REVOKED, NFS4ERR_BADXDR, | + | | NFS4ERR_BAD_STATEID, | + | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | + | | NFS4ERR_DELEG_REVOKED, | + | | NFS4ERR_EXPIRED, NFS4ERR_FHEXPIRED, | + | | NFS4ERR_GRACE, NFS4ERR_INVAL, | + | | NFS4ERR_ISDIR, NFS4ERR_MOVED, | + | | NFS4ERR_NOFILEHANDLE, NFS4ERR_NOTSUPP, | + | | NFS4ERR_NO_GRACE, NFS4ERR_OLD_STATEID, | + | | NFS4ERR_OP_NOT_IN_SESSION, | + | | NFS4ERR_REP_TOO_BIG, | + | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | + | | NFS4ERR_REQ_TOO_BIG, | + | | NFS4ERR_RETRY_UNCACHED_REP, | + | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | + | | NFS4ERR_TOO_MANY_OPS, | + | | NFS4ERR_UNKNOWN_LAYOUTTYPE, | + | | NFS4ERR_WRONG_CRED, NFS4ERR_WRONG_TYPE | + +----------------------+----------------------------------------+ + | LINK | NFS4ERR_ACCESS, NFS4ERR_BADCHAR, | + | | NFS4ERR_BADNAME, NFS4ERR_BADXDR, | + | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | + | | NFS4ERR_DQUOT, NFS4ERR_EXIST, | + | | NFS4ERR_FHEXPIRED, NFS4ERR_FILE_OPEN, | + | | NFS4ERR_GRACE, NFS4ERR_INVAL, | + | | NFS4ERR_ISDIR, NFS4ERR_IO, | + | | NFS4ERR_MLINK, NFS4ERR_MOVED, | + | | NFS4ERR_NAMETOOLONG, | + | | NFS4ERR_NOFILEHANDLE, NFS4ERR_NOSPC, | + | | NFS4ERR_NOTDIR, NFS4ERR_NOTSUPP, | + | | NFS4ERR_OP_NOT_IN_SESSION, | + | | NFS4ERR_REP_TOO_BIG, | + | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | + | | NFS4ERR_REQ_TOO_BIG, | + | | NFS4ERR_RETRY_UNCACHED_REP, | + | | NFS4ERR_ROFS, NFS4ERR_SERVERFAULT, | + | | NFS4ERR_STALE, NFS4ERR_SYMLINK, | + | | NFS4ERR_TOO_MANY_OPS, | + | | NFS4ERR_WRONGSEC, NFS4ERR_WRONG_TYPE, | + | | NFS4ERR_XDEV | + +----------------------+----------------------------------------+ + | LOCK | NFS4ERR_ACCESS, NFS4ERR_ADMIN_REVOKED, | + | | NFS4ERR_BADXDR, NFS4ERR_BAD_RANGE, | + | | NFS4ERR_BAD_STATEID, NFS4ERR_DEADLOCK, | + | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | + | | NFS4ERR_DENIED, NFS4ERR_EXPIRED, | + | | NFS4ERR_FHEXPIRED, NFS4ERR_GRACE, | + | | NFS4ERR_INVAL, NFS4ERR_ISDIR, | + | | NFS4ERR_LOCK_NOTSUPP, | + | | NFS4ERR_LOCK_RANGE, NFS4ERR_MOVED, | + | | NFS4ERR_NOFILEHANDLE, | + | | NFS4ERR_NO_GRACE, NFS4ERR_OLD_STATEID, | + | | NFS4ERR_OPENMODE, | + | | NFS4ERR_OP_NOT_IN_SESSION, | + | | NFS4ERR_RECLAIM_BAD, | + | | NFS4ERR_RECLAIM_CONFLICT, | + | | NFS4ERR_REP_TOO_BIG, | + | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | + | | NFS4ERR_REQ_TOO_BIG, | + | | NFS4ERR_RETRY_UNCACHED_REP, | + | | NFS4ERR_ROFS, NFS4ERR_SERVERFAULT, | + | | NFS4ERR_STALE, NFS4ERR_SYMLINK, | + | | NFS4ERR_TOO_MANY_OPS, | + | | NFS4ERR_WRONG_CRED, NFS4ERR_WRONG_TYPE | + +----------------------+----------------------------------------+ + | LOCKT | NFS4ERR_ACCESS, NFS4ERR_BADXDR, | + | | NFS4ERR_BAD_RANGE, | + | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | + | | NFS4ERR_DENIED, NFS4ERR_FHEXPIRED, | + | | NFS4ERR_GRACE, NFS4ERR_INVAL, | + | | NFS4ERR_ISDIR, NFS4ERR_LOCK_RANGE, | + | | NFS4ERR_MOVED, NFS4ERR_NOFILEHANDLE, | + | | NFS4ERR_OP_NOT_IN_SESSION, | + | | NFS4ERR_REP_TOO_BIG, | + | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | + | | NFS4ERR_REQ_TOO_BIG, | + | | NFS4ERR_RETRY_UNCACHED_REP, | + | | NFS4ERR_ROFS, NFS4ERR_STALE, | + | | NFS4ERR_SYMLINK, NFS4ERR_TOO_MANY_OPS, | + | | NFS4ERR_WRONG_CRED, NFS4ERR_WRONG_TYPE | + +----------------------+----------------------------------------+ + | LOCKU | NFS4ERR_ACCESS, NFS4ERR_ADMIN_REVOKED, | + | | NFS4ERR_BADXDR, NFS4ERR_BAD_RANGE, | + | | NFS4ERR_BAD_STATEID, | + | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | + | | NFS4ERR_EXPIRED, NFS4ERR_FHEXPIRED, | + | | NFS4ERR_INVAL, NFS4ERR_LOCK_RANGE, | + | | NFS4ERR_MOVED, NFS4ERR_NOFILEHANDLE, | + | | NFS4ERR_OLD_STATEID, | + | | NFS4ERR_OP_NOT_IN_SESSION, | + | | NFS4ERR_REP_TOO_BIG, | + | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | + | | NFS4ERR_REQ_TOO_BIG, | + | | NFS4ERR_RETRY_UNCACHED_REP, | + | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | + | | NFS4ERR_TOO_MANY_OPS, | + | | NFS4ERR_WRONG_CRED | + +----------------------+----------------------------------------+ + | LOOKUP | NFS4ERR_ACCESS, NFS4ERR_BADCHAR, | + | | NFS4ERR_BADNAME, NFS4ERR_BADXDR, | + | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | + | | NFS4ERR_FHEXPIRED, NFS4ERR_INVAL, | + | | NFS4ERR_IO, NFS4ERR_MOVED, | + | | NFS4ERR_NAMETOOLONG, NFS4ERR_NOENT, | + | | NFS4ERR_NOFILEHANDLE, NFS4ERR_NOTDIR, | + | | NFS4ERR_OP_NOT_IN_SESSION, | + | | NFS4ERR_REP_TOO_BIG, | + | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | + | | NFS4ERR_REQ_TOO_BIG, | + | | NFS4ERR_RETRY_UNCACHED_REP, | + | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | + | | NFS4ERR_SYMLINK, NFS4ERR_TOO_MANY_OPS, | + | | NFS4ERR_WRONGSEC | + +----------------------+----------------------------------------+ + | LOOKUPP | NFS4ERR_ACCESS, NFS4ERR_DEADSESSION, | + | | NFS4ERR_DELAY, NFS4ERR_FHEXPIRED, | + | | NFS4ERR_IO, NFS4ERR_MOVED, | + | | NFS4ERR_NOENT, NFS4ERR_NOFILEHANDLE, | + | | NFS4ERR_NOTDIR, | + | | NFS4ERR_OP_NOT_IN_SESSION, | + | | NFS4ERR_REP_TOO_BIG, | + | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | + | | NFS4ERR_REQ_TOO_BIG, | + | | NFS4ERR_RETRY_UNCACHED_REP, | + | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | + | | NFS4ERR_SYMLINK, NFS4ERR_TOO_MANY_OPS, | + | | NFS4ERR_WRONGSEC | + +----------------------+----------------------------------------+ + | NVERIFY | NFS4ERR_ACCESS, NFS4ERR_ATTRNOTSUPP, | + | | NFS4ERR_BADCHAR, NFS4ERR_BADXDR, | + | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | + | | NFS4ERR_FHEXPIRED, NFS4ERR_GRACE, | + | | NFS4ERR_INVAL, NFS4ERR_IO, | + | | NFS4ERR_MOVED, NFS4ERR_NOFILEHANDLE, | + | | NFS4ERR_OP_NOT_IN_SESSION, | + | | NFS4ERR_REP_TOO_BIG, | + | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | + | | NFS4ERR_REQ_TOO_BIG, | + | | NFS4ERR_RETRY_UNCACHED_REP, | + | | NFS4ERR_SAME, NFS4ERR_SERVERFAULT, | + | | NFS4ERR_STALE, NFS4ERR_TOO_MANY_OPS, | + | | NFS4ERR_UNKNOWN_LAYOUTTYPE, | + | | NFS4ERR_WRONG_TYPE | + +----------------------+----------------------------------------+ + | OPEN | NFS4ERR_ACCESS, NFS4ERR_ADMIN_REVOKED, | + | | NFS4ERR_ATTRNOTSUPP, NFS4ERR_BADCHAR, | + | | NFS4ERR_BADNAME, NFS4ERR_BADOWNER, | + | | NFS4ERR_BADXDR, NFS4ERR_BAD_STATEID, | + | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | + | | NFS4ERR_DELEG_ALREADY_WANTED, | + | | NFS4ERR_DELEG_REVOKED, NFS4ERR_DQUOT, | + | | NFS4ERR_EXIST, NFS4ERR_EXPIRED, | + | | NFS4ERR_FBIG, NFS4ERR_FHEXPIRED, | + | | NFS4ERR_GRACE, NFS4ERR_INVAL, | + | | NFS4ERR_ISDIR, NFS4ERR_IO, | + | | NFS4ERR_MOVED, NFS4ERR_NAMETOOLONG, | + | | NFS4ERR_NOENT, NFS4ERR_NOFILEHANDLE, | + | | NFS4ERR_NOSPC, NFS4ERR_NOTDIR, | + | | NFS4ERR_NO_GRACE, NFS4ERR_OLD_STATEID, | + | | NFS4ERR_OP_NOT_IN_SESSION, | + | | NFS4ERR_PERM, NFS4ERR_RECLAIM_BAD, | + | | NFS4ERR_RECLAIM_CONFLICT, | + | | NFS4ERR_REP_TOO_BIG, | + | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | + | | NFS4ERR_REQ_TOO_BIG, | + | | NFS4ERR_RETRY_UNCACHED_REP, | + | | NFS4ERR_ROFS, NFS4ERR_SERVERFAULT, | + | | NFS4ERR_SHARE_DENIED, NFS4ERR_STALE, | + | | NFS4ERR_SYMLINK, NFS4ERR_TOO_MANY_OPS, | + | | NFS4ERR_UNSAFE_COMPOUND, | + | | NFS4ERR_WRONGSEC, NFS4ERR_WRONG_TYPE | + +----------------------+----------------------------------------+ + | OPEN_CONFIRM | NFS4ERR_NOTSUPP | + +----------------------+----------------------------------------+ + | OPEN_DOWNGRADE | NFS4ERR_ADMIN_REVOKED, NFS4ERR_BADXDR, | + | | NFS4ERR_BAD_STATEID, | + | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | + | | NFS4ERR_EXPIRED, NFS4ERR_FHEXPIRED, | + | | NFS4ERR_INVAL, NFS4ERR_MOVED, | + | | NFS4ERR_NOFILEHANDLE, | + | | NFS4ERR_OLD_STATEID, | + | | NFS4ERR_OP_NOT_IN_SESSION, | + | | NFS4ERR_REP_TOO_BIG, | + | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | + | | NFS4ERR_REQ_TOO_BIG, | + | | NFS4ERR_RETRY_UNCACHED_REP, | + | | NFS4ERR_ROFS, NFS4ERR_SERVERFAULT, | + | | NFS4ERR_STALE, NFS4ERR_TOO_MANY_OPS, | + | | NFS4ERR_WRONG_CRED | + +----------------------+----------------------------------------+ + | OPENATTR | NFS4ERR_ACCESS, NFS4ERR_BADXDR, | + | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | + | | NFS4ERR_DQUOT, NFS4ERR_FHEXPIRED, | + | | NFS4ERR_IO, NFS4ERR_MOVED, | + | | NFS4ERR_NOENT, NFS4ERR_NOFILEHANDLE, | + | | NFS4ERR_NOSPC, NFS4ERR_NOTSUPP, | + | | NFS4ERR_OP_NOT_IN_SESSION, | + | | NFS4ERR_REP_TOO_BIG, | + | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | + | | NFS4ERR_REQ_TOO_BIG, | + | | NFS4ERR_RETRY_UNCACHED_REP, | + | | NFS4ERR_ROFS, NFS4ERR_SERVERFAULT, | + | | NFS4ERR_STALE, NFS4ERR_TOO_MANY_OPS, | + | | NFS4ERR_UNSAFE_COMPOUND, | + | | NFS4ERR_WRONG_TYPE | + +----------------------+----------------------------------------+ + | PUTFH | NFS4ERR_BADHANDLE, NFS4ERR_BADXDR, | + | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | + | | NFS4ERR_MOVED, | + | | NFS4ERR_OP_NOT_IN_SESSION, | + | | NFS4ERR_REP_TOO_BIG, | + | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | + | | NFS4ERR_REQ_TOO_BIG, | + | | NFS4ERR_RETRY_UNCACHED_REP, | + | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | + | | NFS4ERR_TOO_MANY_OPS, NFS4ERR_WRONGSEC | + +----------------------+----------------------------------------+ + | PUTPUBFH | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | + | | NFS4ERR_OP_NOT_IN_SESSION, | + | | NFS4ERR_REP_TOO_BIG, | + | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | + | | NFS4ERR_REQ_TOO_BIG, | + | | NFS4ERR_RETRY_UNCACHED_REP, | + | | NFS4ERR_SERVERFAULT, | + | | NFS4ERR_TOO_MANY_OPS, NFS4ERR_WRONGSEC | + +----------------------+----------------------------------------+ + | PUTROOTFH | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | + | | NFS4ERR_OP_NOT_IN_SESSION, | + | | NFS4ERR_REP_TOO_BIG, | + | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | + | | NFS4ERR_REQ_TOO_BIG, | + | | NFS4ERR_RETRY_UNCACHED_REP, | + | | NFS4ERR_SERVERFAULT, | + | | NFS4ERR_TOO_MANY_OPS, NFS4ERR_WRONGSEC | + +----------------------+----------------------------------------+ + | READ | NFS4ERR_ACCESS, NFS4ERR_ADMIN_REVOKED, | + | | NFS4ERR_BADXDR, NFS4ERR_BAD_STATEID, | + | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | + | | NFS4ERR_DELEG_REVOKED, | + | | NFS4ERR_EXPIRED, NFS4ERR_FHEXPIRED, | + | | NFS4ERR_GRACE, NFS4ERR_INVAL, | + | | NFS4ERR_ISDIR, NFS4ERR_IO, | + | | NFS4ERR_LOCKED, NFS4ERR_MOVED, | + | | NFS4ERR_NOFILEHANDLE, | + | | NFS4ERR_OLD_STATEID, NFS4ERR_OPENMODE, | + | | NFS4ERR_OP_NOT_IN_SESSION, | + | | NFS4ERR_PNFS_IO_HOLE, | + | | NFS4ERR_PNFS_NO_LAYOUT, | + | | NFS4ERR_REP_TOO_BIG, | + | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | + | | NFS4ERR_REQ_TOO_BIG, | + | | NFS4ERR_RETRY_UNCACHED_REP, | + | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | + | | NFS4ERR_SYMLINK, NFS4ERR_TOO_MANY_OPS, | + | | NFS4ERR_WRONG_TYPE | + +----------------------+----------------------------------------+ + | READDIR | NFS4ERR_ACCESS, NFS4ERR_BADXDR, | + | | NFS4ERR_BAD_COOKIE, | + | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | + | | NFS4ERR_FHEXPIRED, NFS4ERR_INVAL, | + | | NFS4ERR_IO, NFS4ERR_MOVED, | + | | NFS4ERR_NOFILEHANDLE, NFS4ERR_NOTDIR, | + | | NFS4ERR_NOT_SAME, | + | | NFS4ERR_OP_NOT_IN_SESSION, | + | | NFS4ERR_REP_TOO_BIG, | + | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | + | | NFS4ERR_REQ_TOO_BIG, | + | | NFS4ERR_RETRY_UNCACHED_REP, | + | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | + | | NFS4ERR_TOOSMALL, NFS4ERR_TOO_MANY_OPS | + +----------------------+----------------------------------------+ + | READLINK | NFS4ERR_ACCESS, NFS4ERR_DEADSESSION, | + | | NFS4ERR_DELAY, NFS4ERR_FHEXPIRED, | + | | NFS4ERR_INVAL, NFS4ERR_IO, | + | | NFS4ERR_MOVED, NFS4ERR_NOFILEHANDLE, | + | | NFS4ERR_OP_NOT_IN_SESSION, | + | | NFS4ERR_REP_TOO_BIG, | + | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | + | | NFS4ERR_REQ_TOO_BIG, | + | | NFS4ERR_RETRY_UNCACHED_REP, | + | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | + | | NFS4ERR_TOO_MANY_OPS, | + | | NFS4ERR_WRONG_TYPE | + +----------------------+----------------------------------------+ + | RECLAIM_COMPLETE | NFS4ERR_BADXDR, | + | | NFS4ERR_COMPLETE_ALREADY, | + | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | + | | NFS4ERR_FHEXPIRED, NFS4ERR_INVAL, | + | | NFS4ERR_MOVED, NFS4ERR_NOFILEHANDLE, | + | | NFS4ERR_OP_NOT_IN_SESSION, | + | | NFS4ERR_REP_TOO_BIG, | + | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | + | | NFS4ERR_REQ_TOO_BIG, | + | | NFS4ERR_RETRY_UNCACHED_REP, | + | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | + | | NFS4ERR_TOO_MANY_OPS, | + | | NFS4ERR_WRONG_CRED, NFS4ERR_WRONG_TYPE | + +----------------------+----------------------------------------+ + | RELEASE_LOCKOWNER | NFS4ERR_NOTSUPP | + +----------------------+----------------------------------------+ + | REMOVE | NFS4ERR_ACCESS, NFS4ERR_BADCHAR, | + | | NFS4ERR_BADNAME, NFS4ERR_BADXDR, | + | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | + | | NFS4ERR_FHEXPIRED, NFS4ERR_FILE_OPEN, | + | | NFS4ERR_GRACE, NFS4ERR_INVAL, | + | | NFS4ERR_IO, NFS4ERR_MOVED, | + | | NFS4ERR_NAMETOOLONG, NFS4ERR_NOENT, | + | | NFS4ERR_NOFILEHANDLE, NFS4ERR_NOTDIR, | + | | NFS4ERR_NOTEMPTY, | + | | NFS4ERR_OP_NOT_IN_SESSION, | + | | NFS4ERR_REP_TOO_BIG, | + | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | + | | NFS4ERR_REQ_TOO_BIG, | + | | NFS4ERR_RETRY_UNCACHED_REP, | + | | NFS4ERR_ROFS, NFS4ERR_SERVERFAULT, | + | | NFS4ERR_STALE, NFS4ERR_TOO_MANY_OPS | + +----------------------+----------------------------------------+ + | RENAME | NFS4ERR_ACCESS, NFS4ERR_BADCHAR, | + | | NFS4ERR_BADNAME, NFS4ERR_BADXDR, | + | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | + | | NFS4ERR_DQUOT, NFS4ERR_EXIST, | + | | NFS4ERR_FHEXPIRED, NFS4ERR_FILE_OPEN, | + | | NFS4ERR_GRACE, NFS4ERR_INVAL, | + | | NFS4ERR_IO, NFS4ERR_MLINK, | + | | NFS4ERR_MOVED, NFS4ERR_NAMETOOLONG, | + | | NFS4ERR_NOENT, NFS4ERR_NOFILEHANDLE, | + | | NFS4ERR_NOSPC, NFS4ERR_NOTDIR, | + | | NFS4ERR_NOTEMPTY, | + | | NFS4ERR_OP_NOT_IN_SESSION, | + | | NFS4ERR_REP_TOO_BIG, | + | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | + | | NFS4ERR_REQ_TOO_BIG, | + | | NFS4ERR_RETRY_UNCACHED_REP, | + | | NFS4ERR_ROFS, NFS4ERR_SERVERFAULT, | + | | NFS4ERR_STALE, NFS4ERR_TOO_MANY_OPS, | + | | NFS4ERR_WRONGSEC, NFS4ERR_XDEV | + +----------------------+----------------------------------------+ + | RENEW | NFS4ERR_NOTSUPP | + +----------------------+----------------------------------------+ + | RESTOREFH | NFS4ERR_DEADSESSION, | + | | NFS4ERR_FHEXPIRED, NFS4ERR_MOVED, | + | | NFS4ERR_NOFILEHANDLE, | + | | NFS4ERR_OP_NOT_IN_SESSION, | + | | NFS4ERR_REP_TOO_BIG, | + | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | + | | NFS4ERR_REQ_TOO_BIG, | + | | NFS4ERR_RETRY_UNCACHED_REP, | + | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | + | | NFS4ERR_TOO_MANY_OPS, NFS4ERR_WRONGSEC | + +----------------------+----------------------------------------+ + | SAVEFH | NFS4ERR_DEADSESSION, | + | | NFS4ERR_FHEXPIRED, NFS4ERR_MOVED, | + | | NFS4ERR_NOFILEHANDLE, | + | | NFS4ERR_OP_NOT_IN_SESSION, | + | | NFS4ERR_REP_TOO_BIG, | + | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | + | | NFS4ERR_REQ_TOO_BIG, | + | | NFS4ERR_RETRY_UNCACHED_REP, | + | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | + | | NFS4ERR_TOO_MANY_OPS | + +----------------------+----------------------------------------+ + | SECINFO | NFS4ERR_ACCESS, NFS4ERR_BADCHAR, | + | | NFS4ERR_BADNAME, NFS4ERR_BADXDR, | + | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | + | | NFS4ERR_FHEXPIRED, NFS4ERR_INVAL, | + | | NFS4ERR_MOVED, NFS4ERR_NAMETOOLONG, | + | | NFS4ERR_NOENT, NFS4ERR_NOFILEHANDLE, | + | | NFS4ERR_NOTDIR, | + | | NFS4ERR_OP_NOT_IN_SESSION, | + | | NFS4ERR_REP_TOO_BIG, | + | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | + | | NFS4ERR_REQ_TOO_BIG, | + | | NFS4ERR_RETRY_UNCACHED_REP, | + | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | + | | NFS4ERR_TOO_MANY_OPS | + +----------------------+----------------------------------------+ + | SECINFO_NO_NAME | NFS4ERR_ACCESS, NFS4ERR_BADXDR, | + | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | + | | NFS4ERR_FHEXPIRED, NFS4ERR_INVAL, | + | | NFS4ERR_MOVED, NFS4ERR_NOENT, | + | | NFS4ERR_NOFILEHANDLE, NFS4ERR_NOTDIR, | + | | NFS4ERR_NOTSUPP, | + | | NFS4ERR_OP_NOT_IN_SESSION, | + | | NFS4ERR_REP_TOO_BIG, | + | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | + | | NFS4ERR_REQ_TOO_BIG, | + | | NFS4ERR_RETRY_UNCACHED_REP, | + | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | + | | NFS4ERR_TOO_MANY_OPS | + +----------------------+----------------------------------------+ + | SEQUENCE | NFS4ERR_BADSESSION, NFS4ERR_BADSLOT, | + | | NFS4ERR_BADXDR, NFS4ERR_BAD_HIGH_SLOT, | + | | NFS4ERR_CONN_NOT_BOUND_TO_SESSION, | + | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | + | | NFS4ERR_REP_TOO_BIG, | + | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | + | | NFS4ERR_REQ_TOO_BIG, | + | | NFS4ERR_RETRY_UNCACHED_REP, | + | | NFS4ERR_SEQUENCE_POS, | + | | NFS4ERR_SEQ_FALSE_RETRY, | + | | NFS4ERR_SEQ_MISORDERED, | + | | NFS4ERR_TOO_MANY_OPS | + +----------------------+----------------------------------------+ + | SET_SSV | NFS4ERR_BADXDR, | + | | NFS4ERR_BAD_SESSION_DIGEST, | + | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | + | | NFS4ERR_INVAL, | + | | NFS4ERR_OP_NOT_IN_SESSION, | + | | NFS4ERR_REP_TOO_BIG, | + | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | + | | NFS4ERR_REQ_TOO_BIG, | + | | NFS4ERR_RETRY_UNCACHED_REP, | + | | NFS4ERR_TOO_MANY_OPS | + +----------------------+----------------------------------------+ + | SETATTR | NFS4ERR_ACCESS, NFS4ERR_ADMIN_REVOKED, | + | | NFS4ERR_ATTRNOTSUPP, NFS4ERR_BADCHAR, | + | | NFS4ERR_BADOWNER, NFS4ERR_BADXDR, | + | | NFS4ERR_BAD_STATEID, | + | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | + | | NFS4ERR_DELEG_REVOKED, NFS4ERR_DQUOT, | + | | NFS4ERR_EXPIRED, NFS4ERR_FBIG, | + | | NFS4ERR_FHEXPIRED, NFS4ERR_GRACE, | + | | NFS4ERR_INVAL, NFS4ERR_IO, | + | | NFS4ERR_LOCKED, NFS4ERR_MOVED, | + | | NFS4ERR_NOFILEHANDLE, NFS4ERR_NOSPC, | + | | NFS4ERR_OLD_STATEID, NFS4ERR_OPENMODE, | + | | NFS4ERR_OP_NOT_IN_SESSION, | + | | NFS4ERR_PERM, NFS4ERR_REP_TOO_BIG, | + | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | + | | NFS4ERR_REQ_TOO_BIG, | + | | NFS4ERR_RETRY_UNCACHED_REP, | + | | NFS4ERR_ROFS, NFS4ERR_SERVERFAULT, | + | | NFS4ERR_STALE, NFS4ERR_TOO_MANY_OPS, | + | | NFS4ERR_UNKNOWN_LAYOUTTYPE, | + | | NFS4ERR_WRONG_TYPE | + +----------------------+----------------------------------------+ + | SETCLIENTID | NFS4ERR_NOTSUPP | + +----------------------+----------------------------------------+ + | SETCLIENTID_CONFIRM | NFS4ERR_NOTSUPP | + +----------------------+----------------------------------------+ + | TEST_STATEID | NFS4ERR_BADXDR, NFS4ERR_DEADSESSION, | + | | NFS4ERR_DELAY, | + | | NFS4ERR_OP_NOT_IN_SESSION, | + | | NFS4ERR_REP_TOO_BIG, | + | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | + | | NFS4ERR_REQ_TOO_BIG, | + | | NFS4ERR_RETRY_UNCACHED_REP, | + | | NFS4ERR_SERVERFAULT, | + | | NFS4ERR_TOO_MANY_OPS | + +----------------------+----------------------------------------+ + | VERIFY | NFS4ERR_ACCESS, NFS4ERR_ATTRNOTSUPP, | + | | NFS4ERR_BADCHAR, NFS4ERR_BADXDR, | + | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | + | | NFS4ERR_FHEXPIRED, NFS4ERR_GRACE, | + | | NFS4ERR_INVAL, NFS4ERR_IO, | + | | NFS4ERR_MOVED, NFS4ERR_NOFILEHANDLE, | + | | NFS4ERR_NOT_SAME, | + | | NFS4ERR_OP_NOT_IN_SESSION, | + | | NFS4ERR_REP_TOO_BIG, | + | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | + | | NFS4ERR_REQ_TOO_BIG, | + | | NFS4ERR_RETRY_UNCACHED_REP, | + | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | + | | NFS4ERR_TOO_MANY_OPS, | + | | NFS4ERR_UNKNOWN_LAYOUTTYPE, | + | | NFS4ERR_WRONG_TYPE | + +----------------------+----------------------------------------+ + | WANT_DELEGATION | NFS4ERR_BADXDR, NFS4ERR_DEADSESSION, | + | | NFS4ERR_DELAY, | + | | NFS4ERR_DELEG_ALREADY_WANTED, | + | | NFS4ERR_FHEXPIRED, NFS4ERR_GRACE, | + | | NFS4ERR_INVAL, NFS4ERR_IO, | + | | NFS4ERR_MOVED, NFS4ERR_NOFILEHANDLE, | + | | NFS4ERR_NOTSUPP, NFS4ERR_NO_GRACE, | + | | NFS4ERR_OP_NOT_IN_SESSION, | + | | NFS4ERR_RECALLCONFLICT, | + | | NFS4ERR_RECLAIM_BAD, | + | | NFS4ERR_RECLAIM_CONFLICT, | + | | NFS4ERR_REP_TOO_BIG, | + | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | + | | NFS4ERR_REQ_TOO_BIG, | + | | NFS4ERR_RETRY_UNCACHED_REP, | + | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | + | | NFS4ERR_TOO_MANY_OPS, | + | | NFS4ERR_WRONG_TYPE | + +----------------------+----------------------------------------+ + | WRITE | NFS4ERR_ACCESS, NFS4ERR_ADMIN_REVOKED, | + | | NFS4ERR_BADXDR, NFS4ERR_BAD_STATEID, | + | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | + | | NFS4ERR_DELEG_REVOKED, NFS4ERR_DQUOT, | + | | NFS4ERR_EXPIRED, NFS4ERR_FBIG, | + | | NFS4ERR_FHEXPIRED, NFS4ERR_GRACE, | + | | NFS4ERR_INVAL, NFS4ERR_IO, | + | | NFS4ERR_ISDIR, NFS4ERR_LOCKED, | + | | NFS4ERR_MOVED, NFS4ERR_NOFILEHANDLE, | + | | NFS4ERR_NOSPC, NFS4ERR_OLD_STATEID, | + | | NFS4ERR_OPENMODE, | + | | NFS4ERR_OP_NOT_IN_SESSION, | + | | NFS4ERR_PNFS_IO_HOLE, | + | | NFS4ERR_PNFS_NO_LAYOUT, | + | | NFS4ERR_REP_TOO_BIG, | + | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | + | | NFS4ERR_REQ_TOO_BIG, | + | | NFS4ERR_RETRY_UNCACHED_REP, | + | | NFS4ERR_ROFS, NFS4ERR_SERVERFAULT, | + | | NFS4ERR_STALE, NFS4ERR_SYMLINK, | + | | NFS4ERR_TOO_MANY_OPS, | + | | NFS4ERR_WRONG_TYPE | + +----------------------+----------------------------------------+ + + Table 12: Valid Error Returns for Each Protocol Operation + +15.3. Callback Operations and Their Valid Errors + + This section contains a table that gives the valid error returns for + each callback operation. The error code NFS4_OK (indicating no + error) is not listed but should be understood to be returnable by all + callback operations with the exception of CB_ILLEGAL. + + +=========================+=======================================+ + | Callback Operation | Errors | + +=========================+=======================================+ + | CB_GETATTR | NFS4ERR_BADHANDLE, NFS4ERR_BADXDR, | + | | NFS4ERR_DELAY, NFS4ERR_INVAL, | + | | NFS4ERR_OP_NOT_IN_SESSION, | + | | NFS4ERR_REP_TOO_BIG, | + | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | + | | NFS4ERR_REQ_TOO_BIG, | + | | NFS4ERR_RETRY_UNCACHED_REP, | + | | NFS4ERR_SERVERFAULT, | + | | NFS4ERR_TOO_MANY_OPS, | + +-------------------------+---------------------------------------+ + | CB_ILLEGAL | NFS4ERR_BADXDR, NFS4ERR_OP_ILLEGAL | + +-------------------------+---------------------------------------+ + | CB_LAYOUTRECALL | NFS4ERR_BADHANDLE, NFS4ERR_BADIOMODE, | + | | NFS4ERR_BADXDR, NFS4ERR_BAD_STATEID, | + | | NFS4ERR_DELAY, NFS4ERR_INVAL, | + | | NFS4ERR_NOMATCHING_LAYOUT, | + | | NFS4ERR_NOTSUPP, | + | | NFS4ERR_OP_NOT_IN_SESSION, | + | | NFS4ERR_REP_TOO_BIG, | + | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | + | | NFS4ERR_REQ_TOO_BIG, | + | | NFS4ERR_RETRY_UNCACHED_REP, | + | | NFS4ERR_TOO_MANY_OPS, | + | | NFS4ERR_UNKNOWN_LAYOUTTYPE, | + | | NFS4ERR_WRONG_TYPE | + +-------------------------+---------------------------------------+ + | CB_NOTIFY | NFS4ERR_BADHANDLE, NFS4ERR_BADXDR, | + | | NFS4ERR_BAD_STATEID, NFS4ERR_DELAY, | + | | NFS4ERR_INVAL, NFS4ERR_NOTSUPP, | + | | NFS4ERR_OP_NOT_IN_SESSION, | + | | NFS4ERR_REP_TOO_BIG, | + | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | + | | NFS4ERR_REQ_TOO_BIG, | + | | NFS4ERR_RETRY_UNCACHED_REP, | + | | NFS4ERR_SERVERFAULT, | + | | NFS4ERR_TOO_MANY_OPS | + +-------------------------+---------------------------------------+ + | CB_NOTIFY_DEVICEID | NFS4ERR_BADXDR, NFS4ERR_DELAY, | + | | NFS4ERR_INVAL, NFS4ERR_NOTSUPP, | + | | NFS4ERR_OP_NOT_IN_SESSION, | + | | NFS4ERR_REP_TOO_BIG, | + | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | + | | NFS4ERR_REQ_TOO_BIG, | + | | NFS4ERR_RETRY_UNCACHED_REP, | + | | NFS4ERR_SERVERFAULT, | + | | NFS4ERR_TOO_MANY_OPS | + +-------------------------+---------------------------------------+ + | CB_NOTIFY_LOCK | NFS4ERR_BADHANDLE, NFS4ERR_BADXDR, | + | | NFS4ERR_BAD_STATEID, NFS4ERR_DELAY, | + | | NFS4ERR_NOTSUPP, | + | | NFS4ERR_OP_NOT_IN_SESSION, | + | | NFS4ERR_REP_TOO_BIG, | + | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | + | | NFS4ERR_REQ_TOO_BIG, | + | | NFS4ERR_RETRY_UNCACHED_REP, | + | | NFS4ERR_SERVERFAULT, | + | | NFS4ERR_TOO_MANY_OPS | + +-------------------------+---------------------------------------+ + | CB_PUSH_DELEG | NFS4ERR_BADHANDLE, NFS4ERR_BADXDR, | + | | NFS4ERR_DELAY, NFS4ERR_INVAL, | + | | NFS4ERR_NOTSUPP, | + | | NFS4ERR_OP_NOT_IN_SESSION, | + | | NFS4ERR_REJECT_DELEG, | + | | NFS4ERR_REP_TOO_BIG, | + | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | + | | NFS4ERR_REQ_TOO_BIG, | + | | NFS4ERR_RETRY_UNCACHED_REP, | + | | NFS4ERR_SERVERFAULT, | + | | NFS4ERR_TOO_MANY_OPS, | + | | NFS4ERR_WRONG_TYPE | + +-------------------------+---------------------------------------+ + | CB_RECALL | NFS4ERR_BADHANDLE, NFS4ERR_BADXDR, | + | | NFS4ERR_BAD_STATEID, NFS4ERR_DELAY, | + | | NFS4ERR_OP_NOT_IN_SESSION, | + | | NFS4ERR_REP_TOO_BIG, | + | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | + | | NFS4ERR_REQ_TOO_BIG, | + | | NFS4ERR_RETRY_UNCACHED_REP, | + | | NFS4ERR_SERVERFAULT, | + | | NFS4ERR_TOO_MANY_OPS | + +-------------------------+---------------------------------------+ + | CB_RECALL_ANY | NFS4ERR_BADXDR, NFS4ERR_DELAY, | + | | NFS4ERR_INVAL, | + | | NFS4ERR_OP_NOT_IN_SESSION, | + | | NFS4ERR_REP_TOO_BIG, | + | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | + | | NFS4ERR_REQ_TOO_BIG, | + | | NFS4ERR_RETRY_UNCACHED_REP, | + | | NFS4ERR_TOO_MANY_OPS | + +-------------------------+---------------------------------------+ + | CB_RECALLABLE_OBJ_AVAIL | NFS4ERR_BADXDR, NFS4ERR_DELAY, | + | | NFS4ERR_INVAL, NFS4ERR_NOTSUPP, | + | | NFS4ERR_OP_NOT_IN_SESSION, | + | | NFS4ERR_REP_TOO_BIG, | + | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | + | | NFS4ERR_REQ_TOO_BIG, | + | | NFS4ERR_RETRY_UNCACHED_REP, | + | | NFS4ERR_SERVERFAULT, | + | | NFS4ERR_TOO_MANY_OPS | + +-------------------------+---------------------------------------+ + | CB_RECALL_SLOT | NFS4ERR_BADXDR, | + | | NFS4ERR_BAD_HIGH_SLOT, NFS4ERR_DELAY, | + | | NFS4ERR_OP_NOT_IN_SESSION, | + | | NFS4ERR_REP_TOO_BIG, | + | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | + | | NFS4ERR_REQ_TOO_BIG, | + | | NFS4ERR_RETRY_UNCACHED_REP, | + | | NFS4ERR_TOO_MANY_OPS | + +-------------------------+---------------------------------------+ + | CB_SEQUENCE | NFS4ERR_BADSESSION, NFS4ERR_BADSLOT, | + | | NFS4ERR_BADXDR, | + | | NFS4ERR_BAD_HIGH_SLOT, | + | | NFS4ERR_CONN_NOT_BOUND_TO_SESSION, | + | | NFS4ERR_DELAY, NFS4ERR_REP_TOO_BIG, | + | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | + | | NFS4ERR_REQ_TOO_BIG, | + | | NFS4ERR_RETRY_UNCACHED_REP, | + | | NFS4ERR_SEQUENCE_POS, | + | | NFS4ERR_SEQ_FALSE_RETRY, | + | | NFS4ERR_SEQ_MISORDERED, | + | | NFS4ERR_TOO_MANY_OPS | + +-------------------------+---------------------------------------+ + | CB_WANTS_CANCELLED | NFS4ERR_BADXDR, NFS4ERR_DELAY, | + | | NFS4ERR_NOTSUPP, | + | | NFS4ERR_OP_NOT_IN_SESSION, | + | | NFS4ERR_REP_TOO_BIG, | + | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | + | | NFS4ERR_REQ_TOO_BIG, | + | | NFS4ERR_RETRY_UNCACHED_REP, | + | | NFS4ERR_SERVERFAULT, | + | | NFS4ERR_TOO_MANY_OPS | + +-------------------------+---------------------------------------+ + + Table 13: Valid Error Returns for Each Protocol Callback Operation + +15.4. Errors and the Operations That Use Them + + +===================================+===============================+ + | Error | Operations | + +===================================+===============================+ + | NFS4ERR_ACCESS | ACCESS, COMMIT, CREATE, | + | | GETATTR, GET_DIR_DELEGATION, | + | | LAYOUTCOMMIT, LAYOUTGET, | + | | LINK, LOCK, LOCKT, LOCKU, | + | | LOOKUP, LOOKUPP, NVERIFY, | + | | OPEN, OPENATTR, READ, | + | | READDIR, READLINK, REMOVE, | + | | RENAME, SECINFO, | + | | SECINFO_NO_NAME, SETATTR, | + | | VERIFY, WRITE | + +-----------------------------------+-------------------------------+ + | NFS4ERR_ADMIN_REVOKED | CLOSE, DELEGRETURN, | + | | LAYOUTCOMMIT, LAYOUTGET, | + | | LAYOUTRETURN, LOCK, LOCKU, | + | | OPEN, OPEN_DOWNGRADE, READ, | + | | SETATTR, WRITE | + +-----------------------------------+-------------------------------+ + | NFS4ERR_ATTRNOTSUPP | CREATE, LAYOUTCOMMIT, | + | | NVERIFY, OPEN, SETATTR, | + | | VERIFY | + +-----------------------------------+-------------------------------+ + | NFS4ERR_BACK_CHAN_BUSY | DESTROY_SESSION | + +-----------------------------------+-------------------------------+ + | NFS4ERR_BADCHAR | CREATE, EXCHANGE_ID, LINK, | + | | LOOKUP, NVERIFY, OPEN, | + | | REMOVE, RENAME, SECINFO, | + | | SETATTR, VERIFY | + +-----------------------------------+-------------------------------+ + | NFS4ERR_BADHANDLE | CB_GETATTR, CB_LAYOUTRECALL, | + | | CB_NOTIFY, CB_NOTIFY_LOCK, | + | | CB_PUSH_DELEG, CB_RECALL, | + | | PUTFH | + +-----------------------------------+-------------------------------+ + | NFS4ERR_BADIOMODE | CB_LAYOUTRECALL, | + | | LAYOUTCOMMIT, LAYOUTGET | + +-----------------------------------+-------------------------------+ + | NFS4ERR_BADLAYOUT | LAYOUTCOMMIT, LAYOUTGET | + +-----------------------------------+-------------------------------+ + | NFS4ERR_BADNAME | CREATE, LINK, LOOKUP, OPEN, | + | | REMOVE, RENAME, SECINFO | + +-----------------------------------+-------------------------------+ + | NFS4ERR_BADOWNER | CREATE, OPEN, SETATTR | + +-----------------------------------+-------------------------------+ + | NFS4ERR_BADSESSION | BIND_CONN_TO_SESSION, | + | | CB_SEQUENCE, | + | | DESTROY_SESSION, SEQUENCE | + +-----------------------------------+-------------------------------+ + | NFS4ERR_BADSLOT | CB_SEQUENCE, SEQUENCE | + +-----------------------------------+-------------------------------+ + | NFS4ERR_BADTYPE | CREATE | + +-----------------------------------+-------------------------------+ + | NFS4ERR_BADXDR | ACCESS, BACKCHANNEL_CTL, | + | | BIND_CONN_TO_SESSION, | + | | CB_GETATTR, CB_ILLEGAL, | + | | CB_LAYOUTRECALL, CB_NOTIFY, | + | | CB_NOTIFY_DEVICEID, | + | | CB_NOTIFY_LOCK, | + | | CB_PUSH_DELEG, CB_RECALL, | + | | CB_RECALLABLE_OBJ_AVAIL, | + | | CB_RECALL_ANY, | + | | CB_RECALL_SLOT, CB_SEQUENCE, | + | | CB_WANTS_CANCELLED, CLOSE, | + | | COMMIT, CREATE, | + | | CREATE_SESSION, DELEGPURGE, | + | | DELEGRETURN, | + | | DESTROY_CLIENTID, | + | | DESTROY_SESSION, | + | | EXCHANGE_ID, FREE_STATEID, | + | | GETATTR, GETDEVICEINFO, | + | | GETDEVICELIST, | + | | GET_DIR_DELEGATION, ILLEGAL, | + | | LAYOUTCOMMIT, LAYOUTGET, | + | | LAYOUTRETURN, LINK, LOCK, | + | | LOCKT, LOCKU, LOOKUP, | + | | NVERIFY, OPEN, OPENATTR, | + | | OPEN_DOWNGRADE, PUTFH, READ, | + | | READDIR, RECLAIM_COMPLETE, | + | | REMOVE, RENAME, SECINFO, | + | | SECINFO_NO_NAME, SEQUENCE, | + | | SETATTR, SET_SSV, | + | | TEST_STATEID, VERIFY, | + | | WANT_DELEGATION, WRITE | + +-----------------------------------+-------------------------------+ + | NFS4ERR_BAD_COOKIE | GETDEVICELIST, READDIR | + +-----------------------------------+-------------------------------+ + | NFS4ERR_BAD_HIGH_SLOT | CB_RECALL_SLOT, CB_SEQUENCE, | + | | SEQUENCE | + +-----------------------------------+-------------------------------+ + | NFS4ERR_BAD_RANGE | LOCK, LOCKT, LOCKU | + +-----------------------------------+-------------------------------+ + | NFS4ERR_BAD_SESSION_DIGEST | BIND_CONN_TO_SESSION, | + | | SET_SSV | + +-----------------------------------+-------------------------------+ + | NFS4ERR_BAD_STATEID | CB_LAYOUTRECALL, CB_NOTIFY, | + | | CB_NOTIFY_LOCK, CB_RECALL, | + | | CLOSE, DELEGRETURN, | + | | FREE_STATEID, LAYOUTGET, | + | | LAYOUTRETURN, LOCK, LOCKU, | + | | OPEN, OPEN_DOWNGRADE, READ, | + | | SETATTR, WRITE | + +-----------------------------------+-------------------------------+ + | NFS4ERR_CB_PATH_DOWN | DESTROY_SESSION | + +-----------------------------------+-------------------------------+ + | NFS4ERR_CLID_INUSE | CREATE_SESSION, EXCHANGE_ID | + +-----------------------------------+-------------------------------+ + | NFS4ERR_CLIENTID_BUSY | DESTROY_CLIENTID | + +-----------------------------------+-------------------------------+ + | NFS4ERR_COMPLETE_ALREADY | RECLAIM_COMPLETE | + +-----------------------------------+-------------------------------+ + | NFS4ERR_CONN_NOT_BOUND_TO_SESSION | CB_SEQUENCE, | + | | DESTROY_SESSION, SEQUENCE | + +-----------------------------------+-------------------------------+ + | NFS4ERR_DEADLOCK | LOCK | + +-----------------------------------+-------------------------------+ + | NFS4ERR_DEADSESSION | ACCESS, BACKCHANNEL_CTL, | + | | BIND_CONN_TO_SESSION, CLOSE, | + | | COMMIT, CREATE, | + | | CREATE_SESSION, DELEGPURGE, | + | | DELEGRETURN, | + | | DESTROY_CLIENTID, | + | | DESTROY_SESSION, | + | | EXCHANGE_ID, FREE_STATEID, | + | | GETATTR, GETDEVICEINFO, | + | | GETDEVICELIST, | + | | GET_DIR_DELEGATION, | + | | LAYOUTCOMMIT, LAYOUTGET, | + | | LAYOUTRETURN, LINK, LOCK, | + | | LOCKT, LOCKU, LOOKUP, | + | | LOOKUPP, NVERIFY, OPEN, | + | | OPENATTR, OPEN_DOWNGRADE, | + | | PUTFH, PUTPUBFH, PUTROOTFH, | + | | READ, READDIR, READLINK, | + | | RECLAIM_COMPLETE, REMOVE, | + | | RENAME, RESTOREFH, SAVEFH, | + | | SECINFO, SECINFO_NO_NAME, | + | | SEQUENCE, SETATTR, SET_SSV, | + | | TEST_STATEID, VERIFY, | + | | WANT_DELEGATION, WRITE | + +-----------------------------------+-------------------------------+ + | NFS4ERR_DELAY | ACCESS, BACKCHANNEL_CTL, | + | | BIND_CONN_TO_SESSION, | + | | CB_GETATTR, CB_LAYOUTRECALL, | + | | CB_NOTIFY, | + | | CB_NOTIFY_DEVICEID, | + | | CB_NOTIFY_LOCK, | + | | CB_PUSH_DELEG, CB_RECALL, | + | | CB_RECALLABLE_OBJ_AVAIL, | + | | CB_RECALL_ANY, | + | | CB_RECALL_SLOT, CB_SEQUENCE, | + | | CB_WANTS_CANCELLED, CLOSE, | + | | COMMIT, CREATE, | + | | CREATE_SESSION, DELEGPURGE, | + | | DELEGRETURN, | + | | DESTROY_CLIENTID, | + | | DESTROY_SESSION, | + | | EXCHANGE_ID, FREE_STATEID, | + | | GETATTR, GETDEVICEINFO, | + | | GETDEVICELIST, | + | | GET_DIR_DELEGATION, | + | | LAYOUTCOMMIT, LAYOUTGET, | + | | LAYOUTRETURN, LINK, LOCK, | + | | LOCKT, LOCKU, LOOKUP, | + | | LOOKUPP, NVERIFY, OPEN, | + | | OPENATTR, OPEN_DOWNGRADE, | + | | PUTFH, PUTPUBFH, PUTROOTFH, | + | | READ, READDIR, READLINK, | + | | RECLAIM_COMPLETE, REMOVE, | + | | RENAME, SECINFO, | + | | SECINFO_NO_NAME, SEQUENCE, | + | | SETATTR, SET_SSV, | + | | TEST_STATEID, VERIFY, | + | | WANT_DELEGATION, WRITE | + +-----------------------------------+-------------------------------+ + | NFS4ERR_DELEG_ALREADY_WANTED | OPEN, WANT_DELEGATION | + +-----------------------------------+-------------------------------+ + | NFS4ERR_DELEG_REVOKED | DELEGRETURN, LAYOUTCOMMIT, | + | | LAYOUTGET, LAYOUTRETURN, | + | | OPEN, READ, SETATTR, WRITE | + +-----------------------------------+-------------------------------+ + | NFS4ERR_DENIED | LOCK, LOCKT | + +-----------------------------------+-------------------------------+ + | NFS4ERR_DIRDELEG_UNAVAIL | GET_DIR_DELEGATION | + +-----------------------------------+-------------------------------+ + | NFS4ERR_DQUOT | CREATE, LAYOUTGET, LINK, | + | | OPEN, OPENATTR, RENAME, | + | | SETATTR, WRITE | + +-----------------------------------+-------------------------------+ + | NFS4ERR_ENCR_ALG_UNSUPP | EXCHANGE_ID | + +-----------------------------------+-------------------------------+ + | NFS4ERR_EXIST | CREATE, LINK, OPEN, RENAME | + +-----------------------------------+-------------------------------+ + | NFS4ERR_EXPIRED | CLOSE, DELEGRETURN, | + | | LAYOUTCOMMIT, LAYOUTRETURN, | + | | LOCK, LOCKU, OPEN, | + | | OPEN_DOWNGRADE, READ, | + | | SETATTR, WRITE | + +-----------------------------------+-------------------------------+ + | NFS4ERR_FBIG | LAYOUTCOMMIT, OPEN, SETATTR, | + | | WRITE | + +-----------------------------------+-------------------------------+ + | NFS4ERR_FHEXPIRED | ACCESS, CLOSE, COMMIT, | + | | CREATE, DELEGRETURN, | + | | GETATTR, GETDEVICELIST, | + | | GETFH, GET_DIR_DELEGATION, | + | | LAYOUTCOMMIT, LAYOUTGET, | + | | LAYOUTRETURN, LINK, LOCK, | + | | LOCKT, LOCKU, LOOKUP, | + | | LOOKUPP, NVERIFY, OPEN, | + | | OPENATTR, OPEN_DOWNGRADE, | + | | READ, READDIR, READLINK, | + | | RECLAIM_COMPLETE, REMOVE, | + | | RENAME, RESTOREFH, SAVEFH, | + | | SECINFO, SECINFO_NO_NAME, | + | | SETATTR, VERIFY, | + | | WANT_DELEGATION, WRITE | + +-----------------------------------+-------------------------------+ + | NFS4ERR_FILE_OPEN | LINK, REMOVE, RENAME | + +-----------------------------------+-------------------------------+ + | NFS4ERR_GRACE | GETATTR, GET_DIR_DELEGATION, | + | | LAYOUTCOMMIT, LAYOUTGET, | + | | LAYOUTRETURN, LINK, LOCK, | + | | LOCKT, NVERIFY, OPEN, READ, | + | | REMOVE, RENAME, SETATTR, | + | | VERIFY, WANT_DELEGATION, | + | | WRITE | + +-----------------------------------+-------------------------------+ + | NFS4ERR_HASH_ALG_UNSUPP | EXCHANGE_ID | + +-----------------------------------+-------------------------------+ + | NFS4ERR_INVAL | ACCESS, BACKCHANNEL_CTL, | + | | BIND_CONN_TO_SESSION, | + | | CB_GETATTR, CB_LAYOUTRECALL, | + | | CB_NOTIFY, | + | | CB_NOTIFY_DEVICEID, | + | | CB_PUSH_DELEG, | + | | CB_RECALLABLE_OBJ_AVAIL, | + | | CB_RECALL_ANY, CREATE, | + | | CREATE_SESSION, DELEGRETURN, | + | | EXCHANGE_ID, GETATTR, | + | | GETDEVICEINFO, | + | | GETDEVICELIST, | + | | GET_DIR_DELEGATION, | + | | LAYOUTCOMMIT, LAYOUTGET, | + | | LAYOUTRETURN, LINK, LOCK, | + | | LOCKT, LOCKU, LOOKUP, | + | | NVERIFY, OPEN, | + | | OPEN_DOWNGRADE, READ, | + | | READDIR, READLINK, | + | | RECLAIM_COMPLETE, REMOVE, | + | | RENAME, SECINFO, | + | | SECINFO_NO_NAME, SETATTR, | + | | SET_SSV, VERIFY, | + | | WANT_DELEGATION, WRITE | + +-----------------------------------+-------------------------------+ + | NFS4ERR_IO | ACCESS, COMMIT, CREATE, | + | | GETATTR, GETDEVICELIST, | + | | GET_DIR_DELEGATION, | + | | LAYOUTCOMMIT, LAYOUTGET, | + | | LINK, LOOKUP, LOOKUPP, | + | | NVERIFY, OPEN, OPENATTR, | + | | READ, READDIR, READLINK, | + | | REMOVE, RENAME, SETATTR, | + | | VERIFY, WANT_DELEGATION, | + | | WRITE | + +-----------------------------------+-------------------------------+ + | NFS4ERR_ISDIR | COMMIT, LAYOUTCOMMIT, | + | | LAYOUTRETURN, LINK, LOCK, | + | | LOCKT, OPEN, READ, WRITE | + +-----------------------------------+-------------------------------+ + | NFS4ERR_LAYOUTTRYLATER | LAYOUTGET | + +-----------------------------------+-------------------------------+ + | NFS4ERR_LAYOUTUNAVAILABLE | LAYOUTGET | + +-----------------------------------+-------------------------------+ + | NFS4ERR_LOCKED | LAYOUTGET, READ, SETATTR, | + | | WRITE | + +-----------------------------------+-------------------------------+ + | NFS4ERR_LOCKS_HELD | CLOSE, FREE_STATEID | + +-----------------------------------+-------------------------------+ + | NFS4ERR_LOCK_NOTSUPP | LOCK | + +-----------------------------------+-------------------------------+ + | NFS4ERR_LOCK_RANGE | LOCK, LOCKT, LOCKU | + +-----------------------------------+-------------------------------+ + | NFS4ERR_MLINK | CREATE, LINK, RENAME | + +-----------------------------------+-------------------------------+ + | NFS4ERR_MOVED | ACCESS, CLOSE, COMMIT, | + | | CREATE, DELEGRETURN, | + | | GETATTR, GETFH, | + | | GET_DIR_DELEGATION, | + | | LAYOUTCOMMIT, LAYOUTGET, | + | | LAYOUTRETURN, LINK, LOCK, | + | | LOCKT, LOCKU, LOOKUP, | + | | LOOKUPP, NVERIFY, OPEN, | + | | OPENATTR, OPEN_DOWNGRADE, | + | | PUTFH, READ, READDIR, | + | | READLINK, RECLAIM_COMPLETE, | + | | REMOVE, RENAME, RESTOREFH, | + | | SAVEFH, SECINFO, | + | | SECINFO_NO_NAME, SETATTR, | + | | VERIFY, WANT_DELEGATION, | + | | WRITE | + +-----------------------------------+-------------------------------+ + | NFS4ERR_NAMETOOLONG | CREATE, LINK, LOOKUP, OPEN, | + | | REMOVE, RENAME, SECINFO | + +-----------------------------------+-------------------------------+ + | NFS4ERR_NOENT | BACKCHANNEL_CTL, | + | | CREATE_SESSION, EXCHANGE_ID, | + | | GETDEVICEINFO, LOOKUP, | + | | LOOKUPP, OPEN, OPENATTR, | + | | REMOVE, RENAME, SECINFO, | + | | SECINFO_NO_NAME | + +-----------------------------------+-------------------------------+ + | NFS4ERR_NOFILEHANDLE | ACCESS, CLOSE, COMMIT, | + | | CREATE, DELEGRETURN, | + | | GETATTR, GETDEVICELIST, | + | | GETFH, GET_DIR_DELEGATION, | + | | LAYOUTCOMMIT, LAYOUTGET, | + | | LAYOUTRETURN, LINK, LOCK, | + | | LOCKT, LOCKU, LOOKUP, | + | | LOOKUPP, NVERIFY, OPEN, | + | | OPENATTR, OPEN_DOWNGRADE, | + | | READ, READDIR, READLINK, | + | | RECLAIM_COMPLETE, REMOVE, | + | | RENAME, RESTOREFH, SAVEFH, | + | | SECINFO, SECINFO_NO_NAME, | + | | SETATTR, VERIFY, | + | | WANT_DELEGATION, WRITE | + +-----------------------------------+-------------------------------+ + | NFS4ERR_NOMATCHING_LAYOUT | CB_LAYOUTRECALL | + +-----------------------------------+-------------------------------+ + | NFS4ERR_NOSPC | CREATE, CREATE_SESSION, | + | | LAYOUTGET, LINK, OPEN, | + | | OPENATTR, RENAME, SETATTR, | + | | WRITE | + +-----------------------------------+-------------------------------+ + | NFS4ERR_NOTDIR | CREATE, GET_DIR_DELEGATION, | + | | LINK, LOOKUP, LOOKUPP, OPEN, | + | | READDIR, REMOVE, RENAME, | + | | SECINFO, SECINFO_NO_NAME | + +-----------------------------------+-------------------------------+ + | NFS4ERR_NOTEMPTY | REMOVE, RENAME | + +-----------------------------------+-------------------------------+ + | NFS4ERR_NOTSUPP | CB_LAYOUTRECALL, CB_NOTIFY, | + | | CB_NOTIFY_DEVICEID, | + | | CB_NOTIFY_LOCK, | + | | CB_PUSH_DELEG, | + | | CB_RECALLABLE_OBJ_AVAIL, | + | | CB_WANTS_CANCELLED, | + | | DELEGPURGE, DELEGRETURN, | + | | GETDEVICEINFO, | + | | GETDEVICELIST, | + | | GET_DIR_DELEGATION, | + | | LAYOUTCOMMIT, LAYOUTGET, | + | | LAYOUTRETURN, LINK, | + | | OPENATTR, OPEN_CONFIRM, | + | | RELEASE_LOCKOWNER, RENEW, | + | | SECINFO_NO_NAME, | + | | SETCLIENTID, | + | | SETCLIENTID_CONFIRM, | + | | WANT_DELEGATION | + +-----------------------------------+-------------------------------+ + | NFS4ERR_NOT_ONLY_OP | BIND_CONN_TO_SESSION, | + | | CREATE_SESSION, | + | | DESTROY_CLIENTID, | + | | DESTROY_SESSION, EXCHANGE_ID | + +-----------------------------------+-------------------------------+ + | NFS4ERR_NOT_SAME | EXCHANGE_ID, GETDEVICELIST, | + | | READDIR, VERIFY | + +-----------------------------------+-------------------------------+ + | NFS4ERR_NO_GRACE | LAYOUTCOMMIT, LAYOUTRETURN, | + | | LOCK, OPEN, WANT_DELEGATION | + +-----------------------------------+-------------------------------+ + | NFS4ERR_OLD_STATEID | CLOSE, DELEGRETURN, | + | | FREE_STATEID, LAYOUTGET, | + | | LAYOUTRETURN, LOCK, LOCKU, | + | | OPEN, OPEN_DOWNGRADE, READ, | + | | SETATTR, WRITE | + +-----------------------------------+-------------------------------+ + | NFS4ERR_OPENMODE | LAYOUTGET, LOCK, READ, | + | | SETATTR, WRITE | + +-----------------------------------+-------------------------------+ + | NFS4ERR_OP_ILLEGAL | CB_ILLEGAL, ILLEGAL | + +-----------------------------------+-------------------------------+ + | NFS4ERR_OP_NOT_IN_SESSION | ACCESS, BACKCHANNEL_CTL, | + | | CB_GETATTR, CB_LAYOUTRECALL, | + | | CB_NOTIFY, | + | | CB_NOTIFY_DEVICEID, | + | | CB_NOTIFY_LOCK, | + | | CB_PUSH_DELEG, CB_RECALL, | + | | CB_RECALLABLE_OBJ_AVAIL, | + | | CB_RECALL_ANY, | + | | CB_RECALL_SLOT, | + | | CB_WANTS_CANCELLED, CLOSE, | + | | COMMIT, CREATE, DELEGPURGE, | + | | DELEGRETURN, FREE_STATEID, | + | | GETATTR, GETDEVICEINFO, | + | | GETDEVICELIST, GETFH, | + | | GET_DIR_DELEGATION, | + | | LAYOUTCOMMIT, LAYOUTGET, | + | | LAYOUTRETURN, LINK, LOCK, | + | | LOCKT, LOCKU, LOOKUP, | + | | LOOKUPP, NVERIFY, OPEN, | + | | OPENATTR, OPEN_DOWNGRADE, | + | | PUTFH, PUTPUBFH, PUTROOTFH, | + | | READ, READDIR, READLINK, | + | | RECLAIM_COMPLETE, REMOVE, | + | | RENAME, RESTOREFH, SAVEFH, | + | | SECINFO, SECINFO_NO_NAME, | + | | SETATTR, SET_SSV, | + | | TEST_STATEID, VERIFY, | + | | WANT_DELEGATION, WRITE | + +-----------------------------------+-------------------------------+ + | NFS4ERR_PERM | CREATE, OPEN, SETATTR | + +-----------------------------------+-------------------------------+ + | NFS4ERR_PNFS_IO_HOLE | READ, WRITE | + +-----------------------------------+-------------------------------+ + | NFS4ERR_PNFS_NO_LAYOUT | READ, WRITE | + +-----------------------------------+-------------------------------+ + | NFS4ERR_RECALLCONFLICT | LAYOUTGET, WANT_DELEGATION | + +-----------------------------------+-------------------------------+ + | NFS4ERR_RECLAIM_BAD | LAYOUTCOMMIT, LOCK, OPEN, | + | | WANT_DELEGATION | + +-----------------------------------+-------------------------------+ + | NFS4ERR_RECLAIM_CONFLICT | LAYOUTCOMMIT, LOCK, OPEN, | + | | WANT_DELEGATION | + +-----------------------------------+-------------------------------+ + | NFS4ERR_REJECT_DELEG | CB_PUSH_DELEG | + +-----------------------------------+-------------------------------+ + | NFS4ERR_REP_TOO_BIG | ACCESS, BACKCHANNEL_CTL, | + | | BIND_CONN_TO_SESSION, | + | | CB_GETATTR, CB_LAYOUTRECALL, | + | | CB_NOTIFY, | + | | CB_NOTIFY_DEVICEID, | + | | CB_NOTIFY_LOCK, | + | | CB_PUSH_DELEG, CB_RECALL, | + | | CB_RECALLABLE_OBJ_AVAIL, | + | | CB_RECALL_ANY, | + | | CB_RECALL_SLOT, CB_SEQUENCE, | + | | CB_WANTS_CANCELLED, CLOSE, | + | | COMMIT, CREATE, | + | | CREATE_SESSION, DELEGPURGE, | + | | DELEGRETURN, | + | | DESTROY_CLIENTID, | + | | DESTROY_SESSION, | + | | EXCHANGE_ID, FREE_STATEID, | + | | GETATTR, GETDEVICEINFO, | + | | GETDEVICELIST, | + | | GET_DIR_DELEGATION, | + | | LAYOUTCOMMIT, LAYOUTGET, | + | | LAYOUTRETURN, LINK, LOCK, | + | | LOCKT, LOCKU, LOOKUP, | + | | LOOKUPP, NVERIFY, OPEN, | + | | OPENATTR, OPEN_DOWNGRADE, | + | | PUTFH, PUTPUBFH, PUTROOTFH, | + | | READ, READDIR, READLINK, | + | | RECLAIM_COMPLETE, REMOVE, | + | | RENAME, RESTOREFH, SAVEFH, | + | | SECINFO, SECINFO_NO_NAME, | + | | SEQUENCE, SETATTR, SET_SSV, | + | | TEST_STATEID, VERIFY, | + | | WANT_DELEGATION, WRITE | + +-----------------------------------+-------------------------------+ + | NFS4ERR_REP_TOO_BIG_TO_CACHE | ACCESS, BACKCHANNEL_CTL, | + | | BIND_CONN_TO_SESSION, | + | | CB_GETATTR, CB_LAYOUTRECALL, | + | | CB_NOTIFY, | + | | CB_NOTIFY_DEVICEID, | + | | CB_NOTIFY_LOCK, | + | | CB_PUSH_DELEG, CB_RECALL, | + | | CB_RECALLABLE_OBJ_AVAIL, | + | | CB_RECALL_ANY, | + | | CB_RECALL_SLOT, CB_SEQUENCE, | + | | CB_WANTS_CANCELLED, CLOSE, | + | | COMMIT, CREATE, | + | | CREATE_SESSION, DELEGPURGE, | + | | DELEGRETURN, | + | | DESTROY_CLIENTID, | + | | DESTROY_SESSION, | + | | EXCHANGE_ID, FREE_STATEID, | + | | GETATTR, GETDEVICEINFO, | + | | GETDEVICELIST, | + | | GET_DIR_DELEGATION, | + | | LAYOUTCOMMIT, LAYOUTGET, | + | | LAYOUTRETURN, LINK, LOCK, | + | | LOCKT, LOCKU, LOOKUP, | + | | LOOKUPP, NVERIFY, OPEN, | + | | OPENATTR, OPEN_DOWNGRADE, | + | | PUTFH, PUTPUBFH, PUTROOTFH, | + | | READ, READDIR, READLINK, | + | | RECLAIM_COMPLETE, REMOVE, | + | | RENAME, RESTOREFH, SAVEFH, | + | | SECINFO, SECINFO_NO_NAME, | + | | SEQUENCE, SETATTR, SET_SSV, | + | | TEST_STATEID, VERIFY, | + | | WANT_DELEGATION, WRITE | + +-----------------------------------+-------------------------------+ + | NFS4ERR_REQ_TOO_BIG | ACCESS, BACKCHANNEL_CTL, | + | | BIND_CONN_TO_SESSION, | + | | CB_GETATTR, CB_LAYOUTRECALL, | + | | CB_NOTIFY, | + | | CB_NOTIFY_DEVICEID, | + | | CB_NOTIFY_LOCK, | + | | CB_PUSH_DELEG, CB_RECALL, | + | | CB_RECALLABLE_OBJ_AVAIL, | + | | CB_RECALL_ANY, | + | | CB_RECALL_SLOT, CB_SEQUENCE, | + | | CB_WANTS_CANCELLED, CLOSE, | + | | COMMIT, CREATE, | + | | CREATE_SESSION, DELEGPURGE, | + | | DELEGRETURN, | + | | DESTROY_CLIENTID, | + | | DESTROY_SESSION, | + | | EXCHANGE_ID, FREE_STATEID, | + | | GETATTR, GETDEVICEINFO, | + | | GETDEVICELIST, | + | | GET_DIR_DELEGATION, | + | | LAYOUTCOMMIT, LAYOUTGET, | + | | LAYOUTRETURN, LINK, LOCK, | + | | LOCKT, LOCKU, LOOKUP, | + | | LOOKUPP, NVERIFY, OPEN, | + | | OPENATTR, OPEN_DOWNGRADE, | + | | PUTFH, PUTPUBFH, PUTROOTFH, | + | | READ, READDIR, READLINK, | + | | RECLAIM_COMPLETE, REMOVE, | + | | RENAME, RESTOREFH, SAVEFH, | + | | SECINFO, SECINFO_NO_NAME, | + | | SEQUENCE, SETATTR, SET_SSV, | + | | TEST_STATEID, VERIFY, | + | | WANT_DELEGATION, WRITE | + +-----------------------------------+-------------------------------+ + | NFS4ERR_RETRY_UNCACHED_REP | ACCESS, BACKCHANNEL_CTL, | + | | BIND_CONN_TO_SESSION, | + | | CB_GETATTR, CB_LAYOUTRECALL, | + | | CB_NOTIFY, | + | | CB_NOTIFY_DEVICEID, | + | | CB_NOTIFY_LOCK, | + | | CB_PUSH_DELEG, CB_RECALL, | + | | CB_RECALLABLE_OBJ_AVAIL, | + | | CB_RECALL_ANY, | + | | CB_RECALL_SLOT, CB_SEQUENCE, | + | | CB_WANTS_CANCELLED, CLOSE, | + | | COMMIT, CREATE, | + | | CREATE_SESSION, DELEGPURGE, | + | | DELEGRETURN, | + | | DESTROY_CLIENTID, | + | | DESTROY_SESSION, | + | | EXCHANGE_ID, FREE_STATEID, | + | | GETATTR, GETDEVICEINFO, | + | | GETDEVICELIST, | + | | GET_DIR_DELEGATION, | + | | LAYOUTCOMMIT, LAYOUTGET, | + | | LAYOUTRETURN, LINK, LOCK, | + | | LOCKT, LOCKU, LOOKUP, | + | | LOOKUPP, NVERIFY, OPEN, | + | | OPENATTR, OPEN_DOWNGRADE, | + | | PUTFH, PUTPUBFH, PUTROOTFH, | + | | READ, READDIR, READLINK, | + | | RECLAIM_COMPLETE, REMOVE, | + | | RENAME, RESTOREFH, SAVEFH, | + | | SECINFO, SECINFO_NO_NAME, | + | | SEQUENCE, SETATTR, SET_SSV, | + | | TEST_STATEID, VERIFY, | + | | WANT_DELEGATION, WRITE | + +-----------------------------------+-------------------------------+ + | NFS4ERR_ROFS | CREATE, LINK, LOCK, LOCKT, | + | | OPEN, OPENATTR, | + | | OPEN_DOWNGRADE, REMOVE, | + | | RENAME, SETATTR, WRITE | + +-----------------------------------+-------------------------------+ + | NFS4ERR_SAME | NVERIFY | + +-----------------------------------+-------------------------------+ + | NFS4ERR_SEQUENCE_POS | CB_SEQUENCE, SEQUENCE | + +-----------------------------------+-------------------------------+ + | NFS4ERR_SEQ_FALSE_RETRY | CB_SEQUENCE, SEQUENCE | + +-----------------------------------+-------------------------------+ + | NFS4ERR_SEQ_MISORDERED | CB_SEQUENCE, CREATE_SESSION, | + | | SEQUENCE | + +-----------------------------------+-------------------------------+ + | NFS4ERR_SERVERFAULT | ACCESS, | + | | BIND_CONN_TO_SESSION, | + | | CB_GETATTR, CB_NOTIFY, | + | | CB_NOTIFY_DEVICEID, | + | | CB_NOTIFY_LOCK, | + | | CB_PUSH_DELEG, CB_RECALL, | + | | CB_RECALLABLE_OBJ_AVAIL, | + | | CB_WANTS_CANCELLED, CLOSE, | + | | COMMIT, CREATE, | + | | CREATE_SESSION, DELEGPURGE, | + | | DELEGRETURN, | + | | DESTROY_CLIENTID, | + | | DESTROY_SESSION, | + | | EXCHANGE_ID, FREE_STATEID, | + | | GETATTR, GETDEVICEINFO, | + | | GETDEVICELIST, | + | | GET_DIR_DELEGATION, | + | | LAYOUTCOMMIT, LAYOUTGET, | + | | LAYOUTRETURN, LINK, LOCK, | + | | LOCKU, LOOKUP, LOOKUPP, | + | | NVERIFY, OPEN, OPENATTR, | + | | OPEN_DOWNGRADE, PUTFH, | + | | PUTPUBFH, PUTROOTFH, READ, | + | | READDIR, READLINK, | + | | RECLAIM_COMPLETE, REMOVE, | + | | RENAME, RESTOREFH, SAVEFH, | + | | SECINFO, SECINFO_NO_NAME, | + | | SETATTR, TEST_STATEID, | + | | VERIFY, WANT_DELEGATION, | + | | WRITE | + +-----------------------------------+-------------------------------+ + | NFS4ERR_SHARE_DENIED | OPEN | + +-----------------------------------+-------------------------------+ + | NFS4ERR_STALE | ACCESS, CLOSE, COMMIT, | + | | CREATE, DELEGRETURN, | + | | GETATTR, GETFH, | + | | GET_DIR_DELEGATION, | + | | LAYOUTCOMMIT, LAYOUTGET, | + | | LAYOUTRETURN, LINK, LOCK, | + | | LOCKT, LOCKU, LOOKUP, | + | | LOOKUPP, NVERIFY, OPEN, | + | | OPENATTR, OPEN_DOWNGRADE, | + | | PUTFH, READ, READDIR, | + | | READLINK, RECLAIM_COMPLETE, | + | | REMOVE, RENAME, RESTOREFH, | + | | SAVEFH, SECINFO, | + | | SECINFO_NO_NAME, SETATTR, | + | | VERIFY, WANT_DELEGATION, | + | | WRITE | + +-----------------------------------+-------------------------------+ + | NFS4ERR_STALE_CLIENTID | CREATE_SESSION, | + | | DESTROY_CLIENTID, | + | | DESTROY_SESSION | + +-----------------------------------+-------------------------------+ + | NFS4ERR_SYMLINK | COMMIT, LAYOUTCOMMIT, LINK, | + | | LOCK, LOCKT, LOOKUP, | + | | LOOKUPP, OPEN, READ, WRITE | + +-----------------------------------+-------------------------------+ + | NFS4ERR_TOOSMALL | CREATE_SESSION, | + | | GETDEVICEINFO, LAYOUTGET, | + | | READDIR | + +-----------------------------------+-------------------------------+ + | NFS4ERR_TOO_MANY_OPS | ACCESS, BACKCHANNEL_CTL, | + | | BIND_CONN_TO_SESSION, | + | | CB_GETATTR, CB_LAYOUTRECALL, | + | | CB_NOTIFY, | + | | CB_NOTIFY_DEVICEID, | + | | CB_NOTIFY_LOCK, | + | | CB_PUSH_DELEG, CB_RECALL, | + | | CB_RECALLABLE_OBJ_AVAIL, | + | | CB_RECALL_ANY, | + | | CB_RECALL_SLOT, CB_SEQUENCE, | + | | CB_WANTS_CANCELLED, CLOSE, | + | | COMMIT, CREATE, | + | | CREATE_SESSION, DELEGPURGE, | + | | DELEGRETURN, | + | | DESTROY_CLIENTID, | + | | DESTROY_SESSION, | + | | EXCHANGE_ID, FREE_STATEID, | + | | GETATTR, GETDEVICEINFO, | + | | GETDEVICELIST, | + | | GET_DIR_DELEGATION, | + | | LAYOUTCOMMIT, LAYOUTGET, | + | | LAYOUTRETURN, LINK, LOCK, | + | | LOCKT, LOCKU, LOOKUP, | + | | LOOKUPP, NVERIFY, OPEN, | + | | OPENATTR, OPEN_DOWNGRADE, | + | | PUTFH, PUTPUBFH, PUTROOTFH, | + | | READ, READDIR, READLINK, | + | | RECLAIM_COMPLETE, REMOVE, | + | | RENAME, RESTOREFH, SAVEFH, | + | | SECINFO, SECINFO_NO_NAME, | + | | SEQUENCE, SETATTR, SET_SSV, | + | | TEST_STATEID, VERIFY, | + | | WANT_DELEGATION, WRITE | + +-----------------------------------+-------------------------------+ + | NFS4ERR_UNKNOWN_LAYOUTTYPE | CB_LAYOUTRECALL, | + | | GETDEVICEINFO, | + | | GETDEVICELIST, LAYOUTCOMMIT, | + | | LAYOUTGET, LAYOUTRETURN, | + | | NVERIFY, SETATTR, VERIFY | + +-----------------------------------+-------------------------------+ + | NFS4ERR_UNSAFE_COMPOUND | CREATE, OPEN, OPENATTR | + +-----------------------------------+-------------------------------+ + | NFS4ERR_WRONGSEC | LINK, LOOKUP, LOOKUPP, OPEN, | + | | PUTFH, PUTPUBFH, PUTROOTFH, | + | | RENAME, RESTOREFH | + +-----------------------------------+-------------------------------+ + | NFS4ERR_WRONG_CRED | CLOSE, CREATE_SESSION, | + | | DELEGPURGE, DELEGRETURN, | + | | DESTROY_CLIENTID, | + | | DESTROY_SESSION, | + | | FREE_STATEID, LAYOUTCOMMIT, | + | | LAYOUTRETURN, LOCK, LOCKT, | + | | LOCKU, OPEN_DOWNGRADE, | + | | RECLAIM_COMPLETE | + +-----------------------------------+-------------------------------+ + | NFS4ERR_WRONG_TYPE | CB_LAYOUTRECALL, | + | | CB_PUSH_DELEG, COMMIT, | + | | GETATTR, LAYOUTGET, | + | | LAYOUTRETURN, LINK, LOCK, | + | | LOCKT, NVERIFY, OPEN, | + | | OPENATTR, READ, READLINK, | + | | RECLAIM_COMPLETE, SETATTR, | + | | VERIFY, WANT_DELEGATION, | + | | WRITE | + +-----------------------------------+-------------------------------+ + | NFS4ERR_XDEV | LINK, RENAME | + +-----------------------------------+-------------------------------+ + + Table 14: Errors and the Operations That Use Them + +16. NFSv4.1 Procedures + + Both procedures, NULL and COMPOUND, MUST be implemented. + +16.1. Procedure 0: NULL - No Operation + +16.1.1. ARGUMENTS + + void; + +16.1.2. RESULTS + + void; + +16.1.3. DESCRIPTION + + This is the standard NULL procedure with the standard void argument + and void response. This procedure has no functionality associated + with it. Because of this, it is sometimes used to measure the + overhead of processing a service request. Therefore, the server + SHOULD ensure that no unnecessary work is done in servicing this + procedure. + +16.1.4. ERRORS + + None. + +16.2. Procedure 1: COMPOUND - Compound Operations + +16.2.1. ARGUMENTS + + enum nfs_opnum4 { + OP_ACCESS = 3, + OP_CLOSE = 4, + OP_COMMIT = 5, + OP_CREATE = 6, + OP_DELEGPURGE = 7, + OP_DELEGRETURN = 8, + OP_GETATTR = 9, + OP_GETFH = 10, + OP_LINK = 11, + OP_LOCK = 12, + OP_LOCKT = 13, + OP_LOCKU = 14, + OP_LOOKUP = 15, + OP_LOOKUPP = 16, + OP_NVERIFY = 17, + OP_OPEN = 18, + OP_OPENATTR = 19, + OP_OPEN_CONFIRM = 20, /* Mandatory not-to-implement */ + OP_OPEN_DOWNGRADE = 21, + OP_PUTFH = 22, + OP_PUTPUBFH = 23, + OP_PUTROOTFH = 24, + OP_READ = 25, + OP_READDIR = 26, + OP_READLINK = 27, + OP_REMOVE = 28, + OP_RENAME = 29, + OP_RENEW = 30, /* Mandatory not-to-implement */ + OP_RESTOREFH = 31, + OP_SAVEFH = 32, + OP_SECINFO = 33, + OP_SETATTR = 34, + OP_SETCLIENTID = 35, /* Mandatory not-to-implement */ + OP_SETCLIENTID_CONFIRM = 36, /* Mandatory not-to-implement */ + OP_VERIFY = 37, + OP_WRITE = 38, + OP_RELEASE_LOCKOWNER = 39, /* Mandatory not-to-implement */ + + /* new operations for NFSv4.1 */ + + OP_BACKCHANNEL_CTL = 40, + OP_BIND_CONN_TO_SESSION = 41, + OP_EXCHANGE_ID = 42, + OP_CREATE_SESSION = 43, + OP_DESTROY_SESSION = 44, + OP_FREE_STATEID = 45, + OP_GET_DIR_DELEGATION = 46, + OP_GETDEVICEINFO = 47, + OP_GETDEVICELIST = 48, + OP_LAYOUTCOMMIT = 49, + OP_LAYOUTGET = 50, + OP_LAYOUTRETURN = 51, + OP_SECINFO_NO_NAME = 52, + OP_SEQUENCE = 53, + OP_SET_SSV = 54, + OP_TEST_STATEID = 55, + OP_WANT_DELEGATION = 56, + OP_DESTROY_CLIENTID = 57, + OP_RECLAIM_COMPLETE = 58, + OP_ILLEGAL = 10044 + }; + + union nfs_argop4 switch (nfs_opnum4 argop) { + case OP_ACCESS: ACCESS4args opaccess; + case OP_CLOSE: CLOSE4args opclose; + case OP_COMMIT: COMMIT4args opcommit; + case OP_CREATE: CREATE4args opcreate; + case OP_DELEGPURGE: DELEGPURGE4args opdelegpurge; + case OP_DELEGRETURN: DELEGRETURN4args opdelegreturn; + case OP_GETATTR: GETATTR4args opgetattr; + case OP_GETFH: void; + case OP_LINK: LINK4args oplink; + case OP_LOCK: LOCK4args oplock; + case OP_LOCKT: LOCKT4args oplockt; + case OP_LOCKU: LOCKU4args oplocku; + case OP_LOOKUP: LOOKUP4args oplookup; + case OP_LOOKUPP: void; + case OP_NVERIFY: NVERIFY4args opnverify; + case OP_OPEN: OPEN4args opopen; + case OP_OPENATTR: OPENATTR4args opopenattr; + + /* Not for NFSv4.1 */ + case OP_OPEN_CONFIRM: OPEN_CONFIRM4args opopen_confirm; + + case OP_OPEN_DOWNGRADE: + OPEN_DOWNGRADE4args opopen_downgrade; + + case OP_PUTFH: PUTFH4args opputfh; + case OP_PUTPUBFH: void; + case OP_PUTROOTFH: void; + case OP_READ: READ4args opread; + case OP_READDIR: READDIR4args opreaddir; + case OP_READLINK: void; + case OP_REMOVE: REMOVE4args opremove; + case OP_RENAME: RENAME4args oprename; + + /* Not for NFSv4.1 */ + case OP_RENEW: RENEW4args oprenew; + + case OP_RESTOREFH: void; + case OP_SAVEFH: void; + case OP_SECINFO: SECINFO4args opsecinfo; + case OP_SETATTR: SETATTR4args opsetattr; + + /* Not for NFSv4.1 */ + case OP_SETCLIENTID: SETCLIENTID4args opsetclientid; + + /* Not for NFSv4.1 */ + case OP_SETCLIENTID_CONFIRM: SETCLIENTID_CONFIRM4args + opsetclientid_confirm; + case OP_VERIFY: VERIFY4args opverify; + case OP_WRITE: WRITE4args opwrite; + + /* Not for NFSv4.1 */ + case OP_RELEASE_LOCKOWNER: + RELEASE_LOCKOWNER4args + oprelease_lockowner; + + /* Operations new to NFSv4.1 */ + case OP_BACKCHANNEL_CTL: + BACKCHANNEL_CTL4args opbackchannel_ctl; + + case OP_BIND_CONN_TO_SESSION: + BIND_CONN_TO_SESSION4args + opbind_conn_to_session; + + case OP_EXCHANGE_ID: EXCHANGE_ID4args opexchange_id; + + case OP_CREATE_SESSION: + CREATE_SESSION4args opcreate_session; + + case OP_DESTROY_SESSION: + DESTROY_SESSION4args opdestroy_session; + + case OP_FREE_STATEID: FREE_STATEID4args opfree_stateid; + + case OP_GET_DIR_DELEGATION: + GET_DIR_DELEGATION4args + opget_dir_delegation; + + case OP_GETDEVICEINFO: GETDEVICEINFO4args opgetdeviceinfo; + case OP_GETDEVICELIST: GETDEVICELIST4args opgetdevicelist; + case OP_LAYOUTCOMMIT: LAYOUTCOMMIT4args oplayoutcommit; + case OP_LAYOUTGET: LAYOUTGET4args oplayoutget; + case OP_LAYOUTRETURN: LAYOUTRETURN4args oplayoutreturn; + + case OP_SECINFO_NO_NAME: + SECINFO_NO_NAME4args opsecinfo_no_name; + + case OP_SEQUENCE: SEQUENCE4args opsequence; + case OP_SET_SSV: SET_SSV4args opset_ssv; + case OP_TEST_STATEID: TEST_STATEID4args optest_stateid; + + case OP_WANT_DELEGATION: + WANT_DELEGATION4args opwant_delegation; + + case OP_DESTROY_CLIENTID: + DESTROY_CLIENTID4args + opdestroy_clientid; + + case OP_RECLAIM_COMPLETE: + RECLAIM_COMPLETE4args + opreclaim_complete; + + /* Operations not new to NFSv4.1 */ + case OP_ILLEGAL: void; + }; + + struct COMPOUND4args { + utf8str_cs tag; + uint32_t minorversion; + nfs_argop4 argarray<>; + }; + +16.2.2. RESULTS + + union nfs_resop4 switch (nfs_opnum4 resop) { + case OP_ACCESS: ACCESS4res opaccess; + case OP_CLOSE: CLOSE4res opclose; + case OP_COMMIT: COMMIT4res opcommit; + case OP_CREATE: CREATE4res opcreate; + case OP_DELEGPURGE: DELEGPURGE4res opdelegpurge; + case OP_DELEGRETURN: DELEGRETURN4res opdelegreturn; + case OP_GETATTR: GETATTR4res opgetattr; + case OP_GETFH: GETFH4res opgetfh; + case OP_LINK: LINK4res oplink; + case OP_LOCK: LOCK4res oplock; + case OP_LOCKT: LOCKT4res oplockt; + case OP_LOCKU: LOCKU4res oplocku; + case OP_LOOKUP: LOOKUP4res oplookup; + case OP_LOOKUPP: LOOKUPP4res oplookupp; + case OP_NVERIFY: NVERIFY4res opnverify; + case OP_OPEN: OPEN4res opopen; + case OP_OPENATTR: OPENATTR4res opopenattr; + /* Not for NFSv4.1 */ + case OP_OPEN_CONFIRM: OPEN_CONFIRM4res opopen_confirm; + + case OP_OPEN_DOWNGRADE: + OPEN_DOWNGRADE4res + opopen_downgrade; + + case OP_PUTFH: PUTFH4res opputfh; + case OP_PUTPUBFH: PUTPUBFH4res opputpubfh; + case OP_PUTROOTFH: PUTROOTFH4res opputrootfh; + case OP_READ: READ4res opread; + case OP_READDIR: READDIR4res opreaddir; + case OP_READLINK: READLINK4res opreadlink; + case OP_REMOVE: REMOVE4res opremove; + case OP_RENAME: RENAME4res oprename; + /* Not for NFSv4.1 */ + case OP_RENEW: RENEW4res oprenew; + case OP_RESTOREFH: RESTOREFH4res oprestorefh; + case OP_SAVEFH: SAVEFH4res opsavefh; + case OP_SECINFO: SECINFO4res opsecinfo; + case OP_SETATTR: SETATTR4res opsetattr; + /* Not for NFSv4.1 */ + case OP_SETCLIENTID: SETCLIENTID4res opsetclientid; + + /* Not for NFSv4.1 */ + case OP_SETCLIENTID_CONFIRM: + SETCLIENTID_CONFIRM4res + opsetclientid_confirm; + case OP_VERIFY: VERIFY4res opverify; + case OP_WRITE: WRITE4res opwrite; + + /* Not for NFSv4.1 */ + case OP_RELEASE_LOCKOWNER: + RELEASE_LOCKOWNER4res + oprelease_lockowner; + + /* Operations new to NFSv4.1 */ + case OP_BACKCHANNEL_CTL: + BACKCHANNEL_CTL4res + opbackchannel_ctl; + + case OP_BIND_CONN_TO_SESSION: + BIND_CONN_TO_SESSION4res + opbind_conn_to_session; + + case OP_EXCHANGE_ID: EXCHANGE_ID4res opexchange_id; + + case OP_CREATE_SESSION: + CREATE_SESSION4res + opcreate_session; + + case OP_DESTROY_SESSION: + DESTROY_SESSION4res + opdestroy_session; + + case OP_FREE_STATEID: FREE_STATEID4res + opfree_stateid; + + case OP_GET_DIR_DELEGATION: + GET_DIR_DELEGATION4res + opget_dir_delegation; + + case OP_GETDEVICEINFO: GETDEVICEINFO4res + opgetdeviceinfo; + + case OP_GETDEVICELIST: GETDEVICELIST4res + opgetdevicelist; + + case OP_LAYOUTCOMMIT: LAYOUTCOMMIT4res oplayoutcommit; + case OP_LAYOUTGET: LAYOUTGET4res oplayoutget; + case OP_LAYOUTRETURN: LAYOUTRETURN4res oplayoutreturn; + + case OP_SECINFO_NO_NAME: + SECINFO_NO_NAME4res + opsecinfo_no_name; + + case OP_SEQUENCE: SEQUENCE4res opsequence; + case OP_SET_SSV: SET_SSV4res opset_ssv; + case OP_TEST_STATEID: TEST_STATEID4res optest_stateid; + + case OP_WANT_DELEGATION: + WANT_DELEGATION4res + opwant_delegation; + + case OP_DESTROY_CLIENTID: + DESTROY_CLIENTID4res + opdestroy_clientid; + + case OP_RECLAIM_COMPLETE: + RECLAIM_COMPLETE4res + opreclaim_complete; + + /* Operations not new to NFSv4.1 */ + case OP_ILLEGAL: ILLEGAL4res opillegal; + }; + + struct COMPOUND4res { + nfsstat4 status; + utf8str_cs tag; + nfs_resop4 resarray<>; + }; + +16.2.3. DESCRIPTION + + The COMPOUND procedure is used to combine one or more NFSv4 + operations into a single RPC request. The server interprets each of + the operations in turn. If an operation is executed by the server + and the status of that operation is NFS4_OK, then the next operation + in the COMPOUND procedure is executed. The server continues this + process until there are no more operations to be executed or until + one of the operations has a status value other than NFS4_OK. + + In the processing of the COMPOUND procedure, the server may find that + it does not have the available resources to execute any or all of the + operations within the COMPOUND sequence. See Section 2.10.6.4 for a + more detailed discussion. + + The server will generally choose between two methods of decoding the + client's request. The first would be the traditional one-pass XDR + decode. If there is an XDR decoding error in this case, the RPC XDR + decode error would be returned. The second method would be to make + an initial pass to decode the basic COMPOUND request and then to XDR + decode the individual operations; the most interesting is the decode + of attributes. In this case, the server may encounter an XDR decode + error during the second pass. If it does, the server would return + the error NFS4ERR_BADXDR to signify the decode error. + + The COMPOUND arguments contain a "minorversion" field. For NFSv4.1, + the value for this field is 1. If the server receives a COMPOUND + procedure with a minorversion field value that it does not support, + the server MUST return an error of NFS4ERR_MINOR_VERS_MISMATCH and a + zero-length resultdata array. + + Contained within the COMPOUND results is a "status" field. If the + results array length is non-zero, this status must be equivalent to + the status of the last operation that was executed within the + COMPOUND procedure. Therefore, if an operation incurred an error + then the "status" value will be the same error value as is being + returned for the operation that failed. + + Note that operations zero and one are not defined for the COMPOUND + procedure. Operation 2 is not defined and is reserved for future + definition and use with minor versioning. If the server receives an + operation array that contains operation 2 and the minorversion field + has a value of zero, an error of NFS4ERR_OP_ILLEGAL, as described in + the next paragraph, is returned to the client. If an operation array + contains an operation 2 and the minorversion field is non-zero and + the server does not support the minor version, the server returns an + error of NFS4ERR_MINOR_VERS_MISMATCH. Therefore, the + NFS4ERR_MINOR_VERS_MISMATCH error takes precedence over all other + errors. + + It is possible that the server receives a request that contains an + operation that is less than the first legal operation (OP_ACCESS) or + greater than the last legal operation (OP_RELEASE_LOCKOWNER). In + this case, the server's response will encode the opcode OP_ILLEGAL + rather than the illegal opcode of the request. The status field in + the ILLEGAL return results will be set to NFS4ERR_OP_ILLEGAL. The + COMPOUND procedure's return results will also be NFS4ERR_OP_ILLEGAL. + + The definition of the "tag" in the request is left to the + implementor. It may be used to summarize the content of the Compound + request for the benefit of packet-sniffers and engineers debugging + implementations. However, the value of "tag" in the response SHOULD + be the same value as provided in the request. This applies to the + tag field of the CB_COMPOUND procedure as well. + +16.2.3.1. Current Filehandle and Stateid + + The COMPOUND procedure offers a simple environment for the execution + of the operations specified by the client. The first two relate to + the filehandle while the second two relate to the current stateid. + +16.2.3.1.1. Current Filehandle + + The current and saved filehandles are used throughout the protocol. + Most operations implicitly use the current filehandle as an argument, + and many set the current filehandle as part of the results. The + combination of client-specified sequences of operations and current + and saved filehandle arguments and results allows for greater + protocol flexibility. The best or easiest example of current + filehandle usage is a sequence like the following: + + PUTFH fh1 {fh1} + LOOKUP "compA" {fh2} + GETATTR {fh2} + LOOKUP "compB" {fh3} + GETATTR {fh3} + LOOKUP "compC" {fh4} + GETATTR {fh4} + GETFH + + Figure 2 + + In this example, the PUTFH (Section 18.19) operation explicitly sets + the current filehandle value while the result of each LOOKUP + operation sets the current filehandle value to the resultant file + system object. Also, the client is able to insert GETATTR operations + using the current filehandle as an argument. + + The PUTROOTFH (Section 18.21) and PUTPUBFH (Section 18.20) operations + also set the current filehandle. The above example would replace + "PUTFH fh1" with PUTROOTFH or PUTPUBFH with no filehandle argument in + order to achieve the same effect (on the assumption that "compA" is + directly below the root of the namespace). + + Along with the current filehandle, there is a saved filehandle. + While the current filehandle is set as the result of operations like + LOOKUP, the saved filehandle must be set directly with the use of the + SAVEFH operation. The SAVEFH operation copies the current filehandle + value to the saved value. The saved filehandle value is used in + combination with the current filehandle value for the LINK and RENAME + operations. The RESTOREFH operation will copy the saved filehandle + value to the current filehandle value; as a result, the saved + filehandle value may be used a sort of "scratch" area for the + client's series of operations. + +16.2.3.1.2. Current Stateid + + With NFSv4.1, additions of a current stateid and a saved stateid have + been made to the COMPOUND processing environment; this allows for the + passing of stateids between operations. There are no changes to the + syntax of the protocol, only changes to the semantics of a few + operations. + + A "current stateid" is the stateid that is associated with the + current filehandle. The current stateid may only be changed by an + operation that modifies the current filehandle or returns a stateid. + If an operation returns a stateid, it MUST set the current stateid to + the returned value. If an operation sets the current filehandle but + does not return a stateid, the current stateid MUST be set to the + all-zeros special stateid, i.e., (seqid, other) = (0, 0). If an + operation uses a stateid as an argument but does not return a + stateid, the current stateid MUST NOT be changed. For example, + PUTFH, PUTROOTFH, and PUTPUBFH will change the current server state + from {ocfh, (osid)} to {cfh, (0, 0)}, while LOCK will change the + current state from {cfh, (osid} to {cfh, (nsid)}. Operations like + LOOKUP that transform a current filehandle and component name into a + new current filehandle will also change the current state to {0, 0}. + The SAVEFH and RESTOREFH operations will save and restore both the + current filehandle and the current stateid as a set. + + The following example is the common case of a simple READ operation + with a normal stateid showing that the PUTFH initializes the current + stateid to (0, 0). The subsequent READ with stateid (sid1) leaves + the current stateid unchanged. + + PUTFH fh1 - -> {fh1, (0, 0)} + READ (sid1), 0, 1024 {fh1, (0, 0)} -> {fh1, (0, 0)} + + Figure 3 + + This next example performs an OPEN with the root filehandle and, as a + result, generates stateid (sid1). The next operation specifies the + READ with the argument stateid set such that (seqid, other) are equal + to (1, 0), but the current stateid set by the previous operation is + actually used when the operation is evaluated. This allows correct + interaction with any existing, potentially conflicting, locks. + + PUTROOTFH - -> {fh1, (0, 0)} + OPEN "compA" {fh1, (0, 0)} -> {fh2, (sid1)} + READ (1, 0), 0, 1024 {fh2, (sid1)} -> {fh2, (sid1)} + CLOSE (1, 0) {fh2, (sid1)} -> {fh2, (sid2)} + + Figure 4 + + This next example is similar to the second in how it passes the + stateid sid2 generated by the LOCK operation to the next READ + operation. This allows the client to explicitly surround a single I/ + O operation with a lock and its appropriate stateid to guarantee + correctness with other client locks. The example also shows how + SAVEFH and RESTOREFH can save and later reuse a filehandle and + stateid, passing them as the current filehandle and stateid to a READ + operation. + + PUTFH fh1 - -> {fh1, (0, 0)} + LOCK 0, 1024, (sid1) {fh1, (sid1)} -> {fh1, (sid2)} + READ (1, 0), 0, 1024 {fh1, (sid2)} -> {fh1, (sid2)} + LOCKU 0, 1024, (1, 0) {fh1, (sid2)} -> {fh1, (sid3)} + SAVEFH {fh1, (sid3)} -> {fh1, (sid3)} + + PUTFH fh2 {fh1, (sid3)} -> {fh2, (0, 0)} + WRITE (1, 0), 0, 1024 {fh2, (0, 0)} -> {fh2, (0, 0)} + + RESTOREFH {fh2, (0, 0)} -> {fh1, (sid3)} + READ (1, 0), 1024, 1024 {fh1, (sid3)} -> {fh1, (sid3)} + + Figure 5 + + The final example shows a disallowed use of the current stateid. The + client is attempting to implicitly pass an anonymous special stateid, + (0,0), to the READ operation. The server MUST return + NFS4ERR_BAD_STATEID in the reply to the READ operation. + + PUTFH fh1 - -> {fh1, (0, 0)} + READ (1, 0), 0, 1024 {fh1, (0, 0)} -> NFS4ERR_BAD_STATEID + + Figure 6 + +16.2.4. ERRORS + + COMPOUND will of course return every error that each operation on the + fore channel can return (see Table 12). However, if COMPOUND returns + zero operations, obviously the error returned by COMPOUND has nothing + to do with an error returned by an operation. The list of errors + COMPOUND will return if it processes zero operations include: + + +==============================+==================================+ + | Error | Notes | + +==============================+==================================+ + | NFS4ERR_BADCHAR | The tag argument has a character | + | | the replier does not support. | + +------------------------------+----------------------------------+ + | NFS4ERR_BADXDR | | + +------------------------------+----------------------------------+ + | NFS4ERR_DELAY | | + +------------------------------+----------------------------------+ + | NFS4ERR_INVAL | The tag argument is not in UTF-8 | + | | encoding. | + +------------------------------+----------------------------------+ + | NFS4ERR_MINOR_VERS_MISMATCH | | + +------------------------------+----------------------------------+ + | NFS4ERR_SERVERFAULT | | + +------------------------------+----------------------------------+ + | NFS4ERR_TOO_MANY_OPS | | + +------------------------------+----------------------------------+ + | NFS4ERR_REP_TOO_BIG | | + +------------------------------+----------------------------------+ + | NFS4ERR_REP_TOO_BIG_TO_CACHE | | + +------------------------------+----------------------------------+ + | NFS4ERR_REQ_TOO_BIG | | + +------------------------------+----------------------------------+ + + Table 15: COMPOUND Error Returns + +17. Operations: REQUIRED, RECOMMENDED, or OPTIONAL + + The following tables summarize the operations of the NFSv4.1 protocol + and the corresponding designation of REQUIRED, RECOMMENDED, and + OPTIONAL to implement or MUST NOT implement. The designation of MUST + NOT implement is reserved for those operations that were defined in + NFSv4.0 and MUST NOT be implemented in NFSv4.1. + + For the most part, the REQUIRED, RECOMMENDED, or OPTIONAL designation + for operations sent by the client is for the server implementation. + The client is generally required to implement the operations needed + for the operating environment for which it serves. For example, a + read-only NFSv4.1 client would have no need to implement the WRITE + operation and is not required to do so. + + The REQUIRED or OPTIONAL designation for callback operations sent by + the server is for both the client and server. Generally, the client + has the option of creating the backchannel and sending the operations + on the fore channel that will be a catalyst for the server sending + callback operations. A partial exception is CB_RECALL_SLOT; the only + way the client can avoid supporting this operation is by not creating + a backchannel. + + Since this is a summary of the operations and their designation, + there are subtleties that are not presented here. Therefore, if + there is a question of the requirements of implementation, the + operation descriptions themselves must be consulted along with other + relevant explanatory text within this specification. + + The abbreviations used in the second and third columns of the table + are defined as follows. + + REQ REQUIRED to implement + + REC RECOMMEND to implement + + OPT OPTIONAL to implement + + MNI MUST NOT implement + + For the NFSv4.1 features that are OPTIONAL, the operations that + support those features are OPTIONAL, and the server would return + NFS4ERR_NOTSUPP in response to the client's use of those operations. + If an OPTIONAL feature is supported, it is possible that a set of + operations related to the feature become REQUIRED to implement. The + third column of the table designates the feature(s) and if the + operation is REQUIRED or OPTIONAL in the presence of support for the + feature. + + The OPTIONAL features identified and their abbreviations are as + follows: + + pNFS Parallel NFS + + FDELG File Delegations + + DDELG Directory Delegations + + +======================+=============+============+===============+ + | Operation | REQ, REC, | Feature | Definition | + | | OPT, or MNI | (REQ, REC, | | + | | | or OPT) | | + +======================+=============+============+===============+ + | ACCESS | REQ | | Section 18.1 | + +----------------------+-------------+------------+---------------+ + | BACKCHANNEL_CTL | REQ | | Section 18.33 | + +----------------------+-------------+------------+---------------+ + | BIND_CONN_TO_SESSION | REQ | | Section 18.34 | + +----------------------+-------------+------------+---------------+ + | CLOSE | REQ | | Section 18.2 | + +----------------------+-------------+------------+---------------+ + | COMMIT | REQ | | Section 18.3 | + +----------------------+-------------+------------+---------------+ + | CREATE | REQ | | Section 18.4 | + +----------------------+-------------+------------+---------------+ + | CREATE_SESSION | REQ | | Section 18.36 | + +----------------------+-------------+------------+---------------+ + | DELEGPURGE | OPT | FDELG | Section 18.5 | + | | | (REQ) | | + +----------------------+-------------+------------+---------------+ + | DELEGRETURN | OPT | FDELG, | Section 18.6 | + | | | DDELG, | | + | | | pNFS (REQ) | | + +----------------------+-------------+------------+---------------+ + | DESTROY_CLIENTID | REQ | | Section 18.50 | + +----------------------+-------------+------------+---------------+ + | DESTROY_SESSION | REQ | | Section 18.37 | + +----------------------+-------------+------------+---------------+ + | EXCHANGE_ID | REQ | | Section 18.35 | + +----------------------+-------------+------------+---------------+ + | FREE_STATEID | REQ | | Section 18.38 | + +----------------------+-------------+------------+---------------+ + | GETATTR | REQ | | Section 18.7 | + +----------------------+-------------+------------+---------------+ + | GETDEVICEINFO | OPT | pNFS (REQ) | Section 18.40 | + +----------------------+-------------+------------+---------------+ + | GETDEVICELIST | OPT | pNFS (OPT) | Section 18.41 | + +----------------------+-------------+------------+---------------+ + | GETFH | REQ | | Section 18.8 | + +----------------------+-------------+------------+---------------+ + | GET_DIR_DELEGATION | OPT | DDELG | Section 18.39 | + | | | (REQ) | | + +----------------------+-------------+------------+---------------+ + | LAYOUTCOMMIT | OPT | pNFS (REQ) | Section 18.42 | + +----------------------+-------------+------------+---------------+ + | LAYOUTGET | OPT | pNFS (REQ) | Section 18.43 | + +----------------------+-------------+------------+---------------+ + | LAYOUTRETURN | OPT | pNFS (REQ) | Section 18.44 | + +----------------------+-------------+------------+---------------+ + | LINK | OPT | | Section 18.9 | + +----------------------+-------------+------------+---------------+ + | LOCK | REQ | | Section 18.10 | + +----------------------+-------------+------------+---------------+ + | LOCKT | REQ | | Section 18.11 | + +----------------------+-------------+------------+---------------+ + | LOCKU | REQ | | Section 18.12 | + +----------------------+-------------+------------+---------------+ + | LOOKUP | REQ | | Section 18.13 | + +----------------------+-------------+------------+---------------+ + | LOOKUPP | REQ | | Section 18.14 | + +----------------------+-------------+------------+---------------+ + | NVERIFY | REQ | | Section 18.15 | + +----------------------+-------------+------------+---------------+ + | OPEN | REQ | | Section 18.16 | + +----------------------+-------------+------------+---------------+ + | OPENATTR | OPT | | Section 18.17 | + +----------------------+-------------+------------+---------------+ + | OPEN_CONFIRM | MNI | | N/A | + +----------------------+-------------+------------+---------------+ + | OPEN_DOWNGRADE | REQ | | Section 18.18 | + +----------------------+-------------+------------+---------------+ + | PUTFH | REQ | | Section 18.19 | + +----------------------+-------------+------------+---------------+ + | PUTPUBFH | REQ | | Section 18.20 | + +----------------------+-------------+------------+---------------+ + | PUTROOTFH | REQ | | Section 18.21 | + +----------------------+-------------+------------+---------------+ + | READ | REQ | | Section 18.22 | + +----------------------+-------------+------------+---------------+ + | READDIR | REQ | | Section 18.23 | + +----------------------+-------------+------------+---------------+ + | READLINK | OPT | | Section 18.24 | + +----------------------+-------------+------------+---------------+ + | RECLAIM_COMPLETE | REQ | | Section 18.51 | + +----------------------+-------------+------------+---------------+ + | RELEASE_LOCKOWNER | MNI | | N/A | + +----------------------+-------------+------------+---------------+ + | REMOVE | REQ | | Section 18.25 | + +----------------------+-------------+------------+---------------+ + | RENAME | REQ | | Section 18.26 | + +----------------------+-------------+------------+---------------+ + | RENEW | MNI | | N/A | + +----------------------+-------------+------------+---------------+ + | RESTOREFH | REQ | | Section 18.27 | + +----------------------+-------------+------------+---------------+ + | SAVEFH | REQ | | Section 18.28 | + +----------------------+-------------+------------+---------------+ + | SECINFO | REQ | | Section 18.29 | + +----------------------+-------------+------------+---------------+ + | SECINFO_NO_NAME | REC | pNFS file | Section | + | | | layout | 18.45, | + | | | (REQ) | Section 13.12 | + +----------------------+-------------+------------+---------------+ + | SEQUENCE | REQ | | Section 18.46 | + +----------------------+-------------+------------+---------------+ + | SETATTR | REQ | | Section 18.30 | + +----------------------+-------------+------------+---------------+ + | SETCLIENTID | MNI | | N/A | + +----------------------+-------------+------------+---------------+ + | SETCLIENTID_CONFIRM | MNI | | N/A | + +----------------------+-------------+------------+---------------+ + | SET_SSV | REQ | | Section 18.47 | + +----------------------+-------------+------------+---------------+ + | TEST_STATEID | REQ | | Section 18.48 | + +----------------------+-------------+------------+---------------+ + | VERIFY | REQ | | Section 18.31 | + +----------------------+-------------+------------+---------------+ + | WANT_DELEGATION | OPT | FDELG | Section 18.49 | + | | | (OPT) | | + +----------------------+-------------+------------+---------------+ + | WRITE | REQ | | Section 18.32 | + +----------------------+-------------+------------+---------------+ + + Table 16: Operations + + +=========================+=============+============+============+ + | Operation | REQ, REC, | Feature | Definition | + | | OPT, or MNI | (REQ, REC, | | + | | | or OPT) | | + +=========================+=============+============+============+ + | CB_GETATTR | OPT | FDELG | Section | + | | | (REQ) | 20.1 | + +-------------------------+-------------+------------+------------+ + | CB_LAYOUTRECALL | OPT | pNFS (REQ) | Section | + | | | | 20.3 | + +-------------------------+-------------+------------+------------+ + | CB_NOTIFY | OPT | DDELG | Section | + | | | (REQ) | 20.4 | + +-------------------------+-------------+------------+------------+ + | CB_NOTIFY_DEVICEID | OPT | pNFS (OPT) | Section | + | | | | 20.12 | + +-------------------------+-------------+------------+------------+ + | CB_NOTIFY_LOCK | OPT | | Section | + | | | | 20.11 | + +-------------------------+-------------+------------+------------+ + | CB_PUSH_DELEG | OPT | FDELG | Section | + | | | (OPT) | 20.5 | + +-------------------------+-------------+------------+------------+ + | CB_RECALL | OPT | FDELG, | Section | + | | | DDELG, | 20.2 | + | | | pNFS (REQ) | | + +-------------------------+-------------+------------+------------+ + | CB_RECALL_ANY | OPT | FDELG, | Section | + | | | DDELG, | 20.6 | + | | | pNFS (REQ) | | + +-------------------------+-------------+------------+------------+ + | CB_RECALL_SLOT | REQ | | Section | + | | | | 20.8 | + +-------------------------+-------------+------------+------------+ + | CB_RECALLABLE_OBJ_AVAIL | OPT | DDELG, | Section | + | | | pNFS (REQ) | 20.7 | + +-------------------------+-------------+------------+------------+ + | CB_SEQUENCE | OPT | FDELG, | Section | + | | | DDELG, | 20.9 | + | | | pNFS (REQ) | | + +-------------------------+-------------+------------+------------+ + | CB_WANTS_CANCELLED | OPT | FDELG, | Section | + | | | DDELG, | 20.10 | + | | | pNFS (REQ) | | + +-------------------------+-------------+------------+------------+ + + Table 17: Callback Operations + +18. NFSv4.1 Operations + +18.1. Operation 3: ACCESS - Check Access Rights + +18.1.1. ARGUMENTS + + const ACCESS4_READ = 0x00000001; + const ACCESS4_LOOKUP = 0x00000002; + const ACCESS4_MODIFY = 0x00000004; + const ACCESS4_EXTEND = 0x00000008; + const ACCESS4_DELETE = 0x00000010; + const ACCESS4_EXECUTE = 0x00000020; + + struct ACCESS4args { + /* CURRENT_FH: object */ + uint32_t access; + }; + +18.1.2. RESULTS + + struct ACCESS4resok { + uint32_t supported; + uint32_t access; + }; + + union ACCESS4res switch (nfsstat4 status) { + case NFS4_OK: + ACCESS4resok resok4; + default: + void; + }; + +18.1.3. DESCRIPTION + + ACCESS determines the access rights that a user, as identified by the + credentials in the RPC request, has with respect to the file system + object specified by the current filehandle. The client encodes the + set of access rights that are to be checked in the bit mask "access". + The server checks the permissions encoded in the bit mask. If a + status of NFS4_OK is returned, two bit masks are included in the + response. The first, "supported", represents the access rights for + which the server can verify reliably. The second, "access", + represents the access rights available to the user for the filehandle + provided. On success, the current filehandle retains its value. + + Note that the reply's supported and access fields MUST NOT contain + more values than originally set in the request's access field. For + example, if the client sends an ACCESS operation with just the + ACCESS4_READ value set and the server supports this value, the server + MUST NOT set more than ACCESS4_READ in the supported field even if it + could have reliably checked other values. + + The reply's access field MUST NOT contain more values than the + supported field. + + The results of this operation are necessarily advisory in nature. A + return status of NFS4_OK and the appropriate bit set in the bit mask + do not imply that such access will be allowed to the file system + object in the future. This is because access rights can be revoked + by the server at any time. + + The following access permissions may be requested: + + ACCESS4_READ Read data from file or read a directory. + + ACCESS4_LOOKUP Look up a name in a directory (no meaning for non- + directory objects). + + ACCESS4_MODIFY Rewrite existing file data or modify existing + directory entries. + + ACCESS4_EXTEND Write new data or add directory entries. + + ACCESS4_DELETE Delete an existing directory entry. + + ACCESS4_EXECUTE Execute a regular file (no meaning for a directory). + + On success, the current filehandle retains its value. + + ACCESS4_EXECUTE is a challenging semantic to implement because NFS + provides remote file access, not remote execution. This leads to the + following: + + * Whether or not a regular file is executable ought to be the + responsibility of the NFS client and not the server. And yet the + ACCESS operation is specified to seemingly require a server to own + that responsibility. + + * When a client executes a regular file, it has to read the file + from the server. Strictly speaking, the server should not allow + the client to read a file being executed unless the user has read + permissions on the file. Requiring explicit read permissions on + executable files in order to access them over NFS is not going to + be acceptable to some users and storage administrators. + Historically, NFS servers have allowed a user to READ a file if + the user has execute access to the file. + + As a practical example, the UNIX specification [60] states that an + implementation claiming conformance to UNIX may indicate in the + access() programming interface's result that a privileged user has + execute rights, even if no execute permission bits are set on the + regular file's attributes. It is possible to claim conformance to + the UNIX specification and instead not indicate execute rights in + that situation, which is true for some operating environments. + Suppose the operating environments of the client and server are + implementing the access() semantics for privileged users differently, + and the ACCESS operation implementations of the client and server + follow their respective access() semantics. This can cause undesired + behavior: + + * Suppose the client's access() interface returns X_OK if the user + is privileged and no execute permission bits are set on the + regular file's attribute, and the server's access() interface does + not return X_OK in that situation. Then the client will be unable + to execute files stored on the NFS server that could be executed + if stored on a non-NFS file system. + + * Suppose the client's access() interface does not return X_OK if + the user is privileged, and no execute permission bits are set on + the regular file's attribute, and the server's access() interface + does return X_OK in that situation. Then: + + - The client will be able to execute files stored on the NFS + server that could be executed if stored on a non-NFS file + system, unless the client's execution subsystem also checks for + execute permission bits. + + - Even if the execution subsystem is checking for execute + permission bits, there are more potential issues. For example, + suppose the client is invoking access() to build a "path search + table" of all executable files in the user's "search path", + where the path is a list of directories each containing + executable files. Suppose there are two files each in separate + directories of the search path, such that files have the same + component name. In the first directory the file has no execute + permission bits set, and in the second directory the file has + execute bits set. The path search table will indicate that the + first directory has the executable file, but the execute + subsystem will fail to execute it. The command shell might + fail to try the second file in the second directory. And even + if it did, this is a potential performance issue. Clearly, the + desired outcome for the client is for the path search table to + not contain the first file. + + To deal with the problems described above, the "smart client, stupid + server" principle is used. The client owns overall responsibility + for determining execute access and relies on the server to parse the + execution permissions within the file's mode, acl, and dacl + attributes. The rules for the client and server follow: + + * If the client is sending ACCESS in order to determine if the user + can read the file, the client SHOULD set ACCESS4_READ in the + request's access field. + + * If the client's operating environment only grants execution to the + user if the user has execute access according to the execute + permissions in the mode, acl, and dacl attributes, then if the + client wants to determine execute access, the client SHOULD send + an ACCESS request with ACCESS4_EXECUTE bit set in the request's + access field. + + * If the client's operating environment grants execution to the user + even if the user does not have execute access according to the + execute permissions in the mode, acl, and dacl attributes, then if + the client wants to determine execute access, it SHOULD send an + ACCESS request with both the ACCESS4_EXECUTE and ACCESS4_READ bits + set in the request's access field. This way, if any read or + execute permission grants the user read or execute access (or if + the server interprets the user as privileged), as indicated by the + presence of ACCESS4_EXECUTE and/or ACCESS4_READ in the reply's + access field, the client will be able to grant the user execute + access to the file. + + * If the server supports execute permission bits, or some other + method for denoting executability (e.g., the suffix of the name of + the file might indicate execute), it MUST check only execute + permissions, not read permissions, when determining whether or not + the reply will have ACCESS4_EXECUTE set in the access field. The + server MUST NOT also examine read permission bits when determining + whether or not the reply will have ACCESS4_EXECUTE set in the + access field. Even if the server's operating environment would + grant execute access to the user (e.g., the user is privileged), + the server MUST NOT reply with ACCESS4_EXECUTE set in reply's + access field unless there is at least one execute permission bit + set in the mode, acl, or dacl attributes. In the case of acl and + dacl, the "one execute permission bit" MUST be an ACE4_EXECUTE bit + set in an ALLOW ACE. + + * If the server does not support execute permission bits or some + other method for denoting executability, it MUST NOT set + ACCESS4_EXECUTE in the reply's supported and access fields. If + the client set ACCESS4_EXECUTE in the ACCESS request's access + field, and ACCESS4_EXECUTE is not set in the reply's supported + field, then the client will have to send an ACCESS request with + the ACCESS4_READ bit set in the request's access field. + + * If the server supports read permission bits, it MUST only check + for read permissions in the mode, acl, and dacl attributes when it + receives an ACCESS request with ACCESS4_READ set in the access + field. The server MUST NOT also examine execute permission bits + when determining whether the reply will have ACCESS4_READ set in + the access field or not. + + Note that if the ACCESS reply has ACCESS4_READ or ACCESS_EXECUTE set, + then the user also has permissions to OPEN (Section 18.16) or READ + (Section 18.22) the file. In other words, if the client sends an + ACCESS request with the ACCESS4_READ and ACCESS_EXECUTE set in the + access field (or two separate requests, one with ACCESS4_READ set and + the other with ACCESS4_EXECUTE set), and the reply has just + ACCESS4_EXECUTE set in the access field (or just one reply has + ACCESS4_EXECUTE set), then the user has authorization to OPEN or READ + the file. + +18.1.4. IMPLEMENTATION + + In general, it is not sufficient for the client to attempt to deduce + access permissions by inspecting the uid, gid, and mode fields in the + file attributes or by attempting to interpret the contents of the ACL + attribute. This is because the server may perform uid or gid mapping + or enforce additional access-control restrictions. It is also + possible that the server may not be in the same ID space as the + client. In these cases (and perhaps others), the client cannot + reliably perform an access check with only current file attributes. + + In the NFSv2 protocol, the only reliable way to determine whether an + operation was allowed was to try it and see if it succeeded or + failed. Using the ACCESS operation in the NFSv4.1 protocol, the + client can ask the server to indicate whether or not one or more + classes of operations are permitted. The ACCESS operation is + provided to allow clients to check before doing a series of + operations that will result in an access failure. The OPEN operation + provides a point where the server can verify access to the file + object and a method to return that information to the client. The + ACCESS operation is still useful for directory operations or for use + in the case that the UNIX interface access() is used on the client. + + The information returned by the server in response to an ACCESS call + is not permanent. It was correct at the exact time that the server + performed the checks, but not necessarily afterwards. The server can + revoke access permission at any time. + + The client should use the effective credentials of the user to build + the authentication information in the ACCESS request used to + determine access rights. It is the effective user and group + credentials that are used in subsequent READ and WRITE operations. + + Many implementations do not directly support the ACCESS4_DELETE + permission. Operating systems like UNIX will ignore the + ACCESS4_DELETE bit if set on an access request on a non-directory + object. In these systems, delete permission on a file is determined + by the access permissions on the directory in which the file resides, + instead of being determined by the permissions of the file itself. + Therefore, the mask returned enumerating which access rights can be + determined will have the ACCESS4_DELETE value set to 0. This + indicates to the client that the server was unable to check that + particular access right. The ACCESS4_DELETE bit in the access mask + returned will then be ignored by the client. + +18.2. Operation 4: CLOSE - Close File + +18.2.1. ARGUMENTS + + struct CLOSE4args { + /* CURRENT_FH: object */ + seqid4 seqid; + stateid4 open_stateid; + }; + +18.2.2. RESULTS + + union CLOSE4res switch (nfsstat4 status) { + case NFS4_OK: + stateid4 open_stateid; + default: + void; + }; + +18.2.3. DESCRIPTION + + The CLOSE operation releases share reservations for the regular or + named attribute file as specified by the current filehandle. The + share reservations and other state information released at the server + as a result of this CLOSE are only those associated with the supplied + stateid. State associated with other OPENs is not affected. + + If byte-range locks are held, the client SHOULD release all locks + before sending a CLOSE. The server MAY free all outstanding locks on + CLOSE, but some servers may not support the CLOSE of a file that + still has byte-range locks held. The server MUST return failure if + any locks would exist after the CLOSE. + + The argument seqid MAY have any value, and the server MUST ignore + seqid. + + On success, the current filehandle retains its value. + + The server MAY require that the combination of principal, security + flavor, and, if applicable, GSS mechanism that sent the OPEN request + also be the one to CLOSE the file. This might not be possible if + credentials for the principal are no longer available. The server + MAY allow the machine credential or SSV credential (see + Section 18.35) to send CLOSE. + +18.2.4. IMPLEMENTATION + + Even though CLOSE returns a stateid, this stateid is not useful to + the client and should be treated as deprecated. CLOSE "shuts down" + the state associated with all OPENs for the file by a single open- + owner. As noted above, CLOSE will either release all file-locking + state or return an error. Therefore, the stateid returned by CLOSE + is not useful for operations that follow. To help find any uses of + this stateid by clients, the server SHOULD return the invalid special + stateid (the "other" value is zero and the "seqid" field is + NFS4_UINT32_MAX, see Section 8.2.3). + + A CLOSE operation may make delegations grantable where they were not + previously. Servers may choose to respond immediately if there are + pending delegation want requests or may respond to the situation at a + later time. + +18.3. Operation 5: COMMIT - Commit Cached Data + +18.3.1. ARGUMENTS + + struct COMMIT4args { + /* CURRENT_FH: file */ + offset4 offset; + count4 count; + }; + +18.3.2. RESULTS + + struct COMMIT4resok { + verifier4 writeverf; + }; + + union COMMIT4res switch (nfsstat4 status) { + case NFS4_OK: + COMMIT4resok resok4; + default: + void; + }; + +18.3.3. DESCRIPTION + + The COMMIT operation forces or flushes uncommitted, modified data to + stable storage for the file specified by the current filehandle. The + flushed data is that which was previously written with one or more + WRITE operations that had the "committed" field of their results + field set to UNSTABLE4. + + The offset specifies the position within the file where the flush is + to begin. An offset value of zero means to flush data starting at + the beginning of the file. The count specifies the number of bytes + of data to flush. If the count is zero, a flush from the offset to + the end of the file is done. + + The server returns a write verifier upon successful completion of the + COMMIT. The write verifier is used by the client to determine if the + server has restarted between the initial WRITE operations and the + COMMIT. The client does this by comparing the write verifier + returned from the initial WRITE operations and the verifier returned + by the COMMIT operation. The server must vary the value of the write + verifier at each server event or instantiation that may lead to a + loss of uncommitted data. Most commonly this occurs when the server + is restarted; however, other events at the server may result in + uncommitted data loss as well. + + On success, the current filehandle retains its value. + +18.3.4. IMPLEMENTATION + + The COMMIT operation is similar in operation and semantics to the + POSIX fsync() [22] system interface that synchronizes a file's state + with the disk (file data and metadata is flushed to disk or stable + storage). COMMIT performs the same operation for a client, flushing + any unsynchronized data and metadata on the server to the server's + disk or stable storage for the specified file. Like fsync(), it may + be that there is some modified data or no modified data to + synchronize. The data may have been synchronized by the server's + normal periodic buffer synchronization activity. COMMIT should + return NFS4_OK, unless there has been an unexpected error. + + COMMIT differs from fsync() in that it is possible for the client to + flush a range of the file (most likely triggered by a buffer- + reclamation scheme on the client before the file has been completely + written). + + The server implementation of COMMIT is reasonably simple. If the + server receives a full file COMMIT request, that is, starting at + offset zero and count zero, it should do the equivalent of applying + fsync() to the entire file. Otherwise, it should arrange to have the + modified data in the range specified by offset and count to be + flushed to stable storage. In both cases, any metadata associated + with the file must be flushed to stable storage before returning. It + is not an error for there to be nothing to flush on the server. This + means that the data and metadata that needed to be flushed have + already been flushed or lost during the last server failure. + + The client implementation of COMMIT is a little more complex. There + are two reasons for wanting to commit a client buffer to stable + storage. The first is that the client wants to reuse a buffer. In + this case, the offset and count of the buffer are sent to the server + in the COMMIT request. The server then flushes any modified data + based on the offset and count, and flushes any modified metadata + associated with the file. It then returns the status of the flush + and the write verifier. The second reason for the client to generate + a COMMIT is for a full file flush, such as may be done at close. In + this case, the client would gather all of the buffers for this file + that contain uncommitted data, do the COMMIT operation with an offset + of zero and count of zero, and then free all of those buffers. Any + other dirty buffers would be sent to the server in the normal + fashion. + + After a buffer is written (via the WRITE operation) by the client + with the "committed" field in the result of WRITE set to UNSTABLE4, + the buffer must be considered as modified by the client until the + buffer has either been flushed via a COMMIT operation or written via + a WRITE operation with the "committed" field in the result set to + FILE_SYNC4 or DATA_SYNC4. This is done to prevent the buffer from + being freed and reused before the data can be flushed to stable + storage on the server. + + When a response is returned from either a WRITE or a COMMIT operation + and it contains a write verifier that differs from that previously + returned by the server, the client will need to retransmit all of the + buffers containing uncommitted data to the server. How this is to be + done is up to the implementor. If there is only one buffer of + interest, then it should be sent in a WRITE request with the + FILE_SYNC4 stable parameter. If there is more than one buffer, it + might be worthwhile retransmitting all of the buffers in WRITE + operations with the stable parameter set to UNSTABLE4 and then + retransmitting the COMMIT operation to flush all of the data on the + server to stable storage. However, if the server repeatably returns + from COMMIT a verifier that differs from that returned by WRITE, the + only way to ensure progress is to retransmit all of the buffers with + WRITE requests with the FILE_SYNC4 stable parameter. + + The above description applies to page-cache-based systems as well as + buffer-cache-based systems. In the former systems, the virtual + memory system will need to be modified instead of the buffer cache. + +18.4. Operation 6: CREATE - Create a Non-Regular File Object + +18.4.1. ARGUMENTS + + union createtype4 switch (nfs_ftype4 type) { + case NF4LNK: + linktext4 linkdata; + case NF4BLK: + case NF4CHR: + specdata4 devdata; + case NF4SOCK: + case NF4FIFO: + case NF4DIR: + void; + default: + void; /* server should return NFS4ERR_BADTYPE */ + }; + + struct CREATE4args { + /* CURRENT_FH: directory for creation */ + createtype4 objtype; + component4 objname; + fattr4 createattrs; + }; + +18.4.2. RESULTS + + struct CREATE4resok { + change_info4 cinfo; + bitmap4 attrset; /* attributes set */ + }; + + union CREATE4res switch (nfsstat4 status) { + case NFS4_OK: + /* new CURRENTFH: created object */ + CREATE4resok resok4; + default: + void; + }; + +18.4.3. DESCRIPTION + + The CREATE operation creates a file object other than an ordinary + file in a directory with a given name. The OPEN operation MUST be + used to create a regular file or a named attribute. + + The current filehandle must be a directory: an object of type NF4DIR. + If the current filehandle is an attribute directory (type + NF4ATTRDIR), the error NFS4ERR_WRONG_TYPE is returned. If the + current filehandle designates any other type of object, the error + NFS4ERR_NOTDIR results. + + The objname specifies the name for the new object. The objtype + determines the type of object to be created: directory, symlink, etc. + If the object type specified is that of an ordinary file, a named + attribute, or a named attribute directory, the error NFS4ERR_BADTYPE + results. + + If an object of the same name already exists in the directory, the + server will return the error NFS4ERR_EXIST. + + For the directory where the new file object was created, the server + returns change_info4 information in cinfo. With the atomic field of + the change_info4 data type, the server will indicate if the before + and after change attributes were obtained atomically with respect to + the file object creation. + + If the objname has a length of zero, or if objname does not obey the + UTF-8 definition, the error NFS4ERR_INVAL will be returned. + + The current filehandle is replaced by that of the new object. + + The createattrs specifies the initial set of attributes for the + object. The set of attributes may include any writable attribute + valid for the object type. When the operation is successful, the + server will return to the client an attribute mask signifying which + attributes were successfully set for the object. + + If createattrs includes neither the owner attribute nor an ACL with + an ACE for the owner, and if the server's file system both supports + and requires an owner attribute (or an owner ACE), then the server + MUST derive the owner (or the owner ACE). This would typically be + from the principal indicated in the RPC credentials of the call, but + the server's operating environment or file system semantics may + dictate other methods of derivation. Similarly, if createattrs + includes neither the group attribute nor a group ACE, and if the + server's file system both supports and requires the notion of a group + attribute (or group ACE), the server MUST derive the group attribute + (or the corresponding owner ACE) for the file. This could be from + the RPC call's credentials, such as the group principal if the + credentials include it (such as with AUTH_SYS), from the group + identifier associated with the principal in the credentials (e.g., + POSIX systems have a user database [23] that has a group identifier + for every user identifier), inherited from the directory in which the + object is created, or whatever else the server's operating + environment or file system semantics dictate. This applies to the + OPEN operation too. + + Conversely, it is possible that the client will specify in + createattrs an owner attribute, group attribute, or ACL that the + principal indicated the RPC call's credentials does not have + permissions to create files for. The error to be returned in this + instance is NFS4ERR_PERM. This applies to the OPEN operation too. + + If the current filehandle designates a directory for which another + client holds a directory delegation, then, unless the delegation is + such that the situation can be resolved by sending a notification, + the delegation MUST be recalled, and the CREATE operation MUST NOT + proceed until the delegation is returned or revoked. Except where + this happens very quickly, one or more NFS4ERR_DELAY errors will be + returned to requests made while delegation remains outstanding. + + When the current filehandle designates a directory for which one or + more directory delegations exist, then, when those delegations + request such notifications, NOTIFY4_ADD_ENTRY will be generated as a + result of this operation. + + If the capability FSCHARSET_CAP4_ALLOWS_ONLY_UTF8 is set + (Section 14.4), and a symbolic link is being created, then the + content of the symbolic link MUST be in UTF-8 encoding. + +18.4.4. IMPLEMENTATION + + If the client desires to set attribute values after the create, a + SETATTR operation can be added to the COMPOUND request so that the + appropriate attributes will be set. + +18.5. Operation 7: DELEGPURGE - Purge Delegations Awaiting Recovery + +18.5.1. ARGUMENTS + + struct DELEGPURGE4args { + clientid4 clientid; + }; + +18.5.2. RESULTS + + struct DELEGPURGE4res { + nfsstat4 status; + }; + +18.5.3. DESCRIPTION + + This operation purges all of the delegations awaiting recovery for a + given client. This is useful for clients that do not commit + delegation information to stable storage to indicate that conflicting + requests need not be delayed by the server awaiting recovery of + delegation information. + + The client is NOT specified by the clientid field of the request. + The client SHOULD set the client field to zero, and the server MUST + ignore the clientid field. Instead, the server MUST derive the + client ID from the value of the session ID in the arguments of the + SEQUENCE operation that precedes DELEGPURGE in the COMPOUND request. + + The DELEGPURGE operation should be used by clients that record + delegation information on stable storage on the client. In this + case, after the client recovers all delegations it knows of, it + should immediately send a DELEGPURGE operation. Doing so will notify + the server that no additional delegations for the client will be + recovered allowing it to free resources, and avoid delaying other + clients which make requests that conflict with the unrecovered + delegations. The set of delegations known to the server and the + client might be different. The reason for this is that after sending + a request that resulted in a delegation, the client might experience + a failure before it both received the delegation and committed the + delegation to the client's stable storage. + + The server MAY support DELEGPURGE, but if it does not, it MUST NOT + support CLAIM_DELEGATE_PREV and MUST NOT support CLAIM_DELEG_PREV_FH. + +18.6. Operation 8: DELEGRETURN - Return Delegation + +18.6.1. ARGUMENTS + + struct DELEGRETURN4args { + /* CURRENT_FH: delegated object */ + stateid4 deleg_stateid; + }; + +18.6.2. RESULTS + + struct DELEGRETURN4res { + nfsstat4 status; + }; + +18.6.3. DESCRIPTION + + The DELEGRETURN operation returns the delegation represented by the + current filehandle and stateid. + + Delegations may be returned voluntarily (i.e., before the server has + recalled them) or when recalled. In either case, the client must + properly propagate state changed under the context of the delegation + to the server before returning the delegation. + + The server MAY require that the principal, security flavor, and if + applicable, the GSS mechanism, combination that acquired the + delegation also be the one to send DELEGRETURN on the file. This + might not be possible if credentials for the principal are no longer + available. The server MAY allow the machine credential or SSV + credential (see Section 18.35) to send DELEGRETURN. + +18.7. Operation 9: GETATTR - Get Attributes + +18.7.1. ARGUMENTS + + struct GETATTR4args { + /* CURRENT_FH: object */ + bitmap4 attr_request; + }; + +18.7.2. RESULTS + + struct GETATTR4resok { + fattr4 obj_attributes; + }; + + union GETATTR4res switch (nfsstat4 status) { + case NFS4_OK: + GETATTR4resok resok4; + default: + void; + }; + +18.7.3. DESCRIPTION + + The GETATTR operation will obtain attributes for the file system + object specified by the current filehandle. The client sets a bit in + the bitmap argument for each attribute value that it would like the + server to return. The server returns an attribute bitmap that + indicates the attribute values that it was able to return, which will + include all attributes requested by the client that are attributes + supported by the server for the target file system. This bitmap is + followed by the attribute values ordered lowest attribute number + first. + + The server MUST return a value for each attribute that the client + requests if the attribute is supported by the server for the target + file system. If the server does not support a particular attribute + on the target file system, then it MUST NOT return the attribute + value and MUST NOT set the attribute bit in the result bitmap. The + server MUST return an error if it supports an attribute on the target + but cannot obtain its value. In that case, no attribute values will + be returned. + + File systems that are absent should be treated as having support for + a very small set of attributes as described in Section 11.4.1, even + if previously, when the file system was present, more attributes were + supported. + + All servers MUST support the REQUIRED attributes as specified in + Section 5.6, for all file systems, with the exception of absent file + systems. + + On success, the current filehandle retains its value. + +18.7.4. IMPLEMENTATION + + Suppose there is an OPEN_DELEGATE_WRITE delegation held by another + client for the file in question and size and/or change are among the + set of attributes being interrogated. The server has two choices. + First, the server can obtain the actual current value of these + attributes from the client holding the delegation by using the + CB_GETATTR callback. Second, the server, particularly when the + delegated client is unresponsive, can recall the delegation in + question. The GETATTR MUST NOT proceed until one of the following + occurs: + + * The requested attribute values are returned in the response to + CB_GETATTR. + + * The OPEN_DELEGATE_WRITE delegation is returned. + + * The OPEN_DELEGATE_WRITE delegation is revoked. + + Unless one of the above happens very quickly, one or more + NFS4ERR_DELAY errors will be returned while a delegation is + outstanding. + +18.8. Operation 10: GETFH - Get Current Filehandle + +18.8.1. ARGUMENTS + + /* CURRENT_FH: */ + void; + +18.8.2. RESULTS + + struct GETFH4resok { + nfs_fh4 object; + }; + + union GETFH4res switch (nfsstat4 status) { + case NFS4_OK: + GETFH4resok resok4; + default: + void; + }; + +18.8.3. DESCRIPTION + + This operation returns the current filehandle value. + + On success, the current filehandle retains its value. + + As described in Section 2.10.6.4, GETFH is REQUIRED or RECOMMENDED to + immediately follow certain operations, and servers are free to reject + such operations if the client fails to insert GETFH in the request as + REQUIRED or RECOMMENDED. Section 18.16.4.1 provides additional + justification for why GETFH MUST follow OPEN. + +18.8.4. IMPLEMENTATION + + Operations that change the current filehandle like LOOKUP or CREATE + do not automatically return the new filehandle as a result. For + instance, if a client needs to look up a directory entry and obtain + its filehandle, then the following request is needed. + + PUTFH (directory filehandle) + + LOOKUP (entry name) + + GETFH + +18.9. Operation 11: LINK - Create Link to a File + +18.9.1. ARGUMENTS + + struct LINK4args { + /* SAVED_FH: source object */ + /* CURRENT_FH: target directory */ + component4 newname; + }; + +18.9.2. RESULTS + + struct LINK4resok { + change_info4 cinfo; + }; + + union LINK4res switch (nfsstat4 status) { + case NFS4_OK: + LINK4resok resok4; + default: + void; + }; + +18.9.3. DESCRIPTION + + The LINK operation creates an additional newname for the file + represented by the saved filehandle, as set by the SAVEFH operation, + in the directory represented by the current filehandle. The existing + file and the target directory must reside within the same file system + on the server. On success, the current filehandle will continue to + be the target directory. If an object exists in the target directory + with the same name as newname, the server must return NFS4ERR_EXIST. + + For the target directory, the server returns change_info4 information + in cinfo. With the atomic field of the change_info4 data type, the + server will indicate if the before and after change attributes were + obtained atomically with respect to the link creation. + + If the newname has a length of zero, or if newname does not obey the + UTF-8 definition, the error NFS4ERR_INVAL will be returned. + +18.9.4. IMPLEMENTATION + + The server MAY impose restrictions on the LINK operation such that + LINK may not be done when the file is open or when that open is done + by particular protocols, or with particular options or access modes. + When LINK is rejected because of such restrictions, the error + NFS4ERR_FILE_OPEN is returned. + + If a server does implement such restrictions and those restrictions + include cases of NFSv4 opens preventing successful execution of a + link, the server needs to recall any delegations that could hide the + existence of opens relevant to that decision. The reason is that + when a client holds a delegation, the server might not have an + accurate account of the opens for that client, since the client may + execute OPENs and CLOSEs locally. The LINK operation must be delayed + only until a definitive result can be obtained. For example, suppose + there are multiple delegations and one of them establishes an open + whose presence would prevent the link. Given the server's semantics, + NFS4ERR_FILE_OPEN may be returned to the caller as soon as that + delegation is returned without waiting for other delegations to be + returned. Similarly, if such opens are not associated with + delegations, NFS4ERR_FILE_OPEN can be returned immediately with no + delegation recall being done. + + If the current filehandle designates a directory for which another + client holds a directory delegation, then, unless the delegation is + such that the situation can be resolved by sending a notification, + the delegation MUST be recalled, and the operation cannot be + performed successfully until the delegation is returned or revoked. + Except where this happens very quickly, one or more NFS4ERR_DELAY + errors will be returned to requests made while delegation remains + outstanding. + + When the current filehandle designates a directory for which one or + more directory delegations exist, then, when those delegations + request such notifications, instead of a recall, NOTIFY4_ADD_ENTRY + will be generated as a result of the LINK operation. + + If the current file system supports the numlinks attribute, and other + clients have delegations to the file being linked, then those + delegations MUST be recalled and the LINK operation MUST NOT proceed + until all delegations are returned or revoked. Except where this + happens very quickly, one or more NFS4ERR_DELAY errors will be + returned to requests made while delegation remains outstanding. + + Changes to any property of the "hard" linked files are reflected in + all of the linked files. When a link is made to a file, the + attributes for the file should have a value for numlinks that is one + greater than the value before the LINK operation. + + The statement "file and the target directory must reside within the + same file system on the server" means that the fsid fields in the + attributes for the objects are the same. If they reside on different + file systems, the error NFS4ERR_XDEV is returned. This error may be + returned by some servers when there is an internal partitioning of a + file system that the LINK operation would violate. + + On some servers, "." and ".." are illegal values for newname and the + error NFS4ERR_BADNAME will be returned if they are specified. + + When the current filehandle designates a named attribute directory + and the object to be linked (the saved filehandle) is not a named + attribute for the same object, the error NFS4ERR_XDEV MUST be + returned. When the saved filehandle designates a named attribute and + the current filehandle is not the appropriate named attribute + directory, the error NFS4ERR_XDEV MUST also be returned. + + When the current filehandle designates a named attribute directory + and the object to be linked (the saved filehandle) is a named + attribute within that directory, the server may return the error + NFS4ERR_NOTSUPP. + + In the case that newname is already linked to the file represented by + the saved filehandle, the server will return NFS4ERR_EXIST. + + Note that symbolic links are created with the CREATE operation. + +18.10. Operation 12: LOCK - Create Lock + +18.10.1. ARGUMENTS + + /* + * For LOCK, transition from open_stateid and lock_owner + * to a lock stateid. + */ + struct open_to_lock_owner4 { + seqid4 open_seqid; + stateid4 open_stateid; + seqid4 lock_seqid; + lock_owner4 lock_owner; + }; + + /* + * For LOCK, existing lock stateid continues to request new + * file lock for the same lock_owner and open_stateid. + */ + struct exist_lock_owner4 { + stateid4 lock_stateid; + seqid4 lock_seqid; + }; + + union locker4 switch (bool new_lock_owner) { + case TRUE: + open_to_lock_owner4 open_owner; + case FALSE: + exist_lock_owner4 lock_owner; + }; + + /* + * LOCK/LOCKT/LOCKU: Record lock management + */ + struct LOCK4args { + /* CURRENT_FH: file */ + nfs_lock_type4 locktype; + bool reclaim; + offset4 offset; + length4 length; + locker4 locker; + }; + +18.10.2. RESULTS + + struct LOCK4denied { + offset4 offset; + length4 length; + nfs_lock_type4 locktype; + lock_owner4 owner; + }; + + struct LOCK4resok { + stateid4 lock_stateid; + }; + + union LOCK4res switch (nfsstat4 status) { + case NFS4_OK: + LOCK4resok resok4; + case NFS4ERR_DENIED: + LOCK4denied denied; + default: + void; + }; + +18.10.3. DESCRIPTION + + The LOCK operation requests a byte-range lock for the byte-range + specified by the offset and length parameters, and lock type + specified in the locktype parameter. If this is a reclaim request, + the reclaim parameter will be TRUE. + + Bytes in a file may be locked even if those bytes are not currently + allocated to the file. To lock the file from a specific offset + through the end-of-file (no matter how long the file actually is) use + a length field equal to NFS4_UINT64_MAX. The server MUST return + NFS4ERR_INVAL under the following combinations of length and offset: + + * Length is equal to zero. + + * Length is not equal to NFS4_UINT64_MAX, and the sum of length and + offset exceeds NFS4_UINT64_MAX. + + 32-bit servers are servers that support locking for byte offsets that + fit within 32 bits (i.e., less than or equal to NFS4_UINT32_MAX). If + the client specifies a range that overlaps one or more bytes beyond + offset NFS4_UINT32_MAX but does not end at offset NFS4_UINT64_MAX, + then such a 32-bit server MUST return the error NFS4ERR_BAD_RANGE. + + If the server returns NFS4ERR_DENIED, the owner, offset, and length + of a conflicting lock are returned. + + The locker argument specifies the lock-owner that is associated with + the LOCK operation. The locker4 structure is a switched union that + indicates whether the client has already created byte-range locking + state associated with the current open file and lock-owner. In the + case in which it has, the argument is just a stateid representing the + set of locks associated with that open file and lock-owner, together + with a lock_seqid value that MAY be any value and MUST be ignored by + the server. In the case where no byte-range locking state has been + established, or the client does not have the stateid available, the + argument contains the stateid of the open file with which this lock + is to be associated, together with the lock-owner with which the lock + is to be associated. The open_to_lock_owner case covers the very + first lock done by a lock-owner for a given open file and offers a + method to use the established state of the open_stateid to transition + to the use of a lock stateid. + + The following fields of the locker parameter MAY be set to any value + by the client and MUST be ignored by the server: + + * The clientid field of the lock_owner field of the open_owner field + (locker.open_owner.lock_owner.clientid). The reason the server + MUST ignore the clientid field is that the server MUST derive the + client ID from the session ID from the SEQUENCE operation of the + COMPOUND request. + + * The open_seqid and lock_seqid fields of the open_owner field + (locker.open_owner.open_seqid and locker.open_owner.lock_seqid). + + * The lock_seqid field of the lock_owner field + (locker.lock_owner.lock_seqid). + + Note that the client ID appearing in a LOCK4denied structure is the + actual client associated with the conflicting lock, whether this is + the client ID associated with the current session or a different one. + Thus, if the server returns NFS4ERR_DENIED, it MUST set the clientid + field of the owner field of the denied field. + + If the current filehandle is not an ordinary file, an error will be + returned to the client. In the case that the current filehandle + represents an object of type NF4DIR, NFS4ERR_ISDIR is returned. If + the current filehandle designates a symbolic link, NFS4ERR_SYMLINK is + returned. In all other cases, NFS4ERR_WRONG_TYPE is returned. + + On success, the current filehandle retains its value. + +18.10.4. IMPLEMENTATION + + If the server is unable to determine the exact offset and length of + the conflicting byte-range lock, the same offset and length that were + provided in the arguments should be returned in the denied results. + + LOCK operations are subject to permission checks and to checks + against the access type of the associated file. However, the + specific right and modes required for various types of locks reflect + the semantics of the server-exported file system, and are not + specified by the protocol. For example, Windows 2000 allows a write + lock of a file open for read access, while a POSIX-compliant system + does not. + + When the client sends a LOCK operation that corresponds to a range + that the lock-owner has locked already (with the same or different + lock type), or to a sub-range of such a range, or to a byte-range + that includes multiple locks already granted to that lock-owner, in + whole or in part, and the server does not support such locking + operations (i.e., does not support POSIX locking semantics), the + server will return the error NFS4ERR_LOCK_RANGE. In that case, the + client may return an error, or it may emulate the required + operations, using only LOCK for ranges that do not include any bytes + already locked by that lock-owner and LOCKU of locks held by that + lock-owner (specifying an exactly matching range and type). + Similarly, when the client sends a LOCK operation that amounts to + upgrading (changing from a READ_LT lock to a WRITE_LT lock) or + downgrading (changing from WRITE_LT lock to a READ_LT lock) an + existing byte-range lock, and the server does not support such a + lock, the server will return NFS4ERR_LOCK_NOTSUPP. Such operations + may not perfectly reflect the required semantics in the face of + conflicting LOCK operations from other clients. + + When a client holds an OPEN_DELEGATE_WRITE delegation, the client + holding that delegation is assured that there are no opens by other + clients. Thus, there can be no conflicting LOCK operations from such + clients. Therefore, the client may be handling locking requests + locally, without doing LOCK operations on the server. If it does + that, it must be prepared to update the lock status on the server, by + sending appropriate LOCK and LOCKU operations before returning the + delegation. + + When one or more clients hold OPEN_DELEGATE_READ delegations, any + LOCK operation where the server is implementing mandatory locking + semantics MUST result in the recall of all such delegations. The + LOCK operation may not be granted until all such delegations are + returned or revoked. Except where this happens very quickly, one or + more NFS4ERR_DELAY errors will be returned to requests made while the + delegation remains outstanding. + +18.11. Operation 13: LOCKT - Test for Lock + +18.11.1. ARGUMENTS + + struct LOCKT4args { + /* CURRENT_FH: file */ + nfs_lock_type4 locktype; + offset4 offset; + length4 length; + lock_owner4 owner; + }; + +18.11.2. RESULTS + + union LOCKT4res switch (nfsstat4 status) { + case NFS4ERR_DENIED: + LOCK4denied denied; + case NFS4_OK: + void; + default: + void; + }; + +18.11.3. DESCRIPTION + + The LOCKT operation tests the lock as specified in the arguments. If + a conflicting lock exists, the owner, offset, length, and type of the + conflicting lock are returned. The owner field in the results + includes the client ID of the owner of the conflicting lock, whether + this is the client ID associated with the current session or a + different client ID. If no lock is held, nothing other than NFS4_OK + is returned. Lock types READ_LT and READW_LT are processed in the + same way in that a conflicting lock test is done without regard to + blocking or non-blocking. The same is true for WRITE_LT and + WRITEW_LT. + + The ranges are specified as for LOCK. The NFS4ERR_INVAL and + NFS4ERR_BAD_RANGE errors are returned under the same circumstances as + for LOCK. + + The clientid field of the owner MAY be set to any value by the client + and MUST be ignored by the server. The reason the server MUST ignore + the clientid field is that the server MUST derive the client ID from + the session ID from the SEQUENCE operation of the COMPOUND request. + + If the current filehandle is not an ordinary file, an error will be + returned to the client. In the case that the current filehandle + represents an object of type NF4DIR, NFS4ERR_ISDIR is returned. If + the current filehandle designates a symbolic link, NFS4ERR_SYMLINK is + returned. In all other cases, NFS4ERR_WRONG_TYPE is returned. + + On success, the current filehandle retains its value. + +18.11.4. IMPLEMENTATION + + If the server is unable to determine the exact offset and length of + the conflicting lock, the same offset and length that were provided + in the arguments should be returned in the denied results. + + LOCKT uses a lock_owner4 rather a stateid4, as is used in LOCK to + identify the owner. This is because the client does not have to open + the file to test for the existence of a lock, so a stateid might not + be available. + + As noted in Section 18.10.4, some servers may return + NFS4ERR_LOCK_RANGE to certain (otherwise non-conflicting) LOCK + operations that overlap ranges already granted to the current lock- + owner. + + The LOCKT operation's test for conflicting locks SHOULD exclude locks + for the current lock-owner, and thus should return NFS4_OK in such + cases. Note that this means that a server might return NFS4_OK to a + LOCKT request even though a LOCK operation for the same range and + lock-owner would fail with NFS4ERR_LOCK_RANGE. + + When a client holds an OPEN_DELEGATE_WRITE delegation, it may choose + (see Section 18.10.4) to handle LOCK requests locally. In such a + case, LOCKT requests will similarly be handled locally. + +18.12. Operation 14: LOCKU - Unlock File + +18.12.1. ARGUMENTS + + struct LOCKU4args { + /* CURRENT_FH: file */ + nfs_lock_type4 locktype; + seqid4 seqid; + stateid4 lock_stateid; + offset4 offset; + length4 length; + }; + +18.12.2. RESULTS + + union LOCKU4res switch (nfsstat4 status) { + case NFS4_OK: + stateid4 lock_stateid; + default: + void; + }; + +18.12.3. DESCRIPTION + + The LOCKU operation unlocks the byte-range lock specified by the + parameters. The client may set the locktype field to any value that + is legal for the nfs_lock_type4 enumerated type, and the server MUST + accept any legal value for locktype. Any legal value for locktype + has no effect on the success or failure of the LOCKU operation. + + The ranges are specified as for LOCK. The NFS4ERR_INVAL and + NFS4ERR_BAD_RANGE errors are returned under the same circumstances as + for LOCK. + + The seqid parameter MAY be any value and the server MUST ignore it. + + If the current filehandle is not an ordinary file, an error will be + returned to the client. In the case that the current filehandle + represents an object of type NF4DIR, NFS4ERR_ISDIR is returned. If + the current filehandle designates a symbolic link, NFS4ERR_SYMLINK is + returned. In all other cases, NFS4ERR_WRONG_TYPE is returned. + + On success, the current filehandle retains its value. + + The server MAY require that the principal, security flavor, and if + applicable, the GSS mechanism, combination that sent a LOCK operation + also be the one to send LOCKU on the file. This might not be + possible if credentials for the principal are no longer available. + The server MAY allow the machine credential or SSV credential (see + Section 18.35) to send LOCKU. + +18.12.4. IMPLEMENTATION + + If the area to be unlocked does not correspond exactly to a lock + actually held by the lock-owner, the server may return the error + NFS4ERR_LOCK_RANGE. This includes the case in which the area is not + locked, where the area is a sub-range of the area locked, where it + overlaps the area locked without matching exactly, or the area + specified includes multiple locks held by the lock-owner. In all of + these cases, allowed by POSIX locking [21] semantics, a client + receiving this error should, if it desires support for such + operations, simulate the operation using LOCKU on ranges + corresponding to locks it actually holds, possibly followed by LOCK + operations for the sub-ranges not being unlocked. + + When a client holds an OPEN_DELEGATE_WRITE delegation, it may choose + (see Section 18.10.4) to handle LOCK requests locally. In such a + case, LOCKU operations will similarly be handled locally. + +18.13. Operation 15: LOOKUP - Lookup Filename + +18.13.1. ARGUMENTS + + struct LOOKUP4args { + /* CURRENT_FH: directory */ + component4 objname; + }; + +18.13.2. RESULTS + + struct LOOKUP4res { + /* New CURRENT_FH: object */ + nfsstat4 status; + }; + +18.13.3. DESCRIPTION + + The LOOKUP operation looks up or finds a file system object using the + directory specified by the current filehandle. LOOKUP evaluates the + component and if the object exists, the current filehandle is + replaced with the component's filehandle. + + If the component cannot be evaluated either because it does not exist + or because the client does not have permission to evaluate the + component, then an error will be returned and the current filehandle + will be unchanged. + + If the component is a zero-length string or if any component does not + obey the UTF-8 definition, the error NFS4ERR_INVAL will be returned. + +18.13.4. IMPLEMENTATION + + If the client wants to achieve the effect of a multi-component look + up, it may construct a COMPOUND request such as (and obtain each + filehandle): + + PUTFH (directory filehandle) + LOOKUP "pub" + GETFH + LOOKUP "foo" + GETFH + LOOKUP "bar" + GETFH + + Unlike NFSv3, NFSv4.1 allows LOOKUP requests to cross mountpoints on + the server. The client can detect a mountpoint crossing by comparing + the fsid attribute of the directory with the fsid attribute of the + directory looked up. If the fsids are different, then the new + directory is a server mountpoint. UNIX clients that detect a + mountpoint crossing will need to mount the server's file system. + This needs to be done to maintain the file object identity checking + mechanisms common to UNIX clients. + + Servers that limit NFS access to "shared" or "exported" file systems + should provide a pseudo file system into which the exported file + systems can be integrated, so that clients can browse the server's + namespace. The clients view of a pseudo file system will be limited + to paths that lead to exported file systems. + + Note: previous versions of the protocol assigned special semantics to + the names "." and "..". NFSv4.1 assigns no special semantics to + these names. The LOOKUPP operator must be used to look up a parent + directory. + + Note that this operation does not follow symbolic links. The client + is responsible for all parsing of filenames including filenames that + are modified by symbolic links encountered during the look up + process. + + If the current filehandle supplied is not a directory but a symbolic + link, the error NFS4ERR_SYMLINK is returned as the error. For all + other non-directory file types, the error NFS4ERR_NOTDIR is returned. + +18.14. Operation 16: LOOKUPP - Lookup Parent Directory + +18.14.1. ARGUMENTS + + /* CURRENT_FH: object */ + void; + +18.14.2. RESULTS + + struct LOOKUPP4res { + /* new CURRENT_FH: parent directory */ + nfsstat4 status; + }; + +18.14.3. DESCRIPTION + + The current filehandle is assumed to refer to a regular directory or + a named attribute directory. LOOKUPP assigns the filehandle for its + parent directory to be the current filehandle. If there is no parent + directory, an NFS4ERR_NOENT error must be returned. Therefore, + NFS4ERR_NOENT will be returned by the server when the current + filehandle is at the root or top of the server's file tree. + + As is the case with LOOKUP, LOOKUPP will also cross mountpoints. + + If the current filehandle is not a directory or named attribute + directory, the error NFS4ERR_NOTDIR is returned. + + If the requester's security flavor does not match that configured for + the parent directory, then the server SHOULD return NFS4ERR_WRONGSEC + (a future minor revision of NFSv4 may upgrade this to MUST) in the + LOOKUPP response. However, if the server does so, it MUST support + the SECINFO_NO_NAME operation (Section 18.45), so that the client can + gracefully determine the correct security flavor. + + If the current filehandle is a named attribute directory that is + associated with a file system object via OPENATTR (i.e., not a sub- + directory of a named attribute directory), LOOKUPP SHOULD return the + filehandle of the associated file system object. + +18.14.4. IMPLEMENTATION + + An issue to note is upward navigation from named attribute + directories. The named attribute directories are essentially + detached from the namespace, and this property should be safely + represented in the client operating environment. LOOKUPP on a named + attribute directory may return the filehandle of the associated file, + and conveying this to applications might be unsafe as many + applications expect the parent of an object to always be a directory. + Therefore, the client may want to hide the parent of named attribute + directories (represented as ".." in UNIX) or represent the named + attribute directory as its own parent (as is typically done for the + file system root directory in UNIX). + +18.15. Operation 17: NVERIFY - Verify Difference in Attributes + +18.15.1. ARGUMENTS + + struct NVERIFY4args { + /* CURRENT_FH: object */ + fattr4 obj_attributes; + }; + +18.15.2. RESULTS + + struct NVERIFY4res { + nfsstat4 status; + }; + +18.15.3. DESCRIPTION + + This operation is used to prefix a sequence of operations to be + performed if one or more attributes have changed on some file system + object. If all the attributes match, then the error NFS4ERR_SAME + MUST be returned. + + On success, the current filehandle retains its value. + +18.15.4. IMPLEMENTATION + + This operation is useful as a cache validation operator. If the + object to which the attributes belong has changed, then the following + operations may obtain new data associated with that object, for + instance, to check if a file has been changed and obtain new data if + it has: + + SEQUENCE + PUTFH fh + NVERIFY attrbits attrs + READ 0 32767 + + Contrast this with NFSv3, which would first send a GETATTR in one + request/reply round trip, and then if attributes indicated that the + client's cache was stale, then send a READ in another request/reply + round trip. + + In the case that a RECOMMENDED attribute is specified in the NVERIFY + operation and the server does not support that attribute for the file + system object, the error NFS4ERR_ATTRNOTSUPP is returned to the + client. + + When the attribute rdattr_error or any set-only attribute (e.g., + time_modify_set) is specified, the error NFS4ERR_INVAL is returned to + the client. + +18.16. Operation 18: OPEN - Open a Regular File + +18.16.1. ARGUMENTS + + /* + * Various definitions for OPEN + */ + enum createmode4 { + UNCHECKED4 = 0, + GUARDED4 = 1, + /* Deprecated in NFSv4.1. */ + EXCLUSIVE4 = 2, + /* + * New to NFSv4.1. If session is persistent, + * GUARDED4 MUST be used. Otherwise, use + * EXCLUSIVE4_1 instead of EXCLUSIVE4. + */ + EXCLUSIVE4_1 = 3 + }; + + struct creatverfattr { + verifier4 cva_verf; + fattr4 cva_attrs; + }; + + union createhow4 switch (createmode4 mode) { + case UNCHECKED4: + case GUARDED4: + fattr4 createattrs; + case EXCLUSIVE4: + verifier4 createverf; + case EXCLUSIVE4_1: + creatverfattr ch_createboth; + }; + + enum opentype4 { + OPEN4_NOCREATE = 0, + OPEN4_CREATE = 1 + }; + + union openflag4 switch (opentype4 opentype) { + case OPEN4_CREATE: + createhow4 how; + default: + void; + }; + + /* Next definitions used for OPEN delegation */ + enum limit_by4 { + NFS_LIMIT_SIZE = 1, + NFS_LIMIT_BLOCKS = 2 + /* others as needed */ + }; + + struct nfs_modified_limit4 { + uint32_t num_blocks; + uint32_t bytes_per_block; + }; + + union nfs_space_limit4 switch (limit_by4 limitby) { + /* limit specified as file size */ + case NFS_LIMIT_SIZE: + uint64_t filesize; + /* limit specified by number of blocks */ + case NFS_LIMIT_BLOCKS: + nfs_modified_limit4 mod_blocks; + } ; + + /* + * Share Access and Deny constants for open argument + */ + const OPEN4_SHARE_ACCESS_READ = 0x00000001; + const OPEN4_SHARE_ACCESS_WRITE = 0x00000002; + const OPEN4_SHARE_ACCESS_BOTH = 0x00000003; + + const OPEN4_SHARE_DENY_NONE = 0x00000000; + const OPEN4_SHARE_DENY_READ = 0x00000001; + const OPEN4_SHARE_DENY_WRITE = 0x00000002; + const OPEN4_SHARE_DENY_BOTH = 0x00000003; + + + /* new flags for share_access field of OPEN4args */ + const OPEN4_SHARE_ACCESS_WANT_DELEG_MASK = 0xFF00; + const OPEN4_SHARE_ACCESS_WANT_NO_PREFERENCE = 0x0000; + const OPEN4_SHARE_ACCESS_WANT_READ_DELEG = 0x0100; + const OPEN4_SHARE_ACCESS_WANT_WRITE_DELEG = 0x0200; + const OPEN4_SHARE_ACCESS_WANT_ANY_DELEG = 0x0300; + const OPEN4_SHARE_ACCESS_WANT_NO_DELEG = 0x0400; + const OPEN4_SHARE_ACCESS_WANT_CANCEL = 0x0500; + + const + OPEN4_SHARE_ACCESS_WANT_SIGNAL_DELEG_WHEN_RESRC_AVAIL + = 0x10000; + + const + OPEN4_SHARE_ACCESS_WANT_PUSH_DELEG_WHEN_UNCONTENDED + = 0x20000; + + enum open_delegation_type4 { + OPEN_DELEGATE_NONE = 0, + OPEN_DELEGATE_READ = 1, + OPEN_DELEGATE_WRITE = 2, + OPEN_DELEGATE_NONE_EXT = 3 /* new to v4.1 */ + }; + + enum open_claim_type4 { + /* + * Not a reclaim. + */ + CLAIM_NULL = 0, + + CLAIM_PREVIOUS = 1, + CLAIM_DELEGATE_CUR = 2, + CLAIM_DELEGATE_PREV = 3, + + /* + * Not a reclaim. + * + * Like CLAIM_NULL, but object identified + * by the current filehandle. + */ + CLAIM_FH = 4, /* new to v4.1 */ + + /* + * Like CLAIM_DELEGATE_CUR, but object identified + * by current filehandle. + */ + CLAIM_DELEG_CUR_FH = 5, /* new to v4.1 */ + + /* + * Like CLAIM_DELEGATE_PREV, but object identified + * by current filehandle. + */ + CLAIM_DELEG_PREV_FH = 6 /* new to v4.1 */ + }; + + struct open_claim_delegate_cur4 { + stateid4 delegate_stateid; + component4 file; + }; + + union open_claim4 switch (open_claim_type4 claim) { + /* + * No special rights to file. + * Ordinary OPEN of the specified file. + */ + case CLAIM_NULL: + /* CURRENT_FH: directory */ + component4 file; + /* + * Right to the file established by an + * open previous to server reboot. File + * identified by filehandle obtained at + * that time rather than by name. + */ + case CLAIM_PREVIOUS: + /* CURRENT_FH: file being reclaimed */ + open_delegation_type4 delegate_type; + + /* + * Right to file based on a delegation + * granted by the server. File is + * specified by name. + */ + case CLAIM_DELEGATE_CUR: + /* CURRENT_FH: directory */ + open_claim_delegate_cur4 delegate_cur_info; + + /* + * Right to file based on a delegation + * granted to a previous boot instance + * of the client. File is specified by name. + */ + case CLAIM_DELEGATE_PREV: + /* CURRENT_FH: directory */ + component4 file_delegate_prev; + + /* + * Like CLAIM_NULL. No special rights + * to file. Ordinary OPEN of the + * specified file by current filehandle. + */ + case CLAIM_FH: /* new to v4.1 */ + /* CURRENT_FH: regular file to open */ + void; + + /* + * Like CLAIM_DELEGATE_PREV. Right to file based on a + * delegation granted to a previous boot + * instance of the client. File is identified + * by filehandle. + */ + case CLAIM_DELEG_PREV_FH: /* new to v4.1 */ + /* CURRENT_FH: file being opened */ + void; + + /* + * Like CLAIM_DELEGATE_CUR. Right to file based on + * a delegation granted by the server. + * File is identified by filehandle. + */ + case CLAIM_DELEG_CUR_FH: /* new to v4.1 */ + /* CURRENT_FH: file being opened */ + stateid4 oc_delegate_stateid; + + }; + + /* + * OPEN: Open a file, potentially receiving an OPEN delegation + */ + struct OPEN4args { + seqid4 seqid; + uint32_t share_access; + uint32_t share_deny; + open_owner4 owner; + openflag4 openhow; + open_claim4 claim; + }; + +18.16.2. RESULTS + + struct open_read_delegation4 { + stateid4 stateid; /* Stateid for delegation*/ + bool recall; /* Pre-recalled flag for + delegations obtained + by reclaim (CLAIM_PREVIOUS) */ + + nfsace4 permissions; /* Defines users who don't + need an ACCESS call to + open for read */ + }; + + struct open_write_delegation4 { + stateid4 stateid; /* Stateid for delegation */ + bool recall; /* Pre-recalled flag for + delegations obtained + by reclaim + (CLAIM_PREVIOUS) */ + + nfs_space_limit4 + space_limit; /* Defines condition that + the client must check to + determine whether the + file needs to be flushed + to the server on close. */ + + nfsace4 permissions; /* Defines users who don't + need an ACCESS call as + part of a delegated + open. */ + }; + + + enum why_no_delegation4 { /* new to v4.1 */ + WND4_NOT_WANTED = 0, + WND4_CONTENTION = 1, + WND4_RESOURCE = 2, + WND4_NOT_SUPP_FTYPE = 3, + WND4_WRITE_DELEG_NOT_SUPP_FTYPE = 4, + WND4_NOT_SUPP_UPGRADE = 5, + WND4_NOT_SUPP_DOWNGRADE = 6, + WND4_CANCELLED = 7, + WND4_IS_DIR = 8 + }; + + union open_none_delegation4 /* new to v4.1 */ + switch (why_no_delegation4 ond_why) { + case WND4_CONTENTION: + bool ond_server_will_push_deleg; + case WND4_RESOURCE: + bool ond_server_will_signal_avail; + default: + void; + }; + + union open_delegation4 + switch (open_delegation_type4 delegation_type) { + case OPEN_DELEGATE_NONE: + void; + case OPEN_DELEGATE_READ: + open_read_delegation4 read; + case OPEN_DELEGATE_WRITE: + open_write_delegation4 write; + case OPEN_DELEGATE_NONE_EXT: /* new to v4.1 */ + open_none_delegation4 od_whynone; + }; + + /* + * Result flags + */ + + /* Client must confirm open */ + const OPEN4_RESULT_CONFIRM = 0x00000002; + /* Type of file locking behavior at the server */ + const OPEN4_RESULT_LOCKTYPE_POSIX = 0x00000004; + /* Server will preserve file if removed while open */ + const OPEN4_RESULT_PRESERVE_UNLINKED = 0x00000008; + + /* + * Server may use CB_NOTIFY_LOCK on locks + * derived from this open + */ + const OPEN4_RESULT_MAY_NOTIFY_LOCK = 0x00000020; + + struct OPEN4resok { + stateid4 stateid; /* Stateid for open */ + change_info4 cinfo; /* Directory Change Info */ + uint32_t rflags; /* Result flags */ + bitmap4 attrset; /* attribute set for create*/ + open_delegation4 delegation; /* Info on any open + delegation */ + }; + + union OPEN4res switch (nfsstat4 status) { + case NFS4_OK: + /* New CURRENT_FH: opened file */ + OPEN4resok resok4; + default: + void; + }; + +18.16.3. DESCRIPTION + + The OPEN operation opens a regular file in a directory with the + provided name or filehandle. OPEN can also create a file if a name + is provided, and the client specifies it wants to create a file. + Specification of whether or not a file is to be created, and the + method of creation is via the openhow parameter. The openhow + parameter consists of a switched union (data type opengflag4), which + switches on the value of opentype (OPEN4_NOCREATE or OPEN4_CREATE). + If OPEN4_CREATE is specified, this leads to another switched union + (data type createhow4) that supports four cases of creation methods: + UNCHECKED4, GUARDED4, EXCLUSIVE4, or EXCLUSIVE4_1. If opentype is + OPEN4_CREATE, then the claim field of the claim field MUST be one of + CLAIM_NULL, CLAIM_DELEGATE_CUR, or CLAIM_DELEGATE_PREV, because these + claim methods include a component of a file name. + + Upon success (which might entail creation of a new file), the current + filehandle is replaced by that of the created or existing object. + + If the current filehandle is a named attribute directory, OPEN will + then create or open a named attribute file. Note that exclusive + create of a named attribute is not supported. If the createmode is + EXCLUSIVE4 or EXCLUSIVE4_1 and the current filehandle is a named + attribute directory, the server will return EINVAL. + + UNCHECKED4 means that the file should be created if a file of that + name does not exist and encountering an existing regular file of that + name is not an error. For this type of create, createattrs specifies + the initial set of attributes for the file. The set of attributes + may include any writable attribute valid for regular files. When an + UNCHECKED4 create encounters an existing file, the attributes + specified by createattrs are not used, except that when createattrs + specifies the size attribute with a size of zero, the existing file + is truncated. + + If GUARDED4 is specified, the server checks for the presence of a + duplicate object by name before performing the create. If a + duplicate exists, NFS4ERR_EXIST is returned. If the object does not + exist, the request is performed as described for UNCHECKED4. + + For the UNCHECKED4 and GUARDED4 cases, where the operation is + successful, the server will return to the client an attribute mask + signifying which attributes were successfully set for the object. + + EXCLUSIVE4_1 and EXCLUSIVE4 specify that the server is to follow + exclusive creation semantics, using the verifier to ensure exclusive + creation of the target. The server should check for the presence of + a duplicate object by name. If the object does not exist, the server + creates the object and stores the verifier with the object. If the + object does exist and the stored verifier matches the client provided + verifier, the server uses the existing object as the newly created + object. If the stored verifier does not match, then an error of + NFS4ERR_EXIST is returned. + + If using EXCLUSIVE4, and if the server uses attributes to store the + exclusive create verifier, the server will signify which attributes + it used by setting the appropriate bits in the attribute mask that is + returned in the results. Unlike UNCHECKED4, GUARDED4, and + EXCLUSIVE4_1, EXCLUSIVE4 does not support the setting of attributes + at file creation, and after a successful OPEN via EXCLUSIVE4, the + client MUST send a SETATTR to set attributes to a known state. + + In NFSv4.1, EXCLUSIVE4 has been deprecated in favor of EXCLUSIVE4_1. + Unlike EXCLUSIVE4, attributes may be provided in the EXCLUSIVE4_1 + case, but because the server may use attributes of the target object + to store the verifier, the set of allowable attributes may be fewer + than the set of attributes SETATTR allows. The allowable attributes + for EXCLUSIVE4_1 are indicated in the suppattr_exclcreat + (Section 5.8.1.14) attribute. If the client attempts to set in + cva_attrs an attribute that is not in suppattr_exclcreat, the server + MUST return NFS4ERR_INVAL. The response field, attrset, indicates + both which attributes the server set from cva_attrs and which + attributes the server used to store the verifier. As described in + Section 18.16.4, the client can compare cva_attrs.attrmask with + attrset to determine which attributes were used to store the + verifier. + + With the addition of persistent sessions and pNFS, under some + conditions EXCLUSIVE4 MUST NOT be used by the client or supported by + the server. The following table summarizes the appropriate and + mandated exclusive create methods for implementations of NFSv4.1: + + +=============+==========+==============+=======================+ + | Persistent | Server | Server | Client Allowed | + | Reply Cache | Supports | REQUIRED | | + | Enabled | pNFS | | | + +=============+==========+==============+=======================+ + | no | no | EXCLUSIVE4_1 | EXCLUSIVE4_1 (SHOULD) | + | | | and | or EXCLUSIVE4 (SHOULD | + | | | EXCLUSIVE4 | NOT) | + +-------------+----------+--------------+-----------------------+ + | no | yes | EXCLUSIVE4_1 | EXCLUSIVE4_1 | + +-------------+----------+--------------+-----------------------+ + | yes | no | GUARDED4 | GUARDED4 | + +-------------+----------+--------------+-----------------------+ + | yes | yes | GUARDED4 | GUARDED4 | + +-------------+----------+--------------+-----------------------+ + + Table 18: Required Methods for Exclusive Create + + If CREATE_SESSION4_FLAG_PERSIST is set in the results of + CREATE_SESSION, the reply cache is persistent (see Section 18.36). + If the EXCHGID4_FLAG_USE_PNFS_MDS flag is set in the results from + EXCHANGE_ID, the server is a pNFS server (see Section 18.35). If the + client attempts to use EXCLUSIVE4 on a persistent session, or a + session derived from an EXCHGID4_FLAG_USE_PNFS_MDS client ID, the + server MUST return NFS4ERR_INVAL. + + With persistent sessions, exclusive create semantics are fully + achievable via GUARDED4, and so EXCLUSIVE4 or EXCLUSIVE4_1 MUST NOT + be used. When pNFS is being used, the layout_hint attribute might + not be supported after the file is created. Only the EXCLUSIVE4_1 + and GUARDED methods of exclusive file creation allow the atomic + setting of attributes. + + For the target directory, the server returns change_info4 information + in cinfo. With the atomic field of the change_info4 data type, the + server will indicate if the before and after change attributes were + obtained atomically with respect to the link creation. + + The OPEN operation provides for Windows share reservation capability + with the use of the share_access and share_deny fields of the OPEN + arguments. The client specifies at OPEN the required share_access + and share_deny modes. For clients that do not directly support + SHAREs (i.e., UNIX), the expected deny value is + OPEN4_SHARE_DENY_NONE. In the case that there is an existing SHARE + reservation that conflicts with the OPEN request, the server returns + the error NFS4ERR_SHARE_DENIED. For additional discussion of SHARE + semantics, see Section 9.7. + + For each OPEN, the client provides a value for the owner field of the + OPEN argument. The owner field is of data type open_owner4, and + contains a field called clientid and a field called owner. The + client can set the clientid field to any value and the server MUST + ignore it. Instead, the server MUST derive the client ID from the + session ID of the SEQUENCE operation of the COMPOUND request. + + The "seqid" field of the request is not used in NFSv4.1, but it MAY + be any value and the server MUST ignore it. + + In the case that the client is recovering state from a server + failure, the claim field of the OPEN argument is used to signify that + the request is meant to reclaim state previously held. + + The "claim" field of the OPEN argument is used to specify the file to + be opened and the state information that the client claims to + possess. There are seven claim types as follows: + + +======================+============================================+ + | open type | description | + +======================+============================================+ + | CLAIM_NULL, CLAIM_FH | For the client, this is a new OPEN | + | | request and there is no previous state | + | | associated with the file for the | + | | client. With CLAIM_NULL, the file is | + | | identified by the current filehandle | + | | and the specified component name. | + | | With CLAIM_FH (new to NFSv4.1), the | + | | file is identified by just the current | + | | filehandle. | + +----------------------+--------------------------------------------+ + | CLAIM_PREVIOUS | The client is claiming basic OPEN | + | | state for a file that was held | + | | previous to a server restart. | + | | Generally used when a server is | + | | returning persistent filehandles; the | + | | client may not have the file name to | + | | reclaim the OPEN. | + +----------------------+--------------------------------------------+ + | CLAIM_DELEGATE_CUR, | The client is claiming a delegation | + | CLAIM_DELEG_CUR_FH | for OPEN as granted by the server. | + | | Generally, this is done as part of | + | | recalling a delegation. With | + | | CLAIM_DELEGATE_CUR, the file is | + | | identified by the current filehandle | + | | and the specified component name. | + | | With CLAIM_DELEG_CUR_FH (new to | + | | NFSv4.1), the file is identified by | + | | just the current filehandle. | + +----------------------+--------------------------------------------+ + | CLAIM_DELEGATE_PREV, | The client is claiming a delegation | + | CLAIM_DELEG_PREV_FH | granted to a previous client instance; | + | | used after the client restarts. The | + | | server MAY support CLAIM_DELEGATE_PREV | + | | and/or CLAIM_DELEG_PREV_FH (new to | + | | NFSv4.1). If it does support either | + | | claim type, CREATE_SESSION MUST NOT | + | | remove the client's delegation state, | + | | and the server MUST support the | + | | DELEGPURGE operation. | + +----------------------+--------------------------------------------+ + + Table 19 + + For OPEN requests that reach the server during the grace period, the + server returns an error of NFS4ERR_GRACE. The following claim types + are exceptions: + + * OPEN requests specifying the claim type CLAIM_PREVIOUS are devoted + to reclaiming opens after a server restart and are typically only + valid during the grace period. + + * OPEN requests specifying the claim types CLAIM_DELEGATE_CUR and + CLAIM_DELEG_CUR_FH are valid both during and after the grace + period. Since the granting of the delegation that they are + subordinate to assures that there is no conflict with locks to be + reclaimed by other clients, the server need not return + NFS4ERR_GRACE when these are received during the grace period. + + For any OPEN request, the server may return an OPEN delegation, which + allows further opens and closes to be handled locally on the client + as described in Section 10.4. Note that delegation is up to the + server to decide. The client should never assume that delegation + will or will not be granted in a particular instance. It should + always be prepared for either case. A partial exception is the + reclaim (CLAIM_PREVIOUS) case, in which a delegation type is claimed. + In this case, delegation will always be granted, although the server + may specify an immediate recall in the delegation structure. + + The rflags returned by a successful OPEN allow the server to return + information governing how the open file is to be handled. + + * OPEN4_RESULT_CONFIRM is deprecated and MUST NOT be returned by an + NFSv4.1 server. + + * OPEN4_RESULT_LOCKTYPE_POSIX indicates that the server's byte-range + locking behavior supports the complete set of POSIX locking + techniques [21]. From this, the client can choose to manage byte- + range locking state in a way to handle a mismatch of byte-range + locking management. + + * OPEN4_RESULT_PRESERVE_UNLINKED indicates that the server will + preserve the open file if the client (or any other client) removes + the file as long as it is open. Furthermore, the server promises + to preserve the file through the grace period after server + restart, thereby giving the client the opportunity to reclaim its + open. + + * OPEN4_RESULT_MAY_NOTIFY_LOCK indicates that the server may attempt + CB_NOTIFY_LOCK callbacks for locks on this file. This flag is a + hint only, and may be safely ignored by the client. + + If the component is of zero length, NFS4ERR_INVAL will be returned. + The component is also subject to the normal UTF-8, character support, + and name checks. See Section 14.5 for further discussion. + + When an OPEN is done and the specified open-owner already has the + resulting filehandle open, the result is to "OR" together the new + share and deny status together with the existing status. In this + case, only a single CLOSE need be done, even though multiple OPENs + were completed. When such an OPEN is done, checking of share + reservations for the new OPEN proceeds normally, with no exception + for the existing OPEN held by the same open-owner. In this case, the + stateid returned as an "other" field that matches that of the + previous open while the "seqid" field is incremented to reflect the + change status due to the new open. + + If the underlying file system at the server is only accessible in a + read-only mode and the OPEN request has specified ACCESS_WRITE or + ACCESS_BOTH, the server will return NFS4ERR_ROFS to indicate a read- + only file system. + + As with the CREATE operation, the server MUST derive the owner, owner + ACE, group, or group ACE if any of the four attributes are required + and supported by the server's file system. For an OPEN with the + EXCLUSIVE4 createmode, the server has no choice, since such OPEN + calls do not include the createattrs field. Conversely, if + createattrs (UNCHECKED4 or GUARDED4) or cva_attrs (EXCLUSIVE4_1) is + specified, and includes an owner, owner_group, or ACE that the + principal in the RPC call's credentials does not have authorization + to create files for, then the server may return NFS4ERR_PERM. + + In the case of an OPEN that specifies a size of zero (e.g., + truncation) and the file has named attributes, the named attributes + are left as is and are not removed. + + NFSv4.1 gives more precise control to clients over acquisition of + delegations via the following new flags for the share_access field of + OPEN4args: + + OPEN4_SHARE_ACCESS_WANT_READ_DELEG + + OPEN4_SHARE_ACCESS_WANT_WRITE_DELEG + + OPEN4_SHARE_ACCESS_WANT_ANY_DELEG + + OPEN4_SHARE_ACCESS_WANT_NO_DELEG + + OPEN4_SHARE_ACCESS_WANT_CANCEL + + OPEN4_SHARE_ACCESS_WANT_SIGNAL_DELEG_WHEN_RESRC_AVAIL + + OPEN4_SHARE_ACCESS_WANT_PUSH_DELEG_WHEN_UNCONTENDED + + If (share_access & OPEN4_SHARE_ACCESS_WANT_DELEG_MASK) is not zero, + then the client will have specified one and only one of: + + OPEN4_SHARE_ACCESS_WANT_READ_DELEG + + OPEN4_SHARE_ACCESS_WANT_WRITE_DELEG + + OPEN4_SHARE_ACCESS_WANT_ANY_DELEG + + OPEN4_SHARE_ACCESS_WANT_NO_DELEG + + OPEN4_SHARE_ACCESS_WANT_CANCEL + + Otherwise, the client is neither indicating a desire nor a non-desire + for a delegation, and the server MAY or MAY not return a delegation + in the OPEN response. + + If the server supports the new _WANT_ flags and the client sends one + or more of the new flags, then in the event the server does not + return a delegation, it MUST return a delegation type of + OPEN_DELEGATE_NONE_EXT. The field ond_why in the reply indicates why + no delegation was returned and will be one of: + + WND4_NOT_WANTED + The client specified OPEN4_SHARE_ACCESS_WANT_NO_DELEG. + + WND4_CONTENTION + There is a conflicting delegation or open on the file. + + WND4_RESOURCE + Resource limitations prevent the server from granting a + delegation. + + WND4_NOT_SUPP_FTYPE + The server does not support delegations on this file type. + + WND4_WRITE_DELEG_NOT_SUPP_FTYPE + The server does not support OPEN_DELEGATE_WRITE delegations on + this file type. + + WND4_NOT_SUPP_UPGRADE + The server does not support atomic upgrade of an + OPEN_DELEGATE_READ delegation to an OPEN_DELEGATE_WRITE + delegation. + + WND4_NOT_SUPP_DOWNGRADE + The server does not support atomic downgrade of an + OPEN_DELEGATE_WRITE delegation to an OPEN_DELEGATE_READ + delegation. + + WND4_CANCELED + The client specified OPEN4_SHARE_ACCESS_WANT_CANCEL and now any + "want" for this file object is cancelled. + + WND4_IS_DIR + The specified file object is a directory, and the operation is + OPEN or WANT_DELEGATION, which do not support delegations on + directories. + + OPEN4_SHARE_ACCESS_WANT_READ_DELEG, + OPEN_SHARE_ACCESS_WANT_WRITE_DELEG, or + OPEN_SHARE_ACCESS_WANT_ANY_DELEG mean, respectively, the client wants + an OPEN_DELEGATE_READ, OPEN_DELEGATE_WRITE, or any delegation + regardless which of OPEN4_SHARE_ACCESS_READ, + OPEN4_SHARE_ACCESS_WRITE, or OPEN4_SHARE_ACCESS_BOTH is set. If the + client has an OPEN_DELEGATE_READ delegation on a file and requests an + OPEN_DELEGATE_WRITE delegation, then the client is requesting atomic + upgrade of its OPEN_DELEGATE_READ delegation to an + OPEN_DELEGATE_WRITE delegation. If the client has an + OPEN_DELEGATE_WRITE delegation on a file and requests an + OPEN_DELEGATE_READ delegation, then the client is requesting atomic + downgrade to an OPEN_DELEGATE_READ delegation. A server MAY support + atomic upgrade or downgrade. If it does, then the returned + delegation_type of OPEN_DELEGATE_READ or OPEN_DELEGATE_WRITE that is + different from the delegation type the client currently has, + indicates successful upgrade or downgrade. If the server does not + support atomic delegation upgrade or downgrade, then ond_why will be + set to WND4_NOT_SUPP_UPGRADE or WND4_NOT_SUPP_DOWNGRADE. + + OPEN4_SHARE_ACCESS_WANT_NO_DELEG means that the client wants no + delegation. + + OPEN4_SHARE_ACCESS_WANT_CANCEL means that the client wants no + delegation and wants to cancel any previously registered "want" for a + delegation. + + The client may set one or both of + OPEN4_SHARE_ACCESS_WANT_SIGNAL_DELEG_WHEN_RESRC_AVAIL and + OPEN4_SHARE_ACCESS_WANT_PUSH_DELEG_WHEN_UNCONTENDED. However, they + will have no effect unless one of following is set: + + * OPEN4_SHARE_ACCESS_WANT_READ_DELEG + + * OPEN4_SHARE_ACCESS_WANT_WRITE_DELEG + + * OPEN4_SHARE_ACCESS_WANT_ANY_DELEG + + If the client specifies + OPEN4_SHARE_ACCESS_WANT_SIGNAL_DELEG_WHEN_RESRC_AVAIL, then it wishes + to register a "want" for a delegation, in the event the OPEN results + do not include a delegation. If so and the server denies the + delegation due to insufficient resources, the server MAY later inform + the client, via the CB_RECALLABLE_OBJ_AVAIL operation, that the + resource limitation condition has eased. The server will tell the + client that it intends to send a future CB_RECALLABLE_OBJ_AVAIL + operation by setting delegation_type in the results to + OPEN_DELEGATE_NONE_EXT, ond_why to WND4_RESOURCE, and + ond_server_will_signal_avail set to TRUE. If + ond_server_will_signal_avail is set to TRUE, the server MUST later + send a CB_RECALLABLE_OBJ_AVAIL operation. + + If the client specifies + OPEN4_SHARE_ACCESS_WANT_SIGNAL_DELEG_WHEN_UNCONTENDED, then it wishes + to register a "want" for a delegation, in the event the OPEN results + do not include a delegation. If so and the server denies the + delegation due to contention, the server MAY later inform the client, + via the CB_PUSH_DELEG operation, that the contention condition has + eased. The server will tell the client that it intends to send a + future CB_PUSH_DELEG operation by setting delegation_type in the + results to OPEN_DELEGATE_NONE_EXT, ond_why to WND4_CONTENTION, and + ond_server_will_push_deleg to TRUE. If ond_server_will_push_deleg is + TRUE, the server MUST later send a CB_PUSH_DELEG operation. + + If the client has previously registered a want for a delegation on a + file, and then sends a request to register a want for a delegation on + the same file, the server MUST return a new error: + NFS4ERR_DELEG_ALREADY_WANTED. If the client wishes to register a + different type of delegation want for the same file, it MUST cancel + the existing delegation WANT. + +18.16.4. IMPLEMENTATION + + In absence of a persistent session, the client invokes exclusive + create by setting the how parameter to EXCLUSIVE4 or EXCLUSIVE4_1. + In these cases, the client provides a verifier that can reasonably be + expected to be unique. A combination of a client identifier, perhaps + the client network address, and a unique number generated by the + client, perhaps the RPC transaction identifier, may be appropriate. + + If the object does not exist, the server creates the object and + stores the verifier in stable storage. For file systems that do not + provide a mechanism for the storage of arbitrary file attributes, the + server may use one or more elements of the object's metadata to store + the verifier. The verifier MUST be stored in stable storage to + prevent erroneous failure on retransmission of the request. It is + assumed that an exclusive create is being performed because exclusive + semantics are critical to the application. Because of the expected + usage, exclusive CREATE does not rely solely on the server's reply + cache for storage of the verifier. A nonpersistent reply cache does + not survive a crash and the session and reply cache may be deleted + after a network partition that exceeds the lease time, thus opening + failure windows. + + An NFSv4.1 server SHOULD NOT store the verifier in any of the file's + RECOMMENDED or REQUIRED attributes. If it does, the server SHOULD + use time_modify_set or time_access_set to store the verifier. The + server SHOULD NOT store the verifier in the following attributes: + + acl (it is desirable for access control to be established at + creation), + + dacl (ditto), + + mode (ditto), + + owner (ditto), + + owner_group (ditto), + + retentevt_set (it may be desired to establish retention at + creation) + + retention_hold (ditto), + + retention_set (ditto), + + sacl (it is desirable for auditing control to be established at + creation), + + size (on some servers, size may have a limited range of values), + + mode_set_masked (as with mode), + + and + + time_creation (a meaningful file creation should be set when the + file is created). + + Another alternative for the server is to use a named attribute to + store the verifier. + + Because the EXCLUSIVE4 create method does not specify initial + attributes when processing an EXCLUSIVE4 create, the server + + * SHOULD set the owner of the file to that corresponding to the + credential of request's RPC header. + + * SHOULD NOT leave the file's access control to anyone but the owner + of the file. + + If the server cannot support exclusive create semantics, possibly + because of the requirement to commit the verifier to stable storage, + it should fail the OPEN request with the error NFS4ERR_NOTSUPP. + + During an exclusive CREATE request, if the object already exists, the + server reconstructs the object's verifier and compares it with the + verifier in the request. If they match, the server treats the + request as a success. The request is presumed to be a duplicate of + an earlier, successful request for which the reply was lost and that + the server duplicate request cache mechanism did not detect. If the + verifiers do not match, the request is rejected with the status + NFS4ERR_EXIST. + + After the client has performed a successful exclusive create, the + attrset response indicates which attributes were used to store the + verifier. If EXCLUSIVE4 was used, the attributes set in attrset were + used for the verifier. If EXCLUSIVE4_1 was used, the client + determines the attributes used for the verifier by comparing attrset + with cva_attrs.attrmask; any bits set in the former but not the + latter identify the attributes used to store the verifier. The + client MUST immediately send a SETATTR to set attributes used to + store the verifier. Until it does so, the attributes used to store + the verifier cannot be relied upon. The subsequent SETATTR MUST NOT + occur in the same COMPOUND request as the OPEN. + + Unless a persistent session is used, use of the GUARDED4 attribute + does not provide exactly once semantics. In particular, if a reply + is lost and the server does not detect the retransmission of the + request, the operation can fail with NFS4ERR_EXIST, even though the + create was performed successfully. The client would use this + behavior in the case that the application has not requested an + exclusive create but has asked to have the file truncated when the + file is opened. In the case of the client timing out and + retransmitting the create request, the client can use GUARDED4 to + prevent against a sequence like create, write, create (retransmitted) + from occurring. + + For SHARE reservations, the value of the expression (share_access & + ~OPEN4_SHARE_ACCESS_WANT_DELEG_MASK) MUST be one of + OPEN4_SHARE_ACCESS_READ, OPEN4_SHARE_ACCESS_WRITE, or + OPEN4_SHARE_ACCESS_BOTH. If not, the server MUST return + NFS4ERR_INVAL. The value of share_deny MUST be one of + OPEN4_SHARE_DENY_NONE, OPEN4_SHARE_DENY_READ, OPEN4_SHARE_DENY_WRITE, + or OPEN4_SHARE_DENY_BOTH. If not, the server MUST return + NFS4ERR_INVAL. + + Based on the share_access value (OPEN4_SHARE_ACCESS_READ, + OPEN4_SHARE_ACCESS_WRITE, or OPEN4_SHARE_ACCESS_BOTH), the client + should check that the requester has the proper access rights to + perform the specified operation. This would generally be the results + of applying the ACL access rules to the file for the current + requester. However, just as with the ACCESS operation, the client + should not attempt to second-guess the server's decisions, as access + rights may change and may be subject to server administrative + controls outside the ACL framework. If the requester's READ or WRITE + operation is not authorized (depending on the share_access value), + the server MUST return NFS4ERR_ACCESS. + + Note that if the client ID was not created with the + EXCHGID4_FLAG_BIND_PRINC_STATEID capability set in the reply to + EXCHANGE_ID, then the server MUST NOT impose any requirement that + READs and WRITEs sent for an open file have the same credentials as + the OPEN itself, and the server is REQUIRED to perform access + checking on the READs and WRITEs themselves. Otherwise, if the reply + to EXCHANGE_ID did have EXCHGID4_FLAG_BIND_PRINC_STATEID set, then + with one exception, the credentials used in the OPEN request MUST + match those used in the READs and WRITEs, and the stateids in the + READs and WRITEs MUST match, or be derived from the stateid from the + reply to OPEN. The exception is if SP4_SSV or SP4_MACH_CRED state + protection is used, and the spo_must_allow result of EXCHANGE_ID + includes the READ and/or WRITE operations. In that case, the machine + or SSV credential will be allowed to send READ and/or WRITE. See + Section 18.35. + + If the component provided to OPEN is a symbolic link, the error + NFS4ERR_SYMLINK will be returned to the client, while if it is a + directory the error NFS4ERR_ISDIR will be returned. If the component + is neither of those but not an ordinary file, the error + NFS4ERR_WRONG_TYPE is returned. If the current filehandle is not a + directory, the error NFS4ERR_NOTDIR will be returned. + + The use of the OPEN4_RESULT_PRESERVE_UNLINKED result flag allows a + client to avoid the common implementation practice of renaming an + open file to ".nfs<unique value>" after it removes the file. After + the server returns OPEN4_RESULT_PRESERVE_UNLINKED, if a client sends + a REMOVE operation that would reduce the file's link count to zero, + the server SHOULD report a value of zero for the numlinks attribute + on the file. + + If another client has a delegation of the file being opened that + conflicts with open being done (sometimes depending on the + share_access or share_deny value specified), the delegation(s) MUST + be recalled, and the operation cannot proceed until each such + delegation is returned or revoked. Except where this happens very + quickly, one or more NFS4ERR_DELAY errors will be returned to + requests made while delegation remains outstanding. In the case of + an OPEN_DELEGATE_WRITE delegation, any open by a different client + will conflict, while for an OPEN_DELEGATE_READ delegation, only opens + with one of the following characteristics will be considered + conflicting: + + * The value of share_access includes the bit + OPEN4_SHARE_ACCESS_WRITE. + + * The value of share_deny specifies OPEN4_SHARE_DENY_READ or + OPEN4_SHARE_DENY_BOTH. + + * OPEN4_CREATE is specified together with UNCHECKED4, the size + attribute is specified as zero (for truncation), and an existing + file is truncated. + + If OPEN4_CREATE is specified and the file does not exist and the + current filehandle designates a directory for which another client + holds a directory delegation, then, unless the delegation is such + that the situation can be resolved by sending a notification, the + delegation MUST be recalled, and the operation cannot proceed until + the delegation is returned or revoked. Except where this happens + very quickly, one or more NFS4ERR_DELAY errors will be returned to + requests made while delegation remains outstanding. + + If OPEN4_CREATE is specified and the file does not exist and the + current filehandle designates a directory for which one or more + directory delegations exist, then, when those delegations request + such notifications, NOTIFY4_ADD_ENTRY will be generated as a result + of this operation. + +18.16.4.1. Warning to Client Implementors + + OPEN resembles LOOKUP in that it generates a filehandle for the + client to use. Unlike LOOKUP though, OPEN creates server state on + the filehandle. In normal circumstances, the client can only release + this state with a CLOSE operation. CLOSE uses the current filehandle + to determine which file to close. Therefore, the client MUST follow + every OPEN operation with a GETFH operation in the same COMPOUND + procedure. This will supply the client with the filehandle such that + CLOSE can be used appropriately. + + Simply waiting for the lease on the file to expire is insufficient + because the server may maintain the state indefinitely as long as + another client does not attempt to make a conflicting access to the + same file. + + See also Section 2.10.6.4. + +18.17. Operation 19: OPENATTR - Open Named Attribute Directory + +18.17.1. ARGUMENTS + + struct OPENATTR4args { + /* CURRENT_FH: object */ + bool createdir; + }; + +18.17.2. RESULTS + + struct OPENATTR4res { + /* + * If status is NFS4_OK, + * new CURRENT_FH: named attribute + * directory + */ + nfsstat4 status; + }; + +18.17.3. DESCRIPTION + + The OPENATTR operation is used to obtain the filehandle of the named + attribute directory associated with the current filehandle. The + result of the OPENATTR will be a filehandle to an object of type + NF4ATTRDIR. From this filehandle, READDIR and LOOKUP operations can + be used to obtain filehandles for the various named attributes + associated with the original file system object. Filehandles + returned within the named attribute directory will designate objects + of type of NF4NAMEDATTR. + + The createdir argument allows the client to signify if a named + attribute directory should be created as a result of the OPENATTR + operation. Some clients may use the OPENATTR operation with a value + of FALSE for createdir to determine if any named attributes exist for + the object. If none exist, then NFS4ERR_NOENT will be returned. If + createdir has a value of TRUE and no named attribute directory + exists, one is created and its filehandle becomes the current + filehandle. On the other hand, if createdir has a value of TRUE and + the named attribute directory already exists, no error results and + the filehandle of the existing directory becomes the current + filehandle. The creation of a named attribute directory assumes that + the server has implemented named attribute support in this fashion + and is not required to do so by this definition. + + If the current filehandle designates an object of type NF4NAMEDATTR + (a named attribute) or NF4ATTRDIR (a named attribute directory), an + error of NFS4ERR_WRONG_TYPE is returned to the client. Named + attributes or a named attribute directory MUST NOT have their own + named attributes. + +18.17.4. IMPLEMENTATION + + If the server does not support named attributes for the current + filehandle, an error of NFS4ERR_NOTSUPP will be returned to the + client. + +18.18. Operation 21: OPEN_DOWNGRADE - Reduce Open File Access + +18.18.1. ARGUMENTS + + struct OPEN_DOWNGRADE4args { + /* CURRENT_FH: opened file */ + stateid4 open_stateid; + seqid4 seqid; + uint32_t share_access; + uint32_t share_deny; + }; + +18.18.2. RESULTS + + struct OPEN_DOWNGRADE4resok { + stateid4 open_stateid; + }; + + union OPEN_DOWNGRADE4res switch(nfsstat4 status) { + case NFS4_OK: + OPEN_DOWNGRADE4resok resok4; + default: + void; + }; + +18.18.3. DESCRIPTION + + This operation is used to adjust the access and deny states for a + given open. This is necessary when a given open-owner opens the same + file multiple times with different access and deny values. In this + situation, a close of one of the opens may change the appropriate + share_access and share_deny flags to remove bits associated with + opens no longer in effect. + + Valid values for the expression (share_access & + ~OPEN4_SHARE_ACCESS_WANT_DELEG_MASK) are OPEN4_SHARE_ACCESS_READ, + OPEN4_SHARE_ACCESS_WRITE, or OPEN4_SHARE_ACCESS_BOTH. If the client + specifies other values, the server MUST reply with NFS4ERR_INVAL. + + Valid values for the share_deny field are OPEN4_SHARE_DENY_NONE, + OPEN4_SHARE_DENY_READ, OPEN4_SHARE_DENY_WRITE, or + OPEN4_SHARE_DENY_BOTH. If the client specifies other values, the + server MUST reply with NFS4ERR_INVAL. + + After checking for valid values of share_access and share_deny, the + server replaces the current access and deny modes on the file with + share_access and share_deny subject to the following constraints: + + * The bits in share_access SHOULD equal the union of the + share_access bits (not including OPEN4_SHARE_WANT_* bits) + specified for some subset of the OPENs in effect for the current + open-owner on the current file. + + * The bits in share_deny SHOULD equal the union of the share_deny + bits specified for some subset of the OPENs in effect for the + current open-owner on the current file. + + If the above constraints are not respected, the server SHOULD return + the error NFS4ERR_INVAL. Since share_access and share_deny bits + should be subsets of those already granted, short of a defect in the + client or server implementation, it is not possible for the + OPEN_DOWNGRADE request to be denied because of conflicting share + reservations. + + The seqid argument is not used in NFSv4.1, MAY be any value, and MUST + be ignored by the server. + + On success, the current filehandle retains its value. + +18.18.4. IMPLEMENTATION + + An OPEN_DOWNGRADE operation may make OPEN_DELEGATE_READ delegations + grantable where they were not previously. Servers may choose to + respond immediately if there are pending delegation want requests or + may respond to the situation at a later time. + +18.19. Operation 22: PUTFH - Set Current Filehandle + +18.19.1. ARGUMENTS + + struct PUTFH4args { + nfs_fh4 object; + }; + +18.19.2. RESULTS + + struct PUTFH4res { + /* + * If status is NFS4_OK, + * new CURRENT_FH: argument to PUTFH + */ + nfsstat4 status; + }; + +18.19.3. DESCRIPTION + + This operation replaces the current filehandle with the filehandle + provided as an argument. It clears the current stateid. + + If the security mechanism used by the requester does not meet the + requirements of the filehandle provided to this operation, the server + MUST return NFS4ERR_WRONGSEC. + + See Section 16.2.3.1.1 for more details on the current filehandle. + + See Section 16.2.3.1.2 for more details on the current stateid. + +18.19.4. IMPLEMENTATION + + This operation is used in an NFS request to set the context for file + accessing operations that follow in the same COMPOUND request. + +18.20. Operation 23: PUTPUBFH - Set Public Filehandle + +18.20.1. ARGUMENT + + void; + +18.20.2. RESULT + + struct PUTPUBFH4res { + /* + * If status is NFS4_OK, + * new CURRENT_FH: public fh + */ + nfsstat4 status; + }; + +18.20.3. DESCRIPTION + + This operation replaces the current filehandle with the filehandle + that represents the public filehandle of the server's namespace. + This filehandle may be different from the "root" filehandle that may + be associated with some other directory on the server. + + PUTPUBFH also clears the current stateid. + + The public filehandle represents the concepts embodied in RFC 2054 + [49], RFC 2055 [50], and RFC 2224 [61]. The intent for NFSv4.1 is + that the public filehandle (represented by the PUTPUBFH operation) be + used as a method of providing WebNFS server compatibility with NFSv3. + + The public filehandle and the root filehandle (represented by the + PUTROOTFH operation) SHOULD be equivalent. If the public and root + filehandles are not equivalent, then the directory corresponding to + the public filehandle MUST be a descendant of the directory + corresponding to the root filehandle. + + See Section 16.2.3.1.1 for more details on the current filehandle. + + See Section 16.2.3.1.2 for more details on the current stateid. + +18.20.4. IMPLEMENTATION + + This operation is used in an NFS request to set the context for file + accessing operations that follow in the same COMPOUND request. + + With the NFSv3 public filehandle, the client is able to specify + whether the pathname provided in the LOOKUP should be evaluated as + either an absolute path relative to the server's root or relative to + the public filehandle. RFC 2224 [61] contains further discussion of + the functionality. With NFSv4.1, that type of specification is not + directly available in the LOOKUP operation. The reason for this is + because the component separators needed to specify absolute vs. + relative are not allowed in NFSv4. Therefore, the client is + responsible for constructing its request such that the use of either + PUTROOTFH or PUTPUBFH signifies absolute or relative evaluation of an + NFS URL, respectively. + + Note that there are warnings mentioned in RFC 2224 [61] with respect + to the use of absolute evaluation and the restrictions the server may + place on that evaluation with respect to how much of its namespace + has been made available. These same warnings apply to NFSv4.1. It + is likely, therefore, that because of server implementation details, + an NFSv3 absolute public filehandle look up may behave differently + than an NFSv4.1 absolute resolution. + + There is a form of security negotiation as described in RFC 2755 [62] + that uses the public filehandle and an overloading of the pathname. + This method is not available with NFSv4.1 as filehandles are not + overloaded with special meaning and therefore do not provide the same + framework as NFSv3. Clients should therefore use the security + negotiation mechanisms described in Section 2.6. + +18.21. Operation 24: PUTROOTFH - Set Root Filehandle + +18.21.1. ARGUMENTS + + void; + +18.21.2. RESULTS + + struct PUTROOTFH4res { + /* + * If status is NFS4_OK, + * new CURRENT_FH: root fh + */ + nfsstat4 status; + }; + +18.21.3. DESCRIPTION + + This operation replaces the current filehandle with the filehandle + that represents the root of the server's namespace. From this + filehandle, a LOOKUP operation can locate any other filehandle on the + server. This filehandle may be different from the "public" + filehandle that may be associated with some other directory on the + server. + + PUTROOTFH also clears the current stateid. + + See Section 16.2.3.1.1 for more details on the current filehandle. + + See Section 16.2.3.1.2 for more details on the current stateid. + +18.21.4. IMPLEMENTATION + + This operation is used in an NFS request to set the context for file + accessing operations that follow in the same COMPOUND request. + +18.22. Operation 25: READ - Read from File + +18.22.1. ARGUMENTS + + struct READ4args { + /* CURRENT_FH: file */ + stateid4 stateid; + offset4 offset; + count4 count; + }; + +18.22.2. RESULTS + + struct READ4resok { + bool eof; + opaque data<>; + }; + + union READ4res switch (nfsstat4 status) { + case NFS4_OK: + READ4resok resok4; + default: + void; + }; + +18.22.3. DESCRIPTION + + The READ operation reads data from the regular file identified by the + current filehandle. + + The client provides an offset of where the READ is to start and a + count of how many bytes are to be read. An offset of zero means to + read data starting at the beginning of the file. If offset is + greater than or equal to the size of the file, the status NFS4_OK is + returned with a data length set to zero and eof is set to TRUE. The + READ is subject to access permissions checking. + + If the client specifies a count value of zero, the READ succeeds and + returns zero bytes of data again subject to access permissions + checking. The server may choose to return fewer bytes than specified + by the client. The client needs to check for this condition and + handle the condition appropriately. + + Except when special stateids are used, the stateid value for a READ + request represents a value returned from a previous byte-range lock + or share reservation request or the stateid associated with a + delegation. The stateid identifies the associated owners if any and + is used by the server to verify that the associated locks are still + valid (e.g., have not been revoked). + + If the read ended at the end-of-file (formally, in a correctly formed + READ operation, if offset + count is equal to the size of the file), + or the READ operation extends beyond the size of the file (if offset + + count is greater than the size of the file), eof is returned as + TRUE; otherwise, it is FALSE. A successful READ of an empty file + will always return eof as TRUE. + + If the current filehandle is not an ordinary file, an error will be + returned to the client. In the case that the current filehandle + represents an object of type NF4DIR, NFS4ERR_ISDIR is returned. If + the current filehandle designates a symbolic link, NFS4ERR_SYMLINK is + returned. In all other cases, NFS4ERR_WRONG_TYPE is returned. + + For a READ with a stateid value of all bits equal to zero, the server + MAY allow the READ to be serviced subject to mandatory byte-range + locks or the current share deny modes for the file. For a READ with + a stateid value of all bits equal to one, the server MAY allow READ + operations to bypass locking checks at the server. + + On success, the current filehandle retains its value. + +18.22.4. IMPLEMENTATION + + If the server returns a "short read" (i.e., fewer data than requested + and eof is set to FALSE), the client should send another READ to get + the remaining data. A server may return less data than requested + under several circumstances. The file may have been truncated by + another client or perhaps on the server itself, changing the file + size from what the requesting client believes to be the case. This + would reduce the actual amount of data available to the client. It + is possible that the server reduce the transfer size and so return a + short read result. Server resource exhaustion may also occur in a + short read. + + If mandatory byte-range locking is in effect for the file, and if the + byte-range corresponding to the data to be read from the file is + WRITE_LT locked by an owner not associated with the stateid, the + server will return the NFS4ERR_LOCKED error. The client should try + to get the appropriate READ_LT via the LOCK operation before re- + attempting the READ. When the READ completes, the client should + release the byte-range lock via LOCKU. + + If another client has an OPEN_DELEGATE_WRITE delegation for the file + being read, the delegation must be recalled, and the operation cannot + proceed until that delegation is returned or revoked. Except where + this happens very quickly, one or more NFS4ERR_DELAY errors will be + returned to requests made while the delegation remains outstanding. + Normally, delegations will not be recalled as a result of a READ + operation since the recall will occur as a result of an earlier OPEN. + However, since it is possible for a READ to be done with a special + stateid, the server needs to check for this case even though the + client should have done an OPEN previously. + +18.23. Operation 26: READDIR - Read Directory + +18.23.1. ARGUMENTS + + struct READDIR4args { + /* CURRENT_FH: directory */ + nfs_cookie4 cookie; + verifier4 cookieverf; + count4 dircount; + count4 maxcount; + bitmap4 attr_request; + }; + +18.23.2. RESULTS + + struct entry4 { + nfs_cookie4 cookie; + component4 name; + fattr4 attrs; + entry4 *nextentry; + }; + + struct dirlist4 { + entry4 *entries; + bool eof; + }; + + struct READDIR4resok { + verifier4 cookieverf; + dirlist4 reply; + }; + + + union READDIR4res switch (nfsstat4 status) { + case NFS4_OK: + READDIR4resok resok4; + default: + void; + }; + +18.23.3. DESCRIPTION + + The READDIR operation retrieves a variable number of entries from a + file system directory and returns client-requested attributes for + each entry along with information to allow the client to request + additional directory entries in a subsequent READDIR. + + The arguments contain a cookie value that represents where the + READDIR should start within the directory. A value of zero for the + cookie is used to start reading at the beginning of the directory. + For subsequent READDIR requests, the client specifies a cookie value + that is provided by the server on a previous READDIR request. + + The request's cookieverf field should be set to 0 zero) when the + request's cookie field is zero (first read of the directory). On + subsequent requests, the cookieverf field must match the cookieverf + returned by the READDIR in which the cookie was acquired. If the + server determines that the cookieverf is no longer valid for the + directory, the error NFS4ERR_NOT_SAME must be returned. + + The dircount field of the request is a hint of the maximum number of + bytes of directory information that should be returned. This value + represents the total length of the names of the directory entries and + the cookie value for these entries. This length represents the XDR + encoding of the data (names and cookies) and not the length in the + native format of the server. + + The maxcount field of the request represents the maximum total size + of all of the data being returned within the READDIR4resok structure + and includes the XDR overhead. The server MAY return less data. If + the server is unable to return a single directory entry within the + maxcount limit, the error NFS4ERR_TOOSMALL MUST be returned to the + client. + + Finally, the request's attr_request field represents the list of + attributes to be returned for each directory entry supplied by the + server. + + A successful reply consists of a list of directory entries. Each of + these entries contains the name of the directory entry, a cookie + value for that entry, and the associated attributes as requested. + The "eof" flag has a value of TRUE if there are no more entries in + the directory. + + The cookie value is only meaningful to the server and is used as a + cursor for the directory entry. As mentioned, this cookie is used by + the client for subsequent READDIR operations so that it may continue + reading a directory. The cookie is similar in concept to a READ + offset but MUST NOT be interpreted as such by the client. Ideally, + the cookie value SHOULD NOT change if the directory is modified since + the client may be caching these values. + + In some cases, the server may encounter an error while obtaining the + attributes for a directory entry. Instead of returning an error for + the entire READDIR operation, the server can instead return the + attribute rdattr_error (Section 5.8.1.12). With this, the server is + able to communicate the failure to the client and not fail the entire + operation in the instance of what might be a transient failure. + Obviously, the client must request the fattr4_rdattr_error attribute + for this method to work properly. If the client does not request the + attribute, the server has no choice but to return failure for the + entire READDIR operation. + + For some file system environments, the directory entries "." and ".." + have special meaning, and in other environments, they do not. If the + server supports these special entries within a directory, they SHOULD + NOT be returned to the client as part of the READDIR response. To + enable some client environments, the cookie values of zero, 1, and 2 + are to be considered reserved. Note that the UNIX client will use + these values when combining the server's response and local + representations to enable a fully formed UNIX directory presentation + to the application. + + For READDIR arguments, cookie values of one and two SHOULD NOT be + used, and for READDIR results, cookie values of zero, one, and two + SHOULD NOT be returned. + + On success, the current filehandle retains its value. + +18.23.4. IMPLEMENTATION + + The server's file system directory representations can differ + greatly. A client's programming interfaces may also be bound to the + local operating environment in a way that does not translate well + into the NFS protocol. Therefore, the use of the dircount and + maxcount fields are provided to enable the client to provide hints to + the server. If the client is aggressive about attribute collection + during a READDIR, the server has an idea of how to limit the encoded + response. + + If dircount is zero, the server bounds the reply's size based on the + request's maxcount field. + + The cookieverf may be used by the server to help manage cookie values + that may become stale. It should be a rare occurrence that a server + is unable to continue properly reading a directory with the provided + cookie/cookieverf pair. The server SHOULD make every effort to avoid + this condition since the application at the client might be unable to + properly handle this type of failure. + + The use of the cookieverf will also protect the client from using + READDIR cookie values that might be stale. For example, if the file + system has been migrated, the server might or might not be able to + use the same cookie values to service READDIR as the previous server + used. With the client providing the cookieverf, the server is able + to provide the appropriate response to the client. This prevents the + case where the server accepts a cookie value but the underlying + directory has changed and the response is invalid from the client's + context of its previous READDIR. + + Since some servers will not be returning "." and ".." entries as has + been done with previous versions of the NFS protocol, the client that + requires these entries be present in READDIR responses must fabricate + them. + +18.24. Operation 27: READLINK - Read Symbolic Link + +18.24.1. ARGUMENTS + + /* CURRENT_FH: symlink */ + void; + +18.24.2. RESULTS + + struct READLINK4resok { + linktext4 link; + }; + + union READLINK4res switch (nfsstat4 status) { + case NFS4_OK: + READLINK4resok resok4; + default: + void; + }; + +18.24.3. DESCRIPTION + + READLINK reads the data associated with a symbolic link. Depending + on the value of the UTF-8 capability attribute (Section 14.4), the + data is encoded in UTF-8. Whether created by an NFS client or + created locally on the server, the data in a symbolic link is not + interpreted (except possibly to check for proper UTF-8 encoding) when + created, but is simply stored. + + On success, the current filehandle retains its value. + +18.24.4. IMPLEMENTATION + + A symbolic link is nominally a pointer to another file. The data is + not necessarily interpreted by the server, just stored in the file. + It is possible for a client implementation to store a pathname that + is not meaningful to the server operating system in a symbolic link. + A READLINK operation returns the data to the client for + interpretation. If different implementations want to share access to + symbolic links, then they must agree on the interpretation of the + data in the symbolic link. + + The READLINK operation is only allowed on objects of type NF4LNK. + The server should return the error NFS4ERR_WRONG_TYPE if the object + is not of type NF4LNK. + +18.25. Operation 28: REMOVE - Remove File System Object + +18.25.1. ARGUMENTS + + struct REMOVE4args { + /* CURRENT_FH: directory */ + component4 target; + }; + +18.25.2. RESULTS + + struct REMOVE4resok { + change_info4 cinfo; + }; + + union REMOVE4res switch (nfsstat4 status) { + case NFS4_OK: + REMOVE4resok resok4; + default: + void; + }; + +18.25.3. DESCRIPTION + + The REMOVE operation removes (deletes) a directory entry named by + filename from the directory corresponding to the current filehandle. + If the entry in the directory was the last reference to the + corresponding file system object, the object may be destroyed. The + directory may be either of type NF4DIR or NF4ATTRDIR. + + For the directory where the filename was removed, the server returns + change_info4 information in cinfo. With the atomic field of the + change_info4 data type, the server will indicate if the before and + after change attributes were obtained atomically with respect to the + removal. + + If the target has a length of zero, or if the target does not obey + the UTF-8 definition (and the server is enforcing UTF-8 encoding; see + Section 14.4), the error NFS4ERR_INVAL will be returned. + + On success, the current filehandle retains its value. + +18.25.4. IMPLEMENTATION + + NFSv3 required a different operator RMDIR for directory removal and + REMOVE for non-directory removal. This allowed clients to skip + checking the file type when being passed a non-directory delete + system call (e.g., unlink() [24] in POSIX) to remove a directory, as + well as the converse (e.g., a rmdir() on a non-directory) because + they knew the server would check the file type. NFSv4.1 REMOVE can + be used to delete any directory entry independent of its file type. + The implementor of an NFSv4.1 client's entry points from the unlink() + and rmdir() system calls should first check the file type against the + types the system call is allowed to remove before sending a REMOVE + operation. Alternatively, the implementor can produce a COMPOUND + call that includes a LOOKUP/VERIFY sequence of operations to verify + the file type before a REMOVE operation in the same COMPOUND call. + + The concept of last reference is server specific. However, if the + numlinks field in the previous attributes of the object had the value + 1, the client should not rely on referring to the object via a + filehandle. Likewise, the client should not rely on the resources + (disk space, directory entry, and so on) formerly associated with the + object becoming immediately available. Thus, if a client needs to be + able to continue to access a file after using REMOVE to remove it, + the client should take steps to make sure that the file will still be + accessible. While the traditional mechanism used is to RENAME the + file from its old name to a new hidden name, the NFSv4.1 OPEN + operation MAY return a result flag, OPEN4_RESULT_PRESERVE_UNLINKED, + which indicates to the client that the file will be preserved if the + file has an outstanding open (see Section 18.16). + + If the server finds that the file is still open when the REMOVE + arrives: + + * The server SHOULD NOT delete the file's directory entry if the + file was opened with OPEN4_SHARE_DENY_WRITE or + OPEN4_SHARE_DENY_BOTH. + + * If the file was not opened with OPEN4_SHARE_DENY_WRITE or + OPEN4_SHARE_DENY_BOTH, the server SHOULD delete the file's + directory entry. However, until last CLOSE of the file, the + server MAY continue to allow access to the file via its + filehandle. + + * The server MUST NOT delete the directory entry if the reply from + OPEN had the flag OPEN4_RESULT_PRESERVE_UNLINKED set. + + The server MAY implement its own restrictions on removal of a file + while it is open. The server might disallow such a REMOVE (or a + removal that occurs as part of RENAME). The conditions that + influence the restrictions on removal of a file while it is still + open include: + + * Whether certain access protocols (i.e., not just NFS) are holding + the file open. + + * Whether particular options, access modes, or policies on the + server are enabled. + + If a file has an outstanding OPEN and this prevents the removal of + the file's directory entry, the error NFS4ERR_FILE_OPEN is returned. + + Where the determination above cannot be made definitively because + delegations are being held, they MUST be recalled to allow processing + of the REMOVE to continue. When a delegation is held, the server has + no reliable knowledge of the status of OPENs for that client, so + unless there are files opened with the particular deny modes by + clients without delegations, the determination cannot be made until + delegations are recalled, and the operation cannot proceed until each + sufficient delegation has been returned or revoked to allow the + server to make a correct determination. + + In all cases in which delegations are recalled, the server is likely + to return one or more NFS4ERR_DELAY errors while delegations remain + outstanding. + + If the current filehandle designates a directory for which another + client holds a directory delegation, then, unless the situation can + be resolved by sending a notification, the directory delegation MUST + be recalled, and the operation MUST NOT proceed until the delegation + is returned or revoked. Except where this happens very quickly, one + or more NFS4ERR_DELAY errors will be returned to requests made while + delegation remains outstanding. + + When the current filehandle designates a directory for which one or + more directory delegations exist, then, when those delegations + request such notifications, NOTIFY4_REMOVE_ENTRY will be generated as + a result of this operation. + + Note that when a remove occurs as a result of a RENAME, + NOTIFY4_REMOVE_ENTRY will only be generated if the removal happens as + a separate operation. In the case in which the removal is integrated + and atomic with RENAME, the notification of the removal is integrated + with notification for the RENAME. See the discussion of the + NOTIFY4_RENAME_ENTRY notification in Section 20.4. + +18.26. Operation 29: RENAME - Rename Directory Entry + +18.26.1. ARGUMENTS + + struct RENAME4args { + /* SAVED_FH: source directory */ + component4 oldname; + /* CURRENT_FH: target directory */ + component4 newname; + }; + +18.26.2. RESULTS + + struct RENAME4resok { + change_info4 source_cinfo; + change_info4 target_cinfo; + }; + + union RENAME4res switch (nfsstat4 status) { + case NFS4_OK: + RENAME4resok resok4; + default: + void; + }; + +18.26.3. DESCRIPTION + + The RENAME operation renames the object identified by oldname in the + source directory corresponding to the saved filehandle, as set by the + SAVEFH operation, to newname in the target directory corresponding to + the current filehandle. The operation is required to be atomic to + the client. Source and target directories MUST reside on the same + file system on the server. On success, the current filehandle will + continue to be the target directory. + + If the target directory already contains an entry with the name + newname, the source object MUST be compatible with the target: either + both are non-directories or both are directories and the target MUST + be empty. If compatible, the existing target is removed before the + rename occurs or, preferably, the target is removed atomically as + part of the rename. See Section 18.25.4 for client and server + actions whenever a target is removed. Note however that when the + removal is performed atomically with the rename, certain parts of the + removal described there are integrated with the rename. For example, + notification of the removal will not be via a NOTIFY4_REMOVE_ENTRY + but will be indicated as part of the NOTIFY4_ADD_ENTRY or + NOTIFY4_RENAME_ENTRY generated by the rename. + + If the source object and the target are not compatible or if the + target is a directory but not empty, the server will return the error + NFS4ERR_EXIST. + + If oldname and newname both refer to the same file (e.g., they might + be hard links of each other), then unless the file is open (see + Section 18.26.4), RENAME MUST perform no action and return NFS4_OK. + + For both directories involved in the RENAME, the server returns + change_info4 information. With the atomic field of the change_info4 + data type, the server will indicate if the before and after change + attributes were obtained atomically with respect to the rename. + + If oldname refers to a named attribute and the saved and current + filehandles refer to different file system objects, the server will + return NFS4ERR_XDEV just as if the saved and current filehandles + represented directories on different file systems. + + If oldname or newname has a length of zero, or if oldname or newname + does not obey the UTF-8 definition, the error NFS4ERR_INVAL will be + returned. + +18.26.4. IMPLEMENTATION + + The server MAY impose restrictions on the RENAME operation such that + RENAME may not be done when the file being renamed is open or when + that open is done by particular protocols, or with particular options + or access modes. Similar restrictions may be applied when a file + exists with the target name and is open. When RENAME is rejected + because of such restrictions, the error NFS4ERR_FILE_OPEN is + returned. + + When oldname and rename refer to the same file and that file is open + in a fashion such that RENAME would normally be rejected with + NFS4ERR_FILE_OPEN if oldname and newname were different files, then + RENAME SHOULD be rejected with NFS4ERR_FILE_OPEN. + + If a server does implement such restrictions and those restrictions + include cases of NFSv4 opens preventing successful execution of a + rename, the server needs to recall any delegations that could hide + the existence of opens relevant to that decision. This is because + when a client holds a delegation, the server might not have an + accurate account of the opens for that client, since the client may + execute OPENs and CLOSEs locally. The RENAME operation need only be + delayed until a definitive result can be obtained. For example, if + there are multiple delegations and one of them establishes an open + whose presence would prevent the rename, given the server's + semantics, NFS4ERR_FILE_OPEN may be returned to the caller as soon as + that delegation is returned without waiting for other delegations to + be returned. Similarly, if such opens are not associated with + delegations, NFS4ERR_FILE_OPEN can be returned immediately with no + delegation recall being done. + + If the current filehandle or the saved filehandle designates a + directory for which another client holds a directory delegation, + then, unless the situation can be resolved by sending a notification, + the delegation MUST be recalled, and the operation cannot proceed + until the delegation is returned or revoked. Except where this + happens very quickly, one or more NFS4ERR_DELAY errors will be + returned to requests made while delegation remains outstanding. + + When the current and saved filehandles are the same and they + designate a directory for which one or more directory delegations + exist, then, when those delegations request such notifications, a + notification of type NOTIFY4_RENAME_ENTRY will be generated as a + result of this operation. When oldname and rename refer to the same + file, no notification is generated (because, as Section 18.26.3 + states, the server MUST take no action). When a file is removed + because it has the same name as the target, if that removal is done + atomically with the rename, a NOTIFY4_REMOVE_ENTRY notification will + not be generated. Instead, the deletion of the file will be reported + as part of the NOTIFY4_RENAME_ENTRY notification. + + When the current and saved filehandles are not the same: + + * If the current filehandle designates a directory for which one or + more directory delegations exist, then, when those delegations + request such notifications, NOTIFY4_ADD_ENTRY will be generated as + a result of this operation. When a file is removed because it has + the same name as the target, if that removal is done atomically + with the rename, a NOTIFY4_REMOVE_ENTRY notification will not be + generated. Instead, the deletion of the file will be reported as + part of the NOTIFY4_ADD_ENTRY notification. + + * If the saved filehandle designates a directory for which one or + more directory delegations exist, then, when those delegations + request such notifications, NOTIFY4_REMOVE_ENTRY will be generated + as a result of this operation. + + If the object being renamed has file delegations held by clients + other than the one doing the RENAME, the delegations MUST be + recalled, and the operation cannot proceed until each such delegation + is returned or revoked. Note that in the case of multiply linked + files, the delegation recall requirement applies even if the + delegation was obtained through a different name than the one being + renamed. In all cases in which delegations are recalled, the server + is likely to return one or more NFS4ERR_DELAY errors while the + delegation(s) remains outstanding, although it might not do that if + the delegations are returned quickly. + + The RENAME operation must be atomic to the client. The statement + "source and target directories MUST reside on the same file system on + the server" means that the fsid fields in the attributes for the + directories are the same. If they reside on different file systems, + the error NFS4ERR_XDEV is returned. + + Based on the value of the fh_expire_type attribute for the object, + the filehandle may or may not expire on a RENAME. However, server + implementors are strongly encouraged to attempt to keep filehandles + from expiring in this fashion. + + On some servers, the file names "." and ".." are illegal as either + oldname or newname, and will result in the error NFS4ERR_BADNAME. In + addition, on many servers the case of oldname or newname being an + alias for the source directory will be checked for. Such servers + will return the error NFS4ERR_INVAL in these cases. + + If either of the source or target filehandles are not directories, + the server will return NFS4ERR_NOTDIR. + +18.27. Operation 31: RESTOREFH - Restore Saved Filehandle + +18.27.1. ARGUMENTS + + /* SAVED_FH: */ + void; + +18.27.2. RESULTS + + struct RESTOREFH4res { + /* + * If status is NFS4_OK, + * new CURRENT_FH: value of saved fh + */ + nfsstat4 status; + }; + +18.27.3. DESCRIPTION + + The RESTOREFH operation sets the current filehandle and stateid to + the values in the saved filehandle and stateid. If there is no saved + filehandle, then the server will return the error + NFS4ERR_NOFILEHANDLE. + + See Section 16.2.3.1.1 for more details on the current filehandle. + + See Section 16.2.3.1.2 for more details on the current stateid. + +18.27.4. IMPLEMENTATION + + Operations like OPEN and LOOKUP use the current filehandle to + represent a directory and replace it with a new filehandle. Assuming + that the previous filehandle was saved with a SAVEFH operator, the + previous filehandle can be restored as the current filehandle. This + is commonly used to obtain post-operation attributes for the + directory, e.g., + + PUTFH (directory filehandle) + SAVEFH + GETATTR attrbits (pre-op dir attrs) + CREATE optbits "foo" attrs + GETATTR attrbits (file attributes) + RESTOREFH + GETATTR attrbits (post-op dir attrs) + +18.28. Operation 32: SAVEFH - Save Current Filehandle + +18.28.1. ARGUMENTS + + /* CURRENT_FH: */ + void; + +18.28.2. RESULTS + + struct SAVEFH4res { + /* + * If status is NFS4_OK, + * new SAVED_FH: value of current fh + */ + nfsstat4 status; + }; + +18.28.3. DESCRIPTION + + The SAVEFH operation saves the current filehandle and stateid. If a + previous filehandle was saved, then it is no longer accessible. The + saved filehandle can be restored as the current filehandle with the + RESTOREFH operator. + + On success, the current filehandle retains its value. + + See Section 16.2.3.1.1 for more details on the current filehandle. + + See Section 16.2.3.1.2 for more details on the current stateid. + +18.28.4. IMPLEMENTATION + +18.29. Operation 33: SECINFO - Obtain Available Security + +18.29.1. ARGUMENTS + + struct SECINFO4args { + /* CURRENT_FH: directory */ + component4 name; + }; + +18.29.2. RESULTS + + /* + * From RFC 2203 + */ + enum rpc_gss_svc_t { + RPC_GSS_SVC_NONE = 1, + RPC_GSS_SVC_INTEGRITY = 2, + RPC_GSS_SVC_PRIVACY = 3 + }; + + struct rpcsec_gss_info { + sec_oid4 oid; + qop4 qop; + rpc_gss_svc_t service; + }; + + /* RPCSEC_GSS has a value of '6' - See RFC 2203 */ + union secinfo4 switch (uint32_t flavor) { + case RPCSEC_GSS: + rpcsec_gss_info flavor_info; + default: + void; + }; + + typedef secinfo4 SECINFO4resok<>; + + union SECINFO4res switch (nfsstat4 status) { + case NFS4_OK: + /* CURRENTFH: consumed */ + SECINFO4resok resok4; + default: + void; + }; + +18.29.3. DESCRIPTION + + The SECINFO operation is used by the client to obtain a list of valid + RPC authentication flavors for a specific directory filehandle, file + name pair. SECINFO should apply the same access methodology used for + LOOKUP when evaluating the name. Therefore, if the requester does + not have the appropriate access to LOOKUP the name, then SECINFO MUST + behave the same way and return NFS4ERR_ACCESS. + + The result will contain an array that represents the security + mechanisms available, with an order corresponding to the server's + preferences, the most preferred being first in the array. The client + is free to pick whatever security mechanism it both desires and + supports, or to pick in the server's preference order the first one + it supports. The array entries are represented by the secinfo4 + structure. The field 'flavor' will contain a value of AUTH_NONE, + AUTH_SYS (as defined in RFC 5531 [3]), or RPCSEC_GSS (as defined in + RFC 2203 [4]). The field flavor can also be any other security + flavor registered with IANA. + + For the flavors AUTH_NONE and AUTH_SYS, no additional security + information is returned. The same is true of many (if not most) + other security flavors, including AUTH_DH. For a return value of + RPCSEC_GSS, a security triple is returned that contains the mechanism + object identifier (OID, as defined in RFC 2743 [7]), the quality of + protection (as defined in RFC 2743 [7]), and the service type (as + defined in RFC 2203 [4]). It is possible for SECINFO to return + multiple entries with flavor equal to RPCSEC_GSS with different + security triple values. + + On success, the current filehandle is consumed (see + Section 2.6.3.1.1.8), and if the next operation after SECINFO tries + to use the current filehandle, that operation will fail with the + status NFS4ERR_NOFILEHANDLE. + + If the name has a length of zero, or if the name does not obey the + UTF-8 definition (assuming UTF-8 capabilities are enabled; see + Section 14.4), the error NFS4ERR_INVAL will be returned. + + See Section 2.6 for additional information on the use of SECINFO. + +18.29.4. IMPLEMENTATION + + The SECINFO operation is expected to be used by the NFS client when + the error value of NFS4ERR_WRONGSEC is returned from another NFS + operation. This signifies to the client that the server's security + policy is different from what the client is currently using. At this + point, the client is expected to obtain a list of possible security + flavors and choose what best suits its policies. + + As mentioned, the server's security policies will determine when a + client request receives NFS4ERR_WRONGSEC. See Table 14 for a list of + operations that can return NFS4ERR_WRONGSEC. In addition, when + READDIR returns attributes, the rdattr_error (Section 5.8.1.12) can + contain NFS4ERR_WRONGSEC. Note that CREATE and REMOVE MUST NOT + return NFS4ERR_WRONGSEC. The rationale for CREATE is that unless the + target name exists, it cannot have a separate security policy from + the parent directory, and the security policy of the parent was + checked when its filehandle was injected into the COMPOUND request's + operations stream (for similar reasons, an OPEN operation that + creates the target MUST NOT return NFS4ERR_WRONGSEC). If the target + name exists, while it might have a separate security policy, that is + irrelevant because CREATE MUST return NFS4ERR_EXIST. The rationale + for REMOVE is that while that target might have a separate security + policy, the target is going to be removed, and so the security policy + of the parent trumps that of the object being removed. RENAME and + LINK MAY return NFS4ERR_WRONGSEC, but the NFS4ERR_WRONGSEC error + applies only to the saved filehandle (see Section 2.6.3.1.2). Any + NFS4ERR_WRONGSEC error on the current filehandle used by LINK and + RENAME MUST be returned by the PUTFH, PUTPUBFH, PUTROOTFH, or + RESTOREFH operation that injected the current filehandle. + + With the exception of LINK and RENAME, the set of operations that can + return NFS4ERR_WRONGSEC represents the point at which the client can + inject a filehandle into the "current filehandle" at the server. The + filehandle is either provided by the client (PUTFH, PUTPUBFH, + PUTROOTFH), generated as a result of a name-to-filehandle translation + (LOOKUP and OPEN), or generated from the saved filehandle via + RESTOREFH. As Section 2.6.3.1.1.1 states, a put filehandle operation + followed by SAVEFH MUST NOT return NFS4ERR_WRONGSEC. Thus, the + RESTOREFH operation, under certain conditions (see + Section 2.6.3.1.1), is permitted to return NFS4ERR_WRONGSEC so that + security policies can be honored. + + The READDIR operation will not directly return the NFS4ERR_WRONGSEC + error. However, if the READDIR request included a request for + attributes, it is possible that the READDIR request's security triple + did not match that of a directory entry. If this is the case and the + client has requested the rdattr_error attribute, the server will + return the NFS4ERR_WRONGSEC error in rdattr_error for the entry. + + To resolve an error return of NFS4ERR_WRONGSEC, the client does the + following: + + * For LOOKUP and OPEN, the client will use SECINFO with the same + current filehandle and name as provided in the original LOOKUP or + OPEN to enumerate the available security triples. + + * For the rdattr_error, the client will use SECINFO with the same + current filehandle as provided in the original READDIR. The name + passed to SECINFO will be that of the directory entry (as returned + from READDIR) that had the NFS4ERR_WRONGSEC error in the + rdattr_error attribute. + + * For PUTFH, PUTROOTFH, PUTPUBFH, RESTOREFH, LINK, and RENAME, the + client will use SECINFO_NO_NAME { style = + SECINFO_STYLE4_CURRENT_FH }. The client will prefix the + SECINFO_NO_NAME operation with the appropriate PUTFH, PUTPUBFH, or + PUTROOTFH operation that provides the filehandle originally + provided by the PUTFH, PUTPUBFH, PUTROOTFH, or RESTOREFH + operation. + + NOTE: In NFSv4.0, the client was required to use SECINFO, and had + to reconstruct the parent of the original filehandle and the + component name of the original filehandle. The introduction in + NFSv4.1 of SECINFO_NO_NAME obviates the need for reconstruction. + + * For LOOKUPP, the client will use SECINFO_NO_NAME { style = + SECINFO_STYLE4_PARENT } and provide the filehandle that equals the + filehandle originally provided to LOOKUPP. + + See Section 21 for a discussion on the recommendations for the + security flavor used by SECINFO and SECINFO_NO_NAME. + +18.30. Operation 34: SETATTR - Set Attributes + +18.30.1. ARGUMENTS + + struct SETATTR4args { + /* CURRENT_FH: target object */ + stateid4 stateid; + fattr4 obj_attributes; + }; + +18.30.2. RESULTS + + struct SETATTR4res { + nfsstat4 status; + bitmap4 attrsset; + }; + +18.30.3. DESCRIPTION + + The SETATTR operation changes one or more of the attributes of a file + system object. The new attributes are specified with a bitmap and + the attributes that follow the bitmap in bit order. + + The stateid argument for SETATTR is used to provide byte-range + locking context that is necessary for SETATTR requests that set the + size attribute. Since setting the size attribute modifies the file's + data, it has the same locking requirements as a corresponding WRITE. + Any SETATTR that sets the size attribute is incompatible with a share + reservation that specifies OPEN4_SHARE_DENY_WRITE. The area between + the old end-of-file and the new end-of-file is considered to be + modified just as would have been the case had the area in question + been specified as the target of WRITE, for the purpose of checking + conflicts with byte-range locks, for those cases in which a server is + implementing mandatory byte-range locking behavior. A valid stateid + SHOULD always be specified. When the file size attribute is not set, + the special stateid consisting of all bits equal to zero MAY be + passed. + + On either success or failure of the operation, the server will return + the attrsset bitmask to represent what (if any) attributes were + successfully set. The attrsset in the response is a subset of the + attrmask field of the obj_attributes field in the argument. + + On success, the current filehandle retains its value. + +18.30.4. IMPLEMENTATION + + If the request specifies the owner attribute to be set, the server + SHOULD allow the operation to succeed if the current owner of the + object matches the value specified in the request. Some servers may + be implemented in a way as to prohibit the setting of the owner + attribute unless the requester has privilege to do so. If the server + is lenient in this one case of matching owner values, the client + implementation may be simplified in cases of creation of an object + (e.g., an exclusive create via OPEN) followed by a SETATTR. + + The file size attribute is used to request changes to the size of a + file. A value of zero causes the file to be truncated, a value less + than the current size of the file causes data from new size to the + end of the file to be discarded, and a size greater than the current + size of the file causes logically zeroed data bytes to be added to + the end of the file. Servers are free to implement this using + unallocated bytes (holes) or allocated data bytes set to zero. + Clients should not make any assumptions regarding a server's + implementation of this feature, beyond that the bytes in the affected + byte-range returned by READ will be zeroed. Servers MUST support + extending the file size via SETATTR. + + SETATTR is not guaranteed to be atomic. A failed SETATTR may + partially change a file's attributes, hence the reason why the reply + always includes the status and the list of attributes that were set. + + If the object whose attributes are being changed has a file + delegation that is held by a client other than the one doing the + SETATTR, the delegation(s) must be recalled, and the operation cannot + proceed to actually change an attribute until each such delegation is + returned or revoked. In all cases in which delegations are recalled, + the server is likely to return one or more NFS4ERR_DELAY errors while + the delegation(s) remains outstanding, although it might not do that + if the delegations are returned quickly. + + If the object whose attributes are being set is a directory and + another client holds a directory delegation for that directory, then + if enabled, asynchronous notifications will be generated when the set + of attributes changed has a non-null intersection with the set of + attributes for which notification is requested. Notifications of + type NOTIFY4_CHANGE_DIR_ATTRS will be sent to the appropriate + client(s), but the SETATTR is not delayed by waiting for these + notifications to be sent. + + If the object whose attributes are being set is a member of the + directory for which another client holds a directory delegation, then + asynchronous notifications will be generated when the set of + attributes changed has a non-null intersection with the set of + attributes for which notification is requested. Notifications of + type NOTIFY4_CHANGE_CHILD_ATTRS will be sent to the appropriate + clients, but the SETATTR is not delayed by waiting for these + notifications to be sent. + + Changing the size of a file with SETATTR indirectly changes the + time_modify and change attributes. A client must account for this as + size changes can result in data deletion. + + The attributes time_access_set and time_modify_set are write-only + attributes constructed as a switched union so the client can direct + the server in setting the time values. If the switched union + specifies SET_TO_CLIENT_TIME4, the client has provided an nfstime4 to + be used for the operation. If the switch union does not specify + SET_TO_CLIENT_TIME4, the server is to use its current time for the + SETATTR operation. + + If server and client times differ, programs that compare client time + to file times can break. A time synchronization protocol should be + used to limit client/server time skew. + + Use of a COMPOUND containing a VERIFY operation specifying only the + change attribute, immediately followed by a SETATTR, provides a means + whereby a client may specify a request that emulates the + functionality of the SETATTR guard mechanism of NFSv3. Since the + function of the guard mechanism is to avoid changes to the file + attributes based on stale information, delays between checking of the + guard condition and the setting of the attributes have the potential + to compromise this function, as would the corresponding delay in the + NFSv4 emulation. Therefore, NFSv4.1 servers SHOULD take care to + avoid such delays, to the degree possible, when executing such a + request. + + If the server does not support an attribute as requested by the + client, the server SHOULD return NFS4ERR_ATTRNOTSUPP. + + A mask of the attributes actually set is returned by SETATTR in all + cases. That mask MUST NOT include attribute bits not requested to be + set by the client. If the attribute masks in the request and reply + are equal, the status field in the reply MUST be NFS4_OK. + +18.31. Operation 37: VERIFY - Verify Same Attributes + +18.31.1. ARGUMENTS + + struct VERIFY4args { + /* CURRENT_FH: object */ + fattr4 obj_attributes; + }; + +18.31.2. RESULTS + + struct VERIFY4res { + nfsstat4 status; + }; + +18.31.3. DESCRIPTION + + The VERIFY operation is used to verify that attributes have the value + assumed by the client before proceeding with the following operations + in the COMPOUND request. If any of the attributes do not match, then + the error NFS4ERR_NOT_SAME must be returned. The current filehandle + retains its value after successful completion of the operation. + +18.31.4. IMPLEMENTATION + + One possible use of the VERIFY operation is the following series of + operations. With this, the client is attempting to verify that the + file being removed will match what the client expects to be removed. + This series can help prevent the unintended deletion of a file. + + PUTFH (directory filehandle) + LOOKUP (file name) + VERIFY (filehandle == fh) + PUTFH (directory filehandle) + REMOVE (file name) + + This series does not prevent a second client from removing and + creating a new file in the middle of this sequence, but it does help + avoid the unintended result. + + In the case that a RECOMMENDED attribute is specified in the VERIFY + operation and the server does not support that attribute for the file + system object, the error NFS4ERR_ATTRNOTSUPP is returned to the + client. + + When the attribute rdattr_error or any set-only attribute (e.g., + time_modify_set) is specified, the error NFS4ERR_INVAL is returned to + the client. + +18.32. Operation 38: WRITE - Write to File + +18.32.1. ARGUMENTS + + enum stable_how4 { + UNSTABLE4 = 0, + DATA_SYNC4 = 1, + FILE_SYNC4 = 2 + }; + + struct WRITE4args { + /* CURRENT_FH: file */ + stateid4 stateid; + offset4 offset; + stable_how4 stable; + opaque data<>; + }; + +18.32.2. RESULTS + + struct WRITE4resok { + count4 count; + stable_how4 committed; + verifier4 writeverf; + }; + + union WRITE4res switch (nfsstat4 status) { + case NFS4_OK: + WRITE4resok resok4; + default: + void; + }; + +18.32.3. DESCRIPTION + + The WRITE operation is used to write data to a regular file. The + target file is specified by the current filehandle. The offset + specifies the offset where the data should be written. An offset of + zero specifies that the write should start at the beginning of the + file. The count, as encoded as part of the opaque data parameter, + represents the number of bytes of data that are to be written. If + the count is zero, the WRITE will succeed and return a count of zero + subject to permissions checking. The server MAY write fewer bytes + than requested by the client. + + The client specifies with the stable parameter the method of how the + data is to be processed by the server. If stable is FILE_SYNC4, the + server MUST commit the data written plus all file system metadata to + stable storage before returning results. This corresponds to the + NFSv2 protocol semantics. Any other behavior constitutes a protocol + violation. If stable is DATA_SYNC4, then the server MUST commit all + of the data to stable storage and enough of the metadata to retrieve + the data before returning. The server implementor is free to + implement DATA_SYNC4 in the same fashion as FILE_SYNC4, but with a + possible performance drop. If stable is UNSTABLE4, the server is + free to commit any part of the data and the metadata to stable + storage, including all or none, before returning a reply to the + client. There is no guarantee whether or when any uncommitted data + will subsequently be committed to stable storage. The only + guarantees made by the server are that it will not destroy any data + without changing the value of writeverf and that it will not commit + the data and metadata at a level less than that requested by the + client. + + Except when special stateids are used, the stateid value for a WRITE + request represents a value returned from a previous byte-range LOCK + or OPEN request or the stateid associated with a delegation. The + stateid identifies the associated owners if any and is used by the + server to verify that the associated locks are still valid (e.g., + have not been revoked). + + Upon successful completion, the following results are returned. The + count result is the number of bytes of data written to the file. The + server may write fewer bytes than requested. If so, the actual + number of bytes written starting at location, offset, is returned. + + The server also returns an indication of the level of commitment of + the data and metadata via committed. Per Table 20, + + * The server MAY commit the data at a stronger level than requested. + + * The server MUST commit the data at a level at least as high as + that committed. + + +============+===================================+ + | stable | committed | + +============+===================================+ + | UNSTABLE4 | FILE_SYNC4, DATA_SYNC4, UNSTABLE4 | + +------------+-----------------------------------+ + | DATA_SYNC4 | FILE_SYNC4, DATA_SYNC4 | + +------------+-----------------------------------+ + | FILE_SYNC4 | FILE_SYNC4 | + +------------+-----------------------------------+ + + Table 20: Valid Combinations of the Fields + Stable in the Request and Committed in the + Reply + + The final portion of the result is the field writeverf. This field + is the write verifier and is a cookie that the client can use to + determine whether a server has changed instance state (e.g., server + restart) between a call to WRITE and a subsequent call to either + WRITE or COMMIT. This cookie MUST be unchanged during a single + instance of the NFSv4.1 server and MUST be unique between instances + of the NFSv4.1 server. If the cookie changes, then the client MUST + assume that any data written with an UNSTABLE4 value for committed + and an old writeverf in the reply has been lost and will need to be + recovered. + + If a client writes data to the server with the stable argument set to + UNSTABLE4 and the reply yields a committed response of DATA_SYNC4 or + UNSTABLE4, the client will follow up some time in the future with a + COMMIT operation to synchronize outstanding asynchronous data and + metadata with the server's stable storage, barring client error. It + is possible that due to client crash or other error that a subsequent + COMMIT will not be received by the server. + + For a WRITE with a stateid value of all bits equal to zero, the + server MAY allow the WRITE to be serviced subject to mandatory byte- + range locks or the current share deny modes for the file. For a + WRITE with a stateid value of all bits equal to 1, the server MUST + NOT allow the WRITE operation to bypass locking checks at the server + and otherwise is treated as if a stateid of all bits equal to zero + were used. + + On success, the current filehandle retains its value. + +18.32.4. IMPLEMENTATION + + It is possible for the server to write fewer bytes of data than + requested by the client. In this case, the server SHOULD NOT return + an error unless no data was written at all. If the server writes + less than the number of bytes specified, the client will need to send + another WRITE to write the remaining data. + + It is assumed that the act of writing data to a file will cause the + time_modified and change attributes of the file to be updated. + However, these attributes SHOULD NOT be changed unless the contents + of the file are changed. Thus, a WRITE request with count set to + zero SHOULD NOT cause the time_modified and change attributes of the + file to be updated. + + Stable storage is persistent storage that survives: + + 1. Repeated power failures. + + 2. Hardware failures (of any board, power supply, etc.). + + 3. Repeated software crashes and restarts. + + This definition does not address failure of the stable storage module + itself. + + The verifier is defined to allow a client to detect different + instances of an NFSv4.1 protocol server over which cached, + uncommitted data may be lost. In the most likely case, the verifier + allows the client to detect server restarts. This information is + required so that the client can safely determine whether the server + could have lost cached data. If the server fails unexpectedly and + the client has uncommitted data from previous WRITE requests (done + with the stable argument set to UNSTABLE4 and in which the result + committed was returned as UNSTABLE4 as well), the server might not + have flushed cached data to stable storage. The burden of recovery + is on the client, and the client will need to retransmit the data to + the server. + + A suggested verifier would be to use the time that the server was + last started (if restarting the server results in lost buffers). + + The reply's committed field allows the client to do more effective + caching. If the server is committing all WRITE requests to stable + storage, then it SHOULD return with committed set to FILE_SYNC4, + regardless of the value of the stable field in the arguments. A + server that uses an NVRAM accelerator may choose to implement this + policy. The client can use this to increase the effectiveness of the + cache by discarding cached data that has already been committed on + the server. + + Some implementations may return NFS4ERR_NOSPC instead of + NFS4ERR_DQUOT when a user's quota is exceeded. + + In the case that the current filehandle is of type NF4DIR, the server + will return NFS4ERR_ISDIR. If the current file is a symbolic link, + the error NFS4ERR_SYMLINK will be returned. Otherwise, if the + current filehandle does not designate an ordinary file, the server + will return NFS4ERR_WRONG_TYPE. + + If mandatory byte-range locking is in effect for the file, and the + corresponding byte-range of the data to be written to the file is + READ_LT or WRITE_LT locked by an owner that is not associated with + the stateid, the server MUST return NFS4ERR_LOCKED. If so, the + client MUST check if the owner corresponding to the stateid used with + the WRITE operation has a conflicting READ_LT lock that overlaps with + the byte-range that was to be written. If the stateid's owner has no + conflicting READ_LT lock, then the client SHOULD try to get the + appropriate write byte-range lock via the LOCK operation before re- + attempting the WRITE. When the WRITE completes, the client SHOULD + release the byte-range lock via LOCKU. + + If the stateid's owner had a conflicting READ_LT lock, then the + client has no choice but to return an error to the application that + attempted the WRITE. The reason is that since the stateid's owner + had a READ_LT lock, either the server attempted to temporarily + effectively upgrade this READ_LT lock to a WRITE_LT lock or the + server has no upgrade capability. If the server attempted to upgrade + the READ_LT lock and failed, it is pointless for the client to re- + attempt the upgrade via the LOCK operation, because there might be + another client also trying to upgrade. If two clients are blocked + trying to upgrade the same lock, the clients deadlock. If the server + has no upgrade capability, then it is pointless to try a LOCK + operation to upgrade. + + If one or more other clients have delegations for the file being + written, those delegations MUST be recalled, and the operation cannot + proceed until those delegations are returned or revoked. Except + where this happens very quickly, one or more NFS4ERR_DELAY errors + will be returned to requests made while the delegation remains + outstanding. Normally, delegations will not be recalled as a result + of a WRITE operation since the recall will occur as a result of an + earlier OPEN. However, since it is possible for a WRITE to be done + with a special stateid, the server needs to check for this case even + though the client should have done an OPEN previously. + +18.33. Operation 40: BACKCHANNEL_CTL - Backchannel Control + +18.33.1. ARGUMENT + + typedef opaque gsshandle4_t<>; + + struct gss_cb_handles4 { + rpc_gss_svc_t gcbp_service; /* RFC 2203 */ + gsshandle4_t gcbp_handle_from_server; + gsshandle4_t gcbp_handle_from_client; + }; + + union callback_sec_parms4 switch (uint32_t cb_secflavor) { + case AUTH_NONE: + void; + case AUTH_SYS: + authsys_parms cbsp_sys_cred; /* RFC 5531 */ + case RPCSEC_GSS: + gss_cb_handles4 cbsp_gss_handles; + }; + + struct BACKCHANNEL_CTL4args { + uint32_t bca_cb_program; + callback_sec_parms4 bca_sec_parms<>; + }; + +18.33.2. RESULT + + struct BACKCHANNEL_CTL4res { + nfsstat4 bcr_status; + }; + +18.33.3. DESCRIPTION + + The BACKCHANNEL_CTL operation replaces the backchannel's callback + program number and adds (not replaces) RPCSEC_GSS handles for use by + the backchannel. + + The arguments of the BACKCHANNEL_CTL call are a subset of the + CREATE_SESSION parameters. In the arguments of BACKCHANNEL_CTL, the + bca_cb_program field and bca_sec_parms fields correspond respectively + to the csa_cb_program and csa_sec_parms fields of the arguments of + CREATE_SESSION (Section 18.36). + + BACKCHANNEL_CTL MUST appear in a COMPOUND that starts with SEQUENCE. + + If the RPCSEC_GSS handle identified by gcbp_handle_from_server does + not exist on the server, the server MUST return NFS4ERR_NOENT. + + If an RPCSEC_GSS handle is using the SSV context (see + Section 2.10.9), then because each SSV RPCSEC_GSS handle shares a + common SSV GSS context, there are security considerations specific to + this situation discussed in Section 2.10.10. + +18.34. Operation 41: BIND_CONN_TO_SESSION - Associate Connection with + Session + +18.34.1. ARGUMENT + + enum channel_dir_from_client4 { + CDFC4_FORE = 0x1, + CDFC4_BACK = 0x2, + CDFC4_FORE_OR_BOTH = 0x3, + CDFC4_BACK_OR_BOTH = 0x7 + }; + + struct BIND_CONN_TO_SESSION4args { + sessionid4 bctsa_sessid; + + channel_dir_from_client4 + bctsa_dir; + + bool bctsa_use_conn_in_rdma_mode; + }; + +18.34.2. RESULT + + enum channel_dir_from_server4 { + CDFS4_FORE = 0x1, + CDFS4_BACK = 0x2, + CDFS4_BOTH = 0x3 + }; + + struct BIND_CONN_TO_SESSION4resok { + sessionid4 bctsr_sessid; + + channel_dir_from_server4 + bctsr_dir; + + bool bctsr_use_conn_in_rdma_mode; + }; + + union BIND_CONN_TO_SESSION4res + switch (nfsstat4 bctsr_status) { + + case NFS4_OK: + BIND_CONN_TO_SESSION4resok + bctsr_resok4; + + default: void; + }; + +18.34.3. DESCRIPTION + + BIND_CONN_TO_SESSION is used to associate additional connections with + a session. It MUST be used on the connection being associated with + the session. It MUST be the only operation in the COMPOUND + procedure. If SP4_NONE (Section 18.35) state protection is used, any + principal, security flavor, or RPCSEC_GSS context MAY be used to + invoke the operation. If SP4_MACH_CRED is used, RPCSEC_GSS MUST be + used with the integrity or privacy services, using the principal that + created the client ID. If SP4_SSV is used, RPCSEC_GSS with the SSV + GSS mechanism (Section 2.10.9) and integrity or privacy MUST be used. + + If, when the client ID was created, the client opted for SP4_NONE + state protection, the client is not required to use + BIND_CONN_TO_SESSION to associate the connection with the session, + unless the client wishes to associate the connection with the + backchannel. When SP4_NONE protection is used, simply sending a + COMPOUND request with a SEQUENCE operation is sufficient to associate + the connection with the session specified in SEQUENCE. + + The field bctsa_dir indicates whether the client wants to associate + the connection with the fore channel or the backchannel or both + channels. The value CDFC4_FORE_OR_BOTH indicates that the client + wants to associate the connection with both the fore channel and + backchannel, but will accept the connection being associated to just + the fore channel. The value CDFC4_BACK_OR_BOTH indicates that the + client wants to associate with both the fore channel and backchannel, + but will accept the connection being associated with just the + backchannel. The server replies in bctsr_dir which channel(s) the + connection is associated with. If the client specified CDFC4_FORE, + the server MUST return CDFS4_FORE. If the client specified + CDFC4_BACK, the server MUST return CDFS4_BACK. If the client + specified CDFC4_FORE_OR_BOTH, the server MUST return CDFS4_FORE or + CDFS4_BOTH. If the client specified CDFC4_BACK_OR_BOTH, the server + MUST return CDFS4_BACK or CDFS4_BOTH. + + See the CREATE_SESSION operation (Section 18.36), and the description + of the argument csa_use_conn_in_rdma_mode to understand + bctsa_use_conn_in_rdma_mode, and the description of + csr_use_conn_in_rdma_mode to understand bctsr_use_conn_in_rdma_mode. + + Invoking BIND_CONN_TO_SESSION on a connection already associated with + the specified session has no effect, and the server MUST respond with + NFS4_OK, unless the client is demanding changes to the set of + channels the connection is associated with. If so, the server MUST + return NFS4ERR_INVAL. + +18.34.4. IMPLEMENTATION + + If a session's channel loses all connections, depending on the client + ID's state protection and type of channel, the client might need to + use BIND_CONN_TO_SESSION to associate a new connection. If the + server restarted and does not keep the reply cache in stable storage, + the server will not recognize the session ID. The client will + ultimately have to invoke EXCHANGE_ID to create a new client ID and + session. + + Suppose SP4_SSV state protection is being used, and + BIND_CONN_TO_SESSION is among the operations included in the + spo_must_enforce set when the client ID was created (Section 18.35). + If so, there is an issue if SET_SSV is sent, no response is returned, + and the last connection associated with the client ID drops. The + client, per the sessions model, MUST retry the SET_SSV. But it needs + a new connection to do so, and MUST associate that connection with + the session via a BIND_CONN_TO_SESSION authenticated with the SSV GSS + mechanism. The problem is that the RPCSEC_GSS message integrity + codes use a subkey derived from the SSV as the key and the SSV may + have changed. While there are multiple recovery strategies, a + single, general strategy is described here. + + * The client reconnects. + + * The client assumes that the SET_SSV was executed, and so sends + BIND_CONN_TO_SESSION with the subkey (derived from the new SSV, + i.e., what SET_SSV would have set the SSV to) used as the key for + the RPCSEC_GSS credential message integrity codes. + + * If the request succeeds, this means that the original attempted + SET_SSV did execute successfully. The client re-sends the + original SET_SSV, which the server will reply to via the reply + cache. + + * If the server returns an RPC authentication error, this means that + the server's current SSV was not changed (and the SET_SSV was + likely not executed). The client then tries BIND_CONN_TO_SESSION + with the subkey derived from the old SSV as the key for the + RPCSEC_GSS message integrity codes. + + * The attempted BIND_CONN_TO_SESSION with the old SSV should + succeed. If so, the client re-sends the original SET_SSV. If the + original SET_SSV was not executed, then the server executes it. + If the original SET_SSV was executed but failed, the server will + return the SET_SSV from the reply cache. + +18.35. Operation 42: EXCHANGE_ID - Instantiate Client ID + + The EXCHANGE_ID operation exchanges long-hand client and server + identifiers (owners) and provides access to a client ID, creating one + if necessary. This client ID becomes associated with the connection + on which the operation is done, so that it is available when a + CREATE_SESSION is done or when the connection is used to issue a + request on an existing session associated with the current client. + +18.35.1. ARGUMENT + + const EXCHGID4_FLAG_SUPP_MOVED_REFER = 0x00000001; + const EXCHGID4_FLAG_SUPP_MOVED_MIGR = 0x00000002; + + const EXCHGID4_FLAG_BIND_PRINC_STATEID = 0x00000100; + + const EXCHGID4_FLAG_USE_NON_PNFS = 0x00010000; + const EXCHGID4_FLAG_USE_PNFS_MDS = 0x00020000; + const EXCHGID4_FLAG_USE_PNFS_DS = 0x00040000; + + const EXCHGID4_FLAG_MASK_PNFS = 0x00070000; + + const EXCHGID4_FLAG_UPD_CONFIRMED_REC_A = 0x40000000; + const EXCHGID4_FLAG_CONFIRMED_R = 0x80000000; + + struct state_protect_ops4 { + bitmap4 spo_must_enforce; + bitmap4 spo_must_allow; + }; + + struct ssv_sp_parms4 { + state_protect_ops4 ssp_ops; + sec_oid4 ssp_hash_algs<>; + sec_oid4 ssp_encr_algs<>; + uint32_t ssp_window; + uint32_t ssp_num_gss_handles; + }; + + enum state_protect_how4 { + SP4_NONE = 0, + SP4_MACH_CRED = 1, + SP4_SSV = 2 + }; + + union state_protect4_a switch(state_protect_how4 spa_how) { + case SP4_NONE: + void; + case SP4_MACH_CRED: + state_protect_ops4 spa_mach_ops; + case SP4_SSV: + ssv_sp_parms4 spa_ssv_parms; + }; + + struct EXCHANGE_ID4args { + client_owner4 eia_clientowner; + uint32_t eia_flags; + state_protect4_a eia_state_protect; + nfs_impl_id4 eia_client_impl_id<1>; + }; + +18.35.2. RESULT + + struct ssv_prot_info4 { + state_protect_ops4 spi_ops; + uint32_t spi_hash_alg; + uint32_t spi_encr_alg; + uint32_t spi_ssv_len; + uint32_t spi_window; + gsshandle4_t spi_handles<>; + }; + + union state_protect4_r switch(state_protect_how4 spr_how) { + case SP4_NONE: + void; + case SP4_MACH_CRED: + state_protect_ops4 spr_mach_ops; + case SP4_SSV: + ssv_prot_info4 spr_ssv_info; + }; + + struct EXCHANGE_ID4resok { + clientid4 eir_clientid; + sequenceid4 eir_sequenceid; + uint32_t eir_flags; + state_protect4_r eir_state_protect; + server_owner4 eir_server_owner; + opaque eir_server_scope<NFS4_OPAQUE_LIMIT>; + nfs_impl_id4 eir_server_impl_id<1>; + }; + + union EXCHANGE_ID4res switch (nfsstat4 eir_status) { + case NFS4_OK: + EXCHANGE_ID4resok eir_resok4; + + default: + void; + }; + +18.35.3. DESCRIPTION + + The client uses the EXCHANGE_ID operation to register a particular + instance of that client with the server, as represented by a + client_owner4. However, when the client_owner4 has already been + registered by other means (e.g., Transparent State Migration), the + client may still use EXCHANGE_ID to obtain the client ID assigned + previously. + + The client ID returned from this operation will be associated with + the connection on which the EXCHANGE_ID is received and will serve as + a parent object for sessions created by the client on this connection + or to which the connection is bound. As a result of using those + sessions to make requests involving the creation of state, that state + will become associated with the client ID returned. + + In situations in which the registration of the client_owner has not + occurred previously, the client ID must first be used, along with the + returned eir_sequenceid, in creating an associated session using + CREATE_SESSION. + + If the flag EXCHGID4_FLAG_CONFIRMED_R is set in the result, + eir_flags, then it is an indication that the registration of the + client_owner has already occurred and that a further CREATE_SESSION + is not needed to confirm it. Of course, subsequent CREATE_SESSION + operations may be needed for other reasons. + + The value eir_sequenceid is used to establish an initial sequence + value associated with the client ID returned. In cases in which a + CREATE_SESSION has already been done, there is no need for this + value, since sequencing of such request has already been established, + and the client has no need for this value and will ignore it. + + EXCHANGE_ID MAY be sent in a COMPOUND procedure that starts with + SEQUENCE. However, when a client communicates with a server for the + first time, it will not have a session, so using SEQUENCE will not be + possible. If EXCHANGE_ID is sent without a preceding SEQUENCE, then + it MUST be the only operation in the COMPOUND procedure's request. + If it is not, the server MUST return NFS4ERR_NOT_ONLY_OP. + + The eia_clientowner field is composed of a co_verifier field and a + co_ownerid string. As noted in Section 2.4, the co_ownerid + identifies the client, and the co_verifier specifies a particular + incarnation of that client. An EXCHANGE_ID sent with a new + incarnation of the client will lead to the server removing lock state + of the old incarnation. On the other hand, when an EXCHANGE_ID sent + with the current incarnation and co_ownerid does not result in an + unrelated error, it will potentially update an existing client ID's + properties or simply return information about the existing client_id. + The latter would happen when this operation is done to the same + server using different network addresses as part of creating trunked + connections. + + A server MUST NOT provide the same client ID to two different + incarnations of an eia_clientowner. + + In addition to the client ID and sequence ID, the server returns a + server owner (eir_server_owner) and server scope (eir_server_scope). + The former field is used in connection with network trunking as + described in Section 2.10.5. The latter field is used to allow + clients to determine when client IDs sent by one server may be + recognized by another in the event of file system migration (see + Section 11.11.9 of the current document). + + The client ID returned by EXCHANGE_ID is only unique relative to the + combination of eir_server_owner.so_major_id and eir_server_scope. + Thus, if two servers return the same client ID, the onus is on the + client to distinguish the client IDs on the basis of + eir_server_owner.so_major_id and eir_server_scope. In the event two + different servers claim matching server_owner.so_major_id and + eir_server_scope, the client can use the verification techniques + discussed in Section 2.10.5.1 to determine if the servers are + distinct. If they are distinct, then the client will need to note + the destination network addresses of the connections used with each + server and use the network address as the final discriminator. + + The server, as defined by the unique identity expressed in the + so_major_id of the server owner and the server scope, needs to track + several properties of each client ID it hands out. The properties + apply to the client ID and all sessions associated with the client + ID. The properties are derived from the arguments and results of + EXCHANGE_ID. The client ID properties include: + + * The capabilities expressed by the following bits, which come from + the results of EXCHANGE_ID: + + - EXCHGID4_FLAG_SUPP_MOVED_REFER + + - EXCHGID4_FLAG_SUPP_MOVED_MIGR + + - EXCHGID4_FLAG_BIND_PRINC_STATEID + + - EXCHGID4_FLAG_USE_NON_PNFS + + - EXCHGID4_FLAG_USE_PNFS_MDS + + - EXCHGID4_FLAG_USE_PNFS_DS + + These properties may be updated by subsequent EXCHANGE_ID + operations on confirmed client IDs though the server MAY refuse to + change them. + + * The state protection method used, one of SP4_NONE, SP4_MACH_CRED, + or SP4_SSV, as set by the spa_how field of the arguments to + EXCHANGE_ID. Once the client ID is confirmed, this property + cannot be updated by subsequent EXCHANGE_ID operations. + + * For SP4_MACH_CRED or SP4_SSV state protection: + + - The list of operations (spo_must_enforce) that MUST use the + specified state protection. This list comes from the results + of EXCHANGE_ID. + + - The list of operations (spo_must_allow) that MAY use the + specified state protection. This list comes from the results + of EXCHANGE_ID. + + Once the client ID is confirmed, these properties cannot be + updated by subsequent EXCHANGE_ID requests. + + * For SP4_SSV protection: + + - The OID of the hash algorithm. This property is represented by + one of the algorithms in the ssp_hash_algs field of the + EXCHANGE_ID arguments. Once the client ID is confirmed, this + property cannot be updated by subsequent EXCHANGE_ID requests. + + - The OID of the encryption algorithm. This property is + represented by one of the algorithms in the ssp_encr_algs field + of the EXCHANGE_ID arguments. Once the client ID is confirmed, + this property cannot be updated by subsequent EXCHANGE_ID + requests. + + - The length of the SSV. This property is represented by the + spi_ssv_len field in the EXCHANGE_ID results. Once the client + ID is confirmed, this property cannot be updated by subsequent + EXCHANGE_ID operations. + + There are REQUIRED and RECOMMENDED relationships among the + length of the key of the encryption algorithm ("key length"), + the length of the output of hash algorithm ("hash length"), and + the length of the SSV ("SSV length"). + + o key length MUST be <= hash length. This is because the keys + used for the encryption algorithm are actually subkeys + derived from the SSV, and the derivation is via the hash + algorithm. The selection of an encryption algorithm with a + key length that exceeded the length of the output of the + hash algorithm would require padding, and thus weaken the + use of the encryption algorithm. + + o hash length SHOULD be <= SSV length. This is because the + SSV is a key used to derive subkeys via an HMAC, and it is + recommended that the key used as input to an HMAC be at + least as long as the length of the HMAC's hash algorithm's + output (see Section 3 of [52]). + + o key length SHOULD be <= SSV length. This is a transitive + result of the above two invariants. + + o key length SHOULD be >= hash length / 2. This is because + the subkey derivation is via an HMAC and it is recommended + that if the HMAC has to be truncated, it should not be + truncated to less than half the hash length (see Section 4 + of RFC 2104 [52]). + + - Number of concurrent versions of the SSV the client and server + will support (see Section 2.10.9). This property is + represented by spi_window in the EXCHANGE_ID results. The + property may be updated by subsequent EXCHANGE_ID operations. + + * The client's implementation ID as represented by the + eia_client_impl_id field of the arguments. The property may be + updated by subsequent EXCHANGE_ID requests. + + * The server's implementation ID as represented by the + eir_server_impl_id field of the reply. The property may be + updated by replies to subsequent EXCHANGE_ID requests. + + The eia_flags passed as part of the arguments and the eir_flags + results allow the client and server to inform each other of their + capabilities as well as indicate how the client ID will be used. + Whether a bit is set or cleared on the arguments' flags does not + force the server to set or clear the same bit on the results' side. + Bits not defined above cannot be set in the eia_flags field. If they + are, the server MUST reject the operation with NFS4ERR_INVAL. + + The EXCHGID4_FLAG_UPD_CONFIRMED_REC_A bit can only be set in + eia_flags; it is always off in eir_flags. The + EXCHGID4_FLAG_CONFIRMED_R bit can only be set in eir_flags; it is + always off in eia_flags. If the server recognizes the co_ownerid and + co_verifier as mapping to a confirmed client ID, it sets + EXCHGID4_FLAG_CONFIRMED_R in eir_flags. The + EXCHGID4_FLAG_CONFIRMED_R flag allows a client to tell if the client + ID it is trying to create already exists and is confirmed. + + If EXCHGID4_FLAG_UPD_CONFIRMED_REC_A is set in eia_flags, this means + that the client is attempting to update properties of an existing + confirmed client ID (if the client wants to update properties of an + unconfirmed client ID, it MUST NOT set + EXCHGID4_FLAG_UPD_CONFIRMED_REC_A). If so, it is RECOMMENDED that + the client send the update EXCHANGE_ID operation in the same COMPOUND + as a SEQUENCE so that the EXCHANGE_ID is executed exactly once. + Whether the client can update the properties of client ID depends on + the state protection it selected when the client ID was created, and + the principal and security flavor it used when sending the + EXCHANGE_ID operation. The situations described in items 6, 7, 8, or + 9 of the second numbered list of Section 18.35.4 below will apply. + Note that if the operation succeeds and returns a client ID that is + already confirmed, the server MUST set the EXCHGID4_FLAG_CONFIRMED_R + bit in eir_flags. + + If EXCHGID4_FLAG_UPD_CONFIRMED_REC_A is not set in eia_flags, this + means that the client is trying to establish a new client ID; it is + attempting to trunk data communication to the server (See + Section 2.10.5); or it is attempting to update properties of an + unconfirmed client ID. The situations described in items 1, 2, 3, 4, + or 5 of the second numbered list of Section 18.35.4 below will apply. + Note that if the operation succeeds and returns a client ID that was + previously confirmed, the server MUST set the + EXCHGID4_FLAG_CONFIRMED_R bit in eir_flags. + + When the EXCHGID4_FLAG_SUPP_MOVED_REFER flag bit is set, the client + indicates that it is capable of dealing with an NFS4ERR_MOVED error + as part of a referral sequence. When this bit is not set, it is + still legal for the server to perform a referral sequence. However, + a server may use the fact that the client is incapable of correctly + responding to a referral, by avoiding it for that particular client. + It may, for instance, act as a proxy for that particular file system, + at some cost in performance, although it is not obligated to do so. + If the server will potentially perform a referral, it MUST set + EXCHGID4_FLAG_SUPP_MOVED_REFER in eir_flags. + + When the EXCHGID4_FLAG_SUPP_MOVED_MIGR is set, the client indicates + that it is capable of dealing with an NFS4ERR_MOVED error as part of + a file system migration sequence. When this bit is not set, it is + still legal for the server to indicate that a file system has moved, + when this in fact happens. However, a server may use the fact that + the client is incapable of correctly responding to a migration in its + scheduling of file systems to migrate so as to avoid migration of + file systems being actively used. It may also hide actual migrations + from clients unable to deal with them by acting as a proxy for a + migrated file system for particular clients, at some cost in + performance, although it is not obligated to do so. If the server + will potentially perform a migration, it MUST set + EXCHGID4_FLAG_SUPP_MOVED_MIGR in eir_flags. + + When EXCHGID4_FLAG_BIND_PRINC_STATEID is set, the client indicates + that it wants the server to bind the stateid to the principal. This + means that when a principal creates a stateid, it has to be the one + to use the stateid. If the server will perform binding, it will + return EXCHGID4_FLAG_BIND_PRINC_STATEID. The server MAY return + EXCHGID4_FLAG_BIND_PRINC_STATEID even if the client does not request + it. If an update to the client ID changes the value of + EXCHGID4_FLAG_BIND_PRINC_STATEID's client ID property, the effect + applies only to new stateids. Existing stateids (and all stateids + with the same "other" field) that were created with stateid to + principal binding in force will continue to have binding in force. + Existing stateids (and all stateids with the same "other" field) that + were created with stateid to principal not in force will continue to + have binding not in force. + + The EXCHGID4_FLAG_USE_NON_PNFS, EXCHGID4_FLAG_USE_PNFS_MDS, and + EXCHGID4_FLAG_USE_PNFS_DS bits are described in Section 13.1 and + convey roles the client ID is to be used for in a pNFS environment. + The server MUST set one of the acceptable combinations of these bits + (roles) in eir_flags, as specified in that section. Note that the + same client owner/server owner pair can have multiple roles. + Multiple roles can be associated with the same client ID or with + different client IDs. Thus, if a client sends EXCHANGE_ID from the + same client owner to the same server owner multiple times, but + specifies different pNFS roles each time, the server might return + different client IDs. Given that different pNFS roles might have + different client IDs, the client may ask for different properties for + each role/client ID. + + The spa_how field of the eia_state_protect field specifies how the + client wants to protect its client, locking, and session states from + unauthorized changes (Section 2.10.8.3): + + * SP4_NONE. The client does not request the NFSv4.1 server to + enforce state protection. The NFSv4.1 server MUST NOT enforce + state protection for the returned client ID. + + * SP4_MACH_CRED. If spa_how is SP4_MACH_CRED, then the client MUST + send the EXCHANGE_ID operation with RPCSEC_GSS as the security + flavor, and with a service of RPC_GSS_SVC_INTEGRITY or + RPC_GSS_SVC_PRIVACY. If SP4_MACH_CRED is specified, then the + client wants to use an RPCSEC_GSS-based machine credential to + protect its state. The server MUST note the principal the + EXCHANGE_ID operation was sent with, and the GSS mechanism used. + These notes collectively comprise the machine credential. + + After the client ID is confirmed, as long as the lease associated + with the client ID is unexpired, a subsequent EXCHANGE_ID + operation that uses the same eia_clientowner.co_owner as the first + EXCHANGE_ID MUST also use the same machine credential as the first + EXCHANGE_ID. The server returns the same client ID for the + subsequent EXCHANGE_ID as that returned from the first + EXCHANGE_ID. + + * SP4_SSV. If spa_how is SP4_SSV, then the client MUST send the + EXCHANGE_ID operation with RPCSEC_GSS as the security flavor, and + with a service of RPC_GSS_SVC_INTEGRITY or RPC_GSS_SVC_PRIVACY. + If SP4_SSV is specified, then the client wants to use the SSV to + protect its state. The server records the credential used in the + request as the machine credential (as defined above) for the + eia_clientowner.co_owner. The CREATE_SESSION operation that + confirms the client ID MUST use the same machine credential. + + When a client specifies SP4_MACH_CRED or SP4_SSV, it also provides + two lists of operations (each expressed as a bitmap). The first list + is spo_must_enforce and consists of those operations the client MUST + send (subject to the server confirming the list of operations in the + result of EXCHANGE_ID) with the machine credential (if SP4_MACH_CRED + protection is specified) or the SSV-based credential (if SP4_SSV + protection is used). The client MUST send the operations with + RPCSEC_GSS credentials that specify the RPC_GSS_SVC_INTEGRITY or + RPC_GSS_SVC_PRIVACY security service. Typically, the first list of + operations includes EXCHANGE_ID, CREATE_SESSION, DELEGPURGE, + DESTROY_SESSION, BIND_CONN_TO_SESSION, and DESTROY_CLIENTID. The + client SHOULD NOT specify in this list any operations that require a + filehandle because the server's access policies MAY conflict with the + client's choice, and thus the client would then be unable to access a + subset of the server's namespace. + + Note that if SP4_SSV protection is specified, and the client + indicates that CREATE_SESSION must be protected with SP4_SSV, because + the SSV cannot exist without a confirmed client ID, the first + CREATE_SESSION MUST instead be sent using the machine credential, and + the server MUST accept the machine credential. + + There is a corresponding result, also called spo_must_enforce, of the + operations for which the server will require SP4_MACH_CRED or SP4_SSV + protection. Normally, the server's result equals the client's + argument, but the result MAY be different. If the client requests + one or more operations in the set { EXCHANGE_ID, CREATE_SESSION, + DELEGPURGE, DESTROY_SESSION, BIND_CONN_TO_SESSION, DESTROY_CLIENTID + }, then the result spo_must_enforce MUST include the operations the + client requested from that set. + + If spo_must_enforce in the results has BIND_CONN_TO_SESSION set, then + connection binding enforcement is enabled, and the client MUST use + the machine (if SP4_MACH_CRED protection is used) or SSV (if SP4_SSV + protection is used) credential on calls to BIND_CONN_TO_SESSION. + + The second list is spo_must_allow and consists of those operations + the client wants to have the option of sending with the machine + credential or the SSV-based credential, even if the object the + operations are performed on is not owned by the machine or SSV + credential. + + The corresponding result, also called spo_must_allow, consists of the + operations the server will allow the client to use SP4_SSV or + SP4_MACH_CRED credentials with. Normally, the server's result equals + the client's argument, but the result MAY be different. + + The purpose of spo_must_allow is to allow clients to solve the + following conundrum. Suppose the client ID is confirmed with + EXCHGID4_FLAG_BIND_PRINC_STATEID, and it calls OPEN with the + RPCSEC_GSS credentials of a normal user. Now suppose the user's + credentials expire, and cannot be renewed (e.g., a Kerberos ticket + granting ticket expires, and the user has logged off and will not be + acquiring a new ticket granting ticket). The client will be unable + to send CLOSE without the user's credentials, which is to say the + client has to either leave the state on the server or re-send + EXCHANGE_ID with a new verifier to clear all state, that is, unless + the client includes CLOSE on the list of operations in spo_must_allow + and the server agrees. + + The SP4_SSV protection parameters also have: + + ssp_hash_algs: + This is the set of algorithms the client supports for the purpose + of computing the digests needed for the internal SSV GSS mechanism + and for the SET_SSV operation. Each algorithm is specified as an + object identifier (OID). The REQUIRED algorithms for a server are + id-sha1, id-sha224, id-sha256, id-sha384, and id-sha512 [25]. + + Due to known weaknesses in id-sha1, it is RECOMMENDED that the + client specify at least one algorithm within ssp_hash_algs other + than id-sha1. + + The algorithm the server selects among the set is indicated in + spi_hash_alg, a field of spr_ssv_prot_info. The field + spi_hash_alg is an index into the array ssp_hash_algs. Because of + known the weaknesses in id-sha1, it is RECOMMENDED that it not be + selected by the server as long as ssp_hash_algs contains any other + supported algorithm. + + If the server does not support any of the offered algorithms, it + returns NFS4ERR_HASH_ALG_UNSUPP. If ssp_hash_algs is empty, the + server MUST return NFS4ERR_INVAL. + + ssp_encr_algs: + This is the set of algorithms the client supports for the purpose + of providing privacy protection for the internal SSV GSS + mechanism. Each algorithm is specified as an OID. The REQUIRED + algorithm for a server is id-aes256-CBC. The RECOMMENDED + algorithms are id-aes192-CBC and id-aes128-CBC [26]. The selected + algorithm is returned in spi_encr_alg, an index into + ssp_encr_algs. If the server does not support any of the offered + algorithms, it returns NFS4ERR_ENCR_ALG_UNSUPP. If ssp_encr_algs + is empty, the server MUST return NFS4ERR_INVAL. Note that due to + previously stated requirements and recommendations on the + relationships between key length and hash length, some + combinations of RECOMMENDED and REQUIRED encryption algorithm and + hash algorithm either SHOULD NOT or MUST NOT be used. Table 21 + summarizes the illegal and discouraged combinations. + + ssp_window: + This is the number of SSV versions the client wants the server to + maintain (i.e., each successful call to SET_SSV produces a new + version of the SSV). If ssp_window is zero, the server MUST + return NFS4ERR_INVAL. The server responds with spi_window, which + MUST NOT exceed ssp_window and MUST be at least one. Any requests + on the backchannel or fore channel that are using a version of the + SSV that is outside the window will fail with an ONC RPC + authentication error, and the requester will have to retry them + with the same slot ID and sequence ID. + + ssp_num_gss_handles: + This is the number of RPCSEC_GSS handles the server should create + that are based on the GSS SSV mechanism (see Section 2.10.9). It + is not the total number of RPCSEC_GSS handles for the client ID. + Indeed, subsequent calls to EXCHANGE_ID will add RPCSEC_GSS + handles. The server responds with a list of handles in + spi_handles. If the client asks for at least one handle and the + server cannot create it, the server MUST return an error. The + handles in spi_handles are not available for use until the client + ID is confirmed, which could be immediately if EXCHANGE_ID returns + EXCHGID4_FLAG_CONFIRMED_R, or upon successful confirmation from + CREATE_SESSION. + + While a client ID can span all the connections that are connected + to a server sharing the same eir_server_owner.so_major_id, the + RPCSEC_GSS handles returned in spi_handles can only be used on + connections connected to a server that returns the same the + eir_server_owner.so_major_id and eir_server_owner.so_minor_id on + each connection. It is permissible for the client to set + ssp_num_gss_handles to zero; the client can create more handles + with another EXCHANGE_ID call. + + Because each SSV RPCSEC_GSS handle shares a common SSV GSS + context, there are security considerations specific to this + situation discussed in Section 2.10.10. + + The seq_window (see Section 5.2.3.1 of RFC 2203 [4]) of each + RPCSEC_GSS handle in spi_handle MUST be the same as the seq_window + of the RPCSEC_GSS handle used for the credential of the RPC + request of which the EXCHANGE_ID operation was sent as a part. + + +======================+===========================+===============+ + | Encryption Algorithm | MUST NOT be combined with | SHOULD NOT be | + | | | combined with | + +======================+===========================+===============+ + | id-aes128-CBC | | id-sha384, | + | | | id-sha512 | + +----------------------+---------------------------+---------------+ + | id-aes192-CBC | id-sha1 | id-sha512 | + +----------------------+---------------------------+---------------+ + | id-aes256-CBC | id-sha1, id-sha224 | | + +----------------------+---------------------------+---------------+ + + Table 21 + + The arguments include an array of up to one element in length called + eia_client_impl_id. If eia_client_impl_id is present, it contains + the information identifying the implementation of the client. + Similarly, the results include an array of up to one element in + length called eir_server_impl_id that identifies the implementation + of the server. Servers MUST accept a zero-length eia_client_impl_id + array, and clients MUST accept a zero-length eir_server_impl_id + array. + + A possible use for implementation identifiers would be in diagnostic + software that extracts this information in an attempt to identify + interoperability problems, performance workload behaviors, or general + usage statistics. Since the intent of having access to this + information is for planning or general diagnosis only, the client and + server MUST NOT interpret this implementation identity information in + a way that affects how the implementation interacts with its peer. + The client and server are not allowed to depend on the peer's + manifesting a particular allowed behavior based on an implementation + identifier but are required to interoperate as specified elsewhere in + the protocol specification. + + Because it is possible that some implementations might violate the + protocol specification and interpret the identity information, + implementations MUST provide facilities to allow the NFSv4 client and + server to be configured to set the contents of the nfs_impl_id + structures sent to any specified value. + +18.35.4. IMPLEMENTATION + + A server's client record is a 5-tuple: + + 1. co_ownerid: + + The client identifier string, from the eia_clientowner structure + of the EXCHANGE_ID4args structure. + + 2. co_verifier: + + A client-specific value used to indicate incarnations (where a + client restart represents a new incarnation), from the + eia_clientowner structure of the EXCHANGE_ID4args structure. + + 3. principal: + + The principal that was defined in the RPC header's credential + and/or verifier at the time the client record was established. + + 4. client ID: + + The shorthand client identifier, generated by the server and + returned via the eir_clientid field in the EXCHANGE_ID4resok + structure. + + 5. confirmed: + + A private field on the server indicating whether or not a client + record has been confirmed. A client record is confirmed if there + has been a successful CREATE_SESSION operation to confirm it. + Otherwise, it is unconfirmed. An unconfirmed record is + established by an EXCHANGE_ID call. Any unconfirmed record that + is not confirmed within a lease period SHOULD be removed. + + The following identifiers represent special values for the fields in + the records. + + ownerid_arg: + The value of the eia_clientowner.co_ownerid subfield of the + EXCHANGE_ID4args structure of the current request. + + verifier_arg: + The value of the eia_clientowner.co_verifier subfield of the + EXCHANGE_ID4args structure of the current request. + + old_verifier_arg: + A value of the eia_clientowner.co_verifier field of a client + record received in a previous request; this is distinct from + verifier_arg. + + principal_arg: + The value of the RPCSEC_GSS principal for the current request. + + old_principal_arg: + A value of the principal of a client record as defined by the RPC + header's credential or verifier of a previous request. This is + distinct from principal_arg. + + clientid_ret: + The value of the eir_clientid field the server will return in the + EXCHANGE_ID4resok structure for the current request. + + old_clientid_ret: + The value of the eir_clientid field the server returned in the + EXCHANGE_ID4resok structure for a previous request. This is + distinct from clientid_ret. + + confirmed: + The client ID has been confirmed. + + unconfirmed: + The client ID has not been confirmed. + + Since EXCHANGE_ID is a non-idempotent operation, we must consider the + possibility that retries occur as a result of a client restart, + network partition, malfunctioning router, etc. Retries are + identified by the value of the eia_clientowner field of + EXCHANGE_ID4args, and the method for dealing with them is outlined in + the scenarios below. + + The scenarios are described in terms of the client record(s) a server + has for a given co_ownerid. Note that if the client ID was created + specifying SP4_SSV state protection and EXCHANGE_ID as the one of the + operations in spo_must_allow, then the server MUST authorize + EXCHANGE_IDs with the SSV principal in addition to the principal that + created the client ID. + + 1. New Owner ID + + If the server has no client records with + eia_clientowner.co_ownerid matching ownerid_arg, and + EXCHGID4_FLAG_UPD_CONFIRMED_REC_A is not set in the EXCHANGE_ID, + then a new shorthand client ID (let us call it clientid_ret) is + generated, and the following unconfirmed record is added to the + server's state. + + { ownerid_arg, verifier_arg, principal_arg, clientid_ret, + unconfirmed } + + Subsequently, the server returns clientid_ret. + + 2. Non-Update on Existing Client ID + + If the server has the following confirmed record, and the request + does not have EXCHGID4_FLAG_UPD_CONFIRMED_REC_A set, then the + request is the result of a retried request due to a faulty router + or lost connection, or the client is trying to determine if it + can perform trunking. + + { ownerid_arg, verifier_arg, principal_arg, clientid_ret, + confirmed } + + Since the record has been confirmed, the client must have + received the server's reply from the initial EXCHANGE_ID request. + Since the server has a confirmed record, and since + EXCHGID4_FLAG_UPD_CONFIRMED_REC_A is not set, with the possible + exception of eir_server_owner.so_minor_id, the server returns the + same result it did when the client ID's properties were last + updated (or if never updated, the result when the client ID was + created). The confirmed record is unchanged. + + 3. Client Collision + + If EXCHGID4_FLAG_UPD_CONFIRMED_REC_A is not set, and if the + server has the following confirmed record, then this request is + likely the result of a chance collision between the values of the + eia_clientowner.co_ownerid subfield of EXCHANGE_ID4args for two + different clients. + + { ownerid_arg, *, old_principal_arg, old_clientid_ret, confirmed + } + + If there is currently no state associated with old_clientid_ret, + or if there is state but the lease has expired, then this case is + effectively equivalent to the New Owner ID case of + Section 18.35.4, Paragraph 7, Item 1. The confirmed record is + deleted, the old_clientid_ret and its lock state are deleted, a + new shorthand client ID is generated, and the following + unconfirmed record is added to the server's state. + + { ownerid_arg, verifier_arg, principal_arg, clientid_ret, + unconfirmed } + + Subsequently, the server returns clientid_ret. + + If old_clientid_ret has an unexpired lease with state, then no + state of old_clientid_ret is changed or deleted. The server + returns NFS4ERR_CLID_INUSE to indicate that the client should + retry with a different value for the eia_clientowner.co_ownerid + subfield of EXCHANGE_ID4args. The client record is not changed. + + 4. Replacement of Unconfirmed Record + + If the EXCHGID4_FLAG_UPD_CONFIRMED_REC_A flag is not set, and the + server has the following unconfirmed record, then the client is + attempting EXCHANGE_ID again on an unconfirmed client ID, perhaps + due to a retry, a client restart before client ID confirmation + (i.e., before CREATE_SESSION was called), or some other reason. + + { ownerid_arg, *, *, old_clientid_ret, unconfirmed } + + It is possible that the properties of old_clientid_ret are + different than those specified in the current EXCHANGE_ID. + Whether or not the properties are being updated, to eliminate + ambiguity, the server deletes the unconfirmed record, generates a + new client ID (clientid_ret), and establishes the following + unconfirmed record: + + { ownerid_arg, verifier_arg, principal_arg, clientid_ret, + unconfirmed } + + 5. Client Restart + + If EXCHGID4_FLAG_UPD_CONFIRMED_REC_A is not set, and if the + server has the following confirmed client record, then this + request is likely from a previously confirmed client that has + restarted. + + { ownerid_arg, old_verifier_arg, principal_arg, old_clientid_ret, + confirmed } + + Since the previous incarnation of the same client will no longer + be making requests, once the new client ID is confirmed by + CREATE_SESSION, byte-range locks and share reservations should be + released immediately rather than forcing the new incarnation to + wait for the lease time on the previous incarnation to expire. + Furthermore, session state should be removed since if the client + had maintained that information across restart, this request + would not have been sent. If the server supports neither the + CLAIM_DELEGATE_PREV nor CLAIM_DELEG_PREV_FH claim types, + associated delegations should be purged as well; otherwise, + delegations are retained and recovery proceeds according to + Section 10.2.1. + + After processing, clientid_ret is returned to the client and this + client record is added: + + { ownerid_arg, verifier_arg, principal_arg, clientid_ret, + unconfirmed } + + The previously described confirmed record continues to exist, and + thus the same ownerid_arg exists in both a confirmed and + unconfirmed state at the same time. The number of states can + collapse to one once the server receives an applicable + CREATE_SESSION or EXCHANGE_ID. + + * If the server subsequently receives a successful + CREATE_SESSION that confirms clientid_ret, then the server + atomically destroys the confirmed record and makes the + unconfirmed record confirmed as described in Section 18.36.3. + + * If the server instead subsequently receives an EXCHANGE_ID + with the client owner equal to ownerid_arg, one strategy is to + simply delete the unconfirmed record, and process the + EXCHANGE_ID as described in the entirety of Section 18.35.4. + + 6. Update + + If EXCHGID4_FLAG_UPD_CONFIRMED_REC_A is set, and the server has + the following confirmed record, then this request is an attempt + at an update. + + { ownerid_arg, verifier_arg, principal_arg, clientid_ret, + confirmed } + + Since the record has been confirmed, the client must have + received the server's reply from the initial EXCHANGE_ID request. + The server allows the update, and the client record is left + intact. + + 7. Update but No Confirmed Record + + If EXCHGID4_FLAG_UPD_CONFIRMED_REC_A is set, and the server has + no confirmed record corresponding ownerid_arg, then the server + returns NFS4ERR_NOENT and leaves any unconfirmed record intact. + + 8. Update but Wrong Verifier + + If EXCHGID4_FLAG_UPD_CONFIRMED_REC_A is set, and the server has + the following confirmed record, then this request is an illegal + attempt at an update, perhaps because of a retry from a previous + client incarnation. + + { ownerid_arg, old_verifier_arg, *, clientid_ret, confirmed } + + The server returns NFS4ERR_NOT_SAME and leaves the client record + intact. + + 9. Update but Wrong Principal + + If EXCHGID4_FLAG_UPD_CONFIRMED_REC_A is set, and the server has + the following confirmed record, then this request is an illegal + attempt at an update by an unauthorized principal. + + { ownerid_arg, verifier_arg, old_principal_arg, clientid_ret, + confirmed } + + The server returns NFS4ERR_PERM and leaves the client record + intact. + +18.36. Operation 43: CREATE_SESSION - Create New Session and Confirm + Client ID + +18.36.1. ARGUMENT + + struct channel_attrs4 { + count4 ca_headerpadsize; + count4 ca_maxrequestsize; + count4 ca_maxresponsesize; + count4 ca_maxresponsesize_cached; + count4 ca_maxoperations; + count4 ca_maxrequests; + uint32_t ca_rdma_ird<1>; + }; + + const CREATE_SESSION4_FLAG_PERSIST = 0x00000001; + const CREATE_SESSION4_FLAG_CONN_BACK_CHAN = 0x00000002; + const CREATE_SESSION4_FLAG_CONN_RDMA = 0x00000004; + + struct CREATE_SESSION4args { + clientid4 csa_clientid; + sequenceid4 csa_sequence; + + uint32_t csa_flags; + + channel_attrs4 csa_fore_chan_attrs; + channel_attrs4 csa_back_chan_attrs; + + uint32_t csa_cb_program; + callback_sec_parms4 csa_sec_parms<>; + }; + +18.36.2. RESULT + + struct CREATE_SESSION4resok { + sessionid4 csr_sessionid; + sequenceid4 csr_sequence; + + uint32_t csr_flags; + + channel_attrs4 csr_fore_chan_attrs; + channel_attrs4 csr_back_chan_attrs; + }; + + union CREATE_SESSION4res switch (nfsstat4 csr_status) { + case NFS4_OK: + CREATE_SESSION4resok csr_resok4; + default: + void; + }; + +18.36.3. DESCRIPTION + + This operation is used by the client to create new session objects on + the server. + + CREATE_SESSION can be sent with or without a preceding SEQUENCE + operation in the same COMPOUND procedure. If CREATE_SESSION is sent + with a preceding SEQUENCE operation, any session created by + CREATE_SESSION has no direct relation to the session specified in the + SEQUENCE operation, although the two sessions might be associated + with the same client ID. If CREATE_SESSION is sent without a + preceding SEQUENCE, then it MUST be the only operation in the + COMPOUND procedure's request. If it is not, the server MUST return + NFS4ERR_NOT_ONLY_OP. + + In addition to creating a session, CREATE_SESSION has the following + effects: + + * The first session created with a new client ID serves to confirm + the creation of that client's state on the server. The server + returns the parameter values for the new session. + + * The connection CREATE_SESSION that is sent over is associated with + the session's fore channel. + + The arguments and results of CREATE_SESSION are described as follows: + + csa_clientid: This is the client ID with which the new session will + be associated. The corresponding result is csr_sessionid, the + session ID of the new session. + + csa_sequence: Each client ID serializes CREATE_SESSION via a per- + client ID sequence number (see Section 18.36.4). The + corresponding result is csr_sequence, which MUST be equal to + csa_sequence. + + In the next three arguments, the client offers a value that is to be + a property of the session. Except where stated otherwise, it is + RECOMMENDED that the server accept the value. If it is not + acceptable, the server MAY use a different value. Regardless, the + server MUST return the value the session will use (which will be + either what the client offered, or what the server is insisting on) + to the client. + + csa_flags: The csa_flags field contains a list of the following flag + bits: + + CREATE_SESSION4_FLAG_PERSIST: + If CREATE_SESSION4_FLAG_PERSIST is set, the client wants the + server to provide a persistent reply cache. For sessions in + which only idempotent operations will be used (e.g., a read- + only session), clients SHOULD NOT set + CREATE_SESSION4_FLAG_PERSIST. If the server does not or cannot + provide a persistent reply cache, the server MUST NOT set + CREATE_SESSION4_FLAG_PERSIST in the field csr_flags. + + If the server is a pNFS metadata server, for reasons described + in Section 12.5.2 it SHOULD support + CREATE_SESSION4_FLAG_PERSIST if it supports the layout_hint + (Section 5.12.4) attribute. + + CREATE_SESSION4_FLAG_CONN_BACK_CHAN: + If CREATE_SESSION4_FLAG_CONN_BACK_CHAN is set in csa_flags, the + client is requesting that the connection over which the + CREATE_SESSION operation arrived be associated with the + session's backchannel in addition to its fore channel. If the + server agrees, it sets CREATE_SESSION4_FLAG_CONN_BACK_CHAN in + the result field csr_flags. If + CREATE_SESSION4_FLAG_CONN_BACK_CHAN is not set in csa_flags, + then CREATE_SESSION4_FLAG_CONN_BACK_CHAN MUST NOT be set in + csr_flags. + + CREATE_SESSION4_FLAG_CONN_RDMA: + If CREATE_SESSION4_FLAG_CONN_RDMA is set in csa_flags, and if + the connection over which the CREATE_SESSION operation arrived + is currently in non-RDMA mode but has the capability to operate + in RDMA mode, then the client is requesting that the server + "step up" to RDMA mode on the connection. If the server + agrees, it sets CREATE_SESSION4_FLAG_CONN_RDMA in the result + field csr_flags. If CREATE_SESSION4_FLAG_CONN_RDMA is not set + in csa_flags, then CREATE_SESSION4_FLAG_CONN_RDMA MUST NOT be + set in csr_flags. Note that once the server agrees to step up, + it and the client MUST exchange all future traffic on the + connection with RPC RDMA framing and not Record Marking ([32]). + + csa_fore_chan_attrs, csa_back_chan_attrs: The csa_fore_chan_attrs + and csa_back_chan_attrs fields apply to attributes of the fore + channel (which conveys requests originating from the client to the + server), and the backchannel (the channel that conveys callback + requests originating from the server to the client), respectively. + The results are in corresponding structures called + csr_fore_chan_attrs and csr_back_chan_attrs. The results + establish attributes for each channel, and on all subsequent use + of each channel of the session. Each structure has the following + fields: + + ca_headerpadsize: + The maximum amount of padding the requester is willing to apply + to ensure that write payloads are aligned on some boundary at + the replier. For each channel, the server + + * will reply in ca_headerpadsize with its preferred value, or + zero if padding is not in use, and + + * MAY decrease this value but MUST NOT increase it. + + ca_maxrequestsize: + The maximum size of a COMPOUND or CB_COMPOUND request that will + be sent. This size represents the XDR encoded size of the + request, including the RPC headers (including security flavor + credentials and verifiers) but excludes any RPC transport + framing headers. Imagine a request coming over a non-RDMA TCP/ + IP connection, and that it has a single Record Marking header + preceding it. The maximum allowable count encoded in the + header will be ca_maxrequestsize. If a requester sends a + request that exceeds ca_maxrequestsize, the error + NFS4ERR_REQ_TOO_BIG will be returned per the description in + Section 2.10.6.4. For each channel, the server MAY decrease + this value but MUST NOT increase it. + + ca_maxresponsesize: + The maximum size of a COMPOUND or CB_COMPOUND reply that the + requester will accept from the replier including RPC headers + (see the ca_maxrequestsize definition). For each channel, the + server MAY decrease this value, but MUST NOT increase it. + However, if the client selects a value for ca_maxresponsesize + such that a replier on a channel could never send a response, + the server SHOULD return NFS4ERR_TOOSMALL in the CREATE_SESSION + reply. After the session is created, if a requester sends a + request for which the size of the reply would exceed this + value, the replier will return NFS4ERR_REP_TOO_BIG, per the + description in Section 2.10.6.4. + + ca_maxresponsesize_cached: + Like ca_maxresponsesize, but the maximum size of a reply that + will be stored in the reply cache (Section 2.10.6.1). For each + channel, the server MAY decrease this value, but MUST NOT + increase it. If, in the reply to CREATE_SESSION, the value of + ca_maxresponsesize_cached of a channel is less than the value + of ca_maxresponsesize of the same channel, then this is an + indication to the requester that it needs to be selective about + which replies it directs the replier to cache; for example, + large replies from non-idempotent operations (e.g., COMPOUND + requests with a READ operation) should not be cached. The + requester decides which replies to cache via an argument to the + SEQUENCE (the sa_cachethis field, see Section 18.46) or + CB_SEQUENCE (the csa_cachethis field, see Section 20.9) + operations. After the session is created, if a requester sends + a request for which the size of the reply would exceed + ca_maxresponsesize_cached, the replier will return + NFS4ERR_REP_TOO_BIG_TO_CACHE, per the description in + Section 2.10.6.4. + + ca_maxoperations: + The maximum number of operations the replier will accept in a + COMPOUND or CB_COMPOUND. For the backchannel, the server MUST + NOT change the value the client offers. For the fore channel, + the server MAY change the requested value. After the session + is created, if a requester sends a COMPOUND or CB_COMPOUND with + more operations than ca_maxoperations, the replier MUST return + NFS4ERR_TOO_MANY_OPS. + + ca_maxrequests: + The maximum number of concurrent COMPOUND or CB_COMPOUND + requests the requester will send on the session. Subsequent + requests will each be assigned a slot identifier by the + requester within the range zero to ca_maxrequests - 1 + inclusive. For the backchannel, the server MUST NOT change the + value the client offers. For the fore channel, the server MAY + change the requested value. + + ca_rdma_ird: + This array has a maximum of one element. If this array has one + element, then the element contains the inbound RDMA read queue + depth (IRD). For each channel, the server MAY decrease this + value, but MUST NOT increase it. + + csa_cb_program This is the ONC RPC program number the server MUST + use in any callbacks sent through the backchannel to the client. + The server MUST specify an ONC RPC program number equal to + csa_cb_program and an ONC RPC version number equal to 4 in + callbacks sent to the client. If a CB_COMPOUND is sent to the + client, the server MUST use a minor version number of 1. There is + no corresponding result. + + csa_sec_parms The field csa_sec_parms is an array of acceptable + security credentials the server can use on the session's + backchannel. Three security flavors are supported: AUTH_NONE, + AUTH_SYS, and RPCSEC_GSS. If AUTH_NONE is specified for a + credential, then this says the client is authorizing the server to + use AUTH_NONE on all callbacks for the session. If AUTH_SYS is + specified, then the client is authorizing the server to use + AUTH_SYS on all callbacks, using the credential specified + cbsp_sys_cred. If RPCSEC_GSS is specified, then the server is + allowed to use the RPCSEC_GSS context specified in cbsp_gss_parms + as the RPCSEC_GSS context in the credential of the RPC header of + callbacks to the client. There is no corresponding result. + + The RPCSEC_GSS context for the backchannel is specified via a pair + of values of data type gsshandle4_t. The data type gsshandle4_t + represents an RPCSEC_GSS handle, and is precisely the same as the + data type of the "handle" field of the rpc_gss_init_res data type + defined in "Context Creation Response - Successful Acceptance", + Section 5.2.3.1 of [4]. + + The first RPCSEC_GSS handle, gcbp_handle_from_server, is the fore + handle the server returned to the client (either in the handle + field of data type rpc_gss_init_res or as one of the elements of + the spi_handles field returned in the reply to EXCHANGE_ID) when + the RPCSEC_GSS context was created on the server. The second + handle, gcbp_handle_from_client, is the back handle to which the + client will map the RPCSEC_GSS context. The server can + immediately use the value of gcbp_handle_from_client in the + RPCSEC_GSS credential in callback RPCs. That is, the value in + gcbp_handle_from_client can be used as the value of the field + "handle" in data type rpc_gss_cred_t (see "Elements of the + RPCSEC_GSS Security Protocol", Section 5 of [4]) in callback RPCs. + The server MUST use the RPCSEC_GSS security service specified in + gcbp_service, i.e., it MUST set the "service" field of the + rpc_gss_cred_t data type in RPCSEC_GSS credential to the value of + gcbp_service (see "RPC Request Header", Section 5.3.1 of [4]). + + If the RPCSEC_GSS handle identified by gcbp_handle_from_server + does not exist on the server, the server will return + NFS4ERR_NOENT. + + Within each element of csa_sec_parms, the fore and back RPCSEC_GSS + contexts MUST share the same GSS context and MUST have the same + seq_window (see Section 5.2.3.1 of RFC 2203 [4]). The fore and + back RPCSEC_GSS context state are independent of each other as far + as the RPCSEC_GSS sequence number (see the seq_num field in the + rpc_gss_cred_t data type of Sections 5 and 5.3.1 of [4]). + + If an RPCSEC_GSS handle is using the SSV context (see + Section 2.10.9), then because each SSV RPCSEC_GSS handle shares a + common SSV GSS context, there are security considerations specific + to this situation discussed in Section 2.10.10. + + Once the session is created, the first SEQUENCE or CB_SEQUENCE + received on a slot MUST have a sequence ID equal to 1; if not, the + replier MUST return NFS4ERR_SEQ_MISORDERED. + +18.36.4. IMPLEMENTATION + + To describe a possible implementation, the same notation for client + records introduced in the description of EXCHANGE_ID is used with the + following addition: + + clientid_arg: The value of the csa_clientid field of the + CREATE_SESSION4args structure of the current request. + + Since CREATE_SESSION is a non-idempotent operation, we need to + consider the possibility that retries may occur as a result of a + client restart, network partition, malfunctioning router, etc. For + each client ID created by EXCHANGE_ID, the server maintains a + separate reply cache (called the CREATE_SESSION reply cache) similar + to the session reply cache used for SEQUENCE operations, with two + distinctions. + + * First, this is a reply cache just for detecting and processing + CREATE_SESSION requests for a given client ID. + + * Second, the size of the client ID reply cache is of one slot (and + as a result, the CREATE_SESSION request does not carry a slot + number). This means that at most one CREATE_SESSION request for a + given client ID can be outstanding. + + As previously stated, CREATE_SESSION can be sent with or without a + preceding SEQUENCE operation. Even if a SEQUENCE precedes + CREATE_SESSION, the server MUST maintain the CREATE_SESSION reply + cache, which is separate from the reply cache for the session + associated with a SEQUENCE. If CREATE_SESSION was originally sent by + itself, the client MAY send a retry of the CREATE_SESSION operation + within a COMPOUND preceded by a SEQUENCE. If CREATE_SESSION was + originally sent in a COMPOUND that started with a SEQUENCE, then the + client SHOULD send a retry in a COMPOUND that starts with a SEQUENCE + that has the same session ID as the SEQUENCE of the original request. + However, the client MAY send a retry in a COMPOUND that either has no + preceding SEQUENCE, or has a preceding SEQUENCE that refers to a + different session than the original CREATE_SESSION. This might be + necessary if the client sends a CREATE_SESSION in a COMPOUND preceded + by a SEQUENCE with session ID X, and session X no longer exists. + Regardless, any retry of CREATE_SESSION, with or without a preceding + SEQUENCE, MUST use the same value of csa_sequence as the original. + + After the client received a reply to an EXCHANGE_ID operation that + contains a new, unconfirmed client ID, the server expects the client + to follow with a CREATE_SESSION operation to confirm the client ID. + The server expects value of csa_sequenceid in the arguments to that + CREATE_SESSION to be to equal the value of the field eir_sequenceid + that was returned in results of the EXCHANGE_ID that returned the + unconfirmed client ID. Before the server replies to that EXCHANGE_ID + operation, it initializes the client ID slot to be equal to + eir_sequenceid - 1 (accounting for underflow), and records a + contrived CREATE_SESSION result with a "cached" result of + NFS4ERR_SEQ_MISORDERED. With the client ID slot thus initialized, + the processing of the CREATE_SESSION operation is divided into four + phases: + + 1. Client record look up. The server looks up the client ID in its + client record table. If the server contains no records with + client ID equal to clientid_arg, then most likely the client's + state has been purged during a period of inactivity, possibly due + to a loss of connectivity. NFS4ERR_STALE_CLIENTID is returned, + and no changes are made to any client records on the server. + Otherwise, the server goes to phase 2. + + 2. Sequence ID processing. If csa_sequenceid is equal to the + sequence ID in the client ID's slot, then this is a replay of the + previous CREATE_SESSION request, and the server returns the + cached result. If csa_sequenceid is not equal to the sequence ID + in the slot, and is more than one greater (accounting for + wraparound), then the server returns the error + NFS4ERR_SEQ_MISORDERED, and does not change the slot. If + csa_sequenceid is equal to the slot's sequence ID + 1 (accounting + for wraparound), then the slot's sequence ID is set to + csa_sequenceid, and the CREATE_SESSION processing goes to the + next phase. A subsequent new CREATE_SESSION call over the same + client ID MUST use a csa_sequenceid that is one greater than the + sequence ID in the slot. + + 3. Client ID confirmation. If this would be the first session for + the client ID, the CREATE_SESSION operation serves to confirm the + client ID. Otherwise, the client ID confirmation phase is + skipped and only the session creation phase occurs. Any case in + which there is more than one record with identical values for + client ID represents a server implementation error. Operation in + the potential valid cases is summarized as follows. + + * Successful Confirmation + + If the server has the following unconfirmed record, then + this is the expected confirmation of an unconfirmed record. + + { ownerid, verifier, principal_arg, clientid_arg, + unconfirmed } + + As noted in Section 18.35.4, the server might also have the + following confirmed record. + + { ownerid, old_verifier, principal_arg, old_clientid, + confirmed } + + The server schedules the replacement of both records with: + + { ownerid, verifier, principal_arg, clientid_arg, confirmed + } + + The processing of CREATE_SESSION continues on to session + creation. Once the session is successfully created, the + scheduled client record replacement is committed. If the + session is not successfully created, then no changes are + made to any client records on the server. + + * Unsuccessful Confirmation + + If the server has the following record, then the client has + changed principals after the previous EXCHANGE_ID request, + or there has been a chance collision between shorthand + client identifiers. + + { *, *, old_principal_arg, clientid_arg, * } + + Neither of these cases is permissible. Processing stops + and NFS4ERR_CLID_INUSE is returned to the client. No + changes are made to any client records on the server. + + 4. Session creation. The server confirmed the client ID, either in + this CREATE_SESSION operation, or a previous CREATE_SESSION + operation. The server examines the remaining fields of the + arguments. + + The server creates the session by recording the parameter values + used (including whether the CREATE_SESSION4_FLAG_PERSIST flag is + set and has been accepted by the server) and allocating space for + the session reply cache (if there is not enough space, the server + returns NFS4ERR_NOSPC). For each slot in the reply cache, the + server sets the sequence ID to zero, and records an entry + containing a COMPOUND reply with zero operations and the error + NFS4ERR_SEQ_MISORDERED. This way, if the first SEQUENCE request + sent has a sequence ID equal to zero, the server can simply + return what is in the reply cache: NFS4ERR_SEQ_MISORDERED. The + client initializes its reply cache for receiving callbacks in the + same way, and similarly, the first CB_SEQUENCE operation on a + slot after session creation MUST have a sequence ID of one. + + If the session state is created successfully, the server + associates the session with the client ID provided by the client. + + When a request that had CREATE_SESSION4_FLAG_CONN_RDMA set needs + to be retried, the retry MUST be done on a new connection that is + in non-RDMA mode. If properties of the new connection are + different enough that the arguments to CREATE_SESSION need to + change, then a non-retry MUST be sent. The server will + eventually dispose of any session that was created on the + original connection. + + On the backchannel, the client and server might wish to have many + slots, in some cases perhaps more that the fore channel, in order to + deal with the situations where the network link has high latency and + is the primary bottleneck for response to recalls. If so, and if the + client provides too few slots to the backchannel, the server might + limit the number of recallable objects it gives to the client. + + Implementing RPCSEC_GSS callback support requires changes to both the + client and server implementations of RPCSEC_GSS. One possible set of + changes includes: + + * Adding a data structure that wraps the GSS-API context with a + reference count. + + * New functions to increment and decrement the reference count. If + the reference count is decremented to zero, the wrapper data + structure and the GSS-API context it refers to would be freed. + + * Change RPCSEC_GSS to create the wrapper data structure upon + receiving GSS-API context from gss_accept_sec_context() and + gss_init_sec_context(). The reference count would be initialized + to 1. + + * Adding a function to map an existing RPCSEC_GSS handle to a + pointer to the wrapper data structure. The reference count would + be incremented. + + * Adding a function to create a new RPCSEC_GSS handle from a pointer + to the wrapper data structure. The reference count would be + incremented. + + * Replacing calls from RPCSEC_GSS that free GSS-API contexts, with + calls to decrement the reference count on the wrapper data + structure. + +18.37. Operation 44: DESTROY_SESSION - Destroy a Session + +18.37.1. ARGUMENT + + struct DESTROY_SESSION4args { + sessionid4 dsa_sessionid; + }; + +18.37.2. RESULT + + struct DESTROY_SESSION4res { + nfsstat4 dsr_status; + }; + +18.37.3. DESCRIPTION + + The DESTROY_SESSION operation closes the session and discards the + session's reply cache, if any. Any remaining connections associated + with the session are immediately disassociated. If the connection + has no remaining associated sessions, the connection MAY be closed by + the server. Locks, delegations, layouts, wants, and the lease, which + are all tied to the client ID, are not affected by DESTROY_SESSION. + + DESTROY_SESSION MUST be invoked on a connection that is associated + with the session being destroyed. In addition, if SP4_MACH_CRED + state protection was specified when the client ID was created, the + RPCSEC_GSS principal that created the session MUST be the one that + destroys the session, using RPCSEC_GSS privacy or integrity. If + SP4_SSV state protection was specified when the client ID was + created, RPCSEC_GSS using the SSV mechanism (Section 2.10.9) MUST be + used, with integrity or privacy. + + If the COMPOUND request starts with SEQUENCE, and if the sessionids + specified in SEQUENCE and DESTROY_SESSION are the same, then + + * DESTROY_SESSION MUST be the final operation in the COMPOUND + request. + + * It is advisable to avoid placing DESTROY_SESSION in a COMPOUND + request with other state-modifying operations, because the + DESTROY_SESSION will destroy the reply cache. + + * Because the session and its reply cache are destroyed, a client + that retries the request may receive an error in reply to the + retry, even though the original request was successful. + + If the COMPOUND request starts with SEQUENCE, and if the sessionids + specified in SEQUENCE and DESTROY_SESSION are different, then + DESTROY_SESSION can appear in any position of the COMPOUND request + (except for the first position). The two sessionids can belong to + different client IDs. + + If the COMPOUND request does not start with SEQUENCE, and if + DESTROY_SESSION is not the sole operation, then server MUST return + NFS4ERR_NOT_ONLY_OP. + + If there is a backchannel on the session and the server has + outstanding CB_COMPOUND operations for the session which have not + been replied to, then the server MAY refuse to destroy the session + and return an error. If so, then in the event the backchannel is + down, the server SHOULD return NFS4ERR_CB_PATH_DOWN to inform the + client that the backchannel needs to be repaired before the server + will allow the session to be destroyed. Otherwise, the error + CB_BACK_CHAN_BUSY SHOULD be returned to indicate that there are + CB_COMPOUNDs that need to be replied to. The client SHOULD reply to + all outstanding CB_COMPOUNDs before re-sending DESTROY_SESSION. + +18.38. Operation 45: FREE_STATEID - Free Stateid with No Locks + +18.38.1. ARGUMENT + + struct FREE_STATEID4args { + stateid4 fsa_stateid; + }; + +18.38.2. RESULT + + struct FREE_STATEID4res { + nfsstat4 fsr_status; + }; + +18.38.3. DESCRIPTION + + The FREE_STATEID operation is used to free a stateid that no longer + has any associated locks (including opens, byte-range locks, + delegations, and layouts). This may be because of client LOCKU + operations or because of server revocation. If there are valid locks + (of any kind) associated with the stateid in question, the error + NFS4ERR_LOCKS_HELD will be returned, and the associated stateid will + not be freed. + + When a stateid is freed that had been associated with revoked locks, + by sending the FREE_STATEID operation, the client acknowledges the + loss of those locks. This allows the server, once all such revoked + state is acknowledged, to allow that client again to reclaim locks, + without encountering the edge conditions discussed in Section 8.4.2. + + Once a successful FREE_STATEID is done for a given stateid, any + subsequent use of that stateid will result in an NFS4ERR_BAD_STATEID + error. + +18.39. Operation 46: GET_DIR_DELEGATION - Get a Directory Delegation + +18.39.1. ARGUMENT + + typedef nfstime4 attr_notice4; + + struct GET_DIR_DELEGATION4args { + /* CURRENT_FH: delegated directory */ + bool gdda_signal_deleg_avail; + bitmap4 gdda_notification_types; + attr_notice4 gdda_child_attr_delay; + attr_notice4 gdda_dir_attr_delay; + bitmap4 gdda_child_attributes; + bitmap4 gdda_dir_attributes; + }; + +18.39.2. RESULT + + struct GET_DIR_DELEGATION4resok { + verifier4 gddr_cookieverf; + /* Stateid for get_dir_delegation */ + stateid4 gddr_stateid; + /* Which notifications can the server support */ + bitmap4 gddr_notification; + bitmap4 gddr_child_attributes; + bitmap4 gddr_dir_attributes; + }; + + enum gddrnf4_status { + GDD4_OK = 0, + GDD4_UNAVAIL = 1 + }; + + union GET_DIR_DELEGATION4res_non_fatal + switch (gddrnf4_status gddrnf_status) { + case GDD4_OK: + GET_DIR_DELEGATION4resok gddrnf_resok4; + case GDD4_UNAVAIL: + bool gddrnf_will_signal_deleg_avail; + }; + + union GET_DIR_DELEGATION4res + switch (nfsstat4 gddr_status) { + case NFS4_OK: + GET_DIR_DELEGATION4res_non_fatal gddr_res_non_fatal4; + default: + void; + }; + +18.39.3. DESCRIPTION + + The GET_DIR_DELEGATION operation is used by a client to request a + directory delegation. The directory is represented by the current + filehandle. The client also specifies whether it wants the server to + notify it when the directory changes in certain ways by setting one + or more bits in a bitmap. The server may refuse to grant the + delegation. In that case, the server will return + NFS4ERR_DIRDELEG_UNAVAIL. If the server decides to hand out the + delegation, it will return a cookie verifier for that directory. If + the cookie verifier changes when the client is holding the + delegation, the delegation will be recalled unless the client has + asked for notification for this event. + + The server will also return a directory delegation stateid, + gddr_stateid, as a result of the GET_DIR_DELEGATION operation. This + stateid will appear in callback messages related to the delegation, + such as notifications and delegation recalls. The client will use + this stateid to return the delegation voluntarily or upon recall. A + delegation is returned by calling the DELEGRETURN operation. + + The server might not be able to support notifications of certain + events. If the client asks for such notifications, the server MUST + inform the client of its inability to do so as part of the + GET_DIR_DELEGATION reply by not setting the appropriate bits in the + supported notifications bitmask, gddr_notification, contained in the + reply. The server MUST NOT add bits to gddr_notification that the + client did not request. + + The GET_DIR_DELEGATION operation can be used for both normal and + named attribute directories. + + If client sets gdda_signal_deleg_avail to TRUE, then it is + registering with the client a "want" for a directory delegation. If + the delegation is not available, and the server supports and will + honor the "want", the results will have + gddrnf_will_signal_deleg_avail set to TRUE and no error will be + indicated on return. If so, the client should expect a future + CB_RECALLABLE_OBJ_AVAIL operation to indicate that a directory + delegation is available. If the server does not wish to honor the + "want" or is not able to do so, it returns the error + NFS4ERR_DIRDELEG_UNAVAIL. If the delegation is immediately + available, the server SHOULD return it with the response to the + operation, rather than via a callback. + + When a client makes a request for a directory delegation while it + already holds a directory delegation for that directory (including + the case where it has been recalled but not yet returned by the + client or revoked by the server), the server MUST reply with the + value of gddr_status set to NFS4_OK, the value of gddrnf_status set + to GDD4_UNAVAIL, and the value of gddrnf_will_signal_deleg_avail set + to FALSE. The delegation the client held before the request remains + intact, and its state is unchanged. The current stateid is not + changed (see Section 16.2.3.1.2 for a description of the current + stateid). + +18.39.4. IMPLEMENTATION + + Directory delegations provide the benefit of improving cache + consistency of namespace information. This is done through + synchronous callbacks. A server must support synchronous callbacks + in order to support directory delegations. In addition to that, + asynchronous notifications provide a way to reduce network traffic as + well as improve client performance in certain conditions. + + Notifications are specified in terms of potential changes to the + directory. A client can ask to be notified of events by setting one + or more bits in gdda_notification_types. The client can ask for + notifications on addition of entries to a directory (by setting the + NOTIFY4_ADD_ENTRY in gdda_notification_types), notifications on entry + removal (NOTIFY4_REMOVE_ENTRY), renames (NOTIFY4_RENAME_ENTRY), + directory attribute changes (NOTIFY4_CHANGE_DIR_ATTRIBUTES), and + cookie verifier changes (NOTIFY4_CHANGE_COOKIE_VERIFIER) by setting + one or more corresponding bits in the gdda_notification_types field. + + The client can also ask for notifications of changes to attributes of + directory entries (NOTIFY4_CHANGE_CHILD_ATTRIBUTES) in order to keep + its attribute cache up to date. However, any changes made to child + attributes do not cause the delegation to be recalled. If a client + is interested in directory entry caching or negative name caching, it + can set the gdda_notification_types appropriately to its particular + need and the server will notify it of all changes that would + otherwise invalidate its name cache. The kind of notification a + client asks for may depend on the directory size, its rate of change, + and the applications being used to access that directory. The + enumeration of the conditions under which a client might ask for a + notification is out of the scope of this specification. + + For attribute notifications, the client will set bits in the + gdda_dir_attributes bitmap to indicate which attributes it wants to + be notified of. If the server does not support notifications for + changes to a certain attribute, it SHOULD NOT set that attribute in + the supported attribute bitmap specified in the reply + (gddr_dir_attributes). The client will also set in the + gdda_child_attributes bitmap the attributes of directory entries it + wants to be notified of, and the server will indicate in + gddr_child_attributes which attributes of directory entries it will + notify the client of. + + The client will also let the server know if it wants to get the + notification as soon as the attribute change occurs or after a + certain delay by setting a delay factor; gdda_child_attr_delay is for + attribute changes to directory entries and gdda_dir_attr_delay is for + attribute changes to the directory. If this delay factor is set to + zero, that indicates to the server that the client wants to be + notified of any attribute changes as soon as they occur. If the + delay factor is set to N seconds, the server will make a best-effort + guarantee that attribute updates are synchronized within N seconds. + If the client asks for a delay factor that the server does not + support or that may cause significant resource consumption on the + server by causing the server to send a lot of notifications, the + server should not commit to sending out notifications for attributes + and therefore must not set the appropriate bit in the + gddr_child_attributes and gddr_dir_attributes bitmaps in the + response. + + The client MUST use a security tuple (Section 2.6.1) that the + directory or its applicable ancestor (Section 2.6) is exported with. + If not, the server MUST return NFS4ERR_WRONGSEC to the operation that + both precedes GET_DIR_DELEGATION and sets the current filehandle (see + Section 2.6.3.1). + + The directory delegation covers all the entries in the directory + except the parent entry. That means if a directory and its parent + both hold directory delegations, any changes to the parent will not + cause a notification to be sent for the child even though the child's + parent entry points to the parent directory. + +18.40. Operation 47: GETDEVICEINFO - Get Device Information + +18.40.1. ARGUMENT + + struct GETDEVICEINFO4args { + deviceid4 gdia_device_id; + layouttype4 gdia_layout_type; + count4 gdia_maxcount; + bitmap4 gdia_notify_types; + }; + +18.40.2. RESULT + + struct GETDEVICEINFO4resok { + device_addr4 gdir_device_addr; + bitmap4 gdir_notification; + }; + + union GETDEVICEINFO4res switch (nfsstat4 gdir_status) { + case NFS4_OK: + GETDEVICEINFO4resok gdir_resok4; + case NFS4ERR_TOOSMALL: + count4 gdir_mincount; + default: + void; + }; + +18.40.3. DESCRIPTION + + The GETDEVICEINFO operation returns pNFS storage device address + information for the specified device ID. The client identifies the + device information to be returned by providing the gdia_device_id and + gdia_layout_type that uniquely identify the device. The client + provides gdia_maxcount to limit the number of bytes for the result. + This maximum size represents all of the data being returned within + the GETDEVICEINFO4resok structure and includes the XDR overhead. The + server may return less data. If the server is unable to return any + information within the gdia_maxcount limit, the error + NFS4ERR_TOOSMALL will be returned. However, if gdia_maxcount is + zero, NFS4ERR_TOOSMALL MUST NOT be returned. + + The da_layout_type field of the gdir_device_addr returned by the + server MUST be equal to the gdia_layout_type specified by the client. + If it is not equal, the client SHOULD ignore the response as invalid + and behave as if the server returned an error, even if the client + does have support for the layout type returned. + + The client also provides a notification bitmap, gdia_notify_types, + for the device ID mapping notification for which it is interested in + receiving; the server must support device ID notifications for the + notification request to have affect. The notification mask is + composed in the same manner as the bitmap for file attributes + (Section 3.3.7). The numbers of bit positions are listed in the + notify_device_type4 enumeration type (Section 20.12). Only two + enumerated values of notify_device_type4 currently apply to + GETDEVICEINFO: NOTIFY_DEVICEID4_CHANGE and NOTIFY_DEVICEID4_DELETE + (see Section 20.12). + + The notification bitmap applies only to the specified device ID. If + a client sends a GETDEVICEINFO operation on a deviceID multiple + times, the last notification bitmap is used by the server for + subsequent notifications. If the bitmap is zero or empty, then the + device ID's notifications are turned off. + + If the client wants to just update or turn off notifications, it MAY + send a GETDEVICEINFO operation with gdia_maxcount set to zero. In + that event, if the device ID is valid, the reply's da_addr_body field + of the gdir_device_addr field will be of zero length. + + If an unknown device ID is given in gdia_device_id, the server + returns NFS4ERR_NOENT. Otherwise, the device address information is + returned in gdir_device_addr. Finally, if the server supports + notifications for device ID mappings, the gdir_notification result + will contain a bitmap of which notifications it will actually send to + the client (via CB_NOTIFY_DEVICEID, see Section 20.12). + + If NFS4ERR_TOOSMALL is returned, the results also contain + gdir_mincount. The value of gdir_mincount represents the minimum + size necessary to obtain the device information. + +18.40.4. IMPLEMENTATION + + Aside from updating or turning off notifications, another use case + for gdia_maxcount being set to zero is to validate a device ID. + + The client SHOULD request a notification for changes or deletion of a + device ID to device address mapping so that the server can allow the + client gracefully use a new mapping, without having pending I/O fail + abruptly, or force layouts using the device ID to be recalled or + revoked. + + It is possible that GETDEVICEINFO (and GETDEVICELIST) will race with + CB_NOTIFY_DEVICEID, i.e., CB_NOTIFY_DEVICEID arrives before the + client gets and processes the response to GETDEVICEINFO or + GETDEVICELIST. The analysis of the race leverages the fact that the + server MUST NOT delete a device ID that is referred to by a layout + the client has. + + * CB_NOTIFY_DEVICEID deletes a device ID. If the client believes it + has layouts that refer to the device ID, then it is possible that + layouts referring to the deleted device ID have been revoked. The + client should send a TEST_STATEID request using the stateid for + each layout that might have been revoked. If TEST_STATEID + indicates that any layouts have been revoked, the client must + recover from layout revocation as described in Section 12.5.6. If + TEST_STATEID indicates that at least one layout has not been + revoked, the client should send a GETDEVICEINFO operation on the + supposedly deleted device ID to verify that the device ID has been + deleted. + + If GETDEVICEINFO indicates that the device ID does not exist, then + the client assumes the server is faulty and recovers by sending an + EXCHANGE_ID operation. If GETDEVICEINFO indicates that the device + ID does exist, then while the server is faulty for sending an + erroneous device ID deletion notification, the degree to which it + is faulty does not require the client to create a new client ID. + + If the client does not have layouts that refer to the device ID, + no harm is done. The client should mark the device ID as deleted, + and when GETDEVICEINFO or GETDEVICELIST results are received that + indicate that the device ID has been in fact deleted, the device + ID should be removed from the client's cache. + + * CB_NOTIFY_DEVICEID indicates that a device ID's device addressing + mappings have changed. The client should assume that the results + from the in-progress GETDEVICEINFO will be stale for the device ID + once received, and so it should send another GETDEVICEINFO on the + device ID. + +18.41. Operation 48: GETDEVICELIST - Get All Device Mappings for a File + System + +18.41.1. ARGUMENT + + struct GETDEVICELIST4args { + /* CURRENT_FH: object belonging to the file system */ + layouttype4 gdla_layout_type; + + /* number of deviceIDs to return */ + count4 gdla_maxdevices; + + nfs_cookie4 gdla_cookie; + verifier4 gdla_cookieverf; + }; + +18.41.2. RESULT + + struct GETDEVICELIST4resok { + nfs_cookie4 gdlr_cookie; + verifier4 gdlr_cookieverf; + deviceid4 gdlr_deviceid_list<>; + bool gdlr_eof; + }; + + union GETDEVICELIST4res switch (nfsstat4 gdlr_status) { + case NFS4_OK: + GETDEVICELIST4resok gdlr_resok4; + default: + void; + }; + +18.41.3. DESCRIPTION + + This operation is used by the client to enumerate all of the device + IDs that a server's file system uses. + + The client provides a current filehandle of a file object that + belongs to the file system (i.e., all file objects sharing the same + fsid as that of the current filehandle) and the layout type in + gdia_layout_type. Since this operation might require multiple calls + to enumerate all the device IDs (and is thus similar to the READDIR + (Section 18.23) operation), the client also provides gdia_cookie and + gdia_cookieverf to specify the current cursor position in the list. + When the client wants to read from the beginning of the file system's + device mappings, it sets gdla_cookie to zero. The field + gdla_cookieverf MUST be ignored by the server when gdla_cookie is + zero. The client provides gdla_maxdevices to limit the number of + device IDs in the result. If gdla_maxdevices is zero, the server + MUST return NFS4ERR_INVAL. The server MAY return fewer device IDs. + + The successful response to the operation will contain the cookie, + gdlr_cookie, and the cookie verifier, gdlr_cookieverf, to be used on + the subsequent GETDEVICELIST. A gdlr_eof value of TRUE signifies + that there are no remaining entries in the server's device list. + Each element of gdlr_deviceid_list contains a device ID. + +18.41.4. IMPLEMENTATION + + An example of the use of this operation is for pNFS clients and + servers that use LAYOUT4_BLOCK_VOLUME layouts. In these environments + it may be helpful for a client to determine device accessibility upon + first file system access. + +18.42. Operation 49: LAYOUTCOMMIT - Commit Writes Made Using a Layout + +18.42.1. ARGUMENT + + union newtime4 switch (bool nt_timechanged) { + case TRUE: + nfstime4 nt_time; + case FALSE: + void; + }; + + union newoffset4 switch (bool no_newoffset) { + case TRUE: + offset4 no_offset; + case FALSE: + void; + }; + + struct LAYOUTCOMMIT4args { + /* CURRENT_FH: file */ + offset4 loca_offset; + length4 loca_length; + bool loca_reclaim; + stateid4 loca_stateid; + newoffset4 loca_last_write_offset; + newtime4 loca_time_modify; + layoutupdate4 loca_layoutupdate; + }; + +18.42.2. RESULT + + union newsize4 switch (bool ns_sizechanged) { + case TRUE: + length4 ns_size; + case FALSE: + void; + }; + + struct LAYOUTCOMMIT4resok { + newsize4 locr_newsize; + }; + + union LAYOUTCOMMIT4res switch (nfsstat4 locr_status) { + case NFS4_OK: + LAYOUTCOMMIT4resok locr_resok4; + default: + void; + }; + +18.42.3. DESCRIPTION + + The LAYOUTCOMMIT operation commits changes in the layout represented + by the current filehandle, client ID (derived from the session ID in + the preceding SEQUENCE operation), byte-range, and stateid. Since + layouts are sub-dividable, a smaller portion of a layout, retrieved + via LAYOUTGET, can be committed. The byte-range being committed is + specified through the byte-range (loca_offset and loca_length). This + byte-range MUST overlap with one or more existing layouts previously + granted via LAYOUTGET (Section 18.43), each with an iomode of + LAYOUTIOMODE4_RW. In the case where the iomode of any held layout + segment is not LAYOUTIOMODE4_RW, the server should return the error + NFS4ERR_BAD_IOMODE. For the case where the client does not hold + matching layout segment(s) for the defined byte-range, the server + should return the error NFS4ERR_BAD_LAYOUT. + + The LAYOUTCOMMIT operation indicates that the client has completed + writes using a layout obtained by a previous LAYOUTGET. The client + may have only written a subset of the data range it previously + requested. LAYOUTCOMMIT allows it to commit or discard provisionally + allocated space and to update the server with a new end-of-file. The + layout referenced by LAYOUTCOMMIT is still valid after the operation + completes and can be continued to be referenced by the client ID, + filehandle, byte-range, layout type, and stateid. + + If the loca_reclaim field is set to TRUE, this indicates that the + client is attempting to commit changes to a layout after the restart + of the metadata server during the metadata server's recovery grace + period (see Section 12.7.4). This type of request may be necessary + when the client has uncommitted writes to provisionally allocated + byte-ranges of a file that were sent to the storage devices before + the restart of the metadata server. In this case, the layout + provided by the client MUST be a subset of a writable layout that the + client held immediately before the restart of the metadata server. + The value of the field loca_stateid MUST be a value that the metadata + server returned before it restarted. The metadata server is free to + accept or reject this request based on its own internal metadata + consistency checks. If the metadata server finds that the layout + provided by the client does not pass its consistency checks, it MUST + reject the request with the status NFS4ERR_RECLAIM_BAD. The + successful completion of the LAYOUTCOMMIT request with loca_reclaim + set to TRUE does NOT provide the client with a layout for the file. + It simply commits the changes to the layout specified in the + loca_layoutupdate field. To obtain a layout for the file, the client + must send a LAYOUTGET request to the server after the server's grace + period has expired. If the metadata server receives a LAYOUTCOMMIT + request with loca_reclaim set to TRUE when the metadata server is not + in its recovery grace period, it MUST reject the request with the + status NFS4ERR_NO_GRACE. + + Setting the loca_reclaim field to TRUE is required if and only if the + committed layout was acquired before the metadata server restart. If + the client is committing a layout that was acquired during the + metadata server's grace period, it MUST set the "reclaim" field to + FALSE. + + The loca_stateid is a layout stateid value as returned by previously + successful layout operations (see Section 12.5.3). + + The loca_last_write_offset field specifies the offset of the last + byte written by the client previous to the LAYOUTCOMMIT. Note that + this value is never equal to the file's size (at most it is one byte + less than the file's size) and MUST be less than or equal to + NFS4_MAXFILEOFF. Also, loca_last_write_offset MUST overlap the range + described by loca_offset and loca_length. The metadata server may + use this information to determine whether the file's size needs to be + updated. If the metadata server updates the file's size as the + result of the LAYOUTCOMMIT operation, it must return the new size + (locr_newsize.ns_size) as part of the results. + + The loca_time_modify field allows the client to suggest a + modification time it would like the metadata server to set. The + metadata server may use the suggestion or it may use the time of the + LAYOUTCOMMIT operation to set the modification time. If the metadata + server uses the client-provided modification time, it should ensure + that time does not flow backwards. If the client wants to force the + metadata server to set an exact time, the client should use a SETATTR + operation in a COMPOUND right after LAYOUTCOMMIT. See Section 12.5.4 + for more details. If the client desires the resultant modification + time, it should construct the COMPOUND so that a GETATTR follows the + LAYOUTCOMMIT. + + The loca_layoutupdate argument to LAYOUTCOMMIT provides a mechanism + for a client to provide layout-specific updates to the metadata + server. For example, the layout update can describe what byte-ranges + of the original layout have been used and what byte-ranges can be + deallocated. There is no NFSv4.1 file layout-specific layoutupdate4 + structure. + + The layout information is more verbose for block devices than for + objects and files because the latter two hide the details of block + allocation behind their storage protocols. At the minimum, the + client needs to communicate changes to the end-of-file location back + to the server, and, if desired, its view of the file's modification + time. For block/volume layouts, it needs to specify precisely which + blocks have been used. + + If the layout identified in the arguments does not exist, the error + NFS4ERR_BADLAYOUT is returned. The layout being committed may also + be rejected if it does not correspond to an existing layout with an + iomode of LAYOUTIOMODE4_RW. + + On success, the current filehandle retains its value and the current + stateid retains its value. + +18.42.4. IMPLEMENTATION + + The client MAY also use LAYOUTCOMMIT with the loca_reclaim field set + to TRUE to convey hints to modified file attributes or to report + layout-type specific information such as I/O errors for object-based + storage layouts, as normally done during normal operation. Doing so + may help the metadata server to recover files more efficiently after + restart. For example, some file system implementations may require + expansive recovery of file system objects if the metadata server does + not get a positive indication from all clients holding a + LAYOUTIOMODE4_RW layout that they have successfully completed all + their writes. Sending a LAYOUTCOMMIT (if required) and then + following with LAYOUTRETURN can provide such an indication and allow + for graceful and efficient recovery. + + If loca_reclaim is TRUE, the metadata server is free to either + examine or ignore the value in the field loca_stateid. The metadata + server implementation might or might not encode in its layout stateid + information that allows the metadata server to perform a consistency + check on the LAYOUTCOMMIT request. + +18.43. Operation 50: LAYOUTGET - Get Layout Information + +18.43.1. ARGUMENT + + struct LAYOUTGET4args { + /* CURRENT_FH: file */ + bool loga_signal_layout_avail; + layouttype4 loga_layout_type; + layoutiomode4 loga_iomode; + offset4 loga_offset; + length4 loga_length; + length4 loga_minlength; + stateid4 loga_stateid; + count4 loga_maxcount; + }; + +18.43.2. RESULT + + struct LAYOUTGET4resok { + bool logr_return_on_close; + stateid4 logr_stateid; + layout4 logr_layout<>; + }; + + union LAYOUTGET4res switch (nfsstat4 logr_status) { + case NFS4_OK: + LAYOUTGET4resok logr_resok4; + case NFS4ERR_LAYOUTTRYLATER: + bool logr_will_signal_layout_avail; + default: + void; + }; + +18.43.3. DESCRIPTION + + The LAYOUTGET operation requests a layout from the metadata server + for reading or writing the file given by the filehandle at the byte- + range specified by offset and length. Layouts are identified by the + client ID (derived from the session ID in the preceding SEQUENCE + operation), current filehandle, layout type (loga_layout_type), and + the layout stateid (loga_stateid). The use of the loga_iomode field + depends upon the layout type, but should reflect the client's data + access intent. + + If the metadata server is in a grace period, and does not persist + layouts and device ID to device address mappings, then it MUST return + NFS4ERR_GRACE (see Section 8.4.2.1). + + The LAYOUTGET operation returns layout information for the specified + byte-range: a layout. The client actually specifies two ranges, both + starting at the offset in the loga_offset field. The first range is + between loga_offset and loga_offset + loga_length - 1 inclusive. + This range indicates the desired range the client wants the layout to + cover. The second range is between loga_offset and loga_offset + + loga_minlength - 1 inclusive. This range indicates the required + range the client needs the layout to cover. Thus, loga_minlength + MUST be less than or equal to loga_length. + + When a length field is set to NFS4_UINT64_MAX, this indicates a + desire (when loga_length is NFS4_UINT64_MAX) or requirement (when + loga_minlength is NFS4_UINT64_MAX) to get a layout from loga_offset + through the end-of-file, regardless of the file's length. + + The following rules govern the relationships among, and the minima + of, loga_length, loga_minlength, and loga_offset. + + * If loga_length is less than loga_minlength, the metadata server + MUST return NFS4ERR_INVAL. + + * If loga_minlength is zero, this is an indication to the metadata + server that the client desires any layout at offset loga_offset or + less that the metadata server has "readily available". Readily is + subjective, and depends on the layout type and the pNFS server + implementation. For example, some metadata servers might have to + pre-allocate stable storage when they receive a request for a + range of a file that goes beyond the file's current length. If + loga_minlength is zero and loga_length is greater than zero, this + tells the metadata server what range of the layout the client + would prefer to have. If loga_length and loga_minlength are both + zero, then the client is indicating that it desires a layout of + any length with the ending offset of the range no less than the + value specified loga_offset, and the starting offset at or below + loga_offset. If the metadata server does not have a layout that + is readily available, then it MUST return NFS4ERR_LAYOUTTRYLATER. + + * If the sum of loga_offset and loga_minlength exceeds + NFS4_UINT64_MAX, and loga_minlength is not NFS4_UINT64_MAX, the + error NFS4ERR_INVAL MUST result. + + * If the sum of loga_offset and loga_length exceeds NFS4_UINT64_MAX, + and loga_length is not NFS4_UINT64_MAX, the error NFS4ERR_INVAL + MUST result. + + After the metadata server has performed the above checks on + loga_offset, loga_minlength, and loga_offset, the metadata server + MUST return a layout according to the rules in Table 22. + + Acceptable layouts based on loga_minlength. Note: u64m = + NFS4_UINT64_MAX; a_off = loga_offset; a_minlen = loga_minlength. + + +===========+============+==========+==========+===================+ + | Layout | Layout | Layout | Layout | Layout length of | + | iomode of | a_minlen | iomode | offset | reply | + | request | of request | of reply | of reply | | + +===========+============+==========+==========+===================+ + | _READ | u64m | MAY be | MUST be | MUST be >= file | + | | | _READ | <= a_off | length - layout | + | | | | | offset | + +-----------+------------+----------+----------+-------------------+ + | _READ | u64m | MAY be | MUST be | MUST be u64m | + | | | _RW | <= a_off | | + +-----------+------------+----------+----------+-------------------+ + | _READ | > 0 and < | MAY be | MUST be | MUST be >= | + | | u64m | _READ | <= a_off | MIN(file length, | + | | | | | a_minlen + a_off) | + | | | | | - layout offset | + +-----------+------------+----------+----------+-------------------+ + | _READ | > 0 and < | MAY be | MUST be | MUST be >= a_off | + | | u64m | _RW | <= a_off | - layout offset + | + | | | | | a_minlen | + +-----------+------------+----------+----------+-------------------+ + | _READ | 0 | MAY be | MUST be | MUST be > 0 | + | | | _READ | <= a_off | | + +-----------+------------+----------+----------+-------------------+ + | _READ | 0 | MAY be | MUST be | MUST be > 0 | + | | | _RW | <= a_off | | + +-----------+------------+----------+----------+-------------------+ + | _RW | u64m | MUST be | MUST be | MUST be u64m | + | | | _RW | <= a_off | | + +-----------+------------+----------+----------+-------------------+ + | _RW | > 0 and < | MUST be | MUST be | MUST be >= a_off | + | | u64m | _RW | <= a_off | - layout offset + | + | | | | | a_minlen | + +-----------+------------+----------+----------+-------------------+ + | _RW | 0 | MUST be | MUST be | MUST be > 0 | + | | | _RW | <= a_off | | + +-----------+------------+----------+----------+-------------------+ + + Table 22 + + If loga_minlength is not zero and the metadata server cannot return a + layout according to the rules in Table 22, then the metadata server + MUST return the error NFS4ERR_BADLAYOUT. If loga_minlength is zero + and the metadata server cannot or will not return a layout according + to the rules in Table 22, then the metadata server MUST return the + error NFS4ERR_LAYOUTTRYLATER. Assuming that loga_length is greater + than loga_minlength or equal to zero, the metadata server SHOULD + return a layout according to the rules in Table 23. + + Desired layouts based on loga_length. The rules of Table 22 MUST be + applied first. Note: u64m = NFS4_UINT64_MAX; a_off = loga_offset; + a_len = loga_length. + + +===============+==========+==========+==========+================+ + | Layout iomode | Layout | Layout | Layout | Layout length | + | of request | a_len of | iomode | offset | of reply | + | | request | of reply | of reply | | + +===============+==========+==========+==========+================+ + | _READ | u64m | MAY be | MUST be | SHOULD be u64m | + | | | _READ | <= a_off | | + +---------------+----------+----------+----------+----------------+ + | _READ | u64m | MAY be | MUST be | SHOULD be u64m | + | | | _RW | <= a_off | | + +---------------+----------+----------+----------+----------------+ + | _READ | > 0 and | MAY be | MUST be | SHOULD be >= | + | | < u64m | _READ | <= a_off | a_off - layout | + | | | | | offset + a_len | + +---------------+----------+----------+----------+----------------+ + | _READ | > 0 and | MAY be | MUST be | SHOULD be >= | + | | < u64m | _RW | <= a_off | a_off - layout | + | | | | | offset + a_len | + +---------------+----------+----------+----------+----------------+ + | _READ | 0 | MAY be | MUST be | SHOULD be > | + | | | _READ | <= a_off | a_off - layout | + | | | | | offset | + +---------------+----------+----------+----------+----------------+ + | _READ | 0 | MAY be | MUST be | SHOULD be > | + | | | _READ | <= a_off | a_off - layout | + | | | | | offset | + +---------------+----------+----------+----------+----------------+ + | _RW | u64m | MUST be | MUST be | SHOULD be u64m | + | | | _RW | <= a_off | | + +---------------+----------+----------+----------+----------------+ + | _RW | > 0 and | MUST be | MUST be | SHOULD be >= | + | | < u64m | _RW | <= a_off | a_off - layout | + | | | | | offset + a_len | + +---------------+----------+----------+----------+----------------+ + | _RW | 0 | MUST be | MUST be | SHOULD be > | + | | | _RW | <= a_off | a_off - layout | + | | | | | offset | + +---------------+----------+----------+----------+----------------+ + + Table 23 + + The loga_stateid field specifies a valid stateid. If a layout is not + currently held by the client, the loga_stateid field represents a + stateid reflecting the correspondingly valid open, byte-range lock, + or delegation stateid. Once a layout is held on the file by the + client, the loga_stateid field MUST be a stateid as returned from a + previous LAYOUTGET or LAYOUTRETURN operation or provided by a + CB_LAYOUTRECALL operation (see Section 12.5.3). + + The loga_maxcount field specifies the maximum layout size (in bytes) + that the client can handle. If the size of the layout structure + exceeds the size specified by maxcount, the metadata server will + return the NFS4ERR_TOOSMALL error. + + The returned layout is expressed as an array, logr_layout, with each + element of type layout4. If a file has a single striping pattern, + then logr_layout SHOULD contain just one entry. Otherwise, if the + requested range overlaps more than one striping pattern, logr_layout + will contain the required number of entries. The elements of + logr_layout MUST be sorted in ascending order of the value of the + lo_offset field of each element. There MUST be no gaps or overlaps + in the range between two successive elements of logr_layout. The + lo_iomode field in each element of logr_layout MUST be the same. + + Table 22 and Table 23 both refer to a returned layout iomode, offset, + and length. Because the returned layout is encoded in the + logr_layout array, more description is required. + + iomode The value of the returned layout iomode listed in Table 22 + and Table 23 is equal to the value of the lo_iomode field in each + element of logr_layout. As shown in Table 22 and Table 23, the + metadata server MAY return a layout with an lo_iomode different + from the requested iomode (field loga_iomode of the request). If + it does so, it MUST ensure that the lo_iomode is more permissive + than the loga_iomode requested. For example, this behavior allows + an implementation to upgrade LAYOUTIOMODE4_READ requests to + LAYOUTIOMODE4_RW requests at its discretion, within the limits of + the layout type specific protocol. A lo_iomode of either + LAYOUTIOMODE4_READ or LAYOUTIOMODE4_RW MUST be returned. + + offset The value of the returned layout offset listed in Table 22 + and Table 23 is always equal to the lo_offset field of the first + element logr_layout. + + length When setting the value of the returned layout length, the + situation is complicated by the possibility that the special + layout length value NFS4_UINT64_MAX is involved. For a + logr_layout array of N elements, the lo_length field in the first + N-1 elements MUST NOT be NFS4_UINT64_MAX. The lo_length field of + the last element of logr_layout can be NFS4_UINT64_MAX under some + conditions as described in the following list. + + * If an applicable rule of Table 22 states that the metadata + server MUST return a layout of length NFS4_UINT64_MAX, then the + lo_length field of the last element of logr_layout MUST be + NFS4_UINT64_MAX. + + * If an applicable rule of Table 22 states that the metadata + server MUST NOT return a layout of length NFS4_UINT64_MAX, then + the lo_length field of the last element of logr_layout MUST NOT + be NFS4_UINT64_MAX. + + * If an applicable rule of Table 23 states that the metadata + server SHOULD return a layout of length NFS4_UINT64_MAX, then + the lo_length field of the last element of logr_layout SHOULD + be NFS4_UINT64_MAX. + + * When the value of the returned layout length of Table 22 and + Table 23 is not NFS4_UINT64_MAX, then the returned layout + length is equal to the sum of the lo_length fields of each + element of logr_layout. + + The logr_return_on_close result field is a directive to return the + layout before closing the file. When the metadata server sets this + return value to TRUE, it MUST be prepared to recall the layout in the + case in which the client fails to return the layout before close. + For the metadata server that knows a layout must be returned before a + close of the file, this return value can be used to communicate the + desired behavior to the client and thus remove one extra step from + the client's and metadata server's interaction. + + The logr_stateid stateid is returned to the client for use in + subsequent layout related operations. See Sections 8.2, 12.5.3, and + 12.5.5.2 for a further discussion and requirements. + + The format of the returned layout (lo_content) is specific to the + layout type. The value of the layout type (lo_content.loc_type) for + each of the elements of the array of layouts returned by the metadata + server (logr_layout) MUST be equal to the loga_layout_type specified + by the client. If it is not equal, the client SHOULD ignore the + response as invalid and behave as if the metadata server returned an + error, even if the client does have support for the layout type + returned. + + If neither the requested file nor its containing file system support + layouts, the metadata server MUST return NFS4ERR_LAYOUTUNAVAILABLE. + If the layout type is not supported, the metadata server MUST return + NFS4ERR_UNKNOWN_LAYOUTTYPE. If layouts are supported but no layout + matches the client provided layout identification, the metadata + server MUST return NFS4ERR_BADLAYOUT. If an invalid loga_iomode is + specified, or a loga_iomode of LAYOUTIOMODE4_ANY is specified, the + metadata server MUST return NFS4ERR_BADIOMODE. + + If the layout for the file is unavailable due to transient + conditions, e.g., file sharing prohibits layouts, the metadata server + MUST return NFS4ERR_LAYOUTTRYLATER. + + If the layout request is rejected due to an overlapping layout + recall, the metadata server MUST return NFS4ERR_RECALLCONFLICT. See + Section 12.5.5.2 for details. + + If the layout conflicts with a mandatory byte-range lock held on the + file, and if the storage devices have no method of enforcing + mandatory locks, other than through the restriction of layouts, the + metadata server SHOULD return NFS4ERR_LOCKED. + + If client sets loga_signal_layout_avail to TRUE, then it is + registering with the client a "want" for a layout in the event the + layout cannot be obtained due to resource exhaustion. If the + metadata server supports and will honor the "want", the results will + have logr_will_signal_layout_avail set to TRUE. If so, the client + should expect a CB_RECALLABLE_OBJ_AVAIL operation to indicate that a + layout is available. + + On success, the current filehandle retains its value and the current + stateid is updated to match the value as returned in the results. + +18.43.4. IMPLEMENTATION + + Typically, LAYOUTGET will be called as part of a COMPOUND request + after an OPEN operation and results in the client having location + information for the file. This requires that loga_stateid be set to + the special stateid that tells the metadata server to use the current + stateid, which is set by OPEN (see Section 16.2.3.1.2). A client may + also hold a layout across multiple OPENs. The client specifies a + layout type that limits what kind of layout the metadata server will + return. This prevents metadata servers from granting layouts that + are unusable by the client. + + As indicated by Table 22 and Table 23, the specification of LAYOUTGET + allows a pNFS client and server considerable flexibility. A pNFS + client can take several strategies for sending LAYOUTGET. Some + examples are as follows. + + * If LAYOUTGET is preceded by OPEN in the same COMPOUND request and + the OPEN requests OPEN4_SHARE_ACCESS_READ access, the client might + opt to request a _READ layout with loga_offset set to zero, + loga_minlength set to zero, and loga_length set to + NFS4_UINT64_MAX. If the file has space allocated to it, that + space is striped over one or more storage devices, and there is + either no conflicting layout or the concept of a conflicting + layout does not apply to the pNFS server's layout type or + implementation, then the metadata server might return a layout + with a starting offset of zero, and a length equal to the length + of the file, if not NFS4_UINT64_MAX. If the length of the file is + not a multiple of the pNFS server's stripe width (see Section 13.2 + for a formal definition), the metadata server might round up the + returned layout's length. + + * If LAYOUTGET is preceded by OPEN in the same COMPOUND request, and + the OPEN requests OPEN4_SHARE_ACCESS_WRITE access and does not + truncate the file, the client might opt to request a _RW layout + with loga_offset set to zero, loga_minlength set to zero, and + loga_length set to the file's current length (if known), or + NFS4_UINT64_MAX. As with the previous case, under some conditions + the metadata server might return a layout that covers the entire + length of the file or beyond. + + * This strategy is as above, but the OPEN truncates the file. In + this case, the client might anticipate it will be writing to the + file from offset zero, and so loga_offset and loga_minlength are + set to zero, and loga_length is set to the value of + threshold4_write_iosize. The metadata server might return a + layout from offset zero with a length at least as long as + threshold4_write_iosize. + + * A process on the client invokes a request to read from offset + 10000 for length 50000. The client is using buffered I/O, and has + buffer sizes of 4096 bytes. The client intends to map the request + of the process into a series of READ requests starting at offset + 8192. The end offset needs to be higher than 10000 + 50000 = + 60000, and the next offset that is a multiple of 4096 is 61440. + The difference between 61440 and that starting offset of the + layout is 53248 (which is the product of 4096 and 15). The value + of threshold4_read_iosize is less than 53248, so the client sends + a LAYOUTGET request with loga_offset set to 8192, loga_minlength + set to 53248, and loga_length set to the file's length (if known) + minus 8192 or NFS4_UINT64_MAX (if the file's length is not known). + Since this LAYOUTGET request exceeds the metadata server's + threshold, it grants the layout, possibly with an initial offset + of zero, with an end offset of at least 8192 + 53248 - 1 = 61439, + but preferably a layout with an offset aligned on the stripe width + and a length that is a multiple of the stripe width. + + * This strategy is as above, but the client is not using buffered I/ + O, and instead all internal I/O requests are sent directly to the + server. The LAYOUTGET request has loga_offset equal to 10000 and + loga_minlength set to 50000. The value of loga_length is set to + the length of the file. The metadata server is free to return a + layout that fully overlaps the requested range, with a starting + offset and length aligned on the stripe width. + + * Again, a process on the client invokes a request to read from + offset 10000 for length 50000 (i.e. a range with a starting offset + of 10000 and an ending offset of 69999), and buffered I/O is in + use. The client is expecting that the server might not be able to + return the layout for the full I/O range. The client intends to + map the request of the process into a series of thirteen READ + requests starting at offset 8192, each with length 4096, with a + total length of 53248 (which equals 13 * 4096), which fully + contains the range that client's process wants to read. Because + the value of threshold4_read_iosize is equal to 4096, it is + practical and reasonable for the client to use several LAYOUTGET + operations to complete the series of READs. The client sends a + LAYOUTGET request with loga_offset set to 8192, loga_minlength set + to 4096, and loga_length set to 53248 or higher. The server will + grant a layout possibly with an initial offset of zero, with an + end offset of at least 8192 + 4096 - 1 = 12287, but preferably a + layout with an offset aligned on the stripe width and a length + that is a multiple of the stripe width. This will allow the + client to make forward progress, possibly sending more LAYOUTGET + operations for the remainder of the range. + + * An NFS client detects a sequential read pattern, and so sends a + LAYOUTGET operation that goes well beyond any current or pending + read requests to the server. The server might likewise detect + this pattern, and grant the LAYOUTGET request. Once the client + reads from an offset of the file that represents 50% of the way + through the range of the last layout it received, in order to + avoid stalling I/O that would wait for a layout, the client sends + more operations from an offset of the file that represents 50% of + the way through the last layout it received. The client continues + to request layouts with byte-ranges that are well in advance of + the byte-ranges of recent and/or read requests of processes + running on the client. + + * This strategy is as above, but the client fails to detect the + pattern, but the server does. The next time the metadata server + gets a LAYOUTGET, it returns a layout with a length that is well + beyond loga_minlength. + + * A client is using buffered I/O, and has a long queue of write- + behinds to process and also detects a sequential write pattern. + It sends a LAYOUTGET for a layout that spans the range of the + queued write-behinds and well beyond, including ranges beyond the + filer's current length. The client continues to send LAYOUTGET + operations once the write-behind queue reaches 50% of the maximum + queue length. + + Once the client has obtained a layout referring to a particular + device ID, the metadata server MUST NOT delete the device ID until + the layout is returned or revoked. + + CB_NOTIFY_DEVICEID can race with LAYOUTGET. One race scenario is + that LAYOUTGET returns a device ID for which the client does not have + device address mappings, and the metadata server sends a + CB_NOTIFY_DEVICEID to add the device ID to the client's awareness and + meanwhile the client sends GETDEVICEINFO on the device ID. This + scenario is discussed in Section 18.40.4. Another scenario is that + the CB_NOTIFY_DEVICEID is processed by the client before it processes + the results from LAYOUTGET. The client will send a GETDEVICEINFO on + the device ID. If the results from GETDEVICEINFO are received before + the client gets results from LAYOUTGET, then there is no longer a + race. If the results from LAYOUTGET are received before the results + from GETDEVICEINFO, the client can either wait for results of + GETDEVICEINFO or send another one to get possibly more up-to-date + device address mappings for the device ID. + +18.44. Operation 51: LAYOUTRETURN - Release Layout Information + +18.44.1. ARGUMENT + + /* Constants used for LAYOUTRETURN and CB_LAYOUTRECALL */ + const LAYOUT4_RET_REC_FILE = 1; + const LAYOUT4_RET_REC_FSID = 2; + const LAYOUT4_RET_REC_ALL = 3; + + enum layoutreturn_type4 { + LAYOUTRETURN4_FILE = LAYOUT4_RET_REC_FILE, + LAYOUTRETURN4_FSID = LAYOUT4_RET_REC_FSID, + LAYOUTRETURN4_ALL = LAYOUT4_RET_REC_ALL + }; + + struct layoutreturn_file4 { + offset4 lrf_offset; + length4 lrf_length; + stateid4 lrf_stateid; + /* layouttype4 specific data */ + opaque lrf_body<>; + }; + + union layoutreturn4 switch(layoutreturn_type4 lr_returntype) { + case LAYOUTRETURN4_FILE: + layoutreturn_file4 lr_layout; + default: + void; + }; + + struct LAYOUTRETURN4args { + /* CURRENT_FH: file */ + bool lora_reclaim; + layouttype4 lora_layout_type; + layoutiomode4 lora_iomode; + layoutreturn4 lora_layoutreturn; + }; + +18.44.2. RESULT + + union layoutreturn_stateid switch (bool lrs_present) { + case TRUE: + stateid4 lrs_stateid; + case FALSE: + void; + }; + + union LAYOUTRETURN4res switch (nfsstat4 lorr_status) { + case NFS4_OK: + layoutreturn_stateid lorr_stateid; + default: + void; + }; + +18.44.3. DESCRIPTION + + This operation returns from the client to the server one or more + layouts represented by the client ID (derived from the session ID in + the preceding SEQUENCE operation), lora_layout_type, and lora_iomode. + When lr_returntype is LAYOUTRETURN4_FILE, the returned layout is + further identified by the current filehandle, lrf_offset, lrf_length, + and lrf_stateid. If the lrf_length field is NFS4_UINT64_MAX, all + bytes of the layout, starting at lrf_offset, are returned. When + lr_returntype is LAYOUTRETURN4_FSID, the current filehandle is used + to identify the file system and all layouts matching the client ID, + the fsid of the file system, lora_layout_type, and lora_iomode are + returned. When lr_returntype is LAYOUTRETURN4_ALL, all layouts + matching the client ID, lora_layout_type, and lora_iomode are + returned and the current filehandle is not used. After this call, + the client MUST NOT use the returned layout(s) and the associated + storage protocol to access the file data. + + If the set of layouts designated in the case of LAYOUTRETURN4_FSID or + LAYOUTRETURN4_ALL is empty, then no error results. In the case of + LAYOUTRETURN4_FILE, the byte-range specified is returned even if it + is a subdivision of a layout previously obtained with LAYOUTGET, a + combination of multiple layouts previously obtained with LAYOUTGET, + or a combination including some layouts previously obtained with + LAYOUTGET, and one or more subdivisions of such layouts. When the + byte-range does not designate any bytes for which a layout is held + for the specified file, client ID, layout type and mode, no error + results. See Section 12.5.5.2.1.5 for considerations with "bulk" + return of layouts. + + The layout being returned may be a subset or superset of a layout + specified by CB_LAYOUTRECALL. However, if it is a subset, the recall + is not complete until the full recalled scope has been returned. + Recalled scope refers to the byte-range in the case of + LAYOUTRETURN4_FILE, the use of LAYOUTRETURN4_FSID, or the use of + LAYOUTRETURN4_ALL. There must be a LAYOUTRETURN with a matching + scope to complete the return even if all current layout ranges have + been previously individually returned. + + For all lr_returntype values, an iomode of LAYOUTIOMODE4_ANY + specifies that all layouts that match the other arguments to + LAYOUTRETURN (i.e., client ID, lora_layout_type, and one of current + filehandle and range; fsid derived from current filehandle; or + LAYOUTRETURN4_ALL) are being returned. + + In the case that lr_returntype is LAYOUTRETURN4_FILE, the lrf_stateid + provided by the client is a layout stateid as returned from previous + layout operations. Note that the "seqid" field of lrf_stateid MUST + NOT be zero. See Sections 8.2, 12.5.3, and 12.5.5.2 for a further + discussion and requirements. + + Return of a layout or all layouts does not invalidate the mapping of + storage device ID to a storage device address. The mapping remains + in effect until specifically changed or deleted via device ID + notification callbacks. Of course if there are no remaining layouts + that refer to a previously used device ID, the server is free to + delete a device ID without a notification callback, which will be the + case when notifications are not in effect. + + If the lora_reclaim field is set to TRUE, the client is attempting to + return a layout that was acquired before the restart of the metadata + server during the metadata server's grace period. When returning + layouts that were acquired during the metadata server's grace period, + the client MUST set the lora_reclaim field to FALSE. The + lora_reclaim field MUST be set to FALSE also when lr_layoutreturn is + LAYOUTRETURN4_FSID or LAYOUTRETURN4_ALL. See LAYOUTCOMMIT + (Section 18.42) for more details. + + Layouts may be returned when recalled or voluntarily (i.e., before + the server has recalled them). In either case, the client must + properly propagate state changed under the context of the layout to + the storage device(s) or to the metadata server before returning the + layout. + + If the client returns the layout in response to a CB_LAYOUTRECALL + where the lor_recalltype field of the clora_recall field was + LAYOUTRECALL4_FILE, the client should use the lor_stateid value from + CB_LAYOUTRECALL as the value for lrf_stateid. Otherwise, it should + use logr_stateid (from a previous LAYOUTGET result) or lorr_stateid + (from a previous LAYRETURN result). This is done to indicate the + point in time (in terms of layout stateid transitions) when the + recall was sent. The client uses the precise lora_recallstateid + value and MUST NOT set the stateid's seqid to zero; otherwise, + NFS4ERR_BAD_STATEID MUST be returned. NFS4ERR_OLD_STATEID can be + returned if the client is using an old seqid, and the server knows + the client should not be using the old seqid. For example, the + client uses the seqid on slot 1 of the session, receives the response + with the new seqid, and uses the slot to send another request with + the old seqid. + + If a client fails to return a layout in a timely manner, then the + metadata server SHOULD use its control protocol with the storage + devices to fence the client from accessing the data referenced by the + layout. See Section 12.5.5 for more details. + + If the LAYOUTRETURN request sets the lora_reclaim field to TRUE after + the metadata server's grace period, NFS4ERR_NO_GRACE is returned. + + If the LAYOUTRETURN request sets the lora_reclaim field to TRUE and + lr_returntype is set to LAYOUTRETURN4_FSID or LAYOUTRETURN4_ALL, + NFS4ERR_INVAL is returned. + + If the client sets the lr_returntype field to LAYOUTRETURN4_FILE, + then the lrs_stateid field will represent the layout stateid as + updated for this operation's processing; the current stateid will + also be updated to match the returned value. If the last byte of any + layout for the current file, client ID, and layout type is being + returned and there are no remaining pending CB_LAYOUTRECALL + operations for which a LAYOUTRETURN operation must be done, + lrs_present MUST be FALSE, and no stateid will be returned. In + addition, the COMPOUND request's current stateid will be set to the + all-zeroes special stateid (see Section 16.2.3.1.2). The server MUST + reject with NFS4ERR_BAD_STATEID any further use of the current + stateid in that COMPOUND until the current stateid is re-established + by a later stateid-returning operation. + + On success, the current filehandle retains its value. + + If the EXCHGID4_FLAG_BIND_PRINC_STATEID capability is set on the + client ID (see Section 18.35), the server will require that the + principal, security flavor, and if applicable, the GSS mechanism, + combination that acquired the layout also be the one to send + LAYOUTRETURN. This might not be possible if credentials for the + principal are no longer available. The server will allow the machine + credential or SSV credential (see Section 18.35) to send LAYOUTRETURN + if LAYOUTRETURN's operation code was set in the spo_must_allow result + of EXCHANGE_ID. + +18.44.4. IMPLEMENTATION + + The final LAYOUTRETURN operation in response to a CB_LAYOUTRECALL + callback MUST be serialized with any outstanding, intersecting + LAYOUTRETURN operations. Note that it is possible that while a + client is returning the layout for some recalled range, the server + may recall a superset of that range (e.g., LAYOUTRECALL4_ALL); the + final return operation for the latter must block until the former + layout recall is done. + + Returning all layouts in a file system using LAYOUTRETURN4_FSID is + typically done in response to a CB_LAYOUTRECALL for that file system + as the final return operation. Similarly, LAYOUTRETURN4_ALL is used + in response to a recall callback for all layouts. It is possible + that the client already returned some outstanding layouts via + individual LAYOUTRETURN calls and the call for LAYOUTRETURN4_FSID or + LAYOUTRETURN4_ALL marks the end of the LAYOUTRETURN sequence. See + Section 12.5.5.1 for more details. + + Once the client has returned all layouts referring to a particular + device ID, the server MAY delete the device ID. + +18.45. Operation 52: SECINFO_NO_NAME - Get Security on Unnamed Object + +18.45.1. ARGUMENT + + enum secinfo_style4 { + SECINFO_STYLE4_CURRENT_FH = 0, + SECINFO_STYLE4_PARENT = 1 + }; + + /* CURRENT_FH: object or child directory */ + typedef secinfo_style4 SECINFO_NO_NAME4args; + +18.45.2. RESULT + + /* CURRENTFH: consumed if status is NFS4_OK */ + typedef SECINFO4res SECINFO_NO_NAME4res; + +18.45.3. DESCRIPTION + + Like the SECINFO operation, SECINFO_NO_NAME is used by the client to + obtain a list of valid RPC authentication flavors for a specific file + object. Unlike SECINFO, SECINFO_NO_NAME only works with objects that + are accessed by filehandle. + + There are two styles of SECINFO_NO_NAME, as determined by the value + of the secinfo_style4 enumeration. If SECINFO_STYLE4_CURRENT_FH is + passed, then SECINFO_NO_NAME is querying for the required security + for the current filehandle. If SECINFO_STYLE4_PARENT is passed, then + SECINFO_NO_NAME is querying for the required security of the current + filehandle's parent. If the style selected is SECINFO_STYLE4_PARENT, + then SECINFO should apply the same access methodology used for + LOOKUPP when evaluating the traversal to the parent directory. + Therefore, if the requester does not have the appropriate access to + LOOKUPP the parent, then SECINFO_NO_NAME must behave the same way and + return NFS4ERR_ACCESS. + + If PUTFH, PUTPUBFH, PUTROOTFH, or RESTOREFH returns NFS4ERR_WRONGSEC, + then the client resolves the situation by sending a COMPOUND request + that consists of PUTFH, PUTPUBFH, or PUTROOTFH immediately followed + by SECINFO_NO_NAME, style SECINFO_STYLE4_CURRENT_FH. See Section 2.6 + for instructions on dealing with NFS4ERR_WRONGSEC error returns from + PUTFH, PUTROOTFH, PUTPUBFH, or RESTOREFH. + + If SECINFO_STYLE4_PARENT is specified and there is no parent + directory, SECINFO_NO_NAME MUST return NFS4ERR_NOENT. + + On success, the current filehandle is consumed (see + Section 2.6.3.1.1.8), and if the next operation after SECINFO_NO_NAME + tries to use the current filehandle, that operation will fail with + the status NFS4ERR_NOFILEHANDLE. + + Everything else about SECINFO_NO_NAME is the same as SECINFO. See + the discussion on SECINFO (Section 18.29.3). + +18.45.4. IMPLEMENTATION + + See the discussion on SECINFO (Section 18.29.4). + +18.46. Operation 53: SEQUENCE - Supply Per-Procedure Sequencing and + Control + +18.46.1. ARGUMENT + + struct SEQUENCE4args { + sessionid4 sa_sessionid; + sequenceid4 sa_sequenceid; + slotid4 sa_slotid; + slotid4 sa_highest_slotid; + bool sa_cachethis; + }; + +18.46.2. RESULT + + const SEQ4_STATUS_CB_PATH_DOWN = 0x00000001; + const SEQ4_STATUS_CB_GSS_CONTEXTS_EXPIRING = 0x00000002; + const SEQ4_STATUS_CB_GSS_CONTEXTS_EXPIRED = 0x00000004; + const SEQ4_STATUS_EXPIRED_ALL_STATE_REVOKED = 0x00000008; + const SEQ4_STATUS_EXPIRED_SOME_STATE_REVOKED = 0x00000010; + const SEQ4_STATUS_ADMIN_STATE_REVOKED = 0x00000020; + const SEQ4_STATUS_RECALLABLE_STATE_REVOKED = 0x00000040; + const SEQ4_STATUS_LEASE_MOVED = 0x00000080; + const SEQ4_STATUS_RESTART_RECLAIM_NEEDED = 0x00000100; + const SEQ4_STATUS_CB_PATH_DOWN_SESSION = 0x00000200; + const SEQ4_STATUS_BACKCHANNEL_FAULT = 0x00000400; + const SEQ4_STATUS_DEVID_CHANGED = 0x00000800; + const SEQ4_STATUS_DEVID_DELETED = 0x00001000; + + struct SEQUENCE4resok { + sessionid4 sr_sessionid; + sequenceid4 sr_sequenceid; + slotid4 sr_slotid; + slotid4 sr_highest_slotid; + slotid4 sr_target_highest_slotid; + uint32_t sr_status_flags; + }; + + union SEQUENCE4res switch (nfsstat4 sr_status) { + case NFS4_OK: + SEQUENCE4resok sr_resok4; + default: + void; + }; + +18.46.3. DESCRIPTION + + The SEQUENCE operation is used by the server to implement session + request control and the reply cache semantics. + + SEQUENCE MUST appear as the first operation of any COMPOUND in which + it appears. The error NFS4ERR_SEQUENCE_POS will be returned when it + is found in any position in a COMPOUND beyond the first. Operations + other than SEQUENCE, BIND_CONN_TO_SESSION, EXCHANGE_ID, + CREATE_SESSION, and DESTROY_SESSION, MUST NOT appear as the first + operation in a COMPOUND. Such operations MUST yield the error + NFS4ERR_OP_NOT_IN_SESSION if they do appear at the start of a + COMPOUND. + + If SEQUENCE is received on a connection not associated with the + session via CREATE_SESSION or BIND_CONN_TO_SESSION, and connection + association enforcement is enabled (see Section 18.35), then the + server returns NFS4ERR_CONN_NOT_BOUND_TO_SESSION. + + The sa_sessionid argument identifies the session to which this + request applies. The sr_sessionid result MUST equal sa_sessionid. + + The sa_slotid argument is the index in the reply cache for the + request. The sa_sequenceid field is the sequence number of the + request for the reply cache entry (slot). The sr_slotid result MUST + equal sa_slotid. The sr_sequenceid result MUST equal sa_sequenceid. + + The sa_highest_slotid argument is the highest slot ID for which the + client has a request outstanding; it could be equal to sa_slotid. + The server returns two "highest_slotid" values: sr_highest_slotid and + sr_target_highest_slotid. The former is the highest slot ID the + server will accept in future SEQUENCE operation, and SHOULD NOT be + less than the value of sa_highest_slotid (but see Section 2.10.6.1 + for an exception). The latter is the highest slot ID the server + would prefer the client use on a future SEQUENCE operation. + + If sa_cachethis is TRUE, then the client is requesting that the + server cache the entire reply in the server's reply cache; therefore, + the server MUST cache the reply (see Section 2.10.6.1.3). The server + MAY cache the reply if sa_cachethis is FALSE. If the server does not + cache the entire reply, it MUST still record that it executed the + request at the specified slot and sequence ID. + + The response to the SEQUENCE operation contains a word of status + flags (sr_status_flags) that can provide to the client information + related to the status of the client's lock state and communications + paths. Note that any status bits relating to lock state MAY be reset + when lock state is lost due to a server restart (even if the session + is persistent across restarts; session persistence does not imply + lock state persistence) or the establishment of a new client + instance. + + SEQ4_STATUS_CB_PATH_DOWN + When set, indicates that the client has no operational backchannel + path for any session associated with the client ID, making it + necessary for the client to re-establish one. This bit remains + set on all SEQUENCE responses on all sessions associated with the + client ID until at least one backchannel is available on any + session associated with the client ID. If the client fails to re- + establish a backchannel for the client ID, it is subject to having + recallable state revoked. + + SEQ4_STATUS_CB_PATH_DOWN_SESSION + When set, indicates that the session has no operational + backchannel. There are two reasons why + SEQ4_STATUS_CB_PATH_DOWN_SESSION may be set and not + SEQ4_STATUS_CB_PATH_DOWN. First is that a callback operation that + applies specifically to the session (e.g., CB_RECALL_SLOT, see + Section 20.8) needs to be sent. Second is that the server did + send a callback operation, but the connection was lost before the + reply. The server cannot be sure whether or not the client + received the callback operation, and so, per rules on request + retry, the server MUST retry the callback operation over the same + session. The SEQ4_STATUS_CB_PATH_DOWN_SESSION bit is the + indication to the client that it needs to associate a connection + to the session's backchannel. This bit remains set on all + SEQUENCE responses of the session until a connection is associated + with the session's a backchannel. If the client fails to re- + establish a backchannel for the session, it is subject to having + recallable state revoked. + + SEQ4_STATUS_CB_GSS_CONTEXTS_EXPIRING + When set, indicates that all GSS contexts or RPCSEC_GSS handles + assigned to the session's backchannel will expire within a period + equal to the lease time. This bit remains set on all SEQUENCE + replies until at least one of the following are true: + + * All SSV RPCSEC_GSS handles on the session's backchannel have + been destroyed and all non-SSV GSS contexts have expired. + + * At least one more SSV RPCSEC_GSS handle has been added to the + backchannel. + + * The expiration time of at least one non-SSV GSS context of an + RPCSEC_GSS handle is beyond the lease period from the current + time (relative to the time of when a SEQUENCE response was + sent) + + SEQ4_STATUS_CB_GSS_CONTEXTS_EXPIRED + When set, indicates all non-SSV GSS contexts and all SSV + RPCSEC_GSS handles assigned to the session's backchannel have + expired or have been destroyed. This bit remains set on all + SEQUENCE replies until at least one non-expired non-SSV GSS + context for the session's backchannel has been established or at + least one SSV RPCSEC_GSS handle has been assigned to the + backchannel. + + SEQ4_STATUS_EXPIRED_ALL_STATE_REVOKED + When set, indicates that the lease has expired and as a result the + server released all of the client's locking state. This status + bit remains set on all SEQUENCE replies until the loss of all such + locks has been acknowledged by use of FREE_STATEID (see + Section 18.38), or by establishing a new client instance by + destroying all sessions (via DESTROY_SESSION), the client ID (via + DESTROY_CLIENTID), and then invoking EXCHANGE_ID and + CREATE_SESSION to establish a new client ID. + + SEQ4_STATUS_EXPIRED_SOME_STATE_REVOKED + When set, indicates that some subset of the client's locks have + been revoked due to expiration of the lease period followed by + another client's conflicting LOCK operation. This status bit + remains set on all SEQUENCE replies until the loss of all such + locks has been acknowledged by use of FREE_STATEID. + + SEQ4_STATUS_ADMIN_STATE_REVOKED + When set, indicates that one or more locks have been revoked + without expiration of the lease period, due to administrative + action. This status bit remains set on all SEQUENCE replies until + the loss of all such locks has been acknowledged by use of + FREE_STATEID. + + SEQ4_STATUS_RECALLABLE_STATE_REVOKED + When set, indicates that one or more recallable objects have been + revoked without expiration of the lease period, due to the + client's failure to return them when recalled, which may be a + consequence of there being no working backchannel and the client + failing to re-establish a backchannel per the + SEQ4_STATUS_CB_PATH_DOWN, SEQ4_STATUS_CB_PATH_DOWN_SESSION, or + SEQ4_STATUS_CB_GSS_CONTEXTS_EXPIRED status flags. This status bit + remains set on all SEQUENCE replies until the loss of all such + locks has been acknowledged by use of FREE_STATEID. + + SEQ4_STATUS_LEASE_MOVED + When set, indicates that responsibility for lease renewal has been + transferred to one or more new servers. This condition will + continue until the client receives an NFS4ERR_MOVED error and the + server receives the subsequent GETATTR for the fs_locations or + fs_locations_info attribute for an access to each file system for + which a lease has been moved to a new server. See + Section 11.11.9.2. + + SEQ4_STATUS_RESTART_RECLAIM_NEEDED + When set, indicates that due to server restart, the client must + reclaim locking state. Until the client sends a global + RECLAIM_COMPLETE (Section 18.51), every SEQUENCE operation will + return SEQ4_STATUS_RESTART_RECLAIM_NEEDED. + + SEQ4_STATUS_BACKCHANNEL_FAULT + The server has encountered an unrecoverable fault with the + backchannel (e.g., it has lost track of the sequence ID for a slot + in the backchannel). The client MUST stop sending more requests + on the session's fore channel, wait for all outstanding requests + to complete on the fore and back channel, and then destroy the + session. + + SEQ4_STATUS_DEVID_CHANGED + The client is using device ID notifications and the server has + changed a device ID mapping held by the client. This flag will + stay present until the client has obtained the new mapping with + GETDEVICEINFO. + + SEQ4_STATUS_DEVID_DELETED + The client is using device ID notifications and the server has + deleted a device ID mapping held by the client. This flag will + stay in effect until the client sends a GETDEVICEINFO on the + device ID with a null value in the argument gdia_notify_types. + + The value of the sa_sequenceid argument relative to the cached + sequence ID on the slot falls into one of three cases. + + * If the difference between sa_sequenceid and the server's cached + sequence ID at the slot ID is two (2) or more, or if sa_sequenceid + is less than the cached sequence ID (accounting for wraparound of + the unsigned sequence ID value), then the server MUST return + NFS4ERR_SEQ_MISORDERED. + + * If sa_sequenceid and the cached sequence ID are the same, this is + a retry, and the server replies with what is recorded in the reply + cache. The lease is possibly renewed as described below. + + * If sa_sequenceid is one greater (accounting for wraparound) than + the cached sequence ID, then this is a new request, and the slot's + sequence ID is incremented. The operations subsequent to + SEQUENCE, if any, are processed. If there are no other + operations, the only other effects are to cache the SEQUENCE reply + in the slot, maintain the session's activity, and possibly renew + the lease. + + If the client reuses a slot ID and sequence ID for a completely + different request, the server MAY treat the request as if it is a + retry of what it has already executed. The server MAY however detect + the client's illegal reuse and return NFS4ERR_SEQ_FALSE_RETRY. + + If SEQUENCE returns an error, then the state of the slot (sequence + ID, cached reply) MUST NOT change, and the associated lease MUST NOT + be renewed. + + If SEQUENCE returns NFS4_OK, then the associated lease MUST be + renewed (see Section 8.3), except if + SEQ4_STATUS_EXPIRED_ALL_STATE_REVOKED is returned in sr_status_flags. + +18.46.4. IMPLEMENTATION + + The server MUST maintain a mapping of session ID to client ID in + order to validate any operations that follow SEQUENCE that take a + stateid as an argument and/or result. + + If the client establishes a persistent session, then a SEQUENCE + received after a server restart might encounter requests performed + and recorded in a persistent reply cache before the server restart. + In this case, SEQUENCE will be processed successfully, while requests + that were not previously performed and recorded are rejected with + NFS4ERR_DEADSESSION. + + Depending on which of the operations within the COMPOUND were + successfully performed before the server restart, these operations + will also have replies sent from the server reply cache. Note that + when these operations establish locking state, it is locking state + that applies to the previous server instance and to the previous + client ID, even though the server restart, which logically happened + after these operations, eliminated that state. In the case of a + partially executed COMPOUND, processing may reach an operation not + processed during the earlier server instance, making this operation a + new one and not performable on the existing session. In this case, + NFS4ERR_DEADSESSION will be returned from that operation. + +18.47. Operation 54: SET_SSV - Update SSV for a Client ID + +18.47.1. ARGUMENT + + struct ssa_digest_input4 { + SEQUENCE4args sdi_seqargs; + }; + + struct SET_SSV4args { + opaque ssa_ssv<>; + opaque ssa_digest<>; + }; + +18.47.2. RESULT + + struct ssr_digest_input4 { + SEQUENCE4res sdi_seqres; + }; + + struct SET_SSV4resok { + opaque ssr_digest<>; + }; + + union SET_SSV4res switch (nfsstat4 ssr_status) { + case NFS4_OK: + SET_SSV4resok ssr_resok4; + default: + void; + }; + +18.47.3. DESCRIPTION + + This operation is used to update the SSV for a client ID. Before + SET_SSV is called the first time on a client ID, the SSV is zero. + The SSV is the key used for the SSV GSS mechanism (Section 2.10.9) + + SET_SSV MUST be preceded by a SEQUENCE operation in the same + COMPOUND. It MUST NOT be used if the client did not opt for SP4_SSV + state protection when the client ID was created (see Section 18.35); + the server returns NFS4ERR_INVAL in that case. + + The field ssa_digest is computed as the output of the HMAC (RFC 2104 + [52]) using the subkey derived from the SSV4_SUBKEY_MIC_I2T and + current SSV as the key (see Section 2.10.9 for a description of + subkeys), and an XDR encoded value of data type ssa_digest_input4. + The field sdi_seqargs is equal to the arguments of the SEQUENCE + operation for the COMPOUND procedure that SET_SSV is within. + + The argument ssa_ssv is XORed with the current SSV to produce the new + SSV. The argument ssa_ssv SHOULD be generated randomly. + + In the response, ssr_digest is the output of the HMAC using the + subkey derived from SSV4_SUBKEY_MIC_T2I and new SSV as the key, and + an XDR encoded value of data type ssr_digest_input4. The field + sdi_seqres is equal to the results of the SEQUENCE operation for the + COMPOUND procedure that SET_SSV is within. + + As noted in Section 18.35, the client and server can maintain + multiple concurrent versions of the SSV. The client and server each + MUST maintain an internal SSV version number, which is set to one the + first time SET_SSV executes on the server and the client receives the + first SET_SSV reply. Each subsequent SET_SSV increases the internal + SSV version number by one. The value of this version number + corresponds to the smpt_ssv_seq, smt_ssv_seq, sspt_ssv_seq, and + ssct_ssv_seq fields of the SSV GSS mechanism tokens (see + Section 2.10.9). + +18.47.4. IMPLEMENTATION + + When the server receives ssa_digest, it MUST verify the digest by + computing the digest the same way the client did and comparing it + with ssa_digest. If the server gets a different result, this is an + error, NFS4ERR_BAD_SESSION_DIGEST. This error might be the result of + another SET_SSV from the same client ID changing the SSV. If so, the + client recovers by sending a SET_SSV operation again with a + recomputed digest based on the subkey of the new SSV. If the + transport connection is dropped after the SET_SSV request is sent, + but before the SET_SSV reply is received, then there are special + considerations for recovery if the client has no more connections + associated with sessions associated with the client ID of the SSV. + See Section 18.34.4. + + Clients SHOULD NOT send an ssa_ssv that is equal to a previous + ssa_ssv, nor equal to a previous or current SSV (including an ssa_ssv + equal to zero since the SSV is initialized to zero when the client ID + is created). + + Clients SHOULD send SET_SSV with RPCSEC_GSS privacy. Servers MUST + support RPCSEC_GSS with privacy for any COMPOUND that has { SEQUENCE, + SET_SSV }. + + A client SHOULD NOT send SET_SSV with the SSV GSS mechanism's + credential because the purpose of SET_SSV is to seed the SSV from + non-SSV credentials. Instead, SET_SSV SHOULD be sent with the + credential of a user that is accessing the client ID for the first + time (Section 2.10.8.3). However, if the client does send SET_SSV + with SSV credentials, the digest protecting the arguments uses the + value of the SSV before ssa_ssv is XORed in, and the digest + protecting the results uses the value of the SSV after the ssa_ssv is + XORed in. + +18.48. Operation 55: TEST_STATEID - Test Stateids for Validity + +18.48.1. ARGUMENT + + struct TEST_STATEID4args { + stateid4 ts_stateids<>; + }; + +18.48.2. RESULT + + struct TEST_STATEID4resok { + nfsstat4 tsr_status_codes<>; + }; + + union TEST_STATEID4res switch (nfsstat4 tsr_status) { + case NFS4_OK: + TEST_STATEID4resok tsr_resok4; + default: + void; + }; + +18.48.3. DESCRIPTION + + The TEST_STATEID operation is used to check the validity of a set of + stateids. It can be used at any time, but the client should + definitely use it when it receives an indication that one or more of + its stateids have been invalidated due to lock revocation. This + occurs when the SEQUENCE operation returns with one of the following + sr_status_flags set: + + * SEQ4_STATUS_EXPIRED_SOME_STATE_REVOKED + + * SEQ4_STATUS_EXPIRED_ADMIN_STATE_REVOKED + + * SEQ4_STATUS_EXPIRED_RECALLABLE_STATE_REVOKED + + The client can use TEST_STATEID one or more times to test the + validity of its stateids. Each use of TEST_STATEID allows a large + set of such stateids to be tested and avoids problems with earlier + stateids in a COMPOUND request from interfering with the checking of + subsequent stateids, as would happen if individual stateids were + tested by a series of corresponding by operations in a COMPOUND + request. + + For each stateid, the server returns the status code that would be + returned if that stateid were to be used in normal operation. + Returning such a status indication is not an error and does not cause + COMPOUND processing to terminate. Checks for the validity of the + stateid proceed as they would for normal operations with a number of + exceptions: + + * There is no check for the type of stateid object, as would be the + case for normal use of a stateid. + + * There is no reference to the current filehandle. + + * Special stateids are always considered invalid (they result in the + error code NFS4ERR_BAD_STATEID). + + All stateids are interpreted as being associated with the client for + the current session. Any possible association with a previous + instance of the client (as stale stateids) is not considered. + + The valid status values in the returned status_code array are + NFS4ERR_OK, NFS4ERR_BAD_STATEID, NFS4ERR_OLD_STATEID, + NFS4ERR_EXPIRED, NFS4ERR_ADMIN_REVOKED, and NFS4ERR_DELEG_REVOKED. + +18.48.4. IMPLEMENTATION + + See Sections 8.2.2 and 8.2.4 for a discussion of stateid structure, + lifetime, and validation. + +18.49. Operation 56: WANT_DELEGATION - Request Delegation + +18.49.1. ARGUMENT + + union deleg_claim4 switch (open_claim_type4 dc_claim) { + /* + * No special rights to object. Ordinary delegation + * request of the specified object. Object identified + * by filehandle. + */ + case CLAIM_FH: /* new to v4.1 */ + /* CURRENT_FH: object being delegated */ + void; + + /* + * Right to file based on a delegation granted + * to a previous boot instance of the client. + * File is specified by filehandle. + */ + case CLAIM_DELEG_PREV_FH: /* new to v4.1 */ + /* CURRENT_FH: object being delegated */ + void; + + /* + * Right to the file established by an open previous + * to server reboot. File identified by filehandle. + * Used during server reclaim grace period. + */ + case CLAIM_PREVIOUS: + /* CURRENT_FH: object being reclaimed */ + open_delegation_type4 dc_delegate_type; + }; + + struct WANT_DELEGATION4args { + uint32_t wda_want; + deleg_claim4 wda_claim; + }; + +18.49.2. RESULT + + union WANT_DELEGATION4res switch (nfsstat4 wdr_status) { + case NFS4_OK: + open_delegation4 wdr_resok4; + default: + void; + }; + +18.49.3. DESCRIPTION + + Where this description mandates the return of a specific error code + for a specific condition, and where multiple conditions apply, the + server MAY return any of the mandated error codes. + + This operation allows a client to: + + * Get a delegation on all types of files except directories. + + * Register a "want" for a delegation for the specified file object, + and be notified via a callback when the delegation is available. + The server MAY support notifications of availability via + callbacks. If the server does not support registration of wants, + it MUST NOT return an error to indicate that, and instead MUST + return with ond_why set to WND4_CONTENTION or WND4_RESOURCE and + ond_server_will_push_deleg or ond_server_will_signal_avail set to + FALSE. When the server indicates that it will notify the client + by means of a callback, it will either provide the delegation + using a CB_PUSH_DELEG operation or cancel its promise by sending a + CB_WANTS_CANCELLED operation. + + * Cancel a want for a delegation. + + The client SHOULD NOT set OPEN4_SHARE_ACCESS_READ and SHOULD NOT set + OPEN4_SHARE_ACCESS_WRITE in wda_want. If it does, the server MUST + ignore them. + + The meanings of the following flags in wda_want are the same as they + are in OPEN, except as noted below. + + * OPEN4_SHARE_ACCESS_WANT_READ_DELEG + + * OPEN4_SHARE_ACCESS_WANT_WRITE_DELEG + + * OPEN4_SHARE_ACCESS_WANT_ANY_DELEG + + * OPEN4_SHARE_ACCESS_WANT_NO_DELEG. Unlike the OPEN operation, this + flag SHOULD NOT be set by the client in the arguments to + WANT_DELEGATION, and MUST be ignored by the server. + + * OPEN4_SHARE_ACCESS_WANT_CANCEL + + * OPEN4_SHARE_ACCESS_WANT_SIGNAL_DELEG_WHEN_RESRC_AVAIL + + * OPEN4_SHARE_ACCESS_WANT_PUSH_DELEG_WHEN_UNCONTENDED + + The handling of the above flags in WANT_DELEGATION is the same as in + OPEN. Information about the delegation and/or the promises the + server is making regarding future callbacks are the same as those + described in the open_delegation4 structure. + + The successful results of WANT_DELEGATION are of data type + open_delegation4, which is the same data type as the "delegation" + field in the results of the OPEN operation (see Section 18.16.3). + The server constructs wdr_resok4 the same way it constructs OPEN's + "delegation" with one difference: WANT_DELEGATION MUST NOT return a + delegation type of OPEN_DELEGATE_NONE. + + If ((wda_want & OPEN4_SHARE_ACCESS_WANT_DELEG_MASK) & + ~OPEN4_SHARE_ACCESS_WANT_NO_DELEG) is zero, then the client is + indicating no explicit desire or non-desire for a delegation and the + server MUST return NFS4ERR_INVAL. + + The client uses the OPEN4_SHARE_ACCESS_WANT_CANCEL flag in the + WANT_DELEGATION operation to cancel a previously requested want for a + delegation. Note that if the server is in the process of sending the + delegation (via CB_PUSH_DELEG) at the time the client sends a + cancellation of the want, the delegation might still be pushed to the + client. + + If WANT_DELEGATION fails to return a delegation, and the server + returns NFS4_OK, the server MUST set the delegation type to + OPEN4_DELEGATE_NONE_EXT, and set od_whynone, as described in + Section 18.16. Write delegations are not available for file types + that are not writable. This includes file objects of types NF4BLK, + NF4CHR, NF4LNK, NF4SOCK, and NF4FIFO. If the client requests + OPEN4_SHARE_ACCESS_WANT_WRITE_DELEG without + OPEN4_SHARE_ACCESS_WANT_READ_DELEG on an object with one of the + aforementioned file types, the server must set + wdr_resok4.od_whynone.ond_why to WND4_WRITE_DELEG_NOT_SUPP_FTYPE. + +18.49.4. IMPLEMENTATION + + A request for a conflicting delegation is not normally intended to + trigger the recall of the existing delegation. Servers may choose to + treat some clients as having higher priority such that their wants + will trigger recall of an existing delegation, although that is + expected to be an unusual situation. + + Servers will generally recall delegations assigned by WANT_DELEGATION + on the same basis as those assigned by OPEN. CB_RECALL will + generally be done only when other clients perform operations + inconsistent with the delegation. The normal response to aging of + delegations is to use CB_RECALL_ANY, in order to give the client the + opportunity to keep the delegations most useful from its point of + view. + +18.50. Operation 57: DESTROY_CLIENTID - Destroy a Client ID + +18.50.1. ARGUMENT + + struct DESTROY_CLIENTID4args { + clientid4 dca_clientid; + }; + +18.50.2. RESULT + + struct DESTROY_CLIENTID4res { + nfsstat4 dcr_status; + }; + +18.50.3. DESCRIPTION + + The DESTROY_CLIENTID operation destroys the client ID. If there are + sessions (both idle and non-idle), opens, locks, delegations, + layouts, and/or wants (Section 18.49) associated with the unexpired + lease of the client ID, the server MUST return NFS4ERR_CLIENTID_BUSY. + DESTROY_CLIENTID MAY be preceded with a SEQUENCE operation as long as + the client ID derived from the session ID of SEQUENCE is not the same + as the client ID to be destroyed. If the client IDs are the same, + then the server MUST return NFS4ERR_CLIENTID_BUSY. + + If DESTROY_CLIENTID is not prefixed by SEQUENCE, it MUST be the only + operation in the COMPOUND request (otherwise, the server MUST return + NFS4ERR_NOT_ONLY_OP). If the operation is sent without a SEQUENCE + preceding it, a client that retransmits the request may receive an + error in response, because the original request might have been + successfully executed. + +18.50.4. IMPLEMENTATION + + DESTROY_CLIENTID allows a server to immediately reclaim the resources + consumed by an unused client ID, and also to forget that it ever + generated the client ID. By forgetting that it ever generated the + client ID, the server can safely reuse the client ID on a future + EXCHANGE_ID operation. + +18.51. Operation 58: RECLAIM_COMPLETE - Indicates Reclaims Finished + +18.51.1. ARGUMENT + + struct RECLAIM_COMPLETE4args { + /* + * If rca_one_fs TRUE, + * + * CURRENT_FH: object in + * file system reclaim is + * complete for. + */ + bool rca_one_fs; + }; + +18.51.2. RESULTS + + struct RECLAIM_COMPLETE4res { + nfsstat4 rcr_status; + }; + +18.51.3. DESCRIPTION + + A RECLAIM_COMPLETE operation is used to indicate that the client has + reclaimed all of the locking state that it will recover using + reclaim, when it is recovering state due to either a server restart + or the migration of a file system to another server. There are two + types of RECLAIM_COMPLETE operations: + + * When rca_one_fs is FALSE, a global RECLAIM_COMPLETE is being done. + This indicates that recovery of all locks that the client held on + the previous server instance has been completed. The current + filehandle need not be set in this case. + + * When rca_one_fs is TRUE, a file system-specific RECLAIM_COMPLETE + is being done. This indicates that recovery of locks for a single + fs (the one designated by the current filehandle) due to the + migration of the file system has been completed. Presence of a + current filehandle is required when rca_one_fs is set to TRUE. + When the current filehandle designates a filehandle in a file + system not in the process of migration, the operation returns + NFS4_OK and is otherwise ignored. + + Once a RECLAIM_COMPLETE is done, there can be no further reclaim + operations for locks whose scope is defined as having completed + recovery. Once the client sends RECLAIM_COMPLETE, the server will + not allow the client to do subsequent reclaims of locking state for + that scope and, if these are attempted, will return NFS4ERR_NO_GRACE. + + Whenever a client establishes a new client ID and before it does the + first non-reclaim operation that obtains a lock, it MUST send a + RECLAIM_COMPLETE with rca_one_fs set to FALSE, even if there are no + locks to reclaim. If non-reclaim locking operations are done before + the RECLAIM_COMPLETE, an NFS4ERR_GRACE error will be returned. + + Similarly, when the client accesses a migrated file system on a new + server, before it sends the first non-reclaim operation that obtains + a lock on this new server, it MUST send a RECLAIM_COMPLETE with + rca_one_fs set to TRUE and current filehandle within that file + system, even if there are no locks to reclaim. If non-reclaim + locking operations are done on that file system before the + RECLAIM_COMPLETE, an NFS4ERR_GRACE error will be returned. + + It should be noted that there are situations in which a client needs + to issue both forms of RECLAIM_COMPLETE. An example is an instance + of file system migration in which the file system is migrated to a + server for which the client has no clientid. As a result, the client + needs to obtain a clientid from the server (incurring the + responsibility to do RECLAIM_COMPLETE with rca_one_fs set to FALSE) + as well as RECLAIM_COMPLETE with rca_one_fs set to TRUE to complete + the per-fs grace period associated with the file system migration. + These two may be done in any order as long as all necessary lock + reclaims have been done before issuing either of them. + + Any locks not reclaimed at the point at which RECLAIM_COMPLETE is + done become non-reclaimable. The client MUST NOT attempt to reclaim + them, either during the current server instance or in any subsequent + server instance, or on another server to which responsibility for + that file system is transferred. If the client were to do so, it + would be violating the protocol by representing itself as owning + locks that it does not own, and so has no right to reclaim. See + Section 8.4.3 of [66] for a discussion of edge conditions related to + lock reclaim. + + By sending a RECLAIM_COMPLETE, the client indicates readiness to + proceed to do normal non-reclaim locking operations. The client + should be aware that such operations may temporarily result in + NFS4ERR_GRACE errors until the server is ready to terminate its grace + period. + +18.51.4. IMPLEMENTATION + + Servers will typically use the information as to when reclaim + activity is complete to reduce the length of the grace period. When + the server maintains in persistent storage a list of clients that + might have had locks, it is able to use the fact that all such + clients have done a RECLAIM_COMPLETE to terminate the grace period + and begin normal operations (i.e., grant requests for new locks) + sooner than it might otherwise. + + Latency can be minimized by doing a RECLAIM_COMPLETE as part of the + COMPOUND request in which the last lock-reclaiming operation is done. + When there are no reclaims to be done, RECLAIM_COMPLETE should be + done immediately in order to allow the grace period to end as soon as + possible. + + RECLAIM_COMPLETE should only be done once for each server instance or + occasion of the transition of a file system. If it is done a second + time, the error NFS4ERR_COMPLETE_ALREADY will result. Note that + because of the session feature's retry protection, retries of + COMPOUND requests containing RECLAIM_COMPLETE operation will not + result in this error. + + When a RECLAIM_COMPLETE is sent, the client effectively acknowledges + any locks not yet reclaimed as lost. This allows the server to re- + enable the client to recover locks if the occurrence of edge + conditions, as described in Section 8.4.3, had caused the server to + disable the client's ability to recover locks. + + Because previous descriptions of RECLAIM_COMPLETE were not + sufficiently explicit about the circumstances in which use of + RECLAIM_COMPLETE with rca_one_fs set to TRUE was appropriate, there + have been cases in which it has been misused by clients who have + issued RECLAIM_COMPLETE with rca_one_fs set to TRUE when it should + have not been. There have also been cases in which servers have, in + various ways, not responded to such misuse as described above, either + ignoring the rca_one_fs setting (treating the operation as a global + RECLAIM_COMPLETE) or ignoring the entire operation. + + While clients SHOULD NOT misuse this feature, and servers SHOULD + respond to such misuse as described above, implementors need to be + aware of the following considerations as they make necessary trade- + offs between interoperability with existing implementations and + proper support for facilities to allow lock recovery in the event of + file system migration. + + * When servers have no support for becoming the destination server + of a file system subject to migration, there is no possibility of + a per-fs RECLAIM_COMPLETE being done legitimately, and occurrences + of it SHOULD be ignored. However, the negative consequences of + accepting such mistaken use are quite limited as long as the + client does not issue it before all necessary reclaims are done. + + * When a server might become the destination for a file system being + migrated, inappropriate use of per-fs RECLAIM_COMPLETE is more + concerning. In the case in which the file system designated is + not within a per-fs grace period, the per-fs RECLAIM_COMPLETE + SHOULD be ignored, with the negative consequences of accepting it + being limited, as in the case in which migration is not supported. + However, if the server encounters a file system undergoing + migration, the operation cannot be accepted as if it were a global + RECLAIM_COMPLETE without invalidating its intended use. + +18.52. Operation 10044: ILLEGAL - Illegal Operation + +18.52.1. ARGUMENTS + + void; + +18.52.2. RESULTS + + struct ILLEGAL4res { + nfsstat4 status; + }; + +18.52.3. DESCRIPTION + + This operation is a placeholder for encoding a result to handle the + case of the client sending an operation code within COMPOUND that is + not supported. See the COMPOUND procedure description for more + details. + + The status field of ILLEGAL4res MUST be set to NFS4ERR_OP_ILLEGAL. + +18.52.4. IMPLEMENTATION + + A client will probably not send an operation with code OP_ILLEGAL but + if it does, the response will be ILLEGAL4res just as it would be with + any other invalid operation code. Note that if the server gets an + illegal operation code that is not OP_ILLEGAL, and if the server + checks for legal operation codes during the XDR decode phase, then + the ILLEGAL4res would not be returned. + +19. NFSv4.1 Callback Procedures + + The procedures used for callbacks are defined in the following + sections. In the interest of clarity, the terms "client" and + "server" refer to NFS clients and servers, despite the fact that for + an individual callback RPC, the sense of these terms would be + precisely the opposite. + + Both procedures, CB_NULL and CB_COMPOUND, MUST be implemented. + +19.1. Procedure 0: CB_NULL - No Operation + +19.1.1. ARGUMENTS + + void; + +19.1.2. RESULTS + + void; + +19.1.3. DESCRIPTION + + CB_NULL is the standard ONC RPC NULL procedure, with the standard + void argument and void response. Even though there is no direct + functionality associated with this procedure, the server will use + CB_NULL to confirm the existence of a path for RPCs from the server + to client. + +19.1.4. ERRORS + + None. + +19.2. Procedure 1: CB_COMPOUND - Compound Operations + +19.2.1. ARGUMENTS + + enum nfs_cb_opnum4 { + OP_CB_GETATTR = 3, + OP_CB_RECALL = 4, + /* Callback operations new to NFSv4.1 */ + OP_CB_LAYOUTRECALL = 5, + OP_CB_NOTIFY = 6, + OP_CB_PUSH_DELEG = 7, + OP_CB_RECALL_ANY = 8, + OP_CB_RECALLABLE_OBJ_AVAIL = 9, + OP_CB_RECALL_SLOT = 10, + OP_CB_SEQUENCE = 11, + OP_CB_WANTS_CANCELLED = 12, + OP_CB_NOTIFY_LOCK = 13, + OP_CB_NOTIFY_DEVICEID = 14, + + OP_CB_ILLEGAL = 10044 + }; + + union nfs_cb_argop4 switch (unsigned argop) { + case OP_CB_GETATTR: + CB_GETATTR4args opcbgetattr; + case OP_CB_RECALL: + CB_RECALL4args opcbrecall; + case OP_CB_LAYOUTRECALL: + CB_LAYOUTRECALL4args opcblayoutrecall; + case OP_CB_NOTIFY: + CB_NOTIFY4args opcbnotify; + case OP_CB_PUSH_DELEG: + CB_PUSH_DELEG4args opcbpush_deleg; + case OP_CB_RECALL_ANY: + CB_RECALL_ANY4args opcbrecall_any; + case OP_CB_RECALLABLE_OBJ_AVAIL: + CB_RECALLABLE_OBJ_AVAIL4args opcbrecallable_obj_avail; + case OP_CB_RECALL_SLOT: + CB_RECALL_SLOT4args opcbrecall_slot; + case OP_CB_SEQUENCE: + CB_SEQUENCE4args opcbsequence; + case OP_CB_WANTS_CANCELLED: + CB_WANTS_CANCELLED4args opcbwants_cancelled; + case OP_CB_NOTIFY_LOCK: + CB_NOTIFY_LOCK4args opcbnotify_lock; + case OP_CB_NOTIFY_DEVICEID: + CB_NOTIFY_DEVICEID4args opcbnotify_deviceid; + case OP_CB_ILLEGAL: void; + }; + + struct CB_COMPOUND4args { + utf8str_cs tag; + uint32_t minorversion; + uint32_t callback_ident; + nfs_cb_argop4 argarray<>; + }; + +19.2.2. RESULTS + + union nfs_cb_resop4 switch (unsigned resop) { + case OP_CB_GETATTR: CB_GETATTR4res opcbgetattr; + case OP_CB_RECALL: CB_RECALL4res opcbrecall; + + /* new NFSv4.1 operations */ + case OP_CB_LAYOUTRECALL: + CB_LAYOUTRECALL4res + opcblayoutrecall; + + case OP_CB_NOTIFY: CB_NOTIFY4res opcbnotify; + + case OP_CB_PUSH_DELEG: CB_PUSH_DELEG4res + opcbpush_deleg; + + case OP_CB_RECALL_ANY: CB_RECALL_ANY4res + opcbrecall_any; + + case OP_CB_RECALLABLE_OBJ_AVAIL: + CB_RECALLABLE_OBJ_AVAIL4res + opcbrecallable_obj_avail; + + case OP_CB_RECALL_SLOT: + CB_RECALL_SLOT4res + opcbrecall_slot; + + case OP_CB_SEQUENCE: CB_SEQUENCE4res opcbsequence; + + case OP_CB_WANTS_CANCELLED: + CB_WANTS_CANCELLED4res + opcbwants_cancelled; + + case OP_CB_NOTIFY_LOCK: + CB_NOTIFY_LOCK4res + opcbnotify_lock; + + case OP_CB_NOTIFY_DEVICEID: + CB_NOTIFY_DEVICEID4res + opcbnotify_deviceid; + + /* Not new operation */ + case OP_CB_ILLEGAL: CB_ILLEGAL4res opcbillegal; + }; + + struct CB_COMPOUND4res { + nfsstat4 status; + utf8str_cs tag; + nfs_cb_resop4 resarray<>; + }; + +19.2.3. DESCRIPTION + + The CB_COMPOUND procedure is used to combine one or more of the + callback procedures into a single RPC request. The main callback RPC + program has two main procedures: CB_NULL and CB_COMPOUND. All other + operations use the CB_COMPOUND procedure as a wrapper. + + During the processing of the CB_COMPOUND procedure, the client may + find that it does not have the available resources to execute any or + all of the operations within the CB_COMPOUND sequence. Refer to + Section 2.10.6.4 for details. + + The minorversion field of the arguments MUST be the same as the + minorversion of the COMPOUND procedure used to create the client ID + and session. For NFSv4.1, minorversion MUST be set to 1. + + Contained within the CB_COMPOUND results is a "status" field. This + status MUST be equal to the status of the last operation that was + executed within the CB_COMPOUND procedure. Therefore, if an + operation incurred an error, then the "status" value will be the same + error value as is being returned for the operation that failed. + + The "tag" field is handled the same way as that of the COMPOUND + procedure (see Section 16.2.3). + + Illegal operation codes are handled in the same way as they are + handled for the COMPOUND procedure. + +19.2.4. IMPLEMENTATION + + The CB_COMPOUND procedure is used to combine individual operations + into a single RPC request. The client interprets each of the + operations in turn. If an operation is executed by the client and + the status of that operation is NFS4_OK, then the next operation in + the CB_COMPOUND procedure is executed. The client continues this + process until there are no more operations to be executed or one of + the operations has a status value other than NFS4_OK. + +19.2.5. ERRORS + + CB_COMPOUND will of course return every error that each operation on + the backchannel can return (see Table 13). However, if CB_COMPOUND + returns zero operations, obviously the error returned by COMPOUND has + nothing to do with an error returned by an operation. The list of + errors CB_COMPOUND will return if it processes zero operations + includes: + + +==============================+==================================+ + | Error | Notes | + +==============================+==================================+ + | NFS4ERR_BADCHAR | The tag argument has a character | + | | the replier does not support. | + +------------------------------+----------------------------------+ + | NFS4ERR_BADXDR | | + +------------------------------+----------------------------------+ + | NFS4ERR_DELAY | | + +------------------------------+----------------------------------+ + | NFS4ERR_INVAL | The tag argument is not in UTF-8 | + | | encoding. | + +------------------------------+----------------------------------+ + | NFS4ERR_MINOR_VERS_MISMATCH | | + +------------------------------+----------------------------------+ + | NFS4ERR_SERVERFAULT | | + +------------------------------+----------------------------------+ + | NFS4ERR_TOO_MANY_OPS | | + +------------------------------+----------------------------------+ + | NFS4ERR_REP_TOO_BIG | | + +------------------------------+----------------------------------+ + | NFS4ERR_REP_TOO_BIG_TO_CACHE | | + +------------------------------+----------------------------------+ + | NFS4ERR_REQ_TOO_BIG | | + +------------------------------+----------------------------------+ + + Table 24: CB_COMPOUND Error Returns + +20. NFSv4.1 Callback Operations + +20.1. Operation 3: CB_GETATTR - Get Attributes + +20.1.1. ARGUMENT + + struct CB_GETATTR4args { + nfs_fh4 fh; + bitmap4 attr_request; + }; + +20.1.2. RESULT + + struct CB_GETATTR4resok { + fattr4 obj_attributes; + }; + + union CB_GETATTR4res switch (nfsstat4 status) { + case NFS4_OK: + CB_GETATTR4resok resok4; + default: + void; + }; + +20.1.3. DESCRIPTION + + The CB_GETATTR operation is used by the server to obtain the current + modified state of a file that has been OPEN_DELEGATE_WRITE delegated. + The size and change attributes are the only ones guaranteed to be + serviced by the client. See Section 10.4.3 for a full description of + how the client and server are to interact with the use of CB_GETATTR. + + If the filehandle specified is not one for which the client holds an + OPEN_DELEGATE_WRITE delegation, an NFS4ERR_BADHANDLE error is + returned. + +20.1.4. IMPLEMENTATION + + The client returns attrmask bits and the associated attribute values + only for the change attribute, and attributes that it may change + (time_modify, and size). + +20.2. Operation 4: CB_RECALL - Recall a Delegation + +20.2.1. ARGUMENT + + struct CB_RECALL4args { + stateid4 stateid; + bool truncate; + nfs_fh4 fh; + }; + +20.2.2. RESULT + + struct CB_RECALL4res { + nfsstat4 status; + }; + +20.2.3. DESCRIPTION + + The CB_RECALL operation is used to begin the process of recalling a + delegation and returning it to the server. + + The truncate flag is used to optimize recall for a file object that + is a regular file and is about to be truncated to zero. When it is + TRUE, the client is freed of the obligation to propagate modified + data for the file to the server, since this data is irrelevant. + + If the handle specified is not one for which the client holds a + delegation, an NFS4ERR_BADHANDLE error is returned. + + If the stateid specified is not one corresponding to an OPEN + delegation for the file specified by the filehandle, an + NFS4ERR_BAD_STATEID is returned. + +20.2.4. IMPLEMENTATION + + The client SHOULD reply to the callback immediately. Replying does + not complete the recall except when the value of the reply's status + field is neither NFS4ERR_DELAY nor NFS4_OK. The recall is not + complete until the delegation is returned using a DELEGRETURN + operation. + +20.3. Operation 5: CB_LAYOUTRECALL - Recall Layout from Client + +20.3.1. ARGUMENT + + /* + * NFSv4.1 callback arguments and results + */ + + enum layoutrecall_type4 { + LAYOUTRECALL4_FILE = LAYOUT4_RET_REC_FILE, + LAYOUTRECALL4_FSID = LAYOUT4_RET_REC_FSID, + LAYOUTRECALL4_ALL = LAYOUT4_RET_REC_ALL + }; + + struct layoutrecall_file4 { + nfs_fh4 lor_fh; + offset4 lor_offset; + length4 lor_length; + stateid4 lor_stateid; + }; + + union layoutrecall4 switch(layoutrecall_type4 lor_recalltype) { + case LAYOUTRECALL4_FILE: + layoutrecall_file4 lor_layout; + case LAYOUTRECALL4_FSID: + fsid4 lor_fsid; + case LAYOUTRECALL4_ALL: + void; + }; + + struct CB_LAYOUTRECALL4args { + layouttype4 clora_type; + layoutiomode4 clora_iomode; + bool clora_changed; + layoutrecall4 clora_recall; + }; + +20.3.2. RESULT + + struct CB_LAYOUTRECALL4res { + nfsstat4 clorr_status; + }; + +20.3.3. DESCRIPTION + + The CB_LAYOUTRECALL operation is used by the server to recall layouts + from the client; as a result, the client will begin the process of + returning layouts via LAYOUTRETURN. The CB_LAYOUTRECALL operation + specifies one of three forms of recall processing with the value of + layoutrecall_type4. The recall is for one of the following: a + specific layout of a specific file (LAYOUTRECALL4_FILE), an entire + file system ID (LAYOUTRECALL4_FSID), or all file systems + (LAYOUTRECALL4_ALL). + + The behavior of the operation varies based on the value of the + layoutrecall_type4. The value and behaviors are: + + LAYOUTRECALL4_FILE + For a layout to match the recall request, the values of the + following fields must match those of the layout: clora_type, + clora_iomode, lor_fh, and the byte-range specified by lor_offset + and lor_length. The clora_iomode field may have a special value + of LAYOUTIOMODE4_ANY. The special value LAYOUTIOMODE4_ANY will + match any iomode originally returned in a layout; therefore, it + acts as a wild card. The other special value used is for + lor_length. If lor_length has a value of NFS4_UINT64_MAX, the + lor_length field means the maximum possible file size. If a + matching layout is found, it MUST be returned using the + LAYOUTRETURN operation (see Section 18.44). An example of the + field's special value use is if clora_iomode is LAYOUTIOMODE4_ANY, + lor_offset is zero, and lor_length is NFS4_UINT64_MAX, then the + entire layout is to be returned. + + The NFS4ERR_NOMATCHING_LAYOUT error is only returned when the + client does not hold layouts for the file or if the client does + not have any overlapping layouts for the specification in the + layout recall. + + LAYOUTRECALL4_FSID and LAYOUTRECALL4_ALL + If LAYOUTRECALL4_FSID is specified, the fsid specifies the file + system for which any outstanding layouts MUST be returned. If + LAYOUTRECALL4_ALL is specified, all outstanding layouts MUST be + returned. In addition, LAYOUTRECALL4_FSID and LAYOUTRECALL4_ALL + specify that all the storage device ID to storage device address + mappings in the affected file system(s) are also recalled. The + respective LAYOUTRETURN with either LAYOUTRETURN4_FSID or + LAYOUTRETURN4_ALL acknowledges to the server that the client + invalidated the said device mappings. See Section 12.5.5.2.1.5 + for considerations with "bulk" recall of layouts. + + The NFS4ERR_NOMATCHING_LAYOUT error is only returned when the + client does not hold layouts and does not have valid deviceid + mappings. + + In processing the layout recall request, the client also varies its + behavior based on the value of the clora_changed field. This field + is used by the server to provide additional context for the reason + why the layout is being recalled. A FALSE value for clora_changed + indicates that no change in the layout is expected and the client may + write modified data to the storage devices involved; this must be + done prior to returning the layout via LAYOUTRETURN. A TRUE value + for clora_changed indicates that the server is changing the layout. + Examples of layout changes and reasons for a TRUE indication are the + following: the metadata server is restriping the file or a permanent + error has occurred on a storage device and the metadata server would + like to provide a new layout for the file. Therefore, a + clora_changed value of TRUE indicates some level of change for the + layout and the client SHOULD NOT write and commit modified data to + the storage devices. In this case, the client writes and commits + data through the metadata server. + + See Section 12.5.3 for a description of how the lor_stateid field in + the arguments is to be constructed. Note that the "seqid" field of + lor_stateid MUST NOT be zero. See Sections 8.2, 12.5.3, and 12.5.5.2 + for a further discussion and requirements. + +20.3.4. IMPLEMENTATION + + The client's processing for CB_LAYOUTRECALL is similar to CB_RECALL + (recall of file delegations) in that the client responds to the + request before actually returning layouts via the LAYOUTRETURN + operation. While the client responds to the CB_LAYOUTRECALL + immediately, the operation is not considered complete (i.e., + considered pending) until all affected layouts are returned to the + server via the LAYOUTRETURN operation. + + Before returning the layout to the server via LAYOUTRETURN, the + client should wait for the response from in-process or in-flight + READ, WRITE, or COMMIT operations that use the recalled layout. + + If the client is holding modified data that is affected by a recalled + layout, the client has various options for writing the data to the + server. As always, the client may write the data through the + metadata server. In fact, the client may not have a choice other + than writing to the metadata server when the clora_changed argument + is TRUE and a new layout is unavailable from the server. However, + the client may be able to write the modified data to the storage + device if the clora_changed argument is FALSE; this needs to be done + before returning the layout via LAYOUTRETURN. If the client were to + obtain a new layout covering the modified data's byte-range, then + writing to the storage devices is an available alternative. Note + that before obtaining a new layout, the client must first return the + original layout. + + In the case of modified data being written while the layout is held, + the client must use LAYOUTCOMMIT operations at the appropriate time; + as required LAYOUTCOMMIT must be done before the LAYOUTRETURN. If a + large amount of modified data is outstanding, the client may send + LAYOUTRETURNs for portions of the recalled layout; this allows the + server to monitor the client's progress and adherence to the original + recall request. However, the last LAYOUTRETURN in a sequence of + returns MUST specify the full range being recalled (see + Section 12.5.5.1 for details). + + If a server needs to delete a device ID and there are layouts + referring to the device ID, CB_LAYOUTRECALL MUST be invoked to cause + the client to return all layouts referring to the device ID before + the server can delete the device ID. If the client does not return + the affected layouts, the server MAY revoke the layouts. + +20.4. Operation 6: CB_NOTIFY - Notify Client of Directory Changes + +20.4.1. ARGUMENT + + /* + * Directory notification types. + */ + enum notify_type4 { + NOTIFY4_CHANGE_CHILD_ATTRS = 0, + NOTIFY4_CHANGE_DIR_ATTRS = 1, + NOTIFY4_REMOVE_ENTRY = 2, + NOTIFY4_ADD_ENTRY = 3, + NOTIFY4_RENAME_ENTRY = 4, + NOTIFY4_CHANGE_COOKIE_VERIFIER = 5 + }; + + /* Changed entry information. */ + struct notify_entry4 { + component4 ne_file; + fattr4 ne_attrs; + }; + + /* Previous entry information */ + struct prev_entry4 { + notify_entry4 pe_prev_entry; + /* what READDIR returned for this entry */ + nfs_cookie4 pe_prev_entry_cookie; + }; + + struct notify_remove4 { + notify_entry4 nrm_old_entry; + nfs_cookie4 nrm_old_entry_cookie; + }; + + struct notify_add4 { + /* + * Information on object + * possibly renamed over. + */ + notify_remove4 nad_old_entry<1>; + notify_entry4 nad_new_entry; + /* what READDIR would have returned for this entry */ + nfs_cookie4 nad_new_entry_cookie<1>; + prev_entry4 nad_prev_entry<1>; + bool nad_last_entry; + }; + + struct notify_attr4 { + notify_entry4 na_changed_entry; + }; + + struct notify_rename4 { + notify_remove4 nrn_old_entry; + notify_add4 nrn_new_entry; + }; + + struct notify_verifier4 { + verifier4 nv_old_cookieverf; + verifier4 nv_new_cookieverf; + }; + + /* + * Objects of type notify_<>4 and + * notify_device_<>4 are encoded in this. + */ + typedef opaque notifylist4<>; + + struct notify4 { + /* composed from notify_type4 or notify_deviceid_type4 */ + bitmap4 notify_mask; + notifylist4 notify_vals; + }; + + struct CB_NOTIFY4args { + stateid4 cna_stateid; + nfs_fh4 cna_fh; + notify4 cna_changes<>; + }; + +20.4.2. RESULT + + struct CB_NOTIFY4res { + nfsstat4 cnr_status; + }; + +20.4.3. DESCRIPTION + + The CB_NOTIFY operation is used by the server to send notifications + to clients about changes to delegated directories. The registration + of notifications for the directories occurs when the delegation is + established using GET_DIR_DELEGATION. These notifications are sent + over the backchannel. The notification is sent once the original + request has been processed on the server. The server will send an + array of notifications for changes that might have occurred in the + directory. The notifications are sent as list of pairs of bitmaps + and values. See Section 3.3.7 for a description of how NFSv4.1 + bitmaps work. + + If the server has more notifications than can fit in the CB_COMPOUND + request, it SHOULD send a sequence of serial CB_COMPOUND requests so + that the client's view of the directory does not become confused. + For example, if the server indicates that a file named "foo" is added + and that the file "foo" is removed, the order in which the client + receives these notifications needs to be the same as the order in + which the corresponding operations occurred on the server. + + If the client holding the delegation makes any changes in the + directory that cause files or sub-directories to be added or removed, + the server will notify that client of the resulting change(s). If + the client holding the delegation is making attribute or cookie + verifier changes only, the server does not need to send notifications + to that client. The server will send the following information for + each operation: + + NOTIFY4_ADD_ENTRY + The server will send information about the new directory entry + being created along with the cookie for that entry. The entry + information (data type notify_add4) includes the component name of + the entry and attributes. The server will send this type of entry + when a file is actually being created, when an entry is being + added to a directory as a result of a rename across directories + (see below), and when a hard link is being created to an existing + file. If this entry is added to the end of the directory, the + server will set the nad_last_entry flag to TRUE. If the file is + added such that there is at least one entry before it, the server + will also return the previous entry information (nad_prev_entry, a + variable-length array of up to one element. If the array is of + zero length, there is no previous entry), along with its cookie. + This is to help clients find the right location in their file name + caches and directory caches where this entry should be cached. If + the new entry's cookie is available, it will be in the + nad_new_entry_cookie (another variable-length array of up to one + element) field. If the addition of the entry causes another entry + to be deleted (which can only happen in the rename case) + atomically with the addition, then information on this entry is + reported in nad_old_entry. + + NOTIFY4_REMOVE_ENTRY + The server will send information about the directory entry being + deleted. The server will also send the cookie value for the + deleted entry so that clients can get to the cached information + for this entry. + + NOTIFY4_RENAME_ENTRY + The server will send information about both the old entry and the + new entry. This includes the name and attributes for each entry. + In addition, if the rename causes the deletion of an entry (i.e., + the case of a file renamed over), then this is reported in + nrn_new_new_entry.nad_old_entry. This notification is only sent + if both entries are in the same directory. If the rename is + across directories, the server will send a remove notification to + one directory and an add notification to the other directory, + assuming both have a directory delegation. + + NOTIFY4_CHANGE_CHILD_ATTRS/NOTIFY4_CHANGE_DIR_ATTRS + The client will use the attribute mask to inform the server of + attributes for which it wants to receive notifications. This + change notification can be requested for changes to the attributes + of the directory as well as changes to any file's attributes in + the directory by using two separate attribute masks. The client + cannot ask for change attribute notification for a specific file. + One attribute mask covers all the files in the directory. Upon + any attribute change, the server will send back the values of + changed attributes. Notifications might not make sense for some + file system-wide attributes, and it is up to the server to decide + which subset it wants to support. The client can negotiate the + frequency of attribute notifications by letting the server know + how often it wants to be notified of an attribute change. The + server will return supported notification frequencies or an + indication that no notification is permitted for directory or + child attributes by setting the dir_notif_delay and + dir_entry_notif_delay attributes, respectively. + + NOTIFY4_CHANGE_COOKIE_VERIFIER + If the cookie verifier changes while a client is holding a + delegation, the server will notify the client so that it can + invalidate its cookies and re-send a READDIR to get the new set of + cookies. + +20.5. Operation 7: CB_PUSH_DELEG - Offer Previously Requested + Delegation to Client + +20.5.1. ARGUMENT + + struct CB_PUSH_DELEG4args { + nfs_fh4 cpda_fh; + open_delegation4 cpda_delegation; + + }; + +20.5.2. RESULT + + struct CB_PUSH_DELEG4res { + nfsstat4 cpdr_status; + }; + +20.5.3. DESCRIPTION + + CB_PUSH_DELEG is used by the server both to signal to the client that + the delegation it wants (previously indicated via a want established + from an OPEN or WANT_DELEGATION operation) is available and to + simultaneously offer the delegation to the client. The client has + the choice of accepting the delegation by returning NFS4_OK to the + server, delaying the decision to accept the offered delegation by + returning NFS4ERR_DELAY, or permanently rejecting the offer of the + delegation by returning NFS4ERR_REJECT_DELEG. When a delegation is + rejected in this fashion, the want previously established is + permanently deleted and the delegation is subject to acquisition by + another client. + +20.5.4. IMPLEMENTATION + + If the client does return NFS4ERR_DELAY and there is a conflicting + delegation request, the server MAY process it at the expense of the + client that returned NFS4ERR_DELAY. The client's want will not be + cancelled, but MAY be processed behind other delegation requests or + registered wants. + + When a client returns a status other than NFS4_OK, NFS4ERR_DELAY, or + NFS4ERR_REJECT_DELAY, the want remains pending, although servers may + decide to cancel the want by sending a CB_WANTS_CANCELLED. + +20.6. Operation 8: CB_RECALL_ANY - Keep Any N Recallable Objects + +20.6.1. ARGUMENT + + const RCA4_TYPE_MASK_RDATA_DLG = 0; + const RCA4_TYPE_MASK_WDATA_DLG = 1; + const RCA4_TYPE_MASK_DIR_DLG = 2; + const RCA4_TYPE_MASK_FILE_LAYOUT = 3; + const RCA4_TYPE_MASK_BLK_LAYOUT = 4; + const RCA4_TYPE_MASK_OBJ_LAYOUT_MIN = 8; + const RCA4_TYPE_MASK_OBJ_LAYOUT_MAX = 9; + const RCA4_TYPE_MASK_OTHER_LAYOUT_MIN = 12; + const RCA4_TYPE_MASK_OTHER_LAYOUT_MAX = 15; + + struct CB_RECALL_ANY4args { + uint32_t craa_objects_to_keep; + bitmap4 craa_type_mask; + }; + +20.6.2. RESULT + + struct CB_RECALL_ANY4res { + nfsstat4 crar_status; + }; + +20.6.3. DESCRIPTION + + The server may decide that it cannot hold all of the state for + recallable objects, such as delegations and layouts, without running + out of resources. In such a case, while not optimal, the server is + free to recall individual objects to reduce the load. + + Because the general purpose of such recallable objects as delegations + is to eliminate client interaction with the server, the server cannot + interpret lack of recent use as indicating that the object is no + longer useful. The absence of visible use is consistent with a + delegation keeping potential operations from being sent to the + server. In the case of layouts, while it is true that the usefulness + of a layout is indicated by the use of the layout when storage + devices receive I/O requests, because there is no mandate that a + storage device indicate to the metadata server any past or present + use of a layout, the metadata server is not likely to know which + layouts are good candidates to recall in response to low resources. + + In order to implement an effective reclaim scheme for such objects, + the server's knowledge of available resources must be used to + determine when objects must be recalled with the clients selecting + the actual objects to be returned. + + Server implementations may differ in their resource allocation + requirements. For example, one server may share resources among all + classes of recallable objects, whereas another may use separate + resource pools for layouts and for delegations, or further separate + resources by types of delegations. + + When a given resource pool is over-utilized, the server can send a + CB_RECALL_ANY to clients holding recallable objects of the types + involved, allowing it to keep a certain number of such objects and + return any excess. A mask specifies which types of objects are to be + limited. The client chooses, based on its own knowledge of current + usefulness, which of the objects in that class should be returned. + + A number of bits are defined. For some of these, ranges are defined + and it is up to the definition of the storage protocol to specify how + these are to be used. There are ranges reserved for object-based + storage protocols and for other experimental storage protocols. An + RFC defining such a storage protocol needs to specify how particular + bits within its range are to be used. For example, it may specify a + mapping between attributes of the layout (read vs. write, size of + area) and the bit to be used, or it may define a field in the layout + where the associated bit position is made available by the server to + the client. + + RCA4_TYPE_MASK_RDATA_DLG + The client is to return OPEN_DELEGATE_READ delegations on non- + directory file objects. + + RCA4_TYPE_MASK_WDATA_DLG + The client is to return OPEN_DELEGATE_WRITE delegations on regular + file objects. + + RCA4_TYPE_MASK_DIR_DLG + The client is to return directory delegations. + + RCA4_TYPE_MASK_FILE_LAYOUT + The client is to return layouts of type LAYOUT4_NFSV4_1_FILES. + + RCA4_TYPE_MASK_BLK_LAYOUT + See [48] for a description. + + RCA4_TYPE_MASK_OBJ_LAYOUT_MIN to RCA4_TYPE_MASK_OBJ_LAYOUT_MAX + See [47] for a description. + + RCA4_TYPE_MASK_OTHER_LAYOUT_MIN to RCA4_TYPE_MASK_OTHER_LAYOUT_MAX + This range is reserved for telling the client to recall layouts of + experimental or site-specific layout types (see Section 3.3.13). + + When a bit is set in the type mask that corresponds to an undefined + type of recallable object, NFS4ERR_INVAL MUST be returned. When a + bit is set that corresponds to a defined type of object but the + client does not support an object of the type, NFS4ERR_INVAL MUST NOT + be returned. Future minor versions of NFSv4 may expand the set of + valid type mask bits. + + CB_RECALL_ANY specifies a count of objects that the client may keep + as opposed to a count that the client must return. This is to avoid + a potential race between a CB_RECALL_ANY that had a count of objects + to free with a set of client-originated operations to return layouts + or delegations. As a result of the race, the client and server would + have differing ideas as to how many objects to return. Hence, the + client could mistakenly free too many. + + If resource demands prompt it, the server may send another + CB_RECALL_ANY with a lower count, even if it has not yet received an + acknowledgment from the client for a previous CB_RECALL_ANY with the + same type mask. Although the possibility exists that these will be + received by the client in an order different from the order in which + they were sent, any such permutation of the callback stream is + harmless. It is the job of the client to bring down the size of the + recallable object set in line with each CB_RECALL_ANY received, and + until that obligation is met, it cannot be cancelled or modified by + any subsequent CB_RECALL_ANY for the same type mask. Thus, if the + server sends two CB_RECALL_ANYs, the effect will be the same as if + the lower count was sent, whatever the order of recall receipt. Note + that this means that a server may not cancel the effect of a + CB_RECALL_ANY by sending another recall with a higher count. When a + CB_RECALL_ANY is received and the count is already within the limit + set or is above a limit that the client is working to get down to, + that callback has no effect. + + Servers are generally free to deny recallable objects when + insufficient resources are available. Note that the effect of such a + policy is implicitly to give precedence to existing objects relative + to requested ones, with the result that resources might not be + optimally used. To prevent this, servers are well advised to make + the point at which they start sending CB_RECALL_ANY callbacks + somewhat below that at which they cease to give out new delegations + and layouts. This allows the client to purge its less-used objects + whenever appropriate and so continue to have its subsequent requests + given new resources freed up by object returns. + +20.6.4. IMPLEMENTATION + + The client can choose to return any type of object specified by the + mask. If a server wishes to limit the use of objects of a specific + type, it should only specify that type in the mask it sends. Should + the client fail to return requested objects, it is up to the server + to handle this situation, typically by sending specific recalls + (i.e., sending CB_RECALL operations) to properly limit resource + usage. The server should give the client enough time to return + objects before proceeding to specific recalls. This time should not + be less than the lease period. + +20.7. Operation 9: CB_RECALLABLE_OBJ_AVAIL - Signal Resources for + Recallable Objects + +20.7.1. ARGUMENT + + typedef CB_RECALL_ANY4args CB_RECALLABLE_OBJ_AVAIL4args; + +20.7.2. RESULT + + struct CB_RECALLABLE_OBJ_AVAIL4res { + nfsstat4 croa_status; + }; + +20.7.3. DESCRIPTION + + CB_RECALLABLE_OBJ_AVAIL is used by the server to signal the client + that the server has resources to grant recallable objects that might + previously have been denied by OPEN, WANT_DELEGATION, GET_DIR_DELEG, + or LAYOUTGET. + + The argument craa_objects_to_keep means the total number of + recallable objects of the types indicated in the argument type_mask + that the server believes it can allow the client to have, including + the number of such objects the client already has. A client that + tries to acquire more recallable objects than the server informs it + can have runs the risk of having objects recalled. + + The server is not obligated to reserve the difference between the + number of the objects the client currently has and the value of + craa_objects_to_keep, nor does delaying the reply to + CB_RECALLABLE_OBJ_AVAIL prevent the server from using the resources + of the recallable objects for another purpose. Indeed, if a client + responds slowly to CB_RECALLABLE_OBJ_AVAIL, the server might + interpret the client as having reduced capability to manage + recallable objects, and so cancel or reduce any reservation it is + maintaining on behalf of the client. Thus, if the client desires to + acquire more recallable objects, it needs to reply quickly to + CB_RECALLABLE_OBJ_AVAIL, and then send the appropriate operations to + acquire recallable objects. + +20.8. Operation 10: CB_RECALL_SLOT - Change Flow Control Limits + +20.8.1. ARGUMENT + + struct CB_RECALL_SLOT4args { + slotid4 rsa_target_highest_slotid; + }; + +20.8.2. RESULT + + struct CB_RECALL_SLOT4res { + nfsstat4 rsr_status; + }; + +20.8.3. DESCRIPTION + + The CB_RECALL_SLOT operation requests the client to return session + slots, and if applicable, transport credits (e.g., RDMA credits for + connections associated with the operations channel) of the session's + fore channel. CB_RECALL_SLOT specifies rsa_target_highest_slotid, + the value of the target highest slot ID the server wants for the + session. The client MUST then progress toward reducing the session's + highest slot ID to the target value. + + If the session has only non-RDMA connections associated with its + operations channel, then the client need only wait for all + outstanding requests with a slot ID > rsa_target_highest_slotid to + complete, then send a single COMPOUND consisting of a single SEQUENCE + operation, with the sa_highestslot field set to + rsa_target_highest_slotid. If there are RDMA-based connections + associated with operation channel, then the client needs to also send + enough zero-length "RDMA Send" messages to take the total RDMA credit + count to rsa_target_highest_slotid + 1 or below. + +20.8.4. IMPLEMENTATION + + If the client fails to reduce highest slot it has on the fore channel + to what the server requests, the server can force the issue by + asserting flow control on the receive side of all connections bound + to the fore channel, and then finish servicing all outstanding + requests that are in slots greater than rsa_target_highest_slotid. + Once that is done, the server can then open the flow control, and any + time the client sends a new request on a slot greater than + rsa_target_highest_slotid, the server can return NFS4ERR_BADSLOT. + +20.9. Operation 11: CB_SEQUENCE - Supply Backchannel Sequencing and + Control + +20.9.1. ARGUMENT + + struct referring_call4 { + sequenceid4 rc_sequenceid; + slotid4 rc_slotid; + }; + + struct referring_call_list4 { + sessionid4 rcl_sessionid; + referring_call4 rcl_referring_calls<>; + }; + + struct CB_SEQUENCE4args { + sessionid4 csa_sessionid; + sequenceid4 csa_sequenceid; + slotid4 csa_slotid; + slotid4 csa_highest_slotid; + bool csa_cachethis; + referring_call_list4 csa_referring_call_lists<>; + }; + +20.9.2. RESULT + + struct CB_SEQUENCE4resok { + sessionid4 csr_sessionid; + sequenceid4 csr_sequenceid; + slotid4 csr_slotid; + slotid4 csr_highest_slotid; + slotid4 csr_target_highest_slotid; + }; + + union CB_SEQUENCE4res switch (nfsstat4 csr_status) { + case NFS4_OK: + CB_SEQUENCE4resok csr_resok4; + default: + void; + }; + +20.9.3. DESCRIPTION + + The CB_SEQUENCE operation is used to manage operational accounting + for the backchannel of the session on which a request is sent. The + contents include the session ID to which this request belongs, the + slot ID and sequence ID used by the server to implement session + request control and exactly once semantics, and exchanged slot ID + maxima that are used to adjust the size of the reply cache. In each + CB_COMPOUND request, CB_SEQUENCE MUST appear once and MUST be the + first operation. The error NFS4ERR_SEQUENCE_POS MUST be returned + when CB_SEQUENCE is found in any position in a CB_COMPOUND beyond the + first. If any other operation is in the first position of + CB_COMPOUND, NFS4ERR_OP_NOT_IN_SESSION MUST be returned. + + See Section 18.46.3 for a description of how slots are processed. + + If csa_cachethis is TRUE, then the server is requesting that the + client cache the reply in the callback reply cache. The client MUST + cache the reply (see Section 2.10.6.1.3). + + The csa_referring_call_lists array is the list of COMPOUND requests, + identified by session ID, slot ID, and sequence ID. These are + requests that the client previously sent to the server. These + previous requests created state that some operation(s) in the same + CB_COMPOUND as the csa_referring_call_lists are identifying. A + session ID is included because leased state is tied to a client ID, + and a client ID can have multiple sessions. See Section 2.10.6.3. + + The value of the csa_sequenceid argument relative to the cached + sequence ID on the slot falls into one of three cases. + + * If the difference between csa_sequenceid and the client's cached + sequence ID at the slot ID is two (2) or more, or if + csa_sequenceid is less than the cached sequence ID (accounting for + wraparound of the unsigned sequence ID value), then the client + MUST return NFS4ERR_SEQ_MISORDERED. + + * If csa_sequenceid and the cached sequence ID are the same, this is + a retry, and the client returns the CB_COMPOUND request's cached + reply. + + * If csa_sequenceid is one greater (accounting for wraparound) than + the cached sequence ID, then this is a new request, and the slot's + sequence ID is incremented. The operations subsequent to + CB_SEQUENCE, if any, are processed. If there are no other + operations, the only other effects are to cache the CB_SEQUENCE + reply in the slot, maintain the session's activity, and when the + server receives the CB_SEQUENCE reply, renew the lease of state + related to the client ID. + + If the server reuses a slot ID and sequence ID for a completely + different request, the client MAY treat the request as if it is a + retry of what it has already executed. The client MAY however detect + the server's illegal reuse and return NFS4ERR_SEQ_FALSE_RETRY. + + If CB_SEQUENCE returns an error, then the state of the slot (sequence + ID, cached reply) MUST NOT change. See Section 2.10.6.1.3 for the + conditions when the error NFS4ERR_RETRY_UNCACHED_REP might be + returned. + + The client returns two "highest_slotid" values: csr_highest_slotid + and csr_target_highest_slotid. The former is the highest slot ID the + client will accept in a future CB_SEQUENCE operation, and SHOULD NOT + be less than the value of csa_highest_slotid (but see + Section 2.10.6.1 for an exception). The latter is the highest slot + ID the client would prefer the server use on a future CB_SEQUENCE + operation. + +20.10. Operation 12: CB_WANTS_CANCELLED - Cancel Pending Delegation + Wants + +20.10.1. ARGUMENT + + struct CB_WANTS_CANCELLED4args { + bool cwca_contended_wants_cancelled; + bool cwca_resourced_wants_cancelled; + }; + +20.10.2. RESULT + + struct CB_WANTS_CANCELLED4res { + nfsstat4 cwcr_status; + }; + +20.10.3. DESCRIPTION + + The CB_WANTS_CANCELLED operation is used to notify the client that + some or all of the wants it registered for recallable delegations and + layouts have been cancelled. + + If cwca_contended_wants_cancelled is TRUE, this indicates that the + server will not be pushing to the client any delegations that become + available after contention passes. + + If cwca_resourced_wants_cancelled is TRUE, this indicates that the + server will not notify the client when there are resources on the + server to grant delegations or layouts. + + After receiving a CB_WANTS_CANCELLED operation, the client is free to + attempt to acquire the delegations or layouts it was waiting for, and + possibly re-register wants. + +20.10.4. IMPLEMENTATION + + When a client has an OPEN, WANT_DELEGATION, or GET_DIR_DELEGATION + request outstanding, when a CB_WANTS_CANCELLED is sent, the server + may need to make clear to the client whether a promise to signal + delegation availability happened before the CB_WANTS_CANCELLED and is + thus covered by it, or after the CB_WANTS_CANCELLED in which case it + was not covered by it. The server can make this distinction by + putting the appropriate requests into the list of referring calls in + the associated CB_SEQUENCE. + +20.11. Operation 13: CB_NOTIFY_LOCK - Notify Client of Possible Lock + Availability + +20.11.1. ARGUMENT + + struct CB_NOTIFY_LOCK4args { + nfs_fh4 cnla_fh; + lock_owner4 cnla_lock_owner; + }; + +20.11.2. RESULT + + struct CB_NOTIFY_LOCK4res { + nfsstat4 cnlr_status; + }; + +20.11.3. DESCRIPTION + + The server can use this operation to indicate that a byte-range lock + for the given file and lock-owner, previously requested by the client + via an unsuccessful LOCK operation, might be available. + + This callback is meant to be used by servers to help reduce the + latency of blocking locks in the case where they recognize that a + client that has been polling for a blocking byte-range lock may now + be able to acquire the lock. If the server supports this callback + for a given file, it MUST set the OPEN4_RESULT_MAY_NOTIFY_LOCK flag + when responding to successful opens for that file. This does not + commit the server to the use of CB_NOTIFY_LOCK, but the client may + use this as a hint to decide how frequently to poll for locks derived + from that open. + + If an OPEN operation results in an upgrade, in which the stateid + returned has an "other" value matching that of a stateid already + allocated, with a new "seqid" indicating a change in the lock being + represented, then the value of the OPEN4_RESULT_MAY_NOTIFY_LOCK flag + when responding to that new OPEN controls handling from that point + going forward. When parallel OPENs are done on the same file and + open-owner, the ordering of the "seqid" fields of the returned + stateids (subject to wraparound) are to be used to select the + controlling value of the OPEN4_RESULT_MAY_NOTIFY_LOCK flag. + +20.11.4. IMPLEMENTATION + + The server MUST NOT grant the byte-range lock to the client unless + and until it receives a LOCK operation from the client. Similarly, + the client receiving this callback cannot assume that it now has the + lock or that a subsequent LOCK operation for the lock will be + successful. + + The server is not required to implement this callback, and even if it + does, it is not required to use it in any particular case. + Therefore, the client must still rely on polling for blocking locks, + as described in Section 9.6. + + Similarly, the client is not required to implement this callback, and + even it does, is still free to ignore it. Therefore, the server MUST + NOT assume that the client will act based on the callback. + +20.12. Operation 14: CB_NOTIFY_DEVICEID - Notify Client of Device ID + Changes + +20.12.1. ARGUMENT + + /* + * Device notification types. + */ + enum notify_deviceid_type4 { + NOTIFY_DEVICEID4_CHANGE = 1, + NOTIFY_DEVICEID4_DELETE = 2 + }; + + /* For NOTIFY4_DEVICEID4_DELETE */ + struct notify_deviceid_delete4 { + layouttype4 ndd_layouttype; + deviceid4 ndd_deviceid; + }; + + /* For NOTIFY4_DEVICEID4_CHANGE */ + struct notify_deviceid_change4 { + layouttype4 ndc_layouttype; + deviceid4 ndc_deviceid; + bool ndc_immediate; + }; + + struct CB_NOTIFY_DEVICEID4args { + notify4 cnda_changes<>; + }; + +20.12.2. RESULT + + struct CB_NOTIFY_DEVICEID4res { + nfsstat4 cndr_status; + }; + +20.12.3. DESCRIPTION + + The CB_NOTIFY_DEVICEID operation is used by the server to send + notifications to clients about changes to pNFS device IDs. The + registration of device ID notifications is optional and is done via + GETDEVICEINFO. These notifications are sent over the backchannel + once the original request has been processed on the server. The + server will send an array of notifications, cnda_changes, as a list + of pairs of bitmaps and values. See Section 3.3.7 for a description + of how NFSv4.1 bitmaps work. + + As with CB_NOTIFY (Section 20.4.3), it is possible the server has + more notifications than can fit in a CB_COMPOUND, thus requiring + multiple CB_COMPOUNDs. Unlike CB_NOTIFY, serialization is not an + issue because unlike directory entries, device IDs cannot be re-used + after being deleted (Section 12.2.10). + + All device ID notifications contain a device ID and a layout type. + The layout type is necessary because two different layout types can + share the same device ID, and the common device ID can have + completely different mappings for each layout type. + + The server will send the following notifications: + + NOTIFY_DEVICEID4_CHANGE + A previously provided device-ID-to-device-address mapping has + changed and the client uses GETDEVICEINFO to obtain the updated + mapping. The notification is encoded in a value of data type + notify_deviceid_change4. This data type also contains a boolean + field, ndc_immediate, which if TRUE indicates that the change will + be enforced immediately, and so the client might not be able to + complete any pending I/O to the device ID. If ndc_immediate is + FALSE, then for an indefinite time, the client can complete + pending I/O. After pending I/O is complete, the client SHOULD get + the new device-ID-to-device-address mappings before sending new I/ + O requests to the storage devices addressed by the device ID. + + NOTIFY4_DEVICEID_DELETE + Deletes a device ID from the mappings. This notification MUST NOT + be sent if the client has a layout that refers to the device ID. + In other words, if the server is sending a delete device ID + notification, one of the following is true for layouts associated + with the layout type: + + * The client never had a layout referring to that device ID. + + * The client has returned all layouts referring to that device + ID. + + * The server has revoked all layouts referring to that device ID. + + The notification is encoded in a value of data type + notify_deviceid_delete4. After a server deletes a device ID, it + MUST NOT reuse that device ID for the same layout type until the + client ID is deleted. + +20.13. Operation 10044: CB_ILLEGAL - Illegal Callback Operation + +20.13.1. ARGUMENT + + void; + +20.13.2. RESULT + + /* + * CB_ILLEGAL: Response for illegal operation numbers + */ + struct CB_ILLEGAL4res { + nfsstat4 status; + }; + +20.13.3. DESCRIPTION + + This operation is a placeholder for encoding a result to handle the + case of the server sending an operation code within CB_COMPOUND that + is not defined in the NFSv4.1 specification. See Section 19.2.3 for + more details. + + The status field of CB_ILLEGAL4res MUST be set to NFS4ERR_OP_ILLEGAL. + +20.13.4. IMPLEMENTATION + + A server will probably not send an operation with code OP_CB_ILLEGAL, + but if it does, the response will be CB_ILLEGAL4res just as it would + be with any other invalid operation code. Note that if the client + gets an illegal operation code that is not OP_ILLEGAL, and if the + client checks for legal operation codes during the XDR decode phase, + then an instance of data type CB_ILLEGAL4res will not be returned. + +21. Security Considerations + + Historically, the authentication model of NFS was based on the entire + machine being the NFS client, with the NFS server trusting the NFS + client to authenticate the end-user. The NFS server in turn shared + its files only to specific clients, as identified by the client's + source network address. Given this model, the AUTH_SYS RPC security + flavor simply identified the end-user using the client to the NFS + server. When processing NFS responses, the client ensured that the + responses came from the same network address and port number to which + the request was sent. While such a model is easy to implement and + simple to deploy and use, it is unsafe. Thus, NFSv4.1 + implementations are REQUIRED to support a security model that uses + end-to-end authentication, where an end-user on a client mutually + authenticates (via cryptographic schemes that do not expose passwords + or keys in the clear on the network) to a principal on an NFS server. + Consideration is also given to the integrity and privacy of NFS + requests and responses. The issues of end-to-end mutual + authentication, integrity, and privacy are discussed in + Section 2.2.1.1.1. There are specific considerations when using + Kerberos V5 as described in Section 2.2.1.1.1.2.1.1. + + Note that being REQUIRED to implement does not mean REQUIRED to use; + AUTH_SYS can be used by NFSv4.1 clients and servers. However, + AUTH_SYS is merely an OPTIONAL security flavor in NFSv4.1, and so + interoperability via AUTH_SYS is not assured. + + For reasons of reduced administration overhead, better performance, + and/or reduction of CPU utilization, users of NFSv4.1 implementations + might decline to use security mechanisms that enable integrity + protection on each remote procedure call and response. The use of + mechanisms without integrity leaves the user vulnerable to a man-in- + the-middle of the NFS client and server that modifies the RPC request + and/or the response. While implementations are free to provide the + option to use weaker security mechanisms, there are three operations + in particular that warrant the implementation overriding user + choices. + + * The first two such operations are SECINFO and SECINFO_NO_NAME. It + is RECOMMENDED that the client send both operations such that they + are protected with a security flavor that has integrity + protection, such as RPCSEC_GSS with either the + rpc_gss_svc_integrity or rpc_gss_svc_privacy service. Without + integrity protection encapsulating SECINFO and SECINFO_NO_NAME and + their results, a man-in-the-middle could modify results such that + the client might select a weaker algorithm in the set allowed by + the server, making the client and/or server vulnerable to further + attacks. + + * The third operation that SHOULD use integrity protection is any + GETATTR for the fs_locations and fs_locations_info attributes, in + order to mitigate the severity of a man-in-the-middle attack. The + attack has two steps. First the attacker modifies the unprotected + results of some operation to return NFS4ERR_MOVED. Second, when + the client follows up with a GETATTR for the fs_locations or + fs_locations_info attributes, the attacker modifies the results to + cause the client to migrate its traffic to a server controlled by + the attacker. With integrity protection, this attack is + mitigated. + + Relative to previous NFS versions, NFSv4.1 has additional security + considerations for pNFS (see Sections 12.9 and 13.12), locking and + session state (see Section 2.10.8.3), and state recovery during grace + period (see Section 8.4.2.1.1). With respect to locking and session + state, if SP4_SSV state protection is being used, Section 2.10.10 has + specific security considerations for the NFSv4.1 client and server. + + Security considerations for lock reclaim differ between the two + different situations in which state reclaim is to be done. The + server failure situation is discussed in Section 8.4.2.1.1, while the + per-fs state reclaim done in support of migration/replication is + discussed in Section 11.11.9.1. + + The use of the multi-server namespace features described in + Section 11 raises the possibility that requests to determine the set + of network addresses corresponding to a given server might be + interfered with or have their responses modified in flight. In light + of this possibility, the following considerations should be noted: + + * When DNS is used to convert server names to addresses and DNSSEC + [29] is not available, the validity of the network addresses + returned generally cannot be relied upon. However, when combined + with a trusted resolver, DNS over TLS [30] and DNS over HTTPS [34] + can be relied upon to provide valid address resolutions. + + In situations in which the validity of the provided addresses + cannot be relied upon and the client uses RPCSEC_GSS to access the + designated server, it is possible for mutual authentication to + discover invalid server addresses as long as the RPCSEC_GSS + implementation used does not use insecure DNS queries to + canonicalize the hostname components of the service principal + names, as explained in [28]. + + * The fetching of attributes containing file system location + information SHOULD be performed using integrity protection. It is + important to note here that a client making a request of this sort + without using integrity protection needs be aware of the negative + consequences of doing so, which can lead to invalid hostnames or + network addresses being returned. These include cases in which + the client is directed to a server under the control of an + attacker, who might get access to data written or provide + incorrect values for data read. In light of this, the client + needs to recognize that using such returned location information + to access an NFSv4 server without use of RPCSEC_GSS (i.e., by + using AUTH_SYS) poses dangers as it can result in the client + interacting with such an attacker-controlled server without any + authentication facilities to verify the server's identity. + + * Despite the fact that it is a requirement that implementations + provide "support" for use of RPCSEC_GSS, it cannot be assumed that + use of RPCSEC_GSS is always available between any particular + client-server pair. + + * When a client has the network addresses of a server but not the + associated hostnames, that would interfere with its ability to use + RPCSEC_GSS. + + In light of the above, a server SHOULD present file system location + entries that correspond to file systems on other servers using a + hostname. This would allow the client to interrogate the + fs_locations on the destination server to obtain trunking information + (as well as replica information) using integrity protection, + validating the name provided while assuring that the response has not + been modified in flight. + + When RPCSEC_GSS is not available on a server, the client needs to be + aware of the fact that the location entries are subject to + modification in flight and so cannot be relied upon. In the case of + a client being directed to another server after NFS4ERR_MOVED, this + could vitiate the authentication provided by the use of RPCSEC_GSS on + the designated destination server. Even when RPCSEC_GSS + authentication is available on the destination, the server might + still properly authenticate as the server to which the client was + erroneously directed. Without a way to decide whether the server is + a valid one, the client can only determine, using RPCSEC_GSS, that + the server corresponds to the name provided, with no basis for + trusting that server. As a result, the client SHOULD NOT use such + unverified location entries as a basis for migration, even though + RPCSEC_GSS might be available on the destination. + + When a file system location attribute is fetched upon connecting with + an NFS server, it SHOULD, as stated above, be done with integrity + protection. When this not possible, it is generally best for the + client to ignore trunking and replica information or simply not fetch + the location information for these purposes. + + When location information cannot be verified, it can be subjected to + additional filtering to prevent the client from being inappropriately + directed. For example, if a range of network addresses can be + determined that assure that the servers and clients using AUTH_SYS + are subject to the appropriate set of constraints (e.g., physical + network isolation, administrative controls on the operating systems + used), then network addresses in the appropriate range can be used + with others discarded or restricted in their use of AUTH_SYS. + + To summarize considerations regarding the use of RPCSEC_GSS in + fetching location information, we need to consider the following + possibilities for requests to interrogate location information, with + interrogation approaches on the referring and destination servers + arrived at separately: + + * The use of integrity protection is RECOMMENDED in all cases, since + the absence of integrity protection exposes the client to the + possibility of the results being modified in transit. + + * The use of requests issued without RPCSEC_GSS (i.e., using + AUTH_SYS, which has no provision to avoid modification of data in + flight), while undesirable and a potential security exposure, may + not be avoidable in all cases. Where the use of the returned + information cannot be avoided, it is made subject to filtering as + described above to eliminate the possibility that the client would + treat an invalid address as if it were a NFSv4 server. The + specifics will vary depending on the degree of network isolation + and whether the request is to the referring or destination + servers. + + Even if such requests are not interfered with in flight, it is + possible for a compromised server to direct the client to use + inappropriate servers, such as those under the control of the + attacker. It is not clear that being directed to such servers + represents a greater threat to the client than the damage that could + be done by the compromised server itself. However, it is possible + that some sorts of transient server compromises might be exploited to + direct a client to a server capable of doing greater damage over a + longer time. One useful step to guard against this possibility is to + issue requests to fetch location data using RPCSEC_GSS, even if no + mapping to an RPCSEC_GSS principal is available. In this case, + RPCSEC_GSS would not be used, as it typically is, to identify the + client principal to the server, but rather to make sure (via + RPCSEC_GSS mutual authentication) that the server being contacted is + the one intended. + + Similar considerations apply if the threat to be avoided is the + redirection of client traffic to inappropriate (i.e., poorly + performing) servers. In both cases, there is no reason for the + information returned to depend on the identity of the client + principal requesting it, while the validity of the server + information, which has the capability to affect all client + principals, is of considerable importance. + +22. IANA Considerations + + This section uses terms that are defined in [63]. + +22.1. IANA Actions + + This update does not require any modification of, or additions to, + registry entries or registry rules associated with NFSv4.1. However, + since this document obsoletes RFC 5661, IANA has updated all registry + entries and registry rules references that point to RFC 5661 to point + to this document instead. + + Previous actions by IANA related to NFSv4.1 are listed in the + remaining subsections of Section 22. + +22.2. Named Attribute Definitions + + IANA created a registry called the "NFSv4 Named Attribute Definitions + Registry". + + The NFSv4.1 protocol supports the association of a file with zero or + more named attributes. The namespace identifiers for these + attributes are defined as string names. The protocol does not define + the specific assignment of the namespace for these file attributes. + The IANA registry promotes interoperability where common interests + exist. While application developers are allowed to define and use + attributes as needed, they are encouraged to register the attributes + with IANA. + + Such registered named attributes are presumed to apply to all minor + versions of NFSv4, including those defined subsequently to the + registration. If the named attribute is intended to be limited to + specific minor versions, this will be clearly stated in the + registry's assignment. + + All assignments to the registry are made on a First Come First Served + basis, per Section 4.4 of [63]. The policy for each assignment is + Specification Required, per Section 4.6 of [63]. + + Under the NFSv4.1 specification, the name of a named attribute can in + theory be up to 2^(32) - 1 bytes in length, but in practice NFSv4.1 + clients and servers will be unable to handle a string that long. + IANA should reject any assignment request with a named attribute that + exceeds 128 UTF-8 characters. To give the IESG the flexibility to + set up bases of assignment of Experimental Use and Standards Action, + the prefixes of "EXPE" and "STDS" are Reserved. The named attribute + with a zero-length name is Reserved. + + The prefix "PRIV" is designated for Private Use. A site that wants + to make use of unregistered named attributes without risk of + conflicting with an assignment in IANA's registry should use the + prefix "PRIV" in all of its named attributes. + + Because some NFSv4.1 clients and servers have case-insensitive + semantics, the fifteen additional lower case and mixed case + permutations of each of "EXPE", "PRIV", and "STDS" are Reserved + (e.g., "expe", "expE", "exPe", etc. are Reserved). Similarly, IANA + must not allow two assignments that would conflict if both named + attributes were converted to a common case. + + The registry of named attributes is a list of assignments, each + containing three fields for each assignment. + + 1. A US-ASCII string name that is the actual name of the attribute. + This name must be unique. This string name can be 1 to 128 UTF-8 + characters long. + + 2. A reference to the specification of the named attribute. The + reference can consume up to 256 bytes (or more if IANA permits). + + 3. The point of contact of the registrant. The point of contact can + consume up to 256 bytes (or more if IANA permits). + +22.2.1. Initial Registry + + There is no initial registry. + +22.2.2. Updating Registrations + + The registrant is always permitted to update the point of contact + field. Any other change will require Expert Review or IESG Approval. + +22.3. Device ID Notifications + + IANA created a registry called the "NFSv4 Device ID Notifications + Registry". + + The potential exists for new notification types to be added to the + CB_NOTIFY_DEVICEID operation (see Section 20.12). This can be done + via changes to the operations that register notifications, or by + adding new operations to NFSv4. This requires a new minor version of + NFSv4, and requires a Standards Track document from the IETF. + Another way to add a notification is to specify a new layout type + (see Section 22.5). + + Hence, all assignments to the registry are made on a Standards Action + basis per Section 4.6 of [63], with Expert Review required. + + The registry is a list of assignments, each containing five fields + per assignment. + + 1. The name of the notification type. This name must have the + prefix "NOTIFY_DEVICEID4_". This name must be unique. + + 2. The value of the notification. IANA will assign this number, and + the request from the registrant will use TBD1 instead of an + actual value. IANA MUST use a whole number that can be no higher + than 2^(32)-1, and should be the next available value. The value + assigned must be unique. A Designated Expert must be used to + ensure that when the name of the notification type and its value + are added to the NFSv4.1 notify_deviceid_type4 enumerated data + type in the NFSv4.1 XDR description [10], the result continues to + be a valid XDR description. + + 3. The Standards Track RFC(s) that describe the notification. If + the RFC(s) have not yet been published, the registrant will use + RFCTBD2, RFCTBD3, etc. instead of an actual RFC number. + + 4. How the RFC introduces the notification. This is indicated by a + single US-ASCII value. If the value is N, it means a minor + revision to the NFSv4 protocol. If the value is L, it means a + new pNFS layout type. Other values can be used with IESG + Approval. + + 5. The minor versions of NFSv4 that are allowed to use the + notification. While these are numeric values, IANA will not + allocate and assign them; the author of the relevant RFCs with + IESG Approval assigns these numbers. Each time there is a new + minor version of NFSv4 approved, a Designated Expert should + review the registry to make recommended updates as needed. + +22.3.1. Initial Registry + + The initial registry is in Table 25. Note that the next available + value is zero. + + +=========================+=======+==========+=====+================+ + | Notification Name | Value | RFC | How | Minor Versions | + +=========================+=======+==========+=====+================+ + | NOTIFY_DEVICEID4_CHANGE | 1 | RFC | N | 1 | + | | | 8881 | | | + +-------------------------+-------+----------+-----+----------------+ + | NOTIFY_DEVICEID4_DELETE | 2 | RFC | N | 1 | + | | | 8881 | | | + +-------------------------+-------+----------+-----+----------------+ + + Table 25: Initial Device ID Notification Assignments + +22.3.2. Updating Registrations + + The update of a registration will require IESG Approval on the advice + of a Designated Expert. + +22.4. Object Recall Types + + IANA created a registry called the "NFSv4 Recallable Object Types + Registry". + + The potential exists for new object types to be added to the + CB_RECALL_ANY operation (see Section 20.6). This can be done via + changes to the operations that add recallable types, or by adding new + operations to NFSv4. This requires a new minor version of NFSv4, and + requires a Standards Track document from IETF. Another way to add a + new recallable object is to specify a new layout type (see + Section 22.5). + + All assignments to the registry are made on a Standards Action basis + per Section 4.9 of [63], with Expert Review required. + + Recallable object types are 32-bit unsigned numbers. There are no + Reserved values. Values in the range 12 through 15, inclusive, are + designated for Private Use. + + The registry is a list of assignments, each containing five fields + per assignment. + + 1. The name of the recallable object type. This name must have the + prefix "RCA4_TYPE_MASK_". The name must be unique. + + 2. The value of the recallable object type. IANA will assign this + number, and the request from the registrant will use TBD1 instead + of an actual value. IANA MUST use a whole number that can be no + higher than 2^(32)-1, and should be the next available value. + The value must be unique. A Designated Expert must be used to + ensure that when the name of the recallable type and its value + are added to the NFSv4 XDR description [10], the result continues + to be a valid XDR description. + + 3. The Standards Track RFC(s) that describe the recallable object + type. If the RFC(s) have not yet been published, the registrant + will use RFCTBD2, RFCTBD3, etc. instead of an actual RFC number. + + 4. How the RFC introduces the recallable object type. This is + indicated by a single US-ASCII value. If the value is N, it + means a minor revision to the NFSv4 protocol. If the value is L, + it means a new pNFS layout type. Other values can be used with + IESG Approval. + + 5. The minor versions of NFSv4 that are allowed to use the + recallable object type. While these are numeric values, IANA + will not allocate and assign them; the author of the relevant + RFCs with IESG Approval assigns these numbers. Each time there + is a new minor version of NFSv4 approved, a Designated Expert + should review the registry to make recommended updates as needed. + +22.4.1. Initial Registry + + The initial registry is in Table 26. Note that the next available + value is five. + + +===============================+=======+======+=====+==========+ + | Recallable Object Type Name | Value | RFC | How | Minor | + | | | | | Versions | + +===============================+=======+======+=====+==========+ + | RCA4_TYPE_MASK_RDATA_DLG | 0 | RFC | N | 1 | + | | | 8881 | | | + +-------------------------------+-------+------+-----+----------+ + | RCA4_TYPE_MASK_WDATA_DLG | 1 | RFC | N | 1 | + | | | 8881 | | | + +-------------------------------+-------+------+-----+----------+ + | RCA4_TYPE_MASK_DIR_DLG | 2 | RFC | N | 1 | + | | | 8881 | | | + +-------------------------------+-------+------+-----+----------+ + | RCA4_TYPE_MASK_FILE_LAYOUT | 3 | RFC | N | 1 | + | | | 8881 | | | + +-------------------------------+-------+------+-----+----------+ + | RCA4_TYPE_MASK_BLK_LAYOUT | 4 | RFC | L | 1 | + | | | 8881 | | | + +-------------------------------+-------+------+-----+----------+ + | RCA4_TYPE_MASK_OBJ_LAYOUT_MIN | 8 | RFC | L | 1 | + | | | 8881 | | | + +-------------------------------+-------+------+-----+----------+ + | RCA4_TYPE_MASK_OBJ_LAYOUT_MAX | 9 | RFC | L | 1 | + | | | 8881 | | | + +-------------------------------+-------+------+-----+----------+ + + Table 26: Initial Recallable Object Type Assignments + +22.4.2. Updating Registrations + + The update of a registration will require IESG Approval on the advice + of a Designated Expert. + +22.5. Layout Types + + IANA created a registry called the "pNFS Layout Types Registry". + + All assignments to the registry are made on a Standards Action basis, + with Expert Review required. + + Layout types are 32-bit numbers. The value zero is Reserved. Values + in the range 0x80000000 to 0xFFFFFFFF inclusive are designated for + Private Use. IANA will assign numbers from the range 0x00000001 to + 0x7FFFFFFF inclusive. + + The registry is a list of assignments, each containing five fields. + + 1. The name of the layout type. This name must have the prefix + "LAYOUT4_". The name must be unique. + + 2. The value of the layout type. IANA will assign this number, and + the request from the registrant will use TBD1 instead of an + actual value. The value assigned must be unique. A Designated + Expert must be used to ensure that when the name of the layout + type and its value are added to the NFSv4.1 layouttype4 + enumerated data type in the NFSv4.1 XDR description [10], the + result continues to be a valid XDR description. + + 3. The Standards Track RFC(s) that describe the notification. If + the RFC(s) have not yet been published, the registrant will use + RFCTBD2, RFCTBD3, etc. instead of an actual RFC number. + Collectively, the RFC(s) must adhere to the guidelines listed in + Section 22.5.3. + + 4. How the RFC introduces the layout type. This is indicated by a + single US-ASCII value. If the value is N, it means a minor + revision to the NFSv4 protocol. If the value is L, it means a + new pNFS layout type. Other values can be used with IESG + Approval. + + 5. The minor versions of NFSv4 that are allowed to use the + notification. While these are numeric values, IANA will not + allocate and assign them; the author of the relevant RFCs with + IESG Approval assigns these numbers. Each time there is a new + minor version of NFSv4 approved, a Designated Expert should + review the registry to make recommended updates as needed. + +22.5.1. Initial Registry + + The initial registry is in Table 27. + + +=======================+=======+==========+=====+================+ + | Layout Type Name | Value | RFC | How | Minor Versions | + +=======================+=======+==========+=====+================+ + | LAYOUT4_NFSV4_1_FILES | 0x1 | RFC 8881 | N | 1 | + +-----------------------+-------+----------+-----+----------------+ + | LAYOUT4_OSD2_OBJECTS | 0x2 | RFC 5664 | L | 1 | + +-----------------------+-------+----------+-----+----------------+ + | LAYOUT4_BLOCK_VOLUME | 0x3 | RFC 5663 | L | 1 | + +-----------------------+-------+----------+-----+----------------+ + + Table 27: Initial Layout Type Assignments + +22.5.2. Updating Registrations + + The update of a registration will require IESG Approval on the advice + of a Designated Expert. + +22.5.3. Guidelines for Writing Layout Type Specifications + + The author of a new pNFS layout specification must follow these steps + to obtain acceptance of the layout type as a Standards Track RFC: + + 1. The author devises the new layout specification. + + 2. The new layout type specification MUST, at a minimum: + + * Define the contents of the layout-type-specific fields of the + following data types: + + - the da_addr_body field of the device_addr4 data type; + + - the loh_body field of the layouthint4 data type; + + - the loc_body field of layout_content4 data type (which in + turn is the lo_content field of the layout4 data type); + + - the lou_body field of the layoutupdate4 data type; + + * Describe or define the storage access protocol used to access + the storage devices. + + * Describe whether revocation of layouts is supported. + + * At a minimum, describe the methods of recovery from: + + 1. Failure and restart for client, server, storage device. + + 2. Lease expiration from perspective of the active client, + server, storage device. + + 3. Loss of layout state resulting in fencing of client access + to storage devices (for an example, see Section 12.7.3). + + * Include an IANA considerations section, which will in turn + include: + + - A request to IANA for a new layout type per Section 22.5. + + - A list of requests to IANA for any new recallable object + types for CB_RECALL_ANY; each entry is to be presented in + the form described in Section 22.4. + + - A list of requests to IANA for any new notification values + for CB_NOTIFY_DEVICEID; each entry is to be presented in + the form described in Section 22.3. + + * Include a security considerations section. This section MUST + explain how the NFSv4.1 authentication, authorization, and + access-control models are preserved. That is, if a metadata + server would restrict a READ or WRITE operation, how would + pNFS via the layout similarly restrict a corresponding input + or output operation? + + 3. The author documents the new layout specification as an Internet- + Draft. + + 4. The author submits the Internet-Draft for review through the IETF + standards process as defined in "The Internet Standards Process-- + Revision 3" (BCP 9 [35]). The new layout specification will be + submitted for eventual publication as a Standards Track RFC. + + 5. The layout specification progresses through the IETF standards + process. + +22.6. Path Variable Definitions + + This section deals with the IANA considerations associated with the + variable substitution feature for location names as described in + Section 11.17.3. As described there, variables subject to + substitution consist of a domain name and a specific name within that + domain, with the two separated by a colon. There are two sets of + IANA considerations here: + + 1. The list of variable names. + + 2. For each variable name, the list of possible values. + + Thus, there will be one registry for the list of variable names, and + possibly one registry for listing the values of each variable name. + +22.6.1. Path Variables Registry + + IANA created a registry called the "NFSv4 Path Variables Registry". + +22.6.1.1. Path Variable Values + + Variable names are of the form "${", followed by a domain name, + followed by a colon (":"), followed by a domain-specific portion of + the variable name, followed by "}". When the domain name is + "ietf.org", all variables names must be registered with IANA on a + Standards Action basis, with Expert Review required. Path variables + with registered domain names neither part of nor equal to ietf.org + are assigned on a Hierarchical Allocation basis (delegating to the + domain owner) and thus of no concern to IANA, unless the domain owner + chooses to register a variable name from his domain. If the domain + owner chooses to do so, IANA will do so on a First Come First Serve + basis. To accommodate registrants who do not have their own domain, + IANA will accept requests to register variables with the prefix + "${FCFS.ietf.org:" on a First Come First Served basis. Assignments + on a First Come First Basis do not require Expert Review, unless the + registrant also wants IANA to establish a registry for the values of + the registered variable. + + The registry is a list of assignments, each containing three fields. + + 1. The name of the variable. The name of this variable must start + with a "${" followed by a registered domain name, followed by + ":", or it must start with "${FCFS.ietf.org". The name must be + no more than 64 UTF-8 characters long. The name must be unique. + + 2. For assignments made on Standards Action basis, the Standards + Track RFC(s) that describe the variable. If the RFC(s) have not + yet been published, the registrant will use RFCTBD1, RFCTBD2, + etc. instead of an actual RFC number. Note that the RFCs do not + have to be a part of an NFS minor version. For assignments made + on a First Come First Serve basis, an explanation (consuming no + more than 1024 bytes, or more if IANA permits) of the purpose of + the variable. A reference to the explanation can be substituted. + + 3. The point of contact, including an email address. The point of + contact can consume up to 256 bytes (or more if IANA permits). + For assignments made on a Standards Action basis, the point of + contact is always IESG. + +22.6.1.1.1. Initial Registry + + The initial registry is in Table 28. + + +========================+==========+==================+ + | Variable Name | RFC | Point of Contact | + +========================+==========+==================+ + | ${ietf.org:CPU_ARCH} | RFC 8881 | IESG | + +------------------------+----------+------------------+ + | ${ietf.org:OS_TYPE} | RFC 8881 | IESG | + +------------------------+----------+------------------+ + | ${ietf.org:OS_VERSION} | RFC 8881 | IESG | + +------------------------+----------+------------------+ + + Table 28: Initial List of Path Variables + + IANA has created registries for the values of the variable names + ${ietf.org:CPU_ARCH} and ${ietf.org:OS_TYPE}. See Sections 22.6.2 and + 22.6.3. + + For the values of the variable ${ietf.org:OS_VERSION}, no registry is + needed as the specifics of the values of the variable will vary with + the value of ${ietf.org:OS_TYPE}. Thus, values for + ${ietf.org:OS_VERSION} are on a Hierarchical Allocation basis and are + of no concern to IANA. + +22.6.1.1.2. Updating Registrations + + The update of an assignment made on a Standards Action basis will + require IESG Approval on the advice of a Designated Expert. + + The registrant can always update the point of contact of an + assignment made on a First Come First Serve basis. Any other update + will require Expert Review. + +22.6.2. Values for the ${ietf.org:CPU_ARCH} Variable + + IANA created a registry called the "NFSv4 ${ietf.org:CPU_ARCH} Value + Registry". + + Assignments to the registry are made on a First Come First Serve + basis. The zero-length value of ${ietf.org:CPU_ARCH} is Reserved. + Values with a prefix of "PRIV" are designated for Private Use. + + The registry is a list of assignments, each containing three fields. + + 1. A value of the ${ietf.org:CPU_ARCH} variable. The value must be + 1 to 32 UTF-8 characters long. The value must be unique. + + 2. An explanation (consuming no more than 1024 bytes, or more if + IANA permits) of what CPU architecture the value denotes. A + reference to the explanation can be substituted. + + 3. The point of contact, including an email address. The point of + contact can consume up to 256 bytes (or more if IANA permits). + +22.6.2.1. Initial Registry + + There is no initial registry. + +22.6.2.2. Updating Registrations + + The registrant is free to update the assignment, i.e., change the + explanation and/or point-of-contact fields. + +22.6.3. Values for the ${ietf.org:OS_TYPE} Variable + + IANA created a registry called the "NFSv4 ${ietf.org:OS_TYPE} Value + Registry". + + Assignments to the registry are made on a First Come First Serve + basis. The zero-length value of ${ietf.org:OS_TYPE} is Reserved. + Values with a prefix of "PRIV" are designated for Private Use. + + The registry is a list of assignments, each containing three fields. + + 1. A value of the ${ietf.org:OS_TYPE} variable. The value must be 1 + to 32 UTF-8 characters long. The value must be unique. + + 2. An explanation (consuming no more than 1024 bytes, or more if + IANA permits) of what CPU architecture the value denotes. A + reference to the explanation can be substituted. + + 3. The point of contact, including an email address. The point of + contact can consume up to 256 bytes (or more if IANA permits). + +22.6.3.1. Initial Registry + + There is no initial registry. + +22.6.3.2. Updating Registrations + + The registrant is free to update the assignment, i.e., change the + explanation and/or point of contact fields. + +23. References + +23.1. Normative References + + [1] Bradner, S., "Key words for use in RFCs to Indicate + Requirement Levels", BCP 14, RFC 2119, + DOI 10.17487/RFC2119, March 1997, + <https://www.rfc-editor.org/info/rfc2119>. + + [2] Eisler, M., Ed., "XDR: External Data Representation + Standard", STD 67, RFC 4506, DOI 10.17487/RFC4506, May + 2006, <https://www.rfc-editor.org/info/rfc4506>. + + [3] Thurlow, R., "RPC: Remote Procedure Call Protocol + Specification Version 2", RFC 5531, DOI 10.17487/RFC5531, + May 2009, <https://www.rfc-editor.org/info/rfc5531>. + + [4] Eisler, M., Chiu, A., and L. Ling, "RPCSEC_GSS Protocol + Specification", RFC 2203, DOI 10.17487/RFC2203, September + 1997, <https://www.rfc-editor.org/info/rfc2203>. + + [5] Zhu, L., Jaganathan, K., and S. Hartman, "The Kerberos + Version 5 Generic Security Service Application Program + Interface (GSS-API) Mechanism: Version 2", RFC 4121, + DOI 10.17487/RFC4121, July 2005, + <https://www.rfc-editor.org/info/rfc4121>. + + [6] The Open Group, "Section 3.191 of Chapter 3 of Base + Definitions of The Open Group Base Specifications Issue 6 + IEEE Std 1003.1, 2004 Edition, HTML Version", + ISBN 1931624232, 2004, <https://www.opengroup.org>. + + [7] Linn, J., "Generic Security Service Application Program + Interface Version 2, Update 1", RFC 2743, + DOI 10.17487/RFC2743, January 2000, + <https://www.rfc-editor.org/info/rfc2743>. + + [8] Recio, R., Metzler, B., Culley, P., Hilland, J., and D. + Garcia, "A Remote Direct Memory Access Protocol + Specification", RFC 5040, DOI 10.17487/RFC5040, October + 2007, <https://www.rfc-editor.org/info/rfc5040>. + + [9] Eisler, M., "RPCSEC_GSS Version 2", RFC 5403, + DOI 10.17487/RFC5403, February 2009, + <https://www.rfc-editor.org/info/rfc5403>. + + [10] Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed., + "Network File System (NFS) Version 4 Minor Version 1 + External Data Representation Standard (XDR) Description", + RFC 5662, DOI 10.17487/RFC5662, January 2010, + <https://www.rfc-editor.org/info/rfc5662>. + + [11] The Open Group, "Section 3.372 of Chapter 3 of Base + Definitions of The Open Group Base Specifications Issue 6 + IEEE Std 1003.1, 2004 Edition, HTML Version", + ISBN 1931624232, 2004, <https://www.opengroup.org>. + + [12] Eisler, M., "IANA Considerations for Remote Procedure Call + (RPC) Network Identifiers and Universal Address Formats", + RFC 5665, DOI 10.17487/RFC5665, January 2010, + <https://www.rfc-editor.org/info/rfc5665>. + + [13] The Open Group, "Section 'read()' of System Interfaces of + The Open Group Base Specifications Issue 6 IEEE Std + 1003.1, 2004 Edition, HTML Version", ISBN 1931624232, + 2004, <https://www.opengroup.org>. + + [14] The Open Group, "Section 'readdir()' of System Interfaces + of The Open Group Base Specifications Issue 6 IEEE Std + 1003.1, 2004 Edition, HTML Version", ISBN 1931624232, + 2004, <https://www.opengroup.org>. + + [15] The Open Group, "Section 'write()' of System Interfaces of + The Open Group Base Specifications Issue 6 IEEE Std + 1003.1, 2004 Edition, HTML Version", ISBN 1931624232, + 2004, <https://www.opengroup.org>. + + [16] Hoffman, P. and M. Blanchet, "Preparation of + Internationalized Strings ("stringprep")", RFC 3454, + DOI 10.17487/RFC3454, December 2002, + <https://www.rfc-editor.org/info/rfc3454>. + + [17] The Open Group, "Section 'chmod()' of System Interfaces of + The Open Group Base Specifications Issue 6 IEEE Std + 1003.1, 2004 Edition, HTML Version", ISBN 1931624232, + 2004, <https://www.opengroup.org>. + + [18] International Organization for Standardization, + "Information Technology - Universal Multiple-octet coded + Character Set (UCS) - Part 1: Architecture and Basic + Multilingual Plane", ISO Standard 10646-1, May 1993. + + [19] Alvestrand, H., "IETF Policy on Character Sets and + Languages", BCP 18, RFC 2277, DOI 10.17487/RFC2277, + January 1998, <https://www.rfc-editor.org/info/rfc2277>. + + [20] Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep + Profile for Internationalized Domain Names (IDN)", + RFC 3491, DOI 10.17487/RFC3491, March 2003, + <https://www.rfc-editor.org/info/rfc3491>. + + [21] The Open Group, "Section 'fcntl()' of System Interfaces of + The Open Group Base Specifications Issue 6 IEEE Std + 1003.1, 2004 Edition, HTML Version", ISBN 1931624232, + 2004, <https://www.opengroup.org>. + + [22] The Open Group, "Section 'fsync()' of System Interfaces of + The Open Group Base Specifications Issue 6 IEEE Std + 1003.1, 2004 Edition, HTML Version", ISBN 1931624232, + 2004, <https://www.opengroup.org>. + + [23] The Open Group, "Section 'getpwnam()' of System Interfaces + of The Open Group Base Specifications Issue 6 IEEE Std + 1003.1, 2004 Edition, HTML Version", ISBN 1931624232, + 2004, <https://www.opengroup.org>. + + [24] The Open Group, "Section 'unlink()' of System Interfaces + of The Open Group Base Specifications Issue 6 IEEE Std + 1003.1, 2004 Edition, HTML Version", ISBN 1931624232, + 2004, <https://www.opengroup.org>. + + [25] Schaad, J., Kaliski, B., and R. Housley, "Additional + Algorithms and Identifiers for RSA Cryptography for use in + the Internet X.509 Public Key Infrastructure Certificate + and Certificate Revocation List (CRL) Profile", RFC 4055, + DOI 10.17487/RFC4055, June 2005, + <https://www.rfc-editor.org/info/rfc4055>. + + [26] National Institute of Standards and Technology, "Computer + Security Objects Register", May 2016, + <https://csrc.nist.gov/projects/computer-security-objects- + register/algorithm-registration>. + + [27] Adamson, A. and N. Williams, "Remote Procedure Call (RPC) + Security Version 3", RFC 7861, DOI 10.17487/RFC7861, + November 2016, <https://www.rfc-editor.org/info/rfc7861>. + + [28] Neuman, C., Yu, T., Hartman, S., and K. Raeburn, "The + Kerberos Network Authentication Service (V5)", RFC 4120, + DOI 10.17487/RFC4120, July 2005, + <https://www.rfc-editor.org/info/rfc4120>. + + [29] Arends, R., Austein, R., Larson, M., Massey, D., and S. + Rose, "DNS Security Introduction and Requirements", + RFC 4033, DOI 10.17487/RFC4033, March 2005, + <https://www.rfc-editor.org/info/rfc4033>. + + [30] Hu, Z., Zhu, L., Heidemann, J., Mankin, A., Wessels, D., + and P. Hoffman, "Specification for DNS over Transport + Layer Security (TLS)", RFC 7858, DOI 10.17487/RFC7858, May + 2016, <https://www.rfc-editor.org/info/rfc7858>. + + [31] Adamson, A. and N. Williams, "Requirements for NFSv4 + Multi-Domain Namespace Deployment", RFC 8000, + DOI 10.17487/RFC8000, November 2016, + <https://www.rfc-editor.org/info/rfc8000>. + + [32] Lever, C., Ed., Simpson, W., and T. Talpey, "Remote Direct + Memory Access Transport for Remote Procedure Call Version + 1", RFC 8166, DOI 10.17487/RFC8166, June 2017, + <https://www.rfc-editor.org/info/rfc8166>. + + [33] Lever, C., "Network File System (NFS) Upper-Layer Binding + to RPC-over-RDMA Version 1", RFC 8267, + DOI 10.17487/RFC8267, October 2017, + <https://www.rfc-editor.org/info/rfc8267>. + + [34] Hoffman, P. and P. McManus, "DNS Queries over HTTPS + (DoH)", RFC 8484, DOI 10.17487/RFC8484, October 2018, + <https://www.rfc-editor.org/info/rfc8484>. + + [35] Bradner, S., "The Internet Standards Process -- Revision + 3", BCP 9, RFC 2026, October 1996. + + Kolkman, O., Bradner, S., and S. Turner, "Characterization + of Proposed Standards", BCP 9, RFC 7127, January 2014. + + Dusseault, L. and R. Sparks, "Guidance on Interoperation + and Implementation Reports for Advancement to Draft + Standard", BCP 9, RFC 5657, September 2009. + + Housley, R., Crocker, D., and E. Burger, "Reducing the + Standards Track to Two Maturity Levels", BCP 9, RFC 6410, + October 2011. + + Resnick, P., "Retirement of the "Internet Official + Protocol Standards" Summary Document", BCP 9, RFC 7100, + December 2013. + + Dawkins, S., "Increasing the Number of Area Directors in + an IETF Area", BCP 9, RFC 7475, March 2015. + + <https://www.rfc-editor.org/info/bcp9> + +23.2. Informative References + + [36] Roach, A., "Process for Handling Non-Major Revisions to + Existing RFCs", Work in Progress, Internet-Draft, draft- + roach-bis-documents-00, 7 May 2019, + <https://tools.ietf.org/html/draft-roach-bis-documents- + 00>. + + [37] Shepler, S., Callaghan, B., Robinson, D., Thurlow, R., + Beame, C., Eisler, M., and D. Noveck, "Network File System + (NFS) version 4 Protocol", RFC 3530, DOI 10.17487/RFC3530, + April 2003, <https://www.rfc-editor.org/info/rfc3530>. + + [38] Callaghan, B., Pawlowski, B., and P. Staubach, "NFS + Version 3 Protocol Specification", RFC 1813, + DOI 10.17487/RFC1813, June 1995, + <https://www.rfc-editor.org/info/rfc1813>. + + [39] Eisler, M., "LIPKEY - A Low Infrastructure Public Key + Mechanism Using SPKM", RFC 2847, DOI 10.17487/RFC2847, + June 2000, <https://www.rfc-editor.org/info/rfc2847>. + + [40] Eisler, M., "NFS Version 2 and Version 3 Security Issues + and the NFS Protocol's Use of RPCSEC_GSS and Kerberos V5", + RFC 2623, DOI 10.17487/RFC2623, June 1999, + <https://www.rfc-editor.org/info/rfc2623>. + + [41] Juszczak, C., "Improving the Performance and Correctness + of an NFS Server", USENIX Conference Proceedings, June + 1990. + + [42] Reynolds, J., Ed., "Assigned Numbers: RFC 1700 is Replaced + by an On-line Database", RFC 3232, DOI 10.17487/RFC3232, + January 2002, <https://www.rfc-editor.org/info/rfc3232>. + + [43] Srinivasan, R., "Binding Protocols for ONC RPC Version 2", + RFC 1833, DOI 10.17487/RFC1833, August 1995, + <https://www.rfc-editor.org/info/rfc1833>. + + [44] Werme, R., "RPC XID Issues", USENIX Conference + Proceedings, February 1996. + + [45] Nowicki, B., "NFS: Network File System Protocol + specification", RFC 1094, DOI 10.17487/RFC1094, March + 1989, <https://www.rfc-editor.org/info/rfc1094>. + + [46] Bhide, A., Elnozahy, E. N., and S. P. Morgan, "A Highly + Available Network Server", USENIX Conference Proceedings, + January 1991. + + [47] Halevy, B., Welch, B., and J. Zelenka, "Object-Based + Parallel NFS (pNFS) Operations", RFC 5664, + DOI 10.17487/RFC5664, January 2010, + <https://www.rfc-editor.org/info/rfc5664>. + + [48] Black, D., Fridella, S., and J. Glasgow, "Parallel NFS + (pNFS) Block/Volume Layout", RFC 5663, + DOI 10.17487/RFC5663, January 2010, + <https://www.rfc-editor.org/info/rfc5663>. + + [49] Callaghan, B., "WebNFS Client Specification", RFC 2054, + DOI 10.17487/RFC2054, October 1996, + <https://www.rfc-editor.org/info/rfc2054>. + + [50] Callaghan, B., "WebNFS Server Specification", RFC 2055, + DOI 10.17487/RFC2055, October 1996, + <https://www.rfc-editor.org/info/rfc2055>. + + [51] IESG, "IESG Processing of RFC Errata for the IETF Stream", + July 2008, + <https://www.ietf.org/about/groups/iesg/statements/ + processing-rfc-errata/>. + + [52] Krawczyk, H., Bellare, M., and R. Canetti, "HMAC: Keyed- + Hashing for Message Authentication", RFC 2104, + DOI 10.17487/RFC2104, February 1997, + <https://www.rfc-editor.org/info/rfc2104>. + + [53] Shepler, S., "NFS Version 4 Design Considerations", + RFC 2624, DOI 10.17487/RFC2624, June 1999, + <https://www.rfc-editor.org/info/rfc2624>. + + [54] The Open Group, "Protocols for Interworking: XNFS, Version + 3W", ISBN 1-85912-184-5, February 1998. + + [55] Floyd, S. and V. Jacobson, "The Synchronization of + Periodic Routing Messages", IEEE/ACM Transactions on + Networking, 2(2), pp. 122-136, April 1994. + + [56] Chadalapaka, M., Satran, J., Meth, K., and D. Black, + "Internet Small Computer System Interface (iSCSI) Protocol + (Consolidated)", RFC 7143, DOI 10.17487/RFC7143, April + 2014, <https://www.rfc-editor.org/info/rfc7143>. + + [57] Snively, R., "Fibre Channel Protocol for SCSI, 2nd Version + (FCP-2)", ANSI/INCITS, 350-2003, October 2003. + + [58] Weber, R.O., "Object-Based Storage Device Commands (OSD)", + ANSI/INCITS, 400-2004, July 2004, + <https://www.t10.org/drafts.htm>. + + [59] Carns, P. H., Ligon III, W. B., Ross, R. B., and R. + Thakur, "PVFS: A Parallel File System for Linux + Clusters.", Proceedings of the 4th Annual Linux Showcase + and Conference, 2000. + + [60] The Open Group, "The Open Group Base Specifications Issue + 6, IEEE Std 1003.1, 2004 Edition", 2004, + <https://www.opengroup.org>. + + [61] Callaghan, B., "NFS URL Scheme", RFC 2224, + DOI 10.17487/RFC2224, October 1997, + <https://www.rfc-editor.org/info/rfc2224>. + + [62] Chiu, A., Eisler, M., and B. Callaghan, "Security + Negotiation for WebNFS", RFC 2755, DOI 10.17487/RFC2755, + January 2000, <https://www.rfc-editor.org/info/rfc2755>. + + [63] Cotton, M., Leiba, B., and T. Narten, "Guidelines for + Writing an IANA Considerations Section in RFCs", BCP 26, + RFC 8126, DOI 10.17487/RFC8126, June 2017, + <https://www.rfc-editor.org/info/rfc8126>. + + [64] RFC Errata, Erratum ID 2006, RFC 5661, + <https://www.rfc-editor.org/errata/eid2006>. + + [65] Spasojevic, M. and M. Satayanarayanan, "An Empirical Study + of a Wide-Area Distributed File System", ACM Transactions + on Computer Systems, Vol. 14, No. 2, pp. 200-222, + DOI 10.1145/227695.227698, May 1996, + <https://doi.org/10.1145/227695.227698>. + + [66] Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed., + "Network File System (NFS) Version 4 Minor Version 1 + Protocol", RFC 5661, DOI 10.17487/RFC5661, January 2010, + <https://www.rfc-editor.org/info/rfc5661>. + + [67] Noveck, D., "Rules for NFSv4 Extensions and Minor + Versions", RFC 8178, DOI 10.17487/RFC8178, July 2017, + <https://www.rfc-editor.org/info/rfc8178>. + + [68] Haynes, T., Ed. and D. Noveck, Ed., "Network File System + (NFS) Version 4 Protocol", RFC 7530, DOI 10.17487/RFC7530, + March 2015, <https://www.rfc-editor.org/info/rfc7530>. + + [69] Noveck, D., Ed., Shivam, P., Lever, C., and B. Baker, + "NFSv4.0 Migration: Specification Update", RFC 7931, + DOI 10.17487/RFC7931, July 2016, + <https://www.rfc-editor.org/info/rfc7931>. + + [70] Haynes, T., "Requirements for Parallel NFS (pNFS) Layout + Types", RFC 8434, DOI 10.17487/RFC8434, August 2018, + <https://www.rfc-editor.org/info/rfc8434>. + + [71] Farrell, S. and H. Tschofenig, "Pervasive Monitoring Is an + Attack", BCP 188, RFC 7258, DOI 10.17487/RFC7258, May + 2014, <https://www.rfc-editor.org/info/rfc7258>. + + [72] Rescorla, E. and B. Korver, "Guidelines for Writing RFC + Text on Security Considerations", BCP 72, RFC 3552, + DOI 10.17487/RFC3552, July 2003, + <https://www.rfc-editor.org/info/rfc3552>. + +Appendix A. The Need for This Update + + This document includes an explanation of how clients and servers are + to determine the particular network access paths to be used to access + a file system. This includes descriptions of how to handle changes + to the specific replica to be used or to the set of addresses to be + used to access it, and how to deal transparently with transfers of + responsibility that need to be made. This includes cases in which + there is a shift between one replica and another and those in which + different network access paths are used to access the same replica. + + As a result of the following problems in RFC 5661 [66], it was + necessary to provide the specific updates that are made by this + document. These updates are described in Appendix B. + + * RFC 5661 [66], while it dealt with situations in which various + forms of clustering allowed coordination of the state assigned by + cooperating servers to be used, made no provisions for Transparent + State Migration. Within NFSv4.0, Transparent State Migration was + first explained clearly in RFC 7530 [68] and corrected and + clarified by RFC 7931 [69]. No corresponding explanation for + NFSv4.1 had been provided. + + * Although NFSv4.1 provided a clear definition of how trunking + detection was to be done, there was no clear specification of how + trunking discovery was to be done, despite the fact that the + specification clearly indicated that this information could be + made available via the file system location attributes. + + * Because the existence of multiple network access paths to the same + file system was dealt with as if there were multiple replicas, + issues relating to transitions between replicas could never be + clearly distinguished from trunking-related transitions between + the addresses used to access a particular file system instance. + As a result, in situations in which both migration and trunking + configuration changes were involved, neither of these could be + clearly dealt with, and the relationship between these two + features was not seriously addressed. + + * Because use of two network access paths to the same file system + instance (i.e., trunking) was often treated as if two replicas + were involved, it was considered that two replicas were being used + simultaneously. As a result, the treatment of replicas being used + simultaneously in RFC 5661 [66] was not clear, as it covered the + two distinct cases of a single file system instance being accessed + by two different network access paths and two replicas being + accessed simultaneously, with the limitations of the latter case + not being clearly laid out. + + The majority of the consequences of these issues are dealt with by + presenting in Section 11 a replacement for Section 11 of RFC 5661 + [66]. This replacement modifies existing subsections within that + section and adds new ones as described in Appendix B.1. Also, some + existing sections were deleted. These changes were made in order to + do the following: + + * Reorganize the description so that the case of two network access + paths to the same file system instance is distinguished clearly + from the case of two different replicas since, in the former case, + locking state is shared and there also can be sharing of session + state. + + * Provide a clear statement regarding the desirability of + transparent transfer of state between replicas together with a + recommendation that either transparent transfer or a single-fs + grace period be provided. + + * Specifically delineate how a client is to handle such transfers, + taking into account the differences from the treatment in [69] + made necessary by the major protocol changes to NFSv4.1. + + * Discuss the relationship between transparent state transfer and + Parallel NFS (pNFS). + + * Clarify the fs_locations_info attribute in order to specify which + portions of the provided information apply to a specific network + access path and which apply to the replica that the path is used + to access. + + In addition, other sections of RFC 5661 [66] were updated to correct + the consequences of the incorrect assumptions underlying the + treatment of multi-server namespace issues. These are described in + Appendices B.2 through B.4. + + * A revised introductory section regarding multi-server namespace + facilities is provided. + + * A more realistic treatment of server scope is provided. This + treatment reflects the more limited coordination of locking state + adopted by servers actually sharing a common server scope. + + * Some confusing text regarding changes in server_owner has been + clarified. + + * The description of some existing errors has been modified to more + clearly explain certain error situations to reflect the existence + of trunking and the possible use of fs-specific grace periods. + For details, see Appendix B.3. + + * New descriptions of certain existing operations are provided, + either because the existing treatment did not account for + situations that would arise in dealing with Transparent State + Migration, or because some types of reclaim issues were not + adequately dealt with in the context of fs-specific grace periods. + For details, see Appendix B.2. + +Appendix B. Changes in This Update + +B.1. Revisions Made to Section 11 of RFC 5661 + + A number of areas have been revised or extended, in many cases + replacing subsections within Section 11 of RFC 5661 [66]: + + * New introductory material, including a terminology section, + replaces the material in RFC 5661 [66], ranging from the start of + the original Section 11 up to and including Section 11.1. The new + material starts at the beginning of Section 11 and continues + through 11.2. + + * A significant reorganization of the material in Sections 11.4 and + 11.5 of RFC 5661 [66] was necessary. The reasons for the + reorganization of these sections into a single section with + multiple subsections are discussed in Appendix B.1.1 below. This + replacement appears as Section 11.5. + + New material relating to the handling of the file system location + attributes is contained in Sections 11.5.1 and 11.5.7. + + * A new section describing requirements for user and group handling + within a multi-server namespace has been added as Section 11.7. + + * A major replacement for Section 11.7 of RFC 5661 [66], entitled + "Effecting File System Transitions", appears as Sections 11.9 + through 11.14. The reasons for the reorganization of this section + into multiple sections are discussed in Appendix B.1.2. + + * A replacement for Section 11.10 of RFC 5661 [66], entitled "The + Attribute fs_locations_info", appears as Section 11.17, with + Appendix B.1.3 describing the differences between the new section + and the treatment within [66]. A revised treatment was necessary + because the original treatment did not make clear how the added + attribute information relates to the case of trunked paths to the + same replica. These issues were not addressed in RFC 5661 [66] + where the concepts of a replica and a network path used to access + a replica were not clearly distinguished. + +B.1.1. Reorganization of Sections 11.4 and 11.5 of RFC 5661 + + Previously, issues related to the fact that multiple location entries + directed the client to the same file system instance were dealt with + in Section 11.5 of RFC 5661 [66]. Because of the new treatment of + trunking, these issues now belong within Section 11.5. + + In this new section, trunking is covered in Section 11.5.2 together + with the other uses of file system location information described in + Sections 11.5.3 through 11.5.6. + + As a result, Section 11.5, which replaces Section 11.4 of RFC 5661 + [66], is substantially different than the section it replaces in that + some original sections have been replaced by corresponding sections + as described below, while new sections have been added: + + * The material in Section 11.5, exclusive of subsections, replaces + the material in Section 11.4 of RFC 5661 [66] exclusive of + subsections. + + * Section 11.5.1 is the new first subsection of the overall section. + + * Section 11.5.2 is the new second subsection of the overall + section. + + * Each of the Sections 11.5.4, 11.5.5, and 11.5.6 replaces (in + order) one of the corresponding Sections 11.4.1, 11.4.2, and + 11.4.3 of RFC 5661 [66]. + + * Section 11.5.7 is the new final subsection of the overall section. + +B.1.2. Reorganization of Material Dealing with File System Transitions + + The material relating to file system transition, previously contained + in Section 11.7 of RFC 5661 [66] has been reorganized and augmented + as described below: + + * Because there can be a shift of the network access paths used to + access a file system instance without any shift between replicas, + a new Section 11.9 distinguishes between those cases in which + there is a shift between distinct replicas and those involving a + shift in network access paths with no shift between replicas. + + As a result, the new Section 11.10 deals with network address + transitions, while the bulk of the original Section 11.7 of RFC + 5661 [66] has been extensively modified as reflected in + Section 11.11, which is now limited to cases in which there is a + shift between two different sets of replicas. + + * The additional Section 11.12 discusses the case in which a shift + to a different replica is made and state is transferred to allow + the client the ability to have continued access to its accumulated + locking state on the new server. + + * The additional Section 11.13 discusses the client's response to + access transitions, how it determines whether migration has + occurred, and how it gets access to any transferred locking and + session state. + + * The additional Section 11.14 discusses the responsibilities of the + source and destination servers when transferring locking and + session state. + + This reorganization has caused a renumbering of the sections within + Section 11 of [66] as described below: + + * The new Sections 11.9 and 11.10 have resulted in the renumbering + of existing sections with these numbers. + + * Section 11.7 of [66] has been substantially modified and appears + as Section 11.11. The necessary modifications reflect the fact + that this section only deals with transitions between replicas, + while transitions between network addresses are dealt with in + other sections. Details of the reorganization are described later + in this section. + + * Sections 11.12, 11.13, and 11.14 have been added. + + * Consequently, Sections 11.8, 11.9, 11.10, and 11.11 in [66] now + appear as Sections 11.15, 11.16, 11.17, and 11.18, respectively. + + As part of this general reorganization, Section 11.7 of RFC 5661 [66] + has been modified as described below: + + * Sections 11.7 and 11.7.1 of RFC 5661 [66] have been replaced by + Sections 11.11 and 11.11.1, respectively. + + * Section 11.7.2 of RFC 5661 (and included subsections) has been + deleted. + + * Sections 11.7.3, 11.7.4, 11.7.5, 11.7.5.1, and 11.7.6 of RFC 5661 + [66] have been replaced by Sections 11.11.2, 11.11.3, 11.11.4, + 11.11.4.1, and 11.11.5 respectively in this document. + + * Section 11.7.7 of RFC 5661 [66] has been replaced by + Section 11.11.9. This subsection has been moved to the end of the + section dealing with file system transitions. + + * Sections 11.7.8, 11.7.9, and 11.7.10 of RFC 5661 [66] have been + replaced by Sections 11.11.6, 11.11.7, and 11.11.8 respectively in + this document. + +B.1.3. Updates to the Treatment of fs_locations_info + + Various elements of the fs_locations_info attribute contain + information that applies to either a specific file system replica or + to a network path or set of network paths used to access such a + replica. The original treatment of fs_locations_info (Section 11.10 + of RFC 5661 [66]) did not clearly distinguish these cases, in part + because the document did not clearly distinguish replicas from the + paths used to access them. + + In addition, special clarification has been provided with regard to + the following fields: + + * With regard to the handling of FSLI4GF_GOING, it was clarified + that this only applies to the unavailability of a replica rather + than to a path to access a replica. + + * In describing the appropriate value for a server to use for + fli_valid_for, it was clarified that there is no need for the + client to frequently fetch the fs_locations_info value to be + prepared for shifts in trunking patterns. + + * Clarification of the rules for extensions to the fls_info has been + provided. The original treatment reflected the extension model + that was in effect at the time RFC 5661 [66] was written, but has + been updated in accordance with the extension model described in + RFC 8178 [67]. + +B.2. Revisions Made to Operations in RFC 5661 + + Descriptions have been revised to address issues that arose in + effecting necessary changes to multi-server namespace features. + + * The treatment of EXCHANGE_ID (Section 18.35 of RFC 5661 [66]) + assumed that client IDs cannot be created/confirmed other than by + the EXCHANGE_ID and CREATE_SESSION operations. Also, the + necessary use of EXCHANGE_ID in recovery from migration and + related situations was not clearly addressed. A revised treatment + of EXCHANGE_ID was necessary, and it appears in Section 18.35, + while the specific differences between it and the treatment within + [66] are explained in Appendix B.2.1 below. + + * The treatment of RECLAIM_COMPLETE in Section 18.51 of RFC 5661 + [66] was not sufficiently clear about the purpose and use of the + rca_one_fs and how the server was to deal with inappropriate + values of this argument. Because the resulting confusion raised + interoperability issues, a new treatment of RECLAIM_COMPLETE was + necessary, and it appears in Section 18.51, while the specific + differences between it and the treatment within RFC 5661 [66] are + discussed in Appendix B.2.2 below. In addition, the definitions + of the reclaim-related errors have received an updated treatment + in Section 15.1.9 to reflect the fact that there are multiple + contexts for lock reclaim operations. + +B.2.1. Revision of Treatment of EXCHANGE_ID + + There was a number of issues in the original treatment of EXCHANGE_ID + in RFC 5661 [66] that caused problems for Transparent State Migration + and for the transfer of access between different network access paths + to the same file system instance. + + These issues arose from the fact that this treatment was written: + + * Assuming that a client ID can only become known to a server by + having been created by executing an EXCHANGE_ID, with confirmation + of the ID only possible by execution of a CREATE_SESSION. + + * Considering the interactions between a client and a server only + occurring on a single network address. + + As these assumptions have become invalid in the context of + Transparent State Migration and active use of trunking, the treatment + has been modified in several respects: + + * It had been assumed that an EXCHANGE_ID executed when the server + was already aware that a given client instance was either updating + associated parameters (e.g., with respect to callbacks) or dealing + with a previously lost reply by retransmitting. As a result, any + slot sequence returned by that operation would be of no use. The + original treatment went so far as to say that it "MUST NOT" be + used, although this usage was not in accord with [1]. This + created a difficulty when an EXCHANGE_ID is done after Transparent + State Migration since that slot sequence would need to be used in + a subsequent CREATE_SESSION. + + In the updated treatment, CREATE_SESSION is a way that client IDs + are confirmed, but it is understood that other ways are possible. + The slot sequence can be used as needed, and cases in which it + would be of no use are appropriately noted. + + * It had been assumed that the only functions of EXCHANGE_ID were to + inform the server of the client, to create the client ID, and to + communicate it to the client. When multiple simultaneous + connections are involved, as often happens when trunking, that + treatment was inadequate in that it ignored the role of + EXCHANGE_ID in associating the client ID with the connection on + which it was done, so that it could be used by a subsequent + CREATE_SESSION whose parameters do not include an explicit client + ID. + + The new treatment explicitly discusses the role of EXCHANGE_ID in + associating the client ID with the connection so it can be used by + CREATE_SESSION and in associating a connection with an existing + session. + + The new treatment can be found in Section 18.35 above. It supersedes + the treatment in Section 18.35 of RFC 5661 [66]. + +B.2.2. Revision of Treatment of RECLAIM_COMPLETE + + The following changes were made to the treatment of RECLAIM_COMPLETE + in RFC 5661 [66] to arrive at the treatment in Section 18.51: + + * In a number of places, the text was made more explicit about the + purpose of rca_one_fs and its connection to file system migration. + + * There is a discussion of situations in which particular forms of + RECLAIM_COMPLETE would need to be done. + + * There is a discussion of interoperability issues between + implementations that may have arisen due to the lack of clarity of + the previous treatment of RECLAIM_COMPLETE. + +B.3. Revisions Made to Error Definitions in RFC 5661 + + The new handling of various situations required revisions to some + existing error definitions: + + * Because of the need to appropriately address trunking-related + issues, some uses of the term "replica" in RFC 5661 [66] became + problematic because a shift in network access paths was considered + to be a shift to a different replica. As a result, the original + definition of NFS4ERR_MOVED (in Section 15.1.2.4 of RFC 5661 [66]) + was updated to reflect the different handling of unavailability of + a particular fs via a specific network address. + + Since such a situation is no longer considered to constitute + unavailability of a file system instance, the description has been + changed, even though the set of circumstances in which it is to be + returned remains the same. The new paragraph explicitly + recognizes that a different network address might be used, while + the previous description, misleadingly, treated this as a shift + between two replicas while only a single file system instance + might be involved. The updated description appears in + Section 15.1.2.4. + + * Because of the need to accommodate the use of fs-specific grace + periods, it was necessary to clarify some of the definitions of + reclaim-related errors in Section 15 of RFC 5661 [66] so that the + text applies properly to reclaims for all types of grace periods. + The updated descriptions appear within Section 15.1.9. + + * Because of the need to provide the clarifications in errata report + 2006 [64] and to adapt these to properly explain the interaction + of NFS4ERR_DELAY with the reply cache, a revised description of + NFS4ERR_DELAY appears in Section 15.1.1.3. This errata report, + unlike many other RFC 5661 errata reports, is addressed in this + document because of the extensive use of NFS4ERR_DELAY in + connection with state migration and session migration. + +B.4. Other Revisions Made to RFC 5661 + + Besides the major reworking of Section 11 of RFC 5661 [66] and the + associated revisions to existing operations and errors, there were a + number of related changes that were necessary: + + * The summary in Section 1.7.3.3 of RFC 5661 [66] was revised to + reflect the changes made to Section 11 above. The updated summary + appears as Section 1.8.3.3 above. + + * The discussion of server scope in Section 2.10.4 of RFC 5661 [66] + was replaced since it appeared to require a level of inter-server + coordination incompatible with its basic function of avoiding the + need for a globally uniform means of assigning server_owner + values. A revised treatment appears in Section 2.10.4. + + * The discussion of trunking in Section 2.10.5 of RFC 5661 [66] was + revised to more clearly explain the multiple types of trunking + support and how the client can be made aware of the existing + trunking configuration. In addition, while the last paragraph + (exclusive of subsections) of that section dealing with + server_owner changes was literally true, it had been a source of + confusion. Since the original paragraph could be read as + suggesting that such changes be handled nondisruptively, the issue + was clarified in the revised Section 2.10.5. + +Appendix C. Security Issues That Need to Be Addressed + + The following issues in the treatment of security within the NFSv4.1 + specification need to be addressed: + + * The Security Considerations Section of RFC 5661 [66] was not + written in accordance with RFC 3552 (BCP 72) [72]. Of particular + concern was the fact that the section did not contain a threat + analysis. + + * Initial analysis of the existing security issues with NFSv4.1 has + made it likely that a revised Security Considerations section for + the existing protocol (one containing a threat analysis) would be + likely to conclude that NFSv4.1 does not meet the goal of secure + use on the Internet. + + The Security Considerations section of this document (Section 21) has + not been thoroughly revised to correct the difficulties mentioned + above. Instead, it has been modified to take proper account of + issues related to the multi-server namespace features discussed in + Section 11, leaving the incomplete discussion and security weaknesses + pretty much as they were. + + The following major security issues need to be addressed in a + satisfactory fashion before an updated Security Considerations + section can be published as part of a bis document for NFSv4.1: + + * The continued use of AUTH_SYS and the security exposures it + creates need to be addressed. Addressing this issue must not be + limited to the questions of whether the designation of this as + OPTIONAL was justified and whether it should be changed. + + In any event, it may not be possible at this point to correct the + security problems created by continued use of AUTH_SYS simply by + revising this designation. + + * The lack of attention within the protocol to the possibility of + pervasive monitoring attacks such as those described in RFC 7258 + [71] (also BCP 188). + + In that connection, the use of CREATE_SESSION without privacy + protection needs to be addressed as it exposes the session ID to + view by an attacker. This is worrisome as this is precisely the + type of protocol artifact alluded to in RFC 7258, which can enable + further mischief on the part of the attacker as it enables denial- + of-service attacks that can be executed effectively with only a + single, normally low-value, credential, even when RPCSEC_GSS + authentication is in use. + + * The lack of effective use of privacy and integrity, even where the + infrastructure to support use of RPCSEC_GSS is present, needs to + be addressed. + + In light of the security exposures that this situation creates, it + is not enough to define a protocol that could address this problem + with the provision of sufficient resources. Instead, what is + needed is a way to provide the necessary security with very + limited performance costs and without requiring security + infrastructure, which experience has shown is difficult for many + clients and servers to provide. + + In trying to provide a major security upgrade for a deployed protocol + such as NFSv4.1, the working group and the Internet community are + likely to find themselves dealing with a number of considerations + such as the following: + + * The need to accommodate existing deployments of protocols + specified previously in existing Proposed Standards. + + * The difficulty of effecting changes to existing, interoperating + implementations. + + * The difficulty of making changes to NFSv4 protocols other than + those in the form of OPTIONAL extensions. + + * The tendency of those responsible for existing NFSv4 deployments + to ignore security flaws in the context of local area networks + under the mistaken impression that network isolation provides, in + and of itself, isolation from all potential attackers. + + Given that the above-mentioned difficulties apply to minor version + zero as well, it may make sense to deal with these security issues in + a common document that applies to all NFSv4 minor versions. If that + approach is taken, the Security Considerations section of an eventual + NFv4.1 bis document would reference that common document, and the + defining RFCs for other minor versions might do so as well. + +Acknowledgments + +Acknowledgments for This Update + + The authors wish to acknowledge the important role of Andy Adamson of + Netapp in clarifying the need for trunking discovery functionality, + and exploring the role of the file system location attributes in + providing the necessary support. + + The authors wish to thank Tom Haynes of Hammerspace for drawing our + attention to the fact that internationalization and security might + best be handled in documents dealing with such protocol issues as + they apply to all NFSv4 minor versions. + + The authors also wish to acknowledge the work of Xuan Qi of Oracle + with NFSv4.1 client and server prototypes of Transparent State + Migration functionality. + + The authors wish to thank others that brought attention to important + issues. The comments of Trond Myklebust of Primary Data related to + trunking helped to clarify the role of DNS in trunking discovery. + Rick Macklem's comments brought attention to problems in the handling + of the per-fs version of RECLAIM_COMPLETE. + + The authors wish to thank Olga Kornievskaia of Netapp for her helpful + review comments. + +Acknowledgments for RFC 5661 + + The initial text for the SECINFO extensions were edited by Mike + Eisler with contributions from Peng Dai, Sergey Klyushin, and Carl + Burnett. + + The initial text for the SESSIONS extensions were edited by Tom + Talpey, Spencer Shepler, Jon Bauman with contributions from Charles + Antonelli, Brent Callaghan, Mike Eisler, John Howard, Chet Juszczak, + Trond Myklebust, Dave Noveck, John Scott, Mike Stolarchuk, and Mark + Wittle. + + Initial text relating to multi-server namespace features, including + the concept of referrals, were contributed by Dave Noveck, Carl + Burnett, and Charles Fan with contributions from Ted Anderson, Neil + Brown, and Jon Haswell. + + The initial text for the Directory Delegations support were + contributed by Saadia Khan with input from Dave Noveck, Mike Eisler, + Carl Burnett, Ted Anderson, and Tom Talpey. + + The initial text for the ACL explanations were contributed by Sam + Falkner and Lisa Week. + + The pNFS work was inspired by the NASD and OSD work done by Garth + Gibson. Gary Grider has also been a champion of high-performance + parallel I/O. Garth Gibson and Peter Corbett started the pNFS effort + with a problem statement document for the IETF that formed the basis + for the pNFS work in NFSv4.1. + + The initial text for the parallel NFS support was edited by Brent + Welch and Garth Goodson. Additional authors for those documents were + Benny Halevy, David Black, and Andy Adamson. Additional input came + from the informal group that contributed to the construction of the + initial pNFS drafts; specific acknowledgment goes to Gary Grider, + Peter Corbett, Dave Noveck, Peter Honeyman, and Stephen Fridella. + + Fredric Isaman found several errors in draft versions of the ONC RPC + XDR description of the NFSv4.1 protocol. + + Audrey Van Belleghem provided, in numerous ways, essential + coordination and management of the process of editing the + specification documents. + + Richard Jernigan gave feedback on the file layout's striping pattern + design. + + Several formal inspection teams were formed to review various areas + of the protocol. All the inspections found significant errors and + room for improvement. NFSv4.1's inspection teams were: + + * ACLs, with the following inspectors: Sam Falkner, Bruce Fields, + Rahul Iyer, Saadia Khan, Dave Noveck, Lisa Week, Mario Wurzl, and + Alan Yoder. + + * Sessions, with the following inspectors: William Brown, Tom + Doeppner, Robert Gordon, Benny Halevy, Fredric Isaman, Rick + Macklem, Trond Myklebust, Dave Noveck, Karen Rochford, John Scott, + and Peter Shah. + + * Initial pNFS inspection, with the following inspectors: Andy + Adamson, David Black, Mike Eisler, Marc Eshel, Sam Falkner, Garth + Goodson, Benny Halevy, Rahul Iyer, Trond Myklebust, Spencer + Shepler, and Lisa Week. + + * Global namespace, with the following inspectors: Mike Eisler, Dan + Ellard, Craig Everhart, Fredric Isaman, Trond Myklebust, Dave + Noveck, Theresa Raj, Spencer Shepler, Renu Tewari, and Robert + Thurlow. + + * NFSv4.1 file layout type, with the following inspectors: Andy + Adamson, Marc Eshel, Sam Falkner, Garth Goodson, Rahul Iyer, Trond + Myklebust, and Lisa Week. + + * NFSv4.1 locking and directory delegations, with the following + inspectors: Mike Eisler, Pranoop Erasani, Robert Gordon, Saadia + Khan, Eric Kustarz, Dave Noveck, Spencer Shepler, and Amy Weaver. + + * EXCHANGE_ID and DESTROY_CLIENTID, with the following inspectors: + Mike Eisler, Pranoop Erasani, Robert Gordon, Benny Halevy, Fredric + Isaman, Saadia Khan, Ricardo Labiaga, Rick Macklem, Trond + Myklebust, Spencer Shepler, and Brent Welch. + + * Final pNFS inspection, with the following inspectors: Andy + Adamson, Mike Eisler, Mark Eshel, Sam Falkner, Jason Glasgow, + Garth Goodson, Robert Gordon, Benny Halevy, Dean Hildebrand, Rahul + Iyer, Suchit Kaura, Trond Myklebust, Anatoly Pinchuk, Spencer + Shepler, Renu Tewari, Lisa Week, and Brent Welch. + + A review team worked together to generate the tables of assignments + of error sets to operations and make sure that each such assignment + had two or more people validating it. Participating in the process + were Andy Adamson, Mike Eisler, Sam Falkner, Garth Goodson, Robert + Gordon, Trond Myklebust, Dave Noveck, Spencer Shepler, Tom Talpey, + Amy Weaver, and Lisa Week. + + Jari Arkko, David Black, Scott Bradner, Lisa Dusseault, Lars Eggert, + Chris Newman, and Tim Polk provided valuable review and guidance. + + Olga Kornievskaia found several errors in the SSV specification. + + Ricardo Labiaga found several places where the use of RPCSEC_GSS was + underspecified. + + Those who provided miscellaneous comments include: Andy Adamson, + Sunil Bhargo, Alex Burlyga, Pranoop Erasani, Bruce Fields, Vadim + Finkelstein, Jason Goldschmidt, Vijay K. Gurbani, Sergey Klyushin, + Ricardo Labiaga, James Lentini, Anshul Madan, Daniel Muntz, Daniel + Picken, Archana Ramani, Jim Rees, Mahesh Siddheshwar, Tom Talpey, and + Peter Varga. + +Authors' Addresses + + David Noveck (editor) + NetApp + 1601 Trapelo Road, Suite 16 + Waltham, MA 02451 + United States of America + + Phone: +1-781-768-5347 + Email: dnoveck@netapp.com + + + Charles Lever + Oracle Corporation + 1015 Granger Avenue + Ann Arbor, MI 48104 + United States of America + + Phone: +1-248-614-5091 + Email: chuck.lever@oracle.com |