summaryrefslogtreecommitdiff
path: root/doc/rfc/rfc6203.txt
diff options
context:
space:
mode:
Diffstat (limited to 'doc/rfc/rfc6203.txt')
-rw-r--r--doc/rfc/rfc6203.txt395
1 files changed, 395 insertions, 0 deletions
diff --git a/doc/rfc/rfc6203.txt b/doc/rfc/rfc6203.txt
new file mode 100644
index 0000000..e4ed001
--- /dev/null
+++ b/doc/rfc/rfc6203.txt
@@ -0,0 +1,395 @@
+
+
+
+
+
+
+Internet Engineering Task Force (IETF) T. Sirainen
+Request for Comments: 6203 March 2011
+Category: Standards Track
+ISSN: 2070-1721
+
+
+ IMAP4 Extension for Fuzzy Search
+
+Abstract
+
+ This document describes an IMAP protocol extension enabling a server
+ to perform searches with inexact matching and assigning relevancy
+ scores for matched messages.
+
+Status of This Memo
+
+ This is an Internet Standards Track document.
+
+ This document is a product of the Internet Engineering Task Force
+ (IETF). It represents the consensus of the IETF community. It has
+ received public review and has been approved for publication by the
+ Internet Engineering Steering Group (IESG). Further information on
+ Internet Standards is available in Section 2 of RFC 5741.
+
+ Information about the current status of this document, any errata,
+ and how to provide feedback on it may be obtained at
+ http://www.rfc-editor.org/info/rfc6203.
+
+Copyright Notice
+
+ Copyright (c) 2011 IETF Trust and the persons identified as the
+ document authors. All rights reserved.
+
+ This document is subject to BCP 78 and the IETF Trust's Legal
+ Provisions Relating to IETF Documents
+ (http://trustee.ietf.org/license-info) in effect on the date of
+ publication of this document. Please review these documents
+ carefully, as they describe your rights and restrictions with respect
+ to this document. Code Components extracted from this document must
+ include Simplified BSD License text as described in Section 4.e of
+ the Trust Legal Provisions and are provided without warranty as
+ described in the Simplified BSD License.
+
+
+
+
+
+
+
+
+
+Sirainen Standards Track [Page 1]
+
+RFC 6203 IMAP4 FUZZY Search March 2011
+
+
+1. Introduction
+
+ When humans perform searches in IMAP clients, they typically want to
+ see the most relevant search results first. IMAP servers are able to
+ do this in the most efficient way when they're free to internally
+ decide how searches should match messages. This document describes a
+ new SEARCH=FUZZY extension that provides such functionality.
+
+2. Conventions Used in This Document
+
+ In examples, "C:" indicates lines sent by a client that is connected
+ to a server. "S:" indicates lines sent by the server to the client.
+
+ The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
+ "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
+ document are to be interpreted as described in RFC 2119 [KEYWORDS].
+
+3. The FUZZY Search Key
+
+ The FUZZY search key takes another search key as its argument. The
+ server is allowed to perform all matching in an implementation-
+ defined manner for this search key, including ignoring the active
+ comparator as defined by [RFC5255]. Typically, this would be used to
+ search for strings. For example:
+
+ C: A1 SEARCH FUZZY (SUBJECT "IMAP break")
+ S: * SEARCH 1 5 10
+ S: A1 OK Search completed.
+
+ Besides matching messages with a subject of "IMAP break", the above
+ search may also match messages with subjects "broken IMAP", "IMAP is
+ broken", or anything else the server decides that might be a good
+ match.
+
+ This example does a fuzzy SUBJECT search, but a non-fuzzy FROM
+ search:
+
+ C: A2 SEARCH FUZZY SUBJECT work FROM user@example.com
+ S: * SEARCH 1 4
+ S: A2 OK Search completed.
+
+ How the server handles multiple separate FUZZY search keys is
+ implementation-defined.
+
+ Fuzzy search algorithms might change, or the results of the
+ algorithms might be different from search to search, so that fuzzy
+ searches with the same parameters might give different results for
+ 1) the same user at different times, 2) different users (searches
+
+
+
+Sirainen Standards Track [Page 2]
+
+RFC 6203 IMAP4 FUZZY Search March 2011
+
+
+ executed simultaneously), or 3) different users (searches executed at
+ different times). For example, a fuzzy search might adapt to a
+ user's search habits in an attempt to give more relevant results (in
+ a "learning" manner). Such differences can also occur because of
+ operational decisions, such as load balancing. Clients asking for
+ "fuzzy" really are requesting search results in a not-necessarily-
+ deterministic way and need to give the user appropriate warning about
+ that.
+
+4. Relevancy Scores for Search Results
+
+ Servers SHOULD assign a search relevancy score for each matched
+ message when the FUZZY search key is given. Relevancy scores are
+ given in the range 1-100, where 100 is the highest relevancy. The
+ relevancy scores SHOULD use the full 1-100 range, so that clients can
+ show them to users in a meaningful way, e.g., as a percentage value.
+
+ As the name already indicates, relevancy scores specify how relevant
+ to the search the matched message is. It's not necessarily the same
+ as how precisely the message matched. For example, a message whose
+ subject fuzzily matches the search string might get a higher
+ relevancy score than a message whose body had the exact string in the
+ middle of a sentence. When multiple search keys are matched fuzzily,
+ how the relevancy score is calculated is server-dependent.
+
+ If the server also advertises the ESEARCH capability as defined by
+ [ESEARCH], the relevancy scores can be retrieved using the new
+ RELEVANCY return option for SEARCH:
+
+ C: B1 SEARCH RETURN (RELEVANCY ALL) FUZZY TEXT "Helo"
+ S: * ESEARCH (TAG "B1") ALL 1,5,10 RELEVANCY (4 99 42)
+ S: B1 OK Search completed.
+
+ In the example above, the server would treat "hello", "help", and
+ other similar strings as fuzzily matching the misspelled "Helo".
+
+ The RELEVANCY return option MUST NOT be used unless a FUZZY search
+ key is also given. Note that SEARCH results aren't sorted by
+ relevancy; SORT is needed for that.
+
+5. Fuzzy Matching with Non-String Search Keys
+
+ Fuzzy matching is not limited to just string matching. All search
+ keys SHOULD be matched fuzzily, although exactly what that means for
+ different search keys is left for server implementations to decide --
+ including deciding that fuzzy matching is meaningless for a
+ particular key, and falling back to exact matching. Some suggestions
+ are given below.
+
+
+
+Sirainen Standards Track [Page 3]
+
+RFC 6203 IMAP4 FUZZY Search March 2011
+
+
+ Dates:
+ A typical example could be when a user wants to find a message
+ "from Dave about a week ago". A client could perform this search
+ using SEARCH FUZZY (FROM "Dave" SINCE 21-Jan-2009 BEFORE
+ 24-Jan-2009). The server could return messages outside the
+ specified date range, but the further away the message is, the
+ lower the relevancy score.
+
+ Sizes:
+ These should be handled similarly to dates. If a user wants to
+ search for "about 1 MB attachments", the client could do this by
+ sending SEARCH FUZZY (LARGER 900000 SMALLER 1100000). Again, the
+ further away the message size is from the specified range, the
+ lower the relevancy score.
+
+ Flags:
+ If other search criteria match, the server could return messages
+ that don't have the specified flags set, but with lower relevancy
+ scores. SEARCH SUBJECT "xyz" FUZZY ANSWERED, for example, might
+ be useful if the user thinks the message he is looking for has the
+ ANSWERED flag set, but he isn't sure.
+
+ Unique Identifiers (UIDs), sequences, modification sequences: These
+ are examples of keys for which exact matching probably makes sense.
+ Alternatively, a server might choose, for instance, to expand a UID
+ range by 5% on each side.
+
+6. Extensions to SORT and SEARCH
+
+ If the server also advertises the SORT capability as defined by
+ [SORT], the results can be sorted by the new RELEVANCY sort criteria:
+
+ C: C1 SORT (RELEVANCY) UTF-8 FUZZY SUBJECT "Helo"
+ S: * SORT 5 10 1
+ S: C1 OK Sort completed.
+
+ The message with the highest score is returned first. As with the
+ RELEVANCY return option, RELEVANCY sort criteria MUST NOT be used
+ unless a FUZZY search key is also given.
+
+ If the server also advertises the ESORT capability as defined by
+ [CONTEXT], the relevancy scores can be retrieved using the new
+ RELEVANCY return option for SORT:
+
+ C: C2 SORT RETURN (RELEVANCY ALL) (RELEVANCY) UTF-8 FUZZY TEXT
+ "Helo"
+ S: * ESEARCH (TAG "C2") ALL 5,10,1 RELEVANCY (99 42 4)
+ S: C2 OK Sort completed.
+
+
+
+Sirainen Standards Track [Page 4]
+
+RFC 6203 IMAP4 FUZZY Search March 2011
+
+
+ Furthermore, if the server advertises the CONTEXT=SORT (or
+ CONTEXT=SEARCH) capability, then the client can limit the number of
+ returned messages to a SORT (or a SEARCH) by using the PARTIAL return
+ option. For example, this returns the 10 most relevant messages:
+
+ C: C3 SORT RETURN (PARTIAL 1:10) (RELEVANCY) UTF-8 FUZZY TEXT
+ "World"
+ S: * ESEARCH (TAG "C3") PARTIAL (1:10 42,9,34,13,15,4,2,7,23,82)
+ S: C3 OK Sort completed.
+
+7. Formal Syntax
+
+ The following syntax specification uses the augmented Backus-Naur
+ Form (BNF) as described in [ABNF]. It includes definitions from
+ [RFC3501], [IMAP-ABNF], and [SORT].
+
+ capability =/ "SEARCH=FUZZY"
+
+ score = 1*3DIGIT
+ ;; (1 <= n <= 100)
+
+ score-list = "(" [score *(SP score)] ")"
+
+ search-key =/ "FUZZY" SP search-key
+
+ search-return-data =/ "RELEVANCY" SP score-list
+ ;; Conforms to <search-return-data>, from [IMAP-ABNF]
+
+ search-return-opt =/ "RELEVANCY"
+ ;; Conforms to <search-return-opt>, from [IMAP-ABNF]
+
+ sort-key =/ "RELEVANCY"
+
+8. Security Considerations
+
+ Implementation of this extension might enable denial-of-service
+ attacks against server resources. Servers MAY limit the resources
+ that a single search (or a single user) may use. Additionally,
+ implementors should be aware of the following: Fuzzy search engines
+ are often complex with non-obvious disk space, memory, and/or CPU
+ usage patterns. Server implementors should at least test the fuzzy-
+ search behavior with large messages that contain very long words
+ and/or unique random strings. Also, very long search keys might
+ cause excessive memory or CPU usage.
+
+ Invalid input may also be problematic. For example, if the search
+ engine takes a UTF-8 stream as input, it might fail more or less
+ badly when illegal UTF-8 sequences are fed to it from a message whose
+
+
+
+Sirainen Standards Track [Page 5]
+
+RFC 6203 IMAP4 FUZZY Search March 2011
+
+
+ character set was claimed to be UTF-8. This could be avoided by
+ validating all the input and, for example, replacing illegal UTF-8
+ sequences with the Unicode replacement character (U+FFFD).
+
+ Search relevancy rankings might be susceptible to "poisoning" by
+ smart attackers using certain keywords or hidden markup (e.g., HTML)
+ in their messages to boost the rankings. This can't be fully
+ prevented by servers, so clients should prepare for it by at least
+ allowing users to see all the search results, rather than hiding
+ results below a certain score.
+
+9. IANA Considerations
+
+ IMAP4 capabilities are registered by publishing a standards track or
+ IESG-approved experimental RFC. The "Internet Message Access
+ Protocol (IMAP) 4 Capabilities Registry" is available from
+ http://www.iana.org/.
+
+ This document defines the SEARCH=FUZZY IMAP capability. IANA has
+ added it to the registry.
+
+10. Acknowledgements
+
+ Alexey Melnikov, Zoltan Ordogh, Barry Leiba, Cyrus Daboo, and Dave
+ Cridland have helped with this document.
+
+11. Normative References
+
+ [ABNF] Crocker, D., Ed. and P. Overell, "Augmented BNF for
+ Syntax Specifications: ABNF", STD 68, RFC 5234,
+ January 2008.
+
+ [CONTEXT] Cridland, D. and C. King, "Contexts for IMAP4",
+ RFC 5267, July 2008.
+
+ [ESEARCH] Melnikov, A. and D. Cridland, "IMAP4 Extension to SEARCH
+ Command for Controlling What Kind of Information Is
+ Returned", RFC 4731, November 2006.
+
+ [IMAP-ABNF] Melnikov, A. and C. Daboo, "Collected Extensions to
+ IMAP4 ABNF", RFC 4466, April 2006.
+
+ [KEYWORDS] Bradner, S., "Key words for use in RFCs to Indicate
+ Requirement Levels", BCP 14, RFC 2119, March 1997.
+
+ [RFC3501] Crispin, M., "INTERNET MESSAGE ACCESS PROTOCOL - VERSION
+ 4rev1", RFC 3501, March 2003.
+
+
+
+
+Sirainen Standards Track [Page 6]
+
+RFC 6203 IMAP4 FUZZY Search March 2011
+
+
+ [RFC5255] Newman, C., Gulbrandsen, A., and A. Melnikov, "Internet
+ Message Access Protocol Internationalization", RFC 5255,
+ June 2008.
+
+ [SORT] Crispin, M. and K. Murchison, "Internet Message Access
+ Protocol - SORT and THREAD Extensions", RFC 5256,
+ June 2008.
+
+Author's Address
+
+ Timo Sirainen
+
+ EMail: tss@iki.fi
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Sirainen Standards Track [Page 7]
+