1 files changed, 1123 insertions, 0 deletions
diff --git a/doc/rfc/rfc4313.txt b/doc/rfc/rfc4313.txt
new file mode 100644
index 0000000..a0249de
--- /dev/null
+++ b/doc/rfc/rfc4313.txt
@@ -0,0 +1,1123 @@
+
+
+
+
+
+
+Network Working Group                                            D. Oran
+Request for Comments: 4313                           Cisco Systems, Inc.
+Category: Informational                                    December 2005
+
+
+                Requirements for Distributed Control of
+                  Automatic Speech Recognition (ASR),
+       Speaker Identification/Speaker Verification (SI/SV), and
+                     Text-to-Speech (TTS) Resources
+
+Status of this Memo
+
+   This memo provides information for the Internet community.  It does
+   not specify an Internet standard of any kind.  Distribution of this
+   memo is unlimited.
+
+Copyright Notice
+
+   Copyright (C) The Internet Society (2005).
+
+Abstract
+
+   This document outlines the needs and requirements for a protocol to
+   control distributed speech processing of audio streams.  By speech
+   processing, this document specifically means automatic speech
+   recognition (ASR), speaker recognition -- which includes both speaker
+   identification (SI) and speaker verification (SV) -- and
+   text-to-speech (TTS).  Other IETF protocols, such as SIP and Real
+   Time Streaming Protocol (RTSP), address rendezvous and control for
+   generalized media streams.  However, speech processing presents
+   additional requirements that none of the extant IETF protocols
+   address.
+
+Table of Contents
+
+   1. Introduction ....................................................3
+      1.1. Document Conventions .......................................3
+   2. SPEECHSC Framework ..............................................4
+      2.1. TTS Example ................................................5
+      2.2. Automatic Speech Recognition Example .......................6
+      2.3. Speaker Identification example .............................6
+   3. General Requirements ............................................7
+      3.1. Reuse Existing Protocols ...................................7
+      3.2. Maintain Existing Protocol Integrity .......................7
+      3.3. Avoid Duplicating Existing Protocols .......................7
+      3.4. Efficiency .................................................8
+      3.5. Invocation of Services .....................................8
+      3.6. Location and Load Balancing ................................8
+
+
+
+Oran                         Informational                      [Page 1]
+
+RFC 4313          Speech Services Control Requirements     December 2005
+
+
+      3.7. Multiple Services ..........................................8
+      3.8. Multiple Media Sessions ....................................8
+      3.9. Users with Disabilities ....................................9
+      3.10. Identification of Process That Produced Media or
+            Control Output ............................................9
+   4. TTS Requirements ................................................9
+      4.1. Requesting Text Playback ...................................9
+      4.2. Text Formats ...............................................9
+           4.2.1. Plain Text ..........................................9
+           4.2.2. SSML ................................................9
+           4.2.3. Text in Control Channel ............................10
+           4.2.4. Document Type Indication ...........................10
+      4.3. Control Channel ...........................................10
+      4.4. Media Origination/Termination by Control Elements .........10
+      4.5. Playback Controls .........................................10
+      4.6. Session Parameters ........................................11
+      4.7. Speech Markers ............................................11
+   5. ASR Requirements ...............................................11
+      5.1. Requesting Automatic Speech Recognition ...................11
+      5.2. XML .......................................................11
+      5.3. Grammar Requirements ......................................12
+           5.3.1. Grammar Specification ..............................12
+           5.3.2. Explicit Indication of Grammar Format ..............12
+           5.3.3. Grammar Sharing ....................................12
+      5.4. Session Parameters ........................................12
+      5.5. Input Capture .............................................12
+   6. Speaker Identification and Verification Requirements ...........13
+      6.1. Requesting SI/SV ..........................................13
+      6.2. Identifiers for SI/SV .....................................13
+      6.3. State for Multiple Utterances .............................13
+      6.4. Input Capture .............................................13
+      6.5. SI/SV Functional Extensibility ............................13
+   7. Duplexing and Parallel Operation Requirements ..................13
+      7.1. Full Duplex Operation .....................................14
+      7.2. Multiple Services in Parallel .............................14
+      7.3. Combination of Services ...................................14
+   8. Additional Considerations (Non-Normative) ......................14
+   9. Security Considerations ........................................15
+      9.1. SPEECHSC Protocol Security ................................15
+      9.2. Client and Server Implementation and Deployment ...........16
+      9.3. Use of SPEECHSC for Security Functions ....................16
+   10. Acknowledgements ..............................................17
+   11. References ....................................................18
+      11.1. Normative References .....................................18
+      11.2. Informative References ...................................18
+
+
+
+
+
+
+Oran                         Informational                      [Page 2]
+
+RFC 4313          Speech Services Control Requirements     December 2005
+
+
+1.  Introduction
+
+   There are multiple IETF protocols for establishment and termination
+   of media sessions (SIP [6]), low-level media control (Media Gateway
+   Control Protocol (MGCP) [7] and Media Gateway Controller (MEGACO)
+   [8]), and media record and playback (RTSP [9]).  This document
+   focuses on requirements for one or more protocols to support the
+   control of network elements that perform Automated Speech Recognition
+   (ASR), speaker identification or verification (SI/SV), and rendering
+   text into audio, also known as Text-to-Speech (TTS).  Many multimedia
+   applications can benefit from having automatic speech recognition
+   (ASR) and text-to-speech (TTS) processing available as a distributed,
+   network resource.  This requirements document limits its focus to the
+   distributed control of ASR, SI/SV, and TTS servers.
+
+   There is a broad range of systems that can benefit from a unified
+   approach to control of TTS, ASR, and SI/SV.  These include
+   environments such as Voice over IP (VoIP) gateways to the Public
+   Switched Telephone Network (PSTN), IP telephones, media servers, and
+   wireless mobile devices that obtain speech services via servers on
+   the network.
+
+   To date, there are a number of proprietary ASR and TTS APIs, as well
+   as two IETF documents that address this problem [13], [14].  However,
+   there are serious deficiencies to the existing documents.  In
+   particular, they mix the semantics of existing protocols yet are
+   close enough to other protocols as to be confusing to the
+   implementer.
+
+   This document sets forth requirements for protocols to support
+   distributed speech processing of audio streams.  For simplicity, and
+   to remove confusion with existing protocol proposals, this document
+   presents the requirements as being for a "framework" that addresses
+   the distributed control of speech resources.  It refers to such a
+   framework as "SPEECHSC", for Speech Services Control.
+
+1.1.  Document Conventions
+
+   In this document, the key words "MUST", "MUST NOT", "REQUIRED",
+   "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY",
+   and "OPTIONAL" are to be interpreted as described in RFC 2119 [3].
+
+
+
+
+
+
+
+
+
+
+Oran                         Informational                      [Page 3]
+
+RFC 4313          Speech Services Control Requirements     December 2005
+
+
+2.  SPEECHSC Framework
+
+   Figure 1 below shows the SPEECHSC framework for speech processing.
+
+                          +-------------+
+                          | Application |
+                          |   Server    |\
+                          +-------------+ \ SPEECHSC
+            SIP, VoiceXML,  /              \
+             etc.          /                \
+           +------------+ /                  \    +-------------+
+           |   Media    |/       SPEECHSC     \---| ASR, SI/SV, |
+           | Processing |-------------------------| and/or TTS  |
+       RTP |   Entity   |           RTP           |    Server   |
+      =====|            |=========================|             |
+           +------------+                         +-------------+
+
+                       Figure 1: SPEECHSC Framework
+
+   The "Media Processing Entity" is a network element that processes
+   media.  It may be a pure media handler, or it may also have an
+   associated SIP user agent, VoiceXML browser, or other control entity.
+   The "ASR, SI/SV, and/or TTS Server" is a network element that
+   performs the back-end speech processing.  It may generate an RTP
+   stream as output based on text input (TTS) or return recognition
+   results in response to an RTP stream as input (ASR, SI/SV).  The
+   "Application Server" is a network element that instructs the Media
+   Processing Entity on what transformations to make to the media
+   stream.  Those instructions may be established via a session protocol
+   such as SIP, or provided via a client/server exchange such as
+   VoiceXML.  The framework allows either the Media Processing Entity or
+   the Application Server to control the ASR or TTS Server using
+   SPEECHSC as a control protocol, which accounts for the SPEECHSC
+   protocol appearing twice in the diagram.
+
+   Physical embodiments of the entities can reside in one physical
+   instance per entity, or some combination of entities.  For example, a
+   VoiceXML [11] gateway may combine the ASR and TTS functions on the
+   same platform as the Media Processing Entity.  Note that VoiceXML
+   gateways themselves are outside the scope of this protocol.
+   Likewise, one can combine the Application Server and Media Processing
+   Entity, as would be the case in an interactive voice response (IVR)
+   platform.
+
+   One can also decompose the Media Processing Entity into an entity
+   that controls media endpoints and entities that process media
+   directly.  Such would be the case with a decomposed gateway using
+   MGCP or MEGACO.  However, this decomposition is again orthogonal to
+
+
+
+Oran                         Informational                      [Page 4]
+
+RFC 4313          Speech Services Control Requirements     December 2005
+
+
+   the scope of SPEECHSC.  The following subsections provide a number of
+   example use cases of the SPEECHSC, one each for TTS, ASR, and SI/SV.
+   They are intended to be illustrative only, and not to imply any
+   restriction on the scope of the framework or to limit the
+   decomposition or configuration to that shown in the example.
+
+2.1.  TTS Example
+
+   This example illustrates a simple usage of SPEECHSC to provide a
+   Text-to-Speech service for playing announcements to a user on a phone
+   with no display for textual error messages.  The example scenario is
+   shown below in Figure 2.  In the figure, the VoIP gateway acts as
+   both the Media Processing Entity and the Application Server of the
+   SPEECHSC framework in Figure 1.
+
+                                      +---------+
+                                     _|   SIP   |
+                                   _/ |  Server |
+                +-----------+  SIP/   +---------+
+                |           |  _/
+    +-------+   |   VoIP    |_/
+    | POTS  |___| Gateway   |   RTP   +---------+
+    | Phone |   | (SIP UA)  |=========|         |
+    +-------+   |           |\_       | SPEECHSC|
+                +-----------+  \      |   TTS   |
+                                \__   |  Server |
+                             SPEECHSC |         |
+                                    \_|         |
+                                      +---------+
+
+               Figure 2: Text-to-Speech Example of SPEECHSC
+
+   The Plain Old Telephone Service (POTS) phone on the left attempts to
+   make a phone call.  The VoIP gateway, acting as a SIP UA, tries to
+   establish a SIP session to complete the call, but gets an error, such
+   as a SIP "486 Busy Here" response.  Without SPEECHSC, the gateway
+   would most likely just output a busy signal to the POTS phone.
+   However, with SPEECHSC access to a TTS server, it can provide a
+   spoken error message.  The VoIP gateway therefore constructs a text
+   error string using information from the SIP messages, such as "Your
+   call to 978-555-1212 did not go through because the called party was
+   busy".  It then can use SPEECHSC to establish an association with a
+   SPEECHSC server, open an RTP stream between itself and the server,
+   and issue a TTS request for the error message, which will be played
+   to the user on the POTS phone.
+
+
+
+
+
+
+Oran                         Informational                      [Page 5]
+
+RFC 4313          Speech Services Control Requirements     December 2005
+
+
+2.2.  Automatic Speech Recognition Example
+
+   This example illustrates a VXML-enabled media processing entity and
+   associated application server using the SPEECHSC framework to supply
+   an ASR-based user interface through an Interactive Voice Response
+   (IVR) system.  The example scenario is shown below in Figure 3.  The
+   VXML-client corresponds to the "media processing entity", while the
+   IVR application server corresponds to the "application server" of the
+   SPEECHSC framework of Figure 1.
+
+                                      +------------+
+                                      |    IVR     |
+                                     _|Application |
+                               VXML_/ +------------+
+                +-----------+  __/
+                |           |_/       +------------+
+    PSTN Trunk  |   VoIP    | SPEECHSC|            |
+   =============| Gateway   |---------| SPEECHSC   |
+                |(VXML voice|         |   ASR      |
+                | browser)  |=========|  Server    |
+                +-----------+   RTP   +------------+
+
+              Figure 3: Automatic Speech Recognition Example
+
+   In this example, users call into the service in order to obtain stock
+   quotes.  The VoIP gateway answers their PSTN call.  An IVR
+   application feeds VXML scripts to the gateway to drive the user
+   interaction.  The VXML interpreter on the gateway directs the user's
+   media stream to the SPEECHSC ASR server and uses SPEECHSC to control
+   the ASR server.
+
+   When, for example, the user speaks the name of a stock in response to
+   an IVR prompt, the SPEECHSC ASR server attempts recognition of the
+   name, and returns the results to the VXML gateway.  The VXML gateway,
+   following standard VXML mechanisms, informs the IVR Application of
+   the recognized result.  The IVR Application can then do the
+   appropriate information lookup.  The answer, of course, can be sent
+   back to the user using text-to-speech.  This example does not show
+   this scenario, but it would work analogously to the scenario shown in
+   section Section 2.1.
+
+2.3.  Speaker Identification example
+
+   This example illustrates using speaker identification to allow
+   voice-actuated login to an IP phone.  The example scenario is shown
+   below in Figure 4.  In the figure, the IP Phone acts as both the
+   "Media Processing Entity" and the "Application Server" of the
+   SPEECHSC framework in Figure 1.
+
+
+
+Oran                         Informational                      [Page 6]
+
+RFC 4313          Speech Services Control Requirements     December 2005
+
+
+   +-----------+         +---------+
+   |           |   RTP   |         |
+   |   IP      |=========| SPEECHSC|
+   |  Phone    |         |   TTS   |
+   |           |_________|  Server |
+   |           | SPEECHSC|         |
+   +-----------+         +---------+
+
+                 Figure 4: Speaker Identification Example
+
+   In this example, a user speaks into a SIP phone in order to get
+   "logged in" to that phone to make and receive phone calls using his
+   identity and preferences.  The IP phone uses the SPEECHSC framework
+   to set up an RTP stream between the phone and the SPEECHSC SI/SV
+   server and to request verification.  The SV server verifies the
+   user's identity and returns the result, including the necessary login
+   credentials, to the phone via SPEECHSC.  The IP Phone may use the
+   identity directly to identify the user in outgoing calls, to fetch
+   the user's preferences from a configuration server, or to request
+   authorization from an Authentication, Authorization, and Accounting
+   (AAA) server, in any combination.  Since this example uses SPEECHSC
+   to perform a security-related function, be sure to note the
+   associated material in Section 9.
+
+3.  General Requirements
+
+3.1.  Reuse Existing Protocols
+
+   To the extent feasible, the SPEECHSC framework SHOULD use existing
+   protocols.
+
+3.2.  Maintain Existing Protocol Integrity
+
+   In meeting the requirement of Section 3.1, the SPEECHSC framework
+   MUST NOT redefine the semantics of an existing protocol.  Said
+   differently, we will not break existing protocols or cause
+   backward-compatibility problems.
+
+3.3.  Avoid Duplicating Existing Protocols
+
+   To the extent feasible, SPEECHSC SHOULD NOT duplicate the
+   functionality of existing protocols.  For example, network
+   announcements using SIP [12] and RTSP [9] already define how to
+   request playback of audio.  The focus of SPEECHSC is new
+   functionality not addressed by existing protocols or extending
+   existing protocols within the strictures of the requirement in
+
+
+
+
+
+Oran                         Informational                      [Page 7]
+
+RFC 4313          Speech Services Control Requirements     December 2005
+
+
+   Section 3.2.  Where an existing protocol can be gracefully extended
+   to support SPEECHSC requirements, such extensions are acceptable
+   alternatives for meeting the requirements.
+
+   As a corollary to this, the SPEECHSC should not require a separate
+   protocol to perform functions that could be easily added into the
+   SPEECHSC protocol (like redirecting media streams, or discovering
+   capabilities), unless it is similarly easy to embed that protocol
+   directly into the SPEECHSC framework.
+
+3.4.  Efficiency
+
+   The SPEECHSC framework SHOULD employ protocol elements known to
+   result in efficient operation.  Techniques to be considered include:
+
+   o  Re-use of transport connections across sessions
+   o  Piggybacking of responses on requests in the reverse direction
+   o  Caching of state across requests
+
+3.5.  Invocation of Services
+
+   The SPEECHSC framework MUST be compliant with the IAB Open Pluggable
+   Edge Services (OPES) [4] framework.  The applicability of the
+   SPEECHSC protocol will therefore be specified as occurring between
+   clients and servers at least one of which is operating directly on
+   behalf of the user requesting the service.
+
+3.6.  Location and Load Balancing
+
+   To the extent feasible, the SPEECHSC framework SHOULD exploit
+   existing schemes for supporting service location and load balancing,
+   such as the Service Location Protocol [13] or DNS SRV records [14].
+   Where such facilities are not deemed adequate, the SPEECHSC framework
+   MAY define additional load balancing techniques.
+
+3.7.  Multiple Services
+
+   The SPEECHSC framework MUST permit multiple services to operate on a
+   single media stream so that either the same or different servers may
+   be performing speech recognition, speaker identification or
+   verification, etc., in parallel.
+
+3.8.  Multiple Media Sessions
+
+   The SPEECHSC framework MUST allow a 1:N mapping between session and
+   RTP channels.  For example, a single session may include an outbound
+   RTP channel for TTS, an inbound for ASR, and a different inbound for
+   SI/SV (e.g., if processed by different elements on the Media Resource
+
+
+
+Oran                         Informational                      [Page 8]
+
+RFC 4313          Speech Services Control Requirements     December 2005
+
+
+   Element).  Note: All of these can be described via SDP, so if SDP is
+   utilized for media channel description, this requirement is met "for
+   free".
+
+3.9.  Users with Disabilities
+
+   The SPEECHSC framework must have sufficient capabilities to address
+   the critical needs of people with disabilities.  In particular, the
+   set of requirements set forth in RFC 3351 [5] MUST be taken into
+   account by the framework.  It is also important that implementers of
+   SPEECHSC clients and servers be cognizant that some interaction
+   modalities of SPEECHSC may be inconvenient or simply inappropriate
+   for disabled users.  Hearing-impaired individuals may find TTS of
+   limited utility.  Speech-impaired users may be unable to make use of
+   ASR or SI/SV capabilities.  Therefore, systems employing SPEECHSC
+   MUST provide alternative interaction modes or avoid the use of speech
+   processing entirely.
+
+3.10.  Identification of Process That Produced Media or Control Output
+
+   The client of a SPEECHSC operation SHOULD be able to ascertain via
+   the SPEECHSC framework what speech process produced the output.  For
+   example, an RTP stream containing the spoken output of TTS should be
+   identifiable as TTS output, and the recognized utterance of ASR
+   should be identifiable as having been produced by ASR processing.
+
+4.  TTS Requirements
+
+4.1.  Requesting Text Playback
+
+   The SPEECHSC framework MUST allow a Media Processing Entity or
+   Application Server, using a control protocol, to request the TTS
+   Server to play back text as voice in an RTP stream.
+
+4.2.  Text Formats
+
+4.2.1.  Plain Text
+
+   The SPEECHSC framework MAY assume that all TTS servers are capable of
+   reading plain text.  For reading plain text, framework MUST allow the
+   language and voicing to be indicated via session parameters.  For
+   finer control over such properties, see [1].
+
+4.2.2.  SSML
+
+   The SPEECHSC framework MUST support Speech Synthesis Markup Language
+   (SSML)[1] <speak> basics, and SHOULD support other SSML tags.  The
+   framework assumes all TTS servers are capable of reading SSML
+
+
+
+Oran                         Informational                      [Page 9]
+
+RFC 4313          Speech Services Control Requirements     December 2005
+
+
+   formatted text.  Internationalization of TTS in the SPEECHSC
+   framework, including multi-lingual output within a single utterance,
+   is accomplished via SSML xml:lang tags.
+
+4.2.3.  Text in Control Channel
+
+   The SPEECHSC framework assumes all TTS servers accept text over the
+   SPEECHSC connection for reading over the RTP connection.  The
+   framework assumes the server can accept text either "by value"
+   (embedded in the protocol) or "by reference" (e.g., by de-referencing
+   a Uniform Resource Identifier (URI) embedded in the protocol).
+
+4.2.4.  Document Type Indication
+
+   A document type specifies the syntax in which the text to be read is
+   encoded.  The SPEECHSC framework MUST be capable of explicitly
+   indicating the document type of the text to be processed, as opposed
+   to forcing the server to infer the content by other means.
+
+4.3.  Control Channel
+
+   The SPEECHSC framework MUST be capable of establishing the control
+   channel between the client and server on a per-session basis, where a
+   session is loosely defined to be associated with a single "call" or
+   "dialog".  The protocol SHOULD be capable of maintaining a long-lived
+   control channel for multiple sessions serially, and MAY be capable of
+   shorter time horizons as well, including as short as for the
+   processing of a single utterance.
+
+4.4.  Media Origination/Termination by Control Elements
+
+   The SPEECHSC framework MUST NOT require the controlling element
+   (application server, media processing entity) to accept or originate
+   media streams.  Media streams MAY source & sink from the controlled
+   element (ASR, TTS, etc.).
+
+4.5.  Playback Controls
+
+   The SPEECHSC framework MUST support "VCR controls" for controlling
+   the playout of streaming media output from SPEECHSC processing, and
+   MUST allow for servers with varying capabilities to accommodate such
+   controls.  The protocol SHOULD allow clients to state what controls
+   they wish to use, and for servers to report which ones they honor.
+   These capabilities include:
+
+
+
+
+
+
+
+Oran                         Informational                     [Page 10]
+
+RFC 4313          Speech Services Control Requirements     December 2005
+
+
+   o  The ability to jump in time to the location of a specific marker.
+   o  The ability to jump in time, forwards or backwards, by a specified
+      amount of time.  Valid time units MUST include seconds, words,
+      paragraphs, sentences, and markers.
+   o  The ability to increase and decrease playout speed.
+   o  The ability to fast-forward and fast-rewind the audio, where
+      snippets of audio are played as the server moves forwards or
+      backwards in time.
+   o  The ability to pause and resume playout.
+   o  The ability to increase and decrease playout volume.
+
+   These controls SHOULD be made easily available to users through the
+   client user interface and through per-user customization capabilities
+   of the client.  This is particularly important for hearing-impaired
+   users, who will likely desire settings and control regimes different
+   from those that would be acceptable for non-impaired users.
+
+4.6.  Session Parameters
+
+   The SPEECHSC framework MUST support the specification of session
+   parameters, such as language, prosody, and voicing.
+
+4.7.  Speech Markers
+
+   The SPEECHSC framework MUST accommodate speech markers, with
+   capability at least as flexible as that provided in SSML [1].  The
+   framework MUST further provide an efficient mechanism for reporting
+   that a marker has been reached during playout.
+
+5.  ASR Requirements
+
+5.1.  Requesting Automatic Speech Recognition
+
+   The SPEECHSC framework MUST allow a Media Processing Entity or
+   Application Server to request the ASR Server to perform automatic
+   speech recognition on an RTP stream, returning the results over
+   SPEECHSC.
+
+5.2.  XML
+
+   The SPEECHSC framework assumes that all ASR servers support the
+   VoiceXML speech recognition grammar specification (SRGS) for speech
+   recognition [2].
+
+
+
+
+
+
+
+
+Oran                         Informational                     [Page 11]
+
+RFC 4313          Speech Services Control Requirements     December 2005
+
+
+5.3.  Grammar Requirements
+
+5.3.1.  Grammar Specification
+
+   The SPEECHSC framework assumes all ASR servers are capable of
+   accepting grammar specifications either "by value" (embedded in the
+   protocol) or "by reference" (e.g., by de-referencing a URI embedded
+   in the protocol).  The latter MUST allow the indication of a grammar
+   already known to, or otherwise "built in" to, the server.  The
+   framework and protocol further SHOULD exploit the ability to store
+   and later retrieve by reference large grammars that were originally
+   supplied by the client.
+
+5.3.2.  Explicit Indication of Grammar Format
+
+   The SPEECHSC framework protocol MUST be able to explicitly convey the
+   grammar format in which the grammar is encoded and MUST be extensible
+   to allow for conveying new grammar formats as they are defined.
+
+5.3.3.  Grammar Sharing
+
+   The SPEECHSC framework SHOULD exploit sharing grammars across
+   sessions for servers that are capable of doing so.  This supports
+   applications with large grammars for which it is unrealistic to
+   dynamically load.  An example is a city-country grammar for a weather
+   service.
+
+5.4.  Session Parameters
+
+   The SPEECHSC framework MUST accommodate at a minimum all of the
+   protocol parameters currently defined in Media Resource Control
+   Protocol (MRCP) [10] In addition, there SHOULD be a capability to
+   reset parameters within a session.
+
+5.5.  Input Capture
+
+   The SPEECHSC framework MUST support a method directing the ASR Server
+   to capture the input media stream for later analysis and tuning of
+   the ASR engine.
+
+
+
+
+
+
+
+
+
+
+
+
+Oran                         Informational                     [Page 12]
+
+RFC 4313          Speech Services Control Requirements     December 2005
+
+
+6.  Speaker Identification and Verification Requirements
+
+6.1.  Requesting SI/SV
+
+   The SPEECHSC framework MUST allow a Media Processing Entity to
+   request the SI/SV Server to perform speaker identification or
+   verification on an RTP stream, returning the results over SPEECHSC.
+
+6.2.  Identifiers for SI/SV
+
+   The SPEECHSC framework MUST accommodate an identifier for each
+   verification resource and permit control of that resource by ID,
+   because voiceprint format and contents are vendor specific.
+
+6.3.  State for Multiple Utterances
+
+   The SPEECHSC framework MUST work with SI/SV servers that maintain
+   state to handle multi-utterance verification.
+
+6.4.  Input Capture
+
+   The SPEECHSC framework MUST support a method for capturing the input
+   media stream for later analysis and tuning of the SI/SV engine.  The
+   framework may assume all servers are capable of doing so.  In
+   addition, the framework assumes that the captured stream contains
+   enough timestamp context (e.g., the NTP time range from the RTP
+   Control Protocol (RTCP) packets, which corresponds to the RTP
+   timestamps of the captured input) to ascertain after the fact exactly
+   when the verification was requested.
+
+6.5.  SI/SV Functional Extensibility
+
+   The SPEECHSC framework SHOULD be extensible to additional functions
+   associated with SI/SV, such as prompting, utterance verification, and
+   retraining.
+
+7.  Duplexing and Parallel Operation Requirements
+
+   One very important requirement for an interactive speech-driven
+   system is that user perception of the quality of the interaction
+   depends strongly on the ability of the user to interrupt a prompt or
+   rendered TTS with speech.  Interrupting, or barging, the speech
+   output requires more than energy detection from the user's direction.
+   Many advanced systems halt the media towards the user by employing
+   the ASR engine to decide if an utterance is likely to be real speech,
+   as opposed to a cough, for example.
+
+
+
+
+
+Oran                         Informational                     [Page 13]
+
+RFC 4313          Speech Services Control Requirements     December 2005
+
+
+7.1.  Full Duplex Operation
+
+   To achieve low latency between utterance detection and halting of
+   playback, many implementations combine the speaking and ASR
+   functions.  The SPEECHSC framework MUST support such full-duplex
+   implementations.
+
+7.2.  Multiple Services in Parallel
+
+   Good spoken user interfaces typically depend upon the ease with which
+   the user can accomplish his or her task.  When making use of speaker
+   identification or verification technologies, user interface
+   improvements often come from the combination of the different
+   technologies: simultaneous identity claim and verification (on the
+   same utterance), simultaneous knowledge and voice verification (using
+   ASR and verification simultaneously).  Using ASR and verification on
+   the same utterance is in fact the only way to support rolling or
+   dynamically-generated challenge phrases (e.g., "say 51723").  The
+   SPEECHSC framework MUST support such parallel service
+   implementations.
+
+7.3.  Combination of Services
+
+   It is optionally of interest that the SPEECHSC framework support more
+   complex remote combination and controls of speech engines:
+
+   o  Combination in series of engines that may then act on the input or
+      output of ASR, TTS, or Speaker recognition engines.  The control
+      MAY then extend beyond such engines to include other audio input
+      and output processing and natural language processing.
+   o  Intermediate exchanges and coordination between engines.
+   o  Remote specification of flows between engines.
+
+   These capabilities MAY benefit from service discovery mechanisms
+   (e.g., engines, properties, and states discovery).
+
+8.  Additional Considerations (Non-Normative)
+
+   The framework assumes that Session Description Protocol (SDP) will be
+   used to describe media sessions and streams.  The framework further
+   assumes RTP carriage of media.  However, since SDP can be used to
+   describe other media transport schemes (e.g., ATM) these could be
+   used if they provide the necessary elements (e.g., explicit
+   timestamps).
+
+
+
+
+
+
+
+Oran                         Informational                     [Page 14]
+
+RFC 4313          Speech Services Control Requirements     December 2005
+
+
+   The working group will not be defining distributed speech recognition
+   (DSR) methods, as exemplified by the European Telecommunications
+   Standards Institute (ETSI) Aurora project.  The working group will
+   not be recreating functionality available in other protocols, such as
+   SIP or SDP.
+
+   TTS looks very much like playing back a file.  Extending RTSP looks
+   promising for when one requires VCR controls or markers in the text
+   to be spoken.  When one does not require VCR controls, SIP in a
+   framework such as Network Announcements [12] works directly without
+   modification.
+
+   ASR has an entirely different set of characteristics.  For barge-in
+   support, ASR requires real-time return of intermediate results.
+   Barring the discovery of a good reuse model for an existing protocol,
+   this will most likely become the focus of SPEECHSC.
+
+9.  Security Considerations
+
+   Protocols relating to speech processing must take security and
+   privacy into account.  Many applications of speech technology deal
+   with sensitive information, such as the use of Text-to-Speech to read
+   financial information.  Likewise, popular uses for automatic speech
+   recognition include executing financial transactions and shopping.
+
+   There are at least three aspects of speech processing security that
+   intersect with the SPEECHSC requirements -- securing the SPEECHSC
+   protocol itself, implementing and deploying the servers that run the
+   protocol, and ensuring that utilization of the technology for
+   providing security functions is appropriate.  Each of these aspects
+   in discussed in the following subsections.  While some of these
+   considerations are, strictly speaking, out of scope of the protocol
+   itself, they will be carefully considered and accommodated during
+   protocol design, and will be called out as part of the applicability
+   statement accompanying the protocol specification(s).  Privacy
+   considerations are discussed as well.
+
+9.1.  SPEECHSC Protocol Security
+
+   The SPEECHSC protocol MUST in all cases support authentication,
+   authorization, and integrity, and SHOULD support confidentiality.
+   For privacy-sensitive applications, the protocol MUST support
+   confidentiality.  We envision that rather than providing
+   protocol-specific security mechanisms in SPEECHSC itself, the
+   resulting protocol will employ security machinery of either a
+   containing protocol or the transport on which it runs.  For example,
+   we will consider solutions such as using Transport Layer Security
+   (TLS) for securing the control channel, and Secure Realtime Transport
+
+
+
+Oran                         Informational                     [Page 15]
+
+RFC 4313          Speech Services Control Requirements     December 2005
+
+
+   Protocol (SRTP) for securing the media channel.  Third-party
+   dependencies necessitating transitive trust will be minimized or
+   explicitly dealt with through the authentication and authorization
+   aspects of the protocol design.
+
+9.2.  Client and Server Implementation and Deployment
+
+   Given the possibly sensitive nature of the information carried,
+   SPEECHSC clients and servers need to take steps to ensure
+   confidentiality and integrity of the data and its transformations to
+   and from spoken form.  In addition to these general considerations,
+   certain SPEECHSC functions, such as speaker verification and
+   identification, employ voiceprints whose privacy, confidentiality,
+   and integrity must be maintained.  Similarly, the requirement to
+   support input capture for analysis and tuning can represent a privacy
+   vulnerability because user utterances are recorded and could be
+   either revealed or replayed inappropriately.  Implementers must take
+   care to prevent the exploitation of any centralized voiceprint
+   database and the recorded material from which such voiceprints may be
+   derived.  Specific actions that are recommended to minimize these
+   threats include:
+
+   o  End-to-end authentication, confidentiality, and integrity
+      protection (like TLS) of access to the database to minimize the
+      exposure to external attack.
+   o  Database protection measures such as read/write access control and
+      local login authentication to minimize the exposure to insider
+      threats.
+   o  Copies of the database, especially ones that are maintained at
+      off-site locations, need the same protection as the operational
+      database.
+
+   Inappropriate disclosure of this data does not as of the date of this
+   document represent an exploitable threat, but quite possibly might in
+   the future.  Specific vulnerabilities that might become feasible are
+   discussed in the next subsection.  It is prudent to take measures
+   such as encrypting the voiceprint database and permitting access only
+   through programming interfaces enforcing adequate authorization
+   machinery.
+
+9.3.  Use of SPEECHSC for Security Functions
+
+   Either speaker identification or verification can be used directly as
+   an authentication technology.  Authorization decisions can be coupled
+   with speaker verification in a direct fashion through
+   challenge-response protocols, or indirectly with speaker
+   identification through the use of access control lists or other
+   identity-based authorization mechanisms.  When so employed, there are
+
+
+
+Oran                         Informational                     [Page 16]
+
+RFC 4313          Speech Services Control Requirements     December 2005
+
+
+   additional security concerns that need to be addressed through the
+   use of protocol security mechanisms for clients and servers.  For
+   example, the ability to manipulate the media stream of a speaker
+   verification request could inappropriately permit or deny access
+   based on impersonation, or simple garbling via noise injection,
+   making it critical to properly secure both the control and data
+   channels, as recommended above.  The following issues specific to the
+   use of SI/SV for authentication should be carefully considered:
+
+   1.  Theft of voiceprints or the recorded samples used to construct
+       them represents a future threat against the use of speaker
+       identification/verification as a biometric authentication
+       technology.  A plausible attack vector (not feasible today) is to
+       use the voiceprint information as parametric input to a
+       text-to-speech synthesis system that could mimic the user's voice
+       accurately enough to match the voiceprint.  Since it is not very
+       difficult to surreptitiously record reasonably large corpuses of
+       voice samples, the ability to construct voiceprints for input to
+       this attack would render the security of voice-based biometric
+       authentication, even using advanced challenge-response
+       techniques, highly vulnerable.  Users of speaker verification for
+       authentication should monitor technological developments in this
+       area closely for such future vulnerabilities (much as users of
+       other authentication technologies should monitor advances in
+       factoring as a way to break asymmetric keying systems).
+   2.  As with other biometric authentication technologies, a downside
+       to the use of speech identification is that revocation is not
+       possible.  Once compromised, the biometric information can be
+       used in identification and authentication to other independent
+       systems.
+   3.  Enrollment procedures can be vulnerable to impersonation if not
+       protected both by protocol security mechanisms and some
+       independent proof of identity.  (Proof of identity may not be
+       needed in systems that only need to verify continuity of identity
+       since enrollment, as opposed to association with a particular
+       individual.
+
+   Further discussion of the use of SI/SV as an authentication
+   technology, and some recommendations concerning advantages and
+   vulnerabilities, can be found in Chapter 5 of [15].
+
+10.  Acknowledgements
+
+   Eric Burger wrote the original version of these requirements and has
+   continued to contribute actively throughout their development.  He is
+   a co-author in all but formal authorship, and is instead acknowledged
+   here as it is preferable that working group co-chairs have
+   non-conflicting roles with respect to the progression of documents.
+
+
+
+Oran                         Informational                     [Page 17]
+
+RFC 4313          Speech Services Control Requirements     December 2005
+
+
+11.  References
+
+11.1.  Normative References
+
+   [1]  Walker, M., Burnett, D., and A. Hunt, "Speech Synthesis Markup
+        Language (SSML) Version 1.0", W3C
+        REC REC-speech-synthesis-20040907, September 2004.
+
+   [2]  McGlashan, S. and A. Hunt, "Speech Recognition Grammar
+        Specification Version 1.0", W3C REC REC-speech-grammar-20040316,
+        March 2004.
+
+   [3]  Bradner, S., "Key words for use in RFCs to Indicate Requirement
+        Levels", BCP 14, RFC 2119, March 1997.
+
+   [4]  Floyd, S. and L. Daigle, "IAB Architectural and Policy
+        Considerations for Open Pluggable Edge Services", RFC 3238,
+        January 2002.
+
+   [5]  Charlton, N., Gasson, M., Gybels, G., Spanner, M., and A. van
+        Wijk, "User Requirements for the Session Initiation Protocol
+        (SIP) in Support of Deaf, Hard of Hearing and Speech-impaired
+        Individuals", RFC 3351, August 2002.
+
+11.2.  Informative References
+
+   [6]   Rosenberg, J., Schulzrinne, H., Camarillo, G., Johnston, A.,
+         Peterson, J., Sparks, R., Handley, M., and E. Schooler, "SIP:
+         Session Initiation Protocol", RFC 3261, June 2002.
+
+   [7]   Andreasen, F. and B. Foster, "Media Gateway Control Protocol
+         (MGCP) Version 1.0", RFC 3435, January 2003.
+
+   [8]   Groves, C., Pantaleo, M., Ericsson, LM., Anderson, T., and T.
+         Taylor, "Gateway Control Protocol Version 1", RFC 3525,
+         June 2003.
+
+   [9]   Schulzrinne, H., Rao, A., and R. Lanphier, "Real Time Streaming
+         Protocol (RTSP)", RFC 2326, April 1998.
+
+   [10]  Shanmugham, S., Monaco, P., and B. Eberman, "MRCP: Media
+         Resource Control Protocol", Work in Progress.
+
+
+
+
+
+
+
+
+
+Oran                         Informational                     [Page 18]
+
+RFC 4313          Speech Services Control Requirements     December 2005
+
+
+   [11]  World Wide Web Consortium, "Voice Extensible Markup Language
+         (VoiceXML) Version 2.0", W3C Working Draft , April 2002,
+         <http://www.w3.org/TR/2002/WD-voicexml20-20020424/>.
+
+   [12]  Burger, E., Ed., Van Dyke, J., and A. Spitzer, "Basic Network
+         Media Services with SIP", RFC 4240, December 2005.
+
+   [13]  Guttman, E., Perkins, C., Veizades, J., and M. Day, "Service
+         Location Protocol, Version 2", RFC 2608, June 1999.
+
+   [14]  Gulbrandsen, A., Vixie, P., and L. Esibov, "A DNS RR for
+         specifying the location of services (DNS SRV)", RFC 2782,
+         February 2000.
+
+   [15]  Committee on Authentication Technologies and Their Privacy
+         Implications, National Research Council, "Who Goes There?:
+         Authentication Through the Lens of Privacy", Computer Science
+         and Telecommunications Board (CSTB) , 2003,
+         <http://www.nap.edu/catalog/10656.html/ >.
+
+Author's Address
+
+   David R. Oran
+   Cisco Systems, Inc.
+   7 Ladyslipper Lane
+   Acton, MA
+   USA
+
+   EMail: oran@cisco.com
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Oran                         Informational                     [Page 19]
+
+RFC 4313          Speech Services Control Requirements     December 2005
+
+
+Full Copyright Statement
+
+   Copyright (C) The Internet Society (2005).
+
+   This document is subject to the rights, licenses and restrictions
+   contained in BCP 78, and except as set forth therein, the authors
+   retain all their rights.
+
+   This document and the information contained herein are provided on an
+   "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
+   OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET
+   ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED,
+   INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE
+   INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
+   WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
+
+Intellectual Property
+
+   The IETF takes no position regarding the validity or scope of any
+   Intellectual Property Rights or other rights that might be claimed to
+   pertain to the implementation or use of the technology described in
+   this document or the extent to which any license under such rights
+   might or might not be available; nor does it represent that it has
+   made any independent effort to identify any such rights.  Information
+   on the procedures with respect to rights in RFC documents can be
+   found in BCP 78 and BCP 79.
+
+   Copies of IPR disclosures made to the IETF Secretariat and any
+   assurances of licenses to be made available, or the result of an
+   attempt made to obtain a general license or permission for the use of
+   such proprietary rights by implementers or users of this
+   specification can be obtained from the IETF on-line IPR repository at
+   http://www.ietf.org/ipr.
+
+   The IETF invites any interested party to bring to its attention any
+   copyrights, patents or patent applications, or other proprietary
+   rights that may cover technology that may be required to implement
+   this standard.  Please address the information to the IETF at ietf-
+   ipr@ietf.org.
+
+Acknowledgement
+
+   Funding for the RFC Editor function is currently provided by the
+   Internet Society.
+
+
+
+
+
+
+
+Oran                         Informational                     [Page 20]
+