summaryrefslogtreecommitdiff
path: root/doc/rfc/rfc4313.txt
diff options
context:
space:
mode:
Diffstat (limited to 'doc/rfc/rfc4313.txt')
-rw-r--r--doc/rfc/rfc4313.txt1123
1 files changed, 1123 insertions, 0 deletions
diff --git a/doc/rfc/rfc4313.txt b/doc/rfc/rfc4313.txt
new file mode 100644
index 0000000..a0249de
--- /dev/null
+++ b/doc/rfc/rfc4313.txt
@@ -0,0 +1,1123 @@
+
+
+
+
+
+
+Network Working Group D. Oran
+Request for Comments: 4313 Cisco Systems, Inc.
+Category: Informational December 2005
+
+
+ Requirements for Distributed Control of
+ Automatic Speech Recognition (ASR),
+ Speaker Identification/Speaker Verification (SI/SV), and
+ Text-to-Speech (TTS) Resources
+
+Status of this Memo
+
+ This memo provides information for the Internet community. It does
+ not specify an Internet standard of any kind. Distribution of this
+ memo is unlimited.
+
+Copyright Notice
+
+ Copyright (C) The Internet Society (2005).
+
+Abstract
+
+ This document outlines the needs and requirements for a protocol to
+ control distributed speech processing of audio streams. By speech
+ processing, this document specifically means automatic speech
+ recognition (ASR), speaker recognition -- which includes both speaker
+ identification (SI) and speaker verification (SV) -- and
+ text-to-speech (TTS). Other IETF protocols, such as SIP and Real
+ Time Streaming Protocol (RTSP), address rendezvous and control for
+ generalized media streams. However, speech processing presents
+ additional requirements that none of the extant IETF protocols
+ address.
+
+Table of Contents
+
+ 1. Introduction ....................................................3
+ 1.1. Document Conventions .......................................3
+ 2. SPEECHSC Framework ..............................................4
+ 2.1. TTS Example ................................................5
+ 2.2. Automatic Speech Recognition Example .......................6
+ 2.3. Speaker Identification example .............................6
+ 3. General Requirements ............................................7
+ 3.1. Reuse Existing Protocols ...................................7
+ 3.2. Maintain Existing Protocol Integrity .......................7
+ 3.3. Avoid Duplicating Existing Protocols .......................7
+ 3.4. Efficiency .................................................8
+ 3.5. Invocation of Services .....................................8
+ 3.6. Location and Load Balancing ................................8
+
+
+
+Oran Informational [Page 1]
+
+RFC 4313 Speech Services Control Requirements December 2005
+
+
+ 3.7. Multiple Services ..........................................8
+ 3.8. Multiple Media Sessions ....................................8
+ 3.9. Users with Disabilities ....................................9
+ 3.10. Identification of Process That Produced Media or
+ Control Output ............................................9
+ 4. TTS Requirements ................................................9
+ 4.1. Requesting Text Playback ...................................9
+ 4.2. Text Formats ...............................................9
+ 4.2.1. Plain Text ..........................................9
+ 4.2.2. SSML ................................................9
+ 4.2.3. Text in Control Channel ............................10
+ 4.2.4. Document Type Indication ...........................10
+ 4.3. Control Channel ...........................................10
+ 4.4. Media Origination/Termination by Control Elements .........10
+ 4.5. Playback Controls .........................................10
+ 4.6. Session Parameters ........................................11
+ 4.7. Speech Markers ............................................11
+ 5. ASR Requirements ...............................................11
+ 5.1. Requesting Automatic Speech Recognition ...................11
+ 5.2. XML .......................................................11
+ 5.3. Grammar Requirements ......................................12
+ 5.3.1. Grammar Specification ..............................12
+ 5.3.2. Explicit Indication of Grammar Format ..............12
+ 5.3.3. Grammar Sharing ....................................12
+ 5.4. Session Parameters ........................................12
+ 5.5. Input Capture .............................................12
+ 6. Speaker Identification and Verification Requirements ...........13
+ 6.1. Requesting SI/SV ..........................................13
+ 6.2. Identifiers for SI/SV .....................................13
+ 6.3. State for Multiple Utterances .............................13
+ 6.4. Input Capture .............................................13
+ 6.5. SI/SV Functional Extensibility ............................13
+ 7. Duplexing and Parallel Operation Requirements ..................13
+ 7.1. Full Duplex Operation .....................................14
+ 7.2. Multiple Services in Parallel .............................14
+ 7.3. Combination of Services ...................................14
+ 8. Additional Considerations (Non-Normative) ......................14
+ 9. Security Considerations ........................................15
+ 9.1. SPEECHSC Protocol Security ................................15
+ 9.2. Client and Server Implementation and Deployment ...........16
+ 9.3. Use of SPEECHSC for Security Functions ....................16
+ 10. Acknowledgements ..............................................17
+ 11. References ....................................................18
+ 11.1. Normative References .....................................18
+ 11.2. Informative References ...................................18
+
+
+
+
+
+
+Oran Informational [Page 2]
+
+RFC 4313 Speech Services Control Requirements December 2005
+
+
+1. Introduction
+
+ There are multiple IETF protocols for establishment and termination
+ of media sessions (SIP [6]), low-level media control (Media Gateway
+ Control Protocol (MGCP) [7] and Media Gateway Controller (MEGACO)
+ [8]), and media record and playback (RTSP [9]). This document
+ focuses on requirements for one or more protocols to support the
+ control of network elements that perform Automated Speech Recognition
+ (ASR), speaker identification or verification (SI/SV), and rendering
+ text into audio, also known as Text-to-Speech (TTS). Many multimedia
+ applications can benefit from having automatic speech recognition
+ (ASR) and text-to-speech (TTS) processing available as a distributed,
+ network resource. This requirements document limits its focus to the
+ distributed control of ASR, SI/SV, and TTS servers.
+
+ There is a broad range of systems that can benefit from a unified
+ approach to control of TTS, ASR, and SI/SV. These include
+ environments such as Voice over IP (VoIP) gateways to the Public
+ Switched Telephone Network (PSTN), IP telephones, media servers, and
+ wireless mobile devices that obtain speech services via servers on
+ the network.
+
+ To date, there are a number of proprietary ASR and TTS APIs, as well
+ as two IETF documents that address this problem [13], [14]. However,
+ there are serious deficiencies to the existing documents. In
+ particular, they mix the semantics of existing protocols yet are
+ close enough to other protocols as to be confusing to the
+ implementer.
+
+ This document sets forth requirements for protocols to support
+ distributed speech processing of audio streams. For simplicity, and
+ to remove confusion with existing protocol proposals, this document
+ presents the requirements as being for a "framework" that addresses
+ the distributed control of speech resources. It refers to such a
+ framework as "SPEECHSC", for Speech Services Control.
+
+1.1. Document Conventions
+
+ In this document, the key words "MUST", "MUST NOT", "REQUIRED",
+ "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY",
+ and "OPTIONAL" are to be interpreted as described in RFC 2119 [3].
+
+
+
+
+
+
+
+
+
+
+Oran Informational [Page 3]
+
+RFC 4313 Speech Services Control Requirements December 2005
+
+
+2. SPEECHSC Framework
+
+ Figure 1 below shows the SPEECHSC framework for speech processing.
+
+ +-------------+
+ | Application |
+ | Server |\
+ +-------------+ \ SPEECHSC
+ SIP, VoiceXML, / \
+ etc. / \
+ +------------+ / \ +-------------+
+ | Media |/ SPEECHSC \---| ASR, SI/SV, |
+ | Processing |-------------------------| and/or TTS |
+ RTP | Entity | RTP | Server |
+ =====| |=========================| |
+ +------------+ +-------------+
+
+ Figure 1: SPEECHSC Framework
+
+ The "Media Processing Entity" is a network element that processes
+ media. It may be a pure media handler, or it may also have an
+ associated SIP user agent, VoiceXML browser, or other control entity.
+ The "ASR, SI/SV, and/or TTS Server" is a network element that
+ performs the back-end speech processing. It may generate an RTP
+ stream as output based on text input (TTS) or return recognition
+ results in response to an RTP stream as input (ASR, SI/SV). The
+ "Application Server" is a network element that instructs the Media
+ Processing Entity on what transformations to make to the media
+ stream. Those instructions may be established via a session protocol
+ such as SIP, or provided via a client/server exchange such as
+ VoiceXML. The framework allows either the Media Processing Entity or
+ the Application Server to control the ASR or TTS Server using
+ SPEECHSC as a control protocol, which accounts for the SPEECHSC
+ protocol appearing twice in the diagram.
+
+ Physical embodiments of the entities can reside in one physical
+ instance per entity, or some combination of entities. For example, a
+ VoiceXML [11] gateway may combine the ASR and TTS functions on the
+ same platform as the Media Processing Entity. Note that VoiceXML
+ gateways themselves are outside the scope of this protocol.
+ Likewise, one can combine the Application Server and Media Processing
+ Entity, as would be the case in an interactive voice response (IVR)
+ platform.
+
+ One can also decompose the Media Processing Entity into an entity
+ that controls media endpoints and entities that process media
+ directly. Such would be the case with a decomposed gateway using
+ MGCP or MEGACO. However, this decomposition is again orthogonal to
+
+
+
+Oran Informational [Page 4]
+
+RFC 4313 Speech Services Control Requirements December 2005
+
+
+ the scope of SPEECHSC. The following subsections provide a number of
+ example use cases of the SPEECHSC, one each for TTS, ASR, and SI/SV.
+ They are intended to be illustrative only, and not to imply any
+ restriction on the scope of the framework or to limit the
+ decomposition or configuration to that shown in the example.
+
+2.1. TTS Example
+
+ This example illustrates a simple usage of SPEECHSC to provide a
+ Text-to-Speech service for playing announcements to a user on a phone
+ with no display for textual error messages. The example scenario is
+ shown below in Figure 2. In the figure, the VoIP gateway acts as
+ both the Media Processing Entity and the Application Server of the
+ SPEECHSC framework in Figure 1.
+
+ +---------+
+ _| SIP |
+ _/ | Server |
+ +-----------+ SIP/ +---------+
+ | | _/
+ +-------+ | VoIP |_/
+ | POTS |___| Gateway | RTP +---------+
+ | Phone | | (SIP UA) |=========| |
+ +-------+ | |\_ | SPEECHSC|
+ +-----------+ \ | TTS |
+ \__ | Server |
+ SPEECHSC | |
+ \_| |
+ +---------+
+
+ Figure 2: Text-to-Speech Example of SPEECHSC
+
+ The Plain Old Telephone Service (POTS) phone on the left attempts to
+ make a phone call. The VoIP gateway, acting as a SIP UA, tries to
+ establish a SIP session to complete the call, but gets an error, such
+ as a SIP "486 Busy Here" response. Without SPEECHSC, the gateway
+ would most likely just output a busy signal to the POTS phone.
+ However, with SPEECHSC access to a TTS server, it can provide a
+ spoken error message. The VoIP gateway therefore constructs a text
+ error string using information from the SIP messages, such as "Your
+ call to 978-555-1212 did not go through because the called party was
+ busy". It then can use SPEECHSC to establish an association with a
+ SPEECHSC server, open an RTP stream between itself and the server,
+ and issue a TTS request for the error message, which will be played
+ to the user on the POTS phone.
+
+
+
+
+
+
+Oran Informational [Page 5]
+
+RFC 4313 Speech Services Control Requirements December 2005
+
+
+2.2. Automatic Speech Recognition Example
+
+ This example illustrates a VXML-enabled media processing entity and
+ associated application server using the SPEECHSC framework to supply
+ an ASR-based user interface through an Interactive Voice Response
+ (IVR) system. The example scenario is shown below in Figure 3. The
+ VXML-client corresponds to the "media processing entity", while the
+ IVR application server corresponds to the "application server" of the
+ SPEECHSC framework of Figure 1.
+
+ +------------+
+ | IVR |
+ _|Application |
+ VXML_/ +------------+
+ +-----------+ __/
+ | |_/ +------------+
+ PSTN Trunk | VoIP | SPEECHSC| |
+ =============| Gateway |---------| SPEECHSC |
+ |(VXML voice| | ASR |
+ | browser) |=========| Server |
+ +-----------+ RTP +------------+
+
+ Figure 3: Automatic Speech Recognition Example
+
+ In this example, users call into the service in order to obtain stock
+ quotes. The VoIP gateway answers their PSTN call. An IVR
+ application feeds VXML scripts to the gateway to drive the user
+ interaction. The VXML interpreter on the gateway directs the user's
+ media stream to the SPEECHSC ASR server and uses SPEECHSC to control
+ the ASR server.
+
+ When, for example, the user speaks the name of a stock in response to
+ an IVR prompt, the SPEECHSC ASR server attempts recognition of the
+ name, and returns the results to the VXML gateway. The VXML gateway,
+ following standard VXML mechanisms, informs the IVR Application of
+ the recognized result. The IVR Application can then do the
+ appropriate information lookup. The answer, of course, can be sent
+ back to the user using text-to-speech. This example does not show
+ this scenario, but it would work analogously to the scenario shown in
+ section Section 2.1.
+
+2.3. Speaker Identification example
+
+ This example illustrates using speaker identification to allow
+ voice-actuated login to an IP phone. The example scenario is shown
+ below in Figure 4. In the figure, the IP Phone acts as both the
+ "Media Processing Entity" and the "Application Server" of the
+ SPEECHSC framework in Figure 1.
+
+
+
+Oran Informational [Page 6]
+
+RFC 4313 Speech Services Control Requirements December 2005
+
+
+ +-----------+ +---------+
+ | | RTP | |
+ | IP |=========| SPEECHSC|
+ | Phone | | TTS |
+ | |_________| Server |
+ | | SPEECHSC| |
+ +-----------+ +---------+
+
+ Figure 4: Speaker Identification Example
+
+ In this example, a user speaks into a SIP phone in order to get
+ "logged in" to that phone to make and receive phone calls using his
+ identity and preferences. The IP phone uses the SPEECHSC framework
+ to set up an RTP stream between the phone and the SPEECHSC SI/SV
+ server and to request verification. The SV server verifies the
+ user's identity and returns the result, including the necessary login
+ credentials, to the phone via SPEECHSC. The IP Phone may use the
+ identity directly to identify the user in outgoing calls, to fetch
+ the user's preferences from a configuration server, or to request
+ authorization from an Authentication, Authorization, and Accounting
+ (AAA) server, in any combination. Since this example uses SPEECHSC
+ to perform a security-related function, be sure to note the
+ associated material in Section 9.
+
+3. General Requirements
+
+3.1. Reuse Existing Protocols
+
+ To the extent feasible, the SPEECHSC framework SHOULD use existing
+ protocols.
+
+3.2. Maintain Existing Protocol Integrity
+
+ In meeting the requirement of Section 3.1, the SPEECHSC framework
+ MUST NOT redefine the semantics of an existing protocol. Said
+ differently, we will not break existing protocols or cause
+ backward-compatibility problems.
+
+3.3. Avoid Duplicating Existing Protocols
+
+ To the extent feasible, SPEECHSC SHOULD NOT duplicate the
+ functionality of existing protocols. For example, network
+ announcements using SIP [12] and RTSP [9] already define how to
+ request playback of audio. The focus of SPEECHSC is new
+ functionality not addressed by existing protocols or extending
+ existing protocols within the strictures of the requirement in
+
+
+
+
+
+Oran Informational [Page 7]
+
+RFC 4313 Speech Services Control Requirements December 2005
+
+
+ Section 3.2. Where an existing protocol can be gracefully extended
+ to support SPEECHSC requirements, such extensions are acceptable
+ alternatives for meeting the requirements.
+
+ As a corollary to this, the SPEECHSC should not require a separate
+ protocol to perform functions that could be easily added into the
+ SPEECHSC protocol (like redirecting media streams, or discovering
+ capabilities), unless it is similarly easy to embed that protocol
+ directly into the SPEECHSC framework.
+
+3.4. Efficiency
+
+ The SPEECHSC framework SHOULD employ protocol elements known to
+ result in efficient operation. Techniques to be considered include:
+
+ o Re-use of transport connections across sessions
+ o Piggybacking of responses on requests in the reverse direction
+ o Caching of state across requests
+
+3.5. Invocation of Services
+
+ The SPEECHSC framework MUST be compliant with the IAB Open Pluggable
+ Edge Services (OPES) [4] framework. The applicability of the
+ SPEECHSC protocol will therefore be specified as occurring between
+ clients and servers at least one of which is operating directly on
+ behalf of the user requesting the service.
+
+3.6. Location and Load Balancing
+
+ To the extent feasible, the SPEECHSC framework SHOULD exploit
+ existing schemes for supporting service location and load balancing,
+ such as the Service Location Protocol [13] or DNS SRV records [14].
+ Where such facilities are not deemed adequate, the SPEECHSC framework
+ MAY define additional load balancing techniques.
+
+3.7. Multiple Services
+
+ The SPEECHSC framework MUST permit multiple services to operate on a
+ single media stream so that either the same or different servers may
+ be performing speech recognition, speaker identification or
+ verification, etc., in parallel.
+
+3.8. Multiple Media Sessions
+
+ The SPEECHSC framework MUST allow a 1:N mapping between session and
+ RTP channels. For example, a single session may include an outbound
+ RTP channel for TTS, an inbound for ASR, and a different inbound for
+ SI/SV (e.g., if processed by different elements on the Media Resource
+
+
+
+Oran Informational [Page 8]
+
+RFC 4313 Speech Services Control Requirements December 2005
+
+
+ Element). Note: All of these can be described via SDP, so if SDP is
+ utilized for media channel description, this requirement is met "for
+ free".
+
+3.9. Users with Disabilities
+
+ The SPEECHSC framework must have sufficient capabilities to address
+ the critical needs of people with disabilities. In particular, the
+ set of requirements set forth in RFC 3351 [5] MUST be taken into
+ account by the framework. It is also important that implementers of
+ SPEECHSC clients and servers be cognizant that some interaction
+ modalities of SPEECHSC may be inconvenient or simply inappropriate
+ for disabled users. Hearing-impaired individuals may find TTS of
+ limited utility. Speech-impaired users may be unable to make use of
+ ASR or SI/SV capabilities. Therefore, systems employing SPEECHSC
+ MUST provide alternative interaction modes or avoid the use of speech
+ processing entirely.
+
+3.10. Identification of Process That Produced Media or Control Output
+
+ The client of a SPEECHSC operation SHOULD be able to ascertain via
+ the SPEECHSC framework what speech process produced the output. For
+ example, an RTP stream containing the spoken output of TTS should be
+ identifiable as TTS output, and the recognized utterance of ASR
+ should be identifiable as having been produced by ASR processing.
+
+4. TTS Requirements
+
+4.1. Requesting Text Playback
+
+ The SPEECHSC framework MUST allow a Media Processing Entity or
+ Application Server, using a control protocol, to request the TTS
+ Server to play back text as voice in an RTP stream.
+
+4.2. Text Formats
+
+4.2.1. Plain Text
+
+ The SPEECHSC framework MAY assume that all TTS servers are capable of
+ reading plain text. For reading plain text, framework MUST allow the
+ language and voicing to be indicated via session parameters. For
+ finer control over such properties, see [1].
+
+4.2.2. SSML
+
+ The SPEECHSC framework MUST support Speech Synthesis Markup Language
+ (SSML)[1] <speak> basics, and SHOULD support other SSML tags. The
+ framework assumes all TTS servers are capable of reading SSML
+
+
+
+Oran Informational [Page 9]
+
+RFC 4313 Speech Services Control Requirements December 2005
+
+
+ formatted text. Internationalization of TTS in the SPEECHSC
+ framework, including multi-lingual output within a single utterance,
+ is accomplished via SSML xml:lang tags.
+
+4.2.3. Text in Control Channel
+
+ The SPEECHSC framework assumes all TTS servers accept text over the
+ SPEECHSC connection for reading over the RTP connection. The
+ framework assumes the server can accept text either "by value"
+ (embedded in the protocol) or "by reference" (e.g., by de-referencing
+ a Uniform Resource Identifier (URI) embedded in the protocol).
+
+4.2.4. Document Type Indication
+
+ A document type specifies the syntax in which the text to be read is
+ encoded. The SPEECHSC framework MUST be capable of explicitly
+ indicating the document type of the text to be processed, as opposed
+ to forcing the server to infer the content by other means.
+
+4.3. Control Channel
+
+ The SPEECHSC framework MUST be capable of establishing the control
+ channel between the client and server on a per-session basis, where a
+ session is loosely defined to be associated with a single "call" or
+ "dialog". The protocol SHOULD be capable of maintaining a long-lived
+ control channel for multiple sessions serially, and MAY be capable of
+ shorter time horizons as well, including as short as for the
+ processing of a single utterance.
+
+4.4. Media Origination/Termination by Control Elements
+
+ The SPEECHSC framework MUST NOT require the controlling element
+ (application server, media processing entity) to accept or originate
+ media streams. Media streams MAY source & sink from the controlled
+ element (ASR, TTS, etc.).
+
+4.5. Playback Controls
+
+ The SPEECHSC framework MUST support "VCR controls" for controlling
+ the playout of streaming media output from SPEECHSC processing, and
+ MUST allow for servers with varying capabilities to accommodate such
+ controls. The protocol SHOULD allow clients to state what controls
+ they wish to use, and for servers to report which ones they honor.
+ These capabilities include:
+
+
+
+
+
+
+
+Oran Informational [Page 10]
+
+RFC 4313 Speech Services Control Requirements December 2005
+
+
+ o The ability to jump in time to the location of a specific marker.
+ o The ability to jump in time, forwards or backwards, by a specified
+ amount of time. Valid time units MUST include seconds, words,
+ paragraphs, sentences, and markers.
+ o The ability to increase and decrease playout speed.
+ o The ability to fast-forward and fast-rewind the audio, where
+ snippets of audio are played as the server moves forwards or
+ backwards in time.
+ o The ability to pause and resume playout.
+ o The ability to increase and decrease playout volume.
+
+ These controls SHOULD be made easily available to users through the
+ client user interface and through per-user customization capabilities
+ of the client. This is particularly important for hearing-impaired
+ users, who will likely desire settings and control regimes different
+ from those that would be acceptable for non-impaired users.
+
+4.6. Session Parameters
+
+ The SPEECHSC framework MUST support the specification of session
+ parameters, such as language, prosody, and voicing.
+
+4.7. Speech Markers
+
+ The SPEECHSC framework MUST accommodate speech markers, with
+ capability at least as flexible as that provided in SSML [1]. The
+ framework MUST further provide an efficient mechanism for reporting
+ that a marker has been reached during playout.
+
+5. ASR Requirements
+
+5.1. Requesting Automatic Speech Recognition
+
+ The SPEECHSC framework MUST allow a Media Processing Entity or
+ Application Server to request the ASR Server to perform automatic
+ speech recognition on an RTP stream, returning the results over
+ SPEECHSC.
+
+5.2. XML
+
+ The SPEECHSC framework assumes that all ASR servers support the
+ VoiceXML speech recognition grammar specification (SRGS) for speech
+ recognition [2].
+
+
+
+
+
+
+
+
+Oran Informational [Page 11]
+
+RFC 4313 Speech Services Control Requirements December 2005
+
+
+5.3. Grammar Requirements
+
+5.3.1. Grammar Specification
+
+ The SPEECHSC framework assumes all ASR servers are capable of
+ accepting grammar specifications either "by value" (embedded in the
+ protocol) or "by reference" (e.g., by de-referencing a URI embedded
+ in the protocol). The latter MUST allow the indication of a grammar
+ already known to, or otherwise "built in" to, the server. The
+ framework and protocol further SHOULD exploit the ability to store
+ and later retrieve by reference large grammars that were originally
+ supplied by the client.
+
+5.3.2. Explicit Indication of Grammar Format
+
+ The SPEECHSC framework protocol MUST be able to explicitly convey the
+ grammar format in which the grammar is encoded and MUST be extensible
+ to allow for conveying new grammar formats as they are defined.
+
+5.3.3. Grammar Sharing
+
+ The SPEECHSC framework SHOULD exploit sharing grammars across
+ sessions for servers that are capable of doing so. This supports
+ applications with large grammars for which it is unrealistic to
+ dynamically load. An example is a city-country grammar for a weather
+ service.
+
+5.4. Session Parameters
+
+ The SPEECHSC framework MUST accommodate at a minimum all of the
+ protocol parameters currently defined in Media Resource Control
+ Protocol (MRCP) [10] In addition, there SHOULD be a capability to
+ reset parameters within a session.
+
+5.5. Input Capture
+
+ The SPEECHSC framework MUST support a method directing the ASR Server
+ to capture the input media stream for later analysis and tuning of
+ the ASR engine.
+
+
+
+
+
+
+
+
+
+
+
+
+Oran Informational [Page 12]
+
+RFC 4313 Speech Services Control Requirements December 2005
+
+
+6. Speaker Identification and Verification Requirements
+
+6.1. Requesting SI/SV
+
+ The SPEECHSC framework MUST allow a Media Processing Entity to
+ request the SI/SV Server to perform speaker identification or
+ verification on an RTP stream, returning the results over SPEECHSC.
+
+6.2. Identifiers for SI/SV
+
+ The SPEECHSC framework MUST accommodate an identifier for each
+ verification resource and permit control of that resource by ID,
+ because voiceprint format and contents are vendor specific.
+
+6.3. State for Multiple Utterances
+
+ The SPEECHSC framework MUST work with SI/SV servers that maintain
+ state to handle multi-utterance verification.
+
+6.4. Input Capture
+
+ The SPEECHSC framework MUST support a method for capturing the input
+ media stream for later analysis and tuning of the SI/SV engine. The
+ framework may assume all servers are capable of doing so. In
+ addition, the framework assumes that the captured stream contains
+ enough timestamp context (e.g., the NTP time range from the RTP
+ Control Protocol (RTCP) packets, which corresponds to the RTP
+ timestamps of the captured input) to ascertain after the fact exactly
+ when the verification was requested.
+
+6.5. SI/SV Functional Extensibility
+
+ The SPEECHSC framework SHOULD be extensible to additional functions
+ associated with SI/SV, such as prompting, utterance verification, and
+ retraining.
+
+7. Duplexing and Parallel Operation Requirements
+
+ One very important requirement for an interactive speech-driven
+ system is that user perception of the quality of the interaction
+ depends strongly on the ability of the user to interrupt a prompt or
+ rendered TTS with speech. Interrupting, or barging, the speech
+ output requires more than energy detection from the user's direction.
+ Many advanced systems halt the media towards the user by employing
+ the ASR engine to decide if an utterance is likely to be real speech,
+ as opposed to a cough, for example.
+
+
+
+
+
+Oran Informational [Page 13]
+
+RFC 4313 Speech Services Control Requirements December 2005
+
+
+7.1. Full Duplex Operation
+
+ To achieve low latency between utterance detection and halting of
+ playback, many implementations combine the speaking and ASR
+ functions. The SPEECHSC framework MUST support such full-duplex
+ implementations.
+
+7.2. Multiple Services in Parallel
+
+ Good spoken user interfaces typically depend upon the ease with which
+ the user can accomplish his or her task. When making use of speaker
+ identification or verification technologies, user interface
+ improvements often come from the combination of the different
+ technologies: simultaneous identity claim and verification (on the
+ same utterance), simultaneous knowledge and voice verification (using
+ ASR and verification simultaneously). Using ASR and verification on
+ the same utterance is in fact the only way to support rolling or
+ dynamically-generated challenge phrases (e.g., "say 51723"). The
+ SPEECHSC framework MUST support such parallel service
+ implementations.
+
+7.3. Combination of Services
+
+ It is optionally of interest that the SPEECHSC framework support more
+ complex remote combination and controls of speech engines:
+
+ o Combination in series of engines that may then act on the input or
+ output of ASR, TTS, or Speaker recognition engines. The control
+ MAY then extend beyond such engines to include other audio input
+ and output processing and natural language processing.
+ o Intermediate exchanges and coordination between engines.
+ o Remote specification of flows between engines.
+
+ These capabilities MAY benefit from service discovery mechanisms
+ (e.g., engines, properties, and states discovery).
+
+8. Additional Considerations (Non-Normative)
+
+ The framework assumes that Session Description Protocol (SDP) will be
+ used to describe media sessions and streams. The framework further
+ assumes RTP carriage of media. However, since SDP can be used to
+ describe other media transport schemes (e.g., ATM) these could be
+ used if they provide the necessary elements (e.g., explicit
+ timestamps).
+
+
+
+
+
+
+
+Oran Informational [Page 14]
+
+RFC 4313 Speech Services Control Requirements December 2005
+
+
+ The working group will not be defining distributed speech recognition
+ (DSR) methods, as exemplified by the European Telecommunications
+ Standards Institute (ETSI) Aurora project. The working group will
+ not be recreating functionality available in other protocols, such as
+ SIP or SDP.
+
+ TTS looks very much like playing back a file. Extending RTSP looks
+ promising for when one requires VCR controls or markers in the text
+ to be spoken. When one does not require VCR controls, SIP in a
+ framework such as Network Announcements [12] works directly without
+ modification.
+
+ ASR has an entirely different set of characteristics. For barge-in
+ support, ASR requires real-time return of intermediate results.
+ Barring the discovery of a good reuse model for an existing protocol,
+ this will most likely become the focus of SPEECHSC.
+
+9. Security Considerations
+
+ Protocols relating to speech processing must take security and
+ privacy into account. Many applications of speech technology deal
+ with sensitive information, such as the use of Text-to-Speech to read
+ financial information. Likewise, popular uses for automatic speech
+ recognition include executing financial transactions and shopping.
+
+ There are at least three aspects of speech processing security that
+ intersect with the SPEECHSC requirements -- securing the SPEECHSC
+ protocol itself, implementing and deploying the servers that run the
+ protocol, and ensuring that utilization of the technology for
+ providing security functions is appropriate. Each of these aspects
+ in discussed in the following subsections. While some of these
+ considerations are, strictly speaking, out of scope of the protocol
+ itself, they will be carefully considered and accommodated during
+ protocol design, and will be called out as part of the applicability
+ statement accompanying the protocol specification(s). Privacy
+ considerations are discussed as well.
+
+9.1. SPEECHSC Protocol Security
+
+ The SPEECHSC protocol MUST in all cases support authentication,
+ authorization, and integrity, and SHOULD support confidentiality.
+ For privacy-sensitive applications, the protocol MUST support
+ confidentiality. We envision that rather than providing
+ protocol-specific security mechanisms in SPEECHSC itself, the
+ resulting protocol will employ security machinery of either a
+ containing protocol or the transport on which it runs. For example,
+ we will consider solutions such as using Transport Layer Security
+ (TLS) for securing the control channel, and Secure Realtime Transport
+
+
+
+Oran Informational [Page 15]
+
+RFC 4313 Speech Services Control Requirements December 2005
+
+
+ Protocol (SRTP) for securing the media channel. Third-party
+ dependencies necessitating transitive trust will be minimized or
+ explicitly dealt with through the authentication and authorization
+ aspects of the protocol design.
+
+9.2. Client and Server Implementation and Deployment
+
+ Given the possibly sensitive nature of the information carried,
+ SPEECHSC clients and servers need to take steps to ensure
+ confidentiality and integrity of the data and its transformations to
+ and from spoken form. In addition to these general considerations,
+ certain SPEECHSC functions, such as speaker verification and
+ identification, employ voiceprints whose privacy, confidentiality,
+ and integrity must be maintained. Similarly, the requirement to
+ support input capture for analysis and tuning can represent a privacy
+ vulnerability because user utterances are recorded and could be
+ either revealed or replayed inappropriately. Implementers must take
+ care to prevent the exploitation of any centralized voiceprint
+ database and the recorded material from which such voiceprints may be
+ derived. Specific actions that are recommended to minimize these
+ threats include:
+
+ o End-to-end authentication, confidentiality, and integrity
+ protection (like TLS) of access to the database to minimize the
+ exposure to external attack.
+ o Database protection measures such as read/write access control and
+ local login authentication to minimize the exposure to insider
+ threats.
+ o Copies of the database, especially ones that are maintained at
+ off-site locations, need the same protection as the operational
+ database.
+
+ Inappropriate disclosure of this data does not as of the date of this
+ document represent an exploitable threat, but quite possibly might in
+ the future. Specific vulnerabilities that might become feasible are
+ discussed in the next subsection. It is prudent to take measures
+ such as encrypting the voiceprint database and permitting access only
+ through programming interfaces enforcing adequate authorization
+ machinery.
+
+9.3. Use of SPEECHSC for Security Functions
+
+ Either speaker identification or verification can be used directly as
+ an authentication technology. Authorization decisions can be coupled
+ with speaker verification in a direct fashion through
+ challenge-response protocols, or indirectly with speaker
+ identification through the use of access control lists or other
+ identity-based authorization mechanisms. When so employed, there are
+
+
+
+Oran Informational [Page 16]
+
+RFC 4313 Speech Services Control Requirements December 2005
+
+
+ additional security concerns that need to be addressed through the
+ use of protocol security mechanisms for clients and servers. For
+ example, the ability to manipulate the media stream of a speaker
+ verification request could inappropriately permit or deny access
+ based on impersonation, or simple garbling via noise injection,
+ making it critical to properly secure both the control and data
+ channels, as recommended above. The following issues specific to the
+ use of SI/SV for authentication should be carefully considered:
+
+ 1. Theft of voiceprints or the recorded samples used to construct
+ them represents a future threat against the use of speaker
+ identification/verification as a biometric authentication
+ technology. A plausible attack vector (not feasible today) is to
+ use the voiceprint information as parametric input to a
+ text-to-speech synthesis system that could mimic the user's voice
+ accurately enough to match the voiceprint. Since it is not very
+ difficult to surreptitiously record reasonably large corpuses of
+ voice samples, the ability to construct voiceprints for input to
+ this attack would render the security of voice-based biometric
+ authentication, even using advanced challenge-response
+ techniques, highly vulnerable. Users of speaker verification for
+ authentication should monitor technological developments in this
+ area closely for such future vulnerabilities (much as users of
+ other authentication technologies should monitor advances in
+ factoring as a way to break asymmetric keying systems).
+ 2. As with other biometric authentication technologies, a downside
+ to the use of speech identification is that revocation is not
+ possible. Once compromised, the biometric information can be
+ used in identification and authentication to other independent
+ systems.
+ 3. Enrollment procedures can be vulnerable to impersonation if not
+ protected both by protocol security mechanisms and some
+ independent proof of identity. (Proof of identity may not be
+ needed in systems that only need to verify continuity of identity
+ since enrollment, as opposed to association with a particular
+ individual.
+
+ Further discussion of the use of SI/SV as an authentication
+ technology, and some recommendations concerning advantages and
+ vulnerabilities, can be found in Chapter 5 of [15].
+
+10. Acknowledgements
+
+ Eric Burger wrote the original version of these requirements and has
+ continued to contribute actively throughout their development. He is
+ a co-author in all but formal authorship, and is instead acknowledged
+ here as it is preferable that working group co-chairs have
+ non-conflicting roles with respect to the progression of documents.
+
+
+
+Oran Informational [Page 17]
+
+RFC 4313 Speech Services Control Requirements December 2005
+
+
+11. References
+
+11.1. Normative References
+
+ [1] Walker, M., Burnett, D., and A. Hunt, "Speech Synthesis Markup
+ Language (SSML) Version 1.0", W3C
+ REC REC-speech-synthesis-20040907, September 2004.
+
+ [2] McGlashan, S. and A. Hunt, "Speech Recognition Grammar
+ Specification Version 1.0", W3C REC REC-speech-grammar-20040316,
+ March 2004.
+
+ [3] Bradner, S., "Key words for use in RFCs to Indicate Requirement
+ Levels", BCP 14, RFC 2119, March 1997.
+
+ [4] Floyd, S. and L. Daigle, "IAB Architectural and Policy
+ Considerations for Open Pluggable Edge Services", RFC 3238,
+ January 2002.
+
+ [5] Charlton, N., Gasson, M., Gybels, G., Spanner, M., and A. van
+ Wijk, "User Requirements for the Session Initiation Protocol
+ (SIP) in Support of Deaf, Hard of Hearing and Speech-impaired
+ Individuals", RFC 3351, August 2002.
+
+11.2. Informative References
+
+ [6] Rosenberg, J., Schulzrinne, H., Camarillo, G., Johnston, A.,
+ Peterson, J., Sparks, R., Handley, M., and E. Schooler, "SIP:
+ Session Initiation Protocol", RFC 3261, June 2002.
+
+ [7] Andreasen, F. and B. Foster, "Media Gateway Control Protocol
+ (MGCP) Version 1.0", RFC 3435, January 2003.
+
+ [8] Groves, C., Pantaleo, M., Ericsson, LM., Anderson, T., and T.
+ Taylor, "Gateway Control Protocol Version 1", RFC 3525,
+ June 2003.
+
+ [9] Schulzrinne, H., Rao, A., and R. Lanphier, "Real Time Streaming
+ Protocol (RTSP)", RFC 2326, April 1998.
+
+ [10] Shanmugham, S., Monaco, P., and B. Eberman, "MRCP: Media
+ Resource Control Protocol", Work in Progress.
+
+
+
+
+
+
+
+
+
+Oran Informational [Page 18]
+
+RFC 4313 Speech Services Control Requirements December 2005
+
+
+ [11] World Wide Web Consortium, "Voice Extensible Markup Language
+ (VoiceXML) Version 2.0", W3C Working Draft , April 2002,
+ <http://www.w3.org/TR/2002/WD-voicexml20-20020424/>.
+
+ [12] Burger, E., Ed., Van Dyke, J., and A. Spitzer, "Basic Network
+ Media Services with SIP", RFC 4240, December 2005.
+
+ [13] Guttman, E., Perkins, C., Veizades, J., and M. Day, "Service
+ Location Protocol, Version 2", RFC 2608, June 1999.
+
+ [14] Gulbrandsen, A., Vixie, P., and L. Esibov, "A DNS RR for
+ specifying the location of services (DNS SRV)", RFC 2782,
+ February 2000.
+
+ [15] Committee on Authentication Technologies and Their Privacy
+ Implications, National Research Council, "Who Goes There?:
+ Authentication Through the Lens of Privacy", Computer Science
+ and Telecommunications Board (CSTB) , 2003,
+ <http://www.nap.edu/catalog/10656.html/ >.
+
+Author's Address
+
+ David R. Oran
+ Cisco Systems, Inc.
+ 7 Ladyslipper Lane
+ Acton, MA
+ USA
+
+ EMail: oran@cisco.com
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Oran Informational [Page 19]
+
+RFC 4313 Speech Services Control Requirements December 2005
+
+
+Full Copyright Statement
+
+ Copyright (C) The Internet Society (2005).
+
+ This document is subject to the rights, licenses and restrictions
+ contained in BCP 78, and except as set forth therein, the authors
+ retain all their rights.
+
+ This document and the information contained herein are provided on an
+ "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
+ OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET
+ ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED,
+ INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE
+ INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
+ WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
+
+Intellectual Property
+
+ The IETF takes no position regarding the validity or scope of any
+ Intellectual Property Rights or other rights that might be claimed to
+ pertain to the implementation or use of the technology described in
+ this document or the extent to which any license under such rights
+ might or might not be available; nor does it represent that it has
+ made any independent effort to identify any such rights. Information
+ on the procedures with respect to rights in RFC documents can be
+ found in BCP 78 and BCP 79.
+
+ Copies of IPR disclosures made to the IETF Secretariat and any
+ assurances of licenses to be made available, or the result of an
+ attempt made to obtain a general license or permission for the use of
+ such proprietary rights by implementers or users of this
+ specification can be obtained from the IETF on-line IPR repository at
+ http://www.ietf.org/ipr.
+
+ The IETF invites any interested party to bring to its attention any
+ copyrights, patents or patent applications, or other proprietary
+ rights that may cover technology that may be required to implement
+ this standard. Please address the information to the IETF at ietf-
+ ipr@ietf.org.
+
+Acknowledgement
+
+ Funding for the RFC Editor function is currently provided by the
+ Internet Society.
+
+
+
+
+
+
+
+Oran Informational [Page 20]
+