diff options
author | Thomas Voss <mail@thomasvoss.com> | 2024-11-27 20:54:24 +0100 |
---|---|---|
committer | Thomas Voss <mail@thomasvoss.com> | 2024-11-27 20:54:24 +0100 |
commit | 4bfd864f10b68b71482b35c818559068ef8d5797 (patch) | |
tree | e3989f47a7994642eb325063d46e8f08ffa681dc /doc/rfc/rfc4463.txt | |
parent | ea76e11061bda059ae9f9ad130a9895cc85607db (diff) |
doc: Add RFC documents
Diffstat (limited to 'doc/rfc/rfc4463.txt')
-rw-r--r-- | doc/rfc/rfc4463.txt | 4819 |
1 files changed, 4819 insertions, 0 deletions
diff --git a/doc/rfc/rfc4463.txt b/doc/rfc/rfc4463.txt new file mode 100644 index 0000000..37bb4eb --- /dev/null +++ b/doc/rfc/rfc4463.txt @@ -0,0 +1,4819 @@ + + + + + + +Network Working Group S. Shanmugham +Request for Comments: 4463 Cisco Systems, Inc. +Category: Informational P. Monaco + Nuance Communications + B. Eberman + Speechworks Inc. + April 2006 + + + A Media Resource Control Protocol (MRCP) + Developed by Cisco, Nuance, and Speechworks + +Status of This Memo + + This memo provides information for the Internet community. It does + not specify an Internet standard of any kind. Distribution of this + memo is unlimited. + +Copyright Notice + + Copyright (C) The Internet Society (2006). + +IESG Note + + This RFC is not a candidate for any level of Internet Standard. The + IETF disclaims any knowledge of the fitness of this RFC for any + purpose and in particular notes that the decision to publish is not + based on IETF review for such things as security, congestion control, + or inappropriate interaction with deployed protocols. The RFC Editor + has chosen to publish this document at its discretion. Readers of + this document should exercise caution in evaluating its value for + implementation and deployment. See RFC 3932 for more information. + + Note that this document uses a MIME type 'application/mrcp' which has + not been registered with the IANA, and is therefore not recognized as + a standard IETF MIME type. The historical value of this document as + an ancestor to ongoing standardization in this space, however, makes + the publication of this document meaningful. + + + + + + + + + + + + + +Shanmugham, et al. Informational [Page 1] + +RFC 4463 MRCP by Cisco, Nuance, and Speechworks April 2006 + + +Abstract + + This document describes a Media Resource Control Protocol (MRCP) that + was developed jointly by Cisco Systems, Inc., Nuance Communications, + and Speechworks, Inc. It is published as an RFC as input for further + IETF development in this area. + + MRCP controls media service resources like speech synthesizers, + recognizers, signal generators, signal detectors, fax servers, etc., + over a network. This protocol is designed to work with streaming + protocols like RTSP (Real Time Streaming Protocol) or SIP (Session + Initiation Protocol), which help establish control connections to + external media streaming devices, and media delivery mechanisms like + RTP (Real Time Protocol). + +Table of Contents + + 1. Introduction ....................................................3 + 2. Architecture ....................................................4 + 2.1. Resources and Services .....................................4 + 2.2. Server and Resource Addressing .............................5 + 3. MRCP Protocol Basics ............................................5 + 3.1. Establishing Control Session and Media Streams .............5 + 3.2. MRCP over RTSP .............................................6 + 3.3. Media Streams and RTP Ports ................................8 + 4. Notational Conventions ..........................................8 + 5. MRCP Specification ..............................................9 + 5.1. Request ...................................................10 + 5.2. Response ..................................................10 + 5.3. Event .....................................................12 + 5.4. Message Headers ...........................................12 + 6. Media Server ...................................................19 + 6.1. Media Server Session ......................................19 + 7. Speech Synthesizer Resource ....................................21 + 7.1. Synthesizer State Machine .................................22 + 7.2. Synthesizer Methods .......................................22 + 7.3. Synthesizer Events ........................................23 + 7.4. Synthesizer Header Fields .................................23 + 7.5. Synthesizer Message Body ..................................29 + 7.6. SET-PARAMS ................................................32 + 7.7. GET-PARAMS ................................................32 + 7.8. SPEAK .....................................................33 + 7.9. STOP ......................................................34 + 7.10. BARGE-IN-OCCURRED ........................................35 + 7.11. PAUSE ....................................................37 + 7.12. RESUME ...................................................37 + 7.13. CONTROL ..................................................38 + 7.14. SPEAK-COMPLETE ...........................................40 + + + +Shanmugham, et al. Informational [Page 2] + +RFC 4463 MRCP by Cisco, Nuance, and Speechworks April 2006 + + + 7.15. SPEECH-MARKER ............................................41 + 8. Speech Recognizer Resource .....................................42 + 8.1. Recognizer State Machine ..................................42 + 8.2. Recognizer Methods ........................................42 + 8.3. Recognizer Events .........................................43 + 8.4. Recognizer Header Fields ..................................43 + 8.5. Recognizer Message Body ...................................51 + 8.6. SET-PARAMS ................................................56 + 8.7. GET-PARAMS ................................................56 + 8.8. DEFINE-GRAMMAR ............................................57 + 8.9. RECOGNIZE .................................................60 + 8.10. STOP .....................................................63 + 8.11. GET-RESULT ...............................................64 + 8.12. START-OF-SPEECH ..........................................64 + 8.13. RECOGNITION-START-TIMERS .................................65 + 8.14. RECOGNITON-COMPLETE ......................................65 + 8.15. DTMF Detection ...........................................67 + 9. Future Study ...................................................67 + 10. Security Considerations .......................................67 + 11. RTSP-Based Examples ...........................................67 + 12. Informative References ........................................74 + Appendix A. ABNF Message Definitions ..............................76 + Appendix B. Acknowledgements ......................................84 + +1. Introduction + + The Media Resource Control Protocol (MRCP) is designed to provide a + mechanism for a client device requiring audio/video stream processing + to control processing resources on the network. These media + processing resources may be speech recognizers (a.k.a. Automatic- + Speech-Recognition (ASR) engines), speech synthesizers (a.k.a. Text- + To-Speech (TTS) engines), fax, signal detectors, etc. MRCP allows + implementation of distributed Interactive Voice Response platforms, + for example VoiceXML [6] interpreters. The MRCP protocol defines the + requests, responses, and events needed to control the media + processing resources. The MRCP protocol defines the state machine + for each resource and the required state transitions for each request + and server-generated event. + + The MRCP protocol does not address how the control session is + established with the server and relies on the Real Time Streaming + Protocol (RTSP) [2] to establish and maintain the session. The + session control protocol is also responsible for establishing the + media connection from the client to the network server. The MRCP + protocol and its messaging is designed to be carried over RTSP or + another protocol as a MIME-type similar to the Session Description + Protocol (SDP) [5]. + + + + +Shanmugham, et al. Informational [Page 3] + +RFC 4463 MRCP by Cisco, Nuance, and Speechworks April 2006 + + + The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", + "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this + document are to be interpreted as described in RFC 2119 [8]. + +2. Architecture + + The system consists of a client that requires media streams generated + or needs media streams processed and a server that has the resources + or devices to process or generate the streams. The client + establishes a control session with the server for media processing + using a protocol such as RTSP. This will also set up and establish + the RTP stream between the client and the server or another RTP + endpoint. Each resource needed in processing or generating the + stream is addressed or referred to by a URL. The client can now use + MRCP messages to control the media resources and affect how they + process or generate the media stream. + + |--------------------| + ||------------------|| |----------------------| + || Application Layer|| ||--------------------|| + ||------------------|| || TTS | ASR | Fax || + || ASR/TTS API || ||Plugin|Plugin|Plugin|| + ||------------------|| || on | on | on || + || MRCP Core || || MRCP | MRCP | MRCP || + || Protocol Stack || ||--------------------|| + ||------------------|| || RTSP Stack || + || RTSP Stack || || || + ||------------------|| ||--------------------|| + || TCP/IP Stack ||========IP=========|| TCP/IP Stack || + ||------------------|| ||--------------------|| + |--------------------| |----------------------| + + MRCP client Real-time Streaming MRCP + media server + +2.1. Resources and Services + + The server is set up to offer a certain set of resources and services + to the client. These resources are of 3 types. + + Transmission Resources + + These are resources that are capable of generating real-time streams, + like signal generators that generate tones and sounds of certain + frequencies and patterns, and speech synthesizers that generate + spoken audio streams, etc. + + + + + +Shanmugham, et al. Informational [Page 4] + +RFC 4463 MRCP by Cisco, Nuance, and Speechworks April 2006 + + + Reception Resources + + These are resources that receive and process streaming data like + signal detectors and speech recognizers. + + Dual Mode Resources + + These are resources that both send and receive data like a fax + resource, capable of sending or receiving fax through a two-way RTP + stream. + +2.2. Server and Resource Addressing + + The server as a whole is addressed using a container URL, and the + individual resources the server has to offer are reached by + individual resource URLs within the container URL. + + RTSP Example: + + A media server or container URL like, + + rtsp://mediaserver.com/media/ + + may contain one or more resource URLs of the form, + + rtsp://mediaserver.com/media/speechrecognizer/ + rtsp://mediaserver.com/media/speechsynthesizer/ + rtsp://mediaserver.com/media/fax/ + +3. MRCP Protocol Basics + + The message format for MRCP is text based, with mechanisms to carry + embedded binary data. This allows data like recognition grammars, + recognition results, synthesizer speech markup, etc., to be carried + in the MRCP message between the client and the server resource. The + protocol does not address session control management, media + management, reliable sequencing, and delivery or server or resource + addressing. These are left to a protocol like SIP or RTSP. MRCP + addresses the issue of controlling and communicating with the + resource processing the stream, and defines the requests, responses, + and events needed to do that. + +3.1. Establishing Control Session and Media Streams + + The control session between the client and the server is established + using a protocol like RTSP. This protocol will also set up the + appropriate RTP streams between the server and the client, allocating + ports and setting up transport parameters as needed. Each control + + + +Shanmugham, et al. Informational [Page 5] + +RFC 4463 MRCP by Cisco, Nuance, and Speechworks April 2006 + + + session is identified by a unique session-id. The format, usage, and + life cycle of the session-id is in accordance with the RTSP protocol. + The resources within the session are addressed by the individual + resource URLs. + + The MRCP protocol is designed to work with and tunnel through another + protocol like RTSP, and augment its capabilities. MRCP relies on + RTSP headers for sequencing, reliability, and addressing to make sure + that messages get delivered reliably and in the correct order and to + the right resource. The MRCP messages are carried in the RTSP + message body. The media server delivers the MRCP message to the + appropriate resource or device by looking at the session-level + message headers and URL information. Another protocol, such as SIP + [4], could be used for tunneling MRCP messages. + +3.2. MRCP over RTSP + + RTSP supports both TCP and UDP mechanisms for the client to talk to + the server and is differentiated by the RTSP URL. All MRCP based + media servers MUST support TCP for transport and MAY support UDP. + + In RTSP, the ANNOUNCE method/response MUST be used to carry MRCP + request/responses between the client and the server. MRCP messages + MUST NOT be communicated in the RTSP SETUP or TEARDOWN messages. + + Currently all RTSP messages are request/responses and there is no + support for asynchronous events in RTSP. This is because RTSP was + designed to work over TCP or UDP and, hence, could not assume + reliability in the underlying protocol. Hence, when using MRCP over + RTSP, an asynchronous event from the MRCP server is packaged in a + server-initiated ANNOUNCE method/response communication. A future + RTSP extension to send asynchronous events from the server to the + client would provide an alternate vehicle to carry such asynchronous + MRCP events from the server. + + An RTSP session is created when an RTSP SETUP message is sent from + the client to a server and is addressed to a server URL or any one of + its resource URLs without specifying a session-id. The server will + establish a session context and will respond with a session-id to the + client. This sequence will also set up the RTP transport parameters + between the client and the server, and then the server will be ready + to receive or send media streams. If the client wants to attach an + additional resource to an existing session, the client should send + that session's ID in the subsequent SETUP message. + + + + + + + +Shanmugham, et al. Informational [Page 6] + +RFC 4463 MRCP by Cisco, Nuance, and Speechworks April 2006 + + + When a media server implementing MRCP over RTSP receives a PLAY, + RECORD, or PAUSE RTSP method from an MRCP resource URL, it should + respond with an RTSP 405 "Method not Allowed" response. For these + resources, the only allowed RTSP methods are SETUP, TEARDOWN, + DESCRIBE, and ANNOUNCE. + + Example 1: + + C->S: ANNOUNCE rtsp://media.server.com/media/synthesizer RTSP/1.0 + CSeq:4 + Session:12345678 + Content-Type:application/mrcp + Content-Length:223 + + SPEAK 543257 MRCP/1.0 + Voice-gender:neutral + Voice-category:teenager + Prosody-volume:medium + Content-Type:application/synthesis+ssml + Content-Length:104 + + <?xml version="1.0"?> + <speak> + <paragraph> + <sentence>You have 4 new messages.</sentence> + <sentence>The first is from <say-as + type="name">Stephanie Williams</say-as> + and arrived at <break/> + <say-as type="time">3:45pm</say-as>.</sentence> + + <sentence>The subject is <prosody + rate="-20%">ski trip</prosody></sentence> + </paragraph> + </speak> + + S->C: RTSP/1.0 200 OK + CSeq: 4 + Session:12345678 + RTP-Info:url=rtsp://media.server.com/media/synthesizer; + seq=9810092;rtptime=3450012 + Content-Type:application/mrcp + Content-Length:52 + + MRCP/1.0 543257 200 IN-PROGRESS + + S->C: ANNOUNCE rtsp://media.server.com/media/synthesizer RTSP/1.0 + CSeq:6 + Session:12345678 + + + +Shanmugham, et al. Informational [Page 7] + +RFC 4463 MRCP by Cisco, Nuance, and Speechworks April 2006 + + + Content-Type:application/mrcp + Content-Length:123 + + SPEAK-COMPLETE 543257 COMPLETE MRCP/1.0 + + C->S: RTSP/1.0 200 OK + CSeq:6 + + For the sake of brevity, most examples from here on show only the + MRCP messages and do not show the RTSP message and headers in which + they are tunneled. Also, RTSP messages such as response that are not + carrying an MRCP message are also left out. + +3.3. Media Streams and RTP Ports + + A single set of RTP/RTCP ports is negotiated and shared between the + MRCP client and server when multiple media processing resources, such + as automatic speech recognition (ASR) engines and text to speech + (TTS) engines, are used for a single session. The individual + resource instances allocated on the server under a common session + identifier will feed from/to that single RTP stream. + + The client can send multiple media streams towards the server, + differentiated by using different synchronized source (SSRC) + identifier values. Similarly the server can use multiple + Synchronized Source (SSRC) identifier values to differentiate media + streams originating from the individual transmission resource URLs if + more than one exists. The individual resources may, on the other + hand, work together to send just one stream to the client. This is + up to the implementation of the media server. + +4. Notational Conventions + + Since many of the definitions and syntax are identical to HTTP/1.1, + this specification only points to the section where they are defined + rather than copying it. For brevity, [HX.Y] refers to Section X.Y of + the current HTTP/1.1 specification (RFC 2616 [1]). + + All the mechanisms specified in this document are described in both + prose and an augmented Backus-Naur form (ABNF) similar to that used + in [H2.1]. It is described in detail in RFC 4234 [3]. + + The ABNF provided along with the descriptive text is informative in + nature and may not be complete. The complete message format in ABNF + form is provided in Appendix A and is the normative format + definition. + + + + + +Shanmugham, et al. Informational [Page 8] + +RFC 4463 MRCP by Cisco, Nuance, and Speechworks April 2006 + + +5. MRCP Specification + + The MRCP PDU is textual using an ISO 10646 character set in the UTF-8 + encoding (RFC 3629 [12]) to allow many different languages to be + represented. However, to assist in compact representations, MRCP + also allows other character sets such as ISO 8859-1 to be used when + desired. The MRCP protocol headers and field names use only the + US-ASCII subset of UTF-8. Internationalization only applies to + certain fields like grammar, results, speech markup, etc., and not to + MRCP as a whole. + + Lines are terminated by CRLF, but receivers SHOULD be prepared to + also interpret CR and LF by themselves as line terminators. Also, + some parameters in the PDU may contain binary data or a record + spanning multiple lines. Such fields have a length value associated + with the parameter, which indicates the number of octets immediately + following the parameter. + + The whole MRCP PDU is encoded in the body of the session level + message as a MIME entity of type application/mrcp. The individual + MRCP messages do not have addressing information regarding which + resource the request/response are to/from. Instead, the MRCP message + relies on the header of the session level message carrying it to + deliver the request to the appropriate resource, or to figure out who + the response or event is from. + + The MRCP message set consists of requests from the client to the + server, responses from the server to the client and asynchronous + events from the server to the client. All these messages consist of + a start-line, one or more header fields (also known as "headers"), an + empty line (i.e., a line with nothing preceding the CRLF) indicating + the end of the header fields, and an optional message body. + + generic-message = start-line + message-header + CRLF + [ message-body ] + + message-body = *OCTET + + start-line = request-line / status-line / event-line + + The message-body contains resource-specific and message-specific data + that needs to be carried between the client and server as a MIME + entity. The information contained here and the actual MIME-types + used to carry the data are specified later when addressing the + specific messages. + + + + +Shanmugham, et al. Informational [Page 9] + +RFC 4463 MRCP by Cisco, Nuance, and Speechworks April 2006 + + + If a message contains data in the message body, the header fields + will contain content-headers indicating the MIME-type and encoding of + the data in the message body. + +5.1. Request + + An MRCP request consists of a Request line followed by zero or more + parameters as part of the message headers and an optional message + body containing data specific to the request message. + + The Request message from a client to the server includes, within the + first line, the method to be applied, a method tag for that request, + and the version of protocol in use. + + request-line = method-name SP request-id SP + mrcp-version CRLF + + The request-id field is a unique identifier created by the client and + sent to the server. The server resource should use this identifier + in its response to this request. If the request does not complete + with the response, future asynchronous events associated with this + request MUST carry the request-id. + + request-id = 1*DIGIT + + The method-name field identifies the specific request that the client + is making to the server. Each resource supports a certain list of + requests or methods that can be issued to it, and will be addressed + in later sections. + + method-name = synthesizer-method + / recognizer-method + + The mrcp-version field is the MRCP protocol version that is being + used by the client. + + mrcp-version = "MRCP" "/" 1*DIGIT "." 1*DIGIT + +5.2. Response + + After receiving and interpreting the request message, the server + resource responds with an MRCP response message. It consists of a + status line optionally followed by a message body. + + response-line = mrcp-version SP request-id SP status-code SP + request-state CRLF + + + + + +Shanmugham, et al. Informational [Page 10] + +RFC 4463 MRCP by Cisco, Nuance, and Speechworks April 2006 + + + The mrcp-version field used here is similar to the one used in the + Request Line and indicates the version of MRCP protocol running on + the server. + + The request-id used in the response MUST match the one sent in the + corresponding request message. + + The status-code field is a 3-digit code representing the success or + failure or other status of the request. + + The request-state field indicates if the job initiated by the Request + is PENDING, IN-PROGRESS, or COMPLETE. The COMPLETE status means that + the Request was processed to completion and that there will be no + more events from that resource to the client with that request-id. + The PENDING status means that the job has been placed on a queue and + will be processed in first-in-first-out order. The IN-PROGRESS + status means that the request is being processed and is not yet + complete. A PENDING or IN-PROGRESS status indicates that further + Event messages will be delivered with that request-id. + + request-state = "COMPLETE" + / "IN-PROGRESS" + / "PENDING" + +5.2.1. Status Codes + + The status codes are classified under the Success(2XX) codes and the + Failure(4XX) codes. + +5.2.1.1. Success 2xx + + 200 Success + 201 Success with some optional parameters ignored. + +5.2.1.2. Failure 4xx + + 401 Method not allowed + 402 Method not valid in this state + 403 Unsupported Parameter + 404 Illegal Value for Parameter + 405 Not found (e.g., Resource URI not initialized + or doesn't exist) + 406 Mandatory Parameter Missing + 407 Method or Operation Failed (e.g., Grammar compilation + failed in the recognizer. Detailed cause codes MAY BE + available through a resource specific header field.) + 408 Unrecognized or unsupported message entity + + + + +Shanmugham, et al. Informational [Page 11] + +RFC 4463 MRCP by Cisco, Nuance, and Speechworks April 2006 + + + 409 Unsupported Parameter Value + 421-499 Resource specific Failure codes + +5.3. Event + + The server resource may need to communicate a change in state or the + occurrence of a certain event to the client. These messages are used + when a request does not complete immediately and the response returns + a status of PENDING or IN-PROGRESS. The intermediate results and + events of the request are indicated to the client through the event + message from the server. Events have the request-id of the request + that is in progress and is generating these events and status value. + The status value is COMPLETE if the request is done and this was the + last event, else it is IN-PROGRESS. + + event-line = event-name SP request-id SP request-state SP + mrcp-version CRLF + + The mrcp-version used here is identical to the one used in the + Request/Response Line and indicates the version of MRCP protocol + running on the server. + + The request-id used in the event should match the one sent in the + request that caused this event. + + The request-state indicates if the Request/Command causing this event + is complete or still in progress, and is the same as the one + mentioned in Section 5.2. The final event will contain a COMPLETE + status indicating the completion of the request. + + The event-name identifies the nature of the event generated by the + media resource. The set of valid event names are dependent on the + resource generating it, and will be addressed in later sections. + + event-name = synthesizer-event + / recognizer-event + +5.4. Message Headers + + MRCP header fields, which include general-header (Section 5.4) and + resource-specific-header (Sections 7.4 and 8.4), follow the same + generic format as that given in Section 2.1 of RFC 2822 [7]. Each + header field consists of a name followed by a colon (":") and the + field value. Field names are case-insensitive. The field value MAY + be preceded by any amount of linear whitespace (LWS), though a single + SP is preferred. Header fields can be extended over multiple lines + by preceding each extra line with at least one SP or HT. + + + + +Shanmugham, et al. Informational [Page 12] + +RFC 4463 MRCP by Cisco, Nuance, and Speechworks April 2006 + + + message-header = 1*(generic-header / resource-header) + + The order in which header fields with differing field names are + received is not significant. However, it is "good practice" to send + general-header fields first, followed by request-header or response- + header fields, and ending with the entity-header fields. + + Multiple message-header fields with the same field-name MAY be + present in a message if and only if the entire field value for that + header field is defined as a comma-separated list (i.e., #(values)). + + It MUST be possible to combine the multiple header fields into one + "field-name:field-value" pair, without changing the semantics of the + message, by appending each subsequent field-value to the first, each + separated by a comma. Therefore, the order in which header fields + with the same field-name are received is significant to the + interpretation of the combined field value, and thus a proxy MUST NOT + change the order of these field values when a message is forwarded. + + Generic Headers + + generic-header = active-request-id-list + / proxy-sync-id + / content-id + / content-type + / content-length + / content-base + / content-location + / content-encoding + / cache-control + / logging-tag + + All headers in MRCP will be case insensitive, consistent with HTTP + and RTSP protocol header definitions. + +5.4.1. Active-Request-Id-List + + In a request, this field indicates the list of request-ids to which + it should apply. This is useful when there are multiple Requests + that are PENDING or IN-PROGRESS and you want this request to apply to + one or more of these specifically. + + In a response, this field returns the list of request-ids that the + operation modified or were in progress or just completed. There + could be one or more requests that returned a request-state of + PENDING or IN-PROGRESS. When a method affecting one or more PENDING + + + + + +Shanmugham, et al. Informational [Page 13] + +RFC 4463 MRCP by Cisco, Nuance, and Speechworks April 2006 + + + or IN-PROGRESS requests is sent from the client to the server, the + response MUST contain the list of request-ids that were affected in + this header field. + + The active-request-id-list is only used in requests and responses, + not in events. + + For example, if a STOP request with no active-request-id-list is sent + to a synthesizer resource (a wildcard STOP) that has one or more + SPEAK requests in the PENDING or IN-PROGRESS state, all SPEAK + requests MUST be cancelled, including the one IN-PROGRESS. In + addition, the response to the STOP request would contain the + request-id of all the SPEAK requests that were terminated in the + active-request-id-list. In this case, no SPEAK-COMPLETE or + RECOGNITION-COMPLETE events will be sent for these terminated + requests. + + active-request-id-list = "Active-Request-Id-List" ":" request-id + *("," request-id) CRLF + +5.4.2. Proxy-Sync-Id + + When any server resource generates a barge-in-able event, it will + generate a unique Tag and send it as a header field in an event to + the client. The client then acts as a proxy to the server resource + and sends a BARGE-IN-OCCURRED method (Section 7.10) to the + synthesizer server resource with the Proxy-Sync-Id it received from + the server resource. When the recognizer and synthesizer resources + are part of the same session, they may choose to work together to + achieve quicker interaction and response. Here, the proxy-sync-id + helps the resource receiving the event, proxied by the client, to + decide if this event has been processed through a direct interaction + of the resources. + + proxy-sync-id = "Proxy-Sync-Id" ":" 1*ALPHA CRLF + +5.4.3. Accept-Charset + + See [H14.2]. This specifies the acceptable character set for + entities returned in the response or events associated with this + request. This is useful in specifying the character set to use in + the Natural Language Semantics Markup Language (NLSML) results of a + RECOGNITON-COMPLETE event. + + + + + + + + +Shanmugham, et al. Informational [Page 14] + +RFC 4463 MRCP by Cisco, Nuance, and Speechworks April 2006 + + +5.4.4. Content-Type + + See [H14.17]. Note that the content types suitable for MRCP are + restricted to speech markup, grammar, recognition results, etc., and + are specified later in this document. The multi-part content type + "multi-part/mixed" is supported to communicate multiple of the above + mentioned contents, in which case the body parts cannot contain any + MRCP specific headers. + +5.4.5. Content-Id + + This field contains an ID or name for the content, by which it can be + referred to. The definition of this field conforms to RFC 2392 [14], + RFC 2822 [7], RFC 2046 [13] and is needed in multi-part messages. In + MRCP whenever the content needs to be stored, by either the client or + the server, it is stored associated with this ID. Such content can + be referenced during the session in URI form using the session:URI + scheme described in a later section. + +5.4.6. Content-Base + + The content-base entity-header field may be used to specify the base + URI for resolving relative URLs within the entity. + + content-base = "Content-Base" ":" absoluteURI CRLF + + Note, however, that the base URI of the contents within the entity- + body may be redefined within that entity-body. An example of this + would be a multi-part MIME entity, which in turn can have multiple + entities within it. + +5.4.7. Content-Encoding + + The content-encoding entity-header field is used as a modifier to the + media-type. When present, its value indicates what additional + content coding has been applied to the entity-body, and thus what + decoding mechanisms must be applied in order to obtain the media-type + referenced by the content-type header field. Content-encoding is + primarily used to allow a document to be compressed without losing + the identity of its underlying media type. + + content-encoding = "Content-Encoding" ":" + *WSP content-coding + *(*WSP "," *WSP content-coding *WSP ) + CRLF + + content-coding = token + + + + +Shanmugham, et al. Informational [Page 15] + +RFC 4463 MRCP by Cisco, Nuance, and Speechworks April 2006 + + + token = 1*(alphanum / "-" / "." / "!" / "%" / "*" + / "_" / "+" / "`" / "'" / "~" ) + + Content coding is defined in [H3.5]. An example of its use is + + Content-Encoding:gzip + + If multiple encodings have been applied to an entity, the content + codings MUST be listed in the order in which they were applied. + +5.4.8. Content-Location + + The content-location entity-header field MAY BE used to supply the + resource location for the entity enclosed in the message when that + entity is accessible from a location separate from the requested + resource's URI. + + content-location = "Content-Location" ":" ( absoluteURI / + relativeURI ) CRLF + + The content-location value is a statement of the location of the + resource corresponding to this particular entity at the time of the + request. The media server MAY use this header field to optimize + certain operations. When providing this header field, the entity + being sent should not have been modified from what was retrieved from + the content-location URI. + + For example, if the client provided a grammar markup inline, and it + had previously retrieved it from a certain URI, that URI can be + provided as part of the entity, using the content-location header + field. This allows a resource like the recognizer to look into its + cache to see if this grammar was previously retrieved, compiled, and + cached. In which case, it might optimize by using the previously + compiled grammar object. + + If the content-location is a relative URI, the relative URI is + interpreted relative to the content-base URI. + +5.4.9. Content-Length + + This field contains the length of the content of the message body + (i.e., after the double CRLF following the last header field). + Unlike HTTP, it MUST be included in all messages that carry content + beyond the header portion of the message. If it is missing, a + default value of zero is assumed. It is interpreted according to + [H14.13]. + + + + + +Shanmugham, et al. Informational [Page 16] + +RFC 4463 MRCP by Cisco, Nuance, and Speechworks April 2006 + + +5.4.10. Cache-Control + + If the media server plans on implementing caching, it MUST adhere to + the cache correctness rules of HTTP 1.1 (RFC2616), when accessing and + caching HTTP URI. In particular, the expires and cache-control + headers of the cached URI or document must be honored and will always + take precedence over the Cache-Control defaults set by this header + field. The cache-control directives are used to define the default + caching algorithms on the media server for the session or request. + The scope of the directive is based on the method it is sent on. If + the directives are sent on a SET-PARAMS method, it SHOULD apply for + all requests for documents the media server may make in that session. + If the directives are sent on any other messages, they MUST only + apply to document requests the media server needs to make for that + method. An empty cache-control header on the GET-PARAMS method is a + request for the media server to return the current cache-control + directives setting on the server. + + cache-control = "Cache-Control" ":" *WSP cache-directive + *( *WSP "," *WSP cache-directive *WSP ) + CRLF + + cache-directive = "max-age" "=" delta-seconds + / "max-stale" "=" delta-seconds + / "min-fresh" "=" delta-seconds + + delta-seconds = 1*DIGIT + + Here, delta-seconds is a time value to be specified as an integer + number of seconds, represented in decimal, after the time that the + message response or data was received by the media server. + + These directives allow the media server to override the basic + expiration mechanism. + + max-age + + Indicates that the client is OK with the media server using a + response whose age is no greater than the specified time in + seconds. Unless a max-stale directive is also included, the + client is not willing to accept the media server using a stale + response. + + min-fresh + + Indicates that the client is willing to accept the media server + using a response whose freshness lifetime is no less than its + current age plus the specified time in seconds. That is, the + + + +Shanmugham, et al. Informational [Page 17] + +RFC 4463 MRCP by Cisco, Nuance, and Speechworks April 2006 + + + client wants the media server to use a response that will still be + fresh for at least the specified number of seconds. + + max-stale + + Indicates that the client is willing to accept the media server + using a response that has exceeded its expiration time. If max- + stale is assigned a value, then the client is willing to accept + the media server using a response that has exceeded its expiration + time by no more than the specified number of seconds. If no value + is assigned to max-stale, then the client is willing to accept the + media server using a stale response of any age. + + The media server cache MAY BE requested to use stale response/data + without validation, but only if this does not conflict with any + "MUST"-level requirements concerning cache validation (e.g., a + "must-revalidate" cache-control directive) in the HTTP 1.1 + specification pertaining the URI. + + If both the MRCP cache-control directive and the cached entry on the + media server include "max-age" directives, then the lesser of the two + values is used for determining the freshness of the cached entry for + that request. + +5.4.11. Logging-Tag + + This header field MAY BE sent as part of a SET-PARAMS/GET-PARAMS + method to set the logging tag for logs generated by the media server. + Once set, the value persists until a new value is set or the session + is ended. The MRCP server should provide a mechanism to subset its + output logs so that system administrators can examine or extract only + the log file portion during which the logging tag was set to a + certain value. + + MRCP clients using this feature should take care to ensure that no + two clients specify the same logging tag. In the event that two + clients specify the same logging tag, the effect on the MRCP server's + output logs in undefined. + + logging-tag = "Logging-Tag" ":" 1*ALPHA CRLF + + + + + + + + + + + +Shanmugham, et al. Informational [Page 18] + +RFC 4463 MRCP by Cisco, Nuance, and Speechworks April 2006 + + +6. Media Server + + The capability of media server resources can be found using the RTSP + DESCRIBE mechanism. When a client issues an RTSP DESCRIBE method for + a media resource URI, the media server response MUST contain an SDP + description in its body describing the capabilities of the media + server resource. The SDP description MUST contain at a minimum the + media header (m-line) describing the codec and other media related + features it supports. It MAY contain another SDP header as well, but + support for it is optional. + + The usage of SDP messages in the RTSP message body and its + application follows the SIP RFC 2543 [4], but is limited to media- + related negotiation and description. + +6.1. Media Server Session + + As discussed in Section 3.2, a client/server should share one RTSP + session-id for the different resources it may use under the same + session. The client MUST allocate a set of client RTP/RTCP ports for + a new session and MUST NOT send a Session-ID in the SETUP message for + the first resource. The server then creates a Session-ID and + allocates a set of server RTP/RTCP ports and responds to the SETUP + message. + + If the client wants to open more resources with the same server under + the same session, it will send the session-id (that it got in the + earlier SETUP response) in the SETUP for the new resource. A SETUP + message with an existing session-id tells the server that this new + resource will feed from/into the same RTP/RTCP stream of that + existing session. + + If the client wants to open a resource from a media server that is + not where the first resource came from, it will send separate SETUP + requests with no session-id header field in them. Each server will + allocate its own session-id and return it in the response. Each of + them will also come back with their own set of RTP/RTCP ports. This + would be the case when the synthesizer engine and the recognition + engine are on different servers. + + The RTSP SETUP method SHOULD contain an SDP description of the media + stream being set up. The RTSP SETUP response MUST contain an SDP + description of the media stream that it expects to receive and send + on that session. + + The SDP description in the SETUP method from the client SHOULD + describe the required media parameters like codec, Named Signaling + Event (NSE) payload types, etc. This could have multiple media + + + +Shanmugham, et al. Informational [Page 19] + +RFC 4463 MRCP by Cisco, Nuance, and Speechworks April 2006 + + + headers (i.e., m-lines) to allow the client to provide the media + server with more than one option to choose from. + + The SDP description in the SETUP response should reflect the media + parameters that the media server will be using for the stream. It + should be within the choices that were specified in the SDP of the + SETUP method, if one was provided. + + Example: + + C->S: + + SETUP rtsp://media.server.com/recognizer/ RTSP/1.0 + CSeq:1 + Transport:RTP/AVP;unicast;client_port=46456-46457 + Content-Type:application/sdp + Content-Length:190 + + v=0 + o=- 123 456 IN IP4 10.0.0.1 + s=Media Server + p=+1-888-555-1212 + c=IN IP4 0.0.0.0 + t=0 0 + m=audio 46456 RTP/AVP 0 96 + a=rtpmap:0 pcmu/8000 + a=rtpmap:96 telephone-event/8000 + a=fmtp:96 0-15 + + S->C: + + RTSP/1.0 200 OK + CSeq:1 + Session:0a030258_00003815_3bc4873a_0001_0000 + Transport:RTP/AVP;unicast;client_port=46456-46457; + server_port=46460-46461 + Content-Length:190 + Content-Type:application/sdp + + v=0 + o=- 3211724219 3211724219 IN IP4 10.3.2.88 + s=Media Server + c=IN IP4 0.0.0.0 + t=0 0 + m=audio 46460 RTP/AVP 0 96 + a=rtpmap:0 pcmu/8000 + a=rtpmap:96 telephone-event/8000 + a=fmtp:96 0-15 + + + +Shanmugham, et al. Informational [Page 20] + +RFC 4463 MRCP by Cisco, Nuance, and Speechworks April 2006 + + + If an SDP description was not provided in the RTSP SETUP method, then + the media server may decide on parameters of the stream but MUST + specify what it chooses in the SETUP response. An SDP announcement + is only returned in a response to a SETUP message that does not + specify a Session. That is, the server will not return an SDP + announcement for the synthesizer SETUP of a session already + established with a recognizer. + + C->S: + + SETUP rtsp://media.server.com/recognizer/ RTSP/1.0 + CSeq:1 + Transport:RTP/AVP;unicast;client_port=46498 + + S->C: + + RTSP/1.0 200 OK + CSeq:1 + Session:0a030258_000039dc_3bc48a13_0001_0000 + Transport:RTP/AVP;unicast; client_port=46498; + server_port=46502-46503 + Content-Length:193 + Content-Type:application/sdp + + v=0 + o=- 3211724947 3211724947 IN IP4 10.3.2.88 + s=Media Server + c=IN IP4 0.0.0.0 + t=0 0 + m=audio 46502 RTP/AVP 0 101 + a=rtpmap:0 pcmu/8000 + a=rtpmap:101 telephone-event/8000 + a=fmtp:101 0-15 + +7. Speech Synthesizer Resource + + This resource is capable of converting text provided by the client + and generating a speech stream in real-time. Depending on the + implementation and capability of this resource, the client can + control parameters like voice characteristics, speaker speed, etc. + + The synthesizer resource is controlled by MRCP requests from the + client. Similarly, the resource can respond to these requests or + generate asynchronous events to the server to indicate certain + conditions during the processing of the stream. + + + + + + +Shanmugham, et al. Informational [Page 21] + +RFC 4463 MRCP by Cisco, Nuance, and Speechworks April 2006 + + +7.1. Synthesizer State Machine + + The synthesizer maintains states because it needs to correlate MRCP + requests from the client. The state transitions shown below describe + the states of the synthesizer and reflect the request at the head of + the queue. A SPEAK request in the PENDING state can be deleted or + stopped by a STOP request and does not affect the state of the + resource. + + Idle Speaking Paused + State State State + | | | + |----------SPEAK------->| |--------| + |<------STOP------------| CONTROL | + |<----SPEAK-COMPLETE----| |------->| + |<----BARGE-IN-OCCURRED-| | + | |--------| | + | CONTROL |-----------PAUSE--------->| + | |------->|<----------RESUME---------| + | | |----------| + | | PAUSE | + | | |--------->| + | |--------|----------| | + | BARGE-IN-OCCURRED | SPEECH-MARKER | + | |------->|<---------| | + |----------| | |------------| + | STOP | SPEAK | + | | | |----------->| + |<---------| | + |<-------------------STOP--------------------------| + +7.2. Synthesizer Methods + + The synthesizer supports the following methods. + + synthesizer-method = "SET-PARAMS" + / "GET-PARAMS" + / "SPEAK" + / "STOP" + / "PAUSE" + / "RESUME" + / "BARGE-IN-OCCURRED" + / "CONTROL" + + + + + + + + +Shanmugham, et al. Informational [Page 22] + +RFC 4463 MRCP by Cisco, Nuance, and Speechworks April 2006 + + +7.3. Synthesizer Events + + The synthesizer may generate the following events. + + synthesizer-event = "SPEECH-MARKER" + / "SPEAK-COMPLETE" + +7.4. Synthesizer Header Fields + + A synthesizer message may contain header fields containing request + options and information to augment the Request, Response, or Event of + the message with which it is associated. + + synthesizer-header = jump-target ; Section 7.4.1 + / kill-on-barge-in ; Section 7.4.2 + / speaker-profile ; Section 7.4.3 + / completion-cause ; Section 7.4.4 + / voice-parameter ; Section 7.4.5 + / prosody-parameter ; Section 7.4.6 + / vendor-specific ; Section 7.4.7 + / speech-marker ; Section 7.4.8 + / speech-language ; Section 7.4.9 + / fetch-hint ; Section 7.4.10 + / audio-fetch-hint ; Section 7.4.11 + / fetch-timeout ; Section 7.4.12 + / failed-uri ; Section 7.4.13 + / failed-uri-cause ; Section 7.4.14 + / speak-restart ; Section 7.4.15 + / speak-length ; Section 7.4.16 + + Parameter Support Methods/Events/Response + + jump-target MANDATORY SPEAK, CONTROL + logging-tag MANDATORY SET-PARAMS, GET-PARAMS + kill-on-barge-in MANDATORY SPEAK + speaker-profile OPTIONAL SET-PARAMS, GET-PARAMS, + SPEAK, CONTROL + completion-cause MANDATORY SPEAK-COMPLETE + voice-parameter MANDATORY SET-PARAMS, GET-PARAMS, + SPEAK, CONTROL + prosody-parameter MANDATORY SET-PARAMS, GET-PARAMS, + SPEAK, CONTROL + vendor-specific MANDATORY SET-PARAMS, GET-PARAMS + speech-marker MANDATORY SPEECH-MARKER + speech-language MANDATORY SET-PARAMS, GET-PARAMS, SPEAK + fetch-hint MANDATORY SET-PARAMS, GET-PARAMS, SPEAK + audio-fetch-hint MANDATORY SET-PARAMS, GET-PARAMS, SPEAK + fetch-timeout MANDATORY SET-PARAMS, GET-PARAMS, SPEAK + + + +Shanmugham, et al. Informational [Page 23] + +RFC 4463 MRCP by Cisco, Nuance, and Speechworks April 2006 + + + failed-uri MANDATORY Any + failed-uri-cause MANDATORY Any + speak-restart MANDATORY CONTROL + speak-length MANDATORY SPEAK, CONTROL + +7.4.1. Jump-Target + + This parameter MAY BE specified in a CONTROL method and controls the + jump size to move forward or rewind backward on an active SPEAK + request. A + or - indicates a relative value to what is being + currently played. This MAY BE specified in a SPEAK request to + indicate an offset into the speech markup that the SPEAK request + should start speaking from. The different speech length units + supported are dependent on the synthesizer implementation. If it + does not support a unit or the operation, the resource SHOULD respond + with a status code of 404 "Illegal or Unsupported value for + parameter". + + jump-target = "Jump-Size" ":" speech-length-value CRLF + speech-length-value = numeric-speech-length + / text-speech-length + text-speech-length = 1*ALPHA SP "Tag" + numeric-speech-length= ("+" / "-") 1*DIGIT SP + numeric-speech-unit + numeric-speech-unit = "Second" + / "Word" + / "Sentence" + / "Paragraph" + +7.4.2. Kill-On-Barge-In + + This parameter MAY BE sent as part of the SPEAK method to enable + kill-on-barge-in support. If enabled, the SPEAK method is + interrupted by DTMF input detected by a signal detector resource or + by the start of speech sensed or recognized by the speech recognizer + resource. + + kill-on-barge-in = "Kill-On-Barge-In" ":" boolean-value CRLF + boolean-value = "true" / "false" + + If the recognizer or signal detector resource is on, the same server + as the synthesizer, the server should be intelligent enough to + recognize their interactions by their common RTSP session-id and work + with each other to provide kill-on-barge-in support. The client + needs to send a BARGE-IN-OCCURRED method to the synthesizer resource + when it receives a barge-in-able event from the synthesizer resource + + + + + +Shanmugham, et al. Informational [Page 24] + +RFC 4463 MRCP by Cisco, Nuance, and Speechworks April 2006 + + + or signal detector resource. These resources MAY BE local or + distributed. If this field is not specified, the value defaults to + "true". + +7.4.3. Speaker Profile + + This parameter MAY BE part of the SET-PARAMS/GET-PARAMS or SPEAK + request from the client to the server and specifies the profile of + the speaker by a URI, which may be a set of voice parameters like + gender, accent, etc. + + speaker-profile = "Speaker-Profile" ":" uri CRLF + +7.4.4. Completion Cause + + This header field MUST be specified in a SPEAK-COMPLETE event coming + from the synthesizer resource to the client. This indicates the + reason behind the SPEAK request completion. + + completion-cause = "Completion-Cause" ":" 1*DIGIT SP 1*ALPHA + CRLF + + Cause-Code Cause-Name Description + 000 normal SPEAK completed normally. + 001 barge-in SPEAK request was terminated because + of barge-in. + 002 parse-failure SPEAK request terminated because of a + failure to parse the speech markup text. + 003 uri-failure SPEAK request terminated because, access + to one of the URIs failed. + 004 error SPEAK request terminated prematurely due + to synthesizer error. + 005 language-unsupported + Language not supported. + +7.4.5. Voice-Parameters + + This set of parameters defines the voice of the speaker. + + voice-parameter = "Voice-" voice-param-name ":" + voice-param-value CRLF + + voice-param-name is any one of the attribute names under the voice + element specified in W3C's Speech Synthesis Markup Language + Specification [9]. The voice-param-value is any one of the value + choices of the corresponding voice element attribute specified in the + above section. + + + + +Shanmugham, et al. Informational [Page 25] + +RFC 4463 MRCP by Cisco, Nuance, and Speechworks April 2006 + + + These header fields MAY BE sent in SET-PARAMS/GET-PARAMS request to + define/get default values for the entire session or MAY BE sent in + the SPEAK request to define default values for that speak request. + Furthermore, these attributes can be part of the speech text marked + up in Speech Synthesis Markup Language (SSML). + + These voice parameter header fields can also be sent in a CONTROL + method to affect a SPEAK request in progress and change its behavior + on the fly. If the synthesizer resource does not support this + operation, it should respond back to the client with a status of + unsupported. + +7.4.6. Prosody-Parameters + + This set of parameters defines the prosody of the speech. + + prosody-parameter = "Prosody-" prosody-param-name ":" + prosody-param-value CRLF + + prosody-param-name is any one of the attribute names under the + prosody element specified in W3C's Speech Synthesis Markup Language + Specification [9]. The prosody-param-value is any one of the value + choices of the corresponding prosody element attribute specified in + the above section. + + These header fields MAY BE sent in SET-PARAMS/GET-PARAMS request to + define/get default values for the entire session or MAY BE sent in + the SPEAK request to define default values for that speak request. + Furthermore, these attributes can be part of the speech text marked + up in SSML. + + The prosody parameter header fields in the SET-PARAMS or SPEAK + request only apply if the speech data is of type text/plain and does + not use a speech markup format. + + These prosody parameter header fields MAY also be sent in a CONTROL + method to affect a SPEAK request in progress and to change its + behavior on the fly. If the synthesizer resource does not support + this operation, it should respond back to the client with a status of + unsupported. + + + + + + + + + + + +Shanmugham, et al. Informational [Page 26] + +RFC 4463 MRCP by Cisco, Nuance, and Speechworks April 2006 + + +7.4.7. Vendor-Specific Parameters + + This set of headers allows for the client to set vendor-specific + parameters. + + vendor-specific = "Vendor-Specific-Parameters" ":" + vendor-specific-av-pair + *[";" vendor-specific-av-pair] CRLF + + vendor-specific-av-pair = vendor-av-pair-name "=" + vendor-av-pair-value + + This header MAY BE sent in the SET-PARAMS/GET-PARAMS method and is + used to set vendor-specific parameters on the server side. The + vendor-av-pair-name can be any vendor-specific field name and + conforms to the XML vendor-specific attribute naming convention. The + vendor-av-pair-value is the value to set the attribute to and needs + to be quoted. + + When asking the server to get the current value of these parameters, + this header can be sent in the GET-PARAMS method with the list of + vendor-specific attribute names to get separated by a semicolon. + +7.4.8. Speech Marker + + This header field contains a marker tag that may be embedded in the + speech data. Most speech markup formats provide mechanisms to embed + marker fields between speech texts. The synthesizer will generate + SPEECH-MARKER events when it reaches these marker fields. This field + SHOULD be part of the SPEECH-MARKER event and will contain the marker + tag values. + + speech-marker = "Speech-Marker" ":" 1*ALPHA CRLF + +7.4.9. Speech Language + + This header field specifies the default language of the speech data + if it is not specified in the speech data. The value of this header + field should follow RFC 3066 [16] for its values. This MAY occur in + SPEAK, SET-PARAMS, or GET-PARAMS request. + + speech-language = "Speech-Language" ":" 1*ALPHA CRLF + + + + + + + + + +Shanmugham, et al. Informational [Page 27] + +RFC 4463 MRCP by Cisco, Nuance, and Speechworks April 2006 + + +7.4.10. Fetch Hint + + When the synthesizer needs to fetch documents or other resources like + speech markup or audio files, etc., this header field controls URI + access properties. This defines when the synthesizer should retrieve + content from the server. A value of "prefetch" indicates a file may + be downloaded when the request is received, whereas "safe" indicates + a file that should only be downloaded when actually needed. The + default value is "prefetch". This header field MAY occur in SPEAK, + SET-PARAMS, or GET-PARAMS requests. + + fetch-hint = "Fetch-Hint" ":" 1*ALPHA CRLF + +7.4.11. Audio Fetch Hint + + When the synthesizer needs to fetch documents or other resources like + speech audio files, etc., this header field controls URI access + properties. This defines whether or not the synthesizer can attempt + to optimize speech by pre-fetching audio. The value is either "safe" + to say that audio is only fetched when it is needed, never before; + "prefetch" to permit, but not require the platform to pre-fetch the + audio; or "stream" to allow it to stream the audio fetches. The + default value is "prefetch". This header field MAY occur in SPEAK, + SET-PARAMS, or GET-PARAMS requests. + + audio-fetch-hint = "Audio-Fetch-Hint" ":" 1*ALPHA CRLF + +7.4.12. Fetch Timeout + + When the synthesizer needs to fetch documents or other resources like + speech audio files, etc., this header field controls URI access + properties. This defines the synthesizer timeout for resources the + media server may need to fetch from the network. This is specified + in milliseconds. The default value is platform-dependent. This + header field MAY occur in SPEAK, SET-PARAMS, or GET-PARAMS. + + fetch-timeout = "Fetch-Timeout" ":" 1*DIGIT CRLF + +7.4.13. Failed URI + + When a synthesizer method needs a synthesizer to fetch or access a + URI, and the access fails, the media server SHOULD provide the failed + URI in this header field in the method response. + + failed-uri = "Failed-URI" ":" Url CRLF + + + + + + +Shanmugham, et al. Informational [Page 28] + +RFC 4463 MRCP by Cisco, Nuance, and Speechworks April 2006 + + +7.4.14. Failed URI Cause + + When a synthesizer method needs a synthesizer to fetch or access a + URI, and the access fails, the media server SHOULD provide the URI + specific or protocol-specific response code through this header field + in the method response. This field has been defined as alphanumeric + to accommodate all protocols, some of which might have a response + string instead of a numeric response code. + + failed-uri-cause = "Failed-URI-Cause" ":" 1*ALPHA CRLF + +7.4.15. Speak Restart + + When a CONTROL jump backward request is issued to a currently + speaking synthesizer resource and the jumps beyond the start of the + speech, the current SPEAK request re-starts from the beginning of its + speech data and the response to the CONTROL request would contain + this header indicating a restart. This header MAY occur in the + CONTROL response. + + speak-restart = "Speak-Restart" ":" boolean-value CRLF + +7.4.16. Speak Length + + This parameter MAY BE specified in a CONTROL method to control the + length of speech to speak, relative to the current speaking point in + the currently active SPEAK request. A "-" value is illegal in this + field. If a field with a Tag unit is specified, then the media must + speak until the tag is reached or the SPEAK request complete, + whichever comes first. This MAY BE specified in a SPEAK request to + indicate the length to speak in the speech data and is relative to + the point in speech where the SPEAK request starts. The different + speech length units supported are dependent on the synthesizer + implementation. If it does not support a unit or the operation, the + resource SHOULD respond with a status code of 404 "Illegal or + Unsupported value for parameter". + + speak-length = "Speak-Length" ":" speech-length-value + CRLF + +7.5. Synthesizer Message Body + + A synthesizer message may contain additional information associated + with the Method, Response, or Event in its message body. + + + + + + + +Shanmugham, et al. Informational [Page 29] + +RFC 4463 MRCP by Cisco, Nuance, and Speechworks April 2006 + + +7.5.1. Synthesizer Speech Data + + Marked-up text for the synthesizer to speak is specified as a MIME + entity in the message body. The message to be spoken by the + synthesizer can be specified inline (by embedding the data in the + message body) or by reference (by providing the URI to the data). In + either case, the data and the format used to markup the speech needs + to be supported by the media server. + + All media servers MUST support plain text speech data and W3C's + Speech Synthesis Markup Language [9] at a minimum and, hence, MUST + support the MIME types text/plain and application/synthesis+ssml at a + minimum. + + If the speech data needs to be specified by URI reference, the MIME + type text/uri-list is used to specify the one or more URIs that will + list what needs to be spoken. If a list of speech URIs is specified, + speech data provided by each URI must be spoken in the order in which + the URI are specified. + + If the data to be spoken consists of a mix of URI and inline speech + data, the multipart/mixed MIME-type is used and embedded with the + MIME-blocks for text/uri-list, application/synthesis+ssml or + text/plain. The character set and encoding used in the speech data + may be specified according to standard MIME-type definitions. The + multi-part MIME-block can contain actual audio data in .wav or Sun + audio format. This is used when the client has audio clips that it + may have recorded, then stored in memory or a local device, and that + it currently needs to play as part of the SPEAK request. The audio + MIME-parts can be sent by the client as part of the multi-part MIME- + block. This audio will be referenced in the speech markup data that + will be another part in the multi-part MIME-block according to the + multipart/mixed MIME-type specification. + + Example 1: + Content-Type:text/uri-list + Content-Length:176 + + http://www.cisco.com/ASR-Introduction.sml + http://www.cisco.com/ASR-Document-Part1.sml + http://www.cisco.com/ASR-Document-Part2.sml + http://www.cisco.com/ASR-Conclusion.sml + + Example 2: + Content-Type:application/synthesis+ssml + Content-Length:104 + + <?xml version="1.0"?> + + + +Shanmugham, et al. Informational [Page 30] + +RFC 4463 MRCP by Cisco, Nuance, and Speechworks April 2006 + + + <speak> + <paragraph> + <sentence>You have 4 new messages.</sentence> + <sentence>The first is from <say-as + type="name">Stephanie Williams</say-as> + and arrived at <break/> + <say-as type="time">3:45pm</say-as>.</sentence> + + <sentence>The subject is <prosody + rate="-20%">ski trip</prosody></sentence> + </paragraph> + </speak> + + Example 3: + Content-Type:multipart/mixed; boundary="--break" + + --break + Content-Type:text/uri-list + Content-Length:176 + + http://www.cisco.com/ASR-Introduction.sml + http://www.cisco.com/ASR-Document-Part1.sml + http://www.cisco.com/ASR-Document-Part2.sml + http://www.cisco.com/ASR-Conclusion.sml + + --break + Content-Type:application/synthesis+ssml + Content-Length:104 + + <?xml version="1.0"?> + <speak> + <paragraph> + <sentence>You have 4 new messages.</sentence> + <sentence>The first is from <say-as + type="name">Stephanie Williams</say-as> + and arrived at <break/> + <say-as type="time">3:45pm</say-as>.</sentence> + + <sentence>The subject is <prosody + rate="-20%">ski trip</prosody></sentence> + </paragraph> + </speak> + --break + + + + + + + + +Shanmugham, et al. Informational [Page 31] + +RFC 4463 MRCP by Cisco, Nuance, and Speechworks April 2006 + + +7.6. SET-PARAMS + + The SET-PARAMS method, from the client to server, tells the + synthesizer resource to define default synthesizer context + parameters, like voice characteristics and prosody, etc. If the + server accepted and set all parameters, it MUST return a Response- + Status of 200. If it chose to ignore some optional parameters, it + MUST return 201. + + If some of the parameters being set are unsupported or have illegal + values, the server accepts and sets the remaining parameters and MUST + respond with a Response-Status of 403 or 404, and MUST include in the + response the header fields that could not be set. + + Example: + C->S:SET-PARAMS 543256 MRCP/1.0 + Voice-gender:female + Voice-category:adult + Voice-variant:3 + + S->C:MRCP/1.0 543256 200 COMPLETE + +7.7. GET-PARAMS + + The GET-PARAMS method, from the client to server, asks the + synthesizer resource for its current synthesizer context parameters, + like voice characteristics and prosody, etc. The client SHOULD send + the list of parameters it wants to read from the server by listing a + set of empty parameter header fields. If a specific list is not + specified then the server SHOULD return all the settable parameters + including vendor-specific parameters and their current values. The + wild card use can be very intensive as the number of settable + parameters can be large depending on the vendor. Hence, it is + RECOMMENDED that the client does not use the wildcard GET-PARAMS + operation very often. + + Example: + C->S:GET-PARAMS 543256 MRCP/1.0 + Voice-gender: + Voice-category: + Voice-variant: + Vendor-Specific-Parameters:com.mycorp.param1; + com.mycorp.param2 + + S->C:MRCP/1.0 543256 200 COMPLETE + Voice-gender:female + Voice-category:adult + Voice-variant:3 + + + +Shanmugham, et al. Informational [Page 32] + +RFC 4463 MRCP by Cisco, Nuance, and Speechworks April 2006 + + + Vendor-Specific-Parameters:com.mycorp.param1="Company Name"; + com.mycorp.param2="124324234@mycorp.com" + +7.8. SPEAK + + The SPEAK method from the client to the server provides the + synthesizer resource with the speech text and initiates speech + synthesis and streaming. The SPEAK method can carry voice and + prosody header fields that define the behavior of the voice being + synthesized, as well as the actual marked-up text to be spoken. If + specific voice and prosody parameters are specified as part of the + speech markup text, it will take precedence over the values specified + in the header fields and those set using a previous SET-PARAMS + request. + + When applying voice parameters, there are 3 levels of scope. The + highest precedence are those specified within the speech markup text, + followed by those specified in the header fields of the SPEAK request + and, hence, apply for that SPEAK request only, followed by the + session default values that can be set using the SET-PARAMS request + and apply for the whole session moving forward. + + If the resource is idle and the SPEAK request is being actively + processed, the resource will respond with a success status code and a + request-state of IN-PROGRESS. + + If the resource is in the speaking or paused states (i.e., it is in + the middle of processing a previous SPEAK request), the status + returns success and a request-state of PENDING. This means that this + SPEAK request is in queue and will be processed after the currently + active SPEAK request is completed. + + For the synthesizer resource, this is the only request that can + return a request-state of IN-PROGRESS or PENDING. When the text to + be synthesized is complete, the resource will issue a SPEAK-COMPLETE + event with the request-id of the SPEAK message and a request-state of + COMPLETE. + + Example: + C->S:SPEAK 543257 MRCP/1.0 + Voice-gender:neutral + Voice-category:teenager + Prosody-volume:medium + Content-Type:application/synthesis+ssml + Content-Length:104 + + + + + + +Shanmugham, et al. Informational [Page 33] + +RFC 4463 MRCP by Cisco, Nuance, and Speechworks April 2006 + + + <?xml version="1.0"?> + <speak> + <paragraph> + <sentence>You have 4 new messages.</sentence> + <sentence>The first is from <say-as + type="name">Stephanie Williams</say-as> + and arrived at <break/> + <say-as type="time">3:45pm</say-as>.</sentence> + + <sentence>The subject is <prosody + rate="-20%">ski trip</prosody></sentence> + </paragraph> + </speak> + + S->C:MRCP/1.0 543257 200 IN-PROGRESS + + S->C:SPEAK-COMPLETE 543257 COMPLETE MRCP/1.0 + Completion-Cause:000 normal + +7.9. STOP + + The STOP method from the client to the server tells the resource to + stop speaking if it is speaking something. + + The STOP request can be sent with an active-request-id-list header + field to stop the zero or more specific SPEAK requests that may be in + queue and return a response code of 200(Success). If no active- + request-id-list header field is sent in the STOP request, it will + terminate all outstanding SPEAK requests. + + If a STOP request successfully terminated one or more PENDING or + IN-PROGRESS SPEAK requests, then the response message body contains + an active-request-id-list header field listing the SPEAK request-ids + that were terminated. Otherwise, there will be no active-request- + id-list header field in the response. No SPEAK-COMPLETE events will + be sent for these terminated requests. + + If a SPEAK request that was IN-PROGRESS and speaking was stopped, the + next pending SPEAK request, if any, would become IN-PROGRESS and move + to the speaking state. + + If a SPEAK request that was IN-PROGRESS and in the paused state was + stopped, the next pending SPEAK request, if any, would become + IN-PROGRESS and move to the paused state. + + + + + + + +Shanmugham, et al. Informational [Page 34] + +RFC 4463 MRCP by Cisco, Nuance, and Speechworks April 2006 + + + Example: + C->S:SPEAK 543258 MRCP/1.0 + Content-Type:application/synthesis+ssml + Content-Length:104 + + <?xml version="1.0"?> + <speak> + <paragraph> + <sentence>You have 4 new messages.</sentence> + <sentence>The first is from <say-as + type="name">Stephanie Williams</say-as> + and arrived at <break/> + <say-as type="time">3:45pm</say-as>.</sentence> + + <sentence>The subject is <prosody + rate="-20%">ski trip</prosody></sentence> + </paragraph> + </speak> + + S->C:MRCP/1.0 543258 200 IN-PROGRESS + + C->S:STOP 543259 200 MRCP/1.0 + + S->C:MRCP/1.0 543259 200 COMPLETE + Active-Request-Id-List:543258 + +7.10. BARGE-IN-OCCURRED + + The BARGE-IN-OCCURRED method is a mechanism for the client to + communicate a barge-in-able event it detects to the speech resource. + + This event is useful in two scenarios, + + 1. The client has detected some events like DTMF digits or other + barge-in-able events and wants to communicate that to the + synthesizer. + + 2. The recognizer resource and the synthesizer resource are in + different servers. In which case the client MUST act as a Proxy + and receive event from the recognition resource, and then send a + BARGE-IN-OCCURRED method to the synthesizer. In such cases, the + BARGE-IN-OCCURRED method would also have a proxy-sync-id header + field received from the resource generating the original event. + + If a SPEAK request is active with kill-on-barge-in enabled, and the + BARGE-IN-OCCURRED event is received, the synthesizer should stop + streaming out audio. It should also terminate any speech requests + queued behind the current active one, irrespective of whether they + + + +Shanmugham, et al. Informational [Page 35] + +RFC 4463 MRCP by Cisco, Nuance, and Speechworks April 2006 + + + have barge-in enabled or not. If a barge-in-able prompt was playing + and it was terminated, the response MUST contain the request-ids of + all SPEAK requests that were terminated in its active-request-id- + list. There will be no SPEAK-COMPLETE events generated for these + requests. + + If the synthesizer and the recognizer are on the same server, they + could be optimized for a quicker kill-on-barge-in response by having + them interact directly based on a common RTSP session-id. In these + cases, the client MUST still proxy the recognition event through a + BARGE-IN-OCCURRED method, but the synthesizer resource may have + already stopped and sent a SPEAK-COMPLETE event with a barge-in + completion cause code. If there were no SPEAK requests terminated as + a result of the BARGE-IN-OCCURRED method, the response would still be + a 200 success, but MUST not contain an active-request-id-list header + field. + + C->S:SPEAK 543258 MRCP/1.0 + Voice-gender:neutral + Voice-category:teenager + Prosody-volume:medium + Content-Type:application/synthesis+ssml + Content-Length:104 + + <?xml version="1.0"?> + <speak> + <paragraph> + <sentence>You have 4 new messages.</sentence> + <sentence>The first is from <say-as + type="name">Stephanie Williams</say-as> + and arrived at <break/> + <say-as type="time">3:45pm</say-as>.</sentence> + <sentence>The subject is <prosody + rate="-20%">ski trip</prosody></sentence> + </paragraph> + </speak> + + S->C:MRCP/1.0 543258 200 IN-PROGRESS + + C->S:BARGE-IN-OCCURRED 543259 200 MRCP/1.0 + Proxy-Sync-Id:987654321 + + S->C:MRCP/1.0 543259 200 COMPLETE + Active-Request-Id-List:543258 + + + + + + + +Shanmugham, et al. Informational [Page 36] + +RFC 4463 MRCP by Cisco, Nuance, and Speechworks April 2006 + + +7.11. PAUSE + + The PAUSE method from the client to the server tells the resource to + pause speech, if it is speaking something. If a PAUSE method is + issued on a session when a SPEAK is not active, the server SHOULD + respond with a status of 402 or "Method not valid in this state". If + a PAUSE method is issued on a session when a SPEAK is active and + paused, the server SHOULD respond with a status of 200 or "Success". + If a SPEAK request was active, the server MUST return an active- + request-id-list header with the request-id of the SPEAK request that + was paused. + + C->S:SPEAK 543258 MRCP/1.0 + Voice-gender:neutral + Voice-category:teenager + Prosody-volume:medium + Content-Type:application/synthesis+ssml + Content-Length:104 + + <?xml version="1.0"?> + <speak> + <paragraph> + <sentence>You have 4 new messages.</sentence> + <sentence>The first is from <say-as + type="name">Stephanie Williams</say-as> + and arrived at <break/> + <say-as type="time">3:45pm</say-as>.</sentence> + + <sentence>The subject is <prosody + rate="-20%">ski trip</prosody></sentence> + </paragraph> + </speak> + + S->C:MRCP/1.0 543258 200 IN-PROGRESS + + C->S:PAUSE 543259 MRCP/1.0 + + S->C:MRCP/1.0 543259 200 COMPLETE + Active-Request-Id-List:543258 + +7.12. RESUME + + The RESUME method from the client to the server tells a paused + synthesizer resource to continue speaking. If a RESUME method is + issued on a session when a SPEAK is not active, the server SHOULD + respond with a status of 402 or "Method not valid in this state". If + a RESUME method is issued on a session when a SPEAK is active and + speaking (i.e., not paused), the server SHOULD respond with a status + + + +Shanmugham, et al. Informational [Page 37] + +RFC 4463 MRCP by Cisco, Nuance, and Speechworks April 2006 + + + of 200 or "Success". If a SPEAK request was active, the server MUST + return an active-request-id-list header with the request-id of the + SPEAK request that was resumed + + Example: + C->S:SPEAK 543258 MRCP/1.0 + Voice-gender:neutral + Voice-category:teenager + Prosody-volume:medium + Content-Type:application/synthesis+ssml + Content-Length:104 + + <?xml version="1.0"?> + <speak> + <paragraph> + <sentence>You have 4 new messages.</sentence> + <sentence>The first is from <say-as + type="name">Stephanie Williams</say-as> + and arrived at <break/> + <say-as type="time">3:45pm</say-as>.</sentence> + + <sentence>The subject is <prosody + rate="-20%">ski trip</prosody></sentence> + </paragraph> + </speak> + + S->C:MRCP/1.0 543258 200 IN-PROGRESS + + C->S:PAUSE 543259 MRCP/1.0 + + S->C:MRCP/1.0 543259 200 COMPLETE + Active-Request-Id-List:543258 + + C->S:RESUME 543260 MRCP/1.0 + + S->C:MRCP/1.0 543260 200 COMPLETE + Active-Request-Id-List:543258 + +7.13. CONTROL + + The CONTROL method from the client to the server tells a synthesizer + that is speaking to modify what it is speaking on the fly. This + method is used to make the synthesizer jump forward or backward in + what it is being spoken, change speaker rate and speaker parameters, + etc. It affects the active or IN-PROGRESS SPEAK request. Depending + on the implementation and capability of the synthesizer resource, it + may allow this operation or one or more of its parameters. + + + + +Shanmugham, et al. Informational [Page 38] + +RFC 4463 MRCP by Cisco, Nuance, and Speechworks April 2006 + + + When a CONTROL to jump forward is issued and the operation goes + beyond the end of the active SPEAK method's text, the request + succeeds. A SPEAK-COMPLETE event follows the response to the CONTROL + method. If there are more SPEAK requests in the queue, the + synthesizer resource will continue to process the next SPEAK method. + When a CONTROL to jump backwards is issued and the operation jumps to + the beginning of the speech data of the active SPEAK request, the + response to the CONTROL request contains the speak-restart header. + + These two behaviors can be used to rewind or fast-forward across + multiple speech requests, if the client wants to break up a speech + markup text into multiple SPEAK requests. + + If a SPEAK request was active when the CONTROL method was received, + the server MUST return an active-request-id-list header with the + Request-id of the SPEAK request that was active. + + Example: + C->S:SPEAK 543258 MRCP/1.0 + Voice-gender:neutral + Voice-category:teenager + Prosody-volume:medium + Content-Type:application/synthesis+ssml + Content-Length:104 + + <?xml version="1.0"?> + <speak> + <paragraph> + <sentence>You have 4 new messages.</sentence> + <sentence>The first is from <say-as + type="name">Stephanie Williams</say-as> + and arrived at <break/> + <say-as type="time">3:45pm</say-as>.</sentence> + + <sentence>The subject is <prosody + rate="-20%">ski trip</prosody></sentence> + </paragraph> + </speak> + + S->C:MRCP/1.0 543258 200 IN-PROGRESS + + C->S:CONTROL 543259 MRCP/1.0 + Prosody-rate:fast + + S->C:MRCP/1.0 543259 200 COMPLETE + Active-Request-Id-List:543258 + + C->S:CONTROL 543260 MRCP/1.0 + + + +Shanmugham, et al. Informational [Page 39] + +RFC 4463 MRCP by Cisco, Nuance, and Speechworks April 2006 + + + Jump-Size:-15 Words + + S->C:MRCP/1.0 543260 200 COMPLETE + Active-Request-Id-List:543258 + +7.14. SPEAK-COMPLETE + + This is an Event message from the synthesizer resource to the client + indicating that the SPEAK request was completed. The request-id + header field WILL match the request-id of the SPEAK request that + initiated the speech that just completed. The request-state field + should be COMPLETE indicating that this is the last Event with that + request-id, and that the request with that request-id is now + complete. The completion-cause header field specifies the cause code + pertaining to the status and reason of request completion such as the + SPEAK completed normally or because of an error or kill-on-barge-in, + etc. + + Example: + C->S:SPEAK 543260 MRCP/1.0 + Voice-gender:neutral + Voice-category:teenager + Prosody-volume:medium + Content-Type:application/synthesis+ssml + Content-Length:104 + + <?xml version="1.0"?> + <speak> + <paragraph> + <sentence>You have 4 new messages.</sentence> + <sentence>The first is from <say-as + type="name">Stephanie Williams</say-as> + and arrived at <break/> + <say-as type="time">3:45pm</say-as>.</sentence> + + <sentence>The subject is <prosody + rate="-20%">ski trip</prosody></sentence> + </paragraph> + </speak> + + S->C:MRCP/1.0 543260 200 IN-PROGRESS + + S->C:SPEAK-COMPLETE 543260 COMPLETE MRCP/1.0 + + Completion-Cause:000 normal + + + + + + +Shanmugham, et al. Informational [Page 40] + +RFC 4463 MRCP by Cisco, Nuance, and Speechworks April 2006 + + +7.15. SPEECH-MARKER + + This is an event generated by the synthesizer resource to the client + when it hits a marker tag in the speech markup it is currently + processing. The request-id field in the header matches the SPEAK + request request-id that initiated the speech. The request-state + field should be IN-PROGRESS as the speech is still not complete and + there is more to be spoken. The actual speech marker tag hit, + describing where the synthesizer is in the speech markup, is returned + in the speech-marker header field. + + Example: + C->S:SPEAK 543261 MRCP/1.0 + Voice-gender:neutral + Voice-category:teenager + Prosody-volume:medium + Content-Type:application/synthesis+ssml + Content-Length:104 + + <?xml version="1.0"?> + <speak> + <paragraph> + <sentence>You have 4 new messages.</sentence> + <sentence>The first is from <say-as + type="name">Stephanie Williams</say-as> + and arrived at <break/> + <say-as type="time">3:45pm</say-as>.</sentence> + <mark name="here"/> + <sentence>The subject is + <prosody rate="-20%">ski trip</prosody> + </sentence> + <mark name="ANSWER"/> + </paragraph> + </speak> + + S->C:MRCP/1.0 543261 200 IN-PROGRESS + + S->C:SPEECH-MARKER 543261 IN-PROGRESS MRCP/1.0 + Speech-Marker:here + + S->C:SPEECH-MARKER 543261 IN-PROGRESS MRCP/1.0 + Speech-Marker:ANSWER + + S->C:SPEAK-COMPLETE 543261 COMPLETE MRCP/1.0 + Completion-Cause:000 normal + + + + + + +Shanmugham, et al. Informational [Page 41] + +RFC 4463 MRCP by Cisco, Nuance, and Speechworks April 2006 + + +8. Speech Recognizer Resource + + The speech recognizer resource is capable of receiving an incoming + voice stream and providing the client with an interpretation of what + was spoken in textual form. + +8.1. Recognizer State Machine + + The recognizer resource is controlled by MRCP requests from the + client. Similarly, the resource can respond to these requests or + generate asynchronous events to the server to indicate certain + conditions during the processing of the stream. Hence, the + recognizer maintains states to correlate MRCP requests from the + client. The state transitions are described below. + + Idle Recognizing Recognized + State State State + | | | + |---------RECOGNIZE---->|---RECOGNITION-COMPLETE-->| + |<------STOP------------|<-----RECOGNIZE-----------| + | | | + | | |-----------| + | |--------| GET-RESULT | + | START-OF-SPEECH | |---------->| + |------------| |------->| | + | | |----------| | + | DEFINE-GRAMMAR | RECOGNITION-START-TIMERS | + |<-----------| |<---------| | + | | | + | | | + |-------| | | + | STOP | | + |<------| | | + | | + |<-------------------STOP--------------------------| + |<-------------------DEFINE-GRAMMAR----------------| + +8.2. Recognizer Methods + + The recognizer supports the following methods. + recognizer-method = SET-PARAMS + / GET-PARAMS + / DEFINE-GRAMMAR + / RECOGNIZE + / GET-RESULT + / RECOGNITION-START-TIMERS + / STOP + + + + +Shanmugham, et al. Informational [Page 42] + +RFC 4463 MRCP by Cisco, Nuance, and Speechworks April 2006 + + +8.3. Recognizer Events + + The recognizer may generate the following events. + + recognizer-event = START-OF-SPEECH + / RECOGNITION-COMPLETE + +8.4. Recognizer Header Fields + + A recognizer message may contain header fields containing request + options and information to augment the Method, Response, or Event + message it is associated with. + + recognizer-header = confidence-threshold ; Section 8.4.1 + / sensitivity-level ; Section 8.4.2 + / speed-vs-accuracy ; Section 8.4.3 + / n-best-list-length ; Section 8.4.4 + / no-input-timeout ; Section 8.4.5 + / recognition-timeout ; Section 8.4.6 + / waveform-url ; Section 8.4.7 + / completion-cause ; Section 8.4.8 + / recognizer-context-block ; Section 8.4.9 + / recognizer-start-timers ; Section 8.4.10 + / vendor-specific ; Section 8.4.11 + / speech-complete-timeout ; Section 8.4.12 + / speech-incomplete-timeout; Section 8.4.13 + / dtmf-interdigit-timeout ; Section 8.4.14 + / dtmf-term-timeout ; Section 8.4.15 + / dtmf-term-char ; Section 8.4.16 + / fetch-timeout ; Section 8.4.17 + / failed-uri ; Section 8.4.18 + / failed-uri-cause ; Section 8.4.19 + / save-waveform ; Section 8.4.20 + / new-audio-channel ; Section 8.4.21 + / speech-language ; Section 8.4.22 + + Parameter Support Methods/Events + + confidence-threshold MANDATORY SET-PARAMS, RECOGNIZE + GET-RESULT + sensitivity-level Optional SET-PARAMS, GET-PARAMS, + RECOGNIZE + speed-vs-accuracy Optional SET-PARAMS, GET-PARAMS, + RECOGNIZE + n-best-list-length Optional SET-PARAMS, GET-PARAMS, + RECOGNIZE, GET-RESULT + no-input-timeout MANDATORY SET-PARAMS, GET-PARAMS, + RECOGNIZE + + + +Shanmugham, et al. Informational [Page 43] + +RFC 4463 MRCP by Cisco, Nuance, and Speechworks April 2006 + + + recognition-timeout MANDATORY SET-PARAMS, GET-PARAMS, + RECOGNIZE + waveform-url MANDATORY RECOGNITION-COMPLETE + completion-cause MANDATORY DEFINE-GRAMMAR, RECOGNIZE, + RECOGNITON-COMPLETE + recognizer-context-block Optional SET-PARAMS, GET-PARAMS + recognizer-start-timers MANDATORY RECOGNIZE + vendor-specific MANDATORY SET-PARAMS, GET-PARAMS + speech-complete-timeout MANDATORY SET-PARAMS, GET-PARAMS + RECOGNIZE + speech-incomplete-timeout MANDATORY SET-PARAMS, GET-PARAMS + RECOGNIZE + dtmf-interdigit-timeout MANDATORY SET-PARAMS, GET-PARAMS + RECOGNIZE + dtmf-term-timeout MANDATORY SET-PARAMS, GET-PARAMS + RECOGNIZE + dtmf-term-char MANDATORY SET-PARAMS, GET-PARAMS + RECOGNIZE + fetch-timeout MANDATORY SET-PARAMS, GET-PARAMS + RECOGNIZE, DEFINE-GRAMMAR + failed-uri MANDATORY DEFINE-GRAMMAR response, + RECOGNITION-COMPLETE + failed-uri-cause MANDATORY DEFINE-GRAMMAR response, + RECOGNITION-COMPLETE + save-waveform MANDATORY SET-PARAMS, GET-PARAMS, + RECOGNIZE + new-audio-channel MANDATORY RECOGNIZE + speech-language MANDATORY SET-PARAMS, GET-PARAMS, + RECOGNIZE, DEFINE-GRAMMAR + +8.4.1. Confidence Threshold + + When a recognition resource recognizes or matches a spoken phrase + with some portion of the grammar, it associates a confidence level + with that conclusion. The confidence-threshold parameter tells the + recognizer resource what confidence level should be considered a + successful match. This is an integer from 0-100 indicating the + recognizer's confidence in the recognition. If the recognizer + determines that its confidence in all its recognition results is less + than the confidence threshold, then it MUST return no-match as the + recognition result. This header field MAY occur in RECOGNIZE, SET- + PARAMS, or GET-PARAMS. The default value for this field is platform + specific. + + confidence-threshold = "Confidence-Threshold" ":" 1*DIGIT CRLF + + + + + + +Shanmugham, et al. Informational [Page 44] + +RFC 4463 MRCP by Cisco, Nuance, and Speechworks April 2006 + + +8.4.2. Sensitivity Level + + To filter out background noise and not mistake it for speech, the + recognizer may support a variable level of sound sensitivity. The + sensitivity-level parameter allows the client to set this value on + the recognizer. This header field MAY occur in RECOGNIZE, SET- + PARAMS, or GET-PARAMS. A higher value for this field means higher + sensitivity. The default value for this field is platform specific. + + sensitivity-level = "Sensitivity-Level" ":" 1*DIGIT CRLF + +8.4.3. Speed Vs Accuracy + + Depending on the implementation and capability of the recognizer + resource, it may be tunable towards Performance or Accuracy. Higher + accuracy may mean more processing and higher CPU utilization, meaning + less calls per media server and vice versa. This parameter on the + resource can be tuned by the speed-vs-accuracy header. This header + field MAY occur in RECOGNIZE, SET-PARAMS, or GET-PARAMS. A higher + value for this field means higher speed. The default value for this + field is platform specific. + + speed-vs-accuracy = "Speed-Vs-Accuracy" ":" 1*DIGIT CRLF + +8.4.4. N Best List Length + + When the recognizer matches an incoming stream with the grammar, it + may come up with more than one alternative match because of + confidence levels in certain words or conversation paths. If this + header field is not specified, by default, the recognition resource + will only return the best match above the confidence threshold. The + client, by setting this parameter, could ask the recognition resource + to send it more than 1 alternative. All alternatives must still be + above the confidence-threshold. A value greater than one does not + guarantee that the recognizer will send the requested number of + alternatives. This header field MAY occur in RECOGNIZE, SET-PARAMS, + or GET-PARAMS. The minimum value for this field is 1. The default + value for this field is 1. + + n-best-list-length = "N-Best-List-Length" ":" 1*DIGIT CRLF + +8.4.5. No Input Timeout + + When recognition is started and there is no speech detected for a + certain period of time, the recognizer can send a RECOGNITION- + COMPLETE event to the client and terminate the recognition operation. + The no-input-timeout header field can set this timeout value. The + value is in milliseconds. This header field MAY occur in RECOGNIZE, + + + +Shanmugham, et al. Informational [Page 45] + +RFC 4463 MRCP by Cisco, Nuance, and Speechworks April 2006 + + + SET-PARAMS, or GET-PARAMS. The value for this field ranges from 0 to + MAXTIMEOUT, where MAXTIMEOUT is platform specific. The default value + for this field is platform specific. + + no-input-timeout = "No-Input-Timeout" ":" 1*DIGIT CRLF + +8.4.6. Recognition Timeout + + When recognition is started and there is no match for a certain + period of time, the recognizer can send a RECOGNITION-COMPLETE event + to the client and terminate the recognition operation. The + recognition-timeout parameter field sets this timeout value. The + value is in milliseconds. The value for this field ranges from 0 to + MAXTIMEOUT, where MAXTIMEOUT is platform specific. The default value + is 10 seconds. This header field MAY occur in RECOGNIZE, SET-PARAMS + or GET-PARAMS. + + recognition-timeout = "Recognition-Timeout" ":" 1*DIGIT CRLF + +8.4.7. Waveform URL + + If the save-waveform header field is set to true, the recognizer MUST + record the incoming audio stream of the recognition into a file and + provide a URI for the client to access it. This header MUST be + present in the RECOGNITION-COMPLETE event if the save-waveform header + field was set to true. The URL value of the header MUST be NULL if + there was some error condition preventing the server from recording. + Otherwise, the URL generated by the server SHOULD be globally unique + across the server and all its recognition sessions. The URL SHOULD + BE available until the session is torn down. + + waveform-url = "Waveform-URL" ":" Url CRLF + +8.4.8. Completion Cause + + This header field MUST be part of a RECOGNITION-COMPLETE event coming + from the recognizer resource to the client. This indicates the + reason behind the RECOGNIZE method completion. This header field + MUST BE sent in the DEFINE-GRAMMAR and RECOGNIZE responses, if they + return with a failure status and a COMPLETE state. + + Cause-Code Cause-Name Description + + 000 success RECOGNIZE completed with a match or + DEFINE-GRAMMAR succeeded in + downloading and compiling the + grammar + + + + +Shanmugham, et al. Informational [Page 46] + +RFC 4463 MRCP by Cisco, Nuance, and Speechworks April 2006 + + + 001 no-match RECOGNIZE completed, but no match + was found + 002 no-input-timeout + RECOGNIZE completed without a match + due to a no-input-timeout + 003 recognition-timeout + RECOGNIZE completed without a match + due to a recognition-timeout + 004 gram-load-failure + RECOGNIZE failed due grammar load + failure. + 005 gram-comp-failure + RECOGNIZE failed due to grammar + compilation failure. + 006 error RECOGNIZE request terminated + prematurely due to a recognizer + error. + 007 speech-too-early + RECOGNIZE request terminated because + speech was too early. + 008 too-much-speech-timeout + RECOGNIZE request terminated because + speech was too long. + 009 uri-failure Failure accessing a URI. + 010 language-unsupported + Language not supported. + +8.4.9. Recognizer Context Block + + This parameter MAY BE sent as part of the SET-PARAMS or GET-PARAMS + request. If the GET-PARAMS method contains this header field with no + value, then it is a request to the recognizer to return the + recognizer context block. The response to such a message MAY contain + a recognizer context block as a message entity. If the server + returns a recognizer context block, the response MUST contain this + header field and its value MUST match the content-id of that entity. + + If the SET-PARAMS method contains this header field, it MUST contain + a message entity containing the recognizer context data, and a + content-id matching this header field. + + This content-id should match the content-id that came with the + context data during the GET-PARAMS operation. + + recognizer-context-block = "Recognizer-Context-Block" ":" + 1*ALPHA CRLF + + + + + +Shanmugham, et al. Informational [Page 47] + +RFC 4463 MRCP by Cisco, Nuance, and Speechworks April 2006 + + +8.4.10. Recognition Start Timers + + This parameter MAY BE sent as part of the RECOGNIZE request. A value + of false tells the recognizer to start recognition, but not to start + the no-input timer yet. The recognizer should not start the timers + until the client sends a RECOGNITION-START-TIMERS request to the + recognizer. This is useful in the scenario when the recognizer and + synthesizer engines are not part of the same session. Here, when a + kill-on-barge-in prompt is being played, you want the RECOGNIZE + request to be simultaneously active so that it can detect and + implement kill-on-barge-in. But at the same time, you don't want the + recognizer to start the no-input timers until the prompt is finished. + The default value is "true". + + recognizer-start-timers = "Recognizer-Start-Timers" ":" + boolean-value CRLF + +8.4.11. Vendor Specific Parameters + + This set of headers allows the client to set Vendor Specific + parameters. + + This header can be sent in the SET-PARAMS method and is used to set + vendor-specific parameters on the server. The vendor-av-pair-name + can be any vendor-specific field name and conforms to the XML + vendor-specific attribute naming convention. The vendor-av-pair- + value is the value to set the attribute to, and needs to be quoted. + + When asking the server to get the current value of these parameters, + this header can be sent in the GET-PARAMS method with the list of + vendor-specific attribute names to get separated by a semicolon. + This header field MAY occur in SET-PARAMS or GET-PARAMS. + +8.4.12. Speech Complete Timeout + + This header field specifies the length of silence required following + user speech before the speech recognizer finalizes a result (either + accepting it or throwing a nomatch event). The speech-complete- + timeout value is used when the recognizer currently has a complete + match of an active grammar, and specifies how long it should wait for + more input before declaring a match. By contrast, the incomplete + timeout is used when the speech is an incomplete match to an active + grammar. The value is in milliseconds. + + speech-complete-timeout = "Speech-Complete-Timeout" ":" + 1*DIGIT CRLF + + + + + +Shanmugham, et al. Informational [Page 48] + +RFC 4463 MRCP by Cisco, Nuance, and Speechworks April 2006 + + + A long speech-complete-timeout value delays the result completion + and, therefore, makes the computer's response slow. A short speech- + complete-timeout may lead to an utterance being broken up + inappropriately. Reasonable complete timeout values are typically in + the range of 0.3 seconds to 1.0 seconds. The value for this field + ranges from 0 to MAXTIMEOUT, where MAXTIMEOUT is platform specific. + The default value for this field is platform specific. This header + field MAY occur in RECOGNIZE, SET-PARAMS, or GET-PARAMS. + +8.4.13. Speech Incomplete Timeout + + This header field specifies the required length of silence following + user speech, after which a recognizer finalizes a result. The + incomplete timeout applies when the speech prior to the silence is an + incomplete match of all active grammars. In this case, once the + timeout is triggered, the partial result is rejected (with a nomatch + event). The value is in milliseconds. The value for this field + ranges from 0 to MAXTIMEOUT, where MAXTIMEOUT is platform specific. + The default value for this field is platform specific. + + speech-incomplete-timeout = "Speech-Incomplete-Timeout" ":" + 1*DIGIT CRLF + + The speech-incomplete-timeout also applies when the speech prior to + the silence is a complete match of an active grammar, but where it is + possible to speak further and still match the grammar. By contrast, + the complete timeout is used when the speech is a complete match to + an active grammar and no further words can be spoken. + + A long speech-incomplete-timeout value delays the result completion + and, therefore, makes the computer's response slow. A short speech- + incomplete-timeout may lead to an utterance being broken up + inappropriately. + + The speech-incomplete-timeout is usually longer than the speech- + complete-timeout to allow users to pause mid-utterance (for example, + to breathe). This header field MAY occur in RECOGNIZE, SET-PARAMS, + or GET-PARAMS. + +8.4.14. DTMF Interdigit Timeout + + This header field specifies the inter-digit timeout value to use when + recognizing DTMF input. The value is in milliseconds. The value for + this field ranges from 0 to MAXTIMEOUT, where MAXTIMEOUT is platform + specific. The default value is 5 seconds. This header field MAY + occur in RECOGNIZE, SET-PARAMS, or GET-PARAMS. + + + + + +Shanmugham, et al. Informational [Page 49] + +RFC 4463 MRCP by Cisco, Nuance, and Speechworks April 2006 + + + dtmf-interdigit-timeout = "DTMF-Interdigit-Timeout" ":" + 1*DIGIT CRLF + +8.4.15. DTMF Term Timeout + + This header field specifies the terminating timeout to use when + recognizing DTMF input. The value is in milliseconds. The value for + this field ranges from 0 to MAXTIMEOUT, where MAXTIMEOUT is platform + specific. The default value is 10 seconds. This header field MAY + occur in RECOGNIZE, SET-PARAMS, or GET-PARAMS. + + dtmf-term-timeout = "DTMF-Term-Timeout" ":" 1*DIGIT CRLF + +8.4.16. DTMF-Term-Char + + This header field specifies the terminating DTMF character for DTMF + input recognition. The default value is NULL which is specified as + an empty header field. This header field MAY occur in RECOGNIZE, + SET-PARAMS, or GET-PARAMS. + + dtmf-term-char = "DTMF-Term-Char" ":" CHAR CRLF + +8.4.17. Fetch Timeout + + When the recognizer needs to fetch grammar documents, this header + field controls URI access properties. This defines the recognizer + timeout for completing the fetch of the resources the media server + needs from the network. The value is in milliseconds. The value for + this field ranges from 0 to MAXTIMEOUT, where MAXTIMEOUT is platform + specific. The default value for this field is platform specific. + This header field MAY occur in RECOGNIZE, SET-PARAMS, or GET-PARAMS. + +8.4.18. Failed URI + + When a recognizer method needs a recognizer to fetch or access a URI, + and the access fails, the media server SHOULD provide the failed URI + in this header field in the method response. + +8.4.19. Failed URI Cause + + When a recognizer method needs a recognizer to fetch or access a URI, + and the access fails, the media server SHOULD provide the URI- + specific or protocol-specific response code through this header field + in the method response. This field has been defined as alphanumeric + to accommodate all protocols, some of which might have a response + string instead of a numeric response code. + + + + + +Shanmugham, et al. Informational [Page 50] + +RFC 4463 MRCP by Cisco, Nuance, and Speechworks April 2006 + + +8.4.20. Save Waveform + + This header field allows the client to indicate to the recognizer + that it MUST save the audio stream that was recognized. The + recognizer MUST then record the recognized audio and make it + available to the client in the form of a URI returned in the + waveform-uri header field in the RECOGNITION-COMPLETE event. If + there was an error in recording the stream or the audio clip is + otherwise not available, the recognizer MUST return an empty + waveform-uri header field. The default value for this fields is + "false". + + save-waveform = "Save-Waveform" ":" boolean-value CRLF + +8.4.21. New Audio Channel + + This header field MAY BE specified in a RECOGNIZE message and allows + the client to tell the media server that, from that point on, it will + be sending audio data from a new audio source, channel, or speaker. + If the recognition resource had collected any line statistics or + information, it MUST discard it and start fresh for this RECOGNIZE. + This helps in the case where the client MAY want to reuse an open + recognition session with the media server for multiple telephone + calls. + + new-audio-channel = "New-Audio-Channel" ":" boolean-value CRLF + +8.4.22. Speech Language + + This header field specifies the language of recognition grammar data + within a session or request, if it is not specified within the data. + The value of this header field should follow RFC 3066 [16] for its + values. This MAY occur in DEFINE-GRAMMAR, RECOGNIZE, SET-PARAMS, or + GET-PARAMS request. + +8.5. Recognizer Message Body + + A recognizer message may carry additional data associated with the + method, response, or event. The client may send the grammar to be + recognized in DEFINE-GRAMMAR or RECOGNIZE requests. When the grammar + is sent in the DEFINE-GRAMMAR method, the server should be able to + download compile and optimize the grammar. The RECOGNIZE request + MUST contain a list of grammars that need to be active during the + recognition. The server resource may send the recognition results in + the RECOGNITION-COMPLETE event or the GET-RESULT response. This data + will be carried in the message body of the corresponding MRCP + message. + + + + +Shanmugham, et al. Informational [Page 51] + +RFC 4463 MRCP by Cisco, Nuance, and Speechworks April 2006 + + +8.5.1. Recognizer Grammar Data + + Recognizer grammar data from the client to the server can be provided + inline or by reference. Either way, they are carried as MIME + entities in the message body of the MRCP request message. The + grammar specified inline or by reference specifies the grammar used + to match in the recognition process and this data is specified in one + of the standard grammar specification formats like W3C's XML or ABNF + or Sun's Java Speech Grammar Format, etc. All media servers MUST + support W3C's XML based grammar markup format [11] (MIME-type + application/grammar+xml) and SHOULD support the ABNF form (MIME-type + application/grammar). + + When a grammar is specified in-line in the message, the client MUST + provide a content-id for that grammar as part of the content headers. + The server MUST store the grammar associated with that content-id for + the duration of the session. A stored grammar can be overwritten by + defining a new grammar with the same content-id. Grammars that have + been associated with a content-id can be referenced through a special + "session:" URI scheme. + + Example: + session:help@root-level.store + + If grammar data needs to be specified by external URI reference, the + MIME-type text/uri-list is used to list the one or more URI that will + specify the grammar data. All media servers MUST support the HTTP + URI access mechanism. + + If the data to be defined consists of a mix of URI and inline grammar + data, the multipart/mixed MIME-type is used and embedded with the + MIME-blocks for text/uri-list, application/grammar or + application/grammar+xml. The character set and encoding used in the + grammar data may be specified according to standard MIME-type + definitions. + + When more than one grammar URI or inline grammar block is specified + in a message body of the RECOGNIZE request, it is an active list of + grammar alternatives to listen. The ordering of the list implies the + precedence of the grammars, with the first grammar in the list having + the highest precedence. + + Example 1: + Content-Type:application/grammar+xml + Content-Id:request1@form-level.store + Content-Length:104 + + <?xml version="1.0"?> + + + +Shanmugham, et al. Informational [Page 52] + +RFC 4463 MRCP by Cisco, Nuance, and Speechworks April 2006 + + + <!-- the default grammar language is US English --> + <grammar xml:lang="en-US" version="1.0"> + + <!-- single language attachment to tokens --> + <rule id="yes"> + <one-of> + <item xml:lang="fr-CA">oui</item> + <item xml:lang="en-US">yes</item> + </one-of> + </rule> + + <!-- single language attachment to a rule expansion --> + <rule id="request"> + may I speak to + <one-of xml:lang="fr-CA"> + <item>Michel Tremblay</item> + <item>Andre Roy</item> + </one-of> + </rule> + + <!-- multiple language attachment to a token --> + <rule id="people1"> + <token lexicon="en-US,fr-CA"> Robert </token> + </rule> + + <!-- the equivalent single-language attachment expansion --> + <rule id="people2"> + <one-of> + <item xml:lang="en-US">Robert</item> + <item xml:lang="fr-CA">Robert</item> + </one-of> + </rule> + + </grammar> + + Example 2: + Content-Type:text/uri-list + Content-Length:176 + + session:help@root-level.store + http://www.cisco.com/Directory-Name-List.grxml + http://www.cisco.com/Department-List.grxml + http://www.cisco.com/TAC-Contact-List.grxml + session:menu1@menu-level.store + + Example 3: + Content-Type:multipart/mixed; boundary="--break" + + + + +Shanmugham, et al. Informational [Page 53] + +RFC 4463 MRCP by Cisco, Nuance, and Speechworks April 2006 + + + --break + Content-Type:text/uri-list + Content-Length:176 + http://www.cisco.com/Directory-Name-List.grxml + http://www.cisco.com/Department-List.grxml + http://www.cisco.com/TAC-Contact-List.grxml + + --break + Content-Type:application/grammar+xml + Content-Id:request1@form-level.store + Content-Length:104 + + <?xml version="1.0"?> + + <!-- the default grammar language is US English --> + <grammar xml:lang="en-US" version="1.0"> + + <!-- single language attachment to tokens --> + <rule id="yes"> + <one-of> + <item xml:lang="fr-CA">oui</item> + <item xml:lang="en-US">yes</item> + </one-of> + </rule> + + <!-- single language attachment to a rule expansion --> + <rule id="request"> + may I speak to + <one-of xml:lang="fr-CA"> + <item>Michel Tremblay</item> + <item>Andre Roy</item> + </one-of> + </rule> + + <!-- multiple language attachment to a token --> + <rule id="people1"> + <token lexicon="en-US,fr-CA"> Robert </token> + </rule> + + <!-- the equivalent single-language attachment expansion --> + + <rule id="people2"> + <one-of> + <item xml:lang="en-US">Robert</item> + <item xml:lang="fr-CA">Robert</item> + </one-of> + </rule> + + + + +Shanmugham, et al. Informational [Page 54] + +RFC 4463 MRCP by Cisco, Nuance, and Speechworks April 2006 + + + </grammar> + --break + +8.5.2. Recognizer Result Data + + Recognition result data from the server is carried in the MRCP + message body of the RECOGNITION-COMPLETE event or the GET-RESULT + response message as MIME entities. All media servers MUST support + W3C's Natural Language Semantics Markup Language (NLSML) [10] as the + default standard for returning recognition results back to the + client, and hence MUST support the MIME-type application/x-nlsml. + + Example 1: + Content-Type:application/x-nlsml + Content-Length:104 + + <?xml version="1.0"?> + <result grammar="http://theYesNoGrammar"> + <interpretation> + <instance> + <myApp:yes_no> + <response>yes</response> + </myApp:yes_no> + </instance> + <input>ok</input> + </interpretation> + </result> + +8.5.3. Recognizer Context Block + + When the client has to change recognition servers within a call, this + is a block of data that the client MAY collect from the first media + server and provide to the second media server. This may be because + the client needs different language support or because the media + server issued an RTSP RE-DIRECT. Here, the first recognizer may have + collected acoustic and other data during its recognition. When we + switch recognition servers, communicating this data may allow the + second recognition server to provide better recognition based on the + acoustic data collected by the previous recognizer. This block of + data is vendor-specific and MUST be carried as MIME-type + application/octets in the body of the message. + + This block of data is communicated in the SET-PARAMS and GET-PARAMS + method/response messages. In the GET-PARAMS method, if an empty + recognizer-context-block header field is present, then the recognizer + should return its vendor-specific context block in the message body + as a MIME-entity with a specific content-id. The content-id value + should also be specified in the recognizer-context-block header field + + + +Shanmugham, et al. Informational [Page 55] + +RFC 4463 MRCP by Cisco, Nuance, and Speechworks April 2006 + + + in the GET-PARAMS response. The SET-PARAMS request wishing to + provide this vendor-specific data should send it in the message body + as a MIME-entity with the same content-id that it received from the + GET-PARAMS. The content-id should also be sent in the recognizer- + context-block header field of the SET-PARAMS message. + + Each automatic speech recognition (ASR) vendor choosing to use this + mechanism to handoff recognizer context data among its servers should + distinguish its vendor-specific block of data from other vendors by + choosing a unique content-id that they should recognize. + +8.6. SET-PARAMS + + The SET-PARAMS method, from the client to the server, tells the + recognizer resource to set and modify recognizer context parameters + like recognizer characteristics, result detail level, etc. In the + following sections some standard parameters are discussed. If the + server resource does not recognize an OPTIONAL parameter, it MUST + ignore that field. Many of the parameters in the SET-PARAMS method + can also be used in another method like the RECOGNIZE method. But + the difference is that when you set something like the sensitivity- + level using the SET-PARAMS, it applies for all future requests, + whenever applicable. On the other hand, when you pass sensitivity- + level in a RECOGNIZE request, it applies only to that request. + + Example: + C->S:SET-PARAMS 543256 MRCP/1.0 + Sensitivity-Level:20 + Recognition-Timeout:30 + Confidence-Threshold:85 + + S->C:MRCP/1.0 543256 200 COMPLETE + +8.7. GET-PARAMS + + The GET-PARAMS method, from the client to the server, asks the + recognizer resource for its current default parameters, like + sensitivity-level, n-best-list-length, etc. The client can request + specific parameters from the server by sending it one or more empty + parameter headers with no values. The server should then return the + settings for those specific parameters only. When the client does + not send a specific list of empty parameter headers, the recognizer + should return the settings for all parameters. The wild card use can + be very intensive as the number of settable parameters can be large + depending on the vendor. Hence, it is RECOMMENDED that the client + does not use the wildcard GET-PARAMS operation very often. + + + + + +Shanmugham, et al. Informational [Page 56] + +RFC 4463 MRCP by Cisco, Nuance, and Speechworks April 2006 + + + Example: + C->S:GET-PARAMS 543256 MRCP/1.0 + Sensitivity-Level: + Recognition-Timeout: + Confidence-threshold: + + S->C:MRCP/1.0 543256 200 COMPLETE + Sensitivity-Level:20 + Recognition-Timeout:30 + Confidence-Threshold:85 + +8.8. DEFINE-GRAMMAR + + The DEFINE-GRAMMAR method, from the client to the server, provides a + grammar and tells the server to define, download if needed, and + compile the grammar. + + If the server resource is in the recognition state, the DEFINE- + GRAMMAR request MUST respond with a failure status. + + If the resource is in the idle state and is able to successfully load + and compile the grammar, the status MUST return a success code and + the request-state MUST be COMPLETE. + + If the recognizer could not define the grammar for some reason, say + the download failed or the grammar failed to compile, or the grammar + was in an unsupported form, the MRCP response for the DEFINE-GRAMMAR + method MUST contain a failure status code of 407, and a completion- + cause header field describing the failure reason. + + Example: + C->S:DEFINE-GRAMMAR 543257 MRCP/1.0 + Content-Type:application/grammar+xml + Content-Id:request1@form-level.store + Content-Length:104 + + <?xml version="1.0"?> + + <!-- the default grammar language is US English --> + <grammar xml:lang="en-US" version="1.0"> + + <!-- single language attachment to tokens --> + <rule id="yes"> + <one-of> + <item xml:lang="fr-CA">oui</item> + <item xml:lang="en-US">yes</item> + </one-of> + </rule> + + + +Shanmugham, et al. Informational [Page 57] + +RFC 4463 MRCP by Cisco, Nuance, and Speechworks April 2006 + + + <!-- single language attachment to a rule expansion --> + <rule id="request"> + may I speak to + <one-of xml:lang="fr-CA"> + <item>Michel Tremblay</item> + <item>Andre Roy</item> + </one-of> + </rule> + + </grammar> + + S->C:MRCP/1.0 543257 200 COMPLETE + Completion-Cause:000 success + + + C->S:DEFINE-GRAMMAR 543258 MRCP/1.0 + Content-Type:application/grammar+xml + Content-Id:helpgrammar@root-level.store + Content-Length:104 + + <?xml version="1.0"?> + + <!-- the default grammar language is US English --> + <grammar xml:lang="en-US" version="1.0"> + + <rule id="request"> + I need help + </rule> + + </grammar> + + S->C:MRCP/1.0 543258 200 COMPLETE + Completion-Cause:000 success + + C->S:DEFINE-GRAMMAR 543259 MRCP/1.0 + Content-Type:application/grammar+xml + Content-Id:request2@field-level.store + Content-Length:104 + <?xml version="1.0"?> + + <!-- the default grammar language is US English --> + <grammar xml:lang="en-US" version="1.0"> + + <rule id="request"> + I need help + </rule> + + S->C:MRCP/1.0 543258 200 COMPLETE + + + +Shanmugham, et al. Informational [Page 58] + +RFC 4463 MRCP by Cisco, Nuance, and Speechworks April 2006 + + + Completion-Cause:000 success + + C->S:DEFINE-GRAMMAR 543259 MRCP/1.0 + Content-Type:application/grammar+xml + Content-Id:request2@field-level.store + Content-Length:104 + + <?xml version="1.0"?> + + <grammar xml:lang="en"> + + <import uri="session:politeness@form-level.store" + name="polite"/> + + <rule id="basicCmd" scope="public"> + <example> please move the window </example> + <example> open a file </example> + + <ruleref import="polite#startPolite"/> + <ruleref uri="#command"/> + <ruleref import="polite#endPolite"/> + </rule> + + <rule id="command"> + <ruleref uri="#action"/> <ruleref uri="#object"/> + </rule> + + <rule id="action"> + <choice> + <item weight="10" tag="OPEN"> open </item> + <item weight="2" tag="CLOSE"> close </item> + <item weight="1" tag="DELETE"> delete </item> + <item weight="1" tag="MOVE"> move </item> + </choice> + </rule> + + <rule id="object"> + <count number="optional"> + <choice> + <item> the </item> + <item> a </item> + </choice> + </count> + <choice> + <item> window </item> + <item> file </item> + <item> menu </item> + </choice> + + + +Shanmugham, et al. Informational [Page 59] + +RFC 4463 MRCP by Cisco, Nuance, and Speechworks April 2006 + + + </rule> + + </grammar> + + S->C:MRCP/1.0 543259 200 COMPLETE + Completion-Cause:000 success + + C->S:RECOGNIZE 543260 MRCP/1.0 + N-Best-List-Length:2 + Content-Type:text/uri-list + Content-Length:176 + + session:request1@form-level.store + session:request2@field-level.store + session:helpgramar@root-level.store + + S->C:MRCP/1.0 543260 200 IN-PROGRESS + + S->C:START-OF-SPEECH 543260 IN-PROGRESS MRCP/1.0 + + S->C:RECOGNITION-COMPLETE 543260 COMPLETE MRCP/1.0 + Completion-Cause:000 success + Waveform-URL:http://web.media.com/session123/audio.wav + Content-Type:applicationt/x-nlsml + Content-Length:276 + + <?xml version="1.0"?> + <result x-model="http://IdentityModel" + xmlns:xf="http://www.w3.org/2000/xforms" + grammar="session:request1@form-level.store"> + <interpretation> + <xf:instance name="Person"> + <Person> + <Name> Andre Roy </Name> + </Person> + </xf:instance> + <input> may I speak to Andre Roy </input> + </interpretation> + </result> + +8.9. RECOGNIZE + + The RECOGNIZE method from the client to the server tells the + recognizer to start recognition and provides it with a grammar to + match for. The RECOGNIZE method can carry parameters to control the + sensitivity, confidence level, and the level of detail in results + provided by the recognizer. These parameters override the current + defaults set by a previous SET-PARAMS method. + + + +Shanmugham, et al. Informational [Page 60] + +RFC 4463 MRCP by Cisco, Nuance, and Speechworks April 2006 + + + If the resource is in the recognition state, the RECOGNIZE request + MUST respond with a failure status. + + If the resource is in the Idle state and was able to successfully + start the recognition, the server MUST return a success code and a + request-state of IN-PROGRESS. This means that the recognizer is + active and that the client should expect further events with this + request-id. + + If the resource could not start a recognition, it MUST return a + failure status code of 407 and contain a completion-cause header + field describing the cause of failure. + + For the recognizer resource, this is the only request that can return + request-state of IN-PROGRESS, meaning that recognition is in + progress. When the recognition completes by matching one of the + grammar alternatives or by a time-out without a match or for some + other reason, the recognizer resource MUST send the client a + RECOGNITON-COMPLETE event with the result of the recognition and a + request-state of COMPLETE. + + For large grammars that can take a long time to compile and for + grammars that are used repeatedly, the client could issue a DEFINE- + GRAMMAR request with the grammar ahead of time. In such a case, the + client can issue the RECOGNIZE request and reference the grammar + through the "session:" special URI. This also applies in general if + the client wants to restart recognition with a previous inline + grammar. + + Note that since the audio and the messages are carried over separate + communication paths there may be a race condition between the start + of the flow of audio and the receipt of the RECOGNIZE method. For + example, if audio flow is started by the client at the same time as + the RECOGNIZE method is sent, either the audio or the RECOGNIZE will + arrive at the recognizer first. As another example, the client may + chose to continuously send audio to the Media server and signal the + Media server to recognize using the RECOGNIZE method. A number of + mechanisms exist to resolve this condition and the mechanism chosen + is left to the implementers of recognizer Media servers. + + Example: + C->S:RECOGNIZE 543257 MRCP/1.0 + Confidence-Threshold:90 + Content-Type:application/grammar+xml + Content-Id:request1@form-level.store + Content-Length:104 + + <?xml version="1.0"?> + + + +Shanmugham, et al. Informational [Page 61] + +RFC 4463 MRCP by Cisco, Nuance, and Speechworks April 2006 + + + + <!-- the default grammar language is US English --> + <grammar xml:lang="en-US" version="1.0"> + + <!-- single language attachment to tokens --> + <rule id="yes"> + <one-of> + <item xml:lang="fr-CA">oui</item> + <item xml:lang="en-US">yes</item> + </one-of> + </rule> + + <!-- single language attachment to a rule expansion --> + <rule id="request"> + may I speak to + <one-of xml:lang="fr-CA"> + <item>Michel Tremblay</item> + <item>Andre Roy</item> + </one-of> + </rule> + + </grammar> + + S->C:MRCP/1.0 543257 200 IN-PROGRESS + + S->C:START-OF-SPEECH 543257 IN-PROGRESS MRCP/1.0 + + S->C:RECOGNITION-COMPLETE 543257 COMPLETE MRCP/1.0 + + Completion-Cause:000 success + Waveform-URL:http://web.media.com/session123/audio.wav + Content-Type:application/x-nlsml + Content-Length:276 + + <?xml version="1.0"?> + <result x-model="http://IdentityModel" + xmlns:xf="http://www.w3.org/2000/xforms" + grammar="session:request1@form-level.store"> + <interpretation> + <xf:instance name="Person"> + <Person> + <Name> Andre Roy </Name> + </Person> + </xf:instance> + <input> may I speak to Andre Roy </input> + </interpretation> + </result> + + + + +Shanmugham, et al. Informational [Page 62] + +RFC 4463 MRCP by Cisco, Nuance, and Speechworks April 2006 + + +8.10. STOP + + The STOP method from the client to the server tells the resource to + stop recognition if one is active. If a RECOGNIZE request is active + and the STOP request successfully terminated it, then the response + header contains an active-request-id-list header field containing the + request-id of the RECOGNIZE request that was terminated. In this + case, no RECOGNITION-COMPLETE event will be sent for the terminated + request. If there was no recognition active, then the response MUST + NOT contain an active-request-id-list header field. Either + way,method the response MUST contain a status of 200(Success). + + Example: + C->S:RECOGNIZE 543257 MRCP/1.0 + Confidence-Threshold:90 + Content-Type:application/grammar+xml + Content-Id:request1@form-level.store + Content-Length:104 + + <?xml version="1.0"?> + + <!-- the default grammar language is US English --> + <grammar xml:lang="en-US" version="1.0"> + + <!-- single language attachment to tokens --> + <rule id="yes"> + <one-of> + <item xml:lang="fr-CA">oui</item> + <item xml:lang="en-US">yes</item> + </one-of> + </rule> + + <!-- single language attachment to a rule expansion --> + <rule id="request"> + may I speak to + <one-of xml:lang="fr-CA"> + <item>Michel Tremblay</item> + <item>Andre Roy</item> + </one-of> + </rule> + + </grammar> + + S->C:MRCP/1.0 543257 200 IN-PROGRESS + + C->S:STOP 543258 200 MRCP/1.0 + + + + + +Shanmugham, et al. Informational [Page 63] + +RFC 4463 MRCP by Cisco, Nuance, and Speechworks April 2006 + + + S->C:MRCP/1.0 543258 200 COMPLETE + Active-Request-Id-List:543257 + +8.11. GET-RESULT + + The GET-RESULT method from the client to the server can be issued + when the recognizer is in the recognized state. This request allows + the client to retrieve results for a completed recognition. This is + useful if the client decides it wants more alternatives or more + information. When the media server receives this request, it should + re-compute and return the results according to the recognition + constraints provided in the GET-RESULT request. + + The GET-RESULT request could specify constraints like a different + confidence-threshold, or n-best-list-length. This feature is + optional and the automatic speech recognition (ASR) engine may return + a status of unsupported feature. + + Example: + C->S:GET-RESULT 543257 MRCP/1.0 + Confidence-Threshold:90 + + S->C:MRCP/1.0 543257 200 COMPLETE + Content-Type:application/x-nlsml + Content-Length:276 + + <?xml version="1.0"?> + <result x-model="http://IdentityModel" + xmlns:xf="http://www.w3.org/2000/xforms" + grammar="session:request1@form-level.store"> + <interpretation> + <xf:instance name="Person"> + <Person> + <Name> Andre Roy </Name> + + </Person> + </xf:instance> + <input> may I speak to Andre Roy </input> + </interpretation> + </result> + +8.12. START-OF-SPEECH + + This is an event from the recognizer to the client indicating that it + has detected speech. This event is useful in implementing kill-on- + barge-in scenarios when the synthesizer resource is in a different + session than the recognizer resource and, hence, is not aware of an + incoming audio source. In these cases, it is up to the client to act + + + +Shanmugham, et al. Informational [Page 64] + +RFC 4463 MRCP by Cisco, Nuance, and Speechworks April 2006 + + + as a proxy and turn around and issue the BARGE-IN-OCCURRED method to + the synthesizer resource. The recognizer resource also sends a + unique proxy-sync-id in the header for this event, which is sent to + the synthesizer in the BARGE-IN-OCCURRED method to the synthesizer. + + This event should be generated irrespective of whether the + synthesizer and recognizer are in the same media server or not. + +8.13. RECOGNITION-START-TIMERS + + This request is sent from the client to the recognition resource when + it knows that a kill-on-barge-in prompt has finished playing. This + is useful in the scenario when the recognition and synthesizer + engines are not in the same session. Here, when a kill-on-barge-in + prompt is being played, you want the RECOGNIZE request to be + simultaneously active so that it can detect and implement kill-on- + barge-in. But at the same time, you don't want the recognizer to + start the no-input timers until the prompt is finished. The + parameter recognizer-start-timers header field in the RECOGNIZE + request will allow the client to say if the timers should be started + or not. The recognizer should not start the timers until the client + sends a RECOGNITION-START-TIMERS method to the recognizer. + +8.14. RECOGNITON-COMPLETE + + This is an Event from the recognizer resource to the client + indicating that the recognition completed. The recognition result is + sent in the MRCP body of the message. The request-state field MUST + be COMPLETE indicating that this is the last event with that + request-id, and that the request with that request-id is now + complete. The recognizer context still holds the results and the + audio waveform input of that recognition until the next RECOGNIZE + request is issued. A URL to the audio waveform MAY BE returned to + the client in a waveform-url header field in the RECOGNITION-COMPLETE + event. The client can use this URI to retrieve or playback the + audio. + + Example: + C->S:RECOGNIZE 543257 MRCP/1.0 + Confidence-Threshold:90 + Content-Type:application/grammar+xml + Content-Id:request1@form-level.store + Content-Length:104 + + <?xml version="1.0"?> + + <!-- the default grammar language is US English --> + <grammar xml:lang="en-US" version="1.0"> + + + +Shanmugham, et al. Informational [Page 65] + +RFC 4463 MRCP by Cisco, Nuance, and Speechworks April 2006 + + + + <!-- single language attachment to tokens --> + <rule id="yes"> + <one-of> + <item xml:lang="fr-CA">oui</item> + <item xml:lang="en-US">yes</item> + </one-of> + </rule> + + <!-- single language attachment to a rule expansion --> + <rule id="request"> + may I speak to + <one-of xml:lang="fr-CA"> + <item>Michel Tremblay</item> + <item>Andre Roy</item> + </one-of> + </rule> + + </grammar> + + S->C:MRCP/1.0 543257 200 IN-PROGRESS + + S->C:START-OF-SPEECH 543257 IN-PROGRESS MRCP/1.0 + + S->C:RECOGNITION-COMPLETE 543257 COMPLETE MRCP/1.0 + Completion-Cause:000 success + Waveform-URL:http://web.media.com/session123/audio.wav + Content-Type:application/x-nlsml + Content-Length:276 + + <?xml version="1.0"?> + <result x-model="http://IdentityModel" + xmlns:xf="http://www.w3.org/2000/xforms" + grammar="session:request1@form-level.store"> + <interpretation> + <xf:instance name="Person"> + <Person> + <Name> Andre Roy </Name> + </Person> + </xf:instance> + <input> may I speak to Andre Roy </input> + </interpretation> + </result> + + + + + + + + +Shanmugham, et al. Informational [Page 66] + +RFC 4463 MRCP by Cisco, Nuance, and Speechworks April 2006 + + +8.15. DTMF Detection + + Digits received as DTMF tones will be delivered to the automatic + speech recognition (ASR) engine in the RTP stream according to RFC + 2833 [15]. The automatic speech recognizer (ASR) needs to support + RFC 2833 [15] to recognize digits. If it does not support RFC 2833 + [15], it will have to process the audio stream and extract the audio + tones from it. + +9. Future Study + + Various sections of the recognizer could be distributed into Digital + Signal Processors (DSPs) on the Voice Browser/Gateway or IP Phones. + For instance, the gateway might perform voice activity detection to + reduce network bandwidth and CPU requirement of the automatic speech + recognition (ASR) server. Such extensions are deferred for further + study and will not be addressed in this document. + +10. Security Considerations + + The MRCP protocol may carry sensitive information such as account + numbers, passwords, etc. For this reason it is important that the + client have the option of secure communication with the server for + both the control messages as well as the media, though the client is + not required to use it. If all MRCP communications happens in a + trusted domain behind a firewall, this may not be necessary. If the + client or server is deployed in an insecure network, communication + happening across this insecure network needs to be protected. In + such cases, the following additional security functionality MUST be + supported on the MRCP server. MRCP servers MUST implement Transport + Layer Security (TLS) to secure the RTSP communication, i.e., the RTSP + stack SHOULD support the rtsps: URI form. MRCP servers MUST support + Secure Real-Time Transport Protocol (SRTP) as an option to send and + receive media. + +11. RTSP-Based Examples + + The following is an example of a typical session of speech synthesis + and recognition between a client and the server. + + Opening the synthesizer. This is the first resource for this + session. The server and client agree on a single Session ID 12345678 + and set of RTP/RTCP ports on both sides. + + C->S:SETUP rtsp://media.server.com/media/synthesizer RTSP/1.0 + CSeq:2 + Transport:RTP/AVP;unicast;client_port=46456-46457 + Content-Type:application/sdp + + + +Shanmugham, et al. Informational [Page 67] + +RFC 4463 MRCP by Cisco, Nuance, and Speechworks April 2006 + + + Content-Length:190 + + v=0 + o=- 123 456 IN IP4 10.0.0.1 + s=Media Server + p=+1-888-555-1212 + c=IN IP4 0.0.0.0 + t=0 0 + m=audio 0 RTP/AVP 0 96 + a=rtpmap:0 pcmu/8000 + a=rtpmap:96 telephone-event/8000 + a=fmtp:96 0-15 + + S->C:RTSP/1.0 200 OK + CSeq:2 + Transport:RTP/AVP;unicast;client_port=46456-46457; + server_port=46460-46461 + Session:12345678 + Content-Length:190 + Content-Type:application/sdp + + v=0 + o=- 3211724219 3211724219 IN IP4 10.3.2.88 + s=Media Server + c=IN IP4 0.0.0.0 + t=0 0 + m=audio 46460 RTP/AVP 0 96 + a=rtpmap:0 pcmu/8000 + a=rtpmap:96 telephone-event/8000 + a=fmtp:96 0-15 + + Opening a recognizer resource. Uses the existing session ID and + ports. + + C->S:SETUP rtsp://media.server.com/media/recognizer RTSP/1.0 + CSeq:3 + Transport:RTP/AVP;unicast;client_port=46456-46457; + mode=record;ttl=127 + Session:12345678 + + S->C:RTSP/1.0 200 OK + CSeq:3 + Transport:RTP/AVP;unicast;client_port=46456-46457; + server_port=46460-46461;mode=record;ttl=127 + Session:12345678 + + + + + + +Shanmugham, et al. Informational [Page 68] + +RFC 4463 MRCP by Cisco, Nuance, and Speechworks April 2006 + + + An ANNOUNCE message with the MRCP SPEAK request initiates speech. + + C->S:ANNOUNCE rtsp://media.server.com/media/synthesizer RTSP/1.0 + CSeq:4 + Session:12345678 + Content-Type:application/mrcp + Content-Length:456 + + SPEAK 543257 MRCP/1.0 + Kill-On-Barge-In:false + Voice-gender:neutral + Voice-category:teenager + Prosody-volume:medium + Content-Type:application/synthesis+ssml + Content-Length:104 + + <?xml version="1.0"?> + <speak> + <paragraph> + <sentence>You have 4 new messages.</sentence> + <sentence>The first is from <say-as + type="name">Stephanie Williams</say-as> <mark + name="Stephanie"/> + and arrived at <break/> + <say-as type="time">3:45pm</say-as>.</sentence> + + <sentence>The subject is <prosody + rate="-20%">ski trip</prosody></sentence> + </paragraph> + </speak> + + S->C:RTSP/1.0 200 OK + CSeq:4 + Session:12345678 + RTP-Info:url=rtsp://media.server.com/media/synthesizer; + seq=9810092;rtptime=3450012 + Content-Type:application/mrcp + Content-Length:456 + + MRCP/1.0 543257 200 IN-PROGRESS + + + The synthesizer hits the special marker in the message to be spoken + and faithfully informs the client of the event. + + S->C:ANNOUNCE rtsp://media.server.com/media/synthesizer RTSP/1.0 + CSeq:5 + Session:12345678 + + + +Shanmugham, et al. Informational [Page 69] + +RFC 4463 MRCP by Cisco, Nuance, and Speechworks April 2006 + + + Content-Type:application/mrcp + Content-Length:123 + + SPEECH-MARKER 543257 IN-PROGRESS MRCP/1.0 + Speech-Marker:Stephanie + C->S:RTSP/1.0 200 OK + CSeq:5 + + The synthesizer finishes with the SPEAK request. + + S->C:ANNOUNCE rtsp://media.server.com/media/synthesizer RTSP/1.0 + CSeq:6 + Session:12345678 + Content-Type:application/mrcp + Content-Length:123 + + SPEAK-COMPLETE 543257 COMPLETE MRCP/1.0 + + + C->S:RTSP/1.0 200 OK + CSeq:6 + + The recognizer is issued a request to listen for the customer + choices. + + C->S:ANNOUNCE rtsp://media.server.com/media/recognizer RTSP/1.0 + CSeq:7 + Session:12345678 + + RECOGNIZE 543258 MRCP/1.0 + Content-Type:application/grammar+xml + Content-Length:104 + + <?xml version="1.0"?> + + <!-- the default grammar language is US English --> + <grammar xml:lang="en-US" version="1.0"> + + <!-- single language attachment to a rule expansion --> + <rule id="request"> + Can I speak to + <one-of xml:lang="fr-CA"> + <item>Michel Tremblay</item> + <item>Andre Roy</item> + </one-of> + </rule> + + </grammar> + + + +Shanmugham, et al. Informational [Page 70] + +RFC 4463 MRCP by Cisco, Nuance, and Speechworks April 2006 + + + S->C:RTSP/1.0 200 OK + CSeq:7 + Content-Type:application/mrcp + Content-Length:123 + + MRCP/1.0 543258 200 IN-PROGRESS + + The client issues the next MRCP SPEAK method in an ANNOUNCE message, + asking the user the question. It is generally RECOMMENDED when + playing a prompt to the user with kill-on-barge-in and asking for + input, that the client issue the RECOGNIZE request ahead of the SPEAK + request for optimum performance and user experience. This way, it is + guaranteed that the recognizer is online before the prompt starts + playing and the user's speech will not be truncated at the beginning + (especially for power users). + + C->S:ANNOUNCE rtsp://media.server.com/media/synthesizer RTSP/1.0 + CSeq:8 Session:12345678 Content-Type:application/mrcp + Content-Length:733 + + SPEAK 543259 MRCP/1.0 + Kill-On-Barge-In:true + Content-Type:application/synthesis+ssml + Content-Length:104 + + <?xml version="1.0"?> + <speak> + <paragraph> + <sentence>Welcome to ABC corporation.</sentence> + <sentence>Who would you like Talk to.</sentence> + </paragraph> + </speak> + + S->C:RTSP/1.0 200 OK + CSeq:8 + Content-Type:application/mrcp + Content-Length:123 + + MRCP/1.0 543259 200 IN-PROGRESS + + Since the last SPEAK request had Kill-On-Barge-In set to "true", the + message synthesizer is interrupted when the user starts speaking, and + the client is notified. + + Now, since the recognition and synthesizer resources are in the same + session, they worked with each other to deliver kill-on-barge-in. If + the resources were in different sessions, it would have taken a few + more messages before the client got the SPEAK-COMPLETE event from the + + + +Shanmugham, et al. Informational [Page 71] + +RFC 4463 MRCP by Cisco, Nuance, and Speechworks April 2006 + + + synthesizer resource. Whether the synthesizer and recognizer are in + the same session or not, the recognizer MUST generate the START-OF- + SPEECH event to the client. + + The client should have then blindly turned around and issued a + BARGE-IN-OCCURRED method to the synthesizer resource. The + synthesizer, if kill-on-barge-in was enabled on the current SPEAK + request, would have then interrupted it and issued SPEAK-COMPLETE + event to the client. In this example, since the synthesizer and + recognizer are in the same session, the client did not issue the + BARGE-IN-OCCURRED method to the synthesizer and assumed that kill- + on-barge-in was implemented between the two resources in the same + session and worked. + + The completion-cause code differentiates if this is normal completion + or a kill-on-barge-in interruption. + + S->C:ANNOUNCE rtsp://media.server.com/media/recognizer RTSP/1.0 + CSeq:9 + Session:12345678 + Content-Type:application/mrcp + Content-Length:273 + + START-OF-SPEECH 543258 IN-PROGRESS MRCP/1.0 + + C->S:RTSP/1.0 200 OK + CSeq:9 + + S->C:ANNOUNCE rtsp://media.server.com/media/synthesizer RTSP/1.0 + CSeq:10 + Session:12345678 + Content-Type:application/mrcp + Content-Length:273 + + SPEAK-COMPLETE 543259 COMPLETE MRCP/1.0 + Completion-Cause:000 normal + + C->S:RTSP/1.0 200 OK + CSeq:10 + + The recognition resource matched the spoken stream to a grammar and + generated results. The result of the recognition is returned by the + server as part of the RECOGNITION-COMPLETE event. + + S->C:ANNOUNCE rtsp://media.server.com/media/recognizer RTSP/1.0 + CSeq:11 + Session:12345678 + Content-Type:application/mrcp + + + +Shanmugham, et al. Informational [Page 72] + +RFC 4463 MRCP by Cisco, Nuance, and Speechworks April 2006 + + + Content-Length:733 + + RECOGNITION-COMPLETE 543258 COMPLETE MRCP/1.0 + Completion-Cause:000 success + Waveform-URL:http://web.media.com/session123/audio.wav + Content-Type:application/x-nlsml + Content-Length:104 + + <?xml version="1.0"?> + <result x-model="http://IdentityModel" + xmlns:xf="http://www.w3.org/2000/xforms" + grammar="session:request1@form-level.store"> + <interpretation> + <xf:instance name="Person"> + <Person> + <Name> Andre Roy </Name> + </Person> + </xf:instance> + <input> may I speak to Andre Roy </input> + </interpretation> + </result> + + C->S:RTSP/1.0 200 OK + CSeq:11 + + C->S:TEARDOWN rtsp://media.server.com/media/synthesizer RTSP/1.0 + CSeq:12 + Session:12345678 + + S->C:RTSP/1.0 200 OK + CSeq:12 + + We are done with the resources and are tearing them down. When the + last of the resources for this session are released, the Session-ID + and the RTP/RTCP ports are also released. + + C->S:TEARDOWN rtsp://media.server.com/media/recognizer RTSP/1.0 + CSeq:13 + Session:12345678 + + S->C:RTSP/1.0 200 OK + CSeq:13 + + + + + + + + + +Shanmugham, et al. Informational [Page 73] + +RFC 4463 MRCP by Cisco, Nuance, and Speechworks April 2006 + + +12. Informative References + + [1] Fielding, R., Gettys, J., Mogul, J., Frystyk. H., Masinter, L., + Leach, P., and T. Berners-Lee, "Hypertext transfer protocol -- + HTTP/1.1", RFC 2616, June 1999. + + [2] Schulzrinne, H., Rao, A., and R. Lanphier, "Real Time Streaming + Protocol (RTSP)", RFC 2326, April 1998 + + [3] Crocker, D. and P. Overell, "Augmented BNF for Syntax + Specifications: ABNF", RFC 4234, October 2005. + + [4] Rosenberg, J., Schulzrinne, H., Camarillo, G., Johnston, A., + Peterson, J., Sparks, R., Handley, M., and E. Schooler, "SIP: + Session Initiation Protocol", RFC 3261, June 2002. + + [5] Handley, M. and V. Jacobson, "SDP: Session Description + Protocol", RFC 2327, April 1998. + + [6] World Wide Web Consortium, "Voice Extensible Markup Language + (VoiceXML) Version 2.0", W3C Candidate Recommendation, March + 2004. + + [7] Resnick, P., "Internet Message Format", RFC 2822, April 2001. + + [8] Bradner, S., "Key words for use in RFCs to Indicate Requirement + Levels", BCP 14, RFC 2119, March 1997. + + [9] World Wide Web Consortium, "Speech Synthesis Markup Language + (SSML) Version 1.0", W3C Candidate Recommendation, September + 2004. + + [10] World Wide Web Consortium, "Natural Language Semantics Markup + Language (NLSML) for the Speech Interface Framework", W3C + Working Draft, 30 May 2001. + + [11] World Wide Web Consortium, "Speech Recognition Grammar + Specification Version 1.0", W3C Candidate Recommendation, March + 2004. + + [12] Yergeau, F., "UTF-8, a transformation format of ISO 10646", STD + 63, RFC 3629, November 2003. + + [13] Freed, N. and N. Borenstein, "Multipurpose Internet Mail + Extensions (MIME) Part Two: Media Types", RFC 2046, November + 1996. + + + + + +Shanmugham, et al. Informational [Page 74] + +RFC 4463 MRCP by Cisco, Nuance, and Speechworks April 2006 + + + [14] Levinson, E., "Content-ID and Message-ID Uniform Resource + Locators", RFC 2392, August 1998. + + [15] Schulzrinne, H. and S. Petrack, "RTP Payload for DTMF Digits, + Telephony Tones and Telephony Signals", RFC 2833, May 2000. + + [16] Alvestrand, H., "Tags for the Identification of Languages", BCP + 47, RFC 3066, January 2001. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +Shanmugham, et al. Informational [Page 75] + +RFC 4463 MRCP by Cisco, Nuance, and Speechworks April 2006 + + +Appendix A. ABNF Message Definitions + + ALPHA = %x41-5A / %x61-7A ; A-Z / a-z + + CHAR = %x01-7F ; any 7-bit US-ASCII character, + ; excluding NUL + + CR = %x0D ; carriage return + + CRLF = CR LF ; Internet standard newline + + DIGIT = %x30-39 ; 0-9 + + DQUOTE = %x22 ; " (Double Quote) + + HEXDIG = DIGIT / "A" / "B" / "C" / "D" / "E" / "F" + + HTAB = %x09 ; horizontal tab + + LF = %x0A ; linefeed + + OCTET = %x00-FF ; 8 bits of data + + SP = %x20 ; space + + WSP = SP / HTAB ; white space + + LWS = [*WSP CRLF] 1*WSP ; linear whitespace + + SWS = [LWS] ; sep whitespace + + UTF8-NONASCII = %xC0-DF 1UTF8-CONT + / %xE0-EF 2UTF8-CONT + / %xF0-F7 3UTF8-CONT + / %xF8-Fb 4UTF8-CONT + / %xFC-FD 5UTF8-CONT + + UTF8-CONT = %x80-BF + + param = *pchar + + quoted-string = SWS DQUOTE *(qdtext / quoted-pair ) + DQUOTE + + qdtext = LWS / %x21 / %x23-5B / %x5D-7E + / UTF8-NONASCII + + + + + +Shanmugham, et al. Informational [Page 76] + +RFC 4463 MRCP by Cisco, Nuance, and Speechworks April 2006 + + + quoted-pair = "\" (%x00-09 / %x0B-0C + / %x0E-7F) + + token = 1*(alphanum / "-" / "." / "!" / "%" / "*" + / "_" / "+" / "`" / "'" / "~" ) + + reserved = ";" / "/" / "?" / ":" / "@" / "&" / "=" + / "+" / "$" / "," + + mark = "-" / "_" / "." / "!" / "~" / "*" / "'" + / "(" / ")" + + unreserved = alphanum / mark + + char = unreserved / escaped / + ":" / "@" / "&" / "=" / "+" / "$" / "," + + alphanum = ALPHA / DIGIT + + escaped = "%" HEXDIG HEXDIG + + absoluteURI = scheme ":" ( hier-part / opaque-part ) + + relativeURI = ( net-path / abs-path / rel-path ) + [ "?" query ] + + hier-part = ( net-path / abs-path ) [ "?" query ] + + net-path = "//" authority [ abs-path ] + + abs-path = "/" path-segments + + rel-path = rel-segment [ abs-path ] + + rel-segment = 1*( unreserved / escaped / ";" / "@" + / "&" / "=" / "+" / "$" / "," ) + + opaque-part = uric-no-slash *uric + + uric = reserved / unreserved / escaped + + uric-no-slash = unreserved / escaped / ";" / "?" / ":" + / "@" / "&" / "=" / "+" / "$" / "," + + path-segments = segment *( "/" segment ) + + segment = *pchar *( ";" param ) + + + + +Shanmugham, et al. Informational [Page 77] + +RFC 4463 MRCP by Cisco, Nuance, and Speechworks April 2006 + + + scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." ) + + authority = srvr / reg-name + + srvr = [ [ userinfo "@" ] hostport ] + + reg-name = 1*( unreserved / escaped / "$" / "," + / ";" / ":" / "@" / "&" / "=" / "+" ) + + query = *uric + + userinfo = ( user ) [ ":" password ] "@" + + user = 1*( unreserved / escaped + / user-unreserved ) + + user-unreserved = "&" / "=" / "+" / "$" / "," / ";" + / "?" / "/" + + password = *( unreserved / escaped / + "&" / "=" / "+" / "$" / "," ) + + hostport = host [ ":" port ] + + host = hostname / IPv4address / IPv6reference + + hostname = *( domainlabel "." ) toplabel [ "." ] + + domainlabel = alphanum + / alphanum *( alphanum / "-" ) alphanum + + toplabel = ALPHA / ALPHA *( alphanum / "-" ) + alphanum + + IPv4address = 1*3DIGIT "." 1*3DIGIT "." 1*3DIGIT "." + 1*3DIGIT + + IPv6reference = "[" IPv6address "]" + + IPv6address = hexpart [ ":" IPv4address ] + + hexpart = hexseq / hexseq "::" [ hexseq ] / "::" + [ hexseq ] + + hexseq = hex4 *( ":" hex4) + + hex4 = 1*4HEXDIG + + + + +Shanmugham, et al. Informational [Page 78] + +RFC 4463 MRCP by Cisco, Nuance, and Speechworks April 2006 + + + port = 1*DIGIT + + generic-message = start-line + message-header + CRLF + [ message-body ] + + message-body = *OCTET + + start-line = request-line / status-line / event-line + + request-line = method-name SP request-id SP + mrcp-version CRLF + + status-line = mrcp-version SP request-id SP + status-code SP request-state CRLF + + event-line = event-name SP request-id SP + request-state SP mrcp-version CRLF + + message-header = 1*(generic-header / resource-header) + + generic-header = active-request-id-list + / proxy-sync-id + / content-id + / content-type + / content-length + / content-base + / content-location + / content-encoding + / cache-control + / logging-tag + ; -- content-id is as defined in RFC 2392 and RFC 2046 + + mrcp-version = "MRCP" "/" 1*DIGIT "." 1*DIGIT + + request-id = 1*DIGIT + + status-code = 1*DIGIT + + active-request-id-list = "Active-Request-Id-List" ":" + request-id *("," request-id) CRLF + + proxy-sync-id = "Proxy-Sync-Id" ":" 1*ALPHA CRLF + + content-length = "Content-Length" ":" 1*DIGIT CRLF + + content-base = "Content-Base" ":" absoluteURI CRLF + + + +Shanmugham, et al. Informational [Page 79] + +RFC 4463 MRCP by Cisco, Nuance, and Speechworks April 2006 + + + content-type = "Content-Type" ":" media-type + + media-type = type "/" subtype *( ";" parameter ) + + type = token + + subtype = token + + parameter = attribute "=" value + + attribute = token + + value = token / quoted-string + + content-encoding = "Content-Encoding" ":" + *WSP content-coding + *(*WSP "," *WSP content-coding *WSP ) + CRLF + + content-coding = token + + + content-location = "Content-Location" ":" + ( absoluteURI / relativeURI ) CRLF + + cache-control = "Cache-Control" ":" + *WSP cache-directive + *( *WSP "," *WSP cache-directive *WSP ) + CRLF + + cache-directive = "max-age" "=" delta-seconds + / "max-stale" "=" delta-seconds + / "min-fresh" "=" delta-seconds + + logging-tag = "Logging-Tag" ":" 1*ALPHA CRLF + + + resource-header = recognizer-header + / synthesizer-header + + method-name = synthesizer-method + / recognizer-method + + event-name = synthesizer-event + / recognizer-event + + + + + + +Shanmugham, et al. Informational [Page 80] + +RFC 4463 MRCP by Cisco, Nuance, and Speechworks April 2006 + + + request-state = "COMPLETE" + / "IN-PROGRESS" + / "PENDING" + + synthesizer-method = "SET-PARAMS" + / "GET-PARAMS" + / "SPEAK" + / "STOP" + / "PAUSE" + / "RESUME" + / "BARGE-IN-OCCURRED" + / "CONTROL" + + synthesizer-event = "SPEECH-MARKER" + / "SPEAK-COMPLETE" + + synthesizer-header = jump-target + / kill-on-barge-in + / speaker-profile + / completion-cause + / voice-parameter + / prosody-parameter + / vendor-specific + / speech-marker + / speech-language + / fetch-hint + / audio-fetch-hint + / fetch-timeout + / failed-uri + / failed-uri-cause + / speak-restart + / speak-length + + recognizer-method = "SET-PARAMS" + / "GET-PARAMS" + / "DEFINE-GRAMMAR" + / "RECOGNIZE" + / "GET-RESULT" + / "RECOGNITION-START-TIMERS" + / "STOP" + + recognizer-event = "START-OF-SPEECH" + / "RECOGNITION-COMPLETE" + + recognizer-header = confidence-threshold + / sensitivity-level + / speed-vs-accuracy + / n-best-list-length + + + +Shanmugham, et al. Informational [Page 81] + +RFC 4463 MRCP by Cisco, Nuance, and Speechworks April 2006 + + + / no-input-timeout + / recognition-timeout + / waveform-url + / completion-cause + / recognizer-context-block + / recognizer-start-timers + / vendor-specific + / speech-complete-timeout + / speech-incomplete-timeout + / dtmf-interdigit-timeout + / dtmf-term-timeout + / dtmf-term-char + / fetch-timeout + / failed-uri + / failed-uri-cause + / save-waveform + / new-audio-channel + / speech-language + + jump-target = "Jump-Size" ":" speech-length-value CRLF + + speech-length-value = numeric-speech-length + / text-speech-length + + text-speech-length = 1*ALPHA SP "Tag" + + numeric-speech-length =("+" / "-") 1*DIGIT SP + numeric-speech-unit + + numeric-speech-unit = "Second" + / "Word" + / "Sentence" + / "Paragraph" + + delta-seconds = 1*DIGIT + + kill-on-barge-in = "Kill-On-Barge-In" ":" boolean-value CRLF + + boolean-value = "true" / "false" + + speaker-profile = "Speaker-Profile" ":" absoluteURI CRLF + + completion-cause = "Completion-Cause" ":" 1*DIGIT SP + 1*ALPHA CRLF + + voice-parameter = "Voice-" voice-param-name ":" + voice-param-value CRLF + + + + +Shanmugham, et al. Informational [Page 82] + +RFC 4463 MRCP by Cisco, Nuance, and Speechworks April 2006 + + + voice-param-name = 1*ALPHA + + voice-param-value = 1*alphanum + + prosody-parameter = "Prosody-" prosody-param-name ":" + prosody-param-value CRLF + + prosody-param-name = 1*ALPHA + + prosody-param-value = 1*alphanum + + vendor-specific = "Vendor-Specific-Parameters" ":" + vendor-specific-av-pair + *[";" vendor-specific-av-pair] CRLF + + vendor-specific-av-pair = vendor-av-pair-name "=" + vendor-av-pair-value + + vendor-av-pair-name = 1*ALPHA + + vendor-av-pair-value = 1*alphanum + + speech-marker = "Speech-Marker" ":" 1*ALPHA CRLF + + speech-language = "Speech-Language" ":" 1*ALPHA CRLF + + fetch-hint = "Fetch-Hint" ":" 1*ALPHA CRLF + + audio-fetch-hint = "Audio-Fetch-Hint" ":" 1*ALPHA CRLF + + fetch-timeout = "Fetch-Timeout" ":" 1*DIGIT CRLF + + failed-uri = "Failed-URI" ":" absoluteURI CRLF + + failed-uri-cause = "Failed-URI-Cause" ":" 1*ALPHA CRLF + + speak-restart = "Speak-Restart" ":" boolean-value CRLF + + speak-length = "Speak-Length" ":" speech-length-value + CRLF + confidence-threshold = "Confidence-Threshold" ":" + 1*DIGIT CRLF + + sensitivity-level = "Sensitivity-Level" ":" 1*DIGIT CRLF + + speed-vs-accuracy = "Speed-Vs-Accuracy" ":" 1*DIGIT CRLF + + n-best-list-length = "N-Best-List-Length" ":" 1*DIGIT CRLF + + + +Shanmugham, et al. Informational [Page 83] + +RFC 4463 MRCP by Cisco, Nuance, and Speechworks April 2006 + + + no-input-timeout = "No-Input-Timeout" ":" 1*DIGIT CRLF + + recognition-timeout = "Recognition-Timeout" ":" 1*DIGIT CRLF + + waveform-url = "Waveform-URL" ":" absoluteURI CRLF + + recognizer-context-block = "Recognizer-Context-Block" ":" + 1*ALPHA CRLF + + recognizer-start-timers = "Recognizer-Start-Timers" ":" + boolean-value CRLF + + speech-complete-timeout = "Speech-Complete-Timeout" ":" + 1*DIGIT CRLF + + speech-incomplete-timeout = "Speech-Incomplete-Timeout" ":" + 1*DIGIT CRLF + + dtmf-interdigit-timeout = "DTMF-Interdigit-Timeout" ":" + 1*DIGIT CRLF + + dtmf-term-timeout = "DTMF-Term-Timeout" ":" 1*DIGIT CRLF + + dtmf-term-char = "DTMF-Term-Char" ":" CHAR CRLF + + save-waveform = "Save-Waveform" ":" boolean-value CRLF + + new-audio-channel = "New-Audio-Channel" ":" + boolean-value CRLF + +Appendix B. Acknowledgements + + Andre Gillet (Nuance Communications) + Andrew Hunt (SpeechWorks) + Aaron Kneiss (SpeechWorks) + Kristian Finlator (SpeechWorks) + Martin Dragomirecky (Cisco Systems, Inc.) + Pierre Forgues (Nuance Communications) + Suresh Kaliannan (Cisco Systems, Inc.) + Corey Stohs (Cisco Systems, Inc.) + Dan Burnett (Nuance Communications) + + + + + + + + + + +Shanmugham, et al. Informational [Page 84] + +RFC 4463 MRCP by Cisco, Nuance, and Speechworks April 2006 + + +Authors' Addresses + + Saravanan Shanmugham + Cisco Systems, Inc. + 170 W. Tasman Drive + San Jose, CA 95134 + + EMail: sarvi@cisco.com + + + Peter Monaco + Nuasis Corporation + 303 Bryant St. + Mountain View, CA 94041 + + EMail: peter.monaco@nuasis.com + + + Brian Eberman + Speechworks, Inc. + 695 Atlantic Avenue + Boston, MA 02111 + + EMail: brian.eberman@speechworks.com + + + + + + + + + + + + + + + + + + + + + + + + + + + +Shanmugham, et al. Informational [Page 85] + +RFC 4463 MRCP by Cisco, Nuance, and Speechworks April 2006 + + +Full Copyright Statement + + Copyright (C) The Internet Society (2006). + + This document is subject to the rights, licenses and restrictions + contained in BCP 78 and at www.rfc-editor.org/copyright.html, and + except as set forth therein, the authors retain all their rights. + + This document and the information contained herein are provided on an + "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS + OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET + ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, + INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE + INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED + WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. + +Intellectual Property + + The IETF takes no position regarding the validity or scope of any + Intellectual Property Rights or other rights that might be claimed to + pertain to the implementation or use of the technology described in + this document or the extent to which any license under such rights + might or might not be available; nor does it represent that it has + made any independent effort to identify any such rights. Information + on the procedures with respect to rights in RFC documents can be + found in BCP 78 and BCP 79. + + Copies of IPR disclosures made to the IETF Secretariat and any + assurances of licenses to be made available, or the result of an + attempt made to obtain a general license or permission for the use of + such proprietary rights by implementers or users of this + specification can be obtained from the IETF on-line IPR repository at + http://www.ietf.org/ipr. + + The IETF invites any interested party to bring to its attention any + copyrights, patents or patent applications, or other proprietary + rights that may cover technology that may be required to implement + this standard. Please address the information to the IETF at + ietf-ipr@ietf.org. + +Acknowledgement + + Funding for the RFC Editor function is provided by the IETF + Administrative Support Activity (IASA). + + + + + + + +Shanmugham, et al. Informational [Page 86] + |