From 4bfd864f10b68b71482b35c818559068ef8d5797 Mon Sep 17 00:00:00 2001 From: Thomas Voss Date: Wed, 27 Nov 2024 20:54:24 +0100 Subject: doc: Add RFC documents --- doc/rfc/rfc5707.txt | 10307 ++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 10307 insertions(+) create mode 100644 doc/rfc/rfc5707.txt (limited to 'doc/rfc/rfc5707.txt') diff --git a/doc/rfc/rfc5707.txt b/doc/rfc/rfc5707.txt new file mode 100644 index 0000000..6265c1d --- /dev/null +++ b/doc/rfc/rfc5707.txt @@ -0,0 +1,10307 @@ + + + + + + +Independent Submission A. Saleem +Request for Comments: 5707 Y. Xin +Category: Informational RadiSys +ISSN: 2070-1721 G. Sharratt + Consultant + February 2010 + + + Media Server Markup Language (MSML) + +Abstract + + The Media Server Markup Language (MSML) is used to control and invoke + many different types of services on IP media servers. The MSML + control interface was initially driven by RadiSys with subsequent + significant contributions from Intel, Dialogic, and others in the + industry. Clients can use it to define how multimedia sessions + interact on a media server and to apply services to individuals or + groups of users. MSML can be used, for example, to control media + server conferencing features such as video layout and audio mixing, + create sidebar conferences or personal mixes, and set the properties + of media streams. As well, clients can use MSML to define media + processing dialogs, which may be used as parts of application + interactions with users or conferences. Transformation of media + streams to and from users or conferences as well as interactive voice + response (IVR) dialogs are examples of such interactions, which are + specified using MSML. MSML clients may also invoke dialogs with + individual users or with groups of conference participants using + VoiceXML. + +Status of This Memo + + This document is not an Internet Standards Track specification; it is + published for informational purposes. + + This is a contribution to the RFC Series, independently of any other + RFC stream. The RFC Editor has chosen to publish this document at + its discretion and makes no statement about its value for + implementation or deployment. Documents approved for publication by + the RFC Editor are not a candidate for any level of Internet + Standard; see Section 2 of RFC 5741. + + Information about the current status of this document, any errata, + and how to provide feedback on it may be obtained at + http://www.rfc-editor.org/info/rfc5707. + + + + + + +Saleem, et al. Informational [Page 1] + +RFC 5707 Media Server Markup Language February 2010 + + +IESG Note + + This RFC is not a candidate for any level of Internet Standard. The + IETF disclaims any knowledge of the fitness of this RFC for any + purpose and in particular notes that the decision to publish is not + based on IETF review for such things as security, congestion control, + or inappropriate interaction with deployed protocols. The RFC Editor + has chosen to publish this document at its discretion. Readers of + this document should exercise caution in evaluating its value for + implementation and deployment. See RFC 3932 for more information. + +Copyright Notice + + Copyright (c) 2010 IETF Trust and the persons identified as the + document authors. All rights reserved. + + This document is subject to BCP 78 and the IETF Trust's Legal + Provisions Relating to IETF Documents + (http://trustee.ietf.org/license-info) in effect on the date of + publication of this document. Please review these documents + carefully, as they describe your rights and restrictions with respect + to this document. + +Table of Contents + + 1. Introduction ....................................................4 + 2. Glossary ........................................................5 + 3. MSML SIP Usage ..................................................6 + 3.1. SIP INFO ...................................................7 + 3.2. SIP Control Framework ......................................8 + 4. Language Structure .............................................15 + 4.1. Package Scheme ............................................15 + 4.2. Profile Scheme ............................................18 + 5. Execution Flow .................................................19 + 6. Media Server Object Model ......................................21 + 6.1. Objects ...................................................21 + 6.2. Identifiers ...............................................23 + 7. MSML Core Package ..............................................26 + 7.1. ....................................................26 + 7.2. ....................................................26 + 7.3. ..................................................27 + 7.4. ...................................................27 + 8. MSML Conference Core Package ...................................28 + 8.1. Conferences ...............................................28 + 8.2. Media Streams .............................................29 + 8.3. ........................................31 + 8.4. ........................................33 + 8.5. .......................................34 + + + +Saleem, et al. Informational [Page 2] + +RFC 5707 Media Server Markup Language February 2010 + + + 8.6. ................................................35 + 8.7. .............................................36 + 8.8. ....................................................43 + 8.9. ............................................45 + 8.10. .................................................46 + 8.11. ................................................47 + 8.12. .................................................47 + 9. MSML Dialog Packages ...........................................51 + 9.1. Overview ..................................................51 + 9.2. Primitives ................................................53 + 9.3. Events ....................................................55 + 9.4. MSML Dialog Usage with SIP ................................56 + 9.5. MSML Dialog Structure and Modularity ......................57 + 9.6. MSML Dialog Core Package ..................................58 + 9.7. MSML Dialog Base Package ..................................63 + 9.8. MSML Dialog Group Package .................................81 + 9.9. MSML Dialog Transform Package .............................85 + 9.10. MSML Dialog Speech Package ...............................88 + 9.11. MSML Dialog Fax Detection Package ........................92 + 9.12. MSML Dialog Fax Send/Receive Package .....................93 + 10. MSML Audit Package ...........................................100 + 10.1. MSML Audit Core Package .................................100 + 10.2. MSML Audit Conference Package ...........................102 + 10.3. MSML Audit Connection Package ...........................106 + 10.4. MSML Audit Dialog Package ...............................108 + 10.5. MSML Audit Stream Package ...............................110 + 11. Response Codes ...............................................111 + 12. MSML Conference Examples .....................................113 + 12.1. Establishing a Dial-In Conference .......................113 + 12.2. Example of a Sidebar Audio Conference ...................117 + 12.3. Example of Removing a Conference ........................118 + 12.4. Example of Modifying Video Layout .......................118 + 13. MSML Dialog Examples .........................................120 + 13.1. Announcement ............................................120 + 13.2. Voice Mail Retrieval ....................................120 + 13.3. Play and Record .........................................122 + 13.4. Speech Recognition ......................................125 + 13.5. Play and Collect ........................................125 + 13.6. User Controlled Gain ....................................128 + 14. MSML Audit Examples ..........................................128 + 14.1. Audit All Conferences ...................................128 + 14.2. Audit Conference Dialogs ................................129 + 14.3. Audit Conference Streams ................................130 + 14.4. Audit All Connections ...................................131 + 14.5. Audit Connection Dialogs ................................131 + 14.6. Audit Connection Streams ................................132 + 14.7. Audit Connection with Selective States ..................133 + 15. Future Work ..................................................134 + + + +Saleem, et al. Informational [Page 3] + +RFC 5707 Media Server Markup Language February 2010 + + + 16. XML Schema ...................................................134 + 16.1. MSML Core ...............................................136 + 16.2. MSML Conference Core Package ............................140 + 16.3. MSML Dialog Packages ....................................148 + 16.4. MSML Audit Packages .....................................170 + 17. Security Considerations ......................................176 + 18. IANA Considerations ..........................................176 + 18.1. IANA Registrations for 'application' MIME Media Type ....176 + 18.2. IANA Registrations for 'text' MIME Media Type ...........178 + 18.3. URN Sub-Namespace Registration ..........................179 + 18.4. XML Schema Registration .................................180 + 19. References ...................................................181 + 19.1. Normative References ....................................181 + 19.2. Informative References ..................................182 + Acknowledgments ..................................................183 + +1. Introduction + + Media servers contain dynamic pools of media resources. Control + agents and other users of media servers (called media server clients) + can define and create many different services based on how they + configure and use those resources. Often, that configuration and the + ways in which those resources interact will be changed dynamically + over the course of a call, to reflect changes in the way that an + application interacts with a user. + + For example, a call may undergo an initial IVR dialog before being + placed into a conference. Calls may be moved from a main conference + to a sidebar conference and then back again. Individual calls may be + directly bridged to create small n-way calls or simple sidebars. + None of these change the SIP [n1] dialog or RTP [i3] session. Yet + these do affect the media flow and processing internal to the media + server. + + The Media Server Markup Language (MSML) is an XML [n2] language used + to control the flow of media streams and services applied to media + streams within a media server. It is used to invoke many different + types of services on individual sessions, groups of sessions, and + conferences. MSML allows the creation of conferences, bridging + different sessions together, and bridging sessions into conferences. + + MSML may also be used to create user interaction dialogs and allows + the application of media transforms to media streams. Media + interaction dialogs created using MSML allow construction of IVR + dialog sessions to individual users as well as to groups of users + participating in a conference. Dialogs may also be specified using + other languages, VoiceXML [n5], which support complete single-party + application logic to be executed on the media server. + + + +Saleem, et al. Informational [Page 4] + +RFC 5707 Media Server Markup Language February 2010 + + + MSML is a transport independent language, such that it does not rely + on underlying transport mechanisms and language semantics are + independent of transport. However, SIP is a typical and commonly + used transport mechanism for MSML, invoked using the SIP URI scheme. + This specification defines using MSML dialogs using SIP as the + transport mechanism. + + A network connection may be established with the media server using + SIP. Media received and transmitted on that connection will flow + through different media resources on the media server depending on + the requested service. Basic Network Media Services with SIP [n7] + defines conventions for associating a basic service with a SIP + Request-URI. MSML allows services to be dynamically applied and + changed by a control agent during the lifetime of the SIP dialog. + + MSML has been designed to address the control and manipulation of + media processing operations (e.g., announcement, IVR, play and + record, automatic speech recognition (ASR), text to speech (TTS), + fax, video), as well as control and relationships of media streams + (e.g., simple and advanced conferencing). It provides a general- + purpose media server control architecture. MSML can additionally be + used to invoke other more complex IVR languages such as VoiceXML. + + The MSML control interface has been widely deployed in the industry, + with numerous client-side and server-side implementations, since + 2003. The in-service commercial deployments cover a wide variety of + applications including, but not limited to, IP multimedia + conferencing, network voice services, IVR, IVVR (interactive voice + and video response), and voice/video mail. + +2. Glossary + + Media Server: a general-purpose platform for executing real-time + media processing tasks. This is a logical function that maps either + to a single physical device or to a portion of a physical device. + + Media Server Client: an application that originates MSML requests to + a media server and also referred to as a control agent in this + specification. + + Network Connection: a participant that represents the termination on + a media server of one or more RTP [i3] sessions (for example, audio + and video) associated with a call. Network connections are + established and removed using a session establishment protocol such + as SIP. An instance of a network connection is independent of MSML + processing instructions applied to it. + + + + + +Saleem, et al. Informational [Page 5] + +RFC 5707 Media Server Markup Language February 2010 + + + Dialog: an automated IVR participant. Examples of dialogs may be + announcement players, IVR interfaces, or voice recorders. Dialogs + may be defined in MSML or using VoiceXML [n5]. + + Conference: an intermediary function that provides multimedia mixing + and other advanced conferencing services. This specification + currently considers conferences with audio and/or video media types, + but is extensible to other media types. + + Identifier: a name that is used to refer to a specific instance of an + object on the media server, such as a conference or a dialog. + Identifiers are composed of one or more terms where each term + identifies an object class and instance. + + Object: the generic term for a media server entity that terminates, + originates, or processes media. This specification defines four + classes of objects and specifies mechanisms to create them, join them + together, and destroy them. + + Participant Object: an object in a media server that sources original + media in a call and/or receives and terminates media in a call. + + Intermediary Object: an object in a media server that acts on media + within a call for the benefit of the participants. + + Independent Object: an object that can exist on a media server + independent of other objects. + + Operator: an intermediary transformer that modifies or transforms a + media stream. Examples of operators may be audio gain controls, + video scaling, or voice masking. MSML defines operators as media + transform objects, which transform media using operations such as + gain control, when applied to media streams. + + Media Stream: a single media flow between two objects. A media + stream has a media type and may be unidirectional or bidirectional. + +3. MSML SIP Usage + + SIP is used to create and modify media sessions with a media server + according to the procedures defined in RFC 3261 [n1]. Often, SIP + third party call control [i4] will be used to create sessions to a + media server on behalf of end users. MSML is used to define and + change the service that a user connected to a media server will + receive. MSML clients are application servers, soft-switches, or + other forms of control agents, and SHOULD have an authorized security + relationship with the media server. MSML itself does not define + authorization mechanisms. + + + +Saleem, et al. Informational [Page 6] + +RFC 5707 Media Server Markup Language February 2010 + + + MSML transactions are originated based upon events that occur in the + application domain. These events may be independent from any media + or user interaction. For example, an application may wish to play an + announcement to a conference warning that its scheduled completion + time is approaching. Applications themselves are structured in many + different ways. Their structure and requirements contribute to their + selection of protocols and languages. To accommodate differing + application needs, MSML has been designed to be neutral to other + languages and independent of the transport used to carry it. + + MSML is purposely designed to be transport independent. In this + release of the specification, SIP INFO [i5] and SIP Control Framework + [i11] have been chosen for transport mechanisms for MSML, as + described in the following sections. + +3.1. SIP INFO + + SIP INVITE and INFO [i5] requests and responses MAY be used to carry + MSML. INFO requests allow asynchronous mid-call messages within SIP + with few additional semantics. In addition, there are existing + widely deployed implementations of that method, it aids in initial + developments that are closely coupled with SIP session establishment, + and it allows MSML to be directly associated with user dialogs when + third party call control is used. + + Although INFO is sometimes considered not to be a suitable general- + purpose transport mechanism for messages within SIP, there have been + proposals to make it more acceptable. MSML may evolve to include + other SIP usage and/or to work with other protocols or as a stand- + alone protocol established through SIP, in future releases of this + document. + + MSML supports several models for client interaction. When clients + use 3PCC to establish media sessions on behalf of end users, clients + will have a SIP dialog for each media session. MSML MAY be sent on + these dialogs. However the targets of MSML actions are not inferred + from the session associated with the SIP dialog. The targets of MSML + actions are always explicitly specified using identifiers as + previously defined. + + An application, after interacting with a user, may want to affect + multiple objects within a media server. For example, tones or + messages are often played to a conference when connections are added + or removed. A separate message may also be played to a participant + as they are joined, or to moderators. Explicit identifiers, that is, + not inferred from a transport mechanism, allow these multiple actions + to be easily grouped into a single transaction sent on any SIP + dialog. + + + +Saleem, et al. Informational [Page 7] + +RFC 5707 Media Server Markup Language February 2010 + + + MSML also supports a model of dedicated control associations. This + supports decoupled application architectures where a client can + control media server services without also establishing all of the + media sessions itself. Control associations are created using SIP, + but they do not have any associated media session. Although + initially INFO messages will be sent on this SIP dialog, just as with + dialogs associated with media sessions, it is possible that in the + future, the SIP dialog will be used to establish a separate control + session (defined in SDP [n9]) that does not use SIP as the transport + for MSML messages. + + A media server using MSML also sends asynchronous events to a client + using MSML scripts in SIP INFO. Events are sent based on previous + MSML requests and are sent within the SIP dialog on which the MSML + request that caused the event to be generated was received. If this + dialog no longer exists when the event is generated, the event is + discarded. + + Events may be generated during the execution of a dialog created by a + element. For example, dialogs can send events based on + user input. VoiceXML dialogs, on the other hand, generally interact + with other servers outside of MSML using HTTP. + + An event is also generated when the execution of a dialog terminates, + because of either completion or failure. The exact information + returned is dependent on the dialog language, the capabilities of the + dialog execution environment, and what was requested by the dialog. + Both MSML and VoiceXML [n5] allow information to be returned when + they exit. These events may be sent in a SIP INFO or a SIP BYE. SIP + BYE is used when the dialog itself specifies that the connection + should be disconnected, for example, through the use of the + element. + + Conferences may also generate events based upon their configuration. + An example of this is the notification of the set of active speakers. + +3.2. SIP Control Framework + + The SIP Control Framework [i11] MAY be used as a transport mechanism + for MSML. + + The Control Framework provides a generic approach for establishment + and reporting capabilities of remotely initiated commands. The + framework utilizes many functions provided by the Session Initiation + Protocol (SIP) [n1] for the rendezvous and establishment of a + reliable channel for control interactions. Compared to SIP INFO, the + + + + + +Saleem, et al. Informational [Page 8] + +RFC 5707 Media Server Markup Language February 2010 + + + SIP Control Framework is a more general-purpose transport mechanism + and one that is not constrained by limitations of the SIP INFO + mechanism. + + The Control Framework also introduces the concept of a Control + Package, which is an explicit usage of the Control Framework for a + particular interaction set. This specification has already specified + a list of packages for MSML to control the media server in many + aspects, including basic dialog, advanced conferencing, advanced + dialog, and audit service. Each of these packages has a unique + Control Package name assigned in order for MSML to be used with the + Control Framework. + + This section fulfills the mandatory requirement for information that + MUST be specified during the definition of a Control Framework + Package, as detailed in SIP Control Framework [i11]. + +3.2.1. Control Framework Package Names + + The Control Framework [i11] requires a Control Package definition to + specify and register a unique name. + + MSML specification defines Control Package names using a hierarchical + scheme to indicate the inherited relationship across packages. For + example, package "msml-x" is derived from package "msml", and package + "msml-x-y" is derived from package "msml-x". + + The following is a list of Control Package names reserved by the MSML + specification. + + "msml": this Control Package supports MSML Core Package as specified + in section 7. + + "msml-conf": this Control Package supports MSML Conference Core + Package as specified in section 8. + + "msml-dialog": this Control Package supports MSML Dialog Core Package + as specified in section 9.6. + + "msml-dialog-base": this Control Package supports MSML Dialog Base + Package as specified in section 9.7. + + "msml-dialog-group": this Control Package supports MSML Dialog Group + Package as specified in section 9.8. + + "msml-dialog-transform": this Control Package supports MSML Dialog + Transform Package as specified in section 9.9. + + + + +Saleem, et al. Informational [Page 9] + +RFC 5707 Media Server Markup Language February 2010 + + + "msml-dialog-speech": this Control Package supports MSML Dialog + Speech Package as specified in section 9.10. + + "msml-dialog-fax-detect": this Control Package supports MSML Dialog + Fax Detection Package as specified in section 9.11. + + "msml-dialog-fax-sendrecv": this Control Package supports MSML Dialog + Fax Send/Receive Package as specified in section 9.12. + + "msml-audit": this Control Package supports MSML Audit Core Package + as specified in section 10.1. + + "msml-audit-conf": this Control Package supports MSML Audit + Conference Package as specified in section 10.2. + + "msml-audit-conn": this Control Package supports MSML Audit + Connection Package as specified in section 10.3. + + "msml-audit-dialog": this Control Package supports MSML Audit Dialog + Package as specified in section 10.4. + + "msml-audit-stream": this Control Package supports MSML Audit Stream + Package as specified in section 10.5. + + An application server using the Control Framework as transport for + MSML MUST use one or multiple package names, depending on the service + required from the media server. The package name(s) are identified + in the "Control-Packages" SIP header that is present in the SIP + INVITE dialog request that creates the control channel, as specified + in [i11]. The "Control-Packages" value MAY be re-negotiated via the + SIP re-INVITE mechanism. + +3.2.2. Control Framework Messages + + The usage of CONTROL, response, and REPORT messages, as defined in + [i11], by each Control Package defined in MSML is different and + described separately in the following sections. + + MSML Core Package "msml" + + The application server may send a CONTROL message with a body + of MSML request using the following elements to the MS: + + : the root element that may contain a list of child + elements that request a specific operation. The child elements + are defined in extended packages (e.g., "msml-conf" and "msml- + dialog"). This element is also the root element that contains + an MSML result and event. + + + +Saleem, et al. Informational [Page 10] + +RFC 5707 Media Server Markup Language February 2010 + + + : sends an event to the specified recipient within the + media server. Specific event types are defined within the + extended packages. + + The media server replies with a response message containing a + MSML result using the following elements: + + : reports the results of an MSML transaction. + + The media server MAY send the MSML event to the application + server, in a REPORT or CONTROL message, using the element + . The actual content of the and which Control + Framework message to use are defined within the extended + packages. + + MSML Conference Core Package "msml-conf" + + This package extends the MSML Core Package to define a + framework for creation, manipulation, and deletion of a + conference. + + The AS can send a CONTROL message with a body of the MSML + request that contains one or multiple conference-related + commands to the MS. The MS then replies with a response + message with a body of the MSML result to indicate whether or + not the request has been fulfilled. + + During the lifetime of a conference, whenever an event occurs, + the media server MAY send CONTROL messages containing MSML + events to notify the application server. The application + server SHOULD reply with a response message with no MSML body + to acknowledge the event has been received. + + This package does NOT use the REPORT message. + + Dialog Core Package "msml-dialog" + + This package extends the MSML Core Package to define the + structural framework and abstractions for MSML dialogs. + + The application server MAY send CONTROL messages containing a + MSML request using the following elements: + + : instantiate an MSML media dialog on a connection + or a conference. + + : terminates an MSML dialog. + + + + +Saleem, et al. Informational [Page 11] + +RFC 5707 Media Server Markup Language February 2010 + + + : sends an event and an optional namelist to the dialog, + dialog group, or dialog primitive. + + : used by the dialog description language to cause the + execution of the MSML dialog to terminate. + + For the command, the response message MUST + contain an MSML result that indicates that the dialog has been + started successfully. The MSML result MAY contain + to return the dialog identifier, if the identifier was assigned + by the media server. Subsequently, zero or more MSML events + MAY be initiated by the media server in (update) REPORT + messages to report information gathered during the dialog. + Finally, an MSML event "msml.dialog.exit" SHOULD be generated + in a (terminate) REPORT message when the dialog terminates + (e.g., MSML execution of ). + + For the and commands, the response message + contains the final MSML result that indicates that the request + has either been fulfilled or rejected. + + Dialog Base Package "msml-dialog-base" + + This package extends the MSML Dialog Core Package to define a + set of base functionality for MSML dialogs. The extension + defines individual media primitives, including , + , , , and , to be + used as child element of . This package does not + change the framework message usage as defined by the MSML + Dialog Core Package. + + Dialog Transform Package "msml-dialog-transform" + + This package extends the MSML Dialog Core Package to define a + set of transform primitives that works as filter on half-duplex + media streams. The extension defines transform primitives, + including , , , , and , + that MAY be used as child elements of . This + package does not change the framework message usage as defined + by the MSML Dialog Core Package. + + Dialog Group Package "msml-dialog-group" + + This package extends the MSML Dialog Core, Base, and Transform + Packages to define a single control flow construct that + specifies concurrent execution of multiple media primitives. + The extension defines the element that MAY be used as a + child element of to enclose multiple media + + + +Saleem, et al. Informational [Page 12] + +RFC 5707 Media Server Markup Language February 2010 + + + primitives, such that they can be executed concurrently. This + package does not change the framework message usage as defined + by the MSML Dialog Core Package. + + Dialog Speech Package "msml-dialog-speech" + + This package extends the MSML Dialog Core and MSML Base Package + to define functionality that MAY be used for automatic speech + recognition and text to speech. The extension extends the + and the elements. + + For , it defines a new child element to + activate grammars or user input rules associated with speech + recognition. For , it defines a new child element + to initiate the text-to-speech service. + + This package does not change the framework message usage as + defined by the MSML Dialog Core Package. + + Dialog Fax Detection Package "msml-dialog-fax-detect" + + This package extends the MSML Dialog Core Package to define + primitives provide fax detection service. The extension + defines a primitive to be used as a child element + of . This package does not change the framework + message usage as defined by the MSML Dialog Core Package. + + Dialog Fax Send/Receive Package "msml-dialog-fax-sendrecv" + + This package extends the MSML Dialog Core Package to define + primitives that allow a media server to provide fax send or + receive service. The extension defines new primitives + and , to be used as a child element of + . This package does not change the framework + message usage as defined by the MSML Dialog Core Package. + + Dialog Audit Core Package "msml-audit" + + This package extends the MSML Core Package to define a + framework for auditing media resource(s) allocated on the media + server. + + This package follows a simple request/response transaction, + allowing the application server to send CONTROL messages + containing MSML requests. The media server MUST reply + with a response message containing the result. The result is + contained within the element, returning the + queried state information. + + + +Saleem, et al. Informational [Page 13] + +RFC 5707 Media Server Markup Language February 2010 + + + This package does NOT use the REPORT message. + + Dialog Audit Conference Package "msml-audit-conf" + + This package extends the MSML Audit Core Package to define + conference specific states that MAY be queried via the + command and the corresponding response MUST be returned by the + element. This package does not change the + framework message usage as defined by the MSML Audit Core + Package. + + Dialog Audit Connection Package "msml-audit-conn" + + This package extends the MSML Audit Core Package to define + connection specific states that MAY be queried via the + command and the corresponding response MUST be returned by the + element. This package does not change the + framework message usage as defined by the MSML Audit Core + Package. + + Dialog Audit Dialog Package "msml-audit-dialog" + + This package extends the MSML Audit Core Package to define + dialog specific states that MAY be queried via the + command and the corresponding response MUST be returned by the + element. This package does not change the + framework message usage as defined by the MSML Audit Core + Package. + + Dialog Audit Stream Package "msml-audit-stream" + + This package extends the MSML Audit Core Package to define + stream specific states that MAY be queried via the + command and the corresponding response MUST returned by the + element. This package does not change the + framework message usage as defined by the MSML Audit Core + Package. + +3.2.3. Common XML Support + + The XML schema described in [i11] MUST be supported by all Control + Packages defined by MSML. However, the "connection-id" value MUST be + constructed as defined by MSML (i.e., the identifier MUST contain a + local dialog tag only, while the SIP Control Framework [i11] requires + that the "connection-id" contain both local and remote dialog tags). + + + + + + +Saleem, et al. Informational [Page 14] + +RFC 5707 Media Server Markup Language February 2010 + + +3.2.4. Control Message Body + + A valid CONTROL body message MUST conform to the MSML schema, as + included in this specification, for the MSML package(s) used. + +3.2.5. REPORT Message Body + + A valid REPORT body message MUST conform to the MSML schema, as + included in this specification, for the MSML package(s) used. + +4. Language Structure + +4.1. Package Scheme + + The primary mechanism for extending MSML is the "package". A package + is an integrated set of one or more XML schemas that define + additional features and functions via new or extended use of elements + and attributes. Each package, except for those defined in the + current document, is defined in a separate standards document, e.g., + an Internet Draft or an RFC. All packages that extend the base MSML + functionality MUST include references to the MSML base set of schemas + provided in the Internet Drafts. A schema in a package MUST only + extend MSML; that is, it must not alter the existing specification. + + A particular MSML script will include references to all the schemas + defining the packages whose elements and attributes it makes use of. + A particular script MUST reference MSML base and optionally extension + package(s). See the IANA Considerations section. + + Each package MUST define its own namespace so that elements or + attributes with the same name in different packages do not conflict. + A script using a particular element or attribute MUST prefix the + namespace name on that element or attribute's name if it is defined + in a package (as opposed to being defined in the base). + + MSML consists of a core package that provides structure without + support for any specific feature set. Additional packages, relying + on the core package, provide functional features. Any combination of + additional packages may be used along with the core package. The + following describes the set of MSML packages defined in this + document. + + + + + + + + + + +Saleem, et al. Informational [Page 15] + +RFC 5707 Media Server Markup Language February 2010 + + + +--------------------------------------------------------+ + | MSML Core | + +--------------------------------------------------------+ + / \ \ + +--------+ +--------+ +-------+ + | Dialog | | Conf | | Audit | + | Core | | Core | | Core | + +--------+ +--------+ +-------+ + ________ \_______________________________________ | + ------------------------------------------------ | + / \ \ \ \ \ | + +------+ +---------+ +------+ +------+ +------+ +-------+ | + |Dialog| |Dialog | |Dialog| |Dialog| |Dialog| |Dialog | | + |Base | |Transform| |Group | |Speech| |Fax | |Fax | | + +------+ +---------+ +------+ +------+ |Detect| |Send/ | | + +------+ |Receive| | + +-------+ | + ________________________| + ------------------------- + / \ \ \ + +-----+ +-----+ +------+ +------+ + |Audit| |Audit| |Audit | |Audit | + |Conf | |Conn | |Dialog| |Stream| + +-----+ +-----+ +------+ +------+ + + + o MSML Core Package (Mandatory) + + Describes the minimum base framework that MUST be implemented to + support additional core packages. + + o MSML Conference Core Package (Conditionally Mandatory, for + Conferencing) + + Describes the audio and multimedia basic and advanced conferencing + package that MAY be implemented. + + o MSML Dialog Core Package (Conditionally Mandatory, for Dialogs) + + Describes the dialog core package that MUST be implemented for any + dialog services. However, systems supporting conferencing only, + MAY omit support for MSML dialogs. The MSML Dialog Core Package + specifies the framework within which additional dialog packages + are supported. The MSML Dialog Base Package MUST be supported, + while all other dialog packages MAY be supported. + + o MSML Dialog Base Package (Conditionally Mandatory, for Dialogs) + + + + +Saleem, et al. Informational [Page 16] + +RFC 5707 Media Server Markup Language February 2010 + + + o MSML Dialog Group Package (Optional) + + o MSML Dialog Transform Package (Optional) + + o MSML Dialog Fax Detection Package (Optional) + + o MSML Dialog Fax Send/Receive Package (Optional) + + o MSML Dialog Speech Package (Optional) + + o MSML Audit Core Package (Conditionally Mandatory, for Auditing) + + Describes the audit core package that MUST be implemented to + support auditing services. The MSML audit core package specifies + the framework within which additional audit packages are + supported. + + o MSML Audit Conference Package (Conditionally Mandatory, for + Auditing Conference, Conference Dialog, and Conference Stream) + + o MSML Audit Connection Package (Conditionally Mandatory, for + Auditing Connection, Connection Dialog, and Connection Stream) + + o MSML Audit Dialog Package (Conditionally Mandatory, for Auditing + Dialog, and MUST be used with either MSML Audit Conference + Package or MSML Audit Connection Package) + + o MSML Audit Stream Package (Conditionally Mandatory, for Auditing + Stream, and MUST be used with either MSML Audit Conference + Package or MSML Audit Connection Package) + + The formal process for defining extensions to MSML dialogs is to + define a new package. The new package MUST provide a text + description of what extensions are included and how they work. It + MUST also define an XML schema file (if applicable) that defines the + new package (which may be through extension, restriction of an + existing package, or a specific profile of an existing package). + Dependencies upon other packages MUST be stated. For example, a + package that extends or restricts has a dependency on the original + package specification. Finally, the new package MUST be assigned a + unique name and version. + + The types of things that can be defined in new packages are: + + o new primitives + + o extensions to existing primitives (events, shadow variables, + attributes, content) + + + +Saleem, et al. Informational [Page 17] + +RFC 5707 Media Server Markup Language February 2010 + + + o new recognition grammars for existing primitives + + o new markup languages for speech generation + + o languages for specifying a topology schema + + o new predefined topology schemas + + o new variables / segment types (sets & languages) + + o new control flow elements + + MSML packages are assembled together to form a specific MSML profile + that is shared between different implementations. The base MSML + dialog profiles that are defined in this document consist of the MSML + Core Package, MSML Dialog Core Package, MSML Dialog Base Package, + MSML Dialog Group Package, MSML Transform Package, MSML Fax Packages, + and the MSML Speech Package. + + MSML extension packages, which define primitives, MUST define the + following for each primitive within the package: + + o the function that the primitive performs + + o the attributes that may be used to tailor its behavior + + o the events that it is capable of understanding + + o the shadow variables that provide access to information + determined as a result of the primitive's operation + + The mechanism used to ensure that a media server and its client share + a compatible set of packages is not defined. Currently, it is + expected that provisioning will be used, possibly coupled with a + future auditing capability. Additionally, when used in SIP networks, + packages could be defined using feature tags and the procedures + defined for Indicating User Agent Capabilities in SIP [i1] used to + allow a media server to describe its capabilities to other user + agents. + +4.2. Profile Scheme + + Not all devices and applications using MSML will need to support the + entire MSML schema. For example, a media processing device might + support only audio announcements, only audio simple conferencing, or + only multimedia IVR. It is highly desirable to have a system for + describing what portion of MSML a particular media processing device + or control agent supports. + + + +Saleem, et al. Informational [Page 18] + +RFC 5707 Media Server Markup Language February 2010 + + + The package scheme described earlier allows MSML functionality to be + functionally grouped, relying on the MSML core package. This scheme + allows a portion of the complete MSML specification to be + implemented, on a per-package basis, and also creates a framework for + future extension packages. However, within a given package, in some + cases, only a subset of the package functionality may be required. + In order to support subsets of packages, with greater degree of + granularity than at the package level, a profile scheme is required. + + MSML package profiles would identify a subset of a given MSML package + with specific definitions of elements and attributes. Each MSML + package profile MUST be accompanied by one or more corresponding + schemas. To use the examples above, there could be an audio + announcements profile of the MSML Dialog Base Package, an audio + simple conferencing profile of the MSML Conference Core Package, and + a multimedia IVR profile of the MSML Dialog Base Package. + + MSML package profiles MUST be published separately from the MSML + specification, in one or more standards documents (e.g., Internet + Drafts or RFCs) dedicated to MSML package profiles. Profiles would + not be registered with IANA and any organization would additionally + be free to create its own profile(s) if required. + +5. Execution Flow + + MSML assumes a model where there is a single control context within a + media server for MSML processing. That context may have one or many + SIP [n1] dialogs associated with it. It is assumed that any SIP + dialogs associated with the MSML control context have been + authorized, as appropriate, by mechanisms outside the scope of MSML. + + A media server control context maintains information about the state + of all media objects and media streams within a media server. It + receives and processes all MSML requests from authorized SIP dialogs + and receives all events generated internally by media objects and + sends them on the appropriate SIP dialog. An MSML request is able to + create new media objects and streams, and to modify or destroy any + existing media objects and streams. + + An MSML request may simply specify a single action for a media server + to undertake. In this case, the document is very similar to a simple + command request. Often, though, it may be more natural for a client + to request multiple actions at one time, or the client would like + several actions to be closely coordinated by the media server. + Multiple MSML elements received in a single request MUST be processed + sequentially in document order. + + + + + +Saleem, et al. Informational [Page 19] + +RFC 5707 Media Server Markup Language February 2010 + + + An example of the first scenario would be to create a conference and + join it with an initial participant. An example of the second case + would be to unjoin one or more participants from a main conference + and join them to a sidebar conference. In the first scenario, + network latencies may not be an issue, but it is simpler for the + client to combine the requests. In the second case, the added + network latency between separate requests could mean perceptible + audio loss to the participant. + + Each MSML request is processed as a single transaction. A media + server MUST ensure that it has the necessary resources available to + carry out the complete transaction before executing any elements of + the request. If it does not have sufficient resources, it MUST + return a 520 response and MUST NOT execute the transaction. + + The MSML request MUST be checked for well-formedness and validated + against the schema prior to executing any elements. This allows XML + [n2] errors to reported immediately and minimizes failures within a + transaction and the corresponding execution of only part of the + transaction. + + Each element is expected to execute immediately. Elements such as + , which take an unpredictable amount of time, are + "forked" and executed in a separate thread (see MSML Dialog + Packages). Once successfully forked, execution continues with the + element following the . As such, MSML does not provide + mechanisms to sequence or coordinate other operations with dialog + elements. + + Processing within a transaction MUST stop if any errors occur. + Elements that were executed prior to the error are not rolled back. + It is the responsibility of the client to determine appropriate + actions based upon the results indicated in the response. Most + elements MAY contain an optional "mark" attribute. The value of that + attribute from the last successfully executed element MUST be + returned in an error response. Note that errors that occur during + the execution of a dialog occur outside the context of an MSML + transaction. These errors will be indicated in an asynchronous + event. + + Transaction results are returned as part of the SIP request response. + The transaction results indicate the success or failure of the + transaction. The result MUST also include identifiers for any + objects created by a media server for which the client did not + provide an instance name. Additionally, if the transaction fails, + the reason for the failure MUST be returned, as well as an indication + of how much of the transaction was executed before the failure + occurred SHOULD be returned. + + + +Saleem, et al. Informational [Page 20] + +RFC 5707 Media Server Markup Language February 2010 + + +6. Media Server Object Model + + Media servers are general-purpose platforms for executing real-time + media processing tasks. These tasks range in complexity from simple + ones such as serving announcements, to complex ones, such as speech + interfaces, centralized multimedia conferencing, and sophisticated + gaming applications. + + Calls are established to a media server using SIP. Clients will + often use SIP third party call control (3PCC) [i4] to establish calls + to a media server on behalf of end users. However MSML does not + require that 3PCC be used, only that the client and the media server + share a common identifier for the call and its associated RTP [i3] + sessions. + + Objects represent entities that source, sink, or modify media + streams. A media streams is a bidirectional or unidirectional media + flow between objects on a media server. The following subsections + define the classes of objects that exist on a media server and the + way these are identified in MSML. + +6.1. Objects + + A media object is an endpoint of one or more media streams. It may + be a connection that terminates RTP sessions from the network or a + resource that transforms or manipulates media. MSML defines four + classes of media objects. Each class defines the basic properties of + how object instances are used within a media server. However, most + classes require that the function of specific instances be defined by + the client, using MSML or other languages such as VoiceXML. + + The following classes of media processing objects are defined. The + class names are given in parentheses: + + o network connection (conn) + + o conference (conf) + + o dialog (dialog) + + Network connection is an abstraction for the media processing + resources involved in terminating the RTP session(s) of a call. For + audio services, a connection instance presents a full-duplex audio + stream interface within a media server. Multimedia connections have + multiple media streams of different media types, each corresponding + to an RTP session. Network connections get instantiated through SIP + [n1]. + + + + +Saleem, et al. Informational [Page 21] + +RFC 5707 Media Server Markup Language February 2010 + + + A conference represents the media resources and state information + required for a single logical mix of each media type in the + conference (e.g., audio and video). MSML models multiple mixes/views + of the same media type as separate conferences. Each conference has + multiple inputs. Inputs may be divided into classes that allow an + application to request different media treatment for different + participants. For example, the video streams for some participants + may be assigned to fixed regions of the screen while those for other + participants may only be shown when they are speaking. + + A conference has a single logical output per media type. For each + participant, it consists of the audio conference mix, less any + contributed audio of the participant, and the video mix shared by all + conference participants. Video conferences using voice activated + switching have an optional ability to show the previous speaker to + the current speaker. + + Conferences are instantiated using the element. + The content of the element specifies the + parameters of the audio and/or video mixes. + + Dialogs are a class of objects that represent automated participants. + They are similar to network connections from a media flow perspective + and may have one or more media streams as the abstraction for their + interface within a media server. Unlike connections, however, + dialogs are created and destroyed through MSML, and the media server + itself implements the dialog participant. Dialogs are instantiated + through the element. Contents of the + element define the desired or expected dialog behavior. Dialogs may + also be invoked by referencing VoiceXML as the dialog description + language. + + Operators are functions that are used to filter or transform a media + stream. The function that an instance of an operator fulfills is + defined as a property of the media stream. Operators may be + unidirectional or bidirectional and have a media type. + Unidirectional operators reflect simple atomic functions such as + automatic gain control, filtering tones from conferences, or applying + specific gain values to a stream. Unidirectional operators have a + single media input, which is connected to the media stream from one + object, and a single media output, which is connected to the media + stream of a different object. + + Bidirectional operators have two media inputs and two media outputs. + One media input and output is associated with the stream to one + object, and the other input and output is associated with a stream to + a different object. Bidirectional objects may treat the media + differently in each direction. For example, an operator could be + + + +Saleem, et al. Informational [Page 22] + +RFC 5707 Media Server Markup Language February 2010 + + + defined that changed the media sent to a connection based upon + recognized speech or dual-tone multi-frequency (DTMF) received from + the connection. Operators are implicitly instantiated when streams + are created or modified using the elements and , + respectively. + + The relationships between the different object classes (conf, conn, + and dialog) are shown in the figure below. + + +--------------------------------------+ + | Media Server | + | | + |------+ ,---. | + | | +------+ / \ | + <== RTP ==>| conn |<---->| oper |<---->( conf ) | + | | +------+ \ / | + |------+ `---' | + | ^ ^ | + | | | | + | | +------+ +------+ | | + | | | | | | | | + | +-->|dialog| |dialog|<---+ | + | | | | | | + | +------+ +------+ | + +--------------------------------------+ + + A single, full-duplex instance of each object class is shown together + with common relationships between them. An operator (such as gain) + is shown between a connection and a conference and dialogs are shown + participating both with an individual connection and with a + conference. The figure is not meant to imply only one-to-one + relationships. Conferences will often have hundreds of participants, + and either connections or conferences may be interacting with more + than one dialog. For example, one dialog may be recording a + conference while other dialogs announce participants joining or + leaving the conference. + +6.2. Identifiers + + Objects are referenced using identifiers that are composed of one or + more terms. Each term specifies an object class and names a specific + instance within that class. The object class and instance are + separated by a colon ":" in an identifier term. + + Identifiers are assigned to objects when they are first created. In + general, either the MSML client or a media server may specify the + instance name for an object. Objects for which a client does not + assign an instance name will be assigned one by a media server. + + + +Saleem, et al. Informational [Page 23] + +RFC 5707 Media Server Markup Language February 2010 + + + Media server assigned instance names are returned to the client as a + complete object identifier in the response to the request that + created the object. + + It is meaningful for some classes of objects to exist independently + on a media server. Network connections may be created through SIP at + any time. MSML can then be used to associate their media with other + objects as required to create services. Conferences may be created + and have specific resources reserved waiting for participant + connections. + + Objects from these two classes, connections and conferences, are + considered independent objects since they can exist on a standalone + basis. Identifiers for independent objects consist of a single term + as defined above. For example, identifiers for a conference and + connection could be "conf:abc" or "conn:1234" respectively. Clients + that choose to assign instance names to independent objects must use + globally unique instance names. One way to create globally unique + names is to include the domain name of the client as part of the + name. + + Dialogs are created to provide a service to independent objects. + Dialogs may act as a participant in a conference or interact with a + connection similar to a two-participant call. Dialogs depend upon + the existence of independent objects, and this is reflected in the + composition of their identifiers. Operators modify the media flow + between other objects, such as application of gain between a + connection and a conference. As operators are merely media transform + primitives defined as properties of the media stream, they are not + represented by identifiers and created implicitly. + + Identifiers for dialogs are composed of a structured list of slash + ('/') separated terms. The left-most term of the identifier must + specify a conference or connection. This serves as the root for the + identifier. An example of an identifier for a dialog acting as a + conference participant could be: + + conf:abc/dialog:recorder + + All objects except connections are created using MSML. Connections + are created when media sessions get established through SIP. There + are several options clients and media servers can use to establish a + shared instance name for a connection and its media streams. + + When media servers support multiple media types, the instance name + SHOULD be a call identifier that can be used to identify the + collection of RTP sessions associated with a call. When MSML is used + in conjunction with SIP and third party call control, the call + + + +Saleem, et al. Informational [Page 24] + +RFC 5707 Media Server Markup Language February 2010 + + + identifier MUST be the same as the local tag assigned by the media + server to identify the SIP dialog. This will be the tag the media + server adds to the "To" header in its response to an initial invite + transaction. RFC 3261 requires the tag values to be globally unique. + + An example of a connection identifier is: conn:74jgd63956ts. + + With third party call control, the MSML client acts as a back-to-back + user agent (B2BUA) to establish the media sessions. SIP dialogs are + established between the client and the media server allowing the use + of the media server local tag as a connection identifier. If third + party call control is not used, a SIP event package MAY be used to + allow a media server to notify new sessions to a client that has + subscribed to this information. + + Identifiers as described above allow every object in a media server + to be uniquely addressed. They can also be used to refer to multiple + objects. There are two ways in which this can currently be done: + + wildcards + + common instance names + + An identifier can reference multiple objects when a wildcard is used + as an instance name. MSML reserves the instance name composed of a + single asterisk ('*') to mean all objects that have the same + identifier root and class. Instance names containing an asterisk + cannot be created. Wildcards MUST only be used as the right-most + term of an identifier and MUST NOT be used as part of the root for + dialog identifiers. Wildcards are only allowed where explicitly + indicated below. + + The following are examples of valid wildcards: + + conf:abc/dialog:* + + conn:* + + An example of illegal wildcard usage is: + + conf:*/dialog:73849 + + Although identifiers share a common syntax, MSML elements restrict + the class of objects that are valid in a given context. As an + example, although it is valid to join two connections together, it is + not valid to join two IVR dialogs. + + + + + +Saleem, et al. Informational [Page 25] + +RFC 5707 Media Server Markup Language February 2010 + + +7. MSML Core Package + + This section describes the core MSML package that MUST be supported + in order to use any other MSML packages. The core MSML package + defines a framework, without explicit functionality, over which + functional packages are used. + +7.1. + + is the root element. When received by a media server, it + defines the set of operations that form a single MSML request. + Operations are requested by the contents of the element. Each + operation MAY appear zero or more times as children of . + Specific operations are defined within the conference package and in + the set of dialog packages. + + The results of a request or the contents of events sent by a media + server are also enclosed within the element. The results of + the transaction are included as a body in the response to the SIP + request that contained the transaction. This response will contain + any identifiers that the media server assigned to newly created + objects. All messages that a media server generates are correlated + to an object identifier. Objects and identifiers are discussed in + section 6 (Media Server Object Model). + + Attributes: + + version: "1.1" Mandatory + +7.2. + + Events are used to affect the behavior of different objects within a + media server. The element is used to send an event to the + specified recipient within the media server. + + Attributes: + + event: the name of an event. Mandatory. + + target: an object identifier. When the identifier is for a + dialog, it may optionally be appended with a slash "/" followed by + the target to be included in an MSML dialog . Mandatory. + + valuelist: a list of zero or more parameters that are included + with the event. + + + + + + +Saleem, et al. Informational [Page 26] + +RFC 5707 Media Server Markup Language February 2010 + + + mark: a token that can be used to identify execution progress in + the case of errors. The value of the mark attribute from the last + successfully executed MSML element is returned in an error + response. Therefore, the value of all mark attributes within an + MSML document should be unique. + +7.3. + + The element is used to report the results of an MSML + transaction. It is included as a body in the final response to the + SIP request that initiated the transaction. An optional child + element may include text that expands on the meaning of + error responses. Response codes are defined in section 11 (Response + Codes). + + Attributes: + + response: a numeric code indicating the overall success or failure + of the transaction, and in the case of failure, an indication of + the reason. Mandatory. + + mark: in the case of an error, the value of the mark attribute + from the last successfully executed element that included the mark + attribute. + + In the case of failure, a description of the reason SHOULD be + provided using the child element . + + Three other child elements allow the response to include identifiers + for objects created by the request but that did not have instance + names specified by the client. Those elements are and + , for objects created through a and + respectively. + +7.4. + + The element is used to notify an event to a media server + client. Three types of events are defined by the MSML Core Package: + "msml.dialog.exit", "msml.conf.nomedia", and "msml.conf.asn". These + correspond to the termination of an executing dialog, a conference + being automatically deleted when the last participant has left, and + the notification of the current set of active speakers for a + conference, respectively. Events may also be generated by an + executing dialog. In this case, the event type is specified by the + dialog (see MSML Dialog Core Package ). + + + + + + +Saleem, et al. Informational [Page 27] + +RFC 5707 Media Server Markup Language February 2010 + + + Attributes: + + name: the type of event. If the event is generated because of the + execution MSML dialog , the value MUST be the value of the + "event" attribute from the element within the MSML Dialog + Core Package. If the event is generated because of the execution + of an , the value MUST be "moml.exit". If the event is + generated because of the execution of a , the value + MUST be "moml.disconnect". If the event is generated because of + an error, the value must be "moml.error". Mandatory. + + id: the identifier of the conference or dialog that generated the + event or caused the event to be generated. Mandatory. + + has two children, and , which contain the + name and value respectively of each namelist item associated with + the event. + +8. MSML Conference Core Package + +8.1. Conferences + + A conference has a mixer for each type of media that the conference + supports. Each mix has a corresponding description that defines how + the media from participants contributes to that mix. A mixer has + multiple inputs that are combined in a media specific way to create a + single logical output. + + The elements that describe the mix for each media type are called + mixer description elements. They are: + + defines the parameters for mixing audio media. + + defines the composition of a video window. + + These elements, defined in sections 8.6 (Audio Mix) and 8.7 (Video + Layout) respectively, are used as content of the + element to establish the initial properties of a conference. The + elements are used within the element to change the + properties of a conference once it has been created, or within the + element to remove individual mixes from the + conference. + + Conferences may be terminated by an MSML client using the + element to remove the entire conference or by + removing the last mixer(s) associated with the conference. + Conferences can also be terminated automatically by a media server + based on criteria specified when the conference is created. When the + + + +Saleem, et al. Informational [Page 28] + +RFC 5707 Media Server Markup Language February 2010 + + + conference is deleted, any remaining participants will have their + associated SIP dialogs left unchanged or deleted based on the value + of the "term" attribute specified when the conference was created. + +8.2. Media Streams + + Objects have at least one media input and output for each type of + media that they support. Each object class defines the number of + input and output objects of that class support. Media streams are + created when objects are joined, either explicitly using or + implicitly when dialogs are created using . Dialog + creation has two stages, allocating and configuring the resources + required for the dialog instance, and implicitly joining those + resources to the dialog target during the dialog execution. Refer to + the MSML Dialog Base Package. + + A join operation by default creates a bidirectional audio stream + between two objects. Video and unidirectional streams may also be + created. A media stream is created by connecting the output from one + object to the input of another object and vice versa (assuming a + bidirectional or full-duplex join). + + Many objects may only support a single input for each type of media. + Within this specification, only the conference object class supports + an arbitrary number of inputs. When a stream is requested to be + created to an object that already has a stream of the same type + connected to its single input, the result of the request depends upon + the type of the media stream. + + Audio mixing is done by summing audio signals. Automatically mixing + audio streams has common and straightforward applications. For + example, the ability to bridge two streams allows for the easy + creation of simple three-way calls or to bridge private announcements + with a (whispered) conference mix for an individual participant. In + the case of general conferences, however, an MSML client SHOULD + create an audio conference and then join participants to the + conference. Conference mixers SHOULD subtract the audio of each + participant from the mix so that they do not hear themselves. + + A media server receiving a request that requires joining an audio + stream to the single audio input of an object that already has an + audio stream connected SHOULD automatically bridge the new stream + with the existing stream, creating a mix of the two audio streams. + The maximum number of streams that may be bridged in this manner is + implementation specific. It is RECOMMENDED that a media server + support bridging at least two streams. A media server that cannot + bridge a new stream with any existing streams MUST fail the operation + requesting the join. + + + +Saleem, et al. Informational [Page 29] + +RFC 5707 Media Server Markup Language February 2010 + + + Unlike audio mixing, there are many different ways that two video + streams may be combined and presented. For example, they may be + presented side by side in separate panes, picture in picture, or in a + single pane that displays only a single stream at a time based on a + heuristic such as active speaker. Each of these options creates a + very different presentation and requires significantly different + media resources. + + A join operation does not describe how a new stream can be combined + with an existing stream. Therefore, automatic bridging of video is + not supported. A media server MUST fail requests to join a new video + stream to an object that only supports a single video input and + already has a video stream connected to that input. For an object to + have multiple video streams joined to it, the object itself must be + capable in supporting multiple video streams. Conference objects can + support multiple video streams and provide a way to specify the + mixing presentation for the video streams. + + A media server MUST NOT establish any streams unless the media server + is able to create all the streams requested by an operation. Streams + are only able to be created if both objects support a media type and + at least one of the following conditions is true: + + 1. Each object that is to receive media is not already receiving a + stream of that type. + + 2. Any object that is to receive media and is already receiving a + stream of that type supports receiving an additional stream of + that type. The only class of objects defined in this + specification that directly support receiving multiple streams + of the same type are conferences. + + 3. The media server is able to automatically bridge media streams + for an object that is to receive media and that is already + receiving a stream of the requested type. The only type of + media defined in this specification that MAY be automatically + bridged is audio. + + The directionality of media streams associated with a connection is + modeled independently from what SDP [n9] allows for the corresponding + RTP [i3] sessions. Media servers MUST respect the SDP in what they + actually transmit but MUST NOT allow the SDP to affect the + directionality when joining streams internal to the media server. + + + + + + + + +Saleem, et al. Informational [Page 30] + +RFC 5707 Media Server Markup Language February 2010 + + +8.3. + + is used to allocate and configure the media mixing + resources for conferences. A description of the properties for each + type of media mix required for the conference is defined within the + content of the element. Mixer descriptions are + described in Audio Mix and Video Layout sections. When no mixer + descriptions are specified, the default behavior MUST be equivalent + to inclusion of a single . + + Clients can request that a media server automatically delete a + conference when a specified condition occurs by using the + "deletewhen" attribute. A value of "nomedia" indicates that the + conference MUST be deleted when no participants remain in the + conference. When this occurs, an "msml.conf.nomedia" event MUST be + notified to the MSML client. A value of "nocontrol" indicates that + the conference MUST be deleted when the SIP [n1] dialog that carries + the element is terminated. When this occurs, a + media server MUST terminate all participant dialogs by sending a BYE + for their associated SIP dialog. A value of "never" MUST leave the + ability to delete a conference under the control of the MSML client. + + Attributes: + + name: the instance name of the conference. If the attribute is + not present, the media server MUST assign a globally unique name + for the conference. If the attribute is present but the name is + already in use, an error (432) will result and MSML document + execution MUST stop. Events that the conference generates use + this name as the value of their "id" attribute (see section 7.4 + ()). + + deletewhen: defines whether a media server should automatically + delete the conference. Possible values are "nomedia", + "nocontrol", and "never". Default is "nomedia". + + term: when true, the media server MUST send a BYE request on all + SIP dialogs still associated with the conference when the + conference is deleted. Setting term equal to false allows clients + to start dialogs on connections once the conference has completed. + Default is "true". + + mark: a token that MAY be used to identify execution progress in + the case of errors. The value of the mark attribute from the last + successfully executed MSML element is returned in an error + response. Therefore, the value of all mark attributes within an + MSML document should be unique. + + + + +Saleem, et al. Informational [Page 31] + +RFC 5707 Media Server Markup Language February 2010 + + + An example of creating an audio conference is shown below. This + conference allows at most two participants to contend to be heard and + reports the set of active speakers no more frequently than every 10 + seconds. + + + + + + + + + + + +8.3.1. + + Conference resources may be reserved by including the + element as a child of . allows the + specification of a set of resources that a media server will reserve + for the conference. Any requests for resources beyond those that + have been reserved should be honored on a best-effort basis by a + media server. + + Attributes: + + required: boolean that specifies whether should + fail if the requested resources are not available. When set to + false, the conference will be created, with no reserved resources, + if the complete reservation cannot be honored. Default is "true". + +8.3.1.1. + + The resources to be reserved are defined using . The + contents of these elements describe a resource that is to be + reserved. Descriptions are implementation dependent. Media servers + that support MSML dialogs may use the elements from that package as + the basis for resource descriptions. Each resource element may use + the attribute "n" to define the quantity of the resource to reserve. + + For example, the following creates a conference and reserves two + types of resources. One resource element may represent resources + that are shared by all participants of the conference, while the + other may represent resources that are reserved for each of the + expected participants. + + + + + + +Saleem, et al. Informational [Page 32] + +RFC 5707 Media Server Markup Language February 2010 + + + Attributes: + + n: number of resources to be reserved. Default is 1. + + type: specifies whether the resource is to be reserved by each + individual participant or reserved as a shared conference + resource. Valid values for this attribute are "individual" or + "shared". Default is "individual". + + + + + + + + + + + + +8.4. + + All of the properties of an audio mix or the presentation of a video + mix may be changed during the life of a conference using the + element. Changes to an audio mix are requested by + including an element as a child of . + This may also be used to add an audio mixer to the conference if none + was previously allocated. Changes to a video presentation are + requested by including a element as a child of + . Similar to an audio mixer, this may be used to + add a video mixer if none was previously allocated. + + Mixers are removed by including a mixer description element within + . + + Features and presentation aspects are enabled/added or modified by + including the element(s) that define the feature or presentation + aspect within a mixer description. The complete specification of the + element must be included just as it would be included when the + conference is created. The new definition completely replaces any + previous definition that existed. Only things that are defined by + elements included in the mixer descriptions are affected. Any + existing configuration aspects of a conference, which are not + specified within the element, MUST maintain their + current state in the media server. + + + + + + +Saleem, et al. Informational [Page 33] + +RFC 5707 Media Server Markup Language February 2010 + + + For example, if an MSML client wanted to change the minimum reporting + interval for active speaker notification from that shown in the + Conference Examples section () it would send the + following to the media server: + + + + + + + + + + + This would also enable active speaker notification if it had not + previously been enabled. The N-loudest mixing is unaffected. + + Multiple elements MAY be included in the mixer descriptions similar + to when conferences are created. For example, in a video conference, + the video mix description () could specify that the + layout of the video being displayed should change such that the + regions currently displaying participants get smaller and new + region(s) are created to support additional participants. A media + server MUST make all of the requested changes or none of the + requested changes. + + Additional examples of modifying conferences are presented in the + Conference Examples section. + + Attributes: + + id: the identifier for a conference. Wildcards MUST NOT be used. + Mandatory. + + mark: a token that can be used to identify execution progress in + the case of errors. The value of the mark attribute from the last + successfully executed MSML element is returned in an error + response. Therefore, the value of all "mark" attributes within an + MSML document SHOULD be unique. + +8.5. + + Destroy conference is used to delete mixers or to delete the entire + conference and all state and shared resources. When a mixer is + removed, all of the streams joined to that mixer are unjoined. When + a conference is destroyed, SIP dialogs for any remaining participants + MUST be maintained or removed based on the value of the "term" + attribute when the conference was created. + + + +Saleem, et al. Informational [Page 34] + +RFC 5707 Media Server Markup Language February 2010 + + + When there is no element content, deletes the + entire conference. Individual mixers are removed by including a + mixer description element identifying the mix (or mixes) to be + removed as content to . is used + remove audio mixers and is used remove video mixers. + When one or more mixer descriptions are specified, then media server + MUST only delete the specified mixer and MUST NOT affect any other + existing mixers. When or is identified + for individual removal, other feature aspects of the mix MUST NOT be + included. If specified, the media server MUST ignore any such + elements. When the last mixer is removed from a conference, a media + server MUST remove all conference state, leaving or removing any + remaining SIP dialogs as described above. + + Attributes: + + id: the identifier for a conference. Mandatory. + + mark: a token that can be used to identify execution progress in + the case of errors. The value of the mark attribute from the last + successfully executed MSML element is returned in an error + response. Therefore, the value of all "mark" attributes within an + MSML document SHOULD be unique. + +8.6. + + The properties of the overall audio mix are specified using the + element. + + Attributes: + + id: an optional identifier for the audio mix. + + samplerate: Integer value specifies the sample rate (in Hz) for + the audio mixer. Optional, default value of 8000. + + An example of the description for an audio mix is: + + + + + + +8.6.1. + + The element defines that participants contend to be + included in the conference mix based upon their audio energy. When + the element is not present, all participants are mixed. + + + +Saleem, et al. Informational [Page 35] + +RFC 5707 Media Server Markup Language February 2010 + + + Attributes: + + n: the number of participants that will be included in the audio + mix based upon having the greatest audio energy. Mandatory. + +8.6.2. + + The element enables notification of active speakers. Active + speakers MUST be notified using the element with an event + name of "msml.conf.asn". The namelist of the event consists of the + set of active speakers. The name of each item is the string + "speaker" with a value of the connection identifier for the + connection. + + Attributes: + + ri: the minimum reporting interval defines the minimum duration of + time that must pass before changes to active speakers will be + reported. A value of zero disables active speaker notification. + + asth: specifies the active speaker threshold (in unit of dBm0). + Valid value range is 0 to -96. Optional, default is -96. + + An example of an active speaker notification is: + + + speaker + conn:hd93tg5hdf + speaker + conn:w8cn59vei7 + speaker + conn:p78fnh6sek47fg + +8.7. + + A video layout is specified using the element. It is + used as a container to hold elements that describe all of the + properties of a video mix. The parameters of the window that + displays the video mix are defined by the element. When the + video mix in composed of multiple panes, the location and + characteristics of the panes are defined by one or more + elements. A element is not required when only a single + video stream is displayed at one time and none of the visual + attributes of regions are required. + + Some regions may be used to display a video stream based on a + selection criteria rather than having a video stream of a single + participant continuously presented in the region. One such an + + + +Saleem, et al. Informational [Page 36] + +RFC 5707 Media Server Markup Language February 2010 + + + example is a distance learning lecture where the instructor sees each + of the students periodically displayed in a region. When a region is + used to display one of a number of streams, it is placed as a child + of a element. + + Attributes: + + type: specifies the language used to define the layout. Layouts + defined using MSML MUST use the value "text/msml-basic-layout". + This is the same convention as defined for the layout package from + the W3C SMIL 2.0 specification [i6]. The default when omitted is + "text/msml-basic-layout". + + id: an optional identifier for the video layout. + +8.7.1. + + The element describes the root window or virtual screen in + which the conference video mix will be displayed. Simple conferences + can display participant video directly within the root window but + more complex conferences will use regions for this purpose. Areas of + the window which are not used to display video will show the root + window background. + + All video presentations require a root window. It MUST be present + when a video mix is created and it cannot be deleted; however, its + attributes MAY be changed using the element. + + Attributes: + + size: the size of the root window specified as one of the five + standard common intermediate formats (e.g., CIF, QCIF). + + backgroundcolor: the color for the root window background defined + using the values for the "background-color" property of the CSS2 + specification [n10]. + + backgroundimage: the URI for an image to be displayed as the root + window background. Transparent portions of the image allow the + background color to show through. + +8.7.2. + + elements define video panes that are used to display + participant video streams. Regions are rendered on top of the root + window. + + + + + +Saleem, et al. Informational [Page 37] + +RFC 5707 Media Server Markup Language February 2010 + + + The size of a region is specified relative to the size of the root + window using the "relativesize" attribute. Relative sizes are + expressed as fractions (e.g., 1/4, 1/3) that preserve the aspect + ratio of the original video stream while allowing for efficient + scaling implementations. + + Regions are located on the root window based on the value of the + position attributes "top" and "left". These attributes define the + position of the top left corner of the region as an offset from the + top left corner of the root window. Their values may be expressed + either as a number of pixels or as a percent of the vertical or + horizontal dimension of the root window. Percent values are appended + with a percent ('%') character. Percent values of "33%" and "67%" + should be interpreted as "1/3" and "2/3" to allow easy alignment of + regions whose size is expressed relative to the size of the root + window. + + An example of a video layout with six regions is: + + +-------+---+ + | | 2 | + | 1 +---+ + | | 3 | + +---+---+---+ + | 6 | 5 | 4 | + +---+---+---+ + + + + + + + + + + + + The area of the root window covered by a region is a function of the + region's position and its size. When areas of different regions + overlap, they are layered in order of their "priority" attribute. + The region with the highest value for the "priority" attribute is + below all other regions and will be hidden by overlapping regions. + The region with the lowest non-zero value for the "priority" + attribute is on top of all other regions and will not be hidden by + overlapping regions. The priority attribute may be assigned values + between 0 and 1. A value of zero disables the region, freeing any + resources associated with the region, and unjoining any video stream + displayed in the region. + + + +Saleem, et al. Informational [Page 38] + +RFC 5707 Media Server Markup Language February 2010 + + + Regions that do not specify a priority will be assigned a priority by + a media server when a conference is created. The first region within + the element that does not specify a priority will be + assigned a priority of one, the second a priority of two, etc. In + this way, all regions that do not explicitly specify a priority will + be underneath all regions that do specify a priority. As well, + within those regions that do not specify a priority, they will be + layered from top to bottom, in the order they appear within the + element. + + For example, if a layout was specified as follows: + + + + + + + + + + Then the regions would be layered, from top to bottom, c,a,b,d. + + Portions of regions that extend beyond the root window will be + cropped. For example, a layout specified as: + + + + + + + would appear similar to: + + +-----------+ + | root | + |background | + | +-----+-- + | | |// + | | foo |// + +-----+-----+// + |//////// + + Visual attributes are used to define aspects of the visual appearance + of individual regions. A border may be defined together with a title + and/or logo. Text and logos are displayed as images on top of the + region's video, below all regions with a lower priority. The visual + attributes are "title", "titletextcolor", "titlebackgroundcolor", + "bordercolor", "borderwidth", and "logo". + + + + +Saleem, et al. Informational [Page 39] + +RFC 5707 Media Server Markup Language February 2010 + + + Visual attributes can also be defined for individual streams (Video + Stream Properties). When visual attributes are specified as part of + both a region and a stream, those associated with the stream MUST + take precedence. This allows streams that are chosen for display + automatically (Stream Selection) to have proper text and logos + displayed. The region visual attributes are displayed when no stream + is associated with the region. + + Two other attributes associated with a region, "blank" and "freeze", + define the state of the video displayed in the region. When the + blank or freeze attribute is assigned the value "true", then the + media server MUST display the region either as a blank region, or the + video image frozen at the last received frame. + + These attributes are specified for a region and not allowed for + streams because that appears to be the common use case. Applying + them to streams would allow only that stream to be affected within a + selector while other streams continue to display normally. Except + for personal mixing scenarios, the same effect can be achieved by + having the participant mute their own transmission to the media + server. + + Attributes: associated with each region: + + id: a name that can be used to refer to the region. + + left: the position of the region from the left side of the root + window. + + top: the position of the region from the top of the root window. + + relativesize: the size of the region expressed as a fraction of + the root window size. + + priority: a number between 0 and 1 that is used to define the + precedence when rendering overlapping regions. A value of zero + disables the region. + + title: text to be displayed as the title for the region + + titletextcolor: the color of the text + + titlebackgroundcolor: the color of the text background + + bordercolor: the color of the region border + + borderwidth: the width of the region border + + + + +Saleem, et al. Informational [Page 40] + +RFC 5707 Media Server Markup Language February 2010 + + + logo: the URI of an image file to be displayed + + freeze: a boolean value, with a default of "false", that defines + whether the video image should be frozen at the currently + displayed frame + + blank: a boolean value, with a default of "false", that defines + whether the region should display black instead of the associated + video stream + +8.7.3. + + It is often desired that one of several video streams be + automatically selected to be displayed. The element is + used to define the selection criteria and its associated parameters. + The selection algorithm is specified by the "method" attribute. + Currently defined selection methods allow for voice activated + switching and to iterate sequentially through the set of associated + video streams. + + The regions that will display the selected video stream are placed as + child elements of the element. Including regions within a + element does not affect their layout with respect to + regions not subject to the selection. For simple video conferences + that display the video directly in the root window, the + element can be placed as a child of . Region elements MUST + NOT be used in this case. + + For example, below is a common video layout that allows the video + stream from the currently active speaker to be displayed in the large + region ("1") at the top left of the layout while the streams from + five other participants are displayed in regions located at the + layout periphery. + + +-------+---+ + | | 2 | + | 1 +---+ + | | 3 | + +---+---+---+ + | 6 | 5 | 4 | + +---+---+---+ + + + + + + + + + + +Saleem, et al. Informational [Page 41] + +RFC 5707 Media Server Markup Language February 2010 + + + + + + + + + + + + + + + All selector methods must be defined so that they work if only a + single region is a child of the selector. Selector methods that + support more than one child region MUST specify how the method works + across multiple regions. Media server implementations MAY support + only a single region for methods that are defined to allow multiple + regions. + + The selector or region for a participant's video is defined using the + "display" attribute of during a join operation. Specifying + a selector allows the stream to be displayed according to the + criteria defined by the selector method. Specifying a region + supports continuous presence display of participants. Some streams + may be joined with both a selector and a region. In this case, the + value of attribute defines whether the streams + associated with a continuous presence region should be blanked when + the stream is selected for display in one of the selector regions. + + Attributes: common to all selector methods are: + + id: a name that can be used to refer to the selector. + + method: the name of the method used to select the video stream. A + value of "vas" (see the following section, Voice Activated + Switching) MAY be specified. + + status: specifies whether the selector is "active" or "disabled". + + blankothers: when "true", video streams that are also displayed in + continuous presence regions will have the continuous presence + regions blanked when the stream is displayed in a selection + region. + + + + + + + + +Saleem, et al. Informational [Page 42] + +RFC 5707 Media Server Markup Language February 2010 + + +8.7.3.1. Voice Activated Switching ("vas") + + Voice activated switching (VAS) is used to display the video stream + that correlates with the participant who is currently speaking. It + is specified using a selector method value of "vas". + + If the video stream associated with the active speaker is not + currently displayed in a selection region, then it replaces the video + in the region that is displaying the video of the speaker that was + least recently active. If the video of the active speaker is + currently displayed in a selection region, then there is no change to + any region. When VAS is applied to a single region, this has the + effect that the current speaker is displayed in that region. + + Attributes: + + si: switching interval is the minimum period of time that must + elapse before allowing the video to switch to the active speaker. + + speakersees: defines whether the active speaker sees the "current" + speaker (themselves) or the "previous" speaker. + +8.8. + + is used to create one or more streams between two independent + objects. Streams may be audio or video and may be bidirectional or + unidirectional. A bidirectional stream is implicitly composed of two + unidirectional streams that can be manipulated independently. The + streams to be established are specified by elements (section + ) as the content of . + + Without any content, by default establishes a bidirectional + audio stream. When only a stream of a single type has previously + been created between two objects, or when only a unidirectional + stream exists, can be used to add a stream of another media + type or make the stream bidirectional by including the necessary + elements. Bidirectional streams are made unidirectional by + using (section ) to remove the unidirectional stream + for the direction that is no longer required. + + In addition to defining the media type and direction of streams, + elements are also used to establish the properties of + streams, such as gain, voice masking, or tone clamping of audio + streams, or labels and other visual characteristics of video streams. + Properties are often defined asymmetrically for a single direction of + a stream. Creating a bidirectional stream requires two + elements within the , one for each direction, if one direction + is to have different properties from the other direction. + + + +Saleem, et al. Informational [Page 43] + +RFC 5707 Media Server Markup Language February 2010 + + + If a media server can provide services using both compressed or + uncompressed media, the MSML client may need to distinguish within + requests which format is to be used. When compressed streams are + created, both objects must use the same media format or an error + response (450) is generated. + + Attributes: + + id1: an identifier of either a connection or conference. + Wildcards MUST NOT be used. Mandatory. Any other object class + results in a 440 error. + + id2: an identifier of either a connection or conference. + Wildcards MUST NOT be used. Mandatory. Any other object class + results in a 440 error. + + mark: a token that can be used to identify execution progress in + the case of errors. The value of the mark attribute from the last + successfully executed MSML element is returned in an error + response. Therefore, the value of all mark attributes within an + MSML document SHOULD be unique. + + For example, consider a call center coaching scenario where a + supervisor can listen to the conversation between an agent and a + customer and provide hints to the agent, which are not heard by the + customer. One join establishes a stream between the agent and the + customer and another join establishes a stream between the agent and + the supervisor. A third join is used to establish a half-duplex + stream from the customer to the supervisor. The media server + automatically bridges the media streams from the customer and the + supervisor for the agent, and from the customer and the agent for the + supervisor. + + Assuming the following connections, each with a single audio stream: + + conn:supervisor + + conn:agent + + conn:customer + + + + + + + + + + + +Saleem, et al. Informational [Page 44] + +RFC 5707 Media Server Markup Language February 2010 + + + The following would create the media flows previously described: + + + + + + + + + + + The following example shows joining a participant to a multimedia + conference. It assumes that the conference has a video + presentation region named "topright". The "display" attribute is + explained in the section Video Stream Properties. + + + + + + + + + + +8.9. + + Media streams can have different properties such as the gain for an + audio stream or a visual label for a video stream. These properties + are specified as the content of elements (section ). + is used to change the properties of a stream by + including one or more elements that are to have their + properties changed. + + Stream properties MUST be set as specified by the element as + a child element of element. Any properties not + included in the element when modifying a stream MUST remain + unchanged. Setting a property for only one direction of a + bidirectional stream MUST NOT affect the other direction. The + directionality of streams can be changed by issuing an + followed by a . Any streams that exist between the two objects + that are not included within MUST NOT be affected. + + Attributes: + + id1: an identifier of either a conference or a connection. The + instance name MUST NOT contain a wildcard if "id2" contains a + wildcard. Mandatory. + + + +Saleem, et al. Informational [Page 45] + +RFC 5707 Media Server Markup Language February 2010 + + + id2: an identifier of either a conference or a connection. The + instance name MUST NOT contain a wildcard if "id1" contains a + wildcard. Mandatory. + + mark: a token that can be used to identify execution progress in + the case of errors. The value of the mark attribute from the last + successfully executed MSML element is returned in an error + response. Therefore, the value of all mark attributes within an + MSML document is RECOMMENDED to be unique. + +8.10. + + Unjoin removes one or more media streams between two objects. In the + absence of any content in the element, all media streams + between the objects MUST be removed. Individual streams may be + removed by specifying them using elements, while the + unspecified streams MUST NOT be removed. A bidirectional stream is + changed to a unidirectional stream by unjoining the direction that is + no longer required, using the element. Operator elements + MUST NOT be specified within elements when streams are being + unjoined using the element. Any specified stream operators + MUST be ignored. + + and may be used together to move a media stream, such + as from a main conference to a sidebar conference. + + Attributes: + + id1: an identifier of either a conference or a connection. The + instance name MUST NOT contain a wildcard if "id2" contains a + wildcard. Mandatory. + + id2: an identifier of either a conference or a connection. The + instance name MUST NOT contain a wildcard if "id1" contains a + wildcard. Mandatory. + + mark: a token that can be used to identify execution progress in + the case of errors. The value of the mark attribute from the last + successfully executed MSML element is returned in an error + response. Therefore, the value of all mark attributes within an + MSML document SHOULD be unique. + + The following removes a participant from a conference and plays a + leave tone for the remaining participants in the conference. + + + + + + + +Saleem, et al. Informational [Page 46] + +RFC 5707 Media Server Markup Language February 2010 + + + + + + + + + + + +8.11. + + Monitor is a specialized unidirectional join that copies the media + that is destined for a connection object. One example of the use for + may be quality monitoring within a conference. The media + stream may be removed using the element (see the section + ). + + Attributes: + + id1: an identifier of the connection to be monitored. Mandatory. + Any other object class results in a 440 error. Wildcards MUST NOT + be used. + + id2: an identifier of the object that is to receive the copy of + the media destined to id1. id2 may be a connection or a + conference. Mandatory. Any other object class results in a 440 + error. Wildcards MUST NOT be used. + + compressed: "true" or "false". Specifies whether the join should + occur before or after compression. When "true", id2 must be a + connection using the same media format as id1 or an error response + (450) is generated. Default is "false". + + mark: a token that can be used to identify execution progress in + the case of errors. The value of the mark attribute from the last + successfully executed MSML element is returned in an error + response. Therefore, the value of all mark attributes within an + MSML document SHOULD be unique. + +8.12. + + Individual streams are specified using the element. They + MAY be included as a child element in any of the stream manipulation + elements , , or . + + + + + + +Saleem, et al. Informational [Page 47] + +RFC 5707 Media Server Markup Language February 2010 + + + The type of the stream is specified using a "media" attribute that + uses values corresponding to the top-level MIME media types as + defined in RFC 2046 [i7]. This specification only addresses audio + and video media. Other specifications may define procedures for + additional types. + + A bidirectional stream is identified when no direction attribute + "dir" is present. A unidirectional stream is identified when a + direction attribute is present. The "dir" attribute MUST have a + value of "from-id1" or "to-id1" depending on the required direction. + These values are relative to the identifier attributes of the parent + element. + + The compressed attribute is used to distinguish the compressed nature + of the stream when necessary. It is implementation specific what is + used when the attribute is not present. Joining compressed streams + acts much like an RTP [i3] relay. + + The properties of the media streams are specified as the content of + elements when the element is used as a child of or + . Stream elements MUST NOT have any content when they + are used as a child of to identify specific streams to + remove. + + Some properties are defined within MSML as additional attributes or + child elements of that are media type specific. Ones for + audio streams and video streams are defined in the following two sub- + sections. Operators, viewed as properties of the media stream, MAY + be specified as child elements of the element. + + Attributes: + + media: "audio" or video". Mandatory + + dir: "from-id1" or "to-id1". + + compressed: "true" or "false". Specifies whether the stream uses + compressed media. Default is implementation specific. + +8.12.1. Audio Stream Properties + + Audio mixes can be specified to only mix the N-loudest participants. + However, there may be some "preferred" participants that are always + able to contribute. When audio streams are joined to a conference + that uses N-loudest audio mixing, preferred streams need to be + identified. + + + + + +Saleem, et al. Informational [Page 48] + +RFC 5707 Media Server Markup Language February 2010 + + + A preferred audio stream is identified using the "preferred" + attribute. The "preferred" attribute MAY be used for an audio stream + that is input to a conference and MUST NOT be used for other streams. + + Additional attributes of the element for audio streams are: + + Attributes: + + preferred: a boolean value that defines whether the stream does + not contend for N-loudest mixing. A value of "true" means that + the stream MUST always be mixed while a value of "false" means + that the stream MAY contend for mixing into a conference when + N-loudest mixing is enabled. Default is "false". + + There are two elements that can be used to change the characteristics + of an audio stream as defined below. + +8.12.1.1. + + The element may be used to adjust the volume of an audio media + stream. It may be set to a specific gain amount, to automatically + adjust the gain to a desired target level, or to mute the stream. + + Attributes: + + id: an optional identifier that may be referenced elsewhere for + sending events to the gain primitive. + + amt: a specific gain to apply specified in dB or the string "mute" + indicating that the stream should be muted. This attribute MUST + NOT be used if "agc" is present. + + agc: boolean indicating whether automatic gain control is to be + used. This attribute MUST NOT be used if "amt" is present. + + tgtlvl: the desired target level for AGC specified in dBm0. This + attribute MUST be specified if "agc" is set to "true". This + attribute MUST NOT be specified if "agc" is not present. + + maxgain: the maximum gain that AGC may apply. Maxgain is + specified in dB. This attribute MUST be used if "agc" is present + and MUST NOT be used when "agc" is not present. + +8.12.1.2. + + The element is used to filter tones and/or audio-band dtmf + from a media stream. + + + + +Saleem, et al. Informational [Page 49] + +RFC 5707 Media Server Markup Language February 2010 + + + Attributes: + + dtmf: boolean indicating whether DTMF tones should be removed. + + tone: boolean indicating whether other tones should be removed. + +8.12.2. Video Stream Properties + + Video mixes define a presentation that may have multiple regions, + such as a quad-split. Each region displays the video from one or + more participants. When video streams are joined to such a + conference, the region that will display the video needs to be + specified as part of the join operation. + + The region that will display the video is specified using the + "display" attribute. The "display" attribute MUST be used for a + video stream that is input to a conference and MUST NOT be used for + other streams. The value of the attribute MUST identify a + (see the section ) or a (see the section + ) that is defined for the conference. A stream MUST NOT be + directly joined to a region that is defined within a selector. + Changing the value of the "display" attribute can be used to change + where in a video presentation layout a video stream is displayed. + + Additional attributes of the element for video streams are: + + Attributes: + + display: the identifier of a video layout region or selector that + is to be used to display the video stream. + + override: specifies whether or not the given video stream is the + override source in the region defined by "display" attribute. + Valid values are "true" or "false". Optional, default value is + "false". Only a video stream that is input to a conference can be + the override source. A particular region can have at most one + override source at a time. The most recently joined video stream + with this attribute set to "true" becomes the override source. + When there's an override source in place, its video is always + displayed in the region, regardless of what video selection + algorithm (either a selector or continuous presence mode) is + configured for that region. Once the override source is cleared, + the conference MUST revert back to original video selection + algorithm. + + + + + + + +Saleem, et al. Informational [Page 50] + +RFC 5707 Media Server Markup Language February 2010 + + +8.12.2.1. + + Some regions of video conferences may display different streams + automatically, such as when voice activated switching is used. + Connections MAY also be joined directly without the use of video + mixing. In these cases, the element may be used to define + visual display properties for a stream. + + The element MAY use any of the visual attributes defined for + regions (see the section ). This allows the visual aspects + of regions within a to be tailored to the selected video + stream, or for streams that are directly joined to display a name or + logo. + +9. MSML Dialog Packages + +9.1. Overview + + MSML Dialog Packages define an XML [n2] language for composing + complex media objects from a vocabulary of simple media resource + objects called primitives. It is primarily a descriptive or + declarative language to describe media processing objects. MSML + dialogs operate on a single or multiple streams that are identified + by the MSML document outside the scope of the MSML Dialog Package. + + MSML dialogs are intended to be used in different environments. As + such, the language itself does not define how an MSML dialog is used. + Each environment in which an MSML dialog is used must define how it + is used, the set of services provided, and the mechanism for passing + information between the environment and MSML dialog. The specific + mechanisms used to realize the interface between MSML dialog and its + environment are platform specific. + + MSML Dialog Packages provide two models for access to media resources + and service creation building blocks. Both models MAY be used in + conjunction with each other in a complementary manner. The first + model (referred to as "Media Primitives and Composites", part of the + mandatory MSML Dialog Base Package) contains media primitives (such + as digit collection and announcements) and composite functions (such + as play and collect combined as a single operation). The second + model (referred to as "Media Groups", part of the optional MSML + Dialog Group Package) allows the ability to define complex customized + interactions, via event passing mechanisms, between media primitives, + if required. + + + + + + + +Saleem, et al. Informational [Page 51] + +RFC 5707 Media Server Markup Language February 2010 + + + MSML Dialog Core Package + + Defines core framework over which all MSML Dialog Packages + operate. + + MSML Dialog Base Package + + Media Primitives + or + DTMF digit collection + + Playing of Announcements + + Generation of DTMF digits + + Tone genration + + Media recording + + Media Composites + + Supports play and collect operation. + Composite function with inclusion of play. + + Supports play and record operation. + Composite function with inclusion of play. + + MSML Dialog Group Package + + Allows grouping of media primitives for parallel + execution, with an event exchange mechanism + between the media primitives to achieve + customized media operations. All the above media + primitive elements are accepted within the + group. + + The following operations MUST be supported using elements described + above using either the MSML Dialog Base Package or MSML Dialog Group + Package. + + Announcement only + + Collection only + or + + Recording only + + + + + +Saleem, et al. Informational [Page 52] + +RFC 5707 Media Server Markup Language February 2010 + + + Play and Collect + + + + + Play and Record + + + + + Additional MSML Dialog Packages are: + + o MSML Dialog Transform Package + + o MSML Dialog Speech Package + + o MSML Fax Detection Package + + o MSML Fax Send/Receive Package + + MSML dialogs MAY be used to simply expose primitive media resource + objects but will be used more often to describe dialog operations and + media transformation objects that can be controlled via user + interaction. + + MSML dialogs do not contain any computation or flow control + constructs. There are no results automatically generated when media + operations complete. Results MUST be explicitly requested using a + or element within the definition of the MSML dialog. + +9.2. Primitives + + Primitives perform a single function on a media stream or multiple + streams such as generating audio/video, recognizing speech or DTMF, + or adjusting the gain. They may be composed so that primitives + execute concurrently. Primitives not composed for concurrent + execution MUST simply execute sequentially in the order they occur in + an MSML document. All concurrently executing primitives in the same + MSML object (defined in one MSML document) MAY interact with each + other through events (see MSML Dialog Group Package). + + Primitives are categorized into one of the following descriptive + categories. + + o Recognizers have a media input but no output. They allow + different things within a media stream to be recognized or + detected and for events to be generated based upon received + media. + + + +Saleem, et al. Informational [Page 53] + +RFC 5707 Media Server Markup Language February 2010 + + + o Transformers have one media input and output and may send and + receive events. + + o Sources and sinks generate or consume media. They have either + a media input or a media output but not both. They may receive + and generate events. + + o Composites combine underlying primitives to provide higher- + level user interaction, without the need for specific event- + based exchange between the primitives. The composite elements + provide a simpler mechanism for more commonly used services, + such as play and collect or play and record. + + Primitives may define different media processing behavior (states) + based upon the events that they receive. Primitives that support + different processing states must define their default starting state + and should support the "initial" attribute to allow that state to be + specified when the primitive is instantiated. All primitives must + support the "terminate" event class. + + The following types of primitives are defined within this + specification: + + Recognizers Transformers Source/Sink Composites + ------------------------------------------------------ + dtmf/collect agc play dtmf/collect + faxdetect clamp record record + speech gain dtmfgen + vad gate tonegen + relay faxsend + faxrcv + + Primitives have shadow variables, similar to those within VoiceXML + [n5], which are automatically assigned values when the primitives are + used. Upon initialization of an MSML dialog context, all shadow + variables have the string value "undefined". Each primitive has its + own instance of shadow variables that are global in scope to the + entire MSML dialog context. + + Names SHOULD be assigned to individual primitives when more than one + primitive of the same type is used within one MSML document. Shadow + variables are overwritten if the primitive has not been named and is + instantiated a second time. + + Shadow variables cannot be modified under user control. They may be + returned from the MSML dialog context using the element. + + + + + +Saleem, et al. Informational [Page 54] + +RFC 5707 Media Server Markup Language February 2010 + + +9.3. Events + + Events provide the mechanism for primitives to interact with each + other and for an MSML context to interact with its external + environment. The external environment is defined by the way in which + an MSML context has been invoked. This will often be through MSML, + but other languages and protocols such as SIP may also be used. + + Every primitive and group conceptually implements their own event + queue. Events sent to them get placed into their associated queue. + Events are removed from their queues and processed in order. + Primitives within a group conceptually have their own thread of + execution. Due to the asynchronous nature of servicing events from + multiple queues, it cannot be assumed that several events sent in + sequence to different queues will be processed in the order in which + they were sent. For example, if recognition of something led to + sending events to both a and a in that order, it is + possible that the may process its event before the . + + Primitives each define the set of events that they support and the + behavior associated with their handling of each event. This allows + many types of behaviors to be defined. For example, VCR type + controls can be constructed by defining primitives that support + events corresponding to each control. Media recognition/detection + can be used to cause those events to be generated. + + Alternatively, events can be originated elsewhere, such as from a + control agent, and simply received by the primitive implementing the + control. Examples of the use of events include adjusting volume + (gain) and pause and resume of both announcement playout and record + creation. + + Primitives act on events based upon the longest match of an event + name. Event names are a period '.' delimited sequence of tokens. + The first token, or the root of the name, can be considered an event + class. Matching allows a standard meaning to be defined and then + extended based upon what triggers an event's generation. For + example, a record primitive has different behavior depending upon + whether it completed because a user stopped speaking or because it + was cancelled. The recording is retained in the first case but not + the second. + + Longest match allows new recognizers to be created and used without + changing how existing primitives are defined. For example, a face + recognition capability could be created that generates a + terminate.frowning event when a user looks puzzled. Although no + primitive directly defines this event, it will still effect a generic + terminate action. Primitives that require specialized behavior based + + + +Saleem, et al. Informational [Page 55] + +RFC 5707 Media Server Markup Language February 2010 + + + upon frowning may be extended to support this. As well, the event + can still be exported from the MSML context without requiring that + primitives receiving the event understand facial expressions. + +9.4. MSML Dialog Usage with SIP + + MSML dialogs MAY be used directly with SIP for dialog interactions + (e.g., IVR or fax). It can be initially invoked as part of the + "Prompt and Collect" service described in "Basic Network Media + Services with SIP" [n7]. That defines service indicators for a small + number of well-defined services using the user part of the SIP + Request-URI (R-URI). + + The prompt and collect service uses "dialog" as the service + indicator. URI parameters further refine the specific IVR request. + This document defines an additional parameter "msml-param" for the + dialog service indicator as follows: + + dialog-parameters = ";" ( dialog-param [ vxml-parameters ] ) + | moml-param + dialog-param = "voicexml=" dialog-url + moml-param = "moml=" moml-url + + There are no additional URI parameters when MSML is used as the + dialog language. + + MSML dialogs define discrete IVR dialog commands. These commands MAY + be included directly in the body of the INVITE to the "dialog" + service indicator by using the "cid" [n8] URL scheme. This scheme + identifies a message body part that in this case would contain the + MSML dialog request. Note that a multipart message body, containing + a single part, MUST be present even if the INVITE does not contain an + SDP offer. Subsequent MSML dialog requests are sent in the body of + SIP INFO messages as are all messages from a media server. + + An example of SIP URI as described above is: + + sip:dialog@mediaserver.example.net;\ + moml=cid:14864099865376@appserver.example.net + + The body part that contained the MSML dialog referenced by the URL + would have a Content-Id header of: + + Content-Id: <14864099865376@appserver.example.net> + + + + + + + +Saleem, et al. Informational [Page 56] + +RFC 5707 Media Server Markup Language February 2010 + + + The results of executing an or , or of executing a + that has a "target" attribute value equal to "source", are + notified in SIP INFO messages using the element from MSML + Core package. No messages are sent if execution completes normally + without executing one of these elements. + + If there is an error during validation or execution, then a media + server MUST notify the error as described above and must include the + namelist items "moml.error.status" and "moml.error.description". The + values for these items are defined in section 11. + + A restricted subset of MSML dialogs can also be used with the + "Announcement" service defined in [n7]. This service uses "annc" as + the service indicator and defines parameters that describe an + announcement. The "play=" parameter identifies the URL of a prompt + or a provisioned announcement sequence. The value of the "play=" + parameter can refer to an MSML dialog body part using a "cid" URL as + described above. That body part must only contain the + primitive. + + Using MSML dialogs enhances the announcement service by allowing the + client to specify a sequence of audio segments rather than requiring + each sequence to be provisioned as well as support for video. + Moreover, MSML dialogs define a standard set of variables in contrast + to [n7] which defines a parameterization mechanism but does not + formally specify any semantics. + + If a media server does not understand the "cid" scheme or does not + understand MSML dialogs, it must respond with the SIP response code + "488 - not acceptable here". If the MSML dialog body contains + elements other than the primitive, or there are errors during + validation, a media server must respond with a SIP response code "400 + - bad request". Finally, if there is a discrepancy between + parameters specified in the Request-URI and corresponding attributes + defined in the MSML dialog body, the Request-URI parameters must be + silently ignored. + + MSML dialogs MUST NOT change the operation of the announcement + service from that defined in [n7]. When the announcement completes, + a media server issues a SIP BYE request. The INFO method MUST NOT + used with the announcement service. + +9.5. MSML Dialog Structure and Modularity + + MSML is structured as a set of packages. Only the core and base + packages are required. The Dialog Core Package defines the framework + for MSML requests to a media server, without specific functionality. + It consists of the "primitive" abstraction, an abstract element for + + + +Saleem, et al. Informational [Page 57] + +RFC 5707 Media Server Markup Language February 2010 + + + control flow, the sequential execution model, and the element. + That is, the MSML Dialog Core Package allows for the execution of a + sequence of one or more media processing primitives with the ability + to notify events to the invocation environment. + + Primitives are contained within the MSML Dialog Base Package, which + defines the basic , , , , , and + elements. Another package, the MSML Dialog Transform + Package, defines the simple half-duplex filters. More advanced + primitives are defined in the speech and fax packages. The MSML + speech package depends on the MSML Dialog Base Package as it extends + the capability of by adding synthesized speech. Finally, the + group execution model, which is currently the only element that + changes the flow of control, is defined in a separate MSML Dialog + Group Package. All of these packages are optional with the exception + that MSML Dialog Core and MSML Dialog Base Packages MUST be + implemented to provide the minimal functionality. + +9.6. MSML Dialog Core Package + + The MSML Dialog Core Package defines the structural framework and + abstractions for MSML dialogs (via its schema). It also defines the + basic elements that are not part of the core primitive or control + abstractions. This package is dependent on the MSML Core Package. + Events generated by MSML dialogs, such as prompt completion, digits + collected, or dialog termination, are communicated by the media + server via the MSML Core Package (see MSML Core Package ). + + MSML dialogs are executed independently from the MSML core context. + When an MSML dialog is started, MSML allocates the dialog control + resources, and if successful, starts those resources executing. MSML + core execution then continues without waiting for the MSML dialog to + complete. This forking of MSML dialog invocation from the MSML core + context is done via the element. Media streams are + created between the MSML dialog target and other internal media + server resources as part of dialog execution. Stream creation is + subject to the requirements defined in the MSML Core Package and + media streams as defined by the MSML Conference Core Package. + +9.6.1. + + The element is used to instantiate an MSML media dialog + on connections or conferences. The dialog is specified either inline + or by a URI [n6]. Inline dialogs MUST be composed of any of the MSML + Dialog Packages. MSML dialogs MAY be defined externally as VoiceXML + [n5]. The MSML dialog description MUST NOT be inline if the src + attribute, containing a URI, is present. + + + + +Saleem, et al. Informational [Page 58] + +RFC 5707 Media Server Markup Language February 2010 + + + The originator of the MSML dialog is notified using a + "msml.dialog.exit" event when the dialog completes. Any results + returned by the dialog when it exits are sent as a namelist to the + event. + + The "msml.dialog.exit" event is also used when dialogs fail due to + errors encountered fetching external documents or errors that occur + within the dialog execution thread. In this case, a namelist + containing the items "dialog.exit.status" and + "dialog.exit.description" is returned with the event to inform the + client of the failure and the failure reason. The values of these + items are defined within this package and the MSML Core Package. + Information from the failed dialog may be returned as additional + namelist items. + + Attributes: + + target: an identifier of a connection or a conference that will + interact with the dialog. The identifier must not contain + wildcards. Mandatory. + + src: the URL of the dialog description. MUST NOT be used if the + MSML dialog description is inline. Otherwise, an error (422) will + result and MSML document execution will stop. + + type: a MIME type that identifies the type of language used to + describe the dialog. application/moml+xml and + application/vxml+xml are used to identify MSML dialogs and + VoiceXML [n5] respectively. Mandatory. + + name: an instance name for the dialog. If the attribute is not + present, the media server will assign an identifier to the dialog. + If the attribute is present but the name is already associated + with the target, an error (431) will result and MSML document + execution will stop. Any results that a dialog generates will be + correlated to its identifier. + + mark: a token that can be used to identify execution progress in + the case of errors. The value of the mark attribute from the last + successfully executed MSML element is returned in an error + response. Therefore, the value of all "mark" attributes within an + MSML document should be unique. + + The following sections show examples of initiating an external MSML + dialog, an inline embedded MSML dialog, and an MSML-initiated + VoiceXML dialog. + + The following example starts an MSML dialog on a connection. + + + +Saleem, et al. Informational [Page 59] + +RFC 5707 Media Server Markup Language February 2010 + + + + + + + + The following example starts an inline embedded MSML dialog on a + connection. + + + + + + + + + + + The following example starts a VoiceXML dialog on a connection. + + + + + + + If this dialog fails once its execution thread had begun, for + example, the fetch of the VoiceXML document failed, an example of the + event that would be returned would be: + + + + dialog.exit.status + 423 + dialog.exit.description + External document fetch error + + + + +Saleem, et al. Informational [Page 60] + +RFC 5707 Media Server Markup Language February 2010 + + +9.6.2. + + Dialog end is used to terminate an MSML dialog created through + before it completes of its own accord. The operation + of depends on the dialog language being used by the + executing context. When that context is VoiceXML, a + "connection.disconnected" event will be thrown to the VoiceXML + application. When that context is MSML dialog, a "terminate" event + will be sent to the MSML core context. + + allows the executing dialog the opportunity to gracefully + complete before generating a "msml.dialog.exit" event. Dialog + results may be returned and will be contained as a namelist to that + event. + + Attributes: + + id: the identifier of a dialog. Mandatory. + + mark: a token that can be used to identify execution progress in + the case of errors. The value of the mark attribute from the last + successfully executed MSML dialog element is returned in an error + response. Therefore, the value of all "mark" attributes within an + MSML document should be unique. + + For example, if the dialog from the previous example was still + executing, the following would terminate the dialog and generate an + "msml.dialog.exit" event. + + + + + + +9.6.3. + + The element sends an event and optional namelist to the + recipient identified by the target attribute. Event names are + defined by the recipient. In the case where the recipient is an MSML + dialog group or primitive, the events are defined within this + document. Other recipients MAY use names that are suitable for their + environment. + + The "target" attribute specifies the recipient of the event. + Recipients MAY be other MSML dialog primitives or groups executing + within the object, the object itself, or the environment that invoked + the MSML dialog. Sending events to media primitives or groups is + supported by the MSML Dialog Group Package. Any target that is + + + +Saleem, et al. Informational [Page 61] + +RFC 5707 Media Server Markup Language February 2010 + + + unknown within the object is assumed to be destined to the external + environment. By convention, the string "source" SHOULD used to + address that environment, but any target name distinct from the MSML + dialog namespace MAY be used. + + Attributes: + + event: the name of an event. Mandatory. + + target: the recipient of the event. The recipient MUST be a MSML + dialog primitive, the currently executing group, or the MSML + dialog environment. A primitive is specified by a primitive type, + optionally appended by a period '.' followed by the identifier of + a primitive. Identifiers are only needed when more than one + primitive of the same type exists in the object. The executing + group is specified using the token "group". The environment is + specified using the token "source", optionally appended by a + period '.' followed by any environment specific target. + Mandatory. + + namelist: a list of zero or more shadow variables that are + included with the event. + +9.6.4. + + The element causes execution of the MSML dialog to terminate. + + Attributes: + + namelist: a list of one or more shadow variables that MAY + optionally be sent to the context that invoked the MSML Dialog + object. + +9.6.5. + + The element is similar to but has the additional + semantics of indicating to the context that invoked the MSML dialog + that it should disconnect from a media server, the media stream + associated with the object. The method of disconnection depends upon + how the media stream was initially established. If SIP was used, a + would cause a media server to issue a BYE request. The + request would be sent for the SIP dialog associated with media + session on which the MSML dialog was operating. + + + + + + + + +Saleem, et al. Informational [Page 62] + +RFC 5707 Media Server Markup Language February 2010 + + + Attributes: + + namelist: a list of one or more shadow variables that MAY + optionally be sent to the context that invoked the MSML dialog + object. + +9.7. MSML Dialog Base Package + + The MSML Dialog Base Package defines a required set of base + functionality for the media server. It supports individual media + primitives, such as playing an announcement or collection digits, as + well as composite operations such as play and collect. When this + package is used in conjunction with the MSML Dialog Group Package, + the event-based mechanism is used to control primitives. This + package may also be used in conjunction with the MSML Speech Package + to extend the functionality of prompts to include TTS and user input + collection to include ASR. + + In the following sections, subsections of a primitive define child + elements of that primitive and are not themselves considered + primitives. They do not receive events or populate shadow variables. + +9.7.1. + + Play is used to generate an audio or video stream. It MUST play in + sequence the media created by the child media elements