diff options
Diffstat (limited to 'doc/rfc/rfc5629.txt')
-rw-r--r-- | doc/rfc/rfc5629.txt | 2131 |
1 files changed, 2131 insertions, 0 deletions
diff --git a/doc/rfc/rfc5629.txt b/doc/rfc/rfc5629.txt new file mode 100644 index 0000000..eb0cfd0 --- /dev/null +++ b/doc/rfc/rfc5629.txt @@ -0,0 +1,2131 @@ + + + + + + +Network Working Group J. Rosenberg +Request for Comments: 5629 Cisco Systems +Category: Standards Track October 2009 + + + A Framework for Application Interaction + in the Session Initiation Protocol (SIP) + +Abstract + + This document describes a framework for the interaction between users + and Session Initiation Protocol (SIP) based applications. By + interacting with applications, users can guide the way in which they + operate. The focus of this framework is stimulus signaling, which + allows a user agent (UA) to interact with an application without + knowledge of the semantics of that application. Stimulus signaling + can occur to a user interface running locally with the client, or to + a remote user interface, through media streams. Stimulus signaling + encompasses a wide range of mechanisms, ranging from clicking on + hyperlinks, to pressing buttons, to traditional Dual-Tone Multi- + Frequency (DTMF) input. In all cases, stimulus signaling is + supported through the use of markup languages, which play a key role + in this framework. + +Status of This Memo + + This document specifies an Internet standards track protocol for the + Internet community, and requests discussion and suggestions for + improvements. Please refer to the current edition of the "Internet + Official Protocol Standards" (STD 1) for the standardization state + and status of this protocol. Distribution of this memo is unlimited. + +Copyright Notice + + Copyright (c) 2009 IETF Trust and the persons identified as the + document authors. All rights reserved. + + This document is subject to BCP 78 and the IETF Trust's Legal + Provisions Relating to IETF Documents + (http://trustee.ietf.org/license-info) in effect on the date of + publication of this document. Please review these documents + carefully, as they describe your rights and restrictions with respect + to this document. Code Components extracted from this document must + include Simplified BSD License text as described in Section 4.e of + the Trust Legal Provisions and are provided without warranty as + described in the BSD License. + + + + + +Rosenberg Standards Track [Page 1] + +RFC 5629 App Interaction Framework October 2009 + + + This document may contain material from IETF Documents or IETF + Contributions published or made publicly available before November + 10, 2008. The person(s) controlling the copyright in some of this + material may not have granted the IETF Trust the right to allow + + modifications of such material outside the IETF Standards Process. + Without obtaining an adequate license from the person(s) controlling + the copyright in such materials, this document may not be modified + outside the IETF Standards Process, and derivative works of it may + not be created outside the IETF Standards Process, except to format + it for publication as an RFC or to translate it into languages other + than English. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +Rosenberg Standards Track [Page 2] + +RFC 5629 App Interaction Framework October 2009 + + +Table of Contents + + 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 + 2. Conventions Used in This Document . . . . . . . . . . . . . . 4 + 3. Definitions . . . . . . . . . . . . . . . . . . . . . . . . . 4 + 4. A Model for Application Interaction . . . . . . . . . . . . . 7 + 4.1. Functional vs. Stimulus . . . . . . . . . . . . . . . . . 9 + 4.2. Real-Time vs. Non-Real-Time . . . . . . . . . . . . . . . 10 + 4.3. Client-Local vs. Client-Remote . . . . . . . . . . . . . . 10 + 4.4. Presentation-Capable vs. Presentation-Free . . . . . . . . 11 + 5. Interaction Scenarios on Telephones . . . . . . . . . . . . . 11 + 5.1. Client Remote . . . . . . . . . . . . . . . . . . . . . . 12 + 5.2. Client Local . . . . . . . . . . . . . . . . . . . . . . . 12 + 5.3. Flip-Flop . . . . . . . . . . . . . . . . . . . . . . . . 13 + 6. Framework Overview . . . . . . . . . . . . . . . . . . . . . . 13 + 7. Deployment Topologies . . . . . . . . . . . . . . . . . . . . 16 + 7.1. Third-Party Application . . . . . . . . . . . . . . . . . 16 + 7.2. Co-Resident Application . . . . . . . . . . . . . . . . . 17 + 7.3. Third-Party Application and User Device Proxy . . . . . . 18 + 7.4. Proxy Application . . . . . . . . . . . . . . . . . . . . 19 + 8. Application Behavior . . . . . . . . . . . . . . . . . . . . . 19 + 8.1. Client-Local Interfaces . . . . . . . . . . . . . . . . . 20 + 8.1.1. Discovering Capabilities . . . . . . . . . . . . . . . 20 + 8.1.2. Pushing an Initial Interface Component . . . . . . . . 20 + 8.1.3. Updating an Interface Component . . . . . . . . . . . 22 + 8.1.4. Terminating an Interface Component . . . . . . . . . . 22 + 8.2. Client-Remote Interfaces . . . . . . . . . . . . . . . . . 23 + 8.2.1. Originating and Terminating Applications . . . . . . . 23 + 8.2.2. Intermediary Applications . . . . . . . . . . . . . . 24 + 9. User Agent Behavior . . . . . . . . . . . . . . . . . . . . . 24 + 9.1. Advertising Capabilities . . . . . . . . . . . . . . . . . 24 + 9.2. Receiving User Interface Components . . . . . . . . . . . 25 + 9.3. Mapping User Input to User Interface Components . . . . . 26 + 9.4. Receiving Updates to User Interface Components . . . . . . 27 + 9.5. Terminating a User Interface Component . . . . . . . . . . 27 + 10. Inter-Application Feature Interaction . . . . . . . . . . . . 27 + 10.1. Client-Local UI . . . . . . . . . . . . . . . . . . . . . 28 + 10.2. Client-Remote UI . . . . . . . . . . . . . . . . . . . . . 29 + 11. Intra Application Feature Interaction . . . . . . . . . . . . 29 + 12. Example Call Flow . . . . . . . . . . . . . . . . . . . . . . 30 + 13. Security Considerations . . . . . . . . . . . . . . . . . . . 36 + 14. Contributors . . . . . . . . . . . . . . . . . . . . . . . . . 36 + 15. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 36 + 16. References . . . . . . . . . . . . . . . . . . . . . . . . . . 36 + 16.1. Normative References . . . . . . . . . . . . . . . . . . . 36 + 16.2. Informative References . . . . . . . . . . . . . . . . . . 37 + + + + + +Rosenberg Standards Track [Page 3] + +RFC 5629 App Interaction Framework October 2009 + + +1. Introduction + + The Session Initiation Protocol (SIP) [2] provides the ability for + users to initiate, manage, and terminate communications sessions. + Frequently, these sessions will involve a SIP application. A SIP + application is defined as a program running on a SIP-based element + (such as a proxy or user agent) that provides some value-added + function to a user or system administrator. Examples of SIP + applications include prepaid calling card calls, conferencing, and + presence-based [12] call routing. + + In order for most applications to properly function, they need input + from the user to guide their operation. As an example, a prepaid + calling card application requires the user to input their calling + card number, their PIN code, and the destination number they wish to + reach. The process by which a user provides input to an application + is called "application interaction". + + Application interaction can be either functional or stimulus. + Functional interaction requires the user device to understand the + semantics of the application, whereas stimulus interaction does not. + Stimulus signaling allows for applications to be built without + requiring modifications to the user device. Stimulus interaction is + the subject of this framework. The framework provides a model for + how users interact with applications through user interfaces, and how + user interfaces and applications can be distributed throughout a + network. This model is then used to describe how applications can + instantiate and manage user interfaces. + +2. Conventions Used in This Document + + The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", + "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this + document are to be interpreted as described in [1] + +3. Definitions + + SIP Application: A SIP application is defined as a program running + on a SIP-based element (such as a proxy or user agent) that + provides some value-added function to a user or system + administrator. Examples of SIP applications include prepaid + calling card calls, conferencing, and presence-based [12] call + routing. + + Application Interaction: The process by which a user provides input + to an application. + + + + + +Rosenberg Standards Track [Page 4] + +RFC 5629 App Interaction Framework October 2009 + + + Real-Time Application Interaction: Application interaction that + takes place while an application instance is executing. For + example, when a user enters their PIN number into a prepaid + calling card application, this is real-time application + interaction. + + Non-Real-Time Application Interaction: Application interaction that + takes place asynchronously with the execution of the application. + Generally, non-real-time application interaction is accomplished + through provisioning. + + Functional Application Interaction: Application interaction is + functional when the user device has an understanding of the + semantics of the interaction with the application. + + Stimulus Application Interaction: Application interaction is + stimulus when the user device has no understanding of the + semantics of the interaction with the application. + + User Interface (UI): The user interface provides the user with + context to make decisions about what they want. The user + interacts with the device, which conveys the user input to the + user interface. The user interface interprets the information and + passes it to the application. + + User Interface Component: A piece of user interface that operates + independently of other pieces of the user interface. For example, + a user might have two separate web interfaces to a prepaid calling + card application: one for hanging up and making another call, and + another for entering the username and PIN. + + User Device: The software or hardware system that the user directly + interacts with to communicate with the application. An example of + a user device is a telephone. Another example is a PC with a web + browser. + + User Device Proxy: A software or hardware system that a user + indirectly interacts through to communicate with the application. + This indirection can be through a network. An example is a + gateway from IP to the Public Switched Telephone Network (PSTN). + It acts as a user device proxy, acting on behalf of the user on + the circuit network. + + User Input: The "raw" information passed from a user to a user + interface. Examples of user input include a spoken word or a + click on a hyperlink. + + + + + +Rosenberg Standards Track [Page 5] + +RFC 5629 App Interaction Framework October 2009 + + + Client-Local User Interface: A user interface that is co-resident + with the user device. + + Client-Remote User Interface: A user interface that executes + remotely from the user device. In this case, a standardized + interface is needed between the user device and the user + interface. Typically, this is done through media sessions: audio, + video, or application sharing. + + Markup Language: A markup language describes a logical flow of + presentation of information to the user, collection of information + from the user, and transmission of that information to an + application. + + Media Interaction: A means of separating a user and a user interface + by connecting them with media streams. + + Interactive Voice Response (IVR): An IVR is a type of user interface + that allows users to speak commands to the application, and hear + responses to those commands prompting for more information. + + Prompt-and-Collect: The basic primitive of an IVR user interface. + The user is presented with a voice option, and the user speaks + their choice. + + Barge-In: The act of entering information into an IVR user interface + prior to the completion of a prompt requesting that information. + + Focus: A user interface component has focus when user input is + provided to it, as opposed to any other user interface components. + This is not to be confused with the term "focus" within the SIP + conferencing framework, which refers to the center user agent in a + conference [14]. + + Focus Determination: The process by which the user device determines + which user interface component will receive the user input. + + Focusless Device: A user device that has no ability to perform focus + determination. An example of a focusless device is a telephone + with a keypad. + + Presentation-Capable UI: A user interface that can prompt the user + with input, collect results, and then prompt the user with new + information based on those results. + + + + + + + +Rosenberg Standards Track [Page 6] + +RFC 5629 App Interaction Framework October 2009 + + + Presentation-Free UI: A user interface that cannot prompt the user + with information. + + Feature Interaction: A class of problems that result when multiple + applications or application components are trying to provide + services to a user at the same time. + + Inter-Application Feature Interaction: Feature interactions that + occur between applications. + + DTMF: Dual-Tone Multi-Frequency. DTMF refers to a class of tones + generated by circuit-switched telephony devices when the user + presses a key on the keypad. As a result, DTMF and keypad input + are often used synonymously, when in fact one of them (DTMF) is + merely a means of conveying the other (the keypad input) to a + client-remote user interface (the switch, for example). + + Application Instance: A single execution path of a SIP application. + + Originating Application: A SIP application that acts as a User Agent + Client (UAC), making a call on behalf of the user. + + Terminating Application: A SIP application that acts as a User Agent + Server (UAS), answering a call generated by a user. IVR + applications are terminating applications. + + Intermediary Application: A SIP application that is neither the + caller or callee, but rather a third party involved in a call. + +4. A Model for Application Interaction + + +---+ +---+ +---+ +---+ + | | | | | | | | + | | | U | | U | | A | + | | Input | s | Input | s | Results | p | + | | ---------> | e | ---------> | e | ----------> | p | + | U | | r | | r | | l | + | s | | | | | | i | + | e | | D | | I | | c | + | r | Output | e | Output | f | Update | a | + | | <--------- | v | <--------- | a | <.......... | t | + | | | i | | c | | i | + | | | c | | e | | o | + | | | e | | | | n | + | | | | | | | | + +---+ +---+ +---+ +---+ + + Figure 1: Model for Real-Time Interactions + + + +Rosenberg Standards Track [Page 7] + +RFC 5629 App Interaction Framework October 2009 + + + Figure 1 presents a general model for how users interact with + applications. Generally, users interact with a user interface + through a user device. A user device can be a telephone, or it can + be a PC with a web browser. Its role is to pass the user input from + the user to the user interface. The user interface provides the user + with context in order to make decisions about what they want. The + user interacts with the device, causing information to be passed from + the device to the user interface. The user interface interprets the + information, and passes it as a user interface event to the + application. The application may be able to modify the user + interface based on this event. Whether or not this is possible + depends on the type of user interface. + + User interfaces are fundamentally about rendering and interpretation. + Rendering refers to the way in which the user is provided context. + This can be through hyperlinks, images, sounds, videos, text, and so + on. Interpretation refers to the way in which the user interface + takes the "raw" data provided by the user, and returns the result to + the application as a meaningful event, abstracted from the + particulars of the user interface. As an example, consider a prepaid + calling card application. The user interface worries about details + such as what prompt the user is provided, whether the voice is male + or female, and so on. It is concerned with recognizing the speech + that the user provides, in order to obtain the desired information. + In this case, the desired information is the calling card number, the + PIN code, and the destination number. The application needs that + data, and it doesn't matter to the application whether it was + collected using a male prompt or a female one. + + User interfaces generally have real-time requirements towards the + user. That is, when a user interacts with the user interface, the + user interface needs to react quickly, and that change needs to be + propagated to the user right away. However, the interface between + the user interface and the application need not be that fast. Faster + is better, but the user interface itself can frequently compensate + for long latencies between the user interface and the application. + In the case of a prepaid calling card application, when the user is + prompted to enter their PIN, the prompt should generally stop + immediately once the first digit of the PIN is entered. This is + referred to as "barge-in". After the user interface collects the + rest of the PIN, it can tell the user to "please wait while + processing". The PIN can then be gradually transmitted to the + application. In this example, the user interface has compensated for + a slow UI to application interface by asking the user to wait. + + The separation between user interface and application is absolutely + fundamental to the entire framework provided in this document. Its + importance cannot be overstated. + + + +Rosenberg Standards Track [Page 8] + +RFC 5629 App Interaction Framework October 2009 + + + With this basic model, we can begin to taxonomize the types of + systems that can be built. + +4.1. Functional vs. Stimulus + + The first way to taxonomize the system is to consider the interface + between the UI and the application. There are two fundamentally + different models for this interface. In a functional interface, the + user interface has detailed knowledge about the application and is, + in fact, specific to the application. The interface between the two + components is through a functional protocol, capable of representing + the semantics that can be exposed through the user interface. + Because the user interface has knowledge of the application, it can + be optimally designed for that application. As a result, functional + user interfaces are almost always the most user friendly, the + fastest, and the most responsive. However, in order to allow + interoperability between user devices and applications, the details + of the functional protocols need to be specified in standards. This + slows down innovation and limits the scope of applications that can + be built. + + An alternative is a stimulus interface. In a stimulus interface, the + user interface is generic -- that is, totally ignorant of the details + of the application. Indeed, the application may pass instructions to + the user interface describing how it should operate. The user + interface translates user input into "stimulus", which are data + understood only by the application, and not by the user interface. + Because they are generic, and because they require communications + with the application in order to change the way in which they render + information to the user, stimulus user interfaces are usually slower, + less user friendly, and less responsive than a functional + counterpart. However, they allow for substantial innovation in + applications, since no standardization activity is needed to build a + new application, as long as it can interact with the user within the + confines of the user interface mechanism. The web is an example of a + stimulus user interface to applications. + + In SIP systems, functional interfaces are provided by extending the + SIP protocol to provide the needed functionality. For example, the + SIP caller preferences specification [15] provides a functional + interface that allows a user to request applications to route the + call to specific types of user agents. Functional interfaces are + important, but are not the subject of this framework. The primary + goal of this framework is to address the role of stimulus interfaces + to SIP applications. + + + + + + +Rosenberg Standards Track [Page 9] + +RFC 5629 App Interaction Framework October 2009 + + +4.2. Real-Time vs. Non-Real-Time + + Application interaction systems can also be real-time or non-real- + time. Non-real-time interaction allows the user to enter information + about application operation asynchronously with its invocation. + Frequently, this is done through provisioning systems. As an + example, a user can set up the forwarding number for a call-forward + on no-answer application using a web page. Real-time interaction + requires the user to interact with the application at the time of its + invocation. + +4.3. Client-Local vs. Client-Remote + + Another axis in the taxonomization is whether the user interface is + co-resident with the user device (which we refer to as a client-local + user interface), or the user interface runs in a host separated from + the client (which we refer to as a client-remote user interface). In + a client-remote user interface, there exists some kind of protocol + between the client device and the UI that allows the client to + interact with the user interface over a network. + + The most important way to separate the UI and the client device is + through media interaction. In media interaction, the interface + between the user and the user interface is through media: audio, + video, messaging, and so on. This is the classic mode of operation + for VoiceXML [5], where the user interface (also referred to as the + voice browser) runs on a platform in the network. Users communicate + with the voice browser through the telephone network (or using a SIP + session). The voice browser interacts with the application using + HTTP to convey the information collected from the user. + + In the case of a client-local user interface, the user interface runs + co-located with the user device. The interface between them is + through the software that interprets the user's input and passes it + to the user interface. The classic example of this is the Web. In + the Web, the user interface is a web browser, and the interface is + defined by the HTML document that it's rendering. The user interacts + directly with the user interface running in the browser. The results + of that user interface are sent to the application (running on the + web server) using HTTP. + + It is important to note that whether or not the user interface is + local or remote (in the case of media interaction) is not a property + of the modality of the interface, but rather a property of the + system. As an example, it is possible for a Web-based user interface + to be provided with a client-remote user interface. In such a + scenario, video- and application-sharing media sessions can be used + between the user and the user interface. The user interface, still + + + +Rosenberg Standards Track [Page 10] + +RFC 5629 App Interaction Framework October 2009 + + + guided by HTML, now runs "in the network", remote from the client. + Similarly, a VoiceXML document can be interpreted locally by a client + device, with no media streams at all. Indeed, the VoiceXML document + can be rendered using text, rather than media, with no impact on the + interface between the user interface and the application. + + It is also important to note that systems can be hybrid. In a hybrid + user interface, some aspects of it (usually those associated with a + particular modality) run locally, and others run remotely. + +4.4. Presentation-Capable vs. Presentation-Free + + A user interface can be capable of presenting information to the user + (a presentation-capable UI), or it can be capable only of collecting + user input (a presentation-free UI). These are very different types + of user interfaces. A presentation-capable UI can provide the user + with feedback after every input, providing the context for collecting + the next input. As a result, presentation-capable user interfaces + require an update to the information provided to the user after each + input. The Web is a classic example of this. After every input + (i.e., a click), the browser provides the input to the application + and fetches the next page to render. In a presentation-free user + interface, this is not the case. Since the user is not provided with + feedback, these user interfaces tend to merely collect information as + it's entered, and pass it to the application. + + Another difference is that a presentation-free user interface cannot + easily support the concept of a focus. Selection of a focus usually + requires a means for informing the user of the available + applications, allowing the user to choose, and then informing them + about which one they have chosen. Without the first and third steps + (which a presentation-free UI cannot provide), focus selection is + very difficult. Without a selected focus, the input provided to + applications through presentation-free user interfaces is more of a + broadcast or notification operation. + +5. Interaction Scenarios on Telephones + + In this section, we apply the model of Section 4 to telephones. + + In a traditional telephone, the user interface consists of a 12-key + keypad, a speaker, and a microphone. Indeed, from here forward, the + term "telephone" is used to represent any device that meets, at a + minimum, the characteristics described in the previous sentence. + Circuit-switched telephony applications are almost universally + client-remote user interfaces. In the Public Switched Telephone + Network (PSTN), there is usually a circuit interface between the user + and the user interface. The user input from the keypad is conveyed + + + +Rosenberg Standards Track [Page 11] + +RFC 5629 App Interaction Framework October 2009 + + + using Dual-Tone Multi-Frequency (DTMF), and the microphone input as + Pulse Code Modulated (PCM) encoded voice. + + In an IP-based system, there is more variability in how the system + can be instantiated. Both client-remote and client-local user + interfaces to a telephone can be provided. + + In this framework, a PSTN gateway can be considered a User Device + Proxy. It is a proxy for the user because it can provide, to a user + interface on an IP network, input taken from a user on a circuit- + switched telephone. The gateway may be able to run a client-local + user interface, just as an IP telephone might. + +5.1. Client Remote + + The most obvious instantiation is the "classic" circuit-switched + telephony model. In that model, the user interface runs remotely + from the client. The interface between the user and the user + interface is through media, which is set up by SIP and carried over + the Real Time Transport Protocol (RTP) [18]. The microphone input + can be carried using any suitable voice-encoding algorithm. The + keypad input can be conveyed in one of two ways. The first is to + convert the keypad input to DTMF, and then convey that DTMF using a + suitable encoding algorithm (such as PCMU). An alternative, and + generally the preferred approach, is to transmit the keypad input + using RFC 4733 [19], which provides an encoding mechanism for + carrying keypad input within RTP. + + In this classic model, the user interface would run on a server in + the IP network. It would perform speech recognition and DTMF + recognition to derive the user intent, feed them through the user + interface, and provide the result to an application. + +5.2. Client Local + + An alternative model is for the entire user interface to reside on + the telephone. The user interface can be a VoiceXML browser, running + speech recognition on the microphone input, and feeding the keypad + input directly into the script. As discussed above, the VoiceXML + script could be rendered using text instead of voice, if the + telephone has a textual display. + + For simpler phones without a display, the user interface can be + described by a Keypad Markup Language request document [8]. As the + user enters digits in the keypad, they are passed to the user + interface, which generates user interface events that can be + transported to the application. + + + + +Rosenberg Standards Track [Page 12] + +RFC 5629 App Interaction Framework October 2009 + + +5.3. Flip-Flop + + A middle-ground approach is to flip back and forth between a client- + local and client-remote user interface. Many voice applications are + of the type that listen to the media stream and wait for some + specific trigger that kicks off a more complex user interaction. The + long pound in a prepaid calling card application is one example. + Another example is a conference recording application, where the user + can press a key at some point in the call to begin recording. When + the key is pressed, the user hears a whisper to inform them that + recording has started. + + The ideal way to support such an application is to install a client- + local user interface component that waits for the trigger to kick off + the real interaction. Once the trigger is received, the application + connects the user to a client-remote user interface that can play + announcements, collect more information, and so on. + + The benefit of flip-flopping between a client-local and client-remote + user interface is cost. The client-local user interface will + eliminate the need to send media streams into the network just to + wait for the user to press the pound key on the keypad. + + The Keypad Markup Language (KPML) was designed to support exactly + this kind of need [8]. It models the keypad on a phone and allows an + application to be informed when any sequence of keys has been + pressed. However, KPML has no presentation component. Since user + interfaces generally require a response to user input, the + presentation will need to be done using a client-remote user + interface that gets instantiated as a result of the trigger. + + It is tempting to use a hybrid model, where a prompt-and-collect + application is implemented by using a client-remote user interface + that plays the prompts, and a client-local user interface, described + by KPML, that collects digits. However, this only complicates the + application. Firstly, the keypad input will be sent to both the + media stream and the KPML user interface. This requires the + application to sort out which user inputs are duplicates, a process + that is very complicated. Secondly, the primary benefit of KPML is + to avoid having a media stream towards a user interface. However, + there is already a media stream for the prompting, so there is no + real savings. + +6. Framework Overview + + In this framework, we use the term "SIP application" to refer to a + broad set of functionality. A SIP application is a program running + on a SIP-based element (such as a proxy or user agent) that provides + + + +Rosenberg Standards Track [Page 13] + +RFC 5629 App Interaction Framework October 2009 + + + some value-added function to a user or system administrator. SIP + applications can execute on behalf of a caller, a called party, or a + multitude of users at once. + + Each application has a number of instances that are executing at any + given time. An instance represents a single execution path for an + application. It is established as a result of some event. That + event can be a SIP event, such as the reception of a SIP INVITE + request, or it can be a non-SIP event, such as a web form post or + even a timer. Application instances also have an end time. Some + instances have a lifetime that is coupled with a SIP transaction or + dialog. For example, a proxy application might begin when an INVITE + arrives, and terminate when the call is answered. Other applications + have a lifetime that spans multiple dialogs or transactions. For + example, a conferencing application instance may exist so long as + there are dialogs connected to it. When the last dialog terminates, + the application instance terminates. Other applications have a + lifetime that is completely decoupled from SIP events. + + It is fundamental to the framework described here that multiple + application instances may interact with a user during a single SIP + transaction or dialog. Each instance may be for the same + application, or different applications. Each of the applications may + be completely independent, in that each may be owned by a different + provider, and may not be aware of each other's existence. Similarly, + there may be application instances interacting with the caller, and + instances interacting with the callee, both within the same + transaction or dialog. + + The first step in the interaction with the user is to instantiate one + or more user interface components for the application instance. A + user interface component is a single piece of the user interface that + is defined by a logical flow that is not synchronously coupled with + any other component. In other words, each component runs + independently. + + A user interface component can be instantiated in one of the user + agents in a dialog (for a client-local user interface), or within a + network element (for a client-remote user interface). If a client- + local user interface is to be used, the application needs to + determine whether or not the user agent is capable of supporting a + client-local user interface, and in what format. In this framework, + all client-local user interface components are described by a markup + language. A markup language describes a logical flow of presentation + of information to the user, a collection of information from the + user, and a transmission of that information to an application. + Examples of markup languages include HTML, Wireless Markup Language + (WML), VoiceXML, and the Keypad Markup Language (KPML) [8]. + + + +Rosenberg Standards Track [Page 14] + +RFC 5629 App Interaction Framework October 2009 + + + Unlike an application instance, which has a very flexible lifetime, a + user interface component has a very fixed lifetime. A user interface + component is always associated with a dialog. The user interface + component can be created at any point after the dialog (or early + dialog) is created. However, the user interface component terminates + when the dialog terminates. The user interface component can be + terminated earlier by the user agent, and possibly by the + application, but its lifetime never exceeds that of its associated + dialog. + + There are two ways to create a client-local interface component. For + interface components that are presentation capable, the application + sends a REFER [7] request to the user agent. The Refer-To header + field contains an HTTP URI that points to the markup for the user + interface, and the REFER contains a Target-Dialog header field [10] + which identifies the dialog associated with the user interface + component. For user interface components that are presentation free + (such as those defined by KPML), the application sends a SUBSCRIBE + request to the user agent. The body of the SUBSCRIBE request + contains a filter, which, in this case, is the markup that defines + when information is to be sent to the application in a NOTIFY. The + SUBSCRIBE does not contain the Target-Dialog header field, since + equivalent information is conveyed in the Event header field. + + If a user interface component is to be instantiated in the network, + there is no need to determine the capabilities of the device on which + the user interface is instantiated. Presumably, it is on a device on + which the application knows a UI can be created. However, the + application does need to connect the user device to the user + interface. This will require manipulation of media streams in order + to establish that connection. + + The interface between the user interface component and the + application depends on the type of user interface. For presentation- + capable user interfaces, such as those described by HTML and + VoiceXML, HTTP form POST operations are used. For presentation-free + user interfaces, a SIP NOTIFY is used. The differing needs and + capabilities of these two user interfaces, as described in + Section 4.4, are what drives the different choices for the + interactions. Since presentation-capable user interfaces require an + update to the presentation every time user data is entered, they are + a good match for HTTP. Since presentation-free user interfaces + merely transmit user input to the application, a NOTIFY is more + appropriate. + + Indeed, for presentation-free user interfaces, there are two + different modalities of operation. The first is called "one shot". + In the one-shot role, the markup waits for a user to enter some + + + +Rosenberg Standards Track [Page 15] + +RFC 5629 App Interaction Framework October 2009 + + + information and, when they do, reports this event to the application. + The application then does something, and the markup is no longer + used. In the other modality, called "monitor", the markup stays + permanently resident, and reports information back to an application + until termination of the associated dialog. + +7. Deployment Topologies + + This section presents some of the network topologies in which this + framework can be instantiated. + +7.1. Third-Party Application + + +-------------+ + /---| Application | + / +-------------+ + / + SUB/ / REFER/ + NOT / HTTP + / + +--------+ SIP (INVITE) +-----+ + | UI A--------------------X | + |........| | SIP | + | User | RTP | UA | + | Device B--------------------Y | + +--------+ +-----+ + + Figure 2: Third-Party Topology + + In this topology, the application that is interested in interacting + with the users exists outside of the SIP dialog between the user + agents. In that case, the application learns about the initiation + and termination of the dialog, along with the dialog identifiers, + through some out-of-band means. One such possibility is the dialog + event package [16]. Dialog information is only revealed to trusted + parties, so the application would need to be trusted by one of the + users in order to obtain this information. + + At any point during the dialog, the application can instantiate user + interface components on the user device of the caller or callee. It + can do this using either SUBSCRIBE or REFER, depending on the type of + user interface (presentation capable or presentation free). + + + + + + + + + +Rosenberg Standards Track [Page 16] + +RFC 5629 App Interaction Framework October 2009 + + +7.2. Co-Resident Application + + +--------+ SIP (INVITE) +-----+ + | User A--------------------X SIP | + | Device | RTP | UA | + |........B--------------------Y | + | | SUB/NOT | App)| + | UI A'-------------------X' | + +--------+ REFER/HTTP +-----+ + + Figure 3: Co-Resident Topology + + In this deployment topology, the application is co-resident with one + of the user agents (the one on the right in the picture above). This + application can install client-local user interface components on the + other user agent, which is acting as the user device. These + components can be installed using either SUBSCRIBE, for presentation- + free user interfaces, or REFER, for presentation-capable ones. This + situation typically arises when the application wishes to install UI + components on a presentation-capable user interface. If the only + user input is via keypad input, the framework is not needed per se, + because the UA/application will receive the input via RFC 4733 in the + RTP stream. + + If the application resides in the called party, it is called a + "terminating application". If it resides in the calling party, it is + called an "originating application". + + This kind of topology is common in protocol converter and gateway + applications. + + + + + + + + + + + + + + + + + + + + + +Rosenberg Standards Track [Page 17] + +RFC 5629 App Interaction Framework October 2009 + + +7.3. Third-Party Application and User Device Proxy + + +-------------+ + /---| Application | + / +-------------+ + / + SUB/ / REFER/ + NOT / HTTP + / + +-----+ SIP +---M----+ SIP +-----+ + | V--------------------C A--------------------X | + | SIP | | UI | | SIP | + | UAa | RTP | | RTP | UAb | + | W--------------------D B--------------------Y | + +-----+ +--------+ +-----+ + User User + Device Device + Proxy + + Figure 4: User Device Proxy Topology + + In this deployment topology, there is a third-party application as in + Section 7.1. However, instead of installing a user interface + component on the end user device, the component is installed in an + intermediate device, known as a User Device Proxy. From the + perspective of the actual user device (on the left), the User Device + Proxy is a client remote user interface. As such, media, typically + transported using RTP (including RFC 4733 for carrying user input), + is sent from the user device to the client remote user interface on + the User Device Proxy. As far as the application is concerned, it is + installing what it thinks is a client-local user interface on the + user device, but it happens to be on a user device proxy that looks + like the user device to the application. + + The user device proxy will need to terminate and re-originate both + signaling (SIP) and media traffic towards the actual peer in the + conversation. The User Device Proxy is a media relay in the + terminology of RFC 3550 [18]. The User Device Proxy will need to + monitor the media streams associated with each dialog, in order to + convert user input received in the media stream to events reported to + the user interface. This can pose a challenge in multi-media + systems, where it may be unclear on which media stream the user input + is being sent. As discussed in RFC 3264 [20], if a user agent has a + single media source and is supporting multiple streams, it is + supposed to send that source to all streams. In cases where there + are multiple sources, the mapping is a matter of local policy. In + + + + + +Rosenberg Standards Track [Page 18] + +RFC 5629 App Interaction Framework October 2009 + + + the absence of a way to explicitly identify or request which sources + map to which streams, the user device proxy will need to do the best + job it can. This specification RECOMMENDS that the User Device Proxy + monitor the first stream (defined in terms of ordering of media + sessions within a session description). As such, user agents SHOULD + send their user input on the first stream, absent a policy to direct + it otherwise. + +7.4. Proxy Application + + +----------+ + SUB/NOT | App | SUB/NOT + +--------------->| |<-----------------+ + | REFER/HTTP |..........| REFER/HTTP | + | | SIP | | + | | Proxy | | + | +----------+ | + V ^ | V + +----------+ | | +----------+ + | UI | INVITE | | INVITE | UI | + | |------------+ +------------>| | + |......... | |..........| + | SIP |...................................| SIP | + | UA | | UA | + +----------+ RTP +----------+ + User Device User Device + + Figure 5: Proxy Application Topology + + In this topology, the application is co-resident with a transaction + stateful, record-routing proxy server on the call path between two + user devices. The application uses SUBSCRIBE or REFER to install + user interface components on one or both user devices. + + This topology is common in routing applications, such as a web- + assisted call-routing application. + +8. Application Behavior + + The behavior of an application within this framework depends on + whether it seeks to use a client-local or client-remote user + interface. + + + + + + + + + +Rosenberg Standards Track [Page 19] + +RFC 5629 App Interaction Framework October 2009 + + +8.1. Client-Local Interfaces + + One key component of this framework is support for client-local user + interfaces. + +8.1.1. Discovering Capabilities + + A client-local user interface can only be instantiated on a user + agent if the user agent supports that type of user interface + component. Support for client-local user interface components is + declared by both the UAC and UAS in their Allow, Accept, Supported, + and Allow-Event header fields of dialog-initiating requests and + responses. If the Allow header field indicates support for the SIP + SUBSCRIBE method, and the Allow-Event header field indicates support + for the KPML package [8], and the Supported header field indicates + support for the Globally Routable UA URI (GRUU) [9] specification + (which, in turn, means that the Contact header field contains a + GRUU), it means that the UA can instantiate presentation-free user + interface components. In this case, the application can push + presentation-free user interface components according to the rules of + Section 8.1.2. The specific markup languages that can be supported + are indicated in the Accept header field. + + If the Allow header field indicates support for the SIP REFER method, + and the Supported header field indicates support for the Target- + Dialog header field [10], and the Contact header field contains UA + capabilities [6] that indicate support for the HTTP URI scheme, it + means that the UA supports presentation-capable user interface + components. In this case, the application can push presentation- + capable user interface components to the client according to the + rules of Section 8.1.2. The specific markups that are supported are + indicated in the Accept header field. + + A third-party application that is not present on the call path will + not be privy to these header fields in the dialog-initiating requests + that pass by. As such, it will need to obtain this capability + information in other ways. One way is through the registration event + package [21], which can contain user agent capability information + provided in REGISTER requests [6]. + +8.1.2. Pushing an Initial Interface Component + + Generally, we anticipate that interface components will need to be + created at various different points in a SIP session. Clearly, they + will need to be pushed during session setup, or after the session is + established. A user interface component is always associated with a + specific dialog, however. + + + + +Rosenberg Standards Track [Page 20] + +RFC 5629 App Interaction Framework October 2009 + + + An application MUST NOT attempt to push a user interface component to + a user agent until it has determined that the user agent has the + necessary capabilities and a dialog has been created. In the case of + a UAC, this means that an application MUST NOT push a user interface + component for an INVITE-initiated dialog until the application has + seen a request confirming the receipt of a dialog-creating response. + This could be an ACK for a 200 OK, or a PRACK for a provisional + response [3]. For SUBSCRIBE-initiated dialogs, the application MUST + NOT push a user interface component until the application has seen a + 200 OK to the NOTIFY request. For a user interface component on a + UAS, the application MUST NOT push a user interface component for an + INVITE-initiated dialog until it has seen a dialog-creating response + from the UAS. For a SUBSCRIBE-initiated dialog, it MUST NOT push a + user interface component until it has seen a NOTIFY request from the + notifier. + + To create a presentation-capable UI component on the UA, the + application sends a REFER request to the UA. This REFER MUST be sent + to the GRUU [9] advertised by that UA in the Contact header field of + the dialog-initiating request or response sent by that UA. Note that + this REFER request creates a separate dialog between the application + and the UA. The Refer-To header field of the REFER request MUST + contain an HTTP URI that references the markup document to be + fetched. + + Furthermore, it is essential for the REFER request to be correlated + with the dialog to which the user interface component will be + associated. This is necessary for authorization and for terminating + the user interface components when the dialog terminates. To provide + this context, the REFER request MUST contain a Target-Dialog header + field identifying the dialog with which the user interface component + is associated. As discussed in [10], this request will also contain + a Require header field with the tdialog option tag. + + To create a presentation-free user interface component, the + application sends a SUBSCRIBE request to the UA. The SUBSCRIBE MUST + be sent to the GRUU advertised by the UA. This SUBSCRIBE request + creates a separate dialog. The SUBSCRIBE request MUST use the KPML + [8] event package. The body of the SUBSCRIBE request contains the + markup document that defines the conditions under which the + application wishes to be notified of user input. + + In both cases, the REFER or SUBSCRIBE request SHOULD include a + display name in the From header field that identifies the name of the + application. For example, a prepaid calling card might include a + From header field that looks like: + + + + + +Rosenberg Standards Track [Page 21] + +RFC 5629 App Interaction Framework October 2009 + + + From: "Prepaid Calling Card" <sip:prepaid@example.com> + + Any of the SIP identity assertion mechanisms that have been defined, + such as [11] and [13], are applicable to these requests as well. + +8.1.3. Updating an Interface Component + + Once a user interface component has been created on a client, it can + be updated. The means for updating it depends on the type of UI + component. + + Presentation-capable UI components are updated using techniques + already in place for those markups. In particular, user input will + cause an HTTP POST operation to push the user input to the + application. The result of the POST operation is a new markup that + the UI is supposed to use. This allows the UI to be updated in + response to user action. Some markups, such as HTML, provide the + ability to force a refresh after a certain period of time, so that + the UI can be updated without user input. Those mechanisms can be + used here as well. However, there is no support for an asynchronous + push of an updated UI component from the application to the user + agent. A new REFER request to the same GRUU would create a new UI + component rather than update any components already in place. + + For presentation-free UI, the story is different. The application + MAY update the filter at any time by generating a SUBSCRIBE refresh + with the new filter. The UA will immediately begin using this new + filter. + +8.1.4. Terminating an Interface Component + + User interface components have a well-defined lifetime. They are + created when the component is first pushed to the client. User + interface components are always associated with the SIP dialog on + which they were pushed. As such, their lifetime is bound by the + lifetime of the dialog. When the dialog ends, so does the interface + component. + + However, there are some cases where the application would like to + terminate the user interface component before its natural termination + point. For presentation-capable user interfaces, this is not + possible. For presentation-free user interfaces, the application MAY + terminate the component by sending a SUBSCRIBE with Expires equal to + zero. This terminates the subscription, which removes the UI + component. + + A client can remove a UI component at any time. For presentation- + capable UI, this is analogous to the user dismissing the web form + + + +Rosenberg Standards Track [Page 22] + +RFC 5629 App Interaction Framework October 2009 + + + window. There is no mechanism provided for reporting this kind of + event to the application. The application MUST be prepared to time + out and never receive input from a user. The duration of this + timeout is application dependent. For presentation-free user + interfaces, the UA can explicitly terminate the subscription. This + will result in the generation of a NOTIFY with a Subscription-State + header field equal to "terminated". + +8.2. Client-Remote Interfaces + + As an alternative to, or in conjunction with client-local user + interfaces, an application can make use of client-remote user + interfaces. These user interfaces can execute co-resident with the + application itself (in which case no standardized interfaces between + the UI and the application need to be used), or they can run + separately. This framework assumes that the user interface runs on a + host that has a sufficient trust relationship with the application. + As such, the means for instantiating the user interface is not + considered here. + + The primary issue is to connect the user device to the remote user + interface. Doing so requires the manipulation of media streams + between the client and the user interface. Such manipulation can + only be done by user agents. There are two types of user agent + applications within this framework: originating/terminating + applications, and intermediary applications. + +8.2.1. Originating and Terminating Applications + + Originating and terminating applications are applications that are + themselves the originator or the final recipient of a SIP invitation. + They are "pure" user agent applications, not back-to-back user + agents. The classic example of such an application is an interactive + voice response (IVR) application, which is typically a terminating + application. It is a terminating application because the user + explicitly calls it; i.e., it is the actual called party. An example + of an originating application is a wakeup call application, which + calls a user at a specified time in order to wake them up. + + Because originating and terminating applications are a natural + termination point of the dialog, manipulation of the media session by + the application is trivial. Traditional SIP techniques for adding + and removing media streams, modifying codecs, and changing the + address of the recipient of the media streams can be applied. + + + + + + + +Rosenberg Standards Track [Page 23] + +RFC 5629 App Interaction Framework October 2009 + + +8.2.2. Intermediary Applications + + Intermediary applications are, at the same time, more common than + originating/terminating applications and more complex. Intermediary + applications are applications that are neither the actual caller nor + the called party. Rather, they represent a "third party" that wishes + to interact with the user. The classic example is the ubiquitous + prepaid calling card application. + + In order for the intermediary application to add a client-remote user + interface, it needs to manipulate the media streams of the user agent + to terminate on that user interface. This also introduces a + fundamental feature interaction issue. Since the intermediary + application is not an actual participant in the call, the user will + need to interact with both the intermediary application and its peer + in the dialog. Doing both at the same time is complicated and is + discussed in more detail in Section 10. + +9. User Agent Behavior + +9.1. Advertising Capabilities + + In order to participate in applications that make use of stimulus + interfaces, a user agent needs to advertise its interaction + capabilities. + + If a user agent supports presentation-capable user interfaces, it + MUST support the REFER method. It MUST include, in all dialog- + initiating requests and responses, an Allow header field that + includes the REFER method. The user agent MUST support the target + dialog specification [10], and MUST include the "tdialog" option tag + in the Supported header field of dialog-forming requests and + responses. Furthermore, the UA MUST support the SIP user agent + capabilities specification [6]. The UA MUST be capable of being + REFERed to an HTTP URI. It MUST include, in the Contact header field + of its dialog-initiating requests and responses, a "schemes" Contact + header field parameter that includes the HTTP URI scheme. The UA + MUST include, in all dialog-initiating requests and responses, an + Accept header field listing all of those markups supported by the UA. + It is RECOMMENDED that all user agents that support presentation- + capable user interfaces support HTML. + + If a user agent supports presentation-free user interfaces, it MUST + support the SUBSCRIBE [4] method. It MUST support the KPML [8] event + package. It MUST include, in all dialog-initiating requests and + responses, an Allow header field that includes the SUBSCRIBE method. + It MUST include, in all dialog-initiating requests and responses, an + Allow-Events header field that lists the KPML event package. The UA + + + +Rosenberg Standards Track [Page 24] + +RFC 5629 App Interaction Framework October 2009 + + + MUST include, in all dialog-initiating requests and responses, an + Accept header field listing those event filters it supports. At a + minimum, a UA MUST support the "application/kpml-request+xml" MIME + type. + + For either presentation-free or presentation-capable user interfaces, + the user agent MUST support the GRUU [9] specification. The Contact + header field in all dialog-initiating requests and responses MUST + contain a GRUU. The UA MUST include a Supported header field that + contains the "gruu" option tag and the "tdialog" option tag. + + Because these headers are examined by proxies that may be executing + applications, a UA that wishes to support client-local user + interfaces should not encrypt them. + +9.2. Receiving User Interface Components + + Once the UA has created a dialog (in either the early or confirmed + states), it MUST be prepared to receive a SUBSCRIBE or REFER request + against its GRUU. If the UA receives such a request prior to the + establishment of a dialog, the UA MUST reject the request. + + A user agent SHOULD attempt to authenticate the sender of the + request. The sender will generally be an application; therefore, the + user agent is unlikely to ever have a shared secret with it, making + digest authentication useless. However, authenticated identities can + be obtained through other means, such as the Identity mechanism [11]. + + A user agent MAY have pre-defined authorization policies that permit + applications which have authenticated themselves with a particular + identity to push user interface components. If such a set of + policies is present, it is checked first. If the application is + authorized, processing proceeds. + + If the application has authenticated itself but is not explicitly + authorized or blocked, this specification RECOMMENDS that the + application be automatically authorized if it can prove that it was + either on the call path, or is trusted by one of the elements on the + call path. An application proves this to the user agent by + demonstrating that it knows the dialog identifiers. That occurs by + including them in a Target-Dialog header field for REFER requests, or + in the Event header field parameters of the KPML SUBSCRIBE request. + + Because the dialog identifiers serve as a tool for authorization, a + user agent compliant to this framework SHOULD use dialog identifiers + that are cryptographically random, with at least 128 bits of + randomness. It is recommended that this randomness be split between + the Call-ID and From header field tags in the case of a UAC. + + + +Rosenberg Standards Track [Page 25] + +RFC 5629 App Interaction Framework October 2009 + + + Furthermore, to ensure that only applications resident in or trusted + by on-path elements can instantiate a user interface component, a + user agent compliant to this specification SHOULD use the Session + Initiation Protocol Secure (SIPS) URI scheme for all dialogs it + initiates. This will guarantee secure links between all the elements + on the signaling path. + + If the dialog was not established with a SIPS URI, or the user agent + did not choose cryptographically random dialog identifiers, then the + application MUST NOT automatically be authorized, even if it + presented valid dialog identifiers. A user agent MAY apply any other + policies in addition to (but not instead of) the ones specified here + in order to authorize the creation of the user interface component. + One such mechanism would be to prompt the user, informing them of the + identity of the application and the dialog it is associated with. If + an authorization policy requires user interaction, the user agent + SHOULD respond to the SUBSCRIBE or REFER request with a 202. In the + case of SUBSCRIBE, if authorization is not granted, the user agent + SHOULD generate a NOTIFY to terminate the subscription. In the case + of REFER, the user agent MUST NOT act upon the URI in the Refer-To + header field until user authorization is obtained. + + If an application does not present a valid dialog identifier in its + REFER or SUBSCRIBE request, the user agent MUST reject the request + with a 403 response. + + If a REFER request to an HTTP URI is authorized, the UA executes the + URI and fetches the content to be rendered to the user. This + instantiates a presentation-capable user interface component. If a + SUBSCRIBE was authorized, a presentation-free user interface + component is instantiated. + +9.3. Mapping User Input to User Interface Components + + Once the user interface components are instantiated, the user agent + must direct user input to the appropriate component. In the case of + presentation-capable user interfaces, this process is known as focus + selection. It is done by means that are specific to the user + interface on the device. In the case of a PC, for example, the + window manager would allow the user to select the appropriate user + interface component to which their input is directed. + + For presentation-free user interfaces, the situation is more + complicated. In some cases, the device may support a mechanism that + allows the user to select a "line", and thus the associated dialog. + Any user input on the keypad while this line is selected are fed to + the user interface components associated with that dialog. + + + + +Rosenberg Standards Track [Page 26] + +RFC 5629 App Interaction Framework October 2009 + + + Otherwise, for client-local user interfaces, the user input is + assumed to be associated with all user interface components. For + client-remote user interfaces, the user device converts the user + input to media, typically conveyed using RFC 4733, and sends this to + the client-remote user interface. This user interface then needs to + map user input from potentially many media streams into user + interface events. The process for doing this is described in + Section 7.3. + +9.4. Receiving Updates to User Interface Components + + For presentation-capable user interfaces, updates to the user + interface occur in ways specific to that user interface component. + In the case of HTML, for example, the document can tell the client to + fetch a new document periodically. However, this framework does not + provide any additional machinery to asynchronously push a new user + interface component to the client. + + For presentation-free user interfaces, an application can push an + update to a component by sending a SUBSCRIBE refresh with a new + filter. The user agent will process these according to the rules of + the event package. + +9.5. Terminating a User Interface Component + + Termination of a presentation-capable user interface component is a + trivial procedure. The user agent merely dismisses the window (or + its equivalent). The fact that the component is dismissed is not + communicated to the application. As such, it is purely a local + matter. + + In the case of a presentation-free user interface, the user might + wish to cease interacting with the application. However, most + presentation-free user interfaces will not have a way for the user to + signal this through the device. If such a mechanism did exist, the + UA SHOULD generate a NOTIFY request with a Subscription-State header + field equal to "terminated" and a reason of "rejected". This tells + the application that the component has been removed and that it + should not attempt to re-subscribe. + +10. Inter-Application Feature Interaction + + The inter-application feature interaction problem is inherent to + stimulus signaling. Whenever there are multiple applications, there + are multiple user interfaces. The system has to determine to which + user interface any particular input is destined. That question is + the essence of the inter-application feature interaction problem. + + + + +Rosenberg Standards Track [Page 27] + +RFC 5629 App Interaction Framework October 2009 + + + Inter-application feature interaction is not an easy problem to + resolve. For now, we consider separately the issues for client-local + and client-remote user interface components. + +10.1. Client-Local UI + + When the user interface itself resides locally on the client device, + the feature interaction problem is actually much simpler. The end + device knows explicitly about each application, and therefore can + present the user with each one separately. When the user provides + input, the client device can determine to which user interface the + input is destined. The user interface to which input is destined is + referred to as the "application in focus", and the means by which the + focused application is selected is called "focus determination". + + Generally speaking, focus determination is purely a local operation. + In the PC universe, focus determination is provided by window + managers. Each application does not know about focus; it merely + receives the user input that has been targeted to it when it's in + focus. This basic concept applies to SIP-based applications as well. + + Focus determination will frequently be trivial, depending on the user + interface type. Consider a user that makes a call from a PC. The + call passes through a prepaid calling card application and a call- + recording application. Both of these wish to interact with the user. + Both push an HTML-based user interface to the user. On the PC, each + user interface would appear as a separate window. The user interacts + with the call-recording application by selecting its window, and with + the prepaid calling card application by selecting its window. Focus + determination is literally provided by the PC window manager. It is + clear to which application the user input is targeted. + + As another example, consider the same two applications, but on a + "smart phone" that has a set of buttons, and next to each button, + there is an LCD display that can provide the user with an option. + This user interface can be represented using the Wireless Markup + Language (WML), for example. + + The phone would allocate some number of buttons to each application. + The prepaid calling card would get one button for its "hangup" + command, and the recording application would get one for its "start/ + stop" command. The user can easily determine which application to + interact with by pressing the appropriate button. Pressing a button + determines focus and provides user input, both at the same time. + + Unfortunately, not all devices will have these advanced displays. A + PSTN gateway, or a basic IP telephone, may only have a 12-key keypad. + The user interfaces for these devices are provided through the Keypad + + + +Rosenberg Standards Track [Page 28] + +RFC 5629 App Interaction Framework October 2009 + + + Markup Language (KPML). Considering once again the feature + interaction case above, the prepaid calling card application and the + call-recording application would both pass a KPML document to the + device. When the user presses a button on the keypad, to which + document does the input apply? The device does not allow the user to + select. A device where the user cannot provide focus is called a + "focusless device". This is quite a hard problem to solve. This + framework does not make any explicit normative recommendation, but it + concludes that the best option is to send the input to both user + interfaces unless the markup in one interface has indicated that it + should be suppressed from others. This is a sensible choice by + analogy -- it's exactly what the existing circuit-switched telephone + network will do. It is an explicit non-goal to provide a better + mechanism for feature interaction resolution than the PSTN on devices + that have the same user interface as they do on the PSTN. Devices + with better displays, such as PCs or screen phones, can benefit from + the capabilities of this framework, allowing the user to determine + which application they are interacting with. + + Indeed, when a user provides input on a focusless device, the input + must be passed to all client-local user interfaces AND all client- + remote user interfaces, unless the markup tells the UI to suppress + the media. In the case of KPML, key events are passed to remote user + interfaces by encoding them as described in RFC 4733 [19]. Of + course, since a client cannot determine whether or not a media stream + terminates in a remote user interface, these key events are passed in + all audio media streams unless the KPML request document is used to + suppress them. + +10.2. Client-Remote UI + + When the user interfaces run remotely, the determination of focus can + be much, much harder. There are many architectures that can be + deployed to handle the interaction. None are ideal. However, all + are beyond the scope of this specification. + +11. Intra Application Feature Interaction + + An application can instantiate a multiplicity of user interface + components. For example, a single application can instantiate two + separate HTML components and one WML component. Furthermore, an + application can instantiate both client-local and client-remote user + interfaces. + + The feature interaction issues between these components within the + same application are less severe. If an application has multiple + client user interface components, their interaction is resolved + identically to the inter-application case -- through focus + + + +Rosenberg Standards Track [Page 29] + +RFC 5629 App Interaction Framework October 2009 + + + determination. However, the problems in focusless user devices (such + as a keypad on a telephone) generally won't exist, since the + application can generate user interfaces that do not overlap in their + usage of an input. + + The real issue is that the optimal user experience frequently + requires some kind of coupling between the differing user interface + components. This is a classic problem in multi-modal user + interfaces, such as those described by Speech Application Language + Tags (SALT). As an example, consider a user interface where a user + can either press a labeled button to make a selection, or listen to a + prompt, and speak the desired selection. Ideally, when the user + presses the button, the prompt should cease immediately, since both + of them were targeted at collecting the same information in parallel. + Such interactions are best handled by markups that natively support + such interactions, such as SALT, and thus require no explicit support + from this framework. + +12. Example Call Flow + + This section shows the operation of a call-recording application. + This application allows a user to record the media in their call by + clicking on a button in a web form. The application uses a + presentation-capable user interface component that is pushed to the + caller. The conventions of [17] are used to describe representation + of long message lines. + + + + + + + + + + + + + + + + + + + + + + + + + +Rosenberg Standards Track [Page 30] + +RFC 5629 App Interaction Framework October 2009 + + + A Recording App B + |(1) INVITE | | + |----------------------->| | + | |(2) INVITE | + | |----------------------->| + | |(3) 200 OK | + | |<-----------------------| + |(4) 200 OK | | + |<-----------------------| | + |(5) ACK | | + |----------------------->| | + | |(6) ACK | + | |----------------------->| + |(7) REFER | | + |<-----------------------| | + |(8) 200 OK | | + |----------------------->| | + |(9) NOTIFY | | + |----------------------->| | + |(10) 200 OK | | + |<-----------------------| | + |(11) HTTP GET | | + |----------------------->| | + |(12) 200 OK | | + |<-----------------------| | + |(13) NOTIFY | | + |----------------------->| | + |(14) 200 OK | | + |<-----------------------| | + |(15) HTTP POST | | + |----------------------->| | + |(16) 200 OK | | + |<-----------------------| | + + Figure 6 + + First, the caller, A, sends an INVITE to set up a call (message 1). + Since the caller supports the framework and can handle presentation- + capable user interface components, it includes the Supported header + field indicating that the GRUU extension and the Target-Dialog header + field are understood, the Allow header field indicating that REFER is + understood, and the Contact header field that includes the "schemes" + header field parameter. + + + + + + + + +Rosenberg Standards Track [Page 31] + +RFC 5629 App Interaction Framework October 2009 + + + INVITE sip:B@example.com SIP/2.0 + Via: SIP/2.0/TLS host.example.com;branch=z9hG4bK9zz8 + From: Caller <sip:A@example.com>;tag=kkaz- + To: Callee <sip:B@example.org> + Call-ID: fa77as7dad8-sd98ajzz@host.example.com + CSeq: 1 INVITE + Max-Forwards: 70 + Supported: gruu, tdialog + Allow: INVITE, OPTIONS, BYE, CANCEL, ACK, REFER + Accept: application/sdp, text/html + <allOneLine> + Contact: <sip:A@example.com;gr=urn:uuid:f81d4fae + -7dec-11d0-a765-00a0c91e6bf6>;schemes="http,sip" + </allOneLine> + Content-Length: ... + Content-Type: application/sdp + + --SDP not shown-- + + The proxy acts as a recording server, and forwards the INVITE to the + called party (message 2). It strips the Record-Route it would + normally insert due to the presence of the GRUU in the INVITE: + + INVITE sip:B@pc.example.com SIP/2.0 + Via: SIP/2.0/TLS app.example.com;branch=z9hG4bK97sh + Via: SIP/2.0/TLS host.example.com;branch=z9hG4bK9zz8 + From: Caller <sip:A@example.com>;tag=kkaz- + To: Callee <sip:B@example.org> + Call-ID: fa77as7dad8-sd98ajzz@host.example.com + CSeq: 1 INVITE + Max-Forwards: 70 + Supported: gruu, tdialog + Allow: INVITE, OPTIONS, BYE, CANCEL, ACK, REFER + Accept: application/sdp, text/html + <allOneLine> + Contact: <sip:A@example.com;gr=urn:uuid:f81d4fae + -7dec-11d0-a765-00a0c91e6bf6>;schemes="http,sip" + </allOneLine> + Content-Length: ... + Content-Type: application/sdp + + --SDP not shown-- + + B accepts the call with a 200 OK (message 3). It does not support + the framework, so the various header fields are not present. + + + + + + +Rosenberg Standards Track [Page 32] + +RFC 5629 App Interaction Framework October 2009 + + + SIP/2.0 200 OK + Via: SIP/2.0/TLS app.example.com;branch=z9hG4bK97sh + Via: SIP/2.0/TLS host.example.com;branch=z9hG4bK9zz8 + From: Caller <sip:A@example.com>;tag=kkaz- + To: Callee <sip:B@example.com>;tag=7777 + Call-ID: fa77as7dad8-sd98ajzz@host.example.com + CSeq: 1 INVITE + Contact: <sip:B@pc.example.com> + Content-Length: ... + Content-Type: application/sdp + + --SDP not shown-- + + This 200 OK is passed back to the caller (message 4): + + SIP/2.0 200 OK + Record-Route: <sip:app.example.com;lr> + Via: SIP/2.0/TLS host.example.com;branch=z9hG4bK9zz8 + From: Caller <sip:A@example.com>;tag=kkaz- + To: Callee <sip:B@example.com>;tag=7777 + Call-ID: fa77as7dad8-sd98ajzz@host.example.com + CSeq: 1 INVITE + Contact: <sip:B@pc.example.com> + Content-Length: ... + Content-Type: application/sdp + + --SDP not shown-- + + The caller generates an ACK (message 5). + + ACK sip:B@pc.example.com + Route: <sip:app.example.com;lr> + Via: SIP/2.0/TLS host.example.com;branch=z9hG4bK9zz9 + From: Caller <sip:A@example.com>;tag=kkaz- + To: Callee <sip:B@example.com>;tag=7777 + Call-ID: fa77as7dad8-sd98ajzz@host.example.com + CSeq: 1 ACK + + The ACK is forwarded to the called party (message 6). + + ACK sip:B@pc.example.com + Via: SIP/2.0/TLS app.example.com;branch=z9hG4bKh7s + Via: SIP/2.0/TLS host.example.com;branch=z9hG4bK9zz9 + From: Caller <sip:A@example.com>;tag=kkaz- + To: Callee <sip:B@example.com>;tag=7777 + Call-ID: fa77as7dad8-sd98ajzz@host.example.com + CSeq: 1 ACK + + + + +Rosenberg Standards Track [Page 33] + +RFC 5629 App Interaction Framework October 2009 + + + Now, the application decides to push a user interface component to + user A. So, it sends it a REFER request (message 7): + + <allOneLine> + REFER sip:A@example.com;gr=urn:uuid:f81d4fae + -7dec-11d0-a765-00a0c91e6bf6 SIP/2.0 + </allOneLine> + Refer-To: https://app.example.com/script.pl + Target-Dialog: fa77as7dad8-sd98ajzz@host.example.com + ;remote-tag=7777;local-tag=kkaz- + Require: tdialog + Via: SIP/2.0/TLS app.example.com;branch=z9hG4bK9zh6 + Max-Forwards: 70 + From: Recorder Application <sip:app.example.com>;tag=jhgf + <allOneLine> + To: Caller <sip:A@example.com;gr=urn:uuid:f81d4fae + -7dec-11d0-a765-00a0c91e6bf6> + </allOneLine> + Require: tdialog + Allow: INVITE, OPTIONS, BYE, CANCEL, ACK, REFER + Call-ID: 66676776767@app.example.com + CSeq: 1 REFER + Event: refer + Contact: <sip:app.example.com> + + Since the recording application is the same as the authoritative + proxy for the domain, it resolves the Request URI to the registered + contact of A, and then sent there. The REFER is answered by a 200 OK + (message 8). + + SIP/2.0 200 OK + Via: SIP/2.0/TLS app.example.com;branch=z9hG4bK9zh6 + From: Recorder Application <sip:app.example.com>;tag=jhgf + To: Caller <sip:A@example.com>;tag=pqoew + Call-ID: 66676776767@app.example.com + Supported: gruu, tdialog + Allow: INVITE, OPTIONS, BYE, CANCEL, ACK, REFER + <allOneLine> + Contact: <sip:A@example.com;gr=urn:uuid:f81d4fae + -7dec-11d0-a765-00a0c91e6bf6>;schemes="http,sip" + </allOneLine> + CSeq: 1 REFER + + + + + + + + + +Rosenberg Standards Track [Page 34] + +RFC 5629 App Interaction Framework October 2009 + + + User A sends a NOTIFY (message 9): + + NOTIFY sip:app.example.com SIP/2.0 + Via: SIP/2.0/TLS host.example.com;branch=z9hG4bK9320394238995 + To: Recorder Application <sip:app.example.com>;tag=jhgf + From: Caller <sip:A@example.com>;tag=pqoew + Call-ID: 66676776767@app.example.com + CSeq: 1 NOTIFY + Max-Forwards: 70 + <allOneLine> + Contact: <sip:A@example.com;gr=urn:uuid:f81d4fae + -7dec-11d0-a765-00a0c91e6bf6>;schemes="http,sip" + </allOneLine> + Event: refer;id=93809824 + Subscription-State: active;expires=3600 + Content-Type: message/sipfrag;version=2.0 + Content-Length: 20 + + SIP/2.0 100 Trying + + And the recording server responds with a 200 OK (message 10). + + SIP/2.0 200 OK + Via: SIP/2.0/TLS host.example.com;branch=z9hG4bK9320394238995 + To: Recorder Application <sip:app.example.com>;tag=jhgf + From: Caller <sip:A@example.com>;tag=pqoew + Call-ID: 66676776767@app.example.com + CSeq: 1 NOTIFY + + The REFER request contained a Target-Dialog header field parameter + with a valid dialog identifier. Furthermore, all of the signaling + was over TLS and the dialog identifiers contain sufficient + randomness. As such, the caller, A, automatically authorizes the + application. It then acts on the Refer-To URI, fetching the script + from app.example.com (message 11). The response, message 12, + contains a web application that the user can click on to enable + recording. Because the client executed the URL in the Refer-To, it + generates another NOTIFY to the application, informing it of the + successful response (message 13). This is answered with a 200 OK + (message 14). When the user clicks on the link (message 15), the + results are posted to the server, and an updated display is provided + (message 16). + + + + + + + + + +Rosenberg Standards Track [Page 35] + +RFC 5629 App Interaction Framework October 2009 + + +13. Security Considerations + + There are many security considerations associated with this + framework. It allows applications in the network to instantiate user + interface components on a client device. Such instantiations need to + be from authenticated applications, and also need to be authorized to + place a UI into the client. Indeed, the stronger requirement is + authorization. It is not as important to know the name of the + provider of the application, as it is to know that the provider is + authorized to instantiate components. + + This specification defines specific authorization techniques and + requirements. Automatic authorization is granted if the application + can prove that it is on the call path, or is trusted by an element on + the call path. As documented above, this can be accomplished by the + use of cryptographically random dialog identifiers and the usage of + SIPS for message confidentiality. It is RECOMMENDED that SIPS be + implemented by user agents compliant to this specification. This + does not represent a change from the requirements in RFC 3261. + +14. Contributors + + This document was produced as a result of discussions amongst the + application interaction design team. All members of this team + contributed significantly to the ideas embodied in this document. + The members of this team were: + + Eric Burger + Cullen Jennings + Robert Fairlie-Cuninghame + +15. Acknowledgements + + The authors would like to thank Martin Dolly and Rohan Mahy for their + input and comments. Thanks to Allison Mankin for her support of this + work. + +16. References + +16.1. Normative References + + [1] Bradner, S., "Key words for use in RFCs to Indicate Requirement + Levels", BCP 14, RFC 2119, March 1997. + + [2] Rosenberg, J., Schulzrinne, H., Camarillo, G., Johnston, A., + Peterson, J., Sparks, R., Handley, M., and E. Schooler, "SIP: + Session Initiation Protocol", RFC 3261, June 2002. + + + + +Rosenberg Standards Track [Page 36] + +RFC 5629 App Interaction Framework October 2009 + + + [3] Rosenberg, J. and H. Schulzrinne, "Reliability of Provisional + Responses in Session Initiation Protocol (SIP)", RFC 3262, + June 2002. + + [4] Roach, A., "Session Initiation Protocol (SIP)-Specific Event + Notification", RFC 3265, June 2002. + + [5] McGlashan, S., Lucas, B., Porter, B., Rehor, K., Burnett, D., + Carter, J., Ferrans, J., and A. Hunt, "Voice Extensible Markup + Language (VoiceXML) Version 2.0", W3C CR CR-voicexml20- + 20030220, February 2003. + + [6] Rosenberg, J., Schulzrinne, H., and P. Kyzivat, "Indicating + User Agent Capabilities in the Session Initiation Protocol + (SIP)", RFC 3840, August 2004. + + [7] Sparks, R., "The Session Initiation Protocol (SIP) Refer + Method", RFC 3515, April 2003. + + [8] Burger, E. and M. Dolly, "A Session Initiation Protocol (SIP) + Event Package for Key Press Stimulus (KPML)", RFC 4730, + November 2006. + + [9] Rosenberg, J., "Obtaining and Using Globally Routable User + Agent URIs (GRUUs) in the Session Initiation Protocol (SIP)", + RFC 5627, October 2009. + + [10] Rosenberg, J., "Request Authorization through Dialog + Identification in the Session Initiation Protocol (SIP)", + RFC 4538, June 2006. + +16.2. Informative References + + [11] Peterson, J. and C. Jennings, "Enhancements for Authenticated + Identity Management in the Session Initiation Protocol (SIP)", + RFC 4474, August 2006. + + [12] Day, M., Rosenberg, J., and H. Sugano, "A Model for Presence + and Instant Messaging", RFC 2778, February 2000. + + [13] Jennings, C., Peterson, J., and M. Watson, "Private Extensions + to the Session Initiation Protocol (SIP) for Asserted Identity + within Trusted Networks", RFC 3325, November 2002. + + [14] Rosenberg, J., "A Framework for Conferencing with the Session + Initiation Protocol (SIP)", RFC 4353, February 2006. + + + + + +Rosenberg Standards Track [Page 37] + +RFC 5629 App Interaction Framework October 2009 + + + [15] Rosenberg, J., Schulzrinne, H., and P. Kyzivat, "Caller + Preferences for the Session Initiation Protocol (SIP)", + RFC 3841, August 2004. + + [16] Rosenberg, J., Schulzrinne, H., and R. Mahy, "An INVITE- + Initiated Dialog Event Package for the Session Initiation + Protocol (SIP)", RFC 4235, November 2005. + + [17] Sparks, R., Hawrylyshen, A., Johnston, A., Rosenberg, J., and + H. Schulzrinne, "Session Initiation Protocol (SIP) Torture Test + Messages", RFC 4475, May 2006. + + [18] Schulzrinne, H., Casner, S., Frederick, R., and V. Jacobson, + "RTP: A Transport Protocol for Real-Time Applications", STD 64, + RFC 3550, July 2003. + + [19] Schulzrinne, H. and T. Taylor, "RTP Payload for DTMF Digits, + Telephony Tones, and Telephony Signals", RFC 4733, December + 2006. + + [20] Rosenberg, J. and H. Schulzrinne, "An Offer/Answer Model with + Session Description Protocol (SDP)", RFC 3264, June 2002. + + [21] Rosenberg, J., "A Session Initiation Protocol (SIP) Event + Package for Registrations", RFC 3680, March 2004. + +Author's Address + + Jonathan Rosenberg + Cisco Systems + 600 Lanidex Plaza + Parsippany, NJ 07054 + US + + Phone: +1 973 952-5000 + EMail: jdrosen@cisco.com + URI: http://www.jdrosen.net + + + + + + + + + + + + + + +Rosenberg Standards Track [Page 38] + |