summaryrefslogtreecommitdiff
path: root/doc/rfc/rfc6464.txt
diff options
context:
space:
mode:
authorThomas Voss <mail@thomasvoss.com> 2024-11-27 20:54:24 +0100
committerThomas Voss <mail@thomasvoss.com> 2024-11-27 20:54:24 +0100
commit4bfd864f10b68b71482b35c818559068ef8d5797 (patch)
treee3989f47a7994642eb325063d46e8f08ffa681dc /doc/rfc/rfc6464.txt
parentea76e11061bda059ae9f9ad130a9895cc85607db (diff)
doc: Add RFC documents
Diffstat (limited to 'doc/rfc/rfc6464.txt')
-rw-r--r--doc/rfc/rfc6464.txt507
1 files changed, 507 insertions, 0 deletions
diff --git a/doc/rfc/rfc6464.txt b/doc/rfc/rfc6464.txt
new file mode 100644
index 0000000..e96fe9c
--- /dev/null
+++ b/doc/rfc/rfc6464.txt
@@ -0,0 +1,507 @@
+
+
+
+
+
+
+Internet Engineering Task Force (IETF) J. Lennox, Ed.
+Request for Comments: 6464 Vidyo
+Category: Standards Track E. Ivov
+ISSN: 2070-1721 Jitsi
+ E. Marocco
+ Telecom Italia
+ December 2011
+
+
+ A Real-time Transport Protocol (RTP) Header Extension for
+ Client-to-Mixer Audio Level Indication
+
+Abstract
+
+ This document defines a mechanism by which packets of Real-time
+ Transport Protocol (RTP) audio streams can indicate, in an RTP header
+ extension, the audio level of the audio sample carried in the RTP
+ packet. In large conferences, this can reduce the load on an audio
+ mixer or other middlebox that wants to forward only a few of the
+ loudest audio streams, without requiring it to decode and measure
+ every stream that is received.
+
+Status of This Memo
+
+ This is an Internet Standards Track document.
+
+ This document is a product of the Internet Engineering Task Force
+ (IETF). It represents the consensus of the IETF community. It has
+ received public review and has been approved for publication by the
+ Internet Engineering Steering Group (IESG). Further information on
+ Internet Standards is available in Section 2 of RFC 5741.
+
+ Information about the current status of this document, any errata,
+ and how to provide feedback on it may be obtained at
+ http://www.rfc-editor.org/info/rfc6464.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Lennox, et al. Standards Track [Page 1]
+
+RFC 6464 Client-to-Mixer Audio Level Indication December 2011
+
+
+Copyright Notice
+
+ Copyright (c) 2011 IETF Trust and the persons identified as the
+ document authors. All rights reserved.
+
+ This document is subject to BCP 78 and the IETF Trust's Legal
+ Provisions Relating to IETF Documents
+ (http://trustee.ietf.org/license-info) in effect on the date of
+ publication of this document. Please review these documents
+ carefully, as they describe your rights and restrictions with respect
+ to this document. Code Components extracted from this document must
+ include Simplified BSD License text as described in Section 4.e of
+ the Trust Legal Provisions and are provided without warranty as
+ described in the Simplified BSD License.
+
+Table of Contents
+
+ 1. Introduction ....................................................2
+ 2. Terminology .....................................................3
+ 3. Audio Levels ....................................................3
+ 4. Signaling (Setup) Information ...................................5
+ 5. Considerations on Use ...........................................6
+ 6. Security Considerations .........................................6
+ 7. IANA Considerations .............................................7
+ 8. References ......................................................7
+ 8.1. Normative References .......................................7
+ 8.2. Informative References .....................................8
+
+1. Introduction
+
+ In a centralized Real-time Transport Protocol (RTP) [RFC3550] audio
+ conference, an audio mixer or forwarder receives audio streams from
+ many or all of the conference participants. It then selectively
+ forwards some of them to other participants in the conference. In
+ large conferences, it is possible that such a server might be
+ receiving a large number of streams, of which only a few are intended
+ to be forwarded to the other conference participants.
+
+ In such a scenario, in order to pick the audio streams to forward, a
+ centralized server needs to decode, measure audio levels, and
+ possibly perform voice activity detection on audio data from a large
+ number of streams. The need for such processing limits the size or
+ number of conferences such a server can support.
+
+ As an alternative, this document defines an RTP header extension
+ [RFC5285] through which senders of audio packets can indicate the
+ audio level of the packets' payload, reducing the processing load for
+ a server.
+
+
+
+Lennox, et al. Standards Track [Page 2]
+
+RFC 6464 Client-to-Mixer Audio Level Indication December 2011
+
+
+ The header extension in this document is different than, but
+ complementary with, the one defined in [RFC6465], which defines a
+ mechanism by which audio mixers can indicate to clients the levels of
+ the contributing sources that made up the mixed audio.
+
+2. Terminology
+
+ The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
+ "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
+ document are to be interpreted as described in RFC 2119 [RFC2119] and
+ indicate requirement levels for compliant implementations.
+
+3. Audio Levels
+
+ The audio level header extension carries the level of the audio in
+ the RTP [RFC3550] payload of the packet with which it is associated.
+ This information is carried in an RTP header extension element as
+ defined by "A General Mechanism for RTP Header Extensions" [RFC5285].
+
+ The payload of the audio level header extension element can be
+ encoded using either the one-byte or two-byte header defined in
+ [RFC5285]. Figures 1 and 2 show sample audio level encodings with
+ each of these header formats.
+
+ 0 1
+ 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | ID | len=0 |V| level |
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
+ Figure 1: Sample Audio Level Encoding Using the
+ One-Byte Header Format
+
+
+ 0 1 2 3
+ 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | ID | len=1 |V| level | 0 (pad) |
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
+ Figure 2: Sample Audio Level Encoding Using the
+ Two-Byte Header Format
+
+ Note that, as indicated in [RFC5285], the length field in the one-
+ byte header format takes the value 0 to indicate that 1 byte follows.
+ In the two-byte header format, on the other hand, the length field
+ takes the value of 1.
+
+
+
+
+Lennox, et al. Standards Track [Page 3]
+
+RFC 6464 Client-to-Mixer Audio Level Indication December 2011
+
+
+ The magnitude of the audio level itself is packed into the seven
+ least significant bits of the single byte of the header extension,
+ shown in Figures 1 and 2. The least significant bit of the audio
+ level magnitude is packed into the least significant bit of the byte.
+ The most significant bit of the byte is used as a separate flag bit
+ "V", defined below.
+
+ The audio level is expressed in -dBov, with values from 0 to 127
+ representing 0 to -127 dBov. dBov is the level, in decibels, relative
+ to the overload point of the system, i.e., the highest-intensity
+ signal encodable by the payload format. (Note: Representation
+ relative to the overload point of a system is particularly useful for
+ digital implementations, since one does not need to know the relative
+ calibration of the analog circuitry.) For example, in the case of
+ u-law (audio/pcmu) audio [ITU.G711], the 0 dBov reference would be a
+ square wave with values +/- 8031. (This translates to 6.18 dBm0,
+ relative to u-law's dBm0 definition in Table 6 of [ITU.G711].)
+
+ The audio level for digital silence -- for a muted audio source, for
+ example -- MUST be represented as 127 (-127 dBov), regardless of the
+ dynamic range of the encoded audio format.
+
+ The audio level header extension only carries the level of the audio
+ in the RTP payload of the packet with which it is associated, with no
+ long-term averaging or smoothing applied. For payload formats that
+ contain extra error-correction bits or loss-concealment information,
+ the level corresponds only to the data that would result from the
+ payload's normal decoding process, not what it would produce under
+ error or packet loss concealment. The level is measured as a root
+ mean square of all the samples in the audio encoded by the packet.
+
+ To simplify implementation of the encoding procedures described here,
+ Appendix A of [RFC6465] provides a sample Java implementation of an
+ audio level calculator that helps obtain such values from raw linear
+ Pulse Code Modulation (PCM) audio samples.
+
+ In addition, a flag bit (labeled "V") optionally indicates whether
+ the encoder believes the audio packet contains voice activity. If
+ the V bit is in use, the value 1 indicates that the encoder believes
+ the audio packet contains voice activity, and the value 0 indicates
+ that the encoder believes it does not. (The voice activity detection
+ algorithm is unspecified and left implementation-specific.) If the V
+ bit is not in use, its value is unspecified and MUST be ignored by
+ receivers. The use of the V bit is signaled using the extension
+ attribute "vad", discussed in Section 4.
+
+
+
+
+
+
+Lennox, et al. Standards Track [Page 4]
+
+RFC 6464 Client-to-Mixer Audio Level Indication December 2011
+
+
+ When this header extension is used with RTP data sent using the RTP
+ Payload for Redundant Audio Data [RFC2198], the header's data
+ describes the contents of the primary encoding.
+
+ Note: This audio level is defined in the same manner as is audio
+ noise level in the RTP Payload Comfort Noise specification
+ [RFC3389]. In [RFC3389], the overall magnitude of the noise level
+ in comfort noise is encoded into the first byte of the payload,
+ with spectral information about the noise in subsequent bytes.
+ This specification's audio level parameter is defined so as to be
+ identical to the comfort noise payload's noise-level byte.
+
+4. Signaling (Setup) Information
+
+ The URI for declaring this header extension in an extmap attribute is
+ "urn:ietf:params:rtp-hdrext:ssrc-audio-level".
+
+ It has a single extension attribute, named "vad". It takes the form
+ "vad=on" or "vad=off". If the header extension element is signaled
+ with "vad=on", the V bit described in Section 3 is in use, and MUST
+ be set by senders. If the header extension element is signaled with
+ "vad=off", the V bit is not in use, and its value MUST be ignored by
+ receivers. If the vad extension attribute is not specified, the
+ default is "vad=on".
+
+ An example attribute line in the Session Description Protocol (SDP)
+ for a conference might hence be:
+
+ a=extmap:6 urn:ietf:params:rtp-hdrext:ssrc-audio-level vad=on
+
+ The vad extension attribute only controls the semantics of this
+ header extension attribute, and does not make any statement about
+ whether the sender is using any other voice activity detection
+ features, such as discontinuous transmission, comfort noise, or
+ silence suppression.
+
+ Using the mechanisms of [RFC5285], an endpoint MAY signal multiple
+ instances of the header extension element, with different values of
+ the vad attribute, so long as these instances use different values
+ for the extension identifier. However, again following the rules of
+ [RFC5285], the semantics chosen for a header extension element
+ (including its vad setting) for a particular extension identifier
+ value MUST NOT be changed within an RTP session.
+
+
+
+
+
+
+
+
+Lennox, et al. Standards Track [Page 5]
+
+RFC 6464 Client-to-Mixer Audio Level Indication December 2011
+
+
+5. Considerations on Use
+
+ Mixers and forwarders generally ought not base audio forwarding
+ decisions directly on packet-by-packet audio level information, but
+ rather ought to apply some analysis of the audio levels and trends.
+ This general rule applies whether audio levels are provided by
+ endpoints (as defined in this document), or are calculated at a
+ server, as would be done in the absence of this information. This
+ section discusses several issues that mixers and forwarders may wish
+ to take into account. (Note that this section provides design
+ guidance only, and is not normative.)
+
+ First of all, audio levels generally ought to be measured over longer
+ intervals than that of a single audio packet. In order to avoid
+ false-positives for short bursts of sound (such as a cough or a
+ dropped microphone), it is often useful to require that a
+ participant's audio level be maintained for some period of time
+ before considering it to be "real"; i.e., some type of low-pass
+ filter ought to be applied to the audio levels. Note, though, that
+ such filtering must be balanced with the need to avoid clipping of
+ the beginning of a speaker's speech.
+
+ Additionally, different participants may have their audio input set
+ differently. It may be useful to apply some sort of automatic gain
+ control to the audio levels. There are a number of possible
+ approaches to achieving this, e.g., by measuring peak audio levels,
+ by average audio levels during speech, or by measuring background
+ audio levels (average audio levels during non-speech).
+
+6. Security Considerations
+
+ A malicious endpoint could choose to set the values in this header
+ extension falsely, so as to falsely claim that audio or voice is or
+ is not present. It is not clear what could be gained by falsely
+ claiming that audio is not present, but an endpoint falsely claiming
+ that audio is present, or falsely exaggerating its reported levels,
+ could perform a denial-of-service attack on an audio conference, so
+ as to send silence to suppress other conference members' audio, or
+ could dominate a conference by seizing its speaker-selection
+ algorithm. Thus, if a device relies on audio level data from
+ untrusted endpoints, it SHOULD periodically audit the level
+ information transmitted, taking appropriate corrective action against
+ endpoints that appear to be sending incorrect data. (However, as it
+ is valid for an endpoint to choose to measure audio levels prior to
+ encoding, some degree of discrepancy could be present. This would
+ not indicate that an endpoint is malicious.)
+
+
+
+
+
+Lennox, et al. Standards Track [Page 6]
+
+RFC 6464 Client-to-Mixer Audio Level Indication December 2011
+
+
+ In the Secure Real-time Transport Protocol (SRTP) [RFC3711], RTP
+ header extensions are authenticated but not encrypted. When this
+ header extension is used, audio levels are therefore visible on a
+ packet-by-packet basis to an attacker passively observing the audio
+ stream. As discussed in [SRTP-VBR-AUDIO], such an attacker might be
+ able to infer information about the conversation, possibly with
+ phoneme-level resolution. In scenarios where this is a concern,
+ additional mechanisms MUST be used to protect the confidentiality of
+ the header extension. This mechanism could be header extension
+ encryption [SRTP-ENCR-HDR], or a lower-level security and
+ authentication mechanism such as IPsec [RFC4301].
+
+7. IANA Considerations
+
+ This document defines a new extension URI in the RTP Compact Header
+ Extensions subregistry of the Real-Time Transport Protocol (RTP)
+ Parameters registry, according to the following data:
+
+ Extension URI: urn:ietf:params:rtp-hdrext:ssrc-audio-level
+ Description: Audio Level
+ Contact: jonathan@vidyo.com
+ Reference: RFC 6464
+
+8. References
+
+8.1. Normative References
+
+ [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
+ Requirement Levels", BCP 14, RFC 2119, March 1997.
+
+ [RFC2198] Perkins, C., Kouvelas, I., Hodson, O., Hardman, V.,
+ Handley, M., Bolot, J., Vega-Garcia, A., and S. Fosse-
+ Parisis, "RTP Payload for Redundant Audio Data", RFC 2198,
+ September 1997.
+
+ [RFC3550] Schulzrinne, H., Casner, S., Frederick, R., and V.
+ Jacobson, "RTP: A Transport Protocol for Real-Time
+ Applications", STD 64, RFC 3550, July 2003.
+
+ [RFC5285] Singer, D. and H. Desineni, "A General Mechanism for RTP
+ Header Extensions", RFC 5285, July 2008.
+
+
+
+
+
+
+
+
+
+
+Lennox, et al. Standards Track [Page 7]
+
+RFC 6464 Client-to-Mixer Audio Level Indication December 2011
+
+
+8.2. Informative References
+
+ [ITU.G711] International Telecommunication Union, "Pulse Code
+ Modulation (PCM) of Voice Frequencies",
+ ITU-T Recommendation G.711, November 1988.
+
+ [RFC3389] Zopf, R., "Real-time Transport Protocol (RTP) Payload for
+ Comfort Noise (CN)", RFC 3389, September 2002.
+
+ [RFC3711] Baugher, M., McGrew, D., Naslund, M., Carrara, E., and K.
+ Norrman, "The Secure Real-time Transport Protocol (SRTP)",
+ RFC 3711, March 2004.
+
+ [RFC4301] Kent, S. and K. Seo, "Security Architecture for the
+ Internet Protocol", RFC 4301, December 2005.
+
+ [RFC6465] Ivov, E., Ed., Marocco, E., Ed., and J. Lennox,
+ "A Real-time Transport Protocol (RTP) Header Extension for
+ Mixer-to-Client Audio Level Indication", RFC 6465,
+ December 2011.
+
+ [SRTP-ENCR-HDR]
+ Lennox, J., "Encryption of Header Extensions in the Secure
+ Real-Time Transport Protocol (SRTP)", Work in Progress,
+ October 2011.
+
+ [SRTP-VBR-AUDIO]
+ Perkins, C. and JM. Valin, "Guidelines for the use of
+ Variable Bit Rate Audio with Secure RTP", Work
+ in Progress, July 2011.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Lennox, et al. Standards Track [Page 8]
+
+RFC 6464 Client-to-Mixer Audio Level Indication December 2011
+
+
+Authors' Addresses
+
+ Jonathan Lennox (editor)
+ Vidyo, Inc.
+ 433 Hackensack Avenue
+ Seventh Floor
+ Hackensack, NJ 07601
+ US
+
+ EMail: jonathan@vidyo.com
+
+
+ Emil Ivov
+ Jitsi
+ Strasbourg 67000
+ France
+
+ EMail: emcho@jitsi.org
+
+
+ Enrico Marocco
+ Telecom Italia
+ Via G. Reiss Romoli, 274
+ Turin 10148
+ Italy
+
+ EMail: enrico.marocco@telecomitalia.it
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Lennox, et al. Standards Track [Page 9]
+