diff options
Diffstat (limited to 'doc/rfc/rfc3984.txt')
-rw-r--r-- | doc/rfc/rfc3984.txt | 4651 |
1 files changed, 4651 insertions, 0 deletions
diff --git a/doc/rfc/rfc3984.txt b/doc/rfc/rfc3984.txt new file mode 100644 index 0000000..f84e338 --- /dev/null +++ b/doc/rfc/rfc3984.txt @@ -0,0 +1,4651 @@ + + + + + + +Network Working Group S. Wenger +Request for Comments: 3984 M.M. Hannuksela +Category: Standards Track T. Stockhammer + M. Westerlund + D. Singer + February 2005 + + + RTP Payload Format for H.264 Video + +Status of This Memo + + This document specifies an Internet standards track protocol for the + Internet community, and requests discussion and suggestions for + improvements. Please refer to the current edition of the "Internet + Official Protocol Standards" (STD 1) for the standardization state + and status of this protocol. Distribution of this memo is unlimited. + +Copyright Notice + + Copyright (C) The Internet Society (2005). + +Abstract + + This memo describes an RTP Payload format for the ITU-T + Recommendation H.264 video codec and the technically identical + ISO/IEC International Standard 14496-10 video codec. The RTP payload + format allows for packetization of one or more Network Abstraction + Layer Units (NALUs), produced by an H.264 video encoder, in each RTP + payload. The payload format has wide applicability, as it supports + applications from simple low bit-rate conversational usage, to + Internet video streaming with interleaved transmission, to high bit- + rate video-on-demand. + +Table of Contents + + 1. Introduction.................................................. 3 + 1.1. The H.264 Codec......................................... 3 + 1.2. Parameter Set Concept................................... 4 + 1.3. Network Abstraction Layer Unit Types.................... 5 + 2. Conventions................................................... 6 + 3. Scope......................................................... 6 + 4. Definitions and Abbreviations................................. 6 + 4.1. Definitions............................................. 6 + 5. RTP Payload Format............................................ 8 + 5.1. RTP Header Usage........................................ 8 + 5.2. Common Structure of the RTP Payload Format.............. 11 + 5.3. NAL Unit Octet Usage.................................... 12 + + + +Wenger, et al. Standards Track [Page 1] + +RFC 3984 RTP Payload Format for H.264 Video February 2005 + + + 5.4. Packetization Modes..................................... 14 + 5.5. Decoding Order Number (DON)............................. 15 + 5.6. Single NAL Unit Packet.................................. 18 + 5.7. Aggregation Packets..................................... 18 + 5.8. Fragmentation Units (FUs)............................... 27 + 6. Packetization Rules........................................... 31 + 6.1. Common Packetization Rules.............................. 31 + 6.2. Single NAL Unit Mode.................................... 32 + 6.3. Non-Interleaved Mode.................................... 32 + 6.4. Interleaved Mode........................................ 33 + 7. De-Packetization Process (Informative)........................ 33 + 7.1. Single NAL Unit and Non-Interleaved Mode................ 33 + 7.2. Interleaved Mode........................................ 34 + 7.3. Additional De-Packetization Guidelines.................. 36 + 8. Payload Format Parameters..................................... 37 + 8.1. MIME Registration....................................... 37 + 8.2. SDP Parameters.......................................... 52 + 8.3. Examples................................................ 58 + 8.4. Parameter Set Considerations............................ 60 + 9. Security Considerations....................................... 62 + 10. Congestion Control............................................ 63 + 11. IANA Considerations........................................... 64 + 12. Informative Appendix: Application Examples.................... 65 + 12.1. Video Telephony according to ITU-T Recommendation H.241 + Annex A................................................. 65 + 12.2. Video Telephony, No Slice Data Partitioning, No NAL + Unit Aggregation........................................ 65 + 12.3. Video Telephony, Interleaved Packetization Using NAL + Unit Aggregation........................................ 66 + 12.4. Video Telephony with Data Partitioning.................. 66 + 12.5. Video Telephony or Streaming with FUs and Forward + Error Correction........................................ 67 + 12.6. Low Bit-Rate Streaming.................................. 69 + 12.7. Robust Packet Scheduling in Video Streaming............. 70 + 13. Informative Appendix: Rationale for Decoding Order Number..... 71 + 13.1. Introduction............................................ 71 + 13.2. Example of Multi-Picture Slice Interleaving............. 71 + 13.3. Example of Robust Packet Scheduling..................... 73 + 13.4. Robust Transmission Scheduling of Redundant Coded + Slices.................................................. 77 + 13.5. Remarks on Other Design Possibilities................... 77 + 14. Acknowledgements.............................................. 78 + 15. References.................................................... 78 + 15.1. Normative References.................................... 78 + 15.2. Informative References.................................. 79 + Authors' Addresses................................................ 81 + Full Copyright Statement.......................................... 83 + + + + +Wenger, et al. Standards Track [Page 2] + +RFC 3984 RTP Payload Format for H.264 Video February 2005 + + +1. Introduction + +1.1. The H.264 Codec + + This memo specifies an RTP payload specification for the video coding + standard known as ITU-T Recommendation H.264 [1] and ISO/IEC + International Standard 14496 Part 10 [2] (both also known as Advanced + Video Coding, or AVC). Recommendation H.264 was approved by ITU-T on + May 2003, and the approved draft specification is available for + public review [8]. In this memo the H.264 acronym is used for the + codec and the standard, but the memo is equally applicable to the + ISO/IEC counterpart of the coding standard. + + The H.264 video codec has a very broad application range that covers + all forms of digital compressed video from, low bit-rate Internet + streaming applications to HDTV broadcast and Digital Cinema + applications with nearly lossless coding. Compared to the current + state of technology, the overall performance of H.264 is such that + bit rate savings of 50% or more are reported. Digital Satellite TV + quality, for example, was reported to be achievable at 1.5 Mbit/s, + compared to the current operation point of MPEG 2 video at around 3.5 + Mbit/s [9]. + + The codec specification [1] itself distinguishes conceptually between + a video coding layer (VCL) and a network abstraction layer (NAL). + The VCL contains the signal processing functionality of the codec; + mechanisms such as transform, quantization, and motion compensated + prediction; and a loop filter. It follows the general concept of + most of today's video codecs, a macroblock-based coder that uses + inter picture prediction with motion compensation and transform + coding of the residual signal. The VCL encoder outputs slices: a bit + string that contains the macroblock data of an integer number of + macroblocks, and the information of the slice header (containing the + spatial address of the first macroblock in the slice, the initial + quantization parameter, and similar information). Macroblocks in + slices are arranged in scan order unless a different macroblock + allocation is specified, by using the so-called Flexible Macroblock + Ordering syntax. In-picture prediction is used only within a slice. + More information is provided in [9]. + + The Network Abstraction Layer (NAL) encoder encapsulates the slice + output of the VCL encoder into Network Abstraction Layer Units (NAL + units), which are suitable for transmission over packet networks or + use in packet oriented multiplex environments. Annex B of H.264 + defines an encapsulation process to transmit such NAL units over + byte-stream oriented networks. In the scope of this memo, Annex B is + not relevant. + + + + +Wenger, et al. Standards Track [Page 3] + +RFC 3984 RTP Payload Format for H.264 Video February 2005 + + + Internally, the NAL uses NAL units. A NAL unit consists of a one- + byte header and the payload byte string. The header indicates the + type of the NAL unit, the (potential) presence of bit errors or + syntax violations in the NAL unit payload, and information regarding + the relative importance of the NAL unit for the decoding process. + This RTP payload specification is designed to be unaware of the bit + string in the NAL unit payload. + + One of the main properties of H.264 is the complete decoupling of the + transmission time, the decoding time, and the sampling or + presentation time of slices and pictures. The decoding process + specified in H.264 is unaware of time, and the H.264 syntax does not + carry information such as the number of skipped frames (as is common + in the form of the Temporal Reference in earlier video compression + standards). Also, there are NAL units that affect many pictures and + that are, therefore, inherently timeless. For this reason, the + handling of the RTP timestamp requires some special considerations + for NAL units for which the sampling or presentation time is not + defined or, at transmission time, unknown. + +1.2. Parameter Set Concept + + One very fundamental design concept of H.264 is to generate self- + contained packets, to make mechanisms such as the header duplication + of RFC 2429 [10] or MPEG-4's Header Extension Code (HEC) [11] + unnecessary. This was achieved by decoupling information relevant to + more than one slice from the media stream. This higher layer meta + information should be sent reliably, asynchronously, and in advance + from the RTP packet stream that contains the slice packets. + (Provisions for sending this information in-band are also available + for applications that do not have an out-of-band transport channel + appropriate for the purpose.) The combination of the higher-level + parameters is called a parameter set. The H.264 specification + includes two types of parameter sets: sequence parameter set and + picture parameter set. An active sequence parameter set remains + unchanged throughout a coded video sequence, and an active picture + parameter set remains unchanged within a coded picture. The sequence + and picture parameter set structures contain information such as + picture size, optional coding modes employed, and macroblock to slice + group map. + + To be able to change picture parameters (such as the picture size) + without having to transmit parameter set updates synchronously to the + slice packet stream, the encoder and decoder can maintain a list of + more than one sequence and picture parameter set. Each slice header + contains a codeword that indicates the sequence and picture parameter + set to be used. + + + + +Wenger, et al. Standards Track [Page 4] + +RFC 3984 RTP Payload Format for H.264 Video February 2005 + + + This mechanism allows the decoupling of the transmission of parameter + sets from the packet stream, and the transmission of them by external + means (e.g., as a side effect of the capability exchange), or through + a (reliable or unreliable) control protocol. It may even be possible + that they are never transmitted but are fixed by an application + design specification. + +1.3. Network Abstraction Layer Unit Types + + Tutorial information on the NAL design can be found in [12], [13], + and [14]. + + All NAL units consist of a single NAL unit type octet, which also + co-serves as the payload header of this RTP payload format. The + payload of a NAL unit follows immediately. + + The syntax and semantics of the NAL unit type octet are specified in + [1], but the essential properties of the NAL unit type octet are + summarized below. The NAL unit type octet has the following format: + + +---------------+ + |0|1|2|3|4|5|6|7| + +-+-+-+-+-+-+-+-+ + |F|NRI| Type | + +---------------+ + + The semantics of the components of the NAL unit type octet, as + specified in the H.264 specification, are described briefly below. + + F: 1 bit + forbidden_zero_bit. The H.264 specification declares a value of + 1 as a syntax violation. + + NRI: 2 bits + nal_ref_idc. A value of 00 indicates that the content of the NAL + unit is not used to reconstruct reference pictures for inter + picture prediction. Such NAL units can be discarded without + risking the integrity of the reference pictures. Values greater + than 00 indicate that the decoding of the NAL unit is required to + maintain the integrity of the reference pictures. + + Type: 5 bits + nal_unit_type. This component specifies the NAL unit payload type + as defined in table 7-1 of [1], and later within this memo. For a + reference of all currently defined NAL unit types and their + semantics, please refer to section 7.4.1 in [1]. + + + + + +Wenger, et al. Standards Track [Page 5] + +RFC 3984 RTP Payload Format for H.264 Video February 2005 + + + This memo introduces new NAL unit types, which are presented in + section 5.2. The NAL unit types defined in this memo are marked as + unspecified in [1]. Moreover, this specification extends the + semantics of F and NRI as described in section 5.3. + +2. Conventions + + The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", + "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this + document are to be interpreted as described in BCP 14, RFC 2119 [3]. + + This specification uses the notion of setting and clearing a bit when + bit fields are handled. Setting a bit is the same as assigning that + bit the value of 1 (On). Clearing a bit is the same as assigning + that bit the value of 0 (Off). + +3. Scope + + This payload specification can only be used to carry the "naked" + H.264 NAL unit stream over RTP, and not the bitstream format + discussed in Annex B of H.264. Likely, the first applications of + this specification will be in the conversational multimedia field, + video telephony or video conferencing, but the payload format also + covers other applications, such as Internet streaming and TV over IP. + +4. Definitions and Abbreviations + +4.1. Definitions + + This document uses the definitions of [1]. The following terms, + defined in [1], are summed up for convenience: + + access unit: A set of NAL units always containing a primary coded + picture. In addition to the primary coded picture, an access unit + may also contain one or more redundant coded pictures or other NAL + units not containing slices or slice data partitions of a coded + picture. The decoding of an access unit always results in a + decoded picture. + + coded video sequence: A sequence of access units that consists, in + decoding order, of an instantaneous decoding refresh (IDR) access + unit followed by zero or more non-IDR access units including all + subsequent access units up to but not including any subsequent IDR + access unit. + + IDR access unit: An access unit in which the primary coded picture + is an IDR picture. + + + + +Wenger, et al. Standards Track [Page 6] + +RFC 3984 RTP Payload Format for H.264 Video February 2005 + + + IDR picture: A coded picture containing only slices with I or SI + slice types that causes a "reset" in the decoding process. After + the decoding of an IDR picture, all following coded pictures in + decoding order can be decoded without inter prediction from any + picture decoded prior to the IDR picture. + + primary coded picture: The coded representation of a picture to be + used by the decoding process for a bitstream conforming to H.264. + The primary coded picture contains all macroblocks of the picture. + + redundant coded picture: A coded representation of a picture or a + part of a picture. The content of a redundant coded picture shall + not be used by the decoding process for a bitstream conforming to + H.264. The content of a redundant coded picture may be used by + the decoding process for a bitstream that contains errors or + losses. + + VCL NAL unit: A collective term used to refer to coded slice and + coded data partition NAL units. + + In addition, the following definitions apply: + + decoding order number (DON): A field in the payload structure, or + a derived variable indicating NAL unit decoding order. Values of + DON are in the range of 0 to 65535, inclusive. After reaching the + maximum value, the value of DON wraps around to 0. + + NAL unit decoding order: A NAL unit order that conforms to the + constraints on NAL unit order given in section 7.4.1.2 in [1]. + + transmission order: The order of packets in ascending RTP sequence + number order (in modulo arithmetic). Within an aggregation + packet, the NAL unit transmission order is the same as the order + of appearance of NAL units in the packet. + + media aware network element (MANE): A network element, such as a + middlebox or application layer gateway that is capable of parsing + certain aspects of the RTP payload headers or the RTP payload and + reacting to the contents. + + Informative note: The concept of a MANE goes beyond normal + routers or gateways in that a MANE has to be aware of the + signaling (e.g., to learn about the payload type mappings of + the media streams), and in that it has to be trusted when + working with SRTP. The advantage of using MANEs is that they + allow packets to be dropped according to the needs of the media + coding. For example, if a MANE has to drop packets due to + congestion on a certain link, it can identify those packets + + + +Wenger, et al. Standards Track [Page 7] + +RFC 3984 RTP Payload Format for H.264 Video February 2005 + + + whose dropping has the smallest negative impact on the user + experience and remove them in order to remove the congestion + and/or keep the delay low. + + Abbreviations + + DON: Decoding Order Number + DONB: Decoding Order Number Base + DOND: Decoding Order Number Difference + FEC: Forward Error Correction + FU: Fragmentation Unit + IDR: Instantaneous Decoding Refresh + IEC: International Electrotechnical Commission + ISO: International Organization for Standardization + ITU-T: International Telecommunication Union, + Telecommunication Standardization Sector + MANE: Media Aware Network Element + MTAP: Multi-Time Aggregation Packet + MTAP16: MTAP with 16-bit timestamp offset + MTAP24: MTAP with 24-bit timestamp offset + NAL: Network Abstraction Layer + NALU: NAL Unit + SEI: Supplemental Enhancement Information + STAP: Single-Time Aggregation Packet + STAP-A: STAP type A + STAP-B: STAP type B + TS: Timestamp + VCL: Video Coding Layer + +5. RTP Payload Format + +5.1. RTP Header Usage + + The format of the RTP header is specified in RFC 3550 [4] and + reprinted in Figure 1 for convenience. This payload format uses the + fields of the header in a manner consistent with that specification. + + When one NAL unit is encapsulated per RTP packet, the RECOMMENDED RTP + payload format is specified in section 5.6. The RTP payload (and the + settings for some RTP header bits) for aggregation packets and + fragmentation units are specified in sections 5.7 and 5.8, + respectively. + + + + + + + + + +Wenger, et al. Standards Track [Page 8] + +RFC 3984 RTP Payload Format for H.264 Video February 2005 + + + 0 1 2 3 + 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + |V=2|P|X| CC |M| PT | sequence number | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | timestamp | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | synchronization source (SSRC) identifier | + +=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ + | contributing source (CSRC) identifiers | + | .... | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + + Figure 1. RTP header according to RFC 3550 + + The RTP header information to be set according to this RTP payload + format is set as follows: + + Marker bit (M): 1 bit + Set for the very last packet of the access unit indicated by the + RTP timestamp, in line with the normal use of the M bit in video + formats, to allow an efficient playout buffer handling. For + aggregation packets (STAP and MTAP), the marker bit in the RTP + header MUST be set to the value that the marker bit of the last + NAL unit of the aggregation packet would have been if it were + transported in its own RTP packet. Decoders MAY use this bit as + an early indication of the last packet of an access unit, but MUST + NOT rely on this property. + + Informative note: Only one M bit is associated with an + aggregation packet carrying multiple NAL units. Thus, if a + gateway has re-packetized an aggregation packet into several + packets, it cannot reliably set the M bit of those packets. + + Payload type (PT): 7 bits + The assignment of an RTP payload type for this new packet format + is outside the scope of this document and will not be specified + here. The assignment of a payload type has to be performed either + through the profile used or in a dynamic way. + + Sequence number (SN): 16 bits + Set and used in accordance with RFC 3550. For the single NALU and + non-interleaved packetization mode, the sequence number is used to + determine decoding order for the NALU. + + Timestamp: 32 bits + The RTP timestamp is set to the sampling timestamp of the content. + A 90 kHz clock rate MUST be used. + + + +Wenger, et al. Standards Track [Page 9] + +RFC 3984 RTP Payload Format for H.264 Video February 2005 + + + If the NAL unit has no timing properties of its own (e.g., + parameter set and SEI NAL units), the RTP timestamp is set to the + RTP timestamp of the primary coded picture of the access unit in + which the NAL unit is included, according to section 7.4.1.2 of + [1]. + + The setting of the RTP Timestamp for MTAPs is defined in section + 5.7.2. + + Receivers SHOULD ignore any picture timing SEI messages included + in access units that have only one display timestamp. Instead, + receivers SHOULD use the RTP timestamp for synchronizing the + display process. + + RTP senders SHOULD NOT transmit picture timing SEI messages for + pictures that are not supposed to be displayed as multiple fields. + + If one access unit has more than one display timestamp carried in + a picture timing SEI message, then the information in the SEI + message SHOULD be treated as relative to the RTP timestamp, with + the earliest event occurring at the time given by the RTP + timestamp, and subsequent events later, as given by the difference + in SEI message picture timing values. Let tSEI1, tSEI2, ..., + tSEIn be the display timestamps carried in the SEI message of an + access unit, where tSEI1 is the earliest of all such timestamps. + Let tmadjst() be a function that adjusts the SEI messages time + scale to a 90-kHz time scale. Let TS be the RTP timestamp. Then, + the display time for the event associated with tSEI1 is TS. The + display time for the event with tSEIx, where x is [2..n] is TS + + tmadjst (tSEIx - tSEI1). + + Informative note: Displaying coded frames as fields is needed + commonly in an operation known as 3:2 pulldown, in which film + content that consists of coded frames is displayed on a display + using interlaced scanning. The picture timing SEI message + enables carriage of multiple timestamps for the same coded + picture, and therefore the 3:2 pulldown process is perfectly + controlled. The picture timing SEI message mechanism is + necessary because only one timestamp per coded frame can be + conveyed in the RTP timestamp. + + Informative note: Because H.264 allows the decoding order to be + different from the display order, values of RTP timestamps may + not be monotonically non-decreasing as a function of RTP + sequence numbers. Furthermore, the value for interarrival + jitter reported in the RTCP reports may not be a trustworthy + indication of the network performance, as the calculation rules + + + + +Wenger, et al. Standards Track [Page 10] + +RFC 3984 RTP Payload Format for H.264 Video February 2005 + + + for interarrival jitter (section 6.4.1 of RFC 3550) assume that + the RTP timestamp of a packet is directly proportional to its + transmission time. + +5.2. Common Structure of the RTP Payload Format + + The payload format defines three different basic payload structures. + A receiver can identify the payload structure by the first byte of + the RTP payload, which co-serves as the RTP payload header and, in + some cases, as the first byte of the payload. This byte is always + structured as a NAL unit header. The NAL unit type field indicates + which structure is present. The possible structures are as follows: + + Single NAL Unit Packet: Contains only a single NAL unit in the + payload. The NAL header type field will be equal to the original NAL + unit type; i.e., in the range of 1 to 23, inclusive. Specified in + section 5.6. + + Aggregation packet: Packet type used to aggregate multiple NAL units + into a single RTP payload. This packet exists in four versions, the + Single-Time Aggregation Packet type A (STAP-A), the Single-Time + Aggregation Packet type B (STAP-B), Multi-Time Aggregation Packet + (MTAP) with 16-bit offset (MTAP16), and Multi-Time Aggregation Packet + (MTAP) with 24-bit offset (MTAP24). The NAL unit type numbers + assigned for STAP-A, STAP-B, MTAP16, and MTAP24 are 24, 25, 26, and + 27, respectively. Specified in section 5.7. + + Fragmentation unit: Used to fragment a single NAL unit over multiple + RTP packets. Exists with two versions, FU-A and FU-B, identified + with the NAL unit type numbers 28 and 29, respectively. Specified in + section 5.8. + + Table 1. Summary of NAL unit types and their payload structures + + Type Packet Type name Section + --------------------------------------------------------- + 0 undefined - + 1-23 NAL unit Single NAL unit packet per H.264 5.6 + 24 STAP-A Single-time aggregation packet 5.7.1 + 25 STAP-B Single-time aggregation packet 5.7.1 + 26 MTAP16 Multi-time aggregation packet 5.7.2 + 27 MTAP24 Multi-time aggregation packet 5.7.2 + 28 FU-A Fragmentation unit 5.8 + 29 FU-B Fragmentation unit 5.8 + 30-31 undefined - + + + + + + +Wenger, et al. Standards Track [Page 11] + +RFC 3984 RTP Payload Format for H.264 Video February 2005 + + + Informative note: This specification does not limit the size of + NAL units encapsulated in single NAL unit packets and + fragmentation units. The maximum size of a NAL unit encapsulated + in any aggregation packet is 65535 bytes. + +5.3. NAL Unit Octet Usage + + The structure and semantics of the NAL unit octet were introduced in + section 1.3. For convenience, the format of the NAL unit type octet + is reprinted below: + + +---------------+ + |0|1|2|3|4|5|6|7| + +-+-+-+-+-+-+-+-+ + |F|NRI| Type | + +---------------+ + + This section specifies the semantics of F and NRI according to this + specification. + + F: 1 bit + forbidden_zero_bit. A value of 0 indicates that the NAL unit type + octet and payload should not contain bit errors or other syntax + violations. A value of 1 indicates that the NAL unit type octet + and payload may contain bit errors or other syntax violations. + + MANEs SHOULD set the F bit to indicate detected bit errors in the + NAL unit. The H.264 specification requires that the F bit is + equal to 0. When the F bit is set, the decoder is advised that + bit errors or any other syntax violations may be present in the + payload or in the NAL unit type octet. The simplest decoder + reaction to a NAL unit in which the F bit is equal to 1 is to + discard such a NAL unit and to conceal the lost data in the + discarded NAL unit. + + NRI: 2 bits + nal_ref_idc. The semantics of value 00 and a non-zero value + remain unchanged from the H.264 specification. In other words, a + value of 00 indicates that the content of the NAL unit is not used + to reconstruct reference pictures for inter picture prediction. + Such NAL units can be discarded without risking the integrity of + the reference pictures. Values greater than 00 indicate that the + decoding of the NAL unit is required to maintain the integrity of + the reference pictures. + + In addition to the specification above, according to this RTP + payload specification, values of NRI greater than 00 indicate the + relative transport priority, as determined by the encoder. MANEs + + + +Wenger, et al. Standards Track [Page 12] + +RFC 3984 RTP Payload Format for H.264 Video February 2005 + + + can use this information to protect more important NAL units + better than they do less important NAL units. The highest + transport priority is 11, followed by 10, and then by 01; finally, + 00 is the lowest. + + Informative note: Any non-zero value of NRI is handled + identically in H.264 decoders. Therefore, receivers need not + manipulate the value of NRI when passing NAL units to the + decoder. + + An H.264 encoder MUST set the value of NRI according to the H.264 + specification (subclause 7.4.1) when the value of nal_unit_type is + in the range of 1 to 12, inclusive. In particular, the H.264 + specification requires that the value of NRI SHALL be equal to 0 + for all NAL units having nal_unit_type equal to 6, 9, 10, 11, or + 12. + + For NAL units having nal_unit_type equal to 7 or 8 (indicating a + sequence parameter set or a picture parameter set, respectively), + an H.264 encoder SHOULD set the value of NRI to 11 (in binary + format). For coded slice NAL units of a primary coded picture + having nal_unit_type equal to 5 (indicating a coded slice + belonging to an IDR picture), an H.264 encoder SHOULD set the + value of NRI to 11 (in binary format). + + For a mapping of the remaining nal_unit_types to NRI values, the + following example MAY be used and has been shown to be efficient + in a certain environment [13]. Other mappings MAY also be + desirable, depending on the application and the H.264/AVC Annex A + profile in use. + + Informative note: Data Partitioning is not available in certain + profiles; e.g., in the Main or Baseline profiles. + Consequently, the nal unit types 2, 3, and 4 can occur only if + the video bitstream conforms to a profile in which data + partitioning is allowed and not in streams that conform to the + Main or Baseline profiles. + + Table 2. Example of NRI values for coded slices and coded slice + data partitions of primary coded reference pictures + + NAL Unit Type Content of NAL unit NRI (binary) + ---------------------------------------------------------------- + 1 non-IDR coded slice 10 + 2 Coded slice data partition A 10 + 3 Coded slice data partition B 01 + 4 Coded slice data partition C 01 + + + + +Wenger, et al. Standards Track [Page 13] + +RFC 3984 RTP Payload Format for H.264 Video February 2005 + + + Informative note: As mentioned before, the NRI value of non- + reference pictures is 00 as mandated by H.264/AVC. + + An H.264 encoder SHOULD set the value of NRI for coded slice and + coded slice data partition NAL units of redundant coded reference + pictures equal to 01 (in binary format). + + Definitions of the values for NRI for NAL unit types 24 to 29, + inclusive, are given in sections 5.7 and 5.8 of this memo. + + No recommendation for the value of NRI is given for NAL units + having nal_unit_type in the range of 13 to 23, inclusive, because + these values are reserved for ITU-T and ISO/IEC. No + recommendation for the value of NRI is given for NAL units having + nal_unit_type equal to 0 or in the range of 30 to 31, inclusive, + as the semantics of these values are not specified in this memo. + +5.4. Packetization Modes + + This memo specifies three cases of packetization modes: + + o Single NAL unit mode + o Non-interleaved mode + o Interleaved mode + + The single NAL unit mode is targeted for conversational systems that + comply with ITU-T Recommendation H.241 [15] (see section 12.1). The + non-interleaved mode is targeted for conversational systems that may + not comply with ITU-T Recommendation H.241. In the non-interleaved + mode, NAL units are transmitted in NAL unit decoding order. The + interleaved mode is targeted for systems that do not require very low + end-to-end latency. The interleaved mode allows transmission of NAL + units out of NAL unit decoding order. + + The packetization mode in use MAY be signaled by the value of the + OPTIONAL packetization-mode MIME parameter or by external means. The + used packetization mode governs which NAL unit types are allowed in + RTP payloads. Table 3 summarizes the allowed NAL unit types for each + packetization mode. Some NAL unit type values (indicated as + undefined in Table 3) are reserved for future extensions. NAL units + of those types SHOULD NOT be sent by a sender and MUST be ignored by + a receiver. For example, the Types 1-23, with the associated packet + type "NAL unit", are allowed in "Single NAL Unit Mode" and in "Non- + Interleaved Mode", but disallowed in "Interleaved Mode". + Packetization modes are explained in more detail in section 6. + + + + + + +Wenger, et al. Standards Track [Page 14] + +RFC 3984 RTP Payload Format for H.264 Video February 2005 + + + Table 3. Summary of allowed NAL unit types for each packetization + mode (yes = allowed, no = disallowed, ig = ignore) + + Type Packet Single NAL Non-Interleaved Interleaved + Unit Mode Mode Mode + ------------------------------------------------------------- + + 0 undefined ig ig ig + 1-23 NAL unit yes yes no + 24 STAP-A no yes no + 25 STAP-B no no yes + 26 MTAP16 no no yes + 27 MTAP24 no no yes + 28 FU-A no yes yes + 29 FU-B no no yes + 30-31 undefined ig ig ig + +5.5. Decoding Order Number (DON) + + In the interleaved packetization mode, the transmission order of NAL + units is allowed to differ from the decoding order of the NAL units. + Decoding order number (DON) is a field in the payload structure or a + derived variable that indicates the NAL unit decoding order. + Rationale and examples of use cases for transmission out of decoding + order and for the use of DON are given in section 13. + + The coupling of transmission and decoding order is controlled by the + OPTIONAL sprop-interleaving-depth MIME parameter as follows. When + the value of the OPTIONAL sprop-interleaving-depth MIME parameter is + equal to 0 (explicitly or per default) or transmission of NAL units + out of their decoding order is disallowed by external means, the + transmission order of NAL units MUST conform to the NAL unit decoding + order. When the value of the OPTIONAL sprop-interleaving-depth MIME + parameter is greater than 0 or transmission of NAL units out of their + decoding order is allowed by external means, + + o the order of NAL units in an MTAP16 and an MTAP24 is NOT REQUIRED + to be the NAL unit decoding order, and + + o the order of NAL units generated by decapsulating STAP-Bs, MTAPs, + and FUs in two consecutive packets is NOT REQUIRED to be the NAL + unit decoding order. + + The RTP payload structures for a single NAL unit packet, an STAP-A, + and an FU-A do not include DON. STAP-B and FU-B structures include + DON, and the structure of MTAPs enables derivation of DON as + specified in section 5.7.2. + + + + +Wenger, et al. Standards Track [Page 15] + +RFC 3984 RTP Payload Format for H.264 Video February 2005 + + + Informative note: When an FU-A occurs in interleaved mode, it + always follows an FU-B, which sets its DON. + + Informative note: If a transmitter wants to encapsulate a single + NAL unit per packet and transmit packets out of their decoding + order, STAP-B packet type can be used. + + In the single NAL unit packetization mode, the transmission order of + NAL units, determined by the RTP sequence number, MUST be the same as + their NAL unit decoding order. In the non-interleaved packetization + mode, the transmission order of NAL units in single NAL unit packets, + STAP-As, and FU-As MUST be the same as their NAL unit decoding order. + The NAL units within an STAP MUST appear in the NAL unit decoding + order. Thus, the decoding order is first provided through the + implicit order within a STAP, and second provided through the RTP + sequence number for the order between STAPs, FUs, and single NAL unit + packets. + + Signaling of the value of DON for NAL units carried in STAP-B, MTAP, + and a series of fragmentation units starting with an FU-B is + specified in sections 5.7.1, 5.7.2, and 5.8, respectively. The DON + value of the first NAL unit in transmission order MAY be set to any + value. Values of DON are in the range of 0 to 65535, inclusive. + After reaching the maximum value, the value of DON wraps around to 0. + + The decoding order of two NAL units contained in any STAP-B, MTAP, or + a series of fragmentation units starting with an FU-B is determined + as follows. Let DON(i) be the decoding order number of the NAL unit + having index i in the transmission order. Function don_diff(m,n) is + specified as follows: + + If DON(m) == DON(n), don_diff(m,n) = 0 + + If (DON(m) < DON(n) and DON(n) - DON(m) < 32768), + don_diff(m,n) = DON(n) - DON(m) + + If (DON(m) > DON(n) and DON(m) - DON(n) >= 32768), + don_diff(m,n) = 65536 - DON(m) + DON(n) + + If (DON(m) < DON(n) and DON(n) - DON(m) >= 32768), + don_diff(m,n) = - (DON(m) + 65536 - DON(n)) + + If (DON(m) > DON(n) and DON(m) - DON(n) < 32768), + don_diff(m,n) = - (DON(m) - DON(n)) + + A positive value of don_diff(m,n) indicates that the NAL unit having + transmission order index n follows, in decoding order, the NAL unit + having transmission order index m. When don_diff(m,n) is equal to 0, + + + +Wenger, et al. Standards Track [Page 16] + +RFC 3984 RTP Payload Format for H.264 Video February 2005 + + + then the NAL unit decoding order of the two NAL units can be in + either order. A negative value of don_diff(m,n) indicates that the + NAL unit having transmission order index n precedes, in decoding + order, the NAL unit having transmission order index m. + + Values of DON related fields (DON, DONB, and DOND; see section 5.7) + MUST be such that the decoding order determined by the values of DON, + as specified above, conforms to the NAL unit decoding order. If the + order of two NAL units in NAL unit decoding order is switched and the + new order does not conform to the NAL unit decoding order, the NAL + units MUST NOT have the same value of DON. If the order of two + consecutive NAL units in the NAL unit stream is switched and the new + order still conforms to the NAL unit decoding order, the NAL units + MAY have the same value of DON. For example, when arbitrary slice + order is allowed by the video coding profile in use, all the coded + slice NAL units of a coded picture are allowed to have the same value + of DON. Consequently, NAL units having the same value of DON can be + decoded in any order, and two NAL units having a different value of + DON should be passed to the decoder in the order specified above. + When two consecutive NAL units in the NAL unit decoding order have a + different value of DON, the value of DON for the second NAL unit in + decoding order SHOULD be the value of DON for the first, incremented + by one. + + An example of the decapsulation process to recover the NAL unit + decoding order is given in section 7. + + Informative note: Receivers should not expect that the absolute + difference of values of DON for two consecutive NAL units in the + NAL unit decoding order will be equal to one, even in error-free + transmission. An increment by one is not required, as at the time + of associating values of DON to NAL units, it may not be known + whether all NAL units are delivered to the receiver. For example, + a gateway may not forward coded slice NAL units of non-reference + pictures or SEI NAL units when there is a shortage of bit rate in + the network to which the packets are forwarded. In another + example, a live broadcast is interrupted by pre-encoded content, + such as commercials, from time to time. The first intra picture + of a pre-encoded clip is transmitted in advance to ensure that it + is readily available in the receiver. When transmitting the first + intra picture, the originator does not exactly know how many NAL + units will be encoded before the first intra picture of the pre- + encoded clip follows in decoding order. Thus, the values of DON + for the NAL units of the first intra picture of the pre-encoded + clip have to be estimated when they are transmitted, and gaps in + values of DON may occur. + + + + + +Wenger, et al. Standards Track [Page 17] + +RFC 3984 RTP Payload Format for H.264 Video February 2005 + + +5.6. Single NAL Unit Packet + + The single NAL unit packet defined here MUST contain only one NAL + unit, of the types defined in [1]. This means that neither an + aggregation packet nor a fragmentation unit can be used within a + single NAL unit packet. A NAL unit stream composed by decapsulating + single NAL unit packets in RTP sequence number order MUST conform to + the NAL unit decoding order. The structure of the single NAL unit + packet is shown in Figure 2. + + Informative note: The first byte of a NAL unit co-serves as the + RTP payload header. + + 0 1 2 3 + 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + |F|NRI| type | | + +-+-+-+-+-+-+-+-+ | + | | + | Bytes 2..n of a Single NAL unit | + | | + | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | :...OPTIONAL RTP padding | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + + Figure 2. RTP payload format for single NAL unit packet + +5.7. Aggregation Packets + + Aggregation packets are the NAL unit aggregation scheme of this + payload specification. The scheme is introduced to reflect the + dramatically different MTU sizes of two key target networks: + wireline IP networks (with an MTU size that is often limited by the + Ethernet MTU size; roughly 1500 bytes), and IP or non-IP (e.g., ITU-T + H.324/M) based wireless communication systems with preferred + transmission unit sizes of 254 bytes or less. To prevent media + transcoding between the two worlds, and to avoid undesirable + packetization overhead, a NAL unit aggregation scheme is introduced. + + Two types of aggregation packets are defined by this specification: + + o Single-time aggregation packet (STAP): aggregates NAL units with + identical NALU-time. Two types of STAPs are defined, one without + DON (STAP-A) and another including DON (STAP-B). + + o Multi-time aggregation packet (MTAP): aggregates NAL units with + potentially differing NALU-time. Two different MTAPs are defined, + differing in the length of the NAL unit timestamp offset. + + + +Wenger, et al. Standards Track [Page 18] + +RFC 3984 RTP Payload Format for H.264 Video February 2005 + + + The term NALU-time is defined as the value that the RTP timestamp + would have if that NAL unit would be transported in its own RTP + packet. + + Each NAL unit to be carried in an aggregation packet is encapsulated + in an aggregation unit. Please see below for the four different + aggregation units and their characteristics. + + The structure of the RTP payload format for aggregation packets is + presented in Figure 3. + + 0 1 2 3 + 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + |F|NRI| type | | + +-+-+-+-+-+-+-+-+ | + | | + | one or more aggregation units | + | | + | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | :...OPTIONAL RTP padding | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + + Figure 3. RTP payload format for aggregation packets + + MTAPs and STAPs share the following packetization rules: The RTP + timestamp MUST be set to the earliest of the NALU times of all the + NAL units to be aggregated. The type field of the NAL unit type + octet MUST be set to the appropriate value, as indicated in Table 4. + The F bit MUST be cleared if all F bits of the aggregated NAL units + are zero; otherwise, it MUST be set. The value of NRI MUST be the + maximum of all the NAL units carried in the aggregation packet. + + Table 4. Type field for STAPs and MTAPs + + Type Packet Timestamp offset DON related fields + field length (DON, DONB, DOND) + (in bits) present + -------------------------------------------------------- + 24 STAP-A 0 no + 25 STAP-B 0 yes + 26 MTAP16 16 yes + 27 MTAP24 24 yes + + The marker bit in the RTP header is set to the value that the marker + bit of the last NAL unit of the aggregated packet would have if it + were transported in its own RTP packet. + + + + +Wenger, et al. Standards Track [Page 19] + +RFC 3984 RTP Payload Format for H.264 Video February 2005 + + + The payload of an aggregation packet consists of one or more + aggregation units. See sections 5.7.1 and 5.7.2 for the four + different types of aggregation units. An aggregation packet can + carry as many aggregation units as necessary; however, the total + amount of data in an aggregation packet obviously MUST fit into an IP + packet, and the size SHOULD be chosen so that the resulting IP packet + is smaller than the MTU size. An aggregation packet MUST NOT contain + fragmentation units specified in section 5.8. Aggregation packets + MUST NOT be nested; i.e., an aggregation packet MUST NOT contain + another aggregation packet. + +5.7.1. Single-Time Aggregation Packet + + Single-time aggregation packet (STAP) SHOULD be used whenever NAL + units are aggregated that all share the same NALU-time. The payload + of an STAP-A does not include DON and consists of at least one + single-time aggregation unit, as presented in Figure 4. The payload + of an STAP-B consists of a 16-bit unsigned decoding order number + (DON) (in network byte order) followed by at least one single-time + aggregation unit, as presented in Figure 5. + + 0 1 2 3 + 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + : | + +-+-+-+-+-+-+-+-+ | + | | + | single-time aggregation units | + | | + | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | : + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + + Figure 4. Payload format for STAP-A + + 0 1 2 3 + 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + : decoding order number (DON) | | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | + | | + | single-time aggregation units | + | | + | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | : + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + + Figure 5. Payload format for STAP-B + + + +Wenger, et al. Standards Track [Page 20] + +RFC 3984 RTP Payload Format for H.264 Video February 2005 + + + The DON field specifies the value of DON for the first NAL unit in an + STAP-B in transmission order. For each successive NAL unit in + appearance order in an STAP-B, the value of DON is equal to (the + value of DON of the previous NAL unit in the STAP-B + 1) % 65536, in + which '%' stands for the modulo operation. + + A single-time aggregation unit consists of 16-bit unsigned size + information (in network byte order) that indicates the size of the + following NAL unit in bytes (excluding these two octets, but + including the NAL unit type octet of the NAL unit), followed by the + NAL unit itself, including its NAL unit type byte. A single-time + aggregation unit is byte aligned within the RTP payload, but it may + not be aligned on a 32-bit word boundary. Figure 6 presents the + structure of the single-time aggregation unit. + + 0 1 2 3 + 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + : NAL unit size | | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | + | | + | NAL unit | + | | + | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | : + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + + Figure 6. Structure for single-time aggregation unit + + + + + + + + + + + + + + + + + + + + + + + +Wenger, et al. Standards Track [Page 21] + +RFC 3984 RTP Payload Format for H.264 Video February 2005 + + + Figure 7 presents an example of an RTP packet that contains an STAP- + A. The STAP contains two single-time aggregation units, labeled as 1 + and 2 in the figure. + + 0 1 2 3 + 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | RTP Header | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + |STAP-A NAL HDR | NALU 1 Size | NALU 1 HDR | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | NALU 1 Data | + : : + + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | | NALU 2 Size | NALU 2 HDR | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | NALU 2 Data | + : : + | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | :...OPTIONAL RTP padding | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + + Figure 7. An example of an RTP packet including an STAP-A and two + single-time aggregation units + + + + + + + + + + + + + + + + + + + + + + + + + + + +Wenger, et al. Standards Track [Page 22] + +RFC 3984 RTP Payload Format for H.264 Video February 2005 + + + Figure 8 presents an example of an RTP packet that contains an STAP- + B. The STAP contains two single-time aggregation units, labeled as 1 + and 2 in the figure. + + 0 1 2 3 + 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | RTP Header | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + |STAP-B NAL HDR | DON | NALU 1 Size | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | NALU 1 Size | NALU 1 HDR | NALU 1 Data | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + + : : + + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | | NALU 2 Size | NALU 2 HDR | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | NALU 2 Data | + : : + | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | :...OPTIONAL RTP padding | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + + Figure 8. An example of an RTP packet including an STAP-B and two + single-time aggregation units + +5.7.2. Multi-Time Aggregation Packets (MTAPs) + + The NAL unit payload of MTAPs consists of a 16-bit unsigned decoding + order number base (DONB) (in network byte order) and one or more + multi-time aggregation units, as presented in Figure 9. DONB MUST + contain the value of DON for the first NAL unit in the NAL unit + decoding order among the NAL units of the MTAP. + + Informative note: The first NAL unit in the NAL unit decoding + order is not necessarily the first NAL unit in the order in which + the NAL units are encapsulated in an MTAP. + + + + + + + + + + + + + + +Wenger, et al. Standards Track [Page 23] + +RFC 3984 RTP Payload Format for H.264 Video February 2005 + + + 0 1 2 3 + 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + : decoding order number base | | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | + | | + | multi-time aggregation units | + | | + | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | : + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + + Figure 9. NAL unit payload format for MTAPs + + Two different multi-time aggregation units are defined in this + specification. Both of them consist of 16 bits unsigned size + information of the following NAL unit (in network byte order), an 8- + bit unsigned decoding order number difference (DOND), and n bits (in + network byte order) of timestamp offset (TS offset) for this NAL + unit, whereby n can be 16 or 24. The choice between the different + MTAP types (MTAP16 and MTAP24) is application dependent: the larger + the timestamp offset is, the higher the flexibility of the MTAP, but + the overhead is also higher. + + The structure of the multi-time aggregation units for MTAP16 and + MTAP24 are presented in Figures 10 and 11, respectively. The + starting or ending position of an aggregation unit within a packet is + NOT REQUIRED to be on a 32-bit word boundary. The DON of the + following NAL unit is equal to (DONB + DOND) % 65536, in which % + denotes the modulo operation. This memo does not specify how the NAL + units within an MTAP are ordered, but, in most cases, NAL unit + decoding order SHOULD be used. + + The timestamp offset field MUST be set to a value equal to the value + of the following formula: If the NALU-time is larger than or equal to + the RTP timestamp of the packet, then the timestamp offset equals + (the NALU-time of the NAL unit - the RTP timestamp of the packet). + If the NALU-time is smaller than the RTP timestamp of the packet, + then the timestamp offset is equal to the NALU-time + (2^32 - the RTP + timestamp of the packet). + + + + + + + + + + + +Wenger, et al. Standards Track [Page 24] + +RFC 3984 RTP Payload Format for H.264 Video February 2005 + + + 0 1 2 3 + 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + : NAL unit size | DOND | TS offset | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | TS offset | | + +-+-+-+-+-+-+-+-+ NAL unit | + | | + | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | : + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + + Figure 10. Multi-time aggregation unit for MTAP16 + + 0 1 2 3 + 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + : NALU unit size | DOND | TS offset | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | TS offset | | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | + | NAL unit | + | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | : + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + + Figure 11. Multi-time aggregation unit for MTAP24 + + For the "earliest" multi-time aggregation unit in an MTAP the + timestamp offset MUST be zero. Hence, the RTP timestamp of the MTAP + itself is identical to the earliest NALU-time. + + Informative note: The "earliest" multi-time aggregation unit is + the one that would have the smallest extended RTP timestamp among + all the aggregation units of an MTAP if the aggregation units were + encapsulated in single NAL unit packets. An extended timestamp is + a timestamp that has more than 32 bits and is capable of counting + the wraparound of the timestamp field, thus enabling one to + determine the smallest value if the timestamp wraps. Such an + "earliest" aggregation unit may not be the first one in the order + in which the aggregation units are encapsulated in an MTAP. The + "earliest" NAL unit need not be the same as the first NAL unit in + the NAL unit decoding order either. + + + + + + + + +Wenger, et al. Standards Track [Page 25] + +RFC 3984 RTP Payload Format for H.264 Video February 2005 + + + Figure 12 presents an example of an RTP packet that contains a + multi-time aggregation packet of type MTAP16 that contains two + multi-time aggregation units, labeled as 1 and 2 in the figure. + + 0 1 2 3 + 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | RTP Header | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + |MTAP16 NAL HDR | decoding order number base | NALU 1 Size | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | NALU 1 Size | NALU 1 DOND | NALU 1 TS offset | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | NALU 1 HDR | NALU 1 DATA | + +-+-+-+-+-+-+-+-+ + + : : + + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | | NALU 2 SIZE | NALU 2 DOND | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | NALU 2 TS offset | NALU 2 HDR | NALU 2 DATA | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | + : : + | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | :...OPTIONAL RTP padding | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + + Figure 12. An RTP packet including a multi-time aggregation + packet of type MTAP16 and two multi-time aggregation + units + + + + + + + + + + + + + + + + + + + + + + +Wenger, et al. Standards Track [Page 26] + +RFC 3984 RTP Payload Format for H.264 Video February 2005 + + + Figure 13 presents an example of an RTP packet that contains a + multi-time aggregation packet of type MTAP24 that contains two + multi-time aggregation units, labeled as 1 and 2 in the figure. + + 0 1 2 3 + 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | RTP Header | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + |MTAP24 NAL HDR | decoding order number base | NALU 1 Size | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | NALU 1 Size | NALU 1 DOND | NALU 1 TS offs | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + |NALU 1 TS offs | NALU 1 HDR | NALU 1 DATA | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + + : : + + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | | NALU 2 SIZE | NALU 2 DOND | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | NALU 2 TS offset | NALU 2 HDR | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | NALU 2 DATA | + : : + | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | :...OPTIONAL RTP padding | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + + Figure 13. An RTP packet including a multi-time aggregation + packet of type MTAP24 and two multi-time aggregation + units + +5.8. Fragmentation Units (FUs) + + This payload type allows fragmenting a NAL unit into several RTP + packets. Doing so on the application layer instead of relying on + lower layer fragmentation (e.g., by IP) has the following advantages: + + o The payload format is capable of transporting NAL units bigger + than 64 kbytes over an IPv4 network that may be present in pre- + recorded video, particularly in High Definition formats (there is + a limit of the number of slices per picture, which results in a + limit of NAL units per picture, which may result in big NAL + units). + + o The fragmentation mechanism allows fragmenting a single picture + and applying generic forward error correction as described in + section 12.5. + + + + +Wenger, et al. Standards Track [Page 27] + +RFC 3984 RTP Payload Format for H.264 Video February 2005 + + + Fragmentation is defined only for a single NAL unit and not for any + aggregation packets. A fragment of a NAL unit consists of an integer + number of consecutive octets of that NAL unit. Each octet of the NAL + unit MUST be part of exactly one fragment of that NAL unit. + Fragments of the same NAL unit MUST be sent in consecutive order with + ascending RTP sequence numbers (with no other RTP packets within the + same RTP packet stream being sent between the first and last + fragment). Similarly, a NAL unit MUST be reassembled in RTP sequence + number order. + + When a NAL unit is fragmented and conveyed within fragmentation units + (FUs), it is referred to as a fragmented NAL unit. STAPs and MTAPs + MUST NOT be fragmented. FUs MUST NOT be nested; i.e., an FU MUST NOT + contain another FU. + + The RTP timestamp of an RTP packet carrying an FU is set to the NALU + time of the fragmented NAL unit. + + Figure 14 presents the RTP payload format for FU-As. An FU-A + consists of a fragmentation unit indicator of one octet, a + fragmentation unit header of one octet, and a fragmentation unit + payload. + + 0 1 2 3 + 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | FU indicator | FU header | | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | + | | + | FU payload | + | | + | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | :...OPTIONAL RTP padding | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + + Figure 14. RTP payload format for FU-A + + + + + + + + + + + + + + + +Wenger, et al. Standards Track [Page 28] + +RFC 3984 RTP Payload Format for H.264 Video February 2005 + + + Figure 15 presents the RTP payload format for FU-Bs. An FU-B + consists of a fragmentation unit indicator of one octet, a + fragmentation unit header of one octet, a decoding order number (DON) + (in network byte order), and a fragmentation unit payload. In other + words, the structure of FU-B is the same as the structure of FU-A, + except for the additional DON field. + + 0 1 2 3 + 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | FU indicator | FU header | DON | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-| + | | + | FU payload | + | | + | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | :...OPTIONAL RTP padding | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + + Figure 15. RTP payload format for FU-B + + NAL unit type FU-B MUST be used in the interleaved packetization mode + for the first fragmentation unit of a fragmented NAL unit. NAL unit + type FU-B MUST NOT be used in any other case. In other words, in the + interleaved packetization mode, each NALU that is fragmented has an + FU-B as the first fragment, followed by one or more FU-A fragments. + + The FU indicator octet has the following format: + + +---------------+ + |0|1|2|3|4|5|6|7| + +-+-+-+-+-+-+-+-+ + |F|NRI| Type | + +---------------+ + + Values equal to 28 and 29 in the Type field of the FU indicator octet + identify an FU-A and an FU-B, respectively. The use of the F bit is + described in section 5.3. The value of the NRI field MUST be set + according to the value of the NRI field in the fragmented NAL unit. + + The FU header has the following format: + + +---------------+ + |0|1|2|3|4|5|6|7| + +-+-+-+-+-+-+-+-+ + |S|E|R| Type | + +---------------+ + + + + +Wenger, et al. Standards Track [Page 29] + +RFC 3984 RTP Payload Format for H.264 Video February 2005 + + + S: 1 bit + When set to one, the Start bit indicates the start of a fragmented + NAL unit. When the following FU payload is not the start of a + fragmented NAL unit payload, the Start bit is set to zero. + + E: 1 bit + When set to one, the End bit indicates the end of a fragmented NAL + unit, i.e., the last byte of the payload is also the last byte of + the fragmented NAL unit. When the following FU payload is not the + last fragment of a fragmented NAL unit, the End bit is set to + zero. + + R: 1 bit + The Reserved bit MUST be equal to 0 and MUST be ignored by the + receiver. + + Type: 5 bits + The NAL unit payload type as defined in table 7-1 of [1]. + + The value of DON in FU-Bs is selected as described in section 5.5. + + Informative note: The DON field in FU-Bs allows gateways to + fragment NAL units to FU-Bs without organizing the incoming NAL + units to the NAL unit decoding order. + + A fragmented NAL unit MUST NOT be transmitted in one FU; i.e., the + Start bit and End bit MUST NOT both be set to one in the same FU + header. + + The FU payload consists of fragments of the payload of the fragmented + NAL unit so that if the fragmentation unit payloads of consecutive + FUs are sequentially concatenated, the payload of the fragmented NAL + unit can be reconstructed. The NAL unit type octet of the fragmented + NAL unit is not included as such in the fragmentation unit payload, + but rather the information of the NAL unit type octet of the + fragmented NAL unit is conveyed in F and NRI fields of the FU + indicator octet of the fragmentation unit and in the type field of + the FU header. A FU payload MAY have any number of octets and MAY be + empty. + + Informative note: Empty FUs are allowed to reduce the latency of a + certain class of senders in nearly lossless environments. These + senders can be characterized in that they packetize NALU fragments + before the NALU is completely generated and, hence, before the + NALU size is known. If zero-length NALU fragments were not + allowed, the sender would have to generate at least one bit of + data of the following fragment before the current fragment could + be sent. Due to the characteristics of H.264, where sometimes + + + +Wenger, et al. Standards Track [Page 30] + +RFC 3984 RTP Payload Format for H.264 Video February 2005 + + + several macroblocks occupy zero bits, this is undesirable and can + add delay. However, the (potential) use of zero-length NALUs + should be carefully weighed against the increased risk of the loss + of the NALU because of the additional packets employed for its + transmission. + + If a fragmentation unit is lost, the receiver SHOULD discard all + following fragmentation units in transmission order corresponding to + the same fragmented NAL unit. + + A receiver in an endpoint or in a MANE MAY aggregate the first n-1 + fragments of a NAL unit to an (incomplete) NAL unit, even if fragment + n of that NAL unit is not received. In this case, the + forbidden_zero_bit of the NAL unit MUST be set to one to indicate a + syntax violation. + +6. Packetization Rules + + The packetization modes are introduced in section 5.2. The + packetization rules common to more than one of the packetization + modes are specified in section 6.1. The packetization rules for the + single NAL unit mode, the non-interleaved mode, and the interleaved + mode are specified in sections 6.2, 6.3, and 6.4, respectively. + +6.1. Common Packetization Rules + + All senders MUST enforce the following packetization rules regardless + of the packetization mode in use: + + o Coded slice NAL units or coded slice data partition NAL units + belonging to the same coded picture (and thus sharing the same RTP + timestamp value) MAY be sent in any order permitted by the + applicable profile defined in [1]; however, for delay-critical + systems, they SHOULD be sent in their original coding order to + minimize the delay. Note that the coding order is not necessarily + the scan order, but the order the NAL packets become available to + the RTP stack. + + o Parameter sets are handled in accordance with the rules and + recommendations given in section 8.4. + + o MANEs MUST NOT duplicate any NAL unit except for sequence or + picture parameter set NAL units, as neither this memo nor the + H.264 specification provides means to identify duplicated NAL + units. Sequence and picture parameter set NAL units MAY be + duplicated to make their correct reception more probable, but any + such duplication MUST NOT affect the contents of any active + sequence or picture parameter set. Duplication SHOULD be + + + +Wenger, et al. Standards Track [Page 31] + +RFC 3984 RTP Payload Format for H.264 Video February 2005 + + + performed on the application layer and not by duplicating RTP + packets (with identical sequence numbers). + + Senders using the non-interleaved mode and the interleaved mode MUST + enforce the following packetization rule: + + o MANEs MAY convert single NAL unit packets into one aggregation + packet, convert an aggregation packet into several single NAL unit + packets, or mix both concepts, in an RTP translator. The RTP + translator SHOULD take into account at least the following + parameters: path MTU size, unequal protection mechanisms (e.g., + through packet-based FEC according to RFC 2733 [18], especially + for sequence and picture parameter set NAL units and coded slice + data partition A NAL units), bearable latency of the system, and + buffering capabilities of the receiver. + + Informative note: An RTP translator is required to handle RTCP as + per RFC 3550. + +6.2. Single NAL Unit Mode + + This mode is in use when the value of the OPTIONAL packetization-mode + MIME parameter is equal to 0, the packetization-mode is not present, + or no other packetization mode is signaled by external means. All + receivers MUST support this mode. It is primarily intended for low- + delay applications that are compatible with systems using ITU-T + Recommendation H.241 [15] (see section 12.1). Only single NAL unit + packets MAY be used in this mode. STAPs, MTAPs, and FUs MUST NOT be + used. The transmission order of single NAL unit packets MUST comply + with the NAL unit decoding order. + +6.3. Non-Interleaved Mode + + This mode is in use when the value of the OPTIONAL packetization-mode + MIME parameter is equal to 1 or the mode is turned on by external + means. This mode SHOULD be supported. It is primarily intended for + low-delay applications. Only single NAL unit packets, STAP-As, and + FU-As MAY be used in this mode. STAP-Bs, MTAPs, and FU-Bs MUST NOT + be used. The transmission order of NAL units MUST comply with the + NAL unit decoding order. + + + + + + + + + + + +Wenger, et al. Standards Track [Page 32] + +RFC 3984 RTP Payload Format for H.264 Video February 2005 + + +6.4. Interleaved Mode + + This mode is in use when the value of the OPTIONAL packetization-mode + MIME parameter is equal to 2 or the mode is turned on by external + means. Some receivers MAY support this mode. STAP-Bs, MTAPs, FU-As, + and FU-Bs MAY be used. STAP-As and single NAL unit packets MUST NOT + be used. The transmission order of packets and NAL units is + constrained as specified in section 5.5. + +7. De-Packetization Process (Informative) + + The de-packetization process is implementation dependent. Therefore, + the following description should be seen as an example of a suitable + implementation. Other schemes may be used as well. Optimizations + relative to the described algorithms are likely possible. Section + 7.1 presents the de-packetization process for the single NAL unit and + non-interleaved packetization modes, whereas section 7.2 describes + the process for the interleaved mode. Section 7.3 includes + additional decapsulation guidelines for intelligent receivers. + + All normal RTP mechanisms related to buffer management apply. In + particular, duplicated or outdated RTP packets (as indicated by the + RTP sequences number and the RTP timestamp) are removed. To + determine the exact time for decoding, factors such as a possible + intentional delay to allow for proper inter-stream synchronization + must be factored in. + +7.1. Single NAL Unit and Non-Interleaved Mode + + The receiver includes a receiver buffer to compensate for + transmission delay jitter. The receiver stores incoming packets in + reception order into the receiver buffer. Packets are decapsulated + in RTP sequence number order. If a decapsulated packet is a single + NAL unit packet, the NAL unit contained in the packet is passed + directly to the decoder. If a decapsulated packet is an STAP-A, the + NAL units contained in the packet are passed to the decoder in the + order in which they are encapsulated in the packet. If a + decapsulated packet is an FU-A, all the fragments of the fragmented + NAL unit are concatenated and passed to the decoder. + + Informative note: If the decoder supports Arbitrary Slice Order, + coded slices of a picture can be passed to the decoder in any + order regardless of their reception and transmission order. + + + + + + + + +Wenger, et al. Standards Track [Page 33] + +RFC 3984 RTP Payload Format for H.264 Video February 2005 + + +7.2. Interleaved Mode + + The general concept behind these de-packetization rules is to reorder + NAL units from transmission order to the NAL unit decoding order. + + The receiver includes a receiver buffer, which is used to compensate + for transmission delay jitter and to reorder packets from + transmission order to the NAL unit decoding order. In this section, + the receiver operation is described under the assumption that there + is no transmission delay jitter. To make a difference from a + practical receiver buffer that is also used for compensation of + transmission delay jitter, the receiver buffer is here after called + the deinterleaving buffer in this section. Receivers SHOULD also + prepare for transmission delay jitter; i.e., either reserve separate + buffers for transmission delay jitter buffering and deinterleaving + buffering or use a receiver buffer for both transmission delay jitter + and deinterleaving. Moreover, receivers SHOULD take transmission + delay jitter into account in the buffering operation; e.g., by + additional initial buffering before starting of decoding and + playback. + + This section is organized as follows: subsection 7.2.1 presents how + to calculate the size of the deinterleaving buffer. Subsection 7.2.2 + specifies the receiver process how to organize received NAL units to + the NAL unit decoding order. + +7.2.1. Size of the Deinterleaving Buffer + + When SDP Offer/Answer model or any other capability exchange + procedure is used in session setup, the properties of the received + stream SHOULD be such that the receiver capabilities are not + exceeded. In the SDP Offer/Answer model, the receiver can indicate + its capabilities to allocate a deinterleaving buffer with the deint- + buf-cap MIME parameter. The sender indicates the requirement for the + deinterleaving buffer size with the sprop-deint-buf-req MIME + parameter. It is therefore RECOMMENDED to set the deinterleaving + buffer size, in terms of number of bytes, equal to or greater than + the value of sprop-deint-buf-req MIME parameter. See section 8.1 for + further information on deint-buf-cap and sprop-deint-buf-req MIME + parameters and section 8.2.2 for further information on their use in + SDP Offer/Answer model. + + When a declarative session description is used in session setup, the + sprop-deint-buf-req MIME parameter signals the requirement for the + deinterleaving buffer size. It is therefore RECOMMENDED to set the + deinterleaving buffer size, in terms of number of bytes, equal to or + greater than the value of sprop-deint-buf-req MIME parameter. + + + + +Wenger, et al. Standards Track [Page 34] + +RFC 3984 RTP Payload Format for H.264 Video February 2005 + + +7.2.2. Deinterleaving Process + + There are two buffering states in the receiver: initial buffering and + buffering while playing. Initial buffering occurs when the RTP + session is initialized. After initial buffering, decoding and + playback is started, and the buffering-while-playing mode is used. + + Regardless of the buffering state, the receiver stores incoming NAL + units, in reception order, in the deinterleaving buffer as follows. + NAL units of aggregation packets are stored in the deinterleaving + buffer individually. The value of DON is calculated and stored for + all NAL units. + + The receiver operation is described below with the help of the + following functions and constants: + + o Function AbsDON is specified in section 8.1. + + o Function don_diff is specified in section 5.5. + + o Constant N is the value of the OPTIONAL sprop-interleaving-depth + MIME type parameter (see section 8.1) incremented by 1. + + Initial buffering lasts until one of the following conditions is + fulfilled: + + o There are N VCL NAL units in the deinterleaving buffer. + + o If sprop-max-don-diff is present, don_diff(m,n) is greater than + the value of sprop-max-don-diff, in which n corresponds to the NAL + unit having the greatest value of AbsDON among the received NAL + units and m corresponds to the NAL unit having the smallest value + of AbsDON among the received NAL units. + + o Initial buffering has lasted for the duration equal to or greater + than the value of the OPTIONAL sprop-init-buf-time MIME parameter. + + The NAL units to be removed from the deinterleaving buffer are + determined as follows: + + o If the deinterleaving buffer contains at least N VCL NAL units, + NAL units are removed from the deinterleaving buffer and passed to + the decoder in the order specified below until the buffer contains + N-1 VCL NAL units. + + + + + + + +Wenger, et al. Standards Track [Page 35] + +RFC 3984 RTP Payload Format for H.264 Video February 2005 + + + o If sprop-max-don-diff is present, all NAL units m for which + don_diff(m,n) is greater than sprop-max-don-diff are removed from + the deinterleaving buffer and passed to the decoder in the order + specified below. Herein, n corresponds to the NAL unit having the + greatest value of AbsDON among the received NAL units. + + The order in which NAL units are passed to the decoder is specified + as follows: + + o Let PDON be a variable that is initialized to 0 at the beginning + of the an RTP session. + + o For each NAL unit associated with a value of DON, a DON distance + is calculated as follows. If the value of DON of the NAL unit is + larger than the value of PDON, the DON distance is equal to DON - + PDON. Otherwise, the DON distance is equal to 65535 - PDON + DON + + 1. + + o NAL units are delivered to the decoder in ascending order of DON + distance. If several NAL units share the same value of DON + distance, they can be passed to the decoder in any order. + + o When a desired number of NAL units have been passed to the + decoder, the value of PDON is set to the value of DON for the last + NAL unit passed to the decoder. + +7.3. Additional De-Packetization Guidelines + + The following additional de-packetization rules may be used to + implement an operational H.264 de-packetizer: + + o Intelligent RTP receivers (e.g., in gateways) may identify lost + coded slice data partitions A (DPAs). If a lost DPA is found, a + gateway may decide not to send the corresponding coded slice data + partitions B and C, as their information is meaningless for H.264 + decoders. In this way a MANE can reduce network load by + discarding useless packets without parsing a complex bitstream. + + o Intelligent RTP receivers (e.g., in gateways) may identify lost + FUs. If a lost FU is found, a gateway may decide not to send the + following FUs of the same fragmented NAL unit, as their + information is meaningless for H.264 decoders. In this way a MANE + can reduce network load by discarding useless packets without + parsing a complex bitstream. + + + + + + + +Wenger, et al. Standards Track [Page 36] + +RFC 3984 RTP Payload Format for H.264 Video February 2005 + + + o Intelligent receivers having to discard packets or NALUs should + first discard all packets/NALUs in which the value of the NRI + field of the NAL unit type octet is equal to 0. This will + minimize the impact on user experience and keep the reference + pictures intact. If more packets have to be discarded, then + packets with a numerically lower NRI value should be discarded + before packets with a numerically higher NRI value. However, + discarding any packets with an NRI bigger than 0 very likely leads + to decoder drift and SHOULD be avoided. + +8. Payload Format Parameters + + This section specifies the parameters that MAY be used to select + optional features of the payload format and certain features of the + bitstream. The parameters are specified here as part of the MIME + subtype registration for the ITU-T H.264 | ISO/IEC 14496-10 codec. A + mapping of the parameters into the Session Description Protocol (SDP) + [5] is also provided for applications that use SDP. Equivalent + parameters could be defined elsewhere for use with control protocols + that do not use MIME or SDP. + + Some parameters provide a receiver with the properties of the stream + that will be sent. The name of all these parameters starts with + "sprop" for stream properties. Some of these "sprop" parameters are + limited by other payload or codec configuration parameters. For + example, the sprop-parameter-sets parameter is constrained by the + profile-level-id parameter. The media sender selects all "sprop" + parameters rather than the receiver. This uncommon characteristic of + the "sprop" parameters may not be compatible with some signaling + protocol concepts, in which case the use of these parameters SHOULD + be avoided. + +8.1. MIME Registration + + The MIME subtype for the ITU-T H.264 | ISO/IEC 14496-10 codec is + allocated from the IETF tree. + + The receiver MUST ignore any unspecified parameter. + + Media Type name: video + + Media subtype name: H264 + + Required parameters: none + + + + + + + +Wenger, et al. Standards Track [Page 37] + +RFC 3984 RTP Payload Format for H.264 Video February 2005 + + + OPTIONAL parameters: + profile-level-id: + A base16 [6] (hexadecimal) representation of + the following three bytes in the sequence + parameter set NAL unit specified in [1]: 1) + profile_idc, 2) a byte herein referred to as + profile-iop, composed of the values of + constraint_set0_flag, constraint_set1_flag, + constraint_set2_flag, and reserved_zero_5bits + in bit-significance order, starting from the + most significant bit, and 3) level_idc. Note + that reserved_zero_5bits is required to be + equal to 0 in [1], but other values for it may + be specified in the future by ITU-T or ISO/IEC. + + If the profile-level-id parameter is used to + indicate properties of a NAL unit stream, it + indicates the profile and level that a decoder + has to support in order to comply with [1] when + it decodes the stream. The profile-iop byte + indicates whether the NAL unit stream also + obeys all constraints of the indicated profiles + as follows. If bit 7 (the most significant + bit), bit 6, or bit 5 of profile-iop is equal + to 1, all constraints of the Baseline profile, + the Main profile, or the Extended profile, + respectively, are obeyed in the NAL unit + stream. + + If the profile-level-id parameter is used for + capability exchange or session setup procedure, + it indicates the profile that the codec + supports and the highest level + supported for the signaled profile. The + profile-iop byte indicates whether the codec + has additional limitations whereby only the + common subset of the algorithmic features and + limitations of the profiles signaled with the + profile-iop byte and of the profile indicated + by profile_idc is supported by the codec. For + example, if a codec supports only the common + subset of the coding tools of the Baseline + profile and the Main profile at level 2.1 and + below, the profile-level-id becomes 42E015, in + which 42 stands for the Baseline profile, E0 + indicates that only the common subset for all + profiles is supported, and 15 indicates level + 2.1. + + + +Wenger, et al. Standards Track [Page 38] + +RFC 3984 RTP Payload Format for H.264 Video February 2005 + + + Informative note: Capability exchange and + session setup procedures should provide + means to list the capabilities for each + supported codec profile separately. For + example, the one-of-N codec selection + procedure of the SDP Offer/Answer model can + be used (section 10.2 of [7]). + + If no profile-level-id is present, the Baseline + Profile without additional constraints at Level + 1 MUST be implied. + + max-mbps, max-fs, max-cpb, max-dpb, and max-br: + These parameters MAY be used to signal the + capabilities of a receiver implementation. + These parameters MUST NOT be used for any other + purpose. The profile-level-id parameter MUST + be present in the same receiver capability + description that contains any of these + parameters. The level conveyed in the value of + the profile-level-id parameter MUST be such + that the receiver is fully capable of + supporting. max-mbps, max-fs, max-cpb, max- + dpb, and max-br MAY be used to indicate + capabilities of the receiver that extend the + required capabilities of the signaled level, as + specified below. + + When more than one parameter from the set (max- + mbps, max-fs, max-cpb, max-dpb, max-br) is + present, the receiver MUST support all signaled + capabilities simultaneously. For example, if + both max-mbps and max-br are present, the + signaled level with the extension of both the + frame rate and bit rate is supported. That is, + the receiver is able to decode NAL unit + streams in which the macroblock processing rate + is up to max-mbps (inclusive), the bit rate is + up to max-br (inclusive), the coded picture + buffer size is derived as specified in the + semantics of the max-br parameter below, and + other properties comply with the level + specified in the value of the profile-level-id + parameter. + + A receiver MUST NOT signal values of max- + mbps, max-fs, max-cpb, max-dpb, and max-br that + meet the requirements of a higher level, + + + +Wenger, et al. Standards Track [Page 39] + +RFC 3984 RTP Payload Format for H.264 Video February 2005 + + + referred to as level A herein, compared to the + level specified in the value of the profile- + level-id parameter, if the receiver can support + all the properties of level A. + + Informative note: When the OPTIONAL MIME + type parameters are used to signal the + properties of a NAL unit stream, max-mbps, + max-fs, max-cpb, max-dpb, and max-br are + not present, and the value of profile- + level-id must always be such that the NAL + unit stream complies fully with the + specified profile and level. + + max-mbps: The value of max-mbps is an integer indicating + the maximum macroblock processing rate in units + of macroblocks per second. The max-mbps + parameter signals that the receiver is capable + of decoding video at a higher rate than is + required by the signaled level conveyed in the + value of the profile-level-id parameter. When + max-mbps is signaled, the receiver MUST be able + to decode NAL unit streams that conform to the + signaled level, with the exception that the + MaxMBPS value in Table A-1 of [1] for the + signaled level is replaced with the value of + max-mbps. The value of max-mbps MUST be + greater than or equal to the value of MaxMBPS + for the level given in Table A-1 of [1]. + Senders MAY use this knowledge to send pictures + of a given size at a higher picture rate than + is indicated in the signaled level. + + max-fs: The value of max-fs is an integer indicating + the maximum frame size in units of macroblocks. + The max-fs parameter signals that the receiver + is capable of decoding larger picture sizes + than are required by the signaled level conveyed + in the value of the profile-level-id parameter. + When max-fs is signaled, the receiver MUST be + able to decode NAL unit streams that conform to + the signaled level, with the exception that the + MaxFS value in Table A-1 of [1] for the + signaled level is replaced with the value of + max-fs. The value of max-fs MUST be greater + than or equal to the value of MaxFS for the + level given in Table A-1 of [1]. Senders MAY + use this knowledge to send larger pictures at a + + + +Wenger, et al. Standards Track [Page 40] + +RFC 3984 RTP Payload Format for H.264 Video February 2005 + + + proportionally lower frame rate than is + indicated in the signaled level. + + max-cpb The value of max-cpb is an integer indicating + the maximum coded picture buffer size in units + of 1000 bits for the VCL HRD parameters (see + A.3.1 item i of [1]) and in units of 1200 bits + for the NAL HRD parameters (see A.3.1 item j of + [1]). The max-cpb parameter signals that the + receiver has more memory than the minimum + amount of coded picture buffer memory required + by the signaled level conveyed in the value of + the profile-level-id parameter. When max-cpb + is signaled, the receiver MUST be able to + decode NAL unit streams that conform to the + signaled level, with the exception that the + MaxCPB value in Table A-1 of [1] for the + signaled level is replaced with the value of + max-cpb. The value of max-cpb MUST be greater + than or equal to the value of MaxCPB for the + level given in Table A-1 of [1]. Senders MAY + use this knowledge to construct coded video + streams with greater variation of bit rate + than can be achieved with the + MaxCPB value in Table A-1 of [1]. + + Informative note: The coded picture buffer + is used in the hypothetical reference + decoder (Annex C) of H.264. The use of the + hypothetical reference decoder is + recommended in H.264 encoders to verify + that the produced bitstream conforms to the + standard and to control the output bitrate. + Thus, the coded picture buffer is + conceptually independent of any other + potential buffers in the receiver, + including de-interleaving and de-jitter + buffers. The coded picture buffer need not + be implemented in decoders as specified in + Annex C of H.264, but rather standard- + compliant decoders can have any buffering + arrangements provided that they can decode + standard-compliant bitstreams. Thus, in + practice, the input buffer for video + decoder can be integrated with de- + interleaving and de-jitter buffers of the + receiver. + + + + +Wenger, et al. Standards Track [Page 41] + +RFC 3984 RTP Payload Format for H.264 Video February 2005 + + + max-dpb: The value of max-dpb is an integer indicating + the maximum decoded picture buffer size in + units of 1024 bytes. The max-dpb parameter + signals that the receiver has more memory than + the minimum amount of decoded picture buffer + memory required by the signaled level conveyed + in the value of the profile-level-id parameter. + When max-dpb is signaled, the receiver MUST be + able to decode NAL unit streams that conform to + the signaled level, with the exception that the + MaxDPB value in Table A-1 of [1] for the + signaled level is replaced with the value of + max-dpb. Consequently, a receiver that signals + max-dpb MUST be capable of storing the + following number of decoded frames, + complementary field pairs, and non-paired + fields in its decoded picture buffer: + + Min(1024 * max-dpb / ( PicWidthInMbs * + FrameHeightInMbs * 256 * ChromaFormatFactor ), + 16) + + PicWidthInMbs, FrameHeightInMbs, and + ChromaFormatFactor are defined in [1]. + + The value of max-dpb MUST be greater than or + equal to the value of MaxDPB for the level + given in Table A-1 of [1]. Senders MAY use + this knowledge to construct coded video streams + with improved compression. + + Informative note: This parameter was added + primarily to complement a similar codepoint + in the ITU-T Recommendation H.245, so as to + facilitate signaling gateway designs. The + decoded picture buffer stores reconstructed + samples and is a property of the video + decoder only. There is no relationship + between the size of the decoded picture + buffer and the buffers used in RTP, + especially de-interleaving and de-jitter + buffers. + + max-br: The value of max-br is an integer indicating + the maximum video bit rate in units of 1000 + bits per second for the VCL HRD parameters (see + A.3.1 item i of [1]) and in units of 1200 bits + + + + +Wenger, et al. Standards Track [Page 42] + +RFC 3984 RTP Payload Format for H.264 Video February 2005 + + + per second for the NAL HRD parameters (see + A.3.1 item j of [1]). + + The max-br parameter signals that the video + decoder of the receiver is capable of decoding + video at a higher bit rate than is required by + the signaled level conveyed in the value of the + profile-level-id parameter. The value of max- + br MUST be greater than or equal to the value + of MaxBR for the level given in Table A-1 of + [1]. + + When max-br is signaled, the video codec of the + receiver MUST be able to decode NAL unit + streams that conform to the signaled level, + conveyed in the profile-level-id parameter, + with the following exceptions in the limits + specified by the level: + o The value of max-br replaces the MaxBR value + of the signaled level (in Table A-1 of [1]). + o When the max-cpb parameter is not present, + the result of the following formula replaces + the value of MaxCPB in Table A-1 of [1]: + (MaxCPB of the signaled level) * max-br / + (MaxBR of the signaled level). + + For example, if a receiver signals capability + for Level 1.2 with max-br equal to 1550, this + indicates a maximum video bitrate of 1550 + kbits/sec for VCL HRD parameters, a maximum + video bitrate of 1860 kbits/sec for NAL HRD + parameters, and a CPB size of 4036458 bits + (1550000 / 384000 * 1000 * 1000). + + The value of max-br MUST be greater than or + equal to the value MaxBR for the signaled level + given in Table A-1 of [1]. + + Senders MAY use this knowledge to send higher + bitrate video as allowed in the level + definition of Annex A of H.264, to achieve + improved video quality. + + Informative note: This parameter was added + primarily to complement a similar codepoint + in the ITU-T Recommendation H.245, so as to + facilitate signaling gateway designs. No + assumption can be made from the value of + + + +Wenger, et al. Standards Track [Page 43] + +RFC 3984 RTP Payload Format for H.264 Video February 2005 + + + this parameter that the network is capable + of handling such bit rates at any given + time. In particular, no conclusion can be + drawn that the signaled bit rate is + possible under congestion control + constraints. + + redundant-pic-cap: + This parameter signals the capabilities of a + receiver implementation. When equal to 0, the + parameter indicates that the receiver makes no + attempt to use redundant coded pictures to + correct incorrectly decoded primary coded + pictures. When equal to 0, the receiver is not + capable of using redundant slices; therefore, a + sender SHOULD avoid sending redundant slices to + save bandwidth. When equal to 1, the receiver + is capable of decoding any such redundant slice + that covers a corrupted area in a primary + decoded picture (at least partly), and therefore + a sender MAY send redundant slices. When the + parameter is not present, then a value of 0 + MUST be used for redundant-pic-cap. When + present, the value of redundant-pic-cap MUST be + either 0 or 1. + + When the profile-level-id parameter is present + in the same capability signaling as the + redundant-pic-cap parameter, and the profile + indicated in profile-level-id is such that it + disallows the use of redundant coded pictures + (e.g., Main Profile), the value of redundant- + pic-cap MUST be equal to 0. When a receiver + indicates redundant-pic-cap equal to 0, the + received stream SHOULD NOT contain redundant + coded pictures. + + Informative note: Even if redundant-pic-cap + is equal to 0, the decoder is able to + ignore redundant codec pictures provided + that the decoder supports such a profile + (Baseline, Extended) in which redundant + coded pictures are allowed. + + Informative note: Even if redundant-pic-cap + is equal to 1, the receiver may also choose + other error concealment strategies to + + + + +Wenger, et al. Standards Track [Page 44] + +RFC 3984 RTP Payload Format for H.264 Video February 2005 + + + replace or complement decoding of redundant + slices. + + sprop-parameter-sets: + This parameter MAY be used to convey + any sequence and picture parameter set NAL + units (herein referred to as the initial + parameter set NAL units) that MUST precede any + other NAL units in decoding order. The + parameter MUST NOT be used to indicate codec + capability in any capability exchange + procedure. The value of the parameter is the + base64 [6] representation of the initial + parameter set NAL units as specified in + sections 7.3.2.1 and 7.3.2.2 of [1]. The + parameter sets are conveyed in decoding order, + and no framing of the parameter set NAL units + takes place. A comma is used to separate any + pair of parameter sets in the list. Note that + the number of bytes in a parameter set NAL unit + is typically less than 10, but a picture + parameter set NAL unit can contain several + hundreds of bytes. + + Informative note: When several payload + types are offered in the SDP Offer/Answer + model, each with its own sprop-parameter- + sets parameter, then the receiver cannot + assume that those parameter sets do not use + conflicting storage locations (i.e., + identical values of parameter set + identifiers). Therefore, a receiver should + double-buffer all sprop-parameter-sets and + make them available to the decoder instance + that decodes a certain payload type. + + parameter-add: This parameter MAY be used to signal whether + the receiver of this parameter is allowed to + add parameter sets in its signaling response + using the sprop-parameter-sets MIME parameter. + The value of this parameter is either 0 or 1. + 0 is equal to false; i.e., it is not allowed to + add parameter sets. 1 is equal to true; i.e., + it is allowed to add parameter sets. If the + parameter is not present, its value MUST be 1. + + + + + + +Wenger, et al. Standards Track [Page 45] + +RFC 3984 RTP Payload Format for H.264 Video February 2005 + + + packetization-mode: + This parameter signals the properties of an + RTP payload type or the capabilities of a + receiver implementation. Only a single + configuration point can be indicated; thus, + when capabilities to support more than one + packetization-mode are declared, multiple + configuration points (RTP payload types) must + be used. + + When the value of packetization-mode is equal + to 0 or packetization-mode is not present, the + single NAL mode, as defined in section 6.2 of + RFC 3984, MUST be used. This mode is in use in + standards using ITU-T Recommendation H.241 [15] + (see section 12.1). When the value of + packetization-mode is equal to 1, the non- + interleaved mode, as defined in section 6.3 of + RFC 3984, MUST be used. When the value of + packetization-mode is equal to 2, the + interleaved mode, as defined in section 6.4 of + RFC 3984, MUST be used. The value of + packetization mode MUST be an integer in the + range of 0 to 2, inclusive. + + sprop-interleaving-depth: + This parameter MUST NOT be present + when packetization-mode is not present or the + value of packetization-mode is equal to 0 or 1. + This parameter MUST be present when the value + of packetization-mode is equal to 2. + + This parameter signals the properties of a NAL + unit stream. It specifies the maximum number + of VCL NAL units that precede any VCL NAL unit + in the NAL unit stream in transmission order + and follow the VCL NAL unit in decoding order. + Consequently, it is guaranteed that receivers + can reconstruct NAL unit decoding order when + the buffer size for NAL unit decoding order + recovery is at least the value of sprop- + interleaving-depth + 1 in terms of VCL NAL + units. + + The value of sprop-interleaving-depth MUST be + an integer in the range of 0 to 32767, + inclusive. + + + + +Wenger, et al. Standards Track [Page 46] + +RFC 3984 RTP Payload Format for H.264 Video February 2005 + + + sprop-deint-buf-req: + This parameter MUST NOT be present when + packetization-mode is not present or the value + of packetization-mode is equal to 0 or 1. It + MUST be present when the value of + packetization-mode is equal to 2. + + sprop-deint-buf-req signals the required size + of the deinterleaving buffer for the NAL unit + stream. The value of the parameter MUST be + greater than or equal to the maximum buffer + occupancy (in units of bytes) required in such + a deinterleaving buffer that is specified in + section 7.2 of RFC 3984. It is guaranteed that + receivers can perform the deinterleaving of + interleaved NAL units into NAL unit decoding + order, when the deinterleaving buffer size is + at least the value of sprop-deint-buf-req in + terms of bytes. + + The value of sprop-deint-buf-req MUST be an + integer in the range of 0 to 4294967295, + inclusive. + + Informative note: sprop-deint-buf-req + indicates the required size of the + deinterleaving buffer only. When network + jitter can occur, an appropriately sized + jitter buffer has to be provisioned for + as well. + + deint-buf-cap: This parameter signals the capabilities of a + receiver implementation and indicates the + amount of deinterleaving buffer space in units + of bytes that the receiver has available for + reconstructing the NAL unit decoding order. A + receiver is able to handle any stream for which + the value of the sprop-deint-buf-req parameter + is smaller than or equal to this parameter. + + If the parameter is not present, then a value + of 0 MUST be used for deint-buf-cap. The value + of deint-buf-cap MUST be an integer in the + range of 0 to 4294967295, inclusive. + + Informative note: deint-buf-cap indicates + the maximum possible size of the + deinterleaving buffer of the receiver only. + + + +Wenger, et al. Standards Track [Page 47] + +RFC 3984 RTP Payload Format for H.264 Video February 2005 + + + When network jitter can occur, an + appropriately sized jitter buffer has to + be provisioned for as well. + + sprop-init-buf-time: + This parameter MAY be used to signal the + properties of a NAL unit stream. The parameter + MUST NOT be present, if the value of + packetization-mode is equal to 0 or 1. + + The parameter signals the initial buffering + time that a receiver MUST buffer before + starting decoding to recover the NAL unit + decoding order from the transmission order. + The parameter is the maximum value of + (transmission time of a NAL unit - decoding + time of the NAL unit), assuming reliable and + instantaneous transmission, the same + timeline for transmission and decoding, and + that decoding starts when the first packet + arrives. + + An example of specifying the value of sprop- + init-buf-time follows. A NAL unit stream is + sent in the following interleaved order, in + which the value corresponds to the decoding + time and the transmission order is from left to + right: + + 0 2 1 3 5 4 6 8 7 ... + + Assuming a steady transmission rate of NAL + units, the transmission times are: + + 0 1 2 3 4 5 6 7 8 ... + + Subtracting the decoding time from the + transmission time column-wise results in the + following series: + + 0 -1 1 0 -1 1 0 -1 1 ... + + Thus, in terms of intervals of NAL unit + transmission times, the value of + sprop-init-buf-time in this + example is 1. + + + + + +Wenger, et al. Standards Track [Page 48] + +RFC 3984 RTP Payload Format for H.264 Video February 2005 + + + The parameter is coded as a non-negative base10 + integer representation in clock ticks of a 90- + kHz clock. If the parameter is not present, + then no initial buffering time value is + defined. Otherwise the value of sprop-init- + buf-time MUST be an integer in the range of 0 + to 4294967295, inclusive. + + In addition to the signaled sprop-init-buf- + time, receivers SHOULD take into account the + transmission delay jitter buffering, including + buffering for the delay jitter caused by + mixers, translators, gateways, proxies, + traffic-shapers, and other network elements. + + sprop-max-don-diff: + This parameter MAY be used to signal the + properties of a NAL unit stream. It MUST NOT + be used to signal transmitter or receiver or + codec capabilities. The parameter MUST NOT be + present if the value of packetization-mode is + equal to 0 or 1. sprop-max-don-diff is an + integer in the range of 0 to 32767, inclusive. + If sprop-max-don-diff is not present, the value + of the parameter is unspecified. sprop-max- + don-diff is calculated as follows: + + sprop-max-don-diff = max{AbsDON(i) - + AbsDON(j)}, + for any i and any j>i, + + where i and j indicate the index of the NAL + unit in the transmission order and AbsDON + denotes a decoding order number of the NAL + unit that does not wrap around to 0 after + 65535. In other words, AbsDON is calculated as + follows: Let m and n be consecutive NAL units + in transmission order. For the very first NAL + unit in transmission order (whose index is 0), + AbsDON(0) = DON(0). For other NAL units, + AbsDON is calculated as follows: + + If DON(m) == DON(n), AbsDON(n) = AbsDON(m) + + If (DON(m) < DON(n) and DON(n) - DON(m) < + 32768), + AbsDON(n) = AbsDON(m) + DON(n) - DON(m) + + + + +Wenger, et al. Standards Track [Page 49] + +RFC 3984 RTP Payload Format for H.264 Video February 2005 + + + If (DON(m) > DON(n) and DON(m) - DON(n) >= + 32768), + AbsDON(n) = AbsDON(m) + 65536 - DON(m) + DON(n) + + If (DON(m) < DON(n) and DON(n) - DON(m) >= + 32768), + + AbsDON(n) = AbsDON(m) - (DON(m) + 65536 - + DON(n)) + + If (DON(m) > DON(n) and DON(m) - DON(n) < + 32768), + AbsDON(n) = AbsDON(m) - (DON(m) - DON(n)) + + where DON(i) is the decoding order number of + the NAL unit having index i in the transmission + order. The decoding order number is specified + in section 5.5 of RFC 3984. + + Informative note: Receivers may use sprop- + max-don-diff to trigger which NAL units in + the receiver buffer can be passed to the + decoder. + + max-rcmd-nalu-size: + This parameter MAY be used to signal the + capabilities of a receiver. The parameter MUST + NOT be used for any other purposes. The value + of the parameter indicates the largest NALU + size in bytes that the receiver can handle + efficiently. The parameter value is a + recommendation, not a strict upper boundary. + The sender MAY create larger NALUs but must be + aware that the handling of these may come at a + higher cost than NALUs conforming to the + limitation. + + The value of max-rcmd-nalu-size MUST be an + integer in the range of 0 to 4294967295, + inclusive. If this parameter is not specified, + no known limitation to the NALU size exists. + Senders still have to consider the MTU size + available between the sender and the receiver + and SHOULD run MTU discovery for this purpose. + + This parameter is motivated by, for example, an + IP to H.223 video telephony gateway, where + NALUs smaller than the H.223 transport data + + + +Wenger, et al. Standards Track [Page 50] + +RFC 3984 RTP Payload Format for H.264 Video February 2005 + + + unit will be more efficient. A gateway may + terminate IP; thus, MTU discovery will normally + not work beyond the gateway. + + Informative note: Setting this parameter to + a lower than necessary value may have a + negative impact. + + Encoding considerations: + This type is only defined for transfer via RTP + (RFC 3550). + + A file format of H.264/AVC video is defined in + [29]. This definition is utilized by other + file formats, such as the 3GPP multimedia file + format (MIME type video/3gpp) [30] or the MP4 + file format (MIME type video/mp4). + + Security considerations: + See section 9 of RFC 3984. + + Public specification: + Please refer to RFC 3984 and its section 15. + + Additional information: + None + + File extensions: none + Macintosh file type code: none + Object identifier or OID: none + + Person & email address to contact for further information: + stewe@stewe.org + + Intended usage: COMMON + + Author: + stewe@stewe.org + Change controller: + IETF Audio/Video Transport working group + delegated from the IESG. + + + + + + + + + + +Wenger, et al. Standards Track [Page 51] + +RFC 3984 RTP Payload Format for H.264 Video February 2005 + + +8.2. SDP Parameters + +8.2.1. Mapping of MIME Parameters to SDP + + The MIME media type video/H264 string is mapped to fields in the + Session Description Protocol (SDP) [5] as follows: + + o The media name in the "m=" line of SDP MUST be video. + + o The encoding name in the "a=rtpmap" line of SDP MUST be H264 (the + MIME subtype). + + o The clock rate in the "a=rtpmap" line MUST be 90000. + + o The OPTIONAL parameters "profile-level-id", "max-mbps", "max-fs", + "max-cpb", "max-dpb", "max-br", "redundant-pic-cap", "sprop- + parameter-sets", "parameter-add", "packetization-mode", "sprop- + interleaving-depth", "deint-buf-cap", "sprop-deint-buf-req", + "sprop-init-buf-time", "sprop-max-don-diff", and "max-rcmd-nalu- + size", when present, MUST be included in the "a=fmtp" line of SDP. + These parameters are expressed as a MIME media type string, in the + form of a semicolon separated list of parameter=value pairs. + + An example of media representation in SDP is as follows (Baseline + Profile, Level 3.0, some of the constraints of the Main profile may + not be obeyed): + + m=video 49170 RTP/AVP 98 + a=rtpmap:98 H264/90000 + a=fmtp:98 profile-level-id=42A01E; + sprop-parameter-sets=Z0IACpZTBYmI,aMljiA== + +8.2.2. Usage with the SDP Offer/Answer Model + + When H.264 is offered over RTP using SDP in an Offer/Answer model [7] + for negotiation for unicast usage, the following limitations and + rules apply: + + o The parameters identifying a media format configuration for H.264 + are "profile-level-id", "packetization-mode", and, if required by + "packetization-mode", "sprop-deint-buf-req". These three + parameters MUST be used symmetrically; i.e., the answerer MUST + either maintain all configuration parameters or remove the media + format (payload type) completely, if one or more of the parameter + values are not supported. + + + + + + +Wenger, et al. Standards Track [Page 52] + +RFC 3984 RTP Payload Format for H.264 Video February 2005 + + + Informative note: The requirement for symmetric use applies + only for the above three parameters and not for the other + stream properties and capability parameters. + + To simplify handling and matching of these configurations, the + same RTP payload type number used in the offer SHOULD also be used + in the answer, as specified in [7]. An answer MUST NOT contain a + payload type number used in the offer unless the configuration + ("profile-level-id", "packetization-mode", and, if present, + "sprop-deint-buf-req") is the same as in the offer. + + Informative note: An offerer, when receiving the answer, has to + compare payload types not declared in the offer based on media + type (i.e., video/h264) and the above three parameters with any + payload types it has already declared, in order to determine + whether the configuration in question is new or equivalent to a + configuration already offered. + + o The parameters "sprop-parameter-sets", "sprop-deint-buf-req", + "sprop-interleaving-depth", "sprop-max-don-diff", and "sprop- + init-buf-time" describe the properties of the NAL unit stream that + the offerer or answerer is sending for this media format + configuration. This differs from the normal usage of the + Offer/Answer parameters: normally such parameters declare the + properties of the stream that the offerer or the answerer is able + to receive. When dealing with H.264, the offerer assumes that the + answerer will be able to receive media encoded using the + configuration being offered. + + Informative note: The above parameters apply for any stream + sent by the declaring entity with the same configuration; i.e., + they are dependent on their source. Rather then being bound to + the payload type, the values may have to be applied to another + payload type when being sent, as they apply for the + configuration. + + o The capability parameters ("max-mbps", "max-fs", "max-cpb", "max- + dpb", "max-br", ,"redundant-pic-cap", "max-rcmd-nalu-size") MAY be + used to declare further capabilities. Their interpretation + depends on the direction attribute. When the direction attribute + is sendonly, then the parameters describe the limits of the RTP + packets and the NAL unit stream that the sender is capable of + producing. When the direction attribute is sendrecv or recvonly, + then the parameters describe the limitations of what the receiver + accepts. + + + + + + +Wenger, et al. Standards Track [Page 53] + +RFC 3984 RTP Payload Format for H.264 Video February 2005 + + + o As specified above, an offerer has to include the size of the + deinterleaving buffer in the offer for an interleaved H.264 + stream. To enable the offerer and answerer to inform each other + about their capabilities for deinterleaving buffering, both + parties are RECOMMENDED to include "deint-buf-cap". This + information MAY be used when the value for "sprop-deint-buf-req" + is selected in a second round of offer and answer. For + interleaved streams, it is also RECOMMENDED to consider offering + multiple payload types with different buffering requirements when + the capabilities of the receiver are unknown. + + o The "sprop-parameter-sets" parameter is used as described above. + In addition, an answerer MUST maintain all parameter sets received + in the offer in its answer. Depending on the value of the + "parameter-add" parameter, different rules apply: If "parameter- + add" is false (0), the answer MUST NOT add any additional + parameter sets. If "parameter-add" is true (1), the answerer, in + its answer, MAY add additional parameter sets to the "sprop- + parameter-sets" parameter. The answerer MUST also, independent of + the value of "parameter-add", accept to receive a video stream + using the sprop-parameter-sets it declared in the answer. + + Informative note: care must be taken when parameter sets are + added not to cause overwriting of already transmitted parameter + sets by using conflicting parameter set identifiers. + + For streams being delivered over multicast, the following rules apply + in addition: + + o The stream properties parameters ("sprop-parameter-sets", "sprop- + deint-buf-req", "sprop-interleaving-depth", "sprop-max-don-diff", + and "sprop-init-buf-time") MUST NOT be changed by the answerer. + Thus, a payload type can either be accepted unaltered or removed. + + o The receiver capability parameters "max-mbps", "max-fs", "max- + cpb", "max-dpb", "max-br", and "max-rcmd-nalu-size" MUST be + supported by the answerer for all streams declared as sendrecv or + recvonly; otherwise, one of the following actions MUST be + performed: the media format is removed, or the session rejected. + + o The receiver capability parameter redundant-pic-cap SHOULD be + supported by the answerer for all streams declared as sendrecv or + recvonly as follows: The answerer SHOULD NOT include redundant + coded pictures in the transmitted stream if the offerer indicated + redundant-pic-cap equal to 0. Otherwise (when redundant_pic_cap + is equal to 1), it is beyond the scope of this memo to recommend + how the answerer should use redundant coded pictures. + + + + +Wenger, et al. Standards Track [Page 54] + +RFC 3984 RTP Payload Format for H.264 Video February 2005 + + + Below are the complete lists of how the different parameters shall be + interpreted in the different combinations of offer or answer and + direction attribute. + + o In offers and answers for which "a=sendrecv" or no direction + attribute is used, or in offers and answers for which "a=recvonly" + is used, the following interpretation of the parameters MUST be + used. + + Declaring actual configuration or properties for receiving: + + - profile-level-id + - packetization-mode + + Declaring actual properties of the stream to be sent (applicable + only when "a=sendrecv" or no direction attribute is used): + + - sprop-deint-buf-req + - sprop-interleaving-depth + - sprop-parameter-sets + - sprop-max-don-diff + - sprop-init-buf-time + + Declaring receiver implementation capabilities: + + - max-mbps + - max-fs + - max-cpb + - max-dpb + - max-br + - redundant-pic-cap + - deint-buf-cap + - max-rcmd-nalu-size + + Declaring how Offer/Answer negotiation shall be performed: + + - parameter-add + + o In an offer or answer for which the direction attribute + "a=sendonly" is included for the media stream, the following + interpretation of the parameters MUST be used: + + Declaring actual configuration and properties of stream proposed + to be sent: + + - profile-level-id + - packetization-mode + - sprop-deint-buf-req + + + +Wenger, et al. Standards Track [Page 55] + +RFC 3984 RTP Payload Format for H.264 Video February 2005 + + + - sprop-max-don-diff + - sprop-init-buf-time + - sprop-parameter-sets + - sprop-interleaving-depth + + Declaring the capabilities of the sender when it receives a + stream: + + - max-mbps + - max-fs + - max-cpb + - max-dpb + - max-br + - redundant-pic-cap + - deint-buf-cap + - max-rcmd-nalu-size + + Declaring how Offer/Answer negotiation shall be performed: + + - parameter-add + + Furthermore, the following considerations are necessary: + + o Parameters used for declaring receiver capabilities are in general + downgradable; i.e., they express the upper limit for a sender's + possible behavior. Thus a sender MAY select to set its encoder + using only lower/lesser or equal values of these parameters. + "sprop-parameter-sets" MUST NOT be used in a sender's declaration + of its capabilities, as the limits of the values that are carried + inside the parameter sets are implicit with the profile and level + used. + + o Parameters declaring a configuration point are not downgradable, + with the exception of the level part of the "profile-level-id" + parameter. This expresses values a receiver expects to be used + and must be used verbatim on the sender side. + + o When a sender's capabilities are declared, and non-downgradable + parameters are used in this declaration, then these parameters + express a configuration that is acceptable. In order to achieve + high interoperability levels, it is often advisable to offer + multiple alternative configurations; e.g., for the packetization + mode. It is impossible to offer multiple configurations in a + single payload type. Thus, when multiple configuration offers are + made, each offer requires its own RTP payload type associated with + the offer. + + + + + +Wenger, et al. Standards Track [Page 56] + +RFC 3984 RTP Payload Format for H.264 Video February 2005 + + + o A receiver SHOULD understand all MIME parameters, even if it only + supports a subset of the payload format's functionality. This + ensures that a receiver is capable of understanding when an offer + to receive media can be downgraded to what is supported by the + receiver of the offer. + + o An answerer MAY extend the offer with additional media format + configurations. However, to enable their usage, in most cases a + second offer is required from the offerer to provide the stream + properties parameters that the media sender will use. This also + has the effect that the offerer has to be able to receive this + media format configuration, not only to send it. + + o If an offerer wishes to have non-symmetric capabilities between + sending and receiving, the offerer has to offer different RTP + sessions; i.e., different media lines declared as "recvonly" and + "sendonly", respectively. This may have further implications on + the system. + +8.2.3. Usage in Declarative Session Descriptions + + When H.264 over RTP is offered with SDP in a declarative style, as in + RTSP [27] or SAP [28], the following considerations are necessary. + + o All parameters capable of indicating the properties of both a NAL + unit stream and a receiver are used to indicate the properties of + a NAL unit stream. For example, in this case, the parameter + "profile-level-id" declares the values used by the stream, instead + of the capabilities of the sender. This results in that the + following interpretation of the parameters MUST be used: + + Declaring actual configuration or properties: + + - profile-level-id + - sprop-parameter-sets + - packetization-mode + - sprop-interleaving-depth + - sprop-deint-buf-req + - sprop-max-don-diff + - sprop-init-buf-time + + + + + + + + + + + +Wenger, et al. Standards Track [Page 57] + +RFC 3984 RTP Payload Format for H.264 Video February 2005 + + + Not usable: + + - max-mbps + - max-fs + - max-cpb + - max-dpb + - max-br + - redundant-pic-cap + - max-rcmd-nalu-size + - parameter-add + - deint-buf-cap + + o A receiver of the SDP is required to support all parameters and + values of the parameters provided; otherwise, the receiver MUST + reject (RTSP) or not participate in (SAP) the session. It falls + on the creator of the session to use values that are expected to + be supported by the receiving application. + +8.3. Examples + + A SIP Offer/Answer exchange wherein both parties are expected to both + send and receive could look like the following. Only the media codec + specific parts of the SDP are shown. Some lines are wrapped due to + text constraints. + + Offerer -> Answer SDP message: + + m=video 49170 RTP/AVP 100 99 98 + a=rtpmap:98 H264/90000 + a=fmtp:98 profile-level-id=42A01E; packetization-mode=0; + sprop-parameter-sets=Z0IACpZTBYmI,aMljiA== + a=rtpmap:99 H264/90000 + a=fmtp:99 profile-level-id=42A01E; packetization-mode=1; + sprop-parameter-sets=Z0IACpZTBYmI,aMljiA== + a=rtpmap:100 H264/90000 + a=fmtp:100 profile-level-id=42A01E; packetization-mode=2; + sprop-parameter-sets=Z0IACpZTBYmI,aMljiA==; + sprop-interleaving-depth=45; sprop-deint-buf-req=64000; + sprop-init-buf-time=102478; deint-buf-cap=128000 + + The above offer presents the same codec configuration in three + different packetization formats. PT 98 represents single NALU mode, + PT 99 non-interleaved mode; PT 100 indicates the interleaved mode. + In the interleaved mode case, the interleaving parameters that the + offerer would use if the answer indicates support for PT 100 are also + included. In all three cases the parameter "sprop-parameter-sets" + conveys the initial parameter sets that are required for the answerer + when receiving a stream from the offerer when this configuration + + + +Wenger, et al. Standards Track [Page 58] + +RFC 3984 RTP Payload Format for H.264 Video February 2005 + + + (profile-level-id and packetization mode) is accepted. Note that the + value for "sprop-parameter-sets", although identical in the example + above, could be different for each payload type. + + Answerer -> Offerer SDP message: + + m=video 49170 RTP/AVP 100 99 97 + a=rtpmap:97 H264/90000 + a=fmtp:97 profile-level-id=42A01E; packetization-mode=0; + sprop-parameter-sets=Z0IACpZTBYmI,aMljiA==,As0DEWlsIOp==, + KyzFGleR + a=rtpmap:99 H264/90000 + a=fmtp:99 profile-level-id=42A01E; packetization-mode=1; + sprop-parameter-sets=Z0IACpZTBYmI,aMljiA==,As0DEWlsIOp==, + KyzFGleR; max-rcmd-nalu-size=3980 + a=rtpmap:100 H264/90000 + a=fmtp:100 profile-level-id=42A01E; packetization-mode=2; + sprop-parameter-sets=Z0IACpZTBYmI,aMljiA==,As0DEWlsIOp==, + KyzFGleR; sprop-interleaving-depth=60; + sprop-deint-buf-req=86000; sprop-init-buf-time=156320; + deint-buf-cap=128000; max-rcmd-nalu-size=3980 + + As the Offer/Answer negotiation covers both sending and receiving + streams, an offer indicates the exact parameters for what the offerer + is willing to receive, whereas the answer indicates the same for what + the answerer accepts to receive. In this case the offerer declared + that it is willing to receive payload type 98. The answerer accepts + this by declaring a equivalent payload type 97; i.e., it has + identical values for the three parameters "profile-level-id", + packetization-mode, and "sprop-deint-buf-req". This has the + following implications for both the offerer and the answerer + concerning the parameters that declare properties. The offerer + initially declared a certain value of the "sprop-parameter-sets" in + the payload definition for PT=98. However, as the answerer accepted + this as PT=97, the values of "sprop-parameter-sets" in PT=98 must now + be used instead when the offerer sends PT=97. Similarly, when the + answerer sends PT=98 to the offerer, it has to use the properties + parameters it declared in PT=97. + + The answerer also accepts the reception of the two configurations + that payload types 99 and 100 represent. It provides the initial + parameter sets for the answerer-to-offerer direction, and for + buffering related parameters that it will use to send the payload + types. It also provides the offerer with its memory limit for + deinterleaving operations by providing a "deint-buf-cap" parameter. + This is only useful if the offerer decides on making a second offer, + where it can take the new value into account. The "max-rcmd-nalu- + size" indicates that the answerer can efficiently process NALUs up to + + + +Wenger, et al. Standards Track [Page 59] + +RFC 3984 RTP Payload Format for H.264 Video February 2005 + + + the size of 3980 bytes. However, there is no guarantee that the + network supports this size. + + Please note that the parameter sets in the above example do not + represent a legal operation point of an H.264 codec. The base64 + strings are only used for illustration. + +8.4. Parameter Set Considerations + + The H.264 parameter sets are a fundamental part of the video codec + and vital to its operation; see section 1.2. Due to their + characteristics and their importance for the decoding process, lost + or erroneously transmitted parameter sets can hardly be concealed + locally at the receiver. A reference to a corrupt parameter set has + normally fatal results to the decoding process. Corruption could + occur, for example, due to the erroneous transmission or loss of a + parameter set data structure, but also due to the untimely + transmission of a parameter set update. Therefore, the following + recommendations are provided as a guideline for the implementer of + the RTP sender. + + Parameter set NALUs can be transported using three different + principles: + + A. Using a session control protocol (out-of-band) prior to the actual + RTP session. + + B. Using a session control protocol (out-of-band) during an ongoing + RTP session. + + C. Within the RTP stream in the payload (in-band) during an ongoing + RTP session. + + It is necessary to implement principles A and B within a session + control protocol. SIP and SDP can be used as described in the SDP + Offer/Answer model and in the previous sections of this memo. This + section contains guidelines on how principles A and B must be + implemented within session control protocols. It is independent of + the particular protocol used. Principle C is supported by the RTP + payload format defined in this specification. + + The picture and sequence parameter set NALUs SHOULD NOT be + transmitted in the RTP payload unless reliable transport is provided + for RTP, as a loss of a parameter set of either type will likely + prevent decoding of a considerable portion of the corresponding RTP + + + + + + +Wenger, et al. Standards Track [Page 60] + +RFC 3984 RTP Payload Format for H.264 Video February 2005 + + + stream. Thus, the transmission of parameter sets using a reliable + session control protocol (i.e., usage of principle A or B above) is + RECOMMENDED. + + In the rest of the section it is assumed that out-of-band signaling + provides reliable transport of parameter set NALUs and that in-band + transport does not. If in-band signaling of parameter sets is used, + the sender SHOULD take the error characteristics into account and use + mechanisms to provide a high probability for delivering the parameter + sets correctly. Mechanisms that increase the probability for a + correct reception include packet repetition, FEC, and retransmission. + The use of an unreliable, out-of-band control protocol has similar + disadvantages as the in-band signaling (possible loss) and, in + addition, may also lead to difficulties in the synchronization (see + below). Therefore, it is NOT RECOMMENDED. + + Parameter sets MAY be added or updated during the lifetime of a + session using principles B and C. It is required that parameter sets + are present at the decoder prior to the NAL units that refer to them. + Updating or adding of parameter sets can result in further problems, + and therefore the following recommendations should be considered. + + - When parameter sets are added or updated, principle C is + vulnerable to transmission errors as described above, and + therefore principle B is RECOMMENDED. + + - When parameter sets are added or updated, care SHOULD be taken to + ensure that any parameter set is delivered prior to its usage. It + is common that no synchronization is present between out-of-band + signaling and in-band traffic. If out-of-band signaling is used, + it is RECOMMENDED that a sender does not start sending NALUs + requiring the updated parameter sets prior to acknowledgement of + delivery from the signaling protocol. + + - When parameter sets are updated, the following synchronization + issue should be taken into account. When overwriting a parameter + set at the receiver, the sender has to ensure that the parameter + set in question is not needed by any NALU present in the network + or receiver buffers. Otherwise, decoding with a wrong parameter + set may occur. To lessen this problem, it is RECOMMENDED either + to overwrite only those parameter sets that have not been used for + a sufficiently long time (to ensure that all related NALUs have + been consumed), or to add a new parameter set instead (which may + have negative consequences for the efficiency of the video + coding). + + - When new parameter sets are added, previously unused parameter set + identifiers are used. This avoids the problem identified in the + + + +Wenger, et al. Standards Track [Page 61] + +RFC 3984 RTP Payload Format for H.264 Video February 2005 + + + previous paragraph. However, in a multiparty session, unless a + synchronized control protocol is used, there is a risk that + multiple entities try to add different parameter sets for the same + identifier, which has to be avoided. + + - Adding or modifying parameter sets by using both principles B and + C in the same RTP session may lead to inconsistencies of the + parameter sets because of the lack of synchronization between the + control and the RTP channel. Therefore, principles B and C MUST + NOT both be used in the same session unless sufficient + synchronization can be provided. + + In some scenarios (e.g., when only the subset of this payload format + specification corresponding to H.241 is used), it is not possible to + employ out-of-band parameter set transmission. In this case, + parameter sets have to be transmitted in-band. Here, the + synchronization with the non-parameter-set-data in the bitstream is + implicit, but the possibility of a loss has to be taken into account. + The loss probability should be reduced using the mechanisms discussed + above. + + - When parameter sets are initially provided using principle A and + then later added or updated in-band (principle C), there is a risk + associated with updating the parameter sets delivered out-of-band. + If receivers miss some in-band updates (for example, because of a + loss or a late tune-in), those receivers attempt to decode the + bitstream using out-dated parameters. It is RECOMMENDED that + parameter set IDs be partitioned between the out-of-band and in- + band parameter sets. + + To allow for maximum flexibility and best performance from the H.264 + coder, it is recommended, if possible, to allow any sender to add its + own parameter sets to be used in a session. Setting the "parameter- + add" parameter to false should only be done in cases where the + session topology prevents a participant to add its own parameter + sets. + +9. Security Considerations + + RTP packets using the payload format defined in this specification + are subject to the security considerations discussed in the RTP + specification [4], and in any appropriate RTP profile (for example, + [16]). This implies that confidentiality of the media streams is + achieved by encryption; for example, through the application of SRTP + [26]. Because the data compression used with this payload format is + applied end-to-end, any encryption needs to be performed after + compression. + + + + +Wenger, et al. Standards Track [Page 62] + +RFC 3984 RTP Payload Format for H.264 Video February 2005 + + + A potential denial-of-service threat exists for data encodings using + compression techniques that have non-uniform receiver-end + computational load. The attacker can inject pathological datagrams + into the stream that are complex to decode and that cause the + receiver to be overloaded. H.264 is particularly vulnerable to such + attacks, as it is extremely simple to generate datagrams containing + NAL units that affect the decoding process of many future NAL units. + Therefore, the usage of data origin authentication and data integrity + protection of at least the RTP packet is RECOMMENDED; for example, + with SRTP [26]. + + Note that the appropriate mechanism to ensure confidentiality and + integrity of RTP packets and their payloads is very dependent on the + application and on the transport and signaling protocols employed. + Thus, although SRTP is given as an example above, other possible + choices exist. + + Decoders MUST exercise caution with respect to the handling of user + data SEI messages, particularly if they contain active elements, and + MUST restrict their domain of applicability to the presentation + containing the stream. + + End-to-End security with either authentication, integrity or + confidentiality protection will prevent a MANE from performing + media-aware operations other than discarding complete packets. And + in the case of confidentiality protection it will even be prevented + from performing discarding of packets in a media aware way. To allow + any MANE to perform its operations, it will be required to be a + trusted entity which is included in the security context + establishment. + +10. Congestion Control + + Congestion control for RTP SHALL be used in accordance with RFC 3550 + [4], and with any applicable RTP profile; e.g., RFC 3551 [16]. An + additional requirement if best-effort service is being used is: + users of this payload format MUST monitor packet loss to ensure that + the packet loss rate is within acceptable parameters. Packet loss is + considered acceptable if a TCP flow across the same network path, and + experiencing the same network conditions, would achieve an average + throughput, measured on a reasonable timescale, that is not less than + the RTP flow is achieving. This condition can be satisfied by + implementing congestion control mechanisms to adapt the transmission + rate (or the number of layers subscribed for a layered multicast + session), or by arranging for a receiver to leave the session if the + loss rate is unacceptably high. + + + + + +Wenger, et al. Standards Track [Page 63] + +RFC 3984 RTP Payload Format for H.264 Video February 2005 + + + The bit rate adaptation necessary for obeying the congestion control + principle is easily achievable when real-time encoding is used. + However, when pre-encoded content is being transmitted, bandwidth + adaptation requires the availability of more than one coded + representation of the same content, at different bit rates, or the + existence of non-reference pictures or sub-sequences [22] in the + bitstream. The switching between the different representations can + normally be performed in the same RTP session; e.g., by employing a + concept known as SI/SP slices of the Extended Profile, or by + switching streams at IDR picture boundaries. Only when non- + downgradable parameters (such as the profile part of the + profile/level ID) are required to be changed does it become necessary + to terminate and re-start the media stream. This may be accomplished + by using a different RTP payload type. + + MANEs MAY follow the suggestions outlined in section 7.3 and remove + certain unusable packets from the packet stream when that stream was + damaged due to previous packet losses. This can help reduce the + network load in certain special cases. + +11. IANA Consideration + + IANA has registered one new MIME type; see section 8.1. + + + + + + + + + + + + + + + + + + + + + + + + + + + + +Wenger, et al. Standards Track [Page 64] + +RFC 3984 RTP Payload Format for H.264 Video February 2005 + + +12. Informative Appendix: Application Examples + + This payload specification is very flexible in its use, in order to + cover the extremely wide application space anticipated for H.264. + However, this great flexibility also makes it difficult for an + implementer to decide on a reasonable packetization scheme. Some + information on how to apply this specification to real-world + scenarios is likely to appear in the form of academic publications + and a test model software and description in the near future. + However, some preliminary usage scenarios are described here as well. + +12.1. Video Telephony according to ITU-T Recommendation H.241 + Annex A + + H.323-based video telephony systems that use H.264 as an optional + video compression scheme are required to support H.241 Annex A [15] + as a packetization scheme. The packetization mechanism defined in + this Annex is technically identical with a small subset of this + specification. + + When a system operates according to H.241 Annex A, parameter set NAL + units are sent in-band. Only Single NAL unit packets are used. Many + such systems are not sending IDR pictures regularly, but only when + required by user interaction or by control protocol means; e.g., when + switching between video channels in a Multipoint Control Unit or for + error recovery requested by feedback. + +12.2. Video Telephony, No Slice Data Partitioning, No NAL Unit + Aggregation + + The RTP part of this scheme is implemented and tested (though not the + control-protocol part; see below). + + In most real-world video telephony applications, picture parameters + such as picture size or optional modes never change during the + lifetime of a connection. Therefore, all necessary parameter sets + (usually only one) are sent as a side effect of the capability + exchange/announcement process, e.g., according to the SDP syntax + specified in section 8.2 of this document. As all necessary + parameter set information is established before the RTP session + starts, there is no need for sending any parameter set NAL units. + Slice data partitioning is not used, either. Thus, the RTP packet + stream basically consists of NAL units that carry single coded + slices. + + The encoder chooses the size of coded slice NAL units so that they + offer the best performance. Often, this is done by adapting the + coded slice size to the MTU size of the IP network. For small + + + +Wenger, et al. Standards Track [Page 65] + +RFC 3984 RTP Payload Format for H.264 Video February 2005 + + + picture sizes, this may result in a one-picture-per-one-packet + strategy. Intra refresh algorithms clean up the loss of packets and + the resulting drift-related artifacts. + +12.3. Video Telephony, Interleaved Packetization Using NAL Unit + Aggregation + + This scheme allows better error concealment and is used in H.263 + based designs using RFC 2429 packetization [10]. It has been + implemented, and good results were reported [12]. + + The VCL encoder codes the source picture so that all macroblocks + (MBs) of one MB line are assigned to one slice. All slices with even + MB row addresses are combined into one STAP, and all slices with odd + MB row addresses into another. Those STAPs are transmitted as RTP + packets. The establishment of the parameter sets is performed as + discussed above. + + Note that the use of STAPs is essential here, as the high number of + individual slices (18 for a CIF picture) would lead to unacceptably + high IP/UDP/RTP header overhead (unless the source coding tool FMO is + used, which is not assumed in this scenario). Furthermore, some + wireless video transmission systems, such as H.324M and the IP-based + video telephony specified in 3GPP, are likely to use relatively small + transport packet size. For example, a typical MTU size of H.223 AL3 + SDU is around 100 bytes [17]. Coding individual slices according to + this packetization scheme provides further advantage in communication + between wired and wireless networks, as individual slices are likely + to be smaller than the preferred maximum packet size of wireless + systems. Consequently, a gateway can convert the STAPs used in a + wired network into several RTP packets with only one NAL unit, which + are preferred in a wireless network, and vice versa. + +12.4. Video Telephony with Data Partitioning + + This scheme has been implemented and has been shown to offer good + performance, especially at higher packet loss rates [12]. + + Data Partitioning is known to be useful only when some form of + unequal error protection is available. Normally, in single-session + RTP environments, even error characteristics are assumed; i.e., the + packet loss probability of all packets of the session is the same + statistically. However, there are means to reduce the packet loss + probability of individual packets in an RTP session. A FEC packet + according to RFC 2733 [18], for example, specifies which media + packets are associated with the FEC packet. + + + + + +Wenger, et al. Standards Track [Page 66] + +RFC 3984 RTP Payload Format for H.264 Video February 2005 + + + In all cases, the incurred overhead is substantial but is in the same + order of magnitude as the number of bits that have otherwise been + spent for intra information. However, this mechanism does not add + any delay to the system. + + Again, the complete parameter set establishment is performed through + control protocol means. + +12.5. Video Telephony or Streaming with FUs and Forward Error + Correction + + This scheme has been implemented and has been shown to provide good + performance, especially at higher packet loss rates [19]. + + The most efficient means to combat packet losses for scenarios where + retransmissions are not applicable is forward error correction (FEC). + Although application layer, end-to-end use of FEC is often less + efficient than an FEC-based protection of individual links + (especially when links of different characteristics are in the + transmission path), application layer, end-to-end FEC is unavoidable + in some scenarios. RFC 2733 [18] provides means to use generic, + application layer, end-to-end FEC in packet-loss environments. A + binary forward error correcting code is generated by applying the XOR + operation to the bits at the same bit position in different packets. + The binary code can be specified by the parameters (n,k) in which k + is the number of information packets used in the connection and n is + the total number of packets generated for k information packets; + i.e., n-k parity packets are generated for k information packets. + + When a code is used with parameters (n,k) within the RFC 2733 + framework, the following properties are well known: + + a) If applied over one RTP packet, RFC 2733 provides only packet + repetition. + + b) RFC 2733 is most bit rate efficient if XOR-connected packets have + equal length. + + c) At the same packet loss probability p and for a fixed k, the + greater the value of n is, the smaller the residual error + probability becomes. For example, for a packet loss probability + of 10%, k=1, and n=2, the residual error probability is about 1%, + whereas for n=3, the residual error probability is about 0.1%. + + d) At the same packet loss probability p and for a fixed code rate + k/n, the greater the value of n is, the smaller the residual error + probability becomes. For example, at a packet loss probability of + p=10%, k=1 and n=2, the residual error rate is about 1%, whereas + + + +Wenger, et al. Standards Track [Page 67] + +RFC 3984 RTP Payload Format for H.264 Video February 2005 + + + for an extended Golay code with k=12 and n=24, the residual error + rate is about 0.01%. + + For applying RFC 2733 in combination with H.264 baseline coded video + without using FUs, several options might be considered: + + 1) The video encoder produces NAL units for which each video frame is + coded in a single slice. Applying FEC, one could use a simple + code; e.g., (n=2, k=1). That is, each NAL unit would basically + just be repeated. The disadvantage is obviously the bad code + performance according to d), above, and the low flexibility, as + only (n, k=1) codes can be used. + + 2) The video encoder produces NAL units for which each video frame is + encoded in one or more consecutive slices. Applying FEC, one + could use a better code, e.g., (n=24, k=12), over a sequence of + NAL units. Depending on the number of RTP packets per frame, a + loss may introduce a significant delay, which is reduced when more + RTP packets are used per frame. Packets of completely different + length might also be connected, which decreases bit rate + efficiency according to b), above. However, with some care and + for slices of 1kb or larger, similar length (100-200 bytes + difference) may be produced, which will not lower the bit + efficiency catastrophically. + + 3) The video encoder produces NAL units, for which a certain frame + contains k slices of possibly almost equal length. Then, applying + FEC, a better code, e.g., (n=24, k=12), can be used over the + sequence of NAL units for each frame. The delay compared to that + of 2), above, may be reduced, but several disadvantages are + obvious. First, the coding efficiency of the encoded video is + lowered significantly, as slice-structured coding reduces intra- + frame prediction and additional slice overhead is necessary. + Second, pre-encoded content or, when operating over a gateway, the + video is usually not appropriately coded with k slices such that + FEC can be applied. Finally, the encoding of video producing k + slices of equal length is not straightforward and might require + more than one encoding pass. + + Many of the mentioned disadvantages can be avoided by applying FUs in + combination with FEC. Each NAL unit can be split into any number of + FUs of basically equal length; therefore, FEC with a reasonable k and + n can be applied, even if the encoder made no effort to produce + slices of equal length. For example, a coded slice NAL unit + containing an entire frame can be split to k FUs, and a parity check + code (n=k+1, k) can be applied. However, this has the disadvantage + + + + + +Wenger, et al. Standards Track [Page 68] + +RFC 3984 RTP Payload Format for H.264 Video February 2005 + + + that unless all created fragments can be recovered, the whole slice + will be lost. Thus a larger section is lost than would be if the + frame had been split into several slices. + + The presented technique makes it possible to achieve good + transmission error tolerance, even if no additional source coding + layer redundancy (such as periodic intra frames) is present. + Consequently, the same coded video sequence can be used to achieve + the maximum compression efficiency and quality over error-free + transmission and for transmission over error-prone networks. + Furthermore, the technique allows the application of FEC to pre- + encoded sequences without adding delay. In this case, pre-encoded + sequences that are not encoded for error-prone networks can still be + transmitted almost reliably without adding extensive delays. In + addition, FUs of equal length result in a bit rate efficient use of + RFC 2733. + + If the error probability depends on the length of the transmitted + packet (e.g., in case of mobile transmission [14]), the benefits of + applying FUs with FEC are even more obvious. Basically, the + flexibility of the size of FUs allows appropriate FEC to be applied + for each NAL unit and unequal error protection of NAL units. + + When FUs and FEC are used, the incurred overhead is substantial but + is in the same order of magnitude as the number of bits that have to + be spent for intra-coded macroblocks if no FEC is applied. In [19], + it was shown that the overall performance of the FEC-based approach + enhanced quality when using the same error rate and same overall bit + rate, including the overhead. + +12.6. Low Bit-Rate Streaming + + This scheme has been implemented with H.263 and non-standard RTP + packetization and has given good results [20]. There is no technical + reason why similarly good results could not be achievable with H.264. + + In today's Internet streaming, some of the offered bit rates are + relatively low in order to allow terminals with dial-up modems to + access the content. In wired IP networks, relatively large packets, + say 500 - 1500 bytes, are preferred to smaller and more frequently + occurring packets in order to reduce network congestion. Moreover, + use of large packets decreases the amount of RTP/UDP/IP header + overhead. For low bit-rate video, the use of large packets means + that sometimes up to few pictures should be encapsulated in one + packet. + + + + + + +Wenger, et al. Standards Track [Page 69] + +RFC 3984 RTP Payload Format for H.264 Video February 2005 + + + However, loss of a packet including many coded pictures would have + drastic consequences for visual quality, as there is practically no + other way to conceal a loss of an entire picture than to repeat the + previous one. One way to construct relatively large packets and + maintain possibilities for successful loss concealment is to + construct MTAPs that contain interleaved slices from several + pictures. An MTAP should not contain spatially adjacent slices from + the same picture or spatially overlapping slices from any picture. + If a packet is lost, it is likely that a lost slice is surrounded by + spatially adjacent slices of the same picture and spatially + corresponding slices of the temporally previous and succeeding + pictures. Consequently, concealment of the lost slice is likely to + be relatively successful. + +12.7. Robust Packet Scheduling in Video Streaming + + Robust packet scheduling has been implemented with MPEG-4 Part 2 and + simulated in a wireless streaming environment [21]. There is no + technical reason why similar or better results could not be + achievable with H.264. + + Streaming clients typically have a receiver buffer that is capable of + storing a relatively large amount of data. Initially, when a + streaming session is established, a client does not start playing the + stream back immediately. Rather, it typically buffers the incoming + data for a few seconds. This buffering helps maintain continuous + playback, as, in case of occasional increased transmission delays or + network throughput drops, the client can decode and play buffered + data. Otherwise, without initial buffering, the client has to freeze + the display, stop decoding, and wait for incoming data. The + buffering is also necessary for either automatic or selective + retransmission in any protocol level. If any part of a picture is + lost, a retransmission mechanism may be used to resend the lost data. + If the retransmitted data is received before its scheduled decoding + or playback time, the loss is recovered perfectly. Coded pictures + can be ranked according to their importance in the subjective quality + of the decoded sequence. For example, non-reference pictures, such + as conventional B pictures, are subjectively least important, as + their absence does not affect decoding of any other pictures. In + addition to non-reference pictures, the ITU-T H.264 | ISO/IEC + 14496-10 standard includes a temporal scalability method called sub- + sequences [22]. Subjective ranking can also be made on coded slice + data partition or slice group basis. Coded slices and coded slice + data partitions that are subjectively the most important can be sent + earlier than their decoding order indicates, whereas coded slices and + coded slice data partitions that are subjectively the least important + can be sent later than their natural coding order indicates. + Consequently, any retransmitted parts of the most important slices + + + +Wenger, et al. Standards Track [Page 70] + +RFC 3984 RTP Payload Format for H.264 Video February 2005 + + + and coded slice data partitions are more likely to be received before + their scheduled decoding or playback time compared to the least + important slices and slice data partitions. + +13. Informative Appendix: Rationale for Decoding Order Number + +13.1. Introduction + + The Decoding Order Number (DON) concept was introduced mainly to + enable efficient multi-picture slice interleaving (see section 12.6) + and robust packet scheduling (see section 12.7). In both of these + applications, NAL units are transmitted out of decoding order. DON + indicates the decoding order of NAL units and should be used in the + receiver to recover the decoding order. Example use cases for + efficient multi-picture slice interleaving and for robust packet + scheduling are given in sections 13.2 and 13.3, respectively. + Section 13.4 describes the benefits of the DON concept in error + resiliency achieved by redundant coded pictures. Section 13.5 + summarizes considered alternatives to DON and justifies why DON was + chosen to this RTP payload specification. + +13.2. Example of Multi-Picture Slice Interleaving + + An example of multi-picture slice interleaving follows. A subset of + a coded video sequence is depicted below in output order. R denotes + a reference picture, N denotes a non-reference picture, and the + number indicates a relative output time. + + ... R1 N2 R3 N4 R5 ... + + The decoding order of these pictures from left to right is as + follows: + + ... R1 R3 N2 R5 N4 ... + + The NAL units of pictures R1, R3, N2, R5, and N4 are marked with a + DON equal to 1, 2, 3, 4, and 5, respectively. + + + + + + + + + + + + + + +Wenger, et al. Standards Track [Page 71] + +RFC 3984 RTP Payload Format for H.264 Video February 2005 + + + Each reference picture consists of three slice groups that are + scattered as follows (a number denotes the slice group number for + each macroblock in a QCIF frame): + + 0 1 2 0 1 2 0 1 2 0 1 + 2 0 1 2 0 1 2 0 1 2 0 + 1 2 0 1 2 0 1 2 0 1 2 + 0 1 2 0 1 2 0 1 2 0 1 + 2 0 1 2 0 1 2 0 1 2 0 + 1 2 0 1 2 0 1 2 0 1 2 + 0 1 2 0 1 2 0 1 2 0 1 + 2 0 1 2 0 1 2 0 1 2 0 + 1 2 0 1 2 0 1 2 0 1 2 + + + For the sake of simplicity, we assume that all the macroblocks of a + slice group are included in one slice. Three MTAPs are constructed + from three consecutive reference pictures so that each MTAP contains + three aggregation units, each of which contains all the macroblocks + from one slice group. The first MTAP contains slice group 0 of + picture R1, slice group 1 of picture R3, and slice group 2 of + picture R5. The second MTAP contains slice group 1 of picture R1, + slice group 2 of picture R3, and slice group 0 of picture R5. The + third MTAP contains slice group 2 of picture R1, slice group 0 of + picture R3, and slice group 1 of picture R5. Each non-reference + picture is encapsulated into an STAP-B. + + Consequently, the transmission order of NAL units is the following: + + R1, slice group 0, DON 1, carried in MTAP, RTP SN: N + R3, slice group 1, DON 2, carried in MTAP, RTP SN: N + R5, slice group 2, DON 4, carried in MTAP, RTP SN: N + R1, slice group 1, DON 1, carried in MTAP, RTP SN: N+1 + R3, slice group 2, DON 2, carried in MTAP, RTP SN: N+1 + R5, slice group 0, DON 4, carried in MTAP, RTP SN: N+1 + R1, slice group 2, DON 1, carried in MTAP, RTP SN: N+2 + R3, slice group 1, DON 2, carried in MTAP, RTP SN: N+2 + R5, slice group 0, DON 4, carried in MTAP, RTP SN: N+2 + N2, DON 3, carried in STAP-B, RTP SN: N+3 + N4, DON 5, carried in STAP-B, RTP SN: N+4 + + The receiver is able to organize the NAL units back in decoding order + based on the value of DON associated with each NAL unit. + + If one of the MTAPs is lost, the spatially adjacent and temporally + co-located macroblocks are received and can be used to conceal the + loss efficiently. If one of the STAPs is lost, the effect of the + loss does not propagate temporally. + + + +Wenger, et al. Standards Track [Page 72] + +RFC 3984 RTP Payload Format for H.264 Video February 2005 + + +13.3. Example of Robust Packet Scheduling + + An example of robust packet scheduling follows. The communication + system used in the example consists of the following components in + the order that the video is processed from source to sink: + + o camera and capturing + o pre-encoding buffer + o encoder + o encoded picture buffer + o transmitter + o transmission channel + o receiver + o receiver buffer + o decoder + o decoded picture buffer + o display + + The video communication system used in the example operates as + follows. Note that processing of the video stream happens gradually + and at the same time in all components of the system. The source + video sequence is shot and captured to a pre-encoding buffer. The + pre-encoding buffer can be used to order pictures from sampling order + to encoding order or to analyze multiple uncompressed frames for bit + rate control purposes, for example. In some cases, the pre-encoding + buffer may not exist; instead, the sampled pictures are encoded right + away. The encoder encodes pictures from the pre-encoding buffer and + stores the output; i.e., coded pictures, to the encoded picture + buffer. The transmitter encapsulates the coded pictures from the + encoded picture buffer to transmission packets and sends them to a + receiver through a transmission channel. The receiver stores the + received packets to the receiver buffer. The receiver buffering + process typically includes buffering for transmission delay jitter. + The receiver buffer can also be used to recover correct decoding + order of coded data. The decoder reads coded data from the receiver + buffer and produces decoded pictures as output into the decoded + picture buffer. The decoded picture buffer is used to recover the + output (or display) order of pictures. Finally, pictures are + displayed. + + In the following example figures, I denotes an IDR picture, R denotes + a reference picture, N denotes a non-reference picture, and the + number after I, R, or N indicates the sampling time relative to the + previous IDR picture in decoding order. Values below the sequence of + pictures indicate scaled system clock timestamps. The system clock + is initialized arbitrarily in this example, and time runs from left + to right. Each I, R, and N picture is mapped into the same timeline + compared to the previous processing step, if any, assuming that + + + +Wenger, et al. Standards Track [Page 73] + +RFC 3984 RTP Payload Format for H.264 Video February 2005 + + + encoding, transmission, and decoding take no time. Thus, events + happening at the same time are located in the same column throughout + all example figures. + + A subset of a sequence of coded pictures is depicted below in + sampling order. + + ... N58 N59 I00 N01 N02 R03 N04 N05 R06 ... N58 N59 I00 N01 ... + ... --|---|---|---|---|---|---|---|---|- ... -|---|---|---|- ... + ... 58 59 60 61 62 63 64 65 66 ... 128 129 130 131 ... + + Figure 16. Sequence of pictures in sampling order + + The sampled pictures are buffered in the pre-encoding buffer to + arrange them in encoding order. In this example, we assume that the + non-reference pictures are predicted from both the previous and the + next reference picture in output order, except for the non-reference + pictures immediately preceding an IDR picture, which are predicted + only from the previous reference picture in output order. Thus, the + pre-encoding buffer has to contain at least two pictures, and the + buffering causes a delay of two picture intervals. The output of the + pre-encoding buffering process and the encoding (and decoding) order + of the pictures are as follows: + + ... N58 N59 I00 R03 N01 N02 R06 N04 N05 ... + ... -|---|---|---|---|---|---|---|---|- ... + ... 60 61 62 63 64 65 66 67 68 ... + + Figure 17. Re-ordered pictures in the pre-encoding buffer + + The encoder or the transmitter can set the value of DON for each + picture to a value of DON for the previous picture in decoding order + plus one. + + For the sake of simplicity, let us assume that: + + o the frame rate of the sequence is constant, + o each picture consists of only one slice, + o each slice is encapsulated in a single NAL unit packet, + o there is no transmission delay, and + o pictures are transmitted at constant intervals (that is, 1 / frame + rate). + + + + + + + + + +Wenger, et al. Standards Track [Page 74] + +RFC 3984 RTP Payload Format for H.264 Video February 2005 + + + When pictures are transmitted in decoding order, they are received as + follows: + + ... N58 N59 I00 R03 N01 N02 R06 N04 N05 ... + ... -|---|---|---|---|---|---|---|---|- ... + ... 60 61 62 63 64 65 66 67 68 ... + + Figure 18. Received pictures in decoding order + + The OPTIONAL sprop-interleaving-depth MIME type parameter is set to + 0, as the transmission (or reception) order is identical to the + decoding order. + + The decoder has to buffer for one picture interval initially in its + decoded picture buffer to organize pictures from decoding order to + output order as depicted below: + + ... N58 N59 I00 N01 N02 R03 N04 N05 R06 ... + ... -|---|---|---|---|---|---|---|---|- ... + ... 61 62 63 64 65 66 67 68 69 ... + + Figure 19. Output order + + The amount of required initial buffering in the decoded picture + buffer can be signaled in the buffering period SEI message or with + the num_reorder_frames syntax element of H.264 video usability + information. num_reorder_frames indicates the maximum number of + frames, complementary field pairs, or non-paired fields that precede + any frame, complementary field pair, or non-paired field in the + sequence in decoding order and that follow it in output order. For + the sake of simplicity, we assume that num_reorder_frames is used to + indicate the initial buffer in the decoded picture buffer. In this + example, num_reorder_frames is equal to 1. + + It can be observed that if the IDR picture I00 is lost during + transmission and a retransmission request is issued when the value of + the system clock is 62, there is one picture interval of time (until + the system clock reaches timestamp 63) to receive the retransmitted + IDR picture I00. + + + + + + + + + + + + +Wenger, et al. Standards Track [Page 75] + +RFC 3984 RTP Payload Format for H.264 Video February 2005 + + + Let us then assume that IDR pictures are transmitted two frame + intervals earlier than their decoding position; i.e., the pictures + are transmitted as follows: + + ... I00 N58 N59 R03 N01 N02 R06 N04 N05 ... + ... --|---|---|---|---|---|---|---|---|- ... + ... 62 63 64 65 66 67 68 69 70 ... + + Figure 20. Interleaving: Early IDR pictures in sending order + + The OPTIONAL sprop-interleaving-depth MIME type parameter is set + equal to 1 according to its definition. (The value of sprop- + interleaving-depth in this example can be derived as follows: + Picture I00 is the only picture preceding picture N58 or N59 in + transmission order and following it in decoding order. Except for + pictures I00, N58, and N59, the transmission order is the same as the + decoding order of pictures. As a coded picture is encapsulated into + exactly one NAL unit, the value of sprop-interleaving-depth is equal + to the maximum number of pictures preceding any picture in + transmission order and following the picture in decoding order.) + + The receiver buffering process contains two pictures at a time + according to the value of the sprop-interleaving-depth parameter and + orders pictures from the reception order to the correct decoding + order based on the value of DON associated with each picture. The + output of the receiver buffering process is as follows: + + ... N58 N59 I00 R03 N01 N02 R06 N04 N05 ... + ... -|---|---|---|---|---|---|---|---|- ... + ... 63 64 65 66 67 68 69 70 71 ... + + Figure 21. Interleaving: Receiver buffer + + Again, an initial buffering delay of one picture interval is needed + to organize pictures from decoding order to output order, as depicted + below: + + ... N58 N59 I00 N01 N02 R03 N04 N05 ... + ... -|---|---|---|---|---|---|---|- ... + ... 64 65 66 67 68 69 70 71 ... + + Figure 22. Interleaving: Receiver buffer after reordering + + Note that the maximum delay that IDR pictures can undergo during + transmission, including possible application, transport, or link + layer retransmission, is equal to three picture intervals. Thus, the + + + + + +Wenger, et al. Standards Track [Page 76] + +RFC 3984 RTP Payload Format for H.264 Video February 2005 + + + loss resiliency of IDR pictures is improved in systems supporting + retransmission compared to the case in which pictures were + transmitted in their decoding order. + +13.4. Robust Transmission Scheduling of Redundant Coded Slices + + A redundant coded picture is a coded representation of a picture or a + part of a picture that is not used in the decoding process if the + corresponding primary coded picture is correctly decoded. There + should be no noticeable difference between any area of the decoded + primary picture and a corresponding area that would result from + application of the H.264 decoding process for any redundant picture + in the same access unit. A redundant coded slice is a coded slice + that is a part of a redundant coded picture. + + Redundant coded pictures can be used to provide unequal error + protection in error-prone video transmission. If a primary coded + representation of a picture is decoded incorrectly, a corresponding + redundant coded picture can be decoded. Examples of applications and + coding techniques using the redundant codec picture feature include + the video redundancy coding [23] and the protection of "key pictures" + in multicast streaming [24]. + + One property of many error-prone video communications systems is that + transmission errors are often bursty. Therefore, they may affect + more than one consecutive transmission packets in transmission order. + In low bit-rate video communication, it is relatively common that an + entire coded picture can be encapsulated into one transmission + packet. Consequently, a primary coded picture and the corresponding + redundant coded pictures may be transmitted in consecutive packets in + transmission order. To make the transmission scheme more tolerant of + bursty transmission errors, it is beneficial to transmit the primary + coded picture and redundant coded picture separated by more than a + single packet. The DON concept enables this. + +13.5. Remarks on Other Design Possibilities + + The slice header syntax structure of the H.264 coding standard + contains the frame_num syntax element that can indicate the decoding + order of coded frames. However, the usage of the frame_num syntax + element is not feasible or desirable to recover the decoding order, + due to the following reasons: + + o The receiver is required to parse at least one slice header per + coded picture (before passing the coded data to the decoder). + + + + + + +Wenger, et al. Standards Track [Page 77] + +RFC 3984 RTP Payload Format for H.264 Video February 2005 + + + o Coded slices from multiple coded video sequences cannot be + interleaved, as the frame number syntax element is reset to 0 in + each IDR picture. + + o The coded fields of a complementary field pair share the same + value of the frame_num syntax element. Thus, the decoding order + of the coded fields of a complementary field pair cannot be + recovered based on the frame_num syntax element or any other + syntax element of the H.264 coding syntax. + + The RTP payload format for transport of MPEG-4 elementary streams + [25] enables interleaving of access units and transmission of + multiple access units in the same RTP packet. An access unit is + specified in the H.264 coding standard to comprise all NAL units + associated with a primary coded picture according to subclause + 7.4.1.2 of [1]. Consequently, slices of different pictures cannot be + interleaved, and the multi-picture slice interleaving technique (see + section 12.6) for improved error resilience cannot be used. + +14. Acknowledgements + + The authors thank Roni Even, Dave Lindbergh, Philippe Gentric, + Gonzalo Camarillo, Gary Sullivan, Joerg Ott, and Colin Perkins for + careful review. + +15. References + +15.1. Normative References + + [1] ITU-T Recommendation H.264, "Advanced video coding for generic + audiovisual services", May 2003. + + [2] ISO/IEC International Standard 14496-10:2003. + + [3] Bradner, S., "Key words for use in RFCs to Indicate Requirement + Levels", BCP 14, RFC 2119, March 1997. + + [4] Schulzrinne, H., Casner, S., Frederick, R., and V. Jacobson, + "RTP: A Transport Protocol for Real-Time Applications", STD 64, + RFC 3550, July 2003. + + [5] Handley, M. and V. Jacobson, "SDP: Session Description + Protocol", RFC 2327, April 1998. + + [6] Josefsson, S., "The Base16, Base32, and Base64 Data Encodings", + RFC 3548, July 2003. + + + + + +Wenger, et al. Standards Track [Page 78] + +RFC 3984 RTP Payload Format for H.264 Video February 2005 + + + [7] Rosenberg, J. and H. Schulzrinne, "An Offer/Answer Model with + Session Description Protocol (SDP)", RFC 3264, June 2002. + +15.2. Informative References + + [8] "Draft ITU-T Recommendation and Final Draft International + Standard of Joint Video Specification (ITU-T Rec. H.264 | + ISO/IEC 14496-10 AVC)", available from http://ftp3.itu.int/av- + arch/jvt-site/2003_03_Pattaya/JVT-G050r1.zip, May 2003. + + [9] Luthra, A., Sullivan, G.J., and T. Wiegand (eds.), Special Issue + on H.264/AVC. IEEE Transactions on Circuits and Systems on Video + Technology, July 2003. + + [10] Bormann, C., Cline, L., Deisher, G., Gardos, T., Maciocco, C., + Newell, D., Ott, J., Sullivan, G., Wenger, S., and C. Zhu, "RTP + Payload Format for the 1998 Version of ITU-T Rec. H.263 Video + (H.263+)", RFC 2429, October 1998. + + [11] ISO/IEC IS 14496-2. + + [12] Wenger, S., "H.26L over IP", IEEE Transaction on Circuits and + Systems for Video technology, Vol. 13, No. 7, July 2003. + + [13] Wenger, S., "H.26L over IP: The IP Network Adaptation Layer", + Proceedings Packet Video Workshop 02, April 2002. + + [14] Stockhammer, T., Hannuksela, M.M., and S. Wenger, "H.26L/JVT + Coding Network Abstraction Layer and IP-based Transport" in + Proc. ICIP 2002, Rochester, NY, September 2002. + + [15] ITU-T Recommendation H.241, "Extended video procedures and + control signals for H.300 series terminals", 2004. + + [16] Schulzrinne, H. and S. Casner, "RTP Profile for Audio and Video + Conferences with Minimal Control", STD 65, RFC 3551, July 2003. + + [17] ITU-T Recommendation H.223, "Multiplexing protocol for low bit + rate multimedia communication", July 2001. + + [18] Rosenberg, J. and H. Schulzrinne, "An RTP Payload Format for + Generic Forward Error Correction", RFC 2733, December 1999. + + [19] Stockhammer, T., Wiegand, T., Oelbaum, T., and F. Obermeier, + "Video Coding and Transport Layer Techniques for H.264/AVC-Based + Transmission over Packet-Lossy Networks", IEEE International + Conference on Image Processing (ICIP 2003), Barcelona, Spain, + September 2003. + + + +Wenger, et al. Standards Track [Page 79] + +RFC 3984 RTP Payload Format for H.264 Video February 2005 + + + [20] Varsa, V. and M. Karczewicz, "Slice interleaving in compressed + video packetization", Packet Video Workshop 2000. + + [21] Kang, S.H. and A. Zakhor, "Packet scheduling algorithm for + wireless video streaming," International Packet Video Workshop + 2002. + + [22] Hannuksela, M.M., "Enhanced concept of GOP", JVT-B042, available + http://ftp3.itu.int/av-arch/video-site/0201_Gen/JVT-B042.doc, + January 2002. + + [23] Wenger, S., "Video Redundancy Coding in H.263+", 1997 + International Workshop on Audio-Visual Services over Packet + Networks, September 1997. + + [24] Wang, Y.-K., Hannuksela, M.M., and M. Gabbouj, "Error Resilient + Video Coding Using Unequally Protected Key Pictures", in Proc. + International Workshop VLBV03, September 2003. + + [25] van der Meer, J., Mackie, D., Swaminathan, V., Singer, D., and + P. Gentric, "RTP Payload Format for Transport of MPEG-4 + Elementary Streams", RFC 3640, November 2003. + + [26] Baugher, M., McGrew, D., Naslund, M., Carrara, E., and K. + Norrman, "The Secure Real-time Transport Protocol (SRTP)", RFC + 3711, March 2004. + + [27] Schulzrinne, H., Rao, A., and R. Lanphier, "Real Time Streaming + Protocol (RTSP)", RFC 2326, April 1998. + + [28] Handley, M., Perkins, C., and E. Whelan, "Session Announcement + Protocol", RFC 2974, October 2000. + + [29] ISO/IEC 14496-15: "Information technology - Coding of audio- + visual objects - Part 15: Advanced Video Coding (AVC) file + format". + + [30] Castagno, R. and D. Singer, "MIME Type Registrations for 3rd + Generation Partnership Project (3GPP) Multimedia files", RFC + 3839, July 2004. + + + + + + + + + + + +Wenger, et al. Standards Track [Page 80] + +RFC 3984 RTP Payload Format for H.264 Video February 2005 + + +Authors' Addresses + + Stephan Wenger + TU Berlin / Teles AG + Franklinstr. 28-29 + D-10587 Berlin + Germany + + Phone: +49-172-300-0813 + EMail: stewe@stewe.org + + + Miska M. Hannuksela + Nokia Corporation + P.O. Box 100 + 33721 Tampere + Finland + + Phone: +358-7180-73151 + EMail: miska.hannuksela@nokia.com + + + Thomas Stockhammer + Nomor Research + D-83346 Bergen + Germany + + Phone: +49-8662-419407 + EMail: stockhammer@nomor.de + + + Magnus Westerlund + Multimedia Technologies + Ericsson Research EAB/TVA/A + Ericsson AB + Torshamsgatan 23 + SE-164 80 Stockholm + Sweden + + Phone: +46-8-7190000 + EMail: magnus.westerlund@ericsson.com + + + + + + + + + + +Wenger, et al. Standards Track [Page 81] + +RFC 3984 RTP Payload Format for H.264 Video February 2005 + + + David Singer + QuickTime Engineering + Apple + 1 Infinite Loop MS 302-3MT + Cupertino + CA 95014 + USA + + Phone +1 408 974-3162 + EMail: singer@apple.com + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +Wenger, et al. Standards Track [Page 82] + +RFC 3984 RTP Payload Format for H.264 Video February 2005 + + +Full Copyright Statement + + Copyright (C) The Internet Society (2005). + + This document is subject to the rights, licenses and restrictions + contained in BCP 78, and except as set forth therein, the authors + retain all their rights. + + This document and the information contained herein are provided on an + "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS + OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET + ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, + INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE + INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED + WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. + +Intellectual Property + + The IETF takes no position regarding the validity or scope of any + Intellectual Property Rights or other rights that might be claimed to + pertain to the implementation or use of the technology described in + this document or the extent to which any license under such rights + might or might not be available; nor does it represent that it has + made any independent effort to identify any such rights. Information + on the IETF's procedures with respect to rights in IETF Documents can + be found in BCP 78 and BCP 79. + + Copies of IPR disclosures made to the IETF Secretariat and any + assurances of licenses to be made available, or the result of an + attempt made to obtain a general license or permission for the use of + such proprietary rights by implementers or users of this + specification can be obtained from the IETF on-line IPR repository at + http://www.ietf.org/ipr. + + The IETF invites any interested party to bring to its attention any + copyrights, patents or patent applications, or other proprietary + rights that may cover technology that may be required to implement + this standard. Please address the information to the IETF at ietf- + ipr@ietf.org. + + +Acknowledgement + + Funding for the RFC Editor function is currently provided by the + Internet Society. + + + + + + +Wenger, et al. Standards Track [Page 83] + |