Why the buzz about H.264?
It's the bitrate!
H.264 is getting so much attention because it can encode video
with approximately 3 times fewer bits than comparable MPEG-2
encoders.
Because H.264 is up to twice as efficient as MPEG-4 Part 2
(natural video) encoding, it has recently been welcomed into the
MPEG-4 standard as Part 10 Advanced Video Coding. Many
established encoder and decoder vendors are moving directly to
h.264 and skipping the intermediate step of MPEG-4 Part 2.
Goals & Approach of H.264
The International Telecommunications Union (ITU) initiated the
h.26L (for long term) effort in 1998 as a continuation of work
following the MPEG-2 and h.263 standards. The overriding goal
was to achieve a factor-of-2 reduction in bit rate compared to
any competing standard.
Recall that MPEG-2 was optimized with specific focus on Standard
and High Definition digital television services, which are
delivered via circuit-switched head-end networks to dedicated
satellite uplinks, cable infrastructure or terrestrial
facilities. MPEG2's ability to cope is being strained as the
range of delivery media expands to include heterogeneous mobile
networks, packet-switched IP networks, and multiple storage
formats, and as the variety of services grows to include
multimedia messaging, security, increased use of HDTV, and
others. Thus, a second goal for h.264 was to accommodate a wider
variety of bandwidth requirements, picture formats, and
unfriendly network environments that throw high jitter, packet
loss, and bandwidth instability into the mix.
The h.264 approach is a strictly evolutionary extension of the
block-based encoding approach so well established in the MPEG
and ITU standards. Key steps include:
Use of Motion Estimation to support Inter-picture prediction
for eliminating temporal redundancies
Use of spatial correlation of data to provide Intra-picture
prediction.
Construction of residuals as the difference between predicted
images and source images.
Use of a discrete spatial transform and filtering to eliminate
spatial redundancies in the residuals.
Entropy coding of the transformed residual coefficients and of
the supporting data such as motion vectors.
Major Features of H.264
Improved Inter-Prediction and Motion Estimation
First recall the limitations of motion estimation in MPEG-2,
which searches reference pictures for a 16x16 set of pixels that
closely matches the current macro block. The matching set of
pixels must be completely within the reference picture. In
contrast, H.264 provides:
Fine-grained motion estimation.
Temporal search seeks matching sub-macro blocks of variable size
as small as 4x4, and finds the motion vector to _ pel
resolution. Searches may also identify motion vectors associated
with matching sub-macro blocks of 4x8, 8x4, 8x8, 8x16, 16x8, or
the full 16x16. [In future, even finer 1/8 pel resolution will
be supported.]
Multiple reference frames. H.264 provides additional
flexibility for frames to point to more than multiple frames
which may be any combination of past and future frames. This
capability provides opportunities for more precise
inter-prediction, but also improved robustness to lost picture
data.
Unrestricted motion search . Motion search allows for
reference frames that may be partly outside the picture; missing
data can be spatially predicted from boundary data. Users may
choose to disable this feature by specifying a Restricted Motion
search.
Motion vector prediction. Where sufficient temporal
correlation exists, motion vectors may be accurately predicted
and only their residuals transmitted explicitly in the
bitstream.
Such techniques not only provide for more accurate
inter-prediction, but also help to partition and scale the
bitstream with priority given to data that is more globally
applicable. Thus, they not only improve compression but also
resilience to errors and network instabilities.
Improved Intra Spatial Prediction and Transform
Because "intra prediction" is concerned with only one picture at
a time, it relies upon spatial rather than temporal
correlations. As the algorithm works through a picture's macro
blocks in raster scan order, earlier results may be used to
"predict" the downstream calculations. Then we need only
transmit residuals as refinements to the predicted results.
H.264 performs intra prediction in the spatial domain (prior to
the transform, and it is a key part of the approach. Even for an
intra-picture, every block of data is predicted from its
neighbors before being transformed and coefficients generated
for inclusion in the bitstream.
Coarse versus fine intra prediction. Intra prediction may be
performed either on 4x4 blocks, or 16x16 macro blocks. The
latter is more efficient for uniform areas of a picture.
Direction Dependent Intra Modes. By doing intra prediction in
the spatial domain (rather than in the transform domain), h.264
can employ prediction that is direction dependent, and thus can
focus on the most highly correlated neighbors. For Intra 16x16
coding and Intra 4 x 4 coding, there are 9 and 4 directional
modes, respectively.
4x4 transform of Residual Data. For initially supported
profiles, residual data transforms are always performed for 4x4
blocks of data, and coefficients transmitted on this
fine-grained basis.
Variable block sizes for spatial transform*. Future profiles
will allow transform of variable size blocks (4x8, 8x8, etc.)
with the same level of flexibility as motion estimation blocks.
This will provide more flexibility and further reduction of
bitrate.
Integer transforms. Efficiency in both computation and bitrate
is gained by implementing the traditional Discrete Cosine
Transform (DCT) as an integer transform that requires no
multiplications, except for a single normalization. It can also
be inverted exactly without mismatch.
Deblocking filter. To eliminate fine structure blockiness that
might be aggravated by the smaller transform blocks, a
context-sensitive deblocking filter smoothes out the internal
edges. Its filter strength depends upon the prediction modes and
relationship between the neighboring blocks. In addition to
increasing signal-to-noise ratio (S/N), this technique
significantly improves the subjective quality of the image for a
given S/N.
Improved Algorithms for Encoding
Two alternative methods improve efficiency of the entropy coding
process by selecting variable length codes depending upon
context of the data being encoded.
Context-Adaptive Variable Length Coding (CAVLC) employs
multiple variable length codeword tables to encode transform
coefficients, which consume the bulk of bandwidth. Based upon a
priori statistics of already processed data, the best table is
selected adaptively. For non-coefficient data, a simpler scheme
is used that relies upon only a single table.
Context-Adaptive Binary Arithmetic Coding (CABAC*) provides an
extremely efficient encoding scheme when it is known that
certain symbols are much more likely than others. Such dominant
symbols may be encoded with extremely small bit/symbol ratios.
The CABAC method continually updates frequency statistics of the
incoming data and adaptively adjusts the algorithm in real-time.
This method is an advanced option available in profiles beyond
the baseline profile.
Techniques for Mitigation of Errors, Packet Losses, and
Network Variability
Error containment and scalability
H.264 includes several other features that are useful in
containing the impact of errors, and in enabling the use of
scalable or multiple bit streams:
Slice coding. Each picture is subdivided into one or more
slices. The slice is given increased importance in H.264 as the
basic spatial segment that is independent from its neighbors.
Thus, errors or missing data from one slice cannot propagate to
any other slice within the picture. This also increases
flexibility to extend picture types (I, P, B) down to the level
of "slice types." Redundant slices are permitted.
Data partitioning is supported to allow higher priority data
(e.g., sequence headers) to be separated from lower priority
data (e.g., B-picture transform coefficients).
Flexible macro block ordering (FMO) can be used to scatter the
bits associated with adjoining macro blocks more randomly
throughout the bit stream. This reduces the chance that a packet
loss will affect a large region and enables error concealment by
ensuring that neighboring macro blocks will be available for
prediction of a missing macro block.
The Multiple Reference Frames that are used for improved
motion estimation also allow for partial motion compensation for
a P picture when one of its referenced frames is missing or
corrupted.
SI and SP Pictures (or slices)*
MPEG-2 practice is to insert intra pictures (I) at regular
intervals to contain errors that otherwise could propagate
through the picture sequence indefinitely. In addition,
intra-pictures provide a means for random access or fast-forward
actions, because intra frames do not require any knowledge of
other referenced frames. Similarly, regular I pictures would be
necessary to switch promptly from between higher and lower
bitrate streams an important feature for accommodating the
bandwidth variability in mobile networks. However, I pictures
typically require far more bits than P pictures and thus are an
inefficient means for addressing these two requirements.
H.264 introduces two new slice types , "Switching I Pictures"
(SI) and "Switching P Pictures" (SP), which help address these
needs with significantly reduced bit rate. Identical SP frames
can be obtained even though different reference frames are used
thus, they can be substituted for I frames as temporal
resynchronization points, but with significantly reduced
bitrate. SP pictures rely upon the transformation and
quantization of predicted inter blocks. Because SP pictures do
not take full advantage of intra-prediction, at the cost of some
bits they can be extended to SI pictures which do so.
Note that because slices are coded independently, switching
slices (SI or SP) can be defined at that level.
Low Latency Feature
Arbitrary Slice Ordering (ASO) relaxes the constraint that all
macro blocks must be sequenced in decoding order, and thus
enhances flexibility for low-delay performance important in
teleconferencing, surveillance and interactive Internet
applications.
Simplified Profiles
H.264 is completely focused on efficient coding of natural video
and does not directly address the object-oriented functionality,
synthetic video, and other systems functionality in MPEG-4,
which carries a very complex structure of over 50 profiles.
In contrast, H.264 is initially defined with only three
profiles:
Baseline Profile. A basic goal of H.264 was to provide a
royalty-free baseline profile to encourage early application of
the standard. The baseline profile consists most of the major
features described above, with the exception of: B slices and
weighted prediction; CABAC encoding; field coding; and SP & SI
slices. Thus, the baseline profile is appropriate for many
progressive scan applications such as video conferencing and
video-over-IP, but not for interlaced television or multiple
stream applications.
Main Profile. Main profile contains all of the features in
Baseline, except flexible macro block ordering (FMO), arbitrary
slice order (ASO) and redundant slices. However, it adds field
coding, B slices and weighted prediction, and CABAC entropy
coding. This profile is appropriate for efficient coding of
interlaced television applications where bit or packet error is
not excessive, and where low latency is not a requirement.
Extended Profile. This profile contains all features from the
baseline profile and main profiles, except that CABAC is not
supported. In addition, the Extended profile adds SP and SI for
stream switching, and up to 8 slice groups. This profile is
appropriate for server-based streaming applications where
bit-rate scalability and error rate is very important. Security
Applications and Mobile video services would be an example.
Where will H.264 have the biggest impact?
Any video application can benefit from a reduction in bandwidth
requirements, but highest impact will involve applications where
such reduction relieves a hard technical constraint, or which
makes more cost-effective use of bandwidth as a limiting
resource.
In addition, other h.264 features such error containment, error
concealment, and efficient bitstream switching is especially
useful for IP and wireless environments.
Squeeze More Services into a Broadcast Channel
Reduction in bandwidth requirements by factors of 2-3 provide
cost savings for bandwidth-constrained services such as
satellite and DVB-Terrestrial, or alternatively allow such
providers to expand services at reduced incremental cost.
Facilitate High Quality Video Streaming over IP Networks
H.264 can produce very good quality, TV Quality streaming at
less than 1Mbps (standard definition). This slips under 1 Mbps
thresholds for xDSL and thus opens possibilities for new access
methods for high quality, larger format video.
High Definition Transmission and Storage
Recall that MPEG-2 consumes 15-20 Mbps for High Definition video
at suitable quality for broadcast or DVD. Use of h.264 will
bring this down to about 8 Mbps, making it possible for
bandwidth-strapped satellite service providers to fit 4 HD
channels per QPSK channel.
Even more significant is that this reduction enables burning one
HD movie onto a conventional DVD, thus avoiding the need for the
industry to adapt a higher density ("blue laser") DVD format.
Mobile Video Applications
3G Mobile networks present an unusual array of technical
challenges that have driven many features in h.264. Applications
include video conferencing, streaming video on demand,
multimedia-messaging services, and low resolution broadcast.
Some key issues, and h.264 tools for dealing with them, include:
Low bandwidth (50 300 kbps) is the key issue. The expected
trend is for 3G deployment to start with h.263 and move up to
h.264 as it matures. An industry analyst points out "
3G
networks are only likely to offer 57.6kbit/s initially. As those
bit rates increase, mobiles and networks will move to the new
H.264 codec, which offers twice the performance of H.263. This
should result in the same picture quality being achieved at half
the bit rate."
Small devices with many formats ; variability of available
bandwidth. For streaming applications, these two separate issues
can be addressed by providing multiple streams with different
formats and bandwidths, and selecting the appropriate stream at
run-time. H.264's SP and SI pictures facilitate dynamic
switching among multiple streams to accommodate bandwidth
variability.
High bit error rates, packet losses, and latenc y. For video
applications, retransmissions are impractical for dropped or
delayed packets, so h.264 provides several means (e.g., FMO,
data partitioning, etc.) to contain error impacts and facilitate
error concealment.
What is the relationship to MPEG-4 and MPEG-2?
Compared to MPEG-2
H.264 employs the same general approach as MPEG 1 & 2 as well as
the h.261 and h.263 standards, but adds many incremental
improvements to obtain coding efficiency improvement of about a
factor-of-3.
MPEG-2 was optimized with specific focus on Standard and High
Definition digital television services, which are delivered via
circuit-switched head-end networks to dedicated satellite
uplinks, cable infrastructure or terrestrial facilities. MPEG2's
ability to cope is being strained as the range of delivery media
expands to include heterogeneous mobile networks,
packet-switched IP networks, and multiple storage formats, and
as the variety of services grows to include multimedia
messaging, increased use of HDTV, and others. Thus, a second
goal for h.264 was to accommodate a wider variety of bandwidth
requirements, picture formats, and unfriendly network
environments that throw high jitter, packet loss, and bandwidth
instability into the mix.
Compared to MPEG-4
During 2002, the h.264 Video Coding Experts Group combined
forces with MPEG4 experts to form the Joint Video Team (JVT), so
H.264 is being published as MPEG-4 Part 10 (Advanced Video
Coding).
MPEG-4 is really a family of standards whose overall theme is
object-oriented multimedia applications. It thus has much
broader scope than H.264, which is strictly focused on more
efficient and robust video coding. The comparable part of MPEG-4
is Part 2 Visual (sometimes called "Natural Video"). Other parts
of MPEG address scene composition, object description and java
representation of behavior, animation of human body and facial
movements, audio and systems.
|