<<Prev.
MPEG-2 AAC
The MPEG-2 Advanced Audio Coding (AAC) format codes monaural, stereo, or multi-channel
playback for up to 48 channels, including 5.1-channel, at a variety of bit
rates.
AAC is known for its relatively high fidelity at low bit rates; for example,
about 64 kbps per channel. It also provides high-quality 5.1-channel coding
at an overall rate of 320 kbps or 384 kbps. AAC uses a reference model (RM)
structure in which a set of tools (modules) has defined interfaces and can
be combined variously in three different profiles. Individual tools can be
upgraded and used to replace older tools in the reference software. In addition,
this modularity makes it easy to compare revisions against older versions.
AAC also comprises the kernel of audio tools used in the MPEG-4 standard for
coding high-quality audio. AAC also supports lossless coding. AAC is specified
in Part 7 of the MPEG-2 standard (ISO/IEC 13818-7), which was finalized in
April 1997.
MPEG-2 AAC coding is not backward compatible with MPEG-1 and was originally
designated as NBC (non backward compatible) coding. An AAC bitstream cannot
be decoded by an MPEG-1-only decoder. By lifting the constraint of compatibility,
better performance is achieved compared to MPEG-2 BC. MPEG-2 AAC supports standard
sampling frequencies of 32, 44.1, and 48 kHz, as well as other rates from 8
kHz to 96 kHz, yielding maximum bit rates of 48 kbps and 576 kbps, respectively.
Its input channel configurations are: 1/0 (monaural), 2/0 (two channel stereo),
different multichannel configurations up to 3/2 + 1, and provision for up to
48 channels. Matrixing is not used. Downmixing is supported. To improve error
performance, the system is designed to maintain bitstream synchronization in
the presence of bit errors, and error concealment is supported as well.
To allow flexibility in audio quality versus processing requirements, AAC
coding modules are used to create three profiles: main profile, scalable sampling
rate (SSR) profile, and low-complexity (LC) profile. The main profile employs
the most sophisticated encoder using all the coding modules except preprocessing
to yield the highest audio quality at any bit rate. A main profile decoder
can also decode the low-complexity bitstream. The SSR profile uses a gain control
tool to perform poly-phase quadrature filtering (PQF), gain detection, and
gain modification preprocessing; prediction is not used and temporal noise
shaping (TNS) order is limited. SSR divides the audio signal into four equal
frequency bands each with an independent bitstream and decoders can choose
to decode one or more streams and thus vary the bandwidth of the output signal.
SSR provides partial compatibility with the low-complexity profile; the decoded
signal is bandlimited. The LC profile does not use preprocessing or prediction
tools and the TNS order is limited. LC operates with low memory and processing
requirements.
AAC Main Profile
A block diagram of a main profile AAC encoder and decoder is shown in FIG. 15. An MDCT with 50% overlap is used as the only input signal filter bank.
It uses lengths of 1024 for stationary signals or 128 for transient signals,
with a 2048-point window or a block of eight 256 point windows, respectively.
To preserve interchannel block synchronization (phase), short block lengths
are retained for eight-block durations. For multi-channel coding, different
filter bank resolutions can be used for different channels. At 48 kHz, the
long-window frequency resolution is 23 Hz and time resolution is 21 ms; the
short window yields 187 Hz and 2.6 ms. The MDCT employs time-domain aliasing
cancellation (TDAC). Two alternate window shapes are selectable on a frame
basis in the 2048-point mode; either sine or Kaiser-Bessel-derived (KBD) windows
can be employed. The encoder can select the optimal window shape on the basis
of signal characteristics. The sine window is used when perceptually important
components are spaced closer than 140 Hz and narrow-band selectivity is more
important than stop-band attenuation. The KBD window is used when components
are spaced more than 220 Hz apart and stopband attenuation is needed. Window
switching is seamless, even with the overlap-add sequence. The shape of the
left half of each window must match the shape of the right half of the preceding
window; a new window shape is thus introduced as a new right half.
FIG. 15 Block diagram of MPEG-2 AAC encoder and decoder. Heavy lines
denote data paths, light lines denote control signals.
The suggested psychoacoustic model is based on the MPEG-1 model 2 and examines
the perceptual entropy of the audio signal. It controls the quantizer step
size, increasing step size to decrease buffer levels during stationary signals,
and correspondingly decreasing step size to allow levels to rise during transient
signals.
A second-order backward-adaptive predictor is applied to remove redundancy
in stationary signals found in long windows; residues are calculated and used
to replace frequency coefficients. Reconstructed coefficients in successive
blocks are examined for frequencies below 16 kHz. Values from two previous
blocks are used to form one predicted value for each current coefficient. The
predicted value is subtracted from the actual target value to yield a prediction
error (residue) which is quantized. Coefficient residues are grouped into scale
factor bands that emulate critical bands. A prediction control algorithm determines
if prediction should be activated in individual scale factor bands or in the
frame at all, based on whether it improves coding gain.
AAC Allocation Loops
Two nested inner and outer loops iteratively perform nonuniform quantization
and analysis-by-synthesis. The simplified nested algorithms are shown in FIG. 16. The inner loop (within the outer loop) begins with an initial quantization
step size that is used to quantize the data and perform Huffman coding to determine
the number of bits needed for coding. If necessary, the quantizer step size
can be increased to reduce the number of bits needed. The outer loop uses scale
factors to amplify scale factor bands to reduce audibility of quantization
noise (inverse scale factors are applied in the decoder). Each scale factor
band is assigned one multiplying scale factor. The scale factor is a gain value
that changes the amplitude of the coefficients in the scale factor band; this
shapes the quantization noise according to the masking threshold. The outer
loop uses analysis-by-synthesis to determine the resulting distortion and this
is compared to the distortion allowed by the psychoacoustic model; the best
result so far is stored. If distortion is too high in a scale factor band,
the band is amplified (this increases the bit rate) and the outer loop repeats.
The two loops work in conjunction to optimally distribute quantization noise
across the spectrum.
FIG. 16 Two nested inner and outer allocation loops iteratively perform
nonuniform quantization and analysis-by synthesis.
The width of the scale factor bands is limited to 32 coefficients, except
in the last scale factor band. There are 49 scale factor bands for long blocks.
Scale factor bands can be individually amplified in increments of 1.5 dB. Noise
shaping results because amplified coefficients have larger values and will
yield a higher SNR after quantization.
Because inverse amplification must be applied at the decoder, scale factors
are transmitted in the bitstream.
Designers should note that scale factors are defined with opposite polarity
in MPEG-2 AAC and MPEG-1/2 Layer III (larger scale factor values represent
larger signals in AAC, whereas it is the opposite in Layer III ).
Huffman coding is applied to the quantized spectrum, scale factors, and directional
information. Twelve Huffman codebooks are available to code pairs or quadruples
of quantized spectral values. Two codebooks are available for each maximum
value, each representing a different probability function. A bit reservoir
accommodates instantaneously variable bit rates, allowing bits to be distributed
across consecutive blocks for more effective coding within the average bit-rate
constraint. A frame output consists of spectral coefficients and control parameters.
The bitstream syntax defines a lower layer for raw audio data, and a higher
layer contains audio transport data. In the decoder, current spectral components
are reconstructed by adding a prediction error to the predicted value. As in
the encoder, the coefficients are calculated from preceding values; no additional
information is required.
AAC Temporal Noise Shaping
The spectral predictability of signals dictates the optimal coding strategy.
For example, consider a steady-state sine wave comprising a flat temporal envelope,
and a single spectral line-an impulse which is maximally nonflat spectrally.
This sine wave is most easily coded directly in the frequency domain or by
using linear prediction in the time domain. Conversely, consider a transient
pulse signal comprising an impulse in the time domain, and a flat power spectrum.
This pulse would be difficult to code directly in the frequency domain and
difficult to code with prediction in the time domain. However, the pulse could
be optimally coded directly in the time domain, or by using linear prediction
in the frequency domain.
In the AAC codec, predictive coding is used to examine coefficients in each
block. Transient signals will yield a more uniform spectrum and allow transients
to be identified and more efficiently coded as residues. When coding transients,
by analyzing the spectral data from the MDCT, temporal noise shaping (TNS)
can be used to control the temporal shape of the quantization noise within
each window to achieve perceptual noise shaping. By using the duality between
the time and frequency domains, TNS provides improved predictive coding. When
a time-domain signal is coded with predictive coding, the power spectral density
of the quantization noise in the output signal will be shaped by the power
spectral density of the input signal.
Conversely, when a frequency-domain signal is coded with predictive coding,
the temporal shape of the quantization noise in the output signal will follow
the temporal shape of the input signal.
In particular, TNS shapes the temporal envelope of the quantization noise
to follow the transient's temporal envelope and thus conceals the noise under
the transient.
This can overcome problems such as pre-echo. As noted, this is accomplished
with linear predictive coding of the spectral signal; for example, using open-loop
differential pulse-code modulation (DPCM) encoding of spectral values. Corresponding
DPCM decoding is performed in the decoder to create the output signal. During
encoding, TNS replaces the target spectral coefficients with the forward-prediction
residual (prediction error). In the AAC main profile, up to 20 successive coefficients
in a block can be examined to predict the next coefficient and the prediction
value is subtracted from the target coefficient to yield a spectral residue,
which is quantized and encoded. A filter order up to 12 is allowed in the LC
and SSR profiles.
During decoding, the inverse predictive TNS filtering is performed to replace
the residual values with spectral coefficients.
FIG. 17 An example showing how TNS shapes quantization noise to conceal
it under the transient envelope. A. The original speech signal. B. The quantization
coding noise shaped with TNS. C. The quantization coding noise without TNS;
masking is not utilized as well. (Herre and Johnston, 1997)
It should be emphasized that TNS prediction is done over frequency, and not
over time. Thus the prediction error is shaped in time as opposed to frequency.
Time resolution is increased as opposed to frequency resolution; temporal spread
of quantization noise is reduced in the output decoded signal. TNS thus allows
the encoder to control temporal pre-echo quantization noise within a filter-bank
window by shaping it according to the audio signal, so that the noise is masked
by the temporal audio signal, as shown in FIG. 17. TNS allows better coding
of both transient content and pitch-based signals such as speech.
The impulses which comprise speech are not always effectively coded with traditional
transform block switching and may demand instantaneous increases in bit rate.
TNS minimizes unmasked pre-echo in pitch-based signals and reduces the peak
bit demand. With TNS, the codec can also use the more efficient long-block
mode more often without introducing artifacts, and can also perform better
at low sampling frequencies. TNS effectively and dynamically adapts the codec
between high-time resolution for transient signals and high-frequency resolution
for stationary signals and is more efficient than other designs using switched
windows. As explained by Juergen Herre, the prediction filter can be determined
from the range of spectral coefficients corresponding to the target frequency
range (for example, 4 kHz to 20 kHz) and by using DPCM predictive coding methods
such as calculating the autocorrelation function of the coefficients and using
the Levinson-Durban recursion algorithm. A single TNS prediction filter can
be applied to the entire spectrum or different TNS prediction filters can be
uniquely applied to different parts of a spectrum, and TNS can be omitted for
some frequency regions. Thus the temporal quantization noise control can be
applied in a frequency-dependent manner.
AAC Techniques and Performance
The input audio signal can be applied to a four-band polyphase quadrature
mirror filter (PQMF) bank to create four equal-width, critically sampled frequency
bands. This is used for the scalable sampling rate (SSR) profile. An MDCT is
used to produce 256 spectral coefficients from each of the four bands, for
a total of 1024 coefficients.
Positive or negative gain control can be applied independently to each of
the four bands. With SSR, lower sampling rate signals (with lower bit rates)
can be obtained at the decoder by ignoring the upper PQMF bands. For example,
bandwidths of 18, 12, and 6 kHz can be obtained by ignoring one, two, or three
bands. This allows scalability with low decoder complexity.
Two stereo coding techniques are used in AAC:
intensity coding and M/S (middle/side) coding. Both methods can be combined
and applied to selective parts of the signal's spectrum. M/S coding is applied
between channel pairs that are symmetrically placed to the left and right of
the listener; this helps avoid spatial unmasking. M/S coding can be selectively
switched in time (block by block) and frequency (scale factor bands). M/S coding
can control the imaging of coding noise that is separate from the imaging of
the masking signal. High-frequency time domain imaging must be preserved in
transient signals.
Intensity stereo coding considers that perception of high frequency sounds
is based on their energy-time envelopes.
Thus, some signals can be conveyed with one set of spectral values, shared
among channels. Envelope information is maintained by reconstructing each channel
level. Intensity coding can be implemented between channel pairs, and among
coupling channel elements. In the latter, channel spectra are shared between
channel pairs.
Also, coupling channels permit downmixing in which additional audio elements
such as a voice-over can be added to a recording. Both of these techniques
can be used on both stereo and 5.1 multichannel content.
In one listening test, multichannel MPEG-2 AAC at 320 kbps outperformed MPEG-2
Layer II BC at 640 kbps.
MPEG-2 Layer II at 640 kbps did not outperform MPEG-2 AAC at 256 kbps. For
five full-bandwidth channels, MPEG 2 AAC claims "indistinguishable quality" for
bit rates as low as 256 kbps to 320 kbps. Stereo MPEG-2 AAC at 128 kbps is
said to provide significantly better sound quality than MPEG-2 Layer II at
192 kbps or MPEG-2 Layer III at 128 kbps. MPEG-2 AAC at 96 kbps is comparable
to MPEG-2 Layer II at 192 kbps or MPEG-2 Layer III at 128 kbps. Spectral band
replication (SBR) can be applied to AAC codecs. This is sometimes known as
High-Efficiency AAC (HE AAC) or aacPlus. With SBR, a bit rate of 24 kbps per
channel, or 32 kbps to 40 kbps for stereo signals, can yield good results.
The MPEG-4 and MPEG-7 standards are discussed in section15.
ATRAC Codec
The proprietary ATRAC (Adaptive TRansform Acoustic Coding) algorithm was developed
to provide data reduction for the SDDS cinematic sound system and was subsequently
employed in other applications such as the MiniDisc format. ATRAC uses a modified
discrete cosine transform and psychoacoustic masking to achieve a 5:1 compression
ratio; for example, data on a MiniDisc is stored at 292 kbps. ATRAC transform
coding is based on nonuniform frequency and time splitting concepts, and assigns
bits according to rules fixed by a bit allocation algorithm. The algorithm
both observes the fixed threshold of hearing curve, and dynamically analyzes
the audio program to take advantage of psychoacoustic effects such as masking.
The original codec version is sometimes known as ATRAC1. ATRAC was developed
by Sony Corporation.
An ATRAC encoder accepts a digital audio input and parses it into blocks.
The audio signal is divided into three subbands, which are then transformed
into the frequency domain using a variable block length. Transform coefficients
are grouped into 52 subbands (called block floating units or BFUs) modeled
on the ear's critical bands, with particular resolution given to lower frequencies.
Data in these bands is quantized according to dynamic sensitivity and masking
characteristics based on a psychoacoustic model. During decoding, the quantized
spectra are reconstructed according to the bit allocation method, and synthesized
into the output audio signal.
ATRAC differs from some other codecs in that psychoacoustic principles are
applied to both the bit allocation and the time-frequency splitting. In that
respect, both subband and transform coding techniques are used. In addition,
the transform block length adapts to the audio signal's characteristics so
that amplitude and time resolution can be varied between static and transient
musical passages. Through this processing, the data rate is reduced by 4/5.
The ATRAC encoding algorithm can be considered in three parts: time-frequency
analysis, bit allocation, and quantization of spectral components. The analysis
portion of the algorithm decomposes the signal into spectral coefficients grouped
into BFUs that emulate critical bands. The bit allocation portion of the algorithm
divides available bits between the BFUs, allocating more bits to perceptually
sensitive units. The quantization portion of the algorithm quantizes each spectral
coefficient to the specified word length.
FIG. 18 The ATRAC encoder time-frequency analysis block contains QMF
filter banks and MDCT transforms to analyze the signal.
The time-frequency analysis, shown in FIG. 18, uses subband and transform
coding techniques. Two quadrature mirror filters (QMFs) divide the input signal
into three subbands: low (0 Hz to 5.5125 kHz), medium (5.5125 kHz to 11.025
kHz), and high (11.025 kHz to 22.05 kHz). The QMF banks ensure that time-domain
aliasing caused by the subband decomposition will be canceled during reconstruction.
Following splitting, contents are examined to determine the length of block
durations. Signals in each of these bands are then placed in the frequency
domain with the MDCT algorithm. The MDCT allows up to a 50% overlap between
adjacent time-domain windows; this maintains frequency resolution at critical
sampling. A total of 512 coefficients are output, with 128 spectra in the low
band, 128 spectra in the mid band, and 256 spectra in the high band.
Transform coders must balance frequency resolution with temporal resolution.
A long block size achieves high frequency resolution and quantization noise
is readily masked by simultaneous masking; this is appropriate for a steady-state
signal. However, transient signals require temporal resolution, otherwise quantization
noise will be spread in time over the block of samples; a pre-echo can be audible
prior to the onset of the transient masker. Thus, instead of a fixed transform
block length, the ATRAC algorithm adaptively performs nonuniform time splitting
with blocks that vary according to the audio program content.
Two modes are used: long mode (11.6 ms in the high-, medium-, and low-frequency
bands) and short mode (1.45 ms in the high-frequency band, and 2.9 ms in the
mid- and low-frequency bands). The long block mode yields a narrow frequency
band, and the short block mode yields wider frequency bands, trading time and
frequency resolution as required by the audio signal. Specifically, transient
attacks prompt a decrease in block duration (to 1.45 ms or 2.9 ms), and a more
slowly changing program promotes an increase in block duration (to 11.6 ms).
Block duration is interactive with frequency bandwidth; longer block durations
permit selection of narrower frequency bands and greater resolution. This time
splitting is based on the effect of temporal pre-masking (backward masking)
in which tones sounding close in time exhibit masking properties.
Normally, the long mode provides good frequency resolution. However, with
transients, quantization noise is spread over the entire signal block and the
initial quantization noise is not masked. Thus, when a transient is detected,
the algorithm switches to the short mode.
Because the noise is limited to a short duration before the onset of the transient,
it is masked by pre-masking.
Because of its greater extent, post-masking (forward masking) can be relied
on to mask any signal decay in the long mode. The block size mode can be selected
independently for each band. For example, a long block mode might be selected
in the low-frequency band, and short modes in the mid- and high-frequency bands.
The MDCT frequency domain coefficients are then grouped into 52 BFUs; each
contains a fixed number of coefficients. As noted, in the long mode, each unit
conveys 11.6 ms of a narrow frequency band, and in the short mode each block
conveys 1.45 ms or 2.9 ms of a wider frequency band. Fifty-two nonuniform BFUs
are present across the frequency range; there are more BFUs at low frequencies,
and fewer at high frequencies. This nonlinear division is based on the concept
of critical bands. In the ATRAC model, for example, the band centered at 150
Hz is 100 Hz wide, the band at 1 kHz is 160 Hz wide, and the band at 10.5 kHz
is 2500 Hz wide. These widths reflect the ear's decreasing sensitivity to high
frequencies.
Each of the 512 spectral coefficients is quantized according to scale factor
and word length. The scale factor defines the full-scale range of the quantization.
It is selected from a list of possibilities and describes the magnitude of
the spectral coefficients in each of the 52 BFUs. The word length defines the
precision within each scale; it is calculated by the bit allocation algorithm
as described below. All the coefficients in a given BFU are given the same
scale factor and quantization word length because of the psychoacoustic similarity
within each group. Thus the following information is coded for each frame of
512 values: MDCT block size mode (long or short), word length for each BFU,
scale factor for each BFU, and quantized spectral coefficients.
The bit allocation algorithm considers the minimum threshold curve and simultaneous
masking conditions applicable to the BFUs, operating to yield a reduced data
rate. Available bits must be divided optimally between the block floating units.
BFUs coded with many bits will have low quantization noise, but BFUs with few
bits will have greater noise. ATRAC does not specify an arbitrary bit allocation
algorithm; this allows improvement in future encoder versions. The decoder
is completely independent of any allocation algorithm, also allowing future
improvement. To some extent, because the time-frequency splitting relies on
critical band and pre-masking considerations, the choice of the bit allocation
algorithm is less critical. However, any algorithm must minimize perceptual
error.
FIG. 19 An example of a bit-allocation algorithm showing the bit assignment,
using both fixed and variable bits. Fixed bits are weighted toward low-frequency
BFU regions. Variable bits are assigned according to the logarithm of the spectral
coefficients in each BFU. (Tsutsui et al., 1996)
One example of a bit allocation model declares both fixed and variable bits,
as shown in FIG. 19. Fixed bits are allocated mainly to low-frequency BFU
regions, emphasizing their perceptual importance. Variable bits are assigned
according to the logarithm of the spectral coefficients in each BFU. The total
bit allocation btotal for each BFU is the weighted sum of the fixed bits bfixed(k)
and the variable bits bvariable(k) in each BFU. Thus, for each BFU k:
The weight T describes the tonality of the signal, taking a value close to
0 for nontonal signals, and a value close to 1 for tonal signals. Thus the
proportion of fixed bits to variable bits is itself variable. For example,
for noise-like signals the allocation emphasizes fixed bits, thus decreasing
the number of bits devoted to insensitive high frequencies. For pure tones,
the allocation emphasizes variable bits, concentrating available bits to a
few sensitive BFUs with tonal components.
However, the allocation method must observe the overall bit rate. The previous
equation does not account for this and will generally allocate more bits than
available. To maintain a fixed and limited bit rate, an offset boffset is devised,
and set equal for all BFUs. The offset is subtracted from btotal (k) for each
BFU, yielding the final bit allocation bfinal (k):
If the final value describes a negative word length, that BFU is given zero
bits. Because low frequencies are given a greater number of fixed bits, they
generally need fewer variable bits to achieve the offset threshold, and become
coded (see FIG. 19). To meet the required output bit rate, the global bit
allocation can be raised or lowered by correspondingly raising or lowering
the threshold of masking. As noted, ATRAC does not specify this, or any other
arbitrary allocation algorithm.
FIG. 20 The ATRAC decoder time-frequency synthesis block contains QMF
banks and MDCT transforms to synthesize and reconstruct the signal.
The ATRAC decoder essentially reverses the encoding process, performing spectral
reconstruction and time frequency synthesis. Time-frequency synthesis is shown
in FIG. 20. The decoder first accepts the quantized spectral coefficients,
and uses the word length and scale factor parameters to reconstruct the MDCT
spectral coefficients. To reconstruct the audio signal, these coefficients
are first transformed back into the time domain by the inverse MDCT (IMDCT),
using either long or short mode blocks as specified by the received parameters.
The three time-domain subband signals are synthesized into the output signal
using QMF synthesis banks, obtaining a full spectrum, 16-bit digital audio
signal. Wideband quantization noise introduced during encoding (to achieve
data reduction) is limited to critical bands, where it is masked by signal
energy in each band.
Other versions of ATRAC were developed. ATRAC3 achieves twice the compression
of ATRAC1 while providing similar sound quality operating at bit rates such
as 128 kbps. The broadband audio signal is split into four subbands using a
QMF bank; the bands are 0 Hz to 2.75625 kHz, 2.75625 kHz to 5.5125 kHz, 5.5125
kHz to 11.025 kHz, and 11.025 kHz to 22.05 kHz. Gain control is applied to
each band to minimize pre-echo. When a transient occurs, the amplitude of the
section preceding the attack is increased. Gain is correspondingly decreased
during decoding, effectively attenuating pre-echo. The subbands are applied
to fixed-length MDCT with 256 components. Tonal components are subtracted from
the signal and analyzed and quantized separately. Entropy coding is applied.
In addition, joint stereo coding can be used adaptively for each band.
The ATRAC3plus codec is designed to operate at generally lower bit rates;
rates of 48, 64, 132, and 256 kbps are often used. The broadband audio signal
is processed in 16 subbands; a window of up to 4096 samples (92 ms) can be
used and bits can be allocated unequally over two channels.
The ATRAC Advanced Lossless (AAL) codec provides scalable lossless compression.
It codes ATRAC3 or ATRAC3plus data as well as residual information that is
otherwise lost. The ATRAC3 or ATRAC3plus data can be decoded alone for lossy
reproduction or the residual can be added for lossless reproduction.
Perceptual Audio Coding (PAC) Codec
The Perceptual Audio Coding (PAC) codec was designed to provide audio coding
with bit rates ranging from 6 kbps for a monophonic channel to 1024 kbps for
a 5.1-channel format. It was particularly aimed at digital audio broadcast
and Internet download applications, at a rate of 128 kbps for two-channel near-CD
quality coding; however, 96 kbps may be used for FM quality. PAC employs coding
methods that remove signal perceptual irrelevancy, as well as source coding
to remove signal redundancy, to achieve a reduction ratio of about 11:1 while
maintaining transparency. PAC is a third-generation codec with PXFM and ASPEC
as its antecedents, the latter also providing the ancestral basis for MPEG-1
Layer III . PAC was developed by AT&T and Bell Laboratories of Lucent Technologies.
The architecture of a PAC encoder is similar to that of other perceptual codecs.
Throughout the algorithm, data is placed in blocks of 1024 samples per channel.
An MDCT filter bank converts time-domain audio signals to the frequency domain;
a hybrid filter is not used. The MDCT uses an adaptive window size to control
quantization noise spreading, where the spreading is greater in the time domain
with a longer 2048-point window and greater in the frequency domain with a
series of shorter 256-point windows. Specifically, a frequency resolution of
1024 uniformly spaced frequency bands (a window of 2048 points) is usually
employed. When signal transient characteristics suggest that pre-echo artifacts
may occur, the filter bank adaptively switches to a transform with 128 bands.
In either case, the perceptual model calculates a frequency-domain masking
threshold to determine the maximum quantization noise that can be added to
each frequency band without an audible penalty. The perceptual model used in
PAC to code monophonic signals is similar to the MPEG-1 psychoacoustic model
2.
The audio signal, represented as spectral coefficients, is requantized to
one of 128 exponentially distributed quantization step sizes according to noise
allocation determinations. The codec uses a variety of frequency band groupings.
A fixed "threshold calculation partition" is a set of one-to-many
adjacent filter bank outputs arranged to create a partition width that is about
1/3 of a critical band.
Fixed "coder bands" consist of a multiple of four adjacent filter
bank outputs, ranging from 4 to 32 outputs, yielding a bandwidth as close to
1/3 critical band as possible. There are 49 coder bands for the 1024-point
mode and 14 coder bands for the 128-point filter mode. An iterative rate control
loop is used to determine quantization relative to masking thresholds. Time
buffering may be used to smooth the resulting bit rate. Coder bands are assigned
one scale factor. "Sections" are data dependent groupings of adjacent
coder bands using the same Huffman codeword.
Coefficients in each coder band are encoded using one of 16 Huffman codebooks.
At the codec output, a formatter generates a packetized bitstream. One 1024-sample
block (or eight 128-sample blocks) from each channel are placed in one packet,
regardless of the number of channels. The size of a packet corresponding to
each 1024 input samples is thus variable.
Depending on the reliability of the transmission medium, additional header
information is added to the first frame, or to every frame. A header may contain
data such as synchronization, error correction, sample rate, number of channels,
and transmission bit rate.
For joint-stereo coding, the codec employs a binary masking level difference
(BMLD) using M (monaural, L+R), S (stereo, L-R) and independent L and R thresholds.
M-S versus L-R coding decisions are made independently for each band. The multi-channel
MPAC codec (for example, coding 5.1 channels) computes individual masking thresholds
for each channel, two pairs (front and surround) of M-S thresholds, as well
as a global threshold based on all channels. The global threshold takes advantages
of masking across all channels and is used when the bit pool is close to depletion.
PAC employs unequal error protection (UEP) to more carefully protect some
portions of the data. For example, corrupted control information could lead
to a catastrophic loss of synchronization. Moreover, some errors in audio data
are more disruptive than others. For example, distortion in midrange frequencies
is more apparent than a loss of stereo separation. Different versions of PAC
are available for DAB and Internet applications; they are optimized for different
transmission error conditions and error concealment. The error concealment
algorithm mitigates the effect of bit errors and corrupted or lost packets;
partial information is used along with heuristic interpolation. There is slight
audible degradation with 5% random packet losses and the algorithm is effective
with 10 to 15% packet losses.
As with most codecs, PAC has evolved. PAC version 1.A is optimized for unimpaired
channel transmission of voice and music with up to 8-kHz bandwidth; bit rates
range from 16 kbps to 32 kbps. PAC version 1.B uses a bandwidth of 6.5 kHz.
PAC version 2 is designed for impaired channel broadcast applications, with
bit rates of 16 kbps to 128 kbps for stereo signals. PAC version 3 is optimized
for 64 kbps with a bandwidth of about 13 kHz.
PAC version 4 is optimized for 5.1-channel sound. EPAC is an enhanced version
of PAC optimized for low bit rates.
Its filter switches between two different filter-bank designs depending on
signal conditions. At 128 kbps, EPAC offers CD-trans-parent stereo sound and
is compliant with RealNetwork's G2 streaming Internet player. In some applications,
monaural MPAC codecs are used to code multichannel audio using a perceptual
model with provisions for spatial coding conditions such as binaural unmasking
effects and binary-level masking differences.
Signal pairs are coded and masking thresholds are computed for each channel.
AC-3 (Dolby Digital) Codec
Many data reduction codecs are designed for a variety of applications. The
AC-3 (Dolby Digital) codec in particular is widely used to convey multichannel
audio in applications such as DTV, DBS, DVD-Video, and Blu-ray. The AC-3 codec
was preceded by the AC-1 and AC-2 codecs.
The AC-1 (Audio Coding-1) stereo codec uses adaptive delta modulation, as
described in section4, combined with analog companding; it is not a perceptual
codec. An AC-1 codec can code a 20-kHz bandwidth stereo audio signal into a
512-kbps bitstream (approximately a 3:1 reduction).
AC-1 was used in satellite relays of television and FM programming, as well
as cable radio services.
The AC-2 codec is a family of four single-channel codecs used in two-channel
or multichannel applications. It was designed for point-to-point transmission
such as full- duplex ISBN applications. AC-2 is a perceptual codec using a
low-complexity time-domain aliasing cancellation (TDAC) transform. It divides
a wideband signal into multiple subbands using a 512-sample 50% overlapping
FFT algorithm performing alternating modified discrete cosine and sine transform
(MDCT/MDST) calculations; a 128 sample FFT can be used for low-delay coding.
A window function based on the Kaiser-Bessel kernel is used in the window design.
Coefficients are grouped into subbands containing from 1 to 15 coefficients
to model critical bandwidths. The bit allocation process is backward adaptive
in which bit assignments are computed equivalently at both the encoder and
decoder. The decoder uses a perceptual model to extract bit allocation information
from the spectral envelope of the transmitted signal. This effectively reduces
the bit rate, at the expense of decoder complexity. Subbands have pre-allocated
bits, with the lower subbands receiving a greater share.
Additional bits are adaptively drawn from a pool and assigned according to
the logarithm of peak energy levels in subbands. Coefficients are quantized
according to bit allocation calculations, and blocks are formed. Algorithm
parameters vary according to sampling frequency. At sampling frequencies of
48, 44.1, and 32 kHz, the following apply: bytes/block: 168, 184, 190; total
bits: 1344, 1472, 1520; subbands: 40, 43, 42; adaptive bits: 225, 239, 183.
The AC-2 codec provides high audio quality with a data rate of 256 kbps per
channel. With 16-bit input, reduction ratios include 6.1:1, 5.6:1, and 5.4:1
for sample rates of 48, 44.1, and 32 kHz, respectively. AC-2 is also used at
128 kbps and 192 kbps per channel. AC-2 is a registered .wav type so that AC-2
files are interchangeable between computer platforms. The AC-2 .wav header
contains an auxiliary data field at the end of each block, selectable from
0 to 32 bits. For example, peak levels can be stored to facilitate viewing
and editing of .wav files. AC-2 codec applications include PC sound cards,
studio/transmitter links, and ISDN linking of recording studios for long distance
recording. The AC-2 bitstream is robust against errors. Depending on the implementation,
AC-2 delay varies between 7 ms and 60 ms. AC-2A is a multirate, adaptive block
codec, designed for higher reduction ratios; it uses a 512/128-point TDAC filter.
AC-2 was introduced in 1989.
AC-3 Overview
The AC-3 coding system (popularly known as Dolby Digital) is an outgrowth
of the AC-2 encoding format, as well as applications in commercial cinema.
AC-3 was first introduced in 1992. AC-3 is a perceptual codec designed to process
an ensemble of audio channels. It can code from 1 to 7 channels as 3/3, 3/2,
3/1, 3/0, 2/2, 2/1, 2/0, 1/0, as well as an optional low-frequency effects
(LFE) channel.
AC-3 is often used to provide a 5.1 multichannel surround format with left,
center right, left-surround, right-surround, and an LFE channel. The frequency
response of the main channels is 3 Hz to 20 kHz, and the frequency response
of the LFE channel is 3 Hz to 120 Hz. These six channels (requiring 6 × 48
kHz × 18 bits = 5.184 Mbps in uncompressed PCM representation) can be coded
at a nominal rate of 384 kbps, with a bandwidth reduction of about 13:1. However,
the AC-3 standard also supports bit rates ranging from 32 kbps to 640 kbps.
The AC-3 codec is backward compatible with matrix surround sound formats, two-channel
stereo, and monaural reproduction; all of these can be decoded from the AC-3
data stream. AC-3 does not use 5.1 matrixing in its bitstream. This ensures
that quantization noise is not directed to an incorrect channel, where it could
be unmasked. AC-3 transmits a discrete multichannel coded bitstream, with digital
downmixing in the decoder to create the appropriate number (monaural, stereo,
matrix surround, or full multichannel) of reproduction channels.
AC-3 contains a dialogue normalization level control so that the reproduced
level of dialogue (or any audio content) is uniform for different programs
and channels. With dialogue normalization, a listener can select a playback
volume and the decoder will automatically replay content at that average relative
level regardless of how it was recorded. AC-3 also contains a dynamic range
control feature. Control data can be placed in the bitstream so that a program's
recorded dynamic range can be varied in the decoder over a ±24-dB range. Thus,
the decoder can alter the dynamic range of a program to suit the listener's
preference (for example, a reduced dynamic range " midnight mode").
AC-3 also provides a down-mixing feature; a multichannel recording can be reduced
to stereo or monaural. The mixing engineer can specify relative interchannel
levels. Additional services can be embedded in the bitstream including verbal
description for the visually impaired, dialogue with enhanced intelligibility
for the hearing impaired, commentary, and a second stereo program. All services
may be tagged to indicate language.
AC-3 facilitates editing on a block level, and blocks can be rocked back and
forth at the decoder, and read as forward and reverse audio. Complete encoding/decoding
delay is typically 100 ms.
Because AC-3 eliminates redundancies between channels, greater coding efficiency
is achieved relative to AC-2; a stereo version of AC-3 provides high quality
with a data rate of 192 kbps. In one test, AC-3 at 192 kbps scored 4.5 on the
ITU-R impairment scale. Differences between the original and coded files were
perceptible to expert listeners, but not annoying. The AC-3 format also delivers
data describing a program's original production format (monaural, stereo, matrix,
and the like), can encode parameters for selectable dynamic range compression,
can route low bass only to those speakers with subwoofers, and provide gain
control of a program.
AC-3 uses hybrid backward/forward adaptive bit allocation in which an adaptive
allocation routine operates in both the encoder and decoder. The model defines
the spectral envelope, which is encoded in the bitstream. The encoder contains
a core psychoacoustic model, but can employ a different model and compare results.
If desired, the encoder can use the data syntax to code parameter variations
in the core model, or convey explicit delta bit allocation information, to
improve results. Block diagrams of an AC-3 encoder and decoder are shown in
FIG. 21.
FIG. 21 The AC-3 (Dolby Digital) adaptive transform encoder and decoder.
This codec can provide 5.1-channel surround sound. A. AC-3 encoder. B. AC-3
decoder.
AC-3 achieves its data reduction by quantizing a frequency-domain representation
of the audio signal. The encoder first uses an analysis filter bank to transform
time domain PCM samples into frequency-domain coefficients.
Each coefficient is represented in binary exponential notation as a binary
exponent and mantissa. Sets of exponents are encoded into a coarse representation
of the signal spectrum and referred to as the spectral envelope.
This spectral envelope is used by the bit allocation routine to determine
the number of bits needed to code each mantissa. The spectral envelope and
quantized mantissas for six audio blocks (1536 audio samples) are formatted
into a frame for transmission.
The decoding process is the inverse of the encoding process. The decoder synchronizes
the received bitstream, checks for errors, and de-formats the data to recover
the encoded spectral envelope and quantized mantissas. The bit allocation routine
and the results are used to unpack and de-quantize the mantissas. The spectral
envelope is decoded to yield the exponents. Finally, the exponents and mantissas
are transformed back to the time domain to produce output PCM samples.
AC-3 Theory of Operation
Operation of the AC-3 encoder is complex, with much dynamic optimization performed.
In the encoder, blocks of 512 samples are collected and highpass filtered at
3 Hz to eliminate dc offset and analyzed with a bandpass filter to detect transients.
Blocks are windowed and processed with a signal-adaptive transform codec using
a critically sampled filter bank with time-domain aliasing cancellation (TDAC)
described by Princen and Bradley. An FFT is employed to implement an MDCT algorithm.
Frequency resolution is 93.75 Hz at 48 kHz; each transform block represents
10.66 ms of audio, but transforms are computed every 5.33 ms so the audio block
rate is 187.5 Hz. Because there is a 50% long-window overlap (an optimal window
function based on the Kaiser-Bessel kernel is used in the window design), each
PCM sample is represented in two sequential transform blocks; coefficients
are decimated by a factor of two to yield 256 coefficients per block. Aliasing
from sub-sampling is exactly canceled during reconstruction. The transformation
allows the redundancy introduced in the blocking process to be removed. The
input to the TDAC is 512 time-domain samples while the output is 256 frequency-domain
coefficients. There are 50 bands between 0 Hz and 24 kHz; the bandwidths vary
between 3.4 and 1.4 of critical bandwidth values.
Time-domain transients such as an impulsive sound might create audible quantization
artifacts. A transient detector in the encoder, using a high-frequency bandpass
filter, can trigger window switching to dynamically halve the transform length
from 512 to 256 samples for a finer time resolution. The 512-sample transform
is replaced by two 256-sample transforms, each producing 128 unique coefficients;
time resolution is doubled, to help ensure that quantization noise is concealed
by temporal masking.
Audio blocks are 5.33 ms, and transforms are computed every 2.67 ms at 48
kHz. Short blocks use an asymmetric window that uses only one-half of a long
window. This yields poor frequency selectivity and does not give a smooth crossfade
between blocks. However, because short blocks are only used for transient signals,
the signal's flat and wide spectrum does not require selectivity and the transient
itself will mask artifacts. This block switching also simplifies processing,
because groups of short blocks can be treated as groups of long blocks and
no special handling is needed.
Coefficients are grouped into subbands that emulate critical bands. Each frequency
coefficient is processed with floating-point representation with mantissa (0
to 16 bits) and exponent (5 bit) to maintain dynamic range. Coefficient precision
is typically 16 to 18 bits but may reach 24 bits.
The coded exponents act as scale factors for mantissas and represent the signal's
spectrum; their representation is referred to as the spectral envelope. This
spectral envelope coding permits variable resolution of time and frequency.
Unlike some codecs, to reduce the number of exponents conveyed, AC-3 does
not choose one exponent, based on the coefficient with the largest magnitude,
to represent each band. In AC-3, fine-grained exponents are used to represent
each coefficient, and efficiency is achieved by differential coding and sharing
of exponents across frequency and time. The spectral envelope is coded as the
difference between adjacent filters; because the filter response falls off
at 12 dB/bin, maximum deltas of 2 (1 represents a 6-dB difference) are needed.
The first dc term is coded as an absolute, and other exponents are coded as
one of five changes (±2, ±1, 0) from the previous lower frequency exponent,
allowing for ±12 dB/bin differences in exponents.
|