Digital Audio--Principles & Concepts: Low Bit-Rate Coding: Codec Design (part 3)

Home | Audio mag. | Stereo Review mag. | High Fidelity mag. | AE/AA mag.

MPEG-2 AAC

The MPEG-2 Advanced Audio Coding (AAC) format codes monaural, stereo, or multi-channel playback for up to 48 channels, including 5.1-channel, at a variety of bit rates.

AAC is known for its relatively high fidelity at low bit rates; for example, about 64 kbps per channel. It also provides high-quality 5.1-channel coding at an overall rate of 320 kbps or 384 kbps. AAC uses a reference model (RM) structure in which a set of tools (modules) has defined interfaces and can be combined variously in three different profiles. Individual tools can be upgraded and used to replace older tools in the reference software. In addition, this modularity makes it easy to compare revisions against older versions. AAC also comprises the kernel of audio tools used in the MPEG-4 standard for coding high-quality audio. AAC also supports lossless coding. AAC is specified in Part 7 of the MPEG-2 standard (ISO/IEC 13818-7), which was finalized in April 1997.

MPEG-2 AAC coding is not backward compatible with MPEG-1 and was originally designated as NBC (non backward compatible) coding. An AAC bitstream cannot be decoded by an MPEG-1-only decoder. By lifting the constraint of compatibility, better performance is achieved compared to MPEG-2 BC. MPEG-2 AAC supports standard sampling frequencies of 32, 44.1, and 48 kHz, as well as other rates from 8 kHz to 96 kHz, yielding maximum bit rates of 48 kbps and 576 kbps, respectively. Its input channel configurations are: 1/0 (monaural), 2/0 (two channel stereo), different multichannel configurations up to 3/2 + 1, and provision for up to 48 channels. Matrixing is not used. Downmixing is supported. To improve error performance, the system is designed to maintain bitstream synchronization in the presence of bit errors, and error concealment is supported as well.

To allow flexibility in audio quality versus processing requirements, AAC coding modules are used to create three profiles: main profile, scalable sampling rate (SSR) profile, and low-complexity (LC) profile. The main profile employs the most sophisticated encoder using all the coding modules except preprocessing to yield the highest audio quality at any bit rate. A main profile decoder can also decode the low-complexity bitstream. The SSR profile uses a gain control tool to perform poly-phase quadrature filtering (PQF), gain detection, and gain modification preprocessing; prediction is not used and temporal noise shaping (TNS) order is limited. SSR divides the audio signal into four equal frequency bands each with an independent bitstream and decoders can choose to decode one or more streams and thus vary the bandwidth of the output signal. SSR provides partial compatibility with the low-complexity profile; the decoded signal is bandlimited. The LC profile does not use preprocessing or prediction tools and the TNS order is limited. LC operates with low memory and processing requirements.

AAC Main Profile

A block diagram of a main profile AAC encoder and decoder is shown in FIG. 15. An MDCT with 50% overlap is used as the only input signal filter bank. It uses lengths of 1024 for stationary signals or 128 for transient signals, with a 2048-point window or a block of eight 256 point windows, respectively. To preserve interchannel block synchronization (phase), short block lengths are retained for eight-block durations. For multi-channel coding, different filter bank resolutions can be used for different channels. At 48 kHz, the long-window frequency resolution is 23 Hz and time resolution is 21 ms; the short window yields 187 Hz and 2.6 ms. The MDCT employs time-domain aliasing cancellation (TDAC). Two alternate window shapes are selectable on a frame basis in the 2048-point mode; either sine or Kaiser-Bessel-derived (KBD) windows can be employed. The encoder can select the optimal window shape on the basis of signal characteristics. The sine window is used when perceptually important components are spaced closer than 140 Hz and narrow-band selectivity is more important than stop-band attenuation. The KBD window is used when components are spaced more than 220 Hz apart and stopband attenuation is needed. Window switching is seamless, even with the overlap-add sequence. The shape of the left half of each window must match the shape of the right half of the preceding window; a new window shape is thus introduced as a new right half.

FIG. 15 Block diagram of MPEG-2 AAC encoder and decoder. Heavy lines denote data paths, light lines denote control signals.

The suggested psychoacoustic model is based on the MPEG-1 model 2 and examines the perceptual entropy of the audio signal. It controls the quantizer step size, increasing step size to decrease buffer levels during stationary signals, and correspondingly decreasing step size to allow levels to rise during transient signals.

A second-order backward-adaptive predictor is applied to remove redundancy in stationary signals found in long windows; residues are calculated and used to replace frequency coefficients. Reconstructed coefficients in successive blocks are examined for frequencies below 16 kHz. Values from two previous blocks are used to form one predicted value for each current coefficient. The predicted value is subtracted from the actual target value to yield a prediction error (residue) which is quantized. Coefficient residues are grouped into scale factor bands that emulate critical bands. A prediction control algorithm determines if prediction should be activated in individual scale factor bands or in the frame at all, based on whether it improves coding gain.

AAC Allocation Loops

Two nested inner and outer loops iteratively perform nonuniform quantization and analysis-by-synthesis. The simplified nested algorithms are shown in FIG. 16. The inner loop (within the outer loop) begins with an initial quantization step size that is used to quantize the data and perform Huffman coding to determine the number of bits needed for coding. If necessary, the quantizer step size can be increased to reduce the number of bits needed. The outer loop uses scale factors to amplify scale factor bands to reduce audibility of quantization noise (inverse scale factors are applied in the decoder). Each scale factor band is assigned one multiplying scale factor. The scale factor is a gain value that changes the amplitude of the coefficients in the scale factor band; this shapes the quantization noise according to the masking threshold. The outer loop uses analysis-by-synthesis to determine the resulting distortion and this is compared to the distortion allowed by the psychoacoustic model; the best result so far is stored. If distortion is too high in a scale factor band, the band is amplified (this increases the bit rate) and the outer loop repeats. The two loops work in conjunction to optimally distribute quantization noise across the spectrum.

FIG. 16 Two nested inner and outer allocation loops iteratively perform nonuniform quantization and analysis-by synthesis.

The width of the scale factor bands is limited to 32 coefficients, except in the last scale factor band. There are 49 scale factor bands for long blocks. Scale factor bands can be individually amplified in increments of 1.5 dB. Noise shaping results because amplified coefficients have larger values and will yield a higher SNR after quantization.

Because inverse amplification must be applied at the decoder, scale factors are transmitted in the bitstream.

Designers should note that scale factors are defined with opposite polarity in MPEG-2 AAC and MPEG-1/2 Layer III (larger scale factor values represent larger signals in AAC, whereas it is the opposite in Layer III ).

Huffman coding is applied to the quantized spectrum, scale factors, and directional information. Twelve Huffman codebooks are available to code pairs or quadruples of quantized spectral values. Two codebooks are available for each maximum value, each representing a different probability function. A bit reservoir accommodates instantaneously variable bit rates, allowing bits to be distributed across consecutive blocks for more effective coding within the average bit-rate constraint. A frame output consists of spectral coefficients and control parameters. The bitstream syntax defines a lower layer for raw audio data, and a higher layer contains audio transport data. In the decoder, current spectral components are reconstructed by adding a prediction error to the predicted value. As in the encoder, the coefficients are calculated from preceding values; no additional information is required.

AAC Temporal Noise Shaping

The spectral predictability of signals dictates the optimal coding strategy. For example, consider a steady-state sine wave comprising a flat temporal envelope, and a single spectral line-an impulse which is maximally nonflat spectrally. This sine wave is most easily coded directly in the frequency domain or by using linear prediction in the time domain. Conversely, consider a transient pulse signal comprising an impulse in the time domain, and a flat power spectrum. This pulse would be difficult to code directly in the frequency domain and difficult to code with prediction in the time domain. However, the pulse could be optimally coded directly in the time domain, or by using linear prediction in the frequency domain.

In the AAC codec, predictive coding is used to examine coefficients in each block. Transient signals will yield a more uniform spectrum and allow transients to be identified and more efficiently coded as residues. When coding transients, by analyzing the spectral data from the MDCT, temporal noise shaping (TNS) can be used to control the temporal shape of the quantization noise within each window to achieve perceptual noise shaping. By using the duality between the time and frequency domains, TNS provides improved predictive coding. When a time-domain signal is coded with predictive coding, the power spectral density of the quantization noise in the output signal will be shaped by the power spectral density of the input signal.

Conversely, when a frequency-domain signal is coded with predictive coding, the temporal shape of the quantization noise in the output signal will follow the temporal shape of the input signal.

In particular, TNS shapes the temporal envelope of the quantization noise to follow the transient's temporal envelope and thus conceals the noise under the transient.

This can overcome problems such as pre-echo. As noted, this is accomplished with linear predictive coding of the spectral signal; for example, using open-loop differential pulse-code modulation (DPCM) encoding of spectral values. Corresponding DPCM decoding is performed in the decoder to create the output signal. During encoding, TNS replaces the target spectral coefficients with the forward-prediction residual (prediction error). In the AAC main profile, up to 20 successive coefficients in a block can be examined to predict the next coefficient and the prediction value is subtracted from the target coefficient to yield a spectral residue, which is quantized and encoded. A filter order up to 12 is allowed in the LC and SSR profiles.

During decoding, the inverse predictive TNS filtering is performed to replace the residual values with spectral coefficients.

FIG. 17 An example showing how TNS shapes quantization noise to conceal it under the transient envelope. A. The original speech signal. B. The quantization coding noise shaped with TNS. C. The quantization coding noise without TNS; masking is not utilized as well. (Herre and Johnston, 1997)

It should be emphasized that TNS prediction is done over frequency, and not over time. Thus the prediction error is shaped in time as opposed to frequency. Time resolution is increased as opposed to frequency resolution; temporal spread of quantization noise is reduced in the output decoded signal. TNS thus allows the encoder to control temporal pre-echo quantization noise within a filter-bank window by shaping it according to the audio signal, so that the noise is masked by the temporal audio signal, as shown in FIG. 17. TNS allows better coding of both transient content and pitch-based signals such as speech.

The impulses which comprise speech are not always effectively coded with traditional transform block switching and may demand instantaneous increases in bit rate. TNS minimizes unmasked pre-echo in pitch-based signals and reduces the peak bit demand. With TNS, the codec can also use the more efficient long-block mode more often without introducing artifacts, and can also perform better at low sampling frequencies. TNS effectively and dynamically adapts the codec between high-time resolution for transient signals and high-frequency resolution for stationary signals and is more efficient than other designs using switched windows. As explained by Juergen Herre, the prediction filter can be determined from the range of spectral coefficients corresponding to the target frequency range (for example, 4 kHz to 20 kHz) and by using DPCM predictive coding methods such as calculating the autocorrelation function of the coefficients and using the Levinson-Durban recursion algorithm. A single TNS prediction filter can be applied to the entire spectrum or different TNS prediction filters can be uniquely applied to different parts of a spectrum, and TNS can be omitted for some frequency regions. Thus the temporal quantization noise control can be applied in a frequency-dependent manner.

AAC Techniques and Performance

The input audio signal can be applied to a four-band polyphase quadrature mirror filter (PQMF) bank to create four equal-width, critically sampled frequency bands. This is used for the scalable sampling rate (SSR) profile. An MDCT is used to produce 256 spectral coefficients from each of the four bands, for a total of 1024 coefficients.

Positive or negative gain control can be applied independently to each of the four bands. With SSR, lower sampling rate signals (with lower bit rates) can be obtained at the decoder by ignoring the upper PQMF bands. For example, bandwidths of 18, 12, and 6 kHz can be obtained by ignoring one, two, or three bands. This allows scalability with low decoder complexity.

Two stereo coding techniques are used in AAC:

intensity coding and M/S (middle/side) coding. Both methods can be combined and applied to selective parts of the signal's spectrum. M/S coding is applied between channel pairs that are symmetrically placed to the left and right of the listener; this helps avoid spatial unmasking. M/S coding can be selectively switched in time (block by block) and frequency (scale factor bands). M/S coding can control the imaging of coding noise that is separate from the imaging of the masking signal. High-frequency time domain imaging must be preserved in transient signals.

Intensity stereo coding considers that perception of high frequency sounds is based on their energy-time envelopes.

Thus, some signals can be conveyed with one set of spectral values, shared among channels. Envelope information is maintained by reconstructing each channel level. Intensity coding can be implemented between channel pairs, and among coupling channel elements. In the latter, channel spectra are shared between channel pairs.

Also, coupling channels permit downmixing in which additional audio elements such as a voice-over can be added to a recording. Both of these techniques can be used on both stereo and 5.1 multichannel content.

In one listening test, multichannel MPEG-2 AAC at 320 kbps outperformed MPEG-2 Layer II BC at 640 kbps.

MPEG-2 Layer II at 640 kbps did not outperform MPEG-2 AAC at 256 kbps. For five full-bandwidth channels, MPEG 2 AAC claims "indistinguishable quality" for bit rates as low as 256 kbps to 320 kbps. Stereo MPEG-2 AAC at 128 kbps is said to provide significantly better sound quality than MPEG-2 Layer II at 192 kbps or MPEG-2 Layer III at 128 kbps. MPEG-2 AAC at 96 kbps is comparable to MPEG-2 Layer II at 192 kbps or MPEG-2 Layer III at 128 kbps. Spectral band replication (SBR) can be applied to AAC codecs. This is sometimes known as High-Efficiency AAC (HE AAC) or aacPlus. With SBR, a bit rate of 24 kbps per channel, or 32 kbps to 40 kbps for stereo signals, can yield good results. The MPEG-4 and MPEG-7 standards are discussed in section15.

ATRAC Codec

The proprietary ATRAC (Adaptive TRansform Acoustic Coding) algorithm was developed to provide data reduction for the SDDS cinematic sound system and was subsequently employed in other applications such as the MiniDisc format. ATRAC uses a modified discrete cosine transform and psychoacoustic masking to achieve a 5:1 compression ratio; for example, data on a MiniDisc is stored at 292 kbps. ATRAC transform coding is based on nonuniform frequency and time splitting concepts, and assigns bits according to rules fixed by a bit allocation algorithm. The algorithm both observes the fixed threshold of hearing curve, and dynamically analyzes the audio program to take advantage of psychoacoustic effects such as masking. The original codec version is sometimes known as ATRAC1. ATRAC was developed by Sony Corporation.

An ATRAC encoder accepts a digital audio input and parses it into blocks. The audio signal is divided into three subbands, which are then transformed into the frequency domain using a variable block length. Transform coefficients are grouped into 52 subbands (called block floating units or BFUs) modeled on the ear's critical bands, with particular resolution given to lower frequencies. Data in these bands is quantized according to dynamic sensitivity and masking characteristics based on a psychoacoustic model. During decoding, the quantized spectra are reconstructed according to the bit allocation method, and synthesized into the output audio signal.

ATRAC differs from some other codecs in that psychoacoustic principles are applied to both the bit allocation and the time-frequency splitting. In that respect, both subband and transform coding techniques are used. In addition, the transform block length adapts to the audio signal's characteristics so that amplitude and time resolution can be varied between static and transient musical passages. Through this processing, the data rate is reduced by 4/5. The ATRAC encoding algorithm can be considered in three parts: time-frequency analysis, bit allocation, and quantization of spectral components. The analysis portion of the algorithm decomposes the signal into spectral coefficients grouped into BFUs that emulate critical bands. The bit allocation portion of the algorithm divides available bits between the BFUs, allocating more bits to perceptually sensitive units. The quantization portion of the algorithm quantizes each spectral coefficient to the specified word length.

FIG. 18 The ATRAC encoder time-frequency analysis block contains QMF filter banks and MDCT transforms to analyze the signal.

The time-frequency analysis, shown in FIG. 18, uses subband and transform coding techniques. Two quadrature mirror filters (QMFs) divide the input signal into three subbands: low (0 Hz to 5.5125 kHz), medium (5.5125 kHz to 11.025 kHz), and high (11.025 kHz to 22.05 kHz). The QMF banks ensure that time-domain aliasing caused by the subband decomposition will be canceled during reconstruction. Following splitting, contents are examined to determine the length of block durations. Signals in each of these bands are then placed in the frequency domain with the MDCT algorithm. The MDCT allows up to a 50% overlap between adjacent time-domain windows; this maintains frequency resolution at critical sampling. A total of 512 coefficients are output, with 128 spectra in the low band, 128 spectra in the mid band, and 256 spectra in the high band.

Transform coders must balance frequency resolution with temporal resolution. A long block size achieves high frequency resolution and quantization noise is readily masked by simultaneous masking; this is appropriate for a steady-state signal. However, transient signals require temporal resolution, otherwise quantization noise will be spread in time over the block of samples; a pre-echo can be audible prior to the onset of the transient masker. Thus, instead of a fixed transform block length, the ATRAC algorithm adaptively performs nonuniform time splitting with blocks that vary according to the audio program content.

Two modes are used: long mode (11.6 ms in the high-, medium-, and low-frequency bands) and short mode (1.45 ms in the high-frequency band, and 2.9 ms in the mid- and low-frequency bands). The long block mode yields a narrow frequency band, and the short block mode yields wider frequency bands, trading time and frequency resolution as required by the audio signal. Specifically, transient attacks prompt a decrease in block duration (to 1.45 ms or 2.9 ms), and a more slowly changing program promotes an increase in block duration (to 11.6 ms). Block duration is interactive with frequency bandwidth; longer block durations permit selection of narrower frequency bands and greater resolution. This time splitting is based on the effect of temporal pre-masking (backward masking) in which tones sounding close in time exhibit masking properties.

Normally, the long mode provides good frequency resolution. However, with transients, quantization noise is spread over the entire signal block and the initial quantization noise is not masked. Thus, when a transient is detected, the algorithm switches to the short mode.

Because the noise is limited to a short duration before the onset of the transient, it is masked by pre-masking.

Because of its greater extent, post-masking (forward masking) can be relied on to mask any signal decay in the long mode. The block size mode can be selected independently for each band. For example, a long block mode might be selected in the low-frequency band, and short modes in the mid- and high-frequency bands.

The MDCT frequency domain coefficients are then grouped into 52 BFUs; each contains a fixed number of coefficients. As noted, in the long mode, each unit conveys 11.6 ms of a narrow frequency band, and in the short mode each block conveys 1.45 ms or 2.9 ms of a wider frequency band. Fifty-two nonuniform BFUs are present across the frequency range; there are more BFUs at low frequencies, and fewer at high frequencies. This nonlinear division is based on the concept of critical bands. In the ATRAC model, for example, the band centered at 150 Hz is 100 Hz wide, the band at 1 kHz is 160 Hz wide, and the band at 10.5 kHz is 2500 Hz wide. These widths reflect the ear's decreasing sensitivity to high frequencies.

Each of the 512 spectral coefficients is quantized according to scale factor and word length. The scale factor defines the full-scale range of the quantization. It is selected from a list of possibilities and describes the magnitude of the spectral coefficients in each of the 52 BFUs. The word length defines the precision within each scale; it is calculated by the bit allocation algorithm as described below. All the coefficients in a given BFU are given the same scale factor and quantization word length because of the psychoacoustic similarity within each group. Thus the following information is coded for each frame of 512 values: MDCT block size mode (long or short), word length for each BFU, scale factor for each BFU, and quantized spectral coefficients.

The bit allocation algorithm considers the minimum threshold curve and simultaneous masking conditions applicable to the BFUs, operating to yield a reduced data rate. Available bits must be divided optimally between the block floating units. BFUs coded with many bits will have low quantization noise, but BFUs with few bits will have greater noise. ATRAC does not specify an arbitrary bit allocation algorithm; this allows improvement in future encoder versions. The decoder is completely independent of any allocation algorithm, also allowing future improvement. To some extent, because the time-frequency splitting relies on critical band and pre-masking considerations, the choice of the bit allocation algorithm is less critical. However, any algorithm must minimize perceptual error.

FIG. 19 An example of a bit-allocation algorithm showing the bit assignment, using both fixed and variable bits. Fixed bits are weighted toward low-frequency BFU regions. Variable bits are assigned according to the logarithm of the spectral coefficients in each BFU. (Tsutsui et al., 1996)

One example of a bit allocation model declares both fixed and variable bits, as shown in FIG. 19. Fixed bits are allocated mainly to low-frequency BFU regions, emphasizing their perceptual importance. Variable bits are assigned according to the logarithm of the spectral coefficients in each BFU. The total bit allocation btotal for each BFU is the weighted sum of the fixed bits bfixed(k) and the variable bits bvariable(k) in each BFU. Thus, for each BFU k:

The weight T describes the tonality of the signal, taking a value close to 0 for nontonal signals, and a value close to 1 for tonal signals. Thus the proportion of fixed bits to variable bits is itself variable. For example, for noise-like signals the allocation emphasizes fixed bits, thus decreasing the number of bits devoted to insensitive high frequencies. For pure tones, the allocation emphasizes variable bits, concentrating available bits to a few sensitive BFUs with tonal components.

However, the allocation method must observe the overall bit rate. The previous equation does not account for this and will generally allocate more bits than available. To maintain a fixed and limited bit rate, an offset boffset is devised, and set equal for all BFUs. The offset is subtracted from btotal (k) for each BFU, yielding the final bit allocation bfinal (k):

If the final value describes a negative word length, that BFU is given zero bits. Because low frequencies are given a greater number of fixed bits, they generally need fewer variable bits to achieve the offset threshold, and become coded (see FIG. 19). To meet the required output bit rate, the global bit allocation can be raised or lowered by correspondingly raising or lowering the threshold of masking. As noted, ATRAC does not specify this, or any other arbitrary allocation algorithm.

FIG. 20 The ATRAC decoder time-frequency synthesis block contains QMF banks and MDCT transforms to synthesize and reconstruct the signal.

The ATRAC decoder essentially reverses the encoding process, performing spectral reconstruction and time frequency synthesis. Time-frequency synthesis is shown in FIG. 20. The decoder first accepts the quantized spectral coefficients, and uses the word length and scale factor parameters to reconstruct the MDCT spectral coefficients. To reconstruct the audio signal, these coefficients are first transformed back into the time domain by the inverse MDCT (IMDCT), using either long or short mode blocks as specified by the received parameters. The three time-domain subband signals are synthesized into the output signal using QMF synthesis banks, obtaining a full spectrum, 16-bit digital audio signal. Wideband quantization noise introduced during encoding (to achieve data reduction) is limited to critical bands, where it is masked by signal energy in each band.

Other versions of ATRAC were developed. ATRAC3 achieves twice the compression of ATRAC1 while providing similar sound quality operating at bit rates such as 128 kbps. The broadband audio signal is split into four subbands using a QMF bank; the bands are 0 Hz to 2.75625 kHz, 2.75625 kHz to 5.5125 kHz, 5.5125 kHz to 11.025 kHz, and 11.025 kHz to 22.05 kHz. Gain control is applied to each band to minimize pre-echo. When a transient occurs, the amplitude of the section preceding the attack is increased. Gain is correspondingly decreased during decoding, effectively attenuating pre-echo. The subbands are applied to fixed-length MDCT with 256 components. Tonal components are subtracted from the signal and analyzed and quantized separately. Entropy coding is applied. In addition, joint stereo coding can be used adaptively for each band.

The ATRAC3plus codec is designed to operate at generally lower bit rates; rates of 48, 64, 132, and 256 kbps are often used. The broadband audio signal is processed in 16 subbands; a window of up to 4096 samples (92 ms) can be used and bits can be allocated unequally over two channels.

The ATRAC Advanced Lossless (AAL) codec provides scalable lossless compression. It codes ATRAC3 or ATRAC3plus data as well as residual information that is otherwise lost. The ATRAC3 or ATRAC3plus data can be decoded alone for lossy reproduction or the residual can be added for lossless reproduction.

Perceptual Audio Coding (PAC) Codec

The Perceptual Audio Coding (PAC) codec was designed to provide audio coding with bit rates ranging from 6 kbps for a monophonic channel to 1024 kbps for a 5.1-channel format. It was particularly aimed at digital audio broadcast and Internet download applications, at a rate of 128 kbps for two-channel near-CD quality coding; however, 96 kbps may be used for FM quality. PAC employs coding methods that remove signal perceptual irrelevancy, as well as source coding to remove signal redundancy, to achieve a reduction ratio of about 11:1 while maintaining transparency. PAC is a third-generation codec with PXFM and ASPEC as its antecedents, the latter also providing the ancestral basis for MPEG-1 Layer III . PAC was developed by AT&T and Bell Laboratories of Lucent Technologies.

The architecture of a PAC encoder is similar to that of other perceptual codecs. Throughout the algorithm, data is placed in blocks of 1024 samples per channel. An MDCT filter bank converts time-domain audio signals to the frequency domain; a hybrid filter is not used. The MDCT uses an adaptive window size to control quantization noise spreading, where the spreading is greater in the time domain with a longer 2048-point window and greater in the frequency domain with a series of shorter 256-point windows. Specifically, a frequency resolution of 1024 uniformly spaced frequency bands (a window of 2048 points) is usually employed. When signal transient characteristics suggest that pre-echo artifacts may occur, the filter bank adaptively switches to a transform with 128 bands. In either case, the perceptual model calculates a frequency-domain masking threshold to determine the maximum quantization noise that can be added to each frequency band without an audible penalty. The perceptual model used in PAC to code monophonic signals is similar to the MPEG-1 psychoacoustic model 2.

The audio signal, represented as spectral coefficients, is requantized to one of 128 exponentially distributed quantization step sizes according to noise allocation determinations. The codec uses a variety of frequency band groupings. A fixed "threshold calculation partition" is a set of one-to-many adjacent filter bank outputs arranged to create a partition width that is about 1/3 of a critical band.

Fixed "coder bands" consist of a multiple of four adjacent filter bank outputs, ranging from 4 to 32 outputs, yielding a bandwidth as close to 1/3 critical band as possible. There are 49 coder bands for the 1024-point mode and 14 coder bands for the 128-point filter mode. An iterative rate control loop is used to determine quantization relative to masking thresholds. Time buffering may be used to smooth the resulting bit rate. Coder bands are assigned one scale factor. "Sections" are data dependent groupings of adjacent coder bands using the same Huffman codeword.

Coefficients in each coder band are encoded using one of 16 Huffman codebooks.

At the codec output, a formatter generates a packetized bitstream. One 1024-sample block (or eight 128-sample blocks) from each channel are placed in one packet, regardless of the number of channels. The size of a packet corresponding to each 1024 input samples is thus variable.

Depending on the reliability of the transmission medium, additional header information is added to the first frame, or to every frame. A header may contain data such as synchronization, error correction, sample rate, number of channels, and transmission bit rate.

For joint-stereo coding, the codec employs a binary masking level difference (BMLD) using M (monaural, L+R), S (stereo, L-R) and independent L and R thresholds. M-S versus L-R coding decisions are made independently for each band. The multi-channel MPAC codec (for example, coding 5.1 channels) computes individual masking thresholds for each channel, two pairs (front and surround) of M-S thresholds, as well as a global threshold based on all channels. The global threshold takes advantages of masking across all channels and is used when the bit pool is close to depletion.

PAC employs unequal error protection (UEP) to more carefully protect some portions of the data. For example, corrupted control information could lead to a catastrophic loss of synchronization. Moreover, some errors in audio data are more disruptive than others. For example, distortion in midrange frequencies is more apparent than a loss of stereo separation. Different versions of PAC are available for DAB and Internet applications; they are optimized for different transmission error conditions and error concealment. The error concealment algorithm mitigates the effect of bit errors and corrupted or lost packets; partial information is used along with heuristic interpolation. There is slight audible degradation with 5% random packet losses and the algorithm is effective with 10 to 15% packet losses.

As with most codecs, PAC has evolved. PAC version 1.A is optimized for unimpaired channel transmission of voice and music with up to 8-kHz bandwidth; bit rates range from 16 kbps to 32 kbps. PAC version 1.B uses a bandwidth of 6.5 kHz. PAC version 2 is designed for impaired channel broadcast applications, with bit rates of 16 kbps to 128 kbps for stereo signals. PAC version 3 is optimized for 64 kbps with a bandwidth of about 13 kHz.

PAC version 4 is optimized for 5.1-channel sound. EPAC is an enhanced version of PAC optimized for low bit rates.

Its filter switches between two different filter-bank designs depending on signal conditions. At 128 kbps, EPAC offers CD-trans-parent stereo sound and is compliant with RealNetwork's G2 streaming Internet player. In some applications, monaural MPAC codecs are used to code multichannel audio using a perceptual model with provisions for spatial coding conditions such as binaural unmasking effects and binary-level masking differences.

Signal pairs are coded and masking thresholds are computed for each channel.

AC-3 (Dolby Digital) Codec

Many data reduction codecs are designed for a variety of applications. The AC-3 (Dolby Digital) codec in particular is widely used to convey multichannel audio in applications such as DTV, DBS, DVD-Video, and Blu-ray. The AC-3 codec was preceded by the AC-1 and AC-2 codecs.

The AC-1 (Audio Coding-1) stereo codec uses adaptive delta modulation, as described in section4, combined with analog companding; it is not a perceptual codec. An AC-1 codec can code a 20-kHz bandwidth stereo audio signal into a 512-kbps bitstream (approximately a 3:1 reduction).

AC-1 was used in satellite relays of television and FM programming, as well as cable radio services.

The AC-2 codec is a family of four single-channel codecs used in two-channel or multichannel applications. It was designed for point-to-point transmission such as full- duplex ISBN applications. AC-2 is a perceptual codec using a low-complexity time-domain aliasing cancellation (TDAC) transform. It divides a wideband signal into multiple subbands using a 512-sample 50% overlapping FFT algorithm performing alternating modified discrete cosine and sine transform (MDCT/MDST) calculations; a 128 sample FFT can be used for low-delay coding. A window function based on the Kaiser-Bessel kernel is used in the window design. Coefficients are grouped into subbands containing from 1 to 15 coefficients to model critical bandwidths. The bit allocation process is backward adaptive in which bit assignments are computed equivalently at both the encoder and decoder. The decoder uses a perceptual model to extract bit allocation information from the spectral envelope of the transmitted signal. This effectively reduces the bit rate, at the expense of decoder complexity. Subbands have pre-allocated bits, with the lower subbands receiving a greater share.

Additional bits are adaptively drawn from a pool and assigned according to the logarithm of peak energy levels in subbands. Coefficients are quantized according to bit allocation calculations, and blocks are formed. Algorithm parameters vary according to sampling frequency. At sampling frequencies of 48, 44.1, and 32 kHz, the following apply: bytes/block: 168, 184, 190; total bits: 1344, 1472, 1520; subbands: 40, 43, 42; adaptive bits: 225, 239, 183.

The AC-2 codec provides high audio quality with a data rate of 256 kbps per channel. With 16-bit input, reduction ratios include 6.1:1, 5.6:1, and 5.4:1 for sample rates of 48, 44.1, and 32 kHz, respectively. AC-2 is also used at 128 kbps and 192 kbps per channel. AC-2 is a registered .wav type so that AC-2 files are interchangeable between computer platforms. The AC-2 .wav header contains an auxiliary data field at the end of each block, selectable from 0 to 32 bits. For example, peak levels can be stored to facilitate viewing and editing of .wav files. AC-2 codec applications include PC sound cards, studio/transmitter links, and ISDN linking of recording studios for long distance recording. The AC-2 bitstream is robust against errors. Depending on the implementation, AC-2 delay varies between 7 ms and 60 ms. AC-2A is a multirate, adaptive block codec, designed for higher reduction ratios; it uses a 512/128-point TDAC filter. AC-2 was introduced in 1989.

AC-3 Overview

The AC-3 coding system (popularly known as Dolby Digital) is an outgrowth of the AC-2 encoding format, as well as applications in commercial cinema. AC-3 was first introduced in 1992. AC-3 is a perceptual codec designed to process an ensemble of audio channels. It can code from 1 to 7 channels as 3/3, 3/2, 3/1, 3/0, 2/2, 2/1, 2/0, 1/0, as well as an optional low-frequency effects (LFE) channel.

AC-3 is often used to provide a 5.1 multichannel surround format with left, center right, left-surround, right-surround, and an LFE channel. The frequency response of the main channels is 3 Hz to 20 kHz, and the frequency response of the LFE channel is 3 Hz to 120 Hz. These six channels (requiring 6 × 48 kHz × 18 bits = 5.184 Mbps in uncompressed PCM representation) can be coded at a nominal rate of 384 kbps, with a bandwidth reduction of about 13:1. However, the AC-3 standard also supports bit rates ranging from 32 kbps to 640 kbps. The AC-3 codec is backward compatible with matrix surround sound formats, two-channel stereo, and monaural reproduction; all of these can be decoded from the AC-3 data stream. AC-3 does not use 5.1 matrixing in its bitstream. This ensures that quantization noise is not directed to an incorrect channel, where it could be unmasked. AC-3 transmits a discrete multichannel coded bitstream, with digital downmixing in the decoder to create the appropriate number (monaural, stereo, matrix surround, or full multichannel) of reproduction channels.

AC-3 contains a dialogue normalization level control so that the reproduced level of dialogue (or any audio content) is uniform for different programs and channels. With dialogue normalization, a listener can select a playback volume and the decoder will automatically replay content at that average relative level regardless of how it was recorded. AC-3 also contains a dynamic range control feature. Control data can be placed in the bitstream so that a program's recorded dynamic range can be varied in the decoder over a ±24-dB range. Thus, the decoder can alter the dynamic range of a program to suit the listener's preference (for example, a reduced dynamic range " midnight mode"). AC-3 also provides a down-mixing feature; a multichannel recording can be reduced to stereo or monaural. The mixing engineer can specify relative interchannel levels. Additional services can be embedded in the bitstream including verbal description for the visually impaired, dialogue with enhanced intelligibility for the hearing impaired, commentary, and a second stereo program. All services may be tagged to indicate language.

AC-3 facilitates editing on a block level, and blocks can be rocked back and forth at the decoder, and read as forward and reverse audio. Complete encoding/decoding delay is typically 100 ms.

Because AC-3 eliminates redundancies between channels, greater coding efficiency is achieved relative to AC-2; a stereo version of AC-3 provides high quality with a data rate of 192 kbps. In one test, AC-3 at 192 kbps scored 4.5 on the ITU-R impairment scale. Differences between the original and coded files were perceptible to expert listeners, but not annoying. The AC-3 format also delivers data describing a program's original production format (monaural, stereo, matrix, and the like), can encode parameters for selectable dynamic range compression, can route low bass only to those speakers with subwoofers, and provide gain control of a program.

AC-3 uses hybrid backward/forward adaptive bit allocation in which an adaptive allocation routine operates in both the encoder and decoder. The model defines the spectral envelope, which is encoded in the bitstream. The encoder contains a core psychoacoustic model, but can employ a different model and compare results. If desired, the encoder can use the data syntax to code parameter variations in the core model, or convey explicit delta bit allocation information, to improve results. Block diagrams of an AC-3 encoder and decoder are shown in FIG. 21.

FIG. 21 The AC-3 (Dolby Digital) adaptive transform encoder and decoder. This codec can provide 5.1-channel surround sound. A. AC-3 encoder. B. AC-3 decoder.

AC-3 achieves its data reduction by quantizing a frequency-domain representation of the audio signal. The encoder first uses an analysis filter bank to transform time domain PCM samples into frequency-domain coefficients.

Each coefficient is represented in binary exponential notation as a binary exponent and mantissa. Sets of exponents are encoded into a coarse representation of the signal spectrum and referred to as the spectral envelope.

This spectral envelope is used by the bit allocation routine to determine the number of bits needed to code each mantissa. The spectral envelope and quantized mantissas for six audio blocks (1536 audio samples) are formatted into a frame for transmission.

The decoding process is the inverse of the encoding process. The decoder synchronizes the received bitstream, checks for errors, and de-formats the data to recover the encoded spectral envelope and quantized mantissas. The bit allocation routine and the results are used to unpack and de-quantize the mantissas. The spectral envelope is decoded to yield the exponents. Finally, the exponents and mantissas are transformed back to the time domain to produce output PCM samples.

AC-3 Theory of Operation

Operation of the AC-3 encoder is complex, with much dynamic optimization performed. In the encoder, blocks of 512 samples are collected and highpass filtered at 3 Hz to eliminate dc offset and analyzed with a bandpass filter to detect transients. Blocks are windowed and processed with a signal-adaptive transform codec using a critically sampled filter bank with time-domain aliasing cancellation (TDAC) described by Princen and Bradley. An FFT is employed to implement an MDCT algorithm. Frequency resolution is 93.75 Hz at 48 kHz; each transform block represents 10.66 ms of audio, but transforms are computed every 5.33 ms so the audio block rate is 187.5 Hz. Because there is a 50% long-window overlap (an optimal window function based on the Kaiser-Bessel kernel is used in the window design), each PCM sample is represented in two sequential transform blocks; coefficients are decimated by a factor of two to yield 256 coefficients per block. Aliasing from sub-sampling is exactly canceled during reconstruction. The transformation allows the redundancy introduced in the blocking process to be removed. The input to the TDAC is 512 time-domain samples while the output is 256 frequency-domain coefficients. There are 50 bands between 0 Hz and 24 kHz; the bandwidths vary between 3.4 and 1.4 of critical bandwidth values.

Time-domain transients such as an impulsive sound might create audible quantization artifacts. A transient detector in the encoder, using a high-frequency bandpass filter, can trigger window switching to dynamically halve the transform length from 512 to 256 samples for a finer time resolution. The 512-sample transform is replaced by two 256-sample transforms, each producing 128 unique coefficients; time resolution is doubled, to help ensure that quantization noise is concealed by temporal masking.

Audio blocks are 5.33 ms, and transforms are computed every 2.67 ms at 48 kHz. Short blocks use an asymmetric window that uses only one-half of a long window. This yields poor frequency selectivity and does not give a smooth crossfade between blocks. However, because short blocks are only used for transient signals, the signal's flat and wide spectrum does not require selectivity and the transient itself will mask artifacts. This block switching also simplifies processing, because groups of short blocks can be treated as groups of long blocks and no special handling is needed.

Coefficients are grouped into subbands that emulate critical bands. Each frequency coefficient is processed with floating-point representation with mantissa (0 to 16 bits) and exponent (5 bit) to maintain dynamic range. Coefficient precision is typically 16 to 18 bits but may reach 24 bits.

The coded exponents act as scale factors for mantissas and represent the signal's spectrum; their representation is referred to as the spectral envelope. This spectral envelope coding permits variable resolution of time and frequency.

Unlike some codecs, to reduce the number of exponents conveyed, AC-3 does not choose one exponent, based on the coefficient with the largest magnitude, to represent each band. In AC-3, fine-grained exponents are used to represent each coefficient, and efficiency is achieved by differential coding and sharing of exponents across frequency and time. The spectral envelope is coded as the difference between adjacent filters; because the filter response falls off at 12 dB/bin, maximum deltas of 2 (1 represents a 6-dB difference) are needed. The first dc term is coded as an absolute, and other exponents are coded as one of five changes (±2, ±1, 0) from the previous lower frequency exponent, allowing for ±12 dB/bin differences in exponents.

Prev. | Next