Digital Audio--Principles & Concepts: Low Bit-Rate Coding: Theory & Evaluation (part 2)

Home | Audio mag. | Stereo Review mag. | High Fidelity mag. | AE/AA mag.

Psychoacoustic Models

Psychoacoustic models emulate the human hearing system and analyze spectral data to determine how the audio signal can be coded to render quantization noise as inaudible as possible. Most models calculate the masking thresholds for critical bands to determine this just noticeable noise level. In other words, the model determines how much coding noise is allowed in every critical band, performing one such analysis on each frame of data. The difference between the maximum signal level and the minimum masking threshold (the signal-to-mask ratio) thus determines bit allocation for each band. An important element in modeling masking curves is determining the relative tonality of signals, because this affects the character of the masking curve they project. Any model must be time-aligned so that its results coincide with the correct frame of audio data. This accounts for the filter delay and need to center the analysis output in the current data block.

In most codecs, the goal of bit allocation is to minimize the total noise-to-mask ratio over the entire frame. The number of bits allocated cannot exceed the number of bits available for the frame at a given bit rate. The noise-to mask ratio for each sub-band is calculated as:

NMR = SMR - SNR dB

The SNR is the difference between the masker and the noise floor established by a quantization level; the more bits used for quantization, the larger the value of SNR. The SMR is the difference between the masker and the minimum value of the masking threshold within a critical band. More specifically, the pertinent masking threshold is the global masking threshold (also known as the just noticeable distortion or JND) within a critical band. The SMR determines the number of bits needed for quantization. I f a signal is below the threshold, then the signal is not coded. The NMR is the difference between the quantization noise level, and the level where noise reaches audibility. The relationship is shown in FIG. 11

FIG. 11 The NMR is the difference between the SMR and SNR, expressed in dB. The masking threshold varies according to the tonality of the signal. (Noll, 1997)

Within a critical band, the larger the SNR is compared to the SMR, the less audible the quantization noise. If the SNR is less than the SMR, then the noise is audible. A codec thus strives to minimize the value of the NMR in sub-bands by increasing the accuracy of the quantization. This figure gauges the perceptual quality of the coding. For example, NMR values of less than 0 may indicate transparent coding, while values above 0 may indicate audible degradation.

Referring again to FIG. 11, we can also note that the masking threshold is shifted downward from the masking peak by some amount that depends most significantly on whether the masker is tonal or nontonal. Generally, these expressions can be applied:

?TMN = 14.5 + z dB

?NMT = S dB

where z is the frequency in Bark and S can be assumed to lie between 3 and 6 but can be frequency-dependent.

Alternatively, James Johnston has suggested these expressions for the tona/nontonal shift:

?TMN = 19.5 + z(18.0/26.0) dB

?NMT = 6.56 - z(3.06/26.0) dB

where z is the frequency in Bark.

The codec must place noise below the JND, or more specifically, taking into account the absolute threshold of hearing, the codec must place noise below the higher of JND or the threshold of hearing. For example, the SNR may be estimated from table data specified according to the number of quantizing levels, and the SMR is output by the psychoacoustic model. In an iterative process, the bit allocator determines the NMR for all sub-bands. The sub-band with the highest NMR value is allocated bits, and a new NMR is calculated based on the SNR value. The process is repeated until all the available bits are allocated.

The validity of the psychoacoustic model is crucial to the success of any perceptual codec, but it is the utilization of the model's output in the bit allocation and quantization process that ultimately determines the audibility of noise. In that respect, the interrelationship of the model and the quantizer is the most proprietary part of any codec. Many companies have developed proprietary psychoacoustic models and bit-allocation methods that are held in secret; however, their coding is compatible with standards compliant decoders.

Spreading Function

Many psychoacoustic models use a spreading function to compute an auditory spectrum. It is straightforward to estimate masking levels within a critical band by using a component in the critical band. However, masking is usually not limited to a single critical band; its effect spreads to other bands. The spreading function represents the masking response of the entire basilar membrane and describes masking across several critical bands, that is, how masking can occur several Bark away from a masking signal. In crude models (and the most conservative) the spreading function is an asymmetrical triangle. As noted, the lower slope is about 27 dB/Bark; the upper slope may vary from -20 to -5 dB/Bark. The masking contour of a pure tone can be approximated as two slopes where S1 is the lower slope and S2 is the upper slope, plotted as SPL per critical-band rate. The contour is independent of masker frequency:

S1 = 27 dB / Bark S2 = [24 + 0.23(fv/1000)

-1 - 0.2Lv/dB] dB / Bark

where fv is the frequency of the masking tone in Hz, and Lv is the level of the masking tone in dB.

The slope of S2 becomes steeper at low frequencies by the 0.23(fv/1000) -1 term because of the threshold of hearing, while at masking frequencies above 100 Hz, the slope is almost independent of frequency. S2 also depends on SPL.

A more sophisticated spreading function, but one that does not account for the masker level, is given by the expression:

10 log10 SF(dz) = 15.81 + 7.5(dz + 0.474) - 17.5[1 + (dz + 0.474) 2] 1/2 dB

where dz is the distance in Bark between the maskee and masker frequency.

To use a spreading function, the audio spectrum is divided into critical bands and the energy in each band is computed. These values are convolved with the spreading function to yield the auditory spectrum. When offsets and the absolute threshold of hearing are considered, the final masking thresholds are produced. When calculating a global masking threshold, the effects of multiple maskers must be considered. For example, a model could use the higher of two thresholds, or add together the masking threshold intensities of different components. Alternatively, a value averaged between the values of the two methods could be used, or another nonlinear approach could be taken. For example, in the MPEG-1 psychoacoustic model 1, intensities are summed. However, in MPEG-1 model 2, the higher value of the global masking threshold and the absolute threshold is selected. These models are discussed in Section 11.

Tonality

Distinguishing between tonal and nontonal components is an important feature of most psychoacoustic models because tonal and nontonal components demand different masking emulation. For example, as noted, noise is a better masker than a tone. Many methods have been devised to detect and characterize tonality in audio signals.

For example, in MPEG-1 model 1, tonality is determined by detecting local maxima in the audio spectrum. All nontonal components in a critical band are represented with one value at one frequency. In MPEG-1 model 2, a spectral flatness measure is used to measure the average or global tonality. These models are discussed in Section 11.

In some tonality models, when a signal has strong local maxima tonal components, they are detected and withdrawn and coded separately. This flattens the overall spectrum and increases the efficiency of the subsequent Huffman coding because the average number of bits needed in a codebook increases according to the magnitude of the maximum value. The increase in efficiency depends on the nature of the audio signal. Some models further distinguish the harmonic structure of multitonal maskers. With two multitonal maskers of the same power, the one with a strong harmonic structure yields a lower masking threshold.

Identification of tonal and nontonal components can also be important in the decoder when data is conveyed across an error-prone transmission channel and error concealment is applied before the output synthesis filter bank. Missing tonal components can be replaced by predicted values. For example, predictions can be made using an FIR filter for all pole modeling of the signal, and using an autocorrelation function, coefficients can be generated with the Levinson- Durbin algorithm. Studies indicate that concealment in the lower sub-bands is more important than in the upper sub-bands. Noise properly shaped by a spectral envelope can be successfully substituted for missing nontonal sections.

Rationale for Perceptual Coding

The purpose of any low bit-rate coding system is to decrease the data rate, the product of the sampling frequency, and the word length. This can be accomplished by decreasing the sampling frequency; however, the Nyquist theorem dictates a corresponding decrease in high-frequency audio bandwidth. Another approach uniformly decreases the word length; however, this reduces the dynamic range of the audio signal by 6 dB per bit, thus increasing broadband quantization noise. As we have seen, a more enlightened approach uses psychoacoustics.

Perceptual codecs maintain sampling frequency, but selectively decrease word length. The word-length reduction is done dynamically based on signal conditions.

Specifically, masking and other factors are considered so that the resulting increase in quantization noise is rendered as inaudible as possible. The level of quantization error, and its associated distortion from truncating the word length, can be allowed to rise, so long as it is masked by the audio signal. For example, a codec might convey an audio signal with an average bit rate of 2 bits/sample; with PCM encoding, this would correspond to a signal-to-noise ratio of 12 dB-a very poor result. But by exploiting psychoacoustics, the codec can render the noise floor nearly inaudible.

Perceptual codecs analyze the frequency and amplitude content of the input signal. The encoder removes the irrelevancy and statistical redundancy of the audio signal. In theory, although the method is lossy, the human perceiver will not hear degradation in the decoded signal.

Considerable data reduction is possible. For example, a perceptual codec might reduce a channel's bit rate from 768 kbps to 128 kbps; a word length of 16 bits/sample is reduced to an average of 2.67 bits/sample, and data quantity is reduced by about 83%. Table 2 lists various reduction ratios and resulting bit rates for 48-kHz and 44.1 kHz monaural signals. A perceptually coded recording, with a conservative level of reduction, can rival the sound quality of a conventional recording because the data is coded in a much more intelligent fashion, and quite simply, because we do not hear all of what is recorded anyway. In other words, perceptual codecs are efficient because they can convey much of the perceived information in an audio signal, while requiring only a fraction of the data needed by a conventional system.

Part of this efficiency stems from the adaptive quantization used by most perceptual codecs. With PCM, all signals are given equal word lengths. Perceptual codecs assign bits according to audibility. A prominent tone is given a large number of bits to ensure audible integrity.

Conversely, fewer bits are used to code soft tones.

Inaudible tones are not coded at all. Together, bit rate reduction is achieved. A codec's reduction ratio (or coding gain) is the ratio of input bit rate to output bit rate.

Reduction ratios of 4:1, 6:1, or 12:1 are common.

Perceptual codecs have achieved remarkable transparency, so that in many applications reduced data is audibly indistinguishable from linearly represented data.

Tests show that reduction ratios of 4:1 or 6:1 can be transparent.

TABLE 2 Bit-rate reduction for 48-kHz and 44.1-kHz sampling frequencies.

The heart of a perceptual codec is the bit-allocation algorithm; this is where the bit rate is reduced. For example, a 16-bit monaural signal sampled at 48 kHz that is coded at a bit rate of 96 kbps must be requantized with an average of 2 bits/sample. Moreover, at that bit rate, the bit budget might be 1024 bits per block of analyzed data.

The bit-allocation algorithm must determine how best to distribute the bits across the signal's spectrum and re-quantize samples to minimize audibility of quantization noise while meeting its overall bit budget for that block.

Generally, two kinds of bit-allocation strategies can be used in perceptual codecs. In forward adaptive allocation, all allocation is performed in the encoder and this encoding information is contained in the bitstream. Very accurate allocation is permitted, provided the encoder is sufficiently sophisticated. An important advantage of forward adaptive coding is that the psychoacoustic model is located in the encoder; the decoder does not need a psychoacoustic model because it uses the encoded data to completely reconstruct the signal. Thus as psychoacoustic models in encoders are improved, the increased sonic quality can be conveyed through existing decoders. A disadvantage is that a portion of the available bit rate is needed to convey the allocation information to the decoder. In backward adaptive allocation, bit-allocation information is derived from the coded audio data itself without explicit information from the encoder. The bit rate is not partly consumed by allocation information. However, because bit allocation in the decoder is calculated from limited information, accuracy may be reduced. In addition, the decoder is more complex, and the psychoacoustic model cannot be easily improved following the introduction of new codecs.

Perceptual coding is generally tolerant of errors. With PCM, an error introduces a broadband noise. However, with most perceptual codecs, the error is limited to a narrow band corresponding to the bandwidth of the coded critical band, thus limiting its loudness. Instead of a click, an error might be perceived as a burst of low-level noise.

Perceptual coding systems also permit targeted error correction. For example, particularly vulnerable sounds (such as pianissimo passages) may be given greater protection than less vulnerable sounds (such as forte passages). As with any coded data, perceptually coded data requires error correction appropriate to the storage or transmission medium.

Because perceptual codecs tailor the coding to the ear's acuity, they may similarly decrease the required response of the playback system itself. Live acoustic music does not pass through amplifiers and loudspeakers-it goes directly to the ear. But recorded music must pass through the playback signal chain. Arguably, some of the original signal present in a recording could degrade the playback system's ability to reproduce the audible signal.

Because a perceptual codec removes inaudible signal content, the playback system's ability to convey audible music may improve. In short, a perceptual codec may more properly code an audio signal for passage through an audio system.

Perceptual Coding in Time and Frequency

Low bit-rate lossy codecs, whether designed for music or speech coding, attempt to represent the audio signal at a reduced bit rate while minimizing the associated increase in quantization error. Time-domain coding methods such as delta modulation can be considered to be data-reduction codecs (other time-domain methods such as PCM do not provide reduction). They use prediction methods on samples representing the full bandwidth of the audio signal and yield a quantization error spectrum that spans the audio band. Although the audibility of the error depends on the amplitude and spectrum of the signal, the quantization error generally is not masked by the signal. However, time domain codecs operating across the full bandwidth of the time-domain signal can achieve reduction ratios of up to 2.5. For example, Near Instantaneously Companded Audio Multiplex (NICAM) codecs reduce blocks of 32 samples from 14 bits to 10 bits using a sliding window to determine which 10 of the 14 bits can be transmitted with minimal audible degradation. With this method, coding is lossless with low-level signals, with increasing loss at high levels.

Although data reduction is achieved, the bit rate is too high for many applications; primarily, reduction is limited because masking is not fully exploited.

Frequency-domain codecs take a different approach.

The signal is analyzed in the frequency domain, and only the perceptually significant parts of the signal are quantized, on the basis of psychoacoustic characteristics of the ear. Other parts of the signal that are below the minimum threshold, or masked by more significant signals, may be judged to be inaudible and are not coded. In addition, quantization resolution is dynamically adapted so that error is allowed to rise near significant parts of the signal with the expectation that when the signal is reconstructed, the error will be masked by the signal. This approach can yield significant data reduction. However, codec complexity is greatly increased.

Conceptually, there are two types of frequency-domain codecs: sub-band and transform codecs. Generally, sub-band codecs use a low number of sub-bands and process samples adjacent in time, and transform codecs use a high number of sub-bands and process samples adjacent in frequency. Generally, sub-band codecs provide good time resolution and poor frequency resolution, and transform codecs provide good frequency resolution and poor time resolution.

However, the distinction between sub-band and transform codecs is primarily based on their separate historical development. Mathematically, all transforms used in codecs can be viewed as filter banks. Perhaps the most practical difference between sub-band and transform codecs is the number of bands they process. Thus, both sub-band and transform codecs follow the architecture shown in FIG. 12; either time-domain samples or frequency-domain coefficients are quantized according to a psychoacoustic model contained in the encoder.

In sub-band coding, a hybrid of time- and frequency domain techniques is used. A short block of time-based broadband input samples is divided into a number of frequency sub-bands using a filter bank of bandpass filters; this allows determination of the energy in each sub-band.

Using a side-chain transform frequency analysis, the samples in each sub-band are analyzed for energy content and coded according to a psychoacoustic model.

In transform coding, a block of input samples is directly applied to a transform to obtain the block's spectrum in the frequency domain. These transform coefficients are then quantized and coded according to a psychoacoustic model. Problematically, a relatively long block of data is required to obtain a high-resolution spectral representation.

Transform codecs achieve greater reduction than sub-band codecs; ratios of 4:1 to 12:1 are typical. Transform codecs incur a longer processing delay than sub-band codecs.

FIG. 12 The basic structure of a time-frequency domain encoder and decoder (A and B, respectively).

Sub-band (time) codecs quantize time-based samples, and transform (frequency) codecs quantize frequency-based coefficients.

As noted, most low bit-rate lossy codecs use psychoacoustic models to analyze the input signal in the frequency domain. To accomplish this, the time-domain input signal is often applied to a transform prior to analysis in the model. Any periodic signal can be represented as amplitude variations in time, or as a set of frequency coefficients describing amplitude and phase. Jean Baptiste Joseph Fourier first established this relationship between time and frequency. Changes in a time-domain signal also appear as changes in its frequency-domain spectrum. For example, a slowly changing signal would be represented by a low-frequency spectral content. If a sequence of time-based samples are thus transformed, the signal's spectral content can be determined over that period of time. Likewise, the time-based samples can be recovered by inverse transforming the spectral representation back into the time domain. A variety of mathematical transforms can be used to transform a time domain signal into the frequency domain and back again.

For example, the fast Fourier transform (FFT) gives a spectrum with half as many frequency points as there are time samples. For example, assume that 480 samples are taken at a 48-kHz sampling frequency. In this 10-ms interval, 240 frequency points are obtained over a spectrum from the highest frequency of 24 kHz to the lowest of 100 Hz, which is the period of 10 ms, with frequency points placed 100 Hz apart. In addition, a dc point is generated.

Sub-band Coding

Sub-band coding was first developed at Bell Labs in the early 1980s, and much subsequent work was done in Europe later in the decade. Blocks of consecutive time domain samples representing the broadband signal are collected over a short period and applied to a digital filter bank. This analysis filter bank divides the signal into multiple (perhaps up to 32) bandlimited channels to approximate the critical band response of the human ear.

The filter bank must provide a very sharp cutoff (perhaps 100 dB/octave) to emulate critical band response and limit quantization noise within that bandwidth. Only digital filters can accomplish this result. In addition, the processing block length (ideally less than 2 ms to 4 ms) must be small so that quantization error does not exceed the temporal masking limits of the ear. The samples in each sub-band are analyzed and compared to a psychoacoustic model. The codec adaptively quantizes the samples in each sub-band based on the masking threshold in that sub-band. Ideally, the filter bank should yield sub-bands with a width that corresponds to the width of the narrowest critical band. This would allow precise psychoacoustic modeling. However, most filter banks producing uniformly spaced sub-bands cannot meet this goal; this points out the difficulties posed by the great difference in bandwidth between the narrowest critical band and the widest.

Each sub-band is coded independently with greater or fewer bits allocated to the samples in the sub-band.

Quantization noise may be increased in a sub-band.

However, when the signal is reconstructed, the quantization noise in a sub-band will be limited to that sub-band, where it is ideally masked by the audio signal in that sub-band, as shown in FIG. 13. Quantization noise levels that are otherwise intrusive can be tolerated in a sub-band with a signal contained in it because noise will be masked by the signal. Sub-bands that do not contain an audible signal are quantized to zero. Bit allocation is determined by a psychoacoustic model and analysis of the signal itself; these operations are recalculated for every sub-band in every new block of data. Samples are dynamically quantized according to audibility of signals and noise.

There is great flexibility in the design of psycho-acoustic models and bit-allocation algorithms used in codecs that are otherwise compatible. The decoder uses the quantized data to re-form the samples in each block; a synthesis filter bank sums the sub-band signals to reconstruct the output broadband signal.

FIG. 13 A sub-band encoder analyzes the broadband audio signal in narrow sub-bands. Using masking information from a psychoacoustic model, samples in sub-bands are coarsely quantized, raising the noise floor.

When the samples are reconstructed in the decoder, the synthesis filter constrains the quantization noise floor within each sub-band, where it is masked by the audio signal.

A sub-band perceptual codec uses a filter bank to split a short duration of the audio signal into multiple bands, as depicted in FIG. 14. In some designs, a side-chain processor applies the signal to a transform such as an FFT to analyze the energy in each sub-band. These values are applied to a psychoacoustic model to determine the combined masking curve that applies to the signals in that block. This permits more optimal coding of the time domain samples. Specifically, the encoder analyzes the energy in each sub-band to determine which sub-bands contain audible information. A calculation is made to determine the average power level of each sub-band over the block. This average level is used to calculate the masking level due to masking of signals in each sub-band, as well as masking from signals in adjacent sub-bands.

Finally, minimum hearing threshold values are applied to each sub-band to derive its final masking level. Peak power levels present in each sub-band are calculated, and compared to the masking level. Sub-bands that do not contain audible information are not coded. Similarly, tones in a sub-band that are masked by louder nearby tones are not coded, and in some cases entire sub-bands can mask nearby sub-bands, which thus need not be coded.

Calculations determine the ratio of peak power to masking level in each sub-band. Quantization bits are assigned to audible program material with a priority schedule that allocates bits to each sub-band according to signal strength above the audibility curve. For example, Fig. 15 shows vertical lines representing peak power levels, and minimum and masking thresholds.

The signals below the minimum or masking curves are not coded, and the quantization noise floor is allowed to rise to those levels. For example, in the figure, signal A is below the minimum curve and would not be coded in any event. Signal C is also irrelevant in this frame because signal B has dynamically shifted the hearing threshold upward.

Signal B must be coded; however, its presence has created a masking curve, decreasing the relative amplitude above the minimum threshold curve. The portion of signal B between the minimum curve and the masking curve represents the fewer bits that are needed to code the signal when the masking effect is taken into account. In other words, rather than using a signal-to-noise ratio, a signal-to-mask ratio (SMR) is used. The SMR is the difference between the maximum signal and the masking threshold and is used to determine the number of bits assigned to a sub-band. The SMR is calculated for each sub-band.

The number of bits allocated to any sub-band must be sufficient to yield a requantizing noise level that is below the masking level. The number of bits depends on the SMR value, with the goal of maintaining the quantization noise level below the calculated masking level for each sub-band.

In fixed-rate codecs, a bit-pool approach can be taken. A large number of sub-bands requiring coding and signals with large SMR values might empty the pool, resulting in less than optimal coding. On the other hand, if the pool is not empty after initial allocation, the process is repeated until all bits in the codec's data capacity have been used.

Typically, the iterative process continues, allocating more bits where required, with signals with the highest SMR requirements always receiving the most bits; this increases the coding margin. In some cases, sub-bands previously classified as inaudible might receive coding from these extra bits. Thus, signals below the masking threshold can in practice be coded, but only on a secondary priority basis.

Summarizing the concept of sub-band coding, FIG. 16 shows how a 24-sub-band codec might code three tones at 250 Hz, 1 kHz, and 4 kHz; note that in each case the quantization noise level is below the combined masking and threshold curve.

FIG. 14 A sub-band codec divides the signal into narrow sub-bands, calculates average signal level, an masking level; and then quantizes the samples in each sub-band accordingly. A. Output of 24-band sub-band B. Calculation of average level in each sub-band. C. Calculation of masking level in each sub-band. D. Sub-bands below audibility are not coded; bands above audibility are coded. E. Bits are allocated according to peak level above the masking threshold. Sub-bands with peak levels above the masking level contain audible signals that must be coded.

FIG. 15 The bit-allocation algorithm assigns bits according to audibility of sub-band signals. Bits may not be assigned to masked or inaudible tones.

Transform Coding

In transform coding, the audio signal is viewed as a quasi stationary signal that changes relatively little over short time intervals. For efficient coding, blocks of time-domain audio samples are transformed to the frequency domain.

Frequency coefficients, rather than amplitude samples, are quantized to achieve data reduction. For playback, the coefficients are inverse-transformed back to the time domain.

The operation of the transform approximates how the basilar membrane analyzes the frequency content of vibrations along its length. The spectral coefficients output by the transform are quantized according to a psychoacoustic model; masked components are eliminated, and quantization decisions are made based on audibility. In contrast to a sub-band codec, which uses frequency analysis to code time-based samples, a transform codec codes frequency coefficients. From an information theory standpoint, the transform reduces the entropy of the signal, permitting efficient coding. Longer transform blocks provide greater spectral resolution, but lose temporal resolution; for example, a long block might result in a pre-echo before a transient. In many codecs, block length is adapted according to audio signal conditions. Short blocks are used for transient signals, while long blocks are used for continuous signals.

FIG. 16 In this 24-band sub-band codec, three tones are coded so that the quantization noise in each sub-band falls below the calculated composite masking curves.

(Thiele, Link, and Stoll, 1987) Time-domain samples are transformed to the frequency domain, yielding spectral coefficients. The coefficient numbers are sometimes called frequency bin numbers; for example, a 512-point transform can produce 256 frequency coefficients or frequency bins. The coefficients, which might number 512, 1024, or more, are grouped into about 32 bands that emulate critical-band analysis. This spectrum represents the block of time-based input samples. The frequency coefficients in each band are quantized according to the codec's psychoacoustic model; quantization can be uniform, nonuniform, fixed, or adaptive in each band.

Transform codecs may use a discrete cosine transform (DCT) or modified discrete cosine transform (MDCT) for transform coding because of low computational complexity, and because they can critically sample (sample at twice the bandwidth of the bandpass filter) the signal to yield an appropriate number of coefficients. Most codecs overlap successive blocks in time by about 50%, so that each sample appears in two different transform blocks. For example, the samples in the first half of a current block are repeated from the second half of the previous block. This reduces changes in spectra from block to block and improves temporal resolution. The DCT and MDCT can yield the same number of coefficients as with non overlapping blocks. As noted, an FFT may be used in the codec's side chain to yield coefficients for perceptual modeling.

All low bit-rate codecs operate over a block of samples.

This block must be kept short to stay within the temporal masking limits of the ear. During decoding, quantization noise will be spread over the frequency of the band, and over the duration of the block. If the block is longer than temporal backward masking allows, the noise will be heard prior to the onset of the sound, in a phenomenon known as pre-echo. (The term pre-echo is misleading.) Pre-echo is particularly problematic in the case of a silence followed by a time-domain transient within the analysis window. The energy in the transient portion causes the encoder to allocate relatively few bits, thus raising the eventual quantization noise level. Pre-echoes are created in the decoder when frequency coefficients are inverse transformed prior to the reconstruction of sub-band samples in the synthesis filter bank. The duration of the quantization noise equals that of the synthesis window, so the elevated noise extends over the duration of the window, while the transient only occurs briefly. In other words, encoding dictates that a transient in the audio signal will be accompanied by an increase in quantization noise but a brief transient may not fully mask the quantization noise surrounding it, as shown in FIG. 17. In this example, the attack of a triangle occurs as a transient signal. The analysis window of a transform codec operates over a relatively long time period. Quantization noise is spread over the time of the window and precedes the music signal; thus it may be audible as a pre-echo.

FIG. 17 An example of a pre-echo. On reconstruction, quantization noise falls within the analysis block, where the leading edge is not masked by the signal.

Transform codecs are particularly affected by the problem of pre-echo because they require long blocks for greater frequency accuracy. Short block length limits frequency resolution (and also relatively increases the amount of overhead side information). In essence, transform codecs sacrifice temporal resolution for spectral resolution. Long blocks are suitable for slowly changing or tonal signals; the frequency resolution allows the codec to identify spectral peaks and use their masking properties in bit allocation. For example, a clarinet note and its harmonics would require fine frequency resolution but only coarse time resolution. However, transient signals require a short block length; the signals have a flatter spectrum. For example, the fast transient of a castanet click would require fine time resolution but only coarse frequency resolution.

In most transform codecs, to provide the resolution demanded by particular signal conditions, and to avoid pre echo, block length dynamically adapts to signal conditions.

Referring again to FIG. 17, a shorter analysis block would constrain the quantization noise to a shorter duration, where it will be masked by the signal. A short block is also advantageous because it limits the duration of high bit rates demanded by transient encoding. Alternatively, a variable bit rate encoder can minimize pre-echo by briefly increasing the bit rate to decrease the noise level. Some codecs use temporal noise shaping (TNS) to minimize pre echo by manipulating the nature of the quantization noise within a filter bank window. When a transient signal is detected, TNS uses a predictive coding method to shape the quantization noise to follow the transient's envelope. In this way, the quantization error is more effectively concealed by the transient. However, no matter what approach is taken, difficulty arises because most music simultaneously places contradictory demands on the codec.

In adaptive transform codecs, a model is applied to uniformly and adaptively quantize each individual band, but coefficient values within a band are quantized with the same number of bits. The bit-allocation algorithm calculates the optimal quantization noise in each sub-band to achieve a desired signal-to-noise ratio that will promote masking.

Iterative allocation is used to supply additional bits as available to increase the coding margin, yet maintain limited bit rate. In some cases, the output bit rate can be fixed or variable for each block. Before transmission, the reduced data is often compressed with entropy coding such as Huffman coding and run-length coding to perform lossless compression. The decoder inversely quantizes the coefficients and performs an inverse transform to reconstruct the signal in the time domain.

An example of an adaptive transform codec proposed by Karlheinz Brandenburg is shown in FIG. 18. An MDCT transforms the signal to the frequency domain.

Signal energy in each critical band is calculated using the spectral coefficients. This is used to determine the masking threshold for each critical band. Two iterative loops perform quantization and coding using an analysis-by-synthesis technique. Coefficients are initially assigned a quantizer step size and the algorithm calculates the resulting number of bits needed to code the signal in the block. If the count exceeds the bit rate allowed for the block, the loop reassigns a larger quantizer step size and the count is recalculated until the target bit rate is achieved. An outer loop calculates the quantization error as it will appear in the reconstructed signal. If the error in a band exceeds the error allowed by the masking model, the quantizer step size in the band is decreased. Iterations continue in both loops until optimal coding is achieved. Codecs such as this can operate at low bit rates (for example, 2.5 bits/sample).

FIG. 18 Adaptive transform codec using an FFT side chain and iterative quantization to achieve optimal reduction. Entropy coding is additionally used for data compression.

Prev. | Next