<<Prev.
Psychoacoustic Models
Psychoacoustic models emulate the human hearing system and analyze spectral
data to determine how the audio signal can be coded to render quantization
noise as inaudible as possible. Most models calculate the masking thresholds
for critical bands to determine this just noticeable noise level. In other
words, the model determines how much coding noise is allowed in every critical
band, performing one such analysis on each frame of data. The difference between
the maximum signal level and the minimum masking threshold (the signal-to-mask
ratio) thus determines bit allocation for each band. An important element in
modeling masking curves is determining the relative tonality of signals, because
this affects the character of the masking curve they project. Any model must
be time-aligned so that its results coincide with the correct frame of audio
data. This accounts for the filter delay and need to center the analysis output
in the current data block.
In most codecs, the goal of bit allocation is to minimize the total noise-to-mask
ratio over the entire frame. The number of bits allocated cannot exceed the
number of bits available for the frame at a given bit rate. The noise-to mask
ratio for each sub-band is calculated as:
NMR = SMR - SNR dB
The SNR is the difference between the masker and the noise floor established
by a quantization level; the more bits used for quantization, the larger the
value of SNR. The SMR is the difference between the masker and the minimum
value of the masking threshold within a critical band. More specifically, the
pertinent masking threshold is the global masking threshold (also known as
the just noticeable distortion or JND) within a critical band. The SMR determines
the number of bits needed for quantization. I f a signal is below the threshold,
then the signal is not coded. The NMR is the difference between the quantization
noise level, and the level where noise reaches audibility. The relationship
is shown in FIG. 11
FIG. 11 The NMR is the difference between the SMR and SNR, expressed in dB.
The masking threshold varies according to the tonality of the signal. (Noll,
1997)
Within a critical band, the larger the SNR is compared to the SMR, the less
audible the quantization noise. If the SNR is less than the SMR, then the
noise is audible. A codec thus strives to minimize the value of the NMR in
sub-bands by increasing the accuracy of the quantization. This figure gauges
the perceptual quality of the coding. For example, NMR values of less than
0 may indicate transparent coding, while values above 0 may indicate audible
degradation.
Referring again to FIG. 11, we can also note that the masking threshold is
shifted downward from the masking peak by some amount that depends most significantly
on whether the masker is tonal or nontonal. Generally, these expressions can
be applied:
?TMN = 14.5 + z dB
?NMT = S dB
where z is the frequency in Bark and S can be assumed to lie between 3 and
6 but can be frequency-dependent.
Alternatively, James Johnston has suggested these expressions for the tona/nontonal
shift:
?TMN = 19.5 + z(18.0/26.0) dB
?NMT = 6.56 - z(3.06/26.0) dB
where z is the frequency in Bark.
The codec must place noise below the JND, or more specifically, taking into
account the absolute threshold of hearing, the codec must place noise below
the higher of JND or the threshold of hearing. For example, the SNR may be
estimated from table data specified according to the number of quantizing levels,
and the SMR is output by the psychoacoustic model. In an iterative process,
the bit allocator determines the NMR for all sub-bands. The sub-band with the
highest NMR value is allocated bits, and a new NMR is calculated based on the
SNR value. The process is repeated until all the available bits are allocated.
The validity of the psychoacoustic model is crucial to the success of any
perceptual codec, but it is the utilization of the model's output in the bit
allocation and quantization process that ultimately determines the audibility
of noise. In that respect, the interrelationship of the model and the quantizer
is the most proprietary part of any codec. Many companies have developed proprietary
psychoacoustic models and bit-allocation methods that are held in secret; however,
their coding is compatible with standards compliant decoders.
Spreading Function
Many psychoacoustic models use a spreading function to compute an auditory
spectrum. It is straightforward to estimate masking levels within a critical
band by using a component in the critical band. However, masking is usually
not limited to a single critical band; its effect spreads to other bands. The
spreading function represents the masking response of the entire basilar membrane
and describes masking across several critical bands, that is, how masking can
occur several Bark away from a masking signal. In crude models (and the most
conservative) the spreading function is an asymmetrical triangle. As noted,
the lower slope is about 27 dB/Bark; the upper slope may vary from -20 to -5
dB/Bark. The masking contour of a pure tone can be approximated as two slopes
where S1 is the lower slope and S2 is the upper slope, plotted as SPL per critical-band
rate. The contour is independent of masker frequency:
S1 = 27 dB / Bark S2 = [24 + 0.23(fv/1000)
-1 - 0.2Lv/dB] dB / Bark
where fv is the frequency of the masking tone in Hz, and Lv is the level of
the masking tone in dB.
The slope of S2 becomes steeper at low frequencies by the 0.23(fv/1000) -1
term because of the threshold of hearing, while at masking frequencies above
100 Hz, the slope is almost independent of frequency. S2 also depends on SPL.
A more sophisticated spreading function, but one that does not account for
the masker level, is given by the expression:
10 log10 SF(dz) = 15.81 + 7.5(dz + 0.474) - 17.5[1 + (dz + 0.474) 2] 1/2 dB
where dz is the distance in Bark between the maskee and masker frequency.
To use a spreading function, the audio spectrum is divided into critical bands
and the energy in each band is computed. These values are convolved with the
spreading function to yield the auditory spectrum. When offsets and the absolute
threshold of hearing are considered, the final masking thresholds are produced.
When calculating a global masking threshold, the effects of multiple maskers
must be considered. For example, a model could use the higher of two thresholds,
or add together the masking threshold intensities of different components.
Alternatively, a value averaged between the values of the two methods could
be used, or another nonlinear approach could be taken. For example, in the
MPEG-1 psychoacoustic model 1, intensities are summed. However, in MPEG-1 model
2, the higher value of the global masking threshold and the absolute threshold
is selected. These models are discussed in Section 11.
Tonality
Distinguishing between tonal and nontonal components is an important feature
of most psychoacoustic models because tonal and nontonal components demand
different masking emulation. For example, as noted, noise is a better masker
than a tone. Many methods have been devised to detect and characterize tonality
in audio signals.
For example, in MPEG-1 model 1, tonality is determined by detecting local
maxima in the audio spectrum. All nontonal components in a critical band are
represented with one value at one frequency. In MPEG-1 model 2, a spectral
flatness measure is used to measure the average or global tonality. These models
are discussed in Section 11.
In some tonality models, when a signal has strong local maxima tonal components,
they are detected and withdrawn and coded separately. This flattens the overall
spectrum and increases the efficiency of the subsequent Huffman coding because
the average number of bits needed in a codebook increases according to the
magnitude of the maximum value. The increase in efficiency depends on the nature
of the audio signal. Some models further distinguish the harmonic structure
of multitonal maskers. With two multitonal maskers of the same power, the one
with a strong harmonic structure yields a lower masking threshold.
Identification of tonal and nontonal components can also be important in the
decoder when data is conveyed across an error-prone transmission channel and
error concealment is applied before the output synthesis filter bank. Missing
tonal components can be replaced by predicted values. For example, predictions
can be made using an FIR filter for all pole modeling of the signal, and using
an autocorrelation function, coefficients can be generated with the Levinson-
Durbin algorithm. Studies indicate that concealment in the lower sub-bands
is more important than in the upper sub-bands. Noise properly shaped by a spectral
envelope can be successfully substituted for missing nontonal sections.
Rationale for Perceptual Coding
The purpose of any low bit-rate coding system is to decrease the data rate,
the product of the sampling frequency, and the word length. This can be accomplished
by decreasing the sampling frequency; however, the Nyquist theorem dictates
a corresponding decrease in high-frequency audio bandwidth. Another approach
uniformly decreases the word length; however, this reduces the dynamic range
of the audio signal by 6 dB per bit, thus increasing broadband quantization
noise. As we have seen, a more enlightened approach uses psychoacoustics.
Perceptual codecs maintain sampling frequency, but selectively decrease word
length. The word-length reduction is done dynamically based on signal conditions.
Specifically, masking and other factors are considered so that the resulting
increase in quantization noise is rendered as inaudible as possible. The level
of quantization error, and its associated distortion from truncating the word
length, can be allowed to rise, so long as it is masked by the audio signal.
For example, a codec might convey an audio signal with an average bit rate
of 2 bits/sample; with PCM encoding, this would correspond to a signal-to-noise
ratio of 12 dB-a very poor result. But by exploiting psychoacoustics, the codec
can render the noise floor nearly inaudible.
Perceptual codecs analyze the frequency and amplitude content of the input
signal. The encoder removes the irrelevancy and statistical redundancy of the
audio signal. In theory, although the method is lossy, the human perceiver
will not hear degradation in the decoded signal.
Considerable data reduction is possible. For example, a perceptual codec might
reduce a channel's bit rate from 768 kbps to 128 kbps; a word length of 16
bits/sample is reduced to an average of 2.67 bits/sample, and data quantity
is reduced by about 83%. Table 2 lists various reduction ratios and resulting
bit rates for 48-kHz and 44.1 kHz monaural signals. A perceptually coded recording,
with a conservative level of reduction, can rival the sound quality of a conventional
recording because the data is coded in a much more intelligent fashion, and
quite simply, because we do not hear all of what is recorded anyway. In other
words, perceptual codecs are efficient because they can convey much of the
perceived information in an audio signal, while requiring only a fraction of
the data needed by a conventional system.
Part of this efficiency stems from the adaptive quantization used by most
perceptual codecs. With PCM, all signals are given equal word lengths. Perceptual
codecs assign bits according to audibility. A prominent tone is given a large
number of bits to ensure audible integrity.
Conversely, fewer bits are used to code soft tones.
Inaudible tones are not coded at all. Together, bit rate reduction is achieved.
A codec's reduction ratio (or coding gain) is the ratio of input bit rate to
output bit rate.
Reduction ratios of 4:1, 6:1, or 12:1 are common.
Perceptual codecs have achieved remarkable transparency, so that in many applications
reduced data is audibly indistinguishable from linearly represented data.
Tests show that reduction ratios of 4:1 or 6:1 can be transparent.
TABLE 2 Bit-rate reduction for 48-kHz and 44.1-kHz sampling frequencies.
The heart of a perceptual codec is the bit-allocation algorithm; this is where
the bit rate is reduced. For example, a 16-bit monaural signal sampled at 48
kHz that is coded at a bit rate of 96 kbps must be requantized with an average
of 2 bits/sample. Moreover, at that bit rate, the bit budget might be 1024
bits per block of analyzed data.
The bit-allocation algorithm must determine how best to distribute the bits
across the signal's spectrum and re-quantize samples to minimize audibility
of quantization noise while meeting its overall bit budget for that block.
Generally, two kinds of bit-allocation strategies can be used in perceptual
codecs. In forward adaptive allocation, all allocation is performed in the
encoder and this encoding information is contained in the bitstream. Very accurate
allocation is permitted, provided the encoder is sufficiently sophisticated.
An important advantage of forward adaptive coding is that the psychoacoustic
model is located in the encoder; the decoder does not need a psychoacoustic
model because it uses the encoded data to completely reconstruct the signal.
Thus as psychoacoustic models in encoders are improved, the increased sonic
quality can be conveyed through existing decoders. A disadvantage is that a
portion of the available bit rate is needed to convey the allocation information
to the decoder. In backward adaptive allocation, bit-allocation information
is derived from the coded audio data itself without explicit information from
the encoder. The bit rate is not partly consumed by allocation information.
However, because bit allocation in the decoder is calculated from limited information,
accuracy may be reduced. In addition, the decoder is more complex, and the
psychoacoustic model cannot be easily improved following the introduction of
new codecs.
Perceptual coding is generally tolerant of errors. With PCM, an error introduces
a broadband noise. However, with most perceptual codecs, the error is limited
to a narrow band corresponding to the bandwidth of the coded critical band,
thus limiting its loudness. Instead of a click, an error might be perceived
as a burst of low-level noise.
Perceptual coding systems also permit targeted error correction. For example,
particularly vulnerable sounds (such as pianissimo passages) may be given greater
protection than less vulnerable sounds (such as forte passages). As with any
coded data, perceptually coded data requires error correction appropriate to
the storage or transmission medium.
Because perceptual codecs tailor the coding to the ear's acuity, they may
similarly decrease the required response of the playback system itself. Live
acoustic music does not pass through amplifiers and loudspeakers-it goes directly
to the ear. But recorded music must pass through the playback signal chain.
Arguably, some of the original signal present in a recording could degrade
the playback system's ability to reproduce the audible signal.
Because a perceptual codec removes inaudible signal content, the playback
system's ability to convey audible music may improve. In short, a perceptual
codec may more properly code an audio signal for passage through an audio system.
Perceptual Coding in Time and Frequency
Low bit-rate lossy codecs, whether designed for music or speech coding, attempt
to represent the audio signal at a reduced bit rate while minimizing the associated
increase in quantization error. Time-domain coding methods such as delta modulation
can be considered to be data-reduction codecs (other time-domain methods such
as PCM do not provide reduction). They use prediction methods on samples representing
the full bandwidth of the audio signal and yield a quantization error spectrum
that spans the audio band. Although the audibility of the error depends on
the amplitude and spectrum of the signal, the quantization error generally
is not masked by the signal. However, time domain codecs operating across the
full bandwidth of the time-domain signal can achieve reduction ratios of up
to 2.5. For example, Near Instantaneously Companded Audio Multiplex (NICAM)
codecs reduce blocks of 32 samples from 14 bits to 10 bits using a sliding
window to determine which 10 of the 14 bits can be transmitted with minimal
audible degradation. With this method, coding is lossless with low-level signals,
with increasing loss at high levels.
Although data reduction is achieved, the bit rate is too high for many applications;
primarily, reduction is limited because masking is not fully exploited.
Frequency-domain codecs take a different approach.
The signal is analyzed in the frequency domain, and only the perceptually
significant parts of the signal are quantized, on the basis of psychoacoustic
characteristics of the ear. Other parts of the signal that are below the minimum
threshold, or masked by more significant signals, may be judged to be inaudible
and are not coded. In addition, quantization resolution is dynamically adapted
so that error is allowed to rise near significant parts of the signal with
the expectation that when the signal is reconstructed, the error will be masked
by the signal. This approach can yield significant data reduction. However,
codec complexity is greatly increased.
Conceptually, there are two types of frequency-domain codecs: sub-band and
transform codecs. Generally, sub-band codecs use a low number of sub-bands
and process samples adjacent in time, and transform codecs use a high number
of sub-bands and process samples adjacent in frequency. Generally, sub-band
codecs provide good time resolution and poor frequency resolution, and transform
codecs provide good frequency resolution and poor time resolution.
However, the distinction between sub-band and transform codecs is primarily
based on their separate historical development. Mathematically, all transforms
used in codecs can be viewed as filter banks. Perhaps the most practical difference
between sub-band and transform codecs is the number of bands they process.
Thus, both sub-band and transform codecs follow the architecture shown in FIG.
12; either time-domain samples or frequency-domain coefficients are quantized
according to a psychoacoustic model contained in the encoder.
In sub-band coding, a hybrid of time- and frequency domain techniques is used.
A short block of time-based broadband input samples is divided into a number
of frequency sub-bands using a filter bank of bandpass filters; this allows
determination of the energy in each sub-band.
Using a side-chain transform frequency analysis, the samples in each sub-band
are analyzed for energy content and coded according to a psychoacoustic model.
In transform coding, a block of input samples is directly applied to a transform
to obtain the block's spectrum in the frequency domain. These transform coefficients
are then quantized and coded according to a psychoacoustic model. Problematically,
a relatively long block of data is required to obtain a high-resolution spectral
representation.
Transform codecs achieve greater reduction than sub-band codecs; ratios of
4:1 to 12:1 are typical. Transform codecs incur a longer processing delay than
sub-band codecs.
FIG. 12 The basic structure of a time-frequency domain encoder and decoder
(A and B, respectively).
Sub-band (time) codecs quantize time-based samples, and transform (frequency)
codecs quantize frequency-based coefficients.
As noted, most low bit-rate lossy codecs use psychoacoustic models to analyze
the input signal in the frequency domain. To accomplish this, the time-domain
input signal is often applied to a transform prior to analysis in the model.
Any periodic signal can be represented as amplitude variations in time, or
as a set of frequency coefficients describing amplitude and phase. Jean Baptiste
Joseph Fourier first established this relationship between time and frequency.
Changes in a time-domain signal also appear as changes in its frequency-domain
spectrum. For example, a slowly changing signal would be represented by a low-frequency
spectral content. If a sequence of time-based samples are thus transformed,
the signal's spectral content can be determined over that period of time. Likewise,
the time-based samples can be recovered by inverse transforming the spectral
representation back into the time domain. A variety of mathematical transforms
can be used to transform a time domain signal into the frequency domain and
back again.
For example, the fast Fourier transform (FFT) gives a spectrum with half as
many frequency points as there are time samples. For example, assume that 480
samples are taken at a 48-kHz sampling frequency. In this 10-ms interval, 240
frequency points are obtained over a spectrum from the highest frequency of
24 kHz to the lowest of 100 Hz, which is the period of 10 ms, with frequency
points placed 100 Hz apart. In addition, a dc point is generated.
Sub-band Coding
Sub-band coding was first developed at Bell Labs in the early 1980s, and much
subsequent work was done in Europe later in the decade. Blocks of consecutive
time domain samples representing the broadband signal are collected over a
short period and applied to a digital filter bank. This analysis filter bank
divides the signal into multiple (perhaps up to 32) bandlimited channels to
approximate the critical band response of the human ear.
The filter bank must provide a very sharp cutoff (perhaps 100 dB/octave) to
emulate critical band response and limit quantization noise within that bandwidth.
Only digital filters can accomplish this result. In addition, the processing
block length (ideally less than 2 ms to 4 ms) must be small so that quantization
error does not exceed the temporal masking limits of the ear. The samples in
each sub-band are analyzed and compared to a psychoacoustic model. The codec
adaptively quantizes the samples in each sub-band based on the masking threshold
in that sub-band. Ideally, the filter bank should yield sub-bands with a width
that corresponds to the width of the narrowest critical band. This would allow
precise psychoacoustic modeling. However, most filter banks producing uniformly
spaced sub-bands cannot meet this goal; this points out the difficulties posed
by the great difference in bandwidth between the narrowest critical band and
the widest.
Each sub-band is coded independently with greater or fewer bits allocated
to the samples in the sub-band.
Quantization noise may be increased in a sub-band.
However, when the signal is reconstructed, the quantization noise in a sub-band
will be limited to that sub-band, where it is ideally masked by the audio signal
in that sub-band, as shown in FIG. 13. Quantization noise levels that are otherwise
intrusive can be tolerated in a sub-band with a signal contained in it because
noise will be masked by the signal. Sub-bands that do not contain an audible
signal are quantized to zero. Bit allocation is determined by a psychoacoustic
model and analysis of the signal itself; these operations are recalculated
for every sub-band in every new block of data. Samples are dynamically quantized
according to audibility of signals and noise.
There is great flexibility in the design of psycho-acoustic models and bit-allocation
algorithms used in codecs that are otherwise compatible. The decoder uses the
quantized data to re-form the samples in each block; a synthesis filter bank
sums the sub-band signals to reconstruct the output broadband signal.
FIG. 13 A sub-band encoder analyzes the broadband audio signal in narrow
sub-bands. Using masking information from a psychoacoustic model, samples in
sub-bands are coarsely quantized, raising the noise floor.
When the samples are reconstructed in the decoder, the synthesis filter constrains
the quantization noise floor within each sub-band, where it is masked by the
audio signal.
A sub-band perceptual codec uses a filter bank to split a short duration of
the audio signal into multiple bands, as depicted in FIG. 14. In some designs,
a side-chain processor applies the signal to a transform such as an FFT to
analyze the energy in each sub-band. These values are applied to a psychoacoustic
model to determine the combined masking curve that applies to the signals in
that block. This permits more optimal coding of the time domain samples. Specifically,
the encoder analyzes the energy in each sub-band to determine which sub-bands
contain audible information. A calculation is made to determine the average
power level of each sub-band over the block. This average level is used to
calculate the masking level due to masking of signals in each sub-band, as
well as masking from signals in adjacent sub-bands.
Finally, minimum hearing threshold values are applied to each sub-band to
derive its final masking level. Peak power levels present in each sub-band
are calculated, and compared to the masking level. Sub-bands that do not contain
audible information are not coded. Similarly, tones in a sub-band that are
masked by louder nearby tones are not coded, and in some cases entire sub-bands
can mask nearby sub-bands, which thus need not be coded.
Calculations determine the ratio of peak power to masking level in each sub-band.
Quantization bits are assigned to audible program material with a priority
schedule that allocates bits to each sub-band according to signal strength
above the audibility curve. For example, Fig. 15 shows vertical lines representing
peak power levels, and minimum and masking thresholds.
The signals below the minimum or masking curves are not coded, and the quantization
noise floor is allowed to rise to those levels. For example, in the figure,
signal A is below the minimum curve and would not be coded in any event. Signal
C is also irrelevant in this frame because signal B has dynamically shifted
the hearing threshold upward.
Signal B must be coded; however, its presence has created a masking curve,
decreasing the relative amplitude above the minimum threshold curve. The portion
of signal B between the minimum curve and the masking curve represents the
fewer bits that are needed to code the signal when the masking effect is taken
into account. In other words, rather than using a signal-to-noise ratio, a
signal-to-mask ratio (SMR) is used. The SMR is the difference between the maximum
signal and the masking threshold and is used to determine the number of bits
assigned to a sub-band. The SMR is calculated for each sub-band.
The number of bits allocated to any sub-band must be sufficient to yield a
requantizing noise level that is below the masking level. The number of bits
depends on the SMR value, with the goal of maintaining the quantization noise
level below the calculated masking level for each sub-band.
In fixed-rate codecs, a bit-pool approach can be taken. A large number of
sub-bands requiring coding and signals with large SMR values might empty the
pool, resulting in less than optimal coding. On the other hand, if the pool
is not empty after initial allocation, the process is repeated until all bits
in the codec's data capacity have been used.
Typically, the iterative process continues, allocating more bits where required,
with signals with the highest SMR requirements always receiving the most bits;
this increases the coding margin. In some cases, sub-bands previously classified
as inaudible might receive coding from these extra bits. Thus, signals below
the masking threshold can in practice be coded, but only on a secondary priority
basis.
Summarizing the concept of sub-band coding, FIG. 16 shows how a 24-sub-band
codec might code three tones at 250 Hz, 1 kHz, and 4 kHz; note that in each
case the quantization noise level is below the combined masking and threshold
curve.
FIG. 14 A sub-band codec divides the signal into narrow sub-bands, calculates
average signal level, an masking level; and then quantizes the samples in each
sub-band accordingly. A. Output of 24-band sub-band B. Calculation of average
level in each sub-band. C. Calculation of masking level in each sub-band. D.
Sub-bands below audibility are not coded; bands above audibility are coded.
E. Bits are allocated according to peak level above the masking threshold. Sub-bands
with peak levels above the masking level contain audible signals that must be
coded.
FIG. 15 The bit-allocation algorithm assigns bits according to audibility
of sub-band signals. Bits may not be assigned to masked or inaudible tones.
Transform Coding
In transform coding, the audio signal is viewed as a quasi stationary signal
that changes relatively little over short time intervals. For efficient coding,
blocks of time-domain audio samples are transformed to the frequency domain.
Frequency coefficients, rather than amplitude samples, are quantized to achieve
data reduction. For playback, the coefficients are inverse-transformed back
to the time domain.
The operation of the transform approximates how the basilar membrane analyzes
the frequency content of vibrations along its length. The spectral coefficients
output by the transform are quantized according to a psychoacoustic model;
masked components are eliminated, and quantization decisions are made based
on audibility. In contrast to a sub-band codec, which uses frequency analysis
to code time-based samples, a transform codec codes frequency coefficients.
From an information theory standpoint, the transform reduces the entropy of
the signal, permitting efficient coding. Longer transform blocks provide greater
spectral resolution, but lose temporal resolution; for example, a long block
might result in a pre-echo before a transient. In many codecs, block length
is adapted according to audio signal conditions. Short blocks are used for
transient signals, while long blocks are used for continuous signals.
FIG. 16 In this 24-band sub-band codec, three tones are coded so that the
quantization noise in each sub-band falls below the calculated composite masking
curves.
(Thiele, Link, and Stoll, 1987) Time-domain samples are transformed to the
frequency domain, yielding spectral coefficients. The coefficient numbers are
sometimes called frequency bin numbers; for example, a 512-point transform
can produce 256 frequency coefficients or frequency bins. The coefficients,
which might number 512, 1024, or more, are grouped into about 32 bands that
emulate critical-band analysis. This spectrum represents the block of time-based
input samples. The frequency coefficients in each band are quantized according
to the codec's psychoacoustic model; quantization can be uniform, nonuniform,
fixed, or adaptive in each band.
Transform codecs may use a discrete cosine transform (DCT) or modified discrete
cosine transform (MDCT) for transform coding because of low computational complexity,
and because they can critically sample (sample at twice the bandwidth of the
bandpass filter) the signal to yield an appropriate number of coefficients.
Most codecs overlap successive blocks in time by about 50%, so that each sample
appears in two different transform blocks. For example, the samples in the
first half of a current block are repeated from the second half of the previous
block. This reduces changes in spectra from block to block and improves temporal
resolution. The DCT and MDCT can yield the same number of coefficients as with
non overlapping blocks. As noted, an FFT may be used in the codec's side chain
to yield coefficients for perceptual modeling.
All low bit-rate codecs operate over a block of samples.
This block must be kept short to stay within the temporal masking limits of
the ear. During decoding, quantization noise will be spread over the frequency
of the band, and over the duration of the block. If the block is longer than
temporal backward masking allows, the noise will be heard prior to the onset
of the sound, in a phenomenon known as pre-echo. (The term pre-echo is misleading.)
Pre-echo is particularly problematic in the case of a silence followed by a
time-domain transient within the analysis window. The energy in the transient
portion causes the encoder to allocate relatively few bits, thus raising the
eventual quantization noise level. Pre-echoes are created in the decoder when
frequency coefficients are inverse transformed prior to the reconstruction
of sub-band samples in the synthesis filter bank. The duration of the quantization
noise equals that of the synthesis window, so the elevated noise extends over
the duration of the window, while the transient only occurs briefly. In other
words, encoding dictates that a transient in the audio signal will be accompanied
by an increase in quantization noise but a brief transient may not fully mask
the quantization noise surrounding it, as shown in FIG. 17. In this example,
the attack of a triangle occurs as a transient signal. The analysis window
of a transform codec operates over a relatively long time period. Quantization
noise is spread over the time of the window and precedes the music signal;
thus it may be audible as a pre-echo.
FIG. 17 An example of a pre-echo. On reconstruction, quantization noise falls
within the analysis block, where the leading edge is not masked by the signal.
Transform codecs are particularly affected by the problem of pre-echo because
they require long blocks for greater frequency accuracy. Short block length
limits frequency resolution (and also relatively increases the amount of overhead
side information). In essence, transform codecs sacrifice temporal resolution
for spectral resolution. Long blocks are suitable for slowly changing or tonal
signals; the frequency resolution allows the codec to identify spectral peaks
and use their masking properties in bit allocation. For example, a clarinet
note and its harmonics would require fine frequency resolution but only coarse
time resolution. However, transient signals require a short block length; the
signals have a flatter spectrum. For example, the fast transient of a castanet
click would require fine time resolution but only coarse frequency resolution.
In most transform codecs, to provide the resolution demanded by particular
signal conditions, and to avoid pre echo, block length dynamically adapts to
signal conditions.
Referring again to FIG. 17, a shorter analysis block would constrain the quantization
noise to a shorter duration, where it will be masked by the signal. A short
block is also advantageous because it limits the duration of high bit rates
demanded by transient encoding. Alternatively, a variable bit rate encoder
can minimize pre-echo by briefly increasing the bit rate to decrease the noise
level. Some codecs use temporal noise shaping (TNS) to minimize pre echo by
manipulating the nature of the quantization noise within a filter bank window.
When a transient signal is detected, TNS uses a predictive coding method to
shape the quantization noise to follow the transient's envelope. In this way,
the quantization error is more effectively concealed by the transient. However,
no matter what approach is taken, difficulty arises because most music simultaneously
places contradictory demands on the codec.
In adaptive transform codecs, a model is applied to uniformly and adaptively
quantize each individual band, but coefficient values within a band are quantized
with the same number of bits. The bit-allocation algorithm calculates the optimal
quantization noise in each sub-band to achieve a desired signal-to-noise ratio
that will promote masking.
Iterative allocation is used to supply additional bits as available to increase
the coding margin, yet maintain limited bit rate. In some cases, the output
bit rate can be fixed or variable for each block. Before transmission, the
reduced data is often compressed with entropy coding such as Huffman coding
and run-length coding to perform lossless compression. The decoder inversely
quantizes the coefficients and performs an inverse transform to reconstruct
the signal in the time domain.
An example of an adaptive transform codec proposed by Karlheinz Brandenburg
is shown in FIG. 18. An MDCT transforms the signal to the frequency domain.
Signal energy in each critical band is calculated using the spectral coefficients.
This is used to determine the masking threshold for each critical band. Two
iterative loops perform quantization and coding using an analysis-by-synthesis
technique. Coefficients are initially assigned a quantizer step size and the
algorithm calculates the resulting number of bits needed to code the signal
in the block. If the count exceeds the bit rate allowed for the block, the
loop reassigns a larger quantizer step size and the count is recalculated until
the target bit rate is achieved. An outer loop calculates the quantization
error as it will appear in the reconstructed signal. If the error in a band
exceeds the error allowed by the masking model, the quantizer step size in
the band is decreased. Iterations continue in both loops until optimal coding
is achieved. Codecs such as this can operate at low bit rates (for example,
2.5 bits/sample).
FIG. 18 Adaptive transform codec using an FFT side chain and iterative quantization
to achieve optimal reduction. Entropy coding is additionally used for data
compression.
|