<<Prev.
MP3 Stereo Coding
To take advantage of redundancies between stereo channels, and to exploit
limitations in human spatial listening, Layer III allows a choice of stereo
coding methods, with four basic modes: normal stereo mode with independent
left and right channels; M/S stereo mode in which the entire spectrum is coded
with M/S; intensity stereo mode in which the lower spectral range is coded
as left/right and the upper spectral range is coded as intensity; and the intensity
and M/S mode in which the lower spectral range is coded as M/S and the upper
spectral range is coded as intensity. Each frame may have a different mode.
The partition between upper and lower spectral modes can be changed dynamically
in units of scale factor bands.
Layer III supports both M/S (middle/side) stereo coding and intensity stereo
coding. In M/S coding, certain frequency ranges of the left and right channels
are mixed as sum (middle) and difference (side) signals of the left and right
channels before quantization. In this way, stereo unmasking can be avoided.
In addition, when there is high correlation between the left and right channels,
the difference signal is further reduced to conserve bits. In intensity stereo
coding, the left and right channels of upper- frequency subbands are not coded
individually. Instead, one summed signal is transmitted along with individual
left and right-channel scale factors indicating position in the stereo panorama.
This method retains one spectral shape for both channels in upper sub-bands,
but scales the magnitudes. This is effective for stationary signals, but less
effective for transient signals because they may have different envelopes in
different channels. Intensity coding may lead to artifacts such as changes
in stereo imaging, particularly for transient signals. It is used primarily
at low bit rates.
MP3 Decoder Optimization
MP3 files can be decoded with dedicated hardware chips or software programs.
To optimize operation and decrease computation, some software decoders implement
special features. Calculation of the hybrid synthesis filter bank is the most
computationally complex aspect of the decoder.
The process can be simplified by implementing a stereo downmix to monaural
in the frequency domain, before the filter bank, so that only one filter operation
must be performed. Downmixing can be accomplished with a simple weighted sum
of the left and right channels.
However, this is not optimal because, for example, an M/S stereo or intensity-stereo
signal already contains a sum signal. More efficiently, built in downmixing
routines can calculate the sum signal only for those scale factor bands that
are coded in left/right stereo. For M/S- and intensity coded scale factor bands,
only scaling operations are needed.
To further reduce computational complexity, the hybrid filter bank can be
optimized. The filter bank consists of IMDCT and polyphase filter bank sections.
As noted, the IMDCT is executed 32 times for 18 spectral values each to transform
the spectrum of 576 values into 18 consecutive spectra of length 32. These
spectra are converted into the time domain by executing a polyphase synthesis
filter bank 18 times. The polyphase filter bank contains a frequency mapping
operation (such as matrix multiplication) and a FIR filter with 512 coefficients.
The FIR filter calculation can be simplified by reducing the number of coefficients,
the filter coefficients can be truncated at the ends of the impulse response,
and the impulse response can be modeled with fewer coefficients. Experiments
have suggested that filter length can be reduced by 25% without yielding additional
audible artifacts. More directly, computation can be reduced by limiting the
output audio bandwidth. The high-frequency spectral values can be set to zero;
an IMDCT with all input samples set to zero does not have to be calculated.
If only the lower halves of the IMDCTs are calculated, the audio bandwidth
is limited. The output can be downsampled by a factor of 2, so that computation
for every second output value can be skipped, thus cutting the FIR calculation
in half.
There are many nonstandard codecs that produce MP3 compliant bitstreams; they
vary greatly in performance quality. LAME is an example of a fast, high-quality,
royalty free codec that produces a MP3-compliant bitstream.
LAME is open-source, but using LAME may require a patent license in some countries.
LAME is available at sourceforge.net. MP3 Internet applications are discussed
in section15.
MPEG-1 Psychoacoustic Model 1
The MPEG-1 standard suggests two psychoacoustic models that determine the
minimum masking threshold for inaudibility. The models are only informative
in the standard; their use is not mandated. The models are used only in the
encoder. In both cases, the difference between the maximum signal level and
the masking threshold is used by the bit allocator to set the quantization
levels.
Generally, model 1 is applied to Layers I and II and model 2 is applied to
Layer III .
Psychoacoustic model 1 proposes a low-complexity method to analyze spectral
data and output signal-to-mask ratios. Model 1 performs these nine steps:
1. Perform FFT analysis: A 512- or 1024-point fast Fourier transform, with
a Hann window with adjacent overlapping of 32 or 64 samples, respectively,
to reduce edge effects, is used to transform time-aligned time-domain data
to the frequency domain. An appropriate delay is applied to time-align the
psychoacoustic model's output. The signal is normalized to a maximum value
of 96 dB SPL, calibrating the signal's minimum value to the absolute threshold
of hearing.
2. Determine the sound pressure level: The maximum SPL is calculated for each
subband by choosing the greater of the maximum amplitude spectral line in the
subband or the maximum scale factor that accounts for low-level spectral lines
in the subband.
3. Consider the threshold in quiet: An absolute hearing threshold in the absence
of any signal is given; this forms the lower masking bound. An offset is applied
depending on the bit rate.
4. Finding tonal and nontonal components: Tonal (sinusoidal) and nontonal
(noise-like) components in the signal are identified. First, local maxima in
the spectral components are identified relative to bandwidths of varying size.
Components that are locally prominent in a critical band by + 7 dB are labeled
as tonal and their sound-pressure level is calculated. Intensities of the remaining
components, assumed to be nontonal, within each critical band are summed and
their SPL is calculated for each critical band. The nontonal maskers are centered
in each critical band.
5. Decimation of tonal and nontonal masking components: The number of maskers
is reduced to obtain only the relevant maskers. Relevant maskers are those
with magnitude that exceeds the threshold in quiet, and those tonal components
that are strongest within 1/2 Bark.
6. Calculate individual masking thresholds: The total number of masker frequency
bins is reduced (for example, in Layer I at 48 kHz, 256 is reduced to 102)
and maskers are relocated. Noise masking thresholds for each subband, accounting
for tonal and nontonal components and their different downward shifts, are
determined by applying a masking (spreading) function to the signal. Calculations
use a masking index and masking function to describe masking effects on adjacent
frequencies. The masking index is an attenuation factor based on critical-band
rate. The piecewise masking function is an attenuation factor with different
lower and upper slopes between -3 and + 8 Bark that vary with respect to the
distance to the masking component and the component's magnitude.
When the subband is wide compared to the critical band, the spectral model
can select a minimum threshold; when it is narrow, the model averages the thresholds
covering the subband.
7. Calculate the global masking threshold: The powers corresponding to the
upper and lower slopes of individual subband masking curves, as well as a given
threshold of hearing (threshold in quiet), are summed to form a composite global
masking contour. The final global masking threshold is thus a signal-dependent
modification of the absolute threshold of hearing as affected by tonal and
nontonal masking components across the basilar membrane.
8. Determine the minimum masking threshold: The minimum masking level is calculated
for each subband.
9. Calculate the signal-to-mask ratio: Signal-to-mask ratios are determined
for each subband, based on the global masking threshold. The difference between
the maximum SPL levels and the minimum masking threshold values determines
the SMR value in each subband; this value is supplied to the bit allocator.
The principal steps in the operation of model 1 can be illustrated with a
test signal that contains a band of noise, as well as prominent tonal components.
The model analyzes one block of the 16-bit test signal sampled at 44.1 kHz.
FIG. 11A shows the audio signal as output by the FFT; the model has identified
the local maxima. The figure also shows the absolute threshold of hearing used
in this particular example (offset by -12 dB). FIG. 11B shows tonal components
marked with a "+" and nontonal components marked with a "o." FIG. 11C shows the masking functions assigned to tonal maskers after decimation.
The peak SMR (about 14.5 dB) corresponds to that used for tonal maskers. FIG. 11D shows the masking functions assigned to nontonal maskers after decimation.
The peak SMR (about 5 dB) corresponds to that used for nontonal maskers. FIG. 11E shows the final global masking curve obtained by combining the individual
masking thresholds. The higher of the global masking curve and the absolute
threshold of hearing is used as the final global masking curve. FIG. 11F
shows the minimum masking threshold. From this, SMR values can be calculated
in each subband.
FIG. 11 Operation of MPEG-1 model 1 is illustrated using a test signal.
A. Local maxima and absolute threshold. B. Tonal and nontonal components. C.
Tonal masking. D. Nontonal masking. E. Masking threshold. F. Minimum masking
threshold.
To further explain the operation of model 1, additional comments are given
here. The delay in the 512-point analysis filter bank is 256 samples and centering
the data in the 512-point Hann window adds 64 samples. An offset of 320 samples
(256 + (512 - 384)/2 = 320) is needed to time-align the model's 384 samples.
The spreading function used in model 1 is described in terms of piecewise
slopes (in dB):
where dz = z(i) - z(j) is the distance in Bark between the maskee and masker
frequency; i and j are index values of spectral lines of the maskee and masker,
respectively.
X[z(j)] is the sound pressure level of the jth masking component in dB. Values
outside -3 and + 8 Bark are not considered in this model.
Model 1 uses this general approach to detect and characterize tonality in
audio signals: An FFT is applied to 512 or 1024 samples, and the components
of the spectrum analysis are considered. Local maxima in the spectrum are identified
as having more energy than adjacent components. These components are decimated
such that a tonal component closer than 1/2 Bark to a stronger tonal component
is discarded. Tonal components below the threshold of hearing are discarded
as well. The energies of groups of remaining components are summed to represent
tonal components in the signal; other components are summed and marked as nontonal.
A binary designation is given: tonal components are assigned 1, and nontonal
components are assigned 0. This information is presented to the bit allocation
algorithm. Specifically, in model 1, tonality is determined by detecting local
maxima of 7 dB in the audio spectrum. To derive the masking threshold relative
to the masker, a level shift is applied; the nature of the shift depends on
whether the masker is tonal or nontonal:
?T(z) = -6.025 - 0.275z dB
?N(z) = -2.025 - 0.175z dB
where z is the frequency of the masker in Bark.
Model 1 considers all the nontonal components in a critical band and represents
them with one value at one frequency. This is appropriate at low frequencies
where sub-bands and critical bands have good correspondence, but can be inefficient
at high frequencies where there are many critical bands in each subband. A
subband that is apart from the identified nontonal component in a critical
band may not receive a correct nontonal evaluation.
MPEG-1 Psychoacoustic Model 2
Psychoacoustic model 2 performs a more detailed analysis than model 1, at
the expense of greater computational complexity. It is designed for lower bit
rates than model 1.
As in model 1, model 2 outputs a signal-to-mask ratio for each subband; however,
its approach is significantly different. It contours the noise floor of the
signal represented by many spectral coefficients in a way that is more accurate
than that allowed by coarse subband coding. Also, the model uses an unpredictability
measure to examine the side-chain data for tonal or nontonal qualities. Model
2 performs these 14 steps:
1. Reconstruct input samples: A set of 1024 input samples is assembled.
2. Calculate the complex spectrum: The time-aligned input signal is windowed
with a 1024-point Hann window; alternatively, a shorter window may be used.
An FFT is computed and output represented in magnitude and phase.
3. Calculate the predicted magnitude and phase: The predicted magnitude and
phase are determined by extrapolation from the two preceding threshold blocks.
4. Calculate the unpredictability measure: The unpredictability measure is
computed using the Euclidian distance between the predicted and actual values
in the magnitude/phase domain. To reduce complexity, the measure may be computed
only for lower frequencies and assumed constant for higher frequencies.
5. Calculate the energy and unpredictability in the partitions: The energy
magnitude and the weighted unpredictability measure in each threshold calculation
partition are calculated. A partition has a resolution of one spectral line
(at low frequencies) or 1/3 critical band (at high frequencies), whichever
is wider.
6. Convolve energy and unpredictability with the spreading function: The energy
and the unpredictability measure in threshold calculation partitions are each
convolved with a cochlea spreading function. Values are renormalized.
7. Derive tonality index: The unpredictability measures are converted to tonality
indices ranging from 0 (high unpredictability) to 1 (low unpredictability).
This determines the relative tonality of the maskers in each threshold calculation
partition.
8. Calculate the required signal-to-noise ratio: An SNR is calculated for
each threshold calculation partition using tonality to interpolate an attenuation
shift factor between noise-masking-tone (NMT) and tone-masking-noise (TMN).
The interpolated shift ranges from 5.5 dB for NMT and upward. The final shift
value is the higher of the interpolated value or a frequency-dependent minimum
value.
9. Calculate power ratio: The power ratio of the SNR is calculated for each
threshold calculation partition.
10. Calculate energy threshold: The actual energy threshold is calculated
for each threshold calculation partition.
11. Spread threshold energy: The masking threshold energy is spread over FFT
lines corresponding to threshold calculation partitions to represent the masking
in the frequency domain.
12. Calculate final energy threshold of audibility: The spread threshold energy
is compared to values in absolute threshold of quiet tables, and the higher
value is used (not the sum) as the energy threshold of audibility. This is
because it is wasteful to specify a noise threshold lower than the level that
can be heard.
13. Calculate pre-echo control: A narrow-band pre-echo control used in the
Layer III encoder is calculated, to prevent audibility of the error signal
spread in time by the synthesis filter. The calculation lowers the masking
threshold after a quiet signal. The calculation takes the minimum of the comparison
of the current threshold with the scaled thresholds of two previous blocks.
14. Calculate signal-to-mask ratios: Threshold calculation partitions are
converted to codec partitions (scale factor bands). The SMR (energy in each
scale factor band divided by noise level in each scale factor band) is calculated
for each partition and expressed in decibels.
The SMR values are forwarded to the allocation algorithm.
The principal steps in the operation of model 2 can be illustrated with a
test signal that contains three prominent tonal components. The model analyzes
a set of 1024 input samples of the 16-bit test signal sampled at 44.1 kHz.
FIG. 12A shows the magnitude of the audio signal as output by the FFT;
the phase is also computed. Following prediction of magnitude and phase, the
unpredictability measure is computed, as shown in FIG. 12B, using the Euclidian
distance between the predicted and actual values in the magnitude/phase domain.
When the measure equals 0, the current value is completely predicted. FIG.
12C shows the energy magnitude in each partition and the spreading functions
that are applied. FIG. 12D shows the tonality index derived from the unpredictability
measure; the tonality index ranges from 0 (high unpredictability and noise-like)
to 1 (low unpredictability and tonal). FIG. 12E shows the spread masking
threshold energy in the frequency domain and the absolute threshold of quiet;
the higher value is used to find the energy threshold of inaudibility. FIG.
12F shows signal-to-mask ratios (energy in each scale factor band divided by
noise level in each scale factor band) in codec partitions.
To further explain the operation of model 2, additional comments are given
here. The spreading function used in model 2 is:
10 log10 SF(dz) = 15.8111389 + 7.5(1.05dz + 0.474) - 17.5[1.0 +(1.05dz +0.474)
2] 1/2+8 MIN[(1.05dz - 0.5) 2 - 2(1.05dz - 0.5),0] dB
where dz is the distance in Bark between the maskee and masker frequency.
The spectral flatness measure (SFM), devised by James Johnston, measures the
average or global tonality of the segment. SFM is the ratio of the geometric
mean of the power spectrum to its arithmetic mean. The value is converted to
decibels and referenced to -60 dB to provide a coefficient of tonality ranging
continuously from 0 (nontonal) to 1 (tonal). This coefficient can be used to
interpolate between TMN and NMT models. SFM leads to very conservative masking
decisions for nontonal parts of a signal. More efficiently, specific tonal
and nontonal regions within a segment can be identified. This local tonality
can be measured as the normalized Euclidean distance between the actual and
predicted values over two successive segments, for amplitude and phase. On
the basis of this, tonality unpredictability can be computed for narrow frequency
partitions and used to create tonality metrics that are used to interpolate
between tone or noise models.
FIG. 12 Operation of MPEG-1 model 2 is illustrated using a test signal.
A. Magnitude of FFT. B. Unpredictability measure. C. Energy and spreading functions.
D. Tonality index. E. Threshold energy and absolute threshold. F. Signal-to-mask
ratios. (Boley and Rao, 2004)
Specifically, in model 2, a tonality index is created, on the basis of the
predictability of the audio signal's spectral components in a partition in
two successive frames. Tonal components are more accurately predicted. Amplitude
and phase are predicted to form an unpredictability measure C.
When C = 0, the current value is completely predicted, and when C = 1, the
predicted values differ from the actual values. This yields the tonality index
T ranging from 0 (high unpredictability and noise-like) to 1 (low unpredictability
and tonal). For example, the audio signal's strongly tonal and nontonal areas
are evident in FIG. 12D. The tonality index is used to calculate a (z) shift,
for example, interpolating values from 6 dB (nontonal) to 29 dB (tonal).
When used in a Layer III encoder, model 2 is modified.
The model is executed twice, once with a long block and once with a short
256-sample block. These values are used in the unpredictability measure calculation.
A slightly different spreading function is used. The NMT shift is changed to
6.0 dB and a fixed TMN shift of 29.0 dB is used. As noted, a pre-echo control
is calculated.
Perceptual entropy is calculated as the logarithm of the geometric mean of
the normalized spectral energy in a partition. This predicts the minimum number
of bits needed for transparency. High values are used to identify transient
attacks, and thus to determine block size in the encoder. In addition, model
2 accepts the minimum masking threshold at low frequencies where there is good
correspondence between subbands and critical bands, and it uses the average
of the thresholds at higher frequencies where subbands are narrow compared
to critical bands.
Much research has been done since the informative model 2 was published in
the MPEG-1 standard. Thus, most practical encoders use models that offer better
performance, even if they are based on the informative model. An encoder that
follows the informative documentation literally will not provide good results
compared to more sophisticated implementations.
MPEG-2 Audio Standard
The MPEG-2 audio standard was designed for applications ranging from Internet
downloading to high definition digital television (HDTV) transmission. It provides
a backward-compatible path to multichannel sound and a low sampling frequency
provision, as well as a non backward-compatible multichannel format known as
Advanced Audio Coding (AAC). The MPEG-2 audio standard encompasses the MPEG-1
audio standard of Layers I , II , and III , using the same encoding and decoding
principles as MPEG-1. In many cases, the same layer algorithms developed for
MPEG-1 applications are used for MPEG-2 applications. Multichannel MPEG-2 audio
is backward compatible with MPEG-1. An MPEG-2 decoder will accept an MPEG-1
bitstream and an MPEG-1 decoder can derive a stereo signal from an MPEG-2 bitstream.
However, MPEG-2 also permits use of incompatible audio codecs.
One part of the MPEG-2 standard provides multichannel sound at sampling frequencies
of 32, 44.1, and 48 kHz.
Because it is backward compatible to MPEG-1, it is designated as BC (backward
compatible), that is, MPEG-2 BC. Clearly, because there is more redundancy
between six channels than between two, greater coding efficiency is achieved.
Overall, 5.1 channels can be successfully coded at rates from 384 kbps to 640
kbps. MPEG-2 also supports monaural and stereo coding at sampling frequencies
of 16, 22.05, and 24 kHz, using Layers I , II , and III . The MPEG-1 and -2
audio coding family is shown in FIG. 13. The MPEG-2 audio standard was approved
by the MPEG committee in November 1994 and is specified in ISO/IEC 13818-3.
FIG. 13 The MPEG-2 audio standard adds monaural/stereo coding at low
sampling frequencies, multichannel coding, and AAC. The three MPEG-1 layers
are supported.
The multichannel MPEG-2 BC format uses a five channel approach sometimes referred
to as 3/2 + 1 stereo (3 front and 2 surround channels + subwoofer). The low
frequency effects (LFE) subwoofer channel is optional, providing an audio range
up to 120 Hz. A hierarchy of formats is created in which 3/2 may be downmixed
to 3/1, 3/0, 2/2, 2/1, 2/0, and 1/0. The multichannel MPEG-2 BC format uses
an encoder matrix that allows a two-channel decoder to decode a compatible
two-channel signal that is a subset of a multichannel bitstream. The multiple
channels of MPEG-2 are matrixed to form compatible MPEG-1 left/right channels,
as well as other MPEG-2 channels, as shown in FIG. 14. The MPEG-1 left and
right channels are replaced by matrixed MPEG-2 left and right channels and
these are encoded into backward-compatible MPEG frames with an MPEG-1 encoder.
Additional multichannel data is placed in the expanded ancillary data field.
FIG. 14 The MPEG-2 audio encoder and decoder showing how a 5.1-channel
surround format can be achieved with backward compatibility with MPEG-1.
To efficiently code multiple channels, MPEG-2 BC uses techniques such as dynamic
crosstalk reduction, adaptive interchannel prediction, and center channel phantom
image coding. With dynamic crosstalk reduction, as with intensity coding, multichannel
high-frequency information is combined and conveyed along with scale factors
to direct levels to different playback channels. In adaptive prediction, a
prediction error signal is conveyed for the center and surround channels. The
high-frequency information in the center channel can be conveyed through the
front left and right channels as a phantom image.
MPEG-2 BC can achieve a combined bit rate of 384 kbps, using Layer II at a
48-kHz sampling frequency.
MPEG-2 allows for audio bit rates up to 1066 kbps. To accommodate this, the
MPEG- 2 frame is divided into two parts. The first part is an MPEG-1-compatible
stereo section with Layer I data up to 448 kbps, Layer II data up to 384 kbps,
or Layer III data up to 320 kbps. The MPEG-2 extension part contains all other
surround data.
A standard two-channel MPEG-1 decoder ignores the ancillary information, and
reproduces the front main channels. In some cases, the dematrixing procedure
in the decoder can yield an artifact in which the sound in a channel is mainly
phase canceled but the quantization noise is not, and thus becomes audible.
This limitation of spatial unmasking in MPEG-2 BC is a direct result of the
matrixing used to achieve backward compatibility with the original two-channel
MPEG standard. In part, it can be addressed by increasing the bit rate of the
coded signals.
MPEG-2 also specifies Layer I , II , and III at low sampling frequencies (LSF)
of 16, 22.05, and 24 kHz. This extension is not backward compatible to MPEG-1
codecs. This portion of the standard is known as MPEG-2 LSF. At these low bit
rates, Layer III generally shows the best performance. Only minor changes in
the MPEG-1 bit rate and bit allocation tables are necessary to adapt this LSF
format. The relative improvement in quality stems from the improved frequency
resolution of the polyphase filter bank in low- and mid-frequency regions;
this allows more efficient application of masking. Layers I and II fare better
than Layer III in these applications because Layer III already has good frequency
resolution. The bitstream is unchanged in the LSF mode and the same frame format
is used. For 24-kHz sampling, the frame length is 16 ms for Layer I and 48
ms for Layer II . The frame length of Layer III is decreased relative to that
of MPEG-1. In addition, the "MPEG-2.5" standard supports sampling
frequencies of 8, 11.025, and 12 kHz with the corresponding decrease in audio
bandwidth; implementations use Layer III as the codec. Many MP3 codecs support
the original MPEG-1 Layer III codec as well as the MPEG-2 and MPEG-2.5 extensions
for lower sampling frequencies.
The menu of data rates, fidelity, and layer compatibility provided by MPEG
are useful in a wide variety of applications such as computer multimedia, CD-ROM,
DVD-Video, computer disks, local area networks, studio recording and editing,
multichannel disk recording, ISDN transmission, digital audio broadcasting,
and multichannel digital television. Numerous C and C++ programs performing
MPEG-1 and -2 audio coding and decoding can be downloaded from a number of
Internet file sites, and executed on personal computers. The backward compatible
format, using Layer II coding, is used for the soundtracks of some DVD-Video
discs. However, a matrix approach to surround sound does not preserve spatial
fidelity as well as discrete channel coding.
|