| 
 
 <<Prev.  MP3 Stereo Coding To take advantage of redundancies between stereo channels, and to exploit
  limitations in human spatial listening, Layer III allows a choice of stereo
  coding methods, with four basic modes: normal stereo mode with independent
  left and right channels; M/S stereo mode in which the entire spectrum is coded
  with M/S; intensity stereo mode in which the lower spectral range is coded
  as left/right and the upper spectral range is coded as intensity; and the intensity
  and M/S mode in which the lower spectral range is coded as M/S and the upper
  spectral range is coded as intensity. Each frame may have a different mode.  The partition between upper and lower spectral modes can be changed dynamically
  in units of scale factor bands.  Layer III supports both M/S (middle/side) stereo coding and intensity stereo
  coding. In M/S coding, certain frequency ranges of the left and right channels
  are mixed as sum (middle) and difference (side) signals of the left and right
  channels before quantization. In this way, stereo unmasking can be avoided.
  In addition, when there is high correlation between the left and right channels,
  the difference signal is further reduced to conserve bits. In intensity stereo
  coding, the left and right channels of upper- frequency subbands are not coded
  individually. Instead, one summed signal is transmitted along with individual
  left and right-channel scale factors indicating position in the stereo panorama.
  This method retains one spectral shape for both channels in upper sub-bands,
  but scales the magnitudes. This is effective for stationary signals, but less
  effective for transient signals because they may have different envelopes in
  different channels. Intensity coding may lead to artifacts such as changes
  in stereo imaging, particularly for transient signals. It is used primarily
  at low bit rates.  MP3 Decoder Optimization MP3 files can be decoded with dedicated hardware chips or software programs.
  To optimize operation and decrease computation, some software decoders implement
  special features. Calculation of the hybrid synthesis filter bank is the most
  computationally complex aspect of the decoder.  The process can be simplified by implementing a stereo downmix to monaural
  in the frequency domain, before the filter bank, so that only one filter operation
  must be performed. Downmixing can be accomplished with a simple weighted sum
  of the left and right channels.  However, this is not optimal because, for example, an M/S stereo or intensity-stereo
  signal already contains a sum signal. More efficiently, built in downmixing
  routines can calculate the sum signal only for those scale factor bands that
  are coded in left/right stereo. For M/S- and intensity coded scale factor bands,
  only scaling operations are needed.  To further reduce computational complexity, the hybrid filter bank can be
  optimized. The filter bank consists of IMDCT and polyphase filter bank sections.
  As noted, the IMDCT is executed 32 times for 18 spectral values each to transform
  the spectrum of 576 values into 18 consecutive spectra of length 32. These
  spectra are converted into the time domain by executing a polyphase synthesis
  filter bank 18 times. The polyphase filter bank contains a frequency mapping
  operation (such as matrix multiplication) and a FIR filter with 512 coefficients.
  The FIR filter calculation can be simplified by reducing the number of coefficients,
  the filter coefficients can be truncated at the ends of the impulse response,
  and the impulse response can be modeled with fewer coefficients. Experiments
  have suggested that filter length can be reduced by 25% without yielding additional
  audible artifacts. More directly, computation can be reduced by limiting the
  output audio bandwidth. The high-frequency spectral values can be set to zero;
  an IMDCT with all input samples set to zero does not have to be calculated.
  If only the lower halves of the IMDCTs are calculated, the audio bandwidth
  is limited. The output can be downsampled by a factor of 2, so that computation
  for every second output value can be skipped, thus cutting the FIR calculation
  in half.  There are many nonstandard codecs that produce MP3 compliant bitstreams; they
  vary greatly in performance quality. LAME is an example of a fast, high-quality,
  royalty free codec that produces a MP3-compliant bitstream.  LAME is open-source, but using LAME may require a patent license in some countries.
  LAME is available at sourceforge.net. MP3 Internet applications are discussed
  in section15.  MPEG-1 Psychoacoustic Model 1 The MPEG-1 standard suggests two psychoacoustic models that determine the
  minimum masking threshold for inaudibility. The models are only informative
  in the standard; their use is not mandated. The models are used only in the
  encoder. In both cases, the difference between the maximum signal level and
  the masking threshold is used by the bit allocator to set the quantization
  levels.  Generally, model 1 is applied to Layers I and II and model 2 is applied to
  Layer III .  Psychoacoustic model 1 proposes a low-complexity method to analyze spectral
  data and output signal-to-mask ratios. Model 1 performs these nine steps:  1. Perform FFT analysis: A 512- or 1024-point fast Fourier transform, with
  a Hann window with adjacent overlapping of 32 or 64 samples, respectively,
  to reduce edge effects, is used to transform time-aligned time-domain data
  to the frequency domain. An appropriate delay is applied to time-align the
  psychoacoustic model's output. The signal is normalized to a maximum value
  of 96 dB SPL, calibrating the signal's minimum value to the absolute threshold
  of hearing.  2. Determine the sound pressure level: The maximum SPL is calculated for each
  subband by choosing the greater of the maximum amplitude spectral line in the
  subband or the maximum scale factor that accounts for low-level spectral lines
  in the subband.  3. Consider the threshold in quiet: An absolute hearing threshold in the absence
  of any signal is given; this forms the lower masking bound. An offset is applied
  depending on the bit rate.  4. Finding tonal and nontonal components: Tonal (sinusoidal) and nontonal
  (noise-like) components in the signal are identified. First, local maxima in
  the spectral components are identified relative to bandwidths of varying size.
  Components that are locally prominent in a critical band by + 7 dB are labeled
  as tonal and their sound-pressure level is calculated. Intensities of the remaining
  components, assumed to be nontonal, within each critical band are summed and
  their SPL is calculated for each critical band. The nontonal maskers are centered
  in each critical band.  5. Decimation of tonal and nontonal masking components: The number of maskers
  is reduced to obtain only the relevant maskers. Relevant maskers are those
  with magnitude that exceeds the threshold in quiet, and those tonal components
  that are strongest within 1/2 Bark.  6. Calculate individual masking thresholds: The total number of masker frequency
  bins is reduced (for example, in Layer I at 48 kHz, 256 is reduced to 102)
  and maskers are relocated. Noise masking thresholds for each subband, accounting
  for tonal and nontonal components and their different downward shifts, are
  determined by applying a masking (spreading) function to the signal. Calculations
  use a masking index and masking function to describe masking effects on adjacent
  frequencies. The masking index is an attenuation factor based on critical-band
  rate. The piecewise masking function is an attenuation factor with different
  lower and upper slopes between -3 and + 8 Bark that vary with respect to the
  distance to the masking component and the component's magnitude.  When the subband is wide compared to the critical band, the spectral model
  can select a minimum threshold; when it is narrow, the model averages the thresholds
  covering the subband.  7. Calculate the global masking threshold: The powers corresponding to the
  upper and lower slopes of individual subband masking curves, as well as a given
  threshold of hearing (threshold in quiet), are summed to form a composite global
  masking contour. The final global masking threshold is thus a signal-dependent
  modification of the absolute threshold of hearing as affected by tonal and
  nontonal masking components across the basilar membrane.  8. Determine the minimum masking threshold: The minimum masking level is calculated
  for each subband.  9. Calculate the signal-to-mask ratio: Signal-to-mask ratios are determined
  for each subband, based on the global masking threshold. The difference between
  the maximum SPL levels and the minimum masking threshold values determines
  the SMR value in each subband; this value is supplied to the bit allocator.  The principal steps in the operation of model 1 can be illustrated with a
  test signal that contains a band of noise, as well as prominent tonal components.
  The model analyzes one block of the 16-bit test signal sampled at 44.1 kHz.
  FIG. 11A shows the audio signal as output by the FFT; the model has identified
  the local maxima. The figure also shows the absolute threshold of hearing used
  in this particular example (offset by -12 dB). FIG. 11B shows tonal components
  marked with a "+" and nontonal components marked with a "o." FIG. 11C shows the masking functions assigned to tonal maskers after decimation.
  The peak SMR (about 14.5 dB) corresponds to that used for tonal maskers. FIG. 11D shows the masking functions assigned to nontonal maskers after decimation.
  The peak SMR (about 5 dB) corresponds to that used for nontonal maskers. FIG. 11E shows the final global masking curve obtained by combining the individual
  masking thresholds. The higher of the global masking curve and the absolute
  threshold of hearing is used as the final global masking curve. FIG. 11F
  shows the minimum masking threshold. From this, SMR values can be calculated
  in each subband.  
 
  FIG. 11 Operation of MPEG-1 model 1 is illustrated using a test signal.
  A. Local maxima and absolute threshold. B. Tonal and nontonal components. C.
  Tonal masking. D. Nontonal masking. E. Masking threshold. F. Minimum masking
  threshold.
 To further explain the operation of model 1, additional comments are given
  here. The delay in the 512-point analysis filter bank is 256 samples and centering
  the data in the 512-point Hann window adds 64 samples. An offset of 320 samples
  (256 + (512 - 384)/2 = 320) is needed to time-align the model's 384 samples.  The spreading function used in model 1 is described in terms of piecewise
  slopes (in dB):  where dz = z(i) - z(j) is the distance in Bark between the maskee and masker
  frequency; i and j are index values of spectral lines of the maskee and masker,
  respectively.  X[z(j)] is the sound pressure level of the jth masking component in dB. Values
  outside -3 and + 8 Bark are not considered in this model.  Model 1 uses this general approach to detect and characterize tonality in
  audio signals: An FFT is applied to 512 or 1024 samples, and the components
  of the spectrum analysis are considered. Local maxima in the spectrum are identified
  as having more energy than adjacent components. These components are decimated
  such that a tonal component closer than 1/2 Bark to a stronger tonal component
  is discarded. Tonal components below the threshold of hearing are discarded
  as well. The energies of groups of remaining components are summed to represent
  tonal components in the signal; other components are summed and marked as nontonal.
  A binary designation is given: tonal components are assigned 1, and nontonal
  components are assigned 0. This information is presented to the bit allocation
  algorithm. Specifically, in model 1, tonality is determined by detecting local
  maxima of 7 dB in the audio spectrum. To derive the masking threshold relative
  to the masker, a level shift is applied; the nature of the shift depends on
  whether the masker is tonal or nontonal:  ?T(z) = -6.025 - 0.275z dB  ?N(z) = -2.025 - 0.175z dB  where z is the frequency of the masker in Bark.  Model 1 considers all the nontonal components in a critical band and represents
  them with one value at one frequency. This is appropriate at low frequencies
  where sub-bands and critical bands have good correspondence, but can be inefficient
  at high frequencies where there are many critical bands in each subband. A
  subband that is apart from the identified nontonal component in a critical
  band may not receive a correct nontonal evaluation.  MPEG-1 Psychoacoustic Model 2 Psychoacoustic model 2 performs a more detailed analysis than model 1, at
  the expense of greater computational complexity. It is designed for lower bit
  rates than model 1.  As in model 1, model 2 outputs a signal-to-mask ratio for each subband; however,
  its approach is significantly different. It contours the noise floor of the
  signal represented by many spectral coefficients in a way that is more accurate
  than that allowed by coarse subband coding. Also, the model uses an unpredictability
  measure to examine the side-chain data for tonal or nontonal qualities. Model
  2 performs these 14 steps:  1. Reconstruct input samples: A set of 1024 input samples is assembled.  2. Calculate the complex spectrum: The time-aligned input signal is windowed
  with a 1024-point Hann window; alternatively, a shorter window may be used.  An FFT is computed and output represented in magnitude and phase.  3. Calculate the predicted magnitude and phase: The predicted magnitude and
  phase are determined by extrapolation from the two preceding threshold blocks.  4. Calculate the unpredictability measure: The unpredictability measure is
  computed using the Euclidian distance between the predicted and actual values
  in the magnitude/phase domain. To reduce complexity, the measure may be computed
  only for lower frequencies and assumed constant for higher frequencies.  5. Calculate the energy and unpredictability in the partitions: The energy
  magnitude and the weighted unpredictability measure in each threshold calculation
  partition are calculated. A partition has a resolution of one spectral line
  (at low frequencies) or 1/3 critical band (at high frequencies), whichever
  is wider.  6. Convolve energy and unpredictability with the spreading function: The energy
  and the unpredictability measure in threshold calculation partitions are each
  convolved with a cochlea spreading function. Values are renormalized.  7. Derive tonality index: The unpredictability measures are converted to tonality
  indices ranging from 0 (high unpredictability) to 1 (low unpredictability).
  This determines the relative tonality of the maskers in each threshold calculation
  partition.  8. Calculate the required signal-to-noise ratio: An SNR is calculated for
  each threshold calculation partition using tonality to interpolate an attenuation
  shift factor between noise-masking-tone (NMT) and tone-masking-noise (TMN).
  The interpolated shift ranges from 5.5 dB for NMT and upward. The final shift
  value is the higher of the interpolated value or a frequency-dependent minimum
  value.  9. Calculate power ratio: The power ratio of the SNR is calculated for each
  threshold calculation partition.  10. Calculate energy threshold: The actual energy threshold is calculated
  for each threshold calculation partition.  11. Spread threshold energy: The masking threshold energy is spread over FFT
  lines corresponding to threshold calculation partitions to represent the masking
  in the frequency domain.  12. Calculate final energy threshold of audibility: The spread threshold energy
  is compared to values in absolute threshold of quiet tables, and the higher
  value is used (not the sum) as the energy threshold of audibility. This is
  because it is wasteful to specify a noise threshold lower than the level that
  can be heard.  13. Calculate pre-echo control: A narrow-band pre-echo control used in the
  Layer III encoder is calculated, to prevent audibility of the error signal
  spread in time by the synthesis filter. The calculation lowers the masking
  threshold after a quiet signal. The calculation takes the minimum of the comparison
  of the current threshold with the scaled thresholds of two previous blocks.  14. Calculate signal-to-mask ratios: Threshold calculation partitions are
  converted to codec partitions (scale factor bands). The SMR (energy in each
  scale factor band divided by noise level in each scale factor band) is calculated
  for each partition and expressed in decibels.  The SMR values are forwarded to the allocation algorithm.  The principal steps in the operation of model 2 can be illustrated with a
  test signal that contains three prominent tonal components. The model analyzes
  a set of 1024 input samples of the 16-bit test signal sampled at 44.1 kHz.  FIG. 12A shows the magnitude of the audio signal as output by the FFT;
  the phase is also computed. Following prediction of magnitude and phase, the
  unpredictability measure is computed, as shown in FIG. 12B, using the Euclidian
  distance between the predicted and actual values in the magnitude/phase domain.
  When the measure equals 0, the current value is completely predicted. FIG.
12C shows the energy magnitude in each partition and the spreading functions
that are applied. FIG. 12D shows the tonality index derived from the unpredictability
measure; the tonality index ranges from 0 (high unpredictability and noise-like)
to 1 (low unpredictability and tonal). FIG. 12E shows the spread masking
  threshold energy in the frequency domain and the absolute threshold of quiet;
  the higher value is used to find the energy threshold of inaudibility. FIG.
12F shows signal-to-mask ratios (energy in each scale factor band divided by
noise level in each scale factor band) in codec partitions.  To further explain the operation of model 2, additional comments are given
  here. The spreading function used in model 2 is:  10 log10 SF(dz) = 15.8111389 + 7.5(1.05dz + 0.474) - 17.5[1.0 +(1.05dz +0.474)
  2] 1/2+8 MIN[(1.05dz - 0.5) 2 - 2(1.05dz - 0.5),0] dB  where dz is the distance in Bark between the maskee and masker frequency.  The spectral flatness measure (SFM), devised by James Johnston, measures the
  average or global tonality of the segment. SFM is the ratio of the geometric
  mean of the power spectrum to its arithmetic mean. The value is converted to
  decibels and referenced to -60 dB to provide a coefficient of tonality ranging
  continuously from 0 (nontonal) to 1 (tonal). This coefficient can be used to
  interpolate between TMN and NMT models. SFM leads to very conservative masking
  decisions for nontonal parts of a signal. More efficiently, specific tonal
  and nontonal regions within a segment can be identified. This local tonality
  can be measured as the normalized Euclidean distance between the actual and
  predicted values over two successive segments, for amplitude and phase. On
  the basis of this, tonality unpredictability can be computed for narrow frequency
  partitions and used to create tonality metrics that are used to interpolate
  between tone or noise models.  
 
  FIG. 12 Operation of MPEG-1 model 2 is illustrated using a test signal.
  A. Magnitude of FFT. B. Unpredictability measure. C. Energy and spreading functions.
  D. Tonality index. E. Threshold energy and absolute threshold. F. Signal-to-mask
  ratios. (Boley and Rao, 2004)
 Specifically, in model 2, a tonality index is created, on the basis of the
  predictability of the audio signal's spectral components in a partition in
  two successive frames. Tonal components are more accurately predicted. Amplitude
  and phase are predicted to form an unpredictability measure C.  When C = 0, the current value is completely predicted, and when C = 1, the
  predicted values differ from the actual values. This yields the tonality index
  T ranging from 0 (high unpredictability and noise-like) to 1 (low unpredictability
  and tonal). For example, the audio signal's strongly tonal and nontonal areas
  are evident in FIG. 12D. The tonality index is used to calculate a (z) shift,
  for example, interpolating values from 6 dB (nontonal) to 29 dB (tonal).  When used in a Layer III encoder, model 2 is modified.  The model is executed twice, once with a long block and once with a short
  256-sample block. These values are used in the unpredictability measure calculation.
  A slightly different spreading function is used. The NMT shift is changed to
  6.0 dB and a fixed TMN shift of 29.0 dB is used. As noted, a pre-echo control
  is calculated.  Perceptual entropy is calculated as the logarithm of the geometric mean of
  the normalized spectral energy in a partition. This predicts the minimum number
  of bits needed for transparency. High values are used to identify transient
  attacks, and thus to determine block size in the encoder. In addition, model
  2 accepts the minimum masking threshold at low frequencies where there is good
  correspondence between subbands and critical bands, and it uses the average
  of the thresholds at higher frequencies where subbands are narrow compared
  to critical bands.  Much research has been done since the informative model 2 was published in
  the MPEG-1 standard. Thus, most practical encoders use models that offer better
  performance, even if they are based on the informative model. An encoder that
  follows the informative documentation literally will not provide good results
  compared to more sophisticated implementations.  MPEG-2 Audio Standard  The MPEG-2 audio standard was designed for applications ranging from Internet
  downloading to high definition digital television (HDTV) transmission. It provides
  a backward-compatible path to multichannel sound and a low sampling frequency
  provision, as well as a non backward-compatible multichannel format known as
  Advanced Audio Coding (AAC). The MPEG-2 audio standard encompasses the MPEG-1
  audio standard of Layers I , II , and III , using the same encoding and decoding
  principles as MPEG-1. In many cases, the same layer algorithms developed for
  MPEG-1 applications are used for MPEG-2 applications. Multichannel MPEG-2 audio
  is backward compatible with MPEG-1. An MPEG-2 decoder will accept an MPEG-1
  bitstream and an MPEG-1 decoder can derive a stereo signal from an MPEG-2 bitstream.  However, MPEG-2 also permits use of incompatible audio codecs.  One part of the MPEG-2 standard provides multichannel sound at sampling frequencies
  of 32, 44.1, and 48 kHz.  Because it is backward compatible to MPEG-1, it is designated as BC (backward
  compatible), that is, MPEG-2 BC. Clearly, because there is more redundancy
  between six channels than between two, greater coding efficiency is achieved.
  Overall, 5.1 channels can be successfully coded at rates from 384 kbps to 640
  kbps. MPEG-2 also supports monaural and stereo coding at sampling frequencies
  of 16, 22.05, and 24 kHz, using Layers I , II , and III . The MPEG-1 and -2
  audio coding family is shown in FIG. 13. The MPEG-2 audio standard was approved
  by the MPEG committee in November 1994 and is specified in ISO/IEC 13818-3.  
 FIG. 13 The MPEG-2 audio standard adds monaural/stereo coding at low
  sampling frequencies, multichannel coding, and AAC. The three MPEG-1 layers
  are supported.
 The multichannel MPEG-2 BC format uses a five channel approach sometimes referred
  to as 3/2 + 1 stereo (3 front and 2 surround channels + subwoofer). The low
  frequency effects (LFE) subwoofer channel is optional, providing an audio range
  up to 120 Hz. A hierarchy of formats is created in which 3/2 may be downmixed
  to 3/1, 3/0, 2/2, 2/1, 2/0, and 1/0. The multichannel MPEG-2 BC format uses
  an encoder matrix that allows a two-channel decoder to decode a compatible
  two-channel signal that is a subset of a multichannel bitstream. The multiple
  channels of MPEG-2 are matrixed to form compatible MPEG-1 left/right channels,
  as well as other MPEG-2 channels, as shown in FIG. 14. The MPEG-1 left and
  right channels are replaced by matrixed MPEG-2 left and right channels and
  these are encoded into backward-compatible MPEG frames with an MPEG-1 encoder.
  Additional multichannel data is placed in the expanded ancillary data field.  
 FIG. 14 The MPEG-2 audio encoder and decoder showing how a 5.1-channel
  surround format can be achieved with backward compatibility with MPEG-1.
 To efficiently code multiple channels, MPEG-2 BC uses techniques such as dynamic
  crosstalk reduction, adaptive interchannel prediction, and center channel phantom
  image coding. With dynamic crosstalk reduction, as with intensity coding, multichannel
  high-frequency information is combined and conveyed along with scale factors
  to direct levels to different playback channels. In adaptive prediction, a
  prediction error signal is conveyed for the center and surround channels. The
  high-frequency information in the center channel can be conveyed through the
  front left and right channels as a phantom image.  MPEG-2 BC can achieve a combined bit rate of 384 kbps, using Layer II at a
  48-kHz sampling frequency.  MPEG-2 allows for audio bit rates up to 1066 kbps. To accommodate this, the
  MPEG- 2 frame is divided into two parts. The first part is an MPEG-1-compatible
  stereo section with Layer I data up to 448 kbps, Layer II data up to 384 kbps,
  or Layer III data up to 320 kbps. The MPEG-2 extension part contains all other
  surround data.  A standard two-channel MPEG-1 decoder ignores the ancillary information, and
  reproduces the front main channels. In some cases, the dematrixing procedure
  in the decoder can yield an artifact in which the sound in a channel is mainly
  phase canceled but the quantization noise is not, and thus becomes audible.
  This limitation of spatial unmasking in MPEG-2 BC is a direct result of the
  matrixing used to achieve backward compatibility with the original two-channel
  MPEG standard. In part, it can be addressed by increasing the bit rate of the
  coded signals.  MPEG-2 also specifies Layer I , II , and III at low sampling frequencies (LSF)
  of 16, 22.05, and 24 kHz. This extension is not backward compatible to MPEG-1
  codecs. This portion of the standard is known as MPEG-2 LSF. At these low bit
  rates, Layer III generally shows the best performance. Only minor changes in
  the MPEG-1 bit rate and bit allocation tables are necessary to adapt this LSF
  format. The relative improvement in quality stems from the improved frequency
  resolution of the polyphase filter bank in low- and mid-frequency regions;
  this allows more efficient application of masking. Layers I and II fare better
  than Layer III in these applications because Layer III already has good frequency
  resolution. The bitstream is unchanged in the LSF mode and the same frame format
  is used. For 24-kHz sampling, the frame length is 16 ms for Layer I and 48
  ms for Layer II . The frame length of Layer III is decreased relative to that
  of MPEG-1. In addition, the "MPEG-2.5" standard supports sampling
  frequencies of 8, 11.025, and 12 kHz with the corresponding decrease in audio
  bandwidth; implementations use Layer III as the codec. Many MP3 codecs support
  the original MPEG-1 Layer III codec as well as the MPEG-2 and MPEG-2.5 extensions
  for lower sampling frequencies.  The menu of data rates, fidelity, and layer compatibility provided by MPEG
  are useful in a wide variety of applications such as computer multimedia, CD-ROM,
  DVD-Video, computer disks, local area networks, studio recording and editing,
  multichannel disk recording, ISDN transmission, digital audio broadcasting,
  and multichannel digital television. Numerous C and C++ programs performing
  MPEG-1 and -2 audio coding and decoding can be downloaded from a number of
  Internet file sites, and executed on personal computers. The backward compatible
  format, using Layer II coding, is used for the soundtracks of some DVD-Video
  discs. However, a matrix approach to surround sound does not preserve spatial
  fidelity as well as discrete channel coding. |