Audio Coding Fundamentals

Vicente Gonzlez Ruiz

November 2, 2014

1 How bit is audio data?
2 How to reduce the bit-rate?
3 What is an audio codec (COder/DECoder)?
4 Typical encoder steps
5 Overlaped processing?
6 The MDCT (Modiﬁed Discrete Cosine Transform)
7 SAM (pSycho Acoustic Model) of the HAS (Human Auditory System)
7.1 ATH (Absolute Threshold of Hearing) model [1]
7.2 Frequency resolution and simultaneous masking
7.3 Temporal masking
7.4 Channel coupling
8 Quantization
9 Entropy Coding

1 How bit is audio data?

Mobile: up to $13$ Kbps.
(Terrestial) telephony: $64$ Kbps.
CD quality: $1.441$ Mbps.
AC-3 (Dolby Digital): up to $6.144$ Mbps.
DTS: up to $1509.75$ Kbps.

2 How to reduce the bit-rate?

Lowering the sampling rate (less bandwidth).
Lowering the number of channels.
Lowering the bits/sample (high quantization).
Using audio compression.

3 What is an audio codec (COder/DECoder)?

PCM   +---------+        +---------+ PCM
----->| Encoder |------->| Decoder |----->
audio +---------+ stream +---------+ audio’

              audio != audio’
                (usually)

4 Typical encoder steps

Overlaped subband analysis (usually with the MDCT (Modiﬁed Discrete Cosine Transform). Goes from the temporal to a frequency domain.
Quantization. Basically, removes pure signals of low amplitude but taking also into account the SAM (pSycho Acoustic Model) of the HAS (Human Auditory System). Noise use to be of low power!
Entropy coding. Compress data usually with Huﬀman/Arithmetic Coding.

5 Overlaped processing?

0 N-1 2N-1 3N-1
+---------------+---------------+---------------+ s[n]
<--------Transform Step--------->
<---------Transform Step-------->

Each transform step inputs $2 N$ samples and outputs $N$ MDCT coeﬁcients.
$N$ can vary depending on the characteristics of the sound. For complex sounds without clear armonics (such as a plosive sound), shortened windows improve the performance. For simple sounds (such as a music instrument), large windows are better.

6 The MDCT (Modiﬁed Discrete Cosine Transform)

Equivalent to apply a bank of $N$ ﬁlters.
Determines the correlation between a set of $2 N$ numbers (samples) and $N$ orthogonal¹ cosine functions. Therefore, at the input of the DCT there are $2 N$ samples and at the output, $N$ coeﬃcients.
The MDCT coeﬃcients $S [w]$ of the PCM samples $s [n]$ are deﬁned as:
$S [w] = \sum_{n = 0}^{2 N - 1} s [n] c o s [\frac{π}{N} (n + \frac{1}{2} + \frac{N}{2}) (w + \frac{1}{2})] .$ (1)

7 SAM (pSycho Acoustic Model) of the HAS (Human Auditory System)

7.1 ATH (Absolute Threshold of Hearing) model [1]

This means that humans ear better those sounds that contains audio signals with frequencies that ranges between 3 KHz and 4 KHz.

7.2 Frequency resolution and simultaneous masking

The HAS has a limited frequency resolution. Psychoacoustic experiments have demonstrated that the audible frequencies can be grouped into barks.
Each bark deﬁnes the group of frequencies that excite the same cochlear area, i.e., those frequencies that can be masked by the tone with the highest energy (in that bark).

7.3 Temporal masking

The human auditory system has inertia: sounds are not instantly perceived and remains after they are disapered.

7.4 Channel coupling

Most of the time, similar sounds are transported in the channels of a non-mono audio signal. Channel coupling decreases inter-channel redundancy, usually, using prediction techniques.

8 Quantization

Depending on the desired output bit-rate and the frequency (see the ATH model), the SAM applies a diﬀerent quantization step to barks (see Section 7.1). Roughly speaking, the higher the compression ratio, the larger the quantization step and therefore, the quantization noise; and the higher the frequency, the wider the bark. Notice also that the perception of a tone in a bark depends also on the temporal masking.
At decoding time, those barks that suﬀered the biggest lossess are usually ﬁlled with white noise in order to increase the perceived quality.

9 Entropy Coding

Usually, a variable bit-rate (VBR) lossless encoding algorithm asigns code-words of less bits to those code-vectors (one or more quantized MDCT coeﬃcients) with a high probability, and viceversa, producing an eﬀective reduction of the bit-rate.

References

[1] E. Terhardt. Calculating virtual pitch. Hearing Res., 1:155–182, 1979.