A Hi-Fi Subband-Multipulse Digital Audio Codec
X. Lin, R. Steele and L. Hanzo
(c) University of Southampton 1994
Contents
1. Introduction
Analogue frequency modulated (FM) radio broadcasting originates from 1949 and it was designed mainly for directional antennae with about 12dB antenna gain. The audio quality of portable stereo FM radios has been under renewed criticism against the background of the proliferation of high fidelity (HI-FI) portable compact disc (CD), digital audio tape (DAT) and digital compact cassette players, leading to a growing demand for a terrestrial or satellite-based Hi-Fi digital audio broadcast (DAB) system.
In 1986 the European Community (EC) launched the Eureka EU147 project [1] with the aim of proposing a robust, mobile HI-FI DAB system, additionally capable of decoding concomitant traffic and control data, while listening to audio programs and synthesizing the corresponding voice messages on a demand basis at the required time. Furthermore, the system will be capable of producing five-channel surround sound with ambient-dependent dynamic control, catering for example for reduced dynamic range in a noisy vehicle, where pianissimo music cannot be heard. Originally two transform codecs (TC) and two subband (SB) codecs were proposed, and the best candidate codec [2],[3] reduces the 1.411Mbit/s stereo CD bit rate to 192kbit/s using the so-called 'masking pattern adapted universal subband integrated coding and multiplexing' (MUSICAM) technique.
The DAB system's transmission scheme is based on the coded orthogonal frequency division multiplex (COFDM) principle, whose origin goes back to the 1960s [4],[6] and was considered recently for mobile radio telephony [7]. Its principle is that the total transmission bandwidth is divided in a high number of narrow-band sub-channels and each sub-channel is assigned a low-rate modem. Fortunately, the sub-channel modems do not have to be implemented separately, because it can be shown that the bank of sub-channel modems can be substituted by a pair of inverse fast Fourier transform (IFFT) and FFT processing, if the number of subchannels is an integer power of two. If the transmitted signal is corrupted by bursty channel errors, after FFT-demodulation at the receiver the errors will be distributed over the whole block, minimising the probability of erroneous decisions. The remaining randomly distributed errors can be more easily combatted by error correction coding. The main advantage of the COFDM method is that the wide-band frequency selective fading mobile channel is split in the narrow-band sub-channels there is no significant signal dispersion and hence the deployment of channel equalisers can be avoided [8]. Although the lack of an equaliser reduces the receiver's complexity, the price paid for this is the COFDM method's relatively high complexity. The interested reader is referred for further details on the DAB system to Reference [9]. In this contribution we propose a HI-FI audio codec, which is based on a specific subband split modified multipulse excited linear predictive (SB-MMPLPC) scheme.
Figure 1: MMPLPC Codec Schematic
2. The audio codec
In recent years subband coding [11],[12] and transform coding [13],[14], multi-pulse LPC (MPLPC) [15] and wavelet transform [16], have been successfully used for HI-FI audio coding. In this paper, we propose a novel modified multi-pulse excited linear predictive codec (MMPLPC) structure combined with a subband splitting technique in order to further improve the coding efficiency of audio signals. The MMPLPC codec is developed from the amplitude re-optimization method of conventional multi-pulse LPC [17] schemes by choosing a number of different excitation modes which result in an improved subjective audio quality. By using split band coding, perceptually motivated dynamically adapted bit allocation procedures can be deployed in our subband MMPLPC (SB-MMPLPC) codec [18].
2.1. The MMPLPC Codec
The MMPLPC codec's schematic diagram is shown in Figure 1, which is similar to that of a conventional MPLPC arrangement, except that it incorporates number of different excitation modes. The audio input signal is divided into frames of samples for LPC analysis and the LPC filter parameters are determined by minimising the mean squared prediction error over this interval. Each frame is further divided into contiguous subframes of samples, for which the long term predictor (LTP) parameters are initially determined under the assumption of no excitation, since at this stage the excitation is unknown.
In order to find the optimum LTP delay and gain minimising for the current subsegment, the term
has to be maximised [10] over the range of delays , where is the perceptually weighted audio signal after removing the memory contribution of the weighted synthesis filter due to its input in the previous subframe and is the convolution of the previous history of at delay with the impulse response of . The factor controls the grade to which the error signal has to be perceptually de-emphasised in the spectrally prominant frequency regions during the excitation optimisation. The corresponding LTP gain is then given by
The optimum excitation is determined by filtering the candidate innovation sequences through the LTP synthesis filter and the perceptually weighted short term prediction (STP) synthesis filter in order to generate the perceptually weighted synthesized audio signal . The total mean squared weighted error can be expressed as
where denotes the transpose operation and
while
Note that represents the excitation pulse amplitudes, while is the position of the pulse in the excitation frame. By setting we arrive at
which is minimised, when is maximized. If is quantized to , minimizing the weighted error becomes equivalent to maximizing
When the total number of bits used to quantize the pulse amplitudes and to encode the pulse positions is fixed, the number of excitation pulses can be varied to find the best set of quantised excitation pulses , where the excitation mode index in the range represents the specific excitation mode which maximizes above. Explicitly, the best compromise in terms of finding the number of excitation pulses and the associated number of quantisation bits must be found. If the number of excitation pulses is higher, only a lower quantisation precision is possible and vice versa. We refer to the afore mentioned method as MMPLPC.
2.2. The Performance of MMPLPC
Using MMPLPC, simulations were performed with a variety of music signals listed in Table 1. The signals were band limited from 0.05 to 15kHz and sampled at 32kHz, and each signal had a duration of 6 to 16 seconds. Our results were compared with the conventional MPLPC.
Table 1: Music Excerpts Used in Simulations
The simulation results for MPLPC were similar to those given in reference [19]. For our bandwidth of about 15kHz and sampling frequency of 32kHz we chose the parameters corresponding to a subsegment length of 5 ms, representing a mild weighting, and a 10th order LPC filtering. At this stage no LTP filtering was invoked. The LPC analysis frame size of samples was equivalent to a frame duration of 10ms, and a Hamming window duration of 13.75ms was used. Figure 2 shows the SEGSNR performance of two previously published codecs, MPLPC1 and MPLPC2 [17] in contrast to that of our proposed MMPLPC scheme. In MPLPC1 22 excitation pulse amplitudes were quantized with 7 bits/sample, while in MPLPC2 25 pulses were quantized using 6 bits/sample, as summarised in the bit allocation table, Table 2.
Figure 2: SEGSNR Performance of Various MPLPC Schemes
Table 2: Bit Allocation Schemes for Codecs MPLPC1, MPLPC2 and MMPLPC
For our MMPLPC arrangement we set the number of excitation modes in Figure 1 to . Mode 1 and 2 used 22 and 25 excitation pulses, quantised using seven and six bits, respectively, as seen in MPLPC1 and MPLPC2. The full quantization schemes for the parameters of MPLPC1, MPLPC2 and MMPLPC are listed in Table 2. All the parameters were linearly quantized except the maximum excitation magnitude which was logarithmically quantized to eight bits precision and used in the normalization of . The encoding of the positions of the excitation pulses used the enumerative method, outlined in reference [19]. The simulation results of Figure 2 show that the segmental signal to noise ratio (SEGSNR) of the MMPLPC codec was almost always the highest when compared with the MPLPC1 and MPLPC2 schemes. Our informal listening tests confirmed also subjective improvements by the MMPLPC over MPLPC1 and MPLPC2. Figure 2 shows however that for example for music 6 the MMPLPC did not have the highest segmental SNR, because the minimization in the analysis-by-synthesis (ABS) loop was for the perceptually weighted error signal rather than for the original audio signal. Although this did result in a lower SEGSNR, the MMPLPC codec maintained a higher perceptual quality. Figure 2 demonstrates that for MPLPC1 and MPLPC2, some excerpts of music needed more excitation pulses per frame with less precise quantization, while some other sections needed more accurate quantization with fewer excitation pulses per frame. This property was the fact that led us to the concept of the MMPLPC.
2.3. Subband MMPLPC Structure
The audio codec's efficiency can be further improved, if the human ear's frequency and energy sensitive properties are exploited by dividing the audio bandwidth into subbands corresponding to the critical bands found in hearing [11],[20]. However, after band splitting, the correlation between adjacent time domain samples is reduced, and the more the band is split, the more this correlation is decreased. The MMPLPC codec utilizes linear prediction requiring high correlation between adjacent samples. In order to compromise, we chose four-band splitting.
Figure 3: Subband Coded Multipulse Excited Schematic
The subband MMPLPC scheme is shown in Figure 3. The input audio signal is split into four subbands: 0-4kHz, 4-8kHz, 8-12kHz, 12-16kHz, by a Quadrature Mirror Filter (QMF) bank, using two cascaded 64th order QMF filters [21]. The four subband signals are each encoded by an MMPLPC codec. If we were to deploy pure waveform coding for the subband signals in the form of pulse code modulation (PCM) without taking account of perceptual hearing properties, the bit allocation would have to be adjusted according to the signal power in each band using the formula [22]:
where is the total average number of bits/sample, and is the number of bits/sample in band . However, hearing sensitivity is different for the different subbands. For the same sound pressure, human hearing within 0.5-8kHz is more sensitive than for frequencies higher than 8kHz, and especially, for those higher than 12kHz. Consequently, for the same subband input signal power more bits must be allocated to the more sensitive frequency bands than to the less sensitive high frequency bands . Furthermore, since in the subband we propose to use perceptually motivated MMPLPC, Eq. 9 can only be used as an initial guide in our experimentally determined bit allocation scheme.
2.4. Subband-MMPLPC Parameters
Accordingly, the short time energy in each subband was estimated, then the proportion of bits allocated to band was initially determined using Equation 9 and every subband was assigned to one of sixteen empirically designed different bit allocation classes , as demonstrated by Tables 3-5. These tables summarise for both excitation modes the number of excitation pulses , their quantisation accuracies in terms of the number of bits/pulse as well as the number of bits needed for the encoding of their positions, when using the enumerative method [19]. Observe from these tables that for the same bit allocation class the lower frequency subbands were typically assigned a higher number of excitation pulses and higher number of pulse amplitude quantisation bits, whose values were previously determined from a series of subjective experiments. The LPC analysis frame size of 20ms was found to be suitable for every subband. As expected, the LPC prediction gain increased in each subband, when the LPC order was increased. To achieve high fidelity audio, much higher excitation densities were needed than for encoding speech.
Table 3: Bit Allocation Scheme for bands B1 and B2 of the SB-MMPLPC Codec
Table 4: Bit Allocation Scheme for band B3 of the SB-MMPLPC Codec
Table 5: Bit Allocation Scheme for band B4 of the SB-MMPLPC Codec
The number of LPC filter coefficients was 6,4,4,4 for subbands 1,2,3,4 respectively. The LPC filter parameters were quantized by linear quantization of log-area ratios [10] . After band splitting to a bandwidth of 4kHz, the sampling frequency was reduced to 8kHz, yielding a subsegment length of 5ms, equivalent to 40 samples. Accordingly, a subsegment excitation frame size of 5ms or 40 samples was used for our SB-MMPLPC codec. Again, the excitation pulse positions were encoded using the enumerative method [19]. A long term predictor (LTP) was also invoked, as it provided a noticeable increment in subjective and objective quality when the excitation pulse density was low, and even for high excitation densities, it retained the same performance as without the LTP both in terms of bit rate and SEGSNR. For each subband, 4 bits were needed to linearly quantize the LTP filter gain, while 7 bits were used to encode the LTP delay.
When quantising the excitation pulse amplitudes , we found that if the number of excitation pulses was less than six, 3, 4, 5, or 6 bit quantization achieved almost the same quality with the segmental SNR differing by only 0.2dB, while the number of excitation pulse quantisation bits doubled. If was from six to ten, 4, 5, or 6 bit quantization had a similar effect, whereas if was from eleven to sixteen, 5 or 6 bit quantization also got to within 0.2dB and so on. So when we constructed the Tables 3-5, we used less precision to quantize the lower excitation density pulses, while higher precision was employed to quantize the higher excitation density pulses. The excitation pulse amplitudes were normalised by their maximum value within the subsegment and this maximum value was logarithmically quantised using eight bits for each subsegment and each subband before quantisation. The MMPLPC codec structure was identical for all four subbands. The total number of bits per 20ms became 1707, which yielded a mono bit rate of about 86kbits/s at a coding rate of 1707 bits/640 samples 2.67 bits/sample.
2.5. Subband-MMPLPC Performance
In a further experiment the objective and subjective audio quality of our 86kbits/s SB-MMPLPC codec was compared to that of the previously proposed 96kbits/s full-band MMPLPC codec and a 128kbits/s wide-band subband split adaptive differential pulse code modulated (SB-ADPCM) codec. Our SB-ADPCM benchmark codec was based on the CCITT G.722 Recommendation [24], but due to the doubled sampling rate of 32kHz its bit rate was also doubled.
Our informal listening tests demonstrated that the SB-MMPLPC codec operating at an overall bit rate of 86kbits/s outperformed both of the higher-rate benchmarkers in terms of subjective audio quality, although its objective SEGSNR performance was slightly lower. This fact is attributable to the error weighting filter, which de-emphasized the error signal in the perceptually less audible frequency regions.
2.6. Bit-Sensitivity Analysis
In order to ensure robust source-matched error protection for our favoured 2.67 bits/sample SB-MMPLPC audio codec it was subjected to bit sensitivity investigations by systematically corrupting all of its bits in a 1707 bit frame and evaluating the SEGSNR penalty inflicted. When for example the sensitivity of bit 1 was investigated, this bit was consistently corrupted in every frame, while keeping all other bits intact. The 1707-bit frame is constituted by 103 bits for the LPC parameters and 401 bits for every 5ms subframe. The detailed bit allocation within a frame is shown in Table 6, where A, B, C, and D represent quantities having a variable number of quantisation bits in the subbands SB1-SB4 that add up to a fixed value of bits.
Table 6: SB-MMPLPC bit allocation for the first subframe
In order to show the objective importance of the different subbands, as an example in Figure 4 we evaluated their SEGSNR using music excerpt 3. Observe that subband 1 has an average SEGSNR of nearly 25dB, subband 2 an average of about 15dB, while subbands 3 and 4 have fairly low SEGSNR, yet they improve the subjective quality. The formal subjective investigation of the bit error sensitivities would be desirable, but for 1707 bits constitutes a time consuming exercize, hence we had to satisfy ourselves with less reliable objective assessments. Accordingly, the subband energy and the associated SEGSNR values predictably give a lower weight to high-band coding bits than to low-band bits.
Figure 4: SEGSNR versus time for bands B1-B4 using music excerpt 3
Figure 5: Bit Error Sensitivity of the 2.67 bits/sample Audio Codec in Terms of SEGSNR
An overview of the SEGSNR degradation inflicted by systematic bit corruption is given in Figure 5 (a)-(f) for music excerpt 3 for the first 103 LAR bits and the subsequent 401 bits representing the first 5ms sub-frame. The results for the remaining sub-frames are identical to those of the first one. Observe in the global Figure 5(a) that the degradation caused by the first 250 bits, and in particular by the first 50 low-band LAR coefficients bits are the most dramatic, which can be more clearly seen in the expanded Figure 5(b). This figure also reflects that the LAR sensitivity in subbands 3 and 4 is not dramatic.
The results of Figure 5(c) are also interesting to analyse. For example, according to Table 6 bit positions 106-110 that encode the high-energy low-band bit allocation classifiers are vulnerable, but those of the lower band are less sensitive. These bits are followed by the important high-energy band (SB1, SB2) excitation mode bits 117-118 and the less vital low-energy mode bits 119-120. On the same note, the high-energy LTP delay bits 121-127 are followed by more robust lower energy band LTP bits approaching bit position 148. Again, the more vital low-band LTP gain bits cause a deep SEGSNR curve cut above this position, but the curve improves for the higher bands. The sub-frame maxima max follow from position 165 onwards, with the low-band ones yielding a deep SEGSNR valley, which after a temporary recovery for the high-bands dips again to about 10dB, indicating the location of high-band excitation pulses. After this last sensitive region the SEGSNR curve exhibits substantial robustness for the remaining bits. The overall shape of these bit sensitivity curves suggests that basically there are two sensitivity classes, the sensitive C1 and the more robust C2 categories, associated with more than 15dB and less than 10dB SEGSNR degradations, respectively.
Figure 6: SEGSNR Degradation versus BER for two sensitivity classes
The robustness of the proposed codec evaluated in terms of segmental SNR degradation was also evaluated injecting random errors assuming a given fixed BER, as would be experienced over a Gaussian channel. On the basis of the previously discussed bit sensitivities initially we divided the bits in six sensitivity classes, subjected each class to random bit errors and evaluated the SEGSNR degradation as a function of the BER for all six classes, although these results are not shown in this treatise due to lack of space. As expected on the basis of the bit sensitivity curves shown for the systematic corruption of bits in Figure 5, fundamentally only two different sensitivity classes exist. The SEGSNR degradation of these two classes due to random, rather than consistently periodic bit errors introduced with various fixed error probabilities is shown in Figure 6, where it becomes clear that the BER of the more sensitive C1 bits must be below about , while that of the more robust C2 bits below about , in order to ensure acceptable audio quality, although even lower BERs are preferable.
Having designed the audio codec, the pertinent audio transmission issues via fading mobile channels will be the subject of our next contribution in this issue.
3. Summary and conclusions
A modified multipulse LPC audio codec was introduced and incorporated in an SB-MMPLPC codec for the encoding of wideband audio signals. At a coding rate of about 2.67 bit/sample, high fidelity audio reproduction was achieved for a mono bit rate of 86kbits/s. Future work will be targeted at improving the audio quality, complexity, bit rate and error resilience trade-off achieved.
4. References
- F.Mueller-Roemer: Directions in Audio Broadcasting, J. Audio Eng. Soc., Vol. 41, No. 3, March, 1993, pp 158-173
- G. Plenge: DAB - A New Radio Broadcasting System - State of Development and Ways for its Introduction, Rundfunktech. Mitt., Vol. 35, No. 2, 1991, pp 45 ff.
- N.H.C. Gilchrist: BBC Research Department Report RD 1990/16, Digital Sound: Subjective tests on low bit-rate codecs, pp 1-10
- R.W. Chang: Synthesis of Band-limited Orthogonal Signals for Multichannel Data Transmission, BSTJ, Dec 1966
- M.S. Zimmermann, A.L. Kirsch: The AN/GSC-10 (KATHRIN) Variable Rate Data Modem for HF Radio, IEEE Trans. on Comm's Technology, Vol., Com-15, No 2, Apr. 1967
- S.B.Weinstein - Data transmission by frequency-division multiplexing using the discrete Fourier transform - IEEE Trans Comms COM-19 No.5 Oct 1971, pp628-634.
- L.J.Cimini - Analysis and simulation of a digital mobile channel using orthogonal frequency division multiplexing - IEEE Trans. Comms Com-33 No.7 July 1985, pp665-675.
- M. Alard, R. Lassalle: Principles of modulation and channel coding for digital broadcasting for mobile receivers, EBU Review, Technical No. 224, Aug. 1987, pp 47-69
- Proceedings of the 1st Intern. Symp. on DAB, June, 1992, Montreux, Switzerland
- R. Steele (Ed.): Mobile Radio Communications, Pentech Publishers, 1992.
- G. Theile, M. Link, and G. Stoll, ``Low bit rate coding of high quality audio signals,'' AES preprint, p. 2432, 1987.
- S. Smyth and P. Challener, ``An efficient coding scheme for the transmission of high quality music signals,'' Br. Telecom. Technol. J., vol. 6, No.2, pp. 60-70, Apr. 1988.
- K. Brandenburg, ``OCF - A new coding algorithm for high quality sound signals,'' Proc. ICASSP, pp. 141-144, 1987.
- J. Johnston, ``Transform coding of audio signals using perceptual noise criteria,'' IEEE J. on selected areas in Commu., vol. 6, No.2, pp. 314-323, Feb. 1988.
- S. Singhal, ``High quality audio coding using multipulse LPC,'' Proc. ICASSP, pp. 1101-1104, 1990.
- D. Sinha and A. Tewfik, ``Synthesis/coding of audio signals using optimized wavelets,'' Proc. ICASSP, pp. I-113-I-116, 1992.
- S. Singhal and B. Atal, ``Amplitude optimization and pitch prediction in multipulse coders,'' IEEE Trans. on ASSP, vol. 37, No. 3, pp. 317-327, March 1989.
- X. Lin and R. Steele, ``Subband coding with modified multipulse LPC for high quality audio,'' Proc. ICASSP, 1993, ppI201-I204.
- P. Kroon and E. Deprettere, ``A class of analysis-by-synthesis predictive coders for high quality speech coding at rates between 4.8 and 16 kbits/s,'' IEEE J. on selected areas in Commu., vol.6, No.2, pp. 353-363, Feb. 1988.
- R. Veldhuis, M. Breeuwer, and Robbert van der Waal, ``Subband coding of digital audio signals without loss of quality,'' Proc. ICASSP, pp. 2009-2012, 1989.
- R. Crochiere and L. Rabiner, Multirate digital signal processing. Englewood Cliffs, New Jersey: Prentice-Hall.
- N. Jayant and P. Noll, Digital coding of waveforms. Englewood Cliffs: Prentice-Hall, 1984.
- E. Zwicker and H. Fastl, Psychoacoustics. Berlin: Springer-Verlag, 1990.
- X. Maitre, ``7 khz audio coding within 64 kbit/s,'' IEEE J. on selected areas in Commu., vol. {6, No.2}, pp. 283-298, Feb. 1988.