-
Notifications
You must be signed in to change notification settings - Fork 51
Spectral noise reduction
Text and figures © DD4WH, under GNU GPLv3
NEW (January 17th, 2018): two algorithms implemented: Kim & Ruwisch algorithm (DD4WH) and Romanin et al. (DL2FW) seem to be very promising candidates !!!
Noise reduction is a pretty complex task, I had to learn . . . The most promising algorithms mentioned in the professional noise reduction literature are spectral subtraction and spectral weighting methods (Martin 1994, Kim et al. 2002). There was a longstanding wish to add a really good state-of-the-art noise reduction to the Teensy Convolution SDR.
The following guidelines were followed:
- Use an algorithm in the frequency domain rather than in the time domain. See for example Schmitt et al. (2002).
- Use a minimum statistics algorithm OR
- Use the algorithm by Ephraim & Malah (1984, 1985)
- Use an algorithm that has explicitly been designed for low CPU load AND suppression of the musical tones often associated with Spectral Weighting, e.g. the low CPU load rule by Romanin et al. (2009).
- Use relatively small sizes for the FFT-iFFT chain in order to restrict the memory needs, i.e. 256-point FFT/iFFTs.
- Use the algorithm in the decimated path, eg. in 12ksps (= 48ksps & decimate-by-4).
- Think hard whether the latency introduced by this implementation is small enough to be negligible even for CW freaks with QSK etc. and potentially new applications like VOX. Is a latency of 10.67ms (256 * 4 / 48000 / 2) acceptable???
For all spectral noise reduction algorithms we can define the pure speech s(k) and noise signals n(k) in the time domain as:
x(k) = s(k) + n(k) (eq. 1 of Schmitt et al. 2002)
Have a look at figure 1: Spectral Noise Reduction transforms the time domain signal into the frequency domain. This is done for "frames", ie. blocks of samples one at a time. The frame number is n. When we transform into the frequency domain, we get a spectrum of the signal with a number of frequency bins bin[i].
The principle of Spectral noise reduction is to estimate the noise in each bin (Nest) and multiply the bin amplitude with (1 - Nest) in order to eliminate the noise.
Y(n, bin[i]) = X(n, bin[i]) * (1 - (Nest(n,bin[i] / X(n, bin[i]))) (eq. 2 of Schmitt et al. 2002)
which is equivalent to:
Y(n, bin[i]) = X(n, bin[i]) * H(n, bin[i]) (eq. 3 of Schmitt et al. 2002)
So we need two core elements for the noise reduction:
- a noise estimator Nest to estimate the noise in each bin
- an algorithm/function H to calculate the weight to be applied to each bin
After that the weighting function is multiplied with the bin spectrum X(n, bin[i]) and that is transformed to the time domain again. Ready!
Diagram A: General functioning blocks of the noise reduction (modified from Schmitt et al. 2002). (c) DD4WH, under GNU GPLv3
The first implemented spectral noise reduction I implented is a modified version of the algorithm by Kim & Ruwisch (2002) which is called spectral weighting and noise estimation by minimum statistics:
- FFT size is 256 points, overlap is 50% (half-overlapped data buffers), window function is Hann, frame step size is 128 (10.67ms)
- fill first half of FFT_buffer with last frame´s audio samples
- fill second half of FFT buffer with recent frame´s audio samples
- do windowing on the whole FFT buffer
- perform FFT 256 points
- NR_X[bin] --> calculate magnitude for 128 frequency bins (sqrtf(real * real + imag * imag))
- NR_E[bin] --> average magnitude over 3 frames (32ms)
- NR_M[bin] --> search bin-wise for the minimum magnitude value in the last 15 frames (160ms)
- NR_lambda[bin] --> calculate signal-noise-ration NR_T[bin] = NR[X] / NR_M[bin]; if NR_T > NR_PSI (this is an SNR threshold user-definable from 2 to 7) --> NR_lambda[bin] = NR_M[bin] else NR_lambda[bin] = NR_E[bin]
- gain weight calculation: NR_G[bin] = 1 - (NR_lambda[bin] / NR_E[bin])
- time smoothing (exponential averaging) of gain weights: NR_Gts[t, bin] = alpha * NR_Gts[t-1, bin] + (1 - alpha) * NR_G[t, bin]
- frequency smoothing of gain weights: NR_Gfs[bin] = beta * NR_Gts[bin - 1] + (1 - 2 * beta) * NR_Gts[bin] + beta * NR_Gts[bin + 1]
- apply gain weighting to all the FFT bins in the first half of the FFT buffer, real AND imaginary
- do this also for the second half of the FFT buffer in order to account for conjugate symmetry
- perform 256 point inverse FFT
- do overlap-add: take real part of first half of current iFFT result and add to 2nd half of last frames´ iFFT result
- DONE !
This algorithm already gives very acceptable noise suppression results. Clear noise suppression, some minor bubbling sounds, and much better intelligibility has been achieved with the following settings:
- decimation-by-4 --> effective sample rate 12ksps
- number of frames to average magnitudes: L = 3
- number of averaged magnitudes values for minimum search: N = 15
- threshold for noise floor PSI = 7 to 12
- time averaging constant alpha = 0.9 to 0.95
- frequency averaging constant beta = 0.25
- noise suppression strength 0.9 to 1.0
A second way of doing this is described in the scheme by Romanin et al. (2009) and Schmitt et al. (2002): and was implemented as a joint collaboration with Michael DL2FW
We need a fast weighting function Hk (taken from Romanin et al. 2009):
The weighting function estimates the signal-noise-ration SNR prior to the weighting SNRprio and the SNR posterior the weighting SNRpost.
- set time constant alpha for estimation of apriori SNR
- set time constant beta for exponential averager for noise power estimation to 0.85
- estimate the noise power spectrum in each bin by applying an exponential averager with beta = 0.85: Nest(n, bin[i]) = beta * Nest(n-1, bin[i]) + X(n, bin[i]) * (1 - beta) (eq. 5 of Schmitt et al. 2002, eq. 12 of Romanin et al. 2009)
- calculate SNRpost (n, bin[i]) = (X(n, bin[i])^2 / Nest(n, bin[i])^2) - 1 (eq. 13 of Schmitt et al. 2002)
- calculate SNRprio (n, bin[i]) = (1 - alpha) * Q(SNRpost(n, bin[i]) + alpha * (Hk(n - 1, bin[i]) * X(n - 1, bin[i])^2 / Nest(n, bin[i])^2 (eq. 14 of Schmitt et al. 2002, eq. 13 of Romanin et al. 2009) [Q[x] = x if x>=0, else Q[x] = 0]
- calculate vk = SNRprio(n, bin[i]) / (SNRprio(n, bin[i]) + 1) * SNRpost(n, bin[i]) (eq. 12 of Schmitt et al. 2002, eq. 9 of Romanin et al. 2009)
- finally calculate the weighting function for each bin: Hk(n, bin[i]) = 1 / SNRpost(n, [i]) * sqrtf(0.7212 * vk + vk * vk) (eq. 26 of Romanin et al. 2009)
Although the algorithm is faster than the original Ephraim/Malah algorithm by a factor of 115 (!) on average according to Romanin et al. (2009), it still consumes a lot of resources. Therefore we should update the noise spectrum estimation of Nest only in those frames n, where only noise is present (Romanin et al. 2009, p. 239). Those bins are detected by a voice activity detector VAD. The approach in Hu et al. (2001) is followed for this detector. The detector is a statistical model-based voice activity detection approach, which computes the likelihood ratio of speech being present or absent in the input frame n. The threshold of speech detection has to be determined: VAD_thresh
- We have to calculate (eq. 10) from Hu et al.(their eq.(10)) for one frame n:
VAD = 1/frame_size * sum of i = 0 to frame_size -1 of (X(n, bin[i])^2 / Nest(n, bin[i]^2) - logf (X(n, bin[i])^2 / Nest(n, bin[i]^2)) - 1)
If VAD is larger than VAD_thresh --> speech present! If VAD is smaller than VAD_thresh --> noise present!
Only in the latter case we should estimate the noise spectrum in that frame!
Diagram B: Implementation blocks of the noise reduction (modified from Romanin et al. 2009). (c) DD4WH, under GNU GPLv3
Preliminary tests of this algorithm implemented in the Teensy Convolution SDR show very promising results of strong noise reduction effects with neglectable artefacts !
References
Berouti, M., R. Schwartz & J. Makhoul (1979): Enhancement of speech corrupted by acoustic noise. - Proc. IEEE Int Conf Acoust Speech Signal Proc, April 1979, pages 208-211. - HERE
Ephraim, Y. & D. Malah (1984): Speech enhancement using a minimum mean-square error log-spectral amplitude estimator. - IEEE Trans Acoust Speech Signal Proc ASSP-32 (6), Dec 1984, pages 1109-1121. - HERE
Ephraim, Y. & D. Malah (1985): Speech enhancement using a minimum mean-square error short-time spectral amplitude estimator. - IEEE Trans Acoust Speech Signal ASSP-33 (2), April 1985, pages 443-445. - HERE
Esch, T. & P. Vary (2009): Efficient musical noise suppression for speech enhancement systems. - ICASSP 2009: 4409-4412. - HERE
Hu, Y., M. Bhatnagar & P. Loizou (2001): A cross-correlation technique for enhancing speech corrupted with correlated noise. - ICASSP 1, May 2001, pages 673-676. - HERE
Gerkmann, T. & R.C. Hendriks (2012): Unbiased MMSE-based noise power estimation with low complexity and low tracking delay. - IEEE Transactions on Speech and Audio Processing 13(5): 857-869. HERE
Kim, H.-G. & D. Ruwisch (2002): SPEECH ENHANCEMENT IN NON-STATIONARY NOISE ENVIRONMENTS. - 7th international Conference on Spoken Language Processing Denver, Colorado, USA; September 16-20, 2002. HERE
Kim, H.-G., Obermayer, K., Bode, M. & D. Ruwisch (2001): Efficient Speech Enhancement by Diffusive Gain Factors (DGF). - Eurospeech 2001, Scandinavia. HERE
Martin, R. (2001): Noise Power Spectral Density Estimation Based on Optimal Smoothing and Minimum Statistics. - IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 9, NO. 5, JULY 2001 HERE
Martin, R. (1994): Spectral Subtraction Based on Minimum Statistics. - Proc. EUSIPCO 94: 1182-1185. HERE
MATLAB voicebox implementation - MMSE noise estimation. HERE
Romanin, M., E. Marchetto & F. Avanzini (2009): A spectral subtraction rule for real-time DSP implementation of noise reduction in speech signals. - Proc. 12th Conf. DAFx-09, pages 235-239. - HERE
Schmitt, S., M. Sandrock & J. Cronemeyer (2002): Single channel noise reduction for hands free operation in automotive environments. - AES 112th Convention, Munich 2002 May 11-14. - HERE
Sohn, J. & W. Sung(): A voice activity detector employing soft decision based noise spectrum adaptation. - HERE
fig. 0: Noise reduction in the UHSDR: work in progress [2017-11-18]