Thursday, March 29, 2018

mfcc - Cepstral Mean Normalization


Can anyone please explain about Cepstral Mean Normalization, how the equivalence property of convolution affect this? Is it must to do CMN in MFCC Based Speaker Recognition? Why the property of convolution is the fundamental need for MFCC?


I am very new to this signal processing. Please help



Answer




Just to make things clear - this property is not fundamental but important. It is the fundamental difference when it comes to using DCT instead of DFT for spectrum calculation.


Why do we do Cepstral Mean Normalisation


In speaker recognition we want to remove any channel effects (impulse response of vocal tract, audio path, room, etc.). Providing that input signal is $x[n]$ and channel impulse response is given by $h[n]$, the recorded signal is linear convolution of both:


$$y[n] = x[n] \star h[n]$$


By taking the Fourier Transform we get:


$$Y[f] = X[f]\cdot H[f] $$


due to convolution-multiplication equivalence property of FT - that is why it's so important property of FFT at this step.


Next step in calculation of cepstrum is taking the logarithm of spectrum:


$$Y[q] = \log Y[f] = \log \left( X[f] \cdot H[f]\right) = X[q] + H[q]$$


because: $\log(ab) = \log a +\log b $. Obviously, $q$ is the quefrency. As one might notice, by taking the cepstrum of convolution in time domain we end up with the addition in cepstral (quefrency) domain.



What is the Cepstral Mean Normalisation?


Now we know that in cepstral domain any convolutional distortions are represented by addition. Let's assume that all of them are stationary (which is a strong assumption as a vocal tract and channel response are not changing) and the stationary part of speech is negligible. We can observe that for every i-th frame true is:


$$Y_i[q] = H[q] + X_i[q] $$


By taking the average over all frames we get


$$\dfrac{1}{N}\sum_{i} Y_i[q] = H[q] + \dfrac{1}{N}\sum_{i} X_i[q]$$


Defining the difference:


$$\begin{array} &R_i[q] &= Y_i[q] - \dfrac{1}{N}\sum_{j} Y_j[q]\\ & = H[q] + X_i[q] - \left(H[q] + \dfrac{1}{N}\sum_{j} X_j[q]\right) \\ & = X_i[q] - \dfrac{1}{N}\sum_{j} X_j[q]\\ \end{array}$$


We ending up with our signal with channel distortions removed. Putting all above equations into simple English:



  • Calculate cepstrum


  • Subtract the average from each coefficient

  • Optionally divide by variance to perform Cepstral Mean Normalisation as opposed to Subtraction.


Is Cepstral Mean Normalisation necessary?


It's not mandatory, especially when you are trying to recognise one speaker in a single environment. In fact, it can even deteriorate your results, as it's prone to errors due to additive noise:


$$y[n] = x[n] \star h[n] + w[n] $$


$$Y[f] = X[f]\cdot H[f] + W[f] $$


$$\log Y[f] = \log \left[X[f]\left(H[f]+\dfrac{W[f]}{X[f]} \right) \right] = \log X[f] +\log \left(H[f]+\color{red}{\dfrac{W[f]}{X[f]}} \right)$$


In poor SNR conditions marked term can overtake the estimation.


Although when CMS is performed, you can usually gain few extra percent. If you add to that performance gain from derivatives of coefficients then you get a real boost of your recognition rate. The final decision is up to you, especially that there are plenty of other methods used for the improvement of speech recognition systems.



No comments:

Post a Comment

digital communications - Understanding the Matched Filter

I have a question about matched filtering. Does the matched filter maximise the SNR at the moment of decision only? As far as I understand, ...