Sunday, January 5, 2020

python - Librosa stft + istft - Understanding my output (which always seems too perfect) at varying window lengths


I've just started to use Python with Librosa for a DSP project I'll be working on. First thing I've been trying to do is determine my preferred parameters for the FFT window size, and hop-distance.



The domain is music, and my plan is to try a variety of values for the window size and hop distance, and for each of them, do a forward STFT and then an inverse STFT and write the result back out to wav file. I'll then listen to results and choose based on which values I think capture the information in the input the best.


My simple code is as follows:


import librosa.core as lc
import numpy as np
import scipy

_n_fft=80
print(str(_n_fft))
_hop_length=_n_fft/4


data, sampleRate = lc.load("13_Hate_To_See_Your_Heart_Break.wav", sr=44100, duration=20, mono=True)

stftMat = lc.stft(data, n_fft=_n_fft, hop_length=_hop_length, center=True)
iStftMat = lc.istft(stftMat, hop_length=_hop_length)

scipy.io.wavfile.write("testOut.wav", 44100, iStftMat)

powerMat = np.abs(stftMat)
print("powerMat shape = " + str(powerMat.shape))


The behavior I'm experiencing, however, is not what I would have expected.


When I use an incredibly short window length (as in the code above) - I get the correct number of window frames for my FFT length and hop-distance:


powerMat shape = (41, 44101)

44101 window makes sense, and as you can see the frequency resolution is low, with only 41 frequency bands. I would expect the resulting testOut.wav to sound pretty terrible, as the frequency resolution is so low. I can visibly see the effects on a rendered spectrogram as the subtleties in frequency changes are smeared together. Listening back, however, the resulting track sounds great - pretty much like the original input.


Compare this with a much wider window size of 44100 (1 window = 1 second of audio, hop-distance of 1/4*Window size):


powerMat shape = (22051, 81)

Again this output makes sense - in the 20 seconds of audio, with a window length of a second and a hop distance of a quarter second, there would be about 80 window frames. This is pretty poor time resolution, but fairly high frequency resolution with 22051 frequency bins. Again I would expect the resulting testOut.wav to sound poor in the time domain.


Once again the resulting track sounds great - pretty much like the original input. These extreme values, and everything in between, pretty much yield the same output testOut.wav, even though on the real power spectrum I can visibly see the differences when changing the parameters.



Is there a fundamental misunderstanding I'm having with the STFT and it's inverse? Am I simply not understanding the library?




No comments:

Post a Comment

digital communications - Understanding the Matched Filter

I have a question about matched filtering. Does the matched filter maximise the SNR at the moment of decision only? As far as I understand, ...