Friday, October 25, 2019

downsampling - How to extract vocal part from stereo audio signal?


I'm now processing MP3 file and encounter this problem. My MP3 is stereo encoded. What I want to do is extract vocal part for further processing(whatever mode of output signals, mono or stereo are both OK).


As far as I know, audio is encoded into different dis-joint sub frequency bands in MP3. I think I can limit the signals to the vocal range through high-pass/low-pass filter with cutting-off frequency proper set. However, result must contain parts of pure music signal in this case. Or after googling, I think I may calculate the background signals first(by inverting one channel adding with signals from the other channel assuming vocal part is centered in the stereo audio called phase cancellation). After this transformation, the signal is mono. Then I should merge the original stereo into mono from which extracting the background signal.


Given the effectiveness, which one is preferred(or any other solutions:)? If the 2nd one, let two channels A and B, will (B-A) or (A-B) used when compute the background? As with merging two channels, does the arithmetic mean accurate enough? Or I can downsample each channel by a factor of two and interleave the downsampled signals as mono result?


Thanks and best regards.



Answer



First of all, how the data is encoded in a mp3 file is irrelevant to the question unless you aim at doing compressed-domain processing (which would be quite foolish). So you can assume your algorithm will work with decompressed time-domain data.


The sum / difference is a very, very basic trick for vocal suppression (not extraction). It is based on the assumption that the vocals are mixed at the center of the stereo field, while other instruments are panned laterally. This is rarely true. L-R and R-L will sound the same (the human ear is insensitive to a global phase shift) and will give you a mono mix without the instruments mixed at the center. The problem is, once you have recovered the background, what will you do with it? Try to suppress it from the center (average) signal? This won't work, you will be doing (L + R) / 2 - (L - R), this is not very interesting... You can try any linear combinations of those (averaged and "center removed"), nothing will come out of it!


Regarding filtering approaches: the f0 of the voice rarely exceeds 1000 Hz but its harmonics can go over that. Removing the highest frequency will make consonants (especially sss, chhh) unpleasant. Some male voices go below 100 Hz. You can safely cut whatever is below 50 or 60 Hz (bass, kick), though


Some recent developments in voice separation worth exploring:




  • Jean Louis Durrieu's background NMF + harmonic comb > filter model. Python code here.

  • Rafii's background extraction approach. Straightforward to code and works well on computer-produced music with very repetitive patterns like Electro, Hip-hop...

  • Hsu's approached based on f0 detection, tracking and masking. "A Tandem Algorithm for Singing Pitch Extraction and Voice Separation from Music Accompaniment" (can't find accessible PDF).


No comments:

Post a Comment

digital communications - Understanding the Matched Filter

I have a question about matched filtering. Does the matched filter maximise the SNR at the moment of decision only? As far as I understand, ...