Using Gabor-Based Time-Frequency Filtering for Audio Source Separation
This example shows how to isolate a speech signal using a binary mask with the discrete Gabor transform (DGT). You compare results with those obtained using the short-time Fourier transform (STFT). Gabor-based time-frequency filtering enables precise signal manipulation in the time-frequency domain.
Import Data
Load audio files containing male and female speech sampled at 4 kHz. These audio files are from the example Cocktail Party Source Separation Using Deep Learning Networks (Audio Toolbox). The duration of each file is about 20 seconds. Listen to the audio files individually for reference.
[mSpeech,Fs] = audioread("MaleSpeech-16-4-mono-20secs.wav");
soundsc(mSpeech,Fs)
[fSpeech] = audioread("FemaleSpeech-16-4-mono-20secs.wav");
soundsc(fSpeech,Fs)
Create Signal
Combine the two speech sources. Ensure the sources have equal power in the mix. Scale the mix so that its maximum amplitude is one.
mSpeech = mSpeech/norm(mSpeech); fSpeech = fSpeech/norm(fSpeech); ampAdj = max(abs([mSpeech;fSpeech])); mSpeech = mSpeech/ampAdj; fSpeech = fSpeech/ampAdj; mix = mSpeech + fSpeech; mix = mix./max(abs(mix));
Visualize the original and mixed signals.
tm = (0:numel(mSpeech)-1)/Fs; tiledlayout(3,1) nexttile plot(tm,mSpeech) title("Male Speech") grid on nexttile plot(tm,fSpeech) title("Female Speech") grid on nexttile plot(tm,mix) title("Speech Mix") grid on
Listen to the mixed speech signal.
soundsc(mix,Fs)
Time-Frequency Filtering Using STFT
Obtain STFT
Time-frequency filtering with the short-time Fourier transform involves inverting the transform. To ensure perfect reconstruction of nonmodified spectra using the ISTFT, the analysis window and the overlap must satisfy the Constant Overlap-Add (COLA) Constraint. Use the function hann
(Signal Processing Toolbox) to specify a Hann window of length 128 and an overlap of 96 samples. Use the function iscola
(Signal Processing Toolbox) to confirm the window and overlap satisfy the constraint.
winLen = 128; win = hann(winLen,"periodic"); overlapLen = 96; fprintf("Does window satisfy cola condition? %d\n",iscola(win,overlapLen))
Does window satisfy cola condition? 1
Use the function stft
(Signal Processing Toolbox) to obtain the one-sided STFT of the male and female signals. Specify 128 DFT points.
frqRng = "onesided"; numBin = 128; Sm = stft(mSpeech, ... Window=win, ... OverlapLength=overlapLen, ... FFTLength=numBin, ... FrequencyRange=frqRng); Sf = stft(fSpeech, ... Window=win, ... OverlapLength=overlapLen, ... FFTLength=numBin, ... FrequencyRange=frqRng);
Visualize the magnitude of the STFT of the two signals.
figure tiledlayout(2,1) nexttile stft(mSpeech, Fs, ... Window=win, ... OverlapLength=overlapLen, ... FFTLength=numBin, ... FrequencyRange=frqRng) title("Male Speech STFT") nexttile stft(fSpeech, Fs, ... Window=win, ... OverlapLength=overlapLen, ... FFTLength=numBin, ... FrequencyRange=frqRng) title("Female Speech STFT")
Create Binary Mask Using STFT
Use the STFT of both signals to create a binary mask. A true
value identifies a time-frequency bin where the magnitude of the STFT of the male signal is greater than or equal to the corresponding bin in the STFT of the female signal.
stftBmask = abs(Sm) >= abs(Sf);
Reconstruct Male Speech
Apply the binary mask to the STFT of the mixed signal. Use the function istft
(Signal Processing Toolbox) to invert the result.
[S_mix,F] = stft(mix, ... Window=win, ... OverlapLength=overlapLen, ... FFTLength=numBin, ... FrequencyRange=frqRng); stftmSpeechRec = istft(S_mix.*stftBmask, ... Window=win, ... OverlapLength=overlapLen, ... FFTLength=numBin, ... FrequencyRange=frqRng);
Time-Frequency Filtering Using DGT
Obtain DGT
Use the dgt
function to obtain the one-sided DGT of the male and female signals. Use the same parameters used for the STFT. Note that the hop length is the difference between the window length and overlap.
Dm = dgt(mSpeech, ... WindowLength=winLen, ... HopLength=winLen-overlapLen, ... NumFrequencyBins=numBin, ... FrequencyRange=frqRng); Df = dgt(fSpeech, ... WindowLength=winLen, ... HopLength=winLen-overlapLen, ... NumFrequencyBins=numBin, ... FrequencyRange=frqRng);
Visualize the magnitude of the DGT of both signals.
figure tiledlayout(2,1) nexttile dgt(mSpeech, ... SampleRate=Fs, ... WindowLength=winLen, ... HopLength=winLen-overlapLen, ... NumFrequencyBins=numBin, ... FrequencyRange=frqRng) title("Male Speech DGT") nexttile dgt(fSpeech, ... SampleRate=Fs, ... WindowLength=winLen, ... HopLength=winLen-overlapLen, ... NumFrequencyBins=numBin, ... FrequencyRange=frqRng) title("Female Speech DGT")
Create Binary Mask Using DGT
Use the DGT of both signals to create a binary mask. For the STFT, you created a mask that identified the STFT coefficients to include for the inverse transform. The tffilt
function expects the binary mask to identify the DGT coefficients to filter out before reconstruction. Therefore, to reconstruct the male signal, create a mask where a true
value identifies a time-frequency bin where the magnitude of the DGT of the female signal is greater than the corresponding bin in the DGT of the male signal.
dgtBmask = abs(Df) > abs(Dm);
Reconstruct Male Speech
Use the binary mask in tffilt
to reconstruct the male signal. To apply the mask directly to the DGT of the input signal, specify the "gm"
time-frequency filtering method.
dgtmSpeechRec = tffilt(dgtBmask,mix, ... WindowLength=winLen, ... HopLength=winLen-overlapLen, ... NumFrequencyBins=numBin, ... FrequencyRange=frqRng, ... Method="gm");
Compare Reconstructions
To compare the original male signal with the STFT- and DGT-based reconstructions, compute these two metrics for each reconstruction.
Signal-to-interference ratio (SIR) — The higher the SIR, the better the performance. To compute this metric, use the helper function
helperComputeSIR
. The source for this function is listed in the appendix.Itakura-Saito distance (ISD) — The lower the ISD, the better the performance. To compute this metric, use the helper function
helperComputeISD
. The source for this function is listed in the appendix. The Itakura–Saito distance (or divergence) quantifies the difference between an original spectrum and its approximation. While not inherently a perceptual measure, the metric is designed to capture perceptual (dis)similarity. For more information about the ISD, see [5].
sirBefore = helperComputeSIR(mSpeech,mix); sirSTFT = helperComputeSIR(mSpeech,stftmSpeechRec); sirDGT = helperComputeSIR(mSpeech,dgtmSpeechRec); isdBefore = helperComputeISD(mSpeech,mix); isdSTFT = helperComputeISD(mSpeech,stftmSpeechRec); isdDGT = helperComputeISD(mSpeech,dgtmSpeechRec); T = table(round([sirBefore;sirSTFT;sirDGT]), ... round([isdBefore;isdSTFT;isdDGT]), ... variableNames=["SIR (dB)","ISD"], ... RowNames=["Before","STFT","TFFILT"])
T=3×2 table
SIR (dB) ISD
________ _____
Before 1 21026
STFT 6 26174
TFFILT 12 6365
Summary
In this example, you learned how to isolate a speech signal by using a binary mask with the DGT. The tffilt
function performed the time-frequency filtering by applying the binary mask directly to the DGT of the mixed signal and reconstructing the result using the inverse DGT. The function supports two additional filtering methods. Both methods involve Gabor multipliers, linear operators that apply a pointwise product with a time-frequency transfer function, or mask. Time-frequency filtering using Gabor multipliers offers a robust and adaptable solution for precise signal manipulation. You can also provide tffilt
a collection of binary masks to apply sequentially to the DGT before reconstructing the filtered signal.
Appendix
helperComputeSIR
function s = helperComputeSIR(x,xh) % helperComputeSIR Compute signal-to-interference ratio % This function is only intended to support this wavelet example. % It may change or be removed in a future release. s = 20*log10(norm(x,2)/norm(xh-x,2)); end
helperComputeISD
function dis = helperComputeISD(x,xh) % helperComputeISD Itakura-Saito distance % This function is only intended to support this wavelet example. % It may change or be removed in a future release. arguments x {mustBeFloat,mustBeVector} xh {mustBeFloat,mustBeVector} end validateattributes(xh,{'numeric'},{'numel',length(x)}) rdtPropto = zeros("like",real(x([]))) + zeros("like",real(xh([]))); xc = reshape(cast(x,"like",rdtPropto),[],1); xhc = reshape(cast(xh,"like",rdtPropto),[],1); p = abs(fft(xc)); ph = abs(fft(xhc)); % DIS(P(w),Ph(w)) = int_(-pi)^(pi)[P(w)/Ph(w)-log(P(w)/Ph(w))-1]dw/2*pi z = p./(ph+eps(ph)); dis = sum(z - log(z)) - length(z); end
References
[1] Krémé, A. Marina, Valentin Emiya, Caroline Chaux, and Bruno Torrésani. “Time-Frequency Fading Algorithms Based on Gabor Multipliers.” IEEE Journal of Selected Topics in Signal Processing 15, no. 1 (January 2021): 65–77. https://doi.org/10.1109/JSTSP.2020.3045938.
[2] Mallat, S.G. and Zhifeng Zhang. “Matching Pursuits with Time-Frequency Dictionaries.” IEEE Transactions on Signal Processing 41, no. 12 (December 1993): 3397–3415. https://doi.org/10.1109/78.258082.
[3] Søndergaard, Peter. “An Efficient Algorithm for the Discrete Gabor Transform Using Full Length Windows.” In SAMPTA ’09 International Conference on SAMPling Theory and Applications, edited by Laurent Fesquet and Bruno Torresani, 223–26. Marseille, France, 2009. https://hal.science/hal-00495456/file/SampTAProceedings.pdf.
[4] Halko, N., P. G. Martinsson, and J. A. Tropp. “Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions.” SIAM Review 53, no. 2 (January 2011): 217–88. https://doi.org/10.1137/090771806.
[5] Itakura, F., and S. Saito. “Analysis Synthesis Telephony Based on the Maximum Likelihood Method.” In Proc. 6th International Congress on Acoustics, C17–20. Los Alamitos, CA, 1968.