Signal Recovery with Differentiable Scalograms and Spectrograms

This example uses:

This example shows how to implement the technique using custom code, presents ready-to-use functions based on it, and demonstrates it on synthetic data and a speech signal. Additionally, the example compares the gradient-descent technique with differentiable spectrograms to the Griffin-Lim algorithm.

Background

In a number of applications, the phase information in a time-frequency representation is discarded in favor of the magnitudes. There are a number of reasons for this. One reason is simply that the complex-valued time-frequency representations containing phase information are difficult to plot and interpret. This may lead people to retain only the magnitudes. In other applications, the required signal processing is optimally done by modifying the magnitudes of a time-frequency representation. This is most frequently done in speech processing where the underlying time-frequency representation is usually the short-time Fourier transform. In the latter case, the original complex-valued time-frequency representation no longer corresponds to the modified magnitude representation.

In these applications, it still may be useful or even necessary to recover an approximation of the original signal. The techniques for doing this are referred to as phase retrieval. Phase retrieval is, in general, ill-posed and prior iterative methods suffer from the non-convexity of the formulation and therefore convergence to an optimal solution is impossible to guarantee.

The incorporation of automatic differentiation and differentiable signal processing makes it feasible to perform gradient-descent phase retrieval with the usual convex loss functions. To recover a signal from the magnitude of its spectrogram, use stftmag2sig (Signal Processing Toolbox). To recover a signal from the magnitude of its scalogram, use cwtmag2sig. You must have a Deep Learning Toolbox license to use the gradient-descent algorithm.

Synthetic Signal — Exponential Chirp

Initially, create a chirp signal with an exponentially increasing carrier frequency. This synthetic signal is challenging for a signal recovery, or phase retrieval, algorithm due to how rapidly the instantaneous frequency increases over time.

rng("default")
N = 2048;
fs = N;
a = 1;
b = fs/2;
t = (0:N-1)'/fs;
sig = chirp(t,a,t(end),b,"logarithmic")+1;
plot(t,sig)
grid on
title("Exponential Chirp")
xlabel("Seconds")
ylabel("Amplitude")

Obtain the scalogram of the chirp and plot the scalogram along with the instantaneous frequency of the chirp. Here the scalogram is plotted using a linear scaling on the frequency (scale) axis to clearly show the exponential nature of the chirp.

[fmin,fmax] = cwtfreqbounds(2048,fs,Cutoff=100,wavelet="amor");
[cfs,f,~,~,scalcfs] = cwt(sig,fs,FrequencyLimits=[fmin fmax],ExtendSignal=false);
t = linspace(0,1,length(sig));
surf(t,f,abs(cfs))
ylabel("Hz")
shading interp
view(0,90)
yyaxis right
plot(t,1024.^t,"k--")
axis tight
title("Scalogram of Exponential Chirp")
ylabel("Hz")
xlabel("Seconds")

Now, use a differentiable scalogram to perform phase retrieval. Throughout the example, the helper object, helperPhaseRetrieval, is used to perform phase retrieval for both the scalogram and spectrogram. By default, helperPhaseRetrieval pads the signal symmetrically with 10 samples at the beginning and 10 samples at the end to compensate for edge effects.

First, create an object configured for the scalogram and obtain the scalogram of the chirp signal. Show that the scalogram contains only real-valued data.

pr = helperPhaseRetrieval(Method="scalogram",wavelet="morse", ...
    IncludeLowpass=true);
sc = obtainTFR(pr,sig);
isreal(sc)

ans = logical
   1

The scalogram is also a dlarray, which allows us to record operations performed on it for automatic differentiation.

Signal Recovery Using Gradient Descent

Here we recover an approximation to the original signal using the magnitude scalogram and gradient descent. The helper function retrievePhase does this by the following procedure:

Generate a white-noise signal with the same length as the padded input signal.
Obtain the scalogram of the noise. Measure the mean squared error (MSE) between the scalogram of the target signal and the scalogram of the noise.
Use gradient descent with an Adam optimizer to update the noise signal based on the MSE loss between the target scalogram and the scalogram of the noise.

The above procedure is detailed in [1]. At the end of the gradient descent procedure, determine how the noise has converged to a reconstruction of the original signal.

Inside of retrievePhase, a noise signal is initialized as a starting point. Here we compare a representative noise exactly like the one used to initiate the gradient descent procedure. Compare the initial noise with the original chirp signal.

rng default
x = dlarray(randn(length(sig)+20,1),"CBT");
xinit = x./max(abs(x),[],3)+1;
figure
plot(t,squeeze(extractdata(xinit(11:end-10,:,:))),LineWidth=0.5)
hold on
plot(t,sig)
legend("Random","Chirp Signal")
title("Random Noise Initialization with Exponential Chirp")
axis tight
hold off
ylim([-0.5 2.5])
xlabel("Seconds")

Use gradient descent and the differentiable scalogram to recover an approximation to the original signal.

xrec = retrievePhase(pr,sc,"InitialSignal",xinit);

After 300 iterations of gradient descent, the noise signal is modified to closely approximate the chirp signal. Plot the result of the phase retrieval. Note that phase retrieval for real-valued signals is only defined up to a sign change. Accordingly, the result scaled by 1 or -1 may provide a better result.

plot(t,sig,t,xrec,"--")
grid on
xlabel("Seconds")
ylabel("Amplitude")
legend("Original Signal","Phase Reconstruction")
title("Phase Retrieval using Scalogram")

Repeat the computation using the cwtmag2sig function. For more information, see the function reference page.

xrec = cwtmag2sig(abs(cfs),fs,FrequencyLimits=[fmin fmax], ...
    ScalingCoefficients=scalcfs,Display=true);

#Iteration  |  Normalized Inconsistency  
      1     |  1.3560e+00 
     20     |  5.3378e-02 
     40     |  1.5615e-02 
     60     |  9.3957e-03 
     80     |  4.1957e-03 
    100     |  1.3182e-02 
    120     |  3.2693e-03 
    140     |  1.6851e-03 
    160     |  3.2657e-04 
    180     |  1.8968e-03 
    200     |  2.4342e-04 
    220     |  2.4639e-04 
    240     |  7.6536e-05 
    255     |  8.0224e-06 

Decomposition process stopped.
The normalized inconsistency for each channel is smaller than the "InconsistencyTolerance" of 1e-05.

plot(t,sig,t,xrec,"--")
grid on
xlabel("Seconds")
ylabel("Amplitude")
legend("Original Signal","Phase Reconstruction")
title("Phase Retrieval using Scalogram")

Obtain the scalogram of the reconstructed signal and compare its phase at selected center frequencies (CF) with the phase of the original. The phase is compared by plotting the real and imaginary parts of the continuous wavelet transform (CWT) coefficients separately.

cfsR = cwt(xrec,fs,FrequencyLimits=[fmin fmax]);
figure
tiledlayout(3,2);
indices = [30 40 50];
xlimits = [0.63 0.79; 0.48 0.76; 0.4 0.62];

for i = 1:length(indices)
    idx = indices(i);
    fstr = sprintf("%2.2f",f(idx));

    nexttile(2*i-1)
    plot(t,real(cfs(idx,:))',t,real(cfsR(idx,:))',"--")
    grid on
    title("Scalogram (Real Part)", "CF "+fstr+" Hz")
    xlim(xlimits(i,:))
    
    nexttile(2*i)
    plot(t,imag(cfs(idx,:))',t,imag(cfsR(idx,:))',"--")
    grid on
    title("Scalogram (Imaginary Part)"," CF "+fstr+" Hz")
    xlim(xlimits(i,:))
end

The phase agreement for the selected center frequencies is quite good. In fact, if we look at the wavelet coherence between the original signal and its reconstructed version using the gradient-descent based phase retrieval, the overall agreement is quite strong.

figure
wcoherence(sig,xrec,fs,FrequencyLimits=[0 1/2]*fs,PhaseDisplayThreshold=0.7)
title("Wavelet Coherence (Exponential Chirp with reconstruction)", ...
    "Differentiable Scalogram with Gradient Descent");

Repeat the same process using the spectrogram as the time-frequency representation instead of the scalogram. Evaluate the discrete Fourier transform at 128 points and use the default settings for the stft function.

nfft = 128;
sp = stft(sig,FFTLength=nfft);
xrecGD = stftmag2sig(abs(sp),nfft, ...
    Method="gd",Display=true,MaxIterations=600);

#Iteration  |  Normalized Inconsistency  
      1     |  9.3315e-01 
     20     |  1.0800e-01 
     40     |  4.0291e-02 
     60     |  1.1883e-02 
     80     |  4.8425e-03 
    100     |  2.0517e-03 
    120     |  1.3525e-03 
    140     |  1.1850e-03 
    160     |  1.0127e-03 
    180     |  8.6954e-04 
    200     |  7.6701e-04 
    220     |  6.9067e-04 
    240     |  6.2690e-04 
    260     |  5.7133e-04 
    280     |  5.2309e-04 
    300     |  4.8117e-04 
    320     |  4.4435e-04 
    340     |  4.1164e-04 
    360     |  3.8222e-04 
    380     |  3.5552e-04 
    400     |  3.3112e-04 
    420     |  3.0870e-04 
    440     |  2.8801e-04 
    460     |  2.6888e-04 
    480     |  2.5115e-04 
    500     |  2.3468e-04 
    520     |  2.1938e-04 
    540     |  2.0513e-04 
    560     |  1.9187e-04 
    580     |  1.7951e-04 
    600     |  1.6799e-04 

Decomposition process stopped.
The number of iterations reached the specified "MaxIterations" value of 600.

% figure
plot(t,sig,t,xrecGD,"--")
grid on
legend("Original Signal","Phase Reconstruction")
title("Phase Retrieval (Gradient Descent)")
axis padded

The recovery from the spectrogram in this case is also quite good. Use wavelet coherence again to look at the time-varying phase coherence between the original signal and the output of phase retrieval approach using gradient descent with the differentiable spectrogram.

wcoherence(sig,xrecGD,fs,FrequencyLimits=[0 1/2]*fs,PhaseDisplayThreshold=0.7)
title("Wavelet Coherence (Exponential Chirp with Reconstruction)", ...
    "Differentiable Spectrogram with Gradient Descent")

The phase coherence between the two signals is strong except at the start and finish of the signal in the high-frequency range.

In the spectrogram below, it can be observed that in those regions, the frequency content of the original signal is concentrated at the low and high ends of the frequency range, respectively, so you have to be careful when interpreting the results there. Meanwhile, in the area where the signal is primarily distributed, the phase coherence in the corresponding part of the wavelet coherence diagram is very strong.

figure(Position=[0 0 1500 600])
subplot(1,2,1) 
stft(sig,fs,FFTLength=nfft,FrequencyRange="onesided")
set(gca,"YScale","log");
title("Spectrogram of the Original Signal")
subplot(1,2,2)
wcoherence(sig,xrecGD,fs,FrequencyLimits=[0 1/2]*fs,PhaseDisplayThreshold=0.7)

Comparison with Griffin-Lim

Repeat the computation using the well-known Griffin-Lim algorithm. The Griffin-Lim procedure requires the inverse short-time Fourier transform at every iteration and exhibits edge effects.

S = stft(sig);
xrecGL = stftmag2sig(abs(sp),nfft, ...
    Method="gl",Display=true,MaxIterations=600);

#Iteration  |  Normalized Inconsistency  
      1     |  7.4048e-01 
     20     |  9.0505e-02 
     40     |  5.2304e-02 
     60     |  4.3877e-02 
     80     |  4.2882e-02 
    100     |  4.2024e-02 
    120     |  4.1609e-02 
    140     |  4.1364e-02 
    160     |  4.1044e-02 
    180     |  3.0982e-02 
    200     |  2.4466e-02 
    220     |  2.3341e-02 
    240     |  2.2494e-02 
    260     |  2.1793e-02 
    280     |  2.1195e-02 
    300     |  2.0673e-02 
    320     |  2.0211e-02 
    340     |  1.9795e-02 
    360     |  1.9418e-02 
    380     |  1.9072e-02 
    400     |  1.8754e-02 
    420     |  1.8459e-02 
    440     |  1.8184e-02 
    460     |  1.7927e-02 
    480     |  1.7685e-02 
    500     |  1.7457e-02 
    520     |  1.7241e-02 
    540     |  1.7036e-02 
    560     |  1.6841e-02 
    580     |  1.6655e-02 
    600     |  1.6476e-02 

Decomposition process stopped.
The number of iterations reached the specified "MaxIterations" value of 600.

figure
plot(t,sig,t,xrecGL,"--")
grid on
legend(["Original Signal","Phase Reconstruction"])
title("Phase Retrieval (Griffin-Lim)")

As anticipated, Griffin-Lim exhibits substantial deviation at the end of the signal. Moreover, upon closer observation, it can be noticed that in this case, it also presents significant discrepancy in the reconstruction of the signal's middle section.

plot(t,sig,t,xrecGL,"--")
ylim([0,2])
xlim([0.5 0.7])
grid on
legend(["Original Signal","Phase Reconstruction"])
title("Phase Retrieval (Griffin-Lim)")

Accordingly, use the original reconstruction from the STFT magnitudes using the Griffin-Lim algorithm and examine the wavelet coherence between the original signal and the reconstruction.

wcoherence(sig,xrecGL,fs,FrequencyLimits=[0 1/2]*fs,PhaseDisplayThreshold=0.7)
title("Wavelet Coherence (Exponential Chirp with Reconstruction)", ...
    "Griffin-Lim Algorithm using Spectrograms");

The Griffin-Lim algorithm is less accurate than the gradient-descent method in the high-frequency areas where the signal primarily resides.

In this instance, the differentiable phase-retrieval technique does a significantly better job at reconstructing the original phase than the Griffin-Lim algorithm. You can verify this numerically by examining the relative L2-norm error between the original signal and the approximations.

RelativeL2DiffGD = norm(sig-xrecGD,2)/norm(sig,2)

RelativeL2DiffGD = 0.1075

RelativeL2DiffGL = norm(sig-xrecGL,2)/norm(sig,2)

RelativeL2DiffGL = 4.5252

Speech Signal

As mentioned in the introduction, a common application of signal recovery from magnitude time-frequency transforms is in speech. Here we apply differentiable signal processing and gradient descent to speech.

Load and play a speech sample of a speaker saying "I saw the sheep". The original data is sampled at 22050 Hz, resample the data at 1/2 the original rate.

load wavsheep.mat
xsheep = resample(sheep,1,2);
fs = fs/2;
soundsc(xsheep,fs)
t = 0:1/fs:(length(xsheep)-1)/fs;

Recover an approximation of the speech signal from the magnitude spectrogram using gradient descent. Use a Hamming window with 256 samples and overlap the windows by 75% of the window length, or 192 samples. Use 200 iterations of gradient descent and a different optimizer, the limited-memory Broyden–Fletcher–Goldfarb–Shanno (L-BFGS) optimizer, which usually gives better results at the cost of longer computation time.

win = hamming(256);
overlap = 0.75*256;
scsheep = stft(xsheep,Window=win,OverlapLength=overlap);

xrecGDsheep = stftmag2sig(abs(scsheep),length(win),Window=win,OverlapLength=overlap, ...
    Method="gd",Optimizer = "lbfgs",Display=true,MaxIterations=300);

#Iteration  |  Normalized Inconsistency  
      1     |  1.3115e+00 
     20     |  9.9177e-02 
     40     |  2.5096e-02 
     60     |  2.1943e-02 
     80     |  1.2074e-02 
    100     |  2.2944e-02 
    120     |  5.8762e-03 
    140     |  1.1873e-02 
    160     |  2.6551e-03 
    180     |  1.0744e-02 
    200     |  4.5168e-02 
    220     |  8.8441e-03 
    240     |  6.4858e-03 
    260     |  7.1194e-03 
    280     |  8.9862e-03 
    300     |  1.3787e-03 

Decomposition process stopped.
The number of iterations reached the specified "MaxIterations" value of 300.

The reconstructed signal is very similar to the original. The π/2 ambiguity that inverts some peaks is expected and does not affect the perceived sound.

% figure
idx = 1:length(xrecGDsheep);

plot(t(idx),xsheep(idx),t(idx),xrecGDsheep,"--")
title(["Reconstruction from STFT Magnitude and Noise Input", ...
    "using Stochastic Gradient Descent"])
legend("Original","Phaseless Reconstruction")
xlabel("Time (s)")
axis tight
grid on

Play the original waveform and the reconstruction. Pause 3 seconds between playbacks.

soundsc(xsheep,fs)
pause(3)
soundsc(xrecGDsheep,fs)

Repeat the procedure using as input the magnitude of the CWT. The CWT is more computationally expensive than the STFT and takes longer to compute. Accelerate the computation using a graphical processing unit (GPU), if available. To determine the reduction in computation time, toggle back and forth between true and false.

useGPU = false;
[fmin,fmax] = cwtfreqbounds(2048,Cutoff=100,wavelet="amor");
[cfs,~,~,~,scalcfs] = cwt(xsheep,FrequencyLimits=[fmin fmax],extend=false);

if useGPU
    cfs = gpuArray(cfs);
end
xrecGDsheep = cwtmag2sig(abs(cfs),FrequencyLimits=[fmin fmax], ...
    ScalingCoefficients=scalcfs, ...
    Optimizer="sgdm",Display=true);

#Iteration  |  Normalized Inconsistency  
      1     |  1.0646e+00 
     20     |  8.5140e-02 
     40     |  2.8676e-02 
     60     |  1.4658e-02 
     80     |  1.0646e-02 
    100     |  8.8762e-03 
    120     |  7.7660e-03 
    140     |  6.6092e-03 
    160     |  5.6052e-03 
    180     |  4.7687e-03 
    200     |  4.1340e-03 
    220     |  3.7607e-03 
    240     |  3.4947e-03 
    260     |  3.5036e-03 
    280     |  3.1490e-03 
    300     |  3.1313e-03 

Decomposition process stopped.
The number of iterations reached the specified "MaxIterations" value of 300.

Plot the result. Similar to the case with the magnitude STFT, what began as random noise has converged to a good approximation of the speech signal. There is always a phase ambiguity between a value, V and its negative -V. So you can also try plotting the negative.

% figure
plot(t,xsheep,t,xrecGDsheep,"--")
title(["Reconstruction from Scalogram Magnitude and Noise Input", ...
    "using Stochastic Gradient Descent"])
legend("Original","Phaseless Reconstruction")
xlabel("Time (s)")
axis tight
grid on

Play the original waveform and the reconstruction. Pause 3 seconds between playbacks. The scalogram technique also produces an approximation which is perceptually equivalent to the original.

soundsc(xsheep,fs)
pause(3)
soundsc(xrecGDsheep,fs)

Conclusion

This example demonstrates the algorithms and functions that employ differentiable signal processing and gradient descent to recover signal approximations from magnitude time-frequency representations. The approach has many advantages over traditional iterative methods based on inverse transform like the Griffin-Lim algorithm. For instance, it is less affected by minor edge effects and, in many cases, provides superior signal recovery. It allows for targeted selection among various optimizers and benefits from GPU acceleration. Furthermore, this method does not depend on the implementation of an inverse transform, making it easily expandable.