Keyword Spotting in Noise Using MFCC and LSTM Networks

This example uses:

This example shows how to identify a keyword in noisy speech using a deep learning network. In particular, the example uses a Bidirectional Long Short-Term Memory (BiLSTM) network and mel frequency cepstral coefficients (MFCC).

Introduction

Keyword spotting (KWS) is an essential component of voice-assist technologies, where the user speaks a predefined keyword to wake-up a system before speaking a complete command or query to the device.

This example trains a KWS deep network with feature sequences of mel-frequency cepstral coefficients (MFCC). The example also demonstrates how network accuracy in a noisy environment can be improved using data augmentation.

This example uses long short-term memory (LSTM) networks, which are a type of recurrent neural network (RNN) well-suited to study sequence and time-series data. An LSTM network can learn long-term dependencies between time steps of a sequence. An LSTM layer (lstmLayer (Deep Learning Toolbox)) can look at the time sequence in the forward direction, while a bidirectional LSTM layer (bilstmLayer (Deep Learning Toolbox)) can look at the time sequence in both forward and backward directions. This example uses a bidirectional LSTM layer.

The example uses the google Speech Commands Dataset to train the deep learning model. To run the example, you must first download the data set. If you do not want to download the data set or train the network, then you can download and use a pretrained network by opening this example in MATLAB® and running the Spot Keyword with Pretrained Network section.

Spot Keyword with Pretrained Network

Before going into the training process in detail, you will download and use a pretrained keyword spotting network to identify a keyword.

In this example, the keyword to spot is YES.

Read a test signal where the keyword is uttered.

[audioIn,fs] = audioread("keywordTestSignal.wav");
sound(audioIn,fs)

Download and load the pretrained network, the mean (M) and standard deviation (S) vectors used for feature normalization, as well as 2 audio files used for validating the network later in the example.

downloadFolder = matlab.internal.examples.downloadSupportFile("audio","KeywordSpotting.zip");
dataFolder = tempdir;
unzip(downloadFolder,dataFolder)
netFolder = fullfile(dataFolder,"KeywordSpotting");
load(fullfile(netFolder,"KWSNet.mat"));

Create an audioFeatureExtractor object to perform feature extraction.

windowLength = 512;
overlapLength = 384;
afe = audioFeatureExtractor(SampleRate=fs, ...
    Window=hann(windowLength,"periodic"),OverlapLength=overlapLength, ...
    mfcc=true,mfccDelta=true,mfccDeltaDelta=true);

Extract features from the test signal and normalize them.

features = extract(afe,audioIn);

features = (features - M)./S;

Compute the keyword spotting binary mask. A mask value of one corresponds to a segment where the keyword was spotted.

mask = classify(KWSNet,features.');

Each sample in the mask corresponds to 128 samples from the speech signal (windowLength - overlapLength).

Expand the mask to the length of the signal.

mask = repmat(mask,windowLength-overlapLength,1);
mask = double(mask) - 1;
mask = mask(:);

Plot the test signal and the mask.

figure
audioIn = audioIn(1:length(mask));
t = (0:length(audioIn)-1)/fs;
plot(t,audioIn)
grid on
hold on
plot(t, mask)
legend("Speech","YES")

Listen to the spotted keyword.

sound(audioIn(mask==1),fs)

Detect Commands Using Streaming Audio from Microphone

Test your pre-trained command detection network on streaming audio from your microphone. Try saying random words, including the keyword (YES).

Call generateMATLABFunction on the audioFeatureExtractor object to create the feature extraction function. You will use this function in the processing loop.

generateMATLABFunction(afe,"generateKeywordFeatures",IsStreaming=true);

Define an audio device reader that can read audio from your microphone. Set the frame length to the hop length. This enables you to compute a new set of features for every new audio frame from the microphone.

hopLength = windowLength - overlapLength;
frameLength = hopLength;
adr = audioDeviceReader(SampleRate=fs,SamplesPerFrame=frameLength);

Create a scope for visualizing the speech signal and the estimated mask.

scope = timescope(SampleRate=fs, ...
    TimeSpanSource="property", ...
    TimeSpan=5, ...
    TimeSpanOverrunAction="Scroll", ...
    BufferLength=fs*5*2, ...
    ShowLegend=true, ...
    ChannelNames={'Speech','Keyword Mask'}, ...
    YLimits=[-1.2,1.2], ...
    Title="Keyword Spotting");

Define the rate at which you estimate the mask. You will generate a mask once every numHopsPerUpdate audio frames.

numHopsPerUpdate = 16;

Initialize a buffer for the audio.

dataBuff = dsp.AsyncBuffer(windowLength);

Initialize a buffer for the computed features.

featureBuff = dsp.AsyncBuffer(numHopsPerUpdate);

Initialize a buffer to manage plotting the audio and the mask.

plotBuff = dsp.AsyncBuffer(numHopsPerUpdate*windowLength);

To run the loop indefinitely, set timeLimit to Inf. To stop the simulation, close the scope.

timeLimit = 20;

tic
while toc < timeLimit

    data = adr();
    write(dataBuff,data);
    write(plotBuff,data);

    frame = read(dataBuff,windowLength,overlapLength);
    features = generateKeywordFeatures(frame,fs);
    write(featureBuff,features.');

    if featureBuff.NumUnreadSamples == numHopsPerUpdate
        featureMatrix = read(featureBuff);
        featureMatrix(~isfinite(featureMatrix)) = 0;
        featureMatrix = (featureMatrix - M)./S;

        [keywordNet,v] = classifyAndUpdateState(KWSNet,featureMatrix.');
        v = double(v) - 1;
        v = repmat(v,hopLength,1);
        v = v(:);
        v = mode(v);
        v = repmat(v,numHopsPerUpdate*hopLength,1);

        data = read(plotBuff);
        scope([data,v]);

        if ~isVisible(scope)
            break;
        end
    end
end
hide(scope)

In the rest of the example, you will learn how to train the keyword spotting network.

Training Process Summary

The training process goes through the following steps:

Inspect a "gold standard" keyword spotting baseline on a validation signal.
Create training utterances from a noise-free dataset.
Train a keyword spotting LSTM network using MFCC sequences extracted from those utterances.
Check the network accuracy by comparing the validation baseline to the output of the network when applied to the validation signal.
Check the network accuracy for a validation signal corrupted by noise.
Augment the training dataset by injecting noise to the speech data using audioDataAugmenter.
Retrain the network with the augmented dataset.
Verify that the retrained network now yields higher accuracy when applied to the noisy validation signal.

Inspect the Validation Signal

You use a sample speech signal to validate the KWS network. The validation signal consists 34 seconds of speech with the keyword YES appearing intermittently.

Load the validation signal.

[audioIn,fs] = audioread(fullfile(netFolder,"KeywordSpeech-16-16-mono-34secs.flac"));

Listen to the signal.

sound(audioIn,fs)

Visualize the signal.

figure
t = (1/fs)*(0:length(audioIn)-1);
plot(t,audioIn);
grid on
xlabel("Time (s)")
title("Validation Speech Signal")

Inspect the KWS Baseline

Load the KWS baseline. This baseline was obtained using speech2text and Signal Labeler. For a related example, see Label Spoken Words in Audio Signals.

load("KWSBaseline.mat","KWSBaseline")

The baseline is a logical vector of the same length as the validation audio signal. Segments in audioIn where the keyword is uttered are set to one in KWSBaseline.

Visualize the speech signal along with the KWS baseline.

fig = figure;
plot(t,[audioIn,KWSBaseline'])
grid on
xlabel("Time (s)")
legend("Speech","KWS Baseline",Location="southeast")
l = findall(fig,"type","line");
l(1).LineWidth = 2;
title("Validation Signal")

Listen to the speech segments identified as keywords.

sound(audioIn(KWSBaseline),fs)

The objective of the network that you train is to output a KWS mask of zeros and ones like this baseline.

Load Speech Commands Data Set

Download and extract the Google Speech Commands Dataset [1].

downloadFolder = matlab.internal.examples.downloadSupportFile("audio","google_speech.zip");
dataFolder = tempdir;
unzip(downloadFolder,dataFolder)
dataset = fullfile(dataFolder,"google_speech");

Create an audioDatastore that points to the data set.

ads = audioDatastore(dataset,LabelSource="foldername",Includesubfolders=true);
ads = shuffle(ads);

The dataset contains background noise files that are not used in this example. Use subset to create a new datastore that does not have the background noise files.

isBackNoise = ismember(ads.Labels,"background");
ads = subset(ads,~isBackNoise);

The dataset has approximately 65,000 one-second long utterances of 30 short words (including the keyword YES). Get a breakdown of the word distribution in the datastore.

countEachLabel(ads)

ans=30×2 table
    Label     Count
    ______    _____

    bed       1713 
    bird      1731 
    cat       1733 
    dog       1746 
    down      2359 
    eight     2352 
    five      2357 
    four      2372 
    go        2372 
    happy     1742 
    house     1750 
    left      2353 
    marvin    1746 
    nine      2364 
    no        2375 
    off       2357 
      ⋮

Split ads into two datastores: The first datastore contains files corresponding to the keyword. The second datastore contains all the other words.

keyword = "yes";
isKeyword = ismember(ads.Labels,keyword);
adsKeyword = subset(ads,isKeyword);
adsOther = subset(ads,~isKeyword);

To train the network with the entire dataset and achieve the highest possible accuracy, set speedupExample to false. To run this example quickly, set speedupExample to true.

speedupExample = false;
if speedupExample
    % Reduce the dataset by a factor of 20
    adsKeyword = splitEachLabel(adsKeyword,round(numel(adsKeyword.Files)/20));
    numUniqueLabels = numel(unique(adsOther.Labels));
    adsOther = splitEachLabel(adsOther,round(numel(adsOther.Files)/numUniqueLabels/20));
end

Get a breakdown of the word distribution in each datastore. Shuffle the adsOther datastore so that consecutive reads return different words.

countEachLabel(adsKeyword)

ans=1×2 table
    Label    Count
    _____    _____

     yes     2377

countEachLabel(adsOther)

ans=29×2 table
    Label     Count
    ______    _____

    bed       1713 
    bird      1731 
    cat       1733 
    dog       1746 
    down      2359 
    eight     2352 
    five      2357 
    four      2372 
    go        2372 
    happy     1742 
    house     1750 
    left      2353 
    marvin    1746 
    nine      2364 
    no        2375 
    off       2357 
      ⋮

adsOther = shuffle(adsOther);

Create Training Sentences and Labels

The training datastores contain one-second speech signals where one word is uttered. You will create more complex training speech utterances that contain a mixture of the keyword along with other words.

Here is an example of a constructed utterance. Read one keyword from the keyword datastore and normalize it to have a maximum value of one.

yes = read(adsKeyword);
yes = yes/max(abs(yes));

The signal has non-speech portions (silence, background noise, etc.) that do not contain useful speech information. This example removes silence using detectSpeech.

Get the start and end indices of the useful portion of the signal.

speechIndices = detectSpeech(yes,fs);

Randomly select the number of words to use in the synthesized training sentence. Use a maximum of 10 words.

numWords = randi([0,10]);

Randomly pick the location at which the keyword occurs.

keywordLocation = randi([1,numWords+1]);

Read the desired number of non-keyword utterances, and construct the training sentence and mask.

sentence = [];
mask = [];
for index = 1:numWords+1
    if index == keywordLocation
        sentence = [sentence;yes]; %#ok
        newMask = zeros(size(yes));
        newMask(speechIndices(1,1):speechIndices(1,2)) = 1;
        mask = [mask;newMask]; %#ok
    else
        other = read(adsOther);
        other = other./max(abs(other));
        sentence = [sentence;other]; %#ok
        mask = [mask;zeros(size(other))]; %#ok
    end
end

Plot the training sentence along with the mask.

figure
t = (1/fs)*(0:length(sentence)-1);
fig = figure;
plot(t,[sentence,mask])
grid on
xlabel("Time (s)")
legend("Training Signal","Mask",Location="southeast")
l = findall(fig,"type","line");
l(1).LineWidth = 2;
title("Example Utterance")

Listen to the training sentence.

sound(sentence,fs)

Extract Features

This example trains a deep learning network using 39 MFCC coefficients (13 MFCC, 13 delta and 13 delta-delta coefficients).

Define parameters required for MFCC extraction.

windowLength = 512;
overlapLength = 384;

Create an audioFeatureExtractor object to perform the feature extraction.

afe = audioFeatureExtractor(SampleRate=fs, ...
    Window=hann(windowLength,"periodic"),OverlapLength=overlapLength, ...
    mfcc=true,mfccDelta=true,mfccDeltaDelta=true);

Extract the features.

featureMatrix = extract(afe,sentence);
size(featureMatrix)

ans = 1×2

        1372          39

Note that you compute MFCC by sliding a window through the input, so the feature matrix is shorter than the input speech signal. Each row in featureMatrix corresponds to 128 samples from the speech signal (windowLength - overlapLength).

Compute a mask of the same length as featureMatrix.

hopLength = windowLength - overlapLength;
range = hopLength*(1:size(featureMatrix,1)) + hopLength;
featureMask = zeros(size(range));
for index = 1:numel(range)
    featureMask(index) = mode(mask((index-1)*hopLength+1:(index-1)*hopLength+windowLength));
end

Extract Features from Training Dataset

Sentence synthesis and feature extraction for the whole training dataset can be quite time-consuming. To speed up processing, if you have Parallel Computing Toolbox™, partition the training datastore, and process each partition on a separate worker.

Select a number of datastore partitions.

numPartitions = 6;

Initialize cell arrays for the feature matrices and masks.

TrainingFeatures = {};
TrainingMasks= {};

Perform sentence synthesis, feature extraction, and mask creation using parfor.

emptyCategories = categorical([1 0]);
emptyCategories(:) = [];

tic
parfor ii = 1:numPartitions

    subadsKeyword = partition(adsKeyword,numPartitions,ii);
    subadsOther = partition(adsOther,numPartitions,ii);

    count = 1;
    localFeatures = cell(length(subadsKeyword.Files),1);
    localMasks = cell(length(subadsKeyword.Files),1);

    while hasdata(subadsKeyword)

        % Create a training sentence
        [sentence,mask] = synthesizeSentence(subadsKeyword,subadsOther,fs,windowLength);

        % Compute mfcc features
        featureMatrix = extract(afe, sentence);
        featureMatrix(~isfinite(featureMatrix)) = 0;

        % Create mask
        range = hopLength*(1:size(featureMatrix,1)) + hopLength;
        featureMask = zeros(size(range));
        for index = 1:numel(range)
            featureMask(index) = mode(mask((index-1)*hopLength+1:(index-1)*hopLength+windowLength));
        end

        localFeatures{count} = featureMatrix;
        localMasks{count} = [emptyCategories,categorical(featureMask)];

        count = count + 1;
    end

    TrainingFeatures = [TrainingFeatures;localFeatures];
    TrainingMasks = [TrainingMasks;localMasks];
end

Starting parallel pool (parpool) using the 'Processes' profile ...
15-Nov-2023 10:35:21: Job Queued. Waiting for parallel pool job with ID 7 to start ...
15-Nov-2023 10:36:21: Job Queued. Waiting for parallel pool job with ID 7 to start ...
15-Nov-2023 10:37:21: Job Queued. Waiting for parallel pool job with ID 7 to start ...
Connected to parallel pool with 6 workers.
Analyzing and transferring files to the workers ...done.

disp("Training feature extraction took " + toc + " seconds.")

Training feature extraction took 284.7557 seconds.

It is good practice to normalize all features to have zero mean and unity standard deviation. Compute the mean and standard deviation for each coefficient and use them to normalize the data.

sampleFeature = TrainingFeatures{1};
numFeatures = size(sampleFeature,2);
featuresMatrix = cat(1,TrainingFeatures{:});
if speedupExample
    load(fullfile(netFolder,"keywordNetNoAugmentation.mat"),"keywordNetNoAugmentation","M","S");
else
    M = mean(featuresMatrix);
    S = std(featuresMatrix);
end
for index = 1:length(TrainingFeatures)
    f = TrainingFeatures{index};
    f = (f - M)./S;
    TrainingFeatures{index} = f.'; %#ok
end

Extract Validation Features

Extract MFCC features from the validation signal.

featureMatrix = extract(afe, audioIn);
featureMatrix(~isfinite(featureMatrix)) = 0;

Normalize the validation features.

FeaturesValidationClean = (featureMatrix - M)./S;
range = hopLength*(1:size(FeaturesValidationClean,1)) + hopLength;

Construct the validation KWS mask.

featureMask = zeros(size(range));
for index = 1:numel(range)
    featureMask(index) = mode(KWSBaseline((index-1)*hopLength+1:(index-1)*hopLength+windowLength));
end
BaselineV = categorical(featureMask);

Define the LSTM Network Architecture

LSTM networks can learn long-term dependencies between time steps of sequence data. This example uses the bidirectional LSTM layer bilstmLayer (Deep Learning Toolbox) to look at the sequence in both forward and backward directions.

Specify the input size to be sequences of size numFeatures. Specify two hidden bidirectional LSTM layers with an output size of 150 and output a sequence. This command instructs the bidirectional LSTM layer to map the input time series into 150 features that are passed to the next layer. Specify two classes by including a fully connected layer of size 2, followed by a softmax layer and a classification layer.

layers = [ ...
    sequenceInputLayer(numFeatures)
    bilstmLayer(150,OutputMode="sequence")
    bilstmLayer(150,OutputMode="sequence")
    fullyConnectedLayer(2)
    softmaxLayer
    ];

Define Training Options

Specify the training options for the classifier. Set MaxEpochs to 10 so that the network makes 10 passes through the training data. Set MiniBatchSize to 64 so that the network looks at 64 training signals at a time. Set Plots to "training-progress" to generate plots that show the training progress as the number of iterations increases. Set Verbose to false to disable printing the table output that corresponds to the data shown in the plot. Set Shuffle to "every-epoch" to shuffle the training sequence at the beginning of each epoch. Set LearnRateSchedule to "piecewise" to decrease the learning rate by a specified factor (0.1) every time a certain number of epochs (5) has passed. Set ValidationData to the validation predictors and targets.

This example uses the adaptive moment estimation (ADAM) solver. ADAM performs better with recurrent neural networks (RNNs) like LSTMs than the default stochastic gradient descent with momentum (SGDM) solver.

maxEpochs = 10;
miniBatchSize = 64;
options = trainingOptions("adam", ...
    InitialLearnRate=1e-4, ...
    MaxEpochs=maxEpochs, ...
    MiniBatchSize=miniBatchSize, ...
    Shuffle="every-epoch", ...
    Verbose=false, ...
    ValidationFrequency=floor(numel(TrainingFeatures)/miniBatchSize), ...
    ValidationData={FeaturesValidationClean.',BaselineV}, ...
    Plots="training-progress", ...
    Metrics = "accuracy",...
    LearnRateSchedule="piecewise", ...
    LearnRateDropFactor=0.1, ...
    LearnRateDropPeriod=5,...
    InputDataFormats="CTB");

Train the LSTM Network

Train the LSTM network with the specified training options and layer architecture using trainnet. Because the training set is large, the training process can take several minutes.

keywordNetNoAugmentation = trainnet(TrainingFeatures,TrainingMasks,layers,"crossentropy",options);

if speedupExample
    load(fullfile(netFolder,"keywordNetNoAugmentation.mat"),"keywordNetNoAugmentation","M","S");
end

Check Network Accuracy for Noise-Free Validation Signal

Estimate the KWS mask for the validation signal using the trained network.

v = predict(keywordNetNoAugmentation,FeaturesValidationClean);
v = scores2label(v,unique(BaselineV));

Calculate and plot the validation confusion matrix from the vectors of actual and estimated labels.

figure
confusionchart(BaselineV,v, ...
    Title="Validation Accuracy", ...
    ColumnSummary="column-normalized",RowSummary="row-normalized");

Convert the network output from categorical to double.

v = double(v.')-1;
v = repmat(v,hopLength,1);
v = v(:);

Listen to the keyword areas identified by the network.

sound(audioIn(logical(v)),fs)

Visualize the estimated and expected KWS masks.

baseline = double(BaselineV) - 1;
baseline = repmat(baseline,hopLength,1);
baseline = baseline(:);

t = (1/fs)*(0:length(v)-1);
fig = figure;
plot(t,[audioIn(1:length(v)),v,0.8*baseline])
grid on
xlabel("Time (s)")
legend("Training Signal","Network Mask","Baseline Mask",Location="southeast")
l = findall(fig,"type","line");
l(1).LineWidth = 2;
l(2).LineWidth = 2;
title("Results for Noise-Free Speech")

Check Network Accuracy for a Noisy Validation Signal

You will now check the network accuracy for a noisy speech signal. The noisy signal was obtained by corrupting the clean validation signal by additive white Gaussian noise.

Load the noisy signal.

[audioInNoisy,fs] = audioread(fullfile(netFolder,"NoisyKeywordSpeech-16-16-mono-34secs.flac"));
sound(audioInNoisy,fs)

Visualize the signal.

figure
t = (1/fs)*(0:length(audioInNoisy)-1);
plot(t,audioInNoisy)
grid on
xlabel("Time (s)")
title("Noisy Validation Speech Signal")

Extract the feature matrix from the noisy signal.

featureMatrixV = extract(afe, audioInNoisy);
featureMatrixV(~isfinite(featureMatrixV)) = 0;
FeaturesValidationNoisy = (featureMatrixV - M)./S;

Pass the feature matrix to the network.

v = predict(keywordNetNoAugmentation,FeaturesValidationNoisy);
v = scores2label(v,unique(BaselineV));

Compare the network output to the baseline. Note that the accuracy is lower than the one you got for a clean signal.

figure
confusionchart(BaselineV,v, ...
    Title="Validation Accuracy - Noisy Speech", ...
    ColumnSummary="column-normalized",RowSummary="row-normalized");

Convert the network output from categorical to double.

v = double(v.')-1;
v = repmat(v,hopLength,1);
v = v(:);

Listen to the keyword areas identified by the network.

sound(audioIn(logical(v)),fs)

Visualize the estimated and baseline masks.

t = (1/fs)*(0:length(v)-1);
fig = figure;
plot(t,[audioInNoisy(1:length(v)),v,0.8*baseline])
grid on
xlabel("Time (s)")
legend("Training Signal","Network Mask","Baseline Mask",Location="southeast")
l = findall(fig,"type","line");
l(1).LineWidth = 2;
l(2).LineWidth = 2;
title("Results for Noisy Speech - No Data Augmentation")

Perform Data Augmentation

The trained network did not perform well on a noisy signal because the trained dataset contained only noise-free sentences. You will rectify this by augmenting your dataset to include noisy sentences.

Use audioDataAugmenter to augment your dataset.

ada = audioDataAugmenter(TimeStretchProbability=0,PitchShiftProbability=0, ...
    VolumeControlProbability=0,TimeShiftProbability=0, ...
    SNRRange=[-1,1],AddNoiseProbability=0.85);

With these settings, the audioDataAugmenter object corrupts an input audio signal with white Gaussian noise with a probability of 85%. The SNR is randomly selected from the range [-1 1] (in dB). There is a 15% probability that the augmenter does not modify your input signal.

As an example, pass an audio signal to the augmenter.

reset(adsKeyword)
x = read(adsKeyword);
data = augment(ada,x,fs)

data=1×2 table
         Audio          AugmentationInfo
    ________________    ________________

    {16000×1 double}       1×1 struct

Inspect the AugmentationInfo variable in data to verify how the signal was modified.

data.AugmentationInfo

ans = struct with fields:
    SNR: -0.9149

Reset the datastores.

reset(adsKeyword)
reset(adsOther)

Initialize the feature and mask cells.

TrainingFeatures = {};
TrainingMasks = {};

Perform feature extraction again. Each signal is corrupted by noise with a probability of 85%, so your augmented dataset has approximately 85% noisy data and 15% noise-free data.

tic
parfor ii = 1:numPartitions

    subadsKeyword = partition(adsKeyword,numPartitions,ii);
    subadsOther = partition(adsOther,numPartitions,ii);

    count = 1;
    localFeatures = cell(length(subadsKeyword.Files),1);
    localMasks = cell(length(subadsKeyword.Files),1);

    while hasdata(subadsKeyword)

        [sentence,mask] = synthesizeSentence(subadsKeyword,subadsOther,fs,windowLength);

        % Corrupt with noise
        augmentedData = augment(ada,sentence,fs);
        sentence = augmentedData.Audio{1};

        % Compute mfcc features
        featureMatrix = extract(afe, sentence);
        featureMatrix(~isfinite(featureMatrix)) = 0;

        range = hopLength*(1:size(featureMatrix,1)) + hopLength;
        featureMask = zeros(size(range));
        for index = 1:numel(range)
            featureMask(index) = mode(mask((index-1)*hopLength+1:(index-1)*hopLength+windowLength));
        end

        localFeatures{count} = featureMatrix;
        localMasks{count} = [emptyCategories,categorical(featureMask)];

        count = count + 1;
    end

    TrainingFeatures = [TrainingFeatures;localFeatures];
    TrainingMasks = [TrainingMasks;localMasks];
end
disp("Training feature extraction took " + toc + " seconds.")

Training feature extraction took 34.1416 seconds.

Compute the mean and standard deviation for each coefficient; use them to normalize the data.

sampleFeature = TrainingFeatures{1};
numFeatures = size(sampleFeature,2);
featuresMatrix = cat(1,TrainingFeatures{:});
if speedupExample
    load(fullfile(netFolder,"KWSNet.mat"),"KWSNet","M","S");
else
    M = mean(featuresMatrix);
    S = std(featuresMatrix);
end
for index = 1:length(TrainingFeatures)
    f = TrainingFeatures{index};
    f = (f - M) ./ S;
    TrainingFeatures{index} = f.'; %#ok
end

Normalize the validation features with the new mean and standard deviation values.

FeaturesValidationNoisy = (featureMatrixV - M)./S;

Retrain Network with Augmented Dataset

Recreate the training options. Use the noisy baseline features and mask for validation.

options = trainingOptions("adam", ...
    InitialLearnRate=1e-4, ...
    MaxEpochs=maxEpochs, ...
    MiniBatchSize=miniBatchSize, ...
    Shuffle="every-epoch", ...
    Verbose=false, ...
    ValidationFrequency=floor(numel(TrainingFeatures)/miniBatchSize), ...
    ValidationData={FeaturesValidationNoisy.',BaselineV}, ...
    Metrics = "accuracy",...
    Plots="training-progress", ...
    LearnRateSchedule="piecewise", ...
    InputDataFormats="CTB",...
    LearnRateDropFactor=0.1, ...
    LearnRateDropPeriod=5);

Train the network.

[KWSNet,netInfo] = trainnet(TrainingFeatures,TrainingMasks,layers,"crossentropy",options);

if speedupExample
    load(fullfile(netFolder,"KWSNet.mat"));
end

Verify the network accuracy on the validation signal.

v = predict(KWSNet,FeaturesValidationNoisy);
v = scores2label(v,unique(BaselineV));

Compare the estimated and expected KWS masks.

figure
confusionchart(BaselineV,v, ...
    Title="Validation Accuracy with Data Augmentation", ...
    ColumnSummary="column-normalized",RowSummary="row-normalized");

Listen to the identified keyword regions.

v = double(v.')-1;
v = repmat(v,hopLength,1);
v = v(:);

sound(audioIn(logical(v)),fs)

Visualize the estimated and expected masks.

fig = figure;
plot(t,[audioInNoisy(1:length(v)),v,0.8*baseline])
grid on
xlabel("Time (s)")
legend("Training Signal","Network Mask","Baseline Mask",Location="southeast")
l = findall(fig,"type","line");
l(1).LineWidth = 2;
l(2).LineWidth = 2;
title("Results for Noisy Speech - With Data Augmentation")

Supporting Functions

Synthesize Sentence

function [sentence,mask] = synthesizeSentence(adsKeyword,adsOther,fs,minlength)

% Read one keyword
keyword = read(adsKeyword);
keyword = keyword./max(abs(keyword));

% Identify region of interest
speechIndices = detectSpeech(keyword,fs);
if isempty(speechIndices) || diff(speechIndices(1,:)) <= minlength
    speechIndices = [1,length(keyword)];
end
keyword = keyword(speechIndices(1,1):speechIndices(1,2));

% Pick a random number of other words (between 0 and 10)
numWords = randi([0,10]);
% Pick where to insert keyword
loc = randi([1,numWords+1]);
sentence = [];
mask = [];
for index = 1:numWords+1
    if index==loc
        sentence = [sentence;keyword];
        newMask = ones(size(keyword));
        mask = [mask;newMask];
    else
        other = read(adsOther);
        other = other./max(abs(other));
        sentence = [sentence;other];
        mask = [mask;zeros(size(other))];
    end
end
end

References

[1] Warden P. "Speech Commands: A public dataset for single-word speech recognition", 2017. Available from https://storage.googleapis.com/download.tensorflow.org/data/speech_commands_v0.01.tar.gz. Copyright Google 2017. The Speech Commands Dataset is licensed under the Creative Commons Attribution 4.0 license.