Hi Cristina, I understand you are now sorted but I am including some more info below in case it can help others.
The network needs to see exactly the same type of input format it was trained with. In particular, the size of the input depends on:
- The parameters in audioFeatureExtractor, such as -- on one hand, the output of the feature extraction algorithm (e.g. NumBands), on the other, the buffering parameters (in this case Window and OverlapLength)
- The length of the actual input waveform segment
In this example, you honored correctly (1) but not (2). This pre-trained network requires all waveform segments to all be of 1s in length (=16000 samples at a sample rate of 16 kHz). However, this test segment stop_command.flac only includes 12800 samples. The most common approach is to simply pad this with zeros after the end, as in the fourth line of code below:
[stopCmd, fs] = audioread('stop_command.flac');
stopCmdIn = [stopCmd(:,1); zeros(N-size(stopCmd,1),1,'like',stopCmd)];
afe = audioFeatureExtractor( ...
features = extract(afe, stopCmdIn);
Better padding approaches tend to distribute zeros across both start and end, as in the function helperExtractAuditoryFeatures.m (lines 37-46) coming with the example Speech Command Recognition Using Deep Learning. To open that, execute the following in your local MATLAB installation: