Low LSTM Accuracy in Speech Recognition

Hamza il 31 Ott 2023
Hello everyone, I am applying LSTM to speech emotion recognition. I have performed feature extraction using MFCC, resulting in a matrix of dimensions 60,575 × 39. I subsequently transformed this matrix into a cell array named "AllCellTrain" with dimensions 280 × 1, containing signals of varying sizes, as illustrated in the image below. I then utilized "AllCellTrain" as input for the trainNetwork function, along with the labels YCA, network layers, and training options. However, I encountered a significant issue with accuracy, achieving only around 20%. I'm unsure where I may have made a mistake. Could someone please offer some assistance?
num_hidden_units = 1024;
layers = [
lstmLayer(num_hidden_units, 'OutputMode', 'last')
% Specify the training options
max_epochs = 36;
mini_batch_size = 28;
initial_learning_rate = 0.001;
options = trainingOptions('adam', ...
'MaxEpochs', max_epochs, ...
'MiniBatchSize', mini_batch_size, ...
'InitialLearnRate', initial_learning_rate, ...
'SequenceLength','shortest', ...
'ExecutionEnvironment','gpu', ...
'Verbose', false, ...
net = trainNetwork(AllCellTrain, YCA, layers, options);
predicted_labels = classify(net, AllCellTest,'ExecutionEnvironment','gpu');
acc = mean(predicted_labels == YCT)
  4 Commenti
Hamza il 6 Nov 2023
Modificato: Hamza il 6 Nov 2023
Hi @Christopher McCausland , thanks for your answer, I ma trying to classify 7 emotion classes, for your information I have used the same data on 1D CNN and got 90% accuracy, didnt know the issue on LSTM, also when I shufflued the colunms "the features" I got diffrent result, which souldnt be the case. you find the attached curve! thanks in advance
Christopher McCausland
Christopher McCausland il 6 Nov 2023
Hi @Hamza,
To me this looks like classic overfitting, your model appears to train well and learn features, however these features are overfitted to the training data, and are not representative of genralised data.
A few things to consider;
  1. Do you have multiple speakers? If so, how do you pick which speakers are in the test/train set.
  2. You have 280 input sequences, and seven classes, if the data is perfectly ballanced you have 40 observations per class, is this enough?
  3. Can you include a validation split to prevent overfitting?
  4. These are just a few ways to prevent overfitting/ ensure your data is appropreate for training, there are many other which I would suggest you take a look at.
In terms of the CNN preformance, were the test/train set the same and how many epochs did you train the CNN for?

