Train Speaker Identification System
Use the Census Database (also known as AN4 Database) from the CMU Robust Speech Recognition Group . The data set contains recordings of male and female subjects speaking words and numbers. The helper function in this example downloads the data set for you and converts the raw files to FLAC, and returns two
audioDatastore objects containing the training set and test set. By default, the data set is reduced so that the example runs quickly. You can use the full data set by setting
ReduceDataset to false.
[adsTrain,adsTest] = HelperAN4Download(ReduceDataset=true);
Split the test data set into enroll and test sets. Use two utterances for enrollment and the remaining for the test set. Generally, the more utterances you use for enrollment, the better the performance of the system. However, most practical applications are limited to a small set of enrollment utterances.
[adsEnroll,adsTest] = splitEachLabel(adsTest,2);
Inspect the distribution of speakers in the training, test, and enroll sets. The speakers in the training set do not overlap with the speakers in the test and enroll sets.
fejs 13 fmjd 13 fsrb 13 ftmj 13 fwxs 12 mcen 13 mrcb 13 msjm 13 msjr 13 msmn 9
fvap 2 marh 2
fvap 11 marh 11
Create an i-vector system that accepts feature input.
fs = 16e3; iv = ivectorSystem(SampleRate=fs,InputType="features");
audioFeatureExtractor object to extract the gammatone cepstral coefficients (GTCC), the delta GTCC, the delta-delta GTCC, and the pitch from 50 ms periodic Hann windows with 45 ms overlap.
afe = audioFeatureExtractor(gtcc=true,gtccDelta=true,gtccDeltaDelta=true,pitch=true,SampleRate=fs); afe.Window = hann(round(0.05*fs),"periodic"); afe.OverlapLength = round(0.045*fs); afe
afe = audioFeatureExtractor with properties: Properties Window: [800×1 double] OverlapLength: 720 SampleRate: 16000 FFTLength:  SpectralDescriptorInput: 'linearSpectrum' FeatureVectorLength: 40 Enabled Features gtcc, gtccDelta, gtccDeltaDelta, pitch Disabled Features linearSpectrum, melSpectrum, barkSpectrum, erbSpectrum, mfcc, mfccDelta mfccDeltaDelta, spectralCentroid, spectralCrest, spectralDecrease, spectralEntropy, spectralFlatness spectralFlux, spectralKurtosis, spectralRolloffPoint, spectralSkewness, spectralSlope, spectralSpread harmonicRatio, zerocrossrate, shortTimeEnergy To extract a feature, set the corresponding property to true. For example, obj.mfcc = true, adds mfcc to the list of enabled features.
Create transformed datastores by adding feature extraction to the
read function of
trainLabels = adsTrain.Labels; adsTrain = transform(adsTrain,@(x)extract(afe,x)); enrollLabels = adsEnroll.Labels; adsEnroll = transform(adsEnroll,@(x)extract(afe,x));
Train both the extractor and classifier using the training set.
trainExtractor(iv,adsTrain, ... UBMNumComponents=64, ... UBMNumIterations=5, ... TVSRank=32, ... TVSNumIterations=3);
Calculating standardization factors ....done. Training universal background model ........done. Training total variability space ......done. i-vector extractor training complete.
trainClassifier(iv,adsTrain,trainLabels, ... NumEigenvectors=16, ... ... PLDANumDimensions=16, ... PLDANumIterations=5);
Extracting i-vectors ...done. Training projection matrix .....done. Training PLDA model ........done. i-vector classifier training complete.
To calibrate the system so that scores can be interpreted as a measure of confidence in a positive decision, use
Extracting i-vectors ...done. Calibrating CSS scorer ...done. Calibrating PLDA scorer ...done. Calibration complete.
Enroll the speakers from the enrollment set.
Extracting i-vectors ...done. Enrolling i-vectors .....done. Enrollment complete.
Evaluate the file-level prediction accuracy on the test set.
numCorrect = 0; reset(adsTest) for index = 1:numel(adsTest.Files) features = extract(afe,read(adsTest)); results = identify(iv,features); trueLabel = adsTest.Labels(index); predictedLabel = results.Label(1); isPredictionCorrect = trueLabel==predictedLabel; numCorrect = numCorrect + isPredictionCorrect; end display("File Accuracy: " + round(100*numCorrect/numel(adsTest.Files),2) + " (%)")
"File Accuracy: 100 (%)"
 "CMU Sphinx Group - Audio Databases." http://www.speech.cs.cmu.edu/databases/an4/. Accessed 19 Dec. 2019.
ivs — i-vector system
i-vector system, specified as an object of type
data — Data to identify
column vector | matrix
Data to identify, specified as a column vector representing a single-channel (mono) audio signal or a matrix of audio features.
InputTypeis set to
"audio"when the i-vector system is created,
datamust be a column vector with underlying type
InputTypeis set to
"features"when the i-vector system is created,
datamust be a matrix with underlying type
double. The matrix must consist of audio features where the number of features (columns) is locked the first time
trainExtractoris called and the number of hops (rows) is variable-sized.
scorer — Scoring algorithm
Scoring algorithm used by the i-vector system, specified as
"plda", which corresponds to probabilistic linear discriminant
analysis (PLDA), or
"css", which corresponds to cosine similarity
"plda", you must train the PLDA model using
trainClassifier. If the PLDA model has been trained, then
scorer defaults to
"plda". Otherwise, the
scorer defaults to
N — Number of candidates
Number of candidates to return in
tableOut, specified as a
If you request a number of candidates greater than the number of
labels enrolled in the i-vector system, then all candidates are
returned. If unspecified, the number of candidates defaults to the number of enrolled