Main Content

Voice Activity Detection in Audio Toolbox

Voice activity detection (VAD) is a common building block in many audio systems. Distinguishing speech and non-speech segments enables more efficient processing and transmission of audio data. Applications of VAD include speech and speaker recognition systems, telecommunications, hearing aids, audio conferencing, surveillance and security, speech enhancement, and data set preprocessing for machine learning workflows.

In this example, you compare VAD implementations provided by Audio Toolbox™.

Overview

Audio Toolbox provides the following features:

Inspect Test File

Load a relatively clean speech file and corresponding speech detection mask. There is no gold standard labeling for speech detection since boundaries of speech are debatable, especially when considering space between words. The mask shown here gives generous boundaries to speech regions. You can use this mask as a reference for VAD performance.

[cleanspeech,fs] = audioread("vadexample-cleanspeech.ogg");
sound(cleanspeech,fs)

load("vadexample-cleanspeech-mask.mat","vadmask")
plotsigroi(vadmask,cleanspeech,true)

Figure contains an axes object. The axes object with xlabel Seconds contains 3 objects of type line, patch.

voiceActivityDetector

The voiceActivityDetector object creates a moving estimate of noise over time and then estimates the signal-to-noise ratio (SNR) for each input frame [1] [2] [3]. If the SNR is above a certain threshold, the region may correspond to speech. The SNR estimate is combined with a hidden Markov model (HMM) to determine the probability that the current frame contains speech. The final metric is returned as a number in the range [0,1]. It is up to the user to determine a threshold on the metric to determine whether speech is present.

The voiceActivityDetector is well-suited for streaming applications because it retains state between calls. It performs well for clean speech and under the condition that the background variability is static or slow-moving and is not peaky, such as the channel noise introduced in telecommunications.

fileReader = dsp.AudioFileReader("vadexample-noisyspeech.ogg");

vad = voiceActivityDetector();

scope = timescope( ...
    NumInputPorts=2, ...
    SampleRate=fileReader.SampleRate, ...
    TimeSpanSource="Property",TimeSpan=20, ...
    BufferLength=20*fileReader.SampleRate, ...
    YLimits=[-1 1.2], ...
    ShowLegend=true, ...
    ChannelNames=["Audio","Probability of speech presence"]);

device = audiostreamer(SampleRate=fileReader.SampleRate);

while ~isDone(fileReader)
    x = fileReader();
    probability = vad(x);
    scope(x,probability*ones(fileReader.SamplesPerFrame,1))
    drawnow limitrate
    play(device,x);
end
release(fileReader)
release(vad)
release(device)
release(scope)

The algorithm has difficulty distinguishing speech from non-speech when the non-speech is particularly peaky, such as music or some real-world ambiance. For example, in the following file containing both speech and music, musical regions without speech are identified with high probability as being speech by the voiceActivityDetector.

fileReader = dsp.AudioFileReader("vadexample-speechplusmusic.ogg");

while ~isDone(fileReader)
    x = fileReader();
    probability = vad(x);
    scope(x,probability*ones(fileReader.SamplesPerFrame,1))
    drawnow limitrate
    play(device,x);
end
release(fileReader)
release(vad)
release(device)
release(scope)

detectSpeech

The detectSpeech function works by extracting, creating statistical models, and then thresholding short-term features of the signal to create decision masks [4]. By default, it assumes that both speech and non-speech are present in a signal to determine the thresholds. You can specify the thresholds to the function directly for streaming applications or to save time when processing large data sets.

The detectSpeech function is well-suited to isolate relatively clean speech. It operates best on complete audio segments. It does not require a startup time to build a statistical noise estimate.

detectSpeech(cleanspeech,fs)

Figure contains an axes object. The axes object with title Detected Speech, xlabel Time (s) contains 28 objects of type line, constantline, patch.

Because it operates in a batch-fashion, the detectSpeech function additionally offers the post-processing ability to merge speech decisions.

detectSpeech(cleanspeech,fs,MergeDistance=2*fs)

Figure contains an axes object. The axes object with title Detected Speech, xlabel Time (s) contains 10 objects of type line, constantline, patch.

The detectSpeech algorithm has difficulty distinguishing speech from non-speech under low SNR or when the non-speech is particularly peaky, such as music or ambiance. The algorithm fails under both the low SNR and background music scenarios.

[noisyspeech,fs] = audioread("vadexample-noisyspeech.ogg");
detectSpeech(noisyspeech,fs)

Figure contains an axes object. The axes object with title Detected Speech, xlabel Time (s) contains 58 objects of type line, constantline, patch.

[speechplusmusic,fs] = audioread("vadexample-speechplusmusic.ogg");
detectSpeech(speechplusmusic,fs)

Figure contains an axes object. The axes object with title Detected Speech, xlabel Time (s) contains 37 objects of type line, constantline, patch.

detectspeechnn

The detectspeechnn function leverages an AI model to detect the probability of speech, then performs postprocessing logic to smooth detected regions and optionally includes an energy-based VAD on the backend [5]. The model combines convolutional, recurrent, and fully-connected layers. It was trained on the LibriParty data set [6], which is a synthetic cocktail-party/meeting scenario dataset derived from LibriSpeech [7].

The detectspeechnn function is well-suited to detect continuous speech under difficult noise conditions. It performs well in both the low SNR and music scenarios. It is also the only algorithm that disregards the throat clearing around the 13 second mark.

detectspeechnn(noisyspeech,fs)

Figure contains an axes object. The axes object with title Detected Speech, xlabel Time (s), ylabel Amplitude contains 8 objects of type line, constantline, patch.

detectspeechnn(speechplusmusic,fs)

Figure contains an axes object. The axes object with title Detected Speech, xlabel Time (s), ylabel Amplitude contains 8 objects of type line, constantline, patch.

As a byproduct of the training data, the model performs poorly in detecting very small speech segments, like isolated words. In the following audio file, a female voice says "volume up" several time. The algorithm does not detect the last utterance.

[x,fs] = audioread("FemaleVolumeUp-16-mono-11secs.ogg");
detectspeechnn(x,fs)

Figure contains an axes object. The axes object with title Detected Speech, xlabel Time (s), ylabel Amplitude contains 8 objects of type line, constantline, patch.

You can access the model underlying detectspeechnn directly and perform transfer learning for specific noise conditions, language or voice conditions, or to detect isolated words. See Train Voice Activity Detection in Noise Model Using Deep Learning for an example.

Comparison

Use the table to determine which feature is suitable for your application.

 

Use for ...

Data Processing

Speed

Noise Robustness

Simulink® Support

GPU Arrays

C/C++ Code Generation

voiceActivityDetector

Detection of continuous speech or speech segments in clean and non-tonal noisy environments.

frame-based

slowest (operates on frames independently)

medium

Voice Activity Detector block

no

yes

detectSpeech

Detection of continuous speech or speech segments in clean or low-level noise environments.

batch

to use in streaming, buffering is required

fastest

low

Use in MATLAB Function (Simulink) block

yes

yes

detectspeechnn

Detection of continuous speech in clean or noisy environments.

batch

to use in streaming, buffering is required

moderate

high

Use in MATLAB Function (Simulink) block

yes

yes

References

  1. Sohn, Jongseo., Nam Soo Kim, and Wonyong Sung. "A Statistical Model-Based Voice Activity Detection." Signal Processing Letters IEEE. Vol. 6, No. 1, 1999.

  2. Martin, R. "Noise Power Spectral Density Estimation Based on Optimal Smoothing and Minimum Statistics." IEEE Transactions on Speech and Audio Processing. Vol. 9, No. 5, 2001, pp. 504–512.

  3. Ephraim, Y., and D. Malah. "Speech Enhancement Using a Minimum Mean-Square Error Short-Time Spectral Amplitude Estimator." IEEE Transactions on Acoustics, Speech, and Signal Processing. Vol. 32, No. 6, 1984, pp. 1109–1121.

  4. Giannakopoulos, Theodoros. "A Method for Silence Removal and Segmentation of Speech Signals, Implemented in MATLAB", (University of Athens, Athens, 2009).

  5. Ravanelli, Mirco, et al. SpeechBrain: A General-Purpose Speech Toolkit. arXiv, 8 June 2021. arXiv.org, http://arxiv.org/abs/2106.04624

  6. Speechbrain. "Speechbrain/Recipes/Libriparty/Generate_dataset at Develop · Speechbrain/Speechbrain." GitHub. Accessed November 21, 2024. https://github.com/speechbrain/speechbrain/tree/develop/recipes/LibriParty/generate_dataset.

  7. Panayotov, Vassil, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. "Librispeech: An ASR Corpus Based on Public Domain Audio Books." In

See Also

| |