speechClient

Interface with pretrained model or third-party speech service

Since R2022b

Description

Use a speechClient object to interface with a pretrained speech-to-text model, pretrained text-to-speech model, or third-party cloud-based speech services. Use the object with speech2text or text2speech.

Note

Using the Emformer, Whisper, or HiFi-GAN pretrained models requires Deep Learning Toolbox™ and Audio Toolbox™ Interface for SpeechBrain and Torchaudio Libraries. You can download this support package from the Add-On Explorer. For more information, see Get and Manage Add-Ons.

To interface with third-party speech services, you must download the extended Audio Toolbox functionality from File Exchange. The File Exchange submission includes a tutorial to get started with the third-party services.

Using wav2vec 2.0 requires Deep Learning Toolbox and installing the pretrained model.

Creation

Syntax

clientObj = speechClient(name)

clientObj = speechClient(___,Name=Value)

Description

clientObj = speechClient(name) returns a speechClient object that interfaces with the specified pretrained model or speech service.

example

clientObj = speechClient(___,Name=Value) sets Properties using one or more name-value arguments.

example

Input Arguments

expand all

`name` — Pretrained model or service name
`"wav2vec2.0"` | `"emformer"` | `"whisper"` | `"hifigan"` | `"Google"` | `"IBM"` | `"Microsoft"` | `"Amazon"`

Name of the pretrained model or speech service, specified as "wav2vec2.0", "emformer", "whisper", "hifigan", "Google", "IBM", "Microsoft", or "Amazon".

"wav2vec2.0" –– Use a pretrained wav2vec 2.0 model. You can only use wav2vec 2.0 to perform speech-to-text transcription, and therefore you cannot use it with text2speech.
"emformer" –– Use a pretrained Emformer model. You can only use Emformer to perform speech-to-text transcription, and therefore you cannot use it with text2speech.
"whisper" –– Use a pretrained Whisper model. You can only use Whisper to perform speech-to-text transcription, and therefore you cannot use it with text2speech.
"hifigan" –– Use a pretrained HiFi-GAN/Tacotron2 model. You can only use HiFi-GAN/Tacotron2 to perform text-to-speech synthesis, and therefore you cannot use it with speech2text.
"Google" –– Interface with the Google^® Cloud Speech-to-Text and Text-to-Speech service.
"IBM" –– Interface with the IBM^® Watson Speech to Text and Text to Speech service.
"Microsoft" –– Interface with the Microsoft^® Azure^® Speech service.
"Amazon" –– Interface with the Amazon^® Transcribe and Amazon Polly services.

Using the Emformer, Whisper, or HiFi-GAN pretrained models requires Deep Learning Toolbox and Audio Toolbox Interface for SpeechBrain and Torchaudio Libraries. If this support package is not installed, then the function provides a link to the Add-On Explorer, where you can download and install the support package.

Using the wav2vec 2.0 pretrained model requires Deep Learning Toolbox and installing the pretrained wav2vec 2.0 model. If the model is not installed, calling speechClient with "wav2vec2.0" provides a link to download and install the model.

To use any of the third-party speech services (Google, IBM, Microsoft, or Amazon), you must download the extended Audio Toolbox functionality from File Exchange. The File Exchange submission includes a tutorial to get started with the third-party services.

Data Types: string | char

Output Arguments

expand all

`clientObj` — Client object
`speechClient` object

Client object to use with speech2text to transcribe speech in audio signals to text, or with text2speech to synthesize speech signals from text.

Properties

expand all

`Segmentation` — Segmentation of transcript
`"word"` | `"sentence"` | `"none"`

Segmentation of the output transcript, specified as "word", "sentence", or "none".

This property applies only to the wav2vec 2.0 and Whisper pretrained models and the Amazon speech service.

"word" — speech2text returns the transcript as a table where each word is in its own row. This is the default for the wav2vec 2.0 and Whisper pretrained models.
"sentence" — speech2text returns the transcript as a table where each sentence is in its own row. The wav2vec 2.0 and Whisper pretrained models do not support this option.
"none" — speech2text returns a string containing the entire transcript. This is the default for the Amazon speech service.

Data Types: string | char

`ModelSize` — Size of model
`"large"` (default) | `"medium"` | `"tiny"` | `"base"` | `"small"`

Size of model, specified as "tiny", "base", "small", "medium", or "large". Smaller models allow faster inference at the expense of decreased transcription performance.

This property applies only to the Whisper pretrained model.

Data Types: char | string

`Language` — Language spoken in input signal
`"english"` | `"spanish"` | `"italian"` | `"french"` | `"german"` | ...

Language spoken in input signal, specified as a string containing the language name.

This property applies only to the wav2vec 2.0 and Whisper models. For the third-party speech services, use setOptions with service-specific options to specify languages.

For wav2vec 2.0, specify the language as "english", "spanish", "italian", "french", or "german". You can also specify the ISO language codes ("en", "es", "it", "fr", "de").

For Whisper, speech2text can by default automatically transcribe and translate speech from languages other than English. Alternatively, you can also specify the language with this property.

The following languages are supported for Whisper: "english", "chinese", "german", "spanish", "russian", "korean", "french", "japanese", "portuguese", "turkish", "polish", "catalan", "dutch", "arabic", "swedish", "italian", "indonesian", "hindi", "finnish", "vietnamese", "hebrew", "ukrainian", "greek", "malay", "czech", "romanian", "danish", "hungarian", "tamil", "norwegian", "thai", "urdu", "croatian", "bulgarian", "lithuanian", "latin", "maori", "malayalam", "welsh", "slovak", "telugu", "persian", "latvian", "bengali", "serbian", "azerbaijani", "slovenian", "kannada", "estonian", "macedonian", "breton", "basque", "icelandic", "armenian", "nepali", "mongolian", "bosnian", "kazakh", "albanian", "swahili", "galician", "marathi", "punjabi", "sinhala", "khmer", "shona", "yoruba", "somali", "afrikaans", "occitan", "georgian", "belarusian", "tajik", "sindhi", "gujarati", "amharic", "yiddish", "lao", "uzbek", "faroese", "haitian creole", "pashto", "turkmen", "nynorsk", "maltese", "sanskrit", "luxembourgish", "myanmar", "tibetan", "tagalog", "malagasy", "assamese", "tatar", "hawaiian", "lingala", "hausa", "bashkir", "javanese", "sundanese", "cantonese", "burmese", "valencian", "flemish", "haitian", "letzeburgesch", "pushto", "panjabi", "moldavian", "moldovan", "sinhalese", "castilian", "mandarin".

You can also specify the ISO language codes for these languages.

Data Types: char | string

`Task` — Task to perform
`"transcribe"` (default) | `"translate"`

Task to perform, specified as "transcribe" or "translate". When you set Task to "translate", speech2text translates the speech to English when transcribing the text.

This property applies only to the Whisper pretrained model, and "translate" applies only when Language is not English.

Data Types: char | string

`ExecutionEnvironment` — Hardware resource for execution
`"cpu"` (default) | `"gpu"`

Hardware resource for execution, specified as one of these values:

"cpu" — Use the CPU.
"gpu" — Use the GPU. This option requires Parallel Computing Toolbox™.

This property applies only to the Emformer, Whisper, and HiFi-GAN pretrained models.

Data Types: string | char

`DecoderBeamWidth` — Beam width for speech-to-text decoding
10 (default) | nonnegative integer

Beam width for speech-to-text beam search decoding, specified as a nonnegative integer. A higher beam width means the decoder keeps track of more text hypotheses at each time step. Increasing the beam width may lead to more accurate predictions, at the expense of being more computationally expensive.

This property applies only to the Emformer pretrained model.

`TimeStamps` — Include timestamps in transcript
`false` (default) | `true`

Include timestamps of transcribed speech in the transcript, specified as true or false. If you specify TimeStamps as true, speech2text includes an additional column in the transcript table that contains the timestamps. When using the wav2vec 2.0 pretrained model, the speech2text function determines the timestamps using the algorithm described in [2].

This property applies only to the wav2vec 2.0 and Whisper pretrained models and the Amazon speech service if you set the Segmentation property to "word" or "sentence".

Data Types: logical

`TimeOut` — Connection timeout
nonnegative scalar

Connection timeout, specified as a nonnegative scalar in seconds. The timeout specifies the time to wait for the initial server connection to the third-party speech service.

This property applies only to the third-party speech services.

Object Functions

reset Reset states for streaming-enabled speech clients

Note

For the third-party speech services, you can configure server-specific options using the following functions. See the documentation for the specific service for option names and values.

`setOptions`	Set server options
`getOptions`	Get server options
`clearOptions`	Remove all server options

Examples

collapse all

Download wav2vec 2.0 Functionality

Open Live Script

Type speechClient("wav2vec2.0") into the command line. If the required model files are not installed, then the function throws an error and provides a link to download them. Click the link, and unzip the file to a location on the MATLAB® path.

Alternatively, execute the following commands to download and unzip the wav2vec model files to your temporary directory.

downloadFolder = fullfile(tempdir,"wav2vecDownload");
loc = websave(downloadFolder,"https://ssd.mathworks.com/supportfiles/audio/asr-wav2vec2-librispeech.zip");
modelsLocation = tempdir;
unzip(loc,modelsLocation)
addpath(fullfile(modelsLocation,"asr-wav2vec2-librispeech"))

Check that the installation is successful by typing speechClient("wav2vec2.0") into the command line. If the files are installed, then the function returns a Wav2VecSpeechClient object.

speechClient("wav2vec2.0")

ans = 
  Wav2VecSpeechClient with properties:

    Segmentation: 'word'
      TimeStamps: 0
        Language: 'english'

Download Whisper Speech-to-Text Model

Open Live Script

Type speechClient("whisper") into the command line. If the required model files are not installed, then the function throws an error and provides a link to download them. Click the link, unzip the file, and place the resulting folder with the model files in a location on the MATLAB path.

Alternatively, execute the following commands to download and unzip the Whisper model files to your temporary directory.

downloadFolder = fullfile(tempdir,"whisperDownload");
loc = websave(downloadFolder,"https://ssd.mathworks.com/supportfiles/audio/whisper-large-v3.zip");
modelsLocation = tempdir;
unzip(loc,modelsLocation)
addpath(fullfile(modelsLocation,"whisper-large-v3"))

Use Emformer for Streaming Speech-to-Text

Open Live Script

Create a speechClient object that uses the Emformer pretrained model.

emformerSpeechClient = speechClient("emformer");

Create a dsp.AudioFileReader object to read in an audio file. In a streaming loop, read in frames of the audio file and transcribe the speech using speech2text with the Emformer speechClient. The Emformer speechClient object maintains an internal state to perform the streaming speech-to-text transcription.

afr = dsp.AudioFileReader("Counting-16-44p1-mono-15secs.wav");
txtTotal = "";
while ~isDone(afr)
    x = afr();
    txt = speech2text(x,afr.SampleRate,Client=emformerSpeechClient);
    txtTotal = txtTotal + txt;
end

txtTotal

txtTotal = 
"one two three four five six seven eight nine"

Use GPU for Text-to-Speech Synthesis

This example uses:

Open Live Script

Create a speechClient object that uses the HiFi-GAN/Tacotron2 pretrained model. Set ExecutionEnvironment to "gpu" to use the GPU when running the model.

hifiganSpeechClient = speechClient("hifigan",ExecutionEnvironment="gpu");

Call text2speech on a string of text with the HiFi-GAN/Tacotron2 speechClient object to synthesize the speech signal.

[x,fs] = text2speech("hello world",Client=hifiganSpeechClient);

Listen to the synthesized speech.

sound(x,fs)

Translate and Transcribe Speech with Whisper

Open Live Script

Create a speechClient object that uses a Whisper pretrained model. Set Task to "translate" to translate other languages into English when performing speech-to-text with this object.

whisperSpeechClient = speechClient("whisper",Task="translate");

Read in a speech signal containing Polish and listen to it.

[x,fs] = audioread("polish.wav");
sound(x,fs)

Call speech2text on the signal with Client set to the Whisper client object to simultaneously translate and transcribe the speech.

translatedTranscript = speech2text(x,fs,Client=whisperSpeechClient)

translatedTranscript = 
"Good day, I am Polish."

References

[1] Baevski, Alexei, Henry Zhou, Abdelrahman Mohamed, and Michael Auli. “Wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations,” 2020. https://doi.org/10.48550/ARXIV.2006.11477.

[2] Kürzinger, Ludwig, Dominik Winkelbauer, Lujun Li, Tobias Watzel, and Gerhard Rigoll. “CTC-Segmentation of Large Corpora for German End-to-End Speech Recognition.” In Speech and Computer, edited by Alexey Karpov and Rodmonga Potapova, 12335:267–78. Cham: Springer International Publishing, 2020. https://doi.org/10.1007/978-3-030-60276-5_27.

[3] Radford, Alec, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. “Robust Speech Recognition via Large-Scale Weak Supervision.” arXiv, December 6, 2022. https://doi.org/10.48550/arXiv.2212.04356.

Version History

Introduced in R2022b

speechClient

Description

Creation

Syntax

Description

Input Arguments

`name` — Pretrained model or service name
`"wav2vec2.0"` | `"emformer"` | `"whisper"` | `"hifigan"` | `"Google"` | `"IBM"` | `"Microsoft"` | `"Amazon"`

Output Arguments

`clientObj` — Client object
`speechClient` object

Properties

`Segmentation` — Segmentation of transcript
`"word"` | `"sentence"` | `"none"`

`ModelSize` — Size of model
`"large"` (default) | `"medium"` | `"tiny"` | `"base"` | `"small"`

`Language` — Language spoken in input signal
`"english"` | `"spanish"` | `"italian"` | `"french"` | `"german"` | ...

`Task` — Task to perform
`"transcribe"` (default) | `"translate"`

`ExecutionEnvironment` — Hardware resource for execution
`"cpu"` (default) | `"gpu"`

`DecoderBeamWidth` — Beam width for speech-to-text decoding
10 (default) | nonnegative integer

`TimeStamps` — Include timestamps in transcript
`false` (default) | `true`

`TimeOut` — Connection timeout
nonnegative scalar

Object Functions

Examples

Download wav2vec 2.0 Functionality

Download Whisper Speech-to-Text Model

Use Emformer for Streaming Speech-to-Text

Use GPU for Text-to-Speech Synthesis

Translate and Transcribe Speech with Whisper

References

Version History

See Also

Topics

speechClient

Description

Creation

Syntax

Description

Input Arguments

name — Pretrained model or service name "wav2vec2.0" | "emformer" | "whisper" | "hifigan" | "Google" | "IBM" | "Microsoft" | "Amazon"

Output Arguments

clientObj — Client object speechClient object

Properties

Segmentation — Segmentation of transcript "word" | "sentence" | "none"

ModelSize — Size of model "large" (default) | "medium" | "tiny" | "base" | "small"

Language — Language spoken in input signal "english" | "spanish" | "italian" | "french" | "german" | ...

Task — Task to perform "transcribe" (default) | "translate"

ExecutionEnvironment — Hardware resource for execution "cpu" (default) | "gpu"

DecoderBeamWidth — Beam width for speech-to-text decoding 10 (default) | nonnegative integer

TimeStamps — Include timestamps in transcript false (default) | true

TimeOut — Connection timeout nonnegative scalar

Object Functions

Examples

Download wav2vec 2.0 Functionality

Download Whisper Speech-to-Text Model

Use Emformer for Streaming Speech-to-Text

Use GPU for Text-to-Speech Synthesis

Translate and Transcribe Speech with Whisper

References

Version History

See Also

Topics

`name` — Pretrained model or service name
`"wav2vec2.0"` | `"emformer"` | `"whisper"` | `"hifigan"` | `"Google"` | `"IBM"` | `"Microsoft"` | `"Amazon"`

`clientObj` — Client object
`speechClient` object

`Segmentation` — Segmentation of transcript
`"word"` | `"sentence"` | `"none"`

`ModelSize` — Size of model
`"large"` (default) | `"medium"` | `"tiny"` | `"base"` | `"small"`

`Language` — Language spoken in input signal
`"english"` | `"spanish"` | `"italian"` | `"french"` | `"german"` | ...

`Task` — Task to perform
`"transcribe"` (default) | `"translate"`

`ExecutionEnvironment` — Hardware resource for execution
`"cpu"` (default) | `"gpu"`

`DecoderBeamWidth` — Beam width for speech-to-text decoding
10 (default) | nonnegative integer

`TimeStamps` — Include timestamps in transcript
`false` (default) | `true`

`TimeOut` — Connection timeout
nonnegative scalar