Main Content


Interface with pretrained model or third-party speech service



    clientObj = speechClient(name) returns a speechClient object that interfaces with a wav2vec 2.0 pretrained speech-to-text model or third-party cloud-based speech services. Use the object with speech2text or text2speech.


    To use speechClient to interface with third-party speech services, you must download the extended Audio Toolbox™ functionality from File Exchange. The File Exchange submission includes a tutorial to get started with the third-party services.

    Using wav2vec 2.0 requires Deep Learning Toolbox™ and installing the pretrained model.

    clientObj = speechClient(___,Name=Value) specifies additional properties used by the speechClient object.


    collapse all

    Download and install the pretrained wav2vec 2.0 model for speech-to-text transcription.

    Type speechClient("wav2vec2.0") into the command line. If the pretrained model for wav2vec 2.0 is not installed, the function provides a download link. To install the model, click the link to download the file and unzip it to a location on the MATLAB path.

    Alternatively, execute the following commands to download the wav2vec 2.0 model, unzip it to your temporary directory, and then add it to your MATLAB path.

    downloadFile = matlab.internal.examples.downloadSupportFile("audio","wav2vec2/");
    wav2vecLocation = fullfile(tempdir,"wav2vec");

    Check that the installation is successful by typing speechClient("wav2vec2.0") into the command line. If the model is installed, then the function returns a Wav2VecSpeechClient object.

    ans = 
      Wav2VecSpeechClient with properties:
        Segmentation: 'word'
          TimeStamps: 0

    Read in an audio file containing speech and listen to it.

    [y,fs] = audioread("speech_dft.wav");

    Create a speechClient object that uses the wav2vec 2.0 pretrained network. This requires installing the pretrained network. If the network is not installed, the function provides a link with instructions to download and install the pretrained model.

    transcriber = speechClient("wav2vec2.0");

    Use speech2text to obtain a transcription of the audio signal.

    transcript = speech2text(transcriber,y,fs)
    transcript=12×2 table
        Transcript     Confidence
        ___________    __________
        "the"           0.97605  
        "discreet"      0.91927  
        "fourier"       0.84546  
        "transform"     0.89922  
        "of"            0.66676  
        "a"             0.50026  
        "real"          0.88796  
        "valued"        0.89913  
        "signal"         0.8041  
        "is"            0.53891  
        "conjugate"     0.98438  
        "symmetric"     0.89333  

    Input Arguments

    collapse all

    Name of the pretrained model or speech service, specified as "wav2vec2.0", "Google", "IBM", or "Microsoft".

    • "wav2vec2.0" –– Use a pretrained wav2vec 2.0 model. You can only use wav2vec 2.0 to perform speech-to-text transcription, and therefore you cannot use it with text2speech.

    • "Google" –– Interface with the Google® Cloud Speech-to-Text and Text-to-Speech service.

    • "IBM" –– Interface with the IBM® Watson Speech to Text and Text to Speech service.

    • "Microsoft" –– Interface with the Microsoft® Azure® Speech service.

    Using the wav2vec 2.0 pretrained model requires Deep Learning Toolbox and installing the pretrained wav2vec 2.0 model. If the model is not installed, calling speechClient with "wav2vec2.0" provides a link to download and install the model.

    To use any of the third-party speech services (Google, IBM, or Microsoft), you must download the extended Audio Toolbox functionality from File Exchange. The File Exchange submission includes a tutorial to get started with the third-party services.

    Data Types: string | char

    Name-Value Arguments

    Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

    Before R2021a, use commas to separate each name and value, and enclose Name in quotes.

    Example: speechClient("wav2vec2.0",Segmentation="none")


    These arguments apply to the wav2vec 2.0 pretrained model. For the third-party speech services (Google, IBM, or Microsoft), valid property names and values depend on their specific API. See the documentation for the corresponding service for property names and values.

    Segmentation of the output transcript, specified as "word" or "none".

    If the segmentation is "word", speech2text returns the transcript as a table where each word is in its own row.

    If the segmentation is "none", speech2text returns a string containing the entire transcript.

    Data Types: string | char

    Include timestamps of transcribed speech in transcript, specified as true or false. To enable this property, set Segmentation to "word". If you specify TimeStamps as true, speech2text includes an additional column in the transcript table that contains the timestamps. The speech2text function determines the timestamps using the algorithm described in [2].

    Data Types: logical

    Output Arguments

    collapse all

    Client object to be used with speech2text to transcribe speech in audio signals to text, or with text2speech to synthesize speech signals from text.


    [1] Baevski, Alexei, Henry Zhou, Abdelrahman Mohamed, and Michael Auli. “Wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations,” 2020.

    [2] Kürzinger, Ludwig, Dominik Winkelbauer, Lujun Li, Tobias Watzel, and Gerhard Rigoll. “CTC-Segmentation of Large Corpora for German End-to-End Speech Recognition.” In Speech and Computer, edited by Alexey Karpov and Rodmonga Potapova, 12335:267–78. Cham: Springer International Publishing, 2020.

    Version History

    Introduced in R2022b