Main Content

speech2text

Transcribe speech signal to text

    Description

    example

    transcript = speech2text(clientObj,audioIn,fs) transcribes speech in the input audio signal to text using a wav2vec 2.0 pretrained deep learning model or a third-party speech service.

    Note

    To use speech2text with the third-party speech services, you must download the extended Audio Toolbox™ functionality from File Exchange. The File Exchange submission includes a tutorial to get started with the third-party services.

    Using wav2vec 2.0 requires Deep Learning Toolbox™ and installing the pretrained model.

    transcript = speech2text(___,HTTPTimeout=timeout) specifies the time in seconds to wait for the initial server connection to the third-party speech service.

    Examples

    collapse all

    Download and install the pretrained wav2vec 2.0 model for speech-to-text transcription.

    Type speechClient("wav2vec2.0") into the command line. If the pretrained model for wav2vec 2.0 is not installed, the function provides a download link. To install the model, click the link to download the file and unzip it to a location on the MATLAB path.

    Alternatively, execute the following commands to download the wav2vec 2.0 model, unzip it to your temporary directory, and then add it to your MATLAB path.

    downloadFile = matlab.internal.examples.downloadSupportFile("audio","wav2vec2/wav2vec2-base-960.zip");
    wav2vecLocation = fullfile(tempdir,"wav2vec");
    unzip(downloadFile,wav2vecLocation)
    addpath(wav2vecLocation)

    Check that the installation is successful by typing speechClient("wav2vec2.0") into the command line. If the model is installed, then the function returns a Wav2VecSpeechClient object.

    speechClient("wav2vec2.0")
    ans = 
      Wav2VecSpeechClient with properties:
    
        Segmentation: 'word'
          TimeStamps: 0
    
    

    Read in an audio file containing speech and listen to it.

    [y,fs] = audioread("speech_dft.wav");
    soundsc(y,fs)

    Create a speechClient object that uses the wav2vec 2.0 pretrained network. This requires installing the pretrained network. If the network is not installed, the function provides a link with instructions to download and install the pretrained model.

    transcriber = speechClient("wav2vec2.0");

    Use speech2text to obtain a transcription of the audio signal.

    transcript = speech2text(transcriber,y,fs)
    transcript=12×2 table
        Transcript     Confidence
        ___________    __________
    
        "the"           0.97605  
        "discreet"      0.91927  
        "fourier"       0.84546  
        "transform"     0.89922  
        "of"            0.66676  
        "a"             0.50026  
        "real"          0.88796  
        "valued"        0.89913  
        "signal"         0.8041  
        "is"            0.53891  
        "conjugate"     0.98438  
        "symmetric"     0.89333  
    
    

    Input Arguments

    collapse all

    Client object, specified as an object returned by speechClient. The object is an interface to a pretrained wav2vec 2.0 model or to a third-party speech service.

    Using speech2text with wav2vec 2.0 requires Deep Learning Toolbox and installing the pretrained wav2vec 2.0 model. If the model is not installed, calling speechClient with "wav2vec2.0" provides a link to download and install the model.

    To use any of the third-party speech services, you must download the extended Audio Toolbox functionality from File Exchange. The File Exchange submission includes a tutorial to get started with the third-party services.

    Example: speechClient("wav2vec2.0")

    Audio input signal, specified as a column vector (single channel).

    Data Types: single | double

    Sample rate in Hz, specified as a positive scalar.

    Data Types: single | double

    Time to wait for initial server connection in seconds, specified as a positive scalar.

    This argument is enabled only when the clientObj is one of the third-party speech services.

    Output Arguments

    collapse all

    Speech transcript of the input audio signal, returned as a table with a column containing the transcript and another column containing the associated confidence metrics.

    If the clientObj interfaces with the wav2vec 2.0 pretrained model and you set the object Segmentation property to "none" when creating it with speechClient, speech2text returns the transcript as a string.

    Note

    The returned table can have additional columns depending on the properties specified when creating the clientObj with speechClient.

    Data Types: table | string

    References

    [1] Baevski, Alexei, Henry Zhou, Abdelrahman Mohamed, and Michael Auli. “Wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations,” 2020. https://doi.org/10.48550/ARXIV.2006.11477.

    Version History

    Introduced in R2022b