Main Content

Offload Experiments as Batch Jobs to a Cluster

Since R2022a

By default, Experiment Manager runs your experiments interactively, so you can monitor the progress of each trial by inspecting the results table and the training plot. However, running an experiment interactively limits your access to MATLAB® functionality. For example, when an experiment is running, you cannot close the project that contains the experiment or run other experiments.

If you have Parallel Computing Toolbox™ and MATLAB Parallel Server™, you can send your experiment as a batch job to a remote cluster. While the experiment is running in the cluster, you can:

  • Run another experiment interactively or start another batch job using the same experiment, a different experiment in the same project, or an experiment in a different project.

  • Close the Experiment Manager app and continue using MATLAB.

  • Close your MATLAB session.

If you only have Parallel Computing Toolbox, you can use a local cluster profile to develop and test your experiments on your client machine instead of running them on a network cluster. If you close your MATLAB session, any batch jobs using the local cluster profile also stop immediately.

Create Batch Job on Cluster

To start a batch job for your experiment:

  1. Configure your experiment.

    Tip

    Load data for your experiment from a location that is accessible to all your parallel workers. For example, store your data outside the project and access the data by using an absolute path. Alternatively, create a datastore object that can access the data on another machine by setting up the AlternateFileSystemRoots property of the datastore. For more information, see Set Up Datastore for Processing on Different Machines or Clusters.

  2. In the Experiment Manager toolstrip, under Execution, use the Mode list to specify an execution mode:

    • To run one trial of the experiment at a time, select Batch Sequential. Deep learning experiments do not support this execution mode when you set the training option ExecutionEnvironment to "multi-gpu".

    • To run multiple trials at the same time, select Batch Simultaneous. Deep learning experiments do not support this execution mode when you set the training option ExecutionEnvironment to "multi-gpu" or "parallel" or when you enable the training option DispatchInBackground.

  3. Use the Cluster list to select a cluster profile to use for your batch job. To create and manage cluster profiles, open the Cluster Profile Manager. For more information, see Discover Clusters and Use Cluster Profiles (Parallel Computing Toolbox).

  4. In the Pool Size field, enter the number of workers for your batch job.

    • If Mode is Batch Sequential, use this field to configure the number of parallel workers that collaborate on each trial of the experiment. If you set the pool size to 0, the experiment runs on a single worker.

    • If Mode is Batch Simultaneous, use this field to specify the number of trials that the cluster runs at the same time.

    Because Experiment Manager uses an additional worker to run the batch job, the cluster must have at least one more worker available than the number you specify in the Pool Size field. For example, if you specify a pool size of 2, the cluster must have at least three workers available: two workers for the experiment and an additional worker to run the batch job. For more information, see Run a Batch Job with a Parallel Pool (Parallel Computing Toolbox).

  5. Click Run . Experiment Manager uses the batch (Parallel Computing Toolbox) function to run the experiment in the specified cluster.

While the batch job runs your experiment, you can close Experiment Manager and recover the results later.

Track Progress of Batch Job

When you run a batch job for an experiment, Experiment Manager does not continually communicate with the cluster to update the values in the results table and save the visualizations for your experiment. Instead, you retrieve this information from the cluster by clicking the Refresh button above the results table.

Results table showing Refresh button.

To monitor batch jobs without opening the Experiment Manager app, use the Job Monitor, as described in Send Deep Learning Batch Job to Cluster. The Job Monitor tells you whether your batch job is queued, running, or finished.

Job Monitor showing a batch job created with Experiment Manager.

Note

Using the Job Monitor to cancel or delete jobs that you create with Experiment Manager can lead to unexpected behavior. Instead, cancel and delete these batch jobs by using Experiment Manager.

Cancel Batch Job

To cancel a batch job running an experiment, in the Experiment Manager toolstrip, click Cancel . Experiment Manager marks any running and queued trials as Canceled and discards their results.

Batch execution does not support stopping, canceling, or restarting individual trials of an experiment.

Download Training Results

To download the training results for a completed trial, in the Actions column of the results table, click the Download button for the trial.

Results table showing download button for a completed trial.

Experiment Manager saves the training results that you download from the cluster:

  • For built-in training experiments, Experiment Manager downloads the trained network and training information from the cluster.

  • For custom training experiments, Experiment Manager downloads the training output from the cluster.

You can access these results after you close your MATLAB session.

After you download the training results from the cluster, you can export these results to the workspace to perform additional computations to evaluate the quality of the training:

  • For built-in training experiments, select Export > Trained Network or Export > Training Information.

  • For custom training experiments, select Export > Training Output.

Delete Batch Job

To avoid consuming resources unnecessarily, delete the job from the cluster by clicking the Clean up button above the results table. Deleting the batch job permanently discards any results that you have not downloaded from the cluster.

Results table showing Clean up button.

See Also

Apps

Functions

  • (Parallel Computing Toolbox)

Related Topics