Main Content

Resize Clusters Automatically

Your cluster can resize automatically based on the amount of work submitted to the cluster. You must use a dedicated headnode to use automatic resizing.

To enable automatic resizing, select the option Allow cluster to auto-resize on the Create Cluster page. Specify the maximum number of workers that you require in the cluster using the Upper Limit menu next to Workers in Cluster. To ensure that all machines are started with the same number of workers, the Upper Limit menu options are multiples of the Workers per Machine value.

Based on your Upper Limit selection and the Workers per Machine value, the Machines in Cluster field displays the maximum number of machines for your cluster, including the headnode machine. Your cluster will not auto-resize above the number of machines displayed in the Machines in Cluster field or the number of workers specified by Workers in Cluster. You can use this to set an upper limit of the costs you are willing to pay for the cluster.

Note

Set the Workers in Cluster field to a provide a maximum cluster size that you are prepared to pay for.

You can view the maximum number of workers and current requested number of workers on the Cluster Summary page in Cloud Center. You can also view these properties from your cluster object in MATLAB®, using the properties MaxNumWorkers and NumWorkersRequested. For more information, see parallel.Cluster (Parallel Computing Toolbox).

Tip

To avoid your cluster shutting down when all workers become idle and no jobs are in the queue, set the termination policy of your cluster to a After a fixed time period or Never. Your cluster will remain online with only the dedicated headnode machine until more jobs are submitted or the cluster times out.

Cluster Growing and Shrinking

Your cluster starts with the dedicated headnode machine and zero workers. When you submit jobs to the cluster, the cluster will grow to accommodate the next queued job by adding machines, up to the maximum number set when you created the cluster. The cluster continues to grow until it either runs out of queued jobs or the upper limit for the number of workers prevents it from growing to accommodate the next queued job. Workers are added in increments of Workers per Machine.

As workers become available, they can be assigned to the next job in the queue. A queued job is scheduled as soon as there are enough available workers to run the job.

Machines no longer in use are removed from the cluster. If even one worker on a machine is busy, that machine will not be removed until all workers on the machine are idle. Cloud Center removes machines which are idle for at least five minutes, checking for idle workers every five minutes. It can take up to 15 minutes to remove a machine when all workers on that machine become idle. Your cluster can be reduced to zero workers when no jobs are running. In this case, only the headnode remains in the cluster.

Distributing Jobs Across Machines

Workers on a new machine or from a finishing job do not necessarily become available to the cluster at the same time. The cluster schedules jobs to idle workers as soon as the minimum requirements of the job are met. Consequently, as running jobs finish and queued jobs start on the cluster, jobs can be distributed across several machines. In such cases, you can find that even though the number of active workers in your cluster corresponds to a smaller number of machines than your cluster is currently using, the cluster does not shrink as there are active workers on each machine.

The following example shows one of several ways that the cluster can distribute these jobs among workers. The actual distribution depends on the order in which workers become available on new machines and after jobs finish running. Suppose you create a cluster with a maximum of 16 workers, with four workers per machine. The cluster starts with zero workers. You submit four jobs: one six-worker job, two four-worker jobs, and one five-worker job. The jobs finish in the order they are submitted.

First, the cluster grows to six workers on two machines to run the first, six-worker job. To run the second job, the cluster needs two additional workers. A third machine is requested to provide the additional workers. The job is assigned to the two free workers on the existing machine and two workers on the new machine. Similarly, two additional workers are required to run the third job, so the cluster requests a fourth machine. Now, 14 out of 16 workers on four machines are in use. There are not enough workers available to run the final, five-worker job. This job remains in the queue while the first three jobs run.

Three jobs running on four machines. Two workers on the fourth machine are idle.

When the first job finishes, the six workers that ran the job become idle. They do not necessarily become idle at the same time. As soon as three additional workers are available, the cluster assigns workers for the final, five-worker job.

Three jobs running on four machines. Two workers on the first machine and one worker on the second machine are idle.

When the second job finishes, there are still active workers across all four machines. The cluster cannot shrink, even though there are seven idle workers.

Two jobs running on four machines. Two workers on the first machine, three workers on the second machine, and two workers on the third machine are idle.

When the third job finishes, all workers on one machine become idle. When they have been idle for over five minutes, that machine can be removed. The cluster shrinks to three machines.

One job running on three machines. Two workers on the first machine, three workers on the second machine, and two workers on the last machine are idle.

When the fourth and final job is finished, all workers on the three remaining machines become idle. If no further jobs are submitted, the cluster is reduced to zero workers. Only the dedicated headnode machine remains in the cluster.

AWS Resource Limits

If you reach an AWS quota limit error or other resource constraint during the lifetime of the cluster, Cloud Center reduces the maximum number of workers to the number of workers that Cloud Center allocated successfully prior to encountering the error. Queued jobs that are not supported by the reduced maximum cluster size are cancelled and removed from the queue. If you stop and restart the cluster, the restriction is removed and the cluster attempts to grow to the maximum you specified.

For more information on how AWS service limits affect the maximum number of instances that you can launch, see AWS Resource Limits.

Related Topics