Distributing Arrays to Parallel Workers
Using Distributed Arrays to Partition Data Across Workers
Depending on how your data fits in memory, choose one of the following methods:
- If your data is currently in the memory of your local machine, you can use the - distributedfunction to distribute an existing array from the client workspace to the workers of a parallel pool. This option can be useful for testing or before performing operations which significantly increase the size of your arrays, such as- repmat.
- If your data does not fit in the memory of your local machine, but does fit in the memory of your cluster, you can use - datastorewith the- distributedfunction to read data into the memory of the workers of a parallel pool.
- If your data does not fit in the memory of your cluster, you can use - datastorewith- tallarrays to partition and process your data in chunks. See also Big Data Workflow Using Tall Arrays and Datastores.
Load Distributed Arrays in Parallel Using datastore
If your data does not fit in the memory of your local machine, but does fit in the
                memory of your cluster, you can use datastore with the
                    distributed function to create distributed arrays and
                partition the data among your workers. 
This example shows how to create and load distributed arrays using
                    datastore. Create a datastore using a tabular file of
                airline flight data. This data set is too small to show equal partitioning of the
                data over the workers. To simulate a large data set, artificially increase the size
                of the datastore using repmat.
files = repmat({'airlinesmall.csv'}, 10, 1);
ds = tabularTextDatastore(files);
Select the example variables.
ds.SelectedVariableNames = {'DepTime','DepDelay'};
ds.TreatAsMissing = 'NA';
Create a distributed table by reading the datastore in parallel. Partition the datastore with one partition per worker. Each worker then reads all data from the corresponding partition. The files must be in a shared location that is accessible by the workers.
dt = distributed(ds);
Starting parallel pool (parpool) using the 'Processes' profile ... connected to 4 workers.
Display summary information about the distributed table.
summary(dt)
Variables:
    DepTime: 1,235,230×1 double
        Values:
            min          1
            max       2505
            NaNs    23,510
    DepDelay: 1,235,230×1 double
        Values:
            min      -1036
            max       1438
            NaNs    23,510
Determine the size of the tall table.
size(dt)
ans =
     1235230           2
Return the first few rows of dt.
head(dt)
ans =
    DepTime    DepDelay
    _______    ________
     642       12      
    1021        1      
    2055       20      
    1332       12      
     629       -1      
    1446       63      
     928       -2      
     859       -1      
    1833        3      
    1041        1      
Finally, check how much data each worker has loaded.
spmd, dt, end
Worker 1: 
  
  This worker stores dt2(1:370569,:).
  
          LocalPart: [370569×2 table]
      Codistributor: [1×1 codistributor1d]
  
Worker 2: 
  
  This worker stores dt2(370570:617615,:).
  
          LocalPart: [247046×2 table]
      Codistributor: [1×1 codistributor1d]
  
Worker 3: 
  
  This worker stores dt2(617616:988184,:).
  
          LocalPart: [370569×2 table]
      Codistributor: [1×1 codistributor1d]
  
Worker 4: 
  
  This worker stores dt2(988185:1235230,:).
  
          LocalPart: [247046×2 table]
      Codistributor: [1×1 codistributor1d]Note that the data is partitioned equally over the workers. For more details on
                    datastore, see What Is a Datastore?
            
For more details about workflows for big data, see Choose a Parallel Computing Solution.
Alternative Methods for Creating Distributed and Codistributed Arrays
If your data fits in the memory of your local machine, you can use distributed
                arrays to partition the data among your workers. Use the
                    distributed function to create a distributed array in the
                    MATLAB® client, and store its data on the workers of the open parallel pool. A
                distributed array is distributed in one dimension, and as evenly as possible along
                that dimension among the workers. You cannot control the details of distribution
                when creating a distributed array. 
You can create a distributed array in several ways:
- Use the - distributedfunction to distribute an existing array from the client workspace to the workers of a parallel pool.
- Use any of the - distributedfunctions to directly construct a distributed array on the workers. This technique does not require that the array already exists in the client, thereby reducing client workspace memory requirements. Functions include- eye(___,'distributed')- rand(___,'distributed')- distributedobject reference page.
- Create a codistributed array inside an - spmdstatement, and then access it as a distributed array outside the- spmdstatement. This technique lets you use distribution schemes other than the default.
The first two techniques do not involve spmd in creating the
                array, but you can use spmd to manipulate arrays created this
                way. For example:
Create an array in the client workspace, and then make it a distributed array.
parpool('Processes',2) % Create pool W = ones(6,6); W = distributed(W); % Distribute to the workers spmd T = W*2; % Calculation performed on workers, in parallel. % T and W are both codistributed arrays here. end T % View results in client. whos % T and W are both distributed arrays here. delete(gcp) % Stop pool
Alternatively, you can use the codistributed function, which
                allows you to control more options such as dimensions and partitions, but is often
                more complicated. You can create a codistributed array by
                executing on the workers themselves, either inside an spmd statement or inside a
                communicating job. When creating a codistributed array, you can
                control all aspects of distribution, including dimensions and partitions.
The relationship between distributed and codistributed arrays is one of
                perspective. Codistributed arrays are partitioned among the workers from which you
                execute code to create or manipulate them. When you create a distributed array in
                the client, you can access it as a codistributed array inside an
                    spmd statement. When you create a codistributed array in an
                    spmd statement, you can access it as a distributed array in
                the client. Only spmd statements let you access the same array
                data from two different perspectives.
You can create a codistributed array in several ways:
- Use the - codistributedfunction inside an- spmdstatement or a communicating job to codistribute data already existing on the workers running that job.
- Use any of the codistributed functions to directly construct a codistributed array on the workers. This technique does not require that the array already exists in the workers. Functions include - eye(___,'codistributed')- rand(___,'codistributed')- codistributedobject reference page.
- Create a distributed array outside an - spmdstatement, then access it as a codistributed array inside the- spmdstatement running on the same parallel pool.
Create a codistributed array inside an spmd statement using a
                nondefault distribution scheme. First, define 1-D distribution along the third
                dimension, with 4 parts on worker 1, and 12 parts on worker 2. Then create a
                3-by-3-by-16 array of zeros.
parpool('Processes',2) % Create pool spmd codist = codistributor1d(3,[4,12]); Z = zeros(3,3,16,codist); Z = Z + spmdIndex; end Z % View results in client. % Z is a distributed array here. delete(gcp) % Stop pool
For more details on codistributed arrays, see Working with Codistributed Arrays.
See Also
distributed | codistributed | tall | datastore | spmd | repmat | eye | rand