Partition a Datastore in Parallel

Partitioning a datastore in parallel, with a portion of the datastore on each worker in a parallel pool, can provide benefits in many cases:

Perform some action on only one part of the whole datastore, or on several defined parts simultaneously.
Search for specific values in the data store, with all workers acting simultaneously on their own partitions.
Perform a reduction calculation on the workers across all partitions.

Read Data from Datastore in Parallel

Open Live Script

This example shows how to use the partition function to parallelize the reading of data from a datastore. It uses a small dataset that represents events at the substations and feeders of a large energy grid and finds the mean of the non-NaN values from its 'CustomersAffected' column.

Serial Execution

A simple way to calculate the mean is to divide the sum of all the non-NaN values by the number of non-NaN values. The code in the sumAndCountCustomersAffected helper function does this for the datastore first in a non-parallel way.

First, delete any existing parallel pools.

delete(gcp("nocreate"));

The energyGridEvents.parquet file is a relatively small data set so you are unlikely to observe any improved performance with the parallel pool when compared to the MATLAB® client. To increase the dataset size, you can create multiple copies of the file. Create a datastore from the collection of energyGridEvents.parquet files and select the CustomersAffected variable to import.

ds = parquetDatastore(repmat("energyGridEvents.parquet",50,1));
ds.SelectedVariableNames = "CustomersAffected";
reset(ds);

Use the function sumAndCountCustomersAffected to calculate the mean without any parallel execution. Use the tic and toc functions to time the execution, here and in the later parallel cases.

tic
  [total,count] = sumAndCountCustomersAffected(ds)

total = 
1.8720e+11

count = 
9110750

sumtime = toc

sumtime = 
2.5695

mean = total/count

mean = 
2.0547e+04

Parallel Execution

The partition function allows you to partition the datastore into smaller parts, each represented as a datastore itself. These smaller datastores work completely independently of each other, so that you can work with them inside of parallel language features such as parfor-loops and spmd blocks.

Use Automatic Partitions

You can use the numpartitions function to specify the number of partitions, which is based on the datastore itself and the parallel pool size. This does not necessarily equal the number of workers in the pool. Set the number of loop iterations to the number of partitions (N).

The following code starts a parallel pool on the local cluster, then partitions the datastore among workers for iterating over the loop. This code calls the helper function parforSumAndCountCustomersAffected, which includes a parfor-loop to amass the count and sum totals in parallel loop iterations.

p = parpool("Processes",4);

Starting parallel pool (parpool) using the 'Processes' profile ...
Connected to parallel pool with 4 workers.

reset(ds);
tic
  [total,count] = parforSumAndCountCustomersAffected(ds)

total = 
1.8720e+11

count = 
9110750

parfortime = toc

parfortime = 
1.1271

mean = total/count

mean = 
2.0547e+04

Specify Number of Partitions

Rather than let the software calculate the number of partitions, you can explicitly set this value, so that the data can be appropriately partitioned to fit your algorithm. For example, to parallelize data from within an spmd block, you can specify the number of workers (spmdSize) as the number of partitions to use. The spmdSumAndCountCustomersAffected helper function uses an spmd block to perform a parallel read, and explicitly sets the number of partitions equal to the number of workers.

reset(ds);
tic
[total,count] = spmdSumAndCountCustomersAffected(ds)

total = 
1.8720e+11

count = 
9110750

spmdtime = toc

spmdtime = 
1.2823

mean = total/count

mean = 
2.0547e+04

When you are done with your computation, you can delete the current parallel pool.

delete(p);

Helper Functions

Create a helper function to amass the count and sum in a non-parallel way.

function [total,count] = sumAndCountCustomersAffected(ds)
    total = 0;
    count = 0;
    while hasdata(ds)
        data = read(ds);
        total = total + sum(data.CustomersAffected,1,"OmitNaN");
        count = count + sum(~isnan(data.CustomersAffected));
    end
end

Create a helper function to amass the count and sum in parallel using parfor.

function [total, count] = parforSumAndCountCustomersAffected(ds)
    N = numpartitions(ds,gcp);
    total = 0;
    count = 0;    
    parfor ii = 1:N
        % Get partition ii of the datastore.
        subds = partition(ds,N,ii);
    
        [localTotal,localCount] = sumAndCountCustomersAffected(subds);
        total = total + localTotal;
        count = count + localCount;
    end
end

Create a helper function to amass the count and sum in parallel using spmd.

function [total,count] = spmdSumAndCountCustomersAffected(ds)
    spmd
        subds = partition(ds,spmdSize,spmdIndex);
        [total,count] = sumAndCountCustomersAffected(subds);    
    end
    total = sum([total{:}]);
    count = sum([count{:}]);
end

Partition a Datastore in Parallel

Read Data from Datastore in Parallel

See Also

Topics