Documentation

This is machine translation

Mouseover text to see original. Click the button below to return to the English version of the page.

Partition a Datastore in Parallel

Partitioning a datastore in parallel, with a portion of the datastore on each worker in a parallel pool, can provide benefits in many cases:

• Perform some action on only one part of the whole datastore, or on several defined parts simultaneously.

• Search for specific values in the data store, with all workers acting simultaneously on their own partitions.

• Perform a reduction calculation on the workers across all partitions.

This example shows how to use `partition` to parallelize the reading of data from a datastore. It uses a small datastore of airline data provided in MATLAB®, and finds the mean of the non-NaN values from its `'ArrDelay'` column.

A simple way to calculate the mean is to divide the sum of all the non-NaN values by the number of non-NaN values. The following code does this for the datastore first in a non-parallel way. To begin, you define a function to amass the count and sum. If you want to run this example, copy and save this function in a folder on the MATLAB command search path.

``` % Copyright 2015 The MathWorks, Inc. function [total,count] = sumAndCountArrivalDelay(ds) total = 0; count = 0; while hasdata(ds) data = read(ds); total = total + sum(data.ArrDelay,1,'OmitNaN'); count = count + sum(~isnan(data.ArrDelay)); end end ```

The following code creates a datastore, calls the function, and calculates the mean without any parallel execution. The `tic` and `toc` functions are used to time the execution, here and in the later parallel cases.

```ds = datastore(repmat({'airlinesmall.csv'},20,1),'TreatAsMissing','NA'); ds.SelectedVariableNames = 'ArrDelay'; reset(ds); tic [total,count] = sumAndCountArrivalDelay(ds) sumtime = toc mean = total/count ```
```total = 17211680 count = 2417320 sumtime = 8.2906 mean = 7.1201 ```

The `partition` function allows you to partition the datastore into smaller parts, each represented as a datastore itself. These smaller datastores work completely independently of each other, so that you can work with them inside of parallel language features such as `parfor` loops and `spmd` blocks.

The number of partitions in the following code is set by the `numpartitions` function, based on the datastore itself (`ds`) and the parallel pool (`gcp`) size. This does not necessarily equal the number of workers in the pool. In this case, the number of loop iterations is then set to the number of partitions (`N`).

The following code starts a parallel pool on a local cluster, then partitions the datastore among workers for iterating over the loop. Again, a separate function is called, which includes the `parfor` loop to amass the count and sum totals. Copy and save this function if you want to run the example.

``` % Copyright 2015 The MathWorks, Inc. function [total, count] = parforSumAndCountArrivalDelay(ds) N = numpartitions(ds,gcp); total = 0; count = 0; parfor ii = 1:N % Get partition ii of the datastore. subds = partition(ds,N,ii); [localTotal,localCount] = sumAndCountArrivalDelay(subds); total = total + localTotal; count = count + localCount; end end ```

Now the MATLAB code calls this new function, so that the counting and summing of the non-NAN values can occur in parallel loop iterations.

```p = parpool('local',4); reset(ds); tic [total,count] = parforSumAndCountArrivalDelay(ds) parfortime = toc mean = total/count ```
```Starting parallel pool (parpool) using the 'local' profile ... Connected to the parallel pool (number of workers: 4). total = 17211680 count = 2417320 parfortime = 10.6307 mean = 7.1201 ```

Rather than let the software calculate the number of partitions, you can explicitly set this value, so that the data can be appropriately partitioned to fit your algorithm. For example, to parallelize data from within an `spmd` block, you can specify the number of workers (`numlabs`) as the number of partitions to use. The following function uses an `spmd` block to perform a parallel read, and explicitly sets the number of partitions equal to the number of workers. To run this example, copy and save the function.

``` % Copyright 2015 The MathWorks, Inc. function [total,count] = spmdSumAndCountArrivalDelay(ds) spmd subds = partition(ds,numlabs,labindex); [total,count] = sumAndCountArrivalDelay(subds); end total = sum([total{:}]); count = sum([count{:}]); end ```

Now the MATLAB code calls the function that uses an `spmd` block.

```reset(ds); tic [total,count] = spmdSumAndCountArrivalDelay(ds) spmdtime = toc mean = total/count ```
```total = 17211680 count = 2417320 spmdtime = 6.0593 mean = 7.1201 ```
```delete(p); ```
```Parallel pool using the 'local' profile is shutting down. ```

You might get some idea of modest performance improvements by comparing the times recorded in the variables `sumtime`, `parfortime`, and `spmdtime`. Your results might vary, as the performance can be affected by the datastore size, parallel pool size, hardware configuration, and other factors.