Main Content

matlab.tall.reduce

Reduce arrays by applying reduction algorithm to blocks of data

Description

example

tA = matlab.tall.reduce(fcn,reducefcn,tX) applies the function fcn to each block of array tX to generate partial results. Then the function applies reducefcn to the vertical concatenation of partial results repeatedly until it has one final result, tA.

example

tA = matlab.tall.reduce(fcn,reducefcn,tX,tY,...) specifies several arrays tX,tY,... that are inputs to fcn. The same rows of each array are operated on by fcn; for example, fcn(tX(n:m,:),tY(n:m,:)). Inputs with a height of one are passed to every call of fcn. With this syntax, fcn must return one output, and reducefcn must accept one input and return one output.

example

[tA,tB,...] = matlab.tall.reduce(fcn,reducefcn,tX,tY,...) , where fcn and reducefcn are functions that return multiple outputs, returns arrays tA,tB,..., each corresponding to one of the output arguments of fcn and reducefcn. This syntax has these requirements:

  • fcn must return the same number of outputs as were requested from matlab.tall.reduce.

  • reducefcn must have the same number of inputs and outputs as the number of outputs requested from matlab.tall.reduce.

  • Each output of fcn and reducefcn must be the same type as the first input tX.

  • Corresponding outputs of fcn and reducefcn must have the same height.

example

[tA,tB,...] = matlab.tall.reduce(___,'OutputsLike',{PA,PB,...}) specifies that the outputs tA,tB,... have the same data types as the prototype arrays PA,PB,..., respectively. You can use any of the input argument combinations in previous syntaxes.

Examples

collapse all

Create a tall table, extract a tall vector from the table, and then find the total number of elements in the vector.

Create a tall table for the airlinesmall.csv data set. The data contains information about arrival and departure times of US flights. Extract the ArrDelay variable, which is a vector of arrival delays.

ds = tabularTextDatastore('airlinesmall.csv','TreatAsMissing','NA');
ds.SelectedVariableNames = {'ArrDelay' 'DepDelay'};
tt = tall(ds);
tX = tt.ArrDelay;

Use matlab.tall.reduce to count the total number of non-NaN elements in the tall vector. The first function numel counts the number of elements in each block of data, and the second function sum adds together all of the counts for each block to produce a scalar result.

s = matlab.tall.reduce(@numel,@sum,tX)
s =

  MxNx... tall double array

    ?    ?    ?    ...
    ?    ?    ?    ...
    ?    ?    ?    ...
    :    :    :
    :    :    :

Gather the result into memory.

s = gather(s)
Evaluating tall expression using the Local MATLAB Session:
- Pass 1 of 1: Completed in 0.57 sec
Evaluation completed in 0.71 sec
s = 123523

Create a tall table, extract two tall vectors form the table, and then calculate the mean value of each vector.

Create a tall table for the airlinesmall.csv data set. The data contains information about arrival and departure times of US flights. Extract the ArrDelay and DepDelay variables, which are vectors of arrival and departure delays.

ds = tabularTextDatastore('airlinesmall.csv','TreatAsMissing','NA');
ds.SelectedVariableNames = {'ArrDelay' 'DepDelay'};
tt = tall(ds);
tt = rmmissing(tt);
tX = tt.ArrDelay;
tY = tt.DepDelay;

In the first stage of the algorithm, calculate the sum and element count for each block of data in the vectors. To do this you can write a function that accepts two inputs and returns one output with the sum and count for each input. This function is listed as a local function at the end of the example.

function bx = sumcount(tx,ty)
  bx = [sum(tx) numel(tx) sum(ty) numel(ty)];
end

In the reduction stage of the algorithm, you need to add together all of the intermediate sums and counts. Thus, matlab.tall.reduce returns the overall sum of elements and number of elements for each input vector, and calculating the mean is then a simple division. For this step you can apply the sum function to the first dimension of the 1-by-4 vector outputs from the first stage.

 reducefcn = @(x) sum(x,1);
 s = matlab.tall.reduce(@sumcount,reducefcn,tX,tY)
s =

  MxNx... tall double array

    ?    ?    ?    ...
    ?    ?    ?    ...
    ?    ?    ?    ...
    :    :    :
    :    :    :
 s = gather(s)
Evaluating tall expression using the Local MATLAB Session:
- Pass 1 of 1: Completed in 2.5 sec
Evaluation completed in 2.9 sec
s = 1×4

      860584      120866      982764      120866

The first two elements of s are the sum and count for tX, and the second two elements are the sum and count for tY. Dividing the sums and counts yields the mean values, which you can compare to the answer returned by the mean function.

 my_mean = [s(1)/s(2) s(3)/s(4)]
my_mean = 1×2

    7.1201    8.1310

 m = gather(mean([tX tY]))
Evaluating tall expression using the Local MATLAB Session:
- Pass 1 of 1: Completed in 0.53 sec
Evaluation completed in 0.72 sec
m = 1×2

    7.1201    8.1310

Local Functions

Listed here is the sumcount function that matlab.tall.reduce calls to calculate the intermediate sums and element counts.

function bx = sumcount(tx,ty)
  bx = [sum(tx) numel(tx) sum(ty) numel(ty)];
end

Create a tall table, then calculate the mean flight delay for each year in the data.

Create a tall table for the airlinesmall.csv data set. The data contains information about arrival and departure times of US flights. Remove rows of missing data from the table and extract the ArrDelay, DepDelay, and Year variables. These variables are vectors of arrival and departure delays and of the associated years for each flight in the data set.

ds = tabularTextDatastore('airlinesmall.csv','TreatAsMissing','NA');
ds.SelectedVariableNames = {'ArrDelay' 'DepDelay' 'Year'};
tt = tall(ds);
tt = rmmissing(tt);

Use matlab.tall.reduce to apply two functions to the tall table. The first function combines the ArrDelay and DepDelay variables to find the total mean delay for each flight. The function determines how many unique years are in each chunk of data, and then cycles through each year and calculates the average total delay for flights in that year. The result is a two-variable table containing the year and mean total delay. This intermediate data needs to be reduced further to arrive at the mean delay per year. Save this function in your current folder as transform_fcn.m.

type transform_fcn
function t = transform_fcn(a,b,c)
ii = gather(unique(c));

for k = 1:length(ii)
    jj = (c == ii(k));
    d = mean([a(jj) b(jj)], 2);
    
    if k == 1
        t = table(c(jj),d,'VariableNames',{'Year' 'MeanDelay'});
    else
        t = [t; table(c(jj),d,'VariableNames',{'Year' 'MeanDelay'})];
    end
end

end

The second function uses the results from the first function to calculate the mean total delay for each year. The output from reduce_fcn is compatible with the output from transform_fcn, so that blocks of data can be concatenated in any order and continually reduced until only one row remains for each year.

type reduce_fcn
function TT = reduce_fcn(t)
[groups,Y] = findgroups(t.Year);
D = splitapply(@mean, t.MeanDelay, groups);

TT = table(Y,D,'VariableNames',{'Year' 'MeanDelay'});
end

Apply the transform and reduce functions to the tall vectors. Since the inputs (type double) and outputs (type table) have different data types, use the 'OutputsLike' name-value pair to specify that the output is a table. A simple way to specify the type of the output is to call the transform function with dummy inputs.

a = tt.ArrDelay;
b = tt.DepDelay;
c = tt.Year;
d1 = matlab.tall.reduce(@transform_fcn, @reduce_fcn, a, b, c, 'OutputsLike',{transform_fcn(0,0,0)})
d1 =

  Mx2 tall table

    Year    MeanDelay
    ____    _________

     ?          ?    
     ?          ?    
     ?          ?    
     :          :
     :          :

Gather the results into memory to see the mean total flight delay per year.

d1 = gather(d1)
Evaluating tall expression using the Local MATLAB Session:
- Pass 1 of 1: Completed in 2.1 sec
Evaluation completed in 2.3 sec
d1=22×2 table
    Year    MeanDelay
    ____    _________

    1987     7.6889  
    1988     6.7918  
    1989     8.0757  
    1990     7.1548  
    1991     4.0134  
    1992     5.1767  
    1993     5.4941  
    1994     6.0303  
    1995     8.4284  
    1996     9.6981  
    1997     8.4346  
    1998     8.3789  
    1999     8.9121  
    2000     10.595  
    2001     6.8975  
    2002     3.4325  
      ⋮

Alternative Approach

Another way to calculate the same statistics by group is to use splitapply to call matlab.tall.reduce (rather than using matlab.tall.reduce to call splitapply).

Using this approach, you call findgroups and splitapply directly on the data. The function mySplitFcn that operates on each group of data includes a call to matlab.tall.reduce. The transform and reduce functions employed by matlab.tall.reduce do not need to group the data, so those functions just perform calculations on the pregrouped data that splitapply passes to them.

type mySplitFcn
function T = mySplitFcn(a,b,c)
T = matlab.tall.reduce(@non_group_transform_fcn, @non_group_reduce_fcn, ...
    a, b, c, 'OutputsLike', {non_group_transform_fcn(0,0,0)});

    function t = non_group_transform_fcn(a,b,c)
        d = mean([a b], 2);
        t = table(c,d,'VariableNames',{'Year' 'MeanDelay'});
    end

    function TT = non_group_reduce_fcn(t)
        D = mean(t.MeanDelay);
        TT = table(t.Year(1),D,'VariableNames',{'Year' 'MeanDelay'});
    end

end

Call findgroups and splitapply to operate on the data and apply mySplitFcn to each group of data.

groups = findgroups(c);
d2 = splitapply(@mySplitFcn, a, b, c, groups);
d2 = gather(d2)
Evaluating tall expression using the Local MATLAB Session:
- Pass 1 of 2: Completed in 0.52 sec
- Pass 2 of 2: Completed in 1.6 sec
Evaluation completed in 3.1 sec
d2=22×2 table
    Year    MeanDelay
    ____    _________

    1987     7.6889  
    1988     6.7918  
    1989     8.0757  
    1990     7.1548  
    1991     4.0134  
    1992     5.1767  
    1993     5.4941  
    1994     6.0303  
    1995     8.4284  
    1996     9.6981  
    1997     8.4346  
    1998     8.3789  
    1999     8.9121  
    2000     10.595  
    2001     6.8975  
    2002     3.4325  
      ⋮

Calculate weighted standard deviation and variance of a tall array using a vector of weights. This is one example of how you can use matlab.tall.reduce to work around functionality that tall arrays do not support yet.

Create two tall vectors of random data. tX contains random data, and tP contains corresponding probabilities such that sum(tP) is 1. These probabilities are suitable to weight the data.

rng default
tX = tall(rand(1e4,1));
p = rand(1e4,1);
tP = tall(normalize(p,'scale',sum(p)));

Write an identity function that returns outputs equal to the inputs. This approach skips the transform step of matlab.tall.reduce and passes the data directly to the reduction step, where the reduction function is repeatedly applied to reduce the size of the data.

type identityTransform.m
function [A,B] = identityTransform(X,Y)
  A = X;
  B = Y;
end

Next, write a reduction function that operates on blocks of the tall vectors to calculate the weighted variance and standard deviation.

type weightedStats.m
function [wvar, wstd] = weightedStats(X, P)
  wvar = var(X,P);
  wstd = std(X,P);
end

Use matlab.tall.reduce to apply these functions to the blocks of data in the tall vectors.

[tX_var_weighted, tX_std_weighted] = matlab.tall.reduce(@identityTransform, @weightedStats, tX, tP)
tX_var_weighted =

  MxNx... tall double array

    ?    ?    ?    ...
    ?    ?    ?    ...
    ?    ?    ?    ...
    :    :    :
    :    :    :


tX_std_weighted =

  MxNx... tall double array

    ?    ?    ?    ...
    ?    ?    ?    ...
    ?    ?    ?    ...
    :    :    :
    :    :    :

Input Arguments

collapse all

Transform function to apply, specified as a function handle or anonymous function. Each output of fcn must be the same type as the first input tX. You can use the 'OutputsLike' option to return outputs of different data types. If fcn returns more than one output, then the outputs must all have the same height.

The general functional signature of fcn is

[a, b, c, ...] = fcn(x, y, z, ...)
fcn must satisfy these requirements:

  1. Input Arguments — The inputs [x, y, z, ...] are blocks of data that fit in memory. The blocks are produced by extracting data from the respective tall array inputs [tX, tY, tZ, ...]. The inputs [x, y, z, ...] satisfy these properties:

    • All of [x, y, z, ...] have the same size in the first dimension after any allowed expansion.

    • The blocks of data in [x, y, z, ...] come from the same index in the tall dimension, assuming the tall array is nonsingleton in the tall dimension. For example, if tX and tY are nonsingleton in the tall dimension, then the first set of blocks might be x = tX(1:20000,:) and y = tY(1:20000,:).

    • If the first dimension of any of [tX, tY, tZ, ...] has a size of 1, then the corresponding block [x, y, z, ...] consists of all the data in that tall array.

  2. Output Arguments — The outputs [a, b, c, ...] are blocks that fit in memory, to be sent to the respective outputs [tA, tB, tC, ...]. The outputs [a, b, c, ...] satisfy these properties:

    • All of [a, b, c, ...] must have the same size in the first dimension.

    • All of [a, b, c, ...] are vertically concatenated with the respective results of previous calls to fcn.

    • All of [a, b, c, ...] are sent to the same index in the first dimension in their respective destination output arrays.

  3. Functional Rulesfcn must satisfy the functional rule:

    • F([inputs1; inputs2]) == [F(inputs1); F(inputs2)]: Applying the function to the concatenation of the inputs should be the same as applying the function to the inputs separately and then concatenating the results.

  4. Empty Inputs — Ensure that fcn can handle an input that has a height of 0. Empty inputs can occur when a file is empty or if you have done a lot of filtering on the data.

For example, this function accepts two input arrays, squares them, and returns two output arrays:

function [xx,yy] = sqInputs(x,y)
xx = x.^2;
yy = y.^2;
end 
After you save this function to an accessible folder, you can invoke the function to square tX and tY and find the maximum value with this command:
tA = matlab.tall.reduce(@sqInputs, @max, tX, tY)

Example: tC = matlab.tall.reduce(@numel,@sum,tX,tY) finds the number of elements in each block, and then it sums the results to count the total number of elements.

Data Types: function_handle

Reduction function to apply, specified as a function handle or anonymous function. Each output of reducefcn must be the same type as the first input tX. You can use the 'OutputsLike' option to return outputs of different data types. If reducefcn returns more than one output, then the outputs must all have the same height.

The general functional signature of reducefcn is

[rA, rB, rC, ...] = reducefcn(a, b, c, ...)
reducefcn must satisfy these requirements:

  1. Input Arguments — The inputs [a, b, c, ...] are blocks that fit in memory. The blocks of data are either outputs returned by fcn, or a partially reduced output from reducefcn that is being operated on again for further reduction. The inputs [a, b, c, ...] satisfy these properties:

    • The inputs [a, b, c, ...] have the same size in the first dimension.

    • For a given index in the first dimension, every row of the blocks of data [a, b, c, ...] either originates from the input, or originates from the same previous call to reducefcn.

    • For a given index in the first dimension, every row of the inputs [a, b, c, ...] for that index originates from the same index in the first dimension.

  2. Output Arguments — All outputs [rA, rB, rC, ...] must have the same size in the first dimension. Additionally, they must be vertically concatenable with the respective inputs [a, b, c, ...] to allow for repeated reductions when necessary.

  3. Functional Rulesreducefcn must satisfy these functional rules (up to roundoff error):

    • F(input) == F(F(input)): Applying the function repeatedly to the same inputs should not change the result.

    • F([input1; input2]) == F([input2; input1]): The result should not depend on the order of concatenation.

    • F([input1; input2]) == F([F(input1); F(input2)]): Applying the function once to the concatenation of some intermediate results should be the same as applying it separately, concatenating, and applying it again.

  4. Empty Inputs — Ensure that reducefcn can handle an input that has a height of 0. Empty inputs can occur when a file is empty or if you have done a lot of filtering on the data. For this call, all input blocks are empty arrays of the correct type and size in dimensions beyond the first.

Some examples of suitable reduction functions are built-in dimension reduction functions such as sum, prod, max, and so on. These functions can work on intermediate results produced by fcn and return a single scalar. These functions have the properties that the order in which concatenations occur and the number of times the reduction operation is applied do not change the final answer. Some functions, such as mean and var, should generally be avoided as reduction functions because the number of times the reduction operation is applied can change the final answer.

Example: tC = matlab.tall.reduce(@numel,@sum,tX) finds the number of elements in each block, and then it sums the results to count the total number of elements.

Data Types: function_handle

Input arrays, specified as scalars, vectors, matrices, or multidimensional arrays. The input arrays are used as inputs to the transform function fcn. Each input array tX,tY,... must have compatible heights. Two inputs have compatible height when they have the same height, or when one input is of height one.

Prototype of output arrays, specified as arrays. When you specify 'OutputsLike', the output arrays tA,tB,... returned by matlab.tall.reduce have the same data types and attributes as the specified arrays {PA,PB,...}.

Example: tA = matlab.tall.reduce(fcn,reducefcn,tX,'OutputsLike',{int8(1)});, where tX is a double-precision tall array, returns tA as int8 instead of double.

Output Arguments

collapse all

Output arrays, returned as scalars, vectors, matrices, or multidimensional arrays. If any input to matlab.tall.reduce is tall, then all output arguments are also tall. Otherwise, all output arguments are in-memory arrays.

The size and data type of the output arrays depend on the specified functions fcn and reducefcn. In general, the outputs tA,tB,... must all have the same data type as the first input tX. However, you can specify 'OutputsLike' to return different data types. The output arrays tA,tB,... all have the same height.

More About

collapse all

Tall Array Blocks

When you create a tall array from a datastore, the underlying datastore facilitates the movement of data during a calculation. The data moves in discrete pieces called blocks or chunks, where each block is a set of consecutive rows that can fit in memory. For example, one block of a 2-D array (such as a table) is X(n:m,:), for some subscripts n and m. The size of each block is based on the value of the ReadSize property of the datastore, but the block might not be exactly that size. For the purposes of matlab.tall.reduce, a tall array is considered to be the vertical concatenation of many such blocks:

Illustration of an array broken into vertical blocks.

For example, if you use the sum function as the transform function, the intermediate result is the sum per block. Therefore, instead of returning a single scalar value for the sum of the elements, the result is a vector with length equal to the number of blocks.

ds = tabularTextDatastore('airlinesmall.csv','TreatAsMissing','NA');
ds.SelectedVariableNames = {'ArrDelay' 'DepDelay'};
tt = tall(ds);
tX = tt.ArrDelay;

f = @(x) sum(x,'omitnan');
s = matlab.tall.reduce(f, @(x) x, tX);
s = gather(s)
s =

      140467
      101065
      164355
      135920
      111182
      186274
       21321

Version History

Introduced in R2018b