mapreduce requires both an input map function
that receives chunks of data and that outputs intermediate results,
and an input reduce function that reads the intermediate results and
produces a final result. Thus, it is normal to break up a calculation
into two related pieces for the map and reduce functions to fulfill
separately. For example, to find the maximum value in a data set,
the map function can find the maximum value in each chunk of input
data, and then the reduce function can find the single maximum value
among all of the intermediate maxima.
This figure shows the Reduce phase of the
The Reduce phase of the
has the following steps:
The result of the Map phase of the
is an intermediate
KeyValueStore object that contains all of the
key-value pairs added by the map function. Before calling the reduce
mapreduce groups the values in the
KeyValueStore object by unique key.
Each unique key in the intermediate
results in a single call to the reduce function.
For each key,
that contains all of the values associated with that key.
The reduce function scrolls through the values from
ValueIterator object using the
which are typically used in a
After performing a summary calculation, the reduce
function adds one or more key-value pairs to the final
The Reduce phase of the
is complete when the reduce function processes all of the unique intermediate
keys and their associated values. The result of this phase of the
(similar to the Map phase) is a
containing all of the final key-value pairs added by the reduce function.
After the Reduce phase,
mapreduce pulls the key-value
pairs from the
KeyValueStore and returns them in
a datastore (a
KeyValueDatastore object by default).
The key-value pairs in the output datastore are not in sorted order;
they appear in the same order as they were added by the reduce function.
mapreduce automatically calls the reduce
function for each unique key in the intermediate
KeyValueStore object, so
the reduce function must meet certain basic requirements to run properly
during these automatic calls. These requirements collectively ensure
the proper movement of data through the Reduce phase of the
The inputs to the reduce function are
intermKey is one of the unique
keys added by the map function. Each call to the reduce function by
a new unique key from the keys in the intermediate
intermValIter is the
associated with the active key,
contains all of the values associated with the active key. Scroll
through the values using the
outKVStore is the name for the
to which the reduce function needs to add key-value pairs. The
use this object name to add key-value pairs to the output.
the output key-value pairs from
returns them in the output
datastore, which is
KeyValueDatastore object by default. If the reduce
function does not add any key-value pairs to
mapreduce returns an empty datastore.
In addition to these basic requirements for the reduce function, the key-value pairs added by the reduce function must also meet these conditions:
Keys must be numeric scalars, character vectors, or strings. Numeric keys cannot be
NaN, logical, complex, or sparse.
All keys added by the reduce function must have the same class, but that class may differ from the class of the keys added by the map function.
OutputType argument of
default), then a value added by the reduce function can be any MATLAB® object,
including all valid MATLAB data types.
OutputType argument of
'TabularText', then a value added by the reduce function can be a
numeric scalar, character vector, or string. In this case, the value cannot be
NaN, complex, logical, or sparse.
The above key-value pair requirements may differ when using
other products with
mapreduce. See the documentation
for the appropriate product to get product-specific key-value pair
These examples contain some reduce functions
used by the
mapreduce examples in the
One of the simplest examples of a reducer is
maxArrivalDelayReducer.m, which is the reducer for the example file
MaxMapReduceExample.m. The map function in this example finds the maximum arrival delay in each chunk of input data. Then the reduce function finishes the task by finding the single maximum value among all of the intermediate maxima. To find the maximum value, the reducer scrolls through the values in the
ValueIterator object and compares each value to the current maximum.
mapreduce only calls this reducer function once, since the mapper adds a single unique key to the intermediate
KeyValueStore object. The reduce function adds a single key-value pair to the output.
function maxArrivalDelayReducer(intermKey, intermValIter, outKVStore) % Reducer function for the MaxMapreduceExample. % Copyright 2014 The MathWorks, Inc. % intermKey is 'PartialMaxArrivalDelay'. intermValIter is an iterator of % all values that has the key 'PartialMaxArrivalDelay'. maxVal = -inf; while hasnext(intermValIter) maxVal = max(getnext(intermValIter), maxVal); end % The key-value pair added to outKVStore will become the output of mapreduce add(outKVStore,'MaxArrivalDelay',maxVal);
A more advanced example of a reducer is
statsByGroupReducer.m, which is the reducer for the example file
StatisticsByGroupMapReduceExample.m. The map function in this example groups the data in each input using an extra parameter (airline carrier, month, and so on), and then calculates several statistical quantities for each group of data. The reduce function finishes the task by retrieving the statistical quantities and concatenating them into long vectors, and then using the vectors to calculate the final statistical quantities for count, mean, variance, skewness, and kurtosis. The reducer stores these values as fields in a structure, so that each unique key has a structure of statistical quantities in the output.
function statsByGroupReducer(intermKey, intermValIter, outKVStore) % Reducer function for the StatisticsByGroupMapReduceExample. % Copyright 2014 The MathWorks, Inc. n = ; m = ; v = ; s = ; k = ; % get all sets of intermediate statistics while hasnext(intermValIter) value = getnext(intermValIter); n = [n; value(1)]; m = [m; value(2)]; v = [v; value(3)]; s = [s; value(4)]; k = [k; value(5)]; end % Note that this approach assumes the concatenated intermediate values fit % in memory. Refer to the reducer function, covarianceReducer, of the % CovarianceMapReduceExample for an alternative pairwise reduction approach % combine the intermediate results count = sum(n); meanVal = sum(n.*m)/count; d = m - meanVal; variance = (sum(n.*v) + sum(n.*d.^2))/count; skewnessVal = (sum(n.*s) + sum(n.*d.*(3*v + d.^2)))./(count*variance^(1.5)); kurtosisVal = (sum(n.*k) + sum(n.*d.*(4*s + 6.*v.*d +d.^3)))./(count*variance^2); outValue = struct('Count',count, 'Mean',meanVal, 'Variance',variance,... 'Skewness',skewnessVal, 'Kurtosis',kurtosisVal); % add results to the output datastore add(outKVStore,intermKey,outValue);
For more information about common programming patterns in map or reduce functions, see Build Effective Algorithms with MapReduce.