Documentation

Tall Array Support, Usage Notes, and Limitations

This table lists the Statistics and Machine Learning Toolbox™ functions that support tall arrays. The "Notes or Limitations" column is empty for reference pages that fully support tall arrays and in-memory data.

Descriptive Statistics and Visualization

FunctionNotes or Limitations
binScatterPlot

This function is specifically designed for visualizing tall arrays. Instead of plotting millions of data points, which is not very feasible, binScatterPlot summarizes the data points into bins. This “scatter plot of bins” reveals high-level trends in the data.

confusionchart
confusionmat
corr

Only the 'Pearson' type is supported.

crosstab

The fourth output, labels, is returned as a cell array containing M unevaluated tall cell arrays, where M is the number of input grouping variables. Each unevaluated tall cell array, labels{j}, contains the labels for one grouping variable.

dummyvar
geomean
grpstats
• If the input data is a tall array, then all grouping variables must also be tall and have the same number of rows as the data.

• The whichstats option cannot be specified as a function handle. In addition to the current built-in options, whichstats can also be:

• 'Count' — Number of non-NaNs.

• 'NNZ' — Number of nonzeros and non-NaNs.

• 'Kurtosis' — Compute kurtosis.

• 'Skewness' — Compute skewness.

• 'all-stats' — Compute all summary statistics.

• Group order is not guaranteed to be the same as the in-memory grpstats computation. Specify 'gname' as the whichstats option to return the order of rows of the summary statistics. For example [means,grpname] = grpstats(x,bins,{'mean','gname'}) returns the means of groups in x in the same order that the groups appear in grpname.

• Summary statistics for nonnumeric variables return NaNs.

• grpstats always operates on the first dimension.

• If the input is a tall table, then the output is also a tall table. However, rather than including row names, the output tall table contains an extra variable GroupLabel that contains the same information.

harmmean
ksdensity
• Some options that require extra passes or sorting of the input data are not supported:

• 'BoundaryCorrection'

• 'Censoring'

• 'Support' (support is always unbounded).

• Uses standard deviation (instead of median absolute deviation) to compute the bandwidth.

kurtosis
prctile
• Y = prctile(X,p) returns the exact percentiles (using a sorting-based algorithm) only if X is a tall column vector.

• Y = prctile(X,p,dim) returns the exact percentiles only when one of these conditions exists:

• X is a tall column vector.

• X is a tall array and dim is not 1. For example, prctile(X,p,2) returns the exact percentiles along the rows of the tall array X.

If X is a tall array and dim is 1, then you must specify 'Method','approximate' to use an approximation algorithm based on T-Digest for computing the percentiles. For example, prctile(X,p,1,'Method','approximate') returns the approximate percentiles along the columns of the tall array X.

• Y = prctile(X,p,vecdim) returns the exact percentiles only when one of these conditions exists:

• X is a tall column vector.

• X is a tall array and vecdim does not include 1. For example, if X is a 3-by-5-by-2 array, then prctile(X,p,[2,3]) returns the exact percentiles of the elements in each X(i,:,:) slice.

• X is a tall array and vecdim includes 1 and all the nonsingleton dimensions of X. For example, if X is a 10-by-1-by-4 array, then prctile(X,p,[1 3]) returns the exact percentiles of the elements in X(:,1,:).

If X is a tall array and vecdim includes 1 but does not include all the nonsingleton dimensions of X, then you must specify 'Method','approximate' to use the approximation algorithm. For example, if X is a 10-by-1-by-4 array, you can use prctile(X,p,[1 2],'Method','approximate') to find the approximate percentiles of each page of X.

quantile
• Y = quantile(X,p) and Y = quantile(X,N) return the exact quantiles (using a sorting-based algorithm) only if X is a tall column vector.

• Y = quantile(__,dim) returns the exact quantiles only when one of these conditions exists:

• X is a tall column vector.

• X is a tall array and dim is not 1. For example, quantile(X,p,2) returns the exact quantiles along the rows of the tall array X.

If X is a tall array and dim is 1, then you must specify 'Method','approximate' to use an approximation algorithm based on T-Digest for computing the quantiles. For example, quantile(X,p,1,'Method','approximate') returns the approximate quantiles along the columns of the tall array X.

• Y = quantile(__,vecdim) returns the exact quantiles only when one of these conditions exists:

• X is a tall column vector.

• X is a tall array and vecdim does not include 1. For example, if X is a 3-by-5-by-2 array, then quantile(X,p,[2,3]) returns the exact quantiles of the elements in each X(i,:,:) slice.

• X is a tall array and vecdim includes 1 and all the nonsingleton dimensions of X. For example, if X is a 10-by-1-by-4 array, then quantile(X,p,[1 3]) returns the exact quantiles of the elements in X(:,1,:).

If X is a tall array and vecdim includes 1 but does not include all the nonsingleton dimensions of X, then you must specify 'Method','approximate' to use the approximation algorithm. For example, if X is a 10-by-1-by-4 array, you can use quantile(X,p,[1 2],'Method','approximate') to find the approximate quantiles of each page of X.

range
skewness
tabulate
zscore

Probability Distributions

FunctionNotes or Limitations
datasample
• datasample is useful as a precursor to plotting and fitting a random subset of a large data set. Sampling a large data set preserves trends in the data without requiring the use of all the data points. If the sample is small enough to fit in memory, then you can apply plotting and fitting functions that do not directly support tall arrays.

• datasample supports sampling only along the first dimension of the data.

• For tall arrays, datasample does not support sampling with replacement. You must specify 'Replace',false, for example, datasample(data,k,'Replace',false).

• The value of 'Weights' must be a numeric tall array of the same height as data.

• For the syntax [Y,idx] = datasample(___), the output idx is a tall logical vector of the same height as data. The vector indicates whether each data point is included in the sample.

• If you specify a random number stream, then the underlying generator must support multiple streams and substreams. If you do not specify a random number stream, then datasample uses the stream controlled by tallrng.

Cluster Analysis

FunctionNotes or Limitations
kmeans

• Supported syntaxes are:

• idx = kmeans(X,k)

• [idx,C] = kmeans(X,k)

• [idx,C,sumd] = kmeans(X,k)

• [___] = kmeans(___,Name,Value)

• Supported name-value pair arguments, and any differences, are:

• 'Display' — Default value is 'iter'.

• 'MaxIter'

• 'Options' — Supports only the 'TolFun' field of the structure array created by statset. The default value of 'TolFun' is 1e-4. The kmeans function uses the value of 'TolFun' as the termination tolerance for the within-cluster sums of point-to-centroid distances. For example, you can specify 'Options',statset('TolFun',1e-8).

• 'Replicates'

• 'Start' — Supports only 'plus', 'sample', and a numeric array.

knnsearch

• If X is a tall array, then Y cannot be a tall array. Similarly, if Y is a tall array, then X cannot be a tall array.

pdist2

• The first input X must be a tall array. Input Y cannot be a tall array.

rangesearch

• If X is a tall array, then Y cannot be a tall array. Similarly, if Y is a tall array, then X cannot be a tall array.

Regression

FunctionNotes or Limitations

The loss and predict methods of these regression classes support tall arrays:

• You can use models trained on either in-memory or tall data with these methods.

• The loss method of CompactRegressionTree only supports one output argument.

The resume method of RegressionKernel supports tall arrays.

• You can use models trained on either in-memory or tall data.

• The default value for the 'IterationLimit' name-value pair argument is relaxed to 20 when you work with tall arrays.

• resume uses a block-wise strategy. For details, see the Algorithms section of fitrkernel.

cvpartition
• When you use cvpartition with tall arrays, the first input argument must be a grouping variable, tGroup. If you specify a tall scalar as the first input argument, cvpartition gives an error.

• cvpartition supports only Holdout cross-validation for tall arrays; for example, c = cvpartition(tGroup,'HoldOut',p). By default, cvpartition randomly partitions observations into a training set and a test set with stratification, using the class information in tGroup. The parameter p is a scalar such that 0 < p < 1.

• To create nonstratified Holdout partitions, specify the value of the 'Stratify' name-value pair argument as false; for example, c = cvpartition(tGroup,'Holdout',p,'Stratify',false).

fitglm
• If any input argument to fitglm is a tall array, then all of the other inputs must be tall arrays as well. This includes nonempty variables supplied with the 'Weights', 'Exclude', 'Offset', and 'BinomialSize' name-value pairs.

• The default number of iterations is 5. You can change the number of iterations using the 'Options' name-value pair to pass in an options structure. Create an options structure using statset to specify a different value for MaxIter.

• For tall data, fitglm returns a CompactGeneralizedLinearModel object that contains most of the same properties as a GeneralizedLinearModel object. The main difference is that the compact object is sensitive to memory requirements. The compact object does not include properties that include the data, or that include an array of the same size as the data. The compact object does not contain these GeneralizedLinearModel properties:

• Diagnostics

• Fitted

• Offset

• ObservationInfo

• ObservationNames

• Residuals

• Steps

• Variables

You can compute the residuals directly from the compact object returned by GLM = fitglm(X,Y) using

RES = Y - predict(GLM,X);
S = sqrt(GLM.SSE/GLM.DFE);
histogram(RES,linspace(-3*S,3*S,51))

fitlm
• If any input argument to fitlm is a tall array, then all of the other inputs must be tall arrays as well. This includes nonempty variables supplied with the 'Weights' and 'Exclude' name-value pairs.

• The 'RobustOpts' name-value pair is not supported with tall arrays.

• For tall data, fitlm returns a CompactLinearModel object that contains most of the same properties as a LinearModel object. The main difference is that the compact object is sensitive to memory requirements. The compact object does not include properties that include the data, or that include an array of the same size as the data. The compact object does not contain these LinearModel properties:

• Diagnostics

• Fitted

• ObservationInfo

• ObservationNames

• Residuals

• Steps

• Variables

You can compute the residuals directly from the compact object returned by LM = fitlm(X,Y) using

RES = Y - predict(LM,X);
S = LM.RMSE;
histogram(RES,linspace(-3*S,3*S,51))

• If the CompactLinearModel object is missing lower order terms that include categorical factors:

• The plotEffects and plotInteraction methods are not supported.

• The anova method with the 'components' option is not supported.

fitrkernel
• Some name-value pair arguments have different defaults compared to the default values for the in-memory fitrkernel function. Supported name-value pair arguments, and any differences, are:

• 'BoxConstraint'

• 'Epsilon'

• 'NumExpansionDimensions'

• 'KernelScale'

• 'Lambda'

• 'Learner'

• 'Verbose' — Default value is 1.

• 'BlockSize'

• 'RandomStream'

• 'ResponseTransform'

• 'Weights' — Value must be a tall array.

• 'BetaTolerance' — Default value is relaxed to 1e–3.

• 'GradientTolerance' — Default value is relaxed to 1e–5.

• 'HessianHistorySize'

• 'IterationLimit' — Default value is relaxed to 20.

• 'OptimizeHyperparameters'

• 'HyperparameterOptimizationOptions' — For cross-validation, tall optimization supports only 'Holdout' validation. For example, you can specify fitrkernel(X,Y,'OptimizeHyperparameters','auto','HyperparameterOptimizationOptions',struct('Holdout',0.2)).

• If 'KernelScale' is 'auto', then fitrkernel uses the random stream controlled by tallrng for subsampling. For reproducibility, you must set a random number seed for both the global stream and the random stream controlled by tallrng.

• If 'Lambda' is 'auto', then fitrkernel might take an extra pass through the data to calculate the number of observations in X.

• fitrkernel uses a block-wise strategy. For details, see Algorithms.

fitrlinear
• Some name-value pair arguments have different defaults and values compared to the in-memory fitrlinear function. Supported name-value pair arguments, and any differences, are:

• 'Epsilon'

• 'ObservationsIn' — Supports only 'rows'.

• 'Lambda' — Can be 'auto' (default) or a scalar.

• 'Learner'

• 'Regularization' — Supports only 'ridge'.

• 'Solver' — Supports only 'lbfgs'.

• 'Verbose' — Default value is 1

• 'Beta'

• 'Bias'

• 'FitBias' — Supports only true.

• 'Weights' — Value must be a tall array.

• 'HessianHistorySize'

• 'BetaTolerance' — Default value is relaxed to 1e-3.

• 'GradientTolerance' — Default value is relaxed to 1e-3.

• 'IterationLimit' — Default value is relaxed to 20.

• 'OptimizeHyperparameters' — Value of 'Regularization' parameter must be 'ridge'.

• 'HyperparameterOptimizationOptions' — For cross-validation, tall optimization supports only 'Holdout' validation. For example, you can specify fitrlinear(X,Y,'OptimizeHyperparameters','auto','HyperparameterOptimizationOptions',struct('Holdout',0.2)).

• For tall arrays fitrlinear implements LBFGS by distributing the calculation of the loss and the gradient among different parts of the tall array at each iteration. Other solvers are not available for tall arrays.

When initial values for Beta and Bias are not given, fitrlinear first refines the initial estimates of the parameters by fitting the model locally to parts of the data and combining the coefficients by averaging.

lasso
• With tall arrays, lasso uses an algorithm based on ADMM (Alternating Direction Method of Multipliers).

• No elastic net support. The 'Alpha' parameter is always 1.

• No cross-validation ('CV' parameter) support, which includes the related parameter 'MCReps'.

• The output FitInfo does not contain the additional fields 'SE', 'LambdaMinMSE', 'Lambda1SE', 'IndexMinMSE', and 'Index1SE'.

• The 'Options' parameter is not supported because it does not contain options that apply to the ADMM algorithm. You can tune the ADMM algorithm using name-value pair arguments.

• Supported name-value pair arguments are:

• 'Lambda'

• 'LambdaRatio'

• 'NumLambda'

• 'Standardize'

• 'PredictorNames'

• 'RelTol'

• 'Weights'

• Additional name-value pair arguments to control the ADMM algorithm are:

• 'Rho' — Augmented Lagrangian parameter, ρ. The default value is automatic selection.

• 'AbsTol' — Absolute tolerance used to determine convergence. The default value is 1e–4.

• 'MaxIter' — Maximum number of iterations. The default value is 1e4.

• 'B0' — Initial values for the coefficients x. The default value is a vector of zeros.

• 'U0' — Initial values of the scaled dual variable u. The default value is a vector of zeros.

Classification

FunctionNotes or Limitations

The predict, loss, margin, and edge methods of these classification classes support tall arrays:

• You can use models trained on either in-memory or tall data with these methods.

• The loss method of CompactClassificationTree only supports one output argument.

The resume method of ClassificationKernel supports tall arrays.

• You can use models trained on either in-memory or tall data.

• The default value for the 'IterationLimit' name-value pair argument is relaxed to 20 when working with tall arrays.

• resume uses a block-wise strategy. For details, see Algorithms of fitckernel.

fitcdiscr
• Supported syntaxes are:

• Mdl = fitcdiscr(Tbl,Y)

• Mdl = fitcdiscr(X,Y)

• Mdl = fitcdiscr(___,Name,Value)

• [Mdl,FitInfo,HyperparameterOptimizationResults] = fitcdiscr(___,Name,Value)fitcdiscr returns the additional output arguments FitInfo and HyperparameterOptimizationResults when you specify the 'OptimizeHyperparameters' name-value pair argument.

• The FitInfo output argument is an empty structure array currently reserved for possible future use.

• The HyperparameterOptimizationResults output argument is a BayesianOptimization object or a table of hyperparameters with associated values that describe the cross-validation optimization of hyperparameters.

'HyperparameterOptimizationResults' is nonempty when the 'OptimizeHyperparameters' name-value pair argument is nonempty at the time you create the model. The values in 'HyperparameterOptimizationResults' depend on the value you specify for the 'HyperparameterOptimizationOptions' name-value pair argument when you create the model.

• If you specify 'bayesopt' (default), then HyperparameterOptimizationResults is an object of class BayesianOptimization.

• If you specify 'gridsearch' or 'randomsearch', then HyperparameterOptimizationResults is a table of the hyperparameters used, observed objective function values (cross-validation loss), and rank of observations from lowest (best) to highest (worst).

• Supported name-value pair arguments, and any differences, are:

• 'ClassNames'

• 'Cost'

• 'DiscrimType'

• 'HyperparameterOptimizationOptions' — For cross-validation, tall optimization supports only 'Holdout' validation. For example, you can specify fitcdiscr(X,Y,'OptimizeHyperparameters','auto','HyperparameterOptimizationOptions',struct('Holdout',0.2)).

• 'OptimizeHyperparameters' — The only eligible parameter to optimize is 'DiscrimType'. Specifying 'auto' uses 'DiscrimType'.

• 'PredictorNames'

• 'Prior'

• 'ResponseName'

• 'ScoreTransform'

• 'Weights'

• For tall arrays and tall tables, fitcdiscr returns a CompactClassificationDiscriminant object, which contains most of the same properties as a ClassificationDiscriminant object. The main difference is that the compact object is sensitive to memory requirements. The compact object does not include properties that include the data, or that include an array of the same size as the data. The compact object does not contain these ClassificationDiscriminant properties:

• ModelParameters

• NumObservations

• HyperparameterOptimizationResults

• RowsUsed

• XCentered

• W

• X

• Y

Additionally, the compact object does not support these ClassificationDiscriminant methods:

• compact

• crossval

• cvshrink

• resubEdge

• resubLoss

• resubMargin

• resubPredict

fitcecoc
• Supported syntaxes are:

• Mdl = fitcecoc(X,Y)

• Mdl = fitcecoc(X,Y,Name,Value)

• [Mdl,FitInfo,HyperparameterOptimizationResults] = fitcecoc(X,Y,Name,Value)fitcecoc returns the additional output arguments FitInfo and HyperparameterOptimizationResults when you specify the 'OptimizeHyperparameters' name-value pair argument.

• The FitInfo output argument is an empty structure array currently reserved for possible future use.

• Options related to cross-validation are not supported. The supported name-value pair arguments are:

• 'ClassNames'

• 'Cost'

• 'Coding' — Default value is 'onevsall'.

• 'HyperparameterOptimizationOptions' — For cross-validation, tall optimization supports only 'Holdout' validation. For example, you can specify fitcecoc(X,Y,'OptimizeHyperparameters','auto','HyperparameterOptimizationOptions',struct('Holdout',0.2)).

• 'Learners' — Default value is 'linear'. You can specify 'linear','kernel', a templateLinear or templateKernel object, or a cell array of such objects.

• 'OptimizeHyperparameters' — When you use linear binary learners, the value of the 'Regularization' hyperparameter must be 'ridge'.

• 'Prior'

• 'Verbose' — Default value is 1.

• 'Weights'

• This additional name-value pair argument is specific to tall arrays:

• 'NumConcurrent' — A positive integer scalar specifying the number of binary learners that are trained concurrently by combining file I/O operations. The default value for 'NumConcurrent' is 1, which means fitcecoc trains the binary learners sequentially. 'NumConcurrent' is most beneficial when the input arrays cannot fit into the distributed cluster memory. Otherwise, the input arrays can be cached and speedup is negligible.

If you run your code on Apache Spark™, NumConcurrent is upper bounded by the memory available for communications. Check the 'spark.executor.memory' and 'spark.driver.memory' properties in your Apache Spark configuration. See parallel.cluster.Hadoop for more details. For more information on Apache Spark and other execution environments that control where your code runs, see Extend Tall Arrays with Other Products (MATLAB).

fitckernel
• Some name-value pair arguments have different defaults compared to the default values for the in-memory fitckernel function. Supported name-value pair arguments, and any differences, are:

• 'Learner'

• 'NumExpansionDimensions'

• 'KernelScale'

• 'BoxConstraint'

• 'Lambda'

• 'BetaTolerance' — Default value is relaxed to 1e–3.

• 'GradientTolerance' — Default value is relaxed to 1e–5.

• 'IterationLimit' — Default value is relaxed to 20.

• 'BlockSize'

• 'RandomStream'

• 'HessianHistorySize'

• 'Verbose' — Default value is 1.

• 'ClassNames'

• 'Cost'

• 'Prior'

• 'ScoreTransform'

• 'Weights' — Value must be a tall array.

• 'OptimizeHyperparameters'

• 'HyperparameterOptimizationOptions' — For cross-validation, tall optimization supports only 'Holdout' validation. For example, you can specify fitckernel(X,Y,'OptimizeHyperparameters','auto','HyperparameterOptimizationOptions',struct('Holdout',0.2)).

• If 'KernelScale' is 'auto', then fitckernel uses the random stream controlled by tallrng for subsampling. For reproducibility, you must set a random number seed for both the global stream and the random stream controlled by tallrng.

• If 'Lambda' is 'auto', then fitckernel might take an extra pass through the data to calculate the number of observations in X.

• fitckernel uses a block-wise strategy. For details, see Algorithms.

templateKernel

• The default values for these name-value pair arguments are different when you work with tall arrays.

• 'Verbose' — Default value is 1.

• 'BetaTolerance' — Default value is relaxed to 1e–3.

• 'GradientTolerance' — Default value is relaxed to 1e–5.

• 'IterationLimit' — Default value is relaxed to 20.

• If 'KernelScale' is 'auto', then templateKernel uses the random stream controlled by tallrng for subsampling. For reproducibility, you must set a random number seed for both the global stream and the random stream controlled by tallrng.

• If 'Lambda' is 'auto', then templateKernel might take an extra pass through the data to calculate the number of observations.

• templateKernel uses a block-wise strategy. For details, see Algorithms.

fitclinear
• Some name-value pair arguments have different defaults compared to the default values for the in-memory fitclinear function. Supported name-value pair arguments, and any differences, are:

• 'ObservationsIn' — Supports only 'rows'.

• 'Lambda' — Can be 'auto' (default) or a scalar.

• 'Learner'

• 'Regularization' — Supports only 'ridge'.

• 'Solver' — Supports only 'lbfgs'.

• 'FitBias' — Supports only true.

• 'Verbose' — Default value is 1.

• 'Beta'

• 'Bias'

• 'ClassNames'

• 'Cost'

• 'Prior'

• 'Weights' — Value must be a tall array.

• 'HessianHistorySize'

• 'BetaTolerance' — Default value is relaxed to 1e–3.

• 'GradientTolerance' — Default value is relaxed to 1e–3.

• 'IterationLimit' — Default value is relaxed to 20.

• 'OptimizeHyperparameters' — Value of 'Regularization' parameter must be 'ridge'.

• 'HyperparameterOptimizationOptions' — For cross-validation, tall optimization supports only 'Holdout' validation. For example, you can specify fitclinear(X,Y,'OptimizeHyperparameters','auto','HyperparameterOptimizationOptions',struct('Holdout',0.2)).

• For tall arrays, fitclinear implements LBFGS by distributing the calculation of the loss and gradient among different parts of the tall array at each iteration. Other solvers are not available for tall arrays.

When initial values for Beta and Bias are not given, fitclinear refines the initial estimates of the parameters by fitting the model locally to parts of the data and combining the coefficients by averaging.

templateLinear

• The default values for these name-value pair arguments are different when you work with tall arrays.

• 'Lambda' — Can be 'auto' (default) or a scalar

• 'Regularization' — Supports only 'ridge'

• 'Solver' — Supports only 'lbfgs'

• 'FitBias' — Supports only true

• 'Verbose' — Default value is 1

• 'BetaTolerance' — Default value is relaxed to 1e–3

• 'GradientTolerance' — Default value is relaxed to 1e–3

• 'IterationLimit' — Default value is relaxed to 20

• When fitcecoc uses a templateLinear object with tall arrays, the only available solver is LBFGS. The software implements LBFGS by distributing the calculation of the loss and gradient among different parts of the tall array at each iteration. If you do not specify initial values for Beta and Bias, the software refines the initial estimates of the parameters by fitting the model locally to parts of the data and combining the coefficients by averaging.

fitcnb
• Supported syntaxes are:

• Mdl = fitcnb(Tbl,Y)

• Mdl = fitcnb(X,Y)

• Mdl = fitcnb(___,Name,Value)

• Options related to kernel densities, cross-validation, and hyperparameter optimization are not supported. The supported name-value pair arguments are:

• 'DistributionNames''kernel' value is not supported.

• 'CategoricalPredictors'

• 'Cost'

• 'PredictorNames'

• 'Prior'

• 'ResponseName'

• 'ScoreTransform'

• 'Weights' — Value must be a tall array.

fitctree
• Supported syntaxes are:

• tree = fitctree(Tbl,Y)

• tree = fitctree(X,Y)

• tree = fitctree(___,Name,Value)

• [tree,FitInfo,HyperparameterOptimizationResults] = fitctree(___,Name,Value)fitctree returns the additional output arguments FitInfo and HyperparameterOptimizationResults when you specify the 'OptimizeHyperparameters' name-value pair argument.

• The FitInfo output argument is an empty structure array currently reserved for possible future use.

• The HyperparameterOptimizationResults output argument is a BayesianOptimization object or a table of hyperparameters with associated values that describe the cross-validation optimization of hyperparameters.

'HyperparameterOptimizationResults' is nonempty when the 'OptimizeHyperparameters' name-value pair argument is nonempty at the time you create the model. The values in 'HyperparameterOptimizationResults' depend on the value you specify for the 'HyperparameterOptimizationOptions' name-value pair argument when you create the model.

• If you specify 'bayesopt' (default), then HyperparameterOptimizationResults is an object of class BayesianOptimization.

• If you specify 'gridsearch' or 'randomsearch', then HyperparameterOptimizationResults is a table of the hyperparameters used, observed objective function values (cross-validation loss), and rank of observations from lowest (best) to highest (worst).

• Supported name-value pair arguments, and any differences, are:

• 'AlgorithmForCategorical'

• 'CategoricalPredictors'

• 'ClassNames'

• 'Cost'

• 'HyperparameterOptimizationOptions' — For cross-validation, tall optimization supports only 'Holdout' validation. For example, you can specify fitctree(X,Y,'OptimizeHyperparameters','auto','HyperparameterOptimizationOptions',struct('Holdout',0.2)).

• 'MaxNumCategories'

• 'MaxNumSplits'— for tall optimization, fitctree searches among integers, by default log-scaled in the range [1,max(2,min(10000,NumObservations-1))].

• 'MergeLeaves'

• 'MinLeafSize'

• 'MinParentSize'

• 'NumVariablesToSample'

• 'OptimizeHyperparameters'

• 'PredictorNames'

• 'Prior'

• 'ResponseName'

• 'ScoreTransform'

• 'SplitCriterion'

• 'Weights'

• This additional name-value pair argument is specific to tall arrays:

• 'MaxDepth' — A positive integer specifying the maximum depth of the output tree. Specify a value for this argument to return a tree that has fewer levels and requires fewer passes through the tall array to compute. Generally, the algorithm of fitctree takes one pass through the data and an additional pass for each tree level. The function does not set a maximum tree depth, by default.

TreeBagger
• Supported syntaxes for tall X, Y, Tbl are:

• B = TreeBagger(NumTrees,Tbl,Y)

• B = TreeBagger(NumTrees,X,Y)

• B = TreeBagger(___,Name,Value)

• For tall arrays, TreeBagger supports classification but does not support regression.

• Supported name-value pairs are:

• 'NumPredictorsToSample' — Default value is the square root of the number of variables for classification.

• 'MinLeafSize' — Default value is 1 if the number of observations is less than 50,000. If the number of observations is 50,000 or greater, then the default value is max(1,min(5,floor(0.01*NobsChunk))), where NobsChunk is the number of observations in a chunk.

• 'ChunkSize' (only for tall arrays) — Default value is 50000.

In addition, TreeBagger supports these optional arguments of fitctree:

• 'AlgorithmForCategorical'

• 'CategoricalPredictors'

• 'Cost' — The columns of the cost matrix C cannot contain Inf or NaN values.

• 'MaxNumCategories'

• 'MaxNumSplits'

• 'MergeLeaves'

• 'PredictorNames'

• 'PredictorSelection'

• 'Prior'

• 'Prune'

• 'PruneCriterion'

• 'SplitCriterion'

• 'Surrogate'

• 'Weights'

• For tall data, TreeBagger returns a CompactTreeBagger object that contains most of the same properties as a full TreeBagger object. The main difference is that the compact object is more memory efficient. The compact object does not include properties that include the data, or that include an array of the same size as the data.

• The number of trees contained in the returned CompactTreeBagger object can differ from the number of trees specified as input to the TreeBagger function. TreeBagger determines the number of trees to return based on factors that include the size of the input data set and the number of data chunks available to grow trees.

• Supported CompactTreeBagger methods are:

• combine

• error

• margin

• meanMargin

• predict

• setDefaultYfit

The error, margin, meanMargin, and predict methods do not support the name-value pair arguments 'Trees', 'TreeWeights', or 'UseInstanceForTree'. The meanMargin method additionally does not support 'Weights'.

• TreeBagger creates a random forest by generating trees on disjoint chunks of the data. When more data is available than is required to create the random forest, the data is subsampled. For a similar example, see Random Forests for Big Data (Genuer, Poggi, Tuleau-Malot, Villa-Vialaneix 2015).

Depending on how the data is stored, it is possible that some chunks of data contain observations from only a few classes out of all the classes. In this case, TreeBagger might produce inferior results compared to the case where each chunk of data contains observations from most of the classes.

• During training of the TreeBagger algorithm, the speed, accuracy, and memory usage depend on a number of factors. These factors include values for NumTrees, 'ChunkSize', 'MinLeafSize', and 'MaxNumSplits'.

For an n-by-p tall array X, TreeBagger implements sampling during training. This sampling depends on these variables:

• Number of trees NumTrees

• Chunk size 'ChunkSize'

• Number of observations n

• Number of chunks r (approximately equal to n/'ChunkSize')

Because the value of n is fixed for a given X, your settings for NumTrees and 'ChunkSize' determine how TreeBagger samples X.

1. If r > NumTrees, then TreeBagger samples 'ChunkSize' * NumTrees observations from X, and trains one tree per chunk (with each chunk containing 'ChunkSize' number of observations). This scenario is the most common when you work with tall arrays.

2. If rNumTrees, then TreeBagger trains approximately NumTrees/r trees in each chunk, using bootstrapping within the chunk.

3. If n'ChunkSize', then TreeBagger uses bootstrapping to generate samples (each of size n) on which to train individual trees.

• When specifying a value for NumTrees, consider the following:

• If you run your code on Apache Spark, and your data set is distributed with Hadoop® Distributed File System (HDFS™), start by specifying a value for NumTrees that is at least twice the number of partitions in HDFS for your data set. This setting prevents excessive data communication among Apache Spark executors and can improve performance of the TreeBagger algorithm.

• TreeBagger copies fitted trees into the client memory in the resulting CompactTreeBagger model. Therefore, the amount of memory available to the client creates an upper bound on the value you can set for NumTrees. You can tune the values of 'MinLeafSize' and 'MaxNumSplits' for more efficient speed and memory usage at the expense of some predictive accuracy. After tuning, if the value of NumTrees is less than twice the number of partitions in HDFS for your data set, then consider repartitioning your data in HDFS to have larger partitions.

After specifying a value for NumTrees, set 'ChunkSize' to ensure that TreeBagger uses most of the data to grow trees. Ideally, 'ChunkSize' * NumTrees should approximate n, the number of rows in your data. Note that the memory available in the workers for training individual trees can also determine an upper bound for 'ChunkSize'.

You can adjust the Apache Spark memory properties to avoid out-of-memory errors and support your workflow. See parallel.cluster.Hadoop for more information.

Dimensionality Reduction

FunctionNotes or Limitations
pca
• pca works directly with tall arrays by computing the covariance matrix and using the in-memory pcacov function to compute the principle components.

• Supported syntaxes are:

• coeff = pca(X)

• [coeff,score,latent] = pca(X)

• [coeff,score,latent,explained] = pca(X)

• [coeff,score,latent,tsquared] = pca(X)

• [coeff,score,latent,tsquared,explained] = pca(X)

• Name-value pair arguments are not supported.

pcacov, factoran

pcacov and factoran do not work directly on tall arrays. Instead, use C = gather(cov(X)) to compute the covariance matrix of a tall array. Then, you can use pcacov or factoran on the in-memory covariance matrix. Alternatively, you can use pca directly on a tall array.