Contenuto principale

Framework for Ensemble Learning

Using various methods, you can meld results from many weak learners into one high-quality ensemble predictor. These methods closely follow the same syntax, so you can try different methods with minor changes in your commands.

You can create an ensemble for classification by using fitcensemble or for regression by using fitrensemble.

To train an ensemble for classification using fitcensemble, use this syntax.

ens = fitcensemble(X,Y,Name,Value)
  • X is the matrix of data. Each row contains one observation, and each column contains one predictor variable.

  • Y is the vector of responses, with the same number of observations as the rows in X.

  • Name,Value specify additional options using one or more name-value pair arguments. For example, you can specify the ensemble aggregation method with the 'Method' argument, the number of ensemble learning cycles with the 'NumLearningCycles' argument, and the type of weak learners with the 'Learners' argument. For a complete list of name-value pair arguments, see the fitcensemble function page.

This figure shows the information you need to create a classification ensemble.

Required inputs to fitcensemble to create a classification ensemble

Similarly, you can train an ensemble for regression by using fitrensemble, which follows the same syntax as fitcensemble. For details on the input arguments and name-value pair arguments, see the fitrensemble function page.

For all classification or nonlinear regression problems, follow these steps to create an ensemble:

Prepare the Predictor Data

All supervised learning methods start with predictor data, usually called X in this documentation. X can be stored in a matrix or a table. Each row of X represents one observation, and each column of X represents one variable or predictor.

Prepare the Response Data

You can use a wide variety of data types for the response data.

  • For regression ensembles, Y must be a numeric vector with the same number of elements as the number of rows of X.

  • For classification ensembles, Y can be a numeric vector, categorical vector, character array, string array, cell array of character vectors, or logical vector.

    For example, suppose your response data consists of three observations in the following order: true, false, true. You could express Y as:

    • [1;0;1] (numeric vector)

    • categorical({'true','false','true'}) (categorical vector)

    • [true;false;true] (logical vector)

    • ['true ';'false';'true '] (character array, padded with spaces so each row has the same length)

    • ["true","false","true"] (string array)

    • {'true','false','true'} (cell array of character vectors)

    Use whichever data type is most convenient. Because you cannot represent missing values with logical entries, do not use logical entries when you have missing values in Y.

fitcensemble and fitrensemble ignore missing values in Y when creating an ensemble. This table contains the method of including missing entries.

Data TypeMissing Entry
Numeric vectorNaN
Categorical vector<undefined>
Character arrayRow of spaces
String array<missing> or ""
Cell array of character vectors''
Logical vector(not possible to represent)

Choose an Applicable Ensemble Aggregation Method

To create classification and regression ensembles with fitcensemble and fitrensemble, respectively, choose appropriate algorithms from this list.

  • For classification with two classes:

    • 'AdaBoostM1' — Adaptive boosting

    • 'LogitBoost' — Adaptive logistic regression

    • 'GentleBoost' — Gentle adaptive boosting

    • 'RobustBoost' — Robust boosting (requires Optimization Toolbox™)

    • 'LPBoost' — Linear programming boosting (requires Optimization Toolbox)

    • 'TotalBoost' — Totally corrective boosting (requires Optimization Toolbox)

    • 'RUSBoost' — Random undersampling boosting

    • 'Subspace' — Random subspace

    • 'Bag' — Bootstrap aggregation (bagging)

  • For classification with three or more classes:

    • 'AdaBoostM2' — Adaptive boosting

    • 'LPBoost' — Linear programming boosting (requires Optimization Toolbox)

    • 'TotalBoost' — Totally corrective boosting (requires Optimization Toolbox)

    • 'RUSBoost' — Random undersampling boosting

    • 'Subspace' — Random subspace

    • 'Bag' — Bootstrap aggregation (bagging)

  • For regression:

    • 'LSBoost' — Least-squares boosting

    • 'Bag' — Bootstrap aggregation (bagging)

For descriptions of the various algorithms, see Ensemble Algorithms.

See Suggestions for Choosing an Appropriate Ensemble Algorithm.

This table lists characteristics of the various algorithms. In the table titles:

  • Imbalance — Good for imbalanced data (one class has many more observations than the other)

  • Stop — Algorithm self-terminates

  • Sparse — Requires fewer weak learners than other ensemble algorithms

AlgorithmRegressionBinary ClassificationMulticlass ClassificationClass ImbalanceStopSparse
Bag×××   
AdaBoostM1 ×    
AdaBoostM2  ×   
LogitBoost ×    
GentleBoost ×    
RobustBoost ×    
LPBoost ×× ××
TotalBoost ×× ××
RUSBoost ×××  
LSBoost×     
Subspace ××   

RobustBoost, LPBoost, and TotalBoost require an Optimization Toolbox license. Try TotalBoost before LPBoost, as TotalBoost can be more robust.

Suggestions for Choosing an Appropriate Ensemble Algorithm

  • Regression — Your choices are LSBoost or Bag. See General Characteristics of Ensemble Algorithms for the main differences between boosting and bagging.

  • Binary Classification — Try AdaBoostM1 first, with these modifications:

    Data CharacteristicRecommended Algorithm
    Many predictorsSubspace
    Skewed data (many more observations of one class)RUSBoost
    Label noise (some training data has the wrong class)RobustBoost
    Many observationsAvoid LPBoost and TotalBoost
  • Multiclass Classification — Try AdaBoostM2 first, with these modifications:

    Data CharacteristicRecommended Algorithm
    Many predictorsSubspace
    Skewed data (many more observations of one class)RUSBoost
    Many observationsAvoid LPBoost and TotalBoost

For details of the algorithms, see Ensemble Algorithms.

General Characteristics of Ensemble Algorithms

  • Boost algorithms generally use very shallow trees. This construction uses relatively little time or memory. However, for effective predictions, boosted trees might need more ensemble members than bagged trees. Therefore it is not always clear which class of algorithms is superior.

  • Bag generally constructs deep trees. This construction is both time consuming and memory-intensive. This also leads to relatively slow predictions.

  • Bag can estimate the generalization error without additional cross validation. See oobLoss.

  • Except for Subspace, all boosting and bagging algorithms are based on decision tree learners. Subspace can use either discriminant analysis or k-nearest neighbor learners.

For details of the characteristics of individual ensemble members, see Characteristics of Classification Algorithms.

Set the Number of Ensemble Members

Choosing the size of an ensemble involves balancing speed and accuracy.

  • Larger ensembles take longer to train and to generate predictions.

  • Some ensemble algorithms can become overtrained (inaccurate) when too large.

To set an appropriate size, consider starting with several dozen to several hundred members in an ensemble, training the ensemble, and then checking the ensemble quality, as in Test Ensemble Quality. If it appears that you need more members, add them using the resume method (classification) or the resume method (regression). Repeat until adding more members does not improve ensemble quality.

Tip

For classification, the LPBoost and TotalBoost algorithms are self-terminating, meaning you do not have to investigate the appropriate ensemble size. Try setting NumLearningCycles to 500. The algorithms usually terminate with fewer members.

Prepare the Weak Learners

Currently the weak learner types are:

  • 'Discriminant' (recommended for Subspace ensemble)

  • 'KNN' (only for Subspace ensemble)

  • 'Tree' (for any ensemble except Subspace)

There are two ways to set the weak learner type in an ensemble.

  • To create an ensemble with default weak learner options, specify the value of the 'Learners' name-value pair argument as the character vector or string scalar of the weak learner name. For example:

    ens = fitcensemble(X,Y,'Method','Subspace', ...
       'NumLearningCycles',50,'Learners','KNN');
    % or
    ens = fitrensemble(X,Y,'Method','Bag', ...
       'NumLearningCycles',50,'Learners','Tree');
  • To create an ensemble with nondefault weak learner options, create a nondefault weak learner using the appropriate template method.

    For example, if you have missing data, and want to use classification trees with surrogate splits for better accuracy:

    templ = templateTree('Surrogate','all');
    ens = fitcensemble(X,Y,'Method','AdaBoostM2', ...
       'NumLearningCycles',50,'Learners',templ);

    To grow trees with leaves containing a number of observations that is at least 10% of the sample size:

    templ = templateTree('MinLeafSize',size(X,1)/10);
    ens = fitcensemble(X,Y,'Method','AdaBoostM2', ...
       'NumLearningCycles',50,'Learners',templ);

    Alternatively, choose the maximal number of splits per tree:

    templ = templateTree('MaxNumSplits',4);
    ens = fitcensemble(X,Y,'Method','AdaBoostM2', ...
       'NumLearningCycles',50,'Learners',templ);

    You can also use nondefault weak learners in fitrensemble.

While you can give fitcensemble and fitrensemble a cell array of learner templates, the most common usage is to give just one weak learner template.

For examples using a template, see Handle Imbalanced Data or Unequal Misclassification Costs in Classification Ensembles and Surrogate Splits.

Decision trees can handle NaN values in X. Such values are called “missing”. If you have some missing values in a row of X, a decision tree finds optimal splits using nonmissing values only. If an entire row consists of NaN, fitcensemble and fitrensemble ignore that row. If you have data with a large fraction of missing values in X, use surrogate decision splits. For examples of surrogate splits, see Handle Imbalanced Data or Unequal Misclassification Costs in Classification Ensembles and Surrogate Splits.

Common Settings for Tree Weak Learners

  • The depth of a weak learner tree makes a difference for training time, memory usage, and predictive accuracy. You control the depth these parameters:

    • MaxNumSplits — The maximal number of branch node splits is MaxNumSplits per tree. Set large values of MaxNumSplits to get deep trees. The default for bagging is size(X,1) - 1. The default for boosting is 1.

    • MinLeafSize — Each leaf has at least MinLeafSize observations. Set small values of MinLeafSize to get deep trees. The default for classification is 1 and 5 for regression.

    • MinParentSize — Each branch node in the tree has at least MinParentSize observations. Set small values of MinParentSize to get deep trees. The default for classification is 2 and 10 for regression.

    If you supply both MinParentSize and MinLeafSize, the learner uses the setting that gives larger leaves (shallower trees):

    MinParent = max(MinParent,2*MinLeaf)

    If you additionally supply MaxNumSplits, then the software splits a tree until one of the three splitting criteria is satisfied.

  • Surrogate — Grow decision trees with surrogate splits when Surrogate is 'on'. Use surrogate splits when your data has missing values.

    Note

    Surrogate splits cause slower training and use more memory.

  • PredictorSelectionfitcensemble, fitrensemble, and TreeBagger grow trees using the standard CART algorithm [1] by default. If the predictor variables are heterogeneous or there are predictors having many levels and other having few levels, then standard CART tends to select predictors having many levels as split predictors. For split-predictor selection that is robust to the number of levels that the predictors have, consider specifying 'curvature' or 'interaction-curvature'. These specifications conduct chi-square tests of association between each predictor and the response or each pair of predictors and the response, respectively. The predictor that yields the minimal p-value is the split predictor for a particular node. For more details, see Choose Split Predictor Selection Technique.

    Note

    When boosting decision trees, selecting split predictors using the curvature or interaction tests is not recommended.

Call fitcensemble or fitrensemble

The syntaxes for fitcensemble and fitrensemble are identical. For fitrensemble, the syntax is:

ens = fitrensemble(X,Y,Name,Value)
  • X is the matrix of data. Each row contains one observation, and each column contains one predictor variable.

  • Y is the responses, with the same number of observations as rows in X.

  • Name,Value specify additional options using one or more name-value pair arguments. For example, you can specify the ensemble aggregation method with the 'Method' argument, the number of ensemble learning cycles with the 'NumLearningCycles' argument, and the type of weak learners with the 'Learners' argument. For a complete list of name-value pair arguments, see the fitrensemble function page.

The result of fitrensemble and fitcensemble is an ensemble object, suitable for making predictions on new data. For a basic example of creating a regression ensemble, see Train Regression Ensemble. For a basic example of creating a classification ensemble, see Train Classification Ensemble.

Where to Set Name-Value Pairs

There are several name-value pairs you can pass to fitcensemble or fitrensemble, and several that apply to the weak learners (templateDiscriminant, templateKNN, and templateTree). To determine which name-value pair argument is appropriate, the ensemble or the weak learner:

  • Use template name-value pairs to control the characteristics of the weak learners.

  • Use fitcensemble or fitrensemble name-value pair arguments to control the ensemble as a whole, either for algorithms or for structure.

For example, for an ensemble of boosted classification trees with each tree deeper than the default, set the templateTree name-value pair arguments MinLeafSize and MinParentSize to smaller values than the defaults. Or, MaxNumSplits to a larger value than the defaults. The trees are then leafier (deeper).

To name the predictors in a classification ensemble (part of the structure of the ensemble), use the PredictorNames name-value pair in fitcensemble.

References

[1] Breiman, L., J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and Regression Trees. Boca Raton, FL: Chapman & Hall, 1984.

See Also

| | | | | | |

Topics