You have several methods for melding results from many weak learners into one high-quality ensemble predictor. These methods closely follow the same syntax, so you can try different methods with minor changes in your commands.
You can create an ensemble for classification or regression
using fitensemble
. To train an
ensemble using fitensemble
, use this syntax:
ens = fitensemble(X,Y,model,numberens,learners)
X
is the matrix of data. Each row
contains one observation, and each column contains one predictor variable.
Y
is the vector of responses, with
the same number of observations as the rows in X
.
model
is a character vector, such
as 'bag'
, naming the type of ensemble.
numberens
is the number of weak
learners in ens
from each element of learners
.
So the number of elements in ens
is numberens
times
the number of elements in learners
.
learners
is either a character
vector, such as 'tree'
, naming a weak learner,
a weak learner template, or a cell array of such templates.
Pictorially, here is the information you need to create an ensemble:
For all classification or nonlinear regression problems, follow these steps to create an ensemble:
All supervised learning methods start with a data matrix, usually
called X
in this documentation. Each row of X
represents
one observation. Each column of X
represents one
variable, or predictor.
You can use a wide variety of data types for response data.
For regression ensembles, Y
must
be a numeric vector with the same number of elements as the number
of rows of X
.
For classification ensembles, Y
can
be any of the following data types. This table also contains the method
of including missing entries.
Data Type | Missing Entry |
---|---|
Numeric vector | NaN |
Categorical vector | <undefined> |
Character array | Row of spaces |
Cell array of character vectors | '' |
Logical vector | (not possible to represent) |
fitensemble
ignores missing values in Y
when
creating an ensemble.
For example, suppose your response data consists of three observations
in the following order: true
, false
, true
.
You could express Y
as:
[1;0;1]
(numeric vector)
nominal({'true','false','true'})
(categorical
vector)
[true;false;true]
(logical vector)
['true ';'false';'true ']
(character
array, padded with spaces so each row has the same length)
{'true','false','true'}
(cell array
of character vectors)
Use whichever data type is most convenient. Because you cannot
represent missing values with logical entries, do not use logical
entries when you have missing values in Y
.
fitensemble
uses one of these algorithms
to create an ensemble.
For classification with two classes:
'AdaBoostM1'
'LogitBoost'
'GentleBoost'
'RobustBoost'
(requires Optimization Toolbox™)
'LPBoost'
(requires Optimization Toolbox)
'TotalBoost'
(requires Optimization Toolbox)
'RUSBoost'
'Subspace'
'Bag'
For classification with three or more classes:
'AdaBoostM2'
'LPBoost'
(requires Optimization Toolbox)
'TotalBoost'
(requires Optimization Toolbox)
'RUSBoost'
'Subspace'
'Bag'
For regression:
'LSBoost'
'Bag'
'Bag'
applies to all methods. When using 'Bag'
,
indicate whether you want a classifier or regressor with the type
name-value
pair set to 'classification'
or 'regression'
.
For descriptions of the various algorithms, see Ensemble Algorithms.
See Suggestions for Choosing an Appropriate Ensemble Algorithm.
This table lists characteristics of the various algorithms. In the table titles:
Regress. — Regression
Classif. — Classification
Preds. — Predictors
Imbalance — Good for imbalanced data (one class has many more observations than the other)
Stop — Algorithm self-terminates
Sparse — Requires fewer weak learners than other ensemble algorithms
Algorithm | Regress. | Binary Classif. | Binary Classif. Multi- Level Preds. | Classif. 3+ Classes | Class Imbalance | Stop | Sparse |
---|---|---|---|---|---|---|---|
Bag | × | × | × | ||||
AdaBoostM1 | × | ||||||
AdaBoostM2 | × | ||||||
LogitBoost | × | × | |||||
GentleBoost | × | × | |||||
RobustBoost | × | ||||||
LPBoost | × | × | × | × | |||
TotalBoost | × | × | × | × | |||
RUSBoost | × | × | × | ||||
LSBoost | × | ||||||
Subspace | × | × |
RobustBoost
, LPBoost
,
and TotalBoost
require an Optimization Toolbox license.
Try TotalBoost
before LPBoost
,
as TotalBoost
can be more robust.
Suggestions for Choosing an Appropriate Ensemble Algorithm
Regression —
Your choices are LSBoost
or Bag
.
See General Characteristics of Ensemble Algorithms for
the main differences between boosting and bagging.
Binary Classification —
Try AdaBoostM1
first, with these modifications:
Data Characteristic | Recommended Algorithm |
---|---|
Many predictors | Subspace |
Skewed data (many more observations of one class) | RUSBoost |
Categorical predictors with over 31 levels | LogitBoost or GentleBoost |
Label noise (some training data has the wrong class) | RobustBoost |
Many observations | Avoid LPBoost , TotalBoost ,
and Bag |
Multiclass Classification —
Try AdaBoostM2
first, with these modifications:
Data Characteristic | Recommended Algorithm |
---|---|
Many predictors | Subspace |
Skewed data (many more observations of one class) | RUSBoost |
Many observations | Avoid LPBoost , TotalBoost ,
and Bag |
For details of the algorithms, see Ensemble Algorithms.
General Characteristics of Ensemble Algorithms
Bag
generally constructs deep trees.
This construction is both time consuming and memory-intensive. This
also leads to relatively slow predictions.
Boost
algorithms generally use
very shallow trees. This construction uses relatively little time
or memory. However, for effective predictions, boosted trees might
need more ensemble members than bagged trees. Therefore it is not
always clear which class of algorithms is superior.
Bag
can estimate the generalization
error without additional cross validation. See oobLoss
.
Except for Subspace
, all boosting
and bagging algorithms are based on tree learners. Subspace
can
use either discriminant
analysis or k-nearest
neighbor learners.
For details of the characteristics of individual ensemble members, see Characteristics of Classification Algorithms.
Choosing the size of an ensemble involves balancing speed and accuracy.
Larger ensembles take longer to train and to generate predictions.
Some ensemble algorithms can become overtrained (inaccurate) when too large.
To set an appropriate size, consider starting with several dozen
to several hundred members in an ensemble, training the ensemble,
and then checking the ensemble quality, as in Test Ensemble Quality. If it appears that you need
more members, add them using the resume
method (classification)
or the resume
method
(regression). Repeat until adding more members does not improve ensemble
quality.
Tip:
For classification, the |
Currently the weak learner types are:
'Discriminant'
(recommended for Subspace
ensemble)
'KNN'
(only for Subspace
ensemble)
'Tree'
(for any ensemble except Subspace
)
There are two ways to set the weak learner type in the ensemble.
To create an ensemble with default weak learner options, pass in the character vectors as the weak learner. For example:
ens = fitensemble(X,Y,'AdaBoostM2',50,'Tree'); % or ens = fitensemble(X,Y,'Subspace',50,'KNN');
To create an ensemble with nondefault weak learner
options, create a nondefault weak learner using the appropriate template
method.
For example, if you have missing data, and want to use trees with
surrogate splits for better accuracy:
templ = templateTree('Surrogate','all'); ens = fitensemble(X,Y,'AdaBoostM2',50,templ);
To grow trees with leaves containing a number of observations that is at least 10% of the sample size:
templ = templateTree('MinLeafSize',size(X,1)/10); ens = fitensemble(X,Y,'AdaBoostM2',50,templ);
Alternatively, choose the maximal number of splits per tree:
templ = templateTree('MaxNumSplits',4); ens = fitensemble(X,Y,'AdaBoostM2',50,templ);
While you can give fitensemble
a cell array
of learner templates, the most common usage is to give just one weak
learner template.
For examples using a template, see Train Ensemble With Unequal Classification Costs and Surrogate Splits.
Decision trees can handle NaN
values in X
.
Such values are called "missing". If you have some missing
values in a row of X
, a decision tree finds optimal
splits using nonmissing values only. If an entire row consists of NaN
, fitensemble
ignores
that row. If you have data with a large fraction of missing values
in X
, use surrogate decision splits. For examples
of surrogate splits, see Train Ensemble With Unequal Classification Costs and Surrogate Splits.
Common Settings for Tree Weak Learners
The depth of a weak learner tree makes a difference for training time, memory usage, and predictive accuracy. You control the depth these parameters:
MaxNumSplits
— The maximal
number of branch node splits is MaxNumSplits
per
tree. Set large values of MaxNumSplits
to get deep
trees. The default for bagging is size(X,1) - 1
.
The default for boosting is 1
.
MinLeafSize
— Each leaf
has at least MinLeafSize
observations. Set small
values of MinLeafSize
to get deep trees. The default
for classification is 1
and 5
for
regression.
MinParentSize
— Each branch
node in the tree has at least MinParentSize
observations.
Set small values of MinParentSize
to get deep trees.
The default for classification is 2
and 10
for
regression.
If you supply both MinParentSize
and MinLeafSize
,
the learner uses the setting that gives larger leaves (shallower trees):
MinParent = max(MinParent,2*MinLeaf)
If you additionally supply MaxNumSplits
,
then the software splits a tree until one of the three splitting criteria
is satisfied.
Surrogate
— Grow decision
trees with surrogate splits when Surrogate
is 'on'
.
Use surrogate splits when your data has missing values.
Note: Surrogate splits cause slower training and use more memory. |
PredictorSelection
— fitensemble
and TreeBagger
grow
trees using the standard CART[1] algorithm by default. If the predictor variables are heterogeneous
or there are predictors having many levels and other having few levels,
then standard CART tends to select predictors having many levels as
split predictors. For split-predictor selection that is robust to
the number of levels that the predictors have, consider specifying 'curvature'
or 'interaction-curvature'
.
These specifications conduct chi-square tests of association between
each predictor and the response or each pair of predictors and the
response, respectively. The predictor that yields the minimal p-value
is the split predictor for a particular node. For more details, see Choose Split Predictor Selection Technique.
Note: When boosting decision trees, selecting split predictors using the curvature or interaction tests is not recommended. |
The syntax of fitensemble
is:
ens = fitensemble(X,Y,model,numberens,learners)
X
is the matrix of data. Each row
contains one observation, and each column contains one predictor variable.
Y
is the responses, with the same
number of observations as rows in X
.
model
is a character vector, such
as 'bag'
, naming the type of ensemble.
numberens
is the number of weak
learners in ens
from each element of learners
.
The number of elements in ens
is numberens
times
the number of elements in learners
.
learners
is a character vector,
such as 'tree'
, naming a weak learner, a weak learner
template, or a cell array of such character vectors and templates.
The result of fitensemble
is an ensemble
object, suitable for making predictions on new data. For a basic example
of creating a classification ensemble, see Train Classification Ensemble. For a basic example of
creating a regression ensemble, see Train Regression Ensemble.
Where to Set Name-Value Pairs. There are several name-value pairs you can pass to fitensemble
,
and several that apply to the weak learners (templateDiscriminant
, templateKNN
, and templateTree
).
To determine which name-value pair argument is appropriate, the ensemble
or the weak learner:
Use template name-value pairs to control the characteristics of the weak learners.
Use fitensemble
name-value pair
arguments to control the ensemble as a whole, either for algorithms
or for structure.
For example, for an ensemble of boosted classification trees
with each tree deeper than the default, set the templateTree
name-value
pair arguments MinLeafSize
and MinParentSize
to
smaller values than the defaults. Or, MaxNumSplits
to
a larger value than the defaults. The trees are then leafier (deeper).
To name the predictors in the ensemble (part of the structure
of the ensemble), use the PredictorNames
name-value
pair in fitensemble
.
This example shows how to create a classification tree ensemble for the ionosphere
data set, and use it to predict the classification of a radar return with average measurements.
Load the ionosphere
data set.
load ionosphere
Train a classification ensemble. For binary classification problems, fitcensemble
aggregates 100 classification trees using LogitBoost.
Mdl = fitcensemble(X,Y)
Mdl = classreg.learning.classif.ClassificationEnsemble ResponseName: 'Y' CategoricalPredictors: [] ClassNames: {'b' 'g'} ScoreTransform: 'none' NumObservations: 351 NumTrained: 100 Method: 'LogitBoost' LearnerNames: {'Tree'} ReasonForTermination: 'Terminated normally after completing the requested number of training cycles.' FitInfo: [100×1 double] FitInfoDescription: {2×1 cell}
Mdl
is a ClassificationEnsemble
model.
Plot a graph of the first trained classification tree in the ensemble.
view(Mdl.Trained{1}.CompactRegressionLearner,'Mode','graph');
By default, fitcensemble
grows shallow trees for boosting algorithms. You can alter the tree depth by passing a tree template object to fitcensemble
. For more details, see templateTree
.
Predict the quality of a radar return with average predictor measurements.
label = predict(Mdl,mean(X))
label = cell 'g'
This example shows how to create a regression ensemble to predict mileage of cars based on their horsepower and weight, trained on the carsmall
data.
Load the carsmall
data set.
load carsmall
Prepare the predictor data.
X = [Horsepower Weight];
The response data is MPG
. The only available boosted regression ensemble type is LSBoost
. For this example, arbitrarily choose an ensemble of 100 trees, and use the default tree options.
Train an ensemble of regression trees.
Mdl = fitensemble(X,MPG,'LSBoost',100,'Tree')
Mdl = classreg.learning.regr.RegressionEnsemble ResponseName: 'Y' CategoricalPredictors: [] ResponseTransform: 'none' NumObservations: 94 NumTrained: 100 Method: 'LSBoost' LearnerNames: {'Tree'} ReasonForTermination: 'Terminated normally after completing the requested number of training cycles.' FitInfo: [100×1 double] FitInfoDescription: {2×1 cell} Regularization: []
Plot a graph of the first trained regression tree in the ensemble.
view(Mdl.Trained{1},'Mode','graph');
By default, fitensemble
grows stumps for boosted trees.
Predict the mileage of a car with 150 horsepower weighing 2750 lbs.
mileage = predict(Mdl,[150 2750])
mileage = 22.4236
This example shows how to choose the appropriate split predictor selection technique for your data set when growing a random forest of regression trees. This example also shows how to decide which predictors are most important to include in the training data.
Load and Preprocess Data
Load the carbig
data set. Consider a model that predicts the fuel economy of a car given its number of cylinders, engine displacement, horsepower, weight, acceleration, model year, and country of origin. Consider Cylinders
, Model_Year
, and Origin
as categorical variables.
load carbig Cylinders = categorical(Cylinders); Model_Year = categorical(Model_Year); Origin = categorical(cellstr(Origin)); X = table(Cylinders,Displacement,Horsepower,Weight,Acceleration,Model_Year,... Origin,MPG);
Determine Levels in Predictors
The standard CART algorithm tends to split predictors with many unique values (levels), e.g., continuous variables, over those with fewer levels, e.g., categorical variables. If your data is heterogeneous, or your predictor variables vary greatly in their number of levels, then consider using the curvature or interaction tests for split-predictor selection instead of standard CART.
For each predictor, determine the number of levels in the data. One way to do this is define an anonymous function that:
Converts all variables to the categorical data type using categorical
Determines all unique categories while ignoring missing values using categories
Counts the categories using numel
Then, apply the function to each variable using varfun
.
countLevels = @(x)numel(categories(categorical(x))); numLevels = varfun(countLevels,X(:,1:end-1),'OutputFormat','uniform');
Compare the number of levels among the predictor variables.
figure; bar(numLevels); title('Number of Levels Among Predictors'); xlabel('Predictor variable'); ylabel('Number of levels'); h = gca; h.XTickLabel = X.Properties.VariableNames(1:end-1); h.XTickLabelRotation = 45; h.TickLabelInterpreter = 'none';
The continuous variables have many more levels than the categorical variables. Because the number of levels among the predictors vary so much, using standard CART to select split predictors at each node of the trees in a random forest can yield inaccurate predictor importance estimates.
Grow Robust Random Forest
Grow a random forest of 200 regression trees. Specify sampling all variables at each node. Specify usage of the interaction test to select split predictors. Because there are missing values in the data, specify usage of surrogate splits to increase accuracy.
t = templateTree('NumPredictorsToSample','all',... 'PredictorSelection','interaction-curvature','Surrogate','on'); rng(1); % For reproducibility Mdl = fitrensemble(X,'MPG','Method','bag','NumLearningCycles',200,... 'Learners',t);
Mdl
is a RegressionBaggedEnsemble
model.
Estimate the model using out-of-bag predictions.
yHat = oobPredict(Mdl); R2 = corr(Mdl.Y,yHat)^2
R2 = 0.8739
Mdl
explains 87.39% of the variability around the mean.
Predictor Importance Estimation
Estimate predictor importance values by permuting out-of-bag observations among the trees.
impOOB = oobPermutedPredictorImportance(Mdl);
impOOB
is a 1-by-7 vector of predictor importance estimates corresponding to the predictors in Mdl.PredictorNames
. The estimates are not biased toward predictors containing many levels.
Compare the predictor importance estimates.
figure; bar(impOOB); title('Unbiased Predictor Importance Estimates'); xlabel('Predictor variable'); ylabel('Importance'); h = gca; h.XTickLabel = Mdl.PredictorNames; h.XTickLabelRotation = 45; h.TickLabelInterpreter = 'none';
Greater importance estimates indicate more important predictors. The bar graph suggests that Model_Year
is the most important predictor, followed by Weight
. Model_Year
has 13 distinct levels only, whereas Weight
has over 300.
Compare predictor importance estimates by permuting out-of-bag observations and those estimates obtained by summing gains in the mean squared error due to splits on each predictor. Also, obtain predictor association measures estimated by surrogate splits.
[impGain,predAssociation] = predictorImportance(Mdl); figure; plot(1:numel(Mdl.PredictorNames),[impOOB' impGain']); title('Predictor Importance Estimation Comparison') xlabel('Predictor variable'); ylabel('Importance'); h = gca; h.XTickLabel = Mdl.PredictorNames; h.XTickLabelRotation = 45; h.TickLabelInterpreter = 'none'; legend('OOB permuted','MSE improvement') grid on
impGain
is commensurate with impOOB
. According to the values of impGain
, Model_Year
and Weight
do not appear to be the most important predictors.
predAssociation
is a 7-by-7 matrix of predictor association measures. Rows and columns correspond to the predictors in Mdl.PredictorNames
. You can infer the strength of the relationship between pairs of predictors using the elements of predAssociation
. Larger values indicate more highly correlated pairs of predictors.
figure; imagesc(predAssociation); title('Predictor Association Estimates'); colorbar; h = gca; h.XTickLabel = Mdl.PredictorNames; h.XTickLabelRotation = 45; h.TickLabelInterpreter = 'none'; h.YTickLabel = Mdl.PredictorNames; predAssociation(1,2)
ans = 0.6830
The largest association is between Cylinders
and Displacement
, but the value is not high enough to indicate a strong relationship between the two predictors.
Grow Random Forest Using Reduced Predictor Set
Because prediction time increases with the number of predictors in random forests, it is good practice to create a model using as few predictors as possible.
Grow a random forest of 200 regression trees using the best two predictors only.
MdlReduced = fitrensemble(X(:,{'Model_Year' 'Weight' 'MPG'}),'MPG','Method','bag',... 'NumLearningCycles',200,'Learners',t);
Compute the of the reduced model.
yHatReduced = oobPredict(MdlReduced); r2Reduced = corr(Mdl.Y,yHatReduced)^2
r2Reduced = 0.8525
The for the reduced model is close to the of the full model. This result suggests that the reduced model is sufficient for prediction.
Usually you cannot evaluate the predictive quality of an ensemble
based on its performance on training data. Ensembles tend to "overtrain,"
meaning they produce overly optimistic estimates of their predictive
power. This means the result of resubLoss
for classification
(resubLoss
for
regression) usually indicates lower error than you get on new data.
To obtain a better idea of the quality of an ensemble, use one of these methods:
Evaluate the ensemble on an independent test set (useful when you have a lot of training data).
Evaluate the ensemble by cross validation (useful when you don't have a lot of training data).
Evaluate the ensemble on out-of-bag data (useful when
you create a bagged ensemble with fitensemble
).
This example uses a bagged ensemble so it can use all three methods of evaluating ensemble quality.
Generate an artificial dataset with 20 predictors. Each entry is a random number from 0 to 1. The initial classification is if and otherwise.
rng(1,'twister') % for reproducibility X = rand(2000,20); Y = sum(X(:,1:5),2) > 2.5;
In addition, to add noise to the results, randomly switch 10% of the classifications:
idx = randsample(2000,200); Y(idx) = ~Y(idx);
Independent Test Set
Create independent training and test sets of data. Use 70% of the data for a training set by calling cvpartition
using the holdout
option:
cvpart = cvpartition(Y,'holdout',0.3);
Xtrain = X(training(cvpart),:);
Ytrain = Y(training(cvpart),:);
Xtest = X(test(cvpart),:);
Ytest = Y(test(cvpart),:);
Create a bagged classification ensemble of 200 trees from the training data:
bag = fitensemble(Xtrain,Ytrain,'Bag',200,'Tree',... 'Type','Classification')
bag = classreg.learning.classif.ClassificationBaggedEnsemble ResponseName: 'Y' CategoricalPredictors: [] ClassNames: [0 1] ScoreTransform: 'none' NumObservations: 1400 NumTrained: 200 Method: 'Bag' LearnerNames: {'Tree'} ReasonForTermination: 'Terminated normally after completing the requested number of training cycles.' FitInfo: [] FitInfoDescription: 'None' FResample: 1 Replace: 1 UseObsForLearner: [1400×200 logical]
Plot the loss (misclassification) of the test data as a function of the number of trained trees in the ensemble:
figure; plot(loss(bag,Xtest,Ytest,'mode','cumulative')); xlabel('Number of trees'); ylabel('Test classification error');
Cross Validation
Generate a five-fold cross-validated bagged ensemble:
cv = fitensemble(X,Y,'Bag',200,'Tree',... 'type','classification','kfold',5)
cv = classreg.learning.partition.ClassificationPartitionedEnsemble CrossValidatedModel: 'Bag' PredictorNames: {1×20 cell} ResponseName: 'Y' NumObservations: 2000 KFold: 5 Partition: [1×1 cvpartition] NumTrainedPerFold: [200 200 200 200 200] ClassNames: [0 1] ScoreTransform: 'none'
Examine the cross-validation loss as a function of the number of trees in the ensemble:
figure; plot(loss(bag,Xtest,Ytest,'mode','cumulative')); hold on; plot(kfoldLoss(cv,'mode','cumulative'),'r.'); hold off; xlabel('Number of trees'); ylabel('Classification error'); legend('Test','Cross-validation','Location','NE');
Cross validating gives comparable estimates to those of the independent set.
Out-of-Bag Estimates
Generate the loss curve for out-of-bag estimates, and plot it along with the other curves:
figure; plot(loss(bag,Xtest,Ytest,'mode','cumulative')); hold on; plot(kfoldLoss(cv,'mode','cumulative'),'r.'); plot(oobLoss(bag,'mode','cumulative'),'k--'); hold off; xlabel('Number of trees'); ylabel('Classification error'); legend('Test','Cross-validation','Out of bag','Location','NE');
The out-of-bag estimates are again comparable to those of the other methods.
This example shows how to classify when one
class has many more observations than another. Try the RUSBoost
algorithm
first, because it is designed to handle this case.
This example uses the "Cover type" data from the
UCI machine learning archive, described in http://archive.ics.uci.edu/ml/datasets/Covertype
.
The data classifies types of forest (ground cover), based on predictors
such as elevation, soil type, and distance to water. The data has
over 500,000 observations and over 50 predictors, so training and
using a classifier is time consuming.
Blackard and Dean [4] describe
a neural net classification of this data. They quote a 70.6% classification
accuracy. RUSBoost
obtains over 76% classification
accuracy; see steps 6 and 7.
Step 1. Obtain the data.
urlwrite('http://archive.ics.uci.edu/ml/machine-learning-databases/covtype/covtype.data.gz','forestcover.gz');
Then, extract the data from the forestcover.gz
file.
The data is in the covtype.data
file.
Step 2. Import the data and prepare it for classification.
Import the data into your workspace. Extract the last data column
into a variable named Y
.
load covtype.data
Y = covtype(:,end);
covtype(:,end) = [];
Step 3. Examine the response data.
tabulate(Y)
Value Count Percent 1 211840 36.46% 2 283301 48.76% 3 35754 6.15% 4 2747 0.47% 5 9493 1.63% 6 17367 2.99% 7 20510 3.53%
There are hundreds of thousands of data points. Those of class 4
are
less than 0.5% of the total. This imbalance indicates that RUSBoost
is
an appropriate algorithm.
Step 4. Partition the data for quality assessment.
Use half the data to fit a classifier, and half to examine the quality of the resulting classifier.
part = cvpartition(Y,'holdout',0.5); istrain = training(part); % data for fitting istest = test(part); % data for quality assessment tabulate(Y(istrain))
Value Count Percent 1 105920 36.46% 2 141651 48.76% 3 17877 6.15% 4 1374 0.47% 5 4746 1.63% 6 8683 2.99% 7 10255 3.53%
Step 5. Create the ensemble.
Use deep trees for higher ensemble accuracy. To do so, set the
trees to have minimal leaf size of 5
. Set LearnRate
to 0.1
in
order to achieve higher accuracy as well. The data is large, and,
with deep trees, creating the ensemble is time consuming.
t = templateTree('MinLeafSize',5); tic rusTree = fitensemble(covtype(istrain,:),Y(istrain),'RUSBoost',1000,t,... 'LearnRate',0.1,'nprint',100); toc
Training RUSBoost... Grown weak learners: 100 Grown weak learners: 200 Grown weak learners: 300 Grown weak learners: 400 Grown weak learners: 500 Grown weak learners: 600 Grown weak learners: 700 Grown weak learners: 800 Grown weak learners: 900 Grown weak learners: 1000 Elapsed time is 918.258401 seconds.
Step 6. Inspect the classification error.
Plot the classification error against the number of members in the ensemble.
figure; tic plot(loss(rusTree,covtype(istest,:),Y(istest),'mode','cumulative')); toc grid on; xlabel('Number of trees'); ylabel('Test classification error');
Elapsed time is 775.646935 seconds.
The ensemble achieves a classification error of under 24% using 150 or more trees. It achieves the lowest error for 400 or more trees.
Examine the confusion matrix for each class as a percentage of the true class.
tic Yfit = predict(rusTree,covtype(istest,:)); toc tab = tabulate(Y(istest)); bsxfun(@rdivide,confusionmat(Y(istest),Yfit),tab(:,2))*100
Elapsed time is 427.293168 seconds. ans = Columns 1 through 6 83.3771 7.4056 0.0736 0 1.7051 0.2681 18.3156 66.4652 2.1193 0.0162 9.3435 2.8239 0 0.0839 90.8038 2.3885 0.6545 6.0693 0 0 2.4763 95.8485 0 1.6752 0 0.2739 0.6530 0 98.6518 0.4213 0 0.1036 3.8346 1.1400 0.4030 94.5187 0.2340 0 0 0 0.0195 0 Column 7 7.1705 0.9163 0 0 0 0 99.7465
All classes except class 2 have over 80% classification accuracy, and classes 3 through 7 have over 90% accuracy. But class 2 makes up close to half the data, so the overall accuracy is not that high.
Step 7. Compact the ensemble.
The ensemble is large. Remove the data using the compact
method.
cmpctRus = compact(rusTree); sz(1) = whos('rusTree'); sz(2) = whos('cmpctRus'); [sz(1).bytes sz(2).bytes]
ans = 1.0e+09 * 1.6947 0.9790
The compacted ensemble is about half the size of the original.
Remove half the trees from cmpctRus
.
This action is likely to have minimal effect on the predictive performance,
based on the observation that 400 out of 1000 trees give nearly optimal
accuracy.
cmpctRus = removeLearners(cmpctRus,[500:1000]);
sz(3) = whos('cmpctRus');
sz(3).bytes
ans = 475495669
The reduced compact ensemble takes about a quarter the memory of the full ensemble. Its overall loss rate is under 24%:
L = loss(cmpctRus,covtype(istest,:),Y(istest))
L = 0.2326
The predictive accuracy on new data might differ, because the ensemble accuracy might be biased. The bias arises because the same data used for assessing the ensemble was used for reducing the ensemble size. To obtain an unbiased estimate of requisite ensemble size, you should use cross validation. However, that procedure is time consuming.
In many real-world applications, you might prefer to treat classes
in your data asymmetrically. For example, you might have data with
many more observations of one class than of any other. Or you might
work on a problem in which misclassifying observations of one class
has more severe consequences than misclassifying observations of another
class. In such situations, you can use two optional parameters for fitensemble
: prior
and cost
.
By using prior
, you set prior class probabilities
(that is, class probabilities used for training). Use this option
if some classes are under- or overrepresented in your training set.
For example, you might obtain your training data by simulation. Because
simulating class A
is more expensive than class B
,
you opt to generate fewer observations of class A
and
more observations of class B
. You expect, however,
that class A
and class B
are
mixed in a different proportion in the real world. In this case, set
prior probabilities for class A
and B
approximately
to the values you expect to observe in the real world. fitensemble
normalizes
prior probabilities to make them add up to 1
; multiplying
all prior probabilities by the same positive factor does not affect
the result of classification.
If classes are adequately represented in the training data but
you want to treat them asymmetrically, use the cost
parameter.
Suppose you want to classify benign and malignant tumors in cancer
patients. Failure to identify a malignant tumor (false negative) has
far more severe consequences than misidentifying benign as malignant
(false positive). You should assign high cost to misidentifying malignant
as benign and low cost to misidentifying benign as malignant.
You must pass misclassification costs as a square matrix with
nonnegative elements. Element C(i,j)
of this matrix
is the cost of classifying an observation into class j
if
the true class is i
. The diagonal elements C(i,i)
of
the cost matrix must be 0
. For the previous example,
you can choose malignant tumor to be class 1 and benign tumor to be
class 2. Then you can set the cost matrix to
$$\left[\begin{array}{cc}0& c\\ 1& 0\end{array}\right]$$
where c > 1 is the cost of misidentifying a malignant tumor as benign. Costs are relative—multiplying all costs by the same positive factor does not affect the result of classification.
If you have only two classes, fitensemble
adjusts
their prior probabilities using ${\tilde{P}}_{i}={C}_{ij}{P}_{i}$for
class i = 1,2 and j ≠ i. P_{i} are
prior probabilities either passed into fitensemble
or
computed from class frequencies in the training data, and $${\tilde{P}}_{i}$$ are adjusted prior probabilities.
Then fitensemble
uses the default cost matrix
$$\left[\begin{array}{cc}0& 1\\ 1& 0\end{array}\right]$$
and these adjusted probabilities for training its weak learners. Manipulating the cost matrix is thus equivalent to manipulating the prior probabilities.
If you have three or more classes, fitensemble
also
converts input costs into adjusted prior probabilities. This conversion
is more complex. First, fitensemble
attempts
to solve a matrix equation described in Zhou and Liu [51].
If it fails to find a solution, fitensemble
applies
the "average cost" adjustment described in Breiman et
al. [11].
For more information, see Zadrozny, Langford, and Abe [50].
This example shows how to train a ensemble of classification trees with unequal classification costs. This example uses data on patients with hepatitis to see if they live or die as a result of the disease. The data set is described at UCI Machine Learning Data Repository.
Read the hepatitis data set from the UCI repository as a character array. Then convert the result to a cell array of character vectors using textscan
. Specify a cell array of character vectors containing the variable names.
hepatitis = textscan(urlread(['http://archive.ics.uci.edu/ml/' ... 'machine-learning-databases/hepatitis/hepatitis.data']),... '%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f','TreatAsEmpty','?',... 'Delimiter',','); size(hepatitis) VarNames = {'dieOrLive' 'age' 'sex' 'steroid' 'antivirals' 'fatigue' ... 'malaise' 'anorexia' 'liverBig' 'liverFirm' 'spleen' ... 'spiders' 'ascites' 'varices' 'bilirubin' 'alkPhosphate' 'sgot' ... 'albumin' 'protime' 'histology'};
ans = 1 20
hepatitis
is a 1-by-20 cell array of character vectors. The cells correspond to the response (liveOrDie
) and 19 heterogeneous predictors.
Specify a numeric matrix containing the predictors and a cell vector containing 'Die'
and 'Live'
, which are response categories. The response contains two values: 1
indicates that a patient died, and 2
indicates that a patient lived. Specify a cell array of character vectors for the response using the response categories. The first variable in hepatitis
contains the response.
X = cell2mat(hepatitis(2:end)); ClassNames = {'Die' 'Live'}; Y = ClassNames(hepatitis{:,1});
X
is a numeric matrix containing the 19 predictors. Y
is a cell array of character vectors containing the response.
Inspect the data for missing values.
figure; barh(sum(isnan(X),1)/size(X,1)); h = gca; h.YTick = 1:numel(VarNames) - 1; h.YTickLabel = VarNames(2:end); ylabel 'Predictor'; xlabel 'Fraction of missing values';
Most predictors have missing values, and one has nearly 45% of the missing values. Therefore, use decision trees with surrogate splits for better accuracy. Because the data set is small, training time with surrogate splits should be tolerable.
Create a classification tree template that uses surrogate splits.
rng(0,'twister') % for reproducibility t = templateTree('surrogate','all');
Examine the data or the description of the data to see which predictors are categorical.
X(1:5,:)
ans = Columns 1 through 7 30.0000 2.0000 1.0000 2.0000 2.0000 2.0000 2.0000 50.0000 1.0000 1.0000 2.0000 1.0000 2.0000 2.0000 78.0000 1.0000 2.0000 2.0000 1.0000 2.0000 2.0000 31.0000 1.0000 NaN 1.0000 2.0000 2.0000 2.0000 34.0000 1.0000 2.0000 2.0000 2.0000 2.0000 2.0000 Columns 8 through 14 1.0000 2.0000 2.0000 2.0000 2.0000 2.0000 1.0000 1.0000 2.0000 2.0000 2.0000 2.0000 2.0000 0.9000 2.0000 2.0000 2.0000 2.0000 2.0000 2.0000 0.7000 2.0000 2.0000 2.0000 2.0000 2.0000 2.0000 0.7000 2.0000 2.0000 2.0000 2.0000 2.0000 2.0000 1.0000 Columns 15 through 19 85.0000 18.0000 4.0000 NaN 1.0000 135.0000 42.0000 3.5000 NaN 1.0000 96.0000 32.0000 4.0000 NaN 1.0000 46.0000 52.0000 4.0000 80.0000 1.0000 NaN 200.0000 4.0000 NaN 1.0000
It appears that predictors 2 through 13 are categorical, as well as predictor 19. You can confirm this inference using the data set description at UCI Machine Learning Data Repository.
List the categorical variables.
catIdx = [2:13,19];
Create a cross-validated ensemble using 150 learners and the GentleBoost
algorithm.
Ensemble = fitensemble(X,Y,'GentleBoost',150,t,... 'PredictorNames',VarNames(2:end),'LearnRate',0.1,... 'CategoricalPredictors',catIdx,'KFold',5); figure; plot(kfoldLoss(Ensemble,'Mode','cumulative','LossFun','exponential')); xlabel('Number of trees'); ylabel('Cross-validated exponential loss');
Inspect the confusion matrix to see which patients the ensemble predicts correctly.
[yFit,sFit] = kfoldPredict(Ensemble);
confusionmat(Y,yFit,'Order',ClassNames)
ans = 19 13 11 112
Of the 123 patient who live, the ensemble predicts correctly that 112 will live. But for the 32 patients who die of hepatitis, the ensemble only predicts correctly that about half will die of hepatitis.
There are two types of error in the predictions of the ensemble:
Predicting that the patient lives, but the patient dies
Predicting that the patient dies, but the patient lives
Suppose you believe that the first error is five times worse than the second. Create a new classification cost matrix that reflects this belief.
cost.ClassNames = ClassNames; cost.ClassificationCosts = [0 5; 1 0];
Create a new cross-validated ensemble using cost
as the misclassification cost, and inspect the resulting confusion matrix.
EnsembleCost = fitensemble(X,Y,'GentleBoost',150,t,... 'PredictorNames',VarNames(2:end),'LearnRate',0.1,... 'CategoricalPredictors',catIdx,'KFold',5,... 'Cost',cost); [yFitCost,sFitCost] = kfoldPredict(EnsembleCost); confusionmat(Y,yFitCost,'Order',ClassNames)
ans = 20 12 8 115
As expected, the new ensemble does a better job classifying thew patients who die. Somewhat surprisingly, the new ensemble also does a better job classifying the patients who live, though the result is not statistically significantly better. The results of the cross validation are random, so this result is simply a statistical fluctuation. The result seems to indicate that the classification of patients who live is not very sensitive to the cost.
This example shows how to train an ensemble of classification trees using data containing predictors with many categorical levels.
Generally, you cannot use classification with more than 31 levels in any categorical predictor. However, two boosting algorithms can classify data with many categorical predictor levels and binary responses: LogitBoost
and GentleBoost
. For details, see LogitBoost and GentleBoost.
This example uses demographic data from the U.S. Census Bureau, available at UCI Machine Learning Data Repository.
The objective of the researchers who posted the data is predicting whether an individual makes over $50,000 a year, based on a set of characteristics. You can see details of the data, including predictor names, in the adult.names
file at the site.
Load the 'adult.data'
file from the UCI Machine Learning Data Repository. Specify a cell array of character vectors containing the variable names.
adult = urlread(['http://archive.ics.uci.edu/ml/'... 'machine-learning-databases/adult/adult.data']); VarNames = {'age' 'workclass' 'fnlwgt' 'education' 'educationNum'... 'maritalStatus' 'occupation' 'relationship' 'race'... 'sex' 'capitalGain' 'capitalLoss'... 'hoursPerWeek' 'nativeCountry' 'income'};
adult.data
represents missing data as '?'
. Replace instances of missing data with an empty character vector. Use textscan
to put the data into a cell array of character vectors.
adult = strrep(adult,'?',''); adult = textscan(adult,'%f%s%f%s%f%s%s%s%s%s%f%f%f%s%s',... 'Delimiter',',','TreatAsEmpty','');
The name-value pair argument TreatAsEmpty
converts all observations corresponding to numeric variables to NaN
if the observation is an empty character vector.
Since the variables are heterogeneous, put the set into a tabular array.
adult = table(adult{:},'VariableNames',VarNames);
Some categorical variables have many levels. Plot the number of levels of each categorical predictor.
cat = varfun(@iscellstr,adult(:,1:end - 1),... 'OutputFormat','uniform'); % Logical flag for categorical variables catVars = find(cat); % Indices of categorical variables countCats = @(var)numel(categories(nominal(var))); numCat = varfun(@(var)countCats(var),adult(:,catVars),... 'OutputFormat','uniform'); figure barh(numCat); h = gca; h.YTickLabel = VarNames(catVars); ylabel 'Predictor' xlabel 'Number of categories'
The anonymous function countCats
converts a predictor to a nominal array, then counts the unique, nonempty categories of the predictor. Predictor 14 ('nativeCountry'
) has more than 40 categorical levels. For binary classification, fitctree
uses a computational shortcut to find an optimal split for categorical predictors with many categories. For classification with more than two classes, you can choose a heuristic algorithm to find a good split. For details, see Splitting Categorical Predictors.
Specify the predictor matrix using classreg.regr.modelutils.predictormatrix
and the response vector.
X = classreg.regr.modelutils.predictormatrix(adult,'ResponseVar',... size(adult,2)); Y = nominal(adult.income);
X
is a numeric matrix; predictormatrix
converts all categorical variables into group indices. The name-value pair argument ResponseVar
indicates that the last column is the response variable, and excludes it from the predictor matrix. Y
is a nominal, categorical array.
Train classification ensembles using both LogitBoost
and GentleBoost
.
rng(1); % For reproducibility LBEnsemble = fitensemble(X,Y,'LogitBoost',300,'Tree',... 'CategoricalPredictors',cat,'PredictorNames',VarNames(1:end-1),... 'ResponseName','income'); GBEnsemble = fitensemble(X,Y,'GentleBoost',300,'Tree',... 'CategoricalPredictors',cat,'PredictorNames',VarNames(1:end-1),... 'ResponseName','income');
Examine the resubstitution error for both ensembles.
figure plot(resubLoss(LBEnsemble,'Mode','cumulative')) hold on plot(resubLoss(GBEnsemble,'Mode','cumulative'),'r--') hold off xlabel('Number of trees') ylabel('Resubstitution error') legend('LogitBoost','GentleBoost','Location','NE')
The GentleBoost
algorithm has a slightly smaller resubstitution error.
Estimate the generalization error for both algorithms by cross validation.
CVLBEnsemble = crossval(LBEnsemble,'KFold',5); CVGBEnsemble = crossval(GBEnsemble,'KFold',5); figure plot(kfoldLoss(CVLBEnsemble,'Mode','cumulative')) hold on plot(kfoldLoss(CVGBEnsemble,'Mode','cumulative'),'r--') hold off xlabel('Number of trees') ylabel('Cross-validated error') legend('LogitBoost','GentleBoost','Location','NE')
The cross-validated loss is nearly the same as the resubstitution error.
When you have missing data, trees and ensembles of trees give better predictions when they include surrogate splits. Furthermore, estimates of predictor importance are often different with surrogate splits. Eliminating unimportant predictors can save time and memory for predictions, and can make predictions easier to understand.
This example shows the effects of surrogate splits for predictions for data containing missing entries in the test set.
Load sample data. Partition it into a training and test set.
load ionosphere; rng(10) % for reproducibility cv = cvpartition(Y,'Holdout',0.3); Xtrain = X(training(cv),:); Ytrain = Y(training(cv)); Xtest = X(test(cv),:); Ytest = Y(test(cv));
Bag decision trees with and without surrogate splits.
b = fitensemble(Xtrain,Ytrain,'Bag',50,'Tree',... 'Type','Class'); templS = templateTree('Surrogate','On'); bs = fitensemble(Xtrain,Ytrain,'Bag',50,templS,... 'Type','Class');
Suppose half of the values in the test set are missing.
Xtest(rand(size(Xtest))>0.5) = NaN;
Test accuracy with and without surrogate splits.
figure; plot(loss(b,Xtest,Ytest,'Mode','Cumulative')); hold on; plot(loss(bs,Xtest,Ytest,'Mode','Cumulative'),'r--'); legend('Regular trees','Trees with surrogate splits'); xlabel('Number of trees'); ylabel('Test classification error');
Check the statistical significance of the difference in results with the McNemar test. Convert the labels to a nominal
data type to make it easier to check for equality.
Yfit = nominal(predict(b,Xtest)); YfitS = nominal(predict(bs,Xtest)); N10 = sum(Yfit==nominal(Ytest) & YfitS~=nominal(Ytest)); N01 = sum(Yfit~=nominal(Ytest) & YfitS==nominal(Ytest)); mcnemar = (abs(N10-N01) - 1)^2/(N10+N01); pval = 1 - chi2cdf(mcnemar,1)
pval = 1.7683e-04
The extremely low p-value indicates that the ensemble with surrogate splits is better in a statistically significant manner.
This example shows how to obtain the benefits of the LPBoost
and TotalBoost
algorithms. These algorithms share two beneficial characteristics:
They are self-terminating, so you don't have to guess how many members to include.
They produce ensembles with some very small weights, so you can safely remove ensemble members.
Note that the algorithms in this example require an Optimization Toolbox™ license.
Load the data
Load the ionosphere
data set.
load ionosphere
Create the classification ensembles
Create ensembles for classifying the ionosphere
data using the LPBoost
, TotalBoost
, and, for comparison, AdaBoostM1
algorithms. It is hard to know how many members to include in an ensemble. For LPBoost
and TotalBoost
, try using 500
. For comparison, also use 500
for AdaBoostM1
.
rng default % For reproducibility T = 500; adaStump = fitensemble(X,Y,'AdaBoostM1',T,'Tree'); totalStump = fitensemble(X,Y,'TotalBoost',T,'Tree'); lpStump = fitensemble(X,Y,'LPBoost',T,'Tree'); figure; plot(resubLoss(adaStump,'Mode','Cumulative')); hold on plot(resubLoss(totalStump,'Mode','Cumulative'),'r'); plot(resubLoss(lpStump,'Mode','Cumulative'),'g'); hold off xlabel('Number of stumps'); ylabel('Training error'); legend('AdaBoost','TotalBoost','LPBoost','Location','NE');
All three algorithms achieve perfect prediction on the training data after a while.
Examine the number of members in all three ensembles.
[adaStump.NTrained totalStump.NTrained lpStump.NTrained]
ans = 500 52 81
AdaBoostM1
trained all 500
members. The other two algorithms stopped training early.
Cross validate the ensembles
Cross validate the ensembles to better determine ensemble accuracy.
cvlp = crossval(lpStump,'KFold',5); cvtotal = crossval(totalStump,'KFold',5); cvada = crossval(adaStump,'KFold',5); figure; plot(kfoldLoss(cvada,'Mode','Cumulative')); hold on plot(kfoldLoss(cvtotal,'Mode','Cumulative'),'r'); plot(kfoldLoss(cvlp,'Mode','Cumulative'),'g'); hold off xlabel('Ensemble size'); ylabel('Cross-validated error'); legend('AdaBoost','TotalBoost','LPBoost','Location','NE');
It appears that each boosting algorithms achieves 10% or lower loss with 50 ensemble members, and AdaBoostM1
achieves near 6% error with 150 or more ensemble members.
Compact and remove ensemble members
To reduce the ensemble sizes, compact them, and then use removeLearners
. The question is, how many learners should you remove? The cross-validated loss curves give you one measure. For another, examine the learner weights for LPBoost
and TotalBoost
after compacting.
cada = compact(adaStump); clp = compact(lpStump); ctotal = compact(totalStump); figure subplot(2,1,1) plot(clp.TrainedWeights) title('LPBoost weights') subplot(2,1,2) plot(ctotal.TrainedWeights) title('TotalBoost weights')
Both LPBoost
and TotalBoost
show clear points where the ensemble member weights become negligible.
Remove the unimportant ensemble members.
cada = removeLearners(cada,150:cada.NTrained); clp = removeLearners(clp,60:clp.NTrained); ctotal = removeLearners(ctotal,40:ctotal.NTrained);
Check that removing these learners does not affect ensemble accuracy on the training data.
[loss(cada,X,Y) loss(clp,X,Y) loss(ctotal,X,Y)]
ans = 0 0 0
Check the resulting compact ensemble sizes.
s(1) = whos('cada'); s(2) = whos('clp'); s(3) = whos('ctotal'); s.bytes
ans = 503987 ans = 201283 ans = 134023
The sizes of the compact ensembles are approximately proportional to the number of members in each.
Regularization is a process of choosing fewer weak learners for an ensemble in a way that does not diminish predictive performance. Currently you can regularize regression ensembles. (You can also regularize a discriminant analysis classifier in a non-ensemble context; see Regularize a Discriminant Analysis Classifier.)
The regularize
method finds an optimal set of
learner weights α_{t} that
minimize
$\sum}_{n=1}^{N}{w}_{n}g\left(\left({\displaystyle \sum}_{t=1}^{T}{\alpha}_{t}{h}_{t}\left({x}_{n}\right)\right),{y}_{n}\right)+\lambda {\displaystyle \sum}_{t=1}^{T}\left|{\alpha}_{t}\right|.$
Here
λ ≥ 0 is a parameter you provide, called the lasso parameter.
h_{t} is a weak learner in the ensemble trained on N observations with predictors x_{n}, responses y_{n}, and weights w_{n}.
g(f,y) = (f – y)^{2} is the squared error.
The ensemble is regularized on the same (x_{n},y_{n},w_{n}) data used for training, so
$\sum}_{n=1}^{N}{w}_{n}g\left(\left({\displaystyle \sum}_{t=1}^{T}{\alpha}_{t}{h}_{t}\left({x}_{n}\right)\right),{y}_{n}\right)$
is the ensemble resubstitution error. The error is measured by mean squared error (MSE).
If you use λ = 0, regularize
finds
the weak learner weights by minimizing the resubstitution MSE. Ensembles
tend to overtrain. In other words, the resubstitution error is typically
smaller than the true generalization error. By making the resubstitution
error even smaller, you are likely to make the ensemble accuracy worse
instead of improving it. On the other hand, positive values of λ push
the magnitude of the α_{t} coefficients
to 0. This often improves the generalization error. Of course, if
you choose λ too large, all the optimal coefficients
are 0, and the ensemble does not have any accuracy. Usually you can
find an optimal range for λ in which the
accuracy of the regularized ensemble is better or comparable to that
of the full ensemble without regularization.
A nice feature of lasso regularization is its ability to drive the optimized coefficients precisely to 0. If a learner's weight α_{t} is 0, this learner can be excluded from the regularized ensemble. In the end, you get an ensemble with improved accuracy and fewer learners.
This example uses data for predicting the insurance risk of a car based on its many attributes.
Load the imports-85
data into the MATLAB workspace:
load imports-85;
Look at a description of the data to find the categorical variables and predictor names:
Description
Description = 9×79 char array '1985 Auto Imports Database from the UCI repository ' 'http://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.names' 'Variables have been reordered to place variables with numeric values (referred ' 'to as "continuous" on the UCI site) to the left and categorical values to the ' 'right. Specifically, variables 1:16 are: symboling, normalized-losses, ' 'wheel-base, length, width, height, curb-weight, engine-size, bore, stroke, ' 'compression-ratio, horsepower, peak-rpm, city-mpg, highway-mpg, and price. ' 'Variables 17:26 are: make, fuel-type, aspiration, num-of-doors, body-style, ' 'drive-wheels, engine-location, engine-type, num-of-cylinders, and fuel-system. '
The objective of this process is to predict the "symboling," the first variable in the data, from the other predictors. "symboling" is an integer from -3
(good insurance risk) to 3
(poor insurance risk). You could use a classification ensemble to predict this risk instead of a regression ensemble. When you have a choice between regression and classification, you should try regression first.
Prepare the data for ensemble fitting:
Y = X(:,1); X(:,1) = []; VarNames = {'normalized-losses' 'wheel-base' 'length' 'width' 'height' ... 'curb-weight' 'engine-size' 'bore' 'stroke' 'compression-ratio' ... 'horsepower' 'peak-rpm' 'city-mpg' 'highway-mpg' 'price' 'make' ... 'fuel-type' 'aspiration' 'num-of-doors' 'body-style' 'drive-wheels' ... 'engine-location' 'engine-type' 'num-of-cylinders' 'fuel-system'}; catidx = 16:25; % indices of categorical predictors
Create a regression ensemble from the data using 300 default trees:
ls = fitensemble(X,Y,'LSBoost',300,'Tree','LearnRate',0.1,... 'PredictorNames',VarNames,'ResponseName','Symboling',... 'CategoricalPredictors',catidx)
ls = classreg.learning.regr.RegressionEnsemble PredictorNames: {1×25 cell} ResponseName: 'Symboling' CategoricalPredictors: [16 17 18 19 20 21 22 23 24 25] ResponseTransform: 'none' NumObservations: 205 NumTrained: 300 Method: 'LSBoost' LearnerNames: {'Tree'} ReasonForTermination: 'Terminated normally after completing the requested number of training cycles.' FitInfo: [300×1 double] FitInfoDescription: {2×1 cell} Regularization: []
The final line, Regularization
, is empty ([]). To regularize the ensemble, you have to use the regularize
method.
cv = crossval(ls,'KFold',5); figure; plot(kfoldLoss(cv,'Mode','Cumulative')); xlabel('Number of trees'); ylabel('Cross-validated MSE'); ylim([0.2,2])
It appears you might obtain satisfactory performance from a smaller ensemble, perhaps one containing from 50 to 100 trees.
6.Call the regularize
method to try to find trees that you can remove from the ensemble. By default, regularize
examines 10 values of the lasso (Lambda
) parameter spaced exponentially.
ls = regularize(ls)
ls = classreg.learning.regr.RegressionEnsemble PredictorNames: {1×25 cell} ResponseName: 'Symboling' CategoricalPredictors: [16 17 18 19 20 21 22 23 24 25] ResponseTransform: 'none' NumObservations: 205 NumTrained: 300 Method: 'LSBoost' LearnerNames: {'Tree'} ReasonForTermination: 'Terminated normally after completing the requested number of training cycles.' FitInfo: [300×1 double] FitInfoDescription: {2×1 cell} Regularization: [1×1 struct]
The Regularization
property is no longer empty.
Plot the resubstitution mean-squared error (MSE) and number of learners with nonzero weights against the lasso parameter. Separately plot the value at Lambda = 0
. Use a logarithmic scale because the values of Lambda
are exponentially spaced.
figure; semilogx(ls.Regularization.Lambda,ls.Regularization.ResubstitutionMSE); line([1e-3 1e-3],[ls.Regularization.ResubstitutionMSE(1) ... ls.Regularization.ResubstitutionMSE(1)],... 'Marker','x','Markersize',12,'Color','b'); r0 = resubLoss(ls); line([ls.Regularization.Lambda(2) ls.Regularization.Lambda(end)],... [r0 r0],'Color','r','LineStyle','--'); xlabel('Lambda'); ylabel('Resubstitution MSE'); annotation('textbox',[0.5 0.22 0.5 0.05],'String','unregularized ensemble',... 'Color','r','FontSize',14,'LineStyle','none'); figure; loglog(ls.Regularization.Lambda,sum(ls.Regularization.TrainedWeights>0,1)); line([1e-3 1e-3],... [sum(ls.Regularization.TrainedWeights(:,1)>0) ... sum(ls.Regularization.TrainedWeights(:,1)>0)],... 'marker','x','markersize',12,'color','b'); line([ls.Regularization.Lambda(2) ls.Regularization.Lambda(end)],... [ls.NTrained ls.NTrained],... 'color','r','LineStyle','--'); xlabel('Lambda'); ylabel('Number of learners'); annotation('textbox',[0.3 0.8 0.5 0.05],'String','unregularized ensemble',... 'color','r','FontSize',14,'LineStyle','none');
The resubstitution MSE values are likely to be overly optimistic. To obtain more reliable estimates of the error associated with various values of Lambda
, cross validate the ensemble using cvshrink
. Plot the resulting cross-validation loss (MSE) and number of learners against Lambda
.
rng(0,'Twister') % for reproducibility [mse,nlearn] = cvshrink(ls,'Lambda',ls.Regularization.Lambda,'KFold',5); figure; semilogx(ls.Regularization.Lambda,ls.Regularization.ResubstitutionMSE); hold on; semilogx(ls.Regularization.Lambda,mse,'r--'); hold off; xlabel('Lambda'); ylabel('Mean squared error'); legend('resubstitution','cross-validation','Location','NW'); line([1e-3 1e-3],[ls.Regularization.ResubstitutionMSE(1) ... ls.Regularization.ResubstitutionMSE(1)],... 'Marker','x','Markersize',12,'Color','b','HandleVisibility','off'); line([1e-3 1e-3],[mse(1) mse(1)],'Marker','o',... 'Markersize',12,'Color','r','LineStyle','--','HandleVisibility','off'); figure; loglog(ls.Regularization.Lambda,sum(ls.Regularization.TrainedWeights>0,1)); hold; loglog(ls.Regularization.Lambda,nlearn,'r--'); hold off; xlabel('Lambda'); ylabel('Number of learners'); legend('resubstitution','cross-validation','Location','NE'); line([1e-3 1e-3],... [sum(ls.Regularization.TrainedWeights(:,1)>0) ... sum(ls.Regularization.TrainedWeights(:,1)>0)],... 'Marker','x','Markersize',12,'Color','b','HandleVisibility','off'); line([1e-3 1e-3],[nlearn(1) nlearn(1)],'marker','o',... 'Markersize',12,'Color','r','LineStyle','--','HandleVisibility','off');
Warning: Some folds do not have any trained weak learners. Warning: Some folds do not have any trained weak learners. Warning: Some folds do not have any trained weak learners. Current plot held
Examining the cross-validated error shows that the cross-validation MSE is almost flat for Lambda
up to a bit over 1e-2
.
Examine ls.Regularization.Lambda
to find the highest value that gives MSE in the flat region (up to a bit over 1e-2
):
jj = 1:length(ls.Regularization.Lambda); [jj;ls.Regularization.Lambda]
ans = Columns 1 through 7 1.0000 2.0000 3.0000 4.0000 5.0000 6.0000 7.0000 0 0.0014 0.0033 0.0077 0.0183 0.0435 0.1031 Columns 8 through 10 8.0000 9.0000 10.0000 0.2446 0.5800 1.3754
Element 5
of ls.Regularization.Lambda
has value 0.0183
, the largest in the flat range.
Reduce the ensemble size using the shrink
method. shrink
returns a compact ensemble with no training data. The generalization error for the new compact ensemble was already estimated by cross validation in mse(5)
.
cmp = shrink(ls,'weightcolumn',5)
cmp = classreg.learning.regr.CompactRegressionEnsemble PredictorNames: {1×25 cell} ResponseName: 'Symboling' CategoricalPredictors: [16 17 18 19 20 21 22 23 24 25] ResponseTransform: 'none' NumTrained: 15
There are only 15 trees in the new ensemble, notably reduced from the 300 in ls
.
Compare the sizes of the ensembles:
sz(1) = whos('cmp'); sz(2) = whos('ls'); [sz(1).bytes sz(2).bytes]
ans = 113240 2274704
The reduced ensemble is about 5% the size of the original.
Compare the MSE of the reduced ensemble to that of the original ensemble:
figure; plot(kfoldLoss(cv,'mode','cumulative')); hold on plot(cmp.NTrained,mse(5),'ro','MarkerSize',12); xlabel('Number of trees'); ylabel('Cross-validated MSE'); legend('unregularized ensemble','regularized ensemble',... 'Location','NE'); hold off
The reduced ensemble gives low loss while using many fewer trees.
The RobustBoost
algorithm can make good classification predictions even when the training data has noise. However, the default RobustBoost
parameters can produce an ensemble that does not predict well. This example shows one way of tuning the parameters for better predictive accuracy.
Note that RobustBoost
requires an Optimization Toolbox™ license.
Generate data with label noise. This example has twenty uniform random numbers per observation, and classifies the observation as 1
if the sum of the first five numbers exceeds 2.5 (so is larger than average), and 0
otherwise:
rng(0,'twister') % for reproducibility Xtrain = rand(2000,20); Ytrain = sum(Xtrain(:,1:5),2) > 2.5;
To add noise, randomly switch 10% of the classifications:
idx = randsample(2000,200); Ytrain(idx) = ~Ytrain(idx);
Create an ensemble with AdaBoostM1
for comparison purposes:
ada = fitensemble(Xtrain,Ytrain,'AdaBoostM1',... 300,'Tree','LearnRate',0.1);
Create an ensemble with RobustBoost
. Because the data has 10% incorrect classification, perhaps an error goal of 15% is reasonable.
rb1 = fitensemble(Xtrain,Ytrain,'RobustBoost',300,... 'Tree','RobustErrorGoal',0.15,'RobustMaxMargin',1);
Note that if you set the error goal to a high enough value, then the software returns an error.
Create an ensemble with very optimistic error goal, 0.01
:
rb2 = fitensemble(Xtrain,Ytrain,'RobustBoost',300,... 'Tree','RobustErrorGoal',0.01);
Compare the resubstitution error of the four ensembles:
figure plot(resubLoss(rb1,'Mode','Cumulative')); hold on plot(resubLoss(rb2,'Mode','Cumulative'),'r--'); plot(resubLoss(ada,'Mode','Cumulative'),'g.'); hold off; xlabel('Number of trees'); ylabel('Resubstitution error'); legend('ErrorGoal=0.15','ErrorGoal=0.01',... 'AdaBoostM1','Location','NE');
All the RobustBoost
curves show lower resubstitution error than the AdaBoostM1
curve. The error goal of 0.15
curve shows the lowest resubstitution error over most of the range.
Xtest = rand(2000,20); Ytest = sum(Xtest(:,1:5),2) > 2.5; idx = randsample(2000,200); Ytest(idx) = ~Ytest(idx); figure; plot(loss(rb1,Xtest,Ytest,'Mode','Cumulative')); hold on plot(loss(rb2,Xtest,Ytest,'Mode','Cumulative'),'r--'); plot(loss(ada,Xtest,Ytest,'Mode','Cumulative'),'g.'); hold off; xlabel('Number of trees'); ylabel('Test error'); legend('ErrorGoal=0.15','ErrorGoal=0.01',... 'AdaBoostM1','Location','NE');
The error curve for error goal 0.15 is lowest (best) in the plotted range. AdaBoostM1
has higher error than the curve for error goal 0.15. The curve for the too-optimistic error goal 0.01 remains substantially higher (worse) than the other algorithms for most of the plotted range.
This example shows how to use a random subspace ensemble to increase the accuracy of classification. It also shows how to use cross validation to determine good parameters for both the weak learner template and the ensemble.
Load the data
Load the ionosphere
data. This data has 351 binary responses to 34 predictors.
load ionosphere;
[N,D] = size(X)
resp = unique(Y)
N = 351 D = 34 resp = 2×1 cell array 'b' 'g'
Choose the number of nearest neighbors
Find a good choice for k
, the number of nearest neighbors in the classifier, by cross validation. Choose the number of neighbors approximately evenly spaced on a logarithmic scale.
rng(8000,'twister') % for reproducibility K = round(logspace(0,log10(N),10)); % number of neighbors cvloss = zeros(numel(K),1); for k=1:numel(K) knn = fitcknn(X,Y,... 'NumNeighbors',K(k),'CrossVal','On'); cvloss(k) = kfoldLoss(knn); end figure; % Plot the accuracy versus k semilogx(K,cvloss); xlabel('Number of nearest neighbors'); ylabel('10 fold classification error'); title('k-NN classification');
The lowest cross-validation error occurs for k = 2
.
Create the ensembles
Create ensembles for 2
-nearest neighbor classification with various numbers of dimensions, and examine the cross-validated loss of the resulting ensembles.
This step takes a long time. To keep track of the progress, print a message as each dimension finishes.
NPredToSample = round(linspace(1,D,10)); % linear spacing of dimensions cvloss = zeros(numel(NPredToSample),1); learner = templateKNN('NumNeighbors',2); for npred=1:numel(NPredToSample) subspace = fitensemble(X,Y,'Subspace',100,learner,... 'NPredToSample',NPredToSample(npred),'CrossVal','On'); cvloss(npred) = kfoldLoss(subspace); fprintf('Random Subspace %i done.\n',npred); end figure; % plot the accuracy versus dimension plot(NPredToSample,cvloss); xlabel('Number of predictors selected at random'); ylabel('10 fold classification error'); title('k-NN classification with Random Subspace');
Random Subspace 1 done. Random Subspace 2 done. Random Subspace 3 done. Random Subspace 4 done. Random Subspace 5 done. Random Subspace 6 done. Random Subspace 7 done. Random Subspace 8 done. Random Subspace 9 done. Random Subspace 10 done.
The ensembles that use five and eight predictors per learner have the lowest cross-validated error. The error rate for these ensembles is about 0.06, while the other ensembles have cross-validated error rates that are approximately 0.1 or more.
Find a good ensemble size
Find the smallest number of learners in the ensemble that still give good classification.
ens = fitensemble(X,Y,'Subspace',100,learner,... 'NPredToSample',5,'CrossVal','on'); figure; % Plot the accuracy versus number in ensemble plot(kfoldLoss(ens,'Mode','Cumulative')) xlabel('Number of learners in ensemble'); ylabel('10 fold classification error'); title('k-NN classification with Random Subspace');
There seems to be no advantage in an ensemble with more than 50 or so learners. It is possible that 25 learners gives good predictions.
Create a final ensemble
Construct a final ensemble with 50 learners. Compact the ensemble and see if the compacted version saves an appreciable amount of memory.
ens = fitensemble(X,Y,'Subspace',50,learner,... 'NPredToSample',5); cens = compact(ens); s1 = whos('ens'); s2 = whos('cens'); [s1.bytes s2.bytes] % si.bytes = size in bytes
ans = 1728705 1499584
The compact ensemble is about 10% smaller than the full ensemble. Both give the same predictions.
Regression of Insurance Risk Rating for Car Imports Using TreeBagger
Classifying Radar Returns for Ionosphere Data Using TreeBagger
TreeBagger
ensembles
have more functionality than those constructed with fitensemble
; see TreeBagger Features Not in fitensemble. Also, some
property and method names differ from their fitensemble
counterparts.
This section contains examples of workflow for regression and classification
that use this extra TreeBagger
functionality.
In this example, use a database of 1985 car imports with 205 observations, 25 predictors, and 1 response, which is insurance risk rating, or "symboling." The first 15 variables are numeric and the last 10 are categorical. The symboling index takes integer values from -3 to 3.
Load the data set and split it into predictor and response arrays.
load imports-85 Y = X(:,1); X = X(:,2:end); isCategorical = [zeros(15,1);ones(size(X,2)-15,1)]; % Categorical variable flag
Because bagging uses randomized data drawings, its exact outcome depends on the initial random seed. To reproduce the results in this example, use the random stream settings.
rng(1945,'twister')
Finding the Optimal Leaf Size
For regression, the general rule is to the set leaf size to 5 and select one third of the input features for decision splits at random. In the following step, verify the optimal leaf size by comparing mean squared errors obtained by regression for various leaf sizes. oobError
computes MSE versus the number of grown trees. You must set OOBPred
to 'On'
to obtain out-of-bag predictions later.
leaf = [5 10 20 50 100]; col = 'rbcmy'; figure for i=1:length(leaf) b = TreeBagger(50,X,Y,'Method','R','OOBPred','On',... 'CategoricalPredictors',find(isCategorical == 1),... 'MinLeafSize',leaf(i)); plot(oobError(b),col(i)) hold on end xlabel 'Number of Grown Trees' ylabel 'Mean Squared Error' legend({'5' '10' '20' '50' '100'},'Location','NorthEast') hold off
The red curve (leaf size 5) yields the lowest MSE values.
Estimating Feature Importance
In practical applications, you typically grow ensembles with hundreds of trees. For example, the previous code block uses 50 trees for faster processing. Now that you have estimated the optimal leaf size, grow a larger ensemble with 100 trees and use it to estimate feature importance.
b = TreeBagger(100,X,Y,'Method','R','OOBVarImp','On',... 'CategoricalPredictors',find(isCategorical == 1),... 'MinLeafSize',5);
Inspect the error curve again to make sure nothing went wrong during training.
figure plot(oobError(b)) xlabel 'Number of Grown Trees' ylabel 'Out-of-Bag Mean Squared Error'
Prediction ability should depend more on important features than unimportant features. You can use this idea to measure feature importance.
For each feature, permute the values of this feature across every observation in the data set and measure how much worse the MSE becomes after the permutation. You can repeat this for each feature.
Plot the increase in MSE due to permuting out-of-bag observations across each input variable. The OOBPermutedVarDeltaError
array stores the increase in MSE averaged over all trees in the ensemble and divided by the standard deviation taken over the trees, for each variable. The larger this value, the more important the variable. Imposing an arbitrary cutoff at 0.7, you can select the four most important features.
figure bar(b.OOBPermutedVarDeltaError) xlabel 'Feature Number' ylabel 'Out-of-Bag Feature Importance' idxvar = find(b.OOBPermutedVarDeltaError>0.7) idxCategorical = find(isCategorical(idxvar)==1);
idxvar = 1 2 16 19
The OOBIndices
property of TreeBagger
tracks which observations are out of bag for what trees. Using this property, you can monitor the fraction of observations in the training data that are in bag for all trees. The curve starts at approximately 2/3, which is the fraction of unique observations selected by one bootstrap replica, and goes down to 0 at approximately 10 trees.
finbag = zeros(1,b.NTrees); for t=1:b.NTrees finbag(t) = sum(all(~b.OOBIndices(:,1:t),2)); end finbag = finbag / size(X,1); figure plot(finbag) xlabel 'Number of Grown Trees' ylabel 'Fraction of In-Bag Observations'
Growing Trees on a Reduced Set of Features
Using just the four most powerful features, determine if it is possible to obtain a similar predictive power. To begin, grow 100 trees on these features only. The first two of the four selected features are numeric and the last two are categorical.
b5v = TreeBagger(100,X(:,idxvar),Y,'Method','R',... 'OOBVarImp','On','CategoricalPredictors',idxCategorical,... 'MinLeafSize',5); figure plot(oobError(b5v)) xlabel 'Number of Grown Trees' ylabel 'Out-of-Bag Mean Squared Error' figure bar(b5v.OOBPermutedVarDeltaError) xlabel 'Feature Index' ylabel 'Out-of-Bag Feature Importance'
These four most powerful features give the same MSE as the full set, and the ensemble trained on the reduced set ranks these features similarly to each other. If you remove features 1 and 2 from the reduced set, then the predictive power of the algorithm might not decrease significantly.
Finding Outliers
To find outliers in the training data, compute the proximity matrix using fillProximities
.
b5v = fillProximities(b5v);
The method normalizes this measure by subtracting the mean outlier measure for the entire sample. Then it takes the magnitude of this difference and divides the result by the median absolute deviation for the entire sample.
figure histogram(b5v.OutlierMeasure) xlabel 'Outlier Measure' ylabel 'Number of Observations'
Discovering Clusters in the Data
By applying multidimensional scaling to the computed matrix of proximities, you can inspect the structure of the input data and look for possible clusters of observations. The mdsProx
method returns scaled coordinates and eigenvalues for the computed proximity matrix. If you run it with the Colors
name-value-pair argument, then this method creates a scatter plot of two scaled coordinates.
figure(8) [~,e] = mdsProx(b5v,'Colors','K'); xlabel 'First Scaled Coordinate' ylabel 'Second Scaled Coordinate'
Assess the relative importance of the scaled axes by plotting the first 20 eigenvalues.
figure bar(e(1:20)) xlabel 'Scaled Coordinate Index' ylabel 'Eigenvalue'
Saving the Ensemble Configuration for Future Use
To use the trained ensemble for predicting the response on unseen data, store the ensemble to disk and retrieve it later. If you do not want to compute predictions for out-of-bag data or reuse training data in any other way, there is no need to store the ensemble object itself. Saving the compact version of the ensemble is enough in this case. Extract the compact object from the ensemble.
c = compact(b5v)
c = CompactTreeBagger Ensemble with 100 bagged decision trees: Method: regression NumPredictors: 4
You can save the resulting CompactTreeBagger
model in a *.mat
file.
You can also use ensembles of decision trees for classification. For this example, use ionosphere data with 351 observations and 34 real-valued predictors. The response variable is categorical with two levels:
'g' represents good radar returns.
'b' represents bad radar returns.
The goal is to predict good or bad returns using a set of 34 measurements.
Fix the initial random seed, grow 50 trees, inspect how the ensemble error changes with accumulation of trees, and estimate feature importance. For classification, it is best to set the minimal leaf size to 1 and select the square root of the total number of features for each decision split at random. These settings are defaults for TreeBagger
used for classification.
load ionosphere rng(1945,'twister') b = TreeBagger(50,X,Y,'OOBVarImp','On'); figure plot(oobError(b)) xlabel('Number of Grown Trees') ylabel('Out-of-Bag Classification Error')
The method trains ensembles with few trees on observations that are in bag for all trees. For such observations, it is impossible to compute the true out-of-bag prediction, and TreeBagger
returns the most probable class for classification and the sample mean for regression. You can change the default value returned for in-bag observations using the DefaultYfit
property. If you set the default value to an empty character vector for classification, the method excludes in-bag observations from computation of the out-of-bag error. In this case, the curve is more variable when the number of trees is small, either because some observations are never out of bag (and are therefore excluded) or because their predictions are based on few trees.
b.DefaultYfit = ''; figure plot(oobError(b)) xlabel('Number of Grown Trees') ylabel('Out-of-Bag Error Excluding In-Bag Observations')
The OOBIndices
property of TreeBagger
tracks which observations are out of bag for what trees. Using this property, you can monitor the fraction of observations in the training data that are in bag for all trees. The curve starts at approximately 2/3, which is the fraction of unique observations selected by one bootstrap replica, and goes down to 0 at approximately 10 trees.
finbag = zeros(1,b.NTrees); for t=1:b.NTrees finbag(t) = sum(all(~b.OOBIndices(:,1:t),2)); end finbag = finbag / size(X,1); figure plot(finbag) xlabel('Number of Grown Trees') ylabel('Fraction of In-Bag Observations')
Estimate feature importance.
figure bar(b.OOBPermutedVarDeltaError) xlabel('Feature Index') ylabel('Out-of-Bag Feature Importance')
Select the features yeilding an importance measure greater than 0.75. This threshold is chosen arbitrarily.
idxvar = find(b.OOBPermutedVarDeltaError>0.75)
idxvar = 3 5 6 7 8 27
Having selected the most important features, grow a larger ensemble on the reduced feature set. Save time by not permuting out-of-bag observations to obtain new estimates of feature importance for the reduced feature set (set OOBVarImp
to 'off'
). You would still be interested in obtaining out-of-bag estimates of classification error (set OOBPred
to 'on'
).
b5v = TreeBagger(100,X(:,idxvar),Y,'OOBVarImp','off','OOBPred','on'); figure plot(oobError(b5v)) xlabel('Number of Grown Trees') ylabel('Out-of-Bag Classification Error')
For classification ensembles, in addition to classification error (fraction of misclassified observations), you can also monitor the average classification margin. For each observation, the margin is defined as the difference between the score for the true class and the maximal score for other classes predicted by this tree. The cumulative classification margin uses the scores averaged over all trees and the mean cumulative classification margin is the cumulative margin averaged over all observations. The oobMeanMargin
method with the 'mode'
argument set to 'cumulative'
(default) shows how the mean cumulative margin changes as the ensemble grows: every new element in the returned array represents the cumulative margin obtained by including a new tree in the ensemble. If training is successful, you would expect to see a gradual increase in the mean classification margin.
The method trains ensembles with few trees on observations that are in bag for all trees. For such observations, it is impossible to compute the true out-of-bag prediction, and TreeBagger
returns the most probable class for classification and the sample mean for regression.
For decision trees, a classification score is the probability of observing an instance of this class in this tree leaf. For example, if the leaf of a grown decision tree has five 'good'
and three 'bad'
training observations in it, the scores returned by this decision tree for any observation fallen on this leaf are 5/8 for the 'good'
class and 3/8 for the 'bad'
class. These probabilities are called 'scores'
for consistency with other classifiers that might not have an obvious interpretation for numeric values of returned predictions.
figure plot(oobMeanMargin(b5v)); xlabel('Number of Grown Trees') ylabel('Out-of-Bag Mean Classification Margin')
Compute the matrix of proximities and examine the distribution of outlier measures. Unlike regression, outlier measures for classification ensembles are computed within each class separately.
b5v = fillProximities(b5v); figure histogram(b5v.OutlierMeasure) xlabel('Outlier Measure') ylabel('Number of Observations')
Slightly more than half of the extreme outliers are labeled 'bad'
.
extremeOutliers = b5v.Y(b5v.OutlierMeasure>40)
percentBad = 100*sum(strcmp(extremeOutliers,'b'))/numel(extremeOutliers)
extremeOutliers = 3×1 cell array 'g' 'g' 'g' percentBad = 0
As for regression, you can plot scaled coordinates, displaying the two classes in different colors using the 'Colors' name-value pair argument of mdsProx
. This argument takes a character vector in which every character represents a color. The software does not rank class names. Therefore, it is best practice to determine the position of the classes in the ClassNames
property of the ensemble.
gPosition = find(strcmp('g',b5v.ClassNames))
gPosition = 2
The 'bad'
class is first and the 'good'
class is second. Display scaled coordinates using red for the 'bad'
class and blue for the 'good'
class observations.
figure [s,e] = mdsProx(b5v,'Colors','rb'); xlabel('First Scaled Coordinate') ylabel('Second Scaled Coordinate')
Plot the first 20 eigenvalues obtained by scaling. The first eigenvalue clearly dominates and the first scaled coordinate is most important.
figure bar(e(1:20)) xlabel('Scaled Coordinate Index') ylabel('Eigenvalue')
Another way of exploring the performance of a classification ensemble is to plot its Receiver Operating Characteristic (ROC) curve or another performance curve suitable for the current problem. Obtain predictions for out-of-bag observations. For a classification ensemble, the oobPredict
method returns a cell array of classification labels as the first output argument and a numeric array of scores as the second output argument. The returned array of scores has two columns, one for each class. In this case, the first column is for the 'bad'
class and the second column is for the 'good'
class. One column in the score matrix is redundant because the scores represent class probabilities in tree leaves and by definition add up to 1.
[Yfit,Sfit] = oobPredict(b5v);
Use perfcurve
to compute a performance curve. By default, perfcurve
returns the standard ROC curve, which is the true positive rate versus the false positive rate. perfcurve
requires true class labels, scores, and the positive class label for input. In this case, choose the 'good'
class as positive.
[fpr,tpr] = perfcurve(b5v.Y,Sfit(:,gPosition),'g'); figure plot(fpr,tpr) xlabel('False Positive Rate') ylabel('True Positive Rate')
Instead of the standard ROC curve, you might want to plot, for example, ensemble accuracy versus threshold on the score for the 'good'
class. The ycrit
input argument of perfcurve
lets you specify the criterion for the y
-axis, and the third output argument of perfcurve
returns an array of thresholds for the positive class score. Accuracy is the fraction of correctly classified observations, or equivalently, 1 minus the classification error.
[fpr,accu,thre] = perfcurve(b5v.Y,Sfit(:,gPosition),'g','YCrit','Accu'); figure(20) plot(thre,accu) xlabel('Threshold for ''good'' Returns') ylabel('Classification Accuracy')
The curve shows a flat region indicating that any threshold from 0.2 to 0.6 is a reasonable choice. By default, the perfcurve
assigns classification labels using 0.5 as the boundary between the two classes. You can find exactly what accuracy this corresponds to.
accu(abs(thre-0.5)<eps)
ans = 0×1 empty double column vector
The maximal accuracy is a little higher than the default one.
[maxaccu,iaccu] = max(accu)
maxaccu = 0.9459 iaccu = 105
The optimal threshold is therefore.
thre(iaccu)
ans = 0.4503
AdaBoostM1
is a very popular boosting algorithm
for binary classification. The algorithm trains learners sequentially.
For every learner with index t, AdaBoostM1
computes
the weighted classification error
$${\epsilon}_{t}={\displaystyle \sum}_{n=1}^{N}{d}_{n}^{\left(t\right)}\mathbb{I}\left({y}_{n}\ne {h}_{t}\left({x}_{n}\right)\right),$$
where
x_{n} is a vector of predictor values for observation n.
y_{n} is the true class label.
h_{t} is the prediction of learner (hypothesis) with index t.
$$\mathbb{I}$$ is the indicator function.
$${d}_{n}^{\left(t\right)}$$ is the weight of observation n at step t.
AdaBoostM1
then increases weights for observations
misclassified by learner t and reduces weights
for observations correctly classified by learner t.
The next learner t + 1 is then trained on the data with
updated weights ${d}_{n}^{\left(t+1\right)}$.
After training finishes, AdaBoostM1
computes
prediction for new data using
$f\left(x\right)={\displaystyle \sum}_{t=1}^{T}{\alpha}_{t}{h}_{t}\left(x\right),$
where
${\alpha}_{t}=\frac{1}{2}\mathrm{log}\frac{1-{\epsilon}_{t}}{{\epsilon}_{t}}$
are weights of the weak hypotheses in the ensemble.
Training by AdaBoostM1
can be viewed as stagewise
minimization of the exponential loss
$\sum}_{n=1}^{N}{w}_{n}\mathrm{exp}\left(-{y}_{n}f\left({x}_{n}\right)\right),$
where
y_{n} ∊ {–1,+1} is the true class label.
w_{n} are observation weights normalized to add up to 1.
f(x_{n}) ∊ (–∞,+∞) is the predicted classification score.
The observation weights w_{n} are
the original observation weights you passed to fitensemble
.
The second output from the predict
method
of an AdaBoostM1
classification ensemble is an N-by-2
matrix of classification scores for the two classes and N observations.
The second column in this matrix is always equal to minus the first
column. predict
returns two scores to be consistent
with multiclass models, though this is redundant because the second
column is always the negative of the first.
Most often AdaBoostM1
is used with decision
stumps (default) or shallow trees. If boosted stumps give poor performance,
try setting the minimal parent node size to one quarter of the training
data.
By default, the learning rate for boosting algorithms is 1
.
If you set the learning rate to a lower number, the ensemble learns
at a slower rate, but can converge to a better solution. 0.1
is
a popular choice for the learning rate. Learning at a rate less than 1
is
often called "shrinkage".
For examples using AdaBoostM1
, see Tune RobustBoost.
For references related to AdaBoostM1
, see
Freund and Schapire [20],
Schapire et al. [40], Friedman, Hastie,
and Tibshirani [22], and Friedman [21].
AdaBoostM2
is an extension of AdaBoostM1
for
multiple classes. Instead of weighted classification error, AdaBoostM2
uses
weighted pseudo-loss for N observations and K classes
${\epsilon}_{t}=\frac{1}{2}{\displaystyle \sum}_{n=1}^{N}{\displaystyle \sum}_{k\ne {y}_{n}}{d}_{n,k}^{\left(t\right)}\left(1-{h}_{t}\left({x}_{n},{y}_{n}\right)+{h}_{t}\left({x}_{n},k\right)\right),$
where
h_{t}(x_{n},k) is the confidence of prediction by learner at step t into class k ranging from 0 (not at all confident) to 1 (highly confident).
${d}_{n,k}^{\left(t\right)}$ are observation weights at step t for class k.
y_{n} is the true class label taking one of the K values.
The second sum is over all classes other than the true class y_{n}.
Interpreting the pseudo-loss is harder than classification error,
but the idea is the same. Pseudo-loss can be used as a measure of
the classification accuracy from any learner in an ensemble. Pseudo-loss
typically exhibits the same behavior as a weighted classification
error for AdaBoostM1
: the first few learners in
a boosted ensemble give low pseudo-loss values. After the first few
training steps, the ensemble begins to learn at a slower pace, and
the pseudo-loss value approaches 0.5 from below.
For examples using AdaBoostM2
, see Train Classification Ensemble.
For references related to AdaBoostM2
, see
Freund and Schapire [20].
Bagging, which stands for "bootstrap
aggregation," is a type of ensemble learning. To bag a weak
learner such as a decision tree on a dataset, generate many bootstrap
replicas of this dataset and grow decision trees on these replicas.
Obtain each bootstrap replica by randomly selecting N
observations
out of N
with replacement, where N
is
the dataset size. To find the predicted response of a trained ensemble,
take an average over predictions from individual trees.
Bagged decision trees were introduced in MATLAB^{®} R2009a
as TreeBagger
.
The fitensemble
function lets
you bag in a manner consistent with boosting. An ensemble of bagged
trees, either ClassificationBaggedEnsemble
or RegressionBaggedEnsemble
,
returned by fitensemble
offers almost the same
functionally as TreeBagger
. Discrepancies between TreeBagger
and
the new framework are described in detail in TreeBagger Features Not in fitensemble.
Bagging works by training learners on resampled versions of the data. This resampling is usually done by bootstrapping observations, that is, selecting N out of N observations with replacement for every new learner. In addition, every tree in the ensemble can randomly select predictors for decision splits—a technique known to improve the accuracy of bagged trees.
By default, the minimal leaf sizes for bagged trees are set
to 1
for classification and 5
for
regression. Trees grown with the default leaf size are usually very
deep. These settings are close to optimal for the predictive power
of an ensemble. Often you can grow trees with larger leaves without
losing predictive power. Doing so reduces training and prediction
time, as well as memory usage for the trained ensemble.
Another important parameter is the number of predictors selected at random for every decision split. This random selection is made for every split, and every deep tree involves many splits. By default, this parameter is set to a square root of the number of predictors for classification, and one third of predictors for regression.
Several features of bagged decision trees make them a unique
algorithm. Drawing N
out of N
observations
with replacement omits on average 37% of observations for each decision
tree. These are "out-of-bag" observations. You can use
them to estimate the predictive power and feature importance. For
each observation, you can estimate the out-of-bag prediction by averaging
over predictions from all trees in the ensemble for which this observation
is out of bag. You can then compare the computed prediction against
the observed response for this observation. By comparing the out-of-bag
predicted responses against the observed responses for all observations
used for training, you can estimate the average out-of-bag error.
This out-of-bag average is an unbiased estimator of the true ensemble
error. You can also obtain out-of-bag estimates of feature importance
by randomly permuting out-of-bag data across one variable or column
at a time and estimating the increase in the out-of-bag error due
to this permutation. The larger the increase, the more important the
feature. Thus, you need not supply test data for bagged ensembles
because you obtain reliable estimates of the predictive power and
feature importance in the process of training, which is an attractive
feature of bagging.
Another attractive feature of bagged decision trees is the proximity matrix. Every time two observations land on the same leaf of a tree, their proximity increases by 1. For normalization, sum these proximities over all trees in the ensemble and divide by the number of trees. The resulting matrix is symmetric with diagonal elements equal to 1 and off-diagonal elements ranging from 0 to 1. You can use this matrix for finding outlier observations and discovering clusters in the data through multidimensional scaling.
For examples using bagging, see:
Regression of Insurance Risk Rating for Car Imports Using TreeBagger
Classifying Radar Returns for Ionosphere Data Using TreeBagger
For references related to bagging, see Breiman [8], [9], and [10].
Comparison of TreeBagger and Bagged Ensembles. fitensemble
produces bagged
ensembles that have most, but not all, of the functionality of TreeBagger
objects. Additionally,
some functionality has different names in the new bagged ensembles.
TreeBagger Features Not in fitensemble
Feature | TreeBagger Property | TreeBagger Method |
---|---|---|
Computation of proximity matrix | Proximity | fillprox ,
mdsprox |
Computation of outliers | OutlierMeasure | N/A |
Out-of-bag estimates of predictor importance | OOBPermutedPredictorDeltaMeanMargin and OOBPermutedPredictorCountRaiseMargin | N/A |
Merging two ensembles trained separately | N/A | append |
Parallel computation for creating ensemble | Set the UseParallel name-value pair to true | N/A |
Note:
When you estimate the proximity matrix and outliers of a |
Differing Names Between TreeBagger and Bagged Ensembles
Feature | TreeBagger | Bagged Ensembles |
---|---|---|
Split criterion contributions for each predictor | DeltaCriterionDecisionSplit property | First output of predictorImportance (classification)
or predictorImportance (regression) |
Predictor associations | SurrogateAssociation property | Second output of predictorImportance (classification)
or predictorImportance (regression) |
Out-of-bag estimates of predictor importance | OOBPermutedPredictorDeltaError property | Output of oobPermutedPredictorImportance (classification)
or oobPermutedPredictorImportance (regression) |
Error (misclassification probability or mean-squared error) | error and oobError methods | loss and oobLoss methods
(classification); loss and oobLoss methods
(regression) |
Train additional trees and add to ensemble | growTrees method | resume method (classification); resume method
(regression) |
Mean classification margin per tree | meanMargin and oobMeanMargin methods | edge and oobEdge methods
(classification) |
In addition, two important changes were made to training and prediction for bagged classification ensembles:
If you pass a misclassification cost matrix to TreeBagger
,
it passes this matrix along to the trees. If you pass a misclassification
cost matrix to fitensemble
, it uses this matrix
to adjust the class prior probabilities. fitensemble
then
passes the adjusted prior probabilities and the default cost matrix
to the trees. The default cost matrix is ones(K)-eye(K)
for K
classes.
Unlike the loss
and edge
methods
in the new framework, the TreeBagger
error
and meanMargin
methods
do not normalize input observation weights of the prior probabilities
in the respective class.
GentleBoost
(also known as Gentle AdaBoost)
combines features of AdaBoostM1
and LogitBoost
.
Like AdaBoostM1
, GentleBoost
minimizes
the exponential loss. But its numeric optimization is set up differently.
Like LogitBoost
, every weak learner fits a regression
model to response values y_{n} ∊ {–1,+1}.
This makes GentleBoost
another good candidate for
binary classification of data with multilevel categorical predictors.
fitensemble
computes and
stores the mean-squared error in the FitInfo
property
of the ensemble object. The mean-squared error is
$\sum}_{n=1}^{N}{d}_{n}^{\left(t\right)}{\left({\tilde{y}}_{n}-{h}_{t}\left({x}_{n}\right)\right)}^{2},$
where
${d}_{n}^{\left(t\right)}$ are observation weights at step t (the weights add up to 1).
h_{t}(x_{n}) are predictions of the regression model h_{t} fitted to response values y_{n}.
As the strength of individual learners weakens, the weighted mean-squared error approaches 1.
For examples using GentleBoost
, see Train Ensemble With Unequal Classification Costs and Classification with Many Categorical Levels.
For references related to GentleBoost
, see
Friedman, Hastie, and Tibshirani [22].
LogitBoost
is another popular algorithm for
binary classification. LogitBoost
works similarly
to AdaBoostM1
, except it minimizes binomial deviance
$\sum}_{n=1}^{N}{w}_{n}\mathrm{log}\left(1+\mathrm{exp}\left(-2{y}_{n}f\left({x}_{n}\right)\right)\right),$
where
y_{n} ∊ {–1,+1} is the true class label.
w_{n} are observation weights normalized to add up to 1.
f(x_{n}) ∊ (–∞,+∞) is the predicted classification score.
Binomial deviance assigns less weight to badly misclassified
observations (observations with large negative values of y_{n}f(x_{n})). LogitBoost
can
give better average accuracy than AdaBoostM1
for
data with poorly separable classes.
Learner t in a LogitBoost
ensemble
fits a regression model to response values
${\tilde{y}}_{n}=\frac{{y}_{n}^{*}-{p}_{t}\left({x}_{n}\right)}{{p}_{t}\left({x}_{n}\right)\left(1-{p}_{t}\left({x}_{n}\right)\right)},$
where
y*_{n} ∊ {0,+1} are relabeled classes (0 instead of –1).
p_{t}(x_{n}) is the current ensemble estimate of the probability for observation x_{n} to be of class 1.
Fitting a regression model at each boosting step turns into
a great computational advantage for data with multilevel categorical
predictors. Take a categorical predictor with L levels.
To find the optimal decision split on such a predictor, classification
tree needs to consider 2^{L–1} – 1 splits.
A regression tree needs to consider only L – 1 splits, so the processing time can be much shorter. LogitBoost
is
recommended for categorical predictors with many levels.
fitensemble
computes and
stores the mean-squared error in the FitInfo
property
of the ensemble object. The mean-squared error is
$\sum}_{n=1}^{N}{d}_{n}^{\left(t\right)}{\left({\tilde{y}}_{n}-{h}_{t}\left({x}_{n}\right)\right)}^{2},$
where
${d}_{n}^{\left(t\right)}$ are observation weights at step t (the weights add up to 1).
h_{t}(x_{n}) are predictions of the regression model h_{t} fitted to response values ${\tilde{y}}_{n}$.
Values y_{n} can range from –∞ to +∞, so the mean-squared error does not have well-defined bounds.
For examples using LogitBoost
, see Classification with Many Categorical Levels.
For references related to LogitBoost
, see
Friedman, Hastie, and Tibshirani [22].
LPBoost
(linear programming boost), like TotalBoost
,
performs multiclass classification by attempting to maximize the minimal margin in
the training set. This attempt uses optimization algorithms, namely
linear programming for LPBoost
. So you need an Optimization Toolbox license
to use LPBoost
or TotalBoost
.
The margin of a classification is the difference between the
predicted soft classification score for the
true class, and the largest score for the false classes. For trees,
the score of a classification of a leaf node
is the posterior probability of the classification at that node. The
posterior probability of the classification at a node is the number
of training sequences that lead to that node with the classification,
divided by the number of training sequences that lead to that node.
For more information, see Definitions in margin
.
Why maximize the minimal margin? For one thing, the generalization error (the error on new data) is the probability of obtaining a negative margin. Schapire and Singer [41] establish this inequality on the probability of obtaining a negative margin:
$${P}_{\text{test}}\left(m\le 0\right)\le {P}_{\text{train}}\left(m\le \theta \right)+O\left(\frac{1}{\sqrt{N}}\sqrt{\frac{V{\mathrm{log}}^{2}(N/V)}{{\theta}^{2}}+\mathrm{log}(1/\delta )}\right).$$
Here m is the margin, θ is any positive number, V is the Vapnik-Chervonenkis dimension of the classifier space, N is the size of the training set, and δ is a small positive number. The inequality holds with probability 1–δ over many i.i.d. training and test sets. This inequality says: To obtain a low generalization error, minimize the number of observations below margin θ in the training set.
LPBoost
iteratively maximizes the minimal
margin through a sequence of linear programming problems. Equivalently,
by duality, LPBoost
minimizes the maximal edge,
where edge is the weighted mean margin (see Definitions). At each iteration, there are more constraints
in the problem. So, for large problems, the optimization problem becomes
increasingly constrained, and slow to solve.
LPBoost
typically creates ensembles with
many learners having weights that are orders of magnitude smaller
than those of other learners. Therefore, to better enable you to remove
the unimportant ensemble members, the compact
method
reorders the members of an LPBoost
ensemble from
largest weight to smallest. Therefore, you can easily remove the least
important members of the ensemble using the removeLearners
method.
For examples using LPBoost
, see LPBoost and TotalBoost for Small Ensembles.
For references related to LPBoost
, see Warmuth,
Liao, and Ratsch [44].
LSBoost
(least squares boosting) fits regression
ensembles. At every step, the ensemble fits a new learner to the difference
between the observed response and the aggregated prediction of all
learners grown previously. The ensemble fits to minimize mean-squared
error.
You can use LSBoost
with shrinkage by passing
in the LearnRate
parameter. By default this parameter
is set to 1
, and the ensemble learns at the maximal
speed. If you set LearnRate
to a value from 0
to 1
,
the ensemble fits every new learner to y_{n} – ηf(x_{n}),
where
y_{n} is the observed response.
f(x_{n}) is the aggregated prediction from all weak learners grown so far for observation x_{n}.
η is the learning rate.
For examples using LSBoost
, see Train Regression Ensemble and Regularize a Regression Ensemble.
For references related to LSBoost
, see Hastie,
Tibshirani, and Friedman [24], Chapters 7 (Model Assessment and Selection) and 15
(Random Forests, see also [9]).
Boosting algorithms such as AdaBoostM1
and LogitBoost
increase
weights for misclassified observations at every boosting step. These
weights can become very large. If this happens, the boosting algorithm
sometimes concentrates on a few misclassified observations and neglects
the majority of training data. Consequently the average classification
accuracy suffers.
In this situation, you can try using RobustBoost
.
This algorithm does not assign almost the entire data weight to badly
misclassified observations. It can produce better average classification
accuracy.
Unlike AdaBoostM1
and LogitBoost
, RobustBoost
does
not minimize a specific loss function. Instead, it maximizes the number
of observations with the classification margin above a certain threshold.
RobustBoost
trains based on time evolution.
The algorithm starts at t = 0. At every step, RobustBoost
solves
an optimization problem to find a positive step in time Δt and
a corresponding positive change in the average margin for training
data Δm. RobustBoost
stops
training and exits if at least one of these three conditions is true:
Time t reaches 1.
RobustBoost
cannot find a solution
to the optimization problem with positive updates Δt and
Δm.
RobustBoost
grows as many learners
as you requested.
Results from RobustBoost
can be usable for
any termination condition. Estimate the classification accuracy by
cross validation or by using an independent test set.
To get better classification accuracy from RobustBoost
,
you can adjust three parameters in fitensemble
: RobustErrorGoal
, RobustMaxMargin
,
and RobustMarginSigma
. Start by varying values
for RobustErrorGoal
from 0 to 1. The maximal allowed
value for RobustErrorGoal
depends on the two other
parameters. If you pass a value that is too high, fitensemble
produces
an error message showing the allowed range for RobustErrorGoal
.
For examples using RobustBoost
, see Tune RobustBoost.
For references related to RobustBoost
, see
Freund [19].
RUSBoost
is especially effective at classifying
imbalanced data, meaning some class in the training data has many
fewer members than another. RUS stands for Random Under Sampling.
The algorithm takes N, the number of members in
the class with the fewest members in the training data, as the basic
unit for sampling. Classes with more members are under sampled by
taking only N observations of every class. In other
words, if there are K classes, then, for each weak
learner in the ensemble, RUSBoost
takes a subset
of the data with N observations from each of the K classes.
The boosting procedure follows the procedure in AdaBoostM2 for reweighting and constructing the ensemble.
When you construct a RUSBoost
ensemble, there
is an optional name-value pair called RatioToSmallest
.
Give a vector of K values, each value representing
the multiple of N to sample for the associated
class. For example, if the smallest class has N =
100 members, then RatioToSmallest
= [2,3,4]
means
each weak learner has 200 members in class 1, 300 in class 2, and
400 in class 3. If RatioToSmallest
leads to a value
that is larger than the number of members in a particular class, then RUSBoost
samples
the members with replacement. Otherwise, RUSBoost
samples
the members without replacement.
For examples using RUSBoost
, see Classification with Imbalanced Data.
For references related to RUSBoost
, see Seiffert et al. [43].
Use random subspace ensembles (Subspace
)
to improve the accuracy of discriminant analysis (ClassificationDiscriminant
)
or k-nearest neighbor (ClassificationKNN
)
classifiers. Subspace
ensembles also have the advantage
of using less memory than ensembles with all predictors, and can handle
missing values (NaN
s).
The basic random subspace algorithm uses these parameters.
m is the number of dimensions (variables)
to sample in each learner. Set m using the NPredToSample
name-value
pair.
d is the number of dimensions in
the data, which is the number of columns (predictors) in the data
matrix X
.
n is the number of learners in
the ensemble. Set n using the NLearn
input.
The basic random subspace algorithm performs the following steps:
Choose without replacement a random set of m predictors from the d possible values.
Train a weak learner using just the m chosen predictors.
Repeat steps 1 and 2 until there are n weak learners.
Predict by taking an average of the score
prediction
of the weak learners, and classify the category with the highest average score
.
You can choose to create a weak learner for every possible set
of m predictors from the d dimensions.
To do so, set n, the number of learners, to 'AllPredictorCombinations'
.
In this case, there are nchoosek(size(X,2),NPredToSample)
weak
learners in the ensemble.
fitensemble
downweights
predictors after choosing them for a learner, so subsequent learners
have a lower chance of using a predictor that was previously used.
This weighting tends to make predictors more evenly distributed among
learners than in uniform weighting.
For examples using Subspace
, see Random Subspace Classification.
For references related to random subspace ensembles, see Ho [26].
TotalBoost
, like linear programming boost
(LPBoost
), performs multiclass classification by
attempting to maximize the minimal margin in
the training set. This attempt uses optimization algorithms, namely
quadratic programming for TotalBoost
. So you need
an Optimization Toolbox license to use LPBoost
or TotalBoost
.
The margin of a classification is the difference between the
predicted soft classification score for the
true class, and the largest score for the false classes. For trees,
the score of a classification of a leaf node
is the posterior probability of the classification at that node. The
posterior probability of the classification at a node is the number
of training sequences that lead to that node with the classification,
divided by the number of training sequences that lead to that node.
For more information, see Definitions in margin
.
Why maximize the minimal margin? For one thing, the generalization error (the error on new data) is the probability of obtaining a negative margin. Schapire and Singer [41] establish this inequality on the probability of obtaining a negative margin:
$${P}_{\text{test}}\left(m\le 0\right)\le {P}_{\text{train}}\left(m\le \theta \right)+O\left(\frac{1}{\sqrt{N}}\sqrt{\frac{V{\mathrm{log}}^{2}(N/V)}{{\theta}^{2}}+\mathrm{log}(1/\delta )}\right).$$
Here m is the margin, θ is any positive number, V is the Vapnik-Chervonenkis dimension of the classifier space, N is the size of the training set, and δ is a small positive number. The inequality holds with probability 1–δ over many i.i.d. training and test sets. This inequality says: To obtain a low generalization error, minimize the number of observations below margin θ in the training set.
TotalBoost
minimizes a proxy of the Kullback-Leibler
divergence between the current weight distribution and the initial
weight distribution, subject to the constraint that the edge (the
weighted margin) is below a certain value. The proxy is a quadratic
expansion of the divergence:
$$D(W,{W}_{0})={\displaystyle \sum _{n=1}^{N}\mathrm{log}\frac{W(n)}{{W}_{0}(n)}}\approx {\displaystyle \sum _{n=1}^{N}\left(1+\frac{W(n)}{{W}_{0}(n)}\right)\Delta +\frac{1}{2W(n)}{\Delta}^{2}},$$
where Δ is the difference between W(n), the weights at the current and next iteration, and W_{0}, the initial weight distribution, which is uniform. This optimization formulation keeps weights from becoming zero. At each iteration, there are more constraints in the problem. So, for large problems, the optimization problem becomes increasingly constrained, and slow to solve.
TotalBoost
typically creates ensembles with
many learners having weights that are orders of magnitude smaller
than those of other learners. Therefore, to better enable you to remove
the unimportant ensemble members, the compact
method
reorders the members of a TotalBoost
ensemble from
largest weight to smallest. Therefore you can easily remove the least
important members of the ensemble using the removeLearners
method.
For examples using TotalBoost
, see LPBoost and TotalBoost for Small Ensembles.
For references related to TotalBoost
, see
Warmuth, Liao, and Ratsch [44].