fitrchains

Multiresponse regression with regression chains

Since R2024b

collapse all in page

Syntax

Mdl = fitrchains(Tbl,ResponseVarNames)

Mdl = fitrchains(Tbl,formula)

Mdl = fitrchains(X,Y)

Mdl = fitrchains(___,Name=Value)

Description

Mdl = fitrchains(Tbl,ResponseVarNames) returns a trained multiresponse regression model Mdl by using regression chains. The function trains the model using the predictors in the table Tbl and the response values in the ResponseVarNames table variables. For more information, see Regression Chains.

example

Mdl = fitrchains(Tbl,formula) returns a regression model trained using the sample data in the table Tbl. The input argument formula is an explanatory model of the responses and a subset of the predictor variables in Tbl used to fit Mdl.

Mdl = fitrchains(X,Y) returns a regression model using the predictor variables in the table or matrix X and the response values in the matrix Y.

Mdl = fitrchains(___,Name=Value) specifies options using one or more name-value arguments in addition to any of the input argument combinations in previous syntaxes. For example, you can specify the type of model to use in the regression chains by setting the Learner name-value argument.

example

Examples

collapse all

Train Multiresponse Regression Model with Regression Chains

Open Live Script

Create a regression model with more than one response variable by using fitrchains.

Load the carbig data set, which contains measurements of cars made in the 1970s and early 1980s. Create a table containing the predictor variables Displacement, Horsepower, and so on, as well as the response variables Acceleration and MPG. Display the first eight rows of the table.

load carbig
cars = table(Displacement,Horsepower,Model_Year, ...
    Origin,Weight,Acceleration,MPG);
head(cars)

    Displacement    Horsepower    Model_Year    Origin     Weight    Acceleration    MPG
    ____________    __________    __________    _______    ______    ____________    ___

        307            130            70        USA         3504           12        18 
        350            165            70        USA         3693         11.5        15 
        318            150            70        USA         3436           11        18 
        304            150            70        USA         3433           12        16 
        302            140            70        USA         3449         10.5        17 
        429            198            70        USA         4341           10        15 
        454            220            70        USA         4354            9        14 
        440            215            70        USA         4312          8.5        14

Categorize the cars based on whether they were made in the USA.

cars.Origin = categorical(cellstr(cars.Origin));
cars.Origin = mergecats(cars.Origin,["France","Japan",...
    "Germany","Sweden","Italy","England"],"NotUSA");

Partition the data into training and test sets. Use approximately 85% of the observations to train a multiresponse model, and 15% of the observations to test the performance of the trained model on new data. Use cvpartition to partition the data.

rng("default") % For reproducibility
c = cvpartition(height(cars),"Holdout",0.15);
carsTrain = cars(training(c),:);
carsTest = cars(test(c),:);

Train a multiresponse regression model by passing the carsTrain training data to the fitrchains function. By default, the function uses bagged ensembles of trees in the regression chains.

Mdl = fitrchains(carsTrain,["Acceleration","MPG"])

Mdl = 
  RegressionChainEnsemble
           PredictorNames: {'Displacement'  'Horsepower'  'Model_Year'  'Origin'  'Weight'}
             ResponseName: ["Acceleration"    "MPG"]
    CategoricalPredictors: 4
                NumChains: 2
            LearnedChains: {2x2 cell}
          NumObservations: 338

Mdl is a trained RegressionChainEnsemble model object. You can use dot notation to access the properties of Mdl. For example, you can specify Mdl.Learners to see the bagged ensembles used to train the model.

Evaluate the performance of the regression model on the test set by computing the test mean squared error (MSE). Smaller MSE values indicate better performance. Return the loss for each response variable separately by setting the OutputType name-value argument to "per-response".

testMSE = loss(Mdl,carsTest,["Acceleration","MPG"], ...
    OutputType="per-response")

testMSE = 1×2

    2.4921    9.0568

Predict the response values for the observations in the test set. Return the predicted response values as a table.

predictedY = predict(Mdl,carsTest,OutputType="table")

predictedY=60×2 table
    Acceleration     MPG  
    ____________    ______

       12.573       16.109
        10.78       13.988
       11.282       12.963
       15.185       21.066
       12.203       13.773
       13.216       14.216
       17.117       30.199
       16.478       29.033
       13.439       14.208
       11.552       13.066
       13.398       13.271
       14.848       20.927
       16.552       24.603
       12.501       15.359
       15.778       19.328
       12.343       13.185
      ⋮

Specify Multiresponse Regression Model Properties

Open Live Script

Train a multiresponse regression model using regression chains. Specify the type of regression models to use in the regression chains, and train the models with predicted values for response variables used as predictors.

load carbig
cars = table(Displacement,Horsepower,Model_Year, ...
    Origin,Weight,Acceleration,MPG);
head(cars)

    Displacement    Horsepower    Model_Year    Origin     Weight    Acceleration    MPG
    ____________    __________    __________    _______    ______    ____________    ___

        307            130            70        USA         3504           12        18 
        350            165            70        USA         3693         11.5        15 
        318            150            70        USA         3436           11        18 
        304            150            70        USA         3433           12        16 
        302            140            70        USA         3449         10.5        17 
        429            198            70        USA         4341           10        15 
        454            220            70        USA         4354            9        14 
        440            215            70        USA         4312          8.5        14

Categorize the cars based on whether they were made in the USA.

cars.Origin = categorical(cellstr(cars.Origin));
cars.Origin = mergecats(cars.Origin,["France","Japan",...
    "Germany","Sweden","Italy","England"],"NotUSA");

Remove observations with missing values.

cars = rmmissing(cars);

Train a multiresponse regression model by passing the cars data to the fitrchains function. Use regression chains composed of regression support vector machine (SVM) models with standardized numeric predictors. When training the SVM models, use the predicted values for the response variables that are treated as predictors.

Mdl = fitrchains(cars,["Acceleration","MPG"], ...
    Learner=templateSVM(Standardize=true), ...
    ChainPredictedResponse=true);

Mdl is a trained RegressionChainEnsemble model object. You can use dot notation to access the properties of Mdl.

Display the order of the response variables in the regression chains in Mdl, and display the trained regression SVM models in the regression chains.

Mdl.ChainOrders

Mdl.Learners

ans=2×2 cell array
    {1x1 classreg.learning.regr.CompactRegressionSVM}    {1x1 classreg.learning.regr.CompactRegressionSVM}
    {1x1 classreg.learning.regr.CompactRegressionSVM}    {1x1 classreg.learning.regr.CompactRegressionSVM}

In the first regression chain, the first SVM model uses Acceleration as the response variable. The second SVM model uses MPG as the response variable and the predicted values for Acceleration as a predictor variable. The first SVM model provides the predicted Acceleration values used by the second SVM model.

Recall that the SVM models use standardized numeric predictors. Find the means (Mu) and standard deviations (Sigma) used by the second model in the first regression chain.

Chain1Model2 = Mdl.Learners{1,2};

Mdl.PredictorNames

ans = 1x5 cell
    {'Displacement'}    {'Horsepower'}    {'Model_Year'}    {'Origin'}    {'Weight'}

Chain1Model2.ExpandedPredictorNames

ans = 1x7 cell
    {'x1'}    {'x2'}    {'x3'}    {'x4 == 1'}    {'x4 == 2'}    {'x5'}    {'x6'}

Chain1Model2.Mu

ans = 1×7
10³ ×

    0.1944    0.1045    0.0760         0         0    2.9776    0.0153

Chain1Model2.Sigma

ans = 1×7

  104.6440   38.4912    3.6837    1.0000    1.0000  849.4026    2.2190

The SVM model uses five numeric predictors: Displacement (x1), Horsepower (x2), Model_Year (x3), Weight (x5), and the predicted values for Acceleration (x6). The software uses the corresponding Mu and Sigma values to standardize the predictor data before predicting with the predict object function.

The categorical predictor Origin is split into two variables (x4 == 1 and x4 == 2) after categorical expansion. The corresponding Mu and Sigma values indicate that the two variables are unchanged after standardization.

Input Arguments

collapse all

`Tbl` — Sample data
table

Sample data used to train the model, specified as a table. Each row of Tbl corresponds to one observation, and each column corresponds to one variable. Multicolumn variables and cell arrays other than cell arrays of character vectors are not allowed.

Tbl must contain columns for the response variables and can contain a column for the observation weights. Each response and observation weight variable must be a numeric vector.

You must specify the response variables in Tbl by using ResponseVarNames or formula, and specify the observation weights in Tbl by using Weights.

When you specify the response variables by using ResponseVarNames, fitrchains uses the remaining variables as predictors. To use a subset of the remaining variables in Tbl as predictors, specify predictor variables by using PredictorNames.
When you define a model specification by using formula, fitrchains uses a subset of the variables in Tbl as predictor variables and response variables, as specified in formula.

Data Types: table

`ResponseVarNames` — Names of response variables
names of variables in `Tbl`

Names of the response variables, specified as the names of variables in Tbl. Each response variable must be a numeric vector.

You must specify ResponseVarNames as a string array or a cell array of character vectors. For example, if Tbl stores the response variables Y1 and Y2 as Tbl.Y1 and Tbl.Y2, respectively, then specify ResponseVarNames as ["Y1","Y2"]. Otherwise, the software treats the Y1 and Y2 columns of Tbl as predictors when training the model.

Data Types: string | cell

`formula` — Explanatory model of response variables and subset of predictor variables
character vector | string scalar

Explanatory model of the response variables and a subset of the predictor variables, specified as character vector or string scalar in the form "Y1,Y2~x1+x2+x3". In this form, Y1 and Y2 represent the response variables, and x1, x2, and x3 represent the predictor variables.

To specify a subset of variables in Tbl as predictors for training the model, use a formula. If you specify a formula, then the software does not use any variables in Tbl that do not appear in formula, except for observation weights (if specified).

The variable names in the formula must be both variable names in Tbl (Tbl.Properties.VariableNames) and valid MATLAB^® identifiers. You can verify the variable names in Tbl by using the isvarname function. If the variable names are not valid, then you can convert them by using the matlab.lang.makeValidName function.

Data Types: char | string

`Y` — Response data
numeric matrix | numeric table

Response data, specified as a numeric matrix or table. Each row corresponds to an observation, and each column corresponds to a response variable. Y must have the same number of rows as the predictor data X.

Data Types: single | double | table

`X` — Predictor data
numeric matrix | numeric table

Predictor data, specified as a numeric matrix or table. Each row corresponds to an observation, and each column corresponds to a predictor. Optionally, when X is a table, it can contain a column for the observation weights. X and Y must have the same number of rows.

If X is a matrix, you can specify the names of the predictors in the order of their appearance in X by using the PredictorNames name-value argument.
If X is a table, you can use a subset of the variables in X as predictors. To do so, specify predictor variables by using PredictorNames.

Data Types: single | double

Note

The software treats NaN, empty character vector (''), empty string (""), <missing>, and <undefined> elements as missing data. Before training Mdl, the software removes observations with missing values in the response data, although the model retains the observations in its data properties (for example, Mdl.X and Mdl.Y). The treatment of observations with missing values in the predictor data depends on the regression model type specified by the Learner name-value argument.

Name-Value Arguments

Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Example: fitrchains(Tbl,["Y1","Y2"],Learner="svm",ChainPredictedResponse=true) creates a support vector machine (SVM) regression model with two response variables and uses predicted responses in the regression chains to train the model.

`ChainOrder` — Order of response variables in regression chain
`[]` (default) | positive integer vector

Order of the response variables in the regression chain, specified as a positive integer vector. For more information, see Regression Chains.

If you specify ChainOrder, Mdl contains only one regression chain.

Example: ChainOrder=[1 3 2]

Data Types: single | double

`ChainPredictedResponse` — Flag to use predicted responses in regression chains
`false` or `0` (default) | `true` or `1`

Flag to use predicted responses in the regression chains, specified as a numeric or logical 0 (false) or 1 (true).

A value of 0 indicates to train models with observed values for response variables used as predictors.
A value of 1 indicates to train models with predicted values for response variables used as predictors.

For more information, see Regression Chains.

Example: ChainPredictedResponse=true

Data Types: single | double | logical

`Learner` — Type of regression model to train
`"bag"` (default) | `"gam"` | `"gp"` | `"kernel"` | `"linear"` | `"lsboost"` | `"svm"` | `"tree"` | template object

Type of regression model to train, specified as one of the values in this table.

Value	Regression Model Type
`"bag"` or `templateEnsemble` template (with the method specified as `"Bag"` and the weak learners specified as `"Tree"`)	Bagged ensemble of trees
`"gam"` or `templateGAM` template	General additive model (GAM)
`"gp"` or `templateGP` template	Gaussian process regression (GPR)
`"kernel"` or `templateKernel` template	Kernel model
`"linear"` or `templateLinear` template	Linear model
`"lsboost"` or `templateEnsemble` template (with the method specified as `"LSBoost"` and the weak learners specified as `"Tree"`)	Boosted ensemble of trees
`"svm"` or `templateSVM` template	Support vector machine (SVM)
`"tree"` or `templateTree` template	Decision tree

Example: Learner="svm"

Example: Learner=templateEnsemble("LSBoost",50,"Tree")

`MaxNumChains` — Maximum number of regression chains
`10` (default) | positive scalar

Maximum number of regression chains, specified as a positive scalar. Because each regression chain contains one regression model for each response variable, specify MaxNumChains to limit the total number of regression models to train.

Example: MaxNumChains=5

Data Types: single | double

`CategoricalPredictors` — Categorical predictors list
vector of positive integers | logical vector | character matrix | string array | cell array of character vectors | `"all"`

Categorical predictors list, specified as one of the values in this table.

Value	Description
Vector of positive integers	Each entry in the vector is an index value indicating that the corresponding predictor is categorical. The index values are between 1 and `p`, where `p` is the number of predictors used to train the model. If `fitrchains` uses a subset of input variables as predictors, then the function indexes the predictors using only the subset. The `CategoricalPredictors` values do not count any response variable, observation weights variable, or other variable that the function does not use.
Logical vector	A `true` entry means that the corresponding predictor is categorical. The length of the vector is `p`.
Character matrix	Each row of the matrix is the name of a predictor variable. The names must match the entries in `PredictorNames`. Pad the names with extra blanks so each row of the character matrix has the same length.
String array or cell array of character vectors	Each element in the array is the name of a predictor variable. The names must match the entries in `PredictorNames`.
`"all"`	All predictors are categorical.

By default, if the predictor data is in a table, fitrchains assumes that a variable is categorical if it is a logical vector, categorical vector, character array, string array, or cell array of character vectors. However, learners that use decision trees assume that mathematically ordered categorical vectors are continuous variables. If the predictor data is a matrix, fitrchains assumes that all predictors are continuous. To identify any other predictors as categorical predictors, specify them by using the CategoricalPredictors name-value argument.

The software creates dummy variables based on the Learner name-value argument and the underlying fitting function used to create the regression models in the Learners property of Mdl. For more information on how fitting functions treat categorical predictors, see Automatic Creation of Dummy Variables.

Example: CategoricalPredictors="all"

`Options` — Options for computing in parallel and setting random streams
structure

Options for computing in parallel and setting random streams, specified as a structure. Create the Options structure using statset. This table lists the option fields and their values.

Field Name Value Default

UseParallel Set this value to true to run computations in parallel. false

Field Name	Value	Default
`UseParallel`	Set this value to `true` to run computations in parallel.	`false`
`UseSubstreams`	Set this value to `true` to run computations in a reproducible manner. To compute reproducibly, set `Streams` to a type that allows substreams: `"mlfg6331_64"` or `"mrg32k3a"`.	`false`
`Streams`	Specify this value as a `RandStream` object or cell array of such objects. Use a single object except when the `UseParallel` value is `true` and the `UseSubstreams` value is `false`. In that case, use a cell array that has the same size as the parallel pool.	If you do not specify `Streams`, then `fitrchains` uses the default stream or streams.

UseSubstreams

Set this value to true to run computations in a reproducible manner.

To compute reproducibly, set Streams to a type that allows substreams: "mlfg6331_64" or "mrg32k3a".

false

Streams Specify this value as a RandStream object or cell array of such objects. Use a single object except when the UseParallel value is true and the UseSubstreams value is false. In that case, use a cell array that has the same size as the parallel pool. If you do not specify Streams, then fitrchains uses the default stream or streams.

Note

You need Parallel Computing Toolbox™ to run computations in parallel.

Example: Options=statset(UseParallel=true,UseSubstreams=true,Streams=RandStream("mlfg6331_64"))

Data Types: struct

`PredictorNames` — Predictor variable names
string array | cell array of character vectors

Predictor variable names, specified as a string array or a cell array of character vectors.

If you supply predictor data using a numeric matrix, then you can use PredictorNames to assign names to the predictor variables.
- The order of the names in PredictorNames must correspond to the order of the columns in the matrix.
- By default, PredictorNames is {'x1','x2',...}.
If you supply predictor data using a table, then you can use PredictorNames to specify which variables to use as predictors during training.
- PredictorNames must be a subset of the variable names in the table and cannot include the names of response variables.
- By default, PredictorNames contains the names of all predictor variables.

Example: PredictorNames=["SepalLength","SepalWidth","PetalLength","PetalWidth"]

Data Types: string | cell

`ResponseName` — Response variable names
string array | cell array of character vectors

Response variable names, specified as a string array or a cell array of character vectors.

If you supply Y, then you can use ResponseName to specify names for the response variables.
If you supply ResponseVarNames or formula, then you cannot use ResponseName.

Example: ResponseName=["Response1","Response2"]

Data Types: string | cell

`Weights` — Observation weights
nonnegative numeric vector | name of variable in `X` or `Tbl`

Observation weights, specified as a nonnegative numeric vector or the name of a variable in X or Tbl. The software weights each observation in X or Tbl with the corresponding value in Weights. The length of Weights must equal the number of observations in X or Tbl.

If you specify the input data as a table, then Weights can be the name of a variable in the table that contains a numeric vector. In this case, you must specify Weights as a character vector or string scalar. For example, if the weights vector W is stored as Tbl.W, then specify it as "W". Otherwise, the software treats the W column of Tbl as a predictor during the training process.

By default, Weights is ones(n,1), where n is the number of observations in X or Tbl.

Before training, fitrchains normalizes the weights to sum to 1.

Data Types: single | double | char | string

Output Arguments

collapse all

`Mdl` — Multiresponse regression model
`RegressionChainEnsemble` model object

Multiresponse regression model, returned as a RegressionChainEnsemble model object. To access the properties of Mdl, use dot notation.

Algorithms

collapse all

Regression Chains

A regression chain is a sequence of regression models in which the response variables for previous models become predictor variables for subsequent models. If the training data consists of p predictor variables and k response variables, then a regression chain includes exactly k models, each with a different response variable. The first model has p predictors, the second model has p+1 predictors, and so on, with the last model having p+k–1 predictors.

For example, suppose that the predictor data in X or Tbl consists of three variables, x1, x2, and x3, and the response data in Y or Tbl consists of two variables, y1 and y2. A regression chain with the chain order [2 1] (ChainOrder) consists of a model trained on the predictor data [x1, x2, x3] and the response variable y2, followed by a model trained on the predictor data [x1, x2, x3, y2] and the response variable y1.

If you specify to use predicted responses in regression chains (ChainPredictedResponse), the predictor data for the second model is [x1, x2, x3, yfit2], where yfit2 contains the predicted responses returned by the first model.

In general, fitrchains returns an ensemble of regression chains Mdl, where each row of Mdl.Learners corresponds to one regression chain.

References

[1] Spyromitros-Xioufis, Eleftherios, Grigorios Tsoumakas, William Groves, and Ioannis Vlahavas. "Multi-Target Regression via Input Space Expansion: Treating Targets as Inputs." Machine Learning 104, no. 1 (July 2016): 55–98. https://doi.org/10.1007/s10994-016-5546-z.

Extended Capabilities

Automatic Parallel Support
Accelerate code by automatically running computation in parallel using Parallel Computing Toolbox™.

To run in parallel, specify the Options name-value argument in the call to this function and set the UseParallel field of the options structure to true using statset:

Options=statset(UseParallel=true)

For more information about parallel computing, see Run MATLAB Functions with Automatic Parallel Support (Parallel Computing Toolbox).

Version History

Introduced in R2024b

fitrchains

Syntax

Description

Examples

Train Multiresponse Regression Model with Regression Chains

Specify Multiresponse Regression Model Properties

Input Arguments

Tbl — Sample data table

ResponseVarNames — Names of response variables names of variables in Tbl

formula — Explanatory model of response variables and subset of predictor variables character vector | string scalar

Y — Response data numeric matrix | numeric table

X — Predictor data numeric matrix | numeric table

Name-Value Arguments

ChainOrder — Order of response variables in regression chain [] (default) | positive integer vector

ChainPredictedResponse — Flag to use predicted responses in regression chains false or 0 (default) | true or 1

Learner — Type of regression model to train "bag" (default) | "gam" | "gp" | "kernel" | "linear" | "lsboost" | "svm" | "tree" | template object

MaxNumChains — Maximum number of regression chains 10 (default) | positive scalar

CategoricalPredictors — Categorical predictors list vector of positive integers | logical vector | character matrix | string array | cell array of character vectors | "all"

Options — Options for computing in parallel and setting random streams structure

PredictorNames — Predictor variable names string array | cell array of character vectors

ResponseName — Response variable names string array | cell array of character vectors

Weights — Observation weights nonnegative numeric vector | name of variable in X or Tbl

Output Arguments

Mdl — Multiresponse regression model RegressionChainEnsemble model object

Algorithms

Regression Chains

References

Extended Capabilities

Automatic Parallel Support Accelerate code by automatically running computation in parallel using Parallel Computing Toolbox™.

Version History

See Also

`Tbl` — Sample data
table

`ResponseVarNames` — Names of response variables
names of variables in `Tbl`

`formula` — Explanatory model of response variables and subset of predictor variables
character vector | string scalar

`Y` — Response data
numeric matrix | numeric table

`X` — Predictor data
numeric matrix | numeric table

`ChainOrder` — Order of response variables in regression chain
`[]` (default) | positive integer vector

`ChainPredictedResponse` — Flag to use predicted responses in regression chains
`false` or `0` (default) | `true` or `1`

`Learner` — Type of regression model to train
`"bag"` (default) | `"gam"` | `"gp"` | `"kernel"` | `"linear"` | `"lsboost"` | `"svm"` | `"tree"` | template object

`MaxNumChains` — Maximum number of regression chains
`10` (default) | positive scalar

`CategoricalPredictors` — Categorical predictors list
vector of positive integers | logical vector | character matrix | string array | cell array of character vectors | `"all"`

`Options` — Options for computing in parallel and setting random streams
structure

`PredictorNames` — Predictor variable names
string array | cell array of character vectors

`ResponseName` — Response variable names
string array | cell array of character vectors

`Weights` — Observation weights
nonnegative numeric vector | name of variable in `X` or `Tbl`

`Mdl` — Multiresponse regression model
`RegressionChainEnsemble` model object

Automatic Parallel Support
Accelerate code by automatically running computation in parallel using Parallel Computing Toolbox™.