Main Content

fitrchains

Multiresponse regression with regression chains

Since R2024b

    Description

    Mdl = fitrchains(Tbl,ResponseVarNames) returns a trained multiresponse regression model Mdl by using regression chains. The function trains the model using the predictors in the table Tbl and the response values in the ResponseVarNames table variables. For more information, see Regression Chains.

    example

    Mdl = fitrchains(Tbl,formula) returns a regression model trained using the sample data in the table Tbl. The input argument formula is an explanatory model of the responses and a subset of the predictor variables in Tbl used to fit Mdl.

    Mdl = fitrchains(X,Y) returns a regression model using the predictor variables in the table or matrix X and the response values in the matrix Y.

    Mdl = fitrchains(___,Name=Value) specifies options using one or more name-value arguments in addition to any of the input argument combinations in previous syntaxes. For example, you can specify the type of model to use in the regression chains by setting the Learner name-value argument.

    example

    Examples

    collapse all

    Create a regression model with more than one response variable by using fitrchains.

    Load the carbig data set, which contains measurements of cars made in the 1970s and early 1980s. Create a table containing the predictor variables Displacement, Horsepower, and so on, as well as the response variables Acceleration and MPG. Display the first eight rows of the table.

    load carbig
    cars = table(Displacement,Horsepower,Model_Year, ...
        Origin,Weight,Acceleration,MPG);
    head(cars)
        Displacement    Horsepower    Model_Year    Origin     Weight    Acceleration    MPG
        ____________    __________    __________    _______    ______    ____________    ___
    
            307            130            70        USA         3504           12        18 
            350            165            70        USA         3693         11.5        15 
            318            150            70        USA         3436           11        18 
            304            150            70        USA         3433           12        16 
            302            140            70        USA         3449         10.5        17 
            429            198            70        USA         4341           10        15 
            454            220            70        USA         4354            9        14 
            440            215            70        USA         4312          8.5        14 
    

    Categorize the cars based on whether they were made in the USA.

    cars.Origin = categorical(cellstr(cars.Origin));
    cars.Origin = mergecats(cars.Origin,["France","Japan",...
        "Germany","Sweden","Italy","England"],"NotUSA");

    Partition the data into training and test sets. Use approximately 85% of the observations to train a multiresponse model, and 15% of the observations to test the performance of the trained model on new data. Use cvpartition to partition the data.

    rng("default") % For reproducibility
    c = cvpartition(height(cars),"Holdout",0.15);
    carsTrain = cars(training(c),:);
    carsTest = cars(test(c),:);

    Train a multiresponse regression model by passing the carsTrain training data to the fitrchains function. By default, the function uses bagged ensembles of trees in the regression chains.

    Mdl = fitrchains(carsTrain,["Acceleration","MPG"])
    Mdl = 
      RegressionChainEnsemble
               PredictorNames: {'Displacement'  'Horsepower'  'Model_Year'  'Origin'  'Weight'}
                 ResponseName: ["Acceleration"    "MPG"]
        CategoricalPredictors: 4
                    NumChains: 2
                LearnedChains: {2x2 cell}
              NumObservations: 338
    
    
    

    Mdl is a trained RegressionChainEnsemble model object. You can use dot notation to access the properties of Mdl. For example, you can specify Mdl.Learners to see the bagged ensembles used to train the model.

    Evaluate the performance of the regression model on the test set by computing the test mean squared error (MSE). Smaller MSE values indicate better performance. Return the loss for each response variable separately by setting the OutputType name-value argument to "per-response".

    testMSE = loss(Mdl,carsTest,["Acceleration","MPG"], ...
        OutputType="per-response")
    testMSE = 1×2
    
        2.4921    9.0568
    
    

    Predict the response values for the observations in the test set. Return the predicted response values as a table.

    predictedY = predict(Mdl,carsTest,OutputType="table")
    predictedY=60×2 table
        Acceleration     MPG  
        ____________    ______
    
           12.573       16.109
            10.78       13.988
           11.282       12.963
           15.185       21.066
           12.203       13.773
           13.216       14.216
           17.117       30.199
           16.478       29.033
           13.439       14.208
           11.552       13.066
           13.398       13.271
           14.848       20.927
           16.552       24.603
           12.501       15.359
           15.778       19.328
           12.343       13.185
          ⋮
    
    

    Train a multiresponse regression model using regression chains. Specify the type of regression models to use in the regression chains, and train the models with predicted values for response variables used as predictors.

    Load the carbig data set, which contains measurements of cars made in the 1970s and early 1980s. Create a table containing the predictor variables Displacement, Horsepower, and so on, as well as the response variables Acceleration and MPG. Display the first eight rows of the table.

    load carbig
    cars = table(Displacement,Horsepower,Model_Year, ...
        Origin,Weight,Acceleration,MPG);
    head(cars)
        Displacement    Horsepower    Model_Year    Origin     Weight    Acceleration    MPG
        ____________    __________    __________    _______    ______    ____________    ___
    
            307            130            70        USA         3504           12        18 
            350            165            70        USA         3693         11.5        15 
            318            150            70        USA         3436           11        18 
            304            150            70        USA         3433           12        16 
            302            140            70        USA         3449         10.5        17 
            429            198            70        USA         4341           10        15 
            454            220            70        USA         4354            9        14 
            440            215            70        USA         4312          8.5        14 
    

    Categorize the cars based on whether they were made in the USA.

    cars.Origin = categorical(cellstr(cars.Origin));
    cars.Origin = mergecats(cars.Origin,["France","Japan",...
        "Germany","Sweden","Italy","England"],"NotUSA");

    Remove observations with missing values.

    cars = rmmissing(cars);

    Train a multiresponse regression model by passing the cars data to the fitrchains function. Use regression chains composed of regression support vector machine (SVM) models with standardized numeric predictors. When training the SVM models, use the predicted values for the response variables that are treated as predictors.

    Mdl = fitrchains(cars,["Acceleration","MPG"], ...
        Learner=templateSVM(Standardize=true), ...
        ChainPredictedResponse=true);

    Mdl is a trained RegressionChainEnsemble model object. You can use dot notation to access the properties of Mdl.

    Display the order of the response variables in the regression chains in Mdl, and display the trained regression SVM models in the regression chains.

    Mdl.ChainOrders
    ans = 2×2
    
         1     2
         2     1
    
    
    Mdl.Learners
    ans=2×2 cell array
        {1x1 classreg.learning.regr.CompactRegressionSVM}    {1x1 classreg.learning.regr.CompactRegressionSVM}
        {1x1 classreg.learning.regr.CompactRegressionSVM}    {1x1 classreg.learning.regr.CompactRegressionSVM}
    
    

    In the first regression chain, the first SVM model uses Acceleration as the response variable. The second SVM model uses MPG as the response variable and the predicted values for Acceleration as a predictor variable. The first SVM model provides the predicted Acceleration values used by the second SVM model.

    Recall that the SVM models use standardized numeric predictors. Find the means (Mu) and standard deviations (Sigma) used by the second model in the first regression chain.

    Chain1Model2 = Mdl.Learners{1,2};
    
    Mdl.PredictorNames
    ans = 1x5 cell
        {'Displacement'}    {'Horsepower'}    {'Model_Year'}    {'Origin'}    {'Weight'}
    
    
    Chain1Model2.ExpandedPredictorNames
    ans = 1x7 cell
        {'x1'}    {'x2'}    {'x3'}    {'x4 == 1'}    {'x4 == 2'}    {'x5'}    {'x6'}
    
    
    Chain1Model2.Mu
    ans = 1×7
    103 ×
    
        0.1944    0.1045    0.0760         0         0    2.9776    0.0153
    
    
    Chain1Model2.Sigma
    ans = 1×7
    
      104.6440   38.4912    3.6837    1.0000    1.0000  849.4026    2.2190
    
    

    The SVM model uses five numeric predictors: Displacement (x1), Horsepower (x2), Model_Year (x3), Weight (x5), and the predicted values for Acceleration (x6). The software uses the corresponding Mu and Sigma values to standardize the predictor data before predicting with the predict object function.

    The categorical predictor Origin is split into two variables (x4 == 1 and x4 == 2) after categorical expansion. The corresponding Mu and Sigma values indicate that the two variables are unchanged after standardization.

    Input Arguments

    collapse all

    Sample data used to train the model, specified as a table. Each row of Tbl corresponds to one observation, and each column corresponds to one variable. Multicolumn variables and cell arrays other than cell arrays of character vectors are not allowed.

    Tbl must contain columns for the response variables and can contain a column for the observation weights. Each response and observation weight variable must be a numeric vector.

    You must specify the response variables in Tbl by using ResponseVarNames or formula, and specify the observation weights in Tbl by using Weights.

    • When you specify the response variables by using ResponseVarNames, fitrchains uses the remaining variables as predictors. To use a subset of the remaining variables in Tbl as predictors, specify predictor variables by using PredictorNames.

    • When you define a model specification by using formula, fitrchains uses a subset of the variables in Tbl as predictor variables and response variables, as specified in formula.

    Data Types: table

    Names of the response variables, specified as the names of variables in Tbl. Each response variable must be a numeric vector.

    You must specify ResponseVarNames as a string array or a cell array of character vectors. For example, if Tbl stores the response variables Y1 and Y2 as Tbl.Y1 and Tbl.Y2, respectively, then specify ResponseVarNames as ["Y1","Y2"]. Otherwise, the software treats the Y1 and Y2 columns of Tbl as predictors when training the model.

    Data Types: string | cell

    Explanatory model of the response variables and a subset of the predictor variables, specified as character vector or string scalar in the form "Y1,Y2~x1+x2+x3". In this form, Y1 and Y2 represent the response variables, and x1, x2, and x3 represent the predictor variables.

    To specify a subset of variables in Tbl as predictors for training the model, use a formula. If you specify a formula, then the software does not use any variables in Tbl that do not appear in formula, except for observation weights (if specified).

    The variable names in the formula must be both variable names in Tbl (Tbl.Properties.VariableNames) and valid MATLAB® identifiers. You can verify the variable names in Tbl by using the isvarname function. If the variable names are not valid, then you can convert them by using the matlab.lang.makeValidName function.

    Data Types: char | string

    Response data, specified as a numeric matrix or table. Each row corresponds to an observation, and each column corresponds to a response variable. Y must have the same number of rows as the predictor data X.

    Data Types: single | double | table

    Predictor data, specified as a numeric matrix or table. Each row corresponds to an observation, and each column corresponds to a predictor. Optionally, when X is a table, it can contain a column for the observation weights. X and Y must have the same number of rows.

    • If X is a matrix, you can specify the names of the predictors in the order of their appearance in X by using the PredictorNames name-value argument.

    • If X is a table, you can use a subset of the variables in X as predictors. To do so, specify predictor variables by using PredictorNames.

    Data Types: single | double

    Note

    The software treats NaN, empty character vector (''), empty string (""), <missing>, and <undefined> elements as missing data. Before training Mdl, the software removes observations with missing values in the response data, although the model retains the observations in its data properties (for example, Mdl.X and Mdl.Y). The treatment of observations with missing values in the predictor data depends on the regression model type specified by the Learner name-value argument.

    Name-Value Arguments

    Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

    Example: fitrchains(Tbl,["Y1","Y2"],Learner="svm",ChainPredictedResponse=true) creates a support vector machine (SVM) regression model with two response variables and uses predicted responses in the regression chains to train the model.

    Order of the response variables in the regression chain, specified as a positive integer vector. For more information, see Regression Chains.

    If you specify ChainOrder, Mdl contains only one regression chain.

    Example: ChainOrder=[1 3 2]

    Data Types: single | double

    Flag to use predicted responses in the regression chains, specified as a numeric or logical 0 (false) or 1 (true).

    • A value of 0 indicates to train models with observed values for response variables used as predictors.

    • A value of 1 indicates to train models with predicted values for response variables used as predictors.

    For more information, see Regression Chains.

    Example: ChainPredictedResponse=true

    Data Types: single | double | logical

    Type of regression model to train, specified as one of the values in this table.

    ValueRegression Model Type
    "bag" or templateEnsemble template (with the method specified as "Bag" and the weak learners specified as "Tree")Bagged ensemble of trees
    "gam" or templateGAM templateGeneral additive model (GAM)
    "gp" or templateGP templateGaussian process regression (GPR)
    "kernel" or templateKernel templateKernel model
    "linear" or templateLinear templateLinear model
    "lsboost" or templateEnsemble template (with the method specified as "LSBoost" and the weak learners specified as "Tree")Boosted ensemble of trees
    "svm" or templateSVM templateSupport vector machine (SVM)
    "tree" or templateTree templateDecision tree

    Example: Learner="svm"

    Example: Learner=templateEnsemble("LSBoost",50,"Tree")

    Maximum number of regression chains, specified as a positive scalar. Because each regression chain contains one regression model for each response variable, specify MaxNumChains to limit the total number of regression models to train.

    Example: MaxNumChains=5

    Data Types: single | double

    Categorical predictors list, specified as one of the values in this table.

    ValueDescription
    Vector of positive integers

    Each entry in the vector is an index value indicating that the corresponding predictor is categorical. The index values are between 1 and p, where p is the number of predictors used to train the model.

    If fitrchains uses a subset of input variables as predictors, then the function indexes the predictors using only the subset. The CategoricalPredictors values do not count any response variable, observation weights variable, or other variable that the function does not use.

    Logical vector

    A true entry means that the corresponding predictor is categorical. The length of the vector is p.

    Character matrixEach row of the matrix is the name of a predictor variable. The names must match the entries in PredictorNames. Pad the names with extra blanks so each row of the character matrix has the same length.
    String array or cell array of character vectorsEach element in the array is the name of a predictor variable. The names must match the entries in PredictorNames.
    "all"All predictors are categorical.

    By default, if the predictor data is in a table, fitrchains assumes that a variable is categorical if it is a logical vector, categorical vector, character array, string array, or cell array of character vectors. However, learners that use decision trees assume that mathematically ordered categorical vectors are continuous variables. If the predictor data is a matrix, fitrchains assumes that all predictors are continuous. To identify any other predictors as categorical predictors, specify them by using the CategoricalPredictors name-value argument.

    The software creates dummy variables based on the Learner name-value argument and the underlying fitting function used to create the regression models in the Learners property of Mdl. For more information on how fitting functions treat categorical predictors, see Automatic Creation of Dummy Variables.

    Example: CategoricalPredictors="all"

    Data Types: single | double | logical | char | string | cell

    Options for computing in parallel and setting random streams, specified as a structure. Create the Options structure using statset. This table lists the option fields and their values.

    Field NameValueDefault
    UseParallelSet this value to true to run computations in parallel.false
    UseSubstreams

    Set this value to true to run computations in a reproducible manner.

    To compute reproducibly, set Streams to a type that allows substreams: "mlfg6331_64" or "mrg32k3a".

    false
    StreamsSpecify this value as a RandStream object or cell array of such objects. Use a single object except when the UseParallel value is true and the UseSubstreams value is false. In that case, use a cell array that has the same size as the parallel pool.If you do not specify Streams, then fitrchains uses the default stream or streams.

    Note

    You need Parallel Computing Toolbox™ to run computations in parallel.

    Example: Options=statset(UseParallel=true,UseSubstreams=true,Streams=RandStream("mlfg6331_64"))

    Data Types: struct

    Predictor variable names, specified as a string array or a cell array of character vectors.

    • If you supply predictor data using a numeric matrix, then you can use PredictorNames to assign names to the predictor variables.

      • The order of the names in PredictorNames must correspond to the order of the columns in the matrix.

      • By default, PredictorNames is {'x1','x2',...}.

    • If you supply predictor data using a table, then you can use PredictorNames to specify which variables to use as predictors during training.

      • PredictorNames must be a subset of the variable names in the table and cannot include the names of response variables.

      • By default, PredictorNames contains the names of all predictor variables.

    Example: PredictorNames=["SepalLength","SepalWidth","PetalLength","PetalWidth"]

    Data Types: string | cell

    Response variable names, specified as a string array or a cell array of character vectors.

    • If you supply Y, then you can use ResponseName to specify names for the response variables.

    • If you supply ResponseVarNames or formula, then you cannot use ResponseName.

    Example: ResponseName=["Response1","Response2"]

    Data Types: string | cell

    Observation weights, specified as a nonnegative numeric vector or the name of a variable in X or Tbl. The software weights each observation in X or Tbl with the corresponding value in Weights. The length of Weights must equal the number of observations in X or Tbl.

    If you specify the input data as a table, then Weights can be the name of a variable in the table that contains a numeric vector. In this case, you must specify Weights as a character vector or string scalar. For example, if the weights vector W is stored as Tbl.W, then specify it as "W". Otherwise, the software treats the W column of Tbl as a predictor during the training process.

    By default, Weights is ones(n,1), where n is the number of observations in X or Tbl.

    Before training, fitrchains normalizes the weights to sum to 1.

    Data Types: single | double | char | string

    Output Arguments

    collapse all

    Multiresponse regression model, returned as a RegressionChainEnsemble model object. To access the properties of Mdl, use dot notation.

    Algorithms

    collapse all

    Regression Chains

    A regression chain is a sequence of regression models in which the response variables for previous models become predictor variables for subsequent models. If the training data consists of p predictor variables and k response variables, then a regression chain includes exactly k models, each with a different response variable. The first model has p predictors, the second model has p+1 predictors, and so on, with the last model having p+k–1 predictors.

    For example, suppose that the predictor data in X or Tbl consists of three variables, x1, x2, and x3, and the response data in Y or Tbl consists of two variables, y1 and y2. A regression chain with the chain order [2 1] (ChainOrder) consists of a model trained on the predictor data [x1, x2, x3] and the response variable y2, followed by a model trained on the predictor data [x1, x2, x3, y2] and the response variable y1.

    If you specify to use predicted responses in regression chains (ChainPredictedResponse), the predictor data for the second model is [x1, x2, x3, yfit2], where yfit2 contains the predicted responses returned by the first model.

    In general, fitrchains returns an ensemble of regression chains Mdl, where each row of Mdl.Learners corresponds to one regression chain.

    References

    [1] Spyromitros-Xioufis, Eleftherios, Grigorios Tsoumakas, William Groves, and Ioannis Vlahavas. "Multi-Target Regression via Input Space Expansion: Treating Targets as Inputs." Machine Learning 104, no. 1 (July 2016): 55–98. https://doi.org/10.1007/s10994-016-5546-z.

    Extended Capabilities

    Version History

    Introduced in R2024b