Contenuto principale

stepwisefit

Fit linear regression model using stepwise regression

Description

b = stepwisefit(X,y) returns a vector b of coefficient estimates from stepwise regression of the response vector y on the predictor variables in matrix X. stepwisefit begins with an initial constant model and takes forward or backward steps to add or remove variables, until a stopping criterion is satisfied.

example

b = stepwisefit(X,y,Name,Value) specifies additional options using one or more name-value pair arguments. For example, you can specify a nonconstant initial model, or a maximum number of steps that stepwisefit can take.

example

[b,se,pval] = stepwisefit(___) returns the coefficient estimates b, standard errors se, and p-values pval using any of the input argument combinations in previous syntaxes.

example

[b,se,pval,finalmodel,stats] = stepwisefit(___) also returns a specification of the variables in the final regression model finalmodel, and statistics stats about the final model.

example

[b,se,pval,finalmodel,stats,nextstep,history] = stepwisefit(___) also returns the recommended next step nextstep and information history about all the steps taken.

example

Examples

collapse all

Perform a basic stepwise regression and obtain the coefficient estimates.

Load the hald data set.

load hald
whos % Check variables loaded in workspace
  Name              Size            Bytes  Class     Attributes

  Description      22x58             2552  char                
  hald             13x5               520  double              
  heat             13x1               104  double              
  ingredients      13x4               416  double              

This data set contains observations of the heat evolved during cement hardening for various mixtures of four cement ingredients. The response variable is heat. The matrix ingredients contains four columns of predictors.

Run stepwisefit beginning with only a constant term in the model and using the default entry and exit tolerances of 0.05 and 0.10, respectively.

b = stepwisefit(ingredients,heat)
Initial columns included: none
Step 1, added column 4, p=0.000576232
Step 2, added column 1, p=1.10528e-06
Final columns included:  1 4 
    {'Coeff'  }    {'Std.Err.'}    {'Status'}    {'P'         }
    {[ 1.4400]}    {[  0.1384]}    {'In'    }    {[1.1053e-06]}
    {[ 0.4161]}    {[  0.1856]}    {'Out'   }    {[    0.0517]}
    {[-0.4100]}    {[  0.1992]}    {'Out'   }    {[    0.0697]}
    {[-0.6140]}    {[  0.0486]}    {'In'    }    {[1.8149e-07]}
b = 4×1

    1.4400
    0.4161
   -0.4100
   -0.6140

The stepwisefit display shows that columns 1 and 4 are included in the final model. The output b includes estimates for all columns, even those that do not appear in the final model. stepwisefit computes the estimate for column 2 (or 3) by fitting a model consisting of the final model plus column 2 (or 3).

Load the carsmall data set, which contains various car measurements.

load carsmall
whos
  Name                Size            Bytes  Class     Attributes

  Acceleration      100x1               800  double              
  Cylinders         100x1               800  double              
  Displacement      100x1               800  double              
  Horsepower        100x1               800  double              
  MPG               100x1               800  double              
  Mfg               100x13             2600  char                
  Model             100x33             6600  char                
  Model_Year        100x1               800  double              
  Origin            100x7              1400  char                
  Weight            100x1               800  double              

Perform stepwise regression with four continuous variables and the response variable MPG.

X = [Acceleration Cylinders Displacement Horsepower];
y = MPG;
b4_default = stepwisefit(X,y) % Stepwise regression with default arguments
Initial columns included: none
Step 1, added column 2, p=1.59001e-25
Step 2, added column 4, p=0.00364266
Step 3, added column 1, p=0.0161414
Final columns included:  1 2 4 
    {'Coeff'  }    {'Std.Err.'}    {'Status'}    {'P'         }
    {[-0.4517]}    {[  0.1842]}    {'In'    }    {[    0.0161]}
    {[-2.6407]}    {[  0.4823]}    {'In'    }    {[4.0003e-07]}
    {[ 0.0148]}    {[  0.0157]}    {'Out'   }    {[    0.3472]}
    {[-0.0772]}    {[  0.0204]}    {'In'    }    {[2.6922e-04]}
b4_default = 4×1

   -0.4517
   -2.6407
    0.0148
   -0.0772

The term Displacement never enters the model. Determine if it is highly correlated with the other three terms by computing the term correlation matrix.

corrcoef(X,'Rows','complete') % To exclude rows with missing values from calculation
ans = 4×4

    1.0000   -0.6438   -0.6968   -0.6968
   -0.6438    1.0000    0.9517    0.8622
   -0.6968    0.9517    1.0000    0.9134
   -0.6968    0.8622    0.9134    1.0000

The third row of the correlation matrix corresponds to Displacement. This term is highly correlated with the other three terms, especially Cylinders (0.95) and Horsepower (0.91).

Redefine the input matrix X to include Weight. Specify an initial model containing the terms Displacement and Horsepower by using the 'InModel' name-value pair argument.

X = [Acceleration Cylinders Displacement Horsepower Weight];
inmodel = [false false true true false];
b5_inmodel = stepwisefit(X,y,'InModel',inmodel)
Initial columns included: 3 4 
Step 1, added column 5, p=1.06457e-06
Step 2, added column 2, p=0.00410234
Final columns included:  2 3 4 5 
    {'Coeff'  }    {'Std.Err.'}    {'Status'}    {'P'         }
    {[-0.0912]}    {[  0.2032]}    {'Out'   }    {[    0.6548]}
    {[-2.3223]}    {[  0.7879]}    {'In'    }    {[    0.0041]}
    {[ 0.0252]}    {[  0.0145]}    {'In'    }    {[    0.0862]}
    {[-0.0449]}    {[  0.0231]}    {'In'    }    {[    0.0555]}
    {[-0.0050]}    {[  0.0012]}    {'In'    }    {[1.0851e-04]}
b5_inmodel = 5×1

   -0.0912
   -2.3223
    0.0252
   -0.0449
   -0.0050

The final model consists of terms 2–5. However, Displacement and Horsepower estimates have p-values greater than 0.05 in the final model. You can tune the stepwise algorithm to behave more conservatively by using the 'PRemove' name-value pair argument. For example, setting 'PRemove' to 0.05 (instead of the default 0.1) results in a smaller final model with only two terms, each with a p-value less than 0.05.

b5_inmodel_premove = stepwisefit(X,y,'InModel',inmodel,'PRemove',0.05)
Initial columns included: 3 4 
Step 1, added column 5, p=1.06457e-06
Step 2, added column 2, p=0.00410234
Step 3, removed column 3, p=0.0862131
Step 4, removed column 4, p=0.239239
Final columns included:  2 5 
    {'Coeff'  }    {'Std.Err.'}    {'Status'}    {'P'         }
    {[-0.0115]}    {[  0.1656]}    {'Out'   }    {[    0.9449]}
    {[-1.6037]}    {[  0.5146]}    {'In'    }    {[    0.0025]}
    {[ 0.0101]}    {[  0.0124]}    {'Out'   }    {[    0.4186]}
    {[-0.0234]}    {[  0.0198]}    {'Out'   }    {[    0.2392]}
    {[-0.0055]}    {[  0.0011]}    {'In'    }    {[3.9038e-06]}
b5_inmodel_premove = 5×1

   -0.0115
   -1.6037
    0.0101
   -0.0234
   -0.0055

Center and scale each column (compute the z-scores) before fitting by using the 'Scale' name-value pair argument. The scaling does not change the model selected, the signs of coefficient estimates, or their p-values. However, the scaling does scale the coefficient estimates.

b5_inmodel_premove_scale = stepwisefit(X,y,'InModel',inmodel,'PRemove',0.05,'Scale','on')
Initial columns included: 3 4 
Step 1, added column 5, p=1.06457e-06
Step 2, added column 2, p=0.00410234
Step 3, removed column 3, p=0.0862131
Step 4, removed column 4, p=0.239239
Final columns included:  2 5 
    {'Coeff'  }    {'Std.Err.'}    {'Status'}    {'P'         }
    {[-0.0370]}    {[  0.5339]}    {'Out'   }    {[    0.9449]}
    {[-2.8136]}    {[  0.9028]}    {'In'    }    {[    0.0025]}
    {[ 1.1155]}    {[  1.3726]}    {'Out'   }    {[    0.4186]}
    {[-1.0617]}    {[  0.8961]}    {'Out'   }    {[    0.2392]}
    {[-4.4406]}    {[  0.9028]}    {'In'    }    {[3.9038e-06]}
b5_inmodel_premove_scale = 5×1

   -0.0370
   -2.8136
    1.1155
   -1.0617
   -4.4406

Usually, you scale to compare estimates of terms that are measured in different scales, such as Horsepower and Weight. In this case, increasing Horsepower by one standard deviation leads to an expected drop of 1 in MPG, whereas increasing Weight by one standard deviation leads to an expected drop of 4.4 in MPG.

Load the imports-85 data set. This data set contains characteristics of cars imported in 1985. For a list of all column names, see the variable Description in the workspace or type Description at the command line.

load imports-85
whos 
  Name               Size            Bytes  Class     Attributes

  Description        9x79             1422  char                
  X                205x26            42640  double              

Choose a subset of continuous variables to use in stepwise regression, consisting of the predictor variables engine-size, bore, stroke, compression-ratio, horsepower, peak-rpm, city-mpg, and highway-mpg, and the response variable price.

varnames = ["engine-size","bore","stroke","compression-ratio","horsepower","peak-rpm","city-mpg","highway-mpg","price"]; % Variable names to use in stepwise regression
dataTbl = array2table(X(:,8:16),'VariableNames',varnames); % Create data table with variable names
Xstepw = dataTbl{:,{'engine-size','bore','stroke','compression-ratio','horsepower','peak-rpm','city-mpg','highway-mpg'}}; % Input matrix
ystepw = dataTbl{:,{'price'}}; % Response vector

Run stepwisefit of the variable price on the other eight variables, first with the default constant initial model and then with an initial model including highway-mpg. Omit the display of step information.

[betahat_def,se_def,pval_def,finalmodel_def,stats_def] = stepwisefit(Xstepw,ystepw,'Display',"off"); 
inmodel = [false false false false false false false true]; 
[betahat_in,se_in,pval_in,finalmodel_in,stats_in] = stepwisefit(Xstepw,ystepw,'InModel',inmodel,'Display','off');

Inspect the final models returned by stepwisefit.

finalmodel_def
finalmodel_def = 1×8 logical array

   1   0   1   1   0   1   1   0

finalmodel_in
finalmodel_in = 1×8 logical array

   1   0   1   1   0   1   0   1

The default model drops highway-mpg (term 8) from the model and includes city-mpg (term 7) instead. Compare the root mean squared errors (RMSEs) of these two final models.

stats_def.rmse
ans = 
3.3033e+03
stats_in.rmse
ans = 
3.3324e+03

The model resulting from the default arguments has a slightly lower RMSE. Note that a full specification of the final model consists of the term estimates plus the intercept estimate.

betahat_def % Term estimates
betahat_def = 8×1
103 ×

    0.1559
   -0.2242
   -2.8578
    0.3904
    0.0222
    0.0024
   -0.2414
    0.0793

stats_def.intercept % Intercept estimate
ans = 
-7.3506e+03

Retrieve the history of the default run of stepwisefit and the recommended next step. Omit the display of step information.

[~,~,~,~,~,nextstep_def,history_def]=stepwisefit(Xstepw,ystepw,'Display',"off"); 
nextstep_def
nextstep_def = 
0

No further steps are recommended (nextstep_def is 0).

history_def.('in') 
ans = 7×8 logical array

   1   0   0   0   0   0   0   0
   1   0   0   0   1   0   0   0
   1   0   0   1   1   0   0   0
   1   0   1   1   1   0   0   0
   1   0   1   1   1   1   0   0
   1   0   1   1   1   1   1   0
   1   0   1   1   0   1   1   0

The algorithm performs a total of seven steps. The output shows that engine-size (term 1) is added in step 1, horsepower (term 5) is added in step 2, and so on.

Input Arguments

collapse all

Predictor variables, specified as an n-by-p numeric matrix, where n is the number of observations and p is the number of predictor variables. Each column of X represents one variable, and each row represents one observation.

stepwisefit always includes a constant term in the model. Therefore, do not include a column of 1s in X.

Data Types: single | double

Response variable, specified as an n-by-1 numeric or logical vector, where n is the number of observations. Each entry in y is the response for the corresponding row of X.

Data Types: single | double | logical

Note

stepwisefit treats NaN values in either X or y as missing and ignores all rows containing these values.

Name-Value Arguments

collapse all

Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Before R2021a, use commas to separate each name and value, and enclose Name in quotes.

Example: 'PEnter',0.10,'PRemove',0.15,'MaxIter',8 instructs stepwisefit to use entry and exit tolerances of 0.10 and 0.15, respectively, and to take a maximum of 8 steps.

Terms for the initial model, specified as the comma-separated pair consisting of 'InModel' and a logical vector specifying terms to include in the initial model. The default is to include no terms.

Example: 'InModel',[true false false true]

Data Types: logical

Tolerance for adding terms to the model, specified as the comma-separated pair consisting of 'PEnter' and a positive scalar specifying the maximum p-value for a term to be added. The default is 0.05.

Example: 'PEnter',0.10

Data Types: single | double

Tolerance for removing terms from the model, specified as the comma-separated pair consisting of 'PRemove' and a positive scalar specifying the minimum p-value for a term to be removed. The default is the maximum of PEnter and 0.10.

Note

PRemove is not allowed to be smaller than PEnter because that would cause stepwisefit to enter an infinite loop, where a variable is repeatedly added to the model and removed from the model.

Example: 'PRemove',0.15

Data Types: single | double

Indicator for displaying step information, specified as the comma-separated pair consisting of 'Display' and 'on' or 'off'.

  • 'on' displays information about each step in the Command Window (default).

  • 'off' omits the display.

Example: 'Display','off'

Maximum number of steps, specified as the comma-separated pair consisting of 'MaxIter' and a positive integer or Inf (default). Inf allows the algorithm to run until no single step improves the model.

Example: 'MaxIter',12

Data Types: double

Terms to keep in their initial state, specified as the comma-separated pair consisting of 'Keep' and a logical vector. The value true for a term specified to be in (or out of) the initial model forces that term to remain in (or out of) the final model. The value false for a term does not force that term to remain in (or out of) the final model. The default is to specify no terms to keep in their initial state.

Example: 'Keep',[true true false false]

Data Types: logical

Indicator for centering and scaling terms, specified as the comma-separated pair consisting of 'Scale' and 'off' or 'on'.

  • 'off' does not center and scale the terms (default).

  • 'on' centers and scales each column of X (computes the z-scores) before fitting.

Example: 'Scale','on'

Output Arguments

collapse all

Estimated coefficients, returned as a numeric vector corresponding to the terms in X. The stepwisefit function calculates the values in b as follows:

  • If a term is included in the final model, then its corresponding value in b is the estimate resulting from fitting the final model.

  • If a term is excluded from the final model, then its corresponding value in b is the estimate resulting from fitting the final model plus that term.

Note

To obtain a full specification of the fitted model, you also need the estimated intercept in addition to b. The estimated intercept is provided as a field in the output argument stats. For more details see stepwisefit Fitted Model.

Standard errors, returned as a numeric vector corresponding to the estimates in b.

p-values, returned as a numeric vector that results from testing whether elements of b are 0.

Final model, returned as a logical vector with length equal to the number of columns in X, indicating which terms are in the final model.

Additional statistics, returned as a structure with the following fields. All statistics pertain to the final model except where noted.

FieldDescription
source

Character vector 'stepwisefit'

dfe

Degrees of freedom for error

df0

Degrees of freedom for the regression

SStotal

Total sum of squares of the response

SSresid

Sum of squares of the residuals

fstat

F-statistic for testing the final model vs. no model (mean only)

pval

p-value of the F-statistic

rmse

Root mean squared error

xr

Residuals for terms not in the final model, computed by subtracting from each term the predicted response of the final model

yr

Residuals for the response using predictors in the final model

B

Coefficients for terms in the final model, with the value for each term not in the model set to the value that would be obtained by adding that term to the model

SE

Standard errors for coefficient estimates

TSTAT

t statistics for coefficient estimates

PVAL

p-values for coefficient estimates

intercept

Estimated intercept

wasnan

Rows in the data that contain NaN values

Recommended next step, returned as a nonnegative integer equal to the index of the next term to add to or remove from the model, or 0 if no further steps are recommended.

Information on steps taken, returned as a structure with the following fields.

FieldDescription
B

Matrix of regression coefficients, where each column is one step and each row is one coefficient vector

rmse

Root mean squared errors for the model at each step

df0

Degrees of freedom for the regression at each step

in

Logical array indicating which predictors are in the model at each step, where each row is one step and each column is one predictor

More About

collapse all

Algorithms

Stepwise regression is a method for adding terms to and removing terms from a multilinear model based on their statistical significance. This method begins with an initial model and then takes successive steps to modify the model by adding or removing terms. At each step, the p-value of an F-statistic is computed to test models with and without a potential term. If a term is not currently in the model, the null hypothesis is that the term would have a zero coefficient if added to the model. If there is sufficient evidence to reject the null hypothesis, the term is added to the model. Conversely, if a term is currently in the model, the null hypothesis is that the term has a zero coefficient. If there is insufficient evidence to reject the null hypothesis, the term is removed from the model. The method proceeds as follows:

  1. Fit the initial model.

  2. If any terms not in the model have p-values less than an entry tolerance, add the one with the smallest p-value and repeat this step. For example, assume the initial model is the default constant model and the entry tolerance is the default 0.05. The algorithm first fits all models consisting of the constant plus another term and identifies the term that has the smallest p-value, for example term 4. If the term 4 p-value is less than 0.05, then term 4 is added to the model. Next, the algorithm performs a search among all models consisting of the constant, term 4, and another term. If a term not in the model has a p-value less than 0.05, the term with the smallest p-value is added to the model and the process is repeated. When no further terms exist that can be added to the model, the algorithm proceeds to step 3.

  3. If any terms in the model have p-values greater than an exit tolerance, remove the one with the largest p-value and go to step 2; otherwise, end.

In each step of the algorithm, stepwisefit uses the method of least squares to estimate the model coefficients. After adding a term to the model at an earlier stage, the algorithm might subsequently drop that term if it is no longer helpful in combination with other terms added later. The method terminates when no single step improves the model. However, the final model is not guaranteed to be optimal, which means having the best fit to the data. A different initial model or a different sequence of steps might lead to a better fit. In this sense, stepwise models are locally optimal, but are not necessarily globally optimal.

Alternative Functionality

  • You can create a model using fitlm, and then manually adjust the model using step, addTerms, and removeTerms.

  • Use stepwiselm if you have data in a table, you have a mix of continuous and categorical predictors, or you want to specify model formulas that can potentially include higher-order and interaction terms.

  • Use stepwiseglm to create stepwise generalized linear models (for example, if you have a binary response variable and want to fit a classification model).

References

[1] Draper, Norman R., and Harry Smith. Applied Regression Analysis. Hoboken, NJ: Wiley-Interscience, 1998. pp. 307–312.

Version History

Introduced before R2006a