Are lassoglm solutions independent of data order?

5 visualizzazioni (ultimi 30 giorni)
Ken Johnson
Ken Johnson il 13 Ago 2024
Commentato: Ken Johnson il 19 Ago 2024
I thought lassoglm solutions were unique, but I find that the solution from lassoglm depends on the X array order. Is there a way to avoid this? Here's my example. I have 3 X variables (3 columns) and the Y variable. I get different solutions with X(var1, var2, var3) or X(var1, var3, var2). In the example code, the fitted values are:
CONC123 = [21.54, 1.689, 0.726]
CONC132 = [21.94, 2.558, 0]
Which solution is most correct? The deviance for the 123 solution is a bit smaller.
load('YX123') % Y and X(var1, var2, var3)
load('YX132') % Y and X(var1, var3, var2)
lambda = 0.0005 ; % lambda was optimized at 0.0005 with a training set
reltol = 1e-4; % default value
alpha = 1; % forces lasso regression
[CONC123,FitInfo123]=lassoglm(X123,Y,'normal','Alpha',alpha,'Lambda',lambda,'RelTol',reltol);
[CONC132,FitInfo132]=lassoglm(X132,Y,'normal','Alpha',alpha,'Lambda',lambda,'RelTol',reltol);

Risposte (1)

Ayush
Ayush il 16 Ago 2024
Hi Ken,
The issue you're encountering is related to the numerical stability and convergence properties of the optimization algorithm used inlassoglm function. The order of the columns in theXmatrix can sometimes affect the solution due to these numerical properties. However, in theory, the Lasso regression should yield the same solution regardless of the order of the columns inX.
However, I use several methods to mitigate this type of numerical instability:
  1. Standardize the features: Standardizing features, i.e. scaling them to have zero mean and unit variance, can help in making the optimization process more stable and less sensitive to order of the columns. Here’s the pseudo code for standardizing the features and performing required Lasso regression on standardized features.
% Standardize the features
X123_standardized = zscore(X123);
X132_standardized = zscore(X132);
% Perform Lasso regression on standardized features
[CONC123, FitInfo123] = lassoglm(X123_standardized, Y, 'normal', 'Alpha', alpha, 'Lambda', lambda, 'RelTol', reltol);
[CONC132, FitInfo132] = lassoglm(X132_standardized, Y, 'normal', 'Alpha', alpha, 'Lambda', lambda, 'RelTol', reltol);
2. Checking deviance: If deviance for one solution is smaller, that solution is generally more desirable. However, it’s essential to confirm that this is not due to overfitting.
deviance123 = FitInfo123.Deviance;
deviance132 = FitInfo132.Deviance;
if deviance123 < deviance132
% solution 123 is preferred
else
% solution 132 is preferred
end
Note: One more technique which I generally use is “Cross-validation”. It helps to ensure that the chosen model generalizes well to unseen data. This can sometimes help in mitigating the sensitivity to feature order.
So, by standardizing your features, comparing deviance, and using cross-validation, you can reduce the sensitivity of your Lasso regression solutions to the order of the columns in“X. The solution with the smaller deviance is generally more correct, but it's crucial to ensure that this is not due to overfitting.
For standardization, I’ve used “zscore” function. You can read more about it here:
Also, for reading more about Lasso regularization, you may refer:
Hope it helps!

Prodotti


Release

R2020b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by