Why isn't plsregress and svm working together?

2 visualizzazioni (ultimi 30 giorni)
Juuso Korhonen
Juuso Korhonen il 1 Giu 2021
Risposto: Sai Pavan il 17 Feb 2024
Hi,
I've been trying to use plsregress as a dimensionality reduction method for my large 3D volumes before I try to classify them with a SVM. Here is my approach in code:
% Cross validation (train: 70%, test: 30%)
cv = cvpartition(size(X,1),'HoldOut',0.3);
idx = cv.test;
% Separate to training and test data
XTrain = X(~idx,:);
YTrain = Y(~idx, :);
XTest = X(idx,:);
YTest = Y(idx, :);
[XL,yl,XS,YS,beta,PCTVAR] = plsregress(XTrain,YTrain,10);
plot(1:10,cumsum(100*PCTVAR(2,:)),'-bo');
xlabel('Number of PLS components');
ylabel('Percent Variance Explained in y');
SVMModel = fitcsvm(XS,YTrain,'Standardize',true,'KernelFunction','RBF',...
'KernelScale','auto');
CVSVMModel = crossval(SVMModel);
classLoss = kfoldLoss(CVSVMModel)
This classLoss (average misclassification rate with 10-fold cross validation) outputs a really good 0.0143. But then if I try:
[label,score] = predict(SVMModel, XTest*XL);
accuracy = mean(label == YTest)
I get only 0.4444, which is really bad. Obviously something is going wrong. Isn't that XTest*XL the correct way to turn test samples into latent space?
Even with:
[label,score] = predict(SVMModel, XTrain*XL);
accuracy = mean(label == YTrain)
I only get 0.52 as training accuracy, which doesn't make sense since it should be close to 0.98 according to classLoss. When I look at what the predicted labels are, they are only 1s and the scores are all [-0.1 0.1].
Best wishes,
Juuso

Risposte (1)

Sai Pavan
Sai Pavan il 17 Feb 2024
Hello Juuso,
I see that you are trying to perform dimensionality reduction on 3D volumes before performing SVM classification.
As you have rightly done, the scores for the predictor variables, XS, should be used for training the SVM classifier. However, when you want to apply the PLS model to new data, like your test data XTest, you need to calculate the scores for this new data based on the loadings XL. The correct way to do this is not by simply multiplying XTest by XL but by mean-centering the test data using the mean of the training data and then projecting onto the loadings XL. By centering the test data with the same mean used for the training data, you ensure that the projection onto the PLS components (XL) is consistent with the model built from the training data. Please refer to the documentation to learn more about PLS algorithm: https://www.mathworks.com/help/stats/plsregress.html#:~:text=see%20Algorithms.-,Algorithms,-plsregress%20uses%20the
Please refer to the code snippet attached below illustrating the modified workflow:
% Cross-validation (train: 70%, test: 30%)
cv = cvpartition(size(X,1),'HoldOut',0.3);
idx = cv.test;
% Separate to training and test data
XTrain = X(~idx,:);
YTrain = Y(~idx, :);
XTest = X(idx,:);
YTest = Y(idx, :);
% Perform PLS regression to get the scores for the training data
nComponents = 10;
[XL,~,XS,~,~,~,~] = plsregress(XTrain,YTrain,nComponents);
% Train SVM using the scores from PLS
SVMModel = fitcsvm(XS,YTrain,'Standardize',true,'KernelFunction','RBF',...
'KernelScale','auto');
% Perform k-fold cross-validation and compute the loss
CVSVMModel = crossval(SVMModel);
classLoss = kfoldLoss(CVSVMModel);
% Calculate the scores for the test data
XTestCentered = bsxfun(@minus, XTest, mean(XTrain)); % Center the test data
XTestScores = XTestCentered * XL; % Calculate the scores for the test data
% Predict using the SVM model and the test scores
[label,~] = predict(SVMModel, XTestScores);
% Calculate accuracy
accuracy = mean(label == YTest);
As for the performance of SVM on the training data, there can be several reasons. Some of the potential issues to consider are:
  • The number of PLS components chosen might not be optimal. Too few components might not capture enough variance, while too many might include noise. Try adjusting the number of components to see if it improves performance.
  • The choice of kernel function and its parameters can greatly affect SVM performance. You're using the RBF kernel; make sure that the 'KernelScale' parameter is appropriate for your data. You can try a grid search over kernel parameters to find the best combination.
  • The SVM might not be complex enough to capture the patterns in the data, or it might be too regularized. You can experiment with different machine learning classifiers to find a better fit.
Hope it helps!

Prodotti


Release

R2021a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by