permutation in regression learner app

Question

Parisa Ahmadi Ghomroudi il 12 Ott 2021

0
Link

Link diretto a questa domanda

https://it.mathworks.com/matlabcentral/answers/1562071-permutation-in-regression-learner-app

Risposto: Shubham il 15 Mag 2024

I am using statistical and machine learning toolbox to find a best model to predict my variable of interest. After having caculated my best model how can I compute permuation to asses the p value?

2 Commenti
Mostra NessunoNascondi Nessuno

Ive J il 24 Ott 2021

how can I compute permuation to asses the p value... What do you mean exactly by computing permutation?

Parisa Ahmadi Ghomroudi il 25 Ott 2021

Thank you for your reply, the best model for my data is boosted tree and I want to find Predictor importance of my data by permutation. I could not find an option in Matlab toolbox. I tried oobPermutedPredictorImportance but it seems it is only suitable for BaggedEnsemble.

Accedi per commentare.

Accedi per rispondere a questa domanda.

Answer 1

Shubham il 15 Mag 2024

0
Link

Link diretto a questa risposta

https://it.mathworks.com/matlabcentral/answers/1562071-permutation-in-regression-learner-app#answer_1457726

Apri in MATLAB Online

Hi Parisa,

To compute permutation tests for assessing the significance (p-value) of your best model's performance in MATLAB, using the Statistical and Machine Learning Toolbox, you can follow a general approach. This involves shuffling the labels or responses of your dataset multiple times and recalculating the model's performance for each shuffle. By comparing the original model's performance against the distribution of performances from these permutations, you can estimate how likely it is to observe your model's performance by chance.

Here's a step-by-step guide to performing a permutation test:

1. Fit Your Best Model

First, fit your model using the original dataset. This involves selecting your predictors (features) and response variable, then training the model accordingly.

% Assuming X are your predictors and Y is your response variable
bestModel = fitlm(X, Y); % Example for a linear model, adjust according to your model type
% Compute the performance of your model, e.g., R-squared, RMSE, accuracy...
originalPerformance = bestModel.Rsquared.Ordinary; % Adjust this metric as needed

2. Permutation Test

To perform the permutation test, you will shuffle the response variable Y multiple times, refit the model for each shuffled dataset, and compute its performance metric.

numPermutations = 1000; % Number of permutations
performanceShuffled = zeros(numPermutations, 1); % Preallocate array for performance metrics
for i = 1:numPermutations
    % Shuffle the response variable
    Y_shuffled = Y(randperm(length(Y)));
    
    % Fit the model to the shuffled dataset
    modelShuffled = fitlm(X, Y_shuffled); % Adjust for your model type
    
    % Compute the performance metric for the shuffled model
    performanceShuffled(i) = modelShuffled.Rsquared.Ordinary; % Adjust metric as needed
end

3. Compute the P-value

After obtaining the distribution of performance metrics from the shuffled datasets, you can compute the p-value as the proportion of times the shuffled models' performances equal or exceed the performance of the original model.

pValue = sum(performanceShuffled >= originalPerformance) / numPermutations;

Notes

The choice of performance metric (e.g., R-squared, RMSE, accuracy) depends on your model type and the nature of your prediction task (regression or classification).
Adjust the model fitting function (fitlm in the example) according to the type of model you're using (e.g., fitglm for generalized linear models, fitctree for decision trees, etc.).
A low p-value (typically <0.05) suggests that the observed model performance is unlikely to be due to chance, indicating a potentially significant relationship captured by the model.

This approach provides a non-parametric way to assess the significance of your model's predictive ability, complementing traditional statistical tests and confidence intervals that assume specific data distributions.