Non-reproducible "fitcsvm" Matlab output

3 visualizzazioni (ultimi 30 giorni)
Mm
Mm il 9 Giu 2023
Commentato: Mm il 10 Giu 2023
load ionosphere
% run number 1
rng(1); % For reproducibility
SVMModel = fitcsvm(X,Y,'Standardize',true,'KernelFunction','linear','CacheSize','maximal','Solver','L1QP','KernelScale','auto');
% run number 2
indperm = randperm(size(X,1))';
X=X(indperm,:);
Y=Y(indperm);
SVMModel2 = fitcsvm(X,Y,'Standardize',true,'KernelFunction','linear','CacheSize','maximal','Solver','L1QP','KernelScale','auto');
SVMModel1 and SVMModel2 are different (bias and kernel scale values), just by varying the row-sorting of input data matrix X and Y. Any idea on what's going on?
thanks for help
  4 Commenti
Rik
Rik il 9 Giu 2023
I'm not sure you fully understand what rng(1) does (or I'm misunderstanding you).
What it does is to set the state of the randomizer, making sure that the output from any random function are deterministic (though still random). An example will help:
rng(1)
A = randi(20,1);
rng(1)
B = randi(20,1);
C = randi(20,1);
% A should now be equal to B, but C may be different
A,B,C
A = 9
B = 9
C = 15
So there are two reasons why the output is not the same, despite calling rng: you have already called random functions (which advances the seed), and you are changing the input (which could affect the results).
For an example of the latter: I don't know how the internals work, but for the concept that doesn't matter anyway.
rng(1)
data = 5*rand(2000,1);
indperm = randperm(size(data,1))';
SuperFancyMachineLearningMean(data)-mean(data)
ans = 4.4409e-16
SuperFancyMachineLearningMean(data(indperm))-mean(data)
ans = 8.8818e-16
function output = SuperFancyMachineLearningMean(data);
% Calculate (well, approximate, actually) the mean of a vector.
% Split the data in N blocks.
N = min(numel(data),10);
D1 = repmat(ceil(numel(data)/N),1,N);
D2 = 1;
D1(end) = numel(data)-sum(D1(1:(end-1))); % make the last smaller to fit element count
d = mat2cell(reshape(data,[],1),D1,D2);
for n=1:numel(d)
d{n} = mean(d{n});
end
output = mean([d{:}]);
end
This is apperently not as bad as an example as I thought (unless you're working with very small numbers, but the idea carries over.
Mm
Mm il 10 Giu 2023
Thanks for explanation. We are on the same way regarding rng. Anyway, your code hints that sumsampling performed internally for SVM kernel scale optimization is responsible for non reproducibility of final model whether changing the input sorting. Thanks for collaboration

Accedi per commentare.

Risposta accettata

Rik
Rik il 9 Giu 2023
I'm not familiar with the internals of what this does exactly, but is this truly unexpected?
Since this is a form of fitting your data to a function, some variation is expected. For small fitting problems you can use the entire dataset in one go, meaning that sorting may or may not affect the result, but with machine learning this is generally not feasible. That means that the order of your samples may affect the training result.
  2 Commenti
Mm
Mm il 9 Giu 2023
Modificato: Mm il 9 Giu 2023
Why this should be expectable? I expected that differences in order of samples were handled internally by the code to give reproducible results. Slight differences using linear kernels and simple optimizators, incredibly explode when using Polynomial or RBF kernels and Bayesian optimization.
Rik
Rik il 9 Giu 2023
Would you still expect the code to sort the data internally in some way if we're talking terabytes of data? Because that is essentially what you're asking. Note that I'm not defending the current implementation of this function, I'm merely explaining why I'm not surprised that there are functions in the stats&ML toolbox for which this happens.
This is essentially the same problem when you make splits for cross-validation: the splits may determine the outcome (I don't recall whether my colleague published this, so you will have to look for it yourself if you want to see a paper). While it is true that small changes in the data may explode when extrapolating, that is not unique to systems that depend on the data input order. Every extrapollation runs this risk.

Accedi per commentare.

Più risposte (0)

Categorie

Scopri di più su Introduction to Installation and Licensing in Help Center e File Exchange

Prodotti


Release

R2022b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by