Non-reproducible "fitcsvm" Matlab output

Question

Mm il 9 Giu 2023

0
Link

Link diretto a questa domanda

https://it.mathworks.com/matlabcentral/answers/1980679-non-reproducible-fitcsvm-matlab-output

Commentato: Mm il 10 Giu 2023

Risposta accettata: Rik

load ionosphere

% run number 1

rng(1); % For reproducibility

SVMModel = fitcsvm(X,Y,'Standardize',true,'KernelFunction','linear','CacheSize','maximal','Solver','L1QP','KernelScale','auto');

% run number 2

indperm = randperm(size(X,1))';

X=X(indperm,:);

Y=Y(indperm);

SVMModel2 = fitcsvm(X,Y,'Standardize',true,'KernelFunction','linear','CacheSize','maximal','Solver','L1QP','KernelScale','auto');

SVMModel1 and SVMModel2 are different (bias and kernel scale values), just by varying the row-sorting of input data matrix X and Y. Any idea on what's going on?

thanks for help

4 Commenti
Mostra 2 commenti meno recentiNascondi 2 commenti meno recenti

Rik il 9 Giu 2023

Apri in MATLAB Online

I'm not sure you fully understand what rng(1) does (or I'm misunderstanding you).

What it does is to set the state of the randomizer, making sure that the output from any random function are deterministic (though still random). An example will help:

rng(1)
A = randi(20,1);
rng(1)
B = randi(20,1);
C = randi(20,1);
% A should now be equal to B, but C may be different
A,B,C
A = 9
B = 9
C = 15

So there are two reasons why the output is not the same, despite calling rng: you have already called random functions (which advances the seed), and you are changing the input (which could affect the results).

For an example of the latter: I don't know how the internals work, but for the concept that doesn't matter anyway.

rng(1)
data = 5*rand(2000,1);
indperm = randperm(size(data,1))';
SuperFancyMachineLearningMean(data)-mean(data)
ans = 4.4409e-16
SuperFancyMachineLearningMean(data(indperm))-mean(data)
ans = 8.8818e-16
function output = SuperFancyMachineLearningMean(data);
% Calculate (well, approximate, actually) the mean of a vector.
% Split the data in N blocks.
N = min(numel(data),10);
D1 = repmat(ceil(numel(data)/N),1,N);
D2 = 1;
D1(end) = numel(data)-sum(D1(1:(end-1))); % make the last smaller to fit element count
d = mat2cell(reshape(data,[],1),D1,D2);
for n=1:numel(d)
    d{n} = mean(d{n});
end
output = mean([d{:}]);
end

This is apperently not as bad as an example as I thought (unless you're working with very small numbers, but the idea carries over.

Mm il 10 Giu 2023

Thanks for explanation. We are on the same way regarding rng. Anyway, your code hints that sumsampling performed internally for SVM kernel scale optimization is responsible for non reproducibility of final model whether changing the input sorting. Thanks for collaboration

Accedi per commentare.

Accedi per rispondere a questa domanda.

Answer 1

Rik il 9 Giu 2023

0
Link

Link diretto a questa risposta

https://it.mathworks.com/matlabcentral/answers/1980679-non-reproducible-fitcsvm-matlab-output#answer_1253169

I'm not familiar with the internals of what this does exactly, but is this truly unexpected?

Since this is a form of fitting your data to a function, some variation is expected. For small fitting problems you can use the entire dataset in one go, meaning that sorting may or may not affect the result, but with machine learning this is generally not feasible. That means that the order of your samples may affect the training result.

2 Commenti
Mostra NessunoNascondi Nessuno

Mm il 9 Giu 2023

Modificato: Mm il 9 Giu 2023

Why this should be expectable? I expected that differences in order of samples were handled internally by the code to give reproducible results. Slight differences using linear kernels and simple optimizators, incredibly explode when using Polynomial or RBF kernels and Bayesian optimization.

Rik il 9 Giu 2023

Would you still expect the code to sort the data internally in some way if we're talking terabytes of data? Because that is essentially what you're asking. Note that I'm not defending the current implementation of this function, I'm merely explaining why I'm not surprised that there are functions in the stats&ML toolbox for which this happens.

This is essentially the same problem when you make splits for cross-validation: the splits may determine the outcome (I don't recall whether my colleague published this, so you will have to look for it yourself if you want to see a paper). While it is true that small changes in the data may explode when extrapolating, that is not unique to systems that depend on the data input order. Every extrapollation runs this risk.

Accedi per commentare.

Non-reproducible "fitcsvm" Matlab output

4 Commenti
Mostra 2 commenti meno recentiNascondi 2 commenti meno recenti

Risposta accettata

2 Commenti
Mostra NessunoNascondi Nessuno

Più risposte (0)

Vedere anche

Categorie

Tag

Prodotti

Release

Community Treasure Hunt

Non-reproducible "fitcsvm" Matlab output

4 Commenti Mostra 2 commenti meno recentiNascondi 2 commenti meno recenti

Risposta accettata

2 Commenti Mostra NessunoNascondi Nessuno

Più risposte (0)

Vedere anche

Categorie

Tag

Prodotti

Release

Community Treasure Hunt

4 Commenti
Mostra 2 commenti meno recentiNascondi 2 commenti meno recenti

2 Commenti
Mostra NessunoNascondi Nessuno