Why is SVM performance with small random datasets so high?
48 views (last 30 days)
To understand more how SVMs work, I am training a binary SVM with the function fitcsvm, using a sample data set of completely random numbers and cross-validating the classifier with a 10-fold cross-validation.
Since the dataset consists of random numbers, I would expect the classification accuracy of the trained cross-validated SVM to be around 50%.
However, with small datasets, for example consisting of 2 predictors and 12 observations (6 per class), I get very high classification accuracy, up to about 75%. Classification accuracy gets close to 50% by increasing the dataset, for example 2 predictors and 60 observations or 40 predictors and 12 observations. Why with small datasets is the classification accuracy so high?
I guess that with small datasets you might more easily go into over-fitting. Is this the case here?
Anyway, with cross-validation, the SVM is recursively trained on nine partitions and tested on the tenth. Even if the dataset is small, I would anyway expect an accuracy of around 50%, simply because the tenth partition is made of random numbers. Does the cross-validation perform some optimization of the model parameters?
The code that I am using is something like the following, where I try 100 different combinations of Kernel Scale and Box Constraint and then take the combination that yields the lowest classification error:
SVMModel = fitcsvm(cdata, label, 'KernelFunction','linear', 'Standardize',true,...
MisclassRate = kfoldLoss(SVMModel);
I would very much appreciate any clarification. Many thanks!
Ilya on 27 Feb 2017
Let me make sure I got your procedure right. You apply M models to a dataset and measure their accuracies by cross-validation. Each model is described by a set of parameter values such as box constraint and kernel scale. Out of these models, you select the one with largest cross-validation accuracy a_best and record the parameter values for this model pars_best. To estimate the significance of this model, you learn the same model (that is, pass pars_best to fitcsvm) on R synthetic datasets. Each synthetic dataset is obtained by randomly permuting class labels in the original dataset. You estimate cdf F(a) over these R accuracy values. Then you take 1-F(a_best) to be the p-value for the null hypothesis "the model pars_best has no discriminative power".
If I got this right, you should modify your procedure like so. In every run (that is, for every noise dataset), instead of recording accuracy of a model learned using pars_best, search for the best model over M parameter values and record the accuracy for that best model. Estimate cdf F_noisebest(a) using these R values and take 1-F_noisebest(a_best) to be the p-value.
In your procedure, you apply a classifier to a noise dataset and its accuracy is expected to be that of a random coin toss (perhaps, unfair toss if you have imbalanced classes). In my procedure, you choose the best out of M classifiers applied to a noise dataset and the best chosen accuracy is going to be most usually better (or a lot better) than a random coin toss. This could increase your estimate of the p-value quite a bit making the best model pars_best less significant.
You could also use simple analytic formulas for the binomial distribution and order statistic to verify your computation.
More Answers (1)
Ilya on 31 Jan 2017
You have 12 observations. For each observation, the probability of correct classification is 0.5. What is the probability of classifying 9 or more observations correctly by chance? It's
>> p = binocdf(8,12,0.5,'upper')
And what is the probability of that chance event occurring at least once in 100 experiments? It's
Since you take the most accurate model, you always get a highly optimistic estimate of accuracy, that's all.