How to check and remove outliers when it is Non-normal distribution

4 visualizzazioni (ultimi 30 giorni)
I found that many people say z-score and mapstd standardization is good to detect outlier. But z-score is useful when only it is normal distribution. When I found my data doesn't follow normal distribution. What should I do? (1)Should i transform my data(boxcox,Johnson transformation) into normal distribution and use z-score to detect outlier? (2)After transformation and remove the outliers, should I use my transformed data or original data(outliers removed in both data) to be the input of neural network? I found that if I input my transformed data(Johnson transformation) into neural network, it works worse than the original data.How come is it?
Can anybody help.Thanks a lot.

Risposte (1)

John D'Errico
John D'Errico il 12 Lug 2022
z-scores are not a terrible approach. HOWEVER, if you have many outliers, then they will themselves bias the z-scores, making outliers less easy to detect.
A better scheme might be to use the parameters from a trimmed data set. For example, suppose we start with a corrupted set of data. In this example, the data should be normally distributed with mean=0, and standard deviation=1, but then I corrupted it with 5% high variance random crap, that has non-zero mean to boot.
X = [randn(1000,1);5 + randn(50,1)*50];
X = X(randperm(numel(X)));
histogram(X,100,'norm','pdf')
If we try to estimate the mean and variance from this set, we will get a rather poor estimate.
mean(X)
ans = 0.1366
std(X)
ans = 11.7369
So total crapola for a result. Could we use z-scores to search for the outliers? Probably not, since the standard deviation of the data is itself huge, because the outliers were still in the data when I computed those parameters. So the z-scores will look VERY strange themselves.
Instead, consider what happens if we use a trimmed data set? Just take the central 90% of your data, discarding the top and bottom 5% of the data.
nx = numel(X);
Xtrim = sort(X);
tfac = ceil(nx*0.05);
Xtrim = Xtrim(tfac:nx-tfac+1);
histogram(Xtrim,100,'norm','pdf')
That looks a lot more normal, but as you can see, it lacks the correct tail shape for a true normal distribution.
mean(Xtrim)
ans = 0.0257
std(Xtrim)
ans = 0.8593
As you can see here, we have done way better. In fact, you could probably use z-scores based on those numbers alone, and use them to discard outliers.
Could I have done better, perhaps using maximum likelihood estimation to compute a better set of parameter estimates? Probably. But this question is now 7 years old.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by