How to check and remove outliers when it is Non-normal distribution
4 visualizzazioni (ultimi 30 giorni)
Mostra commenti meno recenti
I found that many people say z-score and mapstd standardization is good to detect outlier. But z-score is useful when only it is normal distribution. When I found my data doesn't follow normal distribution. What should I do? (1)Should i transform my data(boxcox,Johnson transformation) into normal distribution and use z-score to detect outlier? (2)After transformation and remove the outliers, should I use my transformed data or original data(outliers removed in both data) to be the input of neural network? I found that if I input my transformed data(Johnson transformation) into neural network, it works worse than the original data.How come is it?
Can anybody help.Thanks a lot.
0 Commenti
Risposte (1)
John D'Errico
il 12 Lug 2022
z-scores are not a terrible approach. HOWEVER, if you have many outliers, then they will themselves bias the z-scores, making outliers less easy to detect.
A better scheme might be to use the parameters from a trimmed data set. For example, suppose we start with a corrupted set of data. In this example, the data should be normally distributed with mean=0, and standard deviation=1, but then I corrupted it with 5% high variance random crap, that has non-zero mean to boot.
X = [randn(1000,1);5 + randn(50,1)*50];
X = X(randperm(numel(X)));
histogram(X,100,'norm','pdf')
If we try to estimate the mean and variance from this set, we will get a rather poor estimate.
mean(X)
std(X)
So total crapola for a result. Could we use z-scores to search for the outliers? Probably not, since the standard deviation of the data is itself huge, because the outliers were still in the data when I computed those parameters. So the z-scores will look VERY strange themselves.
Instead, consider what happens if we use a trimmed data set? Just take the central 90% of your data, discarding the top and bottom 5% of the data.
nx = numel(X);
Xtrim = sort(X);
tfac = ceil(nx*0.05);
Xtrim = Xtrim(tfac:nx-tfac+1);
histogram(Xtrim,100,'norm','pdf')
That looks a lot more normal, but as you can see, it lacks the correct tail shape for a true normal distribution.
mean(Xtrim)
std(Xtrim)
As you can see here, we have done way better. In fact, you could probably use z-scores based on those numbers alone, and use them to discard outliers.
Could I have done better, perhaps using maximum likelihood estimation to compute a better set of parameter estimates? Probably. But this question is now 7 years old.
0 Commenti
Vedere anche
Prodotti
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!

