How to check and remove outliers when it is Non-normal distribution

Question

J1 il 18 Nov 2015

0
Link

Link diretto a questa domanda

https://it.mathworks.com/matlabcentral/answers/255745-how-to-check-and-remove-outliers-when-it-is-non-normal-distribution

Risposto: John D'Errico il 12 Lug 2022

I found that many people say z-score and mapstd standardization is good to detect outlier. But z-score is useful when only it is normal distribution. When I found my data doesn't follow normal distribution. What should I do? (1)Should i transform my data(boxcox,Johnson transformation) into normal distribution and use z-score to detect outlier? (2)After transformation and remove the outliers, should I use my transformed data or original data(outliers removed in both data) to be the input of neural network? I found that if I input my transformed data(Johnson transformation) into neural network, it works worse than the original data.How come is it?

Can anybody help.Thanks a lot.

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

Accedi per commentare.

Accedi per rispondere a questa domanda.

Answer 1

John D'Errico il 12 Lug 2022

1
Link

Link diretto a questa risposta

https://it.mathworks.com/matlabcentral/answers/255745-how-to-check-and-remove-outliers-when-it-is-non-normal-distribution#answer_1005710

Apri in MATLAB Online

z-scores are not a terrible approach. HOWEVER, if you have many outliers, then they will themselves bias the z-scores, making outliers less easy to detect.

A better scheme might be to use the parameters from a trimmed data set. For example, suppose we start with a corrupted set of data. In this example, the data should be normally distributed with mean=0, and standard deviation=1, but then I corrupted it with 5% high variance random crap, that has non-zero mean to boot.

X = [randn(1000,1);5 + randn(50,1)*50];

X = X(randperm(numel(X)));

histogram(X,100,'norm','pdf')

If we try to estimate the mean and variance from this set, we will get a rather poor estimate.

mean(X)
ans = 0.1366
std(X)
ans = 11.7369

So total crapola for a result. Could we use z-scores to search for the outliers? Probably not, since the standard deviation of the data is itself huge, because the outliers were still in the data when I computed those parameters. So the z-scores will look VERY strange themselves.

Instead, consider what happens if we use a trimmed data set? Just take the central 90% of your data, discarding the top and bottom 5% of the data.

nx = numel(X);

Xtrim = sort(X);

tfac = ceil(nx*0.05);

Xtrim = Xtrim(tfac:nx-tfac+1);

histogram(Xtrim,100,'norm','pdf')

That looks a lot more normal, but as you can see, it lacks the correct tail shape for a true normal distribution.

mean(Xtrim)
ans = 0.0257
std(Xtrim)
ans = 0.8593

As you can see here, we have done way better. In fact, you could probably use z-scores based on those numbers alone, and use them to discard outliers.

Could I have done better, perhaps using maximum likelihood estimation to compute a better set of parameter estimates? Probably. But this question is now 7 years old.

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

Accedi per commentare.

How to check and remove outliers when it is Non-normal distribution

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

Risposte (1)

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

Vedere anche

Categorie

Tag

Prodotti

Community Treasure Hunt

How to check and remove outliers when it is Non-normal distribution

0 Commenti Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

Risposte (1)

0 Commenti Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

Vedere anche

Categorie

Tag

Prodotti

Community Treasure Hunt

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti