A suitable method to detect outliers from a non-normally distributed dataset?

Question

Sim il 1 Mag 2023

0
Link

Link diretto a questa domanda

https://it.mathworks.com/matlabcentral/answers/1955994-a-suitable-method-to-detect-outliers-from-a-non-normally-distributed-dataset

Commentato: Sim il 4 Mag 2023

x_data.m

I understood that a good/suitable method to detect outliers in a not normally distributed dataset (I would say skewed) would be the quartiles one. However, it looks like that method would include, as outliers, data that are not visibly outliers, but part of the main cluster of data.

Alternatively, I tried the mean plus 3-sigma method (i.e. isoutlier(x,'mean')) and it looks like it gives a better result (even though it seems it is recommended for normal distributions only), and, as we can see from the plots it does not detect a few other outliers, i.e. a few points that are still quite close to the cluster of data, but visibly saparated and relatively distant by it.

You can see the difference among the two methods here below in the two figures.

Is there a more suitable method (and definition) available in Matlab to detect all the outliers in the plot here below?

run('x_data.m')
% According to the following tests it looks like my dataset is not normally
% distributed, which means that a 'good' method to detect outliers 
% in a not normally distributed dataset would be the 'quartiles' one
disp('Normality Test Name            H')
disp('--------------------------------')
fprintf('Kolmogorov-Smirnov Test        %d\n',double(kstest(x)))
fprintf('Anderson-Darling Test          %d\n',double(adtest(x)))
fprintf('Cramer-Von Mises Test          %d\n',double(cmtest(x)))   
fprintf('Shapiro-Wilk Test              %d\n',double(swtest(x)))
fprintf('Jarque-Bera Test               %d\n',double(jbtest(x)))
% detect and plot outliers with the method 'mean'
outliers=x(isoutlier(x,'mean'));
[~,i]=ismember(outliers,sort(x,'descend'));
figure
hold on
plot(sort(x,'descend'),'o','MarkerFaceColor','k','MarkerEdgeColor','k')
plot(i,outliers,'o','MarkerFaceColor','r','MarkerEdgeColor','r')
yline(min(outliers),'-','Threshold','LabelHorizontalAlignment','center');
title('mean method')
hold off
% detect and plot outliers with the method 'quartiles'
outliers=x(isoutlier(x,'quartiles'));
[~,i]=ismember(outliers,sort(x,'descend'));
figure
hold on
plot(sort(x,'descend'),'o','MarkerFaceColor','k','MarkerEdgeColor','k')
plot(i,outliers,'o','MarkerFaceColor','r','MarkerEdgeColor','r')
yline(min(outliers),'-','Threshold','LabelHorizontalAlignment','center');
title('quartiles method')
hold off
% output
Normality Test Name            H
--------------------------------
Kolmogorov-Smirnov Test        1
Warning: P is less than the smallest tabulated value, returning 0.0005. 
Anderson-Darling Test          1
Cramer-Von Mises Test          1
Shapiro-Wilk Test              1
Jarque-Bera Test               1

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

Accedi per commentare.

Accedi per rispondere a questa domanda.

Answer 1

dpb il 1 Mag 2023

1
Link

Link diretto a questa risposta

https://it.mathworks.com/matlabcentral/answers/1955994-a-suitable-method-to-detect-outliers-from-a-non-normally-distributed-dataset#answer_1226874

Modificato: dpb il 1 Mag 2023

Apri in MATLAB Online

You made it notoriously difficult to do anything to help by not attaching the data in a usable form, but...

fn='https://www.mathworks.com/matlabcentral/answers/uploaded_files/1371389/x_data.m';

m=readlines(fn);

ix=find(contains(m,{'[',']'}));

m=m(ix(1):ix(2));

pat=digitsPattern+optionalPattern("."+digitsPattern');

x=str2double(extract(m,pat));

subplot(3,1,1)

histogram(x)

title('Histgoram X')

subplot(3,1,2)

histogram(log10(x))

title('Histgoram log10(X)')

subplot(3,1,3)

normplot(log10(x))

ALWAYS visualize your data first -- the second doesn't truly look too awful bad for a lognormal at first blush, although the probability plot does indicate it's a little long in the tails so is more extreme than "just" a lognormal.

BUT, unless you have some reason to know these are truly outliers and you're just not dealing with an extreme distribution sample, it's not clear whether those really are "outliers" or just actual realizations of the underlying process.

The MATLAB toolset comes from the <NIST guidelines>; I'd recommend reading it thoroughly to get a better appreciation of the various tests built into MATLAB and their application.

It's a notoriously difficult issue; without knowing much more about the dataset provenance, I'd be reluctant to recommend any particular process here; your order statstics plots focus on the upper tail only; the lower tail is pretty-much symmetric with it that would make one wonder if not a reason for that and that should be looking at alternative distribution families...

5 Commenti
Mostra 3 commenti meno recentiNascondi 3 commenti meno recenti

Sim il 2 Mag 2023

thanks @dpb!

About "ALWAYS visualize your data first -- the second doesn't truly look too awful bad for a lognormal at first blush, although the probability plot does indicate it's a little long in the tails so is more extreme than "just" a lognormal", sorry, I think I did not fully understand... do you think it might be a Generalized Extreme Value Distribution ?
About "BUT, unless you have some reason to know these are truly outliers and you're just not dealing with an extreme distribution sample, it's not clear whether those really are "outliers" or just actual realizations of the underlying process", well... the array "x" does not represent direct observations. Those numbers contained in the array "x" are a product of two variables, lets say "a" and "b", which are somehow related to each other, and, together, they represent direct observations. So, "a" and "b" are two arrays, with elements "a_i" and "b_i". In particular, "a_i" is an eucledian distance among two physical points and "b_i" is a (sort of) density. If we plot "a_i" in the x-axis and "b_i" in the y-axis, the element (a_i,b_i) indicates an event with a certain distance "a_i" and a certain density "b_i". Therefore, the elements contained in the array "x" are calculated as follows: x_i = a_i * b_i. From some ground truth knowledge, it appears that the first 15 elements of the sorted array "x" are interesting and represent some phenomenon. Then, just by a first visual inspection at the plot(sort(x,'descend')), I have noticed that the first 15 points look like outliers. Then, I was thinking to detect those 15 points, yes, in the upper tail of the distribution, with one of the technique available in Matlab. Then I found out, that two quite used methods, i.e. that one of the "mean plus 3-sigma method" (i.e. isoutlier(x,'mean')) and that one of the "1.5*IQR" (i.e. isoutlier(x,'quartiles'), with lower bound as Q1 - 1.5*IQR and upper bound as Q3 + 1.5*IQR) give 10 outliers and around 20 outliers, respectively.
About "The MATLAB toolset comes from the <NIST guidelines>; I'd recommend reading it thoroughly to get a better appreciation of the various tests built into MATLAB and their application" I will read it carefully, thanks a lot!!
About "It's a notoriously difficult issue; without knowing much more about the dataset provenance, I'd be reluctant to recommend any particular process here; your order statstics plots focus on the upper tail only; the lower tail is pretty-much symmetric with it that would make one wonder if not a reason for that and that should be looking at alternative distribution families...", in case I am able to find an alternative distribution family... would you know, by chance, if Matlab could provide alternatives methods to detect outliers for some specific distributions?

Again, many thanks for the comments!!

dpb il 2 Mag 2023

" the array "x" are a product of two variables, lets say "a" and "b", which are somehow related to each other, and, together, they represent direct observations."

That's precisely the kind of thing I had in mind -- so, there's no reason to think these observations in x are "outliers" unless there's some reason that the underlying a, b observations (one or the other or both) are outliers for the process by which they were obtained.

That there may be some other phenomenon occurring that makes them "interesting" for various other reasons may well be true, but I wouldn't then consider that an "outlier" in the statistical sense.

It well may be you're looking at something more like a discriminant analysis or similar here to identify what might be those specific points of interest.

Whether it makes sense to consider them as RVs at all or not isn't necessarily even totally clear, but one possibly could find a reasonable distribution function that would describe the observation frequency, but unless that came about as a being a representative sample from the underlying process and isn't just a set of coordinates, regenerating a sample wouldn't necessarily produce a similar distribution if the points were selected in another manner.

dpb il 2 Mag 2023

I spent nearly 30 years consulting on "analysis of data in a plain brown wrapper" -- being asked to figure out how to draw conclusions from data collected without a prior sit-down with a consulting statistician to design an experiment to produce data that could answer a specific hypothesis. This is a very common-to-me scenario and the sorts of questions have to start with to try to get a feel for what the actual underlying data are, rather than just applying some textbook procedure that may not be at all applicable.

Sim il 4 Mag 2023

thank a lot @dpb for your comments :-)

Accedi per commentare.

A suitable method to detect outliers from a non-normally distributed dataset?

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

Risposte (1)

5 Commenti
Mostra 3 commenti meno recentiNascondi 3 commenti meno recenti

Vedere anche

Categorie

Tag

Community Treasure Hunt

A suitable method to detect outliers from a non-normally distributed dataset?

0 Commenti Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

Risposte (1)

5 Commenti Mostra 3 commenti meno recentiNascondi 3 commenti meno recenti

Vedere anche

Categorie

Tag

Community Treasure Hunt

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

5 Commenti
Mostra 3 commenti meno recentiNascondi 3 commenti meno recenti