A suitable method to detect outliers from a non-normally distributed dataset?

14 visualizzazioni (ultimi 30 giorni)
I understood that a good/suitable method to detect outliers in a not normally distributed dataset (I would say skewed) would be the quartiles one. However, it looks like that method would include, as outliers, data that are not visibly outliers, but part of the main cluster of data.
Alternatively, I tried the mean plus 3-sigma method (i.e. isoutlier(x,'mean')) and it looks like it gives a better result (even though it seems it is recommended for normal distributions only), and, as we can see from the plots it does not detect a few other outliers, i.e. a few points that are still quite close to the cluster of data, but visibly saparated and relatively distant by it.
You can see the difference among the two methods here below in the two figures.
Is there a more suitable method (and definition) available in Matlab to detect all the outliers in the plot here below?
run('x_data.m')
% According to the following tests it looks like my dataset is not normally
% distributed, which means that a 'good' method to detect outliers
% in a not normally distributed dataset would be the 'quartiles' one
disp('Normality Test Name H')
disp('--------------------------------')
fprintf('Kolmogorov-Smirnov Test %d\n',double(kstest(x)))
fprintf('Anderson-Darling Test %d\n',double(adtest(x)))
fprintf('Cramer-Von Mises Test %d\n',double(cmtest(x)))
fprintf('Shapiro-Wilk Test %d\n',double(swtest(x)))
fprintf('Jarque-Bera Test %d\n',double(jbtest(x)))
% detect and plot outliers with the method 'mean'
outliers=x(isoutlier(x,'mean'));
[~,i]=ismember(outliers,sort(x,'descend'));
figure
hold on
plot(sort(x,'descend'),'o','MarkerFaceColor','k','MarkerEdgeColor','k')
plot(i,outliers,'o','MarkerFaceColor','r','MarkerEdgeColor','r')
yline(min(outliers),'-','Threshold','LabelHorizontalAlignment','center');
title('mean method')
hold off
% detect and plot outliers with the method 'quartiles'
outliers=x(isoutlier(x,'quartiles'));
[~,i]=ismember(outliers,sort(x,'descend'));
figure
hold on
plot(sort(x,'descend'),'o','MarkerFaceColor','k','MarkerEdgeColor','k')
plot(i,outliers,'o','MarkerFaceColor','r','MarkerEdgeColor','r')
yline(min(outliers),'-','Threshold','LabelHorizontalAlignment','center');
title('quartiles method')
hold off
% output
Normality Test Name H
--------------------------------
Kolmogorov-Smirnov Test 1
Warning: P is less than the smallest tabulated value, returning 0.0005.
Anderson-Darling Test 1
Cramer-Von Mises Test 1
Shapiro-Wilk Test 1
Jarque-Bera Test 1

Risposte (1)

dpb
dpb il 1 Mag 2023
Modificato: dpb il 1 Mag 2023
You made it notoriously difficult to do anything to help by not attaching the data in a usable form, but...
fn='https://www.mathworks.com/matlabcentral/answers/uploaded_files/1371389/x_data.m';
m=readlines(fn);
ix=find(contains(m,{'[',']'}));
m=m(ix(1):ix(2));
pat=digitsPattern+optionalPattern("."+digitsPattern');
x=str2double(extract(m,pat));
subplot(3,1,1)
histogram(x)
title('Histgoram X')
subplot(3,1,2)
histogram(log10(x))
title('Histgoram log10(X)')
subplot(3,1,3)
normplot(log10(x))
ALWAYS visualize your data first -- the second doesn't truly look too awful bad for a lognormal at first blush, although the probability plot does indicate it's a little long in the tails so is more extreme than "just" a lognormal.
BUT, unless you have some reason to know these are truly outliers and you're just not dealing with an extreme distribution sample, it's not clear whether those really are "outliers" or just actual realizations of the underlying process.
The MATLAB toolset comes from the <NIST guidelines>; I'd recommend reading it thoroughly to get a better appreciation of the various tests built into MATLAB and their application.
It's a notoriously difficult issue; without knowing much more about the dataset provenance, I'd be reluctant to recommend any particular process here; your order statstics plots focus on the upper tail only; the lower tail is pretty-much symmetric with it that would make one wonder if not a reason for that and that should be looking at alternative distribution families...
  5 Commenti
dpb
dpb il 2 Mag 2023
I spent nearly 30 years consulting on "analysis of data in a plain brown wrapper" -- being asked to figure out how to draw conclusions from data collected without a prior sit-down with a consulting statistician to design an experiment to produce data that could answer a specific hypothesis. This is a very common-to-me scenario and the sorts of questions have to start with to try to get a feel for what the actual underlying data are, rather than just applying some textbook procedure that may not be at all applicable.

Accedi per commentare.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by