How to remove Outliers

Hi everyone. Does anyone know how to detect and remove the outliers? I've tried man ways but I can't. I have to remove the data by Excel but I think that's not enough. Thank youuuu

 Risposta accettata

Mathieu NOE
Mathieu NOE il 29 Nov 2021
hello
here is a quick result after a few manipulations
code
clc
clearvars
T = readtable('AskingData.xlsx');
P = T.Power;
Ws = T.WingSpeed;
% remove duplicates
[Ws_unique,IA,IC] = unique(Ws);
P_unique = P(IA);
% remove negative Power
ind = find(P_unique<0);
P_unique(ind) = [];
Ws_unique(ind) = [];
% remove outliers
[P_unique2,TF] = rmoutliers(P_unique,'movmedian',100);
Ws_unique2 = Ws_unique(~TF);
% let's add some smoothing
P_unique2S = smoothdata(P_unique2,'sgolay',151);
plot(Ws_unique,P_unique,'b*',Ws_unique2,P_unique2,'r--',Ws_unique2,P_unique2S,'g');
legend('unique','unique and outliers removed','smoothed');

9 Commenti

Huy Cao
Huy Cao il 29 Nov 2021
It's said
Undefined function or variable 'rmoutliers'.
Error in removeoutliers (line 14)
[P_unique2,TF] = rmoutliers(P_unique,'movmedian',100);
T.T
Mathieu NOE
Mathieu NOE il 29 Nov 2021
OK
I guess you are running an older matlab release (I have R2020b)
Huy Cao
Huy Cao il 29 Nov 2021
i use R2018a :(
well maybe we can achieve a reasonnable plot even without rmoutliers
clc
clearvars
T = readtable('AskingData.xlsx');
P = T.Power;
Ws = T.WingSpeed;
% remove duplicates
[Ws_unique,IA,IC] = unique(Ws);
P_unique = P(IA);
% remove negative Power
ind = find(P_unique<0);
P_unique(ind) = [];
Ws_unique(ind) = [];
% smooth the data
P_uniqueS = smoothdata(P_unique,'movmedian',80);
plot(Ws_unique,P_unique,'b*',Ws_unique,P_uniqueS,'r');
legend('unique','outliers removed');
Huy Cao
Huy Cao il 30 Nov 2021
thank you but i think the outliers is the dot far from the shape ^^
Mathieu NOE
Mathieu NOE il 30 Nov 2021
yes sure
now I just realised that the number of outliers are maybe not so high and that a simple smoothing was good enough
IMHO, the last result is pretty much the same as the previous (when I used rmoutliers)
now I know that I have somewhere another smoothing function with "robust" outliers removal built in; I will try this one as well
this is an alternative using smoothn (from FEX : smoothn - File Exchange - MATLAB Central (mathworks.com) - attached also for convenience
the difference is quite minor , more prononced on the tail section, but that depends also what smothing factor is used in each scenario.
clc
clearvars
T = readtable('AskingData.xlsx');
P = T.Power;
Ws = T.WingSpeed;
% remove duplicates
[Ws_unique,IA,IC] = unique(Ws);
P_unique = P(IA);
% remove negative Power
ind = find(P_unique<0);
P_unique(ind) = [];
Ws_unique(ind) = [];
% smooth the data
P_uniqueS = smoothdata(P_unique,'movmedian',80);
P_uniqueS2 = smoothn(P_unique,1e4,'robust'); % see FEX :
plot(Ws_unique,P_unique,'b*',Ws_unique,P_uniqueS,'r',Ws_unique,P_uniqueS2,'g');
legend('unique','smoothdata','smoothn');
Mathieu NOE
Mathieu NOE il 30 Nov 2021
attached the smoothn function (FYI)
it's me again !
I tried another smoothing function (irlssmooth) found on the file exchange :
this one is also very good with outliers
see the benefit vs built in smoothdata from TMW
code :
clc
clearvars
T = readtable('AskingData.xlsx');
P = T.Power;
Ws = T.WingSpeed;
% remove duplicates
[Ws_unique,IA,IC] = unique(Ws);
P_unique = P(IA);
% remove negative Power
ind = find(P_unique<0);
P_unique(ind) = [];
Ws_unique(ind) = [];
% smooth the data
P_uniqueS = smoothdata(P_unique,'movmedian',80);
P_uniqueS2 = irlssmooth(P_unique,80); % good !! % see FEX : https://fr.mathworks.com/matlabcentral/fileexchange/49788-robust-least-squares-smoother?s_tid=srchtitle_irlssmooth_1
plot(Ws_unique,P_unique,'b*',Ws_unique,P_uniqueS,'r',Ws_unique,P_uniqueS2,'g');
legend('unique','smoothdata','irlssmooth');

Accedi per commentare.

Più risposte (2)

John D'Errico
John D'Errico il 29 Nov 2021

0 voti

The obvious answer is to look at the tools provided in MATLAB. Thus RMOUTLIERS, ISOUTLIER, etc. They are provided in the stats toolbox.
Or if you want to do the work yourself, use tools to do local filtering of some sort. For example, a local polynomial model that drops out the point at the center, but predicts a value there. In a time series context, this can reduce to a simple call to CONV with the correct kernel. (Not difficult to compute that kernel either.) Now compare the predicted value to the point left out. Those with large residuals are potential outliers.
You can also use tools for robust regression modeling, then identifying any points with large residuals as a possible outlier.
The data you show appears to have multiple problems though. There appears to be at least one region with a large dropout, an obvious outlier cluster, possibly caused by some sort of equiptment issues. Outlier detection schemes tend to be best at detecting single point outliers. Groups of outliers are far more difficult to detect, because these points all look like the data around them.
You might also look into clustering methods.
And that means you probably need to make an effort to clean up your data manually. Look for problems in the data that are obvious. If possible, then look back at the source of your data to see if there was some reason for the problem.

1 Commento

FYI rmoutliers and isoutlier are part of MATLAB not Statistics and Machine Learning Toolbox.
which rmoutliers
/MATLAB/toolbox/matlab/datafun/rmoutliers.m
which isoutlier
/MATLAB/toolbox/matlab/datafun/isoutlier.m

Accedi per commentare.

Sean Harvey
Sean Harvey il 21 Mar 2022
Modificato: Sean Harvey il 21 Mar 2022

0 voti

Did you find a way to remove only the outliers (i.e., points away from the S-shaped curve)?

Categorie

Prodotti

Release

R2018a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by