removing outliers

Question

joseph Frank il 26 Mar 2011

1
Link

Link diretto a questa domanda

https://it.mathworks.com/matlabcentral/answers/4040-removing-outliers

Commentato: Anirudh Thatipelli il 24 Mag 2018

Hi,

I have data which is by event for n number of companies (not time series data). Visually, I can see that there are outliers but I don't know which method to use to remove these outliers using matlab. Any help is appreciated

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

Accedi per commentare.

Accedi per rispondere a questa domanda.

Answer 1

Richard Willey il 1 Apr 2011

4
Link

Link diretto a questa risposta

https://it.mathworks.com/matlabcentral/answers/4040-removing-outliers#answer_6305

Apri in MATLAB Online

Automatically detecting outliers is tricky stuff.

You normally need fairly precise information regarding your data as well as the model that you are fitting to your data.

Here's a relatively simple technique that will work for many types of linear models. The methodology is based on a statistics called "Cook's Distance" that you can extract from regstats.

Cook's Distance for a given data point measures the extent to which a regression model would change if this data point were excluded from the regression. Cook's Distance is sometimes used to suggest whether a given data point might be an outlier.

Here's a simple example illustrating how this works

% Create a vector of X values
X = 1:100;
X = X';
% Create a noise vector
noise = randn(100,1);
% Create a second noise value where sigma is much larger
noise2 = 10*randn(100,1);
% Substitute noise2 for noise1 at obs# (11, 31, 51, 71, 91)
% Many of these points will have an undue influence on the model 
noise(11:20:91) = noise2(11:20:91);
% Specify Y = F(X)
Y = 3*X + 2 + noise;
% Cook's Distance for a given data point measures the extent to 
% which a regression model would change if this data point 
% were excluded from the regression. Cook's Distance is 
% sometimes used to suggest whether a given data point might be an outlier.
% Use regstats to calculate Cook's Distance
stats = regstats(Y,X,'linear');
% if Cook's Distance > n/4 is a typical treshold that is used to suggest
% the presence of an outlier
potential_outlier = stats.cookd > 4/length(X);
% Display the index of potential outliers and graph the results
X(potential_outlier)
scatter(X,Y, 'b.')
hold on
scatter(X(potential_outlier),Y(potential_outlier), 'r.')

2 Commenti
Mostra NessunoNascondi Nessuno

Mark Shore il 1 Apr 2011

Looks interesting but unfortunately requires the Statistics Toolbox.

Anirudh Thatipelli il 24 Mag 2018

Thanks for referring to Cook's distance @Richard Wiley. It has been a great help for me in removing outliers.

Accedi per commentare.

Answer 2

Matt Fig il 26 Mar 2011

0
Link

Link diretto a questa risposta

https://it.mathworks.com/matlabcentral/answers/4040-removing-outliers#answer_5755

Apri in MATLAB Online

What form is the data? You might be able to use logical indexing. For example:

% x is some data with outliers 99 and -70.  We want only 0<x<10.
x = [2 3 2 3 1 4 2 3 4 99 2 3 2 -70];
x = x(x<10);  % Take those values less than 10
x = x(x>0);  % Take those values greater than zero.

.

You could also do this in one shot, as below.

% x is some data with outliers 99 and -70.  We want only 0<x<10.
x = [2 3 2 3 1 4 2 3 4 99 2 3 2 -70];
x = x(x<10 & x>0)

1 Commento
Mostra -1 commenti meno recentiNascondi -1 commenti meno recenti

joseph Frank il 27 Mar 2011

the data is in % stock returns. it will be difficult to set a subjective cut off point. I am wondering if there is another way t determine what is outlier and what is not

Accedi per commentare.

Answer 3

Walter Roberson il 27 Mar 2011

0
Link

Link diretto a questa risposta

https://it.mathworks.com/matlabcentral/answers/4040-removing-outliers#answer_5784

"outlier" is mathematically a matter of interpretation.

What is the outlier in this data?

1 2 3 1 2 3

Answer: 2, because the underlying process is believed to create 2 only 1 time in 1000 compared to 1 or 3, so for 2 to show up twice is unusual for this data.

But if you only had the data, how would you know that?

Thus, in order for a program to determine what is an "outlier" or not, you need to encode a model about what is "typical" data and what is not.

5 Commenti
Mostra 3 commenti meno recentiNascondi 3 commenti meno recenti

joseph Frank il 1 Apr 2011

I have used 3 standard deviations away from the mean to remove outliers and I still have some.I have no clue how to compute the 1st derivative. If you have any instructions I will follow them to compute the 1st derivative

Walter Roberson il 1 Apr 2011

Sometimes it is more effective to compute deviations with a "leave one out" method: if this point was not already part of the dataset, how many deviations away from the mean would it be of the (smaller) dataset?

Three standard deviations is 99.7%; possibly for your purposes, a looser test such as 2.5 standard deviations is warranted.

Accedi per commentare.

removing outliers

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

Risposte (3)

2 Commenti
Mostra NessunoNascondi Nessuno

1 Commento
Mostra -1 commenti meno recentiNascondi -1 commenti meno recenti

5 Commenti
Mostra 3 commenti meno recentiNascondi 3 commenti meno recenti

Vedere anche

Categorie

Tag

Community Treasure Hunt

removing outliers

0 Commenti Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

Risposte (3)

2 Commenti Mostra NessunoNascondi Nessuno

1 Commento Mostra -1 commenti meno recentiNascondi -1 commenti meno recenti

5 Commenti Mostra 3 commenti meno recentiNascondi 3 commenti meno recenti

Vedere anche

Categorie

Tag

Community Treasure Hunt

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

2 Commenti
Mostra NessunoNascondi Nessuno

1 Commento
Mostra -1 commenti meno recentiNascondi -1 commenti meno recenti

5 Commenti
Mostra 3 commenti meno recentiNascondi 3 commenti meno recenti