removing outliers

Hi,
I have data which is by event for n number of companies (not time series data). Visually, I can see that there are outliers but I don't know which method to use to remove these outliers using matlab. Any help is appreciated

Risposte (3)

Richard Willey
Richard Willey il 1 Apr 2011

4 voti

Automatically detecting outliers is tricky stuff.
You normally need fairly precise information regarding your data as well as the model that you are fitting to your data.
Here's a relatively simple technique that will work for many types of linear models. The methodology is based on a statistics called "Cook's Distance" that you can extract from regstats.
Cook's Distance for a given data point measures the extent to which a regression model would change if this data point were excluded from the regression. Cook's Distance is sometimes used to suggest whether a given data point might be an outlier.
Here's a simple example illustrating how this works
% Create a vector of X values
X = 1:100;
X = X';
% Create a noise vector
noise = randn(100,1);
% Create a second noise value where sigma is much larger
noise2 = 10*randn(100,1);
% Substitute noise2 for noise1 at obs# (11, 31, 51, 71, 91)
% Many of these points will have an undue influence on the model
noise(11:20:91) = noise2(11:20:91);
% Specify Y = F(X)
Y = 3*X + 2 + noise;
% Cook's Distance for a given data point measures the extent to
% which a regression model would change if this data point
% were excluded from the regression. Cook's Distance is
% sometimes used to suggest whether a given data point might be an outlier.
% Use regstats to calculate Cook's Distance
stats = regstats(Y,X,'linear');
% if Cook's Distance > n/4 is a typical treshold that is used to suggest
% the presence of an outlier
potential_outlier = stats.cookd > 4/length(X);
% Display the index of potential outliers and graph the results
X(potential_outlier)
scatter(X,Y, 'b.')
hold on
scatter(X(potential_outlier),Y(potential_outlier), 'r.')

2 Commenti

Mark Shore
Mark Shore il 1 Apr 2011
Looks interesting but unfortunately requires the Statistics Toolbox.
Anirudh Thatipelli
Anirudh Thatipelli il 24 Mag 2018
Thanks for referring to Cook's distance @Richard Wiley. It has been a great help for me in removing outliers.

Accedi per commentare.

Matt Fig
Matt Fig il 26 Mar 2011

0 voti

What form is the data? You might be able to use logical indexing. For example:
% x is some data with outliers 99 and -70. We want only 0<x<10.
x = [2 3 2 3 1 4 2 3 4 99 2 3 2 -70];
x = x(x<10); % Take those values less than 10
x = x(x>0); % Take those values greater than zero.
.
.
You could also do this in one shot, as below.
% x is some data with outliers 99 and -70. We want only 0<x<10.
x = [2 3 2 3 1 4 2 3 4 99 2 3 2 -70];
x = x(x<10 & x>0)

1 Commento

joseph Frank
joseph Frank il 27 Mar 2011
the data is in % stock returns. it will be difficult to set a subjective cut off point. I am wondering if there is another way t determine what is outlier and what is not

Accedi per commentare.

Walter Roberson
Walter Roberson il 27 Mar 2011

0 voti

"outlier" is mathematically a matter of interpretation.
What is the outlier in this data?
1 2 3 1 2 3
Answer: 2, because the underlying process is believed to create 2 only 1 time in 1000 compared to 1 or 3, so for 2 to show up twice is unusual for this data.
But if you only had the data, how would you know that?
Thus, in order for a program to determine what is an "outlier" or not, you need to encode a model about what is "typical" data and what is not.

5 Commenti

joseph Frank
joseph Frank il 1 Apr 2011
visually I can tell.for example, since i have daily prices: price1=0.037,p2=0.039,p3=139.5, p4=0.385. obviously the 3rd observation is an error that needs to be removed. The database is huge and the number of stocks is enormous so I can't do it visually each time. There must be a certain criteria that would catch these extreme values and these big jumps when you have time series data
Walter Roberson
Walter Roberson il 1 Apr 2011
How many standard deviations is 139.5 from the mean?
Sean de Wolski
Sean de Wolski il 1 Apr 2011
What percentage of the range is the 1st derivative?
joseph Frank
joseph Frank il 1 Apr 2011
I have used 3 standard deviations away from the mean to remove outliers and I still have some.I have no clue how to compute the 1st derivative. If you have any instructions I will follow them to compute the 1st derivative
Walter Roberson
Walter Roberson il 1 Apr 2011
Sometimes it is more effective to compute deviations with a "leave one out" method: if this point was not already part of the dataset, how many deviations away from the mean would it be of the (smaller) dataset?
Three standard deviations is 99.7%; possibly for your purposes, a looser test such as 2.5 standard deviations is warranted.

Accedi per commentare.

Categorie

Tag

Richiesto:

il 26 Mar 2011

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by