filloutliers
Detect and replace outliers in data
Syntax
Description
finds
outliers in B
= filloutliers(A
,fillmethod
)A
and replaces them according to fillmethod
.
For example, filloutliers(A,'previous')
replaces
outliers with the previous nonoutlier element. By default, an outlier
is a value that is more than three scaled median absolute deviations (MAD) away
from the median. If A
is a matrix or table, then filloutliers
operates
on each column separately. If A
is a multidimensional
array, then filloutliers
operates along the first
dimension whose size does not equal 1.
specifies a method for detecting outliers. For example,
B
= filloutliers(A
,fillmethod
,findmethod
)filloutliers(A,'previous','mean')
defines an outlier as an
element of A
more than three standard deviations from the
mean.
defines outliers as points outside of the percentiles specified in
B
= filloutliers(A
,fillmethod
,'percentiles',threshold
)threshold
. The threshold
argument is a
twoelement row vector containing the lower and upper percentile thresholds, such as
[10 90]
.
specifies a moving method for detecting local outliers according to a window length
defined by B
= filloutliers(A
,fillmethod
,movmethod
,window
)window
. For example,
filloutliers(A,'previous','movmean',5)
identifies outliers as
elements more than three local standard deviations away from the local mean within a
fiveelement window.
specifies
additional parameters for detecting and replacing outliers using one
or more namevalue pair arguments. For example, B
= filloutliers(___,Name,Value
)filloutliers(A,'previous','SamplePoints',t)
detects
outliers in A
relative to the corresponding elements
of a time vector t
.
[
also returns information about the position of the outliers and thresholds computed
by the detection method. B
,TF
,L
,U
,C
]
= filloutliers(___)TF
is a logical array indicating the
location of the outliers in A
. The L
,
U
, and C
arguments represent the lower and
upper thresholds and the center value used by the outlier detection method.
Examples
Interpolate Outliers
Create a vector of data containing an outlier, and use linear interpolation to replace the outlier. Plot the original and filled data.
A = [57 59 60 100 59 58 57 58 300 61 62 60 62 58 57]; B = filloutliers(A,'linear'); plot(1:15,A,1:15,B,'o') legend('Original Data','Interpolated Data')
Determine Outliers with Mean
Create a vector containing an outlier, and define outliers as points outside three standard deviations from the mean. Replace the outlier with the nearest element that is not an outlier, and plot the original data and the interpolated data.
A = [57 59 60 100 59 58 57 58 300 61 62 60 62 58 57]; B = filloutliers(A,'nearest','mean'); plot(1:15,A,1:15,B,'o') legend('Original Data','Interpolated Data')
Determine Outliers with Sliding Window
Use a moving median to find local outliers within a sine wave that corresponds to a time vector.
Create a vector of data containing a local outlier.
x = 2*pi:0.1:2*pi; A = sin(x); A(47) = 0;
Create a time vector that corresponds to the data in A
.
t = datetime(2017,1,1,0,0,0) + hours(0:length(x)1);
Define outliers as points more than three local scaled MAD away from the local median within a sliding window. Find the location of the outlier in A
relative to the points in t
with a window size of 5 hours. Fill the outlier with the computed threshold value using the method 'clip'
, and plot the original and filled data.
[B,TF,U,L,C] = filloutliers(A,'clip','movmedian',hours(5),'SamplePoints',t); plot(t,A,t,B,'o') legend('Original Data','Filled Data')
Display the threshold value that replaced the outlier.
L(TF)
ans = 0.8779
Matrix of Data
Fill outliers for each row of a matrix.
Create a matrix of data containing outliers along the diagonal.
A = randn(5,5) + diag(1000*ones(1,5))
A = 5×5
10^{3} ×
1.0005 0.0013 0.0013 0.0002 0.0007
0.0018 0.9996 0.0030 0.0001 0.0012
0.0023 0.0003 1.0007 0.0015 0.0007
0.0009 0.0036 0.0001 1.0014 0.0016
0.0003 0.0028 0.0007 0.0014 1.0005
Fill outliers with zeros based on the data in each row, and display the new values.
[B,TF,lower,upper,center] = filloutliers(A,0,2); B
B = 5×5
0 1.3077 1.3499 0.2050 0.6715
1.8339 0 3.0349 0.1241 1.2075
2.2588 0.3426 0 1.4897 0.7172
0.8622 3.5784 0.0631 0 1.6302
0.3188 2.7694 0.7147 1.4172 0
You can directly access the detected outlier values and their filled values using TF
as an index vector.
[A(TF) B(TF)]
ans = 5×2
10^{3} ×
1.0005 0
0.9996 0
1.0007 0
1.0014 0
1.0005 0
Outlier Thresholds
Find the outlier in a vector of data, and replace it using the 'clip'
method. Plot the original data, the filled data, and the thresholds and center value determined by the detection method. 'clip'
replaces the outlier with the upper threshold value.
x = 1:10; A = [60 59 49 49 58 100 61 57 48 58]; [B,TF,lower,upper,center] = filloutliers(A,'clip'); plot(x,A,x,B,'o',x,lower*ones(1,10),x,upper*ones(1,10),x,center*ones(1,10)) legend('Original Data','Filled Data','Lower Threshold','Upper Threshold','Center Value')
Input Arguments
A
— Input data
vector  matrix  multidimensional array  table  timetable
Input data, specified as a vector, matrix, multidimensional array, table, or timetable.
If A
is a table, then its variables must
be of type double
or single
,
or you can use the 'DataVariables'
namevalue pair
to list double
or single
variables
explicitly. Specifying variables is useful when you are working with
a table that contains variables with data types other than double
or single
.
If A
is a timetable, then filloutliers
operates
only on the table elements. Row times must be unique and listed in
ascending order.
Data Types: double
 single
 table
 timetable
fillmethod
— Fill method
numeric scalar  'center'
 'clip'
 'previous'
 'next'
 'nearest'
 'linear'
 'spline'
 'pchip'
 'makima'
Fill method for replacing outliers, specified as a numeric scalar or one of the following:
Fill Method  Description 

Numeric scalar  Fills with specified scalar value 
'center'  Fills with the center value determined by findmethod 
'clip'  Fills with the lower threshold value for elements smaller than
the lower threshold determined by findmethod . Fills
with the upper threshold value for elements larger than the upper
threshold determined by findmethod 
'previous'  Fills with the previous nonoutlier value 
'next'  Fills with the next nonoutlier value 
'nearest'  Fills with the nearest nonoutlier value 
'linear'  Fills using linear interpolation of neighboring, nonoutlier values 
'spline'  Fills using piecewise cubic spline interpolation 
'pchip'  Fills using shapepreserving piecewise cubic spline interpolation 
'makima'  modified Akima cubic Hermite interpolation (numeric,
duration , and
datetime data types only) 
Data Types: double
 single
 char
findmethod
— Method for detecting outliers
'median'
(default)  'mean'
 'quartiles'
 'grubbs'
 'gesd'
Method for detecting outliers, specified as one of the following:
Method  Description 

'median'  Outliers are defined as elements more than three
scaled MAD from the median. The scaled MAD is defined as
c*median(abs(Amedian(A))) , where
c=1/(sqrt(2)*erfcinv(3/2)) . 
'mean'  Outliers are defined as elements more than three
standard deviations from the mean. This method is faster
but less robust than
'median' . 
'quartiles'  Outliers are defined as elements more than 1.5
interquartile ranges above the upper quartile (75
percent) or below the lower quartile (25 percent). This
method is useful when the data in A
is not normally distributed. 
'grubbs'  Outliers are detected using Grubbs’s test, which
removes one outlier per iteration based on hypothesis
testing. This method assumes that the data in
A is normally
distributed. 
'gesd'  Outliers are detected using the generalized extreme
Studentized deviate test for outliers. This iterative
method is similar to 'grubbs' , but
can perform better when there are multiple outliers
masking each other. 
threshold
— Percentile thresholds
twoelement row vector
Percentile thresholds, specified as a twoelement row vector whose
elements are in the interval [0,100]. The first element indicates the lower
percentile threshold and the second element indicates the upper percentile
threshold. For example, a threshold of [10 90]
defines
outliers as points below the 10th percentile and above the 90th percentile.
The first element of threshold
must be less than the
second element.
movmethod
— Moving method
'movmedian'
 'movmean'
Moving method for detecting outliers, specified as one of the following:
Method  Description 

'movmedian'  Outliers are defined as elements more than three local scaled MAD from the local median over
a window length specified by window .
This method is also known as a Hampel
filter. 
'movmean'  Outliers are defined as elements more than three local standard
deviations from the local mean over a window length specified by window . 
window
— Window length
positive integer scalar  twoelement vector of positive integers  positive duration scalar  twoelement vector of positive durations
Window length, specified as a positive integer scalar, a twoelement vector of positive integers, a positive duration scalar, or a twoelement vector of positive durations.
When window
is a positive integer scalar, the window is centered about the
current element and contains window1
neighboring
elements. If window
is even, then the window is centered
about the current and previous elements.
When window
is a twoelement vector of positive
integers [b f]
, the window contains the current element,
b
elements backward, and f
elements forward.
When A
is a timetable or 'SamplePoints'
is
specified as a datetime
or duration
vector, window
must
be of type duration
, and the windows are computed
relative to the sample points.
Data Types: double
 single
 int8
 int16
 int32
 int64
 uint8
 uint16
 uint32
 uint64
 duration
dim
— Dimension to operate along
positive integer scalar
Dimension to operate along, specified as a positive integer scalar. If no value is specified, then the default is the first array dimension whose size does not equal 1.
Consider a matrix A
.
filloutliers(A,fillmethod,1)
fills outliers
according to the data in each column.
filloutliers(A,fillmethod,2)
fills outliers
according to the data in each row.
When A
is a table or timetable, dim
is
not supported. filloutliers
operates along each
table or timetable variable separately.
Data Types: double
 single
 int8
 int16
 int32
 int64
 uint8
 uint16
 uint32
 uint64
NameValue Arguments
Specify optional
commaseparated pairs of Name,Value
arguments. Name
is
the argument name and Value
is the corresponding value.
Name
must appear inside quotes. You can specify several name and value
pair arguments in any order as
Name1,Value1,...,NameN,ValueN
.
filloutliers(A,'center','mean','ThresholdFactor',4)
SamplePoints
— Sample points
vector  table variable name  scalar  function handle  table vartype
subscript
Sample points, specified as the commaseparated pair consisting of
'SamplePoints'
and either a vector of sample
point values or one of the options in the following table when the input
data is a table. The sample points represent the
xaxis locations of the data, and must be sorted and
contain unique elements. Sample points do not need to be uniformly
sampled. The vector [1 2 3 ...]
is the
default.
When the input data is a table, you can specify the sample points as a table variable using one of the following options.
Option for Table Input  Description  Examples 

Variable name  A character vector or scalar string specifying a single table variable name 

Scalar variable index  A scalar table variable index 

Logical vector  A logical vector whose elements each correspond to a table variable, where


Function handle  A function handle that takes a table variable as input and returns a logical scalar,
which must be 

vartype subscript  A table subscript generated by the 

Note
This namevalue pair is not supported when the input data is a timetable
. Timetables always use the vector of row times as the sample points. To use different sample points, you must edit the timetable so that the row times contain the desired sample points.
Moving windows are defined relative to the sample points. For example,
if t
is a vector of times corresponding to the input
data, then
filloutliers(rand(1,10),'previous','movmean',3,'SamplePoints',t)
has a window that represents the time interval between
t(i)1.5
and t(i)+1.5
.
When the sample points vector has data type
datetime
or duration
, then the
moving window length must have type duration
.
Example: filloutliers([1 100 3 4],'nearest','SamplePoints',[1
2.5 3
4])
filloutliers(T,'nearest','SamplePoints',"Var1")
Data Types: single
 double
 datetime
 duration
DataVariables
— Table variables to operate on
table variable name  scalar  vector  cell array  function handle  table vartype
subscript
Table variables to operate on, specified as the commaseparated pair
consisting of 'DataVariables'
and one of the options
in this table. The 'DataVariables'
value indicates
which variables of the input table to fill. Other variables in the table
not specified by 'DataVariables'
pass through to the
output without being operated on.
Option  Description  Examples 

Variable name  A character vector or scalar string specifying a single table variable name 

Vector of variable names  A cell array of character vectors or string array where each element is a table variable name 

Scalar or vector of variable indices  A scalar or vector of table variable indices 

Logical vector  A logical vector whose elements each correspond to a table variable, where


Function handle  A function handle that takes a table variable as input and returns a logical scalar 

vartype subscript  A table subscript generated by the 

Example: filloutliers(A,'previous','DataVariables',["Var1"
"Var2" "Var4"])
ThresholdFactor
— Detection threshold factor
nonnegative scalar
Detection threshold factor, specified as the commaseparated
pair consisting of 'ThresholdFactor'
and a nonnegative
scalar.
For methods 'median'
and
'movmedian'
, the detection threshold factor
replaces the number of scaled MAD, which is 3 by default.
For methods 'mean'
and
'movmean'
, the detection threshold factor replaces
the number of standard deviations from the mean, which is 3 by
default.
For methods 'grubbs'
and 'gesd'
, the detection
threshold factor is a scalar ranging from 0 to 1. Values close to 0
result in a smaller number of outliers and values close to 1 result in a
larger number of outliers. The default detection threshold factor is
0.05.
For the 'quartiles'
method, the detection threshold factor replaces the
number of interquartile ranges, which is 1.5 by default.
This namevalue pair is not supported when the specified method is
'percentiles'
.
Data Types: double
 single
 int8
 int16
 int32
 int64
 uint8
 uint16
 uint32
 uint64
MaxNumOutliers
— Maximum outlier count
positive scalar
Maximum outlier count, for the 'gesd'
method only,
specified as the commaseparated pair consisting of
'MaxNumOutliers'
and a positive scalar. The
'MaxNumOutliers'
value specifies the maximum
number of outliers returned by the 'gesd'
method. For
example,
filloutliers(A,'linear','gesd','MaxNumOutliers',5)
returns no more than five outliers.
The default value for 'MaxNumOutliers'
is the
integer nearest to 10 percent of the number of elements in
A
. Setting a larger value for the maximum number
of outliers can ensure that all outliers are detected, but at the cost
of reduced computational efficiency.
Data Types: double
 single
 int8
 int16
 int32
 int64
 uint8
 uint16
 uint32
 uint64
OutlierLocations
— Known outlier indicator
vector  matrix  multidimensional array
Known outlier indicator, specified as the commaseparated pair
consisting of 'OutlierLocations'
and a logical
vector, matrix, or multidimensional array of the same size as A
.
The known outlier indicator elements can be true
to
indicate an outlier in the corresponding location of A
or false
otherwise.
Specifying 'OutlierLocations'
turns off the default
outlier detection method, and uses only the elements of the known
outlier indicator to define outliers.
The 'OutlierLocations'
namevalue pair cannot
be specified when findmethod
is specified.
The output TF
is the same as the 'OutlierLocations'
value.
Data Types: logical
Output Arguments
B
— Filled outlier array
vector  matrix  multidimensional array  table  timetable
Filled outlier array, returned as a vector, matrix, multidimensional
array, table, or timetable. The elements of B
are
the same as those of A
, but with all outliers replaced
according to fillmethod
.
Data Types: double
 single
 table
 timetable
TF
— Outlier indicator
vector  matrix  multidimensional array
Outlier indicator, returned as a vector, matrix, or multidimensional
array. An element of TF
is true
when
the corresponding element of A
is an outlier and false
otherwise. TF
is
the same size as A
.
Data Types: logical
L
— Lower threshold
scalar  vector  matrix  multidimensional array  table  timetable
Lower threshold used by the outlier detection method, returned as a
scalar, vector, matrix, multidimensional array, table, or timetable. For
example, the lower value of the default outlier detection method is three
scaled MAD below the median of the input data. L
has the
same size as A
in all dimensions except for the operating
dimension where the length is 1.
Data Types: double
 single
 table
 timetable
U
— Upper threshold
scalar  vector  matrix  multidimensional array  table  timetable
Upper threshold used by the outlier detection method, returned as a
scalar, vector, matrix, multidimensional array, table, or timetable. For
example, the upper value of the default outlier detection method is three
scaled MAD above the median of the input data. U
has the
same size as A
in all dimensions except for the operating
dimension where the length is 1.
Data Types: double
 single
 table
 timetable
C
— Center value
scalar  vector  matrix  multidimensional array  table  timetable
Center value used by the outlier detection method, returned as a scalar,
vector, matrix, multidimensional array, table, or timetable. For example,
the center value of the default outlier detection method is the median of
the input data. C
has the same size as
A
in all dimensions except for the operating
dimension where the length is 1.
Data Types: double
 single
 table
 timetable
More About
Median Absolute Deviation
For a random variable vector A made up of N scalar observations, the median absolute deviation (MAD) is defined as
$$\text{MAD=median}\left({A}_{i}\text{median}\left(A\right)\right)$$
for i = 1,2,...,N.
The scaled MAD is defined as c*median(abs(Amedian(A)))
where
c=1/(sqrt(2)*erfcinv(3/2))
.
Extended Capabilities
Tall Arrays
Calculate with arrays that have more rows than fit in memory.
Usage notes and limitations:
The
'percentiles'
,'grubs'
, and'gesd'
methods are not supported.The
'movmedian'
and'movmean'
methods do not support tall timetables.The
'SamplePoints'
and'MaxNumOutliers'
namevalue pairs are not supported.The value of
'DataVariables'
cannot be a function handle.Computation of
filloutliers(A,fillmethod)
,filloutliers(A,fillmethod,'median',…)
orfilloutliers(A,fillmethod,'quartiles',…)
along the first dimension is only supported whenA
is a tall column vector.The syntaxes
filloutliers(A,'spline',…)
andfilloutliers(A,'makima',…)
are not supported.
For more information, see Tall Arrays.
C/C++ Code Generation
Generate C and C++ code using MATLAB® Coder™.
Usage notes and limitations:
The
'movmean'
and'movmedian'
methods for detecting outliers do not support timetable input data, datetime'SamplePoints'
values, or duration'SamplePoints'
values.Only the
'center'
,'clip'
, and numeric scalar methods for filling outliers are supported when the input data is a timetable or when the'SamplePoints'
value has typedatetime
orduration
.To use the
'spline'
and'pchip'
fill methods, you must enable support for variablesize arrays.String and character array inputs must be constant.
The
'makima'
options is not supported.
ThreadBased Environment
Run code in the background using MATLAB® backgroundPool
or accelerate code with Parallel Computing Toolbox™ ThreadPool
.
This function fully supports threadbased environments. For more information, see Run MATLAB Functions in ThreadBased Environment.
See Also
rmoutliers
 isoutlier
 ismissing
 fillmissing
 Clean Outlier
Data
Apri esempio
Si dispone di una versione modificata di questo esempio. Desideri aprire questo esempio con le tue modifiche?
Comando MATLAB
Hai fatto clic su un collegamento che corrisponde a questo comando MATLAB:
Esegui il comando inserendolo nella finestra di comando MATLAB. I browser web non supportano i comandi MATLAB.
Select a Web Site
Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .
You can also select a web site from the following list:
How to Get Best Site Performance
Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.
Americas
 América Latina (Español)
 Canada (English)
 United States (English)
Europe
 Belgium (English)
 Denmark (English)
 Deutschland (Deutsch)
 España (Español)
 Finland (English)
 France (Français)
 Ireland (English)
 Italia (Italiano)
 Luxembourg (English)
 Netherlands (English)
 Norway (English)
 Österreich (Deutsch)
 Portugal (English)
 Sweden (English)
 Switzerland
 United Kingdom (English)