Documentation

prctile

Percentiles of a data set

Description

example

Y = prctile(X,p) returns percentiles of the elements in a data vector or array X for the percentages p in the interval [0,100].

• If X is a vector, then Y is a scalar or a vector with the same length as the number of percentiles requested (length(p)). Y(i) contains the p(i) percentile.

• If X is a matrix, then Y is a row vector or a matrix, where the number of rows of Y is equal to the number of percentiles requested (length(p)). The ith row of Y contains the p(i) percentiles of each column of X.

• For multidimensional arrays, prctile operates along the first nonsingleton dimension of X.

example

Y = prctile(X,p,'all') returns percentiles of all the elements of X.

example

Y = prctile(X,p,dim) returns percentiles along the operating dimension dim.

example

Y = prctile(X,p,vecdim) returns percentiles over the dimensions specified in the vector vecdim. For example, if X is a matrix, then prctile(X,50,[1 2]) returns the 50th percentile of all the elements of X because every element of a matrix is contained in the array slice defined by dimensions 1 and 2.

example

Y = prctile(___,'Method',method) returns either exact or approximate percentiles based on the value of method, using any of the input argument combinations in the previous syntaxes.

Examples

collapse all

Generate a data set of size 10.

rng('default'); % for reproducibility
x = normrnd(5,2,1,10)
x = 1×10

6.0753    8.6678    0.4823    6.7243    5.6375    2.3846    4.1328    5.6852   12.1568   10.5389

Calculate the 42nd percentile.

Y = prctile(x,42)
Y = 5.6709

Find the percentiles of all the values in an array.

Create a 3-by-5-by-2 array X.

X = reshape(1:30,[3 5 2])
X =
X(:,:,1) =

1     4     7    10    13
2     5     8    11    14
3     6     9    12    15

X(:,:,2) =

16    19    22    25    28
17    20    23    26    29
18    21    24    27    30

Find the 40th and 60th percentiles of the elements of X.

Y = prctile(X,[40 60],'all')
Y = 2×1

12.5000
18.5000

Y(1) is the 40th percentile of X, and Y(2) is the 60th percentile of X.

Calculate the percentiles along the columns and rows of a data matrix for specified percentages.

Generate a 5-by-5 data matrix.

X = (1:5)'*(2:6)
X = 5×5

2     3     4     5     6
4     6     8    10    12
6     9    12    15    18
8    12    16    20    24
10    15    20    25    30

Calculate the 25th, 50th, and 75th percentiles along the columns of X.

Y = prctile(X,[25 50 75],1)
Y = 3×5

3.5000    5.2500    7.0000    8.7500   10.5000
6.0000    9.0000   12.0000   15.0000   18.0000
8.5000   12.7500   17.0000   21.2500   25.5000

The rows of Y correspond to the percentiles of columns of X. For example, the 25th, 50th, and 75th percentiles of the third column of X with elements (4, 8, 12, 16, 20) are 7, 12, and 17, respectively. Y = prctile(X,[25 50 75]) returns the same percentile matrix.

Calculate the 25th, 50th, and 75th percentiles along the rows of X.

Y = prctile(X,[25 50 75],2)
Y = 5×3

2.7500    4.0000    5.2500
5.5000    8.0000   10.5000
8.2500   12.0000   15.7500
11.0000   16.0000   21.0000
13.7500   20.0000   26.2500

The rows of Y correspond to the percentiles of rows of X. For example, the 25th, 50th, and 75th percentiles of the first row of X with elements (2, 3, 4, 5, 6) are 2.75, 4, and 5.25, respectively.

Find the percentiles of a multidimensional array along multiple dimensions simultaneously.

Create a 3-by-5-by-2 array X.

X = reshape(1:30,[3 5 2])
X =
X(:,:,1) =

1     4     7    10    13
2     5     8    11    14
3     6     9    12    15

X(:,:,2) =

16    19    22    25    28
17    20    23    26    29
18    21    24    27    30

Calculate the 40th and 60th percentiles for each page of X by specifying dimensions 1 and 2 as the operating dimensions.

Ypage = prctile(X,[40 60],[1 2])
Ypage =
Ypage(:,:,1) =

6.5000
9.5000

Ypage(:,:,2) =

21.5000
24.5000

For example, Ypage(1,1,1) is the 40th percentile of the first page of X, and Ypage(2,1,1) is the 60th percentile of the first page of X.

Calculate the 40th and 60th percentiles of the elements in each X(:,i,:) slice by specifying dimensions 1 and 3 as the operating dimensions.

Ycol = prctile(X,[40 60],[1 3])
Ycol = 2×5

2.9000    5.9000    8.9000   11.9000   14.9000
16.1000   19.1000   22.1000   25.1000   28.1000

For example, Ycol(1,4) is the 40th percentile of the elements in X(:,4,:), and Ycol(2,4) is the 60th percentile of the elements in X(:,4,:).

Calculate exact and approximate percentiles of a tall column vector for a given percentage.

When you perform calculations on tall arrays, MATLAB® uses either a parallel pool (default if you have Parallel Computing Toolbox™) or the local MATLAB session. If you want to run the example using the local MATLAB session when you have Parallel Computing Toolbox, you can change the global execution environment by using the mapreducer function.

Create a datastore for the airlinesmall data set. Treat 'NA' values as missing data so that datastore replaces them with NaN values. Specify to work with the ArrTime variable.

ds = datastore('airlinesmall.csv','TreatAsMissing','NA',...
'SelectedVariableNames','ArrTime');

Create a tall table on top of the datastore, and extract the data from the tall table into a tall vector.

t = tall(ds) % Tall table
Starting parallel pool (parpool) using the 'local' profile ...
Connected to the parallel pool (number of workers: 4).

t =

Mx1 tall table

ArrTime
_______

735
1124
2218
1431
746
1547
1052
1134
:
:
x = t{:,:}   % Tall vector
x =

Mx1 tall double column vector

735
1124
2218
1431
746
1547
1052
1134
:
:

Calculate the exact 50th percentile of x. Because x is a tall column vector and p is a scalar, prctile returns the exact percentile value by default.

p = 50;
yExact = prctile(x,p)
yExact =

tall double

?

Calculate the approximate 50th percentile of x. Specify 'Method','approximate' to use an approximation algorithm based on T-Digest for computing the percentile.

yApprox = prctile(x,p,'Method','approximate')
yApprox =

MxNx... tall double array

?    ?    ?    ...
?    ?    ?    ...
?    ?    ?    ...
:    :    :
:    :    :

Evaluate the tall arrays and bring the results into memory by using gather.

[yExact,yApprox] = gather(yExact,yApprox)
Evaluating tall expression using the Parallel Pool 'local':
- Pass 1 of 4: Completed in 5.1 sec
- Pass 2 of 4: Completed in 1 sec
- Pass 3 of 4: Completed in 1.8 sec
- Pass 4 of 4: Completed in 1.8 sec
Evaluation completed in 13 sec
yExact = 1522
yApprox = 1.5220e+03

The values of the approximate percentile and the exact percentile are the same to the four digits shown.

Calculate exact and approximate percentiles of a tall matrix for specified percentages along different dimensions.

When you perform calculations on tall arrays, MATLAB® uses either a parallel pool (default if you have Parallel Computing Toolbox™) or the local MATLAB session. If you want to run the example using the local MATLAB session when you have Parallel Computing Toolbox, you can change the global execution environment by using the mapreducer function.

Create a tall matrix X containing a subset of variables from the airlinesmall data set. See Percentiles of Tall Vector for Given Percentage for details about the steps to extract data from a tall array.

varnames = {'ArrDelay','ArrTime','DepTime','ActualElapsedTime'}; % Subset of variables in the data set
ds = datastore('airlinesmall.csv','TreatAsMissing','NA',...
'SelectedVariableNames',varnames); % Datastore
t = tall(ds);     % Tall table
Starting parallel pool (parpool) using the 'local' profile ...
Connected to the parallel pool (number of workers: 4).
X = t{:,varnames} % Tall matrix
X =

Mx4 tall double matrix

8         735         642          53
8        1124        1021          63
21        2218        2055          83
13        1431        1332          59
4         746         629          77
59        1547        1446          61
3        1052         928          84
11        1134         859         155
:          :            :           :
:          :            :           :

When operating along a dimension that is not 1, the prctile function calculates the exact percentiles only, so that it can perform the computation efficiently using a sorting-based algorithm (see Algorithms) instead of an approximation algorithm based on T-Digest.

Calculate the exact 25th, 50th, and 75th percentiles of X along the second dimension.

p = [25 50 75]; % Vector of percentages
Yexact = prctile(X,p,2)
Yexact =

MxNx... tall double array

?    ?    ?    ...
?    ?    ?    ...
?    ?    ?    ...
:    :    :
:    :    :

When the function operates along the first dimension and p is a vector of percentages, you must use the approximation algorithm based on t-digest to compute the percentiles. Using the sorting-based algorithm to find the percentiles along the first dimension of a tall array is computationally intensive.

Calculate the approximate 25th, 50th, and 75th percentiles of X along the first dimension. Because the default dimension is 1, you do not need to specify a value for dim.

Yapprox = prctile(X,p,'Method','approximate')
Yapprox =

MxNx... tall double array

?    ?    ?    ...
?    ?    ?    ...
?    ?    ?    ...
:    :    :
:    :    :

Evaluate the tall arrays and bring the results into memory by using gather.

[Yexact,Yapprox] = gather(Yexact,Yapprox);
Evaluating tall expression using the Parallel Pool 'local':
- Pass 1 of 1: Completed in 12 sec
Evaluation completed in 14 sec

Show the first five rows of the exact 25th, 50th, and 75th percentiles along the second dimension of X .

Yexact(1:5,:)
ans = 5×3
103 ×

0.0305    0.3475    0.6885
0.0355    0.5420    1.0725
0.0520    1.0690    2.1365
0.0360    0.6955    1.3815
0.0405    0.3530    0.6875

Each row of the matrix Yexact contains the three percentiles of the corresponding row in X. For example, 30.5, 347.5, and 688.5 are the 25th, 50th, and 75th percentiles, respectively, of the first row in X.

Show the approximate 25th, 50th, and 75th percentiles of X along the first dimension.

Yapprox
Yapprox = 3×4
103 ×

-0.0070    1.1149    0.9322    0.0700
0    1.5220    1.3350    0.1020
0.0110    1.9180    1.7400    0.1510

Each column of the matrix Yapprox corresponds to the three percentiles for each column of the matrix X. For example, the first column of Yapprox with elements (–7, 0, 11) contains the percentiles for the first column of X.

Input Arguments

collapse all

Input data, specified as a vector or array.

Data Types: double | single

Percentages for which to compute percentiles, specified as a scalar or vector of scalars from 0 to 100.

Example: 25

Example: [25, 50, 75]

Data Types: double | single

Dimension along which the percentiles of X are requested, specified as a positive integer. For example, for a matrix X, when dim = 1, prctile returns the percentile(s) of the columns of X; when dim = 2, prctile returns the percentile(s) of the rows of X. For a multidimensional array X, the length of the dimth dimension of Y is equal to the length of p.

Data Types: double | single

Vector of dimensions, specified as a positive integer vector. Each element of vecdim represents a dimension of the input array X. The output Y has length length(p) in the smallest specified operating dimension (that is, dimension min(vecdim)) and has length 1 in each of the remaining operating dimensions. The other dimension lengths are the same for X and Y.

For example, consider a 2-by-3-by-3 array X with p = [20 40 60 80]. In this case, prctile(X,p,[1 2]) returns an array, where each page of the array contains the 20th, 40th, 60th, and 80th percentiles of the elements of the corresponding page of X. Because 1 and 2 are the operating dimensions, with min([1 2]) = 1 and length(p) = 4, the output is a 4-by-1-by-3 array.

Data Types: single | double

Method for calculating percentiles, specified as 'exact' or 'approximate'. By default, prctile returns the exact percentiles by implementing an algorithm that uses sorting. You can specify 'method','approximate' for prctile to return approximate percentiles by implementing an algorithm that uses T-Digest.

Data Types: char | string

Output Arguments

collapse all

Percentiles of a data vector or array, returned as a scalar or array for one or more percentage values.

• If X is a vector, then Y is a scalar or a vector with the same length as the number of percentiles requested (length(p)). Y(i) contains the p(i)th percentile.

• If X is an array of dimension d, then Y is an array with the length of the smallest operating dimension equal to the number of percentiles requested (length(p)).

collapse all

Multidimensional Array

A multidimensional array is an array with more than two dimensions. For example, if X is a 1-by-3-by-4 array, then X is a 3-D array.

Nonsingleton Dimension

A nonsingleton dimension of an array is a dimension whose size is not equal to 1. A first nonsingleton dimension of an array is the first dimension that satisfies the nonsingleton condition. For example, if X is a 1-by-1-by-2-by-4 array, then the third dimension is the first nonsingleton dimension of X.

Linear Interpolation

Linear interpolation uses linear polynomials to find yi = f(xi), the values of the underlying function Y = f(X) at the points in the vector or array x. Given the data points (x1, y1) and (x2, y2), where y1 = f(x1) and y2 = f(x2), linear interpolation finds y = f(x) for a given x between x1 and x2 as follows:

$y=f\left(x\right)={y}_{1}+\frac{\left(x-{x}_{1}\right)}{\left({x}_{2}-{x}_{1}\right)}\left({y}_{2}-{y}_{1}\right).$

Similarly, if the 100(1.5/n)th percentile is y1.5/n and the 100(2.5/n)th percentile is y2.5/n, then linear interpolation finds the 100(2.3/n)th percentile, y2.3/n as:

${y}_{\frac{2.3}{n}}={y}_{\frac{1.5}{n}}+\frac{\left(\frac{2.3}{n}-\frac{1.5}{n}\right)}{\left(\frac{2.5}{n}-\frac{1.5}{n}\right)}\left({y}_{\frac{2.5}{n}}-{y}_{\frac{1.5}{n}}\right).$

T-Digest

T-digest is a probabilistic data structure that is a sparse representation of the empirical cumulative distribution function (CDF) of a data set. T-digest is useful for computing approximations of rank-based statistics (such as percentiles and quantiles) from online or distributed data in a way that allows for controllable accuracy, particularly near the tails of the data distribution.

For data that is distributed in different partitions, t-digest computes quantile estimates (and percentile estimates) for each data partition separately, and then combines the estimates while maintaining a constant-memory bound and constant relative accuracy of computation ($q\left(1-q\right)$ for the qth quantile). For these reasons, t-digest is practical for working with tall arrays.

To estimate quantiles of an array that is distributed in different partitions, first build a t-digest in each partition of the data. A t-digest clusters the data in the partition and summarizes each cluster by a centroid value and an accumulated weight that represents the number of samples contributing to the cluster. T-digest uses large clusters (widely spaced centroids) to represent areas of the CDF that are near q = 0.5 and uses small clusters (tightly spaced centroids) to represent areas of the CDF that are near q = 0 or q = 1.

T-digest controls the cluster size by using a scaling function that maps a quantile q to an index k with a compression parameter $\delta$. That is,

$k\left(q,\delta \right)=\delta \cdot \left(\frac{{\mathrm{sin}}^{-1}\left(2q-1\right)}{\pi }+\frac{1}{2}\right),$

where the mapping k is monotonic with minimum value k(0,δ) = 0 and maximum value k(1,δ) = δ. The following figure shows the scaling function for δ = 10. The scaling function translates the quantile q to the scaling factor k in order to give variable size steps in q. As a result, cluster sizes are unequal (larger around the center quantiles and smaller near q = 0 or q = 1). The smaller clusters allow for better accuracy near the edges of the data.

To update a t-digest with a new observation that has a weight and location, find the cluster closest to the new observation. Then, add the weight and update the centroid of the cluster based on the weighted average, provided that the updated weight of the cluster does not exceed the size limitation.

You can combine independent t-digests from each partition of the data by taking a union of the t-digests and merging their centroids. To combine t-digests, first sort the clusters from all the independent t-digests in decreasing order of cluster weights. Then, merge neighboring clusters, when they meet the size limitation, to form a new t-digest.

Once you form a t-digest that represents the complete data set, you can estimate the end-points (or boundaries) of each cluster in the t-digest and then use interpolation between the end-points of each cluster to find accurate quantile estimates.

Algorithms

For an n-element vector X, prctile returns percentiles by using a sorting-based algorithm as follows:

1. The sorted elements in X are taken as the 100(0.5/n)th, 100(1.5/n)th, ..., 100([n – 0.5]/n)th percentiles. For example:

• For a data vector of five elements such as {6, 3, 2, 10, 1}, the sorted elements {1, 2, 3, 6, 10} respectively correspond to the 10th, 30th, 50th, 70th, and 90th percentiles.

• For a data vector of six elements such as {6, 3, 2, 10, 8, 1}, the sorted elements {1, 2, 3, 6, 8, 10} respectively correspond to the (50/6)th, (150/6)th, (250/6)th, (350/6)th, (450/6)th, and (550/6)th percentiles.

2. prctile uses linear interpolation to compute percentiles for percentages between 100(0.5/n) and 100([n – 0.5]/n).

3. prctile assigns the minimum or maximum values of the elements in X to the percentiles corresponding to the percentages outside that range.

prctile treats NaNs as missing values and removes them.

 Langford, E. “Quartiles in Elementary Statistics”, Journal of Statistics Education. Vol. 14, No. 3, 2006.

 Dunning, T., and O. Ertl. “Computing Extremely Accurate Quantiles Using T-Digests.” August 2017.