# pdist

Pairwise distance between pairs of objects

## Syntax

`D = pdist(X)D = pdist(X,distance)`

## Description

`D = pdist(X)` computes the Euclidean distance between pairs of objects in m-by-n data matrix `X`. Rows of `X` correspond to observations, and columns correspond to variables. `D` is a row vector of length m(m–1)/2, corresponding to pairs of observations in `X`. The distances are arranged in the order (2,1), (3,1), ..., (m,1), (3,2), ..., (m,2), ..., (m,m–1)). `D` is commonly used as a dissimilarity matrix in clustering or multidimensional scaling.

To save space and computation time, `D` is formatted as a vector. However, you can convert this vector into a square matrix using the `squareform` function so that element i, j in the matrix, where i < j, corresponds to the distance between objects i and j in the original data set.

`D = pdist(X,distance)` computes the distance between objects in the data matrix, `X`, using the method specified by `distance`, which can be any of the following character strings.

MetricDescription
`'euclidean'`

Euclidean distance (default).

`'seuclidean'`

Standardized Euclidean distance. Each coordinate difference between rows in X is scaled by dividing by the corresponding element of the standard deviation `S=nanstd(X)`. To specify another value for `S`, use `D=pdist(X,'seuclidean',S)`.

`'cityblock'`

City block metric.

`'minkowski'`

Minkowski distance. The default exponent is 2. To specify a different exponent, use `D = pdist(X,'minkowski',P)`, where `P` is a scalar positive value of the exponent.

`'chebychev'`

Chebychev distance (maximum coordinate difference).

`'mahalanobis'`

Mahalanobis distance, using the sample covariance of `X` as computed by `nancov`. To compute the distance with a different covariance, use `D = pdist(X,'mahalanobis',C)`, where the matrix `C` is symmetric and positive definite.

`'cosine'`

One minus the cosine of the included angle between points (treated as vectors).

`'correlation'`

One minus the sample correlation between points (treated as sequences of values).

`'spearman'`

One minus the sample Spearman's rank correlation between observations (treated as sequences of values).

`'hamming'`

Hamming distance, which is the percentage of coordinates that differ.

`'jaccard'`

One minus the Jaccard coefficient, which is the percentage of nonzero coordinates that differ.

custom distance function

A distance function specified using @:
```D = pdist(X,@distfun)```

A distance function must be of form

`d2 = distfun(XI,XJ)`
taking as arguments a 1-by-n vector `XI`, corresponding to a single row of `X`, and an m2-by-n matrix `XJ`, corresponding to multiple rows of `X`. `distfun` must accept a matrix `XJ` with an arbitrary number of rows. `distfun` must return an m2-by-1 vector of distances `d2`, whose kth element is the distance between `XI` and `XJ(k,:)`.

The output `D` is arranged in the order of ((2,1),(3,1),..., (m,1),(3,2),...(m,2),.....(m,m–1)), i.e. the lower left triangle of the full m-by-m distance matrix in column order. To get the distance between the ith and jth observations (i < j), either use the formula D((i–1)*(mi/2)+ji), or use the helper function `Z = squareform(D)`, which returns an m-by-m square symmetric matrix, with the (i,j) entry equal to distance between observation i and observation j.

### Metrics

Given an m-by-n data matrix `X`, which is treated as m (1-by-n) row vectors `x`1, `x`2, ..., `x`m, the various distances between the vector `x`s and `x`t are defined as follows:

• Euclidean distance

${d}_{st}^{2}=\left({x}_{s}-{x}_{t}\right)\left({x}_{s}-{x}_{t}{\right)}^{\prime }$

Notice that the Euclidean distance is a special case of the Minkowski metric, where `p` = 2.

• Standardized Euclidean distance

${d}_{st}^{2}=\left({x}_{s}-{x}_{t}\right){V}^{-1}\left({x}_{s}-{x}_{t}{\right)}^{\prime }$

where `V` is the n-by-n diagonal matrix whose jth diagonal element is `S`(j)2, where `S` is the vector of standard deviations.

• Mahalanobis distance

${d}_{st}^{2}=\left({x}_{s}-{x}_{t}\right){C}^{-1}\left({x}_{s}-{x}_{t}{\right)}^{\prime }$

where `C` is the covariance matrix.

• City block metric

${d}_{st}=\sum _{j=1}^{n}|{x}_{sj}-{x}_{tj}|$

Notice that the city block distance is a special case of the Minkowski metric, where `p=`1.

• Minkowski metric

${d}_{st}=\sqrt[p]{\sum _{j=1}^{n}{|{x}_{sj}-{x}_{tj}|}^{p}}$

Notice that for the special case of `p` = 1, the Minkowski metric gives the city block metric, for the special case of `p` = 2, the Minkowski metric gives the Euclidean distance, and for the special case of `p` = ∞, the Minkowski metric gives the Chebychev distance.

• Chebychev distance

${d}_{st}={\mathrm{max}}_{j}\left\{|{x}_{sj}-{x}_{tj}|\right\}$

Notice that the Chebychev distance is a special case of the Minkowski metric, where `p` = ∞.

• Cosine distance

${d}_{st}=1-\frac{{x}_{s}{{x}^{\prime }}_{t}}{\sqrt{\left({x}_{s}{{x}^{\prime }}_{s}\right)\left({x}_{t}{{x}^{\prime }}_{t}\right)}}$

• Correlation distance

${d}_{st}=1-\frac{\left({x}_{s}-{\overline{x}}_{s}\right){\left({x}_{t}-{\overline{x}}_{t}\right)}^{\prime }}{\sqrt{\left({x}_{s}-{\overline{x}}_{s}\right){\left({x}_{s}-{\overline{x}}_{s}\right)}^{\prime }}\sqrt{\left({x}_{t}-{\overline{x}}_{t}\right){\left({x}_{t}-{\overline{x}}_{t}\right)}^{\prime }}}$

where

${\overline{x}}_{s}=\frac{1}{n}\sum _{j}{x}_{sj}$ and ${\overline{x}}_{t}=\frac{1}{n}\sum _{j}{x}_{tj}$

• Hamming distance

${d}_{st}=\left(#\left({x}_{sj}\ne {x}_{tj}\right)/n\right)$

• Jaccard distance

${d}_{st}=\frac{#\left[\left({x}_{sj}\ne {x}_{tj}\right)\cap \left(\left({x}_{sj}\ne 0\right)\cup \left({x}_{tj}\ne 0\right)\right)\right]}{#\left[\left({x}_{sj}\ne 0\right)\cup \left({x}_{tj}\ne 0\right)\right]}$

• Spearman distance

${d}_{st}=1-\frac{\left({r}_{s}-{\overline{r}}_{s}\right){\left({r}_{t}-{\overline{r}}_{t}\right)}^{\prime }}{\sqrt{\left({r}_{s}-{\overline{r}}_{s}\right){\left({r}_{s}-{\overline{r}}_{s}\right)}^{\prime }}\sqrt{\left({r}_{t}-{\overline{r}}_{t}\right){\left({r}_{t}-{\overline{r}}_{t}\right)}^{\prime }}}$

where

• rsj is the rank of xsj taken over x1j, x2j, ...xmj, as computed by `tiedrank`

• rs and rt are the coordinate-wise rank vectors of xs and xt, i.e., rs = (rs1, rs2, ... rsn)

• ${\overline{r}}_{s}=\frac{1}{n}\sum _{j}{r}_{sj}=\frac{\left(n+1\right)}{2}$

• ${\overline{r}}_{t}=\frac{1}{n}\sum _{j}{r}_{tj}=\frac{\left(n+1\right)}{2}$

## Examples

Generate random data and find the unweighted Euclidean distance and then find the weighted distance using two different methods:

```% Compute the ordinary Euclidean distance. X = randn(100, 5); D = pdist(X,'euclidean'); % euclidean distance % Compute the Euclidean distance with each coordinate % difference scaled by the standard deviation. Dstd = pdist(X,'seuclidean'); % Use a function handle to compute a distance that weights % each coordinate contribution differently. Wgts = [.1 .3 .3 .2 .1]; % coordinate weights weuc = @(XI,XJ,W)(sqrt(bsxfun(@minus,XI,XJ).^2 * W')); Dwgt = pdist(X, @(Xi,Xj) weuc(Xi,Xj,Wgts));```