corr

Linear or rank correlation

Syntax

rho = corr(X)

rho = corr(X,Y)

[rho,pval] = corr(X,Y)

[rho,pval] = corr(___,Name,Value)

Description

rho = corr(X) returns a matrix of the pairwise linear correlation coefficient between each pair of columns in the input matrix X.

example

rho = corr(X,Y) returns a matrix of the pairwise correlation coefficient between each pair of columns in the input matrices X and Y.

example

[rho,pval] = corr(X,Y) also returns pval, a matrix of p-values for testing the hypothesis of no correlation against the alternative hypothesis of a nonzero correlation.

example

[rho,pval] = corr(___,Name,Value) specifies options using one or more name-value pair arguments in addition to the input arguments in the previous syntaxes. For example, 'Type','Kendall' specifies computing Kendall's tau correlation coefficient.

example

Examples

collapse all

Find Correlation Between Matrix Columns

Open Live Script

Generate a matrix that contains one missing value.

rng(0,"twister"); % For reproducibility
X = rand(5, 5);
indices = randperm(numel(X), 1);
X(indices) = NaN

X = 5×5

    0.8147    0.0975    0.1576    0.1419    0.6557
    0.9058    0.2785    0.9706    0.4218    0.0357
    0.1270    0.5469    0.9572    0.9157    0.8491
    0.9134    0.9575    0.4854       NaN    0.9340
    0.6324    0.9649    0.8003    0.9595    0.6787

Calculate the matrix of the pairwise linear correlation coefficients.

rho = corr(X)

rho = 5×5

    1.0000   -0.0875   -0.4566       NaN   -0.3958
   -0.0875    1.0000    0.2336       NaN    0.5303
   -0.4566    0.2336    1.0000       NaN   -0.3636
       NaN       NaN       NaN       NaN       NaN
   -0.3958    0.5303   -0.3636       NaN    1.0000

Each entry rho(a,b) is the pairwise linear correlation coefficient between column a and column b in X. By default, rho(a,b) is NaN if a or b contains a missing value.

Calculate each element rho(a,b) of the coefficient matrix using rows with no missing values in column a or b.

rho2 = corr(X,Rows="pairwise")

rho2 = 5×5

    1.0000   -0.0875   -0.4566   -0.7054   -0.3958
   -0.0875    1.0000    0.2336    0.9089    0.5303
   -0.4566    0.2336    1.0000    0.6948   -0.3636
   -0.7054    0.9089    0.6948    1.0000    0.4338
   -0.3958    0.5303   -0.3636    0.4338    1.0000

The software uses only the first, second, third, and fifth rows of X to calculate the coefficients in the fourth row and fourth column.

Calculate each element rho(a,b) of the coefficient matrix using only the rows of X that have no missing values.

rho = corr(X,Rows="complete")

rho = 5×5

    1.0000   -0.4044   -0.3842   -0.7054   -0.7317
   -0.4044    1.0000    0.5057    0.9089    0.3614
   -0.3842    0.5057    1.0000    0.6948   -0.2608
   -0.7054    0.9089    0.6948    1.0000    0.4338
   -0.7317    0.3614   -0.2608    0.4338    1.0000

The software uses only the first, second, third, and fifth rows of X to calculate the coefficient matrix.

Find Correlation Between Two Matrices

Open Live Script

Find the correlation between two matrices and compare it to the correlation between two column vectors.

Generate sample data.

rng('default')
X = randn(30,4);
Y = randn(30,4);

Introduce correlation between column two of the matrix X and column four of the matrix Y.

Y(:,4) = Y(:,4)+X(:,2);

Calculate the correlation between columns of X and Y.

[rho,pval] = corr(X,Y)

rho = 4×4

   -0.1686   -0.0363    0.2278    0.3245
    0.3022    0.0332   -0.0866    0.7653
   -0.3632   -0.0987   -0.0200   -0.3693
   -0.1365   -0.1804    0.0853    0.0279

pval = 4×4

    0.3731    0.8489    0.2260    0.0802
    0.1045    0.8619    0.6491    0.0000
    0.0485    0.6039    0.9166    0.0446
    0.4721    0.3400    0.6539    0.8837

As expected, the correlation coefficient between column two of X and column four of Y, rho(2,4), is the highest, and it represents a high positive correlation between the two columns. The corresponding p-value, pval(2,4), is zero to the four digits shown. Because the p-value is less than the significance level of 0.05, it indicates rejection of the hypothesis that no correlation exists between the two columns.

Calculate the correlation between X and Y using corrcoef.

[r,p] = corrcoef(X,Y)

r = 2×2

    1.0000   -0.0329
   -0.0329    1.0000

p = 2×2

    1.0000    0.7213
    0.7213    1.0000

The MATLAB® function corrcoef, unlike the corr function, converts the input matrices X and Y into column vectors, X(:) and Y(:), before computing the correlation between them. Therefore, the introduction of correlation between column two of matrix X and column four of matrix Y no longer exists, because those two columns are in different sections of the converted column vectors.

The value of the off-diagonal elements of r, which represents the correlation coefficient between X and Y, is low. This value indicates little to no correlation between X and Y. Likewise, the value of the off-diagonal elements of p, which represents the p-value, is much higher than the significance level of 0.05. This value indicates that not enough evidence exists to reject the hypothesis of no correlation between X and Y.

Test Alternative Hypotheses for Correlation

Open Live Script

Test alternative hypotheses for positive, negative, and nonzero correlation between the columns of two matrices. Compare values of the correlation coefficient and p-value in each case.

Generate sample data.

rng('default')
X = randn(50,4);
Y = randn(50,4);

Introduce positive correlation between column one of the matrix X and column four of the matrix Y.

Y(:,4) = Y(:,4)+0.7*X(:,1);

Introduce negative correlation between column two of X and column two of Y.

Y(:,2) = Y(:,2)-2*X(:,2);

Test the alternative hypothesis that the correlation is greater than zero.

[rho,pval] = corr(X,Y,'Tail','right')

rho = 4×4

    0.0627   -0.1438   -0.0035    0.7060
   -0.1197   -0.8600   -0.0440    0.1984
   -0.1119    0.2210   -0.3433    0.1070
   -0.3526   -0.2224    0.1023    0.0374

pval = 4×4

    0.3327    0.8405    0.5097    0.0000
    0.7962    1.0000    0.6192    0.0836
    0.7803    0.0615    0.9927    0.2298
    0.9940    0.9397    0.2398    0.3982

As expected, the correlation coefficient between column one of X and column four of Y, rho(1,4), has the highest positive value, representing a high positive correlation between the two columns. The corresponding p-value, pval(1,4), is zero to the four digits shown, which is lower than the significance level of 0.05. These results indicate rejection of the null hypothesis that no correlation exists between the two columns and lead to the conclusion that the correlation is greater than zero.

Test the alternative hypothesis that the correlation is less than zero.

[rho,pval] = corr(X,Y,'Tail','left')

rho = 4×4

    0.0627   -0.1438   -0.0035    0.7060
   -0.1197   -0.8600   -0.0440    0.1984
   -0.1119    0.2210   -0.3433    0.1070
   -0.3526   -0.2224    0.1023    0.0374

pval = 4×4

    0.6673    0.1595    0.4903    1.0000
    0.2038    0.0000    0.3808    0.9164
    0.2197    0.9385    0.0073    0.7702
    0.0060    0.0603    0.7602    0.6018

As expected, the correlation coefficient between column two of X and column two of Y, rho(2,2), has the negative number with the largest absolute value (-0.86), representing a high negative correlation between the two columns. The corresponding p-value, pval(2,2), is zero to the four digits shown, which is lower than the significance level of 0.05. Again, these results indicate rejection of the null hypothesis and lead to the conclusion that the correlation is less than zero.

Test the alternative hypothesis that the correlation is not zero.

[rho,pval] = corr(X,Y)

rho = 4×4

    0.0627   -0.1438   -0.0035    0.7060
   -0.1197   -0.8600   -0.0440    0.1984
   -0.1119    0.2210   -0.3433    0.1070
   -0.3526   -0.2224    0.1023    0.0374

pval = 4×4

    0.6654    0.3190    0.9807    0.0000
    0.4075    0.0000    0.7615    0.1673
    0.4393    0.1231    0.0147    0.4595
    0.0120    0.1206    0.4797    0.7964

The p-values, pval(1,4) and pval(2,2), are both zero to the four digits shown. Because the p-values are lower than the significance level of 0.05, the correlation coefficients rho(1,4) and rho(2,2) are significantly different from zero. Therefore, the null hypothesis is rejected; the correlation is not zero.

Input Arguments

collapse all

`X` — Input matrix
matrix

Input matrix, specified as an n-by-k matrix. The rows of X correspond to observations, and the columns correspond to variables.

Example: X = randn(10,5)

Data Types: single | double

`Y` — Input matrix
matrix

Input matrix, specified as an n-by-k₂ matrix when X is specified as an n-by-k₁ matrix. The rows of Y correspond to observations, and the columns correspond to variables.

Example: Y = randn(20,7)

Data Types: single | double

Name-Value Arguments

collapse all

Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Before R2021a, use commas to separate each name and value, and enclose Name in quotes.

Example: corr(X,Y,'Type','Kendall','Rows','complete') returns Kendall's tau correlation coefficient using only the rows that contain no missing values.

`Type` — Type of correlation
`'Pearson'` (default) | `'Kendall'` | `'Spearman'`

Type of correlation, specified as the comma-separated pair consisting of 'Type' and one of these values.

Value	Description
`'Pearson'`	Pearson's Linear Correlation Coefficient
`'Kendall'`	Kendall's Tau Coefficient
`'Spearman'`	Spearman's Rho

corr computes the p-values for Pearson's correlation using a Student's t distribution for a transformation of the correlation. This correlation is exact when X and Y come from a normal distribution. corr computes the p-values for Kendall's tau and Spearman's rho using either the exact permutation distributions (for small sample sizes) or large-sample approximations.

Example: 'Type','Spearman'

`Rows` — Rows to use in computation
`"all"` (default) | `"complete"` | `"pairwise"`

Rows to use in computation, specified as one of these values.

Value	Description
`"all"`	Use all rows of the input regardless of missing values (`NaN`s).
`"complete"`	Use only rows of the input with no missing values.
`"pairwise"`	Compute `rho(i,j)` using rows with no missing values in column `i` or `j`.

The "complete" value, unlike the "pairwise" value, always produces a positive definite or positive semidefinite rho. Also, the "complete" value generally uses fewer observations to estimate rho when rows of the input (X or Y) contain missing values.

If Rows="all" (the default), then rho(i,j) and pval(i,j) are NaN if column i or j contains a missing value.

Example: Rows="pairwise"

`Tail` — Alternative hypothesis
`'both'` (default) | `'right'` | `'left'`

Alternative hypothesis, specified as the comma-separated pair consisting of 'Tail' and one of the values in the table. 'Tail' specifies the alternative hypothesis against which to compute p-values for testing the hypothesis of no correlation.

Value	Description
`'both'`	Test the alternative hypothesis that the correlation is not `0`.
`'right'`	Test the alternative hypothesis that the correlation is greater than `0`
`'left'`	Test the alternative hypothesis that the correlation is less than `0`.

corr computes the p-values for the two-tailed test by doubling the more significant of the two one-tailed p-values.

Example: 'Tail','left'

`Weights` — Observation weights
n-by-1 vector of ones (default) | n-by-1 vector of nonnegative scalar values

Observation weights, specified as an n-by-1 vector of nonnegative scalar values, where n is the number of observations. For more information, see Algorithms.

Example: Weights=[300 457 200]

Data Types: single | double

Output Arguments

collapse all

`rho` — Pairwise linear correlation coefficient
matrix

Pairwise linear correlation coefficient, returned as a matrix.

If you input only a matrix X, rho is a symmetric k-by-k matrix, where k is the number of columns in X. The entry rho(a,b) is the pairwise linear correlation coefficient between column a and column b in X.
If you input matrices X and Y, rho is a k₁-by-k₂ matrix, where k₁ and k₂ are the number of columns in X and Y, respectively. The entry rho(a,b) is the pairwise linear correlation coefficient between column a in X and column b in Y.
If Rows="all" (the default), then rho(i,j) is NaN if column i or j contains a missing value.

`pval` — p-values
matrix

p-values, returned as a matrix. Each element of pval is the p-value for the corresponding element of rho.

pval is NaN when the corresponding element of rho is NaN, or when you specify observation weights using the Weights name-value argument.

If pval(a,b) is small (less than 0.05), then the correlation rho(a,b) is significantly different from zero.

More About

collapse all

Pearson's Linear Correlation Coefficient

Pearson's linear correlation coefficient is the most commonly used linear correlation coefficient. For column X_a in matrix X and column Y_b in matrix Y, having means ${\bar{X}}_{a} = \sum_{i = 1}^{n} (X_{a, i}) / n,$ and ${\bar{Y}}_{b} = \sum_{j = 1}^{n} (Y_{b, j}) / n$ , Pearson's linear correlation coefficient rho(a,b) is defined as:

$r h o (a, b) = \frac{\sum_{i = 1}^{n} (X_{a, i} - {\bar{X}}_{a}) (Y_{b, i} - {\bar{Y}}_{b})}{{\sum_{i = 1}^{n} {(X_{a, i} - {\bar{X}}_{a})}^{2} \sum_{j = 1}^{n} {(Y_{b, j} - {\bar{Y}}_{b})}^{2}}^{1 / 2}},$

where n is the length of each column.

Values of the correlation coefficient can range from –1 to +1. A value of –1 indicates perfect negative correlation, while a value of +1 indicates perfect positive correlation. A value of 0 indicates no correlation between the columns.

Kendall's Tau Coefficient

Kendall's tau is based on counting the number of (i,j) pairs, for i<j, that are concordant—that is, for which $X_{a, i} - X_{a, j}$ and $Y_{b, i} - Y_{b, j}$ have the same sign. The equation for Kendall's tau includes an adjustment for ties in the normalizing constant and is often referred to as tau-b.

For column X_a in matrix X and column Y_b in matrix Y, Kendall's tau coefficient is defined as:

$τ = \frac{2 K}{n (n - 1)},$

where $K = \sum_{i = 1}^{n - 1} \sum_{j = i + 1}^{n} ξ^{*} (X_{a, i}, X_{a, j}, Y_{b, i}, Y_{b, j}),$ and

$ξ^{*} (X_{a, i}, X_{a, j}, Y_{b, i}, Y_{b, j}) = {\begin{matrix} 1 & if & (X_{a, i} - X_{a, j}) (Y_{b, i} - Y_{b, j}) > 0 \\ 0 & if & (X_{a, i} - X_{a, j}) (Y_{b, i} - Y_{b, j}) = 0 \\ - 1 & if & (X_{a, i} - X_{a, j}) (Y_{b, i} - Y_{b, j}) < 0 \end{matrix} .$

Values of the correlation coefficient can range from –1 to +1. A value of –1 indicates that one column ranking is the reverse of the other, while a value of +1 indicates that the two rankings are the same. A value of 0 indicates no relationship between the columns.

Spearman's Rho

Spearman's rho is equivalent to Pearson's Linear Correlation Coefficient applied to the rankings of the columns X_a and Y_b.

If all the ranks in each column are distinct, the equation simplifies to:

$r h o (a, b) = 1 - \frac{6 \sum d^{2}}{n (n^{2} - 1)},$

where d is the difference between the ranks of the two columns, and n is the length of each column.

Tips

The difference between corr(X,Y) and the MATLAB^® function corrcoef(X,Y) is that corrcoef(X,Y) returns a matrix of correlation coefficients for two column vectors X and Y. If X and Y are not column vectors, corrcoef(X,Y) converts them to column vectors.

Algorithms

When you specify the Weights name-value argument, corr calculates the Pearson correlation by weighting the variance and covariance calculations. For the Spearman correlation (which is based on ranks), corr calculates weighted ranks as proposed by [5]. To calculate the Kendall correlation (which is based on counts of permutations), corr extends the weighted counts algorithm in [6] to account for ties.

References

[1] Gibbons, J.D. Nonparametric Statistical Inference. 2nd ed. M. Dekker, 1985.

[2] Hollander, M., and D.A. Wolfe. Nonparametric Statistical Methods. Wiley, 1973.

[3] Kendall, M.G. Rank Correlation Methods. Griffin, 1970.

[4] Best, D.J., and D.E. Roberts. "Algorithm AS 89: The Upper Tail Probabilities of Spearman's rho." Applied Statistics, 24:377-379.

[5] Bailey, Paul, and Ahmad Emad (2023). wCorr: Weighted Correlations. R package version 1.9.7, https://american-institutes-for-research.github.io/wCorr.

[6] Van Doorn, Johnny, et al. "Using the Weighted Kendall Distance to Analyze Rank Data in Psychology." The Quantitative Methods for Psychology, vol. 17, no. 2, June 2021, pp. 154–65.

Extended Capabilities

expand all

Tall Arrays
Calculate with arrays that have more rows than fit in memory.

This function supports tall arrays for out-of-memory data with the limitation:

Only the 'Pearson' type is supported.

For more information, see Tall Arrays for Out-of-Memory Data.

Thread-Based Environment
Run code in the background using MATLAB® `backgroundPool` or accelerate code with Parallel Computing Toolbox™ `ThreadPool`.

This function fully supports thread-based environments. For more information, see Run MATLAB Functions in Thread-Based Environment.

GPU Arrays
Accelerate code by running on a graphics processing unit (GPU) using Parallel Computing Toolbox™.

This function fully supports GPU arrays. For more information, see Run MATLAB Functions on a GPU (Parallel Computing Toolbox).

Version History

Introduced before R2006a

expand all

R2024a: Specify weights for linear or rank correlation

corr allows you to specify weights for a linear or rank correlation by using the Weights name-value argument.

corr

Syntax

Description

Examples

Find Correlation Between Matrix Columns

Find Correlation Between Two Matrices

Test Alternative Hypotheses for Correlation

Input Arguments

X — Input matrix matrix

Y — Input matrix matrix

Name-Value Arguments

Type — Type of correlation 'Pearson' (default) | 'Kendall' | 'Spearman'

Rows — Rows to use in computation "all" (default) | "complete" | "pairwise"

Tail — Alternative hypothesis 'both' (default) | 'right' | 'left'

Weights — Observation weights n-by-1 vector of ones (default) | n-by-1 vector of nonnegative scalar values

Output Arguments

rho — Pairwise linear correlation coefficient matrix

pval — p-values matrix

More About

Pearson's Linear Correlation Coefficient

Kendall's Tau Coefficient

Spearman's Rho

Tips

Algorithms

References

Extended Capabilities

Tall Arrays Calculate with arrays that have more rows than fit in memory.

Thread-Based Environment Run code in the background using MATLAB® backgroundPool or accelerate code with Parallel Computing Toolbox™ ThreadPool.

GPU Arrays Accelerate code by running on a graphics processing unit (GPU) using Parallel Computing Toolbox™.

Version History

R2024a: Specify weights for linear or rank correlation

See Also

`X` — Input matrix
matrix

`Y` — Input matrix
matrix

`Type` — Type of correlation
`'Pearson'` (default) | `'Kendall'` | `'Spearman'`

`Rows` — Rows to use in computation
`"all"` (default) | `"complete"` | `"pairwise"`

`Tail` — Alternative hypothesis
`'both'` (default) | `'right'` | `'left'`

`Weights` — Observation weights
n-by-1 vector of ones (default) | n-by-1 vector of nonnegative scalar values

`rho` — Pairwise linear correlation coefficient
matrix

`pval` — p-values
matrix

Tall Arrays
Calculate with arrays that have more rows than fit in memory.

Thread-Based Environment
Run code in the background using MATLAB® `backgroundPool` or accelerate code with Parallel Computing Toolbox™ `ThreadPool`.

GPU Arrays
Accelerate code by running on a graphics processing unit (GPU) using Parallel Computing Toolbox™.