Identifying Biomolecular Subgroups Using Attractor Metagenes
This example shows workflows for the analysis of gene expression data with the attractor metagene algorithm. Gene expression data is available for many model organisms and disease conditions. This example shows how to use the metafeatures
function to explore biomolecular phenotypes in breast cancer.
The Cancer Genome Atlas Data
The Cancer Genome Atlas (TCGA) includes several kinds of data across multiple cancer indications. TCGA includes measurements of gene expression, protein expression, clinical outcomes, and more. In this example, you explore breast cancer gene expression.
Researchers collected tumor samples, and used Agilent G4502A microarrays to measure their gene expression. In this example you use the Level-3 expression data, which has been post-processed from the original measurements into the expression calls. Data was retrieved May 20, 2014.
Load the data into MATLAB®. The MAT-file TCGA_Breast_Gene_Expression.mat
contains gene expression data of 17814 genes for 590 different patients. The expression data is stored in the variable geneExpression
. The gene names are stored in the variable geneNames
.
load TCGA_Breast_Gene_Expression
To see for the organization of the data, check number of genes and samples in this data set.
size(geneExpression)
ans = 1×2
17814 590
geneNames
is a cell array of the gene names. You can access the entries using MATLAB cell array indexing:
geneNames{655}
ans = 'EGFR'
This cell array indicates that the 655th row of the variable geneExpression
contains expression measurements for the gene expression of Epidermal Growth Factor Receptor (EGFR).
Attractor Metagene Algorithm
The attractor metagene algorithm was developed as part of the DREAM 8 challenge to develop prognostic biomarkers for breast cancer survival. The attractor metagene approach discovers and quantifies underlying biomolecular events. These events reduce the dimensionality of the gene expression data, and also allow for subtype classification and investigation of regulatory machinery [1].
A metagene is defined as any weighted sum of gene expression. Suppose you have a collection of co-expressed genes. You can create a metagene by averaging the expression levels of the genes in the collection.
There is the potential to refine our understanding of the gene expression captured in this metagene. Suppose you create a set of weights that quantify the similarity between the genes in our collection and the metagene. Genes that are more similar to the metagene receive larger weights, while genes that are less similar receive smaller weights. Using these new weights, you can form a new metagene that is a weighted average of gene expression. The new metagene better captures a biomolecular event that governs some element of gene regulation in the expression data.
This procedure forms the core of the attractor metagene algorithm. Form a metagene using some current estimate of the weights, then update the weights based on a measure of similarity. Attractor metagenes are defined as the attracting fixed points of this iterative process.
The algorithm exists within the broad family of unsupervised machine learning algorithms. Related algorithms include principal component analysis, various clustering algorithms (especially fuzzy c-means), non-negative matrix factorization, and others. The main advantage of the metagene approach is that the results of the algorithm tend to be more clearly linked with a phenotype defined by gene expression.
Concretely, in the ith iteration of the algorithm. You have a vector of weights, , of size 1-by-number of genes. The estimate of the metagene during the th iteration is:
is the number of genes by number of samples gene expression matrix. To update the weights:
is the th element of , is the th row of G, and is a similarity metric. In the metagene attractor algorithm, is defined as:
if the correlation between and is greater than 0. MI is the mutual information between and . The function metafeatures
uses the B-spline estimator of mutual information described in [3].
If, instead, the correlation between and is less than or equal to 0, then:
The weights are all greater than or equal to zero. Because mutual information is scale invariant, you can normalize the weights in whatever way you choose. Here, they are normalized so their sum is 1.
The algorithm is initialized by either random or user-selected weights. It proceeds until the change in between iterations is small, or a prespecified number of iterations is exhausted.
Cleaning the Data
The data has several NaN
values. To check how many, sum over an indicator returned by isnan
.
sum(sum(isnan(geneExpression)))
ans = 1695
Out of the approximately 10 million entries of geneExpression
, there are 1695 missing entries. Before proceeding you will need to deal with these missing entries.
There are several ways to impute these missing values. You can use a simple method called K nearest neighbor imputation supplied by the Bioinformatics Toolbox (TM). K-nearest neighbor imputation works by replacing missing data with the corresponding value from a weighted average of the k
nearest columns to the column with the missing data.
Use k
= 3, and replace the current value of geneExpression
with one that has no NaN
values.
geneExpression = knnimpute(geneExpression,3);
The variable geneExpression
has no NaN
values.
sum(sum(isnan(geneExpression)))
ans = 0
For more information about knnimpute
, see the Bioinformatics Toolbox documentation.
doc knnimpute
Identifying Biomolecular Events Using the Attractor Metagene Algorithm
The function metafeatures
uses the attractor metagene algorithm to identify motifs of gene regulation.
Setup an options structure. In this case, set the display to provide the information about the algorithm at each iteration.
opts = struct('Display','iter');
metafeatures
also allows for specifying start values. You can seed the starting weights to emphasize genes that you are interested in. There are three common drivers of breast cancer, ERBB2 (also called HER2), estrogen, and progestrone.
Set the weight for each of these genes to 1 in three different rows of startValues
. Each row corresponds to initial values for a different replicate. strcmp
compares the genes of interest and the list of genes in the data set. find
returns the index in the list of the gene.
erbb = find(strcmp('ERBB2',geneNames)); estrogen = find(strcmp('ESR1',geneNames)); progestrone = find(strcmp('PGR',geneNames)); startValues = zeros(size(geneExpression,1),3); startValues(erbb,1) = 1; startValues(estrogen,2) = 1; startValues(progestrone,3) = 1;
Call metafeatures
with the imputed data set. The second argument, geneNames
is the list of all the genes in the data set. Supplying the gene names is not required. However, the gene names can allow exploration of the highly ranked genes that are returned by the algorithm to get insights into the biomolecular event described by the metagene.
[meta, weights, genes_sorted] = metafeatures(geneExpression,geneNames,'start',startValues,'options',opts);
Caching self information ... ... done. Took 29.7148 seconds. Caching entropy and binning information... ... done. Took 10.463 seconds. non-zero Found iter diff weights 1 1 1.26e+01 8924 1 2 7.29e+00 8885 1 3 4.22e+00 8796 1 4 2.54e+00 8761 1 5 1.63e+00 8745 1 6 1.14e+00 8720 1 7 8.59e-01 8706 1 8 7.18e-01 8682 1 9 7.04e-01 8687 1 10 6.44e-01 8680 1 11 5.53e-01 8676 1 12 4.56e-01 8664 1 13 3.67e-01 8654 1 14 2.91e-01 8649 1 15 2.30e-01 8642 1 16 1.83e-01 8636 1 17 1.46e-01 8634 1 18 1.17e-01 8631 1 19 9.45e-02 8632 1 20 7.65e-02 8634 1 21 6.22e-02 8633 1 22 5.06e-02 8631 1 23 4.13e-02 8635 1 24 3.38e-02 8639 1 25 2.76e-02 8636 1 26 2.26e-02 8633 1 27 1.85e-02 8633 1 28 1.51e-02 8635 1 29 1.24e-02 8635 1 30 1.02e-02 8634 1 31 8.35e-03 8633 1 32 6.85e-03 8633 1 33 5.57e-03 8633 1 34 4.59e-03 8631 1 35 3.78e-03 8631 1 36 3.07e-03 8632 1 37 2.53e-03 8632 1 38 2.06e-03 8632 1 39 1.70e-03 8632 1 40 1.40e-03 8632 1 41 1.15e-03 8632 1 42 9.24e-04 8632 1 43 7.70e-04 8632 1 44 6.21e-04 8632 1 45 5.20e-04 8632 1 46 4.43e-04 8632 1 47 3.49e-04 8632 1 48 2.97e-04 8632 1 49 2.36e-04 8632 1 50 1.93e-04 8632 1 51 1.56e-04 8632 1 52 1.42e-04 8632 1 53 8.98e-05 8632 1 54 9.72e-05 8632 1 55 5.37e-05 8632 1 56 7.47e-05 8632 1 57 5.17e-05 8632 1 58 4.81e-05 8632 1 59 2.85e-05 8632 1 60 1.97e-05 8632 1 61 3.05e-05 8632 1 62 1.41e-05 8632 1 63 1.02e-05 8632 1 64 7.89e-06 8632 1 65 9.34e-06 8632 1 66 2.07e-05 8632 1 67 1.52e-05 8632 1 68 2.26e-05 8632 1 69 1.55e-05 8632 1 70 2.24e-05 8632 1 71 1.75e-05 8632 1 72 2.01e-05 8632 1 73 6.47e-06 8632 1 74 1.62e-05 8632 1 75 2.23e-05 8632 1 76 1.93e-05 8632 1 77 1.71e-05 8632 1 78 6.94e-06 8632 1 79 3.21e-06 8632 1 80 1.58e-05 8632 1 81 2.02e-05 8632 1 82 1.99e-05 8632 1 83 2.12e-05 8632 1 84 1.79e-05 8632 1 85 1.60e-05 8632 1 86 1.78e-05 8632 1 87 1.87e-05 8632 1 88 1.66e-05 8632 1 89 5.98e-06 8632 1 90 1.26e-05 8632 1 91 2.14e-05 8632 1 92 1.82e-05 8632 1 93 6.97e-06 8632 1 94 1.04e-05 8632 1 95 2.13e-05 8632 1 96 6.39e-06 8632 1 97 1.75e-05 8632 1 98 2.37e-05 8632 1 99 2.01e-05 8632 1 100 1.98e-05 8632
Warning: 'Maximum iterations exceeded, terminating early.'
2 1 1.93e+01 9893 2 2 6.04e+00 9885 2 3 3.80e+00 9883 2 4 2.53e+00 9886 2 5 1.73e+00 9881 2 6 1.13e+00 9873 2 7 7.19e-01 9869 2 8 4.63e-01 9866 2 9 3.08e-01 9870 2 10 2.13e-01 9874 2 11 1.54e-01 9872 2 12 1.15e-01 9874 2 13 8.72e-02 9874 2 14 6.68e-02 9874 2 15 5.14e-02 9874 2 16 3.97e-02 9875 2 17 3.07e-02 9875 2 18 2.37e-02 9873 2 19 1.84e-02 9871 2 20 1.42e-02 9871 2 21 1.10e-02 9871 2 22 8.54e-03 9872 2 23 6.62e-03 9872 2 24 5.05e-03 9872 2 25 4.01e-03 9872 2 26 3.09e-03 9872 2 27 2.38e-03 9872 2 28 1.85e-03 9872 2 29 1.43e-03 9872 2 30 1.09e-03 9872 2 31 8.46e-04 9872 2 32 6.73e-04 9872 2 33 5.10e-04 9872 2 34 3.81e-04 9872 2 35 2.98e-04 9872 2 36 2.46e-04 9872 2 37 1.51e-04 9872 2 38 1.63e-04 9872 2 39 1.15e-04 9872 2 40 7.11e-05 9872 2 41 1.18e-04 9872 2 42 7.28e-05 9872 2 43 1.89e-05 9872 2 44 4.24e-05 9872 2 45 1.60e-05 9872 2 46 6.75e-06 9872 2 47 4.81e-05 9872 2 48 2.47e-05 9872 2 49 1.04e-05 9872 2 50 7.46e-06 9872 2 51 9.31e-06 9872 2 52 5.25e-06 9872 2 53 3.89e-05 9872 2 54 9.38e-06 9872 2 55 3.33e-05 9872 2 56 1.48e-05 9872 2 57 2.45e-05 9872 2 58 2.58e-05 9872 2 59 1.00e-05 9872 2 60 1.86e-05 9872 2 61 5.87e-05 9872 2 62 2.97e-05 9872 2 63 1.07e-05 9872 2 64 8.84e-06 9872 2 65 8.29e-06 9872 2 66 1.58e-05 9872 2 67 1.48e-05 9872 2 68 5.00e-06 9872 2 69 2.74e-05 9872 2 70 1.20e-05 9872 2 71 2.91e-05 9872 2 72 9.45e-06 9872 2 73 1.75e-05 9872 2 74 1.56e-05 9872 2 75 6.56e-06 9872 2 76 1.79e-05 9872 2 77 2.67e-05 9872 2 78 5.55e-05 9872 2 79 2.55e-05 9872 2 80 1.03e-05 9872 2 81 2.74e-05 9872 2 82 2.04e-05 9872 2 83 1.00e-05 9872 2 84 1.11e-05 9872 2 85 9.83e-06 9872 2 86 2.71e-05 9872 2 87 1.42e-05 9872 2 88 1.28e-05 9872 2 89 2.24e-05 9872 2 90 4.58e-05 9872 2 91 3.36e-05 9872 2 92 9.74e-06 9872 2 93 1.06e-05 9872 2 94 1.50e-05 9872 2 95 5.05e-05 9872 2 96 1.12e-05 9872 2 97 2.52e-05 9872 2 98 9.77e-06 9872 2 99 6.10e-06 9872 2 100 2.97e-05 9872
Warning: 'Maximum iterations exceeded, terminating early.'
3 1 3.75e+00 9963 3 2 1.08e+00 9966 3 3 4.29e-01 9959 3 4 1.87e-01 9961 3 5 8.45e-02 9958 3 6 3.88e-02 9957 3 7 1.80e-02 9956 3 8 8.36e-03 9956 3 9 3.89e-03 9956 3 10 1.78e-03 9956 3 11 8.68e-04 9956 3 12 3.96e-04 9956 3 13 1.89e-04 9956 3 14 8.92e-05 9956 3 15 4.25e-05 9956 3 16 1.16e-05 9956 3 17 1.57e-05 9956 3 18 1.67e-05 9956 3 19 1.59e-05 9956 3 20 1.07e-05 9956 3 21 9.21e-06 9956 3 22 1.59e-05 9956 3 23 6.23e-06 9956 3 24 8.68e-06 9956 3 25 1.56e-05 9956 3 26 1.55e-05 9956 3 27 9.65e-06 9956 3 28 9.74e-06 9956 3 29 9.75e-06 9956 3 30 9.75e-06 9956 3 31 9.84e-06 9956 3 32 1.49e-05 9956 3 33 1.05e-05 9956 3 34 1.43e-05 9956 3 35 2.14e-05 9956 3 36 6.64e-06 9956 3 37 1.54e-06 9956 3 38 2.23e-06 9956 3 39 2.98e-06 9956 3 40 8.89e-06 9956 3 41 1.60e-05 9956 3 42 1.06e-05 9956 3 43 9.08e-06 9956 3 44 1.60e-05 9956 3 45 6.74e-06 9956 3 46 8.71e-06 9956 3 47 9.45e-06 9956 3 48 1.48e-05 9956 3 49 1.05e-05 9956 3 50 1.43e-05 9956 3 51 2.15e-05 9956 3 52 6.64e-06 9956 3 53 1.41e-06 9956 3 54 1.98e-06 9956 3 55 2.58e-06 9956 3 56 9.07e-06 9956 3 57 6.54e-06 9956 3 58 5.44e-06 9956 3 59 4.36e-06 9956 3 60 8.43e-06 9956 3 61 1.08e-05 9956 3 62 9.73e-06 9956 3 63 9.72e-06 9956 3 64 9.72e-06 9956 3 65 9.75e-06 9956 3 66 9.78e-06 9956 3 67 9.82e-06 9956 3 68 1.50e-05 9956 3 69 1.02e-05 9956 3 70 1.34e-05 9956 3 71 2.08e-05 9956 3 72 1.30e-05 9956 3 73 2.11e-05 9956 3 74 1.56e-05 9956 3 75 9.45e-06 9956 3 76 1.48e-05 9956 3 77 1.11e-05 9956 3 78 8.97e-06 9956 3 79 1.31e-05 9956 3 80 2.19e-05 9956 3 81 8.99e-06 9956 3 82 1.60e-05 9956 3 83 7.51e-06 9956 3 84 6.78e-06 9956 3 85 7.51e-06 9956 3 86 1.10e-05 9956 3 87 1.39e-05 9956 3 88 6.38e-06 9956 3 89 6.05e-06 9956 3 90 4.66e-06 9956 3 91 7.28e-06 9956 3 92 7.98e-06 9956 3 93 1.15e-05 9956 3 94 8.72e-06 9956 3 95 1.56e-05 9956 3 96 1.82e-05 9956 3 97 1.23e-05 9956 3 98 6.69e-06 9956 3 99 1.63e-06 9956 3 100 1.15e-06 9956
Warning: 'Maximum iterations exceeded, terminating early.'
The variable meta
has the value of the three metagenes discovered for each sample. You can plot the three metagenes to gain insight into the nature of gene regulation across different phenotypes of breast cancer.
plot3(meta(1,:),meta(2,:),meta(3,:),'o') xlabel('ERBB2 metagene') ylabel('Estrogen metagene') zlabel('Progestrone metagene')
In the plot you can observe a few things.
In the plot, there is a group of points bunched together with low values for all three metagenes. Based on mRNA levels, the expectation is that points are associated with tumor samples that are triple-negative or basal type.
There is also a group of points that have high estrogen receptor metagene expression. This group spans both high and low progestrone metagene expression. There are no points with high progestrone metagene expression and low estrogen metagene expression. This finding is consistent with the observation that ER-/PR+ breast cancers are extremely rare [2].
The remaining points are the ERBB2 positive cancers. They have less representation in this data set than the hormone-driven and triple-negative cancers. There are also no firmly established relationships between hormone receptor expression and ERBB2 status.
To develop a better understanding of the gene regulation captured by the metagenes, take a closer look at the metagene discovered by initializing the estrogen receptor to have weight 1. You can list the top ten genes contributing to the metagene for the 11th metagene discovered.
genes_sorted(1:10,2)
ans = 10x1 cell
{'AGR3' }
{'ESR1' }
{'CA12' }
{'AGR2' }
{'MLPH' }
{'FOXA1'}
{'THSD4'}
{'FSIP1'}
{'ANXA9'}
{'XBP1' }
This metagene captures the biomolecular event associated with the transition to estrogen-driven breast cancer. The four, top-ranked, genes listed are:
Anterior Gradient Homolog 3 (AGR3)
Estrogen Receptor 1 (ESR1)
Carbonic anhydrase 12 (CA12)
Anterior Gradient Homolog 2 (AGR2)
Transcriptional changes in each of these genes are implicated in estrogen-driven breast cancer. The three genes other than ESR1 are known to be coexpressed with ESR1. Identification of these genes illustrates the power of the attractor metagene algorithm to link gene expression with phenotypes.
Similar versions of the estrogen metagene and the ERBB2 metagene are described in [1]. The ordering of the gene contributions differs slightly between this analysis and [1] because a different breast cancer data set was used. Variations in the weights are to be expected, but the ordering of the genes by weights are roughly the same. Specifically, genes with the top 10 weights are mostly the same between this version, and the version described in [1]. Similarly, there is significant overlap between the genes with the top 100 weights.
Genes can contribute to multiple metagenes. In this sense, the attractor metagene algorithm is a "soft" clustering technique. In this example, finding metagenes in breast cancer data, there is overlap in the sets of genes that have larger contribution weights to the estrogen and progestrone metagenes.
If a weight is "elevated" when it is larger than .001, then:
elevated_weights = weights>.001;
The column sum of the elevated_weights
is the total number of elevated weights in each of the three metagenes.
sum(elevated_weights)
ans = 1×3
19 96 27
Of the 96 elevated weights for the estrogen metagene, and the 27 for the progestrone metagene, there are 22 elevated weights that are in both sets.
sum(elevated_weights(:,2) & elevated_weights(:,3))
ans = 22
However, there is no overlap between the ERBB2 metagene and the estrogen metagene:
sum(elevated_weights(:,1) & elevated_weights(:,2))
ans = 0
as well as no overlap between the ERBB2 metagene and the progestrone metagene:
sum(elevated_weights(:,1) & elevated_weights(:,3))
ans = 0
The Role of Alpha
In the similarity metric of the algorithm, the parameter alpha controls the degree of nonlinearity. As alpha is increased, the number of metagenes tends to increase. The default alpha is 5, because this value was good for the work in [1], but for different data sets or use cases, you must adjust alpha.
To illustrate the effects of alpha, if alpha is 1 in the breast cancer analysis, then the progesterone and estrogen metagenes are not distinct.
[meta_alpha_1, weights_alpha_1, genes_sorted_alpha_1] = ... metafeatures(geneExpression,geneNames,'start',startValues,'alpha',1);
Warning: 'Maximum iterations exceeded, terminating early.'
Warning: 'Maximum iterations exceeded, terminating early.'
Warning: 'Maximum iterations exceeded, terminating early.'
In this case, only two metagenes are returned, despite the fact that we ran the algorithm three times.
size(meta_alpha_1)
ans = 1×2
2 590
This result is because, by default, metafeatures
returns only the unique metagenes. The initialization with the weight for ESR1 set to 1, and the initialization with the weight for PGR set to 1, both converge to metagenes that are effectively the same.
References
[1] Cheng, Wei-Yi, Tai-Hsien Ou Yang, and Dimitris Anastassiou. "Biomolecular events in cancer revealed by attractor metagenes." PLoS computational biology 9.2 (2013): e1002920.
[2] Hefti, Marco M., et al."Estrogen receptor negative/progesterone receptor positive breast cancer is not a reproducible subtype." Breast Cancer Research 15.4 (2013): R68.
[3] Daub, Carsten O., et al. "Estimating mutual information using B-spline functions?an improved similarity measure for analysing gene expression data." BMC bioinformatics 5.1 (2004): 118.