Main Content

cuffnorm

Normalize transcript expression levels

Description

cuffnorm(transcriptsAnnot,alignmentFiles) normalizes transcript expression to FPKM for the samples in alignmentFiles and corrects for differences in library size [1].

cuffnorm requires the Cufflinks Support Package for the Bioinformatics Toolbox™. If the support package is not installed, then the function provides a download link. For details, see Bioinformatics Toolbox Software Support Packages.

example

cuffnorm(transcriptsAnnot,alignmentFiles,opt) uses additional options specified by opt.

cuffnorm(transcriptsAnnot,alignmentFiles,Name,Value) uses additional options specified by one or more name-value pair arguments. For example, cuffnorm('gyrAB.gtf',["Myco_1_1.sam", "Myco_2_1.sam"],'NumThreads',5) specifies to use five parallel threads.

[isoform,gene,tss,cds] = cuffnorm(___) returns the names of files containing normalized results using any of the input argument combinations in the previous syntaxes. By default, the function saves all files to the current directory.

Examples

collapse all

Create a CufflinksOptions object to define cufflinks options, such as the number of parallel threads and the output directory to store the results.

cflOpt = CufflinksOptions;
cflOpt.NumThreads = 8;
cflOpt.OutputDirectory = "./cufflinksOut";

The SAM files provided for this example contain aligned reads for Mycoplasma pneumoniae from two samples with three replicates each. The reads are simulated 100bp-reads for two genes (gyrA and gyrB) located next to each other on the genome. All the reads are sorted by reference position, as required by cufflinks.

sams = ["Myco_1_1.sam","Myco_1_2.sam","Myco_1_3.sam",...
        "Myco_2_1.sam", "Myco_2_2.sam", "Myco_2_3.sam"];

Assemble the transcriptome from the aligned reads.

[gtfs,isofpkm,genes,skipped] = cufflinks(sams,cflOpt);

gtfs is a list of GTF files that contain assembled isoforms.

Compare the assembled isoforms using cuffcompare.

stats = cuffcompare(gtfs);

Merge the assembled transcripts using cuffmerge.

mergedGTF = cuffmerge(gtfs,'OutputDirectory','./cuffMergeOutput');

mergedGTF reports only one transcript. This is because the two genes of interest are located next to each other, and cuffmerge cannot distinguish two distinct genes. To guide cuffmerge, use a reference GTF (gyrAB.gtf) containing information about these two genes. If the file is not located in the same directory that you run cuffmerge from, you must also specify the file path.

gyrAB = which('gyrAB.gtf');
mergedGTF2 = cuffmerge(gtfs,'OutputDirectory','./cuffMergeOutput2',...
			'ReferenceGTF',gyrAB);

Calculate abundances (expression levels) from aligned reads for each sample.

abundances1 = cuffquant(mergedGTF2,["Myco_1_1.sam","Myco_1_2.sam","Myco_1_3.sam"],...
                        'OutputDirectory','./cuffquantOutput1');
abundances2 = cuffquant(mergedGTF2,["Myco_2_1.sam", "Myco_2_2.sam", "Myco_2_3.sam"],...
                        'OutputDirectory','./cuffquantOutput2');

Assess the significance of changes in expression for genes and transcripts between conditions by performing the differential testing using cuffdiff. The cuffdiff function operates in two distinct steps: the function first estimates abundances from aligned reads, and then performs the statistical analysis. In some cases (for example, distributing computing load across multiple workers), performing the two steps separately is desirable. After performing the first step with cuffquant, you can then use the binary CXB output file as an input to cuffdiff to perform statistical analysis. Because cuffdiff returns several files, specify the output directory is recommended.

isoformDiff = cuffdiff(mergedGTF2,[abundances1,abundances2],...
                      'OutputDirectory','./cuffdiffOutput');

Display a table containing the differential expression test results for the two genes gyrB and gyrA.

readtable(isoformDiff,'FileType','text')
ans =

  2×14 table

        test_id            gene_id        gene              locus             sample_1    sample_2    status     value_1       value_2      log2_fold_change_    test_stat    p_value    q_value    significant
    ________________    _____________    ______    _______________________    ________    ________    ______    __________    __________    _________________    _________    _______    _______    ___________

    'TCONS_00000001'    'XLOC_000001'    'gyrB'    'NC_000912.1:2868-7340'      'q1'        'q2'       'OK'     1.0913e+05    4.2228e+05          1.9522           7.8886      5e-05      5e-05        'yes'   
    'TCONS_00000002'    'XLOC_000001'    'gyrA'    'NC_000912.1:2868-7340'      'q1'        'q2'       'OK'     3.5158e+05    1.1546e+05         -1.6064          -7.3811      5e-05      5e-05        'yes'   

You can use cuffnorm to generate normalized expression tables for further analyses. cuffnorm results are useful when you have many samples and you want to cluster them or plot expression levels for genes that are important in your study. Note that you cannot perform differential expression analysis using cuffnorm.

Specify a cell array, where each element is a string vector containing file names for a single sample with replicates.

alignmentFiles = {["Myco_1_1.sam","Myco_1_2.sam","Myco_1_3.sam"],...
                  ["Myco_2_1.sam", "Myco_2_2.sam", "Myco_2_3.sam"]}
isoformNorm = cuffnorm(mergedGTF2, alignmentFiles,...
                      'OutputDirectory', './cuffnormOutput');

Display a table containing the normalized expression levels for each transcript.

readtable(isoformNorm,'FileType','text')
ans =

  2×7 table

      tracking_id          q1_0          q1_2          q1_1          q2_1          q2_0          q2_2   
    ________________    __________    __________    __________    __________    __________    __________

    'TCONS_00000001'    1.0913e+05         78628    1.2132e+05    4.3639e+05    4.2228e+05    4.2814e+05
    'TCONS_00000002'    3.5158e+05    3.7458e+05    3.4238e+05    1.0483e+05    1.1546e+05    1.1105e+05

Column names starting with q have the format: conditionX_N, indicating that the column contains values for replicate N of conditionX.

Input Arguments

collapse all

Name of the transcript annotation file, specified as a string or character vector. The file can be a GTF or GFF file produced by cufflinks, cuffcompare, or another source of GTF annotations.

Example: "gyrAB.gtf"

Data Types: char | string

Names of SAM, BAM, or CXB files containing alignment records for each sample, specified as a string vector or cell array. If you use a cell array, each element must be a string vector or cell array of character vectors specifying alignment files for every replicate of the same sample.

Example: ["Myco_1_1.sam", "Myco_2_1.sam"]

Data Types: char | string | cell

cuffnorm options, specified as a CuffNormOptions object, string, or character vector. The string or character vector must be in the original cuffnorm option syntax (prefixed by one or two dashes) [1].

Name-Value Arguments

Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Before R2021a, use commas to separate each name and value, and enclose Name in quotes.

Example: cuffnorm('gyrAB.gtf',["Myco_1_1.sam", "Myco_2_1.sam"],'NumThreads',5)

The commands must be in the native syntax (prefixed by one or two dashes). Use this option to apply undocumented flags and flags without corresponding MATLAB® properties.

Example: 'ExtraCommand','--library-type fr-secondstrand'

Data Types: char | string

The original (native) syntax is prefixed by one or two dashes. By default, the function converts only the specified options. If the value is true, the software converts all available options, with default values for unspecified options, to the original syntax.

Note

If you set IncludeAll to true, the software converts all available properties, using default values for unspecified properties. The only exception is when the default value of a property is NaN, Inf, [], '', or "". In this case, the software does not translate the corresponding property.

Example: 'IncludeAll',true

Data Types: logical

Labels for samples, specified as a string, character vector, string vector, or cell array of character vectors. If you are providing labels, you must specify the same number of labels as input samples.

Example: 'Labels',["mutant1","mutant2"]

Data Types: char | string | cell

Method to normalize the library size, specified as one of the following options:

  • "geometric" — The function scales the FPKM values by the median geometric mean of fragment counts across all libraries as described in [2].

  • "classic-fpkm" — The function applies no scaling to the FPKM values or fragment counts.

  • "quartile" — The function scales the FPKM values by the ratio of upper quartiles between fragment counts and the average value across all libraries.

Example: 'LibraryNormalizationMethod',"classic-fpkm"

Data Types: char | string

Flag to use only fragments compatible with a reference transcript to calculate FPKM values, specified as true or false.

Example: 'NormalizeCompatibleHits',false

Data Types: logical

Flag to include all fragments to calculate FPKM values, specified as true or false. If the value is true, the function includes all fragments, including fragments without a compatible reference.

Example: 'NormalizeTotalHits',true

Data Types: logical

Number of parallel threads to use, specified as a positive integer. Threads are run on separate processors or cores. Increasing the number of threads generally improves the runtime significantly, but increases the memory footprint.

Example: 'NumThreads',4

Data Types: double

Directory to store analysis results, specified as a string or character vector.

Example: 'OutputDirectory',"./AnalysisResults/"

Data Types: char | string

Format for result files, specified as "simple-table" or "cuffdiff".

  • "simple-table" — The output is in tab-delimited table format.

  • "cuffdiff" — The output is in the same form used by cuffdiff.

Example: 'OutputFormat',"cuffdiff"

Data Types: char | string

Seed for the random number generator, specified as a nonnegative integer. Setting a seed value ensures the reproducibility of the analysis results.

Example: 'Seed',10

Data Types: double

Output Arguments

collapse all

Name of a file containing the normalized expression level for each isoform, returned as a string.

The output string also includes the directory information defined by OutputDirectory. The default is the current directory. If you set OutputDirectory to "/local/tmp/", the output becomes "/local/tmp/isoforms.fpkm_table".

Name of a file containing the normalized expression level for each gene, returned as a string.

The output string also includes the directory information defined by OutputDirectory. The default is the current directory. If you set OutputDirectory to "/local/tmp/", the output becomes "/local/tmp/genes.fpkm_table".

Name of a file containing the normalized expression level for each transcript start site (TSS), returned as a string.

The output string also includes the directory information defined by OutputDirectory. The default is the current directory. If you set OutputDirectory to "/local/tmp/", the output becomes "/local/tmp/tss_groups.fpkm_table".

Name of a file containing the normalized expression level for each coding sequence, returned as a string.

The output string also includes the directory information defined by OutputDirectory. The default is the current directory. If you set OutputDirectory to "/local/tmp/", the output becomes "/local/tmp/cds.fpkm_table".

References

[1] Trapnell, Cole, Brian A Williams, Geo Pertea, Ali Mortazavi, Gordon Kwan, Marijke J van Baren, Steven L Salzberg, Barbara J Wold, and Lior Pachter. “Transcript Assembly and Quantification by RNA-Seq Reveals Unannotated Transcripts and Isoform Switching during Cell Differentiation.” Nature Biotechnology 28, no. 5 (May 2010): 511–15.

Version History

Introduced in R2019a