Main Content

cigar2align

Convert unaligned sequences to aligned sequences using signatures in CIGAR format

Description

Alignment = cigar2align(Seqs,Cigars) converts unaligned sequences in Seqs into Alignment using the information stored in Cigars.

example

[GapSeq,Indices] = cigar2align(Seqs,Cigars) converts unaligned sequences in Seqs into GapSeq, and also returns Indices, a vector of numeric indices, using the information stored in Cigars. When an alignment has many columns, this syntax uses less memory and is faster.

example

___ = cigar2align(Seqs,Cigars,Name,Value), for any outputs, specifies additional options using one or more name-value arguments. For example, to have the output display gaps in the reference sequence, use Alignment = cigar2align(Seqs,Cigars,GapsInRef=true).

example

Examples

collapse all

Create a cell array of character vectors containing unaligned sequences, create a cell array of corresponding CIGAR-formatted character vectors associated with a reference sequence of ACGTATGC, and then reconstruct the alignment.

Seqs = {'ACGACTGC', 'ACGTTGC', 'AGGTATC'}; % unaligned sequences
Cigars = {'3M1D1M1I3M', '4M1D1P3M', '5M1P1M1D1M'}; % cigar-formatted
Alignment = cigar2align(Seqs, Cigars)
Alignment = 3x8 char array
    'ACG-ATGC'
    'ACGT-TGC'
    'AGGTAT-C'

Reconstruct the same alignment to display positions in the aligned sequences that correspond to gaps in the reference sequence.

Alignment2 = cigar2align(Seqs,Cigars,GapsInRef=true)
Alignment2 = 3x9 char array
    'ACG-ACTGC'
    'ACGT--TGC'
    'AGGTA-T-C'

Reconstruct the alignment adding an offset padding of 5.

Alignment3 = cigar2align(Seqs,Cigars,Start=[5 5 5],OffsetPad=true)
Alignment3 = 3x12 char array
    '    ACG-ATGC'
    '    ACGT-TGC'
    '    AGGTAT-C'

Use the two-output syntax to obtain the alignment and indices.

[GapSeq, Indices] = cigar2align(Seqs,Cigars)
GapSeq = 3x1 cell
    {'ACG-ATGC'}
    {'ACGT-TGC'}
    {'AGGTAT-C'}

Indices = 3×1

     1
     1
     1

Input Arguments

collapse all

Unaligned sequences, specified as a cell array of character vectors or as a string vector. Seqs must contain the same number of elements as Cigars.

Data Types: string | cell

Formats for sequences, specified as a cell array of valid CIGAR–formatted character vectors or a CIGAR–formatted string vector. Cigars must contain the same number of elements as Seqs.

Data Types: string | cell

Name-Value Arguments

Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Before R2021a, use commas to separate each name and value, and enclose Name in quotes.

Example: Alignment = cigar2align(Seqs,Cigars,GapsInRef=true)

indication to display positions in the aligned sequences that correspond to gaps in the reference sequence, specified as false (do not display gaps) or true. If your reference sequence has gaps and you set GapsInRef to false, and then later use Alignment as input to align2cigar, the returned CIGAR–formatted character vectors will not match the original ones.

Example: true

Data Types: logical

Indication to add padding blanks to left of each aligned sequence, specified as false (do not add padding) or true. The added padding places blanks to the left of each aligned read sequence. The offset of the start position is from the first position of the reference sequence. When false, the matrix of aligned sequences starts at the start position of the leftmost aligned read sequence.

Example: true

Data Types: logical

Indication to include characters in the aligned read sequences corresponding to soft clipping ends, specified as false (do not include) or true.

Example: true

Data Types: logical

Reference sequence position at which each aligned sequence starts, specified as a vector of positive integers. By default, each aligned sequence starts at position 1 of the reference sequence.

Data Types: single | double

Output Arguments

collapse all

Aligned sequences, returned as a character array. Each row of Alignment represents one aligned sequence. The number of rows of Alignment equals the number of character vectors in Seqs.

Aligned sequences without any leading or trailing whitespace, returned as a cell array of character vectors. The number character vectors in GapSeq equals the number of character vectors in Seqs.

Indices of starting columns in Alignment, returned as a numeric vector. Sequences returned in GapSeq are identical to those in the Alignment output except that those in GapSeq have no leading or trailing whitespace.

Entries in Indices are not necessarily the same as the start positions in the reference sequence for each aligned sequence. This is because either of the following can hold:

  • The reference sequence can be extended to account for insertions.

  • An aligned sequence can have leading soft clippings, padding, or insertion characters.

Algorithms

When cigar2align reconstructs the alignment, it does not display hard clipped positions (H) or soft clipped positions (S). Also, it does not consider soft clipped positions as start positions for aligned sequences.

Alternative Functionality

If your CIGAR information is captured in the Signature property of a BioMap object, you can use the getAlignment method to construct the alignment.

Version History

Introduced in R2010b