Documentation

samread

Read data from Sequence Alignment/Map (SAM) file

Syntax

SAMStruct = samread(File)
[SAMStruct, HeaderStruct]= samread(File)
... = samread(File,'ParameterName',ParameterValue)

Description

SAMStruct = samread(File) reads a SAM-formatted file and returns the data in a MATLAB® array of structures.

[SAMStruct, HeaderStruct]= samread(File) returns the alignment and header data in two separate variables.

... = samread(File,'ParameterName',ParameterValue) accepts one or more comma-separated parameter name/value pairs. Specify ParameterName inside single quotes.

Input Arguments

File

Either of the following:

  • String specifying a file name or path and file name of a SAM-formatted file. If you specify only a file name, that file must be on the MATLAB search path or in the current folder.

  • MATLAB string containing the text of a SAM-formatted file.

Name-Value Pair Arguments

'Tags'

Controls the reading of the optional tags in addition to the first 11 fields for each alignment in the SAM-formatted file. Choices are true (default) or false.

'ReadGroup'

String specifying the read group ID for which to read alignment records from. Default is to read records from all groups.

    Tip   For a list of the read groups (if present), return the header information in a separate Header structure and view the ReadGroup field in this structure.

'BlockRead'

Scalar or vector that controls the reading of a single sequence entry or block of sequence entries from a SAM-formatted file containing multiple sequences. Enter a scalar N, to read the Nth entry in the file. Enter a 1-by-2 vector [M1, M2], to read a block of entries starting at the M1 entry and ending at the M2 entry. To read all remaining entries in the file starting at the M1 entry, enter a positive value for M1 and enter Inf for M2.

Output Arguments

SAMStruct

An N-by-1 array of structures containing sequence alignment and mapping information from a SAM-formatted file, where N is the number of alignment records stored in the SAM-formatted file. Each structure contains the following fields.

FieldDescription
QueryName

Name of read sequence (if unpaired) or name of sequence pair (if paired).

    Tip   You can use this information to populate the Header property of the BioMap object.

Flag

Integer indicating the bit-wise information that specifies the status of each of 11 flags described by the SAM format specification.

    Tip   You can use the bitget function to determine the status of a specific SAM flag.

ReferenceNameName of the reference sequence.
PositionPosition (one-based offset) of the forward reference sequence where the left-most base of the alignment of the read sequence starts.
MappingQualityInteger specifying the mapping quality score for the read sequence.
CigarStringCIGAR-formatted string representing how the read sequence aligns with the reference sequence.
MateReferenceNameName of the reference sequence associated with the mate. If this name is the same as ReferenceName, then this value is =. If there is no mate, then this value is *.
MatePositionPosition (one-based offset) of the forward reference sequence where the left-most base of the alignment of the mate of the read sequence starts.
InsertSizeThe number of base positions between the read sequence and its mate, when both are mapped to the same reference sequence. Otherwise, this value is 0.
SequenceString containing the letter representations of the read sequence. It is the reverse-complement if the read sequence aligns to the reverse strand of the reference sequence.
QualityString containing the ASCII representation of the per-base quality score for the read sequence. The quality score is reversed if the read sequence aligns to the reverse strand of the reference sequence.
TagsList of applicable SAM tags and their values.

HeaderStruct

Structure containing header information for the SAM-formatted file in the following fields.

FieldDescription
Header*Structure containing the file format version, sort order, and group order.
SequenceDictionary*

Structure containing the:

  • Sequence name

  • Sequence length

  • Genome assembly identifier

  • MD5 checksum of sequence

  • URI of sequence

  • Species

ReadGroup*

Structure containing the:

  • Read group identifier

  • Sample

  • Library

  • Description

  • Platform unit

  • Predicted median insert size

  • Sequencing center

  • Date

  • Platform

Program*

Structure containing the:

  • Program name

  • Version

  • Command line

* — These structures and their fields appear in the output structure only if they are present in the SAM file. The information in these structures depends on the information present in the SAM file.

Examples

Read the header information and the alignment data from the ex1.sam file included with Bioinformatics Toolbox™, and then return the information in two separate variables:

[data header] = samread('ex1.sam');

Read a block of entries, excluding the tags, from the ex1.sam file, and then return the information in an array of structures:

% Read entries 5 through 10 and do not include the tags
data = samread('ex1.sam','blockread', [5 10], 'tags', false);

More About

collapse all

Tips

  • Use the saminfo function to investigate the size and content of a SAM-formatted file before using the samread function to read the file contents into a MATLAB array of structures.

  • If your SAM-formatted file is too large to read using available memory, try one of the following:

    • Use the BlockRead parameter with the samread function to read a subset of entries.

    • Create a BioIndexedFile object from the SAM-formatted file, then access the entries using methods of the BioIndexedFile class.

  • Use the SAMStruct output argument that samread returns to create a BioMap object, which lets you explore, access, filter, and manipulate all or a subset of the data, before doing subsequent analyses or viewing the data.

References

[1] Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Goncalo, A., and Durbin, R. (2009). The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 16, 2078–2079.

Was this topic helpful?