Azzera filtri
Azzera filtri

How to import a very large nucleotide sequence in Matlab workspace

3 visualizzazioni (ultimi 30 giorni)
Iam using the following commands to access very big sequences.
Though Iam able to read the sequence in Matlab workspace but due to the amt of memory it takes, Matlab exits automatically when some mathematical fn is used on the data.
a= getgenbank ('NC_000913','ToFile', 'Ecoli.txt','FileFormat','FASTA', 'SequenceOnly','true')
b=fastaread('Ecoli.txt')
If I use memory mapping instruction
m=memmapfile('Ecoli1.txt', format,'uint8')
I get the following output
ans =
Filename: 'C:\Users\Shiwani\Documents\MATLAB\Ecoli1.txt'
Writable: false
Offset: 0
Format: 'uint8'
Repeat: Inf
Data: 4772342x1 uint8 array
If I retrieve the sequence using m.Data , it is in ascii format and not as nucleotide sequence(thats because memory mapping requires the data to be in int format and not char)
Iam missing out on some key instruction in between please help

Risposte (1)

per isakson
per isakson il 31 Mag 2012
4772342 bytes is a small file
Matlab should not exit, but issue an "Out of memory error". If Matlab exits there is some other problem.
It is a text file. Do you know how it is encoded, UTF-8 ??
What does
char( m.Data(1:120) )
show?
You can read any text file piece wise with fscanf
.
--- Next step ---
This is not my domain. I have never used the the Bio info toolbox and know little about genomes.
However, I have the toolbox installed (R2012a) and have done some tests.
The function, nt2int, converts a string of characters to a vector of uint8. The key lines in nt2int are
% Copyright 2002-2004 The MathWorks, Inc.
...
% 'a b c d e f g h i j k l m n o p q r s t u v w x y z -'
map = uint8([1 11 2 12 0 0 3 13 0 0 7 0 8 15 0 0 0 5 9 4 4 14 10 15 6 0 16]);
...
nt = lower(nt);
...
seq = map((uint8(nt) + 1) - uint8('a'));
nt2int cannot handle numbers, e.g. "49175990", because seq cannot hold negative numbers
Thus, your file has a header that nt2int cannot handle. I guess that fastaread is supposed to take care of the header, but fails for some reason. I have not checked.
Try this, which converts to characters, strips of the header, and converts to a numeric vector.
str = char( m.Data );
cac = regexp( str, 'complete genome', 'split' );
vector_of_uint8 = nt2int( cac{2} );
or safer
vector_of_uint8 = nt2int( strtrim(cac{2}) );
.
--- On fastaread ---
What does the following lines show?
b.Sequence(1:120)
char( b.Sequence(1:120) )
.
--- getgenbank ---
Try this:
str = getgenbank ('NC_000913','ToFile', 'Ecoli.txt','FileFormat','FASTA', 'SequenceOnly','true');
str =
AGCTTTTCATTCTGACTGCAACGGGCAATA ....
vector_of_uint8 = nt2int( str );
works perfectly on my system
  3 Commenti
per isakson
per isakson il 1 Giu 2012
Above, I've added a few more steps to try. I cannot understand why you get "Out of memory error", but we use different Matlab releases.
Shiwani
Shiwani il 6 Giu 2012
Thanks for your help. I have tried to read the sequence using fastaread only and then saved all the workspace variables into a matfile.
That saves a lot of memory temporarily.
But I am now facing problems at the analysis step.
Iam trying to use 1D continuous wavelets for analysing the curves for DNA sequences.
But the processing of wavelet coefficients keeps on going endlessly for exceptionally large sequences.
Is it possible to analyse the complete genomic signal (nt2int(seq)) using 1D continuous wavelet?

Accedi per commentare.

Categorie

Scopri di più su Genomics and Next Generation Sequencing in Help Center e File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by