How to import a very large nucleotide sequence in Matlab workspace

Question

Shiwani il 31 Mag 2012

0
Link

Link diretto a questa domanda

https://it.mathworks.com/matlabcentral/answers/39961-how-to-import-a-very-large-nucleotide-sequence-in-matlab-workspace

Iam using the following commands to access very big sequences.

Though Iam able to read the sequence in Matlab workspace but due to the amt of memory it takes, Matlab exits automatically when some mathematical fn is used on the data.

a= getgenbank ('NC_000913','ToFile', 'Ecoli.txt','FileFormat','FASTA', 'SequenceOnly','true')
b=fastaread('Ecoli.txt')

If I use memory mapping instruction

m=memmapfile('Ecoli1.txt', format,'uint8')

I get the following output

ans =

    Filename: 'C:\Users\Shiwani\Documents\MATLAB\Ecoli1.txt'
    Writable: false
      Offset: 0
      Format: 'uint8'
      Repeat: Inf
        Data: 4772342x1 uint8 array

If I retrieve the sequence using m.Data , it is in ascii format and not as nucleotide sequence(thats because memory mapping requires the data to be in int format and not char)

Iam missing out on some key instruction in between please help

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

Accedi per commentare.

Accedi per rispondere a questa domanda.

Answer 1

per isakson il 31 Mag 2012

0
Link

Link diretto a questa risposta

https://it.mathworks.com/matlabcentral/answers/39961-how-to-import-a-very-large-nucleotide-sequence-in-matlab-workspace#answer_49592

Apri in MATLAB Online

4772342 bytes is a small file

Matlab should not exit, but issue an "Out of memory error". If Matlab exits there is some other problem.

It is a text file. Do you know how it is encoded, UTF-8 ??

What does

char( m.Data(1:120) )

show?

You can read any text file piece wise with fscanf

.

--- Next step ---

This is not my domain. I have never used the the Bio info toolbox and know little about genomes.

However, I have the toolbox installed (R2012a) and have done some tests.

The function, nt2int, converts a string of characters to a vector of uint8. The key lines in nt2int are

    %   Copyright 2002-2004 The MathWorks, Inc.
     ...
    %           'a  b c  d e f g  h i j k l m  n o p q r s t u  v  w  x y z  -'
    map = uint8([1 11 2 12 0 0 3 13 0 0 7 0 8 15 0 0 0 5 9 4 4 14 10 15 6 0 16]);
    ...
    nt = lower(nt);
    ...
    seq = map((uint8(nt) + 1) - uint8('a'));

nt2int cannot handle numbers, e.g. "49175990", because seq cannot hold negative numbers

Thus, your file has a header that nt2int cannot handle. I guess that fastaread is supposed to take care of the header, but fails for some reason. I have not checked.

Try this, which converts to characters, strips of the header, and converts to a numeric vector.

    str = char( m.Data );
    cac = regexp( str, 'complete genome', 'split' );
    vector_of_uint8 = nt2int( cac{2} );

or safer

vector_of_uint8 = nt2int( strtrim(cac{2}) );

.

--- On fastaread ---

What does the following lines show?

    b.Sequence(1:120)
    char( b.Sequence(1:120) )

.

--- getgenbank ---

Try this:

    str = getgenbank ('NC_000913','ToFile', 'Ecoli.txt','FileFormat','FASTA', 'SequenceOnly','true');
    str =
    AGCTTTTCATTCTGACTGCAACGGGCAATA ....
    vector_of_uint8 = nt2int( str );

works perfectly on my system

3 Commenti
Mostra 1 commento meno recenteNascondi 1 commento meno recente

per isakson il 1 Giu 2012

Above, I've added a few more steps to try. I cannot understand why you get "Out of memory error", but we use different Matlab releases.

Shiwani il 6 Giu 2012

Thanks for your help. I have tried to read the sequence using fastaread only and then saved all the workspace variables into a matfile.

That saves a lot of memory temporarily.

But I am now facing problems at the analysis step.

Iam trying to use 1D continuous wavelets for analysing the curves for DNA sequences.

But the processing of wavelet coefficients keeps on going endlessly for exceptionally large sequences.

Is it possible to analyse the complete genomic signal (nt2int(seq)) using 1D continuous wavelet?

Accedi per commentare.

How to import a very large nucleotide sequence in Matlab workspace

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

Risposte (1)

3 Commenti
Mostra 1 commento meno recenteNascondi 1 commento meno recente

Vedere anche

Categorie

Tag

Community Treasure Hunt

How to import a very large nucleotide sequence in Matlab workspace

0 Commenti Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

Risposte (1)

3 Commenti Mostra 1 commento meno recenteNascondi 1 commento meno recente

Vedere anche

Categorie

Tag

Community Treasure Hunt

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

3 Commenti
Mostra 1 commento meno recenteNascondi 1 commento meno recente