Replacing characters with integers in a very long string

2 visualizzazioni (ultimi 30 giorni)
I have a string of a few millions characters, want to replace it with a vector of integers according to simple rules, such as 'C' = -1 and so forth. My implementation works but takes forever and uses gigabytes of memory, in particular due to the str2num function, to my understanding. Is there a way to go more efficiently?
sequence = fileread('sourcefile.txt');
sequence_num = strrep(sequence, 'A', '0 ');
sequence_num = strrep(sequence_num,'C','-1 ');
sequence_num = strrep(sequence_num,'G', '1 ');
sequence_num = strrep(sequence_num,'T', '0 ');
sequence_num = regexprep(sequence_num,'\r\n','');
sequence_num = str2num(sequence_num);
sequence_num = int32(sequence_num);

Risposta accettata

Star Strider
Star Strider il 17 Dic 2016
I don’t know what structure ‘sequence’ has. I created it as a cell array here:
bases = {'A','C','T','G'}; % Cell Array
sequence = bases(randi(4, 1, 20)); % Create Data
skew = zeros(1, length(sequence)+1,'int32'); % Preallocate
Cix = find(ismember(sequence, 'C')); % Logical Vector
Gix = find(ismember(sequence, 'G')); % Logical Vector
skew(Cix+1) = -1; % Replace With Integer
skew(Gix+1) = +1; % Replace With Integer
  7 Commenti
Paolo Binetti
Paolo Binetti il 18 Dic 2016
Thank you @Star and @Jan. All in your help sped up my code 700x times, now 0.17 s for a bacterium genome. About 250 times thanks to @Star suggestions, and 3 more times thanks to @Jan final simplification.
Star Strider
Star Strider il 18 Dic 2016
Our pleasure!
It is always more gratifying to help with real-world research. We wish you well!

Accedi per commentare.

Più risposte (0)

Categorie

Scopri di più su Characters and Strings in Help Center e File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by