Generate neighbors of a string
4 visualizzazioni (ultimi 30 giorni)
Mostra commenti meno recenti
Paolo Binetti
il 9 Apr 2017
Commentato: Paolo Binetti
il 3 Giu 2017
I need to generate a cell array (or char array) containing all d-neighbors of a string of lenght k. This means all the k-long strings which differ from the input at most d characters.
My specific need concerns an alphabet of four letters, A, C, G, T. Here is an example: with string='CTA' and d=2 as inputs, the output array should have 37 elements, like 'AGA', 'CAA', 'GGA', ... and 'CTA'.
My code, attached, is the fastest of several I have tried. The main function "neighbors" calls auxiliary function "immediate_neighbors". The expected inputs of these function are provided as comments in the script.
I am convinced there are ways to achieve this much faster, but could not find any.
3 Commenti
Risposta accettata
Guillaume
il 9 Apr 2017
Modificato: Guillaume
il 9 Apr 2017
For short strings and small d, your code is slightly faster, but for longer strings with larger d (tested with 18 chars and d = 3) the following is much faster:
function patterns = neighbours(pattern, distance, charset)
if nargin < 3
charset = 'ACGT';
end
%build replacement list for each character:
charsetreps = arrayfun(@(cidx) charset([1:cidx-1, cidx+1:end]), 1:numel(charset), 'UniformOutput', false);
charsetreps = vertcat(charsetreps{:});
%fill patterns for distances from 0 to distance
patterns = {pattern}; %0 distance
for d = 1:distance
%get all combinations of d indices of pattern to replace at once
charidx = nchoosek(1:numel(pattern), d);
%storage for all replacement at d:
allreps = cell(size(charidx, 1), (numel(charset)-1)^d);
%get cartesion product of replacement column in charsetreps
repcols = cell(1, d);
[repcols{:}] = ndgrid(1:numel(charset)-1);
repcols = reshape(cat(d+1, repcols{:}), [], d);
repcols = numel(charset) * (repcols-1); %conversion to linear indexing of the column
%iterate over the charidx combinations
for ci = 1:size(charidx, 1)
[~, reprows] = ismember(pattern(charidx(ci, :)), charset);
%iterate over the replacement characters
for rep = 1:size(repcols, 1)
replacement = pattern;
replacement(charidx(ci, :)) = charsetreps(reprows + repcols(rep, :)); %reprow + repcols creates linear indices
allreps{ci, rep} = replacement;
end
end
patterns = [patterns; allreps(:)]; %#ok<AGROW>
end
end
It is also a lot more generic as it works with any character set, not just 'ACGT'.
However, note that your own code could be optimised by taking the unique and assignment to f out of the loop.
2 Commenti
Più risposte (0)
Vedere anche
Categorie
Scopri di più su Text Data Preparation in Help Center e File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!