Find and Replace Overlapping Substrings

2 views (last 30 days)
Hello,
I want to find a set of substrings (between 19 and 24 characters long, 'ACGT' mix = DNA sequences) in a bigger string (template DNA) and replace them with '*' for the length of the substring. I have following code.
%"template" is a 8x1 cell array with original DNA sequence data (araound 1800 chars each). To minimize the example I just go through the first cell.
%"substring" is e.g. a 50x2 cell array, with column 1 = substring and olumn 2 = length of the substring.
%"substituted_seq" is a 8x1 cell array with the replaced sequence (substrings substituted by '*')
%
substituted_seq{1,1} = strrep(template{1,1},substring{1,1},'*');
for j=1:size(substring,1)
substituted_seq{1,1} = strrep(substituted_seq{1,1},substring{j,1},'*');
end
The first problem I have is, that these substrings are overlapping with each other. So when I replace the first substring with '*' and search for the next one (which is overlapping the first) this code will not replace it anymore.
Second: I also couldn't figure out, how to replace a substing of e.g. 'ACGTCG' with the same number of '*' (in this example '******').
I would be very grateful for any help. Thanks!

Accepted Answer

Robert Cumming
Robert Cumming on 30 Aug 2012
I would make a binary flag = to the length of your string. Then run through all your substrings and mark the flag true for the characters to be replaced wiht *. This will eliminate the fact the problem of overlapping.
Once its all done you then replace all the true items recorded by flag in your string.
For your second iss: something like:
flag = 'CC';
key = regexprep ( 'CC', '.', '*' );
regexprep ( 'ABCCCCCDEFG', flag, key )
ans =
AB****CDEFG

More Answers (0)

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by