Find common 4-letter substrings in a list of strings

5 views (last 30 days)
How do I find ANY (4-letter) common pattern (substring) among all strings in the ID column in the table below, and create another table per common pattern found.
A note that I only have the license for 2018a, so I cannot use fancy functions like 'lettersPattern' or so.
I have attached an excerpt of the .csv file.
Stephen23 on 15 Feb 2023
"How do I find ANY (4-letter) common pattern (substring) among all strings in the ID column in the table below."
There are no 4-letter substrings that occur in all of of the strings in the ID column.
Or do you really mean that something like "...that occur in two or more of the strings in the ID column" ?

Sign in to comment.

Accepted Answer

dpb on 15 Feb 2023
Edited: dpb on 16 Feb 2023
May be something more clever, but the "deadahead" way for the one string looks something like --
tAMP=readtable('AMPdb_short.csv'); % bring the data in...
SUBSTRLEN=4; % the substring length
for j=1:height(tAMP); % iterate over all strings
S=tAMP.ID{j}); % convenient temporary
for i=1:length(S)-L % keep in bounds of string
s=S(i:i+L); % the ith substring in the string
ix=strfind(S,s); % find the locations if any in this string
if numel(ix)>1 % if are any, whatever you wish here
fprintf(['%3d' '%5s' repmat('%3d',1,numel(ix)) '\n'],i,s,ix)
That finds all the matches within each string; to do across all requires wrapping it in another layer to iterate also not only the comparison of the ith substring over the jth string but also over all others in the collection. Leave as "exercise for Student"...
For the longest substring found in the sample dataset (60 characters) the above inner loop over that string alone produced (run locally)...
>> for i=1:length(S)-3
if numel(ix)>1,fprintf(['%3d' '%5s' repmat('%3d',1,numel(ix)) '\n'],i,s,ix),end
4 RPRP 4 12 14
11 PRPR 11 13
12 RPRP 4 12 14
13 PRPR 11 13
14 RPRP 4 12 14
15 PRPL 15 29 43 57
16 RPLP 16 30 44
17 PLPF 17 31 45
18 LPFP 18 32 46
50 RPGP 22 36 50
51 PGPR 23 37 51
52 GPRP 24 38 52
53 PRPI 25 39 53
54 RPIP 26 40 54
55 PIPR 27 41 55
56 IPRP 28 42 56
57 PRPL 15 29 43 57
The above, of course, finds duplicates so the overall number will be the unique combination of the above -- well, let's see
for i=1:length(S)-3
if numel(ix)>1,smatch=[smatch;s];end
smatch =
16×1 cell array
You can be a little more clever by checking whether the next substring is already in the set of matches if you keep the running array of indices during the loop and break the loop instead of searching again for the same pattern.

More Answers (0)


Find more on Characters and Strings in Help Center and File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by