How to search a substring in a list of strings?

51 visualizzazioni (ultimi 30 giorni)
Mr M.
Mr M. il 29 Gen 2018
Modificato: Jan il 29 Gen 2018
I have {'xx', 'abc1', 'abc2', 'yy', 'abc100'} and I would like to search 'abc' and get back {'abc1', 'abc2', 'abc100'}. Is it possible to do this in a simple way without a for cycle?

Risposte (4)

Jos (10584)
Jos (10584) il 29 Gen 2018
In recent releases you can use startsWith
A = {'xx', 'abc1', 'abc2', 'yy', 'abc100'}
tf = startsWith(A,'abc')
B = A(tf)
See the documentation on string functions for many other utilities that may be useful for you.

Stephen23
Stephen23 il 29 Gen 2018
Much faster than using cellfun or any string functions:
>> C = {'xx', 'abc1', 'abc2', 'yy', 'abc100'};
>> Z = C(strncmp(C,'abc',3));
>> Z{:}
ans = abc1
ans = abc2
ans = abc100
  1 Commento
Jan
Jan il 29 Gen 2018
Modificato: Jan il 29 Gen 2018
+1. 2 minutes faster than me. See my answers for timings: for cell strings, strcnmp is 65 times faster than startsWith.

Accedi per commentare.


Jan
Jan il 29 Gen 2018
Modificato: Jan il 29 Gen 2018
If you want to search at the start of the strings only, this is efficient:
A = {'xx', 'abc1', 'abc2', 'yy', 'abc100'};
B = s(strncmp(s, 'abc', 3));
Some timings:
% Some larger test data:
A = repmat({'xx', 'abc1', 'abc2', 'yy', 'abc100'}, 1, 1000);
S = string(A);
tic;
for k = 1:1000
tf = startsWith(A, 'abc');
B = A(tf);
end
toc
tic;
for k = 1:1000
tf = startsWith(S, 'abc');
B = A(tf);
end
toc
tic;
for k = 1:1000
tf = strncmp(s, 'abc', 3);
B = A(tf);
end
toc
tic;
for k = 1:1000
tf = cellfun(@any,strfind(A, 'abc'));
B = A(tf);
end
toc
tic;
for k = 1:1000
tf = ~cellfun('isempty', strfind(A, 'abc'));
B = A(tf);
end
toc
Elapsed time is 1.492006 seconds. % startsWith(cell string)
Elapsed time is 0.308345 seconds. % startsWith(string)
Elapsed time is 0.018157 seconds. % strncmp
Elapsed time is 8.095714 seconds. % cellfun(@any, strfind)
Elapsed time is 1.706694 seconds. % cellfun('isempty', strfind)
Note that cellfun method searches for the substring anywhere in the strings, while the two other methods search at the start only. With modern string methods this would be:
tf = contains(A, 'abc');
This has an equivalent speed as startsWith.
@MathWorks: strncmp is 17 times faster for cell strings than startsWith for strings. The conversion from cell strings to strings inside startsWith let it run 65 times slower than strncmp. There is a great potential for improvements.

Fangjun Jiang
Fangjun Jiang il 29 Gen 2018
s={'xx', 'abc1', 'abc2', 'yy', 'abc100'};
index=cellfun(@any,strfind(s,'abc'));
s(index)
  2 Commenti
Matt J
Matt J il 29 Gen 2018
Modificato: Matt J il 29 Gen 2018
This solution has a for-loop embedded within cellfun.
Jan
Jan il 29 Gen 2018
Modificato: Jan il 29 Gen 2018
@Matt J: But cellfun is at least a fast C-mex function. Every function to process a cell string must contain a loop anywhere. The problem is using cellfun with an expensive anonymous function. About 4 times faster (but still slower than startsWith, timings see my answer):
index = ~cellfun('isempty', strfind(s, 'abc'));

Accedi per commentare.

Categorie

Scopri di più su Data Type Conversion in Help Center e File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by