Why is isfield faster than intersect for strings?

8 views (last 30 days)
Sebastian Kraemer on 9 Mar 2018
Edited: Jan on 10 Mar 2018
I need to intersect moderately sized cells containing strings, say length 5. Using Matlabs function intersect (with or without 'stable') turns out to be up to 10 times slower than the following, quite weird lines of code. And even for large cells, this code is faster (using Matlab 2017a).
a = {'a1','c2','e3','g4','i5'};
b = {'a1','b2','c2','d2','e3'};
% for i = 1:1000
b2 = [b;b];
c = struct(b2{:});
cisfieldina = isfield(c,a);
v = a(cisfieldina); % = intersect(a,b,'stable')
% end
Intersect can not have such grave overhead. I tried out different ways to use struct, but this seems to be the fastest way, and it is stable as well.

Jan on 9 Mar 2018
Edited: Jan on 9 Mar 2018
As you have found out already, intersect has a remarkable overhead. It checks, if the inputs are sorted, and if they aren't, it sorts them for a faster binary search. For 5 strings this is much slower than a linear search.
For a faster implementation see e.g. the C-mex function FEX: CStrAinBP. This replies the indices of all strings occurring in both cell strings. Here is an equivalent M-code:
function [AI, BI] = CStrAinBP(A, B)
nA = numel(A);
nB = numel(B);
if nA <= nB
% Collect the index of the first occurrence in B for every A:
M = zeros(1, nA);
for iA = 1:nA
Ind = find(strcmp(A{iA}, B), 1);
if isempty(Ind) == 0
M(iA) = Ind(1);
end
end
else % nB <= nA, B is smaller, so it is cheaper to loop over B:
% Mark every A which equal the current B
M = zeros(1, nA);
for iB = nB:-1:1
M(strcmp(B{iB}, A)) = iB; % Same as M(find(strcmp()))
end
end
AI = find(M); % If any occurrence was found, this A exists...
BI = M(AI); % at this index in B
end
And if the cell of intersecting objects is needed, use this in the caller:
C = A(AI);
The code can be simplified, if you only need the common elements:
function AB = CStrCommon(A, B)
nA = numel(A);
nB = numel(B);
M = false(1, nA);
if nA <= nB
for iA = 1:nA
M(iA) = any(strcmp(A{iA}, B));
end
else % nB <= nA, B is smaller, so it is cheaper to loop over B:
for iB = 1:nB
M(strcmp(A, B{iB})) = true;
end
end
AB = A(M);
end
UNTESTED CODE Both versions have been adjusted for the publication in the forum, which might have inserted bugs.
Matlab's function for handling sets are very powerful and optimized for huge data sets and handling different input types. But a specific code can be much faster for a small inputs.
Jan on 10 Mar 2018
@Sebastian: You are welcome. The final code looks straight and tight. I think it is very efficient for small cell strings, where "small" is maybe 100 strings.
There is no need for excuses. I tried to help you efficiently and you have been the one, who needs the solution. I'm glad, if I could help you.