Why is isfield faster than intersect for strings?

4 visualizzazioni (ultimi 30 giorni)
I need to intersect moderately sized cells containing strings, say length 5. Using Matlabs function intersect (with or without 'stable') turns out to be up to 10 times slower than the following, quite weird lines of code. And even for large cells, this code is faster (using Matlab 2017a).
a = {'a1','c2','e3','g4','i5'};
b = {'a1','b2','c2','d2','e3'};
% for i = 1:1000
b2 = [b;b];
c = struct(b2{:});
cisfieldina = isfield(c,a);
v = a(cisfieldina); % = intersect(a,b,'stable')
w = a(~cisfieldina); % = setdiff(a,b,'stable') (basically for free)
% end
Intersect can not have such grave overhead. I tried out different ways to use struct, but this seems to be the fastest way, and it is stable as well.

Risposta accettata

Jan
Jan il 9 Mar 2018
Modificato: Jan il 9 Mar 2018
As you have found out already, intersect has a remarkable overhead. It checks, if the inputs are sorted, and if they aren't, it sorts them for a faster binary search. For 5 strings this is much slower than a linear search.
For a faster implementation see e.g. the C-mex function FEX: CStrAinBP. This replies the indices of all strings occurring in both cell strings. Here is an equivalent M-code:
function [AI, BI] = CStrAinBP(A, B)
nA = numel(A);
nB = numel(B);
if nA <= nB
% Collect the index of the first occurrence in B for every A:
M = zeros(1, nA);
for iA = 1:nA
Ind = find(strcmp(A{iA}, B), 1);
if isempty(Ind) == 0
M(iA) = Ind(1);
end
end
else % nB <= nA, B is smaller, so it is cheaper to loop over B:
% Mark every A which equal the current B
M = zeros(1, nA);
for iB = nB:-1:1
M(strcmp(B{iB}, A)) = iB; % Same as M(find(strcmp()))
end
end
AI = find(M); % If any occurrence was found, this A exists...
BI = M(AI); % at this index in B
end
And if the cell of intersecting objects is needed, use this in the caller:
C = A(AI);
The code can be simplified, if you only need the common elements:
function AB = CStrCommon(A, B)
nA = numel(A);
nB = numel(B);
M = false(1, nA);
if nA <= nB
for iA = 1:nA
M(iA) = any(strcmp(A{iA}, B));
end
else % nB <= nA, B is smaller, so it is cheaper to loop over B:
for iB = 1:nB
M(strcmp(A, B{iB})) = true;
end
end
AB = A(M);
end
UNTESTED CODE Both versions have been adjusted for the publication in the forum, which might have inserted bugs.
Matlab's function for handling sets are very powerful and optimized for huge data sets and handling different input types. But a specific code can be much faster for a small inputs.
  6 Commenti
Sebastian Kraemer
Sebastian Kraemer il 10 Mar 2018
Sorry Jan if I was not making myself clear (working Friday afternoon is my excuse I guess). I kind of figured out my needs in the process (considering that the original post was just about intersect). I needed a function that takes two cells of strings a and b and outputs the same as intersect(a,b), setdiff(a,b) and setdiff(b,a).
Regarding my previous question, I was just wondering if I could actually overtake your codes exactly as they are, being implemented as mex files. Since that does not seem to be the case, as far as I can see, I indeed modified your code to suit my needs. I settled for this now:
function [acapb,awob,bwoa,IA,IB] = str_intersect(a,b)
% STR_INTERSECT Equivalent to intersect(a,b), setdiff(a,b)
% and setdiff(b,a) for string cells a,b
%
% [acapb,awob,bwoa] = STR_INTERSECT(a,b) returns
% acapb = intersect(a,b)
% awob = setdiff(a,b) % if asked for
% bwoa = setdiff(b,a) % if asked for
% but without overhead (and relying on correct input).
%
% [acapb,awob,bwoa,IA,IB] = STR_INTERSECT(a,b) also returns
% the logical indices IA, IB for which acapb = a(IA) = b(IB).
%
% See also: INTERSECT, SETDIFF, INT_INTERSECT
IA = false(size(a));
IB = false(size(b));
for j = 1:numel(a)
cmp = strcmp(a{j},b);
IA(j) = any(cmp);
IB = IB | cmp;
end
acapb = a(IA);
if nargout > 1
awob = a(~IA);
end
if nargout > 2
bwoa = b(~IB);
end
end
This is a basic and sufficient implementation for me. Thank you for your input on the problem! (Of course I could ask as well which cell is smaller, but they use to have same size.)
Jan
Jan il 10 Mar 2018
Modificato: Jan il 10 Mar 2018
@Sebastian: You are welcome. The final code looks straight and tight. I think it is very efficient for small cell strings, where "small" is maybe 100 strings.
There is no need for excuses. I tried to help you efficiently and you have been the one, who needs the solution. I'm glad, if I could help you.

Accedi per commentare.

Più risposte (0)

Categorie

Scopri di più su Characters and Strings in Help Center e File Exchange

Prodotti

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by