Why is isfield faster than intersect for strings?

Question

Sebastian Kraemer il 9 Mar 2018

0
Link

Link diretto a questa domanda

https://it.mathworks.com/matlabcentral/answers/387324-why-is-isfield-faster-than-intersect-for-strings

Modificato: Jan il 10 Mar 2018

I need to intersect moderately sized cells containing strings, say length 5. Using Matlabs function intersect (with or without 'stable') turns out to be up to 10 times slower than the following, quite weird lines of code. And even for large cells, this code is faster (using Matlab 2017a).

 a = {'a1','c2','e3','g4','i5'};
 b = {'a1','b2','c2','d2','e3'};
 % for i = 1:1000
    b2 = [b;b];
    c = struct(b2{:}); 
    cisfieldina = isfield(c,a);
    v = a(cisfieldina); % = intersect(a,b,'stable')
    w = a(~cisfieldina); % = setdiff(a,b,'stable') (basically for free) 
 % end

Intersect can not have such grave overhead. I tried out different ways to use struct, but this seems to be the fastest way, and it is stable as well.

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

Accedi per commentare.

Accedi per rispondere a questa domanda.

Answer 1

Jan il 9 Mar 2018

0
Link

Link diretto a questa risposta

https://it.mathworks.com/matlabcentral/answers/387324-why-is-isfield-faster-than-intersect-for-strings#answer_309254

Modificato: Jan il 9 Mar 2018

Apri in MATLAB Online

As you have found out already, intersect has a remarkable overhead. It checks, if the inputs are sorted, and if they aren't, it sorts them for a faster binary search. For 5 strings this is much slower than a linear search.

For a faster implementation see e.g. the C-mex function FEX: CStrAinBP. This replies the indices of all strings occurring in both cell strings. Here is an equivalent M-code:

function [AI, BI] = CStrAinBP(A, B)
nA = numel(A);
nB = numel(B);
if nA <= nB
   % Collect the index of the first occurrence in B for every A:
   M = zeros(1, nA);
   for iA = 1:nA
      Ind = find(strcmp(A{iA}, B), 1);
      if isempty(Ind) == 0
         M(iA) = Ind(1);
      end
   end
else  % nB <= nA, B is smaller, so it is cheaper to loop over B:
   % Mark every A which equal the current B
   M = zeros(1, nA);
   for iB = nB:-1:1
      M(strcmp(B{iB}, A)) = iB;  % Same as M(find(strcmp()))
   end
end
AI = find(M);  % If any occurrence was found, this A exists...
BI = M(AI);    % at this index in B
end

And if the cell of intersecting objects is needed, use this in the caller:

C = A(AI);

The code can be simplified, if you only need the common elements:

function AB = CStrCommon(A, B)
nA = numel(A);
nB = numel(B);
M  = false(1, nA);
if nA <= nB
   for iA = 1:nA
      M(iA) = any(strcmp(A{iA}, B));
   end     
else  % nB <= nA, B is smaller, so it is cheaper to loop over B:
   for iB = 1:nB
      M(strcmp(A, B{iB})) = true;
   end
end
AB = A(M);
end

UNTESTED CODE Both versions have been adjusted for the publication in the forum, which might have inserted bugs.

Matlab's function for handling sets are very powerful and optimized for huge data sets and handling different input types. But a specific code can be much faster for a small inputs.

6 Commenti
Mostra 4 commenti meno recentiNascondi 4 commenti meno recenti

Sebastian Kraemer il 10 Mar 2018

Apri in MATLAB Online

Sorry Jan if I was not making myself clear (working Friday afternoon is my excuse I guess). I kind of figured out my needs in the process (considering that the original post was just about intersect). I needed a function that takes two cells of strings a and b and outputs the same as intersect(a,b), setdiff(a,b) and setdiff(b,a).

Regarding my previous question, I was just wondering if I could actually overtake your codes exactly as they are, being implemented as mex files. Since that does not seem to be the case, as far as I can see, I indeed modified your code to suit my needs. I settled for this now:

 function [acapb,awob,bwoa,IA,IB] = str_intersect(a,b)
% STR_INTERSECT  Equivalent to intersect(a,b), setdiff(a,b)
% and setdiff(b,a) for string cells a,b
%
%   [acapb,awob,bwoa] = STR_INTERSECT(a,b) returns
%   acapb = intersect(a,b)
%   awob = setdiff(a,b) % if asked for
%   bwoa = setdiff(b,a) % if asked for
%   but without overhead (and relying on correct input).
%
%   [acapb,awob,bwoa,IA,IB] = STR_INTERSECT(a,b) also returns
%   the logical indices IA, IB for which acapb = a(IA) = b(IB).
%
%   See also: INTERSECT, SETDIFF, INT_INTERSECT
    IA = false(size(a));
    IB = false(size(b));
    for j = 1:numel(a)
        cmp = strcmp(a{j},b);
        IA(j) = any(cmp);
        IB = IB | cmp;
    end
    acapb = a(IA);
    if nargout > 1
        awob = a(~IA);
    end
    if nargout > 2
        bwoa = b(~IB);
    end
 end

This is a basic and sufficient implementation for me. Thank you for your input on the problem! (Of course I could ask as well which cell is smaller, but they use to have same size.)

Jan il 10 Mar 2018

Modificato: Jan il 10 Mar 2018

@Sebastian: You are welcome. The final code looks straight and tight. I think it is very efficient for small cell strings, where "small" is maybe 100 strings.

There is no need for excuses. I tried to help you efficiently and you have been the one, who needs the solution. I'm glad, if I could help you.

Accedi per commentare.

Why is isfield faster than intersect for strings?

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

Risposta accettata

6 Commenti
Mostra 4 commenti meno recentiNascondi 4 commenti meno recenti

Più risposte (0)

Vedere anche

Categorie

Tag

Prodotti

Community Treasure Hunt

Why is isfield faster than intersect for strings?

0 Commenti Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

Risposta accettata

6 Commenti Mostra 4 commenti meno recentiNascondi 4 commenti meno recenti

Più risposte (0)

Vedere anche

Categorie

Tag

Prodotti

Community Treasure Hunt

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

6 Commenti
Mostra 4 commenti meno recentiNascondi 4 commenti meno recenti