search for elements of a vector in a matrix (without using ismember)

I have a matrix A of size Nx2, N is large. I have another vector b of size nx1 and n<<N. I want to filter out those row indices of A, where both columns could contain any 2 elements of b.
For instance, let
A = [1 2;3 4; 5 6]
b = [1,2,5,6]'
Then output of method(A,b) should be
[1,3]'
(i.e. the first and 3rd rows of A contain 2 elements from b, but not the 2nd row.
I am currently using a combination of ismember and find but seems like ismember is not very efficient because my A, as I said, is very large.
Thanks in advance!
EDIT: (some more background on the problem)
The code has about couple dozen functions of which couple of them call ismember within a loop. The overall code takes several hours to execute when N is in O(1e5) and profiling it showed most of the time is consumed by ismember. The whole point of the question is to verify if (i) ismember is specifically inefficient for large scale code and (ii) are there alternatives that could take only a fraction of ismember 's cputime
EDIT 2: Tried alternative suggestion using 'any'. The cpu time was almost identical.
EDIT 3: Solved. Based on my personal experience, it does seem like ismember is inefficient when used repeatedly used within a large loop. So I replaced it with cellfun (since my 'b' is actually a cell array) & bsxfun and saw 25x speedup. Thanks to everyone that chimed in. I learned a lot!

Risposte (4)

Cedric
Cedric il 19 Ott 2017
Modificato: Cedric il 19 Ott 2017
EDIT: I took 3 more minutes and I profiled a small test case (N=1e7, n=1e2). ISMEMBER is more efficient!
Here is an alternative for MATLAB 2016b and above:
>> all( any( A == permute( b, [3,2,1] ), 3 ), 2 )
ans =
3×1 logical array
1
0
1
Then FIND if you need indices.
And here the version for below 2016b:
all( any( bsxfun( @eq, A, permute( b, [3,2,1] )), 3 ), 2 )
If speed maters much, it may be faster to avoid pages:
>> any( A(:,1) == b.', 2 ) & any( A(:,2) == b.', 2 )
ans =
3×1 logical array
1
0
1
but you'll have to profile all solutions and also ISMEMBER, because our guesses about performance are often wrong.

2 Commenti

Cedric
Cedric il 19 Ott 2017
Modificato: Cedric il 19 Ott 2017
Moved here:
There may be ways to pre-process a few things, to eliminate the loop, to parallelize, etc. I built a quick test (with n~10) and you can see that working with an expansion/ANY/etc may be as efficient as Andrei's solution for small values of N, but that as soon as N gets larger ISMEMBER becomes more efficient:

Accedi per commentare.

Jan
Jan il 19 Ott 2017
Modificato: Jan il 19 Ott 2017
[EDITED Wrong approach:]
You can save some time with calling the internally used C-Mex function:
bs = sort(b); % Omit if b is sorted already!
find(ismembc(A(:,1), bs) & ismembc(A(:,2), bs))
Then b is sorted once only.
Unfortunately ismembc is not documented. It existed at least from Matlab 5.3 to 2016b, but it might be removed from the toolbox in the future.
[EDITED] No, this is slower that ismember. Obviously Matlab uses faster methods now internally.

3 Commenti

+1! I love Jan's research.
:-) Thanks, Andrei.
In modern Matlab versions the "ismembc" is replaced by "_ismemberhelper". As other commands starting with an underscore it can be called through builtin also. Then you can omit the error checking and classification of the inputs by:
% b MUST be sorted!
find(builtin('_ismemberhelper', A(:,1), b) & ...
builtin('_ismemberhelper', A(:,2), b));
The speedup with inputs with 1e7 is negligible. This shortcut is only useful, if you process many small problems.
I was about to refer to _ismemberhelper, which is what is used by ismember() for some special cases of sorted data. In R2017b look near line 333 of ismember() for a description of _ismemberhelper

Accedi per commentare.

all(ismember(A,b),2)
or
diff(A(all(ismember(A,b),2),:),1,2)~=0
find( ismember(A(:,1),b) & ismember(A(:,2),b) )

15 Commenti

But I thought ismember is not efficient when A is large? Any alternatives that does not use ismember?
Why do you suppose that it is not efficient?
1. after profiling my code and finding out that ismember takes ~85% of the cputime and 2. searching the internet to notice other people acknowledging the same thing?
Well in Matt's code where there are two calls to ismember() and one to find(), I'm not surprised that ismember takes 85% of the time. But what's more important is the absolute time. If the whole thing only takes 20 millseconds, who cares? Who much time does it actually take for your actual arrays? Milliseconds? Hours?
I believe I did not make myself clear. The code has about couple dozen functions of which couple of them call ismember within a loop. The overall code takes several hours to execute when N is in O(1e5) and profiling it showed most of the time is consumed by ismember. The whole point of the question is to verify if (i) ismember is specifically inefficient for large scale code and (ii) are there alternatives that could take only a fraction of ismember 's cputime
Also, I was originally using a variant of the 1st suggestion provided here. I just plugged in the suggested line of code as well. Both give almost identical cpu times.
Matt J
Matt J il 19 Ott 2017
Modificato: Matt J il 19 Ott 2017
searching the internet to notice other people acknowledging the same thing?
It would be worthwhile posting links to those in your question so we can better see the context. I don't see why the efficiency of ismember would be affected by data size. Obviously, it will take longer as A and b get larger, but for reasons of data size, not efficiency.
I will leave it for the MathWorks folks to search the internet about user feedback on ismember. My point is that when I went from N=20,000 to N=100,000 the cputime went from around 2 minutes to several hours. So as you can see it does not scale well with N. Upon pofiling I found ismemer takes most of the time. The question really is that does MW acknowledge any known limitations on the efficiency of ismember? If they do, is there a better alternative. I have seen a few other suggestions in this thread which I shall try and report sometime today.
Cedric
Cedric il 19 Ott 2017
Modificato: Cedric il 19 Ott 2017
What version of MATLAB are you using on what system? My profiling took ~2.4s for N=1e8 and n=1e2 on a laptop, and in your comments you mention N=2e4-1e5, which should take no time in comparison(?) How does does my code below perform in comparison?
2017b. On a windows workstation with 16gb of RAM. My ismember is within a loop and I am looking for cputime in ~1e-3s to be computationally affordable. I haven't tried the other ones yet but will do that and report back soon.
Cedric
Cedric il 19 Ott 2017
Modificato: Cedric il 19 Ott 2017
Ok, so it is not just a single call that takes 2 minutes. Is this loop something that you could share? There may be a better approach. For relatively small Ns, the expansion can become more efficient than ISMEMBER, but I doubt that it will be the case here. What is the range of n?
I will leave it for the MathWorks folks to search the internet about user feedback on ismember...The question really is that does MW acknowledge any known limitations on the efficiency of ismember?
Just to be clear, you are not talking to the MathWorks right now. It's a public forum.
n is almost always < 10.
I will see what I can do about sharing the loop. Also the 'any' approach gave similar results.
What I gather from this thread is that (i) I am using the correct methods and (ii) all of them are equally good (or bad). So may be it is an algorithmic issue, that I will have to look into.
|Just to be clear, you are not talking to the MathWorks right now. It's a public forum.
I know and I understand. Still this is a MW forum, so I did not want to quote anybody from other public forum.
Cedric - I really appreciate it. Thanks! I am re-thinking my algorithm to see scope for improvement. I will try to include the actual loops sometime today if it is still necessary.

Accedi per commentare.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by