find indices of row subsets

5 visualizzazioni (ultimi 30 giorni)
Michal
Michal il 9 Mag 2018
Modificato: Michal il 11 Mag 2018
I am trying to find vectorized matlab function ind = item2ind(item,t) to solve following problem: I have a list of row vectors
item = [2 3 1; 2 1 2; 3 1 1; 1 3 3]
and vector of all possible item elements at each item row vector
t = [1 1 2 2 2 3 3];
I need to find indexes of of separate item rows elements corresponding to the vector t in this way:
ind = [3 6 1; 3 1 4; 6 1 2; 1 6 7]
But
item = [1 1 1]
does not correspond to the vector t, because there are 3 "1" elements, and t contains only 2 "1" elements.
Note: Serial version is inefficient for large item (10000 x 100) and t (1 x 200).
function ind = item2ind(item,t)
[nlp,N] = size(item);
ind = zeros(nlp,N);
for i = 1:nlp
auxitem = item(i,:);
auxt = t;
for j = 1:N
I = find(auxitem(j) == auxt,1,'first');
if ~isempty(I)
auxt(I) = 0;
ind(i,j) = I;
else
error('Incompatible content of item and t.');
end
end
end
end
Additional remarks:
Most of the time is spent on the line:
I = find(auxitem(j) == auxt,1,'first');
Is there any clever trick how to speed up this line of code? I tried this, for example, but without any speedup:
I = ipos(auxitem(j) == auxt); I = I(1);
where ipos is preallocated as:
ipos = 1:length(t);
Thanks in advance for any help ...
  6 Commenti
Jan
Jan il 9 Mag 2018
Modificato: Jan il 9 Mag 2018
@Michal: If you provide a relevant set of example data, we could test the speed and the correctness of the suggested code.
What is the maximum value of t?
Do you have a C compiler installed?
Michal
Michal il 9 Mag 2018
Modificato: Michal il 9 Mag 2018
Please read discussion under Chad's answer.
Typical length of "t" is 30-100, maxmimum value of elements at "t" is equal maximum value of elements at array "item".
Yes I have a C compiler...
Please, keep in mind, that vector "t" is not sorted.

Accedi per commentare.

Risposta accettata

Michal
Michal il 11 Mag 2018
Modificato: Michal il 11 Mag 2018
So far best solution:
function ind = item2ind_new(item,t)
t = t(:);
[m,n] = size(item);
mct = max(accumarray(t,1));
G = accumarray(t,1:length(t),[],@(x) {sort(x)});
G = cellfun(@(x) padarray(x.',[0 mct-length(x)],0,'post'), G, 'UniformOutput', false);
G = vertcat(G{:});
C = cumsum(reshape(item,m,1,n)==item,3);
ia = C(sub2ind(size(C),repelem((1:m).',1,n),repelem(1:n,m,1),repelem(1:n,m,1)));
ind = G(sub2ind(size(G),item,ia));
Any idea how to improve it?

Più risposte (2)

Jan
Jan il 9 Mag 2018
Modificato: Jan il 9 Mag 2018
function ind = item2ind(item, t);
maxRun = length(t) + 1;
[T , TI] = accumsort(t, maxRun);
ind = zeros(size(item));
for k = 1:size(item, 1)
[aItem, aItemI] = accumsort(item(k, :), maxRun);
% [m, index] = ismember(aItem, T);
% Faster with undocumented function:
[m, index] = builtin('_ismemberhelper', aItem, T);
if all(m)
ind(k, aItemI) = TI(index);
else
error('Incompatible item.');
end
end
end
function [T, TI] = accumsort(t, maxRun)
[sortedT, TI] = sort(t);
T = sortedT * maxRun;
c = -1;
for k = 1:numel(T)
if T(k) ~= c
d = 0;
c = T(k);
else
d = d + 1;
end
T(k) = T(k) + d;
end
end
For some test data of size [10'000 x 100] I get a runtime of 0.21 sec instead of 1.3 sec of the original version.
With calling ismember the runtime is 0.41 sec. Internally ismember calls the helper function builtin('_ismemberhelper') for sorted data of type double. If it is known already, that the input is sorted, calling the internal function avoids the overhead.
If you have a C compiler, converting accumsort to a C-mex would be useful.
maxRun must a any number greater than the highest number of repetitions in t. length(t)+1 is guaranteed to be larger.
  9 Commenti
Wick
Wick il 10 Mag 2018
Interesting. Jan's code is always faster on my computer than the numbers you gave. My code is faster on my machine for the short runs but for the long runs your machine is faster. It appears your memory subsystem is a bit more efficient.
Michal
Michal il 10 Mag 2018
@Jan any progress or new ideas on your site?

Accedi per commentare.


Wick
Wick il 9 Mag 2018
Modificato: Wick il 9 Mag 2018
Here you go. At the sizes you suggested, this shouldn't take too long. It has a single 'for' loop that cycles through the unique values of 't'.
I'm using logical indexing to identify all the elements in 'item' that match the given unique 't' and summing across the row. If the sum exceeds the number of times that value showed up in 't' you get your error. Otherwise I'm using 'cumsum' in a creative fashion (in my ever so humble opinion) to provide the indexes back to the location of the unique value in the original vector 't'.
Good Luck!
function ind = item2ind(item,t)
unique_t = unique(t);
ind = zeros(size(item));
try
% a single 'for' loop as long as the unique elements of t
for jj = 1:length(unique_t)
O = zeros(size(item));
O(item == unique_t(jj)) = 1;
positions_of_t = [0 find(t == unique_t(jj))];
% adding zero so sub_index call below will always reference a non-zero element
sub_index = cumsum(O,2) .* O + 1;
ind = ind + positions_of_t(sub_index);
% this is why we needed the 0 in positions_of_t above
end
catch
error('Incompatible content of item and t.');
end
  12 Commenti
Michal
Michal il 9 Mag 2018
This is strange, my code perfectly works with both data case examples you mentioned above … ??!!
Wick
Wick il 9 Mag 2018
Jan,
My code is faster for small length 't' and much, much slower for large 't'. You vectorized in a completely different way than I did (and used an undocumented function but we won't use that against you). My question is, is there some rule of thumb my snippet of code didn't follow that I should change how I code things? I've always felt I was pretty good at vectorizing my MATLAB code but I've been coming here to learn how to be better. Obviously, you know some tricks I don't.
Thanks.

Accedi per commentare.

Prodotti

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by