Find the K most orthogonal vectors in a set of vectors

1 visualizzazione (ultimi 30 giorni)
Hello All,
The context of this particular search is a step in tuning a spectral clustering routine a la Ng et al 2002. The purpose of this search is to give a initialization point for k-means clustering in higher dimensional space. In particular, well separated data ought to sit in K tight clusters on the surface of a hypersphere, so the purpose of this search is to find the locations of these cluster centroids with which to initialize k-means clustering of the same data.
I have a data matrix "Y" with O(10^4) rows and O(10^2-10^3) columns (the columns of this matrix are the [transformed and normalized a couple times] K largest eigenvectors of the affinity matrix).
Ng et al suggest "Briefly, we let the first cluster centroid be a randomly chosen row of Y, and then repeatedly choose as the next centroid the row of Y that is closest to being 90 degrees from all the centroids already picked." I translated this mathspeak to mean I need to take a bunch of dot products and look for values close to zero. This was quoted as computationally cheap (perhaps it is for clustering fewer points into say O(10^1) clusters or perhaps they meant it in the sense that it requires fewer iterations of k-means clustering once initialized), but my CPU is dragging ass at it.
So far I've tried 2 approaches: Approach #1 - Compute everything then search
% "cheap" initialization of k-means
dotProductY = zeros(length(Y)); %preallocate to make the parser turn green
% compute dot product of every row with every other row first
for k = 1:length(Y)
dotProductY(:,k) = sum(bsxfun(@times,Y(k,:),Y),2);
end
dotProductY(logical(eye(length(Y)))) = nan; %exclude dot product of row with self
centroidIdx = randi(length(Y)); %initialize on a random row of Y
dotProductY(centroidIdx,:) = nan; %dont pick the same row twice
dotProductY = abs(dotProductY); %use the absolute value because looking for closer to zero
for k = 2:K
[~,im] = min(sum(dotProductY(:,centroidIdx),2)); %find next best centroid
centroidIdx(k) = im; %reassign
dotProductY(centroidIdx,:) = nan; %dont pick the same row twice
end
Approach #2 - Simultaneous computation and search
%try a cheaper one?
centroidIdx = randi(length(Y)); %initialize on a random row of Y
for k = 1:K-1
dotProductY(:,k) = sum(bsxfun(@times,Y(centroidIdx(k),:),Y),2); %compute inner product
dotProductY(centroidIdx,:) = nan; %dont pick the same row twice
[~,im] = min(sum(abs(dotProductY),2)); %find next best centroid
centroidIdx(k+1) = im; %reassign
end
Neither of these approaches seems cheap to me. Anyone else take a stab at this before? Any suggestions?

Risposta accettata

Matt J
Matt J il 19 Mag 2016
Modificato: Matt J il 19 Mag 2016
Your computation of dotProductY could be more efficient. The most vectorized may of computing it, I believe is
dotProductY=abs(Y*Y.');
I expect that would have been the main bottleneck.
  2 Commenti
Matt J
Matt J il 19 Mag 2016
Modificato: Matt J il 22 Mag 2016
This part
for k = 2:K
[~,im] = min(sum(dotProductY(:,centroidIdx),2)); %find next best centroid
centroidIdx(k) = im; %reassign
dotProductY(centroidIdx,:) = nan; %dont pick the same row twice
end
also looks like it could be incrementalized as follows
im=randi(size(Y,1));
centroidIdx(1)=im;
temp=dotProductY(:,im);
for k = 2:K
[~,im] = min(temp); %find next best centroid
centroidIdx(k) = im; %reassign
temp=temp+dotProductY(:,im); %update temp
temp(im) = inf; %dont pick the same row twice
end
Peter Cook
Peter Cook il 8 Giu 2016
Thanks for the help, I can't believe I had that boneheaded dotProductY computation in there. The algorithm runtime is still quite slow, but that, I am accepting, is to be expected for most clustering algorithms with this amount of data.

Accedi per commentare.

Più risposte (0)

Categorie

Scopri di più su Statistics and Machine Learning Toolbox in Help Center e File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by