BUG (#2)? kmeans is sensitive to rows (points) order

2 visualizzazioni (ultimi 30 giorni)
micholeodon
micholeodon il 12 Mar 2019
Modificato: micholeodon il 12 Mar 2019
Dear All,
I have noticed that kmeans gives different results for different points order !
This does not make any sense in my opinion.
I guess row order in matrix should have no impact on centroids location if random generator is set to fixed seed.
Anybody can explain that?
clear; close all; clc;
nPoints = 100;
nDimensions = 2;
nClusters = 3;
data = rand(nPoints,nDimensions) % points from uniform distr.
scatter(data(:,1), data(:,2), 'b')
rndGenSeed = 1;
%% cluster unshuffled data
rng(rndGenSeed) % set random generator's seed
[~, clusters] = kmeans(data, nClusters)
hold on
scatter(clusters(:,1), clusters(:,2), 'rv') % red triangles
hold off
%% cluster shuffled data
rng(rndGenSeed) % set random generator's seed - same seed
[~, clusters_sh] = kmeans(sortrows(data), nClusters)
hold on
scatter(data(:,1), data(:,2), 'k*') % control - plot shuffeled points - they should be ion same spots
scatter(clusters_sh(:,1), clusters_sh(:,2), 'gv') % these points should cover red triangles
hold off
grid on
  1 Commento
micholeodon
micholeodon il 12 Mar 2019
Modificato: micholeodon il 12 Mar 2019
I think I have some clue, but it would be highly recommended that somebody from MathWorks Team verify it.
So my clue is this:
  1. Kmeans needs to choose some initial clusters positions. It can select randomly k INPUT POINTS to start.
  2. If you set rng(seed), seed=const. you will always get SAME row indices from data matrix as a starting cluster position.
  3. If you shuffle input data (input points locations are the same, only order in data structure is shuffled), even if you set rng(seed), seed=const. , you will get SAME row indices, BUT points under that indices are DIFFERENT !
  4. That means that kmeans will converge differently for shuffled input data points.
This would explain also my puzzle in another question: https://www.mathworks.com/matlabcentral/answers/448832-bug-evalclusters-is-sensitive-to-rows-points-order
What do you think MathWorks experts? :) Does k-means select input data points as a starting centroids locations?

Accedi per commentare.

Risposte (0)

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by