What if clustering and all data belong to the same group?
9 visualizzazioni (ultimi 30 giorni)
Mostra commenti meno recenti
Hi,
I'm researching some binary clustering algorithms (and unsupervised learning in general) for a problem I'm currently working on. Clearly there are numerous options and solutions out there. By in large, the clustering works fine and as expected. However, I am occasionally running into a problem when my data is essentially "ideal". By that I mean, every data point in the set belongs to the same cluster (this is only known a priori for testing purposes, but won't be in practice). So, then what happens is most elements are clustered together, group A, then the "furthest" data point(s) is/are classified into the other, group B, even though they belong to A.
I was at first surprised by these results, but upon further consideration realized they're perhaps it is not so unexpected within the realm of binary classification.
The nature of the data is incredibly simple and could be, in theory, distilled down to something as simple as a set of 1D values, where "high" values are true and "low" values are false. In this example, false values are the equivalent of noise and are ultimately disregarded. If we use k-means as an example, the mandatory second argument is the number of clusters. As such, the algorithm forces 2-cluster separability. A quick eample may look something like this:
s = [3.58470; 3.7784; 3.6453; 3.5204; 3.3012; 3.5844; 3.5048; 3.5244; 3.5038];
clusterS = kmeans(s,2)'
clusterS =
1 1 1 1 2 1 1 1 1
t = [3.58470; 3.7784; 3.6453; 0.5204; 3.3012; 3.5844; 0.5048; 0.5244; 3.5038];
clusterT = kmeans(t,2)'
clusterT =
1 1 1 2 1 1 2 2 1
(Note that k-means doesn't always produce the same result)
In this example, the result that I want would be:
clusterS =
1 1 1 1 1 1 1 1 1
clusterT =
1 1 1 2 1 1 2 2 1
As of now, I'm considering two options:
- Perform fuzzy clustering (fcm) and setting some arbitrary threshold for pushing data into one group or the other. Though, I should note that, in the interest of full automation, I'm really trying to avoid any inputs of this nature.
- Introduce some small perturbation(s) in the data to simulate noise that will provide at least one data point for the false category (this is could perhaps solve the scenario when all other are true).
At any rate, I'm curious if this is a well-known problem with unsupervised learning and/or if there are any practical solutions? OR if it's even a valid dilemma for clustering (am I violating some primary assumption here)?
Thanks!
2 Commenti
Walter Roberson
il 14 Apr 2019
This is a general problem with k-means.
With k-means, the optimal number of clusters is the same as the number of unique datapoints, because you can always reduce the sum of squares of distances to centers by adding an additional center located at a point that is not already a center.
Likewise, k-means does not know that all of the data is in one class: all it knows is that it can reduce the sum-of-squares of distances by assigning the furthest point to a second class.
Risposte (1)
Image Analyst
il 14 Apr 2019
Automatic segmentation is not always good. For example you might try to threshold a gray scale image with kmeans, like in the attached example, but it's only good if you have nice well separated clusters. If you don't have clusters but have something like a single shotgun blast cluster, or randomly distributed points, then you may have only one cluster. But you're telling kmeans() to find 2 so it will find 2 because you're forcing it to, but whatever it finds may be no good. That's why it's often good NOT to do an automatic segmentation, but to use a fixed segmentation. For example if you have foreground and background gray levels, then telling it to find 2 classes, then that may be okay if there is a bright thing on a dark background, but it will not be good if the image is 100% background or 100% foreground. It won't find a good threshold, but if you used a fixed threshold, it will accurately tell you that you have 100% class1 (background) or 100% class2 (foreground).
2 Commenti
Image Analyst
il 15 Apr 2019
Automatic/dynamic thresholds are sometimes used where your foreground and background both drift in tandem, which they should not if you have good control over your imaging situation. But sometimes that can happen, for example outdoor scenes where the sunlight varies depending on time of day, obscuration by clouds, etc. and you have no control over that.
I have no idea what are the more extreme situations you're dealing with so I can't comment. Anyway, I believe there are functions in the toolbox to tell you how much confidence you should have in the clusters it came up with. I've seen them but don't remember what they are off the top of my head.
Vedere anche
Categorie
Scopri di più su Data Clustering in Help Center e File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!