What if clustering and all data belong to the same group?

9 visualizzazioni (ultimi 30 giorni)
Hi,
I'm researching some binary clustering algorithms (and unsupervised learning in general) for a problem I'm currently working on. Clearly there are numerous options and solutions out there. By in large, the clustering works fine and as expected. However, I am occasionally running into a problem when my data is essentially "ideal". By that I mean, every data point in the set belongs to the same cluster (this is only known a priori for testing purposes, but won't be in practice). So, then what happens is most elements are clustered together, group A, then the "furthest" data point(s) is/are classified into the other, group B, even though they belong to A.
I was at first surprised by these results, but upon further consideration realized they're perhaps it is not so unexpected within the realm of binary classification.
The nature of the data is incredibly simple and could be, in theory, distilled down to something as simple as a set of 1D values, where "high" values are true and "low" values are false. In this example, false values are the equivalent of noise and are ultimately disregarded. If we use k-means as an example, the mandatory second argument is the number of clusters. As such, the algorithm forces 2-cluster separability. A quick eample may look something like this:
s = [3.58470; 3.7784; 3.6453; 3.5204; 3.3012; 3.5844; 3.5048; 3.5244; 3.5038];
clusterS = kmeans(s,2)'
clusterS =
1 1 1 1 2 1 1 1 1
t = [3.58470; 3.7784; 3.6453; 0.5204; 3.3012; 3.5844; 0.5048; 0.5244; 3.5038];
clusterT = kmeans(t,2)'
clusterT =
1 1 1 2 1 1 2 2 1
(Note that k-means doesn't always produce the same result)
In this example, the result that I want would be:
clusterS =
1 1 1 1 1 1 1 1 1
clusterT =
1 1 1 2 1 1 2 2 1
As of now, I'm considering two options:
  1. Perform fuzzy clustering (fcm) and setting some arbitrary threshold for pushing data into one group or the other. Though, I should note that, in the interest of full automation, I'm really trying to avoid any inputs of this nature.
  2. Introduce some small perturbation(s) in the data to simulate noise that will provide at least one data point for the false category (this is could perhaps solve the scenario when all other are true).
At any rate, I'm curious if this is a well-known problem with unsupervised learning and/or if there are any practical solutions? OR if it's even a valid dilemma for clustering (am I violating some primary assumption here)?
Thanks!
  2 Commenti
Walter Roberson
Walter Roberson il 14 Apr 2019
This is a general problem with k-means.
With k-means, the optimal number of clusters is the same as the number of unique datapoints, because you can always reduce the sum of squares of distances to centers by adding an additional center located at a point that is not already a center.
Likewise, k-means does not know that all of the data is in one class: all it knows is that it can reduce the sum-of-squares of distances by assigning the furthest point to a second class.
K Joe
K Joe il 15 Apr 2019
Thank you for the response. I agree: I feel that I am violating the primary assumption of the kmeans heres. I mean, I'm forcing the function to look for 2 clusters, even though there is only one. The optimization is simply maximizing separability, which of course one or more of the extrema in the set will be classified as such.
Perhaps I'm looking at this all wrong and need to move beyond the realm of learning/clustering....

Accedi per commentare.

Risposte (1)

Image Analyst
Image Analyst il 14 Apr 2019
Automatic segmentation is not always good. For example you might try to threshold a gray scale image with kmeans, like in the attached example, but it's only good if you have nice well separated clusters. If you don't have clusters but have something like a single shotgun blast cluster, or randomly distributed points, then you may have only one cluster. But you're telling kmeans() to find 2 so it will find 2 because you're forcing it to, but whatever it finds may be no good. That's why it's often good NOT to do an automatic segmentation, but to use a fixed segmentation. For example if you have foreground and background gray levels, then telling it to find 2 classes, then that may be okay if there is a bright thing on a dark background, but it will not be good if the image is 100% background or 100% foreground. It won't find a good threshold, but if you used a fixed threshold, it will accurately tell you that you have 100% class1 (background) or 100% class2 (foreground).
  2 Commenti
K Joe
K Joe il 15 Apr 2019
Thank you for the response, the imaging analogy is spot-on. In terms of a fixed threshold, I'm hesitant to use one because the nature of the input data tends to be widely variable. I see a fixed threshold as presenting some serious problems in the outer, more extreme, cases that I'm likely to come across.
As a similar idea though, are you familiar with setting a "dynamic threshold"? That is, a fixed threshold of sorts, but that is adjusted appropriately for the given input characteristics?
I'm still leaning toward the idea of adding inconsequential noise to the data, just to ensure the clustering has at least one noise bin to cluster to.
However, I did just think of another possibility where if a pre-pre-pre-processing step could be employed for detecting if the dataset is binary or unary. I'm not familiar with any such test, but I suppose it could be something as simple using a histogram to search the distribution for 2 groups or just 1. Or perhaps, do the clustering, then look at p-value between groups, haha.
Finally, another avenue I've explored is something called "outlier detection". The best resources I could find for that were texts that I don't have access to (and seemingly nothing ML 2014a). If anyone happens to know if any great resources for that, that'd be greatly appreciated!
(P.S. I apologize for the rambling stream of consciousness here - I'm kinda answering/asking/thinking all at once)
Image Analyst
Image Analyst il 15 Apr 2019
Automatic/dynamic thresholds are sometimes used where your foreground and background both drift in tandem, which they should not if you have good control over your imaging situation. But sometimes that can happen, for example outdoor scenes where the sunlight varies depending on time of day, obscuration by clouds, etc. and you have no control over that.
I have no idea what are the more extreme situations you're dealing with so I can't comment. Anyway, I believe there are functions in the toolbox to tell you how much confidence you should have in the clusters it came up with. I've seen them but don't remember what they are off the top of my head.

Accedi per commentare.

Categorie

Scopri di più su Data Clustering in Help Center e File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by