using fitgmdist to separate data in clusters

Question

Pieter van Doorn il 29 Nov 2021

0
Link

Link diretto a questa domanda

https://it.mathworks.com/matlabcentral/answers/1598694-using-fitgmdist-to-separate-data-in-clusters

Risposto: Abhimenyu il 14 Apr 2024

I have the following code to fit a trimodal distribution on my data, separate the data in clusters and plot it in histogram:

GMmodel = fitgmdist(Data,3);
idx = cluster(GMmodel, Data);
cluster1 = Data(idx==1,:);
cluster2 = Data(idx==2,:);
cluster3 = Data(idx==3,:);
histfit(cluster1)
hold on
histfit(cluster2)
histfit(cluster3)
hold off;

I ran this code for two different datasets (Walkingspeed 2 and 3). Now i get the following figure:

What happened in Walkingspeed 2? Why did it make a cluster in the middle of another cluster? My goal is separate modes. Any help will be very much appreciated.

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

Accedi per commentare.

Accedi per rispondere a questa domanda.

Answer 1

Abhimenyu il 14 Apr 2024

0
Link

Link diretto a questa risposta

https://it.mathworks.com/matlabcentral/answers/1598694-using-fitgmdist-to-separate-data-in-clusters#answer_1441096

Hi Pieter,

From the information shared, I could infer that you are trying to fit a trimodal distribution on your data and getting some errors in the clustering of "Walkingspeed 2". The behavior where one cluster appears in the middle of another, can be attributed to several factors related to the nature of the data and the assumptions behind "Gaussian Mixture Models (GMMs)". Please refer to the below-mentioned factors that could contribute to this error:

Overlapping Data: If the underlying distributions of the data have significant overlap, the data points in the green cluster might overlap with those in the yellow cluster. This overlap could lead to the middle cluster being detected within the larger one because the algorithm interprets the overlap as a distinct mode.
Noise or Variability: Noise or variability in the data can cause unexpected clustering results. Even if the underlying distribution is trimodal, noisy data can create additional modes.
Model Complexity: By choosing to fit a trimodal distribution, an assumption is made that the data is best represented by three underlying Gaussian distributions. If the actual data distribution doesn't align well with this assumption (e.g., if there are not three modes or if there's significant variance within modes), the model might try to fit smaller, noisy clusters that don’t correspond to meaningful modes.
Initialization: "GMMs" use iterative optimization techniques (like Expectation-Maximization) that are sensitive to initialization. They can converge to local minima that might not represent the global best fit for your data. The initial guesses for the means, variances, and mixture coefficients can significantly impact the final model.
Data Preprocessing: Please ensure that your data is appropriately scaled, centered, and cleaned. Outliers or poorly conditioned data can impact clustering.

Please follow the below-mentioned steps to resolve the error:

Evaluate Data: Please visualize the data using histograms or density plots before fitting the model. This can give insights into whether a trimodal model is appropriate or if a different number of components might be better.
Adjust Model Complexity: Please experiment with different numbers of components (e.g., two or four) to see if it improves separation.
Evaluate Initialization: Please run the GMM with different initializations and compare the results to find a global fit.
Regularization: Adding regularization to the covariance matrices can help prevent overfitting to the data and might lead to more stable clustering.
Consider Other Clustering Algorithms: "GMMs" assume Gaussian distributions. If your data doesn’t fit this assumption, explore other clustering methods like "DBSCAN", "k-means", or "hierarchical clustering" that might be more suited to your data's characteristics.