Azzera filtri
Azzera filtri

How to Cluster Dataset and remove outlier in MATLAB

31 visualizzazioni (ultimi 30 giorni)
Hello, I have the following dataset, In which i have four features in each column.
I want to cluster Dataset. I have go through K-means it required Number of clusters as input.

Risposte (2)

Sai Pavan
Sai Pavan il 17 Apr 2024
Hello,
I understand that you want to cluster the 4-feature dataset and remove the outliers from the dataset. This task can be carried out using the following workflow:
  • Determine the optimal number of clusters: The elbow method involves plotting the within-cluster sum of squares (WCSS) against the number of clusters and looking for the "elbow" point where the rate of decrease sharply changes. This point is often considered a good choice for the number of clusters.
  • Perform K-means clustering: After determining the optimal number of clusters, perform k-means clustering.
  • Removing outliers: Outliers can be detected and removed based on their distance from the centroid of their assigned cluster. A common approach is to remove points that are farthest from the centroid beyond a certain threshold.
Please refer to the below code snippet that illustrates the above workflow:
data = Dataset;
wcss = [];
for k = 1:10 % Test up to 10 clusters
[idx, C, sumd] = kmeans(data, k, 'Replicates', 10);
wcss(k) = sum(sumd);
end
plot(1:10, wcss);
xlabel('Number of clusters');
ylabel('WCSS');
title('Elbow Method');
optimalK = % the optimal number of clusters you determined
[idx, C, sumd] = kmeans(data, optimalK, 'Replicates', 10);
% Calculate distances of each point to its cluster centroid
distances = zeros(size(data, 1), 1);
for i = 1:optimalK
clusterPoints = data(idx == i, :);
centroid = C(i, :);
distances(idx == i) = sqrt(sum((clusterPoints - centroid).^2, 2));
end
threshold = prctile(distances, 95); % Define a threshold for outlier removal, e.g., 95th percentile of distances
outliers = distances > threshold; % Identify outliers
% Remove outliers
dataCleaned = data(~outliers, :);
idxCleaned = idx(~outliers);
Hope it helps!
  2 Commenti
Med Future
Med Future il 23 Apr 2024
Modificato: Walter Roberson il 24 Apr 2024
I have implement the code you shared with my code. But still there is an error Arrays have incompatible sizes for this operation. I have attached the dataset and the code below. Please modified the code for that. As i know the ground truth there should be only 1 cluster the remaining are the noise. Based on the distance calculation
load Question
dataset1=data(:,[2 4]);
% Step 1: Identify and remove outliers
freq_outliers = isoutlier(dataset1(:, 1));
pw_outliers = isoutlier(dataset1(:, 2));
outliers = freq_outliers | pw_outliers;
% Step 2: Remove rows with outliers from all columns
dataset1_no_outliers = dataset1(~outliers, :);
pdw_no_outliers = data(~outliers, :);
% Now, continue with your existing code using 'dataset1_no_outliers'
eva = evalclusters(dataset1_no_outliers, 'kmeans', 'silhouette', 'KList', [1:8]);
%eva = evalclusters(dataset1, 'kmeans', 'silhouette', 'KList', [1:8]);
K = eva.OptimalK;
[idx,C,sumdist] = kmeans(dataset1,K);
dataset=data;
dataset_idx=zeros(length(dataset),5);
dataset_idx=dataset(:,1:5);
dataset_idx(:,6)=idx;
clusters = cell(K,1);
for i = 1:K
clusters{i} = dataset_idx(dataset_idx(:,6) == i,:);
end
cluster_assignments=idx;
optimalK=K
optimalK = 4
% Calculate distances of each point to its cluster centroid
distances = zeros(size(data, 1), 1);
for i = 1:optimalK
clusterPoints = data(idx == i, :);
centroid = C(i, :);
distances(idx == i) = sqrt(sum((clusterPoints - centroid).^2, 2));
end
Arrays have incompatible sizes for this operation.
threshold = prctile(distances, 95); % Define a threshold for outlier removal, e.g., 95th percentile of distances
outliers = distances > threshold; % Identify outliers
% Remove outliers
dataCleaned = data(~outliers, :);
idxCleaned = idx(~outliers);

Accedi per commentare.


Walter Roberson
Walter Roberson il 24 Apr 2024
Spostato: Walter Roberson il 24 Apr 2024
load Question
dataset1=data(:,[2 4]);
dataset1 is created from 2 columns of data
% Step 1: Identify and remove outliers
freq_outliers = isoutlier(dataset1(:, 1));
pw_outliers = isoutlier(dataset1(:, 2));
outliers = freq_outliers | pw_outliers;
% Step 2: Remove rows with outliers from all columns
dataset1_no_outliers = dataset1(~outliers, :);
pdw_no_outliers = data(~outliers, :);
% Now, continue with your existing code using 'dataset1_no_outliers'
eva = evalclusters(dataset1_no_outliers, 'kmeans', 'silhouette', 'KList', [1:8]);
%eva = evalclusters(dataset1, 'kmeans', 'silhouette', 'KList', [1:8]);
K = eva.OptimalK;
[idx,C,sumdist] = kmeans(dataset1,K);
C is created from dataset1 so it has two columns
dataset=data;
dataset_idx=zeros(length(dataset),5);
dataset_idx=dataset(:,1:5);
dataset_idx(:,6)=idx;
clusters = cell(K,1);
for i = 1:K
clusters{i} = dataset_idx(dataset_idx(:,6) == i,:);
end
cluster_assignments=idx;
optimalK=K
optimalK = 4
% Calculate distances of each point to its cluster centroid
distances = zeros(size(data, 1), 1);
for i = 1:optimalK
clusterPoints = data(idx == i, :);
data has 6 columns, so clusterPoints has 6 columns
centroid = C(i, :);
centroid is created from C so it has two columns
whos clusterPoints centroid
distances(idx == i) = sqrt(sum((clusterPoints - centroid).^2, 2));
You are trying to subtract something with 2 columns from something with 6 columns, which is an error
end
Name Size Bytes Class Attributes centroid 1x2 16 double clusterPoints 177x6 8496 double
Arrays have incompatible sizes for this operation.
threshold = prctile(distances, 95); % Define a threshold for outlier removal, e.g., 95th percentile of distances
outliers = distances > threshold; % Identify outliers
% Remove outliers
dataCleaned = data(~outliers, :);
idxCleaned = idx(~outliers);
  1 Commento
Med Future
Med Future il 25 Apr 2024
@Walter Roberson Thank you for explaining it that much. Basically the problem is to reassign the clusters which are already made by K-means. means i want to remove the outliers. as you see the solution the each distance of each centroid from the clusterpoints are recalculated by facing the error. can you please help me to solve this problem.

Accedi per commentare.

Prodotti


Release

R2022a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by