Draw a curve defining the bounds of a scatter plot to detect anomalous (sparse or low density) points
Mostra commenti meno recenti
Dear all, I have thousands of x,y data paris of observed and theoretical temperatures in a given region, where I need to identify some anomalous (for the purpose of my work) data points. I have tried PCA analysys (i.e., DBSCAN), statistical thresholds, curve fittings, etc., but none of them did work. I then tried a KDE-based approach, which seems to 'better' work for my data. To achieve the results shown below I used https://it.mathworks.com/matlabcentral/fileexchange/8430-flow-cytometry-data-reader-and-visualization as:
[hAxes,col,ctrs1,ctrs2,F] = dscatter(X,Y,'BINS',[250,250]);
% Please note, I modified the code to export hAxes,col,ctrs1,ctrs2 and plot
% the contour line as:
contour(ctrs1,ctrs2,F,0.0015,'k-'); % or 0.015 ---> See figure below
By using this apporach I was able to draw a contour line around the main cluster of my data, and I was mostly able to draw the line between 'main cluster' and 'anomalous' points. However, if I have too many 'anomalous' data points (see xy4 in the figure below), the method, using a fixed threshold, fails. Beside, I have to adjust the threshold based on the region of each x,y pair and I have no idea on how to find the correct threshold level (see the figure below - For the same region, a threshold level of 0.0015 seems to work for situations with a few anomalous points, but a threshold of 0.015 is needed when more spreaded points occurred).
What I would really like to do is to draw a curve around the main cluster of my scatterplot, so to devide the anomalous data points. I fully understand this may be a challenging task, but I hope you may provide some good alternatives and/or solutions.
Another solution, as it seems to work, may be defining automatically the density threshold level, but I don't really know where to start from.
Below, you can see 4 examples (xy 1 to 4 are also attached as .mat files). To the left is the simple x,y scatterplot. In the middle, the ideal curve (in red) defining the boundary of my scatterplot (manually sketched). To the right, the KDE-based approach discussed above. Please note the last figure where a threshold of 0.0015 fails, and a threshld of 0.015 is needed instead.

Any help is grately appreciated!
Any other approach to identify the points outside the red boundaries is more than wellcome!
Thanks a lot in advance
Risposta accettata
Più risposte (1)
Mathieu NOE
il 13 Set 2023
hello
maybe this ?
I simply decided that I wanted the contour line corresponding to a level 5% of F range above the min value of F
Result :

[hAxes,col,ctrs1,ctrs2,F] = dscatter(xy4(:,1),xy4(:,2),'BINS',[250,250]);
colorbar('vert');
hold on
lb = min(F,[],'all');
ub = max(F,[],'all');
threshold = lb+0.05*(ub-lb); % threshold is 5% of data range above min value
contour(ctrs1,ctrs2,F,[threshold threshold],'k-'); %
hold off
9 Commenti
Simone A.
il 13 Set 2023
Mathieu NOE
il 14 Set 2023
mayba this ?
Sets=["xy1" "xy2" "xy3" "xy4"];
threshold = 0.03; % threshold is 3 % of F range above min value
s = 0.95; % k = boundary(___,s) specifies shrink factor s using any of the previous syntaxes.
for k=1:numel(Sets)
xy=load(Sets(k)).(Sets(k));
x = xy(:,1);
y = xy(:,2);
figure(k)
[hAxes,col,ctrs1,ctrs2,F] = dscatter(x,y,'BINS',[250,250]);
colorbar('vert');
hold on
lb = min(F,[],'all');
ub = max(F,[],'all');
thr = lb+threshold*(ub-lb);
[row,col]=find(F>thr);
m = boundary(row,col,s);
plot(ctrs1(col(m)),ctrs2(row(m)),'r')
hold off
end
Mathieu NOE
il 14 Set 2023
ok
I am not sure there is a magic tool or AI that can decide for each case how to optimize the parameters...
Simone A.
il 14 Set 2023
Mathieu NOE
il 15 Set 2023
hello again
I tried to simplify a bit my code , so the shrink factor is simply the one by default - one parameter less to play with
remains only the threshold , according to my own trials I would pick something between 3 and 5 %
I recognize it may not give the expected nice shape like contour does, but at least it's working
Sets=["xy1" "xy2" "xy3" "xy4"];
threshold = 0.05; % threshold is 5 % of F range above min value
% s = 0.5; % k = boundary(___,s) specifies shrink factor s using any of the previous syntaxes.
for k=1:numel(Sets)
xy=load(Sets(k)).(Sets(k));
x = xy(:,1);
y = xy(:,2);
figure(k)
[hAxes,col,ctrs1,ctrs2,F] = dscatter(x,y,'BINS',[250,250]);
colorbar('vert');
hold on
lb = min(F,[],'all');
ub = max(F,[],'all');
thr = lb+threshold*(ub-lb);
[row,col]=find(F>thr);
% m = boundary(row,col,s);
m = boundary(row,col);
plot(ctrs1(col(m)),ctrs2(row(m)),'r')
hold off
end
Simone A.
il 16 Set 2023
Mathieu NOE
il 18 Set 2023
No problem
I just need now to find an alternative to pdist2 as I don't have the relevant toolbox
Mathieu NOE
il 18 Set 2023
so finally (and for my own fun) I tried some variants to replace pdist2
so for those like me that lack the right toolbox , here two codes that do the same process .
the differences are minor, one keeps valid points and the other one remove invalid points like Matt did in his code.
One thing I removed is the 3 iterations in the envelope function ; I didn't see any benefit of doing this , but maybe it's specific to my own code
solution 1
Sets=["xy1" "xy2" "xy3" "xy4"];
for i=1:numel(Sets)
xy=load(Sets(i)).(Sets(i));
[p,XY]=myenvelope(xy);
p = polybuffer(p,0.25);
figure
scatter(xy(:,1),xy(:,2),5);
hold on
plot(p,'FaceColor','none','EdgeColor','r', 'LineWidth',2)
hold off
end
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function [p,XY]=myenvelope(xy)
% D2=max(pdist2(xy,xy,'euclidean','Smallest', 10),[],1);
% the one line engine above is replaced by the following 8 lines of code
x = xy(:,1);
y = xy(:,2);
D2 = sqrt((x-x').^2 + (y-y').^2);
D2(D2 <eps ) = NaN;% replace diag elements that are 0 by NaN's
% find cols where we can have n neigbours within distance of 1
D2 = sort(D2,'ascend');
D2 = D2(1:10,:); % keep the n smallest values
f = all(D2<=1); % keep those points
xy=xy(f,:);
XY = xy;
b=boundary(XY,0.95);
p=polyshape(XY(b,:));
end
solution 2
Sets=["xy1" "xy2" "xy3" "xy4"];
for i=1:numel(Sets)
xy=load(Sets(i)).(Sets(i));
[p,XY]=myenvelope(xy);
p = polybuffer(p,0.25);
figure
scatter(xy(:,1),xy(:,2),5);
hold on
plot(p,'FaceColor','none','EdgeColor','r', 'LineWidth',2)
hold off
end
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function [p,XY]=myenvelope(xy)
% D2=max(pdist2(xy,xy,'euclidean','Smallest', 10),[],1);
% the one line engine above is replaced by the following 8 lines of code
x = xy(:,1);
y = xy(:,2);
D2 = sqrt((x-x').^2 + (y-y').^2);
D2(D2 <eps ) = NaN;% replace diag elements that are 0 by NaN's
% remove rows where we haven't n neigbours within distance of 1
D2 = sort(D2,'ascend');
D2 = D2(1:10,:); % keep the n smallest values
D2=max(D2,[],1);
xy(D2>1,:)=[];
XY = xy;
b=boundary(XY,0.95);
p=polyshape(XY(b,:));
end
Categorie
Scopri di più su Shifting and Sorting Matrices in Centro assistenza e File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!










