Draw a curve defining the bounds of a scatter plot to detect anomalous (sparse or low density) points

Dear all, I have thousands of x,y data paris of observed and theoretical temperatures in a given region, where I need to identify some anomalous (for the purpose of my work) data points. I have tried PCA analysys (i.e., DBSCAN), statistical thresholds, curve fittings, etc., but none of them did work. I then tried a KDE-based approach, which seems to 'better' work for my data. To achieve the results shown below I used https://it.mathworks.com/matlabcentral/fileexchange/8430-flow-cytometry-data-reader-and-visualization as:
[hAxes,col,ctrs1,ctrs2,F] = dscatter(X,Y,'BINS',[250,250]);
% Please note, I modified the code to export hAxes,col,ctrs1,ctrs2 and plot
% the contour line as:
contour(ctrs1,ctrs2,F,0.0015,'k-'); % or 0.015 ---> See figure below
By using this apporach I was able to draw a contour line around the main cluster of my data, and I was mostly able to draw the line between 'main cluster' and 'anomalous' points. However, if I have too many 'anomalous' data points (see xy4 in the figure below), the method, using a fixed threshold, fails. Beside, I have to adjust the threshold based on the region of each x,y pair and I have no idea on how to find the correct threshold level (see the figure below - For the same region, a threshold level of 0.0015 seems to work for situations with a few anomalous points, but a threshold of 0.015 is needed when more spreaded points occurred).
What I would really like to do is to draw a curve around the main cluster of my scatterplot, so to devide the anomalous data points. I fully understand this may be a challenging task, but I hope you may provide some good alternatives and/or solutions.
Another solution, as it seems to work, may be defining automatically the density threshold level, but I don't really know where to start from.
Below, you can see 4 examples (xy 1 to 4 are also attached as .mat files). To the left is the simple x,y scatterplot. In the middle, the ideal curve (in red) defining the boundary of my scatterplot (manually sketched). To the right, the KDE-based approach discussed above. Please note the last figure where a threshold of 0.0015 fails, and a threshld of 0.015 is needed instead.
Any help is grately appreciated!
Any other approach to identify the points outside the red boundaries is more than wellcome!
Thanks a lot in advance

 Risposta accettata

Sets=["xy1" "xy2" "xy3" "xy4"];
Factors=[20,20,5,5];
for i=1:numel(Sets)
xy=load(Sets(i)).(Sets(i));
p=envelope(xy,Factors(i));
figure
scatter(xy(:,1),xy(:,2),5);
hold on
plot(p,'FaceColor','none','EdgeColor','r', 'LineWidth',2)
hold off
end
Warning: Polyshape has duplicate vertices, intersections, or other inconsistencies that may produce inaccurate or unexpected results. Input data has been modified to create a well-defined polyshape.
function p=envelope(xy,spreadFactor)
[D,I]=pdist2(xy,xy,'euclidean','Smallest',5);
Dmax=max(D,[],1);
d=median(D(:))*spreadFactor;
XY=xy(Dmax<=d,:);
b=boundary(XY,0.95);
p=polyshape(XY(b,:));
end

11 Commenti

Hi @Matt J, thanks a lot for getting back! This approach has a great potential, thanks for that! However, the main issue remains the choose of the right coefficient 'Factors'. As in the KDE-based approach, I should change the coefficient depending on the scene, thus on the scatterplot. As I need to automatise the problem, I need either something that works with a single coefficient, or a way to estimate the coefficient directly from my xy data. Can you think of anything to overcome this issue?
Once again, thanks a lot!
@Simone A. There's no way you will be able to avoid, in whatever method you choose, the selection of some parameter defining the density difference between what you consider the main lobe and what you consider to be sparse, outlying points. That density difference is a subjective choice, not some well-defined thing.
Thanks @Matt J. As such, it is my understanding that the only option I have left is to find, if possible, a way to establish the threshold a priori, based on my x and y. Considering these examples, is there anything that may suggest why xy1 and xy2 require a factor 20, while xy3 and xy 4 a factor 5?
Please note, xy1 and xy2 are indeed the same geographic region, similarly xy3 and xy4 are the same region, but they are different from xy1 and xy2.
Even when I applied the KDE-based approach, xy1 and xy2 worked with the same threshold, whilst that had to be changed to work for xy3 and xy4.
I am thinking that, maybe, x alone - this being the reference temperature variable - may provide hints on the threshold to be applied. Is there any statistical parameter (i.e., std, mean, covariance, etc...) that in your opinion may be directly related to the threshold factor? That would be a great way to automatise the process!
Considering these examples, is there anything that may suggest why xy1 and xy2 require a factor 20, while xy3 and xy 4 a factor 5?
They don't "require" it. If you make all the factors 5, you will get different boundaries. It's not for me to say whether those new boundaries are right or wrong. I chose to use different factor values because they gave the particular results I think you wanted, even though I don't know why you want them.
I chose to use different factor values because they gave the particular results I think you wanted, even though I don't know why you want them.
That said, there is a definite difference in the range of local point densities across the region. In xy3 and xy4, the sparsest points are much closer in density to the main lobe than those in xy1,xy2.
even though I don't know why you want them
I am interested in values of y positively deviating from x, as they may represent anomalous data points. These data points are indeed those falling outside the boundaries I drawn, and this is all I am trying to extract.
'there is a definite difference in the range of local point densities across the region'
May I kindly ask you where and how did you defined the above? Where can I find the information about the range of local point densities? Maybe that may help me to find a way to automatically define the threshold!
Thanks!
These data points are indeed those falling outside the boundaries I drawn, and this is all I am trying to extract automatically.
Yes, but my point was that the precise boundaries are subjectively chosen by you, as far as we, your audience, can tell. We do not know by what criteria you judge whether our code proposals are finding the correct boundaries.
May I kindly ask you where and how did you defined the above? Where can I find the information about the range of local point densities?
In my code, the vector Dmax is what I use to quantify the local density of neighbors around each point. If you plot the histograms of Dmax for each xy data set, you will see that for xy1,xy2 the tails are heavier than for xy3.
Here's another variation to consider. Don't know if it draws the borders exactly where you'd like...
Sets=["xy1" "xy2" "xy3" "xy4"];
for i=1:numel(Sets)
xy=load(Sets(i)).(Sets(i));
p=envelope(xy);
figure
scatter(xy(:,1),xy(:,2),5);
hold on
plot(p,'FaceColor','none','EdgeColor','r', 'LineWidth',2)
hold off
end
function [p,XY]=envelope(xy)
for i=1:3
D2=max(pdist2(xy,xy,'euclidean','Smallest', 10),[],1);
xy(D2>1,:)=[];
end
XY=xy;
b=boundary(XY,0.95);
p=polyshape(XY(b,:));
end
@Matt J that seems pretty nice looking ! congrat's , well done !
maybe just a little bit of smoothing ?
wish I had the pdist2 function to run this code
all the best
@Matt J Absolutely amazing! This code is performing extremely well for most of my data distributions and, with a couple of tweaks, such as adding a smooth (as kindly suggested by @Mathieu NOE - see below for future reference), I am mostly achieving what I needed! Thank you very much indeed!!!
As a last thing, could you kindly provide a brief summary on the rationale and on what this code (thus pdist2 function) is actually doing, or better yet, how it determines 'where to draw a boundary'?
Especially, why is
D2=max(pdist2(xy,xy,'euclidean','Smallest', 8),[],1);
running 3 times, keeping the values >1? Is this related to an initial guess of the pdist2 function, which may return different results at every run? What is 8 defining (in the code you wrote it was 10, but I found that 8 better adapts to different distributions)?
Also, do you have any suggestions for further reading so that I can better understand the approach? It would be greatly appreciated!
Thanks a lot for your time, patience, and consideration!
xy= [x,y];
p=envelope(xy);
p = polyshape(p.Vertices(:,1), p.Vertices(:,2));
p = polybuffer(p,0.25);
function [p,XY]=envelope(xy)
for i=1:3
D2=max(pdist2(xy,xy,'euclidean','Smallest', 8),[],1);
xy(D2>1,:)=[];
end
XY=xy;
b=boundary(XY,0.95);
p=polyshape(XY(b,:));
end
As a last thing, could you kindly provide a brief summary on the rationale and on what this code (thus pdist2 function) is actually doing, or better yet, how it determines 'where to draw a boundary'?
Well, pdist2 finds the 8 nearest neighbors of every point (x0,y0) and if any of the neighbors is at a distance of >1 than (x0,y0) is classified as belonging to the sparse region, and is discarded from the list. Meanwhile, points well-nested in the dense region will survive this elimination process. Ultimately, the idea is that all the sparse points will be stripped away. Some points within the dense central cloud may also be stripped away, but the expectation is that enough points there will survive so that the boundaries of the central cloud can be well-delineated by the boundary command.
To determine the threshold of >1, I had to read your mind :-). It seemed to me from your initial post and where you were drawing the boundaries, that your criterion for classifying points as an outlier was that they were spaced apart by >1 on average.
Especially, why is D2=max(__); running 3 times,
In the cloud of sparse outlying points, you may have a small, isolated clump of dense points, separate from the main lobe. By iterating a few times, it will eat away at the edges of those clumps, eventually consuming them.
Also, do you have any suggestions for further reading so that I can better understand the approach?
I don't. This is a very improvised clustering strategy, and one that is very customized to your specific data sets. I'm sure you will find much better algorithms if you really dive into the literature on clustering. If you do come across this one in the literature, though, I would be interested to know!

Accedi per commentare.

Più risposte (1)

hello
maybe this ?
I simply decided that I wanted the contour line corresponding to a level 5% of F range above the min value of F
Result :
[hAxes,col,ctrs1,ctrs2,F] = dscatter(xy4(:,1),xy4(:,2),'BINS',[250,250]);
colorbar('vert');
hold on
lb = min(F,[],'all');
ub = max(F,[],'all');
threshold = lb+0.05*(ub-lb); % threshold is 5% of data range above min value
contour(ctrs1,ctrs2,F,[threshold threshold],'k-'); %
hold off

9 Commenti

Hi @Mathieu NOE thanks for getting back and for your help!
Unfortunatelly, that seem to work for xy4, but not for the others. This is how xy1 and xy2 look like:
Can you think of any other suggestions/approaches that may help me to achieve a solution?
Thanks a lot once again
mayba this ?
Sets=["xy1" "xy2" "xy3" "xy4"];
threshold = 0.03; % threshold is 3 % of F range above min value
s = 0.95; % k = boundary(___,s) specifies shrink factor s using any of the previous syntaxes.
for k=1:numel(Sets)
xy=load(Sets(k)).(Sets(k));
x = xy(:,1);
y = xy(:,2);
figure(k)
[hAxes,col,ctrs1,ctrs2,F] = dscatter(x,y,'BINS',[250,250]);
colorbar('vert');
hold on
lb = min(F,[],'all');
ub = max(F,[],'all');
thr = lb+threshold*(ub-lb);
[row,col]=find(F>thr);
m = boundary(row,col,s);
plot(ctrs1(col(m)),ctrs2(row(m)),'r')
hold off
end
Hi @Mathieu NOE, thanks once again. Unfortunatelly I still need to change the threshold at every run, as the shrink factor alone is not sufficient. I have tried to play a bit with both threshold and s, but I could not find a unique setting that was applicable to all of these 4 examples (thus I am pretty sure that it would be even harder to accomodate thousands of them). This is for instance what I get for xy1 if I apply the settings you suggested:
ok
I am not sure there is a magic tool or AI that can decide for each case how to optimize the parameters...
Hi @Mathieu NOE. I understand this is a quite challenging task, which may indeed require some AI-based approach to be solved. Yet, it is still worth it asking for some ideas and suggestion, maybe a simpler, yet effective, way of doing it exists! Thanks a lot for your help and time anyway, I really appreciate it!
hello again
I tried to simplify a bit my code , so the shrink factor is simply the one by default - one parameter less to play with
remains only the threshold , according to my own trials I would pick something between 3 and 5 %
I recognize it may not give the expected nice shape like contour does, but at least it's working
Sets=["xy1" "xy2" "xy3" "xy4"];
threshold = 0.05; % threshold is 5 % of F range above min value
% s = 0.5; % k = boundary(___,s) specifies shrink factor s using any of the previous syntaxes.
for k=1:numel(Sets)
xy=load(Sets(k)).(Sets(k));
x = xy(:,1);
y = xy(:,2);
figure(k)
[hAxes,col,ctrs1,ctrs2,F] = dscatter(x,y,'BINS',[250,250]);
colorbar('vert');
hold on
lb = min(F,[],'all');
ub = max(F,[],'all');
thr = lb+threshold*(ub-lb);
[row,col]=find(F>thr);
% m = boundary(row,col,s);
m = boundary(row,col);
plot(ctrs1(col(m)),ctrs2(row(m)),'r')
hold off
end
Hi, @Mathieu NOE thanks a lot for taking the time to provide a simplified appraoch. Although this code is indeed better suited for my task (thanks a lot for it!), I must aknoweldge that @Matt J code is indeed better adapting to most of my data distributions! Nonetheless, thank you very much for your time, I really appreciate your effort and kindness!
No problem
I was conscient that my proposal is not as good as @Matt Jone. He's done an amazing job
I just need now to find an alternative to pdist2 as I don't have the relevant toolbox
so finally (and for my own fun) I tried some variants to replace pdist2
so for those like me that lack the right toolbox , here two codes that do the same process .
the differences are minor, one keeps valid points and the other one remove invalid points like Matt did in his code.
One thing I removed is the 3 iterations in the envelope function ; I didn't see any benefit of doing this , but maybe it's specific to my own code
solution 1
Sets=["xy1" "xy2" "xy3" "xy4"];
for i=1:numel(Sets)
xy=load(Sets(i)).(Sets(i));
[p,XY]=myenvelope(xy);
p = polybuffer(p,0.25);
figure
scatter(xy(:,1),xy(:,2),5);
hold on
plot(p,'FaceColor','none','EdgeColor','r', 'LineWidth',2)
hold off
end
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function [p,XY]=myenvelope(xy)
% D2=max(pdist2(xy,xy,'euclidean','Smallest', 10),[],1);
% the one line engine above is replaced by the following 8 lines of code
x = xy(:,1);
y = xy(:,2);
D2 = sqrt((x-x').^2 + (y-y').^2);
D2(D2 <eps ) = NaN;% replace diag elements that are 0 by NaN's
% find cols where we can have n neigbours within distance of 1
D2 = sort(D2,'ascend');
D2 = D2(1:10,:); % keep the n smallest values
f = all(D2<=1); % keep those points
xy=xy(f,:);
XY = xy;
b=boundary(XY,0.95);
p=polyshape(XY(b,:));
end
solution 2
Sets=["xy1" "xy2" "xy3" "xy4"];
for i=1:numel(Sets)
xy=load(Sets(i)).(Sets(i));
[p,XY]=myenvelope(xy);
p = polybuffer(p,0.25);
figure
scatter(xy(:,1),xy(:,2),5);
hold on
plot(p,'FaceColor','none','EdgeColor','r', 'LineWidth',2)
hold off
end
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function [p,XY]=myenvelope(xy)
% D2=max(pdist2(xy,xy,'euclidean','Smallest', 10),[],1);
% the one line engine above is replaced by the following 8 lines of code
x = xy(:,1);
y = xy(:,2);
D2 = sqrt((x-x').^2 + (y-y').^2);
D2(D2 <eps ) = NaN;% replace diag elements that are 0 by NaN's
% remove rows where we haven't n neigbours within distance of 1
D2 = sort(D2,'ascend');
D2 = D2(1:10,:); % keep the n smallest values
D2=max(D2,[],1);
xy(D2>1,:)=[];
XY = xy;
b=boundary(XY,0.95);
p=polyshape(XY(b,:));
end

Accedi per commentare.

Categorie

Prodotti

Release

R2022b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by