Fast Euclidean Distance Calculation

34 visualizzazioni (ultimi 30 giorni)
Moreno, M.
Moreno, M. il 17 Mar 2022
Commentato: Moreno, M. il 18 Mar 2022
Hi, I hope you are well.
I am trying to compute euclidean distances in 2D or 3D (for the sake of the question, I will refer to 2D for ease of computation) in the fastest way possible in MATLAB. I am aware that if comparing distances between 3 points the following algebrae can be done, as well as L1-norm filters:
  • d1 > d2 is equal to d1^2 > d2^2
  • d1^2 > d2^2 is equal to (A - B)^2 > (A - C)^2 is equal to (B^2 - 2AB) > (C^2 - 2AC).
However, the speed of computations of these expressions failed to match the traditional d = sqrt(sum(t .* t, 2)), where t = B - A. I have calculated the distance between elements of two arrays X = rand(1e6, 2), Y = rand(1e6, 2) in different ways to illustrate this:
  1. L1 distance as the absolute difference between coordinates
  2. Quasi-euclidean distance (octagonal distance): 0.941246 * max(abs(X), abs(Y)) + 0.414214 * min(abs(X), abs(Y))
  3. Expanded expression as sqrt(X .* X + Y .* Y - 2 * X .* Y)
  4. Same as 3. but with bsxfun
  5. Traditional distance sqrt(sum((X - Y) .^ 2, 2))
X = rand(1e6, 2);
Y = rand(1e6, 2);
% L1 distance
tic
t = abs(X - Y);
d1 = sum(t, 2);
toc
% Quasi-Euclidean distance
tic
t = abs(X - Y);
a = t(:, 1);
b = t(:, 2);
i = a > b;
d2 = i .* (0.941246 * a + 0.414214 * b) + ~i .* (0.941246 * b + ...
0.414214 * a);
toc
% Expanded binomial distance
tic
a = sum(X .* X, 2);
b = sum(Y .* Y, 2);
ab = 2 * sum(X .* Y, 2);
d3 = sqrt(a + b - ab);
toc
% Expanded binomial with bsxfun
tic
d4 = sqrt(bsxfun(@plus, sum(X .* X, 2), bsxfun(@minus, sum(Y .* Y, 2), ...
2 * sum(X .* Y, 2))));
toc
% Traditional distance
tic
t = X - Y;
d5 = sqrt(sum(t .* t, 2));
toc
As a result, I am getting the following elapsed times for the L1, quasi-euclidean, expanded binomial, bsxfun and traditional L2 distance calculation methods:
Elapsed time is 0.004421 seconds.
Elapsed time is 0.009535 seconds.
Elapsed time is 0.021147 seconds.
Elapsed time is 0.019462 seconds.
Elapsed time is 0.009760 seconds.
It was surprising that, after simplifying the calculations via algebra or quasi-euclidean norms, no significant speed gains are achieved. It is true that the L1 distance is faster than other cases, but the error between its result and the actual L2 distance is too great (1x - 2x %). For example, the speed comparison between the quasi-euclidean distance (with no square roots or matrix multiplication) and the traditional L2 distance is negligible.
The speed gain between bsxfun and 'manually' introducing the array terms is also small. The square distance in the traditional calculation way (5) has the following elapsed time for the previous X and Y matrices:
Elapsed time is 0.008278 seconds.
Which comes accross shocking to me as it seems even faster than the quasi-euclidean method, and I thought square roots were incredibly costly as they are typically solved iteratively or as 1/sqrt(x).
Is there any way to speed up this distance calculation in MATLAB (accessible to the users)? There must be, as calling functions that involve distance calculation or comparison such as knnsearch result in faster computing times than the traditional computation of the square distance (sometimes, even twice as fast, in an already optimised KNN code).
As a note, I also tried gpuArrays but the elapsed times were greater than the ones I am showing here, as shown here:
Thank you in advance for the help and regards.
  2 Commenti
Matt J
Matt J il 17 Mar 2022
Modificato: Matt J il 17 Mar 2022
Which comes accross shocking to me as it seems even faster than the quasi-euclidean method, and I thought square roots were incredibly costly as they are typically solved iteratively or as 1/sqrt(x).
Perhaps, but your M-code is doing lots of extra memory allocation to hold the intermediate quantities than what the builtin functions are doing behind the scenes.The quasi-euclidean method looks like it's allocating twice as much or more memory as the "traditional" method.
Moreno, M.
Moreno, M. il 17 Mar 2022
Thank you Matt, it makes sense that it is a memory-related problem and the optimisation work must come from allocation strategies as well as from the intrinsic distance calculation. I appreciate the reply and the help. I will look for workarounds in those lines, but it is beyond my knowledge at the moment.
Regards,
Moreno, M.

Accedi per commentare.

Risposta accettata

Jan
Jan il 17 Mar 2022
Modificato: Jan il 18 Mar 2022
There is no reason to call bsxfun, if the arrays have the same sizes:
% d = sqrt(bsxfun(@plus, sum(X .* X, 2), ...
% bsxfun(@minus, sum(Y .* Y, 2), 2 * sum(X .* Y, 2))))
% d = sqrt(sum(X .* X, 2) - sum(Y .* Y, 2) + 2 * sum(X .* Y, 2))
Both do exactly the same, but bsxfun has a certain overhead.
Creating large arrays in the RAM is the bottleneck of many functions. Although sqrt() is slow (even if it is not solved iteratively in modern CPUs), waiting for the data to be delivered from the slow RAM takes more time. In your case:
X = rand(1e6, 2);
Y = rand(1e6, 2);
tic
for kk = 1:100 % Repeat this to reduce noise of the timing
t = X - Y;
d5 = sqrt(sum(t .* t, 2));
end
toc
tic
for kk = 1:100
d6 = zeros(1e6, 1);
for k = 1:1e6
t1 = X(k, 1) - Y(k, 1);
t2 = X(k, 2) - Y(k, 2);
d6(k) = sqrt(t1*t1 + t2*t2);
end
end
toc
% Matlab 2018b, i7-3770:
% Elapsed time is 1.645735 seconds.
% Elapsed time is 1.197466 seconds. !!!
% In Matlab online the loop is slower than the vectorized code.
The loop is faster, because Matlab have to access the three used arrays X, Y and d6 locally only. In oppsite to this, sqrt(sum(t .* t, 2)) creates the complete t .* t at first and throws it from the cache to the slow RAM. Then sum() is fed with the complete result, and throws the result again to the RAM. Then sqrt() requests the complete arrays again and (you know it already) pushs the result back to the RAM again.
The loop imports the data from the RAM in blocks of the cache line size (64 Byte usually) processes them and if they are thrown back to the RAM they are not touched anymore.
Highly optimzed functions consider the cache sizes of the CPU and the limited bandwidth to the RAM .
  4 Commenti
Jan
Jan il 18 Mar 2022
Modificato: Jan il 18 Mar 2022
Hi @Moreno, M., when I run the codes in Matlab online (2021b I assume), the loop slower than the vectorized code. This could be an effect of Matlab's JIT acceleration, which parses the code and optimized the code. Then further techniques might be applied, e.g. calling optimized SSE or AVX code to process the data.
Optimizing code requires either to trust the manufacturer of the compiler, e.g. MathWorks, or to understand in detail, how the hardware is designed. "Loops are slow in Matlab" was true in Matlab 12.1 (R6.1 from 2001). Then Matlab introduced the JIT acceleration and loops are getting faster with each new release. But the vectorized code is supported by BLAS libraries, which call SSE/AVX code and mutli-threading in addition. As you can see in my code example, the fastest solution in R2018b need not be the winner in 2021b.
A good point to start from is not easy to suggest. Beside the architecture of CPUs the field of software engineering is important also: If you get the code to run 2 seconds faster, but which is a hand-written AVX assembler, which fails on another CPU, it can take days to weeks to get it to run on a new machine. A scientist measures the time until a problem is solved, and the run time of the code is only one of the steps:
T_solution = T_design + T_programming + T_debugging + ...
T_unittests + T_documentation + t_runtime
This can mean, that an optimal runtime can waste weeks with debugging. For this reasons "pre-mature optimization" is a wide spread programming anti-pattern.
I've seen too many projects failing due to instable code or missing documentations. So my suggestion: Start with writing stable code. Include unit tests and integration tests. Include an exhaustive documentation. Then use the profiler to find the bottlenecks. Keep the reference solution to allow a comparison with the optimized code in future versions. If these points are solved, care about CPU caches and memory bandwidth.
With the attached C-Mex function I get the timings:
Elapsed time is 1.447366 seconds. SQRT(SUM((X - Y).^2, 2))
Elapsed time is 1.184052 seconds. Loop
Elapsed time is 0.589187 seconds. C-Mex
Moreno, M.
Moreno, M. il 18 Mar 2022
Hi Jan, thank you for this.
That is a very good suggestion, I will start there.
Regards,
Moreno, M.

Accedi per commentare.

Più risposte (1)

Matt J
Matt J il 17 Mar 2022
Modificato: Matt J il 17 Mar 2022
vecnorm(X-Y,2,2);
  3 Commenti
Moreno, M.
Moreno, M. il 17 Mar 2022
Hi Matt,
I appreciate your reply and the comments. For what I know, ‘vecnorm’ parses the input arrays and then it performs an element-wise norm calculation using the same expression as I am using in point 5: sqrt(sum(dXY .* dXY, 2)), as shown in:
https://es.mathworks.com/help/matlab/ref/vecnorm.html
https://www.mathworks.com/matlabcentral/mlc-downloads/downloads/submissions/7844/versions/19/previews/geom2d/geom2d/vecnorm.m/index.html
So it would surprise me that it has faster computing times than the traditional element-wise calculation.
Also, I am including the calculation of X - Y in every elapsed time sequence because the calculation of X - Y may vary between distance calculation methods and I need to account for that time difference between methods, despite the redundancy.
Regards, Moreno, M.
Matt J
Matt J il 17 Mar 2022
So it would surprise me that it has faster computing times than the traditional element-wise calculation.
And yet it does...

Accedi per commentare.

Categorie

Scopri di più su Mathematics and Optimization in Help Center e File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by