Inlined code segment slower than internal function pass - why?

Question

Carson Purnell il 10 Feb 2023

0
Link

Link diretto a questa domanda

https://it.mathworks.com/matlabcentral/answers/1910525-inlined-code-segment-slower-than-internal-function-pass-why

Modificato: Matt J il 12 Feb 2023

I'm trying to speed up prototype code and have found a strange instance of speed increase when replacing standard inlined code i'm using inside of a loop. The inlined code is as follows:

s=0;
for dd=1:numel(loc)
    s=s+(dynpts(:,dd)-loc(dd)).^2;
end
fidx=s<sel.rad^2;
ix=find(fidx);

Somehow, this is 2-3x slower in profiling than making it an in-script subfunction call: ix = rangesearchnest(loc,sel.rad,dynpts); with an identical body (different variable names). I don't know how this could be the case for any circumstance - my understanding that JIT and internal optimizations should work on the inlined code better than external calls. However, dynpts is a nx3 array where n is in the millions to billions so I was expecting a tremendous speed increase with the inlined version merely as a result of not needing to pass the gargantuan array as an argument (and potential memory limit issues).

Is there special case behavior i'm not aware of happening here?

1 Commento
Mostra -1 commenti meno recentiNascondi -1 commenti meno recenti

Walter Roberson il 11 Feb 2023

Apri in MATLAB Online

x = find(s<sel.rad^2);

is potentially better optimized then the two-statement version.

In numeric cases where the < ordering is guaranteed not to return errors, then potentially MATLAB could run s(K)<sel.rad^2 in a loop gathering indices as it went (perhaps into a linked list) instead of first calculating s and sel.rad^2 as logical vectors and then doing a find() operation on the result

In order to determine whether it does that kind of operation, you would probably need to use large matrices, right on the boundary, where calculating s<sel.rad^2 first would exhaust your memory.

The language model is to calculate the logical vector first, but in most languages, internal optimizations are permitted to vary order of operations provided that the result is the same when no exceptions occur.

Accedi per commentare.

Accedi per rispondere a questa domanda.

Answer 1

Matt J il 10 Feb 2023

0
Link

Link diretto a questa risposta

https://it.mathworks.com/matlabcentral/answers/1910525-inlined-code-segment-slower-than-internal-function-pass-why#answer_1168940

Modificato: Matt J il 10 Feb 2023

with the inlined version merely as a result of not needing to pass the gargantuan array as an argument (and potential memory limit issues).

Passing a variable to a function does not result in any memory copying unless the function makes changes to the variable, which you are not doing. Also, my recollection of how the JIT works is that it optimizes the execution of functions, but not scripts. So, if your top level code is not enclosed ina function, that might be part of it as well.

3 Commenti
Mostra 1 commento meno recenteNascondi 1 commento meno recente

Walter Roberson il 11 Feb 2023

This is a point that some people disagree with me on, but my reading of historical documents, and discussions with Mathworks employees, is:

JIT never optimized scripts, but did optimize functions in limited ways
JIT was replaced with Execution Engine, which is now used for all computation paths, so these days scripts are now at least partly optimized. The optmization is not exactly the same as inside functions; for example a plain A = B; in a script is faster than a plain A = B; in a function (!!) according to a recent test I did. Also for scripts, there has to be more room left for the possibility that a value might change type completely, so scripts need to be more tentative in their optimization.

The part that people disagree with me on is where I say that JIT was replaced with Execution Engine; some people say that Execution Engine is a kind of JIT and therefore that MATLAB still has JIT; however I recall a number of years being corrected by Mathworks staff on that point who (IIRC) said that JIT is no longer present. I interpet this as being that Mathworks had a specific execution flow that they called JIT and that the flow has been so completely rewritten that the implementing technology they called JIT is no longer present.

That leaves open the question of whether these days everything is "compiled" at parse time and so there is no longer any "Just In Time" left, or if instead there is still a component of run-time optimization as needed that is now called something else at Mathworks but falls under the broad "just in time" category. I do not know the answer to that; the statements I have seen from Mathworks employees have tended to imply compilation without run-time as-needed optimization, but it is quite difficult for me to explain the results of some timing tests unless there is an optimization phase happening the second or third time a particular statement is executed.

Carson Purnell il 12 Feb 2023

Alright, after some more testing things... are still confusing. I tried a vectorized solution both inlined and as a function: (note that sel.rad is a scalar, not a vector as walter appeared to assume)

s = sum((dynpts-loc).^2,2);

ix5=find(s<sel.rad^2)

This did increase the speed in the script - and slowed down the function to near identical speeds (function overhead margin, probably). So the script loop was slowest, the vector was fast either way, but the loop in the function is still faster. That does make it look like the script prevented optimization of the loop - but does not explain why the external loop is faster. Maybe 1xm vector math is faster than doing things in the array?

Good to know that arguments don't need full memory under conditions.

Matt J il 12 Feb 2023

Modificato: Matt J il 12 Feb 2023

Apri in MATLAB Online

I don't know what you mean by the "external loop", but the tests below seem consistent with the rest of your comment. None if it is too surprising, IMHO. The vectorized version allocates the most memory, so it makes sense to me that the loop is fastest when full optimizations are applied.

n=1e7;
[dynpts,loc]=deal(rand(n,3),rand(1,3));
timeit(@()implem1(dynpts,loc))
ans = 0.0320
timeit(@()implem2(dynpts,loc))
ans = 0.0888
tic;
    s=0;
    for dd=1:numel(loc)
        s=s+(dynpts(:,dd)-loc(dd)).^2;
    end
toc
Elapsed time is 0.112159 seconds.
tic
    s = sum((dynpts-loc).^2,2);
toc
Elapsed time is 0.090275 seconds.
function implem1(dynpts,loc)
    s=0;
    for dd=1:numel(loc)
        s=s+(dynpts(:,dd)-loc(dd)).^2;
    end
end
function implem2(dynpts,loc)
    s = sum((dynpts-loc).^2,2);
end

Accedi per commentare.

Inlined code segment slower than internal function pass - why?

1 Commento
Mostra -1 commenti meno recentiNascondi -1 commenti meno recenti

Risposte (1)

3 Commenti
Mostra 1 commento meno recenteNascondi 1 commento meno recente

Vedere anche

Categorie

Tag

Prodotti

Community Treasure Hunt

Inlined code segment slower than internal function pass - why?

1 Commento Mostra -1 commenti meno recentiNascondi -1 commenti meno recenti

Risposte (1)

3 Commenti Mostra 1 commento meno recenteNascondi 1 commento meno recente

Vedere anche

Categorie

Tag

Prodotti

Community Treasure Hunt

1 Commento
Mostra -1 commenti meno recentiNascondi -1 commenti meno recenti

3 Commenti
Mostra 1 commento meno recenteNascondi 1 commento meno recente