Optimize/speed up a big and slow matrix operation with addition and bsxfun?

Question

0 voti

Hi. I have a fairly long line of code in my script that takes about 7 seconds to run, it was originally three separate calculations but are now done in one line that looks like this:

X = bsxfun(@times,reshape(bsxfun(@times,A,B),[1440,1,2251]),(1-C)./2)...
    +bsxfun(@times,reshape(E,1440,1,2251),D)...
    +bsxfun(@times,reshape(F,1440,1,[]),((1+C)/2));

Since I need to run it several tens of thousands times it is causing a fair amount of anxiety in my life at the moment. I don’t know how I would go about speeding it up further or if it is even possible. So I would really appreciate it if any of you optimization gurus here could give me some advice on how and if I can go about making this faster.

The input variables are of size:

A = 2251x1
B = 1440x2251
C = 1x181
D = 1440x181
E = 1440x2251
F = 1440x2251

Thanks!

2 Commenti
Mostra Nessuno Nascondi Nessuno

James Tursa il 20 Apr 2015

Do you have a C compiler available for compiling mex functions?

James Tursa il 20 Apr 2015

I assume A is really 1 x 2251 (otherwise bsxfun(@times,A,B) wouldn't work as written)

Accedi per commentare.

Accedi per rispondere a questa domanda.

Follow Question

Answer 1

James Tursa il 20 Apr 2015

Modificato: James Tursa il 20 Apr 2015

Apri in MATLAB Online

0 voti

Assuming you have a C compiler available, here is a simple C mex routine to do the calculation. Could possibly be sped up even more with multi-threading (e.g., OpenMP) as well but I did not include that code below. Let me know if you have an OpenMP compilant compiler since that addition would be fairly simple.

As is (with no C-mex multi-threading) here is an example run on my machine:

Elapsed time is 6.655025 seconds. % M-code with bsxfun
Elapsed time is 1.700077 seconds. % C-mex code (not multi-threaded)

CAUTION: Code below is bare bones with no argument checking.

// X =   bsxfun(@times,reshape(bsxfun(@times,A,B),[1440,1,2251]),(1-C)./2)...
//     + bsxfun(@times,reshape(E,1440,1,2251),D)...
//     + bsxfun(@times,reshape(F,1440,1,[]),((1+C)/2));
// 
// The input variables are of size:
// 
// A =    1 x       2251
// B = 1440 x       2251
// C =    1 x 181
// D = 1440 x 181
// E = 1440 x       2251
// F = 1440 x       2251
//
// Output:
//
// X = 1440 x 181 x 2251
//
// Calling sequence:
//
// X = this_file_name(A,B,C,D,E,F);
#include "mex.h"
void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])
{
    mwSize i, j, k, m, n, p, km, ikm, jm;
    mwSize dims[3];
    double *A, *B, *C, *D, *E, *F, *X, *Cm, *Cp;
    double a, cp, cm;
      m = mxGetM(prhs[3]);
      n = mxGetN(prhs[3]);
      p = mxGetN(prhs[4]);
      A = mxGetPr(prhs[0]);
      B = mxGetPr(prhs[1]);
      C = mxGetPr(prhs[2]);
      D = mxGetPr(prhs[3]);
      E = mxGetPr(prhs[4]);
      F = mxGetPr(prhs[5]);
      dims[0] = m;
      dims[1] = n;
      dims[2] = p;
      plhs[0] = mxCreateNumericArray(3, dims, mxDOUBLE_CLASS, mxREAL);
      X = mxGetPr(plhs[0]);
      Cm = (double *) mxMalloc(n*sizeof(*Cm));
      Cp = (double *) mxMalloc(n*sizeof(*Cp));
      for( j=0; j<n; j++ ) {
          Cm[j] = (1.0 - C[j]) / 2.0;
          Cp[j] = (1.0 + C[j]) / 2.0;
      }
      for( k=0; k<p; k++ ) {
          a = A[k];
          km = k * m;
          for( j=0; j<n; j++ ) {
              jm = j * m;
              cm = Cm[j] * a;
              cp = Cp[j];
              for( i=0; i<m; i++ ) {
                  ikm = i+km;
                  *X++ = B[ikm] * cm + E[ikm] * D[i+jm] + F[ikm] * cp;
              }
          }
      }
      mxFree(Cp);
      mxFree(Cm);
}

6 Commenti
Mostra 4 commenti meno recenti Nascondi 4 commenti meno recenti

PetterS il 20 Apr 2015

Amazing, it worked and it runs in 1.8 seconds! I will definitely be using it and so far It hasn’t complained about inputs in any way. I have two small questions though:

• If I compare the output of this with the output produced by the original, slower way, with isequal(X,X_old) I get 0 as if they were not identical. But when I plot parts of the data to compare, they look the same as far as I can tell and the matrices have the same mean values. So is the inequality simply created by tiny differences in the ways the two code languages calculate it and for all intents and purposes they are the same or is this something I should be worried about?

• Does the same concept of single-point numbers equals roughly twice the speed apply here so that I theoretically could go down below one second? I tried just feeding the inputs to the mex file as singles instead of doubles but that crashed matlab again. I see the term “double” in a few places in your mex code, would it simply be a matter of replacing that with “single” if I wanted to try that?

James Tursa il 21 Apr 2015

(1) If you plan on using this long term, I will add code to check the arguments so it won't crash if something is not right.

(2) Part of the speed-up is consolidating the arithmetic so the same operations aren't done multiple times. This involves pulling some operations out of the inner loop into the outer loops, which results in some of the operations being done in a different order. This is nothing to worry about, and you will find that the results are on the order of eps(result) in difference. If you are concerned about this, I can post a version that does the computations in exactly the same order, giving exactly the same results. It will not run as fast because it will not take advantage of what I mentioned earlier, but it should still run faster than your m-code because it does not need to make large intermediate arrays.

(3) If you want to try single vs double, then make these changes:

Change all instances of "double" to "float"

Change the mxDOUBLE_CLASS to mxSINGLE_CLASS

Change all of the mxGetPr(etc) to (float *)mxGetData(etc)

Change 1.0 to 1.0f and 2.0 to 2.0f

PetterS il 21 Apr 2015

Switching to singles brought it down to 1.3 seconds, so not really half but still, since I started with 7 seconds I’m really pleased with 1.3. I will run it continuously for around two weeks now and after that I don’t know if or when I will be using it again so you don’t have to put any effort in argument checking it.

Thank you so much for your help on this, it really saved me a lot of time!

James Tursa il 23 Apr 2015

Another potential speed-up I should mention is to replace mxCreateNumericArray with mxCreateUninitNumericArray.

Accedi per commentare.

Answer 2

John D'Errico il 19 Apr 2015

Apri in MATLAB Online

1 voto

You are trying to do billions of multiplies here, resulting in an array that is of size

1440*181*2251
ans =
   586700640

So roughly 600 million elements. since they are double precision numbers, the array will require 8 bytes per element. So the final result will require nearly 5 gigabytes of RAM to store.

Oh, by the way, there are three parts to that computation, so your code must generate three such temporary arrays in that computation, that are then summed together. So the internal storage requirements might be as large as 15 gigabytes of RAM, depending on how that code is processed internally.

That it ran in 7 seconds is a testament to the speed of your system, not the slowness of the code.

If you want it to run faster, get more memory. A good idea would be to use a flash based hard drive to allow your CPU to access virtual memory much faster.

5 Commenti
Mostra 3 commenti meno recenti Nascondi 3 commenti meno recenti

John D'Errico il 19 Apr 2015

Apri in MATLAB Online

I see no commonality in those terms to compress things down into something more efficient.

So first I'd watch your disk to see if it is swapping in that computation. My guess is it is being used if you are that close to the edge. That will cause a serious time loss. That you are seeing all your memory being used tells me that MATLAB is doing as I expected.

If there is disk thrashing involved, the simple answer is to get more RAM, OR a faster disk. RAM is cheap. A terabyte sized flash drive is relatively cheap too. Both are VERY cheap compared to YOUR time. People forget these things too often.

Of course, you could also write MEX code to do this, avoiding the creation of those large internal temporary arrays.

Perhaps another option, IF you can tolerate some precision loss in the result, is to convert the arrays to single precision. That will cut you RAM needs in half here, making it run MUCH faster. In fact, this will be a good test of my recommendation, since that computation will not need any swapping done.

Therefore, IF the computation runs much more quickly using singles versus doubles, then it says that the problem is insufficient RAM, NOT the number of computations that needed to be done. A single precision multiply or add seems to take about half the time as the same operation using doubles. But more importantly, if swapping can be avoided, then it will take MUCH less time.

A = single(rand(100,1,100));
B = single(rand(1,200,100));
timeit(@() bsxfun(@times,A,B))
ans =
    0.0024701
Ad = double(A);
Bd = double(B);
timeit(@() bsxfun(@times,Ad,Bd))
ans =
    0.0048494

So try converting your arrays to single precision. If the overall run time is cut by 50%, then you know that swapping is not a factor. But even then, you have a way to cut your time by roughly 50%.

If the run time is much less than 50% for the single computation, and you cannot afford the loss of precision, then you know that you really do need more memory and/or a faster drive.

PetterS il 19 Apr 2015

Ok, I’ve tried that now;

Time with doubles = 6.576590 sec

Time with singles = 3.378048 sec

So almost exactly half! If I understand your reasoning correctly this should then mean that the RAM was sufficient when the calculation was done with doubles? When I run the line with singles my RAM usage is only around 50% so clearly that can’t be a problem anymore (if it ever was). Maybe this memory usage is now low enough for me to make a parfor loop and run two of these simultaneously.

But I’ve never calculated things with single-point numbers before, I read a bit about them on Wikipedia but it seemed too technical for me to understand exactly what it meant. Is it possible to estimate in layman’s terms how much I’m reducing the accuracy by converting to single numbers? For example if I have an integer followed by a long line of decimal numbers, at approximately which decimal number am I rounding my number to by making it into single? (If that’s even how it works).

And thank you for enabling me to cut the runtime in half, that could be a game changer!

John D'Errico il 20 Apr 2015

Apri in MATLAB Online

The funny thing is, much computation in the old days was done using single precision. Double precision is more robust though. You can get away with things with doubles that might hurt you with single.

When you start swapping to disk, things suddenly get much slower. So my argument is that IF the time drops by 50%, then you were not doing any serious amount of swapping with the doubles.

Eps tells you about the potential error in those numbers.

eps('double')
ans =
      2.22044604925031e-16
eps('single')
ans =
    1.192093e-07

Eps is the smallest number you can add to 1, and get a different result. So think of eps as the granularity for that precision, the least significant bit.

Much will depend on what you are doing with those numbers. Is there noise in them anyway, so that trash down around 1 part in 1e7 won't bother you? What are you doing with this result afterwards? On some computations that noise won't matter a bit.

schrodinbug il 10 Lug 2019

if you're creating a mex function would using a library like Eigen provide value?

Accedi per commentare.

Optimize/speed up a big and slow matrix operation with addition and bsxfun?

2 Commenti
Mostra Nessuno Nascondi Nessuno

Risposta accettata

6 Commenti
Mostra 4 commenti meno recenti Nascondi 4 commenti meno recenti

Più risposte (1)

5 Commenti
Mostra 3 commenti meno recenti Nascondi 3 commenti meno recenti

Categorie

Tag

Community Treasure Hunt

Optimize/speed up a big and slow matrix operation with addition and bsxfun?

2 Commenti Mostra Nessuno Nascondi Nessuno

Risposta accettata

6 Commenti Mostra 4 commenti meno recenti Nascondi 4 commenti meno recenti

Più risposte (1)

5 Commenti Mostra 3 commenti meno recenti Nascondi 3 commenti meno recenti

Categorie

Tag

Vedere anche

Community Treasure Hunt

2 Commenti
Mostra Nessuno Nascondi Nessuno

6 Commenti
Mostra 4 commenti meno recenti Nascondi 4 commenti meno recenti

5 Commenti
Mostra 3 commenti meno recenti Nascondi 3 commenti meno recenti