Perhaps then I need to understand how diff is written so I can write my own mx file to speed up the averaging process?
I'm not a Mathworks developer so this is all my speculation, but it stands to reason that accelerating diff() on the GPU is going to be easier than for other routines like movsum. This is because each thread always requires exactly 2 pieces of data. The CUDA kernel threads can read all elements of A(1:Nx-1,:,:) from GPU global memory in one synchronized step, and then a A(2:Nx,:,:) in a second synchronized step. No memory sharing is required between the threads, all data access is embarassingly parallel and naturally coalesced, and the memory consumption per thread as so small that the fastest local memory registers can be used.
Conversely, for movsum and conv, the code needs to handle moving windows of arbitrary sizes, and so the parallelization strategy needs to be more complicated (and hence is slower).
You could almost certainly write your own optimized movsum CUDA kernel using the same principles as with diff(), as long as, like diff, the moving window size is always 2. Or at least, as long as the window size is small, fixed, and a priori known.