There are many ways to solve this, such as rewriting the implementation of this function yourself in C/C++ and wrapping it as a mex. I now think it's not just about improving the algorithm, it's about external factors adjusting to my own needs, even if I rewrite it in an optimised C/C++ (not taking into account margin effects), such as cv::copyTo, they don't really differ much in performance, see the performance curve below.
note: uint8 type, two dims image array test