Replace multiple intervals in array with NaN - No loops

Hi
I am working on a project with a lot of data. In the area of 10.000.000 rows. I therefore cannot use for loops. The data has to be cleaned, with respect to certain parameters. These parameters are found independently and are not on measured with the same timesteps. Below is a simplified example that explains the idea of the problem. :)
The main dataset, let's call it:
dataset = 1:30;
Two arrays are now given. They indicate intervals within which measurements has to replaced with NaN.
upperbound = [2,5,10,18,27];
lowerbound = [0,3,8,15,25];
With a for loop I would normally say:
for i = 1:length(upperbound);
dataset(lowerbound(i):upperbound(i)) = NaN;
end
But with so much data this is not possible.
Do any of you have a good idea to solve this ?

2 Commenti

The smallest lowerbound index is zero. Is this a typo, because indices must be >= 1?
Yes, that is a typo!

Accedi per commentare.

 Risposta accettata

Jan
Jan il 14 Feb 2017
Modificato: Jan il 15 Feb 2017
With logical indexing:
data = 1:30;
upper = [2,5,10,18,27];
lower = [1,3,8,15,25]; % Not 0 as 1st element
% FAILING for upper==lower or overlapping intervals:
index = zeros(1, numel(data));
index(lower) = 1;
index(upper) = -1;
toNaN = (cumsum(index) == 1) | (index == -1); % Thanks Adam!
data(toNaN) = NaN;
Alternatively use Bruno's FEX: mcolon:
v = mcolon(lower, upper);
data(v) = NaN;
If this is the bottleneck of your code, use a C-Mex function.
// File: BlockCopy.c
#include "mex.h"
void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])
{
double *low, *up, *out, NaN = mxGetNaN(), *p, *pend;
size_t n, lowi, upi, maxi;
maxi = mxGetNumberOfElements(prhs[0]);
low = mxGetPr(prhs[1]);
up = mxGetPr(prhs[2]);
n = mxGetNumberOfElements(prhs[1]);
if (n != mxGetNumberOfElements(prhs[1])) {
mexErrMsgTxt("*** BlockCopy[mex]: Index vectors must have the same length.");
}
plhs[0] = mxDuplicateArray(prhs[0]);
out = mxGetPr(plhs[0]) - 1; // For 1-based Matlab indexing
while (n--) {
lowi = (size_t) *low++; // Check validity of indices
upi = (size_t) *up++;
if (lowi < 1 || upi > maxi) {
mexErrMsgTxt("*** BlockCopy[mex]: Index out of bounds.");
}
p = out + lowi;
pend = out + upi;
while (p <= pend) {
*p++ = NaN;
}
}
return;
}
This works with overlapping intervals also.
Attention: The C-code will crash if any index is <= 0 or > then the length of the input array or if the inputs are no DOUBLE arrays. Either care for proper inputs or add checks of each index.

7 Commenti

To be inclusive of the upper bound you'd need the condition to be:
toNaN = ( cumsum( index ) == 1 ) | index == -1
though.
Jan
Jan il 14 Feb 2017
Modificato: Jan il 14 Feb 2017
@Adam: You are right. Thanks for finding this bug. I've considered it by adding 1 to the upperbound instead. Perhaps your suggestion is faster.
Creating a double vector with the same size as the input is a painful waste of resources, but unfortuantely cumsum doe not operate on INT8.
Adding 1 to upperbound won't work so far as I can see as the new upper bounds will overwrite the lowerbound where they are contiguous and the whole section will be missed out
Jan
Jan il 14 Feb 2017
Modificato: Jan il 15 Feb 2017
@Adam: Sigh. You are right. Unfortunately your suggestion fails as mine if the upper limit equals the lower limit. I'm going to have a break.
Anyway, my solution is not lean:
  1. zeros(1, numel(dataset)) creates a double vector of the same size as the input.
  2. cumsum(index) creates the next full size vector.
  3. (cumsum(index) == 1) and index == -1 create two temporary vectors of the same size, at least with logical type
  4. a | b creates the next logical vector.
This is not memory efficient. It cries for C-mex-ing.
I thought of this:
index = zeros(1, numel(dataset)+1);
index(lowerbound) = 1;
index(upperbound+1) = index(upperbound+1) + 1;
toNaN = (mod(cumsum(index), 2) == 1);
dataset(toNaN(1:end-1)) = NaN;
Not nice also, and slow.
Jan
Jan il 15 Feb 2017
Modificato: Jan il 15 Feb 2017
@Michael Madelaire: I'm impressed. The vectorized method cannot handle intervals with a single element and overlapping intervals. mcolon is slower than expected. And The speedup of the MEX is not overwhelming:
function test
dataset = 1:1e7;
lower = sort(randi([1, 1e7 - 1000], 1, 1e5));
upper = lower + randi([1,1000], size(lower));
tic; v1 = method1(dataset, lower, upper); toc;
tic; v2 = method2(dataset, lower, upper); toc;
tic; v3 = method3(dataset, lower, upper); toc;
tic; v4 = method4(dataset, lower, upper); toc;
tic; v5 = method5(dataset, lower, upper); toc;
function v = method1(v, lower, upper) % Original loop
for i = 1:length(upper)
v(lower(i):upper(i)) = NaN;
end
function v = method2(v, lower, upper) % Vectorized 1
index = zeros(1, numel(v));
index(lower) = 1;
index(upper) = -1;
toNaN = (cumsum(index) == 1) | (index == -1);
v(toNaN) = NaN;
function v = method3(v, lower, upper) % Vectorized 2
index = zeros(1, numel(v)+1);
index(lower) = 1;
index(upper+1) = index(upper+1) + 1;
toNaN = (mod(cumsum(index), 2) == 1);
v(toNaN(1:end-1)) = NaN;
function v = method4(v, lower, upper) % mcolon
idx = mcolon(lower, upper);
v(idx) = NaN;
function v = method5(v, lower, upper) % C-Mex
v = BlockCopy(v, lower, upper);
Elapsed time is 0.328137 seconds. Original loop
Elapsed time is 0.286607 seconds. Vectorized 1
Elapsed time is 0.878544 seconds. Vectorized 2
Elapsed time is 1.516311 seconds. mcolon
Elapsed time is 0.149635 seconds. C-mex
Matlab 2016b/64, Win, SDK7.1 compiler, dual core. Note that the vectorized versions do not handle the overlapping intervals correctly. A data set without overlaps:
lower = 1:1e3:(1e7-1000);
upper = lower + randi([1, 999], size(lower));
Elapsed time is 0.153022 seconds. Original loops
Elapsed time is 0.332387 seconds. Vectorized 1
Elapsed time is 0.787915 seconds. Vectorized 2
Elapsed time is 0.252076 seconds. mcolon
Elapsed time is 0.109385 seconds. C-mex
Summary: The naive loops are fine! The vectorized version can almost compete, but is less powerful with scalar or overlapping intervals (perhaps somebody finds a better solution). The C-mex is 30% to 55% faster only. Omitting the boundary checks saves some percent only, but let the Matlab session crash in case of unexpected input.
The Matlab loop is efficient, because the main work of writing NaNs to the memory performs well, while the loop itself and the check of the boudaries is not very expensive.
Thank you so much for the thorough answer answer! The elapsed time table is very cool. I think i will stick to the loops then. But really had expected the vectorized solution to be much faster.
@Michael: Feel free to use the C-mex. I've attached the commented C-code and a pre-compiled function for Win/64.

Accedi per commentare.

Più risposte (0)

Categorie

Scopri di più su Loops and Conditional Statements in Centro assistenza e File Exchange

Richiesto:

il 14 Feb 2017

Commentato:

Jan
il 15 Feb 2017

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by