How to find rows in table quickly based on time ranges specified in another table?
5 visualizzazioni (ultimi 30 giorni)
Mostra commenti meno recenti
I have a table T that contains a time start, time end and an ID. I have another table S that contains a timestamp and an ID. T typically has 50k rows, while S has ~1 million rows. Simplified example code for tables T and S:
rng(1);
timeStart = datetime(2023,1,1) + days(0:0.002:100)';
timeEnd = datetime(2023,1,2) + days(0:0.002:100)';
timeEnd(randi(numel(timeEnd), 100, 1)) = NaT; % some data will be missing
id = [ 1:floor(numel(timeStart)/2) 1:ceil(numel(timeStart)/2) ]'; % id's are reused
T = table(timeStart, timeEnd, id);
head(T)
size(T)
timeStamp = datetime(2023,1,1) + days(0:0.0001:100)';
id = randi(max(T.id), numel(timeStamp), 1);
S = table(timeStamp, id);
head(S)
size(S)
For each row in T, I would like to find all the rows in S that has a timestamp within the (time start, time end) range and has the same ID as in T. Rows in T that do not have an timeEnd can be ignored. I see a few approaches:
- Use isbetween() in a for loop around all the rows of T for the time-based check, but that was slow and really does not scale well with the number of rows in S (and linearly in the number of rows in T).
- I could achieve this by converting S to a timetable and using timeranges constructed from T.timeStart and T.timeEnd together with a check on the T.id and S.id columns, but that seemed to be just as slow as approach 1.
- I can convert all datetimes to unix time (using posixtime) and instead work with those as real numbers. This seems to be much faster than approaches 1 and 2.
- I can take approach 3 and instead of using a for loop, compare all time ranges against all timestamps and create one giant (1 million by 50k) logical matrix indicating whether a row in S is within the timerange of each row of T. This matrix is sparse. Like all previous approaches, this does not scale well at all. To make memory use manageable, I can create a number of smaller (X by 50k) sparse matrices and then vertcat them at the end.
Does anyone see a good approach that scales well and keeps computation time manageable?
0 Commenti
Risposte (2)
Voss
il 30 Ott 2023
Modificato: Voss
il 30 Ott 2023
% input data construction (copied):
timeStart = datetime(2023,1,1) + days(0:0.002:100)';
timeEnd = datetime(2023,1,2) + days(0:0.002:100)';
id = (1:numel(timeStart))';
T = table(timeStart, timeEnd, id);
rng(1);
timeStamp = datetime(2023,1,1) + days(0:0.0001:100)';
id = randi(max(T.id), numel(timeStamp), 1);
S = table(timeStamp, id);
% find which row of T contains each id in S:
[ism,idx] = ismember(S.id,T.id);
% make sure each id in S exists in T:
assert(all(ism));
% make a new table (S_new) that has the timeStamp from S and everything
% from T, but the rows of T are ordered according to the id in S:
S_new = [S(:,1), T(idx,:)];
head(S_new)
size(S_new)
% now keep only rows in S_new where timeStamp is between timeStart and timeEnd:
to_keep = S_new.timeStamp >= S_new.timeStart & S_new.timeStamp <= S_new.timeEnd;
S_new = S_new(to_keep,:);
head(S_new)
size(S_new)
Vedere anche
Categorie
Scopri di più su Data Type Conversion in Help Center e File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!