Azzera filtri
Azzera filtri

How to find rows in table quickly based on time ranges specified in another table?

5 visualizzazioni (ultimi 30 giorni)
I have a table T that contains a time start, time end and an ID. I have another table S that contains a timestamp and an ID. T typically has 50k rows, while S has ~1 million rows. Simplified example code for tables T and S:
rng(1);
timeStart = datetime(2023,1,1) + days(0:0.002:100)';
timeEnd = datetime(2023,1,2) + days(0:0.002:100)';
timeEnd(randi(numel(timeEnd), 100, 1)) = NaT; % some data will be missing
id = [ 1:floor(numel(timeStart)/2) 1:ceil(numel(timeStart)/2) ]'; % id's are reused
T = table(timeStart, timeEnd, id);
head(T)
timeStart timeEnd id ____________________ ____________________ __ 01-Jan-2023 00:00:00 02-Jan-2023 00:00:00 1 01-Jan-2023 00:02:52 02-Jan-2023 00:02:52 2 01-Jan-2023 00:05:45 02-Jan-2023 00:05:45 3 01-Jan-2023 00:08:38 02-Jan-2023 00:08:38 4 01-Jan-2023 00:11:31 02-Jan-2023 00:11:31 5 01-Jan-2023 00:14:24 NaT 6 01-Jan-2023 00:17:16 02-Jan-2023 00:17:16 7 01-Jan-2023 00:20:09 02-Jan-2023 00:20:09 8
size(T)
ans = 1×2
50001 3
timeStamp = datetime(2023,1,1) + days(0:0.0001:100)';
id = randi(max(T.id), numel(timeStamp), 1);
S = table(timeStamp, id);
head(S)
timeStamp id ____________________ _____ 01-Jan-2023 00:00:00 8167 01-Jan-2023 00:00:08 13177 01-Jan-2023 00:00:17 22150 01-Jan-2023 00:00:25 8933 01-Jan-2023 00:00:34 22715 01-Jan-2023 00:00:43 15585 01-Jan-2023 00:00:51 396 01-Jan-2023 00:01:00 23237
size(S)
ans = 1×2
1000001 2
For each row in T, I would like to find all the rows in S that has a timestamp within the (time start, time end) range and has the same ID as in T. Rows in T that do not have an timeEnd can be ignored. I see a few approaches:
  1. Use isbetween() in a for loop around all the rows of T for the time-based check, but that was slow and really does not scale well with the number of rows in S (and linearly in the number of rows in T).
  2. I could achieve this by converting S to a timetable and using timeranges constructed from T.timeStart and T.timeEnd together with a check on the T.id and S.id columns, but that seemed to be just as slow as approach 1.
  3. I can convert all datetimes to unix time (using posixtime) and instead work with those as real numbers. This seems to be much faster than approaches 1 and 2.
  4. I can take approach 3 and instead of using a for loop, compare all time ranges against all timestamps and create one giant (1 million by 50k) logical matrix indicating whether a row in S is within the timerange of each row of T. This matrix is sparse. Like all previous approaches, this does not scale well at all. To make memory use manageable, I can create a number of smaller (X by 50k) sparse matrices and then vertcat them at the end.
Does anyone see a good approach that scales well and keeps computation time manageable?

Risposte (2)

Voss
Voss il 30 Ott 2023
Modificato: Voss il 30 Ott 2023
% input data construction (copied):
timeStart = datetime(2023,1,1) + days(0:0.002:100)';
timeEnd = datetime(2023,1,2) + days(0:0.002:100)';
id = (1:numel(timeStart))';
T = table(timeStart, timeEnd, id);
rng(1);
timeStamp = datetime(2023,1,1) + days(0:0.0001:100)';
id = randi(max(T.id), numel(timeStamp), 1);
S = table(timeStamp, id);
% find which row of T contains each id in S:
[ism,idx] = ismember(S.id,T.id);
% make sure each id in S exists in T:
assert(all(ism));
% make a new table (S_new) that has the timeStamp from S and everything
% from T, but the rows of T are ordered according to the id in S:
S_new = [S(:,1), T(idx,:)];
head(S_new)
timeStamp timeStart timeEnd id ____________________ ____________________ ____________________ _____ 01-Jan-2023 00:00:00 11-Feb-2023 16:50:52 12-Feb-2023 16:50:52 20852 01-Jan-2023 00:00:08 14-Mar-2023 00:46:04 15-Mar-2023 00:46:04 36017 01-Jan-2023 00:00:17 01-Jan-2023 00:14:24 02-Jan-2023 00:14:24 6 01-Jan-2023 00:00:25 31-Jan-2023 05:34:04 01-Feb-2023 05:34:04 15117 01-Jan-2023 00:00:34 15-Jan-2023 16:10:33 16-Jan-2023 16:10:33 7338 01-Jan-2023 00:00:43 10-Jan-2023 05:36:57 11-Jan-2023 05:36:57 4618 01-Jan-2023 00:00:51 19-Jan-2023 15:01:26 20-Jan-2023 15:01:26 9314 01-Jan-2023 00:01:00 04-Feb-2023 13:20:38 05-Feb-2023 13:20:38 17279
size(S_new)
ans = 1×2
1000001 4
% now keep only rows in S_new where timeStamp is between timeStart and timeEnd:
to_keep = S_new.timeStamp >= S_new.timeStart & S_new.timeStamp <= S_new.timeEnd;
S_new = S_new(to_keep,:);
head(S_new)
timeStamp timeStart timeEnd id ____________________ ____________________ ____________________ ___ 01-Jan-2023 02:58:24 01-Jan-2023 01:49:26 02-Jan-2023 01:49:26 39 01-Jan-2023 03:50:24 01-Jan-2023 02:26:52 02-Jan-2023 02:26:52 52 01-Jan-2023 06:14:15 01-Jan-2023 04:42:14 02-Jan-2023 04:42:14 99 01-Jan-2023 06:59:11 01-Jan-2023 02:38:24 02-Jan-2023 02:38:24 56 01-Jan-2023 07:14:52 01-Jan-2023 06:17:16 02-Jan-2023 06:17:16 132 01-Jan-2023 07:31:00 01-Jan-2023 06:31:40 02-Jan-2023 06:31:40 137 01-Jan-2023 07:58:04 01-Jan-2023 07:32:09 02-Jan-2023 07:32:09 158 01-Jan-2023 08:03:59 01-Jan-2023 01:52:19 02-Jan-2023 01:52:19 40
size(S_new)
ans = 1×2
10033 4
  2 Commenti
Jori Selen
Jori Selen il 31 Ott 2023
Thanks for the nice solution. I've tried implementing it for my real case and ran into an issue. I should have been a bit more precise in my example: id's in the table T are also not unique and are reused. However, no 2 rows will have the same id and have overlap in their (timeStart, timeEnd) intervals. I will adapt the example.
Jori Selen
Jori Selen il 31 Ott 2023
In addition, some timeEnd fields are missing. That is not a problem for your solution, but I've added it to the example as well.

Accedi per commentare.


Jori Selen
Jori Selen il 31 Ott 2023
Modificato: Jori Selen il 31 Ott 2023
I've come up with the solution below, but it still takes ~50 seconds on my machine with approximately all time spent on the sameID{iGroup} = sparse(S.id(idxStart:idxEnd) == T.id'); line. It should be possible to create something considerably faster, but I don't see how.
rng(1);
% Create table T
timeStart = datetime(2023,1,1) + days(0:0.002:100)';
timeEnd = datetime(2023,1,2) + days(0:0.002:100)';
timeEnd(randi(numel(timeEnd), 100, 1)) = NaT; % some data will be missing
id = [ 1:floor(numel(timeStart)/2) 1:ceil(numel(timeStart)/2) ]'; % id's are reused
T = table(timeStart, timeEnd, id);
nT = rows(T);
% Create table S
timeStamp = datetime(2023,1,1) + days(0:0.0001:100)';
id = randi(max(T.id), numel(timeStamp), 1);
S = table(timeStamp, id);
nS = rows(S);
% Determine which rows in T and S have the same ID
MAX_ELEMENTS_PER_GROUP = 1e9;
nElementsPerGroup = ceil(MAX_ELEMENTS_PER_GROUP / nT);
nGroups = ceil(nS / nElementsPerGroup);
sameID = cell(nGroups, 1);
for iGroup = 1:nGroups % constructed incrementally to not run into memory issues
idxStart = (iGroup - 1)*nElementsPerGroup + 1;
idxEnd = min(iGroup*nElementsPerGroup, nS);
sameID{iGroup} = sparse(S.id(idxStart:idxEnd) == T.id');
end
sameID = vertcat(sameID{:}); % sparse logical matrix with nS rows and nT columns
% For each row in T, determine which rows of S have the same ID and have a
% timeStamp within the [timeStart, timeEnd] timerange
idx = 1:nS;
idxAssociatedRowsInS = cell(nT, 1);
for iT = 1:nT
idxSameID = idx(sameID(:, iT));
withinTimeRange = isbetween(S.timeStamp(idxSameID), T.timeStart(iT), T.timeEnd(iT));
idxAssociatedRowsInS{iT} = idxSameID(withinTimeRange);
end
T.idxAssociatedRowsInS = idxAssociatedRowsInS;
% any other way of storing the same information in T (or a joined table) is also fine

Categorie

Scopri di più su Data Type Conversion in Help Center e File Exchange

Prodotti


Release

R2022b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by