Finding index to values at latest dates in tall timetable

4 visualizzazioni (ultimi 30 giorni)
I have a tall timetable containing some repeated values (a, b, or c) in one column. I want to find the rows containing the latest time associated with a, with b, and with c and get rid of the non-latest rows in the tall timetable. Here's what I tried. Is there a better way?
%% Make some example data
nRow = 100;
letterList = {'a', 'b', 'c'};
stringList = randsample(letterList, nRow, true)'; % Column containing a, b or c
timeList = datetime(2017, 03, 1) + rand(nRow, 1)*1000;
% Make a regular table
tableReg = table(timeList, ... % Date
stringList, ... % List with values of either 'a', 'b', or 'c'
rand(nRow, 1), ... % Value
'VariableNames', {'Date', 'Letter', 'Value'});
% Make a tall table
tt = tall(tableReg);
nRow = gather(height(tt)); % Confirm there are now only 3 rows in tt after trimming
disp([num2str(nRow) ' rows originally'])
%% Get index to latest value of each letter
% Sort by time in descending order, so most recent is first
tt = sortrows(tt, 'Date', 'desc');
[~, iLatestLetter, ~] = unique(tt.Letter); % Default is first value
iLatestLetter = gather(iLatestLetter);
tt = tt(iLatestLetter, :); % Trim table to only select latest values of each letter
nRow = gather(height(tt)); % Confirm there are now only 3 rows in tt after trimming
disp([num2str(nRow) ' rows after trim'])

Risposta accettata

Rupesh
Rupesh il 26 Feb 2024
Modificato: Rupesh il 26 Feb 2024
Hi KAE,
I understand that you are trying to streamline the process of filtering a tall timetable in MATLAB to keep only the latest entries for each category ('a', 'b', and 'c') based on their associated timestamps. You are looking for an efficient method to achieve this without having to sort the entire dataset. Another optimised approached would be the one that uses the “groupsummary” function to directly calculate the maximum date for each category and then joins this result with the original data to filter out the earlier entries:
% Calculate the latest time for each category using groupsummary
latestTimes = groupsummary(tt, 'Letter', 'max', 'Date');
% Join the latest times with the original tall timetable to filter it
ttLatest = innerjoin(tt, latestTimes, 'Keys', 'Letter', 'RightVariables', 'max_Date');
% Rename the max_Date variable back to Date for clarity
ttLatest.Properties.VariableNames{'max_Date'} = 'Date';
% Gather the result to get the filtered timetable with only the latest entries
ttLatest = gather(ttLatest);
Here’s a concise overview of the functions used in the code.
“groupsummary”: Utilized to compute the latest (maximum) datetime for each category in the 'Letter' column without sorting the entire dataset.
“innerjoin”: Employed to merge the original timetable with the latest times, ensuring only rows with the latest datetimes for each category are retained.
“gather”: Executed to materialize the results from the tall array operations into memory, finalizing the filtering process.
“Properties.VariableNames”: Accessed to rename a column in the timetable for consistency after performing the join operation.
Above procedure should give you a tall timetable with only the latest entry for each category, avoiding the need to sort the entire timetable and improving the overall efficiency of the operation.
You can also refer to below documents regarding detailed information about above function used in the code.
Hope this helps!

Più risposte (0)

Categorie

Scopri di più su Tables in Help Center e File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by