Concept decision for visualization: for loops vs custom iterators vs findgroups/splitapply vs ???

5 visualizzazioni (ultimi 30 giorni)
Hi,
in data analysis, the visualization workflow usually is
  1. load data into (time-) table T
  2. process/transform data
  3. visualize data
Point 3 usually contains filtering, grouping and indexing directly into T to avoid copys of large data files.
I'm searching for general concepts for 3. that are flexible, easy to understand and sufficiently performant.
Simple example, a table with 2 data variables and 1 categorical, used for grouping. I'd like to plot each variables in its own subplot and each group as a single line plot. I.e., a figure with two subplots and 3 lineplots per subplot (since there are 3 categories). You can find the whole script attached!
% T =
%
% 3×3 timetable
%
% Time cat dat1 dat2
% ____________________ ___ ________ _______
%
% 01-Jan-2021 12:00:00 2 -0.54518 -5.995
% 01-Jan-2021 12:00:00 3 0.37835 18.321
% 01-Jan-2021 12:00:00 1 -0.32751 -10.042
The script/class should be as readable as possible while maintaining performance. I hate to use traditional for loops because they are error prone. One always has to know to iterate through which array, therefore I prefer to use "for each" to directly iterate through an array/list instead using a integer index. But this doees not work as soon as two lists shall be iterated through at the same time. E.g. in this example the array of subplots handles and the number of variables to be plotted have the same size and are iterated through together, there is a connection between them.
In my opionion, some kind of double nested loop is needed here which, since one needs to iterate through the subplots anyway and also one can't vectorize plotting, because the elements per groups aren't the same, so no reshape to a matrix is possible.
Let's start with the example, create data:
d = datetime("2021-01-01 12:00");
t = linspace(d, d+hours(1), testSize);
cat = categorical(randi(3, size(t)));
dat1 = randn(size(t));
dat2 = 10*randn(size(t));
T = timetable(t', cat', dat1', dat2', 'VariableNames',["cat", "dat1", "dat2"]);
% ans =
%
% 3×3 timetable
%
% Time cat dat1 dat2
% ____________________ ___ ________ _______
%
% 01-Jan-2021 12:00:00 2 -0.54518 -5.995
% 01-Jan-2021 12:00:00 3 0.37835 18.321
% 01-Jan-2021 12:00:00 1 -0.32751 -10.042
Approach 1, manual grouping & plotting. Simple for loops with integer indices. The user has to know which variables belong together, aka need to get iterated through simultaneously (e.g. plotVars and haxes).
function naiveForLoopPlotTest1(T, plotVars)
arguments
T timetable
plotVars (1,:) string
end
% infer number of plots & subplots
numPlotVars = length(plotVars); % #subplots
grps = unique(T.cat); % should be fast due to categorical
numGrps = length(grps); % #lines per subplot
% we want/need to keep the handles for further processing
haxes = gobjects(numPlotVars); % subplot handles
hlines = gobjects(numPlotVars, numGrps); % line handles per subplot
% create graphic objects, just called once
figure
tiledlayout('flow')
for iPlotVar = 1:numPlotVars
haxes(iPlotVar) = nexttile();
hold(haxes(iPlotVar),'on');
for iGrp = 1:numGrps
hlines(iPlotVar, iGrp) = plot(NaT,NaN);
end
end
% plotting, could be called multiple times in real program - I hate it, its so verbose
% and one needs to keep track on the indices
for iPlotVar = 1:numPlotVars
plotVar = plotVars(iPlotVar); % current var to plot
for iGrp = 1:numGrps % nested loop...
grp = grps(iGrp); % current group
idx = T.cat == grp; % find index -> filter
TT = T(idx, :); % filtered (grouped) data from a single column
set(hlines(iPlotVar, iGrp),'XData', TT.Properties.RowTimes, 'YData', TT.(plotVar)); % update graphic data
end
end
end
Approach 2: Reduce number of numXXX variables, do not iterate through integer indices but through the elements directly. Drawback: Needs for-each toolbox (with awful license..) https://www.mathworks.com/matlabcentral/fileexchange/48729-for-each . Use arrayfun for handle initiatlization
function maybeBetterForLoopPlotTest(T, plotVars)
arguments
T timetable
plotVars (1,:) string
end
% NOT LONGER NEEDED!: Infer number of plots & subplots
% numPlotVars = length(plotVars); % #subplots
grps = unique(T.cat); % should be fast due to categorical
% numGrps = length(grps); % #lines per subplot
% combine / make graphic object creation shorter (?!)
% same function as in naiveForLoopPlotTest1
figure
tiledlayout('flow')
haxes = arrayfun(@(~)nexttile, plotVars); % subplots
arrayfun(@(hax)hold(hax,'on'), haxes); % hold on
hlines = arrayfun(@(hax)arrayfun(@(~)plot(hax, NaT, NaN), grps), ...
haxes,'unif',false); % lines per subplots; this is not nice either.... is cell array
% 1x2 cellarr with line handles, each of size numGrps x 1
% plotting, could be called multiple times in real program
% use for-each instead of integer indices, see https://www.mathworks.com/matlabcentral/fileexchange/48729-for-each
% unfortunately, this for-each toolbox' license is way to restricting!
% Is there anything like for [hax, plotVar] = [haxes, plotVars] in Matlab?
for elem = eachTuple(plotVars, hlines) % advantage here: it's clearly seen that all these variables belong together!
plotVar = elem{1};
hlinesSub = elem{2};
for subElem = eachTuple(grps, hlinesSub)
grp = subElem{1};
hline = subElem{2}; % handle class, can therefore be updated here!
idx = T.cat == grp;
TT = T(idx, :); % filtered (grouped) data from a single column
set(hline,'XData', TT.Properties.RowTimes, 'YData', TT.(plotVar)); % update plots
end
end
end
Approach 3: Replace second nested for loop by splitapply
function splitApplyPlotTest(T, plotVars)
arguments
T timetable
plotVars (1,:) string
end
% find groups
[G, grps] = findgroups(T.cat); %
% combine / make graphic object creation shorter (?!)
% same function as in naiveForLoopPlotTest1
figure
tiledlayout('flow')
haxes = arrayfun(@(~)nexttile, plotVars); % subplots
arrayfun(@(hax)hold(hax,'on'), haxes); % hold on
hlines = arrayfun(@(hax)arrayfun(@(n)plot(hax, NaT, NaN), grps), ...
haxes,'unif',false); % lines per subplots; this is not nice either.... is cell array
% 1x2 cellarr with line handles, each of size numGrps x 1
% plotting, could be called multiple times in real program
% use for-each instead of iteger indices, see https://www.mathworks.com/matlabcentral/fileexchange/48729-for-each
% unfortunately, this for-each toolbox' license is way to restricting!
% Is there anything like for [hax, plotVar] = [haxes, plotVars] in Matlab?
for elem = eachTuple(plotVars, hlines) % advantage here: it's clearly visible that all these variables belong together!
plotVar = elem{1};
hlinesSub = elem{2};
hline = hlinesSub(G); % this is ugly but needed - creation of a big handle array
% remove 2nd for loop - is this clearer?
% the problem here is: how to supply the hlinesSub handles array without
% making it huge? All data variables must have same number of rows in
% splitapply..
splitapply(@(h, t,dat)...
set(h(1),'XData', t, 'YData', dat),...
hline, T.Properties.RowTimes, T.(plotVar), G); % update plots
end
end
Benchmark results:
% bench =
%
% 3×4 table
%
% TimeTimeit TimeProfiler TotalMemoryMb PeakMemoryMb
% __________ ____________ _____________ ____________
%
% naiveForLoopPlotTest1 0.11933 0.17104 3962.1 65.221
% maybeBetterForLoopPlotTest 0.10815 0.11897 141.47 41.275
% splitApplyPlotTest 0.17568 0.18233 11215 185.27
This is very interesting - splitapply seems to need much more time & memory (which makes sense, especially because the hline array must be repmatted to the size of group array G which seems to be complete nonesense.
Interestingly, maybeBetterForLoopPlotTest is faster and needs less memory than other solutions.
What do you prefer? Are there other ways to structure the code or functions I'm not aware of yet? I mean this problem occurs each day in data analysis I guess.
I'm looking forward for you suggestions, thank you very much!
  1 Commento
Simon
Simon il 30 Ago 2023
Thaks for making the benchmark comparison. I was actually looking for comparions between splitapply and parfor. I tried splitapply and in my case, mainly data cleaning, it is faster than for-loop. And its code readibility is much better. I also have paralell toolbox. parfor in general speed things up really well in my eight core machine. I haven't made any comparison between splitapply and parfor. (maybe I should do that)

Accedi per commentare.

Risposte (1)

Gaurav Garg
Gaurav Garg il 22 Feb 2021
Hi Jan,
The benchmark results you have tabulated seem to be correct and making sense for the case study you have mentioned.
Along with the approaches you have mentioned already, gscatter is a function which plots classification dataset by group in a very beautiful manner. Although it isn't an approach, but gscatter and similar functions (like scatter, grpstats, gplotmatrix) are some functions you should be aware of.
  1 Commento
Jan Kappen
Jan Kappen il 22 Feb 2021
Thank you very much for your comment.
I'm aware auf the g**** functions, but they belong to statistics toolbox and feel kinda outdated. There seems to be no support for tables which is one of the best features, Matlab got during the last releases.
Would you have suggestions how to speed up the splitapply workflow or use any other, similar function?
Or would you (or anybody else) just stick to the for ... with integer indices? If yes, why?

Accedi per commentare.

Categorie

Scopri di più su Data Preprocessing in Help Center e File Exchange

Prodotti


Release

R2020b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by