Concept decision for visualization: for loops vs custom iterators vs findgroups/splitapply vs ???

Question

Apri in MATLAB Online

0 voti

matlabQuestions.m

Hi,

in data analysis, the visualization workflow usually is

load data into (time-) table T
process/transform data
visualize data

Point 3 usually contains filtering, grouping and indexing directly into T to avoid copys of large data files.

I'm searching for general concepts for 3. that are flexible, easy to understand and sufficiently performant.

Simple example, a table with 2 data variables and 1 categorical, used for grouping. I'd like to plot each variables in its own subplot and each group as a single line plot. I.e., a figure with two subplots and 3 lineplots per subplot (since there are 3 categories). You can find the whole script attached!

% T =
% 
%   3×3 timetable
% 
%             Time            cat      dat1       dat2  
%     ____________________    ___    ________    _______
% 
%     01-Jan-2021 12:00:00     2     -0.54518     -5.995
%     01-Jan-2021 12:00:00     3      0.37835     18.321
%     01-Jan-2021 12:00:00     1     -0.32751    -10.042

The script/class should be as readable as possible while maintaining performance. I hate to use traditional for loops because they are error prone. One always has to know to iterate through which array, therefore I prefer to use "for each" to directly iterate through an array/list instead using a integer index. But this doees not work as soon as two lists shall be iterated through at the same time. E.g. in this example the array of subplots handles and the number of variables to be plotted have the same size and are iterated through together, there is a connection between them.

In my opionion, some kind of double nested loop is needed here which, since one needs to iterate through the subplots anyway and also one can't vectorize plotting, because the elements per groups aren't the same, so no reshape to a matrix is possible.

Let's start with the example, create data:

    d = datetime("2021-01-01 12:00");
    t = linspace(d, d+hours(1), testSize);
    cat = categorical(randi(3, size(t)));
    dat1 = randn(size(t));
    dat2 = 10*randn(size(t));
    T = timetable(t', cat', dat1', dat2', 'VariableNames',["cat", "dat1", "dat2"]);
% ans =
% 
%   3×3 timetable
% 
%             Time            cat      dat1       dat2  
%     ____________________    ___    ________    _______
% 
%     01-Jan-2021 12:00:00     2     -0.54518     -5.995
%     01-Jan-2021 12:00:00     3      0.37835     18.321
%     01-Jan-2021 12:00:00     1     -0.32751    -10.042

Approach 1, manual grouping & plotting. Simple for loops with integer indices. The user has to know which variables belong together, aka need to get iterated through simultaneously (e.g. plotVars and haxes).

function naiveForLoopPlotTest1(T, plotVars)
    arguments
        T timetable
        plotVars (1,:) string
    end
    
    % infer number of plots & subplots
    numPlotVars = length(plotVars); % #subplots
    grps = unique(T.cat); % should be fast due to categorical
    numGrps = length(grps); % #lines per subplot
    
    % we want/need to keep the handles for further processing
    haxes = gobjects(numPlotVars); % subplot handles
    hlines = gobjects(numPlotVars, numGrps); % line handles per subplot
    
    % create graphic objects, just called once
    figure
    tiledlayout('flow')
    for iPlotVar = 1:numPlotVars
        haxes(iPlotVar) = nexttile();
        hold(haxes(iPlotVar),'on');
        for iGrp = 1:numGrps
            hlines(iPlotVar, iGrp) = plot(NaT,NaN);
        end
    end
    
    % plotting, could be called multiple times in real program - I hate it, its so verbose
    % and one needs to keep track on the indices
    for iPlotVar = 1:numPlotVars
        plotVar = plotVars(iPlotVar); % current var to plot
        for iGrp = 1:numGrps % nested loop...
            grp = grps(iGrp); % current group
            idx = T.cat == grp; % find index -> filter
            TT = T(idx, :); % filtered (grouped) data from a single column
            set(hlines(iPlotVar, iGrp),'XData', TT.Properties.RowTimes, 'YData', TT.(plotVar)); % update graphic data
        end
    end
end

Approach 2: Reduce number of numXXX variables, do not iterate through integer indices but through the elements directly. Drawback: Needs for-each toolbox (with awful license..) https://www.mathworks.com/matlabcentral/fileexchange/48729-for-each . Use arrayfun for handle initiatlization

function maybeBetterForLoopPlotTest(T, plotVars)
    arguments
        T timetable
        plotVars (1,:) string
    end
    
    % NOT LONGER NEEDED!: Infer number of plots & subplots
    %     numPlotVars = length(plotVars); % #subplots
    grps = unique(T.cat); % should be fast due to categorical
    %     numGrps = length(grps); % #lines per subplot
    
    % combine / make graphic object creation shorter (?!)
    % same function as in naiveForLoopPlotTest1
    figure
    tiledlayout('flow')
    haxes = arrayfun(@(~)nexttile, plotVars); % subplots
    arrayfun(@(hax)hold(hax,'on'), haxes); % hold on
    
    hlines = arrayfun(@(hax)arrayfun(@(~)plot(hax, NaT, NaN), grps), ...
        haxes,'unif',false); % lines per subplots; this is not nice either.... is cell array
    % 1x2 cellarr with line handles, each of size numGrps x 1
    
    % plotting, could be called multiple times in real program
    % use for-each instead of integer indices, see https://www.mathworks.com/matlabcentral/fileexchange/48729-for-each
    % unfortunately, this for-each toolbox' license is way to restricting!
    % Is there anything like for [hax, plotVar] = [haxes, plotVars] in Matlab?
    for elem = eachTuple(plotVars, hlines) % advantage here: it's clearly seen that all these variables belong together!
        plotVar = elem{1};
        hlinesSub = elem{2};
        for subElem = eachTuple(grps, hlinesSub)
            grp = subElem{1};
            hline = subElem{2}; % handle class, can therefore be updated here!
            idx = T.cat == grp;
            TT = T(idx, :); % filtered (grouped) data from a single column
            set(hline,'XData', TT.Properties.RowTimes, 'YData', TT.(plotVar)); % update plots
        end
    end
end

Approach 3: Replace second nested for loop by splitapply

function splitApplyPlotTest(T, plotVars)
    arguments
        T timetable
        plotVars (1,:) string
    end
    
    % find groups
    [G, grps] = findgroups(T.cat); %
    
    % combine / make graphic object creation shorter (?!)
    % same function as in naiveForLoopPlotTest1
    figure
    tiledlayout('flow')
    haxes = arrayfun(@(~)nexttile, plotVars); % subplots
    arrayfun(@(hax)hold(hax,'on'), haxes); % hold on
    
    hlines = arrayfun(@(hax)arrayfun(@(n)plot(hax, NaT, NaN), grps), ...
        haxes,'unif',false); % lines per subplots; this is not nice either.... is cell array
    % 1x2 cellarr with line handles, each of size numGrps x 1
    
    % plotting, could be called multiple times in real program
    % use for-each instead of iteger indices, see https://www.mathworks.com/matlabcentral/fileexchange/48729-for-each
    % unfortunately, this for-each toolbox' license is way to restricting!
    % Is there anything like for [hax, plotVar] = [haxes, plotVars] in Matlab?
    for elem = eachTuple(plotVars, hlines) % advantage here: it's clearly visible that all these variables belong together!
        plotVar = elem{1};
        hlinesSub = elem{2};
        hline = hlinesSub(G); % this is ugly but needed - creation of a big handle array
        
        % remove 2nd for loop - is this clearer?
        % the problem here is: how to supply the hlinesSub handles array without
        % making it huge? All data variables must have same number of rows in
        % splitapply..
        splitapply(@(h, t,dat)...
            set(h(1),'XData', t, 'YData', dat),...
            hline, T.Properties.RowTimes, T.(plotVar), G); % update plots
    end
end

Benchmark results:

% bench =
% 
%   3×4 table
% 
%                                   TimeTimeit    TimeProfiler    TotalMemoryMb    PeakMemoryMb
%                                   __________    ____________    _____________    ____________
% 
%     naiveForLoopPlotTest1          0.11933        0.17104          3962.1           65.221   
%     maybeBetterForLoopPlotTest     0.10815        0.11897          141.47           41.275   
%     splitApplyPlotTest             0.17568        0.18233           11215           185.27   

This is very interesting - splitapply seems to need much more time & memory (which makes sense, especially because the hline array must be repmatted to the size of group array G which seems to be complete nonesense.

Interestingly, maybeBetterForLoopPlotTest is faster and needs less memory than other solutions.

What do you prefer? Are there other ways to structure the code or functions I'm not aware of yet? I mean this problem occurs each day in data analysis I guess.

I'm looking forward for you suggestions, thank you very much!

1 Commento
Mostra -1 commenti meno recenti Nascondi -1 commenti meno recenti

Simon il 30 Ago 2023

Thaks for making the benchmark comparison. I was actually looking for comparions between splitapply and parfor. I tried splitapply and in my case, mainly data cleaning, it is faster than for-loop. And its code readibility is much better. I also have paralell toolbox. parfor in general speed things up really well in my eight core machine. I haven't made any comparison between splitapply and parfor. (maybe I should do that)

Accedi per commentare.

Accedi per rispondere a questa domanda.

Accedi per seguire l’attività

Answer 1

Gaurav Garg il 22 Feb 2021

0 voti

Hi Jan,

The benchmark results you have tabulated seem to be correct and making sense for the case study you have mentioned.

Along with the approaches you have mentioned already, gscatter is a function which plots classification dataset by group in a very beautiful manner. Although it isn't an approach, but gscatter and similar functions (like scatter, grpstats, gplotmatrix) are some functions you should be aware of.

1 Commento
Mostra -1 commenti meno recenti Nascondi -1 commenti meno recenti

Jan Kappen il 22 Feb 2021

Thank you very much for your comment.

I'm aware auf the g**** functions, but they belong to statistics toolbox and feel kinda outdated. There seems to be no support for tables which is one of the best features, Matlab got during the last releases.

Would you have suggestions how to speed up the splitapply workflow or use any other, similar function?

Or would you (or anybody else) just stick to the for ... with integer indices? If yes, why?

Accedi per commentare.

Concept decision for visualization: for loops vs custom iterators vs findgroups/splitapply vs ???

1 Commento
Mostra -1 commenti meno recenti Nascondi -1 commenti meno recenti

Risposte (1)

1 Commento
Mostra -1 commenti meno recenti Nascondi -1 commenti meno recenti

Categorie

Prodotti

Release

Tag

Community Treasure Hunt

Concept decision for visualization: for loops vs custom iterators vs findgroups/splitapply vs ???

1 Commento Mostra -1 commenti meno recenti Nascondi -1 commenti meno recenti

Risposte (1)

1 Commento Mostra -1 commenti meno recenti Nascondi -1 commenti meno recenti

Categorie

Prodotti

Release

Tag

Vedere anche

Community Treasure Hunt

1 Commento
Mostra -1 commenti meno recenti Nascondi -1 commenti meno recenti

1 Commento
Mostra -1 commenti meno recenti Nascondi -1 commenti meno recenti