Efficient access and manipulation of arrays in nested cells

Question

Dan Houck il 31 Mar 2025

0
Link

Link diretto a questa domanda

https://it.mathworks.com/matlabcentral/answers/2175843-efficient-access-and-manipulation-of-arrays-in-nested-cells

Commentato: dpb il 2 Apr 2025

I have nested cells in the form mycell{i}{j,k} with arrays in each of those. I have not found examples that work to perform operations like getting the stats (e.g., max) of all the arrays without a loop to return something like cellstat(i,j,k). Another example is that I'm performing a fit with each array and it would be nice to gather all of one of the goodness stats into a single dimension array or to take stats of a goodness stat across i so I can see it at each j,k.

I think with an example of each of those, I could figure out anything else that comes up. Thanks!

**********************

Adding an example:

data = rand(2e5,1); % one data set, I have many
datay = rand(2e5,1); % y-coordinate of the data
dataz = rand(2e5,1); % z-coordinate of the data

The first task with this data, is to create a grid of y,z pairs and sort each data set into those. Since rand is [0,1], say the grid is every 0.1. This only has to be done once, but I suppose how the data are stored could affect the speed of future steps.

After that, I'm doing a windowed fit on the points that are sorted into each y,z bin for each dataset. There may be some trial and error here, and, while I can test on subsets, it would be helpful if the data are structured in a way that makes the fitting routine as fast as possible. Would any more information be useful?

8 Commenti
Mostra 6 commenti meno recentiNascondi 6 commenti meno recenti

Dan Houck il 1 Apr 2025

Apri in MATLAB Online

bins = 1:100; % this is how many datasets there are, so data = cell(100,1) and each data{i} = rand(2e5,1)
yrange = 0:0.1:1;
zrange = 0:0.1:1;
% assume preallocation of yz_data, but not shown
for m = 1:length(bins)
    for i = 1:length(data{i})
        for k = 1:length(yrange)-1  
            for l = 1:length(zrange)-1
            yz_data{i}{k,l} = [yz_data{m}{k,l};data{m}{i}(datay{m}{i} > yrange(k) & datay{m}{i} < yrange(k+1) & dataz{m}{i} > zrange(l) & dataz{m}{i} < zrange(k+1))];
            end
        end
    end
end

This is what I did. I think what I'm trying to ask (sorry for confusion) is if there's a storage scheme that will speed up future access.

dpb il 2 Apr 2025

Apri in MATLAB Online

OK, I let the "grid" and the initial stucture stuff confuse me...@Voss got back before I did and answered the basics; as he points out, there's no reason to create excessively complex storage structures; use the data you have the way it comes. I'd still be looking into how the data are initially created and what are the multiple cases for further consolidation, but if there really are 10E5 points per dataset, it's probably not a practical thing to actually combine until summarize results.

The only thing compared to @Voss's approach you might compare how

N=10;
edges=linspace(0,1,NY+1);
iyz=discretize([datay dataz],edges);

does compared to histcounts2. It returns the indices by column in one output array and uses the same binning in both directions so isn't quite as flexible but it might be a little faster, although given the tasks so far, I don't see performance as being a big issue if you don't make things more difficult than need be... :J>

Accedi per commentare.

Accedi per rispondere a questa domanda.

Answer 1

Voss il 1 Apr 2025

0
Link

Link diretto a questa risposta

https://it.mathworks.com/matlabcentral/answers/2175843-efficient-access-and-manipulation-of-arrays-in-nested-cells#answer_1562956

Apri in MATLAB Online

data = rand(2e5,1); % one data set, I have many
datay = rand(2e5,1); % y-coordinate of the data
dataz = rand(2e5,1); % z-coordinate of the data

"The first task with this data, is to create a grid of y,z pairs and sort each data set into those. Since rand is [0,1], say the grid is every 0.1.... how the data are stored could affect the speed of future steps"

Store the bin index of each data point, so you know what bin each data point belongs to. (It's not necessary to make a new copy of the data with a different structure.)

NY = 10;
NZ = 10;
yedges = linspace(0,1,NY+1);
zedges = linspace(0,1,NZ+1);
[~,~,~,yidx,zidx] = histcounts2(datay,dataz,yedges,zedges);

"After that, I'm doing a windowed fit on the points that are sorted into each y,z bin for each dataset."

Maybe something like as follows. groupsummary uses the bin indices found in the previous step:

function out = your_fit_function(d,y,z)
    [f,gof] = fit([y,z],d,'poly11');
    out = {{f,gof}};
end
[C,BG] = groupsummary({data,datay,dataz},[zidx,yidx],@your_fit_function);

Now you have an sfit object and goodness-of-fit struct, returned from fit, for each grid cell:

C{1}
ans = 1x2 cell array
    {1x1 sfit}    {1x1 struct}
C{1}{:}
ans = 
     Linear model Poly11:
     ans(x,y) = p00 + p10*x + p01*y
     Coefficients (with 95% confidence bounds):
       p00 =      0.5103  (0.4767, 0.5439)
       p10 =    -0.09779  (-0.5436, 0.348)
       p01 =     -0.1282  (-0.559, 0.3026)
ans = struct with fields:
           sse: 170.9652
       rsquare: 2.5376e-04
           dfe: 2035
    adjrsquare: -7.2879e-04
          rmse: 0.2898

And you can do what you want with those:

for ii = 1:3%numel(C)
    fprintf(1,'region %0.1f<y<%0.1f, %0.1f<z<%0.1f:\n\n', ...
        yedges(BG{2}(ii)),yedges(BG{2}(ii)+1),zedges(BG{1}(ii)),zedges(BG{1}(ii)+1));
    fprintf(1,'  fit object:\n');
    disp(C{ii}{1})
    fprintf(1,'  goodness:\n');
    disp(C{ii}{2})
    fprintf(1,' \n');
end
region 0.0<y<0.1, 0.0<z<0.1:
  fit object:
     Linear model Poly11:
     (x,y) = p00 + p10*x + p01*y
     Coefficients (with 95% confidence bounds):
       p00 =      0.5103  (0.4767, 0.5439)
       p10 =    -0.09779  (-0.5436, 0.348)
       p01 =     -0.1282  (-0.559, 0.3026)
  goodness:
           sse: 170.9652
       rsquare: 2.5376e-04
           dfe: 2035
    adjrsquare: -7.2879e-04
          rmse: 0.2898
 
region 0.1<y<0.2, 0.0<z<0.1:
  fit object:
     Linear model Poly11:
     (x,y) = p00 + p10*x + p01*y
     Coefficients (with 95% confidence bounds):
       p00 =       0.505  (0.434, 0.576)
       p10 =     -0.1254  (-0.5669, 0.316)
       p01 =     0.04957  (-0.3817, 0.4809)
  goodness:
           sse: 162.3938
       rsquare: 1.8595e-04
           dfe: 1961
    adjrsquare: -8.3374e-04
          rmse: 0.2878
 
region 0.2<y<0.3, 0.0<z<0.1:
  fit object:
     Linear model Poly11:
     (x,y) = p00 + p10*x + p01*y
     Coefficients (with 95% confidence bounds):
       p00 =      0.5457  (0.4333, 0.6581)
       p10 =     -0.2367  (-0.6725, 0.1991)
       p01 =     0.09185  (-0.3504, 0.5341)
  goodness:
           sse: 164.6248
       rsquare: 6.5738e-04
           dfe: 1993
    adjrsquare: -3.4548e-04
          rmse: 0.2874
 

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

Accedi per commentare.

Answer 2

Walter Roberson il 31 Mar 2025

0
Link

Link diretto a questa risposta

https://it.mathworks.com/matlabcentral/answers/2175843-efficient-access-and-manipulation-of-arrays-in-nested-cells#answer_1562870

Apri in MATLAB Online

Example:

function gof = getgof(PAGE)
   [~, gof] = fit(PAGE somehow);
end
gof_stats = cellfun(@getgof, mycell, 'uniform', 0);
gof_stats = vertcat(gof_stats{:});

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

Accedi per commentare.

Answer 3

Matt J il 1 Apr 2025

0
Link

Link diretto a questa risposta

https://it.mathworks.com/matlabcentral/answers/2175843-efficient-access-and-manipulation-of-arrays-in-nested-cells#answer_1562880

Modificato: Matt J il 1 Apr 2025

There is no way to iterate over cells (nested or otherwise) without a loop, or something equivent in performance to a loop (cellfun, arrayfun, cell2mat, etc...).

4 Commenti
Mostra 2 commenti meno recentiNascondi 2 commenti meno recenti

Matt J il 1 Apr 2025

Modificato: Matt J il 1 Apr 2025

Can you give an example without a loop, e.g., cellfun?

How would an example of cellfun help you? You said you are looking for something more efficient than a loop, and as I have said, nothing is more efficient than a loop when dealing with cell arrays.

dpb il 1 Apr 2025

Modificato: dpb il 1 Apr 2025

To amplify on @Matt J's comment; at its heart all the cell-, array-, struct- functions are looping constructs internally that are "syntactic sugar" in replacing the for ... end loop with the single source code line. But, the performance of these cannot exceed that of JIT-compiled looping code and given that they have not been subject to all the optimizations Mathworks has made to for loops over the years including multi-threading, they all will be at least some slower than a "deadahead" for loop.

Functionally, a cellfun is a wrapper for an arrayfun -- it passes the derferenced cell to the function instead; you could construct the same with arrayfun if you did the dereferencing in the argument list for it. See this <recent post> for a general discussion and some pertinent remarks from TMW Staff members on differences.

MORAL: Do NOT assume that fewer lines of source code equate to faster execution speed.

Accedi per commentare.

Answer 4

dpb il 1 Apr 2025

0
Link

Link diretto a questa risposta

https://it.mathworks.com/matlabcentral/answers/2175843-efficient-access-and-manipulation-of-arrays-in-nested-cells#answer_1562915

Modificato: dpb il 1 Apr 2025

The other alternative to investigate is to turn the metadata you're segregating/tracking by cell indices into real data in a flat table or array. Ideally, those would be recognizable things like test number, date, whatever..., but they could for starters just be the indices. Then the power of <grouping variables> and or grpstats and/or varfun could be brought to bear on the problem. Large datasets can be dealt with tall arrays and/or memory mapping...findgroups

4 Commenti
Mostra 2 commenti meno recentiNascondi 2 commenti meno recenti

dpb il 1 Apr 2025

Modificato: dpb il 1 Apr 2025

"...j,k above could be rows and columns and each entry could be a cell with a number of arrays equal to the number of datasets"

That sounds like yet more nightmares and not at all what I would envision. Again, give us an actual representative dataset we can actually poke at rather than trying to just describe.

The point would be that "dataset" would become a variable as would all other metadata that can then be used for selection/grouping for calculations without having to dereference a bunch of cells and then try to put the results back together.

"Do you know what option would allow for the fastest computation?"

Not a priori, without a more specific example of what you actually are working with and what iterations you're talking about, no.

Unless the datasets are truly huge or the iterations are deep in loops, whether it's a few more msec or not is probably immaterial, particularly if one takes into consideration anything at all for the development time.

Walter Roberson il 1 Apr 2025

I believe I could reorganize the data into a table

Accessing a range of table rows is notably less efficient than accessing a range of rows of a numeric array.

dpb il 1 Apr 2025

Modificato: dpb il 1 Apr 2025

"... turn the metadata you're segregating/tracking by cell indices into real data in a flat table or array." (emphasis added...dpb)

The table is awfully convenient for display and is generally "fast enough" ...but, agreed, findgroups and splitapply to do the calculations will be faster on an array than will be varfun or grpstats on a table.

I was interpreting the Q? about speed as including the existing cell array structure as well, not just the comparison of an array to a table. Dereferencing a cell itself is generally quick, but by the time one calls cellfun() a number of times and then has to reconstruct/collect the results, who knows how it might compare?

But, it's pretty tough to attack @Dan Houck's real problem without an example to poke at...others may be able to write air code that might be applicable to his actual situation, but I'm not that clairvoyant and as @John D'Errico was complaining the other day, the Crystal Ball TB is notably dark these days.

Accedi per commentare.

Efficient access and manipulation of arrays in nested cells

8 Commenti
Mostra 6 commenti meno recentiNascondi 6 commenti meno recenti

Risposta accettata

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

Più risposte (3)

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

4 Commenti
Mostra 2 commenti meno recentiNascondi 2 commenti meno recenti

4 Commenti
Mostra 2 commenti meno recentiNascondi 2 commenti meno recenti

Vedere anche

Categorie

Tag

Prodotti

Release

Community Treasure Hunt

Efficient access and manipulation of arrays in nested cells

8 Commenti Mostra 6 commenti meno recentiNascondi 6 commenti meno recenti

Risposta accettata

0 Commenti Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

Più risposte (3)

0 Commenti Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

4 Commenti Mostra 2 commenti meno recentiNascondi 2 commenti meno recenti

4 Commenti Mostra 2 commenti meno recentiNascondi 2 commenti meno recenti

Vedere anche

Categorie

Tag

Prodotti

Release

Community Treasure Hunt

8 Commenti
Mostra 6 commenti meno recentiNascondi 6 commenti meno recenti

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

4 Commenti
Mostra 2 commenti meno recentiNascondi 2 commenti meno recenti

4 Commenti
Mostra 2 commenti meno recentiNascondi 2 commenti meno recenti