Data structure for mixed-type arrays - cellarray, dataset, structarray, struct of arrays, or other?

Question

2 voti

I need to process large amounts of tabular data of mixed type - strings and doubles. A standard problem, I would think. What is the best data structure in Matlab for working with this?

Cellarray is definitely not the answer. It is extremely memory inefficient. (tests shown below). Dataset (from stats toolbox) is horribly time and space inefficient. That leaves me with structarray or struct of arrays. I did a test across all four different options for both time and memory below and it seems to me the struct of arrays is the best option for the things I tested for.

I am relatively new to Matlab and this is a bit disappointing, frankly. Anyway - looking for advice on whether I am missing something, or if my tests are accurate/reasonable. Am I missing other considerations besides access/conversion/memory usage that are likely to come up as I code more using this stuff. (fyi am using R2010b)

**** Test #1: Access speed Accessing a data item.

cellarray:0.002s
dataset:36.665s       %<<< This is horrible
structarray:0.001s
struct of array:0.000s

**** Test #2: Conversion speed and memory usage I dropped dataset from this test.

Cellarray(doubles)->matrix:d->m: 0.865s
Cellarray(mixed)->structarray:c->sc: 0.268s
Cellarray(doubles)->structarray:d->sd: 0.430s
Cellarray(mixed)->struct of arrays:c->sac: 0.361s
Cellarray(doubles)->struct of arrays:d->sad: 0.887s
    Name           Size               Bytes  Class     Attributes
    c         100000x10            68000000  cell                
    d         100000x10            68000000  cell                
    m         100000x10             8000000  double              
    sac            1x1             38001240  struct              
    sad            1x1              8001240  struct              
    sc        100000x1             68000640  struct              
    sd        100000x1             68000640  struct

================== CODE: TEST#1

    %%cellarray
    c = cell(100000,10);
    c(:,[1,3,5,7,9]) = num2cell(zeros(100000,5));
    c(:,[2,4,6,8,10]) = repmat( {'asdf'}, 100000, 5);
    cols = strcat('Var', strtrim(cellstr(num2str((1:10)'))))';
    te = tic;
    for iii=1:1000
        x = c(1234,5);
    end
    te = toc(te);
    fprintf('cellarray:%0.3fs\n', te);
    %%dataset
    ds = dataset( { c, cols{:} } );
    te = tic;
    for iii=1:1000
        x = ds(1234,5);
    end
    te = toc(te);
    fprintf('dataset:%0.3fs\n', te);
    %%structarray
    s = cell2struct( c, cols, 2 );
    te = tic;
    for iii=1:1000
        x = s(1234).Var5;
    end
    te = toc(te);
    fprintf('structarray:%0.3fs\n', te);
    %%struct of arrays
    for iii=1:numel(cols)
        if iii/2==floor(iii/2) % even => string
            sac.(cols{iii}) = c(:,iii);
        else
            sac.(cols{iii}) = cell2mat(c(:,iii));
        end
    end
    te = tic;
    for iii=1:1000
        x = sac.Var5(1234);
    end
    te = toc(te);
    fprintf('struct of array:%0.3fs\n', te);

================== CODE: TEST #2

    %%cellarray
    % c - cellarray containing mixed type 
    c = cell(100000,10);
    c(:,[1,3,5,7,9]) = num2cell(zeros(100000,5));
    c(:,[2,4,6,8,10]) = repmat( {'asdf'}, 100000, 5);
    cols = strcat('Var', strtrim(cellstr(num2str((1:10)'))))';
    % c - cellarray containing doubles only
    d = num2cell( zeros( 100000, 10 ) );
    %%matrix
    % doubles only
    te = tic;
    m = cell2mat(d);
    te = toc(te);
    fprintf('Cellarray(doubles)->matrix:d->m: %0.3fs\n', te);
    %%structarray
    % mixed
    te = tic;
    sc = cell2struct( c, cols, 2 );
    te = toc(te);
    fprintf('Cellarray(mixed)->structarray:c->sc: %0.3fs\n', te);
    % doubles
    te = tic;
    sd = cell2struct( d, cols, 2 );
    te = toc(te);
    fprintf('Cellarray(doubles)->structarray:d->sd: %0.3fs\n', te);
    %%struct of arrays
    % mixed
    te = tic;
    for iii=1:numel(cols)
        if iii/2==floor(iii/2) % even => string
            sac.(cols{iii}) = c(:,iii);
        else
            sac.(cols{iii}) = cell2mat(c(:,iii));
        end
    end
    te = toc(te);
    fprintf('Cellarray(mixed)->struct of arrays:c->sac: %0.3fs\n', te);
    % doubles
    te = tic;
    for iii=1:numel(cols)
        sad.(cols{iii}) = cell2mat(d(:,iii));
    end
    te = toc(te);
    fprintf('Cellarray(doubles)->struct of arrays:d->sad: %0.3fs\n', te);
    clear iii cols te;
    whos

5 Commenti
Mostra 3 commenti meno recenti Nascondi 3 commenti meno recenti

Cedric il 26 Apr 2013

Modificato: Cedric il 26 Apr 2013

Apri in MATLAB Online

I would say that a mixed set of specific data structures is almost always better than a flexible data structure of mixed data. What you want to guarantee is "contiguity" when possible/meaningful. If you want multiple data structures to appear under the same base name, or to be able to pass a "whole set of mixed structures" under a unique variable/name to functions, use a struct.

To illustrated, your best option if you want to store assets names, codes, and prices, is probably to build the following struct and access its fields using logical indexing when possible:

 Assets.code  = randi(1000, 1e7,1) ;   % Contiguous.
 Assets.price = rand(1e7,1) ;          % Contiguous.
 Assets.name  = {..} ;                 % Non-contiguous.
 lid = Assets.price > 100 ;            % Fast.
 codes = Assets.code(lid) ;            % Fast.
 names = Assets.name(lid) ;            % Not too fast; execute only 
                                       % when relevant, e.g. for display 
                                       % or export to file.

(I let you generate fake names the way you did if you want to test this).

PS: and congratz for the amount of tests/work that you did before submitting a question!

Matt J il 26 Apr 2013

Modificato: Matt J il 26 Apr 2013

I'm a bit surprised to discover that sortrows works on cells. Nothing in the documentation about it.

Anyway, the fact that it requires multiple lines of code wouldn't make it "painful" in my book. You can always encapsulate the multiple lines in your own mfile and just reuse that.

The issue you point out is also only an issue when the columns being sorted are of mixed type. My approach might be to convert all the numeric data to strings and then concatenate the columns being sorted into one big string matrix. Then I can run sortrows on that.

per isakson il 26 Apr 2013

[num2cell(sac.Var1(1:10)), sac.Var2(1:10), ...]

This operation represents a cost. What is the benifit of storing Var1, Var2, etc. in separate fields compared to a double array?

Accedi per commentare.

Accedi per rispondere a questa domanda.

Follow Question

Answer 1

Marc il 29 Feb 2016

0 voti

I thought about asking a question but was wondering if the "table" structure fits this. I cannot confirm when this data type was released (looks like it wasn't in 2010a or 2012a) but I have been using this now for a wide range of stuff that I dump into excel files and want to do some serious analysis on. In 2015b, the help doc refers you to table when looking at dataset, so it looks like the table data type will be replacing this in the future. Works pretty well if you do a lot of design of experiments and like me, have to dump the data somewhere like excel or some database due to work requirements.

If you format the excel file correctly, reading in tabular data to a table is really easy.

2 Commenti
Mostra Nessuno Nascondi Nessuno

per isakson il 29 Feb 2016

Modificato: per isakson il 29 Feb 2016

table, Create table from workspace variables says: Introduced in R2013b
dataset is in the Statistical Toolbox. However, it seems as if they decided to move the functionality to table in Matlab itself.

Guillaume il 29 Feb 2016

In the context of this thread, a table is equivalent to a cell of arrays in term of memory overhead. Access is a bit more convenient at the expense of performance.

Accedi per commentare.

Answer 2

Matt J il 26 Apr 2013

Modificato: Matt J il 18 Mag 2013

0 voti

As best I can tell, you haven't tested a "cell of arrays", i.e., instead of having a 100000x10 cell array, have a 1x10 cell array where each c{i} contains an array of a column of data. Should be similar to "struct of arrays", but with easier indexing.

Beyond that, nothing in your tests is very unexpected. You have a large amount of data and have to be careful not to scatter it discontiguously in memory. Successive cell/struct elements cannot be held contiguously in memory, because they hold non-homogeneous data types. Numeric and string arrays are contiguous, however, so by grouping things into large numeric/string sub-arrays where possible, you maximize data contiguity, which leads to efficiencies both in access speed and memory usage.

As for "dataset", I cannot comment, since I don't have the Stats Toolbox. However, a mixed data table with 100000 rows is uncommonly large in my experience. I don't think you would ever see it in an Excel spreadsheet, for example. If dataset was meant to be "Excel-like", I can imagine 100000 rows being usage outside of what the designers anticipated.

3 Commenti
Mostra 1 commento meno recente Nascondi 1 commento meno recente

Sean de Wolski il 26 Apr 2013

Nope. That is the only real difference and the rest is preference.

Matt J il 26 Apr 2013

Modificato: Matt J il 26 Apr 2013

As Sean says, nothing of great consequence. There are small differences in storage since structs need to allocate memory for field names. Maybe small differences in indexing speed, too, to convert string indices to numeric ones.

Accedi per commentare.

Answer 3

matal il 17 Mag 2013

0 voti

Thought I would post some of my thoughts after looking at this problem a bit more.

I don't see an efficient data structure in Matlab for managing heterogeneous tabular data. The best I can do is a struct of vectors where each vector is either a numeric matrix or a cellarray depending on the data type I am storing. Then it is up to me to keep the vectors of equal length and write accessor function(s) to efficiently get/set arbitrary "sub-matrices" of the heterogeneous data matrix.

As noted above - dataset is too inefficient both in space and time. Cellarray is too inefficient in space. --- Taking this route, I run into other issues to do with copy-on-write semantics - but that's for another thread.

1 Commento
Mostra -1 commenti meno recenti Nascondi -1 commenti meno recenti

Cedric il 17 Mag 2013

Modificato: Cedric il 17 Mag 2013

It would not be too difficult to create your own class for that if you can precisely define what you need, moreover if you know OOP but never used it in MATLAB. Let us know if you are interested, this would be a good "case/pretext/application" to make the step towards OOP. It would not be more efficient than managing numeric arrays and cell arrays, as you would essentially build a wrapper around these structures with proper methods to manage size/indexing, but it would make the whole clean and consistent.

Accedi per commentare.

Data structure for mixed-type arrays - cellarray, dataset, structarray, struct of arrays, or other?

5 Commenti
Mostra 3 commenti meno recenti Nascondi 3 commenti meno recenti

Risposta accettata

2 Commenti
Mostra Nessuno Nascondi Nessuno

Più risposte (2)

3 Commenti
Mostra 1 commento meno recente Nascondi 1 commento meno recente

1 Commento
Mostra -1 commenti meno recenti Nascondi -1 commenti meno recenti

Categorie

Prodotti

Tag

Community Treasure Hunt

Data structure for mixed-type arrays - cellarray, dataset, structarray, struct of arrays, or other?

5 Commenti Mostra 3 commenti meno recenti Nascondi 3 commenti meno recenti

Risposta accettata

2 Commenti Mostra Nessuno Nascondi Nessuno

Più risposte (2)

3 Commenti Mostra 1 commento meno recente Nascondi 1 commento meno recente

1 Commento Mostra -1 commenti meno recenti Nascondi -1 commenti meno recenti

Categorie

Prodotti

Tag

Vedere anche

Community Treasure Hunt

5 Commenti
Mostra 3 commenti meno recenti Nascondi 3 commenti meno recenti

2 Commenti
Mostra Nessuno Nascondi Nessuno

3 Commenti
Mostra 1 commento meno recente Nascondi 1 commento meno recente

1 Commento
Mostra -1 commenti meno recenti Nascondi -1 commenti meno recenti