filedatastore; read M mat files of size (1xL) into single tall array that is of size (1, L*M)
7 visualizzazioni (ultimi 30 giorni)
Mostra commenti meno recenti
Alex Hogg
il 13 Mag 2022
Commentato: Alex Hogg
il 16 Mag 2022
I have an extremely large collection of data distributed into multiple mat files I am wanting to do some stats on as a whole entitiy, so I need to load them in as a filedatastore. The dimensions of the tall array are stopping me from getting the exact stats I am after.
Before the question is inevitably asked; I can't read all the files normally as they won't all fit in RAM at the same time, if I could, I would as I could then just easily append each array from each file as a normal matrix operation.
Mat file format:
Each file is identical in size; there is a single variable in each file of size [1 x L].
In the directory consisting of the mat files for the datastore, the number of files is M.
Attempted datastore code:
So far I've managed to get a datastore created of my desired data format, however the dimensions of the tall array are preventing me from getting the exact stats I'm after.
%find files in directory with mat file extension (all matching formats)
Datastore_Files = Search_Files(Datastore_Directory, ".mat");
%create absolute file path to each individual file, make string array for datastore input
Datastore_File_List = string(fullfile({Datastore_Files.folder}, {Datastore_Files.name}));
%create single tall datastore from array of mat files
File_Data_Store = tall(fileDatastore(Datastore_File_List, 'ReadFcn', @(x)table2array(struct2table(load(x)), 'UniformRead', true), 'UniformRead', true));
%get data store size
File_Data_Store_Size = gather(size(File_Data_Store));
disp(File_Data_Store_Size)
M L
Stats issue:
My issue due to the datastore dimensions is that for example, if I then perform a function such as mean(), I end up with a mean value per-file, rather than getting a single value representing the mean value for the entire datastore as a whole.
%Returns mean value per-file; not a single value for the whole dataset.
Test1 = gather(mean(File_Data_Store))
disp(size(Test1))
1 M
%Also returns mean value per-file; not a single value for the whole dataset.
Test2 = gather(mean(File_Data_Store(:,:)))
disp(size(Test2))
1 M
Attempted workarounds:
As above, I can't appear to perform the normal trick for a standard matrix, where if you had a multidimensional array and wanted a single mean() value representing the entire array, you could just use mean(:).
I also can't use reshape, as you can't change the size of the first dimension of a tall array.
T = reshape(File_Data_Store, 1, File_Data_Store_Size(1)*File_Data_Store_Size(2))
Error using tall/reshape (line 17)
Reshaping the first dimension of tall arrays is not supported.
Question:
Is there a way for me to concatonate the output from each file during the datastore creation such that I end up with a single tall array of dimensions [M*L, 1] instead of [M, L]?
Alternatively, is there a way I am unaware of for performing operations on a tall array as a whole; rather than each column independently (each file)?
1 Commento
Risposta accettata
Jeremy Hughes
il 13 Mag 2022
In general, you'll have better luck identifying where the problem lies by looking at each piece of the code separately.
fcn = @(x)table2array(struct2table(load(x))); % Issue may be here
ds = fileDatastore(Datastore_File_List, 'ReadFcn', fcn, 'UniformRead', true);
A = tall(ds)
If you have a 1-by-L vector as the return of fcn, then tall will try to create an M-by-L array eventually from that data. Calling mean on that will result in the mean of each column, or an 1xL array.
A = rand(3,10)
m = mean(A)
I think what you are asking for is the mean of the whole array. For in-memory arrays, I would do this:
m = mean(A(:))
But tall probably won't like that.
If you modify that fcn to return the transpose,
fcn = @(x)table2array(struct2table(load(x)))'; % Note the added ' transpose character.
Now each read will result in an L-by-1 instead, and the tall array should represent an (M*L)-by-1 array. Which you can call mean on, and get a single value.
m = gather(mean(A))
Più risposte (0)
Vedere anche
Categorie
Scopri di più su Tall Arrays in Help Center e File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!