Slicing matrices to fit in memory

2 visualizzazioni (ultimi 30 giorni)
Yi-xiao Liu
Yi-xiao Liu il 17 Giu 2022
Risposto: Avni Agrawal il 26 Ott 2023
I have a custom vectorized function whose space (and time) complexity is O(n^2). The problem is that n is very large, ~1e6, and the function cannot run in memory. Even storing the output is difficult. Fortunately, the output is very sparse, containing only ~5% non-zero elements. So I am thinking of slicing the matrix into smaller chunks that fits in memory, and store the output in a sparse format. This is what I got:
n=1e6+1e4-1;
v=rand(n,1);
slice=1e4;
n_slice=floor(n/slice);
slice=repmat(slice,n_slice,1);
slice(1)=n-sum(slice)+slice(1);
v=mat2cell(v,slice);
out=cell(n_slice);
for i=1:n_slice
for ii=1:n_slice
if ii<i
out{i,ii}=logical(sparse(size(v{i},1),size(v{ii},1)));
else
out{i,ii}=sparse(myfun(v{i},v{ii}));
end
end
end
out=cell2mat(out);
out=out|tril(out',-1);
function out=myfun(v1,v2)
%just an example
out=abs(v1-v2')<1e-6;
end
My question is:
  1. what is the ideal size of each slice? is it just the largest that can fit into RAM, or is there other considerations?
  2. The conversion of full matrix to sparse (line 15) seems to have a big overhead, any way to improve this?
  3. In the current code the size of the real slice can vary by as much as a factor of 4 (when the remainder of n/slice is slice-1). Is there any way to get more consistent slices given an arbitrary n?
I would also appericate if you points out any other direction/room for optimization.
Thanks!
  2 Commenti
Jonas
Jonas il 18 Giu 2022
if you work with large data sets you may want to have a look into Tall Arrays
Yi-xiao Liu
Yi-xiao Liu il 18 Giu 2022
@Jonas The output is used in subsequent operations so sparse has performance advantages over tall as well.

Accedi per commentare.

Risposte (1)

Avni Agrawal
Avni Agrawal il 26 Ott 2023
Your approach to handling large data with memory constraints by slicing the data into smaller chunks and storing the output in a sparse format is a good one. Here are some suggestions to your questions:
1. I understand that you are trying to figure out the ideal slice size which is indeed the largest size that can fit into your RAM. However, you also need to consider the overhead of other programs running on your system. So, you might want to leave some room for them. You can use the `memory` function in MATLAB to get an idea of how much memory is available.
2. You are right that converting a full matrix to a sparse one can be computationally expensive. One way to reduce this overhead is to generate sparse matrices directly in your function `myfun`, if possible. This way, you can avoid creating dense matrices and then converting them to sparse.
3. To get more consistent slices, you could adjust your slicing strategy. Instead of making the first slice larger to accommodate the remainder, you could distribute the remainder across all slices. Here's an example:
remainder = mod(n,slice);
slice = repmat(slice,n_slice,1) + [ones(remainder,1); zeros(n_slice-remainder,1)];
This will add 1 to the size of the first `remainder` slices, making all slices more consistent in size.
Few other optimization suggestions:
You can avoid the `if ii<i` condition by adjusting your loops. You can start the inner loop from `i` instead of 1.
- You are creating a lower triangular matrix and then using `tril` to mirror it to the upper triangle. This is unnecessary if your function `myfun` is symmetric, i.e., `myfun(a,b) = myfun(b,a)`. In that case, you can calculate both `out{i,ii}` and `out{ii,i}` in the same iteration.
Here's your code with these suggestions applied:
n=1e6+1e4-1;
v=rand(n,1);
slice=1e4;
n_slice=floor(n/slice);
remainder = mod(n,slice);
slice=repmat(slice,n_slice,1) + [ones(remainder,1); zeros(n_slice-remainder,1)];
v=mat2cell(v,slice);
out=cell(n_slice);
for i=1:n_slice
for ii=i:n_slice
out{i,ii}=sparse(myfun(v{i},v{ii}));
if ii>i
out{ii,i}=out{i,ii}';
end
end
end
out=cell2mat(out);
Lastly, you could consider using a machine with more memory or distributed computing techniques if your data is too large to fit into memory even with these optimizations.
Take a look at these documentations for better understanding:

Categorie

Scopri di più su Data Distribution Plots in Help Center e File Exchange

Prodotti


Release

R2019b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by