Efficient use of matfile within parfor

Question

Angelo Cuzzola il 27 Apr 2020

0
Link

Link diretto a questa domanda

https://it.mathworks.com/matlabcentral/answers/521137-efficient-use-of-matfile-within-parfor

Commentato: Angelo Cuzzola il 5 Mag 2020

I have a very simple code that works pretty well when the size of the involved objects is manageable by the memory.

%%% Variables
% Z -> matrix of dimensions (r,T)
% V -> matrix of dimensions (r,r,T)
% C -> matrix of dimensions (n,r)
% nanY -> matrix of dimensions (n,T)
% y -> matrix of dimensions (n,T)
R_new_T = zeros(n,1);
parfor t=1:T
    yC = diag(~nanY(:,t))*C;
    R_new_T = R_new_T + (y(:,t)-yC*Z(:,t+1)).^2 ...
                      + diag(yC*V(:,:,t+1)*yC')  ...
                       + Rvec.*nanY(:,t);  
end

In practical application, that bunch of code has to run with the parameter n taking values around 80k leading quickly to an unfeasible memory load. The solution I figured out relies on matfiles, at the expenses of paralellizionation and general performance. This is the version with which I am able to handle it for large n.

yC = zeros(n,r);
save var_cyc.mat nanY y yC -v7.3;
v = matfile('var_cyc.mat', 'Writable', true);
R_new_T = zeros(n,1);
for t=1:T
    v.yC = diag(~v.nanY(:,t))*C;
    R_new_T = R_new_T + (v.y(:,t) - v.yC*Z(:,t+1)).^2 ...
                       + diag(v.yC*V(:,:,t+1)*v.yC')  ...
                       + Rvec.*(v.nanY(:,t)) ; 
end     

Unfortunately this script cannot be paralellizable for classification problems related to the variable v. I'm wondering if there is a way to recover the initial parallelization structure using matlab file to handle huge file without incurring in 'out of memory errors'.

Thanks.

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

Accedi per commentare.

Accedi per rispondere a questa domanda.

Answer 1

Edric Ellis il 28 Apr 2020

0
Link

Link diretto a questa risposta

https://it.mathworks.com/matlabcentral/answers/521137-efficient-use-of-matfile-within-parfor#answer_428851

Apri in MATLAB Online

The problem is all to do with your assignment into v.yC. The parfor analysis cannot tell that this is safe and not order-dependent - it doesn't consider the fact that you're completely overwriting one field of the object. So, I think the simplest workaround is not to place yC in the matfile, and instead keep this as a loop temporary variable. In other words:

parfor t=1:T
    yC = diag(~v.nanY(:,t))*C;
    R_new_T = R_new_T + (v.y(:,t) - yC*Z(:,t+1)).^2 ...
                       + diag(yC*V(:,:,t+1)*yC')  ...
                       + Rvec.*(v.nanY(:,t)) ; 
end  

This shouldn't cause any additional memory usage since each worker already needs a full copy of yC in memory in order to perform the other operations.

8 Commenti
Mostra 6 commenti meno recentiNascondi 6 commenti meno recenti

Angelo Cuzzola il 28 Apr 2020

Modificato: Angelo Cuzzola il 28 Apr 2020

Apri in MATLAB Online

I tried your solution

parfor t=1:T
    yC = diag(~v.nanY(:,t))*C;
    R_new_T = R_new_T + (v.y(:,t) - yC*Z(:,t+1)).^2 ...
                       + diag(yC*V(:,:,t+1)*yC')  ...
                       + Rvec.*(v.nanY(:,t)) ; 
end  

I get the following error.

"Conversion to double from logical is not possible."

As if yC is imported in double format to save space (NOTE: nanY(:,t) is a vector of 0s and 1s) and it is impossible to convert it to double and get the multiplication done.

Moreover, what if R_new_T becomes too large leading to out of memory error? What are possible strategies to write on a mat-file as an output of a parfor.

For example, the following code leads to errore due to parallelization:

yC = zeros(n,r);
R_new_T = zeros(n,1);
save var_cyc.mat nanY y yC R_new_T -v7.3;
v = matfile('var_cyc.mat', 'Writable', true);
parfor t=1:T
    v.yC = diag(~v.nanY(:,t))*C;
    v.R_new_T = v.R_new_T + (v.y(:,t) - v.yC*Z(:,t+1)).^2 ...
                       + diag(v.yC*V(:,:,t+1)*v.yC')  ...
                       + Rvec.*(v.nanY(:,t)) ; 
end     

Edric Ellis il 5 Mag 2020

Apri in MATLAB Online

This code is missing nanY and Rvec, so I added them, like this:

% Initialize data at random
n = 60000;
r = 3;
T = 100;
Z = randn(r,T);
V = randn(r,r,T);
C = rand(n,r);
y = randn(n,T);
y(y==0) = NaN;
% adding nanY and Rvec
nanY = isnan(y);   % ??
Rvec = zeros(n,1); % ??
% Run the code
R_new_T = zeros(n,1);
save var_cyc.mat nanY y -v7.3;
clear y nanY
v = matfile('var_cyc.mat');
parfor t=1:(T-1)
    nanYCol = v.nanY(:,t);
    yC = nanYCol.*C;
    R_new_T = R_new_T + (v.y(:,t) - yC*Z(:,t+1)).^2 ...
                       + diag(yC*V(:,:,t+1)*yC')  ...
                       + Rvec.*nanYCol ; 
end

This works, but it uses a lot of memory. The problem is essentially that each worker needs to make an n-by-n temporary. I was able to remove the first one of those by using yC = nanYCol .* C rather than forming the diagonal matrix. However, the second line of the computation of R_new_T also forms an n-by-n temporary only to extract the diagonal. It might be worth trying to refactor that so that it doesn't need the whole matrix.

Angelo Cuzzola il 5 Mag 2020

Apri in MATLAB Online

Thanks. It should work. Just for consistency yC definition need a correction

yC = (~nanYCol).*C;

The second line can refactored using only matrices of dimension (n,r)

sum((yC*V(:,:,t+1)).*yC,2)

Accedi per commentare.

Efficient use of matfile within parfor

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

Risposte (1)

8 Commenti
Mostra 6 commenti meno recentiNascondi 6 commenti meno recenti

Vedere anche

Categorie

Tag

Community Treasure Hunt

Efficient use of matfile within parfor

0 Commenti Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

Risposte (1)

8 Commenti Mostra 6 commenti meno recentiNascondi 6 commenti meno recenti

Vedere anche

Categorie

Tag

Community Treasure Hunt

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

8 Commenti
Mostra 6 commenti meno recentiNascondi 6 commenti meno recenti