Mat file empty after saving

31 visualizzazioni (ultimi 30 giorni)
Nicholas Hopkins
Nicholas Hopkins il 22 Ott 2022
Risposto: Nicholas Hopkins il 23 Ott 2022
I'm training a neural network on an HPCMP system GPU cluster (Linux OS) and I'm saving the trained model parameters in a structure. When I execute the code on my local machine (Windows OS) for debugging, the code executes fine and my model is saved. The saved .mat file containing the model/data structure is about 550 MB and I have no problems importing the structure and its contents back into Matlab on my local machine.
When I execute the same code on the HPCMP GPU cluster, the code executes fine, the model is saved in the designated directory and the filesize shows that it is also about 550 MB; however, when I try to import this file back into Matlab the import utility says the Mat file is empty. I'm performing transfer learning with my neural network on the HPCMP sytems and when the saved Mat file is reloaded, the code fails because the file doesn't load any data.
I don't understand why this mat file would show a filesize of 550 MB and be empty in the import wizard. If I pull the mat file off HPCMP systems and onto my local machine, the filesize still shows 550 MB but I still can't see or load any data from it; import utility still says the Mat file is empty.
I've saved models on HPCMP systems with similar code without these issues so I don't know what has changed exactly and why the file won't load any data even though the filesize shows the file isn't empty. I'm using iLaucher to submit communicating jobs to the HPCMP cluster, in the past, I used HPC portal but I don't see why this would make a difference exactly. Since its a communicating job, I lose access to debug the job after it is submitted to the cluster.
Any help would be greatly appreciated!
  7 Commenti
Nicholas Hopkins
Nicholas Hopkins il 23 Ott 2022
Oh I see what you mean concerning tempDir(), this project has re-introduced me to Linux so I'm pretty green on interpretting troubleshooting tips on that OS.
I'll try saving with -7.3 and see if that changes anything also, although, given I can save/reload the model stepping through the code on HPC side outside of submitting the code via communicating job I don't think saving with -7.3 will help but its worth trying.
A HPC data scientist setup the communicating job code so my algorithm could be run appropriately on HPC's system; he's modified it a decent bit this past week to run Matlab in a container since the Linux OS doesn't have the necessary video codecs for my algorithm's application. I'm hoping to speak to him shortly and get some more information.
Nicholas Hopkins
Nicholas Hopkins il 23 Ott 2022
Apparently inputing '-v7.3' into the save function does save the model in a way that allows me to load the model back in from the mat file after saving with a communicating job. I'm not sure what using this save version does differently though in my case; I've only ever used this input when the variables size to be saved exceeds 2 GB. I spoke with my HPC counterpart and he thinks the issue was related to another problem which I'll answer my original question with. Thank you for you input though, I certainly appreciate it!

Accedi per commentare.

Risposta accettata

Nicholas Hopkins
Nicholas Hopkins il 23 Ott 2022
As Walter mentioned in the comments above, using '-v7.3' as an input to the save function saves the model when code is executed via a communicating job on the HPC GPU cluster in a way that allows me to load back in the saved model. However, the following code is what should have been used when saving a model processed on multiple GPU clusters in my case:
if labindex == 1
save(fileName,variableName)
end
I was not using the if statement; doing so ensures that the model running on each GPU cluster does not try to save at the same time. I believe this is what my issue was as I only encountered the model saving problem when submitting a communicating job which ran on multiple GPU clusters (i.e. using the '-v7.3' save function input was not needed to save the model outside of GPU cluster processing and running the code locally on my machine).

Più risposte (0)

Categorie

Scopri di più su Startup and Shutdown in Help Center e File Exchange

Prodotti


Release

R2021b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by