Out of memory error when using validation while training a DAGnetwork

Hi,
I'm training a resnet18 netwerk for semantic segmentation with my own data, based on this tutorial: https://nl.mathworks.com/help/vision/examples/semantic-segmentation-using-deep-learning.html .
The training is done on a threadripper 1950x with 64GB of RAM and a gtx1050TI 4GB (i use a minibatchsize of 4 in order for it to fit in the GPU memory). Matlab 2020a is used.
The dataset are images of size [284 481 3], with 10k images in the training dataset and ~3k images in the validation set. Both are stored in a pixelLabelImageDatastore object. When I train the network without validation, memory usage in task manager shows around 6GB used out of 64GB. So no problems there.
However, when I attempt to train the network with validation, memory usage skyrockets to 64GB, with approximately 30 minutes of swapping to disk everytime the validation step happens. After a couple epochs, MATLAB then throws the 'out of memory' error. I even increased the size of the swap file in Windows to 500GB with the same results. It only takes a couple of extra epochs before crashing.
What is the reason this is happing only with the validation data and what can be done to counteract this? I believed that using a datastore made it so that only the data required at a certain moment was read into memory instead of the entire batch? The total file size of all the labeled and input images in my dataset is only ~300MB.
Thanks for any feedback!

 Risposta accettata

Più risposte (1)

Hi,
This is the result of a bug, that leads to GPU's running out of memory as the validation set size grows. This issue is not a result of the increased training set size.
One workaround is to train by splitting the training set into smaller groups (and thus breaking up the validation set), and looping through each of the groups, and having each successive group pick up where the prior group left off.
Another workaround is to reduce the size of the validation set until the GPU does not run out of memory, and continue training on the full training set.

7 Commenti

Thanks for your reply. However, the issue does not lie with the GPU memory, but with the system memory. The GPU memory usage relates to the minibatchsize setting, and validation data has no effect on it.
In the meantime I've also attempted to train the network on a different system, with 128GB of RAM and an RTX2080ti and here I experienced the same issue, where the entire 128GB of RAM is used, with additional swapping to disk, when validation is selected as an option for training.
As a "workaround" I'm currently training the network without the validation step...
That seems like a fairly major bug as using an extremely small validation set renders the validation far less useful. Is there a plan to fix this bug?
It's still present in Matlab 2021a unfortunately. 64GB of system RAM not sufficient...
And now even ran into this issue using 128GB of RAM, while training on some slighly larger images, batch size of 8.
Matlab R2021a
Apparently, this is fixed in 2021b though I have not verified the claim. Here's the response I got from mathworks support: "You are correct. In MATLAB R2021a, there is a bug in the Neural Network Toolbox where, depending on the workflow, if the validation data is large, you may run out of memory on the GPU. This has been reported in image segmentation and LSTM workflows.
The workaround is to reduce the validation data set size or train without validation data. Reducing the "miniBatchSize" does not fix this issue.
A patch for this bug was made in MATLAB R2021b. You may want to consider using this version of MATLAB to avoid encountering this issue."
Again, the issue I'm having is NOT related to GPU memory.
It's the system memory, RAM, that is being filled to the max, to the point that 64GB of system memory is insuficient to train a resnet18 network with validation enabled.
I have run into the same issue as well, when doing sementic segmentation experiments. I have a validation dataset of 43,034 images, and MATLAB attempts to request ~2TB of RAM and obviously, failed.
MATLAB: R2021a
Caused by:
Error using zeros
Requested 320x4096x10x43034 (2101.3GB) array exceeds maximum array size preference (251.6GB). This might cause MATLAB to become unresponsive.'
From the dimension of the requested array, I could understand that this RAM buffer is perhaps used to store data in order to calculate some mean statistics across the entire validation set (Maybe Mean accuracy/Loss? not sure)
Anyway, this is a very big issue and obviously needs some action from the crew. Reducing the validation set size is a compromise, NOT A FIX!

Accedi per commentare.

Prodotti

Release

R2020a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by