Memory Usage and block proc

1 visualizzazione (ultimi 30 giorni)
Mohammad Abouali
Mohammad Abouali il 20 Ago 2014
Commentato: Ashish Uthama il 22 Ago 2014
Hi,
Consider this command:
blockproc(inputImage.tif,blockSize,function_handle,
'UseParallel',true,
'Destination',BigTiff_Adapter);
As the function suggest the input image is a huge tiff image (44GB+). So, no way, I can load it onto memory on a machine that has 16GB memory in total.
The output of the function_handle has the same number of rows and columns of the input except it has only two layers of single precision value, so the output is a single precision floating point number which defines the computed value based on the input image. as I mentioned the input image would be 44GB+ 4 band Tiff image, where each band is uint8. Therefore, the output of the entire image would be twice that, i.e. 88GB+, (2 band x 32bits per pixel per band). So, it is also not possible to get the output as one matrix on a machine with total of 16GB memory.
Since, these are geolocated, I need to store it as tiff, and as the size of the image is too big I definitely need to write it as BigTiff; hence, I am using a customized BigTiff_Adapter. "blockSize" is set to tile size, (my input tiff image is tiled, 256x256 pixels per tile). So, pretty much one tile is loaded, processed via function_handle and then written to the output BigTiff tile by tile.
So, here is the question that I don't get it. Once I set "UseParallel" to false; everything works just fine. except that it takes a long time. You would think that using parallel proc should improve that. However, once I set the " UseParallel " to true, all my memory is used and it seems that it takes even slower to compute compared to the time that " UseParallel " is set to false. Literally, the serial version seems to be much faster.
If you are thinking of the communication cost, don't bother. The computation in the function_handle is so easy and it only needs the data within one pixel. So no communication at all. Let's say the the output pixel o(i,j)=K*reshape(I(i,j,:),[],1), where o(i,j) is the pixel on i-th row and j-th column of the output, K is a 1x3 matrix, and reshape(I(i,j,:),[],1) is a column vector of size 3x1 . As you can see, I don't need any information from the neighboring block or even neighboring pixels. So, absolutely no inter communication (seems a heaven for parallel proc). All this said, It is not the communication problem that slows down the parallel version, or better to say that the communication is not within function_handle (I don't know what MATLAB does under the hood).
Any idea why turning UseParallel increases the memory usage and causing slower calculation? On some machine I even get java.lang.OutOfMemoryError.
I have to add that once I use an image of let's say 8 or 9GB everything works just fine, both parallel set to true or false, doesn't produce any problem. But when I track the code, it appears that the memory usage goes up, it appears that the entire image seems to be loaded. I am controlling the number of workers, it changes between 4 and 12. If each block is assigned to one worker, then 12 block of 256x256x4 must be loaded at any time, i.e. 3MB, my output should be twice that so the output should be 6MB, and There are not that many intermediate variable during computation, but let's say they also take another 24MB. So, practically I shouldn't see more than 40MB memory usage, but I see at least 5 6 GB of memory usage.
So, what's going on?

Risposta accettata

Meshooo
Meshooo il 21 Ago 2014
That's too much to read. Can you make it shorter?
  1 Commento
Mohammad Abouali
Mohammad Abouali il 21 Ago 2014
That is a work around. But not really solving the problem. At the end we want the whole thing in one file, so, if I split the image, process each part and then recombine them into one image, I would end up with more reading/writing to the disk. (and these disks are network mounted, so imagine how much slower it would be). I thought, block proc pretty much does this for you, but as Ashish has explained the problem is not with reading part but the writing part, since the write in parallel is not supported.

Accedi per commentare.

Più risposte (1)

Ashish Uthama
Ashish Uthama il 21 Ago 2014
blockproc will not read the full image, it ought to only read one tile each. However, since your file system is serial, the final write will have to have to be done in a serial fashion to your storage device. If you have large number of workers, processing each block in short time, then the one process which is trying to serialize this to the final output could be the main bottleneck. In this extreme case, the problem is IO bound rather than compute bound, so parallel processing might actually hurt. I know this explanation might not really 'help'.
Is there more processing you need to do on this file? You might see a parallel advantage if you can combine all your future processing into one block processing function. That way you might even out the IO/Compute balance.
  5 Commenti
Mohammad Abouali
Mohammad Abouali il 21 Ago 2014
I read one tile at a time. So the entire tile is read, but they are small, 256x256.
Based on your instruction I checked the memory during serial run of block proc, i.e. setting UseParallel=false. Although the overall system memory usage goes up, but the instances of matlab stay about 70MB, I tested this using 4 workers. all 4 workers were about 70MB and the master one was about 1GB, but fluctuating a lot (+-200MB).
Ashish Uthama
Ashish Uthama il 22 Ago 2014
That makes sense to me. The overall system memory uptick might be the OS caching some parts of the file (read ahead?). In the parallel case, the master worker is the 'stitcher', so the increase and fluctuations make sense - its probably buffering completed tiles while waiting for the disk output to go through. (I hope you already use SSD's/RAID array or have the budget for it!)

Accedi per commentare.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by