Is it possible to improve fread/fwrite performance and further speed up loading/writing of binary data?

87 visualizzazioni (ultimi 30 giorni)
I am working on a project that requires reading and writing large amounts of binary data from disk (10's to 100's of GB). I've managed to optimize the code to the point where I/O is taking up a rather large chunk of the total run time, and is realistically the only thing that I can further improve (all the other non-trivial parts of the code execution are either double-vectorized as matrix-matrix multiplications or use bsxfun when matrix multiplications arent possible).
Ive managed to set up the code such that the entire block of data is read/written in a continuous block using a single call to fread/fwrite. From what Ive seen looking this issue up on Google, this seems to be the best possible situation for fread/fwrite. However, I I know that it isnt utilizing the full capabilities of the hardware. I've run tests with other programs and know that the disk is capable of using the full SATA3 bandwidth (for example, with "dd" I get speeds of ~550-575 MB/s). fread/fwrite, however, give me a pretty constant speed of ~110 MB/s, which is ~5x slower than what the disk should be capable of.
I suspect this is a CPU bottle neck - fread/fwrite are both single threaded, and the CPU I'm using is an older (Sandy Bridge) 16 core xeon, which has pretty good multi-threaded performance but is a bit lacking in single threaded performance. This is just a guess though.
Is there any way to further speed up the read/write process?
Are there any other functions (either built in or publically available m-file/mex function) that might be faster than fread/fwite?
Could I possibly parallelize it and have multiple threads access the file at the same time (with specified byte ranges to avoid accessing the same part of the same file simultaneously)?
It was also suggested that I might be able to get around this by using datastores and tall arrays...if I go this route would the execution time for the rest of the code be affected?
The machine I am using has a ton of memory (allowing me to load the whole dataset and analyse it directly in memory very efficiently), so if the main computations take a performance hit from using tall arrays (or some other form of memory mapping procedure) then I would highly doubt that the overall execution time would be less. I will actually test this at some point though to confirm.
  2 Commenti
Anthony Barone
Anthony Barone il 31 Lug 2017
Modificato: Anthony Barone il 31 Lug 2017
Good to know. The system I'm working on currently doesn't have the parallel computing toolbox, but I assume some of the free alternatives would also have this functionality? If not I have access to another (less powerful) system that does have parfor, so I can test it and confirm it works (after which I can likely convince my employer to purchase the parallel computing toolbox).
Is there anything in particular I need to do to prevent data corruption other than make sure that each thread which is accessing the data has a unique (non-overlapping) set of byte locations to read from / write to?

Accedi per commentare.

Risposte (1)

Yair Altman
Yair Altman il 31 Lug 2017
You might try some of the suggestions mentioned here:
Also note Joe V's important comment:
"In general, when doing binary file I/O, touch the file as few times as possible. Reading or writing the entire file in one go is fastest. Make use of fread's (and fwrite's) skip parameter when possible. When performance is truly critical, read the data into memory as a uint8 array and parse it yourself using typecast, rather than relying on fread to do the parsing for you. It can be complicated and tricky and error-prone, but it's tons faster."
Additional I/O performance tips can be found in chapter 11 of my book "Accelerating Matlab Performance" (CRC Press, 2014).
  1 Commento
Anthony Barone
Anthony Barone il 31 Lug 2017
Thanks for the response Yair. I'm a big fan of your blog.
I had previously come across the "improving frwrite performance" page and I've already tried switching to using 'W' and 'A' instead of 'w' and 'a' and got no change in performance. From what I can tell, it seems like this will mostly help when you are writing a number of small files, so this somewhat makes sense - as I mention in the original post I am already loading the data as one continuous block using a single call to fread/fwrite, so buffering seems less likely to make a difference.
I also have implemented the "load the data as a binary blob and separate it with typecast" approach you refer to, though I came up with this approach independentally and from scratch so its implemented a little bit differently (for example, I'm reading uint32's rather than uint8's, since the main data is uint32 / IBM floats -- it is a data header that is attached to every trace that needs this process done). I've also tailored the process specifically to work with the data I am using (SEGY data to be specific).
(If you are curious, http://seg.org/Portals/0/SEG/News%20and%20Resources/Technical%20Standards/seg_y_rev2_0-mar2017.pdf, pages 15-24, describes how to break up the 240 byte big endian header into 91 different little endian int8, uint8, int16, uint16, int32 and uint32 values...this was a major pain in the ass to get correct, but provided at least an order of magnitude speed up).
.
.
I will have to give a few of the explicit multi-threading data I/O examples you linked a try. Do you have a suggestion on which might be best to start with? (My initial thought would tend to be to start with the C implementation, as my intuition tells me this is likely to be the fastest of the bunch, but I'm just guessing).

Accedi per commentare.

Categorie

Scopri di più su Performance and Memory in Help Center e File Exchange

Prodotti

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by