h5create/h5write - speed issues

Question

Guillaume Erny il 9 Gen 2021

0
Link

Link diretto a questa domanda

https://it.mathworks.com/matlabcentral/answers/712483-h5create-h5write-speed-issues

Risposto: Avadhoot il 13 Feb 2024

Dear all,

I work with large/numerous scientific data, and my original system was to write my data in a binary file, indexed in a structure so I could only read the portion of data that interessed me. I recently tested hdf5 file sructure (h5create/h5write....), and my first approach was to make many independ hdf5 files so to be able to work with Parallel computing. In my first test, I have created 20,000+ h5 files. While it work fine, it is at least ten time slower that my previous approach where I dumbed everythings on binary file and recorded the starting and end position.

I would appreciate any help on the following topics:

1- Will it be faster if I used the low lovel h5 functions rather than the high-level functions?

2- Will it be fasted to have one large h5 file (>1GB) rather than many, and in this case will I be able to access this file siumltaneously with many workers (one worker par dataset through)?

Thanks in advance

Guillaume

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

Accedi per commentare.

Accedi per rispondere a questa domanda.

Answer 1

Avadhoot il 13 Feb 2024

0
Link

Link diretto a questa risposta

https://it.mathworks.com/matlabcentral/answers/712483-h5create-h5write-speed-issues#answer_1407418

Hi Guillaume,

I see that you are experimenting with using hdf5 files for your project. The answers to both of your queries are as follows:

Firstly, using the low level h5 functions can lead to some improvement in the performance. But the complexity of the code will also increase proportionately. With low level functions you can have a fine-grained control over the data access, custom tuning of data chunk sizes and an easier implementation of parallel I/O operations. But all of this requires a deeper understanding of the functions and will considerably complicate your code.

Secondly, it would be better to use a single hdf5 file instead of multiple files because of the following reasons:

File system overhead: Every file access has a certain overhead and multiple files will cause the code to be significantly slower.
Better I/O optimization: A large hdf5 file can provide better optimizations like reading and writing larger, contiguous blocks in the file.
File management: It is easier to keep track of one file instead of thousands of files.

Also, you can have parallel access to a single hdf5 file using the single writer multiple readers (SMWR) model in MATLAB. Using this functionality, you can read in parallel, but you can't write in parallel. More information about this functionality is provided in the following documentation:

https://www.mathworks.com/help/matlab/import_export/read-and-write-data-concurrently-using-single-writermultiple-reader-swmr.html

I hope it helps.