How does ValueStore (Parallel Computing Toolbox) deal with concurrent write access?

4 visualizzazioni (ultimi 30 giorni)
Usecase: random and parallel read access on network where operating system read-cache functionality is ineffective. Therefore, we implement our own read cache.
I am working on a script where in a parfor loop data is loaded and stored, going through the cache and also filling the cache.
Thus, each parallel worker has to access - namely read and extend - this cache/shared data storage independently.
After working with persistent variables (+external interprocess library), I found the very new ValueStore [1] functionality which seems to do the job with less hassle.
But I couldn't find in the documentation whether I risk running into race conditions. For instance when the workers extend the same data entry simultaneously, which can theoretically happen, while with low propability. Atomic opertations and syncronization or mutexes are not mentioned in the documentation (at least for write access?), which is a little bit scary.
To ask more precisely, is Matlab doing its operations atomic (thread safe) for ValueStore functions? And, where does ValueStore store its data exactly - the documentation only mention where it is not stored? Is it stored on local persistent storage relying on OS for cache and performance?

Risposta accettata

Stuart Moulder
Stuart Moulder il 17 Ott 2022
ValueStore is designed for workers to incrementally store data during job execution. ValueStore therefore only makes the following guarantees:
  • Multiple processes may safely read or write to the same entry. It should not be possible to end up with corrupted data.
  • Since there is no synchronization, concurrent writes to the same entry from multiple processes do not have a well defined order. The final stored value of the entry will be that of the "last" writer as deemed by the implementation of ValueStore.
  • Since there is no synchronization, concurrent reads during writes from multiple processes may see a stale value.
To answer your question then, ValueStore does not provide the synchronization mechanisms required for multiple processes to coordinate a shared entry. As you point out, this would require the addition of mutexes or atomic read-then-modify operations to do so safely.
Data stored in ValueStore uses the JobStorageLocation of the cluster running your parfor. Depending on your cluster setup, this will veither be a shared filesystem location or a database. Since this involves file system access it is also unlikely that ValueStore would be more performant than an operating system read-cache.
  3 Commenti
Stuart Moulder
Stuart Moulder il 19 Ott 2022
Entries in the ValueStore obey eventual consistency. For most environments this consistency will occur instantly, however for others the ValueStore could be using the network file system which may have significant delays depending on the user setup.
ValueStore is intended for one process to write an entry which other processes can see. Intended use cases for this include:
  1. One process writing an entry once. In this case the entry either exists or does not, so all other processes which see the entry are seeing the final result. This could be used for one process to offload a result which another process then post-processes. It can also be used to offload large results out of memory where they can be easily found later by the client MATLAB.
  2. One process repeatedly writing the same entry. This could be used to share progress or some intermediate result. Any process which reads this entry should expect that it may be overwritten and therefore without some external synchronisation mechanism will always have to accept that the result could be stale.
The use case you describe where multiple process coordinate on the value of shared entry is not supported. Since there is no locking or atomic read-and-modify operations any process which reads an entry and then tries to write an updated value to the same entry cannot guarantee that another process hasn't modifed the entry in the meantime. If you do need to coordinate actions between workers in a pool you might wish to try the spmdBarrier, spmdSend, spmdReceive etc. operations which use mpi to communicate between workers.
As discussed, the data inside ValueStore is stored in the JobStorageLocation of the cluster which is running your pool. To delete the data you must delete the Job. Usually when you delete a parpool the backing Job and its associated data is automatically deleted as well. In this case it sounds like too much data was stored and MATLAB existed without running the Job delete hook. To clean up your SSD you can manually delete the Job using the Cluster object
cluster = parcluster();
jobs = cluster.Jobs;
% Find which Job is the one related to the parpool
delete(jobs(indexOfPoolJob));
Leonie Schicketanz
Leonie Schicketanz il 11 Nov 2022
Hello, thank you for your time to answer all my questions (and sorry for the late reaction).
I tried to delete the job as you proposed but when Matlab had crashed due to the error I described above I couldn't access this job to delete the data (so I had to search the folder and delete the data manually). And even when it works, it is a hassle if it doesn't clean up by itself at some point. Thus, I hope that this behaviour can be improved.
Also, would it be possible to extend the documentation on ValueStoren with the explanation you gave me (especially the details in your first answer)? This info was very insightful in order to be able to deal with the ValueStore correctly and I imagine others feel the same.

Accedi per commentare.

Più risposte (0)

Categorie

Scopri di più su Startup and Shutdown in Help Center e File Exchange

Prodotti


Release

R2022a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by