Multithreding causing system crashes on linux
Mostra commenti meno recenti
Hello,
Im trying to run a rather heavy calculation on our server, and high levels of multithreading causes the system to crash, sadly without crash log. The system does not respond to anything, both over the network and pressing the power button, and requires the power connection to be pulled before it can restart.
The linux version is "Ubunto 24.04.2 LTS", with two AMD EPYC 7763 64-Core Processor CPUs, about 1 TB of ram, a GeForce RTX 3050 8GB graphics card and a Supermicro H12DSi-NT6 mainboard.
When I use parpool and maxNumCompThreads at 16, the system works fine, but rather slow. At 32, the system works fine for a couple of minutes, before crashing, even when the CPU and Ram load never goes above 20%.
The computations include ffts, interpolation and complex to real and vice versa conversions.
Do you have any tips or hints where I can check what is happening, or do you have any information what is going on and what I can do to resolve this problem?
Thanks
2 Commenti
Walter Roberson
circa 5 ore fa
Typically the causes of something like this are one of the following:
- general overheating. You say that CPU never goes above 20% but such a low CPU would have to require that the tasks are mostly I/O bound, contradicting "heavy calculation"
- memory error. A memory fault might only happen to get triggered when the system is busy
- occasionally, the problem can be tracked back to overheating of one specific CPU . Potentially if you isolate CPUs to make them non-runnable, you might do systematic elimination (but it would be very tedious.)
Steven Lord
circa un'ora fa
Without more details about what you're actually doing, it's probably going to be difficult if not impossible to identify a cause.
How large are the arrays involved in your calculations? Are you perhaps having the machines occupied in allocating huge blocks of memory (potentially with swapping) or perhaps having MATLAB killed by the operating system's out-of-memory killer?
When you say "When I use parpool and maxNumCompThreads at 16" does that mean that you start a parallel pool (with how many workers?) and on each worker you set maxNumCompThreads to 16?
You said "sadly without crash log. The system does not respond to anything, both over the network and pressing the power button, and requires the power connection to be pulled before it can restart." Does "without crash log" mean without any MATLAB crash log file, without any entries in the operating system's log files at the time the system stops responding, or both?
Risposte (0)
Categorie
Scopri di più su Parallel Computing Fundamentals in Centro assistenza e File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!