[Dune] DUNE running out of memory on 65536 cores?
eike.h.mueller at googlemail.com
Sat Mar 10 12:44:16 CET 2012
Dear DUNE list,
I'm slowly increasing the core count on Hector... With ParMETIS
installed, I could extend my weak scaling runs to 32768 processes. I'm
using 1.7E10 degrees of freedom there, i.e. 0.5E6 dof per process.
However, when I push this further to 65536 processes (with 3.4E10 dof,
but the same number of dof per process), my program gets killed as it
runs out of memory (I get error messages like this: '[NID 00134]
2012-03-07 09:42:53 Apid 1756591: OOM killer terminated this
I've then used a slightly different scaling, which is a mixture
between weak and strong scaling, with the number of processes
increasing by a factor of 16, whenever the size of the domain in each
direction is doubled (i.e. the dof increase by a factor of 8). This
means, of course, the the dof decreases as the total process count
goes up. With this approach I only have 85E3 dof on 4096 processes
(where the code runs fine), but this drops even further to 30E3 dof on
65536 processes, and here the code crashes again.
Looking at the output, the code appears to get to the stage where it
has built the coarse grids (I also attach one output file, where I
have cut out most of the lines in the middle of the file as otherwise
it would be more than 6MB long, I hope it is still useful), but has
not started the main CG iteration yet. I do not use SuperLU.
The nodes on Hector have 32 cores each, I'm using CG preconditioned
with an AMG (with SSOR smoother, but I have also used ILU0 as a
smoother instead). The runs are all based on YaspGrid (but for the
'mixed' scaling approach I have transformed this onto part of a
spherical grid with geometrygrid).
Does anyone have any idea of what might be going wrong there?
Thank you very much for your help,
More information about the Dune