[Dune] DUNE running out of memory on 65536 cores?

Eike Mueller eike.h.mueller at googlemail.com
Sat Mar 10 12:44:16 CET 2012


Dear DUNE list,

I'm slowly increasing the core count on Hector... With ParMETIS  
installed, I could extend my weak scaling runs to 32768 processes. I'm  
using 1.7E10 degrees of freedom there, i.e. 0.5E6 dof per process.  
However, when I push this further to 65536 processes (with 3.4E10 dof,  
but the same number of dof per process), my program gets killed as it  
runs out of memory (I get error messages like this: '[NID 00134]  
2012-03-07 09:42:53 Apid 1756591: OOM killer terminated this  
process.') .

I've then used a slightly different scaling, which is a mixture  
between weak and strong scaling, with the number of processes  
increasing by a factor of 16, whenever the size of the domain in each  
direction is doubled (i.e. the dof increase by a factor of 8). This  
means, of course, the the dof decreases as the total process count  
goes up. With this approach I only have 85E3 dof on 4096 processes  
(where the code runs fine), but this drops even further to 30E3 dof on  
65536 processes, and here the code crashes again.
Looking at the output, the code appears to get to the stage where it  
has built the coarse grids (I also attach one output file, where I  
have cut out most of the lines in the middle of the file as otherwise  
it would be more than 6MB long, I hope it is still useful), but has  
not started the main CG iteration yet. I do not use SuperLU.

The nodes on Hector have 32 cores each, I'm using CG preconditioned  
with an AMG (with SSOR smoother, but I have also used ILU0 as a  
smoother instead). The runs are all based on YaspGrid (but for the  
'mixed' scaling approach I have transformed this onto part of a  
spherical grid with geometrygrid).

Does anyone have any idea of what might be going wrong there?

Thank you very much for your help,

Eike




More information about the Dune mailing list