[Dune] DUNE running out of memory on 65536 cores?

Eike Mueller E.Mueller at bath.ac.uk
Tue Mar 13 10:45:16 CET 2012


Hi Markus,

many thanks for your reply. I'm still using the fixed version 2.1. How 
complicated are these changes and can I do them myself or can you point 
me to a patch? It's just that we have to use up our current AUs 
allocation on Hector months by the end of March... Or is the trunk 
sufficiently stable and would you recommend using it instead of version 
2.1 anyways?

Will the changes to ParMETIS you mention have any impact on the 
performance of the solver? I.e. will they change the time spent in the 
solver? If they don't have any impact on the solver time, I will just 
keep running with lower core counts for now. But I suppose they should 
have an impact on the AMG setup time?

I managed to run on 49152 cores yesterday, with 2.6E10 degrees of freedom.

Another thing I noticed is that if I add up the timings reported by the 
DUNE timers, i.e. for the 49152 core run I get something like this:

Building Hierarchy of 11 levels took 31.1499 seconds.
=== CGSolver
   124      2.87928e-11
=== rate=0.911289, T=100.386, TIT=0.809567, IT=124

then this does not match up with time spent in slp.apply();, e.g. in the 
above example, the master process spends 148.7s in slp.apply(), but 
adding up the the 31.15s and 100.4s, I only get 131.55s. Am I missing 
something?
If I compare to the total runtime, as measured by a time command, I'm 
still missing a few seconds, but that I'm happy with this as the code 
probably does not start up instantly on such large core counts.

Thanks a lot,

Eike

Markus Blatt wrote:
> Hi Eike,
> 
> On Sat, Mar 10, 2012 at 11:44:16AM +0000, Eike Mueller wrote:
>> Dear DUNE list,
>>
>> I'm slowly increasing the core count on Hector... With ParMETIS
>> installed, I could extend my weak scaling runs to 32768 processes.
>> I'm using 1.7E10 degrees of freedom there, i.e. 0.5E6 dof per
>> process. However, when I push this further to 65536 processes (with
>> 3.4E10 dof, but the same number of dof per process), my program gets
>> killed as it runs out of memory (I get error messages like this:
>> '[NID 00134] 2012-03-07 09:42:53 Apid 1756591: OOM killer terminated
>> this process.') .
> 
> you are not using the trunk, are you?
> 
> These problems are due to parmetis using a dense matrix structure for
> saving the adjacency information. In your case this results in
> allocting a 65536x65536 matrix.
> 
> I fixed this is in the trunk, but probably forgot to merge the changes
> to the 2.1 branch. If I find the time I will do it at the end of this
> week or the beginning of next week.
> 
> BTW: The attachment was missing.
> 
> Cheers, 
> 
> Markus
> 




More information about the Dune mailing list