[Dune] DUNE running out of memory on 65536 cores?
Eike Mueller
E.Mueller at bath.ac.uk
Tue Mar 13 10:45:16 CET 2012
Hi Markus,
many thanks for your reply. I'm still using the fixed version 2.1. How
complicated are these changes and can I do them myself or can you point
me to a patch? It's just that we have to use up our current AUs
allocation on Hector months by the end of March... Or is the trunk
sufficiently stable and would you recommend using it instead of version
2.1 anyways?
Will the changes to ParMETIS you mention have any impact on the
performance of the solver? I.e. will they change the time spent in the
solver? If they don't have any impact on the solver time, I will just
keep running with lower core counts for now. But I suppose they should
have an impact on the AMG setup time?
I managed to run on 49152 cores yesterday, with 2.6E10 degrees of freedom.
Another thing I noticed is that if I add up the timings reported by the
DUNE timers, i.e. for the 49152 core run I get something like this:
Building Hierarchy of 11 levels took 31.1499 seconds.
=== CGSolver
124 2.87928e-11
=== rate=0.911289, T=100.386, TIT=0.809567, IT=124
then this does not match up with time spent in slp.apply();, e.g. in the
above example, the master process spends 148.7s in slp.apply(), but
adding up the the 31.15s and 100.4s, I only get 131.55s. Am I missing
something?
If I compare to the total runtime, as measured by a time command, I'm
still missing a few seconds, but that I'm happy with this as the code
probably does not start up instantly on such large core counts.
Thanks a lot,
Eike
Markus Blatt wrote:
> Hi Eike,
>
> On Sat, Mar 10, 2012 at 11:44:16AM +0000, Eike Mueller wrote:
>> Dear DUNE list,
>>
>> I'm slowly increasing the core count on Hector... With ParMETIS
>> installed, I could extend my weak scaling runs to 32768 processes.
>> I'm using 1.7E10 degrees of freedom there, i.e. 0.5E6 dof per
>> process. However, when I push this further to 65536 processes (with
>> 3.4E10 dof, but the same number of dof per process), my program gets
>> killed as it runs out of memory (I get error messages like this:
>> '[NID 00134] 2012-03-07 09:42:53 Apid 1756591: OOM killer terminated
>> this process.') .
>
> you are not using the trunk, are you?
>
> These problems are due to parmetis using a dense matrix structure for
> saving the adjacency information. In your case this results in
> allocting a 65536x65536 matrix.
>
> I fixed this is in the trunk, but probably forgot to merge the changes
> to the 2.1 branch. If I find the time I will do it at the end of this
> week or the beginning of next week.
>
> BTW: The attachment was missing.
>
> Cheers,
>
> Markus
>
More information about the Dune
mailing list