[Dune] UG grid code: Segfault on large core counts
Peter Bastian
peter.bastian at iwr.uni-heidelberg.de
Wed Dec 5 12:57:49 CET 2012
Hi Oliver,
the question is : What will be the next place where some limit is broken…
-- Peter
Am 05.12.2012 um 11:20 schrieb Oliver Sander <sander at igpm.rwth-aachen.de>:
> Am 05.12.2012 10:48, schrieb Peter Bastian:
>> Hi Oliver,
>>
>> you ask questions… The parallel UG graphics used some clever
>> way to do depth ordering by sorting the coarse mesh in
>> view direction and then "interpolating" this ordering in the
>> grid hierarchy. I do not know whether it has something to
>> do with that. Obviously some tree is constructed on
>> the processor graph and the maximum degree is estimated
>> using the strange formula. I really cannot say more as it was
>> all Klaus Johannsens work.
> Hi Peter,
> okay, that explains why the file is 20k lines long (you guys were tough!).
>
> I then propose to make WOP_DOWN_CHANNELS_MAX very large(tm),
> and forget about it. Since the number of channels computed in line 21023
> is roughly the square root of the processor number I guess a value
> of about 2000 should be okay for the nearer future.
>
> best,
> Oliver
>
>>
>> Best,
>>
>> Peter
>>
>> Am 05.12.2012 um 09:46 schrieb Oliver Sander<sander at igpm.rwth-aachen.de>:
>>
>>> Am 04.12.2012 20:10, schrieb Peter Bastian:
>>>> Dear Eike,
>>>>
>>>> a quick search through the code suggests that
>>>> the constant
>>>> #define WOP_DOWN_CHANNELS_MAX 32
>>>> in ug/graphics/uggraph/wop.h line 146 is too small.
>>>>
>>>> In wop.c line 21023
>>>>
>>>> WopDownChannels = (INT)ceil(0.5*(sqrt(4.0*(DOUBLE)procs-3.0)-1.0));
>>>>
>>>> evaluates to 39 which fairly close above 32. Then out of bound access in many
>>>> arrays is produced. Maybe it is as simple as that.
>>> Hi Peter,
>>> I came to the same conclusion. Can you give us an explanation of what
>>> these WopDownChannels are?
>>> best,
>>> Oliver
>>>
>>>> Best,
>>>>
>>>> Peter
>>>>
>>>>
>>>> Am 04.12.2012 um 18:39 schrieb Oliver Sander<sander at igpm.rwth-aachen.de>:
>>>>
>>>>> Am 04.12.2012 17:05, schrieb Eike Mueller:
>>>>>> Dear dune-list,
>>>>> Hi Eike,
>>>>>> now my code runs on 6, 24, 96 and 384 processors. However, on 1536 cores it crashes with a (different) segmentation faullt,
>>>>> is that with custom types for DDD_GID, DDD_PROC and such, or with the defaults?
>>>>>
>>>>> Is it possible to get a valgrind output? The crash is at program startup, so even a
>>>>> slow valgrind run shouldn't take too long.
>>>>>
>>>>> The crash is in the ug graphics implementation, which we don't use from Dune.
>>>>> So in the very worst case we could out-comment lots of stuff and make it work
>>>>> that way. But I prefer a proper fix.
>>>>>
>>>>> best,
>>>>> Oliver
>>>>>
>>>>>> when I inspect the core dump with
>>>>>>
>>>>>> gdb<executable> --core=<core>
>>>>>>
>>>>>> I get this backtrace:
>>>>>>
>>>>>> Core was generated by `/work/n02/n02/eike/Code/DUNEGeometricMultigrid/geometricmultigrid_nz128_r130_DE'.
>>>>>> Program terminated with signal 11, Segmentation fault.
>>>>>> #0 0x0000000000898d7a in PPIF::RecvASync (v=0x100000001, data=0x13b80a0,
>>>>>> size=4, error=0x7fffffff41f0) at ppif.c:656
>>>>>> 656 ((MPIVChannel*)v)->p, ((MPIVChannel*)v)->chanid, COMM, req) )
>>>>>> (gdb) backtrace
>>>>>> #0 0x0000000000898d7a in PPIF::RecvASync (v=0x100000001, data=0x13b80a0,
>>>>>> size=4, error=0x7fffffff41f0) at ppif.c:656
>>>>>> #1 0x000000000081ce2b in NumberOfDesc () at wop.c:21095
>>>>>> #2 0x000000000082b2db in UG::D2::InitWOP () at wop.c:24677
>>>>>> #3 0x00000000007fbb1d in UG::D2::InitUGGraph () at initgraph.c:90
>>>>>> #4 0x00000000007fbac5 in UG::D2::InitGraphics () at graphics.c:133
>>>>>> #5 0x0000000000715e4c in UG::D2::InitUg (argcp=0x7fffffff465c,
>>>>>> argvp=0x7fffffff4648) at ../initug.c:293
>>>>>> #6 0x000000000051bb8c in Dune::UG_NS<2>::InitUg (argcp=0x7fffffff465c,
>>>>>> argvp=0x7fffffff4648) at ../../../dune/grid/uggrid/ugwrapper.hh:910
>>>>>> #7 0x000000000052062a in Dune::UGGrid<3>::UGGrid (this=0x1603360)
>>>>>> at uggrid.cc:74
>>>>>> #8 0x000000000059332f in Dune::GridFactory<Dune::UGGrid<3> >::GridFactory (
>>>>>> this=0x7fffffff9cd8) at uggridfactory.cc:74
>>>>>> #9 0x000000000040faa2 in SphericalGridGenerator::SphericalGridGenerator (
>>>>>> this=0x7fffffff9cd0, filename=..., refcount=5)
>>>>>> at sphericalgridgenerator.hh:37
>>>>>> #10 0x00000000004010c8 in main (argc=2, argv=0x7fffffffb778)
>>>>>> at geometricmultigrid.cc:167
>>>>>>
>>>>>> I'm setting up a grid with one element on each core, and then refine it 5 times, so the total grid size is 49152. I don't know whether this has any influence, but I set the default heap size to 8000.
>>>>>>
>>>>>> Any ideas what might be going on here, or just suggestions on how to proceed with debugging a code with this large number of cores would be much appreciated. I'm now trying to get a better trace with the Cray ATP tool.
>>>>>>
>>>>>> Thank you very much,
>>>>>>
>>>>>> Eike
>>>>>>
>>>>>> On 30 Nov 2012, at 18:47, Eike Mueller wrote:
>>>>>>
>>>>>>> Hi Oliver,
>>>>>>>
>>>>>>> I just did some detective work and I think I know why it crashes and how it can be fixed.
>>>>>>>
>>>>>>> The problem is that in parallel/ddd/include/ddd.h the type DDD_PROC is defined as 'unsigned short', so can only store up to 2^16 different processor IDs. In the subroutine IFCreateFromScratch(), where the segfault occurs, the variable lastproc, which is initialised to PROC_INVALID is of this type. According to parallel/ddd/dddi.h, PROC_INVALID=(MAX_PROCS+1) and MAX_PROCS=(1<<MAX_PROCBITS_IN_GID). Investigating the core dump with gdb, lastproc is indeed 1, at least for MAX_PROCBITS_IN_GID=16 and then on processor 1 the pointer ifHead never gets initialised, so dereferencing it with ifHead->nItems++; causes the segfault.
>>>>>>>
>>>>>>> I now changed DDD_PROC to 'unsigned int' in parallel/ddd/include/ddd.h and then the test program I sent you works fine if I run on 6 processors (I can also look at the .vtu files if I write them out in ascii format).
>>>>>>> I can run with up to 6 refinement steps, and if I go to 7 it runs out of memory with
>>>>>>>
>>>>>>> testsphericalgridgenerator: xfer/xfer.c:942: void UG::D3::ddd_XferRegisterDelete(UG::D3::DDD_HDR): Assertion `0' failed.
>>>>>>> DDD [000] FATAL 06060: out of memory during XferEnd()
>>>>>>> _pmiu_daemon(SIGCHLD): [NID 01842] [c11-0c1s6n0] [Fri Nov 30 18:35:27 2012] PE RANK 0 exit signal Aborted
>>>>>>> DDD [001] WARNING 02514: increased coupling table, now 131072 entries
>>>>>>> DDD [001] WARNING 02224: increased object table, now 131072 entries
>>>>>>> [NID 01842] 2012-11-30 18:35:27 Apid 3144197: initiated application termination
>>>>>>> Application 3144197 exit codes: 134
>>>>>>> Application 3144197 resources: utime ~24s, stime ~1s
>>>>>>>
>>>>>>> but I think this is another problem, and in my case 6 should actually be enough.
>>>>>>>
>>>>>>> Do you think the change I made makes sense and doesn't break anything else? If that is the case, what's the best way of setting up a patch? To be on the safe side, maybe something like this:
>>>>>>>
>>>>>>> #if (DDD_MAX_PROCBITS_IN_GID< 16)
>>>>>>> typedef unsigned short DDD_PROC;
>>>>>>> #else
>>>>>>> typedef unsigned int DDD_PROC;
>>>>>>> #endif // (DDD_MAX_PROCBITS_IN_GID< 16)
>>>>>>>
>>>>>>> I need to do some more testing to check that it all works for my full code, and I will keep you updated on any progress.
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Eike
>>>>>>>
>>>>>>> Oliver Sander wrote:
>>>>>>>> Okay, with maxprocbits = 16 I do see a crash. 2 processors and 1 level is enough.
>>>>>>>> I'm in a bit of a hurry right now, but I'll have a look at it later.
>>>>>>>> best,
>>>>>>>> Oliver
>>>>>>>> Am 29.11.2012 19:36, schrieb Eike Mueller:
>>>>>>>>> Hi Oliver,
>>>>>>>>>
>>>>>>>>> thank you very much for trying this out and for the patch. I now rebuilt UG with the latest patch file you sent me this morning and I still get the segfault. However, this only occurs if I specify the with_ddd_maxprocbits flag, if I do not set this (i.e. only use DDD_GID=long, as you do), then it runs fine (I have only done a small run so far, and haven't tried 7 refinement steps, so can not say anything about that other error you get).
>>>>>>>>> I tried both with_ddd_maxprocbits=20 and 16, but it does not work in any of these cases. The default is 2^9=512, and unfortunately that's not enough for me, I would need at least 2^16=65536.
>>>>>>>>> As for the error message you get when reading the .vtu files, could that be because they are written out in Dune::VTK::appendedbase64 format? I also get an error message when I open them in paraview
>>>>>>>>>
>>>>>>>>> ERROR: In /home/kitware/ParaView3/Utilities/BuildScripts/ParaView-3.6/ParaView3/VTK/IO/vtkXMLUnstructuredDataReader.cxx, line 522
>>>>>>>>> vtkXMLUnstructuredGridReader (0xa16c918): Cannot read points array from Points in piece 0. The data array in the element may be too short.
>>>>>>>>>
>>>>>>>>> When I opened .vtu files produced by my main solver code on HECToR, paraview actually crashed, and this was before I modified any of the UG settings. I could fix this by writing data out in Dune::VTK::ascii format instead. Could this be a big/little endian issue? HECToR is 64bit, but my local desktop, where I run paraview to look at the output is 32bit, not sure if that has any impact.
>>>>>>>>>
>>>>>>>>> Eike
>>>>>>>>>
>>>>>>>>> Oliver Sander wrote:
>>>>>>>>>> Hi Eike,
>>>>>>>>>> I tried your example with DDD_GID==long, on my laptop where sizeof(long)==8 and sizeof(uint)==4.
>>>>>>>>>> I started the program with
>>>>>>>>>>
>>>>>>>>>> mpirun -np 6 ./testsphericalgridgenerator sphericalshell_cube_6.dgf 6
>>>>>>>>>>
>>>>>>>>>> Besides a few DDD warnings that I have never seen before, it works like a charm.
>>>>>>>>>>
>>>>>>>>>> What version of UG are you using? I'll send you the very latest patch file just
>>>>>>>>>> to be sure.
>>>>>>>>>>
>>>>>>>>>> The programs runs, but paraview gives me an error message when trying to open
>>>>>>>>>> the output file. Does that happen to you, too?
>>>>>>>>>>
>>>>>>>>>> I didn't try the with_ddd_maxprocbits setting. Does your program crash if you
>>>>>>>>>> do _not_ set this?
>>>>>>>>>>
>>>>>>>>>> For the worst case: is it possible to get a temporary account on your hector computer?
>>>>>>>>>>
>>>>>>>>>> best,
>>>>>>>>>> Oliver
>>>>>>>>>>
>>>>>>>>>> Am 28.11.2012 10:28, schrieb Eike Mueller:
>>>>>>>>>>> Hi Oliver,
>>>>>>>>>>>
>>>>>>>>>>> thanks a lot, that would be great. My desktop is a 32bit machine, where sizeof(long) = sizeof(int) = 4, so I'm not sure if recompiling everything with GID=long there will make a difference.
>>>>>>>>>>>
>>>>>>>>>>> Eike
>>>>>>>>>>>
>>>>>>>>>>> Oliver Sander wrote:
>>>>>>>>>>>> Thanks for the backtrace. I'll try and see whether I can reproduce the crash
>>>>>>>>>>>> on my machine. If that's not possible things will be a bit difficult :-)
>>>>>>>>>>>> --
>>>>>>>>>>>> Oliver
>>>>>>>>>>>>
>>>>>>>>>>>> Am 26.11.2012 18:39, schrieb Eike Mueller:
>>>>>>>>>>>>> Hi Markus and Oliver,
>>>>>>>>>>>>>
>>>>>>>>>>>>> to get to the bottom of this I recompiled everything (UG+Dune+my code) with -O0 -g, and that way I was able to get some more information out of the core dump. On 1 processor it runs fine now, but when running on 6, this is what I get, looks like it crashes in loadBalance(), but I can't make sense of what's happening inside UG. It always seems to crash inside ifcreate.c, either in line 482 or 489:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Program terminated with signal 11, Segmentation fault.
>>>>>>>>>>>>> #0 0x0000000000a438b0 in UG::D3::IFCreateFromScratch (tmpcpl=0x1462ad0,
>>>>>>>>>>>>> ifId=1) at if/ifcreate.c:489
>>>>>>>>>>>>> 489 ifHead->nItems++;
>>>>>>>>>>>>> (gdb) backtrace
>>>>>>>>>>>>> #0 0x0000000000a438b0 in UG::D3::IFCreateFromScratch (tmpcpl=0x1462ad0,
>>>>>>>>>>>>> ifId=1) at if/ifcreate.c:489
>>>>>>>>>>>>> #1 0x0000000000a44dae in UG::D3::IFRebuildAll () at if/ifcreate.c:1059
>>>>>>>>>>>>> #2 0x0000000000a44e71 in UG::D3::IFAllFromScratch () at if/ifcreate.c:1097
>>>>>>>>>>>>> #3 0x0000000000a4bd89 in UG::D3::DDD_XferEnd () at xfer/cmds.c:869
>>>>>>>>>>>>> #4 0x0000000000a65b5c in UG::D3::TransferGridFromLevel (theMG=0x1441880,
>>>>>>>>>>>>> level=0) at trans.c:835
>>>>>>>>>>>>> #5 0x0000000000a5df4b in UG::D3::lbs (argv=0x7fffffffa390 "0",
>>>>>>>>>>>>> theMG=0x1441880) at lb.c:659
>>>>>>>>>>>>> #6 0x0000000000a0bd83 in UG::D3::LBCommand (argc=4, argv=0x7fffffffab90)
>>>>>>>>>>>>> at commands.c:10658
>>>>>>>>>>>>> #7 0x00000000004a1e2d in Dune::UG_NS<3>::LBCommand (argc=4,
>>>>>>>>>>>>> argv=0x7fffffffab90) at ../../../dune/grid/uggrid/ugwrapper.hh:979
>>>>>>>>>>>>> #8 0x00000000004a8df9 in Dune::UGGrid<3>::loadBalance (this=0x134e490,
>>>>>>>>>>>>> strategy=0, minlevel=0, depth=2, maxLevel=32, minelement=1)
>>>>>>>>>>>>> at uggrid.cc:556
>>>>>>>>>>>>> #9 0x00000000004077bf in Dune::UGGrid<3>::loadBalance (this=0x134e490)
>>>>>>>>>>>>> at /home/n02/n02/eike/work/Library/Dune2.2/include/dune/grid/uggrid.hh:738
>>>>>>>>>>>>> #10 0x0000000000400928 in main (argc=3, argv=0x7fffffffb5a8)
>>>>>>>>>>>>> at testsphericalgridgenerator.cc:65
>>>>>>>>>>>>>
>>>>>>>>>>>>> but I sometimes also get:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Core was generated by `./testsphericalgridgenerator sphericalshell_cube_6.dgf 4'.
>>>>>>>>>>>>> Program terminated with signal 11, Segmentation fault.
>>>>>>>>>>>>> #0 0x0000000000a438b0 in UG::D3::IFCreateFromScratch (tmpcpl=0x1462ad0,
>>>>>>>>>>>>> ifId=1) at if/ifcreate.c:482
>>>>>>>>>>>>> 482 ifAttr->nAB = ifAttr->nBA = ifAttr->nABA = 0;
>>>>>>>>>>>>> (gdb) backtrace
>>>>>>>>>>>>> #0 0x0000000000a438b0 in UG::D3::IFCreateFromScratch (tmpcpl=0x1462ad0,
>>>>>>>>>>>>> ifId=1) at if/ifcreate.c:482
>>>>>>>>>>>>> #1 0x0000000000a44dae in UG::D3::IFRebuildAll () at if/ifcreate.c:1049
>>>>>>>>>>>>> #2 0x0000000000a44e71 in UG::D3::IFRebuildAll () at if/ifcreate.c:1057
>>>>>>>>>>>>> #3 0x0000000000a4bd89 in UG::D3::DDD_XferEnd () at xfer/cmds.c:850
>>>>>>>>>>>>> #4 0x0000000000a65b5c in UG::D3::TransferGridFromLevel (theMG=0x1441880,
>>>>>>>>>>>>> level=0) at trans.c:824
>>>>>>>>>>>>> #5 0x0000000000a5df4b in UG::D3::lbs (argv=0x7fffffffa390 "0",
>>>>>>>>>>>>> theMG=0x1441880) at lb.c:644
>>>>>>>>>>>>> #6 0x0000000000a0bd83 in UG::D3::LBCommand (argc=4, argv=0x7fffffffab90)
>>>>>>>>>>>>> at commands.c:10644
>>>>>>>>>>>>> #7 0x00000000004a1e2d in Dune::UG_NS<3>::LBCommand (argc=0,
>>>>>>>>>>>>> argv=0x7fffffffa420) at ../../../dune/grid/uggrid/ugwrapper.hh:977
>>>>>>>>>>>>> #8 0x00000000004a8df9 in Dune::UGGrid<3>::loadBalance (this=0x134e490,
>>>>>>>>>>>>> strategy=0, minlevel=0, depth=2, maxLevel=32, minelement=1)
>>>>>>>>>>>>> at uggrid.cc:554
>>>>>>>>>>>>> #9 0x00000000004077bf in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (this=0x134e490, __in_chrg=<optimized out>)
>>>>>>>>>>>>> at /opt/gcc/4.6.3/snos/include/g++/bits/shared_ptr_base.h:550
>>>>>>>>>>>>> #10 0x0000000000400928 in main (argc=0, argv=0x520)
>>>>>>>>>>>>> at testsphericalgridgenerator.cc:66
>>>>>>>>>>>>>
>>>>>>>>>>>>> I've tried different load balancing strategies, but for all I get a segfault.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Eike
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Markus Blatt wrote:
>>>>>>>>>>>>>> On Mon, Nov 26, 2012 at 01:57:15PM +0000, Eike Mueller wrote:
>>>>>>>>>>>>>>> thanks a lot for the patch, unfortunately I still get a segfault when I run on HECToR.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I feared that, but it was still worth a shot. The change probably
>>>>>>>>>>>>>> interferes with the memory allocation in ddd.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Markus
>>>>>>> --
>>>>>>> Dr Eike Mueller
>>>>>>> Research Officer
>>>>>>>
>>>>>>> Department of Mathematical Sciences
>>>>>>> University of Bath
>>>>>>> Bath BA2 7AY, United Kingdom
>>>>>>>
>>>>>>> +44 1225 38 5633
>>>>>>> e.mueller at bath.ac.uk
>>>>>>> http://people.bath.ac.uk/em459/
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Dune mailing list
>>>>>>> Dune at dune-project.org
>>>>>>> http://lists.dune-project.org/mailman/listinfo/dune
>>>>> _______________________________________________
>>>>> Dune mailing list
>>>>> Dune at dune-project.org
>>>>> http://lists.dune-project.org/mailman/listinfo/dune
>>>> ------------------------------------------------------------
>>>> Peter Bastian
>>>> Interdisziplinäres Zentrum für Wissenschaftliches Rechnen
>>>> Universität Heidelberg
>>>> Im Neuenheimer Feld 368
>>>> D-69120 Heidelberg
>>>> Tel: 0049 (0) 6221 548261
>>>> Fax: 0049 (0) 6221 548884
>>>> email: peter.bastian at iwr.uni-heidelberg.de
>>>> web: http://conan.iwr.uni-heidelberg.de/people/peter/
>>>>
>>>>
>>>> _______________________________________________
>>>> Dune mailing list
>>>> Dune at dune-project.org
>>>> http://lists.dune-project.org/mailman/listinfo/dune
>>>
>>> _______________________________________________
>>> Dune mailing list
>>> Dune at dune-project.org
>>> http://lists.dune-project.org/mailman/listinfo/dune
>> ------------------------------------------------------------
>> Peter Bastian
>> Interdisziplinäres Zentrum für Wissenschaftliches Rechnen
>> Universität Heidelberg
>> Im Neuenheimer Feld 368
>> D-69120 Heidelberg
>> Tel: 0049 (0) 6221 548261
>> Fax: 0049 (0) 6221 548884
>> email: peter.bastian at iwr.uni-heidelberg.de
>> web: http://conan.iwr.uni-heidelberg.de/people/peter/
>>
>
------------------------------------------------------------
Peter Bastian
Interdisziplinäres Zentrum für Wissenschaftliches Rechnen
Universität Heidelberg
Im Neuenheimer Feld 368
D-69120 Heidelberg
Tel: 0049 (0) 6221 548261
Fax: 0049 (0) 6221 548884
email: peter.bastian at iwr.uni-heidelberg.de
web: http://conan.iwr.uni-heidelberg.de/people/peter/
More information about the Dune
mailing list