[Dune] UG grid code: Segfault on large core counts
Peter Bastian
peter.bastian at iwr.uni-heidelberg.de
Wed Dec 5 10:48:29 CET 2012
Hi Oliver,
you ask questions… The parallel UG graphics used some clever
way to do depth ordering by sorting the coarse mesh in
view direction and then "interpolating" this ordering in the
grid hierarchy. I do not know whether it has something to
do with that. Obviously some tree is constructed on
the processor graph and the maximum degree is estimated
using the strange formula. I really cannot say more as it was
all Klaus Johannsens work.
Best,
Peter
Am 05.12.2012 um 09:46 schrieb Oliver Sander <sander at igpm.rwth-aachen.de>:
> Am 04.12.2012 20:10, schrieb Peter Bastian:
>> Dear Eike,
>>
>> a quick search through the code suggests that
>> the constant
>> #define WOP_DOWN_CHANNELS_MAX 32
>> in ug/graphics/uggraph/wop.h line 146 is too small.
>>
>> In wop.c line 21023
>>
>> WopDownChannels = (INT)ceil(0.5*(sqrt(4.0*(DOUBLE)procs-3.0)-1.0));
>>
>> evaluates to 39 which fairly close above 32. Then out of bound access in many
>> arrays is produced. Maybe it is as simple as that.
> Hi Peter,
> I came to the same conclusion. Can you give us an explanation of what
> these WopDownChannels are?
> best,
> Oliver
>
>>
>> Best,
>>
>> Peter
>>
>>
>> Am 04.12.2012 um 18:39 schrieb Oliver Sander<sander at igpm.rwth-aachen.de>:
>>
>>> Am 04.12.2012 17:05, schrieb Eike Mueller:
>>>> Dear dune-list,
>>> Hi Eike,
>>>> now my code runs on 6, 24, 96 and 384 processors. However, on 1536 cores it crashes with a (different) segmentation faullt,
>>> is that with custom types for DDD_GID, DDD_PROC and such, or with the defaults?
>>>
>>> Is it possible to get a valgrind output? The crash is at program startup, so even a
>>> slow valgrind run shouldn't take too long.
>>>
>>> The crash is in the ug graphics implementation, which we don't use from Dune.
>>> So in the very worst case we could out-comment lots of stuff and make it work
>>> that way. But I prefer a proper fix.
>>>
>>> best,
>>> Oliver
>>>
>>>> when I inspect the core dump with
>>>>
>>>> gdb<executable> --core=<core>
>>>>
>>>> I get this backtrace:
>>>>
>>>> Core was generated by `/work/n02/n02/eike/Code/DUNEGeometricMultigrid/geometricmultigrid_nz128_r130_DE'.
>>>> Program terminated with signal 11, Segmentation fault.
>>>> #0 0x0000000000898d7a in PPIF::RecvASync (v=0x100000001, data=0x13b80a0,
>>>> size=4, error=0x7fffffff41f0) at ppif.c:656
>>>> 656 ((MPIVChannel*)v)->p, ((MPIVChannel*)v)->chanid, COMM, req) )
>>>> (gdb) backtrace
>>>> #0 0x0000000000898d7a in PPIF::RecvASync (v=0x100000001, data=0x13b80a0,
>>>> size=4, error=0x7fffffff41f0) at ppif.c:656
>>>> #1 0x000000000081ce2b in NumberOfDesc () at wop.c:21095
>>>> #2 0x000000000082b2db in UG::D2::InitWOP () at wop.c:24677
>>>> #3 0x00000000007fbb1d in UG::D2::InitUGGraph () at initgraph.c:90
>>>> #4 0x00000000007fbac5 in UG::D2::InitGraphics () at graphics.c:133
>>>> #5 0x0000000000715e4c in UG::D2::InitUg (argcp=0x7fffffff465c,
>>>> argvp=0x7fffffff4648) at ../initug.c:293
>>>> #6 0x000000000051bb8c in Dune::UG_NS<2>::InitUg (argcp=0x7fffffff465c,
>>>> argvp=0x7fffffff4648) at ../../../dune/grid/uggrid/ugwrapper.hh:910
>>>> #7 0x000000000052062a in Dune::UGGrid<3>::UGGrid (this=0x1603360)
>>>> at uggrid.cc:74
>>>> #8 0x000000000059332f in Dune::GridFactory<Dune::UGGrid<3> >::GridFactory (
>>>> this=0x7fffffff9cd8) at uggridfactory.cc:74
>>>> #9 0x000000000040faa2 in SphericalGridGenerator::SphericalGridGenerator (
>>>> this=0x7fffffff9cd0, filename=..., refcount=5)
>>>> at sphericalgridgenerator.hh:37
>>>> #10 0x00000000004010c8 in main (argc=2, argv=0x7fffffffb778)
>>>> at geometricmultigrid.cc:167
>>>>
>>>> I'm setting up a grid with one element on each core, and then refine it 5 times, so the total grid size is 49152. I don't know whether this has any influence, but I set the default heap size to 8000.
>>>>
>>>> Any ideas what might be going on here, or just suggestions on how to proceed with debugging a code with this large number of cores would be much appreciated. I'm now trying to get a better trace with the Cray ATP tool.
>>>>
>>>> Thank you very much,
>>>>
>>>> Eike
>>>>
>>>> On 30 Nov 2012, at 18:47, Eike Mueller wrote:
>>>>
>>>>> Hi Oliver,
>>>>>
>>>>> I just did some detective work and I think I know why it crashes and how it can be fixed.
>>>>>
>>>>> The problem is that in parallel/ddd/include/ddd.h the type DDD_PROC is defined as 'unsigned short', so can only store up to 2^16 different processor IDs. In the subroutine IFCreateFromScratch(), where the segfault occurs, the variable lastproc, which is initialised to PROC_INVALID is of this type. According to parallel/ddd/dddi.h, PROC_INVALID=(MAX_PROCS+1) and MAX_PROCS=(1<<MAX_PROCBITS_IN_GID). Investigating the core dump with gdb, lastproc is indeed 1, at least for MAX_PROCBITS_IN_GID=16 and then on processor 1 the pointer ifHead never gets initialised, so dereferencing it with ifHead->nItems++; causes the segfault.
>>>>>
>>>>> I now changed DDD_PROC to 'unsigned int' in parallel/ddd/include/ddd.h and then the test program I sent you works fine if I run on 6 processors (I can also look at the .vtu files if I write them out in ascii format).
>>>>> I can run with up to 6 refinement steps, and if I go to 7 it runs out of memory with
>>>>>
>>>>> testsphericalgridgenerator: xfer/xfer.c:942: void UG::D3::ddd_XferRegisterDelete(UG::D3::DDD_HDR): Assertion `0' failed.
>>>>> DDD [000] FATAL 06060: out of memory during XferEnd()
>>>>> _pmiu_daemon(SIGCHLD): [NID 01842] [c11-0c1s6n0] [Fri Nov 30 18:35:27 2012] PE RANK 0 exit signal Aborted
>>>>> DDD [001] WARNING 02514: increased coupling table, now 131072 entries
>>>>> DDD [001] WARNING 02224: increased object table, now 131072 entries
>>>>> [NID 01842] 2012-11-30 18:35:27 Apid 3144197: initiated application termination
>>>>> Application 3144197 exit codes: 134
>>>>> Application 3144197 resources: utime ~24s, stime ~1s
>>>>>
>>>>> but I think this is another problem, and in my case 6 should actually be enough.
>>>>>
>>>>> Do you think the change I made makes sense and doesn't break anything else? If that is the case, what's the best way of setting up a patch? To be on the safe side, maybe something like this:
>>>>>
>>>>> #if (DDD_MAX_PROCBITS_IN_GID< 16)
>>>>> typedef unsigned short DDD_PROC;
>>>>> #else
>>>>> typedef unsigned int DDD_PROC;
>>>>> #endif // (DDD_MAX_PROCBITS_IN_GID< 16)
>>>>>
>>>>> I need to do some more testing to check that it all works for my full code, and I will keep you updated on any progress.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Eike
>>>>>
>>>>> Oliver Sander wrote:
>>>>>> Okay, with maxprocbits = 16 I do see a crash. 2 processors and 1 level is enough.
>>>>>> I'm in a bit of a hurry right now, but I'll have a look at it later.
>>>>>> best,
>>>>>> Oliver
>>>>>> Am 29.11.2012 19:36, schrieb Eike Mueller:
>>>>>>> Hi Oliver,
>>>>>>>
>>>>>>> thank you very much for trying this out and for the patch. I now rebuilt UG with the latest patch file you sent me this morning and I still get the segfault. However, this only occurs if I specify the with_ddd_maxprocbits flag, if I do not set this (i.e. only use DDD_GID=long, as you do), then it runs fine (I have only done a small run so far, and haven't tried 7 refinement steps, so can not say anything about that other error you get).
>>>>>>> I tried both with_ddd_maxprocbits=20 and 16, but it does not work in any of these cases. The default is 2^9=512, and unfortunately that's not enough for me, I would need at least 2^16=65536.
>>>>>>> As for the error message you get when reading the .vtu files, could that be because they are written out in Dune::VTK::appendedbase64 format? I also get an error message when I open them in paraview
>>>>>>>
>>>>>>> ERROR: In /home/kitware/ParaView3/Utilities/BuildScripts/ParaView-3.6/ParaView3/VTK/IO/vtkXMLUnstructuredDataReader.cxx, line 522
>>>>>>> vtkXMLUnstructuredGridReader (0xa16c918): Cannot read points array from Points in piece 0. The data array in the element may be too short.
>>>>>>>
>>>>>>> When I opened .vtu files produced by my main solver code on HECToR, paraview actually crashed, and this was before I modified any of the UG settings. I could fix this by writing data out in Dune::VTK::ascii format instead. Could this be a big/little endian issue? HECToR is 64bit, but my local desktop, where I run paraview to look at the output is 32bit, not sure if that has any impact.
>>>>>>>
>>>>>>> Eike
>>>>>>>
>>>>>>> Oliver Sander wrote:
>>>>>>>> Hi Eike,
>>>>>>>> I tried your example with DDD_GID==long, on my laptop where sizeof(long)==8 and sizeof(uint)==4.
>>>>>>>> I started the program with
>>>>>>>>
>>>>>>>> mpirun -np 6 ./testsphericalgridgenerator sphericalshell_cube_6.dgf 6
>>>>>>>>
>>>>>>>> Besides a few DDD warnings that I have never seen before, it works like a charm.
>>>>>>>>
>>>>>>>> What version of UG are you using? I'll send you the very latest patch file just
>>>>>>>> to be sure.
>>>>>>>>
>>>>>>>> The programs runs, but paraview gives me an error message when trying to open
>>>>>>>> the output file. Does that happen to you, too?
>>>>>>>>
>>>>>>>> I didn't try the with_ddd_maxprocbits setting. Does your program crash if you
>>>>>>>> do _not_ set this?
>>>>>>>>
>>>>>>>> For the worst case: is it possible to get a temporary account on your hector computer?
>>>>>>>>
>>>>>>>> best,
>>>>>>>> Oliver
>>>>>>>>
>>>>>>>> Am 28.11.2012 10:28, schrieb Eike Mueller:
>>>>>>>>> Hi Oliver,
>>>>>>>>>
>>>>>>>>> thanks a lot, that would be great. My desktop is a 32bit machine, where sizeof(long) = sizeof(int) = 4, so I'm not sure if recompiling everything with GID=long there will make a difference.
>>>>>>>>>
>>>>>>>>> Eike
>>>>>>>>>
>>>>>>>>> Oliver Sander wrote:
>>>>>>>>>> Thanks for the backtrace. I'll try and see whether I can reproduce the crash
>>>>>>>>>> on my machine. If that's not possible things will be a bit difficult :-)
>>>>>>>>>> --
>>>>>>>>>> Oliver
>>>>>>>>>>
>>>>>>>>>> Am 26.11.2012 18:39, schrieb Eike Mueller:
>>>>>>>>>>> Hi Markus and Oliver,
>>>>>>>>>>>
>>>>>>>>>>> to get to the bottom of this I recompiled everything (UG+Dune+my code) with -O0 -g, and that way I was able to get some more information out of the core dump. On 1 processor it runs fine now, but when running on 6, this is what I get, looks like it crashes in loadBalance(), but I can't make sense of what's happening inside UG. It always seems to crash inside ifcreate.c, either in line 482 or 489:
>>>>>>>>>>>
>>>>>>>>>>> Program terminated with signal 11, Segmentation fault.
>>>>>>>>>>> #0 0x0000000000a438b0 in UG::D3::IFCreateFromScratch (tmpcpl=0x1462ad0,
>>>>>>>>>>> ifId=1) at if/ifcreate.c:489
>>>>>>>>>>> 489 ifHead->nItems++;
>>>>>>>>>>> (gdb) backtrace
>>>>>>>>>>> #0 0x0000000000a438b0 in UG::D3::IFCreateFromScratch (tmpcpl=0x1462ad0,
>>>>>>>>>>> ifId=1) at if/ifcreate.c:489
>>>>>>>>>>> #1 0x0000000000a44dae in UG::D3::IFRebuildAll () at if/ifcreate.c:1059
>>>>>>>>>>> #2 0x0000000000a44e71 in UG::D3::IFAllFromScratch () at if/ifcreate.c:1097
>>>>>>>>>>> #3 0x0000000000a4bd89 in UG::D3::DDD_XferEnd () at xfer/cmds.c:869
>>>>>>>>>>> #4 0x0000000000a65b5c in UG::D3::TransferGridFromLevel (theMG=0x1441880,
>>>>>>>>>>> level=0) at trans.c:835
>>>>>>>>>>> #5 0x0000000000a5df4b in UG::D3::lbs (argv=0x7fffffffa390 "0",
>>>>>>>>>>> theMG=0x1441880) at lb.c:659
>>>>>>>>>>> #6 0x0000000000a0bd83 in UG::D3::LBCommand (argc=4, argv=0x7fffffffab90)
>>>>>>>>>>> at commands.c:10658
>>>>>>>>>>> #7 0x00000000004a1e2d in Dune::UG_NS<3>::LBCommand (argc=4,
>>>>>>>>>>> argv=0x7fffffffab90) at ../../../dune/grid/uggrid/ugwrapper.hh:979
>>>>>>>>>>> #8 0x00000000004a8df9 in Dune::UGGrid<3>::loadBalance (this=0x134e490,
>>>>>>>>>>> strategy=0, minlevel=0, depth=2, maxLevel=32, minelement=1)
>>>>>>>>>>> at uggrid.cc:556
>>>>>>>>>>> #9 0x00000000004077bf in Dune::UGGrid<3>::loadBalance (this=0x134e490)
>>>>>>>>>>> at /home/n02/n02/eike/work/Library/Dune2.2/include/dune/grid/uggrid.hh:738
>>>>>>>>>>> #10 0x0000000000400928 in main (argc=3, argv=0x7fffffffb5a8)
>>>>>>>>>>> at testsphericalgridgenerator.cc:65
>>>>>>>>>>>
>>>>>>>>>>> but I sometimes also get:
>>>>>>>>>>>
>>>>>>>>>>> Core was generated by `./testsphericalgridgenerator sphericalshell_cube_6.dgf 4'.
>>>>>>>>>>> Program terminated with signal 11, Segmentation fault.
>>>>>>>>>>> #0 0x0000000000a438b0 in UG::D3::IFCreateFromScratch (tmpcpl=0x1462ad0,
>>>>>>>>>>> ifId=1) at if/ifcreate.c:482
>>>>>>>>>>> 482 ifAttr->nAB = ifAttr->nBA = ifAttr->nABA = 0;
>>>>>>>>>>> (gdb) backtrace
>>>>>>>>>>> #0 0x0000000000a438b0 in UG::D3::IFCreateFromScratch (tmpcpl=0x1462ad0,
>>>>>>>>>>> ifId=1) at if/ifcreate.c:482
>>>>>>>>>>> #1 0x0000000000a44dae in UG::D3::IFRebuildAll () at if/ifcreate.c:1049
>>>>>>>>>>> #2 0x0000000000a44e71 in UG::D3::IFRebuildAll () at if/ifcreate.c:1057
>>>>>>>>>>> #3 0x0000000000a4bd89 in UG::D3::DDD_XferEnd () at xfer/cmds.c:850
>>>>>>>>>>> #4 0x0000000000a65b5c in UG::D3::TransferGridFromLevel (theMG=0x1441880,
>>>>>>>>>>> level=0) at trans.c:824
>>>>>>>>>>> #5 0x0000000000a5df4b in UG::D3::lbs (argv=0x7fffffffa390 "0",
>>>>>>>>>>> theMG=0x1441880) at lb.c:644
>>>>>>>>>>> #6 0x0000000000a0bd83 in UG::D3::LBCommand (argc=4, argv=0x7fffffffab90)
>>>>>>>>>>> at commands.c:10644
>>>>>>>>>>> #7 0x00000000004a1e2d in Dune::UG_NS<3>::LBCommand (argc=0,
>>>>>>>>>>> argv=0x7fffffffa420) at ../../../dune/grid/uggrid/ugwrapper.hh:977
>>>>>>>>>>> #8 0x00000000004a8df9 in Dune::UGGrid<3>::loadBalance (this=0x134e490,
>>>>>>>>>>> strategy=0, minlevel=0, depth=2, maxLevel=32, minelement=1)
>>>>>>>>>>> at uggrid.cc:554
>>>>>>>>>>> #9 0x00000000004077bf in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (this=0x134e490, __in_chrg=<optimized out>)
>>>>>>>>>>> at /opt/gcc/4.6.3/snos/include/g++/bits/shared_ptr_base.h:550
>>>>>>>>>>> #10 0x0000000000400928 in main (argc=0, argv=0x520)
>>>>>>>>>>> at testsphericalgridgenerator.cc:66
>>>>>>>>>>>
>>>>>>>>>>> I've tried different load balancing strategies, but for all I get a segfault.
>>>>>>>>>>>
>>>>>>>>>>> Cheers,
>>>>>>>>>>>
>>>>>>>>>>> Eike
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Markus Blatt wrote:
>>>>>>>>>>>> On Mon, Nov 26, 2012 at 01:57:15PM +0000, Eike Mueller wrote:
>>>>>>>>>>>>> thanks a lot for the patch, unfortunately I still get a segfault when I run on HECToR.
>>>>>>>>>>>>>
>>>>>>>>>>>> I feared that, but it was still worth a shot. The change probably
>>>>>>>>>>>> interferes with the memory allocation in ddd.
>>>>>>>>>>>>
>>>>>>>>>>>> Markus
>>>>> --
>>>>> Dr Eike Mueller
>>>>> Research Officer
>>>>>
>>>>> Department of Mathematical Sciences
>>>>> University of Bath
>>>>> Bath BA2 7AY, United Kingdom
>>>>>
>>>>> +44 1225 38 5633
>>>>> e.mueller at bath.ac.uk
>>>>> http://people.bath.ac.uk/em459/
>>>>>
>>>>> _______________________________________________
>>>>> Dune mailing list
>>>>> Dune at dune-project.org
>>>>> http://lists.dune-project.org/mailman/listinfo/dune
>>>
>>> _______________________________________________
>>> Dune mailing list
>>> Dune at dune-project.org
>>> http://lists.dune-project.org/mailman/listinfo/dune
>> ------------------------------------------------------------
>> Peter Bastian
>> Interdisziplinäres Zentrum für Wissenschaftliches Rechnen
>> Universität Heidelberg
>> Im Neuenheimer Feld 368
>> D-69120 Heidelberg
>> Tel: 0049 (0) 6221 548261
>> Fax: 0049 (0) 6221 548884
>> email: peter.bastian at iwr.uni-heidelberg.de
>> web: http://conan.iwr.uni-heidelberg.de/people/peter/
>>
>>
>> _______________________________________________
>> Dune mailing list
>> Dune at dune-project.org
>> http://lists.dune-project.org/mailman/listinfo/dune
>
>
> _______________________________________________
> Dune mailing list
> Dune at dune-project.org
> http://lists.dune-project.org/mailman/listinfo/dune
------------------------------------------------------------
Peter Bastian
Interdisziplinäres Zentrum für Wissenschaftliches Rechnen
Universität Heidelberg
Im Neuenheimer Feld 368
D-69120 Heidelberg
Tel: 0049 (0) 6221 548261
Fax: 0049 (0) 6221 548884
email: peter.bastian at iwr.uni-heidelberg.de
web: http://conan.iwr.uni-heidelberg.de/people/peter/
More information about the Dune
mailing list