[Dune] UG grid code: Segfault on large core counts

Oliver Sander sander at igpm.rwth-aachen.de
Fri Dec 7 08:44:42 CET 2012


Hi Eike,
can you try the attached patch?  It invalidates the last one.
good luck,
Oliver

Am 06.12.2012 17:57, schrieb Eike Mueller:
> Hi Oliver,
>
> thanks, I've now tried your patch, but I still get a segfault. I believe this is still a problem with the BTree. Here is the backtrace:
>
> Program terminated with signal 11, Segmentation fault.
> #0  0x0000000000b9a239 in UG::D3::XICopyObj_Compare (item1=0xffffffff5d4ed200,
>      item2=0x2aac5d4ed820) at xfer/supp.c:208
> 208		if (item1->dest<  item2->dest) return(-1);
> (gdb) backtrace
> #0  0x0000000000b9a239 in UG::D3::XICopyObj_Compare (item1=0xffffffff5d4ed200,
>      item2=0x2aac5d4ed820) at xfer/supp.c:208
> #1  0x0000000000b9865b in UG::D3::XICopyObjBTreeNode_Insert (
>      _oopp_this=0x2aac5d4eb920, item=0x2aac5d4ed820) at ./basic/ooppcc.h:697
> #2  0x0000000000b986cf in UG::D3::XICopyObjBTreeNode_Insert (
>      _oopp_this=0x2aac64f04920, item=0x2aac5d4ed820) at ./basic/ooppcc.h:721
> #3  0x0000000000b986cf in UG::D3::XICopyObjBTreeNode_Insert (
>      _oopp_this=0x2aac64423180, item=0x2aac5d4ed820) at ./basic/ooppcc.h:721
> #4  0x0000000000b986cf in UG::D3::XICopyObjBTreeNode_Insert (
>      _oopp_this=0x2aac6eaca3b0, item=0x2aac5d4ed820) at ./basic/ooppcc.h:721
> #5  0x0000000000b98b85 in UG::D3::XICopyObjBTree_Insert (_oopp_this=0x179a560,
>      item=0x2aac5d4ed820) at ./basic/ooppcc.h:902
> #6  0x0000000000b98e2b in UG::D3::XICopyObjSet_ItemOK (_oopp_this=0x17381a0)
>      at ./basic/ooppcc.h:1097
> #7  0x0000000000b851d9 in UG::D3::XferInitCopyInfo (hdr=0x2aab89588708,
>      desc=0x1483dd0, size=128, dest=487, prio=1) at xfer/cmds.c:1017
> #8  0x0000000000b852dc in UG::D3::DDD_XferCopyObj (hdr=0x2aab89588708,
>      proc=487, prio=1) at xfer/cmds.c:1133
> #9  0x0000000000cc46f3 in NodeXferCopy (
>      obj=0x2aab89587600 "\002\200o``\233\003", proc=487, prio=1)
>      at handler.c:1106
> #10 0x0000000000b8525d in UG::D3::XferInitCopyInfo (hdr=0x2aab89587610,
>      desc=0x1484880, size=104, dest=487, prio=1) at xfer/cmds.c:1047
> ---Type<return>  to continue, or q<return>  to quit---
> #11 0x0000000000b852dc in UG::D3::DDD_XferCopyObj (hdr=0x2aab89587610,
>      proc=487, prio=1) at xfer/cmds.c:1133
> #12 0x0000000000cc4b68 in ElementXferCopy (
>      obj=0x2aab895887f0 "\006\223}@\224\314\001", proc=487, prio=1)
>      at handler.c:1373
> #13 0x0000000000b8525d in UG::D3::XferInitCopyInfo (hdr=0x2aab89588808,
>      desc=0x1489e00, size=312, dest=487, prio=1) at xfer/cmds.c:1047
> #14 0x0000000000b853cf in UG::D3::DDD_XferCopyObjX (hdr=0x2aab89588808,
>      proc=487, prio=1, size=312) at xfer/cmds.c:1206
> #15 0x0000000000bb295c in XferGridWithOverlap (theGrid=0x2aab8023f0a8)
>      at trans.c:630
> #16 0x0000000000bb2d0d in UG::D3::TransferGridFromLevel (theMG=0x1a3a250,
>      level=0) at trans.c:831
> #17 0x0000000000bacb34 in UG::D3::lbs (argv=0x7fffffff3ce0 "0",
>      theMG=0x1a3a250) at lb.c:659
> #18 0x0000000000b6db93 in UG::D3::LBCommand (argc=4, argv=0x7fffffff44e0)
>      at commands.c:10658
> #19 0x000000000051bf07 in Dune::UG_NS<3>::LBCommand (argc=4,
>      argv=0x7fffffff44e0) at ../../../dune/grid/uggrid/ugwrapper.hh:979
> #20 0x0000000000522b7d in Dune::UGGrid<3>::loadBalance (this=0x1652de0,
>      strategy=0, minlevel=0, depth=2, maxLevel=32, minelement=1)
>      at uggrid.cc:556
>
> #21 0x0000000000414ed3 in Dune::UGGrid<3>::loadBalance (this=0x1652de0)
> ---Type<return>  to continue, or q<return>  to quit---
>      at /home/n02/n02/eike/work/Library/Dune2.2_DEBUG/include/dune/grid/uggrid.hh:738
> #22 0x0000000000401110 in main (argc=2, argv=0x7fffffffb578)
>      at geometricmultigrid.cc:180
>
> Eike
>
> On 5 Dec 2012, at 15:22, Oliver Sander wrote:
>
>> Hi Eike,
>> please try the attached patch.  It may fix your BTree issue.
>> good luck,
>> Oliver
>>
>> Am 05.12.2012 13:32, schrieb Eike Mueller:
>>> Hi Oliver,
>>>
>>> thanks, it would be great if you could send me that patch for the BTree.
>>> This is probably an unrelated problem, but I now reran the same code and problem it with UG heap size = 16000. Now it crashes with a segfault:
>>>
>>> Program terminated with signal 11, Segmentation fault.
>>> #0  0x0000000000b8be5c in UG::D3::XferInitCopyInfo (hdr=0x2aab8001e6b0,
>>>     desc=0x144b290, size=96, dest=938, prio=1) at xfer/cmds.c:1012
>>> 1012			xi->hdr  = hdr;
>>> (gdb) backtrace
>>> #0  0x0000000000b8be5c in UG::D3::XferInitCopyInfo (hdr=0x2aab8001e6b0,
>>>     desc=0x144b290, size=96, dest=938, prio=1) at xfer/cmds.c:1012
>>> #1  0x0000000000b8c088 in UG::D3::DDD_XferCopyObjX (hdr=0x2aab8001e6b0,
>>>     proc=938, prio=1, size=96) at xfer/cmds.c:1206
>>> #2  0x0000000000ccee14 in ElementXferCopy (obj=0x2aab8001e088 "\006\223]@\ny",
>>>     proc=938, prio=1) at handler.c:1442
>>> #3  0x0000000000b8bf16 in UG::D3::XferInitCopyInfo (hdr=0x2aab8001e0a0,
>>>     desc=0x1452820, size=312, dest=938, prio=1) at xfer/cmds.c:1047
>>> #4  0x0000000000b8c088 in UG::D3::DDD_XferCopyObjX (hdr=0x2aab8001e0a0,
>>>     proc=938, prio=1, size=312) at xfer/cmds.c:1206
>>> #5  0x0000000000bbb754 in XferGridWithOverlap (theGrid=0x2aab7d7e7d10)
>>>     at trans.c:630
>>> #6  0x0000000000bbbb5e in UG::D3::TransferGridFromLevel (theMG=0x1a005a0,
>>>     level=0) at trans.c:831
>>> #7  0x0000000000bb562f in UG::D3::lbs (argv=0x7fffffff3f00 "0",
>>>     theMG=0x1a005a0) at lb.c:659
>>> #8  0x0000000000b746d1 in UG::D3::LBCommand (argc=4, argv=0x7fffffff4700)
>>>     at commands.c:10658
>>> #9  0x000000000051bef7 in Dune::UG_NS<3>::LBCommand (argc=4,
>>>     argv=0x7fffffff4700) at ../../../dune/grid/uggrid/ugwrapper.hh:979
>>> #10 0x0000000000522b6d in Dune::UGGrid<3>::loadBalance (this=0x1619af0,
>>>     strategy=0, minlevel=0, depth=2, maxLevel=32, minelement=1)
>>>     at uggrid.cc:556
>>> ---Type<return>   to continue, or q<return>   to quit---
>>> #11 0x0000000000414ec3 in Dune::UGGrid<3>::loadBalance (this=0x1619af0)
>>>     at /home/n02/n02/eike/work/Library/Dune2.2_DEBUG/include/dune/grid/uggrid.hh:738
>>> #12 0x0000000000401100 in main (argc=2, argv=0x7fffffffb7a8)
>>>     at geometricmultigrid.cc:180
>>>
>>> Eike
>>>
>>> On 5 Dec 2012, at 12:14, Oliver Sander wrote:
>>>
>>>> Hi,
>>>> that's funny.  Here in Aachen people used to work with DDD, and just a few
>>>> days ago a guy here sent me their version of DDD.  They changed very little,
>>>> but they did fix some bug in the BTree implementation.  Seems like this may
>>>> be the culprit here.  Unfortunately there is no revision history anymore.
>>>> I'll try to isolate the fix from the patch.
>>>> :-)
>>>> Oliver
>>>>
>>>> Am 05.12.2012 12:32, schrieb Eike Mueller:
>>>>> Dear Peter and Oliver,
>>>>>
>>>>> thanks a lot for your help. I now changed  WOP_DOWN_CHANNELS_MAX to 64 (which is enough for my case), but then I get
>>>>>
>>>>> geometricmultigrid_nz128_r130_DEBUG: ./basic/ooppcc.h:724: int UG::D3::XICopyObjBTreeNode_Insert(UG::D3::XICopyObjBTreeNode*, UG::D3::XICopyObj*): Assertion `new_r!=__null' failed
>>>>>
>>>>> The gdb backtrace is this:
>>>>>
>>>>> Program terminated with signal 6, Aborted.
>>>>> #0  0x0000000000d97f1b in raise (sig=<optimized out>)
>>>>>      at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:42
>>>>> 42	../nptl/sysdeps/unix/sysv/linux/pt-raise.c: No such file or directory.
>>>>> 	in ../nptl/sysdeps/unix/sysv/linux/pt-raise.c
>>>>> (gdb) backtrace
>>>>> #0  0x0000000000d97f1b in raise (sig=<optimized out>)
>>>>>      at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:42
>>>>> #1  0x0000000000db2f41 in abort () at abort.c:92
>>>>> #2  0x0000000000dadc17 in __assert_fail (assertion=0xf279ef "new_r!=__null",
>>>>>      file=0xf27980 "./basic/ooppcc.h", line=724,
>>>>>      function=0xf27d40 "int UG::D3::XICopyObjBTreeNode_Insert(UG::D3::XICopyObjBTreeNode*, UG::D3::XICopyObj*)") at assert.c:81
>>>>> #3  0x0000000000ba0637 in UG::D3::XICopyObjBTreeNode_Insert (
>>>>>      _oopp_this=0x2aac5d4ea920, item=0x2aac5d4ec7e8) at ./basic/ooppcc.h:724
>>>>> #4  0x0000000000ba05e8 in UG::D3::XICopyObjBTreeNode_Insert (
>>>>>      _oopp_this=0x2aac64f03920, item=0x2aac5d4ec7e8) at ./basic/ooppcc.h:717
>>>>> #5  0x0000000000ba05e8 in UG::D3::XICopyObjBTreeNode_Insert (
>>>>>      _oopp_this=0x2aac64422180, item=0x2aac5d4ec7e8) at ./basic/ooppcc.h:717
>>>>> #6  0x0000000000ba05e8 in UG::D3::XICopyObjBTreeNode_Insert (
>>>>>      _oopp_this=0x2aac6eac93b0, item=0x2aac5d4ec7e8) at ./basic/ooppcc.h:717
>>>>> #7  0x0000000000ba0ae3 in UG::D3::XICopyObjBTree_Insert (_oopp_this=0x17589f0,
>>>>>      item=0x2aac5d4ec7e8) at ./basic/ooppcc.h:898
>>>>> #8  0x0000000000ba0e1d in UG::D3::XICopyObjSet_ItemOK (_oopp_this=0x16fed40)
>>>>>      at ./basic/ooppcc.h:1093
>>>>> #9  0x0000000000b8be92 in UG::D3::XferInitCopyInfo (hdr=0x2aab89586610,
>>>>>      desc=0x144d2a0, size=104, dest=487, prio=1) at xfer/cmds.c:1017
>>>>> #10 0x0000000000b8bf95 in UG::D3::DDD_XferCopyObj (hdr=0x2aab89586610,
>>>>>      proc=487, prio=1) at xfer/cmds.c:1133
>>>>> ---Type<return>    to continue, or q<return>    to quit---
>>>>> #11 0x0000000000cceb58 in ElementXferCopy (
>>>>>      obj=0x2aab895877f0 "\006\223}@\224\314\001", proc=487, prio=1)
>>>>>      at handler.c:1373
>>>>> #12 0x0000000000b8bf16 in UG::D3::XferInitCopyInfo (hdr=0x2aab89587808,
>>>>>      desc=0x1452820, size=312, dest=487, prio=1) at xfer/cmds.c:1047
>>>>> #13 0x0000000000b8c088 in UG::D3::DDD_XferCopyObjX (hdr=0x2aab89587808,
>>>>>      proc=487, prio=1, size=312) at xfer/cmds.c:1206
>>>>> #14 0x0000000000bbb754 in XferGridWithOverlap (theGrid=0x2aab8023e0a8)
>>>>>      at trans.c:630
>>>>> #15 0x0000000000bbbb5e in UG::D3::TransferGridFromLevel (theMG=0x19fe680,
>>>>>      level=0) at trans.c:831
>>>>> #16 0x0000000000bb562f in UG::D3::lbs (argv=0x7fffffff3ee0 "0",
>>>>>      theMG=0x19fe680) at lb.c:659
>>>>> #17 0x0000000000b746d1 in UG::D3::LBCommand (argc=4, argv=0x7fffffff46e0)
>>>>>      at commands.c:10658
>>>>> #18 0x000000000051bef7 in Dune::UG_NS<3>::LBCommand (argc=4,
>>>>>      argv=0x7fffffff46e0) at ../../../dune/grid/uggrid/ugwrapper.hh:979
>>>>> #19 0x0000000000522b6d in Dune::UGGrid<3>::loadBalance (this=0x1619970,
>>>>>      strategy=0, minlevel=0, depth=2, maxLevel=32, minelement=1)
>>>>>      at uggrid.cc:556
>>>>> #20 0x0000000000414ec3 in Dune::UGGrid<3>::loadBalance (this=0x1619970)
>>>>>      at /home/n02/n02/eike/work/Library/Dune2.2_DEBUG/include/dune/grid/uggrid.hh:738
>>>>> ---Type<return>    to continue, or q<return>    to quit---
>>>>> #21 0x0000000000401100 in main (argc=2, argv=0x7fffffffb778)
>>>>>      at geometricmultigrid.cc:180
>>>>>
>>>>> I have no idea what it means and it's just a wild guess, but could Set_BTreeOrder and BTreeOrder be of the wrong size? They are set to 32.
>>>>>
>>>>> Eike
>>>>>
>>>>> On 5 Dec 2012, at 10:20, Oliver Sander wrote:
>>>>>
>>>>>> Am 05.12.2012 10:48, schrieb Peter Bastian:
>>>>>>> Hi Oliver,
>>>>>>>
>>>>>>> you ask questions… The parallel UG graphics used some clever
>>>>>>> way to do depth ordering by sorting the coarse mesh in
>>>>>>> view direction and then "interpolating" this ordering in the
>>>>>>> grid hierarchy. I do not know whether it has something to
>>>>>>> do with that. Obviously some tree is constructed on
>>>>>>> the processor graph and the maximum degree is estimated
>>>>>>> using the strange formula. I really cannot say more as it was
>>>>>>> all Klaus Johannsens work.
>>>>>> Hi Peter,
>>>>>> okay, that explains why the file is 20k lines long (you guys were tough!).
>>>>>>
>>>>>> I then propose to make WOP_DOWN_CHANNELS_MAX very large(tm),
>>>>>> and forget about it.  Since the number of channels computed in line 21023
>>>>>> is roughly the square root of the processor number I guess a value
>>>>>> of about 2000 should be okay for the nearer future.
>>>>>>
>>>>>> best,
>>>>>> Oliver
>>>>>>
>>>>>>> Best,
>>>>>>>
>>>>>>> Peter
>>>>>>>
>>>>>>> Am 05.12.2012 um 09:46 schrieb Oliver Sander<sander at igpm.rwth-aachen.de>:
>>>>>>>
>>>>>>>> Am 04.12.2012 20:10, schrieb Peter Bastian:
>>>>>>>>> Dear Eike,
>>>>>>>>>
>>>>>>>>> a quick search through the code suggests that
>>>>>>>>> the constant
>>>>>>>>> #define WOP_DOWN_CHANNELS_MAX           32
>>>>>>>>> in ug/graphics/uggraph/wop.h line 146 is too small.
>>>>>>>>>
>>>>>>>>> In wop.c line 21023
>>>>>>>>>
>>>>>>>>> 	WopDownChannels = (INT)ceil(0.5*(sqrt(4.0*(DOUBLE)procs-3.0)-1.0));
>>>>>>>>>
>>>>>>>>> evaluates to 39 which fairly close above 32. Then out of bound access in many
>>>>>>>>> arrays is produced. Maybe it is as simple as that.
>>>>>>>> Hi Peter,
>>>>>>>> I came to the same conclusion.  Can you give us an explanation of what
>>>>>>>> these WopDownChannels are?
>>>>>>>> best,
>>>>>>>> Oliver
>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>>
>>>>>>>>> Peter
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Am 04.12.2012 um 18:39 schrieb Oliver Sander<sander at igpm.rwth-aachen.de>:
>>>>>>>>>
>>>>>>>>>> Am 04.12.2012 17:05, schrieb Eike Mueller:
>>>>>>>>>>> Dear dune-list,
>>>>>>>>>> Hi Eike,
>>>>>>>>>>> now my code runs on 6, 24, 96 and 384 processors. However, on 1536 cores it crashes with a (different) segmentation faullt,
>>>>>>>>>> is that with custom types for DDD_GID, DDD_PROC and such, or with the defaults?
>>>>>>>>>>
>>>>>>>>>> Is it possible to get a valgrind output?  The crash is at program startup, so even a
>>>>>>>>>> slow valgrind run shouldn't take too long.
>>>>>>>>>>
>>>>>>>>>> The crash is in the ug graphics implementation, which we don't use from Dune.
>>>>>>>>>> So in the very worst case we could out-comment lots of stuff and make it work
>>>>>>>>>> that way.  But I prefer a proper fix.
>>>>>>>>>>
>>>>>>>>>> best,
>>>>>>>>>> Oliver
>>>>>>>>>>
>>>>>>>>>>> when I inspect the core dump with
>>>>>>>>>>>
>>>>>>>>>>> gdb<executable>       --core=<core>
>>>>>>>>>>>
>>>>>>>>>>> I get this backtrace:
>>>>>>>>>>>
>>>>>>>>>>> Core was generated by `/work/n02/n02/eike/Code/DUNEGeometricMultigrid/geometricmultigrid_nz128_r130_DE'.
>>>>>>>>>>> Program terminated with signal 11, Segmentation fault.
>>>>>>>>>>> #0  0x0000000000898d7a in PPIF::RecvASync (v=0x100000001, data=0x13b80a0,
>>>>>>>>>>>      size=4, error=0x7fffffff41f0) at ppif.c:656
>>>>>>>>>>> 656	                                     ((MPIVChannel*)v)->p, ((MPIVChannel*)v)->chanid, COMM, req) )
>>>>>>>>>>> (gdb) backtrace
>>>>>>>>>>> #0  0x0000000000898d7a in PPIF::RecvASync (v=0x100000001, data=0x13b80a0,
>>>>>>>>>>>      size=4, error=0x7fffffff41f0) at ppif.c:656
>>>>>>>>>>> #1  0x000000000081ce2b in NumberOfDesc () at wop.c:21095
>>>>>>>>>>> #2  0x000000000082b2db in UG::D2::InitWOP () at wop.c:24677
>>>>>>>>>>> #3  0x00000000007fbb1d in UG::D2::InitUGGraph () at initgraph.c:90
>>>>>>>>>>> #4  0x00000000007fbac5 in UG::D2::InitGraphics () at graphics.c:133
>>>>>>>>>>> #5  0x0000000000715e4c in UG::D2::InitUg (argcp=0x7fffffff465c,
>>>>>>>>>>>      argvp=0x7fffffff4648) at ../initug.c:293
>>>>>>>>>>> #6  0x000000000051bb8c in Dune::UG_NS<2>::InitUg (argcp=0x7fffffff465c,
>>>>>>>>>>>      argvp=0x7fffffff4648) at ../../../dune/grid/uggrid/ugwrapper.hh:910
>>>>>>>>>>> #7  0x000000000052062a in Dune::UGGrid<3>::UGGrid (this=0x1603360)
>>>>>>>>>>>      at uggrid.cc:74
>>>>>>>>>>> #8  0x000000000059332f in Dune::GridFactory<Dune::UGGrid<3>       >::GridFactory (
>>>>>>>>>>>      this=0x7fffffff9cd8) at uggridfactory.cc:74
>>>>>>>>>>> #9  0x000000000040faa2 in SphericalGridGenerator::SphericalGridGenerator (
>>>>>>>>>>>      this=0x7fffffff9cd0, filename=..., refcount=5)
>>>>>>>>>>>      at sphericalgridgenerator.hh:37
>>>>>>>>>>> #10 0x00000000004010c8 in main (argc=2, argv=0x7fffffffb778)
>>>>>>>>>>>      at geometricmultigrid.cc:167
>>>>>>>>>>>
>>>>>>>>>>> I'm setting up a grid with one element on each core, and then refine it 5 times, so the total grid size is 49152. I don't know whether this has any influence, but I set the default heap size to 8000.
>>>>>>>>>>>
>>>>>>>>>>> Any ideas what might be going on here, or just suggestions on how to proceed with debugging a code with this large number of cores would be much appreciated. I'm now trying to get a better trace with the Cray ATP tool.
>>>>>>>>>>>
>>>>>>>>>>> Thank you very much,
>>>>>>>>>>>
>>>>>>>>>>> Eike
>>>>>>>>>>>
>>>>>>>>>>> On 30 Nov 2012, at 18:47, Eike Mueller wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Oliver,
>>>>>>>>>>>>
>>>>>>>>>>>> I just did some detective work and I think I know why it crashes and how it can be fixed.
>>>>>>>>>>>>
>>>>>>>>>>>> The problem is that in parallel/ddd/include/ddd.h the type DDD_PROC is defined as 'unsigned short', so can only store up to 2^16 different processor IDs. In the subroutine IFCreateFromScratch(), where the segfault occurs, the variable lastproc, which is initialised to PROC_INVALID is of this type. According to parallel/ddd/dddi.h, PROC_INVALID=(MAX_PROCS+1) and MAX_PROCS=(1<<MAX_PROCBITS_IN_GID). Investigating the core dump with gdb, lastproc is indeed 1, at least for MAX_PROCBITS_IN_GID=16 and then on processor 1 the pointer ifHead never gets initialised, so dereferencing it with ifHead->nItems++; causes the segfault.
>>>>>>>>>>>>
>>>>>>>>>>>> I now changed DDD_PROC to 'unsigned int' in parallel/ddd/include/ddd.h and then the test program I sent you works fine if I run on 6 processors (I can also look at the .vtu files if I write them out in ascii format).
>>>>>>>>>>>> I can run with up to 6 refinement steps, and if I go to 7 it runs out of memory with
>>>>>>>>>>>>
>>>>>>>>>>>> testsphericalgridgenerator: xfer/xfer.c:942: void UG::D3::ddd_XferRegisterDelete(UG::D3::DDD_HDR): Assertion `0' failed.
>>>>>>>>>>>> DDD [000] FATAL 06060: out of memory during XferEnd()
>>>>>>>>>>>> _pmiu_daemon(SIGCHLD): [NID 01842] [c11-0c1s6n0] [Fri Nov 30 18:35:27 2012] PE RANK 0 exit signal Aborted
>>>>>>>>>>>> DDD [001] WARNING 02514: increased coupling table, now 131072 entries
>>>>>>>>>>>> DDD [001] WARNING 02224: increased object table, now 131072 entries
>>>>>>>>>>>> [NID 01842] 2012-11-30 18:35:27 Apid 3144197: initiated application termination
>>>>>>>>>>>> Application 3144197 exit codes: 134
>>>>>>>>>>>> Application 3144197 resources: utime ~24s, stime ~1s
>>>>>>>>>>>>
>>>>>>>>>>>> but I think this is another problem, and in my case 6 should actually be enough.
>>>>>>>>>>>>
>>>>>>>>>>>> Do you think the change I made makes sense and doesn't break anything else? If that is the case, what's the best way of setting up a patch? To be on the safe side, maybe something like this:
>>>>>>>>>>>>
>>>>>>>>>>>> #if (DDD_MAX_PROCBITS_IN_GID<       16)
>>>>>>>>>>>>   typedef unsigned short   DDD_PROC;
>>>>>>>>>>>> #else
>>>>>>>>>>>>   typedef unsigned int   DDD_PROC;
>>>>>>>>>>>> #endif // (DDD_MAX_PROCBITS_IN_GID<       16)
>>>>>>>>>>>>
>>>>>>>>>>>> I need to do some more testing to check that it all works for my full code, and I will keep you updated on any progress.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>
>>>>>>>>>>>> Eike
>>>>>>>>>>>>
>>>>>>>>>>>> Oliver Sander wrote:
>>>>>>>>>>>>> Okay, with maxprocbits = 16 I do see a crash.  2 processors and 1 level is enough.
>>>>>>>>>>>>> I'm in a bit of a hurry right now, but I'll have a look at it later.
>>>>>>>>>>>>> best,
>>>>>>>>>>>>> Oliver
>>>>>>>>>>>>> Am 29.11.2012 19:36, schrieb Eike Mueller:
>>>>>>>>>>>>>> Hi Oliver,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> thank you very much for trying this out and for the patch. I now rebuilt UG with the latest patch file you sent me this morning and I still get the segfault. However, this only occurs if I specify the with_ddd_maxprocbits flag, if I do not set this (i.e. only use DDD_GID=long, as you do), then it runs fine (I have only done a small run so far, and haven't tried 7 refinement steps, so can not say anything about that other error you get).
>>>>>>>>>>>>>> I tried both with_ddd_maxprocbits=20 and 16, but it does not work in any of these cases. The default is 2^9=512, and unfortunately that's not enough for me, I would need at least 2^16=65536.
>>>>>>>>>>>>>> As for the error message you get when reading the .vtu files, could that be because they are written out in Dune::VTK::appendedbase64 format? I also get an error message when I open them in paraview
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> ERROR: In /home/kitware/ParaView3/Utilities/BuildScripts/ParaView-3.6/ParaView3/VTK/IO/vtkXMLUnstructuredDataReader.cxx, line 522
>>>>>>>>>>>>>> vtkXMLUnstructuredGridReader (0xa16c918): Cannot read points array from Points in piece 0.  The data array in the element may be too short.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> When I opened .vtu files produced by my main solver code on HECToR, paraview actually crashed, and this was before I modified any of the UG settings. I could fix this by writing data out in Dune::VTK::ascii format instead. Could this be a big/little endian issue? HECToR is 64bit, but my local desktop, where I run paraview to look at the output is 32bit, not sure if that has any impact.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Eike
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Oliver Sander wrote:
>>>>>>>>>>>>>>> Hi Eike,
>>>>>>>>>>>>>>> I tried your example with DDD_GID==long, on my laptop where sizeof(long)==8 and sizeof(uint)==4.
>>>>>>>>>>>>>>> I started the program with
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> mpirun -np 6 ./testsphericalgridgenerator sphericalshell_cube_6.dgf 6
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Besides a few DDD warnings that I have never seen before, it works like a charm.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> What version of UG are you using?  I'll send you the very latest patch file just
>>>>>>>>>>>>>>> to be sure.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The programs runs, but paraview gives me an error message when trying to open
>>>>>>>>>>>>>>> the output file.  Does that happen to you, too?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I didn't try the with_ddd_maxprocbits setting.  Does your program crash if you
>>>>>>>>>>>>>>> do _not_ set this?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> For the worst case: is it possible to get a temporary account on your hector computer?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> best,
>>>>>>>>>>>>>>> Oliver
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Am 28.11.2012 10:28, schrieb Eike Mueller:
>>>>>>>>>>>>>>>> Hi Oliver,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> thanks a lot, that would be great. My desktop is a 32bit machine, where sizeof(long) = sizeof(int) = 4, so I'm not sure if recompiling everything with GID=long there will make a difference.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Eike
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Oliver Sander wrote:
>>>>>>>>>>>>>>>>> Thanks for the backtrace.  I'll try and see whether I can reproduce the crash
>>>>>>>>>>>>>>>>> on my machine. If that's not possible things will be a bit difficult :-)
>>>>>>>>>>>>>>>>> -- 
>>>>>>>>>>>>>>>>> Oliver
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Am 26.11.2012 18:39, schrieb Eike Mueller:
>>>>>>>>>>>>>>>>>> Hi Markus and Oliver,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> to get to the bottom of this I recompiled everything (UG+Dune+my code) with -O0 -g, and that way I was able to get some more information out of the core dump. On 1 processor it runs fine now, but when running on 6, this is what I get, looks like it crashes in loadBalance(), but I can't make sense of what's happening inside UG. It always seems to crash inside ifcreate.c, either in line 482 or 489:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Program terminated with signal 11, Segmentation fault.
>>>>>>>>>>>>>>>>>> #0  0x0000000000a438b0 in UG::D3::IFCreateFromScratch (tmpcpl=0x1462ad0,
>>>>>>>>>>>>>>>>>>     ifId=1) at if/ifcreate.c:489
>>>>>>>>>>>>>>>>>> 489                     ifHead->nItems++;
>>>>>>>>>>>>>>>>>> (gdb) backtrace
>>>>>>>>>>>>>>>>>> #0  0x0000000000a438b0 in UG::D3::IFCreateFromScratch (tmpcpl=0x1462ad0,
>>>>>>>>>>>>>>>>>>     ifId=1) at if/ifcreate.c:489
>>>>>>>>>>>>>>>>>> #1  0x0000000000a44dae in UG::D3::IFRebuildAll () at if/ifcreate.c:1059
>>>>>>>>>>>>>>>>>> #2  0x0000000000a44e71 in UG::D3::IFAllFromScratch () at if/ifcreate.c:1097
>>>>>>>>>>>>>>>>>> #3  0x0000000000a4bd89 in UG::D3::DDD_XferEnd () at xfer/cmds.c:869
>>>>>>>>>>>>>>>>>> #4  0x0000000000a65b5c in UG::D3::TransferGridFromLevel (theMG=0x1441880,
>>>>>>>>>>>>>>>>>>     level=0) at trans.c:835
>>>>>>>>>>>>>>>>>> #5  0x0000000000a5df4b in UG::D3::lbs (argv=0x7fffffffa390 "0",
>>>>>>>>>>>>>>>>>>     theMG=0x1441880) at lb.c:659
>>>>>>>>>>>>>>>>>> #6  0x0000000000a0bd83 in UG::D3::LBCommand (argc=4, argv=0x7fffffffab90)
>>>>>>>>>>>>>>>>>>     at commands.c:10658
>>>>>>>>>>>>>>>>>> #7  0x00000000004a1e2d in Dune::UG_NS<3>::LBCommand (argc=4,
>>>>>>>>>>>>>>>>>>     argv=0x7fffffffab90) at ../../../dune/grid/uggrid/ugwrapper.hh:979
>>>>>>>>>>>>>>>>>> #8  0x00000000004a8df9 in Dune::UGGrid<3>::loadBalance (this=0x134e490,
>>>>>>>>>>>>>>>>>>     strategy=0, minlevel=0, depth=2, maxLevel=32, minelement=1)
>>>>>>>>>>>>>>>>>>     at uggrid.cc:556
>>>>>>>>>>>>>>>>>> #9  0x00000000004077bf in Dune::UGGrid<3>::loadBalance (this=0x134e490)
>>>>>>>>>>>>>>>>>>     at /home/n02/n02/eike/work/Library/Dune2.2/include/dune/grid/uggrid.hh:738
>>>>>>>>>>>>>>>>>> #10 0x0000000000400928 in main (argc=3, argv=0x7fffffffb5a8)
>>>>>>>>>>>>>>>>>>     at testsphericalgridgenerator.cc:65
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> but I sometimes also get:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Core was generated by `./testsphericalgridgenerator sphericalshell_cube_6.dgf 4'.
>>>>>>>>>>>>>>>>>> Program terminated with signal 11, Segmentation fault.
>>>>>>>>>>>>>>>>>> #0  0x0000000000a438b0 in UG::D3::IFCreateFromScratch (tmpcpl=0x1462ad0,
>>>>>>>>>>>>>>>>>>     ifId=1) at if/ifcreate.c:482
>>>>>>>>>>>>>>>>>> 482                             ifAttr->nAB    = ifAttr->nBA   = ifAttr->nABA   = 0;
>>>>>>>>>>>>>>>>>> (gdb) backtrace
>>>>>>>>>>>>>>>>>> #0  0x0000000000a438b0 in UG::D3::IFCreateFromScratch (tmpcpl=0x1462ad0,
>>>>>>>>>>>>>>>>>>     ifId=1) at if/ifcreate.c:482
>>>>>>>>>>>>>>>>>> #1  0x0000000000a44dae in UG::D3::IFRebuildAll () at if/ifcreate.c:1049
>>>>>>>>>>>>>>>>>> #2  0x0000000000a44e71 in UG::D3::IFRebuildAll () at if/ifcreate.c:1057
>>>>>>>>>>>>>>>>>> #3  0x0000000000a4bd89 in UG::D3::DDD_XferEnd () at xfer/cmds.c:850
>>>>>>>>>>>>>>>>>> #4  0x0000000000a65b5c in UG::D3::TransferGridFromLevel (theMG=0x1441880,
>>>>>>>>>>>>>>>>>>     level=0) at trans.c:824
>>>>>>>>>>>>>>>>>> #5  0x0000000000a5df4b in UG::D3::lbs (argv=0x7fffffffa390 "0",
>>>>>>>>>>>>>>>>>>     theMG=0x1441880) at lb.c:644
>>>>>>>>>>>>>>>>>> #6  0x0000000000a0bd83 in UG::D3::LBCommand (argc=4, argv=0x7fffffffab90)
>>>>>>>>>>>>>>>>>>     at commands.c:10644
>>>>>>>>>>>>>>>>>> #7  0x00000000004a1e2d in Dune::UG_NS<3>::LBCommand (argc=0,
>>>>>>>>>>>>>>>>>>     argv=0x7fffffffa420) at ../../../dune/grid/uggrid/ugwrapper.hh:977
>>>>>>>>>>>>>>>>>> #8  0x00000000004a8df9 in Dune::UGGrid<3>::loadBalance (this=0x134e490,
>>>>>>>>>>>>>>>>>>     strategy=0, minlevel=0, depth=2, maxLevel=32, minelement=1)
>>>>>>>>>>>>>>>>>>     at uggrid.cc:554
>>>>>>>>>>>>>>>>>> #9  0x00000000004077bf in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (this=0x134e490, __in_chrg=<optimized out>)
>>>>>>>>>>>>>>>>>>     at /opt/gcc/4.6.3/snos/include/g++/bits/shared_ptr_base.h:550
>>>>>>>>>>>>>>>>>> #10 0x0000000000400928 in main (argc=0, argv=0x520)
>>>>>>>>>>>>>>>>>>     at testsphericalgridgenerator.cc:66
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I've tried different load balancing strategies, but for all I get a segfault.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Eike
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Markus Blatt wrote:
>>>>>>>>>>>>>>>>>>> On Mon, Nov 26, 2012 at 01:57:15PM +0000, Eike Mueller wrote:
>>>>>>>>>>>>>>>>>>>> thanks a lot for the patch, unfortunately I still get a segfault when I run on HECToR.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I feared that, but it was still worth a shot. The change probably
>>>>>>>>>>>>>>>>>>> interferes with the memory allocation in ddd.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Markus
>>>>>>>>>>>> -- 
>>>>>>>>>>>> Dr Eike Mueller
>>>>>>>>>>>> Research Officer
>>>>>>>>>>>>
>>>>>>>>>>>> Department of Mathematical Sciences
>>>>>>>>>>>> University of Bath
>>>>>>>>>>>> Bath BA2 7AY, United Kingdom
>>>>>>>>>>>>
>>>>>>>>>>>> +44 1225 38 5633
>>>>>>>>>>>> e.mueller at bath.ac.uk
>>>>>>>>>>>> http://people.bath.ac.uk/em459/
>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> Dune mailing list
>>>>>>>>>>>> Dune at dune-project.org
>>>>>>>>>>>> http://lists.dune-project.org/mailman/listinfo/dune
>>>>>>>>>> _______________________________________________
>>>>>>>>>> Dune mailing list
>>>>>>>>>> Dune at dune-project.org
>>>>>>>>>> http://lists.dune-project.org/mailman/listinfo/dune
>>>>>>>>> ------------------------------------------------------------
>>>>>>>>> Peter Bastian
>>>>>>>>> Interdisziplinäres Zentrum für Wissenschaftliches Rechnen
>>>>>>>>> Universität Heidelberg
>>>>>>>>> Im Neuenheimer Feld 368
>>>>>>>>> D-69120 Heidelberg
>>>>>>>>> Tel: 0049 (0) 6221 548261
>>>>>>>>> Fax: 0049 (0) 6221 548884
>>>>>>>>> email: peter.bastian at iwr.uni-heidelberg.de
>>>>>>>>> web: http://conan.iwr.uni-heidelberg.de/people/peter/
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> Dune mailing list
>>>>>>>>> Dune at dune-project.org
>>>>>>>>> http://lists.dune-project.org/mailman/listinfo/dune
>>>>>>>> _______________________________________________
>>>>>>>> Dune mailing list
>>>>>>>> Dune at dune-project.org
>>>>>>>> http://lists.dune-project.org/mailman/listinfo/dune
>>>>>>> ------------------------------------------------------------
>>>>>>> Peter Bastian
>>>>>>> Interdisziplinäres Zentrum für Wissenschaftliches Rechnen
>>>>>>> Universität Heidelberg
>>>>>>> Im Neuenheimer Feld 368
>>>>>>> D-69120 Heidelberg
>>>>>>> Tel: 0049 (0) 6221 548261
>>>>>>> Fax: 0049 (0) 6221 548884
>>>>>>> email: peter.bastian at iwr.uni-heidelberg.de
>>>>>>> web: http://conan.iwr.uni-heidelberg.de/people/peter/
>>>>>>>
>>>>>> _______________________________________________
>>>>>> Dune mailing list
>>>>>> Dune at dune-project.org
>>>>>> http://lists.dune-project.org/mailman/listinfo/dune
>>> _______________________________________________
>>> Dune mailing list
>>> Dune at dune-project.org
>>> http://lists.dune-project.org/mailman/listinfo/dune
>> <ooppcc.patch>_______________________________________________
>> Dune mailing list
>> Dune at dune-project.org
>> http://lists.dune-project.org/mailman/listinfo/dune

-------------- next part --------------
A non-text attachment was scrubbed...
Name: ooppcc_second_attempt.patch
Type: text/x-patch
Size: 863 bytes
Desc: not available
URL: <https://lists.dune-project.org/pipermail/dune/attachments/20121207/2bc545a7/attachment-0001.bin>


More information about the Dune mailing list