[Dune] UG grid problem: ID overflow DDD_HdrConstructor

Eike Mueller E.Mueller at bath.ac.uk
Fri Nov 30 19:47:28 CET 2012


Hi Oliver,

I just did some detective work and I think I know why it crashes and how it can be fixed.

The problem is that in parallel/ddd/include/ddd.h the type DDD_PROC is defined as 'unsigned short', so can only store up to 2^16 
different processor IDs. In the subroutine IFCreateFromScratch(), where the segfault occurs, the variable lastproc, which is 
initialised to PROC_INVALID is of this type. According to parallel/ddd/dddi.h, PROC_INVALID=(MAX_PROCS+1) and 
MAX_PROCS=(1<<MAX_PROCBITS_IN_GID). Investigating the core dump with gdb, lastproc is indeed 1, at least for 
MAX_PROCBITS_IN_GID=16 and then on processor 1 the pointer ifHead never gets initialised, so dereferencing it with 
ifHead->nItems++; causes the segfault.

I now changed DDD_PROC to 'unsigned int' in parallel/ddd/include/ddd.h and then the test program I sent you works fine if I run 
on 6 processors (I can also look at the .vtu files if I write them out in ascii format).
I can run with up to 6 refinement steps, and if I go to 7 it runs out of memory with

testsphericalgridgenerator: xfer/xfer.c:942: void UG::D3::ddd_XferRegisterDelete(UG::D3::DDD_HDR): Assertion `0' failed.
DDD [000] FATAL 06060: out of memory during XferEnd()
_pmiu_daemon(SIGCHLD): [NID 01842] [c11-0c1s6n0] [Fri Nov 30 18:35:27 2012] PE RANK 0 exit signal Aborted
DDD [001] WARNING 02514: increased coupling table, now 131072 entries
DDD [001] WARNING 02224: increased object table, now 131072 entries
[NID 01842] 2012-11-30 18:35:27 Apid 3144197: initiated application termination
Application 3144197 exit codes: 134
Application 3144197 resources: utime ~24s, stime ~1s

but I think this is another problem, and in my case 6 should actually be enough.

Do you think the change I made makes sense and doesn't break anything else? If that is the case, what's the best way of setting 
up a patch? To be on the safe side, maybe something like this:

#if (DDD_MAX_PROCBITS_IN_GID < 16)
   typedef unsigned short   DDD_PROC;
#else
   typedef unsigned int   DDD_PROC;
#endif // (DDD_MAX_PROCBITS_IN_GID < 16)

I need to do some more testing to check that it all works for my full code, and I will keep you updated on any progress.

Thanks,

Eike

Oliver Sander wrote:
> Okay, with maxprocbits = 16 I do see a crash.  2 processors and 1 level 
> is enough.
> 
> I'm in a bit of a hurry right now, but I'll have a look at it later.
> 
> best,
> Oliver
> 
> Am 29.11.2012 19:36, schrieb Eike Mueller:
>> Hi Oliver,
>>
>> thank you very much for trying this out and for the patch. I now 
>> rebuilt UG with the latest patch file you sent me this morning and I 
>> still get the segfault. However, this only occurs if I specify the 
>> with_ddd_maxprocbits flag, if I do not set this (i.e. only use 
>> DDD_GID=long, as you do), then it runs fine (I have only done a small 
>> run so far, and haven't tried 7 refinement steps, so can not say 
>> anything about that other error you get).
>> I tried both with_ddd_maxprocbits=20 and 16, but it does not work in 
>> any of these cases. The default is 2^9=512, and unfortunately that's 
>> not enough for me, I would need at least 2^16=65536.
>> As for the error message you get when reading the .vtu files, could 
>> that be because they are written out in Dune::VTK::appendedbase64 
>> format? I also get an error message when I open them in paraview
>>
>> ERROR: In 
>> /home/kitware/ParaView3/Utilities/BuildScripts/ParaView-3.6/ParaView3/VTK/IO/vtkXMLUnstructuredDataReader.cxx, 
>> line 522
>> vtkXMLUnstructuredGridReader (0xa16c918): Cannot read points array 
>> from Points in piece 0.  The data array in the element may be too short.
>>
>> When I opened .vtu files produced by my main solver code on HECToR, 
>> paraview actually crashed, and this was before I modified any of the 
>> UG settings. I could fix this by writing data out in Dune::VTK::ascii 
>> format instead. Could this be a big/little endian issue? HECToR is 
>> 64bit, but my local desktop, where I run paraview to look at the 
>> output is 32bit, not sure if that has any impact.
>>
>> Eike
>>
>> Oliver Sander wrote:
>>> Hi Eike,
>>> I tried your example with DDD_GID==long, on my laptop where 
>>> sizeof(long)==8 and sizeof(uint)==4.
>>> I started the program with
>>>
>>> mpirun -np 6 ./testsphericalgridgenerator sphericalshell_cube_6.dgf 6
>>>
>>> Besides a few DDD warnings that I have never seen before, it works 
>>> like a charm.
>>>
>>> What version of UG are you using?  I'll send you the very latest 
>>> patch file just
>>> to be sure.
>>>
>>> The programs runs, but paraview gives me an error message when trying 
>>> to open
>>> the output file.  Does that happen to you, too?
>>>
>>> I didn't try the with_ddd_maxprocbits setting.  Does your program 
>>> crash if you
>>> do _not_ set this?
>>>
>>> For the worst case: is it possible to get a temporary account on your 
>>> hector computer?
>>>
>>> best,
>>> Oliver
>>>
>>> Am 28.11.2012 10:28, schrieb Eike Mueller:
>>>> Hi Oliver,
>>>>
>>>> thanks a lot, that would be great. My desktop is a 32bit machine, 
>>>> where sizeof(long) = sizeof(int) = 4, so I'm not sure if recompiling 
>>>> everything with GID=long there will make a difference.
>>>>
>>>> Eike
>>>>
>>>> Oliver Sander wrote:
>>>>> Thanks for the backtrace.  I'll try and see whether I can reproduce 
>>>>> the crash
>>>>> on my machine. If that's not possible things will be a bit 
>>>>> difficult :-)
>>>>> -- 
>>>>> Oliver
>>>>>
>>>>> Am 26.11.2012 18:39, schrieb Eike Mueller:
>>>>>> Hi Markus and Oliver,
>>>>>>
>>>>>> to get to the bottom of this I recompiled everything (UG+Dune+my 
>>>>>> code) with -O0 -g, and that way I was able to get some more 
>>>>>> information out of the core dump. On 1 processor it runs fine now, 
>>>>>> but when running on 6, this is what I get, looks like it crashes 
>>>>>> in loadBalance(), but I can't make sense of what's happening 
>>>>>> inside UG. It always seems to crash inside ifcreate.c, either in 
>>>>>> line 482 or 489:
>>>>>>
>>>>>> Program terminated with signal 11, Segmentation fault.
>>>>>> #0  0x0000000000a438b0 in UG::D3::IFCreateFromScratch 
>>>>>> (tmpcpl=0x1462ad0,
>>>>>>     ifId=1) at if/ifcreate.c:489
>>>>>> 489                     ifHead->nItems++;
>>>>>> (gdb) backtrace
>>>>>> #0  0x0000000000a438b0 in UG::D3::IFCreateFromScratch 
>>>>>> (tmpcpl=0x1462ad0,
>>>>>>     ifId=1) at if/ifcreate.c:489
>>>>>> #1  0x0000000000a44dae in UG::D3::IFRebuildAll () at 
>>>>>> if/ifcreate.c:1059
>>>>>> #2  0x0000000000a44e71 in UG::D3::IFAllFromScratch () at 
>>>>>> if/ifcreate.c:1097
>>>>>> #3  0x0000000000a4bd89 in UG::D3::DDD_XferEnd () at xfer/cmds.c:869
>>>>>> #4  0x0000000000a65b5c in UG::D3::TransferGridFromLevel 
>>>>>> (theMG=0x1441880,
>>>>>>     level=0) at trans.c:835
>>>>>> #5  0x0000000000a5df4b in UG::D3::lbs (argv=0x7fffffffa390 "0",
>>>>>>     theMG=0x1441880) at lb.c:659
>>>>>> #6  0x0000000000a0bd83 in UG::D3::LBCommand (argc=4, 
>>>>>> argv=0x7fffffffab90)
>>>>>>     at commands.c:10658
>>>>>> #7  0x00000000004a1e2d in Dune::UG_NS<3>::LBCommand (argc=4,
>>>>>>     argv=0x7fffffffab90) at 
>>>>>> ../../../dune/grid/uggrid/ugwrapper.hh:979
>>>>>> #8  0x00000000004a8df9 in Dune::UGGrid<3>::loadBalance 
>>>>>> (this=0x134e490,
>>>>>>     strategy=0, minlevel=0, depth=2, maxLevel=32, minelement=1)
>>>>>>     at uggrid.cc:556
>>>>>> #9  0x00000000004077bf in Dune::UGGrid<3>::loadBalance 
>>>>>> (this=0x134e490)
>>>>>>     at 
>>>>>> /home/n02/n02/eike/work/Library/Dune2.2/include/dune/grid/uggrid.hh:738 
>>>>>>
>>>>>> #10 0x0000000000400928 in main (argc=3, argv=0x7fffffffb5a8)
>>>>>>     at testsphericalgridgenerator.cc:65
>>>>>>
>>>>>> but I sometimes also get:
>>>>>>
>>>>>> Core was generated by `./testsphericalgridgenerator 
>>>>>> sphericalshell_cube_6.dgf 4'.
>>>>>> Program terminated with signal 11, Segmentation fault.
>>>>>> #0  0x0000000000a438b0 in UG::D3::IFCreateFromScratch 
>>>>>> (tmpcpl=0x1462ad0,
>>>>>>     ifId=1) at if/ifcreate.c:482
>>>>>> 482                             ifAttr->nAB    = ifAttr->nBA   = 
>>>>>> ifAttr->nABA   = 0;
>>>>>> (gdb) backtrace
>>>>>> #0  0x0000000000a438b0 in UG::D3::IFCreateFromScratch 
>>>>>> (tmpcpl=0x1462ad0,
>>>>>>     ifId=1) at if/ifcreate.c:482
>>>>>> #1  0x0000000000a44dae in UG::D3::IFRebuildAll () at 
>>>>>> if/ifcreate.c:1049
>>>>>> #2  0x0000000000a44e71 in UG::D3::IFRebuildAll () at 
>>>>>> if/ifcreate.c:1057
>>>>>> #3  0x0000000000a4bd89 in UG::D3::DDD_XferEnd () at xfer/cmds.c:850
>>>>>> #4  0x0000000000a65b5c in UG::D3::TransferGridFromLevel 
>>>>>> (theMG=0x1441880,
>>>>>>     level=0) at trans.c:824
>>>>>> #5  0x0000000000a5df4b in UG::D3::lbs (argv=0x7fffffffa390 "0",
>>>>>>     theMG=0x1441880) at lb.c:644
>>>>>> #6  0x0000000000a0bd83 in UG::D3::LBCommand (argc=4, 
>>>>>> argv=0x7fffffffab90)
>>>>>>     at commands.c:10644
>>>>>> #7  0x00000000004a1e2d in Dune::UG_NS<3>::LBCommand (argc=0,
>>>>>>     argv=0x7fffffffa420) at 
>>>>>> ../../../dune/grid/uggrid/ugwrapper.hh:977
>>>>>> #8  0x00000000004a8df9 in Dune::UGGrid<3>::loadBalance 
>>>>>> (this=0x134e490,
>>>>>>     strategy=0, minlevel=0, depth=2, maxLevel=32, minelement=1)
>>>>>>     at uggrid.cc:554
>>>>>> #9  0x00000000004077bf in 
>>>>>> std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count 
>>>>>> (this=0x134e490, __in_chrg=<optimized out>)
>>>>>>     at /opt/gcc/4.6.3/snos/include/g++/bits/shared_ptr_base.h:550
>>>>>> #10 0x0000000000400928 in main (argc=0, argv=0x520)
>>>>>>     at testsphericalgridgenerator.cc:66
>>>>>>
>>>>>> I've tried different load balancing strategies, but for all I get 
>>>>>> a segfault.
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>> Eike
>>>>>>
>>>>>>
>>>>>> Markus Blatt wrote:
>>>>>>> On Mon, Nov 26, 2012 at 01:57:15PM +0000, Eike Mueller wrote:
>>>>>>>> thanks a lot for the patch, unfortunately I still get a segfault 
>>>>>>>> when I run on HECToR.
>>>>>>>>
>>>>>>>
>>>>>>> I feared that, but it was still worth a shot. The change probably
>>>>>>> interferes with the memory allocation in ddd.
>>>>>>>
>>>>>>> Markus
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>>
> 


-- 
Dr Eike Mueller
Research Officer

Department of Mathematical Sciences
University of Bath
Bath BA2 7AY, United Kingdom

+44 1225 38 5633
e.mueller at bath.ac.uk
http://people.bath.ac.uk/em459/




More information about the Dune mailing list