[dune-pdelab] Fwd: Fwd: solver fails to reset correctly after FMatrixError (singular matrix)

Nils-Arne Dreier n.dreier at uni-muenster.de
Wed Jul 10 14:39:09 CEST 2019


Hi Shubhangi,

I just talked to Jö. We guess that the problem is, that the exception is
only thrown on one rank, say rank X. All other ranks do not know that
rank X failed and proceed as usual, at some point all these ranks
waiting for communication of rank X. That is the deadlock that you see.

You may want to have a look at Dune::MPIGuard in
dune/common/parallel/mpiguard.hh. It makes it possible to propagate the
error state to all ranks.

There is also a merge request for dune-common, which adapts the MPIGuard
such that you don't need to check for an error state before
communicating, making use of the ULFM proposal for MPI. You can find it
here: https://gitlab.dune-project.org/core/dune-common/merge_requests/517

If you don't have a MPI implementation that provides a *working* ULFM
implementation, you may want to use the blackchannel-ulfm lib:
https://gitlab.dune-project.org/exadune/blackchannel-ulfm

I hope that helps.

Kind regards
Nils

On 10.07.19 14:07, Shubhangi Gupta wrote:
> Hi Jö,
>
> So, since you asked about the number of ranks... I tried running the
> simulations again on 2 processes and 1 process. I get the same problem
> with 2, but not with 1.
>
> On 10.07.19 13:33, Shubhangi Gupta wrote:
>> Hi Jö,
>>
>> Yes, I am running it MPI-parallel, on 4 ranks.
>>
>> On 10.07.19 13:32, Jö Fahlke wrote:
>>> Are you running this MPI-parallel?  If yes, how many ranks?
>>>
>>> Regards, Jö.
>>>
>>> Am Mi, 10. Jul 2019, 11:55:45 +0200 schrieb Shubhangi Gupta:
>>>> Dear pdelab users,
>>>>
>>>> I am currently experiencing a rather strange problem during parallel
>>>> solution of my finite volume code. I have written a short outline
>>>> of my code
>>>> below for reference.
>>>>
>>>> At some point during computation, if dune throws an error, the code
>>>> catches
>>>> this error, resets the solution vector to the old value, halves the
>>>> time
>>>> step size, and tries to redo the calculation (osm.apply()).
>>>>
>>>> However, if I get the error "FMatrixError: matrix is singular", the
>>>> solver
>>>> seems to freeze. Even the initial defect is not shown! (See the
>>>> terminal
>>>> output below.) I am not sure why this is so, and I have not
>>>> experienced this
>>>> issue before.
>>>>
>>>> I will be very thankful if someone can help me figure out a way
>>>> around this
>>>> problem.
>>>>
>>>> Thanks, and warm wishes, Shubhangi
>>>>
>>>>
>>>> *// code layout*
>>>>
>>>>      ...UG grid, generated using gmsh, GV, ...
>>>>
>>>>      typedef
>>>> Dune::PDELab::QkDGLocalFiniteElementMap<GV::Grid::ctype, double,
>>>> 0, dim, Dune::PDELab::QkDGBasisPolynomial::lagrange> FEMP0;
>>>>      FEMP0 femp0;
>>>>      typedef
>>>> Dune::PDELab::GridFunctionSpace<GV,FEMP0,Dune::PDELab::P0ParallelConstraints,Dune::PDELab::ISTL::VectorBackend<>>
>>>> GFS0;
>>>>      GFS0 gfs0(gv,femp0);
>>>>      typedef Dune::PDELab::PowerGridFunctionSpace< GFS0,num_of_vars,
>>>> Dune::PDELab::ISTL::VectorBackend<Dune::PDELab::ISTL::Blocking::fixed>,
>>>>
>>>> Dune::PDELab::EntityBlockedOrderingTag> GFS_TCH;
>>>>
>>>>      ... LocalOperator LOP lop, TimeLocalOperator TOP top,
>>>> GridOperator GO
>>>> go, InstationaryGridOperator IGO igo, ...
>>>>
>>>>      typedef Dune::PDELab::ISTLBackend_BCGS_AMG_SSOR<IGO> LS;
>>>>      LS ls(gfs,50,1,false,true);
>>>>      typedef Dune::PDELab::Newton< IGO, LS, U > PDESOLVER;
>>>>      PDESOLVER pdesolver( igo, ls );
>>>> Dune::PDELab::ImplicitEulerParameter<double> method;
>>>>
>>>>      Dune::PDELab::OneStepMethod< double, IGO, PDESOLVER, U, U >
>>>> osm( method,
>>>> igo, pdesolver );
>>>>
>>>>      //TIME-LOOP
>>>>      while( time < t_END - 1e-8){
>>>>              try{
>>>>                  //PDE-SOLVE
>>>>                  osm.apply( time, dt, uold, unew );
>>>>                  exceptionCaught = false;
>>>>              }catch ( Dune::Exception &e ) {
>>>>                  //RESET
>>>>                  exceptionCaught = true;
>>>>                  std::cout << "Catched Error, Dune reported error:
>>>> " << e <<
>>>> std::endl;
>>>>                  unew = uold;
>>>>                  dt *= 0.5;
>>>> osm.getPDESolver().discardMatrix();
>>>>                  continue;
>>>>              }
>>>>              uold = unew;
>>>>              time += dt;
>>>>      }
>>>>
>>>>
>>>> *// terminal output showing FMatrixError...*
>>>>
>>>>
>>>>   time = 162.632 , time+dt = 164.603 , opTime = 180 , dt  : 1.97044
>>>>
>>>>   READY FOR NEXT ITERATION.
>>>> _____________________________________________________
>>>>   current opcount = 2
>>>> ****************************
>>>> TCH HYDRATE:
>>>> ****************************
>>>> TIME STEP [implicit Euler]     89 time (from):   1.6263e+02 dt:  
>>>> 1.9704e+00
>>>> time (to):   1.6460e+02
>>>> STAGE 1 time (to):   1.6460e+02.
>>>>    Initial defect:   2.1649e-01
>>>> Using a direct coarse solver (SuperLU)
>>>> Building hierarchy of 2 levels (inclusive coarse solver) took 0.2195
>>>> seconds.
>>>> === BiCGSTABSolver
>>>>   12.5        6.599e-11
>>>> === rate=0.1733, T=1.152, TIT=0.09217, IT=12.5
>>>>    Newton iteration  1.  New defect:   3.4239e-02.  Reduction (this):
>>>> 1.5816e-01.  Reduction (total):   1.5816e-01
>>>> Using a direct coarse solver (SuperLU)
>>>> Building hierarchy of 2 levels (inclusive coarse solver) took 0.195
>>>> seconds.
>>>> === BiCGSTABSolver
>>>>     17        2.402e-11
>>>> === rate=0.2894, T=1.655, TIT=0.09738, IT=17
>>>>    Newton iteration  2.  New defect:   3.9906e+00.  Reduction (this):
>>>> 1.1655e+02.  Reduction (total):   1.8434e+01
>>>> Using a direct coarse solver (SuperLU)
>>>> Building hierarchy of 2 levels (inclusive coarse solver) took 0.8697
>>>> seconds.
>>>> === BiCGSTABSolver
>>>> Catched Error, Dune reported error: FMatrixError
>>>> [luDecomposition:/home/sgupta/dune_2_6/source/dune/dune-common/dune/common/densematrix.hh:909]:
>>>> matrix is singular
>>>> _____________________________________________________
>>>>   current opcount = 2
>>>> ****************************
>>>> TCH HYDRATE:
>>>> ****************************
>>>> TIME STEP [implicit Euler]     89 time (from):   1.6263e+02 dt:  
>>>> 9.8522e-01
>>>> time (to):   1.6362e+02
>>>> STAGE 1 time (to):   1.6362e+02.
>>>>
>>>> *... nothing happens here... the terminal appears to freeze...*
>>>>
>>>>
>>>>
>>>> -- 
>>>> Dr. Shubhangi Gupta
>>>> Marine Geosystems
>>>> GEOMAR Helmholtz Center for Ocean Research
>>>> Wischhofstraße 1-3,
>>>> D-24148 Kiel
>>>>
>>>> Room: 12-206
>>>> Phone: +49 431 600-1402
>>>> Email:sgupta at geomar.de
>>>>
>>>> _______________________________________________
>>>> dune-pdelab mailing list
>>>> dune-pdelab at lists.dune-project.org
>>>> https://lists.dune-project.org/mailman/listinfo/dune-pdelab
>>>






More information about the dune-pdelab mailing list