[dune-pdelab] Fwd: Fwd: solver fails to reset correctly after FMatrixError (singular matrix)

Nils-Arne Dreier n.dreier at uni-muenster.de
Fri Jul 12 13:38:14 CEST 2019


Hi Shubhangi,

you have to call the MPIGuard::finalize() method after that point, where
the exception might be thrown and before the next communication is
performed. From the information, you provided, I guess that the
exception is thrown in the smoother of the AMG. Which makes things
slightly complicated. Maybe AMG::mgc is a good starting point.

By the way: If you use the ULFM things I described previously you can
use the MPIGuard on the coarsest level and don't need to call
MPIGuard::finalize() after every critical section.

Regards
Nils

On 11.07.19 14:56, Shubhangi Gupta wrote:
> Dear Jö and Nils,
>
> Thanks a lot for your replies.
>
> I actually tried putting the mpiguard within the time loop (at the
> highest level) just to see what happens... Indeed, the one step method
> now proceeds as it should, but the BiCGSTab freezes... So yeah, as Jö
> mentioned, the mpiguard needs to be introduced inside the
> ISTL-solver... I am not very sure how and where exactly though! Any
> ideas?
>
> Thanks again, and warm wishes, Shubhangi
>
> On 10.07.19 14:52, Jö Fahlke wrote:
>> Am Mi, 10. Jul 2019, 14:39:09 +0200 schrieb Nils-Arne Dreier:
>>> Hi Shubhangi,
>>>
>>> I just talked to Jö. We guess that the problem is, that the
>>> exception is
>>> only thrown on one rank, say rank X. All other ranks do not know that
>>> rank X failed and proceed as usual, at some point all these ranks
>>> waiting for communication of rank X. That is the deadlock that you see.
>>>
>>> You may want to have a look at Dune::MPIGuard in
>>> dune/common/parallel/mpiguard.hh. It makes it possible to propagate the
>>> error state to all ranks.
>> It should be mentioned that MPIGuard probably cannot be used at a
>> high level,
>> it would probably need to be introduced into the ISTL-Solver
>> (BiCGSTab, AMG,
>> SSOR) and/or PEDLab (the parallel scalar product, Newton) for this to
>> work.
>> Not sure where exactly.
>>
>> Regards,
>> Jö.
>>
>>> There is also a merge request for dune-common, which adapts the
>>> MPIGuard
>>> such that you don't need to check for an error state before
>>> communicating, making use of the ULFM proposal for MPI. You can find it
>>> here:
>>> https://gitlab.dune-project.org/core/dune-common/merge_requests/517
>>>
>>> If you don't have a MPI implementation that provides a *working* ULFM
>>> implementation, you may want to use the blackchannel-ulfm lib:
>>> https://gitlab.dune-project.org/exadune/blackchannel-ulfm
>>>
>>> I hope that helps.
>>>
>>> Kind regards
>>> Nils
>>>
>>> On 10.07.19 14:07, Shubhangi Gupta wrote:
>>>> Hi Jö,
>>>>
>>>> So, since you asked about the number of ranks... I tried running the
>>>> simulations again on 2 processes and 1 process. I get the same problem
>>>> with 2, but not with 1.
>>>>
>>>> On 10.07.19 13:33, Shubhangi Gupta wrote:
>>>>> Hi Jö,
>>>>>
>>>>> Yes, I am running it MPI-parallel, on 4 ranks.
>>>>>
>>>>> On 10.07.19 13:32, Jö Fahlke wrote:
>>>>>> Are you running this MPI-parallel?  If yes, how many ranks?
>>>>>>
>>>>>> Regards, Jö.
>>>>>>
>>>>>> Am Mi, 10. Jul 2019, 11:55:45 +0200 schrieb Shubhangi Gupta:
>>>>>>> Dear pdelab users,
>>>>>>>
>>>>>>> I am currently experiencing a rather strange problem during
>>>>>>> parallel
>>>>>>> solution of my finite volume code. I have written a short outline
>>>>>>> of my code
>>>>>>> below for reference.
>>>>>>>
>>>>>>> At some point during computation, if dune throws an error, the code
>>>>>>> catches
>>>>>>> this error, resets the solution vector to the old value, halves the
>>>>>>> time
>>>>>>> step size, and tries to redo the calculation (osm.apply()).
>>>>>>>
>>>>>>> However, if I get the error "FMatrixError: matrix is singular", the
>>>>>>> solver
>>>>>>> seems to freeze. Even the initial defect is not shown! (See the
>>>>>>> terminal
>>>>>>> output below.) I am not sure why this is so, and I have not
>>>>>>> experienced this
>>>>>>> issue before.
>>>>>>>
>>>>>>> I will be very thankful if someone can help me figure out a way
>>>>>>> around this
>>>>>>> problem.
>>>>>>>
>>>>>>> Thanks, and warm wishes, Shubhangi
>>>>>>>
>>>>>>>
>>>>>>> *// code layout*
>>>>>>>
>>>>>>>       ...UG grid, generated using gmsh, GV, ...
>>>>>>>
>>>>>>>       typedef
>>>>>>> Dune::PDELab::QkDGLocalFiniteElementMap<GV::Grid::ctype, double,
>>>>>>> 0, dim, Dune::PDELab::QkDGBasisPolynomial::lagrange> FEMP0;
>>>>>>>       FEMP0 femp0;
>>>>>>>       typedef
>>>>>>> Dune::PDELab::GridFunctionSpace<GV,FEMP0,Dune::PDELab::P0ParallelConstraints,Dune::PDELab::ISTL::VectorBackend<>>
>>>>>>>
>>>>>>> GFS0;
>>>>>>>       GFS0 gfs0(gv,femp0);
>>>>>>>       typedef Dune::PDELab::PowerGridFunctionSpace<
>>>>>>> GFS0,num_of_vars,
>>>>>>> Dune::PDELab::ISTL::VectorBackend<Dune::PDELab::ISTL::Blocking::fixed>,
>>>>>>>
>>>>>>>
>>>>>>> Dune::PDELab::EntityBlockedOrderingTag> GFS_TCH;
>>>>>>>
>>>>>>>       ... LocalOperator LOP lop, TimeLocalOperator TOP top,
>>>>>>> GridOperator GO
>>>>>>> go, InstationaryGridOperator IGO igo, ...
>>>>>>>
>>>>>>>       typedef Dune::PDELab::ISTLBackend_BCGS_AMG_SSOR<IGO> LS;
>>>>>>>       LS ls(gfs,50,1,false,true);
>>>>>>>       typedef Dune::PDELab::Newton< IGO, LS, U > PDESOLVER;
>>>>>>>       PDESOLVER pdesolver( igo, ls );
>>>>>>> Dune::PDELab::ImplicitEulerParameter<double> method;
>>>>>>>
>>>>>>>       Dune::PDELab::OneStepMethod< double, IGO, PDESOLVER, U, U >
>>>>>>> osm( method,
>>>>>>> igo, pdesolver );
>>>>>>>
>>>>>>>       //TIME-LOOP
>>>>>>>       while( time < t_END - 1e-8){
>>>>>>>               try{
>>>>>>>                   //PDE-SOLVE
>>>>>>>                   osm.apply( time, dt, uold, unew );
>>>>>>>                   exceptionCaught = false;
>>>>>>>               }catch ( Dune::Exception &e ) {
>>>>>>>                   //RESET
>>>>>>>                   exceptionCaught = true;
>>>>>>>                   std::cout << "Catched Error, Dune reported error:
>>>>>>> " << e <<
>>>>>>> std::endl;
>>>>>>>                   unew = uold;
>>>>>>>                   dt *= 0.5;
>>>>>>> osm.getPDESolver().discardMatrix();
>>>>>>>                   continue;
>>>>>>>               }
>>>>>>>               uold = unew;
>>>>>>>               time += dt;
>>>>>>>       }
>>>>>>>
>>>>>>>
>>>>>>> *// terminal output showing FMatrixError...*
>>>>>>>
>>>>>>>
>>>>>>>    time = 162.632 , time+dt = 164.603 , opTime = 180 , dt  :
>>>>>>> 1.97044
>>>>>>>
>>>>>>>    READY FOR NEXT ITERATION.
>>>>>>> _____________________________________________________
>>>>>>>    current opcount = 2
>>>>>>> ****************************
>>>>>>> TCH HYDRATE:
>>>>>>> ****************************
>>>>>>> TIME STEP [implicit Euler]     89 time (from):   1.6263e+02 dt:
>>>>>>> 1.9704e+00
>>>>>>> time (to):   1.6460e+02
>>>>>>> STAGE 1 time (to):   1.6460e+02.
>>>>>>>     Initial defect:   2.1649e-01
>>>>>>> Using a direct coarse solver (SuperLU)
>>>>>>> Building hierarchy of 2 levels (inclusive coarse solver) took
>>>>>>> 0.2195
>>>>>>> seconds.
>>>>>>> === BiCGSTABSolver
>>>>>>>    12.5        6.599e-11
>>>>>>> === rate=0.1733, T=1.152, TIT=0.09217, IT=12.5
>>>>>>>     Newton iteration  1.  New defect:   3.4239e-02.  Reduction
>>>>>>> (this):
>>>>>>> 1.5816e-01.  Reduction (total):   1.5816e-01
>>>>>>> Using a direct coarse solver (SuperLU)
>>>>>>> Building hierarchy of 2 levels (inclusive coarse solver) took 0.195
>>>>>>> seconds.
>>>>>>> === BiCGSTABSolver
>>>>>>>      17        2.402e-11
>>>>>>> === rate=0.2894, T=1.655, TIT=0.09738, IT=17
>>>>>>>     Newton iteration  2.  New defect:   3.9906e+00.  Reduction
>>>>>>> (this):
>>>>>>> 1.1655e+02.  Reduction (total):   1.8434e+01
>>>>>>> Using a direct coarse solver (SuperLU)
>>>>>>> Building hierarchy of 2 levels (inclusive coarse solver) took
>>>>>>> 0.8697
>>>>>>> seconds.
>>>>>>> === BiCGSTABSolver
>>>>>>> Catched Error, Dune reported error: FMatrixError
>>>>>>> [luDecomposition:/home/sgupta/dune_2_6/source/dune/dune-common/dune/common/densematrix.hh:909]:
>>>>>>>
>>>>>>> matrix is singular
>>>>>>> _____________________________________________________
>>>>>>>    current opcount = 2
>>>>>>> ****************************
>>>>>>> TCH HYDRATE:
>>>>>>> ****************************
>>>>>>> TIME STEP [implicit Euler]     89 time (from):   1.6263e+02 dt:
>>>>>>> 9.8522e-01
>>>>>>> time (to):   1.6362e+02
>>>>>>> STAGE 1 time (to):   1.6362e+02.
>>>>>>>
>>>>>>> *... nothing happens here... the terminal appears to freeze...*
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> -- 
>>>>>>> Dr. Shubhangi Gupta
>>>>>>> Marine Geosystems
>>>>>>> GEOMAR Helmholtz Center for Ocean Research
>>>>>>> Wischhofstraße 1-3,
>>>>>>> D-24148 Kiel
>>>>>>>
>>>>>>> Room: 12-206
>>>>>>> Phone: +49 431 600-1402
>>>>>>> Email:sgupta at geomar.de
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> dune-pdelab mailing list
>>>>>>> dune-pdelab at lists.dune-project.org
>>>>>>> https://lists.dune-project.org/mailman/listinfo/dune-pdelab
>>>
>>>
>>> _______________________________________________
>>> dune-pdelab mailing list
>>> dune-pdelab at lists.dune-project.org
>>> https://lists.dune-project.org/mailman/listinfo/dune-pdelab
>


-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 488 bytes
Desc: OpenPGP digital signature
URL: <https://lists.dune-project.org/pipermail/dune-pdelab/attachments/20190712/c7abd163/attachment.sig>


More information about the dune-pdelab mailing list