[dune-pdelab] Fwd: Fwd: solver fails to reset correctly after FMatrixError (singular matrix)

Nils-Arne Dreier n.dreier at uni-muenster.de
Thu Jul 25 22:19:40 CEST 2019


Hi Shubhangi,

sorry I missed a lot of the discussion. Just to clarify the differences
between the non-ULFM MPIGuard and the ULFM-MPIGuard:

The non-ULFM-MPIGuard performs after every critical session a
non-blocking communication to synchronize the error state. For that
after every critical section the reinitialize method must be called.

The ULFM-MPIGuard does not rely on this synchronization, instead
MPI-calls simply fail if the communicator is revoked (by a remote failed
rank) and throw an exception (if an appropriate error handler is
chosen). The revocation is performed in the destructor of the MPIGuard
if an exception is in flight. The advantages are: 1. We don't need to
add reinitialize-calls in the low-level code. 2. We don't add much
communication overhead for the error state synchronization.

There is a problem with the ULFM version and dune-istl. A lot classes in
dune-istl store a copy of the MPI_Comm. If that communicator is revoked
this classes can not be used anymore and a new instance of this class
must be created with a working MPI_Comm (can be obtained with
MPIX_shrink from the revoked communicator). You can find a
quick-and-dirty fix of that problem in this branch:
https://gitlab.dune-project.org/exadune/dune-istl/tree/p/mirco/ccsolvers


I hope that helps.

Best
Nils


On 25.07.19 17:28, Shubhangi Gupta wrote:
> Hi Markus,
>
> Thanks a lot for your advice.
>
> I corrected the implementation of the mpiguard as per your suggestion
> (both, in the main time loop, and in the ovlpistlsolverbackend). Two
> notable things I observe:
>
> 1. The mpiguard **seems** to work on my local machine... as in, I have
> run my simulations for a number of parameter sets, and my linear
> solver hasn't frozen *yet*. But, the mpiguard doesn't work on the copy
> of the code on our university server!
>
> 2. It seems that the mpiguard is making the code slower ... can this be?
>
> Also, yes, I agree that my linear system could be ill-condition (or
> weird, as you put it). I have a complicated setting with rather
> extreme properties taken from the Black Sea cores.. But, I think the
> linear/nonlinear solvers shouldn't fail partially, and communication
> failure between processes is certainly not a good sign for the solvers
> in general... or? I would expect the solver to simply not converge
> overall if the linear system is incorrect... not freeze halfway and
> stop communicating.
>
> Thanks once again! I really appreciate your help.
>
> best wishes, Shubhangi
>
>
> On 24.07.19 11:25, Markus Blatt wrote:
>> Please always reply to the list. Free consulting is only available
>> there.
>>
>> the solution to your problems is at the bottom. Please also read the
>> rest
>> as you seem to use MPGuard the wrong way
>>
>> On Wed, Jul 24, 2019 at 09:42:01AM +0200, Shubhangi Gupta wrote:
>>> Hi Markus,
>>>
>>> Thanks a lot for your reply! I am answering your questions below...
>>>
>>> 1. Does at the highest level mean outside the try clause? That might
>>> be wrong as it will throw if something went wrong. It needs to be
>>> inside the try clause.
>>>
>>> By highest level, I meant **inside** the try clause.
>> I really have no experience with MPIGuard. Maybe someone else can
>> tell us where
>> it throws.
>>
>> but I think you are using it wrong.
>>> Dune::MPIGuard guard;
>>>
>> This would be outside the try clause. But that might be right as
>> MPIGuard
>> throws during finalize.
>>
>>>          bool exceptionCaught = false;
>>>
>>>          while( time < t_END ){
>>>
>>>              try{
>>>
>> Personally I would have initialize the MPIGuard here, but maybe
>> reactivating
>> but it seems like your approach is valid too as you reactivate.
>>
>>>                  // reactivate the guard for the next critical
>>> operation
>>> guard.reactivate();
>>>
>>>                  osm.apply( time, dt, uold, unew );
>>>
>>>                  exceptionCaught = false;
>>>
>> Here you definitely need to tell it that you passed the critial section:
>> guard.finalize();
>>
>>>              }catch ( Dune::Exception &e ) {
>>>                  exceptionCaught = true;
>>>
>>>                  // tell the guard that you successfully passed a
>>> critical
>>> operation
>>> guard.finalize();
>> This is too late! You have already experienced any exception there
>> might be.
>>
>>>                  unew = uold;
>>>
>>>                  dt *= 0.5;
>>>
>>>                  osm_tch.getPDESolver().discardMatrix();
>>>
>>>                  continue;
>>>              }
>>>
>>>              uold = unew;
>>>              time += dt;
>>>          }
>>>
>>> 2. freezes means deadlock (stopping at an iteration and never
>>> finishing)? That will happen in your code if the MPIGuard is before
>>> the try clause.
>>>
>>> Yes, freezes means stopping at the iteration and never finishing it.
>>>
>>> So first, this was happening right after FMatrixError (singular
>>> matrix).
>>> The osm froze without initiating Newton solver... After I put the
>>> MPIGuard,
>>> this problem was solved... Newton solver restarts as it should...
>>> But now
>>> the freezing happens with the linear solver (BiCGStab, in this
>>> case). Nils
>>> said to solve this I will have to put the MPIGuard also on lower levels
>>> (inside newton and linear solver...). I, on the other hand, prefer
>>> to not
>>> touch the dune core code and risk introducing more errors along the
>>> way...
>>>
>> That is because in your case different processor will work with
>> different
>> timesteps and that cannot work as the linear system is utterly wrong.
>>
>>> 3. ....have you tried the poor-man's solution, below? ...
>>>
>>> Yes, I tried that, but the problem is if the apply step doesn't
>>> finish, then
>>> nothing really happens...
>>>
>> Finally I understand. Your are using
>> Dune::PDELab::ISTLBackend_BCGS_AMG_SSOR<IGO>.
>> You must have a very weired linear system as this bug can only appear
>> when
>> inverting the diagonal block in the application of one step SSOR.
>> Personally
>> I would say that your linear system is incorrect/not very sane.
>>
>> The bug is in PDELab that does not expect an exception
>> during the application of the preconditioner. It has to be fixed
>> there in
>> file ovlpistlsolverbackend.hh OverlappingPreconditioner::apply
>>
>> MPIGuard guard;
>> prec.apply(Backend::native(v),Backend::native(dd));
>> guard.finalize(true);
>>
>> and probably many more. In addition this construct is also need in the
>> constructor of AMG as it can happen if ILU is used as the smoother-
>>
>> Please make your patch available afterwards.
>>
>> HTH
>>
>> Markus
>>





More information about the dune-pdelab mailing list