[dune-pdelab] Fwd: Fwd: solver fails to reset correctly after FMatrixError (singular matrix)

Nils-Arne Dreier n.dreier at uni-muenster.de
Tue Jul 30 09:20:26 CEST 2019


Hi Markus,

> I am not at all an expert on this, but reading MPIX_shrink does ring
an alarm
> bell. Doesn't it remove the processor from the communicator? In that
case this
> approach is clearly the wrong solution to this problem as the error
(inversion
> of the diagonal seems to fail) might occur over and over again until
no processors
> are left.

The problem is that MPIX_shrink does not shrink the communicator, but it
creates a new communicator, with all sane ranks of the failed/revoked
communicator. For the case where all ranks are sane the only difference
to MPI_Comm_dup is, that it can be applied to a revoked communicator.
If we want to "shrink" the communicator, which is used by the AMG we
actually need to change the value of the MPI_Comm member of the
underlying classes. Which is not possible from the user-side currently?

Best
Nils

On 26.07.19 09:28, Markus Blatt wrote:
> Hi,
> 
> my two cents:
> 
> On Thu, Jul 25, 2019 at 10:19:40PM +0200, Nils-Arne Dreier wrote:
> 
>> dune-istl store a copy of the MPI_Comm. If that communicator is revoked
>> this classes can not be used anymore and a new instance of this class
>> must be created with a working MPI_Comm (can be obtained with
>> MPIX_shrink from the revoked communicator). You can find a
>> quick-and-dirty fix of that problem in this branch:
>> https://gitlab.dune-project.org/exadune/dune-istl/tree/p/mirco/ccsolvers
>>
> 
> I am not at all an expert on this, but reading MPIX_shrink does ring an alarm
> bell. Doesn't it remove the processor from the communicator? In that case this
> approach is clearly the wrong solution to this problem as the error (inversion
> of the diagonal seems to fail) might occur over and over again until no processors
> are left.
> 
> I guess your proposed approach is very suitable for hardware errors but not
> for the ones we are experiencing here.
> 
> Markus.
>>
>> I hope that helps.
>>
>> Best
>> Nils
>>
>>
>> On 25.07.19 17:28, Shubhangi Gupta wrote:
>>> Hi Markus,
>>>
>>> Thanks a lot for your advice.
>>>
>>> I corrected the implementation of the mpiguard as per your suggestion
>>> (both, in the main time loop, and in the ovlpistlsolverbackend). Two
>>> notable things I observe:
>>>
>>> 1. The mpiguard **seems** to work on my local machine... as in, I have
>>> run my simulations for a number of parameter sets, and my linear
>>> solver hasn't frozen *yet*. But, the mpiguard doesn't work on the copy
>>> of the code on our university server!
>>>
>>> 2. It seems that the mpiguard is making the code slower ... can this be?
>>>
>>> Also, yes, I agree that my linear system could be ill-condition (or
>>> weird, as you put it). I have a complicated setting with rather
>>> extreme properties taken from the Black Sea cores.. But, I think the
>>> linear/nonlinear solvers shouldn't fail partially, and communication
>>> failure between processes is certainly not a good sign for the solvers
>>> in general... or? I would expect the solver to simply not converge
>>> overall if the linear system is incorrect... not freeze halfway and
>>> stop communicating.
>>>
>>> Thanks once again! I really appreciate your help.
>>>
>>> best wishes, Shubhangi
>>>
>>>
>>> On 24.07.19 11:25, Markus Blatt wrote:
>>>> Please always reply to the list. Free consulting is only available
>>>> there.
>>>>
>>>> the solution to your problems is at the bottom. Please also read the
>>>> rest
>>>> as you seem to use MPGuard the wrong way
>>>>
>>>> On Wed, Jul 24, 2019 at 09:42:01AM +0200, Shubhangi Gupta wrote:
>>>>> Hi Markus,
>>>>>
>>>>> Thanks a lot for your reply! I am answering your questions below...
>>>>>
>>>>> 1. Does at the highest level mean outside the try clause? That might
>>>>> be wrong as it will throw if something went wrong. It needs to be
>>>>> inside the try clause.
>>>>>
>>>>> By highest level, I meant **inside** the try clause.
>>>> I really have no experience with MPIGuard. Maybe someone else can
>>>> tell us where
>>>> it throws.
>>>>
>>>> but I think you are using it wrong.
>>>>> Dune::MPIGuard guard;
>>>>>
>>>> This would be outside the try clause. But that might be right as
>>>> MPIGuard
>>>> throws during finalize.
>>>>
>>>>>          bool exceptionCaught = false;
>>>>>
>>>>>          while( time < t_END ){
>>>>>
>>>>>              try{
>>>>>
>>>> Personally I would have initialize the MPIGuard here, but maybe
>>>> reactivating
>>>> but it seems like your approach is valid too as you reactivate.
>>>>
>>>>>                  // reactivate the guard for the next critical
>>>>> operation
>>>>> guard.reactivate();
>>>>>
>>>>>                  osm.apply( time, dt, uold, unew );
>>>>>
>>>>>                  exceptionCaught = false;
>>>>>
>>>> Here you definitely need to tell it that you passed the critial section:
>>>> guard.finalize();
>>>>
>>>>>              }catch ( Dune::Exception &e ) {
>>>>>                  exceptionCaught = true;
>>>>>
>>>>>                  // tell the guard that you successfully passed a
>>>>> critical
>>>>> operation
>>>>> guard.finalize();
>>>> This is too late! You have already experienced any exception there
>>>> might be.
>>>>
>>>>>                  unew = uold;
>>>>>
>>>>>                  dt *= 0.5;
>>>>>
>>>>>                  osm_tch.getPDESolver().discardMatrix();
>>>>>
>>>>>                  continue;
>>>>>              }
>>>>>
>>>>>              uold = unew;
>>>>>              time += dt;
>>>>>          }
>>>>>
>>>>> 2. freezes means deadlock (stopping at an iteration and never
>>>>> finishing)? That will happen in your code if the MPIGuard is before
>>>>> the try clause.
>>>>>
>>>>> Yes, freezes means stopping at the iteration and never finishing it.
>>>>>
>>>>> So first, this was happening right after FMatrixError (singular
>>>>> matrix).
>>>>> The osm froze without initiating Newton solver... After I put the
>>>>> MPIGuard,
>>>>> this problem was solved... Newton solver restarts as it should...
>>>>> But now
>>>>> the freezing happens with the linear solver (BiCGStab, in this
>>>>> case). Nils
>>>>> said to solve this I will have to put the MPIGuard also on lower levels
>>>>> (inside newton and linear solver...). I, on the other hand, prefer
>>>>> to not
>>>>> touch the dune core code and risk introducing more errors along the
>>>>> way...
>>>>>
>>>> That is because in your case different processor will work with
>>>> different
>>>> timesteps and that cannot work as the linear system is utterly wrong.
>>>>
>>>>> 3. ....have you tried the poor-man's solution, below? ...
>>>>>
>>>>> Yes, I tried that, but the problem is if the apply step doesn't
>>>>> finish, then
>>>>> nothing really happens...
>>>>>
>>>> Finally I understand. Your are using
>>>> Dune::PDELab::ISTLBackend_BCGS_AMG_SSOR<IGO>.
>>>> You must have a very weired linear system as this bug can only appear
>>>> when
>>>> inverting the diagonal block in the application of one step SSOR.
>>>> Personally
>>>> I would say that your linear system is incorrect/not very sane.
>>>>
>>>> The bug is in PDELab that does not expect an exception
>>>> during the application of the preconditioner. It has to be fixed
>>>> there in
>>>> file ovlpistlsolverbackend.hh OverlappingPreconditioner::apply
>>>>
>>>> MPIGuard guard;
>>>> prec.apply(Backend::native(v),Backend::native(dd));
>>>> guard.finalize(true);
>>>>
>>>> and probably many more. In addition this construct is also need in the
>>>> constructor of AMG as it can happen if ILU is used as the smoother-
>>>>
>>>> Please make your patch available afterwards.
>>>>
>>>> HTH
>>>>
>>>> Markus
>>>>
>>
>>
>> _______________________________________________
>> dune-pdelab mailing list
>> dune-pdelab at lists.dune-project.org
>> https://lists.dune-project.org/mailman/listinfo/dune-pdelab
> 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 488 bytes
Desc: OpenPGP digital signature
URL: <https://lists.dune-project.org/pipermail/dune-pdelab/attachments/20190730/313320f5/attachment.sig>


More information about the dune-pdelab mailing list