[dune-pdelab] Fwd: Fwd: solver fails to reset correctly after FMatrixError (singular matrix)

Shubhangi Gupta sgupta at geomar.de
Fri Jul 26 10:07:40 CEST 2019


Thanks again, Markus and Nils, for your inputs...

But I am quite lost here! So, in any case, I will reiterate:

- I put the mpiguard (*not* ulfm) in the time loop (highest level) and 
also in the ovlpistsolverbackend around apply (so, lower level).

1. The mpiguard in this form seems to help, but only on my laptop with 4 
processors... But it does not help on the university server. I am 
wondering why this could be.

2. The mpiguard also makes the code slower... I guess as Nils said, 
communicating the failure state at each step has a non-negligible 
computational overhead.

- I have not yet managed to get the dune version with the 
ulfm-mpiguard... but from what Markus said, this will probably not help 
anyway.. Is this correct?

- As Nils said, the ulfm stuff has not been integrated with the latest 
dune master (v2.7?). I am currently using v2.6. So upgrading to the next 
version will also probably not help... since the relevant changes to 
solver have not been made.. Is this correct?

Thanks, and warm wishes, Shubhangi


On 26.07.19 09:28, Markus Blatt wrote:
> Hi,
>
> my two cents:
>
> On Thu, Jul 25, 2019 at 10:19:40PM +0200, Nils-Arne Dreier wrote:
>
>> dune-istl store a copy of the MPI_Comm. If that communicator is revoked
>> this classes can not be used anymore and a new instance of this class
>> must be created with a working MPI_Comm (can be obtained with
>> MPIX_shrink from the revoked communicator). You can find a
>> quick-and-dirty fix of that problem in this branch:
>> https://gitlab.dune-project.org/exadune/dune-istl/tree/p/mirco/ccsolvers
>>
> I am not at all an expert on this, but reading MPIX_shrink does ring an alarm
> bell. Doesn't it remove the processor from the communicator? In that case this
> approach is clearly the wrong solution to this problem as the error (inversion
> of the diagonal seems to fail) might occur over and over again until no processors
> are left.
>
> I guess your proposed approach is very suitable for hardware errors but not
> for the ones we are experiencing here.
>
> Markus.
>> I hope that helps.
>>
>> Best
>> Nils
>>
>>
>> On 25.07.19 17:28, Shubhangi Gupta wrote:
>>> Hi Markus,
>>>
>>> Thanks a lot for your advice.
>>>
>>> I corrected the implementation of the mpiguard as per your suggestion
>>> (both, in the main time loop, and in the ovlpistlsolverbackend). Two
>>> notable things I observe:
>>>
>>> 1. The mpiguard **seems** to work on my local machine... as in, I have
>>> run my simulations for a number of parameter sets, and my linear
>>> solver hasn't frozen *yet*. But, the mpiguard doesn't work on the copy
>>> of the code on our university server!
>>>
>>> 2. It seems that the mpiguard is making the code slower ... can this be?
>>>
>>> Also, yes, I agree that my linear system could be ill-condition (or
>>> weird, as you put it). I have a complicated setting with rather
>>> extreme properties taken from the Black Sea cores.. But, I think the
>>> linear/nonlinear solvers shouldn't fail partially, and communication
>>> failure between processes is certainly not a good sign for the solvers
>>> in general... or? I would expect the solver to simply not converge
>>> overall if the linear system is incorrect... not freeze halfway and
>>> stop communicating.
>>>
>>> Thanks once again! I really appreciate your help.
>>>
>>> best wishes, Shubhangi
>>>
>>>
>>> On 24.07.19 11:25, Markus Blatt wrote:
>>>> Please always reply to the list. Free consulting is only available
>>>> there.
>>>>
>>>> the solution to your problems is at the bottom. Please also read the
>>>> rest
>>>> as you seem to use MPGuard the wrong way
>>>>
>>>> On Wed, Jul 24, 2019 at 09:42:01AM +0200, Shubhangi Gupta wrote:
>>>>> Hi Markus,
>>>>>
>>>>> Thanks a lot for your reply! I am answering your questions below...
>>>>>
>>>>> 1. Does at the highest level mean outside the try clause? That might
>>>>> be wrong as it will throw if something went wrong. It needs to be
>>>>> inside the try clause.
>>>>>
>>>>> By highest level, I meant **inside** the try clause.
>>>> I really have no experience with MPIGuard. Maybe someone else can
>>>> tell us where
>>>> it throws.
>>>>
>>>> but I think you are using it wrong.
>>>>> Dune::MPIGuard guard;
>>>>>
>>>> This would be outside the try clause. But that might be right as
>>>> MPIGuard
>>>> throws during finalize.
>>>>
>>>>>           bool exceptionCaught = false;
>>>>>
>>>>>           while( time < t_END ){
>>>>>
>>>>>               try{
>>>>>
>>>> Personally I would have initialize the MPIGuard here, but maybe
>>>> reactivating
>>>> but it seems like your approach is valid too as you reactivate.
>>>>
>>>>>                   // reactivate the guard for the next critical
>>>>> operation
>>>>> guard.reactivate();
>>>>>
>>>>>                   osm.apply( time, dt, uold, unew );
>>>>>
>>>>>                   exceptionCaught = false;
>>>>>
>>>> Here you definitely need to tell it that you passed the critial section:
>>>> guard.finalize();
>>>>
>>>>>               }catch ( Dune::Exception &e ) {
>>>>>                   exceptionCaught = true;
>>>>>
>>>>>                   // tell the guard that you successfully passed a
>>>>> critical
>>>>> operation
>>>>> guard.finalize();
>>>> This is too late! You have already experienced any exception there
>>>> might be.
>>>>
>>>>>                   unew = uold;
>>>>>
>>>>>                   dt *= 0.5;
>>>>>
>>>>>                   osm_tch.getPDESolver().discardMatrix();
>>>>>
>>>>>                   continue;
>>>>>               }
>>>>>
>>>>>               uold = unew;
>>>>>               time += dt;
>>>>>           }
>>>>>
>>>>> 2. freezes means deadlock (stopping at an iteration and never
>>>>> finishing)? That will happen in your code if the MPIGuard is before
>>>>> the try clause.
>>>>>
>>>>> Yes, freezes means stopping at the iteration and never finishing it.
>>>>>
>>>>> So first, this was happening right after FMatrixError (singular
>>>>> matrix).
>>>>> The osm froze without initiating Newton solver... After I put the
>>>>> MPIGuard,
>>>>> this problem was solved... Newton solver restarts as it should...
>>>>> But now
>>>>> the freezing happens with the linear solver (BiCGStab, in this
>>>>> case). Nils
>>>>> said to solve this I will have to put the MPIGuard also on lower levels
>>>>> (inside newton and linear solver...). I, on the other hand, prefer
>>>>> to not
>>>>> touch the dune core code and risk introducing more errors along the
>>>>> way...
>>>>>
>>>> That is because in your case different processor will work with
>>>> different
>>>> timesteps and that cannot work as the linear system is utterly wrong.
>>>>
>>>>> 3. ....have you tried the poor-man's solution, below? ...
>>>>>
>>>>> Yes, I tried that, but the problem is if the apply step doesn't
>>>>> finish, then
>>>>> nothing really happens...
>>>>>
>>>> Finally I understand. Your are using
>>>> Dune::PDELab::ISTLBackend_BCGS_AMG_SSOR<IGO>.
>>>> You must have a very weired linear system as this bug can only appear
>>>> when
>>>> inverting the diagonal block in the application of one step SSOR.
>>>> Personally
>>>> I would say that your linear system is incorrect/not very sane.
>>>>
>>>> The bug is in PDELab that does not expect an exception
>>>> during the application of the preconditioner. It has to be fixed
>>>> there in
>>>> file ovlpistlsolverbackend.hh OverlappingPreconditioner::apply
>>>>
>>>> MPIGuard guard;
>>>> prec.apply(Backend::native(v),Backend::native(dd));
>>>> guard.finalize(true);
>>>>
>>>> and probably many more. In addition this construct is also need in the
>>>> constructor of AMG as it can happen if ILU is used as the smoother-
>>>>
>>>> Please make your patch available afterwards.
>>>>
>>>> HTH
>>>>
>>>> Markus
>>>>
>>
>> _______________________________________________
>> dune-pdelab mailing list
>> dune-pdelab at lists.dune-project.org
>> https://lists.dune-project.org/mailman/listinfo/dune-pdelab

-- 
Dr. Shubhangi Gupta
Marine Geosystems
GEOMAR Helmholtz Center for Ocean Research
Wischhofstraße 1-3,
D-24148 Kiel

Room: 12-206
Phone: +49 431 600-1402
Email: sgupta at geomar.de





More information about the dune-pdelab mailing list