[dune-pdelab] reg: OpenMP parallelization of assembly

Fri Jun 26 15:18:06 CEST 2020

Hi,

just my two cents.

On Fri, Jun 26, 2020 at 12:30:51PM +0200, Shubhangi Gupta wrote:
> Dear Linus,
> 
> Thanks a lot for your reply...
> 
> I am already using parallelization via MPI, and that ofcourse helps
> in faster computation, but the communication starts to get expensive
> really fast.
>

Please keep in mind that the assembly usually is the part where the
parallelization is nearly ideal. There is usually not much communication but
you might do assembly on self mutiple times (on process where the cell is
interior and on a few others where it is not). That you might notice when
the local problems per process get small.

Getting ideal scalability might also be possible when using OpenMP but then
the implementation is a lot more non-trivial than putting a few pragmas there.

In OPM for example you can do both MPI and OpenMP during the assembly. Our
experience is that with e.g. hyperthreading (one CPU-core with two hard threads
) using 2 threads per MPI gives some benfit but it is much less then theoretical
one (~15%). You can check that yourself with OPM and one of the supplied
models (but that is of course quite a different problem, that we solve).

> As I said earlier, matrix assembly is the main bottleneck (very
> large system of PDEs, highly nonlinear...),

That is indeed different form our problems where it is 50:50.

> and I was wondering
> whether there was an easier way of just parallelizing the assembly
> but continue using one of the direct solvers (superlu) instead of
> the available parallel solvers (I am currently using amg).

If your available memory allows that.

What MPI version are you using? At least OpenMPI  tries to be smart and 
guesses whether you want affinity for core or sockets. https://blogs.cisco.com/performance/open-mpi-binding-to-core-by-default. Might not always be
what you want.

Markus
> 
> Best wishes, Shubhangi
> 
> On 26.06.20 11:54, Linus Seelinger wrote:
> > Hi Shubhangi,
> > 
> > just to make sure you are not heading into the wrong direction, are you sure
> > you really want to use OpenMP? Parallelization via MPI is fully integrated in
> > PDELab, rather easy to use, and would allow you to scale beyond a single
> > machine.
> > By using a parallel grid, matrix assembly will immediately scale as well, so
> > maybe that would be a better choice for you?
> > 
> > Best,
> > 
> > Linus
> > 
> > Am Freitag, 26. Juni 2020, 09:23:55 CEST schrieb Shubhangi Gupta:
> > > Dear Santiago,
> > > 
> > > Thanks a lot for your reply.
> > > 
> > > I was hoping it would be a bit easier than this to get openMP working..
> > > but I'll give it a shot.. if by any chance it works, I'll get back to you :)
> > > 
> > > Warm wishes, Shubhangi
> > > 
> > > On 25.06.20 18:38, Santiago Ospina wrote:
> > > > Hi Shubhangi,
> > > > 
> > > > as far as I can tell, the main PDELab is not able to do so. I know
> > > > that this was tried out in the EXADUNE project but I don't know the
> > > > outcome of that implementation. Perhaps someone else may comment on
> > > > that one. But in general, you need that each thread owns a copy of a
> > > > LocalFunctionSpace, an LFSCache, an assembler_engine and an entity
> > > > (this may need some modifications on these classes). Once that is
> > > > done, most of the assembler loop can be done in parallel. Binds, loads
> > > > and assemble methods should be OK with multiple threads. The
> > > > problematic part comes on the unbind. There is when the local
> > > > container from assembler_engine are scattered to the global container.
> > > > Since contiguous entities are likely to have common DOFs or be very
> > > > near in memory in the global container, data races may appear.
> > > > Thinkthreads most of that is possible with the C++ thread, but I might
> > > > be wrong.
> > > > 
> > > > Please let us know if you get that working ;-)
> > > > 
> > > > Best,
> > > > Santiago Ospina
> > > > 
> > > > On Thu, Jun 25, 2020 at 2:10 PM Shubhangi Gupta <sgupta at geomar.de
> > > > 
> > > > <mailto:sgupta at geomar.de>> wrote:
> > > >      Dear all,
> > > >      The matrix assembly is the main bottleneck for my numerical
> > > >      implementation in pdelab. So, I am thinking of parallelizing this
> > > >      part
> > > >      using openMP.. I understand that dune-pdelab is already capable of
> > > >      doing
> > > >      this...but I don't know where to start and I have only very
> > > >      superficial
> > > >      understanding of openMP.
> > > >      Is there an example that I can look at? Or can someone give me a
> > > >      quick
> > > >      outline of how to proceed?
> > > >      Thanks, and warm wishes, Shubhangi
> > > >      _______________________________________________
> > > >      dune-pdelab mailing list
> > > >      dune-pdelab at lists.dune-project.org
> > > >      <mailto:dune-pdelab at lists.dune-project.org>
> > > >      https://lists.dune-project.org/mailman/listinfo/dune-pdelab
> > 
> > 
> > 
> > 
> > _______________________________________________
> > dune-pdelab mailing list
> > dune-pdelab at lists.dune-project.org
> > https://lists.dune-project.org/mailman/listinfo/dune-pdelab
> 
> _______________________________________________
> dune-pdelab mailing list
> dune-pdelab at lists.dune-project.org
> https://lists.dune-project.org/mailman/listinfo/dune-pdelab

-- 
Dr. Markus Blatt - HPC-Simulation-Software & Services http://www.dr-blatt.de
Pedettistr. 38, 85072 Eichstätt, Germany,  USt-Id: DE279960836
Tel.: +49 (0) 160 97590858