[Dune-devel] SOCIS 2014: FEM assembly with threads

Mon Jun 30 11:48:07 CEST 2014

Hi Marco,

On Thu, Jun 26, 2014 at 02:41:23PM +0000, Agnese, Marco wrote:
> Hi DUNErs,
> as suggested by Markus I have written a very simple code which uses
> threads to assemble the stiffness matrix and the RHS of a FEM scheme
> for a 1D Poisson problem. 
> 
> You can find the code here: https://github.com/magnese/dune-common/blob/threads/dune/common/test/threadstest.cc
> 
> The code constructs a vector of grid points. The mesh is
> "partitioned" among threads. Each thread assembles a local stiffness
> matrix and than copy the values in the global stiffness matrix
> A. The only race condition which could happen is for the grid points
> which are the boundaries of the partitions. Therefore if we are
> using N threads the number "critical" points is N-1. To avoid a race
> condition I have created a mutex for each "critical" point (which is
> marked as shared). If an entry of type shared need to be inserted,
> assemble() takes the corresponding lock. This choice works in this
> case because the number of critical point is very small compared to
> the size of the vector. In general this approach is very
> inefficient.  

Nice piece of code. For our problems the number of shared points will
always be small compared to the number of points per grid.

Please be aware that the actual idea of the ParallelIndexSets and the
parallelization of dune-istl is that the threads only see and work on
their local portion of the datastructures. The parallel index set
provides a mapping from the global representation to the local
one. For MPI the global representation of the matrix/vector does not
even exist.
> 
> Now I am writing a more general class to obtain something similar of what we have for ParallelIndexSet. 
> Let's suppose that we have a vector v and we want to call several threads which will read and write on this vector. The class should work like this:
> 1) the user specifies all the positions for each thread which can be accessed by the thread
> 2) the class will find the which part of the vector are critical and will create several mutex to address the problem (computed only once)
> 3) the user give to the class the function to be called by the threads. This function can only read from the global vector and return a buffer of values which need to be inserted in the global vector
> 4) when the buffers are created, the values are inserted in the vectors directly if they aren't in a critical position, otherwise using the mutexs.
> Using the class, the user avoids to directly deal with threads and mutex. 
> 

One of the goals of this project is to reuse the existing parallel
solvers within dune-istl with threads (as is). For this to work each
thread needs to see only local matrices and vectors. I.e. all entries
are used for the computation within this very thread. To facilitate
the communication the parallel decomposition does not form a
partitioning. Instead, the parts of each thread is a little big larger
to hold e.g. one layer of additional unknowns per thread as overlap. 
http://www.dune-project.org/publications/parallel_istl_paper_ijcse.pdf

This overlap is used in combination of the parallel index sets for the
communication. See e.g. OverlappingSchwarzOperator for some code that
represents the parallel application of a matrix-vector multiplication.

> Obviously the difficult part it to implement an efficient and general lock mechanism. As soon as I have some code I will let you know. 
> 

Yes, you need this for general communication schemes, but maybe not
for all of them in the broadest sense. The most often used one by the
parallel solvers in dune-istl is
OwnerOverlapCopyCommunication::copyOwnerToAll(), where each thread
copies the values of his partition (which are part of the copy/overlap
region on other threads) to the other threads. Each value will only be
written once. Therefore we only need to make sure that no read
interferes with the write method. The easiest (yet not most efficient
way) would be to make sure the all threads exit the method at the same
time. Later on you can refine this approach.

Again I have to stress, that I am very much driven by the message
passing paradigm and so is the current (official/public) design of
dune-istl. If there are better ideas, we can discuss them.

> Therefore my objective right now is to implement this class and test it with the assemble(). What do you think? I would appreciate very much some feedbacks to understand if I am moving in the right direction or not. 
> 

For now I would prefer it the following way:

1. modify your assembly routine such that each thread sets up local
   matrix and a local vector with some overlap.

   What I mean by this is the following. Partition the unknows between
   the threads (the owner region). Then augment the partition by
   adding all unknowns to the set that are connected by a non-zero
   entry of the global matrix to an unknown already in the initial
   partition. The added unknowns form the overlap/copy part of the thread.
2. Create the corresponding parallel index sets.
3. Perform a parallel matrix vector product using
   OverlappingSchwarzOperator. 

Markus
-- 
Do you need more support with DUNE or HPC in general? 

Dr. Markus Blatt - HPC-Simulation-Software & Services http://www.dr-blatt.de
Hans-Bunte-Str. 8-10, 69123 Heidelberg, Germany
Tel.: +49 (0) 160 97590858
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 836 bytes
Desc: Digital signature
URL: <https://lists.dune-project.org/pipermail/dune-devel/attachments/20140630/57a57f6e/attachment.sig>