[Dune-devel] SOCIS 2014: FEM assembly with threads

Fri Jun 27 13:27:19 CEST 2014

Dear Jö, dear Christian,
thank you very much for your feedbacks, I appreciate it a lot. 

Actually I am not focus on an efficient assembly of the matrix but to write a class to manage a thread-safe writing on a shared vector (something like what happens with ParallelIndexSet). This should be my first goal, if I have understood it right.  

For what concern the design, I was thinking to this strategy to find out the best locking system for a general case. 
For each thread I know which elements of the vector it will access. I can therefore simply estimates if an element of the vector is shared or not. I can also describe the interactions between threads as graph where each node represents a thread; two nodes are connected if they share at least one element (a weight could be the number of shared elements). For this graph I extract the connected components; for each connected components, I choose the most connected node and I remove all the nodes connected to until I have a set of disjoint threads. All these threads can write simultaneously without any danger. I repeat this algorithm with the remaining threads, until all the threads have been considered. In this way I have obtained a list of sets of threads which can be executed simultaneously. This list is computed only once and it is used every time an update of the shared vector is performed. Obviously when the thread needs to write in a not shared element of the vector, it won't way anything.

Before implementing it I would like to know what you think about it since I would like to avoid spending a lot of coding hours for something which is useless/inefficient. 

Thank you very much,
cheers,
Marco.
________________________________________
From: Christian Engwer [christian.engwer at uni-muenster.de]
Sent: Thursday, June 26, 2014 8:40 PM
To: Agnese, Marco; dune-devel at dune-project.org
Subject: Re: [Dune-devel] SOCIS 2014: FEM assembly with threads

Dear all,

I think we are currently drifting a bit off-topic (at least wrt
SOCIS). For the solver it should not matter how you assembled the
matrix, that is actually a benefit of
shared-memory/multi-threading... you can assemble with a totally
different partition compared to solving and even use seq. assembly and
multi-threaded solving.

Ciao
Christian

PS: this does not mean that thinking about such questions is useless,
it is just not high-priority for the SOCIS project.

On Thu, Jun 26, 2014 at 09:27:50PM +0200, Jorrit Fahlke wrote:
> Am Thu, 26. Jun 2014, 14:41:23 +0000 schrieb Agnese, Marco:
> > Hi DUNErs,
> > as suggested by Markus I have written a very simple code which uses threads to assemble the stiffness matrix and the RHS of a FEM scheme for a 1D Poisson problem.
> >
> > You can find the code here: https://github.com/magnese/dune-common/blob/threads/dune/common/test/threadstest.cc
> >
> > The code constructs a vector of grid points. The mesh is "partitioned" among threads. Each thread assembles a local stiffness matrix and than copy the values in the global stiffness matrix A. The only race condition which could happen is for the grid points which are the boundaries of the partitions. Therefore if we are using N threads the number "critical" points is N-1. To avoid a race condition I have created a mutex for each "critical" point (which is marked as shared). If an entry of type shared need to be inserted, assemble() takes the corresponding lock. This choice works in this case because the number of critical point is very small compared to the size of the vector. In general this approach is very inefficient.
> >
> > Now I am writing a more general class to obtain something similar of what we have for ParallelIndexSet.
> > Let's suppose that we have a vector v and we want to call several threads which will read and write on this vector. The class should work like this:
> > 1) the user specifies all the positions for each thread which can be accessed by the thread
> > 2) the class will find the which part of the vector are critical and will create several mutex to address the problem (computed only once)
> > 3) the user give to the class the function to be called by the threads. This function can only read from the global vector and return a buffer of values which need to be inserted in the global vector
> > 4) when the buffers are created, the values are inserted in the vectors directly if they aren't in a critical position, otherwise using the mutexs.
> > Using the class, the user avoids to directly deal with threads and mutex.
> >
> > Obviously the difficult part it to implement an efficient and general lock mechanism. As soon as I have some code I will let you know.
> >
> > Therefore my objective right now is to implement this class and test it with the assemble(). What do you think? I would appreciate very much some feedbacks to understand if I am moving in the right direction or not.
>
> This is pretty similar to something I did for EXADUNE, see attachment.  Ignore
> the global-lock implementation at the top, there is an implementation with one
> mutex per element further down.
>
> The difference is: I'm doing this for dune-grid and you're implementing your
> own grid ;) But it should not matter greatly if your mesh entity is some
> complicated class or just some integer.  And even whether you're restricting
> yourself to vertices or not.
>
> Since you asked I'm going to dump my thoughts on this matter here:
>
> - A lock manager should not force locking on entities that do not require
>   locking.  My lock manager concept currently fails there, because it must
>   always return some kind of mutex when asked.
>
> - It should be possible to lock a single entity (for residual assembly).  This
>   is not a big deal, both our concepts allow that.
>
> - It should be possible to lock a pair of entities.  This can be necessary for
>   matrix assembly.
>
>   + It isn't actually necessary in your case (yet): your locking the entity
>     corresponding to row elem+i of your matrix, then you iterate over the
>     columns of the matrix, locking each one in turn.  Locking the columns is
>     not necessary, because no other thread will access the same row, either
>     because you locked it, or because it isn't shared.
>
>     If you know you matrix is going to be symmetric and you want to compute
>     the update to A[i][j] and A[j][i] only once at the same time, then the
>     matter is different.
>
>   + Your implementation depends on the mutex being recursive (which is
>     guaranteed for std::mutex, see the table at[1]).  It would be
>     nice if the lock manager could do without such a requirement, although one
>     might wonder whether it is worth to work around that limitation.
>
>   + My implementation compares the addresses of both mutexes to avoid trying
>     to double-lock the same mutex (which is quite frankly a hack).  The lock
>     manager should not require such hacks.
>
> - Locking three entities at once and beyond: I cannot think of a case where
>   this is necessary, so I would ignore this case.
>
> - The lock manager should allow alternate implementations to let you try out
>   different things.
>
>   + For instance, you may want to try out a global lock for all entities that
>     need locking, just to see how bad it really is.
>
>   + Or you may want to unconditionally have one lock per entity.
>
>   + Or you may want to use a lock manager that does not actually lock
>     anything.  Your results will be wrong, but you can time your program and
>     get an idea of the overhead from locking.
>
>   + Beyond 1D, you have many entities shared between multiple partitions, and
>     storage may become an issue.  You may want to use the same lock for
>     multiple entities (e.g. all entities shared between the same two
>     partitions).  This of course comes at the cost of locking too much, but it
>     may still be worth it: it is something that happens on the surface of the
>     partitions, which should be small compared to the interior, so threads
>     competing for the same lock should not happen too often.
>
>     Again this can pose a problem when locking two entities for matrix
>     assembly: the entities may share the same lock, even though they are
>     different entities.  This needs either detection of this condition or
>     recursive locks.
>
> - For ease of programming it would be nice to be able to use guard objects for
>   locking, which provide for automatic unlocking at end of scope and exception
>   safety.  Examples are std::lock_guard and std::unique_lock.
>
> Here is one approache how to achieve this under a general interface:
>
> Write a wrapper mutex that has a list of zero, one, or two (or even more)
> underlying mutexes.  When the wrapper is locked, it locks all the mutexes in
> the list using std::lock().  When it is unlocked, it unlocks all of them.
> This wrapper can be used as the mutex in std::lock_guard.
>
>   template<class BaseMutex>
>   class MutexList {
>     std::size_t nMutexes;
>     typdef BaseMutex *BaseMutexP;
>     BaseMutexP mutexList[2];
>   public:
>
>     MutexList();
>     MutexList(BaseMutex*);
>     MutexList(BaseMutex*, BaseMutex*);
>
>     // usual mutex interface
>   };
>
> The lock manager can then be queried for the mutexes of one or two entities,
> and it will return a wrapper with the list of the requested mutexes.  If the
> underlying mutexes are not recursive, the lock manager must ensure that the
> list does not contain the same mutex twice.
>
>   template<class BaseMutex>
>   struct FooLockManager {
>     typedef MutexList<BaseMutex> Mutex;
>
>     template<class Entity>
>     Mutex get(const Entity &);
>     template<class Entity>
>     Mutex get(const Entity &, const Entity &);
>   };
>
> Of course, the length of the list in the wrapper must be dynamic at runtime.
> This can be a challenge since std::lock() requires each mutex to be locked as
> a seperate function argument.  However, we can limit ourselves to lists of at
> most two underlying mutexes, so this should not pose a real problem.
>
> Regards,
> Jö.
>
> [1] https://software.intel.com/sites/products/documentation/doclib/tbb_sa/help/index.htm#reference/reference.htm
>
> --
> Jorrit (Jö) Fahlke, Institute for Computational und Applied Mathematics,
> University of Münster, Orleans-Ring 10, D-48149 Münster
> Tel: +49 251 83 35146 Fax: +49 251 83 32729
>
> featured product: Debian GNU/Linux - http://www.debian.org

> #ifndef DUNE_PDELAB_COMMON_LOCKMANAGER_HH
> #define DUNE_PDELAB_COMMON_LOCKMANAGER_HH
>
> #include <cstddef>
>
> #include <dune/pdelab/common/elementmapper.hh>
>
> namespace Dune{
>   namespace PDELab{
>
>     //! LockManager that uses one global lock.
>     template<class Mutex>
>     class GlobalLockManager
>     {
>       Mutex mutex_;
>
>     public:
>       typedef Mutex value_type;
>
>       template<class Entity>
>       Mutex & operator[](const Entity &e)
>       {
>         return mutex_;
>       }
>     };
>
>     //! Hold a vector of mutexes
>     /**
>      * Mutexes usually can only be default-constructed, but cannot be copied
>      * or moved, neither by construction nor bei assignment.  This vector can
>      * hold a variable number of such mutexes.  The vector can be resized, but
>      * that reconstructs all mutexes, so references to old mutexes are no
>      * longer valid.
>      */
>     template<class Mutex>
>     class MutexVector
>     {
>       Mutex* data_;
>       std::size_t size_;
>
>     public:
>       typedef Mutex value_type;
>       typedef Mutex* iterator;
>
>       MutexVector(std::size_t size = 0) :
>         data_(0), size_(0)
>       {
>         resize(size);
>       }
>       ~MutexVector()
>       {
>         clear();
>       }
>
>       void clear() {
>         if(data_)
>           delete[] data_;
>         data_ = 0;
>         size_ = 0;
>       }
>
>       void resize(std::size_t newsize) {
>         clear();
>         if(newsize) {
>           data_ = new Mutex[newsize];
>           size_ = newsize;
>         }
>       }
>
>       Mutex &operator[](std::size_t i)
>       {
>         return data_[i];
>       }
>
>       std::size_t size() const
>       {
>         return size_;
>       }
>
>       iterator begin()
>       {
>         return data_;
>       }
>       iterator end()
>       {
>         return data_ + size_;
>       }
>
>     };
>
>     //! LockManager that uses one lock per Element.
>     template<class GridView, class Mutex>
>     class PerElementLockManager :
>       public MutexVector<Mutex>
>     {
>       ElementMapper<GridView> emapper;
>
>     public:
>       PerElementLockManager(const GridView &gv) :
>         emapper(gv)
>       {
>         this->resize(emapper.size());
>       }
>
>       Mutex & operator[](const typename GridView::template Codim<0>::Entity &e)
>       {
>         return MutexVector<Mutex>::operator[](emapper.map(e));
>       }
>     };
>
>   }
> }
> #endif // DUNE_PDELAB_COMMON_LOCKMANAGER_HH

> _______________________________________________
> Dune-devel mailing list
> Dune-devel at dune-project.org
> http://lists.dune-project.org/mailman/listinfo/dune-devel

--
Prof. Dr. Christian Engwer
Institut für Numerische und Angewandte Mathematik
Fachbereich Mathematik und Informatik der Universität Münster
Einsteinstrasse 62
48149 Münster

E-Mail  christian.engwer at uni-muenster.de
Telefon +49 251 83-35067
FAX             +49 251 83-32729