[Dune-devel] New assembly mode for BCRSMatrix

Thu Aug 1 14:37:52 CEST 2013

Hi Markus,

Am 01.08.2013 um 13:02 schrieb Markus Blatt:

> Hi,
> 
> On Wed, Jul 24, 2013 at 11:55:02AM +0200, Steffen Müthing wrote:
>> Hi everybody,
>> 
>> as there won't be a GSoC project for the new BCRSMatrix assembly mode, Peter hired a student in Heidelberg
>> (Dominic Kempf) to implement it instead (PDELab *really* needs this to get memory usage under control).
>> 
> 
> it is verx nice to see that there is so much interest in this and
> people already took the time to implement a prototype. Given the
> reactions and willingness to donate mentoring time for this project, I
> did not exspect this. 
> 
>> We now have a prototype that is working extremely well - what's left IMO is ironing out the details like naming and API.
>> Here is what we've got right at the moment: There is a new build mode (currently creatively called "mymode") that
>> builds the matrix in a way that is similar to PETSc / Trilinos. The new mode has two parameters: The average number
>> of nonzeroes per matrix row and an overflow, which is given as a fraction of the total number of nonzeroes in the matrix.
>> 
> 
> What are the advantages of having yet another build mode instead of
> implementing it similar to the MTL with an inserter object.
> 
> What I liked about their approach was:
> 1. It does not clutter the actual matrix interface. Instead the
> creation is clearly separated from the rest.

That might be true - on the other hand, it requires *significantly* more code. Did you look at their implementation?
The inserter for their CRS matrix is about 500 lines, and a lot of that just has to forward stuff like number of rows / columns
etc.

> 2. Code might be reused for other matrices to. Even using an inserter
> for a dense matrix to facilitate generic programing is
> possible. Further more resusing code to set up PETSc matrices seems
> like a viable further extension.

Well, there isn't really any inserter code reuse in MTL - that thing is specialized for every one of their matrix implementations.
There isn't really a way around that because to work efficiently, the inserter has to exploit the internal data structures of the
matrix. PETSc for example will have to work completely differently (that's why PDELab now has those LocalMatrixViews):
In PETSc, you absolutely *have to* batch your matrix updates as much as possible to get good performance because every call
to the matrix update method in PETSc involves 2-3 function calls, at least one of them through a function pointer. When I get around
to fixing the PETSc backend in PDELab, it will update all entries for a LocalFunctionSpace with a single call to PETSc.

Even if we go for an inserter, I really don't like their syntax with the streaming operator<<, where you have to decide beforehand
what it will do (add, set, subtract). What if you need to do something else with it? That's a very weird interface IMHO - just give me
a reference and be done with it!

> 
>> Example: You have a matrix with 1 000 rows and 25 entries on average per row (25 000 entries in total). Specifying
>> an overflow of .05 gives you a total capacity for 25 000 + 1 250 = 26 250 nonzero entries. If you try to add more entries
>> than that, you get an exception.
>> 
> 
> Actually, I think specifying an overflow should be
> unnecessary. Can't the user incorporate the overflow into the average
> number of nonzeros e.g. use 10.1 instead of 10? Therefore I would
> prefer skipping it. As floating point numbers for the number of
> nonzeros per row seems natural, we could also simply ask the user for
> provinding an upper bound of the total number of nonzeros exspected.

The reason for this split approach is that Peter and Dominic have come up with a more memory-efficient way of
doing the compression step: The new code allocates arrays for the column indices and values of size (N*row_avg*(1+overflow)).
The overflow area (which is not used initially because overflow entries end up in a map during insertion) is now placed
*at the beginning* of those arrays, i.e. the entries for row 0 start at index int(N*row_avg*overflow). Then, during compression, those
entries are moved to the beginning of the arrays (and additional entries from the map are inserted, slowly eating up the margin
provided by the overflow area). If you have an unstructured grid, the number of entries per row can vary quite a lot, so you want to
be able to have a large overflow area, but that's not possible if you only specify a single number.

The one major disadvantage of this approach is that you permanently waste some memory (if your row average or your overflow
fraction were too large), but you avoid having to resize the arrays, which temporarily doubles your memory requirements. This short
spike in memory usage is really the central problem for PDELab because it unnecessarily limits the matrix size you can fit on a node.

During testing, we found that you can dump quite a few entries into the map without a major performance impact, so a bit of
twiddling with the two parameters easily gets you up to a used memory fraction of .95+. So the tradeoff is: You have to tweak a bit,
but that allows you to really get you matrix as big as possible.

Another interesting observation was that we could speed up the code by *not* sorting the per-row entries during insertion, but only
at the compression stage.

> 
> I am unsure whether there should be an alternative to throwing an
> exception. Think of a productive parallel simulatio and you experience
> such an exception  after a few days of runtime.

Well, we could implement that (you'd have to resize the column index and the value array). That said, it should normally not be
difficult to determine the pattern size in a small model simulation before going big...

> 
>> Those parameters are currently set via setMymodeParameters(), but
>> I'm not sure whether that's the best way of doing things.
> 
> One more argument for stripping the build mode off into an inserter class.
> 
>> After setting the parameters, you set the size of the matrix and can start assembling (there is also a new constructor that
>> takes the additional parameters for direct matrix setup in a single step).
>> 
>> In order to avoid slowing down operator[] (and because the CompressedBlockVectorWindows aren't functional at this stage),
>> you have to access the matrix entries using entry(row,column) during construction. As long as there are less entries in the
>> current row than the specified average, new entries will be placed directly in the matrix; otherwise they end up in a std::map
>> for now.
>> 
> 
> IMHO this another strong argument for using an inserter class. Having
> different access methods for different stages seems rather confusing
> to me. Addtionally, it will not be possible to use  existing
> algorithms based on our matrix access scheme.

I don't quite understand what you mean by existing algorithms. Are you talking about algorithms that operate on the matrix, like a scalar
product? That doesn't work anyway during matrix setup. Apart from that, I really don't like the idea of having to fiddle around with this
additional inserter object. Frameworks like PDELab have to support repeated assembly into the same matrix. With the current implementation,
I just have to check the current build stage and call either [][] or entry(). With an inserter, I would have to dynamically allocate and deallocate that
thing (to call its destructor) and lug it around… I don't like that. 

> 
>> When you are done with the assembly, you have to finish up the construction process by calling compress(). That call frees the
>> map used as intermediate storage and optimizes data storage, but it does not reclaim any excess memory requested via the row
>> average and overflow parameters. After the call, you cannot add additional entries anymore and the matrix works just as if it had
>> been constructed the traditional way. compress() also returns a struct with some statistics about your matrix: The average and
>> maximum number of entries per row, the number of entries that had to be cached in the map and the fraction of the allocated
>> memory that is actually in use.
> 
> Again this seems clear with an inserter class. There this would simply
> appen in the destructor. Still, I have to admit that the user needs to
> make sure that the destructor is called by using correct scoping.

As I said, in any kind of framework, this involves allocating and deallocating the inserter on the heap. I don't see how that is simpler than
calling a compress() method...

> 
>> 
>> During construction, you cannot copy matrices. If a matrix is copied after construction, the copy constructor only allocates as much
>> memory as is actually needed by the matrix, throwing away unused overflow space.
>> 
>> The one problem with this approach is the fact that matrix entries are created on the fly, and if they are of type FieldMatrix with a
>> primitive field type, you run into FS1322.
>> 
>> We did some benchmarking with this new scheme, and it really improves memory usage and (initial) assembly time quite a bit: In
>> PDELab (wich right now uses a std::vector<std::unordered_set<std::size_t> > to build up the pattern before creating the matrix),
>> memory usage goes down by a factor of at least 2 (and a lot more if you have a large pattern), and initial assembly is sped up as well
>> (depending on pattern size, the improvements range from just a few percent up to a reduction by a factor of 2 or 3). Interestingly, the
>> mode is quite tolerant regarding the overflow area: I've tried with large matrices, and even if you end up with a million entries in the
>> map, performance only goes down by something like 10-20%. If you are really concerned about matrix size, you can use this to
>> compensate for the smaller pattern at the domain boundary: Just specify a smaller row average and add enough overflow. ;-)
>> 
> 
> Sounds promissing.
> 
>> You can find Dominic's code on GitHub (https://github.com/dokempf/dune-istl/tree/feature/new-fast-build-mode), and if you have
>> PDELab, you can play with the new build mode on the branch
>> p/muething/new-bcrs-matrix. Check out
>> dune/pdelab/test/testpoisson.cc
>> as an example. Just note that you might need the fix from FS1322. The code in ISTL isn't really cleaned up yet, but I thought it might
>> be a good time to get some initial feedback.
>> 
> 
> I am not sure when I will find the time to test this.
> 
>> Any comments / ideas / problems / naming suggestions?
>> 
> 
> Have you thought about exting a sparsity pattern of an already setup
> matrix. For the AMG for non-overlapping grids Rebecca might take
> advantage of this.

I don't quite understand that sentence, looks like your autocorrection messed it up. But if you are talking about multiple BCRSMatrices
sharing a pattern, that has been implemented since 2010 by Carsten: If you copy a matrix that has been correctly set up, both matrices
will share the pattern through a shared_ptr.

Steffen

> 
> Cheers,
> 
> Markus
> 
> -- 
> Do you need more support with DUNE or HPC in general? 
> 
> Dr. Markus Blatt - HPC-Simulation-Software & Services http://www.dr-blatt.de
> Hans-Bunte-Str. 8-10, 69123 Heidelberg, Germany
> Tel.: +49 (0) 160 97590858  Fax: +49 (0)322 1108991658 
> 
> _______________________________________________
> Dune-devel mailing list
> Dune-devel at dune-project.org
> http://lists.dune-project.org/mailman/listinfo/dune-devel

Steffen Müthing
Universität Stuttgart
Institut für Parallele und Verteilte Systeme
Universitätsstr. 38
70569 Stuttgart
Tel: +49 711 685 88429
Fax: +49 711 685 88340
Email: steffen.muething at ipvs.uni-stuttgart.de

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 495 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <https://lists.dune-project.org/pipermail/dune-devel/attachments/20130801/8c365f60/attachment.sig>