Tuesday 27 August 2019

Update on Trip timings for multivariate arithmetic

New Trip timings


In my last blog I wrote about the timings we did for the OpenDreamKit project for parallel multivariate arithmetic.

After publishing those timings, Mickael Gastineau of the Trip project let me know that there is an undocumented feature for modular arithmetic, using the Mod(., p) command. He also told me that they had not linked in a fast parallel memory allocator. He released a patch update version of the publicly available Trip with this fix.

You can see the new timings, including for mod p arithmetic below.



Cores Trip n=16 ODK n=16 Trip n=20 ODK n=20
1 24.0s 10.5s 140s 120s
2 11.9s 5.55s 69.9s 60.0s
3 8.06s 3.80s 46.7s 41.0s
4 6.13s 2.95s 35.2s 31.4s
6 4.19s 2.09s 25.0s 21.1s
8 3.19s 1.60s 18.7s 16.1s
10 2.60s 1.30s 15.1s 13.0s
12 2.19s 1.09s 12.6s 10.9s
14 1.99s 0.96s 10.8s 9.4s
16 1.96s 0.86s 9.7s 8.3s
20 1.42s 0.72s 7.98s 6.7s
24 1.30s 0.64s 6.72s 5.6s
28 1.17s 0.55s 6.01s 4.9s
32 1.10s 0.50s 5.36s 4.9s
Sparse Multiplication over Z

Cores Trip n=16 ODK n=16 Trip n=20 ODK n=20
1 27.9s 9.15s 175s 53.9s
2 14.0s 4.73s 97.5s 28.1s
3 9.38s 3.29s 59.1s 18.7s
4 7.18s 2.49s 48.5s 14.3s
6 5.14s 1.68s 31.8s 9.48s
8 3.80s 1.29s 23.3s 7.26s
10 3.02s 1.04s 19.0s 5.96s
12 2.60s 0.86s 15.6s 4.99s
14 2.25s 0.74s 13.5s 4.22
16 1.97s 0.66s 11.7s 3.76s
20 1.70s 0.56s 9.72s 3.13s
24 1.44s 0.49s 8.50s 2.65s
28 1.29s 0.43s 7.35s 2.27s
32 1.21s 0.38s 6.79s 2.04s
Sparse Multiplication over Z/pZ

Cores Trip Z ODK Z
1 25.6s 5.08s
2 12.8s 2.56s
3 8.55s 1.71s
4 6.44s 1.29s
6 4.38s 0.86s
8 3.36s 0.67s
10 2.81s 0.52s
12 2.34s 0.46s
14 1.98s 0.40s
16 1.79s 0.34s
Dense Multiplication over Z

Cores Trip Z/pZ ODK Z/pZ
1 30.5s 3.64s
2 15.3s 1.83s
3 10.2s 1.22s
4 7.65s 0.92s
6 5.21s 0.62s
8 3.97s 0.46s
10 3.17s 0.37s
12 2.86s 0.31s
14 2.51s 0.28s
16 2.08s 0.24s
Dense Multiplication over Z/pZ

Conclusion


As you can see, there are some really impressive speedups for Trip from 16 to 32 cores. Below that the times seem to have gone up a little, but overall the scaling with the number of cores is much better.

The Trip Z/pZ times are often slower than the Z times, which is the opposite for us.

By the way, as the ODK project is coming to a close and my other work related things now have a blog elsewhere, this blog will probably now revert to just being a private blog again. I will no doubt not be able to resist doing a post here when Oscar is finally ready for its first prerelease/release, and there may be some new timings for giac, but otherwise it'll be mostly personal projects now.

I may post some articles on speeding up graphics on the CGA, EGA and VGA adapters. So stick around if that interests you.

Wednesday 7 August 2019

Fast Parallel Multivariate Arithmetic - Final Report!



We are hiring: see below!

As many readers of my blog know, we received funding here at TU Kaiserslautern in Germany from the European Union H2020 initiative to develop fast parallel multivariate polynomial arithmetic as part of the OpenDreamKit collaboration.

OpenDreamKit is a set of tools developed here in Europe for constructing Virtual Research Environments for scientific/mathematical research and other collaborations, and our site has been part of the High Performance team.

As the OpenDreamKit (ODK) project is drawing to a close, it is time to update everyone on how our project for parallel multivariate arithmetic has gone.

Here in Kaiserslautern, Daniel Schultz has been writing the code for this. There's now well over 100,000 lines of new, really high quality C code and test code, plus associated documentation.

You will remember from previous blog posts of mine that we had the following aims:
  • Parallelise multivariate multiplication, division and GCD
  • Support lex, deglex and degrevlex orderings
  • Support 64 bit exponents with any number of variables [1]
  • Aim for linear speedup with the number of cores [2]
  • Provide divisibility testing, exact division, quotient only division, division with remainder
  • Make it available via the computer algebra system Singular [3]
  • Support Z, Q, Z/pZ as coefficient rings [4]
  • Competitive single core and multicore timings
  • Support dense and sparse arithmetic [5]
  • Open source
This might seem like an impossible list, but we will let you judge if we have been successful.

In previous blog posts we've talked about our single core multiplication and division timings and our multicore multiplication timings versus many other systems.

In this blog we will summarise the multicore performance for multiplication, division and GCD.

Note that many other things were implemented in single core versions: various kinds of evaluation and composition, ideal reduction, powering, derivative and integral, input parsing and a panoply of support functions [12].

Benchmarking


We've decided to only benchmark systems that we know can do all of the following:
  • Parallel multivariate arithmetic
  • Support at least one of the main polynomial orderings: lex, deglex or degrevlex
  • Support at least one of Z, Q, Z/pZ
This means there are only three contenders:
  • Maple 2019 (deglex)
  • Giac 1.5.0 (lex)
  • Trip 14.62 (lex)
Maple and Giac currently optimise for small numbers of cores, e.g. two to four cores [6]. Trip is optimised for many cores.

Note that Maple is proprietary and closed source, Trip is closed source but available for academic use, and Giac is free and open source. All our software is free and open source.

As we were not able to install all software on the same machine, we benchmark against each system separately on different machines. Thus the timings between the different systems are not directly comparable.

To avoid conversion costs to and from Singular polynomials [7] we did all our timings with Oscar/Nemo which directly interfaces to Flint [8].

Benchmarks


We decided to benchmark the following algorithms:
  • Sparse multiplication over Z and Z/pZ, (Pearce n = 16 and n = 20)
  • Dense multiplication over Z and Z/pZ, (Fateman n = 30)
  • Sparse divisibility testing over Z and Z/pZ, (Pearce n = 16) [9]
  • Sparse GCD over Z (Schultz n = (7,5),(4,8))
These are the algorithms that we have spent the most effort parallelising [10]. 

Code for the benchmarks can be found at [13] and [14].

Maple 2019


Maple is not installed on our development server and we thank Michael Monagan for providing access to a machine on which the latest Maple is installed.

We were not able to pin threads when benchmarking Maple [17] and the machine we had access to only had 16 cores, so we only used 12 cores to be respectful of other users.

Unfortunately, for many of our benchmarks, unpredictable and long running garbage collection/conversion costs dominated some of the Maple timings. It was not practical to run the benchmarks to completion. We present here only the timings that did not suffer from this problem.


Cores Maple Z ODK Z Maple Z/pZ ODK Z/pZ
1 30.87s 5.79s 30.97s 4.57s
2 15.12s 2.94s 15.00s 2.33s
3 9.60s 2.04s 9.42s 1.59s
4 6.93s 1.58s 6.97s 1.23s
6 4.95s 1.10s 4.62s 0.89s
8 3.58s 0.88s 3.55s 0.66s
10 2.97s 0.72s 2.93s 0.58s
12 2.52s 0.62s 2.47s 0.49s
Dense Multiplication

Trip 14.62


We were able to install Trip [18] on our Gentoo development server with 32 cores.

We were able to pin cores when benchmarking Trip. For consistency we used the performance governor with turbo off. We did not use a specialised malloc (however Trip does), but Flint has its own multithreaded memory manager [15]. Edit: Mickaël Gastineau has informed us that due to a linking issue in the version of Trip that we used, the system allocator was being used. He has issued a new version which we will soon benchmark here on this blog.

Trip offers a variety of polynomial representations. We picked the one that seemed to give the best timings. Note that our code automatically converts on the fly so that no ahead-of-time selection is required.

Trip has very different applications to our system, so not many of our benchmarks are supported [19]. Edit: Mickaël Gastineau has informed us that mod p arithmetic is now available through an undocumented feature. We will do timings for comparison in an upcoming blog.


Cores Trip n=16 ODK n=16 Trip n=20 ODK n=20
1 21.9s 10.5s 130s 120s
2 11.5s 5.55s 67.2s 60.0s
3 8.02s 3.80s 45.7s 41.0s
4 6.02s 2.95s 33.4s 31.4s
6 3.96s 2.09s 22.9s 21.1s
8 3.29s 1.60s 17.3s 16.1s
10 2.68s 1.30s 15.5s 13.0s
12 2.30s 1.09s 12.9s 10.9s
14 2.07s 0.96s 11.0s 9.4s
16 1.96s 0.86s 10.1s 8.3s
20 1.79s 0.72s 9.39s 6.7s
24 1.73s 0.64s 9.08s 5.6s
28 1.62s 0.55s 8.87s 4.9s
32 1.53s 0.50s 8.57s 4.9s
Sparse Multiplication over Z

Cores Trip Z ODK Z
1 24.4s 5.08s
2 12.3s 2.56s
3 8.30s 1.71s
4 6.21s 1.29s
6 4.24s 0.86s
8 3.17s 0.67s
10 2.87s 0.52s
12 2.47s 0.46s
14 2.01s 0.40s
16 1.78s 0.34s
Dense Multiplication

Giac 1.5.0


Giac did not install successfully on our Gentoo development server [20]. The only system I could install it on was my four core laptop [21]. Therefore the giac timings should not in any way be considered scientific!

The current version of giac is only optimised for two or four threads. Bernard Parisse reports that he is currently working on improving much of the multivariate arithmetic and threading. He also notes that giac has to internally convert from a new polynomial representation to a historic representation, which can significantly increase times. It may be better to compare with Singular rather than Oscar/Nemo.

We did not do any thread pinning for the giac timings.

Some operations did not complete in a reasonable time. We marked these with a dash.

Cores Giac Z n=16 ODK Z n = 16 Giac Z/pZ n=16 ODK Z/pZ n=16 Giac Z n=20 ODK Z n=20 Giac Z/pZ n=20 ODK Z/pZ n=20
1 18.3s 12.0s 153s 7.85s 162s 92.2s ----- 45.3s
2 16.2s 7.42s 153s 4.85s 217s 63.5s ----- 25.0s
3 18.6s 5.91s 158s 3.49s 191s 40.7s ----- 18.1s
4 13.4s 5.67s 153s 2.77s 196s 36.7s ----- 14.7s
Sparse Multiplication

Cores Giac Z ODK Z Giac Z/pZ ODK Z/pZ
1 5.65s 3.61s 4.12s 2.73s
2 3.64s 1.91s 2.61s 1.45s
3 2.64s 1.29s 1.83s 1.01s
4 2.40s 0.99s 1.61s 0.75s
Dense Multiplication

Cores Giac Z ODK Z
1 41.6s 14.8s
2 38.8s 8.03s
3 37.5s 5.31s
4 36.8s 4.22s
Sparse GCD

Other OpenDreamKit (ODK) Timings


Divisibility testing with quotient does not seem to be available in any of the systems we are comparing against [11].  It is available in Magma [16], but this is not parallelised.

[We have since been told that Maple does have divisibility testing but does not return the quotient. However, apparently it internally does sufficient work to compute the quotient, so we could have safely benchmarked it. Even so, the multiplication does not terminate for us in a reasonable time, so we are unable to test the divisibility testing in Maple.]

Thus, we present this benchmark for our system only.

For consistent timings that scale well we used thread pinning, the performance governor and disabled turbo. We do not use a specialised malloc, however Flint has a multithreaded memory manager [15].


Cores ODK Z ODK Z/pZ
1 9.96s 9.57s
2 5.52s 3.30s
3 3.64s 2.41s
4 2.68s 1.97s
6 1.92s 1.72s
8 1.55s 1.38s
9 1.26s 1.11s
12 1.14s 1.01s
14 0.92s 0.81s
16 0.88s 0.76s
Sparse Divisibility Testing

As scaling with many threads for our GCD code is not shown above, we give the timings here.

Cores ODK Z
1 21.58s
2 11.10s
3 7.46s
4 5.74s
6 3.93s
8 3.02s
10 2.50s
12 2.13s
14 1.89s
16 1.70s
Sparse GCD

We spare the reader all the Z/pZ timings for multiplication, as they scale much as the timings over Z do.

[We have since discovered we can do this on a single core in about 3s, but are still trying to figure out how to parallelise that trick.]

Current and Future Work


We are currently working on integration into Singular. This is almost done, with only rational functions still to be done. Our final ODK report will contain timings and details.

We also plan on implementing multivariate factoring directly on top of our multivariate arithmetic. Some progress has already been made on this, but it goes well beyond the scope of the ODK project. It will likely be a side project of Daniel Schultz, probably over a period of months or years.

We are hiring!


Daniel and I will be here in Kaiserslautern long after ODK finishes, but we are also hiring new people. There are numerous open postdocs in Kaiserslautern after Prof. Wolfram Decker managed to secure lots of new funding.

We are looking for individuals who:

* Have good programming skills
* Have a PhD
* Interest in some area of computer algebra (broadly interpreted)
* Are interested in being sited at TU Kaiserslautern

If that's you or you know someone suitable, don't hesitate to put them in contact with us. You can find our email addresses on the University website.

Dr. William Hart.

Notes


[1] We also support multiprecision exponents for some operations.

[2] For sufficient sized problems.

[3] The code is actually in Flint, which is a component of Singular. It is also available through Oscar/Nemo.

[4] We also support Fq in the new Flint fq_nmod_mpoly interface.

[5] We also support what we call "semi-sparse" arithmetic, which is sometimes specially optimised for.

[6] Bernard Parisse tells us he will be optimising for more cores in future.

[7] Groebner bases require fast (monomial)x (polynomial) operations, and Singular is optimised to do many fundamental operations without memory allocation. This is achieved using a custom memory manager (omalloc) and a linked list implementation.

[8] Note that a very small patch is currently required in Nemo to avoid use of Julia's counted_malloc, which isn't threadsafe for use with Flint threads. Essentially we comment out a few lines in Nemo.jl that redirect the memory manager functions in GMP and Flint to Julia's counted malloc.

[9] We use Pearce's multiplication benchmark, then check divisibility by the second factor and return the quotient if it is divisible.

[10] Some dense algorithms are also parallelised by earlier ODK work on parallelising the FFT, however we are not reporting on that work here.

[11] Divisibility testing usually requires a heuristic test to quickly exclude inexact divsion, and a divrem with early termination. If one does not have this then it will be really slow in the case of inexact division. We time only exact division, as this requires the expensive divrem step, with early termination in case it made it past our heuristic test. 



[14] Flint profiling code, see for example: https://github.com/wbhart/flint2/tree/trunk/fmpz_mpoly/profile

[14a] No one will read all these comments, which is why they are comments.

[15] Developed for the ODK project by me, see: https://github.com/wbhart/flint2/blob/trunk/fmpz/link/fmpz_single.c

[16] We also have it on Good Authority TM that (some recent version of) Magma does not do a proper divisibility test, but only a heuristic test, which we do not accept to be sufficient for this benchmark.

[17] We also did not use thread pinning or adjust the CPU governor when timing Oscar/Nemo on this machine.

[18] It didn't install out of the box, but we were easily able to install older versions of certain prerequisites that then allowed Trip to run.

[19] Trip does not provide exact division but has division with remainder. But this is not optimised for performance.

[20] I tried to install it from source, because I do not have sudo access. But Bernard reports that the timings we were obtaining were not reasonable, probably due to some misconfiguration which we were unable to identify.

[21] This was a Windows system, so I had to use the WSL. Unfortunately this is not a recent enough version of Ubuntu, so I attempted an upgrade which was partially successful. It was at least enough to get giac 1.5.0 stable working with timings that Bernard Parisse agrees are reasonable.

Saturday 23 September 2017

Parallel multivariate multiplication


Since my last blog on multivariate polynomial arithmetic, a lot has happened, so it's time to give an update, including our first parallel sparse multiplication benchmarks! In fact, there is so much to report that I'm restricting this blog to multiplication only!

The most exciting news since last time is that Daniel Schultz has gotten parallel sparse multiplication working. Daniel is the person we hired on the OpenDreamKit grant to work on parallel multivariate arithmetic. He's also implemented numerous improvements to the code, which I'll mention below.

After the last blog post, I discovered that a number of things affect the timings:

  • Polynomial ordering (lex, deglex, degrevlex)
  • Ordering of variables (x, y, z, t, u)
  • Whether one computes f*g or g*f, or in the case of division, p/f or p/g

Obviously in a perfect system, none of these would matter, but especially with our old code and with certain systems, they do. In particular, some of the division timings I gave previously for Flint, were invalid. I've now corrected them on the old blog.

In this blog, I will give new timings which always force lex ordering, for x > y > z > t > u, and will always take f*g.

Sdmp and Maple prefer deglex, so I will time these separately. We also don't have them available on our benchmark machine anyway.

Magma, Sage and Singular support a virtual panoply of orderings, whilst some systems are restricted just to lex or just to deglex. Flint supports lex, deglex and degrevlex. (Since there is no agreement on what revlex should be, we decided not to support it.)

We also discovered that we had implemented degrevlex incorrectly, and this has now been fixed. This didn't affect the timings in the last blog.

Sage division improvements


In the last blog we noted that Sage performed much worse than Singular for division over Z. The Sage developers immediately took note of this and set about fixing it. The latest release is 8.0 and it doesn't seem to have the fix, so we have to wait for the new version to give the new Sage division timings. It's great to see how quickly the community responded to this.

It was of course fixed by using Singular for that computation, rather than a very slow generic implementation of division.

Singular improvements


Hans Schoenemann has been working on improving the use of Factory from Singular. This means that Singular will call Factory for large arithmetic operations, since it is much faster. Singular itself is optimised for Groebner basis computations, not arithmetic.

The new improvements will currently take effect for computations with over 1000 terms or so, which is heuristically chosen to avoid overhead in converting to and from Factory's lex ordering and recursive polynomial format. Changing from an arbitrary ordering to a lex ordering and back can be extremely expensive, so this crossover is essential to avoid harming performance for smaller operations.

We've also decided that when we have a really solid, performant implementation of multivariate arithmetic, we are going to make use of it in Factory. The end result is that the code that we are writing now will be available to all Singular (and Factory) users.

The main operations that really benefit from such arithmetic are rational functions and multivariate factorisation. They aren't for speeding up Groebner Basis computations!

Flint improvements


Most of the actual work done on Flint since my last blog, has been done by Daniel Schultz. Here are some of the improvements:

Improved exponent packing


When using deglex or degrevlex, our monomial format packs the total degree field into the same words as all the other exponent fields. The trick of packing exponent vectors into bit fields goes way back to the early days of Singular. We get the idea from a paper of Hans Schoenemann.

Actually, Singular keeps the degree field separate from the rest of the vector (which in hindsight might have been a better thing for us to do as well), but we decided to try and pack it along with the other exponents.

The problem with our implementation of deglex and degrevlex is precisely this extra field that is required for the degree. If we don't want to overflow a word for each exponent vector, we have fewer bits per exponent when using degree orderings.

The maximum total degree is also often larger than the maximum of any individual degree, further exacerbating the problem. These issues mean that if we aren't careful, we can take more space for packed monomials using degree orderings than for lex orderings.

In practice, this meant that for our largest benchmarks, two 64 bit words were being used for exponent vectors, instead of one. This is because we were packing each exponent into a 16 bit field. Obviously six times 16 bits requires more than 64 bits.

We were using 16 bit fields, since we only supported fields of 8, 16, 32 or 64 bits (including an overflow bit), and 8 bit fields just weren't large enough; the maximum total degree in our largest benchmark is 160, (which requires 8 bits, plus an overflow bit).

We now pack into an arbitrary number of bits, and for 5 variables plus a degree field, it obviously makes sense to pack into 10 bit fields, since you get 6 of those in 64 bits. This is plenty big enough for all our benchmarks, in fact. The result is a 50% saving in time when using degree orderings.

Delayed heap insertion


Recently at ISSAC 2017 here in Kaiserslautern, Roman Pearce, one of the authors of the Sdmp library, told me about an improvement for multiplication that delays inserting nodes into the heap. In some cases, we know that they will be inserted at a later time anyway, so there is no hurry to put them in at the earliest possible opportunity, as this just increases the size of the heap.

Daniel implemented a version of this that he came up with, and it makes a big difference in timings, and shrinks the heap to a fraction of its original size.

Multi-core multiplication


Flint can now multiply sparse multivariate polynomials using threads. Daniel experimented with various strategies, including breaking one of the polynomials up. This strategy has the disadvantage of requiring a merge of polynomials from the various threads at the end, which is expensive.

The best strategy we found was to take "diagonals". That is, we divide the output polynomial up into chunks instead of dividing up the input polynomials. This has to be done heuristically, of course, since you don't have the output polynomial until it is computed.

This idea is contained in a paper of Jacques Laskar and Mickael Gastineau.

To ensure that all threads take about the same amount of time, we start with big blocks and take smaller and smaller blocks towards the end, so that no thread will be left running for a long time at the end while all the others are waiting. This gives fairly regular performance. The scheme we use is a variation on that found in the paper by Laskar and Gastineau.

The advantage of the diagonal splitting method is that no merge is required at the end of the computation. One simply needs to concatenate the various outputs from the threads, and this can be done relatively efficiently.

Improved memory allocation


At scale, everything breaks, and malloc and the Flint memory manager are no exception. I wrote a new memory manager for Flint which uses thread local storage to manage allocations in a way that tries to prevent different threads writing to adjacent memory locations (which would incur a performance penalty).

The design of this memory manager is beyond the scope of this blog, but it seems to work in practice. 

[A brief summary is that it allocates blocks of memory per thread, consisting of a few pages each. Each page has a header which points to the header of the first page in the block, where a count of the number of active allocations is stored. When this count reaches zero, the entire block is deleted. When threads terminate, blocks are marked as moribund so that further allocations are not made in them. The main thread can still clean thread allocated blocks up when their count reaches zero, but no further allocations are made in those blocks.]

The main advantage of this scheme, apart from keeping memory separated between threads is that we can block-allocate the mpz_t's that are used for the polynomial coefficients. This is critical, since in our largest benchmarks, almost a third of the time is spent allocating space for the output!

Unfortunately, each mpz_t allocates a small block of memory internally. That memory is handled by GMP/MPIR, not Flint, and so we still have 28 million small allocations for our largest benchmark!

I was able to make a small improvement here by changing the default allocation to two words, since we don't really use one word GMP integers in Flint anyway. But fundamentally, we are still making many small allocations at this point, which is very expensive.

This can be improved in a future version of Flint by setting a custom GMP memory allocator. But it will require implementation of a threaded, pooled, bump allocator. It can be done, but it will be a lot of work.

If we want to scale up to more than 4 cores (and we do!), we have no choice but to fix this!

Trip homogeneous format


Mickael Gastineau, one of the authors of Trip, was kind enough to send me a special version of Trip which allows access to their homogeneous polynomial format. It isn't optimised for integer arithmetic, but their papers show it is very fast for dense multivariate arithmetic, assuming you are happy to wait for a long precomputation.

The basic idea is to split polynomials into homogeneous components and precompute a very large table in memory for lookup. One only needs a table for the given degree bound, and Trip manages this automatically upon encountering this degree for the first time.

Before we get to the main timings below, we present Trip timings with this homogeneous format (excluding the precomputation).

We only time this for dense computations, since the precomputation is not practical for most of the sparse benchmarks. Moreover, since the code is not optimised for integer arithmetic, we use floating point to exactly represent integers, which means that the largest benchmarks can't be performed, since they would overflow the double precision floating point numbers.

Since Flint doesn't yet have parallel dense multiplication, these timings are restricted to a single core.


Dense Fateman (1 core)


New sparse multiplication benchmarks


In this section, we give timings for all systems (except Sdmp and Maple) on 1, 2 and 4 cores (where available). Most of the systems, including ours, perform badly on larger numbers of cores, unless the problems are really huge (> 30 minutes to run). When we have fixed our memory management issues, we'll do another blog post with timings on larger numbers of cores.

All the systems below were forced to use lexicographical ordering, with x > y > z > t > u, and we always take f*g (the polynomials f and g are the same length, so it's an arbitrary choice in reality).

The new delayed insertion technique Daniel implemented actually makes the timings for f*g and g*f the same for our new code. But other systems that support lex don't necessarily have this.

All of the timings in this section are done on the same machine, an old 2.2GHz Opteron server.

First, we give all the single core timings. We've added Piranha and Trip timings this time around. Incidentally, my understanding is that Piranha uses a hash table and therefore doesn't have a specific polynomial ordering. We include timings for Piranha to quench your burning curiosity.

Dense Fateman (1 core)




Sparse Pearce (1 core)



Of course, not all the systems can do parallel multiplication. So a smaller set of benchmarks for a smaller set of systems follows.

Sparse Pearce (2 core)

Sparse Pearce (4 core)

Sdmp and Maple 2017 timings


Michael Monagan very kindly granted me guest access to a machine that has Maple 2017 installed, which is currently also the only way to time Sdmp. Roman Pearce has also kindly given lots of information on how to use Maple and Sdmp, so that we can get fair timings.

The first thing to note is that the timings here are on a completely different machine to the above. So they are not in any way comparable.

I believe Sdmp internally only supports deglex. Therefore, we decided it best to use deglex for the Maple and Sdmp timings, otherwise we may be comparing apples and oranges. This is another reason one can't make a direct comparison with the timings for the other systems above. It also means there are no dense benchmarks, since our dense code doesn't support deglex efficiently yet.

It's also very important to point out that we have no way to run Sdmp except via Maple. We don't always get the timings for Sdmp below in practice, they are simply the self-reported times for Sdmp as a component of the full Maple run times.

It's not entirely clear why Maple is so slow on large inputs. Maple makes use of a DAG structure for its symbolic engine, rather than directly manipulating Sdmp polynomials. Something is apparently causing this to be incredibly slow when the polynomials are large. According to the Maple CodeTools it seems to be doing some long garbage collection runs after each multiplication. One wonders if there isn't something better that Maple can do so that their users can take practical advantage of the full performance of the Sdmp library in all cases!

An important benchmarking consideration is that for Maple, all Sdmp computations can be considered static. This means that Sdmp can block allocate their outputs, since the coefficients will never be mutated and therefore will never be reallocated. To simulate this, when comparing Sdmp and Flint, we write Flint polynomials into an already allocated polynomial. This probably slightly advantages Flint in some cases, but since we don't actually get the stated Sdmp timings in practice, this seems reasonable.

Of course, this isn't fair to Maple, but we are talking about a few percent, and the comparison with Sdmp was more interesting to us, since it is a library like Flint.

When comparing against Maple, we do the Flint timings using Nemo, which is a Julia package that wraps Flint. The code isn't on by default in Nemo, since we don't yet have GCD code, for example, but we switch the Flint code on in Nemo for the basic arithmetic timings in this blog. 

We take considerable care to ensure that both Maple and Sdmp use the same effective variable order that we do.

Sparse Pearce (1 core)


Sparse Pearce (2 core)


Sparse Pearce (4 core)

Conclusion


The project is coming along nicely. 

Daniel has started working on arithmetic over Q, and we'll update everyone when that is working. Hopefully we'll also find a good strategy for parallelisation of dense multiplication.

I should point out that the code described in this blog is very new, and after the last blog, we found many things that changed the results. We've taken much more care this time, but even an infinite amount of effort won't avoid all possible sources of error. The code itself is merged into trunk, since we mostly trust it, but it should still be considered experimental until it settles a bit.

I wish to thank Roman Pearce, Michael Monagan and Mickael Gastineau who have very generously provided details of how to use their systems, and gave us access so that we could do timings. There is an academic version of Trip available, which has everything but the homogeneous algorithm. For that, Mickael prepared a special version we could make use of.

We also want to thank the EU for their Horizon 2020 funding of the OpenDreamKit project.

Saturday 5 August 2017

Oscar:Antic Update


This week we had the first OSCAR:Antic workshop and coding sprint [1]. We had about 10 people turn up, though unfortunately we clashed with big meetings in Atlanta and Leiden. Still, we got a lot done, so we'll probably have another one of these soon!

For those who don't know, OSCAR is the computer algebra system being developed as part of a large DFG (German Research Council) transregional grant. The Antic subsystem of OSCAR is the Number Theory subsystem, consisting of Julia packages such as Hecke.jl, Nemo.jl and making use of Flint, Arb, the Antic C library, etc.

The workshop and expenses for invited speakers were funded by DFG TRR 195, but I should also note that a number of the participants are currently employed by OpenDreamKit, an EU Horizon 2020 grant.

Fp bar


The focus of the workshop more or less ended up being finite fields and their embeddings. Luca De Feo gave a really fantastic set of talks on algorithms for embedding finite fields with the aim of eventually reaching an implementation of Fp bar (the algebraic closure of the finite field of p elements).

We learned about all the recent breakthroughs in using elliptic curves over a finite field to efficiently do such constructions, including the papers of Luca and his coauthors [3].

Luca put all this work in the context of the older work of Bosma-Cannon-Steel and Lenstra-deSmit and the algorithms of Allombert and Plesken.

One of the main aims of the coding sprints was to implement a naive embedding method in Nemo, which just relies on finding roots of one finite field in another. This is still work in progress, but I think it is fair to say we made some good progress towards this.

The eventual aim is to have a modern sophisticated and highly performant implementation in Nemo, based on the code of Luca and his student Jean Kieffer. Some of the basic building blocks for this will be in Flint, of course.

The basic steps were:
  • Add some additional helper functions to Flint
  • Add root finding, based on equal degree factorisation, to Flint
  • Port the Hecke code for maps into Nemo
  • Add the code for handling embeddings, going up and down along embedding maps
  • Thinking about how to handle compatibility
  • Designing an interface for Nemo

Elliptic curve cryptography in Nemo


Luca and Jean presented a poster at ISSAC, which was held the previous week in Kaiserslautern. I think they even won an award for their poster.

Following on from this, Jean gave a talk at our Antic workshop on the work he and Luca have been doing on a Nemo module for elliptic curves over a finite field [4]. You can find the slides on the workshop homepage [1].

One remark that I found interesting was a note Jean made on performance. One of his graphs showed that Sage was beating Nemo for a generic implementation of Weierstrass elliptic curves over a finite field of cryptographic sized characteristic (e.g. p = 2^502 + 49), especially as the degree of the field grows.

We eventually traced this down to two specific things:

  • Sage was using one division instead of two in its implementation of the addition law on an elliptic curve (this was easily fixed in Jean's code)
  • Sage was using Pari for finite fields, and Pari is really fast at finite field arithmetic
My contribution was to speed up arithmetic for finite fields of smallish degree for large characteristic. Addition is still faster in Pari, due to their stack based memory management, but we now have multiplication that is faster than Pari, inversion about the same speed (sometimes faster, sometimes slower, to within about 1-2%), division and powering faster than Pari.

The tricks involved in speeding up inversion (and hence division) were as follows:

  • Improvements to the memory manager in Flint to block allocate mpz_structs
  • implement a gcdinv function for the Flint fmpz_mod_poly module (polynomials over Z/pZ for large p), instead of using the xgcd implementation, which does additional work
  • use temporary allocations (on the stack where possible) throughout the gcdinv and divrem code
  • delay reduction mod p in the divrem_basecase code in fmpz_mod_poly
  • inplace divrem_basecase
  • change crossover to divrem_divconquer to a higher value, in light of new basecase improvements
These improvements dropped the time for inversion in a finite field of degree 5 in large characteristic from about 12.7s for 100,000 inversions to about 8.7s.

There's still one potential improvement we could make, which would speed things up quite a bit:

  • implement a dedicated gcdinv function that doesn't call out to divrem, but does division itself, delaying reduction all the way through the entire Euclidean algorithm, except when actually needed
This would be quite a lot of work and so we didn't implement this yet. Contributions welcome.

An interesting observation


Initially we believed that Pari has special code for quadratic extensions of Fp. This is because we were constructing Fp^2 and then inverting f = x^1234 as a benchmark for inversion.

After investigating this, I realised that Pari was using cyclotomic polynomials for at least degree 2 and 4 extensions, where possible. This meant that in the degree 2 case, x^3 = 1 and in the degree 4 case x^5 = 1.

Of course x^1234 is not a really random looking element in such extensions, and it just so happens that the extended gcd computation that is necessary for inversion really goes fast for such computations.

I don't see how to make use of this in general, but it was interesting that it made such a huge difference (a factor of 10 or so in Pari).

Actually, we don't yet get such an improvement in Flint (it's more like 5x), so I will look into that today. It's almost certainly due to not terminating the xgcd early enough, or not handling a leading coefficient of 1 optimally, or something like that. It'll be something trivial, but worth optimising for, I think.

Update 06 Aug: this is now fixed in flint/trunk. It was indeed due to not dealing with length 1 remainders in a special case in the gcdinv code.

Other topics


Many other topics were covered at the workshop.

Ball arithmetic


Fredrik Johansson gave a very nice presentation on Arb for ball arithmetic over R and C. The slides are on the workshop website [1].

Generic polynomial testing


We decided to write test code for generic polynomials in Nemo. Three of us (me, Luca and Remi) wrote test functions, with particular emphasis on:

  • towers of univariate polynomial rings
  • generic polynomials where the base ring is built over a characteristic p field
  • randomised tests
As expected, this turned up numerous corner cases and failures, most of which are fixed (the rest will be fixed before the next Nemo release).

Multivariate polynomial arithmetic over Z


We finally merged all the new multivariate code in Flint/trunk that I have been working on (recently with the help of Daniel Schultz). I've spoken about this code in a recent blog post, and there will be another blog post on this topic fairly soon. Lots of things happened since the last update, including some speedups.

Memory leak in Flint matrices


Fredrik fixed a serious memory leak in Flint matrices, which was slowing things down.

A 2500 times speedup


Fredrik sped up some Flint code by a factor of 2500. The slow Flint code was pointed out by the Sage developers, who were using it for modular symbols. The code is still in a branch, and could be improved even further. But since the Flint code is also used in Nemo, this will speed up some things in Nemo too.

Flint on FreeBSD


Dima Pasechnik helped with getting Flint working on FreeBSD.

Gap-Julia interface


Thomas Breuer and Sebastian Gutsche made progress in the OSCAR Gap-Julia interface. They worked on:
  • Extended the GAP-Package JuliaInterface (Type conversions, e.g., of GAP cyclotomics into Nemo cyclotomics, access to more functionality, improved error handling)
  • Translating shortest vector GAP function to Julia to look to test how speed improves.
  • Tried to access compiled Julia methods directly (unsucessfully so far)

There is an upcoming Gap days in Siegen early next month [2]. 

Register for the Gap conference, or the subsequent coding sprint if you are interested in general Gap development, or a Gap-Julia interface.

Moving finite fields out of Flint


We discussed the future possibility of moving finite fields out of Flint, into a separate library, similar to the way Antic (for number field arithmetic) is separate from Flint.

The reason for this is the templated code (using C #defines) that is used in the finite fields module. This suggests we should be using C++ for finite fields (with a C interface to maintain compatibility with existing code).

Some of the work Luca wants to do on finite field extensions, which requires bivariate polynomial arithmetic and so on, would be fiddly to develop using the Flint templating system. C++ would make this easier, and we'd still get good performance if we restricted to a C-style interface and only used templates to make the code easier to write.

The other possibility is to reimplement finite fields directly in Julia. But we decided to delay a project like that until we see how the Julia JIT compiler will be improved. Putting too many layers of code in Julia is asking for trouble right now. When JIT compilation is improved after the language matures, this would be a sensible choice. Each time Julia improves, the sweet spot enlarges, and we can do more in Julia. For example, traits are a highly anticipated feature of Julia, which should materialise over the next couple of years.

Rewriting the finite fields code in C++ (with a C-style interface, for performance) is certainly a worthwhile project. Volunteers welcome.

References and links


Sunday 9 July 2017

Update on fast multivariate polynomial arithmetic in Flint

Fast multivariate polynomial arithmetic in Flint


It's time for me to update everyone on the implementation of fast multivariates in Flint. We are working on this as part of the OpenDreamKit project, funded by the EU. The eventual aim is to provide a robust, extremely fast implementation of multivariate polynomial arithmetic and GCD that:

  • is parallelised (hopefully scaling linearly with number of cores)
  • supports basic monomial orderings
  • supports individual exponents up to 64 bits and exponent vectors of arbitrary size
  • provides multiplication, exact division, divisibility testing, division with remainder and GCD
  • is open source
  • is highly optimised for Z, Q and Z/pZ

What else already exists?


But this is a solved problem, right? There's heaps of libraries that already do this, aren't there?

Indeed there are lots of cool libraries and proprietary software. Let's summarise each of them:

Singular

Singular is a Computer Algebra System for polynomial systems, developed here in Kaiserslautern. It has sparse multivariate polynomial arithmetic over various fields, supporting exponents from 2 to 64 bits (no smooth promotion for the user doing polynomial arithmetic, but internally in Groebner bases promotion happens optimally and automatically) and in practice, any number of variables (up to about 2^15).

But Singular's polynomial representation is optimised for Groebner bases (GBs). In particular, GBs require that one be able to get the leading monomial of a polynomial very fast and to be able to multiply polynomials by monomials extremely fast. Whilst a basic Buchberger GB implementation can benefit from fast basic arithmetic, most of the really efficient GB algorithms cannot. And it is not enough to be able to return the value of the leading monomial. One really needs to be able to return it without copying any data. For this purpose, Singular uses a linked list implementation of (sparse distributed) multivariate polynomials. Of course this makes Singular very slow for ordinary multivariate arithmetic.

Naturally, Singular's multivariate arithmetic is not parallelised.

Another issue with Singular is that its multivariate arithmetic is only "fast" if you don't supply a module ordering. But if you don't supply one, you can't use the polynomials you generate in most of Singular's higher level functions. In other words, the implementation is very highly tied to Singular's domain of specialty.

Some of Singular's multivariate functionality is supplemented by Factory, which we'll describe next.

Factory

Factory is a library distributed with Singular, for multivariate polynomial arithmetic, GCD and factorisation. Martin Lee put a lot of time into Factory over the past few years, to speed it up, in particular switching to Flint for univariate polynomial arithmetic and speeding up factorisation routines.

Factory still uses a sparse representation for polynomials, but optimised more for arithmetic. I believe it uses a sparse recursive representation, which is a logical choice for such a library.

Some operations in Factory are parallelised, basically because those operations are parallelised in Flint. But basic multivariate arithmetic is not parallel.

In the long run, the work that we are currently doing for ODK will possibly replace the core of Factory. I'm sorry that I don't have explicit timings for Factory below. I may include them in future.

Magma

This is a closed source Computer Algebra System. Based on my timings, I now believe that Magma does certain critical operations in a heuristic way, so I no longer trust it. But still, this is the system to beat if possible. It has for a long time provided exceptionally fast basic arithmetic. Allan Steel in particular has done an amazing job with core arithmetic in Magma. Unfortunately his ideas are often not published, and he doesn't seem to find time to reply to emails asking about his algorithms. An open code base (even if not Open Source) would be highly preferable!

Magma supports exponents up to about 32 bits, but its basic multivariate arithmetic is not parallelised. Some operations seem to be missing (or I couldn't find them documented). But it generally has very good coverage. Magma is good at both GBs and basic arithmetic.

Giac

Giac is part of the Xcas computer algebra system, which is widely used in schools in France. Moreover, it is Open Source!

Bernard Parisse has done a magnificent job speeding up multivariate arithmetic in Giac. It has also been really fantastic to have email exchanges with Bernard and exchange ideas.

Giac supports exponents up to 16 bits, but as far as I'm aware, only supports lexical orderings. The front end is symbolic and a conversion is done to a raw, internal polynomial format for doing polynomial arithmetic. One can change the priority of variables, but degree orderings and reverse orderings can only be supported artificially.

Since Flint currently only allows the order of variables to be reversed and not the order of comparisons (as required for true reverse lexical orderings), I decided to blog about our timings now, since comparison with Giac is actually fair. Properly supporting reverse lexical orderings (work in progress right now), will very slightly slow down our benchmarks.

Giac is parallelised and gives a nice speedup with multiple threads. It also has really fast multivariate GCD and factorisation. 

Piranha

Piranha is another Open Source library, written in C++. It makes heavy use of templating and is insanely fast. It is also parallelised and scales nearly linearly with the number of cores. Its maintainer, Francesco Biscani is also really talkative by email and has lots of great ideas.

Piranha has a reputation for being fast, and it is well-deserved. However, I don't include it in timings below since Piranha uses a hashed representation for polynomials, which means that it doesn't support things like division of multivariate polynomials (it's really hard to get the leading term of a polynomial from a hash table!)

For those interested to know, I think we are beating Piranha on multiplication of dense polynomials, but it beats us by a factor of 3 or so on the sparse benchmarks. This is to be expected given their representation. We are thinking of also adding a "hash" ordering for cases where division is not required, so that we can compete with Piranha on that benchmark.

Piranha is an amazing project. It supports arithmetic and symbolic operations. It even supports Poisson and divisor series, which are presumably important in applications in Celestial mechanics. 

Piranha also has very limited range for its packed monomial representation, which is fine for some applications, but not useful for our purposes. It is still under development though, and I expect some of its limitations will eventually be lifted.

Sdmp

Sdmp is the library developed by Michael Monagan and Roman Pearce for fast polynomial arithmetic in Maple. It's closed source, but its authors regularly publish papers describing their (revolutionary) ideas. Unfortunately, I only have access to the versions of these papers on their website and have had trouble implementing some of their algorithms as given in their papers. The fixes seem to be nontrivial and have consumed a lot of my time. An open code base (even if not Open Source) would be highly preferable!

I do not provide timings of Sdmp below, since I do not have access to it. I believe we are probably beating Sdmp on the dense benchmarks, but maybe 10-15% slower on the sparse benchmarks, but this is purely conjecture based on various bits of information I have.

Of course, Sdmp is parallelised, and scales well with the number of cores up to about 16 cores.

Trip

Trip has a reputation for being extremely fast, represents decades of work, and should be hard to compete with especially on multiple cores, as it scales extremely well. It is closed source, though occasionally papers are published describing the algorithms in very high level detail, and giving timings.

I don't have a full copy of Trip, but the authors confirm that the academic version is full speed and can be used for timings for academic purposes. I should be able to include timings for functionality available in the academic version of Trip at some point in the future. I give just a couple of data points in the comments below, but want to get familiar with the system before putting up official timings.

Pari/GP

Pari/GP is Open Source and implements something like a recursive dense or even recursive sparse format, depending on how you define it. Unfortunately I can't time it, since it isn't capable of the benchmarks I'm doing, due to the memory usage of their representation.

What Pari/GP is doing is usually fine for Number Theory, where things tend to be pretty dense. But it isn't trying to solve the same problem we are trying to solve, which is a really fast, flexible system for multivariate arithmetic in both the sparse and dense cases.

SageMath

Initially I thought it was pointless including SageMath since I assumed it just used Singular. However, on my machine, Sage is actually slower than Singular, and so I decided to include timings for that, at least for the cases that finished in under two hours.

Presumably the Sage developers didn't know about certain operations in Factory (or they weren't available at the time), since some Sage timings are really bad. I don't know what they do instead. Sage is Open Source, but there'd be more motivation to look up the source code if it was faster than everyone else, instead of slower.

As the work we are doing is part of OpenDreamKit, it is anticipated that Sage will eventually benefit greatly from the work we are doing.

Update on Flint multivariate polynomial arithmetic


Over the past year, I've spent a lot of time focusing on an implementation of multivariate arithmetic over the integers. Of course, one needs to be a little careful to describe what one actually means by division with remainder of multivariate polynomials over Z. But the idea has been to get a decent implementation over Z that can be easily modified for multivariate polynomials over Z/pZ and Q.

We currently have single core code only, for multiplication, divisibility testing, quotient only, quotient with remainder and reduction by an ideal (specified as an array of generators), plus a bunch of simpler operations, of course.

The code supports exponents up to 64 bits (actually 63 bits, since we use the top bit to detect overflows), with automatic packing of monomials where possible. The number of words in an exponent vector is essentially unlimited.

We support lexical and degree lexical orderings, with degree reverse lexical ordering not far away (we currently support reversing the order of variables, just not the order of comparisons).

The big news for this blog post is that it all now passes very strong test code, and I've documented the interface. It should be merged into flint/trunk soon.

The other exciting news is we've hired Daniel Schultz to work on the project for OpenDreamKit, for two and a half years. He will primarily be focusing on parallelising the code, though he'll no doubt work on many other things as well, such as a hash ordering, reverse lexical orderings, GCD and so on.

One major thing we are still missing is quasidivision. There are two kinds required: one that allows division over Q using only integer arithmetic, and the other that allows division over the fraction field of a multivariate polynomial ring using only polynomial arithmetic over the integers. Neither of these should be difficult to implement now that we have the basics working.

A puzzle


So far we can't get down to the same time as Magma for divisibility testing of multivariate polynomials. This is a critical operation for various fast multivariate GCD algorithms. Magma seems to be able to do this 200 times faster than it can do exact quotient!

I've asked a lot of smart people about this, and no one has an idea how it is possible to do this so fast. Moreover, Magma curiously takes about the same time as a heuristic test. There's really not enough time to do anything other than a few evaluations at some small values, and even then you have to do some tricks to pull that off!

I've written to Allan Steel to ask how this is done, but he hasn't replied. At this stage I strongly suspect Magma is doing a heuristic test only. Since the source code is not open for me to examine, and no one published this amazing algorithm, I am free to assume what I want about it.

Because I don't believe Magma is doing the right thing here, I've asked it to provide me with the exact quotient when doing divisibility testing (which just happens to be what our code does anyway). This makes the Magma times more realistic.

Benchmarks


And now for the benchmarks of all the systems I have on one machine (it's nearly 7 years old), with the exceptions noted above.

In each case there are "dense" and "sparse" benchmarks. Note that we don't include really dense or really sparse benchmarks. In the really dense case, you'd probably want to use an FFT based algorithm. Our code is probably more suited to the really sparse case. But it does well in the sparse case anyway (if we don't compare against a hash based approach), so that's what we've chosen to use for the benchmarks.

Timings that took longer than 2 hours are marked with a dash.

In some cases, systems don't provide the exact operation required, but I've allowed division with remainder to be substituted when that is the case.

The coefficient ring is Z for all operations, though Giac uses Q and internally optimises when working over Z. The benchmarks are picked to not introduce denominators.

Multiplication (dense Fateman):

  • f = (1 + x + y + z + t)^n
  • time p = f*(f + 1)

Multiplication (sparse Pearce):

  • f = (1 + x + y + 2z^2 + 3t^3 + 5u^5)^n
  • g = (1 + u + t + 2z^2 + 3y^3 + 5x^5)^n
  • time p = f*g

Quotient only division (dense):

  • f = (1 + x + y + z + t)^n
  • p = f*(f + 1) + x^(n - 3)
  • time q = p/f discard remainder

Quotient only division (sparse):

  • f = (1 + x + y + 2z^2 + 3t^3 + 5u^5)^n
  • g = (1 + u + t + 2z^2 + 3y^3 + 5x^5)^n
  • p = f*g + x^(n - 3)
  • time q = p/f discard remainder

Yes, exact quotient can be much faster than multiplication!!

N.B: due to an accident of interfacing, previously published Flint times for this benchmark were using reflected lexicograhical ordering, instead of lexicographical ordering. This has now been corrected.

Divisibility test with quotient (dense):

  • f = (1 + x + y + z + t)^n
  • p = f*(f + 1)
  • time flag, q = divides(p, f)
  • flag indicates whether division was exact, in which case q is the quotient
  • (division with remainder can be substituted for this operation)

Divisibility test with quotient (sparse):

  • f = (1 + x + y + 2z^2 + 3t^3 + 5u^5)^n
  • g = (1 + u + t + 2z^2 + 3y^3 + 5x^5)^n
  • p = f*g
  • time flag, q = divides(p, f)
  • flag indicates whether division was exact, in which case q is the quotient
  • (division with remainder can be substituted for this operation)

Tuesday 28 February 2017

Parallelising integer and polynomial multiplication in Flint


As I explained in my last two blogs, over the last eight months I have been working on the OpenDreamKit project, funded by the EU in the Horizon2020 initiative. In this blog post I want to discuss the work I've been doing to speed up integer and polynomial multiplication as part of the OpenDreamKit high performance computing and optimisation component.

Flint makes use of a variety of algorithms for integer and polynomials. For integers, GMP/MPIR is used, and algorithms such as the schoolboy algorithm, Karatsuba and Toom-Cook algorithms are used. For polynomials the schoolboy algorithm is used, along with the Karatsuba algorithm and various variants of Kronecker segmentation.

For large integer and polynomial multiplications, however, Flint uses an FFT convolution.

Actually, integer multiplication is reduced to polynomial multiplication. The large integers are broken, down to the bit, into equal sized chunks, which are interpreted as coefficients of a polynomial. The polynomials are multiplied, and the result is again interpreted as a large integer.

Polynomial multiplications of large degree polynomials with small coefficients are also reduced to integer multiplication, by a method known as Kronecker Segmentation, where the polynomials are evaluated at a sufficiently large power of two, the resulting large integers are multiplied, and then the coefficients of the product polynomial are identified in the zero padded output.

The FFT convolution


The FFT convolution is a method that has much better complexity than all of these methods. It can multiply polynomials in O(d log d) steps, where d is the degree of the polynomial. Compare this with the classical schoolboy method, which takes O(d^2) steps.

At its core, the FFT convolution efficiently multiplies two large polynomials modulo x^(2^n) - 1. This is like polynomial multiplication with wraparound. Every 2^n coefficients, the result is wrapped around to the start. But we can use it to multiply polynomials of degree d if 2d < 2^n so that wraparound of the result doesn't interfere.

To multiply modulo x^(2^n) - 1 the FFT convolution uses an evaluation, pointwise multiplication, interpolation strategy. The polynomials to be multiplied are first evaluated at 2^n-th roots of unity. These 2^n values for each polynomial are multiplied with each of their counterparts from the other polynomial, to yield 2^n "pointwise products" and then the result polynomial is interpolated from these 2^n values.

Notice that the number of roots of unity is always a power of two in the FFT algorithm, so you are always doing multiplication modulo x^(2^n) - 1 for some n. In other words, convolutions always have length a power of two.

The evaluation phase of the FFT convolution is called the forward FFT. In the middle are the pointwise multiplications, i.e. a total of 2^n coefficient multiplications, and then the interpolation phase is called the inverse FFT or IFFT, for short.

Both the FFT and IFFT take O(n log n) operations, all of which are additions, subtractions and multiplications by roots of unity. The pointwise multiplications are simply coefficient multiplications. For example, if the polynomials being multiplied were over the complex numbers, the pointwise multiplications would be products of complex numbers.

The Schoenhage-Strassen algorithm


Performing FFTs and IFFTs over the complex numbers is problematic. When implemented on a computer, we have to use floating point arithmetic for this, which can result in rounding errors which can't be effectively controlled.

Instead of working over the complex numbers, the Schoenhage-Strassen algorithm works over another ring with roots of unity, namely Z/pZ where p = 2^(2^k) + 1. Note that in this ring, 2 is a 2^(k+1)-th root of unity. This allows for convolutions up to that length. But the advantage is that elements of this ring can be represented exactly, using (multiprecision) integer arithmetic,

Pointwise multiplications become multiplications modulo p = 2^(2^k) + 1, which can also be performed exactly. Thus, the result of such an algorithm is always exactly correct if the inputs are polynomials over the integers, so long as the output coefficients are smaller than p.

The Flint FFT convolution code uses the Schoenhage-Strassen algorithm to multiply large polynomials and integers. However, it incorporates many improvements, which I will now describe.

Efficient FFT butterflies


The basic low level component of the FFT and IFFT algorithms is called a butterfly. The FFT butterfly transforms pairs of inputs as follows

[a{i}, b{i}] => [a{i}+b{i}, z^i*(a{i}-b{i})]

where z is some power of the 2^(k+1)-th root of unity discussed above. As this root of unity is a power of 2 in the Schoenhage-Strassen algorithm, multiplication by z is just a binary shift to the left or right.

The IFFT butterflies are similar. They take the form

[a{i}, b{i}] => [a{i}+z^(-i)*b{i}, a{i}-z^(-i)*b{i}]

Since the FFT coefficients are multiprecision integers (modulo p), the butterfly operations require applying the given  operations on multiprecision integers of a fixed precision.

To speed these up, we make use of MPIR's low level assembly optimised functions. The sumdiff function performs an addition and a subtraction at the same time, and of course we make use of code for shifting multiprecision integers. In addition, the subtraction modulo p requires complementing the integers in twos complement format.

One of the OpenDreamKit projects we worked on was to speed up these basic assembly operations on modern CPUs, especially those that support AVX instructions.

You can read about the work that was done on this by Alex Kruppa [1] and the various improvements he got over the old MPIR assembly code.

Extended Fermat numbers


Numbers of the form p = 2^(2^k) + 1 are called Fermat numbers. But they limit us to working with coefficients in both our input polynomials and the output polynomial, which don't exceed p. To give us more flexibility, we can work with numbers of the form p = 2^(a*2^k) + 1 for some number a. Now 2^a is a 2^(k+1)-th root of unity, but we get to work with much larger coefficients at very little extra cost. Of course the pointwise multiplications are now more expensive, but sometimes this extra cost is worth paying for the extra flexibility we get.

Truncated Fourier Transform


As we noticed above, the length of a convolution is always a power of 2. We are always working modulo x^(2^n) - 1.

This means that if we are multiplying polynomials of degree d, everything is fine until 2d = 2^n. At this point, the output polynomial will become too large and we have to increase n. This leads to ugly steps in performance at power of two lengths.

To get around this, we make use of an algorithm known as the Truncated Fourier Transform (TFT). Truncating the FFT to non-power-of-2 lengths is relatively straightforward. We simply omit the operations which deal with coefficients beyond the desired length.

The pointwise multiplications are also easy. Simply do fewer pointwise multiplications.

But now we have less information going into the inverse FFT than we normally would. We use some clever linear algebra to reconstruct the missing information as the algorithm proceeds. 

By truncating the entire convolution in this way, it is possible to save almost all the work that would be required to do a full, length 2^n transform.

Negacyclic transform


When the pointwise multiplications get very large, there is a trick we can use to perform them using an FFT of half the usual length.

Notice that the pointwise multiplications are done modulo p = 2^(a*2^k) + 1. If we split such a number up, this almost looks like doing a polynomial multiplication modulo x^(2^n) - 1 for some value of x, except for that pesky -1 instead of +1.

To get around this, we multiply x by a suitable root of unity and rescale so that the FFT convolution effectively becomes multiplication modulo x^(2^n) + 1. In other words, we use the FFT wraparound feature to our advantage.

This trick is due to Nussbaumer and saves time whenever the FFT coefficients are themselves large enough to be multiplied using an FFT, or perhaps slightly smaller than that.

The main problem with this approach is the size of the FFT coefficients of that small FFT convolution. It usually isn't quite the right size.

To get around this, it is possible to do a naive convolution where the coefficients are just one or two words in size and use Chinese remaindering to deal with coefficients that are just slightly larger than this Nussbaumer convolution can handle.

Algebraic identities


Another trick that can be used is when the number a is a multiple of 3 in p = 2^(2^a) + 1, specifically when the pointwise multiplications are small.

In such a case, we can use the identity x^{3a} + 1 = (x^a + 1)(x^{2a} - x^a + 1) to break the pointwise multiplications into smaller multiplication modulo x^a + 1 and x^{2a} - x^a + 1 respectively. If those multiplications are small enough to be quadratic time, splitting them up and using Chinese remaindering is actually faster than doing the one large multiplication.

Multiplication modulo x^a + 1 takes time some constant multiple of a^2 and multiplication modulo x^{2a} - x^a + 1 takes four times as long again. The total is some constant multiple of 5a^2.

Doing a single multiplication modulo x^{3a} + 1, on the other hand, would take time equal to some multiple of 9a^2. So splitting things up using identities can be a big win.

The sqrt2 trick


In the ring Z/pZ where p = 2^S  + 1 the value w = 2^(2S/4) - 2^(S/4) is a square root of 2. And recall that 2 was a root of unity in the Schoenhage-Strassen algorithm. Thus w is also a root of unity.

Making use of the root of unity w allows us to perform convolutions of twice the length using the Schoenhage-Strassen algorithm. And multiplication by powers of w is not so difficult. Any even power of w is just a power of 2, and any odd power is just the difference of two such powers.

Ordinarily, making the length of an FFT convolution twice as big, multiplies the time taken by a factor of four, since the coefficients required to support the larger convolution also have to be twice as large. But the sqrt2 trick allows us to do it for just over twice the cost.

The matrix Fourier algorithm


One of the big problems with the Schoenhage-Strassen algorithm as described above is that it works with a lot of data. There are 2^n multiprecision FFT coefficients being dealt with at any one time. The pattern of access to these coefficients is also not very regular throughout the algorithm, and so once the entire convolution falls out of cache, performance can be very bad.

The performance degradation can be reduced to some degree by switching to a recursive implementation of the FFT, which quickly gets down to blocks of coefficients that fit in cache. But it doesn't mitigate it entirely.

One way of resolving this is the matrix Fourier algorithm. Instead of viewing the input as a linear array of coefficients, it breaks the single large FFT down into many smaller FFTs, as though the input coefficients were the rows and columns of a matrix.

Two passes over the data are required, in order to handle lots of row FFTs and lots of column FFTs. But the overall saving is dramatic.

Of course, combining this trick with all the other tricks above is really a lot of work. But we do this in Flint.

In particular, we break the coefficients of the truncated Fourier transform up in such a way that each of the row and column FFTs has no wasted space.

Parallelisation


Now we have an obvious target for parallelisation. As we have broken the single large FFT into many smaller independent ones, these can be sent to different threads to perform.

Of course, nothing is ever that simple. We must ensure that the temporary space allocated throughout the algorithm is not overwritten by different threads. The algorithm is also more efficient if some of the inward column FFT transforms for example, are combined with the relevant pointwise multiplications and outward IFFT column transforms.

All of this also requires all the coefficients to be in memory in the right order, so that memory access is not complicated. Fortunately there are two kinds of FFT and IFFT, namely decimation in time and decimation in frequency. Picking the correct kinds at the right time allows things to be managed more or less efficiently.

The matrix Fourier algorithm also requires multiplying by some additional roots of unity at the right points, and these need to be merged into one layer of the FFTs and IFFTs so that they don't become an additional cost.

But all complications aside, the matrix Fourier algorithm is both cache efficient and a good target for parallelisation.

As part of my OpenDreamKit work, I parallelised the Flint Schoenhage-Strassen code in precisely this way, maintaining all of the above optimisations. I made use of OpenMP for this.

The results were quite good, with up to a 5.5x speedup on 8 cores for large integer multiplications (I also parallelised splitting the large integer up into FFT coefficients).

You can see timings and a high level summary of the work that I did in the writeup I did for the final deliverable for this task of the OpenDreamKit project [1].


[1] https://github.com/OpenDreamKit/OpenDreamKit/issues/120