As I mentioned in my previous blog, I have been involved in the OpenDreamKit project, funded by the EU through their Horizon2020 programme.
Today I want to blog about a project I was involved with here at the University of Kaiserslautern, but which was carried out by two people we hired, namely Alex Best and Alex Kruppa (we made the joke that we are only hiring Alex's on H2020 and referred to them as Alex v1.0 and Alex v2.0).
MPIR stands for Multiple Precision Integers and Rationals. It is a fork of the GMP library  for multiple precision arithmetic for integers, rationals and floating point numbers.
GMP and MPIR consist of three main components: (i) assembly optimised bignum arithmetic, (ii) highly optimised bignum algorithms implemented in C, (iii) high level interfaces.
The way MPIR works is to provide assembly implementations of low level arithmetic primitives, such as addition, subtraction, shifting, multiplication and many other similar things, for each new microprocessor architecture that comes out.
You may have heard of Intel's tick-tock cycle. They bring out a completely new microarchitecture in their tock cycle, and they shrink it and optimise it in their tick cycle. Every year or so, there is a new tick or tock.
Starting in 2014, they introduced a new phase called their refresh cycle. So it's now tick-tock-refresh, since it is getting to be too complicated to do new microarchitectures every two or three years.
What this means for MPIR is that every year or so there is a new tick, tock or refresh for Intel, and similar for AMD, that needs support at the assembly level.
Over the years there have been many new instruction set technologies that have had to be supported, such as X86_64, MMX, 3DNOW!, SSE, SSE2, SSE3, SSSE3, SSE4, AVX, AVX2, AVX512, BMI, BMI2.
It's surprising to many people that the difference between bignum code output by an optimising compiler like gcc and handwritten assembly code can be a factor of 4-12 in performance. Of course, if you already have assembly code for a prior, related architecture, the improvement you can get with handwritten assembly code is much less than this. However, the difference accumulates with time, as you forgo more and more potential improvements and support for more and more new instruction sets and technologies.
We made the case in the OpenDreamKit grant that this ongoing maintenance to support new instruction sets requires investment to keep up with the latest processor technology. Each new microarchitecture requires as much as 3-6 months full time work to support!
In addition to writing new assembly code to support each new microprocessor iteration, one can use superoptimisation to get up to another factor of two difference in performance (though often much less).
Superoptimisation takes already written assembly code and explores all valid reorderings of the assembly instructions, that don't change the behaviour of the code, and looks for the fastest reordering.
As typical assembly sequences can be dozens of lines long, this cannot be done by hand. There can be billions of valid reorderings.
The reason this kind of rearrangement can make a difference is because the CPU uses a very simple algorithm to determine which instructions to put in which pipeline. There are also limitations on how many of certain types of instructions can be executed at the same time, e.g. because of a limited number of integer ports, etc.
By rearranging the assembly instructions, we can sometimes cause the CPU scheduler to put the instructions in the pipeline in just the right order that resources are used optimally.
If an assembly function, like a multiplication routine, is going to be used quadrillions of times, it is certainly well worth trying to get an optimal ordering, since this will save a large amount of CPU time for a lot of people.
The AJS Superoptimiser
Alex Best was the first of the two Alex's to work on writing a superoptimiser for MPIR that supported the recent AVX and BMI instruction sets.
He began with an assembly JIT (Just-In-Time) library  written by Petr Kobalicek and improved it for use with MPIR, e.g. by supporting yasm syntax and removing numerous limitations.
On top of this, he wrote a superoptimiser called AJS [3, 6] which cleverly works out which reorderings of a segment of assembly code will be valid and times them all to find the optimal one.
AJS is tailored to work with MPIR assembly functions, but it could be adapted by a sufficiently talented individual to work with other projects if desired.
AJS takes a large number of command line options which control things such as which lines of assembly code should be reordered, how the output is to be written, how timing is to be done, etc.
After six months of work, Alex Kruppa took over AJS development. The basic superoptimiser worked, modulo some bugs, but it still couldn't be used because of an unexpected problem we encountered.
In the good ole days, getting cycle accurate timing was trivial. x86 architectures had an instruction for this. But over the time, CPUs have become more and more complex, and the demands on them have become greater. We don't know whether gamers are to blame, or patent trolls, or Intel engineers, but cycle accurate timings these days, can only be obtained with a great deal of trouble.
It literally took 3 or so months to solve the problem of getting cycle accurate timings on Intel processors. Some combination of fiddling with hyperthreading, address space layout randomisation, kernel options, frequency scaling, performance counters, kernel modules, stack address manipulation, SSE to AVX switching and various other tricks later, we finally got more or less cycle accurate timing on some machines we were superoptimising for.
After this, Alex Kruppa was able to superoptimise for two recent Intel microarchitectures, namely Haswell and Skylake.
He also did some optimisation (though not superoptimisation) for an older AMD architecture called Bulldozer.
As this was all more work than could be accomplished in the 12 months total that we had available, we also relied heavily on outside volunteer effort. We are very thankful to Jens Nurmann in particular who was able to supply handwritten Skylake assembly for many functions which were easy to convert to the MPIR interface.
Brian Gladman also helped to modify these function so they would work on Windows (which uses a different ABI, i.e. functions store their arguments in different registers on Windows).
In some cases, GMP already had very fast or optimal assembly code for these processors, and where our internal interface is the same as theirs, we were able to use some of their functions in MPIR.
The result of all this effort is a new version of MPIR which will be released in a couple of days, with much faster assembly optimised functions for Haswell, Skylake and Bulldozer architectures.
We are also in the process of doing some optimisation for Broadwell, a tick to Skylake's tock.
You can see tables of all the performance improvements that were obtained for Haswell, Skylake and Bulldozer on the final writeup for the OpenDreamKit project here .
As can be seen, even over the previous assembly code that was being used for these architectures (which had been written for earlier, but related microarchitectures), we obtain as much as 20 or 30 percent improvement. This represents a real-world speedup that one can expect to see in most calls to MPIR on those architectures.
Of course, we'd like to do much more for the three architectures we optimised for. There wasn't time to speed up division functions, for example, or Montgomery REDC, which are all assembly optimised in MPIR.
And there are the Excavator, Steamroller, Broadwell, Kaby Lake, Xen, Cannon Lake and other architectures still to go.
Volunteers are of course welcome. We need assembly experts who are interested in writing new assembly code, superoptimising it and maintaining support in MPIR for new architectures as they arise. If you can help, please volunteer on our development list, which you can find on our website .