This produces slightly better performance than the inline assembly,
and has the added benefit that it should be portable to other systems
that use gcc, not just x86-64.
Here are the results on my "AMD Athlon(tm) 7450 Dual-Core Processor"
with "gcc (Ubuntu 4.3.3-5ubuntu4) 4.3.3":
with portable 64H macros:
camellia : Schedule at 1659
camellia [ 23]: Encrypt at 431, Decrypt at 434
whirlpool : Process at 55
with inline assembly (with "memory clobber" for correctness):
camellia : Schedule at 1380
camellia [ 23]: Encrypt at 406, Decrypt at 403
whirlpool : Process at 50
with __builtin_bswap64:
camellia : Schedule at 1352
camellia [ 23]: Encrypt at 396, Decrypt at 391
whirlpool : Process at 46
This had been causing Camellia (the only cipher that uses these
macros) to fail when compiling "out-of-the-box" with gcc version
"4.3.3-5ubuntu4". I think because the compiler had no idea any memory
access was going on in these macros.
Adding "memory" as a clobber solves the problem, but is probably
overkill. I suspect that if we specify the constraint for y
differently, we could get rid of both "memory" and __volatile__, which
would allow the compiler to optimize much more.
Also, in gcc versions that support it, we should probably use the
bswap builtins instead.
This seemed to be the only place in the code that was using this
particular transposition. And, indeed, when compiling with
"GMP_DESC", it looks like it is necessary to disable Diffie-Hellman.
(Otherwise, the test fails for me.)
addmod and submod are moved to the end of the math descriptor, in order
to be able to run existing software against a new version of ltc without need
to rebuild the software.