This produces slightly better performance than the inline assembly,
and has the added benefit that it should be portable to other systems
that use gcc, not just x86-64.
Here are the results on my "AMD Athlon(tm) 7450 Dual-Core Processor"
with "gcc (Ubuntu 4.3.3-5ubuntu4) 4.3.3":
with portable 64H macros:
camellia : Schedule at 1659
camellia [ 23]: Encrypt at 431, Decrypt at 434
whirlpool : Process at 55
with inline assembly (with "memory clobber" for correctness):
camellia : Schedule at 1380
camellia [ 23]: Encrypt at 406, Decrypt at 403
whirlpool : Process at 50
with __builtin_bswap64:
camellia : Schedule at 1352
camellia [ 23]: Encrypt at 396, Decrypt at 391
whirlpool : Process at 46