This had been causing Camellia (the only cipher that uses these
macros) to fail when compiling "out-of-the-box" with gcc version
"4.3.3-5ubuntu4". I think because the compiler had no idea any memory
access was going on in these macros.
Adding "memory" as a clobber solves the problem, but is probably
overkill. I suspect that if we specify the constraint for y
differently, we could get rid of both "memory" and __volatile__, which
would allow the compiler to optimize much more.
Also, in gcc versions that support it, we should probably use the
bswap builtins instead.