added libtommath-0.14

2003-03-13 02:11:11 +00:00 · 2003-03-13 02:11:11 +00:00 · 82f4858291
commit 82f4858291
parent b66471f74f
94 changed files with 600 additions and 418 deletions
--- a/bn.pdf
+++ b/bn.pdf
--- a/bn.tex
+++ b/bn.tex
@ -1,15 +1,15 @@
 \documentclass{article}
 \begin{document}
-\title{LibTomMath v0.13 \\ A Free Multiple Precision Integer Library}
+\title{LibTomMath v0.14 \\ A Free Multiple Precision Integer Library \\ http://math.libtomcrypt.org }
 \author{Tom St Denis \\ tomstdenis@iahu.ca}
 \maketitle
 \newpage
 \section{Introduction}
-``LibTomMath'' is a free and open source library that provides multiple-precision integer functions required to form a basis
+``LibTomMath'' is a free and open source library that provides multiple-precision integer functions required to form a 
-of a public key cryptosystem.  LibTomMath is written entire in portable ISO C source code and designed to have an application
+basis of a public key cryptosystem.  LibTomMath is written entire in portable ISO C source code and designed to have an 
-interface much like that of MPI from Michael Fromberger.  
+application interface much like that of MPI from Michael Fromberger.  
 LibTomMath was written from scratch by Tom St Denis but designed to be  drop in replacement for the MPI package.  The 
 algorithms within the library are derived from descriptions as provided in the Handbook of Applied Cryptography and Knuth's
@ -23,8 +23,7 @@ LibTomMath was designed with the following goals in mind:
 \item Be written entirely in portable C.
 \end{enumerate}
-All three goals have been achieved.  Particularly the speed increase goal.  For example, a 512-bit modular exponentiation 
+All three goals have been achieved to one extent or another (actual figures depend on what platform you are using).
 is eight times faster\footnote{On an Athlon XP with GCC 3.2} with LibTomMath compared to MPI.
 Being compatible with MPI means that applications that already use it can be ported fairly quickly.  Currently there are 
 a few differences but there are many similarities.  In fact the average MPI based application can be ported in under 15
@ -54,16 +53,26 @@ make install
 Now within your application include ``tommath.h'' and link against libtommath.a to get MPI-like functionality.
 \subsection{Microsoft Visual C++}
 A makefile is also provided for MSVC (\textit{tested against MSVC 6.00 with SP5}) which allows the library to be used
 with that compiler as well.  To build the library type
 \begin{verbatim}
 nmake -f makefile.msvc
 \end{verbatim}
 Which will build ``tommath.lib''.  
 \section{Programming with LibTomMath}
 \subsection{The mp\_int Structure}
 All multiple precision integers are stored in a structure called \textbf{mp\_int}.  A multiple precision integer is
-essentially an array of \textbf{mp\_digit}.  mp\_digit is defined at the top of bn.h.  Its type can be changed to suit
+essentially an array of \textbf{mp\_digit}.  mp\_digit is defined at the top of ``tommath.h''.  The type can be changed 
-a particular platform.  
+to suit a particular platform.  
-For example, when \textbf{MP\_8BIT} is defined\footnote{When building bn.c.} a mp\_digit is a unsigned char and holds 
+For example, when \textbf{MP\_8BIT} is defined a mp\_digit is a unsigned char and holds seven bits.  Similarly 
-seven bits.  Similarly when \textbf{MP\_16BIT} is defined a mp\_digit is a unsigned short and holds 15 bits.  
+when \textbf{MP\_16BIT} is defined a mp\_digit is a unsigned short and holds 15 bits.   By default a mp\_digit is a 
-By default a mp\_digit is a unsigned long and holds 28 bits.  
+unsigned long and holds 28 bits which is optimal for most 32 and 64 bit processors.
 The choice of digit is particular to the platform at hand and what available multipliers are provided.  For 
 MP\_8BIT either a $8 \times 8 \Rightarrow 16$ or $16 \times 16 \Rightarrow 16$ multiplier is optimal.  When 
@ -83,20 +92,19 @@ $W$ is the number of bits in a digit (default is 28).
 \subsection{Calling Functions}
 Most functions expect pointers to mp\_int's as parameters.   To save on memory usage it is possible to have source
-variables as destinations.  For example:
+variables as destinations.  The arguements are read left to right so to compute $x + y = z$ you would pass the arguments
 in the order $x, y, z$.  For example:
 \begin{verbatim}
   mp_add(&x, &y, &x);           /* x = x + y */
-   mp_mul(&x, &z, &x);           /* x = x * z */
+   mp_mul(&y, &x, &z);           /* z = y * x */
-   mp_div_2(&x, &x);             /* x = x / 2 */
+   mp_div_2(&x, &y);             /* y = x / 2 */
 \end{verbatim}
-\section{Quick Overview}
+\subsection{Return Values}
 All functions that return errors will return \textbf{MP\_OKAY} if the function was succesful.  It will return 
 \textbf{MP\_MEM} if it ran out of heap memory or \textbf{MP\_VAL} if one of the arguements is out of range.  
 \subsection{Basic Functionality}
 Essentially all LibTomMath functions return one of three values to indicate if the function worked as desired.  A 
 function will return \textbf{MP\_OKAY} if the function was successful.  A function will return \textbf{MP\_MEM} if
 it ran out of memory and \textbf{MP\_VAL} if the input was invalid.  
 Before an mp\_int can be used it must be initialized with 
 \begin{verbatim}
@ -106,7 +114,7 @@ int mp_init(mp_int *a);
 For example, consider the following.
 \begin{verbatim}
-#include "bn.h"
+#include "tommath.h"
 int main(void)
 {
   mp_int num;
@ -383,6 +391,18 @@ in $c$ and returns success.
 This function requires $O(N)$ additional digits of memory and $O(2 \cdot N)$ time.
 \subsubsection{mp\_mul\_2(mp\_int *a, mp\_int *b)}
 Multiplies $a$ by two and stores in $b$.  This function is hard coded todo a shift by one place so it is faster
 than calling mp\_mul\_2d with a count of one.  
 This function requires $O(N)$ additional digits of memory and $O(N)$ time.
 \subsubsection{mp\_div\_2(mp\_int *a, mp\_int *b)}
 Divides $a$ by two and stores in $b$.  This function is hard coded todo a shift by one place so it is faster
 than calling mp\_div\_2d with a count of one.
 This function requires $O(N)$ additional digits of memory and $O(N)$ time.
 \subsubsection{mp\_mod\_2d(mp\_int *a, int b, mp\_int *c)}
 Performs the action of reducing $a$ modulo $2^b$ and stores the result in $c$.  If the shift count $b$ is less than 
 or equal to zero the function places $a$ in $c$ and returns success.  
@ -412,7 +432,7 @@ of $c$ is the maximum length of the two inputs.
 \subsection{Basic Arithmetic}
 \subsubsection{mp\_cmp(mp\_int *a, mp\_int *b)}
-Performs a \textbf{signed} comparison between $a$ and $b$ returning \textbf{MP\_GT} is $a$ is larger than $b$.
+Performs a \textbf{signed} comparison between $a$ and $b$ returning \textbf{MP\_GT} if $a$ is larger than $b$.
 This function requires no additional memory and $O(N)$ time.
@ -559,57 +579,6 @@ A very useful observation is that multiplying by $R = \beta^n$ amounts to perfor
 requires no single precision multiplications.  
 \section{Timing Analysis}
 \subsection{Observed Timings}
 A simple test program ``demo.c'' was developed which builds with either MPI or LibTomMath (without modification).  The
 test was conducted on an AMD Athlon XP processor with 266Mhz DDR memory and the GCC 3.2 compiler\footnote{With build
 options ``-O3 -fomit-frame-pointer -funroll-loops''}.    The multiplications and squarings were repeated 100,000 times 
 each while the modular exponentiation (exptmod) were performed 50 times each.  The ``inversions'' refers to multiplicative
 inversions modulo an odd number of a given size.  The RDTSC (Read Time Stamp Counter) instruction was used to measure the 
 time the entire iterations took and was divided by the number of iterations to get an average.  The following results 
 were observed.
 \begin{small}
 \begin{center}
 \begin{tabular}{c|c|c|c}
 \hline \textbf{Operation} & \textbf{Size (bits)} & \textbf{Time with MPI (cycles)} & \textbf{Time with LibTomMath (cycles)} \\
 \hline
 Inversion & 128 & 264,083  & 59,782   \\
 Inversion & 256 & 549,370  & 146,915   \\
 Inversion & 512 & 1,675,975  & 367,172   \\
 Inversion & 1024 & 5,237,957  & 1,054,158   \\
 Inversion & 2048 & 17,871,944  & 3,459,683   \\
 Inversion & 4096 & 66,610,468  & 11,834,556   \\
 \hline
 Multiply & 128 & 1,426   & 451     \\
 Multiply & 256 & 2,551   & 958     \\
 Multiply & 512 & 7,913   & 2,476     \\
 Multiply & 1024 & 28,496   & 7,927   \\
 Multiply & 2048 & 109,897   & 28,224     \\
 Multiply & 4096 & 469,970   & 101,171     \\
 \hline 
 Square & 128 & 1,319   & 511     \\
 Square & 256 & 1,776   & 947     \\
 Square & 512 & 5,399  & 2,153    \\
 Square & 1024 & 18,991  & 5,733     \\
 Square & 2048 & 72,126  & 17,621    \\
 Square & 4096 & 306,269  & 67,576   \\
 \hline 
 Exptmod & 512 & 32,021,586  & 3,118,435 \\
 Exptmod & 768 & 97,595,492  & 8,493,633 \\
 Exptmod & 1024 & 223,302,532  & 17,715,899     \\
 Exptmod & 2048 & 1,682,223,369   & 114,936,361      \\
 Exptmod & 2560 & 3,268,615,571   & 229,402,426       \\
 Exptmod & 3072 & 5,597,240,141   & 367,403,840      \\
 Exptmod & 4096 & 13,347,270,891   & 779,058,433      
 \end{tabular}
 \end{center}
 \end{small}
 Note that the figures do fluctuate but their magnitudes are relatively intact.  The purpose of the chart is not to
 get an exact timing but to compare the two libraries.  For example, in all of the tests the exact time for a 512-bit
 squaring operation was not the same.  The observed times were all approximately 2,500 cycles, more importantly they
 were always faster than the timings observed with MPI by about the same magnitude.  
 \subsection{Digit Size}
 The first major constribution to the time savings is the fact that 28 bits are stored per digit instead of the MPI 
@ -619,29 +588,59 @@ A savings of $64^2 - 37^2 = 2727$ single precision multiplications.
 \subsection{Multiplication Algorithms}
 For most inputs a typical baseline $O(n^2)$ multiplier is used which is similar to that of MPI.  There are two variants 
-of the baseline multiplier.  The normal and the fast variants.  The normal baseline multiplier is the exact same as the
+of the baseline multiplier.  The normal and the fast comba variant.  The normal baseline multiplier is the exact same as 
-algorithm from MPI.  The fast baseline multiplier is optimized for cases where the number of input digits $N$ is less
+the algorithm from MPI.  The fast comba baseline multiplier is optimized for cases where the number of input digits $N$ 
-than or equal to $2^{w}/\beta^2$.  Where $w$ is the number of bits in a \textbf{mp\_word}.  By default a mp\_word is
+is less than or equal to $2^{w}/\beta^2$.  Where $w$ is the number of bits in a \textbf{mp\_word} or simply $lg(\beta)$.
-64-bits which means $N \le 256$ is allowed which represents numbers upto $7168$ bits.
+By default a mp\_word is 64-bits which means $N \le 256$ is allowed which represents numbers upto $7,168$ bits.  However,
 since the Karatsuba multiplier (discussed below) will kick in before that size the slower baseline algorithm (that MPI
 uses) should never really be used in a default configuration.  
-The fast baseline multiplier is optimized by removing the carry operations from the inner loop.  This is often referred
+The fast comba baseline multiplier is optimized by removing the carry operations from the inner loop.  This is often 
-to as the ``comba'' method since it computes the products a columns first then figures out the carries.  This has the
+referred to as the ``comba'' method since it computes the products a columns first then figures out the carries.  To
-effect of making a very simple and paralizable inner loop.
+accomodate this the result of the inner multiplications must be stored in words large enough not to lose the carry bits.  
 This is why there is a limit of $2^{w}/\beta^2$ digits in the input.  This optimization has the effect of making a 
 very simple and efficient inner loop.
-For large inputs, typically 80 digits\footnote{By default that is 2240-bits or more.} or more the Karatsuba method is 
+\subsubsection{Karatsuba Multiplier}
-used.  This method has significant overhead but an asymptotic running time of $O(n^{1.584})$ which means for fairly large
+For large inputs, typically 80 digits\footnote{By default that is 2240-bits or more.} or more the Karatsuba multiplication
-inputs this method is faster.  The Karatsuba implementation is recursive which means for extremely large inputs they
+method is used.  This method has significant overhead but an asymptotic running time of $O(n^{1.584})$ which means for 
-will benefit from the algorithm.
+fairly large inputs this method is faster than the baseline (or comba) algorithm.  The Karatsuba implementation is 
 recursive which means for extremely large inputs they will benefit from the algorithm.
 The algorithm is based on the observation that if 
 \begin{eqnarray}
 x = x_0 + x_1\beta \nonumber \\
 y = y_0 + y_1\beta
 \end{eqnarray}
 Where $x_0, x_1, y_0, y_1$ are half the size of their respective summand than 
 \begin{equation}
 x \cdot y = x_1y_1\beta^2 + ((x_1 - y_1)(x_0 - y_0) + x_0y_0 + x_1y_1)\beta + x_0y_0
 \end{equation}
 It is trivial that from this only three products have to be produced: $x_0y_0, x_1y_1, (x_1-y_1)(x_0-y_0)$ which
 are all of half size numbers.  A multiplication of two half size numbers requires only $1 \over 4$ of the
 original work which means with no recursion the Karatsuba algorithm achieves a running time of ${3n^2}\over 4$.  
 The routine provided does recursion which is where the $O(n^{1.584})$ work factor comes from.
 The multiplication by $\beta$ and $\beta^2$ amount to digit shift operations.  
 The extra overhead in the Karatsuba method comes from extracting the half size numbers $x_0, x_1, y_0, y_1$ and
 performing the various smaller calculations.  
 The library has been fairly optimized to extract the digits using hard-coded routines instead of the hire
 level functions however there is still significant overhead to optimize away.
 MPI only implements the slower baseline multiplier where carries are dealt with in the inner loop.  As a result even at
 smaller numbers (below the Karatsuba cutoff) the LibTomMath multipliers are faster.
 \subsection{Squaring Algorithms}
-Similar to the multiplication algorithms there are two baseline squaring algorithms.  Both have an asymptotic running
+Similar to the multiplication algorithms there are two baseline squaring algorithms.  Both have an asymptotic 
-time of $O((t^2 + t)/2)$.  The normal baseline squaring is the same from MPI and the fast is a ``comba'' squaring
+running time of $O((t^2 + t)/2)$.  The normal baseline squaring is the same from MPI and the fast method is 
-algorithm.  The comba method is used if the number of digits $N$ is less than $2^{w-1}/\beta^2$ which by default 
+a ``comba'' squaring algorithm.  The comba method is used if the number of digits $N$ is less than 
-covers numbers upto $3584$ bits.  
+$2^{w-1}/\beta^2$ which by default covers numbers upto $3,584$ bits.  
 There is also a Karatsuba squaring method which achieves a running time of $O(n^{1.584})$ after considerably large
 inputs.
@ -653,25 +652,31 @@ than MPI is.
 LibTomMath implements a sliding window $k$-ary left to right exponentiation algorithm.  For a given exponent size $L$ an
 appropriate window size $k$ is chosen.  There are always at most $L$ modular squarings and $\lfloor L/k \rfloor$ modular
-multiplications.   The $k$-ary method works by precomputing values $g(x) = b^x$ for $0 \le x < 2^k$ and a given base 
+multiplications.   The $k$-ary method works by precomputing values $g(x) = b^x$ for $2^{k-1} \le x < 2^k$ and a given base 
 $b$.  Then the multiplications are grouped in windows of $k$ bits.  The sliding window technique has the benefit 
 that it can skip multiplications if there are zero bits following or preceding a window.  Consider the exponent 
 $e = 11110001_2$ if $k = 2$ then there will be a two squarings, a multiplication of $g(3)$, two squarings, a multiplication
 of $g(3)$, four squarings and and a multiplication by $g(1)$.  In total there are 8 squarings and 3 multiplications.
-MPI uses a binary square-multiply method.  For the same exponent $e$ it would have had 8 squarings and 5 multiplications.  
+MPI uses a binary square-multiply method for exponentiation.  For the same exponent $e = 11110001_2$ it would have had to
-There is a precomputation phase for the method LibTomMath uses but it generally cuts down considerably on the number
+perform 8 squarings and 5 multiplications.  There is a precomputation phase for the method LibTomMath uses but it 
-of multiplications.  Consider a 512-bit exponent.  The worst case for the LibTomMath method results in 512 squarings and 
+generally cuts down considerably on the number of multiplications.  Consider a 512-bit exponent.  The worst case for the 
-124 multiplications.  The MPI method would have 512 squarings and 512 multiplications.  Randomly every $2k$ bits another 
+LibTomMath method results in 512 squarings and 124 multiplications.  The MPI method would have 512 squarings 
-multiplication is saved via the sliding-window technique on top of the savings the $k$-ary method provides.
+and 512 multiplications.  Randomly every $2k$ bits another multiplication is saved via the sliding-window 
 technique on top of the savings the $k$-ary method provides.
 Both LibTomMath and MPI use Barrett reduction instead of division to reduce the numbers modulo the modulus given.
 However, LibTomMath can take advantage of the fact that the multiplications required within the Barrett reduction
-do not have to give full precision.  As a result the reduction step is much faster and just as accurate.  The LibTomMath code
+do not have to give full precision.  As a result the reduction step is much faster and just as accurate.  The LibTomMath 
-will automatically determine at run-time (e.g. when its called) whether the faster multiplier can be used.  The
+code will automatically determine at run-time (e.g. when its called) whether the faster multiplier can be used.  The
 faster multipliers have also been optimized into the two variants (baseline and comba baseline).
 LibTomMath also has a variant of the exptmod function that uses Montgomery reductions instead of Barrett reductions
-which is faser.  As a result of all these changes exponentiation in LibTomMath is much faster than compared to MPI.  
+which is faster.  The code will automatically detect when the Montgomery version can be used (\textit{Requires the
 modulus to be odd and below the MONTGOMERY\_EXPT\_CUTOFF size}).  The Montgomery routine is essentially a copy of the 
 Barrett exponentiation routine except it uses Montgomery reduction.
 As a result of all these changes exponentiation in LibTomMath is much faster than compared to MPI.  On most ALU-strong
 processors (AMD Athlon for instance) exponentiation in LibTomMath is often more then ten times faster than MPI.   
 \end{document}
--- a/bn_fast_mp_invmod.c
+++ b/bn_fast_mp_invmod.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>
--- a/bn_fast_mp_montgomery_reduce.c
+++ b/bn_fast_mp_montgomery_reduce.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>
@ -100,14 +100,18 @@ fast_mp_montgomery_reduce (mp_int * a, mp_int * m, mp_digit mp)
    W[ix + 1] += W[ix] >> ((mp_word) DIGIT_BIT);
  }
  /* nox fix rest of carries */
  for (++ix; ix <= m->used * 2 + 1; ix++) {
    W[ix] += (W[ix - 1] >> ((mp_word) DIGIT_BIT));
  }
  {
    register mp_digit *tmpa;
-    register mp_word *_W;
+    register mp_word *_W, *_W1;
    /* nox fix rest of carries */
    _W1 = W + ix;
    _W = W + ++ix;
    for (; ix <= m->used * 2 + 1; ix++) {
      *_W++ += *_W1++ >> ((mp_word) DIGIT_BIT);
    }
    /* copy out, A = A/b^n
     *
--- a/bn_fast_s_mp_mul_digs.c
+++ b/bn_fast_s_mp_mul_digs.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>
--- a/bn_fast_s_mp_mul_high_digs.c
+++ b/bn_fast_s_mp_mul_high_digs.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>
--- a/bn_fast_s_mp_sqr.c
+++ b/bn_fast_s_mp_sqr.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>
--- a/bn_mp_2expt.c
+++ b/bn_mp_2expt.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>
--- a/bn_mp_abs.c
+++ b/bn_mp_abs.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>
--- a/bn_mp_add.c
+++ b/bn_mp_add.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>
--- a/bn_mp_add_d.c
+++ b/bn_mp_add_d.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>
--- a/bn_mp_addmod.c
+++ b/bn_mp_addmod.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>
--- a/bn_mp_and.c
+++ b/bn_mp_and.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>
--- a/bn_mp_clamp.c
+++ b/bn_mp_clamp.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>
--- a/bn_mp_clear.c
+++ b/bn_mp_clear.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>
--- a/bn_mp_cmp.c
+++ b/bn_mp_cmp.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>
--- a/bn_mp_cmp_d.c
+++ b/bn_mp_cmp_d.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>
--- a/bn_mp_cmp_mag.c
+++ b/bn_mp_cmp_mag.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>
--- a/bn_mp_copy.c
+++ b/bn_mp_copy.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>
--- a/bn_mp_count_bits.c
+++ b/bn_mp_count_bits.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>
--- a/bn_mp_div.c
+++ b/bn_mp_div.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>
--- a/bn_mp_div_2.c
+++ b/bn_mp_div_2.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>
@ -46,6 +46,7 @@ mp_div_2 (mp_int * a, mp_int * b)
      *tmpb++ = 0;
    }
  }
  b->sign = a->sign;
  mp_clamp (b);
  return MP_OKAY;
 }
--- a/bn_mp_div_2d.c
+++ b/bn_mp_div_2d.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>
@ -51,7 +51,9 @@ mp_div_2d (mp_int * a, int b, mp_int * c, mp_int * d)
  }
  /* shift by as many digits in the bit count */
  if (b >= DIGIT_BIT) {
     mp_rshd (c, b / DIGIT_BIT);
  }     
  /* shift any bit count < DIGIT_BIT */
  D = (mp_digit) (b % DIGIT_BIT);
--- a/bn_mp_div_d.c
+++ b/bn_mp_div_d.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>
--- a/bn_mp_exch.c
+++ b/bn_mp_exch.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>
--- a/bn_mp_expt_d.c
+++ b/bn_mp_expt_d.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>
--- a/bn_mp_exptmod.c
+++ b/bn_mp_exptmod.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>
--- a/bn_mp_exptmod_fast.c
+++ b/bn_mp_exptmod_fast.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>
--- a/bn_mp_gcd.c
+++ b/bn_mp_gcd.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>
--- a/bn_mp_grow.c
+++ b/bn_mp_grow.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>
--- a/bn_mp_init.c
+++ b/bn_mp_init.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>
--- a/bn_mp_init_copy.c
+++ b/bn_mp_init_copy.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>
--- a/bn_mp_init_size.c
+++ b/bn_mp_init_size.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>
--- a/bn_mp_invmod.c
+++ b/bn_mp_invmod.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>
--- a/bn_mp_jacobi.c
+++ b/bn_mp_jacobi.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>
--- a/bn_mp_karatsuba_mul.c
+++ b/bn_mp_karatsuba_mul.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>
@ -37,8 +37,7 @@ int
 mp_karatsuba_mul (mp_int * a, mp_int * b, mp_int * c)
 {
  mp_int  x0, x1, y0, y1, t1, t2, x0y0, x1y1;
-  int     B, err, x;
+  int     B, err;
  err = MP_MEM;
@ -59,13 +58,13 @@ mp_karatsuba_mul (mp_int * a, mp_int * b, mp_int * c)
    goto Y0;
  /* init temps */
-  if (mp_init (&t1) != MP_OKAY)
+  if (mp_init_size (&t1, B * 2) != MP_OKAY)
    goto Y1;
-  if (mp_init (&t2) != MP_OKAY)
+  if (mp_init_size (&t2, B * 2) != MP_OKAY)
    goto T1;
-  if (mp_init (&x0y0) != MP_OKAY)
+  if (mp_init_size (&x0y0, B * 2) != MP_OKAY)
    goto T2;
-  if (mp_init (&x1y1) != MP_OKAY)
+  if (mp_init_size (&x1y1, B * 2) != MP_OKAY)
    goto X0Y0;
  /* now shift the digits */
@ -76,18 +75,32 @@ mp_karatsuba_mul (mp_int * a, mp_int * b, mp_int * c)
  x1.used = a->used - B;
  y1.used = b->used - B;
  {
    register int x;
    register mp_digit *tmpa, *tmpb, *tmpx, *tmpy;
    /* we copy the digits directly instead of using higher level functions
     * since we also need to shift the digits
     */
    tmpa = a->dp;
    tmpb = b->dp;
    tmpx = x0.dp;
    tmpy = y0.dp;
    for (x = 0; x < B; x++) {
-    x0.dp[x] = a->dp[x];
+      *tmpx++ = *tmpa++;
-    y0.dp[x] = b->dp[x];
+      *tmpy++ = *tmpb++;
    }
    tmpx = x1.dp;
    for (x = B; x < a->used; x++) {
-    x1.dp[x - B] = a->dp[x];
+      *tmpx++ = *tmpa++;
    }
    tmpy = y1.dp;
    for (x = B; x < b->used; x++) {
-    y1.dp[x - B] = b->dp[x];
+      *tmpy++ = *tmpb++;
    }
  }
  /* only need to clamp the lower words since by definition the upper words x1/y1 must
--- a/bn_mp_karatsuba_sqr.c
+++ b/bn_mp_karatsuba_sqr.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>
@ -23,8 +23,7 @@ int
 mp_karatsuba_sqr (mp_int * a, mp_int * b)
 {
  mp_int  x0, x1, t1, t2, x0x0, x1x1;
-  int     B, err, x;
+  int     B, err;
  err = MP_MEM;
@ -41,22 +40,31 @@ mp_karatsuba_sqr (mp_int * a, mp_int * b)
    goto X0;
  /* init temps */
-  if (mp_init (&t1) != MP_OKAY)
+  if (mp_init_size (&t1, a->used * 2) != MP_OKAY)
    goto X1;
-  if (mp_init (&t2) != MP_OKAY)
+  if (mp_init_size (&t2, a->used * 2) != MP_OKAY)
    goto T1;
-  if (mp_init (&x0x0) != MP_OKAY)
+  if (mp_init_size (&x0x0, B * 2) != MP_OKAY)
    goto T2;
-  if (mp_init (&x1x1) != MP_OKAY)
+  if (mp_init_size (&x1x1, (a->used - B) * 2) != MP_OKAY)
    goto X0X0;
  {
    register int x;
    register mp_digit *dst, *src;
    src = a->dp;
    /* now shift the digits */
    dst = x0.dp;
    for (x = 0; x < B; x++) {
-    x0.dp[x] = a->dp[x];
+      *dst++ = *src++;
    }
    dst = x1.dp;
    for (x = B; x < a->used; x++) {
-    x1.dp[x - B] = a->dp[x];
+      *dst++ = *src++;
    }
  }
  x0.used = B;
@ -77,7 +85,7 @@ mp_karatsuba_sqr (mp_int * a, mp_int * b)
    goto X1X1;			/* t1 = (x1 - x0) * (y1 - y0) */
  /* add x0y0 */
-  if (mp_add (&x0x0, &x1x1, &t2) != MP_OKAY)
+  if (s_mp_add (&x0x0, &x1x1, &t2) != MP_OKAY)
    goto X1X1;			/* t2 = x0y0 + x1y1 */
  if (mp_sub (&t2, &t1, &t1) != MP_OKAY)
    goto X1X1;			/* t1 = x0y0 + x1y1 - (x1-x0)*(y1-y0) */
--- a/bn_mp_lcm.c
+++ b/bn_mp_lcm.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>
--- a/bn_mp_lshd.c
+++ b/bn_mp_lshd.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>
@ -31,16 +31,31 @@ mp_lshd (mp_int * a, int b)
    return res;
  }
  {
    register mp_digit *tmpa, *tmpaa;
    /* increment the used by the shift amount than copy upwards */
    a->used += b;
    /* top */
    tmpa = a->dp + a->used - 1;
    /* base */
    tmpaa = a->dp + a->used - 1 - b;
    /* much like mp_rshd this is implemented using a sliding window
     * except the window goes the otherway around.  Copying from
     * the bottom to the top.  see bn_mp_rshd.c for more info.
     */
    for (x = a->used - 1; x >= b; x--) {
-    a->dp[x] = a->dp[x - b];
+      *tmpa-- = *tmpaa--;
    }
    /* zero the lower digits */
    tmpa = a->dp;
    for (x = 0; x < b; x++) {
-    a->dp[x] = 0;
+      *tmpa++ = 0;
    }
  }
  mp_clamp (a);
  return MP_OKAY;
 }
--- a/bn_mp_mod.c
+++ b/bn_mp_mod.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>
--- a/bn_mp_mod_2d.c
+++ b/bn_mp_mod_2d.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>
--- a/bn_mp_mod_d.c
+++ b/bn_mp_mod_d.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>
--- a/bn_mp_montgomery_calc_normalization.c
+++ b/bn_mp_montgomery_calc_normalization.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>
--- a/bn_mp_montgomery_reduce.c
+++ b/bn_mp_montgomery_reduce.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>
--- a/bn_mp_montgomery_setup.c
+++ b/bn_mp_montgomery_setup.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>
@ -18,36 +18,29 @@
 int
 mp_montgomery_setup (mp_int * a, mp_digit * mp)
 {
-  mp_int  t, tt;
+  unsigned long x, b;
  int     res;
-  if ((res = mp_init (&t)) != MP_OKAY) {
+/* fast inversion mod 2^32 
-    return res;
+ *
 * Based on the fact that 
 *
 * XA = 1 (mod 2^n)  =>  (X(2-XA)) A = 1 (mod 2^2n)
 *                   =>  2*X*A - X*X*A*A = 1
 *                   =>  2*(1) - (1)     = 1
 */
  b = a->dp[0];
  if ((b & 1) == 0) {
    return MP_VAL;
  }
-  if ((res = mp_init (&tt)) != MP_OKAY) {
+  x = (((b + 2) & 4) << 1) + b;	/* here x*a==1 mod 2^4 */
-    goto __T;
+  x *= 2 - b * x;		/* here x*a==1 mod 2^8 */
-  }
+  x *= 2 - b * x;		/* here x*a==1 mod 2^16; each step doubles the nb of bits */
-
+  x *= 2 - b * x;		/* here x*a==1 mod 2^32 */
  /* tt = b */
  tt.dp[0] = 0;
  tt.dp[1] = 1;
  tt.used = 2;
  /* t = m mod b */
  t.dp[0] = a->dp[0];
  t.used = 1;
  /* t = 1/m mod b */
  if ((res = mp_invmod (&t, &tt, &t)) != MP_OKAY) {
    goto __TT;
  }
  /* t = -1/m mod b */
-  *mp = ((mp_digit) 1 << ((mp_digit) DIGIT_BIT)) - t.dp[0];
+  *mp = ((mp_digit) 1 << ((mp_digit) DIGIT_BIT)) - (x & MP_MASK);
-  res = MP_OKAY;
+  return MP_OKAY;
 __TT:mp_clear (&tt);
 __T:mp_clear (&t);
  return res;
 }
--- a/bn_mp_mul.c
+++ b/bn_mp_mul.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>
--- a/bn_mp_mul_2.c
+++ b/bn_mp_mul_2.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>
@ -50,6 +50,11 @@ mp_mul_2 (mp_int * a, mp_int * b)
 	if ((res = mp_grow (b, b->used + 1)) != MP_OKAY) {
 	  return res;
 	}
 	/* after the grow *tmpb is no longer valid so we have to reset it! 
 	 * (this bug took me about 17 minutes to find...!)
 	 */
 	tmpb = b->dp + b->used;
      }
      /* add a MSB of 1 */
      *tmpb = 1;
@ -61,5 +66,6 @@ mp_mul_2 (mp_int * a, mp_int * b)
      *tmpb++ = 0;
    }
  }
  b->sign = a->sign;
  return MP_OKAY;
 }
--- a/bn_mp_mul_2d.c
+++ b/bn_mp_mul_2d.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>
@ -32,9 +32,11 @@ mp_mul_2d (mp_int * a, int b, mp_int * c)
  }
  /* shift by as many digits in the bit count */
  if (b >= DIGIT_BIT) {
     if ((res = mp_lshd (c, b / DIGIT_BIT)) != MP_OKAY) {
       return res;
     }
  }     
  c->used = c->alloc;
  /* shift any bit count < DIGIT_BIT */
--- a/bn_mp_mul_d.c
+++ b/bn_mp_mul_d.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>
--- a/bn_mp_mulmod.c
+++ b/bn_mp_mulmod.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>
--- a/bn_mp_n_root.c
+++ b/bn_mp_n_root.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>
--- a/bn_mp_neg.c
+++ b/bn_mp_neg.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>
--- a/bn_mp_or.c
+++ b/bn_mp_or.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>
--- a/bn_mp_rand.c
+++ b/bn_mp_rand.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>
--- a/bn_mp_read_signed_bin.c
+++ b/bn_mp_read_signed_bin.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>
--- a/bn_mp_read_unsigned_bin.c
+++ b/bn_mp_read_unsigned_bin.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>
--- a/bn_mp_reduce.c
+++ b/bn_mp_reduce.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>
--- a/bn_mp_rshd.c
+++ b/bn_mp_rshd.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>
@ -20,7 +20,6 @@ mp_rshd (mp_int * a, int b)
 {
  int     x;
  /* if b <= 0 then ignore it */
  if (b <= 0) {
    return;
@ -32,14 +31,34 @@ mp_rshd (mp_int * a, int b)
    return;
  }
  {
    register mp_digit *tmpa, *tmpaa;
    /* shift the digits down */
    /* base */
    tmpa = a->dp;
    /* offset into digits */
    tmpaa = a->dp + b;
    /* this is implemented as a sliding window where the window is b-digits long
     * and digits from the top of the window are copied to the bottom
     *
     * e.g.
     b-2 | b-1 | b0 | b1 | b2 | ... | bb |   ---->
                 /\                   |      ---->
                  \-------------------/      ---->
    */         
    for (x = 0; x < (a->used - b); x++) {
-    a->dp[x] = a->dp[x + b];
+      *tmpa++ = *tmpaa++;
    }
    /* zero the top digits */
    for (; x < a->used; x++) {
-    a->dp[x] = 0;
+      *tmpa++ = 0;
    }
  }
  mp_clamp (a);
 }
--- a/bn_mp_set.c
+++ b/bn_mp_set.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>
--- a/bn_mp_set_int.c
+++ b/bn_mp_set_int.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>
--- a/bn_mp_shrink.c
+++ b/bn_mp_shrink.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>
--- a/bn_mp_signed_bin_size.c
+++ b/bn_mp_signed_bin_size.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>
--- a/bn_mp_sqr.c
+++ b/bn_mp_sqr.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>
--- a/bn_mp_sqrmod.c
+++ b/bn_mp_sqrmod.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>
--- a/bn_mp_sub.c
+++ b/bn_mp_sub.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>
--- a/bn_mp_sub_d.c
+++ b/bn_mp_sub_d.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>
--- a/bn_mp_submod.c
+++ b/bn_mp_submod.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>
--- a/bn_mp_to_signed_bin.c
+++ b/bn_mp_to_signed_bin.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>
--- a/bn_mp_to_unsigned_bin.c
+++ b/bn_mp_to_unsigned_bin.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>
--- a/bn_mp_unsigned_bin_size.c
+++ b/bn_mp_unsigned_bin_size.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>
--- a/bn_mp_xor.c
+++ b/bn_mp_xor.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>
--- a/bn_mp_zero.c
+++ b/bn_mp_zero.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>
--- a/bn_radix.c
+++ b/bn_radix.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>
--- a/bn_reverse.c
+++ b/bn_reverse.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>
--- a/bn_s_mp_add.c
+++ b/bn_s_mp_add.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>
@ -55,8 +55,14 @@ s_mp_add (mp_int * a, mp_int * b, mp_int * c)
    register int i;
    /* alias for digit pointers */
    /* first input */
    tmpa = a->dp;
    /* second input */
    tmpb = b->dp;
    /* destination */
    tmpc = c->dp;
    u = 0;
--- a/bn_s_mp_mul_digs.c
+++ b/bn_s_mp_mul_digs.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>
--- a/bn_s_mp_mul_high_digs.c
+++ b/bn_s_mp_mul_high_digs.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>
--- a/bn_s_mp_sqr.c
+++ b/bn_s_mp_sqr.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>
--- a/bn_s_mp_sub.c
+++ b/bn_s_mp_sub.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>
--- a/bncore.c
+++ b/bncore.c
@ -10,10 +10,13 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>
-int     KARATSUBA_MUL_CUTOFF = 80,	/* Min. number of digits before Karatsuba multiplication is used. */
+/* configured for a AMD Duron Morgan core with etc/tune.c */
-        KARATSUBA_SQR_CUTOFF = 80,	/* Min. number of digits before Karatsuba squaring is used. */
+int     KARATSUBA_MUL_CUTOFF = 73,	/* Min. number of digits before Karatsuba multiplication is used. */
-        MONTGOMERY_EXPT_CUTOFF = 74;	/* max. number of digits that montgomery reductions will help for */
+        KARATSUBA_SQR_CUTOFF = 121,	/* Min. number of digits before Karatsuba squaring is used. */
        MONTGOMERY_EXPT_CUTOFF = 128;	/* max. number of digits that montgomery reductions will help for */
--- a/changes.txt
+++ b/changes.txt
@ -1,3 +1,16 @@
 Mar 15th, 2003
 v0.14  -- Tons of manual updates
       -- cleaned up the directory
       -- added MSVC makefiles
       -- source changes [that I don't recall]
       -- Fixed up the lshd/rshd code to use pointer aliasing
       -- Fixed up the mul_2d and div_2d to not call rshd/lshd unless needed
       -- Fixed up etc/tune.c a tad
       -- fixed up demo/demo.c to output comma-delimited results of timing
          also fixed up timing demo to use a finer granularity for various functions
       -- fixed up demo/demo.c testing to pause during testing so my Duron won't catch on fire
          [stays around 31-35C during testing :-)]
 Feb 13th, 2003
 v0.13  -- tons of minor speed-ups in low level add, sub, mul_2 and div_2 which propagate 
          to other functions like mp_invmod, mp_div, etc...
--- a/demo/demo.c
+++ b/demo/demo.c
@ -69,18 +69,32 @@ int mp_reduce_setup(mp_int *a, mp_int *b)
   }
   return mp_div(a, b, a, NULL);
 }
 int mp_rand(mp_int *a, int c)
 {
   long z = abs(rand()) & 65535;
   mp_set(a, z?z:1);
   while (c--) {
      s_mp_lshd(a, 1);
      mp_add_d(a, abs(rand()), a);
   }
   return MP_OKAY;
 }
 #endif
   char cmd[4096], buf[4096];
 int main(void)
 {
   mp_int a, b, c, d, e, f;
-   unsigned long expt_n, add_n, sub_n, mul_n, div_n, sqr_n, mul2d_n, div2d_n, gcd_n, lcm_n, inv_n;
+   unsigned long expt_n, add_n, sub_n, mul_n, div_n, sqr_n, mul2d_n, div2d_n, gcd_n, lcm_n, inv_n,
                 div2_n, mul2_n;
   unsigned rr;
   int cnt;
 #ifdef TIMER
   int n;
   ulong64 tt;
   FILE *log;
 #endif
   mp_init(&a);
@ -90,60 +104,66 @@ int main(void)
   mp_init(&e);
   mp_init(&f);
 #ifdef TIMER
 goto multtime;
      printf("CLOCKS_PER_SEC == %lu\n", CLOCKS_PER_SEC);
-      mp_read_radix(&a, "340282366920938463463374607431768211455", 10);
+goto expttime;      
-      mp_read_radix(&b, "340282366920938463463574607431768211455", 10);
+
-      while (a.used * DIGIT_BIT < 8192) {
+      log = fopen("add.log", "w");
      for (cnt = 4; cnt <= 128; cnt += 4) {
         mp_rand(&a, cnt);
         mp_rand(&b, cnt);
         reset();
         for (rr = 0; rr < 10000000; rr++) {
             mp_add(&a, &b, &c);
         }
         tt = rdtsc();
         printf("Adding\t\t%4d-bit => %9llu/sec, %9llu ticks\n", mp_count_bits(&a), (((unsigned long long)rr)*CLOCKS_PER_SEC)/tt, tt);
-         mp_sqr(&a, &a);
+         fprintf(log, "%d,%9llu\n", cnt, (((unsigned long long)rr)*CLOCKS_PER_SEC)/tt);
         mp_sqr(&b, &b);
      }
      fclose(log);
-      mp_read_radix(&a, "340282366920938463463374607431768211455", 10);
+      log = fopen("sub.log", "w");
-      mp_read_radix(&b, "340282366920938463463574607431768211455", 10);
+      for (cnt = 4; cnt <= 128; cnt += 4) {
-      while (a.used * DIGIT_BIT < 8192) {
+         mp_rand(&a, cnt);
         mp_rand(&b, cnt);
         reset();
         for (rr = 0; rr < 10000000; rr++) {
             mp_sub(&a, &b, &c);
         }
         tt = rdtsc();
-         printf("Subtracting\t%4d-bit => %9llu/sec, %9llu ticks\n", mp_count_bits(&a), (((unsigned long long)rr)*CLOCKS_PER_SEC)/tt, tt);
+         printf("Subtracting\t\t%4d-bit => %9llu/sec, %9llu ticks\n", mp_count_bits(&a), (((unsigned long long)rr)*CLOCKS_PER_SEC)/tt, tt);
-         mp_sqr(&a, &a);
+         fprintf(log, "%d,%9llu\n", cnt, (((unsigned long long)rr)*CLOCKS_PER_SEC)/tt);
         mp_sqr(&b, &b);
      }
      fclose(log);
 multtime:      
-   mp_read_radix(&a, "340282366920938463463374607431768211455", 10);
+   log = fopen("sqr.log", "w");
-   while (a.used * DIGIT_BIT < 8192) {
+   for (cnt = 4; cnt <= 128; cnt += 4) {
      mp_rand(&a, cnt);
      reset();
      for (rr = 0; rr < 250000; rr++) {
          mp_sqr(&a, &b);
      }
      tt = rdtsc();
      printf("Squaring\t%4d-bit => %9llu/sec, %9llu ticks\n", mp_count_bits(&a), (((unsigned long long)rr)*CLOCKS_PER_SEC)/tt, tt);
-      mp_copy(&b, &a);
+      fprintf(log, "%d,%9llu\n", cnt, (((unsigned long long)rr)*CLOCKS_PER_SEC)/tt);
   }
   fclose(log);
-   mp_read_radix(&a, "340282366920938463463374607431768211455", 10);
+   log = fopen("mult.log", "w");
-   while (a.used * DIGIT_BIT < 8192) {
+   for (cnt = 4; cnt <= 128; cnt += 4) {
      mp_rand(&a, cnt);
      mp_rand(&b, cnt);
      reset();
      for (rr = 0; rr < 250000; rr++) {
-          mp_mul(&a, &a, &b);
+          mp_mul(&a, &b, &c);
      }
      tt = rdtsc();
      printf("Multiplying\t%4d-bit => %9llu/sec, %9llu ticks\n", mp_count_bits(&a), (((unsigned long long)rr)*CLOCKS_PER_SEC)/tt, tt);
-      mp_copy(&b, &a);
+      fprintf(log, "%d,%9llu\n", cnt, (((unsigned long long)rr)*CLOCKS_PER_SEC)/tt);
   }
   fclose(log);
 expttime:  
   {
@ -157,6 +177,7 @@ expttime:
         "1214855636816562637502584060163403830270705000634713483015101384881871978446801224798536155406895823305035467591632531067547890948695117172076954220727075688048751022421198712032848890056357845974246560748347918630050853933697792254955890439720297560693579400297062396904306270145886830719309296352765295712183040773146419022875165382778007040109957609739589875590885701126197906063620133954893216612678838507540777138437797705602453719559017633986486649523611975865005712371194067612263330335590526176087004421363598470302731349138773205901447704682181517904064735636518462452242791676541725292378925568296858010151852326316777511935037531017413910506921922450666933202278489024521263798482237150056835746454842662048692127173834433089016107854491097456725016327709663199738238442164843147132789153725513257167915555162094970853584447993125488607696008169807374736711297007473812256272245489405898470297178738029484459690836250560495461579533254473316340608217876781986188705928270735695752830825527963838355419762516246028680280988020401914551825487349990306976304093109384451438813251211051597392127491464898797406789175453067960072008590614886532333015881171367104445044718144312416815712216611576221546455968770801413440778423979",
         NULL         
      };
   log = fopen("expt.log", "w");
   for (n = 0; primes[n]; n++) {
      mp_read_radix(&a, primes[n], 10);
      mp_zero(&b);
@ -183,12 +204,21 @@ expttime:
         exit(0);
      }
      printf("Exponentiating\t%4d-bit => %9llu/sec, %9llu ticks\n", mp_count_bits(&a), (((unsigned long long)rr)*CLOCKS_PER_SEC)/tt, tt);
      fprintf(log, "%d,%9llu\n", cnt, (((unsigned long long)rr)*CLOCKS_PER_SEC)/tt);
   }
   }   
   fclose(log);
 invtime:
   log = fopen("invmod.log", "w");
   for (cnt = 4; cnt <= 128; cnt += 4) {
      mp_rand(&a, cnt);
      mp_rand(&b, cnt);
      do {
         mp_add_d(&b, 1, &b);
         mp_gcd(&a, &b, &c);
      } while (mp_cmp_d(&c, 1) != MP_EQ);
   mp_read_radix(&a, "340282366920938463463374607431768211455", 10);
   mp_read_radix(&b, "234892374891378913789237289378973232333", 10);
   while (a.used * DIGIT_BIT < 8192) {
      reset();
      for (rr = 0; rr < 10000; rr++) {
          mp_invmod(&b, &a, &c);
@ -200,16 +230,18 @@ expttime:
         return 0;
      }
      printf("Inverting mod\t%4d-bit => %9llu/sec, %9llu ticks\n", mp_count_bits(&a), (((unsigned long long)rr)*CLOCKS_PER_SEC)/tt, tt);
-      mp_sqr(&a, &a);
+      fprintf(log, "%d,%9llu\n", cnt, (((unsigned long long)rr)*CLOCKS_PER_SEC)/tt);
      mp_sqr(&b, &b);
   }
   fclose(log);
   return 0;
 #endif
-   inv_n = expt_n = lcm_n = gcd_n = add_n = sub_n = mul_n = div_n = sqr_n = mul2d_n = div2d_n = 0;   
+   div2_n = mul2_n = inv_n = expt_n = lcm_n = gcd_n = add_n = 
   sub_n = mul_n = div_n = sqr_n = mul2d_n = div2d_n = cnt = 0;
   for (;;) {
       if (!(++cnt & 15)) sleep(3);
       /* randomly clear and re-init one variable, this has the affect of triming the alloc space */
       switch (abs(rand()) % 7) {
@ -223,7 +255,7 @@ expttime:
       }
-       printf("%7lu/%7lu/%7lu/%7lu/%7lu/%7lu/%7lu/%7lu/%7lu/%7lu/%7lu/%5d\r", add_n, sub_n, mul_n, div_n, sqr_n, mul2d_n, div2d_n, gcd_n, lcm_n, expt_n, inv_n, _ifuncs);
+       printf("%7lu/%7lu/%7lu/%7lu/%7lu/%7lu/%7lu/%7lu/%7lu/%7lu/%7lu/%7lu/%7lu ", add_n, sub_n, mul_n, div_n, sqr_n, mul2d_n, div2d_n, gcd_n, lcm_n, expt_n, inv_n, div2_n, mul2_n);
       fgets(cmd, 4095, stdin);
       cmd[strlen(cmd)-1] = 0;
       printf("%s  ]\r",cmd); fflush(stdout);
@ -386,6 +418,28 @@ draw(&a);draw(&b);draw(&c);draw(&d);
                return 0;
             }
       } else if (!strcmp(cmd, "div2")) { ++div2_n;
             fgets(buf, 4095, stdin);  mp_read_radix(&a, buf, 10);
             fgets(buf, 4095, stdin);  mp_read_radix(&b, buf, 10);
             mp_div_2(&a, &c);
             if (mp_cmp(&c, &b) != MP_EQ) {
                 printf("div_2 %lu failure\n", div2_n);
                 draw(&a);
                 draw(&b);
                 draw(&c);
                 return 0;
             }
       } else if (!strcmp(cmd, "mul2")) { ++mul2_n;
             fgets(buf, 4095, stdin);  mp_read_radix(&a, buf, 10);
             fgets(buf, 4095, stdin);  mp_read_radix(&b, buf, 10);
             mp_mul_2(&a, &c);
             if (mp_cmp(&c, &b) != MP_EQ) {
                 printf("mul_2 %lu failure\n", mul2_n);
                 draw(&a);
                 draw(&b);
                 draw(&c);
                 return 0;
             }
       }             
   }
--- a/etc/makefile
+++ b/etc/makefile
@ -17,4 +17,4 @@ mersenne: mersenne.o
 	$(CC) mersenne.o $(LIBNAME) -o mersenne
 clean:
-	rm -f *.o *.exe pprime tune mersenne 
+	rm -f *.log *.o *.obj *.exe pprime tune mersenne 
--- a/etc/makefile.msvc
+++ b/etc/makefile.msvc
@ -0,0 +1,14 @@
 #MSVC Makefile
 #
 #Tom St Denis
 CFLAGS = /I../ /Ogityb2 /Gs /DWIN32 /W3
 pprime: pprime.obj
 	cl pprime.obj ../tommath.lib 
 mersenne: mersenne.obj
 	cl mersenne.obj ../tommath.lib
 tune: tune.obj
 	cl tune.obj ../tommath.lib	
--- a/etc/mersenne.c
+++ b/etc/mersenne.c
@ -3,7 +3,7 @@
 * Tom St Denis, tomstdenis@iahu.ca
 */
 #include <time.h>
-#include <bn.h>
+#include <tommath.h>
 int
 is_mersenne (long s, int *pp)
--- a/etc/tune.c
+++ b/etc/tune.c
@ -17,10 +17,10 @@ time_mult (void)
  mp_init (&c);
  t1 = clock ();
-  for (x = 8; x <= 128; x += 8) {
+  for (x = 4; x <= 128; x += 4) {
    for (y = 0; y < 1000; y++) {
    mp_rand (&a, x);
    mp_rand (&b, x);
    for (y = 0; y < 10000; y++) {
      mp_mul (&a, &b, &c);
    }
  }
@ -41,9 +41,9 @@ time_sqr (void)
  mp_init (&b);
  t1 = clock ();
-  for (x = 8; x <= 128; x += 8) {
+  for (x = 4; x <= 128; x += 4) {
    for (y = 0; y < 1000; y++) {
    mp_rand (&a, x);
    for (y = 0; y < 10000; y++) {
      mp_sqr (&a, &b);
    }
  }
@ -52,20 +52,54 @@ time_sqr (void)
  return clock () - t1;
 }
 clock_t
 time_expt (void)
 {
  clock_t t1;
  int     x, y;
  mp_int  a, b, c, d;
  mp_init (&a);
  mp_init (&b);
  mp_init (&c);
  mp_init (&d);
  t1 = clock ();
  for (x = 4; x <= 128; x += 4) {
    mp_rand (&a, x);
    mp_rand (&b, x);
    mp_rand (&c, x);
    if (mp_iseven (&c) != 0) {
      mp_add_d (&c, 1, &c);
    }
    for (y = 0; y < 10; y++) {
      mp_exptmod (&a, &b, &c, &d);
    }
  }
  mp_clear (&d);
  mp_clear (&c);
  mp_clear (&b);
  mp_clear (&a);
  return clock () - t1;
 }
 int
 main (void)
 {
-  int       best_mult, best_square;
+  int     best_mult, best_square, best_exptmod;
  clock_t best, ti;
  FILE   *log;
-  best_mult = best_square = 0;
+  best_mult = best_square = best_exptmod = 0;
  /* tune multiplication first */
  log = fopen ("mult.log", "w");
  best = CLOCKS_PER_SEC * 1000;
-  for (KARATSUBA_MUL_CUTOFF = 8; KARATSUBA_MUL_CUTOFF <= 128;
+  for (KARATSUBA_MUL_CUTOFF = 8; KARATSUBA_MUL_CUTOFF <= 128; KARATSUBA_MUL_CUTOFF++) {
       KARATSUBA_MUL_CUTOFF++) {
    ti = time_mult ();
    printf ("%4d : %9lu\r", KARATSUBA_MUL_CUTOFF, ti);
    fprintf (log, "%d, %lu\n", KARATSUBA_MUL_CUTOFF, ti);
    fflush (stdout);
    if (ti < best) {
      printf ("New best: %lu, %d         \n", ti, KARATSUBA_MUL_CUTOFF);
@ -73,13 +107,15 @@ main (void)
      best_mult = KARATSUBA_MUL_CUTOFF;
    }
  }
  fclose (log);
  /* tune squaring */
  log = fopen ("sqr.log", "w");
  best = CLOCKS_PER_SEC * 1000;
-  for (KARATSUBA_SQR_CUTOFF = 8; KARATSUBA_SQR_CUTOFF <= 128;
+  for (KARATSUBA_SQR_CUTOFF = 8; KARATSUBA_SQR_CUTOFF <= 128; KARATSUBA_SQR_CUTOFF++) {
       KARATSUBA_SQR_CUTOFF++) {
    ti = time_sqr ();
    printf ("%4d : %9lu\r", KARATSUBA_SQR_CUTOFF, ti);
    fprintf (log, "%d, %lu\n", KARATSUBA_SQR_CUTOFF, ti);
    fflush (stdout);
    if (ti < best) {
      printf ("New best: %lu, %d         \n", ti, KARATSUBA_SQR_CUTOFF);
@ -87,10 +123,30 @@ main (void)
      best_square = KARATSUBA_SQR_CUTOFF;
    }
  }
  fclose (log);
  /* tune exptmod */
  KARATSUBA_MUL_CUTOFF = best_mult;
  KARATSUBA_SQR_CUTOFF = best_square;
  log = fopen ("expt.log", "w");
  best = CLOCKS_PER_SEC * 1000;
  for (MONTGOMERY_EXPT_CUTOFF = 8; MONTGOMERY_EXPT_CUTOFF <= 192; MONTGOMERY_EXPT_CUTOFF++) {
    ti = time_expt ();
    printf ("%4d : %9lu\r", MONTGOMERY_EXPT_CUTOFF, ti);
    fflush (stdout);
    fprintf (log, "%d : %lu\r", MONTGOMERY_EXPT_CUTOFF, ti);
    if (ti < best) {
      printf ("New best: %lu, %d\n", ti, MONTGOMERY_EXPT_CUTOFF);
      best = ti;
      best_exptmod = MONTGOMERY_EXPT_CUTOFF;
    }
  }
  fclose (log);
  printf
-    ("\n\n\nKaratsuba Multiplier Cutoff: %d\nKaratsuba Squaring Cutoff: %d\n",
+    ("\n\n\nKaratsuba Multiplier Cutoff: %d\nKaratsuba Squaring Cutoff: %d\nMontgomery exptmod Cutoff: %d\n",
-     best_mult, best_square);
+     best_mult, best_square, best_exptmod);
  return 0;
 }
--- a/4
+++ b/4
@ -1,6 +1,6 @@
 CFLAGS  +=  -I./ -Wall -W -Wshadow -O3 -fomit-frame-pointer -funroll-loops
-VERSION=0.13
+VERSION=0.14
 default: libtommath.a
@ -60,7 +60,7 @@ docs:	docdvi
 	rm -f bn.log bn.aux bn.dvi
 clean:
-	rm -f *.pdf *.o *.a *.exe etclib/*.o demo/demo.o test ltmtest mpitest mtest/mtest mtest/mtest.exe \
+	rm -f *.pdf *.o *.a *.obj *.lib *.exe etclib/*.o demo/demo.o test ltmtest mpitest mtest/mtest mtest/mtest.exe \
        bn.log bn.aux bn.dvi *.log *.s mpi.c 
 	cd etc ; make clean
--- a/makefile.msvc
+++ b/makefile.msvc
@ -0,0 +1,26 @@
 #MSVC Makefile
 #
 #Tom St Denis
 CFLAGS = /I. /Ogityb2 /Gs /DWIN32 /W3
 default: library
 OBJECTS=bncore.obj bn_mp_init.obj bn_mp_clear.obj bn_mp_exch.obj bn_mp_grow.obj bn_mp_shrink.obj \
 bn_mp_clamp.obj bn_mp_zero.obj  bn_mp_set.obj bn_mp_set_int.obj bn_mp_init_size.obj bn_mp_copy.obj \
 bn_mp_init_copy.obj bn_mp_abs.obj bn_mp_neg.obj bn_mp_cmp_mag.obj bn_mp_cmp.obj bn_mp_cmp_d.obj \
 bn_mp_rshd.obj bn_mp_lshd.obj bn_mp_mod_2d.obj bn_mp_div_2d.obj bn_mp_mul_2d.obj bn_mp_div_2.obj \
 bn_mp_mul_2.obj bn_s_mp_add.obj bn_s_mp_sub.obj bn_fast_s_mp_mul_digs.obj bn_s_mp_mul_digs.obj \
 bn_fast_s_mp_mul_high_digs.obj bn_s_mp_mul_high_digs.obj bn_fast_s_mp_sqr.obj bn_s_mp_sqr.obj \
 bn_mp_add.obj bn_mp_sub.obj bn_mp_karatsuba_mul.obj bn_mp_mul.obj bn_mp_karatsuba_sqr.obj \
 bn_mp_sqr.obj bn_mp_div.obj bn_mp_mod.obj bn_mp_add_d.obj bn_mp_sub_d.obj bn_mp_mul_d.obj \
 bn_mp_div_d.obj bn_mp_mod_d.obj bn_mp_expt_d.obj bn_mp_addmod.obj bn_mp_submod.obj \
 bn_mp_mulmod.obj bn_mp_sqrmod.obj bn_mp_gcd.obj bn_mp_lcm.obj bn_fast_mp_invmod.obj bn_mp_invmod.obj \
 bn_mp_reduce.obj bn_mp_montgomery_setup.obj bn_fast_mp_montgomery_reduce.obj bn_mp_montgomery_reduce.obj \
 bn_mp_exptmod_fast.obj bn_mp_exptmod.obj bn_mp_2expt.obj bn_mp_n_root.obj bn_mp_jacobi.obj bn_reverse.obj \
 bn_mp_count_bits.obj bn_mp_read_unsigned_bin.obj bn_mp_read_signed_bin.obj bn_mp_to_unsigned_bin.obj \
 bn_mp_to_signed_bin.obj bn_mp_unsigned_bin_size.obj bn_mp_signed_bin_size.obj bn_radix.obj \
 bn_mp_xor.obj bn_mp_and.obj bn_mp_or.obj bn_mp_rand.obj bn_mp_montgomery_calc_normalization.obj
 library: $(OBJECTS)
 	lib /out:tommath.lib $(OBJECTS)
--- a/mtest/mtest.c
+++ b/mtest/mtest.c
@ -41,7 +41,7 @@ void rand_num(mp_int *a)
   unsigned char buf[512];
 top:
-   size = 1 + ((fgetc(rng)*fgetc(rng)) % 96);
+   size = 1 + ((fgetc(rng)*fgetc(rng)) % 512);
   buf[0] = (fgetc(rng)&1)?1:0;
   fread(buf+1, 1, size, rng);
   for (n = 0; n < size; n++) {
@ -57,7 +57,7 @@ void rand_num2(mp_int *a)
   unsigned char buf[512];
 top:
-   size = 1 + ((fgetc(rng)*fgetc(rng)) % 96);
+   size = 1 + ((fgetc(rng)*fgetc(rng)) % 512);
   buf[0] = (fgetc(rng)&1)?1:0;
   fread(buf+1, 1, size, rng);
   for (n = 0; n < size; n++) {
@ -73,6 +73,8 @@ int main(void)
   mp_int a, b, c, d, e;
   char buf[4096];
   static int tests[] = { 11, 12 };
   mp_init(&a);
   mp_init(&b);
   mp_init(&c);
@ -89,7 +91,7 @@ int main(void)
   }
   for (;;) {
-       n = 4; // fgetc(rng) % 11;
+       n =  fgetc(rng) % 13;
   if (n == 0) {
       /* add tests */
@ -235,6 +237,23 @@ int main(void)
      printf("%s\n", buf);      
      mp_todecimal(&c, buf);
      printf("%s\n", buf);      
   } else if (n == 11) {
      rand_num(&a);
      mp_mul_2(&a, &a);
      mp_div_2(&a, &b);
      printf("div2\n");
      mp_todecimal(&a, buf);
      printf("%s\n", buf);      
      mp_todecimal(&b, buf);
      printf("%s\n", buf);
   } else if (n == 12) {
      rand_num2(&a);
      mp_mul_2(&a, &b);
      printf("mul2\n");
      mp_todecimal(&a, buf);
      printf("%s\n", buf);      
      mp_todecimal(&b, buf);
      printf("%s\n", buf);
   }
   }
   fclose(rng);
--- a/timings.txt
+++ b/timings.txt
@ -1,36 +0,0 @@
 CLOCKS_PER_SEC == 1000
 Adding           128-bit =>  14534883/sec,       688 ticks
 Adding           256-bit =>  11037527/sec,       906 ticks
 Adding           512-bit =>   8650519/sec,      1156 ticks
 Adding          1024-bit =>   5871990/sec,      1703 ticks
 Adding          2048-bit =>   3575259/sec,      2797 ticks
 Adding          4096-bit =>   2018978/sec,      4953 ticks
 Subtracting      128-bit =>  11025358/sec,       907 ticks
 Subtracting      256-bit =>   9149130/sec,      1093 ticks
 Subtracting      512-bit =>   7440476/sec,      1344 ticks
 Subtracting     1024-bit =>   5078720/sec,      1969 ticks
 Subtracting     2048-bit =>   3168567/sec,      3156 ticks
 Subtracting     4096-bit =>   1833852/sec,      5453 ticks
 Squaring         128-bit =>   3205128/sec,        78 ticks
 Squaring         256-bit =>   1592356/sec,       157 ticks
 Squaring         512-bit =>    696378/sec,       359 ticks
 Squaring        1024-bit =>    266808/sec,       937 ticks
 Squaring        2048-bit =>     85999/sec,      2907 ticks
 Squaring        4096-bit =>     21949/sec,     11390 ticks
 Multiplying      128-bit =>   3205128/sec,        78 ticks
 Multiplying      256-bit =>   1592356/sec,       157 ticks
 Multiplying      512-bit =>    615763/sec,       406 ticks
 Multiplying     1024-bit =>    192752/sec,      1297 ticks
 Multiplying     2048-bit =>     53510/sec,      4672 ticks
 Multiplying     4096-bit =>     14801/sec,     16890 ticks
 Exponentiating   513-bit =>       531/sec,        47 ticks
 Exponentiating   769-bit =>       177/sec,       141 ticks
 Exponentiating  1025-bit =>        88/sec,       282 ticks
 Exponentiating  2049-bit =>        13/sec,      1890 ticks
 Exponentiating  2561-bit =>         6/sec,      3812 ticks
 Exponentiating  3073-bit =>         4/sec,      6031 ticks
 Exponentiating  4097-bit =>         1/sec,     12843 ticks
 Inverting mod    128-bit =>     19160/sec,      5219 ticks
 Inverting mod    256-bit =>      8290/sec,     12062 ticks
 Inverting mod    512-bit =>      3565/sec,     28047 ticks
 Inverting mod   1024-bit =>      1305/sec,     76594 ticks
--- a/timings2.txt
+++ b/timings2.txt
@ -1,36 +0,0 @@
 CLOCKS_PER_SEC == 1000
 Adding           128-bit =>  15600624/sec,       641 ticks
 Adding           256-bit =>  12804097/sec,       781 ticks
 Adding           512-bit =>  10000000/sec,      1000 ticks
 Adding          1024-bit =>   7032348/sec,      1422 ticks
 Adding          2048-bit =>   4076640/sec,      2453 ticks
 Adding          4096-bit =>   2424242/sec,      4125 ticks
 Subtracting      128-bit =>  10845986/sec,       922 ticks
 Subtracting      256-bit =>   9416195/sec,      1062 ticks
 Subtracting      512-bit =>   7710100/sec,      1297 ticks
 Subtracting     1024-bit =>   5159958/sec,      1938 ticks
 Subtracting     2048-bit =>   3299241/sec,      3031 ticks
 Subtracting     4096-bit =>   1987676/sec,      5031 ticks
 Squaring         128-bit =>   3205128/sec,        78 ticks
 Squaring         256-bit =>   1592356/sec,       157 ticks
 Squaring         512-bit =>    696378/sec,       359 ticks
 Squaring        1024-bit =>    266524/sec,       938 ticks
 Squaring        2048-bit =>     86505/sec,      2890 ticks
 Squaring        4096-bit =>     22471/sec,     11125 ticks
 Multiplying      128-bit =>   3205128/sec,        78 ticks
 Multiplying      256-bit =>   1592356/sec,       157 ticks
 Multiplying      512-bit =>    615763/sec,       406 ticks
 Multiplying     1024-bit =>    190548/sec,      1312 ticks
 Multiplying     2048-bit =>     54418/sec,      4594 ticks
 Multiplying     4096-bit =>     14897/sec,     16781 ticks
 Exponentiating   513-bit =>       531/sec,        47 ticks
 Exponentiating   769-bit =>       177/sec,       141 ticks
 Exponentiating  1025-bit =>        84/sec,       297 ticks
 Exponentiating  2049-bit =>        13/sec,      1875 ticks
 Exponentiating  2561-bit =>         6/sec,      3766 ticks
 Exponentiating  3073-bit =>         4/sec,      6000 ticks
 Exponentiating  4097-bit =>         1/sec,     12750 ticks
 Inverting mod    128-bit =>     17301/sec,       578 ticks
 Inverting mod    256-bit =>      8103/sec,      1234 ticks
 Inverting mod    512-bit =>      3422/sec,      2922 ticks
 Inverting mod   1024-bit =>      1330/sec,      7516 ticks
--- a/timings3.txt
+++ b/timings3.txt
@ -1,5 +0,0 @@
 Exponentiating   513-bit =>       531/sec,        94 ticks
 Exponentiating   769-bit =>       187/sec,       266 ticks
 Exponentiating  1025-bit =>        88/sec,       562 ticks
 Exponentiating  2049-bit =>        13/sec,      3719 ticks
--- a/tommath.h
+++ b/tommath.h
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #ifndef BN_H_
 #define BN_H_