added libtommath-0.14

2003-03-13 02:11:11 +00:00 · 2003-03-13 02:11:11 +00:00 · 82f4858291
commit 82f4858291
parent b66471f74f
94 changed files with 600 additions and 418 deletions
--- a/bn.pdf
+++ b/bn.pdf
--- a/bn.tex
+++ b/bn.tex
@ -1,15 +1,15 @@
 \documentclass{article}
 \begin{document}

-\title{LibTomMath v0.13 \\ A Free Multiple Precision Integer Library}
+\title{LibTomMath v0.14 \\ A Free Multiple Precision Integer Library \\ http://math.libtomcrypt.org }
 \author{Tom St Denis \\ tomstdenis@iahu.ca}
 \maketitle
 \newpage

 \section{Introduction}
-``LibTomMath'' is a free and open source library that provides multiple-precision integer functions required to form a basis
-of a public key cryptosystem.  LibTomMath is written entire in portable ISO C source code and designed to have an application
-interface much like that of MPI from Michael Fromberger.  
+``LibTomMath'' is a free and open source library that provides multiple-precision integer functions required to form a 
+basis of a public key cryptosystem.  LibTomMath is written entire in portable ISO C source code and designed to have an 
+application interface much like that of MPI from Michael Fromberger.  

 LibTomMath was written from scratch by Tom St Denis but designed to be  drop in replacement for the MPI package.  The 
 algorithms within the library are derived from descriptions as provided in the Handbook of Applied Cryptography and Knuth's
@ -23,8 +23,7 @@ LibTomMath was designed with the following goals in mind:
 \item Be written entirely in portable C.
 \end{enumerate}

-All three goals have been achieved.  Particularly the speed increase goal.  For example, a 512-bit modular exponentiation 
-is eight times faster\footnote{On an Athlon XP with GCC 3.2} with LibTomMath compared to MPI.
+All three goals have been achieved to one extent or another (actual figures depend on what platform you are using).

 Being compatible with MPI means that applications that already use it can be ported fairly quickly.  Currently there are 
 a few differences but there are many similarities.  In fact the average MPI based application can be ported in under 15
@ -54,16 +53,26 @@ make install

 Now within your application include ``tommath.h'' and link against libtommath.a to get MPI-like functionality.

+\subsection{Microsoft Visual C++}
+A makefile is also provided for MSVC (\textit{tested against MSVC 6.00 with SP5}) which allows the library to be used
+with that compiler as well.  To build the library type
+
+\begin{verbatim}
+nmake -f makefile.msvc
+\end{verbatim}
+
+Which will build ``tommath.lib''.  
+
 \section{Programming with LibTomMath}

 \subsection{The mp\_int Structure}
 All multiple precision integers are stored in a structure called \textbf{mp\_int}.  A multiple precision integer is
-essentially an array of \textbf{mp\_digit}.  mp\_digit is defined at the top of bn.h.  Its type can be changed to suit
-a particular platform.  
+essentially an array of \textbf{mp\_digit}.  mp\_digit is defined at the top of ``tommath.h''.  The type can be changed 
+to suit a particular platform.  

-For example, when \textbf{MP\_8BIT} is defined\footnote{When building bn.c.} a mp\_digit is a unsigned char and holds 
-seven bits.  Similarly when \textbf{MP\_16BIT} is defined a mp\_digit is a unsigned short and holds 15 bits.  
-By default a mp\_digit is a unsigned long and holds 28 bits.  
+For example, when \textbf{MP\_8BIT} is defined a mp\_digit is a unsigned char and holds seven bits.  Similarly 
+when \textbf{MP\_16BIT} is defined a mp\_digit is a unsigned short and holds 15 bits.   By default a mp\_digit is a 
+unsigned long and holds 28 bits which is optimal for most 32 and 64 bit processors.

 The choice of digit is particular to the platform at hand and what available multipliers are provided.  For 
 MP\_8BIT either a $8 \times 8 \Rightarrow 16$ or $16 \times 16 \Rightarrow 16$ multiplier is optimal.  When 
@ -83,20 +92,19 @@ $W$ is the number of bits in a digit (default is 28).

 \subsection{Calling Functions}
 Most functions expect pointers to mp\_int's as parameters.   To save on memory usage it is possible to have source
-variables as destinations.  For example:
+variables as destinations.  The arguements are read left to right so to compute $x + y = z$ you would pass the arguments
+in the order $x, y, z$.  For example:
 \begin{verbatim}
   mp_add(&x, &y, &x);           /* x = x + y */
-   mp_mul(&x, &z, &x);           /* x = x * z */
-   mp_div_2(&x, &x);             /* x = x / 2 */
+   mp_mul(&y, &x, &z);           /* z = y * x */
+   mp_div_2(&x, &y);             /* y = x / 2 */
 \end{verbatim}

-\section{Quick Overview}
+\subsection{Return Values}
+All functions that return errors will return \textbf{MP\_OKAY} if the function was succesful.  It will return 
+\textbf{MP\_MEM} if it ran out of heap memory or \textbf{MP\_VAL} if one of the arguements is out of range.  

 \subsection{Basic Functionality}
-Essentially all LibTomMath functions return one of three values to indicate if the function worked as desired.  A 
-function will return \textbf{MP\_OKAY} if the function was successful.  A function will return \textbf{MP\_MEM} if
-it ran out of memory and \textbf{MP\_VAL} if the input was invalid.  
-
 Before an mp\_int can be used it must be initialized with 

 \begin{verbatim}
@ -106,7 +114,7 @@ int mp_init(mp_int *a);
 For example, consider the following.

 \begin{verbatim}
-#include "bn.h"
+#include "tommath.h"
 int main(void)
 {
   mp_int num;
@ -383,6 +391,18 @@ in $c$ and returns success.

 This function requires $O(N)$ additional digits of memory and $O(2 \cdot N)$ time.

+\subsubsection{mp\_mul\_2(mp\_int *a, mp\_int *b)}
+Multiplies $a$ by two and stores in $b$.  This function is hard coded todo a shift by one place so it is faster
+than calling mp\_mul\_2d with a count of one.  
+
+This function requires $O(N)$ additional digits of memory and $O(N)$ time.
+
+\subsubsection{mp\_div\_2(mp\_int *a, mp\_int *b)}
+Divides $a$ by two and stores in $b$.  This function is hard coded todo a shift by one place so it is faster
+than calling mp\_div\_2d with a count of one.
+
+This function requires $O(N)$ additional digits of memory and $O(N)$ time.
+
 \subsubsection{mp\_mod\_2d(mp\_int *a, int b, mp\_int *c)}
 Performs the action of reducing $a$ modulo $2^b$ and stores the result in $c$.  If the shift count $b$ is less than 
 or equal to zero the function places $a$ in $c$ and returns success.  
@ -412,7 +432,7 @@ of $c$ is the maximum length of the two inputs.
 \subsection{Basic Arithmetic}

 \subsubsection{mp\_cmp(mp\_int *a, mp\_int *b)}
-Performs a \textbf{signed} comparison between $a$ and $b$ returning \textbf{MP\_GT} is $a$ is larger than $b$.
+Performs a \textbf{signed} comparison between $a$ and $b$ returning \textbf{MP\_GT} if $a$ is larger than $b$.

 This function requires no additional memory and $O(N)$ time.

@ -559,57 +579,6 @@ A very useful observation is that multiplying by $R = \beta^n$ amounts to perfor
 requires no single precision multiplications.  

 \section{Timing Analysis}
-\subsection{Observed Timings}
-A simple test program ``demo.c'' was developed which builds with either MPI or LibTomMath (without modification).  The
-test was conducted on an AMD Athlon XP processor with 266Mhz DDR memory and the GCC 3.2 compiler\footnote{With build
-options ``-O3 -fomit-frame-pointer -funroll-loops''}.    The multiplications and squarings were repeated 100,000 times 
-each while the modular exponentiation (exptmod) were performed 50 times each.  The ``inversions'' refers to multiplicative
-inversions modulo an odd number of a given size.  The RDTSC (Read Time Stamp Counter) instruction was used to measure the 
-time the entire iterations took and was divided by the number of iterations to get an average.  The following results 
-were observed.
-
-\begin{small}
-\begin{center}
-\begin{tabular}{c|c|c|c}
-\hline \textbf{Operation} & \textbf{Size (bits)} & \textbf{Time with MPI (cycles)} & \textbf{Time with LibTomMath (cycles)} \\
-\hline
-Inversion & 128 & 264,083  & 59,782   \\
-Inversion & 256 & 549,370  & 146,915   \\
-Inversion & 512 & 1,675,975  & 367,172   \\
-Inversion & 1024 & 5,237,957  & 1,054,158   \\
-Inversion & 2048 & 17,871,944  & 3,459,683   \\
-Inversion & 4096 & 66,610,468  & 11,834,556   \\
-\hline
-Multiply & 128 & 1,426   & 451     \\
-Multiply & 256 & 2,551   & 958     \\
-Multiply & 512 & 7,913   & 2,476     \\
-Multiply & 1024 & 28,496   & 7,927   \\
-Multiply & 2048 & 109,897   & 28,224     \\
-Multiply & 4096 & 469,970   & 101,171     \\
-\hline 
-Square & 128 & 1,319   & 511     \\
-Square & 256 & 1,776   & 947     \\
-Square & 512 & 5,399  & 2,153    \\
-Square & 1024 & 18,991  & 5,733     \\
-Square & 2048 & 72,126  & 17,621    \\
-Square & 4096 & 306,269  & 67,576   \\
-\hline 
-Exptmod & 512 & 32,021,586  & 3,118,435 \\
-Exptmod & 768 & 97,595,492  & 8,493,633 \\
-Exptmod & 1024 & 223,302,532  & 17,715,899     \\
-Exptmod & 2048 & 1,682,223,369   & 114,936,361      \\
-Exptmod & 2560 & 3,268,615,571   & 229,402,426       \\
-Exptmod & 3072 & 5,597,240,141   & 367,403,840      \\
-Exptmod & 4096 & 13,347,270,891   & 779,058,433      
-
-\end{tabular}
-\end{center}
-\end{small}
-
-Note that the figures do fluctuate but their magnitudes are relatively intact.  The purpose of the chart is not to
-get an exact timing but to compare the two libraries.  For example, in all of the tests the exact time for a 512-bit
-squaring operation was not the same.  The observed times were all approximately 2,500 cycles, more importantly they
-were always faster than the timings observed with MPI by about the same magnitude.  

 \subsection{Digit Size}
 The first major constribution to the time savings is the fact that 28 bits are stored per digit instead of the MPI 
@ -619,29 +588,59 @@ A savings of $64^2 - 37^2 = 2727$ single precision multiplications.

 \subsection{Multiplication Algorithms}
 For most inputs a typical baseline $O(n^2)$ multiplier is used which is similar to that of MPI.  There are two variants 
-of the baseline multiplier.  The normal and the fast variants.  The normal baseline multiplier is the exact same as the
-algorithm from MPI.  The fast baseline multiplier is optimized for cases where the number of input digits $N$ is less
-than or equal to $2^{w}/\beta^2$.  Where $w$ is the number of bits in a \textbf{mp\_word}.  By default a mp\_word is
-64-bits which means $N \le 256$ is allowed which represents numbers upto $7168$ bits.
+of the baseline multiplier.  The normal and the fast comba variant.  The normal baseline multiplier is the exact same as 
+the algorithm from MPI.  The fast comba baseline multiplier is optimized for cases where the number of input digits $N$ 
+is less than or equal to $2^{w}/\beta^2$.  Where $w$ is the number of bits in a \textbf{mp\_word} or simply $lg(\beta)$.
+By default a mp\_word is 64-bits which means $N \le 256$ is allowed which represents numbers upto $7,168$ bits.  However,
+since the Karatsuba multiplier (discussed below) will kick in before that size the slower baseline algorithm (that MPI
+uses) should never really be used in a default configuration.  

-The fast baseline multiplier is optimized by removing the carry operations from the inner loop.  This is often referred
-to as the ``comba'' method since it computes the products a columns first then figures out the carries.  This has the
-effect of making a very simple and paralizable inner loop.
+The fast comba baseline multiplier is optimized by removing the carry operations from the inner loop.  This is often 
+referred to as the ``comba'' method since it computes the products a columns first then figures out the carries.  To
+accomodate this the result of the inner multiplications must be stored in words large enough not to lose the carry bits.  
+This is why there is a limit of $2^{w}/\beta^2$ digits in the input.  This optimization has the effect of making a 
+very simple and efficient inner loop.

-For large inputs, typically 80 digits\footnote{By default that is 2240-bits or more.} or more the Karatsuba method is 
-used.  This method has significant overhead but an asymptotic running time of $O(n^{1.584})$ which means for fairly large
-inputs this method is faster.  The Karatsuba implementation is recursive which means for extremely large inputs they
-will benefit from the algorithm.
+\subsubsection{Karatsuba Multiplier}
+For large inputs, typically 80 digits\footnote{By default that is 2240-bits or more.} or more the Karatsuba multiplication
+method is used.  This method has significant overhead but an asymptotic running time of $O(n^{1.584})$ which means for 
+fairly large inputs this method is faster than the baseline (or comba) algorithm.  The Karatsuba implementation is 
+recursive which means for extremely large inputs they will benefit from the algorithm.
+
+The algorithm is based on the observation that if 
+
+\begin{eqnarray}
+x = x_0 + x_1\beta \nonumber \\
+y = y_0 + y_1\beta
+\end{eqnarray}
+
+Where $x_0, x_1, y_0, y_1$ are half the size of their respective summand than 
+
+\begin{equation}
+x \cdot y = x_1y_1\beta^2 + ((x_1 - y_1)(x_0 - y_0) + x_0y_0 + x_1y_1)\beta + x_0y_0
+\end{equation}
+
+It is trivial that from this only three products have to be produced: $x_0y_0, x_1y_1, (x_1-y_1)(x_0-y_0)$ which
+are all of half size numbers.  A multiplication of two half size numbers requires only $1 \over 4$ of the
+original work which means with no recursion the Karatsuba algorithm achieves a running time of ${3n^2}\over 4$.  
+The routine provided does recursion which is where the $O(n^{1.584})$ work factor comes from.
+
+The multiplication by $\beta$ and $\beta^2$ amount to digit shift operations.  
+The extra overhead in the Karatsuba method comes from extracting the half size numbers $x_0, x_1, y_0, y_1$ and
+performing the various smaller calculations.  
+
+The library has been fairly optimized to extract the digits using hard-coded routines instead of the hire
+level functions however there is still significant overhead to optimize away.

 MPI only implements the slower baseline multiplier where carries are dealt with in the inner loop.  As a result even at
 smaller numbers (below the Karatsuba cutoff) the LibTomMath multipliers are faster.

 \subsection{Squaring Algorithms}

-Similar to the multiplication algorithms there are two baseline squaring algorithms.  Both have an asymptotic running
-time of $O((t^2 + t)/2)$.  The normal baseline squaring is the same from MPI and the fast is a ``comba'' squaring
-algorithm.  The comba method is used if the number of digits $N$ is less than $2^{w-1}/\beta^2$ which by default 
-covers numbers upto $3584$ bits.  
+Similar to the multiplication algorithms there are two baseline squaring algorithms.  Both have an asymptotic 
+running time of $O((t^2 + t)/2)$.  The normal baseline squaring is the same from MPI and the fast method is 
+a ``comba'' squaring algorithm.  The comba method is used if the number of digits $N$ is less than 
+$2^{w-1}/\beta^2$ which by default covers numbers upto $3,584$ bits.  

 There is also a Karatsuba squaring method which achieves a running time of $O(n^{1.584})$ after considerably large
 inputs.
@ -653,25 +652,31 @@ than MPI is.

 LibTomMath implements a sliding window $k$-ary left to right exponentiation algorithm.  For a given exponent size $L$ an
 appropriate window size $k$ is chosen.  There are always at most $L$ modular squarings and $\lfloor L/k \rfloor$ modular
-multiplications.   The $k$-ary method works by precomputing values $g(x) = b^x$ for $0 \le x < 2^k$ and a given base 
+multiplications.   The $k$-ary method works by precomputing values $g(x) = b^x$ for $2^{k-1} \le x < 2^k$ and a given base 
 $b$.  Then the multiplications are grouped in windows of $k$ bits.  The sliding window technique has the benefit 
 that it can skip multiplications if there are zero bits following or preceding a window.  Consider the exponent 
 $e = 11110001_2$ if $k = 2$ then there will be a two squarings, a multiplication of $g(3)$, two squarings, a multiplication
-of $g(3)$, four squarings and and a multiplication by $g(1)$.  In total there are 8 squarings and 3 multiplications.  
+of $g(3)$, four squarings and and a multiplication by $g(1)$.  In total there are 8 squarings and 3 multiplications.

-MPI uses a binary square-multiply method.  For the same exponent $e$ it would have had 8 squarings and 5 multiplications.  
-There is a precomputation phase for the method LibTomMath uses but it generally cuts down considerably on the number
-of multiplications.  Consider a 512-bit exponent.  The worst case for the LibTomMath method results in 512 squarings and 
-124 multiplications.  The MPI method would have 512 squarings and 512 multiplications.  Randomly every $2k$ bits another 
-multiplication is saved via the sliding-window technique on top of the savings the $k$-ary method provides.
+MPI uses a binary square-multiply method for exponentiation.  For the same exponent $e = 11110001_2$ it would have had to
+perform 8 squarings and 5 multiplications.  There is a precomputation phase for the method LibTomMath uses but it 
+generally cuts down considerably on the number of multiplications.  Consider a 512-bit exponent.  The worst case for the 
+LibTomMath method results in 512 squarings and 124 multiplications.  The MPI method would have 512 squarings 
+and 512 multiplications.  Randomly every $2k$ bits another multiplication is saved via the sliding-window 
+technique on top of the savings the $k$-ary method provides.

 Both LibTomMath and MPI use Barrett reduction instead of division to reduce the numbers modulo the modulus given.
 However, LibTomMath can take advantage of the fact that the multiplications required within the Barrett reduction
-do not have to give full precision.  As a result the reduction step is much faster and just as accurate.  The LibTomMath code
-will automatically determine at run-time (e.g. when its called) whether the faster multiplier can be used.  The
+do not have to give full precision.  As a result the reduction step is much faster and just as accurate.  The LibTomMath 
+code will automatically determine at run-time (e.g. when its called) whether the faster multiplier can be used.  The
 faster multipliers have also been optimized into the two variants (baseline and comba baseline).

 LibTomMath also has a variant of the exptmod function that uses Montgomery reductions instead of Barrett reductions
-which is faser.  As a result of all these changes exponentiation in LibTomMath is much faster than compared to MPI.  
+which is faster.  The code will automatically detect when the Montgomery version can be used (\textit{Requires the
+modulus to be odd and below the MONTGOMERY\_EXPT\_CUTOFF size}).  The Montgomery routine is essentially a copy of the 
+Barrett exponentiation routine except it uses Montgomery reduction.
+
+As a result of all these changes exponentiation in LibTomMath is much faster than compared to MPI.  On most ALU-strong
+processors (AMD Athlon for instance) exponentiation in LibTomMath is often more then ten times faster than MPI.   

 \end{document}
--- a/bn_fast_mp_invmod.c
+++ b/bn_fast_mp_invmod.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>

--- a/bn_fast_mp_montgomery_reduce.c
+++ b/bn_fast_mp_montgomery_reduce.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>

@ -100,14 +100,18 @@ fast_mp_montgomery_reduce (mp_int * a, mp_int * m, mp_digit mp)
    W[ix + 1] += W[ix] >> ((mp_word) DIGIT_BIT);
  }

-  /* nox fix rest of carries */
-  for (++ix; ix <= m->used * 2 + 1; ix++) {
-    W[ix] += (W[ix - 1] >> ((mp_word) DIGIT_BIT));
-  }

  {
    register mp_digit *tmpa;
-    register mp_word *_W;
+    register mp_word *_W, *_W1;
+
+    /* nox fix rest of carries */
+    _W1 = W + ix;
+    _W = W + ++ix;
+
+    for (; ix <= m->used * 2 + 1; ix++) {
+      *_W++ += *_W1++ >> ((mp_word) DIGIT_BIT);
+    }

    /* copy out, A = A/b^n
     *
--- a/bn_fast_s_mp_mul_digs.c
+++ b/bn_fast_s_mp_mul_digs.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>

--- a/bn_fast_s_mp_mul_high_digs.c
+++ b/bn_fast_s_mp_mul_high_digs.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>

--- a/bn_fast_s_mp_sqr.c
+++ b/bn_fast_s_mp_sqr.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>

--- a/bn_mp_2expt.c
+++ b/bn_mp_2expt.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>

--- a/bn_mp_abs.c
+++ b/bn_mp_abs.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>

--- a/bn_mp_add.c
+++ b/bn_mp_add.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>

--- a/bn_mp_add_d.c
+++ b/bn_mp_add_d.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>

--- a/bn_mp_addmod.c
+++ b/bn_mp_addmod.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>

--- a/bn_mp_and.c
+++ b/bn_mp_and.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>

--- a/bn_mp_clamp.c
+++ b/bn_mp_clamp.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>

--- a/bn_mp_clear.c
+++ b/bn_mp_clear.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>

--- a/bn_mp_cmp.c
+++ b/bn_mp_cmp.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>

--- a/bn_mp_cmp_d.c
+++ b/bn_mp_cmp_d.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>

--- a/bn_mp_cmp_mag.c
+++ b/bn_mp_cmp_mag.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>

--- a/bn_mp_copy.c
+++ b/bn_mp_copy.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>

--- a/bn_mp_count_bits.c
+++ b/bn_mp_count_bits.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>

--- a/bn_mp_div.c
+++ b/bn_mp_div.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>

--- a/bn_mp_div_2.c
+++ b/bn_mp_div_2.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>

@ -46,6 +46,7 @@ mp_div_2 (mp_int * a, mp_int * b)
      *tmpb++ = 0;
    }
  }
+  b->sign = a->sign;
  mp_clamp (b);
  return MP_OKAY;
 }
--- a/bn_mp_div_2d.c
+++ b/bn_mp_div_2d.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>

@ -51,7 +51,9 @@ mp_div_2d (mp_int * a, int b, mp_int * c, mp_int * d)
  }

  /* shift by as many digits in the bit count */
-  mp_rshd (c, b / DIGIT_BIT);
+  if (b >= DIGIT_BIT) {
+     mp_rshd (c, b / DIGIT_BIT);
+  }     

  /* shift any bit count < DIGIT_BIT */
  D = (mp_digit) (b % DIGIT_BIT);
--- a/bn_mp_div_d.c
+++ b/bn_mp_div_d.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>

--- a/bn_mp_exch.c
+++ b/bn_mp_exch.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>

--- a/bn_mp_expt_d.c
+++ b/bn_mp_expt_d.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>

--- a/bn_mp_exptmod.c
+++ b/bn_mp_exptmod.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>

--- a/bn_mp_exptmod_fast.c
+++ b/bn_mp_exptmod_fast.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>

--- a/bn_mp_gcd.c
+++ b/bn_mp_gcd.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>

--- a/bn_mp_grow.c
+++ b/bn_mp_grow.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>

--- a/bn_mp_init.c
+++ b/bn_mp_init.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>

--- a/bn_mp_init_copy.c
+++ b/bn_mp_init_copy.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>

--- a/bn_mp_init_size.c
+++ b/bn_mp_init_size.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>

--- a/bn_mp_invmod.c
+++ b/bn_mp_invmod.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>

--- a/bn_mp_jacobi.c
+++ b/bn_mp_jacobi.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>

--- a/bn_mp_karatsuba_mul.c
+++ b/bn_mp_karatsuba_mul.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>

@ -37,8 +37,7 @@ int
 mp_karatsuba_mul (mp_int * a, mp_int * b, mp_int * c)
 {
  mp_int  x0, x1, y0, y1, t1, t2, x0y0, x1y1;
-  int     B, err, x;
-
+  int     B, err;

  err = MP_MEM;

@ -59,13 +58,13 @@ mp_karatsuba_mul (mp_int * a, mp_int * b, mp_int * c)
    goto Y0;

  /* init temps */
-  if (mp_init (&t1) != MP_OKAY)
+  if (mp_init_size (&t1, B * 2) != MP_OKAY)
    goto Y1;
-  if (mp_init (&t2) != MP_OKAY)
+  if (mp_init_size (&t2, B * 2) != MP_OKAY)
    goto T1;
-  if (mp_init (&x0y0) != MP_OKAY)
+  if (mp_init_size (&x0y0, B * 2) != MP_OKAY)
    goto T2;
-  if (mp_init (&x1y1) != MP_OKAY)
+  if (mp_init_size (&x1y1, B * 2) != MP_OKAY)
    goto X0Y0;

  /* now shift the digits */
@ -76,18 +75,32 @@ mp_karatsuba_mul (mp_int * a, mp_int * b, mp_int * c)
  x1.used = a->used - B;
  y1.used = b->used - B;

-  /* we copy the digits directly instead of using higher level functions
-   * since we also need to shift the digits
-   */
-  for (x = 0; x < B; x++) {
-    x0.dp[x] = a->dp[x];
-    y0.dp[x] = b->dp[x];
-  }
-  for (x = B; x < a->used; x++) {
-    x1.dp[x - B] = a->dp[x];
-  }
-  for (x = B; x < b->used; x++) {
-    y1.dp[x - B] = b->dp[x];
+  {
+    register int x;
+    register mp_digit *tmpa, *tmpb, *tmpx, *tmpy;
+
+    /* we copy the digits directly instead of using higher level functions
+     * since we also need to shift the digits
+     */
+    tmpa = a->dp;
+    tmpb = b->dp;
+
+    tmpx = x0.dp;
+    tmpy = y0.dp;
+    for (x = 0; x < B; x++) {
+      *tmpx++ = *tmpa++;
+      *tmpy++ = *tmpb++;
+    }
+
+    tmpx = x1.dp;
+    for (x = B; x < a->used; x++) {
+      *tmpx++ = *tmpa++;
+    }
+
+    tmpy = y1.dp;
+    for (x = B; x < b->used; x++) {
+      *tmpy++ = *tmpb++;
+    }
  }

  /* only need to clamp the lower words since by definition the upper words x1/y1 must
--- a/bn_mp_karatsuba_sqr.c
+++ b/bn_mp_karatsuba_sqr.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>

@ -23,8 +23,7 @@ int
 mp_karatsuba_sqr (mp_int * a, mp_int * b)
 {
  mp_int  x0, x1, t1, t2, x0x0, x1x1;
-  int     B, err, x;
-
+  int     B, err;

  err = MP_MEM;

@ -41,22 +40,31 @@ mp_karatsuba_sqr (mp_int * a, mp_int * b)
    goto X0;

  /* init temps */
-  if (mp_init (&t1) != MP_OKAY)
+  if (mp_init_size (&t1, a->used * 2) != MP_OKAY)
    goto X1;
-  if (mp_init (&t2) != MP_OKAY)
+  if (mp_init_size (&t2, a->used * 2) != MP_OKAY)
    goto T1;
-  if (mp_init (&x0x0) != MP_OKAY)
+  if (mp_init_size (&x0x0, B * 2) != MP_OKAY)
    goto T2;
-  if (mp_init (&x1x1) != MP_OKAY)
+  if (mp_init_size (&x1x1, (a->used - B) * 2) != MP_OKAY)
    goto X0X0;

-  /* now shift the digits */
-  for (x = 0; x < B; x++) {
-    x0.dp[x] = a->dp[x];
-  }
+  {
+    register int x;
+    register mp_digit *dst, *src;

-  for (x = B; x < a->used; x++) {
-    x1.dp[x - B] = a->dp[x];
+    src = a->dp;
+
+    /* now shift the digits */
+    dst = x0.dp;
+    for (x = 0; x < B; x++) {
+      *dst++ = *src++;
+    }
+
+    dst = x1.dp;
+    for (x = B; x < a->used; x++) {
+      *dst++ = *src++;
+    }
  }

  x0.used = B;
@ -77,7 +85,7 @@ mp_karatsuba_sqr (mp_int * a, mp_int * b)
    goto X1X1;			/* t1 = (x1 - x0) * (y1 - y0) */

  /* add x0y0 */
-  if (mp_add (&x0x0, &x1x1, &t2) != MP_OKAY)
+  if (s_mp_add (&x0x0, &x1x1, &t2) != MP_OKAY)
    goto X1X1;			/* t2 = x0y0 + x1y1 */
  if (mp_sub (&t2, &t1, &t1) != MP_OKAY)
    goto X1X1;			/* t1 = x0y0 + x1y1 - (x1-x0)*(y1-y0) */
--- a/bn_mp_lcm.c
+++ b/bn_mp_lcm.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>

--- a/bn_mp_lshd.c
+++ b/bn_mp_lshd.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>

@ -31,16 +31,31 @@ mp_lshd (mp_int * a, int b)
    return res;
  }

-  /* increment the used by the shift amount than copy upwards */
-  a->used += b;
-  for (x = a->used - 1; x >= b; x--) {
-    a->dp[x] = a->dp[x - b];
-  }
+  {
+    register mp_digit *tmpa, *tmpaa;

-  /* zero the lower digits */
-  for (x = 0; x < b; x++) {
-    a->dp[x] = 0;
+    /* increment the used by the shift amount than copy upwards */
+    a->used += b;
+    
+    /* top */
+    tmpa = a->dp + a->used - 1;
+    
+    /* base */
+    tmpaa = a->dp + a->used - 1 - b;
+
+    /* much like mp_rshd this is implemented using a sliding window
+     * except the window goes the otherway around.  Copying from
+     * the bottom to the top.  see bn_mp_rshd.c for more info.
+     */
+    for (x = a->used - 1; x >= b; x--) {
+      *tmpa-- = *tmpaa--;
+    }
+
+    /* zero the lower digits */
+    tmpa = a->dp;
+    for (x = 0; x < b; x++) {
+      *tmpa++ = 0;
+    }
  }
-  mp_clamp (a);
  return MP_OKAY;
 }
--- a/bn_mp_mod.c
+++ b/bn_mp_mod.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>

--- a/bn_mp_mod_2d.c
+++ b/bn_mp_mod_2d.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>

--- a/bn_mp_mod_d.c
+++ b/bn_mp_mod_d.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>

--- a/bn_mp_montgomery_calc_normalization.c
+++ b/bn_mp_montgomery_calc_normalization.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>

--- a/bn_mp_montgomery_reduce.c
+++ b/bn_mp_montgomery_reduce.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>

--- a/bn_mp_montgomery_setup.c
+++ b/bn_mp_montgomery_setup.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>

@ -18,36 +18,29 @@
 int
 mp_montgomery_setup (mp_int * a, mp_digit * mp)
 {
-  mp_int  t, tt;
-  int     res;
+  unsigned long x, b;

-  if ((res = mp_init (&t)) != MP_OKAY) {
-    return res;
+/* fast inversion mod 2^32 
+ *
+ * Based on the fact that 
+ *
+ * XA = 1 (mod 2^n)  =>  (X(2-XA)) A = 1 (mod 2^2n)
+ *                   =>  2*X*A - X*X*A*A = 1
+ *                   =>  2*(1) - (1)     = 1
+ */
+  b = a->dp[0];
+
+  if ((b & 1) == 0) {
+    return MP_VAL;
  }

-  if ((res = mp_init (&tt)) != MP_OKAY) {
-    goto __T;
-  }
-
-  /* tt = b */
-  tt.dp[0] = 0;
-  tt.dp[1] = 1;
-  tt.used = 2;
-
-  /* t = m mod b */
-  t.dp[0] = a->dp[0];
-  t.used = 1;
-
-  /* t = 1/m mod b */
-  if ((res = mp_invmod (&t, &tt, &t)) != MP_OKAY) {
-    goto __TT;
-  }
+  x = (((b + 2) & 4) << 1) + b;	/* here x*a==1 mod 2^4 */
+  x *= 2 - b * x;		/* here x*a==1 mod 2^8 */
+  x *= 2 - b * x;		/* here x*a==1 mod 2^16; each step doubles the nb of bits */
+  x *= 2 - b * x;		/* here x*a==1 mod 2^32 */

  /* t = -1/m mod b */
-  *mp = ((mp_digit) 1 << ((mp_digit) DIGIT_BIT)) - t.dp[0];
+  *mp = ((mp_digit) 1 << ((mp_digit) DIGIT_BIT)) - (x & MP_MASK);

-  res = MP_OKAY;
-__TT:mp_clear (&tt);
-__T:mp_clear (&t);
-  return res;
+  return MP_OKAY;
 }
--- a/bn_mp_mul.c
+++ b/bn_mp_mul.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>

--- a/bn_mp_mul_2.c
+++ b/bn_mp_mul_2.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>

@ -50,6 +50,11 @@ mp_mul_2 (mp_int * a, mp_int * b)
 	if ((res = mp_grow (b, b->used + 1)) != MP_OKAY) {
 	  return res;
 	}
+
+	/* after the grow *tmpb is no longer valid so we have to reset it! 
+	 * (this bug took me about 17 minutes to find...!)
+	 */
+	tmpb = b->dp + b->used;
      }
      /* add a MSB of 1 */
      *tmpb = 1;
@ -61,5 +66,6 @@ mp_mul_2 (mp_int * a, mp_int * b)
      *tmpb++ = 0;
    }
  }
+  b->sign = a->sign;
  return MP_OKAY;
 }
--- a/bn_mp_mul_2d.c
+++ b/bn_mp_mul_2d.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>

@ -32,9 +32,11 @@ mp_mul_2d (mp_int * a, int b, mp_int * c)
  }

  /* shift by as many digits in the bit count */
-  if ((res = mp_lshd (c, b / DIGIT_BIT)) != MP_OKAY) {
-    return res;
-  }
+  if (b >= DIGIT_BIT) {
+     if ((res = mp_lshd (c, b / DIGIT_BIT)) != MP_OKAY) {
+       return res;
+     }
+  }     
  c->used = c->alloc;

  /* shift any bit count < DIGIT_BIT */
--- a/bn_mp_mul_d.c
+++ b/bn_mp_mul_d.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>

--- a/bn_mp_mulmod.c
+++ b/bn_mp_mulmod.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>

--- a/bn_mp_n_root.c
+++ b/bn_mp_n_root.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>

--- a/bn_mp_neg.c
+++ b/bn_mp_neg.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>

--- a/bn_mp_or.c
+++ b/bn_mp_or.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>

--- a/bn_mp_rand.c
+++ b/bn_mp_rand.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>

--- a/bn_mp_read_signed_bin.c
+++ b/bn_mp_read_signed_bin.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>

--- a/bn_mp_read_unsigned_bin.c
+++ b/bn_mp_read_unsigned_bin.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>

--- a/bn_mp_reduce.c
+++ b/bn_mp_reduce.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>

--- a/bn_mp_rshd.c
+++ b/bn_mp_rshd.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>

@ -20,7 +20,6 @@ mp_rshd (mp_int * a, int b)
 {
  int     x;

-
  /* if b <= 0 then ignore it */
  if (b <= 0) {
    return;
@ -32,14 +31,34 @@ mp_rshd (mp_int * a, int b)
    return;
  }

-  /* shift the digits down */
-  for (x = 0; x < (a->used - b); x++) {
-    a->dp[x] = a->dp[x + b];
-  }
+  {
+    register mp_digit *tmpa, *tmpaa;

-  /* zero the top digits */
-  for (; x < a->used; x++) {
-    a->dp[x] = 0;
+    /* shift the digits down */
+
+    /* base */
+    tmpa = a->dp;
+    
+    /* offset into digits */
+    tmpaa = a->dp + b;
+    
+    /* this is implemented as a sliding window where the window is b-digits long
+     * and digits from the top of the window are copied to the bottom
+     *
+     * e.g.
+     
+     b-2 | b-1 | b0 | b1 | b2 | ... | bb |   ---->
+                 /\                   |      ---->
+                  \-------------------/      ---->
+    */         
+    for (x = 0; x < (a->used - b); x++) {
+      *tmpa++ = *tmpaa++;
+    }
+
+    /* zero the top digits */
+    for (; x < a->used; x++) {
+      *tmpa++ = 0;
+    }
  }
  mp_clamp (a);
 }
--- a/bn_mp_set.c
+++ b/bn_mp_set.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>

--- a/bn_mp_set_int.c
+++ b/bn_mp_set_int.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>

--- a/bn_mp_shrink.c
+++ b/bn_mp_shrink.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>

--- a/bn_mp_signed_bin_size.c
+++ b/bn_mp_signed_bin_size.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>

--- a/bn_mp_sqr.c
+++ b/bn_mp_sqr.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>

--- a/bn_mp_sqrmod.c
+++ b/bn_mp_sqrmod.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>

--- a/bn_mp_sub.c
+++ b/bn_mp_sub.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>

--- a/bn_mp_sub_d.c
+++ b/bn_mp_sub_d.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>

--- a/bn_mp_submod.c
+++ b/bn_mp_submod.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>

--- a/bn_mp_to_signed_bin.c
+++ b/bn_mp_to_signed_bin.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>

--- a/bn_mp_to_unsigned_bin.c
+++ b/bn_mp_to_unsigned_bin.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>

--- a/bn_mp_unsigned_bin_size.c
+++ b/bn_mp_unsigned_bin_size.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>

--- a/bn_mp_xor.c
+++ b/bn_mp_xor.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>

--- a/bn_mp_zero.c
+++ b/bn_mp_zero.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>

--- a/bn_radix.c
+++ b/bn_radix.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>

--- a/bn_reverse.c
+++ b/bn_reverse.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>

--- a/bn_s_mp_add.c
+++ b/bn_s_mp_add.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>

@ -55,8 +55,14 @@ s_mp_add (mp_int * a, mp_int * b, mp_int * c)
    register int i;

    /* alias for digit pointers */
+    
+    /* first input */
    tmpa = a->dp;
+    
+    /* second input */
    tmpb = b->dp;
+    
+    /* destination */
    tmpc = c->dp;

    u = 0;
--- a/bn_s_mp_mul_digs.c
+++ b/bn_s_mp_mul_digs.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>

--- a/bn_s_mp_mul_high_digs.c
+++ b/bn_s_mp_mul_high_digs.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>

--- a/bn_s_mp_sqr.c
+++ b/bn_s_mp_sqr.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>

--- a/bn_s_mp_sub.c
+++ b/bn_s_mp_sub.c
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>

--- a/bncore.c
+++ b/bncore.c
@ -10,10 +10,13 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #include <tommath.h>

-int     KARATSUBA_MUL_CUTOFF = 80,	/* Min. number of digits before Karatsuba multiplication is used. */
-        KARATSUBA_SQR_CUTOFF = 80,	/* Min. number of digits before Karatsuba squaring is used. */
-        MONTGOMERY_EXPT_CUTOFF = 74;	/* max. number of digits that montgomery reductions will help for */
+/* configured for a AMD Duron Morgan core with etc/tune.c */
+int     KARATSUBA_MUL_CUTOFF = 73,	/* Min. number of digits before Karatsuba multiplication is used. */
+        KARATSUBA_SQR_CUTOFF = 121,	/* Min. number of digits before Karatsuba squaring is used. */
+        MONTGOMERY_EXPT_CUTOFF = 128;	/* max. number of digits that montgomery reductions will help for */
+
+
--- a/changes.txt
+++ b/changes.txt
@ -1,3 +1,16 @@
+Mar 15th, 2003
+v0.14  -- Tons of manual updates
+       -- cleaned up the directory
+       -- added MSVC makefiles
+       -- source changes [that I don't recall]
+       -- Fixed up the lshd/rshd code to use pointer aliasing
+       -- Fixed up the mul_2d and div_2d to not call rshd/lshd unless needed
+       -- Fixed up etc/tune.c a tad
+       -- fixed up demo/demo.c to output comma-delimited results of timing
+          also fixed up timing demo to use a finer granularity for various functions
+       -- fixed up demo/demo.c testing to pause during testing so my Duron won't catch on fire
+          [stays around 31-35C during testing :-)]
+       
 Feb 13th, 2003
 v0.13  -- tons of minor speed-ups in low level add, sub, mul_2 and div_2 which propagate 
          to other functions like mp_invmod, mp_div, etc...
--- a/demo/demo.c
+++ b/demo/demo.c
@ -69,18 +69,32 @@ int mp_reduce_setup(mp_int *a, mp_int *b)
   }
   return mp_div(a, b, a, NULL);
 }
+
+int mp_rand(mp_int *a, int c)
+{
+   long z = abs(rand()) & 65535;
+   mp_set(a, z?z:1);
+   while (c--) {
+      s_mp_lshd(a, 1);
+      mp_add_d(a, abs(rand()), a);
+   }
+   return MP_OKAY;
+}
 #endif

   char cmd[4096], buf[4096];
 int main(void)
 {
   mp_int a, b, c, d, e, f;
-   unsigned long expt_n, add_n, sub_n, mul_n, div_n, sqr_n, mul2d_n, div2d_n, gcd_n, lcm_n, inv_n;
+   unsigned long expt_n, add_n, sub_n, mul_n, div_n, sqr_n, mul2d_n, div2d_n, gcd_n, lcm_n, inv_n,
+                 div2_n, mul2_n;
   unsigned rr;
+   int cnt;

 #ifdef TIMER
   int n;
   ulong64 tt;
+   FILE *log;
 #endif

   mp_init(&a);
@ -90,60 +104,66 @@ int main(void)
   mp_init(&e);
   mp_init(&f);

-
 #ifdef TIMER
-goto multtime;
-
      printf("CLOCKS_PER_SEC == %lu\n", CLOCKS_PER_SEC);
-      mp_read_radix(&a, "340282366920938463463374607431768211455", 10);
-      mp_read_radix(&b, "340282366920938463463574607431768211455", 10);
-      while (a.used * DIGIT_BIT < 8192) {
+goto expttime;      
+
+      log = fopen("add.log", "w");
+      for (cnt = 4; cnt <= 128; cnt += 4) {
+         mp_rand(&a, cnt);
+         mp_rand(&b, cnt);
         reset();
         for (rr = 0; rr < 10000000; rr++) {
             mp_add(&a, &b, &c);
         }
         tt = rdtsc();
         printf("Adding\t\t%4d-bit => %9llu/sec, %9llu ticks\n", mp_count_bits(&a), (((unsigned long long)rr)*CLOCKS_PER_SEC)/tt, tt);
-         mp_sqr(&a, &a);
-         mp_sqr(&b, &b);
+         fprintf(log, "%d,%9llu\n", cnt, (((unsigned long long)rr)*CLOCKS_PER_SEC)/tt);
      }
+      fclose(log);
 
-      mp_read_radix(&a, "340282366920938463463374607431768211455", 10);
-      mp_read_radix(&b, "340282366920938463463574607431768211455", 10);
-      while (a.used * DIGIT_BIT < 8192) {
+      log = fopen("sub.log", "w");
+      for (cnt = 4; cnt <= 128; cnt += 4) {
+         mp_rand(&a, cnt);
+         mp_rand(&b, cnt);
         reset();
         for (rr = 0; rr < 10000000; rr++) {
             mp_sub(&a, &b, &c);
         }
         tt = rdtsc();
-         printf("Subtracting\t%4d-bit => %9llu/sec, %9llu ticks\n", mp_count_bits(&a), (((unsigned long long)rr)*CLOCKS_PER_SEC)/tt, tt);
-         mp_sqr(&a, &a);
-         mp_sqr(&b, &b);
+         printf("Subtracting\t\t%4d-bit => %9llu/sec, %9llu ticks\n", mp_count_bits(&a), (((unsigned long long)rr)*CLOCKS_PER_SEC)/tt, tt);
+         fprintf(log, "%d,%9llu\n", cnt, (((unsigned long long)rr)*CLOCKS_PER_SEC)/tt);
      }
+      fclose(log);
      
 multtime:      

-   mp_read_radix(&a, "340282366920938463463374607431768211455", 10);
-   while (a.used * DIGIT_BIT < 8192) {
+   log = fopen("sqr.log", "w");
+   for (cnt = 4; cnt <= 128; cnt += 4) {
+      mp_rand(&a, cnt);
      reset();
      for (rr = 0; rr < 250000; rr++) {
          mp_sqr(&a, &b);
      }
      tt = rdtsc();
      printf("Squaring\t%4d-bit => %9llu/sec, %9llu ticks\n", mp_count_bits(&a), (((unsigned long long)rr)*CLOCKS_PER_SEC)/tt, tt);
-      mp_copy(&b, &a);
+      fprintf(log, "%d,%9llu\n", cnt, (((unsigned long long)rr)*CLOCKS_PER_SEC)/tt);
   }
+   fclose(log);
   
-   mp_read_radix(&a, "340282366920938463463374607431768211455", 10);
-   while (a.used * DIGIT_BIT < 8192) {
+   log = fopen("mult.log", "w");
+   for (cnt = 4; cnt <= 128; cnt += 4) {
+      mp_rand(&a, cnt);
+      mp_rand(&b, cnt);
      reset();
      for (rr = 0; rr < 250000; rr++) {
-          mp_mul(&a, &a, &b);
+          mp_mul(&a, &b, &c);
      }
      tt = rdtsc();
      printf("Multiplying\t%4d-bit => %9llu/sec, %9llu ticks\n", mp_count_bits(&a), (((unsigned long long)rr)*CLOCKS_PER_SEC)/tt, tt);
-      mp_copy(&b, &a);
+      fprintf(log, "%d,%9llu\n", cnt, (((unsigned long long)rr)*CLOCKS_PER_SEC)/tt);
   }
+   fclose(log);

 expttime:  
   {
@ -157,6 +177,7 @@ expttime:
         "1214855636816562637502584060163403830270705000634713483015101384881871978446801224798536155406895823305035467591632531067547890948695117172076954220727075688048751022421198712032848890056357845974246560748347918630050853933697792254955890439720297560693579400297062396904306270145886830719309296352765295712183040773146419022875165382778007040109957609739589875590885701126197906063620133954893216612678838507540777138437797705602453719559017633986486649523611975865005712371194067612263330335590526176087004421363598470302731349138773205901447704682181517904064735636518462452242791676541725292378925568296858010151852326316777511935037531017413910506921922450666933202278489024521263798482237150056835746454842662048692127173834433089016107854491097456725016327709663199738238442164843147132789153725513257167915555162094970853584447993125488607696008169807374736711297007473812256272245489405898470297178738029484459690836250560495461579533254473316340608217876781986188705928270735695752830825527963838355419762516246028680280988020401914551825487349990306976304093109384451438813251211051597392127491464898797406789175453067960072008590614886532333015881171367104445044718144312416815712216611576221546455968770801413440778423979",
         NULL         
      };
+   log = fopen("expt.log", "w");
   for (n = 0; primes[n]; n++) {
      mp_read_radix(&a, primes[n], 10);
      mp_zero(&b);
@ -183,12 +204,21 @@ expttime:
         exit(0);
      }
      printf("Exponentiating\t%4d-bit => %9llu/sec, %9llu ticks\n", mp_count_bits(&a), (((unsigned long long)rr)*CLOCKS_PER_SEC)/tt, tt);
+      fprintf(log, "%d,%9llu\n", cnt, (((unsigned long long)rr)*CLOCKS_PER_SEC)/tt);
   }
   }   
-
-   mp_read_radix(&a, "340282366920938463463374607431768211455", 10);
-   mp_read_radix(&b, "234892374891378913789237289378973232333", 10);
-   while (a.used * DIGIT_BIT < 8192) {
+   fclose(log);
+invtime:
+   log = fopen("invmod.log", "w");
+   for (cnt = 4; cnt <= 128; cnt += 4) {
+      mp_rand(&a, cnt);
+      mp_rand(&b, cnt);
+      
+      do {
+         mp_add_d(&b, 1, &b);
+         mp_gcd(&a, &b, &c);
+      } while (mp_cmp_d(&c, 1) != MP_EQ);
+      
      reset();
      for (rr = 0; rr < 10000; rr++) {
          mp_invmod(&b, &a, &c);
@ -200,16 +230,18 @@ expttime:
         return 0;
      }
      printf("Inverting mod\t%4d-bit => %9llu/sec, %9llu ticks\n", mp_count_bits(&a), (((unsigned long long)rr)*CLOCKS_PER_SEC)/tt, tt);
-      mp_sqr(&a, &a);
-      mp_sqr(&b, &b);
+      fprintf(log, "%d,%9llu\n", cnt, (((unsigned long long)rr)*CLOCKS_PER_SEC)/tt);
   }
+   fclose(log);
   
   return 0;
  
 #endif

-   inv_n = expt_n = lcm_n = gcd_n = add_n = sub_n = mul_n = div_n = sqr_n = mul2d_n = div2d_n = 0;   
+   div2_n = mul2_n = inv_n = expt_n = lcm_n = gcd_n = add_n = 
+   sub_n = mul_n = div_n = sqr_n = mul2d_n = div2d_n = cnt = 0;
   for (;;) {
+       if (!(++cnt & 15)) sleep(3);
   
       /* randomly clear and re-init one variable, this has the affect of triming the alloc space */
       switch (abs(rand()) % 7) {
@ -223,7 +255,7 @@ expttime:
       }
   
   
-       printf("%7lu/%7lu/%7lu/%7lu/%7lu/%7lu/%7lu/%7lu/%7lu/%7lu/%7lu/%5d\r", add_n, sub_n, mul_n, div_n, sqr_n, mul2d_n, div2d_n, gcd_n, lcm_n, expt_n, inv_n, _ifuncs);
+       printf("%7lu/%7lu/%7lu/%7lu/%7lu/%7lu/%7lu/%7lu/%7lu/%7lu/%7lu/%7lu/%7lu ", add_n, sub_n, mul_n, div_n, sqr_n, mul2d_n, div2d_n, gcd_n, lcm_n, expt_n, inv_n, div2_n, mul2_n);
       fgets(cmd, 4095, stdin);
       cmd[strlen(cmd)-1] = 0;
       printf("%s  ]\r",cmd); fflush(stdout);
@ -386,7 +418,29 @@ draw(&a);draw(&b);draw(&c);draw(&d);
                return 0;
             }
                
-       }
+       } else if (!strcmp(cmd, "div2")) { ++div2_n;
+             fgets(buf, 4095, stdin);  mp_read_radix(&a, buf, 10);
+             fgets(buf, 4095, stdin);  mp_read_radix(&b, buf, 10);
+             mp_div_2(&a, &c);
+             if (mp_cmp(&c, &b) != MP_EQ) {
+                 printf("div_2 %lu failure\n", div2_n);
+                 draw(&a);
+                 draw(&b);
+                 draw(&c);
+                 return 0;
+             }
+       } else if (!strcmp(cmd, "mul2")) { ++mul2_n;
+             fgets(buf, 4095, stdin);  mp_read_radix(&a, buf, 10);
+             fgets(buf, 4095, stdin);  mp_read_radix(&b, buf, 10);
+             mp_mul_2(&a, &c);
+             if (mp_cmp(&c, &b) != MP_EQ) {
+                 printf("mul_2 %lu failure\n", mul2_n);
+                 draw(&a);
+                 draw(&b);
+                 draw(&c);
+                 return 0;
+             }
+       }             
       
   }
   return 0;   
--- a/etc/makefile
+++ b/etc/makefile
@ -17,4 +17,4 @@ mersenne: mersenne.o
 	$(CC) mersenne.o $(LIBNAME) -o mersenne
        
 clean:
-	rm -f *.o *.exe pprime tune mersenne 
+	rm -f *.log *.o *.obj *.exe pprime tune mersenne 
--- a/etc/makefile.msvc
+++ b/etc/makefile.msvc
@ -0,0 +1,14 @@
+#MSVC Makefile
+#
+#Tom St Denis
+
+CFLAGS = /I../ /Ogityb2 /Gs /DWIN32 /W3
+
+pprime: pprime.obj
+	cl pprime.obj ../tommath.lib 
+
+mersenne: mersenne.obj
+	cl mersenne.obj ../tommath.lib
+	
+tune: tune.obj
+	cl tune.obj ../tommath.lib	
--- a/etc/mersenne.c
+++ b/etc/mersenne.c
@ -3,14 +3,14 @@
 * Tom St Denis, tomstdenis@iahu.ca
 */
 #include <time.h>
-#include <bn.h>
+#include <tommath.h>

 int
 is_mersenne (long s, int *pp)
 {
-  mp_int    n, u, mu;
-  int       res, k;
-  long      ss;
+  mp_int  n, u, mu;
+  int     res, k;
+  long    ss;

  *pp = 0;

@ -85,7 +85,7 @@ __N:mp_clear (&n);
 long
 i_sqrt (long x)
 {
-  long      x1, x2;
+  long    x1, x2;

  x2 = 16;
  do {
@ -104,7 +104,7 @@ i_sqrt (long x)
 int
 isprime (long k)
 {
-  long      y, z;
+  long    y, z;

  y = i_sqrt (k);
  for (z = 2; z <= y; z++) {
@ -118,9 +118,9 @@ isprime (long k)
 int
 main (void)
 {
-  int       pp;
-  long      k;
-  clock_t   tt;
+  int     pp;
+  long    k;
+  clock_t tt;

  k = 3;

--- a/etc/pprime.c
+++ b/etc/pprime.c
@ -8,10 +8,10 @@
 #include "tommath.h"

 /* fast square root */
-static    mp_digit
+static  mp_digit
 i_sqrt (mp_word x)
 {
-  mp_word   x1, x2;
+  mp_word x1, x2;

  x2 = x;
  do {
@ -28,10 +28,10 @@ i_sqrt (mp_word x)


 /* generates a prime digit */
-static    mp_digit
+static  mp_digit
 prime_digit ()
 {
-  mp_digit  r, x, y, next;
+  mp_digit r, x, y, next;

  /* make a DIGIT_BIT-bit random number */
  for (r = x = 0; x < DIGIT_BIT; x++) {
@ -141,8 +141,8 @@ prime_digit ()
 int
 pprime (int k, int li, mp_int * p, mp_int * q)
 {
-  mp_int    a, b, c, n, x, y, z, v;
-  int       res, ii;
+  mp_int  a, b, c, n, x, y, z, v;
+  int     res, ii;
  static const mp_digit bases[] = { 2, 3, 5, 7, 11, 13, 17, 19 };

  /* single digit ? */
@ -329,10 +329,10 @@ __C:mp_clear (&c);
 int
 main (void)
 {
-  mp_int    p, q;
-  char      buf[4096];
-  int       k, li;
-  clock_t   t1;
+  mp_int  p, q;
+  char    buf[4096];
+  int     k, li;
+  clock_t t1;

  srand (time (NULL));

--- a/etc/tune.c
+++ b/etc/tune.c
@ -8,19 +8,19 @@
 clock_t
 time_mult (void)
 {
-  clock_t   t1;
-  int       x, y;
-  mp_int    a, b, c;
+  clock_t t1;
+  int     x, y;
+  mp_int  a, b, c;

  mp_init (&a);
  mp_init (&b);
  mp_init (&c);

  t1 = clock ();
-  for (x = 8; x <= 128; x += 8) {
-    for (y = 0; y < 1000; y++) {
-      mp_rand (&a, x);
-      mp_rand (&b, x);
+  for (x = 4; x <= 128; x += 4) {
+    mp_rand (&a, x);
+    mp_rand (&b, x);
+    for (y = 0; y < 10000; y++) {
      mp_mul (&a, &b, &c);
    }
  }
@ -33,17 +33,17 @@ time_mult (void)
 clock_t
 time_sqr (void)
 {
-  clock_t   t1;
-  int       x, y;
-  mp_int    a, b;
+  clock_t t1;
+  int     x, y;
+  mp_int  a, b;

  mp_init (&a);
  mp_init (&b);

  t1 = clock ();
-  for (x = 8; x <= 128; x += 8) {
-    for (y = 0; y < 1000; y++) {
-      mp_rand (&a, x);
+  for (x = 4; x <= 128; x += 4) {
+    mp_rand (&a, x);
+    for (y = 0; y < 10000; y++) {
      mp_sqr (&a, &b);
    }
  }
@ -52,20 +52,54 @@ time_sqr (void)
  return clock () - t1;
 }

+clock_t
+time_expt (void)
+{
+  clock_t t1;
+  int     x, y;
+  mp_int  a, b, c, d;
+
+  mp_init (&a);
+  mp_init (&b);
+  mp_init (&c);
+  mp_init (&d);
+
+  t1 = clock ();
+  for (x = 4; x <= 128; x += 4) {
+    mp_rand (&a, x);
+    mp_rand (&b, x);
+    mp_rand (&c, x);
+    if (mp_iseven (&c) != 0) {
+      mp_add_d (&c, 1, &c);
+    }
+    for (y = 0; y < 10; y++) {
+      mp_exptmod (&a, &b, &c, &d);
+    }
+  }
+  mp_clear (&d);
+  mp_clear (&c);
+  mp_clear (&b);
+  mp_clear (&a);
+
+  return clock () - t1;
+}
+
 int
 main (void)
 {
-  int       best_mult, best_square;
-  clock_t   best, ti;
+  int     best_mult, best_square, best_exptmod;
+  clock_t best, ti;
+  FILE   *log;

-  best_mult = best_square = 0;
+  best_mult = best_square = best_exptmod = 0;

  /* tune multiplication first */
+  log = fopen ("mult.log", "w");
  best = CLOCKS_PER_SEC * 1000;
-  for (KARATSUBA_MUL_CUTOFF = 8; KARATSUBA_MUL_CUTOFF <= 128;
-       KARATSUBA_MUL_CUTOFF++) {
+  for (KARATSUBA_MUL_CUTOFF = 8; KARATSUBA_MUL_CUTOFF <= 128; KARATSUBA_MUL_CUTOFF++) {
    ti = time_mult ();
    printf ("%4d : %9lu\r", KARATSUBA_MUL_CUTOFF, ti);
+    fprintf (log, "%d, %lu\n", KARATSUBA_MUL_CUTOFF, ti);
    fflush (stdout);
    if (ti < best) {
      printf ("New best: %lu, %d         \n", ti, KARATSUBA_MUL_CUTOFF);
@ -73,13 +107,15 @@ main (void)
      best_mult = KARATSUBA_MUL_CUTOFF;
    }
  }
+  fclose (log);

  /* tune squaring */
+  log = fopen ("sqr.log", "w");
  best = CLOCKS_PER_SEC * 1000;
-  for (KARATSUBA_SQR_CUTOFF = 8; KARATSUBA_SQR_CUTOFF <= 128;
-       KARATSUBA_SQR_CUTOFF++) {
+  for (KARATSUBA_SQR_CUTOFF = 8; KARATSUBA_SQR_CUTOFF <= 128; KARATSUBA_SQR_CUTOFF++) {
    ti = time_sqr ();
    printf ("%4d : %9lu\r", KARATSUBA_SQR_CUTOFF, ti);
+    fprintf (log, "%d, %lu\n", KARATSUBA_SQR_CUTOFF, ti);
    fflush (stdout);
    if (ti < best) {
      printf ("New best: %lu, %d         \n", ti, KARATSUBA_SQR_CUTOFF);
@ -87,10 +123,30 @@ main (void)
      best_square = KARATSUBA_SQR_CUTOFF;
    }
  }
+  fclose (log);
+
+  /* tune exptmod */
+  KARATSUBA_MUL_CUTOFF = best_mult;
+  KARATSUBA_SQR_CUTOFF = best_square;
+
+  log = fopen ("expt.log", "w");
+  best = CLOCKS_PER_SEC * 1000;
+  for (MONTGOMERY_EXPT_CUTOFF = 8; MONTGOMERY_EXPT_CUTOFF <= 192; MONTGOMERY_EXPT_CUTOFF++) {
+    ti = time_expt ();
+    printf ("%4d : %9lu\r", MONTGOMERY_EXPT_CUTOFF, ti);
+    fflush (stdout);
+    fprintf (log, "%d : %lu\r", MONTGOMERY_EXPT_CUTOFF, ti);
+    if (ti < best) {
+      printf ("New best: %lu, %d\n", ti, MONTGOMERY_EXPT_CUTOFF);
+      best = ti;
+      best_exptmod = MONTGOMERY_EXPT_CUTOFF;
+    }
+  }
+  fclose (log);

  printf
-    ("\n\n\nKaratsuba Multiplier Cutoff: %d\nKaratsuba Squaring Cutoff: %d\n",
-     best_mult, best_square);
+    ("\n\n\nKaratsuba Multiplier Cutoff: %d\nKaratsuba Squaring Cutoff: %d\nMontgomery exptmod Cutoff: %d\n",
+     best_mult, best_square, best_exptmod);

  return 0;
 }
--- a/4
+++ b/4
@ -1,6 +1,6 @@
 CFLAGS  +=  -I./ -Wall -W -Wshadow -O3 -fomit-frame-pointer -funroll-loops

-VERSION=0.13
+VERSION=0.14

 default: libtommath.a

@ -60,7 +60,7 @@ docs:	docdvi
 	rm -f bn.log bn.aux bn.dvi
 	
 clean:
-	rm -f *.pdf *.o *.a *.exe etclib/*.o demo/demo.o test ltmtest mpitest mtest/mtest mtest/mtest.exe \
+	rm -f *.pdf *.o *.a *.obj *.lib *.exe etclib/*.o demo/demo.o test ltmtest mpitest mtest/mtest mtest/mtest.exe \
        bn.log bn.aux bn.dvi *.log *.s mpi.c 
 	cd etc ; make clean

--- a/makefile.msvc
+++ b/makefile.msvc
@ -0,0 +1,26 @@
+#MSVC Makefile
+#
+#Tom St Denis
+
+CFLAGS = /I. /Ogityb2 /Gs /DWIN32 /W3
+
+default: library
+
+OBJECTS=bncore.obj bn_mp_init.obj bn_mp_clear.obj bn_mp_exch.obj bn_mp_grow.obj bn_mp_shrink.obj \
+bn_mp_clamp.obj bn_mp_zero.obj  bn_mp_set.obj bn_mp_set_int.obj bn_mp_init_size.obj bn_mp_copy.obj \
+bn_mp_init_copy.obj bn_mp_abs.obj bn_mp_neg.obj bn_mp_cmp_mag.obj bn_mp_cmp.obj bn_mp_cmp_d.obj \
+bn_mp_rshd.obj bn_mp_lshd.obj bn_mp_mod_2d.obj bn_mp_div_2d.obj bn_mp_mul_2d.obj bn_mp_div_2.obj \
+bn_mp_mul_2.obj bn_s_mp_add.obj bn_s_mp_sub.obj bn_fast_s_mp_mul_digs.obj bn_s_mp_mul_digs.obj \
+bn_fast_s_mp_mul_high_digs.obj bn_s_mp_mul_high_digs.obj bn_fast_s_mp_sqr.obj bn_s_mp_sqr.obj \
+bn_mp_add.obj bn_mp_sub.obj bn_mp_karatsuba_mul.obj bn_mp_mul.obj bn_mp_karatsuba_sqr.obj \
+bn_mp_sqr.obj bn_mp_div.obj bn_mp_mod.obj bn_mp_add_d.obj bn_mp_sub_d.obj bn_mp_mul_d.obj \
+bn_mp_div_d.obj bn_mp_mod_d.obj bn_mp_expt_d.obj bn_mp_addmod.obj bn_mp_submod.obj \
+bn_mp_mulmod.obj bn_mp_sqrmod.obj bn_mp_gcd.obj bn_mp_lcm.obj bn_fast_mp_invmod.obj bn_mp_invmod.obj \
+bn_mp_reduce.obj bn_mp_montgomery_setup.obj bn_fast_mp_montgomery_reduce.obj bn_mp_montgomery_reduce.obj \
+bn_mp_exptmod_fast.obj bn_mp_exptmod.obj bn_mp_2expt.obj bn_mp_n_root.obj bn_mp_jacobi.obj bn_reverse.obj \
+bn_mp_count_bits.obj bn_mp_read_unsigned_bin.obj bn_mp_read_signed_bin.obj bn_mp_to_unsigned_bin.obj \
+bn_mp_to_signed_bin.obj bn_mp_unsigned_bin_size.obj bn_mp_signed_bin_size.obj bn_radix.obj \
+bn_mp_xor.obj bn_mp_and.obj bn_mp_or.obj bn_mp_rand.obj bn_mp_montgomery_calc_normalization.obj
+
+library: $(OBJECTS)
+	lib /out:tommath.lib $(OBJECTS)
--- a/mtest/mtest.c
+++ b/mtest/mtest.c
@ -41,7 +41,7 @@ void rand_num(mp_int *a)
   unsigned char buf[512];

 top:
-   size = 1 + ((fgetc(rng)*fgetc(rng)) % 96);
+   size = 1 + ((fgetc(rng)*fgetc(rng)) % 512);
   buf[0] = (fgetc(rng)&1)?1:0;
   fread(buf+1, 1, size, rng);
   for (n = 0; n < size; n++) {
@ -57,7 +57,7 @@ void rand_num2(mp_int *a)
   unsigned char buf[512];

 top:
-   size = 1 + ((fgetc(rng)*fgetc(rng)) % 96);
+   size = 1 + ((fgetc(rng)*fgetc(rng)) % 512);
   buf[0] = (fgetc(rng)&1)?1:0;
   fread(buf+1, 1, size, rng);
   for (n = 0; n < size; n++) {
@ -72,6 +72,8 @@ int main(void)
   int n;
   mp_int a, b, c, d, e;
   char buf[4096];
+   
+   static int tests[] = { 11, 12 };

   mp_init(&a);
   mp_init(&b);
@ -89,7 +91,7 @@ int main(void)
   }

   for (;;) {
-       n = 4; // fgetc(rng) % 11;
+       n =  fgetc(rng) % 13;

   if (n == 0) {
       /* add tests */
@ -235,7 +237,24 @@ int main(void)
      printf("%s\n", buf);      
      mp_todecimal(&c, buf);
      printf("%s\n", buf);      
-   } 
+   } else if (n == 11) {
+      rand_num(&a);
+      mp_mul_2(&a, &a);
+      mp_div_2(&a, &b);
+      printf("div2\n");
+      mp_todecimal(&a, buf);
+      printf("%s\n", buf);      
+      mp_todecimal(&b, buf);
+      printf("%s\n", buf);
+   } else if (n == 12) {
+      rand_num2(&a);
+      mp_mul_2(&a, &b);
+      printf("mul2\n");
+      mp_todecimal(&a, buf);
+      printf("%s\n", buf);      
+      mp_todecimal(&b, buf);
+      printf("%s\n", buf);
+   }
   }
   fclose(rng);
   return 0;
--- a/timings.txt
+++ b/timings.txt
@ -1,36 +0,0 @@
-CLOCKS_PER_SEC == 1000
-Adding           128-bit =>  14534883/sec,       688 ticks
-Adding           256-bit =>  11037527/sec,       906 ticks
-Adding           512-bit =>   8650519/sec,      1156 ticks
-Adding          1024-bit =>   5871990/sec,      1703 ticks
-Adding          2048-bit =>   3575259/sec,      2797 ticks
-Adding          4096-bit =>   2018978/sec,      4953 ticks
-Subtracting      128-bit =>  11025358/sec,       907 ticks
-Subtracting      256-bit =>   9149130/sec,      1093 ticks
-Subtracting      512-bit =>   7440476/sec,      1344 ticks
-Subtracting     1024-bit =>   5078720/sec,      1969 ticks
-Subtracting     2048-bit =>   3168567/sec,      3156 ticks
-Subtracting     4096-bit =>   1833852/sec,      5453 ticks
-Squaring         128-bit =>   3205128/sec,        78 ticks
-Squaring         256-bit =>   1592356/sec,       157 ticks
-Squaring         512-bit =>    696378/sec,       359 ticks
-Squaring        1024-bit =>    266808/sec,       937 ticks
-Squaring        2048-bit =>     85999/sec,      2907 ticks
-Squaring        4096-bit =>     21949/sec,     11390 ticks
-Multiplying      128-bit =>   3205128/sec,        78 ticks
-Multiplying      256-bit =>   1592356/sec,       157 ticks
-Multiplying      512-bit =>    615763/sec,       406 ticks
-Multiplying     1024-bit =>    192752/sec,      1297 ticks
-Multiplying     2048-bit =>     53510/sec,      4672 ticks
-Multiplying     4096-bit =>     14801/sec,     16890 ticks
-Exponentiating   513-bit =>       531/sec,        47 ticks
-Exponentiating   769-bit =>       177/sec,       141 ticks
-Exponentiating  1025-bit =>        88/sec,       282 ticks
-Exponentiating  2049-bit =>        13/sec,      1890 ticks
-Exponentiating  2561-bit =>         6/sec,      3812 ticks
-Exponentiating  3073-bit =>         4/sec,      6031 ticks
-Exponentiating  4097-bit =>         1/sec,     12843 ticks
-Inverting mod    128-bit =>     19160/sec,      5219 ticks
-Inverting mod    256-bit =>      8290/sec,     12062 ticks
-Inverting mod    512-bit =>      3565/sec,     28047 ticks
-Inverting mod   1024-bit =>      1305/sec,     76594 ticks
--- a/timings2.txt
+++ b/timings2.txt
@ -1,36 +0,0 @@
-CLOCKS_PER_SEC == 1000
-Adding           128-bit =>  15600624/sec,       641 ticks
-Adding           256-bit =>  12804097/sec,       781 ticks
-Adding           512-bit =>  10000000/sec,      1000 ticks
-Adding          1024-bit =>   7032348/sec,      1422 ticks
-Adding          2048-bit =>   4076640/sec,      2453 ticks
-Adding          4096-bit =>   2424242/sec,      4125 ticks
-Subtracting      128-bit =>  10845986/sec,       922 ticks
-Subtracting      256-bit =>   9416195/sec,      1062 ticks
-Subtracting      512-bit =>   7710100/sec,      1297 ticks
-Subtracting     1024-bit =>   5159958/sec,      1938 ticks
-Subtracting     2048-bit =>   3299241/sec,      3031 ticks
-Subtracting     4096-bit =>   1987676/sec,      5031 ticks
-Squaring         128-bit =>   3205128/sec,        78 ticks
-Squaring         256-bit =>   1592356/sec,       157 ticks
-Squaring         512-bit =>    696378/sec,       359 ticks
-Squaring        1024-bit =>    266524/sec,       938 ticks
-Squaring        2048-bit =>     86505/sec,      2890 ticks
-Squaring        4096-bit =>     22471/sec,     11125 ticks
-Multiplying      128-bit =>   3205128/sec,        78 ticks
-Multiplying      256-bit =>   1592356/sec,       157 ticks
-Multiplying      512-bit =>    615763/sec,       406 ticks
-Multiplying     1024-bit =>    190548/sec,      1312 ticks
-Multiplying     2048-bit =>     54418/sec,      4594 ticks
-Multiplying     4096-bit =>     14897/sec,     16781 ticks
-Exponentiating   513-bit =>       531/sec,        47 ticks
-Exponentiating   769-bit =>       177/sec,       141 ticks
-Exponentiating  1025-bit =>        84/sec,       297 ticks
-Exponentiating  2049-bit =>        13/sec,      1875 ticks
-Exponentiating  2561-bit =>         6/sec,      3766 ticks
-Exponentiating  3073-bit =>         4/sec,      6000 ticks
-Exponentiating  4097-bit =>         1/sec,     12750 ticks
-Inverting mod    128-bit =>     17301/sec,       578 ticks
-Inverting mod    256-bit =>      8103/sec,      1234 ticks
-Inverting mod    512-bit =>      3422/sec,      2922 ticks
-Inverting mod   1024-bit =>      1330/sec,      7516 ticks
--- a/timings3.txt
+++ b/timings3.txt
@ -1,5 +0,0 @@
-Exponentiating   513-bit =>       531/sec,        94 ticks
-Exponentiating   769-bit =>       187/sec,       266 ticks
-Exponentiating  1025-bit =>        88/sec,       562 ticks
-Exponentiating  2049-bit =>        13/sec,      3719 ticks
-
--- a/tommath.h
+++ b/tommath.h
@ -10,7 +10,7 @@
 * The library is free for all purposes without any express
 * guarantee it works.
 *
- * Tom St Denis, tomstdenis@iahu.ca, http://libtommath.iahu.ca
+ * Tom St Denis, tomstdenis@iahu.ca, http://math.libtomcrypt.org
 */
 #ifndef BN_H_
 #define BN_H_