• Ingen resultater fundet

5.2 Reducing the Recursion Cycle Time

5.2.1 Optimisation Techniques

The strategy of the optimisation techniques is identical to the strategy of the techniques described in Chapter 3: The complexity of the intermedi-ate operation is reduced by means of precomputations that only have to be performed once for each change of the value of modulus. Since several consecutive modular multiplications are performed using the same value of modulus, the additional time for performing the precomputation becomes negligible. Furthermore,parallel computations are utilised. It turns out that the computation of consecutive intermediate operations can be overlapped and, therefore, that Montgomery’s multiplication can be efficiently computed on a pipelined hardware architecture. Appendix D describes three optimisa-tion techniques that, to some extent, can be applied independently of each other:

1. It is possible toavoid the multiplication operation in the quotient deter-minationby adjusting the value of the modulus. A similar optimisation technique for the traditional modular multiplication operation was pre-sented in Subsection 3.10.4, where the aim was to obtain a modulus value that is close to a power of two, i.e. the most significant digits have a fixed known value after the adjustment. In Montgomery’s mul-tiplication operation it is desirable to obtain a fixed known value of the least significant digitof modulus. Hereby, also the value ofm mod 2k,

5.2. REDUCING THE RECURSION CYCLE TIME 179 used in the quotient determination (5.3) becomes fixed. In Appendix D it is shown how to adjust the value of the modulus such thatm mod 2k becomes equal to 1, i.e. the multiplication operation in the quotient determination is avoided. The new modulus value, say m, is obtained through a simple scaling of m,

m =c·m, where c=m mod 2k. (5.4) The penalty for replacing the value of modulus with m is a larger residue range [0; 2m[ of the results of Montgomery’s multiplication.

However, in applications like modular exponentiation, where several intermediate multiplications are performed, the additional time for con-verting the final result into the residue range [0;m[ is vanishing. Since the scaling constant cbelongs to {1,3, . . . ,2k1} (note that m is odd when m is odd) the residue range of the intermediate products will, at worst, be [0; 2(2k1)m[. This implies that the required number of re-cursion cycles n in Montgomery’s multiplication method may increase by one. As previously mentioned, the number of recursion cycles is constrained by 2kn >4m.

2. It is possible to avoid the addition operation in the quotient determj-natjon by scaling the value of the multiplicand. An analogous optimi-sation technique for the traditional modular multiplication operation was described in Section 3.5, where the aim was to achieve a higher degree of parallelism in the computation. The means was an implicit scaling of the multiplier a with the constant 2k. Since a multiplica-tion by 2k is obtained through a left-shift, the scaling is implicit. In Montgomery’s multiplication operation it is advantageous to implicitly scale themultiplicand bwith the constant 2k. Hereby, the value ofaib, whereb denotes the scaled multiplicand, in the quotient determination (5.3) becomes divisible by 2k and, consequently, the addition operation in the quotient determination operation is avoided,

((Si+aib)·m) mod 2k= (Si·m) mod 2k, whereb mod 2k = 0.

In order to compensate for the multiplicand scaling, the number of recursion cycles in Montgomery’s multiplication method is increased by one,

(ai·b)r1 m (a·2kb)r(n+1), where r1·2km 1.

This penalty is identical to the penalty of using the implicit scaling technique on the traditional modular multiplication operation.

When b is scaled, the updating statement in Algorithm 5.1–1 can be formulated asSi+1 := (Si+qim) div 2k+aib. Hence, there is no explicit reference to the scaled multiplicand b.

3. It is possible toparallelise the computation of consecutive intermediate operations. The idea of performing an overlapped computation of the intermediate operation is introduced by Shand and Vuillemin in [SV93], where the technique is denoted “quotient pipelining”. The basic idea is to postpone the use of the quotient digit qi, computed on basis of information available in the ith recursion cycle, by d recursion cycles.

Hence, the term qim does not appear in the updating statement until the (i + d)th recursion cycle. Hereby, it is possible to overlap the computation of d consecutive intermediate operations by means of a pipelined computation scheme.

In Appendix D the quotient pipelining technique has been combined with the above described optimisation techniques. It is shown how to obtain a particular simple intermediate operation, where the quotient determination is a simple inspection of the least significant digit of the intermediate operand, qi := Si mod 2k, and the updating statement reduces to a shift-and-add operation, Si+1 := Si div 2k +Tid. The termTid is the sum of the multiplesqidmˆ and aib, where ˆm is a pre-computed operand. Both multiples are based on information available after the (i −d)th recursion cycle. Hence, the computation of Tid

can be performed by an architecture that has been pipelined into d stages. The operand ˆm is precomputed each time the value of modulus is changed. It is equal to (m + 1) div 2k(d+1), where m is the result of a scaling of m similar to (5.4),

m =c·m, where c=m mod 2k(d+1). (5.5) Since the quotient determination reduces to a simple inspection of the least significant digit ofSi, the quotient determination requires no cir-cuitry and, hence, the area contribution from this operation is zero.

5.2. REDUCING THE RECURSION CYCLE TIME 181 Compared to the quotient determination method used in the exponen-tiation processor described in Chapter 4 this is a significant improve-ment. The processor’s quotient determination unit had a significant contribution to the total area and, furthermore, it had a significant contribution to the recursion cycle time. In fact, the area of the proces-sor’s quotient determination unit increases by a rate of 2k for increasing radix 2k values.

In Appendix D an example hardware architecture for the computation of a radix 28Montgomery multiplication is discussed. It is seen that the recursion cycle time can be reduced to the time for a single shift-and-add operation, which is efficiently performed by a redundant shift-and-adder.

Furthermore, it is seen that the recursion cycle time is independent of the chosen radix. This property makes Montgomery multiplication superior to the traditional modular multiplication methods. According to Section 3.11 the recursion cycle time for the traditional modular multiplication methods can be expected to increase by a rate of log2 k for increasing radix 2k values. Hence, a significant step toward an efficient utilisation of high radices in modular multiplication has been achieved.

There is, however, a penalty imposed by using the quotient pipelining technique:

First, the number of recursion cycles increases with d since the use of quotient digitqi is postponed for d recursion cycles.

Second, because of the d pipeline buffers in the hardware archi-tecture, the result of the multiplication process is further delayed by d recursion cycles. If the input operands for the next multi-plication is known before the result of the present multimulti-plication is completed, it is possible to start up the next multiplication as soon as the first stage in the pipeline has completed the present computation. Hence, it may be possible to overlap two consecu-tive modular multiplications and, hence, to avoid the penalty of d extra recursion cycles per multiplication. The required prop-erty is, however, dependent on the application. So, the actual scheduling of the multiplications may be improved after a further investigation of the application.

Third, the implicit scaling of the multiplicand requires an addi-tional recursion cycle.

Finally, the residue range of the product increases to [0; 2m[, which at worst is [0; 2(2k(d+1)1)m[. So, due to the constraint 4m < 2kn, the number n of basic recursion cycles increases by up tod cycles compared to the original radix 2kversion of Montgomery’s method (5.2). Since m is odd, it cannot be a power of 2. Therefore, the constraint 4m < 2kn is obeyed by the following value of n:

n =

In total, including all the additional penalty terms, the worst case num-ber of recursion cycles for a radix 2k version of the optimised Mont-gomery multiplication Therefore, the delay parameterdshould be chosen as small as possible.

In Appendix D it is indicated how the term Tid can be computed by means of a pipelined Wallace Tree, see Subsection 3.2.2, containing about log2 k stages. The time for each stage is smaller than or equal to a redundant 4–2 addition, which corresponds to the delay of two full adders.

It should be mentioned, that the resulting modular product of the optimised Montgomery multiplication is redundantly represented. So, the time for a conversion into non-redundant representation should be added to the computing time indicated by (5.6). This issue is discussed in Appendix D as well. As for the traditional modular multiplication methods, it is possible to avoid the conversion by allowing the input

5.3. ADDITIONAL PROCESSING 183