Optimisation Techniques - Reducing the Recursion Cycle Time

5.2 Reducing the Recursion Cycle Time

5.2.1 Optimisation Techniques

The strategy of the optimisation techniques is identical to the strategy of the techniques described in Chapter 3: The complexity of the intermedi-ate operation is reduced by means of precomputations that only have to be performed once for each change of the value of modulus. Since several consecutive modular multiplications are performed using the same value of modulus, the additional time for performing the precomputation becomes negligible. Furthermore,parallel computations are utilised. It turns out that the computation of consecutive intermediate operations can be overlapped and, therefore, that Montgomery’s multiplication can be eﬃciently computed on a pipelined hardware architecture. Appendix D describes three optimisa-tion techniques that, to some extent, can be applied independently of each other:

1. It is possible toavoid the multiplication operation in the quotient deter-minationby adjusting the value of the modulus. A similar optimisation technique for the traditional modular multiplication operation was pre-sented in Subsection 3.10.4, where the aim was to obtain a modulus value that is close to a power of two, i.e. the most signiﬁcant digits have a ﬁxed known value after the adjustment. In Montgomery’s mul-tiplication operation it is desirable to obtain a ﬁxed known value of the least signiﬁcant digitof modulus. Hereby, also the value ofm mod 2^k,

5.2. REDUCING THE RECURSION CYCLE TIME 179 used in the quotient determination (5.3) becomes ﬁxed. In Appendix D it is shown how to adjust the value of the modulus such thatm mod 2^k becomes equal to 1, i.e. the multiplication operation in the quotient determination is avoided. The new modulus value, say m, is obtained through a simple scaling of m,

m =c·m, where c=m mod 2^k. (5.4) The penalty for replacing the value of modulus with m is a larger residue range [0; 2m[ of the results of Montgomery’s multiplication.

However, in applications like modular exponentiation, where several intermediate multiplications are performed, the additional time for con-verting the ﬁnal result into the residue range [0;m[ is vanishing. Since the scaling constant cbelongs to {1,3, . . . ,2^k⁻¹} (note that m is odd when m is odd) the residue range of the intermediate products will, at worst, be [0; 2(2^k−1)m[. This implies that the required number of re-cursion cycles n in Montgomery’s multiplication method may increase by one. As previously mentioned, the number of recursion cycles is constrained by 2^kn >4m.

2. It is possible to avoid the addition operation in the quotient determj-natjon by scaling the value of the multiplicand. An analogous optimi-sation technique for the traditional modular multiplication operation was described in Section 3.5, where the aim was to achieve a higher degree of parallelism in the computation. The means was an implicit scaling of the multiplier a with the constant 2^k. Since a multiplica-tion by 2^k is obtained through a left-shift, the scaling is implicit. In Montgomery’s multiplication operation it is advantageous to implicitly scale themultiplicand bwith the constant 2^k. Hereby, the value ofaib, whereb denotes the scaled multiplicand, in the quotient determination (5.3) becomes divisible by 2^k and, consequently, the addition operation in the quotient determination operation is avoided,

((Si+aib)·m) mod 2^k= (Si·m) mod 2^k, whereb mod 2^k = 0.

In order to compensate for the multiplicand scaling, the number of recursion cycles in Montgomery’s multiplication method is increased by one,

(ai·b)r⁻¹ ≡m (a·2^kb)r⁻⁽ⁿ⁺¹⁾, where r⁻¹·2^k≡m 1.

This penalty is identical to the penalty of using the implicit scaling technique on the traditional modular multiplication operation.

When b is scaled, the updating statement in Algorithm 5.1–1 can be formulated asSi+1 := (Si+qim) div 2^k+aib. Hence, there is no explicit reference to the scaled multiplicand b.

3. It is possible toparallelise the computation of consecutive intermediate operations. The idea of performing an overlapped computation of the intermediate operation is introduced by Shand and Vuillemin in [SV93], where the technique is denoted “quotient pipelining”. The basic idea is to postpone the use of the quotient digit qi, computed on basis of information available in the ith recursion cycle, by d recursion cycles.

Hence, the term qim does not appear in the updating statement until the (i + d)th recursion cycle. Hereby, it is possible to overlap the computation of d consecutive intermediate operations by means of a pipelined computation scheme.

In Appendix D the quotient pipelining technique has been combined with the above described optimisation techniques. It is shown how to obtain a particular simple intermediate operation, where the quotient determination is a simple inspection of the least signiﬁcant digit of the intermediate operand, qi := Si mod 2^k, and the updating statement reduces to a shift-and-add operation, Si+1 := Si div 2^k +Ti−d. The termTi−d is the sum of the multiplesqi−dmˆ and aib, where ˆm is a pre-computed operand. Both multiples are based on information available after the (i −d)th recursion cycle. Hence, the computation of Ti−d

can be performed by an architecture that has been pipelined into d stages. The operand ˆm is precomputed each time the value of modulus is changed. It is equal to (m + 1) div 2^k(d+1), where m is the result of a scaling of m similar to (5.4),

m =c·m, where c=m mod 2^k(d+1). (5.5) Since the quotient determination reduces to a simple inspection of the least signiﬁcant digit ofSi, the quotient determination requires no cir-cuitry and, hence, the area contribution from this operation is zero.

5.2. REDUCING THE RECURSION CYCLE TIME 181 Compared to the quotient determination method used in the exponen-tiation processor described in Chapter 4 this is a signiﬁcant improve-ment. The processor’s quotient determination unit had a signiﬁcant contribution to the total area and, furthermore, it had a signiﬁcant contribution to the recursion cycle time. In fact, the area of the proces-sor’s quotient determination unit increases by a rate of 2^k for increasing radix 2^k values.

In Appendix D an example hardware architecture for the computation of a radix 2⁸Montgomery multiplication is discussed. It is seen that the recursion cycle time can be reduced to the time for a single shift-and-add operation, which is eﬃciently performed by a redundant shift-and-adder.

Furthermore, it is seen that the recursion cycle time is independent of the chosen radix. This property makes Montgomery multiplication superior to the traditional modular multiplication methods. According to Section 3.11 the recursion cycle time for the traditional modular multiplication methods can be expected to increase by a rate of log₂ k for increasing radix 2^k values. Hence, a signiﬁcant step toward an eﬃcient utilisation of high radices in modular multiplication has been achieved.

There is, however, a penalty imposed by using the quotient pipelining technique:

• First, the number of recursion cycles increases with d since the use of quotient digitqi is postponed for d recursion cycles.

• Second, because of the d pipeline buﬀers in the hardware archi-tecture, the result of the multiplication process is further delayed by d recursion cycles. If the input operands for the next multi-plication is known before the result of the present multimulti-plication is completed, it is possible to start up the next multiplication as soon as the ﬁrst stage in the pipeline has completed the present computation. Hence, it may be possible to overlap two consecu-tive modular multiplications and, hence, to avoid the penalty of d extra recursion cycles per multiplication. The required prop-erty is, however, dependent on the application. So, the actual scheduling of the multiplications may be improved after a further investigation of the application.

• Third, the implicit scaling of the multiplicand requires an addi-tional recursion cycle.

• Finally, the residue range of the product increases to [0; 2m[, which at worst is [0; 2(2^k(d+1)−1)m[. So, due to the constraint 4m < 2^kn, the number n of basic recursion cycles increases by up tod cycles compared to the original radix 2^kversion of Montgomery’s method (5.2). Since m is odd, it cannot be a power of 2. Therefore, the constraint 4m < 2^kn is obeyed by the following value of n:

n =

In total, including all the additional penalty terms, the worst case num-ber of recursion cycles for a radix 2^k version of the optimised Mont-gomery multiplication Therefore, the delay parameterdshould be chosen as small as possible.

In Appendix D it is indicated how the term Ti−d can be computed by means of a pipelined Wallace Tree, see Subsection 3.2.2, containing about log₂ k stages. The time for each stage is smaller than or equal to a redundant 4–2 addition, which corresponds to the delay of two full adders.

It should be mentioned, that the resulting modular product of the optimised Montgomery multiplication is redundantly represented. So, the time for a conversion into non-redundant representation should be added to the computing time indicated by (5.6). This issue is discussed in Appendix D as well. As for the traditional modular multiplication methods, it is possible to avoid the conversion by allowing the input

5.3. ADDITIONAL PROCESSING 183

In document View of Exponentiation, Modular Multiplication and VLSI Implementation of High-Speed RSA Cryptography (Sider 192-197)