Additional Processing - View of Exponentiation, Modular Multiplication and VLSI Implementation

discussion concerning the traditional modular multiplication methods in Subsection 3.7.2.

5.3 Additional Processing

In applications, like large operand modular exponentiation, where several intermediate modular multiplications are performed, the time for perform-ing the additional processperform-ing required by Montgomery’s method is usually negligible. The additional processing can be divided into two groups:

Pre-processing. A series of computations must be performed before the Montgomery multiplications are initiated. The computations may be seen as a part of the initialisation process of the application. For a modular exponentiation, the pre-processing consists of the following tasks:

• Compute the neutral,rⁿmodm, for Montgomery’s multiplication composition.

• Compute r²ⁿ mod m, which is used in the transformation of operands into the M-residue domain.

• Computem mod 2^k(d+1), the multiplicative inverse of−mmodulo 2^k(d+1).

• Compute ˆmto be used in the updating statements. The expression for ˆm is (m + 1) div 2^k(d+1), where the scaled modulus m equals (m mod 2^k(d+1))· m.

Some authors have a quit eﬃcient solution to the pre-processing: In applications like the RSA crypto system the value ofm is a part of the key. Therefore, it is known prior to the application and, consequently, it can beassumedthat the “key” include the value ofmplus the deriva-tive values rⁿ mod m, r²ⁿ mod m and ˆm. Of course these additional values can be computed simultaneously with generating the RSA keys.

However, it cannot be expected that all of the equipment for perform-ing the encryption and decryption in a large (public) communication network are based on this particular implementation of the modular multiplications. There may very well be other implementations that

uses a diﬀerent set of derivatives. So, in a concrete application it can-not be assumed that these derivatives are part of the input.

Post-processing. After the ﬁnal multiplication, and transformation from the M-residue range, a result in the residue range [0; 2m[ has been obtained. Hence, a ﬁnal conversion into the residue range [0;m[ is required. A similar conversion is required for the traditional modular multiplication methods.

Regarding the computing time, none of the above computations are partic-ularly complex in comparison to a series of modular exponentiations. Of greater concern is the required circuitry of a hardware implementation that supports the additional processing. Hence, it is a challenging exercise to ﬁgure out how the additional processing may be performed by the existing hardware architecture without too many modiﬁcations and too much ad-ditional circuitry. This thesis does not include a solution to the exercise.

However, some ideas and hints are provided:

• According to Kornerup [Kor94b], the modular reduction technique used by Montgomery is similar to a method, proposed by Hensel [Hen08] in 1908, for computation of the multiplicative inverse of an odd number modulo 2ⁿ Indeed, Kornerup shows how a slightly modiﬁed version of Montgomery’s multiplication method can be applied for computing the value ofm mod 2^k(d+1).

• When the scaling constantc=m mod 2^k(d+1) is computed, the scaled modulus m =c·m may be computed by the existing hardware archi-tecture. Finally, the operand ˆm= (m + 1) div 2^k(d+1) is computed.

• It is not necessary to perform a complete reduction modulo m of rⁿ and r²ⁿ since the multiplication method accepts residues in the range [0; 2m[.

• Assume that rⁿ mod m has been computed. Then, it is possible to compute r²ⁿ by an exponentiation process, where the exponent is set equal to kn, and the base is set equal to 2rⁿ (mod m): Observe that 2rⁿ (mod m) in the range [0; 2 m[ can be obtained by a left-shift of rⁿ modm. Furthermore, observe that 2r ⁿ (modm) is the M-residue of 2.

Therefore, the exponentiation in the domain of M-residues will result

5.4. SUMMARY AND DISCUSSION 185 in the residue 2^knrⁿ (mod m) in the range [0; 2 m[. Hence, since r is equal to 2^k, the desired residue r²ⁿ (mod m) has been computed.

• It is possible to convert the ﬁnal result, say z, from the residue range [0; 2m[ into the residue range [0; 2m[ by means of an exact division by the scaling constantc, wherecis given by (5.5). Kornerup [Kor94b] has showed how to perform exact divisions by a slightly modiﬁed version of Montgomery’s multiplication method.

To be able to use the exact division technique, the operation used for transforming an operand from the M-residue domain, sayy, is modiﬁed (See Figure 5.1, page 174): Instead of computingz =y∗^M_Z_m1, the scaled result z = y∗^M_Z_m c is computed. Then from the congruence z ≡ c·z (mod c·m) it follows that z div c ≡ z (mod m). Furthermore, since z ∈ [0; 2c·m[, it follows that z ∈ [0; 2m[. Hence, a conversion into the residue range [0; 2m[ has been obtained by an exact division that may be computed by a slightly modiﬁed version of the Montgomery multiplication method.

5.4 Summary and Discussion

In this chapter, another representation of residues, denoted M-residues after Montgomery, has been discussed. The advantage of this representation is a more eﬃcient modular multiplication method, where the quotient determina-tion operadetermina-tion is simpler than the corresponding operadetermina-tion for the tradidetermina-tional modular multiplication methods.

It has been discussed how a combination of two optimisation techniques,

“adjustment of the modulus” and “shifting of the multiplicand”, leads to a particular simple quotient determination method, where a quotient digit is obtained through a simple inspection. Hence, there is no computation as-sociated with the operation and, consequently, no circuitry is required for the operation. This implies that the critical operation of the inter-mediate operation is the formation of the multiples qim and aib. Moreover, it has been discussed how a further combination with the “quotient pipelining”

technique, invented by Shand and Vuillemin [SV93], allows a pipelined com-putation of these multiples. The net eﬀect of these optimisations is a modular multiplication method, where the recursion cycle time is independent of the radix of the multiplication method.

Indeed, the recursion cycle time can be made as short as the delay of a single redundant addition. This is a signiﬁcant improvement of the re-cursion cycle time of the traditional modular multiplication methods (and the non-optimised Montgomery multiplication method, as well): According to Chapter 3 the recursion cycle time includes the time for determining a quotient digitqi, the time for computing the multiple qim, and the time for subtractingqim from the intermediate result. Compared to the new method for computing a Montgomery multiplication, this represents an overhead cor-responding to the time for a quotient determination plus the time for com-puting a multiple. In fact, this overhead contributes signiﬁcantly to the total recursion cycle time: Even though the time for the quotient determination may be reduced by adjusting the range of the modulus, see Subsection 3.10.4 the time for computingqimwill be longer than or equal to the delay of log₂ k redundant additions, where the radix is expressed as 2^k. Indeed, since the quotient digit range for the traditional methods are wider than the range for the optimised Montgomery method, the overhead will be longer than the delay of log₂k redundant additions. Hence,the recursion cycle time has been reduced by more than a factor log₂ k.

The aim of using high radices in the modular multiplication methods is to reduce the number of recursion cycles. In an ideal radix 2^k modular multiplication method the number of recursion cycles decreases by a factor of k, and the recursion cycle time is independent on the chosen radix. So, the computing time for an ideal high radix method decreases by a factor of k. This chapter has described a method with properties that approaches the ideal properties: The recursion cycle time is independent of the radix.

However, the optimisation techniques add a penalty term, 3(log₂ k+ 1), to the ideal number of recursion cycles. It might be possible, through a more careful scheduling of the computation on the pipelined architecture, to reduce this penalty term. Compared to the computing time for an ideal high radix method, the computing time for the non-optimised Montgomery method, and for the traditional methods in Chapter 3, has a penaltyfactor proportional to log₂ k. Hence, the optimisation techniques presented in this chapter can be viewed as a way of changing a multiplicative log₂ k time-penalty into an additive log₂ktime-penalty. It should be noted that the relative penalty for a givenk, i.e. fraction of penalty-cycles in the total number of recursion cycles, is decreasing for increasing operands. This means that the total computing time is relatively closer to the ideal computing time for e.g. 512 bit modular

In document View of Exponentiation, Modular Multiplication and VLSI Implementation of High-Speed RSA Cryptography (Sider 197-200)