• Ingen resultater fundet

discussion concerning the traditional modular multiplication methods in Subsection 3.7.2.

5.3 Additional Processing

In applications, like large operand modular exponentiation, where several intermediate modular multiplications are performed, the time for perform-ing the additional processperform-ing required by Montgomery’s method is usually negligible. The additional processing can be divided into two groups:

Pre-processing. A series of computations must be performed before the Montgomery multiplications are initiated. The computations may be seen as a part of the initialisation process of the application. For a modular exponentiation, the pre-processing consists of the following tasks:

Compute the neutral,rnmodm, for Montgomery’s multiplication composition.

Compute r2n mod m, which is used in the transformation of operands into the M-residue domain.

Computem mod 2k(d+1), the multiplicative inverse of−mmodulo 2k(d+1).

Compute ˆmto be used in the updating statements. The expression for ˆm is (m + 1) div 2k(d+1), where the scaled modulus m equals (m mod 2k(d+1))· m.

Some authors have a quit efficient solution to the pre-processing: In applications like the RSA crypto system the value ofm is a part of the key. Therefore, it is known prior to the application and, consequently, it can beassumedthat the “key” include the value ofmplus the deriva-tive values rn mod m, r2n mod m and ˆm. Of course these additional values can be computed simultaneously with generating the RSA keys.

However, it cannot be expected that all of the equipment for perform-ing the encryption and decryption in a large (public) communication network are based on this particular implementation of the modular multiplications. There may very well be other implementations that

uses a different set of derivatives. So, in a concrete application it can-not be assumed that these derivatives are part of the input.

Post-processing. After the final multiplication, and transformation from the M-residue range, a result in the residue range [0; 2m[ has been obtained. Hence, a final conversion into the residue range [0;m[ is required. A similar conversion is required for the traditional modular multiplication methods.

Regarding the computing time, none of the above computations are partic-ularly complex in comparison to a series of modular exponentiations. Of greater concern is the required circuitry of a hardware implementation that supports the additional processing. Hence, it is a challenging exercise to figure out how the additional processing may be performed by the existing hardware architecture without too many modifications and too much ad-ditional circuitry. This thesis does not include a solution to the exercise.

However, some ideas and hints are provided:

According to Kornerup [Kor94b], the modular reduction technique used by Montgomery is similar to a method, proposed by Hensel [Hen08] in 1908, for computation of the multiplicative inverse of an odd number modulo 2n Indeed, Kornerup shows how a slightly modified version of Montgomery’s multiplication method can be applied for computing the value ofm mod 2k(d+1).

When the scaling constantc=m mod 2k(d+1) is computed, the scaled modulus m =c·m may be computed by the existing hardware archi-tecture. Finally, the operand ˆm= (m + 1) div 2k(d+1) is computed.

It is not necessary to perform a complete reduction modulo m of rn and r2n since the multiplication method accepts residues in the range [0; 2m[.

Assume that rn mod m has been computed. Then, it is possible to compute r2n by an exponentiation process, where the exponent is set equal to kn, and the base is set equal to 2rn (mod m): Observe that 2rn (mod m) in the range [0; 2 m[ can be obtained by a left-shift of rn modm. Furthermore, observe that 2r n (modm) is the M-residue of 2.

Therefore, the exponentiation in the domain of M-residues will result

5.4. SUMMARY AND DISCUSSION 185 in the residue 2knrn (mod m) in the range [0; 2 m[. Hence, since r is equal to 2k, the desired residue r2n (mod m) has been computed.

It is possible to convert the final result, say z, from the residue range [0; 2m[ into the residue range [0; 2m[ by means of an exact division by the scaling constantc, wherecis given by (5.5). Kornerup [Kor94b] has showed how to perform exact divisions by a slightly modified version of Montgomery’s multiplication method.

To be able to use the exact division technique, the operation used for transforming an operand from the M-residue domain, sayy, is modified (See Figure 5.1, page 174): Instead of computingz =y∗MZm1, the scaled result z = y∗MZm c is computed. Then from the congruence z c·z (mod c·m) it follows that z div c z (mod m). Furthermore, since z [0; 2c·m[, it follows that z [0; 2m[. Hence, a conversion into the residue range [0; 2m[ has been obtained by an exact division that may be computed by a slightly modified version of the Montgomery multiplication method.

5.4 Summary and Discussion

In this chapter, another representation of residues, denoted M-residues after Montgomery, has been discussed. The advantage of this representation is a more efficient modular multiplication method, where the quotient determina-tion operadetermina-tion is simpler than the corresponding operadetermina-tion for the tradidetermina-tional modular multiplication methods.

It has been discussed how a combination of two optimisation techniques,

“adjustment of the modulus” and “shifting of the multiplicand”, leads to a particular simple quotient determination method, where a quotient digit is obtained through a simple inspection. Hence, there is no computation as-sociated with the operation and, consequently, no circuitry is required for the operation. This implies that the critical operation of the inter-mediate operation is the formation of the multiples qim and aib. Moreover, it has been discussed how a further combination with the “quotient pipelining”

technique, invented by Shand and Vuillemin [SV93], allows a pipelined com-putation of these multiples. The net effect of these optimisations is a modular multiplication method, where the recursion cycle time is independent of the radix of the multiplication method.

Indeed, the recursion cycle time can be made as short as the delay of a single redundant addition. This is a significant improvement of the re-cursion cycle time of the traditional modular multiplication methods (and the non-optimised Montgomery multiplication method, as well): According to Chapter 3 the recursion cycle time includes the time for determining a quotient digitqi, the time for computing the multiple qim, and the time for subtractingqim from the intermediate result. Compared to the new method for computing a Montgomery multiplication, this represents an overhead cor-responding to the time for a quotient determination plus the time for com-puting a multiple. In fact, this overhead contributes significantly to the total recursion cycle time: Even though the time for the quotient determination may be reduced by adjusting the range of the modulus, see Subsection 3.10.4 the time for computingqimwill be longer than or equal to the delay of log2 k redundant additions, where the radix is expressed as 2k. Indeed, since the quotient digit range for the traditional methods are wider than the range for the optimised Montgomery method, the overhead will be longer than the delay of log2k redundant additions. Hence,the recursion cycle time has been reduced by more than a factor log2 k.

The aim of using high radices in the modular multiplication methods is to reduce the number of recursion cycles. In an ideal radix 2k modular multiplication method the number of recursion cycles decreases by a factor of k, and the recursion cycle time is independent on the chosen radix. So, the computing time for an ideal high radix method decreases by a factor of k. This chapter has described a method with properties that approaches the ideal properties: The recursion cycle time is independent of the radix.

However, the optimisation techniques add a penalty term, 3(log2 k+ 1), to the ideal number of recursion cycles. It might be possible, through a more careful scheduling of the computation on the pipelined architecture, to reduce this penalty term. Compared to the computing time for an ideal high radix method, the computing time for the non-optimised Montgomery method, and for the traditional methods in Chapter 3, has a penaltyfactor proportional to log2 k. Hence, the optimisation techniques presented in this chapter can be viewed as a way of changing a multiplicative log2 k time-penalty into an additive log2ktime-penalty. It should be noted that the relative penalty for a givenk, i.e. fraction of penalty-cycles in the total number of recursion cycles, is decreasing for increasing operands. This means that the total computing time is relatively closer to the ideal computing time for e.g. 512 bit modular