Utilisation of Parallel Computations - View of Exponentiation, Modular Multiplication and VLSI

single article [FDG90]. Suppose the parallel computation of the right-to-left method in Section 2.3 is utilised. Then, two basic operations of the form 2^ky−qjm andx+aiy−qim, respectively, are computed in parallel. The ﬁrst of these operations is seen to be identical to the above discussed modular k-fold doubling operation. So, unless some method, that is faster than the computation of 2^ks+aib−qim, is proposed for the computation of 2^ks− qim there will be no beneﬁt of using the right-to-left modular multiplication method—not even if the parallel computation is utilised.

As seen from Equation (3.9) the left-to-right modular multiplication can be computed by n executions of the intermediate operation 2^ks+aib−qim where n denotes the number of radix 2^k digits required to represent the multiplier a. Assume / binary digits is suﬃcient to represent the value of a.

Then, compared to a radix 2 modular multiplication method, the required number of intermediate operations can be reduced to about _k by using a radix 2^k version of the operation. Indeed this observation is the motivation for considering the use of high radices: The aim of the high-radix modular multiplication approach is to obtain a faster computation by reducing the required number of intermediate operations. However, if the computing time for a single radix 2^k version of the intermediate operation increases by a rate greater than or equal to k, there may be no point in using a high radix. The remainder of this chapter is devoted to eﬃcient computation of intermediate operations of the form 2^ks+aib−qim or of a form similar to this.

3.5 Utilisation of Parallel Computations

Using the notation introduced in Section 2.3 the left-to-right method in Equa-tion (3.9) can be described by a set of recursive equaEqua-tions,

(a·b) mod m ≡m S0, where Si = Ri−qim, Sn= 0,

Ri = 2^kSi+1+aib. (3.11) The value of Ri is the result of a “shift-and-add” operation and Si denotes the result of a modular reduction of Ri. The quotient digit qi must be de-termined from the the value of Ri and modulus m. In Figure 3.4 a slice of the dependence graph for this method is depicted. Note that the dependence graph is more “ﬁne-grained” than the equations in (3.11), i.e. there is more

nodes in the graph than there is equations in (3.11): The black node cor-responds to an equation that expresses the determination of quotient digit qi and the shaded nodes correspond to equations that compute the multi-ples aib and qim. To keep the notation simple these equations have been left out. The dashed lines in the ﬁgure symbolise the start-point of a new cycle in the computation of the recursive equations. The computation of the nodes between two consecutive start-points is denoted the computation of a recursion cycle. As seen from the data dependencies the computing time for a single recursion cycle is determined by the time for adding the multiple aib to 2^kSi+1, the time for determining a quotient digit qi, the time for com-puting the multiple qim, and the time for subtracting qim from Ri. Note, that the multipleaib can be computedin parallel with the other operations and, hence, the time for computing aib has no inﬂuence on the time for a recursion cycle. This is becauseai andb are part of the input to the modular multiplication, so the only demand is that the computation ofaib should by completed before it is needed in a recursion cycle.

Figure 3.4: Slice of dependence graph for the left-to-right method based on a recursive evaluation of Si = (2^kSi+1+aib)−qim.

Now, it is possible to reduce the recursion cycle time by rearranging the computation of the values in (3.11). In Figure 3.4 it is seen that the determination of qi is delayed by the computation of Ri = 2^kSi+1+aib. It is, however, not necessary to perform the addition of aib after the modular reduction of Ri+1: It may be done before the subtraction of qi+1m, and in parallel with the determination of qi+1 and the computation of multiple qi+1m. Another set of recursive equations that mirrors such a computation

3.5. UTILISATION OF PARALLEL COMPUTATIONS 81 can be developed by rewriting (3.11),

Ri = 2^kSi+1+aib

= 2^k(Ri+1−qi+1m) +aib

= (2^kRi+1+aib)−2^kqi+1m

= Ti−2^kqi+1m, where Ti = 2^kRi+1+aib.

Hence, the computation of Ti can be performed in parallel with the compu-tation of 2^kqi+1m. If the initial value Sn= 0 is replaced by the initial value Rn= 0,(3.11) can be written as,

(a·b) mod m ≡m S0, where S0 = R0−q0m,

Ri = Ti−2^kqi+1m, Rn = 0, Ti = 2^kRi+1+aib

Further, to simplify these equations the recursion depth is enlarged by one, which means i ∈ {−1,0, . . . , n− 1}, and the multiplier digit a₋1 = 0 is introduced. This gives,

R1 =T1 −2^kq0m= (2^kR0 +a₋1b)−2^kq0m= 2^k(R0−q0m).

Therefore, the result S0 can be expressed as R₋1 div 2^k which is easily ob-tained by a right-shift of R₋1. Hereby, the rearrangement of (3.11) is com-pleted and the following set of recursive equations is obtained,

(a·b) mod m≡m R₋1, where Ri = Ti−2^kqi+1m, Rn = 0,

Ti = 2^kRi+1+aib. a₋1 = 0 (3.12) The new dependence graph in Figure 3.5 shows how a parallel computation of Ti and of 2^kqi+1m can be utilised to improve the computing time for a recursion cycle. Indeed, assuming that the most time consuming path in the ﬁgure is the path through the quotient determination node and the node computing the multiple 2^kqi+1m, a recursion cycle time for modular multiplication that is comparable to the recursion cycle time of SRT division methods is achieved. As long as the computing times for determination of a quotient digit and for computation of a multiple 2^kqi+1m have not be discussed, the recursion cycle times cannot be properly compared. However, this will be done in detail in the sections from 3.7 to 3.10. The computing

time for a SRT division recursion cycle turns out to be about equal to the computing time for a modular multiplication recursion cycle. According to (3.12) the cost of using this computation strategy is an additional recursion cycle. For large operands this additional cost is vanishing.

Figure 3.5: Slice of dependence graph for the left-to-right method based on a recursive evaluation of Ri = (2^kRi+1+aib)−2^kqi+1m.

It should be mentioned that the method illustrated by Figure 3.5 is similar to the method described in the article in Appendix A. This article describes a hardware architecture that, indeed, is a direct mapping of the ﬁve nodes in a single recursion cycle onto ﬁve processing elements. In Figure 3.6 the hard-ware architecture is shown. The VLSI processor described in Chapter 4 has a very similar hardware architecture. However, since the pipelined exponentia-tion method, see Secexponentia-tion 2.3.1, requires that two modular multiplicaexponentia-tions are performed simultaneously the architecture is pipelined. This implies that the computation of a recursion cycle is divided into two sub-computations and that a pipeline buﬀer is inserted somewhere in the architecture. Furthermore, the architecture of the VLSI processor has an additional multiplier register.

Assume a modular multiplication is based on (3.12) and assume the com-putation utilises the possibilities for improving the recursion cycle time by performing some of the operations in parallel. Then, as indicated by the dependence graph in Figure 3.5 and by the hardware architecture in Figure 3.6, the computing time is determined by the time for determining a quo-tient digit qi+1, the time for computing the multiple 2^kqi+1m, and the time for subtracting this multiple fromTi. These three operations will be the issue of the discussion in the following sections.

In document View of Exponentiation, Modular Multiplication and VLSI Implementation of High-Speed RSA Cryptography (Sider 93-97)