• Ingen resultater fundet

single article [FDG90]. Suppose the parallel computation of the right-to-left method in Section 2.3 is utilised. Then, two basic operations of the form 2ky−qjm andx+aiy−qim, respectively, are computed in parallel. The first of these operations is seen to be identical to the above discussed modular k-fold doubling operation. So, unless some method, that is faster than the computation of 2ks+aib−qim, is proposed for the computation of 2ks− qim there will be no benefit of using the right-to-left modular multiplication method—not even if the parallel computation is utilised.

As seen from Equation (3.9) the left-to-right modular multiplication can be computed by n executions of the intermediate operation 2ks+aib−qim where n denotes the number of radix 2k digits required to represent the multiplier a. Assume / binary digits is sufficient to represent the value of a.

Then, compared to a radix 2 modular multiplication method, the required number of intermediate operations can be reduced to about k by using a radix 2k version of the operation. Indeed this observation is the motivation for considering the use of high radices: The aim of the high-radix modular multiplication approach is to obtain a faster computation by reducing the required number of intermediate operations. However, if the computing time for a single radix 2k version of the intermediate operation increases by a rate greater than or equal to k, there may be no point in using a high radix. The remainder of this chapter is devoted to efficient computation of intermediate operations of the form 2ks+aib−qim or of a form similar to this.

3.5 Utilisation of Parallel Computations

Using the notation introduced in Section 2.3 the left-to-right method in Equa-tion (3.9) can be described by a set of recursive equaEqua-tions,

(a·b) mod m m S0, where Si = Ri−qim, Sn= 0,

Ri = 2kSi+1+aib. (3.11) The value of Ri is the result of a “shift-and-add” operation and Si denotes the result of a modular reduction of Ri. The quotient digit qi must be de-termined from the the value of Ri and modulus m. In Figure 3.4 a slice of the dependence graph for this method is depicted. Note that the dependence graph is more “fine-grained” than the equations in (3.11), i.e. there is more

nodes in the graph than there is equations in (3.11): The black node cor-responds to an equation that expresses the determination of quotient digit qi and the shaded nodes correspond to equations that compute the multi-ples aib and qim. To keep the notation simple these equations have been left out. The dashed lines in the figure symbolise the start-point of a new cycle in the computation of the recursive equations. The computation of the nodes between two consecutive start-points is denoted the computation of a recursion cycle. As seen from the data dependencies the computing time for a single recursion cycle is determined by the time for adding the multiple aib to 2kSi+1, the time for determining a quotient digit qi, the time for com-puting the multiple qim, and the time for subtracting qim from Ri. Note, that the multipleaib can be computedin parallel with the other operations and, hence, the time for computing aib has no influence on the time for a recursion cycle. This is becauseai andb are part of the input to the modular multiplication, so the only demand is that the computation ofaib should by completed before it is needed in a recursion cycle.

Figure 3.4: Slice of dependence graph for the left-to-right method based on a recursive evaluation of Si = (2kSi+1+aib)−qim.

Now, it is possible to reduce the recursion cycle time by rearranging the computation of the values in (3.11). In Figure 3.4 it is seen that the determination of qi is delayed by the computation of Ri = 2kSi+1+aib. It is, however, not necessary to perform the addition of aib after the modular reduction of Ri+1: It may be done before the subtraction of qi+1m, and in parallel with the determination of qi+1 and the computation of multiple qi+1m. Another set of recursive equations that mirrors such a computation

3.5. UTILISATION OF PARALLEL COMPUTATIONS 81 can be developed by rewriting (3.11),

Ri = 2kSi+1+aib

= 2k(Ri+1−qi+1m) +aib

= (2kRi+1+aib)−2kqi+1m

= Ti2kqi+1m, where Ti = 2kRi+1+aib.

Hence, the computation of Ti can be performed in parallel with the compu-tation of 2kqi+1m. If the initial value Sn= 0 is replaced by the initial value Rn= 0,(3.11) can be written as,

(a·b) mod m m S0, where S0 = R0−q0m,

Ri = Ti2kqi+1m, Rn = 0, Ti = 2kRi+1+aib

Further, to simplify these equations the recursion depth is enlarged by one, which means i ∈ {−1,0, . . . , n 1}, and the multiplier digit a1 = 0 is introduced. This gives,

R1 =T1 2kq0m= (2kR0 +a1b)−2kq0m= 2k(R0−q0m).

Therefore, the result S0 can be expressed as R1 div 2k which is easily ob-tained by a right-shift of R1. Hereby, the rearrangement of (3.11) is com-pleted and the following set of recursive equations is obtained,

(a·b) mod m≡m R1, where Ri = Ti2kqi+1m, Rn = 0,

Ti = 2kRi+1+aib. a1 = 0 (3.12) The new dependence graph in Figure 3.5 shows how a parallel computation of Ti and of 2kqi+1m can be utilised to improve the computing time for a recursion cycle. Indeed, assuming that the most time consuming path in the figure is the path through the quotient determination node and the node computing the multiple 2kqi+1m, a recursion cycle time for modular multiplication that is comparable to the recursion cycle time of SRT division methods is achieved. As long as the computing times for determination of a quotient digit and for computation of a multiple 2kqi+1m have not be discussed, the recursion cycle times cannot be properly compared. However, this will be done in detail in the sections from 3.7 to 3.10. The computing

time for a SRT division recursion cycle turns out to be about equal to the computing time for a modular multiplication recursion cycle. According to (3.12) the cost of using this computation strategy is an additional recursion cycle. For large operands this additional cost is vanishing.

Figure 3.5: Slice of dependence graph for the left-to-right method based on a recursive evaluation of Ri = (2kRi+1+aib)−2kqi+1m.

It should be mentioned that the method illustrated by Figure 3.5 is similar to the method described in the article in Appendix A. This article describes a hardware architecture that, indeed, is a direct mapping of the five nodes in a single recursion cycle onto five processing elements. In Figure 3.6 the hard-ware architecture is shown. The VLSI processor described in Chapter 4 has a very similar hardware architecture. However, since the pipelined exponentia-tion method, see Secexponentia-tion 2.3.1, requires that two modular multiplicaexponentia-tions are performed simultaneously the architecture is pipelined. This implies that the computation of a recursion cycle is divided into two sub-computations and that a pipeline buffer is inserted somewhere in the architecture. Furthermore, the architecture of the VLSI processor has an additional multiplier register.

Assume a modular multiplication is based on (3.12) and assume the com-putation utilises the possibilities for improving the recursion cycle time by performing some of the operations in parallel. Then, as indicated by the dependence graph in Figure 3.5 and by the hardware architecture in Figure 3.6, the computing time is determined by the time for determining a quo-tient digit qi+1, the time for computing the multiple 2kqi+1m, and the time for subtracting this multiple fromTi. These three operations will be the issue of the discussion in the following sections.