Pipelined Computation - Parallel Computation

2.3 Parallel Computation

2.3.1 Pipelined Computation

For the parallel computation on two processing elements the reduced com-puting time is obtained at the cost of an extra processing element. Now, as observed in Appendix A and B it is also possible to perform the parallel

2.3. PARALLEL COMPUTATION 43 computation on a single pipelined processing element. Then the overhead in hardware is reduced to the extra registers that implement the pipeline buﬀers. In general, to implement a two-stage pipelined computation of an operation denoted by the functionf it must be possible to decomposef into functionsf1 andf2 such thatf =f2◦f1, i.e. f is equal to the composition of f1 and f2. The eﬃciency of the pipelined computation is highly dependent on how the decomposition is done: First, the computing time for f1 and f2

should be balanced and both computing times should be about the half of the computing time for f. This ensures that the neither f1 or f2 is a bottleneck in the pipeline and that the throughput of the pipeline is about twice the throughput of a non-pipelined computation. Finally, the required number of registers for buﬀering the result from f1 should be as small as possible.

Figure 2.5: Dependence graph for pipelined right-to-left binary method.

In the following a pipelined computation of the binary right-to-left method will be developed. The main operation in Equations (2.23) is the multipli-cation operation. Assume that the multiplimultipli-cation operation is decomposed into the operations denoted by f1 and jf2, such that the composition of these gives,

f2◦f1(X, Y, e) = X∗(Y)^e =

X∗Y if e= 1

X if e= 0. (2.24)

Then, if this decomposition substitutes the multiplication operation in Equa-tions 2.23, the following set of recursive equaEqua-tions describes a pipelined right-to-left binary method,

b^e=Xn−1 where Xi+1 = f2◦f1(Xi, Yi+1, ei+1), X₋1 = 1,

Yi+1 = f2◦f1(Yi, Yi,1), Y0 = b (2.25)

The dependence graph for these equations is shown in Figure 2.5. In this graph an unshaded node corresponds to one of thef1operations and a shaded node corresponds to one of the f2 operations. The new dependence graph can be viewed as a modiﬁcation of the old graph in Figure 2.3 where each old node is expanded into two new nodes. A multiplication now takes two operations. However, if it is assumed that the time for operation performing anf1 or an f2 operation is the half of the time for performing a multiplica-tion then the computing time for a multiplicamultiplica-tion is unchanged. The idea behind the pipelined computation becomes clear when the dependence graph is mapped onto two processing elements, where one processing element per-formsf1operations and the other processing element performsf2 operations.

The expected total hardware consumption for these two processing elements is about the same as the hardware consumption for a single processing el-ement that is capable of performing a complete multiplication operation.

Hence, the new hardware architecture contains two processing elements that together form a pipelined version of one of the processing elements used in the previous hardware architecture in Figure 2.4.

A computation schedule is given by the equitemporal hyperplanes in Fig-ure 2.5. The schedule has been chosen such that only one f1 operation and only one f2 operation is performed at the same time. The latency for the proposed pipelined computation is 2n+1 time units. This corresponds to the time for performing n+ ¹₂ multiplications. The small increase in latency is common for all pipelined computations: A small overhead due to the buﬀer-ing in the pipeline will appear, and the overhead, measured in time units, will usually be equal to the number of inserted buﬀers. In this case the pipeline has two stages and a single buﬀer is inserted between the stages. Regarding the throughput, then it is obvious from the dependence graph that time step number 0 in the next computation can be overlapped with the current time step number 2n. This results in a throughput of one exponential per 2ntime units. Furthermore, a repetition of the discussion on page 42, of how the computation of X0 can be replaced by a simple initialisation, will conclude that it is also possible to achieve a throughput of one exponential per 2(n−1) time units when the computation is pipelined.

In Figure 2.6 the pipelined hardware architecture is shown together with an algorithmic description of the computation schedule and two invariants.

The hardware architecture is depicted in two states: The upper state corre-sponds to one of the even numbered time steps in Figure 2.5 and the lower

2.3. PARALLEL COMPUTATION 45

Figure 2.6: Pipelined computation of right-to-left binary method.

state corresponds to one of the odd numbered time steps. There are three registers in the hardware architecture. Register B is the pipeline buﬀer. It contains the result from an f1 operation. The actual size of register B de-pends on how f1 is chosen. RegisterRalternately holds the value Yi and the valueXi−1. Finally, registerY is used for saving the valueYi during a period of two time units. As seen in Figure 2.5, Yi is needed in two consecutive time steps. The algorithm consists of an initialisation and a loop. Each cycle in the loop is divided into two phases: The ﬁrst phase corresponds to an even numbered time step in Figure 2.5 and the second phase corresponds to an odd numbered time step. So, a cycle in the loop describes the computation

of two consecutive time steps. According to Figure 2.5 register R must be initialised to b prior to the ﬁrst phase of the ﬁrst cycle. This is done before entering the loop. Furthermore, registerY and register Rmust be initialised to respectively b and 1 prior to the second phase of the ﬁrst cycle. Since Y is not altered during the ﬁrst phase, Y can also be initialised before the loop is entered. The initialisation of R prior to the second phase can be done by initialising registerB to the valuef1(1,1,0) prior to the ﬁrst phase.

Then the computation in the ﬁrst phase will result in an assignment of the valuef2(f1(1,1,0)) = 1 to register R. Of course the initialisation of register R could also have been carried out by a simple assignment in a conditional control structure, e.g. # if i = 0 then R := 1 else R := f2(B), during the ﬁrst phase. The time step numbering in Figure 2.5 relates closely toi in Figure 2.6. The even numbered time step 2i corresponds to the ﬁrst phase of the ith cycle and the odd numbered time step 2i+ 1 corresponds to the second phase of theith cycle.

The hardware requirement for the pipelined right-to-left binary compu-tation corresponds to one multiplication unit, two registers and a pipeline buﬀer. Except for the buﬀer this is equal to the hardware requirement for the sequential right-to-left binary method. In Chapter 4 a VLSI implementation of a processor, that can compute modular exponentials, is described. The processor is using a pipelined right-to-left binary computation method for the exponentiation. The size of the pipeline buﬀer in this implementation is equal to the size of the other registers. So, at least when the multipli-cation composition is deﬁned as modular multiplimultipli-cation, it is reasonable to count the buﬀer as a register. Hence, at the cost of a single additional reg-ister it is possible to halve the worst case computing time for the sequential binary method. The pipelined binary computation is superior to Thurber’s modiﬁcation—both with respect to computing time and with respect to hard-ware consumption. It can be concluded that it is better to use additional registers for a pipelined computation than to use registers for a table of pre-computed values. However, as described by Brickell et al. in [BGMW92] it is possible to obtain very fast computation of exponentials by using a large table of precomputed values under the assumption that the computing time for this table can be disregarded. Indeed, in applications where the base is ﬁxed over many consecutive exponentiations with varying exponents the table is computed only once.

Shand and Vuillemin [SV93] have analysed the pipelined binary

right-2.3. PARALLEL COMPUTATION 47 to-left method. The analysis is based on the description of the method in Appendix B. Apparently the description is unclear, because Shand and Vuillemin assume that the computation is scheduled on a single processing element bytime-multiplexing. In Figure 2.7 the dependence graph from Fig-ure 2.3 is shown with a time-multiplexed computation schedule. It is seen that the computation can be scheduled on a single processing element that alternately executes an Xi ∗(Yi+1)^eⁱ⁺¹ operation and an Yi ∗Yi operation.

This is what is meant by a time-multiplexed computation schedule. The computing time is 2n−1 time units, where a time unit refers to the time for performing a complete multiplication, and the computing time is equal to the worst case time for the sequential binary method. Shand and Vuillemin observe that in the average case the sequential binary method is 33 percents faster than a time-multiplexed computation of the parallel binary method.

Of course the time-multiplexed computation is slow. It corresponds to a se-quential computation where all multiplications of the type Xi ∗(Yi+1)^e+1 is executed—even thoughei+1is zero. In fact, thepipelinedright-to-left binary computation is 33 percents faster than the average case for the sequential bi-nary computation.

Figure 2.7: Dependence graph for right-to-left binary method. Time-multiplexed computation schedule.

In document View of Exponentiation, Modular Multiplication and VLSI Implementation of High-Speed RSA Cryptography (Sider 56-61)