• Ingen resultater fundet

4.2 Processor Description

4.2.2 Modular Multiplication Unit

The modular multiplication unit is implementing a radix 25left-to-right mod-ular multiplication method. The method is based on a recursive evaluation of expressions of the form Ri = (25Ri+1 +aib)−25qi+1m. As described in Section 3.5, it is possible to utilise a parallel computation strategy, where the evaluation ofTi = 25Ri+1+aib is overlapped with the evaluation of 25qi+1m.

Furthermore, in order to improve the quotient determination complexity, the modular multiplication unit utilises the scaling technique presented in Sec-tion 3.9. The scaling constant, by which the modulusm and the multipliera are scaled, is 2r = 25. In fact, the modular multiplication method is identical

4.2. PROCESSOR DESCRIPTION 139 to the method described by Algorithm 3.9–1 with the modification, that the resulting product is a residue in the non-symmetric range [0; 2m[. Finally, as mentioned above, the modular multiplication unit is pipelined. This means that two modular products are simultaneously computed. Since the multi-plicandband the modulusmare common operands for both multiplications, it is sufficient to implement a simultaneous evaluation of the expressions RYi =TiY 25qi+1Y 2rm andRXi =TiX25qi+1X 2rm, where TiY = 25RYi+1+aYi b and TiX = 25RXi+1+aXi b. The radix 25 digitsaXi and aYi denote the digits at position i of the scaled multiplier operands.

Figure 4.3: Hardware architecture of the modular multiplication unit.

The hardware architecture of the modular multiplication unit is illustra-ted in Figure 4.3. All of the internal connections are communicating data

represented in the redundant carry save form. These connections are sym-bolised by double lines in the figure. The external connections are connected to the registers of the modular exponentiation unit. The external connec-tions are communicating (non-redundant) binary represented data. Before a modular multiplication process is started, the external modulus register M and the multiplicand register B must be properly initialised. To compute residues modulom ofy·b andx·b,M must hold the value−25+rm=−210m andB must hold the valueb. During the modular multiplication process the contents of the multiplier registers X and Y are shifted, digit by digit from the most significant end, into the modular multiplication unit. As stated by the stimulus condition of Algorithm 3.9–1, a multiplier register, say X, must hold the scaled multiplier 2rx = aXn1aXn2. . . aX0 plus an additional digitaX1 = 0. Since 2r = 25 this implies that X must be initialised with the value 210x. Similarly, registerY must be initialised with the value 210y. The number of radix 25 digits held by a multiplier register is ni =n + 1, where n denotes the number of digits used for representing the scaled multiplier.

Since a multiplier may be a result from a previous modular multiplication, it is known to belong to [0; 2m[ and, therefore, it may need 562 bits to be binary represented. So, when scaled by 25, the number of bits increases to 567. This means the number of radix 25 digits of the scaled multiplier is n = 5675 = 114. Hence, the number of radix 25 digits in the multiplier registers is ni = 115.

The modular multiplication unit contains two pipelined units for redun-dant addition and two pipelined units for computation of multiples. Further-more, a binary adder, used for converting the carry save represented results into binary representation, is included. The units with a grey-shaded frame in Figure 4.3 are the pipelined units. They are pipelined into two stages and, hence, each unit contains a register implementing the pipeline buffer.

The multiple unit denoted aB computes multiples of the multiplicand.

The unit has three input operands: The multiplicandB, and the two multi-plier digitsaXi1 andaYi1. The actual multiplier digit used in the computation alternates between a digit from registerY and a digit from registerX. A mul-tiple is produced in each clock period. The sequence of computed mulmul-tiples can be expressed as

aY113B, aX113B, aY112B, aX112B, . . . , aY1B, aX1B, (4.1) Themultiple unit denoted qM performs a similar computation. It computes

4.2. PROCESSOR DESCRIPTION 141 multiples of the modulus value in register M. Instead of receiving the quo-tient digit qi+1 to be used in the computation, the unit receives a truncated version ˆRi+1 of the intermediate result Ri+1. Hence, in this multiple unit, circuitry for determination of the quotient digits is included. The input ˆR is equal to the 12 most significant carry save digits from position 561 to 572 of R, i.e. ˆR=r572r571. . . r561.

Figure 4.4: Hardware architecture of the redundant adder denoted T. The redundant adder denoted T is implementing the addition operation in the expressions Ti = 25Ri+1+aiB. Since both terms in this expression are carry save represented, the adder can be identified as a 4–2 adder (see Subsection 3.2.2). The hardware architecture of the unit is shown in Figure 4.4. The register, T, implementing the pipeline buffer is buffering the result from the redundant addition. Since the result is carry save represented, register T corresponds to two registers for holding binary represented data.

The redundant adder denoted R is implementing the addition operation in the expressions Ri =Ti+qi+1M. (Recall that registerM contains the value

210m. Hence, the subtraction is converted to an ordinary addition). The implementation of this adder is similar to the other redundant adder.

The binary adder is used for converting the final results, RY1 and RX1,

into non-redundant binary representation. Since the computing time for a binary addition is relatively large compared to the computing time for the other units, more than a single clock period is assigned for this operation.

The number of extra clock periods is denotednw, the number ofwait states.

The required number of wait states depends on the actual clocking frequency.

Therefore, the parameternw can be configured by the user. While the pro-cessor is waiting for the conversion to be completed, the contents of all the registers, used in the computation of modular products, remain unchanged.

The binary adder was generated by the chip development tools. According to the data book [Cas91a], this so-called high speed adder uses acarry select architecture with a Manchester carry chain. (These addition techniques are described in standard books on computer arithmetic and VLSI design, e.g.

[WE92, Chapter 8]).

The data connections in Figure 4.3 and in Figure 4.4 are annotated with the values computed by the various units at a certain instant of time. The annotation can be viewed as a snapshot of the internal state of the modular multiplication unit. The snapshot shows the internal state just after the evaluation ofRXi+1. Because of the pipelined architecture, all of the pipelined units produce an alternating sequence of results as illustrated by (4.1). In general, when a unit produces a result marked with Y, it simultaneously consumes inputs marked withX, and vice versa.

It should be mentioned, that the final results are left-shifted versions of the residues modulo m of x·b and y·b. The results can be expressed by RY1/210m y·band byRX1/210m x·b. According to the above discussion, the multiplier registersX and Y must be initialised with 210x and with 210y prior to each modular multiplication. Hence, the updating of these registers with a result from a previous modular multiplication can be achieved by a simple load of the valueR1.

The modular multiplication unit supports the conversion of a residue, say R, in the range [0; 2m[ into the range [0;m[. As mentioned in the previous subsection, such a conversion is required for the final result of a modular exponentiation. The conversion is performed by a subtraction of m, i.e. the operation R m, and an inspection of the resulting sign: First register B is loaded with the value R. Then, by enforcing certain values to the multiplier digits and quotient digits, the following computation implements

4.2. PROCESSOR DESCRIPTION 143 the subtraction using the existing hardware architecture:

R1 := 0;

R0 := (25R1+a0B) +q1M, wherea0 = 25 and q1 = 0;

R1 := (25R0+a1B) +q0M, wherea1 = 0 andq0 = 1;

By insertion of the digit values, the computation is seen to result in R1 = 210B +M, which equals the value 210(R −m). Finally, by means of the binary adder the sign of R−m is computed.