Modular Multiplication Unit - Processor Description

4.2 Processor Description

4.2.2 Modular Multiplication Unit

The modular multiplication unit is implementing a radix 2⁵left-to-right mod-ular multiplication method. The method is based on a recursive evaluation of expressions of the form Ri = (2⁵Ri+1 +aib)−2⁵qi+1m. As described in Section 3.5, it is possible to utilise a parallel computation strategy, where the evaluation ofTi = 2⁵Ri+1+aib is overlapped with the evaluation of 2⁵qi+1m.

Furthermore, in order to improve the quotient determination complexity, the modular multiplication unit utilises the scaling technique presented in Sec-tion 3.9. The scaling constant, by which the modulusm and the multipliera are scaled, is 2^r = 2⁵. In fact, the modular multiplication method is identical

4.2. PROCESSOR DESCRIPTION 139 to the method described by Algorithm 3.9–1 with the modiﬁcation, that the resulting product is a residue in the non-symmetric range [0; 2m[. Finally, as mentioned above, the modular multiplication unit is pipelined. This means that two modular products are simultaneously computed. Since the multi-plicandband the modulusmare common operands for both multiplications, it is suﬃcient to implement a simultaneous evaluation of the expressions R^Y_i =T_i^Y −2⁵q_i+1^Y 2^rm andR^X_i =T_i^X−2⁵q_i+1^X 2^rm, where T_i^Y = 2⁵R^Y_i+1+a^Y_i b and T_i^X = 2⁵R^X_i+1+a^X_i b. The radix 2⁵ digitsa^X_i and a^Y_i denote the digits at position i of the scaled multiplier operands.

Figure 4.3: Hardware architecture of the modular multiplication unit.

The hardware architecture of the modular multiplication unit is illustra-ted in Figure 4.3. All of the internal connections are communicating data

represented in the redundant carry save form. These connections are sym-bolised by double lines in the ﬁgure. The external connections are connected to the registers of the modular exponentiation unit. The external connec-tions are communicating (non-redundant) binary represented data. Before a modular multiplication process is started, the external modulus register M and the multiplicand register B must be properly initialised. To compute residues modulom ofy·b andx·b,M must hold the value−2^5+rm=−2¹⁰m andB must hold the valueb. During the modular multiplication process the contents of the multiplier registers X and Y are shifted, digit by digit from the most signiﬁcant end, into the modular multiplication unit. As stated by the stimulus condition of Algorithm 3.9–1, a multiplier register, say X, must hold the scaled multiplier 2^rx = a^X_n−1a^X_n−2. . . a^X₀ plus an additional digita^X₋₁ = 0. Since 2^r = 2⁵ this implies that X must be initialised with the value 2¹⁰x. Similarly, registerY must be initialised with the value 2¹⁰y. The number of radix 2⁵ digits held by a multiplier register is ni =n + 1, where n denotes the number of digits used for representing the scaled multiplier.

Since a multiplier may be a result from a previous modular multiplication, it is known to belong to [0; 2m[ and, therefore, it may need 562 bits to be binary represented. So, when scaled by 2⁵, the number of bits increases to 567. This means the number of radix 2⁵ digits of the scaled multiplier is n = ⁵⁶⁷₅ = 114. Hence, the number of radix 2⁵ digits in the multiplier registers is ni = 115.

The modular multiplication unit contains two pipelined units for redun-dant addition and two pipelined units for computation of multiples. Further-more, a binary adder, used for converting the carry save represented results into binary representation, is included. The units with a grey-shaded frame in Figure 4.3 are the pipelined units. They are pipelined into two stages and, hence, each unit contains a register implementing the pipeline buﬀer.

The multiple unit denoted aB computes multiples of the multiplicand.

The unit has three input operands: The multiplicandB, and the two multi-plier digitsa^X_i₋₁ anda^Y_i₋₁. The actual multiplier digit used in the computation alternates between a digit from registerY and a digit from registerX. A mul-tiple is produced in each clock period. The sequence of computed mulmul-tiples can be expressed as

a^Y₁₁₃B, a^X₁₁₃B, a^Y₁₁₂B, a^X₁₁₂B, . . . , a^Y₋₁B, a^X₋₁B, (4.1) Themultiple unit denoted qM performs a similar computation. It computes

4.2. PROCESSOR DESCRIPTION 141 multiples of the modulus value in register M. Instead of receiving the quo-tient digit qi+1 to be used in the computation, the unit receives a truncated version ˆRi+1 of the intermediate result Ri+1. Hence, in this multiple unit, circuitry for determination of the quotient digits is included. The input ˆR is equal to the 12 most signiﬁcant carry save digits from position 561 to 572 of R, i.e. ˆR=r572r571. . . r561.

Figure 4.4: Hardware architecture of the redundant adder denoted T. The redundant adder denoted T is implementing the addition operation in the expressions Ti = 2⁵Ri+1+aiB. Since both terms in this expression are carry save represented, the adder can be identiﬁed as a 4–2 adder (see Subsection 3.2.2). The hardware architecture of the unit is shown in Figure 4.4. The register, T, implementing the pipeline buﬀer is buﬀering the result from the redundant addition. Since the result is carry save represented, register T corresponds to two registers for holding binary represented data.

The redundant adder denoted R is implementing the addition operation in the expressions Ri =Ti+qi+1M. (Recall that registerM contains the value

−2¹⁰m. Hence, the subtraction is converted to an ordinary addition). The implementation of this adder is similar to the other redundant adder.

The binary adder is used for converting the ﬁnal results, R₋^Y₁ and R₋^X₁,

into non-redundant binary representation. Since the computing time for a binary addition is relatively large compared to the computing time for the other units, more than a single clock period is assigned for this operation.

The number of extra clock periods is denotednw, the number ofwait states.

The required number of wait states depends on the actual clocking frequency.

Therefore, the parameternw can be conﬁgured by the user. While the pro-cessor is waiting for the conversion to be completed, the contents of all the registers, used in the computation of modular products, remain unchanged.

The binary adder was generated by the chip development tools. According to the data book [Cas91a], this so-called high speed adder uses acarry select architecture with a Manchester carry chain. (These addition techniques are described in standard books on computer arithmetic and VLSI design, e.g.

[WE92, Chapter 8]).

The data connections in Figure 4.3 and in Figure 4.4 are annotated with the values computed by the various units at a certain instant of time. The annotation can be viewed as a snapshot of the internal state of the modular multiplication unit. The snapshot shows the internal state just after the evaluation ofR^X_i+1. Because of the pipelined architecture, all of the pipelined units produce an alternating sequence of results as illustrated by (4.1). In general, when a unit produces a result marked with Y, it simultaneously consumes inputs marked withX, and vice versa.

It should be mentioned, that the ﬁnal results are left-shifted versions of the residues modulo m of x·b and y·b. The results can be expressed by R^Y₋₁/2¹⁰≡m y·band byR₋^X₁/2¹⁰≡m x·b. According to the above discussion, the multiplier registersX and Y must be initialised with 2¹⁰x and with 2¹⁰y prior to each modular multiplication. Hence, the updating of these registers with a result from a previous modular multiplication can be achieved by a simple load of the valueR₋1.

The modular multiplication unit supports the conversion of a residue, say R, in the range [0; 2m[ into the range [0;m[. As mentioned in the previous subsection, such a conversion is required for the ﬁnal result of a modular exponentiation. The conversion is performed by a subtraction of m, i.e. the operation R − m, and an inspection of the resulting sign: First register B is loaded with the value R. Then, by enforcing certain values to the multiplier digits and quotient digits, the following computation implements

4.2. PROCESSOR DESCRIPTION 143 the subtraction using the existing hardware architecture:

R1 := 0;

R0 := (2⁵R1+a0B) +q1M, wherea0 = 2⁵ and q1 = 0;

R₋1 := (2⁵R0+a₋1B) +q0M, wherea₋1 = 0 andq0 = 1;

By insertion of the digit values, the computation is seen to result in R₋1 = 2¹⁰B +M, which equals the value 2¹⁰(R −m). Finally, by means of the binary adder the sign of R−m is computed.

In document View of Exponentiation, Modular Multiplication and VLSI Implementation of High-Speed RSA Cryptography (Sider 152-157)