Analyze the timing information and generate the timing report

1, imp

igure 11 shows the full architecture of combined unit

6. Analyze the timing information and generate the timing report

he procedure is illustrated in Figure 14.

Table 5 Critical path of reciprocal unit.

QLATCH 4-1

multiplexer CSA32 Selection Critical path (ns)

summarized in Table 5, Table 6 and Table 7.

The layout was created by using SoC encounter. The gate-level netlist was imported into Encounter and doing Place and Route. After creating the layout, the delays are calculated for each wire (routes). The timing information is saved in .sdf file, and can be imported to the Synopsys

gain to evaluate the impact of the interconnections on the cycle time. The a

la The i

nopsys. Save the design in Verilog forma

T

0.46 0.19 0.17 0.74 1.56

Table 6 Critical path of combined unit ontrol 2-1

multiplexer 4-1

multiplexer CSA42 Short

adder Selection QLATCH Critical path (ns)

C

0.40 0.14 0.09 0.36 0.47 0.40 0.11 1.97

Table 7 Area and power dissipation for reciprocal and combined unit

Unit Area ( µ m

) Power(mw)

Reciprocal unit 183197.703125 29.3092

Combined unit 328405.000000 41.8629

Table 8 Comparison of delay before and after layout

Unit Before layout (ns) After layout (ns) Difference (%)

reciprocal 1.56 1.68 7.7

Combined unit 1.97 2.42 22.8

RTL- level design

Gate - level netlist

Layout SDF format

timing

Synthesis

Synopsys

SoC encounter

Timing

Figure 14 Procedure of Implementation

able 5 and Table 6 show that for both reciprocal and square root reciprocal operation, the critical ath corresponds to the digit-by-digit part. This means, the proposed new schemes have the same ycle time as the corresponding digit-by-digit schemes. But the number of iterations is only half of

e digit-by-digit schemes alone. The total delay is reduced roughly by half.

ince the approximation part has roughly the same complexity as the conventional digit-by-digit art, the area of the proposed schemes increase the hardware overhead by nearly 2 with respect to

e corresponding digit-by-digit schemes.

or reciprocal, the area-time ratios of higher radix schemes with respect to the radix-4 scheme are escribed in [2]. Therefore, we can compare the design in proposed scheme with more instances in area-time space. Figure 15 shows the area-time ratios for digit-by-digit radix-4, radix-16 using erlapped radix-4 and very high radix-512 schemes in comparison with the proposed design.

the square root reciprocal, there are several implementations using radix-4 and very high radix be me and implementation proposed by [3] has good area-delay figures, so take [3] as the T p

c th S p th

F d an ov For

described in [3] an [5]. A radix-16 implementation with overlapped radix-4 iterations would

significantly more complex, because of the many conditional forms required, so we do not consider

it. The sche

radix-4 design reference. For the very high radix implementation, assume a cost in area 1.5 times cycle time. Figure area and delay of d (a conservative figure) the cost of a very high radix reciprocal unit with the same

5 also shows the ratios corresponding to the square-root reciprocal, using the 1

the conventional radix-4 reciprocal unit as a reference. From the estimations, it can be conclude that the proposed designs introduce attractive points in the area-time space.

Figure 15 Area-time space.

Chapter5

Conclusions

the entire process of design a combined unit for reciprocal and square root sented. A new algorithm is used, which combined digit-by-digit algorithm and pproximation. The approximation part is performed by a digit recurrence using e digit-by-digit part. In this way, two parts can execute in an overlapped equently, only half of the iterations are needed as compared to the conventional

gorithm.

ign of reciprocal unit and combined unit, the radix-4 implementation is used and the e represented in carry-save form. This choice can achieve low latency with moderate verhead. To make the architecture suitable for both reciprocal and square root

utation, a signal quotient digit selection function is used which is developed in [3].

n was synthesized using the 0.18 µ m standard cell library. From synthesis, the critical and power dissipation are estimated. The synthesis results show that the cycle time is hat of the conventional digit-by-digit unit. Since the number of iterations is nearly half -digit method alone, the total execution time is almost reduced by half. On the other addition of the approximation part almost doubles the required area.

design produces four bits per cycle, for reciprocal, it is about 50% faster than a In this project,

reciprocal are pre a Newton-Raphson

duced by th the digit pro

manner. Cons digit-recurrence al For the des residual ar hardware o reciprocal comp The desig path, area as same as t of the digit-by hand, the

Since the proposed

radix-16 implementation with overlapped radix-4 stages with an increase of 20% in area.The

evaluation shows that the propose design shows the good figure in area-time space.

References

arelli, ”Low Latency Digit-Recurrence Reciprocal and , in Proceedings - 17th IEEE Symposium

.

n Kaufmann Publishers, 2003.

t and Its Combination with Division and Sept. 2003, pp. 1100-1114.

ization and Design – the hardware/software 1998.

in a Very-high Radix Combined y Rounding”, IEEE Trans. On Computers,

cal Square Root”, in Proc. 15

^th

IEEE .

echnical Disclosure Bulletin, vol. 23, no. 8, Jan.

] E. Antelo, P. Montuschi, T. Lang, A. Nann [1

Square-Root Reciprocal Algorithm and Architecture”

on Computer Arithmetic, ARITH-17 2005. pp. 147-154 [2] M. Ercegovac and T. Lang, “Digital Arithmetic”, Morga [3] T. Lang and E. Antelo, “Radix-4 Reciprocal Square-roo Square Root”, IEEE Trans. On Comput., vol. 52, no 9, [4] D. A. Patterson and J. L. Hennessy, “Computer Organ interface”, Morgan Kaufmann Publishers Inc., 2

^nd

edition, [5] E. Antelo, T. Lang and J.D. Bruguera, “Computation of Division/Square-Root Unit with Scaling and Selection b vol. 47, no. 2, Feb. 1998, pp. 152-161.

[6] N. Takagi, “A Hardware Algorithm for Computing Recipro

Symposium on Computer Arithmetic, pp. 94-100, 2001

[7] A. Weinberger, “4 – 2 carry-save adder module”, IBM T

1981, pp. 3811 – 3814.

Appendices

A

ciprocal unit design

re listed as following, these files are not A.1 VHDL files included in the re

reciprocal.vhd, this is the top level design.

control.vhd convert.vhd convert_app.vhd cpa.vhd

cr_qcalc.vhd csa32LSBs.vhd csa42.vhd

digit_compute.vhd digit_update2.vhd

tend.vhd ex

gl_csa32.vhd gl_dualreg_ld.vhd gl_mux21.vhd gl_reg.vhd mult2.vhd

regs.vhd q_

qds_adder.vhd qds_table.vhd qdsel.vhd qlatch.vhd quotient.vhd rounding.vhd sdet.vhd shifter.vhd

tb2_reciprocal.vhd

Files used by both reciprocal and combined unit a shown in A.2, but shown in B.2.

cpa.vhd

csa42.vhd

gl_csa32.vhd

gl_dualreg_ld.vhd

gl_mux21.vhd

gl_reg.vhd

mult2.vhd

qlatch.vhd

In document Design of a Combined Unit for Reciprocal and Square Root Reciprocal (Sider 30-36)