• Ingen resultater fundet

As expected, only an improvement in the register bank entity was noticed, while the gain achieved in CAU remained intact; which confirms that the estimation method works cor-rectly. An additional power improvement of 44% over REG EN was noticed in REG BANK.

This is due to the power consumed by the feedback multiplexers in the enabled registers and the disabling of the registers.

The overall conclusion of this experiment is that clock gating is confirmed as a way to reduce power consumption in sequential logic. More importantly though, as it is effectively used as a means to isolate combinational fanout logic, high expectations from operand isolation based power management are built. How clock gating and operand isolation compare and combine with each other is the focus of the next experiment.

3.5 Operand Isolation

Operand isolation has not attracted considerable attention due to the incurred power, area and timing overhead. In respect to area, the isolation logic circuitry is significant, in contrast to clock gating, where area is actually benefited. As regards power, clock gating has a triple effect: it reduces power on the clock tree, at the register banks and in the functional units.

Though, only the latter part will be considered, as operand isolation only relates to power in functional units. Under this condition, operand isolation by clock gating comes for free and the benefits are unconditional, as no additional logic is inferred. On the contrary, the net power gain from operand isolation is the one after the power overhead of the isolation circuitry has been deducted. Finally, timing wise, the main reason why operand isolation has remained just an idea and has not made its way to commercial designs is its impact on timing, as the isolation logic is added on the critical paths of the design. In this experiment it is assumed that timing is not an issue, however as described before, effort will be put on keeping the imposed delay to minimum.

In the following, three designs are compared against the PLAIN base design: OP ISOL, REG EN OP ISOL,CLK GATED OP ISOL.

The “OP ISOL” architecture

In this implementation, the register banks are free running and all inputs to all multipliers are operand isolated. This means that enable latches separate the shared input buses from the multipliers, as shown in figure 3.5. The adder and the subtractor are left unguarded, because of their insignificant power dissipation. Isolation logic could be added at the output of the multipliers in order to isolate the chained functional units both from unnecessary activity (MSF, MPF) and glitches from the multipliers in case of MCX instructions. Because of the insignificant contribution of the adder/subtractor units, the insertion of additional isolation logic is not justified. Yet, some secondary savings (see section 2.3.1) are still achieved by isolating the fanin logic.

By applying so fine operand isolation, it is made sure that only the necessary multipliers are active. The conditions under which latches become transparent and the incurred delays are:

Isolate Condition depth

mult hh opcode(3)or opcode(2) 1 +tL

mult hl opcode(3) tL

mult lh opcode(3) tL

mult ll opcode(0) 1 +tL

In the table, tL stands for the latch delay. As it can be seen logic depth is limited to 1 gate delay plus tL. This can only happen if one-hot encoding for the opcode is used and

34 Chapter 3. Experiment 1: A Complex Arithmetic Unit

-mult_hh

+

mult_hl mult_lh

mult_ll

A_h B_h A_l B_l A_l B_h A_h B_l

MUX

&

&

&

'0' op_code

LATCH LATCH LATCH LATCH

Pipeline register

Figure 3.5: Latch-based operand isolation in the CAU design

resources are highly mutually exclusive. The area overhead of the isolation latches is 6.5%

over the plain design; to visualize the overhead, it is approximately as much as an extra set of pipeline input registers. This point is used in the next chapter, where the duplication of input registers is evaluated as an alternative isolation method.

The decomposition of power improvements in the CAU is presented in table 3.8.

Component Improvement(%)

REG BANK 21.52

CAU TOTAL 66.08

adder 1.77

subtractor 1.36

mult hh 15.31

mult hl 19.14

mult lh 18.89

mult ll 11.38

Isolation logic -2.07

Table 3.8: Relative power improvement in OP ISOL(%) over PLAIN

Operand isolation compared to clock gating yields an additional net improvement of 9% in the CAU. The isolation logic is responsible for 2% of total power consumption. It is obvious that the power overhead is insignificant even for the extreme case that all inputs are isolated.

There is also a 20% improvement in the register bank. This is due to the fact that after the insertion of latches, input registers see less capacitive load. Although it appears significant, one can not be sure before load capacitances have been back-annotated after physical layout.

The “CLK GATED OP ISOL” architecture

The previous experiment proved that operand isolation offers great power savings and that it outperforms clock gating. After all, clock gating is meant as a power reduction technique for sequential logic. The purpose of the next experiments is to investigate how operand isolation performs together with clock gating.

This design deploys both clock gated input registers and operand isolation latches. Power dissipation in the CAU remains intact, while a total improvement of 55.29% is noticed in the register banks. This gives an additional 7% over the CLK GATED design and is probably due to the reduced fanout load at the input registers.

3.5 Operand Isolation 35

The “CLK GATED OP ISOL OPT” architecture

The purpose of this experiment is to optimize the overlapping isolation effect of clock gating and isolation latches present in the previous design. It was shown that clock gating effectively isolates the mult hh and mult ll functional units. Thus the particular isolation latches can be removed. By doing so an additional 1% improvement in CAU is achieved due to the reduced isolation logic power overhead.

Overall this implementation yields the highest total power improvement of 69.7%.

The “DECOUPLED” architecture

Isolate Condition Depth

mult hh opcode(3) tL

mult hl opcode(3) tL

mult lh opcode(3) tL

mult ll opcode(3) tL

mult 16 sf H opcode(2) tL

mult 16 sf L opcode(1) tL

Table 3.9: Relative power improvement (%)

At the previous chapter it is mentioned that resource sharing can damage data correlations which directly translates to higher switching activity. Based on that, the next design focuses on evaluating the scenario where separate resources are allocated for every single instruction.

The idea is similar to the one illustrated in figure 3.2, but at a lower level. In that respect, two more 16x16 bit multipliers are instantiated, each with its own isolation logic, to perform the MPF andMSF instructions. The extra resources are grouped to form the Signed Fractional Arithmetic Unit (SFAU). The new isolation conditions are summed in table 3.9.

Component Improvement(%)

REG BANK 19.85

CAU TOTAL 79.70

adder 79.70

subtractor 80.09

mult hh 80.72

mult hl 80.57

mult lh 80.57

mult ll 80.67

Isolation logic -1.05

SFAU TOTAL -13.37

mult 16 sf H -4.48

mult 16 sf L -8.03

Isolation logic -0.83

Table 3.10: Relative power improvement in DECOUPLE(%)

It can be seen that the delay in the isolation logic is only that of the latch and hence minimum.

The power dissipation in every component is organized in table 3.10. The overall power distribution and absolute values are identical to those achieved with the OP ISOL architec-ture. The 13% power improvement in CAU is in whole offset by the power dissipated in the SFAU unit implementing the remaining instructions. In other words, there is not a single reason to justify the significant area overhead of two extra multipliers.

36 Chapter 3. Experiment 1: A Complex Arithmetic Unit

In conclusion, at the end of this experiment pure operand isolation yields an identical power saving of 66% in CAU compared to the decoupled design. Power dissipation in the REG BANK can be halved, if clock gating is applied. However, it may not always be possible to find clock gating conditions, as the input registers in an architecture like the one in figure 3.2, can only be inactive during NOP instructions; overall, they are not expected to exceed 13% of the total instruction count [24]. In this percentage load-save instructions are also accounted for. Although they do not utilize the functional units, fetched operands will have to travel though the execution stage and should not invoke new computations.