Design Specification - Power Efficient Arithmetic Circuits for Application Specific Processors

3.2 Design Specification

The design example used has intentionally been kept simple to illustrate the sources of power consumption, their cause and allow for simple to interpret and correlate power measurements before and after the application of power management techniques. This section deals with the functional specification of the design.

Instruction Set

The CAU implements the following instructions:

• Complex multiplication

• Multiplication on two pairs of 16-bit signed fractional numbers

• Multiplication of a single pair of 16-bit signed fractional numbers

• No operation

Additional instructions would be single/parallel addition/subtraction and multiply accumu-late instructions on a sequence of complex numbers.

The block diagram of CAU is shown in figure 3.3. Similarly to the overall architecture of the execution stage (figure 3.1), the individual results are merged through a 3-to-1 multiplexer controlled by the operation code control field. The concatenation modules, annotated by

“&”, form the results of the individual instructions. They do not incur any more logic than manipulation of wires to account for truncation of the least significant bits to match the available precision and remove redundant wires that are used to evaluate overflow conditions.

Although the overflow detection logic does not appear in the figure, the functionality is explained in a subsequent section.

-mult_hh

+

mult_hl mult_lh

mult_ll

A_h B_h A_l B_l A_l B_h A_h B_l

MUX

'0' op_code

Figure 3.3: The complex arithmetic unit with enriched instruction set

The mnemonics assigned to the individual instructions in the order presented above are shown in table 3.1.

Instruction Mux input op code

MCX 2 “1000”

MPF 0 “0100”

MSF 1 “0010”

NOP - “0001”

Table 3.1: CAU instruction set

28 Chapter 3. Experiment 1: A Complex Arithmetic Unit

The operation codes are one-hot encoded. Although this adds to the register count for the pipeline registers, it simplifies control logic and improves timing, as it will be explained later.

A “no operation” is present in all instruction sets and if not designed carefully it can also result in unnecessary power consumption, for example if the output is reset to logical zero.

To avoid this situation it is sufficient to disable the control information carrying signals that can alter the state of the machine. Operands can then maintain their previous value, since they are not going to be written. This saves power both in the register banks and in the functional units, which would otherwise change state and hence dissipate power. The main idea that will be quoted in numerous places throughout this report isto prevent any useless switching activity.

Implementation

The top level design is split into two high level entities, a sequential (REG BANK) and a combinatorial (CAU). The former implements the I/O registers, which can be perceived as the pipeline registers of the execution stage of a processor. The latter is the complex arithmetic unit and it includes the arithmetic and control units. The power management circuitry will also be considered as a part of the CAU to ease evaluation of power figures.

The size of the design does not justify the architectural partition in a sequential and a com-binational block and the incurred code overhead. However, it serves the purpose of allowing the generation of hierarchical analysis reports (area and power), which allows unambiguous evaluation of the performance of the power management techniques applied.

Registers

The CAU is a combinatorial circuit and its inputs and outputs are registered as shown in table 3.2.

Width Direction Comment

in A 32 I Operand 1

in B 32 I Operand 2

opcode 4 I Operation code

Z 32 O Result

ovf 1 O Overflow flag

Table 3.2: The input output registers

An optional property available is clock gating as described in section 2.2. It is available for all registers but the “opcode”, which carries sensitive control information and should therefore always be enabled.

Arithmetic units

As illustrated in figure 3.3, the CAU comprises four multipliers, an adder and a subtractor.

The implementation of those units are analyzed in table 3.3.

Furthermore, each multiplier consists of a wallace tree structure followed by a 25x25bits adder with Brent-Kung architecture (see section 5.1).

At this point, the implementation selection for the arithmetic units was left to be taken care of by the synthesis tool, based only on timing and area constraints. The idea was first to investigate the merits and overhead of the power management techniques alone and then to

3.2 Design Specification 29

Unit name Implementation Width

adder Ripple carry adder 32x32

subtractor Ripple carry subtractor 32x32

mult hh Non-booth encoded wallace tree multiplier 16x16 mult hl Non-booth encoded wallace tree multiplier 16x16 mult lh Non-booth encoded wallace tree multiplier 16x16 mult ll Non-booth encoded wallace tree multiplier 16x16

Table 3.3: Implementation of arithmetic units after synthesis

compare them against the results gained by the use of low power implementations of the same modules. Power management circuitry entails a certain power overhead that offsets power savings, thus there is a turn-over point; and it is one of the objectives of the thesis to discover this point.

Another possible degree of freedom is the accuracy in the calculations. At this point, no precision is sacrificed before the final, full-precision result is truncated. It could be worth investigating, whether precision in the computation could be traded-off for reduced hardware and hence lower power consumption [32]. This would however require design of customized arithmetic components.

Finally, timing and pipelining could be taken into account. Timing closure is not considered a problem, however if units were to be pipelined, the insertion point of pipe stages should be power sensitive, aiming at reduced switching activity as suggested in section 2.5.2 and applied in the design described in section 6.3.

Control logic

The control unit is responsible for the tasks of selecting the correct result, activating power management circuitry and detecting overflow.

For some applications (real-time signal processing, multimedia), performance is not to be compromised for power, so any power saving technique that degrades performance is bound not to find wide acceptance and applicability. However, the requirements may be relaxed, if there is some available slack. Both the clock gating and the operand isolation enable conditions are derived from the operation code. To minimize the delay, the depth of the logic inferred should be held as low as possible. In that respect, the absolute limit which can occasionally be achieved is zero. This only occurs when one-hot encoding, instead of binary, is used for the operation code, as illustrated in table 3.4.

1-hot Binary

Instruction code cond. depth code cond. depth

MCX “1000” op(3) 0 “11” op(1)·op(0) 1

MPF “0100” op(2) 0 “10” op(1)·op(0) 2

MSF “0010” op(1) 0 “01” op(1)·op(0) 2

NOP “0001” op(0) 0 “00” op(1)·op(0) 2

Table 3.4: Power management delay overhead VS encoding style

For instance, for one-hot encoded operation code and isolation logic that is transparent when the control input is high, no other timing than the propagation delay through the isolation latch is added on the timing path. This point will be further clarified when the architecture based on operand isolation is introduced later on.

30 Chapter 3. Experiment 1: A Complex Arithmetic Unit

S . S.

input

product partial product

S .

0 32

0 31

0 14

30 29

31 30 29

Figure 3.4: Representation of signed fractional intermediate results

The second role of the control logic is to set the overflow flag. In the design the length of the intermediate results is chosen so that no overflow can occur. However, since arithmetic is performed on signed fixed point fractional numbers, the available range is limited from -1 inclusive to +1 exclusive. Overflow can occur either at the final real and imaginary part of the complex multiplication which range from -2 to +2, both inclusive, or at the intermediate products which range from -1 to +1, both inclusive. Overflow on the partial products in the case of complex multiplication is not accounted for as the subsequent addition/subtraction may restore the result within the legal range. For theMCX instruction, overflow is identified when bits 32 down to 30 at the product holding register are not identical (see figure 3.4).

For theMPF andMSF instructions overflow has occurred if the bits in positions 31 and 30 of the partial product are different.

In document Power Efficient Arithmetic Circuits for Application Specific Processors (Sider 42-45)