• Ingen resultater fundet

A Pseudo-Synchronous Implementation Flow for WCHB QDI Asynchronous Circuits

N/A
N/A
Info
Hent
Protected

Academic year: 2022

Del "A Pseudo-Synchronous Implementation Flow for WCHB QDI Asynchronous Circuits"

Copied!
22
0
0

Indlæser.... (se fuldtekst nu)

Hele teksten

(1)

A Pseudo-Synchronous Implementation Flow for WCHB QDI Asynchronous Circuits

Yvain Thonnart, Edith Beigné, Pascal Vivet

CEA-LETI, Minatec, Grenoble, France

Async’2012, May 8

th

2012

DTU, Copenhagen, Denmark

(2)

Asynchronous circuits

A handcrafted piece of art

Entangled uneven loops

Requires minute attention to detail

Very valuable for specific needs

But very expensive design time

 A powerful heavy machinery

Backed-up by big EDA companies

Obsessed about clocks

Scared of loops

with synchronous CAD tools?

 Pseudo-synchronous implementation

“Mass-produced”

Much cheaper design time

Can run fast, nevertheless!

Trick the

chain link model

(3)

Outline

 Asynchronous circuits with synchronous CAD tools ?

 Pseudo-synchronous models for C-elements

 Pseudo-synchronous circuit implementation

 Benchmarking against asynchronous implementation

 Real-world implementations

 Conclusion & perspectives

(4)

DIMS WHCB pipeline

combinational loops & optimization

 Performance is given by the loops cycle times

 Design optimization needs to constrain those loops

 Synchronous CAD tools can’t handle them

need to cut the loops in the timing graph & constrain loop segments

 Where to cut for a systematic approach

 in the WCHB C-elements: the ones gathering forward and backward data (they must be Resetted)

C

Logic Fwd

Logic Bwd Reset

C

C

Logic Fwd

Logic Bwd Reset

C

C

Logic Fwd

Logic Bwd Reset

C

(5)

Asynchronous Implementation: cost & flaws

 Resulting timing constraints:

 For each WCHB C-element in the cell library, disable timing arcs to cut the loops

set_disable_timing ‘C_element’ –from ‘in’ –to ‘out’

 For each path segment between two WCHB C-elements, specify a target maximum delay

set_max_delay –from ‘C/elt/inst1/out’ –to ‘C/elt/inst2/in’ 0.5ns

 Limitation: The WCHB C-elements themselves are not optimized

 Minimal or no drive adaptation of cells depending on cell load

 No consideration on signal slope on path end

 Cells can be moved back and forth during placement

Synchronous CAD tools do not manage asynchronous path ends correctly

 Use pseudo-synchronous models for WCHB C-elements

to cut timing loops without disabling timing arcs

to improve tool control over path ends

(6)

Pseudo-synchronous circuit timing paths

 Loops are cut naturally at pseudo-synchronous C-elts

 No need to disable a timing arc

 Creates 2 kinds of paths in WCHB pipeline:

forward paths

backward paths

 How to derive pseudo synchronous models ?

 How to constrain resulting paths ?

C Fwd

Logic

Bwd Logic Reset=clk

C Fwd

Logic

Bwd Logic Reset=clk

C Fwd

Logic

Bwd Logic Reset=clk

(7)

Asynchronous .lib characterization

 .lib files in Liberty format to model cell timing arcs

 As a function of input transition times and output capacitance

 4 values per arc : rise delay, fall delay, rise transition, fall transition

Reset

A

B

Z

when B=1 and Reset inactive A

Z

rise_delay

rise_tran

rise_delay(AZ):

30ps 120ps 200ps 80ps 160ps 250ps 130ps 210ps 300ps Z output capacitance 10fF 40fF 100fF

A input transition 10ps

80ps 200ps

rise_tran(AZ):

12ps 80ps 320ps 20ps 85ps 320ps 28ps 90ps 320ps Z output capacitance 10fF 40fF 100fF

A input transition 10ps

80ps 200ps

(8)

Pseudo-synchronous .lib derivation

Clk (was Reset)

A

B

Z

 C-element is modeled like a synchronous flip-flop

 Reset pin is used as a dummy clock input

 New arc uses first row of AZ arc, old arcs are turned to setup checks

A

Z

rise_delay

rise_tran

rise_delay(ClkZ):

30ps 120ps 200ps 80ps 160ps 250ps 130ps 210ps 300ps Z output capacitance 10fF 40fF 100fF

rise_tran(ClkZ):

12ps 80ps 320ps 20ps 85ps 320ps 28ps 90ps 320ps Z output capacitance 10fF 40fF 100fF Clk

setup rise constraint

setup_rise(AClk)

computed as diff.

between 1st column of previous rise_delay(AZ) and new rise_delay(ClkZ)

setup_rise(BClk)

Idem with previous BZ

A input transition 10ps

80ps 200ps

0ps 50ps 100ps

(9)

Simple pseudo-synchronous constraint

 Declaring a clock on the reset signal constrains all paths to a given “dummy” period

Actual asynchronous cycle time given by biggest sum of 2 fwd + 2bwd delays on the loops (for token+bubble)

as bad as 4x dummy target period

often less (2x-3x) as no hold fixing is done

 Dummy clock period limitation:

 Logic depth can be different on each path

 Relaxes all paths to worst path length

Actual throughput not optimal when forward and backward logic are not balanced (on most critical local loop)

Actual forward latency can be really sub-optimal (given by sum of fwd delays)

 What about over-constraining the design ?

 Negative slack is not a big deal for implementation, circuit is QDI after all !

 But over-constrained paths will distract the optimization kernels…

(10)

Refined pseudo-synchronous timing constraints

 Use dummy clock declaration to identify paths, not to constrain design with a given period

 Declare clock to break loops, with any period (e.g. 0ns)

 Override delays on all paths with reg2reg set_max_delay constraints set_max_delay 0.23ns –from C/elt/inst1 –to C/elt/inst2

(no pins given  preserve all arcs inferred by clock declaration)

 Resulting constraints very similar to asynchronous ones, but with no timing arc disabled

Better control on timing paths for optimization tools

Leverage on all existing asynchronous STA methods to predict performance

(11)

WHCB isochronic forks handling

 Green fork needs no isochronic assumption

Both branches are acknowledged by protocol (C-element on point of reconvergence)

 Red forks should be isochronic (or relaxed)

Only one of the branches is acknowledged (reconvergence on a combinational gate) BUT

they always occur at path ends (previous logic is shared)

Shortest adversary path goes through 2 C-elements and at least 1 inverting bwd logic

Constraining paths through the fork for shortest possible delays (with refined ‘set_max_delay’

constraints) also balances any buffer tree needed at the fork

Adversary path isochronic hypothesis is easily met C

Logic Fwd

Logic Bwd Reset

C

C

Logic Fwd

Logic Bwd Reset

C

C

Logic Fwd

Bwd Logic Reset

C slow branch

fast branch

+ Adversary path 2nd segment Adversary

path 1st segment

(12)

Pseudo-synchronous implementation flow

Source

Netlist.ref.v

Netlist.final.v Map & Opt

preCTS.sdc

Place & IPO CTS Route & IPO

postCTS.sdc

Delay Calc

GDS SPEF

SDF

Final sim.

DRC, LVS…

Async.lib PSync.lib

dummy.ctsspec

Reset C

Reset=clk C

Tape-out

PSyncIP.lib

< Your preferred asynchronous sythesis method here >

(13)

Linear pipeline case study

 Implemented down to layout with Cadence SoC Encounter

 STMicro 65nm LP technology

 Very narrow floorplan 20µm*600µm to model a long NoC link

C

Reset

C C

C C

Reset

C C

C

C

Reset

C C

C

C

Reset

C C

C C

Reset

C C

C

C

Reset

C C

C

x17 MR4

Physically implemented & optimized with different strategies

Instantiated 4x to inject the 4 different input values on each MR4

(14)

Timing constraints strategies

Asynchronous modeling

combinational loops broken at C-elements inputs.

zero-delay target:

‘set_max_delay 0’ on all paths

zero slack:

iterations on place-and-route flow adjusting per path

‘set_max_delay’ values

until implementation reports final slack of 0ps.

-40ps slack:

same as above, but stop iterating as soon as final negative slack is lesser than 40ps.

Pseudo-synchronous modeling

zero-delay target:

‘create_clock Reset -period 0’

simple:

‘create_clock Reset -period N’

with iterations until N cannot be reduced with a final slack of 0ps.

zero slack:

‘create_clock Reset -period 0’, plus iterations on per path

‘set_max_delay’ values until implementation reports a final slack of 0ps.

-20ps slack:

same as above, with a 20ps target.

(15)

Benchmarking results @tt65_1.2V_25C

 With asynchronous modeling, disabling timing arcs to break loops at C-elements degrades performance

 Simple and 0 target synchronous are comparable in performance

Less iterations for 0 target, but slightly bigger area

 Ad-hoc synchronous constraints give best results

0 5 10 15 20 25 30 35

300 325 350 375 400 425 450 475 500 525 550 575 600 625 650 cycle time (ps)

number of occurences

Async 0 target Async 0ps slack Async -40ps slack Sync simple Sync 0 target Sync 0ps slack Sync -20ps slack

0 5 10 15 20 25 30 35 40

175 225 275 325 375 425 475 525 575 625 675 725 775 825 875 925 975 1025 latency (ps)

number of occurences

Async 0 target Async 0ps slack Async -40ps slack Sync simple Sync 0 target Sync 0ps slack Sync -20ps slack

(16)

ANoC implementations

 ANoC router made of 6 kinds of WCHB processes

3 per input stage, 3 per output stage

Generic data path size

Any possible combination of input stages and output stages

 60 “generic” ‘set_max_delay’

constraints cover all possible arrangements of processes in NoC topology

60 values to refine for zero-slack strategies

 Recent implementation in 3 chips with industrial partnership in

2011/2012

2D-mesh based, in STMicro 65nm LP

Req-Resp Master-Slave based in STMicro 32nm and 28nm LP

30 20 10 00

3D (ftol) TX_BIT tx_bit00n TX_BIT tx_bit00n

31 21 11 01

32 22 12 02

33 23 13 03

2 4 2

6 1 6

6 1 6

2 4 2

2 6

2

5 4 4 4

6 5 5 6

6 3 4 4

1 MEPHISTO

mep_01n MEPHISTO

mep_01n MEPHISTO

mep_02n MEPHISTO

mep_02n TRX_OFDM

trx_ofdm_03n TRX_OFDM trx_ofdm_03n

ARM11 arm11_00w

ARM11

arm11_00w TRX_OFDM

trx_ofdm_03e TRX_OFDM trx_ofdm_03e SME

sme_01 SME

sme_01 SME

sme_03 SME sme_03

SME_EXT sme_10 SME_EXT

sme_10 SME_WIDEIO

sme_11 SME_WIDEIO

sme_11 SME_WIDEIO

sme_12 SME_WIDEIO

sme_12 UDECASIP

asip_13 UDECASIP

asip_13

UDECASIP asip_13 UDECASIP

asip_13 SME_WIDEIO

sme_21 SME_WIDEIO

sme_21 SME_WIDEIO

sme_22 SME_WIDEIO

sme_22 RX_BIT

rx_bit23 RX_BIT rx_bit23

SME sme_31

SME

sme_31 SME

sme_33 SME sme_33

MEPHISTO_

HEATER mep_30w MEPHISTO_

HEATER mep_30w

MEPHISTO mep_33e MEPHISTO mep_33e

MEPHISTO mep_33s MEPHISTO mep_33s MEPHISTO

mep_32s MEPHISTO mep_32s TRX_OFDM

trx_ofdm_30s TRX_OFDM

trx_ofdm_30s TRX_OFDM trx_ofdm_31s TRX_OFDM trx_ofdm_31s nocif2

nocif1 3D

(serial2)

3D (serial2r)

3D (normal) TEST

3DNoC TEST 3DNoC

TEST Wide IO TEST Wide IO

C1_m1 C2_m1

C3_m1 C4_m1

L2_s1 L2_s2

L2_s3 L2_s4 C1_m2

C2_m2

C3_s C4_s L3_s1

C3_m2 C4_m2

L3_s2 C1_s C2_s

FC_m1 L3_m

FC_m2

FC_s

GANoC-L2 GANoC-L3

MAG3D

P2012_CO

ST 65nm LP

16 routers

2 channels

34b datapath

1MGate ANoC

ST 28nm LP

10 routers

76b requests

68b responses

400kGate ANoC

(17)

28nm P2012_CO ANoC synthesis results

 According to dummy period:

Area increase up to +30%

cycle time & latency reduction up to -30%

 Ad-hoc pseudo-sync. constraints allow for:

reproducible best performance @ 1280Mflit/s

with reasonable area increase by ~20% compared to under-constrained design

Quality of Results

300 320 340 360 380 400 420 440

0.0 0.2 0.4 0.6 0.8 1.0

Pseudo-Sync Dummy Period (ns)

Area (KGate)

0.30 0.40 0.50 0.60 0.70 0.80 0.90

Critical ¨Path Length (ns)

Area (Kgate) Critical Path (ns)

Negative Slack

Positive Slack

Performance tt28_1.00V_25C

4.00 4.50 5.00 5.50 6.00 6.50 7.00 7.50 8.00 8.50

0.0 0.2 0.4 0.6 0.8 1.0

Pseudo-Sync Dummy Period (ns)

Peak Bandwidth (GB/s)

2.00 2.20 2.40 2.60 2.80 3.00 3.20 3.40

Async. end to end Latency (ns)

Peak Bandw idth (GB/s) Async. end2end Latency (ns) Ad-hoc max delay

pseudo-sync constraints

(18)

MAG3D implementation results

 Technology

STMicroelectronics

cmos 65nm low-power process

 Implementation strategy

Pseudo-synchronous hard-macro for routers

Mixed integration on top

Synchronous DfT

Pseudo-synchronous ANoC links

P&R Runtime ~ 17h

 ANoC Area

1M Gate

 Performance

@tt65_1.2V_25C

7 routers path

~10 mm links

Average throughput:

850 Mflit/s

Average latency:

9.81ns

~8.5mm

Measured NoC path

(19)

Conclusion

Asynchronous circuits turned synchronous (not really…)

 For the designs  a bit more performance

 DIMS WCHB circuits are not as bad as you would think, aren’t they ?

 For the designers  a systematic approach for loop breaking and design constraints

 Large asynchronous designs within easy reach

 For the community  a “benevolent” betrayal

 Don’t banish me, please…

 For the industry  a comfortable well-known CAD environment

 Energy-efficient off-the-shelf soft IPs

OK, they are actually asynchronous, but only if they ask…

 But will it work for more than ANoC or DIMS WCHB ?

(20)

Pseudo-synchronous timing paths in QDI (PCHB/PCFB/RSPCHB…) pipelines

 Up to 5 types of pseudo-synchronous paths instead of 2

(+ WCHB like paths for state variable in PCFB)

Not necessarily balanced in delays  ad-hoc constraints to be considered, dummy period could be insufficient

 When no Reset input is present on the cells, create and rely on an “internal pin” for dummy clock

pin(dummy) {direction : “internal”; […]} in .lib file

create_clock –name ‘dummy_clk’ [$all_dummy_pins_in_design] in .sdc file

 Blue paths form an isochronic fork for “bubbles”

Need special handling to guarantee data deactivation before EN re-activation

Ra

EN EN Ra

(21)

timing arcs diversion and timing margin

 Alternatives for relative delay constraint on isochronic fork

 specify ‘set_data_check’

 reduce max delay constraints separately on both paths

to guarantee there is no positive slack

 Add security margin to data arcs

Compatible with simple

dummy clk period constraint

Specify margin thanks to dummy clk transition time

EN Ra

A

B

Dummy (or Reset) Z

modified comb arc computed from EN setup

setup

setup

(with security margin / EN)

setup

(with security margin / EN)

A

Z

rise_delay

rise_tran Clk

setup rise constraint

margin spec margin

(22)

Many thanks to

 My co-authors for their 9-year contribution &

support

 The reviewers for their inspiring feedback

 The audience for your questions ?

Referencer

RELATEREDE DOKUMENTER

 Gate Level Self Synchronous circuits can provide reliable operation within PVT (Process, Voltage, Temperature) variations compared to Synchronous circuits..  No need for

In Uppaal , we use finite–state automata extended with clock and data variables to describe processes and networks of such automata to describe real–time systems..

0735-1933 International Communications in Heat and Mass Transfer 0958-6946 International Dairy Journal.. 1755-599X International Emergency Nursing 1567-5769

In contrast with a synchronous processor which is generally centrally controlled, this asynchronous processor has a fully distributed control system:. • Control is

When I asked The Head during our first meeting to explain what means and tools he used in his daily practice, he answered by explaining the reasoning behind his choice of using

The objective of technical regulation TR 3.2.3 is to specify the minimum tech- nical and functional requirements that a thermal plant with a synchronous or asynchronous generator

To reduce selection bias, we compared the non-screening period with the screening period, and found that the introduction of general screening had a positive impact on

Clock gating was originally conceived as a system level power optimization technique aiming to reduce the power dissipated on the clock network (which accounts up to 40% of the