Performance Analysis of GALS Datalink Based on Pausible Clocking

(1)

IHP

Im Technologiepark 25 15236 Frankfurt (Oder)

Germany

Performance Analysis of GALS Datalink Based on Pausible Clocking

Xin Fan, Milos Krstic, Eckhard Grass fan@ihp-microelectronics.com

(2)

Outline

• Pausible clocking based GALS design – an overview

• Performance modeling of typical GALS datalinks

• Upper bound of throughput-tolerant clock/interconnect delays

• Moonrake: a SYNC/GALS OFDM BB TX chip in the 40-nm process

• Conclusions

(3)

Pausible clocking based GALS design – an overview

• What’s pausible clocking?

Temporarily stop the clock if its rising edge is too close to the input data, i.e., synchronize the clock activity according to the arrival time of data.

It was once expected to be a high reliability, high performance while low overhead solution for GALS system design.

F F

CLK

D Q

Data _RX

Req

rClk

Reg Data

D

Ack

Programmable Delay Line

nRst MUTEX

Gnt

Q

lClk -

C ELE

TX

(4)

Pausible clocking based GALS design – an overview

• Fundamental limitations of applying pausible clocking:

Taking into consideration the clock tree distribution delay nullifies all the assumptions for source clock scheduling to avoid metastability.

Multi-cycle vs. sub-cycle clock tree delay.

s_sub=T-w_MUTEX-t_MUTEX where

w: the MUTEX acknowledge window (off-phase duration of rClk),

t_MUTEX: the resolution time of MUTEX under a certain MTBF condition.

R. Dobkin et al., “Data synchronization issues on GALS SoCs,” in ASYNC 2004.

FF

CLK

D Q

Data_RX

Req

rClk

Reg

Data

D

Ack

Programmable

Delay Line

nRst MUTEX

Gnt

Q

C-ELE sClk

TX

lClk

sClk rClk

w T

MUTEX ri

rClk

ai gi

t ssub

(5)

Pausible clocking based GALS design – an overview

• Optimization for safe synchronization

For sub-cycle clock tree delay: maximize the safe timing region.

Double stages of register timed by the output of MUTEX, s_sub=w.

Two cascaded delay lines with a feedback loop to maximize w.

X. Fan et al., “Analysis and optimization of pausible clocking based GALS design,” in ICCD 2009.

FF

CLK

D Q

Data_RX

Req

rClk

Reg

Data

D

Ack

Delay A

nRst MUTEX

Gnt

Q

C-ELE sClk

TX

lClk

Delay B Reg

D Q

(6)

Pausible clocking based GALS design – an overview

• Optimization of pausible clocking scheme

For multi-cycle clock tree delay: locally data latching (LDL) scheme from Technion, arbitrating on the leaf registers instead of on the source clock.

Integrate double-register scheme with LDL to maximize w.

R. Dobkin et al., “High rate data synchronization in GALS SoCs,” in TVLSI 2006

F F

CLK

D Q

Data _RX

Req

rClk

Reg Data

D

Ack

Delay Line MUTEX

Gnt

Q

lClk

TX

Reg

D Q

sClk

(7)

Pausible clocking based GALS design – an overview

• Architecture of pausible clocking based GALS design

Communication across clock domains is initiated by the output/input flow control logic of TX/RX synchronous functional blocks.

Data transfer between TX and RX is scheduled by the asynchronous handshake signals.

• Previous studies show that the pausible clocking based GALS datalink could reach a throughput of one data item every second clock cycle!?

RX CORE TX CORE

OPC

op _ te op _ ta

op _ r p

op _ a p op _ ri

op _ ai tx _ clk

OUT _ REG

IPC

i p _ ri i p _ ai

rx _ clk IN _ FLOW _ CNTR

IN _ REG

i p _ t a i p _ t e

i p _ r p i p _ a p handshake signals OUT _ REG

OUT _ FLOW _ CNTR

IN _ REG

IN _ FLOW _ CNTR

MUTEX

TX ( GATED ) RING OSC MUTEX RX ( GATED ) RING OSC

DL _ REG

(8)

sClk rClk

lClk rp/ap

te ri

Ds

n

T  CT t 2T  _CT t_n_₁

Ds(n) Ds(n+1)

Performance modeling of typical GALS datalinks

• Data synchronization latency

sClk rClk

w t

T

MUTEX ri

rClk

ai gi

The timing interval from receiving an input data, indicated by the input request, to sampling the data by the local clock.

(9)

Performance modeling of typical GALS datalinks

• Synchronization latency function L (t, w, )

For the uniform distribution of t within [0, T), we have:

l is determined by the effective acknowledge window of MUTEX . If , a sub-cycle latency can be achieved on average.

Given w, the minimum latency is obtained when .

, [0, );

( , , ) 3 / 2 , [ , ];

2 , ( , ).

CT

CT CT

CT

T t t w

L t w T w t w w

T t t w T

    

        

    







0

1 3

2

( ) .

( , , _CT ) ^CT

T w

dt T

T T

L t w

l   ^ 







(w _CT )  T / 2

CT 0

 

CT

(w  _CT )

(10)

Performance modeling of typical GALS datalinks

• Data throughput in burst-mode communication

The handshake loop delay of asynchronous datalink consists of 3 parts:

• The optimization target:

RX CORE TX CORE

OPC

op_te op_ta

op_rp

op_ap op_ri

op_ai tx_clk

OUT_REG

IPC

ip_ri ip_ai

rx_clk IN_FLOW_CNTR

IN_REG

ip_ta ip_te

ip_rp ip_ap handshake signals OUT_REG

OUT_FLOW_CNTR

IN_REG

IN_FLOW_CNTR

MUTEX

TX (GATED) RING OSC MUTEX RX (GATED) RING OSC

DL_REG

INT.

IPC OPC

Loop

avg avg avg

T  d  d  d

max( , ).

Loop TX RX

Tavg  T T

(11)

Performance modeling of typical GALS datalinks

• An example: tightly coupled GALS datalink

Q

SETQ

CLR

D

Q Q^SET

CLR

D

A0 A1 Z

S

Q

SETQ

CLR

D

Q

SETQ

CLR

tx_data_comb D

ip_gi ^Q

SETQ

CLR

ip_ta D

ip_te

rx_ta_comb ip_ta_l

rx_clk rx_data

tx_clk

ip_te_comb

A0 A1

Z S

Q

SETQ

CLR

op_te_comb D op_te

tx_ta_comb

tx_data

op_ta

rx_te

Q Q^SET

CLR

D

op_gi

Q Q^SET

CLR

D

G

tx_te tx_ta

rx_ta op_ta_l

OPC IPC

op_rp ip_rp

op_ap ip_ap

tx_clk rx_clk

op_ri op_ai

ip_ri ip_ai

EN EN

Q

SETQ

CLR

D

EN

G G

EN

Q Q^SET

CLR

tx_te_pending D

Q

SETQ

CLR

D rx_te_pending

tx_clk

TX GATED RING OSC. MUTEX MUTEX RX GATED RING OSC.

op_te+

op_rp+

op_ap+

op_ri+

op_ai+

op_ta+

op_ta- op_ai+

op_ri+

op_ap- op_rp-

op_te-

op_ri- op_ai-

op_ai- op_ri-

ip_rp+ ip_ri+ ip_ai+ ip_ta+ ip_ri- ip_ai-

ip_ai- ip_ri- ip_ta- ip_ai+ ip_ri+ ip_rp- ip_te+ ip_ap+

ip_ap- ip_te-

(12)

Performance modeling of typical GALS datalinks

• Average loop period of the tightly coupled datalink

It is the sum of the average synchronization latencies on both TX and RX sides and the interconnect delays between OPC and IPC:

Given T_TX/T_RX close to 1, we can derive the throughput-tolerant condition on clock-tree and interconnect delays:

Note that, due to w<T, above requirement could never be satisfied, even with zero clock-tree/interconnect delay.

It means throughput drop is unavoidable in the tightly coupled datalink, even if TX and RX have only a tiny mismatch on operating frequency!

3 3

( ) ( )

2 2 .

RX TX

RX CT TX CT

RX TX INT

RX TX

Loop

avg w w

T T T d

T T

 

    

 

(w_TX  ^TX_CT )(w_RX _CT^RX )  2T  d_INT.

(13)

Upper bound of throughput-tolerant clock-tree/inter- connect delay

• Loosely coupled GALS datalinks

(a) Reduce d_IPC by introducing concurrency in IPC:

(b) Reduce d_OPC by stopping TX local clock if possible:

op_te+

op_rp+

op_ap+

op_ri+

op_ai+

op_ta+

op_ta- op_ai+

op_ri+

op_ap- op_rp-

op_te-

op_ri- op_ai-

op_ai- op_ri-

ip_ri+ ip_ai+ ip_ta+

ip_ta- ip_ai+ ip_ri+

ip_rp+

ip_te+

ip_ap- ip_rp-

ip_te- ip_ap+

ip_ri- ip_ai-

ip_ai- ip_ri-

(w_TX  ^TX_CT)  (w_RX  _CT^RX )  T  d_INT.

( )

(^TX_CT ) w_RX  _CT^RX  d_INT.

op_te+

op_ri+

op_ai+

op_rp+

op_ap+

op_ap- op_rp-

op_ai+

op_ri+

op_te-

op_ri- op_ai-

op_ai- op_ri-

ip_ri+ ip_ai+ ip_ta+

ip_ta- ip_ai+ ip_ri+

ip_rp+

ip_te+

ip_ap- ip_rp-

ip_te- ip_ap+

ip_ri- ip_ai-

ip_ai- ip_ri-

(14)

Upper bound of throughput-tolerant clock-tree/inter- connect delay

• Comparison in data throughput of tightly coupled GALS datalink

0.60

0.55

0.50

0.45

0.40

0.35

0.30

1.28 1.24 1.20 1.17 1.13 1.10 1.06 1.02 0.99 0.95 0.92 0.88 0.84 0.81 0.77 0.73

Clock ratio (TTX/TRX)

Data per cycle

Simulation @ d_I_NT0,_CT0 Equation (6) @ d_I_NT0,_CT0

Simulation @ _NT / 4, _CT 0

I RX

d T  

Equation (6) @

Simulation @ _NT 0, _CT / 4

I RX

d   T Equation (6) @

/ 4, _CT 0

NT RX

dI T   _NT 0, _CT / 4

I RX

d   T

Simulation @ _NT /4, _CT /4

RX RX

dI T  T Equation (6) @ _NT /4, _CT /4

RX RX

dI T  T

(15)

Upper bound of throughput-tolerant clock-tree/inter- connect delay

• Comparison in throughput-tolerant delay in loosely coupled datalinks

()maxTX CTns

1.28 1.24 1.20 1.17 1.13 1.10 1.06 1.02

3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0

()maxRX CTns

6.0 5.0 4.0 3.0 2.0 1.0 0.0

0.99 0.95 0.92 0.88 0.84 0.81 0.77 0.73

Simulation @ d_I_NT 0,^RX_CT 0

Equation (8) @ d_I_NT 0,^RX_CT 0 Simulation @ d_I_NT 0,_CT^RX 2 Equation (8) @ d_I_NT 0,^RX_CT 2 Simulation @ d_I_NT 2,_CT^RX 0 Equation (8) @ d_I_NT 2,_CT^RX 0

Simulation @ d_I_NT 0,^TX_CT 0 Equation (9) @ d_I_NT 0,^RX_CT 0 Simulation @ d_I_NT 0,^TX_CT 1 Equation (9) @ d_I_NT 0,^RX_CT 2 Simulation @ d_I_NT 2,^TX_CT 0 Equation (9) @ d_I_NT 2,^RX_CT 0

(a) (b)

(16)

Moonrake: a 40-nm SYNC/GALS OFDM BB TX chip

• Architecture of GALS design

System partition accounting both function and physical criteria;

6 GALS blocks balanced in area and power dissipation;

16 asynchronous datalinks (32 IPC/OPC).

Middle control

Input control IPC OPC

Mapper [4:1]

Pilot inserter

OPC Interleaver interface

Interleaver [2:1] Interleaver [4:3] Interleaver [6:5] IFFT

64p [4:1]

IFFT

4p Output Stage

IPC IPC OPC

OPC IPC

Input data FIFO

Symbol mapping

Universal

scrambler Universal FEC encoder [12:1]

Pausible Clock GEN 1

GALS BLOCK 1

Pausible Clock GEN 2 GALS BLOCK 2

Pausible Clock

GEN 5 Pausible Clock GEN 6

GALS BLOCK 5 GALS BLOCK 6

IPC

OPC IPC OPC IPC OPC

OPC IPC OPC IPC OPC IPC

IPC OPC IPC OPC IPC OPC IPC OPC IPC OPC IPC OPC

(17)

Moonrake: a 40-nm SYNC/GALS OFDM BB TX chip

• Comparison in GALS TX throughput fabricated by different datalinks

A data frame with QPSK modulation was processed by the GALS TX.

RTL of TX functional blocks, netlist of IPC/OPC and clock generators, with back-annotated clock-tree/interconnect delays after layout.

OPC-I to IPC-I w/o buffer

OPC-I to IPC-II w/o buffer

OPC-II to IPC-II w/o buffer

OPC-I to IPC-I w buffer

OPC-I to IPC-II w buffer Simulations with worst-case delays Simulations with best-case delays 1.0

0.9 0.8 0.7 0.6 0.5

OPC-I to IPC-I w/o buffer

OPC-I to IPC-II w/o buffer

OPC-II to IPC-II w/o buffer

OPC-I to IPC-I w buffer

OPC-I to IPC-II w buffer

Simulation @ dINT = 3T/4 Simulation @ dINT = 0

Simulation @ dINT = T/4

Simulation @ dINT = T/2 1.0

0.9 0.8 0.7 0.6 0.5

(18)

Moonrake: a 40-nm SYNC/GALS OFDM BB TX chip

• Power and area comparison of SYNC/GALS TXs

0 5 10 15 20 25 30

Cell area (µm²) Power dissipation (mW) Post-synthesis Post-layout Post-layout Measurement

SYNC w/o PLL 2206895 2234712 234 252

GALS TX 2225823 2220080 225 237

GALS_BLK 1

GALS_BLK 2

GALS_BLK 3

GALS_BLK 4

GALS_BLK 5

GALS_BLK

6 TOTAL

AREA 19% 18% 18% 18% 10% 17% 100%

POWER 12% 17% 17% 17% 18% 19% 100%

(19)

Conclusions

• The performance (synchronization latency and data throughput) of pausible clocking based GALS datalinks is dominated by w, which is in turn determined by the target resolution time of MUTEX:

MTBF  t_MUTEX w=T- t_MUTEX.

• It can cause large performance penalty in speed aggressive design, such as high-speed micro-processors. On the other hand, it is very suitable for complicated system integration with moderate speed.

• By the optimization of IPC/OPC, it is possible to tolerate clock and interconnect delays, to some extent, with little performance drop.

• The marginal hardware overhead caused by the GALS infrastructure based on pausible clocking can be compensated at the system level.

(20)

Thank you!

Question?