IHP
Im Technologiepark 25 15236 Frankfurt (Oder)
Germany
Performance Analysis of GALS Datalink Based on Pausible Clocking
Xin Fan, Milos Krstic, Eckhard Grass fan@ihp-microelectronics.com
Outline
• Pausible clocking based GALS design – an overview
• Performance modeling of typical GALS datalinks
• Upper bound of throughput-tolerant clock/interconnect delays
• Moonrake: a SYNC/GALS OFDM BB TX chip in the 40-nm process
• Conclusions
Pausible clocking based GALS design – an overview
• What’s pausible clocking?
Temporarily stop the clock if its rising edge is too close to the input data, i.e., synchronize the clock activity according to the arrival time of data.
It was once expected to be a high reliability, high performance while low overhead solution for GALS system design.
F F
CLK
D Q
Data RX
Req
rClk
Reg Data
D
Ack
Programmable Delay Line
nRst MUTEX
Gnt
Q
lClk -
C ELE
TX
Pausible clocking based GALS design – an overview
• Fundamental limitations of applying pausible clocking:
Taking into consideration the clock tree distribution delay nullifies all the assumptions for source clock scheduling to avoid metastability.
Multi-cycle vs. sub-cycle clock tree delay.
ssub=T-wMUTEX-tMUTEX where
w: the MUTEX acknowledge window (off-phase duration of rClk),
tMUTEX: the resolution time of MUTEX under a certain MTBF condition.
R. Dobkin et al., “Data synchronization issues on GALS SoCs,” in ASYNC 2004.
FF
CLK
D Q
DataRX
Req
rClk
Reg
Data
D
Ack
Programmable
Delay Line
nRst MUTEX
Gnt
Q
C-ELE sClk
TX
lClk
sClk rClk
w T
MUTEX ri
rClk
ai gi
t ssub
Pausible clocking based GALS design – an overview
• Optimization for safe synchronization
For sub-cycle clock tree delay: maximize the safe timing region.
Double stages of register timed by the output of MUTEX, ssub=w.
Two cascaded delay lines with a feedback loop to maximize w.
X. Fan et al., “Analysis and optimization of pausible clocking based GALS design,” in ICCD 2009.
FF
CLK
D Q
DataRX
Req
rClk
Reg
Data
D
Ack
Delay A
nRst MUTEX
Gnt
Q
C-ELE sClk
TX
lClk
Delay B Reg
D Q
Pausible clocking based GALS design – an overview
• Optimization of pausible clocking scheme
For multi-cycle clock tree delay: locally data latching (LDL) scheme from Technion, arbitrating on the leaf registers instead of on the source clock.
Integrate double-register scheme with LDL to maximize w.
R. Dobkin et al., “High rate data synchronization in GALS SoCs,” in TVLSI 2006
F F
CLK
D Q
Data RX
Req
rClk
Reg Data
D
Ack
Delay Line MUTEX
Gnt
Q
lClk
TX
Reg
D Q
sClk
Pausible clocking based GALS design – an overview
• Architecture of pausible clocking based GALS design
Communication across clock domains is initiated by the output/input flow control logic of TX/RX synchronous functional blocks.
Data transfer between TX and RX is scheduled by the asynchronous handshake signals.
• Previous studies show that the pausible clocking based GALS datalink could reach a throughput of one data item every second clock cycle!?
RX CORE TX CORE
OPC
op _ te op _ ta
op _ r p
op _ a p op _ ri
op _ ai tx _ clk
OUT _ REG
IPC
i p _ ri i p _ ai
rx _ clk IN _ FLOW _ CNTR
IN _ REG
i p _ t a i p _ t e
i p _ r p i p _ a p handshake signals OUT _ REG
OUT _ FLOW _ CNTR
IN _ REG
IN _ FLOW _ CNTR
MUTEX
TX ( GATED ) RING OSC MUTEX RX ( GATED ) RING OSC
DL _ REG
sClk rClk
lClk rp/ap
te ri
Ds
n
T CT t 2T CT tn1
Ds(n) Ds(n+1)
Performance modeling of typical GALS datalinks
• Data synchronization latency
sClk rClk
w t
T
MUTEX ri
rClk
ai gi
The timing interval from receiving an input data, indicated by the input request, to sampling the data by the local clock.
Performance modeling of typical GALS datalinks
• Synchronization latency function L (t, w, )
For the uniform distribution of t within [0, T), we have:
l is determined by the effective acknowledge window of MUTEX . If , a sub-cycle latency can be achieved on average.
Given w, the minimum latency is obtained when .
, [0, );
( , , ) 3 / 2 , [ , ];
2 , ( , ).
CT
CT CT
CT
T t t w
L t w T w t w w
T t t w T
0
1 3
2
( ) .
( , , CT ) CT
T w
dt T
T T
L t w
l
(w CT ) T / 2
CT 0
CT
(w CT )
Performance modeling of typical GALS datalinks
• Data throughput in burst-mode communication
The handshake loop delay of asynchronous datalink consists of 3 parts:
• The optimization target:
RX CORE TX CORE
OPC
op_te op_ta
op_rp
op_ap op_ri
op_ai tx_clk
OUT_REG
IPC
ip_ri ip_ai
rx_clk IN_FLOW_CNTR
IN_REG
ip_ta ip_te
ip_rp ip_ap handshake signals OUT_REG
OUT_FLOW_CNTR
IN_REG
IN_FLOW_CNTR
MUTEX
TX (GATED) RING OSC MUTEX RX (GATED) RING OSC
DL_REG
INT.
IPC OPC
Loop
avg avg avg
T d d d
max( , ).
Loop TX RX
Tavg T T
Performance modeling of typical GALS datalinks
• An example: tightly coupled GALS datalink
Q
SETQ
CLR
D
Q QSET
CLR
D
A0 A1 Z
S
Q
SETQ
CLR
D
Q
SETQ
CLR
tx_data_comb D
ip_gi Q
SETQ
CLR
ip_ta D
ip_te
rx_ta_comb ip_ta_l
rx_clk rx_data
tx_clk
ip_te_comb
A0 A1
Z S
Q
SETQ
CLR
op_te_comb D op_te
tx_ta_comb
tx_data
op_ta
rx_te
Q QSET
CLR
D
op_gi
Q QSET
CLR
D
G
tx_te tx_ta
rx_ta op_ta_l
OPC IPC
op_rp ip_rp
op_ap ip_ap
tx_clk rx_clk
op_ri op_ai
ip_ri ip_ai
EN EN
Q
SETQ
CLR
D
EN
G G
EN
Q QSET
CLR
tx_te_pending D
Q
SETQ
CLR
D rx_te_pending
tx_clk
TX GATED RING OSC. MUTEX MUTEX RX GATED RING OSC.
op_te+
op_rp+
op_ap+
op_ri+
op_ai+
op_ta+
op_ta- op_ai+
op_ri+
op_ap- op_rp-
op_te-
op_ri- op_ai-
op_ai- op_ri-
ip_rp+ ip_ri+ ip_ai+ ip_ta+ ip_ri- ip_ai-
ip_ai- ip_ri- ip_ta- ip_ai+ ip_ri+ ip_rp- ip_te+ ip_ap+
ip_ap- ip_te-
Performance modeling of typical GALS datalinks
• Average loop period of the tightly coupled datalink
It is the sum of the average synchronization latencies on both TX and RX sides and the interconnect delays between OPC and IPC:
Given TTX/TRX close to 1, we can derive the throughput-tolerant condition on clock-tree and interconnect delays:
Note that, due to w<T, above requirement could never be satisfied, even with zero clock-tree/interconnect delay.
It means throughput drop is unavoidable in the tightly coupled datalink, even if TX and RX have only a tiny mismatch on operating frequency!
3 3
( ) ( )
2 2 .
RX TX
RX CT TX CT
RX TX INT
RX TX
Loop
avg w w
T T T d
T T
(wTX TXCT )(wRX CTRX ) 2T dINT.
Upper bound of throughput-tolerant clock-tree/inter- connect delay
• Loosely coupled GALS datalinks
(a) Reduce dIPC by introducing concurrency in IPC:
(b) Reduce dOPC by stopping TX local clock if possible:
op_te+
op_rp+
op_ap+
op_ri+
op_ai+
op_ta+
op_ta- op_ai+
op_ri+
op_ap- op_rp-
op_te-
op_ri- op_ai-
op_ai- op_ri-
ip_ri+ ip_ai+ ip_ta+
ip_ta- ip_ai+ ip_ri+
ip_rp+
ip_te+
ip_ap- ip_rp-
ip_te- ip_ap+
ip_ri- ip_ai-
ip_ai- ip_ri-
(wTX TXCT) (wRX CTRX ) T dINT.
( )
(TXCT ) wRX CTRX dINT.
op_te+
op_ri+
op_ai+
op_rp+
op_ap+
op_ap- op_rp-
op_ai+
op_ri+
op_te-
op_ri- op_ai-
op_ai- op_ri-
ip_ri+ ip_ai+ ip_ta+
ip_ta- ip_ai+ ip_ri+
ip_rp+
ip_te+
ip_ap- ip_rp-
ip_te- ip_ap+
ip_ri- ip_ai-
ip_ai- ip_ri-
Upper bound of throughput-tolerant clock-tree/inter- connect delay
• Comparison in data throughput of tightly coupled GALS datalink
0.60
0.55
0.50
0.45
0.40
0.35
0.30
1.28 1.24 1.20 1.17 1.13 1.10 1.06 1.02 0.99 0.95 0.92 0.88 0.84 0.81 0.77 0.73
Clock ratio (TTX/TRX)
Data per cycle
Simulation @ dINT0,CT0 Equation (6) @ dINT0,CT0
Simulation @ NT / 4, CT 0
I RX
d T
Equation (6) @
Simulation @ NT 0, CT / 4
I RX
d T Equation (6) @
/ 4, CT 0
NT RX
dI T NT 0, CT / 4
I RX
d T
Simulation @ NT /4, CT /4
RX RX
dI T T Equation (6) @ NT /4, CT /4
RX RX
dI T T
Upper bound of throughput-tolerant clock-tree/inter- connect delay
• Comparison in throughput-tolerant delay in loosely coupled datalinks
()maxTX CTns
1.28 1.24 1.20 1.17 1.13 1.10 1.06 1.02
3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0
()maxRX CTns
6.0 5.0 4.0 3.0 2.0 1.0 0.0
0.99 0.95 0.92 0.88 0.84 0.81 0.77 0.73
Simulation @ dINT 0,RXCT 0
Clock ratio (TTX/TRX)
Equation (8) @ dINT 0,RXCT 0 Simulation @ dINT 0,CTRX 2 Equation (8) @ dINT 0,RXCT 2 Simulation @ dINT 2,CTRX 0 Equation (8) @ dINT 2,CTRX 0
Simulation @ dINT 0,TXCT 0 Equation (9) @ dINT 0,RXCT 0 Simulation @ dINT 0,TXCT 1 Equation (9) @ dINT 0,RXCT 2 Simulation @ dINT 2,TXCT 0 Equation (9) @ dINT 2,RXCT 0
Clock ratio (TTX/TRX)
(a) (b)
Moonrake: a 40-nm SYNC/GALS OFDM BB TX chip
• Architecture of GALS design
System partition accounting both function and physical criteria;
6 GALS blocks balanced in area and power dissipation;
16 asynchronous datalinks (32 IPC/OPC).
Middle control
Input control IPC OPC
Mapper [4:1]
Pilot inserter
OPC Interleaver interface
Interleaver [2:1] Interleaver [4:3] Interleaver [6:5] IFFT
64p [4:1]
IFFT
4p Output Stage
IPC IPC OPC
OPC IPC
Input data FIFO
Symbol mapping
Universal
scrambler Universal FEC encoder [12:1]
Pausible Clock GEN 1
GALS BLOCK 1
Pausible Clock GEN 2 GALS BLOCK 2
Pausible Clock GEN 3 GALS BLOCK 3
Pausible Clock GEN 4 GALS BLOCK 4
Pausible Clock
GEN 5 Pausible Clock GEN 6
GALS BLOCK 5 GALS BLOCK 6
IPC
OPC IPC OPC IPC OPC
OPC IPC OPC IPC OPC IPC
IPC OPC IPC OPC IPC OPC IPC OPC IPC OPC IPC OPC
Moonrake: a 40-nm SYNC/GALS OFDM BB TX chip
• Comparison in GALS TX throughput fabricated by different datalinks
A data frame with QPSK modulation was processed by the GALS TX.
RTL of TX functional blocks, netlist of IPC/OPC and clock generators, with back-annotated clock-tree/interconnect delays after layout.
OPC-I to IPC-I w/o buffer
OPC-I to IPC-II w/o buffer
OPC-II to IPC-II w/o buffer
OPC-I to IPC-I w buffer
OPC-I to IPC-II w buffer Simulations with worst-case delays Simulations with best-case delays 1.0
0.9 0.8 0.7 0.6 0.5
OPC-I to IPC-I w/o buffer
OPC-I to IPC-II w/o buffer
OPC-II to IPC-II w/o buffer
OPC-I to IPC-I w buffer
OPC-I to IPC-II w buffer
Simulation @ dINT = 3T/4 Simulation @ dINT = 0
Simulation @ dINT = T/4
Simulation @ dINT = T/2 1.0
0.9 0.8 0.7 0.6 0.5
Moonrake: a 40-nm SYNC/GALS OFDM BB TX chip
• Power and area comparison of SYNC/GALS TXs
0 5 10 15 20 25 30
Cell area (µm2) Power dissipation (mW) Post-synthesis Post-layout Post-layout Measurement
SYNC w/o PLL 2206895 2234712 234 252
GALS TX 2225823 2220080 225 237
GALS_BLK 1
GALS_BLK 2
GALS_BLK 3
GALS_BLK 4
GALS_BLK 5
GALS_BLK
6 TOTAL
AREA 19% 18% 18% 18% 10% 17% 100%
POWER 12% 17% 17% 17% 18% 19% 100%
Conclusions
• The performance (synchronization latency and data throughput) of pausible clocking based GALS datalinks is dominated by w, which is in turn determined by the target resolution time of MUTEX:
MTBF tMUTEX w=T- tMUTEX.
• It can cause large performance penalty in speed aggressive design, such as high-speed micro-processors. On the other hand, it is very suitable for complicated system integration with moderate speed.
• By the optimization of IPC/OPC, it is possible to tolerate clock and interconnect delays, to some extent, with little performance drop.
• The marginal hardware overhead caused by the GALS infrastructure based on pausible clocking can be compensated at the system level.
Thank you!
Question?