• Ingen resultater fundet

Performance Analysis of GALS Datalink Based on Pausible Clocking

N/A
N/A
Info
Hent
Protected

Academic year: 2022

Del "Performance Analysis of GALS Datalink Based on Pausible Clocking"

Copied!
20
0
0

Indlæser.... (se fuldtekst nu)

Hele teksten

(1)

IHP

Im Technologiepark 25 15236 Frankfurt (Oder)

Germany

Performance Analysis of GALS Datalink Based on Pausible Clocking

Xin Fan, Milos Krstic, Eckhard Grass fan@ihp-microelectronics.com

(2)

Outline

Pausible clocking based GALS design – an overview

Performance modeling of typical GALS datalinks

Upper bound of throughput-tolerant clock/interconnect delays

Moonrake: a SYNC/GALS OFDM BB TX chip in the 40-nm process

Conclusions

(3)

Pausible clocking based GALS design – an overview

What’s pausible clocking?

Temporarily stop the clock if its rising edge is too close to the input data, i.e., synchronize the clock activity according to the arrival time of data.

It was once expected to be a high reliability, high performance while low overhead solution for GALS system design.

F F

CLK

D Q

Data RX

Req

rClk

Reg Data

D

Ack

Programmable Delay Line

nRst MUTEX

Gnt

Q

lClk -

C ELE

TX

(4)

Pausible clocking based GALS design – an overview

Fundamental limitations of applying pausible clocking:

Taking into consideration the clock tree distribution delay nullifies all the assumptions for source clock scheduling to avoid metastability.

Multi-cycle vs. sub-cycle clock tree delay.

ssub=T-wMUTEX-tMUTEX where

w: the MUTEX acknowledge window (off-phase duration of rClk),

tMUTEX: the resolution time of MUTEX under a certain MTBF condition.

R. Dobkin et al., “Data synchronization issues on GALS SoCs,” in ASYNC 2004.

FF

CLK

D Q

DataRX

Req

rClk

Reg

Data

D

Ack

Programmable

Delay Line

nRst MUTEX

Gnt

Q

C-ELE sClk

TX

lClk

sClk rClk

w T

MUTEX ri

rClk

ai gi

t ssub

(5)

Pausible clocking based GALS design – an overview

Optimization for safe synchronization

For sub-cycle clock tree delay: maximize the safe timing region.

Double stages of register timed by the output of MUTEX, ssub=w.

Two cascaded delay lines with a feedback loop to maximize w.

X. Fan et al., “Analysis and optimization of pausible clocking based GALS design,” in ICCD 2009.

FF

CLK

D Q

DataRX

Req

rClk

Reg

Data

D

Ack

Delay A

nRst MUTEX

Gnt

Q

C-ELE sClk

TX

lClk

Delay B Reg

D Q

(6)

Pausible clocking based GALS design – an overview

Optimization of pausible clocking scheme

For multi-cycle clock tree delay: locally data latching (LDL) scheme from Technion, arbitrating on the leaf registers instead of on the source clock.

Integrate double-register scheme with LDL to maximize w.

R. Dobkin et al., “High rate data synchronization in GALS SoCs,” in TVLSI 2006

F F

CLK

D Q

Data RX

Req

rClk

Reg Data

D

Ack

Delay Line MUTEX

Gnt

Q

lClk

TX

Reg

D Q

sClk

(7)

Pausible clocking based GALS design – an overview

Architecture of pausible clocking based GALS design

Communication across clock domains is initiated by the output/input flow control logic of TX/RX synchronous functional blocks.

Data transfer between TX and RX is scheduled by the asynchronous handshake signals.

Previous studies show that the pausible clocking based GALS datalink could reach a throughput of one data item every second clock cycle!?

RX CORE TX CORE

OPC

op _ te op _ ta

op _ r p

op _ a p op _ ri

op _ ai tx _ clk

OUT _ REG

IPC

i p _ ri i p _ ai

rx _ clk IN _ FLOW _ CNTR

IN _ REG

i p _ t a i p _ t e

i p _ r p i p _ a p handshake signals OUT _ REG

OUT _ FLOW _ CNTR

IN _ REG

IN _ FLOW _ CNTR

MUTEX

TX ( GATED ) RING OSC MUTEX RX ( GATED ) RING OSC

DL _ REG

(8)

sClk rClk

lClk rp/ap

te ri

Ds

n

T  CT t 2T  CT tn1

Ds(n) Ds(n+1)

Performance modeling of typical GALS datalinks

Data synchronization latency

sClk rClk

w t

T

MUTEX ri

rClk

ai gi

The timing interval from receiving an input data, indicated by the input request, to sampling the data by the local clock.

(9)

Performance modeling of typical GALS datalinks

Synchronization latency function L (t, w, )

For the uniform distribution of t within [0, T), we have:

l is determined by the effective acknowledge window of MUTEX . If , a sub-cycle latency can be achieved on average.

Given w, the minimum latency is obtained when .

, [0, );

( , , ) 3 / 2 , [ , ];

2 , ( , ).

CT

CT CT

CT

T t t w

L t w T w t w w

T t t w T

 

   

 





0

1 3

2

( ) .

( , , CT ) CT

T w

dt T

T T

L t w

l

(w CT ) T / 2

CT 0

CT

(w  CT )

(10)

Performance modeling of typical GALS datalinks

Data throughput in burst-mode communication

The handshake loop delay of asynchronous datalink consists of 3 parts:

The optimization target:

RX CORE TX CORE

OPC

op_te op_ta

op_rp

op_ap op_ri

op_ai tx_clk

OUT_REG

IPC

ip_ri ip_ai

rx_clk IN_FLOW_CNTR

IN_REG

ip_ta ip_te

ip_rp ip_ap handshake signals OUT_REG

OUT_FLOW_CNTR

IN_REG

IN_FLOW_CNTR

MUTEX

TX (GATED) RING OSC MUTEX RX (GATED) RING OSC

DL_REG

INT.

IPC OPC

Loop

avg avg avg

T d d d

max( , ).

Loop TX RX

Tavg T T

(11)

Performance modeling of typical GALS datalinks

An example: tightly coupled GALS datalink

Q

SETQ

CLR

D

Q QSET

CLR

D

A0 A1 Z

S

Q

SETQ

CLR

D

Q

SETQ

CLR

tx_data_comb D

ip_gi Q

SETQ

CLR

ip_ta D

ip_te

rx_ta_comb ip_ta_l

rx_clk rx_data

tx_clk

ip_te_comb

A0 A1

Z S

Q

SETQ

CLR

op_te_comb D op_te

tx_ta_comb

tx_data

op_ta

rx_te

Q QSET

CLR

D

op_gi

Q QSET

CLR

D

G

tx_te tx_ta

rx_ta op_ta_l

OPC IPC

op_rp ip_rp

op_ap ip_ap

tx_clk rx_clk

op_ri op_ai

ip_ri ip_ai

EN EN

Q

SETQ

CLR

D

EN

G G

EN

Q QSET

CLR

tx_te_pending D

Q

SETQ

CLR

D rx_te_pending

tx_clk

TX GATED RING OSC. MUTEX MUTEX RX GATED RING OSC.

op_te+

op_rp+

op_ap+

op_ri+

op_ai+

op_ta+

op_ta- op_ai+

op_ri+

op_ap- op_rp-

op_te-

op_ri- op_ai-

op_ai- op_ri-

ip_rp+ ip_ri+ ip_ai+ ip_ta+ ip_ri- ip_ai-

ip_ai- ip_ri- ip_ta- ip_ai+ ip_ri+ ip_rp- ip_te+ ip_ap+

ip_ap- ip_te-

(12)

Performance modeling of typical GALS datalinks

Average loop period of the tightly coupled datalink

It is the sum of the average synchronization latencies on both TX and RX sides and the interconnect delays between OPC and IPC:

Given TTX/TRX close to 1, we can derive the throughput-tolerant condition on clock-tree and interconnect delays:

Note that, due to w<T, above requirement could never be satisfied, even with zero clock-tree/interconnect delay.

It means throughput drop is unavoidable in the tightly coupled datalink, even if TX and RX have only a tiny mismatch on operating frequency!

3 3

( ) ( )

2 2 .

RX TX

RX CT TX CT

RX TX INT

RX TX

Loop

avg w w

T T T d

T T

(wTX TXCT )(wRX CTRX ) 2T dINT.

(13)

Upper bound of throughput-tolerant clock-tree/inter- connect delay

Loosely coupled GALS datalinks

(a) Reduce dIPC by introducing concurrency in IPC:

(b) Reduce dOPC by stopping TX local clock if possible:

op_te+

op_rp+

op_ap+

op_ri+

op_ai+

op_ta+

op_ta- op_ai+

op_ri+

op_ap- op_rp-

op_te-

op_ri- op_ai-

op_ai- op_ri-

ip_ri+ ip_ai+ ip_ta+

ip_ta- ip_ai+ ip_ri+

ip_rp+

ip_te+

ip_ap- ip_rp-

ip_te- ip_ap+

ip_ri- ip_ai-

ip_ai- ip_ri-

(wTX TXCT) (wRXCTRX ) T dINT.

( )

(TXCT ) wRX CTRX dINT.

op_te+

op_ri+

op_ai+

op_rp+

op_ap+

op_ap- op_rp-

op_ai+

op_ri+

op_te-

op_ri- op_ai-

op_ai- op_ri-

ip_ri+ ip_ai+ ip_ta+

ip_ta- ip_ai+ ip_ri+

ip_rp+

ip_te+

ip_ap- ip_rp-

ip_te- ip_ap+

ip_ri- ip_ai-

ip_ai- ip_ri-

(14)

Upper bound of throughput-tolerant clock-tree/inter- connect delay

Comparison in data throughput of tightly coupled GALS datalink

0.60

0.55

0.50

0.45

0.40

0.35

0.30

1.28 1.24 1.20 1.17 1.13 1.10 1.06 1.02 0.99 0.95 0.92 0.88 0.84 0.81 0.77 0.73

Clock ratio (TTX/TRX)

Data per cycle

Simulation @ dINT0,CT0 Equation (6) @ dINT0,CT0

Simulation @ NT / 4, CT 0

I RX

d T

Equation (6) @

Simulation @ NT 0, CT / 4

I RX

d T Equation (6) @

/ 4, CT 0

NT RX

dI T NT 0, CT / 4

I RX

d T

Simulation @ NT /4, CT /4

RX RX

dI T T Equation (6) @ NT /4, CT /4

RX RX

dI T T

(15)

Upper bound of throughput-tolerant clock-tree/inter- connect delay

Comparison in throughput-tolerant delay in loosely coupled datalinks

()maxTX CTns

1.28 1.24 1.20 1.17 1.13 1.10 1.06 1.02

3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0

()maxRX CTns

6.0 5.0 4.0 3.0 2.0 1.0 0.0

0.99 0.95 0.92 0.88 0.84 0.81 0.77 0.73

Simulation @ dINT 0,RXCT 0

Clock ratio (TTX/TRX)

Equation (8) @ dINT 0,RXCT 0 Simulation @ dINT 0,CTRX 2 Equation (8) @ dINT 0,RXCT 2 Simulation @ dINT 2,CTRX 0 Equation (8) @ dINT 2,CTRX 0

Simulation @ dINT 0,TXCT 0 Equation (9) @ dINT 0,RXCT 0 Simulation @ dINT 0,TXCT 1 Equation (9) @ dINT 0,RXCT 2 Simulation @ dINT 2,TXCT 0 Equation (9) @ dINT 2,RXCT 0

Clock ratio (TTX/TRX)

(a) (b)

(16)

Moonrake: a 40-nm SYNC/GALS OFDM BB TX chip

Architecture of GALS design

System partition accounting both function and physical criteria;

6 GALS blocks balanced in area and power dissipation;

16 asynchronous datalinks (32 IPC/OPC).

Middle control

Input control IPC OPC

Mapper [4:1]

Pilot inserter

OPC Interleaver interface

Interleaver [2:1] Interleaver [4:3] Interleaver [6:5] IFFT

64p [4:1]

IFFT

4p Output Stage

IPC IPC OPC

OPC IPC

Input data FIFO

Symbol mapping

Universal

scrambler Universal FEC encoder [12:1]

Pausible Clock GEN 1

GALS BLOCK 1

Pausible Clock GEN 2 GALS BLOCK 2

Pausible Clock GEN 3 GALS BLOCK 3

Pausible Clock GEN 4 GALS BLOCK 4

Pausible Clock

GEN 5 Pausible Clock GEN 6

GALS BLOCK 5 GALS BLOCK 6

IPC

OPC IPC OPC IPC OPC

OPC IPC OPC IPC OPC IPC

IPC OPC IPC OPC IPC OPC IPC OPC IPC OPC IPC OPC

(17)

Moonrake: a 40-nm SYNC/GALS OFDM BB TX chip

Comparison in GALS TX throughput fabricated by different datalinks

A data frame with QPSK modulation was processed by the GALS TX.

RTL of TX functional blocks, netlist of IPC/OPC and clock generators, with back-annotated clock-tree/interconnect delays after layout.

OPC-I to IPC-I w/o buffer

OPC-I to IPC-II w/o buffer

OPC-II to IPC-II w/o buffer

OPC-I to IPC-I w buffer

OPC-I to IPC-II w buffer Simulations with worst-case delays Simulations with best-case delays 1.0

0.9 0.8 0.7 0.6 0.5

OPC-I to IPC-I w/o buffer

OPC-I to IPC-II w/o buffer

OPC-II to IPC-II w/o buffer

OPC-I to IPC-I w buffer

OPC-I to IPC-II w buffer

Simulation @ dINT = 3T/4 Simulation @ dINT = 0

Simulation @ dINT = T/4

Simulation @ dINT = T/2 1.0

0.9 0.8 0.7 0.6 0.5

(18)

Moonrake: a 40-nm SYNC/GALS OFDM BB TX chip

Power and area comparison of SYNC/GALS TXs

0 5 10 15 20 25 30

Cell area (µm2) Power dissipation (mW) Post-synthesis Post-layout Post-layout Measurement

SYNC w/o PLL 2206895 2234712 234 252

GALS TX 2225823 2220080 225 237

GALS_BLK 1

GALS_BLK 2

GALS_BLK 3

GALS_BLK 4

GALS_BLK 5

GALS_BLK

6 TOTAL

AREA 19% 18% 18% 18% 10% 17% 100%

POWER 12% 17% 17% 17% 18% 19% 100%

(19)

Conclusions

The performance (synchronization latency and data throughput) of pausible clocking based GALS datalinks is dominated by w, which is in turn determined by the target resolution time of MUTEX:

MTBF tMUTEX w=T- tMUTEX.

It can cause large performance penalty in speed aggressive design, such as high-speed micro-processors. On the other hand, it is very suitable for complicated system integration with moderate speed.

By the optimization of IPC/OPC, it is possible to tolerate clock and interconnect delays, to some extent, with little performance drop.

The marginal hardware overhead caused by the GALS infrastructure based on pausible clocking can be compensated at the system level.

(20)

Thank you!

Question?

Referencer

RELATEREDE DOKUMENTER

DVFS Based on Voltage Dithering and Clock Scheduling for GALS Systems.. Manoj Kumar Yadav

The study is based on analysis of video uptake of authentic performance appraisal interviews, and through detailed examination of participant conduct and orientation, we point

The analysis is based on an interpretation of the transcribed interviews, which we have acquired through a qualitative data collection in the form of focus group interviews

This relationship is grounded both at the level of the word, in the topics brought to discussion by the poets, and at the level of performance, in the enactment

The public sector is thus dominated by the logic of state, the private sector by the logic of capitalism and the civil society, including the voluntary sector, is dominated by

Section 4 analysed how a GP’s uncertainty about the effort-performance relationship affects his or her response to the introduction of a target performance payment. The results can

In  sum,  the  time  varying  effect  of  lightning  on  growth  is  not  produced  by  the  growth  performance  of  any  particular  region,  is  robust  to 

The speed at which the response is obtained is determined not just by the server and network load and performance, but also by the delays in all the software components involved,