• Ingen resultater fundet

INTRODUCTION TO OCTASIC ASYNCHRONOUS PROCESSOR TECHNOLOGY

N/A
N/A
Info
Hent
Protected

Academic year: 2022

Del "INTRODUCTION TO OCTASIC ASYNCHRONOUS PROCESSOR TECHNOLOGY"

Copied!
49
0
0

Indlæser.... (se fuldtekst nu)

Hele teksten

(1)

INTRODUCTION TO OCTASIC

ASYNCHRONOUS PROCESSOR TECHNOLOGY

Michel Laurence, Founder & CEO michel.laurence@octasic.com

Async 2012 , Copenhagen, May 7-9 th 2012

(2)

CONTENTS

• Background

• Asynchronous Circuits Description

• Processor Architecture and Operation

• Performance Analysis

• Conclusion

(3)

CONTENTS

Background

• Asynchronous Circuits Description

• Processor Architecture and Operation

• Performance Analysis

• Conclusion

(4)

BACKGROUND ON OCTASIC

• Founded in 1998

• Headquartered in Montreal, Canada

• 85 employees

• Evolution:

• 98/00 - Design ASICs for others

• 2001 Convert to fabless model

2001- 2003: VoIP Support Products (Synchronous):

2001 - Voice Packetization Engine

/ OCT8304

2003 - Echo Cancellation Processor

/ OCT6100

2004 – DSPs (Asynchronous) for Voice, Video, and Wireless Baseband

2008 - First Generation

/ OCT1010

2011 - Second Generation

/ OCT2224

…2013 - Third Generation

/ OCTXXXX

(5)

GENESIS OF MOVE INTO ASYNC DESIGN

First Processor Product

Specialized DSP for Echo Cancellation

• Entered the echo market 20 year late

• Success because of unique algorithm

Next Product – Generic DSP?

How to succeed?

• Settle on highest processing efficiency

– Processing Power / Power Consumption

• 2+X improvement needed to be able to succeed and displace incumbents

This led us forfuitously into the asynchronous world

• Started by removing the clock –

the single greediest power culprit in synchronous designs

• … then tried to figure our how to make our circuits work

• … proceeded by trial and error until

…we arrived at our current async design and methodology

ASYNC 2012

(6)

SET ADDITIONAL PRE-REQUIREMENTS

Use only standard ASIC library elements

No custom cell

Ease of porting - from one silicon node to the next / from one vendor to another

Use (as much as possible) standard CAD tools and concepts

To facilitate sign-off

To facilitate staff conversion training

Use an architecture presenting a traditional programming view

S/W paradigm (same look and feel)

Avoid software programming model changes

Programming model change is an almost insurmountable barrier to product adoption

Allow re-use of existing S/W

Transparent to programmers

Similar single thread-performance

Avoid forcing to re-structure algorithms

(7)

CONTENTS

• Background

Asynchronous Circuits Description (Basic)

• Processor Architecture and Operation

• Performance Analysis

• Conclusion

(8)

BASIC ASYNCHRONOUS CIRCUIT (1)

• Logic Elements: States In/Out, Logic Clouds, and Delay Chains

• States are latches or flip-flops

• Logic Clouds and delay chains use combinatorial logic

• Delay chains are statically or dynamically controlled

• Timing Elements: Pulses

• Pulses are asynchronous to each other and event (token) driven

• Timing verification is performed via standard STA (Static Analysis Tools) Tools

• on each pulse (clock) domain:

(9)

BASIC ASYNCHRONOUS CIRCUIT(2)

How does this maps into traditional classification of async circuits?

• Single-rail data bundled type for data transmission

• With a worst-case delay "Bundling Signal” to latch data

• However no formal reverse ACK signal for flow control

• Use a system of tokens to be described later

• Asynchronous Pipeline Structure: Static

• Formal latches/FF to store data in between stages

(10)

SIMPLIFIED DSP EXECUTION UNIT

• The 3 operand state registers are asynchronously loaded

• The instruction state register is asynchronously loaded

• When ready

(input registers loaded & output register released)

a launch pulse is generated

• Delay chain timing is modulated according to instruction

• Output state register is asynchronously loaded with result of instruction

(11)

CONTENTS

• Background

• Asynchronous Circuits Description

Processor Architecture and Operation (Simplified)

Architecture, Silicon, and ILP Implementation

• Operation & Synchronization

• Performance Analysis

• Conclusion

(12)

SYNC VS ASYNC PROCESOR IMPLEMENTATION

MEM load/store not show

Fetch Decode Reg Reads

Execute Branch

Output Write

Store

Fetch Unit Decode Unit Execution Unit Store

F0 F1 F2 D0 D1 D2 E0 E1 E2 E3 E4 E5 S0

In typical synchronous design,

pipelining is used to boost performance

and provide Instruction Level Parallelism (ILP)

How can we convert such synchronous design

into an asynchronous one?

(13)

SYNC VS ASYNC PROCESOR IMPLEMENTATION

MEM load/store not show

Fetch Decode Reg Reads

Execute Branch

Output Write

Store

Fetch Unit Decode Unit Execution Unit Store

F0 F1 F2 D0 D1 D2 E0 E1 E2 E3 E4 E5 S0

Conversion Sync => Async:

• One way is to map each unit

functionality into an equivalent

asynchronous unit

(14)

SYNC VS ASYNC PROCESOR IMPLEMENTATION

MEM load/store not show

Fetch Decode Reg Reads

Execute Branch

Output Write

Store

Fetch Unit Decode Unit Execution Unit Store

F0 F1 F2 D0 D1 D2 E0 E1 E2 E3 E4 E5 S0

Logic Cloud

State State

Async Execution Unit

Conversion Sync => Async:

• One way is to map each unit functionality into an equivalent asynchronous unit

• But using this methodology

will slow down the unit!

(15)

SYNC VS ASYNC PROCESOR IMPLEMENTATION

MEM load/store not show

Fetch Decode Reg Reads

Execute Branch

Output Write

Store

Fetch Unit Decode Unit Execution Unit Store

F0 F1 F2 D0 D1 D2 E0 E1 E2 E3 E4 E5 S0

Logic Cloud

State State

Async Execution Unit

Conversion Sync => Async:

• One way is to map each unit functionality into an equivalent asynchronous unit

• But using this methodology will slow down the unit!

• How can we get the

performance back?

(16)

ASYNC ILP IMPLEMENTATION (1)

(17)

ASYNC ILP IMPLEMENTATION (2)

To multiply the processing power of our processor we could use multiple Exec Units (EUs) operating in parallel

Now how can we transparently weave together those EUs ...

....so they behave as one processor?

(18)

ASYNC PROCESSOR ARCHITECTURE (2)

• Starting with the 8 execution units …

(19)

ASYNC PROCESSOR ARCHITECTURE (3)

• Adding a non-blocking combinatorial X-Bar switch to:

• connect the execution units data paths among themselves, and

• with external resources – register file, memory, etc.

(20)

ASYNC PROCESSOR ARCHITECTURE (4)

• Adding a CPU Register File to implement a load/store processor design:

(21)

ASYNC PROCESSOR ARCHITECTURE (5)

• Adding a Data Memory Load/Store unit

• to be able to load/store memory data into/from the CPU (registers

)

(22)

ASYNC PROCESSOR ARCHITECTURE (6)

• Adding a Program Counter Control unit including a branch predictor;

• Coupled with an Instruction Fetch & Decode Unit

• to be able to load instructions into the execution units

(23)

ASYNC PROCESSOR ARCHITECTURE (7)

• Adding L1 Memory accessible for:

• Data, or

• Code

(24)

CONTENTS

• Background

• Asynchronous Circuits Description

Processor Architecture and Operation (Simplified)

• Architecture, Silicon, and ILP Implementation

• Operation & Synchronization

• Performance Analysis

• Conclusion

(25)

ASYNC PROCESSOR ARCHITECTURE (8)

• How does this map on silicon?

(26)

ASYNC PROCESSOR ARCHITECTURE (8)

• How does it map on silicon?

L1 Memory 72KB

L1 Memory 72KB

(27)

ASYNC PROCESSOR ARCHITECTURE (8)

• How does it map on silicon?

L1 Memory 72KB

L1 Memory 72KB

One execution unit

(28)

ASYNC PROCESSOR ARCHITECTURE (8)

• How does it map on silicon?

L1 Memory 72KB

L1 Memory 72KB

Block of four (4) execution units

(29)

ASYNC PROCESSOR ARCHITECTURE (8)

• How does it map on silicon?

L1 Memory 72KB

L1 Memory 72KB

There are indeed 16 Execution Units, not 8 EUs in this DSP core!

4 block of four (4) execution units

(30)

ASYNC PROCESSOR ARCHITECTURE (8)

• How does it map on silicon?

L1 Memory 72KB

L1 Memory 72KB

Block of four (4) execution units

X-Bar Switch

(31)

ASYNC PROCESSOR ARCHITECTURE (8)

• How does it map on silicon?

L1 Memory 72KB

L1 Memory 72KB

Block of four (4) execution units X-Bar Switch

Register File & Processor Control Logic

(32)

CONTENTS

• Background

• Asynchronous Circuits Description

Processor Architecture and Operation (Simplified)

• Architecture, Silicon, and ILP Implementation

• Operation & Synchronization

• Performance Analysis

• Conclusion

(33)

PROCESSOR OPERATION – SIMPLIFIED ILP (1)

Assuming the operation of the Execution Units and resources (registers, memory, …) are somehow synchronized, here is the flow of instructions overlap that would result in the processor;

hence realizing the Instruction Level Parallelism (ILP) mechanism to boost performance

I1: add r4,r3, r9

Time (pico-seconds) I2: sub r7,r4,#0x01

I3: orr r4,r3,#0x01 I4: add r7,r7,r3,lsl r5 I5: ldr r9,r7,r2 I6: sub r7,r4,#0x01

I17: sub r2,r4,#0x47

M R7

EU0 EU1 EU2 EU3 EU4 EU5

R3,R9 R4

R3

R7

R3,R5 R2

R4

EU6-EU15 Time (instruction cycles)

= Decode Instr.

= Fetch Instr.

= Load Reg.

= Execute Instr.

= Write Output Reg.

= Memory access

ASYNC 2012

(34)

PROCESSOR OPERATION – SIMPLIFIED ILP (1)

Assuming the operation of the Execution Units and resources (registers, memory, …) are somehow synchronized, here is the flow of instructions overlap that would result in the processor;

hence realizing the Instruction Level Parallelism (ILP) mechanism to boost performance

I1: add r4,r3, r9

Time (pico-seconds) I2: sub r7,r4,#0x01

I3: orr r4,r3,#0x01 I4: add r7,r7,r3,lsl r5 I5: ldr r9,r7,r2 I6: sub r7,r4,#0x01

I17: sub r2,r4,#0x47

M R7

EU0 EU1 EU2 EU3 EU4 EU5

R3,R9 R4

R3

R7

R3,R5 R2

R4

Time (instruction cycles)

= Decode Instr. = Load Reg. = Write Output Reg.

BTW did you notice the instructions?

Hey this is not a DSP!

This is an ARM processor!

ASYNC 2012

(35)

PROCESSOR OPERATION ILP: REAL-WORLD EXAMPLE (2)

= Decode Instr.

= Fetch Instr.

= Load Reg.

= Execute Instr.

= Write Output Reg.

= Memory access

Time (pico-seconds)

Program instruction Flow

Note: Dependencies are no different than in the case of synchronous pipelined processors.

However in the event of a pipeline stall, no dynamic power is consumed.

This time it is a DSP!

ASYNC 2012

(36)

CONTENTS

• Background

• Asynchronous Circuits Description

Processor Architecture and Operation (Simplified)

• Architecture, Silicon, and ILP Implementation

Operation & Synchronization

• Performance Analysis

• Conclusion

(37)

OPERATION AND SYNCHRONIZATION (1)

ETC.

COMMON RESOURCES:

Regs + Mem.

EU 1

EU 2

EU 3

EU 4 EU 5

EU 6

EU N

This is an alternate simplified processor block diagram:

• the execution units (EUs) are mapped in a ring like fashion

• the EUs have access to common resources:

• Register File

• Data Memory

• Code Memory

• X-Bar

• PC Control Logic

• a synchronization mechanism is needed to arbitrate and

avoid conflicts in the access

of the EUs to the common resources

(38)

OPERATION AND SYNCHRONIZATION (2)

In contrast with a synchronous processor which is generally centrally controlled, this asynchronous processor has a fully distributed control system:

• Control is exercised individually by each Execution Unit (EU)

• Control tokens are passed asynchronously among the EUs in a ring fashion to synchronize accesses to common resources and avoid conflicts

• In the simplified model discussed herein, six (6) tokens are used:

• Instruction Fetch Token

• Register Read Token

• Launch Execution Token

(X-Bar, Reg Ready)

• No Mis-Prediction Token

(PC & Write Commit)

• Data Memory Token (Rd or Wr)

• Register Write Token

D G

Q TOKEN

OUT TOKEN

IN

READY RESOURCE_REQ

ACCESS LOGIC

(39)

OPERATION AND SYNCHRONIZATION (3)

Asynchronous control tokens are used to control and synchronize the overall operation of the processor.

• Control tokens are passed from one EU to the next in a ring fashion.

• When a token is owned by an EU it can use it to request services (via Req pulses)

• When a service request is sent and a certain time has elapsed and certain conditions are met, or when the EU does not need the token (resource) the token is passed to the next EU.

• On start up or after a flush (wrongly predicted branch),

all tokens are assigned to the same EU.

(40)

OCT2224 SOC ARCHITECTURE (1)

Asynchronous SoC Portion:

• 24 async DSP Cores

All other modules in the SoC including the external interfaces are all synchronous:

• not power critical

• bought IP blocks

• ease of interface

(41)

CONTENTS

• Background

• Asynchronous Circuits Description

• Processor Architecture and Operation

Performance Analysis

• Conclusion

(42)

COMPARISON – DIE AREA

4.5 x 4.8 21.6mm2 2.7 x 3

8.1mm2 4.5 x 4.8

21.6mm2

4.5 x 4.8 21.6mm2

4.5 x 4.8 21.6mm2

4.5 x 4.8

21.6mm2 4.5 x 4.8

21.6mm2

C64+ Mega-module C64+ Core

TI C64+ Core

• Texas Instruments (TI) is the

leading DSP vendor in the industry;

• TI literature claims the C6472® is the most power efficient high-performance DSP in the market. It features 6 ea C64+® cores;

(43)

COMPARISON – DIE AREA

4.5 x 4.8 21.6mm2 2.7 x 3

8.1mm2 4.5 x 4.8

21.6mm2

4.5 x 4.8 21.6mm2

4.5 x 4.8 21.6mm2

4.5 x 4.8

21.6mm2 4.5 x 4.8

21.6mm2

C64+ Mega-module C64+ Core

TI C64+ Core

• Texas Instruments (TI) is the

leading DSP vendor in the industry;

• TI literature claims the C6472® is the most power efficient high-performance DSP in the market. It features 6 ea C64+® cores;

• The C6472® is implemented in the same silicon technology as one of our DSP so it provides a reasonably fair benchmark*;

•The C6472® is a mature device so fairly accurate data is available for area, power consumption, and processing capability*;

•The C64+® core area is ~8.1mm2 (estimate);

(44)

COMPARISON – DIE AREA

4.5 x 4.8 21.6mm2 2.7 x 3

8.1mm2 1.75 x 1.3

2.28mm2 4.5 x 4.8

21.6mm2

4.5 x 4.8 21.6mm2

4.5 x 4.8 21.6mm2

4.5 x 4.8

21.6mm2 4.5 x 4.8

21.6mm2

C64+ Mega-module C64+ Core Opus2 Core

Opus 2 Core TI C64+ Core

Octasic Opus 2 Core

• Texas Instruments (TI) is the

leading DSP vendor in the industry;

• TI literature claims the C6472® is the most power efficient high-performance DSP in the market. It features 6 ea C64+® cores;

• The C6472® is implemented in the same silicon technology as one of our DSP so it provides a reasonably fair benchmark*;

• The C6472® is a mature device so fairly accurate data is available for area, power consumption, and processing capability*;

• The C64+® core area is ~8.1mm2 (estimate)

• Octasic’s Opus2 core is 2.28mm2

Ratio of area: ~3.5

(45)

COMPARISON – POWER EFFICIENCY

00 05 10 15 20 25 30 35

0 2000 4000 6000 8000 10000 12000

Efficiency (MMACS / mW)

MMACS

TIC6472 GP

( all in 90nm for comparison purpose) 1.0V

1.1V 1.2V

TI C64+® core used in TI most power efficient high-performance C6472® DSP device

(46)

COMPARISON – POWER EFFICIENCY

00 05 10 15 20 25 30 35

Efficiency (MMACS / mW)

Opus2 GP TIC6472 GP

( all in 90nm for comparison purpose) 1.0V

1.1V 1.2V 0.9V

1.0V 1.1V

1.2V

TI C64+® core used in TI most power efficient high-performance C6472® DSP device

Opus2: 3X Power Efficient and 1.7X Area Efficient as TI C6472 core (@ TI best power efficiency operating point)

(47)

COMPARISON – POWER EFFICIENCY

*It is understood that any such data and comparison is never totally accurate and can be subject to many interpretations.

00 05 10 15 20 25 30 35

0 2000 4000 6000 8000 10000 12000

Efficiency (MMACS / mW)

MMACS

Opus2 GP TIC6472 GP Opus3 GP 0.9V

1.0V

1.1V

1.2V

( all in 90nm for comparison purpose) 1.0V

1.1V 1.2V 0.9V

1.0V 1.1V

1.2V

Opus3: 3X Power Efficient and 3X Area Efficient as TI C6472® core (@ TI best power efficiency point)

TI C64+® core used in TI most power efficient high-performance C6472® DSP device

Opus2: 3X Power Efficient and 1.7X Area Efficient as TI C6472 core (@ TI best power efficiency operating point)

ASYNC 2012

(48)

CONTENTS

• Background

• Asynchronous Circuits Description

• Processor Architecture and Operation

• Performance Analysis

Conclusion

(49)

CONCLUSION

Thank you!

Michel Laurence

michel.laurence@octasic.com ...powered by an

OCT2224 Async DSP

Asynchronous technology does works!

• not only in the universities and labs, but

• in real-life commercial products used by people worldwide

Asynchronous technology can be quite advantageous!

• area efficiency wise,

....but more importantly...

• power efficiency wise

• in the DSP processor market: ~3X more than equivalent synchronous products

• same for other processors and datapath engines

The industry smallest

and lowest power

2G/3G/4G basestation

Referencer

RELATEREDE DOKUMENTER

  Cache design for WCET analysis.. Misses per Instruction is too

• A possible definition: a distributed system is a system in which hardware or software components located at networked devices communicate and?. coordinate their actions only

6.5 The Data Processor shall instruct any employees with access to or who otherwise process the Data Controller’s personal data in the Data Processor’s obligations, including any

We formulate the problem as a parameter sweep application, which searches for the optimal partition scheduling parameters with respect to minimum processor occupancy via an

The system is used to control the creation of a single document (production data) through a workflow process, which means that a single document is created through input from

The second analysis is a control-flow analysis of the actors in the system. It determines which data a specific actor may read and which location he may reach, given a

(Haxthausen and Peleska, 2000) con- cerns the formal development and verifica- tion of a distributed railway control system using RAISE?. The idea is to start with a domain model

types contains a number of entity classes derived from the data types in the Types module in the model. statics contains a number of classes derived from the Statics module in