• Ingen resultater fundet

High-Throughput Low-Energy Content-Addressable Memory Based on Self-Timed Overlapped Search Mechanism

N/A
N/A
Info
Hent
Protected

Academic year: 2022

Del "High-Throughput Low-Energy Content-Addressable Memory Based on Self-Timed Overlapped Search Mechanism"

Copied!
32
0
0

Indlæser.... (se fuldtekst nu)

Hele teksten

(1)

High-Throughput Low-Energy

Content-Addressable Memory Based on Self-Timed Overlapped Search Mechanism

Dr. Naoya Onizawa

Postdoctoral Fellow, McGill Univerisity naoya.onizawa@mail.mcgill.ca

Collaborators: S. Matsunaga, V. Gaudet, and T. Hanyu

18th IEEE International Symposium on Asynchronous

Circuits and Systems (ASYNC), May 7-9, 2012, Copenhagen

!

Acknowledgements

This research was supported by Japan Science Technology Agency (JST)

(2)

Content-Addressable Memory (CAM)

!   Associative memory

!   Parallel searching

!   Applications

!   Cache, Virus checking

!   Packet forwarding (40G, 100Gbps)

word length match Search

word

Mismatch Mismatch Mismatch

Location (Address)

# entries Match

(3)

Hardware-Implementation Issue

Goal: High-throughput low-overhead CAM

K. Pagiamtzis, et al (JSSC’04 vol.39-9)

Speed restriction

!   Packet length - 32bit (IPv4), 128,144bit (IPv6)

144-b data

"   Large matching delay

Search word

Pipelined approach

Search

word

"   Large area and

power dissipation

(4)

Concept

!   Operate as comparable to small CAM

Hide large matching delay to improve throughput

time Word 2

Word 3

In most cases,

search words “match”

Word 1 Small matching

delay

Search If ”Match”

”Mismatch”

High-speed access to small CAM Large

Large CAM

Large Large

Slow access

(5)

Approach

After searching first few bits, most blocks are mismatched

"  If unused blocks are found, it doesn’t need to wait to search new words until the current search is complete.

!   Assign search words to unused blocks

5.57x higher throughput at 8% cost of area

In use Pre-computation

block

Unused Unused Unused

(Find unused blocks before sending)

Unused

Partitioning

Search new words

after searching few bits

(6)

Table of Contents

"   Introduction to content-addressable memory

"   Overlapped search mechanism

"   Word overlapped search

"   Phase overlapped processing

"   Hardware implementation

"   Evaluation

"   Conclusion and future prospect

(7)

CAM

Word-parallel search in single cycle

Operation

1. Search all words in parallel

2. Find a matched word block

3. Output a matched location (address)

128-144 bits (IPv6)

stored word 0 stored word 1

stored word w-1

Input controller Input controller n bits Search word

match location

log2w bits

search lines match lines

encoder

word blocks

(8)

CAM Word Block

Long word length degrades throughput

Conditions

!   Match

!   Mismatch

Series of CAM cells (NAND-type CAM) Search word (n bits)

1 0 1

1 1 1

1

1 Sense amplifier

Stored word (n bits)

Throughput determined by word length in conventional CAM

(9)

CAM Characteristics

Matching probability of word blocks after k-bit search is

Most word blocks are not used after k-bit search (We set k=8 in the hardware implementation)

"   Use unused blocks to improve throughput

p

matched

= 1 2

!

"

# $

% &

k

Most word blocks are unused (mismatched).

(10)

Word Overlapped Search (WOS)

word block

MLS10 MLS11

k-bit 1st-stage

sub-word block (n-k)-bit 2nd-stage sub-word block

Segmentation circuit

Input controller

SL1 SL2

ML10 ML11

MLS1w-1 ML1w-1

ML20 ML21

ML1w-1

CAM block (self-timed circuit)

(synchronous circuit)

1. Partition word block to:

a) small k-bit block and b) large (n-k)-bit block by segmentation block

2. Segmentation block stores Its k-bit matched result

3. Latter block operates when the first block matches

Word block partitioned by segmentation block

CAM architecture based on segmentation method

(11)

WOS operation

Mismatch

Match

Mismatch

Search

sub-word 1 1.   First k-bit sub-word search

2.   Some sub-word block matched

Input connects all word blocks

(12)

Mismatch

Match

Mismatch

Search sub-word 2

Match

Mismatch

Mismatch

Search sub-word 1

WOS operation

After k-bit search, new search starts

Assign to unused block

Latter bits matching

(13)

Match

Mismatch

Search sub-word 3

Match

Mismatch

Mismatch

Search sub-word 2

Match

WOS operation

How to assign search words to unused blocks?

Assign to

unused block

Still operates

Latter bits matching (word 2)

Should not affect

the old matched block

(14)

Categorize Word Blocks

Categorize based on the first k-bit stored word

00000000 00000000

00000001

Group A

Group B

Same stored

k-bit data

(15)

Pre-Computation

SL1 SL2

Search line

registers Search line registers

Search line registers

Comparator

controllerMode enable mode

k bits

(n-k) bits

Search data

ctrl

Categorize search words using comparator

Input controller (m=1)

Compare “m” consecutive k-bit search words

If they are different,

they are in different groups (Category 1: fast mode)

Otherwise,

they are in the same group (Category 2: slow mode)

(16)

Timing Diagram (fast mode)

High-speed searching based on T

Send search words based on

short delay Ttst

Consecutive words are assigned to

unused blocks

SL1 Ctrl

SL2 ML10

ML20 ML11 ML21

Data1 Data2 Data3 Data0 Data1 Data2

Match

Match Match

Mismatch

Mismatch

Mismatch Mismatch

Mismatch

Mismatch Mismatch

Data4 Data3

Mismatch MLS10

Mismatch MLS11

Hold

Hold

(a)

(17)

Timing Diagram (slow mode)

Wait until the current search is complete

Two consecutive words use the same word block

SL1 Ctrl SL2 ML10

ML20

Data1 Data2 Data3

Data0

Match

Match Mismatch

Mismatch

Data1 Data2

Match

Slow

Hold Hold

MLS10 mode

Mismatch

fast

(18)

Average Search Delay

!   Category 1 – fast mode

Send search words based on the first k-bit delay (T

1st

)

!   Category 2 – slow mode

Send a new word after the current n-bit search is complete (T

slow

)

T

sa

= T

1st

1 ! m 1 2

"

# $ %

&

'

"

k

# $$ %

&

''+ T

slow

m 1

2

"

# $ %

&

'

"

k

# $$ %

&

''

(19)

Table of Contents

"   Introduction to content-addressable memory

"   Overlapped search mechanism

"   Word overlapped search

"   Phase overlapped processing

"   Hardware implementation

"   Evaluation

"   Conclusion and future prospect

(20)

Word Circuit (precharge)

SL BL BL

SL WL

ML

C C C Low

Low ML

Charge

Charge capacitance on match line

NAND-type word circuit NAND-type CAM cell

CAM cell

!  Dynamic logic

!  Series of pass transistors !  Match “ON”,

!  Mismatch “OFF”

(21)

Word Circuit (evaluate)

Match line remains high in mismatched case

Match operation Mismatch operation

Discharging capacitance on match line

"  Output goes high

1 0 1 High

High

Discharge Search word

1 0 1

High

Low

1 0 1 1 1 1

Not discharging

"  Output remains low

(22)

Synchronous Control (conventional)

2 phases are required every search

1 0 1

1 1 1

1 1 0

1 0 1

Search data Clk

High

Low

High

Low

1 0 1

1 1 1

1 1 0

Low Low

Low Low

ML0

ML1

ML2

1 0 1

Evaluate

Precharge

All word circuits are controlled by a global clock signal

(23)

Phase Overlapped Processing (POP)

Mismatched blocks always process new word

Each circuit is independently

controlled using local control signals

!  Matched word circuit

Move on to precharge phase

!  Mismatched word circuit Stay in evaluate phase

#Lowering switching activity of pre-charging signals

1 0 1

1 1 1

1 1 0

1 0 1 Lclk0

High Low

High

Low Lclk1

High Lclk2

High

(24)

WOS based POP

Searching words requires just 1 phase

Unused block can process without waiting precharge phase

1 0 1

1 1 1

1 1 0

1 0 1

Low

High

Low

1 0 1

1 1 1

1 1 0

Low High

Low Low

1 1 1

Evaluate

Precharge (Local)

Lctrl0 High

Lctrl1 High

Lctrl2 High High

High

Input can be changed at every phase Input are assigned to

unused block (WOS scheme)

(25)

Throughput Ratio

Throughput ratio = T

CS

T

CA

= 2(T

reg

+ T

1st

+ T

2nd

) T

1st

T

CS

= 2T

SS

= 2(T

reg

+ T

1st

+ T

2nd

) T

CA

= T

SA

= T

1st

1 ! m 1

2

"

# $ %

&

'

"

k

# $$ %

&

''+ T

slow

m 1

2

"

# $ %

&

'

"

k

# $$ %

&

''

Conventional Proposed

! T

1st

T

SS

T

SA

Synchronous search delay (evaluate phase) Asynchronous search delay

(26)

Table of Contents

"   Introduction to content-addressable memory

"   Overlapped search mechanism

"   Word overlapped search

"   Phase overlapped processing

"   Hardware implementation

"   Evaluation

"   Conclusion and future prospect

(27)

Circuit Implementation

Self-precharge circuit controls its word circuit

!  144-bit CAM word block with self-precharge circuit

!  Self-precharge circuit pre-charges after 2nd stage is complete.

!  Hierarchical 2nd stage block (17 local and 1 global match circuit)

C C C C C C

prec

Local match circuit

x8

x8

ML10 LML140

LML160

ML20

Segment circuit 8 bits 1st-stage

sub-word circuits

LML150 LML00

Global match circuit Self-precharge circuit

0 1 2 14 15 16

ML10 prec

ML10

(a)

(c)

(b)

(d)

weak feedback transistor

prec

prec

NAND-type cell

Store local

matched result

It isn’t affected by input changing

(28)

Ctrl 1.0 0

ML10

LML10 ML20 prec0

ML11 LML11

ML21 prec1

Voltage [V]

1.0 0 1.0 1.0 0 0 1.0

0 1.0 1.0 0

0 1.0 1.0 0 0

=261ps TCA

1.75 2.0 2.5 3.25

Time [ns]

3.0

Simulated Waveforms

HSPICE simulation under a 90nm CMOS technology

CAM operates based on T1st (259ps)

After search is complete,

self pre-charging is locally done.

Global match circuit uses only local

matched result.

(29)

Performance Comparison

Precharge Evaluate

-64.1%

-100%

Synchronous WOS WOS

POP+ 1.5

1.0

0

Average cycle time [ns]

-64.1%

-64.1% -82.0%

0.5

Conventional Proposed Cycle delay [ps] 1454 261 Energy

[fJ/bit/

search]

Match 0.0003 0.0006

Search 0.160 0.160 Control 0.103 0.001

Total 0.263 0.162

Area [Trs.] 372K 408K

5.57x throughput and 38% energy saving

256-word 144-bit binary CAM@90nm CMOS

Independent control reduces

switching activity of pre-charging

(30)

0 0.5 1 1.5 2 2.5

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35

Performance Comparison

Better energy-delay product

Energy dissipation [fJ/bit/search]

Cycle time [ns]

VDD=0.6V

0.7V 0.8V

0.9V 1.0V

Conventional 65nm@JSSC

vol. 46-2

Hybrid(100nm)

@JSSC 40-1

(31)

Conclusion

High-throughput low-energy CAM

!   Word overlapped search

!   Use unused word block

!   Assign based on pre-computation

!   Phase overlapped search

!   Independent control of each word block

!   Search without waiting for precharge

"   5.57x throughput, 38% energy saving,

8% cost of area

(32)

Future Prospects

!   Circuit design considerations

!   Number of partitions

!   Timing robustness

!   Extend to Ternary CAM (TCAM)

!   Redesign input controller

!   Application specific design

!   Cache (TLB), virus checker

Referencer

RELATEREDE DOKUMENTER

• Smart search, maybe we can create a smart algorithm to solve the particular problem, eventhough the number of possible solutions is huge (P problems ...). • Local search:

The result of exact line search is normally a good approximation to the result, and this can make descent methods with exact line search find the local minimizer in fewer

Section 41(2) builds further on this by stating that: “every person shall have access to any court of law or any tribunal with jurisdiction for final settlement of legal

By clicking on a magnifying glass to the left of each related word, we get a list of the common words, and the overlapping of the pairs for the search word and the related word

Interface - EBSCOhost Search Screen - Advanced Search Database - CINAHL with Full Text. S25 behavioural pain score

What this use case allows is to give the user the opportunity to just search for the departures from a stop, based on his current position and time of the search or provide

The fi rst transition matrix experiment was based on two AOIs to measure the overall consultation behaviour in search-related operations (search and writing) and

When provided with search features that retain search queries, consumers would pay more attention to the available traceable memory and then adjust search criteria