High-Throughput Low-Energy
Content-Addressable Memory Based on Self-Timed Overlapped Search Mechanism
Dr. Naoya Onizawa
Postdoctoral Fellow, McGill Univerisity naoya.onizawa@mail.mcgill.ca
Collaborators: S. Matsunaga, V. Gaudet, and T. Hanyu
18th IEEE International Symposium on Asynchronous
Circuits and Systems (ASYNC), May 7-9, 2012, Copenhagen
!
Acknowledgements
This research was supported by Japan Science Technology Agency (JST)
Content-Addressable Memory (CAM)
! Associative memory
! Parallel searching
! Applications
! Cache, Virus checking
! Packet forwarding (40G, 100Gbps)
word length match Search
word
Mismatch Mismatch Mismatch
Location (Address)
# entries Match
Hardware-Implementation Issue
Goal: High-throughput low-overhead CAM
K. Pagiamtzis, et al (JSSC’04 vol.39-9)
Speed restriction
! Packet length - 32bit (IPv4), 128,144bit (IPv6)
144-b data
" Large matching delay
Search word
Pipelined approach
Search
word
" Large area and
power dissipation
Concept
! Operate as comparable to small CAM
Hide large matching delay to improve throughput
time Word 2
Word 3
In most cases,
search words “match”
Word 1 Small matching
delay
Search If ”Match”
”Mismatch”
High-speed access to small CAM Large
Large CAM
Large Large
Slow access
Approach
After searching first few bits, most blocks are mismatched
" If unused blocks are found, it doesn’t need to wait to search new words until the current search is complete.
! Assign search words to unused blocks
5.57x higher throughput at 8% cost of area
In use Pre-computation
block
Unused Unused Unused
(Find unused blocks before sending)
Unused
Partitioning
Search new words
after searching few bits
Table of Contents
" Introduction to content-addressable memory
" Overlapped search mechanism
" Word overlapped search
" Phase overlapped processing
" Hardware implementation
" Evaluation
" Conclusion and future prospect
CAM
Word-parallel search in single cycle
Operation
1. Search all words in parallel
2. Find a matched word block
3. Output a matched location (address)
128-144 bits (IPv6)
stored word 0 stored word 1
stored word w-1
Input controller Input controller n bits Search word
match location
log2w bits
search lines match lines
encoder
word blocks
CAM Word Block
Long word length degrades throughput
Conditions
! Match
! Mismatch
Series of CAM cells (NAND-type CAM) Search word (n bits)
1 0 1
1 1 1
1
1 Sense amplifier
Stored word (n bits)
Throughput determined by word length in conventional CAM
CAM Characteristics
Matching probability of word blocks after k-bit search is
Most word blocks are not used after k-bit search (We set k=8 in the hardware implementation)
" Use unused blocks to improve throughput
p
matched= 1 2
!
"
# $
% &
k
Most word blocks are unused (mismatched).
Word Overlapped Search (WOS)
word block
MLS10 MLS11
k-bit 1st-stage
sub-word block (n-k)-bit 2nd-stage sub-word block
Segmentation circuit
Input controller
SL1 SL2
ML10 ML11
MLS1w-1 ML1w-1
ML20 ML21
ML1w-1
CAM block (self-timed circuit)
(synchronous circuit)
1. Partition word block to:
a) small k-bit block and b) large (n-k)-bit block by segmentation block
2. Segmentation block stores Its k-bit matched result
3. Latter block operates when the first block matches
Word block partitioned by segmentation block
CAM architecture based on segmentation method
WOS operation
Mismatch
Match
Mismatch
Search
sub-word 1 1. First k-bit sub-word search
2. Some sub-word block matched
Input connects all word blocks
Mismatch
Match
Mismatch
Search sub-word 2
Match
Mismatch
Mismatch
Search sub-word 1
WOS operation
After k-bit search, new search starts
Assign to unused block
Latter bits matching
Match
Mismatch
Search sub-word 3
Match
Mismatch
Mismatch
Search sub-word 2
Match
WOS operation
How to assign search words to unused blocks?
Assign to
unused block
Still operates
Latter bits matching (word 2)
Should not affect
the old matched block
Categorize Word Blocks
Categorize based on the first k-bit stored word
00000000 00000000
00000001
Group A
Group B
Same stored
k-bit data
Pre-Computation
SL1 SL2
Search line
registers Search line registers
Search line registers
Comparator
controllerMode enable mode
k bits
(n-k) bits
Search data
ctrl
Categorize search words using comparator
Input controller (m=1)
Compare “m” consecutive k-bit search words
If they are different,
they are in different groups (Category 1: fast mode)
Otherwise,
they are in the same group (Category 2: slow mode)
Timing Diagram (fast mode)
High-speed searching based on T
Send search words based on
short delay Ttst
Consecutive words are assigned to
unused blocks
SL1 Ctrl
SL2 ML10
ML20 ML11 ML21
Data1 Data2 Data3 Data0 Data1 Data2
Match
Match Match
Mismatch
Mismatch
Mismatch Mismatch
Mismatch
Mismatch Mismatch
Data4 Data3
Mismatch MLS10
Mismatch MLS11
Hold
Hold
(a)
Timing Diagram (slow mode)
Wait until the current search is complete
Two consecutive words use the same word block
SL1 Ctrl SL2 ML10
ML20
Data1 Data2 Data3
Data0
Match
Match Mismatch
Mismatch
Data1 Data2
Match
Slow
Hold Hold
MLS10 mode
Mismatch
fast
Average Search Delay
! Category 1 – fast mode
Send search words based on the first k-bit delay (T
1st)
! Category 2 – slow mode
Send a new word after the current n-bit search is complete (T
slow)
T
sa= T
1st1 ! m 1 2
"
# $ %
&
'
"
k# $$ %
&
''+ T
slowm 1
2
"
# $ %
&
'
"
k# $$ %
&
''
Table of Contents
" Introduction to content-addressable memory
" Overlapped search mechanism
" Word overlapped search
" Phase overlapped processing
" Hardware implementation
" Evaluation
" Conclusion and future prospect
Word Circuit (precharge)
SL BL BL
SL WL
ML
C C C Low
Low ML
Charge
Charge capacitance on match line
NAND-type word circuit NAND-type CAM cell
CAM cell
! Dynamic logic
! Series of pass transistors ! Match “ON”,
! Mismatch “OFF”
Word Circuit (evaluate)
Match line remains high in mismatched case
Match operation Mismatch operation
Discharging capacitance on match line
" Output goes high
1 0 1 High
High
Discharge Search word
1 0 1
High
Low
1 0 1 1 1 1
Not discharging
" Output remains low
Synchronous Control (conventional)
2 phases are required every search
1 0 1
1 1 1
1 1 0
1 0 1
Search data Clk
High
Low
High
Low
1 0 1
1 1 1
1 1 0
Low Low
Low Low
ML0
ML1
ML2
1 0 1
Evaluate
Precharge
All word circuits are controlled by a global clock signal
Phase Overlapped Processing (POP)
Mismatched blocks always process new word
Each circuit is independently
controlled using local control signals
! Matched word circuit
Move on to precharge phase
! Mismatched word circuit Stay in evaluate phase
#Lowering switching activity of pre-charging signals
1 0 1
1 1 1
1 1 0
1 0 1 Lclk0
High Low
High
Low Lclk1
High Lclk2
High
WOS based POP
Searching words requires just 1 phase
Unused block can process without waiting precharge phase
1 0 1
1 1 1
1 1 0
1 0 1
Low
High
Low
1 0 1
1 1 1
1 1 0
Low High
Low Low
1 1 1
Evaluate
Precharge (Local)
Lctrl0 High
Lctrl1 High
Lctrl2 High High
High
Input can be changed at every phase Input are assigned to
unused block (WOS scheme)
Throughput Ratio
Throughput ratio = T
CST
CA= 2(T
reg+ T
1st+ T
2nd) T
1stT
CS= 2T
SS= 2(T
reg+ T
1st+ T
2nd) T
CA= T
SA= T
1st1 ! m 1
2
"
# $ %
&
'
"
k# $$ %
&
''+ T
slowm 1
2
"
# $ %
&
'
"
k# $$ %
&
''
Conventional Proposed
! T
1stT
SST
SASynchronous search delay (evaluate phase) Asynchronous search delay
Table of Contents
" Introduction to content-addressable memory
" Overlapped search mechanism
" Word overlapped search
" Phase overlapped processing
" Hardware implementation
" Evaluation
" Conclusion and future prospect
Circuit Implementation
Self-precharge circuit controls its word circuit
! 144-bit CAM word block with self-precharge circuit
! Self-precharge circuit pre-charges after 2nd stage is complete.
! Hierarchical 2nd stage block (17 local and 1 global match circuit)
C C C C C C
prec
Local match circuit
x8
x8
ML10 LML140
LML160
ML20
Segment circuit 8 bits 1st-stage
sub-word circuits
LML150 LML00
Global match circuit Self-precharge circuit
0 1 2 14 15 16
ML10 prec
ML10
(a)
(c)
(b)
(d)
weak feedback transistor
prec
prec
NAND-type cell
Store local
matched result
It isn’t affected by input changing
Ctrl 1.0 0
ML10
LML10 ML20 prec0
ML11 LML11
ML21 prec1
Voltage [V]
1.0 0 1.0 1.0 0 0 1.0
0 1.0 1.0 0
0 1.0 1.0 0 0
=261ps TCA
1.75 2.0 2.5 3.25
Time [ns]
3.0
Simulated Waveforms
HSPICE simulation under a 90nm CMOS technology
CAM operates based on T1st (259ps)
After search is complete,
self pre-charging is locally done.
Global match circuit uses only local
matched result.
Performance Comparison
Precharge Evaluate
-64.1%
-100%
Synchronous WOS WOS
POP+ 1.5
1.0
0
Average cycle time [ns]
-64.1%
-64.1% -82.0%
0.5
Conventional Proposed Cycle delay [ps] 1454 261 Energy
[fJ/bit/
search]
Match 0.0003 0.0006
Search 0.160 0.160 Control 0.103 0.001
Total 0.263 0.162
Area [Trs.] 372K 408K
5.57x throughput and 38% energy saving
256-word 144-bit binary CAM@90nm CMOS
Independent control reduces
switching activity of pre-charging
0 0.5 1 1.5 2 2.5
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35
Performance Comparison
Better energy-delay product
Energy dissipation [fJ/bit/search]
Cycle time [ns]
VDD=0.6V
0.7V 0.8V
0.9V 1.0V
Conventional 65nm@JSSC
vol. 46-2
Hybrid(100nm)
@JSSC 40-1