• Ingen resultater fundet

Transient  and  Permanent  Error  Control  for   High-­‐End  Multiprocessor  Systems-­‐on-­‐Chip

N/A
N/A
Info
Hent
Protected

Academic year: 2022

Del "Transient  and  Permanent  Error  Control  for   High-­‐End  Multiprocessor  Systems-­‐on-­‐Chip"

Copied!
28
0
0

Indlæser.... (se fuldtekst nu)

Hele teksten

(1)

Transient  and  Permanent  Error  Control  for   High-­‐End  Multiprocessor  Systems-­‐on-­‐Chip  

Qiaoyan    Yu¹,  José  Cano²,  José  Flich²,  Paul  Ampadu³  

¹ University  of  New  Hampshire,  USA  

² Universitat  Politècnica  de  València  ,  Spain  

³ University  of  Rochester,  USA  

(2)

Outline  

•   Introduc)on  &  Mo)va)on  

–  Impact  of  permanent  and  transient  errors  on  NoC  routers   –  Advanced  topologies  

•   Proposed  method  

–  LBDRhr  

–   Transient  error  control  in  LBDRhr  

•   Experimental  results  

•   Summary  and  conclusions  

(3)

Introduc0on  

•   Types  of  MPSoCs:  

–  Applica)on-­‐specific    

  Fully  irregular  topologies    

  System  design  totally  customized    

  E.g.  Spidergon  STNoC   –  High-­‐end  

  Regular  structures  (2D  mesh-­‐based  topologies)    

  E.g.  Tilera  

(4)

Introduc0on  

•   Cri)cal  challenge  in  current  NoCs:  RELIABILITY  

–  Permanent  errors  

   E.g.  due  to  defec)ve  components  (links,  routers)    

  Solu)on  based  on  fault-­‐tolerant  rou)ng         Logic-­‐based  Distributed  Rou0ng  (LBDR)  

–  Transient  errors  

  E.g.  due  to  par)cle  strike  

  Solu)on  based  on  error  control  coding      Inherent  informa0on  redundancy  (IIR)    

It  could  be  a  good  solu0on  for  addressing  

both  permanent  and  transient  errors  in  NoCs  

(5)

Introduc0on  &  Mo0va0on  

•   Problem:  both  LBDR  and  IIR  methods   cannot  be  applied  to  topologies  and   configura)ons  for  advanced  NoC   topologies  

–   LBDR  approach    

  Designed  for  2D  meshes    

  Routers  connected  to  1  router  neighbour  on  each   dimension  and  direc)on  

  Not  ready  for  transient  errors  

–   IIR  approach  

  Designed  for  XY  rou)ng  

Router

PE   East West

North

South Local

(6)

Advanced  Topologies  

1-hop links

2-hop diagonal links

1-hop links

2-hop straight links 3-hop links

1-hop links

2-hop straight links

Diagonal mesh 2D-mesh with express channels Flattened butterfly

The  ini0al  2D-­‐mesh  is  the  underlying  topology!!!  

EE port

N N N p o rt

(7)

Proposed  Ideas  

•   To  address  fault  tolerance  for  advanced  topologies:  

–  Redesign  the  LBDR  mechanism:  LBDRhr    (LBDR  for  high-­‐radix  networks)  

  Adap)ve  rou)ng  algorithm  supported  

  2  Virtual  Channels  

  Deadlock-­‐free  for  the  high-­‐radix  topologies  defined  

–  Develop  a  new  method  to  detect  transient  errors  in  LBDRhr  logic  

  Exploits  the  inherent  informa)on  redundancy  in  LBDRhr  to  significantly  reduce  the  

error  control  overhead  

(8)

NoC  Router  Func0onality  

•  Compute  rou)ng  direc)on  for  next  hop  

•  Pass  the  packet  to  its  intended  output  port  

Note:  24  is  the  maximum  number  of  rou4ng  ports  for  each  router,  

     but  not  all  need  to  be  implemented,  depends  on  the  topology  

(9)

Permanent  Error  Management  

•  Previous  method:  Logic-­‐Based  Distributed  Rou)ng  (LBDR)  

–  Four  rou)ng  ports  per  switch  (North,  South,  East,  West)  

–  Two  sets  of  bits:  Rou)ng  bits  (Rxy,  2  per  output  port)  and  Connec)vity  bits  (Cx,  1  per   output  port)  

–  Minimal  path  support  

N’

E’

W’

N’

RNE E’

. . .

N

N’

RNW W’

CN

. .

LBDR

Xdst

Xcurr C

M P

N’

Ydst S’

Ycurr C

M P

E’

W’

(10)

Permanent  Error  Management  

•   LBDRhr  

–  Tolerates  permanent  link  and  router  failures   –  Implemented  with  three  basic  logic  blocks  

  1-­‐hop,  2-­‐hop  and  3-­‐hop  ports  

–   Uses  a  few  configura)on  bits  to  store  local  informa)on  about  the  neighboring   routers  

  8  configura)on  bits  for  rou)ng  purposes    Rxy  

  2  bits  for  two  deroute  op)ons  (special  cases)  at  every  input  port     DRx  

  1  connec)vity  bit  per  output  port     Cx  

(11)

LBDRhr  logic  (common  part)  

Relative  direction  of  message’s  destination  

XXX’  set:    dest  is  at  least  three  hops  away  in  X  direction   XX’  set:  dest  is  at  least  two  hops  away  in  X  direction   X’  set:  dest  is  at  least  one  hop  away  in  X  direction   if  XX’  set  -­‐>  X’  set  

If  XXX’  set  -­‐>  XX’  and  X’  set  

(12)

LBDRhr  logic  (adap0ve  part)  

One gate per output signal e.g.: NNNlbdr = NNN’ & Cnnn

Routing restrictions (at 1hop ports) taken into account

e.g: N’’ = (N’ & E’ & Rne) | (N’ & W’ & Rnw) | (N’ & /E’ & /W’)

One gate per output signal

(3hop and 2hop ports have priority) e.g.: N’’’ = N’’ & Cn & !3hop & !2hop

In case of no solution at all (non-minimal path support) Take configured deroute option at the switch

One gate per output signal (3hop ports have priority)

e.g.: NElbdr = N’ & E’ & Cne & /3hops

(13)

LBDRhr  logic  (escape  part)  

(14)

Permanent  Error  Management  

1 2

4 5 6

0

8

3

9

7

10 11

12 13 14 15

1

1 2

2

VC0: Faulty 2D-mesh with express channels VC1: Faulty 2D-mesh

1 2

4 5 6

0

8

3

9

7

10 11

12 13 14 15

1

2 1/3 2/4 3

3 3/4

4

4 1/2

1/2

•   Deadlock-­‐free  rou)ng  example  

Deroute

here

(15)

Prevision  Transient  Error  Control  Methods  

•  Limita)on  of  Previous  Methods   –  Need  knowledge  of  error  

loca)ons  

–  Consume  large  link  switching   power  

–  Increase  area  overhead  or   –   Limited  to  XY  rou)ng  

Flooding

[1]

Triple modular redundancy

[2]

IIR

[3]

(16)

New  Inherent  Informa0on  Redundancy  Extrac0on  

•  Forbidden  signal  pa[erns  in  routers  are  regarded  as  inherent  informa)on   redundancy  (IIR)  

•  More  IIR  are  found  in  LBDRhr-­‐based  router  

0

0

R eq ue st to Ea st Po rt R eq ue st to W est Po rt

1

1

with error w/o

error

1

with error

0

0 w/o error

(c) Multi-switched request

(a) Switched single- request

1

(b) Switched multi- request

0 w/o error

1

with error

0 1

with error

1

0

w/o error

(d) Bidirectional switched-request

0

0 1

w/o error

(e) Mute-request

with

error

(17)

Error  Detec0on  for  CMP  in  Router  

Err

1.1

 =  WWW’  &  EEE’    

Err

2.1

=  WWW’  &  !(WW’  &  W’  &  EE’_n  &  E’_n)    

we  keep  the  forbidden  signal     patterns  of  the  opposite    

directions  in  mind  to     further  constrain  the  signal  

patterns  

(18)

Error  Detec0on  for  Mul0-­‐hop  Logic  

Err

3-­‐hops

 =  (NNE  &  SSW)  |  (EEN  &  SSW)  |   (EES  &  WWN)  |  (SSE  &  NNW)  |  (NNN  &  

SSS)  |  (EEE  &  WWW)  |  (NNE  &  NNW)  |   (SSW  &  SSE)  |  (EEN  &  EES)  |  (EES  &  EEW)  

|  (WWN  &  WWS)  

Err

2-­‐hops

 =  (NN  &  SS)    |    (EE  &  WW)    

|    (NE  &  SW)    |    (SE  &  NW)|    (NE  &  

SE)  |    (SW  &  SE)    

(19)

Experimental  Results  

•   Error  Detec)on  Rate  

•   Reliability  

•   Flit  Throughput  and  Latency  

•   Area,  Power  and  Delay  

(20)

Experimental  Results:  Setup  

•   Mul)ple  NoC  topologies  

•   LBDRhr  Rou)ng  

•   8-­‐bit  address  

•   Synthesized  with  a  

TSMC  65nm  technology  

Error Injection Enable

Original Gate

Input M uX   Output

Logic facilitating

error injection in

modeling

(21)

Error  Detec0on  Rate  in  CMP  

0.4 0.5 0.6 0.7 0.8 0.9 1

Error Detection Rate

Single Error Double Errors Triple Errors

•  No  ma[er  how  the  NoC  size  changes,  the  error  detec)on  rate  for  E’  and  W’  is   100%  because  of  the  use  of  the  internal  node.  

•  The  error  detec)on  rate  for  EE’,  WW’,  EEE’  and  WWW’  is  less  than  1.  

–  Only  the  occurrence  of  opposite  direc)on  pairs  helps  to  detect  errors  in  EE’,  

WW’,  EEE’  and  WWW’  (the  non-­‐zero  substrac)on  output  does  not  contribute  

to  detect  the  errors  causing  wrong  EE’,  WW’,  EEE’,  WWW’).  

(22)

Error  Detec0on  Rate  in  Mul0-­‐hop  Logic  

•  3-­‐hop  path                                                                                                                2-­‐hop  path  

–  Achieve  minor  varia)on  on  the  error  detec)on  rate  for  different    topologies.  

–   Improve  the  error  detec)on  rate  of  2-­‐hops  logic  as  the  number  of  error  

injected  to  the  logic  increases,  because  of  more  IIR  

(23)

Residual  Error  Rate  Comparison  

•  The  proposed  method  

–  Reduce  the  residual  error  rate  by  two  orders  of  magnitude  over  TMR  

–  Slightly  vary  the  error  detec)on  rate  

(24)

Flit  Throughput  and  Latency  

•  The  number  of  faulty  links  in  each  topology  increases  up  to  obtain  the  

underlying  2D-­‐mesh  

(25)

Area,  Power  and  Delay  

LBDRhr  

without   Error   Detec0on  

LBDRhr   with   Proposed   Error   Detec0on  

LBDRhr   with   TMR  

Area  (µm

2

)   342  (100%)   363  (106.1%)   806  (235.7%)  

Delay  (ns)   0.495  (100%)   0.54  (109.1%)   0.51  (103%)  

Power   Dyn.(µW)   199.97  (100%)   207.27  

(103.7%)   267.39   (133.7%)  

Leak(µW)   1.8084  (100%)   1.8405  

(101.8%)   4.0969   (226.5%)  

(26)

Conclusions  

•   For   transient  errors,  our  method  reduces  the  residual  error  rate   and   the   average   power   consump)on   by   up   to   200x   and   30%,   respec)vely,  over  triple  modular  redundancy.  

•   For   permanent   errors,   the   proposed   method   is   able   to   cover  

permanent   failures   of   all   the   long-­‐range   links   and   80%   of   the  

failure  combina)ons  of  the  2D-­‐mesh  links.      

(27)

Transient  and  Permanent  Error  Control  for   High-­‐End  Multiprocessor  Systems-­‐on-­‐Chip  

Qiaoyan    Yu¹,  José  Cano²,  José  Flich²,  Paul  Ampadu³  

¹ University  of  New  Hampshire,  USA  

² Universitat  Politècnica  de  València  ,  Spain  

³ University  of  Rochester,  USA  

Thank you!

(28)

Error  Detec0on  for  Deroute  Logic  

•   Four  direc)ons  are  exclusive  is  regarded  as  a  new  inherent   informa)on  redundancy  

N

deroute

=  ~DR[0]  &~DR[1]  

E

deroute

=  ~DR[0]&  DR[1]      

     

W

deroute

 =DR[0]&~DR[1]        

     

S

deroute

 =DR[0]&DR[1]            

Referencer

RELATEREDE DOKUMENTER

(End-to-end error, sequence & flow control) Transfer of data between arbitrary systems (Routing, multiple subnets, flow control).. Transfer of data between directly connected

computation communication Algorithm on Chip (ASIC) hardwired hardwired System on Chip (SoC) soft hardwired. Network on Chip (NoC)

The pro- posed Rao algorithms have been applied on three standard IEEE test systems, i.e., 30-bus, 57-bus, and 118-bus, which have 24 control variables, 33 control variables, and

N ursing record systems: effects on nursing practice and healthcare outcomes (Review).. Urquhart C, Currell R, Grant MJ,

• Example [PAR protocol with numbered ACK]: Sender always waits for positive ACK for latest transmitted message before using next sequence number. OK to count modulo 2

Chapter 2 will introduce the Network-on-Chip concept, chapter 3 will give an introduction to MANGO, chapter 4 will take a look at different approaches to mod- eling NoCs, chapter 5

Agenda Spintronics MTJ On-chip Buffer On-chip Crossbar Conclusion... Kungliga

‣ replies to each correct incoming data with an acknowledgment that includes the sequence number of the last correctly received message (which of course may be