Switch Design
a unified view of micro-architecture and circuits
Giorgos Dimitrakopoulos
Electrical and Computer Engineering Democritus University of Thrace (DUTH)
Xanthi, Greece dimitrak@ee.duth.gr
http://utopia.duth.gr/~dimitrak
Algorithms-Applications Algorithms-Applications
System abstraction
• Processors for computation
• Memories for storage
• IO for connecting to the outside world
• Network for communication and system integration
Operating System Operating System
Instruction Set Architecture Instruction Set Architecture
Microarchitecture Microarchitecture Register-Transfer Level Register-Transfer Level
Logic design Logic design
Circuits Circuits Devices Devices
Network Network
Processors
Processors Memory Memory
IO IO
Logic, State and Memory
• Datapath functions
– Controlled by FSMs – Can be pipelined
• Mapped on silicon chips
– Gate-level netlist from a cell library
– Cells built from transistors after custom layout
• Memory macros store large chunks of data
– Multi-ported register files for fast
local storage and access of data
On-Chip Wires
• Passive devices that connect transistors
• Many layers of wiring on a chip
• Wire width, spacing
depends on metal layer
– High density local
connections, Metal 1-5
– Upper metal layers > 6 are
wider and used for less dense
low-delay global connections
Future of wires: 2.5D – 3D integration
Evolution
Optical wiring
• Optical connections will be integrated on chip
– Useful when the power of electrical connections will limit the available chip IO bandwidth
• A balanced solution that involves both optical and
electrical components will probably win
Let’s send a word on a chip
• Sender and receiver on the same clock domain
• Clock-domain crossing just adds latency
• Any relation of the sender- receiver clocks is exploited
– Mesochronous interface
– Tightly coupled synchronizers
[AMD Zacate]
Point-to-point links: Flow control
S R
Data
S R
Data Valid
S Valid R
Stall Data
• Synchronous operation
– Data on every cycle
• Sender can stall
– Data valid signal
• Receiver can stall
– Stall (back-pressure) signal
• Either can stall
– Valid and Stall backpressure
– Partially decouple Sender and Receiver by adding a buffer at the receive side
S R
Stall
Data
Sender and Receiver decoupled by a buffer
• Receiver accepts some of the sender’s traffic even if the transmitted words are not consumed
– When to stop? How is buffer overflow avoided?
• Let’s see first how to build a buffer
• Clock-domain crossing can be tightly coupled within the buffer
Buffer organization
• A FIFO container that maintains order of arrival
– 4 interfaces (full, empty, put, get)
• Elastic
– Cascade of depth-1 stages – Internal full/empty signals
• Shift register in/Parallel out
– Put: shift all entries – Get: tail pointer
• Circular buffer
– Memory with head / tail pointers
– Wrap around array implementation
– Storage can be register based
Buffer implementation
• The same basic structure evolves with extra read/write flexibility
• Multiplexers and head/tail pointers handle data movement and addressing
Elastic Shift In/Parallel Out Circular array
Link-level flow control: Backpressure
• Link-level flow control provides a closed feedback loop to control the
flow of data from a sender to a receiver
• Explicit flow control (stall-go)
– Receiver notifies the sender when to stop/resume transmission
• Implicit flow control (credits)
– Sender knows when to stop to avoid buffer overflow
• For unreliable channels we need extra
mechanisms for detecting and handling
transmission errors
STALL-GO flow control
• One signal STALL/GO is sent back to the receiver
– STALL=0 (G0) means that the sender is allowed to send – STALL=1 (STALL) means that the sender should stop
– The sender changes its behavior the moment it detects a change to the backpressure signal
• Data valid (not shown) is asserted when new data are available
Stall
STALL-GO flow control: example
Stall
• In-flight words will be dropped or they will replace the ones that wait to be consumed
– In every case data are lost
• STALL and GO should be connected with the
buffer availability of the receiver’s queue
– The example assumes that the receiver is stalled or released for other
network reasons
• STALL should be asserted early enough
– Not drop words in-flight
– Timing of STALL assertion guarantees lossless operation
• GO should be asserted late enough
– Have words ready-to-consume before new words arrive – Correct timing guarantees high throughput
• Minimum buffering for full throughput and lossless operation should cover both STALL&GO re-action cycles
Buffering requirements of STALL&GO
Stall
If not available the link
remains idle
STALL&GO on pipelined and elastic links
• Traffic is “blind” during a time interval of Round-trip Time (RTT)
– the source will only learn about the effects of its transmission RTT after this transmission has started
– the (corrective) effects of a contention notification will only appear at the site of
Credit-based flow control
• Sender keeps track of the available buffer slots of the receiver
– The number of available slots is called credits
– The available credits are stored in a credit counter
• If #credits > 0 sender is allowed to send a new word
– Credits are decremented by 1 for each transmitted word
• When one buffer slot is made free in the receive side, the sender is notified to increase the credit count
– An example where credit update signal is registered first at the
receive side
Credit-based flow control: Example
0* means that credit counter is
incremented and decremented at the same cycle (ways and stays at 0) Credit Update
Available Credits
Credit-based flow control: Buffers and Throughput
Condition for 100% throughput
• The number of registers that the data and the credits pass through define the credit loop
– 100% throughput is guaranteed only when the number of available buffer slots at the receive side equals the registers of the credit loop
• Changing the available number of credits can reconfigure maximum throughput at runtime
– Credit-based FC is lossless with any buffer size > 0.
– Stall and Go FC requires at least one loop extra buffer space than credit-based FC
Credit loop
Link-level flow control enhancements
• Reservation based flow control
– Separate control and data functions
– Control links race ahead of the data to reserve resources
– When data words arrive, they can proceed with little overhead
• Speculative flow control
– The sender can transmit cells even without sufficient credits
• Speculative transmissions occur when no other words with available credits is eligible for transmission
– The receiver drops an incoming cell if its buffer is full
• For every dropped word a NACK is returned to the sender
• Each cell remains stored at the sender until it is positively acknowledged
– Each cell may be speculatively transmitted at most once
• All retransmissions must be performed when credits are available
– The sender consumes credit for every cell sent, i.e., for speculative as
well as credited transmissions.
Send a large message(packet)
• Send long packet of 1Kbit over a 32-bit-wire channel
– Serialize the message to 16 words of 32 bits
– Need 16 cycles for packet transmission
• Each packet is transmitted word-by-word
– When the output port is free, send the next word immediately
– Old fashioned Store-and-forward required the entire packet to reach each node before
initiating next transmission
Buffer allocation policies
• Each transmitted word needs a free downstream buffer slot
– When the output of the downstream node is blocked the buffer will hold the arriving words
• How much free buffering is guaranteed before sending the first word of a packet?
– Virtual Cut Through (VCT): The available buffer slots equal the words of the packet
• Each blocked packet stays together and consumes the buffers of only one node
– Wormhole: Just a few are enough
• Packet inevitably occupies the buffers of more nodes
• Nothing is lost due to flow control backpressure policy
VCT and Wormhole in graphics
Link sharing
• The number of wires of the link does not increase
– One word can be sent on each clock cycle
– The channel should be shared
• A multiplexer is needed at the output port of the
sender
• Each packet is sent un-interrupted
– Wormhole, and VCT behave this way – Connection is locked for a packet until
the tail of the packet passes the
output port
Who drives the select signals?
• The arbiter is responsible for selecting which packet will gain access to the output channel
– A word is sent if buffer slots are available downstream
• It receives requests from the inputs and grants only one of them
– Decisions are based on some internal priority state
Arbitration for Wormhole and VCT
• In wormhole and VCT the words of each packet are not mixed with the words of other packets
• Arbitration is performed once per packet and the decision is locked at the output for all packet duration
• Even if a packet is blocked downstream the connection does not change until the tail of the packet leaves the output port
– Buffer utilization managed by flow control mechanism
How can I place my buffers?
Let’s add some complexity: Networks
• A network of terminal nodes
– Each node can be a source or a sink
• Multiple point-to-point links connected with switches
• Parallel communication between components
Source/Sink Terminal Node
Switch
• Multiple input-output permutations should be supported
• Contention should be resolved and non-winning inputs should be handled
– Buffered locally
– Deflected to the network
• Separate flow control for each link
• Each packet needs to know/compute the path to its destination
Complexity affects the switches
• More than one terminal nodes can connect per switch
– Concentration good for bursty traffic
– Local switch isolates local traffic from the main network
How are the terminal nodes connected to the switch?
Switch design: IO interface
Separate flow control per link
Switch design: One output port
per-output requests
Let’s reuse the circuit we already have for one output port
• Move buffers to the inputs
Switch design: Input buffers
Data from input#1
Requests
for output #0
Switch design: Complete output ports
• How are the output requests computed?
Routing computation
• Routing computation generates per output requests
– The header of the packet carries the requests for each intermediate node (source routing)
– The requests are computed/retrieved based on the packet’s destination
(distributed routing)
Routing logic
• Routing logic translates a global destination address to a local output port request
– To reach node X from node Y should use output port #2 of Y
• A Lookup-table is enough for
holding the request vector that
corresponds to each destination
Switch building blocks
Running example of switch operation
• Switches transfer packets
• Packets are broken to flits
– Head flit only knows packet’s destination
• The wires of each link equals the bits of each flit
Buffer access
• Buffer incoming packets per link
• Read the destination of the head of each queue
Routing Computation/Request Generation
• Compute output requests and drive the output
arbiters
Arbitration-Multiplexer path setup
• Arbitrate per output
• The grant signals
– Drive the output multiplexers
– Notify the inputs about the arbitration outcome
Switch traversal
• Words H will leave the switch on the next clock
edge provided they have at least one credit
Link traversal
• Words going to a non-blocked output leave the switch
• The grants of a blocked output (due to flow control) are lost
– An output arbiter can also stall in case of blocked output
Head-Of-Line blocking: performance limiter
• The FIFO order of the input buffers limit the throughput of the switch
– The flit is blocked by the Head-of-Line that lost arbitration
– A memory throughput problem
Wormhole switch operation
• The operations can fit in the same cycle or they can be pipelined
– Extra registers are needed in the control path
– Registers in the input/output ports already present – LT at the end involves a register write
• Body/tail flits inherit the decisions taken by the head flits
Look-ahead routing
• Routing computation is based only on packet’s destination
– Can be performed in switch A and used in switch B
• Look-ahead routing computation (LRC)
Look-ahead routing
• The LRC is performed in parallel to SA
• LRC should be completed before the ST stage in the same switch
– The head flit needs the output port requests for the next
switch
Look-ahead routing details
• The head flit of each packet carries the output port
requests for the next switch together with the destination
address
Low-latency organizations
• Baseline
– SA precedes ST (no speculation)
• SA decoupled from ST
– Predict or Speculate arbiter’s decisions
– When prediction is wrong replay all the tasks (same as baseline)
• Do in different phases
– Circuit switching
– Arbitration and routing at setup phase
– At transmit only ST is needed since contention is already resolved
• Bypass switches
– Reduce latency under certain criteria
– When bypass not enabled same as baseline
ST
Setup
Xmit
LRC SA LT
SA
LRC ST LT SA ST
LRC
ST LT ST LT
Setup LT
Xmit
Prediction-based ST: Hit
Crossbar Buffer
X+
X- Y+
Y-
X+
X- Y+
Y- PREDICTOR
Crossbar is reserved
Idle state: Output port X+ is selected and reserved
1st cycle: Incoming flit is transferred to X+ without RC and SA
Correct 1st cycle: RC is performed The prediction is correct!
2nd cycle: Next flit is transferred to X+ without RC and SA
ARBITER
Prediction-based ST: Miss
X+
X- Y+
Y-
Idle state: Output port X+ is selected and reserved
Correct Dead flit
1st cycle: RC is performed The prediction is wrong! (X- is correct) 2nd/3rd cycle: Dead flit is removed; retransmission to the correct port
Buffer X+
X- Y+
Y-
PREDICTOR ARBITER
1st cycle: Incoming flit is transferred to X+ without RC and SA
KILL
Kill signal to X+ is asserted
Crossbar
@Miss: tasks replayed
as the baseline case
Speculative ST
• Assume contention doesn’t happen
– If correct then flit transferred directly to output port without waiting SA
– In case of contention replay SA
• Wasted cycle in the event of contention
– Arbiter decides what will be sent on the next cycle
Switch Fabric Control
B A A
clk port 0 port 1 grant valid out data out
0 1 4
cycle 2 3
A p0
A
A B p1
???
B A
A ?
B
A p0
B A
A
B
A B
Wins
A
Wins
XOR-based ST
• Assume contention never happens
– If correct then flit transferred directly to output port
– If not then bitwise=XOR all the
competing flits and send the encoded result to the link
– At the same time arbitrate and mask (set to 0) the winning input
– Repeat on the next cycle
• In the case of contention encoded outputs (due to contention) are resolved at the receiver
– Can be done at the output port of the switch too
Switch Fabric Control
B A B A
A A^B
A
0 1 4
cycle 2 3
clk port 0 port 1 grant valid out data out
A p0
A
A B p1 B^A
A
A
A
No Contention Contention B
Wins
XOR-based ST: Flit recovery
• Works upon simple XOR property.
– (A^B^C) ^ (B^C) = A
– Always able to decode by XORing two sequential values
• Performs similarly to speculative switches
– Only head-flit collisions matter
– Maintains previous router’s arbitration order
Coded
Flit Buffer
A A ^ B ^C B ^ C C A
0
0
B ^ C
1
A ^ B ^ C
C B ^ C B
Bypassing intermediate nodes
• Switch bypassing criteria:
– Frequently used paths
– Packets continually moving along the same dimension
• Most techniques can bypass some pipeline stages only for specific packet transfers and traffic patterns
– Not generic enough
3-cycle
SRC DST
3-cycle 3-cycle
Virtual bypassing paths
3-cycle 3-cycle 1-cycle
Bypassed
1-cycle
Bypassed
Circuit switching
• Network traversal done in phases
• Path reservation (multiple switch allocations) is done all at once
• Switch traversal finds no contention
– Data buffers are avoided
• Part of the reserved and unutilized path is needlessly blocked
Speculation-free low-latency switches
• Prediction and speculation drawbacks
– On miss-prediction(speculation) the tasks should be replayed – Latency not always saved. Depends on network conditions
• Merged Switch allocation and Traversal (SAT)
– Latency always saved – no speculation
– Delay of SAT smaller than SA and ST in series
Arbitration and Multiplexing
• Stop thinking arbitration and multiplexing separately
• One new algorithm that fits every policy
– Generic priority-based solution that works even when
arbitration and multiplexing are done separately
Round-robin arbitration
• Round-robin arbitration
– Most commonly used
– Start from the High-Priority position and grant the first active request you find after searching all cyclically all requests
– Granted input becomes lowest-priority for the next arbitration
• Cyclic search found in many other algorithms
• Transform each request and priority bit to a 2bit unsigned arithmetic symbol
– The request is the MSBit
• Round-robin arbitration is equivalent to finding the maximum symbol that lies in the rightmost position
• Cyclic search disappears
Let’s think out of the box
Working examples
• Maximum selection is done via a tree structure
– The rightmost maximum symbol always wins
• Direction flags (L,R) always point to the direction of the winning input
– Direction flags form the path to the winning input
Why not switch data in parallel?
Grant signals produced simultaneously
• When F=0 the maximum came from the Right
• When F=1 the maximum came from the Left
• Onehot, thermometer, weighted-binary grant signals can be derived by the tree of MAX nodes
Direction flag F
Wormhole/VCT MARX-based switches
SRAM-based input buffers
• Buffer reads and writes are treated as separate tasks
– Buffer write occurs always after link traversal
• A separate read and write port is required for maximum
performance
Speculative Buffer Read
• Buffer read occurs after SA for Head flits (no speculation)
• Buffer read can occur in parallel to SA (speculation)
– HOL Head flit is read out before knowing if it received a grant
– Once SA has finished speculation is removed for the rest flits
Pipelining and credits
• Credit loop begins from upstream SA stage
• Deep pipelining increases the buffering requirements for 100% throughput
– Elastic pipeline stages that can stall independently can
partially alleviate the problem
Bufferless
• Assume there are no buffers
– When a packet loses switch allocation it is:
• Dropped
• Deflected to any free output
• Deflection spreads contention in space (in the network)
– Allocation solves contention at each time slot but spreads it in time (next time slots)
• Deflection (or misrouting) can occur in buffered switches too
– Rotary router
Slicing
• Introduces hierarchy inside the switch
• When traffic is concentrated to certain outputs the switch suffers high performance penalties
• Intermediate buffers partially alleviate the loss
Dimension slicing Port slicing
How can we increase throughput?
Green flow is blocked until red passes the switch.
Physical channel left idle
• Decouple output port allocation from next-hop buffer allocation
• Contention present on:
– Output links (crossbar output port) – Input port of the crossbar
• Contention is resolved by time sharing the resources
– Mixing words of two packets on the same channel – The words are on different virtual channels
– Separate buffers at the end of the link guarantee no interference between the packets
Virtual Channels
Virtual channels
• Virtual-channel support does not mean extra links
– They act as extra street lanes
– Traffic on each lane is time shared on a common channel
• Provide dedicated buffer space for each virtual channel
– Decouple channels from buffers
– Interleave flits from different packets
• “The Swiss Army Knife for Interconnection Networks”
– Prevent deadlocks
– Reduce head-of-line blocking
– Provide QoS
Datapath of a VC-based switch
• Separate buffer for each VC
• Separate flow control signals (credits) for each VC
• The radix of the crossbar can stay the same
• Input VCs can share a
common input port of the crossbar
– On each cycle at most one VC
will receive a new word
• A switch connects input VCs to output VCs
• Routing computation (RC) determines the output port
– May restrict the output VCs that can be used
• An input VC should allocate first an output VC
– Allocation is performed by the VC allocator (VA)
• RC and VA are done per packet on the head flits and inherited to the rest flits of the packet
Input VCs Output VCs
Per-packet operation of a VC-based switch
Per-flit operation of a VC-based switch
• Flits with an allocated output VC fight for an output port
– Output port allocated by switch allocator
– The VCs of the same input share a common input port of the crossbar – Each input has multiple requests
(equal to the #input VCs)
• The flit leaves the switch provided that credits are available downstream
– Credits are counted per output VC
Input VCs Output VCs
Switch allocation
• All VCs at a given input port share one crossbar input port
• Switch allocator matches ready-to-go flits with
crossbar time slots
– Allocation performed on a cycle-by-cycle basis
– N×V requests (input VCs), N resources (output ports) – At most one flit at each input port can be granted
– At most one flit et each output port can be leave
• Other options need more crossbar ports (input-output
speedup)
Switch allocation example
• One request (arc) for each input VC
• Example with 2 VCs per input
– At most 2 arcs leaving each input
– At most 2 requests per row in the request matrix
• Matching:
– Each grant must satisfy a request
– Each requester gets at most one grant – Each resource is granted at most once
Inputs Outputs
0
0 1 2
2 1 Inputs
Outputs 0
2 1
0
2 1
Bipartite graph Request matrix
Separable allocation
• Matchings have at most one grant per row and per column
• Two phases of arbitration
– Column-wise and row-wise – Perform in either order
– Arbiters in each stage are independent
• But the outcome of each one affects the quality of the overall match
• Fast and cheap
• Bad choices in first phase can prevent second stage from generating a good matching
– Multiple iterations required for a good match
Input-first:
Output-first:
Implementation
Output first allocation
Input first allocation
Multi-cycle separable allocators
• Allocators produce better results if they run for many cycles
– On each cycle they try to fill the input-output match with new edges
• We don’t have the time to wait more than one cycle
• Run two or more allocators in parallel and interleave their
grants to the switch
Centralized allocator
• Wavefront allocation
– Pick initial diagonal
– Grant all requests on each diagonal
• Never conflict!
– For each grant, delete requests in same row, column
– Repeat for next diagonal
Switch allocation for adaptive routing
• Input VCs can request more than one output ports
– Called the set of Admissible Output Ports (AOP) – This adds an extra selection step (not arbitration) – Selection mostly tries to load balance the traffic
• Input-first allocation
– For each input VC select one request of the AOP – Arbitrate locally per input and select one input VC
– Arbitrate globally per output and select one VC from all fighting inputs
• Output-first allocation
– Send all requests of the AOP of each input VC to the outputs – Arbitrate globally per output and grant one request
– Arbitrate locally per input and grant an input VC
– For this input VC select one out of the possibly multiple grants of the
AOP set
VC allocation
• Virtual channels (VCs) allow multiple packet flows to share physical resources (buffers, channels)
• Before packets can proceed through router, need to claim ownership of VC buffer at next router
– VC acquired by head flit, is inherited by body & tail flits
• VC allocator assigns waiting packets at inputs to output VC buffers that are not currently in use
– N×V inputs (input VCs), N×V outputs (output VCs)
– Once assigned, VC is used for entire packet’s duration in the switch
Input VCs
Output VCs
VC allocation example
• Input VC match to an output VC simultaneously with the rest
– Even if it belongs to the same input
– No port constraint as in switch allocators
• VC allocation refers to allocating buffer id (output VC) on the next router
– Allocation can be both separable (2 arbitration steps) or centralized
Inputs VCs Output VCs
0 In#0 1
In#1
In#2
Out#0
Requests Grants
2 3 4 5
0 1 2 3 4 5
Out#1
Out#2
Inputs VCs Output VCs
0 In#0 1
In#1
In#2
Out#0 2
3 4 5
0 1 2 3 4 5
Out#1
Out#2
• Any-to-any flexibility in VC allocator is unnecessary
– Partition set of VCs to restrict legal requests
• Different use cases for VCs restrict possible transitions:
– Message class never changes
– Resource classes are traversed in order
• VCs within a packet class are functionally equivalent
• Can take advantage of these
properties to reduce VC allocator complexity!
Input – output VC mapping
VA single cycle or pipelined organization
• Header flits see longer latency than body/tail flits
– RC, VA decisions taken for head flits and inherited to the rest of the packet – Every flit fights for SA
• Can we parallelize SA and VA?
The order of VC and switch allocation
• VA first SA follows
– Only packets will an allocated output VC fight for SA
• VA and SA can be performed concurrently:
– Speculate that waiting packets will successfully acquire a VC – Prioritize non-speculative requests over speculative ones – Speculation holds only for the head flits (The body/tail flits
always know their output VC)
VA SA Description
Win Win Everything OK!! Leave the switch
Win Lose Allocated a VC
Retry SA (not speculative - high priority next cycle) Lose Win Does not know the output VC
Allocated output port (grant lost – output idle)
Lose Lose Retry both VA and SA
Speculative switch allocation
• Perform switch allocation in parallel with VC allocation
– Speculate that the latter will be successful – If so, saves delay, otherwise try again
– Reduces zero-load latency, but adds complexity
• Prioritize non-speculative requests
– Avoid performance degradation due to miss-speculation
• Usually implemented through secondary switch allocator
– But need to prioritize non-speculative grants
Free list of VCs per output
• Can assign a VC non-speculatively after SA
• A free list of output VCs exists at each output
– The flit that was granted access to this output
receives the first free VC before leaving the switch – If no VC available output port allocation slot is missed
• Flit retries for switch allocation
• VCs are not unnecessarily occupied for flits that
don’t win SA
VC buffer implementation
Static partitioning Dynamic partitioning
Linked-List
Shared Buffer
Implementation
VC-based switches with MARX units
• Merged Switch Allocation and Traversal can be applied to VC-based switches too
• VA can be run before or in parallel to SAT
VC-based switches with MARX units: Datapath
NoC: The science & art of on-chip connections
Network- on-Chip
MICRO ARCH
CIRCU IT S
Micro-architecture of Network-on-Chip Routers Giorgos Dimitrakopoulos, Springer, mid 2013
Micro-architecture of Network-on-Chip Routers Giorgos Dimitrakopoulos, Springer, mid 2013
ADVERTISEMENT
References (1)
• W. J. Dally and B. Towles. Route packets, not wires: On-chip interconnection networks, DAC 2001
• A. Kumar, et al. A 4.6tbits/s 3.6ghz single-cycle noc router with a novel switch allocator. In in 65nm CMOS”, ICCD-2007
• A. Kumar, et al. “Express virtual channels: towards the ideal interconnection fabric”, ISCA ’07
• H. Matsutani, et al. “Prediction router: A low-latency on-chip router architecture with multiple predictors”, IEEE Trans. Computers, 2011
• G. Michelogiannakis, J. Balfour, and W. Dally, “Elastic bufferflow control for on-chip networks”, HPCA 2009.
• Mitchell Hayenga, Mikko Lipasti, “The NoX Router”, MICRO 2011
• T. Moscibroda and O. Mutlu. A case for bufferless routing in on-chip networks, ISCA 2009
• R. Mullins, A. West, and S. Moore. Low-latency virtual-channel routers for on-chip networks. ISCA 2004
• L.-S. Peh and W. J. Dally. A delay model and speculative architecture for pipelined router HPCA 2001
• D. Wentzlaff, et al. “On-chip interconnection architecture of the tile processor. Micro, IEEE,2007
• Y. J. Yoon, et al. “Virtual channels vs. Multiple Physical Networks”, DAC 2010
• M. Azimi, et al. “Flexible and adaptive on-chip interconnect for terascale architectures,” Intel Technology Journal, 2009.
• A. Golander, et al. “A cost-efficient L1–L2 multicore interconnect: Performance, power, and area considerations,” IEEE TCAS-I 2011.
• P. Kumar, “Exploring concentration and channel slicing in on-chip network router,” HPCA 2009
• M. Galles, “Spider: A high-speed network interconnect,” IEEE Micro, 1997.
• A. S. Vaidya, et al. , “Lapses: A recipe for high performance adaptive router design”, HPCA 1999.
• C. Batten Interconnection Networks Course, Columbia University
• M. Katevenis, Packet Switch Architectures Course, University of Crete, Greece.
• W. J. Dally, “Virtual-Channel Flow Control,” ISCA 1990.
• D. U. Becker and W. J. Dally, “Allocator implementations for network-on-chip routers,”, SC 2009.
• S. S. Mukherjee, et al., “A comparative study of arbitration algorithms for the Alpha 21364 pipelined router,” ASPLOS 2002.
• Y. Tamir and H.-C. Chi, “Symmetric crossbar arbiters for VLSI communication switches,” IEEE Trans. on Par. and Distributed Systems, 1993.
• J. Hurt, et al. , “Design and implementation of high-speed symmetric crossbar schedulers,” in ICC 1999
• G. Ascia, et al., “Implementation and analysis of a new selection strategy for adaptive routing in networks-on-chip,” IEEE T. on Comp. 2008
• P. Salihundam, et al. , “A 2Tb/s 6x4 Mesh Network with DVFS and 2.3Tb/s/W router in 45nm CMOS,” in Symp. VLSI Circuits, 2010.
• P. Gupta and N. McKeown, “Design and implementation of a fast crossbar scheduler,” IEEE Micro 1999.
• J. Flich and D. Bertozzi (editors), “Network on Chip in the Nanoscale Era”, CRC Press, 2010
References (2)
• L. Pirvu et al. “The impact of link arbitration on switch performance,” HPCA, 1999.
• M. Coppola, et al. “Spidergon: A Novel On-Chip Communication Network” IEEE SOC 2004.
• W. Dally and C. Seitz. “Deadlock-Free Message Routing in Multiprocessor Interconnection Networks” IEEE Tran.on Computers, 1987
• M. Karol, “Input vs Output Queuing on a Space-Division Packet Switch”, In IEEE Transactions on Communications, 1987.
• Zhonghai Lu, et al. “Evaluation of on-chip networks using deflection routing”. In Proceedings of GLSVLSI, 2006.
• Zhonghai Lu, et al.. “Layered switching for networks on chip”. DAC 2007
• R. Ginosar, "Metastability and Synchronizers: A Tutorial," IEEE Design & Test, Sept/Oct. 2011.
• G. Dimitrakopoulos, D. Bertozzi, “Switch architecture”, in J. Flich and D. Bertozzi (editors), “Network on Chip in the Nanoscale Era”, CRC Press, 2010
• G. Dimitrakopoulos, “Logic-level Design of Basic Switch Components”, in J. Flich and D. Bertozzi (editors), “Network on Chip in the Nanoscale Era”, CRC Press, 2010
• G. Dimitrakopoulos E. Kalligeros, “Dynamic-Priority Arbiter and Multiplexer Soft Macros for On-Chip Networks Switches”, DATE 2012
• G. Dimitrakopoulos, E. Kalligeros, K. Galanopoulos, “Merged Switch allocation and traversal in Network-on-Chip Switches”, to appear in IEEE transactions on Computers (available at IEEExplore preprints)
• Se-Joong Lee et al. Packet-Switched On-Chip Interconnection Network for System-on-Chip Applications, IEEE TCAS II 2005.
• Donghyun Kim et al. A Reconfigurable Crossbar Switch with Adaptive Bandwidth Control for Networks-on-Chip, ISCAS2005.
• Anh Tran and Bevan Baas, "RoShaQ: High-Performance On-Chip Router with Shared Queues,“ iCCD 2011
• Anh Tran et al. "A Reconfigurable Source-Synchronous On-Chip Network for GALS Many-Core Platforms,“ IEEE Trans.on CAD, 2010.
• B. Dally and B. Towles, “Interconnection networks”, Morgan Kaufman 2004
• C.A. Nicopoulos, “ViChaR: A Dynamic Virtual Channel Regulator for Network-on-Chip Routers “, MICRO 2006.
• Clive Maxfield , “2D vs. 2.5D vs. 3D ICs 101” EE Times, Design 2012
• Mike Santarini , “2.5D ICs are more than a stepping stone to 3D Ics”, EE Times, Design 2012
• Nathan Binkert et al. , “The Role of Optics in Future High Radix Switch Design”, ISCA-2011
• Eylon Caspi “Design Automation for Streaming Systems”, PhD Thesis, Berkeley 2005
• C Minkenberg, M Gusat , “Design and performance of speculative flow control for high-radix datacenter interconnect switches”, JPDC 09
• Peh, Li-Shiuan and Dally, William J., "Flit-Reservation Flow Control," in HPCA 1999
• M. Gerla and L. Kleinrock. Flow Control: A Comparative Survey. IEEE Transactions on Communications, 1980.