University of Pittsburgh Ahmed Abousamra Rami Melhem Alex Jones

(1)

University of Pittsburgh Ahmed

Abousamra Rami Melhem Alex Jones

(2)

Power efficiency has become a primary concern in the design of CMPs.

The NoC of Intel’s TeraFLOPS processor

consumes more than 28% of the chip’s power.

Network messages can be classified into critical and non-critical

It may be possible to send non-critical messages on a slower plane without hurting performance.

(3)

Baseline Single Plane NoC

Control Plane Data Plane

• Carries control and coherence messages: data requests,

invalidates, acknowledgments, …

• Operates at baseline’s voltage &

frequency

• Carries data messages

• Operates at a lower voltage & frequency to save power

(4)

Motivation & Related work

Importance of data messages to performance The Optimization Problem

Proposed Solution

Déjà Vu Switching

Analysis of the acceptable data plane speed

Evaluation Summary

(5)

1 2 3 4 5 6 7 8 Cache line having 8 data words

Data Message

5 1 2 3 4 6 7 8

Header

Critical Word

Subsequent miss for

another word in the line

 Delayed cache hit

Critical Word

Non-Critical Words

(6)

If delayed cache hits are overly delayed, performance can suffer.

0%

5%

10%

15%

20%

25%

30%

35%

40%

45%

50%

barnes blackscholes bodytrack fluidanimate lu contig lu noncontig ocean contig radiosity radix raytrace specjbb er nsquared water spatial tric mean

% of L1 misses that are delayed hits

(7)

Importance of data messages to performance The Optimization Problem

Proposed Solution

Déjà Vu Switching

Evaluation Summary

(8)

How slow can we operate the data plane to maximize energy savings while not impacting

performance?

(9)

Split NoC physically into 2 planes: control + data On data plane:

Use circuit-switching to speed-up communication.

Reduce voltage & frequency.

Send a control message to establish the circuit once a cache hit is detected.

Do not block circuit establishment message: Déjà Vu Switching

Analyze acceptable slow down of the data plane to minimize energy while maintaining performance.

(10)

Request Packet Reservation Packet Data Packet

Control Plane

Data Plane

1) Upon a cache miss, a data request is sent

(11)

Control Plane

Data Plane

2) Upon a data hit in the next cache level, the circuit reservation packet is sent

(12)

Control Plane

Data Plane

3) Finally, the data packet is sent to the requester on the reserved circuit

(13)

2

3 1

2

3 1

Control Plane

Data Plane Reserved Circuits

Reservation Packets

Data Packets

East-North West-North

West-East

Router

(14)

Reservation Queue of West Input Port

Reservation Queue of East Output Port

Reserving the West-East Connection: E W Head of the

Queues

(15)

S S Reservation Queues

of the Input Ports

Reservation Queues of the Output Ports

E N

N

N N

S W

North West

South East

Local

E

W N

Head of the Queues

North West

South East

Local

(16)

Reservation Queues of the Input Ports

N

S North

West

South East

E

W N

North West

South East N

(17)

Reservation Queues of the Input Ports

N S North

West

South East

Local

E

W N

North West

South East

Local N

(18)

Each input and output port must independently track the reserved circuits it is part of.

Any two reservation packets that share part of their paths, must traverse all the shared links in the same order

Data packets must be injected onto the data plane in the same order their reservation

packets are injected onto the control plane.

(19)

Importance of data messages to performance The Optimization Problem

Proposed Solution

Déjà Vu Switching

Analysis of the acceptable data plane speed

Evaluation Summary

(20)

8 2 3 4 5 6 7 9 12 10 11 1

Time 13

(21)

8 2 3 4 5 6 7 9 12 10 11 1

Time 13

Reservation Packet injected early at cycle 1

(22)

(23)

Importance of data messages to performance The Optimization Problem

Proposed Solution

Déjà Vu Switching

Evaluation Summary

(24)

We use the functional simulator, Simics, to simulate cache coherent CMPs of 16 and 64 cores.

We use Orion2 to get power numbers for the interconnect routers

We evaluate with

Synthetic traces: allows varying the network load Execution driven simulation of parallel benchmarks

(25)

Baseline NoC:

Single plane, 16 byte links, packet switched with 3 cycles router pipeline, clocked at 4 GHz

Evaluated NoC:

Control plane: 6 byte links, packet switched, 4GHz Data plane: 10 byte links, circuit switched.

Control and Coherence Packets: 1-flit Data Packets:

Baseline NoC: 5 flits Data Plane: 7 flits

(26)

Normalized NoC energy and completion time

0%

20%

40%

60%

80%

100%

120%

0,01 0,03 0,05

4 GHz 4/4 GHz 4/3 GHz 4/2.66 GHz 4/2 GHz

Normalized Energy Consumption

Traffic Injection Rate (request / cycle / node)

0%

20%

40%

60%

80%

100%

120%

0,01 0,03 0,05

Traffic Injection Rate (request / cycle / node)

Normalized Completion Time

(27)

0%

20%

40%

60%

80%

100%

120%

barnes blackscholes bodytrack fluidanimate lu contig lu noncontig ocean contig radiosity radix raytrace specjbb water nsquared water spatial Geometric mean

Normalized Execution Time

Normalized execution time on a 16-core CMP

(28)

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Normalized Energy Consumption

Normalized NoC energy on a 16-core CMP

(29)

0%

20%

40%

60%

80%

100%

120%

140%

barnes blackscholes bodytrack fluidanimate lu contig lu noncontig ocean contig radiosity radix specjbb water nsquared water spatial Geometric mean

Normalized Execution Time

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

barnes blackscholes bodytrack fluidanimate lu contig lu noncontig ocean contig radiosity radix specjbb water nsquared water spatial Geometric mean

Normalized Energy Consumption

Normalized NoC energy and execution time on a 64-core CMP

(30)

90%

95%

100%

105%

110%

PS 4/4 DV 4/4 PS 4/2.66 PS+CW 4/2.66 DV 4/2.66

Performance Relative to Single Plane Baseline

(31)

L. Cheng et al. (ISCA'06) and A. Flores et al. (IEEE Trans. Computers'10): Heterogeneous NoC

using wires of different latency and power characteristics to improve performance and reduce NoC energy.

Proposal requires wide links (75 bytes), but performance degrades with narrow links.

Our work differs in:

Tying latency of messages to performance

Using Déjà Vu Switching to Compensate for slower

(32)

Problem: Saving power in the NoC by reducing the data plane’s power consumption without impacting performance.

Delayed cache hits are important to performance.

Operating data plane in circuit-switched mode allows it to operate at reduced frequency.

Déjà Vu Switching allows reservation to proceed when resources are not currently available.

The constraints governing the speed of the data

(33)