• Ingen resultater fundet

University of Pittsburgh Ahmed Abousamra Rami Melhem Alex Jones

N/A
N/A
Info
Hent
Protected

Academic year: 2022

Del "University of Pittsburgh Ahmed Abousamra Rami Melhem Alex Jones"

Copied!
33
0
0

Indlæser.... (se fuldtekst nu)

Hele teksten

(1)

University of Pittsburgh Ahmed

Abousamra Rami Melhem Alex Jones

(2)

Power efficiency has become a primary concern in the design of CMPs.

The NoC of Intel’s TeraFLOPS processor

consumes more than 28% of the chip’s power.

Network messages can be classified into critical and non-critical

It may be possible to send non-critical messages on a slower plane without hurting performance.

(3)

Baseline Single Plane NoC

Control Plane Data Plane

Carries control and coherence messages: data requests,

invalidates, acknowledgments, …

Operates at baseline’s voltage &

frequency

Carries data messages

Operates at a lower voltage & frequency to save power

(4)

Motivation & Related work

Importance of data messages to performance The Optimization Problem

Proposed Solution

Déjà Vu Switching

Analysis of the acceptable data plane speed

Evaluation Summary

(5)

1 2 3 4 5 6 7 8 Cache line having 8 data words

Data Message

5 1 2 3 4 6 7 8

Header

Critical Word

Subsequent miss for

another word in the line

Delayed cache hit

Critical Word

Non-Critical Words

(6)

If delayed cache hits are overly delayed, performance can suffer.

0%

5%

10%

15%

20%

25%

30%

35%

40%

45%

50%

barnes blackscholes bodytrack fluidanimate lu contig lu noncontig ocean contig radiosity radix raytrace specjbb er nsquared water spatial tric mean

% of L1 misses that are delayed hits

(7)

Motivation & Related work

Importance of data messages to performance The Optimization Problem

Proposed Solution

Déjà Vu Switching

Analysis of the acceptable data plane speed

Evaluation Summary

(8)

How slow can we operate the data plane to maximize energy savings while not impacting

performance?

(9)

Split NoC physically into 2 planes: control + data On data plane:

Use circuit-switching to speed-up communication.

Reduce voltage & frequency.

Send a control message to establish the circuit once a cache hit is detected.

Do not block circuit establishment message: Déjà Vu Switching

Analyze acceptable slow down of the data plane to minimize energy while maintaining performance.

(10)

Request Packet Reservation Packet Data Packet

Control Plane

Data Plane

1) Upon a cache miss, a data request is sent

(11)

Request Packet Reservation Packet Data Packet

Control Plane

Data Plane

2) Upon a data hit in the next cache level, the circuit reservation packet is sent

(12)

Request Packet Reservation Packet Data Packet

Control Plane

Data Plane

3) Finally, the data packet is sent to the requester on the reserved circuit

(13)

2

3 1

2

3 1

Control Plane

Data Plane Reserved Circuits

Reservation Packets

Data Packets

East-North West-North

West-East

Router

(14)

Reservation Queue of West Input Port

Reservation Queue of East Output Port

Reserving the West-East Connection: E W Head of the

Queues

(15)

S S Reservation Queues

of the Input Ports

Reservation Queues of the Output Ports

E N

N

N N

S W

North West

South East

Local

E

W N

Head of the Queues

North West

South East

Local

(16)

Reservation Queues of the Input Ports

Reservation Queues of the Output Ports

N

S North

West

South East

E

W N

Head of the Queues

North West

South East N

(17)

Reservation Queues of the Input Ports

Reservation Queues of the Output Ports

N S North

West

South East

Local

E

W N

Head of the Queues

North West

South East

Local N

(18)

Each input and output port must independently track the reserved circuits it is part of.

Any two reservation packets that share part of their paths, must traverse all the shared links in the same order

Data packets must be injected onto the data plane in the same order their reservation

packets are injected onto the control plane.

(19)

Motivation & Related work

Importance of data messages to performance The Optimization Problem

Proposed Solution

Déjà Vu Switching

Analysis of the acceptable data plane speed

Evaluation Summary

(20)

8 2 3 4 5 6 7 9 12 10 11 1

Time 13

(21)

8 2 3 4 5 6 7 9 12 10 11 1

Time 13

Reservation Packet injected early at cycle 1

(22)
(23)

Motivation & Related work

Importance of data messages to performance The Optimization Problem

Proposed Solution

Déjà Vu Switching

Analysis of the acceptable data plane speed

Evaluation Summary

(24)

We use the functional simulator, Simics, to simulate cache coherent CMPs of 16 and 64 cores.

We use Orion2 to get power numbers for the interconnect routers

We evaluate with

Synthetic traces: allows varying the network load Execution driven simulation of parallel benchmarks

(25)

Baseline NoC:

Single plane, 16 byte links, packet switched with 3 cycles router pipeline, clocked at 4 GHz

Evaluated NoC:

Control plane: 6 byte links, packet switched, 4GHz Data plane: 10 byte links, circuit switched.

Control and Coherence Packets: 1-flit Data Packets:

Baseline NoC: 5 flits Data Plane: 7 flits

(26)

Normalized NoC energy and completion time

0%

20%

40%

60%

80%

100%

120%

0,01 0,03 0,05

4 GHz 4/4 GHz 4/3 GHz 4/2.66 GHz 4/2 GHz

Normalized Energy Consumption

Traffic Injection Rate (request / cycle / node)

0%

20%

40%

60%

80%

100%

120%

0,01 0,03 0,05

4 GHz 4/4 GHz 4/3 GHz 4/2.66 GHz 4/2 GHz

Traffic Injection Rate (request / cycle / node)

Normalized Completion Time

(27)

0%

20%

40%

60%

80%

100%

120%

barnes blackscholes bodytrack fluidanimate lu contig lu noncontig ocean contig radiosity radix raytrace specjbb water nsquared water spatial Geometric mean

4 GHz 4/4 GHz 4/3 GHz 4/2.66 GHz 4/2 GHz

Normalized Execution Time

Normalized execution time on a 16-core CMP

(28)

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

barnes blackscholes bodytrack fluidanimate lu contig lu noncontig ocean contig radiosity radix raytrace specjbb water nsquared water spatial Geometric mean

4 GHz 4/4 GHz 4/3 GHz 4/2.66 GHz 4/2 GHz

Normalized Energy Consumption

Normalized NoC energy on a 16-core CMP

(29)

0%

20%

40%

60%

80%

100%

120%

140%

barnes blackscholes bodytrack fluidanimate lu contig lu noncontig ocean contig radiosity radix specjbb water nsquared water spatial Geometric mean

4 GHz 4/4 GHz 4/3 GHz 4/2.66 GHz 4/2 GHz

Normalized Execution Time

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

barnes blackscholes bodytrack fluidanimate lu contig lu noncontig ocean contig radiosity radix specjbb water nsquared water spatial Geometric mean

4 GHz 4/4 GHz 4/3 GHz 4/2.66 GHz 4/2 GHz

Normalized Energy Consumption

Normalized NoC energy and execution time on a 64-core CMP

(30)

90%

95%

100%

105%

110%

barnes blackscholes bodytrack fluidanimate lu contig lu noncontig ocean contig radiosity radix raytrace specjbb water nsquared water spatial Geometric mean

PS 4/4 DV 4/4 PS 4/2.66 PS+CW 4/2.66 DV 4/2.66

Performance Relative to Single Plane Baseline

(31)

L. Cheng et al. (ISCA'06) and A. Flores et al. (IEEE Trans. Computers'10): Heterogeneous NoC

using wires of different latency and power characteristics to improve performance and reduce NoC energy.

Proposal requires wide links (75 bytes), but performance degrades with narrow links.

Our work differs in:

Tying latency of messages to performance

Using Déjà Vu Switching to Compensate for slower

(32)

Problem: Saving power in the NoC by reducing the data plane’s power consumption without impacting performance.

Delayed cache hits are important to performance.

Operating data plane in circuit-switched mode allows it to operate at reduced frequency.

Déjà Vu Switching allows reservation to proceed when resources are not currently available.

The constraints governing the speed of the data

(33)

Referencer

RELATEREDE DOKUMENTER

The focus of this thesis is to analyze the performance of force myography (FMG) to detect upper limb movements and based on it develop control methods for upper limb

(End-to-end error, sequence & flow control) Transfer of data between arbitrary systems (Routing, multiple subnets, flow control).. Transfer of data between directly connected

• Chapter 6: Coordinated Packet Scheduling for Joint Uplink CoMP - This chapter presents a multi-cell coordinated packet scheduling algorithm which can further improve the

• Even if a packet is blocked downstream the connection does not change until the tail of the packet leaves the output port. – Buffer utilization managed by flow

Packet loss or bit errors are usually in the form of burst loss where a number of consecutive packets or bits are lost or random loss where as the name indicates only single

B) Still on this firewall: To extend the packet-capture-process from just the INPUT/OUTPUT- chains, to that of the FORWARD-chain too. Because now, it can check the end-socket-data

This bachelor thesis sets out to look into and analyze part of an extensive collection of data from the LADIS (Leukoaraiosis And DISability) Study. This data collection contains

In order to compute the time needed for a packet to travel from source to the des- tination (packet delay), as well as the jitter and interarrival time, the destination needs