• Ingen resultater fundet

Analytical Performance Modeling

N/A
N/A
Info
Hent
Protected

Academic year: 2022

Del "Analytical Performance Modeling"

Copied!
28
0
0

Indlæser.... (se fuldtekst nu)

Hele teksten

(1)

Analytical Performance Modeling of Hierarchical Interconnect Fabrics

Nikita Nikitin, Javier de San Pedro, Josep Carmona and Jordi Cortadella

Universitat Politècnica de Catalunya

Supported by Intel Corporation

International Symposium on Networks-on-Chip (NOCS) 2012, Copenhagen, Denmark

(2)

Outline

Introduction

Hierarchical Chip Multiprocessors (CMPs) Performance modeling for CMPs

The cyclic dependency between latency and traffic

• Analytical performance modeling

– Modeling traffic – Modeling latency

– Methods to resolve the dependency

• Results and conclusions

(3)

The trends in CMP design

• Hundreds of computing units per chip

– Smaller, simpler, more power-efficient cores

• Advanced memory management

– Larger on-chip cache

– Increasing interconnect (IC) bandwidth

• Tiled architecture

R R R R

R R R R

R R R R

R R R R

Memor y Con tr ol ler Memor y Con tr ol ler C

L2 R

L1

(4)

Hierarchical interconnects

C+L1

L2

C+L1

L2

L3

IC ( Bus / Ring )

NI R

Dir

Tiled CMP with hierarchical interconnect

R

R R

Memor y Con tr ol ler Memor y Con tr ol ler IC

R

IC

IC IC

R R

R

• Exploit locality of memory references*

(5)

Design of CMP architecture

• Goal: efficient use of chip resources

– Maximize performance

– Fit area/power/thermal budget

• Multidimensional exploration space

(#cores / cache size /

memory hierarchy / IC topologies /…)

• Means: automated design space exploration

– Analytical performance models are essential

C C

L3

R D

R

MC MC

IC

R

IC

R

R

IC

R

IC

R

(6)

Contention modeling

• Contention impacts CMP performance

• Crucial evaluating hierarchical interconnects

– Is the required bandwidth sustainable?

R

R R

y Con tr ol ler y Con tr ol ler IC

R IC

IC IC

R R

R

# of wires? Router architecture?

Local IC topology?

(7)

Motivational example

(a) 8x8 mesh (b) 4x4 mesh with bus clusters

(c) 2x2 mesh with bus clusters

Estimation w/o contention is very

inaccurate!

48 cores, 16 cache modules Legend: core cache IC

0 2 4 6 8 10

(a) (b) (c)

Thr o u gh p u t (I PC)

No contention

With contention

(8)

Analytical modeling of CMP performance

• Analytical models for ICs:

Latency L as a function of traffic λ λ defined by the workload

Emphasis: λ depends on L!

• This work: resolve the cyclic dependency of traffic and latency

– Formulate λ as a function of L – Add existing model for L(λ) – Resolve the system efficiently

L λ IPC

Core

1

Core

i

Core

N

L

i

λ

i

Memory subsystem

  L    L    •••

(Throughput)

(9)

Outline

• Introduction

– Hierarchical Chip Multiprocessors (CMPs) – Performance modeling for CMPs

– The cyclic dependency between latency and traffic

Analytical performance modeling

Modeling traffic Modeling latency

Methods to resolve the dependency

• Results and conclusions

(10)

Modeling memory traffic

Traffic to memory (probability of a memory reference per cycle):

Average latency of memory access Memory access penalty

Core

λ L

Memory subsystem

Parameters of core executing some workload:

1. - ideal Cycles Per Instruction

2. - # Memory references Per Instruction

Real performance of in-order core:

(11)

Modeling average memory latency

• Average latency of memory requests for a core:

0 0,05 0,1 0,15 0,2 0,25

0 5 10

Miss Ratio

Cache size (Mb)

Latencies are calculated using

- Cache latencies

- Interconnect topology - Routing algorithm (XY)

Probabilities are calculated using

- Miss ratio dependency on cache size

Application

15% miss in 64K L1 5% miss in 1M L2

0 0,1 0,2 0,3 0,4

0 5 10

Miss Ra tio

Cache size (Mb)

Application

(12)

Modeling contention latency

CL

MC MC

R

CL

R

CL

R

CL

R

R

C C

L3

NI

D

Mesh NoC Bus-based cluster

Delays in queues are defined by extending M/G/1 queuing model:

“An Analytical Approach for Network-on-Chip Performance Analysis”,

Ogras et al., TCAD, 2010 (Best Paper Award)

(13)

System of non-linear equations

• Solve using numerical methods

• General methods are very slow

– 10x10 mesh (10K vars./eqns.) – MATLAB timeout after few hours

• Proposed methods:

– Fixed-point iteration – Bisection search for λ

The cyclic dependency of L and λ

Any “black-box”

model for L(λ)!

Analytical model for latency

(14)

Fixed-point iteration

+ Fast (10x10 mesh in several ms)

0 10 20 30 40 50

0 0,05 0,1 0,15 0,2

L, av e rag e la te n cy (c ycle s)

λ, average traffic rate (flits/cycle) L(λ) λ (L)

Characteristic of the IC Characteristic of

the cores/workload

Hop-count latency

(15)

Bisection search for λ

– Fast, as fixed-point

– Always converges to an approximate solution (good for homogeneous clusters)

0 10 20 30 40 50

0 0,05 0,1 0,15 0,2

L, av e rag e la te n cy (c ycle s)

λ, average traffic rate (flits/cycle) L(λ) λ (L)

Characteristic of the IC Characteristic of

the cores/workload

λ=0 λ(L

hop-count

)

(16)

Outline

• Introduction

– Hierarchical Chip Multiprocessors (CMPs) – Performance modeling for CMPs

– The cyclic dependency between latency and traffic

• Analytical performance modeling

– Modeling traffic – Modeling latency

– Methods to resolve the dependency

Results and conclusions

(17)

Performance of analytical methods

Test Mesh Cont. lat. Num. of var./eqn.

Runtime (sec)

MATLAB Fixed-Point Bisection

T1 2 x 2 5% 236 0.023 0.001 0.001

T2 4 x 4 13% 1224 1.412 0.001 0.002

T3 6 x 6 8% 3108 30.831 0.002 0.003

T4 8 x 8 12% 6128 408.539 0.006 0.010

T5 10 x 10 23% 10260 Timeout (1hr) 0.010 0.012 T6 10 x 10 46% 10260 Timeout (1hr) 0.022 0.015

T7 10 x 10 55% 10260 Timeout (1hr) NA 0.016

(18)

Case study: performance exploration

Parameter Value Chip area

Core area Core IPC

0

MPI

L1 size L2 size

Memory density Mesh dimensions MC latency

350 mm

2

1.25 mm

2

2.0

0.5

64, 128 Kb 64 Kb to 3 Mb 1 mm

2

/ Mb 2x2 to 16x16 100 cycles

0 0,05 0,1 0,15 0,2 0,25

0 2 4 6 8 10

Miss Ra tio

Cache size (Mb)

1062 configurations explored

Cache Size 64K 128K 256K 512K 1M 2M 4M 8M

Area* (mm

2

) 0.063 0.125 0.25 0.5 1.0 2.0 4.0 8.0

(19)

Simulation environment

Network simulation

Global (mesh)

memory

L3 cache node

Local (bus, ring, …) Bus

Core

Memory controller

• Verify model by simulation

• Cycle-accurate NoC simulator

– On top of BookSim 2.0

• Extensions

– Hierarchical networks – Bus topologies

– Probabilistic state-machines

for cores and memories

(20)

Faithfulness of the model

0 5 10 15 20 25 30 35

1 52 103 154 205 256 307 358 409 460 511 562 613 664 715 766 817 868 919 970 1021

Thr ou ghp u t (IPC)

Configurations sorted in descending order of throughput

Modeling

Simulation

(21)

Best-throughput ordering

Simulation time: 5.5 hours

Modeling time: 16.8 sec (>1000x faster)

0 200 400 600 800 1000

0 200 400 600 800 1000 Be st co n figur atio n s b y anal ysis th at includ e N

Number of best config. by simulation (N) Static latency Full latency

Ideal (Simulation) (1; 33)

(4; 44)

(1; 2) (4; 6)

(50; 64)

0 10 20 30 40 50 60 70

0 10 20 30 40 50 60

Bes t conf igu ra tion s b y analy sis tha t in clu d e N

Number of best configurations by simulation (N)

Static latency Full latency

Ideal (Simulation) No contention With contention

No contention With contention

Ideal (Simulation)

(22)

Conclusions

• Analytical modeling of contention in CMPs is essential

There exists cyclic dependency between latency and traffic of memory requests

• This dependency can be efficiently resolved using numerical methods (fixed-point, bisection)

• Precision of the model is significantly improved

• Current work: out-of-order cores, heterogeneity

(23)

Backup

(24)

Sufficient for convergence of :

0 10 20 30 40 50

0 0,05 0,1 0,15 0,2

L, av er ag e la tency (cy cles )

L(λ) λ (L)

Fixed-point convergence issues

(25)

Bisection search

Latency model Traffic model

(26)

Average latency calculation

• Average Memory Access Time (AMAT):

(27)

Best configuration

- 6x6 mesh, 36 clusters, 5 cores/cluster - total 180 cores with 64K L1, 256K L2 - 68Mb total shared L3

Throughput = 30.81 IPC

R R R R R R

R R R R R R

R R R R R R

R R R R R R

R R R R R R

R R R R R R

Memor y Con tr ol ler Memor y Con tr ol ler

Memory Controller Memory Controller

C+L1

L2

C+L1

L2

C+L1

L2

L3 Bus

C+L1

L2

C+L1

L2

NI R

Dir

(28)

Runtime: Modeling vs Simulation

0,01 0,1 1 10 100 1000

0 100 200 300 400 500 600 700 800 900 1000 1100

Runtime (se co n d s)

Analytical Simulation

Modeling a CMP with

~700 components in 1 second

Referencer

RELATEREDE DOKUMENTER

For combining the between-lab and between-material heteroscedasticity together, new model LB3 is processed on the basis of model MB2: 8 distinct residuals and 8 distinct lab

Performance” was a master’s thesis project in the field of interaction design and consisted of an artistic performance titled “My Body, Your Room.” The live performance

The case studies were sub- jected to four sequential analytical perspectives: the impacts of methodological issues on the EEG results obtained (the focus of this paper); comparing

The essence of this paper is to investigate corporate identity constructions with a critical discourse analytical approach because there are limited studies that explore

Chapter 5: A Vocal based Analytical Method for Goose Behaviour Recognition, part of a journal paper from Sensors, presenting an algorithm for automatic recognition of goose

We applied a series of analytical methods, including a novel framework for sentiment analysis and econometric analysis, to reveal how public social media opinions influence

2 Analytical framework: conceptualising labor agency in global production networks Our analytical framework seeks to explain how a multinational company’s sourcing and CSR

The analytical aim of this paper is to explore how students’ scenario competence is enacted in a Power Game session. The empirical basis for my analysis of game-based