## Analytical Performance Models for NoCs with Multiple Priority Traffic Classes

Sumit K. Mandal, Umit Y. Ogras Arizona State University Raid Ayoub, Michael Kishinevsky Intel Corporation

CASES, October 15, 2019







## Outline

Need for analytical fabric model generation

## Background

- Networks-on-chip (NoC) used in industry
- Prior work on NoC performance analysis
- Prior work on queuing networks
- Proposed network transformations
- Experimental results
- Conclusion and future work



### **Performance Modeling for Emerging Applications**

<sup>o</sup>ercentage of Total Time

#### Examples of emerging applications

- Virtual reality, autonomous driving
- Machine learning, Al

#### System modeling challenges

- Both SW/HW complexities grow
- Long simulations needed for power, performance, and thermal analysis (*minutes instead of milliseconds*)

#### Research need

- Communication fabric: Central shared resource
- Fast and accurate system level modeling
- Automated generation of high-level performance models of industrial SoCs





## Outline

## Need for analytical fabric model generation

## Background

- Networks-on-chip (NoC) used in industry
- Prior work on NoC performance analysis
- Prior work on queuing networks
- Proposed network transformations
- Experimental results
- Conclusion and future work



## **Priority Aware Networks-on-Chip Basics**

- Industrial NoCs use routers with priority arbitration to achieve predictable latency
  - Packets in NoC have higher priority than new packets



- Mux' in routers modeled as priority arbiters and servers
- Inputs with filled color denote higher priority



## Example Fabric: Xeon Phi (KNL) Processor

### **Knights Landing Overview**



| TILE | 2 VPU | СНА       | 2 VPU |  |
|------|-------|-----------|-------|--|
|      |       | 1MB<br>L2 |       |  |
|      | Core  |           | Core  |  |

Chip: 36 Tiles interconnected by 2D Mesh Tile: 2 Cores + 2 VPU/core + 1 MB L2

Memory: MCDRAM: 16 GB on-package; High BW DDR4: 6 channels @ 2400 up to 384GB IO: 36 lanes PCIe Gen3. 4 lanes of DMI for chipset Node: 1-Socket only Fabric: Omni-Path on-package (not shown)

Vector Peak Perf: 3+TF DP and 6+TF SP Flops Scalar Perf: ~3x over Knights Corner Streams Triad (GB/s): MCDRAM : 400+; DDR: 90+

Source Intel: All products, computer systems, dates and figures specified are preliminary based on current expectations, and are subject to change without notice. KNL data are preliminary based on current expectations and are subject to change without notice. IBinary Compatible with Intel Xeon processors using Haswell Instruction Set (except TSX). <sup>2</sup>Bandwidth numbers are based on STREAM-like memory access pattern when MCDRAM used as flat memory. Results have been estimated based on internal Intel analysis and accounted for incommissional purposes only. Any difference in system



### **Performance Analysis of Communication Fabric**

- A 4x4 mesh with YX routing
- Source to destination latency (L<sub>SD</sub>) has four components
  - Waiting time in source queue  $(W_Q^S)$
  - Deterministic vertical latency  $(L_v)$
  - Waiting time at the junction  $(W_0^J)$
  - Deterministic horizontal latency  $(L_h)$
- L<sub>v</sub> and L<sub>h</sub> depend on the source-destination pair and fabric topology

$$L_{SD} = W_Q^S + L_v + W_Q^J + L_h$$

 W<sup>S</sup><sub>Q</sub> and W<sup>J</sup><sub>Q</sub> depend on injection rates at different queues and need detailed analysis





## Outline

- Need for analytical fabric model generation
- Background
  - Networks-on-chip (NoC) used in industry
  - Prior work on NoC performance analysis
  - Prior work on queuing networks
- Proposed network transformations
- Experimental results
- Conclusion and future work



### **Prior Work on Performance Analysis of Networks**

| NoC     | Priority-<br>based | Multiple<br>classes | Scalable          | Off-chip<br>Network    | Priority-<br>based | Multiple<br>classes | Scalable     |
|---------|--------------------|---------------------|-------------------|------------------------|--------------------|---------------------|--------------|
| [1,2,5] | ×                  | $\checkmark$        | $\checkmark$      | [6,7]                  | $\checkmark$       | ×                   | $\checkmark$ |
| [3,4]   | $\checkmark$       | x                   | $\checkmark$      | [8]                    | $\checkmark$       | $\checkmark$        | ×            |
|         |                    | NoC                 | Priority<br>based | /- Multiple<br>classes | Scalable           |                     |              |
|         |                    | This wor            | k 🗸               | $\checkmark$           | $\checkmark$       |                     |              |

[1] Ogras et al. "An analytical approach for network-on-chip performance analysis." *IEEE Trans. on Computer-Aided Design of Integrated Circuits and Systems* 29.12 (2010).

[2] Bogdan et al. "Non-stationary traffic analysis and its implications on multicore platform design." *IEEE Trans. on Computer-Aided Design of Integrated Circuits and Systems* 30.4 (2011).

[3] Kiasari et al. "An analytical latency model for networks-on-chip." IEEE Trans. on Very Large Scale Integration Systems21.1 (2013).

[4] Kashif et al. "Bounding buffer space requirements for real-time priority-aware networks." *Design Automation Conference (ASP-DAC),* 2014 19th Asia and South Pacific.

[5] Qian et al. "A support vector regression (SVR)-based latency model for network-on-chip (NoC) architectures." *IEEE Trans. on Computer-Aided Design of Integrated Circuits and Systems* 35.3 (2016).

[6] Bertsekas et al. Data networks. Vol. 2. New Jersey: Prentice-Hall International, 1992.

[7] Walraevens, Joris. Discrete-time queueing models with priorities. Diss. Ghent University, 2004.

[8] G. Bolch et al. Queuing Networks and Markov Chains. Wiley. 2006



## **Background: Queuing Systems**



- Kendall's notation for queuing discipline: A/B/m
- Arrival and departure may have different distribution (e.g. Poisson (M), Deterministic (D), General (G)).



W: average waiting time, T: service time, R: average residual time



## **Residual Time for Single Queue Node**

- Residual time (R): delay of serving the next token due to the remaining service time for currently processed token
- Arrival distribution is Geometric
  P{X = k} = p(1 p)^{k-1}

$$R_{avg} = \frac{1}{T_{tot}} \sum_{i=1}^{M(T_{tot})} \left( \sum_{\tau=0}^{T_i - 1} \tau \right)$$
$$= \frac{1}{2} \lambda \left( \overline{T^2} - \overline{T} \right) \text{ (for Geo/G/1)}$$
$$= \frac{1}{2} \rho (T - 1) \text{ (for Geo/D/1)}$$



## **Residual Time for Basic Priority Queue (Geo/G/1)**

Reminder: basic priority



 Expression for residual time of class 1 packets

$$R_1 = \frac{1}{2}\rho_1(T-1) + \frac{1}{2}\rho_2(T-1)$$

 Expression for residual time of class 2 packets

$$R_2 = \frac{1}{2}\rho_1(T+1) + \frac{1}{2}\rho_2(T-1)$$

 Class 2 packets have higher residual time due to lower priority



 $M(T_{tot})$  = total no. of packets arrived

 $\tau =$  intermediate variable for sum

during time interval T<sub>tot</sub>

e lab

## Outline

- Need for analytical fabric model generation
- Background
  - Networks-on-chip (NoC) used in industry
  - Prior work on NoC performance analysis
  - Prior work on queuing networks
- Proposed network transformations
- Experimental results
- Conclusion and future work

![](_page_12_Picture_9.jpeg)

## **Overview of the Automated Flow**

![](_page_13_Figure_1.jpeg)

### Limitation of Basic Priority Based Models (1)

![](_page_14_Figure_1.jpeg)

Applying basic priority equation for class 3 tokens results in **pessimistic solution** 

<u>Reason:</u> Not all tokens of class 1 will block tokens of class 3. Tokens of class 3 can occupy the server if a class 2 token is being served

![](_page_14_Picture_4.jpeg)

### Limitation of Basic Priority Based Models (2)

![](_page_15_Figure_1.jpeg)

Applying basic priority equation for class 3 tokens results in **optimistic solution** 

<u>Reason:</u> Tokens of class 2 will have effect of class 1 indirectly as class 2 tokens have to wait due to class 3 tokens

![](_page_15_Picture_4.jpeg)

## **Proposed Network Transformations**

- Extend decomposition method [1] to handle priority arbitration based multi-class networks in industry
- We identified two transformations
  - ST: structural transformation
  - RT: service rate transformation
- Complex priority-based networks are decomposed iteratively to systems of equivalent queues using ST/RT
- Obtain a closed form analytical expression for the equivalent system

[1] G. Bolch et al. Queuing Networks and Markov Chains. Wiley. 2006

![](_page_16_Picture_8.jpeg)

## **Structural Transformation (ST)**

![](_page_17_Figure_1.jpeg)

- Class 3 do not have to wait for all packets in Q<sub>1</sub>
  - Class 3 and 2 can be served simultaneously
- Class 3 packets will wait only when the server is busy serving class 1
  - Need to decompose class 1 and class 2
- Proposed transformation separates class 1 tokens and put in a virtual queue (Q<sub>2</sub>)
- Equivalence is demonstrated by the result shown on the right

![](_page_17_Figure_8.jpeg)

![](_page_17_Figure_9.jpeg)

![](_page_17_Picture_10.jpeg)

### **Analytical Model after Structural Transformation**

- ST enables us to derive closed form analytical equations
- Expression of residual time of class 1 in Q<sub>2</sub>'

$$R_1^{Q_2'} = \frac{1}{2}\rho_3(T-1) + \frac{\frac{1}{2}\rho_3T(C_{A_1}^2 + 1 - \mu)}{2} - \frac{\rho_1\mu}{2}$$

![](_page_18_Figure_4.jpeg)

Waiting time of class 1 in Q<sub>2</sub>'

$$W_1^{Q_2'} = \frac{R_1^{Q_2'}}{1 - \rho_1}$$

Residual time of class 3

$$R_3 = R_1^{Q_2'} + \rho_1$$

• Finally, waiting time of class 3

$$W_3 = \frac{R_3 + \rho_1 W_1^{Q_2'}}{1 - \rho_1 - \rho_3}$$

#### Comparison of Simulation and Analytical Models

![](_page_18_Figure_12.jpeg)

Geo/D/1 with T = 2, 3% average error

![](_page_18_Picture_14.jpeg)

## **Service Rate Transformation (RT)**

![](_page_19_Figure_1.jpeg)

#### **Observations**

- Packets in Q<sub>1</sub> effectively increase the service time of class 2 packet
  - Need to modify the service time of class 2 packet
- Challenging to model since not all packets in Q<sub>2</sub> will wait for packets in Q<sub>1</sub>
- Insight: Modified service time of class 2 is independent of incoming distribution of class 3

![](_page_19_Figure_7.jpeg)

#### **Proposed Approach**

- Decompose priority arbitration by modifying the service time of class 2
- Approximate first and second order moments of modified service time (µ̂)
- $\hat{\mu} = \frac{1}{\hat{T}} = \frac{1}{T + \Delta T}$ , calculation of  $\Delta T$  is shown next

![](_page_19_Picture_12.jpeg)

### Service Rate Transformation (RT): 1<sup>st</sup> moment

![](_page_20_Figure_1.jpeg)

Calculation of average busy period ( $\Delta \overline{T}$ )

Let p be the probability that the server is occupied by high priority token

If low priority token is blocked once, it will see a busy period of  $\frac{1}{T}\sum_{i=1}^{T} i = \frac{T+1}{2}$  in average

$$\Delta \overline{T} = \left(\frac{T+1}{2}\right) p(1-p) + \left(\frac{T+1}{2} + T\right) p^2(1-p) \dots.$$

$$\Rightarrow \Delta \overline{T} = \frac{pT}{1-p} - \left(\frac{T-1}{2}\right)p, \qquad p = \rho_1 + \lambda_1 R$$

T: service time

![](_page_20_Picture_8.jpeg)

### Service Rate Transformation (RT): 2<sup>nd</sup> moment

![](_page_21_Figure_1.jpeg)

![](_page_21_Picture_2.jpeg)

## **Model Generation Flow\***

**Input:** Injection rates for all traffic classes ( $\lambda$ ), Network topology, Routing algorithm

**Output:** Waiting time for all traffic classes

For each Queue and traffic class:

![](_page_22_Figure_4.jpeg)

## **A Representative Example**

![](_page_23_Figure_1.jpeg)

## Outline

Need for analytical fabric model generation

## Background

- Networks-on-chip (NoC) used in industry
- Prior work on NoC performance analysis
- Prior work on queuing networks
- Proposed network transformations

### Experimental results

Conclusion and future work

![](_page_24_Picture_9.jpeg)

## **Experimental Setup**

- We evaluated the proposed analytical models on
  - Ring
  - Mesh
- Simulation parameters
  - Simulation length: 10M cycles
  - Warm-up period: 5000 cycles
- Traffic load
  - Sweep from a very light load to  $\lambda_{max}$
  - $-\lambda_{max}$  is the injection rate at which the maximum server utilization is 1

![](_page_25_Figure_10.jpeg)

![](_page_25_Picture_11.jpeg)

![](_page_25_Picture_12.jpeg)

## **Evaluation on xPLORE: Mesh Topology**

- Traffic pattern is all to all with YX routing
- Injection rate for each source destination pair is equal

![](_page_26_Figure_3.jpeg)

- Achieve less than 4% error compared to simulation
- Proposed analytical models are 2-3 orders of magnitude faster than simulation models

![](_page_26_Picture_6.jpeg)

### Verification with Intel<sup>®</sup> Xeon<sup>®</sup> Scalable Server

- One variation of scalable Intel<sup>®</sup> Xeon<sup>®</sup>
  - 26 tiles with Core + LLC, 2 memory controllers on a 6x6 mesh
- Synthetic injection
  - All cores send packets to all caches
- Validated latencies for cache-coherency flow

![](_page_27_Figure_6.jpeg)

![](_page_27_Picture_7.jpeg)

## **Evaluation with Real Applications**

- Collected real app traces from gem5 in Full System mode
  - Applications (16-threaded) from PARSEC suite
  - Average statistics over 1M cycles

![](_page_28_Figure_4.jpeg)

## Outline

- Need for analytical fabric model generation
- Background
  - Networks-on-chip (NoC) used in industry
  - Prior work on NoC performance analysis
  - Prior work on queuing networks
- Proposed network transformations
- Experimental results
- Conclusion and future work

![](_page_29_Picture_9.jpeg)

## **Conclusion and Future Work**

- Industrial NoCs use priority-based routers
- Most NoC performance analysis techniques assume fair arbitration
- Priority-based models in literature do not consider multiple traffic classes
- Presented the first technique that handles both priority and multiple traffic classes

![](_page_30_Figure_5.jpeg)

# **THANK YOU**

# **BACK-UP**

### **Residual Time for Basic Priority Queue (Geo/G/1)**

![](_page_33_Figure_1.jpeg)

34

## **Residual Time for Single Queue Node**

- Residual time (R): delay of serving the next token due to the remaining service time for currently processed token
- Arrival distribution is Geometric
  P{X = k} = p(1 p)^{k-1}

$$\begin{split} R_{avg} &= \frac{1}{T_{tot}} \sum_{i=1}^{M(T_{tot})} \left( \sum_{\tau=0}^{T_i-1} \tau \right) \\ &= \frac{1}{T_{tot}} \sum_{i=1}^{M(T_{tot})} \frac{1}{2} T_i (T_i - 1) \\ &= \frac{M(T_{tot})}{T_{tot}} \frac{\sum_{i=1}^{M(T_{tot})} \frac{1}{2} T_i (T_i - 1)}{M(T_{tot})} \\ &= \frac{1}{2} \lambda \left( \overline{T^2} - \overline{T} \right) \text{ (for Geo/G/1)} \\ &= \frac{1}{2} \rho (T - 1) \text{ (for Geo/D/1)} \end{split}$$

![](_page_34_Figure_4.jpeg)

## **Decomposition Method**

- Can handle complex network with multi-class traffic <u>but</u> <u>limited to non-priority networks</u>
- Decompose the queuing network into individual queues of type G/G/1
- Approximate input/output traffic distribution of these G/G/1 queues using analytical expressions

![](_page_35_Figure_4.jpeg)

Coefficient of variation =  $\frac{standard \ deviation}{mean}$ 

[1] Pujolle, Guy, and Wu Ai. "A solution for multiserver and multiclass open queueing networks." *INFOR: Information Systems and Operational Research* 24.3 (1986): 221-230.

![](_page_35_Picture_7.jpeg)

## **Evaluation on Simulink**

- Simulation models have been built in Simulink based on priority aware network architecture
- We observe less than 2% error between simulation and analysis for rings and mesh

![](_page_36_Figure_3.jpeg)

- Traffic pattern is all to all (i.e. each node is sending tokens to all nodes) with YX routing
- Injection rate for each source destination pair is equal

![](_page_36_Picture_6.jpeg)

### **Evaluation on xPLORE: Ring Topology**

We observe less than 2% error between simulation and analysis for rings

![](_page_37_Figure_2.jpeg)

- xPLORE is a System-C based simulator for priority aware NoCs
- Traffic pattern is all to all (i.e. each node is sending tokens to all nodes) with YX routing
- Injection rate for each source destination pair is equal

### **Per Class Latency Comparison for Intel<sup>®</sup> Xeon<sup>®</sup>**

- Average accuracy for lowest priority class is 91%
  - Medium and highest priority class show 99% accuracy

![](_page_38_Figure_3.jpeg)

![](_page_38_Picture_4.jpeg)

### **Real Application (Streamcluster) Finer Grained**

![](_page_39_Figure_1.jpeg)

![](_page_39_Picture_2.jpeg)

| Full-System Simulat           | tion Time with 4×4 Mesh No         | С       |  |
|-------------------------------|------------------------------------|---------|--|
| With Garnet 2.0<br>Simulation | With Proposed<br>Analytical Models | Speedup |  |
| 12466 s                       | 4986 s                             | 2.5×    |  |

![](_page_40_Picture_2.jpeg)