# CSC 255/455 Software Analysis and Improvement

# Instruction Scheduling, Register Allocation, Partial Redundancy Removal

### Instructor: Chen Ding



COMP 412 FALL 2008

Local Instruction Scheduling — A Primer for Lab 3 —

Comp 412

Copyright 2008, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice University have explicit permission to make copies of these materials for their personal use. Faculty from other educational institutions may use these materials for nonprofit educational purposes, provided this copyright notice is preserved.

### What Makes Code Run Fast?



- Many operations have non-zero latencies
- Modern machines can issue several operations per cycle
- Execution time is order-dependent (and has been since the 60's)

## Assumed latencies (conservative)

| Operation<br>load<br>store | Cycles<br>3      | <ul> <li>Loads &amp; stores may or may not block</li> <li>Non-blocking ⇒fill those issue slots</li> </ul>                                        |
|----------------------------|------------------|--------------------------------------------------------------------------------------------------------------------------------------------------|
| loadI<br>add               | 1                | <ul> <li>Branch costs vary with path taken</li> <li>Branches typically have delay slots</li> <li>Fill slots with unrelated operations</li> </ul> |
| mult<br>fadd               | 2<br>1           | <ul> <li>Percolates branch upward</li> <li>Scheduler should hide the latencies</li> </ul>                                                        |
| fmult<br>shift<br>branch   | 2<br>1<br>0 to 8 | Lab 3 will build a local scheduler                                                                                                               |

Comp 412, Fall 2008

| Exa | mple         |            |           |         |               |            |              |
|-----|--------------|------------|-----------|---------|---------------|------------|--------------|
|     |              |            | w ← w * : | 2 * x * | y * z         |            |              |
|     | <u>Simpl</u> | e sche     | dule      |         | <u>Schedu</u> | le loads   | <u>early</u> |
| 1   | loadAl       | r0,@w      | ⇒r1       | 1       | loadAl        | r0,@w      | ⇒r1          |
| 4   | add          | r1,r1      | ⇒r1       | 2       | loadAl        | r0,@x      | ⇒r2          |
| 5   | loadAl       | r0,@x      | ⇒r2       | 3       | loadAl        | r0,@y      | ⇒r3          |
| 8   | mult         | r1,r2      | ⇒r1       | 4       | add           | r1,r1      | ⇒r1          |
| 9   | loadAl       | r0,@y      | ⇒r2       | 5       | mult          | r1,r2      | ⇒r1          |
| 12  | mult         | r1,r2      | ⇒ r1      | 6       | loadAl        | r0,@z      | ⇒r2          |
| 13  | loadAl       | r0,@z      | ⇒ r2      | 7       | mult          | r1,r3      | ⇒r1          |
| 16  | mult         | r1,r2      | ⇒ r1      | 9       | mult          | r1,r2      | ⇒r1          |
| 18  | storeAl      | r1         | ⇒ r0,@w   | 11      | storeAl       | r1         | ⇒ r0,@w      |
| 21  | r1 is free   |            |           | 14      | r1 is free    |            |              |
|     | 2 registe    | ers, 20 cy | /cles     |         | 3 regist      | ers, 13 cy | cles         |

Reordering operations for speed is called *instruction scheduling* Comp 412, Fall 2008

### Instruction Scheduling (Engineer's View)

### The Problem

Given a code fragment for some target machine and the latencies for each individual operation, reorder the operations to minimize execution time

### The Concept



Produce correct code

The Task

- Minimize wasted cycles
- Avoid spilling registers
- Operate efficiently

3



- To capture properties of the code, build a precedence graph G
- Nodes  $n \in G$  are operations with type(n) and delay(n)
- An edge  $e = (n_1, n_2) \in G$  if & only if  $n_2$  uses the result of  $n_1$

| a:       | loadAl  | r0,@w | ⇒r1     |
|----------|---------|-------|---------|
| b:       | add     | r1.r1 | ⇒r1     |
| c:<br>d: | loadAl  | r0,@x | ⇒r2     |
| e:       | mult    | r1,r2 | ⇒r1     |
|          | IoadAl  | r0,@y | ⇒r2     |
| f:       | mult    | r1,r2 | ⇒r1     |
| g:       | IoadAl  | r0,@z | ⇒r2     |
| h:       | mult    | r1,r2 | ⇒ r1    |
| i:       | storeAl | r1    | ⇒ r0,@w |
|          |         |       |         |



The Code



Comp 412, Fall 2008

2

### Instruction Scheduling

### (Definitions

- A correct schedule S maps each n∈ N into a non-negative integer representing its cycle number, and
- 1.  $S(n) \ge 0$ , for all  $n \in N$ , obviously
- 2. If  $(n_1, n_2) \in E$ ,  $S(n_1) + delay(n_1) \leq S(n_2)$
- 3. For each type t, there are no more operations of type t in any cycle than the target machine can issue
- The <u>length</u> of a schedule S, denoted L(S), is  $L(\overline{S}) = max_{n \in N} (S(n) + delay(n))$

## The goal is to find the shortest possible correct schedule.

- S is <u>time-optimal</u> if  $L(S) \leq L(S_1)$ , for all other schedules  $S_1$
- A schedule might also be optimal in terms of registers, power, or space....

Comp 412, Fall 2008

# Instruction Scheduling (What's so difficult?

### **Critical Points**

- All operands must be available
- Multiple operations can be *ready*
- Moving operations can lengthen register lifetimes ٠
- Placing uses near definitions can shorten register lifetimes •
- Operands can have multiple predecessors
- Together, these issues make scheduling *hard* (NP-Complete)

Local scheduling is the simple case

- Restricted to straight-line code
- Consistent and predictable latencies

Comp 412, Fall 2008

## Instruction Scheduling: The Big Picture

- 1. Build a precedence graph, P
- 2. Compute a priority function over the nodes in P
- Use list scheduling to construct a schedule, one cycle at a time
  - a. Use a queue of operations that are ready
  - b. At each cycle I. Choose the highest priority ready operation and schedule
    - II. Update the ready queue

### Local list scheduling

- The dominant algorithm for twenty years
- A greedy, heuristic, local technique

Comp 412, Fall 2008



Comp 412, Fall 2008

# Scheduling Example

1. Build the precedence graph







9





1. Build the precedence graph

2. Determine priorities: longest latency-weighted path

| a: | loadAl  | r0,@w | ⇒r1     |
|----|---------|-------|---------|
| b: | add     | r1,r1 | ⇒ r1    |
| c: | loadAl  | r0,@x | ⇒ r2    |
| d: | mult    | r1,r2 | ⇒r1     |
| e: | loadAl  | r0,@y | ⇒r2     |
| f: | mult    | r1,r2 | ⇒ r1    |
| g: | loadAl  | r0,@z | ⇒r2     |
| h: | mult    | r1,r2 | ⇒r1     |
| i: | storeAl | r1    | ⇒ r0,@w |
|    |         |       |         |

The Code



Comp 412, Fall 2008

The Precedence Graph

10



6

8



5



7



**Register Allocation** Part of the compiler's back end



Critical properties

- Produce <u>correct</u> code that uses k (or fewer) registers
- Minimize added loads and stores
- Minimize space used to hold spilled values
- Operate efficiently O(n),  $O(n \log_2 n)$ , maybe  $O(n^2)$ , but not  $O(2^n)$

Comp 412, Fall 2008

#### Rutgers An Example : Bottom-up



| 5<br>6 | loadI<br>load<br>mult<br>loadI<br>sub<br>loadI<br>mult<br>sub<br>store | 1028<br>r1<br>r1, r2<br>5<br>r4, r2<br>8<br>r5, r6<br>r7, r3<br>r8 | $\begin{array}{c} \Rightarrow r1 \\ \Rightarrow r2 \\ \Rightarrow r3 \\ \Rightarrow r4 \\ \Rightarrow r5 \\ \Rightarrow r6 \\ \Rightarrow r7 \\ \Rightarrow r8 \\ \Rightarrow r1 \end{array}$ |
|--------|------------------------------------------------------------------------|--------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|--------|------------------------------------------------------------------------|--------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|

NOTE: live sets on exit of each instruction

1

14

# **Code to Interference Graph**

### Schielke's Scheduling Study

Non-optimal list schedules (%) versus available parallelism 1 functional unit, randomly generated blocks of 10, 20, 50 ops

3 compute months of work



- 85,000 randomly generated blocks
- RBF found optimal schedules for > 80%
- Peak difficulty (for RBF) is around 2.8
- Tie-breaking matters because it affects the choice when queue has > 1 element 15

# Live Ranges

| $\begin{array}{c ccccccccccccccccccccccccccccccccccc$ | r1 r2 r3<br>   r1 r2 r3 r4<br>   r1 r3 r5<br>   r1 r3 r5 r6<br>   r1 r3 r7<br>   r1 r3 r7 |
|-------------------------------------------------------|-------------------------------------------------------------------------------------------|
|-------------------------------------------------------|-------------------------------------------------------------------------------------------|

### NOTE: live sets on exit of each instruction

cs415, spring 09 by Uli Kremer Lecture 1

Each color can be mapped to a distinct physical register

Comp 412, Fall 2008

Global Register Allocation

5

(Lavrov & (later) Chaitin)

Taking a global approach

- Abandon the distinction between local & global
- Make systematic use of registers or memory
- Adopt a general scheme to approximate a good allocation

**Interference Graph to Coloring** 

Graph coloring paradigm

- 1 Build an interference graph  $G_I$  for the procedure — Computing LIVE is harder than in the local case —  $G_T$  is not an interval graph
- 2 (try to) construct a k-coloring
  - Minimal coloring is NP-Complete
  - Spill placement becomes a critical issue
- 3 Map colors onto physical registers

Comp 412, Fall 2008

# Improvement in Coloring Scheme



6

47

Optimistic Coloring (Briggs, Cooper, Kennedy, and Torczon)

- If Chaitin's algorithm reaches a state where every node has k or more neighbors, it chooses a node to spill.
- Briggs said, take that same node and push it on the stack

   When you pop it off, a color might be available for it!



For example, a node n might have k+2 neighbors, but those neighbors might only use 3 (<k) colors</li>
 → Degree is a <u>loose upper bound</u> on colorability

# Lazy Code Motion

# (Partial Redundancy Elimination)



An expression is partially redundant at p if it is redundant along some, but not all, paths reaching p

### Example





COMP 512, Spring 2009

Lazy Code Motion

\* 3

#### Sources

- Knoop, Ruthing, Steffen, PLDI 1992
  Dreschsler and Stadel, SIGPLAN 1993
- Chapter 10, Engineering a compiler
- Intuition
  - computer available and anticipable expressions
  - · compute the earliest placement for each expression
  - push expressions down the CFG to the last point with no redundancy

### Solving a set of data flow equations

· availability, anticipable, earliest placement, later placement, insert, delete

### 27