### **CS:APP Chapter 4 Computer Architecture**

# Pipelined Implementation

### Randal E. Bryant adapted by Jason Fritts

http://csapp.cs.cmu.edu



### **General Principles of Pipelining**

- Goal
- Difficulties

### **Creating a Pipelined Y86 Processor**

- Rearranging SEQ to create pipelined datapath, PIPE
- Inserting pipeline registers
- Problems with data and control hazards

### Fundamentals of Pipelining



### **Real-World Pipelines: Car Washes**

#### **Sequential**



#### **Parallel**



#### Pipelined



-4-

#### ldea

- Divide process into independent stages
- Move objects through stages in sequence
- At any given times, multiple objects being processed

### **Computational Example**



#### System

- Computation requires total of 300 picoseconds
- Additional 20 picoseconds to save result in register
- Must have clock cycle of at least 320 ps

# **3-Way Pipelined Version**



#### System

- Divide combinational logic into 3 blocks of 100 ps each
- Can begin new operation as soon as previous one passes through stage A.
  - Begin new operation every 120 ps
- Overall latency increases
  - 360 ps from start to finish



### Unpipelined



Cannot start new operation until previous one completes

### **3-Way Pipelined**



Up to 3 operations in process simultaneously









-8-

# **Limitations: Nonuniform Delays**



- Throughput limited by slowest stage
- Other stages sit idle for much of the time
- Challenging to partition system into balanced stages



| instruction memory | 220ps |                                        |
|--------------------|-------|----------------------------------------|
| decode             | 70ps  |                                        |
| register fetch     | 120ps |                                        |
| ALU                | 180ps |                                        |
| data memory        | 260ps |                                        |
| register writeback | 120ps | 20ps delay for<br>hardware register at |

#### Single-cycle processor:

- Clock cycle = 220 + 70 + 120 + 180 + 260 + 120 + 20 = 990ps
- Clock freq = 1 / 990ps = 1 / 990\*10<sup>-12</sup> = 1.01 GHz

#### Combine and/or split stages for pipelining

- Need to balance time per stage since clock freq determined by slowest time
- Must maintain original order of stages, so can't combine nonneighboring stages (e.g. can't combine decode & data mem)

end of cycle

| instruction memory | 220ps |                                              |
|--------------------|-------|----------------------------------------------|
| decode             | 70ps  |                                              |
| register fetch     | 120ps |                                              |
| ALU                | 180ps |                                              |
| data memory        | 260ps |                                              |
| register writeback | 120ps | 20ps delay added for<br>hardware register at |

3-stage pipeline:

Best combination for minimizing clock cycle time:

| 1 <sup>st</sup> stage – instr mem & decode:  | 220 + 70 + 20  | = 310ps |
|----------------------------------------------|----------------|---------|
| 2 <sup>nd</sup> stage – reg fetch & ALU:     | 120 + 180 + 20 | = 320ps |
| • 3 <sup>rd</sup> stage – data mem & reg WB: | 260 + 120 + 20 | = 400ps |

- Slowest stage is 400ps, so clock cycle time is 400ps
- Clock freq = 1 / 400ps = 1 / 400\*10<sup>-12</sup> = 2.5 GHz

end of each cycle

| instruction memory | 220ps |                  |
|--------------------|-------|------------------|
| decode             | 70ps  |                  |
| register fetch     | 120ps |                  |
| ALU                | 180ps |                  |
| data memory        | 260ps |                  |
| register writeback | 120ps | 20ps delay added |

20ps delay added for hardware register at end of each cycle

#### 5-stage pipeline:

Best combination for minimizing clock cycle time:

| 1 <sup>st</sup> stage – instr mem:          | 220 + 20ps      | = 240ps |
|---------------------------------------------|-----------------|---------|
| 2 <sup>nd</sup> stage – decode & reg fetch: | 70 + 120 + 20ps | = 210ps |
| 3 <sup>rd</sup> stage – ALU:                | 180 + 20ps      | = 200ps |
| 4 <sup>th</sup> stage – data mem:           | 260 + 20ps      | = 280ps |
| • 5 <sup>th</sup> stage – <i>reg WB</i> :   | 120 + 20ps      | = 140ps |

- Slowest stage is 280ps, so clock cycle time is 280ps
- Clock freq = 1 / 280ps = 1 / 280\*10<sup>-12</sup> = 3.57 GHz

| instruction memory | 220ps |
|--------------------|-------|
| decode             | 70ps  |
| register fetch     | 120ps |
| ALU                | 180ps |
| data memory        | 260ps |
| register writeback | 120ps |

20ps delay added for hardware register at end of each cycle

#### 9-stage pipeline:

- Assuming can split stages evenly into halves, thirds, or quarters
  - not a valid assumption, but useful for simplifying problem
- Best combination for minimizing clock cycle time:
  - Each circuit is its own stage, with 20ps added delay for reg
  - Split *instr mem* circuit into two stages, each 110+20ps
  - Split data mem circuit into two stages, each 130+20ps
  - Split ALU circuit into two stages, each 90+20ps
- Slowest stage is 150ps, so clock cycle time is 150ps
- Clock freq = 1 / 150ps = 1 / 150\*10<sup>-12</sup> = 6.67 GHz

# **Limitations: Register Overhead**



- As try to deepen pipeline, overhead of loading registers becomes more significant
- Percentage of clock cycle spent loading register:
  - 1-stage pipeline: 6.25%
  - 3-stage pipeline: 16.67%
  - 6-stage pipeline: 28.57%
- High speeds of modern processor designs obtained through very deep pipelining

# In Practice

- i386 3 stage pipeline
- I486 5 stages
- Pentium 3 11 stages
- Pentium 4 (willamette)
  - 20 stages
- Pentium 4 (prescott)
  - 31 stages!
  - Up to 3.8GHz
  - Severe heat problems
  - Long pipeline actually hurt some application's performance
  - 115 Watts dissipation



### Converting SEQ to PIPE, a pipelined datapath



### **SEQ Hardware**

- Stages occur in sequence
- One operation in process at a time
- To convert to pipelined datapath, start by adding registers between stages, resulting in 5 pipeline stages:
  - Fetch
  - Decode
  - Execute
  - Memory
  - Writeback



# **Converting to pipelined datapath**



- 18 -

CS:APP2e

# Problem: Fetching a new instruction each cycle

### **Two problems**

- PC generated in last stage of SEQ datapath
- PC sometimes not available until end of Execute or Memory stage

### PC needs to be computed early

- In order to fetch a new instruction every cycle, PC generation must be moved to first stage of datapath
- Solve first problem by moving PC generation from end of SEQ to beginning of SEQ

### **Use prediction to select PC early**

- Solve second problem by <u>predicting</u> next instruction from current instruction
- If prediction is wrong, squash (kill) predicted instructions

# **SEQ+ Hardware**

- Still sequential implementation
- Reorder PC stage to put at beginning

#### PC Stage

- Task is to select PC for current instruction
- Based on results computed by previous instruction

#### **Processor State**

- PC is no longer stored in register
- But, can determine PC based on other stored information





Start fetch of new instruction after current has been fetched

- Not enough time to fully determine next instruction
- Attempt to predict which instruction will be next
  - Recover if prediction was incorrect

### **Our Prediction Strategy**

### Predict next instruction from current instruction

#### **Instructions that Don't Transfer Control**

- Predict next PC to be valP
- Always reliable

#### **Call and Unconditional Jumps**

- Predict next PC to be valC (destination)
- Always reliable

#### **Conditional Jumps**

- Predict next PC to be valC (destination)
- Only correct if branch is taken
  - Typically right 60% of time

#### **Return Instruction**

Don't predict, just stall



### Recovering from PC Misprediction



#### **Mispredicted Jump**

- Will see branch condition flag once instruction reaches memory stage
- Can get fall-through PC from valA (value M\_valA)

#### **Return Instruction**

Will get return PC when ret reaches write-back stage (W\_valM)





# **PIPE- Hardware**

 Pipeline registers hold intermediate values from instruction execution

### Forward (Upward) Paths

- Values passed from one stage to next
- Cannot jump past stages
  - e.g., valC passes through decode



# **Feedback Paths**

Important for distinguishing dependencies between pipeline stages

#### **Predicted PC**

Guess value of next PC

#### **Branch information**

- Jump taken/not-taken
- Fall-through or target address

#### Return point

Read from memory

#### **Register updates**

To register file write ports



# **Signal Naming Conventions**

### S\_Field

Value of Field held in stage S pipeline register

### s\_Field

Value of Field computed in stage S



### Dealing with Dependencies between Instructions





### Hazards

Problems caused by dependencies between separate instructions in the pipeline

#### **Data Hazards**

- Instruction having register R as source follows shortly after instruction having register R as destination
- Common condition, don't want to slow down pipeline

#### **Control Hazards**

- Mispredict conditional branch
  - Our design predicts all branches as being taken
  - Naïve pipeline executes two extra instructions
- Getting return address for ret instruction
  - Naïve pipeline executes three extra instructions

### Dealing with Dependencies between Instructions

### **Data Hazards**



### Data Dependencies - not a problem in SEQ



#### System

Each operation depends on result from preceding one

CS:APP2e

### Data Hazards - the problems caused by data dependences in pipelined datapaths



Result does not feed back around in time for next operation

Pipelining has changed behavior of system

### Data Dependencies between Instructions



#### Result from one instruction used as operand for another

- Read-after-write (RAW) dependency
- Dependency is between writeback stage of earlier instruction and decode stage of later instruction
- Very common in actual programs
- Must make sure our pipeline handles these properly
  - Get correct results
  - Minimize performance impact

### Data Dependencies – Loop-Carried Dependencies



CS:APP2e

### **Pipeline Demonstration**

|        |           |     | 1 | 2  | 3 | 4 | 5                                                                                                | 6 | 7 | 8 | 9 |
|--------|-----------|-----|---|----|---|---|--------------------------------------------------------------------------------------------------|---|---|---|---|
| irmovl | \$1,%eax  | #I1 | F | D  | Е | Μ | W                                                                                                |   |   |   |   |
| irmovl | \$2,%ecx  | #I2 |   | F  | D | Е | М                                                                                                | W |   |   |   |
| irmovl | \$3,%edx  | #I3 |   |    | F | D | Е                                                                                                | Μ | W |   |   |
| irmovl | \$4,%ebx  | #I4 |   |    |   | F | D                                                                                                | Е | Μ | W |   |
| halt   |           | #I5 |   |    |   |   | F                                                                                                | D | Е | Μ | W |
|        | f each ot | -   |   | nt |   |   | V         I1         M         I2         B         I3         D         I4         F         I5 | 5 |   |   |   |

CS:APP2e

### **Data Dependencies: 3 Nop's**

7 2 9 10 1 3 4 5 6 8 F Е Μ W D 0x000: irmovl \$10,%edx F F W D Μ 0x006: irmovl \$3,%eax F D F Μ W 0x00c: nop F D F Μ W 0x00d: nop F F W D Μ 0x00e: nop F F W D Μ 0x00f: addl %edx,%eax F F D Μ 0x011: halt The addl instruction depends on the first Cycle 6 two instructions W - addl depends upon %edx from the 1<sup>st</sup> instr  $R[\text{seax}] \leftarrow 3$ addl depends upon %eax from the 2<sup>nd</sup> instr

11

W

Cycle 7

D

valA  $\leftarrow R[\&edx] = 10$ valB  $\leftarrow R[\&eax] = 3$ 

add1 must wait 3 cycles after the 2<sup>nd</sup> instruction, so that it doesn't fetch the two registers before they've been written to the register file

### **Data Dependencies: 2 Nop's**



0x00c: nop

0x00d: nop

0x00e: addl %edx, %eax

0x010: halt.

#### If add1 executes one cycle earlier, it gets the wrong value for %eax

- 37 -

### **Data Dependencies: 1 Nop**



0x000: irmovl \$10,%edx

0x006: irmovl \$3,%eax

0x00c: nop

0x00d: addl %edx,%eax

0x00f: halt

If add1 executes two cycles earlier, it gets the wrong value for both %eax and %edx

### **Data Dependencies: No Nop**

2 3 4 5 7 8 6 1 F Е Μ W D F W D E Μ F Ε W Μ D F D E Μ W Cycle 4 Μ M valE = 10M dstE = %edx F e valE  $\leftarrow 0 + 3 = 3$ E dstE = %eax D Error valA  $\leftarrow R[\$edx] = 0^{4}$ valB  $\leftarrow R[\$eax] = 0$ 

Like the prior case, if add1 executes three cycles earlier, it gets the wrong value for both %eax and %edx

0x00e: halt

0x000: irmovl \$10,%edx

0x006: irmovl \$3,%eax

0x00c: addl %edx,%eax



## **Stalling for Data Dependencies**



- If instruction follows too closely after one that writes register, slow it down
- Hold instruction in decode
- Dynamically inject nop into execute stage

# **Stall Condition**

### **Source Registers**

srcA and srcB of current instruction in decode stage

### **Destination Registers**

- dstE and dstM fields
- Instructions in execute, memory, and write-back stages

### **Special Case**

- Don't stall for register ID 15 (0xF)
  - Indicates absence of register operand
- -41 Don't stall for failed conditional move



### **Detecting Stall Condition**



- 42 -





## What Happens When Stalling?

| 0x000: | irmovl \$10,%edx |
|--------|------------------|
| 0x006: | irmovl \$3,%eax  |
| 0x00c: | addl %edx,%eax   |
| 0x00e: | halt             |

|            | Cycle 8               |
|------------|-----------------------|
| Write Back | bubble                |
| Memory     | bubble                |
| Execute    | 0x00c: addl %edx,%eax |
| Decode     | 0x00e: halt           |
| Fetch      |                       |

- Stalling instruction held back in decode stage
- Following instruction stays in fetch stage
- Bubbles injected into execute stage
  - Like dynamically generated nop's
  - Move through later stages



**Pipeline Register Modes** 



CS:APP2e

# **Implementing Stalling**



#### **Pipeline Control**

- Combinational logic detects stall condition
- Sets mode signals for how pipeline registers should update

CS:APP2e

- 46 -

## **Data Forwarding**

### Naïve Pipeline

- Register isn't written until completion of write-back stage
- Source operands read from register file in decode stage
  - Needs to be in register file at start of stage

#### Observation

Value generated in execute or memory stage

### Trick

- Pass value directly from generating instruction to decode stage
- Needs to be available at end of decode stage



### **Data Forwarding Example**

# 0x000: irmovl \$10,%edx 0x006: irmovl \$3,%eax 0x00c: nop 0x00d: nop 0x00e: addl %edx,%eax 0x010: halt

- irmovl in writeback stage
- Destination value in W pipeline register
- Forward as valB for decode stage





### **Bypass Paths**

### **Decode Stage**

- Forwarding logic selects valA and valB
- Normally from register file
- Forwarding: get valA or valB from later pipeline stage

### **Forwarding Sources**

- **Execute: valE**
- Memory: valE, valM
- Write back: valE, valM

PC



- 49 -

# **Data Forwarding Example #2**

#### # demo-h0.ys

0x000: irmovl \$10,%edx
0x006: irmovl \$3,%eax
0x00c: addl %edx,%eax
0x00e: halt

#### Register %edx

- Generated by ALU during previous cycle
- Forward from memory as valA

#### Register %eax

- Value just generated by ALU
- Forward from execute as valB



# **Forwarding Priority**

F

D

F

- 0x000: irmovl \$1, %eax
- 0x006: irmovl \$2, %eax
- 0x00c: irmovl \$3, %eax
- 0x012: rrmovl %eax, %edx
- 0x014: halt

#### Multiple Forwarding Choices

- Which one should have priority
- Match serial semantics
- Use matching value from earliest pipeline stage





- 52 -

### Implementing Forwarding

- Add additional feedback paths from E, M, and W pipeline registers into decode stage
- Create logic blocks to select from multiple sources for valA and valB in decode stage



### **Limitation of Forwarding**

- 54 -



D

valA  $\leftarrow$  M\_valE = 10 valB  $\leftarrow$  R[eax] = 0

CS:APP2e

Error

### **Avoiding Load/Use Hazard**



CS:APP2e

D

 $valA \leftarrow W valE = 10$ 

 $valB \leftarrow m_valM = 3$ 

### **Detecting Load/Use Hazard**



| Condition       | Trigger                        |
|-----------------|--------------------------------|
| Load/Use Hazard | E_icode in { MRMOVL, POPL } && |
|                 | E_dstM in { d_srcA, d_srcB }   |

### **Control for Load/Use Hazard**



- Stall instructions in fetch and decode stages
- Inject bubble into execute stage

| Condition       | F     | D     | E      | M      | W      |
|-----------------|-------|-------|--------|--------|--------|
| Load/Use Hazard | stall | stall | bubble | normal | normal |

### Dealing with Dependencies between Instructions

### **Control Hazards**



### **Branch Misprediction Example**

| 0x000:         | <pre>xorl %eax,%eax</pre>   |                                          |
|----------------|-----------------------------|------------------------------------------|
| <b>0x002:</b>  | jne t                       | # Not taken                              |
| <b>0x007</b> : | <pre>irmovl \$1, %eax</pre> | <pre># Fall through</pre>                |
| 0x00d:         | nop                         |                                          |
| 0x00e:         | nop                         |                                          |
| <b>0x00f</b> : | nop                         |                                          |
| <b>0x010:</b>  | halt                        |                                          |
| 0x011: t:      | irmovl \$3, %edx            | <pre># Target (Should not execute)</pre> |
| <b>0x017:</b>  | irmovl \$4, %ecx            | <pre># Should not execute</pre>          |
| 0x01d:         | <pre>irmovl \$5, %edx</pre> | <pre># Should not execute</pre>          |

#### Should only execute first 7 instructions

### **Branch Misprediction Trace**

-60 -



CS:APP2e

# **Handling Misprediction**



#### **Predict branch as taken**

Fetch 2 instructions at target

#### **Cancel when mispredicted**

- Detect branch not-taken in execute stage
- On following cycle, replace instructions in execute and decode by bubbles
- No side effects have occurred yet

### **Detecting Mispredicted Branch**



| Condition                  | Trigger                 |
|----------------------------|-------------------------|
| <b>Mispredicted Branch</b> | E_icode == JXX & !e_Cnd |

### **Control for Misprediction**



| Condition           | F      | D      | E      | М      | w      |
|---------------------|--------|--------|--------|--------|--------|
| Mispredicted Branch | normal | bubble | bubble | normal | normal |

### **Return Example**

**0x000:** 

- **0x006**: call p
- 0x00b: irmovl \$5,%esi # Return point
- 0x011: halt
- 0x020: .pos 0x20
- 0x020: p: irmovl \$-1,%edi
- $0 \times 026$ : ret

- 0x039:
- 0x100: .pos 0x100

0x100: Stack:

```
irmovl Stack,%esp # Initialize stack pointer
                 # Procedure call
```

```
# procedure
```

- 0x027: irmovl \$1, %eax # Should not be executed
- 0x02d: irmovl \$2, %ecx # Should not be executed
- 0x033: irmovl \$3,%edx # Should not be executed
  - irmovl \$4,%ebx # Should not be executed
    - # Stack: Stack pointer

#### Previously executed three additional instructions

### **Incorrect Return Example**

#### # demo-ret

| 0x023: | ret               | F     | D | Е | М | W |   |   |   |   |
|--------|-------------------|-------|---|---|---|---|---|---|---|---|
| 0x024: | irmovl \$1,%eax # | Oops! | F | D | Е | М | W |   |   |   |
| 0x02a: | irmovl \$2,%ecx # | Oops! |   | F | D | Е | М | W |   |   |
| 0x030: | irmovl \$3,%edx # | Oops! |   |   | F | D | Е | Μ | W |   |
| 0x00e: | irmovl \$5,%esi # | Retur | n |   |   | F | D | Е | Μ | W |

#### Incorrectly execute 3 instructions following ret



- 65 -

### **Correct Return Example**



- Inject bubble into decode stage
- Release stall when reach write-back stage





### **Detecting Return**



| Condition      | Trigger                                          |
|----------------|--------------------------------------------------|
| Processing ret | <pre>IRET in { D_icode, E_icode, M_icode }</pre> |

### **Control for Return**

#### # demo-retb

| 0x026: | ret                 | F     | D | Е | М | W |   |   |   |   |
|--------|---------------------|-------|---|---|---|---|---|---|---|---|
|        | bubble              |       | F | D | Е | М | W |   |   |   |
|        | bubble              |       |   | F | D | Е | Μ | W |   |   |
|        | bubble              |       |   |   | F | D | E | Μ | W |   |
| 0x00b: | irmovl \$5,%esi # 1 | Retur | n |   |   | F | D | Е | М | W |

| Condition      | F     | D      | E      | Μ      | W      |
|----------------|-------|--------|--------|--------|--------|
| Processing ret | stall | bubble | normal | normal | normal |



# Special Control Cases Detection

| Condition                  | Trigger                                                          |
|----------------------------|------------------------------------------------------------------|
| Processing ret             | <pre>IRET in { D_icode, E_icode, M_icode }</pre>                 |
| Load/Use Hazard            | E_icode in { IMRMOVL, IPOPL } &&<br>E_dstM in { d_srcA, d_srcB } |
| <b>Mispredicted Branch</b> | E_icode = IJXX & !e_Cnd                                          |

#### Action (on next cycle)

| Condition           | F      | D      | E      | М      | W      |
|---------------------|--------|--------|--------|--------|--------|
| Processing ret      | stall  | bubble | normal | normal | normal |
| Load/Use Hazard     | stall  | stall  | bubble | normal | normal |
| Mispredicted Branch | normal | bubble | bubble | normal | normal |

## **Implementing Pipeline Control**



Combinational logic generates pipeline control signals
 Action occurs at start of following cycle

-70-

### **Pipeline Control Logic**

- A sequence of control instructions complicates the control logic
  - in particular, should stall in Decode stage (instead of bubble, as an initial inspection suggests)
- Load/use hazard should get priority
- ret instruction should be held in decode stage for additional cycle

| Condition       | F     | D            | E      | Μ      | W      |
|-----------------|-------|--------------|--------|--------|--------|
| Processing ret  | stall | bubble       | normal | normal | normal |
| Load/Use Hazard | stall | stall        | bubble | normal | normal |
| Combination     | stall | <u>stall</u> | bubble | normal | normal |

# **Pipeline Summary**

#### Concept

- Break instruction execution into 5 stages
- Run instructions through in pipelined mode

### Limitations

- Can't handle dependencies between instructions when instructions follow too closely
- Data dependencies
  - One instruction writes register, later one reads it
- Control dependency
  - Instruction sets PC in way that pipeline did not predict correctly
  - Mispredicted branch and return

# **Pipeline Summary**

### Data Hazards

- Read-after-write dependencies handled by forwarding
  - No performance penalty
- Load/use hazard requires one cycle stall

#### **Control Hazards**

- Cancel instructions when detect mispredicted branch
  - Two clock cycles wasted
- Stall fetch stage while ret passes through pipeline
  - Three clock cycles wasted

### **Control Combinations**

- Must analyze carefully
- First version had subtle bug
  - Only arises with unusual instruction combination