Skip to content

BPU model

Yan Logovsky edited this page Oct 1, 2018 · 17 revisions

This page is maintained by George Korepanov


Introduction

The Branch Prediction Unit is a digital circuit that tries to guess which way a branch (e.g. an if-then-else structure) will go before this is known for sure, thus enabling the fetch unit to continue fetching instructions without stalling the pipeline.

For more detailed description please refer to the latest BPU lecture.

Implementation

The operation of this unit is simulated inside BP class.

Interface

The BaseBP class provides very common interface of BPU as follows:

class BaseBP
{
    virtual bool is_taken( Addr PC) = 0;
    virtual Addr get_target( Addr PC) = 0;
    virtual void update( bool is_taken,
                         Addr branch_ip,
                         Addr target) = 0;
}
  • The first two methods are called by fetch stage, whenever it checks whether the next instruction is a branch to be taken. It uses get_target() to determine the IP to be fetched next cycle, and is_taken to store BPU prediction so that later memory stage could check if misprediction took place.
  • As mentioned, on memory stage the BPU prediction is compared with condition calculated by ALU and in case of misprediction, updates BPU information by calling update method.

bool is_taken( Addr PC)

Lookups the BP and if an entry holds info about requested IP, retrieves that info from it. Otherwise we have nothing else but return false (if we return true, we won't be able to return the target as well).

Addr get_target( Addr PC)

Lookups the BP entry and if it is TAKEN, returns predicted target. If there is no entry or it is predicted as NOT TAKEN, it returns PC + 4.

void update(bool is_taken, Addr branch_ip, Addr target)

Touches the existing BP entry in cache or creates a new one if it is not present (in the latter case we also have to reset the new entry with target to make it valid).

Internals

The implementation of described interface is very straight-forward. We just create and serve the cache holding BP entries, passing all the requests to corresponding entries (thus the real logic of operation is comprised within BP entries, not the BP class itself). In hardware BPU has it's own independent cache (BTB) for storing data, but in simulator we are free to reuse existing generic CacheTagArray implementation:

template<typename T>
class BP : public BaseBP
{
    std::vector<std::vector<T>> data;
    CacheTagArray tags;

Constructor

BP class is templated with a type of different entries. To create different modes of BP, you may use BPFactory class:

class BPFactory {
     BaseBP* create( std::string mode,
                     uint32 size,
                     uint32 ways,
                     uint32 branch_ip_size_in_bits = 32);

Tuning in command line

The branch predictor settings adjustin via command-line arguments. Currently allowed options are

--bp-mode arg (=dynamic_two_bit) branch prediction mode
--bp-size arg (=128)             BTB size in entries
--bp-ways arg (=16)              number of ways in BTB

bp-mode option defines the type of predictor used (see description below). Currently supported modes are

  • Simplest static predictor static_always_taken and static_always_not_taken
  • "Advanced" static predictor predicting only backward jumps to be taken static_backward_jumps
  • Dynamic predictor keeping only last branch outcome dynamic_one_bit
  • Default: Saturating dynamic predictor keeping only last branch outcome dynamic_two_bit
  • Two-level adaptive predictor keeping last two branch outcomes adaptive_two_level

bp-size and bp-ways parameters are responsible for BTB size and associativity respectively.

Predictor types

All of implemented predictors store IP's (instruction pointers) of branches (both conditional & unconditional) to be able to retrieve address of next instruction without decoding the current one. Thus the basic class BP implements common BTB (branch target buffer, practically the cache keeping IP's), encapsulating all the differences of implementation of specific predictor inside BPentry.

Static predictor

Static predictor simply predicts that jump will always be taken:

You may wonder, why static prediction actually works? The answer is simple. For example, loops are often created via backward jumps, like

li   $t0, 10         # t0 is a constant 10
li   $t1, 0          # t1 is our counter (i)
loop:
beq  $t1, $t0, end   # if t1 == 10 wear done
addi $t1, $t1, 1     # add 1 to t1
   ... # loop body
j loop               # jump back to the top
end:

In given example fetch stage after fetching j loop instruction does not know which instrcution to fetch next. However, when we enable static predictor, it predicts that jump will be taken and will also provide next IP (IP of loop label). And next cycle the right beq $t1, $t0, end instruction will be fetched.

"Smart" static predictor

Above example also illustrates the first flaw of plain static predictor: each time the beq $t1, $t0, end instruction is fetched, the predictor will wrongly tell that the branch will be taken and thus the pipeline will be flushed each time the loop is executed.

You may easily see that in the given loop example the "smart" static predictor would usually predict right direction.

Static prediction is the simplest branch prediction technique because it does not rely on information about the dynamic history of code executing. Instead, it predicts the outcome of a branch based solely on the branch instruction.

Dynamic predictor

When code structure is more complicated (e.g. nested if-then-else blocks within loops), the static predictor makes too many mistakes. The more intelligent solution would be to gather and keep statistics at run-time. The simplest variant is to remember the last outcome of the branch, and presume that next time the direction would be the same. This will mostly work for long loops, if-then-else blocks with seldomly changing conditions and other repetitive code.

Saturating predictor

Saturating predictor scheme

This sort of predictors are less aggressive: the prediction state changes from "taken" to "not taken" less abruptly, requiring two mispredictions two change the state. This is implemented as simple state machine with four states:

  • Strongly not taken
  • Weakly not taken
  • Weakly taken
  • Strongly taken

When a branch is evaluated, the corresponding state machine is updated. Branches evaluated as not taken decrement the state toward strongly not taken, and branches evaluated as taken increment the state toward strongly taken.

The advantage of the two-bit counter might arise e.g. in nested loops, when the inside loop-closing conditional jump is mispredicted once rather than twice (at the very beginning of the nested loop and the very end).

Two-level adaptive predictor

If an if statement is executed, for instance, three times, the decision made on the third execution might depend upon whether the previous two were taken or not. In such scenarios, two-level adaptive predictor works more efficiently than a saturation counter. Conditional jumps that are taken every second time or have some other regularly recurring pattern are not predicted well by the saturating counter. A two-level adaptive predictor remembers the history of the last N occurrences of the branch and uses one saturating counter for each of the possible 2N history patterns:

Two-level adaptive predictor scheme

Consider the predictor with depth (N) of 2. Assume, for example, that a conditional jump is taken every third time. The branch sequence is 001001001... In this case, entry number 00 in the pattern history table will go to state "strongly taken", indicating that after two zeroes comes a one. Entry number 01 will go to state "strongly not taken", indicating that after 01 comes a zero. The same is the case with entry number 10, while entry number 11 is never used because there are never two consecutive ones. It's clear from the given example that the general rule for a two-level adaptive predictor with an N-bit history is that it can perfectly predict any repetitive sequence with any period if all N-bit sub-sequences are different.

Clone this wiki locally