Skip to content

Implement delayed branches in Functional simulation #626

Closed
3 tasks
pavelkryukov opened this issue Oct 9, 2018 · 33 comments
Closed
3 tasks

Implement delayed branches in Functional simulation #626

pavelkryukov opened this issue Oct 9, 2018 · 33 comments
Assignees
Labels
3 Features of medium complexity or infrastructure enhancements bug Fixes a bug or potential bug in simulation. S1 — Branch prediction To solve the issue, you need knowledge about branch prediction

Comments

@pavelkryukov
Copy link
Member

pavelkryukov commented Oct 9, 2018

Delayed branch is one of the (obsolete) ways to fix problem of control hazards, but, however, it exists in MIPS and cannot be removed — there is a lot of binary files compatible with it.

The idea is to put an instruction independent from branch to be executed right after the branch. Then, in a case of branch misprediction, that instruction is not flushed from the pipeline:

Let's assume there is a following control flow:

if (branch)
   then instructions
else
   else instruction
independent instructions

Pipeline without delayed branches:

       0 1 2 3 4 5 6 7 8 9 10
branch F D E M W
then-1   F D E x
then-2     F D x
then-3       F x
else-1         F D E M W
independent-1    F D E M W
independent-2      F D E M W
       0 1 2 3 4 5 6 7 8 9 10

We save one cycle with a single branch delay slot:

       0 1 2 3 4 5 6 7 8 9 10
branch F D E M W
ind-1    F D E M W
then-1     F D x
then-2       F x
else-1         F D E M W
independent-2    F D E M W
       0 1 2 3 4 5 6 7 8 9 10

We save two cycles with two delay slots:

       0 1 2 3 4 5 6 7 8 9 10
branch F D E M W
ind-1    F D E M W
ind-2      F D E M W
then-1       F x
else-1         F D E M W
       0 1 2 3 4 5 6 7 8 9 10

How can we support branch delay slot in micro-architecture? There are straightforward steps:

  • Add configuration option for delay slots (0, 1, 2 or more). Currently we implement 0.
  • Build MIPS tests with delayed slots. Currently they are disable by .set noreorder directive in MIPS assembly files.
  • Implement delayed slots in FuncSim.
@pavelkryukov pavelkryukov added enhancement Adds a new feature to simulation. 4 Features of medium complexity which usually require infrastructure enhancements. S1 — Branch prediction To solve the issue, you need knowledge about branch prediction labels Oct 9, 2018
@pavelkryukov pavelkryukov changed the title Implement Delayed Branches Implement delayed branches Oct 9, 2018
@pavelkryukov pavelkryukov added 5 Same as 4, but requires good understanding of CPU microarchitecture. bug Fixes a bug or potential bug in simulation. and removed 4 Features of medium complexity which usually require infrastructure enhancements. enhancement Adds a new feature to simulation. labels Oct 12, 2018
@pavelkryukov pavelkryukov changed the title Implement delayed branches Implement delayed branches in Functional simulation Oct 12, 2018
@pavelkryukov pavelkryukov added 4 Features of medium complexity which usually require infrastructure enhancements. and removed 5 Same as 4, but requires good understanding of CPU microarchitecture. labels Oct 12, 2018
@agrachiv agrachiv self-assigned this Nov 28, 2018
@agrachiv
Copy link
Contributor

It is not clear, where exactly i should add this option for delay slots.

@pavelkryukov
Copy link
Member Author

pavelkryukov commented Nov 28, 2018

I don't know either. I suggest to start with implementing it in FuncSim.

@agrachiv
Copy link
Contributor

agrachiv commented Nov 28, 2018

So i should add bool enabe_delayed_branches argument here?
Trap run(uint64 instrs_to_run, int enabled_delayed_slots) final;

Or in the constructor?
explicit FuncSim( bool log = false, int enabled_delayed_slots);

@pavelkryukov
Copy link
Member Author

Usually constructor is preferrable, as you are able to look configuration in all class methods.

@agrachiv
Copy link
Contributor

So I should add function, that replaces instructions in instruction list, when delayed slots option is turned on?

@pavelkryukov
Copy link
Member Author

First of all, I advise to get sure you have the test trace which is failing without delayed branches.

that replaces instructions in instruction list, when delayed slots option is turned on?

What is the "instruction list"?

@agrachiv
Copy link
Contributor

agrachiv commented Nov 29, 2018

What is the "instruction list"?

List of instructions that run method need to proceed. I mean, changing instructions with each other shouldn't be the work of simulator, should it? In my vision order of the instructions should be changed in constructor, as it would be done before by compiler. Is my vision proper?

P.s: Or I have to work in model where simulator knows only instructions it already decoded? But then it is impossible to use delayed brunch method. On lecture we've been told that finding independent instructions is done with the compiler not with the pipeline. So is this issue's point is figuring out some algorithm to find independent instructions?

@agrachiv
Copy link
Contributor

And is there any branch prediction functionality in functional simulator now?

@pavelkryukov
Copy link
Member Author

And is there any branch prediction functionality in functional simulator now?

Yes, it is clearly stated in README file

List of instructions that run method need to proceed.

No, we do not have such a list. We fetch instructions one-by-one from memory, as the next instruction program counter is dependent on previous instruction result.

On lecture we've been told that finding independent instructions is done with the compiler

Right. But our simulator does not assume compiler may do this at the moment, so instead of this:

       0 1 2 3 4 5 6 7 8 9 10
branch F D E M W
ind-1    F D E M W
then-1     F D x
then-2       F x
else-1         F D E M W
independent-2    F D E M W
       0 1 2 3 4 5 6 7 8 9 10

we get this:

       0 1 2 3 4 5 6 7 8 9 10
branch F D E M W
ind-1    F D E x
then-1     F D x
then-2       F x
else-1         F D E M W
independent-2    F D E M W
       0 1 2 3 4 5 6 7 8 9 10

which is not what compiler wanted us to do.

So is this issue's point is figuring out some algorithm to find independent instructions?

The point is to execute 1 or 2 instruction after the branch independently from branch condition, i.e. the program counter change should be effective with a delay.

@agrachiv
Copy link
Contributor

Build MIPS tests with delayed slots. Currently they are disable by .set noreorder directive in MIPS assembly files.

Sorry for stupid questions, but i can't find anything to understand what should i do here. There are no tests for delayed slots in traces directory and i don't understand what set noreorder directive is.

@pavelkryukov
Copy link
Member Author

pavelkryukov commented Nov 29, 2018

GitHub has a nice feature called "search":
https://github.com/MIPT-ILab/mips-traces/search?q=set+noreorder&unscoped_q=set+noreorder

By default, MIPS compiler/assembler assumes the CPU (or simulator) supports branch delays and therefore it finds independent instructions to fill the delay slot. Since MIPT-MIPS does not support it, we had to disable them by setting the set noreorder directive to assembler files. If this directive is removed, MIPS assembler will reorder instructions to fill the branch delay slot, and simulation will fail.

@pavelkryukov
Copy link
Member Author

More detailed explanation how compiler/assembler works:

https://stackoverflow.com/questions/3807480/weird-mips-assembler-behavior-with-jump-and-link-instruction

@agrachiv
Copy link
Contributor

agrachiv commented Nov 29, 2018

GitHub has a nice feature called "search":

I was searching in this repository, so that i couldn't find anything, sorry.

So after I remove all .set noreorder detectives all the reorders I will be able to cover all the reorders by implementing delayed branches?

@pavelkryukov
Copy link
Member Author

pavelkryukov commented Nov 29, 2018

Actually, I never investigated deeply how assembler works and what brake in torture tests — I just disabled reordering and TT started to work,

Let's start with simpler test provided on SO;

.set noreorder
.section .text
.globl __start

__start:
   addi $a0, $0, 100
   addi $a1, $0, 200 
   jal test
   jr $zero # Required to halt simulation
test:
   add $v0, $a0, $a1
   jr $ra

If we build it with current flow:
https://github.com/MIPT-ILab/mips-traces/blob/0c8d51184a9918cdc80e68c6355deec07092fc5f/Makefile#L21-L27

we should get something like the following code (you can check it with mips-linux-gnu-objdump -EL -D -b test.out, not sure about exact options), which is perfectly executed on MIPT-MIPS at the moment.

Disassembly of section .text:

00000000 <__start>:
   0:   20040064    addi    a0,zero,100
   4:   200500c8    addi    a1,zero,200
   8:   0c000004    jal 10 <test>
   c:   00000000    nop
   10:   00000008    jr  zero
   14:   00000000    nop

00000010 <test>:
  18:   00851020    add v0,a0,a1
  1c:   03e00008    jr  ra
  20:   00000000    nop
  24:   00000000    nop

We need to get something like SO author had:

Disassembly of section .text:

00000000 <__start>:
   0:   20040064    addi    a0,zero,100
   4:   0c000003    jal c <test>    <--- Why is jal coming before addi?
   8:   200500c8    addi    a1,zero,200
   c:   00000008    jr  zero
   10: 00000000    nop

0000000c <test>:
   14:   03e00008    jr  ra  <--- Why is jr coming before add?
   18:   00851020    add v0,a0,a1
    ...

@agrachiv
Copy link
Contributor

agrachiv commented Nov 29, 2018

mips-linux-gnu-objdump -EL -D -b test.out

So in order to test it I need to use MIPS binutils or something else?

@pavelkryukov
Copy link
Member Author

So in order to test it I need to use MIPS binutils or something else?

Yes, that's the only way to build ELF files from MIPS assembler.
You should have installed it to run simulation unit tests, right?

@agrachiv
Copy link
Contributor

agrachiv commented Nov 30, 2018

I guess I should copy of all tests with noreorder because both functional simulator and performance simulator use same tests, and paste new path here:

TEST_CASE( "Torture_Test: Func_Sim")
{
    CHECK_NOTHROW( get_simulator_with_test<MIPS32>( TEST_PATH "/tt.core.universal.out")->run_no_limit() );
    CHECK_NOTHROW( get_simulator_with_test<MIPS32>( TEST_PATH "/tt.core32.le.out")->run_no_limit() );
    CHECK_NOTHROW( get_simulator_with_test<MIPS64>( TEST_PATH "/tt.core64.le.out")->run_no_limit() );
}

@pavelkryukov
Copy link
Member Author

If you want to run these tests with ctest, yes, but at the beginning you may run them just from shell with mipt-mips command.

@agrachiv
Copy link
Contributor

agrachiv commented Nov 30, 2018

That is how disassembly looks like:

004000d0 <__start>:
  4000d0:	20040064 	addi	a0,zero,100
  4000d4:	0c100039 	jal	4000e4 <test>
  4000d8:	200500c8 	addi	a1,zero,200
  4000dc:	00000008 	jr	zero
  4000e0:	00000000 	nop

004000e4 <test>:
  4000e4:	03e00008 	jr	ra
  4000e8:	00851020 	add	v0,a0,a1
  4000ec:	00000000 	nop

There are 2 instructions after the jump, and these 2 are always filled with nops or independent instructions, so simulation should work without any pipeline flushing or whatever, am I right?

@pavelkryukov
Copy link
Member Author

We do not touch pipeline at the moment, we are talking only about functional simulation (single-cycle implementation, func_sim.cpp).

simulation should work without any pipeline flushing or whatever, am I right?

Please run functional simulation at your own and see what it does.

@agrachiv
Copy link
Contributor

agrachiv commented Nov 30, 2018

I've just understood that functional simulator is single-cycled, but then it can't face any branches hazards. So what does 'implementing delayed branches' stands for if we talk about single-cycled simulator?

@pavelkryukov
Copy link
Member Author

Please run functional simulation at your own and see what it does.

Please, do this and share what instructions are executed by default (./mipt-mips -b test.out -d -f).

@agrachiv
Copy link
Contributor

I see the problem. So as soon as simulator meets branch condition it should execute 2 more instructions before jumping?

@pavelkryukov
Copy link
Member Author

pavelkryukov commented Nov 30, 2018

Right, but we should optionally select 0, 1 or 2.

That's the problem with delayed branches method Oleg explained on the lecture — whereas it works well in pipelined microarchitecture, it still should be supported in other implementations, like single-cycle or more complicated.

@agrachiv
Copy link
Contributor

That's the problem with delayed branches method Oleg explained on the lecture — whereas it works well in pipelined microarchitecture, it still should be supported in other implementations, like single-cycle or more complicated.

Why do we need to optionally select 0, 1 or 2 if compiler always put 2 instructions after the branch condition

@pavelkryukov
Copy link
Member Author

compiler always put 2 instructions after the branch condition

So far I don't see 2 instructions case, I see always 1 — please explain if I'm wrong

Why do we need to optionally select 0, 1 or 2?

  1. RISC-V has no branch delay option, thus we'll always choose 0 option
  2. MARS and SPIM has an option to execute without branch delay — we need to have it as well
  3. If compiler is instructed not to reorder instructions, we must have 0 (as we have at the moment).

@pavelkryukov
Copy link
Member Author

And, finally, we won't have a branch delay slot in PerfSim for some time, so we have to have no slots to ensure both simulators behave similarly.

@agrachiv
Copy link
Contributor

agrachiv commented Nov 30, 2018

This delayed option slots should be one of the command line parameters ideally, or I should just add it into constructor parameter as you said?

@pavelkryukov
Copy link
Member Author

Do as best as you can, we'll refactor that later.

@agrachiv
Copy link
Contributor

agrachiv commented Dec 2, 2018

Is alu.h related only to the functional simulator?

@pavelkryukov
Copy link
Member Author

No.

@pavelkryukov pavelkryukov added 3 Features of medium complexity or infrastructure enhancements and removed 4 Features of medium complexity which usually require infrastructure enhancements. labels Dec 20, 2018
@pavelkryukov
Copy link
Member Author

Committing 1 point as something like a solution was proposed

@pavelkryukov pavelkryukov pinned this issue Feb 6, 2019
@pavelkryukov
Copy link
Member Author

Closing as duplicate of #737

@pavelkryukov pavelkryukov unpinned this issue Feb 27, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
3 Features of medium complexity or infrastructure enhancements bug Fixes a bug or potential bug in simulation. S1 — Branch prediction To solve the issue, you need knowledge about branch prediction
Projects
None yet
Development

No branches or pull requests

2 participants