Common place to calculate instruction size so we do not mispredict it #57368

kunalspathak · 2021-08-13T18:51:28Z

During codegen, we try to calculate the instruction size to get an idea of how much memory is needed to store the machine code. Next we allocate the memory from runtime and then output the instructions in the memory. In some situation, we mispredict (overestimate) the size of instruction and during outputting, we end up outputting an instruction of smaller size. The root of the problem is because there are two separate code paths that estimate (emitIns* methods) and output (emitOut* methods) instruction. We should unify this code so we know accurate instruction size and hence reduce memory wastage and speedup the throughput by not going through estimate and then outputting phase separately.

Relevant discussion in #12840, #57041, #12178, #8748 (comment)

category:design
theme:intrinsics
skill-level:expert
cost:large
impact:medium

The text was updated successfully, but these errors were encountered:

ghost · 2021-08-13T18:51:32Z

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

During codegen, we try to calculate the instruction size to get an idea of how much memory is needed to store the machine code. Next we allocate the memory from runtime and then output the instructions in the memory. In some situation, we mispredict (overestimate) the size of instruction and during outputting, we end up outputting an instruction of smaller size. The root of the problem is because there are two separate code paths that estimate (emitIns* methods) and output (emitOut* methods) instruction. We should unify this code so we know accurate instruction size and hence reduce memory wastage and speedup the throughput by not going through estimate and then outputting phase separately.

Relevant discussion in #12840, #57041, #12178, #8748 (comment)

Author:	kunalspathak
Assignees:	-
Labels:	`area-CodeGen-coreclr`
Milestone:	Future

BruceForstall · 2022-06-14T22:32:24Z

Related: #50054

kunalspathak · 2022-12-29T02:40:14Z

I am just pasting here the design proposal that I made in #50054 (comment).

Existing design

Currently there are 4 steps that occur in code generation and here is a rough outline of what happens:

genGenerateMachineCode(): This phase iterates over all the GenTree* nodes (see genCodeForTreeNode()) and translates them to machine dependent data-structure instrDesc. They are not necessarily 1-to-1, but some GenTree* can produce more than one instrDesc* while others can skip it. Just as GenTree* maps to instrDesc*, during this phase, so does BasicBlock* maps to instrGroup* (instruction group), which is a collection of instrDesc* we create during this phase. instrDesc* are first created in a temporary buffer. Depending on the instruction, various fields are set in instrDesc* include registers, instruction, instruction format, etc. We also estimate the instruction size of each instruction that is saved in instrDesc*. Once we complete processing all the instructions for a single instrGroup*, we save the raw data of all instrDesc* in a memory region allocated in ig->igData field. (See emitSavIG()). Basically, we memcpy all the instrDsc* data (contiguous raw memory contents) created so far in ig->igData.
At this point, since we have instruction size of all instrDesc*, we allocate equivalent memory from runtime. This is the final code buffer where instruction encoding will be saved for execution.
genEmitMachineCode() : This phase iterates over each instrGroup* and then over each instrDesc* (which is present in ig->igData). It extracts all the metadata present in instrDesc* (See emitIssue1Instr()) and depending on the insFmt (instruction format), calls the appropriate emitOutput*() method to generate the raw encoding for that instruction. The raw encoding is saved in the code buffer that we requested in step 2 above. In this step, we also update the GC refs to reflect the true code address where GC refs goes live/dead.
Finally, we adjust any forward jump addresses to reflect the final code location.

Proposal

If we notice, step 1 has all the information needed for an instruction to be encoded. Step 3 uses this exact information to determine the encoding needed. The proposal is to combine step 1 and 3 in following manner:

A1. If we see emitOutputInstr(), it has a big switch-case depending on the instruction format present in instrDesc*. During step 1, since we already have all the information, we could just call various emitOutput*() methods. The only challenge at that point is we do not have final code memory where the encoded instructions will be stored. We can get around the problem by having a growable memory say ig->finalCode at instrGroup* level (See C1 below). The encoding can be accumulated in ig->finalCode during step 1 itself.

A2. After above step, we would perform step 2, just as it happens currently. Except, in this case, we would have accurate memory size from ig->finalCode that should be allocated from the runtime, lets call it finalCodeBuffer.

A3. Traverse ig->finalCode of all instrGroup* and memcpy() it to final code memory that we allocated in A2 above.

A4. In A1 step, we will have to somehow track the GC liveness by building a table (See C2 below) of which codeAddress (memory of ig->finalCode) needs update. Later, once code is moved to final code memory, we would have to write a routine which will efficiently map the codeAddress from ig->finalCode to finalCodeBuffer and search/replace all the addresses based on the mapping.

A5. Forward jump addresses will be addressed, the one they are fixed today.

Refactoring

B1. emitOutputInstr*() should be segregated based upon the instruction format that they handle. For example, here is a piece of code from emitxarch.cpp emitOutputInstr() method:

        case IF_ARD_RRD:
        case IF_AWR_RRD:
        case IF_ARW_RRD:
            code    = insCodeMR(ins);
            code    = AddSimdPrefixIfNeeded(ins, code, size);
            regcode = (insEncodeReg345(ins, id->idReg1(), size, &code) << 8);
            dst     = emitOutputAM(dst, id, code | regcode);
            sz      = emitSizeOfInsDsc(id);
            break;

IMO, we should just move such code in a separate method with appropriate name. In A1, we would just call the relevant methods, since we know what is the insFmt and instruction we are setting, 2 things that we check in above code.

B2. GC update related code should just be in a separate file IMO, today it is mixed with emitter code.

Data structures

C1. We could have two ways in which finalCode can be stored. [1] A growable contiguous memory of X bytes. Once X bytes are filled, we allocate 2X bytes, copy the contents of current memory to new memory and continue writing to the new memory. We continue growing it by 4X, 8X, etc. I think this option has a downside of over allocating memory after one point and additional cost of memcpy, everytime the buffer is grown. [2] A simple linked list of fixed size, which when filled, we just allocate next chunk and continue write instruction encoding in it. In step A3, we would just traverse the linked list and copy the contents to finalCodeBuffer.

C2. To track GC refs, currently the addr is used just to find the offset from the start. Once we start using ig->finalCode, we should be able to track the offset from the start, and might not need readjustment, as far as I understand the code. May be Bruce can correct me here.

Open Questinos

We will have to still do the branch tensioning and will have to see how we can compact the already encoded jmp instructions in the finalCode or finalCodeBuffer region.

kunalspathak · 2023-01-12T21:00:35Z

I gathered some data on how much mis-estimation we are doing today.

Collection	# of methods	Misestimated bytes
benchmarks-win-x64	128	561
libraries-pmi-win-x64	181	637
asp-win-x64	269	1193
benchmarks-unix-x64	95	440
libraries-pmi-unix-x64	302	1112

I will post the individual instruction distribution shortly.

kunalspathak · 2023-01-12T22:47:17Z

Benchmarks windows/x64

Instruction	Number of occurrence	Total bytes
ja	4	16
jae	14	56
jb	1	4
jbe	4	16
je	63	252
jg	7	28
jge	11	44
jl	32	128
jle	28	112
jmp	21	63
jne	30	120
jo	1	4
mov	3	9
movsxd	22	22
Grand Total	241	874

Libraries.pmi windows/x64

Instructions	Number of occurrences	Total bytes
crc32	4	4
imul	1	1
ja	14	56
jae	21	84
jb	3	12
jbe	4	16
je	55	220
jg	7	28
jge	3	12
jl	26	104
jle	23	92
jmp	33	99
jne	32	128
jo	5	20
js	2	8
mov	4	12
movsxd	53	53
vmovnti	4	4
vprefetchnta	1	1
vprefetcht0	1	1
vprefetcht1	1	1
vprefetcht2	1	1
Grand Total	298	957

coreclr_run windows/x64

Instructions	Number of occurrences	Total bytes
crc32	43	43
idiv	2	6
ja	1	4
jae	15	60
jb	6	24
jbe	10	40
je	72	288
jg	9	36
jge	10	40
jl	71	284
jle	44	176
jmp	24	72
jne	186	744
jp	17	68
js	3	12
mov	24	72
movsxd	105	105
vmovnti	20	20
vmovsd	10	30
vmovupd	42	126
vmulsd	4	12
vprefetchnta	3	3
vprefetcht0	3	3
vprefetcht1	3	3
vprefetcht2	3	3
Grand Total	730	2274

crossgen windows/x64

Instructions	Number of occurrences	Total bytes
crc32	4	4
imul	1	1
mov	138	414
movsxd	70	70
Grand Total	213	489

Note: The difference of grand total that we see in above tables vs. the one in #57368 (comment) is because we add NOP compensation bytes that reduces the difference between actual code size vs. memory allocated.

kunalspathak added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Aug 13, 2021

kunalspathak added this to the Future milestone Aug 13, 2021

dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Aug 13, 2021

kunalspathak mentioned this issue Aug 13, 2021

imul instruction size misprediction #57179

Merged

AndyAyersMS removed the untriaged New issue has not been triaged by the area owner label Aug 14, 2021

kunalspathak mentioned this issue Nov 12, 2021

XArch: Trim the code block to match the actual code length #61523

Merged

AndyAyersMS mentioned this issue Jan 18, 2023

Proposal: evolve RyuJIT emitter to improve flexibility #80589

Open

kunalspathak mentioned this issue Feb 3, 2023

WIP: Eliminate estimation and double use emitOutputInstr() #81587

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Common place to calculate instruction size so we do not mispredict it #57368

Common place to calculate instruction size so we do not mispredict it #57368

kunalspathak commented Aug 13, 2021 •

edited by BruceForstall

Loading

ghost commented Aug 13, 2021

BruceForstall commented Jun 14, 2022

kunalspathak commented Dec 29, 2022

kunalspathak commented Jan 12, 2023

kunalspathak commented Jan 12, 2023 •

edited

Loading

Common place to calculate instruction size so we do not mispredict it #57368

Common place to calculate instruction size so we do not mispredict it #57368

Comments

kunalspathak commented Aug 13, 2021 • edited by BruceForstall Loading

ghost commented Aug 13, 2021

BruceForstall commented Jun 14, 2022

kunalspathak commented Dec 29, 2022

Existing design

Proposal

Refactoring

Data structures

Open Questinos

kunalspathak commented Jan 12, 2023

kunalspathak commented Jan 12, 2023 • edited Loading

Benchmarks windows/x64

Libraries.pmi windows/x64

coreclr_run windows/x64

crossgen windows/x64

kunalspathak commented Aug 13, 2021 •

edited by BruceForstall

Loading

kunalspathak commented Jan 12, 2023 •

edited

Loading