Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Translate assembly text templates into avo programs #529

Closed
WojciechMula opened this issue Mar 11, 2022 · 25 comments
Closed

Translate assembly text templates into avo programs #529

WojciechMula opened this issue Mar 11, 2022 · 25 comments

Comments

@WojciechMula
Copy link
Contributor

The assembler procedures are currently generated from text templates. As @klauspost suggested, it would be easier to maintain this quite complex code with avo (https://github.com/mmcloughlin/avo).

@klauspost
Copy link
Owner

klauspost commented Mar 11, 2022

Yeah, no biggie. We can do it when it suits you. Should be pretty straightforward. If you'd like I can help out, or do the initial translation from your templates.

Will make things like temp register allocation and inlining functionality much easier.

@WojciechMula
Copy link
Contributor Author

I plan to add sequeceDecs.execute implementation as text templates, as this is something I already have. Then I'll convert decode and execute into avo. And then it will be easier to implement decodeSync.

@klauspost
Copy link
Owner

@WojciechMula I started to convert the sequenceDecs_decode_amd64, just to get you started.

@WojciechMula
Copy link
Contributor Author

@WojciechMula I started to convert the sequenceDecs_decode_amd64, just to get you started.

Great, I planned to work on this the next week. I wanted to complete the already opened PRs before.

@klauspost
Copy link
Owner

@WojciechMula https://gist.github.com/klauspost/8949f70d98dd94116392019f119087e5

Ended up doing it all, but did test it at all.

Place in a subfolder called _generate, as the s2.

I think we should just do a standalone cpuid check for now.

@klauspost
Copy link
Owner

klauspost commented Mar 13, 2022

You can rearrange as you see fit. I don't want avo as a dependency, since it isn't needed, that is why it is in a separate folder - that is my only real requirement.

The output if you're curious.

@klauspost
Copy link
Owner

Made a few tweaks and added more documentation. Unfortunately it crashes. Tried to debug a bit, but I will have to leave it for a while, until I've caught up on other things.

@WojciechMula
Copy link
Contributor Author

Made a few tweaks and added more documentation. Unfortunately it crashes. Tried to debug a bit, but I will have to leave it for a while, until I've caught up on other things.

That looks great, thank you. Let me continue your work, I'll work more on this on Tuesday.

@klauspost
Copy link
Owner

@WojciechMula Got it to not crash, but only output wrong values. I will see if I can get it to pass 🤞🏼

@klauspost
Copy link
Owner

@WojciechMula Did you manage to spot my mistake, otherwise I will take another look through the code.

@WojciechMula
Copy link
Contributor Author

@WojciechMula Did you manage to spot my mistake, otherwise I will take another look through the code.

Sorry, didn't check it yesterday. I planned to look at that problem this evening.

@klauspost
Copy link
Owner

Sure. I will take another pass through it.

@klauspost
Copy link
Owner

@WojciechMula FINALLY! After hours of debugging I finally got it working. The gist above is updated.

I have noted a few potential improvements - for now just getting it working is a big win.

This will make experiments with register allocations a bit easier as well.

@WojciechMula
Copy link
Contributor Author

@WojciechMula FINALLY! After hours of debugging I finally got it working. The gist above is updated.

I have noted a few potential improvements - for now just getting it working is a big win.

This will make experiments with register allocations a bit easier as well.

That's great news! Speaking of improvements, I'm wondering if it's possible to front-pad a bit-stream data with four zero bytes. Then we will not need to have to test off > 4.

@WojciechMula
Copy link
Contributor Author

@klauspost You wrote "TODO: We should be able to extract bits with BEXTRQ". I tried it already, the code is not shorter and isn't faster.

{{define "update_length2_bmi1"}}
    {{/* There's a lot of additional work to prepare control word for BEXTRQ.
         Peformance differences are negligible. */}}
    MOVQ    {{var "arg"}}, AX
    MOVB    AH, DL
    MOVBQZX DL, DX
    MOVQ    $64, CX
    SUBQ    br_bits_read, CX
    SUBQ    DX, CX
    MOVB    AH, CH                  // CH = moB (n)
    BEXTRQ  CX, br_value, BX
    ADDQ    DX, br_bits_read        // br.bitsRead += moB
    SHRQ    $32, AX                 // AX = mo (ofState.baselineInt(), that's the higer dword of moState)
    ADDQ    BX, AX                  // AX - mo + br.getBits(moB)
    MOVQ    AX, {{var "out"}}+{{var "out_ofs"}}(FP)
{{end}}

The current solution:

{{define "update_length2_bmi2"}}
    MOVQ    {{var "arg"}}, AX
    MOVQ    br_value, BX
    MOVB    AH, DL
    MOVBQZX DL, DX                      // DX = n
    LEAQ    (br_bits_read)(DX*1), CX    // CX = r.bitsRead + n
    ROLQ    CL, BX
    BZHIQ   DX, BX, BX
    MOVQ    CX, br_bits_read        // br.bitsRead += moB
    SHRQ    $32, AX                 // AX = mo (ofState.baselineInt(), that's the higer dword of moState)
    ADDQ    BX, AX                  // AX - mo + br.getBits(moB)
    MOVQ    AX, {{var "out"}}+{{var "out_ofs"}}(FP)
{{end}}

@klauspost
Copy link
Owner

Then we will not need to have to test off > 4.

We will. Bad codes can cause considerable underrun, much beyond 4 bytes.

The branch is fully predictable until it is taken, so potential performance improvement will be low-to-none.

@klauspost
Copy link
Owner

I am not concerned about further opts right now.

Once we have the pipeline implemented we can look at the overall picture and do microopts, like looking at register allocs, etc.

Having execute with history and decodeSync (doesn't need history AFAIR) would be a great start.

@WojciechMula
Copy link
Contributor Author

@klauspost Did you have any problems with any of these lines Load(ctx.Field("llState"), llState)? I'm getting the error "component is not primitive". I changed type of field to uint64 to work around this, but that's strange - decSymbol is mere alias for uint64.

@klauspost
Copy link
Owner

@WojciechMula Oh, yeah, totally forgot:

type decodeAsmContext struct {
	llTable   []decSymbol
	mlTable   []decSymbol
	ofTable   []decSymbol
	llState   uint64
	mlState   uint64
	ofState   uint64
	iteration int
	seqs      []seqVals
	litRemain int
}

@WojciechMula
Copy link
Contributor Author

I'll translate also Decompress4X from huff0 module.

@klauspost
Copy link
Owner

@WojciechMula I am not going to stop you :)

@WojciechMula
Copy link
Contributor Author

@klauspost I also saw you add some TODO items to sequenceDecs.decode, so I'll check them too.

@klauspost
Copy link
Owner

Did some small tweaks here: #537 - having to wait a bit before I can bench.

@mmcloughlin
Copy link
Contributor

👀

@WojciechMula
Copy link
Contributor Author

OK, now we have pure avo code. That's great.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants