-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Super-instructions #16
Comments
Yeah, taking advantage of common bytecode sequences is something that I've thought would be worth exploring. If we did this, it would be helpful to have basic tooling to identify common patterns (e.g. from running the test suite or even an arbitrary given workload). It would also be important for such tooling to validate that any super-instructions we add are still worth having, since changes in the compiler can create or eliminate those patterns. Likewise it would be important to identify the more important workloads to target (just like with #10), or even just show that the specific workload isn't so critical. |
Here's a derivative idea that may be worth exploring later: a dynamic flavor of super-instructions, where we recognize common patterns at runtime and build/run the collapsed case accordingly (sort of like a JIT). This introduces extra overhead so, to be worth it, it would have to be used frequently enough and, of course, more than offset that overhead. |
Note that there's something in ceval.c that can dump out dynamic instruction pair occurrences. Search for dxpairs IIRC. |
I've attempted something a little different: in the compiler itself, collapse
into a new opcode
which duplicates the code of LOAD_FAST and BINARY_ADD. Many other opcodes can get this treatment if it works out, e.g. LOAD_CONST and other binary and unary operations (I also did BINARY_SUBSCR). Here's the branch: https://github.com/gvanrossum/cpython/tree/add-opcodes UPDATE: This makes the bytecode more compact and saves dispatch time, at the cost of more cases in the switch. (I could reduce the code size by using more |
That definitely sounds like something worth exploring. |
I like how straight-forward the change is. 🙂 |
Here are the results of running the benchmarks on the add-opcodes branch right now: results-add-opcodes.tar.gz (Note that the actual benchmark run took about 20 minutes.) |
Hm, that shows small improvements for some number-heavy opcodes, but significant slowdowns on the pickle benchmarks. That's kind of weird, and why we need a separate machine (#19). |
Some other pairs that could be combined easily using the same technique:
Of these, RETURN_CONST and RETURN_NONE show the most promise IMO (in my sample of 1 they are the most prominent). Note that there are other common combinations, but they require a different approach to dispatch, because both instructions in the pair have an operand. (Top 9 in my sample: LOAD_FAST + LOAD_ATTR, LOAD_FAST + LOAD_FAST, LOAD_FAST + STORE_ATTR, LOAD_GLOBAL + LOAD_FAST, LOAD_FAST + LOAD_METHOD, STORE_FAST + LOAD_FAST, LOAD_METHOD + LOAD_FAST, LOAD_GLOBAL + CALL_FUNCTION, STORE_ATTR + LOAD_FAST.) This is the approach (according to Mark) taken by Ertl. |
See faster-cpython/cpython#2 (comment) for restrictions on super-instruction formation.
|
Things that can be done then include:
Some of these could be separate PRs to test the waters. |
(Actually, code like that |
Note that all of the work I've done so far is really combined instructions, and for that I've opened a new issue, #36. Let's keep this one (#16) for the original idea of super-instructions (also spelled "superinstructions") in the style of Ertl -- see top comment in the current issue. |
Since runtime specialization is going to take a while I'm putting some effort into getting several (true) super-instructions implemented in the compiler. See https://github.com/faster-cpython/cpython/tree/super-instr |
UPDATE: Never mind, see later comment (it did move the needle, but in the wrong direction)
|
(The trickiest thing to clean up is that e.g. a sequence of three LOAD_FAST opcodes in a row is converted to two LOAD_FAST_LOAD_FAST opcodes followed by a regular LOAD_FAST. This works but is unexpected, and there's an assert in HALF_NEXTOPARG() that can fail because of this.) |
Whoa, never mind. It seems I had the benchmarks backwards (I ran my branch first and then master). So it seems this is either a serious pessimization or there's too much variation in the benchmark times. |
Several benchmark runs later, these results are unfortunately correct: several benchmarks are much slower with the super-instructions -- up to 2.09x for sqlalchemy_declarative. I'm investigating this for bm_meteor_contest (which has no external dependencies, so is easier to analyze). So far the main suspicious thing is that there are several sequences of multiple super-instructions in a row. This benchmark has a dynamic execution profile showing 15% of opcode pairs LOAD_FAST_LOAD_FAST. There's also a very common pair FOR_ITER+STORE_FAST (8%), where I notice that replacing that STORE_FAST with a super-instruction (as it is done for all three occurrences of FOR_ITER in the solve() function) will make a PREDICT macro fail (FOR_ITER predicts STORE_FAST or UNPACK_SEQUENCE). (Then again, IIUC with gcc we're using computed gotos which should disable prediction?) Food for thought. |
Okay, I had a reference count bug. Here's a more reliable result. So it looks like it's up to 6% faster, up to 2% slower, and on average 1% faster.
|
I've tried this with the INT_ADD opcode added too. Results are similar, but alas:
Now that is a pretty crazy benchmark (400 times UPDATE: Strangely, unpack_sequence was faster without INT_ADD, but the generated bytecode doesn't have an INT_ADD -- just a bunch of STORE_FAST_LOAD_FAST opcodes (one per line, last opcode of the line). No idea yet what could be going on here. |
I can't say I understand it, but clearly STORE_FAST_LOAD_FAST was a pessimization. Without this, the top faster and slower times are like this:
So mostly a wash. Is it worth investigating why unpickle_list is consistently slower? |
Here are some TODOs I collected during a discussion with Mark and Eric):
|
This seems to have expired. Let's try to restart it. First of all, let's narrow this issue down to superinstructions that are inserted into the quickened code, but not are not present in the original code object. Superinstructions that are inserted by the compiler into the code object are discussed in #36. We want to exclude from this list:
Which leaves this list of common pairs, compiled from various sources:
We are not just interested in the prevalence of the pairs, but the profitability of combining them I am suspicious of the STORE_FAST STORE_FAST pair. I suspect that it shows up a lot in tests, but less so in production code. That leaves just four:
It worth noting that none of LOAD_FAST, STORE_FAST and LOAD_CONST can change the state from non-tracing to tracing. This gives us three more instructions that might be worth adding:
|
So is your conclusion to just add the no-trace instructions, or also the four specialization you added? |
My conclusion is that we should run some experiments 🙂 |
Initial experiments show no speedup with the four super-instructions and ~2% speedup adding If you want to play, the code is here: https://github.com/faster-cpython/cpython/tree/super-instructions |
I'd be curious about a further specialized
Implemented as something like
See also https://bugs.python.org/issue38381 . Maybe it's worth collecting statistics about how often this pair has the same argument. |
Could this be handled by a peephole optimization step that simply elides the pair of instructions? I suppose that's only safe if there are no |
If there are code, dir(), traceback calls, deletion the pair of instructions will also be unsafe. |
Is it worth to make a super-instruction for UNPACK_SEQUENCE and its following STORE_FASTs? In bm_nbody.py of Pyperformance, these patterns are in hot loop bodies. We can identify these patterns in compilation phase. def advance(dt, n, bodies=SYSTEM, pairs=PAIRS): 80 14 GET_ITER 81 34 UNPACK_SEQUENCE 3 |
We don't want to distort our selection of superinstructions, or optimizations in general, to fit the Do we have evidence that |
I'd expect that 2 or more STORE_FAST are the most likely thing following UNPACK_SEQUENCE, given the abundance of common patterns like
If you want more evidence we can look through the top 100 packages for such sequences. From a quick grep of the stdlib, just 2 values is by far the most common. |
I have downloaded all the wheels in the top 100 python packages, and counted the patterns of unpacking sequence. Here is the result:
Statistics shows that 2 My grep patterns are:
|
Thanks, that is very useful!
|
I've hacked together something of a framework to analyse bytecode and look for patterns like the above, also using the top 100 PyPI packages (or it can be adapted for any other corpus of |
Results for
|
I have prototyped UNPACK_SEQUENCE_ST super-instruction and produced 1.008% speedup. The implementation is based on Add five superinstructions . Is it worth to do this optimization? In ceval.c
In specialize.c
|
One percent overall improvement is nothing to sneeze at! Please submit a PR. (Sorry for the slow response!) |
“1.006x faster“ - that’s 0.6% correct? |
Um, yeah, I misread. Still. |
Update the performance data with the latest code base. It's 1.008x faster. |
I think this can be closed now the code generator supports defining super-instructions. |
Literature: look for work by Anton Ertl (Forth)
Basic idea: suppose in a runtime (!) execution profile we see that the sequence LOAD_FAST, LOAD_FAST occurs frequently. We can introduce a new opcode, LOAD_FAST_LOAD_FAST (LFLF), and blindly replace the first of every pair of LOAD_FAST opcodes with the super-instruction LFLF. The code then changes from e.g.
to
The opcode case for LFLF would do the following:
This saves all the other stuff that's done for each opcode (tracing etc.), and it doesn't even read the second opcode. (If the second LOAD_FAST is preceded by an EXTENDED_ARG opcode we should not do the substitution, so that getting the argument for the second opcode is a single byte access and the p-counter update is a fixed add.)
The beauty is that if there's a jump to the second LOAD_FAST instruction, things just work, so this form of code rewriting can be very fast.
The text was updated successfully, but these errors were encountered: