Super-instructions #16

gvanrossum · 2021-03-11T05:07:00Z

Literature: look for work by Anton Ertl (Forth)

Basic idea: suppose in a runtime (!) execution profile we see that the sequence LOAD_FAST, LOAD_FAST occurs frequently. We can introduce a new opcode, LOAD_FAST_LOAD_FAST (LFLF), and blindly replace the first of every pair of LOAD_FAST opcodes with the super-instruction LFLF. The code then changes from e.g.

    LOAD_FAST 0 (x)
    LOAD_FAST 1 (y)
    BINARY_ADD

to

    LOAD_FAST_LOAD_FAST 0 (x)
    LOAD_FAST 1 (y)
    BINARY_ADD

The opcode case for LFLF would do the following:

load the first variable (x)
decode the argument for the second opcode
adjust the program counter to consume the second opcode
load the second variable(y)
DISPATCH()

This saves all the other stuff that's done for each opcode (tracing etc.), and it doesn't even read the second opcode. (If the second LOAD_FAST is preceded by an EXTENDED_ARG opcode we should not do the substitution, so that getting the argument for the second opcode is a single byte access and the p-counter update is a fixed add.)

The beauty is that if there's a jump to the second LOAD_FAST instruction, things just work, so this form of code rewriting can be very fast.

The text was updated successfully, but these errors were encountered:

ericsnowcurrently · 2021-03-11T17:03:50Z

Yeah, taking advantage of common bytecode sequences is something that I've thought would be worth exploring.

If we did this, it would be helpful to have basic tooling to identify common patterns (e.g. from running the test suite or even an arbitrary given workload). It would also be important for such tooling to validate that any super-instructions we add are still worth having, since changes in the compiler can create or eliminate those patterns. Likewise it would be important to identify the more important workloads to target (just like with #10), or even just show that the specific workload isn't so critical.

ericsnowcurrently · 2021-03-11T17:03:58Z

Here's a derivative idea that may be worth exploring later: a dynamic flavor of super-instructions, where we recognize common patterns at runtime and build/run the collapsed case accordingly (sort of like a JIT). This introduces extra overhead so, to be worth it, it would have to be used frequently enough and, of course, more than offset that overhead.

gvanrossum · 2021-03-11T18:10:51Z

Note that there's something in ceval.c that can dump out dynamic instruction pair occurrences. Search for dxpairs IIRC.

gvanrossum · 2021-03-13T18:44:39Z

I've attempted something a little different: in the compiler itself, collapse

LOAD_FAST x
BINARY_ADD

into a new opcode

FAST_ADD x

which duplicates the code of LOAD_FAST and BINARY_ADD. Many other opcodes can get this treatment if it works out, e.g. LOAD_CONST and other binary and unary operations (I also did BINARY_SUBSCR).

Here's the branch: https://github.com/gvanrossum/cpython/tree/add-opcodes

UPDATE: This makes the bytecode more compact and saves dispatch time, at the cost of more cases in the switch. (I could reduce the code size by using more gotos. :-) On a microbenchmark I see 5-10% speedup.

ericsnowcurrently · 2021-03-15T16:25:05Z

UPDATE: This makes the bytecode more compact and saves dispatch time, at the cost of more cases in the switch. (I could reduce the code size by using more gotos. :-) On a microbenchmark I see 5-10% speedup.

That definitely sounds like something worth exploring.

ericsnowcurrently · 2021-03-16T21:46:16Z

I like how straight-forward the change is. 🙂

ericsnowcurrently · 2021-03-24T19:56:40Z

Here are the results of running the benchmarks on the add-opcodes branch right now: results-add-opcodes.tar.gz

(Note that the actual benchmark run took about 20 minutes.)

gvanrossum · 2021-03-24T21:14:58Z

Hm, that shows small improvements for some number-heavy opcodes, but significant slowdowns on the pickle benchmarks. That's kind of weird, and why we need a separate machine (#19).

gvanrossum · 2021-03-25T04:53:58Z

Some other pairs that could be combined easily using the same technique:

LOAD_CONST + RETURN_VALUE = RETURN_CONST
LOAD_CONST (None) + RETURN_VALUE = RETURN_NONE
CALL_METHOD + POP_TOP = POP_METHOD
POP_TOP + LOAD_FAST = POP_TOP_AND_LOAD_FAST
CALL_METHOD + RETURN_VALUE = RETURN_METHOD
GET_ITER + FOR_ITER = GET_ITER_AND_FOR_ITER
LOAD_FAST + RETURN_VALUE = RETURN_FAST

Of these, RETURN_CONST and RETURN_NONE show the most promise IMO (in my sample of 1 they are the most prominent).

Note that there are other common combinations, but they require a different approach to dispatch, because both instructions in the pair have an operand. (Top 9 in my sample: LOAD_FAST + LOAD_ATTR, LOAD_FAST + LOAD_FAST, LOAD_FAST + STORE_ATTR, LOAD_GLOBAL + LOAD_FAST, LOAD_FAST + LOAD_METHOD, STORE_FAST + LOAD_FAST, LOAD_METHOD + LOAD_FAST, LOAD_GLOBAL + CALL_FUNCTION, STORE_ATTR + LOAD_FAST.) This is the approach (according to Mark) taken by Ertl.

markshannon · 2021-03-25T11:59:03Z

See faster-cpython/cpython#2 (comment) for restrictions on super-instruction formation.

I don't think there is much room for super-instructions implemented in the compiler, because:

They shouldn't include a bytecode that we will want to specialize, because specialization gains much more than removing the dispatch.

They can't cross line boundaries.

They can't include more than one instruction that has an operand.

Unfortunately, FAST_ADD, FAST_SUBSCR, CONST_ADD, CONST_SUBSCR all break rule 1.

INT_ADD is good because it is already specialized, and will rarely cross a line boundary.

Various combinations of LOAD_CONST, IS_OP and POP_JUMP_IF_XXX are also promising.

gvanrossum · 2021-03-29T17:44:31Z

Things that can be done then include:

INT_ADD
LOAD_CONST(None), IS_OP, POP_JUMP_IF_XXX
RETURN_NONE, RETURN_CONST
Maybe add checks to skip these optimizations if they span lines, e.g.
```
  return (
    None
  )
``
```

Some of these could be separate PRs to test the waters.

gvanrossum · 2021-03-29T17:51:58Z

(Actually, code like that return broken across lines already executes like it's all on the second line, so I think I needn't worry about that.)

gvanrossum · 2021-03-30T00:08:40Z

See python/cpython#25090

gvanrossum · 2021-03-31T19:14:26Z

Note that all of the work I've done so far is really combined instructions, and for that I've opened a new issue, #36. Let's keep this one (#16) for the original idea of super-instructions (also spelled "superinstructions") in the style of Ertl -- see top comment in the current issue.

gvanrossum · 2021-04-22T17:13:17Z

Since runtime specialization is going to take a while I'm putting some effort into getting several (true) super-instructions implemented in the compiler. See https://github.com/faster-cpython/cpython/tree/super-instr

gvanrossum · 2021-04-23T04:14:51Z

UPDATE: Never mind, see later comment (it did move the needle, but in the wrong direction)

~~There are a few things to clean up, but this seems to really move the needle for pyperformance:~~

Slower (15):
- regex_dna: 212 ms +- 1 ms -> 225 ms +- 1 ms: 1.06x slower
- regex_v8: 27.6 ms +- 0.3 ms -> 29.2 ms +- 0.1 ms: 1.06x slower
- spectral_norm: 174 ms +- 3 ms -> 182 ms +- 2 ms: 1.04x slower
- regex_effbot: 3.60 ms +- 0.10 ms -> 3.72 ms +- 0.01 ms: 1.03x slower
- json_loads: 31.1 us +- 0.2 us -> 32.1 us +- 0.3 us: 1.03x slower
- pickle_dict: 33.5 us +- 0.3 us -> 34.4 us +- 0.2 us: 1.03x slower
- unpack_sequence: 58.0 ns +- 0.6 ns -> 59.5 ns +- 0.6 ns: 1.03x slower
- fannkuch: 572 ms +- 3 ms -> 585 ms +- 4 ms: 1.02x slower
- scimark_sor: 248 ms +- 4 ms -> 254 ms +- 3 ms: 1.02x slower
- crypto_pyaes: 142 ms +- 1 ms -> 145 ms +- 2 ms: 1.02x slower
- pickle_list: 5.15 us +- 0.07 us -> 5.25 us +- 0.05 us: 1.02x slower
- pyflate: 811 ms +- 6 ms -> 824 ms +- 6 ms: 1.02x slower
- scimark_fft: 516 ms +- 3 ms -> 524 ms +- 4 ms: 1.01x slower
- nqueens: 123 ms +- 1 ms -> 124 ms +- 1 ms: 1.01x slower
- chameleon: 11.2 ms +- 0.1 ms -> 11.3 ms +- 0.2 ms: 1.01x slower

Faster (40):
- sqlalchemy_declarative: 381 ms +- 9 ms -> 182 ms +- 3 ms: 2.09x faster
- xml_etree_iterparse: 180 ms +- 10 ms -> 120 ms +- 2 ms: 1.50x faster
- xml_etree_parse: 251 ms +- 18 ms -> 168 ms +- 3 ms: 1.49x faster
- meteor_contest: 186 ms +- 21 ms -> 127 ms +- 0 ms: 1.46x faster
- sqlalchemy_imperative: 33.7 ms +- 2.4 ms -> 24.1 ms +- 0.5 ms: 1.40x faster
- regex_compile: 288 ms +- 26 ms -> 210 ms +- 1 ms: 1.37x faster
- genshi_text: 48.3 ms +- 3.4 ms -> 37.2 ms +- 0.4 ms: 1.30x faster
- tornado_http: 204 ms +- 16 ms -> 160 ms +- 4 ms: 1.27x faster
- raytrace: 847 ms +- 26 ms -> 667 ms +- 4 ms: 1.27x faster
- chaos: 168 ms +- 3 ms -> 134 ms +- 1 ms: 1.26x faster
- 2to3: 481 ms +- 2 ms -> 389 ms +- 2 ms: 1.24x faster
- genshi_xml: 91.7 ms +- 1.9 ms -> 75.3 ms +- 0.8 ms: 1.22x faster
- xml_etree_generate: 141 ms +- 8 ms -> 117 ms +- 2 ms: 1.20x faster
- json_dumps: 18.9 ms +- 0.9 ms -> 16.0 ms +- 0.2 ms: 1.18x faster
- xml_etree_process: 112 ms +- 5 ms -> 95.0 ms +- 0.5 ms: 1.18x faster
- pathlib: 26.1 ms +- 0.7 ms -> 22.9 ms +- 0.4 ms: 1.14x faster
- logging_format: 13.4 us +- 0.8 us -> 11.9 us +- 0.1 us: 1.13x faster
- dulwich_log: 99.3 ms +- 4.1 ms -> 88.7 ms +- 0.6 ms: 1.12x faster
- logging_simple: 12.1 us +- 0.7 us -> 10.9 us +- 0.1 us: 1.11x faster
- telco: 9.52 ms +- 0.22 ms -> 8.61 ms +- 0.22 ms: 1.11x faster
- pidigits: 222 ms +- 0 ms -> 204 ms +- 0 ms: 1.09x faster
- float: 141 ms +- 6 ms -> 129 ms +- 2 ms: 1.09x faster
- pickle_pure_python: 631 us +- 18 us -> 580 us +- 7 us: 1.09x faster
- django_template: 66.7 ms +- 3.8 ms -> 63.3 ms +- 0.7 ms: 1.05x faster
- mako: 19.9 ms +- 0.5 ms -> 19.0 ms +- 0.2 ms: 1.05x faster
- hexiom: 12.4 ms +- 0.2 ms -> 12.0 ms +- 0.1 ms: 1.04x faster
- scimark_monte_carlo: 141 ms +- 2 ms -> 137 ms +- 2 ms: 1.03x faster
- nbody: 175 ms +- 1 ms -> 170 ms +- 2 ms: 1.03x faster
- unpickle_pure_python: 415 us +- 4 us -> 404 us +- 4 us: 1.03x faster
- scimark_lu: 218 ms +- 2 ms -> 212 ms +- 1 ms: 1.03x faster
- sympy_sum: 244 ms +- 2 ms -> 239 ms +- 2 ms: 1.02x faster
- python_startup: 9.89 ms +- 0.05 ms -> 9.71 ms +- 0.04 ms: 1.02x faster
- pickle: 13.2 us +- 0.1 us -> 13.0 us +- 0.1 us: 1.02x faster
- python_startup_no_site: 6.55 ms +- 0.04 ms -> 6.47 ms +- 0.04 ms: 1.01x faster
- deltablue: 10.1 ms +- 0.2 ms -> 10.0 ms +- 0.2 ms: 1.01x faster
- scimark_sparse_mat_mult: 7.71 ms +- 0.07 ms -> 7.63 ms +- 0.18 ms: 1.01x faster
- sympy_str: 379 ms +- 3 ms -> 375 ms +- 6 ms: 1.01x faster
- sympy_integrate: 30.3 ms +- 0.2 ms -> 30.1 ms +- 0.2 ms: 1.01x faster
- unpickle_list: 5.42 us +- 0.05 us -> 5.38 us +- 0.09 us: 1.01x faster
- go: 285 ms +- 2 ms -> 283 ms +- 2 ms: 1.01x faster

Benchmark hidden because not significant (5): logging_silent, richards, sqlite_synth, sympy_expand, unpickle

Geometric mean: 1.09x faster

gvanrossum · 2021-04-23T04:19:49Z

(The trickiest thing to clean up is that e.g. a sequence of three LOAD_FAST opcodes in a row is converted to two LOAD_FAST_LOAD_FAST opcodes followed by a regular LOAD_FAST. This works but is unexpected, and there's an assert in HALF_NEXTOPARG() that can fail because of this.)

gvanrossum · 2021-04-23T04:29:50Z

Whoa, never mind. It seems I had the benchmarks backwards (I ran my branch first and then master). So it seems this is either a serious pessimization or there's too much variation in the benchmark times.

gvanrossum · 2021-04-24T20:58:06Z

Several benchmark runs later, these results are unfortunately correct: several benchmarks are much slower with the super-instructions -- up to 2.09x for sqlalchemy_declarative. I'm investigating this for bm_meteor_contest (which has no external dependencies, so is easier to analyze). So far the main suspicious thing is that there are several sequences of multiple super-instructions in a row. This benchmark has a dynamic execution profile showing 15% of opcode pairs LOAD_FAST_LOAD_FAST. There's also a very common pair FOR_ITER+STORE_FAST (8%), where I notice that replacing that STORE_FAST with a super-instruction (as it is done for all three occurrences of FOR_ITER in the solve() function) will make a PREDICT macro fail (FOR_ITER predicts STORE_FAST or UNPACK_SEQUENCE). (Then again, IIUC with gcc we're using computed gotos which should disable prediction?)

Food for thought.

gvanrossum · 2021-04-25T23:22:50Z

Okay, I had a reference count bug. Here's a more reliable result. So it looks like it's up to 6% faster, up to 2% slower, and on average 1% faster.

$ python3 -m pyperf compare_to  req-{1619196353,1619391121}-gvanrossum/*gz -G --min-speed 2
Slower (1):
- unpickle_list: 5.38 us +- 0.10 us -> 5.49 us +- 0.07 us: 1.02x slower

Faster (14):
- regex_dna: 225 ms +- 1 ms -> 212 ms +- 1 ms: 1.06x faster
- regex_v8: 29.1 ms +- 0.1 ms -> 27.6 ms +- 0.3 ms: 1.05x faster
- regex_effbot: 3.75 ms +- 0.08 ms -> 3.59 ms +- 0.03 ms: 1.04x faster
- spectral_norm: 182 ms +- 3 ms -> 175 ms +- 3 ms: 1.04x faster
- regex_compile: 210 ms +- 1 ms -> 203 ms +- 1 ms: 1.04x faster
- unpickle_pure_python: 405 us +- 5 us -> 391 us +- 4 us: 1.04x faster
- pickle_list: 5.28 us +- 0.11 us -> 5.11 us +- 0.05 us: 1.03x faster
- chameleon: 11.3 ms +- 0.3 ms -> 10.9 ms +- 0.1 ms: 1.03x faster
- pathlib: 23.1 ms +- 0.3 ms -> 22.5 ms +- 0.3 ms: 1.03x faster
- scimark_fft: 524 ms +- 4 ms -> 511 ms +- 4 ms: 1.02x faster
- raytrace: 665 ms +- 4 ms -> 649 ms +- 6 ms: 1.02x faster
- unpack_sequence: 59.4 ns +- 0.5 ns -> 58.1 ns +- 2.5 ns: 1.02x faster
- genshi_xml: 76.1 ms +- 2.9 ms -> 74.4 ms +- 0.8 ms: 1.02x faster
- scimark_sor: 254 ms +- 3 ms -> 248 ms +- 2 ms: 1.02x faster

Benchmark hidden because not significant (45): 2to3, chaos, crypto_pyaes, deltablue, django_template, dulwich_log, fannkuch, float, genshi_text, go, hexiom, json_dumps, json_loads, logging_format, logging_silent, logging_simple, mako, meteor_contest, nbody, nqueens, pickle, pickle_dict, pickle_pure_python, pidigits, pyflate, python_startup, python_startup_no_site, richards, scimark_lu, scimark_monte_carlo, scimark_sparse_mat_mult, sqlalchemy_declarative, sqlalchemy_imperative, sqlite_synth, sympy_expand, sympy_integrate, sympy_sum, sympy_str, telco, tornado_http, unpickle, xml_etree_parse, xml_etree_iterparse, xml_etree_generate, xml_etree_process

Geometric mean: 1.01x faster
$

gvanrossum · 2021-04-26T03:45:11Z

I've tried this with the INT_ADD opcode added too. Results are similar, but alas:

- unpack_sequence: 59.4 ns +- 0.5 ns -> 69.7 ns +- 0.6 ns: 1.17x slower

Now that is a pretty crazy benchmark (400 times a, b, c, d, e, f, g, h, i, j = to_unpack) but still, an unacceptable regression.

UPDATE: Strangely, unpack_sequence was faster without INT_ADD, but the generated bytecode doesn't have an INT_ADD -- just a bunch of STORE_FAST_LOAD_FAST opcodes (one per line, last opcode of the line). No idea yet what could be going on here. ~~My current theory: that benchmark doesn't do enough loops and it's too susceptible to random variability. Note that it's only 59 ns per loop.~~

gvanrossum · 2021-04-26T14:56:58Z

I can't say I understand it, but clearly STORE_FAST_LOAD_FAST was a pessimization. Without this, the top faster and slower times are like this:

Slower (2):
- unpickle_list: 5.38 us +- 0.09 us -> 5.52 us +- 0.09 us: 1.03x slower
- sqlite_synth: 3.35 us +- 0.06 us -> 3.43 us +- 0.05 us: 1.02x slower

Faster (11):
- chameleon: 11.3 ms +- 0.2 ms -> 11.0 ms +- 0.1 ms: 1.03x faster
- regex_compile: 210 ms +- 1 ms -> 204 ms +- 1 ms: 1.03x faster

So mostly a wash. Is it worth investigating why unpickle_list is consistently slower?

gvanrossum · 2021-04-26T17:00:07Z

Here are some TODOs I collected during a discussion with Mark and Eric):

static analyze how often STORE_FAST_LOAD_FAST is generated
- in the stdlib: on master this pair occurred 0.87%; in my branch, the single opcode occurs 0.79%
  (presumably the difference is due to triples, which get counted twice, but optimized only once)
maybe also do dynamic analysis of the same
- no surprise there -- about 4-5%, corresponding to the dynamic stats I gathered previously; in bm_unpack_sequence.py it's around 3%
consider not generating super-instructions crossing lines, then we can drop the use_tracing check from HALF_DISPATCH()

markshannon · 2021-07-16T13:12:45Z

This seems to have expired. Let's try to restart it.

First of all, let's narrow this issue down to superinstructions that are inserted into the quickened code, but not are not present in the original code object.

Superinstructions that are inserted by the compiler into the code object are discussed in #36.

We want to exclude from this list:

Combined instructions (typically a pair where only one instruction has an operand)
Any pairs when one (or both) of the pair can be specialized
Any pair including FOR_ITER, for now. It is not clear what the best way to speed up for loops is.

Which leaves this list of common pairs, compiled from various sources:

LOAD_FAST LOAD_FAST
COMPARE_OP POP_JUMP_IF_FALSE
STORE_FAST LOAD_FAST
LOAD_FAST LOAD_CONST
LOAD_CONST COMPARE_OP
STORE_FAST STORE_FAST
LOAD_CONST LOAD_FAST

We are not just interested in the prevalence of the pairs, but the profitability of combining them
Of the above the simple pairs (combining LOAD_FAST, STORE_FAST and LOAD_CONST) are worth combining because the dispatch overhead is a significant part of the execution time.
However, COMPARE_OP is a complex instruction, and may benefit from specialization. So well leave that out for now as well.

I am suspicious of the STORE_FAST STORE_FAST pair. I suspect that it shows up a lot in tests, but less so in production code.
I could well be wrong, and would be happy to be proved wrong, but I'm going to leave it out for now.

That leaves just four:

LOAD_FAST LOAD_FAST
STORE_FAST LOAD_FAST
LOAD_FAST LOAD_CONST
LOAD_CONST LOAD_FAST

It worth noting that none of LOAD_FAST, STORE_FAST and LOAD_CONST can change the state from non-tracing to tracing.
Since tracing is always off when executing quickened instructions, this gives us another advantage. We don't need to check for tracing at the end of the instruction. In fact, we can take advantage of this to quicken LOAD_FAST, STORE_FAST and LOAD_CONST by replacing them with "NO TRACE" versions that do not check for tracing when dispatching.

This gives us three more instructions that might be worth adding:

LOAD_FAST_NO_TRACE
LOAD_CONST_NO_TRACE
STORE_FAST_NO_TRACE

gvanrossum · 2021-07-16T14:06:24Z

So is your conclusion to just add the no-trace instructions, or also the four specialization you added?

markshannon · 2021-07-16T15:24:33Z

My conclusion is that we should run some experiments 🙂

markshannon · 2021-07-16T15:30:00Z

Initial experiments show no speedup with the four super-instructions and ~2% speedup adding STORE_FAST__STORE_FAST
That might be a real effect, or just the unpack_sequence benchmark.
It is also on my laptop without restrictions on the power management, so noisy.
From this I conclude that more experiments are necessary 🙂

If you want to play, the code is here: https://github.com/faster-cpython/cpython/tree/super-instructions

markshannon · 2021-08-12T13:10:14Z

https://bugs.python.org/issue44900
python/cpython#27741

sweeneyde · 2021-08-12T15:05:52Z

I'd be curious about a further specialized STORE_FAST__LOAD_FAST_SAME when the same variable is stored and then loaded, as is the case in some list comprehensions:

.\python.bat -c "from dis import dis; dis(lambda: [x + 1 for x in my_list])"
Running Release|x64 interpreter...
  1           0 LOAD_CONST               1 (<code object <listcomp> at 0x0000016D0DF4F210, file "<string>", line 1>)
              2 MAKE_FUNCTION            0
              4 LOAD_GLOBAL              0 (my_list)
              6 GET_ITER
              8 CALL_FUNCTION            1
             10 RETURN_VALUE

Disassembly of <code object <listcomp> at 0x0000016D0DF4F210, file "<string>", line 1>:
  1           0 BUILD_LIST               0
              2 LOAD_FAST                0 (.0)
        >>    4 FOR_ITER                 6 (to 18)
              6 STORE_FAST               1 (x)   <----------------------
              8 LOAD_FAST                1 (x)   <----------------------
             10 LOAD_CONST               0 (1)
             12 BINARY_ADD
             14 LIST_APPEND              2
             16 JUMP_ABSOLUTE            2 (to 4)
        >>   18 RETURN_VALUE

Implemented as something like

TARGET(STORE_FAST__LOAD_FAST_SAME):
    PyObect *val = TOP(); // stays on top
    Py_INCREF(val);
    SETLOCAL(oparg, val);
    DISPATCH();

See also https://bugs.python.org/issue38381 . Maybe it's worth collecting statistics about how often this pair has the same argument.

JelleZijlstra · 2021-08-12T18:43:18Z

Could this be handled by a peephole optimization step that simply elides the pair of instructions? I suppose that's only safe if there are no locals() calls, and no other references to the variable.

zcpara · 2021-08-14T16:59:20Z

If there are code, dir(), traceback calls, deletion the pair of instructions will also be unsafe.

zcpara · 2021-08-14T17:11:52Z

Is it worth to make a super-instruction for UNPACK_SEQUENCE and its following STORE_FASTs? In bm_nbody.py of Pyperformance, these patterns are in hot loop bodies. We can identify these patterns in compilation phase.

def advance(dt, n, bodies=SYSTEM, pairs=PAIRS):
for i in range(n):
for (([x1, y1, z1], v1, m1),
([x2, y2, z2], v2, m2)) in pairs:

80 14 GET_ITER
>> 16 FOR_ITER 112 (to 242)
18 UNPACK_SEQUENCE 2
20 UNPACK_SEQUENCE 3
22 UNPACK_SEQUENCE 3
24 STORE_FAST 5 (x1)
26 STORE_FAST 6 (y1)
28 STORE_FAST 7 (z1)
30 STORE_FAST 8 (v1)
32 STORE_FAST 9 (m1)

81 34 UNPACK_SEQUENCE 3
36 UNPACK_SEQUENCE 3
38 STORE_FAST 10 (x2)
40 STORE_FAST 11 (y2)
42 STORE_FAST 12 (z2)
44 STORE_FAST 13 (v2)
46 STORE_FAST 14 (m2)

markshannon · 2021-08-16T12:07:00Z

We don't want to distort our selection of superinstructions, or optimizations in general, to fit the pyperformance benchmark suite.

Do we have evidence that UNPACK_SEQUENCE followed by a sequence of STORE_FASTs is a common pattern?
We now have STORE_FAST__STORE_FAST as a superinstruction, which will help with the above patterns.

gvanrossum · 2021-08-16T15:19:51Z

I'd expect that 2 or more STORE_FAST are the most likely thing following UNPACK_SEQUENCE, given the abundance of common patterns like

x, y = point
for x, y in points: ...
for k, v in d.items(): ...
for i, x in enumerate(a): ...
for x, y in zip(xs, ys): ...
for root, dirs, files in os.walk(...): ...

If you want more evidence we can look through the top 100 packages for such sequences.

From a quick grep of the stdlib, just 2 values is by far the most common.

frank-king · 2021-08-19T09:33:35Z

I have downloaded all the wheels in the top 100 python packages, and counted the patterns of unpacking sequence.

Here is the result:

Pattern	Lines	Percentage
for x, y in ...	5842	23.5%
x, y = ...	13244	52.8%
for x, y(, z)+ in ...	125	0.5%
x, y(, z)+ = ...	5852	23.3%
total	25063	100%

Statistics shows that 2 STORE_FASTs are most likely to follow UNPACK_SEQUENCE (about 76.2%), but there are still 23.8% of them are 3 or more STORE_FASTs that follows an UNPACK_SEQUENCE.

My grep patterns are:

Pattern	Regex
for x, y in ...	`'for \w+, \w+ in'` or `'for \(\w+, \w+\) in'`
for x, y(, z)+ in ...	`'for \w+, \w+(, \w)+ in'` or `'for \(\w+, \w+(, \w)+\) in'`
x, y = ...	`'^\s\w+, \w+ ='` or `'^\s\(\w+, \w+\) ='`
x, y(, z)+ = ...	`'^\s\w+, \w+(, \w+)+ ='` or `'^\s\(\w+, \w+(, \w+)+\) ='`

gvanrossum · 2021-08-19T15:39:34Z

Thanks, that is very useful!

pxeger · 2021-08-19T20:13:03Z

I've hacked together something of a framework to analyse bytecode and look for patterns like the above, also using the top 100 PyPI packages (or it can be adapted for any other corpus of pyc files, really). You may find it helpful: https://github.com/pxeger/pycstats

pxeger · 2021-08-20T09:00:51Z

Results for UNPACK_SEQUENCE followed by any two instructions (this is based on the compiled output unweighted by how many times they're executed, so not quite representative):

72975 UNPACK_SEQUENCE STORE_FAST STORE_FAST
993 UNPACK_SEQUENCE LOAD_FAST STORE_ATTR
744 UNPACK_SEQUENCE STORE_NAME STORE_NAME
592 UNPACK_SEQUENCE STORE_FAST UNPACK_SEQUENCE
588 UNPACK_SEQUENCE STORE_FAST LOAD_FAST
287 UNPACK_SEQUENCE LOAD_FAST LOAD_CONST
280 UNPACK_SEQUENCE STORE_DEREF STORE_FAST
256 UNPACK_SEQUENCE STORE_DEREF STORE_DEREF
236 UNPACK_SEQUENCE STORE_FAST STORE_DEREF
219 UNPACK_SEQUENCE UNPACK_SEQUENCE STORE_FAST
162 UNPACK_SEQUENCE STORE_FAST LOAD_GLOBAL
55 UNPACK_SEQUENCE LOAD_FAST LOAD_FAST
52 UNPACK_SEQUENCE LOAD_DEREF STORE_ATTR
51 UNPACK_SEQUENCE LOAD_FAST LOAD_ATTR
50 UNPACK_SEQUENCE STORE_FAST POP_BLOCK
20 UNPACK_SEQUENCE STORE_FAST LOAD_DEREF
20 UNPACK_SEQUENCE STORE_FAST JUMP_FORWARD
19 UNPACK_SEQUENCE STORE_FAST LOAD_CONST
18 UNPACK_SEQUENCE STORE_GLOBAL STORE_GLOBAL
16 UNPACK_SEQUENCE LOAD_DEREF LOAD_ATTR
10 UNPACK_SEQUENCE LOAD_GLOBAL STORE_ATTR
8 UNPACK_SEQUENCE LOAD_GLOBAL LOAD_CONST
7 UNPACK_SEQUENCE STORE_FAST SETUP_FINALLY
5 UNPACK_SEQUENCE LOAD_NAME STORE_ATTR
4 UNPACK_SEQUENCE STORE_NAME UNPACK_SEQUENCE
4 UNPACK_SEQUENCE STORE_GLOBAL STORE_FAST
4 UNPACK_SEQUENCE STORE_FAST JUMP_ABSOLUTE
3 UNPACK_SEQUENCE STORE_DEREF UNPACK_SEQUENCE
3 UNPACK_SEQUENCE STORE_FAST UNPACK_EX
3 UNPACK_SEQUENCE UNPACK_SEQUENCE STORE_DEREF
3 UNPACK_SEQUENCE UNPACK_SEQUENCE UNPACK_SEQUENCE
2 UNPACK_SEQUENCE STORE_DEREF LOAD_DEREF
1 UNPACK_SEQUENCE LOAD_GLOBAL LOAD_ATTR
1 UNPACK_SEQUENCE EXTENDED_ARG STORE_NAME
1 UNPACK_SEQUENCE EXTENDED_ARG UNPACK_EX

zcpara · 2021-08-26T06:08:08Z

I have prototyped UNPACK_SEQUENCE_ST super-instruction and produced 1.008% speedup. The implementation is based on Add five superinstructions .

Is it worth to do this optimization?

In ceval.c

TARGET(UNPACK_SEQUENCE_ST): {
    PREDICTED(UNPACK_SEQUENCE_ST);
    PyObject *seq = POP(), *item, **items;
    if (PyTuple_CheckExact(seq) &&
        PyTuple_GET_SIZE(seq) == oparg) {
        items = ((PyTupleObject *)seq)->ob_item;
        int i = 0;
        while (i < oparg) {
            item = items[i];
            Py_INCREF(item);
            SETLOCAL(_Py_OPARG(*(next_instr)), item);
            next_instr++;
            i++;
        }   
    } else if (PyList_CheckExact(seq) &&
               PyList_GET_SIZE(seq) == oparg) {
        items = ((PyListObject *)seq)->ob_item;
        int i = 0;
        while (i < oparg) {
            item = items[i];
            Py_INCREF(item);
            SETLOCAL(_Py_OPARG(*(next_instr)), item);
            next_instr++;
            i++;
        }   
    } else if (unpack_iterable(tstate, seq, oparg, -1, 
                               stack_pointer + oparg)) {
        items = stack_pointer + oparg - 1;
        int i = 0;
        while (i < oparg) {
            item = *(items - i); 
            SETLOCAL(_Py_OPARG(*(next_instr)), item);
            next_instr++;
            i++;
        }   
    } else {
        /* unpack_iterable() raised an exception */
        Py_DECREF(seq);
        goto error;
    }   
    Py_DECREF(seq);
    PREDICT(LOAD_FAST);
    NOTRACE_DISPATCH();
}

In specialize.c

case UNPACK_SEQUENCE: {
    int j = 1;
    for (; j <= oparg; ++j) {
        if (_Py_OPCODE(instructions[i+j]) != STORE_FAST) {
            break;
        }   
    }   
    if (j == (oparg+1)) {
        instructions[i] = _Py_MAKECODEUNIT(UNPACK_SEQUENCE_ST, oparg);
        i += oparg;
    }   
    break;
}

gvanrossum · 2021-09-16T05:03:41Z

One percent overall improvement is nothing to sneeze at! Please submit a PR. (Sorry for the slow response!)

ssweber · 2021-09-16T21:11:51Z

“1.006x faster“ - that’s 0.6% correct?

gvanrossum · 2021-09-16T21:51:46Z

Um, yeah, I misread. Still.

zcpara · 2021-09-22T15:06:42Z

Update the performance data with the latest code base. It's 1.008x faster.

https://bugs.python.org/issue45260
python/cpython#28519

gvanrossum · 2022-11-16T04:45:16Z

I think this can be closed now the code generator supports defining super-instructions.

gvanrossum mentioned this issue Mar 11, 2021

List of ideas #4

Closed

gvanrossum mentioned this issue Mar 16, 2021

Specialized eval loops for categories of functions #17

Closed

ericsnowcurrently mentioned this issue Mar 24, 2021

[profiling] Review the existing DXPAIRS instrumentation #29

Closed

gvanrossum mentioned this issue Mar 31, 2021

Combined instructions #36

Closed

gvanrossum mentioned this issue Jun 22, 2021

More efficient implementation of integers #42

Closed

gvanrossum mentioned this issue Jul 21, 2021

Consider FOR_ITER family for specialization? #67

Closed

gramster added this to Fancy CPython Board Jan 10, 2022

gramster moved this to Todo in Fancy CPython Board Jan 10, 2022

gramster moved this from Todo to Other in Fancy CPython Board Jan 10, 2022

gramster moved this from Other to Todo in Fancy CPython Board Jan 24, 2022

sweeneyde mentioned this issue Feb 10, 2022

bpo-46702: Specialize UNPACK_SEQUENCE python/cpython#31240

Merged

This was referenced Oct 28, 2022

GH-98831: "Generate" the interpreter python/cpython#98830

Merged

Generate the interpreter python/cpython#98831

Closed

gvanrossum closed this as completed Nov 16, 2022

Repository owner moved this from Todo to Done in Fancy CPython Board Nov 16, 2022

esc mentioned this issue Oct 28, 2024

Python 3.13 support numba/numba#9682

Merged

Super-instructions #16

Super-instructions #16

Comments

gvanrossum commented Mar 11, 2021 • edited Loading

ericsnowcurrently commented Mar 11, 2021 • edited Loading

ericsnowcurrently commented Mar 11, 2021

gvanrossum commented Mar 11, 2021

gvanrossum commented Mar 13, 2021 • edited Loading

ericsnowcurrently commented Mar 15, 2021

ericsnowcurrently commented Mar 16, 2021

ericsnowcurrently commented Mar 24, 2021

gvanrossum commented Mar 24, 2021

gvanrossum commented Mar 25, 2021

markshannon commented Mar 25, 2021 • edited by gvanrossum Loading

gvanrossum commented Mar 29, 2021

gvanrossum commented Mar 29, 2021

gvanrossum commented Mar 30, 2021

gvanrossum commented Mar 31, 2021 • edited Loading

gvanrossum commented Apr 22, 2021 • edited Loading

gvanrossum commented Apr 23, 2021 • edited Loading

gvanrossum commented Apr 23, 2021 • edited Loading

gvanrossum commented Apr 23, 2021

gvanrossum commented Apr 24, 2021 • edited Loading

gvanrossum commented Apr 25, 2021

gvanrossum commented Apr 26, 2021 • edited Loading

gvanrossum commented Apr 26, 2021

gvanrossum commented Apr 26, 2021 • edited Loading

markshannon commented Jul 16, 2021

gvanrossum commented Jul 16, 2021

markshannon commented Jul 16, 2021

markshannon commented Jul 16, 2021

markshannon commented Aug 12, 2021

sweeneyde commented Aug 12, 2021

JelleZijlstra commented Aug 12, 2021

zcpara commented Aug 14, 2021

zcpara commented Aug 14, 2021

markshannon commented Aug 16, 2021

gvanrossum commented Aug 16, 2021

frank-king commented Aug 19, 2021 • edited Loading

gvanrossum commented Aug 19, 2021 via email

pxeger commented Aug 19, 2021

pxeger commented Aug 20, 2021

zcpara commented Aug 26, 2021 • edited Loading

gvanrossum commented Sep 16, 2021

ssweber commented Sep 16, 2021

gvanrossum commented Sep 16, 2021

zcpara commented Sep 22, 2021

gvanrossum commented Nov 16, 2022

gvanrossum commented Mar 11, 2021 •

edited

Loading

ericsnowcurrently commented Mar 11, 2021 •

edited

Loading

gvanrossum commented Mar 13, 2021 •

edited

Loading

markshannon commented Mar 25, 2021 •

edited by gvanrossum

Loading

gvanrossum commented Mar 31, 2021 •

edited

Loading

gvanrossum commented Apr 22, 2021 •

edited

Loading

gvanrossum commented Apr 23, 2021 •

edited

Loading

gvanrossum commented Apr 23, 2021 •

edited

Loading

gvanrossum commented Apr 24, 2021 •

edited

Loading

gvanrossum commented Apr 26, 2021 •

edited

Loading

gvanrossum commented Apr 26, 2021 •

edited

Loading

frank-king commented Aug 19, 2021 •

edited

Loading

zcpara commented Aug 26, 2021 •

edited

Loading