Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Super-instructions #16

Closed
gvanrossum opened this issue Mar 11, 2021 · 44 comments
Closed

Super-instructions #16

gvanrossum opened this issue Mar 11, 2021 · 44 comments

Comments

@gvanrossum
Copy link
Collaborator

gvanrossum commented Mar 11, 2021

Literature: look for work by Anton Ertl (Forth)

Basic idea: suppose in a runtime (!) execution profile we see that the sequence LOAD_FAST, LOAD_FAST occurs frequently. We can introduce a new opcode, LOAD_FAST_LOAD_FAST (LFLF), and blindly replace the first of every pair of LOAD_FAST opcodes with the super-instruction LFLF. The code then changes from e.g.

    LOAD_FAST 0 (x)
    LOAD_FAST 1 (y)
    BINARY_ADD

to

    LOAD_FAST_LOAD_FAST 0 (x)
    LOAD_FAST 1 (y)
    BINARY_ADD

The opcode case for LFLF would do the following:

  • load the first variable (x)
  • decode the argument for the second opcode
  • adjust the program counter to consume the second opcode
  • load the second variable(y)
  • DISPATCH()

This saves all the other stuff that's done for each opcode (tracing etc.), and it doesn't even read the second opcode. (If the second LOAD_FAST is preceded by an EXTENDED_ARG opcode we should not do the substitution, so that getting the argument for the second opcode is a single byte access and the p-counter update is a fixed add.)

The beauty is that if there's a jump to the second LOAD_FAST instruction, things just work, so this form of code rewriting can be very fast.

@gvanrossum gvanrossum mentioned this issue Mar 11, 2021
@ericsnowcurrently
Copy link
Collaborator

ericsnowcurrently commented Mar 11, 2021

Yeah, taking advantage of common bytecode sequences is something that I've thought would be worth exploring.

If we did this, it would be helpful to have basic tooling to identify common patterns (e.g. from running the test suite or even an arbitrary given workload). It would also be important for such tooling to validate that any super-instructions we add are still worth having, since changes in the compiler can create or eliminate those patterns. Likewise it would be important to identify the more important workloads to target (just like with #10), or even just show that the specific workload isn't so critical.

@ericsnowcurrently
Copy link
Collaborator

Here's a derivative idea that may be worth exploring later: a dynamic flavor of super-instructions, where we recognize common patterns at runtime and build/run the collapsed case accordingly (sort of like a JIT). This introduces extra overhead so, to be worth it, it would have to be used frequently enough and, of course, more than offset that overhead.

@gvanrossum
Copy link
Collaborator Author

Note that there's something in ceval.c that can dump out dynamic instruction pair occurrences. Search for dxpairs IIRC.

@gvanrossum
Copy link
Collaborator Author

gvanrossum commented Mar 13, 2021

I've attempted something a little different: in the compiler itself, collapse

LOAD_FAST x
BINARY_ADD

into a new opcode

FAST_ADD x

which duplicates the code of LOAD_FAST and BINARY_ADD. Many other opcodes can get this treatment if it works out, e.g. LOAD_CONST and other binary and unary operations (I also did BINARY_SUBSCR).

Here's the branch: https://github.com/gvanrossum/cpython/tree/add-opcodes

UPDATE: This makes the bytecode more compact and saves dispatch time, at the cost of more cases in the switch. (I could reduce the code size by using more gotos. :-) On a microbenchmark I see 5-10% speedup.

@ericsnowcurrently
Copy link
Collaborator

UPDATE: This makes the bytecode more compact and saves dispatch time, at the cost of more cases in the switch. (I could reduce the code size by using more gotos. :-) On a microbenchmark I see 5-10% speedup.

That definitely sounds like something worth exploring.

@ericsnowcurrently
Copy link
Collaborator

I like how straight-forward the change is. 🙂

@ericsnowcurrently
Copy link
Collaborator

Here are the results of running the benchmarks on the add-opcodes branch right now: results-add-opcodes.tar.gz

(Note that the actual benchmark run took about 20 minutes.)

@gvanrossum
Copy link
Collaborator Author

Hm, that shows small improvements for some number-heavy opcodes, but significant slowdowns on the pickle benchmarks. That's kind of weird, and why we need a separate machine (#19).

@gvanrossum
Copy link
Collaborator Author

Some other pairs that could be combined easily using the same technique:

  • LOAD_CONST + RETURN_VALUE = RETURN_CONST
  • LOAD_CONST (None) + RETURN_VALUE = RETURN_NONE
  • CALL_METHOD + POP_TOP = POP_METHOD
  • POP_TOP + LOAD_FAST = POP_TOP_AND_LOAD_FAST
  • CALL_METHOD + RETURN_VALUE = RETURN_METHOD
  • GET_ITER + FOR_ITER = GET_ITER_AND_FOR_ITER
  • LOAD_FAST + RETURN_VALUE = RETURN_FAST

Of these, RETURN_CONST and RETURN_NONE show the most promise IMO (in my sample of 1 they are the most prominent).

Note that there are other common combinations, but they require a different approach to dispatch, because both instructions in the pair have an operand. (Top 9 in my sample: LOAD_FAST + LOAD_ATTR, LOAD_FAST + LOAD_FAST, LOAD_FAST + STORE_ATTR, LOAD_GLOBAL + LOAD_FAST, LOAD_FAST + LOAD_METHOD, STORE_FAST + LOAD_FAST, LOAD_METHOD + LOAD_FAST, LOAD_GLOBAL + CALL_FUNCTION, STORE_ATTR + LOAD_FAST.) This is the approach (according to Mark) taken by Ertl.

@markshannon
Copy link
Member

markshannon commented Mar 25, 2021

See faster-cpython/cpython#2 (comment) for restrictions on super-instruction formation.

I don't think there is much room for super-instructions implemented in the compiler, because:

  • They shouldn't include a bytecode that we will want to specialize, because specialization gains much more than removing the dispatch.
  • They can't cross line boundaries.
  • They can't include more than one instruction that has an operand.

Unfortunately, FAST_ADD, FAST_SUBSCR, CONST_ADD, CONST_SUBSCR all break rule 1.

INT_ADD is good because it is already specialized, and will rarely cross a line boundary.

Various combinations of LOAD_CONST, IS_OP and POP_JUMP_IF_XXX are also promising.

@gvanrossum
Copy link
Collaborator Author

Things that can be done then include:

  • INT_ADD

  • LOAD_CONST(None), IS_OP, POP_JUMP_IF_XXX

  • RETURN_NONE, RETURN_CONST

  • Maybe add checks to skip these optimizations if they span lines, e.g.

      return (
        None
      )
    ``
    
    

Some of these could be separate PRs to test the waters.

@gvanrossum
Copy link
Collaborator Author

(Actually, code like that return broken across lines already executes like it's all on the second line, so I think I needn't worry about that.)

@gvanrossum
Copy link
Collaborator Author

See python/cpython#25090

@gvanrossum
Copy link
Collaborator Author

gvanrossum commented Mar 31, 2021

Note that all of the work I've done so far is really combined instructions, and for that I've opened a new issue, #36. Let's keep this one (#16) for the original idea of super-instructions (also spelled "superinstructions") in the style of Ertl -- see top comment in the current issue.

@gvanrossum
Copy link
Collaborator Author

gvanrossum commented Apr 22, 2021

Since runtime specialization is going to take a while I'm putting some effort into getting several (true) super-instructions implemented in the compiler. See https://github.com/faster-cpython/cpython/tree/super-instr

@gvanrossum
Copy link
Collaborator Author

gvanrossum commented Apr 23, 2021

UPDATE: Never mind, see later comment (it did move the needle, but in the wrong direction)

There are a few things to clean up, but this seems to really move the needle for pyperformance:

Slower (15):
- regex_dna: 212 ms +- 1 ms -> 225 ms +- 1 ms: 1.06x slower
- regex_v8: 27.6 ms +- 0.3 ms -> 29.2 ms +- 0.1 ms: 1.06x slower
- spectral_norm: 174 ms +- 3 ms -> 182 ms +- 2 ms: 1.04x slower
- regex_effbot: 3.60 ms +- 0.10 ms -> 3.72 ms +- 0.01 ms: 1.03x slower
- json_loads: 31.1 us +- 0.2 us -> 32.1 us +- 0.3 us: 1.03x slower
- pickle_dict: 33.5 us +- 0.3 us -> 34.4 us +- 0.2 us: 1.03x slower
- unpack_sequence: 58.0 ns +- 0.6 ns -> 59.5 ns +- 0.6 ns: 1.03x slower
- fannkuch: 572 ms +- 3 ms -> 585 ms +- 4 ms: 1.02x slower
- scimark_sor: 248 ms +- 4 ms -> 254 ms +- 3 ms: 1.02x slower
- crypto_pyaes: 142 ms +- 1 ms -> 145 ms +- 2 ms: 1.02x slower
- pickle_list: 5.15 us +- 0.07 us -> 5.25 us +- 0.05 us: 1.02x slower
- pyflate: 811 ms +- 6 ms -> 824 ms +- 6 ms: 1.02x slower
- scimark_fft: 516 ms +- 3 ms -> 524 ms +- 4 ms: 1.01x slower
- nqueens: 123 ms +- 1 ms -> 124 ms +- 1 ms: 1.01x slower
- chameleon: 11.2 ms +- 0.1 ms -> 11.3 ms +- 0.2 ms: 1.01x slower

Faster (40):
- sqlalchemy_declarative: 381 ms +- 9 ms -> 182 ms +- 3 ms: 2.09x faster
- xml_etree_iterparse: 180 ms +- 10 ms -> 120 ms +- 2 ms: 1.50x faster
- xml_etree_parse: 251 ms +- 18 ms -> 168 ms +- 3 ms: 1.49x faster
- meteor_contest: 186 ms +- 21 ms -> 127 ms +- 0 ms: 1.46x faster
- sqlalchemy_imperative: 33.7 ms +- 2.4 ms -> 24.1 ms +- 0.5 ms: 1.40x faster
- regex_compile: 288 ms +- 26 ms -> 210 ms +- 1 ms: 1.37x faster
- genshi_text: 48.3 ms +- 3.4 ms -> 37.2 ms +- 0.4 ms: 1.30x faster
- tornado_http: 204 ms +- 16 ms -> 160 ms +- 4 ms: 1.27x faster
- raytrace: 847 ms +- 26 ms -> 667 ms +- 4 ms: 1.27x faster
- chaos: 168 ms +- 3 ms -> 134 ms +- 1 ms: 1.26x faster
- 2to3: 481 ms +- 2 ms -> 389 ms +- 2 ms: 1.24x faster
- genshi_xml: 91.7 ms +- 1.9 ms -> 75.3 ms +- 0.8 ms: 1.22x faster
- xml_etree_generate: 141 ms +- 8 ms -> 117 ms +- 2 ms: 1.20x faster
- json_dumps: 18.9 ms +- 0.9 ms -> 16.0 ms +- 0.2 ms: 1.18x faster
- xml_etree_process: 112 ms +- 5 ms -> 95.0 ms +- 0.5 ms: 1.18x faster
- pathlib: 26.1 ms +- 0.7 ms -> 22.9 ms +- 0.4 ms: 1.14x faster
- logging_format: 13.4 us +- 0.8 us -> 11.9 us +- 0.1 us: 1.13x faster
- dulwich_log: 99.3 ms +- 4.1 ms -> 88.7 ms +- 0.6 ms: 1.12x faster
- logging_simple: 12.1 us +- 0.7 us -> 10.9 us +- 0.1 us: 1.11x faster
- telco: 9.52 ms +- 0.22 ms -> 8.61 ms +- 0.22 ms: 1.11x faster
- pidigits: 222 ms +- 0 ms -> 204 ms +- 0 ms: 1.09x faster
- float: 141 ms +- 6 ms -> 129 ms +- 2 ms: 1.09x faster
- pickle_pure_python: 631 us +- 18 us -> 580 us +- 7 us: 1.09x faster
- django_template: 66.7 ms +- 3.8 ms -> 63.3 ms +- 0.7 ms: 1.05x faster
- mako: 19.9 ms +- 0.5 ms -> 19.0 ms +- 0.2 ms: 1.05x faster
- hexiom: 12.4 ms +- 0.2 ms -> 12.0 ms +- 0.1 ms: 1.04x faster
- scimark_monte_carlo: 141 ms +- 2 ms -> 137 ms +- 2 ms: 1.03x faster
- nbody: 175 ms +- 1 ms -> 170 ms +- 2 ms: 1.03x faster
- unpickle_pure_python: 415 us +- 4 us -> 404 us +- 4 us: 1.03x faster
- scimark_lu: 218 ms +- 2 ms -> 212 ms +- 1 ms: 1.03x faster
- sympy_sum: 244 ms +- 2 ms -> 239 ms +- 2 ms: 1.02x faster
- python_startup: 9.89 ms +- 0.05 ms -> 9.71 ms +- 0.04 ms: 1.02x faster
- pickle: 13.2 us +- 0.1 us -> 13.0 us +- 0.1 us: 1.02x faster
- python_startup_no_site: 6.55 ms +- 0.04 ms -> 6.47 ms +- 0.04 ms: 1.01x faster
- deltablue: 10.1 ms +- 0.2 ms -> 10.0 ms +- 0.2 ms: 1.01x faster
- scimark_sparse_mat_mult: 7.71 ms +- 0.07 ms -> 7.63 ms +- 0.18 ms: 1.01x faster
- sympy_str: 379 ms +- 3 ms -> 375 ms +- 6 ms: 1.01x faster
- sympy_integrate: 30.3 ms +- 0.2 ms -> 30.1 ms +- 0.2 ms: 1.01x faster
- unpickle_list: 5.42 us +- 0.05 us -> 5.38 us +- 0.09 us: 1.01x faster
- go: 285 ms +- 2 ms -> 283 ms +- 2 ms: 1.01x faster

Benchmark hidden because not significant (5): logging_silent, richards, sqlite_synth, sympy_expand, unpickle

Geometric mean: 1.09x faster

@gvanrossum
Copy link
Collaborator Author

gvanrossum commented Apr 23, 2021

(The trickiest thing to clean up is that e.g. a sequence of three LOAD_FAST opcodes in a row is converted to two LOAD_FAST_LOAD_FAST opcodes followed by a regular LOAD_FAST. This works but is unexpected, and there's an assert in HALF_NEXTOPARG() that can fail because of this.)

@gvanrossum
Copy link
Collaborator Author

Whoa, never mind. It seems I had the benchmarks backwards (I ran my branch first and then master). So it seems this is either a serious pessimization or there's too much variation in the benchmark times.

@gvanrossum
Copy link
Collaborator Author

gvanrossum commented Apr 24, 2021

Several benchmark runs later, these results are unfortunately correct: several benchmarks are much slower with the super-instructions -- up to 2.09x for sqlalchemy_declarative. I'm investigating this for bm_meteor_contest (which has no external dependencies, so is easier to analyze). So far the main suspicious thing is that there are several sequences of multiple super-instructions in a row. This benchmark has a dynamic execution profile showing 15% of opcode pairs LOAD_FAST_LOAD_FAST. There's also a very common pair FOR_ITER+STORE_FAST (8%), where I notice that replacing that STORE_FAST with a super-instruction (as it is done for all three occurrences of FOR_ITER in the solve() function) will make a PREDICT macro fail (FOR_ITER predicts STORE_FAST or UNPACK_SEQUENCE). (Then again, IIUC with gcc we're using computed gotos which should disable prediction?)

Food for thought.

@gvanrossum
Copy link
Collaborator Author

Okay, I had a reference count bug. Here's a more reliable result. So it looks like it's up to 6% faster, up to 2% slower, and on average 1% faster.

$ python3 -m pyperf compare_to  req-{1619196353,1619391121}-gvanrossum/*gz -G --min-speed 2
Slower (1):
- unpickle_list: 5.38 us +- 0.10 us -> 5.49 us +- 0.07 us: 1.02x slower

Faster (14):
- regex_dna: 225 ms +- 1 ms -> 212 ms +- 1 ms: 1.06x faster
- regex_v8: 29.1 ms +- 0.1 ms -> 27.6 ms +- 0.3 ms: 1.05x faster
- regex_effbot: 3.75 ms +- 0.08 ms -> 3.59 ms +- 0.03 ms: 1.04x faster
- spectral_norm: 182 ms +- 3 ms -> 175 ms +- 3 ms: 1.04x faster
- regex_compile: 210 ms +- 1 ms -> 203 ms +- 1 ms: 1.04x faster
- unpickle_pure_python: 405 us +- 5 us -> 391 us +- 4 us: 1.04x faster
- pickle_list: 5.28 us +- 0.11 us -> 5.11 us +- 0.05 us: 1.03x faster
- chameleon: 11.3 ms +- 0.3 ms -> 10.9 ms +- 0.1 ms: 1.03x faster
- pathlib: 23.1 ms +- 0.3 ms -> 22.5 ms +- 0.3 ms: 1.03x faster
- scimark_fft: 524 ms +- 4 ms -> 511 ms +- 4 ms: 1.02x faster
- raytrace: 665 ms +- 4 ms -> 649 ms +- 6 ms: 1.02x faster
- unpack_sequence: 59.4 ns +- 0.5 ns -> 58.1 ns +- 2.5 ns: 1.02x faster
- genshi_xml: 76.1 ms +- 2.9 ms -> 74.4 ms +- 0.8 ms: 1.02x faster
- scimark_sor: 254 ms +- 3 ms -> 248 ms +- 2 ms: 1.02x faster

Benchmark hidden because not significant (45): 2to3, chaos, crypto_pyaes, deltablue, django_template, dulwich_log, fannkuch, float, genshi_text, go, hexiom, json_dumps, json_loads, logging_format, logging_silent, logging_simple, mako, meteor_contest, nbody, nqueens, pickle, pickle_dict, pickle_pure_python, pidigits, pyflate, python_startup, python_startup_no_site, richards, scimark_lu, scimark_monte_carlo, scimark_sparse_mat_mult, sqlalchemy_declarative, sqlalchemy_imperative, sqlite_synth, sympy_expand, sympy_integrate, sympy_sum, sympy_str, telco, tornado_http, unpickle, xml_etree_parse, xml_etree_iterparse, xml_etree_generate, xml_etree_process

Geometric mean: 1.01x faster
$ 

@gvanrossum
Copy link
Collaborator Author

gvanrossum commented Apr 26, 2021

I've tried this with the INT_ADD opcode added too. Results are similar, but alas:

- unpack_sequence: 59.4 ns +- 0.5 ns -> 69.7 ns +- 0.6 ns: 1.17x slower

Now that is a pretty crazy benchmark (400 times a, b, c, d, e, f, g, h, i, j = to_unpack) but still, an unacceptable regression.

UPDATE: Strangely, unpack_sequence was faster without INT_ADD, but the generated bytecode doesn't have an INT_ADD -- just a bunch of STORE_FAST_LOAD_FAST opcodes (one per line, last opcode of the line). No idea yet what could be going on here. My current theory: that benchmark doesn't do enough loops and it's too susceptible to random variability. Note that it's only 59 ns per loop.

@gvanrossum
Copy link
Collaborator Author

I can't say I understand it, but clearly STORE_FAST_LOAD_FAST was a pessimization. Without this, the top faster and slower times are like this:

Slower (2):
- unpickle_list: 5.38 us +- 0.09 us -> 5.52 us +- 0.09 us: 1.03x slower
- sqlite_synth: 3.35 us +- 0.06 us -> 3.43 us +- 0.05 us: 1.02x slower

Faster (11):
- chameleon: 11.3 ms +- 0.2 ms -> 11.0 ms +- 0.1 ms: 1.03x faster
- regex_compile: 210 ms +- 1 ms -> 204 ms +- 1 ms: 1.03x faster

So mostly a wash. Is it worth investigating why unpickle_list is consistently slower?

@gvanrossum
Copy link
Collaborator Author

gvanrossum commented Apr 26, 2021

Here are some TODOs I collected during a discussion with Mark and Eric):

  • static analyze how often STORE_FAST_LOAD_FAST is generated
    • in the stdlib: on master this pair occurred 0.87%; in my branch, the single opcode occurs 0.79%
      (presumably the difference is due to triples, which get counted twice, but optimized only once)
  • maybe also do dynamic analysis of the same
    • no surprise there -- about 4-5%, corresponding to the dynamic stats I gathered previously; in bm_unpack_sequence.py it's around 3%
  • consider not generating super-instructions crossing lines, then we can drop the use_tracing check from HALF_DISPATCH()

@markshannon
Copy link
Member

This seems to have expired. Let's try to restart it.

First of all, let's narrow this issue down to superinstructions that are inserted into the quickened code, but not are not present in the original code object.

Superinstructions that are inserted by the compiler into the code object are discussed in #36.

We want to exclude from this list:

  • Combined instructions (typically a pair where only one instruction has an operand)
  • Any pairs when one (or both) of the pair can be specialized
  • Any pair including FOR_ITER, for now. It is not clear what the best way to speed up for loops is.

Which leaves this list of common pairs, compiled from various sources:

  • LOAD_FAST LOAD_FAST
  • COMPARE_OP POP_JUMP_IF_FALSE
  • STORE_FAST LOAD_FAST
  • LOAD_FAST LOAD_CONST
  • LOAD_CONST COMPARE_OP
  • STORE_FAST STORE_FAST
  • LOAD_CONST LOAD_FAST

We are not just interested in the prevalence of the pairs, but the profitability of combining them
Of the above the simple pairs (combining LOAD_FAST, STORE_FAST and LOAD_CONST) are worth combining because the dispatch overhead is a significant part of the execution time.
However, COMPARE_OP is a complex instruction, and may benefit from specialization. So well leave that out for now as well.

I am suspicious of the STORE_FAST STORE_FAST pair. I suspect that it shows up a lot in tests, but less so in production code.
I could well be wrong, and would be happy to be proved wrong, but I'm going to leave it out for now.

That leaves just four:

  • LOAD_FAST LOAD_FAST
  • STORE_FAST LOAD_FAST
  • LOAD_FAST LOAD_CONST
  • LOAD_CONST LOAD_FAST

It worth noting that none of LOAD_FAST, STORE_FAST and LOAD_CONST can change the state from non-tracing to tracing.
Since tracing is always off when executing quickened instructions, this gives us another advantage. We don't need to check for tracing at the end of the instruction. In fact, we can take advantage of this to quicken LOAD_FAST, STORE_FAST and LOAD_CONST by replacing them with "NO TRACE" versions that do not check for tracing when dispatching.

This gives us three more instructions that might be worth adding:

  • LOAD_FAST_NO_TRACE
  • LOAD_CONST_NO_TRACE
  • STORE_FAST_NO_TRACE

@gvanrossum
Copy link
Collaborator Author

So is your conclusion to just add the no-trace instructions, or also the four specialization you added?

@markshannon
Copy link
Member

My conclusion is that we should run some experiments 🙂

@markshannon
Copy link
Member

Initial experiments show no speedup with the four super-instructions and ~2% speedup adding STORE_FAST__STORE_FAST
That might be a real effect, or just the unpack_sequence benchmark.
It is also on my laptop without restrictions on the power management, so noisy.
From this I conclude that more experiments are necessary 🙂

If you want to play, the code is here: https://github.com/faster-cpython/cpython/tree/super-instructions

@markshannon
Copy link
Member

@sweeneyde
Copy link

I'd be curious about a further specialized STORE_FAST__LOAD_FAST_SAME when the same variable is stored and then loaded, as is the case in some list comprehensions:

.\python.bat -c "from dis import dis; dis(lambda: [x + 1 for x in my_list])"
Running Release|x64 interpreter...
  1           0 LOAD_CONST               1 (<code object <listcomp> at 0x0000016D0DF4F210, file "<string>", line 1>)
              2 MAKE_FUNCTION            0
              4 LOAD_GLOBAL              0 (my_list)
              6 GET_ITER
              8 CALL_FUNCTION            1
             10 RETURN_VALUE

Disassembly of <code object <listcomp> at 0x0000016D0DF4F210, file "<string>", line 1>:
  1           0 BUILD_LIST               0
              2 LOAD_FAST                0 (.0)
        >>    4 FOR_ITER                 6 (to 18)
              6 STORE_FAST               1 (x)   <----------------------
              8 LOAD_FAST                1 (x)   <----------------------
             10 LOAD_CONST               0 (1)
             12 BINARY_ADD
             14 LIST_APPEND              2
             16 JUMP_ABSOLUTE            2 (to 4)
        >>   18 RETURN_VALUE

Implemented as something like

TARGET(STORE_FAST__LOAD_FAST_SAME):
    PyObect *val = TOP(); // stays on top
    Py_INCREF(val);
    SETLOCAL(oparg, val);
    DISPATCH();

See also https://bugs.python.org/issue38381 . Maybe it's worth collecting statistics about how often this pair has the same argument.

@JelleZijlstra
Copy link
Contributor

Could this be handled by a peephole optimization step that simply elides the pair of instructions? I suppose that's only safe if there are no locals() calls, and no other references to the variable.

@zcpara
Copy link

zcpara commented Aug 14, 2021

If there are code, dir(), traceback calls, deletion the pair of instructions will also be unsafe.

@zcpara
Copy link

zcpara commented Aug 14, 2021

Is it worth to make a super-instruction for UNPACK_SEQUENCE and its following STORE_FASTs? In bm_nbody.py of Pyperformance, these patterns are in hot loop bodies. We can identify these patterns in compilation phase.

def advance(dt, n, bodies=SYSTEM, pairs=PAIRS):
for i in range(n):
for (([x1, y1, z1], v1, m1),
([x2, y2, z2], v2, m2)) in pairs:

80 14 GET_ITER
>> 16 FOR_ITER 112 (to 242)
18 UNPACK_SEQUENCE 2
20 UNPACK_SEQUENCE 3
22 UNPACK_SEQUENCE 3
24 STORE_FAST 5 (x1)
26 STORE_FAST 6 (y1)
28 STORE_FAST 7 (z1)
30 STORE_FAST 8 (v1)
32 STORE_FAST 9 (m1)

81 34 UNPACK_SEQUENCE 3
36 UNPACK_SEQUENCE 3
38 STORE_FAST 10 (x2)
40 STORE_FAST 11 (y2)
42 STORE_FAST 12 (z2)
44 STORE_FAST 13 (v2)
46 STORE_FAST 14 (m2)

@markshannon
Copy link
Member

We don't want to distort our selection of superinstructions, or optimizations in general, to fit the pyperformance benchmark suite.

Do we have evidence that UNPACK_SEQUENCE followed by a sequence of STORE_FASTs is a common pattern?
We now have STORE_FAST__STORE_FAST as a superinstruction, which will help with the above patterns.

@gvanrossum
Copy link
Collaborator Author

I'd expect that 2 or more STORE_FAST are the most likely thing following UNPACK_SEQUENCE, given the abundance of common patterns like

x, y = point
for x, y in points: ...
for k, v in d.items(): ...
for i, x in enumerate(a): ...
for x, y in zip(xs, ys): ...
for root, dirs, files in os.walk(...): ...

If you want more evidence we can look through the top 100 packages for such sequences.

From a quick grep of the stdlib, just 2 values is by far the most common.

@frank-king
Copy link

frank-king commented Aug 19, 2021

I have downloaded all the wheels in the top 100 python packages, and counted the patterns of unpacking sequence.

Here is the result:

Pattern Lines Percentage
for x, y in ... 5842 23.5%
x, y = ... 13244 52.8%
for x, y(, z)+ in ... 125 0.5%
x, y(, z)+ = ... 5852 23.3%
total 25063 100%

Statistics shows that 2 STORE_FASTs are most likely to follow UNPACK_SEQUENCE (about 76.2%), but there are still 23.8% of them are 3 or more STORE_FASTs that follows an UNPACK_SEQUENCE.


My grep patterns are:

Pattern Regex
for x, y in ... 'for \w+, \w+ in' or 'for \(\w+, \w+\) in'
for x, y(, z)+ in ... 'for \w+, \w+(, \w)+ in' or 'for \(\w+, \w+(, \w)+\) in'
x, y = ... '^\s*\w+, \w+ =' or '^\s*\(\w+, \w+\) ='
x, y(, z)+ = ... '^\s*\w+, \w+(, \w+)+ =' or '^\s*\(\w+, \w+(, \w+)+\) ='

@gvanrossum
Copy link
Collaborator Author

gvanrossum commented Aug 19, 2021 via email

@pxeger
Copy link

pxeger commented Aug 19, 2021

I've hacked together something of a framework to analyse bytecode and look for patterns like the above, also using the top 100 PyPI packages (or it can be adapted for any other corpus of pyc files, really). You may find it helpful: https://github.com/pxeger/pycstats

@pxeger
Copy link

pxeger commented Aug 20, 2021

Results for UNPACK_SEQUENCE followed by any two instructions (this is based on the compiled output unweighted by how many times they're executed, so not quite representative):

72975 UNPACK_SEQUENCE STORE_FAST STORE_FAST
993 UNPACK_SEQUENCE LOAD_FAST STORE_ATTR
744 UNPACK_SEQUENCE STORE_NAME STORE_NAME
592 UNPACK_SEQUENCE STORE_FAST UNPACK_SEQUENCE
588 UNPACK_SEQUENCE STORE_FAST LOAD_FAST
287 UNPACK_SEQUENCE LOAD_FAST LOAD_CONST
280 UNPACK_SEQUENCE STORE_DEREF STORE_FAST
256 UNPACK_SEQUENCE STORE_DEREF STORE_DEREF
236 UNPACK_SEQUENCE STORE_FAST STORE_DEREF
219 UNPACK_SEQUENCE UNPACK_SEQUENCE STORE_FAST
162 UNPACK_SEQUENCE STORE_FAST LOAD_GLOBAL
55 UNPACK_SEQUENCE LOAD_FAST LOAD_FAST
52 UNPACK_SEQUENCE LOAD_DEREF STORE_ATTR
51 UNPACK_SEQUENCE LOAD_FAST LOAD_ATTR
50 UNPACK_SEQUENCE STORE_FAST POP_BLOCK
20 UNPACK_SEQUENCE STORE_FAST LOAD_DEREF
20 UNPACK_SEQUENCE STORE_FAST JUMP_FORWARD
19 UNPACK_SEQUENCE STORE_FAST LOAD_CONST
18 UNPACK_SEQUENCE STORE_GLOBAL STORE_GLOBAL
16 UNPACK_SEQUENCE LOAD_DEREF LOAD_ATTR
10 UNPACK_SEQUENCE LOAD_GLOBAL STORE_ATTR
8 UNPACK_SEQUENCE LOAD_GLOBAL LOAD_CONST
7 UNPACK_SEQUENCE STORE_FAST SETUP_FINALLY
5 UNPACK_SEQUENCE LOAD_NAME STORE_ATTR
4 UNPACK_SEQUENCE STORE_NAME UNPACK_SEQUENCE
4 UNPACK_SEQUENCE STORE_GLOBAL STORE_FAST
4 UNPACK_SEQUENCE STORE_FAST JUMP_ABSOLUTE
3 UNPACK_SEQUENCE STORE_DEREF UNPACK_SEQUENCE
3 UNPACK_SEQUENCE STORE_FAST UNPACK_EX
3 UNPACK_SEQUENCE UNPACK_SEQUENCE STORE_DEREF
3 UNPACK_SEQUENCE UNPACK_SEQUENCE UNPACK_SEQUENCE
2 UNPACK_SEQUENCE STORE_DEREF LOAD_DEREF
1 UNPACK_SEQUENCE LOAD_GLOBAL LOAD_ATTR
1 UNPACK_SEQUENCE EXTENDED_ARG STORE_NAME
1 UNPACK_SEQUENCE EXTENDED_ARG UNPACK_EX

@zcpara
Copy link

zcpara commented Aug 26, 2021

I have prototyped UNPACK_SEQUENCE_ST super-instruction and produced 1.008% speedup. The implementation is based on Add five superinstructions .

Is it worth to do this optimization?

In ceval.c

TARGET(UNPACK_SEQUENCE_ST): {
    PREDICTED(UNPACK_SEQUENCE_ST);
    PyObject *seq = POP(), *item, **items;
    if (PyTuple_CheckExact(seq) &&
        PyTuple_GET_SIZE(seq) == oparg) {
        items = ((PyTupleObject *)seq)->ob_item;
        int i = 0;
        while (i < oparg) {
            item = items[i];
            Py_INCREF(item);
            SETLOCAL(_Py_OPARG(*(next_instr)), item);
            next_instr++;
            i++;
        }   
    } else if (PyList_CheckExact(seq) &&
               PyList_GET_SIZE(seq) == oparg) {
        items = ((PyListObject *)seq)->ob_item;
        int i = 0;
        while (i < oparg) {
            item = items[i];
            Py_INCREF(item);
            SETLOCAL(_Py_OPARG(*(next_instr)), item);
            next_instr++;
            i++;
        }   
    } else if (unpack_iterable(tstate, seq, oparg, -1, 
                               stack_pointer + oparg)) {
        items = stack_pointer + oparg - 1;
        int i = 0;
        while (i < oparg) {
            item = *(items - i); 
            SETLOCAL(_Py_OPARG(*(next_instr)), item);
            next_instr++;
            i++;
        }   
    } else {
        /* unpack_iterable() raised an exception */
        Py_DECREF(seq);
        goto error;
    }   
    Py_DECREF(seq);
    PREDICT(LOAD_FAST);
    NOTRACE_DISPATCH();
}

In specialize.c

case UNPACK_SEQUENCE: {
    int j = 1;
    for (; j <= oparg; ++j) {
        if (_Py_OPCODE(instructions[i+j]) != STORE_FAST) {
            break;
        }   
    }   
    if (j == (oparg+1)) {
        instructions[i] = _Py_MAKECODEUNIT(UNPACK_SEQUENCE_ST, oparg);
        i += oparg;
    }   
    break;
}

@gvanrossum
Copy link
Collaborator Author

One percent overall improvement is nothing to sneeze at! Please submit a PR. (Sorry for the slow response!)

@ssweber
Copy link

ssweber commented Sep 16, 2021

“1.006x faster“ - that’s 0.6% correct?

@gvanrossum
Copy link
Collaborator Author

Um, yeah, I misread. Still.

@zcpara
Copy link

zcpara commented Sep 22, 2021

Update the performance data with the latest code base. It's 1.008x faster.

https://bugs.python.org/issue45260
python/cpython#28519

@gvanrossum
Copy link
Collaborator Author

I think this can be closed now the code generator supports defining super-instructions.

Repository owner moved this from Todo to Done in Fancy CPython Board Nov 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

No branches or pull requests

9 participants