Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is PREDICT() worth it any more? #496

Closed
gvanrossum opened this issue Nov 15, 2022 · 11 comments
Closed

Is PREDICT() worth it any more? #496

gvanrossum opened this issue Nov 15, 2022 · 11 comments

Comments

@gvanrossum
Copy link
Collaborator

The PREDICT() macro is already a no-op on compilers with computed GOTO. So it's only used on Windows (and a few esoteric platforms). Is it worth the bother? We should ideally benchmark this. In the past it's been claimed to have substantial benefits.

@brandtbucher
Copy link
Member

I've wondered the same thing. We could benchmark this on Linux pretty easily by disabling computed gotos and timing the results with and without PREDICT turned on. I imagine that would be very close to whatever the Windows speedup is (but without requiring a Windows benchmarking setup).

@brandtbucher
Copy link
Member

Based on Linux builds on our benchmarking machine, disabling USE_COMPUTED_GOTOS makes things ~1% slower (I thought it would be more, but it could be that our new superinstructions do most of the heavy lifting when reducing this part of the dispatch overhead):

Slower (45):
- logging_format: 6.25 us +- 0.09 us -> 6.79 us +- 0.12 us: 1.09x slower
- logging_simple: 5.71 us +- 0.09 us -> 6.07 us +- 0.07 us: 1.06x slower
- genshi_xml: 46.1 ms +- 0.6 ms -> 48.8 ms +- 1.9 ms: 1.06x slower
- html5lib: 58.5 ms +- 2.6 ms -> 61.8 ms +- 2.7 ms: 1.06x slower
- deltablue: 3.19 ms +- 0.04 ms -> 3.37 ms +- 0.07 ms: 1.06x slower
- dulwich_log: 61.1 ms +- 0.4 ms -> 64.3 ms +- 0.8 ms: 1.05x slower
- scimark_lu: 104 ms +- 2 ms -> 110 ms +- 2 ms: 1.05x slower
- regex_compile: 128 ms +- 1 ms -> 135 ms +- 2 ms: 1.05x slower
- raytrace: 278 ms +- 4 ms -> 292 ms +- 9 ms: 1.05x slower
- deepcopy: 324 us +- 2 us -> 338 us +- 3 us: 1.05x slower
- pidigits: 189 ms +- 0 ms -> 198 ms +- 0 ms: 1.04x slower
- tornado_http: 92.9 ms +- 1.4 ms -> 97.1 ms +- 1.4 ms: 1.04x slower
- sympy_str: 282 ms +- 3 ms -> 294 ms +- 3 ms: 1.04x slower
- sqlglot_normalize: 105 ms +- 1 ms -> 109 ms +- 1 ms: 1.04x slower
- sympy_expand: 459 ms +- 5 ms -> 479 ms +- 3 ms: 1.04x slower
- unpickle: 12.9 us +- 0.1 us -> 13.4 us +- 1.4 us: 1.04x slower
- sympy_sum: 164 ms +- 2 ms -> 169 ms +- 2 ms: 1.03x slower
- gunicorn: 1.07 ms +- 0.01 ms -> 1.11 ms +- 0.01 ms: 1.03x slower
- sqlglot_optimize: 50.4 ms +- 0.2 ms -> 52.1 ms +- 0.2 ms: 1.03x slower
- sqlglot_parse: 1.32 ms +- 0.01 ms -> 1.36 ms +- 0.01 ms: 1.03x slower
- django_template: 32.3 ms +- 0.3 ms -> 33.3 ms +- 0.4 ms: 1.03x slower
- unpickle_pure_python: 201 us +- 2 us -> 207 us +- 2 us: 1.03x slower
- sqlglot_transpile: 1.61 ms +- 0.02 ms -> 1.66 ms +- 0.02 ms: 1.03x slower
- pprint_safe_repr: 672 ms +- 6 ms -> 691 ms +- 7 ms: 1.03x slower
- pprint_pformat: 1.38 sec +- 0.01 sec -> 1.41 sec +- 0.01 sec: 1.03x slower
- aiohttp: 994 us +- 9 us -> 1.02 ms +- 0.01 ms: 1.03x slower
- sympy_integrate: 20.4 ms +- 0.1 ms -> 20.9 ms +- 0.2 ms: 1.02x slower
- coroutines: 25.0 ms +- 0.2 ms -> 25.5 ms +- 0.5 ms: 1.02x slower
- scimark_sor: 103 ms +- 1 ms -> 105 ms +- 1 ms: 1.02x slower
- 2to3: 246 ms +- 1 ms -> 250 ms +- 1 ms: 1.02x slower
- chaos: 66.9 ms +- 0.8 ms -> 68.0 ms +- 0.6 ms: 1.02x slower
- scimark_fft: 309 ms +- 3 ms -> 314 ms +- 3 ms: 1.01x slower
- pickle_pure_python: 284 us +- 4 us -> 288 us +- 2 us: 1.01x slower
- deepcopy_memo: 33.7 us +- 0.4 us -> 34.2 us +- 0.3 us: 1.01x slower
- float: 73.0 ms +- 1.0 ms -> 74.0 ms +- 0.9 ms: 1.01x slower
- deepcopy_reduce: 2.90 us +- 0.04 us -> 2.94 us +- 0.04 us: 1.01x slower
- json_loads: 24.1 us +- 0.9 us -> 24.3 us +- 0.3 us: 1.01x slower
- xml_etree_process: 53.0 ms +- 0.7 ms -> 53.6 ms +- 0.4 ms: 1.01x slower
- unpickle_list: 4.92 us +- 0.10 us -> 4.97 us +- 0.04 us: 1.01x slower
- xml_etree_generate: 76.8 ms +- 0.7 ms -> 77.6 ms +- 0.7 ms: 1.01x slower
- xml_etree_iterparse: 102 ms +- 2 ms -> 103 ms +- 2 ms: 1.01x slower
- python_startup_no_site: 6.23 ms +- 0.01 ms -> 6.28 ms +- 0.01 ms: 1.01x slower
- python_startup: 8.56 ms +- 0.01 ms -> 8.62 ms +- 0.01 ms: 1.01x slower
- coverage: 98.2 ms +- 1.3 ms -> 98.9 ms +- 1.3 ms: 1.01x slower
- hexiom: 6.13 ms +- 0.03 ms -> 6.16 ms +- 0.05 ms: 1.01x slower

Faster (22):
- nbody: 94.7 ms +- 2.3 ms -> 88.6 ms +- 1.5 ms: 1.07x faster
- pycparser: 1.13 sec +- 0.02 sec -> 1.09 sec +- 0.02 sec: 1.04x faster
- pathlib: 18.0 ms +- 0.3 ms -> 17.4 ms +- 0.3 ms: 1.03x faster
- mdp: 2.65 sec +- 0.02 sec -> 2.57 sec +- 0.01 sec: 1.03x faster
- regex_v8: 22.2 ms +- 0.3 ms -> 21.6 ms +- 0.3 ms: 1.03x faster
- pickle_list: 4.16 us +- 0.06 us -> 4.06 us +- 0.04 us: 1.03x faster
- pickle: 10.3 us +- 0.1 us -> 10.0 us +- 0.1 us: 1.03x faster
- generators: 76.4 ms +- 0.7 ms -> 74.6 ms +- 0.5 ms: 1.02x faster
- genshi_text: 21.1 ms +- 0.3 ms -> 20.7 ms +- 0.3 ms: 1.02x faster
- pyflate: 402 ms +- 7 ms -> 393 ms +- 4 ms: 1.02x faster
- chameleon: 6.50 ms +- 0.07 ms -> 6.38 ms +- 0.07 ms: 1.02x faster
- scimark_sparse_mat_mult: 4.12 ms +- 0.08 ms -> 4.04 ms +- 0.05 ms: 1.02x faster
- unpack_sequence: 48.7 ns +- 0.6 ns -> 48.0 ns +- 0.8 ns: 1.01x faster
- fannkuch: 380 ms +- 2 ms -> 375 ms +- 3 ms: 1.01x faster
- mako: 9.64 ms +- 0.07 ms -> 9.53 ms +- 0.06 ms: 1.01x faster
- logging_silent: 91.3 ns +- 0.9 ns -> 90.3 ns +- 1.5 ns: 1.01x faster
- richards: 41.7 ms +- 0.6 ms -> 41.3 ms +- 1.0 ms: 1.01x faster
- go: 134 ms +- 1 ms -> 132 ms +- 1 ms: 1.01x faster
- json_dumps: 9.48 ms +- 0.14 ms -> 9.40 ms +- 0.10 ms: 1.01x faster
- regex_effbot: 3.60 ms +- 0.02 ms -> 3.58 ms +- 0.02 ms: 1.01x faster
- scimark_monte_carlo: 65.6 ms +- 0.8 ms -> 65.3 ms +- 0.8 ms: 1.00x faster
- regex_dna: 203 ms +- 0 ms -> 202 ms +- 1 ms: 1.00x faster

Benchmark hidden because not significant (15): async_tree_none, async_tree_cpu_io_mixed, async_tree_io, async_tree_memoization, crypto_pyaes, json, meteor_contest, mypy, nqueens, pickle_dict, spectral_norm, sqlite_synth, telco, thrift, xml_etree_parse

Geometric mean: 1.01x slower

When USE_COMPUTED_GOTOS is turned off, disabling PREDICT doesn't appear to have any effect:

Slower (28):
- generators: 74.6 ms +- 0.5 ms -> 84.4 ms +- 0.3 ms: 1.13x slower
- scimark_sparse_mat_mult: 4.04 ms +- 0.05 ms -> 4.27 ms +- 0.06 ms: 1.06x slower
- mdp: 2.57 sec +- 0.01 sec -> 2.68 sec +- 0.02 sec: 1.04x slower
- telco: 6.30 ms +- 0.15 ms -> 6.54 ms +- 0.18 ms: 1.04x slower
- pickle_list: 4.06 us +- 0.04 us -> 4.19 us +- 0.06 us: 1.03x slower
- float: 74.0 ms +- 0.9 ms -> 76.3 ms +- 1.1 ms: 1.03x slower
- nbody: 88.6 ms +- 1.5 ms -> 91.3 ms +- 2.8 ms: 1.03x slower
- json: 4.70 ms +- 0.11 ms -> 4.84 ms +- 0.17 ms: 1.03x slower
- pickle: 10.0 us +- 0.1 us -> 10.3 us +- 0.1 us: 1.02x slower
- json_dumps: 9.40 ms +- 0.10 ms -> 9.56 ms +- 0.14 ms: 1.02x slower
- pickle_dict: 30.8 us +- 0.1 us -> 31.2 us +- 0.1 us: 1.01x slower
- hexiom: 6.16 ms +- 0.05 ms -> 6.24 ms +- 0.04 ms: 1.01x slower
- pathlib: 17.4 ms +- 0.3 ms -> 17.6 ms +- 0.3 ms: 1.01x slower
- deepcopy_memo: 34.2 us +- 0.3 us -> 34.5 us +- 0.6 us: 1.01x slower
- go: 132 ms +- 1 ms -> 134 ms +- 1 ms: 1.01x slower
- sqlglot_transpile: 1.66 ms +- 0.02 ms -> 1.67 ms +- 0.02 ms: 1.01x slower
- sqlglot_parse: 1.36 ms +- 0.01 ms -> 1.37 ms +- 0.02 ms: 1.01x slower
- json_loads: 24.3 us +- 0.3 us -> 24.6 us +- 0.2 us: 1.01x slower
- pyflate: 393 ms +- 4 ms -> 396 ms +- 4 ms: 1.01x slower
- sympy_expand: 479 ms +- 3 ms -> 482 ms +- 5 ms: 1.01x slower
- sqlite_synth: 2.60 us +- 0.04 us -> 2.62 us +- 0.04 us: 1.01x slower
- mako: 9.53 ms +- 0.06 ms -> 9.57 ms +- 0.04 ms: 1.00x slower
- nqueens: 81.9 ms +- 0.8 ms -> 82.3 ms +- 0.7 ms: 1.00x slower
- regex_v8: 21.6 ms +- 0.3 ms -> 21.7 ms +- 0.2 ms: 1.00x slower
- unpickle_pure_python: 207 us +- 2 us -> 208 us +- 2 us: 1.00x slower
- 2to3: 250 ms +- 1 ms -> 251 ms +- 1 ms: 1.00x slower
- sqlglot_optimize: 52.1 ms +- 0.2 ms -> 52.2 ms +- 0.3 ms: 1.00x slower
- regex_dna: 202 ms +- 1 ms -> 202 ms +- 1 ms: 1.00x slower

Faster (23):
- regex_effbot: 3.58 ms +- 0.02 ms -> 3.32 ms +- 0.02 ms: 1.08x faster
- unpack_sequence: 48.0 ns +- 0.8 ns -> 44.9 ns +- 0.6 ns: 1.07x faster
- pidigits: 198 ms +- 0 ms -> 189 ms +- 0 ms: 1.04x faster
- xml_etree_process: 53.6 ms +- 0.4 ms -> 52.8 ms +- 0.4 ms: 1.01x faster
- chaos: 68.0 ms +- 0.6 ms -> 67.1 ms +- 0.7 ms: 1.01x faster
- pprint_pformat: 1.41 sec +- 0.01 sec -> 1.40 sec +- 0.01 sec: 1.01x faster
- django_template: 33.3 ms +- 0.4 ms -> 32.9 ms +- 0.3 ms: 1.01x faster
- pycparser: 1.09 sec +- 0.02 sec -> 1.07 sec +- 0.02 sec: 1.01x faster
- thrift: 755 us +- 27 us -> 746 us +- 9 us: 1.01x faster
- scimark_sor: 105 ms +- 1 ms -> 104 ms +- 1 ms: 1.01x faster
- logging_format: 6.79 us +- 0.12 us -> 6.72 us +- 0.11 us: 1.01x faster
- scimark_fft: 314 ms +- 3 ms -> 311 ms +- 3 ms: 1.01x faster
- unpickle_list: 4.97 us +- 0.04 us -> 4.92 us +- 0.12 us: 1.01x faster
- logging_simple: 6.07 us +- 0.07 us -> 6.01 us +- 0.10 us: 1.01x faster
- crypto_pyaes: 75.1 ms +- 1.3 ms -> 74.5 ms +- 0.8 ms: 1.01x faster
- pickle_pure_python: 288 us +- 2 us -> 285 us +- 4 us: 1.01x faster
- pprint_safe_repr: 691 ms +- 7 ms -> 686 ms +- 5 ms: 1.01x faster
- sympy_str: 294 ms +- 3 ms -> 292 ms +- 2 ms: 1.01x faster
- python_startup_no_site: 6.28 ms +- 0.01 ms -> 6.25 ms +- 0.01 ms: 1.00x faster
- aiohttp: 1.02 ms +- 0.01 ms -> 1.01 ms +- 0.01 ms: 1.00x faster
- sqlglot_normalize: 109 ms +- 1 ms -> 109 ms +- 1 ms: 1.00x faster
- python_startup: 8.62 ms +- 0.01 ms -> 8.59 ms +- 0.01 ms: 1.00x faster
- deepcopy: 338 us +- 3 us -> 337 us +- 3 us: 1.00x faster

Benchmark hidden because not significant (31): async_tree_none, async_tree_cpu_io_mixed, async_tree_io, async_tree_memoization, chameleon, coroutines, coverage, deepcopy_reduce, deltablue, dulwich_log, fannkuch, genshi_text, genshi_xml, gunicorn, html5lib, logging_silent, meteor_contest, mypy, raytrace, regex_compile, richards, scimark_lu, scimark_monte_carlo, spectral_norm, sympy_integrate, sympy_sum, tornado_http, unpickle, xml_etree_parse, xml_etree_iterparse, xml_etree_generate

Geometric mean: 1.00x slower

So this could be worth exploring as a cleanup opportunity (probably after observing similar MSVC results first).

@gvanrossum
Copy link
Collaborator Author

Agreed, this looks somewhat promising. Let's wait until we have a reliable way to benchmark this on Windows. (Didn't I read somewhere that we need 2.5% perf difference to be sure a benchmark result isn't noise?)

@ericsnowcurrently
Copy link
Collaborator

(Didn't I read somewhere that we need 2.5% perf difference to be sure a benchmark result isn't noise?)

#480

@gvanrossum
Copy link
Collaborator Author

gvanrossum commented Nov 30, 2022

Another way to evaluate the effectiveness of PREDICT() without benchmarking: look at opcode counts and successors for opcodes that have a PREDICT() in them. We should be able to tell how often the PREDICT() is effective from that.

Separately, PREDICT() in opcodes that may be specialized aren't very effective unless the specialized versions also use PREDICT(). So we should either add PREDICT() to the specialized versions, or remove it from the base opcode.

@gvanrossum
Copy link
Collaborator Author

  • LIST_APPEND: not specialized; 97% successors are the predicted JUMP_BACKWARD
  • SET_ADD: not specialized; 100% successors are the predicted JUMP_BACKWARD
  • GET_ANEXT: not specialized; doesn't show up in successor count stats
  • GET_AWAITABLE: not specialized; doesn't show up in successor count stats
  • DICT_MERGE: not specialized; 100% successors are the predicted CALL_FUNCTION_EX
  • MAP_ADD: not specialized; only 8.6% successors are the predicted JUMP_BACKWARD (!)
  • MATCH_MAPPING: not specialized; doesn't show up in successor count stats
  • MATCH_SEQUENCE: not specialized; doesn't show up in successor count stats
  • GET_YIELD_FROM_ITER: not specialized; 100% successors are the predicted LOAD_CONST
  • BEFORE_ASYNC_WITH: not specialized; doesn't show up in successor count stats

Conclusion: no need to worry about specializations here; half of these opcodes don't show up in the stats (because they're not used in any of the benchmarks); four of the others are clearly effective; MAP_ADD seems questionable. Perhaps we can just drop the PREDICT() calls from the 5 that don't show up (being very specialized opcodes) and from MAP_ADD?

@earonesty
Copy link

windows is far from esoteric tho! seems like it's still worth it on windows, which is a critical o/s

@markshannon
Copy link
Member

One thing we could do, is strip the PREDICT() macros from bytecodes.c and automatically insert the prediction logic using data from the stats.
The current, manually inserted predictions aren't very good, so it should be easy to do better.

E.g. according to the stats, BEFORE_WITH is followed by POP_TOP about 80% of the time, STORE_FAST__LOAD_FAST the remaining 20%, and everything else some negligible fraction.

We could generate this code at the end of BEFORE_WITH:

if (opcode == POP_TOP) goto target_POP_TOP;
if (opcode == STORE_FAST__LOAD_FAST) goto target_STORE_FAST__LOAD_FAST;
GOTO_DISPATCH();

(The if ... goto ... will need to wrapped in a macro, for portability)

@gvanrossum
Copy link
Collaborator Author

Since PREDICT() only matters on the MS compiler, maybe we could just manually insert more appropriate PREDICT() calls based on the stats? Where are the most recent stats?

@markshannon
Copy link
Member

I think we should automate this.
https://github.com/faster-cpython/ideas/tree/main/stats look for the one from the most recent Sunday.

@markshannon
Copy link
Member

The PREDICT macros are gone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants